ecDNA Prediction Using HiChIP Data
Background
- Focal amplification involving enhancers and target oncogenes has been observed in many cancers, such as EGFR in glioblastoma, MYC in group 3 medulloblastoma, and MYCN in both neuroblastoma and Wilms tumors cell2019scacheri.
- H3K27ac ChlP-seq, ATAC-seq, POLR2A ChlP-seq, and RNA-seq signals at two EGFR enhancers for four glioblastoma lines (GBM3565, GBM3094, GSC23, G459)
- used HiChIP: GSE73865 (O’Brien et al., 2016), GSE90683 (Boeva et al., 2017)
- These oncogenes were co-amplified with super-enhancers, not only in contiguous regions but also in more complex, non-contiguous amplicons. They are linearly broken into cis and trans genomic loci associated with oncogenes role of ecDNAs.
- These regulatory elements have been preserved and evolved within cells in a circular form, referred to as extra-circular DNA cell2013korbel.
- Bioinformatics tools for analyzing whole genome sequencing (WGS) data can exhibit varying performance based on their underlying assumptions and the quality of the input data 38746056,39209966.
Methods
- Convert contacts to network
- Assortativity (https://networkx.org/nx-guides/content/algorithms/assortativity/correlation.html)
Public Datasets
- Database 35388171
- ecDNA HiChIP datasets 31748743.
- MYC-amplified colorectal cancer cell line, ecDNA hubs are tethered by the BET protein BRD4 34819668.
- HiChIP datasets from SNU16 cells (amplified for MYC and FGFR2) 31748743.
Previous Results
Results
Methods
- Hint : Gini-ranking 32293513
- github: https://github.com/parklab/HiNT
Code Anlysis
The Hint source code (https://github.com/parklab/HiNT):
def gini(x):
# (Warning: This is a concise implementation, but it is O(n**2)
# in time and memory, where n = len(x). *Don't* pass in huge
# samples!)
# Mean absolute difference
mad = np.nanmean(np.abs(np.subtract.outer(x, x)))
# Relative mean absolute difference
rmad = mad/np.nanmean(x)
# Gini coefficient
g = 0.5 * rmad
return g
def getGini(mat1,mat2):
matrix1 = np.genfromtxt(mat1,delimiter="\t")
matrix2 = np.genfromtxt(mat2,delimiter="\t")
matrix1[np.isfinite(matrix1)==0] = 0
matrix2[np.isfinite(matrix2)==0] = 0
rowsum1 = np.sum(matrix1,axis=1)
rowsum2 = np.sum(matrix2,axis=1)
colsum1 = np.sum(matrix1,axis=0)
colsum2 = np.sum(matrix2,axis=0)
ridx1 = np.where(rowsum1==0)
cidx1 = np.where(colsum1==0)
ridx2 = np.where(rowsum2==0)
cidx2 = np.where(colsum2==0)
ridx = np.union1d(ridx1[0], ridx2[0])
cidx = np.union1d(cidx1[0], cidx2[0])
temp1 = np.delete(matrix1,ridx,0)
temp2 = np.delete(matrix2,ridx,0)
selectedData1 = np.delete(temp1,cidx,1)
selectedData2 = np.delete(temp2,cidx,1)
average1 = np.mean(selectedData1)
average2 = np.mean(selectedData2)
tm1 = np.divide(selectedData1,average1)
tm2 = np.divide(selectedData2,average2)
division = np.divide(tm1,tm2)
giniIndex = gini(np.asarray(division).reshape(-1))
maximum = np.nanmax(np.asarray(division).reshape(-1))
return giniIndex,maximum
def getRankProduct(matrix1MbInfo,background1MbInfo,outdir,name):
rpout = os.path.join(outdir,name + '_chrompairs_rankProduct.txt')
outf = open(rpout,'w')
ginis = []
maximums = []
chrompairs = []
for chrompair in matrix1MbInfo:
#print chrompair
matrix1 = matrix1MbInfo[chrompair]
matrix2 = background1MbInfo[chrompair]
giniIndex,maximum = getGini(matrix1,matrix2)
chrompairs.append(chrompair)
ginis.append(giniIndex)
maximums.append(maximum)
rankgini = len(ginis) - rankdata(ginis)
rankmaximum = len(maximums) - rankdata(maximums)
#print rankgini,rankmaximum
rps = (np.divide(rankgini,len(ginis)*1.0))*(np.divide(rankmaximum,len(maximums)*1.0))
result = np.stack((chrompairs,ginis,maximums,rps),axis=-1)
sortedResult = sorted(result, key=itemgetter(-1))
outf.write('\t'.join(['ChromPair',"GiniIndex","Maximum","RankProduct"]) + '\n')
for res in sortedResult:
chrompair, gini, maximum, rp = res
newres = [chrompair, str(gini), str(maximum), str(rp)]
outf.write('\t'.join(newres) + '\n')
outf.close()
return rpout