Genome-wide Profiling of RNA splicing in prostate tumor from RNA-seq data using virtual microarrays
© Srinivasan et al.; licensee BioMed Central Ltd. 2012
Received: 16 July 2012
Accepted: 10 September 2012
Published: 26 November 2012
Second generation RNA sequencing technology (RNA-seq) offers the potential to interrogate genome-wide differential RNA splicing in cancer. However, since short RNA reads spanning spliced junctions cannot be mapped contiguously onto to the chromosomes, there is a need for methods to profile splicing from RNA-seq data. Before the invent of RNA-seq technologies, microarrays containing probe sequences representing exon-exon junctions of known genes have been used to hybridize cellular RNAs for measuring context-specific differential splicing. Here, we extend this approach to detect tumor-specific splicing in prostate cancer from a RNA-seq dataset.
A database, SPEventH, representing probe sequences of under a million non-redundant splice events in human is created with exon-exon junctions of optimized length for use as virtual microarray. SPEventH is used to map tens of millions of reads from matched tumor-normal samples from ten individuals with prostate cancer. Differential counts of reads mapped to each event from tumor and matched normal is used to identify statistically significant tumor-specific splice events in prostate.
We find sixty-one (61) splice events that are differentially expressed with a p-value of less than 0.0001 and a fold change of greater than 1.5 in prostate tumor compared to the respective matched normal samples. Interestingly, the only evidence, EST (BF372485), in the public database for one of the tumor-specific splice event joining one of the intron in KLK3 gene to an intron in KLK2, is also derived from prostate tumor-tissue. Also, the 765 events with a p-value of less than 0.001 is shown to cluster all twenty samples in a context-specific fashion with few exceptions stemming from low coverage of samples.
We demonstrate that virtual microarray experiments using a non-redundant database of splice events in human is both efficient and sensitive way to profile genome-wide splicing in biological samples and to detect tumor-specific splicing signatures in datasets from RNA-seq technologies. The signature from the large number of splice events that could cluster tumor and matched-normal samples into two tight separate clusters, suggests that differential splicing is yet another RNA phenotype, alongside gene expression and SNPs, that can be exploited for tumor stratification.
Until recently, the extent of RNA diversity resulting from alternative splicing had been consistently underestimated. In the early 90s, researchers projected that only 5% of the human genes were alternatively spliced based on PCR methods, which was revised upward to 35% by the end of the decade using mining EST database [1, 2]. The estimate rose to 74% in 2003 based on exon-exon junction microarrays  and then all the way to 94% in 2008 by the use of second generation RNA sequencing (RNA-seq) . What was once the exception has now become the norm , a fact that may be especially significant given that the human genome contains only a few more genes than C. elegan. As highlighted by the ENCODE project , RNA splicing is complicated and has called into question the very definition of the gene as a unit of heredity. Since the majority of human genes contain multiple exons and express at least two splice products, gene expression cannot be fully meaningful without considering alternative splicing as well.
It is known that alternatively spliced transcripts of a given gene can code for protein variants with varied biological functions or cellular localizations. Human spliceosome is a complex dance between trans-acting and cis-acting signatures. Regulation, disruption and mutations in any one of these elements has the potential to provide tumor cells with selective advantage during cancer evolution. Numerous gene-by-gene studies have linked splicing to cancer. Genome-wide profiling of splicing in cancer using microarray technologies has revealed that differential splicing may play a key role in cancer progression and metastasis. A comprehensive review of systematic profiling of splicing in various cancer types using genome-wide microarray technologies suggest that differential exon inclusion or skip events may drive cancer and can be used as biomarkers in cancer .
The exposure of the entire RNA content within samples by RNA-seq technologies, including novel splice isoforms, has encouraged the development of methods for de novo identification of splice events expressed in samples [8–10]. The use of these de novo methods has been attractive, because a large number of splice events are believed to be yet unidentified. More recently, these methods are used to discover disease-specific splicing from RNA-seq data [11, 12].
Genome-wide profiling of alternative splicing is not new. Before the invent of RNA-seq technologies, genome-wide profiling of RNA splicing in biological samples included exon arrays , splice junction arrays [14, 15], and genome-wide tiling arrays . Use of these technologies to profile known splicing events in various biological contexts has already revealed the importance of splicing in cancer research. A recent review of genome-wide profiling of splicing in cancer using various microarray platforms suggests that splicing in cancer is prevalent, regulated and that novel therapeutic strategies are emerging [17, 18].
The success of microarrays in profiling known splicing in cancer can be extended to identifying tumor specific splicing events in reads from RNA-seq using virtual microarray experiments. In such an experiment, short RNA reads from RNA-seq can be considered virtual equivalent of cellular RNA, in silico mapping of reads can be considered virtual equivalent of hybridization and the sequences of exon-exon junction probes equivalent of virtual microarray platform. Hence, a non-redundant reference database of known splice junctions can be used to directly map RNA reads to detect and measure expression levels of known splice events. Although such an approach is limited to detection, by augmenting the database with predicted junctions, one could also infuse discovery into this approach .
Here we have profiled less than a million known and predicted splice events to identify tumor-specific splicing in prostate tumor using a RNA-seq dataset of matched tumor-normal from ten individuals downloaded from NCBI public repository.
Results and discussion
Validation of SPEventH based prediction
Shows the total number of junctions predicted by SPEventH and Topaht for the top ten most highly expressed genes in T11
Number of exons
Comparison of SPEventH and Tophat predictions
Lists the accession IDs of the 20 samples from 10 individuals used in this study along with read coverage
Shows the number of exon-exon junctions on which reads were uniquely mapped from each sample listed in column 1 by the two methods SPEventH (Column 2) and Tophat (column 3)
Prostate cancer specific splicing from SPEventH
Splice events up-and down-regulated in a tumor specific fashion with a p-value of < 0.0001
Source of evidence
Are they from alt. splicing region?
Base level expression around splice sites
PCR primers for events for which a transcript sequence was available in the public databases
In order to computationally validate the findings with samples from other individuals, not included during the discovery process, we profiled splicing in another RNA-seq dataset independently generated by another group (SRP003611) . Two splice events in genes PPP3CA and SLC20A2 are found to be significantly up- and down-regulated in both datasets with a p-value of less than 0.001 in both sets (shown bold in Table 4). This is despite the fact that the sample preparation protocols for both datasets are different. Also, the only evidence in the public EST transcript database, (BF372485), for a tumor-specific splice event connecting an intron from KLK3 gene to that of KLK2 gene is derived from prostate tumor.
Deep sequencing of RNA provides a promising means of understanding the role of alternative splicing in cancer. A reference database of splice events, such as SPEventH, provides a useful tool to expedite the analysis of RNA-seq data, while also providing a ready link between raw data and existing body of knowledge necessary for biological interpretation and downstream validation. To our knowledge SPEventH database is the only splice event database that includes splice junctions resulting from alternative 5’ and 3’ events, a class of splicing that is also prevalent, and understudied, in human. The key attributes of this database are high-confidence, non-redundancy, detailed annotations, and simple format for ease of use.
We have demonstrated the value of SPEventH in the identification of prostate tumor specific splicing from RNA-seq datasets. We have identified a large number of tumor-specific splice events in prostate cancer and have authenticated the findings by computing base-level expression immediately around the donor and acceptor sites for each differentially expressed splice event. Also, the significance of the hundreds of splice events with p-values less than 0.001 was addressed by clustering the 20 samples using the RPKM values for these events. Our observation is that despite normalizing for sequence coverage using RPKM, samples with low coverage could not be clustered using the signature events. This suggests that for profiling splicing at least 30–40 million reads may be necessary.
Here, RNA-seq datasets generated by two independent groups have been compared to validate the findings. We find that the most significant events from the two datasets are quite divergent, suggesting either heterogeneity in the cancer types or differences in sample preparation protocols by the two groups. Since many of the most up-regulated genes from the validation dataset (SRP003611) had many snoRNAs () than from the discovery dataset (SRP002628), we believe that the validation set may not have been selected for protein coding RNAs. Despite such discrepancies, we found two splice events from genes PPP3CA and SLC20A2 that are significantly up- and down-regulated in a tumor–specific fashion.
Identification of tumor-specific splice events is complicated by the expression of a large number of constitutive junctions from differentially expressed genes. In order to separate those events that are purely from differential splicing, all events from differentially expressed genes were removed from the final signature. It is likely that many differential cancer driving splice events from differentially expressed genes may have been removed by this crude approach. Better methods are needed to address this issue.
This is perhaps one of the first efforts to compare the performance of a de novo splice prediction method such as Tophat to a splice detection method such as SPEventH in the identification of tumor-specific splice signatures. The de novo method like Tophat is considered attractive for their capacity to discover novel events. However, we see that Tophat performs poorly in predicting known splice events including constitutive junctions of highly expressed genes consistently across large number of samples. On the contrary, detection methods like SPEventH are not only sensitive for known events but are amenable for comparison across large number of samples. Also, by augmenting predicted splice events from gene prediction algorithm, discovery is also built into SPEventH-based profiling of splicing across large number of samples. With advancing sequencing technologies, improving bioinformatics tools, and proliferating public data sets a reference database of annotated splice events in human will mature and will become critical for profiling alternative splicing in biological samples.
Materials and methods
Results reported in this manuscript did not involve any experimental work on human or animal samples. All reported findings are obtained by bioinformatics analysis on the data downloaded from NCBI repository of Short Read Archive as listed in Table 2.
Construction of SPEventH database
A reference splice event database, SPEventH, containing probes of optimal length representing exon-exon and exon-intron junction sequences for 731,954 splice events is derived from a database of non-redundant splice junctions in human, SEHS1.0 , which is validated for genome-wide profiling of alternative splicing using microarray technology . The SEHS1.0 was created using the alignment of 8.5 million transcripts including 8,089,335 ESTs, 287,440 mRNAs, 34,389 Refseq, 66,803 known genes and 99,128 predicted genes onto the hg18 assembly of the human genome . Construction of SEHS1.0 involved parsing and processing of 8.5 million aligned full-length and partial transcript sequences, identifying the splice sites, applying an alignment quality filter, making the set non redundant, selecting sequences spanning the splice sites, and attaching detailed annotations to the results. The quality filter has removed misalignments arising from low sequence quality. Key parameters enforced that each splice site span canonical donor and acceptor motifs; however, splice events with multiple sequences as evidence were included regardless of the intron motifs. Uniqueness of a splice site was defined by the intron start and end coordinates. Annotations include the data source (such as RefSeq or GENSCAN), list of accessions of sequences that provide evidence for each splice event, intron donor and acceptor motifs, as well chromosomal coordinates and gene symbols.
The probes in the SPEventH database are of length 56 bases including 28 bases from either exons participating in the junctions. This length is optimized to reliably exclude mapping of RNA reads of length 36 base pair, which do not span a junction. Meaning, when 36mer reads of length 36 bases are mapped to probes containing 28 bases on either side of the junction, minimum mapping of 8 bases into the adjoining exon is ensured even for reads mapped staring at position one of the probe or ending at position 56 of the probes.
Mapping short RNA reads to SPEventH
The dataset of short RNA reads derived from 20 samples including tumor and matched normal prostate tissues from 10 individuals is downloaded from NCBI repository with accession IDs listed in Table 2. Tens of millions of short reads from each sample are mapped to junction spanning probes in SPEventH using bowtie. Two mismatches are allowed while mapping and only uniquely mapped reads are considered. Parsing the bowtie output was done by uploading the bowtie fields on to MySQL database, creating one table per sample. Using a SQL query, number of reads mapped to each SPEventH junctions was computed and stored in another table. A mega table is then generated from the twenty individual tables that includes read counts of all the splice events in the database of all samples used in the study (SRP002628). The read counts are normalized across samples by computing RPKM values. The p-value and fold change (log2) are computed by comparing RPKMs for normal and tumor for all events in SPEventH using R statistical package. A reduced table of events with p-value <0.001 for all normal and tumor samples is created. A hierarchical clustering of all 20 samples based on the RPKM values for selected events in each sample is performed.
Mapping short RNA reads to hg18 assembly
Tophat program, version 1.1.2, with default parameters is used to map reads from all 20 samples onto hg18 genome assembly. The default parameters of Tophat and the genome assembly versions are consistent with mapping of reads to SPEventH. The junction coordinates are normalized to those of SPEventH for comparison by using a perl script and MySQL database.
SPlice Events in Human
Browser Extensible Data format
Reads Per Kilobase of exon model per Million mapped reads.
The authors wish to acknowledge the Department of Biotechnology, New Delhi, India for support to Dr. Srinivasan via the Ramalingaswamy Fellowship. Authors also acknowledge the Department of Information Technology, GoI, for their support to the institute under the ‘Center of Excellence in Bioinformatics Training and Research’.
- Mironov AA, Fickett JW, Gelfand MS: Frequent alternative splicing of human genes. Genome Res. 1999, 9: 1288-1293. 10.1101/gr.9.12.1288.PubMed CentralView ArticlePubMedGoogle Scholar
- Modrek B, Resch A, Grasso C, Lee C: Genome-wide detection of alternative splicing in expressed sequences of human genes. Nucleic Acids Res. 2001, 29: 2850-2859. 10.1093/nar/29.13.2850.PubMed CentralView ArticlePubMedGoogle Scholar
- Johnson JM, Castle J, Garrett-Engele P, Kan Z, Loerch PM, Armour CD, Santos R, Schadt EE, Stoughton R, Shoemaker DD: Genome-wide survey of human alternative pre-mRNA splicing with exon junction microarrays. Science. 2003, 302: 2141-2144. 10.1126/science.1090100.View ArticlePubMedGoogle Scholar
- Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF, Schroth GP, Burge CB: Alternative isoform regulation in human tissue transcriptomes. Nature. 2008, 456: 470-476. 10.1038/nature07509.PubMed CentralView ArticlePubMedGoogle Scholar
- Srinivasan S: Alternative Splicing in Eukaryotes: The norm not an anomaly. Curr Sci. 2011, 100: 813-814.Google Scholar
- Birney E, Stamatoyannopoulos JA, Dutta A, et al: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007, 447: 799-816. 10.1038/nature05874.View ArticlePubMedGoogle Scholar
- Dutertre M, Lacroix-Triki M, Driouch K, de la Grange P, Gratadou L, Beck S, Millevoi S, Tazi J, Lidereau R, Vagner S, Auboeuf D: Exon-based clustering of murine breast tumor transcriptomes reveals alternative exons whose expression is associated with metastasis. Cancer Res. 2010, 70: 896-905. 10.1158/0008-5472.CAN-09-2703.View ArticlePubMedGoogle Scholar
- Trapnell C, Pachter L, Salzberg SL: TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009, 25: 1105-1111. 10.1093/bioinformatics/btp120.PubMed CentralView ArticlePubMedGoogle Scholar
- Bryant DW, Shen R, Priest HD, Wong W-K, Mockler TC: Supersplat -- spliced RNA-seq alignment. Bioinformatics. 2010, 26 (12): 1500-1505. 10.1093/bioinformatics/btq206.PubMed CentralView ArticlePubMedGoogle Scholar
- Au KF, Jiang H, Lin L, Xing Y, Wong WH: Detection of splice junctions from paired-end RNA-seq data by SpliceMap. Nucleic Acids Res. 2010, 38 (14): 4570-4578. 10.1093/nar/gkq211.PubMed CentralView ArticlePubMedGoogle Scholar
- Twine NA, Janitz K, Wilkins MR, Janitz M: Whole transcriptome sequencing reveals gene expression and splicing differences in brain regions affected by Alzheimer’s disease. PLoS One. 2011, 6: e16266-10.1371/journal.pone.0016266.PubMed CentralView ArticlePubMedGoogle Scholar
- Ren S, Peng Z, Mao J-H, Yu Y, Yin C, Gao X, Cui Z, Zhang J, Yi K, Xu W, Chen C, Wang F, Guo X, Lu J, Yang J, Wei M, Tian Z, Guan Y, Tang L, Xu C, Wang L, Gao X, Tian W, Wang J, Yang H, Wang J, Sun Y: RNA-seq analysis of prostate cancer in the Chinese population identifies recurrent gene fusions, cancer-associated long noncoding RNAs and aberrant alternative splicings. Cell Res. 2012, 22: 806-821. 10.1038/cr.2012.30.PubMed CentralView ArticlePubMedGoogle Scholar
- Liu J, Xiao Y, Xiong H-M, Li J, Huang B, Zhang H-B, Feng D-Q, Chen X-M, Wang X-Z: Alternative splicing of apoptosis-related genes in imatinib-treated K562 cells identified by exon array analysis. Int J Mol Med. 2012, 29: 690-698.PubMed CentralPubMedGoogle Scholar
- Fehlbaum P, Guihal C, Bracco L, Cochet O: A microarray configuration to quantify expression levels and relative abundance of splice variants. Nucleic Acids Res. 2005, 33: e47-10.1093/nar/gni047.PubMed CentralView ArticlePubMedGoogle Scholar
- Bingham J, Sudarsanam S, Srinivasan S: Profiling human phosphodiesterase genes and splice isoforms. Biochem Biophys Res Commun. 2006, 350: 25-32. 10.1016/j.bbrc.2006.08.180.View ArticlePubMedGoogle Scholar
- Zhang X, Shiu S-H, Shiu S, Cal A, Borevitz JO: Global analysis of genetic, epigenetic and transcriptional polymorphisms in Arabidopsis thaliana using whole genome tiling arrays. PLoS Genet. 2008, 4: e1000032-10.1371/journal.pgen.1000032.PubMed CentralView ArticlePubMedGoogle Scholar
- Ghigna C, Valacca C, Biamonti G: Alternative splicing and tumor progression. Curr. Genomics. 2008, 9: 556-570. 10.2174/138920208786847971.PubMed CentralView ArticlePubMedGoogle Scholar
- Germann S, Gratadou L, Dutertre M, Auboeuf D: Splicing programs and cancer. J Nucleic Acids. 2012, 2012: 269570-PubMed CentralView ArticlePubMedGoogle Scholar
- Jang BI, Li Y, Graham DY, Cen P: The Role of CD44 in the Pathogenesis, Diagnosis, and Therapy of Gastric Cancer. Gut Liver. 2011, 5: 397-405. 10.5009/gnl.2011.5.4.397.PubMed CentralView ArticlePubMedGoogle Scholar
- Heider KH, Dämmrich J, Skroch-Angel P, Müller-Hermelink HK, Vollmers HP, Herrlich P, Ponta H: Differential expression of CD44 splice variants in intestinal- and diffuse-type human gastric carcinomas and normal gastric mucosa. Cancer Res. 1993, 53: 4197-4203.PubMedGoogle Scholar
- Patil A, Deshmukh M, Singh N, Srivastava R, Satya Swati K, Verma M, Saurabh G, Veeresh S, Srivatsan R, Srinivasan S: From data repositories to potential biomarkers: application to prostate cancer. Curr Sci. 102: 1111-1116.
- Bingham JL, Carrigan PE, Miller LJ, Srinivasan S: Extent and diversity of human alternative splicing established by complementary database annotation and microarray analysis. OMICS. 2008, 12: 83-92. 10.1089/omi.2007.0041.View ArticlePubMedGoogle Scholar
- Carrigan PE, Bingham JL, Srinvasan S, Brentnall TA, Miller LJ: Characterization of alternative spliceoforms and the RNA splicing machinery in pancreatic cancer. Pancreas. 2011, 40: 281-288.PubMed CentralPubMedGoogle Scholar
- Miller W, Rosenbloom K, Hardison RC, Hou M, Taylor J, Raney B, Burhans R, King DC, Baertsch R, Blankenberg D, Kosakovsky Pond SL, Nekrutenko A, Giardine B, Harris RS, Tyekucheva S, Diekhans M, Pringle TH, Murphy WJ, Lesk A, Weinstock GM, Lindblad-Toh K, Gibbs RA, Lander ES, Siepel A, Haussler D, Kent WJ: 28-way vertebrate alignment and conservation track in the UCSC Genome Browser. Genome Res. 2007, 17: 1797-1808. 10.1101/gr.6761107.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.