Oncogene (2013), 1–9 & 2013 Macmillan Publishers Limited All rights reserved 0950-9232/13 www.nature.com/onc

ORIGINAL ARTICLE

Identification of latent biomarkers in hepatocellular carcinoma by ultra-deep whole-transcriptome sequencing K-T Lin1,8, Y-J Shann2,8, G-Y Chau3, C-N Hsu4,5,6 and C-YF Huang1,2,7 There is an urgent need to identify biomarkers for hepatocellular carcinoma due to limited treatment options and the poor prognosis of this common lethal disease. Whole-transcriptome shotgun sequencing (RNA-Seq) provides new possibilities for biomarker identification. We sequenced B250 million pair-end reads from a pair of adjacent normal and tumor liver samples. With the aid of bioinformatics tools, we determined the transcriptome landscape and sought novel biomarkers by further empirical validations in 55 pairs of adjacent normal and tumor liver samples with various viral statuses such as HBV( þ ), HCV( þ ) and HBV(  )HCV(  ). We identified a novel gene with coding regions, termed DUNQU1, which has a tissue-specific expression pattern in tumor liver samples of HCV( þ ) and HBV(  )HCV(  ) hepatocellular carcinomas. Overexpression of DUNQU1 in Huh7 cell lines enhances the ability to form colonies in soft agar. Also, we identified three novel differentially-expressed protein-coding genes (ALG1L, SERPINA11 and TMEM82) that lack documented expression profiles in liver cancer and showed that the level of SREPINA11 is correlated with pathology stages. Moreover, we showed that the alternative splicing event of FGFR2 is associated with virus infection, tumor size, cirrhosis and tumor recurrence. The findings indicate that these new markers of hepatocellular carcinoma may be of value in improving prognosis and could have potential as new targets for developing new treatment options. Oncogene advance online publication, 21 October 2013; doi:10.1038/onc.2013.424 Keywords: hepatocellular carcinoma; DUNQU1; FGFR2; alternative splicing; RNA-Seq

INTRODUCTION Hepatocellular carcinoma (HCC) is one of the three fastestgrowing cancers in the US and is the most lethal cancer in Asia. Compared to other cancers, HCC has a relatively poor prognosis and limited treatment options. Traditionally, to dissect how the functional units are deployed in different cells, gene expression microarrays are the most frequently used tools. Gene signatures derived from these microarrays are considered to be the blueprints of events taking place in the cells under particular conditions at specific time points. In previous genome-wide studies, many hypotheses were generated from the gene signatures to explain biological outcomes. Recently, RNA-Seq, one of the applications of the secondgeneration sequencing techniques, was developed. Since RNASeq offers single-base resolution on the whole-genome scale, it provides the opportunity to greatly improve our knowledge of both the quantitative and qualitative aspects of the human transcriptome.1 It has been reported that RNA-Seq can detect at least 25% more known genes than traditional gene expression arrays, as well as many novel transcripts in intergenic regions.2,3 Also, studies using RNA-Seq have shown that the number of functional units in the human genome is much larger than previously anticipated.4 The transcription of mammalian genomes is now known to take place across almost all sections of the genome, and many alternative splicing (AS) events in the human transcriptome are very noisy, even in normal cells.5,6 It quickly became clear that the complexity of the human transcriptome 1

cannot be explained solely by the B2% of the human genome analyzed by gene expression microarrays. With the aid of bioinformatics tools, many potential biomarkers that were latent variables in previous genome-wide studies can now be seen by RNA-Seq.7 In the present study, we sought out potential biomarkers and validated their expression patterns in 55 pairs of adjacent normal and tumor liver samples with diverse viral status and gender.

RESULTS The catalog of the transcriptome landscape of HCC In total, we sequenced B120 million read pairs per sample from a pair of adjacent normal and tumor liver samples (Supplementary Table 1). Aligned normal reads covered B6.52% of the human genome and B7.59% for tumor reads (Supplementary Figure 1a and Supplementary Table 2). We also identified many novel exon junctions shown in Supplementary Figure 2. All of the above information can be either downloaded from our website (http:// bioagent.iis.sinica.edu.tw/HCCT2012) or browsed on the UCSC Genome Browser for visualization and track comparison. Novel differentially-expressed protein-coding genes whose expression profiles were missing in liver cancer RNA-Seq can detect more differentially-expressed protein-coding genes (DE genes) than previous genome-wide arrays. We filtered

Institute of Biomedical Informatics, National Yang-Ming University, Taipei, Taiwan; 2Institute of Biopharmaceutical Sciences, National Yang-Ming University, Taipei, Taiwan; Division of General Surgery, Department of Surgery, Taipei Veterans General Hospital, Taipei, Taiwan; 4Institute of Information Science, Academia Sinica, Taipei, Taiwan; 5USC/ Information Sciences Institute, Marina del Rey, CA, USA; 6Division of Biomedical Informatics, Department of Medicine, University of California, San Diego, La Jolla, CA, USA and 7 Cancer Research Center and Genome Research Center, National Yang-Ming University, Taipei, Taiwan. Correspondence: Professor CYF Huang, Institute of Clinical Medicine, National Yang-Ming University, No. 155, Li-Non St, Sec. 2 Taipei 112, Taiwan, Taiwan. E-mail: [email protected] 8 These authors contributed equally to this work. Received 3 January 2013; revised 15 August 2013; accepted 19 August 2013 3

Latent biomarkers in hepatocellular carcinoma K-T Lin et al

2 out 2,576 up-regulated and 855 down-regulated genes (Supplementary Table 3) and found 17 DE gene candidates without annotations from the genome-wide arrays (U95, U133 and U133 plus 2.0) (Supplementary Table 4). Also, the 17 DE genes lacked documented expression profiles for liver cancer in Gene Expression Omnibus, Gene Expression Atlas, ArrayExpress and Oncomine (Supplementary Table 4 and 5).8–11 By real-time PCR in 55 pairs of HCC patient samples, we showed that ALG1L, SERPINA11 and TMEM82 indeed had expected expression patterns (Figure 1b and Supplementary Table 6). In particular, the DDCt values of SERPINA11 were significantly different between stage I/II and stage III/IV (P ¼ 0.0202) and negatively correlated with pathology stages (stage I/II ¼ 1 and stage III/IV ¼ 2, Pearson’s correlation ¼ –0.328, and P ¼ 0.0145) (Figure 1c). That is to say, SERPINA11 was significantly lower in the later stages, such as stages III and IV. ALG1L is a putative glycosyltransferase. An altered mRNA expression level of glycosyltransferases might be helpful for early detection of carcinomas and tumor progression.12 SERPINA11 is a serine proteinase inhibitor that might be secreted. The downregulation of SERPINA11 has been correlated with breast cancer initiation and progression.13 TMEM82 has a transmembrane domain. These new DE genes may improve our understanding of the carcinogenesis of HCC. A novel gene, termed DUNQU1, has a tissue-specific expression pattern and may play a role in liver tumorigenesis RNA-Seq can address a certain proportion of the transcribed genome beyond current gene annotations (Supplementary Figure 1b). To investigate the uncharacterized areas of the human genome, we extracted the putative mRNA sequences from the Cufflinks de novo assembly results. In total, 38155 normal and

38146 tumor transcript sequences were found. Among them, 28569 normal and 28278 tumor sequences contained coding regions longer than or equal to 50 amino acids (Supplementary Table 7). Of the predicted coding regions, 231 normal and 286 tumor peptides were located in intergenic regions based on Ensembl 65 gene annotations. These coding peptides in intergenic regions represent potentially novel, unidentified genes. In particular, 224 of the 286 tumor peptides were specific to the sequenced liver tumor. This implies the existence of tumorspecific protein-coding genes in intergenic regions. To prioritize the candidates, we ranked the tumor-specific peptides by FPKM values and manually determined whether there were multi-exon peptides with predicted functional domains. Among tumorspecific peptides whose FPKM value was greater than 0.5, we found only one multi-exon candidate that had predicted functional domains. It was a 101-amino-acid peptide encoded by a gene with 3 exons. We named the gene DUNQU1, from the Chinese for ‘the latent one’. Our analysis of the junction reads suggested that DUNQU1 comprises three exons (E1, E2 and E3) and expresses two isoforms: SP1 (E1 þ E2 þ E3) and SP2 (E1 þ E3) (Supplementary Figure 3a). The mRNA transcript of SP1 was predicted to be 5438 bp, whereas SP2 was 5345 bp. The predicted peptide sequences for SP1 and SP2 were 101 and 94 amino acids, respectively. According to InterProScan, both isoforms have a phosphodiesterase domain (PF01663) and alkaline phosphatase-like domains (G3DSA:3.40.720.10 and SSF53649), and both have a potential signal peptide (SignalP-NN(euk)) (Supplementary Figure 4).14 Analysis of the predicted DUNQU1 peptide sequences revealed high identity between DUNQU1 and the N-terminus of ENPP7 (Supplementary Figure 5). ENPP7, also known as alkaline sphingomyelinase (alk-SMase) or NPP7, is expressed in intestine and liver

Figure 1. Flow chart of this study and novel differentially-expressed genes in liver cancer. (a) Using the total RNA extracted from a pair of adjacent normal and tumor liver samples, we performed RNA-Seq to obtain short reads and constructed the transcriptome landscape by using several bioinformatics tools such as TopHat, MapSplice, SpliceTrap and Cufflinks. Several major findings, such as novel differentially expressed protein-coding genes (DE genes), novel genes and AS events, were selected and verified in a set of 55 pairs of adjacent normal and tumor HCC patient samples. (b) DDCt values of ALG1L, SERPINA11 and TMEM82 were derived from real-time PCR of 55 pairs of adjacent normal and tumor liver samples. Each gene has a violin plot showing the shape of the distribution of DDCt values (gray area), their median (white dot) and their interquartile range (black box). (c) DDCt values of SERINA11 were grouped into two pathological stages. P-value was obtained from t-test comparing stage I/II and stage III/IV. Oncogene (2013) 1 – 9

& 2013 Macmillan Publishers Limited

Latent biomarkers in hepatocellular carcinoma K-T Lin et al

3 and may function as a tumor suppressor. It hydrolyzes sphingomyelin to ceramide, which inhibits cell proliferation and induces apoptosis. Loss-of-function mutations and (AS) of ENPP7 have been associated with colon cancer and liver tumorigenesis.15,16 DUNQU1 is located at the Chromosome band 16p11.2 (Supplementary Figure 3b), whose deletion has been linked to autism and obesity.17,18 Upstream of DUNQU1, we found a transcription factor binding site and a DNase hypersensitivity cluster at the same location (Supplementary Figure 3b).19 The first exon of DUNQU1 is conserved in Xenopus tropicalis, Tetraodon, Fugu, stickleback, medaka, zebrafish and lamprey (Supplementary Figure 3b). This sequence conservation corroborates the existence of DUNQU1 as an expressed gene. The transcription start site identified by 50 RACE was close to the transcription start site suggested by RNA-Seq data (Supplementary Figure 6). The tumorspecific expression pattern of DUNQU1 in the sequenced liver tumors was confirmed by end-point PCR (Figure 2a). Direct sequencing of the PCR products also revealed exactly the same sequence as produced by RNA-Seq (within the PCR-amplified region). To investigate the possible role of DUNQU1 in liver tumorigenesis, we first investigated whether DUNQU1 is expressed only in tumor tissues. Two sets of primers were designed and used to detect DUNQU1 cDNA in a nested PCR experiment of 55 pairs of HCC samples (Supplementary Figure 3a). No signal was detected in most normal livers (except 4 HBV( þ ) male samples), even after extensive PCR amplification, whereas there was clear DNA amplification in most liver tumors (Figure 2c). Some of the samples showed three amplified DNA fragments, corresponding to the SP1 (425 bp), SP2 (332 bp) and SP1/SP2 hybrid forms (highest in the gel because of their loop structure). Direct

sequencing of a gel-eluted DNA fragment of the hybrid form showed a mixture of SP1 and SP2 (Supplementary Figure 7). To explore the expression patterns of DUNQU1 in HCC cell lines, we performed real-time PCRs and found that DUNQU1 expressed two isoforms in Hep3B, HepG2, Huh7 and PLC5 cell lines, but not in Mahlavu cell lines (Figure 3a). Quantification of the expression levels of DUNQU1 showed that, compared with the sequenced tumor liver, Hep3B had a higher expression level of DUNQU1 while the other three cell lines had a relatively lower expression level of DUNQU1 (Figure 3b). To explore the effect of DUNQU1 in HCC cells, we overexpressed DUNQU1 in Huh7 cell line which was examined via PCR (Figure 3c). Soft agar assay showed that overexpression of DUNQU1 increased the colony formation of Huh7 cell lines (Figure 3d). The results raise the possibility that DUNQU1 might play one or more roles in liver tumorigenesis. Alternative splicing events show the changes in cell behaviors and may serve as new biomarkers of HCC To detect the alterations of AS events between normal and tumor liver tissues, we used SpliceTrap to identify significant AS events by comparing the exon inclusion ratios between normal and tumor reads. SpliceTrap reported 1003 AS events from 825 exons in 648 genes.20 Pathway analysis showed that the potentially AS genes were enriched in metabolism and cell-cell communication pathways (Supplementary Table 8).21 Most of the AS events were from exons with low inclusion ratios (Supplementary Figure 8). We report 38 AS events with high exon inclusion ratios (X 0.4) (Supplementary Table 9). After manual curation, we concluded that 14 of the AS events were relatively obvious (highlighted in

Figure 2. DUNQU1 has a tissue-specific expression pattern in HCV( þ ) and HBV(  )HCV(  ) liver tumors. (a) The RT-PCR results confirm that DUNQU1 is only expressed in the tumor part of the sequenced liver tumors (HCC1428). (b) The table shows the viral status and gender information for the 55 pairs of patient samples used for further RT-PCR validation. (c) In most cases, DUNQU1 was only detectable in liver tumors. The only exceptions were 4 pairs of male HBV( þ ) samples, showing signals in both adjacent normal and tumor liver samples. Some samples showed an extra DNA fragment at B500 bp. This signal represents a hybrid form of SP1 and SP2 arising during PCR (Supplementary Figure 7). & 2013 Macmillan Publishers Limited

Oncogene (2013) 1 – 9

Latent biomarkers in hepatocellular carcinoma K-T Lin et al

4

Figure 3. DUNQU1 enhances soft agar colony formation. (a) 40-cycle of real-time PCR products (Primer DUN-1 and DUN-7) were used for electrophoresis to confirm DUNQU1 expression in Hep3B, HepG2, Huh7, Mahlavu and PLC5 cell lines. (b) Quantification of DUNQU1 in all samples was normalized against the endogenous control b-actin. DDCt values were used to represent relative abundance of DUNQU1 compared to the sequenced tumor liver (1428T). Positive number represents that the expression level of DUNQU1 is higher than the sequenced tumor liver (1428T). (c) Primers DUN-8 and DUN-9 were used to detect both endogenous and exogenous DUNQU1 levels in lentivirus-transducted Huh7 cell lines. Ratios are relative to the WT Huh7 cell lines. (d) Soft agar colony formation assay. Mixed stable Huh7 cell lines were seeded in soft agar in triplicate. EGFP is a negative control for overexpression experiments.

red in Supplementary Table 9). Three of the AS events (FGFR2, EXOC7 and ADAM15) are cancer-related, and one AS event with a novel exon (TELO2) may have a role in the cell cycle.22 FGFR2 has a mutually exclusive AS event, which corresponds to the switch from its epithelial isoform to mesenchymal isoform (Supplementary Figure 9a). The AS event of FGFR2 is necessary for epithelial-mesenchymal transition, which is a critical event during tumorigenesis.23–25 Reduced protein level of FGFR2-IIIb is reportedly correlated with the tumor stages of HCC.26 EXOC7 changed its exon 7 inclusion ratio in the sequenced tumor liver (Supplementary Figure 9b). Similar to FGFR2, the AS event of EXOC7 has been reported to be epithelial-mesenchymal transitiondriven in human breast cancer.27 The protein encoded by EXOC7, Exo70, is a exocyst component that regulates cell migration and maintains the epithelial polarity at the plasma membrane. ADAM15 has different exon inclusion ratios for exon 20 and exon 21. The different use of exons 19–21 has been reported previously and was proposed as a diagnostic marker for cancer diagnostics.28,29 ADAM15 is an adhesion receptor on endothelial cells, and the protein expression of ADAM15 is reportedly associated with cancer cell proliferation and progression.30 Finally, TELO2 has a novel exon involved in an alternative 50 splice site event and seemed lost in the sequenced tumor liver (Supplementary Figure 10). The predicted transcript (CUFF.11822.1) with the novel exon had a shorter coding region (718 amino acids) than the coding region of the canonical transcript (ENST00000262319) (837 amino acids). Taken together, the AS events showed the changes in cell behaviors such as cell-cell adhesion, polarity and migration in HCC.

The switch from FGFR2-IIIb to FGFR2-IIIc in the liver tumors was significantly-associated with virus infection, and increased FGFR2IIIc inclusion ratio was associated with cirrhosis and tumor recurrence Real-time PCR of 43 pairs of adjacent normal and HCC samples showed that FGFR2-IIIb was down-regulated in 67.44% of the liver tumors, whereas FGFR2-IIIc was up-regulated in 46.51% of the liver Oncogene (2013) 1 – 9

tumors (Figure 4a). 65.11% of the HCC samples has stronger down-regulation of FGFR2-IIIb than FGFR2-IIIc. This resulted in that the FGFR2-IIIc inclusion ratios increased in 65.12% of the HCC samples (Figure 4b). Contingency table analysis showed that the DDCt values of FGFR2-IIIc and overall FGFR2 were associated with viral status (Table 1). Also, combining HBV( þ ) and HCV( þ ) HCCs together, we found the presence of virus infection in a patient who was associated with the downregulation of overall FGFR2 (P ¼ 0.001145) and FGFR2-IIIb (P ¼ 0.03511). Moreover, the switch from FGFR2-IIIb to FGFR2-IIIc in the liver tumors was significantly associated with virus infection (Table 2 and Figure 4b). Interestingly, we also found that FGFR2-IIIc inclusion ratios in the adjacent normal tissues were significantly correlated with tumor sizes (Pearson’s correlation ¼ 0.4661 and P value ¼ 0.001631)(Supplementary Figure 11b). Furthermore, FGFR2-IIIc inclusion ratio change (tumor minus normal) was associated with both cirrhosis and tumor recurrence (Table 2 and Supplementary Figure 11c). In adult normal and fetal liver tissues, FGFR2-IIIc inclusion ratios were near 50% or below 50%, whereas the ratios were much higher in the tumor cell lines (Supplementary Figure 11a). These evidences raised the possibility that FGFR2-IIIc might also have roles in liver tumorigenesis. DISCUSSION Our knowledge converges to a point based on what we have observed, and the point is often not fixed. By analyzing the first ultra-deep transcriptome landscape of human liver cancer, taking into account empirical validations and published evidence, the present study identified potential biomarkers for HCC, including ALG1L, SERPINA11, TMEM82 and DUNQU1 and the AS event of FGFR2. They were latent variables in the previous genome-wide studies of HCC. Thanks to the power of RNA-Seq, their importance can now be revealed. RNA-Seq is revolutionizing both the size and complexity of human transcriptome analysis. The idea of ‘genome-wide’ has no & 2013 Macmillan Publishers Limited

Latent biomarkers in hepatocellular carcinoma K-T Lin et al

5

Figure 4. Distributions of DDCt values of FGFR2 isoforms and FGFR2-IIIc inclusion ratios in 43 pairs of HCC samples. (a) Blue bars are DDCt values of FGFR2-IIIb and green bars are for FGFR2-IIIc. Patient IDs are ordered first by virus states and then sorted by the DDCt values of FGFR2IIIb in ascending order. (b) Patient IDs are arranged in the same order as in (a). The green bars are the FGFR2-IIIc inclusion ratio changes, which are the differences of tumor inclusion ratios minus normal inclusion ratios.

Table 1.

Contingency table analysis for FGFR2 and its isoforms based on the DDCt values of realtime PCR IIIb DDCt

Variable

IIIc DDCt

FGFR2 DDCt

%

þ



P value

þ



P-value

þ



Age at diagnosis, years X60 23 o60 20

53.49% 46.51%

9 5

14 15

0.3528

10 13

13 7

0.2233

8 7

15 13

1

Gender Male Female

21 22

48.84% 51.16%

4 10

17 12

0.104

11 12

10 10

1

6 9

15 13

0.5256

Pathology Stage I II III IV

16 12 14 1

37.21% 27.91% 32.56% 2.33%

4 5 4 1

12 7 10 0

0.4315

6 7 10 0

10 5 4 1

0.1537

4 5 6 0

12 7 8 1

0.7116

Cirrhosis Yes No

16 27

37.21% 62.79%

2 12

14 15

0.04471

8 15

8 12

0.7611

3 12

13 15

0.1095

Viral status HBV HCV None

18 13 12

41.86% 30.23% 27.91%

4 3 7

14 10 5

0.1112

11 3 9

7 10 3

0.02924

5 1 9

13 12 3

0.001976

Vascular invasion 0 2 4

18 16 6

45.00% 40.00% 15.00%

4 8 2

14 10 5

0.427

7 12 4

11 6 3

0.2612

5 8 2

13 10 5

0.6307

Tumor size, cm p5 45

23 20

53.49% 46.51%

5 9

18 11

0.1912

10 13

13 7

0.2233

6 9

17 11

0.2193

Recurrence Yes No Unknown n

15 23 5 43

34.88% 53.49% 11.63%

5 8 1 14

10 15 4 29

1

6 14 3 23

9 9 2 20

0.4699

5 8 2 15

10 15 3 28

1

Categorization

n

32.56%

53.49%

P-value

34.88%

* ‘ þ ’ means an increase of expression level where DCt (normal)–DCt (tumor) is positive. ‘  ’ means the expression level decreases.

& 2013 Macmillan Publishers Limited

Oncogene (2013) 1 – 9

Latent biomarkers in hepatocellular carcinoma K-T Lin et al

6 Table 2.

Contingency table analysis of FGFR2-IIIc inclusion ratio change Variable

Switch from IIIb to IIIc

FGFR2-IIIc inclusion ratio change

Categorization

n

%

Yes

No

P-value

þ



P-value

Age at diagnosis, year Z60 o60

23 20

53.49% 46.51%

10 8

13 12

1

15 13

8 7

1

Gender Male Female

21 22

48.84% 51.16%

9 9

12 13

1

15 13

6 9

0.5256

Pathology stage I II III IV

16 12 14 1

37.21% 27.91% 32.56% 2.33%

6 6 6 0

10 6 8 1

0.9553

11 8 9 0

5 4 5 1

0.7116

Cirrhosis Yes No

16 27

37.21% 62.79%

8 10

8 17

0.526

14 14

2 13

0.02293

Viral status HBV/HCV None

31 12

72.09% 27.91%

16 2

15 10

0.04637

23 5

8 7

0.07395

Vascular invasion 0 2 4

18 16 6

45.00% 40.00% 15.00%

7 9 2

11 9 5

0.7154

13 11 4

5 7 3

0.7567

Tumor size, cm p5 45

20 23

46.51% 53.49%

8 10

12 13

1

16 12

7 8

0.5401

Recurrence Yes No Unknwon n

15 23 5 43

34.88% 53.49% 11.63%

5 11 2 18

10 12 3 25

0.7522

6 17 5 15

9 6 0 8

0.02216

41.86%

65.22%

* ‘ þ ’ means an increase of FGFR2-IIIc inclusion ratio. ‘  ’ means the inclusion ratio decreased.

longer been limited to the B2% of the human genome. Known coding genes such as ALG1L, SERPINA11 and TMEM82, which were analyzed in the present study, are not detectable in traditional ‘genome-wide’ arrays and can serve as new DE genes in terms of expression patterns. Long non-coding RNAs (lncRNAs) such as MALAT1 and HULC have been shown to be associated with cancers and also had expected expression patterns in our sequenced samples (Supplementary Figure 12).31–33 Moreover, the Encyclopedia of DNA Elements (ENCODE) project recently reported that B75% of the human genome is transcribable at some point in some cells and can produce highly overlapped transcripts from both DNA strands.34 Taken together, the evidence suggests that we should rethink our approach to understanding the human transcriptome and elements in the human genome. DUNQU1, for instance, which is not documented in any current gene annotations including GENCODE (version 14) and thereby exceeds the boundary of B75% reported by ENCODE, has intriguing expression patterns and potentially has functions in liver tumorigenesis. Our cDNA libraries were constructed from mRNAs with a poly(A) tail. Therefore, non-Poly(A) RNAs such as ribosomal RNAs will not be captured. Also, cell types can have a considerable impact on whether or not a particular transcript is sequenced. In our sequenced liver samples, B80% of the reads were derived from the top 100 highest-expressed genes such as ALB. Transcripts from Oncogene (2013) 1 – 9

the remainder of the genome thus are less likely to be sequenced. Moreover, different sequencing strategy and alignment settings can result in significantly different sizes of mapped regions. For example, mapping pair-end reads by Bowtie can cover B6 to B8% of the genome depending on the seed length (Supplementary Figure 13a). If we use only the first end of pairend reads to simulate alignments of single reads, the sizes range from B6 to B18% (Supplementary Figure 13b). We postulate that the reasons described above explain why there was such a discrepancy between results in our analysis, that is, 6.5 to 7.6% vs B75%. Since 6% of the genome can be perfectly aligned by 75 bp pair-end reads, it is likely that 6% is the minimum number of transcribable regions on the human genome (Supplementary Figure 13b). Still, this figure is much larger than the B2% that is currently accepted. In addition to the transcribed regions, an important dimension of the human transcriptome is AS. AS events can alter the expression of proteins and serve as potential targets for new treatment options.35 In the present study, we showed that the AS events of FGFR2 might be related to virus infection and the FGFR2IIIc inclusion ratios were related to tumor size, cirrhosis and tumor recurrence. Most importantly, it is intriguing to see that the more the FGFR2-IIIc in the liver tumors, the lower the tumor recurrence. It suggests that the bad ones expressing FGFR2-IIIc have been removed by surgery, and the paracrine networks are & 2013 Macmillan Publishers Limited

Latent biomarkers in hepatocellular carcinoma K-T Lin et al

7 disrupted.36,37 It might be the reason why the tumor recurrence rate became lower. If this is the case, detecting the abundance of FGFR2-IIIc might be a good target for improving prognosis.38 With the help of RNA-Seq and appropriate bioinformatics tools, it is now possible to investigate AS events more accurately. A recent study also performed RNA-Seq to analyze 10 pairs of HCC samples, and the results were significantly different both quantitatively and qualitatively with those of the present study.39 In the study by Huang et al., ‘single-end’ RNA-Seq (36 bases) were performed, capturing B21.6 million single-end reads with B10.6 million aligned reads per sample. In contrast, in the present study, we performed ‘pair-end’ RNA-Seq (75 bases) and captured B126 million read ‘pairs’ per sample, and each sample had B100 million read pairs properly aligned (Supplementary Table 1). The purposes of identifying DE genes in the two studies were also very different. Huang et al. randomly validated some DE genes to demonstrate the accuracy of RNA-Seq. In the present study, we identified DE genes not detected in previous genome-wide studies and correlated their expression levels with clinicopathological characteristics of HCCs. Moreover, there were also differences between the two studies with regard to AS. The main subject of interest in the study by Huang et al. was the identification of novel junctions, whereas in the present study we were primarily concerned with exploring the switch of AS events associated with HCC. Huang et al. identified a novel junction of ATAD2 expressed in 5 adjacent non-cancerous and 20 tumor liver samples. Although we detected many novel junctions (Supplementary Figure 2), we did not detect the novel junction of ATAD2. In the present study, we investigated the switch of AS events such as FGFR2 and ADAM15. A recent study suggested that at least B500 million single-end reads (50 bases) are required for detecting the changes of isoforms.40 This might explain why Huang et al. found it difficult to confirm their novel junctions because the junction of interest might not be supported by a sufficient amount of robust reads bridging the intron. Huang et al. did not release their RNA-Seq reads, so this cannot be confirmed. We have deposited our RNA-Seq reads on the GEO database and have made all alignments as well as assembled isoforms readily accessible on the UCSC genome browser. In conclusion, we have characterized the first ultra-deep liver cancer transcriptome landscape by validating several novel findings. The new biomarkers might be used as new diagnostic or prognostic markers for HCC biopsies by RT-qPCR or immunohistochemistry when the antibodies are available. Not only do these findings provide new insights for the field of liver cancer research, but they also serve as a valuable resource for understanding of the human transcriptome.

MATERIALS AND METHODS Clinical samples and RNA-Seq experiment The sequenced samples of HCC (1428T) and adjacent normal liver (1428N) tissues were obtained from a patient who had undergone curative hepatic resection for HCC at the Department of Surgery, Taipei Veterans General Hospital (Taipei, Taiwan). Curative resection was defined as complete clearance of the tumour macroscopically, with a microscopically clear margin. The patient had not received any preoperative treatment, such as chemotherapy, ethanol injection or transarterial chemoembolization. The diagnosis of HCC was confirmed by histological examination of surgicallyresected specimens. Tumour specimen and paired non-tumour liver tissue were obtained immediately after surgical resection; the non-tumour liver tissues were taken more than 10 mm away from the HCC. Samples of tumor tissue were free of necrotic region and were collected after histological examination. For empirical validations, 55 pairs of HCC samples from the Taiwan Liver Cancer Network were selected based on gender and viral status (Figure 2b). These samples were used in accordance with the IRB procedures of National Yang-Ming University. RNA samples including 1428N, 1428T and the 55 pairs of HCC samples were prepared from frozen resected tumor and non-tumor tissues directly. & 2013 Macmillan Publishers Limited

RNA quality was confirmed by gel electrophoresis. RNA preparation for 1428N and 1428T were carried out according to the manual of the RNeasy Mini Kit (Qiagen, Hilden, Germany), with an extra on-column DNase digestion for sample preparation. Free of DNA contamination and RNA integrity was first checked by gel electrophoresis. RNA samples for RNASeq experiment were further checked with Agilent Bioanalyzer to assess sample integrity. RNA-Seq experiment was performed in the genomic center of National Yang-Ming University. In total, we sequenced 8 lanes (4 lanes for adjacent normal and 4 lanes for tumor) using the Illumina GA2 platform with a 75 bp pair-end sequencing protocol and base calling with the Illumina pipeline, version 1.6. The sequencing data is deposit under accession number SRA043490. DUNQU1’s accession numbers are JF934746 and JF934747.

cDNA synthesis, primers and PCR reaction conditions cDNA was synthesized from 1 mg of total RNA using the ThermoScript RT-PCR system (Invitrogen, Life Technologies, Grand Island, NY, USA) with random primers. Real-time PCRs were performed using StepOnePlus system (Applied Biosystems, Life Technologies, Grand Island, NY, USA) and QuantiFast SYBR Green Real-time PCR Master Mix (Qiagen) with cycling conditions of 5 min at 95 1C and 40 cycles of 20 s at 95 1C and 40 s at 66 1C. The changes in gene expression were analyzed by the DDCt method, using -actin as an endogenous control. All primers are listed in Supplementary Table 10.

Short reads process To process B100 GB raw data (B253.6 million 75 bp pair-end reads), we used TopHat 1.4 and MapSplice 1.15.2 to identify candidate junctions.41,42 For transcriptome landscape construction, we did two runs. Run A aimed to assign FPKM values to known genes by Cufflinks.43 For each gene with normal FPKM (n) and tumor FPKM (t) values, we calculated the fold change based on the following equation: 8 < t=n; nX1 and tX1 Fold changeðn; tÞ ¼ 1=n; nX1 and to1 : t=1; no1 and tX1 Run B aimed to predict putative transcripts by de novo assembly using Cufflinks.43 The reference genome for bias detection and the correction algorithm was based on hg19. We used Ensembl 65 version for gene annotation.44

Differentially-included exons and inclusion ratio formula We combined the Gene Transfer Format (GTF) files generated from run B and the Ensembl 65 gene annotation to construct a new database for SpliceTrap to estimate the exon inclusion ratios in adjacent normal and tumor tissues for AS events such as CAssette exon (CA), Alternative Acceptor (AA), Alternative Donor (AD), and Intron Retention (IR).20

Statistical analysis and FGFR2-IIIc inclusion ratio Association test of contingency table analysis was carried out by Fisher exact test. Correlation test was carried out by Pearson’s product moment correlation coefficient and follows a Student’s t-distribution. FGFR2-IIIc inclusion ratio was estimated by the following equation: .  DCt DCt DCt FGFR2  IIIc inclusion ratio ¼ 2  ðIIIc Þ 2  ðIIIc  IIIb Þ þ 1

Plasmid and lentivirus protocol Two expression constructs (EGFP and 3Flag-DUNQU1) use the LJM1 lentiviral vector in which expression is driven by the CMV promoter. Lentiviral constructs were co-transfected with LP1, LP2 and LP/VSVG plasmid into 293T cells. Medium containing virus particles were harvested at 72 h and passed through a 0.45 mM filter. Target cells were transducted with lentivirus and mixed population of stable cell lines were generated by selection with puromycin for 7 days.

Soft agar colony formation assay Four percent agar in water was prepared and autoclaved, then kept in a water bath of 56 1C. The two-layer agar plate was prepared by mixing culture medium with a volume of 4% melted agar and then adding to culture dish immediately. For bottom layer, 5 ml culture medium Oncogene (2013) 1 – 9

Latent biomarkers in hepatocellular carcinoma K-T Lin et al

8 containing 0.75% agar was added into a 60 mm culture dish and kept at room temperature to allow the plate to solidify. For the top layer, 3  104 cells was mixed in 3 ml of medium containing 0.4% agar and added to the bottom layer plate and then placed in an incubator at 37 1C for 14 days.

ABBREVIATIONS HCC, hepatocellular carcinoma; RNA-Seq, whole-transcriptome shotgun sequencing; HCV, hepatitis C virus; HBV, hepatitis B virus; FPKM, fragments per kilobase of transcript per million mapped reads; DNA, deoxyribonucleic acid; GTF, Gene Transfer Format; AS, alternative splicing; CA, cassette exon; AA, alternative acceptor; AD, alternative donor; IR, intron retention; PCR, polymerase chain reaction; DE genes, differentially expressed protein-coding genes; TSS, transcription start side; PTC, premature termination codon; IHC, immunohistochemistry; RT-qPCR, reverse transcription quantitative real time polymerase chain reaction.

CONFLICT OF INTEREST The authors declare no conflict of interest.

ACKNOWLEDGEMENTS We thank the Taiwan Liver Cancer Network for providing the liver tumor tissue samples and related clinical data (all are anonymous) for this work. This network currently includes five major medical centers in Taiwan (National Taiwan University Hospital, Chang-Gung Memorial Hospital-Linko, Veteran General Hospital-Taichung, Chang-Gung Memorial Hospital-Kaohsiung and Veteran General Hospital-Kaohsiung). Taiwan Liver Cancer Network is supported by grants from the National Science Council (NSC94–3112-B-182–002, NSC97–3112-B-182–004) and National Health Research Institutes, Taiwan. We also want to thank National Core Facility Program for Biotechnology (Bioinformatics Consortium of Taiwan, NSC102–2319-B-010–002), National Research Program for Biopharmaceuticals (NRPB, NSC10102325-B-492–001) and National Center for High-performance Computing of National Applied Research Laboratories (NCHC, NARLabs) for providing computing and storage resources. Finally, we want to thank Professor Adrian R Krainer for his valuable comments on the manuscript and hosting at Cold Spring Harbor Laboratory. This research was supported by grants from the National Science Council (NSC101– 2627-B-010–001- and NSC102-2627-B-010-001-), Taipei Veterans General Hospital (V102E2–006), the National Health Research Institutes (NHRI-EX102–10029BI), Ministry of Economic Affairs (101-EC-17-A-17-S1–152) and the Ministry of Education, Aim for the Top University Plan (National Yang-Ming University) to C-YF. Huang. This research was also supported by the research aboard grant from the National Science Council (NSC 100–2917-I-010–001) to K-T. Lin.

AUTHOR CONTRIBUTIONS Kuan-Ting Lin and Yih-Jyh Shann designed the study, analyzed, interpreted the data and drafted the article. Kuan-Ting Lin performed RNA-Seq and statistical analysis. Yih-Jyh Shann performed the experimental validations. Gar-Yang Chau, Chun-Nan Hsu and Chi-Ying F Huang participated in the design of the study. All authors agreed to publication.

REFERENCES 1 Ozsolak F, Milos PM. RNA sequencing: advances, challenges and opportunities. Nat Rev Genet 2011; 12: 87–98. 2 Sultan M, Schulz MH, Richard H, Magen A, Klingenhoff A, Scherf M et al. A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science 2008; 321: 956–960. 3 Morin R, Bainbridge M, Fejes A, Hirst M, Krzywinski M, Pugh T et al. Profiling the HeLa S3 transcriptome using randomly primed cDNA and massively short-read sequencing. Biotechniques 2008; 45: 81–94. 4 Lander ES. Initial impact of the sequencing of the human genome. Nature 2011; 470: 187–197. 5 Clark MB, Amaral PP, Schlesinger FJ, Dinger ME, Taft RJ, Rinn JL et al. The reality of pervasive transcription. PLoS Biol 2011; 9: e1000625 , discussion e1001102. 6 Pickrell JK, Pai AA, Gilad Y, Pritchard JK. Noisy splicing drives mRNA isoform diversity in human cells. PLoS Genet 2010; 6: e1001236.

Oncogene (2013) 1 – 9

7 Garber M, Grabherr MG, Guttman M, Trapnell C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods 2011; 8: 469–477. 8 Barrett T, Troup DB, Wilhite SE, Ledoux P, Evangelista C, Kim IF et al. NCBI GEO: archive for functional genomics data sets--10 years on. Nucleic Acids Res 2011; 39: D1005–D1010. 9 Kapushesky M, Adamusiak T, Burdett T, Culhane A, Farne A, Filippov A et al. Gene Expression Atlas update--a value-added database of microarray and sequencing-based functional genomics experiments. Nucleic Acids Res 2012; 40: D1077–D1081. 10 Parkinson H, Sarkans U, Kolesnikov N, Abeygunawardena N, Burdett T, Dylag M et al. ArrayExpress update--an archive of microarray and high-throughput sequencing-based functional genomics experiments. Nucleic Acids Res 2011; 39: D1002–D1004. 11 Rhodes DR, Kalyana-Sundaram S, Mahavisno V, Varambally R, Yu J, Briggs BB et al. Oncomine 3.0: genes, pathways, and networks in a collection of 18,000 cancer gene expression profiles. Neoplasia 2007; 9: 166–180. 12 Petretti T, Kemmner W, Schulze B, Schlag PM. Altered mRNA expression of glycosyltransferases in human colorectal carcinomas and liver metastases. Gut 2000; 46: 359–366. 13 Parris TZ, Danielsson A, Nemes S, Kova´cs A, Delle U, Fallenius G et al. Clinical implications of gene dosage and gene expression patterns in diploid breast carcinoma. Clin Cancer Res 2010; 16: 3860–3874. 14 Zdobnov EM, Apweiler R. InterProScan--an integration platform for the signaturerecognition methods in InterPro. Bioinformatics 2001; 17: 847–848. 15 Hertervig E, Nilsson A, Nyberg L, Duan RD. Alkaline sphingomyelinase activity is decreased in human colorectal carcinoma. Cancer 1997; 79: 448–453. 16 Cheng Y, Wu J, Hertervig E, Lindgren S, Duan D, Nilsson A et al. Identification of aberrant forms of alkaline sphingomyelinase (NPP7) associated with human liver tumorigenesis. Br J Cancer 2007; 97: 1441–1448. 17 Eichler EE, Zimmerman AW. A hot spot of genetic instability in autism. N Engl J Med 2008; 358: 737–739. 18 Bochukova EG, Huang N, Keogh J, Henning E, Purmann C, Blaszczyk K et al. Large, rare chromosomal deletions associated with severe early-onset obesity. Nature 2010; 463: 666–670. 19 Myers RM, Stamatoyannopoulos J, Snyder M, Dunham I, Hardison RC, Bernstein BE et al. A user’s guide to the encyclopedia of DNA elements (ENCODE). PLoS Biol 2011; 9: e1001046. 20 Wu J, Akerman M, Sun S, McCombie WR, Krainer AR, Zhang MQ. SpliceTrap: a method to quantify alternative splicing under single cellular conditions. Bioinformatics 2011; 27: 3010–3016. 21 Kamburov A, Pentchev K, Galicka H, Wierling C, Lehrach H, Herwig R. ConsensusPathDB: toward a more complete picture of cell biology. Nucleic Acids Res 2011; 39: D712–D717. 22 Takai H, Wang RC, Takai KK, Yang H, de Lange T. Tel2 regulates the stability of PI3K-related protein kinases. Cell 2007; 131: 1248–1259. 23 Grosso AR, Martins S, Carmo-Fonseca M. The emerging role of splicing factors in cancer. EMBO Rep 2008; 9: 1087–1093. 24 David CJ, Manley JL. Alternative pre-mRNA splicing regulation in cancer: pathways and programs unhinged. Genes Dev 2010; 24: 2343–2364. 25 Warzecha CC, Sato TK, Nabet B, Hogenesch JB, Carstens RP. ESRP1 and ESRP2 are epithelial cell-type-specific regulators of FGFR2 splicing. Mol Cell 2009; 33: 591–601. 26 Amann T, Bataille F, Spruss T, Dettmer K, Wild P, Liedtke C et al. Reduced expression of fibroblast growth factor receptor 2IIIb in hepatocellular carcinoma induces a more aggressive growth. Am J Pathol 2010; 176: 1433–1442. 27 Shapiro IM, Cheng AW, Flytzanis NC, Balsamo M, Condeelis JS, Oktay MH et al. An EMT-driven alternative splicing program occurs in human breast cancer and modulates cellular phenotype. PLoS Genet 2011; 7: e1002218. 28 Kleino I, Ortiz RM, Huovila AP. ADAM15 gene structure and differential alternative exon use in human tissues. BMC Mol Biol 2007; 8: 90. 29 Ortiz RM, Karkkainen I, Huovila AP. Aberrant alternative exon use and increased copy number of human metalloprotease-disintegrin ADAM15 gene in breast cancer cells. Genes Chromosomes Cancer 2004; 41: 366–378. 30 Mochizuki S, Okada Y. ADAMs in cancer cell proliferation and progression. Cancer Sci 2007; 98: 621–628. 31 Lin R, Maeda S, Liu C, Karin M, Edgington TS. A large noncoding RNA is a marker for murine hepatocellular carcinomas and a spectrum of human carcinomas. Oncogene 2007; 26: 851–858. 32 Panzitt K, Tschernatsch MM, Guelly C, Moustafa T, Stradner M, Strohmaier HM et al. Characterization of HULC, a novel gene with striking up-regulation in hepatocellular carcinoma, as noncoding RNA. Gastroenterology 2007; 132: 330–342.

& 2013 Macmillan Publishers Limited

Latent biomarkers in hepatocellular carcinoma K-T Lin et al

9 33 Gutschner T, Diederichs S. The hallmarks of cancer: a long non-coding RNA point of view. RNA Biol 2012; 9: 703–719. 34 Ecker JR, Bickmore WA, Barroso I, Pritchard JK, Gilad Y, Segal E. Genomics: ENCODE explained. Nature 2012; 489: 52–55. 35 Hua Y, Sahashi K, Rigo F, Hung G, Horev G, Bennett CF et al. Peripheral SMN restoration is essential for long-term rescue of a severe spinal muscular atrophy mouse model. Nature 2011; 478: 123–126. 36 Turner N, Grose R. Fibroblast growth factor signalling: from development to cancer. Nat Rev Cancer 2010; 10: 116–129. 37 Huijts PE, van Dongen M, de Goeij MC, van Moolenbroek AJ, Blanken F, Vreeswijk MP et al. Allele-specific regulation of FGFR2 expression is cell type-dependent and may increase breast cancer risk through a paracrine stimulus involving FGF10. Breast cancer research: BCR 2011; 13: R72. 38 Somarelli JA, Schaeffer D, Bosma R, Bonano VI, Sohn JW, Kemeny G et al. Fluorescence-based alternative splicing reporters for the study of epithelial plasticity in vivo. Rna 2013; 19: 116–127.

39 Huang Q, Lin B, Liu H, Ma X, Mo F, Yu W et al. RNA-Seq analyses generate comprehensive transcriptomic landscape and reveal complex transcript patterns in hepatocellular carcinoma. PLoS One 2011; 6: e26168. 40 Toung JM, Morley M, Li M, Cheung VG. RNA-sequence analysis of human B-cells. Genome Res 2011; 21: 991–998. 41 Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNASeq. Bioinformatics 2009; 25: 1105–1111. 42 Wang K, Singh D, Zeng Z, Coleman SJ, Huang Y, Savich GL et al. MapSplice: accurate mapping of RNA-seq reads for splice junction discovery. Nucleic Acids Res 2010; 38: e178. 43 Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 2010; 28: 511–515. 44 Flicek P, Amode MR, Barrell D, Beal K, Brent S, Chen Y et al. Ensembl 2011. Nucleic Acids Res 2011; 39: D800–D806.

Supplementary Information accompanies this paper on the Oncogene website (http://www.nature.com/onc)

& 2013 Macmillan Publishers Limited

Oncogene (2013) 1 – 9

Identification of latent biomarkers in hepatocellular carcinoma by ultra-deep whole-transcriptome sequencing.

There is an urgent need to identify biomarkers for hepatocellular carcinoma due to limited treatment options and the poor prognosis of this common let...
1MB Sizes 0 Downloads 0 Views