Decoding genomes Shankar Balasubramanian*1

Biochemical Society Transactions

www.biochemsoctrans.org

*Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, U.K.

Heatley Medal Lecture Delivered at the 76th Harden Conference held at Wellcome Trust Genome Campus, Hinxton, Cambridge, on 2 September 2014 Shankar Balasubramaniam

Abstract The primary sequence of DNA can be decoded a million times faster and cheaper than it could 20 years ago. This capability is transforming our understanding of biology and has stimulated efforts to influence modern medicine through routine sequencing of human genomes. I describe how SolexaIllumina sequencing originated from our laboratory and was developed into widely used commercial sequencing platforms. I also discuss examples of how this approach is being employed to exploit genome sequencing for medicine.

Background In 1952, Todd and Brown remarked that, “there can be no finality about any nucleic acid structure at the present time, since it is clear that there is no available method for determining the nucleotide sequence” [1]. This statement was made in the wake of elucidation of the chemical structure of DNA and RNA, and preceded the Watson and Crick publication of the three-dimensional structure of doublehelical B-form DNA [2] which disclosed the molecular basis of the genetic code. This marked the beginnings of significant interest in the creation of methods to decode the primary sequence of nucleic acids. The earliest sequencing method I am aware of involved the stepwise degradation of RNA from the 3 -end, via cycles of oxidation, followed by 1,2elimination and dephosphorylation, published by Todd and colleagues in 1953 [3]. There were many methods explored over the subsequent two decades, of which two approaches emerged as being particularly practical. The approach of Key words: nucleic acid, genome, genomic medicine, sequencing. 1 email [email protected].

Biochem. Soc. Trans. (2015) 43, 1–5; doi:10.1042/BST20140254

Maxam and Gilbert [4] involved the application of chemistries to cleave the DNA strand, whereby a combination of four sets of reactions and conditions were sufficiently selective for nucleobases to resolve all four bases by separation of the cleaved fragments by polyacrylamide gel electrophoresis. Sanger et al. [5] devised a distinct method that involved use of dideoxynucleotide triphosphate ‘terminators’ to cause basespecific termination of DNA-templated synthesis by a DNA polymerase. Once again, polyacrylamide gel electrophoresis was vital to separate out the resultant DNA fragments in order to decode the sequence. Both methods were subsequently used to decode small genomes that included viruses, bacteria and phage. The Sanger method proved amenable to considerable further optimization and automation into the fluorescence-based capillary electrophoresis sequencers that were used in the Human Genome Project. By the midto late 1990s, automated sequencers based on the Sanger method were able to decode of the order of 105 bases per experimental run. A single copy of the human genome is just over 3×109 bases. Decoding the first human genome was an expensive, time-consuming and hugely important endeavour executed by a sizeable international consortium to provide a reference human genome sequence [6]. A paradigm shift was needed to go beyond the reference genome and render human genome (and other large genome) sequencing a routine practical process. In the present article, I describe the genesis, development and application of a working method for decoding human genomes at a speed, cost and accuracy that has enabled sequencing on a population scale. The approach originated in the University of Cambridge, then we set up the start-up company Solexa to develop and commercialize the technology, after which it was acquired and further developed by Illumina. I also briefly present a perspective on the future of decoding DNA on a population scale.

Heatley Medal Lecture

Heatley Medal Lecture

Origins of Solexa sequencing The initiation of this project happened by accident. During the early stages of my career, I was carrying out some single-molecule experiments in collaboration with David Klenerman in the Chemistry Department at Cambridge. We built a single-molecule florescence detection instrument  C The

Authors Journal compilation

 C 2015

Biochemical Society

1

2

Biochemical Society Transactions (2015) Volume 43, part 1

Figure 1 Solid-phase DNA sequencing

with the intention of exploring the action of a DNA polymerase (Klenow fragment) on a DNA template primer by single-molecule fluorescence. During the course of these exploratory studies, we evaluated various sites to label with the fluorophore, which included the DNA, the polymerase protein, the dNTP monomers or, indeed, pairwise combinations [7]. We also explored keeping all the components in free solution, immobilizing the DNA on a bead or on a glass surface [8–10]. In discussions over beer drinking, David Klenerman and I came to the realization that extension of the DNA template primer immobilized on a surface by a polymerase, using a fluorescently tagged dNTP, could be exploited to decode the template DNA strand (Figure 1). This led to a change in our thinking and direction in 1997, from exploratory research on the mechanism of a DNA polymerase towards a method for decoding DNA.

Solexa sequencing The method for solid-phase sequencing would require four building blocks, each colour-coded with a fluorescent dye to mark the identity of the base (Figure 1). A DNA template primer, immobilized at the surface, would be extended by one nucleotide by high-fidelity incorporation by a DNA polymerase, and immediate termination in synthesis, preventing a second nucleotide from being spontaneously incorporated. At this point, the dye could be detected by imaging to reveal the identity of the incorporated cognate base, and, after retrieving the information, both the dye and the terminator entity could be removed in order to enable a second round of incorporation. The decoding of DNA would thus be done exactly one base at a time with stepwise molecular control.  C The

C 2015 Biochemical Society Authors Journal compilation 

We recognized early on that the polymerase would need to be modified to accommodate any chemical changes to the dNTP. This was ultimately achieved by protein engineering that was guided by the substantial body of knowledge on the structure, function and mechanism of DNA polymerases. We adopted the strategy of using a protecting group on the 3 -oxygen of the incoming dNTP to restrict polymerase extension to one nucleotide. A small protecting group minimized perturbation in the polymerase active site during catalysis. The fluorescent dye was attached via a cleavable linker to the non-Watson–Crick edges of the pyrimidines (at C5) and purines (via N7 deaza modification). Our firstgeneration sequencing nucleotide is shown in Figure 2. The chemistry required here to cleave the carbon–oxygen bond of the azidomethyl group at the 3 oxygen, and the corresponding bond in the linker, was carried out by Staudinger reduction [11], using a water-soluble phosphine. This overall approach reduced to practise to provide a method for stepwise solid-phase sequencing of an immobilized DNA sample. To achieve high capacity from solid-phase sequencing, a massively parallel DNA sample array is prepared by fragmenting the genomic source DNA, then immobilizing the sample by contacting a surface at high sample dilution. This can produce a single-molecule array of sample DNA fragments in which each sample molecule is separated from another by distance greater that the diffraction limit. Using optical imaging, each sample molecule can be decoded in parallel by solid-phase fluorescence sequencing. An important change to our original strategy was to transform the single-molecule DNA array by adopting the method of Kawashima and colleagues [12] to locally amplify each fragment molecule on the surface, to produce a cluster Biochem. Soc. Trans. (2015) 43, 2–5; doi:10.1042/BST20140254

Decoding genomes

Figure 2 First-generation sequencing nucleotide

of identical fragments, without loss of spatial integrity. This on-chip amplification step preserves the simplicity and clonal purity of the initial single-molecule array, whereas the amplified fragments improve sequencing accuracy by reducing stochastic error generation, from single molecule analysis and also allowing imaging to be carried out by less-expensive optics. In 1997, Klenerman and I predicted that this overall approach would be capable of decoding approximately one billion bases of DNA per experiment on the basis of some rudimentary estimates and calculations (Figure 3). In 2006, Solexa launched a commercial instrument called the Genome Analyzer (Figure 4) that sequenced one billion bases of human DNA, accurately, in a single experiment, realizing our goal to sequence on the scale of the human genome per experiment.

Figure 3 An early calculation that predicted a pathway to a system capable of sequencing several billion bases of DNA

From Solexa to Illumina In early 2007, Illumina acquired Solexa and the core Solexa sequencing technology and started a phase of evolving and optimizing the first-generation sequencing platform. The core technology has been improved and developed into a suite of sequencing platforms from low-end benchtop sequences for life scientists called the MiSeq, capable of sequencing billions of bases of DNA in hours, to the high-end HiSeq platforms, now able to decode a trillion bases of DNA per experimental run in under a week. The cost of decoding an accurate human genome has now fallen to below US$1000, and several whole human genomes can be sequenced at 30fold depth in just a few days on a single instrument. The current state of the Solexa-Illumina sequencing approach is approximately a million-fold improvement in

speed, capacity and cost compared with the state-of-theart platforms in general use when we started the project in 1997.

Whole genomes and genomic medicine Subsequent to the release of the first Solexa sequencing systems in 2006, many innovative research users of the technology have developed a growing range of applications. These applications exploit the ability to decode a large number of DNA (or RNA) fragments generated by  C The

C 2015 Biochemical Society Authors Journal compilation 

3

4

Biochemical Society Transactions (2015) Volume 43, part 1

Figure 4 Genome Analyzer (Solexa, 2006)

experiment, where one does not presume anything about the sequence (other than the source/organism). An added feature being that if any part of the source genome is represented more than once on the sequencing array, it can be counted for comparison, as the sequencing is a digital approach to quantifying the fragments of DNA provided by the input. Such applications include the mapping of proteins to the genome (ChIP-Seq), quantitative analysis of transcripts (RNA-Seq) and the determination of the three-dimensional structure of chromatin by chromatin conformational capture (Hi-C), to name but a few. However, our main drive for pursuing the development of this technology had been to enable the accurate systematic exploration of the genetic basis of human disease by whole human genome sequence analysis. This motivation had been triggered in large part from a visit to the Wellcome Trust Sanger Institute in early 1998, by Klenerman and myself. Discussions with key scientists who led the Human Genome Project (which was approximately 30 % complete at that time) left us with two important messages. First, that the project would be completed. Secondly, although it would provide an important reference human genome, many more fully decoded human genomes would be required to begin to fully understand exactly how genetic variation causes human disease. We had envisaged the need to ultimately sequence human genomes on a population scale to elucidate the underlying genetics of complex and rare genetic conditions. The first human genome we sequenced by the Solexa approach was also the first African genome [13]. This marked the emergence of routine reporting of studies comprising one or more sequenced human genome. The clinical applications of whole human genome sequencing quickly followed the release of commercial  C The

C 2015 Biochemical Society Authors Journal compilation 

Solexa-Illumina sequencing systems into the public domain. One of the earliest examples (published in 2010) arose from the laboratory of Marra and co-workers [14] who tracked the evolution of a mutating cancer genome of a patient undergoing chemotherapy by whole-genome sequencing. The pattern of genetic changes was employed to help physicians make rational prescribing decisions as the tumour became progressively drug-resistant. This case demonstrated the potential to apply real-time whole-genome sequencing to bring patient benefit. It also suggested that an increased understanding of pathways of cancer evolution, particularly when the cancer is subjected to particular clinical therapies, might lead to rational combination therapies from the outset for an improved patient outcome. Large-scale cancer genome sequencing projects are systematically unravelling a catalogue of genetic variants that drive different cancers. Such work has already been instrumental in identifying prominent genetic signatures that seems to recur in cancers [15]. ‘Rare diseases’ collectively affect 1 in 17 people, usually in early childhood, and are predominantly genetic in origin (http://www.raredisease.org.uk/). Although such conditions are collectively not rare, individual cases are challenging to detect, characterize and treat. In pioneering pediatric clinics, whole-genome sequencing (typically of the patient plus both parents for comparison) is starting to be routinely employed as standard practice for the rapid diagnosis of genetic rare diseases in infants, with positive results [16,17].

The future In a pioneering initiative launched by the U.K. Government and the National Health Service (NHS), our sequencing approach will be employed to decode the genomes

Decoding genomes

of 100 000 patients (∼0.2 % of the U.K. population) (http://www.genomicsengland.co.uk/). This pilot project will build infrastructure comprising linked data, clinical records and knowhow to help us optimize the utilization of genome data for medicine on a national scale. Other such projects are inevitable, and the next 10 years will reveal greater insights into the extent to which human genome sequencing will usefully shape the way we consider and practice medicine.

Acknowledgements I am mindful of the growing expectations for scientists to pursue research that can be predicted to be ‘useful’. I am grateful for having had the opportunity to see something practical grow from the blue skies research that David Klenerman and I embarked on nearly 20 years ago. The underpinning science was funded by the Biotechnology and Biological Sciences Research Council (BBSRC) of the U.K. Experimental science is the product of contributions from many individuals. I thank David Klenerman and the co-workers in my laboratory and the Klenerman laboratory that carried out exploratory work between 1995 and 2000 that turned out to be more important that we had realized at that time. I thank the many contributors from Solexa and Illumina who have and who continue to drive the core technology way beyond my expectations.

References 1 Brown, D.M. and Todd, A.R. (1952) Nucleotides. Part IX. The synthesis of adenylic acids a and b from 5 -trityl adenosine. J. Chem. Soc. 52, 44–51 CrossRef 2 Watson, J.D. and Crick, F.H. (1953) Molecular structure of nucleic acids: a structure for deoxyribose nucleic acid. Nature 171, 737–738 CrossRef PubMed 3 Brown, D.M., Fried, M. and Todd, A.R. (1953) The determination of nucleotide sequence in polyribonucleotides. Chem. Ind., 352–353 4 Maxam, A.M. and Gilbert, W. (1977) A new method for sequencing DNA. Proc. Natl. Acad. Sci. U.S.A. 74, 560–564 CrossRef PubMed 5 Sanger, F., Nicklen, F. and Coulson, A.R. (1977) DNA sequencing with chain-terminating inhibitors. Proc. Natl. Acad. Sci. U.S.A. 74, 5463–5467 CrossRef PubMed

6 International Human Genome Consortium (2004) Finishing the euchromatic sequence of the human genome. Nature 431, 931–945 CrossRef PubMed 7 Furey, W.S., Joyce, C.M., Osbourne, M.A., Klenerman, D., Peliska, J.A. and Balasubramanian, S. (1998) Use of fluorescence resonance energy transfer to investigate the conformation of DNA substrates bound to the Klenow fragment. Biochemistry 37, 2979–2990 CrossRef PubMed 8 Osbourne, M.A., Balasubramanian, S., Furey, W.S. and Klenerman, D. (1998) Optically biased diffusion of single molecules studied by confocal fluorescence microscopy. J. Phys. Chem. B 102, 3160–3167 CrossRef 9 Osborne, M.A., Furey, W.S., Klenerman, D. and Balasubramanian, S. (2000) Single-molecule analysis of DNA immobilised on microspheres. Anal. Chem. 72, 3678–3681 CrossRef PubMed 10 Osborne, M.A., Barnes, C.L., Balasubramanian, S. and Klenerman, D. (2001) Probing DNA surface attachment and local environment using single molecule spectroscopy. J. Phys. Chem. B 105, 3120–3126 CrossRef ¨ 11 Staudinger, H. and Meyer, J. (1919) Uber neue organische Phosphorverbindungen III. Phosphinmethylenderivate und Phosphinimine. Helv. Chim. Acta 2, 635–646 CrossRef 12 Adessi, C., Matton, G., Ayala, G., Turcatti, G., Mermod, J.-J., Mayer, P. and Kawashima, E. (2000) Solid phase DNA amplification: characterisation of primer attachment and amplification mechanisms. Nucleic Acids Res. 28, E87 CrossRef PubMed 13 Bentley, D.R., Balasubramanian, S., Swerdlow, Smith, G.P., Milton, J., Brown, C.G., Hall, K.P., Evers, D.J., Barnes, C.L., Bignell, H.R. et al. (2008) Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 CrossRef PubMed 14 Jones, S.J.M., Laskin, J., Li, Y.Y., Griffith, O.L., An, J., Bilenky, M., Butterfield, Y.S., Cezard, T., Chuah, E., Corbett, R. et al. (2010) Evolution of an adenocarcinoma in response to selection by targeted kinase inhibitors. Genome Biol. 11, R82 CrossRef PubMed 15 Alexandrov, L.B., Nik-Zainal, S., Wedge, D.C., Aparicio, S.A., Behjati, S., Biankin, A.V., Bignell, G.R., Bolli, N., Borg, A., Børresen-Dale, A.L. et al. (2013) Signatures of mutational processes in human cancer. Nature 500, 415–421 CrossRef PubMed 16 Saunders, C.J., Miller, N.A., Soden, S.E., Dinwiddie, D.L., Noll, A., Alnadi, N.A., Andraws, N., Patterson, M.L., Krivohlavek, L.A., Fellis, J. et al. (2012) Rapid whole-genome sequencing for genetic disease diagnosis in neonatal intensive care units. Sci. Transl. Med. 4, 154ra135 CrossRef PubMed 17 Jacob, H.J., Abrams, K., Bick, D.P., Brodie, K., Dimmock, D.P., Farrell, M., Geurts, J., Harris, J., Helbling, D., Joers, B.J. et al. (2013) Genomics in clinical practice: lessons from the front lines. Sci. Transl. Med. 7, 194cm5

Received 22 September 2014 doi:10.1042/BST20140254

 C The

C 2015 Biochemical Society Authors Journal compilation 

5

Decoding genomes.

The primary sequence of DNA can be decoded a million times faster and cheaper than it could 20 years ago. This capability is transforming our understa...
389KB Sizes 2 Downloads 13 Views