Large-scale DNA sequencing Tim Hunkapiller, Robert J. Kaiser, Ben F. Koop and Leroy Hood California Institute of Technology, Pasadena, California, USA The problems of large-scale sequencing are discussed, together with existing and future strategies, and the technology capable of achieving the goal of a productive, efficient process is reviewed. Current Opinion in Biotechnology 1991, 2:92-101

Introduction The timetable for completing the human genome initiative (greater than 3 x 109bp) is generally thought to be of the order of 15 years. In order best to define the procedures and technologies needed to fulfill this goal, paradigm large-scale sequencing projects (the analysis of mostly contiguous segments of DNA, 1 Mb or larger) will be initiated for proposed completion in the next 5 years. These large-scale projects will focus on the analysis of either the genomes of simple model organisms such as Escherichia coli and yeast, or discrete and highly defined 1-2 Mb mammalian loci that are of biological interest and are informationally complex. To date, no projects of this scale have been completed, although several megabase projects have recently been initiated or proposed [1"']. Certainly the number of sequencing projects that, at this stage, are considered relatively large is growing rapidly as the technologies for sequencing mature. Release 64 of the Genbank sequence databank contains 32 sequences that are greater than approximately 25kb), 12 of which have been released in the past two years [2o-11.,12-o,13.°]. The past year has also seen the completion of the largest sequence to date--the complete genome of human cytomegalovirus which is almost a quarter of a million bases in length [12-,]. There are many processes and steps involved in DNA sequencing and each has a variety of possible technological solutions. There are probably any number of strategies that could in theory generate data at the rate necessary for projects of this type. They must all, however, be judged not only on theoretical peak potential, but also on their potential for day-to-day consistency and sustained throughput. We believe that sustainable throughput will be less concerned with the technology used to ascertain the actual base data than with the integration of the necessary subprocesses (e.g. cloning, mapping, sequencing, informatics). This integration will ultimately depend on how well the entire sequencing process is automated. The body of literature spanning the technologies related to sequencing is so diverse, and the number of new se-

quences so large (several thousand over the past year alone), that it is difficult to cover all of the significant publications and accomplishments of the past year or two here. Also, the literature related specifically to largescale sequencing is limited as this is still a very immature field. Therefore, this review will focus on a discussion of the problems to be solved, the potential methods and technologies to provide a high-throughout low-cost process, and developments which, in our opinion, will be the most significant in the next 5 years.

Technology and methods Mapping and cloning Before sequencing, the DNA must be identified, isolated and characterized. In any large-scale project, the DNA must be cloned into relatively large units that facilitate screening, handling and ordering. Much of the effort of such a project must be directed to this primary cloning process, which includes any subcloning on a scale larger than that directly sequencable. The potential dittlculties encountered in this type of cloning, particularly in selecting enough clones to cover the entire area of interest, are the first and possibly the worst of the bottlenecks which occur in projects of this type. Although a large original clone may require multiple further subcloning stages in order to provide sequencable units of DNA template, the larger the initial clone, the easier it is to define and isolate the DNA of the desired locus. These subcloning steps are generally easier when the source DNA has a defined size, as libraries can be smaller, thereby facilitating the screening process and other steps.

DNA sequencing DNA sequence chemistry Two general methods of DNA sequence analysis are currently available. These are the dideoxynucleotide chain termination method of Sanger [14,15] and the chemical degradation method of Maxam and Gilbert [16]. Both methods result in the production of a nested set of DNA

Abbreviations ALF~automatic laser fluorescence; PCR--polymerase chain reaction; Taq~Thermus aquaticus;YAC--yeast artificial chromosome. 92

© Current Biology Ltd ISSN 0958-1669

Large-scale DNA sequencing Hunkapiller, Kaiser, Koop, Hood fragments with the same starting point, terminating at every nucleotide throughout the sequence. This set is electrophoresed through a polyacrylamide gel for size separation. Detection of reporter groups on each fragment reveals the order of the sequenced bases. Various methods are used to label, separate and detect the set of fragments. For all methods, the raw data from each sequencing operation must be entered into a computer and aligned or assembled with the data already accumulated to generate the finished sequence of a large region of DNA. The chain termination method of Sanger employs an enzyme (DNA polymerase), deoxynucleotides and chainterminating dideoxynucleotides to extend a small oligonucleotide primer which has been annealed to a singlestranded DNA template of interest. Chain elongation is terminated whenever a dideoxy analog is incorporated instead of a deoxynucleotide. Correct titration of the norreal and analog nucleotides results in a nested set of DNA fragments complementary to the template strand, each fragment terminated by a dideoxynucleotide. Reporter labels can be incorporated into the primer, deoxynucleotides or dideoxynucleotides, depending on the type of label and the method of detection. The Sanger method has become the usual method of choice for larger sequencing efforts [2--11.,12..,13..]. The most critical reagent used in the Sanger method is the DNA polymerase. Various polymemses have been used, each with differing optimal conditions and results [17]. Three have been of particular importance. Historically, the Klenow fragment of E. coli DNA polymerase has been the most commonly used enzyme. It can operate in elevated temperatures (37"C-55"C) and is often the least affected by modifications to the nucleotide analogs (e.g. the addition of fluorophores). Klenow incorporates the various deoxy- and dideoxynucleotides with different efficiencies, resulting in a variation in the number of molecules generated for each fragment length and hence in relative signal intensity. This variation is dependent on both nucleotide and substrate (local sequence). Chemically and genetically modified versions of T7 polymemse (Sequenase 1.0 and 2.0) are now often used in place of Klenow because they catalyse higher rates of nucleotide incorporation and they have greater processivity [18]. These two factors not only permit generally longer sequencing reactions (number of bases readable), but also provide a better opportunity for tailoring the reaction mixture to favor a particular window of sequence along the template. The substitution of Mn2 + for Mg2+ as the cation in Sequenase reactions has greatly increased the consistency of incorporation. This is particularly important for automated interpretation. The polymerase of heat-tolerant Tbermus aquaticus (Taq) has recently been applied to DNA sequencing as well as to polymerase chain reaction (PCR) [19,20]. Taq shows interesting similarity to Klenow in its particular patterns or inconsistencies of incorporation (C Fuller, personal commtmication), but in spite of this still provides a considerably more even band intensity, although not as good as that seen for Sequenase with Mn2+. Also, because of

its higher optimum reaction temperature, no separate annealing step is required. Taq gives much better results than other enzymes with templates that manifest secondary structure-related stops. By cycling the temperatures of the denaturing, annealing and elongation steps, as with PCR, a linear amplification of signal is possible. This cycle sequencing can prove particularly effective for sequencing double-stranded templates (C Heiner, personal communication). The chemical degradation method of Maxam and Gilbert uses a set of between four and six chemical reaction conditions to degrade a labeled DNA fragment at different specific nucleotides prior to polyacrylamide gel electrophoresis. The chemical degradation method, although significantly more labor intensive, is often very useful for analyzing regions of DNA that prove difficult to sequence with standard Sanger dideoxynucleotide procedures. Because chemical methods of DNA strand scission are not as sensitive to secondary structure effects, sequencing artifacts are far less prevalent than with the Sanger method. In addition, as modified by Church and Gilbert [21], the chemical degradation method may be used for direct sequence analysis of genomic and other very large DNAs or cloned fragments.

Sequencing strategies There are two basic strategies for DNA sequencing; directed and random. Each has advantages and disadvantages as well as many variations. Directed methods isolate or otherwise define the sequencable DNA using knowledge of its location within a larger clone or source DNA. Random methods refer to the random selection of templates from a limited pool of randomly-generated subclones that are then reassembled after sequencing using overlapping sequence similarity. The directed strategies have the advantage of requiring less actual DNA sequencing and of sequencing in a known location relative to the entire fragment. One directed strategy used in conjunction with enzymatic sequencing makes use of custom synthetic oligonucleotide primers simply to 'walk' along the template DNA sequence on both strands. Sequence data from each reaction is used to design the primer for the next sequencing reaction [22]. In this manner, there is no ambiguity about the ordering of sequence data and redundancy is minimized. Primer synthesis, however, can be prohibitively expensive and slower than the rate of sequencing possible. Also, directed methods require more process steps and organization than random strategies, and thus are difficult to automate fully and have a relatively large cost per sequencing reaction. Using random strategies, a large amount of data is rapidly accumulated with limited subcloning or synthesis efforts. Because the relative location of a subclone is initially unknown, the final assembly of data becomes an important issue and the possible bottleneck in this strategy. In addition, a completely random strategy can result in a large degree of redundancy after 80 to 90% of the complete nucleotide sequence has been determined, thus increasing the overall cost. This redundancy, however, enables

93

94 Analytical biotechnolog./ definition of the final consensus sequence with greater confidence. It is impractical to randomly select enough clones to ensure a complete coverage of the larger source DNA for all but the smallest projects. Therefore, for final completion of the sequence, a more selective subcloning or directed sequencing approach must be pursued. For essentially all of the recently characterized large sequences, the strategies employed included two fractionation steps and combined both random and directed methods. In most cases, an ordered set of relatively small restriction fragments [13..] or plasmids [12..] isolated from larger clones or genomic DNA were subcloned randomly into M13 vectors for Sanger sequencing. Maxam and Gilbert methods have been applied only rarely. Directed primer sequencing has usually been applied to sequence across restriction sites, and to verify fragment order and identify adjacent fragments.

Subcloning strategies Although it is possible to use genomic or large clone (e.g. cosmid) DNA directly as a template for sequencing [21], the process is greatly facilitated by the amplification of more discrete regions. The generation of these templates, however, can be the most labor-intensive aspect of efforts attempting high-throughput. The strategy used for template selection and amplification is closely linked to whether the overall sequencing strategy is mostly random or directed in approach. Most current sequencing efforts facilitate template preparation by first isolating discrete pieces (subclones) of larger, characterized clones. There are many methods available for subcloning large DNA fragments into cosmid, bacteriophage lambda or plasmid vectors [23]. For random-based projects, regions of the cloned insert DNAs are usually enzymatically digested or physically sheared into smaller fragments, and subcloned into M13- or other Fl-derived bacteriophage vectors, such as M13mplS/19, pEMBL, pUCll8/ll9 or other bacterial plasmids, prior to sequence analysis. Large-scale sequencing projects to date have used several strategies, including: directed and random subcloning of restriction fragments [2-,9o,11-,24,25]; directed enzymatic deletion of subcloned restriction fragments using exonuclease III, nuclease Bal31 or T4 DNA polymerase [26]; and random ('shotgtm') subcloning of DNA fragments generated by sonic shearing, DNase I digestion or partial restriction enzyme digestion [27-29]. Other methods involve direct, ordered amplification via PCR of regions of larger clones or direct sequencing using mobile priming sites provided by transposon insertion [30,31]. Unfortunately, most directed methods require considerable time and effort to select and prepare subclones or amplification targets (e.g. construction of limited subclone libraries, clone characterization, and primer synthesis) compared with random methods, especially for very large DNAs such as cosmid inserts.

Sequence determination Once a set of sequencing reactions has been carried out, the resulting set of fragments must be separated and de-

tected. Various methods may include labeling either before or after separation, but traditional or snapshot methods all have distinct separation and detection steps. Because gels have practical limitations to length, multiple loadings or gels may be required to accommodate all the possible fragment lengths generated from one sequence reaction. The most commonly used detection methods rely on radiolabelling the set of fragments (e.g. during enzymatic synthesis or by prior end-labelling) before electrophoresis, then recording the separation via autoradiography on X-ray film. This requires significant manual handling. The film must then be read, the base order identified and the data entered into a computer. Most reactions provide 300-400 bases of data, but greater than 800 is possible. An interesting variation on the autoradiographic procedure used with the Sanger method places many sets of sequencing reactions together on one gel [32]. An entire series of subcloning vectors, each with a unique or tag sequence downstream of a common priming site, is used to make a series of randomly-cloned libraries. Template from one subclone of each library is aliquoted to each of the four reaction pots required for one complete sequence. Sanger sequencing without labeling is performed and the four lanes/sequences electrophoresed. Each lane then has the set of fragments for a particular nucleotide for multiple clones. After separation, the gel is blotted onto nitrocellulose or some other matrix. Because each vector has a unique tag sequence downstream of the priming site, clones from each library can be visualized in turn by hybridizing the blot with labeled probes specific for each tag site. Once the data from one hybridiT~tion are recorded, the blot can be stripped and rehybridized with another tag probe that will visualize clones from another library. Forty or more hybridizations are possible for each blot, but the processes involved are difficult to automate. This method can provide a tremendous amount of raw sequence information from a single gel. Average read lengths, excluding the tag sequence, are only of the order of 200-250 bp, with errors. Without significant improvement in the average read length, such multiplex methods may be most successful when used for sequences with limited repetitive information, such as those from prokaryotes. A modification of multiplex sequencing that combines Maxam and Gilbert chemistry and genomic sequencing is currently being scaled up for an attempt at a megabase prokaryotic project [1°°]. An important advance in the separation of the fragments resulting from DNA sequencing reactions that deserves close observation is ultrathin capillary gel electrophoresis. Electrophoretic separations performed in very small diameter capillaries can be very rapid because a much greater field can be applied to the gel because of reduced Joule heating. Migration speeds have been increased by an order of magnitude [33°°]. The DNA fragments can be labeled and detected fluorescently (see below). Further optimization of the optics, detection system and gels promise to reduce greatly the amount of reaction mixture loaded, potentially to the point of excluding the need for subclone amplification.

Large-scale DNA sequencing Hunkapiller, Kaiser, Koop, Hood Automation The obvious problems of relatively large-scale sequencing efforts to date have not been so much an inability to produce a peak performance effort as an inability to sustain levels of data generation and quality over an extended period of time. Therefore, strategies must be chosen with careful consideration of automation potential and reduction of possible bottleneck steps. To put such an emphasis on automation and the development of a consistent process is the only way we can envisage that economical large-scale operations, including the human genome initiative, are possible. Automated sequence determination

In the past, most autoradiographs of DNA sequence have been read manually or aided by various pointer devices. Now, several commercial systems are available that attempt to automatically interpret the sequence from a computerized image of the film. unfortunately, it has proven to be very difficult to run gels consistently enough and to develop algorithms sufficiently robust to read more than 300 bp reliably without significant operator intervention and interpretation. This is primarily because of the difficulty of establishing the correct register between four (or more) independent lanes on gels that are prone to significant mobility artifacts. Multiplex methods are particularly sensitive to this problem, as enormous numbers of gels can generated, with the potential for further distortion during the blotting process. Variations of labeling strategies, including non-radioactive methods as well as better gel construction, could facilitate these efforts in the future. Use of more powerful computer hardware that can employ more powerful software may also help. In order to provide a more automated procedure for sequence detection which overcomes the problem of gel non-uniformity, chemistry and instrumentation have been developed that allow all four reactions to be electrophoresed together in one lane [34-36]. This chemistry uses oligonucleotide primers labeled with four different fluorescent dyes, one for each of the four nucleotides. Thus the nested fragments produced in the separate A-, C-, G- and T-specific reactions may be combined, co-electrophoresed and distinguished during electrophoresis on the basis of their unique fluorescent emission spectra. An argon-neon laser beam scans across the bottom of the gel and excites the passing dye-labeled DNA molecules. The light emitted is collected and focused through a fourwavelength selectable filter onto a single photomuMplier to give a continuous four point spectrum. The order of passing dye-labeled DNA molecules is translated directly into sequence data by a computer. The multiple fluorophore approach avoids the problems resulting from lane-to-lane variation in banding patterns which often plague standard DNA sequencing methods as no gel artifacts can alter the order of the bands in a single lane. Applied Biosystems, Inc. currently markets an instrument, the 373A, based on these principles. In our experience, the 373A is consistently capable of resolving approximately 500 bp from the priming site per sequencing reac-

tion, with an error and ambiguity rate of 1-5% depending on the quality of the template DNA and the sequencing method. Each instrument has the ability to analyze simultaneously 24 channels (templates) in 14 h. Recently, we have completed the longest metazoan sequence to date using this technology; approximately 94 kb of the mouse alpha T-cell receptor locus (RK Wilson, BF Koop and LE Hood, unpublished data). Other similar sized projects are also nearing completion (C Ventor, personal communication). Other DNA sequencing instruments have been described which automate the task of data processing using fluorescence detection [37-40]. Pharmacia has recently introduced the only other commercial system currently available, the Automatic Ictser Fluorescent (ALF) DNA sequencer [38]. The ALF, unlike the 373A, uses one dye label only. Four lanes are loaded for each reaction set, just as for standard autoradiography. Detection of the laser-activated fluorescence is achieved with forty fixed detectors, each corresponding to one of the forty lanes that can be potentially loaded (10 sets of sequence reactions). No mo~ag parts are involved as the laser beam is directed through the width of the gel from one side in between the gel plates. In contrast, the 373A relies on an optical system that travels back and forth across the gel to redirect the beam. One of the largest eukaryotic DNA sequencing projects published to date, analysis of the human hypoxanthine phosphoribosyltransferase gene locus (57 kb), was undertaken using the prototype of this instrument [11o,]. The more limited number of reactions that can be run simultaneously (10 sets) is somewhat compensated by shorter run times (6 h) that allow more than one run a day per instrument. Because of the fixed nature of the detectors, however, it is sensitive to gel artifacts that distort lanes. The simplicity of primer synthesis, a single labelled primer being required per reaction set, means that this instrument may have particular usefulness in specific primer-based sequencing approaches or for directed gap closure. Additionally, the system can be used with Maxam and Gilbert chemical sequencing, which is not yet possible using the four dye strategies. Sequence reaction automation

True automation of DNA sequencing requires the automated performance of the actual sequencing reactions. This is a critical step for large-scale sequencing because the consistency and throughput required are generally beyond the long term capabilities of most researchers, especially as the risk of sample and pipetting errors increases substantially with increased sample numbers. The first robotic system to address this concern was developed by Seiko to automate Maxam and Gilbert sequencing reactions [41]. This instrument is no longer commercially available. Others have modified chemical robots to perform the repetitive pipetting tasks of sequencing. For example, several laboratories have adapted the Beckman Biomek 1000 robotic workstation for the automation of Sanger sequencing reactions which use modified "I7 polymerase [42] (B Roe, personal communication). The Biomek is used to perform the se-

95

96

Analytical biotechnology quencing reactions in 96 well microtiter plates. The instrument uses single and multiple tip pipetting tools to transfer reagents to and from microfuge tubes and microtiter wells situated in four slots on the instrument tablet. Although the Beckman robot is a useful tool, it suffers from the lack of truly integrated temperature control and an imprecision in pipetting small volumes, only being accurate to within 1-2 Ill. Consequently, other commercial systems designed specifically for DNA sequencing are currently under development. Applied Biosysterns has described one such system that relies on a single, syringe-based steel-tipped pipette capable of precision volume control to 0.1 Ill and has both heating and cooling modules. Each instrument is capable of supplying the reaction needs of two to three automated sequencers daily (C Ventor, personal communication). No specialized system is yet available commercially. Further automation of the sequencing process Although real-~me detection of DNA sequences by fluorescence emission has solved one major problem, other tasks such as subcloning, DNA preparation, sample processing and data analysis still limit the amount of data that can actually be generated. Other automated systems and high-throughput strategies are being developed which may help ameliorate these constraints. Among the most notable are new methods for DNA preparation and sample processing. In the past, DNA sequencing has made extensive use of fragments subcloned into bacteriophage M13 and, to a lesser extent, plasmid vectors. Single- or double-stranded recombinant DNA is purified by standard methods and subjected to sequence analysis. Most of these methods require extensive centrifugation, chromatography or gel electrohoresis. Recently, a modification of the PCR method was described in which asymmetric priming during amplification allows production of an excess of single-stranded DNA [43°,44]. This method has not proven completely reproducible, however. Alternative high-throughput methods which use a robotic laboratory workstation for template DNA preparation have also been described [45.,46.]. These approaches have made use of existing laboratory robotics to automate standard DNA preparative methods. Other approaches to the preparation of template DNA have included the use of biotin with magnetic beads or streptavidin in conjunction with PCR and robotic methods. Magnetic manipulation of DNA templates may play a key role in the automation of pre-sequencing reaction procedures [47°o,48°]. Magnetic manipulation of DNA templates may play a key role in the automation of pre-sequencing reaction procedures.

BoUlenecks Many problems have to be solved before a very largescale sequencing project can expect to succeed. Those that cannot be solved outright must be anticipated and the protocol redesigned to circumvent them, or they will become bottlenecks. These problems can arise from limitations in both biology and technology. The problems of

long repeats in eukaryote sequences, for example, cannot be ever entirely dismissed. Longer sequencing runs, directed sequencing and algorithm development, however, can all play a role in reducing the problem and in at least identifi/ing possible repeat regions. Nonclonable sequences, incomplete or biased libraries, recombination and other cloning artifacts can all greatly disrupt a sequencing effort as production is interrupted to solve problems that arise, particularly during sequence assembly and closure. Error rate and cost are also fundamental issues. It is necessary to define rigorously the tolerated and likely level of error resulting from any sequencing effort. To date, the actual level of error in the sequence databases is only poorly tmderstood, as inadequate models exist for the estimation of the error rate of raw data from standard, manual sequencing methods. Rates have been examined more efficientlyfor automated sequencers, but the lack of comparison is unfortunate. Mathematical and empirical tools must be developed to address this issue, particularly as single base changes between alleles can have profound implications for the biological manifestation of individual sequences. Well defined and rigorous mathematical rules for determining levels of confidence for consensus bases (sequences) must be devised and accepted by the community. Cost is perhaps even more poorly understood than error. Figures of $2-$5 per 'finished' base, the final consensus, are often cited, although these estimates are not generally based on well documented cost analyses from rigorously maintained records. Cost is primarily a function of the capital investment in equipment and the productivity achieved. The efficiency of resource use therefore becomes paramount in reducing costs. For very large projects, an industrial or production approach which adopts an assembly line model is required in order to achieve efficiency. The generation of very large sequences places particular demands on the computational resources required for data management and analysis. Unfortunately, it is insufficient to scale up existing methods adequate for much smaller problems, but rather a redesign of the basic computational tools is required. For genetic sequencing projects, most of the emphasis has been placed on the problems of analyzing data for its information content after it has been compiled. Much less effort has been aimed at tracking the process of obtaining these data, as most sequencing projects to date have been on a scale that lend themselves to 'hands on' management. These ad boc approaches are precluded by the level of automation, throughput and quality assurance required for very large-scale efforts. Thus, considerable effort must be aimed at the problems of sample management, data acquisition and maintenance, and final sequence assembly as well as analysis of the final data. Clearly, efficiency, error and cost are all important considerations when evaluating sequencing technology. There is much debate as to whether current technology, even if modified, is applicable to any large-scale sequencing ef-

Large-scale DNA sequencingHunkapiller, Kaiser, Koop, Hood forts, let alone the genome initiative. In principle, scaling up the effort can successfully generate significantlyhigher throughput regardless of whichever technology is used. The issue is, however, an economical increase in throughput rather than the absolute rate of production. Thus, development of technology must be viewed in parallel with efforts to design an efficient overall process. For example, it is poindess to develop an extremely rapid sequencer if reactions can not be generated quickly enough to keep it maximally functional. The real technological bottlenecks must be carefully identified, targeted and reduced. Currently, we do not believe that running sequencing gels and obtaining the raw sequence data will be the most limiting bottleneck for large projects in the next 3-5 years.

Suggested strategies There are three basic stages involved in conducting largescale sequencing projects; production sequencing, technology and methods development, and biological analysis of the resulting data. The production group must provide an industrial operation for consistent, high-throughput sequencing, and is not necessarily involved in the latter two processes. The total resources expenditure can be divided into five projected categories as follows; production (50%), development (20%), trouble-shooting (10%), quality control (10%) and computation (10%).

Sequencing strategy Although directed and multiplex sequencing approaches have a particular elegance of design, they have not yet proven to be practical or efficient for large projects. Directed methods require the expensive synthesis of specific primers or large-scale clone selection by, for example, deletional subcloning. On a large scale, both of these methods would be enormously difficult to organize and automate. Also, it has yet to be proved that multiplex sequencing can consistently provide sequences longer than on average 250 bp and automatic gel readers simply have not worked without the requirement for significant user interaction. Random- or shotgun-based strategies gain in practicality what they lose in elegance. Selection and preparation of templates is simple and automatable. Protocols are well tested and successful, the cost per sequenced raw base being least for simple, random subclones. The degree of redundancy required for confidence in the consensus is no greater for mixed random strategies than it is for any direct strategies. The extra redundancy that is associated with random strategies can be greatly ameliorated with improvements in the sequencing protocols, software and the judicious application of directed primed sequencing for closure. Simplicity and consistency should be the watchwords for any process that demands a high degree of automation to achieve an appropriate and economical throughput. We believe the shotgun approach is both simple and consistent.

Subcloning strategies The usefulness of a particular strategy varies with the size and complexity of the project. Initially, we believe that random subcloning approaches involving fractionation of whole cosmids will prove the most useful. The random subcloning of whole cosmids eliminates extra subcloning and fractionation steps and multiple subcloning procedures. The past experience of our and other DNA sequencing laboratories has shown that the most accurate nucleotide sequence data is accumulated using singlestranded template DNA. Single-stranded sequencing with the Sanger method has proven to be both consistent and automatable. However, the use of M13RF phagmids and plasmids could allow for a more flexible choice in subcloning strategies, particularly ff double-stranded cycle sequencing protocols employing PCR consistently prove to be comparable to single-stranded sequencing. Doublestranded vectors provide the opportunity to sequence both strands of a single template preparation, thus providing twice the sequence data per unit cost. Increased subclone length can additionally have a significant impact on the number of subclones required to provide basic coverage of the source clone (e.g. cosmid) being sequenced (calculations not shown). Cycle sequencing protocols also significantly reduce the amount of template required for adequate sequencing, thereby simplifying the preparation process.

Polymerase Almost all of the large-scale sequences to date have been generated with Klenow or Sequenase. We believe it is likely that most large-scale efforts will commence with Sequenase. Although standard Taq sequencing also seems to work well for M13 inserts, the enzyme is more expensive than Sequenase. If the cycle method for Taq can provide consistently sequence data almost comparable with that of double-strand vectors, we believe that it may be a superior method. The polymerase from the heat-stable bacterium Bacillus stereothermophilushas many properties in common with Taq and may also prove to be a useful reagent.

Closure strategies Closure, ultimately required for a successful sequencing project, can prove to be its downfall. The first 75-90% of coverage is generally straightforward for most cosmids. The last 10-25% can take as long or longer to determine, currently providing perhaps the biggest bottleneck for large sequencing projects. The more repetitive the sequence being studied, the greater the difficultyof closure. Therefore, the overall sequencing strategy used should be chosen so as to minimize closure problems. Closure of particular source clones cannot take precedence over general sequencing. Instead, closure efforts must go on in parallel with day-to-day subclone sequencing and the priority must be to keep the flow of sequence data relatively constant. It has been our experience that the random strategy, as described above, will provide over 85% of the finished

97

98

Analytical biotechnolo8), double-stranded sequence for any given DNA fragment. At this point, a more directed approach to joining contiguous sequences and closing sequence gaps must be employed. The two most useful closure strategies we have used to date are primer-directed sequencing and PCR-directed closure. Once a large percentage of a cosmid insert has been sequenced and assembled, contiguous sequence stretches (contigs) may be extended by synthesizing custom sequencing primers. These primers can be used for primer-directed sequencing of subclones at the ends of the contigs. Alternatively, the cosmid DNA can be sequenced directly. As discussed above, this is particularly attractive with Taq cycle sequencing. A point to keep in mind when selecting the protocol for solving one problem is its compatibility with other protocols employed. For example, if Taq sequencing is particularly useful for gap closure, that might also favor its use for general sequencing, if all other factors are relatively consistent, in order to minimaz"e the number of different procedures in the production loop. If the relative locations of contigs are known (e.g. restriction maps), gaps may be filled by PCR-directed closure. In this approach, custom primers are synthesized at the ends of adjacent contigs, and used as primers for PCR amplification of the intervening region. The whole cosmid or yeast artificial chromosome (YAC) clone can be used as the template for PCR, and gaps greater than 4 kb can be bridged by this method. Following PCR, the gap DNA is accurately sized and purified by agarose gel electrophoresis, and subcloned into the sequencing vector. This provides custom subclones which bridge the gap, and which may be sequenced using standard primers as well as with the amplification primers. The only potential draw ack is the possibility of point errors generated by Taq DNA polymerase. To avoid this problem, a minimum of three subclones, at least one from each orientation, must be sequenced. Alternatively, the amplified DNA can be sequenced directly. If the relative locations of two or more contigs is unknown, a combinatorial approach to PCR-directed closure can be useful for aligning and spacing the sequences as long as the number of contigs is small. Mathematical tests can be employed to determine at which stage this is most likely to be useful. Amplification primers may be synthesized for both ends of each contig and used in a variety of PCR combinations. Subsequent agarose gel electrophoresis will provide data about the relative locations, orientations and spacing of the contigs. In addition, the PCR-directed closure approach is useful for confirming the joining of two adjacent restriction fragments. Both the primer-directed and PCR-directed methods can also be used to confirm base order in regions of repetitive sequence or severe secondary structure.

Automatic sequencer We believe, as a result of extensive experience, that reliance on automatic sequencing rather than manual methods will be the most efficient long-term strategy. Automatic methods are more consistent and appear more

error-free on a gel-to-gel basis. Although potentially manual methods can result in a much greater number of bases read from a single set of reactions (by using e.g. multiple loadings, wedge gels and 35S labeling), the average amount of data generated by most researchers is no more or even less than that generated by automated sequencers. Many of the steps of manual sequencing (gel drying, film loading and developing, film reading, data entry, etc.) have proven difficult to automate adequately. People are the least consistent and most expensive component of a large-scale effort. Automation of as much of the entire process as possible must be our goal if large sequencing efforts are to be successful over any length of time. Extensive experience with the 373A automatic fluorescent sequencer indicates that this method is reliable and has excellent potential for evolution into an even more productive tool. Because of this, we feel that during at least the next 3-5 years the fluorescence-based instruments will offer an attractive approach to megabase sequencing. It must be emphasized that the potential productivity of the available automated instruments is not the current bottleneck. Therefore, the introduction of new, much faster instruments might have no significant impact on the best sequencing strategies unless they also influence other aspects of the entire procedure. Increasing the sequencable unit length, for example, could have significantly more impact than merely reducing the number of subclones required and improving the assembly process. Realistic estimates of potential DNA sequencing throughput are difficult to make, but we can extrapolate from our own experience. Theoretically, a 373A can generate 24 samples of 450-500 bp per day (1.2 x 104 bp total). If used 5 days a week for a complete year, a single 373A could generate up to 3.1 x 106bp of raw data. At an average redundancy factor of five for shotgun sequencing, this means ideally more than 600 kb per year of finished data per instrument. In a 6 month period in which two 373A instruments were operating, we experienced equipment failure 15-20% of the time. We experienced another 20-25% level of sequencing failure due to factors such as bad template and failed reactions. In addition, two technicians using manual preparative procedures could only keep two machines going 80% of the time (4 days/week) with our current procedures. Taken together, these values would result in a rate of about 300 kb/year/machine. Of this, 10% will be sequenced from the subcloning vector, 15% will be cosmid vector and 15-20% will be redundant, because of the overlap between cosmids. With a fivefold redundancy, about 200kb of finished data per 373A per technician a year could be accomplished with our present procedures. Increasing the efficiency of this process by only 50% and doubling the number of instruments would allow two technicians to sequence a Mb of DNA a year (not including major cloning and mapping). We believe that in the next 2-3 years another two to five fold increase in throughput is achievable without dramatic changes to the basic 373A design (e.g. software, comb design and lane number and rapid primer synthesis).

Large-scale DNA sequencing Hunkapiller, Kaiser, Koop, Hood Final sequence determination In order to choose the best sequencing strategy, it is critical to define the criteria which allow determination of the level of confidence in the final results. It is interesting that essentially no sequence has been published with the rules used to define the raw or final sequences, other than the number of times each clone or region was sequenced. These final sequences have been the result of the confidence the researcher had in his or her raw data along with generally ill-defined (at least not stated) rules for consensus. Very large projects can no longer rely on these subjective, hands-on approaches. Perhaps this task, in particular, must be automated as it is too large to enable individual researchers to check all raw data. High quality redundant data and the methods by which we can automate the base-calling process must be relied upon to provide confidence. In order to automate the consensus base-calling process, objective rules that quantitatively describe the likelihood of each base call must be defined. These rules must be predefined such that the final answer is not a consensus, but rather a probability for each base position determined; for example, at position 1031: A-97%, G--2%, C-0.5%, T-0.5%. These rules must also establish the likelihood of error. In general, the error rate should be at least an order of magnitude less than the known level of variation in the species (or loci) being studied. Of course, 'objective' here refers to the methodology, and is not meant to imply that any researcher can define the rules for truth. The rules are still best guesses, although they can be tested against known and simulated data. Specifically, 'objective' refers to the fact that two different researchers with the same set of rules will get the same final answer if they begin with the same raw data. Although this does not guarantee that the consensus view is correct, it does guarantee that any researcher can precisely know how the answer was determined and make their own judgement as to its validity. As experience increases, the consensus rules can be further refined, the data resubmitted to analysis and an improved consensus view generated.

Sequencing as an industrial process No process as complex as large-scale sequencing is any faster than its slowest step. Technology is often assumed to be the bottleneck of large-scale sequencing efforts. Currently, that bottleneck is not running and reading gels, but rather keeping that one aspect of the overall process fed with material and in operation day in and day out. Considered individually, most of the other subprocesses required (e.g. clone preparation and sequence reactions) also do not appear likely to be a severe block. Why, then, has the community not achieved megabase sequencing? We believe that the true bottleneck has been the lack of a serious commitment to an industrial approach; the recognition that consistency is usually much more important to the success of any large project than some ideal technological peak performance that is only occasionally attainable [49 oo]. It must be stressed that the requirements of information technology associated with large-scale se-

quencing (e.g. data acquisition, management and distribution as well as fragment assembly and data analysis) also play a critical role in designing an intelligent operational scheme, but are beyond the focus of this treatment.

The future Clearly, the development of novel DNA sequencing technologies (e.g. manning tunneling electron microscopy, mass spectrometry, hybridization by sets of small DNA probes, and single molecule degradation and detection) must be explored throughout the next 5 year period. There is a possibility that such novel technology will improve to the point where in the second 5 year period of the genome initiative, it could replace the more conventional DNA sequencing approaches described in this review. Two important points can be made in this regard. First, we believe a strategy based on automated fluorescent sequencing could be optimized such that the human genome could be successfully sequenced over a 15 year period. Admittedly, an alternative technique might be developed during this time. Second, if a novel sequencing technology does replace the current approach, many of the production line lessons to be learned from employing conventional methods over the next 5 years can be usefully applied, at least in part, to any new approach (e.g. data management). In any case, we believe that with appropriate support for technology over the next 10 years, there will be a 50-100-fold improvement in the rate of DNA sequence production.

Conclusion We believe that although technological innovation is imperative at many levels, the fundamental requirement for the success of large-scale sequencing project~ is the developement of an appropriate management structure for the process, and effort. It must be possible for the experimental strategies chosen to be easily organized into a pipeline process that avoids potential individual bottlenecks, even ff some of the separate steps are not the fastest methods available. Ultimately, protocols must be chosen that are automatable in practice, such as fluorescent sequencing. Any designed process must involve few decision or branch points in its implementation. In general, the process must be the same every day; the protocols used must not be specific to that day's effort (e.g. different sequencing strategies for particular sets of templates), or frequently dependent on previously generated data. There are three critical factors in establishing a successful large-scale project: the choice of appropriate and adequate technology; the design of a coherent, relatively stable process protocol; and the use of weU-tmined people for the job. We believe that with the correct planning, successful very large-scale sequencing projects can be initiated with the technologies and procedures cur-

99

100

Analyticalbiotechnology rently available. The largest and most important problem, as it should be, is the identification and definition of the most interesting biological problems that can be investigated through the use of these technologies.

References and recommended reading Papers of special interest, published within the annual period of review, have been highlighted as: • of interest •. of outstanding interest 1. ROBERTSI2 Genome Center Grants Chosen. Science 1990, •. 249:1497. This-article provides a description of the otiqcially funded genomerelated sequencing grants. MARGOT JB, DEMERSGW, HARDISONRC: Complete Nucleotide sequence of the Rabbit Beta-like Globin Gene Cluster. J Mol Biol 1989, 205:15-40. In the context of this review, this paper and [3"-11*] are remarkable for the quantity of data and amount of effort they represent rather than for the individual sequence information they contain.

the Mitochondrial Genome of Paramecium. Nucleic Acids Res 1990, 18:173-180.

See [2"]. 12. .•

CHEEMS, BANKIERAT, BECKS, BOHNI R, BROWNCM, CERNYR, HORSNELLT, HUTCHINSON CA HI, KOUZARIDEST, MARTIGNETH JA, PREDDIE E, SATCHWELLSC, TOMHNSON P, WESTON KM, BARRELLBG: Analysis of the Protein Coding-content of the Sequence of Human Cytomegalovirus Strain AD 169. Curr Top Microbtol I m m u n o l 1990, 154:125-170. Reports the largest sequence yet completed--almost 250000 bases. 13. •+

EDWARDSA, VOSS H, RICE P, CIV1TELLO A, STEGEMANN J, SCHWAGERC, ZIMMERMANNJ, ERELE H, CASKEYCT, ANSORGE W: Automated DNA Sequencing of the Human HPRT Locus. Genomics 1990, 6:593-608. This is the first paper that describes the use of automated DNA sequencing for a large-scale project. 14.

SANGERF, NICKLEN S, COULSONAI~ DNA Sequencing with Chain Terminating Inhibitors. Proc N a a Acad Sci USA 1977, 74:5463-5467.

15.

SANGERF, COULSONA, BARRELLB, Smrm A, ROE B: Cloning in Single-stranded Bacteriophage as an Aid to Rapid DNA Sequencing. J Mol Biol 1980, 143:161-178.

16.

MAXAMA, GILBERTW: A New Method for Sequencing DNA. Proc N a a Acad Sci USA 1977, 74:560-564.

17.

BANKIERAT, BARRELLBG: Sequencing Single-Stranded DNA Using the Chain-termination Method. In Nucleic Acid Se~ quencing edited by Howe CJ and Ward ES. Oxford: IRL Press, 1989, pp 37-78.

18.

TABORS, RICHARDSONCC: DNA Sequence Analysis with a Modified Bacteriophage T7 Polymerase. Proc Natl Acad Sci USA 1987, 84:4767-4771.

19.

PETERSON MG: DNA Sequencing Using Taq Polymerase. Nucleic Acids Res 1988, 16:10915.

20.

INNIS MA, MYAMBOKB, GELFANDDH, BROW MD: DNA Sequencing with Thermus Aquaticus DNA Polymerase and Direct Sequence of Polymerase Chain Reaction-amplified DN& Proc Natl Acad Sci USA 1988, 85:9436-9440.

21.

CHURCHG, GImERTW: Genomic Sequencing. Proc Natl Acad Sci USA 1984, 81:1991-1995.

22.

STRAUSS EC, KOBOm JA, SIU G, HOOD LE: Specific Primerdirected DNA Sequencing. Anal Bioc,bem 1986, 154:353-360.

23.

MESSINGJ: New M13 Vectors for Sequencing. Methods Enzymol 1983, 101:20-78.

24.

SANGER F, COULSON AR, HONG GF, HILL DF, PETERSEN GB: Nucleotide Sequence of Bacteriophage Lambda DNA. J Mol Biol 1982, 162:729-774.

25.

BAER R, BANKIERAT, BIGGIN MD, DEININGER PL, FARRELLPJ, GIBSON TJ, HATeULLG, HUDSON GS, SATCHWELLSC, SEGUIN C, TUFFNELLPS, BARRELLBG: DNA sequence and Expression of the B-95-B Epstein-Barr Virus Genome. Nature 1984, 310:207-211.

26.

HENIKOFF S: Unidirectional Digestion with Exonuclease HI in DNA sequence Analysis. Methods Enzymol 1987, 155:156-163.

27.

BANmER AT, BARRELLBG: Shotgun DNA Sequencing. Tech Nucl Acid Biochem 1983, B5:l.

28.

DEININGERP: Random Subcloning of the Sonicated DN& Application to Shotgun DNA Sequence Analysis. Anal Bloc.bern 1983, 129:216-223.

29.

ANDERSON S: Shotgun DNA Sequencing Using Cloned DNaseI-generated Fragments. Nucleic Acids Res 1981, 9:3015-3027.

3O.

ADACHI T, MgUUCHI M, ROBINSONEA, APPELLAE, O'DEA MH, GELLERT M, MIZUUCHIK: DNA sequence of the Escherichia

2.



3.

DEN DUNNEN .IT, VAN NECK JW, CREMEP.S FPM, LUBSEN NH, SCHOENMAKERSJGG: Nucleotide Sequence of the Rat Gamma-crystallin Gene Region and Comparison with an Orthologous Human Region. Gene 1989, 78:201-214. See [2°]. •

4.

HAMPEA, SHAMOON B, GOBET M, SHERR C, GALIBERT F: Nu-



cleotide Sequence and Structural Organization of the Human fins Proto-oncogene. Oncogene Res 1989, 4:9-17. See [2"]. 5.

BENIANGM, KIFFJE, NECKELMANNN, MOERMANDG, WATERSON • RH: Sequence of an Unusually Large Protein Implicated in Regulation of Myosin Activity in C. elegans. Nature 1989, 342:45-50. See [2•1. 6.



SHULL MM, PUGH DG, LINGRELJB: Characterization of the Human Na,K-ATPase Alpha-2 Gene and Identification of Intragenic Restriction Fragment Length Polymorphism. J Biol O~mm 1989, 264:17532-17543.

See [2"]. 7. •

CHEN EY, LIAOY-C, SM1TH DH, BARRERA+SALDANHA, GELINAS RE, SEEBURGPH: The Human Growth Hormone Locus: Nucleotide Sequence Biology and Evolution. Genomics 1989, 4:479-497.

See [2"L 8. •

SHEHEEWR, LOEB DD, ADEYNB, BURTON FH, CASAVANTNC, COLE P, DAVIESCJ, MCGRAwRA, SCHICHMANSA, SEVERYNSEDM, VOLIVACF, WEYTERFW, WISELYGB, EDGELLMH, HUTCHISON CA: Nucleotide Sequence of the BALB/c Mouse Beta-giobin Complex. J Mol Biol 1989, 205:41-62.

See [2"]. 9. •

ROGOWSKYPM, POWELL BS, SHIRASU K, LIN T-S, MOREL P, ZYPRIANEM, STECKTR, KADOCI: Molecular Characterization of the Vir Regulon of Agrobacterium Tumefaciens: Complete Nucleotide Sequence and Gene Organization of the 28.63-kbp Regulon Cloned as Single Unit. P/asmid 1990, 23:85-106.

See [2.]. 10.

IKEJmIK, UENO T, MATSUGUCHIT, TAKAHASHIK, ENDO H,



YAMAMOTOM: Rat IGFII Gene for Insulin-like Growth Factor

II. Unpublished, Genbank access number X17012, 1990.

See [2"L 11. •

PRITCHARDAE, SEILHAMERJJ, MAHALINGAMR, GHALAMBORM, SABLECL, VENUTISE, CUMMINGSDJ: Nucleotide Sequence of

Large-scale DNA sequencing Hunkapiller, Kaiser, Koop, Hood coli GYRB Gene - - Application o f a New Sequencing Strategy. Nucleic Acids Res 1987, 15:771-784. 31.

AHMEDA: Use of Transposon-promoted Deletions in DNA Sequence Analysis. Methods Enzymol 1987, 155:177-204.

32.

CHURCHGM, KIEFFER-HIGGINSS: Multiplex DNA Sequencing. Science 1988, 240:185-188.

actions Using a Robotic Workstation. Biotechniques 1988, 6:776-777. 43. •

GYLLENSTENLIB, ERLICH HA: Generation of Single-stranded DNA by the Polymerase Chain Reaction and its Application to Direct Sequencing o f the HLA-DQa Locus. Proc Natl Acad Sci USA 1989, 85:7652-7656. This study uses unequal molar amounts of two PCR primers to gen erate an excess of single-stranded DNA which can then be sequenced

IAJCKEYJA, DROSSMANH, KOSTICHKAJ, MEAD DA, NORRIS TB, Sirra I: High-speed DNA Sequencing by Capillary Electrophoresis. Nucleic Acids Res 1990, 18:4417-4421. This is a representative article on a state-of-the-art alternative separation method using capillary systems that promises considerable advantages in throughput.

cUrecdy.

34.

SM1TH LM, FUNG S, HUNKAPILLERMW, HUNKAPILLERT, HOOD LE: The Synthesis of Oligonucleotides Containing an Aliphatic Amino Group at the 5' Terminus: Synthesis of Fluorescent DNA Primers for Use in DNA Sequence Analysis. Nuc/e/c Acids Res 1985, 13:2399-2412.

45. •

35.

SMITH Li, SANDERSJZ, KAISERRJ, HUGHES P, DODD C, CONNELL CR, HEINER C, KENT SBH, HOOD LE: Fluorescence Detection in Automated DNA Sequence Analysis. Nature 1986, 321:674-679.

33. ..

36.

CONNELL C, FUNG S, HEINER C, BP,iI~HAM J, CHAKERIANV, HERON E, JONES B, MENCHENS, MOP,DAN W, RAFF M, RECKNOR M, SMITH L, SPRlNGERJ, W o o S, HUNKAPILI.ERM: Automated DNA Sequence Analysis. Biotechniques 1987, 5:3428.

37.

PROBERJM, TRAINORGL, DAM RJ, HOBBS F'W, ROBERTSON CW, ZAGURSKYRJ, COCUZZAAJ, JENSENMA, BAUMEISTERK: A System for Rapid DNA Sequencing With Fluorescent Chain-terminating Dideoxynucleotides. Science 1987, 238:336-341.

38.

ANSORGE W, SPROAT B, STEGEMANNJ, SCHWAGERC, ZENKE M: Automated DNA Sequencing: Ultrasensitive Detection of Fluorescent Bands During Electrophoresis. Nucleic Acids Res 1987, 15:4593-4602.

39.

BRUMBAUGHJA, MIDDENDORF LR, GRONE DL, RUTH JL: Continuous, On-line DNA Sequencing Using Oligonucleotide Primer with Multiple Fluorophores. Proc Natl Acad Sci USA 1988, 85:5610-5614.

40.

KAMBARAH, NISHIKAWAT, KATAYAMAY, YAMAGUCHIT: Optimization of Parameters in the DNA Sequenator Using Fluorescent Detection. Biotechnology 1988, 6:816~21.

41.

WADAA: Automated High-speed DNA Sequencing. Nature 1987, 325:771-772.

42.

WILSONRK, YUEN AS, CLARKSM, SPENCE C, ARAKELIANP, HOOD I: Automation of Dideoxynucleotide DNA Sequencing Re-

44.

WILSONRK, KONO D, ZAIJ.ER D, HOOD I2 Rapid Analysis of T-Cell Receptor Gene Structure and Expression. In Current Communications. Polymerase Chain Reaction edited by Erlich HA, Gibbs R, Kazazain Jr HH. New York: Cold Spring Harbor Laboratory, 1989, pp 217-223.

Z1MMERMANN J, Voss H, KRISTENSENT, SCHWAGERC, STEGEMANN J, ERFLEH, ANSORGEW: Automated Preparation and Purification of M13 Templates for DNA Sequencing. Methods Mol Cell Biol 1989, 1:29. Describes the use of glass fiber filter grids and acetic acid precipitation in conjunction with a robotic work station to isolate M13 templates for DNA sequencIng. 46. •

MARDISER, ROE BA: Automated Methods for Sin~gle-stranded DNA Isolation and Dideoxynucleotide DNA Sequencing Reactions on a Robotic Workstation. Biotechniques 1989, 7:840-850. A methods article that describes a semi-automated procedure (with a Biomek robotic work station) for the preparation of single-stranded DNA. This procedure replaces phenal proteIn separation with heat and Triton-X. 47. ..

UHLENM, HULTMAN T, WAHI.BERGJ: Approaches to Solidphase DNA Technology Using PCR. J Cell Biochem 1989, 13E:310. The ability to conduct sample preparation and sequencing reactions on a solid-state support greatly increases our ability to automate the entire sequencing process. 48. •

MITCHELLLG, MERRIL CR: Affinity Generation of Singlestranded DNA for Dideoxy Sequencing Following the Polymerase Chain Reaction. Anal Biochem 1989, 178:239-242. Incorporation of biotin as one of the primers in a PCR reaction enables capture by streptavidin. SIngle-stranded DNA for sequencing is obtained by performing a sodium hydroxide wash. 49.

ROBERTSL: A Meeting o f the Minds on the Genome Project. Science 1990, 250:756-757. This is a news article that describes the organizational philosophies of the various large-scale sequencing projects. ••

T Hunkapiller, RJ Kaiser, BF Koop and L Hood, Division of Biology, 139-7U, California Institute of Technology, Pasadena, California 91125, USA.

101

Large-scale DNA sequencing.

Large-scale DNA sequencing Tim Hunkapiller, Robert J. Kaiser, Ben F. Koop and Leroy Hood California Institute of Technology, Pasadena, California, USA...
1MB Sizes 0 Downloads 0 Views