GENOMICS

13,1008-1017

(19%)

Analyzing and Comparing Nucleic Acid Sequences by Hybridization to Arrays of Oligonucleotides: Evaluation Using Experimental Models E. M. SOUTHERN, U. MASKOS, AND J. K. ELDER Department

of Biochemistry,

University Received

of Oxford, South Parks Road, Oxford

December

26. 1991;

An efficient method was developed for making complete sets of oligonucleotides of defined length, covalently attached to the surface of a glass plate, by synthesizing them in situ. A device carrying all octapurine sequences was used to explore factors affecting molecular hybridization of the tethered oligonucleotides, to develop computer-aided methods for analyzing the data, and to test the feasibility of using the method for sequence analysis. Further development is needed before the method can be used routinely, but our work shows that it has a number of potential advantages over gelbased methods: it should be easy to automate; the quality of the sequence results can be evaluated statistically; it provides a powerful way of comparing related sequences and detecting mutation; it can be applied to both DNA and RNA; and specific motifs can be incorporated into all sequences of the array to focus analysis on sequences of biological interest. 0 1992 Academic PWSS, Inc.

INTRODUCTION

The development of methods for synthesizing oligodeoxyribonucleotides of defined sequence has extended the potential of molecular hybridization to providing sequence information on the target DNA, making it possible to design probes for the analysis of specific sequences, for example, known mutations associated with human diseases (Conner et al., 1983). Other possibilities are opened up if the probe consists of a set of oligonucleotides that may be used to provide a “fingerprint” comprising a binary code of presences and absences of particular short sequences in the test sequence (Craig et al., 1990). If the set of oligonucleotides is complete, that is, consists of all possible sequences of a given length, the result of the analysis is a complete description of the sequence probed and in principle can determine its base sequence (Lysov et al., 1988; Southern, 1988; Drmanac et al., 1989; Khrapko et al., 1989; Pevzner, 1989). Two different experimental configurations have been developed for the hybridization reaction: the target sequence may be immobilized and the oligonucleotides labeled (Drmanac et al., 1989; Stretzoska et al., 1991), or the oligonucleotides OS%7543/92 $5.00 Copyright 0 1992 by Academic Press, Inc. All rights of reproduction in any form reserved.

1008

revised

March

OX1 3QU, United Kingdom

17, 1992

may be immobilized and the target sequence labeled (Southern, 1988; Khrapko et al., 1989, 1991). Each method has advantages over the other for particular applications. It is an advantage to label the oligonucleotides to analyze a large number of target sequences for fingerprinting. On the other hand, for applications that require large numbers of oligonucleotides of different sequence, it is advantageous to immobilize oligonucleotides and use the target sequence as the labeled probe. However, it is not easy to synthesize large numbers of oligonucleotides even with automated methods. This paper describes a convenient method for synthesizing oligonucleotides covalently and stably attached to a glass surface. Efficient methods were developed to make large arrays of oligonucleotides of different sequence on the surface of a glass plate. Simple procedures for hybridization of probes to the tethered oligonucleotides are described. Methods were developed for imaging and for analyzing quantitatively the results of hybridizing radioactive probes to the arrays, using phosphorimaging (Johnston et al., 1990). We describe a simple procedure for making complete sets of all sequences of a given length and explore the potential of such devices for sequence determination. Although we analyze sequences of only 20 bases on arrays of 256 octapurines, our studies assess many features required of a working system. In these model experiments, the sequences were determined correctly. We also show how image processing procedures can be applied to the results for the rapid analysis of mutation. The method we developed to derive the base sequences and to analyze mutation from the hybridization data provides a measure of the quality of the assignment, an important feature missing from gel-based methods. Our experience suggests that it should be possible to automate all steps of the procedure, from making the devices to analyzing the results. MATERIALS

AND

METHODS

3-Glycidoxypropyltrimethoxysilane, hexaethylene glycol, and HPLC-grade acetonitrile were purchased from Aldrich and used without further purification. All DNA synthesis materials were purchased from Applied Biosystems. The first step in synthesizing the linker on the glass surface was

ANALYZING

AND

COMPARING

carried out in one of two ways (Southern and Maskos, 1988, Maskos and Southern, 1992a). A mixture of xylene (10 ml), 3-glycidoxypropyltrimethoxysilane (3 ml) (modified after Engelhardt and Mathes, 1977), and a trace of Hiinig’s base (after Boksanyi et al., 1976) was held in contact with the glass by clamping two plates together with silicone rubber tubing, outer diameter 1.2 mm, lining the edges. The solution was introduced with a syringe, and the sealed assembly was held in an oven at SO’C overnight. Alternatively, a solution of 5% of the silane in water was used, keeping the pH between 5.5 and 5.8 (after Chang et al., 1976) for 30 min at 90°C. For the second step of the synthesis, the plates were heated in neat hexaethylene glycol, containing a catalytic amount of coned sulfuric acid, overnight at 80°C to yield alkyl hydroxyl-derivatized support. After washing with methanol and ether, the plates were air-dried and stored at -20°C. To synthesize oligonucleotides on the surface of the plates, “masks” were prepared by gluing 65 lines of silicone rubber tubing (1.2 mm o.d.) with silicone rubber cement to the surface of a plain glass plate (220 x 220 mm), at 3-mm intervals. Clamping the mask against a derivatized plate produced channels that were gassed with argon before the coupling solutions were introduced. To ensure precise registration of channels in sequential coupling steps, one side of the plate was placed against a stop fixed to one side of the template. It is not necessary to locate the front and back edges of the plate with respect to the channels as the silicone rubber barriers extend beyond them. Precursor and activator solutions were held in two lo-ml gas-tight syringes that were driven simultaneously by an infusion pump. The syringes were connected via Teflon (Kel-F) tubing to two sawn-off syringe needles that were pressed together by forcing them through a narrow hole in a Teflon cylinder. Mixing occurred as the reagents entered the channel that delivered them to the surface of the plate. After coupling for a minimum of 1 min, the channels were thoroughly rinsed with acetonitrile, and the plate was separated from the mask for immersion in dichloroacetic acid (2.5% in dichloromethane) for 100 s. The plate was then washed with acetonitrile, dried in a stream of argon, and again placed against the mask with care taken to register the edges. In alternate coupling steps, the plate was rotated through 90” with respect to the mask. Hydrogen phosphonate chemistry (Froehler et al., 1986) was used for a variety of reasons. Since H-phosphonate precursors can be applied at a much lower concentration than phosphoramidites, costs were lower, and although coupIing yields are expected to be slightly lower, 97% compared to 99% under ideal conditions, this should not make a great difference for the very short sequences synthesized (Andrus et al., 1988). This was established in preliminary experiments on microscope slides (Maskos and Southern, 1992b). A second advantage of the H-phosphonate method is the elimination of oxidation after every coupling. As the phosphonate linkage is more stable, the oxidation step can be deferred to a single reaction in the end, making cycle times much faster. Ammonia deprotection was carried out in a specially designed bomb. A cavity (235 X 235 X 21 mm) was cut in a polyamide block (305 X 305 X 27 mm) to contain the glass plate. A sheet of silicone rubber (305 X 305 X 2 mm) formed a seal over the cavity. The assembly was clamped between two 5-mm-thick steel plates held together by 12 bolts to withstand the considerable pressure generated by 800 ml of concentrated ammonia at 55°C during the deprotection step. Oligopurine arrays were synthesized on the surface of a glassplate (150 X 150 mm). After derivatization with the linker, oligonucleotides were synthesized on one surface of the glass plate using the protocol described above, except that 3-mm-wide channels were used for all coupling steps. In the first and second steps, precursors for A and G were introduced in alternating channels; in the third and fourth through alternating pairs of channels; for the fifth and sixth through alternating groups of four channels; and for the seventh and eighth through alternating groups of eight channels as shown on the axes in Fig. 4, to create four copies of all octapurines. The base sequence at any position can be worked out from the protocol for adding the precursors. A pool of 256 different oligonucleotides of the formula 5’-A(C, T),A was synthesized on an Applied Biosystems Model 381A, deprotected

NUCLEIC

ACID

1009

SEQUENCES

in coned ammonia at 55’C overnight, and purified using an Applied Biosystems oligonucleotide purification cartridge (OPC) (McBride et al., i988). An A residue at the 3’ end was chosen so that a single oligonucleotide synthesis column could be used to generate the whole pool. The A nucleotide at the 5’ end would offer the same substrate to the enzyme, thus eliminating differences in the rate or amount of labeling with polynucleotide kinase. After end-labeling with polynucleotide kinase and [Y-~‘P]ATP, an aliquot was used in a hybridization to the oligopurine plate. The solvent was 3.5 M tetramethylammonium chloride (TMACl) containing 50 n&f Tris-HCl, pH 8.0,2 mM EDTA, SDS at less than 0.04 mg/ml. After hybridization at 4°C the plate was rinsed for 10 min at 4°C in the solvent, sealed in a plastic bag, and exposed to a storage phosphor screen (Fuji STIII) at 4°C overnight in the dark. The screens were scanned by a PhoshorImager 400A (Molecular Dynamics), and digital images were transferred to a Sun-4 workstation for analysis.

RESULTS

Making complete sets of oligonucleotides. If a nucleic acid molecule is hybridized to a complete set of oligonucleotides of a given length, it is possible in principle to determine its sequence by overlapping those oligonucleotides that give a positive signal. (Figures 5A and 5B show two examples of 20-base sequences and the overlapping sets of 13 octamers that they comprise; these examples are discussed in the description of our experiments.) Several authors have considered the theoretical basis of the process (Drmanac and Crkvenjakov, 1987; Bains and Smith, 1988; Lysov et al., 1988; Southern, 1988; Khrapko et al., 1989; Bains, 1991). One of the factors influencing the power of the analysis is the length of the oligonucleotides. The length of random sequence that can be analyzed by a complete set of oligonucleotides of a given length is approximately the square root of the number of oligonucleotides in the set.’ Each increase of one base in the length of the oligonucleotides increases the size of the set by a factor of 4. Octamers, of which there are 65,536, may be useful in the range up to 200 bases ’ We would like to know how the number of oligonucleotides, r, in a random sequence of bases is related to the probability that no oligonucleotide occurs more than once. Consider a sequence comprising oligonucleotides of length s, each overlapping its neighbors by s - 1 bases, and make the simplifying assumption that overlapping oligonucleotides are independent of each other. The problem can then be cast as a classical occupancy problem where balls (oligonucleotides) are placed randomly into n cells (n = 4” for four bases, n = 2” for two bases), until for the first time a ball is placed into a cell already occupied; the process then terminates (Feller, 1968). The probability pr that the process continues for more than r balls is

When

r 4 n,

logp,

=

-

1+2-t

*** n

+(1.-l)

2-y

r(r - 1) 2n

a d the value of r for which p, = 0.5 is approximately l= N 1.18 each oligonucleotide in the sequence is dependent on P n. In reality, the s - 1 preceding oligonucleotides. A theory that treats this problem is available (Zubkov and Mikhailov, 1979).

1010

SOUTHERN,

MASKOS,

FIG. 1. A linker that leaves synthetic oligonucleotides covalently attached to glass. The linker (Maskos and Southern, 1992a; Southern and Maskos, 1988) is a primary aliphatic hydroxyl group attached to the glass through a chain comprising only stable carbon-carbon and aliphatic ether linkages. Oligonucleotide synthesis is initiated from the primary hydroxyl group using standard phosphoramidite or Hphosphonate chemistry (Andrus et al., 1988; Applied Biosystems, 1987).

(Lysov et al., 1988; Khrapko et al., 1989; Pevzner, 1989; Maskos, 1991); and decanucleotides, of which there are more than a million, may analyze up to a kilobase. The smallest set that has been suggested to be of practical use for sequence determination comprises 256 hexanucleotides, with degeneracy in the two central positions; tests suggest that such a set could be used to analyze sequences of up to 70 bases (Bains and Smith, 1988; Bains, 1991). Clearly, efficient methods are needed to synthesize large sets. Our approach is to synthesize arrays of oligonucleotides in situ using a linker (Fig. l), which leaves them covalently attached to the surface of a glass plate and available for hybridization. Arrays representing all sequences are built by a procedure that applies nucleotide precursors to the surface of the plate in rows and columns; the logic of the procedure is similar to the familiar way of writing the triplet code in which all 64 triplets can be represented just once in 16 rows and four columns. A protocol that could be used to build an array of all 256 tetranucleotides is shown in Fig. 2. This process can be continued to any chosen depth to produce a two-dimensional array of all oligonucleotides of any length in which each oligonucleotide sequence occurs only once. If all four bases are used, 4” oligonucleotides of length s are synthesized in s steps, applying the nucleotide precursors in p/2 rows and columns. Shapes other than stripes can be used to make complete sets of oligomers. The same end result could be achieved by nesting quadrants inside squares that decreased in area fourfold at each step. The method also lends itself to incorporating specific motifs. For example, the triplets representing a chosen amino acid can be incorporated into all sequences. Degenerate bases at chosen sites can be incorporated to produce the type of array suggested by Bains and Smith (1988), by using mixed precursors at appropriate steps of the synthesis. Sequencing trial with complete sets of oligopurines. We used the above procedure to make an array comprising four copies of all 256 octapurines, a total of 256 different sequences in 1024 cells in an area (96 X 96 mm). This set required only eight coupling steps, carried out manually over a period of a day. For initial characterization of the octapurine plate, we used a probe with complementarity to all the oligonucleotides on the plate, a set of all octapyrimidine sequences made by conventional automated synthesis, using an equimolar addition of T and C at each step. The over-

AND

ELDER

hanging A residues added at each end go some way toward mimicking the effects of longer sequence probes (Mellema et al., 1984). This probe set was hybridized to the oligopurine arrays in the presence of TMACI, which reduces the effects of base composition (Maskos and Southern, 1992b, and manuscript in preparation) (Fig. 3). The same oligopurine array was used to analyze pyrimidine sequences. Two 24-mers were synthesized: Sequence I: AATTTCCCTTCCTTCCTCTCTCAA Sequence II: AATTTCCCTTCCCTCCTCTCTCAA. These sequences were chosen for the following reasons. Using the product expression in the footnote for the case of two bases and ignoring the effect of overlapping oligonucleotides, there is a 27% probability that, in a random sequence of 20 pyrimidines, any of the 13 constituent octamers will occur more than once. Thus a 20-mer is a

3

A C ACGTACGTACGTACGT

1

G

T

A C AG T A CC

G T A C

GG T A C TG T

FIG. 2. Procedure for building arrays of oligonucleotides using intersecting lines of precursors. In the illustration, all tetranucleotides are made by laying down the first bases in channels one-sixteenth the width of the plate and running from top to bottom. The second bases are laid down in similar channels running across the plate; the third and fourth bases are laid down in the same way in channels encompassing four of first- and second-layer channels. The sequences are written in the cells of the array from upper left to lower right, corresponding to the direction of synthesis, i.e., 3’ to 5’ for the chemistry used in this work. Different geometries can be used, for example, nested squares, and the order of application can be changed, for example, the narrow channels can be used after the wider ones, to achieve the desired result of a complete set of sequences with defined length in which each member of the set is represented just once. Of course, changing the geometry or order produces changes to the position of the oligonucleotides of the set. To incorporate specific motifs into the whole set, precursors are simply applied over the entire surface of the plate. For example, the set suggested by Bains and Smith (1988) of the general form NNXXNN, where N represents a defined base, and X a mixture of all four, a mixture of all four precursors would be applied over the whole plate in two steps between Steps 2 and 3 of the protocol described above.

ANALYZING

AND COMPARING

NUCLEIC

ACID SEQUENCES

1011

FIG. 3. Experiments with oligopurine arrays: hybridization with oligopyrimidines. The plate carries four copies of an array of all 256 octapurines, one in each of the four quadrants. The protocol for building the array was a simplification of that described in Fig. 2., in which only the precursors for A and G were applied as shown at the top left of Fig. 4 and described in the text. The base sequence at any position in the four octapurine arrays can be worked out from the protocol for adding the precursors, shown in Fig. 4, given that coupling is from 3’to 5’. (Left) The probe was a synthetic mixture of all pyrimidine octamer sequences flanked by an A residue at each end, as described in the text. Hybridization was in 10 ml of 3.5 M TMACI at 4°C for 7 h. (Right) The same plate probed with sequence I. Hybridization was in 10 ml of 3.5 M TMACl at 29°C for 5.25 h. The photographs show the raw images that were processed as described below before sequence reconstruction. Due to imperfections in the channels, the cells of the array on the glass plate varied slightly in size. To quantify the intensity in each cell, a grid was superimposed over the array so that each spot lay at approximately the center of a grid cell; the pixel values within a square centered at the cell’s center of mass and of area one-quarter that of the cell were summed. Effects of variation in the yield of oligonucleotides were reduced and occasional hot spots in the background were eliminated by taking a truncated mean calculated for each octanucleotide by ranking the four cell intensities obtained from the quadrants, discarding the highest and lowest values, and taking the sum of the central two values. Sequence II was analyzed in the same way in 10 ml of 3.5 Mat 25°C for 1.25 h (raw data not shown, but see the bottom right-hand panel of Fig. 4).

suitable length to analyze on an array of octapurines. The sequences also have a mixture of alternating singletons, doublets, and triplets of each of the pyrimidines, which allows us to examine effects of sequence on hybridization behavior. Sequences I and II differ by a single base and so can be used to test the method for detection of point mutations described below. The oligonucleotides were labeled and hybridized to the array as described in the legends to Figs. 3 and 4. Imaging, quantitation, and normalization of hybridization to the array. The hybridization pattern was digitized by phosphorimaging (Johnston et al., 1990). The greater spatial resolution of autoradiography is not needed for the arrays used in the experiments described here, although the 8%pm pixel size of the PhosphorImager puts a limit on the size of cells that can be analyzed by this method. Phosphorimaging can be carried out above freezing temperatures, avoiding the ice crystals that form during exposure to film at -70°C causing background on film, and a further advantage is the automatic digitization of the image, which can be immediately fed to the software used to display, manipulate, and quantify the data. There were about 1000 pixels per cell of the oligonucleotide array and the total measurement in a cell of average intensity represented many thousands of radioactive disintegrations. Thus, counting errors should be negligible when compared with variation resulting from other

sources. We were able to compensate for much, but not aI1, of the variation in yield due to differences in base composition by using TMACl (Melchior and von Hippel, 1973; Wood et al., 1985; Jacobs et al., 1988; Maskos and Southern, 1992b, and in preparation) as solvent during hybridization. Residual effects of base composition, base sequence, and variation in the yield of oligopurines on the plate were compensated for, cell by cell, using the values from the hybridization with the oligopyrimidine probe to normalize data from the sequencing experiments: the integrated intensity of each cell in sequencing experiments was divided by that of the corresponding cell when the array was probed with the complete set of 256 octapyrimidines. In effect, the right-hand image in Fig. 3 was divided by the left-hand image in Fig. 3. A quantitative approach to sequence assembly. The sequencing experiments did not show the ideal of 13 dark cells representing complementary oligonucleotides on a clear background (left-hand column, Fig. 4). Contrast improved as we applied the various corrections described in the legend to Fig. 3: summation of pixels around the center of mass; truncated mean of the quadrant cell intensities; and normalization against the octapyrimidine set. But even after processing, not all the complementary octapurines produce dark spots, and some of the dark spots (center column, Fig. 4) are partly mismatched duplexes. Thus, it is not possible to score the result by eye. Nor is it possible to assemble the se-

1012

SOUTHERN,

G A

G A G A QAGAGAQAQAGAQAGA

2

AND

ELDER

A

G

L4

MASKOS,

Q (3

A

/

I /

CiAA

G CPA G Q AA (3 A GA

I

Ai Q CiQA Q

A

AA

Q A’A AQ

I I I P

I

,

I

/

1 I,

,

I

I-II

FIG. 4. Images used in sequence reconstruction and mutant detection. The left-hand panels show the ideal images to be expected if signal were obtained only in cells complementary to octapyrimidines in the probe. The sequences written against the cells in the top panel are those expected from sequence I; those in the bottom panel are those in sequence II; the solid cells in the center panel are those sequences that are found only in sequence I (red) or II (blue); and the circles show the sequences present in both I and II. The right-hand panels show experimental data: the upper image is the truncated mean of the four quadrants of the right-hand panel of Fig. 3; the lower panel is a similar truncated mean for sequence II. The center panels show the data used to reconstruct the sequences of Table 1. These were derived from the right-hand panels by summing one-quarter of the pixels around the center of mass of each cell and normalizing with corresponding data from hybridization with the full set of octapyrimidines (see left-hand panel of Fig. 3). The center row of panels are the result of subtracting the bottom from the top panels. In these images, deep red represents high positive intensity and deep blue high negative intensity. The data used to construct the center panel were also used to reconstruct the sequences around the sequence difference shown in Table 2.

quence using algorithms that start with a set of the most oligonucleotides (Drmanac and strongly positive Crkvenjakov, 1987; Lysov et al., 1988; Drmanac et al., 1989; Khrapko et al., 1989; Pevzner, 1989), as no simple criterion allows us to decide between absence or presence of a given oligonucleotide.

We therefore devised a quantitative approach to the problem of sequence assembly that makes use of the intensity in all cells and evaluates the goodness of fit of all possible 2’O 20-mers to the whole of the data. This fit was obtained from the sum of the squares of the differences between mock and true data cell intensities. We first

ANALYZING Sequence

AND

COMPARING

I

TTTCCCTTCCTTCCTCTCTC TTTCCCTT TTCCCTTC TCCCTTCC CCCTTCCT CCTTCCTT CTTCCTTC TTCCTTCC TCCTTCCT CCTTCCTC CTTCCTCT TTCCTCTC TCCTCTCT CCTCTCTC

Sequence

NUCLEIC

ACID

1013

SEQUENCES Difference

II

TTTCCCTTCCCTCCTCTCTC TTTCCCTT TTCCCTTC TCCCTTCC CCCTTCCC CCTTCCCT CTTCCCTC TTCCCTCC TCCCTCCT CCCTCCTC CCTCCTCT CTCCTCTC TCCTCTCT CCTCTCTC

t ttCCCTTCCXTCCTCTCtc

CCCTTCCX CCTTCCXT CTTCCXTC TTCCXTCC TCCXTCCT CCXTCCTC CXTCCTCT XTCCTCTC

FIG. 5. Octanucleotide sets making up 20.mers I and II and the set of pairs comprising the difference. Sequences I and II were hybridized to the oligopurine plate to produce Figs. 3 and 4. Columns of octapyrimidines beneath each 20.mer show the two sets of 13 constituent octamers. The sequence on the right indicates with an X where sequences I and II differ, and underneath this sequence is the set of eight pairs of octamers expected to he seen in the difference image in Fig. 4 (X in these sequences can be T or C). The uppercase characters in the difference 20.mer represent the two 15-base sequences assembled in the reconstruction; the lowercase represents sequence that is not reconstructed.

normalized the intensity values over the array to total 13 (the number of octamers needed to make a 20-mer; Fig. 5). For each candidate sequence, a fit was calculated according to a simple model in which every occurrence of an octamer in the sequence contributes one unit of intensity to the array as a whole: 0.2 unit to that octamer’s data cell and 0.8 unit distributed uniformly over the other cells of the array. The average of the 13 most intense cells was 0.2. For each sequence, Table 1 shows the sum of the squares of the residuals between mock and true data. Care must be taken in interpreting these values; they can be used only to rank the sequences, as they have not been normalized with respect to the level of noise in the data. For quantitative comparison, a more realistic model is needed, which takes account of cross-hybridization between partially matched oligonucleotides and of noise in the data; likelihood ratios calculated according to such a model for different sequences would allow their

relative merits to be assessed and would also allow sequences of different length to be assessed quantitatively (Elder and Southern, in preparation). For both sequences I and II, the sequence with the best fit was the correct one (Table 1). It should be noted, too, that other sequences with good fits differ from the correct sequence by base changes at the ends. Because of the large numbers of possible sequences, this exhaustive method could not be used on sequences comprising all four bases (there are 4” - 1012 20-mers of all four bases), but the number of sequences to be evaluated can be reduced by excluding oligomers that have cell intensities below a threshold value. The exhaustive method can then be applied to the reduced pool of oligomers. The choice of threshold is a balance between reducing the pool of oligomers and minimizing the chance of excluding oligomers that occur in the sequence and TABLE Finding

TABLE

the Site

2 of a Mutation

1

Sequences Assembled from the Results of Hybridizing Sequences I and II to the Octapurine Array Sequence

RSS

TTTCCCTTCCTTCCTCTCTC TTTCCCTTCCTTCCTCTCTt cTTCCCTTCCTTCCTCTCTC cTTTCCCTTCCTTCCTCTCT tTTTCCCTTCCTTCCTCTCT

0.254 0.265 0.268 0.269 0.270

2 3 4 5

TTTCCCTTCCCTCCTCTCTC tTTTCCCTTCCCTCCTCTCT cTTTCCCTTCCCTCCTCTCT TTTCCCTTCCCTCCTCTCTt tTTTCCCTTCCCTCCTCTCc

0.388 0.393 0.399 0.404 0.408

1 2 3 4 5

Sequence

I/II

RSS”

Rank

CCCTTCCTTCCTCTC CCCTTCCCTCCTCTC

0.343

1

tCCTTCCTTCCTCTC tCCTTCCCTCCTCTC

0.389

2

CtCTTCCTTCCTCTC CtCTTCCCTCCTCTC

0.403

3

CCCTTCCTTCCTCTt CCCTTCCCTCCTCTt

0.404

4

ttCTTCCTTCCTCTC ttCTTCCCTCCTCTC

0.407

5

Rank 1

Note. The assembly procedure described in the text was used to rank the sequences. The top ranked sequence is correct for both sequences I and II. Miscalls, represented in lowercase, concentrate at the ends as can be seen from the differences between the correct sequence and the next four in rank. D RSS, sum of squares of residuals.

Note. Pairs of sequences with one mutation in the middle were assembled from the difference image using a fitting procedure similar to that used to reconstruct the sequences. Note that the method not only detects the mutation, but also returns the sequence of 2s - 1 bases centered on the mutation, where s is the length of oligonucleotides in the array. Thus 15 bases around the mutation are determined from an array of octamers. a RSS, sum of squares of residuals.

1014

SOUTHERN,

MASKOS,

depends on the degree of cross-hybridization and the noise level in the data. We are exploring this and other ways of tackling search problems of realistic size. A similar approach for quantitative evaluation of sequences determined by gel-based methods would be much more difficult to implement. Although much remains to be done to bring the method to practical application, we conclude that it has potential for sequence determination, with attributes that make it substantially different from the gel-based methods. Detection of mutation. Methods that detect sequence differences have important applications in genetic mapping (Kurnit, 1979; Solomon and Bodmer, 1979; Botstein et al., 1980), in DNA fingerprinting (Jeffreys et al., 1985), and in scanning genes for mutations (Wallace et al., 1979; Myers et al., 1985a, 198513; Cotton et al., 1988; Orita et al., 1989; Saiki et al., 1989; Nickerson et al., 1990). For these applications, and where long sequences are to be analyzed, it would be an advantage to eliminate from consideration regions of identical sequence and focus the analysis on the differences. Sequences I and II differ only by one C to T substitution. However, a single base change affects the sequences of the eight octamers that include that position (Fig. 5). The two hybridization patterns show differences in several spots as expected (Fig. 4). There should be eight octapurines present in each pattern that are absent in the other and five that are common to both. The difference between sequences I and II is readily seen in the two sequences of Table 1, but the strong difference between the two hybridization patterns suggests a more direct way of finding the underlying sequence difference, using an approach that we developed to find differences in complex patterns of restriction fragments (Southern, 1979). As the two images were produced from the same plate, they can be overlaid precisely, and one image can be subtracted from the other (center and center-right panel of Fig. 4). This reduces the intensity of the spots that are in common, emphasizing those that differ in the two images. Figure 5C shows the set of octamer pairs embracing at their center the single base difference between sequences I and II. The two sets of eight octamers overlap to produce a pair sequence of 15mers with the mismatch at the center. Again, the difference between the images is not the ideal, which would have eight spots contributed from sequence I and eight from sequence II, and we adapted the method for sequence assembly described above to find the pair of sequences of 15 bases that gave the best fit to the data represented in the central image of Fig. 4. There are 215 ordered pairs of pentadecapyrimidines that differ only at the eighth base position. Each pair was assessed against the difference data by generating a set of mock difference data and calculating a fit in a way similar to that used to generate the full sequences. The result (Table 2) is clear and correct. There are two reasons why this approach gives a more reliable analysis of sequence difference than that obtained by comparing

AND

ELDER

the two full sequences that best fit the data sets: the problem to be solved is simpler as only the region containing the difference is relevant, and a more robust measure of goodness of fit between mock and true data is obtained by using both sets of data simultaneously. Two further advantages are that the sequence length does not need to be known, and the size of the computing problem does not alter with the length of the sequences that are compared, since both depend only on the length of oligonucleotides in the array. The method illustrated for transitions can easily be adapted to handle other types of mutation such as transversions and small insertions or deletions. DISCUSSION

The model experiments that correctly reconstructed two different sequences from the results of hybridizing them to an array of octapurines are an encouraging start in the development of a method of analysis with potential for large-scale sequencing, but several problems must be solved before the method can be put to practical use. The arrays of octapurines used in our studies had 256 cells that would only accommodate tetranucleotides of all four bases, which is not long enough to form stable duplexes. However, the same size of array could be used to determine sequences up to 70 bases long, using the approach suggested by Bains and Smith (1988) in which the array comprises hexamers with the two central bases undefined. Although the inclusion of degenerate bases within the oligonucleotides greatly extends the length of sequence that can be analyzed, it introduces problems in the analysis. First, as each sequence in the array is a mixture of 16 sequences, the signal will be reduced. Second, cross-hybridizations will be more difficult to resolve, as there will be more opportunities for mismatched duplexes to form in cells containing mixed sequences. Larger arrays of single sequences will be needed to sequence runs of more than a hundred bases, as would be required for most practical sequencing applications. An array of all octanucleotides would occupy an area 256 X 256 mm with a cell size of 1 mm’, which is probably close to the limit of the method described here-the set of 256 octapurines used in our experiments would form just one row of such an array. However, there are many alternative ways in which smaller cells could be formed on a surface. Printing methods with a resolution of 10 pm could be used to form masks resistant to the oligonucleotide precursors or the deprotecting agents. Photolithographic techniques can be used to remove photolabile blocking groups at each cycle of synthesis (Fodor et al., 1991). This method can produce cells down to a size of 25 pm. Such small cell sizes would produce a device 40-100 mm square for an array of all dodecamers, with the potential to analyze runs of around 4000 bases. The combination of the oligopurine array and pyrimidine target sequence was chosen for a number of reasons: it enabled us to explore the behavior of all base composi-

ANALYZING

AND

COMPARING

tions in an array that was smaller than an array of octanucleotides of all four bases, and oligopurines and oligopyrimidines both lack foldback sequences, which we expect will interfere with intermolecular hybridization. In addition, the sequences we used in the test lacked repeats and long homopolymer runs. Such features have effects on hybrid yields that must be eliminated or taken into account in the interpretation of the data. The empirical method of standardizing the array with a complete set of complementary sequences used here could be extended to overcome many of these problems. An alternative would be to learn the rules governing duplex formation between oligonucleotides in sufficient detail to allow calculation of correction factors. Such rules are likely to be complex (Breslauer et al., 1986; Wetmur, 1991), but it has been our experience that they emerge from detailed analysis of experiments such as the one we used to calibrate the array of octapurines, to be reported in detail elsewhere (Maskos, 1991; Maskos and Southern, in preparation). The exhaustive sequence assembly method we used must be modified to deal with long sequences of all four bases, but simulations suggest that simply removing those cells with faint signal from the analysis reduces the problem to a manageable size. Alternative strategies suggested for the efficient assembly of long sequences (Pevzner, 1989) do not address the problems presented by the nature of the data, which our experiments show can be far from the ideal of perfect discrimination between perfectly and imperfectly paired duplexes. Many of the problems associated with reconstructing long sequences are removed by the image subtraction method we used for sequence comparison. These experiments suggest that a complete set of octanucleotides could be used to compare quite long sequences, perhaps several hundred bases long, and find any single base difference. As point mutation causes a high proportion of human genetic disease (Cooper and Krawczak, 1989), such a method would find many applications, for example, in screening genes such as the cystic fibrosis transmembrane conductance regulator (CFTR) (Zielenski et al., 1991) and the p53 (Harris, 1991) genes, which are affected by mutation at many sites in the sequence. Comparison

with Gel-Based Methods

and Others

The approach to analyzing nucleic acid sequences described here bears some similarities to early methods that were based upon the analysis of oligonucleotides produced by degrading RNA or DNA to its constituent oligonucleotides and analyzing these by two-dimensional separation methods (Murray, 1970). These methods had some success but were replaced by the gel-based methods (Maxam and Gilbert, 1977; Sanger et al., 1977), which are much simpler and able to read long sequences quickly. The advantages of the present method of analyzing oligonucleotide composition are the greater potential resolving power compared with that of the 2-D separation methods and the simplicity of the procedures. Gel-

NUCLEIC

ACID

SEQUENCES

1015

based methods of DNA sequence analysis are well established, and many ancillary techniques have been developed to support their application to large-scale sequencing problems. Nevertheless, alternatives are being sought for the very large problems of sequencing whole genomes now being considered. A major disadvantage of the gel-based methods is the difficulty of automating some of the steps. It has been possible to automate the reading process, from autoradiographs of gels (Elder et al., 1986) or by continuously monitoring fluorescent bands as they pass a fixed point in the gel. But it has proved difficult to automate most of the manipulations apart from sample preparation (Hunkapiller et al., 1991), and it is here that the present method may have practical advantages, as we have found that all manipulations are simple, quick, and suitable for automation. Furthermore, the dimensional stability and logical layout of the array is particularly suitable for processing by computer. Indeed it is difficult to see how the images could be analyzed without the aid of a computer; this leads naturally to a quantitative approach to assembling the sequence and further to an estimate of the quality of the result, thus addressing one of the major problems of gelbased sequencing methods. The method of analysis presented here, like that described by Lysov et al. (1988), is aimed at analyzing relatively short sequences rapidly. It differs from the method described by Drmanac and Crkvenjakov (1987), which hybridizes oligonucleotides to arrays of cloned target sequences and therefore builds up the entire sequence encompassed by the clones in the array by the stepwise accumulation of data. Because the analysis of mutation has applications in many areas of biology, methods for analyzing sequence differences have become increasingly important. Most of the methods in current use involve gel electrophoresis and so have the same experimental difficulty as gelbased sequence analysis. The approach described here removes the need for electrophoresis. Some of the more powerful methods for detecting sequence differences, such as electrophoresis of heteroduplexes on denaturing gels and gel electrophoresis of single strands on nondenaturing gels, do not give information on the nature of the sequence difference. Like direct sequence analysis, the comparison of hybridization patterns identifies the sequence difference. The potential to address specific biological problems, by incorporating known sequence motifs into the array, is a further advantage of the oligonucleotide arrays. Several specific sequence motifs have been shown to be associated with specific functions. For example, polyadenylation of mRNA requires a sequence, most frequently AAUAAA, toward the 3’ end of the mRNA (Humphrey and Proudfoot, 1988). An array of 12-mers comprising AATAAA extended by all sequences of six bases (e.g., N,AATAAA, AATAAAN,, OT N,AATAAAN,) could be used to analyze cDNA made from mRNA populations. The result would be a set of oligonucleotide sequence “addresses” defining t.hose mRNAs that differed in the populations, which could be used to prepare hybridiza-

1016

SOUTHERN,

MASKOS,

tion probes to isolate the relevant mRNAs. Amino acid sequence motifs, such as those characteristic of zinc fingers (Klug and Rhodes, 1987) and calcium-binding domains (Babu et al., 1985), can be back-translated into base sequences, which could be incorporated in just the same way, to search mRNAs for functional sets. Large arrays of oligonucleotides may find applications other than sequence analysis, for example, in the study of DNA binding proteins; and the potential to incorporate nucleotide analogues may lead to other uses.

We are grateful to Drs. W. R. A. Brown, A. Porter, and C. TylerSmith for commenting on the manuscript and to Mr. M. Johnson for constructing much of the specialized equipment. U.M. acknowledges the support of the Maximilianeum, Munich, Germany, the Studienstiftung des deutschen Volkes, the Bayerische Begabtenfijrderung, and the Deutscher Akademischer Austauschdienst. This work was supported by the U.K. Medical Research Council Human Genome Mapping Project.

REFERENCES Andrus, A., Efcavitch, J. W., McBride, L. J., and Giusti, B. (1988). Novel activating and capping reagents for improved Hydrogenphosphonate DNA synthesis. Tetrahedron Lett. 29: 861-864. (1987). “Model 381A DNA Synthesizer 1.23,” Applied Biosystems, Inc., Foster

City,

User’s CA.

Babu, Y. S., Sack, J. S., Greenhough, T. J., Bugg, C. E., Means, A. R., and Cook, W. J. (1985). Three-dimensional structure of calmodulin. Nature 315:37-40. Bains, W. (1991). Hybridization mics 11: 294-301. Bains, W., sequence

methods

and Smith, G. C. (1988). determination. J. Theor.

for DNA

sequencing.

A novel method for nucleic Biol. 135: 303-307.

Genoacid

Boksinyi, L., Liardon, O., and Kovats, E. sz. (1976). Chemically modified silicon dioxide surfaces. Reaction of n-alkyldimethylsilanols and n-oxaalkyldimethylsilanols with the hydrated surface of silicon dioxide-the question of the limiting surface concentration. Adu. Colloid Interface Sci. 6: 95-137. Botstein, D., Construction ment length Breslauer, K. Predicting Natl. Acad. Chang, ranes

White, R. L., Skolnick, M., and Davis, R. W. (1980). of a genetic linkage map in man using restriction fragpolymorphisms. Am. J. Hum. Genet. 32: 314-331. J., Frank, R., Blocker, H., and Marky, L. A. (1986). DNA duplex stability from the base sequence. Proc. Sci. USA 83: 3746-3750.

S. H., Gooding, K. M., and Regnier, in the preparation of bonded phase

F. E. (1976). Use of oxisupports. J. Chromatogr.

120:321-333. Conner, B. J., Reyes, A. A., Morin, C., Itakura, Wallace, R. B. (1983). Detection of sickle hybridization with synthetic oligonucleotides. USA 80: 278-282.

ELDER

Drmanac, R., and Crkvenjakov, tion 570. Drmanac, quencing method.

R. (1987).

Yugoslav

Patent

R., Labat, I., Brukner, I., and Crkvenjakov, of megabase plus DNA by hybridization: Genomics 4: 114-128.

Elder, J. K., Green, D. K., and Southern, E. M. (1986). reading of DNA sequencing gel autoradiographs using mat digital scanner. Nucleic Acids Res. 14: 417-424. Engelhardt, H., and Mathes, D. (1977). phases for aqueous high-performance Chromatogr. 142: 311-320.

Chemically exclusion

K., Teplitz, R. L., and cell p” globin allele by Proc. Natl. Acad. Sci.

Cooper, D. N., and Krawczak, M. (1989). The mutational single base-pair substitutions causing human genetic terns and predictions. Hum. Genet. 85: 55-74.

spectrum of disease: Pat-

Cotton, R. G. H., Rodrigues, N. R., and Campbell, R. D. (1988). Reactivity of cytosine and thymine in single-base-pair mismatches with hydroxylamine and osmium tetroxide and its application to the study of mutations. Proc. N&l. Acad. Sci. USA 85: 4397-4401. Craig, A. G., Nizetic, D., Hoheisel, J. D., Zehetner, G., and Lehrach, H. (1990). Ordering of cosmid clones covering the Herpes simplex virus tvpe I (HSV-I) genome: A test case for fingerprinting by hybridization. Nucleic Acids Res. 18: 2653-2660.

Applica-

R. (1989). SeTheory of the Automatic a large for-

bonded stationary chromatography.

Feller, W. (1968). “An Introduction to Probability Applications,” Vol. 1, 3rd ed., Wiley, New York.

ACKNOWLEDGMENTS

Applied Biosystems Manual, Version

AND

Theory

and

J. Its

Fodor, S. P. A., Read, J. L., Pirrung, M. C., Stryer, L., Lu, A. T., and Solas, D. (1991). Light-directed, spatially addressable parallel chemical synthesis. Science 251: 767-773. Froehler, DNA Acids

B. C., Ng, P. G., and Matteucci, via deoxynucleoside H-phosphonate Res. 14: 5399-5407.

Harris, 360:

A. L. (1991). 377-378.

Humphrey, chemistry

Cancer

T., and Proudfoot, of polyadenylation.

genes:

M. D. (1986). Synthesis of intermediates. Nucleic

Telling

changes

Nature

of’base.

N. J. (1988). A beginning Trends Genet. 4: 243-245.

Hunkapiller, T., Kaiser, R. J., Koop, B. F., and Hood, Large-scale and automated DNA sequence determination. 254: 59-67.

to the bioL.

(1991). Science

Jacobs, K. A., Rudersdorf, R., Neill, S. D., Dougherty, J. P., Brown, E. L., andFritsch, E. F. (1988). The thermal stability of oligonucleotide duplexes is sequence independent in tetraalkylammonium salt solutions: Application to identifying recombinant DNA clones. Nucleic Acids Res. 16: 4637-4650. Jeffreys, A. J., Wilson, ‘minisatellite’ regions

V., and Thein, in human DNA.

S. L. (1985). Hypervariable Nature 314: 67-71.

Johnston, R. F., Pickett, S. C., and Barker, phy using storage phosphor technology. 360.

D. L. (1990). AutoradiograElectrophoresis 11: 355-

Khrapko, K. R., Lysov, Yu. P., Khorlyn, A. A., Shick, V. V., Florentiev, V. L., and Mirzabekov, A. D. (1989). An oligonucleotide hybridization approach to DNA sequencing. FEBS Lett. 256: 118-122. Khrapko, K. R., Lysov, Yu. P., Khortin, A. A., Ivanov, I. B., Yershov, G. M., Vasilenko, S. K., Florentiev, V. L., and Mirzabekov, A. D. (1991). A method for DNA sequencing by hybridization with oligonucleotide matrix. DNA Sequence 1: 375-388. Klug, A., and Rhodes, D. (1987). ‘Zinc for nucleic acid recognition. Trends

fingers’: Biochem.

Kurnit,

variant

D. (1979).

Evolution

of sickle

Lysov, Yu. P., Florentiev, V. L., Khorlyn, Shick, V. V., and Mirzabekov, A. D. (1988). 303: 1508-1511. Maskos, U. (1991). “A Novel Method of Nucleic sis,” D. Phil. thesis, Oxford University.

A novel protein motif Sci. 12: 464-469. gene.

Lancet

1: 104.

A. A., Khrapko, Dokl. Akad. Nuuk Acid

Sequence

K. R., SSSR Analy-

Maskos, U., and Southern, E. M. (1992a). Oligonucleotide hybridisations on glass supports: A novel linker for oligonucleotide synthesis and hybridisation properties of oligonucleotides synthesised in situ. Nucleic Acids Res. 20: 1679-1684. Maskos, U., and Southern, E. M. (199213). Parallel analysis of oligodeoxyribonucleotide (oligonucleotide) interactions. I. Analysis of factors influencing oligonucleotide duplex formation. Nucleic Acids Res. 20: 1675-1678. Maxam, DNA.

A. M., and Proc. Natl.

Gilbert, W. (1977). A new method Acad. Sci. USA 74: 560-564.

McBride, L. J., McCollum, C., Davidson, S., Efcavitch, A., and Lombardi, S. J. (1988). A new, reliable rapid purification of synthetic DNA. BioZ’echniques Melchior, W. B., Jr., and von Hippel, P. H. (1973).

for sequencing J. W., Andrus, cartridge for the 6: 362-367. Alteration of the

ANALYZING

relative stability Natl. Acad. Sci.

of dA.dT USA 70:

and dG .dC 298-302.

AND

base

pairs

COMPARING

in DNA.

Proc.

NUCLEIC

Solomon, Lancet

ACID

1017

SEQUENCES

E., and Bodmer, 1: 923.

W. (1979).

Mellema, J.-R., van der Woerd, R., van der Mare& G. A., van Boom, J. H., and Altona, C. (1984). Proton NMR study and conformational analysis of d(CGT), d(TCG) and d(CGTCG) in aqueous solution. The effect of a dangling thymidine and of a thymidine mismatch on DNA mini-duplexes. Nucleic Acids Res. 12: 5061-5078.

Southern, E. M. (1979). Analysis from complex deoxyribonucleic 44: 37-41. Southern, E. M. (1988). Analyzing tional Patent Application PCT

Murray, K. (1970). Nucleotide ‘maps’ of digests acid. Biochem. J. 118: 831-841. Myers, R. M., Larin, Z., and Maniatis, T. (1985a). base substitutions by ribonuclease cleavage RNA:DNA duplexes. Science 230: 1242-1246.

Southern, E. M., and tides. International

Myers, R. M., Lumelsky, N., Lerman, Detection of single base substitutions ture 313:495-497.

of deoxyribonucleic Detection of single at mismatches in

L. S., and Maniatis, in total genomic

T. (1985b). DNA. Nu-

Nickerson, D. A., Kaiser, R., Lappin, S., Stewart, J., Hood, L., and Landegren, U. (1990). Automated DNA diagnostics using an ELISA-based oligonucleotide ligation assay. Proc. Natl. Acad. Sci. USA 87: 8923-8927. Orita, M., Suzuki, Y., Sekiya, T., and Hayashi, K. (1989). Rapid sensitive detection of point mutations and DNA polymorphisms using the polymerase chain reaction. Genomics 5: 874-879.

and

Pevzner, P. A. (1989). l-tuple DNA sequencing: Computer analysis. J. Biomol. Struct. Dyn. 7: 63-73. Saiki, R. K., Walsh, P. S., Levenson, C. H., and Erlich, H. A. (1989). Genetic analysis of amplified DNA with immobilized sequence-specific oligonucleotide probes. Proc. Natl. Acad. Sci. USA 86: 62306234. Sanger, F., Nicklen, with chain-terminating 5463-5467.

S., and Coulson, inhibitors.

A. R. (1977). DNA sequencing Proc. Natl. Acad. Sci. USA 74:

Maskos, Patent

Evolution

of sickle

variant

of restriction-fragment acid species. Biochem. polynucleotide GB 89/00460.

U. (1988). Application

sequences.

gene.

patterns Sot. Symp. Interna-

Support-bound oligonucleoPCT GB 89/01114.

Stretzoska, Z., Paunesku, T., Radosavljevic, D., Labat, I., Drmanac, R., and Crkvenjakov, R. (1991). DNA sequencing by hybridization: 100 bases read by a non-gel-based method. Proc. Natl. Acad. Sci. USA 88: 10,089-10,093. Wallace, R. B., Shaffer, J., Murphy, R. E., Bonner, J., Hirose, T., and Itakura, K. (1979). Hybridization of synthetic oligodeoxyribonucleotides to X 174 DNA: The effect of a single base pair mismatch. Nucleic Acids Res. 6: 3543-3557. Wetmur, nucleic 259.

J. G. (1991). DNA acid hybridization.

probes: Applications Crit. Rev. Biochem.

of the principles of Mol. Biol. 26: 227-

Wood, W. I., Gitschier, J., Lasky, L. A., and Lawn, R. M. (1985). Base composition-independent hybridization in tetramethylammonium chloride: A method for oligonucleotide screening of highly complex gene libraries. Proc. N&l. Acad. Sci. USA 82: 1585-1588. Zielenski, J., Riordan, J. sequence of tor (CFTR)

Rozmahel, R., Bozon, D., Kerem, B.-S., Grzelczak, Z., R., Rommens, J., and Tsui, L.-C. (1991). Genomic DNA the cystic fibrosis transmembrane conductance regulagene. Genomics 10: 214-228.

Zubkov, A. M., and Mikhailov, a sequence of independent 282.

V. G. (1979). Repetitions of s-tuples in trials. Theory Probab. Its Appl. 24: 269-

Analyzing and comparing nucleic acid sequences by hybridization to arrays of oligonucleotides: evaluation using experimental models.

An efficient method was developed for making complete sets of oligonucleotides of defined length, covalently attached to the surface of a glass plate,...
12MB Sizes 0 Downloads 0 Views