Protein Science (1992), I , 1083-1091. Cambridge University Press. Printed in the USA

Copyright 0 1992 The Protein Society

Sequence analysis of peptide mixtures by automated integration of Edman and mass spectrometric data

R.S. JOHNSON

AND

K.A. WALSH

Department of Biochemistry, University of Washington, Seattle, Washington 98195 (RECEIVEDApril 20, 1992; REVISEDMANUSCRIPT RECEIVEDMay 29, 1992)

Abstract

A computer algorithm is described that utilizes both Edman and mass spectrometric data for simultaneous determination of the amino acidsequences of several peptides in amixture. Gas phasesequencing of a peptide mixture results in a list of observed amino acids for each cycle of Edman degradation, which by itself may not be informative and typically requires reanalysis following additional chromatographic steps. Tandem mass spectroma proven ability to analyzesequences of peptides present in mixtures. However, mass etry, on the other hand, has spectrometric data maylack a complete set of sequence-defining fragment ions, so that more than one possible sequence may account for the observed fragment ions. A combination of the two typesof data reduces the ambiguity inherent in each. The algorithm first utilizes the Edman datato determine all hypothetical sequences with a calculated mass equal to the observed massof one of the peptides present in the mixture. These sequences are then assigned figures of merit accordingto howwell each of them accounts for the fragment ionsin the tandem mass spectrum of thatpeptide. The program was tested on tryptic and chymotrypticpeptides from henlysozyme, and the results are compared with those of another computer programuses that only mass spectral data for peptide sequencing. In order to assess the utility of this method the program is tested using simulated mixtures of varying complexity and tandem mass spectra of varying quality. Keywords: collision-induced dissociation; Edman degradation; electrospray ionization; protein sequencing; tandem mass spectrometry

In recent years cDNA sequencing has largely replaced Edman degradation as the primary means of sequencing proteins. Nevertheless, if difficulties are encountered in obtaining clones encompassing the entire coding region of the protein or if the protein is small, the most expeditious route maybe to sequence the protein directly. Traditionally,thishasinvolvedproteolytic or chemical cleavage of the protein, followed by isolation of the resulting peptides, and sequential Edman degradation of each peptide (Walsh et al., 1981). This process requires a time-consuming effort to isolate individual peptides, and it is not uncommon to find that a chromatographic fraction still contains a mixture. If the mixture contains a major and a minor component, one canrely on the differing high performance liquid chromatography (HPLC) detector responses of the two phenylthiohydantoin (PTH)amino acids to sequence two peptides simultaneously. However, it is very difficult to sequence such mixtures ~

~~

~

~~~

~~~

~~~~

Reprint requests to: K.A. Walsh, Department of Biochemistry, SJ-70, University of Washington, Seattle, Washington 98195.

if more than two peptides are present or if two peptides are present in nearly equal amounts. Furthermore, Edman degradation is slow in that it typically requires 30 to 60 min per residue. Tandem mass spectrometry has been shown to be a useful alternative means of protein and peptide sequencing that overcomes the slow pace of Edman degradationdata collection is much more rapid andpeptides present in complex mixtures can be analyzed (Hunt et al., 1986; Biemann, 1990). Tandem mass spectrometry (MWMS), as the nameimplies, involves multiple mass analyzers employed for the purpose of determiningrelationships between fragment ions and their precursors (McLafferty, 1983). For peptide sequencing this usually means that the first of two mass analyzers is statically set to transmit peptide ions of one mass intoa collision chamber where the selected peptide ion is collisionally activated by interaction with neutral atoms. The mass-to-charge ratios of the fragment ions are measured by scanning the second mass analyzer, and the peptide sequence can in favorable cases be unambiguously deduced from the resulting MS/MS spectrum.

1083

R.S. Johnson and K . A . Walsh

1084 More often than not, however, an MS/MS spectrum alone does not contain sufficient information for delineating a peptide sequence. There are atleast three reasons for this. First, the entire sequence can be deduced only if fragmentations occur between each amino acid. This is not always the case; in particular, it is not unusual for the cleavage between the first and second amino acids to be absent so that the orderwithin the amino-terminal dipeptide is unknown. The absence of cleavage between adjacent amino acids can occur elsewhere, but does so less frequently. In some cases,this lack of fragmentation could be misleading if the sum of the masses of adjacent amino acids equals a single amino acid. For instance, if a peptide had an N-terminal sequence of Gly-Gly-, but there were no fragment ions indicating thissequence, one could easily conclude that this peptide hadan N-terminal Asn (a Gly residue has half the mass of Am). Second, data obtained on mass spectrometers that employ lowenergy collisional activation of precursor ions are not sufficient to permit the differentiation of the isomeric amino acids leucine and isoleucine or the isobaric residues glutamine and lysine. Third, there often are a number of possible sequences that account for the observed fragment ions equally well, and in the absence of additional information this ambiguity decreases the confidence of the sequence assignment. This has been recognized and dealt with by investigators in a number of ways. Obviously, a cDNA sequence can provide a framework from which MS/MS data can be interpreted. Otherwise, the carboxyl groups can be methylated and the MS/MS spectrum of thederivatized peptide compared with the spectrum of the underivatized peptide (Hunt et al., 1986). Those fragment ions that contain carboxyl groups are shifted by multiples of 14 mass units, and with this additional information some of the difficulties encountered in the interpretation of the MS/ MS spectrum of the underivatized peptide are removed. In the normalcourse of sequencing a protein, several proteolytic digests are utilized to obtain overlapping peptides. If the MS/MS spectra of such peptides are acquired then the interpretation canbe carried out so as to account for the maximum number of product ions in all of the tandem mass spectra (Ghosh & Biemann, 1991). Another technique is to carry out a manual subtractive Edman degradation of the peptide fractions, whereby the masses of the peptides are determined after each Edman cycle. In many cases the determination of the N-terminal amino acid is sufficient for delineating the peptide sequence. As an alternative, we propose the use of gas phase sequencing data toassist in the interpretation of MS/MS spectra. The use of Edman andmass spectral datahas already been shown to be a powerful combination in that the peptide mass provides an estimate of the number of Edman cycles to be programmed and also confirms that the C-terminus of the peptide was sequenced (Stults et al., 1988). Knowledge of the peptide mass may also assist in

the interpretation of the sequence data if the mixture is not too complex. Matsuo et al. (1981) have shown that more complex peptide mixtures can be deconvoluted if the peptide masses are known and the Edman data are very carefully quantitated. In this study, a mixture of six chymotryptic peptides of glucagon were sequenced utilizing Edman data and the peptide masses determined by fast atom bombardment mass spectrometry (FABMS); however, the algorithm presented in this paper required twotenuousassumptions.First,allofthepeptides present in themixture must be observed by the mass spectrometer. If a peptide were to be present, but not detected by mass spectrometry (as will happen, particularly in FABMS) then the algorithm fails. Second, this algorithm requires that the amino acids present in each cycle be quantitatively determined, e.g., if two peptides have leucine in position 9,the investigator must somehow decide that two, and not one or three,leucines were released in the ninth Edman cycle. Due to incomplete yields in the Edman degradation, therequirement that amino acids be quantitatively determined becomes increasingly difficult to fulfill as the sequencing proceeds. Furthermore, quantitation becomes more difficult as the mixture increases in complexity. Neither of these two requirements canbe realistically fulfilled. In this paper we propose another means of aiding the interpretation of MS/MS spectra of peptides, which utilizes gas phase Edman data of peptide mixtures. Unlike the algorithmdescribed above, this program- MADMAE (Mixture Analysis Derived by Mass spectrometry And Edman) - does not require the quantitative determination of PTH-amino acids released by Edman degradation nor does it require thatall peptides present in the mixturebe observed mass spectrometrically. The data inputis a list of amino acids observed in each cycle of Edman degradation, the mass of one of the peptides in the mixture, and the list of fragment ion masses and relative abundances derived from the MUMS spectrum of that peptide. All of these data can be acquired on unmodified and commercially available instruments.

Results The algorithm involves two major steps. First, the gas phase Edman data areused to derive a list of all possible sequences that have a calculated mass equal to the observed peptide mass. Secondly, each sequence in this list is then assigned a score based on the fractionof M U M S product ion current that canbe accounted for,given the fragment ion types shown in Scheme 1. For instance, if one of thepossible sequences accounts for all of the ions present in the M U M S spectrum as being fragmentations of the type shownin Scheme 1, then that sequence is assigned a score of 1.OO. If only half of the ion current (or ion abundance) is accounted for, then the score is 0.50. This score is then modified further according to the de-

1085

Integration of Edman and mass spectral data b, ion'

71 5.0

447.7

H~(NH~CHR~CO),~~~NH"CHR"C~O+ y, ion'

[H-(NH-CHR-CO),-OH]H+ a, ion'

+

H-(NH-CHR-CO),~I-NH=CHR Scheme 1. Low-energy collision-induced dissociation products. Other product ions: ( I ) a, b, or y ions lacking NH3 or H20; (2) internal fragments (two peptide bonds cleaved); (3) internal fragments lacking CO, H 2 0 , or NH3. 750 500 The b- and a-type ions encompass the N-terminal amino acid and are numbered from N- to C-terminus; likewise, y-type ions encompass the C-terminus and are numbered from C- to N-terminus.

250

'

gree to which the putative set of b- and y-type ions can define the sequence. The b and y ions are the majorsequence-specific ion types found in low-energy MS/MS spectra of peptides, and ideally one or theother would be present at every position if a hypothetical sequence was correct. In some cases, an incorrect sequence might accidentally account for many of the product ions as ones that are not sequence-specific ions (e.g., internal fragments), thereby receiving an artificially high score. To weight against these sequences, all of the scores are multiplied by a factor thatcorresponds to the fraction of peptide bonds that are represented by a b- or y-type ion divided by the total number of peptide bonds in the hypothetical sequence. For instance, in the sequence below: Y

Y

Y

Y

Asn / Gly / Phe / Pro / Leu / Lys b b there area total offive peptide bonds, four of which are represented by b or y ions, so the score is multiplied by the factor 4/5.Following this procedure, the hypothetical sequences are ranked according to this final score. The overall procedure is illustrated below. A 110-pmol sample of a trypticdigest of carboxymethylated lysozyme (as quantitated by amino acid analysis) was loaded onto a C8 reversed-phase HPLC column; 11 fractions were collected by hand and brought nearly to dryness on a vacuum centrifuge. Half of each fraction was used to obtain both MS and MS/MS spectra; the electrospray ionization (ESI) mass spectrum of fraction 2 is shown in Figure 1. The two major ions are doubly charged as judgedby the appearanceof sodium adducts that are heavier by 11 Da per elementary charge (Da/e) (replacement of a proton by sodium results in a mass change of22 Da, or a Da/e change of 11 if the ion is doubly charged). Thus, at least two peptides of molecular weights 893.4 and 1,428.0 are present in this fraction. Ta-

IO00

mass-to-charge ratio

Fig. 1. Nebulization-assisted electrospray massspectrum of HPLC fraction 2 from a tryptic digest of carboxymethylated hen lysozyme. Approximately 5 pmol were consumed to obtain this spectrum.

ble 1 shows the hypothetical Edman data for these two peptides; note that both peptides have glutamyl residues at position 2, and that the program inputdoes not require quantitation. Neither are any assumptions made concerning the length of the peptide. Although it appears as though one of the peptides might be eight amino acids in length, the fact that thereis no quantitationcould mean that the smaller peptide is longer than eight if the two peptides have identical amino acids starting at position 9. However, given the Edman data anda molecular weight of 893.4, this peptide couldonly be seven or eight amino acids long. This is determined by summing the greatest amino acid residue masses in each cycle to determine the minimum length and summing the least amino acid residue masses to obtain the maximum peptide length. Within these constraints, all possible permutations of the Edman data aremade, and themasses of these hypothetical sequences are calculated and compared to the observed peptide mass. In this case, only two possibilities are found- Cys(cm)-Glu-Leu-Ala-Ala-Ala-Met-Gln and Cys(cm)-Glu-Leu-Ala-Ala-Ala-Met-Lys. Because this is

Table 1. Mixture analysis of two trypticpeptides "

Edman cycle no. 1

2

F C

E

3

4 S L

5 N A

6 F A

7 N A

8

9 T M

1 Q K

0 A

1

1 T

1 N

MW 893.4: 2 sequences

MW 1,428.0: 2 sequences

CELAAAMKa CELAAAMQ

FESNFNTQATNRa FESNFNTKATNR

2 R

~-

"

a

Correct sequence.

R . S . Johnson and K.A. Walsh

1086

a tryptic peptide, the latter peptide would be favored, and is in fact correct. Likewise, the peptide of molecular weight 1,428.0 could have only two possible sequences, Phe-Glu-Ser- Asn-Phe-Asn-Thr-Lys-Ala-Thr-Asn-Arg or

Phe-Glu-Ser-Asn-Phe-Asn-Thr-Gln-Ala-Thr-Asn-Arg. Again, the specificity of trypsin would favor the lattersequence. In this example, where the mixture is relatively simple,theEdmandataandthepeptidemolecular weights alone are sufficient for determining the peptide sequences. A more complex mixture can be simulated by assuming the presence of a third trypticpeptide (Trp-Trp-Cys(cm)Asn-Asp-Gly-Arg) along with the two that actually coeluted on the HPLC column. The hypothetical Edman data for this mixture are shown at the top of Table 2, and the MS/MS spectra of these three peptides are shownin Figure 2A-C. This time 9 possiblesequences corresponded to a calculated molecular weight of 893.4, 6 to a mass of 1,428.0, and 40 sequences to a mass of 993.4. To differentiate among all of these possibilities, each sequence is scored and ranked according to how well they account for the productions found in the MS/MS spectra and the degree to which each peptide bond is represented by a b- or y-type ion (Scheme 1). The results are depicted in Table 2; only the top 10 sequences are shown for the peptide of mass 993.4. Except for the Gln/Lys ambiguity at position 8 (glutamine and lysine have the same nominal mass), the scores are sufficiently divergent so as to allow for confident sequence assignments. In this case, where the masses alone were not sufficient for delin-

Table 2. Mixture analysis of three tryptic peptides " "

. .. .

2

F C

3

E W

W

4 S L C

5 N A

MW 893.4: 9 sequences Score

Sequence CELAAAMK~ CELAAAMQ CELADGTK

0.00 0.00 0.00 0.00

250

500

750

Yl 0

Y..

1000

mass-to-charge ratio

mass-to-charge ratio

Fig. 2. Tandem mass spectra of carboxymethylated hen lysozyme tryptic peptides of molecular weights (A) 893.4, (B) 1,428.0, and (C) 993.4. In all cases the doubly-charged ion was chosen as the precursor ion. The ion nomenclature is as shown in Scheme 1, except that amino acid immonium ions are depicted as the corresponding single letter code. Approximately 5-15 pmol were consumed during the acquisition of each spectrum.

CELADGTQ FELADATK FELADATQ FWSNANR WECADGR WWLAFAT

~~~~. ..

6

7

8

9

F

N

T

A D

A G

M R

Correct sequence.

1

0

Q K

Score Sequence 0.75 0.44 0.44 0.30 0.30 0.21 0.14 0.14 0.14 0.14

WWCNDGR~ WWCAAATK WWCAAATQ

WWSADAMK WWSADAMQ WECNDNR WELNFGTK WELNFGTQ CWSADARK CWSADARQ

1

A

MW 993.4: 40 sequences

"

a

In

8

v

Edman cycle no.

_______

0.98 0.97 0.57 0.56 0.00

-I

.

~~

1

mass-to-charge ratio

1 T

1 N

2 R

MW 1,428.0: 6 sequences Score 0.62

Sequence FESNFNTQATNR~ FESNFNTKATNR FWSNFGTKATNR FWSNFGTQATNR CECADAMKATNR CECADAMQATNR

0.62 0.16 0.16 0.00 0.00

"_

eating sequences of peptides in the mixture, the MS/MS spectra provide conclusive proof of the structure. The degree of complexity of the mixture that remains interpretable by MADMAE can be demonstrated by adding two more peptides(GTDVQAWIR and GYSLGNW VCAAK) to the three discussed above. Table 3 shows the hypothetical Edman data and the MADMAE output for this simulated mixtureof five tryptic peptides. As can be seen, the number of possible sequences has grown for each of the original three peptides; for instance, with a mixture of three peptides, the peptide of mass 1,428.0 had only six possibilities, but with five peptides there are 6,053 sequences. Nevertheless, even with a few thousand possible sequences, the correct ones are ranked first with scores that are significantly higher than the alternative sequences. Again the exceptions arise from having both Gln and Lys at position 8; this same ambiguity would also occur if both Leu and Ile werefound in the same Edman cycle. Also shown in Table 3 are the numbersof fragment

1087

Integration of Edman and mass spectral data Table 3. Mixture analysis of five tryptic peptides

-~

Edman cycle no.

W Y

1

2

3 8

4

7 5

6

F C W G

E A

S L C D

N A L

F A D G

R K G

VT

TN M R W

QT K V I

9

10

A AC R

A

11

12

Q MW 1,044.6:1,227 sequences CPU time: 1.5 min 62 fragment ions

MW 993.4: 1,494 sequences CPU time: 1.4 min 51 fragment ions

MW 893.4: 541 sequences CPU time: 1.4 min 27 fragment ions

d

~

Score

Sequence

Score

Sequence

Score

Sequence

0.98 0.97 0.59 0.58 0.57

CELAAAMK~ CELAAAMQ CELVGGMK CELVGGMQ CELADGTK

0.75 0.55 0.48 0.44 0.44

WWCNDGRa WWCNAAW WWCNGGTI WWCAAATK WWCAAATQ

0.90 0.79 0.72 0.72 0.65

GTDVQAWIRa GESVQAWIR GEDAQAW IR GTDVDNRIR WWSLQARV

MW 1,428.0: 6,053 sequences CPU time: 2 min 25 s 21 fragment ions

MW 1,325.8:7,061 sequences CPU time: 2 rnin 40 s 47 fragment ions

"

Score 0.65 0.53 0.52 0.51

0.50

." "

~

-

Sequence

Score

Sequence

GYSLGNWVCAAKa GYSLGARVCTNK GYSLGNRVCTAK GYSLQNTICAAK GYSLQNWICAN

0.62 0.62 0.48 0.48 0.48

FESNFNTQATNR~ FESNFNTKATNR FESNFNMKAANR FTDNFNTKATNR FTDNFNTQATNR "_

a Correct

-

sequence.

ions in the data input and the computer time used to carry out the entire procedure. Becauseit seems thatthealgorithmemployed by MADMAE caneffectively deal with moderately complex mixtures of peptides, limitations of the programmay relate to the difficulties of accurately identifying all amino acids in every cycle of Edman degradation forvery complex mixtures. It is apparent that if an amino acid were omitted from the MADMAE input, the correct sequence cannot possibly be found, and the outputwould be misleading. For instance, tryptophan or threonine may be overlooked in late cycles of gas phase Edman data. Table 4 shows the results obtained on three tryptophancontaining peptides using the same Edman data as in Table 3 except that Trp was omitted from either cycle l or 7 (depending on the peptide). In two cases, the peptides of mass 993.4 and 1,325.8 gave results that were clearly erroneous in that the lower-ranked sequences received scores nearly identicalto the top sequence (in contrast to the scores in Table 3). The remaining peptide, upon omission of Trp from cycle 7, gave MADMAE output that could less convincingly be dismissed as erroneous.

To overcome this problem, the suspected amino acid could be arbitrarily insertedin every Edman cycle of the input. Table 5 shows the MADMAE outputwhen the Edman data for thesame five peptides as Table 3 are used, except that Trp is inserted into each cycle (unless it was already present). As can be seen, this greatly increases the number of possible sequences that match each molecular

Table 4. MADMAE output when tryptophan is omitted from Edman data of Table 3 MW 993.4: -Trp in cycle 1

Score 0.39 0.39 0.39 0.39 0.34

MW 1,044.6: -Trp in cycle 7

MW 1,325.8: -Trp in cycle 7

Sequence

Score

Sequence

Score

Sequence

GEDAFATKR GEDAFATQR GESVFATKR GESVFATQR GESVFAWKA

0.72 0.65 0.63 0.62 0.57

GTDVDNRIR WWSLQARV GESVDNRIR GTDVAGRIRT GEDADNRIR

0.53 0.52 0.51 0.45 0.45

GYSLGARVCTNK GYSLGNRVCTAK GYSLQNTICAAK GYSLGGRICTNK GYSLDGRKCAAK

R . S . Johnson and K.A. Walsh

1088 Table 5. Insertion of Trp into each Edman cycle of Table 3 ~-

~~

""~

MW 893.4: 806 sequences CPU time: 21.8 min

~~~~

~

MW 1,044.6: 2,290 sequences CPU time: 21.9 min ~~~~

Sequence

~

~~~~

MW 993.4: 2,593 sequences CPU time: 20.2 min

~~~~~

~

Score ~

~

~~~~~~

~~~~~~

0.98 0.97 0.59 0.58 0.57

0.75 0.55 0.50 0.50 0.48

CELAAAMK~ CELAAAMQ CELVGGMK CELVGGMQ CELADGTK

~

0.90 0.79 0.73 0.72 0.72

WWCNDGR" WWCNAAW GEWAFAWK GEWAFAWQ WWCNGGTI

MW 1,325.8: 30,567 sequences CPU time: 26.5 min

MW 1,428.0: 46,447 sequences CPU time: 31.4 min ~~

Sequence

Score

Sequence

~~~

a

.~

~~~

Score ~~~~~~~

0.65 0.53 0.52 0.51 0.51

.

GTDVQAWIR" GESVQAWIR WTWVQARV GEDAQAWIR GTDVDNRIR

~~

~

~

0.62 0.62 0.48 0.48 0.48

GYSLGNWVCAAK~ GYSLGARVCTNK GYSLGNRVCTAK GYSLGGRWCAAK GYSLQNTICAAK

~~

~~- -~ ..

~

FESNFNTQATNRa FESNFNTKATNR FESNFNMKAANR FTDNFNTKATNR FTDNFNTQATNR

Correct sequence.

specificity is lost when too few fragment ions are input weight and also increases the computation time to as long to MADMAE. Likewise, the correct sequence for the as 30 min. For instance, from this set of Edman data the peptide of mass 893.4 is readily apparent from an input peptide of mass 1,428.0 has 46,447 possible sequences. of 11 fragment ions, whereas 7 fragment ions are insufNevertheless, the correct one is still ranked highest. This demonstrates that it is better to include weak Edman data than toexclude them altogether. Moreover, omission of difficult amino acids need not be a problem if that amino Table 6. MADMAE output using reduced numbers acid is artificially inserted into each Edman cycle. of fragment ions Another source of concern has to do with the quality of the MS/MS spectra that is required to obtain useful MW 993.4 ~. output from MADMAE. In the examples presented so 30% threshold, 10% threshold, 20% threshold, far, the fragment ion mass and relative abundance list 6 fragment ions 14 fragment ions 9 fragment ions was obtained from the M U M S spectra using the peakfinding routine (provided by the manufacturer) with a Sequence Score Sequence Sequence Score Score threshold of zero intensity; i.e., all peaks including much 0.39 WWSADAMK 0.53 WWCNDGR" 0.40 WWCNDGR" of the low-intensity noise were selected for input to 0.39 WWSADAMQ WWCAAATK 0.37 WWCAAATK 0.51 MADMAE. Introduction of the additional low-intensity 0.39 WWCAAATK WWCVGGTK WWCVGGTK 0.37 0.51 noise has little effect on the sequence scores.To simulate 0.39 WWCAAATQ 0.48 0.36 WYLVQGTK WWCAAATQ 0.39 WWCVGGTK 0.36 WYLVQGTQ MS/MS spectra of increasingly poor quality, this thresh- 0.48 WWCVGGTQ old was raised from 0% to lo%, 20%, and 30% of the MW 893.4 most abundant fragment ion, with the result that fewer fragment ions were used as input. In addition, the frag30% threshold, 10% threshold, 20% threshold, ment ion mass error was increased to 0.7 Da (normally 7 fragment ions 11 fragment ions 22 fragment ions k0.5 Da), since peaks of low signal-to-noise have poorer Score Sequence Score Sequence Score Sequence mass accuracy. The results using the same five peptide mixture are summarized in Table 6, which, as expected, 0.43 CELAGNMV 0.86 C E L A A A M K ~ 0.99 CELAAAMK" shows that the range ofsequence scores decreases asthe 0.43 CELADGTK 0.86 CELAAAMQ 0.99 CELAAAMQ number of fragment ions are reduced. The correct se0.43 CELADGTQ CELVGGMK 0.50 CELADGTK 0.59 0.43 C E L A A A M K ~ 0.50 CELADGTQ quence for the peptide of mass 993.4 is ranked highest 0.59 CELVGGMQ 0.43 CELAAAMQ CELVGGMK 0.49 0.58 CELADGTK even if the number of fragment ions is reduced to 14 and 9, but an inputof 6 fragment ionsresults in the six highCorrect sequence. est ranked sequences receiving the same score; thus, the ~~

~

~~~

~

~~~

~~~~~

~

~~

~~

~

~~~~~

~~

~~~~

I'

~~

~~~

~~

~

~

~~~~

~~~~~

~~

~~~

~

~~~~~~

~~

~

~~~

~~~

~~

~~

~~

~~

~~

~~

~~

~~~~

~~~~

~~~

~

~

~~~

~

~~~

~

~~

~~

~~~~~~

~~

~

~~

~~~

~~~

~~~

~~

~~~

~~~~

~

~

~~

~~

~~

~

~~

Integration of Edman and mass spectral data

ficient. If fewer peptides are present in the mixture, there will be a reduced number of possible sequencesgenerated such that it might be expected thatpoorerquality MS/MS spectra could be satisfactory. However, when the Edman data forthe three-peptide mixture of Table 2 are used with MS/MS data containing limited numbers of fragmentions,MADMAEstill gives ambiguous output. Tryptic peptides are most frequently used as examples in the demonstration of the utility of sequencing peptides with triple quadrupole mass spectrometers equipped with an electrospray ion source. With the exception of those containing histidine, tryptic peptides have only two sites of protonation located at the N-terminal amino group and the C-terminal lysine or arginine side chain. This has the effect of providing favorable fragmentations when the doubly charged peptide ion is chosen as the precursor in the MUMS experiment. In many cases, a complete series of y-type ions and a partial series of overlapping b-type ions are found. For some proteins, however, tryptic cleavage sites may not be conveniently locatedthroughout the entire sequence, and, for all proteins, overlapping peptides derived from digests using proteolytic enzymes of different specificity must be sequenced. In our experience with this instrumentation, nontryptic peptides do not always provide MS/MS spectra from which the entire peptide sequence can be determined, and so it was of interest to test MADMAE on a simulated mixture of chymotryptic peptides of carboxymethylated lysozyme (Table 7). In this case, MADMAE also worked well for sequencing nontryptic peptides. Up to this point, MADMAE has been presented as a way to sort out uninterpretable Edman data using MS/MS data, but can the MS/MS spectra alone give unambigu-

ous sequence assignments? To assist in the sequencing of peptides by MS/MS, another computer program called LUTEFISK was written in-house, which uses only the MS/MS data from low-energy collision-induced dissociations of multiply-charged precursors to derive a list of possible sequences (Johnson et al., 1991). This program is conceptually similar to that used by others (Ishikawa & Niwa, 1986; Johnson & Biemann, 1989; Yates et al., 1991). As with MADMAE, the LUTEFISK output scores correspond to the fraction of product ion current that can be accounted for according to the fragmentations shown in Scheme 1. In contrast to MADMAE, LUTEFISK does not multiply the scores by factors based on the presence of y or b ions at every peptide bond, since all of the sequences generated by LUTEFISK are based on the presence of y or b ions. To provide an indication of the quality of data used in these studies, Figure 2 depicts the tandem mass spectra of three of the peptides discussed below,and Table 8 shows the corresponding output from LUTEFISK. The peptide of mass 893.4represents an ideal case where the correct sequence has a score of 0.97 and is ranked first. One other sequence has an identical score, where the C-terminal Lys is replaced by Gly-Ala. Because this is a tryptic peptide, the sequence with a C-terminal Lys could be considered to be more likely. However,even with this ideal case, position 3 has the ambiguity Leu/Ile (arbitrarily assigned as Leu in this program), which cannot be determined from these data. The peptide of mass 993.4 is even more problematic as all 10 of the sequences listed in Table 8 account for the fragmentations equally well. Furthermore, the third and fourth positions are undifferentiable (parentheses indicate pairs of amino acids for which sequence-specific fragmentations are absent). The situation is equally ambiguous for the peptide of molecular weight 1,428.0, where the two N-terminal positions are undifferentiable. There are also a number of

Table 7. Mixture analysis of three chymotryptic peptides Edman cycle no.

Table 8. LUTEFISK output

____________

-

1

R C

I V

2

A

I

G

MW 656.4: 12 sequences

3

4

5

G A L

LC F

R K I

Q MW 695.4: 7 sequences

6

MW 893.4 Score

N MW 774.4: 9 sequences

Score

Sequence

Score

Sequence

Score

Sequence

0.75 0.53

GILQIN~ IIGQIN VIAQIN IRAAIN GIAQRL

0.74 0.16 0.16 0.13 0.00

VCAAKF~ ICGAKF GCLAKF GIACKF GCGCKL

0.63 0.35 0.20 0.17 0.17

IRGCRLa IRLARF ICLQKL VRACRL GRLCRL

0.50 0.25 0.15 ~

a

Correct sequence.

0.97 0.97 0.90 0.90 0.87 0.87 0.87 0.87 0.87 0.85

MW 993.4

MW 1,42Sa

Sequence

Score

Sequence

Score

Sequence

CELAAAMKb CELAAAMGA CELAAA[TSlA CELAAA[SG]D CELAA[DS]K CELAA[TT]K CELAA[DSlGA S[MA]LAAAMK S[TT]LAAAMK CELAAGCMAIA

0.81 0.81 0.80 0.80 0.80 0.80 0.80 0.80 0.80 0.80

QGWGRVYK WWGRVYK EGWCNDGR EGWCGGDGR WW[CN]DGRb WWCGGDGR WWC[QT]GR QGWCG[TA]GR QGWC[QT]GR WWCG[TA]GR

0.77 0.76 0.72 0.72 0.72 0.72 0.72 0.72 0.72 0.72

[LAIETNFNSPC [RR] TNFNSPC L[WN]NFNSPC LYHNFNS[EE] TSFDNTCQAITNR [WVIETNFNSPC [REIETNFNSPC SNFNTQATNRb [CDIHNFNSPC [FFITNFNSPC

a Two additional N-terminal amino acids (of total mass 276.1 Da) are not differentiable and are not shown. Correct sequence.

R.S. Johnson and K . A . Walsh sequences with similar or slightly better scores than the correct sequence. Such examples support our contention that low-energy tandem mass spectra alone seldom provide sufficient information to sequence peptides, and that in most cases additional experiments of some kind must be carried out in order to eliminate any ambiguities.

Discussion

Given the relative ease with which cDNA can be sequenced and translated into the amino acid sequence, the value of tandem mass spectrometers for sequencing peptides relates to their speed, their ability to detect posttranslational modifications, and their applicability to unresolved mixtures of peptides. For large proteins, structure determinations are mostefficiently carried out by cloning and sequencing the corresponding cDNA (once a partial sequence is available from protein microanalysis) and then using the mass spectrometer to check for errors andpost- or cotranslational modifications. Exceptions arise when clones covering the entire protein are for some reason difficult to obtain or if the protein is small. In the latter case, selecting the right clone may prove more time-consuming than sequencing the protein directly. It has been demonstrated that with sufficient funds, expertise, and material, small proteins can be sequenced solely by four-sector tandem mass spectrometry (Biemann, 1990); however, this cannot be done completely using the less expensive and easier-to-use triple quadrupole instruments, which cannot differentiate between leucine and isoleucine. HPLC fractionscontaining peptide mixtures could be subjectedto amino acid analysis (using approximately 100 pmol of material), and in favorable cases (Griffin et al., 1989) where either leucine or isoleucine are found in the absence of the other, one could make this differentiation. Alternatively, one must carry out gas phase Edman degradations to distinguish between a Leu/Ile locus identified by MS/MS. Because Edman data areusually necessary to differentiate leucine and isoleucine, that information may as well be used to help interpret the MS/MS spectra. In some cases, Edman data may have been acquired first and found to be too complex to interpret. In this situation, the MS and MS/MS spectra can be obtained without additional chromatographic purifications (and attendant loss of material) that are typically applied in these situations. This approach ofusing both tandem mass spectrometry and Edman degradation is not without some disadvantages, e.g., maintaining both types of instrumentation in one laboratory. Moreover,use of two instruments limits the overall sensitivity. In the example provided above, 100 pmol of a tryptic digest was applied to an HPLCcolumn, and one-half of each collected fraction was used to obtain the mass spectra (one MS spectrum and two

MS/MS spectra of the two peptides present), which left sufficient material for gas phase sequencing. MADMAE is also subject to the samelimitations as Edman sequencing, i.e., N-terminal blocking groups, cross-linked peptides, and any othermodified amino acids cause problems. Moreover, if both leucine and isoleucine are present in the same Edman cycle then this differentiation cannot be made. As discussed earlier, if an amino acid is omitted from the Edman datainput then the correct sequence for that peptide cannot be determined. In many cases, this should be evident from the reduced range of scores between the highest ranked sequence and those that follow. If it is known or suspected that a certain problematic amino acid is present, then that residue can be artificially introduced into each Edman cycle without greatly affecting the outcome of MADMAE. As the complexity of a peptide mixture increases, the Edman data approach a completely random state where all 20 amino acids are observed in each cycle. Thus, for increasingly complex mixtures MADMAE finds greater numbers of sequences that match a peptide’s molecular weight, and it becomes more likely that additional sequences will account for the MS/MS data aswell as thecorrect sequence. In addition, the computation time for very complex mixtures would be prohibitive. It should be pointed out, however, that this approach would prove most useful inthe context of structure determinations of lower molecular weight proteins by sequencing proteolytic peptides that had been partially purified in a single chromatographic step. In these situations, the mixtures are not likely to beof greater complexity than the examples discussed here. Finally, it must be kept in mind that the range of sequence scores will narrow (and the discrimination is decreased) as fewer fragment ion masses are used as input. Despite these limitations, MADMAE may prove useful for laboratories with the appropriate instrumentation. Use of this combined approach eliminates the need for additional chromatographic purification ofmixtures, as is usually required when using gas phase sequencing alone. Furthermore, MADMAE permits some degree of multiplexing of Edman degradation, thereby enhancing the speed of a relatively slow procedure. For a complete primary structure determination of a protein, gas phase Edman data arecurrently required for differentiating leucine and isoleucine when MS/MS spectra are obtained from low-energy collisions. The combination of these two very different but complementary sequencing methods can greatly reduce or eliminate the ambiguities present when only one technique is used and atthe very least insure greater confidence in sequence determinations. Materials and methods

Mass spectra were acquired on a triple quadrupolemass spectrometer (Sciex API-111, Thornhill, Ontario, Canada) equipped with a nebulization-assisted electrospray ion

Integration of Edman and mass spectral data source (Covey et al., 1988). Methano1:water (1:l) with 0.19'0 formic acid was added to the partially lyophilized HPLC fractions and continuously infused at a flow of 1.7 pL/min through the electrospray needle, which was held at potential of 5 kV. A concentric flow of air surrounding the needle was applied at a flow of 0.4 L/min, and a drynitrogen curtain gas flowing at 1.2 L/min was used to help desolvate ions entering the mass spectrometer. Poly(propy1ene glycol) was used to calibrate both quadrupoles 1 and 3 (Q1 and 43); ions were detected with a continuous dynode electron multiplier operating in a pulse counting mode. Ion-spray mass spectra were acquired at unit resolution, whereas in MUMS the resolution of both Q1 and 4 3 was decreased so as to increase sensitivity. The precursor ions were accelerated to a kinetic energy of 25 eV and were activated by collision with argon gas (3.5 x lOI4 atoms/cm2) in Q2. Ion-spray mass spectra were typically scanned from Da/e 200 to 1,500 at a rate of 26 s/scan, and three to five spectra were summed. Hen lysozyme (Sigma) was reduced in 6 M guanidine hydrochloride, pH 8, 20 mM Tris, 2 mM EDTA, with a 50-fold excess of dithiothreitol (DTT) over cystines, and alkylated with a 2.6-fold excess of iodoacetic acid over DTT. The alkylated protein was dialyzed using a Centricon 10 concentrator and treated with either trypsin (Worthington) or chymotrypsin (Sigma) usinga 50: 1 substrate-to-enzyme ratio. HPLC was carried out using Waters 510 pumps and a Waters gradient controller with a Waters 481 UV detector monitoring 206 nm. Samples were placed onto a 2-mm Brownlee C8 cartridge column using a Rheodyne injector, and operated at a flow of 400 pL/min using acetonitrile and water with 0.05% trifluoracetic acid as solvent. Calculations were performed on a Macintosh IIfx that had been programmed in Fortran using an Absoft compiler. Acknowledgments We thankDrs.H.Charbonneau,L.H.Ericsson,and H. LeTrong for instructive discussions, particularly regarding the pitfalls and problems encountered in gas phase sequencing. This

1091 work was supported by grants from the National Institutes of Health (RR 05543 and HL40990) to K.A.W., and a Fellowship award to R.S.J. by the American Heart Association, Washington Affiliate.

References Biemann, K. (1990). Sequencing of peptides by tandem mass spectrometry and high-energy collision-induced dissociation. Methods Enzymol. 193, 455-479. Covey, T.R., Bonner, R.F., Shushan, B.I., & Henion, J. (1988). Thedetermination of protein, oligonucleotide and peptide molecular weights by ion-spray mass spectrometry. Rapid Commun. Mass Spectrom. 2 , 249-256. Ghosh, A. & Biemann, K. (1991). Computer-aided assembly of complete protein sequences from the probable (CID-MS derived) sequences of proteolytic peptides. In Proceedings of the 39th Annual American Society f o r Mass Spectrometry Conference (Caprioli, R.M., Ed.), pp. 1225-1226. Nashville, Tennessee. Griffin, P.R., Kumar, S., Shabanowitz, J., Charbonneau, H., Namkung, P.C., Walsh, K.A., Hunt, D.F., & Petra, P.H. (1989). The amino acid sequence of the sex steroid-binding protein of rabbit serum. J. Biol. Chem. 264, 19066-19075. Hunt, D.F., Yates, J.R., 111, Shabanowitz, J., Winston, S., & Hauer, C.R. (1986). Protein sequencing by tandem mass spectrometry. Proc. Natl. Acad, Sci. USA 83,6233-6237. Ishikawa, K. & Niwa, Y. (1986). Computer-aided peptide sequencing by fast atom bombardment mass spectrometry. Biomed. Environ. Mass Spectrom. 13, 373-380. Johnson, R.S. & Biemann, K. (1989). Computer program (SEQPEP) to aid in the interpretation of high-energy collision tandem mass spectra of peptides. Biomed. Environ. Mass Spectrom. 18, 945-957. Johnson, R.S., Ericsson, L.H., & Walsh, K.A. (1991). LUTEFISK without theodor -A computer program for the interpretation of low energy CID spectra of electrosprayed peptide ions. In Proceedings of for Mass Spectrometry Conferthe 39th Annual American Society ence (Caprioli, R.M., Ed.), pp. 1233-1234. Nashville, Tennessee. Matsuo, T., Matsuda, H., & Katakuse, I. (1981). Computer program PAAS for the estimation of possible amino acid sequence of peptides. Biomed. Mass Spectrom. 8, 137-143. McLafferty, F.W., Ed. (1983). Tandem Mass Spectrometry. Wiley, New York. Stults, J.T., Henzel, W.J., & Chakel, J.A. (1988). Integration of microsequencing, mass spectrometry, and tandem mass spectrometry for maximum efficiency in peptide sequencing. In Proceedings of the 36th Annual American Society f o r Mass SpectrometryConference (Harrison, W.W., Ed.), pp. 405-406. San Francisco, California. Walsh, K.A., Ericsson, L.H., Parmelee, D.C., & Titani, K. (1981). Advances in protein sequencing. Annu. Rev. Biochem. 50, 261-284. Yates, J.R., 111, Griffin, P.R., Hood, L.E., & Zhou, J.X. (1991). Computer aided interpretation of low energy MS/MS mass spectra of peptides. In Techniques in Protein Chemistry II (Villafranca, J.J., Ed.), pp. 477-486. Academic Press, San Diego, California.

Sequence analysis of peptide mixtures by automated integration of Edman and mass spectrometric data.

A computer algorithm is described that utilizes both Edman and mass spectrometric data for simultaneous determination of the amino acid sequences of s...
871KB Sizes 0 Downloads 0 Views