BIOLOGICAL MASS SPECTROMETRY, VOL. 20, 115-120 (1991)

A New Procedure for Peptide Alignment in Protein Sequence Determination Using Fast Atom Bombardment Mass Spectral Data P. Petrilli Istituto di Industrie Agrarie, Universita di Napoli, Napoli, Italy

c. sepe Diparthento di Chimica Organica e Biologica, Universita di Napoli, Napoli, Italy

P. Puccit Istituto Chirnico, Universita della Basilicata and Servizio di Spettrometria di Massa, CNR-Universita di Napoli, via Pansini 5, 80131 Napoli, Italy

A computer program allowing the correct alignment of peptides generated by a first cleaving agent during protein sequence determination studies has been developed. The program elaborates data obtained from fast atom bombardment maw spectrometric analysis of different digests of tbe protein. The recorded mass values are used to identify peptides in these digests that overlap peptides from the first cleavage, thus making it possible to establish unambiguously the correct order of these peptides in the protein chain. This procedure has been tested on a model protein by reconstructing the complete sequence of human &globin chain, determining the correct alignment of 14 tryptic peptides.

INTRODUCTION The modern approach to protein sequence determination relies on the possibility of sequencing peptides as long as 40 residues or more by ‘gas-phase’ automated Edman degradation.’ The protein under investigation is cleaved at a few specific residues to generate a small number of large peptides which can be easily separated and directly sequenced. However, the alignment of these peptides in the actual protein chain has still to be determined; this is normally accomplished following the ‘overlapping procedure’, by using different proteolytic cleavages of the protein to produce peptides that overlap peptides from the first cleavage, thus making it possible to establish their correct order in the polypeptidic chain. Several computer-based strategies have been proposed in recent years to provide information on the ‘overlapping peptides’ without the need for long and difficult purification and sequencing procedures.2-6 Particularly, Shimonishi and co-workers6 reported a method of protein sequence determination based on the analysis of a complex mixture of peptides by fast atom bombardment mass spectrometry (FABIMS),’ both for molecular weight measurement and sequence determination of individual components. We are now investigating the use of FAB/MS in the identification of overlapping peptides during the overlapping procedure in an attempt to eliminate most of the peptide separation steps. In this respect, a computer program allowing the determination of the correct

t Author to whom correspondence should be addressed.

alignment of peptides generated by the first cleavage of the protein has been developed. The program makes use of FAB/MS data from peptide mixture analysis to reconstruct the original protein structure. In this paper we report the application of this program to the reconstruction of the entire sequence of the human 8-globin chain, chosen as a model protein, determining the correct alignment of 14 peptides generated by tryptic digest of the protein.

EXPERIMENTAL The program was written in Applesoft BASIC on an Apple IIe personal computer (64K RAM) equipped with two 5.25-inch disk drives and an Epson MX-80 printer. The Tasc compiler from Microsoft was used to speed up the elaboration of data. A copy of the program may be obtained by sending a returnable postage-paid envelope and a 5.25-inch floppy disk to the authors. FAB mass spectra were recorded on a Kratos MS 50 double-focusing mass spectrometer fitted with an M-Scan gun operating at 9 keV (2.5 PA). Xenon was used as primary ionizing beam. Samples were dissolved in 0.1 M HCI and loaded onto a glycerol-coated probe tip; thioglycerol was added immediately before inserting the sample into the ion source. Human 8-globin was prepared as previously described from the blood of a normal individual.* Tryptic and chymotryptic digestions were carried out in 0.4% ammonium bicarbonate pH 8.5 at 37°C for 4 h, using an enzyme-to-substrate ratio of 1 :75 and 1 :50, w/w, respectively. V-8 protease hydrolysis was performed in the same buffer at pH 8.0, 40°C for 6 h.

1052-9306/91/030115-06 $05.00

0 1991 by John Wiley & Sons, Ltd.

Accepted (revised) December 1990

P. PETRILLI, C. SEPE AND P. PUCCI

116

Peptic digest was carried out in 5% formic acid at 37 "C for 4 h. Manual Edman degradation steps were performed on the peptide mixtures as already de~cribed,~ using 5% phenylisothiocyanate in pyridine as coupling agent ; the N-terminal residue of each peptide was identified by mass difference following FAB/MS analysis of the truncated peptide mixtures. Protein N-terminal sequence analysis was performed by manual Edman degradation; (phenylthiohydantoin) PTH-amino acids were identified by high-performance liquid chromatography using the previously described procedure. O Trypsin, chymotrypsin, V-8 protease, pepsin and thioglycerol were from Sigma ; reagents and solvents used in the gas-phase Edman degradation were Beckman sequence grade; all other chemicals were from Carlo Erba. ~~

~

RESULTS Tryptic digest of human B-globin chain yields the 14 peptides whose sequences are reported in Table 1. Manual Edman degradation of the intact globin resulted in the N-terminal sequence Val-His-Leu-ThrPro-Glu, showing that the first peptide listed in Table 1 was the actual N-terminus of the protein. Overlapping peptides were generated by enzymatic hydrolysis of the globin chain with chymotrypsin,

pepsin and V-8 protease; the resulting peptide mixtures were directly analysed by FAB/MS without any purification step. The principle of the method is that FAB/MS analysis of the unfractionated digests can lead to the identification of peptides which overlap tryptic peptides, thus providing evidence for their alignment in the intact protein. Unambiguous assignment of mass signals to overlapping sequences is accomplished on the basis of the molecular weight of peptides, determined by direct mass spectrometric analysis of the digests, and of the identification of their N-terminal residues. A portion of each digest was submitted to a single step of manual Edman degradation followed by FAB/MS analysis of the truncated peptides; in most cases, the mass shift of the signals observed after Edman reaction led to the determination of the N-terminal residues of peptides. Table 2 reports the molecular ions recorded for each digest before and after one cycle of Edman degradation and the corresponding N-terminal residue when it was unambiguously identified. The sequence of tryptic peptides listed in Table 1 and the mass spectral data obtained by the proteolytic digests (see Table 2) were input in the Mass Overlap program. The program examines all the possible alignments of the 13 tryptic peptides, assigning each signal to all the sequences which fit with the mass value and the corresponding N-terminal residue. Figure 1 reports the program output relative to those mass values which could only be assigned to single

Table 1. Amino acid sequences of the 14 peptides generated by tryptic digest of human /?-globin chain 5

1

1 Val-H is-Leu-Thr-P ro-Glu-Glu-Lys 1

2

5

10

10

Lys-Val-Leu-G ly-A I a-Phe-Ser-Asp-Gly-Leu-Ala-H

is-Leu-Asp-Asn

Leu-Lys 1

5

10

15

3 Leu-Leu-Gly-Asn-Val-Leu-Val-Cys-Val-Leu-Ala-His-His-PheGly LYS 5

I

4

10

G 1y-Thr-Phe-Ala-Thr-Leu-Ser-GIu-Leu-His-Cys-Asp-Lys 5

1

10

5 G I u-Phe-Thr-PrWP r c-Val-GIn-Ala-Ala-Tyr-GIn-Lys 5

1

6 Val-Asn-Val-Asp-G 1

I u-Val-Gly-Gly-G

10

lu-Ala-Leu-G ly-Arg

5

10

7 Leu-Leu-Val-Val-Tyr-Pro-Trp-Thr-GIn-Arg 1

10

5

8 Val-Val-Ala-Gly-Val-Ala-Asn-Ala-Leu-Ala-H

is-Lys

5

1

9 Leu-H is-Val-Asp-P ro-G lu-Asn-P he-Arg 1

10

5

10

Phe-Phe-GIu-Ser-PheGly-Asp-Leu-Ser-Thr-Pr~Asp-Ala-Val-Met Gly-Asn-Pro-Lys 5

1

11 Ser-Ala-Val-Thr-A 1

4

12 A I a-His-Gly-Lys 1

2

13 Tyr-H is 1

2

14 Val-Lys

I a-Leu-Trp-G ly-Lys

15

PEPTIDE ALIGNMENT IN PROTEIN SEQUENCE DETERMINATION

117

Table 2. Observed mass values of chymotrypdc, peptic and V-8 protease digests of human &globin before and after one cycle of Edman degradation. The N-terminal amino acid corresponding to each peptide is shown in parentheses Chymotwpsin MH+

Edman

952 1499 1286 698 1368 987 794 633 857 846 1069 885

853 (Val) 1442 (Gly) 1229 (Gly) 597 (Thr) 1239 (Glu) 930 (Gly) 679 (Asp) 562 (Ala) 720 (His) 745 (Thr) 941 (Gln) 757 (Gln)

V-8 protease

Pepsin

MH'

1494 1209 824 859 1798 1472 946 871 1308 2204 1634 1228 794 931 1239

overlapping sequences; these signals indicated the correct alignment of overlapped tryptic peptides. As an example, the signal at m/z 794 carrying an Asp residue at the N-terminus could only be generated by the sequence AspAsn-Leu-Lys-Gly-Thr-Phe which overlaps tryptic peptides 2 and 4 in Table 1, thus demonstrating their linkage in the intact protein. Using the same logic, data reported in Fig. 1 allowed us to confidently align some of the tryptic peptides as follows : Peptides 7 - 10; Peptides 2 - 4 Peptides 9 - 3 - 5 - 8 - 13. Therefore, in a single set of experiments on unfractionated digests, the number of peptides to be ordered has been reduced from the initial 14 to the eight peptides whose sequences are listed in Table 3.

MASS VALUE: 698 N-TERMINAL RESIDUE: THR OVERLAPPED PEPTIDES: 7 - 10 POSITIONS: FROM POS 8 TO POS. 12

Edman

MH+

Edman

1395 (Val) 1110 (Val) 725 (Val) 712 (Phe) 1612 (Trp) 1286 (Trp) 760 (Trp) 772 (Val) 1195 (Leu) 21 17 (Ser) 1503 (Met) 1115 (Leu) 679 (ASP) 784 (Phe) 1140 (Val)

824 695 1616 2094 953 938 1180 1305 1745

725 (Val)

-

1488 (Lys) 2023 (Ala) 839 (Asn) 851 (Ser) 1066 (Asn) 1192 (Leu) 1598 (Phe)

Moreover, the positive identification of correctly linked peptides placed increasing constraints on potential combinations of the tryptic peptides, greatly decreasing the total number of possibilities. Thus, those mass signals that corresponded to several sequences in the first cycle of the program have to be re-examined, taking into account the new set of peptides generated by the linkages already assigned. All these mass values were input again in the Mass Overlap program together with the sequence of the new set of eight peptides (Table 3 and Fig. 3). Figure 2 shows the computer output relative to the five mass values corresponding to unique overlapping sequences. The same logic described above allowed us to order the peptides listed in Table 3 as follows: Peptides 1 - 3 - 2 Peptides 5 - (8, 4) - 6 - 7. The signal at m/z 1634 was assigned to an amino acid sequence overlapping peptides 5, 8, 4 and 6 from Table

N-TERMINAL RESIDUE: ASP MASS VALUE: 794 OVERLAPPED PEPTIDES12 - 4 POSITIONS: FROM POS. 14 TO POS. 20

N-TERMINAL RESIDUE: VAL MASS VALUE: 1209 OVERLAPPED PEPTIDES: 1-3 POSITIONS: FROM POS. 1 TO POS. 11

MASS VALUE: 1069 N-TERMINAL RESIDUE: LYS OVERLAPPED PEPTIDES: 5 - 8 POSITIONS: FROM POS. 11 TO POS. 21

N-TERMINAL RESIDUE: TRP MASS VALUE: 1472 OVERLAPPED PEPTIDES: 3 - 2 POSITIONS: FROM POS. 7 TO POS. 20

MASS VALUE: 931 N-TERMINAL RESIDUE: PHE OVERLAPPED PEPTIDES: 9-3 POSITIONS: FROM POS. 8 TO POS. 15

N-TERMINAL RESIDUE: TRP MASS VALUE: 946 OVERLAPPED PEPTIDES: 3 - 2 POSITIONS: FROM POS. 7 TO POS. 14

MASS VALUE: 1239 N-TERMINAL RESIDUE: VAL OVERLAPPED PEPTIDES: 3 - 5 POSITIONS: FROM POS. 7 TO POS. 17

N-TERMINAL RESIDUE: LEU MASS VALUE: 1305 OVERLAPPED PEPTIDES: 6 - 7 POSITIONS: FROM POS. 26 TO POS. 36

MASS VALUE: 953 N-TERMINAL RESIDUE: ASN OVERLAPPED PEPTIDES: 8 - 13 POSITIONS: FROM POS. 7 TO POS. 14

N-TERMINAL RESIDUE: M E T MASS VALUE: 1634 OVERLAPPED PEPTIDES: 5 - (8,4)- 6 POSITIONS: FROM POS. 25 TO POS. 40

Figure 1. Program output relative to the mass values assigned to unique overlapping sequences during the first cycle of the Mass Overlap program. The positions indicated refer to sequence numbering reported in Table 1.

Figure 2. Mass Overlap output of the second cycle of the program. Mass values are those associated with single overlapping sequences. The positions indicated refer to sequence numbering reported in Table 2.

P. PETRILLI, C. SEPE AND P. PUCCI

118

Table 3. Amino acid sequences of the new set of eight peptides obtained after one cycle of the Mass Overlap program following the linking procedure 1

5

Val-H is-Leu-Trp-P ro-G I u-Glu-Lys 5

1

10

I u-Val-G ly-Gly-G lu-Ala-Leu-G Iy-Arg

Val-Asn-Val-Asp-G 5

1

Ser-Ala-Val-Thr-A 1

I a-Leu-Trp-G ly-Lys

4

Al a-H is-Gly-Lys 5

1

15

10

Leu-Leu-Val-Val-Tyr-Pro-Trp-Thr-Gln-Arg-PhePhe-Glu-Ser-Phe 20

18

25

Gly-Asp-Leu-Ser-Thr-Pro-Asp-Ala-Val-Met-G ly-Asn-Pro-Lys 5

1

20

16

ly-Leu-Ala-H

isLeu-Asp-Asn-

25

Leu-Lys-Gly-Thr-Phe-Ala-Thr-Leu-Ser-Glu-Leu-H 5

1

15

10

Lys-Val-Leu-G Iy-A I a-Phe-Ser-Asp-G

30

is-Cys-Asp-Lys 15

10

Leu-H is-Val-Asp-Pro-GIu-Asn-PheArg-Leu-Leu-Gly-Asn-Val-Leu20

16

25

30

Val-Cys-Val-Leu-Ala-His-His-Phe-Gly-Lys-GIu-Phe-Thr-Pr~Pro31

35

45

40

Val-Gln-Ala-Ala-Tyr-GIn-Lys-Val-Val-Ala-Gly-Val-Ala-Asn-Ala50

48

Leu-Ala-His-Lys-Tyr-H 1

is

2

Val-Lys

The complete sequence of the intact protein was achieved by iterating the above-discussed procedure; the sequence of the two peptides reported in Table 4 and the mass values still not assigned to a single sequence were input in the Mass Overlap program. After the third cycle, the signals at m/z 871 and 2094 were unambiguously assigned to the overlapping sequences Val-Gly-Gly-Glu-Ala-Leu-Gly-Arg-Leu and Ala-Leu-Gly-Arg-Leu-Leu-Val-Val-Tyr-F're TrpThr-Gln-Arg-Phe-Phe-Glu, respectively, thus allowing reconstruction of the complete sequence of the /3-globin chain.

3 (see Fig. 2); as peptides 8 and 4 are inside the overlapping sequence and we could not find any indication of their relative position, the correct order of these two peptides might be either 8 - 4 or 4 - 8, leading to the uncertainty being indicated in parentheses. The power of the method is well illustrated here in that the second cycle of the program decreased the number of peptides to order to just two, using always the same experimental data. The sequence of the two peptides is reported in Table 4, the second peptide still containing the uncertainty of the respective position of tryptic peptides 8 and 4.

Table 4. Amino acid sequences of the new set of two peptides obtained after two cycles of the Mass Overlap program following the linking procedure 1

1 Val-H is-Leu-Thr-P 16

r o-G Iu-G Iu-Lys-Ser-Ala-VaI-Thr-Ala-Leu-Trp20

30

25

G ly-Lys-Val-Asn-Val-Asp-GIu-Val-G ?

15

10

5

ly-Gly-G lu-Ala-Leu-G

5

ly-Arg 15

10

2 Leu-Leu-Val-Val-Tyr-Pro-Trp-Thr-GIn-Arg-PhePhe-GIu-Ser-Phg 16

20

25

30

Gly-Asp-Leu-Ser-Thr-Pr~Asp-Al~Val-Met-Gly-Asn-Pro-Lys-Val31

35

40

Lys-Ala-His-Gly-Lys-Lys-Val-Leu-G 46

50

45

ly-Ala-PheSer-Asp-Gly-Leu60

55

Ala-H is-Leu-Asp-Asn-Leu-Lys-Gly-Thr-Phe-Ala-Thr-Leu-Ser-GIu61

65

70

75

Leu-His-Cys-As~Lys-Leu-His-Val-Asp-Pr~lu-Asn-PheArg-Leu76

80

85

Leu-GI y-Asn-Val-Leu-Val-Cys-Val-Leu-Ala-His-H 91

95

90

is-Phe-Gly-Lys-

100

GIu-PheThr-Pro-Pro-VaI-Gln-Ala-Ala-Tyr-GIn-Lys-VaI-Val-Ala106

110

Gly-Val-Ala-Asn-Ala-Leu-Ala-H

115

is-Lys-Tyr-H is

105

PEPTIDE ALIGNMENT IN PROTEIN SEQUENCE DETERMINATION

original

set

of peptide sequences

S E A R C H for OVERLAPS

-

Figure 3. Flow chart illustrating the overall strategy of the Mass Overlap program.

~~

~

DISCUSSION The Mass Overlap program consists of three sections, named SET U P SEARCH, SEARCH OVERLAP and LINK, which can be selected by a main menu; Fig. 3 reports a flow chart of the program that shows the overall strategy. The amino acid sequences of the peptides to be aligned in the correct order are input into the program and the data are elaborated and organized by the SET U P SEARCH routine; this section has been developed in order to make the search for overlapping sequences faster. It should be noted that this routine has to be used every time the peptide set changes following the linking procedure, as shown in Fig. 3. The SEARCH OVERLAP program allows the assignment of the mass signals recorded in the spectra to all the possible sequences that fit with the mass value and the N-terminal residue. These sequences are formed by combining the sequences of the input peptides; the maximum number of peptides in a single combination is four because of the suitable mass range of the mass spectrometers.

119

When a mass signal can be unambiguously associated to a single sequence that overlaps two or more peptides, these peptides will be linked together using the LINK program; the linking procedure generates a new set of peptides which is directly input in the SET UP SEARCH section for preliminary calculations, as illustrated in Fig. 3. In this paper we reported the reconstruction of the entire primary structure of human /?-globin chain, apart from a single uncertainty, starting from the sequence of 14 tryptic peptides and using mass spectral data obtained by the analysis of three different proteolytic digests for the overlapping sequences. However, even when the complete sequence of the protein cannot be achieved, the Mass Overlap program will be able to indicate the correct alignment of some of the peptides, suggesting further experiments to obtain the missing overlapping sequences which will lead to the reconstruction of the entire protein. The whole strategy is based on the assumption that each mass signal recorded in the spectra, being generated by proteolytic fragments of the protein, has to be assigned to a sequence occurring in the protein structure. This sequence might either be located within the sequence of a tryptic fragment or overlap two or more of them, leading to the linking of these peptides. It should be noted that all the mass signals recorded in the FAB/MS analysis of 8-globin digests (see Table 2) were unambiguously assigned to /?-globin peptides. The crucial point in the entire procedure is the identification of those mass values which can be only associated to unique overlapping sequences; in this respect, the fundamental role of the N-terminal residues in eliminating most of the uncorrect possibilities has to be underlined. The proposed procedure, avoiding most of the dificult chromatographic separations of peptides, can find application in the field of protein sequence determination. However, we want to underline the limits of this approach; the method requires the knowledge of the sequence of all the peptides generated by the first cleavage of the protein; this is not always possible during protein sequence analysis. Moreover, the occurrence of post-translational modifications in proteins will generate modified peptides giving mass values in the FAB spectra which the program cannot account for. The results presented in this paper provide our contribution towards the development of fast computerbased procedures that makes use of FAB/MS analysis of peptide mixtures, in an effort to speed up protein sequence determination.

Acknowledgements This work was supported by Progetto di Ricerca di Interesse Nazionale ‘Livelli di OrganizzazioneStrutturale di Proteine: Metodi di Indagine’ and by grants from Minister0 Pubblica Istruzione, Rome, Italy. Mass spectral data were obtained at Servizio di Spettrometria di Massa, CNR-Universita di Napoli; the assistance of the staff is gratefully acknowledged. We are particularly indebted to Dr A. Malorni.

120

P. PETRILLI, C. SEPE AND P. PUCCI

REFERENCES 1. R. M. Hewick, M. W. Hunkapiller, L. E. Hood and W. J. Dreyer,J. Biol. Chem. 256, 7990 (1981). 2. Y. Shimonishi, Y. M. Hong, T. Kitagishi, T. Matsuo, H. Matsuda and I. Katakuse, Eur. J. Biochem. 112, 251 (1980). 3. Y. M. Hong, T. Takao, S. Aimoto and Y. Shimonishi, Biomed. Mass Spectrom. 10,450 (1983). 4. B. W. Gibson and K. Biemann. froc. Natl Acad. Sci. USA 81, 1956 (1984). 5. P. Petrilli, Int. J. Pep. frot. Res. 17, 85 (1985). 6. T. Takao, T. Hitouji, S. Aimoto, Y. Shimonishi, S. Hara, T. Takeda, Y. Takeda and T. Mitwani, FEBS Lett. 152, 1 (1983).

7. M. Barber, R. S. Bordoli, 0.R. Sedgwick and A. N. Tayler, J . Chem. SOC.Chem. Commun. 325 (1981). 8. P. Pucci, P. Ferranti, A. Malorni and G. Marino, Eiomed. Environ. Mass Spectrom. 18,20 (1989). 9. P. Pucci, C. Carestia, G. Fioretti, A. M. Mastrobuoni and L. Pagano, Biochem. Biophys. Res. Commun. 130.85 (1985). 10. P. Pucci, G. Sannia and G. Marino. J. Chromatogr. 270. 371 (1 983).

A new procedure for peptide alignment in protein sequence determination using fast atom bombardment mass spectral data.

A computer program allowing the correct alignment of peptides generated by a first cleaving agent during protein sequence determination studies has be...
430KB Sizes 0 Downloads 0 Views