Software for analysis and manipulation of genetic linkage data.

Am. J. Hum. Genet. 50:1267-1274, 1992

Software for Analysis and Manipulation of Genetic Linkage Data Raymond Weaver, Cynthia Helms, Santosh K. Mishra, and Helen Donis-Keller Department of Genetics, Washington University School of Medicine, St. Louis

Summary

We present eight computer programs written in the C programming language that are designed to analyze genotypic data and to support existing software used to construct genetic linkage maps. Although each program has a unique purpose, they all share the common goals of affording a greater understanding of genetic linkage data and of automating tasks to make computers more effective tools for map building. The PIC/ HET and FAMINFO programs automate calculation of relevant quantities such as heterozygosity, PIC, allele frequencies, and informativeness of markers and pedigrees. PREINPUT simplifies data submissions to the Centre d'Etude du Polymorphisme Humain (CEPH) data base by creating a file with genotype assignments that CEPH's INPUT program would otherwise require to be input manually. INHERIT is a program written specifically for mapping the X chromosome: by assigning a dummy allele to males, in the nonpseudoautosomal region, it eliminates falsely perceived noninheritances in the data set. The remaining four programs complement the previously published genetic linkage mapping software CRI-MAP and LINKAGE. TWOTABLE produces a more readable format for the output of CRI-MAP two-point calculations; UNMERGE is the converse to CRI-MAP's merge option; and GENLINK and LINKGEN automatically convert between the genotypic data file formats required by these packages. All eight applications read input from the same types of data files that are used by CRI-MAP and LINKAGE. Their use has simplified the management of data, has increased knowledge of the content of information in pedigrees, and has reduced the amount of time needed to construct genetic linkage maps of chromosomes.

Introduction Improvements in laboratory techniques for constructing genetic linkage maps have not only helped increase the amount of data being applied to such research but have also dramatically accelerated the rate at which new data are generated. While more information enables maps of ever-increasing resolution to be constructed, it also demands that additional tools with which to analyze and manipulate data be created. Reliance on electronic digital computers to perform these processes has become ubiquitous. Several interactive software packages to construct multilocus linkage maps are available. The Linkage Received December 5, 1991; revision received February 11, 1992. Address for correspondence and reprints: Dr. Helen DonisKeller, Department of Genetics, Washington University School of Medicine, 660 South Euclid Avenue, St. Louis, MO 63110. © 1992 by The American Society of Human Genetics. All rights reserved. 0002-9297/92/5006-0015$02.00

Analysis Program, or LINKAGE (Lathrop et al. 1984), performs linkage analysis by the family method and calculates genetic risk. LINKAGE is commonly used to map disease loci. CRI-MAP (Donis-Keller et al. 1987; Green et al. 1989), which implements the EM algorithm (Lander and Green 1987) and a novel and efficient method for calculating likelihoods (P. Green, unpublished data), is used in constructing multilocus linkage maps, as well as for detecting errors in data. MAPMAKER (Lander et al. 1987) also uses the EM algorithm to perform multilocus linkage analysis and provides a command language with which genetic linkage data can be explored. These programs have strengths and weaknesses; for example, CRIMAP is more efficient when full data (i.e., all parental genotypes) are available, but LINKAGE is better at dealing with missing genotypes. Even with computers and their attendant software, building linkage maps is a monumental task. A continuous 1 -cM-resolution genetic linkage map spanning 1267

1268

the entire 3,300-cM human genome, for example, requires, at a minimum, 3,300 markers. If these markers are highly informative (heterozygosity >.70), only 9 % (= [1 - .70]2) of children will, on average, be untyped homozygotes. Therefore, building such a map by using the Centre d'Etude du Polymorphisme Humain (CEPH) reference pedigree primary panel, which contains roughly 520 individuals, requires over 1,560,000 genotypes. The huge volume of data required to successfully construct maps becomes obvious, especially in consideration of the fact that markers are not, as in this ideal case, evenly spaced. PIC/HET, FAMINFO, PREINPUT, INHERIT, TWOTABLE, UNMERGE, GENLINK, and LINKGEN are eight programs that we wrote in the C programming language (Kernighan and Ritchie 1978) that support and complement published mapping software. We have automated several tasks that had been done manually or not at all, thereby drawing on the power of the computer but reducing interaction with it. The programs also increase understanding of the data set used in map construction, by compiling genetic information into general observations. Our guiding design philosophy was to provide software that is accessible to molecular geneticists, rather than only to computer scientists. To this end, little knowledge of computers is required to use the programs, and user interface is limited and straightforward. These programs make computing more of an asset and less of an obstacle. In the present paper, we explain the computing environment in which our software was developed. We describe in detail the function of each program and, where appropriate, the algorithms used to implement that function. Material and Methods The programs were coded and compiled with a Sun 3/60 and with SPARCstation computers manufactured by Sun Microsystems. These machines run the SunOS operating system, which is derived from UNIX 4.3 BSD. In addition, we were provided with debugging tools through windowing software such as Sun View and Open Windows. Data are input to the programs from ASCII text files and contain genotypes for genetic markers (e.g., probe-enzyme systems) in CEPH-format pedigrees of two or more generations. We refer to members of the youngest generation of each family as "siblings," to the next oldest generation as "parents" (father and

Weaver et al.

mother), and to their parents as the pedigree's "grandparents." Under the CEPH standard, the father and mother must be, respectively, the first and second individuals listed in each pedigree. All programs except LINKGEN operate on files having the same form as that required by CRI-MAP; these are called ".gen" files. At the beginning of such a file is a list of the markers for which there are data. The data in the file are grouped by pedigree. Every individual in a family has an associated list of the genotypes, for that individual, at each locus. Figure 1 shows the structure of and an example of a .gen file. The primary input to LINKGEN is a file whose content is similar to that of a .gen file but which has a different structure. We call such a file a ".lnk" file; this format is used by LINKAGE. Information is again grouped by pedigree, but the family ID is repeated for each member. The names and order of genetic markers are implicit in .Ink files, as there is no list of markers to identify the data. Another difference between the file formats is that .lnk files provide the ability to classify disease loci in multiple ways. If the data set contains disease loci, they may be classified either by allele numbers (as are other markers) or with an affection status of unaffected, affected, or unknown. Although LINKAGE also allows classification by binary factor notation or by quantitative factors, these options have not been implemented in GENLINK. Finally, liability classes may be associated with each disease locus. A liability class is an integer that defines disease penetrance as a function of parameters such as age. Figure 2 shows the .lnk file structure and an example with one disease locus. The genotypic data files on which the programs operate are created in a number of ways. Arranging the data in the proper format with a text editor is straightforward but painstaking; we make .lnk files in this manner. However, the MS-DOS CEPH data management program simplifies the task by providing the option of automatically creating a .lnk file for CEPH data. Two additional procedures are available for creating .gen files. One method uses a HyperCard application called "Pedigree Stacks" (Six Ponds Software), which runs on Macintosh computers (our lab uses a Macintosh I~x). This application facilitates data entry using electronic "cards," one of which is shown in figure 3. Pedigree Stacks then uses these cards to make a .gen file. This file is then transferred to a Sun computer via Ethernet. Alternatively, we make .gen files with data received from CEPH. These data are initially in binary files, and can be converted to ASCII

1269

Genetic-mapping Software 1 L4 isease

436-B Dp446-M Dp445-M 334

TIr 1:

(number of families) (number of probe-enzyme systems) (name of 1st probe-enzyme system)

7

(name of nth probe-enzyme system) For each family: (family ID) (number of members) For each member of family: (ID) (mother's ID) (father's ID) (sex: female 0, male 1) Flocus 1 allele 1) (loc 1 a2) ... bloc n al) (loc n a2)

0 1 2 0 1 2 3 2 0 0 4 2 1 1 2 1 2 1 2 7 2 0 0

16

(a)

0 1 0 1 1 1 1 1 1 2 1 1 1 1

1 2 0 2 1 1 0 2 1 2 1 2 1 1

1 2 3 3

2 2 2 3 1 2 3 3

2 2 3 3 2 2 2 3 2 2 0 3 1 2 3 3

(b) .gen file structure. (a), General file form. Data are grouped by family, with an ID and the number of members preceding Figure I the individual's information in each. (b), Small two-generation .gen file with one family. Zeros are used for unknown alleles and for parental IDs of individuals whose parents are not present in the data set. Arrows indicate the correspondence between the general form and the example .gen file.

format by using an MS-DOS program supplied by CEPH. We perform the conversions on a Dell 386 PC, then transfer them, using Kermit communications software, to a Sun computer. Finally, files of the .gen structure are created using information in the ASCII files. For each individual: (family ID) (ID) (father's ID) (mother's ID) (sex: female 2, male 1) ... (locus 1allele1) (loc1a2) ... l(oc nail (bocna2)

(a)

1334 1334 1334 1334 1334 1334 1334

1 0 0 2 0 0 3 1 2 4 1 2 5 1 2 6 1 2 7 1 2

1 2 1 2 1 1 1

1

2 0 1 1

2 0

4 4

2 2 3 2 3

1 1 1 1 2 1 1

2 1 2 2 2 1 2 2 2 2 2 2 2 2 1 1 2 2 1 2 2

3 2 3 3 2 0 3

3 3 3 3 3 3 3

(b)

Figure 2

.lnk file structure. (a), General file form. Data are grouped by individual. Angle brackets enclose optional quantities: a .lnk file need not have disease loci, and disease loci need not have liability classes. (b), .lnk file containing the same markers as the .gen file of fig. 1. An affection status has been used to classify the disease locus, and each individual has an associated liability class. Arrows indicate the correspondence between the general form and the example .lnk file.

Program Descriptions

PICIHET PIC/HET calculates PIC (Botstein et al. 1980), heterozygosity, the maximum number of informative meioses, and allele frequencies for each genetic marker that has been input. These quantities are based on collective data from each pedigree's parents, which ensures that only biologically unrelated individuals are used (related CEPH parents are automatically omitted from the calculations). Although older generations provide more individuals on which to base these quantities, parents are preferable for several reasons. Pedigrees often include only two generations or have incomplete data for previous generations. Also, because genotyping is performed twice on parents (on separate blots or PCR gels), these data can be confirmed and are more accurate. Finally, the allele distribution among grandparents may be skewed, since they may not be genotyped if the parents have proved uninformative. Figure 4 shows an example of PIC/HET's output. The user of PIC/HET chooses between two methods of determining heterozygosity. Direct calculation follows from the quantity's definition: heterozygosity is simply the ratio of the number of heterozygotes to the number of individuals sampled. Alternatively,

1270

Weaver et al.

b|ots

date|ofentry:51

j 349-H-

14

13

10

5

chromosome: EZZZJ

commentisl test of System L Figure 3 fields.

I

e

~~~~~~~~~updmte/ intiss:

:l

HyperCard Pedigree Stacks card for CEPH kindred 1349. The genotypes for a marker have been entered into the card's

allele-frequency approximation estimates homozygosity from n calculated allele frequencies (each pi) for each marker and subtracts this fraction from 1, to determine heterozygosity by n

E

2

(1)

Allele-frequency approximation attempts to infer the heterozygosity for a population under the assumption of Hardy-Weinberg equilibrium, so this method does not always accurately reflect the data in the actual sample. Conversely, applying direct calculation to a small data set occasionally yields unlikely results (e.g., a two-allele system with heterozygosity true

>50%).

PIC/HET uses the frequencies pi of n alleles in implementing an algorithm given by Botstein et al. (1980) to determine PIC as n

g p2 i= 1

n-

1

n

E;

i= 1

2

j =i

pzp2.

(2)

+1

Note that when heterozygosity is calculated with equation (1), PIC will never be greater than heterozygosity.

Both the maximum number of information meioses and allele frequencies are calculated for each marker in a straightforward manner. PIC/HET multiplies the number of heterozygous parents in each family by the

number of its siblings who have nonzero genotypes. The sum of these products is the maximum number of informative meioses for a marker. The actual number may be somewhat smaller because PIC/ HET does not consider that both parents may be heterozygous for the same alleles. All siblings in such a family who are also heterozygous are generally not useful in measuring linkage, unless close flanking markers exist. The frequency of each allele in a system is simply the ratio of the number of observations of that allele to the number of observations of all alleles. Zero alleles are disregarded. When PIC/HET is used with X-linked data, genotype data from males should not be included in the calculations. FAMINFO

FAMINFO determines, from data for some or all of the markers in its input file, both whether the parents in each family are informative and the number of siblings who provide information. This information is organized into a table that qualifies input data. Markers lie along the horizontal axis of the table; pedigree IDs are on the vertical axis. Each entry in the table P/E System L436-B

TDp446-M

TDp445-M TDp391-M

#Ind

Hetero

PIC

IM

72 57

0.50

0.37

224

1/0.48

2/0.52

0.48 0.46 0.40

0.37 0.37

220 157

1/0.40

2/0.60

0.32

51

1/0.02 1/0.27

2/0.31 2/0.73

27

11

Allele/Freq

3/0.67

Example of PIC/HET output for four markers. Figure 4 The number of individuals used for each marker's calculations is reported along with the relevant quantities. This number varies because parents may have no information for a locus.

Genetic-mapping Software contains two fields. The first field is a character describing parental informativeness; the second field is an integer representing the number of siblings in the family who have nonzero genotypes. Figure 5 shows both an example of FAMINFO's output and a key for all possible character codes. The table that this program creates can be used to infer the number of informative meioses that a marker has. For example, a table entry of B6 for a family corresponds to the contribution of 12 informative meioses (two heterozygous parents multiplied by six siblings) to the relevant marker's total. Since the FAMINFO program succinctly describes each pedigree's content, it can be used to make general observations. FAMINFO identifies those families for which no data exist and represents the informativeness of a marker, thereby supporting ongoing data collection. The program also reveals which polymorphisms have identical informativeness, before data have been collected on a large number of families. Researchers then have the choice of either reducing redundant data collection or intentionally duplicating genotypes to aid in error checking. PREINPUT

Collaborators to CEPH are required to submit new data in binary files. Prior to version 4 (available January 1990) of the CEPH data base, the CEPH's MSDOS-based program, INPUT, was used to perform this function by providing a keyboard interface to read data and then creating the appropriate binary file. Although this method is sufficient for CEPH submissions, the large number of keystrokes that it requires makes it laborious and error prone. Furthermore, if the genotypes are already stored as a .gen file, reentering the data is, in theory, unnecessary. We wrote PREINPUT to simplify data submission when using INPUT. As of CEPH version 4, this program is no longer current; it has been replaced by another data management program, called "CEPH." Despite its advantages, however, the CEPH program still requires the user to key in all the genotypic data. Therefore, the combination of PREINPUT and INPUT is still a useful data-submission tool when a .gen file for the data set exists. PREINPUT eliminates much of the data entry required, by replacing INPUT's user interface with a less interactive one. The number of keystrokes involved is substantially reduced: the user supplies only the location, band sizes, and allele definitions for a genetic marker. The remaining data (genotype assignments)

1271 o 3

L436-B TDp391-M

1 4 o 2 12 17 21

23 28 35 37 45 66 102 104

F6 SO M8 M6 bO bO bO S7 bO F4 M13 bO

1 M7 S7

2 no nO

nO F6 S6 no M8 bO S8 M6 Sll M10

nO no F7 nO F7 F8 nO nO nO

nO

TDp446-M HTY2070-C14MSP 3

4

5

nO nO nO nO nO nO nO

B7 nO M8 F6 nO M5 B8 B3 B8 B7

SO SO M8 bO

F6 nO

2 5

TDp445-M HTY2070-C24ECORI

no M6 F5 SO

S2 nO M14 S14 bE no S2 bE

nO

(a) n

m

f b

x y M F B S

neither parent has information mother only has information but is homozygous father only has information but is homozygous both have information but are homozygous mother only has information and is heterozygous father only has information and is heterozygous

both have information but only mother is heterozygous both have information but only father is heterozygous both have information and are heterozygous for different genotypes both have information and are heterozygous for the same genotype

a,) Figure 5 Output of FAMINFO. (a), Example of a sixmarker, 12-family table. (b), Explanatory key for the meanings of codes in the character field of table entries.

are read from a .gen file containing the relevant marker. PREINPUT uses these data to create a text file that exactly corresponds to the keystrokes needed to run CEPH's INPUT program. INPUT can then read this file, instead of reading keyboard input, and can create the appropriate binary file. PREINPUT can be run repeatedly for submission of multiple markers. To ensure that PREINPUT creates a valid file, the same error-checking method that is used by INPUT is performed. Allele definitions must be composed of band sizes that the user has defined, and all alleles at the relevant locus in the .gen file must be defined. Because genotypes submitted to CEPH cannot contain zeros, all genotypes in which both allele assignments are zero are ignored when PREINPUT creates its output file. If a genotype contains one zero allele and one nonzero allele, the user is given the option of ignoring the individual or defining a second nonzero allele. INHERIT

CRI-MAP tests individuals in its input .gen files, for noninheritances or for genotype assignments that do not reflect Mendelian inheritance. Although automatic detection of these inconsistencies is an important feature of CRI-MAP, problems arise when X-chromosome data are used. Males have only one copy of X and hence have only one allele at each locus in the nonpseudoautosomal region. The .gen file structure used by CRI-MAP and CEPH data-submission

Weaver et al.

1272

standards, however, requires two alleles for each individual. CEPH, moreover, does not accept genotypes that have a single zero allele. Therefore, for males, the allele on X is typically duplicated to complete the required two-allele genotype assignment. Use of this convention causes CRI-MAP to detect many noninheritances in X-chromosome markers. Because many individuals are falsely identified as having invalid inheritance, it is impossible to identify the actual noninheritances. INHERIT eliminates this problem by implementing a corrective strategy suggested in the CRI-MAP program documentation. First, the program identifies markers in a .gen file that are likely to be in the nonpseudoautosomal regioni.e., those for which all males are homozygotes. The genotype assignments are then altered to make the males heterozygous, with the first allele unchanged and the second a dummy value of - 1. The value -1 was chosen because it ensures that the new allele assignment is distinguishable from all actual genotypes, so the original data set may be recreated if necessary. Figure 6 shows an example of a perceived noninheritance and its correction by the program INHERIT. After processing X-chromosome data files with INHERIT, CRI-MAP flags only real noninheritances, making data checking considerably easier. The program has the additional feature of counting the number of heterozygous males associated with all markers in the input file. A very low count may indicate a nonpseudoautosomal marker whose heterozygous males reflect genotyping errors. TWOTABLE The two-point option of the CRI-MAP package calculates two-point LOD scores for pairs of markers in its input .gen file. The user can specify one-group input; i.e., a single group of markers is specified, and two-point calculations are performed for every marker against all other markers in the group. Alternatively, two-group input comprises two disjoint subsets of the markers in the .gen file. For two-group input, calculations are performed only for pairs of markers that are not members of the same group. The results of these calculations are reported by listing the two markers involved in each calculation, the maximum LOD score obtained, and the recombination fraction at which this maximum occurred. Lower LOD scores, calculated at other recombination fractions, are also given. TWOTABLE reads a CRIMAP two-point output file and produces with it an alternative representation of the data. The file TWO-

2 (a)

r

22 (b)

2

.1 (c)

Perceived noninheritances in males in nonpseuFigure 6 doautosomal region of the X chromosome. (a), Actual genotype, which includes only one allele for males. (b), Each male allele duplicated to complete the two-allele genotype for data file storage. CRIMAP identifies this as a noninheritance. (c), Conversion of one allele to a dummy - 1 for all males, which eliminates the perception of noninheritance.

TABLE creates is a table with marker numbers along each axis and with the results of the two-point calculations in the table cells. There are two forms of the table, one for one-group input and the other for twogroup input. If TWOTABLE is given a one-group CRI-MAP file, all markers are shown on both axes. The top diagonal of the table holds the maximum LOD scores for marker pairs, and the bottom diagonal displays the recombination fractions at which these LOD scores occurred. Two-group input files result in a somewhat

different table: the first group's markers lie along the horizontal edge, and the second group's markers are listed vertically. In this case, each cell contains both the LOD score and the associated recombination fraction for the relevant marker pair, separated by a slash. Examples of these two output formats are shown in figure 7. TWOTABLE makes the most relevant data more accessible by eliminating the less important nonmaximum LOD scores. Because the two-point calculation results are displayed in a table, comparing any two markers is easier than with CRI-MAP's two-point output. This display format, however, becomes less useful as the number of markers increases: TWOTABLE splits the table into appropriate pieces when it is too large to print on a single page, but the amount of data in large tables is difficult to digest. Moreover, TWOTABLE is limited to certain classes of two-point output. For the program to work correctly, CRI-MAP must specify sex-equal two-point calculations and must run its two-point option on all markers in its input file. UNMERGE The CRI-MAP software provides a merge option, by which two .gen files containing different markers

Genetic-mapping Software

1273

Numbers in the table below correspond to these probe-enzyme systems: 0 3

L436-B

1 4

TDp391-M

TDp446-M

2

HTY2070-C14MSP

5

TDp445-M HTY2070-C24ECORI

Bottom diagonal is recombination fraction; top is LOD score

0 0 1 2

3 4

5

0.00 0.00 0.00 0.50 0.33

9.33

2 5.42 3.61

0.00 0.00 0.36 0.38

0.00

1

3 7.22 1.20 2.71

0.50 0.50

0.50 0.10

4 0.00 0.50 0.00 0.00

5 0.35 0.35 0.00 0.00 28.90

0.00

(a) Numbers in the table below correspond to these probe-enzyme systems:

0 3

L436-B TDp391-M

1 4

Recombination fraction/LOD 0 1 2

3 0.00/7.22 0.00/1.20 0.00/2.71

TDp446-M HTY2070-C14MSP

2 5

TDp445-M

HTY2070-C24ECORI

score

0. 50/0. 00 0.36/0.50

0.50/0.00

0.33/0.35 0.38/0.35 0. 50/0. 00

(b)

TWOTABLE output for two-point calculations performed on six markers. (a), Table created from input file that had Figure 7 single group of markers. (b), Input file that had two disjoint groups of markers.

and/or families are combined. This function is most useful when a map is to be built with markers whose data are contained in multiple files. We have written UNMERGE, a program that complements CRI-MAP by performing the opposite function of merge: a subset of the probe-enzyme systems in a .gen file is extracted and used to create a new file. The user may also specify a subset of the original families for inclusion in the new file. A second version of the program, UNMERGE2, creates an additional new .gen file that UNMERGE does not. The second file contains the markers that are in the original file but that are not in the user-specified UNMERGE list. However, UNMERGE2 does not provide the option of deleting families. UNMERGE is useful in defining the data-set input to PIC/HET, to FAMINFO, or for genetic linkage calculations. GENLINK

Using both CRI-MAP and LINKAGE to build multilocus linkage maps is sometimes advantageous. Although the information contained in .gen and .Ank files is roughly equivalent, manual modification of one file to conform to the structure of the other is slow and error-prone. GENLINK automates this process. If the

a

data set contains disease loci, the user may classify these in the new file, by either affection status (unaffected, affected, or unknown) or allele number. All disease loci are assumed to be autosomal dominant. The program is also limited in its ability to create files with liability classes. A constant-liability class of the user's choice can be associated with each disease locus in the .lnk file, but such a specification corresponds to the trivial case of equal disease penetrance for all individuals. The feature is useful, however, to the extent that the most common liability class of a disease can be specified when GENLINK is run, after which appropriate modifications can be made to the .lnk file by using a text editor. LINKGEN

LINKGEN performs the inverse task of GENLINK by converting .lnk files to .gen form. Because .Ink files do not include the names of genetic markers, an additional input file that contains these names is required. LINKGEN can process input files containing an arbitrary number of disease loci; however, it cannot process disease loci that have been classified using either binary factor notation or quantitative factors. If

Weaver

1274

the input file has disease loci that are classified by an affection status, each individual's status is converted to a two-allele genotype, which is determined assuming an autosomal dominant mode of inheritance. The input file may also contain liability classes associated with disease loci. The program discards this information, as .gen files do not include liability classes. Discussion

Our programs have been very useful in genetic linkage map construction, by helping to manage and understand data. While not all the programs overtly complement CRI-MAP or LINKAGE, they have been most valuable when used in conjunction with mapbuilding software. Adding the eight programs to the user's set of computing tools has significantly reduced the time required to construct maps. The primary limitation to the software is the required input format. Without a program similar to Six Ponds Software's Pedigree Stacks HyperCard application, creating .gen or .lnk files on which to run our software may be laborious. Some programs also restrict the classes of input that are valid. TWOTABLE, for example, only accepts CRI-MAP two-point output files if all markers are used in the two-point analysis. GENLINK and LINKGEN cannot process files with disease loci that are classified using binary factor notation or quantitative factors. Future versions of some programs will focus on increasing their power by broadening their acceptable classes of input, including generalizing the allowed pedigree structures to nonCEPH format. The power of our applications can be increased by adding more options. Among the additions being considered is permitting the user to choose individuals who are to be used for calculations in PIC/HET. The current version uses biologically unrelated parents, but basing the calculations on grandparents may be desired occasionally. PIC/HET will also be revised to handle haplotyped systems of markers. UNMERGE will be enhanced by allowing the user to specify individual genotypes to be eliminated from the created .gen file. Although these features will make the programs more complex, we plan to maintain their ease of use through straightforward and limited user interfaces. Portability and Availability

Because the code for these programs was written in Kernighan and Ritchie's (1978) C, it should be porta-

et

al.

ble to any system for which a C compiler is available. All user interface is textual, so no special terminals or software are needed to support the display. However, the programs contain several references to the SunOS operating system (e.g., for opening a file) that are specific to computers that run UNIX. Minor modifications can be made to create synonymous commands in other environments. The programs do not inherently require substantial memory to run, but using large input files significantly increases the memory needed. The source code, documentation, and executable files for these programs, including the Pedigree Stacks application, are freely available through Dr. Helen Donis-Keller.

Acknowledgments We wish to thank Dr. Philip Green, Todd Steinbrueck, and Ralph Normington, Jr., for advice during the course of this project and for helpful comments on the manuscript. We also thank Dr. Green for allowing us to use the part of the code for CRI-MAP which reads .gen files. This work was supported by National Institutes of Health grants HG00304 and HGO0201 (to H.D.-K.).

References Botstein D, White R, Skolnick M, Davis RW (1980) Construction of a genetic linkage map in man using restriction fragment length polymorphisms. Am J Hum Genet 32: 314-331 Donis-Keller H, Green P, Helms C, Cartinhour S, Weiffenbach B, Stephens K, Keith T, et al (1987) A genetic linkage map of the human genome. Cell 51:319-337 Green P, Falls K, Crooks S (1989) Documentation for CRIMAP, version 2.4. Available from P Green Kernighan BW, Ritchie DM (1978) The C programming language. Prentice-Hall, Englewood Cliffs, NJ Lander ES, Green P (1987) Construction of multilocus genetic linkage maps in humans. Proc Natl Acad Sci USA 84:2363-2367 Lander ES, Green P, Abrahamson J, Barlow A, Daly MA, Lincoln SE, Newburg L (1987) MAPMAKER: an interactive computer package for constructing primary genetic linkage maps of experimental and natural populations. Genomics 1: 174-181 Lathrop GM, Lalouel JM, Julier C, Ott J (1984) Strategies for multilocus linkage analysis in humans. Proc Natl Acad Sci USA 81:3443-3446

Viral Genetic Linkage Analysis in the Presence of Missing Data.

Microsatellites for linkage analysis of genetic traits.

Balony: a software package for analysis of data generated by synthetic genetic array experiments.

OriginPro 9.1: scientific data analysis and graphing software-software review.

PGTools: A Software Suite for Proteogenomic Data Analysis and Visualization.

Data acquisition and analysis software for gamma coincidence spectrometry.

A software tool for the analysis of neuronal morphology data.

MinorityReport, software for generalized analysis of causal genetic variants.

The critical need for computational methods and software for simulating complex genetic and genomic data.

Rapid computer analysis of linkage data.

Improved Efficiency and Reliability of NGS Amplicon Sequencing Data Analysis for Genetic Diagnostic Procedures Using AGSA Software.

BicPAMS: software for biological data analysis with pattern-based biclustering.

Systematic detection of errors in genetic linkage data.

Familial dyslexia: use of genetic linkage data to define subtypes.

Analysis of Whole Transcriptome Sequencing Data: Workflow and Software.

[Data linkage of primary and secondary data: a gain for small-area health-care analysis?].

fqtools: an efficient software suite for modern FASTQ file manipulation.

A unified test of linkage analysis and rare-variant association for analysis of pedigree sequence data.

Genetic analysis of the linkage between chromosome 11q and atopy.

Familial Mediterranean fever: analysis of inheritance and current linkage data.

Combined linkage and family-based association analysis improves candidate gene detection in Genetic Analysis Workshop 18 simulation data.

Software for acquisition and analysis of ion channel data: choices, tasks, and strategies.

Overview of software options for processing, analysis and interpretation of mass spectrometric proteomic data.

Application of compiled BASIC in developing software for collection and analysis of neuronal firing frequency data.