Immunogenetics DOI 10.1007/s00251-013-0754-1

ORIGINAL PAPER

Genomic characteristics of the Tcell receptor (TRB) locus in the rabbit (Oryctolagus cuniculus) revealed by comparative and phylogenetic analyses Rachele Antonacci & Francesco Giannico & Salvatrice Ciccarese & Serafina Massari

Received: 25 October 2013 / Accepted: 20 December 2013 # Springer-Verlag Berlin Heidelberg 2014

Abstract The present study identifies the genomic structure and the gene content of the T cell receptor beta (TRB) locus in the Oryctolagus cuniculus whole genome assembly. The rabbit locus spans less than 600 Kb and the general genomic organization is highly conserved with respect to other mammalian species. A pool of 74 TRB variable (TRBV) genes distributed in 24 subgroups are located upstream of two in tandem-aligned D-J-C gene clusters, each composed of one TRBD, six TRBJ genes, and one TRBC gene, followed by a single TRBV gene with an inverted transcriptional orientation. All TRB genes (functional, ORF, pseudogenes) of this paper have been approved by the IMGT/WHO-IUIS nomenclature committee. Additionally, five potentially functional protease serine (PRSS) trypsinogen or trypsinogen-like genes were identified: two in tandem PRSS-like genes, followed by two PRSS genes with unique traits, lie downstream of the TRBV1

gene and one PRSS gene is located about 400 Kb away downstream of the TRBV genes. Comparative and phylogenetic analyses revealed that multiple duplication events within a few subgroups have generated the germline repertoire of the rabbit TRBV genes, which is substantially larger than those described in humans, mice, and dogs, suggesting that a strong evolutionary pressure has selected the development of a species-specific TRBV repertoire. Hence, the genomic organization of the TRB locus in the genomes appears to be the result of a balance between the maintenance of a core-number of genes essential for the immunological performances and the requirement of newly arisen genes. Keywords T cell receptor . TRB locus . Trypsinogen genes . Rabbit genome . Comparative genomics . IMGT

Introduction Electronic supplementary material The online version of this article (doi:10.1007/s00251-013-0754-1) contains supplementary material, which is available to authorized users. R. Antonacci : S. Ciccarese Dipartimento di Biologia, Universita’ degli Studi di Bari “Aldo Moro”, Bari, Italy R. Antonacci e-mail: [email protected] S. Ciccarese e-mail: [email protected] F. Giannico Dipartimento dell’Emergenza e dei Trapianti di Organi, Università degli Studi di Bari “Aldo Moro”, Bari, Italy e-mail: [email protected] S. Massari (*) Dipartimento di Scienze e Tecnologie Biologiche ed Ambientali, Universita’ del Salento, Via Monteroni 165, Centro Ecotekne, 73100 Lecce, Italy e-mail: [email protected]

The various roles of T lymphocytes in adaptive immune responses to infection, which include the provision of helper functions to other immune cells and cytolytic control of infected cells, require that T cell populations recognize a large variety of foreign peptides bound to major histocompatibility proteins. This recognition depends on the presence, on the T cell membrane, of a heterodimer receptor (TR) consisting of both an alpha and beta chain. Each chain contains a variable and a constant domain (Lefranc and Lefranc 2001). To generate a large repertoire of TR capable of recognizing the diverse peptide–MH complexes, during the development of T lymphocytes in the thymus, the variable (V), and joining (J) genes of the TR alpha (TRA) locus and the V, diversity (D), and J genes of the TR beta (TRB) locus undergo somatic rearrangements, with the resulting rearranged TRAV-TRAJ and TRBV-TRBD-TRBJ regions encoding the V-ALPHA and V-BETA domains of the TR alpha and beta chains

Immunogenetics

(Lefranc and Lefranc 2001; Jung and Alt 2004). After transcription, the V-(D)-J sequence is spliced to the constant (C) gene. These rearrangements are responsible for the generation of the combinatorial diversity of the TR. The TRA and TRB loci are localized on two distinct chromosomes. The human and mouse TRB loci have long been characterized (Rowen et al. 1996; Glusman et al. 2001; ImMunoGeneTics (IMGT) Repertoire, http://www.imgt.org). In both species, the TRB locus spans about 650 Kb and consists of a pool of TRB variable (TRBV) genes positioned at the 5′ end of two in tandem D-J-C clusters, each composed of a single TRBD, 6–7 TRBJ genes, and one TRBC gene, followed by a single TRBV with an inverted transcriptional orientation. In humans, 67 TRBV genes, distributed in 32 subgroups, have been identified; while in mice, there are approximately half of the TRBV genes (35 genes), nevertheless incorporated in 31 subgroups. More recently, the genomic organization of the TRB locus has been also updated in the domestic dog (Mineccia et al. 2012), a mammalian species belonging to the Carnivora order. In dogs, the TRB locus spans about 350 kb, making it smaller in size than human and mouse counterparts. There are 37 canine TRBV genes and these are assigned to 25 subgroups. As in human and mouse, the dog TRBD, TRBJ, and TRBC genes are organized in two D-J-C clusters, each with a single TRBD, six TRBJ, and one TRBC gene. As in the other species, a TRBV located at the 3′ end of the locus that is in a reverse orientation relative to the other genes has been identified. A more complex organization characterizes the TRB locus in Artiodactyla species, i.e., in cattle (Connelley et al. 2009). In fact, the locus, extending beyond 700 Kb, contains 134 TRBV genes assigned to 24 subgroups positioned at the 5′ end of three D-J-C clusters each comprising 1 TRBD gene, 5–7 TRBJ genes, and a single TRBC gene. A TRBV with an inverted transcriptional orientation is located at the 3′ end. Despite belonging to different mammalian orders, one striking feature of the TRB loci in all four species is the conserved synteny. This includes the identity of the immediate neighboring genes such as the monooxygenase, dopaminebeta-hydroxylase-like 2 (MOXD2), and the ephrin type-b receptor 6 (EPHB6) genes, flanking, respectively, the 5′ and 3′ end of each TRB locus, and the presence of a gene family that lies within each TRB locus intermingled among the TRBV genes. This gene family consists of the protease serine (PRSS) trypsinogen genes, a group of genes encoding for the inactive precursors to trypsin, a class of serine proteases that digest proteins by cleaving at lysine or arginine. They are produced by the pancreas and secreted to the duodenum, where they are activated by enteropeptidases to form trypsin, which, in turn, activates itself and other digestive enzymes (Kitamoto et al. 1994). So far, the mouse TRB locus is the genomic region with the largest number of PRSS genes, with

20 trypsinogen genes identified (Glusman et al. 2001). Seven genes are located towards the 5′ end of the locus between the TRBV1 and TRBV2 genes. The other 13 genes are arranged at the 3′ end of the TRBV array, 5′ to the first D-J-C cluster. Eleven of the 20 genes are potentially functional, while 9 genes represent pseudogenes. Also, in the other species in which the number is much lower, the PRSS genes are arranged in two distinct genomic regions. In the human TRB, there are eight PRSS genes (Rowen et al. 1996). One potentially functional gene and two pseudogenes are positioned upstream of TRBV1 gene, while two functional genes and three pseudogenes (one of them transcribed) are located 5′ to the first D-J-C cluster. In dogs, only four PRSS genes, predicted to be functional, were identified. Three of them are downstream of the TRBV1 gene; only one lies 5′ to the first D-J-C cluster (Mineccia et al. 2012). Similarly, five functional PRSS genes were identified in the bovine genome: four genes are located downstream of the TRBV1 gene, the last is recognized 5′ to the first D-J-C cluster (Connelley et al. 2009). In 2009, the Broad Institute (www.broadinstitute.org) has released the second assembly of the European rabbit (Oryctolagus cuniculus) genome sequence. We use this genome assembly to infer in a mammalian species belonging to Lagomorpha order, the structure, and the gene content of the TRB locus. All TRB genes (functional, ORF, pseudogenes) of this paper have been approved by the IMGT/WHO-IUIS nomenclature committee. The principal aim of this paper is to present comparative and phylogenetic data to gain further insight into the genomic evolution of the TRB locus in mammals.

Materials and methods Sequence analysis To determine TRB locus location, the rabbit OryCun2.0 whole genome shotgun sequence was searched using the BLAST algorithm. A sequence of 597,165 pb (gap included) was retrieved directly from the reference sequence NW_003159384.1 (O. cuniculus OryCun2.0 chrUn0060 genomic scaffold) available at NCBI from positions 1130000 to 1727165. The analyzed region comprises the MOXD2 and EPHB6 genes, already annotated within the scaffold and flanking, respectively, the 5′ and 3′ ends of the TRB locus. The TRB genes were identified using the rabbit sequences available in the GEDI (for GenBank/ENA/DDBJ/IMGT/ LIGM-DB) databases: a cDNA collection by Isono et al. (D17416-426; Isono et al. 1994) and partial genomic sequences of the region (M14576, M1457677, M26312, and S60737). We also used the human, mouse, and dog genomic sequences previously described (Mineccia et al. 2012).

Immunogenetics

Locations of the TRB genes are provided in Supplementary Table 1, and their sequences are in Supplementary Fig. 1. The sequence comparison with the human, mouse, and dog counterparts has allowed also the identification and characterization of the rabbit protease genes. The locations of the PRSS genes are provided in the Supplementary Table 2. Genome analyses Computational analysis of the rabbit TRB locus was conducted with the following programs: RepeatMasker for the identification of the genome-wide repeats and low complexity regions (Smit, A.F.A., Hubley R., Green P., RepeatMasker at http://www.repeatmasker.org) and Pipmaker (Schwartz et al. 2000; http://www.pipmaker.bx.psu.edu/pipmaker/) for the alignment of the rabbit sequence with itself and with the human, mouse, and dog counterparts as described by Mineccia et al. (Mineccia et al. 2012). For the analysis, we used the entire rabbit TRB genomic region (from MOXD2 to EPHB6 genes) plus the D-J cluster 2 derived from the GEDI ID S60737. In detail, we combined 1,565 bp derived from S60737 to the retrieved sequence of the NW_003159384.1 scaffold (from 1644263 to 1645297 positions). The final length of the analyzed sequence was 589,517 pb (gap excluded). Nomenclature Rabbit TRB genes were named following the IMGT nomenclature established for human and mouse (IMGT®, http:// www.imgt.org): TRBV genes were assigned to 24 different subgroups on the basis of the percentage of nucleotide identity. TRBC1 and TRBC2 were named according to their location from 5′ to 3′ in the locus; TRB D-J-C clusters are designated according to the constant genes. TRBD and TRBJ genes are named according to the cluster to which they belong. The rabbit PRSS genes were classified on the basis of homology with the corresponding human genes. The human PRSS were already described in GEDI ID L36092 at the following positions: 44765–50247 (PRSS58, alias TRYX3), 58379–63843 (TRY2P, pseudogene), 79799–84159 (PRSS3P3, alias TRY3, pseudogene), 583153–586688 (PRSS1, alias TRY1), 594083–597648 (PRSS3P1, alias TRY5, pseudogene), 604575–608219 (PRSS3P2, alias TRY6, pseudogene), and 625027–628590 (PRSS2, alias TRY2). Phylogenetic analysis The TRBV genes used for the phylogenetic analysis were retrieved from the following genomic sequences deposited at NCBI Gene: NG_001333 (human TRB locus contig);

NW_003726086.1 (dog TRB locus contig); NW_003159384.1 (this work) (rabbit TRB locus). Multiple alignments of the gene sequences under analysis were carried out with the MUSCLE program (Edgar 2004). Phylogenetic analyses were conducted in MEGA5 (Tamura et al. 2011). We used the neighbor-joining (NJ) method to reconstruct the phylogenetic tree. The evolutionary distances were computed using the p-distance method (Nei and Kumar 2000) and are in the units of the number of base differences per site. All positions containing gaps and missing data were eliminated. Other models were applied to draw the phylogenetic tree: maximum parsimony and minimum evolution based on the distances with Poisson correction and with the gamma model in MEGA5.

Results Analysis of the rabbit TRB locus retrieved from the genome assembly We employed the whole genome assembly (OryCun2.0) of the European rabbit (O. cuniculus) released by the Broad Institute (www.broadinstitute.org) to NCBI (BioProject ID: 42933) to identify the TRB locus in this species. We retrieved directly from the assembly the sequence comprising the MOXD2 and the EPHB6 genes, which flank, respectively, the 5′ and 3′ ends of all mammalian TRB loci studied so far (Rowen et al. 1996; Glusman et al. 2001; Connelley et al. 2009; Mineccia et al. 2012). We recovered a sequence of a little more than 597 kb (gaps included) from the rabbit unplaced genomic scaffold, OryCun2.0 chrUn0060 (NCBI ID: NW_003159384.1; pos.1130000–1727165). First, we identified and annotated all TRB genes using ab initio the rabbit sequences already available in the GEDI databases (see “Sequence analysis” section in Material and methods). We also utilized the homology-based method, comparing the retrieved sequence with other mammalian corresponding regions (see “Genome analyses” section in Material and methods). Particularly, the rabbit masked sequence was aligned against the human, mouse, and dog counterparts, using the PipMaker program (Schwartz et al. 2000), and the alignment expressed as percentage identity plot (Supplementary Fig. 2). The beginning and end of each coding exon were then identified with accuracy by the presence of splice sites or flanking recombination signal (RS) sequences of the V, D, and J genes. The analysis of the genomic sequence revealed that, as in most species of mammals, the general structural organization of the rabbit TRB locus has the common feature of a library of TRBV genes positioned 5′ of two D-J-C clusters, followed by a single TRBV gene with an inverted transcriptional orientation located at the 3′ end. It is

Immunogenetics

noteworthy that the second D-J-C cluster is composed of the TRBD2 gene tightly associated to a single TRBJ2 gene (Supplementary Fig. 2). The occurrence of a recombination or error assembly can explain the absence of other TRBJ2 genes within the current OryCun2.0 assembly. As a matter of a fact, Harindranath et al. (Harindranath et al. 1991) had previously reported the complete genomic sequences of the rabbit D-J cluster 2, consisting of six TRBJ functional segments with a very strong similarity to both human and mouse counterparts. Hence, as in humans and mice, the rabbit TRB locus consists of two in tandem aligned D-J-C clusters, each containing one TRBD, six TRBJ genes, and one TRBC gene. Interestingly, the rabbit TRB shares with dog and mouse (Supplementary Fig. 2) and probably cattle (Connelley et al. 2009), the first TRBV gene at the 5′ end that is functional, whereas it is a pseudogene in the human locus (IMGT Repertoire, http://www.imgt.org). The comparison of the entire TRB sequence with those of the other species also allowed us to identify and annotate in the rabbit genome five non-related TRB genes, consisting of the PRSS group genes (Supplementary Fig. 2). In particular, four PRSS genes downstream of the TRBV1 gene and one upstream of the D-J-C clusters were identified. All PRSS genes appear functional with the presence of correct acceptor and donor splicing sites and an absence of frameshifts and stop codons in their coding regions. Classification of the TRB genes and phylogenetic analysis of the TRBV genes Considering the percentage of nucleotide identity with respect to the human corresponding genes and based on the genomic position within the locus, each rabbit TRB gene was classified and the nomenclature established according to IMGT®, the international ImMunoGeneTics information system® (IMGT®, http://www.imgt.org, Lefranc et al. 2009), and IMGT/GENE-DB (Giudicelli et al. 2005). The functionality of the V, D, J, and C genes was predicted through the manual alignment of sequences adopting the following parameters: (a) identification of the leader sequence at the 5′ of the TRBV genes; (b) determination of proper RS sequences located at 3′ of the TRBV (V-RS), 5′ and 3′ ends of the TRBD (5′D-RS and 3′D-RS), and 5′ of the TRBJ (J-RS), respectively; (c) determination of conserved acceptor and donor splicing sites; (d) estimation of the expected length of the coding regions; (e) absence of frameshifts and stop codons in the coding regions of the genes. In the retrieved sequence, we have annotated a pool of 74 TRBV genes grouped into 24 distinct subgroups according to the criterion that sequences with nucleotide identity of more than 75 % belong to the same subgroup (the upper part of the Supplementary Table 1 and Supplementary Fig. 1). Six subgroups are multimembers with a massive expansion of the

TRBV5 (18 genes), TRBV6 (14 genes), and TRBV7 (14 genes), while the TRBV21 subgroup consists of seven gene members. The TRBV9 is the only variable gene classified on the basis of the genomic position within the rabbit locus although it displays a more significant similarity (78 %) to the human TRBV13 gene. Sixteen out of 74 TRBV genes (about 22 %) have been predicted to be pseudogenes (Supplementary Table 3). The deduced amino acid (AA) sequences of the rabbit TRBV genes were manually aligned according to IMGT unique numbering for the V-REGION (Lefranc et al. 2003) to maximize homology (Fig. 1). Only potential functional genes and in-frame pseudogenes are shown. All sequences exhibit the typical framework regions and complementarity determining regions and the four amino acids: cysteine 23 (1st-CYS) in FR1-IMGT, tryptophan 41 (CONSERVEDTRP) in FR2-IMGT, hydrophobic (here L, M, and V) 89, and cysteine 104 (2nd-CYS) in FR3-IMGT (Lefranc et al. 2003). Conversely, CDR-IMGT vary in amino acid composition and length (Fig. 1). One TRBD, six TRBJ, and one TRBC genes form each DJ-C cluster and they were annotated and classified, according to the similarity with the other species, as TRBD1 and TRBD2, TRBJ1 and TRBJ2, followed for the TRBJ, by a hyphen and a number corresponding to their position in the cluster, and TRBC1 and TRBC2 genes (the lower part of the Supplementary Table 1). The TRBD genes consist of a 12 (TRBD1) and 15-bp (TRBD2) g-rich stretch that can productively be read in its three coding phases and encode 1–4 glycine, depending on the phase. The 5′D-RS and 3′D-RS that flank the D-REGION are well conserved (Fig. 2a). The TRBJ genes are typically 43–52-bp long and were identified also for the canonical FGXG amino acid motif, whose presence characterizes the functional J genes. They are flanked by the J-RS at the 5′ end, and a donor splice site at the 3′ end (Fig. 2b). The coding regions were all predicted to be functional except for the TRBJ1-2 gene due to one “g” missing at the end of the coding region (Lefranc 2011). However, the functionality of some TRBJ genes should be validated by expression data since the sequence of the J-RS nonamers seems to be not in accordance with the recombination signal sequence logos (http://www.imgt.org/IMGTrepertoire/LocusGenes/) (Fig. 2b, Supplementary Table 1 and Supplementary Fig. 1). The exon-intron organization of the two TRBC genes was also determined (Supplementary Table 1 and Supplementary Fig. 1). As in all known mammalian species, they are composed of four exons and three introns. The size of the exons and introns is the same for both genes except for the third intron that is 14-bp longer in the TRBC1. The part of the 3′ UTR downstream of EX4 and delimited by a donor splice site measures 201 bp for TRBC1 and 191 bp for TRBC2. The nucleotide identity of the coding regions is 97.94 %, while the

Immunogenetics LEADER

Gene TRBV1 TRBV2 TRBV3 TRBV4 TRBV5-1 TRBV5-2 TRBV5-4 TRBV5-5 TRBV5-7 TRBV5-8 TRBV5-9 TRBV5-10 TRBV5-11 TRBV5-12 TRBV5-14 TRBV5-15 TRBV5-16 TRBV5-18 TRBV6-2 TRBV6-3 TRBV6-4 TRBV6-5 TRBV6-6 TRBV6-7 TRBV6-8 TRBV6-9 TRBV6-10 TRBV6-11 TRBV6-12 TRBV6-13 TRBV6-14 TRBV7-1 TRBV7-2 TRBV7-3 TRBV7-5 TRBV7-7 TRBV7-8 TRBV7-9 TRBV7-10 TRBV7-11 TRBV7-13 TRBV7-14 TRBV9 TRBV10 TRBV12 TRBV18 TRBV19 TRBV20-1 TRBV20-2 TRBV21-1 TRBV21-2 TRBV21-3 TRBV21-4 TRBV21-5 TRBV21-6 TRBV21-7 TRBV25 TRBV26 TRBV27 TRBV28 TRBV29 TRBV30

Functionality F F F F F F F F F F F F P F F F F F F F F F F F F F F F F F F F F F F F P P F F F F F F F F F F F F F F F F F F F F F F F F

MVQLPVLCGWLLVAAADSDAA MNTWLLWGATFTILKAGHT MGPRLLCWVALCLLGAGPL MGCSLLCYVVLGLLGTVSM MASRLLPWAMLCFLGAGPV MGSRLLPWAVLCLLGAGPV MGSRLLPWAMLCLLGAGPG MGSRLLPWAVLCLLGADPV MGSRLLPWAVLCLLRAGPV MGSRLLPWAVLCLLGAGPV MGSRLLPWAVLCLLGAGPV MGSRLLPWAVLCLLGADPV MGSRLLPWAVLCLLGAGLV MGSRLLPWAVLCLLGADPV MGSRLLPWAVLCLLGAGPG MGSRLLPWAVLCLLGAGPV MACWLLPWSLFCLLGAGPV MGSRLLPWAMLCLLGAGPV MSVGPLSCVVLCLLLAGPA MSVGPLSCVVLCLLLAGPA MSVGPLSCVVLCLLLAGPA MSVGPLSCVVLCLLLAGPA MSVGPLSCVVLCLLLAGPA MSVGPLSCVVLCLLLAGPA MSVGPLSCVVLCLLLAGPA MSVGPLSCVVLCLLLAGPA MSVGPLSCVVLCLLLAGPA MSVGPLSCVVLCLLLAGPA MSMGPLSCVVLCLLLAGPA MSVGPLSCVVLCLLLAGPA MSVGPLSCVVLCLLLAGPA MGTRLLCWAVLCLLGAEHT MGTRLLCWAVLCLLGAEHT MGTRLLCWAVLCLLGAEHT MGTRLLCWAVLCLLGAEHT MGTRLLCWAVLCLLGAEHT MGTRLLCWAVLCLLGAEHM .................HT MGTRLLCWAVLCLLGAEHT MVTRLLCWAVLCLLGAEHT MGTRLLCWAALCLLGAEHT MGTRLLCWAVLCLLGAEHT MPRAVLLSSAWGPRLLCCVTLCLVGVGSV MSTKLLCVAFSLLCAGHT MGSQTFCYVVLCFLAAEPT MGTRLLSCVLMWLLRTGL MSNQVLCWVILCLLQRGTT MLTLLLLLGPGSGL MLTLLLLLGPGSGL MRLSLLCWGALCLWGAGSM MRLSLLCWGALCLWGAGSM MRLSLLCWGALCLWGAGSM MRLSLLCWGALCLWGAGSM MRLSLLCWGALCLWGAGSM MRLSLLCWGALCLWGAGSM MRLSLLCWGALCLWGAGSM MGAGFFCCVAFYLLGEGLL MGNRLLCCMVIWLLTADLE MGPRLFRCVVFCLLGAGLM MEVRLLCAGALCLLGAGLM MLAFLLLLQGGGKSLGFVF MLCSLLAVLLGAFLGTR

FR1-IMGT CDR1-IMGT (1-26) (27-38) A B BC (1-15) (16-26) (27-38) ——————————————> ——————————> 1 10 15 16 23 26 27 38 |........|....| |......|..| |..........|

FR2-IMGT CDR2-IMGT FR3-IMGT (39-55) (56-65) (66-104) C C' C'C" C" D E (39-46) (47-55) (56-65) (66-74) (75-84) (85-96) ———————> ————————> ————————> —————————> ———————————> 3941 46 47 55 56 65 66 74 75 80 84 85 89 96 |.|....| |.......| |........| |.......| |....|...| |...|......|

CDR3-IMGT (105-115) F FG (97-104) (105-115) ———————> 97 104 105 111 |......| |.....|....

ASLVEQRPRWVLVPR GQSRTLQCILR DPGVIQTPSHQVTKI GQGVTLRCDPV DTAVFQTPKYLISQV GNKKTIKCEQK DTKITQTPRHLVMGT ANKKSLKCEQH EAGVTQTPRHLIQTR GQQVSLRCTPH EAGVTQMPRHLIQTR GQQVSLRCTPH EAGVTQTPRHLIQTR GQQVSLRCTPD EAGVTQTPRHLIQTR GQQVSLRCTPH EAGVTQTPRHLIQTR GQQVSLRCTPD EAGVTQTPRHLIQTR GQQVSLRCTPH EAGVTQTPRHLIQTR GQQVSLRCAPD EAGVTQTPRHLIQTR GQQVSLGCTPQ EAGVTQTPRHLIQT* GQQVSLRCTPH ETGVTQTPRHLIQTR GQQVSLRCIPH EAGVTQKPRHLIQIR GQQVSLRCTPH EAGVTQTPRHLIQTR GQQVSLRCTPD EAGVTQTPRYLIQTR GQQVSLTCTPD EAGVTQTPRHLIKAR GQQVSLRCTPH NAGVTQSPKFQVLKT GQSLTLRCAQD NAGVTQSPKFQVLKT GQSLTLRCAQD NAGVTQSPKFQVLKT GQSLTLTCAQD NAGVTQSPKFQVLKT GQSLTLRCAQD NAGVTQRPKFQVLKT GQSLTLRCAQD NAGVTQSPKFQVLKT GQSLTLRCAQN NAGVTQSPKFQVLKT GQSLTLKCTQD NAGVTQSPKFQVLKT GQSLTLSCAQD SAGVTQSPKFQVLKT GQSLTLRCAQD NAGVTQSPKFQVLKT GQSLTLRCVQD NDGVTQSPKFQVLKT GQSLTLRCAQD NAGVTQSPKFQVLKT GQSLTLRCAQD SAGVTQSPKFQVLKT GQSLTLRCAQD DAGVSQSPRHRVTGR GQNVTLTCDPE DAGVSQSPRHRVTGR GQNVTLTCDPE DAGVSQSPRHRVTGR GQNVTLTCDPE DAGVSQSPRHRVTGR GQNVTLTCDPE DAGVSQSPRHRVTGR GQNVTLTCDPE NAGVSQSPRHRVTGR GQNVTLTCDPE DAGVSQSSRHRVTGR GQNVTLTCDPE DAGVSQSPRHRVTGR GQNVTLTCDPE DAGVSQSPRHRVTGR GQNVTLTCDPE DAGVSQSPRHRVTGR GQNVTLTCDPE DAGVSQSPRHRVTGR GQNVTLTCDPE NVGVIQSPRHLIKGK GGNFSLKCSPI DAVVTQIPRHKITQT GKKVTLRCHQT DAGVIQTPRHTVTEK GQAVTLRCEPM IKAAIQSPRHLIREQ GEEVTLTCNPT DGGINQIPKHLIRKEK QAVTLECEQN GALVLQNPSRAICQS GASVRIECHSV GALVLQNPTRAICQS RASVRIECRSV DTQVTQTPRHLLKGK AQTAKMECVPA DTEVTQTPRHLLKGK AQTAKMDCVPA DTQVTQTPRHLLKGK AQTAKMECVPA DTEDTQTPRHLLKGK AQTAKMDCVPA DTEVTQTPRHLLKGK AQTAKMDCVPA DTEVTQTPRHLLKGK AQTAKMDCVPA DTEVTQTPRHLLKGK AQTAKMECVPA EAAVFQTPRHRIAGT GSKITLECSQT NAVVTQSPRHRVSGT GKTLTLRCSQD EAKVTQTPRHITTGT GQKLTVTCSQD KAEVTQTPRYVIKRR GEKVLLECSQD GVLVSQKPIRDICQR GNSIMIQCQVD AQTIHQWPAFRVQLV GSPLSLQCTVK

MSWYQQDL FYWYRQIL MYWYKQDS MYWYKQNA VLWYQQVL VLWYQQVL VFWYQQVL VFWYQQVL VVWYQQVL VLWYQQVL VLWYQHVL VLWYQQVL VFWYQQVL VFWYQQVL VLWYQHVL VFWYQQVL VVWYQQVL VSWYQQAP MYWYRQDP MCWYRQDP MYWYRQDP MCWYRQDP MYWYRQDP MCWYRQDP MYWYRQDP MYWYRQDP MYWYRQDP MYWYRQDP MCWYRQDP MYWYRQDP MCWYRQDP LYWYRQSQ LYWYRQSQ LYWYRQSQ LYWYRQSQ LYWYRQSQ LYWYRQSQ LYWYRQSQ LYWYRQSQ LYWYRQSQ LYWYRQSQ LYWYRQSQ VSWYQKLP MYWYRQDL LFWYRQTS VFWYRQHL MYWYRQDP VLWYRQLP VFWYRQLP VYWYRKKP VFWYRKKP VYWYRQKP VYWYRKKP VYWYRQKP VYWYLKKP VYWYRQKP MFWYRQDP MYWYRQDP MYWYRQDP MYWYRQDP MFWYRQLP LYWYRQAA

GRTLFCTC GDSAVYLC DDSAVYLC GDSAVYLC QDAAVYLC QDAAVYLC QDLAVYLC QDSAVYLC QDSAVYLC QDAAVYLC QDSAVYLC QDSAVYLC QDSAVYLC QDSAVYLC QDSAVYLC QDAAVYLC QDSAVYLC GDSAVYLC SQTSVYLC SQTSVYLC SQTSVYFC SQTSVYFC SQTSVYFC SQTSVYFC SQTSVYFC SQTSVYFC SQTSVYFC SQTSVYFC SQTSVYFC SQTSVYFC SQTSVYFC GDSAVYLC GDSAVYFC GDSAVFLC GDSAVYLC GDSAVYLC GDSAMYLC GDTAVYLC GDSAVYLC GDSAVYLC GDSAVYLC GDSAVYLC GDSALYLC SQTSVYFC RDSAVYFC QDSAVYFC NQRALYLC EDIGLYFC EDIGFYFC EDSALYFC GDSALYFC EDSALYFC GDSALYFC GDSALYFC GDSALYFC EDSALYFC SDSSWYLC KQTSVYIC SQTSLYLC NHTSTYHC EDNGLYLC SDSGFYLC

DAQ......YPW SNH.......LY LGY.......DA LGH.......NA SGH.......NR SGD.......DR SGH.......DR SGH.......NR SGH.......DR SGH.......DR SGH.......NR SGH.......NR SGH.......DR SGD.......DH SGH.......DR SGH.......DR SGH.......NR SGH.......SS MSH.......NY MNY.......DY MNY.......DN MNY.......DY MSH.......NY MNY.......DY MSH.......NY MNY.......DY MNY.......DN MNH.......NY MNH.......NY MSH.......NY MNY.......DY SGH.......VG SGH.......YT SGH.......NV SGH.......YT SGH.......NT SGH.......NV SGH.......AR SGH.......TF SGH.......AG SGH.......AR SGH.......VG PGH.......NA DNY.......NY SGH.......AA KGH.......AH LNH.......DV GLQ......ALT GLQ......AVT KGH.......RS KGH.......RY KGH.......SY KGH.......RS KGH.......SY EGH.......RY ERH.......SY MGH.......DR MSH.......VA MNH.......DY MDH.......EQ VQA.......SL GVS......SPN

QGQLQSLAS GQKMDFLVS KKSLKILFC PKPPELMFL GQGLQFLFD GQGLKFLFE GQGLQFLFD GQGLHFLFY GQGLQFLFD GQGLQFLFE GQGLQFLFE GQGLQFLFY DQGLQFLFE GQGLQFLFD GQGLQFLFD GQDLRFLFA GQGLQFLFD GQGLQLLFT GLGLKLIYY GLGLKLIYY GLGLKLIYY GLGLKLIYY GLGLKLIYY GLGLKLIYY GLGLKLIYY GLGLKLIYY GLGLKLIYY GLGLKLIYY GLGLKLIYY GLGLKLIYY GLGLKLIYY GQGLQFMMY GQGLQLMMY GQGLQLMMY GQGLQLMMY GQGLQFMMY GQGLQLMMY GQGLQLMMY GQGLQFMMY GQGLHLMMY GQGLQLMMY GQGLQLMMY GEGPHFLIQ GSGLRLIHY GQGLEFLIY EEDLKFMIY GQGLKLIYY KQSFMLMAT KQSFMLMAT EEELKFLVY EEELKFLVY EEELKFLVY EEELKFLVY EEELKFLVY EEEIKFVVY EEELKFLVY GMRPQLIHY GLGLQLIYY GLGLKLIYY GLGLQLIYF GQSLILIAT EGSLQALFF

LRS....PGD FYN....DKL YNN....KEL FNH....KEL YYN....ERQ YYR....GSQ YYR....GSQ YYN....ERQ YYN....EIQ YYN....ERQ YYN....ERQ YYN....EIQ YYG....ESQ YYN....EIQ YYN....ERQ YSL....GSE YYL....GSQ IYE....YVE SAG....RDI SAV....QGT SVV....QGT SVD....RDI SAG....QGT SVD....HDI SVG....RDT SAG....QGT SVD....RDI SVG....RDT SAG....QGT SAG....QGT SVG....LGT FQG....KNP FQD....KDP FQG....KDP FQG....KDP FQG....KNP FQG....KNP FQG....KDP FQD....KNP FQR....KDP FQR....KDP FQD....KNP YFE....TMQ SYG....VGS FNN....QAP LQK....EMV SQT....VKD SNQG...SDA SNQA...SDA FLD....EEI LQN....EEI FLD....EEI LQD....KEI LQN....EEI FRD....EEI FQD....EAI SFG....VNS SPG....VSS SIN....VDS SYG....IDN ANQG...SEA SIG.....VG

EEAVSRP.G SEKSEIFSD ILNETVP.S IENNSVP.S SSKGNVP.G TSKGNVS.G TSKGNVS.G SSKGNVS.G RSKGNVS.G SSKGNVS.G SSKGNVP.G SSKGNVS.G TSKGNVS.G SSKGNVS.G SSKGNVS.G TSKGNVS.G NAKGNVS.G NAKGNFP.D YEKGDVP.D HEKGDDP.D HEKGDIP.D YEIGEVS.D PEKGDLP.D YEKGDVP.D DAKGEVS.D PEKGDVP.D YAIGEVS.D DAKGEVS.D PEKGDVP.D HEKGDVP.D PEKGDVP.D VDESGMPGS ADESGLPGA ADESGLPGA ADESGLPGA ADDSGMPGA AEESGLPGA ADQSGLPGA ADESGMPGA ADESGMPGA VDESGMPGA VDESGMPGA RERGNIP.D MEKGDVS.D IDDSGMPKD IDDSGMPAQ VQKGDLA.E AYEQGFTKA AYEQGFTKA NEKTEVINE IEKTEVINE NEKTEVINE IEKTEVINE IEKTEVINE LDKTEVINS IEKTEMINE TEKGDLP.S TEKGDTS.E VENGDVH.A KEKGEAS.E TYESGFTKD QVDPEGP.Q

ADYLATRVSD RFSIERP.DG RFSPESP.DK HFSPECP.EN RFSAHQF.SD RFSAHQF.SD RFSALQF.SD RFSAHQF.SD RFSAHQF.SD RFSARQF.PD RFSAHQF.SD RFSAHQF.SD RFSARQF.SD RFSAHQF.SD RFSAHQF.SD RFSARQF.SD RFSAHQF.SD RFSAHQF.QN GYNVSRP.ST GYNASRP.ST GYNVSRP.ST GYNVSRP.ST GYNVSRT.ST GYNTFRQ.ST GYNVSRP.ST GYNASRP.ST GYNVSRP.ST GYNASRP.ST GYNVSRP.ST GYNASRP.ST GYNASRP.SI HFSAERP.GG HFSAERP.GG HFSAERP.GG HFSAERP.GG HFSAERP.GG HFSAERP.GG HFSAERP.GG HFSAERP.GG HFSAERP.GG HFSAERP.GG HFSAERP.EG RFSAQQF.QD GYNVSRT.NT RFSAIMP.NA RFSAGFP.RE GYNALRE.KK KFPINHP.NL KFPINHP.NL RFSAQCP.TN RFSAQCP.TN RFSAQCP.KN RFSAQCP.TN RFSAQCP.TN RFSAQCP.TN RFSTQCP.TN EASVSRA.TK GYNVSRK.EL GYVASRK.KK GYDVSRK.KK KFPITRP.NL NLSASR.PQD

MELTLRVANVTQ SYLILKIQPTQL AHLNLHVRSLER SRLLLHLANLQP FTSEMNVSALEL FSSEMNVSSLEL FSSEMNVSALEL FSSEMNVSALEL FSSEMNVSALEL FTSEMNVSALEL FSSEMNVSALEL FSSEMNVSALEL FSSEMNVSALEL FSSEMNVSALEL CSSEMNVSALEL RRSEMNVSALEL FSSEMNVSTLEL YDSELNVSALDL EEFPLTLPSAVP KEFPLTLLSAVP EEFPLTLTSAVP EEFPLTLLSAVP EEFPLTLPSAVP EEFHLTLPSAVP EEFPLMLLSAVP EEFPLMLLSAVP EEFHLMLLSAVP EEFPLTLLSAVP EEFPLTLQSAVP EEFPLMLLSAVP EEFHLTLPSAVP SSSTLKIQPAQP SSSTLKIQPAQP SSSTLKIQPAQP SSSTLKIQPAQP SSSTLKIQPAQP SSSTLKIQPAQP SSSTLKIQPAQP SSSTLKIQPAQP SSSTLKIQPAQP SSSTLKIQPAQP SSSTLKIQPAQP YHSELNGSFLEP EDFSLILVSATT SLSTLKIQPTDA GPSSLQIEPAEP TSFTLAVTSTHE SYSSLEVTNTQP SYSSLEVTNTQP SPCSVEIQSTEL SPCSVEIQSTEL SPCSVEIQSTEL SPCSVEIQSTDL SPCSVEIQSTDL SPCSVEIQSTDL SPCSVEIQSTEL EQFSLTLASARP GHFPLMLESTST ANFLLTLESTNI PNFSLILASASA TFSTLTVSDVSP DQFILSSPKLLL

SK......... [6.6.2] ASSL....... [5.6.4] ASSL....... [5.6.4] ASSQ....... [5.6.4] ASSL....... [5.6.4] ASSL....... [5.6.4] ASSL....... [5.6.4] ASSL....... [5.6.4] ASSL....... [5.6.4] ASSL....... [5.6.4] ASSL....... [5.6.4] ASSL....... [5.6.4] ASSL....... [5.6.4] ASSL....... [5.6.4] ASSL....... [5.6.4] ASSL....... [5.6.4] ASSL....... [5.6.4] ASSL....... [5.6.4] ASSY....... [5.6.4] AAI........ [5.6.3] AAV........ [5.6.3] AAI........ [5.6.3] AAV........ [5.6.3] ASSY....... [5.6.4] ASSY....... [5.6.4] ASSY....... [5.6.4] AAI........ [5.6.3] ASSY....... [5.6.4] ASSY....... [5.6.4] ASSY....... [5.6.4] ASSY....... [5.6.4] ASSL....... [5.6.4] ASSL....... [5.6.4] ASSL....... [5.6.4] ASSL....... [5.6.4] AGSL....... [5.6.4] AGSEAP..... [5.6.6] ASSI....... [5.6.4] ASSL....... [5.6.4] ASSL....... [5.6.4] AGSL....... [5.6.4] ASSL....... [5.6.4] ASSL....... [5.6.4] ANSD....... [5.6.4] ASSL....... [5.6.4] ASSS....... [5.6.4] AASV....... [5.6.4] GAS........ [6.7.3] GAS........ [6.7.3] ASSQ....... [5.6.4] ASSQ....... [5.6.4] ASSQ....... [5.6.4] ASSQ....... [5.6.4] ASSQ....... [5.6.4] ASSQ....... [5.6.4] ASSQ....... [5.6.4] ASSD....... [5.6.4] ASSS....... [5.6.4] ASSD....... [5.6.4] ASSL....... [5.6.4] SFG........ [5.7.3] AWS........ [6.5.3]

Fig. 1 The IMGT protein display of the rabbit TRBV genes. Only functional genes and in-frame pseudogenes are shown. The description of the strands and loops is according to the IMGT unique numbering for

V-REGION (Lefranc et al. 2003). The amino acid length of CDR-IMGT is also indicated in square brackets

3′UTR differs extensively between the two genes. Table 1 summarizes the percentage of nucleotide identities for each exon and intron calculated in pairwise combinations between the two TRBC genes in rabbits as well as in humans, mice, and dogs. It can be noticed the higher level of similarity between rabbit genes with respect to the other mammalian species. The amino acid sequence of the exons was also deduced (Fig. 2c). The rabbit TRBC genes encode a similar protein of 177 AA. Particularly, the nucleotide differences result in six AA changes, three located in the extracellular C-beta domain and three in the cytoplasmic region. The first exon encoding for the extracellular region was described according to the IMGT unique numbering for the CDOMAIN (Lefranc et al. 2005) and, as in human and dog, comprises 129 AA. The second exon, of 21 AA, encodes for the connecting region. The transmembrane region (22 AA) is encoded by the 3′ part of exon 3 and the first codon of exon 4, and the cytoplasmic region (5 AA) by the exon 4. The evolutionary relationship of the TRBV genes was also investigated by comparing rabbit with all available human and dog corresponding genes, adopting two selection criteria: (1) only potential functional genes and in-frame pseudogenes

(except the human TRBV1) were included; (2) only one gene for each of the subgroups was selected. We chose human and dog as representative species, respectively, for the major and minor numbers of TRBV subgroups (Supplementary Table 4). Thus, the coding nucleotide sequences of all selected TRBV genes (from FR1IMGT to FR3-IMGT) from rabbits, humans, and dogs were combined in the same alignment and an unrooted phylogenetic tree was made using NJ method (Saitou and Nei 1987) (Fig. 3). The tree shows that each of the 24 rabbit genes forms a monophyletic group with a corresponding human and/or dog gene consistent with the occurrence of distinct subgroups prior to the divergence of the mammalian species under analysis. It is to be noted that the tree recapitulates the accepted phylogeny of these species. In fact, the orthologous rabbit and human genes are more closely related to each other than to the dog counterparts in the majority of the phylogenetic groupings (18 out of 24). This would explain the divergence of the canine TRBV5-2, which clusters with the human TRBV9, and TRBV20 gene, which is the basis of the branch including the TRBV29/20 subgroups. The phylogenetic groupings also confirm that our previous classification of the rabbit TRBV genes is consistent with that of the corresponding orthologs in humans and/or dogs

Immunogenetics

a

b

Position Consensus

987654321 GGTTTTTGT

12–spacer ************

7654321 CACTGTG

1

cgtttttgt

acaaagctgtag

cattgtg

2

catttttgt

atcatgatgtaa

cattgtg

987654321 GGTTTTTGT

12–spacer ************

7654321 CACTGTG

1.1

cattttcct

ccttgcccccgt

cactgtg

1.2

cattttggt

gtggccgtgctg

tgctgtg

CCAACTATGACTACACCTTCGGCTCAGGGACGAGGCTGACGGTTGTG N Y D Y T F G S G T R L T V V

gtaagg

1.3

ggttttgaa

gtggacctggga

gactgtg

TTCTGGAAACACCGTCTATTTCGGGGAGGGAAGCCAGCTCACTGTTGTAG S G N T V Y F G E G S Q L T V V

gtaagt

1.4

agttttgct

gccggtctttga

tgctgtg

CTTCTAATGAAAAACTGTTCTTTGGCAACGGAACGAAGCTGTCTGTCTTGG S N E K L F F G N G T K L S V L

gtaagt

1.5

agggtttgc

ctcctgatgtta

aactgtg

CAGCAGCAGCATTTTGGCAGCGGGACCCGACTCTCTGTCCTAG Q Q Q H F G S G T R L S V L

gtaaga

1.6

ggttttact

cagagcccctgc

agctgtg

CTAATTCACCACTCCACTTTGGAGTCGGCACCAGGCTCACCGTGACAG N S P L H F G V G T R L T V T

gtatgg

2.1

ggcattctg

gcagccacttcc

cactgtg

CTCCTTTGCTGATGAGCTGTTCTTCGGACCAGGCACCCAGCTCACCGTGCTA S F A D E L F F G P G T Q L T V L

2.2

tttgcacca

ggcaccgcaggg

ctgtgtg

AACACCGGGCAGCTGTACTTTGGGGACGGCTCCAAGCTGACCGTGCTG N T G Q L Y F G D G S K L T V L

ggtaag

2.3

ggtttttgt

ccggggcctgga

ggctgtg

AGCTCAGAGACCCAGTATTTCGGCCCAGGCACTCGGCTGACCGTGCTT S S E T Q Y F G P G T R L T V L

ggtaag

2.4

tgtttttgt

gttgcacccggg

ggctgtg

AGCCAAAATACTCAGTACTTCGGTGCGGGCACCAGGCTCTCGGTGCTA S Q N T Q Y F G A G T R L S V L

ggtaag

2.5

gggtatttg

tcgcgggccctc

ggctgtg

CTCTGCAGCGGCGGCGCAAACTTTCGGCTCCGGCAGCCGGCTGACCGTGCTC S A A A A Q T F G S G S R L T V L

ggtgag

2.6

agtttgcga

gcggggctgtgc

ctccgtg

CTCCTATGAGCAGTATTTCGGTCCCGGCACCAAACTCACGGTCACA S Y E Q Y F G P G T K L T V T

ggtgag

Position Consensus

c TRBC1 TRBC2

A AB B (1-15) (16-26) ——————————————> ——————————> 1 10 15 16 2326 87654321|........|....|123|......|..| (E)DLANVSAPQVVVFDPSEAEINK..TQKATLVCLAK (E)DLANVSPPQVVVFDPSEAEINK..TQKATLVCLAK

CONNECTING-REGION [EX2] TRBC1 TRBC2

(D)CGISS (D)CGISS

TRBD

1234567 CACAGTG

23-spacer ***********************

123456789 ACAAAAACC

cacggtg

atgcaagtcaacaggccgccttt

acaaaaacc

cacaatg

attcagatagaggagatgctttt

acaaaaagc

GGGACAGGGGGC G T G G G Q G D R G GGGACTGGGGGGGGG G T G G G G L G G D W G G

TRBJ TCAACACTGAACTTTTCTTTGGAGAAGGCACCAGACTCACA N T E L F F G E G T R L T

BC (27-38) 27 36 |........| DFYP..DHVE DFYP..DHVE

5'splice donor gttata

ggtaag

C CD D DE E EF F (39-45) (77-84) (85-96) (97-104) ——————> ———————> ———————————> ———————> 3941 45 77 80 84 85 89 96 97 104 |.|...|1234567|..|...|12345677654321|...|......|12|......| LSWWVNGKEVHN..GVSTDPQPYKQDPKS.DHSKYCLSSRLRVSAAFWH.NPRNHFRC LSWWVNGKEVHN..GVSTDPQPYKQDPKS.DHSKYCLSSRLRVSAAFWH.NPRNHFRC

FG (105-117)

G (118-128) ——————————> 105 117 118121 128 |......123456654321.....| |..|......| QVQFFGLTDDDEWTYNSSKPITQNI SAHTRGRA.. QVQFFGLTDDDKWTYNSSKPVTQNV SAHTRGRA..

| TRANSMEMBRANE-REGION |CYTOPLASMIC-REGION [EX3]

(A)SYQQGVLSATVLYEILLGKATLYAVLVSALVLMAM (A)SYQQGVLSATVLYEILLGKATLYAVLVSALVLMAM

[EX4] VKRKDS VKKKNP

Fig. 2 Nucleotide and deduced amino acid sequences of the rabbit TRBD (a), TRBJ (b) and TRBC genes (c). The consensus sequences of the heptamer and nonamer (Hesse et al. 1989) are provided at the top of the figure and underlined. The numbering adopted for the gene classification is reported on the left of each gene. In a, the inferred amino acid sequences of the TRBD genes in the three coding frames are reported. In

b, the donor splice site for each TRBJ is shown. The canonical FGXG amino acid motifs are underlined. Possible unusual bases within the RS sequences are pointed in italics. In c, IMGT protein display of the TRBC genes. Description of the strands and loops is according to the IMGT unique numbering for C-DOMAIN (Lefranc et al. 2005). The amino acid differences between the two genes are boxed

except for the TRBV9 that was classified on the basis of the genomic position within the TRB locus even if it is related to the human TRBV13 gene.

downstream of the TRBV1 gene, proceeding from 5′ to 3′, the first protease gene was identified as orthologous to the human functional PRSS58 gene (alias TRYX3) with 80.56 % nucleotide similarity. The following protease gene was recognized as TRY2 based on the orthology with the human TRY2P pseudogene. The two successive protease genes and that positioned upstream of the D-J-C clusters showed a nucleotide similarity ranging from 76.48 to 86.37 % with the human functional PRSS1 (alias TRY1) and PRSS2 (alias TRY2) genes. We tentatively classified them as PRSS1, PRSS2, and PRSS3, respectively, according to their genomic position

Classification of the PRSS genes In the rabbit TRB locus, five members of the protease serine gene family were recognized. The rabbit PRSS genes were provisionally classified on the basis of the homology with the corresponding human genes and in accordance with the genomic position within the TRB locus. In this context,

Immunogenetics Table 1 Percentage of nucleotide identities of each exon/intron between rabbit (Orycun), human (Homsap), mouse (Musmus), and dog (Canlupfam) TRBC genes; The length for each exon and intron is also reported

TRBC1 vs TRBC2 Orycun Homsap Musmus Canlupfam Length (nt) Orycun TRBC1 Orycun TRBC2 Homsap TRBC1 Homsap TRBC2 Musmus TRBC1 Musmus TRBC2 Canlupfam TRBC1 Canlupfam TRBC2

EX 1

Intron 1

EX 2

Intron 2

EX 3

Intron 3

EX 4 (CDS)

EX 4 (3′ UTR)

98 100 99

99 70 11

100 88 80

99 75 23

100 92 85

43 6 8

80 93 80

46 8 14

99

99

100

98

99

8

94

30

388 388 387 387 375 375 387 387

841 841 442 517 552 506 697 700

18 18 18 18 18 18 18 18

132 132 153 144 98 145 157 157

107 107 108 108 108 108 107 107

313 299 323 292 310 283 323 295

21 21 18 24 18 18 18 18

201 191 384 520 575 n. d. 224 205

within the rabbit TRB locus. The positions of the non-TRB genes within the rabbit genomic scaffold are reported in Supplementary Table 2. The deduced amino acid sequences of the rabbit protease genes were aligned with the functional human genes and shown in Supplementary Fig. 3. The rabbit PRSS1, PRSS2, and PRSS3 genes (Supplementary Fig. 3a) exhibit, as in human, the typical hallmarks of the trypsinogen gene family: the signal peptide, the activation peptide with the crucial DDDDK recognition pattern for the enterokinase cleavage, preceding the Lys-Ile scissile bond (Lu et al. 1999), the catalytic triad His/Asp/Ser and the highly conserved cysteine profile. Notably, PRSS1 and PRSS2 proteins were both 249 amino acids long, differed from each other by two amino acids and had, in comparison to the other PRSS genes analyzed, an 11-amino acid long rather than 8-amino acid long activation peptide due to an additional Asp and two Ser residues. Finally, the PRSS58 gene in rabbits, as well as in humans, misses the features of the trypsinogen gene family confirming their classification as PRSS-like gene with an unknown function (Supplementary Fig. 3b). Similarly, the structure of the rabbit TRY2 gene can be defined as PRSS-like. Thus, the genomic organization of the PRSS genes within the rabbit TRB locus includes the presence of two in tandem PRSS-like genes following by two PRSS genes originating from a recent duplication positioned at the 5′ of the region, and one PRSS gene located about 400 Kb away, encompassing the array of the TRBV genes. Architecture of the rabbit TRB genomic region In order to better define the genomic structure of the rabbit TRB locus, we combined the complete sequence of the rabbit

D-J-C cluster 2 (GEDI ID: S60737) with the retrieved genomic assembly (see “Sequence analysis” section in Materials and methods). The entire region (589,517 pb, from MOXD2 to EPHB6 genes) was screened with RepeatMasker program (see “Materials and methods”) to highlight interspersed repeats and the compositional properties (g + c content) (Supplementary Table 5). The density of the total interspersed repeats was 31.12 %, comparable to human (33.83 %) (Supplementary Table S2 in Mineccia et al. 2012). In the rabbit sequence, the repeat elements do not show any identity with those of other species (Supplementary Fig. 2), indicating that the repetitive sequences contribute to the genomic architecture of the TRB locus. However, differently from man (as well as mouse and dog), not only LINEs (10.81 %), but especially SINEs (14.18 %), significantly contribute to the architecture of the rabbit region (Supplementary Table 5). The rabbit gc content of 47.26 % is higher than human (42.45 %) and mouse (39.78 %) and dog (44.30 %). To delineate the genomic structure, the rabbit TRB locus masked sequence was aligned against itself with the Pipmaker program (Supplementary Fig. 4). Inspection of the obtained dot-plot matrix allowed us to identify portions of the sequence that align with more regions within the sequence itself. The dot-plot matrix confirms the high level of nucleotide identity between TRBV genes as indicated by dots and diagonal lines in the central part of the dot, with TRBV1 at the 5′, and TRBV30 at the 3′ outside of the identity region. In particular, within the TRBV identity region, we observed parallel lines forming two rectangular shapes that identify two multiple tandem duplications. The first and wider rectangle recognizes an internal homology unit in which the TRBV5, TRBV6, and TRBV7 subgroup genes have arisen through a series of

Immunogenetics

1

24

2 3

23

4

22

5

21

20 6 19

7 18

8 17

9 16

10 15

14

13

12

11

Fig. 3 Evolutionary relationships of the TRBV genes. The analyses were conducted in MEGA5 (Tamura et al. 2011). The evolutionary history was inferred using the Neighbor-Joining method (Saitou and Nei 1987). The percentage of replicate trees in which the associated taxa clustered together in the bootstrap test (1,000 replicates) is shown next to the branches (Felsenstein 1985). The tree is drawn to scale, with branch lengths in the same units as those of the evolutionary distances used to infer the phylogenetic tree. The evolutionary distances were computed

using the p-distance method (Nei and Kumar 2000) and are in the units of the number of base differences per site. Codon positions included were first + second + third. All positions containing gaps and missing data were eliminated. The TRBV nomenclature is according to IMGTONTOLOGY concepts of classification (Lefranc 2011). The IMGT standardized abbreviation for taxon is used: six letters for species (Homsap, Musmus, Orycun, Bostau, Mondom) and nine letters for subspecies (Canlupfam)

tandem duplication events. The homology unit is interrupted, due to insertions or deletions, subsequent to the initial duplication of the whole region as well as by lines, orthogonal to the main diagonal, that indicate a sequence similarity of LINE reverted motifs (gray arrows in Supplementary Fig. 4). The other rectangular shape identifies the duplications of the TRBV20 and TRBV21 subgroup genes. Finally, at the 5′ of the region, a parallel line to the main diagonal detects a homology unit of about 18 Kb in which the two PRSS genes have arisen through a tandem duplication event.

Discussion Recent advances in DNA sequencing technology have made available, in different species of mammals, a large number of complete genome sequences. To be able to give the genomic sequence data a meaning, those genomes ought to be annotated by characterizing regions on a single genome, based either on certain properties or on some content information. They should then, be compared by defining relationships between regions in one or more genomes.

Immunogenetics

In this context, we have identified the genomic organization and the gene content of the TRB locus from the whole genome assembly (OryCun2.0) of the European rabbit (O. cuniculus), which has been released by the Broad Institute. The analysis has shown that the general genomic organization of the rabbit TRB locus, encompassing less than 600 kb, is highly conserved with respect to all other mammalian species studied so far: it is made of a pool of TRBV genes located upstream of two in tandem D–J–C gene clusters followed by a TRBV gene with an inverted transcriptional orientation (Fig. 4). Although the genomic organization and the size of the rabbit TRB locus are comparable with those of humans and mice, differences in the gene content can be observed (Supplementary Table 4). In rabbits, the total number of TRBV genes is higher than in human and mouse loci, whereas, the number of the TRBV subgroups is lower (IMGT Repertoire, http://www.imgt.org). Therefore, the larger extension of TRBV germline repertoire in rabbits is mainly due to the complexity of the duplication events that have caused the expansion of genes within a subgroup, rather than to the emergence of different gene subgroups (Supplementary Fig. 4). The most remarkable duplication event has occurred within the region in which TRBV5, TRBV6, and TRBV7 subgroup genes are located. The gene members of these expanded subgroups are intercalated with each other in a recurrent pattern composed of alternating TRBV5, TRBV6, and TRBV7 genes producing ten complete homology units. It is worth mentioning that the major expansion has involved the corresponding TRBV5, TRBV6, and TRBV7 subgroups incorporated in six homology units, in human TRB locus too (IMGT Repertoire, http://www.imgt.org). Similarly, the genomic analysis of the dog TRB locus has indicated that the most remarkable duplication event has occurred in a region in which three genes, TRBV2, TRBV3, and TRBV4, are located (Mineccia et al. 2012). On the other hand, duplication events involving two-gene blocks have caused the amplification of the mouse TRBV12 and TRBV13 subgroups, with three genes each; as well as a massive expansion of the bovine TRBV6 and TRBV9 subgroups, with 40 and 35 members, respectively (IMGT Repertoire, http://www. imgt.org; Connelley et al. 2009). All together, our data prove that the evolutionary dynamics of the TRBV genes have more often affected internal duplication events to the 5′ of the genomic TRB region as a block, including two or three subgroup genes. Rarer duplications have occurred in correspondence of the 3′ TRB region and, in those instances, single subgroups such as TRBV20 and TRBV21 subgroups in cattle and TRBV21 subgroup in rabbits, are involved. The consequence of these events has lead to differences in the TRBV gene content that are undoubtedly related to the different functions that each TRBV gene plays in the diverse species according to the variety of antigens.

Despite the substantial disproportion in the number of genes, the tight phylogenetic relationships of the rabbit TRBV subgroups with respect to those of the other species (Fig. 3) confirm that gene duplications of ancestral subgroup genes, followed by subsequent diversification, is the major mode of evolution of the TRBV genes in this species. The rabbit germline repertoire is also not functionally reduced considering that only 22 % of the TRBV genes are classified as pseudogenes. This percentage is comparable with that of humans’; however, less so with that of other species’ (Supplementary Table 4). In contrast to the “divergent evolution” of the TRBV genes imposed by ligands, the TRBC genes in all the mammalian species are subject to concerted evolutionary pressures with intra-species homogenization (Rudikoff et al. 1992; Glusman et al. 2001; Antonacci et al. 2008). In rabbits, more than other species, both the nucleotide and the protein sequences of the two TRBC genes are very similar to each other (Table 1; Fig. 2c). On the other hand, a strong selective pressure preserves the sequence diversity of the 3′ UTR of each TRBC gene in rabbits just as well as it does that in other mammalian species (Antonacci et al. 2008). The genome analysis has revealed that repetitive sequences, principally SINEs (14.18 %) and LINEs (10.81 %), are prevalent throughout the entirety of the TRB locus and make a substantial contribution to its structure (Supplementary Table 5; Supplementary Fig. 2). Differently from humans (6.62 %), mice (2.17 %), and dogs (7.19 %) (Mineccia et al. 2012), SINEs represent the most predominant repeat elements within the rabbit TRB. It is noteworthy that SINEs are the most abundant elements within the rabbit TRG locus as well (Massari et al. 2012). However, upon checking the sequence, we have noticed that LINEs can be correlated with the complexity of the gene duplications during evolution. For instance, the dot-plot matrix reveals the presence of two LINE reverted motifs that may have been the substrate for the TRBV5, TRBV6, and TRBV7 gene subgroup expansion (Supplementary Fig. 4). Similarly to what happens in the other mammalian species, MOXD2 and EPHB6 genes bordered the rabbit TRB locus at the 5′ and 3′ ends, respectively, while a multigene family, different for expression and function, is arranged within two distinct genomic regions interposed among the TRB genes (Fig. 4). The genomic organization of the rabbit PRSS genes is identical to those of dog and cattle (Connelley et al. 2009; Mineccia et al. 2012) except for one PRSS gene that is missing in dogs. There is also a much smaller interval (about 200 Kb) between the two groups of PRSS genes in the dog genome. With regards to the human TRB locus, a single apparently functional PRSS-like gene is about 400 kb away from two expressed PRSS1 and PRSS2 genes; while in mice, only about 240 Kb separate the two groups of genes, with one PRSS-like and three apparently functional PRSS genes present in the first and eight in the second region (Rowen et al. 1996; Glusman et al. 2001). Thus, the distance between the two PRSS regions is not fixed but varies according to the expansion of the TRBV genes.

PRSS2 V5-5

V7-1

V7-4

V6-1

V5-1

4,5 Kb

V7-10

V6-10

V6-13

V7-9

V6-9

V6-6

V5-8

V16

V7-14 V21-2

V21-1

V25 59 Kb

EPHB6

V29 V30

C2

V27

V28

V23

V21-7

V24

V19

V20-1

V15

V12

V10 V21-6 D2 J2-1 J2-2 J2-3 J2-4 J2-5 J2-6

C1

D1 J1-1 J1-2 J1-3 J1-4 J1-5 J1-6

V5-11 V5-15

V7-12 V6-14

V5-14

V6-12

V5-13 V9 V18 V21-5

V6-8

V5-10

V7-6

V6-5

V5-7 V7-8 V5-17

V5-16 V21-4

V7-3

V6-3

V6-2

V5-4

V2-2 V5-3

V6-7 V7-11

V6-11 V20-2 V26

V21-3 PRSS3

PRSS1

TRY2

PRSS58 V3 V7-2 V5-9

V6-4

V7-5

V5-2 V7-7 V5-18

V7-13

V5-12

V4

V1

V2-1 V5-6

Fig. 4 Schematic representation of the genomic organization of the rabbit TRB locus as deduced from the genome assembly OryCun2.0 (gaps excluded) plus GEDI ID S60737. The diagram shows the position of all related and nonrelated TRB genes according to nomenclature. Boxes representing genes are not to scale. Exons are not shown

MOXD2

Immunogenetics

2 Kb

Legend V gene: functional V gene: pseudogene D gene: functional J gene: functional C gene: functional genes not related

All rabbit genes appear putatively functional, with PRSS58 and TRY2 primary structure notably different from those of the

classic pancreatic trypsinogen genes (Supplementary Fig. 3). On the contrary, the PRSS1, PRSS2, and PRSS3 genes keep the

Immunogenetics

typical features of the trypsinogen gene family as the highly conserved tetra-aspartate motif, preceding the Lys-Ileu scissile bond. A large body of research has defined the primary role of this acidic motif as a critical element for the inhibition of autoactivation within the pancreas. Indeed, experiments with single and multiple Ala-mutants have confirmed that each Asp residue plays a role in autoactivation control, and their effects are synergistic in a multiplicative manner (Nemoda and Sahin-Tóth 2005). Moreover, Chen et al. (Chen et al. 2003) have indicated that the number of Asp in the activation peptide of trypsinogens has evolved progressively during the course of vertebrate evolution. Two Asp are present in the sea lamprey, three or four in fish, amphibians, and birds. In mammals, just four. The rabbit PRSS1 and PRSS2 genes have, in their activation peptide, besides the presence of two additional Ser, an increase of the number of Asp (Asp5) that could be associated with a further decreased tendency to trypsinogen autoactivation. The rabbit PRSS genes are not the only examples of mammalian activation peptide with five Asp residues. In mice, 4 out of 12 trypsinogen isoforms are expressed at high levels in the pancreas, with one isoform representing 60 % of the total trypsinogen product (Németh et al. 2013). This isoform has five Asp while the others have four. Therefore, rabbits and mice share an expressed PRSS isoform with five Asp. All together, these observations suggest that the mechanism of the inhibition of the trypsinogen autoactivation could not be applied to all species of mammals; in fact, it seems to be both species and isoform specific. The rabbit provides a valuable model for immunological research and the characterisation of the TRB locus present herein will be a useful resource for researchers using this model. Our work has defined the repertoire of the TR and other genes present in the rabbit TRB locus and demonstrated its similarities and differences to model species from other major mammalian orders. Acknowledgments The financial support of the University of Bari and MIUR (PON 254/Ric. Potenziamento del CENTRO RICERCHE PER LA SALUTE DELL'UOMO E DELL'AMBIENTE Cod. PONa3_00334) is gratefully acknowledged.

References Antonacci R, Di Tommaso S, Lanave C, Cribiu EP, Ciccarese S, Massari S (2008) Organization, structure and evolution of 41 Kb of genomic DNA spanning the D-J-C region of the sheep TRB locus. Mol Immunol 45:493–509 Chen JM, Kukor Z, Le Maréchal C, Tóth M, Tsakiris L, Raguénès O, Férec C, Sahin-Tóth M (2003) Evolution of trypsinogen activation peptides. Mol Biol Evol 20:1767–1777 Connelley T, Aerts J, Law A, Morrison WI (2009) Genomic analysis reveals extensive gene duplication within the bovine TRB locus. BMC Genomics 10:192 Edgar RC (2004) MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinforma 5:113 Felsenstein J (1985) Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39:783–791

Giudicelli V, Chaume D, Lefranc M-P (2005) IMGT/GENE-DB: a comprehensive database for human and mouse immunoglobulin and T cell receptor genes. Nucleic Acids Res 33:D256–D261 Glusman G, Rowen L, Lee I, Boysen C, Roach JC, Smit AF, Wang K, Koop BF, Hood L (2001) Comparative genomics of the human and mouse T cell receptor loci. Immunity 15:337–349 Harindranath N, Alexander CB, Mage RG (1991) Evolutionarily conserved organization and sequences of germline diversity and joining regions of the rabbit T-cell receptor beta 2 chain. Mol Immunol 28: 881–888 Hesse JE, Lieber MR, Mizuuchi K, Gellert M (1989) V(D)J recombination: a functional definition of the joining signals. Genes Dev 3: 1053–1061 Isono T, Isegawa Y, Seto A (1994) Sequence and diversity of variable gene segments coding for rabbit T-cell receptor beta chains. Immunogenetics 39:243–248 Jung D, Alt FW (2004) Unraveling V(D)J recombination; insights into gene regulation. Cell 116:299–311 Kitamoto Y, Yuan X, Wu Q, McCourt DW, Sadler JE (1994) Enterokinase, the initiator of intestinal digestion, is a mosaic protease composed of a distinctive assortment of domains. Proc Natl Acad Sci U S A 91:7588–7592 Lefranc M-P (2011) From IMGT-ONTOLOGY IDENTIFICATION axiom to IMGT standardized keywords: for immunoglobulins (IG), T cell receptors (TR), and conventional genes. Cold Spring Harb Protoc 6:604–613 Lefranc M-P, Lefranc G (2001) The T cell receptor facts book. Academic, New York, pp 1–398 Lefranc M-P, Pommié C, Ruiz M, Giudicelli V, Foulquier E, Truong L, Thouvenin-Contet V, Lefranc G (2003) IMGT unique numbering for immunoglobulin and T cell receptor variable domains and Ig superfamily V- like domains. Dev Comp Immunol 27:55–77 Lefranc M-P, Pommié C, Kaas Q, Duprat E, Bosc N, Guiraudou D, Jean C, Ruiz M, Da Piédade I, Rouard M, Foulquier E, Thouvenin V, Lefranc G (2005) IMGT unique numbering for immunoglobulin and T cell receptor constant domains and Ig superfamily C-like domains. Dev Comp Immunol 29:185–203 Lefranc M-P, Giudicelli V, Ginestoux C, Jabado-Michaloud J, Folch G, Bellahcene F, Wu Y, Gemrot E, Brochet X, Lane J, Regnier L, Ehrenmann F, Lefranc G, Duroux P (2009) IMGT®, the international ImMunoGeneTics information system®. Nucleic Acids Res 37: D1006–D1012 Lu D, Futterer K, Korolev S, Zheng X, Tan K, Waksman G, Sandler JE (1999) Crystal structure of enteropeptidase light chain complexed with an analog of the trypsinogen activation peptide. J Mol Biol 292: 361–373 Massari S, Ciccarese S, Antonacci R (2012) Structural and comparative analysis of the T cell receptor gamma (TRG) locus in Oryctolagus cuniculus. Immunogenetics 64:773–779 Mineccia M, Massari S, Linguiti G, Ceci L, Ciccarese S, Antonacci R (2012) New insight into the genomic structure of dog T cell receptor beta (TRB) locus inferred from expression analysis. Dev Comp Immunol 37:279–293 Nei M, Kumar S (2000) Molecular evolution and phylogenetics. Oxford University Press, New York Németh BC, Wartmann T, Halangk W, Sahin-Tóth M (2013) Autoactivation of mouse trypsinogens is regulated by chymotrypsin C via cleavage of the autolysis loop. J Biol Chem 288:24049–24062 Nemoda Z, Sahin-Tóth M (2005) The tetra-aspartate motif in the activation peptide of human cationic trypsinogen is essential for autoactivation control but not for enteropeptidase recognition. J Biol Chem 280:29645–29652 Rowen L, Koop BF, Hood L (1996) The complete 685-kilobase DNA sequence of the human beta T cell receptor locus. Science 272: 1755–1762

Immunogenetics Rudikoff S, Fitch WM, Heller M (1992) Exon-specific gene correction (conversion) during short evolutionary periods: homogenization in a two-gene family encoding the beta-chain constant region of the T-lymphocyte antigen receptor. Mol Biol Evol 9: 14–26 Saitou N, Nei M (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 4:406–425

Schwartz S, Zhang Z, Frazer KA, Smit A, Riemer C, Bouck J, Gibbs R, Hardison R, Miller W (2000) PipMaker—a web server for aligning two genomic DNA sequences. Genome Res 10:577–586 Tamura K, Peterson D, Peterson N, Stecher G, Nei M, Kumar S (2011) MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. Mol Biol Evol 28:2731–2739

Genomic characteristics of the T cell receptor (TRB) locus in the rabbit (Oryctolagus cuniculus) revealed by comparative and phylogenetic analyses.

The present study identifies the genomic structure and the gene content of the T cell receptor beta (TRB) locus in the Oryctolagus cuniculus whole gen...
523KB Sizes 1 Downloads 0 Views