Predicting the Expression of Recombinant Monoclonal Antibodies in Chinese Hamster Ovary Cells Based on Sequence Features of the CDR3 Domain Leon P. Pybus and David C. James ChELSI Institute, Dept. of Chemical and Biological Engineering, University of Sheffield, Mappin Street, Sheffield S1 3JD, U.K

Greg Dean, Tim Slidel, Colin Hardman, Andrew Smith, Olalekan Daramola, and Ray Field MedImmune Ltd., Granta Park, Cambridge CB21 6GH, U.K DOI 10.1002/btpr.1839 Published online November 20, 2013 in Wiley Online Library (wileyonlinelibrary.com)

Despite the development of high-titer bioprocesses capable of producing >10 g L21 of recombinant monoclonal antibody (MAb), some so called “difficult-to-express” (DTE) MAbs only reach much lower process titers. For widely utilized “platform” processes the only discrete variable is the protein coding sequence of the recombinant product. However, there has been little systematic study to identify the sequence parameters that affect expression. This information is vital, as it would allow us to rationally design genetic sequence and engineering strategies for optimal bioprocessing. We have therefore developed a new computational tool that enables prediction of MAb titer in Chinese hamster ovary (CHO) cells based on the recombinant coding sequence of the expressed MAb. Model construction utilized a panel of MAbs, which following a 10-day fed-batch transient production process varied in titer 5.6-fold, allowing analysis of the sequence features that impact expression over a range of high and low MAb productivity. The model identified 18 light chain (LC)-specific sequence features within complementarity determining region 3 (CDR3) capable of predicting MAb titer with a root mean square error of 0.585 relative expression units. Furthermore, we identify that CDR3 variation influences the rate of LC-HC dimerization during MAb synthesis, which could be exploited to improve the production of DTE MAb variants via increasing the transfected LC:HC gene ratio. Taken together these data suggest that engineering intervention strategies to improve the expression of DTE recombinant products can be rationally implemented based on an identification of the sequence motifs that render a C 2013 American Institute of Chemical Engineers Biotechnol. recombinant product DTE. V Prog., 30:188–197, 2014 Keywords: bioinformatics, recombinant monoclonal antibodies, Chinese hamster, ovary cells, difficult-to-express proteins

Introduction Multi gram per liter fed-batch production titers have now largely become the norm for recombinant monoclonal antibodies (MAbs), shifting production bottlenecks to downstream process operations.1 However, even for seemingly easy-to-express molecules such as MAbs, production titers can vary extensively.2 Therefore, cost-of-goods (COGs) of upstream processing is still an issue for DTE MAbs. Indeed, whilst we accept that many attributes contribute to a protein product “manufacturability” assessment (e.g., binding affinity, biophysical properties, etc.) a fundamental consideration is whether the production system can achieve an adequate upstream volumetric titer of product. It is this “stop/go” factor that determines production scale and the duration of cell line/process development activity. Great progress has been made towards methods that assist in the in silico redesign of antibodies for high affinity or biophysical properties.3 How-

Correspondence concerning this article should be addressed to D. C. James at [email protected]. 188

ever, there remains an unmet need for reliable in silico tools that predict critical performance characteristics. At a structural level, antibody diversity is manifested primarily in the antigen-binding sites, comprised of three complementarity determining region (CDR) loops from each heavy chain (HC; H1, H2, and H3) and light chain (LC; L1, L2, and L3). These hyper-variable loops are therefore likely to be the “hot spot” regions that render a MAb DTE. It is therefore somewhat surprising that there is very little information available on the relationship between CDR sequence and MAb productivity. Design of an “optimal” gene sequence requires a thorough understanding of the interaction between the multiple properties of a gene sequence with host-specific and environmental variables.4 However, previous methods designed to increase MAb synthesis have generally focused on one or a few design rules. For example, recent reports suggest that manipulation of codon use bias can be employed by industry to improve recombinant protein expression.5–7 Consequently, commercial optimization of gene expression is currently based on crude sequence analysis tools that adjust relatively C 2013 American Institute of Chemical Engineers V

Biotechnol. Prog., 2014, Vol. 30, No. 1

few generic features such as codon use bias, GC content, cryptic splice sites etc.8,9 Even companies specializing in optimization of mammalian gene expression acknowledge, “robust rules for designing a gene for heterlogs expression are not available.”4 One aspect of the gene design process that requires more attention is the treatment of coding sequence. The ability to systematically understand the effect of hierarchical sequence properties on expression represents a challenging problem, as there are many more variables than observations. To identify the subset of predictive design variables for recombinant protein expression we must therefore utilize bioinformatics tools such as partial least squares regression (PLSR) to construct predictive models from such a complex multidimensional dataset.10 Such approaches have been utilized previously to answer a number of other bioprocess related problems.11–14 In this study we have, for the first time, minimized experimental variables by utilizing a transient expression platform process and MAbs varying only in CDR3 sequence (the most likely target for MAb sequence engineering for improved affinity)15 to restrict cell line/process/expression vector specific variability,6,16,17 and construct a PLSR-based regression model that captures the relationship between MAb CDR3 gene sequence and MAb production titer. This approach produces a reduced feature set from which it is revealed that (for this panel of MAbs) expression is primarily a function of LC mRNA/protein sequence and structural features. Rather than effect the folding rate of LC itself we infer that these sequence/structural features disrupt the rate of VH-VL association and can be overcome by increase of the transfected LC:HC gene ratio.

Materials and Methods Expression vectors MAb heavy chain (HC) and light chain (LC) genes driven by an EF1a promoter were encoded on separate MedImmune proprietary expression vectors. Vectors were engineered with an oriP origin of replication to drive EBNA1 based plasmid retention post-transfection. Plasmid DNA for transfection was purified using the EndoFree plasmid Mega Kit (Qiagen, Crawley, UK) using a QIAvac vacuum system (Qiagen) according to the manufacturer’s instructions.

189

(LC2) in media samples was determined by quantitative western blotting as previously described.16 Biochemical and biophysical feature description of the mRNA/protein sequence To encode the sequences of HC and LC CDR3 we adopted the notion of pseudo amino acid composition,19,20 in which each gene sequence was represented by 672 biochemical and biophysical features, which can be categorized into 17 groups (Table 1). Except for amino acid composition, codon composition, mRNA minimum folding energy, BiP binding prediction, and proline cis-isomer prediction; all other parameters are generated by classifying the properties of codons or amino acids into three groups (Table 1): High (H), Medium (M), or Low (L), e.g., in terms of hydrophobicity there are three groups of amino acids: low (polar; R, K, E, D, Q, N), medium (neutral; G, A, S, T, P, H, Y), and high (hydrophobic; C, V, L, I, M, F, W).26 We integrated these local classifications of a codon or amino acid over the entire protein sequence by calculating the following three quantities: C (composition), T (transition), and D (distribution) (Table 2). The detailed computational procedures and a well-illustrated example have been previously reported.35,36 Partial least squares regression (PLSR) and variable selection based on variable importance in projection (VIP) scores PLSR analysis was implemented using the pls package within the R computing environment (www.r-project.org)37 To avoid over fitting, the number of PLSR latent variables have been assessed using a filtering measure known as VIP using the VIP.R pls package add on.38 The success of each PLS model constructed was evaluated using a number of commonly used measures. Firstly, the root mean square error (RMSE). ffi sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Xn 2 ð y 2^ y Þ i i i51 RMSE 5 N

In addition, the correlation coefficient Q2 was also calculated. Xn 2

Q 512 Xi51 n

Fed-batch transient production process CHO cell line EB27,18 was seeded into 50 mL R bioreactors (Sartorius, Surrey, UK) at a concenCultiFlaskV tration of 1 3 106 viable cells mL21 in 10 mL of CD CHO media (Life Technologies, Paisley, UK) supplemented with 6 mM L-glutamine (Life Technologies) and unless stated otherwise transfected with 6.25 lg of each vector (yielding a molar excess of 6:4 LC:HC gene) using LipofectamineTM LTX with PLUSTM reagent (Life Technologies) according to the manufacturers protocol. Transiently transfected cultures were maintained for 10 days using a fed-batch process. Nutrient supplementation was carried out 6 h after transfection using the CHO EfficientFeedTM Kit (30% v/v CHO CD EfficientFeedTM A, and 30% v/v CHO CD EfficientFeedTM B; Life Technologies). The concentration of recombinant MAb in media samples (volumetric titer) at harvest (day 10 of culture) was determined by Protein A HPLC. The concentration of secreted light chain (LC) and light chain dimer

ðyi 2^ y i Þ2

i51

ðyi 2^ y Þ2

where y^ i are the predicted measurements, yi are the observed measurements, y the mean of the observed measurements, and N is the number of samples tested.

Results Recombinant IgG1k monoclonal antibodies varying in HC and LC CDR3 sequence and HC-LC pairing exhibit variable transient production titer in CHO cells The panel of MAbs utilized in this study consists of a panel of 24 recombinant MAb variants derived from MedImmune’s antibody discovery platform. Briefly, a single scFv (VH-VL) was isolated by phage display and binding affinity for the therapeutic target was optimized via mutagenesis of the VH and VL CDR3. Repertoires of improved variants were recombined, yielding MAbs exhibiting a significant increase in affinity and cellular potency (data not shown).

190

Biotechnol. Prog., 2014, Vol. 30, No. 1

Table 1. The Biochemical and Physiochemical Feature Components of the mRNA/Amino Acid Sequence Category

Biochemical/Biophysical Description CHO Codon Adaptation Index (CAI)21

GC content22 % of residues at the surface of proteins23 % of residues buried within proteins23 a-helix forming propensity24 b-sheet forming propensity24 Turn forming propensity24 Hydrophilicity25 Hydrophobicity26 Polarity27 Polarizability28 Normalized van der Waals volume29 Codon composition30 mRNA minimum folding energy31 Amino acid composition32 BiP binding site prediction33 Cis-proline peptide bond prediction34

High

Medium

Phe – TTC, TTT Leu – CTG Ile – ATC, ATT Met – ATG Val – GTG Ser – AGC, TCT Pro – CCT Thr – ACA Ala –GCC Tyr – TAC, TAT His – CAC, CAT Gln – CAG Asn – AAC, AAT Lys – AAG Asp –GAT, GAC Glu – GAG Cys – TGT, TGC Trp –TGG Arg – AGA, AGG Gly – GGC 0–1 G/C in codon sequence K, S, D, T, N, G, A

Ser – TCC, AGT, TCA Pro – CCA, CCC Thr – ACC, ACT Ala – GCT, GCA Lys – AAA Glu – GAA Arg – CGG, CGA, CGC Gly – GGA, GGG, GGT

Leu – CTC, CTT, TTG, CTA, TTA Ile – ATA Val – GTC, GTT, GTA Ser – TCG Pro – CCG Thr – ACG Ala – GCG Gln – CAA Arg – CGT

Low

2 G/C in codon sequence E, Q, Y, L, P, R, V

3 G/C in codon sequence I, H, F, W, M, C

V, G, L, A, I

S, F, T, C

N, D, P, Y, W, H, M, E, Q, R, K

G, P, S, T, Y C, D, E, K, N, P A, F, H, I, L, M, V R, D, E, K C, V, L, I, M, E, W H, Q, R, K, N, E, D K, M, H, F, R, Y, W M, H, K, F, R, Y, W

D, F, I, N, R, V, W A, G, H, L, M, Q, R, S C, E, K, Q, R, T, W, Y S, N, Q, G, P, T, A, H G, A, S, T, P, H, Y P, A, T, G, S C, P, N, V, E, Q, I, L N, V, E, Q, I, L

A, C, E, H, K, L, M, Q F, I, T, V, W, Y D, G, N, P, S C, M, V, I, L, Y, F, W R, K, E, D, Q, N L, I, F, W, C, M, V, Y G, A, S, D, T G, A, S, C, T, P, D

Number of each codon in the sequence Prediction of mRNA minimum folding energy using m-fold software Number of each amino acid in the sequence Number of predicted BiP binding sites in the sequence Number of cis-proline peptide bonds in the sequence

Table 2. Computation of the 21 Hydrophobic Feature Components for a Hypothetical Protein Sequence Protein Sequence G M G R L P E E Q C R Hydrophobicity sequence Sequence index H composition C(H) M composition C(M) L composition C(L) H/M transition T(H/M) M/L transition T(M/L) H/L transition T(H/L) H distribution M distribution L distribution

M 1

H 2 1

1

M 3

L 4

H 5 2

2

M 6

2

L 9

2

3

4

H 10 3

L 11

M

L

A

M 12

H 13 4

H 14 5

M 15

4 4 2

M25

2 H50

3 H75

M50 L1

M75 L25

5

3

1 H25

H1

5

5

3 1

M1

L 8

3 1

1

L 7

Y

L50

L75

H100 M100

L100

According to the hydrophobicity of each amino acid (Table 1), the protein sequence “GMGRLPEEQCRYMLA” was converted to a hydrophobicity sequence “MHMLHMLLLHLMHHM.” It is composed of 5 Hs, 5 Ms, and 5 Ls. Therefore; for composition parameters C(H) 5 5/15, C(M) 5 5/15, and C(L) 5 5/15. There are a total of 11 transitions in the sequence with 5 between H and M, 3 between M and L, and 3 between H and L. Therefore, for transition parameters T(H/M) 5 5/11, T(M/L) 5 3/11, and T(H/L) 5 3/11. The distribution of H, M, and L is given by the proportion of sequence within which occurs the 1st, 25%, 50%, 75%, and 100% of the H’s, M’s, and L’s in the sequence. For example, the distribution for H is D(H1) 5 2/15, D(H25) 5 5/15, D(H50) 5 10/15, D(H75) 5 13/15, D(H100) 5 14/15.

MAbs derived from this platform differed only in HC (at 11 Kabat positions) and LC (at 6 Kabat positions) CDR3 sequence, and were comprised of discrete combinations of six HC variants (1, 2, 3, 4, 5, and 6) and four LC variants (A, B, C, and D). HC and LC genes differing only in CDR3 sequence were encoded on identical expression vectors under the control of the eF1a promoter, and transfected at a LC:HC gene copy ratio of 6:4 by lipofection into a CHO-K1 derived host cell line (EB27) engineered to express EBNA-1. Across the panel of MAbs, 10-day production process titers varied 5.6-fold between MAb pairing 5C (lowest titer: 24 mg L21) and 3A (highest titer: 137 mg L21) (Figure 1). One-way ANOVA

analysis of the transient production data (Figure 1 inserts) revealed that variation in MAb process titer derived primarily from variation in LC (p-value 5 1.8 3 10210) rather than HC (p-value 5 0.98) sequence. Comparative analysis of HC and LC CDR3 mRNA and protein sequences using partial least squares regression reveals LC CDR3 sequence features that modulate recombinant monoclonal antibody production To identify CDR3 sequence (mRNA and protein) features or biophysical properties significantly associated with variation in MAb production we (i) defined a broad set of

Biotechnol. Prog., 2014, Vol. 30, No. 1

Figure 1.

191

Relative transient MAb titer for a panel of 24 IgG1k MAbs with mixed-and-matched heavy chain (HC) and light chain (LC) sequences. A panel of four LC (A, B, C, and D) and six HC (1, 2, 3, 4, 5, and 6) sequences encoded on separate plasmids were combined to obtain a panel of 24 different IgG1k MAbs. Each HC and LC pairing were cotransfected into the CHO cell line EB27 cell line by lipofection at a LC:HC gene ratio of 6:4. MAb volumetric titer was measured after 10 days of a fed-batch process. Data is split into training (black bars) and test (white bars) to aid in model validation. Transfections were performed two times in duplicate and data shown is the mean 6 standard deviation of single samples taken from each transfection that were analyzed once via protein A HPLC for titer analysis. Data is represented as relative to the expression of the MAb pairing with the lowest production titer (5C). The insert shows the distribution of MAb expression levels for each HC and LC sequence.

Figure 2. PLSR and VIP variable filtering derives a model that identifies LC CDR3 sequence features that are highly predictive of MAb expression. The optimal number of sequence features, which are predictive of MAb titer were identified using a PLSR and VIP variable filtering method, reducing the 672 variables described in Table 1 to 51 sequence features capable of predicting MAb titer with root mean square error of 0.35 for the test data set. (A) Plot of the measured versus predicted values of MAb titer. (B) Model residual error for both the training and test data sets. In both A and B the training set is represented as open circles and the test set by black triangles.

quantifiable parameters listed in Table 1 and (ii) used PLS regression methodology (outlined in the methods) to determine the optimum subset of sequence features or properties, which are predictive of MAb titer. Firstly, based on MAb titer each MAb pairing was allocated into either training (16 MAb variants) or test (8 MAb variants) data sets equally spaced across the range of observed titers (Figure 1). The training data set was utilized to derive a two latent variable (LV) PLS model selected according to minimization in leave-one-out cross validation error (LV1 accounting for 82% of the variation and LV2 accounting for 88% of the variation), which yielded a root mean square error (RMSETraining ) of 0.57 and a Q2 of 0.805. To evaluate model quality, the PLS model was used to predict MAb titer for the test data set. A prediction error rate of 0.35 (RMSETest) was observed, highlighting the ability of the model to predict MAb titer. The experimental (measured) versus predicted MAb titer values for all MAb variants are plotted in Figure

2A. The majority of predicted values lie close to the y 5 x line, suggestive of a good model fit. The residual error plot (Figure 2B) derived from PLS model predictions exhibits a random distribution pattern, with 22 out of 24 MAb variants having a residual error below one order of magnitude. MAb variant 2B and 4B production titers were poorly predicted, exhibiting high residual error (>1.0). It may be noted that both these MAbs contain LC sequence B, suggesting that the model does not capture an important interaction between LC B and certain HC sequences. The PLS regression coefficient values are shown in Figure 3, comprising of 51 CDR3 sequence features (selected from 672 evaluated), which are predictive of MAb production. Only LC specific sequence features were identified as important predictors where 28/51 show negative regression coefficient values (hence negatively correlate with MAb production), and 23/51 sequence features show positive regression coefficient values (hence positively correlate with

192

Biotechnol. Prog., 2014, Vol. 30, No. 1

Figure 3. PLSR coefficient values for variables selected by VIP filtering. From a total of 672 variables described in Table 1, 51 sequence features that were most predictive of MAb titer were identified using a PLR and VIP variable filtering method.

MAb production). Of this predictive set of sequence features, the PLSR coefficients most negatively influencing MAb titer (and predictive of MAb titer generally) were those describing a LC CDR3 sequence composed of an increased number of amino acids with a medium propensity to form turns (variable 51, Figure 3), where the last such amino acid is distributed further along the LC CDR3 sequence (variable 50, Figure 3). The reduced feature set of variables also shows significant VIP values. The VIP parameter defines the relative importance of each variable in the PLS model; variables with a VIP score 1 can be considered important to a given model. Variables with a VIP score

Predicting the expression of recombinant monoclonal antibodies in Chinese hamster ovary cells based on sequence features of the CDR3 domain.

Despite the development of high-titer bioprocesses capable of producing >10 g L(-1) of recombinant monoclonal antibody (MAb), some so called "difficul...
490KB Sizes 0 Downloads 0 Views