Accepted Manuscript Next Generation Sequencing-based analysis of RNA polymerase functions Tomasz Heyduk, Ewa Heyduk PII: DOI: Reference:

S1046-2023(15)00177-2 http://dx.doi.org/10.1016/j.ymeth.2015.04.030 YMETH 3680

To appear in:

Methods

Received Date: Revised Date: Accepted Date:

5 March 2015 23 April 2015 24 April 2015

Please cite this article as: T. Heyduk, E. Heyduk, Next Generation Sequencing-based analysis of RNA polymerase functions, Methods (2015), doi: http://dx.doi.org/10.1016/j.ymeth.2015.04.030

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Next Generation Sequencing-based analysis of RNA polymerase functions

Tomasz Heyduk* and Ewa Heyduk E.A. Doisy Department of Biochemistry and Molecular Biology, Saint Louis University School of Medicine, Saint Louis, MO 63104.

*Corresponding author. Phone: 314-977-9238; Fax: 314-977-9205; e-mail: [email protected].

1

Abstract Next Generation Sequencing (NGS) that revolutionized genome wide studies allows analysis of complex nucleic acids mixtures containing thousands of sequences. This extraordinary analytical power of NGS can be harnessed for the analysis of in vitro experiments where DNA template sequence dependence of protein activity acting on DNA can be studied in a single reaction for thousands of DNA sequence variants. This allows a rapid accumulation of data on DNA sequence dependence of the process of interest to a depth not accessible by standard experimentation. We use an example of bacterial RNA polymerase promoter melting activity to describe the NGS-based methodology to study DNA template dependence of protein activity.

Keywords: Next Generation Sequencing, RNA polymerase, transcription, sequence dependence.

2

1. Introduction The sequence of DNA plays an essential role in initiating and controlling activities of proteins that bind and operate on DNA templates. There has been a tremendous interest in determining the molecular mechanisms that explain DNA sequence dependence of the activities of these proteins. Deciphering these mechanisms will significantly expand the understanding of many basic cellular mechanisms and will be also important in the development of means to control or inhibit the activities of these proteins for therapeutic purposes. Despite significant progress in this area, much still remains to be learned. Next Generation Sequencing (NGS) revolutionized analysis of complex nucleic acid mixtures [1-6]. A single NGS experiment can reveal the identity of millions of sequences in a sample. The accessibility of research community to NGS has been increasing in a rapid rate. Most core DNA sequencing facilities are now equipped for NGS and less expensive compact instruments have been recently introduced placing them within the reach of individual laboratories. Typical uses of NGS include de novo genome sequencing [5, 6] and a rapidly expanding collection of genome-wide analyses, such as for example, gene expression analysis (RNA-seq) [7] or genome-wide mapping of protein-DNA interactions (Chip-seq) [8-10]. NGS provides information not only on the identity of the sequences present in the sample but also on their relative amounts. Read count (how many times a given sequence is read during NGS analysis), after proper normalization to account for bias introduced by sequencing library preparation and the sequencing, is proportional to the relative abundance of a sequence in the sample. This possibility of quantitative analysis of complex nucleic acid mixtures is employed in many genome wide analyses such as, for example, RNA-seq [7]. The quality of NGS read count as a quantitative readout could be high enough to extend its applicability to highly parallel quantitative analyses of biochemical experiments employing thousands of DNA

3

template sequence variants in a single reaction. We recently demonstrated the power of such NGS based analysis by analyzing DNA template dependence of promoter DNA melting by bacterial RNA polymerase (RNAP) [11]. We use this reaction as an example to describe here this NGS-based methodology. RNAP is an outstanding example of a protein whose many functions depend on the sequence of DNA [12, 13]. The mechanisms behind DNA sequence dependence of these functions are still not well understood. RNA polymerase performs DNA template directed synthesis of mRNA [12, 13]. This activity involves a series of steps that include sequence specific binding of the enzyme to the promoter DNA, melting of DNA duplex, initiation of RNA synthesis, breaking of the specific enzyme-promoter DNA contacts (promoter escape), processive synthesis of the RNA product (elongation) and termination that involves release of RNA and dissociation of the enzyme from the DNA template) [12, 13]. Each of these steps is dependent on the sequence of DNA template in multiple ways that are challenging to sort out. Although we describe here the NGSbased approach for parallel investigation of DNA sequence dependence of thousands of sequence variants of DNA template using RNAP promoter melting activity as an example, the described protocols can be in most cases easily modified for other activities of RNAP or for studies of different DNA-dependent proteins. E. coli promoter is defined by two conserved hexametric sequences (-35 and -10 elements) with consensus sequences of TTGACA and TATAAT, respectively [14, 15]. Promoter melting by bacterial RNAP occurs during an isomerization step of RNAPpromoter complex following initial recognition of the promoter. The resulting “open” complex, in the presence of NTP’s, can initiate synthesis of RNA [16]. The sequence of -10 promoter element plays a critical role in promoter melting by RNAP [16]. The nontemplate strand of the -10 element is sequence-specifically bound by RNAP [17-21], which provides important driving force for promoter melting. Sequence determinants for

4

the recognition of the -10 element in the single-stranded form were studied in detail demonstrating essential roles for -11A and -7T for high affinity binding to RNAP [22]. The adenine at position -11 likely flips out of DNA duplex nucleating promoter melting reaction [22-27]. Structural analysis revealed that similarly to -11A, -7T is bound to RNAP in a flipped out conformation [27].

2. Experimental design. 2.1. Overall design. The first step in NGS-based parallel analysis of DNA template sequence dependence of protein activity is to prepare the DNA template. This would obviously depend on the specifics of the experiment but most often it will involve preparing DNA with the sequence of interest completely or partially randomized, or mixing many DNA’s containing variants of the sequence of interest. With the appropriate DNA template prepared, the next step will be to design an experimental scheme, which will allow isolating DNA samples that became enriched as a result of the activity of the protein of interest. For example, in our studies of promoter melting activity of RNAP, we used heparin challenge followed by native gel electrophoresis separation to isolate DNA fraction that was melted by RNAP after specific time of the reaction [11]. Such DNA is then converted to sequencing library compatible with the selected NGS sequencing platform. The library is sequenced and the number of reads for each sequence of interest is extracted from raw sequencing data. These data, after normalization to an appropriate control sample, can be interrogated in many different ways to obtain insights into DNA template sequence of the process under study. We use an example of RNAP promoter melting activity to describe below each of the above steps in detail. 2.2. Preparation of DNA templates.

5

DNA templates for promoter melting were prepared from two synthetic oligonucleotides corresponding to upper and lower strands of DNA template (λPR promoter). These two oligonucleotides contained 17 nt of overlapping complementary sequence at their 3’ ends allowing their extension to a full duplex with DNA polymerase. One of the strands (O1) contained the sequence of interest (-10 promoter element in this case) that was completely randomized during oligonucleotide synthesis. To obtain promoter library with randomized -10 sequence we used the following O1 and O2 oligonucleotides: O1

ATC TAT CAC CGC AAG GGA TAA ATA TCT AAC ACC GTG CGT GTT GAC TAT TTT ACC TCT GGC GGT NNN NNN GGT TGC ATG TAG TAA GG

O2

TCA GTT GCC GCT TTC TTT CTT GCT GAC TGC TTA ATC GCT TCT AGG GAT ATA GGT AAT TCC ATA CCA CCT CCT TAC TAC ATG CAA CC

NNNNNN in bold denotes 6 bp random sequence. The complementary 3’ end sequences are underlined. Resulting DNA duplex sample prepared from such synthetic oligonucleotides was thus a mixture of DNA templates containing all possible sequence variants (46=4096 in case of 6 bp -10 element). In designing DNA library, it is important to consider the length of randomized sequence in comparison to the expected number of reads that will be obtained from NGS sequencing. Optimally, the number of reads produced by NGS sequencing should be ~100 fold higher than the number of sequence variants in the sample to assure acceptable noise in read count data if one desires to quantitate every sequence in the library. For example, a 10 nt randomized sequence corresponds to 1,048,576 sequence variants. A single lane on Illumina instrument will produce ~150 million reads (about half of that on high capacity Ion Torrent chip) which should be sufficient to allow quantitative analysis of all sequence variants of 10 nt long DNA. Longer randomized sequence segments can still be studied using NGS approach but not every sequence in the library would be then detectable.

6

In order to prepare the duplex using O1 and O2 oligonucleotides, they are mixed at 2 µM concentration and annealed by heating to 95oC for 2 minutes and slowly cooling down to room temperature. Annealed partial duplex (0.25 µM) is extended by a 2-cycle PCR extension with Klentaq enzyme. Extended duplex DNA is purified with Wizard SV DNA purification kit (Promega). We find that this simple purification was usually sufficient but if there were doubts regarding the quality of the template, we would further purify DNA template by ion exchange FPLC using 1 ml Resource Q column [24]). 2.3. Isolation of DNA sequences preferentially melted by RNAP. We used relative resistance of RNAP-promoter open complexes to the challenge with heparin to isolate the complexes that successfully progressed to the open complex. While this approach allows easy isolation of DNA enriched by preferential melting by RNAP, its limitation is that heparin-resistance is not a universal property of all open complexes. Reactions were initiated by mixing in room temperature 2.5 µl of 100 nM promoter DNA with 2.5 µl of 150 nM holoenzyme in 20 mM Tris-HCl (pH 8.0) buffer containing 150 mM NaCl, 10 mM MgCl2, 0.1 mg/ml BSA, 0.1 mM DTT and 5% glycerol. Reactions were terminated at various times by adding 2 µl of a heparin-Ficoll mixture. RNAP holoenzyme was prepared as described [28]. Final concentration of heparin was 0.2 mg/ml. 7 µl of the reaction mixtures were loaded on 7.5% native polyacrylamide gel running at constant voltage (90 V) in TBE buffer. The gel was stained with Sybr Green and the bands corresponding to heparin-resistant complex were excised from the gel. DNA was eluted from the gel slices by incubating the gel in 100 mM Tris-HCl (pH 8.0), 0.5 M NaCl and 5 mM EDTA at 500C for 2 hrs. Eluted DNA was filtered, precipitated with ethanol and was dissolved in water. We also used an alternative approach were heparin-resistant complexes are isolated on nitrocellulose filter. Nitrocellulose filters before use were washed with 0.4 M

7

KOH, followed by 3 washes with water and 3 washes with transcriptions buffer. Reactions were initiated as described above and were terminated by addition of heparin to 0.2 mg/ml. Reaction mixtures were loaded onto nitrocellulose filter and were washed twice with transcription buffer with no salt. DNA was eluted from the filter with 20 mM Tris (pH 7.5), 0.2% SDS and 0.3 M sodium acetate buffer. Eluted DNA was precipitated with ethanol overnight. The advantage of filter binding approach is the possibility to probe reactions at shorter times giving access to faster kinetics. Separation of DNA sequences enriched by promoter melting by RNAP is relatively straightforward. In order to probe a different DNA template dependent activity of RNAP, an appropriate separation scheme would have to be established. For example, we’ve been using NGS-based approach to analyze the dependence of promoter escape on the sequence of ITS (Initial Transcribed Sequence, the first ~20bp following transcription start site [29-33]). In this case, we prepare promoter libraries containing randomized ITS sequences. Transcription is initiated from preformed open complex and RNA products are isolated after stopping the reaction at various time points. RNA is purified, ligated with appropriate adapters, and reverse transcribed to DNA that could be then used to prepare libraries for NGS analysis as described below. 2.4. Preparation of libraries for NGS analysis. DNA samples enriched in sequences that promote or inhibit RNAP activity of interest need to be converted into the libraries compatible with the chosen NGS sequencing platform. The two most accessible NGS sequencing platforms are Illumina (www.illumina.com) and Ion Torrent (www.lifetechnologies.com). The data produced by these two platforms will differ in details that are mostly unimportant for the purpose of the approach described. For example, unlike Illumina, Ion Torrent sequencing produces reads of variable length and exhibits elevated sequencing errors in homopolymeric sequence stretches. As discussed below, these differences should not affect significantly

8

the results since the data are normalized to control samples prepared and sequenced in the same way as the test samples. Thus, the choice of which sequencing platform to use will largely depend on cost and accessibility considerations. Illumina and Ion Torrent require each its own specific sequences at the ends of the DNA libraries for the clonal amplification, attachment to the solid support and sequencing primer hybridization. These sequences are included in PCR primers that are used to prepare sequencing libraries. 2.4.1. Sequencing library for Illumina platform. The following primers can be used to prepare Illumina-compatible sequencing libraries from DNA templates used in the experiments:

O3:

CCCTACACGACGCTCTTCCGATCT-XXXXXX-template specific sequence

O4:

CAAGCAGAAGACGGCATACGAGAT- template specific sequence

O5:

AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGC TCTTCCGATCT

XXXXXX corresponds to a 6 bp DNA barcode sequence that allows multiplexed sequencing of many samples [4]. A specific barcode sequence can be used to tag DNA corresponding to a specific experiment and DNA from many experiments can be then mixed and sequenced together. The resulting NGS data is then split into files corresponding to each experiment by sorting the data according to the barcodes used. Multiplexing using DNA barcodes can greatly reduce the cost of sequencing per experiment and the degree of multiplexing is determined by the desired number of reads per experiments. For example, we would typically combine 30-40 barcoded samples for sequencing using Illumina when 6 bp randomized segment of DNA template was studied because this would still produce several millions of reads per each experiment. The template specific sequence of O3 primer should be upstream of the randomized

9

sequence segment (typically right next to it) of the template whereas the template specific sequence of O4 primer should be somewhere downstream such that length of DNA produced by PCR with these primers is within the range optimal for Illumina sequencing (typically 200- 500 bp). We prepare Illumina sequencing libraries by a twostep low cycle (to reduce PCR artifacts and sequence bias) PCR. The first step involves PCR with O3 and O4 primers using DNA obtained from the experiment as a template. The second step involves PCR with O4 and O5 primers using the product of the first PCR as a template. If the PCR reactions produce a clean band of expected size, purification of PCR products using PCR cleanup kit (for example, Wizard SV (Promega)) is sufficient. Otherwise, the PCR product of correct size can be obtained by running PCR reactions on agarose gel and elution the DNA from gel slices containing the desired product. Concentration of DNA libraries for sequencing should be measured with Qubit assay (Life Technologies, Grand Island, NY), which prevents overestimating PCR product concentration (that is often the case when measured by UV spectroscopy). With the described above primers, NGS sequence reads will start with the barcode sequence, followed by template specific sequence of O3 primer and by the sequence of the randomized sequence segment. 2.4.2. Sequencing library for Ion Torrent platform. The following primers could be used to prepare Ion Torrent-compatible sequencing libraries from DNA templates used in the experiments: O6:

CCATCTCATCCCTGCGTGTCTCCGACTCAG-XXXXXXXXXX- template specific sequence

O7:

CCTCTCTATGGGCAGTCGGTGAT- template specific sequence

XXXXXXXXXX corresponds to a 10 bp DNA barcode sequence. Sequencing libraries are produced by PCR with O6 and O7 primers using DNA obtained from the experiment as a template. For Ion Torrent sequencing more stringent purification of PCR products

10

is advisable such as double purification on AMPure XP beads (Beckman Coulter, Inc.) or the elution of correct size band from the agarose gel. With the described above primers, NGS sequence reads will start with the barcode sequence, followed by template specific sequence of O6 primer and by the sequence of the randomized sequence segment. 2.4.3. Reference sample. The procedures involved in preparation of sequencing libraries (including sequencing process itself) can distort the data due to sequence bias inherent to these procedures. PCR (that we use to prepare the libraries) is the largest source of such sequence bias [34]. Thus, the relative number of reads produced by NGS sequencing will not exactly represent the relative amounts of corresponding sequences in the analyzed DNA sample. It is thus essential to design the experiments such that appropriate reference sample is available that could be used to transform raw read count data into relative enrichment value (the ratio of read count of a given sequence in test sample and reference sample). When test and reference samples are processed and sequenced in the same way, the distortion of the data due to sequence bias will be essentially eliminated. When we studied promoter melting by RNAP using native gel electrophoresis to isolate open complexes, we used a sample of free DNA (no RNAP) that was run on the gel, eluted from the gel and converted to NGS compatible library using the same protocol as the one used in case of RNAP-DNA complexes. Such convenient reference sample may not always be possible. For example, in the analysis of DNA template sequence dependence of promoter escape, no zero time reference sample is possible since there is no transcription in the absence of RNAP. In this case we calculate enrichment factors for each sequence between early and late time points of transcription reaction to identify sequences that promote fast promoter escape. 2.5. Pre-processing of NGS data.

11

The outcome of NGS sequencing is a FastQ file containing raw read data and the associated base calling quality information (Fig. 1 shows as an illustration an example of FastQ file structure from 100 bp Illumina sequencing run). We use FASTX tools (http://hannonlab.cshl.edu/fastx_toolkit/index.html) that can be used in command line mode or through Galaxy, an open web-based platform (http://galaxyproject.org) to preprocess the NGS data. While there are also other options available to perform this preprocessing, we describe here pre-processing in Galaxy, as this is probably the userfriendliest approach that would be easy to use by a novice. Due to the large size of sequencing data files, it is much more convenient to work with Galaxy locally installed on the computer rather than using it remotely over Internet. Galaxy file for local installation can be downloaded from http://galaxyproject.org. 2.5.1. Splitting the data according to the barcodes. The first step is to split the data according to the barcodes used for multiplexing the samples using FASTX Barcode Splitter tool. A simple text file containing the tabseparated list of barcode sequences used and the names assigned to them is used as an input in this tool. The output is a set of FastQ files each containing only the sequences containing the corresponding barcode (Fig. 1) and thus corresponding to a specific experiment. 2.5.2. Trimming the data. Sequences in FastQ files for individual experiments obtained as described above are longer then than the segment of the template that was randomized. The next pre-processing step is the trimming to remove constant sequences. There are two options for trimming the data. The object of the trimming is to retain for further analysis only the parts of reads corresponding to the randomized segment of the template (for example, 6bp region of randomized -10 element in DNA template obtained using O1 and O2 oligonucleotides). The 5’ part of the each read sequence contains constant

12

sequences corresponding to sequence of the barcode and the DNA template sequence complementary to the primer used for preparing the sequencing library (Fig. 1). One simple trimming option is to trim from the 5’ end of the reads the number of bases equal to the sum of the length of the barcode and the primer complementary sequence and from the 3’ end all bases following the length of the randomized sequence. We use FASTX Trim Sequences tool in Galaxy where the trimming is accomplished by entering the positions of the first and the last base to keep. Fig. 1 illustrates the outcome of such trimming where the first and last base to keep were set at 19 and 24, respectively. This simple trimming works fine in most cases but it assumes that all reads in FastQ file are derived from the DNA template of interest (i.e. there was no significant contamination with some unrelated DNA, for example, PCR artifacts). We observed on some occasions significant amounts of contaminating reads that could skew the results if not removed prior to further analysis. In such case before the described above trimming is performed, the reads can be first filtered removing the reads that do not show expected pattern of specific constant sequences separated by the correct length of randomized sequence. For example, in the case of O1/O2 DNA template all the reads should contain the following string which describes the proper length of the randomized sequence fragment and the constant sequences flanking it: CGGT[A-T][A-T][A-T][A-T][A-T][A-T]GGTT. NGS read data could be thus filtered to remove the reads that do not contain this string. We use regular expression [35] based Select tool in Galaxy to perform this operation where the above string is entered as a filtering criterion. 2.5.3. Filtering by quality. The last preprocessing step is to remove from the data reads with low base calling accuracy. We use FASTX Filter by Quality tool where the desired quality cut-off and the percent of bases that must have quality better then the cut-off value can be chosen (we typically use cut-off value of 20 and 90% for these parameters).

13

2.6. Quantitative analysis of NGS data. 2.6.1. Calculation of the number of reads for each sequence. Calculation of the number of reads corresponding to each sequence of interest in DNA template library is a key element of the analysis that converts large gigabyte size sequencing data into much smaller numerical data that allows quantitative analyses of template sequence dependence of RNAP activities. We use a relatively simple approach that only counts exact matches between the reads and the list of sequences of interests. The major advantage of this approach is that it is very fast but by disallowing mismatches, it does not take into account the non-ideal nature of the sequencing where errors in base assignments are made with some non-zero frequency. However, this is not a significant disadvantage and in fact in analyzing the results of experiments with randomized templates, allowing for mismatches doesn’t actually make sense. We first use FASTX Collapse Sequences tool in Galaxy that transforms the FastQ file into FASTA file in which the heading for each sequence includes the number of times the sequence appeared in the input FastQ file (Fig. 1). This FASTA file is then converted into a table with two columns listing the sequences and their corresponding read counts, respectively (Fig. 1), using Galaxy FASTA-to-Tabular tool and Galaxy text manipulation tools. This table contains all sequences for which at least a single read was obtained. If there is a need to create a table with read counts for a specific subset of sequences, we use for this purpose a simple R script based on the following command: merge(b,c,all.y=T), where b is the data table listing sequences and their read count and c is a variable containing the list of desired subset of sequences. The tables listing sequences and their corresponding read count constitute the primary data that is the foundation of subsequent analyses. If allowing for mismatches in counting the reads is desired, FASTX Barcode Splitter or the tools from Biostrings package in R could be used to count the reads.

14

2.6.2. Calculation of enrichment factors. Preparation of the libraries for sequencing and sequencing itself introduces sequence bias, which needs to be corrected for. Fig. 2A illustrates how the raw read count could be affected by this sequence bias. The figure shows raw read count distribution for all 4096 sequence variants of 6 bp randomized segment of O1/O2 duplex. Ideally, the read counts for all sequences should be very similar (assuming that oligonucleotide synthesis with the mixture of bases at each position incorporates randomly each base with the same probability) but instead the observed read count values span the range of ~ one order of magnitude (Fig. 2A). There was a significant correlation between read count and GC content of the sequence (Fig. 2B). PCR amplification used in preparing sequencing libraries has been identified as the main contributor to the sequence bias [34]. We eliminate sequencing bias by using relative enrichment factor as the primary data to be used in further analyses. Relative enrichment is the ratio of read count for a given sequence in the experiment to the read count for the same sequence in the appropriate control experiment. In the case of DNA melting, our reference sample was free DNA that was run on the gel (or through nitrocellulose filter) and processed to obtain sequencing libraries in the same way as the test samples. To calculate enrichment factor we first normalize read counts in each analyzed file to the total of all reads in the file and then divide appropriate columns from the tables (Fig.1) listing sequences and their normalized read counts. Fig. 3A (where enrichment factors between two independent reference samples (free DNA bands eluted from the gel) are plotted) illustrates how calculation of enrichment factors eliminates sequencing bias. Enrichment values were ~1 as expected for comparing two control samples even though raw read counts could vary substantially (Fig. 2).

15

While the calculation of enrichment factors using appropriate reference sample corrects for sequence bias, the fact that the read count for various sequences spans large range of values will have some effect of the outcomes of the experiment. The error of enrichment value calculation will be inversely proportional to the read count and thus the enrichment data for sequences with low read count will be noisier. In contrast to Fig. 3A, enrichment values calculated between the test and control sample show preferential enrichment of specific sequences (Fig. 3B). Data in Fig. 3B are for an experiment in which DNA template sequence dependence of open complex formation was probed using nitrocellulose filter binding as a way to separate open complex was used. These results are very similar to our previously published data [11] that used gel electrophoresis to isolate the open complex validating the use of nitrocellulose filters as a separation step in these experiments. 2.6.2. Enrichment factors as experimental observables. Enrichment factors were demonstrated to reproduce properly previously determined effects of base substitutions in -10 promoter element (Fig. 4A) and also correlate reasonably well with kinetics of promoter melting determined by biochemical assays (Fig. 4B) confirming their utility as an experimental observable. However, it is important to emphasize that when NGS read count is used as a quantitative readout in in vitro biochemical experiments, it reports the fraction of each sequence in the sample rather than its amount. This has at least two interesting and important consequences for the behavior of relative enrichment as the observable. The first is the dependence of the range for the changes in relative enrichment on the total number of sequences in the sample. To illustrate this, one can consider a sample consisting of just two sequences that are initially present in equal amounts (i.e. their corresponding fractions are 0.5). If the process under study enriches one of these sequences so much that it is the only sequence remaining in the sample (i.e. its fraction in the sample becomes 1), its relative

16

enrichment value will be 2. Thus, the range of the changes in enrichment value for this sample could be from 0 to 2. However, if the sample initially contains 100 different sequences (each at 0.01 fraction) and one of them is enriched by the process of interest such the it is the only remaining sequence in the sample (fraction equal to 1), its enrichment value will be 100 and the range of possible enrichment values will be thus from 0 to 100 in this case. The second interesting property of relative enrichment, derived from the fact that it reports the fractions not the total amounts of specific sequences in the sample, is its nonlinear behavior. This again can be illustrated by a simple example of a hypothetical sample containing two sequences initially at 0.5 ratio. If the process under study does not change the amount of one sequence but increases the amount of other sequence two-fold, their respective fractions become 0.25 and 0.75 and the corresponding relative enrichments will be 0.5 and 1.5. These properties of relative enrichment as the observable should be taken into account when designing NGS-based analyses and interpreting the outcomes of these experiments. For example, if the time course of a process is studied, the time dependence of relative enrichment for sequences differing greatly in their kinetics of enrichment can exhibit quite complicated behaviors (Fig. 4C). For sequences with fast enrichment kinetics, the initial rapid increase in the relative enrichment would be followed by the decline at longer times (Fig. 4C) because the sequences with the slower kinetics will slowly build up, decreasing the fraction of the sequences that were initially preferentially enriched. 2.7. Interpretation of the outcomes of NGS-based analysis. The data such as in Fig. 3B describe DNA template sequence dependence of the protein activity (in this case, promoter melting activity) but of course they do not provide an explanation for why these sequence dependencies occur. However, this exhaustive experimental data provides foundation for further analyses to uncover the molecular mechanisms behind DNA template sequence control protein activity. These data can be

17

interrogated by computational tools to formulate testable hypotheses. The simplest form of the subsequent analysis of NGS-derived data is to determine if any obvious sequence patterns could be discovered that can be associated with the protein activity using sequence alignment tools. For example, the data in Fig. 3B can be filtered to include only the sequences that were highly enriched by promoter melting reaction. Sequence logo of such sequences shows expected strong preference for A at -11 and T at -7 (Fig. 5A) and correlates well with the pattern of evolutionary conserved bases in -10 element [15]. This plot is very similar to the one that we previously obtained using gel electrophoresis for isolating the open complex [11], again validating the use of nitrocellulose filter approach to isolate open complexes for NGS-based analyses. Fig. 5B lists the top 10 (highest relative enrichment) sequences and their associated enrichment factors. The sequence with the highest enrichment factor corresponds to the consensus -10 sequence. All sequences on this list have A at -11 and T at -7 further illustrating the importance of these bases for efficient melting of the promoter by RNAP. Further correlations between base identities at different positions can be investigated by filtering the relative enrichment data using regular expressions for specific combinations of bases and examining the corresponding relative enrichment values. Further analyses could also involve examining correlations between specific DNA sequence dependent properties of DNA template and experimental relative enrichment values to test if these properties of DNA could play a role in the process under study. For example, we examined and observed a correlation between relative enrichment values and the thermodynamic stability of DNA duplex within -10 region (Fig. 6) suggesting that the energy of DNA melting in this region plays a role in the kinetics of open complex formation. The nature of these analyses will obviously depend on the process under study and will thus be different in each case. In most cases though further follow up

18

experiments will be required to validate the conclusions derived from the in-depth sequence dependence data obtained by NGS-based approach.

3. Discussion and future directions. Our data demonstrate that NGS could be used as reliable, reproducible biochemical readout in in vitro experiments testing DNA template dependence of RNAP activities using promoter libraries containing thousands of sequence variants. While we used promoter melting reaction here as an example to illustrate the NGS-based approach, it will certainly be possible to adapt this approach for analyzing DNA template dependencies of many other RNAP activities, and more generally, of activities of other proteins that act on DNA or RNA templates. The two critical elements that will allow such analyses are the preparation of appropriate DNA library and appropriate experimental procedure that could allow isolation of DNA/RNA template fraction that was enriched by the protein activity of interest. The in vitro analytical power of NGS as an observable in highly parallel biochemical experiments is not necessarily limited only to the analysis of nucleic acids. We have recently described a NGS-based approach [36] that allows parallel analysis of interactions of thousands of peptides. In this approach the peptides are produced from DNA libraries by in vitro transcription/translation under experimental conditions that result in tagging of each of the peptide in the library with its own coding RNA. This tagging process allows then the use of NGS as a quantitative readout of relative amounts of each peptide bound to the target. The rapidly increasing accessibility of NGS techniques will undoubtedly further widen the applicability of this technology beyond its standard current uses.

4. Acknowledgement

19

This work was supported by NIH grants R21AI112919 and RO1GM109974.

20

5. References.

1.

Voelkerding, K.V., S.A. Dames, and J.D. Durtschi, Next-generation sequencing: from basic research to diagnostics. Clin Chem, 2009. 55(4): p. 641-58.

2.

Niedringhaus, T.P., et al., Landscape of next-generation sequencing technologies. Anal Chem, 2011. 83(12): p. 4327-41.

3.

Metzker, M.L., Sequencing technologies - the next generation. Nat Rev Genet, 2010. 11(1): p. 31-46.

4.

Meyer, M. and M. Kircher, Illumina sequencing library preparation for highly multiplexed target capture and sequencing. Cold Spring Harb Protoc, 2010. 2010(6): p. pdb prot5448.

5.

van Dijk, E.L., et al., Ten years of next-generation sequencing technology. Trends Genet, 2014. 30(9): p. 418-26.

6.

Buermans, H.P. and J.T. den Dunnen, Next generation sequencing technology: Advances and applications. Biochim Biophys Acta, 2014. 1842(10): p. 19321941.

7.

Wang, Z., M. Gerstein, and M. Snyder, RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet, 2009. 10(1): p. 57-63.

8.

Barski, A., et al., High-resolution profiling of histone methylations in the human genome. Cell, 2007. 129(4): p. 823-37.

9.

Johnson, D.S., et al., Genome-wide mapping of in vivo protein-DNA interactions. Science, 2007. 316(5830): p. 1497-502.

10.

Robertson, G., et al., Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat Methods, 2007. 4(8): p. 651-7.

21

11.

Heyduk, E. and T. Heyduk, Next generation sequencing-based parallel analysis of melting kinetics of 4096 variants of a bacterial promoter. Biochemistry, 2014. 53(2): p. 282-92.

12.

. RNA Polymerases as Molecular Motors, ed. H.B.a.T. Strick. 2009: Royal Society of Chemsitry.

13.

Record. M.T.Jr, R., W.S., Craig, M.L., McQuade, K.L., Schlax, P.J., in Escherichia coli and Salmonella: cellular and molecular biology., F.C. Neidhardt, Curtis III, R., Ingraham J.L., Lin, E.C.C., Low, K.R., Magasanik, B., Reznikoff, W.S., Riley, M., Schaechter, M., Umbarger, H.E., Editor. 1996, ASM Press: Washington, DC. p. 792-820.

14.

Pribnow, D., Nucleotide sequence of an RNA polymerase binding site at an early T7 promoter. Proc Natl Acad Sci U S A, 1975. 72(3): p. 784-8.

15.

Shultzaberger, R.K., et al., Anatomy of Escherichia coli sigma70 promoters. Nucleic Acids Res, 2007. 35(3): p. 771-88.

16.

Saecker, R.M., M.T. Record, Jr., and P.L. Dehaseth, Mechanism of bacterial transcription initiation: RNA polymerase - promoter binding, isomerization to initiation-competent open complexes, and initiation of RNA synthesis. J Mol Biol, 2011. 412(5): p. 754-71.

17.

Savinkova, L.K., et al., [Binding of RNA-polymerase from Escherichia coli with oligodeoxyribonucleotides homologous to transcribed and non-transcribed DNA stands in the "-10"-promoter region of bacterial genes]. Mol Biol (Mosk), 1988. 22(3): p. 807-12.

18.

Marr, M.T. and J.W. Roberts, Promoter recognition as measured by binding of polymerase to nontemplate strand oligonucleotide. Science, 1997. 276(5316): p. 1258-60.

22

19.

Roberts, C.W. and J.W. Roberts, Base-specific recognition of the nontemplate strand of promoter DNA by E. coli RNA polymerase. Cell, 1996. 86(3): p. 495501.

20.

Callaci, S. and T. Heyduk, Conformation and DNA binding properties of a singlestranded DNA binding region of sigma 70 subunit from Escherichia coli RNA polymerase are modulated by an interaction with the core enzyme. Biochemistry, 1998. 37(10): p. 3312-20.

21.

Ring, B.Z., W.S. Yarnell, and J.W. Roberts, Function of E. coli RNA polymerase sigma factor sigma 70 in promoter-proximal pausing. Cell, 1996. 86(3): p. 48593.

22.

Matlock, D.L. and T. Heyduk, Sequence determinants for the recognition of the fork junction DNA containing the -10 region of promoter DNA by E. coli RNA polymerase. Biochemistry, 2000. 39(40): p. 12274-83.

23.

Tsujikawa, L., et al., RNA polymerase alters the mobility of an A-residue crucial to polymerase-induced melting of promoter DNA. Biochemistry, 2002. 41(51): p. 15334-41.

24.

Heyduk, E., et al., A consensus adenine at position -11 of the nontemplate strand of bacterial promoter is important for nucleation of promoter melting. J Biol Chem, 2006. 281(18): p. 12362-9.

25.

Schroeder, L.A., et al., Evidence for a tyrosine-adenine stacking interaction and for a short-lived open intermediate subsequent to initial binding of Escherichia coli RNA polymerase to promoter DNA. J Mol Biol, 2009. 385(2): p. 339-49.

26.

Schroeder, L.A., M.E. Karpen, and P.L. deHaseth, Threonine 429 of Escherichia coli sigma 70 is a key participant in promoter DNA melting by RNA polymerase. J Mol Biol, 2008. 376(1): p. 153-65.

23

27.

Feklistov, A. and S.A. Darst, Structural basis for promoter-10 element recognition by the bacterial RNA polymerase sigma subunit. Cell, 2011. 147(6): p. 1257-69.

28.

Ko, J. and T. Heyduk, Kinetics of promoter escape by bacterial RNA polymerase: effects of promoter contacts and transcription bubble collapse. Biochem J, 2014. 463(1): p. 135-44.

29.

Chan, C.L. and C.A. Gross, The anti-initial transcribed sequence, a portable sequence that impedes promoter escape, requires sigma70 for function. J Biol Chem, 2001. 276(41): p. 38201-9.

30.

Hsu, L.M., et al., In vitro studies of transcript initiation by Escherichia coli RNA polymerase. 1. RNA chain initiation, abortive initiation, and promoter escape at three bacteriophage promoters. Biochemistry, 2003. 42(13): p. 3777-86.

31.

Vo, N.V., et al., In vitro studies of transcript initiation by Escherichia coli RNA polymerase. 3. Influences of individual DNA elements within the promoter recognition region on abortive initiation and promoter escape. Biochemistry, 2003. 42(13): p. 3798-811.

32.

Vo, N.V., et al., In vitro studies of transcript initiation by Escherichia coli RNA polymerase. 2. Formation and characterization of two distinct classes of initial transcribing complexes. Biochemistry, 2003. 42(13): p. 3787-97.

33.

Hsu, L.M., Monitoring abortive initiation. Methods, 2009. 47(1): p. 25-36.

34.

Aird, D., et al., Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol, 2011. 12(2): p. R18.

35.

Friedl, J., Mastering Regular Expressions. 2002: O'Reilly.

36.

Heyduk, E. and T. Heyduk, Ribosome display enhanced by next generation sequencing: A tool to identify antibody-specific peptide ligands. Anal Biochem, 2014. 464C: p. 73-82.

24

37.

Schneider, T.D. and R.M. Stephens, Sequence logos: a new way to display consensus sequences. Nucleic Acids Res, 1990. 18(20): p. 6097-100.

38.

Crooks, G.E., et al., WebLogo: a sequence logo generator. Genome Res, 2004. 14(6): p. 1188-90.

25

5. Figure Legends

Figure 1. Illustration of the structure of raw NGS data (FastQ file) and its pre-processing. Top panel shows example of FastQ data file produced by 100 bp Illumina sequencing run. The example shows three reads (two of which are identical to facilitate illustration of the pre-processing). Each read is described by four lines of text: the first line describes parameters of the run, the second line shows the sequence, the third line is “+” character and the fourth line provides coded information about the quality of base assignments. Sequences highlighted in blue, green and red correspond to the barcode used, constant sequence of the barcoding primer, and randomized segment of the DNA template, respectively. The outcomes of trimming (a), collapsing (b), and FASTA-to-table (c) preprocessing steps are shown. Figure 2. (A). Distribution of read count for each of 4096 6bp sequences in DNA library obtained from O1 and O2 oligonucleotides. (B) Correlation between read count and GC contents of 6 bp randomized segment of the promoter. Figure 3. Relative enrichment for each of 4096 6bp sequences in DNA library obtained from O1 and O2 oligonucleotides between two independent DNA samples that were run through nitrocellulose filter in the absence of RNAP and processed the same way as the open complexes (A); and between the open complexes (formed after 6 sec of promoter melting reaction) isolated using nitrocellulose filter and control sample (free DNA) (B). Figure 4. (A) Relative enrichment factors for indicated mutants of consensus -10 element (TATAAT) and λPR wt -10 element sequence (GATAAT). (B) Correlation between open complex formation rates determined by standard methodology for promoters containing 14 sequence variants of the -10 element (data from [24]) and enrichment factors at 6 s time point for the same sequences. Data for mutants at -11

26

are in red. (C) Calculated time course of enrichment factor changes for a mixture of 5 DNA templates with the following rates constants of open complex formation: 0.5 s-1 (black), 0.05 s-1 (red), 0.01 s-1 (green), 0.004 s-1 (blue), and 0.000005 s-1 (magenta). Reprinted with permission from Heyduk, E. and Heyduk, T. Biochemistry 53(2), 282-92. Copyright (2014) American Chemical Society. Figure 5. (A) Sequence logo [37] of highly enriched (relative enrichment >2.5). sequences for data set shown in Fig. 3B. The logo was prepared using WebLogo [38] (B) Ten sequences with the highest relative enrichment values. Figure 6. Correlation between relative enrichment at 6 sec time point and free energy of DNA duplex melting for all 4096 -10 element sequence variants. Reprinted (adapted) with permission from Heyduk, E. and Heyduk, T. Biochemistry 53(2), 282-92. Copyright (2014) American Chemical Society.

27

Fig. 1.

28

Fig. 2.

29

Fig. 3.

30

Fig. 4.

31

Fig. 5.

32

Fig. 6.

33

Highlights • NGS could be used as a reliable quantitative readout in in vitro experiments. • RNAP activity on thousands of DNA template sequence variants can be studied in one reaction. • The approach allows the depth of analysis not accessible to standard approaches. • The approach can be adapted to studies on any protein acting on DNA template.

34

Next Generation Sequencing-based analysis of RNA polymerase functions.

Next Generation Sequencing (NGS) that revolutionized genome wide studies allows analysis of complex nucleic acids mixtures containing thousands of seq...
3MB Sizes 4 Downloads 7 Views