Published by the International Society of Protistologists

The Journal of

Eukaryotic Microbiology

Journal of Eukaryotic Microbiology ISSN 1066-5234

SHORT COMMUNICATION

Exploring Biotic Interactions Within Protist Cell Populations Using Network Methods Shu Chenga,1, Dana C. Pricea,1, Slim Karkarb & Debashish Bhattacharyaa,b a Department of Ecology, Evolution and Natural Resources, Rutgers University, New Brunswick, New Jersey, 08901, USA b Institute of Marine and Coastal Science, Rutgers University, New Brunswick, New Jersey, 08901, USA

Keywords Assortativity; Bigelowiella natans; network analysis; Paulinella ovalis; scale-free network; single cell genomics. Correspondence D. Bhattacharya, Rutgers University, 59 Dudley Road, Foran Hall 102, New Brunswick, New Jersey, 08901, USA Telephone number: +1 848-932-6218; FAX number: +1 732-932-8746; e-mail: [email protected] 1

ABSTRACT The study of diseased human cells and of cells isolated from the natural environment will likely be revolutionized by single cell genomics (SCG). Here, we used protein similarity networks to explore within- and between-cell DNA differences from SCG data derived from six individual rhizarian cells related to Paulinella ovalis and proteins from the complete genome of another rhizarian, Bigelowiella natans. We identified shared and distinct DNA components within our SCG data and between P. ovalis and B. natans. We show that network properties such as assortativity and degree effectively discriminate genome features between SCG assemblies and that SCG data follow the power law with a small number of protein families dominating networks.

Equal contribution made by these authors.

Received: 4 October 2013; revised 6 January 2014; accepted February 2, 2014. doi:10.1111/jeu.12113

NETWORK methods offer rapid and powerful tools for analyzing complex information such as genomic or functional genomic data (e.g. Bapteste et al. 2012; Barab asi and Oltvai 2004; Komurov et al. 2012; Zhou et al. 2010). Rather than studying collections of individual genes, it is possible with network methods to interpret the biology of cells and interactions between them as a system comprised of thousands of interacting components. Genomewide protein networks can be built using measures of similarity (e.g. with a BLAST cutoff value) to create edges that connect the different nodes (proteins) with groups of related sequences in a network graph referred to as a connected component (Beauregard-Racine et al. 2011). The components represent genes and gene families that are shared among the studied genomes. Network analysis does not require predictive gene modeling and can be implemented with freely available tools such as Cytoscape (http://www.cytoscape.org/). This straightforward and inclusive approach has been used to identify ancient connections between members of gene families (i.e. when relaxed cutoff values are used), for defining horizontal gene transfer events, and for recognizing gene fusions that may have important functional consequences (e.g. Alvarez-Ponce et al. 2013; Beauregard-Racine et al. 2011).

These types of weak or reticulate evolutionary signal are often difficult to assess using phylogenetic methods that rely on simultaneous multiple sequence alignments to generate bifurcating trees. Metagenomic and single cell genome (SCG) data from natural samples (e.g. Halary et al. 2010; Kalisky et al. 2011; Yoon et al. 2011) offer other promising targets for applying network approaches to study gene distribution. Despite the promise, SCG methods still require considerable refinement because these genome assemblies may show significant coverage bias introduced by multiple displacement amplification (MDA, Rodrigue et al. 2009; Woyke et al. 2010 [although the recently developed MALBAC procedure may ameliorate this issue; Zong et al. 2012]). In addition, the challenge remains to assemble and analyze complex DNA mixtures that include the host DNA and potentially, associated nucleic acids from symbionts, pathogens, and prey (Bhattacharya et al. 2012, 2013; Yoon et al. 2011). Given that population-level SCG data will likely become more widely available, here we explore the use of network methods to study the protein complements of a wild-caught sample of microbial eukaryotes. The data are derived from six individual Paulinella ovalislike cells (phagotrophic rhizarian protists) that form a sister

© 2014 The Author(s) Journal of Eukaryotic Microbiology © 2014 International Society of Protistologists Journal of Eukaryotic Microbiology 2014, 61, 399–403

399

Single Cell Network Analysis

Cheng et al.

group to photosynthetic Paulinella species and were previously isolated from a single sample collected in Chesapeake Bay, U.S. All of the P. ovalis-like cells share 100% small subunit rDNA identity (Bhattacharya et al. 2012). The MDA-derived DNA was sequenced using the Roche 454 and Illumina platforms (for details, see Table S1 and Bhattacharya et al. 2012). A seventh and complete proteome from the photosynthetic (chlorarachniophyte) rhizarian Bigelowiella natans (Curtis et al. 2012) was included in the analysis to provide a eukaryotic reference point when interpreting the P. ovalis-like network graphs. Our primary goal was to create network data structures that highlight both the (meta) genomic similarities and differences between the six conspecific protist cells, and more broadly, between the two rhizarian species. All six P. ovalis-like cells were sampled at the same time and location, therefore the network components that discriminate (e.g. are “assortative”, see below) these genome data represent local differences in biotic interactions among the individual cells and their environment. Assortativity in this instance is a measure of the tendency of sequences from a given SCG assembly to preferentially connect to each other in a connected component. The nominal assortative coefficient ri for each SCG and between each SCG was calculated to determine if protein sequences tend to be assortative with respect to origin. This approach affords a priori identification and selection of interesting genome features (i.e. a “visual snapshot”) prior to computationally exhaustive BLAST and phylogenomic analyses, and can act as a final check on metagenome sequencing data to ensure that variability has been properly described. MATERIALS AND METHODS Protein similarity networks The program EGN (Halary et al. 2010; http://www.evol-net.fr) was used to build sequence similarity networks, defined by BLASTp protein sequence identity ≥ 40% and an e-value threshold < 1e5. Each subgraph (or component) in the network with such edges represents an operational gene family whose sequences do not share significant similarity to other components. Proteins in the data sets were annotated using Blast2GO (Conesa et al. 2005). Two undirected networks (G-6) and (G-7) were generated: G-6 was made using open reading frames (ORFs) of ≥ 30 amino acids extracted via EMBOSS (Rice et al. 2000) and clustered via CD-HIT (Li and Godzik 2006) at 92.5% amino acid similarity from each assembly of six P. ovalis cells (Bhattacharya et al. 2012), whereas G-7 was made using ORFs from the six individual P. ovalis cells and the predicted proteome from B. natans (Curtis et al. 2012). All analyses are based on the G-6 network unless otherwise specified. The 454 sequence data used to assemble the P. ovalis-like genomes are archived at the NCBI Sequence Read Archive (SRA) under Accession SRA049870.1. The 454 assemblies for each P. ovalis-like cell and the ORFs

400

used for this study are available at http://dbdata.rutgers. edu/data/ovalis/. Homology network We constructed a homology network, G-h, in which “Homology edges” were retained from the previous network G-6 when the minimal match coverage was ≥ 90% of the query sequence length. This cutoff resulted in edges that limited the networks to putative homologs resulting in over one-half (56%) of the proteins from the G-6 network being removed due to a null degree of connection (i.e. they were singletons). The G-h network contained 10,726 components with a total of 25,152 proteins, based on using only the 454 data from the six P. ovalislike cells. Given the large number of components representing one or few genes/gene families, 131 connected components of size ≥ 5 proteins were used to calculate assortativity (see below). Network visualization and analysis We analyzed the major component of network G-7 using the Louvain community detection method (Blondel et al. 2008). This is a graph clustering heuristic based on modularity optimization that identifies densely connected regions of a graph and resolves a hierarchical clustering structure within networks. On the basis of the clusters provided by the algorithm at the first level, we visualized the main component of G-7 using Cytoscape, with each cluster corresponding to a node. The weight of the links between the cluster nodes corresponds to the number of links between the proteins in the original network G-7. We measured assortativity in the connected components from the homology network G-h with respect to the six SCGs. Here, consider the element eij of a 2 9 2 mixing matrix X which denotes the observed frequency of edges between proteins of two categories i and j :

i i X= j Σ

eji

j eij

Σ

1 e þe

In the undirected case, eij ¼ eji ¼ ij 2 ji , eii denotes the frequency of links between nodes of the same category i, and ai and is the frequency of links involving proteins of category i. Then, the assortativity coefficient is defined similarly to Newman (2003): P P 2 eii  a ri ¼ i P 2i i 1  i ai Here, proteins of category i are proteins from the cell to be considered, whereas proteins of category j are the proteins from the other five cells. A positive coefficient indicates that nodes of the same category tend to be linked together, whereas a negative value indicates that nodes of different categories tend to be linked together.

© 2014 The Author(s) Journal of Eukaryotic Microbiology © 2014 International Society of Protistologists Journal of Eukaryotic Microbiology 2014, 61, 399–403

Single Cell Network Analysis

Cheng et al.

RESULTS AND DISCUSSION Analysis of protein composition As described above, here we used networks primarily as a visualization tool for complex SCG data to generate knowledge and testable hypotheses (e.g. Atkinson et al. 2009) about the ecology and genome biology of P. ovalislike cells. Analysis of the individual P. ovalis-like SCG assemblies using the Core Eukaryotic Genes Mapping Approach (CEGMA; http://korflab.ucdavis.edu/datasets/cegma/) showed that across all cells, 240 unique hits (BLASTx e-value cutoff ≤ 1e10) exist to the KOG database of 458 core eukaryotic proteins (Parra et al. 2007). Individual cell assemblies however show between 24 (Cell 5) to 141 (Cell 1) significant hits to these core proteins (Cell 6 was not included in the analysis) and only 11 protein hits were shared across all the cell assemblies. Of a set of 248 “ultra-conserved” eukaryotic proteins available in CEGMA, we recovered hits to 122 of these sequences across all cells. These data demonstrate that the combined 454 SCG assemblies account for about one-half of the expected set of core eukaryotic proteins and that these individual cell assemblies provide largely nonoverlapping data with regard to the expected nuclear genome complement of P. ovalis-like cells. To explore whether this result reflects only issues associated with genome amplification bias or a combination of bias and insufficient data, we re-ran CEGMA with Cells 1 and 2 for which we had generated significantly more Illumina paired-end data (4.7 and 3.8 Gbp, respectively, Bhattacharya et al. 2012). No such data existed for Cells 3–6. Again, using a BLASTx e-value cutoff ≤ 1e10, we found 356/458 (75.5%) of the core eukaryotic proteins and 180/248 (72.5%) of the ultraconserved set in the Cell 1 assembly. These numbers were 333/458 (72.7%) and 164/248 (66.1%), respectively, for Cell 2. Therefore, the large majority of core eukaryotic proteins are present in these P. ovalis-like MDA samples, although they were not recovered in the exploratory 454 sequencing approach used here for the six cells. Protein networks The G-7 network was created using ORFs of length > 30 amino acids from 180 to 308 Mbp of 454 data from each P. ovalis-like cell (Table S1) and included predicted proteins from the B. natans genome project. The largest connected component joining proteins of a putative common origin (i.e., either a shared ancestry of the entire protein or of a protein domain) was comprised of 1,043 sequences. The cluster-based representation of this component (Fig. 1A) uses a pie chart to show the relative abundance of cells for each cluster. This analysis identifies the broadly conserved proteins or protein domains (i.e. > 30 amino acids in length) that are shared by all six P. ovalis-like cells along with B. natans. Within the component, the largest cluster (#1; Fig. 1A) contains constitutive components of eukaryote genomes (predominantly kinases and ribonucleotide/nucleoside [ATP] binding

enzymes). The two clusters sharing a dense edge (#2, #3) provide an example of the network’s ability to differentiate between functional gene families and simple domain sharing. Both clusters are enriched in ankyrin repeat domain-containing proteins (ARD, 167 of 238 total sequences). This widespread domain (Mosavi et al. 2002) is responsible for the dense edge between methyltransferases from cluster #3 and proteins involved in signal transduction in cluster #2. This result emphasizes the capacity of cluster-based analysis to correctly isolate signal (the shared ARD domain) from noise in classifying proteins that have distinct functions. To identify proteins associated with each cell that potentially provide insights into shared and unique biotic interactions, we computed assortativity of the sequences from the six SCG cells in the G-h network (that only included P. ovalis-like SCG data). A heatmap of assortativity values for each cell (Fig. 1B) shows that these data, as expected due to their con-specificity, generally lack an assortative pattern. Cell 1 has the most significant amount of unique DNA associated with it, which results in very high assortativity (red region indicated with an arrow) for a small number of components. This analysis provides a rapid method for assessing the relative contribution of noneukaryote DNA associated with a SCG assembly. BLASTp analysis confirms this result by showing that Cell 1 is the most divergent SCG in this analysis with a large number of proteobacterial proteins (Fig. S1). The cumulative increase in the number of individual protein families (i.e. components) as cells are added to the analysis (Fig. 1C) demonstrates the complexity of natural samples, whereby each SCG introduces novel DNA associated with it. With deeper sequencing, the collection of P. ovalis-like protein families will presumably converge (e.g. at ca. 10–15 K genes as in many unicellular protists; see Fig. 1C) as the host genome is fully sequenced. Additional growth will derive from biotic interactions that give rise to foreign DNAs present in the environment and associated with individual cells, and can be identified via their assortativity patterns. It is widely recognized that various genome features (e.g. types of protein folds, nucleotide patterns; Luscombe et al. 2002; Proulx et al. 2005) follow power-law distributions, whereby a few types of features are dominant. We find here that the probability that a protein in our network has k edges to another node (or degree k) also follows a power-law distribution P(k) ~ kc, in which c = 2.39. The protein families with the greatest number of edges in Fig. 1D (e.g. Na [+]-dependent inorganic phosphate cotransporter, prolyl aminopeptidase) are ancient, functionally important proteins that have undergone gene duplication and divergence, processes thought to underlie the power-law behavior of DNA data (Luscombe et al. 2002). The degree distribution shown here provides the opportunity to identify, in uncharacterized SCG data, conserved eukaryotic proteins for downstream analysis. In summary, our exploratory study demonstrates the utility of network topology structure to accurately represent the complex nature of SCG data derived from natural

© 2014 The Author(s) Journal of Eukaryotic Microbiology © 2014 International Society of Protistologists Journal of Eukaryotic Microbiology 2014, 61, 399–403

401

Single Cell Network Analysis

Cheng et al.

Figure 1 Network analysis of proteins from Paulinella ovalis-like Cells 1–6. A. Network representation of genome data from six P. ovalis-like cells and from Bigelowiella natans reveals the extent of protein sequence sharing between the targeted genomes. Circles with pie charts show the level of representation of different cells and of B. natans in the communities. Edges (gray lines) are weighted by the number of sequences connecting any two communities. B. Heatmap showing cell-based assortativity coefficients. The arrow identifies the relatively high assortativity of Cell 1 data. C. Cumulative growth in unique gene families (represented by component counts in the network) based on 454 data from the six SCGs (light gray panel, black solid line shows the linear fit). The broken line represents the approximate gene family number in the target eukaryote that does not continue to grow with the addition of SCG data once the nuclear genome has been fully sequenced with SCG analysis. D. Scale-free degree distribution of protein connections (k) in the P. ovalis-like cell networks is described as a straight line on this log–log plot.

environments. We show that network community analysis effectively highlights shared genome components and is robust against arbitrary sequence similarity, whereas properties such as assortativity and degree allow the identification of both outlier cells and core sets of shared proteins for detailed study. As the growing power of SCG (e.g. Zong et al. 2012) allows the assembly of near complete SCGs and associated biota, it will be possible to emphasize genome differences in entire populations of cells, both free living and as tissue constituents (e.g. cancer cells) from normal and diseased states. The use of differing identity and coverage values will result in components that can serve downstream uses such as phylogenetics, single nucleotide polymorphism analysis, and to identify unique and shared prey, symbionts, and pathogens (e.g. Bhattacharya et al. 2012; Yoon et al. 2011).

402

ACKNOWLEDGMENTS This work was partially supported by NSF grants 0827023 and 0936884. SC was supported by the Gordon and Betty Moore Foundation through Grant GBMF2807 to Paul Falkowski at Rutgers University. LITERATURE CITED Alvarez-Ponce, D., Lopez, P., Bapteste, E. & McInerney, J. O. 2013. Gene similarity networks provide tools for understanding eukaryote origins and evolution. Proc. Natl Acad. Sci. USA, 110: E1594–1603. Atkinson, H. J., Morris, J. H., Ferrin, T. E. & Babbitt, P. C. 2009. Using sequence similarity networks for visualization of relationships across diverse protein superfamilies. PLoS ONE, 4:e4345.

© 2014 The Author(s) Journal of Eukaryotic Microbiology © 2014 International Society of Protistologists Journal of Eukaryotic Microbiology 2014, 61, 399–403

Single Cell Network Analysis

Cheng et al.

Bapteste, E., Lopez, P., Bouchard, F., Baquero, F., McInerney, J. O. & Burian, R. M. 2012. Evolutionary analyses of non-genealogical bonds produced by introgressive descent. Proc. Natl Acad. Sci. U S A, 109:18266–18272. Barabasi, A. L. & Oltvai, Z. N. 2004. Network biology: understanding the cell’s functional organization. Nat. Rev. Genet., 5:101–113. Beauregard-Racine, J., Bicep, C., Schliep, K., Lopez, P., Lapointe, F. J. & Bapteste, E. 2011. Of woods and webs: possible alternatives to the tree of life for studying genomic fluidity in E. coli. Biol. Direct, 6:39. Bhattacharya, D., Price, D. C., Yoon, H. S., Yang, E. C., Poulton, N. J., Andersen, R. A. & Das, S. P. 2012. Single cell genome analysis supports a link between phagotrophy and primary plastid endosymbiosis. Sci. Rep., 2:356. Bhattacharya, D., Price, D. C., Bicep, C., Bapteste, E., Sarwade, M., Rajah, V. D. & Yoon, H. S. 2013. Identification of a marine cyanophage in a protist single-cell metagenome assembly. J. Phycol., 49:207–212. Blondel, V., Guillaume, J.-L., Lambiotte, R. & Lefebvre, E. 2008. Fast unfolding of communities in large networks. J. Stat. Mech., 2008:P10008. Conesa, A., Gotz, S., Garcia-Gomez, J. M., Terol, J., Talon, M. & Robles, M. 2005. Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics, 21:3674–3676. Curtis, B., Tanifuji, G., Burki, F., Gruber, A., Irimia, M., Maruyama, S., Arias, M., Ball, S., Gile, G., Hirawaka, Y., Hopkins, J., Kuo, A., Ransing, S., Schmutz, J., Symeonidi, A., Elias, M., Eveleigh, R., Herman, E., Klute, M., Nakayama, T., Obornik, M., ReyesPrieto, A., Armbrust, E. V., Aves, S., Beiko, R. G., Coutinho, P., Dacks, J. B., Durnford, D. G., Fast, N. M., Green, B. R., Gris€ppner, M. P., Ishida, dale, C. J., Hempel, F., Henrissat, B., Ho K., Kim, E., Koren y, L., Kroth, P. G., Liu, Y., Malik, S. B., Maier, U. G., McRose, D., Mock, T., Neilson, J. A., Onodera, N. T., Poole, A. M., Pritham, E. J., Richards, T. A., Rocap, G., Roy, S. W., Sarai, C., Schaack, S., Shirato, S., Slamovits, C. H., Spencer, D. F., Suzuki, S., Worden, A. Z., Zauner, S., Barry, K., Bell, C., Bharti, A. K., Crow, J. A., Grimwood, J., Kramer, R., Lindquist, E., Lucas, S., Salamov, A., McFadden, G. I., Lane, C. E., Keeling, P. J., Gray, M. W., Grigoriev, I. V. & Archibald, J. M. 2012. Algal genomes reveal evolutionary mosaicism and the fate of nucleomorphs. Nature, 492:59–65. Halary, S., Leigh, J. W., Cheaib, B., Lopez, P. & Bapteste, E. 2010. Network analyses structure genetic diversity in independent genetic worlds. Proc. Natl Acad. Sci. U S A, 107:127–132. Kalisky, T., Blainey, P. & Quake, S. R. 2011. Genomic analysis at the single-cell level. Annu. Rev. Genet., 45:431–445. Komurov, K., Dursun, S., Erdin, S. & Ram, P. T. 2012. NetWalker: a contextual network analysis tool for functional genomics. BMC Genomics, 13:282. Li, W. & Godzik, A. 2006. CD-HIT: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 22:1658–1659. Luscombe, N. M., Qian, J., Zhang, Z., Johnson, T. & Gerstein, M. 2002. The dominance of the population by a selected few:

power-law behaviour applies to a wide variety of genomic properties. Genome Biol., 3:RESEARCH0040. Mosavi, L., Minor Jr, D. & Peng, Z. 2002. Consensus-derived structural determinants of the ankyrin repeat motif. Proc. Natl Acad. Sci., 99:16029–16034. Newman, M. E. J. 2003. Mixing patterns in networks. Phys. Rev. E, 67:026126. Parra, G., Bradnam, K. & Korf, I. 2007. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics, 23:1061–1067. Proulx, S. R., Promislow, D. E. & Phillips, P. C. 2005. Network thinking in ecology and evolution. Trends Ecol. Evol., 20:345– 353. Rice, P., Longden, I. & Bleasby, A. 2000. EMBOSS: The European Molecular Biology Open Software Suite. Trends Genet., 16:276–277. Rodrigue, S., Malmstrom, R. R., Berlin, A. M., Birren, B. W., Henn, M. R. & Chisholm, S. W. 2009. Whole genome amplification and de novo assembly of single bacterial cells. PLoS ONE, 4:e6864. Woyke, T., Tighe, D., Mavromatis, K., Clum, A., Copeland, A., Schackwitz, W., Lapidus, A., Wu, D., McCutcheon, J. P., McDonald, B. R., Moran, N. A., Bristow, J. & Cheng, J. F. 2010. One bacterial cell, one complete genome. PLoS ONE, 5: e10314. Yoon, H. S., Price, D. C., Stepanauskas, R., Rajah, V. D., Sieracki, M. E., Wilson, W. H., Yang, E. C., Duffy, S. & Bhattacharya, D. 2011. Single cell genomics reveals trophic interactions and evolutionary history of uncultured protists. Science, 332:714–717. Zhou, J., Deng, Y., Luo, F., He, Z., Tu, Q. & Zhi, X. 2010. Functional molecular ecological networksmBio, 1:e00169-10. Zong, C., Lu, S., Chapman, A. R. & Xie, X. S. 2012. Genome-wide detection of single-nucleotide and copy-number variations of a single human cell. Science, 338:1622–1626.

SUPPORTING INFORMATION Additional Supporting Information may be found in the online version of this article: Fig. S1. Hierarchically clustered taxonomic distribution of single cell data from P. ovalis-like Cells 1–6. This map is based on BLASTp hits of the predicted proteins. Note that Cell 1 has a relatively far distance from the other five cells mainly due to its extremely high proportion of proteobacterial hits. Bacteria/phages are marked as red text and eukaryotes as green text. Table S1. Assembly output for the six Paulinella ovalis-like cells studied in our work. The 454 reads were assembled using the native Roche GS De Novo Assembler Software V2.5 beta.

© 2014 The Author(s) Journal of Eukaryotic Microbiology © 2014 International Society of Protistologists Journal of Eukaryotic Microbiology 2014, 61, 399–403

403

Exploring biotic interactions within protist cell populations using network methods.

The study of diseased human cells and of cells isolated from the natural environment will likely be revolutionized by single cell genomics (SCG). Here...
320KB Sizes 0 Downloads 2 Views