Protein co-expression network analysis (ProCoNA)
© Gibbs et al.; licensee BioMed Central Ltd. 2013
Received: 22 February 2013
Accepted: 23 May 2013
Published: 1 June 2013
Skip to main content
© Gibbs et al.; licensee BioMed Central Ltd. 2013
Received: 22 February 2013
Accepted: 23 May 2013
Published: 1 June 2013
Biological networks are important for elucidating disease etiology due to their ability to model complex high dimensional data and biological systems. Proteomics provides a critical data source for such models, but currently lacks robust de novo methods for network construction, which could bring important insights in systems biology.
We have evaluated the construction of network models using methods derived from weighted gene co-expression network analysis (WGCNA). We show that approximately scale-free peptide networks, composed of statistically significant modules, are feasible and biologically meaningful using two mouse lung experiments and one human plasma experiment. Within each network, peptides derived from the same protein are shown to have a statistically higher topological overlap and concordance in abundance, which is potentially important for inferring protein abundance. The module representatives, called eigenpeptides, correlate significantly with biological phenotypes. Furthermore, within modules, we find significant enrichment for biological function and known interactions (gene ontology and protein-protein interactions).
Biological networks are important tools in the analysis of complex systems. In this paper we evaluate the application of weighted co-expression network analysis to quantitative proteomics data. Protein co-expression networks allow novel approaches for biological interpretation, quality control, inference of protein abundance, a framework for potentially resolving degenerate peptide-protein mappings, and a biomarker signature discovery.
Systems biology embraces the complexity found in biological networks by taking a holistic view of the cell [1, 2]. As systems biology moves forward, models making use of quantitative proteomic data will become increasingly necessary since this information is not accessible using other analytical methods [3, 4].
Large-scale quantitative proteomics, however, is still developing and can be challenging and complex in practice [5, 6]. In order to boost throughput and ease computation, tag-based approaches are used . Briefly, proteins are digested enzymatically, producing a multitude of peptide fragments. Using liquid chromatography coupled to mass spectroscopy (referred to as LC-MS), the digested mixture is quantified, resulting in a set of features containing both mass and net elution time measurements. Peptides are identified by mapping features to entries in an accurate mass and time tag (AMT) database. Tag databases are constructed using pooled samples processed on a tandem MS/MS platform .
Currently, a majority of protein networks are constructed using protein-protein interaction (PPI) databases. However, manually curated PPI databases are regularly revised as our understanding of biology grows. PPI databases are typically heterogeneous, containing different experiment types and model organisms, leading to sparse annotation and a lack of experimental concordance . In addition, interaction temporality and contextual information is lacking. Coverage, selection bias, and detection bias all remain problems [9, 10].
Three quantitative LC-MS data sets including both human and mouse disease studies were used. The Thermo Electron Exactive platform was used to generate data. The Pacific Northwest National Labs (PNNL, http://omics.pnl.gov) developed the Accurate Mass and Time (AMT) tag databases. VIPER (v3.48) is used to align individual samples and identify peptides using an AMT database . Identifications have confidence metrics: the probability for a correct match, the STAC score, and the probability for a unique database match, the uniqueness probability (UP) . Peptides with STAC scores > 0 and UP > 0 were used. Peptide abundances were normalized by total ion count per sample and log10 scaled. Missing data are encountered when peptides are identified in a subset of samples. “Missingness filtration” involves removing any peptide with greater than X% missing data across samples. In this case, peptides with greater than 10% missing data were removed.
The infectious disease data came from the NIAID Systems Virology project (publically available data is found at http://www.systemsvirology.org). We utilized both longitudinal SARS-CoV and influenza mouse studies. These data are generated using C57BL/6J mice exposed to either a mouse adapted SARS-CoV (MA-15) or avian influenza virus (A/Vietnam/1203/2004 (H5N1, VN1203)) [15, 16]. Measurements took place on post infection days 1, 2, 4 and 7.
SARS control samples include three technical replicates per day. Infected samples are five technical replicates with viral dosages of 102, 103, 104, and 105 PFU per day. Abundance measurements for 16,890 peptides mapping to 3,277 proteins were recorded. After missingness filtration, 2,008 peptides mapping to 707 proteins remained. 352 proteins were associated with a single peptide, while 355 proteins had two or more peptides associated.
Influenza control samples include three technical replicates per day. Infected samples include five technical replicates with dosages of 102, 103, and 104 PFU per day. Abundances for 10,285 peptides mapping to 2,661 proteins were recorded. After missingness filtration, 989 peptides associated with 493 proteins remained. 274 proteins were associated with a single peptide, while 219 proteins had at least 2 peptides associated.
The human proteomics data (currently unpublished) comes from a sub-cohort of participants selected from a large (N=6000) longitudinal study of musculoskeletal health in older (≥ 65 years) men (from the Osteoporotic Fractures in Men (MrOS) Study) [17, 18]. The protocol was approved by the local institutional review boards. All participants provided written informed consent. The sub-cohort is focused on the sarcopenia phenotype, which is related to loss of lean mass and muscle performance . A subset of 68 samples from two phenotyped groups (sarcopenic (N=38) and non-sarcopenic (N=30)) based on lean mass and leg power are used. Abundances for 10,679 peptides mapping to 1,868 proteins were recorded. After missingness filtration, 2,845 peptides mapping to 685 proteins remained. 505 proteins were associated with a single peptide, and 180 were associated with at least two peptides.
Protein co-expression networks contain nodes representing peptides connected with edges weighted by similarity in abundance profile. Edge weights are calculated using peptide intensity measurements. Although not always representative of absolute abundance, intensity is frequently used to track relative peptide abundance and to infer protein abundance . In this work, we did not attempt to rectify situations where proteins were represented by a single peptide or where degenerate peptides mapped to multiple proteins. See Additional file 1 for a description of the software used in this work.
Construction of the network follows the WGCNA method [21–24]. Pearson’s correlations are computed pairwise between all peptides, retaining the sign as in Mason et al., resulting in a signed similarity matrix . According to the scale-free criterion, a power (beta) is selected that transforms the distribution of node degrees in the similarity matrix to log-linear, producing the appropriate adjacency matrix. Topological overlap is a similarity metric that incorporates information from neighboring nodes, making it robust to noisy correlations. The TOM is computed as TOMij = (lij + aij) / [min (ki, kj) + 1 - aij] where lij is defined as the dot product on row i and column j in adjacency matrix [a] and ki (the connectivity) is the summation of row i in adjacency matrix [a]. Modules, or subnetworks, are composed of strongly connected peptides. Modules are discovered by hierarchical clustering of the distance matrix, 1-TOM, using the “average” agglomeration method, followed by branch cutting with the dynamic hybrid treecut algorithm . The following parameters were used after visualization and exploratory analysis: deepSplit = 2, minModuleSize = 30, mergeThreshold = 0.1.
Similar to Iancu et al. [26, 27], module significance was examined using permutation testing. Empirical p-values are computed by comparing the mean topological overlap of peptides within a module to a similarly sized random peptide sample. These samples are taken from the total set of peptides used in the network. For a given module with size n, mean edge weights are computed. For a number of trials, t, a sample of peptides is drawn with size equal to n, and the mean edge weight computed. If this value is equal to, or higher than the observed module mean, a count is incremented. The p-value is equal to (counts/t). In this work 10,000 random samples were drawn.
After assigning peptides to modules, an aggregate module signature is computed. The first right-singular vector, or eigenpeptide, is computed from a singular value decomposition of the standardized abundance module matrix. The eigenpeptide has length equal to the number of samples. This vector acts as an overall summary of the module. Modules can be prioritized according to correlations between the eigenpeptide and biological phenotypes. Additionally, the relative centrality, or “importance” of any given peptide within a module is found by computing a Pearson’s correlation to the eigenpeptide (called the Kme) [28, 29]. Peptides with a strong correlation to the eigenpeptide are said to be more central, and important within the module, allowing prioritization on peptides.
Concordance among a set of peptides relates to the shared sign of the slope when regressed against a given variable such as time or infection status. Our approach to this problem involves constructing protein sub-networks, initially as “all-to-all” networks. After applying a topological overlap threshold, edges start to fall away. This results in a disjoint set of connected components.
Alternatively, the expression fold change (infected vs. non-infected) of peptides can be compared within a protein to determine discordance. A fold change cut-off of 1.5 was used to define peptides as up or down regulated. If both up and down labeled peptides mapped to a protein, then the protein was counted as discordant.
Similar to testing for module significance, the connectivity among peptides mapping to a given protein can be tested by permutation. For each protein with greater than two peptides, the pairwise topological overlaps are averaged. Then for a set number of trials (10,000), the same number of peptides is randomly sampled from all peptides in the network, and the mean pair wise topological overlap recorded. The empirical p-value is taken as the number of times the random sample has values equal or greater than the observed case divided by the number of trials. This test can also be applied using correlations between the peptides.
Co-expression modules are thought to reflect, to some degree, true protein interactions. To examine this, we compare the contents of modules with known PPIs. As previously done, permutation testing was used to determine whether a significant amount of PPI edges exist within a module. Within each module, peptides with weak connections to the module eigenpeptide were removed (Kme < 0.333). Centrality filtration is performed to focus the analysis on peptides associated with overall module function. The remaining peptides are mapped to proteins. Proteins with any number of mapping peptides are included. The number of observed PPIs within a module is recorded and compared to the number of PPIs in a random module for a set number of trials. P-values are computed using permutation testing as before. The PPI databases HPRD  and MPPI  were used for human and mouse data respectively.
After PPI enrichment tests, significant sets of proteins were collected by module. Querying KEGG [32, 33], using the R package KEGGSOAP , with these proteins provided a list of potential pathways to investigate by module. For each pathway returned, a hypergeometric test was performed using significant PPIs from the module and other proteins taking part in the pathway. The universe is defined as the subset of proteins in the mass tag database with known roles in KEGG pathways. P-values are adjusted using the Benjamini and Yekutieli method .
Functional enrichment on modules was computed using the R package GOstats . Peptides are mapped to proteins and counted once in any module. The universe is defined as all proteins found in the AMT mass tag database (similar to microarray studies). Annotation databases “org.Mm.eg.db” and “org.Hs.eg.db” (Bioconductor 2.8) are used for mouse and human annotations. Conditional hypergeometric testing is used to account for correlated GO terms.
Co-expression network construction methods are applicable to proteomics
With regard to distinct and significant modules, the SARS network contained 14 modules ranging from 65 to 369 peptides, with a mean size of 133.9 peptides. The Influenza network contained 6 modules, with sizes ranging from 56 to 327 peptides and a mean size of 141.3 peptides. The sarcopenia network contained 19 modules ranging in size from 36 to 477 peptides, with a mean size of 142.25 peptides. An initial examination indicates that low to moderate levels of missing data did not negatively impact the model fit (Figure 3). Importantly, we note that all of the identified de-novo modules have significant connectivity with the exception of the sarcopenia network, which contained one module without significant connectivity (p-value 0.33).
The influenza network showed strong correlations with average weight loss, an important indicator of infection severity. Two modules showed positive correlation (p-values 2e-10 and 2e-6), and two modules showed negative correlation (p-values 8e-10 and 2e-15).
The sarcopenia network showed the weakest correlations with sample phenotypes. Several modules correlate with technical variables, indicating that the normalization method did not completely remove systematic effects. This finding guided the re-evaluation of data processing and motivated new methods in normalization, which is in preparation by Baraff et al.
Given a complex biological mixture, a significant problem in quantitative proteomics remains in confidently identifying the protein component. This problem is made worse by the existence of degenerate peptides, which can lead to multiple solutions for peptide to protein mapping. We find that the connectivity of a protein’s constituent peptides is far from random. This is potentially useful for resolving cases of degenerate peptide mapping, increasing confidence in protein identification.
Topological overlap can be utilized with a threshold to identify high confidence edges between peptides. Upon examination, we found that with a topological overlap threshold of 80% (keeping only edges in the top 20% of all weights), the majority of proteins remained connected (Sarcopenia 84%, SARS 72%, influenza 63%).
Network topology may be useful for resolving degenerate peptide mappings
Examining discordance in the fold change of peptide expression, and using proteins with unique peptide mappings, 48 of 218 proteins in the SARS data were discordant. After applying a topological overlap threshold of 0.8 (as above), the number of discordant components dropped to 24, and with a threshold of 0.9 dropped to 12. In the sarcopenia network, peptides that were modeled against leg strength provided the most discordant proteins, resulting in 11 of 139 discordant proteins. After applying the topological overlap thresholds of 0.8 and 0.9, these dropped to 4 and 3 respectively. In influenza, there were 15 (out of 115 proteins) classified as discordant; this reduced to 7 and 3 respectively after applying the 0.8 and 0.9 thresholds. It appears that edge strength is informative with respect to concordance among constituent peptides for any given protein, showing potential utility for both protein inference and quantification.
De novo co-expression modules are thought to be useful for detecting new interactions and/or functional pathway members. We first evaluate, however, whether the de novo modules are enriched for known interactions to aid in assessing the utility of this approach. Using the HURD and MPPI protein-protein interaction databases, significant interactions were identified in all three experiments. After adjusting for multiple testing, the influenza network had 5/6 modules with significant PPI enrichment, the SARS network had 12/14 modules significantly enriched and the MrOS network had 10/19. Some modules overlapped in terms of mapped proteins which is typically the result of highly similar or related protein sequences, such as sets of histones.
We then examined whether modules with significant PPIs were also enriched for known pathways, similar to what has been seen in de-novo gene expression studies. For the influenza modules with significant PPI enrichment, we examined known pathways in the KEGG database. When defining the universe (or background for comparison) as all proteins contained in the mass tag database (5,521 proteins), a range of significant pathways were found, including “regulation of actin cytoskeleton” (mmu:04810), the “tight junctions” pathway (mmu:04530), and the “antigen presentation and processing” pathway (mmu:04612). However, if the universe is restricted to only those proteins with KEGG annotations (2,539 proteins), the antigen presentation pathway alone remained significant in two modules. These pathways are important in the pathological progression of influenza, highlighting the relationship between network structure and biology. Identification of enriched pathways in these unbiased modules could potentially aid discovery of novel interactions or pathway members.
Given the significant numbers of PPI interactions, modules may have overarching functional organization. To study this, Gene Ontology enrichment, by module, was evaluated using the GOstats package . All three data sources showed GO term enrichment with highly significant Bonferroni adjusted p-values (Additional file 2: Table S1). In the SARS and influenza networks, enrichment for biological processes such as DNA packaging, cellular component assembly, and cellular complex assembly was observed. The sarcopenia network modules also showed significant functional enrichment, including immune response and blood processes. This further reiterates the non-random, biological composition of these modules and provides support for the use of this approach to network inference in proteomics. We note that this framework is generalizable to many data types and is not limited to proteomics.
We have demonstrated the feasibility of constructing de novo peptide co-expression networks. We show that these networks have a biologically meaningful and approximately scale-free topology and contain statistically significant modules. We also noted that the network structure and connectivity of the modules are potentially useful for resolution of degenerate peptides and inference of protein abundance. Across three distinct experiments, we have illustrated how module summaries significantly correlate with clinically relevant phenotypes. In addition, we have shown how de novo modules show significant enrichment for known PPI and biological function. Peptides can be ranked according to their module centrality and relationship to phenotypic traits, allowing researchers to prioritize targets for further research. Finally, modules can provide a natural aggregate representation for composite biomarker discovery.
This work was supported by the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services [HHSN272200800060C; 5U54AI081680], National Institutes of Health, National Center for Advancing Translational Sciences [5UL1RR024140], National Library of Medicine [3T15LM7088-18S1], and National Cancer Institute [5P30CA069533].
The Osteoporotic Fractures in Men (MrOS) Study is supported by National Institutes of Health funding. The following institutes provide support: the National Institute of Arthritis and Musculoskeletal and Skin Diseases (NIAMS), the National Institute on Aging (NIA), the National Center for Research Resources (NCRR), and NIH Roadmap for Medical Research under the following grant numbers: U01 AR45580, U01 AR45614, U01 AR45632, U01 AR45647, U01 AR45654, U01 AR45583, U01 AG18197, U01 AG027810, and UL1 RR024140. The National Institute for Dental and Craniofacial Research (NIDCR) provides funding for the MrOS Dental ancillary study “Oral and Skeletal Bone Loss in Older Men” under the grant number R01 DE014386.
The experimental proteomics measurements described herein were supported in part by the NIH NIGMS P41 BTRC [RR185220 and GM103493-10] and was performed in the Environmental Molecular Sciences Laboratory, EMSL, a national scientific user facility sponsored by DOE/Biological and Environmental Research (BER) and located at Pacific Northwest National Laboratory, which is operated by the Battelle Memorial Institute for the DOE under Contract DE-AC05-76RL0 1830. All authors read and approved the final manuscript.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.