Availability of MudPIT data for classi(cid:28)cation of biological samples

Background : Mass spectrometry is an important analytical tool for clinical proteomics. Primarily employed for biomarker discovery, it is increasingly used for developing methods which may help to provide unambiguous


Background
The identication of proteins changing their quantitative level is a key aspect to investigate biological systems as well as to develop strategies for classifying samples into pre-specied categories, such as healthy and diseased.In fact, one of the main objectives of the clinical proteomics is to use relevant biomarkers for improving disease diagnosis or for monitoring the ecacy of treatments [1].
A procedure for discriminating biological samples involves a preliminary evaluation of experimental data, useful for building classication models [2].In this context, a wide variety of algorithms has been used for processing raw mass spectra, mainly generated by MALDI [310] and SELDI technologies [1114].Although results from diagnostic studies based on SELDI have generated both excitement and scepticism, it doesn't allows a direct identication of proteins and it is based on m/z signals, only.On the other hand, MALDI is mainly used for the identication of peptides and its reproducibility is strongly dependent by sample preparation method.Besides, in many studies, selected discriminant mass spectrometry signals have then been identied by liquid chromatography (LC) coupled to mass spectrometry (MS).Nevertheless, few works have directly taken into consideration LC-MS data for discriminating biological samples [15,16].On the contrary, some authors have used them, combined to machine learning algorithms, for improving tandem mass (MS/MS) spectra quality assessment and hence, the protein identication [1720].
Recently, the improvement of robustness and reproducibility of the MudPIT (Multidimensional Protein Identication Technology) approach, based on two dimensional liquid chromatography coupled to tandem mass spectrometry, has permitted a correct grouping of phenotypes, by using unsupervised algorithms [21 23].Based on these ndings, MudPIT may represent an attractive methodology for improving methods concerning sample classication.It allows to automatically obtain thousands of features comprising spectra, peptide sequences and related proteins [24,25].In addition, label-free quantication approaches based on spectral count (SpC) or SEQUEST-based SCORE evaluation permit an high-throughput discovering of multiple biomarkers [2628], which could contain a higher level of discriminatory information.
The present study investigates in-depth the availability of MudPIT data for the classication of biological samples.We focused on classication performances achievable by processing dierent data-types, such as spectra, peptides and proteins.Specically, we applied a class of machine learning algorithms, i.e.Support Vector Machine (SVM), to identify most predictive features and to score the data-types according to the inference performances of the algorithm [29,30].Finally, since the identication of features allowing a model of classication is a key challenge for high-dimensional data, we evaluated how the applied selection method correlates with an independent label-free quantication approach.Therefore, we measured the overlapping of the features selected by SVM with the dierentially expressed proteins selected by means of the MAProMa software [31].

Data collections
For the study purpose, two pre-existing dierent collections of experimental data were used.They were previously obtained by MudPIT analysis of complex samples, such as adipose and cardiac tissues [23,32].Specically, for collection 1 were considered 30 diseased and 11 healthy controls, while 18 diseased and 18 healthy controls were considered for collection 2 (Supplementary Figure S1).Experimental details of the MudPIT analysis are reported in Supplementary Information.

Data handling of MS results
Raw mass spectra (MS) produced by MudPIT were handled using MZmine software [33].Peak detection was performed by the chromatogram builder module by using the Centroid algorithm.Each le containing MS spectra was processed individually and converted to pairs of m/z and intensity values by considering all data points above the specied noise level (e 3 ).Then, m/z data points were connected to form chromatograms.
In particular, the minimum time span was set to 1 min, the minimum absolute height to e 3 and the m/z tolerance to 0.5.Finally, peak lists were aligned by Join aligner method applying a ranges of tolerance of 0.5 and 1 min for mass and retention time, respectively.
The experimental tandem mass spectra (MS/MS) were correlated to in-silico tryptic peptide sequences, and accordingly to parent proteins, by using a database search method based on the SEQUEST algorithm [34].
The validity of spectrum/peptide matching was assessed using SEQUEST dened parameter thresholds (Supplementary Information).Finally, protein and peptide lists obtained from each sample were handled and aligned using MAProMa software and an in-house R-script, respectively [31,35].
In order to evaluate the reproducibility of the MudPIT approach, protein lists of technical replicates were aligned and then processed using a linear-regression-based analysis: where: i = 1, ........, n; with n = number of variables (proteins) Y i is the spectral count (SpC) value of the protein i in the rst replicate analysis X i is the spectral count (SpC)) value of the protein i in the second replicate analysis β 0 is the intercept of the regression line of the population β 1 is the slope or gradient of the regression line of the population u i is the error term

Proteomic datasets
Each sample belonging to the collection 1 and 2 was represented by ve dierent datasets, including the global protein/peptide proles and m/z precursor ions from three dierent chromatographic steps (60, 120, 400 mM) of the applied analytical method (Supplementary Information).Each dataset was formatted in a s × f matrix, where s represents the number of samples and f the number of features.Entries of the protein data matrix were the spectral count (SpC) values assigned by the SEQUEST algorithm to each identied protein; in the same way, Xcorrelation values and peak area intensity (AUC) were used for the peptide and mass spectra data matrices, respectively (Supplementary Table S1).

Label-free quantication approach
Proteins dierentially expressed between the considered phenotype groups were identied by using a labelfree quantication approach.In particular, SEQUEST-based SCORE values were processed by means of the DAve and DCI formulas, which are inserted in MAProMa software [31].In addition, SpC values were evaluated by using the G-test [36] and the unpaired Student's t-test.In this scenario, proteins with DAve ≥ 0.3 (≤ −0.3) and DCI ≥ 300 (≤ −300), or statistical meaningful at least for one test (P > 95%) were considered for the study purpose (Supplementary Information and Figure S2).

Evaluation procedures by SVM
In order to investigate on the classication performance achievable by the dierent data-types (spectra, peptides and proteins) we designed specic Rapid Miner workows (RM-WF) mainly addressed to implement a class of algorithms widely used in the machine learning community, i.e., the Support Vector Machine (SVM) [30].
In our investigation we sequentially applied two main operational processes i.e., feature selection and model construction (and validation), respectively.We briey summarize in the following issues the RM-WF designed for each phase (a complete description of each operator is reported in Supplementary Figure S3).
1. Feature selection phase.Due to the high number of signals, features selection may be helpful to improve both the inference quality and the data understanding.For this reason we rst applied a standard feature selection procedure [29].Broadly speaking we weighted each signal by an information theory criterion (i.e., infogain ratio [37]).Then we considered to employ in the forward phase only signals having a weight greater than 0.6; this way, only 10 signals were considered.The RM-WF in this case is simple, providing only the infogain weighting capability as reported in Supplementary Figure S3 (a).

Model construction and validation phase. To evaluate the classication performance achievable
by the dierent data-types, we employed SVM algorithms as black boxes to score each input datatype, according to the inference performances of the algorithm [29].In order to avoid over-tting we rst sub-sampled a set of dierent data instances: i.e., for each data set, this phase was applied on (data) instances never used in the above feature selection step.Then, for each instance, we considered only intensity (and counting) values corresponding to the previously suggested 10 signals (i.e., feature selection).This approach has been applied together with an optimization procedure to learn the algorithm parameters.As a matter of fact, dierent learning model may have many parameters, and often it is not clear which values are best for the learning task at hand; in our case, SVMs involve dierent kernel types and, in turn, each of such functions uses specic values which we need to dene in the learning algorithms [30].In order for the SVMs to perform as better (and homogeneous) as possible for each data-type, we optimized such parameters over the same space of common values.That is, we searched the best parameter values (i.e., providing the highest SVM inference performances) among all the combinations of common ranges for each input data collection.The RM-WF reported in Supplementary Figure S3 (b) species the main steps used in this phase.Finally, standard indices (i.e., sensitivity, specicity, positive (PPV), negative predictive (NPV), accuracy, F-score, balanced accuracy, informedness and Matthews correlation coecient) were used as performance measures to verify which data-types provide the best SVM classication [2].

Results and Discussion
In this study, we investigated the classication of phenotypes by applying support vector machine (SVM) algorithms on experimental data obtained by MudPIT approach (Figure 1).Identied proteins, peptides and experimental mass spectra (m/z ) were processed to evaluate the generalization capability of SVM about the disease vs. healthy cases used in this study (Supplementary Figure S1).For this purpose, a RapidMiner workow was implemented (Supplementary Figure S3).Firstly, a set of data was used as input to SVM learning algorithm.Some learning parameters were optimized over the same common space of values [30].
Finally, data were evaluated according to the inference performance of the algorithm by using standard indices broadly applied to measure the precision and the recall capability [2].
By applying a standard features selection procedure, ten features having a weight greater than 0.6 were selected from each dataset (see features selection phase in Materials and Methods).Model delivered by the SVM operator was applied on independent validation datasets for estimating the performances concerning the phenotype classication.Tables 1 and 2, reporting the standard indices, show the diagnostic capabilities of SVM by using two independent collection of samples and dierent data-types.Of note, the results suggest that SVM allows a better classication capability by using proteins and peptides rather than mass spectra datasets.In fact, better values of accuracy, F-score, informedness and MCC were observed by considering both collection 1 and collection 2. As opposite, samples classication by means of m/z data, resulted to be more dicult.In particular, by using the mass spectra of the collection 1 low values of specicity were observed, while the mass spectra of the collection 2 allowed low overall classication accuracy values.
The dierent classication performances, obtained by SVM, may be related to the m/z data complexity.
In this regard, an overview of the data was performed by means of Principal Component Analysis (PCA) [38].As opposed to protein and peptide data, PCA showed that mass spectra, especially for the collection 1, didn't allow a clear dierentiation in the multidimensional space between disease and healthy groups (Supplementary Figure S4).In this context, the great amount of mass spectra can make it dicult their data-mining.In fact, a single step of liquid chromatography separation allows the collection of a number of features (m/z values) about 15 and 3 times bigger than protein and peptide ones, respectively (Figure 2).This great amount of data may be due to the redundant acquisition of spectra, like so to the biological and/or chemical modications of peptides/proteins (e.g.Post Translational Modications).Moreover, m/z values may be aected by chemical noise as well as to day-to-day instrument variations.Therefore, preprocessing of the raw data signicantly inuences the quality of the classication results [39,40].Nevertheless, further errors may be introduced during spectra alignment, while overlapping of m/z regions may create ambiguities for peak detection leading to increase the noise and to loss of information and discriminatory ability.
The identication of peptides and proteins by means of the interpretation of tandem mass spectra, can represent a cleaning and a simplifying of m/z data complexity.This aspect probably improved the features selection process and consequently the performance of classication by means of SVM model.For each collection about 20% of the selected features resulted common between protein and peptide datasets.
Besides, around 80% of proteins and peptides, selected by SVM, matched with the dierentially expressed proteins selected by MAProMa software (Figure 3).This correspondence represents a mutual validation of these two dierent procedures and it means that dierentially expressed proteins may be used also for a correct grouping of sample phenotypes.For this reason, the use statistical parameters associated with identied proteins and peptides represents a robust procedure for a rapid extraction of potential biomarkers.
In addition, MudPIT approach allows a good analytical reproducibility (Figure 4).In fact, although only 60-80 % of protein are identied in two replicate analyses, most of the variation is due to low abundance proteins which are usually identied with a low number of peptides.However, a statistical model has been proposed for estimating the number of replicates required for saturated sampling of a complex protein mixture [41].
Our ndings are in good agreement with the most widely used semi-quantitative methods concerning the identication of biomarkers using LC-MS/MS approach [25,27].As for the identication of clinically useful biomarkers, in the last decade, SELDI-TOF analysis has been widely used and many diseases have been mainly studied by serum/plasma protein proling.Although preliminary results have generated a lot of expectations, later scepticism resulted prevalent [42].The reasons of this failure is probably due to SELDI proling based on m/z signals, only, and it doesn't permits a direct identication and quantication of peptides/proteins.In addition, blood samples, although relatively simple to be collected, have a very complex composition with the presence of prominent and unspecic changes, resulting a drawback for the biomarker discovery based on m/z signals.On the contrary, we have evidenced in the present manuscript the improved availability of peptide/protein outcomes to allow biomarker discovery and phenotype discrimination.In comparison to mass spectra, sequenced proteins and peptides are less aected by experimental errors, and their use can be useful to avoid the problems of reproducibility due to dierent instrumental settings occurring over time.In addition, model of healthy/disease tissues represents a source of biomarkers in higher concentration than to plasma, which may be considered mainly useful in their monitoring using other LC-MS procedures [43].

Conclusion
To realize the potential of MS-based proteomics in the context of clinical utility, for disease diagnosis and prognosis, comparative studies are of great importance.In the present work, MudPIT data, both experimental mass spectra and sequenced peptides/proteins, were processed by SVM for evaluating the corresponding performances of classication.The overall accuracy resulted in all investigated cases higher than 77%.In particular, protein/peptide allowed a better discriminant informative content than experimental mass spectra (overall accuracy higher than 87% in both collection 1 and 2).This result is probably due to the translation of mass spectra to peptides/proteins, that eliminates the experimental noise and highlight the actual features useful for the phenotype classication.Overall, the presented ndings indicate that the impressive amount of data produced by MudPIT approach can be processed for identifying multiple biomarkers and for classifying biological samples, by applying both supervised and unsupervised algorithms.These procedures permit the evaluation of actual samples and translate proteomic methodology to clinical application.In this context, MudPIT approach can be a useful tool for improving the extraction of informative features and therefore diagnosis procedures.Probably, in the next future new and more ecient algorithms will be applied, and the discovered biomarkers will be validated by means of fast and high-resolution mass spectrometry and data independent analysis [44,45].These aspects will be of primary importance to be combined with clinical data and for investigating mechanisms of pathogenesis.In fact, the improved quality of data has the potential to optimize existing protein quantication methods and address the increasing demand of systems biology studies for correlating molecular expression to biological processes.

List of abbreviations used
• SVM (Support Vector Machine) • MudPIT (Multidimensional Protein Identication Technology) • MAProMa (Multidimensional Algorithm Protein Map) • MALDI (Matrix-Assisted Laser Desorption/Ionization) • SELDI (Surface-Enhanced Laser Desorption/Ionization) • LC (Liquid Chromatography) • MS (Mass Spectrometry) • MS/MS (Tandem Mass Spectra) • SpC (Spectral Count) • DAve (Dierential Average) • DCI (Dierential Condence Index) Enzymatic digested peptides are rst separated by Strong Cation Exchange (SCX), using steps of increasing salt concentration, followed by Reverse Phase (RP) chromatography, using an acetonitrile gradient.Eluted peptides are then directly analyzed by tandem mass spectrometry producing MS and MS/MS spectra.By specic algorithm, such as SEQUEST, and applying appropriate criteria of data ltering (see Supplemental Information), the comparison of experimental MS and MS/MS spectra with those in-silico predicted from a protein sequence database allows the characterization of the peptide sequences and the corresponding proteins, without limits of isoelectric point (pI), molecular weight (Mw) or hydrophobicity.Using MudPIT, ve dierent datasets per sample were collected for the study purposes.Specically, in addition to complete protein and peptide proles, m/z data, corresponding to 60 mM, 120 mM, 400 mM of salt concentration steps, were mined collecting three dierent datasets of spectra.Tables Table 1 -Performance of classication obtained by using SVM -Collection1 Specicity (Spec.),sensitivity (Sens.),positive predictive value (PPV), negative predictive value (NPV), accuracy (Acc.),F-score, balanced accuracy (Bal.Acc.), informedness and Matthews correlation coecient (MCC) of collection 1. Evaluation capabilities have been obtained using observations not considered in the signal selection phase.

Figure 2 -
Figure 2 -Features selected for the study purposes.Number of features (m/z ions, peptides and proteins) collected analyzing, by MudPIT, all samples belonging to collection 1 and collection 2. DB1, DB2 and DB3 correspond to m/z data mined from 60 mM, 120 mM, 400 mM of salt concentration steps, respectively.

Figure 3 -
Figure 3 -DAve values for proteins and peptides selected by SVM.DAve evaluates changes in protein expression and is dened as: ((X − Y )/(X + Y ))/0.5, while DCI, which describes the condence of dierential expression, is dened as: (X + Y ) * (X − Y ))/2, where X and Y represent the SEQUEST-based SCORE values (or spectral count) of a given protein in two compared samples.Conventionally, signs (+/-) of DAve (and DCI) indicate if proteins are up-regulated in the rst or in the second sample, respectively.

Figure 4 -
Figure 4 -MudPIT repeatability Linear regression analysis obtained by considering SpC values of proteins identied in two technical replicates of MudPIT analysis.R2 and Slope values resulted near to 1. Red rectangle highlights the proteins identied with a low number of peptides and which represent the portion of data less reproducible.

Table 2 -
Performance of classication obtained by using SVM -Collection2 Acc.), informedness and Matthews correlation coecient (MCC) of collection 2. Evaluation capabilities have been obtained using observations not considered in the signal selection phase.