Semi-automated literature mining to identify putative biomarkers of disease from multiple biofluids

Jordan, Rick; Visweswaran, Shyam; Gopalakrishnan, Vanathi

doi:10.1186/2043-9113-4-13

Research
Open access
Published: 23 October 2014

Semi-automated literature mining to identify putative biomarkers of disease from multiple biofluids

Rick Jordan¹,
Shyam Visweswaran^1,2,3 &
Vanathi Gopalakrishnan^1,2,3

Journal of Clinical Bioinformatics volume 4, Article number: 13 (2014) Cite this article

4440 Accesses
5 Citations
1 Altmetric
Metrics details

Abstract

Background

Computational methods for mining of biomedical literature can be useful in augmenting manual searches of the literature using keywords for disease-specific biomarker discovery from biofluids. In this work, we develop and apply a semi-automated literature mining method to mine abstracts obtained from PubMed to discover putative biomarkers of breast and lung cancers in specific biofluids.

Methodology

A positive set of abstracts was defined by the terms ‘breast cancer’ and ‘lung cancer’ in conjunction with 14 separate ‘biofluids’ (bile, blood, breastmilk, cerebrospinal fluid, mucus, plasma, saliva, semen, serum, synovial fluid, stool, sweat, tears, and urine), while a negative set of abstracts was defined by the terms ‘(biofluid) NOT breast cancer’ or ‘(biofluid) NOT lung cancer.’ More than 5.3 million total abstracts were obtained from PubMed and examined for biomarker-disease-biofluid associations (34,296 positive and 2,653,396 negative for breast cancer; 28,355 positive and 2,595,034 negative for lung cancer). Biological entities such as genes and proteins were tagged using ABNER, and processed using Python scripts to produce a list of putative biomarkers. Z-scores were calculated, ranked, and used to determine significance of putative biomarkers found. Manual verification of relevant abstracts was performed to assess our method’s performance.

Results

Biofluid-specific markers were identified from the literature, assigned relevance scores based on frequency of occurrence, and validated using known biomarker lists and/or databases for lung and breast cancer [NCBI’s On-line Mendelian Inheritance in Man (OMIM), Cancer Gene annotation server for cancer genomics (CAGE), NCBI’s Genes & Disease, NCI’s Early Detection Research Network (EDRN), and others]. The specificity of each marker for a given biofluid was calculated, and the performance of our semi-automated literature mining method assessed for breast and lung cancer.

Conclusions

We developed a semi-automated process for determining a list of putative biomarkers for breast and lung cancer. New knowledge is presented in the form of biomarker lists; ranked, newly discovered biomarker-disease-biofluid relationships; and biomarker specificity across biofluids.

Background

The amount of scientific information has become overwhelmingly abundant, providing querying difficulties for scientists and physicians. While many data mining and literature mining methods have been described [1–11], new and innovative methods are highly desired. Articles have been written about drawing implicit connections from separate literatures [12–15], and many unidentified connections exist within publicly available material. Identifying putative disease biomarkers may lead to new connections between biofluids and diseases being discovered.

It is known that false positive elimination from text mining findings can be aided by the use of negative abstract sets, which are abstracts that are specifically not about the entity or relationship of interest. It is also important to examine all abstracts, both positive and negative, so that the results are comprehensive and so statistical significance measures can be accurately calculated. However, it does not seem that negative abstract sets are discussed in detail.

A literature search identified several biomedical text mining papers describing the use of a negative set of abstracts [2, 16–19]. Implementations of negative sets of abstracts seem to be described far less than would be expected. Adamic et al.[2] presented a statistical approach for finding gene-disease relations. The authors described a frequency of occurrence count and an expected number of relevant abstracts vs. a random set. Gene pairs and gene symbol disambiguation results were compared to a human edited breast cancer gene database.

Al-Mubaid, et al.’s method [16] for discovering protein-to-disease associations from MEDLINE abstracts employed a protein and disease name dictionary and “positive” and “negative” sets of abstracts. The positive set consisted of abstracts relevant to a given disease, as determined by a PubMed keyword search; the negative set contained a random set of abstracts that did not mention the disease. The method identified proteins that were relevant to the disease by comparing the frequency distributions of protein names in the positive set and the overall set, which was the union of the positive and negative sets, and selected those proteins for which the distributions were significantly different statistically.

Andrade [17] was interested in annotating biological function of protein sequences. In this article, the ‘treatment of text with statistical methods’ was discussed. Their approach estimated the word significance from a given set of protein family abstracts by comparing each word’s abundance and distribution in a background set of varying protein family abstracts.

Younesi, et al.[18, 19] divided the biomarker terminology into six concept classes (clinical management; diagnostics; prognosis; statistics; evidence; and antecedent). By including this extra level of restriction, the authors were able to significantly reduce the number of retrieved relevant documents. Frequency and entropy ranking methods were used for acquired genelists, with frequency ranking performing better overall, with their method.

Individual biofluids have been characterized; [20–25] however, we have found only one comprehensive comparison of more than a few biofluids. Alterovitz et al.[26] compared 10 biofluid proteomes to 16 tissue proteomes to determine tissue function, and tissue-specific candidate biomarkers that could be found in a given biofluid. Gene Ontology (GO); [27, 28]http://www.geneontology.org/, was used for functionality mapping, NCBI’s Online Mendelian Inheritance in Man (OMIM); [29]http://www.ncbi.nlm.nih.gov/omim/, for disease mapping, the Pharmacogenomics Knowledge Base (PharmGKB); [30]https://www.pharmgkb.org/, for drug mapping, and a relative entropy measure was the scoring method of choice. PubMed co-citation frequencies were used to determine the overall quality of the candidate biomarkers.

Comparisons such as those described above have the potential to reveal critical knowledge as to which biomarkers for a disease may be detected in a given biofluid. As some biofluids are more easily obtainable than others, elimination of invasive sampling procedures is highly desirable. However, details describing which potential biomarkers can be obtained in given biofluids are not clearly defined.

In this paper, we developed a semi-automated process for determining a list of putative biomarkers for breast and lung cancers, with a putative biomarker being described as a ‘gene’ or ‘protein’. 5.3 million PubMed abstracts were analysed for biomarker-disease associations (34,296 positive and 2,653,396 negative for breast cancer; 28,355 positive and 2,595,034 negative for lung cancer). The abstract sets were further stratified among 14 biofluids. New knowledge is provided in the form of known disease biomarker lists, ranked newly discovered biomarker-disease-biofluid relationships, and biomarker specificity across biofluids. On average, (see Additional file 1) we expect true positive rates for new discoveries to be 87.5% for breast cancer, and 71.59% for lung cancer. These biomarker-disease association and accompanying z-scores will be used as informative prior values in future disease modeling activities.

Methodology

Automation

Python scripts were developed to reduce the amount of manual effort needed to achieve final scores for each potential biofluid biomarker, and to eliminate manual errors. Figure 1 shows a flowchart that summarizes the experimental methodology used.

Information retrieval

For retrieving abstracts related to breast and lung cancer, a PubMed query was performed using the following limits: Abstracts, English, and Human. Query results for diseases-biofluid can be found in Table 1 (see Additional file 2 for Biofluid synonyms used). An abstract consists of journal entry information, title, authors, affiliations, text, copyright information, and PubMed ID. The following sets of abstracts were obtained using the selected criteria from the positive and/or negative queries (defined below).

Table 1 Size of the abstract sets returned from queries of breast and lung cancer

Full size table

Positive Abstract Sets

A positive abstract set is defined as the set of abstracts obtained by using the following combination of keywords, ‘breast cancer AND (biofluid)’, e.g. breast cancer AND plasma, or ‘lung cancer AND (biofluid)’. From this point forward, all positive abstract sets will be called “positive sets” for brevity. Positive set queries were performed on 4-29-2013 for breast cancer and 5-2-2013 for lung cancer. The underlying assumption being made is that any possible biomarker mentioned in these abstract sets is related to both the disease and the biofluid. Queries were returned from PubMed as large text files, and Python scripts were implemented to process the files.
Negative Abstract Sets

We define a negative abstract set as a set of abstracts returned using the keywords ‘(biofluid) NOT breast cancer’ or ‘(biofluid) NOT lung cancer’. From this point forward, all negative abstract sets will be called “negative sets” for the entirety of this article. Negative set queries were performed on 4-29-2013 for breast cancer and 5-2-2013 for lung cancer. Queries were returned from PubMed as large text files, and Python scripts were implemented to process the files.

Filtering information

Python scripts were developed to remove unwanted punctuation and other unwanted information from the abstracts.

Named entity recognition

ABNER [31] (A Biomedical Named Entity Recognizer; http://pages.cs.wisc.edu/~bsettles/abner/) v1.5 was used to tag mentions of proteins, DNA, RNA, cell lines, and cell types in the positive and negative sets. Version 1.5 trains on the NLBPA and BioCreative corpora. Reported performance measures for ABNER are in the range of 65.9-77.8 for protein recall and 68.1-74.5 for protein precision. Our method utilizes entities tagged as “Protein”, “DNA”, and “RNA”. A batch tagging process is available and proved to be extremely useful.

Entity extraction

Python scripts were developed to produce a list of tagged entities from the ABNER results file (.sgml), remove unwanted characters, tags, tagged entries, and duplicate putative biomarkers from the list, and to tally the final count of each biological entity found. PubMed identifiers were retained for tracking and manual verification purposes.

Dictionary

A file named Protein Nomenclature was downloaded from the Human Protein Reference Database Copyright^© 2002-09, Johns Hopkins University and The Institute of Bioinformatics (Additional file 3), to use as a dictionary file. The file contains 19,327 unique IDs. The format consists of the HPRD id, gene symbol, RefSeq id, and aliases (separated by semi-colons). The gene symbol will be used to create a consensus name for all other aliases found. The entities were mapped via another Python script.

Scoring

Counts were performed at the abstract level, where a mention of a given biomarker was assigned a count of 1, regardless of the frequency of mentions within the abstract.

Each z-score corresponds to a point in a normal distribution and can be associated to its deviation from the mean. Z-scores were computed as follows:

Briefly, from Al-Mubaid [16], S₁ is the positive set of abstracts (i.e. disease/biofluid), S₁ = {A₁, A₂, …, An}. A is a given abstract, S_p is the set of proteins (markers) mentioned in the dictionary found in the positive set S₁, S_p = {P₁, P₂, …, P_m}. S₂ is the negative set of abstracts.

For each protein (marker) P_i in S_p, compute the document frequency (df) of P_i in both sets S₁ and S₂ as:

\begin{matrix} d f 1 (P_{i}) = number of S_{1} documents in which P_{i} \\ is mentioned, \end{matrix}

\begin{matrix} d f 2 (P_{i}) = number of S_{2} documents in which P_{i} \\ is mentioned, \end{matrix}

d f t (P_{i}) = d f 1 (P_{i}) + d f 2 (P_{i}) .

For each protein in the set Sp compute an expectation (ex) value and an evidence (ev) value as:

ex (P_{i}) = [d f t (P_{i}) / |S_{1} + S_{2}|] * |S_{1}|, and

e v (P_{i}) = d f 1 (P_{i})

Ex measures expected number of mentions of P_i in the abstracts in set S₁; ev measures actual number of S₁ abstracts that P_i has appeared in. The larger the difference in observed and expected document frequencies, ev(P_i) – ex(P_i), the more likely that P_i and the disease are significantly associated.

The difference is normalized by:

f (P_{i}) = e v (P_{i}) - ex (P_{i}) / d f t (P_{i}) .

And the z-score is calculated by:

Z (P_{i}) = [f (P_{i}) - mean (f)] / S D (f)

where mean(f) is the mean of all f values of all proteins of S_p and SD(f) is the standard deviation of the f values.

A threshold value of 1.0 was established as a significance cut-off (see Figure 2). These z-score values will be used as informative prior values in future modeling efforts (Additional file 4 and Additional file 5).

Verification of relationships

One possible method of verification is to remove from the abstract pool, ‘verification documents’ (ones specifically pertaining to a disease-protein relationship), and use them for subsequent verification [16]. Our method allows these abstracts to remain in the pool, and verification is performed by comparing our results to a combined disease biomarker list (Additional file 6: Table S1 & Additional file 7: Table S2). The list was created using the following sources: OMIM [29] (O in table); http://www.ncbi.nlm.nih.gov/omim/), a cancer gene annotation system for cancer genomics [32] (CAGE(C); http://mgrc.kribb.re.kr/cage/pageHome.php?m=hm), NCBI’s Genes & Disease [33] ((G); http://www.ncbi.nlm.nih.gov/books/NBK22183/), NCI’s Early Detection Research Network [34] (EDRN (E); http://edrn.nci.nih.gov/), an expert provided list (X) of validated cancer markers [35], and a recently released breast cancer paper [36] (P). Markers that are present in at least one of these lists, as well as in our dictionary were considered verified. The list for breast cancer was compiled using OMIM, CAGE, Genes & Disease, the expert provided list, and the previously mentioned paper. The lung cancer list was compiled from OMIM, CAGE, EDRN, and the expert provided list.

True positive rate determination

Negative abstracts were utilized to initially eliminate some false positives. However, it is more likely than not, that this process alone will not completely eliminate all false positives.

In processing the abstracts, it was apparent that eventually manual examination of abstracts would be required for result verification. The abstract PubMed identifier of every possible instance of every biomarker mention accompanied each biomarker, allowing for manual tracking and further verification of our results. Relevant abstracts were investigated further. Three criteria were used for a pass/fail outcome. Abstracts were examined for mentions of biomarker, disease, and biofluid. All three criteria were required to be acceptable, and synonyms and/or root words were deemed adequate (e.g. biliary instead of bile).

Results

Positive and negative sets

Table 1 describes the number of relevant abstracts obtained from the PubMed searches. Fourteen biofluids were evaluated. From this table, blood, plasma, and serum returned the most positive and negative abstracts from both breast and lung cancer queries. Over five million total abstracts were examined.

Known markers per biofluid

Our known marker lists are combinations of several ‘biomarker lists’ obtained from well-known databases. The known breast cancer marker list contains 211 gene symbols that mapped to our dictionary (Additional file 6: Table S1; 159 found in this exercise), and the known lung cancer marker list has 209 markers that mapped to our dictionary (Additional file 7: Table S2; 145 found in this exercise). Known marker results presented in Table 2 were obtained by identifying putative biomarkers with a z-score exceeding the significance threshold (>1.0), and confirming the gene symbol in our known disease biomarker list. Table 2 also summarizes the biofluids that produced markers with significant z-scores and/or the number of known markers found for breast and lung cancer.

Table 2 Number of markers identified for each disease-biofluid combination

Full size table

Z- score threshold optimization

We chose an appropriate threshold for z-score based on empirical findings. As shown in Figure 2 which is a plot of the number of known markers and new markers (log₁₀) based on the z-score threshold which was varied between 1 and 4 in increments of 0.5. Based on this we chose a non-stringent z-score threshold of 1.0 which allows us to identify the maximum number of known and new markers.

Comparison of identification of potential biomarkers by disease-biofluid

Table 2 shows the breakdown of the number of markers found by our method. In most biofluids, the number found in breast cancer outnumbers the number found in lung cancer, with the exceptions being breastmilk (removed from our breast cancer examination due to both positive and negative search terms containing the root ‘breast’) and mucus (greater association with respiratory system).

Known markers found significant vs. non-significant

While the truth is unknown as to the members of the comprehensive pool of breast or lung cancer biomarkers, and thus a true positive value cannot be obtained, estimates can be made. Although these numbers are not shown, one can easily calculate the percentage of known markers identified as significant vs. not-significant using the counts from Table 2.

For breast cancer, percentages range from 5% in plasma and serum to 37.5% in stool (for biofluids with known-significant markers; non-zero). In lung cancer the range is from 3% in serum to 37% in mucus.

Newly discovered markers found significant vs. non-significant

The percentage of newly discovered markers (markers not found in known marker list) that were found to be significant vs. the percentage that were identified but not found to be significant was calculated.

For breast cancer, percentages range from 6.67% in stool to 29.3% in bile (for biofluids with known-significant markers; non-zero). In lung cancer the range is from 7.9% in plasma to 27.2% in synovial fluid.

Potential marker biofluid specificity

Biomarker commonality and specificity was sought across biofluids. This was a significant finding in that we have not seen many potential biomarker comparisons across more than a few biofluids. Additional file 8: Table S3 shows the known + significant biomarkers within biofluids for breast and lung cancer.

A total of 21 known + significant markers were identified for breast cancer. Nine biofluids produced known ID’s with significant scores. A breakdown of this list shows that 14 are only identified in combination with one biofluid, 3 with two biofluids, 1 with 3 biofluids (ERBB2; mentioned blood, plasma, and serum), 1 with 4 biofluids (NCOA3; mentioned in bile, blood, plasma, and serum), 1 with 6 biofluids (BRCA2; mentioned in bile, blood, mucus, saliva, serum, and sweat), and 1 with 7 biofluids (BRCA1; mentioned in blood, mucus, plasma, saliva, serum, sweat, and urine abstracts).

A total of 26 known + significant putative markers were identified for lung cancer. Eight biofluids produced known ID’s with significant scores. A breakdown of this list shows that 21 are only mentioned in combination with one biofluid, 3 with two biofluids, 1 with 3 biofluids (EML4; mentioned in blood, mucus, and serum), and 1 with 4 biofluids (KRAS; mentioned in blood, breastmilk, mucus, and serum).

Manual verification of findings

A manual check of relevant abstracts was performed to ensure the reliability of our results. Each relevant PubMed abstract was manually examined to verify the biomarker mentioned. The results of this manual verification can be seen in Additional file 1: Table S4. Four known biomarkers (CHEK2 in both plasma and urine, CDKN1B, PCNA, and THBS1) were identified as false positives (red) in our breast cancer list, and seven (KRAS, GDNF in both breastmilk and plasma, MYCL1 in both blood and serum, CD40LG, CGA, CTAG1A, ERCC6, and HRAS) in our lung cancer list. KRAS is interesting in that it produced a false positive in association with breastmilk, but had verified positive findings in associations with blood, mucus, and serum.

True positive rate estimation of new discoveries

Manual verification allowed us to calculate the true positive rates across the biofluids-diseases. The results found in Additional file 1: Table S4 show an average error rate for breast cancer of 12.5%, and an average lung cancer error rate of 29.41%. From these calculations, one can conclude that 87.5% of the breast cancer new discoveries would be true positives, and 70.59% of the lung cancer new discoveries would be true positives.

Discussion

We have presented a method to determine the possibility of relatedness between potential biomarkers in biofluids and disease (breast and lung cancers), using positive and negative sets of abstracts and a z-score.

Error exists in ABNER’s [31] tagging, our dictionary consensus, and possibly anywhere manual processing of the data occurs. Negation was not addressed at this time.

A potential dictionary problem was identified in that some members of a protein family had a generic alias in common. This led to results such as ceacam5 and ceacam8 both being identified for the CEA alias. Adding another unique ID such as “ceacam_family” to account for this double counting was considered, however it was decided to let the counts stand, as there may be double counting elsewhere in the dictionary of which we are unaware.

In some situations a potential biomarker may need to only be mentioned in one negative set abstract to exhibit non-significance by our method. As disease-specific potential markers are sought, common biomarkers implicated in several diseases may not reach a significant score by our method because of their mention in abstracts describing other diseases including other types of cancer.

A requirement for potential biomarkers to appear in different abstracts was not applied. Several biomarker mentions may come from the same abstract. Similarly, there was not a requirement for different biofluids to appear in different abstracts. One biomarker discussed in association with more than one biofluid may appear in the list for each biofluid.

The number of known cancer biomarkers found but deemed not significant was reported. The results may be due to the way the negative search space was defined. It is possible that abstracts of other cancers or diseases exist in our negative set, and thus any biomarker mentioned in association with any other disease would negate our positive findings for breast and/or lung cancer.

Databases used for verification are probably far from being complete, which could be why our list of known + significant biomarkers is smaller than expected. Another explanation could be that certain markers just may not be found in a given biofluid. We will work to improve our verification methods over time.

Lastly, only abstracts were examined in this work. Obviously, full text examination would produce more findings as well as more confidence in the findings, but access to full text remains a limiting factor for all text-mining researchers.

Conclusions

We have presented a method that utilizes literature mining to create a list of documented putative biomarker-biofluid relationships for breast and lung cancer. Over 5 million abstracts were analyzed for biomarker-disease associations. These abstract sets were further stratified among 14 biofluids. Some false positives were initially eliminated by examining negative sets of abstracts and establishing a threshold z-score. New knowledge pertaining to breast and lung cancer is presented in the forms of known disease biomarker lists; ranked, newly discovered biomarker-disease-biofluid relationships; and biomarker specificity across biofluids. The relationships obtained from literature mining were verified by comparison to well-known published databases. Manual examination of abstracts allowed for known relationship verification and true positive rate calculations. On average, we can expect an 87.5% true positive rate for our breast cancer new discoveries, and a 71.59% true positive rate for our lung cancer new discoveries.

Future work in this area will include further automation of our semi-automated process, applying our method to other diseases, assembling a disease database to make our z-score findings available to others, as well as converting our z-score values into prior probabilities for use as informative priors in Bayesian disease modeling.

References

Hirschman L, Park JC, Tsujii J, Wong L, Wu CH: Accomplishments and challenges in literature data mining for biology. Bioinformatics. 2002, 18: 1553-1561. 10.1093/bioinformatics/18.12.1553.
Article CAS PubMed Google Scholar
Adamic LA, Wilkinson D, Huberman BA, Adar E: A literature based method for identifying gene-disease connections. Proc IEEE Comput Soc Bioinform Conf. 2002, 1: 109-117.
Article PubMed Google Scholar
Wren JD, Bekeredjian R, Stewart JA, Shohet RV, Garner HR: Knowledge discovery by automated identification and ranking of implicit relationships. Bioinformatics. 2004, 20: 389-398. 10.1093/bioinformatics/btg421.
Article CAS PubMed Google Scholar
Xuan W, Wang P, Watson SJ, Meng F: Medline search engine for finding genetic markers with biological significance. Bioinformatics. 2007, 23: 2477-2484. 10.1093/bioinformatics/btm375.
Article CAS PubMed Google Scholar
Hristovski D, Peterlin B, Mitchell JA, Humphrey SM: Using literature-based discovery to identify disease candidate genes. Int J Med Inform. 2005, 74: 289-298. 10.1016/j.ijmedinf.2004.04.024.
Article PubMed Google Scholar
Novichkova S, Egorov S, Daraseila N: MedScan, a natural language processing engine for MEDLINE abstracts. Bioinformatics. 2003, 19: 1699-1706. 10.1093/bioinformatics/btg207.
Article CAS PubMed Google Scholar
Srinivasan P: Text mining: generating hypotheses from MEDLINE. J Am Soc Inform Sci Technol. 2004, 55: 396-413. 10.1002/asi.10389.
Article CAS Google Scholar
Leonard JE, Colombe JB, Levy JL: Finding relevant references to genes and proteins in Medline using a Bayesian approach. Bioinformatics. 2002, 18: 1515-1522. 10.1093/bioinformatics/18.11.1515.
Article CAS PubMed Google Scholar
Jensen LJ, Saric J, Bork P: Literature mining for the biologist: from information retrieval to biological discovery. Nat Rev Genet. 2006, 7: 119-129. 10.1038/nrg1768.
Article CAS PubMed Google Scholar
Krallinger M, Valencia A, Hirschman L: Linking genes to literature: text mining, information extraction, and retrieval applications for biology. Genome Biol. 2008, 9 (Suppl.2): S8-
Article PubMed Central PubMed Google Scholar
Cohen AM, Hersh WR: A survey of current work in biomedical text mining. Brief Bioinform. 2005, 6: 57-71. 10.1093/bib/6.1.57.
Article CAS PubMed Google Scholar
Swanson DR: Medical literature as a potential source of new knowledge. Bull Med Libr Assoc. 1990, 78: 29-37.
PubMed Central CAS PubMed Google Scholar
Zhu S, Okuno Y, Tsujimoto G, Mamitsuka H: Application of a new probabilistic model for mining implicit associated cancer genes from OMIM and Medline. Cancer Inform. 2006, 2: 361-371.
PubMed Central CAS Google Scholar
Frijters R, Van Vugt M, Smeets R, Van Schaik R, De Vlieg J, Alkema W: Literature mining for the discovery of hidden connections between drugs, genes and diseases. PLoS Comput Biol. 2010, 6: e1000943-10.1371/journal.pcbi.1000943.
Article PubMed Central PubMed Google Scholar
Li H, Liu C: Biomarker identification using text mining. Comput Math Methods Med. 2012, 2012: 135780-
PubMed Central PubMed Google Scholar
Al-Mubaid H, Singh RK: A new text mining approach for finding protein-to-disease associations. Am J Biochem Biotechnol. 2005, 1: 145-152.
Article CAS Google Scholar
Andrade MA, Valencia A: Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families. Bioinformatics. 1998, 14: 600-607. 10.1093/bioinformatics/14.7.600.
Article CAS PubMed Google Scholar
Younesi E, Toldo L, Muller B, Friedrich CM, Novac N, Scheer A, Hofmann-Apitius M, Fluck J: Mining biomarker information in biomedical literature. BMC Med Inform Decis Mak. 2012, 12: 148-10.1186/1472-6947-12-148.
Article PubMed Central PubMed Google Scholar
Deyati A, Younesi E, Hofmann-Apitius M, Novac N: Challenges and opportunities for oncology biomarker discovery. Drug Discov Today. 2012, 18: 614-624.
Article PubMed Google Scholar
Veenstra T, Conrads T, Hood B, Avellino A, Ellenbogen R, Morrison R: Biomarkers: mining the biofluid proteome. Mol Cell Proteomics. 2005, 4: 409-418. 10.1074/mcp.M500006-MCP200.
Article CAS PubMed Google Scholar
Zhou M, Conrads T, Veenstra T: Proteomics approaches to biomarker detection. Brief Funct Genom Proteomics. 2005, 4: 69-75. 10.1093/bfgp/4.1.69.
Article CAS Google Scholar
Lee Y, Wong D: Saliva: An emerging biofluid for early detection of diseases. Am J Dent. 2009, 22: 241-248.
PubMed Central PubMed Google Scholar
Gao K, Zhou H, Zhang L, Lee J, Zhou Q, Hu S, Wolinsky L, Farrell J, Eibl G, Wong D: Systemic disease-induced salivary biomarker profiles in mouse models of melanoma and non-small cell lung cancer. PLoS One. 2009, 4: e5875-10.1371/journal.pone.0005875.
Article PubMed Central PubMed Google Scholar
Xu X, Veenstra T: Analysis of biofluids for biomarker research. Proteomics Clin Appl. 2008, 2: 1403-1412. 10.1002/prca.200780173.
Article CAS PubMed Google Scholar
Delaleu N, Immervoll H, Cornelius J, Jonsson R: Biomarker profiles in serum and saliva of experimental Sjogren’s syndrome: associations with specific autoimmune manifestations. Arthritis Res Ther. 2008, 10: R22-10.1186/ar2375.
Article PubMed Central PubMed Google Scholar
Alterovitz G, Xiang M, Liu J, Chang A, Ramoni MF: System-wide peripheral biomarker discovery using information theory. Pac Symp Biocomput. 2008, 231-242.
Google Scholar
Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J, Binns D, Harte N, Lopez R, Apweiler R: The Gene Ontology Annotation (GOA) database: sharing knowledge in uniprot with gene ontology. Nucleic Acids Res. 2004, 32 (Database issue): D262-D266.
Article PubMed Central CAS PubMed Google Scholar
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25 (1): 25-29. 10.1038/75556.
Article PubMed Central CAS PubMed Google Scholar
Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S, Geer LY, Kapustin Y, Khovayko O, Landsman D, Lipman DJ, Madden TL, Maglott DR, Ostell J, Miller V, Pruitt KD, Schuler GD, Sequeira E, Sherry ST, Sirotkin K, Souvorov A, Starchecko G, Tatusov RL, Tatusova TA, Wagner L, Yaschenko E: Database resources of the national center for biotechnology information. Nucleic Acids Res. 2007, 35 (Database issue): D5-D12. Epub 2006 Dec 14
Article PubMed Central CAS PubMed Google Scholar
Hewett M, Oliver DE, Rubin DL, Easton KL, Stuart JM, Altman RB, Klein TE: PharmGKB: the pharmacogenetics knowledge base. Nucleic Acids Res. 2002, 30 (1): 163-165. 10.1093/nar/30.1.163.
Article PubMed Central CAS PubMed Google Scholar
Settles B: ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics. 2005, 21: 3191-3192. 10.1093/bioinformatics/bti475.
Article CAS PubMed Google Scholar
Park YK, Kang TW, Baek SJ, Kim KI, Kim SY, Lee D, Kim YS: CaGe: a web-based cancer gene annotation system for cancer genomics. Genom Inform. 2012, 10 (1): 33-39. 10.5808/GI.2012.10.1.33. Epub 2012 Mar 31
Article CAS Google Scholar
National Center for Biotechnology Information (US): Genes and Disease [Internet]. 1998, Bethesda (MD): National Center for Biotechnology Information (US), Available from: http://www.ncbi.nlm.nih.gov/books/NBK22183/
Google Scholar
Wagner PD, Srivastava S: New paradigms in translational science research in cancer biomarkers. Transl Res. 2012, 159 (4): 343-353. 10.1016/j.trsl.2012.01.015. Epub 2012 Feb 3
Article PubMed Central PubMed Google Scholar
Bigbee WL, Gopalakrishnan V, Weissfeld JL, Wilson DO, Dacic S, Lokshin AE, Siegfried JM: A multiplexed serum biomarker immunoassay panel discriminates clinical lung cancer patients from high-risk individuals found to be cancer-free by CT screening. J Thorac Oncol. 2012, 7 (4): 698-708. 10.1097/JTO.0b013e31824ab6b0.
Article PubMed Central PubMed Google Scholar
Cancer Genome Atlas Network: Comprehensive molecular portraits of human breast tumours. Nature. 2012, Advanced online publication
Google Scholar

Download references

Acknowledgements

The research reported in this publication was partially supported by the following grants from the National Institutes of Health: National Library of Medicine Award Number R01LM010950 (to VG), and National Institute of General Medical Sciences Award Number R01GM100387 (to VG) and National Cancer Institute Award Number P50CA090440. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Author information

Authors and Affiliations

Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA
Rick Jordan, Shyam Visweswaran & Vanathi Gopalakrishnan
Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, USA
Shyam Visweswaran & Vanathi Gopalakrishnan
Department of Computational & Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA
Shyam Visweswaran & Vanathi Gopalakrishnan

Authors

Rick Jordan
View author publications
You can also search for this author in PubMed Google Scholar
Shyam Visweswaran
View author publications
You can also search for this author in PubMed Google Scholar
Vanathi Gopalakrishnan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rick Jordan.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

RJ wrote the Python scripts, downloaded abstracts, performed analysis, created figures and tables. VG conceived of the study, participated in its design and coordination. SV provided methodology and participated in study design. All authors participated in drafting the manuscript as well as reading and approving the final manuscript.

Electronic supplementary material

13336_2014_94_MOESM1_ESM.docx

Additional file 1: Table S4: Manually verified biomarker table. Biomarker specific abstracts were manually examined for accuracy. Abstracts were examined for mentions of biofluid, disease, and biomarker. Lack of any one term resulted in a ‘false positive’ result. (DOCX 129 KB)

Additional file 2: SupplementaryBiofluidTable.(XLSX 11 KB)

Additional file 3: SupplementaryProteinlist.(TXT 2 MB)

Additional file 4: SupplementaryBCResults.(XLSX 1 MB)

Additional file 5: SupplementaryLCResults.(XLSX 768 KB)

Additional file 6: Table S1: List of breast cancer identifiers.(DOCX 52 KB)

Additional file 7: Table S2: List of lung cancer identifiers.(DOCX 53 KB)

13336_2014_94_MOESM8_ESM.docx

Additional file 8: Table S3: Identification of the significant validated potential markers found to be in common to several biofluids or biofluid specific for breast and lung cancer. Biomarkers highlighted in yellow are either breast cancer markers found in the list of validated lung cancer biomarkers (Additional file 7: Table S2), or lung cancer markers found in the list of validated breast cancer biomarkers (Additional file 6: Table S1). It is doubtful that these markers are disease specific. CDH1 is the only found biomarker in both cancer lists. (DOCX 124 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Jordan, R., Visweswaran, S. & Gopalakrishnan, V. Semi-automated literature mining to identify putative biomarkers of disease from multiple biofluids. J Clin Bioinform 4, 13 (2014). https://doi.org/10.1186/2043-9113-4-13

Download citation

Received: 26 June 2014
Accepted: 02 October 2014
Published: 23 October 2014
DOI: https://doi.org/10.1186/2043-9113-4-13

Semi-automated literature mining to identify putative biomarkers of disease from multiple biofluids

Abstract

Background

Methodology

Results

Conclusions

Background

Methodology

Automation

Information retrieval

Filtering information

Named entity recognition

Entity extraction

Dictionary

Scoring

Verification of relationships

True positive rate determination

Results

Positive and negative sets

Known markers per biofluid

Z- score threshold optimization

Comparison of identification of potential biomarkers by disease-biofluid

Known markers found significant vs. non-significant

Newly discovered markers found significant vs. non-significant

Potential marker biofluid specificity

Manual verification of findings

True positive rate estimation of new discoveries

Discussion

Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Electronic supplementary material

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Journal of Clinical Bioinformatics