Extraction of echocardiographic data from the electronic medical record is a rapid and efficient method for study of cardiac structure and function
© Wells et al.; licensee BioMed Central Ltd. 2014
Received: 23 June 2014
Accepted: 11 September 2014
Published: 20 September 2014
Measures of cardiac structure and function are important human phenotypes that are associated with a range of clinical outcomes. Studying these traits in large populations can be time consuming and costly. Utilizing data from large electronic medical records (EMRs) is one possible solution to this problem. We describe the extraction and filtering of quantitative transthoracic echocardiographic data from the Epidemiologic Architecture for Genes Linked to Environment (EAGLE) study, a large, racially diverse, EMR-based cohort (n = 15,863).
There were 6,076 echocardiography reports for 2,834 unique adult subjects. Missing data were uncommon with over 90% of data points present. Data irregularities are primarily related to inconsistent use of measurement units and transcriptional errors. The reported filtering method requires manual review of very few data points (<1%), and filtered echocardiographic parameters are similar to published data from epidemiologic populations of similar ethnicity. Moreover, the cohort is comparable in size, and in some cases larger than community-based cohorts of similar race/ethnicity.
These results demonstrate that echocardiographic data can be efficiently extracted from EMRs, and suggest that EMR-based cohorts have the potential to make major contributions toward the study of epidemiologic and genotype-phenotype associations for cardiac structure and function in diverse populations.
KeywordsElectronic health records Echocardiography Natural language processing
Indices of cardiac structure and function are clinically relevant parameters associated with important outcomes. Several measures of cardiac structure, including wall thickness and left ventricular dilation, predict cardiovascular disease events and heart failure [1, 2]. Additionally, left atrial size is related to incidence of atrial fibrillation, stroke, and death, and aortic root size is associated with risk of heart failure, stroke, and mortality [1, 3, 4]. These pathologic changes in cardiac structure and function occur in response to myocardial injury and a wide range of stressful stimuli. Discovery of environmental, physiologic, and genetic factors influencing cardiac remodeling will improve understanding of underlying pathologic mechanisms, and may identify pathways and targets for therapeutic intervention.
Structural cardiac parameters are assessed using a variety of imaging modalities. However, the cost of cardiac imaging can prevent their broad application in large research cohorts. One possible solution to this challenge is to leverage data, obtained during clinical care, found in large EMRs. With the growth of EMRs, there have been considerable efforts to utilize clinical data to support cohort development, clinical research, and genetic research. However, EMR data are complex, often unstructured, and are prone to multiple types of errors and local idiosyncrasies. A key challenge has been the development of strategies to accurately extract high value clinical data and identify phenotypes of interest . Nonetheless, there are multiple examples successful use of EMR-derived data for research, particularly for genetic association studies [1, 3, 4, 6–11].
The most common technique to assess cardiac structure in the clinical setting is transthoracic echocardiography (referred to subsequently as “echocardiography”). These studies routinely acquire important parameters that define cardiac structure including left ventricular septal and posterior wall thicknesses, left ventricular end systolic and diastolic diameters, left atrial diameter, and aortic root diameter. As a potentially important data source in a clinically relevant population, echocardiographic information within EMRs is subject to multiple sources of error, including incorrect data entry, transcriptional errors, and inconsistent use of measurement units. Moreover, the data may not be structured in such a way that extraction is straightforward. As such, rapid and efficient methods for the extraction and filtering of echocardiographic data from EMRs are needed. In this report, as part of the Epidemiologic Architecture for Genes Linked to Environment (EAGLE) study, we describe the extraction and filtering of six canonical quantitative echocardiographic variables including left ventricular septal thickness, left ventricular posterior wall thicknesses, left ventricular end systolic diameter, left ventricular end diastolic diameter, left atrial diameter, and aortic root diameter in an racially diverse population from a large EMR for eventual genetic association studies.
The Vanderbilt University Medical Center (VUMC) biorepository, BioVU, is a resource linking DNA samples to a de-identified EMR, termed the Synthetic Derivative (SD), that contains approximately 20 years of data on over 2 million individuals. The SD is generated by the application of a one-way hash to the EMRs that removes or de-identifies protected health information such as proper names, geographical locations, medical record numbers, and social security numbers. Dates are randomly shifted by up to six months (but consistently within any single record). As of March 2014 BioVU houses DNA samples from over 175,000 subjects. The design and implementation of the Synthetic Derivative and BioVU have been previously described,  as has utilization of the resource for replication of known associations between genetic variants and common diseases .
EAGLE, as part of the larger Population Architecture using Genomics and Epidemiology I (PAGE I) study,  selected 15,863 BioVU samples (EAGLE BioVU) from diverse populations for genotype-phenotype studies, including 11,503 African Americans, 1,702 Hispanics, and 1,098 Asians, to generate a near-complete cross-section of all minority populations in BioVU as of 2012. Subjects have been phenotyped for a range of important characteristics including body mass index (BMI), serum lipid levels, renal function, and hemoglobin A1C Dumitrescu L, Goodloe R, Boston J, Farber Eger E, Pendergrass SA, Bush WS, Crawford DC: Towards a phenome-wide catalog of human clinical traits impacted by genetic ancestry, submitted]. We restricted echocardiographic data extraction and analyses to the 13,957 non-white adult (age ≥ 18 years) subjects within EAGLE BioVU.
Extraction of echocardiographic data from EMRs
Echocardiography reports in the VUMC EMR are in the portable document format (PDF) and have undergone three formatting iterations since 1997. Reports prior to 1997 are not in digital format and not included in the EMRs. Each report contains structured, semi-structured, and unstructured data. Structured data are generally quantitative measures such as wall thicknesses, chamber dimensions, or flow velocities. Semi-structured data fields contain subjective interpretations of parameters with a limited number of potential values. These fields frequently contain ordinal data. For example, valvular lesions and abnormalities of ventricular function are often subjectively quantified as “mild”, “moderate”, or “severe”. Unstructured fields contain unrestricted prose descriptions of clinically relevant findings as interpreted by the reader.
Fields containing structured, semi-structured, and unstructured data were identified within echocardiography reports in the EMRs. Numeric values for left ventricular septal thickness, left ventricular posterior wall thicknesses, left ventricular end systolic diameter, left ventricular end diastolic diameter, left atrial diameter, and aortic root diameter were subsequently parsed from reports using natural language processing.
Systematic filtering of quantitative echocardiographic data
Step 1: Identification and characterization of outliers
We first examined the distributions of and relationships between quantitative echocardiographic parameters. Extreme outliers, unrealistic values, and unusual relationships (i.e., relative values between parameters) were identified and evaluated manually, including comparison with other echocardiogram reports. Data points were retained when found to be valid, edited when obvious data entry errors were identified, and removed in all other cases.
Step 2: Identification of measurement unit-discordant outliers
Step 3: Check of anatomically constrained relationships
Step 4: Harmonization of measurement units
Step 5: Final review for residual aberrant data points
Data were again displayed graphically, as in the initial characterization of erroneous data, to identify residual outliers. There were 10 residual outliers identified for review after re-examination of the data (2 aortic root, 3 posterior wall thickness, 5 septal thickness). Six values were confirmed to be accurate (1 aortic root, 1 posterior wall thickness, 4 septal thickness) and retained while 3 (2 posterior wall thickness, 1 septal thickness) were aberrant and removed from the dataset. One aortic root outlier proved to be entered into an incorrect data field (swapped with atrial diameter) and was retained after editing.
Evaluation of the filtering strategy
The cleaning process was semi-automated in that suspect data points identified at each step underwent manual review to determine their accuracy. Remaining data points were assumed to be free of the specified error types and handled automatically. The accuracy of the filtering strategy for error identification was determined by calculating the proportion data points selected for review that were, in fact, erroneous as determined by manual cardiologist review. The presence and frequency of error types not included in the filtering strategy, missed errors, and introduced errors was determined by comparing the filtered values from 500 random reports to a gold standard generated manually by a cardiologist.
Results and discussion
During filtering, manual review was performed on 113 data points (16 extreme outliers from Step 1; 55 unit-discordant data points from Step 2; 32 anatomically impossible values from Step 3; 10 residual outliers from Step 5), representing less than 1% of the dataset, and no reports were lost entirely. Among manually reviewed data points, only 11 were determined not to contain errors. Thus, 102 of 113 data points were correctly identified, yielding an accuracy of 90% for error identification. Using expert manual review, errors for 64 reviewed data points were successfully cleaned while 38 were removed from the dataset due to unresolvable errors. Among the 500 reports randomly selected to evaluate the performance of the filtering strategy there were 2,612 data points. Expert manual review found no errors that evaded cleaning (sensitivity to error detection ~100%), and there was perfect agreement between the cardiologist and the output of the filtering method.
The filtering method was highly accurate in regard to selection of data points for review, and also produced high sensitivity in regard error detection and agreement with manually curated gold standard. While data regarding the extraction and filtering of quantitative structural data from clinical echocardiography reports is limited, the performance of the current approach was similar reported for extraction of ejection fraction and semi-structured echocardiographic data. For example, one report demonstrated that reduced cardiac function (defined as ejection fraction <40%) can be identified within echocardiography reports with a sensitivity of 98.4% and specificity of 100%,  and others have extracted semi-structured data elements with a sensitivity and specificity of 78% and 99% respectively .
Echocardiograms in EAGLE BioVU
Unique adult subjects with an echocardiogram
Median [IQR] echocardiograms/subjects
1 [1, 2]
Range of echocardiograms/subjects
Demographics of entire adult cohort in EAGLE BioVU (N = 13,957)
Body mass index (kg/m2)
Serum creatinine (mg/dL)
Total cholesterol (mg/dL)
Demographics among individuals with and without echocardiography performed
Echocardiography not performed
P = X2or t-test
Within group (%)*
Within race (%)**
Within group (%)*
Within race (%)**
Serum creatinine (mg/dL)
Serum creatinine (mg/dL)
Total cholesterol (mg/dL)
Total cholesterol (mg/dL)
Demographic and echocardiographic data for African American subjects from population-based cohorts in published GWAS for echocardiographic traits
Atherosclerosis risk in communities
Coronary artery risk development in young adults study
Jackson heart study
Age (mean ± SD)
57.9 ± 17.6
58.6 ± 15.9
P = 0.27
59 ± 6
59 ± 6
P = 1
30 ± 4
29 ± 4
P = 3.4x10-6
55 ± 13
54 ± 13
P = 0.04
LV diastolic dimension, mm
44.4 ± 7.6
48.5 ± 8.8
46 ± 6
49 ± 6
48 ± 4.5
51 ± 4.7
49 ± 4.1
51 ± 4
Left atrial dimension, mm
37.4 ± 7.3
40.1 ± 8.0
39 ± 6
39 ± 6
35 ± 5
36 ± 4.8
Aortic root diameter, mm
28.4 ± 3.6
32.8 ± 4.1
30 ± 4
34 ± 4
26 ± 3
30 ± 3.5
30 ± 2.8
34 ± 3
Posterior wall thickness, mm
10.5 ± 2.2
11.4 ± 2.5
11 ± 2
12 ± 2
8 ± 1
9 ± 1.4
8 ± 1
9 ± 1
LV systolic dimension, mm
29.0 ± 8.4
33.3 ± 10.7
29 ± 4
32 ± 5
Interventricular septal wall thickness, mm
11.0 ± 2.5
12.0 ± 2.8
12 ± 2
12 ± 3
9 ± 2
10 ± 1.6
9 ± 1
9 ± 1.5
We extracted echocardiographic traits recorded in the EMRs for 2,834 individuals in EAGLE BioVU, the majority of which are African American (85.6%). Our experience shows that extraction and filtering of these parameters can be done rapidly and efficiently. Missing data are uncommon, with only ~10% of data points being absent. There are systematic errors mostly related to choice of measurement units that are easily corrected. However, there are also non-systematic outliers due to inconsistent use of units and transcriptional errors. Nonetheless, our method required manual review of very few data points (<1%). So, while we chose to have these values reviewed by a content expert and retained as possible, there would be a low penalty, in terms of data loss, for simply removing them.
Several important features of the EAGLE BioVU echocardiography cohort are worthy of note. First is its relatively large size, which is similar to, or larger than, several population-based cohorts of African Americans used in previous genetic studies of echocardiographic traits. In addition, the EAGLE BioVU echocardiography population also has demographic and echocardiographic characteristics similar to those of African American subjects from community-based cohorts. Given the difficulties and cost associated with the ascertainment and phenotyping of large cohorts, clinic-based populations, such as the one developed here, represent an important complementary methodology.
There are limitations to the data extraction and cohort development methods presented here. First, of necessity, the EAGLE BioVU echocardiography population is limited to individuals for whom clinically indicated echocardiography was performed at a single, tertiary care, referral center. Additionally, echocardiographic measures were obtained under clinical standards of care and not using standardized research protocols. As such, the EAGLE BioVU echocardiography cohort is subject to the ascertainment biases and data heterogeneity concerns inherent to all clinic-based populations. Finally, only structured, quantitative echocardiographic measures of cardiac structure were extracted. There are clinically important semi-structured and unstructured data, including parameters of myocardial contractility, valvular function, and recognized patterns of disease that were not extracted.
The decision to focus initially on quantitative data was made because these data are critical for the study of cardiac structure (i.e., they define cardiac structure), are often recorded in tabular format, making them more amenable to extraction, and are particularly susceptible to transcriptional errors and measurement unit inconsistencies that must be removed prior to analysis. Semi-structured data, in general, consist of simplified vocabulary explicitly stating the presence or absence of a limited range of conditions (e.g., valvular stenosis or insufficiency), and, if present, the degree of disease (e.g., mild, moderate, severe). Information extraction methods exploiting these characteristics to identify canonical concepts and corresponding values have been described,  but these approaches are less robust for rare phenomena and atypical expressions common in large clinical datasets. Expanding the concept-value vocabulary of such tools could mitigate these limitations, but manual review may always be required to filter unusual findings and phrasings. In general, filtering of semi-quantitative echocardiographic data is less of a concern because, possible values for each variable are more constrained (e.g., absent, mild, moderate, severe), limiting the problem of extreme outliers due to transcriptional errors, and values have no units, obviating the need to assure unit harmonization.
Despite these limitations, this report demonstrates that quantitative echocardiographic data can be extracted quite efficiently from large EMRs. Moreover, the comparable size, demographics, and echocardiographic measures of this population to epidemiologic cohorts provide reassurances that, despite many intrinsic biases, EMR-derived datasets are useful for the study of cardiac structure and function.
These results demonstrate that extraction of echocardiographic data from the EMR environment can be rapid and efficient, and suggest that EMR-based cohorts have the potential to be important data sources for the study of cardiac structure and function. Future directions include refinement of methods for extraction and filtering of semi-structured and unstructured echocardiographic date from EMRs and leveraging of the EAGLE BioVU echocardiography cohort for study of cardiac structure and function genotype-phenotype associations.
Electronic medical records
Epidemiologic Architecture for Genes Linked to Environment study
Vanderbilt University Medical Center
Synthetic Derivative, a de-identified version of the VUMC EMR used for research
Portable document format
Genome-wide association study
Coronary Artery Risk Development in Young Adults Study
Atherosclerosis Risk in Communities cohort
Jackson Heart Study.
The authors would like to acknowledge the Vanderbilt University Center for Human Genetics Research, Computational Genomics Core who provided computational and/or analytical support for this work.
- Vasan RS, Larson MG, Levy D, Evans JC, Benjamin EJ: Distribution and categorization of echocardiographic measurements in relation to reference limits: the Framingham Heart Study: formulation of a height- and sex-specific classification and its prospective validation. Circulation. 1997, 96: 1863-1873. 10.1161/01.CIR.96.6.1863.View ArticlePubMedGoogle Scholar
- Vasan RS, Larson MG, Benjamin EJ, Evans JC, Levy D: Left ventricular dilatation and the risk of congestive heart failure in people without myocardial infarction. N Engl J Med. 1997, 336: 1350-1355. 10.1056/NEJM199705083361903.View ArticlePubMedGoogle Scholar
- Benjamin EJ, D’Agostino RB, Belanger AJ, Wolf PA, Levy D: Left atrial size and the risk of stroke and death. The Framingham Heart Study. Circulation. 1995, 92: 835-841. 10.1161/01.CIR.92.4.835.View ArticlePubMedGoogle Scholar
- Gardin JM, Arnold AM, Polak J, Jackson S, Smith V, Gottdiener J: Usefulness of aortic root dimension in persons ≥65 years of age in predicting heart failure, stroke, cardiovascular mortality, all-cause mortality and acute myocardial infarction (from the cardiovascular health study). Am J Cardiol. 2006, 97: 270-275. 10.1016/j.amjcard.2005.08.039.View ArticlePubMedGoogle Scholar
- Wojczynski MK, Tiwari HK: Definition of phenotype. Adv Genet. 2008, 60: 75-105.View ArticlePubMedGoogle Scholar
- Denny JC, Crawford DC, Ritchie MD, Bielinski SJ, Basford MA, Bradford Y, Chai HS, Bastarache L, Zuvich R, Peissig P, Carrell D, Ramirez AH, Pathak J, Wilke RA, Rasmussen L, Wang X, Pacheco JA, Kho AN, Hayes MG, Weston N, Matsumoto M, Kopp PA, Newton KM, Jarvik GP, Li R, Manolio TA, Kullo IJ, Chute CG, Chisholm RL, Larson EB, et al: Variants near FOXE1 are associated with hypothyroidism and other thyroid conditions: using electronic medical records for genome- and phenome-wide studies. Am J Hum Genet. 2011, 89: 529-542. 10.1016/j.ajhg.2011.09.008.PubMed CentralView ArticlePubMedGoogle Scholar
- Gioli-Pereira L, Bernardez-Pereira S, Marcondes-Braga FG, Spina JMR, da Silva RMM, Ferreira NE, Bacal F, Fernandes FB, Mansur AJ, Krieger JE, Pereira AC: Genetic and electroNic medIcal records to predict outcomes in heart failure patients (GENIUS-HF) - design and rationale. BMC Cardiovasc Disord. 2014, 14: 1-5. 10.1186/1471-2261-14-1.View ArticleGoogle Scholar
- Kullo IJ, Ding K, Jouni H, Smith CY, Chute CG: A genome-wide association study of red blood cell traits using the electronic medical record. PLoS One. 2010, 5 (9): doi:10.1371/journal.pone.0013011Google Scholar
- Crosslin DR, McDavid A, Weston N, Nelson SC, Zheng X, Hart E, de Andrade M, Kullo IJ, McCarty CA, Doheny KF, Pugh E, Kho A, Hayes MG, Pretel S, Saip A, Ritchie MD, Crawford DC, Crane PK, Newton K, Li R, Mirel DB, Crenshaw A, Larson EB, Carlson CS, Jarvik GP, Electronic Medical Records and Genomics (eMERGE) Network: Genetic variants associated with the white blood cell count in 13,923 subjects in the eMERGE Network. Hum Genet. 2012, 131: 639-652. 10.1007/s00439-011-1103-9.PubMed CentralView ArticlePubMedGoogle Scholar
- Denny JC, Ritchie MD, Crawford DC, Schildcrout JS, Ramirez AH, Pulley JM, Basford MA, Masys DR, Haines JL, Roden DM: Identification of genomic predictors of atrioventricular conduction: using electronic medical records as a tool for genome science. Circulation. 2010, 122: 2016-2021. 10.1161/CIRCULATIONAHA.110.948828.PubMed CentralView ArticlePubMedGoogle Scholar
- Kullo IJ, Ding K, Shameer K, McCarty CA, Jarvik GP, Denny JC, Ritchie MD, Ye Z, Crosslin DR, Chisholm RL, Manolio TA, Chute CG: Complement receptor 1 gene variants are associated with erythrocyte sedimentation rate. Am J Hum Genet. 2011, 89: 131-138. 10.1016/j.ajhg.2011.05.019.PubMed CentralView ArticlePubMedGoogle Scholar
- Roden DM, Pulley JM, Basford MA, Bernard GR, Clayton EW, Balser JR, Masys DR: Development of a large-scale de-identified DNA biobank to enable personalized medicine. Clin Pharmacol Ther. 2008, 84: 362-369. 10.1038/clpt.2008.89.PubMed CentralView ArticlePubMedGoogle Scholar
- Ritchie MD, Denny JC, Crawford DC, Ramirez AH, Weiner JB, Pulley JM, Basford MA, Brown-Gentry K, Balser JR, Masys DR, Haines JL, Roden DM: Robust replication of genotype-phenotype associations across multiple diseases in an electronic medical record. Am J Hum Genet. 2010, 86: 560-572. 10.1016/j.ajhg.2010.03.003.PubMed CentralView ArticlePubMedGoogle Scholar
- Matise TC, Ambite JL, Buyske S, Carlson CS, Cole SA, Crawford DC, Haiman CA, Heiss G, Kooperberg C, Marchand LL, Manolio TA, North KE, Peters U, Ritchie MD, Hindorff LA, Haines JL, for the PAGE Study: The Next PAGE in understanding complex traits: design for the analysis of population architecture using genetics and epidemiology (PAGE) study. Am J Epidemiol. 2011, 174: 849-859. 10.1093/aje/kwr160.PubMed CentralView ArticlePubMedGoogle Scholar
- Garvin JH, DuVall SL, South BR, Bray BE, Bolton D, Heavirland J, Pickard S, Heidenreich P, Shen S, Weir C, Samore M, Goldstein MK: Automated extraction of ejection fraction for quality measurement using regular expressions in Unstructured Information Management Architecture (UIMA) for heart failure. J Am Med Inform Assoc. 2012, 19: 859-866. 10.1136/amiajnl-2011-000535.PubMed CentralView ArticlePubMedGoogle Scholar
- Chung J, Murphy S: Concept-value pair extraction from semi-structured clinical narrative: a case study using echocardiogram reports. AMIA Annu Symp Proc. 2005, 2005: 131-135.PubMed CentralGoogle Scholar
- Fox ER, Musani SK, Barbalic M, Lin H, Yu B, Ogunyankin KO, Smith NL, Kutlar A, Glazer NL, Post WS, Paltoo DN, Dries DL, Farlow DN, Duarte CW, Kardia SL, Meyers KJ, Sun YV, Arnett DK, Patki AA, Sha J, Cui X, Samdarshi TE, Penman AD, Bibbins-Domingo K, Buzkova P, Benjamin EJ, Bluemke DA, Morrison AC, Heiss G, Carr JJ, et al: Genome-wide association study of cardiac structure and systolic function in African Americans: the candidate gene association resource (CARe) study. Circ Cardiovasc Genet. 2013, 6: 37-46. 10.1161/CIRCGENETICS.111.962365.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.