# Potential identification of pediatric asthma patients within pediatric research database using low rank matrix decomposition

- Teeradache Viangteeravat
^{1}Email author

**3**:16

https://doi.org/10.1186/2043-9113-3-16

© Viangteeravat; licensee BioMed Central Ltd. 2013

**Received: **20 July 2013

**Accepted: **22 August 2013

**Published: **28 September 2013

## Abstract

Asthma is a prevalent disease in pediatric patients and most of the cases begin at very early years of life in children. Early identification of patients at high risk of developing the disease can alert us to provide them the best treatment to manage asthma symptoms. Often evaluating patients with high risk of developing asthma from huge data sets (e.g., electronic medical record) is challenging and very time consuming, and lack of complex analysis of data or proper clinical logic determination might produce invalid results and irrelevant treatments. In this article, we used data from the Pediatric Research Database (PRD) to develop an asthma prediction model from past All Patient Refined Diagnosis Related Groupings (APR-DRGs) coding assignments. The knowledge gleamed in this asthma prediction model, from both routinely use by physicians and experimental findings, will become fused into a knowledge-based database for dissemination to those involved with asthma patients. Success with this model may lead to expansion with other diseases.

## Keywords

## Background

### Data mining in medical informatics

Because of their predictive power, various healthcare systems are attempting to use available data mining techniques to discover hidden relationships as well as trends in huge data available within the clinical database and convert it to valuable information that can be used by physicians and other clinical decision markers. In general, data mining techniques can learn from what was happened in past examples and model oftentimes non-linear relationships between independent and dependent variables. The resulting model provides formalized knowledge and prediction of outcome. For example, Shekar et al. used data mining based decision tree algorithm to discover the most common refractive error in both male and female [1]. Palaniappan et al. presented a prototype that combines the strengths of both an online analytical processing (OLAP) and data mining techniques for clinical decision support systems (DSS) [2]. Jonathan et al. used data mining techniques to explore the factors contributing to cost of prenatal care and outcomes [3]. Chae et al. used data mining approach analysis in health insurance domain [4]. With advanced data mining techniques to help evaluate healthcare utilization costs for employees and dependents in organizations [5].

More advanced machine learning methods, such as artificial neural networks and support vector machines, have been adopted to use in various areas of biomedical and bioinformatics, including genomics and proteomics [6]. For biological data, clustering is probably the most widely used data mining technique, such as clustering analysis, hierarchical clustering, *k*-means clustering, backpropagation neural networks, self-organization maps, fuzzy clustering, expectation maximization, and support vector machines [7, 8]. Bayesian models were widely used to classify data into predefined classes based on a set of features. Given the training examples, a Bayesian model stores the probability of each class, the probability of each feature, and the probability of each feature given each class. When a new unseen example occurred, it can be classified according to these probabilities [9, 10]. This classification technique is one of the most widely used in medical data mining. Decision tree models, such as the Iterative Dichotomiser 3 (ID3) Heuristic techniques belong to the subfield of machine learning. The ID3 Heuristic uses a technique called "entropy" to measure disorder in a set of data [11, 12]. The idea behind the ID3 Heuristic is to find the best attribute to classify the records in the data set. The outcome is learned rules and a model used to predict unseen examples based on past seen examples. Non-negative matrix factorization (NMF) has been widely used in the field of text mining applications [13, 14]. The only constraint that is unique from other methods is factorization of two matrices *W* and *H* from *V* (i.e., *nmf* (*V*) *→ WH*) must be non-negative or all elements must be equal to or greater than zero. Typically, *W* and *H* are initialized with random non-negative values to start the NMF algorithm. The convergent time is varied and local minimum is not guaranteed [15].

Here, we are working on a methodology and classification technique in data mining called Low Rank Matrix Decomposition (LRMD) to allow computer to learn from what has happened in the past APR-DRGs datasets for asthma, able to extract dominant features, and then predict outcomes. The summary of APR-DRGs and the mathematics behind LRMD is discussed further below.

### All patient refined diagnosis related groups (APR-DRGs)

APR-DRG is a grouping methodology developed in a joint effort between 3M Health Information Systems (HIS) and National Association of Children’s Hospitals and Related Institutions (NACHRI). APR-DRGs are proprietary and have the most comprehensive and complete classification of any severity of illness system for pediatric patients. It was designed to be more appropriate for general population patients than the old Diagnosis Related Group (DRG) [16]. While the DRG was designed and normed on Medicare patients only, the APR-DRG was designed and normed on a general population. We use APR-DRG based weights normed on a pediatric patient population. There are 316 APR-DRGs, such common APR-DRG codes include but not limited to 138 Bronchiolitis/RSV pneumonia, 141 Asthma, 160 Major repair of heart anomaly, 225 Appendectomy, 420 Diabetes, 440 Kidney transplant, 662 Sickle cell anemia crisis, and 758 Childhood behavioral disorder. Each group has 4 severity levels of illnesses (SOI) and 4 risk levels of mortality (ROM) while the DRG and Medicare Service – Diseases Related Groups (MS-DRG) have only a single severity and risk of mortality per group. For example, there are multiple diagnosis codes for asthma and an encounter might have asthma as principal diagnosis or a secondary diagnosis and if the encounter was primarily for asthma treatment, then the APR-DRG code will be 141 and all asthma encounters will be assigned the same APR-DRG code. In our internal system we code inpatient encounters to APR-DRG as well as DRG. We have available from our PRD back through 2009 [17], including Emergency Room (ER), Ambulatory Surgery (AS), and Observation (OBS) encounters.

## Methods

### Singular value decomposition

^{MxN}where M ≥ N in a product of UV

^{T}, where U∈R

^{Mxk}and V∈R

^{Nxk}[18, 19]. Since any rank

*k*matrix can be decomposed in such a way, and any pair of such matrices yields a rank

*k*matrix, the problem becomes as an unconstrained minimization over pairs of matrices (U ,V ) with the minimization objective

*A*

^{(k)}is a rank

*k*approximation of matrix A. To find the optimum choices of U,V in

*l*

_{2}norm sense [20, 21], the partial derivatives of the objective

*f*(

*U,V*) with respect to U,V are

*U*=

*AV*(

*V*

^{ T }

*V*)

^{- 1}. By considering an orthogonal solution, then U = Λ is diagonal such that U = AV. Substituting back into $\frac{\partial f\left(U,V\right)}{\partial V}=0$, we have

The columns of V are mapped by *A*^{
T
}*A* to multiples of themselves, i.e., they are eigenvectors of *A*^{
T
}*A*. Therefore, the gradient $\frac{\partial f\left(U,V\right)}{\partial \left(U,V\right)}$ vanishes at an orthogonal (U,V) if and only if the columns of V are eigenvectors of *A*^{
T
}*A* and the column of U are eigenvectors of *AA*^{
T
}, scaled by the square root of their eigenvalues [18, 19]. More generally, the gradient vanishes at any (U,V) if and only if the columns of U are spanned by eigenvector of *AA*^{
T
} and the columns of V are spanned by eigenvector of *A*^{
T
}*A*. In term of the singular value decomposition, $A={U}_{o}S{V}_{o}^{T}$ the gradient vanishes at (U,V) if and only if there exist matrices ${P}_{U}^{T}{P}_{V}=I\in {R}^{\mathit{kxk}}$ such that *U* = *U*_{
O
} *SP*_{
U
} and *V* = *V*_{
O
} *P*_{
V
}. Thus, using singular eigenvectors that corresponds to the largest singular values can represent the global properties (i.e., feature vectors) of A with satisfying the minimization under *l*_{2} norm sense [19].

### Low rank matrix decomposition

*X*∈

*R*

^{ MxN }as a sum of simple rank one matrices so as to capture the nature of the matrix in which matrix

*X*is to be represented by the summation of

*r,*i.e., rank of matrix. In this case, the outer products can be written as:

*X*∈

*R*

^{ M x N }, {

*u*

_{1},

*u*

_{2}, …,

*u*

_{ r }} and {

*v*

_{1},

*v*

_{2}, …,

*v*

_{ r }} vectors each represents linearly independent column vectors with dimensions M and N, respectively. The constituent outer product ${u}_{i}{v}_{i}^{T}$ is rank one in which the MxN matrix whose column (row) vectors are each a linear multiple of vector

*u*

_{ i }

*(v*

_{ i }

*)*. To be more precise, a necessary condition is that the vector set {

*u*

_{ 1 }

*, u*

_{ 2 }

*, …, u*

_{ r }} must form a basis for the column space of matrix X and the vector set $\left\{{v}_{1}^{T},{v}_{2}^{T},\dots ,{v}_{r}^{T}\right\}$ should form a basis for the row space of matrix X. It is noted, however, that there will exist an infinite number of distinct selections of these basis vectors for the case r ≥ 2. It then follows that there will be an infinite number of distinct ranks when the decomposition of a matrix has rank r ≥ 2. The ultimate selection to be made is typically based on the application as well as computational considerations. To provide a mathematically based method for selecting the required basis vectors, let us consider the functional relationship

For 1 ≤ *k* ≤ *r* and *p* = 1,2 where the integer *k* ranges in the interval 1 ≤ *k* ≤ *r*.

*u*

_{ i }} for a fixed set of {

*v*

_{ i }} and vice-versa. For the proof, please refer to [22]. The convexity property is important since it ensures that any local minimum of

*f*

_{ k }(

*v*) (i.e.,

*u*is fixed) and vice-versa is also a global minimum. With regard to the above equation, a specific selection of the vector sets {

*u*

_{1},

*u*

_{2},…,

*u*

_{ k }}∈

*R*

^{ M }and {

*v*

_{1},

*v*

_{2},…,

*v*

_{ k }}∈

*R*

^{ N }is to be made so as to minimize this functional. The optimal selection will then provide the best rank

*k*approximation of matrix

*X*in the

*lp*norm sense, as designated by

*k*approximation of matrix

*X*in the

*lp*norm sense. For convenience, we express equation (5) in a normalized form as:

_{ i }

^{ o }are positive scalars. The most employed matrix decomposition procedure is the Singular Value Decomposition (SVD). The SVD method provides an effective method for mitigating the deleterious effects of additive noise and is characterized by the function

*f*

_{ k }({

*u*

_{ i }},{

*v*

_{ i }}) in the

*l*

_{2}norm sense, that is

*l*

_{1}norm criterion can be of practical use when analyzing data that contains data outliers. Namely, it would be useful to express this equation (9) as an objective function that optimizes the best rank

*k*approximation of matrix

*X*∈

*R*

^{ MxN }as measured for the case of the

*l*

_{1}norm criterion. That is

In order to attempt to find the optimum solution which minimizes the objective function (10), we introduce a concept, called Alternating Optimization, This optimization concept is explained a detailed below.

### Alternating optimization

*f*(

*U*,

*V*) = ||

*X*-

*UV*

^{ T }||

_{1}by fixing U, then objective function becomes:

*V*

^{ T }are denoted by ${V}^{T}=\left[{\tilde{v}}_{1}\phantom{\rule{0.12em}{0ex}}{\tilde{v}}_{2}.\phantom{\rule{0.5em}{0ex}}.\phantom{\rule{0.5em}{0ex}}.\phantom{\rule{0.5em}{0ex}}.\phantom{\rule{0.5em}{0ex}}.{\tilde{v}}_{n}\right]$. It is straightforward to see that f (V) can be rewritten as a sum of independent criteria

*V*

^{ T }, we get a solution for equation (11). On the other hand, by fixing V, the objective function can be expressed as:

And a similar method may be used to solve for U. The iteration process proceeded by finding ${\tilde{v}}_{i}$ and then finding ${\tilde{u}}_{i}$ (i.e., the alternating optimization) is continued until a stopping criterion is met (i.e., the matrix from two successive iterations are sufficiently close). For example, $\left|\right|{X}_{i-1}^{\left(k\right)}-{X}_{i}^{\left(k\right)}{\left|\right|}_{2}<\mathrm{\u03f5},\mathrm{\u03f5}={10}^{-7}$. However, it must be noted that finding a global minimum is not guaranteed. In the following section, we establish a guideline for selection of the stopping criteria.

### Selection criterion

*U*, where

*U*∈

*R*

^{ Mxk }. We note that for the following cases where (i) rank

*k*= 1 approximation and (ii) rank 1 <

*k ≤ r*, then

*r =*rank (

*X*). In order to take the global data into account, a good choice of initial value of

*U*for a rank

*k*= 1 (i.e.,

*U*∈

*R*

^{Mx1}) approximation may be obtained as follows. First, we compute the

*l*

_{1}norm of each column vector in

*X*, and denoted this norm by ${x}_{1}^{c},{x}_{2}^{c},\dots ,{x}_{n}^{c}$. Next compute the

*l*

_{1}norm of each row vector in

*X*, and denoted this norm by ${x}_{1}^{r},{x}_{2}^{r},\dots ,{x}_{m}^{r}$. Now we find the maximum value in $\left\{{x}_{1}^{c},{x}_{2}^{c},{x}_{n}^{c},{x}_{1}^{r},{x}_{2}^{r},\dots ,{x}_{m}^{r}\right\}$. If the maximum corresponds to a column norm, say from column j, then chose that column (i.e.,

*U*=

*X*(:,

*j*)) as the initial choice for

*U*. If the maximum corresponds to a row norm, say row I, then we start with the transposed form of the criterion in (11) and we chose that row (i.e.,

*V*

^{T}=

*X*(

*i*,:)) as the initial choice for

*V*

^{ T }. We can also extend the previous concept to find the initial choice for

*U*for the rank

*k*= 2. Essentially, we apply the rank one approximation twice in succession. Therefore our objective function can be expressed as:

Where *U* = [*u*_{1}*u*_{2}] and *V* = [*v*_{1}*v*_{2}], *u*_{1},*u*_{2},*v*_{1},*v*_{2} are vectors. Therefore the initial choice for U (rank *k =* 2) is *U* = [*u*_{1}*u*_{2}] (i.e., two largest *l*_{1} column or row norm). In a similar fashion, a selection criterion for the initial for U for rank *k* (1 *< k ≤ r* ) can be also obtained. Thus the column space of *X* (i.e., U) represents a feature vector that is considered as a global property (i.e., the best low rank approximation) of *X* that minimizes the above objective function under *l*_{1} norm sense [22].

### Convergence subsequence

*E*(

*U,V*) ≥0)) we have

And lim _{i → ∞}*E*_{
i
} = *E*_{
final
} ≥ 0. Therefore the entire infinite length sequence lies inside a hypersphere (i.e., a closed and bounded set of points) of finite volume centered at *X* and with a radius of *E*_{1}. Since this hypersphere has finite volume, it is possible to construct a finite number of smaller hypersphere, each with radius ϵ > 0, such that the union of all these small hyperspheres contains the large hypersphere of radius *E*_{1}. For all ϵ > 0 there will be at least one hypersphere of radius ϵ containing an infinite number of points of the sequence. Thus, there is at least one cluster point. The cluster point is the limit of a convergent subsequence. Therefore, we know that the sequence of *X*_{
i
}^{(k)}, produced by the algorithm must contain at least one convergent subsequence.

### Feature extraction methodology

The text parsing software and natural language toolkit [24] (written in Python) were used to parse all encounter data sets for this preliminary study. If *X =* [*x*_{ij}] defines the *m* × *n* term-by-encounter matrix for decomposition. Each element or component *x*_{ij} of the matrix *X* defines a weighted frequency at which term *i* occurs in encounter *j*, where term *i*∈ {gender, age, discharge status, admitting diagnosis, secondary diagnoses, principal diagnosis, principal procedure, secondary procedures}. The corpus stop words from NLTK were used to filter out unimportant terms.

In each iteration, three subsets of data are used as training data and the remaining set is used as testing data. In rotation, each subset of data serves as the testing set in exactly one iteration. The rank *U* used to test the LRMD was *k* = 4. Hence, the *U* and *V* matrix factors were number of terms × 4 and 4 × 1200, respectively. The percentage of possible asthma encounters used for training in our testing was 900 encounters and the remaining 300 encounters were used for testing our classifier. The initial matrix factors *U* and *V* were selected to meet our Selection criterion (see Selection Criterion) and alternating iteration was continued until the matrix from two successive iterations are sufficiently close (see Alternating optimization). All classification results were obtained using Python version 2.7.4.

## Results

**Example of dominant features using LRMD**

Variables | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|

admitting diagnoses (ICD-9-CM) | 786.07 | 786.07 | 786.07 | 786.07 | 786.07 |

493.90 | 786.09 | 493.92 | 493.92 | ||

secondary diagnoses (ICD-9-CM) | v175 | 486.00 | 786.05 | 786.05 | v175 |

530.81 | 692.9 | v175 | 780.60 | 692.9 | |

786.05 | v175 | 785.0 | 692.9 | 785.0 | |

v174.9 | 786.2 | 787.03 | |||

466.0 | |||||

v175 | |||||

principal diagnoses (ICD-9-CM) | 493.92 | 493.92 | 493.92 | 493.92 | 493.92 |

494.90 | 493.90 | 786.06 | 493.91 | ||

493.91 | |||||

principal procedures (ICD-9-CM) | N/A | 939.4 | N/A | N/A | 939.4 |

age (year) | 4-7 | 3.5-7 | 4-6.5 | 4-6 | 4.5-8 |

gender | male | female | female | female | male |

discharge status | home | home | home | home | home |

**Sensitivity and specificity**

Cutoff score for similarity to features in training set (1 = perfect correlation and 0 = no correlation) | LRMD | NMF | ||
---|---|---|---|---|

Sensitivity | Specificity | Sensitivity | Specificity | |

> 0.65 | 0.92 | 0.54 | 0.85 | 0.53 |

> 0.75 | 0.84 | 0.76 | 0.8 | 0.7 |

> 0.85 | 0.74 | 0.8 | 0.72 | 0.78 |

> 0.9 | 0.56 | 0.9 | 0.54 | 0.88 |

## Discussion

The results presented in this paper should not be taken as an accurate representation of our patient data (as it does not include all the data records). These data are meant to demonstrate the potential of PRD and the feasibility of data mining technique using LRMD. Additional experiments with a larger number of features (rank *k* > 4) and encounter data sets (2009 – 2012) should produce better models to capture the diversity of contexts described by those encounters. Using ICD-9-CM has limitations because they are generally used for billing purposes and not for clinical research. We are planning to access free-text fields in the near future, such as physician and clinician notes, and include them into our classifier. Additional socio-demographic variables such as incomes, type of insurance, environment, nutrition, genome and comorbidity covariants could potentially be added to the model to support the evaluation of potential causes for readmission.

## Conclusions

Using data mining technique to learn from past examples within rich data sources such as electronic medical records not only permits users to detect expected events, such as might be predicted by models, but also helps users discover the unexpected patterns and relationships that can then be examined and assessed to develop new insights. We hope that learned rules from the LRMD technique will greatly advance progress toward the goal of identifying high risk of pediatric asthma patient and help support clinical decisions.

## Declarations

### Acknowledgements

The author thanks the UTHSC department of ITS Computing Systems and Office of Biomedical Informatics for use of informatics resources and collaborations. We gratefully acknowledge Rae Shell and Grady Wade for assistance on proofreading and providing good comments. This work was supported by the Children’s Foundation Research Institute (CFRI).

## Authors’ Affiliations

## References

- Chandra Shekar DV, Sesha Srinivas V: Clinical Data Mining An Approach for Identification of Refractive Errors. 2008, Hong Kong: Proceedings of the International MultiConference of Engineers and Computer Scientists 2008 Vol I IMECS 2008, 19-21 MarchGoogle Scholar
- Palaniappan S, Ling C: Clinical Decision Support Using OLAP With Data Mining. IJCSNS International Journal of Computer Science and Network Security. September 2008, 8: 9-Google Scholar
- Prather JC, et al: Medical data mining: knowledge discovery in a clinical data warehouse. Proc AMIA Annu Fall Symp. 1997, 101-105.Google Scholar
- Chae YM, et al: Data mining approach to policy analysis in a health insurance domain. Int J Med Inform. 2001, 62 (2-3): 103-111. 10.1016/S1386-5056(01)00154-X.View ArticlePubMedGoogle Scholar
- Hedberg SR: The data gold-rush. Byte. 1995, 20 (10): 83-88.Google Scholar
- Mohri M, Rostamizadeh A, Talwalkar A: Foundations of Machine Learning. 2012, New York: The MIT PressGoogle Scholar
- Huang Z: Extensions to the
*k*-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery. 1998, 283: 304-Google Scholar - Jain AK, Murty MN, Flynn PJ: Data Clustering: A Review. 1999, ohio: ACM computing surveysGoogle Scholar
- Neapolitan RE: Learning Bayesian Networks. 2004, Illinois: Prentice HallGoogle Scholar
- Gelman A: A Bayesian formulation of exploratory data analysis and goodness-of-fit testing. International Statistical Review. 2003, 71 (2): 369-382.View ArticleGoogle Scholar
- Tom M: Machine Learning. 1997, McGraw-Hill, 55-58.Google Scholar
- Grzymala-Busse JW: Selected algorithms of machine learning from examples. Fundamenta Informaticae. 1993, 18: 193-207.Google Scholar
- Liu WX, et al: Nonnegative matrix factorization and its applications in pattern recognition. Chinese Science Bulletin. 2006, 51 (1): 7-18. 10.1007/s11434-005-1109-6.View ArticleGoogle Scholar
- Cemgil AT: Bayesian inference for nonnegative matrix factorisation models. Comput Intell Neurosci. 2009, 785152-Google Scholar
- Berry MW, Gillis N, Glineur F: Document Classification Using Nonnegative Matrix Factorization and Underapproximation. 2009, IEEEView ArticleGoogle Scholar
- Sedman AB, Bahl V, Bunting E, Bandy K, Jones S, Nasr SZ, Schulz K, Campbell DA: Clinical redesign using all patient refined diagnosis related groups. Pediatrics. 2004, 114 (4): 965-969. 10.1542/peds.2004-0650.View ArticlePubMedGoogle Scholar
- Viangteeravat T: Giving Raw Data a Chance to Talk: A demonstration of de-identified Pediatric Research Database and exploratory analysis techniques for possible cohort discovery and identifiable high risk factors for readmission. Proceeding of 12TH Annual UT-ORNL-KBRIN Bioinformatics Summit. 2013Google Scholar
- Srebro N, Jaakkola T: Weighted Low Rank Approximation. 2003, Washington DC: Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003)Google Scholar
- Young E: Singular Value Decomposition. http://www.schonemann.de//svd.htm,
- Cadzow JA: Signal enhancement: a useful signal processing tool Spectrum Estimation and Modeling. Fourth Annual ASSP Workshop. 1988, 162: 167-Google Scholar
- Cadzow JA: Minimum l(1), l(2), and l(infinity) norm approximate solutions to an overdetermined system of linear equations. Digital Signal Processing. 2002, 12 (4): 524-560. 10.1006/dspr.2001.0409.View ArticleGoogle Scholar
- Viangteeravat T: Discrete Approximation using L1 norm Techniques. 2000, Master Thesis: Electrical Engineering, Vanderbilt UniversityGoogle Scholar
- Cadzow JA: Application of the l1 norm in Signal Processing". Department of Electrical Engineering. 1999, Nashville: Vanderbilt UniversityGoogle Scholar
- Perkins J: Python Text Processing with NLTK 2.0 Cookbook. 2010, Birmingham: Packt PublishingGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.