Skip to main content


High-throughput identification of reference genes for research and clinical RT-qPCR analysis of breast cancer samples

Article metrics

  • 5647 Accesses

  • 34 Citations



Quantification and normalization of RT-qPCR data critically depends on the expression of so called reference genes. Our goal was to develop a strategy for the selection of reference genes that utilizes microarray data analysis and combines known approaches for gene stability evaluation and to select a set of appropriate reference genes for research and clinical analysis of breast samples with different receptor and cancer status using this strategy.


A preliminary search of reference genes was based on high-throughput analysis of microarray datasets. The final selection and validation of the candidate genes were based on the RT-qPCR data analysis using several known methods for expression stability evaluation: comparative ∆Ct method, geNorm, NormFinder and Haller equivalence test.


A set of five reference genes was identified: ACTB, RPS23, HUWE1, EEF1A1 and SF3A1. The initial selection was based on the analysis of publically available well-annotated microarray datasets containing different breast cancers and normal breast epithelium from breast cancer patients and epithelium from cancer-free patients. The final selection and validation were performed using RT-qPCR data from 39 breast cancer biopsy samples. Three genes from the final set were identified by the means of microarray analysis and were novel in the context of breast cancer assay. We showed that the selected set of reference genes is more stable in comparison not only with individual genes, but also with a system of reference genes used in commercial OncotypeDX test.


A selection of reference genes for RT-qPCR can be efficiently performed by combining a preliminary search based on the high-throughput analysis of microarray datasets and final selection and validation based on the analysis of RT-qPCR data with a simultaneous examination of different expression stability measures. The identified set of reference genes proved to be less variable and thus potentially more efficient for research and clinical analysis of breast samples comparing to individual genes and the set of reference genes used in OncotypeDX assay.


Gene-expression analysis is ubiquitously used both in life sciences and in medical research [13]. Microarray and reverse transcription quantitative real-time polymerase chain reaction (RT-qPCR) analyses are commonly used to measure transcript abundance. Microarrays allow the massive parallel analysis of thousands of genes but still require significant time and financial expenses. RT-qPCR is used as a conventional routine method for expression profiling of a moderate number of genes and is highly suitable when only a small amount of sample is available [1, 4]. High throughput thermal cyclers and PCR platforms help to overcome the limitation of RT-qPCR and to perform simultaneous expression analysis of tens and hundreds of genes for one or more samples depending on the platform [1].

In research practice these two technologies are able to solve a broad spectrum of questions supplementing each other. In clinical diagnostics RT-qPCR is still more popular with the advantage of speed and a relatively low cost [5].

Based on a totally different readout strategy, RT-qPCR has become the standard method for validation of microarray data. However, the application of RT-qPCR requires an appropriate normalization to obtain accurate and reliable quantification of gene expression levels. The purpose of normalization is to remove experimentally induced errors that can be introduced at a number of stages throughout the procedure (variability in sample acquisition and RNA extraction protocols, different reverse transcription reactions and PCR efficiencies) and to reveal the true biological changes in expression [4, 68]. Lack of appropriate normalization can lead to misinterpretation of a target gene expression profile. This is especially pronounced when the samples come from different individuals, different tissues, different time courses, different environmental conditions, etc.

To date, the reference gene concept is a gold standard used for normalization [1]. According to this concept reference genes are endogenous controls whose expression remains stable across all sources of variation during the experimental workflow. Results of RT-qPCR studies that use improper reference genes (e.g. genes that are not constitutively expressed) can be significantly different from results obtained with proper reference genes [7, 9, 10]. So far, as ideal reference genes do not exist, it is essential to select appropriate reference genes for accurate normalization of the RT-qPCR results for each experimental condition. It is worth noting that the expression levels of many commonly used reference genes such as GAPDH, ACTB, RPS18, UBC, B2M, and HPRT1 have been reported to vary considerably in multiple tissues and cells [4, 7, 9, 1115]. However, in the absence of a better choice, the majority of researchers continue to use these genes, sometimes even without preliminary validation [9, 13, 16]. Thus for many studies there is a necessity to identify other candidate reference genes. One of possible ways is RT-qPCR screening of all 535 transcripts involved in maintaining cellular function by RT-qPCR [11, 15]. Another way in our genomic era is to use the wealth of accumulated available data obtained by high-throughput microarray and sequencing technologies. This idea is getting more and more popular and was used in some recent studies of plants [1719], diatoms [20], mammals [21, 22] and human [2327]. Certain attempts were also made to assess the stability characteristics of transcripts in different tissue types and cells of several species in different physiological states [14, 28, 29]. However, no genes were identified as generally stably expressed. Thus it is still a problem to reveal genes that are stably expressed in a given biological context, and the analysis of well-annotated microarray data may be very helpful to solve this problem. This approach promises to be particularly fruitful for heterogenic samples like biopsies which may contain different cell types in different ratios that can vary widely between samples.

In this study, we identified a set of genes that can be used as a reference simultaneously for analysis of breast cancer samples with an unknown hormone receptor status and different cancer status. To achieve this goal we analyzed four available well-annotated Affymetrix Human Genome U133A array datasets containing different breast cancers and normal breast epithelium from breast cancer and cancer-free patients. Then we validated identified candidate reference genes by RT-qPCR on 39 fresh breast cancer biopsies and compared expression stability of these genes and commonly used control genes including ACTB, GAPDH, RPLP0, GUSB and TFRC that form a set of reference genes in a commercial Oncotype DX Breast Cancer diagnostic test [2].

In order to select an optimal set of reference genes we developed a new strategy that utilizes the advantages of the most widely used algorithms for gene stability evaluation: comparative ∆Ct method [8], geNorm [12], NormFinder [1, 30] and Haller equivalence test [31]. Each of these popular methods for gene stability evaluation has its own advantages and pitfalls, and the developed strategy allows scientists to avoid weighing all the pros and cons of individual methods each time, but instead to combine all these methods. We showed that novel candidate reference genes identified by our microarray search have expression stability comparable to or even higher than the stability of commonly used reference genes. We also demonstrated that the selected set of 5 reference genes is more stable than the set of 5 reference genes used in Oncotype DX Breast Cancer diagnostic test. The RT-qPCR data used for the validation study was obtained in compliance with the global standardization accords presented in the MIQE guidelines [32], see Additional file 1.


Large-scale selection of an extended list of candidate reference genes using microarray data

Four GEO Series [33, 34] of Affymetrix Human Genome U133A array data were taken: GSE17705 (title: “Endocrine Sensitivity Index Validation Dataset”, 298 samples, [35]), GSE10780 (title: “Proliferative genes dominate malignancy-risk gene signature in histologically-normal breast tissue”, 185 samples, [36]), GSE20711 (title: “Epigenetic portraits of human breast cancers (expression data)”, 90 samples, [37]), GSE20437 (title: “Histologically normal epithelium from breast cancer patients and cancer-free prophylactic mastectomy patients”, 42 samples, [38]).

For each data series raw microarray data were processed using Bioconductor [39] xps package [40] implementation of RMA [41]. For each probeset a variability characteristic was computed as a ratio of 95- and 5-percentiles of log-transformed expression levels, and an ascending ordering of probesets was performed according to this variability characteristic. After these steps four lists of probesets were obtained (each corresponding to one data series). Genes corresponding to probesets that belong to top-500 for all four lists were included into the extended list of candidate reference genes.

The variability characteristic used for stability ranking is methodologically close to the interquartile range, but controls expression variation for 90 percent of samples in a dataset instead of 50 percent. Unlike mean deviation and standard deviation that are commonly used for expression stability comparison [14, 15, 24] the ratio of quantiles is absolutely stable with respect to data outliers.

For genes with stable expression the stability ranking based on the ratio of 95- and 5-percentiles of log-transformed expression levels is similar to ranking based on a relative width of expression range in a logarithmic scale: if 5-percintile p 5 of log-transformed expression level equals m(1-w), and 95-percintile p 95 equals m(1 + w), and 0 ≤ w<< 1 then

p 95 p 5 = m 1 + w m 1 w = 1 + w 1 + w + w 2 + = 1 + 2 w + 2 w 2 1 w 1 + 2 w .

The top-500 lists of most stable probesets for four datasets used for the generation of an extended list of candidate genes contained 347–410 unique gene symbols. However, pairwise intersection of top-500 lists corresponding to two datasets (GSE17705 – ER-positive breast cancer patients [35]; GSE20711 – breast cancer patients [37]) contained only 48 unique gene symbols, and interception of top-500 lists corresponding to three datasets (a small dataset GSE20437 – histologically normal epithelium from breast cancer patients and cancer-free prophylactic mastectomy patients [38] – added) contained 27 unique gene symbols. Hence, our extended list of candidate reference genes – it contained 25 unique gene symbols – was formed based mainly on three datasets out of four, and the forth dataset (GSE10780 – a larger set of histologically-normal breast tissues from breast carcinoma patients [36]) only validated our selection.


For validation of reference genes, samples from 39 patients (mean age 60.5 years; range 35–87 years) were analyzed. Patients were treated for breast cancer at the Gertsen Moscow Research Institute of Oncology or the Moscow State Oncology Medical Center, Russia, in 2012. Patient selection was based only on the availability of tumor tissue. All patients gave written informed consent to access their tissues and review their medical records in accordance with the principles of the Declaration of Helsinki. All patients were treated by mastectomy. Patients did not receive neoadjuvant chemotherapy.

Histologically, 32 tumors were diagnosed as ductal carcinomas, 1 was a mucus-producing adenocarcinoma, and 6 were of other histological types. TMN classification of all tumors is presented in Additional file 2. Twenty five tumors (64%) were estrogen-receptor (ER) positive, and 14 cases (36%) were ER-negative. Progesterone receptors (PR) were detectable in 24 cases (61.5%), whereas 15 tumors (38.5%) were PR-negative. Detailed information about tumor receptor status is provided in Additional file 2. All tumor samples were placed in appropriate volume of RNAlater RNA stabilization reagent (Qiagen, Germany) within 30 min after mastectomy and incubated at 4°C at least overnight but no more than 3 days. If total RNA was not isolated from a tissue sample after 3 day incubation, the sample was frozen at −80°C.

RNA extraction and quality control

Tumor samples stabilized in RNAlater reagent were crushed in liquid nitrogen. QIAzol lysis reagent (Qiagen, Germany) was added and the homogenate was centrifuged through a QIAshredder column (Qiagen, Germany). Total RNA was extracted from the eluate by the miRNeasy mini kit (Qiagen, Germany) according to the manufacturer’s instructions. Genomic DNA contamination was removed by performing DNaseI digestion on the RNA binding column for 15 min. Total RNA was eluted in 50 μl of RNase-free H2O and stored at −80°C. RNA yield and purity was assessed spectrophotometrically by measuring OD260 and OD260/280 ratio respectively in RNase-free H2O using NanoDrop ND-1000 (NanoDrop Technologies, USA). For all samples OD260/280 ratio was between 2.0 and 2.2.

RNA integrity was determined using the Experion Automated Electrophoresis Station with the standard-sensitivity RNA analysis kit according to the manufacturer’s instructions (Bio-Rad, USA). Values of RNA quality indicator (RQI) generated by Experion system’s software ranged from 6.4 to 9.4.

Primer and probe design and validation

A gene list used for validation is presented in Table 1. Primer pairs and probes were designed using Primer3 software [42] according to the recommendations described in [43], and their specificity was determined with Primer-blast program [44]. All sequences were tested for potential secondary structure and dimerization formation using OligoAnalyzer 3.1 program [45]. Primer specificity was confirmed by agarose gel electrophoresis after RT-qPCR reaction. Validated sequences are shown in Table 2 (see details in Additional file 3). Oligonucleotides were synthesized by Syntol (Russia).

Table 1 Description of selected candidate reference genes
Table 2 Description of primers and probes for validated candidate reference genes

The amplification efficiency of each set of primer pair and probe was determined by plotting the Ct values obtained for 7 serial dilutions of cDNA (1:25, 1:100, 1:250, 1:500, 1:1000, 1:2500, and 1:5000). For this aim cDNA was synthesized by reverse transcription of the RNA pool which was prepared by mixing of an equal amount of total RNA isolated from five random tumor samples from the biopsy collection. The corresponding RT-qPCR primer efficiencies (E) were calculated according to the equation E = 10(−1/slope). Each reaction was performed in triplicates. All PCR-mixes were pipetted by the automated pipetting system epMotion 5075 (Eppendorf, Germany).

PCR efficiencies for all candidate reference genes were higher than 1.83 and lower than 2.1 (Table 2, see Additional files 4 and 5), except for TPT1 (1.76). The amplicon sizes for all primer pairs are presented in Table 2.

cDNA synthesis and quantitative RT-qPCR

RNA samples were reverse transcribed to cDNA with the Reverse Transcription reaction mix using random hexamer primer according to the manufacturer’s instruction (Syntol, Russia). Initially 2 μg of total RNA was incubated at 65°C for 5 min followed by an incubation with 10 μl 2.5× reaction mix, 1 μl of 15 OD/μl random hexamer, 1 μl of 50U/ μl MMLV reverse transcriptase, 1 μl of 5U/ μl RNase inhibitor and nuclease-free water in final volume 25 μl for 1 hour at 37°C and 5 min at 95°C. First strand cDNA samples were stored at −20°C.

To measure the transcript levels of selected candidate genes by RT-qPCR, a 2.5× PCR reaction mix for qPCR (Syntol, Russia) was applied and analysis was performed on a 96-well DT-prime detection system (DNA-Technology, Russia). Each reaction was performed in triplicates in a reaction volume of 25 μl in high-profile 96-well unskirted PCR plates (BioRad, USA). All reactions contained 1 μl of cDNA diluted in 10 times, 10 μl of 2.5× Master Mix containing 6.25 mM MgCl2, 2.5 μl of 25 mM MgCl2 (till final concentration in a reaction mix 5 mM), 0.6 μl of 10 μM of each primers, 0.6 μl of 10 μM of a probe and 9.7 μl of DNase/RNase-free water. The PCR program consists of an initial 10 min template denaturation step at 94°C for enzyme activation, followed by 50 cycles of 94°C for 20 sec, 64°C for 10 sec, and 72°C for 15 sec. Nontemplate controls were also performed in triplicate for each gene set of primer pair and probe. Sets for all 18 candidate genes were analyzed in the same run. Each run was repeated three times. All PCR-mixes were pipetted by the automated pipetting system epMotion 5075 (Eppendorf, Germany). To verify the presence of genomic DNA contamination in the isolated total RNA, RT-qPCR were performed on equivalent amounts of total RNA without reverse transcription (as template) for each tumor sample.

Baseline and threshold values were automatically determined for all reactions in the plate using RealTime PCR v.7.3 software (DNA-Technology, Russia).

All RT-qPCR experiment data comply with the Minimum Information for Publication of Quantitative Real-Time PCR Experiments (MIQE) guidelines [32]. The MIQE checklist is presented in Supplementary Information (see Additional file 1).

Analysis of gene expression stability by RT-qPCR

To analyze the stability of the candidate reference genes, four different methods were used [9]: comparative ΔCt method, geNorm, NormFinder and Haller equivalence test. Briefly, the geNorm tool is based on the principle that the expression ratio of two ideal reference genes should be constant in samples from different experimental conditions or cell types [12]. The expression stability value (MgeNorm) of a gene is defined as the standard deviation of the log-transformed expression ratio for the particular gene relative to all other genes under investigation in a panel of cDNA samples. Lower values of MgeNorm correspond to higher gene expression stability. Because a freely available geNorm program does not allow the data to be processed efficiently if the RT-qPCR for each cDNA is repeated several times, we performed the geNorm analysis of our RT-qPCR data manually as described in [9].

The comparative ΔCt method compares a Ct ratio of “pairs of genes” within each sample and assigns a relative variation value MΔCt to each gene. Higher values of relative variation MΔCt correspond to lower gene expression stability.

NormFinder is a freely available program [46] which calculates the stability index using analysis of variance (ANOVA) on log-transformed expression values [1, 30]. The NormFinder algorithm estimates overall gene expression variation MNormFinder as well as the variation between subgroups, such as ER-positive and ER-negative breast cancer tissue samples. Higher values of MNormFinder correspond to higher gene expression variability.

Haller equivalence test gives for each pair of samples k, l an α-confidence interval δ kl , L δ kl U for gene expression difference in samples k, l[31]. An integral measure of gene variability in Haller equivalence test approach (MHaller) was set to the 95-th percentile of a set δ kl L , δ kl U k l with a confidence level α set to 0.05. Similarly to the other approaches considered in our study, the higher MHaller value corresponds to higher gene expression variability. The analysis of gene expression stability using Haller equivalence test is described in details in [9].

Gene expression levels were estimated using a standard exponential model A = I/ECt, where A is an expression level, I is a threshold value, Ct is a number of cycle corresponding to the threshold value, and E is a qPCR efficiency for a gene.

Data generation for “quasi-genes”

In order to compare the stability of a set of reference genes with the stability of individual genes using standard methods that were used for the stability evaluation of genes (comparative ∆Ct method, geNorm, NormFinder, Haller equivalence test) we implemented a simple procedure that generates data for a “quasi-gene” that can be associated with a set of reference genes.

An expression level, assigned to a “quasi-gene” by this procedure, was equal to a normalization factor defined by the set of genes, or, in other words, to the geometric mean of expression levels for genes that belong to the set: A ˜ = i = 1 n A i n ,

In order to find an appropriate Ct-value C ˜ , required by some methods for stability evaluation, we used a standard exponential model I = AECt, where A is an expression level, E is a gene efficiency, I is a threshold, Ct is threshold cycle. We also accepted as a fact that all Ct-values for a fixed sample correspond to the same threshold I ˜ , and set an efficiency for a “quasi-gene” to the geometric mean of efficiency values of genes in the set E ˜ = i = 1 n E i n . Due to the exponential model assumptions, log E ˜ I ˜ = log E ˜ A ˜ + C ˜ log E ˜ E ˜ = log E ˜ A ˜ + C ˜ , so Ct-value, assigned to by the procedure to a “quasi-gene” for a fixed sample, was equal to

log E ˜ I ˜ log E ˜ A ˜ = log E ˜ I ˜ log E ˜ i = 1 n A i n = log E ˜ I ˜ 1 n i = 1 n log E ˜ I ˜ E i C t i = 1 n i = 1 n C t i log E ˜ E i .

The RT-qPCR data contained 3 replicates for every gene and 3 replicates for every sample, and hence, there were nine Ct-values for each gene-sample pair. Taking the respective Ct-values of this 3x3 matrix for each gene of a system we got the corresponding matrix of Ct-values for a “quasi-gene”.


The strategy for the selection of reference genes that was used in this study consists of the following stages: the large-scale screening of an extended list of candidate reference genes based on microarray data series, the selection of a candidate gene set for validation by RT-qPCR, and the determination of the final set of reference genes.

Large-scale selection of an extended list of candidate reference genes based on microarray data

The large-scale selection of the extended list of candidate reference genes was performed by a selection of genes that are present in sub-lists of most stable genes in all four analyzed microarray data sets (see the details in the Methods). Note that this approach allows a combining of microarray studies that utilize different microarray technologies. The total number of such genes was 25 (Table 3). Sixty percent of these genes (15 genes) were present in the list of maintenance genes expressed across eleven human adult and fetal tissues [11], and 9 percent (3 genes) were also present in the list of 47 transcripts expressed at the same level across eleven human adult and fetal tissues [11]. The extended list of candidates included eight RPL and eight RPS genes. These results are concordant with the results of [14, 15, 24].

Table 3 An extended list of candidate reference genes

Such genes as ACTB and GAPDH, which are widely used for normalization of RT-qPCR data, were not included into the resulting extended list of candidate genes. The GAPDH gene did not belong to any list of top-500 stable probesets, while gene ACTB (as well as UBC) belonged to three lists of top-500 stable probesets, but was not present in top-500 list corresponding to a dataset of histologically normal epithelium from breast cancer patients and cancer-free prophylactic mastectomy patients (GSE20437). It is also worth noting that ACTB and UBC (in contrast with, e.g., GUSB) had high intergroup difference in expression level for samples from cancer and cancer-free patients. However, the intragroup variation was low for each of the groups. Hence, the assessment of the expression level of these genes can be potentially used for the estimation of a tumor/normal epithelium ratio in a biopsy sample. An ability of housekeeping genes to define different biological states was also reported in [15].

Selection of a gene set for the experimental validation and RT-qPCR validation

The selection of a gene set for validation from the preliminary extended list was performed by an identification of genes with the highest stability values in the majority of the analyzed microarray data series, but the following additional limitation was imposed: a simultaneous selection of genes with known interactions, as well as a selection of genes with similar biological functions should be minimized. Informally, this limitation increases the independence of genes in the selected set.

Nine genes from the extended list of candidate reference genes (TPT1, HCFC1, PTMA, RPL23A, RPL37A, RPL39, RPS23, HUWE1, EEF1A1) were selected for the experimental validation of the expression stability and comparison with other genes traditionally used as reference genes in RT-qPCR studies of breast tissue (ACTB, GAPDH, RPLP0, GUSB, TFRC, SF3A1, MRPL19, PSMC4, TBP[2, 23, 27, 47, 48]) (Table 1). As no unique criterion exists for the comparison of gene expression stability, we combined four different approaches: comparative ΔCt method [8], geNorm [12], NormFinder [1, 30], and Haller equivalence test [31] (see Methods section for the details).

The simplest method, the comparative ΔCt method, uses only raw values of threshold cycles Ct. An application of the ΔCt method to the results of RT-qPCR analysis of 39 breast cancer tissue samples for 18 genes from the set for validation gave the following results (Figure 1a, Additional file 6: Figure S1). Ten genes, namely, ACTB, HUWE1, RPS23, EEF1A1, SF3A1, HCFC1, RPL37A, MRPL19, TBP and TFRC showed nearly the same expression stability for the whole set of samples (Figure 1a). The RPL23A gene displayed the lowest stability in the whole set of samples and ER-negative tumors (Additional file 6: Figure S1). The highest variability for ER- positive tumors was exhibited by PSMC4.

Figure 1

Evaluation of candidate reference gene expression stability by RT-qPCR. Gene expression variations calculated for aggregated set of samples by a: comparative ΔCt (MΔCt); b: geNorm (MgeNorm) and d: NormFinder (MNormFinder). c: Maximum fold changes of gene expression levels MHaller calculated by Haller equivalence test.

In contrast to the comparative ΔCt method, geNorm, NormFinder and Haller equivalence test use expression levels as input data. The values of expression stability MgeNorm for 18 candidate genes generated by the most widely used algorithm geNorm are presented in Figure 1b and Additional file 6: Figure S1. The 10 top-ranked genes were similar to the top-10 genes in the rating generated by the comparative ΔCt method in the whole set of samples: stability values of their expression were nearly the same in each sample group and only the exact order of genes was slightly different.

Unlike comparative ΔCt method and geNorm, NormFinder and Haller equivalence test examine the expression stability of each candidate independently from other candidates. The results of NormFinder application to the studied set of genes is presented in Figure 1d and Additional file 6: Figure S1. In intergroup variation analysis groups were formed by ER-positive and ER-negative samples, respectively (Additional file 7: Figure S2). Higher stability was observed for ACTB, RPS23, HUWE1, EEF1A1, MRPL19, SF3A1, HCFC1, RPL37A and TBP.

The Haller equivalence test allows to evaluate a fold change value (MHaller) of candidate reference genes expression level between different samples for a given significance value. The results of the expression variability estimation for the 18 candidate genes studied are presented in Figure 1c and Additional file 6: Figure S1. In contrast to the comparative ΔCt method, geNorm and NormFinder PSMC4 demonstrated the lowest variability for the whole set of samples as well as for ER-negative and ER-positive tumors. The next gene in the stability ranking was RPL39, the third gene was ACTB that demonstrated lowest variation according to other methods of stability evaluation that A large group of genes from the validation set (GUSB, RPS23, MRPL19, RPL37A, SF3A1, HUWE1, TBP) showed very similar MHaller values and were the next in stability ranking. In concordance with the results of the comparative ΔCt method, geNorm and NormFinder, RPL23A had the greatest expression variability. The relatively high MHaller values of all candidate genes are consistent with highly heterogeneous nature of biopsies and with other studies [15, 47]. Also it is worth noting that MHaller stability values are “pessimistic” in comparison with other methods as they represent the “poor case”, but not the “averaged” one (see details in Methods).

An important inference is that such genes as GAPDH and RPLP0, commonly employed as reference genes (e.g., in the commercial Oncotype DX diagnostic test) were ranked among the most variable candidate genes by almost all methods.

Final selection of reference genes

Thus four rankings of gene expression stability were obtained by the following methods: comparative ∆Ct method [8], geNorm [12], NormFinder [1, 30] and Haller equivalence test [31]. In spite of the fact that generally the results of all stability evaluation methods are consistent each of these ranking has its own order of the genes. Based on both four stability rankings and intergroup variations calculated by NormFinder we formed a subset of the best 5 reference genes: ACTB, RPS23, HUWE1, EEF1A1, SF3A1 (other good choices can be obtained by a replacement of EEF1A1 or SF3A1 by MRPL19). Note that three out of five genes in this set (RPS23, HUWE1, EEF1A1) were selected by microarray analysis and were top-ranked in the extended list of candidate genes. ACTB was not included in the extended list of candidate genes only because it had high intergroup variation between normal epithelium samples from cancer and cancer-free patients. However it showed low variability value for other microarray data series used for the selection of the extended list.

In order to validate the selected subset of reference genes we added a “quasi-gene” corresponding to the set of 5selected genes (see Methods section for details) and calculated the stability measures by all four methods (Table 4). The results show that the selected subset clearly outperforms not only every individual gene from the tested set, but also a system of reference genes used by OncotypeDX test. The only method that did not rank the selected set as the most stable was the Haller equivalence test: it ranked the individual gene PSMC4 higher than “quasi-genes”.

Table 4 Stability characteristics evaluated for the system of 18 individual genes and 2 “quasi-genes” – one corresponding to the selected system of reference genes, and the other corresponding to the system of reference genes used in OcotypeDX test


RT-qPCR has become the most popular method for detection and quantification of mRNA transcripts. An efficient normalization enables the gathering of reproducible and biologically relevant RT-qPCR data correcting non-biological sample-to-sample variations that could be introduced by protocol-dependent inconsistencies [1, 30]. To achieve an efficient normalization the use of internal control genes (reference genes) has been established as the gold standard normalization method. Formerly, the genes that have housekeeping roles in basal cellular activities were selected as internal control for gene expression studies. However, it is obvious now that a housekeeping function does not guarantee that a gene has a stable expression level. It was well documented that expression levels of many commonly used housekeeping genes, including GAPDH, ACTB, RPS18, UBC, B2M, HPRT1 and others, may vary significantly in different tissues and cells [4, 7, 9, 1115]. Therefore it is necessary to identify novel candidate reference genes for given individual studies.

Our goal was to select a set of reference genes that can be efficiently used for both research and clinical analysis of breast cancer samples with an unknown hormone receptor status and different cancer status. In order to achieve this goal we developed the strategy for the selection of reference genes that consists of the following stages: 1) selection of an extended list of candidate reference genes based on microarray data, 2) selection of a set of candidate genes for validation by RT-qPCR and 3) selection of the final set.

Similarly to the studies presented in [14, 24] we started from the analysis of available microarray datasets, however, we used variation characteristics that are more stable with respect to outliers. This analysis allowed us to identify 25 genes with stable expression level for the studied sample types (Table 3). Forming the list of candidate genes for validation, we took into account the possible interactions of genes and their involvement in the same biological pathways. The final set of reference genes was selected based on four stability rankings generated by the most widely used methods for evaluation of gene expression variability. This allowed us to take advantage of all considered methods. We noted that both microarray and RT-qPCR data allows to evaluate only the stability of a transcript portion of an individual gene in a fixed amount of total RNA, or, in other words, to evaluate the relative stability of gene expression. Hence it is a natural idea to use tools that estimate the expression stability of a gene relative to the expression of other genes, e.g. the comparative ∆Ct method [8] and the geNorm [12]. However, low values of such relative variation characteristics can be a result of gene co-regulation. That is why the usage of tools based on stability measures of individual genes independently on other genes such as NormFinder and Haller equivalence test is very reasonable.

All methods used for the gene stability evaluation showed that the genes selected by microarray analysis are comparable with or even better than traditionally used reference genes: in the final subset 3 out of 5 genes belonged to the list of genes selected by means of microarray analysis. These three genes are RPS23, HUWE1 and EEF1A1.


We presented experimentally validated evidence that a selection of reference genes for RT-qPCR can be efficiently performed by combining a preliminary search based on the high-throughput analysis of microarray datasets with subsequent selection and validation using RT-qPCR and simultaneous examination of different expression stability measures. We showed that the identified set of reference genes, including ACTB, RPS23, HUWE1, EEF1A1 and SF3A1 proved to be less variable and thus potentially more efficient for research and clinical analysis of breast samples in comparison with individual genes and a set of reference genes used in OncotypeDX assay.



Reverse transcription quantitative real-time polymerase chain reaction


Minimum Information for Publication of Quantitative Real-Time PCR Experiments


Estrogen receptor


Progesterone receptor


RNA quality indicator




  1. 1.

    Vandesompele J, Kubista M, Pfaffl M: Reference gene validation software for improved normalization. Real-Time PCR: Current Technology and Applications. Edited by: Logan J, Edwards K, Saunders N. 2009, UK: Caister Academic Press, 47-83.

  2. 2.

    Paik S, Shak S, Tang G, Kim C, Baker J, Cronin M, Baehner FL, Walker MG, Watson D, Park T, Hiller W, Fisher ER, Wickerham DL, Bryant J, Wolmark N: A multigene assay to predict recurrence of tamoxifen treated node-negative breast cancer. N Engl J Med. 2004, 351: 2817-2826. 10.1056/NEJMoa041588.

  3. 3.

    Cardoso F, Van’t Veer L, Rutgers E, Loi S, Mook S, Piccart-Gebhart MJ: Clinical application of the 70-gene profile: the MINDACT trial. J Clin Oncol. 2008, 26: 729-735. 10.1200/JCO.2007.14.3222.

  4. 4.

    Huggett J, Dheda K, Bustin S, Zumla A: Real-time RT-PCR normalisation; strategies and considerations. Genes Immun. 2005, 6: 279-284. 10.1038/sj.gene.6364190.

  5. 5.

    Bustin SA: Developments in real-time PCR research and molecular diagnostics. Expert Rev Mol Diagn. 2010, 10: 713-715. 10.1586/erm.10.65.

  6. 6.

    Tricarico C, Pinzani P, Bianchi S, Paglierani M, Distante V, Pazzagli M, Bustin SA, Orlando C: Quantitative real-time reverse transcription polymerase chain reaction: normalization to rRNA or single housekeeping genes is inappropriate for human tissue biopsies. Anal Biochem. 2002, 309: 293-300. 10.1016/S0003-2697(02)00311-1.

  7. 7.

    Dheda K, Huggett JF, Chang JS, Kim LU, Bustin SA, Johnson MA, Rook GA, Zumla A: The implications of using an inappropriate reference gene for real-time reverse transcription PCR data normalization. Anal Biochem. 2005, 344: 141-143. 10.1016/j.ab.2005.05.022.

  8. 8.

    Silver N, Best S, Jiang J, Thein SL: Selection of housekeeping genes for gene expression studies in human reticulocytes using real-time PCR. BMC Mol Biol. 2006, 7: 33-10.1186/1471-2199-7-33.

  9. 9.

    Krainova NA, Khaustova NA, Makeeva DS, Fedotov NN, Gudim EA, Riabenko EA, Galatenco VV, Sakharov DA, Maltseva DV, Tonevitsky AG: Evaluation of potential reference genes for qRT-PCR data normalization in HeLa cells. Appl Biochem Microbiol. 2013

  10. 10.

    Lanoix D, Lacasse AA, St-Pierre J, Taylor SC, Ethier-Chiasson M, Lafond J, Vaillancourt C: Quantitative PCR pitfalls: the case of the human placenta. Mol Biotechnol. 2012, 52: 234-243. 10.1007/s12033-012-9539-2.

  11. 11.

    Warrington JA, Nair A, Mahadevappa M, Tsyganskaya M: Comparison of human adult and fetal expression and identification of 535 housekeeping/maintenance genes. Physiol Genomics. 2000, 2: 143-147.

  12. 12.

    Vandesompele J, De Preter K, Pattyn F, Poppe B, Van Roy N, De Paepe A, Speleman F: Accurate normalization of real-time quantitative RT-PCR data by geometric averaging of multiple internal control genes. Genome Biol. 2002, 3: RESEARCH0034-

  13. 13.

    Galiveti CR, Rozhdestvensky TS, Brosius J, Lehrach H, Konthur Z: Application of housekeeping npcRNAs for quantitative expression analysis of human transcriptome by real-time PCR. RNA. 2010, 16: 450-461. 10.1261/rna.1755810.

  14. 14.

    Hruz T, Wyss M, Docquier M, Pfaffl MW, Masanetz S, Borghi L, Verbrugghe P, Kalaydjieva L, Bleuler S, Laule O, Descombes P, Gruissem W, Zimmermann P: RefGenes: identification of reliable and condition specific reference genes for RT-qPCR data normalization. BMC Genomics. 2011, 12: 156-10.1186/1471-2164-12-156.

  15. 15.

    Hsiao LL, Dangond F, Yoshida T, Hong R, Jensen RV, Misra J, Dillon W, Lee KF, Clark KE, Haverty P, Weng Z, Mutter GL, Frosch MP, MacDonald ME, Milford EL, Crum CP, Bueno R, Pratt RE, Mahadevappa M, Warrington JA, Stephanopoulos G, Stephanopoulos G, Gullans SR: A compendium of gene expression in normal human tissues. Physiol Genomics. 2001, 7 (2): 97-104.

  16. 16.

    Bustin SA: Why the need for qPCR publication guidelines? - The case for MIQE. Methods. 2010, 50 (4): 217-226. 10.1016/j.ymeth.2009.12.006.

  17. 17.

    Czechowski T, Stitt M, Altmann T, Udvardi MK, Scheible WR: Genome-wide identification and testing of superior reference genes for transcript normalization in Arabidopsis. Plant Physiol. 2005, 139: 5-17. 10.1104/pp.105.063743.

  18. 18.

    Cassan-Wang H, Soler M, Yu H, Camargo EL, Carocha V, Ladouce N, Savelli B, Paiva JA, Leplé JC, Grima-Pettenati J: Reference Genes for High-Throughput Quantitative Reverse Transcription-PCR Analysis of Gene Expression in Organs and Tissues of Eucalyptus Grown in Various Environmental Conditions. Plant Cell Physiol. 2012, 53: 2101-2116. 10.1093/pcp/pcs152.

  19. 19.

    Narsai R, Ivanova A, Ng S, Whelan J: Defining reference genes in Oryza sativa using organ, development, biotic and abiotic transcriptome datasets. BMC Plant Biol. 2010, 10: 56-10.1186/1471-2229-10-56.

  20. 20.

    Alexander H, Jenkins BD, Rynearson TA, Saito MA, Mercier ML, Dyhrman ST: Identifying reference genes with stable expression from high throughput sequence data. Front Microbiol. 2012, 3: 385-

  21. 21.

    Fedrigo O, Warner LR, Pfefferle AD, Babbitt CC, Cruz-Gordillo P, Wray GA: A pipeline to determine RT-QPCR control genes for evolutionary studies: application to primate gene expression across multiple tissues. PLoS One. 2010, 5 (9): e12545-10.1371/journal.pone.0012545.

  22. 22.

    Zhou L, Lim QE, Wan G, Too HP: Normalization with genes encoding ribosomal proteins but not GAPDH provides an accurate quantification of gene expressions in neuronal differentiation of PC12 cells. BMC Genomics. 2010, 11: 75-10.1186/1471-2164-11-75.

  23. 23.

    Szabo A, Perou CM, Karaca M, Perreard L, Palais R, Quackenbush JF, Bernard PS: Statistical modeling for selecting housekeeper genes. Genome Biol. 2004, 5: R59-10.1186/gb-2004-5-8-r59.

  24. 24.

    Popovici V, Goldstein DR, Antonov J, Jaggi R, Delorenzi M, Wirapati P: Selecting control genes for RT-QPCR using public microarray data. BMC Bioinforma. 2009, 10: 42-10.1186/1471-2105-10-42.

  25. 25.

    Gabrielsson BG, Olofsson LE, Sjögren A, Jernås M, Elander A, Lönn M, Rudemo M, Carlsson LM: Evaluation of reference genes for studies of gene expression in human adipose tissue. Obes Res. 2005, 13 (4): 649-652. 10.1038/oby.2005.72.

  26. 26.

    Stamova BS, Apperson M, Walker WL, Tian Y, Xu H, Adamczy P, Zhan X, Liu DZ, Ander BP, Liao IH, Gregg JP, Turner RJ, Jickling G, Lit L, Sharp FR: Identification and validation of suitable endogenous reference genes for gene expression studies in human peripheral blood. BMC Med Genomics. 2009, 2: 49-10.1186/1755-8794-2-49.

  27. 27.

    Gur-Dedeoglu B, Konu O, Bozkurt B, Ergul G, Seckin S, Yulug IG: Identification of endogenous reference genes for qRT-PCR analysis in normal matched breast tumor tissues. Oncol Res. 2009, 17 (8): 353-365. 10.3727/096504009788428460.

  28. 28.

    Cheng WC, Chang CW, Chen CR, Tsai ML, Shu WY, Li CY, Hsu IC: Identification of reference genes across physiological states for qRT-PCR through microarray meta-analysis. PLoS One. 2011, 6 (2): e17347-10.1371/journal.pone.0017347.

  29. 29.

    Kwon MJ, Oh E, Lee S, Roh MR, Kim SE, Lee Y, Choi YL, In YH, Park T, Koh SS, Shin YK: Identification of novel reference genes using multiplatform expression data and their validation for quantitative gene expression analysis. PLoS One. 2009, 4 (7): e6162-10.1371/journal.pone.0006162.

  30. 30.

    Andersen CL, Jensen JL, Ørntoft TF: Normalization of real-time quantitative reverse transcription-PCR data: a model-based variance estimation approach to identify genes suited for normalization, applied to bladder and colon cancer data sets. Cancer Res. 2004, 64 (15): 5245-5250. 10.1158/0008-5472.CAN-04-0496.

  31. 31.

    Haller F, Kulle B, Schwager S, Gunawan B, Von Heydebreck A, Sültmann H, Füzesi L: Equivalence test in quantitative reverse transcription polymerase chain reaction: confirmation of reference genes suitable for normalization. Anal Biochem. 2004, 335: 1-9. 10.1016/j.ab.2004.08.024.

  32. 32.

    Bustin S: The MIQE guidelines: minimum information for publication of quantitative real-time PCR experiments. Clin Chem. 2009, 55 (4): 611-622. 10.1373/clinchem.2008.112797.

  33. 33.

    Edgar R, Domrachev M, Lash AE: Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002, 30: 207-210. 10.1093/nar/30.1.207.

  34. 34.

    Barrett T, Troup DB, Wilhite SE, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Holko M, Yefanov A, Lee H, Zhang N, Robertson CL, Serova N, Davis S, Soboleva A: NCBI GEO: archive for functional genomics data sets—10 years on. Nucleic Acids Res. 2011, 39 (suppl. 1): D1005-D1010.

  35. 35.

    Symmans WF, Hatzi C, Sotiriou C, Andre F, Peintinger F, Regitnig P, Daxenbichler G, Desmedt C, Domont J, Marth C, Delaloge S, Bauernhofer T, Valero V, Booser DJ, Hortobagyi GN, Pusztai L: Genomic index of sensitivity to endocrine therapy for breast cancer. J Clin Oncol. 2010, 28: 4111-4119. 10.1200/JCO.2010.28.4273.

  36. 36.

    Chen DT, Nasir A, Culhane A, Venkataramu C, Fulp W, Rubio R, Wang T, Agrawal D, McCarthy SM, Gruidl M, Bloom G, Anderson T, White J, Quackenbush J, Yeatman T: Proliferative genes dominate malignancy-risk gene signature in histologically-normal breast tissue. Breast Cancer Res Treat. 2010, 119: 335-346. 10.1007/s10549-009-0344-y.

  37. 37.

    Dedeurwaerder S, Desmedt C, Calonne E, Singhal SK, Haibe-Kains B, Defrance M, Michiels S, Volkmar M, Deplus R, Luciani J, Lallemand F, Larsimont D, Toussaint J, Haussy S, Rothé F, Rouas G, Metzger O, Majjaj S, Saini K, Putmans P, Hames G, Van Baren N, Coulie PG, Piccart M, Sotiriou C, Fuks F: DNA methylation profiling reveals a predominant immune component in breast cancers. EMBO Mol Med. 2011, 3: 726-741. 10.1002/emmm.201100801.

  38. 38.

    Graham K, Delas Morenas A, Tripathi A, King C, Kavanah M, Mendez J, Stone M, Slama J, Miller M, Antoine G, Willers H, Sebastiani P, Rosenberg CL: Gene expression in histologically normal epithelium from breast cancer patients and from cancer-free prophylactic mastectomy patients shares a similar profile. Br J Cancer. 2010, 102: 1284-1293. 10.1038/sj.bjc.6605576.

  39. 39.

    Gentlemen RC, Carey VC, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JY, Zhang J: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004, 5: R80.1-R80.16.

  40. 40.

    Stratova C: Introduction to the xps Package: Overview. URL:

  41. 41.

    Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003, 4: 249-264. 10.1093/biostatistics/4.2.249.

  42. 42.

    Primer3 (v. 0.4.0). []

  43. 43.

    Nolan T, Hands RE, Bustin S: Quantification of mRNA using real-time RT-PCR. Nat Protoc. 2006, 1 (3): 1559-1582. 10.1038/nprot.2006.236.

  44. 44.

    NCBI/Primer-BLAST. []

  45. 45.

    OligoAnalyzer 3.1. []

  46. 46.

    NormFinder software. []

  47. 47.

    Lyng MB, Laenkholm AV, Pallisgaard N, Ditzel HJ: Identificationof genes for normalization of real-time RT-PCR data in breast carcinomas. BMC Cancer. 2008, 8: 20-10.1186/1471-2407-8-20.

  48. 48.

    De Kok JB, Roelofs RW, Giesendorf BA, Pennings JL, Waas ET, Feuth T, Swinkels DW, Span PN: Normalization of gene expression measurements in tumor tissues: comparison of 13 endogenous control genes. Lab Invest. 2005, 85 (1): 154-159. 10.1038/labinvest.3700208.

Download references


The authors thank D.S. Makeeva, O.V. Kondrashina and K. Phomicheva for taking part in RT-qPCR experiments, Dr. T.R. Samatov and I.I. Davydov for valuable comments and discussions. This work was supported by Russian Ministry of Science Contracts No. 14.512.11.0083, 16.522.11.7057 and 14.514.11.4025 and Bundesministerium für Bildung und Forschung (BMBF No. RUS 10/022).

Author information

Correspondence to Diana V Maltseva.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

DVM contributed to the conception of the study, assisted in the RT-qPCR studies and data processing, wrote the first draft of the manuscript. NAK performed the RNA sample preparations, part of RT-qPCR studies, contributed to the data processing and interpretation and helped to draft the manuscript. EOM performed the majority of RT-qPCR studies. NNF performed the majority of RT-qPCR data processing and helped to draft the manuscript. AEL performed the processing of microarray data. MUS performed the processing of the patient data and some of immunohistochemistry analysis of biopsies. VVG contributed to study conception, processing of the RT-qPCR and microarray data, data interpretation and edited the manuscript. US and AGT contributed to study conception, and critical manuscript review. All authors read and approved the final manuscript.

Electronic supplementary material

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and Permissions

About this article


  • Reference genes
  • Microarrays
  • Reverse transcription quantitative real-time polymerase chain reaction (RT-qPCR)
  • Gene expression
  • Breast cancer