To provide a rationale for our architectural decision, we first describe multi-centric study workflow, which dictates software requirements and design. We then summarize the issues of overlapping functionality between BDMS and CRIS software, and user interfaces to clinical/biospecimen data.
2.1 Workflow of Biospecimen Collection and Processing in Multi-centric studies
Enrollment of patients based on the protocol's inclusion and exclusion criteria is a complex process as such individuals are rarely available immediately. The study protocol's "event calendar", a predetermined sequence of time points ("events") relative to a subject's enrollment date, determines the biospecimen-collection schedule. Note that many or even most time-points are not associated with biospecimen collection, but may involve subject interviews, clinical examination, special investigations (e.g., radiology) or outreach (e.g., reminders through phone, letters or E-mail). The numerous study parameters recorded across all events, such as measures of disease progression or clinical improvement specific to the disease condition being followed, are segregated into logically-related units called case report forms (CRFs).
In order to reduce shipping costs, centers perform local biospecimen processing, aliquot creation and temporary storage prior to batch shipments. The actual number of aliquots may vary for individual subjects because of material-collection constraints (especially in pediatric patients): in intensive-care/emergency situations, scheduled collections may be missed. Actual biospecimen collection and quantity must be closely tracked to inform the study progress. To streamline collection and processing, an analytic center typically provides collection centers in advance with a batch of aliquot containers (vials) and the barcode labels record standard information such as patient ID, event, sample type and aliquot number.
The samples are batch shipped and aliquots that are received are scanned at the data and sample coordinating center for verification against the previously entered collection data. Discrepancy-resolution generally involves human intervention (e.g., phone calls to collection centers). After any additional local processing if necessary, aliquots are stored in freezers, with locations recorded using a coordinate system (e.g., site-freezer-rack-slot). Biospecimens are consumed following local analysis or shipping to external biomarker laboratories, either in bulk for specialized analyses, or when individually requested by collaborators. For the former, the external lab may send analytical results back in a variety of formats (typically in spreadsheets), and these must also be bulk-imported. Specimen consumption must be tracked accurately to guide future ancillary studies and sample requests.
2.2 Existing Software for Biospecimen Management
Because individual research groups' needs vary greatly, existing BDMS functionality is very diverse: however, all BDMSs should be able to manage an unlimited number of study protocols: every data element must be associated, directly or indirectly, with the study where it originated.
Angelow et al  describe a "virtual repository" BDMS: biospecimens are not shipped, but stored (and analyzed) at individual collection centers, but managed by a central web-based BDMS. Pulley et al  describe a DNA biobanking system for anonymous subjects: each biospecimen is associated with structured and textual electronic-medical-record (EMR) data that is anonymized using electronic and manual processes. This data characterizes individual phenotypes: genotype-phenotype correlations form a focus of the eMERGE network .
CaTissue , supported by the Cancer BioInformatics Grid (CaBiG) , focuses on tissue banking, providing functionality such as clinical annotations (e.g., pathology reports), but also has general-purpose features. The annotation module has been utilized by other groups [11, 12].
2.3 CRISs and BDMSs: Overlapping Functionality
Clinical Research Information Systems (CRISs) [13–15], with prices ranging from free to several million dollars, are designed to manage workflow and data for an arbitrary number of studies. Both CRISs and BDMSs typically utilize high-end relational database management systems (RDBMSs). When BDMSs are used for clinical studies, they address many areas covered by CRISs (though often in greater depth) as discussed shortly. Despite this overlap, even high-end CRISs do not currently provide comprehensive BDMS capability: biospecimen-inventory management, in particular, falls significantly short.
Large research groups therefore employ both types of systems. In such scenarios, one must determine whether one system shall be used primarily for a particular function (or whether both should be used for complementary functionality), and how to coordinate both systems' contents. Consider the following synchronization challenges:
1. Users: A large multi-center study may involve hundreds of research staff across sites, with a variety of access privileges to either system: staff turnover may be significant. We consider this issue later in the Discussion.
2. Informed Consent: Consent often has finer details related to the degree of participation allowed by the subject. Based on research goals, subjects may consent to provide some tissues but not others, or to have only certain tests performed: e.g., they may decline genotyping because of concerns (in the USA) that accidental result disclosure may impact their families' health-insurability. Biospecimens may inherit their consent values from the subject (e.g., if the subject drops out and withdraws consent, the consent status of all specimens must automatically change).
3. Collection Schedules: As stated earlier, the study calendar is a superset of the biospecimen-collection calendar. For subjects' convenience, individual collection visits also serve other purposes (e.g., physical examination, interviews), and visits are frequently rescheduled.
4. Analytical Data: The subject's total clinical data constitute a superset of biospecimen-associated analytic data, which are rarely inspected in isolation. Research staffs typically enter/edit non-analytical data, either through real-time electronic data capture, or on paper that is later transcribed electronically by data-entry staff. While analytical data can also be entered manually, many parameters may be outputted electronically by laboratory instruments following batch analyses, and are preferably bulk-imported.
When both systems are in use, issues 3-4 above result in maximizing CRIS use. However, there is some data overlap - e.g., patient identifiers, basic study protocol information, etc. and consequently, data exchange is unavoidable.
2.4 User Interfaces for Clinical Data
User interfaces for interactive data capture must support robust validation and ergonomics. Parameter-level validation includes data type, range and set-membership, and mandatory (non-empty) values. Cross-parameter validation involves testing of rules (e.g., the differential white blood cell count components must total 100). Ergonomic aids include automatic computations of parameters based on formulas, disabling of certain fields based on values of previously entered fields (so called "skip logic") and keyword-based search of controlled biomedical vocabularies. Finally, based on the study calendar, individual parameters may only be recorded for the CRFs/time-points where they apply. The approach of programming such capabilities manually (e.g., Angelow et al). takes significant expertise and effort, and does not scale. Alternative user-interface-management approaches include:
1. Managing collection schedules and analytical data through the BDMS. CaTissue lets developers specify a Unified Modeling Language (UML) data model, generating relational tables and a basic form interface that supports only data-type and set-membership checks. Calendar functionality (e.g., reminders, reports) lags considerably behind that of CRISs,.
Several commercial BDMSs (e.g., FreezerPro  and FreezerWorks ) provide more end-user-friendly and more full featured alternatives: some of these are Web-based, while others use two-tier technology (i.e., custom client software installed on multiple desktops communicating directly with a database). In any case, such systems address longitudinal-clinical-study needs only partially.
2. Delegating calendar and analytical-data management to a CRIS. CRISs typically provide extensive interface-generation as well as calendar-driven capabilities: they allow designer-level users to specify the interface declaratively through a data library, and then generate CRFs. We employ this design approach.