Behind the Data: Electronic Medical Records
As part of their new data channel, GE Healthcare has just released two slices of their massive Medical Quality Improvement Consortium (MQIC) database of deidentified medical records. We asked one of the experts behind this data to explain the process of collecting and working with Electronic Medical Records and to suggest some avenues for research and visualization.
Research Analyst, Clinical Data Services
Visualizing: Describe the MQIC data — where does it come from? Who has access?
Ian Gibbs: The Clinical Data Services Research Data Set (CDSRDS) is collected from healthcare organizations participating in the Medical Quality Improvement Consortium (MQIC). MQIC is a group of users of GE Healthcare's Centricity Electronic Medical Record (EMR), who have agreed to share and aggregate their clinical data in a de-identified HIPAA-compliant manner. MQIC has over 300 member organizations, approximately 20,000 healthcare providers and is represented in 44 states. The data set consists entirely of ambulatory care data representing over 22 million patient records collected from the outpatient physician office setting and is collected mostly from primary care physician offices. Given that the data set is focused entirely on the outpatient part of the healthcare service system, chronic and infectious disease areas are excellent areas for research.
The resulting data set is used primarily for providing quality reporting services back to MQIC for the purpose of improving healthcare quality, reducing cost and improving efficiency. Secondarily, the data is also used by government entities, pharmaceutical companies and GE Healthcare for the purpose of supporting various healthcare research interests, such as health outcomes research, disease surveillance and practice based research networks. Use of the data by organizations external to GE Healthcare is overseen by a committee of MQIC member organizations to ensure it is used appropriately.
V: What is the research value of compiling this data in one place?
IG: The CDS Research Data Set provides a unique opportunity to passively evaluate practice care patterns and healthcare outcomes in a real-world data set that is geographically and demographically similar to the United States population as defined by the US Census statistics. The data is collected from disparate sources that vary in the patient populations they serve and their operational/clinical methods for providing healthcare. This variation provides a diverse pool of patients that are suitable for research. For example, the patients in the data set represent all payer types, such as Medicare, Medicaid, self-pay and commercial insurance. Varying operational and clinical methods offer an opportunity to evaluate system-level characteristics in healthcare. For example, it is possible to evaluate potential disparities in practice care patterns between Integrated Delivery Networks (IDNs) and stand-alone physician offices.
V: What are the challenges and benefits inherent in working with patient records?
IG: The major challenge to working with a data set such as this is that it represents an open population both in terms of healthcare organizations and patients. In order to utilize an organization's data set for research purposes, a judicious approach to the data must be taken in order to ensure that sources of information bias are mitigated as much as possible. Furthermore, since this data represents the US healthcare system, then patients may not be continuously represented in the data set for an entire time period of interest so data censoring must also be addressed. But both of these challenges merely reflect the unadulterated aspects of this data set: it is real-world and passively collected. Data anomalies in a real-world data set are to be expected and patient censoring is natural since patients have the significant freedom to choose when and where they may seek healthcare. The source of these challenges (real-world and passive data collection) is also the source of the primary benefits for this data set. With a real-world, passively collected data set it is possible to understand the clinical context in which care decisions are made and the outcomes that subsequently result from those decisions. That is a very unique and powerful aspect to this data set.
V: How do you see the MQIC data being used in coming years? Where does data visualization fit in?
IG: The strength of this data set is not just in the data, but also in the MQIC organization that serves as the source of this data. We expect that practice-based research networks, where healthcare organizations are active participants in research rather than passive participants, will be a major opportunity for further use and development of this data set. Data visualization within healthcare offers a new and exciting opportunity to understand patterns and trends. Traditionally, healthcare research results are conveyed in tabular form or statistical model results, which require a significant understanding of research study design and methods to interpret. A visual form of similar data provides an opportunity to remove that constraint if needed so that it may be shared with the general public. In addition, the visual representation of data may provide insight into patterns and trends in the data that tabulations may hide. The visual representation of data can be extremely powerful and it is a tool that we feel is under-utilized within healthcare research and analysis.
V: What portion of the MQIC data is available to the public in this release?
IG: The data set provided in this release is derived from the CDS Research Data Set. The CDS Research Data Set contains raw, EMR record-level data and this data set is aggregated from that source. The data set provided in this release is aggregated on certain geographic characteristics, patient demographic characteristics and disease categories (diabetes or hypertension). For each stratum, various summary statistics are provided for various data elements that are of interest for the disease categories. Since this is an aggregated data set that is derived from a data set that is already statistically de-identified, then patient privacy is ensured.
V: Suggest some directions for analysis and visualization of this data that would address critical issues.
IG: Since this data set is aggregated, then analyses and visualizations are constrained to the level defined by each stratum. As such, this eliminates the ability to derive patient-level conclusions due to the epidemiological concept "ecological fallacy". For example, it would be possible to develop a visualization that examines the change in mean hemoglobin A1c values over time, gender and age in a heat map of the United States. Healthcare utilization for each disease in terms of outpatient physician office visits may also be represented in a similar way.
Find the MQIC data slices on the GE data channel and stay tuned for more releases.