Authors:
Roy A. Ruddle
1
and
Marlous S. Hall
2
Affiliations:
1
School of Computing, University of Leeds, Leeds and U.K.
;
2
Leeds Institute of Cardiovascular & Metabolic Medicine, University of Leeds, Leeds and U.K.
Keyword(s):
Data Visualization, Electronic Health Records, Data Quality.
Related
Ontology
Subjects/Areas/Topics:
Biomedical Engineering
;
Data Engineering
;
Data Management and Quality
;
Data Manipulation
;
Data Visualization
;
Electronic Health Records and Standards
;
Health Information Systems
;
Sensor Networks
Abstract:
Descriptive statistics are typically presented as text, but that quickly becomes overwhelming when datasets contain many variables or analysts need to compare multiple datasets. Visualization offers a solution, but is rarely used apart from to show cardinalities (e.g., the % missing values) or distributions of a small set of variables. This paper describes dataset- and variable-centric designs for visualizing three categories of descriptive statistic (cardinalities, distributions and patterns), which scale to more than 100 variables, and use multiple channels to encode important semantic differences (e.g., zero vs. 1+ missing values). We evaluated our approach using large (multi-million record) primary and secondary care datasets. The miniature visualizations provided our users with a variety of important insights, including differences in character patterns that indicate data validation issues, missing values for a variable that should always be complete, and inconsistent encryption
of patient identifiers. Finally, we highlight the need for research into methods of identifying anomalies in the distributions of dates in health data.
(More)