A system of metadata to control the process of query, aggregating, cleaning and analysing large datasets of primary care data

Jeremy VanVlymen, Simon de Lusignan


Background Metadata is data that describes other data or resources. It has a defined number of named elements that convey meaning. Medical data are complex to process. For example, in the Primary Care Data Quality (PCDQ) renal programme, we need to collect over 300 variables because there are so many possible causes of renal disease. These variables are not just single columns of data - all are extracted as code plus date, while others are code_date_value. Metadata has the potential to improve the reliability of processing large datasets.
Objective To define unique and unambiguous metadata headings for clinical data and derived variables.
Method We defined the look-up tables we would use as a controlled vocabulary to name the core clinical concepts within the metadata. We added six other elements to describe data: (1) the study or audit name; (2) the query used to extract the data; (3) the data collection number; (4) the type of data, including specifying the units; (5) the repeat number (if the variable was extracted more than once); and (6) a processing suffix that defines how the data have been processed.
Results The metadata system has enabled the development of a query library and an analysis syntax library that make data processing and analysis more efficient. Its stability means greater effort can be put into more complex data processing, and some semi-automation of processes. However, the system has had implementation problems. It has been particularly hard to stop clinicians using multiple synonyms for the same variable.
Conclusions The PCDQ metadata system provides an auditable method of data processing. It is a method that should improve the reliability, validity and efficiency of processing routinely collected clinical data. This paper sets out to demystify our data processing method and makes the PCDQ metadata system available to clinicians and data processors who might wish to adopt it.


data processing methods; metadata; primary care data quality

Full Text:


DOI: http://dx.doi.org/10.14236/jhi.v13i4.608


  • There are currently no refbacks.

This is an open access journal, which means that all content is freely available without charge to the user or their institution. Users are allowed to read, download, copy, distribute, print, search, or link to the full texts of the articles in this journal starting from Volume 21 without asking prior permission from the publisher or the author. This is in accordance with the BOAI definition of open accessFor permission regarding papers published in previous volumes, please contact us.

Privacy statement: The names and email addresses entered in this journal site will be used exclusively for the stated purposes of this journal and will not be made available for any other purpose or to any other party.

Online ISSN 2058-4563 - Print ISSN 2058-4555. Published by BCS, The Chartered Institute for IT