Probabilistic linkage to enhance deterministic algorithms and reduce data linkage errors in hospital administrative data

Gareth Hagger-Johnson, Katie Harron, Harvey Goldstein, Robert Aldridge, Ruth Gilbert


Background: The pseudonymisation algorithm used to link together episodes of care belonging to the same patient in England [Hospital Episode Statistics ID (HESID)] has never undergone any formal evaluation to determine the extent of data linkage error.

Objective: To quantify improvements in linkage accuracy from adding probabilistic linkage to existing deterministic HESID algorithms.

Methods: Inpatient admissions to National Health Service (NHS) hospitals in England (HES) over 17 years (1998 to 2015) for a sample of patients (born 13th or 28th of months in 1992/1998/2005/2012). We compared the existing deterministic algorithm with one that included an additional probabilistic step, in relation to a reference standard created using enhanced probabilistic matching with additional clinical and demographic information. Missed and false matches were quantified and the impact on estimates of hospital readmission within one year was determined.

Results: HESID produced a high missed match rate, improving over time (8.6% in 1998 to 0.4% in 2015). Missed matches were more common for ethnic minorities, those living in areas of high socio-economic deprivation, foreign patients and those with ‘no fixed abode’. Estimates of the readmission rate were biased for several patient groups owing to missed matches, which were reduced for nearly all groups.

Conclusion: Probabilistic linkage of HES reduced missed matches and bias in estimated readmission rates, with clear implications for commissioning, service evaluation and performance monitoring of hospitals. The existing algorithm should be modified to address data linkage error, and a retrospective update of the existing data would address existing linkage errors and their implications.


Probabilistic record linkage; Deterministic record linkage; Hospital discharge; Evaluation

Full Text:



Hagger-Johnson G, Harron K, Fleming T, Gilbert R, Goldstein H, Landy R, et al. Data linkage errors in hospital administrative data when applying a pseudonymisation algorithm to paediatric intensive care records. BMJ Open 2015;5(8):e008118. Available at: PMid:26297363; PMCid:PMC4550723.

Dungey S, Beloff N, Williams R, Williams T, Puri S and Tate AR. Characterisation of data quality in electronic healthcare records. In: Briassouli A, Benois-Pineau J and Hauptmann A (Ed), Health Monitoring and Personalized Feedback Using Multimedia Data (pp. 115–35). Basel, Switzerland: Springer International, 2015. Available at:

Zhu Y, Matsuyama Y, Ohashi Y and Setoguchi S. When to conduct probabilistic linkage vs. deterministic linkage? A simulation study. Journal of Biomedical Informatics 2015;56:80–86. Available at: PMid:26004791.

Lariscy JT. Differential record linkage by hispanic ethnicity and age in linked mortality studies. Journal of Aging and Health 2011;23(8):1263–84. Available at: PMid:21934120; PMCid:PMC4598042.

Harron K, Wade A, Gilbert R, Muller-Pebody B and Goldstein H. Evaluating bias due to data linkage error in electronic healthcare records. BMC Medical Research Methodology 2014;14(1):36.

Harron K, Goldstein H and Dibben C. Methodological Developments in Data Linkage. Chichester, UK: Wiley, 2015. Available at: PMid:24597489; PMCid:PMC4015706.

Moore CL, Gidding HF, Law MG and Amin J. Poor record linkage sensitivity biased outcomes in a linked cohort analysis. Journal of Clinical Epidemiology 2016;75:70–77.

Bohensky M, Jolley D, Sundararajan V, Evans S, Pilcher D, Scott I, et al. Data linkage: A powerful research tool with potential problems. BMC Health Services Research 2010;10(1):346. Available at: PMid:21176171; PMCid:PMC3271236.

Hagger-Johnson G, Harron K, Gonzalez-Izquierdo A, Cortina-Borja M, Dattani N, Muller-Pebody B, et al. Identifying possible false matches in anonymized hospital administrative data without patient identifiers. Health Services Research 2014;50(4):1162–78. Available at: PMid:25523215; PMCid:PMC4545352.

Silveira DP and Artmann E. Accuracy of probabilistic record linkage applied to health databases: Systematic review. Revista de Saúde Pública 2009;43(5):875–82.

Baldwin E, Johnson K, Berthoud H and Dublin S. Linking mothers and infants within electronic health records: A comparison of deterministic and probabilistic algorithms. Pharmacoepidemiology and Drug Safety 2015;24(1):45–51. Available at: PMid:25408418.

Aldridge R, Shaji K, Hayward A and Abubakar I. Accuracy of probabilistic linkage using the enhanced matching system for public health and epidemiological studies. PLoS ONE 2015;10(8):e0136179. Available at: PMid:26302242; PMCid:PMC4547731.

Health and Social Care Information Centre. HES Data Dictionary: Admitted Patient Care. Leeds, UK: Health and Social Care Information Centre, 2016. Available at: Accessed 16 July 2016.

Health and Social Care Information Centre. Replacement of the HES Patient ID (HESID). Leeds, UK: Health and Social Care Information Centre, 2015. Available at: Accessed 16 July 2016.

Health and Social Care Information Centre. IQAP Guidance on Unknown, Estimated and Default Birth Dates. Leeds, UK: Health and Social Care Information Centre, 2010.

Office of the Deputy Prime Minister. The English Indices of Deprivation 2004: Summary (revised). London, UK: Office of the Deputy Prime Minister, 2004. Available at: http:/ Accessed 16 July 2016.

Health and Social Care Information Centre. HES 2013-14 Month 11 Inpatient Data Quality Note. Leeds, UK: Health and Social Care Information Centre, 2014. Available at: Accessed 16 July 2016.

Dattani N, Datta-Nemdharry P and Macfarlane A. Linking maternity data for England, 2005-06: Methods and data quality. Health Statistics Quarterly 2011;49(1):53–79. Available at: PMid:21372845.

Jaro M. Probabilistic linkage of large public health data files. Statistics in Medicine 1995;14(5–7):491–98. Available at: PMid:7792443.

Medical Research Council and NHS Health Research Authority. Do I need NHS REC approval? Available at: 2016. Accessed 16 July 2016.

Kirkwood B and Sterne J. Essentials of Medical Statistics. Oxford, UK: Blackwell, 2003.

Goldstein H, Harron K and Wade A. The analysis of record-linked data using multiple imputation with data value priors. Statistics in Medicine 2012;31(28):3481–93. Available at: PMid:22807145.

Zhu V, Overhage M, Egg J, Downs S and Grannis S. An empiric modification to the probabilistic record linkage algorithm using frequency-based weight scaling. Journal of the American Medical Informatics Association: JAMIA 2009;16(5):738–45. Available at: PMid:19567789; PMCid:PMC2744724.

Ketende S and McDonald J. Neighbourhoods and Residential Mobility. Children of the 21st Century. The First Five Years (Vol. 2, pp. 115–30). Bristol, UK: Policy Press, University of Bristol, 2010. Available at:

Hardelid P, Dattani N and Gilbert R. Estimating the prevalence of chronic conditions in children who die in England, Scotland and Wales: A data linkage cohort study. BMJ Open 2014;4(8):e005331. Available at: PMid:25085264; PMCid:PMC4127921.

Hipisley-Cox J. Validity and Completeness of the NHS Number in Primary and Secondary Care: Electronic Data in England 1991-2013. Nottingham, UK: University of Nottingham, 2013. Available at: Accessed 16 July 2016.

Doran K, Ragins K, Iacomacci A, Cunningham A, Jubanyik K and Jenq G. The revolving hospital door: Hospital readmissions among patients who are homeless. Medical Care 2013;51(9):767–73. Available at: PMid:23929401.

Lariscy J. Differential record linkage by hispanic ethnicity and age in linked mortality studies. Journal of Aging and Health 2011;23(8):1263–84. Available at: PMid:21934120; PMCid:PMC4598042.

UK government. Health and Social Care Act: Part 9, Chapter 2, Functions: Quality of Health and Social Care Information (Section 266). London, UK: Stationery Office, 2012



  • There are currently no refbacks.

This is an open access journal, which means that all content is freely available without charge to the user or their institution. Users are allowed to read, download, copy, distribute, print, search, or link to the full texts of the articles in this journal starting from Volume 21 without asking prior permission from the publisher or the author. This is in accordance with the BOAI definition of open accessFor permission regarding papers published in previous volumes, please contact us.

Privacy statement: The names and email addresses entered in this journal site will be used exclusively for the stated purposes of this journal and will not be made available for any other purpose or to any other party.

Online ISSN 2058-4563 - Print ISSN 2058-4555. Published by BCS, The Chartered Institute for IT