Can AI models rival humans in anonymising patient data?
Researchers from the University of Oxford have benchmarked artificial intelligence (AI) tools capable of automatically removing personal information from patient electronic health records in a key step towards enabling large-scale, confidential medical research.
As healthcare becomes digitised, the wealth of information stored in millions of electronic health records (EHRs) is providing a valuable resource. These routinely collected data are driving advances in research, education, and quality improvement. But the increasing interest in using EHRs to train deep-learning models aimed at improving patient outcomes, is raising questions over whether current de-identification methods are robust enough to fully protect patient privacy.
“Patient confidentiality is essential to building public trust in healthcare research,” said Dr Rachel Kuo, NIHR Doctoral Research Fellow at Oxford University. “Manual redaction of personally identifiable information such as patient names or locations is time-consuming and expensive. Automated de-identification could alleviate this burden, but we need to be sure that software could meet an acceptable standard of performance.”
The study, published in iScience was a collaboration between Dr Kuo and Professor Dominic Furniss from the Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, Dr Andrew Soltan from the Department of Oncology, and Professor David Eyre from the Big Data Institute.
The research, which looked at routinely collected data from Oxford University Hospitals, was supported by the NIHR Oxford Biomedical Research Centre. It evaluated the ability of large language models (LLMs) and purpose-built software tools to detect and remove patient names, dates, medical record numbers, and other identifiers from real-world records, without altering clinical content.
The first step was to test the ability of a human to anonymise the data. The team manually redacted 3,650 medical records, compared and corrected the data until they had a complete set to use as a benchmark. They then compared two task-specific de-identification software tools (Microsoft Azure and AnonCAT) and five general-purpose LLMs, including GPT-4, GPT-3.5, Llama-3, Phi-3, and Gemma for redacting identifiable information.
Microsoft’s Azure de-identification service achieved the highest performance overall, closely matching human reviewers. GPT-4 also performed strongly, demonstrating that modern language models can accurately remove identifiers with minimal fine-tuning or task-specific training.
“One of our most promising findings was that we don't need to retrain complex AI models from scratch,” explained Dr Soltan, NIHR Academic Clinical Lecturer and Engineering Research Fellow. “We found that some models worked well ‘out of the box’, and that others saw their performance nudged upwards with simple techniques.
“For the general-purpose models, this meant showing them just a handful of examples of what a correctly anonymised record looks like. For the specialised software, one model learned to pick up nuances in our hospital’s data, like the format of telephone extensions, after fine-tuning on just a small sample. This is exciting because it shows a practical path for hospitals to adopt these technologies without manually labelling thousands of patient notes.”
However, the study also revealed risks. Some models produced ‘hallucinations’, where text that was not present in the original record was shown, or occasionally introducing fabricated medical details.
“While some large language models perform impressively, others can generate false or misleading text,” explained Dr Soltan. “This behaviour poses a risk in clinical contexts, and careful validation is critical before deployment.”
The researchers concluded that automating de-identification could significantly reduce the time and cost required to prepare clinical data for research, while maintaining patient privacy in compliance with data protection regulations.
“This work shows that AI can be a powerful ally in protecting patient confidentiality,” said Professor Eyre. “But human judgement and strong governance must remain at the centre of any system that handles patient data.”
As well as the NIHR Oxford BRC, the study was supported by Microsoft Research UK, Cancer Research UK, the Engineering and Physical Sciences Research Council (EPSRC).

