Factlen ResearchSynthetic DataExplainerJun 12, 2026, 8:30 AM· 6 min read

The Evidence for Synthetic Data: Can AI Safely Replace Real Patient Records?

As privacy laws restrict access to real-world datasets, researchers are increasingly turning to synthetic data to train AI. Recent validation studies reveal exactly how well these artificial datasets perform compared to the real thing.

By Factlen Editorial Team

Clinical Researchers 35%Privacy Advocates 35%Machine Learning Engineers 30%
Clinical Researchers
Value synthetic data as a way to bypass lengthy privacy approvals and accelerate medical discoveries.
Privacy Advocates
Focus on rigorous mathematical guarantees that synthetic datasets cannot be reverse-engineered to expose real people.
Machine Learning Engineers
Emphasize the 'domain gap' and advocate for hybrid approaches that mix synthetic and real-world data for maximum accuracy.

What's not represented

  • · Patient Advocacy Groups
  • · Data Privacy Lawyers

Why this matters

Data bottlenecks are the biggest hurdle to medical and technological breakthroughs. If synthetic data can accurately proxy reality without violating privacy laws, it will radically accelerate the pace of global research.

Key points

  • Synthetic data uses generative AI to create artificial datasets that mirror real-world statistics without containing personal information.
  • Validation studies show synthetic data can successfully replicate the clinical conclusions of real-world oncology trials.
  • Privacy-by-design techniques can achieve 'Hidden Rates' above 85%, legally qualifying the data as non-personal under GDPR.
  • Models trained exclusively on synthetic data suffer from a 'domain gap,' but hybrid approaches using 33% real data achieve near-perfect baseline accuracy.
85–93%
Hidden Rate (privacy guarantee) in MS trial
<1%
Difference in univariate distributions vs real data
13.8
Synthetic samples needed to equal one real sample
65.4%
Accuracy of hybrid real/synthetic vision model

The foundation of modern artificial intelligence and medical research rests on a paradox. To build systems that can detect early-stage diseases or predict patient outcomes, researchers need access to massive, highly detailed datasets. Yet, the most valuable information—real human medical records—is rightfully locked behind stringent privacy regulations like the European Union’s GDPR and the United States’ HIPAA. This tension between the hunger for data and the imperative of patient confidentiality has historically forced a compromise, slowing down life-saving research to ensure no individual's privacy is breached.[6]

For years, the standard workaround was anonymization—stripping names, addresses, and social security numbers from datasets before sharing them. However, as machine learning models grew more sophisticated, anonymization proved fragile. In datasets involving rare diseases or unique genetic markers, the risk of "re-identification" remained stubbornly high, as algorithms could cross-reference seemingly anonymous data points to unmask individuals. The industry needed a solution that offered the mathematical utility of real data without the legal and ethical liabilities.[6]

Enter synthetic data. Rather than masking real patient records, researchers are now using generative artificial intelligence—including Generative Adversarial Networks (GANs) and diffusion models—to manufacture entirely new, artificial datasets from scratch. These algorithms ingest real-world data, learn its underlying statistical distributions, and then generate virtual patient profiles. A synthetic patient might have a realistic combination of blood pressure, age, and medication history that perfectly mirrors broader population trends, but that specific patient does not exist in the real world.[6]

The theoretical appeal is obvious, but the scientific community demands empirical proof. Can a dataset populated by "fake" patients actually yield rigorous, peer-reviewed medical breakthroughs? Over the past year, a wave of validation studies has moved synthetic data from a theoretical computer science concept to a proven clinical tool, demonstrating that artificial data can indeed serve as a highly accurate proxy for reality.[6]

Validation studies show synthetic datasets can successfully defeat re-identification attacks.
Validation studies show synthetic datasets can successfully defeat re-identification attacks.

One of the most rigorous tests of this concept was published in a validation study that sought to replicate the findings of a major oncology clinical trial. Researchers took the original, highly sensitive trial data and used it to train a generative model, which then produced a synthetic twin of the dataset. The critical test was whether a secondary analysis of the synthetic data would lead scientists to the exact same medical conclusions as the original, real-world data.[3]

The results were striking. When comparing the two datasets, researchers found that the univariate distributions—the spread of individual variables like tumor size or patient age—differed by less than one percent. Furthermore, the complex, bivariate relationships between different health factors maintained a confidence interval overlap of more than fifty percent. Most importantly, the statistical models built on the synthetic data produced the exact same clinical conclusions regarding treatment efficacy, proving that the artificial proxy retained the vital medical signals of the original trial.[3]

Beyond accuracy, the data must also withstand aggressive privacy audits. A landmark study published in the Journal of Medical Internet Research tackled this by generating two virtual clinical trials for Multiple Sclerosis, based on data from over 2,300 real patients. Led by researcher Pierre-Antoine Gourraud, the team utilized a "privacy-by-design" approach called the avatars technique, which specifically optimizes for both clinical utility and anonymity.[1]

Beyond accuracy, the data must also withstand aggressive privacy audits.

To prove the data was safe, the researchers subjected the synthetic Multiple Sclerosis datasets to simulated adversarial attacks, attempting to re-identify the original patients. The synthetic datasets achieved a "Hidden Rate" of between 85.0% and 93.2%, meaning the attacks overwhelmingly failed. Because the privacy assessment was so robust, the synthetic datasets legally qualified as non-personal data under GDPR, allowing the team to release the virtual placebo arms as open-access resources for the global research community.[1]

When used to replicate an oncology trial, synthetic data produced nearly identical statistical distributions to the real-world records.
When used to replicate an oncology trial, synthetic data produced nearly identical statistical distributions to the real-world records.

Leading medical institutions are already operationalizing these findings. At the University of Chicago, a collaboration between clinicians and data scientists recently launched a unified data repository designed around synthetic access. The platform features self-service, kiosk-style tools that allow researchers to immediately query synthetic versions of the hospital's electronic medical records. This allows teams to rapidly test hypotheses and explore care trends without waiting months for institutional review board approvals to access the real, sensitive data.[2]

Despite these successes, synthetic data is not a flawless replica of reality, and machine learning engineers warn against treating it as a complete replacement for human data. The primary limitation is known as the "domain gap"—the subtle, often invisible differences between a simulated environment and the messy, unpredictable real world. When models are trained exclusively on artificial data, they often struggle to perform when deployed in real-world settings.[6]

A 2025 study on computer vision models quantified this domain gap with stark clarity. Researcher Alexey Gruzdev found that artificial intelligence models trained solely on synthetic data achieved a dismal 16% accuracy when tested against real-world images. The synthetic data, while statistically clean, lacked the chaotic edge cases, sensor noise, and unpredictable backgrounds that define reality.[4]

However, the same study revealed that synthetic data becomes incredibly powerful when used as an augmentative tool rather than a total replacement. By combining a massive synthetic dataset with just 33% of the original real-world data, the hybrid model achieved 65.41% accuracy. This performance was nearly identical to the baseline model, which had been trained on 100% real data and achieved 66.82% accuracy.[4]

This research established a practical "exchange rate" for machine learning engineers: it took approximately 13.8 synthetic samples to equal the training value of one real-world sample. Because synthetic data can be generated at near-zero marginal cost, generating fourteen times as much data is often vastly cheaper and faster than navigating the legal and logistical hurdles of acquiring more real-world records.[4]

While synthetic data alone struggles with real-world accuracy, hybrid models achieve baseline performance at a fraction of the data-collection cost.
While synthetic data alone struggles with real-world accuracy, hybrid models achieve baseline performance at a fraction of the data-collection cost.

There is also the critical risk of bias amplification. Generative models are mirrors; they reflect whatever is in their training data. If an original dataset lacks representation from certain demographic groups, the synthetic generator will not only replicate that blind spot but can mathematically amplify it. Ensuring that synthetic datasets are fair requires active auditing and deliberate re-balancing by data scientists before the artificial records are deployed into medical or financial systems.[6]

Recognizing both its power and its limits, the technology industry is rapidly scaling its reliance on artificial datasets. Industry analysts at Gartner project that by 2028, the vast majority of all data used in artificial intelligence training will be synthetically generated, fundamentally shifting the economics of machine learning. The global market for synthetic data generation is expected to reach $3.7 billion by the end of the decade.[5]

Ultimately, the evidence suggests that synthetic data has successfully bridged the gap between privacy and progress. By providing a mathematically sound proxy for sensitive information, it allows researchers to share insights across borders, train more robust algorithms, and accelerate medical discoveries—all while ensuring that the actual human beings behind the numbers remain completely anonymous.[6]

How we got here

  1. 2018

    The implementation of GDPR strictly limits how researchers can share and process real patient data.

  2. 2023

    Generative AI breakthroughs accelerate the ability to create high-fidelity synthetic datasets.

  3. 2024

    Researchers successfully use the 'avatars' technique to release open-access synthetic data for Multiple Sclerosis trials.

  4. 2025

    Computer vision studies quantify the 'domain gap,' proving that hybrid real-synthetic datasets perform best.

  5. 2026

    Major hospitals like the University of Chicago deploy self-service synthetic data repositories for clinical researchers.

Viewpoints in depth

Clinical Researchers' View

Prioritize the speed and accessibility of data to accelerate medical discoveries.

For medical researchers, the primary appeal of synthetic data is administrative bypass. Gaining access to real Electronic Medical Records (EMRs) often requires months of navigating Institutional Review Boards (IRBs) and compliance checks. Synthetic data allows clinicians to immediately query databases, test hypotheses, and identify care gaps without touching sensitive information. Their main concern is ensuring the artificial data maintains high statistical fidelity so that early-stage research translates accurately when finally tested against real patients.

Privacy Advocates' View

Focus on the mathematical guarantees that prevent re-identification of real individuals.

Privacy experts view synthetic data as a massive upgrade over traditional anonymization, which has repeatedly proven vulnerable to cross-referencing attacks. However, they emphasize that synthetic data is not automatically safe. If a generative model over-fits its training data, it might inadvertently memorize and regurgitate a real patient's unique profile. Advocates demand rigorous, adversarial privacy testing—such as measuring 'Hidden Rates'—before any synthetic dataset is legally classified as non-personal data under frameworks like GDPR.

Machine Learning Engineers' View

Focus on the 'domain gap' and the practical economics of model training.

Engineers building AI systems are pragmatic about synthetic data's limitations. They acknowledge the 'domain gap'—the reality that models trained purely on simulated data often fail when exposed to the messy, unpredictable real world. Instead of viewing synthetic data as a total replacement, they treat it as an augmentative tool. By mixing a small percentage of expensive, real-world data with massive volumes of cheap synthetic data, they can achieve state-of-the-art accuracy while drastically reducing data collection costs.

What we don't know

  • Whether synthetic data can accurately capture the nuances of extremely rare diseases where the original training data is statistically insignificant.
  • How long-term reliance on synthetic data might compound hidden biases over multiple generations of AI models.
  • How international regulatory bodies will standardize the legal definition of 'sufficiently private' synthetic data.

Key terms

Synthetic Data
Artificially generated data that mimics the statistical properties of real data without containing identifiable personal information.
Domain Gap
The difference in performance that occurs when an AI model trained in a simulated or synthetic environment is deployed in the unpredictable real world.
Generative Adversarial Network (GAN)
A type of artificial intelligence system used to generate synthetic data by pitting two neural networks against each other to create highly realistic outputs.
Re-identification
The process of cross-referencing anonymized data points to successfully unmask and identify a real person.
Hidden Rate
A privacy metric measuring how often an attacker fails to confirm whether a specific individual's data was used to train a model.

Frequently asked

What is synthetic data?

Synthetic data is artificially generated information that mirrors the statistical patterns of real-world data without containing any actual personal details.

Is synthetic data just anonymized data?

No. Anonymization removes names from real records, which can sometimes be reverse-engineered. Synthetic data generates entirely new, fake records from scratch.

Can synthetic data completely replace real data?

Not yet. Studies show that models trained exclusively on synthetic data suffer from a 'domain gap' and perform poorly in the real world. It works best when combined with a small amount of real data.

Is synthetic data legal under GDPR?

Yes, if generated correctly. When privacy assessments prove the data cannot be re-identified, it qualifies as non-personal data and is exempt from strict GDPR restrictions.

Sources

Source coverage

6 outlets

3 viewpoints surfaced

Clinical Researchers 35%Privacy Advocates 35%Machine Learning Engineers 30%
  1. [1]Journal of Medical Internet ResearchPrivacy Advocates

    Privacy-by-Design Approach to Generate Two Virtual Clinical Trials for Multiple Sclerosis

    Read on Journal of Medical Internet Research
  2. [2]University of ChicagoClinical Researchers

    Using synthetic data to tell the right story and improve care

    Read on University of Chicago
  3. [3]PubMed CentralClinical Researchers

    Can synthetic data be a proxy for real clinical trial data? A validation study

    Read on PubMed Central
  4. [4]Towards Data ScienceMachine Learning Engineers

    Synthetic Reality: How 3D-Generated Data Can Replace 66% of Your ML Training Data

    Read on Towards Data Science
  5. [5]EnFuse SolutionsMachine Learning Engineers

    Synthetic Data Generation: Fueling AI Without Compromising Privacy

    Read on EnFuse Solutions
  6. [6]Factlen Editorial Team

    Synthesis by Factlen editorial team

    Read on Factlen Editorial Team
Stay informed

Every angle. Every day.

Get data analysis stories with full source coverage and perspective breakdowns delivered to your inbox.