Publication

The Data Drought: How AI’s Training Gap Could Expose Users to Liability

08 Apr 2026 United Kingdom 3 min read

Key contacts

London

Sarojah Sathivelu

Associate

London

Tamsin Blow

Partner

The artificial intelligence revolution is facing a significant hurdle: the data that powers it is running dry.

The consequences of this data shortage are already evident in the construction of new AI systems. One solution to this problem is to train models using the outputs of existing models rather than on entirely new, original data. The practice of recycling AI-generated content, or “synthetic data”, is rapidly becoming the industry’s stopgap solution. While synthetic data offers a theoretically limitless supply for AI models, there is a risk of overwhelming them with low-quality input.

Although synthetic data has the potential to “democratise” AI production by reducing reliance on large datasets and correcting historical biases, its use shifts data-making processes further away from public scrutiny, making it more difficult to identify sources of information. From a legal perspective, this makes it harder to attribute original sources, which is a thorny issue for anyone seeking to develop an intellectual property or copyright claim relating to their work.

The implications for output quality are also significant. When AI models are trained on the outputs of other AI models, there is an increased risk of errors compounding, as models learn from increasingly artificial versions of reality with every feedback loop. The risk is “creative plateau”.

For users of AI in all fields, this decline in quality means that they are more exposed to liability. AI tools that rely on synthetic data may produce fabricated case citations, erroneous financial data, or misleading regulatory guidance with greater confidence.
The solution is not straightforward. While the open internet is “tapped out”, there are still large amounts of data held privately. In particular, corporations own large proprietary datasets that could be useful for models, and they may be interested in the lucrative potential of these data sets. Synthetic data can also be subject to quality control and transparency measures to mitigate issues arising from its use.

The full impact of synthetic data remains to be seen, as most organisations have only just started to implement its use. However, current predictions estimate that 80% of the data used by AI models will be synthetic by 2028, meaning we all need to understand its consequences.

More insights

TMT - Technology, Media & Telecommunications Digital Assets & Crypto Technology

Explore more

Event United Kingdom

Cross-Border Financial Services 2026 webinar series

18 Jun 2026

Expert Guide International

Protection of Graphical User Interfaces as Registered Design Rights or Design Patents in United Kingdom

15 min read

Publication United Kingdom

The Cyber Continuum: Ready, Resilient, Recovered

01 Jun 2026 1 min read

Select your region