The Data Drought: How AI’s Training Gap Could Expose Users to Liability
Key contacts
The artificial intelligence revolution is facing a significant hurdle: the data that powers it is running dry.
The consequences of this data shortage are already evident in the construction of new AI systems. One solution to this problem is to train models using the outputs of existing models rather than on entirely new, original data. The practice of recycling AI-generated content, or “synthetic data”, is rapidly becoming the industry’s stopgap solution. While synthetic data offers a theoretically limitless supply for AI models, there is a risk of overwhelming them with low-quality input.
Although synthetic data has the potential to “democratise” AI production by reducing reliance on large datasets and correcting historical biases, its use shifts data-making processes further away from public scrutiny, making it more difficult to identify sources of information. From a legal perspective, this makes it harder to attribute original sources, which is a thorny issue for anyone seeking to develop an intellectual property or copyright claim relating to their work.
The implications for output quality are also significant. When AI models are trained on the outputs of other AI models, there is an increased risk of errors compounding, as models learn from increasingly artificial versions of reality with every feedback loop. The risk is “creative plateau”.
For users of AI in all fields, this decline in quality means that they are more exposed to liability. AI tools that rely on synthetic data may produce fabricated case citations, erroneous financial data, or misleading regulatory guidance with greater confidence.
The solution is not straightforward. While the open internet is “tapped out”, there are still large amounts of data held privately. In particular, corporations own large proprietary datasets that could be useful for models, and they may be interested in the lucrative potential of these data sets. Synthetic data can also be subject to quality control and transparency measures to mitigate issues arising from its use.
The full impact of synthetic data remains to be seen, as most organisations have only just started to implement its use. However, current predictions estimate that 80% of the data used by AI models will be synthetic by 2028, meaning we all need to understand its consequences.