The AI data crisis
Key contact
Artificial intelligence – particularly generative AI and large language models (LLMs) that have captured everyone’s imagination – rely on extensive, varied and high-quality data for their training and development.
But most of the publicly available data on the internet has now been mined by ChatGPT and other LLMs. Content publishers are increasingly using pay walls and other security measures to shut out the web crawlers that are used to harvest data to train such LLMs. And without access to new data, the development of AI will slow to a crawl.
AI developers have tried using data generated by AI – so-called ‘synthetic’ data – to train AI models. However, this results in the rapid degradation of AI models and ultimately model collapse. Through successive generation of training on synthetic data, parts of the data become over represented and other elements of the data become underrepresented or absent resulting in completely nonsensical outputs.
So for now, high-quality human-generated data is needed to maintain the accuracy and reliability of AI models. And as AI-generated content spreads across the internet, it becomes increasingly hard to be sure that what is harvested was created by humans rather than machines.
These challenges in accessing new data could lead to a significant first-mover advantage. Companies that have sourced training data from a pre-AI internet may possess more accurate and reliable models. However, even these first movers will need more data if they want to continue developing their products.
Joy Calder on the AI data crisis
Are we running out of data to train AI models? Can you use AI to generate new data to then train future AI models?
Does your data hold value?
Many AI developers are now trying to source high-quality data through collaborations and partnerships.
At the moment, unless you’re actually a content provider, you probably won’t be approached by Big Tech with proposals for collaboration.
However, smaller AI developers increasingly seek to include in their supply contracts provisions that let them train their models with the customer data that’s available to the AI solutions they supply. So data from your business could be used to shape future generations of AI.
Companies need to weigh their options. Is it short-sighted to lock up your organisational data, jeopardising the long-term development of AI? Or is it necessary to protect your business from regulatory risk and maintain your competitive advantage?
Your answer will depend on the context, volume and nature of the data involved, as well as the potential benefits and use cases for the specific AI solution within your business.
- Allowing training of AI models based on your organisation’s data can offer significant advantages, primarily by enhancing the performance, accuracy, and effectiveness of the model.
- These improvements in the model could drive further efficiencies for your business and result in substantial cost savings.
- There may also be the opportunity to negotiate a commercial benefit in exchange for use of your organisation’s data.
- If personal data is involved, you will need to consider the consequences under data protection laws.
- This will include establishing whether you have the appropriate lawful basis, privacy notices and/or consents in place to allow use the data for this purpose.
- A supplier may be able to anonymise the data or exclude sensitive or confidential information.
- This may help to mitigate regulatory risks, but will not necessarily eliminate them.
- You will also need to be sure you have confidence in the supplier’s processes.
- If you are a market leader or have significant market share, are you comfortable with your data being used to enhance a product that might be used by your competitors? The enhancements in the AI model could benefit the supplier’s other customers, including your competitors or disrupters/challengers, potentially diminishing your organisation’s competitive advantage.
- You may be able to ask the supplier to fine tune or train a version of the model using your organisation’s data that is solely for the benefit of your organisation and which won’t be accessible to third parties.
How has Big Tech tried to fix the data drought?
Some tech companies have already teamed up with content publishers to gain access to their archives. In August 2024, OpenAI and Condé Nast announced a multi-year partnership to allow ChatGPT to use and show content from publications include Vogue, The New Yorker and GQ.
Reports suggest that some technology providers have even considered buying publishing houses in order to gain access to their content.