Large language models (LLM) are trained on large quantities of content. And increasing amounts of available content is generated by large language models. This sets up a form of recursion in which AI models increasingly rely on such content, producing an irreversible degradation in the quality of AI-generated content. This has been described as a form of entropy. Shumailov et al call it Model Collapse.
There is an interesting comparison between this and data poisoning, where an AI model is deliberately polluted with bad data, often as an external attack, to influence and corrupt its output. Whereas model collapse doesn't involve a hostile attack, and may reflect a form of self-pollution.
Is there a technical or sociotechnical fix for this? This seems to require limiting the training data - either sticking to the original data source, or only allowing new training data that can be verified as non LLM-generated. Shumailov et al appeal to some form of "community-wide coordination ... to resolve questions of provenance", but this seems somewhat optimistic.
Dividing content by provenance is of course a non-trivial challenge, and automatic filters typically flag content from non-native speakers as AI-generated, which in turn further narrows the data available. Thus Shumailov et al conclude "it may become increasingly difficult to train newer versions of LLMs without access to data that was crawled from the Internet prior to the mass adoption of the technology, or direct access to data generated by humans at scale".
What are the implications of this for the attainment of the promised benefits of AI? Imre Lakatos once suggested a distinction between progressive research programmes and degenerating ones: a degenerating programme either fails to make interesting (novel) predictions, or becomes increasingly unable to make true predictions. Many years ago, Hubert Dreyfus made exactly this criticism of AI. And to the extent that Large Language Models and other forms of AI are vulnerable to model collapse and entropy, this would again make AI look like a degenerating programme.
Thomas Claburn, What is Model Collapse and how to avoid it (The Register, 26 January 2024)
Ian Sample, Programs to detect AI discriminate against non-native English speakers, shows study (Guardian, 10 Jul 2023)
Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot and Ross Anderson, The Curse of Recursion: Training on Generated Data Makes Models Forget (arXiv:2305.17493v2, 31 May 2023)
David Sweenor, AI Entropy: The Vicious Circle of AI-Generated Content (Linked-In, 28 August 2023)
Stanford Encyclopedia of Philosophy: Imre Lakatos
Wikipedia: Data Poisoning, Model Collapse, Self Pollution
Related posts: From ChatGPT to Infinite Sets (May 2023), ChatGPT and the Defecating Duck (Sept 2023), Creativity and Recursivity (Sept 2023)