信息管理与商业智能系学术讲座

时间：2025-01-10 10:00-11:30

地点：李达三楼104室

题目：Regurgitative Training: The Value of Real Data in Training Large Language Models

主讲人：Mochen Yang (杨漠尘), Associate Professor, University of Minnesota

主持人：窦一凡教授信息管理与商业智能系

内容摘要:

What happens if we train a new Large Language Model (LLM) using data that are at least partially generated by other LLMs? The explosive success of LLMs, such as ChatGPT and LLAMA, means that a substantial amount of content online will be generated by LLMs rather than humans, which will inevitably enter the training datasets of next-generation LLMs. In this paper, we evaluate the implications of such "regurgitative training" on LLM performance. Through fine-tuning GPT-3.5 with data generated either by itself or by other LLMs in a machine translation task, we find strong evidence that regurgitative training clearly handicaps the performance of LLMs. The ease of getting large quantities of LLM-generated data cannot compensate for performance loss-even training with a fraction of real data is enough to outperform regurgitative training. The same performance loss of regurgitative training is observed on transformer models that we train from scratch. We carry out textual analyses to compare LLM-generated data with real human-generated data, and find suggestive evidence that the performance disadvantage of regurgitative training can be attributed to at least two mechanisms: (1) higher error rates and (2) lower lexical diversity in LLM-generated data as compared to real data. Based on these mechanisms, we propose and evaluate three different strategies to mitigate the performance loss of regurgitative training. In the first strategy, we devise data-driven metrics to gauge the quality of each LLM-generated data instance, and then carry out an ordered regurgitative training process where high-quality data are added before low-quality ones. In the second strategy, we combine data generated by multiple different LLMs (as an attempt to increase lexical diversity). In the third strategy, we train an AI detection classifier to differentiate between LLM-and human-generated data, and include LLM-generated data in the order of resemblance to human-generated data. All three strategies can improve the performance of regurgitative training to some extent but are not always able to fully close the gap from training with real data. Our results highlight the value of real, human-generated data in training LLMs, which cannot be easily substituted by synthetic, LLM-generated data. Given the inevitability of having some LLM-generated data in the training sets of future LLMs, our work serves as both a cautionary tale of its performance implication as well as a call-to-action for developing effective mitigation strategies.

嘉宾简介:

Dr. Yang is currently an Associate Professor in the Department of Information and Decision Sciences at Carlson School of Management, University of Minnesota. His main research revolves around the topic of algorithmic decision-making. His works have appeared in top-tier journals, including Management Science, Information Systems Research, and MIS Quarterly. He is currently an Associate Editor of INFORMS Journal on Data Science and serves on the Editorial Review Board of Information Systems Research.