Large language models remember datasets to test

14 Min Read
14 Min Read

When relying on AI to encourage viewing, reading and purchasing, new research shows that some systems are based on these results. Memory Instead of learning to make useful suggestions rather than skills, models often recall items from the dataset used to evaluate them, overestimating performance and recommendations that may be outdated or missing from the user.

In machine learning, we use test splits to see if the trained model has learned to solve similar problems, but is not identical to the trained material.

Therefore, if a new AI ‘Dog-breed perception’ model is trained on a dataset of 100,000 photos of a dog, it usually features the attached 80/20 split – 80,000 photos to train the model. 20,000 photographs were suppressed and used as material to test the completed model.

If AI training data inadvertently includes the “secret” 20% section of test splitting, the model already knows the answers and thus facilitates these tests (already seeing 100% of domain data). Of course, this does not exactly reflect how the model will perform later on with new “live” data in the production context.

Movie spoilers

The problem of AI cheating on exams is taking a step forward with the scale of the model itself. Today’s systems are trained on vast, indiscriminate web scrapped corpus such as common crawls, so the chances that benchmark datasets (i.e., 20% retention back) will slip into the training mix are no longer edge cases, but rather the syndrome known as default-data contamination. And at this scale, manual curation that could catch such an error is logistically impossible.

This case is investigated in a new paper by Politecnico di Bari in Italy. Researchers focus on the major role of Movielens-1M, the recommended dataset for a single film.

This particular dataset is so widely used in testing recommended systems that it can make the test pointless by being present in the memory of the model. In fact, what appears to be intelligence could actually be a simple recall, and what appears to be an intuitive recommended skill is a statistical echo that reflects previous exposures.

The author states:

‘Our findings show that LLM has extensive knowledge of Movielens-1M datasets, items, user attributes, and interaction history. In particular, a simple prompt allows the GPT-4o to recover nearly 80% (the name of most movies in the dataset).

“None of the models inspected have this knowledge, suggesting that Movielens-1M data is likely to be included in the training set. We observed a similar trend when obtaining user attributes and interaction history.

It has a short new paper title Does LLMS remember the recommended dataset? Preliminary research on Movielens-1Mand comes from six Politecnico researchers. A pipeline to replicate their work is now available on Github.

See also  AI is struggling to emulate historical languages

method

To understand whether the model in question is truly learning or simply recall, the researchers began by defining the meaning of memorization in this context, and by testing whether the model could retrieve certain information from the Movielens-1M dataset when prompted in the correct way.

If the model shows the movie’s ID number and you can create its title and genre, it counts as something to memorize the item. If details about the user (age, occupation, postal code, etc.) can be generated from the user ID, it will also be counted as a user’s memorization. And if it could replicate the user’s ratings of the next film from a previous known sequence, it was taken as evidence that the model might be recalling Specific interaction dataRather than learning general patterns.

Each of these forms of recalls was tested using carefully written prompts created to fine-tune the model without providing new information. The more accurate the response, the more likely the model had already encountered that data while training.

Zero-shot prompt for the evaluation protocol used in new papers. Source: https://arxiv.org/pdf/2505.10212

Data and Testing

To curate appropriate datasets, the authors have investigated the recent papers of two ACM Recsys 2024 and ACM SIGIR 2024 at the main conference of the field. Movielens-1M is quoted at least one in five submission amounts. This was not a surprising result, but a confirmation of the dataset’s dominance, as previous studies have reached similar conclusions.

Movielens-1m consists of three files. movie. thatlists films by ID, title and genre. users.datmaps the user ID to the basic biographical field. and Ratings.datrecords who evaluated what.

To investigate whether this data was remembered by a large-scale language model, researchers turned to the prompt technique first introduced in the paper Extract training data from large language modelsand later adapted to subsequent work. A bag of tricks for training data extraction from language models.

This method is direct: ask questions that reflect the dataset format and check if the model answers correctly. Zero Shot, way of thinkingand A few shot prompts The last method, which was tested and shows some examples in the model, was the most effective. Even if a more elaborate approach could lead to higher recalls, this was thought to be sufficient to reveal what was remembered.

A few prompts that can test whether a model can reproduce a specific movielens-1m value when queried in a minimal context.

To measure memorization, researchers defined three forms of recall: item, userand Exchange. These tests looked at whether the model could retrieve the film title from the ID, generate user details from the user ID, and predict the next rating of the user based on previous ones. Each was scored using coverage metrics* that reflected how much dataset could be reconstructed through the prompt.

See also  How walled gardens of public safety expose the data privacy crisis in America

The model I tested was the GPT-4O. gpt-4o mini; GPT-3.5 turbo; llama-3.370b; llama-3.23b; llama-3.21b; llama-3.1405b; llama-3.170b; and llama-3.18b. Everything was set to zero, TOP_P was set to 1, and both frequency and presence penalties were set to invalid. Fixed random seeds ensured consistent output throughout the run.

Models grouped by percentage, version, and sorted by parameter count of Movielens-1M entries obtained from Movie.dat, users.dat, and Ratings.dat.

To investigate how deeply absorbed Movielens-1M was, researchers prompted each model of accurate entries from three (mentioned) files in the dataset. movie. that, users.datand Ratings.dat.

The results of the first test above reveal sharp differences not only in the GPT and Lama families, but also in the overall model size. The GPT-4O and GPT-3.5 turbos easily retrieve most of the dataset, but most open source models recall only a small portion of the same material, suggesting an uneven exposure to this benchmark.

These are not small margins. Over all three files, the strongest file recalls rather than simply outperforming the weaker model The whole part Movielens-1m.

For GPT-4O, coverage was high enough to suggest that nontrivial shares of the dataset were directly remembered.

The author states:

‘Our findings show that LLM has extensive knowledge of Movielens-1M datasets, items, user attributes, and interaction history.

“In particular, a simple prompt allows the GPT-4o to recover almost 80% of the Movie::Title Record. None of the models inspected have this knowledge, suggesting that Movielens-1M data is likely to be included in the training set.

“We observed a similar trend in obtaining user attributes and interaction history.”

The authors then tested the impact of memorization on recommended tasks by encouraging each model to act as a recommendation system. We compared the benchmark performance and output with seven standard methods. userknn. itemknn; BPRMF;EasyrRandom with lightgcn; MostPop;

The Movielens-1M dataset split 80/20 into training and test sets using a vacation 1-out sampling strategy to simulate actual usage. The metric used was hit rate (HR@(n)); and ndcg (@(n)):

Recommended accuracy for standard baseline and LLM-based methods. Models are grouped by family and ordered by parameter counts, with bolded font indicating the highest score within each group.

Here, some large language models outperform traditional baselines on all metrics, the GPT-4o establishes a wide lead in all columns, and even medium-sized models such as the GPT-3.5 Turbo and the Llama-3.1 405b consistently surpass benchmarking methods such as BPRMF and LightGCN.

Among the Llama variants, performance has changed significantly, but the Llama-3.2 3B stands out, with the best HR@1 in that group stands out.

See also  How AI agents are transforming the education sector: See Kira Learning and Beyond

The authors suggest that the results suggest that the stored data can be converted to the measurable benefits of the recommended style prompt, especially for the strongest model.

In additional observations, the researcher continues.

“While the performance of the recommendations appears to be significant, an interesting pattern is revealed when comparing Table 2 and Table 1. Within each group, models with higher memorization demonstrate superior performance on the recommended task.

‘For example, GPT-4O outperforms GPT-4O MINI, while Llama-3.1405b outperforms Llama-3.1 70b and 8b.

“These results highlight that evaluating LLMs on datasets that are leaked to training data can be driven by memorization rather than generalization.”

Regarding the effect of model scale on this problem, the authors not only retained more Movielens-1M datasets, but also performed stronger in downstream tasks, but also observed clear correlations between size, memorization, and recommended performance using larger models.

For example, llama-3.1405b showed an average memory rate of 12.9%, while Llama-3.18b retained only 5.82%. This nearly 55% reduction in recall corresponded to a 54.23% decrease in NDCG and a 47.36% decrease in HR across the evaluation cutoff.

Patterns retained throughout – there was obvious performance when memorization was reduced.

‘These findings suggest that increasing model scale leads to greater memorization of the dataset, improving performance.

“As a result, large models offer better recommended performance, but also present risks associated with potential leaks of training data.”

The final test looked at whether memorization reflected the popularity bias burned into Movielens-1M. Items were grouped by frequency of interaction, and the chart below shows that the larger model consistently favored the most popular entries.

Item coverage with three popular tier models: Top 20% of the most popular. Moderately popular. And the lowest interaction item at the bottom 20%.

The GPT-4O has acquired 89.06% of the top ranked items, but only 63.97% of the most popular items. The GPT-4O Mini and Small Lama models showed much lower coverage in all bands. The researchers say this trend suggests that memory amplifies existing imbalances in training data as well as expanding model size.

They continue:

‘Our findings reveal a prominent popularity bias in LLMS, with the top 20% of popular items being much easier to obtain than the bottom 20%.

“This trend highlights the impact of training data distributions where popular films are overrepresented, leading to disproportionate memorization by the model.”

Conclusion

The dilemma is no longer novel. As training sets grow, the chances of curating them decrease in the opposite rate. Movielens-1m enters these vast corpus without supervision, perhaps among many others.

The problem repeats at any scale and resists automation. Solutions require not only effort, but human judgment, the slow, false kind that machines cannot supply. In this regard, new papers are not a future method.

* Coverage metrics in this context are the percentages that indicate how well a language model can reproduce when asked the correct type of question. The recall will be successful if the model is prompted with the movie ID and responded with the correct title and genre. The total number of successful recalls is divided by the total number of entries in the dataset to create a coverage score. For example, if the model correctly returns 800 pieces of information out of 1,000 items, its coverage is 80%.

First released on Friday, May 16th, 2025

Share This Article
Leave a comment