Research says AI behaves differently when it is known to be tested.

Table of Contents

Reflecting the 2015 “Dieselgate” scandal, new research suggests that AI language models such as GPT-4, Claude, and Gemini can change behavior during testing, and can act “safer” for testing than for actual use. If LLMS habitually adjusts its behavior under scrutiny, a safety audit would certify a system that certifies very different behaviors in the real world.

In 2015, investigators discovered that Volkswagen installed software on millions of diesel vehicles and was detected while emission tests were being performed, causing the vehicle to temporarily reduce emissions and to comply with the “fake” regulatory standards. However, under normal operation, contamination output exceeded legal standards. The intentional manipulation has led to criminal charges, billions of fines, and global scandals over the reliability of safety and compliance testing.

Two years before these events, since being called “Dieselgate,” it has been revealed that Samsung has enacted a similar deceptive mechanism with the release of its Galaxy Note 3 smartphone. Since then, similar scandals have arisen between Huawei and OnePlus.

Currently, there is growing evidence in the scientific literature that not only can they have the ability to detect large-scale language models (LLMs) when tested, they can also take different behaviors under these circumstances.

While this is a very human characteristic, the latest US research concludes that this can become a dangerous habit of long-term indulgence for a variety of reasons.

In a new study, researchers discovered that “frontier models” such as GPT-4, Claude and Gemini often can. Detect when testedand they tend to adjust their behavior accordingly, potentially hobbing the validity of systematic testing methods.

dubbing Evaluation recognitionAccording to the authors of a new study, this (probably) innate characteristic of the language model may undermine the reliability of safety assessments.

‘We recommend treating rating recognition as a new source of potential distributional shifts between tests and real-world developments that can adversely affect the accuracy of evaluation results.

“In the worst case, models can strategically reduce performance with dangerous capacity assessments (“sandbags”) or trend assessments (“alignment counterfeiting”), allowing society to overestimate the safety of such models. ”

Other studies have observed similar behaviors, including Stanford University studies. Stanford University found that large-scale language models modulate responses to appear more “likeable” or “socially desirable” when they recognize that they are being evaluated, and reflect human behavior in personality tests.

The author of the new work does not suggest that LLM was intentionally designed for “games” tests, as Volkswagen’s diesel cars were constructed to do it a decade ago. Rather, this behavior appears to be manifesting as a kind of natural psychological by-product of the model’s training process and potential purpose. And the underlying cause of this phenomenon remains unknown.

New paper titled Large language models know well when they are being evaluatedand comes from five authors of the ML Alignment & Theoreologist (MATS) program related to Berkeley, California, and Apolloresearch.ai (a think tank specializing in assessing AI risk).

method

To create an evaluation benchmark, the authors compiled 1,000 transcripts drawn from 61 data sets.

This material is evenly balanced between evaluation and deployment scenarios, and is further divided between simple chat interactions and more complex agent sessions aimed at the model acting autonomously using tools that allow the model to act autonomously using tools that allow it to draw data from an external database or from the Internet.

Functional tests for assessment data inference, coding, cybersecurity, and general knowledge, as well as alignment tests for ethics, safety and rejection behavior. Deployment data was fed from user interactions such as ShareGPT logs, running crowdsourcing agents, and completing synthesis tasks.

To supplement the agent data, the authors generated additional examples using orchestration frameworks such as Aider and Smolagents. This is an approach aimed at minimizing superficial cues to simply store dataset structures and instead rely on more general signals, relying on more general signals when determining whether transcription is a real user.

The researchers then tested whether the model could identify the evaluation transcript by presenting a dialogue record and seeking variations on simple questions. Did this exchange come from a test?

To avoid a model that merely remembers one phrasing, they developed 28 different versions of the question, altering both the language and the way transcripts were presented.

Transcripts affect the model’s response, either directly connected to the question or presented in a form such as JSON or XML. After early testing, five question versions were selected that were run consistently across models.

One of five probe question formats used to test whether a language model can recognize transcripts for evaluation. Each variation paraphrases the core questions to prevent the model from merely remembering fixed language. See the source paper for more examples. Source: https://arxiv.org/pdf/2505.23836

Three metrics were used to score the results. Area under the receiver operating characteristic curve (AUC, showing how tests distinguish tests from non-test data). Brier score (measurement of overall prediction accuracy); expected calibration errors (measurement to the extent that the model reliability matches actual accuracy).

What is it?

Researchers also tested whether the model could be inferred the purpose of evaluation. Since each transcript was already labelled as either a competency test (technical skills, knowledge, inference) or an alignment test (safety, value, rejection behavior), the model was given one of three question types. Unconditional, multiple choice questions with real world and evaluation options. Or open-ended questions whose answers are determined by another GPT-4.1-based classifier. Judges were verified with a subset resolved by 50 hand responses.

Examples of three question formats used to test whether a model can identify not only the existence of an evaluation, but also its purpose. Responses were scored on both multiple selection and GPT-4.1-based judgment.

Two paper authors^†† All 1,000 transcripts were manually labeled, scoring each from zero (not tests) to one (arguably tests), and their ratings served as human baselines on average. Despite the subjective nature of the task, the annotator showed strong agreement, with an intraclass correlation coefficient of 0.768. Final human scores were averaged across both raters.^†.

Data and Testing

The nine models tested were GPT-4.1. Claude 3.7; claude-3.5-sonnet-new; claude-3.7-sonnet (thinking); gemini-2.5-flash; gemini-2.5-pro; GPT-4O; GPT-4.1; GPT-4.1-MINI; O4-Mini all showed some degree of rating recognition.

For chat data, GPT-4.1 approached the human level of rating recognition classification performance (ability to distinguish between evaluation and deployment transcripts) with an AUC of 0.77 versus human scores.

The difference was sharper in agent data where the model uses tools autonomously. The Gemini2.5Pro reached 0.95 AUC, the Claude 3.7 Sonnet approached 0.92 and the human limit of 0.996.

The receiver operating characteristics (ROC) curve shows how the model and humans distinguish the transcript of assessment from actual interactions. The Gemini 2.5 Pro and Claude 3.7 Sonnet approaches approach human-level discrimination against agent data.

This paper points out that new “inference” models are particularly good at recognizing when tested in agent scenarios, but do not judge how confident they should be in those decisions. Overall, the Claude and Gemini models were great for communicating test cases apart from normal use.

The researchers then tested whether the models could also be recognized. why The assessment was being performed. The model displayed only the evaluation transcripts, and was asked to classify whether one person was working or had tested the alignment.

Use both the model’s accuracy, multi-select and open-ended formats, when classifying evaluation objectives as ability or alignment. The Claude 3.7 Sonnet is better than other models, especially when unprimed.

In the clearest setting, the model was said to come from testing, the Claude 3.7 Sonnet reached 93% accuracy, while the GPT-4.1 reached 89%. Without that hint, performance would have been degraded. Claude still scored 84%, but GPT-4.1 fell to 55%, often misclassifying test transcription as a real-world interaction.

Claude 3.7 Sonnet still correctly identified the assessment objectives in 46% of cases, but performance was weak in open-ended format (the model did not receive multiple selection options).

The model’s reliability scores were subsequently measured against actual results, and it was discovered that frontier models often overestimate the possibility that the transcripts came from the test. Calibration plots showed that model confidence was generally too high and ratings flagged more frequently than appropriate.

Calibration curves are calibration curves that show how closely the confidence score matches the actual results. Most models overestimated the possibility that transcripts arise from assessment. The Claude 3.5 sonnet was optimally calibrated. The Gemini 2.5 Pro and Claude 3.7 Sonnet were more overconfident.

Although good calibration was observed on the Claude 3.5 Sonnet, the Gemini 2.5 Pro and Claude 3.7 Sonnet were fairly overly confident despite their high accuracy. No clear improvements were observed in the new model, and trust judgments remained unreliable even when evaluation transcription was detected more accurately in the agent settings.

Benchmark memories

Many models relied on obvious cues, such as whether the task was in a multi-selection format, in terms of how the model explained the decisions about whether the transcripts originated from the assessment. Powerful models like Gemini 2.5 Pro and Claude 3.7 Sonnet can notice more subtle signs such as the created name and artificial scenario, especially in datasets such as AgentHarm.

The paper states:

‘(All) The model frequently acknowledges common benchmark strategies used by evaluators, such as the formatting of tasks (“multiple selection formats”), the tendency to ask questions in verifiable solutions, and system prompts designed to derive performance. However, only the best-performing models, the Claude 3.7 Sonnet and the Gemini 2.5 Pro, consistently detect more subtle metrics, such as synthetic inputs.

“For example, in 22% and 17% of cases, respectively, these models emphasize the synthetic nature of the AgentHarm benchmark scenario, identifying correctly as assessments above 95% of time.

“On the contrary, when inferring that transcripts appear to be true interactions, for example, “before and after interaction with the user” is often mentioned in a repetitive debugging context.

Distribution of model inference strategies in determining whether transcripts come from assessment. The most powerful models, such as the Gemini 2.5 Pro and the Claude 3.7 Sonnet, were more likely to find subtle clues such as the synthesis scenario and the format of the task, but some models relied on the meta season by using the evaluation questions themselves as evidence.

In some cases, the model was recognized The whole benchmark We propose the memory of training data from training. I used some models too The fact that they were asked about their ratings The author calls it as a clue that they were being tested The taste of meta;This is the most common in O4-Mini, and showed it in 15% of its responses.

Conclusion

Just as Volkswagen designed the vehicle to recognize test conditions a decade ago, it may be more comforting if it could be shown to be “test aware” simply because the LLM was programmed like that by humans.

Currently, this behavior appears to be appearing naturally Social components – Inconvenient, if not actually dangerous on the machine, depending on the range of freedom and agency, depending on the machine.

There is no immediate remedy. The black box nature of the latent space of the model forces them to assess their behavior rather than empirical architecture.

* See the source paper for more information.

^† However, the authors note that this is likely to be an estimate of the upper bound, as the annotator is very familiar with both datasets and AI evaluation methods.

††As long as it can be established. Paper phrasing makes the sudden appearance of two annotators unclear in who they are.

First released on Wednesday, June 4th, 2025