Why language models are "lost" in conversation

Table of Contents

A new paper from Microsoft Research and Salesforce found that even the most capable major language models (LLMs) fall apart when given instructions At the stage Not everything at once. The authors found that performance drops by an average of 39% on six tasks when there is a prompt Split multiple times to split:

A single-turn conversation (left) gives the best results, but it’s unnatural for the end user. Multi-turn conversations (right) find even the best-ranked and most performant LLMs who have lost effective impulses in conversation. Source: https://arxiv.org/pdf/2505.06120

Even more surprising, Reliability Depending on how the same task is represented, prestigious models such as the ChatGPT-4.1 and Gemini 2.5 Pro, as well as the Gemini 2.5 Pro, sway between near perfect answers and manifest failures. Furthermore, output consistency can be reduced by more than half in the process.

To explore this behavior, the paper introduces a method called Shard*, split the fully specified prompt into small fragments and release it into conversation one at a time.

In the most basic terminology, this is equivalent to giving a cohesive, comprehensive single order at a restaurant, and although there is nothing to do to the waiter, we accept the request. Or we decided to collaborate and attack the issue:

Two extreme versions of restaurant conversations (not from new papers, but only for example purposes).

To highlight, the above example will likely put the customer in a negative light. However, the core idea depicted in the second column is the idea of transaction exchange that clarifies the problem set before addressing the problem. Obviously it’s a reasonable and reasonable way to approach a task.

This setup is reflected in the drip fed in the new work. Shard Approach to LLM interactions. The authors note that LLM often produces excessively long responses and then continues to rely on their own insights Even after these insights are shown to be wrong or unrelated. This trend, when combined with other factors, can cause the system to lose track of replacements completely.

In fact, researchers point out what many of us have found anecdotes. The best way to get your conversation back on track is to start a new conversation with LLM.

‘If your conversation with LLM does not lead to the expected outcome, starting a new conversation that repeats the same information could result in significantly better results than continuing the ongoing conversation.

“This is because current LLMs can get lost in conversation. Our experiments show that persistence in conversations with models is effective. Furthermore, LLMS generates random texts, so new conversations can lead to improved outcomes.

The authors acknowledge that agent systems such as Autogen and Lang Chain can potentially improve the results by acting as an interpretation layer between the end user and LLM, and only communicate with the LLM when they gather enough “shard” responses to aggregate into a single aggregation query (no end user exposed).

However, the authors should not need another layer of abstraction. Otherwise, I argue that it should not be constructed directly into the source LLM.

“We can argue that multi-turn functionality is not a necessary feature of LLMS, because it can be offloaded into the agent framework. In other words, if the agent framework can coordinate interactions with users and leverage LLMS only as a single turn operator, do we need native multi-turn support in LLM?

However, after testing the proposal in a series of examples, they concluded:

‘It may be limited to processing information in frameworks like (dependent) agents, and LLM argues that it needs to natively support multi-turn interactions.”

The title of this interesting new paper is entitled LLM gets lost in multi-turn conversationsand from four researchers from MS Research and Salesforce,

Fragmented conversation

The new method first breaks down the traditional one-turn instruction into small shards. It is designed to be introduced at critical moments during LLM interactions. This is a structure that reflects the exploratory, pre- and post-engagement found in systems such as ChatGpt and Google Gemini.

Each original instruction is a single self-contained prompt that combines high-level questions, support contexts, and related conditions to deliver the entire task at once. The shard version divides this into several small parts, each shard adding one piece of information.

(a) a full prompt delivered in one turn and (b) a pair of instructions indicating the sharding version used to simulate unified multi-turn interactions. Semantically, each version provides the same information payload.

The first shard always introduces the main goals of the task, while the rest provide clear details. Together, they deliver the same content as the original prompt, but spread naturally over several turns of the conversation.

Each simulated conversation is unfolded between three components assistant, Model under evaluation. user, A simulated agent with access to full instructions in shard format. and systemit will cheer up and earn exchange.

The conversation begins with the user revealing the first shard and the assistant responding freely. The system responds by Clarification Request Or a Try a complete answer.

For models I’ll do it When attempting to answer, another component ignores the surrounding text and extracts only spans related to the evaluation. For each new turn, the user reveals one additional shard and prompts another response. The exchange continues until the model gets the answer correctly or there are no fragments left to reveal.

A diagram of a sharded conversation simulation with the evaluated model highlighted in red.

Early tests showed that models were often asked about information that had not yet been shared, so the authors dropped the idea of exposing shards in a fixed order. Instead, I decided which shards to reveal next using the simulator, based on how the conversation was going.

Therefore, a user simulator implemented using GPT-4O-MINI has full access to the entire conversation history that imposes decisions on each turn, based on how the exchange was unfolded, to reveal next.

User simulator too To paraphrase Each shard to maintain the flow of conversation without changing its meaning. This allowed the simulation to reflect the “given and take” of the actual dialogue, while maintaining control over the task structure.

Before the conversation begins, the assistant is given only the basic information needed to complete the task, such as a database schema or an API reference. It is not said that instructions will be dismantled and guided by any particular way to handle the conversation. This is done intentionally. In actual use, the model is rarely able to tell you that the prompt will be incomplete or updated over time. Excluding this context will help the simulation reflect how the model behaves in a more realistic context.

The GPT-4O-MINI was also used to determine how model replies should be categorized and to extract the final answer from those replies. This allowed the simulation to remain flexible, but introduced occasional mistakes. However, after checking hundreds of conversations by hand, the authors found that less than 5% had problems, and less than 2% showed a change in the outcome, which was considered a sufficiently low error rate within the project parameters.

Simulation scenario

The authors tested the behavior of the model under different conditions using five different simulations.

in full In settings, the model receives the entire instruction in one turn. This represents a standard benchmark format and serves as a performance baseline.

Shard The settings divide the instructions into multiple pieces, delivering one at a time, simulating a more realistic, unified conversation. This is the main setting used to test how the model handles multi-turn inputs well.

in concat In the settings, the shards are sewn into one list to preserve the wording, but remove the turn-by-turn structure. This helps to separate conversation fragmentation from detachment or loss of content.

summary The configuration is performed as follows: Shardbut add a final turn where all previous shards are fixed before the model gives the final answer. This will test whether the summary prompt helps recover lost contexts.

Finally, Snowball Repeat more All previous debris per turnmaking full instructions visible as the conversation unfolds, providing a more tolerant test of multi-turn abilities.

The simulation type is based on sharded instructions. A fully specified prompt can be used to simulate single-turn (full, concut) or multi-turn (shard, summarizing, snowballing) conversations, depending on the speed of the information.

Tasks and Metrics

Six generations of tasks were selected to cover both programming and natural language domains. Code generation prompts were obtained from Humaneval and LiveCodebench. Text to SQL queries were sourced from Spider. The API calls were constructed using data from the Berkeley Function Calling LeaderBoard. Basic math problems were provided by GSM8K. Tabular caption tasks were based on tots. And the multi-document overview was drawn from the overview of the Haystack dataset.

Model performance was measured using three core metrics. Average performance, Aptitudeand Reliability.

Average performance I captured how well the model worked overall in multiple attempts. Aptitude The model reflects the best results the model may reach, based on the output of the top score. and Reliability We measure the extent to which these results have changed, and the large gap between the best and worst results indicates less stable behavior.

All scores were placed on a scale of 0-100 to ensure consistency between tasks, and the calculated metrics were made for each instruction and averaged to provide an overall picture of model performance.

Six sharded tasks used in the experiment cover both programming and natural language generation. Each task is displayed with a fully specified instruction and its shard version. Instructions from 90 to 120 were adopted from established benchmarks for each task.

Candidates and tests

In the initial simulation (estimated cost $5,000), 600 instructions across six tasks are corrupted and used to simulate three conversation types. full, concatand Shard. Ten conversations were performed for each combination of model, instruction, and simulation type, generating a total of over 200,000 simulations. This allows us to capture both overall performance and deeper measures of aptitude and reliability.

Fifteen models have been tested across a wide range of providers and architectures, including the Openai model GPT-4O (version 2024-11-20), GPT-4O-MINI (2024-07-18), GPT-4.1 (2025-04-14), and Thinking Model O3 (2025-04-16).

The human models were Claude 3 Haiku (2024-03-07) and Claude 3.7 Sonnet (2025-02-19) accessed via Amazon Bedrock.

Google contributed the Gemini 2.5 Flash (Preview-04-17) and the Gemini 2.5 Pro (Preview-03-25). The metamodels were via AI via llama 3.1-8b-instruct and llama 3.3-70b-instruct and llama 4 Scout-17B-16E.

Other entries were Olmo 2 13B, PHI-4, and Command-A, all accessed locally via the Ollama or Cohere API. Deepseek-R1 accessed from Amazon Bedrock.

In two “thinking” models (O3 and R1), the token limit has been increased to 10,000 to accommodate long inference chains.

Average performance scores for each model across six tasks: code, database, actions, data to text, mathematics and overview. The results are shown for three simulation types: Full, Concut, and Shard. The model is ordered by an average full setting score. Shading reflects the degree of performance degradation from the full setting, with the last two columns reporting average reductions and debris in concat compared perfectly.

Regarding these results, the authors stated^†:

‘A high level, All models compare full performance with sharded performance and all tasks reduce performance-39% average deterioration. Name this phenomenon I got lost in the conversation: A model that delivers star (90%+) performance in a fully specified set of conversational struggles lab-like settings. In the exact same task When there is a lack of conversation, more realistic settings will turn multiple times.

concat The average score is 95% fullindicating that the performance of the shade settings is degraded. Loss of information cannot be explained. Smaller models such as llama3.1-8b-instruct, olmo-2-13b, and claude 3 haiku showed more pronounced decomposition below concatwhich suggests that smaller models are generally less robust than larger models.

The author observes^†:

“Amazingly The more performant models (Claude 3.7 Sonnet, Gemini 2.5, GPT-4.1) get equally lost in conversation compared to the smaller models (llama3.1-8b-instruct, phi-4).), average decomposition of 30-40%. This is due to the definition of the metric. Because smaller models achieve lower absolute scores fullthere is less degradation than better models.

“In short, no matter how strong the LLM’s single-turn performance is, we observe a significant performance degradation in a multi-turn setting.”

Initial testing shows that some models are better maintained for a given task. Commands on Actions-A, Claude 3.7 Sonnet, and GPT-4.1 codes. Data to Text Gemini 2.5 Pro shows that multi-turn capabilities vary by domain. Inference models such as O3 and Deepseek-R1 did not improve overall, as their long replies tended to confuse conversations, perhaps because they introduced more assumptions.

Reliability

The clear aptitude and reliability relationship in single-turn simulations appeared to collapse under multiple turn conditions. Aptitude has been reduced modestly, but less reliable It’s doubled On average. Models that were stable at full format prompts, such as GPT-4.1 and Gemini 2.5 Pro, became as unstable as weak models such as llama3.1-8b-instruct and olmo-2-13b, when instructions were fragmented.

Following the summary of aptitude and reliability shown in boxplot (a), reliability results from experiments using 15 model (b), as well as the results of a stepwise shard test in which instructions are split into 1-8 shards (c).

Model responses often change by 50 points on the same task, suggesting that even when new ones are not added, performance degradation is not due to lack of skill, but becomes increasingly unstable throughout the turn.

The paper states^†:

‘(However) better models tend to be slightly more suited to multi-turns. All models tend to have a similar level of reliability. In other words, Multi-turn, in a lacking setting, all models tested show very high reliability, performance reduces 50% points on average between the best and worst simulation runs for fixed instructions. ‘

To test whether performance degradation was tied to the number of turns, the authors gradually performed shard experiments, splitting each instruction into 1-8 shards (see the right column in the image above).

As the number of debris increases, it becomes less reliable, so check it out Models counted in turn even with slight increases became more unstable. Aptitude remains almost unchanged, reinforcing problems Consistencynot ability.

Temperature control

An independent set of experiments tested whether reliability was simply a by-product of randomness. To do this, the authors changed the temperature settings for both the assistant and user simulator across three values: 1.0, 0.5, and 0.0.

In a single turn format like full and concat,Reducing the assistant temperature significantly improves reliability and reduces fluctuations by up to 80%. but, Shard The same intervention had little effect:

Reliability scores for various combinations of assistant and user temperature across complete, concatenated and shard configurations. A low value indicates greater consistency in the response.

Even when both the assistant and the user have temperature set to zero, it is highly reliable, with the GPT-4o showing a variability of about 30%, suggesting that the instability seen in multi-turn conversations is not just stochastic noise, but also the structural weakness of the way the model handles fragmented inputs.

meaning

The author writes the meaning of findings at unusual lengths in the paper conclusion, arguing that strong single-turn performance does not guarantee multi-turn reliability, and notes that it is overreliant on fully specified benchmarks when evaluating actual off-the-shelf preparations (as such benchmarks mask instability to more natural, fragmented interactions).

They also say that reliability is not just a sampling artifact; Basic limitations The way the current model evolves input suggests this raises concerns for the agent framework. This depends on persistent reasoning throughout the turn.

Finally, they argue that multi-turn capabilities should not be offloaded to external systems, but should be treated as a core feature of LLM.

The authors point out that their outcomes are likely to occur Underrated It brings attention to the true scale of the question and the ideal conditions for testing. The user simulator in the setup was able to fully access the instructions and reveal the shards in the best order.

Additionally, assistants were evaluated immediately After each turn, before the full conversation unfolds, the confusion and self-contradiction that follows will not be punished, otherwise performance will deteriorate. These choices are required for experimental control, but imply that the actual observed reliability gap is even greater than reported.

They conclude:

‘(We) consider the simulations carried out to be a benign test field for LLM multi-turn functionality. Because the conditions of the simulation are oversimplified, we believe that the degradation observed in the experiment is an underestimation of LLM reliability and is likely to be the frequency with which LLM is lost in conversations in real-world settings.‘

Conclusion

Anyone who has spent a lot of time with LLM will recognize the problems formulated here from real-life experience. And I think most of us have intuitively abandoned the “lost” LLM conversation for something fresh, in the hope that LLM will “start” and become obsessed with the material that has come out in a long, twisty, increasingly infuriating exchange.

It is interesting to note that throwing more context into a problem may not necessarily solve it. And certainly, to observe that the paper raises more questions than it provides answers (except in terms of how to skip the question).

* Confusingly, this has nothing to do with the traditional meaning of “shards” in AI.

† The author’s own bold emphasis.

First published on Monday, May 12, 2025