Why LLMS is thinking too much about simple puzzles, but give up on hard puzzles

Table of Contents

Artificial intelligence has made incredible advances with large-scale language models (LLMS) and its advanced counterparts, large-scale inference models (LRM), redefines the way machines process and generate human-like text. These models can write essays, answer questions, and even solve mathematical problems. However, despite their impressive capabilities, these models show strange behavior. They often overcomplicate simple problems while struggling with complex problems. Recent research by Apple researchers provides valuable insight into this phenomenon. In this article, we explore why LLM and LRM behave this way, and what it means for the future of AI.

Understanding LLM and LRMS

To understand why LLMS and LRMS behave this way, you first need to clarify what these models are. LLMs such as GPT-3 and BERT are trained on a vast dataset of text to predict the next word in sequence: This makes it better for tasks like text generation, translation, and summarizing. However, it is not essentially designed for reasoning with logical deductions or problem-solving.

The LRMS is a new class of models designed to address this gap. They incorporate techniques such as the Chain of Sharch (COT) prompt where the model generates intermediate inference steps before providing the final answer. For example, when solving mathematical problems, LRM can break down into stairs like humans. This approach improves performance for complex tasks, but as Apple’s research reveals, it faces challenges when dealing with various complexity issues.

Research

Apple’s research team took a different approach to assess the inference capabilities of LLMS and LRMS. Instead of relying on traditional benchmarks such as mathematics and coding tests that could be affected by data contamination (if the model remembers the answer), we created a controlled puzzle environment. These included the Tower of Hanoi, Checker Jump, River Crossings, and famous puzzles that blocked the world. For example, the Tower of Hanoi involves moving disks between pegs according to certain rules, increasing complexity as more disks are added. By systematically adjusting the complexity of these puzzles while maintaining a consistent logical structure, researchers observe how the models work across a variety of difficulties. This method allows us to analyze not only the final answer, but also the inference process, allowing us to explore more deeply how these models “think.”

Survey results on rethinking and giving up

This study identified three different performance regimes based on the complexity of the problem.

At low complexity levels, standard LLMs are often better than LRMs, as LRMS tend to overthink and produce additional steps that are not necessary.
For medium complexity issues, LRMS exhibits excellent performance due to its ability to generate detailed inference traces that help to effectively address these challenges.
For high complexity issues, both LLMS and LRM fail completely. In particular, LRMS reduces inference efforts despite the complete collapse of accuracy and increasing difficulty.

For simple puzzles such as the Hanoi Tower with one or two discs, the standard LLM was more efficient to provide the correct answer. However, LRMS often exaggerates these problems and generates long inference traces, even if the solution is simple. This suggests that LRM may mimic exaggerated explanations from training data, which may lead to inefficiency.

In moderately complex scenarios, LRMS performance has been improved. The ability to create detailed inference steps allowed us to tackle problems that require multiple logical steps. This allowed us to surpass the standard LLM, which was a challenge to maintain consistency.

However, for very complicated puzzles, such as the Tower of Hanoi with many discs, both models failed completely. Surprisingly, LRMS reduced inference efforts as complexity increased beyond a certain point despite having sufficient computational resources. This “giving up” behavior illustrates the fundamental limitations of the ability to expand inference.

Why does this happen?

A simple puzzle rethink can be attributed to the way LLMS and LRM trains. These models learn from a vast dataset that includes both concise and detailed explanations. For simple questions, they can generate redundant inference traces by default, and mimic long examples of training data, even if a direct answer is sufficient. This behavior is not necessarily a defect, but reflects training that prioritizes inference on efficiency.

The obstacles to complex puzzles reflect the inability of LLM and LRM to learn to generalize logical rules. As the complexity of the problem increases, the reliance on pattern matching collapses, leading to inconsistent inference and performance collapse. This study found that LRMS does not consistently use explicit algorithms and reasons in different puzzles. This emphasizes that while these models can simulate reasoning, they do not truly understand the underlying logic in human ways.

Various perspectives

This research sparked debate within the AI community. Some experts argue that these findings could be misinterpreted. They suggest that LLM and LRMS may not be for human-like reasons, but they still show effective problem-solving within certain complexity limitations. They emphasize that AI’s “inference” does not need to reflect human cognition to be valuable. Similarly, discussions on platforms such as Hacker News praise the rigorous approach to research, but emphasize the need for further research to improve AI inference. These perspectives highlight the ongoing debate about AI reasoning and how to evaluate it.

Meaning and future direction

This survey finding has great significance for AI development. While LRM represents advances in mimicking human reasoning, the limitations in dealing with complex problems and scaling inference efforts suggest that current models are far from achieving generalizable reasoning. This highlights the need for new assessment methods that focus not only on the accuracy of the final answer, but also on the quality and adaptability of the inference process.

Future research should aim to enhance the ability to accurately execute the logical steps of the model and to coordinate inference efforts based on the complexity of the problem. Developing benchmarks that reflect real-world inference tasks such as medical diagnosis and legal arguments could potentially provide more meaningful insight into AI capabilities. Furthermore, improving the ability to overreliance on model pattern recognition and generalize logic rules is important to advance AI inference.

Conclusion

This study provides a key analysis of the inference abilities of LLM and LRM. These models overanalyse simple puzzles, but show that they fight more complex puzzles and expose both their strengths and limitations. It works well in certain situations, but the inability to tackle highly complex problems highlights the gap between simulated reasoning and true understanding. This study highlights the need to develop AI systems that can infer adaptively across different levels of complexity, allowing them to address a variety of complexity issues, as humans do.