A new paper from researchers in China and Spain found that even advanced multimodal AI models such as GPT-4.1 struggle to communicate time from images of analog clocks. Small visual changes in the clock can cause major interpretation errors, and fine tuning is only useful for familiar examples. The results raise concerns about the reliability of these models when faced with unfamiliar images in real-world tasks.
A fully in-depth understanding of domains such as gravity and other fundamental physical principles will allow us to grasp fundamental abstractions beyond specific examples. This allows us to creatively apply that knowledge across contexts and identify actual principles, allowing us to recognize new instances that we have never seen before.
When a domain is important enough, we may start to perceive it If it does not existlike Pareidolia, it is driven by the high cost of not being able to recognize the actual instance. This pattern recognition survival mechanism is so powerful that it disposes us to find a wider range of patterns that are nothing.
Faster, more repeated domains are instilled in us, with deeper grounds and lifelong persistence. And one of the earliest visual data sets we are exposed to as children is used to teach us how to communicate time using printed materials or interactive analog clocks.
Teachers to help children learn to communicate time. Source: https://www.youtube.com/watch?v=ibbqxbhsnus
While fashion changes in watch design may sometimes challenge us, the resilience of this early domain master is very impressive and can identify the face of an analog watch even in the face of complex or “quirky” design choices.
Some challenging faces of the watch Couture. Source: https://www.ablogtowatch.com/wait-a-minute-legibility-is-the-most-import-of-watch-design/
Humans don’t need thousands of examples to learn how a clock works. Once you have a basic concept, you can recognize it in almost any form, whether it is distorted or abstracted.
In contrast, the challenges that AI models face with this task highlight a deeper problem. Their obvious strength can depend on a greater amount of exposure than understanding.
Beyond imitation games?
Recent large-scale models have repeatedly emerged in studies of tensions between surface-level performance and authentic “understanding.” Last month I reconstructed the question in a paper entitled Zhijiang University and Westlake University Does PHD level LLMS really know the primary addition? (Not the focus of this article), conclusion:
“Despite the impressive benchmark, the model shows a significant dependence on pattern matching rather than true understanding, as evidenced by impairments that involve symbolic representations and violations of basic characteristics.
“The definitive fault performance of explicit rules suggests inherent architectural constraints. These insights uncover gaps in evaluation and highlight the need for an architecture that allows for true mathematical inference beyond pattern recognition.
This week, problems resurfaced in collaboration with Nanjing University of Aviation and astronauts and University of De Madrid at the University of Madrid in Spain. title Have you ever really learned that the leading multimodal language models (MLLM) communicates the time of an analog clock?,New paper explores the good understanding of timetelling of multimodal models.
Although research progress is only covered in the paper details, the researchers’ first test established that Openai’s GPT-4.1 multimodal language model struggles to read time correctly from a set of diverse clock images, and often gives false answers even in simple cases.
This refers to potential gaps in the model’s training data, increasing the need for a more balanced data set, and testing whether the model can actually learn the underlying concepts. Therefore, the author curated the synthetic dataset of the analog clock, covering all the time equally, avoiding the normal bias found in Internet images.
An example of a synthetic analog clock dataset from researchers used to fine-tune GPT models in new work. Source: https://huggingface.co/datasets/magonsa/analog_watches_finetune
GPT-4.1 was unable to read these clocks consistently before tweaking with the new dataset. However, after some exposure to the new collection, its performance has improved, but only if the new image appears to have already seen it.
When the shape of the watch or the style of the hand changes, the accuracy drops dramatically. Even small tweaks such as thin hands and arrows (image on the far right below) were enough to throw it away. GPT-4.1 struggled even more to interpret the Dali-style “crucible.”
Along with the era returned by GPT-4.1 before and after tweaking, clock images with standard design (left), distorted shape (medium), and modified hands (right). Source: https://arxiv.org/pdf/2505.10862
The author speculates that current models, such as GPT-4.1, may primarily read watches. Visual Pattern Matchingrather than a deeper concept of time, I argue:
‘(gpt 4.1) fails when the clock is transformed or when the hands become thin and have arrows. The mean absolute error (MAE) for time estimates 150 or more random times was 232.48 on the first clock, 1380.69 when the shape was deformed, and 3726.93 seconds when the hand was changed.
“These results suggest that MLLM did not learn to communicate time, but rather learned the patterns that were memorized.”
Enough time
Most training datasets rely on scraped web images that tend to repeat for a particular time. Popular settings especially at 10:10:00: watch ads:
From a new paper, an example of the prevalence of “last 10” times in analog clock images.
As a result of this limited range drawn, the model may only display a narrow range of possible clock configurations, limiting its ability to generalize beyond their repetitive patterns.
As to why models cannot correctly interpret distorted clocks, the paper states:
‘The GPT-4.1 works very well with standard clock images, but it’s surprising that changing the clock hand by thinning the clock hand and adding an arrow will significantly reduce its accuracy.
“Intuitively, you might expect more visually complex changes (distorted dials) to have a significant impact on performance, but this change seems relatively small and effective.
“This raises the question: how do MLLMS interpret clocks and why do they fail? One possibility is that thin hands will undermine the ability to recognize the orientation of the model and weaken the understanding of spatial orientation.
“Or alternatively, there could be other factors that can cause confusion if the model tries to combine hours, minutes, and second hands with accurate time readings.”
The authors argue that identifying the root causes of these disorders is key to advance the multimodal model. If the problem lies in how the model perceives spatial orientation, fine tuning may provide a simple fix. However, when the problem stems from the broader difficulty in integrating multiple visual cues, it refers to a more fundamental weakness in the way these systems process information.
Fine tuning test
To test whether model obstacles can be overcome with exposure, GPT-4.1 was fine-tuned with the comprehensive synthetic dataset mentioned above. Before the tweaks, the predictions were widely scattered and there were major errors in all kinds of clocks. After fine-tuning the collection, the accuracy of the standard clock side is dramatically improved, and not so much distorted.
However, clocks with modified hands such as thin shapes and arrows continued to produce large errors.
Two different failure modes were displayed. With normal clocks and distorted clocks, the model usually misdirected the hand. But with the changed watch Hand Styleit often confuses and mistakes each hand’s function time for Min or Min for Number 2.
A comparison showing the initial weakness of the model and the partial benefits achieved through fine tuning shows the predicted and actual time in seconds for 150 randomly selected clocks. Before the tweak on the left, the predictions for GPT-4.1 are scattered, often far from the correct value, indicated by a red diagonal line. On the right, after fine-tuning with a balanced synthetic dataset, the prediction remains with some errors, but is much more closely aligned with ground truth.
This suggests that the model learned to associate visual features such as hand thickness with specific roles, and was struggling when these cues changed.
Limited improvements to unfamiliar designs raise further questions about whether this type of model will learn abstract concepts of timetelling or improve pattern matching.
Hand sign
So, fine tuning has improved the performance of GPT-4.1 on traditional analog clocks, but it has much less impact on thin hands and arrow-shaped watches, resulting in less obstacles to the model from abstract inference and confusion surrounding any hands.
A new analysis was conducted on the prediction of the model for the “modified hand” dataset to test whether the confusion could improve accuracy if it was removed. The output was divided into two groups. If GPT-4.1 correctly recognizes hours, minutes and second hand. And if not.
Predictions were evaluated for mean absolute error (MAE) before and after fine adjustment, and results were compared to standard clock results. Angle errors for each hand were also measured using the dial position as the baseline.
A clock error comparison with handroll confusion and no clock consequences on fixed hand datasets before and after fine tuning.
Confusing the role of the watch hand resulted in the biggest error. When GPT-4.1 mistook one hour hand for one hour or vice versa, the resulting time estimates were often far apart. In contrast, the error caused by incorrectly determining the orientation of a correctly identified hand has been reduced. Of the three hands, the time hand showed the highest angular error before fine-tuning, and the second hand showed the lowest.
Angle errors per hand type for pre- and post-fine predictions when accompanied by hand roll confusion and when tweaking are corrected in the hand dataset.
To focus solely on directional errors, the analysis was limited to when the model correctly identified each hand function. If the model internalizes the general concept of timetelling, the performance in these examples should match the accuracy of the standard clock. Instead, accuracy remained significantly worse.
Check your hands shape A second experiment was performed, obstructing the model’s sense of orientation. Two new datasets have been created. Each contains 60 synthetic clocks for just one hour of hand, pointing to different minute marks. One set used the original hand design, while the other set used the modified version. The model was asked to name the tick mark that the hand pointed to.
The results showed slight decreases in accuracy with the corrected hands, but were not sufficient to explain the wider obstacles in the model. a Unfamiliar visual features Even tasks that previously worked well seemed to confuse the overall interpretation of the model.
The GPT-4.1 standard, distorted, corrected watch summary of performance before and after tweaking the corrected watch highlights uneven benefits and persistent weaknesses.
Conclusion
The focus of the paper may seem trivial at first glance, but it is not particularly important whether the vision language model learns to read analog clocks with 100% accuracy. The weight of the work is to focus on deeper, repetitive questions. Whether saturating models with more (and more diverse) data can help us understand the types of domains that humans acquire through abstraction and generalization. Or whether the only viable path is to flood the domain with sufficient examples to predict all possible variations at inference.
Both routes raise questions about what current architecture can really be learned.
First published on Monday, May 19, 2025