"Download more labels!" The fantasy of AI research

Table of Contents

The general view of current machine learning research is that machine learning itself can be used to improve the quality of AI dataset annotations. In particular, image captions intended for use in Vision Language Models (VLMS). This idea is driven by the high cost of human commentary and the additional burden of overseeing the performance of the annotator.

Perhaps this is the equivalent of “more RAM” meme downloads in the early 2000s, satisfying the notion that software-based fixes could solve hardware limitations.

It is also an underrated issue. While new AI models have attracted extensive attention in both public and commercial areas, the annotations often appear to be trivial details of the machine learning pipeline, hidden by the excitement surrounding the broader framework.

In fact, the ability of machine learning systems to recognize and reproduce patterns (the central use case of almost every AI system) depends on the quality and consistency of real-world annotations created or arbitrated by real people.

Inevitably, systems that attempt to observe and reproduce patterns of annotator behavior (and thereby replace human annotators and promote large-scale accurate labeling) cannot hope to work well with data. do not have It is included in an example taken from a human observer. “Similar” is not exactly the same, and the equivalence of domain crosses remains a problematic pursuit in computer vision.

“Upstream data back” must stop somewhere. In this case, that’s exactly where it stops – the human cerebellum makes some subjective distinction in order to codify the data of artificial systems.

Rag trading

Until recently, inaccuracies resulting from raw dataset annotations were probably considered acceptable incidental damage in the context of incomplete yet marketable outcomes obtained from the generated AI system.

In fact, a Singapore study concluded that this year’s study had an inevitable opportunity for the inventive of hallucinations, i.e., AI systems to invent things that undermine our intentions, and were bound by the concept architecture of such systems.

To counter this, RAG-based agents that can “verify” facts through internet searches have become increasingly popular in research and applied commercial solutions. However, this will be added to resource costs and query delays. Furthermore, new information applied to the trained model cannot compete with the more complex, deeply intertwined connections that characterize the native layer of the trained model.

Therefore, even if the annotation data informing these models is not perfect, it would be better if there were significantly fewer defects in the first place (particularly because this activity invades the realm of human subjectivity).

again

A new German paper highlights the problems that arise from relying on old and widely used datasets, particularly focusing on the accuracy and reliability of image captions. Researchers’ findings suggest that label errors within benchmarks can mask or misrepresent hallucinations in vision language models.

An example from a new paper where the original caption failed to correctly identify an object in the MSCOCO dataset of the image. Researcher’s manual revisions to the Pope Benchmark Data Set address these shortcomings and show the costs of saving money in curating annotations. Source: https://arxiv.org/pdf/2504.15707

Imagine that the model shows an image of the street scene and asks if the bike is inside. The model answers yes. If you say that there is no bike in your benchmark dataset, the model is marked I’m wrong. But if the bike is like that It’s clearly visible In the image, the answer of the model was correct and the benchmark failed, simply missed during annotation. Such errors accumulate throughout the dataset, giving a distorted image that is accurate and hallucinated.

Therefore, if false or vague annotations are treated as terrestrial truths, they may appear to hallucinate when the model is correct or accurate when it is not. It distorts both hallucination measurements and model performance rankings, making it difficult to reliably diagnose or address problems.

The new paper revisits a widely used benchmark called polling-based object probing evaluation (Pope), which tests whether visual language models can correctly say what is in images.

Pope is based on labels from the influential Microsoft Coco: Common Objects in Context (MSCOCO) dataset. This is a collection of annotated images that have long been treated as providing annotation accuracy.

The Pope evaluates hallucinations of objects in large visual language models by reconstructing the problem as a binary classification task. Rather than parsing the generated captions, the system is simple Yes/No Question to the model Regarding whether a particular object exists in the image, ‘Is there? .

Examples of hallucinations of objects in vision language models. A bold label indicates objects that are marked as present in the original annotation, while a red label indicates objects that have been hallucinated by the model. The example on the left reflects traditional instruction-based evaluations, while the three examples on the right are drawn from different Pope Benchmark variants. Source: https://aclanthology.org/2023.demlp-main.20.pdf

Objects of Truth on Earth (Answer: yes) It is paired with a sampled nonexistent object (Answer: no), frequently selected randomly (Popular), or co-occurrence based (enemy)strategy. This setup allows for a more stable and rapid insensitivity assessment of hallucinations without relying on complex rule-based caption analysis.

New paper author – Title Repeat: Effects of annotation errors on the Pope Benchmark – By rechecking the labels on the benchmark image (i.e. Mscoco), we challenge the Pope’s supposed accuracy and discover that the surprising number is wrong or unknown.

Example of 2014 MSCOCO dataset. Source: https://arxiv.org/pdf/1405.0312

Some of these errors have changed the way models are ranked and perform at first with delayed performance when determined for modified labels.

In the test, the authors evaluated a set of open weight vision language models on both the original Pope benchmark and their relabels again Version.

According to the paper, the revised annotations resulted in a marked change in model rankings, particularly for F1 scores, with several high-performance models under the Pope dropping to the re-report position.

The authors argue that this change shows the extent to which annotation errors obscure the actual hallucination behavior of the model, and presents repoup as a more reliable tool for assessing hallucination vulnerability.

Another example of the new paper confirms that the original Pope’s caption cannot identify subtle objects, such as the person sitting by the tram cabin in the far right photo, or the chair obscured by the tennis player in the second photo from the left.

Methods and Tests

The researchers relabeled all annotations in the original MSCOCO dataset and assigned two human labelers to each data instance. If ambiguity regarding the quality of the original label occurred (as in the example below), these results were separated from the test round.

A vague case in which the labeling of Pope’s inconsistencies reflects boundaries of unclear categories. For example, teddy bears are labelled as bears, motorcycles as bicycles, or as airport vehicles. These cases were again excluded due to the subjective nature of such classification and the inconsistency of the original labels of MSCOCO.

The paper states:

“The original annotator missed the person behind the background or the glass, the tennis player blocked the ‘chair’ in the background, and the coleslaw only contains small visible stripes of carrots.

“In some objects, cocoa notation may be very inconsistent because the definition of the object used by the original annotator is different. The classification of an “bare” in a “teddy bear”, a bike as an electrified “bicycle”, or an airport vehicle as a “car” depends on a particular definition, leading to inconsistencies in the Pope’s ground truth annotation. Therefore, annotate the corresponding image-query pair as “ambiguous”.

Results of Reconciliation: Positive questions are shared across all three Pope variations. Of those labeled “yes” by the Pope, 9.3% were found to be incorrect, and 13.8% were classified as ambiguous. In the “no” question, 1.7% were misunderstood and 4.3% were vague.

The authors evaluated various open weight models in Pope and Repope, across diverse architectures and model sizes. The selected model includes some of the main architectures of the OpenVLM leaderboard: InternVL2.5 (8B/26B/38B/78B and 8B-MPO/26B-MPO). llava-next; Vicuna; Mistral 7b; Llama; llava-onevision; ovis2 (1b/2b/4b/8b); paligemma-3b; and paligemma2 (3b/10b).

Initial Results: High error rates for the original positive label cause a sharp drop in true positivity in all models. False positives vary from subset to subset, almost doubled for random subsets, but little has changed in popular subsets and slightly decreases in hostile subsets. This relabeling has a major impact on F1-based rankings. Models such as the Ovis2-4B and Ovis2-8B, which worked well in Pope’s general and hostile division, also rise above the random subset of Repope. See the source PDF.

The results graph above shows how the number of true positives and false positives changes after modifying the label in the benchmark.

False positives have been reduced in all models, showing that true positives are often credited to the correct answer if only under the failed label, and false positives followed a more diverse pattern.

In the “random” version of the Pope, it’s almost false positive It’s doubled Many models show that a considerable number of objects flagged as hallucinations are indeed present in the image, but were overlooked in the original annotation. In this case, many assumed model errors were in fact mistakes in dataset labeling.

For the “hostile” version of the Pope, false positives were reduced because questions were based on frequently co-employed objects. This is probably the missing object In fact, in the image But left No labels.

These shifts affected accuracy and recall, but the model ranking remained relatively stable for both metrics.

The F1 score – the Pope’s main rating scale – was much more sensitive to label modifications. In the random subset, models ranked near the top under the original labels, such as InternVL2.5-8B and -26B, fell below the bottom when scored with Repope. Others such as the Ovis2-4B and -8B have risen to the top.

A similar pattern appeared in the accuracy score, but the authors note that these may be biased as the revised dataset contains a heterogeneous number of positive and negative examples.

The authors argue that the strong influence of annotation errors on benchmark results highlights the need for high-quality data. To support a more reliable assessment of the hallucination of objects, they have released modified labels on Github.

However, it should be noted that this relabeling does not completely address benchmark saturation, as many models still achieve true positive and negative rates above 90%. They suggest that additional benchmarks such as Dash-B, which use a more challenging set of negative examples, should be used together with re-poop.

Conclusion

This particular experiment was possible due to the very small scale of the data set involved. Proving the same hypothesis in a hyperscale dataset involves working with very limited fragments of data. In a large, highly diverse dataset, it has proven that it is nearly impossible to isolate statistically representative and semantically coherent groups, which can skew the results.

Even if that is possible, what treatments are available under the cutting edge of the current state of the art? This argument inevitably traces back to the need for better, richer human commentaries.

In this respect, “better” and “a richer” exist as separate issues in themselves, as more annotations can be obtained through interracial economies such as Amazon Mechanical Turk (AMT). Clearly, this potentially exploitative subeconomics often leads to poorer outcomes.

Alternatively, the annotation task could be cultivated in economic areas where the same spending results in more annotations. However, if annotators are further removed, annotators from the intended use case of the model in which the label is shaped is less likely to match the resulting model in line with the needs or expectations of the target domain.

Therefore, this is one of the most sustained and unsolved challenges in the economics of machine learning development.

First released on Wednesday, April 23, 2025