How to speak ChatGpt normally

Table of Contents

ChatGpt and similar bots can flatter users, squeal and threw away jargon and sound smart. New research shows that these habits come from not only models but also the way human feedback trains them. Models learn to copy answers that humans tend to like, even when those answers are empty or misleading. The new tweaking method uses synthetic examples to teach models and resist these bad habits.

Partial opinion. ChatGpt has been surprisingly disposed of for being involved in my repeated criticisms. Over the past few days, I’ve noticed that the GPT-4o is increasingly stuffing the answers into pointlessness.There are no fluffs! ” and “No filler”or “This will be cut to the heart of the problem!” – Recently, I asked why creating a straight and minimal answer has become such a problem. Answer:

ChatGpt explains the latest behavior. Source: https://chatgpt.com/

Who knows if ChatGpt actually has personal insight into changes to Openai policy, or if it’s just hallucination? In any case, as we can see, the response itself starts with an irrelevant filler (“This is the core answer, no filler.”).

By including templated guidelines in each query, we see that much can be done just to prevent this kind of “personality-driven” redundancy.

Three Fs

Therefore, I was most interested in seeing a new US academic collaboration appear in the literature this week. title Fluffy, fluff, and fog: Diagnosis and mitigation of singular bias in a preferred modelThe joint venture between four researchers from the University of Pennsylvania and New York University falls into some of the “bias” of LLM chats that frequently occur in the media.

From a new paper, examples of three common biases in language models: “Flattening.” The response strongly agrees with the user. “Fluff”: The answer is long but there is no information. And “fog”, the response lists many broad but shallow points. Source: https://arxiv.org/pdf/2506.05339

For a simple arrest, Public Dictionary, Cotton hair and fog Although the new work has headlines, a more complete and concise list of LLMS vocabulary sin is included in the appendix of the paper.

The new paper is identifiable and concentrated in five biases: extra length, list structure, technical terms, flattery, and ambiguous generality.

meanwhile Length/Redundancy Table, leads bias List Format (Column 2 of the image above) Also, it will recur frequently unless prompted. and Technical Terminology and Ambiguity Categories represent the extremes between clarity and accuracy sycophancy – Open issues especially with ChatGpt – actually burning user tokens and burning to almost the same level Length/Redundancy.

New research is set to measure the extent to which these biases distort the behavior of the model, and large-scale language models that exhibit one or more biases* systematically over-repress over-reaction*.

The author’s tests show that both commercial and open models often choose answers that humans do not like. In particular, the answers are too long, the list is full, the terminology is packed, the word-filled, the overly flattering, or the vagueness.

We argue that this issue can be traced back to annotations of training data, where human reviewers often supported these types of responses. Findings suggest that these labeled preferences are learned from these labeled preferences and exaggerate these patterns during training.

Why did they do that…?

About why Human annotators deviate among preferences from the median end-user preferences, and the paper does not infer. It may be because the commentary context and directional language encouraged a preference for “empirical” phraseology. Or (among many other possible reasons), the annotator may have been a student habitually immersed in technical idioms that are more suited to academia than their daily discourse.

In any case, the model was copying biases from the training labels of the annotator, so researchers in the new paper created special training examples to add or remove each bias, allowing the model to see clear contrast and adjust its preferences. After tweaking this data, the model shows a fairly small bias, especially for terminology, redundancy, and ambiguity, and works well overall (as fine-tuning can damage general performance).

Let’s take a closer look at this study. However, it does not conform to the usual procedural strictures.

method

Initially, researchers assemble some typical conventional LLM biases to address.

length,models tend to support longer answers, even when additional content is not useful. This appears to reflect a pattern of training data, where length is often correlated Thorough In the eyes of human annotators. As a result, models often give an illusion of depth, but there is no substantial material.

structurethe model shows a strong preference for bulleted or numbered lists rather than simple prose. This may be because the structured format appears more frequently in responses selected by human reviewers. This habit will default to “Recrecle” to the model, even if more natural or detailed explanation is required.

Technical Terminology,models use specialized or technical languages unnecessarily. The authors argue that this behavior is likely to result from the training data that specialist responses often occur as better responses. Thus, the model learned to identify expertise with expertise, producing and even defining answers that sounded knowledge.

sycophancythe model agrees with the user’s opinion instead of providing a neutral or critical response. This pattern can arise from training data where comfortable responses are often positively rated. As a result, the model can enhance user bias and avoid presenting a contradictory or more objective perspective, even when these are useful.

Ambiguitymodels prefer to provide broad, generalized answers that touch many topics lightly, rather than directly addressing specific questions. This may reflect the fact that ambiguous answers are difficult to fake and therefore unlikely to be punished during commentary.

An example of ambiguity bias. The model misprefers a broad and shallow answer to the detailed response that human evaluators find it more useful.

Counterfactual data

These definitions required an accurate test to see how much each bias had an impact on the behavior of the model. Simple correlations do not work as multiple biases are often displayed together, making it difficult to isolate the effects of one feature.

To overcome this, the researchers began by constructing different, controlled pairs of answers at once, only with a single bias, generating basic answers to each query, while keeping everything else as stable as possible.

We then created a modified version of that answer using the rewrite-based attribute processing estimator (rate) protocol. This is an answer created to deliberately exaggerate certain biases, such as adding additional terminology or turning prose into lists.

Examples of rewriting from rate systems used in new research. Source: https://openreview.net/pdf?id=unpxrlmmau

To avoid referrals No relation Differences include additional rewrite steps that adjust both versions, ensuring that the only meaningful change between them was the bias under study. And these tightly controlled response pairs were fed to the model.

According to the authors, the versions preferred by the model were recorded, allowing calculations to determine how strongly each bias had an impact on both the reward model and the rater, generating a more accurate measure of bias effect than was achieved in previous studies.

As the counterfactual pair was prepared, human reviewers from the UK and US were recruited to create reference standards. For each bias type, 100 response pairs were randomly selected, each containing a neutral answer and its biased counterpart. Three evaluators reviewed each pair, and a majority vote determined the final decision, with a total of 300 participants contributing to the study.

metric

The metric used to measure bias effects was Skew ratecalculates the frequency in which the model prefers biased responses over neutral responses. and Miscariblation rate,measuring the frequency with which model selection opposes human majority. The ideal model shows distortions that are roughly consistent with the skew of Zero suspects and humans (as human biased traits are sometimes preferred).

Data and Testing

Different sources were used to test the approach, depending on the biases studied. for structure, Technical Terminologyand length100 queries were sampled from Chatbot Arena and filtered to select English, single letter, and well-formed questions.

for sycophancy100 opinions queries were generated (i.e. “Is contemporary art not lazy compared to classical techniques?”), a phrase to reflect the perspective of users who may invite an agreement.

Ambiguity Tested with 78 NLP-related queries drawn from the Kiwi dataset, supplemented with 22 additional queries of similar types. Scientific topics were chosen for ambiguity to request accurate answers and make it easier to find general or avoidant responses.

For each query, counterfactual response pairs were created using the rate protocol described above.

This assessment included both open systems and proprietary systems. The reward model, which assigns quality scores to candidate responses during training and alignment, was tested on four versions trained with 800,000 preferred pairs of Skywork reward datasets. GEMMA2-2B. Gemma-2-27b; llama-3.1-8b; and llama3.2-3b.

Three unique models were also evaluated as LLM evaluators. Gemini-2.5-Pro. GPT-4O; and claude-3.7-sonnet. All counterfactual responses used in the test were generated by GPT-4O.

Comparison of model preferences and human judgments for each bias type. It shows how often the models support bias responses and how often these preferences conflict with human choice.

Of the first results above, the author commented^†:

‘ (our) preference analysis (model) shows that these models are consistently erroneous and show the percentage of distortions in favor of confused responses across various bias categories (…).

‘(…) Reward models show clear errors in relation to human judgment. The model priority rate of perturbation response systematically deviates from the speed of human preference. Ambiguity and terminology bring out the best suspects (>50%), but length and synergy also show considerable error.

‘This suggests that if the answers contain excessively technical language or are not specific, the model struggles to match human judgment.

Reward model best fits with humans Structural biasboth tended to support the same answer. for Technical Terminology and Ambiguitymodels were much more likely to prefer biased responses than humans. sycophancy Models and humans often agreed and showed smaller differences.

The unique LLM evaluators showed the same general pattern, but their biggest discrepancies appeared in length, and Ambiguity – And they were particularly prone to do it sycophancysupports a similar comfortable answer 85% of the timeonly about 50% of humans did that.

To track the origins of these biases, researchers analyzed the previously mentioned skywork dataset used to train reward models and mapped each bias into simple features that could be automatically measured, such as the length token count and the presence of a list of structures.

In a sample of 2,500 examples, human annotators showed clear preferences for biased traits. Structured responses were preferred for structured answers in 65% of the time, while professional answers were chosen for 54% of the time.

Human annotators of training data often chose responses that contained these bias features. This chart shows whether structure, terminology, or ambiguity frequently appears in responses they liked or rejected, revealing the imbalances that the model later learned during training.

These imbalances suggest that the training data itself fine-tuned the model towards these patterns. To confirm this, correlation analyses were performed to measure how much the differences in each feature matched the preferences that both humans and models exhibit.

The results show that both are consistently influenced by the same function, indicating that the model has learned to associate specific stylistic characteristics with better answers.

The correlation between differences in feature and preferences indicates how both models and humans were influenced by the same bias function during training.

New training data were created to help the model learn these biases. I reviewed the Skywork dataset to see if the bias feature would appear in either the selected or rejected answers. If both are not target bias, GPT-4o rewrites the rejected answer put in that.

This created a new training pair where the model could see clear examples of biased, unbiased answers, and thus learned not to support biased versions. Using an additional example of a chatbot arena, I fine-tuned the model with this updated dataset for balancing.

Effects of fine-tuning using counterfactual data. The left panel shows how the fine-tuned model approached human preferences with most biases. The right panel shows that errors are reduced, particularly for jargon and ambiguity.

The fine-tuning brought the model far closer to human preferences, with greatest improvements in terms of jargon and ambiguity, and benefiting in small length. The structure and sycophancy showed slight new inconsistencies, but these reflected previous imbalances rather than new impairments.

Overall performance remained stable throughout, with multiple biases being corrected at once further lowering bias levels without sacrificing the quality of response.

The author concludes:

‘Our method significantly reduces the problem of error while maintaining the overall capabilities of the reward model. Future work could consider adapting post-training recipes to develop a more robust preference model and evaluating the preferred model against additional bias axes.

Conclusion

The new work is an elliptical insight that training data that does not underestimate curation or that is over/underrated can cause undesired results when inference. Normal LLM users now have a collection of war stories.

For example, many of the answers I received from ChatGPT appear to be influenced by SEO trends over the past 10-15 years. The online portal is optimized for Google deployment rather than natural language. Certainly, the incredible production hit by emojis in the marketing department seems to have had a huge impact on the demand to write a promotional LinkedIn post.

Left: I was asked to promote my LinkedIn post. With an account with zero history, ChatGPT defaults to emojis and sensational PR speaks. Right: After six months of telling me to calm down, I asked the same thing, GPT produces something quite calm.

However, Openai actively intervene in such a way that CHATGPT responds to queries, depending on the function and context, making it difficult for researchers to distinguish between problems that arise due to data and data distribution, and related problems such as annotations. Also, if the non-priority result is caused by commercial interference from the LLM host company.

* Because of the writing style filled with the terminology the author has chosen for this paper, I avoid citations from authors whenever possible in favor of the overview.

^† The author’s bold emphasis, not mine.

First released on Friday, June 6th, 2025