A new research paper questions the price of "tokens" in AI chat

Table of Contents

A new study shows that the AI Services bill by Tokens hides actual costs from users. Providers can quietly inflate the fees by defeating token counts and sliding with hidden steps. Some systems do not affect the output, but still perform additional processes that appear on the invoice. Auditing tools have been proposed, but without actual monitoring, users are leaving more payments than they noticed.

In almost all cases, what consumers pay for AI-powered chat interfaces such as the CHATGPT-4O is currently measured by tokens. It is a unit of text that is not noticed during use, yet counts accurately for billing purposes. Additionally, each exchange is priced by the number of tokens processed, but users do not have a direct way to see the count.

Despite an incomplete understanding of what is gained for purchased “token” units, token-based billing is a standard approach across providers, and depends on what could prove to be an assumption of unstable trust.

Token Word

Tokens are not exactly the same as words, but often play similar roles, with most providers using the term “token” to describe small textual units such as words, punctuation, and broken words. words “can’t believe it”for example, one system may count as a single token, but another system may split it and, Believers and caneach piece increases the cost.

This system applies to both user input and model replies at a price based on the total number of these units.

The difficulty lies in the fact that you are the user Don’t watch this process. Most interfaces do not display token counts while the conversation is occurring. It is difficult to replicate the way tokens are calculated. Even if the count is displayed rear Reply, it’s too late to determine if it’s fair and creates a discrepancy between what users are seeing and what they are paying for.

Recent research has pointed out deeper issues. One survey shows how providers can overcharge without breaking the rules simply by inflated token counts in ways that users cannot see. Another person reveals discrepancies between what the interface is displayed and what is actually billed, and the efficiency may not be illusional to the user. The third also exposes how models routinely generate internal inference steps that are not visible to the user, but they still appear on the invoice.

The findings show the system I think Accurate, accurate numbers imply clarity, but their underlying logic remains hidden. Whether this is due to design or structural defects, the results are the same. Users pay more than they can see, and often more than they expect.

Is it cheap for a dozen?

In the first of these papers – Title Does your LLM overcharge you? Tokenization, transparency, and incentivesfrom four researchers at the Max Planck Software Systems Institute — the authors argue that the risk of token-based claims exceeds opacity, referring to a built-in incentive for providers to inflate token counts.

“The heart of the problem lies in the fact that tokenization of strings is not unique. Consider, for example, the user submits a prompt saying, “Where will the next newlip happen?” For the provider, the provider sends it to the LLM, and the model produces the output “| san | diego |”. It consists of two tokens.

“Because users have forgotten the generation process, selfish providers have the ability to falsely report to users the tokenization of output without changing the underlying string. For example, providers can simply share the tokenization “|s|a|n||d|i|e|g|o|”. Overcharge users with 9 tokens instead of 2! ”

This paper presents a heuristic that allows for this kind of dishonest calculation to be performed without altering the visible output and without violating validity under typical decoding settings. Tested on Llama, Mistral and Gemma series models and using real prompts, this method provides measurable overcharge without appearing abnormal.

Token inflation using “plausible misconceptions.” Each panel shows the percentage of overcharged tokens that arise from providers applying algorithm 1 to the output from 400 LMSYS prompts under various sampling parameters (M and P). All outputs were generated at temperature 1.3 and there were 5 iterations per setting to calculate the 90% confidence interval. Source: https://arxiv.org/pdf/2505.21627

To address the issue, researchers are seeking claims based on Character count He argues that rather than tokens this is the only approach that gives providers a reason to honestly report usage rates, and that if the goal is fair pricing, tie it to visible characters rather than hidden processes is the only option to withstand scrutiny. They argue that character-based pricing will reward shorter, more efficient outputs while also removing the motivation for false reporting.

However, there are many additional considerations (mostly accepted by the author). First, the proposed character-based scheme introduces additional business logic that could support vendors over consumers.

‘(a) Providers who are never falsely reported will generate the shortest possible output token sequence and try to compress the output token sequence as much as possible, as they have clear incentives to improve current tokenization algorithms such as BPE.”

The optimistic motif here is that vendors are encouraged to produce a succinct, meaningful and valuable output. In fact, providers are not too favorable for reducing text counts.

Second, it is reasonable to assume that the authors are likely to require law to move from the inexplicable token system to a more clear text-based claims method. By launching it with this kind of pricing model, rebel startups may decide to distinguish between products. But anyone with a truly competitive product (and operating on a lower scale than the EEE category) is removed to do this.

Finally, theft algorithms such as those proposed by the authors have their own computational costs. If the cost of calculating the “upcharge” exceeds the potential benefits, the scheme clearly has no merit. However, researchers emphasize that the proposed algorithm is effective and economical.

The author provides code for the theory on GitHub.

switch

Second paper – Title Invisible tokens, visible invoices: the urgent need to audit hidden operations with opaque LLM servicesfrom researchers at the University of Maryland and Berkeley – erroneous incentives in the commercial language model are not limited to token splitting; The whole class of hidden operations.

These include invoking internal models, speculative inference, using tools, and multi-agent interactions. All of these may be billed to the user without visibility or requiring them.

Price and transparency inferring LLM APIs across major providers. All listed services charge users for hidden internal inference tokens and none will display these tokens at runtime. The costs vary widely, and Openai’s O1-Pro model charges 10 times more per million tokens than the Claude Opus 4 or Gemini 2.5 Pro despite its opacity. Source: https://www.arxiv.org/pdf/2505.18471

Unlike traditional billing where the amount and quality of service can be verified, the authors argue that today’s LLM platform is running Structurally opaque: Users will be billed based on reported tokens and API usage, but there is no way to ensure that these metrics reflect actual or required work.

This paper identifies two important forms of operations. Quantity inflationIf the number of tokens or calls increases without user benefits. and Quality downgradeLow performance models or tools are used quietly instead of premium components.

‘Inference in the LLM API, providers often maintain multiple variations of the same model family with different capacity, training data, or optimization strategies (e.g. ChatGpt O1, O3). Model downgrade refers to silent replacement of low-cost models, which can lead to inconsistencies between expected quality of service and actual quality of service.

“For example, prompts may be handled by smaller sized models, but the billing will not change. This practice is difficult for users to detect, as the final answer may still appear plausible for many tasks.

In this paper, we document instances where more than 90% of the tokens charged were not shown to the user, and internal inferences inflate the use of the token with more than 20 factors. Whether legitimate or not, the opacity of these steps negates the user any basis for assessing the user’s relevance or legitimacy.

In an agent system, opacity increases as internal exchanges between AI agents can each bear a fee without significantly affecting the final output.

‘Beyond internal reasoning, agents communicate by exchanging prompts, summaries, and planning instructions. Each agent interprets input from other agents and generates output to guide the workflow. Messages between these agents can often consume a significant token that is not directly visible to the end user.

‘All tokens consumed during agent coordination, including generated prompts, responses, and tool-related instructions, usually do not surface to the user. If the agent itself uses the inference model, the claims become even more opaque.”

To confront these issues, the authors propose a layered audit framework that includes encrypted proofs of internal activities, verifiable markers of model or tool identities, and independent monitoring. However, the underlying concern is structural. Current LLM billing schemes depend on permanent ones Information asymmetryleaving users exposed to costs that they cannot verify or disassemble.

Count the invisible

The final paper from a researcher at the University of Maryland reorganizes the claims issue as a structural issue, not as a problem of misuse or misinformation. Paper – Title Coin: Count invisible inference tokens with commercial opaque LLM APIand from 10 researchers at the University of Maryland – observe that most commercial LLM services are now hidden Intermediate reasoning It still contributes to the model’s final answer Still billing those tokens.

This paper argues that this creates an unobservable claim surface that can be manufactured, injected or inflated without detecting the entire sequence*:

‘(This) invisibility allows providers to Report token count incorrectly or Artificially inflate the token count by injecting low-cost manufactured inference tokens. This practice is called Token count inflation.

“For example, the single, highly efficient ARC-AGI run by Openai’s O3 model will consume 111 million tokens, costing $66,772.3 on this scale, and even small operations can have a major financial impact.

“This information asymmetry allows AI companies to significantly recharge their users, thereby eroding their interest.”

To counter this asymmetry, the author proposes coina third-party audit system designed to validate hidden tokens without revealing their content, using hashed fingerprints and semantic checks to find inflation signs.

An overview of opaque commercial LLM coin audit system. Panel A shows how token embeddings are hashed into the Merkle tree for validation of token counts without revealing the content of the token. Panel B shows a semantic validity check in which a lightweight neural network compares inference blocks with the final answer. Together, these components allow third-party auditors to detect hidden token inflation while maintaining the confidentiality of their own model behavior.. Source: https://arxiv.org/pdf/2505.13778

One component encrypts token counts using a merkle tree. The other is to assess the relevance of hidden content by comparing hidden content with the embedding of the answer. This allows the auditor to detect padding and irrelevantness. This simply indicates that a token is inserted to hike the invoice.

When deployed to testing, Coin achieved a detection success rate of nearly 95% with several forms of inflation, minimizing exposure to underlying data. The system still relies on voluntary cooperation from providers, and solutions in the case of the edge are limited, but the broader point is unmistakable. The current LLM billing architecture assumes unconfirmed integrity.

Conclusion

In addition to the benefits of earning advance payments from users, scrip-based currencies (such as Civitai’s “buzz” system) can help you abstract users from the true value of the currency you are spending, or the true value of the product you are purchasing. Similarly, giving vendor leeways to define units of measurement leaves consumers even more in the dark in the dark about what they are actually spending, from a real money standpoint.

Like the Las Vegas watch shortage, this type of measure aims to make consumers reckless or indifferent.

Almost not understood tokencan be consumed and defined in so many ways, but it is probably not a suitable unit of measurement for LLM consumption. Particularly because it can take many times more to calculate lower LLM results in languages other than English, compared to English-based sessions.

However, as Max Planck researchers suggest, character-based outputs favor a more concise language and could naturally punish redundant languages. Visual indications such as depreciation token counters could likely lead to spending a little more on LLM sessions, so it seems unlikely that the addition of such a useful GUI will come anytime soon, at least without legislative action.

* Author emphasis. My conversion from author inline quotes to hyperlinks.

First released on Thursday, May 29, 2025