AI is struggling to emulate historical languages

Table of Contents

A Canadian collaboration with researchers in the US and Canada found that large-scale language models (LLMs) such as CHATGPT struggle to replicate historical idioms without extensive pre-training.

Researchers have investigated various methods for generating text I heard it Starting with a simple prompt using historically accurate, early 20th century prose, we move on to fine-tuning the commercial model with a small collection of books from that era.

We also compared the results with another model fully trained in a book published between 1880 and 1914.

In the first test, we instruct chatgpt-4o to mimic fin–of–century Language produced results that were completely different to those of smaller GPT2-based models that were fine-tuned in literature of this era.

They were asked to complete the actual historical text (top center). In contrast, the finely tuned GPT2 model (bottom right) captures language styles well, but otherwise it’s not that accurate. Source: https://arxiv.org/pdf/2505.00030

Although fine-tuning brings the output closer to the original style, human readers are able to frequently detect traces of modern languages and ideas, suggesting that even carefully tuned models continue to reflect the influence of modern training data.

Researchers arrive at the frustrating conclusion that there are no economic shortcuts to generate machine-generated, conventionally modified historical texts or dialogues. They also speculate that the challenge itself may be inappropriate:

‘(We) need to consider the possibility that anachronism may in some way be inevitable. Whether or not teaching historical models can express the past and have conversations, teaching modern models can require compromise between the goals of reliability and flow of conversation by ventricularizing the old period.

“After all, there is no “real” example of a conversation between questioners and respondents in the 21st century in 1914. Researchers seeking to create such a conversation should look back at the (presence) in which interpretations always involve negotiations between the present and (the past). ”

New research titles included Can language models represent the past without anachronism?and came from three researchers from the University of Illinois, the University of British Columbia and Cornell University.

A complete disaster

Initially, in a three-part research approach, the authors tested whether modern language models could be fine-tuned to mimic historical languages through simple prompts. Using actual excerpts from books published between 1905 and 1914, they asked ChatGpt -4o to continue these sentences in the same idiom.

The original period text is as follows:

‘In this last case, you’ll save about $5-6 per minute. Films over 20 yards must be placed per minute or raised to project landscape objects. Therefore, you will get a practical combination of fixed and moving photos, producing most artistic effects.

“It can also do two cinematography tasks that alternately project to avoid scintillation or to simultaneously propagate red and green images and propagate natural colors. Now, the words about the application of cold light to instantaneous photographs.

To assess whether the generated text coincides with the intended historical style and whether people are aware that they are not particularly skilled at inferring the date the text was written, researchers fine-tuned the Roberta model to estimate publication dates using a subset of the historical American English corpus from 1810 to 2009.

We then used the Roberta classifier to assess the continuation generated by ChatGPT‑ 4o, prompted by the actual text of a book published between 1905 and 1914.

The system prompt (i.e., the contextual instructions in ChatGpt about how to approach a task) are:

“Your job is to complete a sentence from a book from the early 20th century. You will be given a passage from a book published in 1913. Continue this passage in the same style in at least 200 words. We offer this continuation only. Do not make framing statements like “Continued here:”.

Despite both one-shot and 20-shot prompts, the output of the CHATGPT-4O is consistently distorted into a 21st century stylistic register.

The example given in the paper features one of ChatGPT-4O’s blog-like attempts to continue texting for actual periods of photography.

“The use of cold light in instant photography has revolutionized the approach to capturing images under a variety of light conditions. Traditionally, photographers have suffered from the harshness and variability of natural light, or the intrusive nature of flash photography, which can disrupt the atmosphere of the scene and the natural behavior of subjects, especially in wildlife and outspoken human photography.

“Cold light with softer, more diffused properties allows for a more natural representation of colour and texture, which is often lost under harsh artificial lighting.”

As easily seen, this common Wiki style text does not match the original period text’s Prolix and elaborate style. Moreover, rather than actually continuing from where the original content leaves, it exerts a tangential abstract imagination in one of the subtopics.

The team also tested the GPT-1914, a smaller GPT-2 class model trained from scratch in a 26.5 billion token literature from 1880 to 1914.

Its output was less consistent than the output of CHATGPT ‑ 4o, but was more consistent with the style of the source period. The only example provided in this paper is again as the completion of a real-world period text on photography, read as follows:

“The principle of that behavior has already been explained (p. 4). We only mention that it can be applied only if you want to obtain a very quick photo on paper coated with collodion, and if you want to obtain a photo taken by a gelatin plate.

‘In such cases, the exposure should at least not exceed 1 second. However, for a shorter time (for example, if it is desired to develop a photo in half a second), the temperature will not fall below 20°C. Otherwise, the image will become darker after development. Furthermore, the plate loses its sensitivity under these conditions.

“However, for normal purposes, simply enough is sufficient, without any need other than simply exposing sensitive surfaces to low heat and maintaining special precautions” (sic)

Even the real original materials are inexplicable and very difficult to follow, making it difficult to understand the degree to which the GPT-1914 was taken up accurately from the original. But the output certainly sounds more periods.

However, the authors concluded from this experiment that simple prompts are rarely overcome to overcome the modern bias of large-scale assumption-type models such as ChatGPT-4o.

The plot becomes thicker

To measure how closely the model output resembles authentic historical writing, researchers used statistical classifiers to estimate the publication date for each text sample. We then visualized the results using kernel density plots. This shows where the model thinks each passage falls into a historical timeline.

Estimated publication dates of material and generated text based on a classifier trained to recognize historical styles (GPT‑ 1914 trained in literature only, compared to source texts from 1905 to 1914, compared to continuation using one-shot and 20-shot prompts compared to source texts from 1905 to 1914).

Although the finely tuned Roberta model used for this task is not perfect, it still managed to highlight the general styling trends. The passages written by GPT ‑ 1914 are fully trained in literature from the early 20th century clustered era. This is similar to the original source material.

In contrast, the output of CHATGPT-4O tended to resemble 21st century writing, reflecting the data that was initially trained, even if prompted with multiple historical examples.

The researchers quantified this discrepancy using Jensen Shannon’s divergence. This is a measure of how the two probability distributions differ. GPT ‑ 1914 scored 0.006 compared to actual historical text, while the output of ChatGPT-4o’s one-shot and 20-shot output showed a much wider gap at 0.310 and 0.350, respectively.

The authors argue that these findings indicate that even with multiple instances, they show that they encourage alone.

Complete the passage

This process investigates whether fine-tuning could produce good results as it directly affects the available weights of the model by “continuing” training on user-specified data. This is a process that can affect the original core functionality of the model.

In the first fine-tuning experiment, the team trained GPT ‑ 4o ‑ mini with approximately 2,000 passing computer pairs drawn from books published between 1905 and 1914.

Using the same Roberta-based classifier that serves as a judge in previous tests and estimates the stylistic “date” of each output, the researchers found that in a new experiment, fine-tuned models created texts that were closely aligned with ground truth.

Its stylistic divergence from the original text, measured by Jensen-Shannon Divergence, generally decreased to 0.002 along GPT‑ 1914.

Estimated publication date for actual and generated text. It shows that the finely tuned versions of GPT‑ 1914 and GPT‑ 4o‑ Mini coincides with early 20th century writing styles (based on books published between 1905 and 1914).

However, researchers warn that the indicator only captures the superficial features of the historical style, not deeper conceptual or de facto anachronism.

‘(This) is not a very sensitive test. The Roberta model used here as a judge is trained solely to predict dates, rather than discriminating against anachronistic passages. Perhaps we use rough stylistic evidence to make that prediction. Human readers, or large models, may be able to detect anachronistic content in texts that sound on the surface “period.”

Human touch

Finally, researchers conducted human assessment tests using 250 hand-selected texts from books published between 1905 and 1914. I observe that many of these texts are likely to be interpreted quite differently than at the time of writing.

‘Our list included, for example, encyclopedia entries on Alsace (which was part of Germany at the time) and Belibery (which was then described as a fungal disease rather than a malnutrition deficiency). They are factual differences, but I also chose sentences that demonstrated nuanced differences in attitude, rhetoric, or imagination.

For example, descriptions of non-European places in the early 20th century tend to slip into racial generalizations. The description of moon sunrise written in 1913 imagines a rich hue phenomenon, as I had not yet seen photographs of a world without (atmosphere).

Researchers created short questions that each historical passage could plausibly answer, and then fine-tuned the GPT ‑ 4o ‑ mini for these question-answer pairs. To enhance the assessment, they trained five separate versions of the model, each time extending different parts of the data for testing.

We then created responses using both the default versions of the GPT-4O and GPT-4O Mini, creating fine variants each evaluated in areas that we had never seen during training.

It was lost in time

To assess how convincing the model is to mimic historical language, researchers confirmed three expert annotators that 120 AIs were generated and determined whether each in 1914 would appear plausible to the writer.

This direct assessment approach proved to be more difficult than expected. The annotator agreed to nearly 80% on the rating, but the imbalance of judgment (2x “unplausible” selected for “plausible”) means that the actual level of match is moderate, as measured by Cohen’s 0.554 score.

The evaluator himself explained the task as follows: difficultoften requesting additional research to assess whether the statement was consistent with what was known or believed in 1914.

Some texts raised difficult questions about tone and perspective. For example, whether the response was appropriately limited in that worldview to reflect what was typical of 1914. Ethnocentrism (i.e., they tend to see other cultures through their own assumptions and prejudices).

In this connection, the challenge was to determine whether they expressed cultural biases that seemed historically plausible, without sounding too modern, or too aggressive by today’s standards. The author points out that even for scholars familiar with the era, it was difficult to draw a sharp line between languages that felt historically accurate and languages that reflected current ideas.

Nevertheless, the results showed a clear ranking of the model, and the tweak version of the GPT ‑ 4o ‑ Mini was determined to be the most plausible overall.

Annotator evaluations emerged for how the output of each model is valid

Is this level of performance evaluated? Pragmatic In 80% of cases, it remains unknown that historical research is sufficiently reliable. Particularly, this study does not include baseline measures of the frequency at which authentic period texts are misclassified.

Intruder Alert

Next came the “Intruder Test.” There, four anonymous sentences were presented answering the same historical questions. Three answers came from the linguistic model, while one was a real, authentic excerpt from actual early 20th century sources.

The task was to identify the passages, which were the original passages that were truly written during the period.

This approach did not ask the annotator to directly assess validity, but measured the frequency at which actual passages stood out from AI-generated responses, and tested whether in effect the model was cheating on readers to make them think their output was authentic.

The model rankings matched the results of previous judgment tasks. The tweaked version of the GPT‑ 4o‑ Mini was the most convincing of the model, but still lacking the real thing.

The frequency at which each source was correctly identified as a genuine historical site.

The test also served as a useful benchmark. Because, more than half of the true passages were identified, the gap between the authentic and synthetic prose remained noticeable to human readers.

Statistical analysis known as the McNemar test confirmed that differences between models are meaningful.

The future of the past

The authors found that encouraging modern linguistic models to adopt historical voices certainly did not yield compelling results. Less than two-thirds of the output are judged plausible by human readers, and even this number can exaggerate performance.

In many cases, the responses included an explicit signal that the model was speaking from a current perspective. “In 1914, we still don’t know…” or “As of 1914, I’m not used to it…” It was so common that it appeared on a fifth of completion. This kind of disclaimer revealed that the model simulates history from the outside rather than writing it from within.

The author states:

“The decline in performance of in-context learning is unfortunate because these methods are the easiest and cheapest methods for AI-based historical research, highlighting that we have not thoroughly investigated these approaches.

“In-context learning may be appropriate or may be found in the future in a subset of research areas. However, our initial evidence is unencouraging.”

The authors conclude that fine-tuning the commercial model of historical text can produce stylistically compelling output at minimal cost, but do not completely eliminate traces of modern perspectives. Complete removal of the material for the period avoids anachronism, but requires much more resources and reduces fluent output.

Neither method provides a complete solution, and for now, attempts to simulate historical voices appear to involve a trade-off between reliability and consistency. The authors conclude that further research is needed to clarify the best way to navigate that tension.

Conclusion

Perhaps one of the most interesting questions that arise from a new paper is the authenticity question. Although they are not perfect tools, loss functions such as LPIPS and SSIM, and metrics such as SSIM give computer vision researchers at least the same methodology for assessing against ground truth.

In contrast, there is no fundamental truth when generating new text in the style of a bygone era. It is just an attempt to inhabit an explosive cultural perspective. Such traces are mere evidence, and the cultural consciousness in which they appear is beyond reasoning, perhaps beyond imagination, and therefore an attempt to reconstruct the very idea from the literary trace is an act of quantization.

Even on a practical level, the foundations of modern linguistic models are shaped by current norms and data, and run the risk of reinterpreting or suppressing ideas that appear to be reasonably or inconspicuous to Edward’s readers, but are now registered as (frequently offensive) artifacts of bias, inequality, or fraud.

So, we are surprised that we could create such an argument, whether we may not fight back on us.

First released on Friday, May 2, 2025