Use AI to predict blockbuster movies

Table of Contents

Film and television are often seen as creative and open-ended industries, but have long been risk-averse. High production costs (at least in US projects, they could quickly lose the advantages of cheaper foreign location offsets) and fragmented production environments make it difficult for independent companies to absorb major losses.

Therefore, over the past decade, the industry has been increasing interest in whether machine learning can detect trends and patterns as to how viewers respond to proposed films and television projects.

The main data sources remain the sample-based methods, such as the Nielsen system (which provides scale but its roots are in television and advertising) and focus groups that trade scales for curated demographics. This latter category also includes scorecard feedback from free movie previews, but by that point most of the production budget has already been spent.

“Big hit” theory/theory

Initially, ML systems utilized traditional analytical methods such as linear regression, k-nearest neighbors, stochastic gradient descent, decision trees, forests, and neural networks.

A 2018 survey evaluated episode performances based on character and writer combinations (most episodes were written by multiple people). Source: https://arxiv.org/pdf/1910.12589

At least the most relevant related work (although often criticised) unfolding in the wild lies in the area of recommendation systems.

Typical video recommended pipeline. Videos in the catalog are indexed using the ability to manually annotate and automatically extract. Recommendations are generated in two stages by first selecting the candidate videos and ranking them according to the user profiles inferred from the viewing of the viewing. Source: https://www.frontiersin.org/journals/big-data/articles/10.3389/fdata.2023.1281614/full

However, these types of approaches analyse already successful projects. It is not clear what kind of ground truth will be most applicable for new shows and films of the future. In particular, when combined with general taste changes and improved and enhanced data sources, this means that consistent data is usually not available.

This is an instance of Cold start Problem,recommendations should evaluate candidates without previous interaction data. In such cases, traditional co-filtering will collapse. This is because it relies on patterns of user behavior (such as display, rating, sharing) to generate predictions. The problem is that for most new movies and shows, there is still not enough audience feedback to support these methods.

Comcast predicts

A new paper from Comcast Technology AI, in collaboration with George Washington University, proposes a solution to this problem by encouraging language models. Structured Metadata About the unreleased film.

Included in the input cast, Genre, overview, Content evaluation, feelingand awardreturns a ranking list where the model can potentially be a future hit.

The authors hope to use the model output as a substitute for viewer interests if engagement data is not available, avoid early biases for already well-known titles.

Very short (3 pages) paper, title Predict movie hits before they occur on LLMScomes from six researchers from Comcast Technology AI and one researcher from GWU.

‘Our results show that LLM can significantly exceed the baseline when using film metadata. This approach serves as a support system for multiple use cases, allowing for automatic scoring of a large amount of new content released daily and weekly.

“By providing early insights before the editorial team or algorithm accumulates sufficient interaction data, LLM can streamline the content review process.

“With continuous improvements in LLM efficiency and rising recommendation agents, insights from this work are valuable and can be adapted to a wide range of domains.”

If this approach turns out to be robust, it can reduce industry retrospective metrics and reliance on significantly promoted titles by introducing scalable methods to flag promising content prior to release. Therefore, editorial teams can receive early, metadata-driven forecasts of audience interest, rather than waiting for user behavior to signal demand, and could redistribute exposure across a wider new release.

Methods and Data

The author provides a summary of the four-stage workflow: Building a dedicated dataset Unreleased Film Metadata; Establishment of a baseline model for comparison. Evaluation of appeals LLM using both natural language inference and embedded-based prediction. Optimizing output with prompt engineering in generation mode using Meta’s Llama 3.1 and 3.3 language models.

The authors have stated that publicly available datasets do not provide a direct way to test hypotheses (as most existing collections are pre-LLM and do not have detailed metadata), so they have built benchmark datasets from the Comcast Entertainment platform.

The dataset tracks newly released movies. It also tracks whether they became popular later and defines popularity through user interaction.

The collection focuses on films rather than series, and the author states:

“We focus on films and are more influenced by external knowledge than television series, improving the reliability of our experiments.”

Labels were assigned by analyzing the time until the title became popular with windows and list sizes at different times. LLM was prompted with a metadata field similar to: Genre, overview, evaluation, The era, cast, crew, feeling, awardand Character type.

For comparison, the authors used two baselines. Popular Embedded (PE) Model (to be reached soon).

This project used large-scale language models as the primary ranking method to generate an ordered list of films with predicted popularity scores and accompanying justifications. These outputs were shaped by rapid engineering strategies designed to use structured metadata to guide model predictions.

The prompt strategy framed the model as an “editor assistant” by identifying that future films are most likely to become popular based solely on structured metadata and then being tasked with sorting the fixed list of titles. None Introducing a new item and returns the output in JSON format.

Each response consisted of a ranked list, assigned popularity scores, justification of rankings, and references to previous examples that influenced the outcome. These multiple levels of metadata were intended to improve model contextual understanding and to improve our ability to predict future audience trends.

test

The experiment followed two major stages. First, the authors tested several model variants to establish a baseline. This involves better versions of identification than the random ordering approach.

Secondly, they tested a large language model in Generation modeincreasing the task difficulty by comparing its output with a stronger baseline rather than a random ranking.

This meant that the model had to do better than a system that already showed its ability to predict which films would become popular. As a result, the authors argue that the evaluations are better reflected between competing systems with varying levels of predictive capabilities, where editorial teams and recommendation systems rarely select models and opportunities.

The Benefits of Ignorance

An important constraint to this setup was the gap in time between the model’s knowledge cutoff and the actual release date of the film. The language model was trained with data that ended 6-12 months before the film was available, thus preventing access to post-release information, and predictions were entirely based on metadata and not based on learner responses.

Baseline evaluation

To construct the baseline, the authors used three embedded models to generate a semantic representation of the film metadata. linq-embed mistral 7b; Llama 3.3 70b is quantized to 8-bit precision to meet the constraints of the experimental environment.

Linq-membed-Mistral was selected for inclusion for the top level of the MTEB (Giant Text Embedded Benchmark) leaderboard.

Each model created a vector embedding of candidate films. This was compared to the average embedding of the most popular top titles in the weeks before each film’s release.

The popularity is inferred using cosine similarities between these embeddings, indicating that higher similarity scores result in higher predicted attractiveness. Ranking accuracy for each model was assessed by measuring performance against a random ordering baseline.

Performance improvements in general embedded models compared to random baselines. Each model was tested using four metadata configurations. V1 only includes genres. V2 only includes an overview. V3 combines genre, overview, content ratings, character types, moods, and release eras. V4 adds cast, crew and awards to your V3 configuration. The results show how metadata inputs with richer metadata inputs affect ranking accuracy. Source: https://arxiv.org/pdf/2505.02693

The results (above) show that the Bert V4 and the Linq-embed Mistral 7B have produced the most powerful improvements in identifying the top three most popular titles, but both are slightly short of predicting a single most popular item.

Bert was ultimately selected as a baseline model for comparison with LLMS. Because its efficiency and overall profit exceeded its limits.

LLM rating

The researchers used two ranking approaches to assess performance. pairwise and ListWine. Pairwise rankings evaluate whether a model correctly orders one item for another item. Listwise rankings take into account the accuracy of the entire ordered list of candidates.

metric

It is especially note that both ranking-based and classification-based metrics are used to assess how effectively the language model predicted the popularity of a film, and identify the three most popular titles.

Four metrics have been applied. Accuracy @1 measured the frequency at which the most popular items were displayed in the first position. Mutual ranks captured how high the top actual items ranked on the prediction list were by taking the opposite position. Normalized Discount Cumulative Gain (NDCG@k) rated how well the overall ranking matched actual popularity, and higher scores resulted in higher adjustments. Recall@3 measured the percentage of truly popular titles that appeared in the top three predictions of the model.

As most user engagement occurs near the top of the ranked menu, the rating focused on low values kreflects actual use cases.

Performance improvements in large-scale language models for Bert V4 measured as percentage profit across ranking metrics. Results were averaged over 10 runs per model prompt combination, highlighting the top two values. The reported numbers reflect an improvement in the average percentage across all metrics.

Performances for Llama Model 3.1 (8b), 3.1 (405b), and 3.3 (70b) were assessed by measuring metric improvements compared to previously established Bert V4 baselines. Each model was tested using a series of prompts ranging from minimal to informative to investigate the effect of input details on predicted quality.

The author states:

“When using Llama 3.1 (405b) with the most useful prompt, the best performance is achieved using Llama 3.3 (70b). Based on the trends observed, when using complex, long prompts (MD V4), more complex language models generally lead to improved performance for various metrics, but are sensitive to the type of information added.”

Included cast awards as part of the prompt improved performance. In this case, the number of major awards received by the top five billed actors for each film. This richer metadata is part of the most detailed prompt configuration, outweighing the simpler version that excludes cast recognition. This advantage was most evident in the larger models, the Llama 3.1 (405b) and 3.3 (70b). Both showed stronger prediction accuracy given this additional signal of prestige and audience familiarity.

In contrast, the smallest model, Llama 3.1 (8b), showed slightly more detailed prompts and improved performance as they progressed from genre to outline, but with more fields added it decreases, suggesting that the model effectively integrates complex prompts, leading to weaker generalizations.

When prompts are limited to genres, all A model with low performance relative to baseline indicates that limited metadata is insufficient to support meaningful predictions.

Conclusion

LLM became a child of the AI poster. Still, there are still many things they don’t know yet what they can do in different industries, so giving them a shot makes sense.

In this particular case, as with stock markets and weather forecasts, historical data serves as the basis for future forecasts is limited. For movies and TV shows, Delivery method It is currently a mobile target, in contrast to the period between 1978 and 2011, when cables, satellites, and portable media (VHS, DVDs, etc.) represent a series of temporary or evolving historical confusions.

Neither prediction method can explain the degree of success or failure of other Productions can affect the viability of proposed properties, but this is often the case in the film and television industry and they love to ride the trends.

Nevertheless, with thoughtful use, LLM helps to strengthen the recommended system during the cold start phase and provides useful support across a variety of prediction methods.

First released on Tuesday, May 6, 2025