In real video, we publish small but important AI edits

Table of Contents

In 2019, U.S. House Speaker Nancy Pelosi was the target of a rather low-tech deep-fark-style attack that targeted her real video to make her look drunk.

This misrepresentation only required some simple audiovisual editing, not AI, but it remains an important example of how subtle changes in actual audio audiovisual output can have devastating effects.

At the time, the deepfake scene was dominated by an automatic encoder-based face placement system that debuted in late 2017, and since then, its quality has not improved significantly. These early systems would have struggled to realistically pursue modern research chains such as creating small but significant changes of this kind, and expressive editing.

The 2022 “Neural Emotion Director” framework changes the atmosphere of famous faces. Source: https://www.youtube.com/watch?v=li6w8prdmjq

Things are completely different now. The film and television industry is seriously interested in changing post-production of real performance using machine learning approaches. A de facto post Perfectionism has even been under recent criticism.

By predicting (or undoubtedly creating) this demand, the image and video integration research scene has advanced a wide range of projects that provide “local editing” of face capture rather than a complete replacement. Stitch it in time. Chat face; Magic face; especially disco.

Expression editing with Project Magic Face in January 2025. Source: https://arxiv.org/pdf/2501.02260

New face, new wrinkles

However, enabled technologies are developing much faster than the way they are detected. Almost all of the depth detection methods that surface in the literature follow yesterday’s deepfake methods in yesterday’s dataset. Until this week, we were not addressing the creeping potential of AI systems to create small, localized changes to the video.

Now a new paper from India has bailed out this. edit Through an AI-based technique (rather than swapping):

Detecting subtle local editing in deep fakes: Actual video changes to create fakes with subtle changes such as frowning, modifying gender characteristics, and shifting expressions towards disgust (explained here in a single frame). Source: https://arxiv.org/pdf/2503.22121

The author’s system aims to identify deepfakes that include subtle, localized facial manipulation. Rather than focusing on global contradictions or identity inconsistencies, this approach targets finer changes such as slight equation shifts and small editing for specific facial features.

This method uses an Action Unit (AUS) delimiter in the Face Action Code System (FACS). This defines 64 possible individual variable regions on the surface.

Part of the expression part of the FACS constitution 64. Source: https://www.cs.cmu.edu/~face/facs.htm

The authors evaluated their approach to various recent editing methods and reported consistent performance improvements in both older datasets and much more recent attack vectors.

‘By using AU-based features, our method effectively captures localized changes that are important for detecting subtle face editing, via masked autoencoder ((MAE)).

“This approach creates a unified latent representation that encodes both local editing and broader changes to face-centered videos, providing a comprehensive and adaptive solution for deepfake detection.”

New paper titled Detect localized deep-fark operations using action unit guided video representationand from three authors from the Indian Institute of Technology in Madras.

method

In line with the approach taken by VideMomae, the new method begins by applying face detection to the video and sampling evenly spaced frames around the detected surface. These frames are split into small 3D partitions (i.e., temporally-aware patches), each capturing local spatial and temporal details.

A schema for a new method. The input video is processed with face detection, extracting evenly spaced face-center frames, split into “tubular” patches, and passing through an encoder that fuses the potential representations from the two prerequisite entry tasks. The resulting vector is used by the classifier to determine whether the video is real or fake.

Each 3D patch contains a fixed size window of pixels (IE, 16×16) from a few consecutive frames (IE, 2). This allows the model to learn short-term movements and expression changes – not only how the face looks, but How does it move?.

The patch is embedded and positionally encoded before passing it to an encoder designed to extract features that can actually be distinguished from fakes.

The authors acknowledge that this is particularly difficult when dealing with subtle operations, and use a dividing mechanism to address this problem and fuse them by constructing an encoder that combines two separate types of learning representations. It aims to create a more sensitive and generalizable functional space for detecting localized edits.

Pre-task

The beginning of these representations is an encoder trained with a masked autoencoding task. When the video is split into 3D patches (most of which are hidden), the encoder learns to reconstruct missing parts and forces them to capture important space-time patterns such as facial movements and consistency over time.

Alteration task training involves masking some of the video input and reconstructing the original frame or frame-by-frame action unit map depending on the task, using the encoder decoder setup.

However, this paper does not provide sufficient sensitivity to detect fine-grained edits, so the author introduces a second encoder trained to detect facial action units (AUS). In this task, the model learns to reconstruct the dense AU map for each frame from partially masked inputs. This encourages focusing on local muscle activity. This is where many subtle deepfark edits occur.

Further examples of facial action units (FAUs, or AUS). Source: https://www.eiagroup.com/the-facial-action-coding-system/

Once both encoders are preprocessed, their output is combined using mutual participation. Instead of simply merging a set of two features, the model uses AU-based features as follows: Query It draws attention to spatial features learned from masked autoencoding. In reality, the action unit encoder tells the model where to look.

The result is a fusion latent representation aimed at capturing both broader movement contexts and localized expression level details. This combined functional space is used for the final classification task. This predicts whether the video is actually manipulated or manipulated.

Data and Testing

implementation

The authors implemented the system by obtaining 16 face-centered frames from each clip by preprocessing the input video with the Facexzooo Pytorch-based face detection framework. The above pretext task was trained on a celebv-hq dataset consisting of 35,000 high quality facial videos.

From the source paper, an example of the celebv-hq dataset used in the new project. Source: https://arxiv.org/pdf/2207.12393

Half of the data examples were masked, and instead of overfitting the system with source data, we learned general principles.

For masked frame reconstruction tasks, the model was trained to predict missing areas of video frames using L1 losses, minimizing the difference between the original and reconstructed content.

In the second task, the model was trained to generate a map of 16 face action units. Each represented subtle muscle movements in areas such as the brows, eyelids, nose, and lips, and was again supervised by L1 loss.

After pretraining, the two encoders were fused and fine-tuned for deepfake detection using a FaceForensics++ dataset containing both real and manipulated videos.

The Faceforensics++ dataset is the basis for Deepfake Detection since 2017, but it is rather outdated when it comes to the latest facial integration techniques. Source: https://www.youtube.com/watch?v=x2g48q2i2zq

To explain class imbalances, the authors used focal loss (variant of cross-entropy loss). This highlights an example that is more challenging during training.

All training was done on a single RTX 4090 GPU with 24GB of VRAM, with a batch size of 600 epochs of 8 (a complete review of data), and the weights of each pre-task were initialized using pre-trained checkpoints in VideOMAME.

test

Quantitative and qualitative assessments were performed for a variety of deepfake detection methods. FTCN. Realforensics; Lip Forensics; EfficientNet+vit; Face X-ray; alt-freezing; cadmm; laanet; and Blendface SBI. In all cases, source code was available in these frameworks.

The tests focus on locally edited deepfakes where only some of the source clips have been modified. The architecture used was a diffusion video autoencoder (DVA). Stitch to time (stit); Disentangled Face Editing (DFE); tokenflow; videoop2p; text2live; and Fate Zero. These methods employ a variety of approaches (e.g., diffusion of DVA and StyleGan2 and diffusion of StyleGan2).

The author states:

“To ensure comprehensive coverage of different facial manipulations, we incorporated editing of different facial features and attributes. In editing facial features, we changed the size of the eye, distance over the eye, nose ratio, nose mouth distance, lip ratio, cheek ratio. We made various expressions of face attribute editing, such as smile, anger, disgust, and sadness.

“This diversity is essential to verify the robustness of the model with a wide range of localized editing. In total, we generated 50 videos for each of the above editing methods, verifying a powerful generalization of the method for deepfake detection.

Older deepfake datasets were also included in the round. In other words, celeb-dfv2 (cdf2). Deepfake Detection (DFD); Deepfake Detection Challenge (DFDC); and WildDeepfake (DFW).

The rating metric was the area under the curve (AUC). Average accuracy; Average F1 score.

From the paper: A recent comparison of localized deepfakes shows that the proposed method outweighs all others, showing a 15-20% increase in both AUC and average accuracy in the following best approach.

The author also provides a visual detection comparison of locally manipulated views (since there is a lack of space, it is reproduced only in the following parts):

The actual video was modified using three different local operations to generate a fake that remains visually similar to the original. Shown here is a representative frame along with the average false detection score for each method. Although existing detectors struggled with these subtle editing, the proposed model consistently assigned high false probability, indicating that it was highly sensitive to localized changes.

The researcher commented:

‘(The) existing SOTA detection methods, (Laanet), (SBI), (AltFreezing), and (CADMM) significantly reduce the performance of the latest DeepFake Generation methods. The current SOTA method shows a low AUCS of 48-71%, indicating a low generalization ability for recent deepfakes.

On the other hand, our method demonstrates robust generalization and achieves AUC in the 87-93% range. A similar trend is also prominent for average accuracy. As shown (below), our method also achieves consistently high performance on standard datasets, exceeding 90% AUC, competing with recent deepfake detection models.

The performance of traditional deepfake datasets shows that the proposed method is competitive with the main approach, indicating a strong generalization across different operation types.

The authors observe that these last tests include models that are considered outdated and introduced before 2020.

A broader visual depiction of the performance of the new model allows the author to provide an extensive table at the end.

In these examples, the actual video was modified using three localized edits to generate a fake that was visually similar to the original. The average confidence scores for the overall operations indicate that the proposed method detected counterfeiting more reliably than the other major approaches. For the complete results, see the final page of the source PDF.

The authors argue that in local edit detection, the method achieves a confidence score of over 90%, and that existing detection methods remain below 50% for the same task. They interpret this gap as evidence of both sensitivity and generalizability of the approach, and as a sign of the challenges faced by current techniques in dealing with these types of subtle facial manipulation.

Assessing the reliability of the model under real-world conditions and following the methods established by CADMM, the authors tested its performance on videos that were corrected with common distortions, including saturation and contrast, Gaussian blur, pixelation, and block-based compression artifacts, and additive noise.

The results showed that detection accuracy remained significantly stable across these perturbations. The only noticeable decline occurred with the addition of Gaussian noise, causing slight degradation in performance. Other changes had minimal effect.

A diagram of how detection accuracy changes under various video distortions. In most cases, the new method was resilient, with only a slight reduction in AUC. The most significant drop occurred when Gaussian noise was introduced.

These findings propose that the ability of methods to detect local operations is not easily destroyed by typical deterioration of video quality, supporting potential robustness in real settings.

Conclusion

The manipulation of AI mainly exists in the traditional concept of deep fakes. There, it is imposed on another person’s body on a person’s identity. This concept is gradually being updated to acknowledge the more insidious capabilities of the generated video system (a new kind of video deep fake) and the capabilities of the generally potential diffusion model (LDMS).

Therefore, it is reasonable to expect that the kind of local editing that new papers are concerned about may not attract the public’s attention until a pivotal Pelosi-style event occurs.

Nevertheless, as actor Nick Cage has expressed consistent concern about the possibility that the post-production process may modify an actor’s performance, we should also encourage greater awareness of this type of “subtle” video adjustment.

First released on Wednesday, April 2, 2025