Correcting a limited understanding of mirrors and reflections in diffusive models

Table of Contents

Since generative AI began to attract public interest, the field of computer vision research has grown more interest in developing AI models that allow us to understand and replicate physical laws. However, the challenge of teaching machine learning systems to simulate phenomena such as gravity and liquid dynamics has been an important focus of research efforts for at least the past five years.

As potential diffusion models (LDMS) dominated the generated AI scene in 2022, researchers have increasingly focused on their limited ability to understand and replicate the physical phenomena of LDM architectures. Now, this issue has become even more pronounced with the groundbreaking development of Openai’s generated video model Sora, and (probably) more important recent open source releases. video Model Hunyuan Video and WAN 2.1.

It reflects badly

Most studies aimed at improving understanding of LDM in physics focus on areas such as walking simulation, particle physics, and other aspects of Newtonian motion. These areas are attracting attention as inaccuracies in basic physical behaviors quickly undermine the reliability of AI-generated videos.

However, the small and growing strand of research focuses on one of the greatest weaknesses of LDM. Reflection.

An example of “reflection disorder” to the researcher’s own approach, “Reflection reality: enabling diffusive models to produce faithful mirror reflections.” Source: https://arxiv.org/pdf/2409.14677

This issue is also a challenge in the CGI era and remains the same in the video game field. The ray tracing algorithm simulates the path of light when interacting with a surface. Raytracing calculates whether a virtual ray bounces off an object or creates realistic reflections, refractions, and shadows.

However, because computational costs increase significantly with each additional bounce, real-time applications must trade off delays for accuracy by limiting the number of allowed ray bounces.

It is a practically calculated representation of light beams using traditional 3D-based (i.e. CGI) scenarios using technologies and principles first developed in the 1960s, fully implemented between 1982 and 93 (1982) and “Jurassic Park” (1993). Source: https://www.unrealengine.com/en-us/explainers/ray-tracing/what-is-real-time-ray-tracing

For example, drawing a chrome teapot in front of a mirror involves a ray tracking process in which the rays bounce repeatedly between reflective surfaces, creating an almost infinite loop with little practical advantage in the final image. In most cases, the depth of reflection for two or three bounces is already exceeded what the viewer can perceive. A single bounce will be a black mirror. Because light must complete at least two journeys to form visible reflections.

Each additional bounce will result in a sudden increase in computational costs, and in many cases it is one of the most important opportunities to double the rendering time, handle faster reflections, and improve the rendering quality of ray traces.

Naturally, reflections occur and are essential for ray phenomena. It’s not a less obvious scenario, such as streets after the rain or reflective surfaces of battlefields. Reflecting the street on the other side at the shop windows and glass doorways. Alternatively, the glasses with drawn text may require that objects and environments be displayed.

Simulated twin reflections achieved through traditional synthesis of the iconic scenes of “The Matrix” (1999).

Image problems

For this reason, frameworks that were popular prior to the advent of diffusion models such as neural radiance fields (NERFs), and recent challengers such as Gaussian splatters maintain their own struggles to enact reflections in natural ways.

ref²-nerf project (photo below) proposed a nerf-based modeling method for scenes that include glass cases. This method used independent elements to model refraction and reflection. This approach allowed researchers to estimate the surfaces where refraction occurred, particularly the glass surface, allowing for the separation and modeling of both direct and reflected light components.

Example of ref2nerf paper. Source: https://arxiv.org/pdf/2311.17116

Other reflective solutions heading towards NERF over the past four to five years include NERFren and Meta 2024, reflecting reality Neural radiance field awakening plane reflection project.

In the case of GSPLAT, papers such as Mirror-3DGS, Reflex Gaussian, and others provided solutions to the reflex problem, while the 2023 Nero project proposed a bespoke method to incorporate reflexivity into neural representation.

Mirrorverse

Obtaining a diffusion model to respect reflection logic is undoubtedly more difficult than explicitly structural non-semantic approaches such as Gaussian splatting and nerf. In a diffusion model, this type of rule can be reliably incorporated only if the training data contains many different examples across a wide range of scenarios and is heavily dependent on the distribution and quality of the original dataset.

Traditionally, adding this type of specific behavior is a fine tuning of the Lora range or base model. But these are not the ideal solutions. Because LORA tends to output to its own training data without the need for prompts, it is not only expensive, but also allows it to fork irreparable major models from the mainstream and produce related and related custom tools. other The strain of the model containing the original.

In general, improving diffusion models requires training data to pay more attention to the physics of reflection. However, many other areas require similar special attention. In the context of hyperscale datasets where custom curation is costly and difficult, it is unrealistic to address all the weaknesses in this way.

Nevertheless, solutions to the LDM reflection problem appear many times. One such recent initiative from India is Mirrorverse The project provides improved datasets and training methods that can improve cutting edge in this particular challenge in diffusion research.

Rightmost, Mirrorverse results were pitted against two previous approaches (center 2 rows). Source: https://arxiv.org/pdf/2504.15397

As you can see from the example above (a feature image of a new study PDF), Mirrorverse is improved with recent products tackling the same problem, but it is far from perfect.

In the image on the top right, you can see that it is on the right where the ceramic bottle should be. In the image below, technically it should be completely characterized by the reflection of the cup.

Therefore, we don’t look much of a new method to explain that, as new methods may represent current state-of-the-art, diffusion-based reflections, examples of data that require reflectance are likely to be intertwined with a particular action and scene, and thus this may prove to be an awkward problem for potential diffusion models of static and video.

Therefore, this particular feature of LDMS may continue to rely on structure-specific approaches such as NERF, GSPLAT, and traditional CGI.

New paper titled Mirrorverse: Push the diffusion model to realistically reflect the worldand comes from three researchers from Vision and AI Lab, IISC Bangalore and Samsung R&D Institute in Bangalore. This paper has related project pages and a dataset for hugging hugging faces, and source code has been released on Github.

method

From the start, researchers point out that models such as stable diffusion and flux respect reflection-based prompts and cleverly show the problem.

From the paper: The current cutting edge text-to-image models, SD3.5 and flux, present important challenges in producing consistent, geometrically accurate reflections when prompted to generate in a scene.

Developed by researchers MirrorFusion 2.0a diffusion-based generation model aimed at improving the optical realism and geometric accuracy of mirror reflection in composite images. The training of the model was based on the researcher’s own newly curated dataset. MirrorGen2is designed to address the generalization weaknesses observed in previous approaches.

MirrorGen2 extends previous methodology with its introduction Positioning Random Objects, Randomized Rotationand Explicit object groundingwith the aim of ensuring that reflections are plausible across a wider range of objects poses and arrangements compared to mirror surfaces.

Schema for Generating Composite Data in Mirrirverse: Dataset Generation Pipeline applied key augmentation by randomly placing, rotating and grounding objects in the scene using a 3D positioner. Additionally, objects pair in semantically consistent combinations to simulate complex spatial relationships and occlusions, allowing the dataset to capture more realistic interactions in a multi-object scene.

The MirrorGen2 pipeline is incorporated to further enhance the model’s ability to handle complex spatial arrangements pair Object scenes allow the system to better represent the occlusion and interaction between multiple elements of the reflection setting.

The paper states:

‘Categories are manually paired to ensure semantic consistency. For example, combine chairs and tables. During rendering, after placing and rotating the primary (objects), the additions (objects) are sampled from the pair’s categories to prevent overlapping and ensure different spatial regions in the scene.

Regarding explicit object grounding, here the authors have ensured that the generated objects are “fixed” to the ground of output composite data rather than “hovering” which can occur when synthetic data is generated at scale at scale or in a highly automated way.

Dataset innovation is central to the novelty of the paper, so we proceed to coverage in this section earlier than usual.

Data and Testing

Synmirrorv2

The researcher’s synmirrorv2 dataset was devised to improve the diversity and realism of Mirral Reflection Training data. It features 3D objects fed from Objaverse and Amazon Berkeley Objects (ABO) datasets, which are then refined via object 3dit and drain the cyclone assets through the filtering process from the V1 MirrorFusion Project. This resulted in a sophisticated pool of 66,062 objects.

An example of an OBJAVERSE dataset used to create a curated dataset for a new system. Source: https://arxiv.org/pdf/2212.08051

The construction of the scene involved placing these objects on the texture floor from the CC texture and HDRI background from the PolyHaven CGI repository, using a full wall or tall rectangular mirror. Lighting was standardized with area light placed above and below the object at a 45-degree angle. The objects were scaled to fit within the unit cube and placed using a pre-calculated intersection of mirror and camera viewing power to ensure visibility.

Randomized rotations were applied around the y-axis, and grounding techniques used to prevent “floating artifacts.”

To simulate more complex scenes, the dataset also incorporates multiple objects arranged according to semantically consistent pairing based on ABO categories. We created 3,140 multi-object scenes designed to place secondary objects to avoid overlap and capture different occlusion and depth relationships.

An example of a rendered view from the author’s dataset includes multiple (two or more) objects, with illustrations of object segmentation and depth map visualization shown below.

Training process

Accepting that synthetic realism alone is not sufficient for a robust generalization to real data, the researchers have developed a three-stage curriculum learning process to train MirrorFusion 2.0.

In stage 1, the author initialized the weights of both conditioning and generating branches with stable diffusion v1.5 checkpoints and fine-tuned the model for single object training splitting of the Synmirrorv2 dataset. It is different from the above It reflects reality Project, researchers did not freeze the generations branch. The model was then trained with 40,000 iterations.

In stage 2, we taught the system to handle occlusion and fine-tuned the model with an additional 10,000 iterations in the multi-object training division of synmirrorv2 to teach the system to handle more complex spatial arrangements seen in realistic scenes.

Finally, in stage 3, 10,000 more fine-tunings were performed using the depth map generated by the MatterPort3D monocular depth estimator using the actual data data from the MSD dataset.

Example MSD dataset. The actual scene was analyzed into depth and segmentation maps. Source: https://arxiv.org/pdf/1908.09101

During training, text prompts were omitted for 20% of training time to encourage the model to optimally use the depth information available (i.e. the “mask” approach).

Training was done on four NVIDIA A100 GPUs at every stage (no VRAM specs offered, but they were 40GB or 80GB per card). 1e learning rate^-5 It was used under Adamw Optimizer with a batch size of 4 per GPU.

This training scheme aims to gradually increase the difficulty of the tasks presented in the model, starting with a simpler composite scene and towards more challenging compositions, and developing robust real-world transferability.

test

The authors evaluated MirrorFusion 2.0 against previous state-of-the-art MirrorFusion, which served as a baseline, and conducted experiments on the MirrorBenchv2 dataset, covering both single and multi-object scenes.

Additional qualitative tests were performed on the sample MSD dataset to perform Google Scanned Object (GSO) datasets.

This assessment used 2,991 single object images from the categories seen and invisible categories, and 300 two object scenes from ABO. Performance was measured using peak signal-to-noise ratio (PSNR). Structural Similarity Index (SSIM); We learned the similarity (LPIPS) scores for perceptual image patches to assess the quality of reflections in masked mirror regions. Clip similarity was used to evaluate text alignment with input prompts.

In quantitative tests, the authors generated images using four seeds at a specific prompt and selected the resulting images with the highest SSIM score. Below are two tables of quantitative test results:

Left quantitative results of single object reflection generation quality for MirriRbenchv2 single object splitting. MirrorFusion 2.0 exceeded the baseline, with the best results shown in bold. Right quantitative results of multiple object reflection generation quality of MirriRbenchv2 multi-object splitting. MirrorFusion 2.0 was trained on multiple objects and outperformed the version trained without them.

Author’s comments:

‘(Results) shows that our method outweighs the baseline method and fine-tuning with multiple objects improves the outcome of complex scenes. ”

Most of the results, and those highlighted by the authors, consider qualitative tests. The dimensions of these illustrations allow for partial reproduction of paper examples.

MirrorBenchv2 comparison: While the baseline fails to maintain accurate reflections and spatial consistency, falsely indicating chair orientation and distorted reflections of multiple objects, (authors) MirrorFusion 2.0 correctly renders chairs and sofas with precise position, orientation and structure.

Of these subjective results, researchers have argued that baseline models cannot accurately render object orientation and spatial relationships with reflections, and often generate artifacts such as incorrect rotations and floating objects. The authors trained in Synmirrorv2, MirrorFusion 2.0 preserves the correct object orientation and position in both single and multi-object scenes, resulting in more realistic and coherent reflections.

Below is the qualitative results for the aforementioned GSO dataset.

Comparison of GSO datasets. While the baseline misrepresented the structure of the object, producing incomplete and distorted reflections, MirrorFusion 2.0 preserves spatial integrity and generates accurate geometry, colours, and details even in distributed emission objects.

Here the author comments:

‘MirrorFusion 2.0 produces more accurate and realistic reflections. For example, in Figure 5 (A – above), MirrorFusion 2.0 correctly reflects the drawer handle (highlighted in green), while the baseline model produces incredible reflections (highlighted in red).

“Similarly, for the “white-yellow mug” in Figure 5(b), MirrorFusion 2.0 provides compelling geometry with minimal artifacts, unlike baselines that cannot accurately capture the geometry and appearance of objects. ”

The final qualitative test was on the real-world MSD dataset mentioned above (partial results shown below):

Results from RealWorld scenes with fine-tuning MIRRIRFUSION, MirrorFusion 2.0, and MirrorFusion 2.0 with MSD datasets. The author claims MirlerFusion 2.0 captures complex scene details more accurately, such as messy objects on tables and the presence of multiple mirrors in a 3D environment. Here, only partial results are presented here due to the dimensions of the results of the original paper.

Here, the author observes that MirrorFusion 2.0 works well with MirrorBenchv2 and GSO data, but initially struggled with complex real-world scenes in MSD datasets. Fine-tuning the model with a subset of MSD improved the cluttered environment and the ability to handle multiple mirrors, resulting in more coherent and detailed reflections in the held test split.

Additionally, a user survey was conducted and reported that 84% of users have a generation that prioritizes MirrorFusion 2.0 over the baseline method.

Results of user surveys.

Details of the user survey are left to the appendix of the paper, so we will introduce it to the readers to more details about the survey.

Conclusion

Some of the results presented in this paper are impressive cutting edge improvements, but the cutting edge of this particular tracking is so awful that even an unconvincing comprehensive solution can win with a little effort. The fundamental architecture of the diffusion model is so informal to the reliable learning and demonstration of consistent physics that the problem itself is truly raised and is clearly not disposed of in elegant solutions.

Furthermore, adding data to existing models is a standard way to correct for lack of LDM performance, with all drawbacks listed previously. If future high-scale datasets pay more attention to the distribution (and annotations) of reflection-related data points, it is reasonable to assume that the resulting model is expected to handle this scenario better.

However, the same can be said for multiple other bugbears in LDM output. Who is the most deserving of the effort and money that comes with the type of solution proposed by the new paper authors here?

First released on Monday, April 28, 2025