1Brown University, 2Google DeepMind
1. Train a Force Conditioned Video Model 
with Limited Synthetic Data
            Local Force Model (Poke)
Global Force Model (Wind)
2. Video Model Generalizes Force Conditioning
Generalizes to Different Affordances
Overview
We investigate using physical forces as a control signal for video generation and propose force prompts which enable users to interact with images through both localized point forces, such as poking a plant, and global wind force fields, such as wind blowing on fabric.
The main challenge of force prompting is the difficulty in obtaining high quality paired force-video training data. Our key finding is that video generation models can generalize remarkably well when adapted to follow physical force conditioning from videos synthesized by Blender, even with limited demonstrations of few objects (e.g., flying flags, rolling balls, etc.). Our method can generate videos which simulate forces across diverse geometries, settings, and materials. We also try to understand the source of this generalization and perform ablations on the training data that reveal two key elements: visual diversity and the use of specific text keywords during training.
In addition, our approach is trained on only around 15k training examples for a single day on four A100 GPUs, making these techniques broadly accessible for future research.
Interacting with Images Using Force Prompts
A user can interact with an image by specifying a force vector (location, angle, magnitude) on the image. With this force prompt, the video generator then generates the resultant scene. No physics simulator used at inference time!
While currently the results are not real-time or per-frame causal (though it is causal with respect to the conditioning signal), we believe that they show the potential of future video generation models as they get faster, more efficient, and more powerful.
Local Force Prompts
Interactive Force Prompting Demos: Try It Yourself! Click on a thumbnail below to select a demo. Then, click on the white bead in the image and drag along the indicated line. Release the mouse to see the generated video!
Global Force Prompts
Interactive Force Prompting Demos: Try It Yourself! Click on a thumbnail below to select a demo. Then, click on the wind icon to select a wind direction and release the mouse to see the generated video!
Training dataset diversity
The global wind force model is trained on 15k synthetic videos of flags in the wind. The model learns how wind is supposed to affect the flags and generalizes the wind control signal to diverse types of motions, including tethered and aerodynamic motion, as well as fluid dynamics. Pictured here are three different scenes of flags being blown to the right with varying force magnitudes.
The local point force model is trained on 11k videos of plants being poked, and 12k videos of balls being poked. This unified dataset allows for modeling of linear motions, as well as oscillatory and complex motions. Pictured here are three different scenes of plants being being poked to the left with varying force magnitudes, as well as three scenes of soccer balls being poked upwards with varying force magnitudes.
Force Prompting Can Recreate Some Demos for
Prior Works that Use a Physics Simulator at Inference
      
      To demonstrate the point force model's versatility, we curate a benchmark using first-frame images from some prominent physics-in-the-loop papers. We are not claiming that the Force Prompting method outperforms those methods on visual fidelity or physical realism. Rather, we wish to illustrate that our purely neural method can handle some of the same visual scenarios almost as effectively as approaches which require some combination of 3D assets and explicit physics simulation at inference time.
Recreating a PhysGen (ECCV 2024) demo
Hints at Mass Understanding
The same force results in different motion depending on the object's inferred mass
Single book vs. stack of books
Empty laundry basket vs. full laundry basket
Single cube vs. stack of cubes
Wooden ornament vs. metal ornament
Analysis of Effect of Text Keywords on Generalization
We find that the usage of standard keywords (e.g. wind/blow/breeze) at train time are crucial for generalization of the wind model. Interestingly whether they are present or not at inference time does not seem to matter significantly. We hypothesize that using these keywords at train time allows the model to connect the conditioning signal with these keywords and the video distributions they represent.
"Wind" keyword is important at train time but not at inference time
Analysis of Effect of Visual Diversity on Generalization
Our main finding is the surprising generalization given limited paired data; however this generalization still requires strategically selecting certain types of visual diversity. Here we ablate several of these types of diversity and their effects when they are removed. While we find the generalization ability promising, we also believe that more diverse training data will improve the robustness of the model.
Limitations
Failure Case #1: The Physics is Out-of-Domain for the Base Video Model
The dust is blown in the prompted direction, but the base video model has difficulty generating a physically plausible person-plow-ground interaction
The kite is blown in the prompted direction, but the base video model has difficulty generating a physically plausible video of a kite dragging a person
The egg rolls in the prompted direction, but the base video model has difficulty rolling non-spherical objects, so the egg appears to float
Failure Case #2: The Base Video Model's Prior Competes with the Force Prompt
The rocking chair moves in the prompted direction, but the base video model has trouble distinguishing between foreground and background objects
The rubber duck moves in the prompted direction but bobs up and down due to the base model's prior. Also, all objects move because the base model struggles with object atomicity for complex scenes
The confetti moves in the prompted direction, but the base video model forces the scene to conjure extra confetti
.png)
 4 months ago
                                20
                        4 months ago
                                20
                     
  


