Stable Video Diffusion and the Current state of Video diffusion

In April 2023, video diffusion was in its infancy, with early tools offering frame-to-frame coherence but little control. They were seen more as novelties, not yet useful for serious applications. However, this field evolves rapidly, and keeping up with the pace can be challenging, before going into the latest new release, I will give a rundown of how we got here.

Fast forward six months, several independent developers have crafted extensions for stable diffusion. Initially, we had Modelscope, promising but limited by its dataset and heavy watermarks. Zeroscope followed, offering better resolution, RAM efficiency, and a cleaner dataset. Around May 2023, these, along with commercial releases like RunwayML, became available, still with limited control - either a text prompt or an image as a base.

The real game-changer has been the combination of extensions like Animatediff and Controlnet. Animatediff enhances video diffusion with longer clips, adjustable prompts, and Controlnets compatibility. Controlnets, on the other hand, lets you preprocess images for depth, lineart, and pose estimation, directing the output more precisely. This combo allows for more purposeful video generation, edging closer to a usable tool, though cohesion remains a challenge.

Among these, Animatediff stands out for its directability. But in the last week, , Stability AI's own dedicated video diffusion model has been released, Stable Video Diffusion. SDV offers unparalleled quality and improved temporal cohesion, even with its basic prompt or image input. In just one week since its release, it's seen significant advancements, like reduced VRAM requirements from 40GB to 8GB. While it has not had the directability extensions built around it yet, it is amazingly fast and easy to get started with.

Next, I'll outline some simple workflows to get started with Stable Video Diffusion in ComfyUI.

To use Stable Video Diffusion in ComfyUI, we will need to download the models. This assumes you have already installed comfyUI and are updated to the latest version, the models can be located here:

There are currently two models, one is trained to create 2 second videos:

https://huggingface.co/stabilityai/stable-video-diffusion-img2vid

And a second which creates 3 second videos:

https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt/

These need to go into the ComfyUI folder where your checkpoints are stored.

The workflow looks like this, and is relatively simple to get started. Json file will be included.

Images must be fed to it at its training resolution, which is 1024x576. At current, it generates video at 6fps,and interpolation must be done to fill in the rest. This is similar to how current commercial options such as Runway and Pika work, and thankfully, the AI tools for interpolating have come a long way. Here is the raw output i got from this image put into this workflow with no changes:

Since beginning to write this, some further advancements for SDV have come along and some new workflows tested out. One that is likely the most interesting is this Masking workflow set up by Redditor

It is a bit more complicated, but it looks like this:

When you load this workflow, you will not see the black and white mask image. The way this works is you select your input image (resized to 1024x576) and then right click and select “Open in MaskEditor” it will open a window with your image and allow you to paint the area you would like to see motion in, areas not painted will not have motion. My mask looked like this:

After this, I simply ran the queue and got the final output video seen below:

This mimics some of the features seen on RunwayML’s motion brush, which allows similar masking. And while the fine grained control is not here yet, with the pace we have been seeing, I hope my next update will include new workflows showing greater directability.

I hope these allow you to get started generating videos with Stable Video Diffusion, and look forward to seeing what comes down the pipe next week!

Files:

Comfy_SDV_Masked

Comfy_SDV_Simple

Stable Video Diffusion and the Current state of Video diffusion

What is Shaping the Future of Visual Effects?

Why WeWork was Doomed from the First Day?