Google's Lumiere Advances AI Video to Near-Reality

Google’s new video generation AI model, Lumiere, employs a novel diffusion model called Space-Time-U-Net (STUNet), which can determine the location of objects in a video (space) and how they move and change simultaneously (time). According to Ars Technica, this approach allows Lumiere to create videos in one go, rather than stitching together multiple static frames. Lumiere starts by creating a base frame from a prompt, and then, using the STUNet framework, it begins estimating how the objects in that frame will move to create additional frames with smoother transitions, creating the appearance of seamless motion. Lumiere can also generate up to 80 frames, compared to Stable Video Diffusion’s 25 frames.

Although I am more of a text journalist than a video person, the promotional videos released by Google and their preprint scientific paper show that AI video generation and editing tools have transformed from an unreal feeling to near-realism in just a few years. This also establishes Google’s technical position in the field, comparable to competitors like Runway, Stable Video Diffusion, or Meta’s Emu. Runway was one of the first mainstream text-to-video platforms and launched Runway Gen-2 last March, offering more realistic videos. Runway’s videos also had difficulties in depicting motion. Google posted clips and prompts on the Lumiere website, and I conducted a comparison on Runway using these prompts. The result was: Yes, some of the showcased clips look a bit artificial, especially when closely observing skin textures or more atmospheric scenes. But look at that turtle! Its movement in the water is just like a real turtle! It looks like a real turtle! I sent Lumiere’s introduction video to a professional video editor friend. Although she noted, “You can clearly tell it’s not completely real,” she thought if I hadn’t told her it was AI, she would have thought it was CGI. (She also said, “This is going to take my job, isn’t it?")

Other models stitch videos together by generating keyframes where motion has already occurred (imagine the drawings in a flipbook animation), whereas STUNet allows Lumiere to focus on the motion itself based on the specific timing of generated content in the video. Google is not a major player in the text-to-video category, but it has slowly released more advanced AI models and gradually shifted to more multimodal attention. Its Gemini large language model will eventually bring image generation capabilities to Bard. Lumiere is not yet available for testing, but it demonstrates Google’s capability to develop an AI video platform that is comparable or even superior to generally available AI video generators like Runway and Pika. Additionally, this is a reminder of Google’s achievement in AI video from two years ago.

Besides text-to-video generation, Lumiere will also allow image-to-video generation, stylized generation (allowing users to make videos in specific styles), partial animated movie maps (animating only a part of the video), and doodles (used to mask video areas to change colors or patterns). However, Google’s Lumiere paper points out, “There is a risk of misuse in creating fake or harmful content using our technology, and we believe that developing and applying tools to detect bias and malicious use cases to ensure safe and fair use is crucial.” The paper’s authors do not elaborate on how this will be achieved.