OpenAI’s Video Generator Sora Is Stunning and Utterly Terrifying

OpenAI teased a text-to-video AI generator that’s capable of creating incredibly detailed and realistic videos based off of text prompts on Thursday. The model, called Sora, can create videos up to 60 seconds long and is currently being tested with OpenAI’s risk assessment team along with “a number of visual artists, designers, and filmmakers” before an eventual launch to the wider public, according to the announcement.

“Sora is able to generate complex scenes with multiple characters, specific types of motion, and accurate details of the subject and background,” OpenAI said. “The model understands not only what the user has asked for in the prompt, but also how those things exist in the physical world.”

The company has remained predictably silent on what their dataset contains and what the model was trained on. However, they did note that it was made using a similar process to DALL-E 3, “which involves generating highly descriptive captions for the visual training data.”

Gif of woman walking down street in Tokyo — Prompt: A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.
OpenAI

Along with being able to generate video via text, it can also create a video from an “existing still image,” and even an existing video in order to “extend it or fill in missing frames.” This creates a lot more potential use cases for the model, which means it can be used for anything from restoring old footage, to creating cheap video content, to ushering in a new era of propaganda and disinformation the likes of which the world has never seen before.

In various demos, OpenAI shows that Sora is capable of creating high-definition videos of wooly mammoths galloping through a snowy landscape, a movie trailer for a film about a 30-year-old astronaut shot on 35mm film, Pixar-like animations of cute monsters, and historical footage of California during the Gold Rush.

It doesn’t just feel more advanced than current text-to-video generators but rather lightyears ahead of everyone else. Meta released their own video generator in 2022 that was impressive at the time, but now just looks downright archaic in comparison. Similarly, Google released a text-to-video model in Jan. 2024 but it’s also not as detailed and realistic as OpenAI’s Sora.

Prompt: Photorealistic closeup video of two pirate ships battling each other as they sail inside a cup of coffee.
OpenAI

The stunning accuracy of Sora only underscores the incredibly rapid advances in the world of generative AI that we’ve seen in the past two years—and also their dangers. The world is already struggling to grapple with the impact that these models have on disinformation and social engineering. For example, studies have shown that AI deepfakes can be incredibly effective in swaying people’s opinions and perceptions and are even capable of creating false memories. Meanwhile, Congress is still slow to adopt regulation in order to rein in the worst of its impact.

As we enter yet another hotly contentious election year amidst the geopolitical turmoils of Russia’s invasion of Ukraine and Israel’s war in Gaza, the dangers and risks posed by this technology are far-reaching. There’s no telling what nation states, terrorist organizations, and political campaigns can weaponize these models for—and the danger isn’t just limited to bad actors either.

Technology like Sora holds the potential to not just disrupt industries like art and cinema, but completely obliterate them. No longer will production companies need to rely on actors, camera operators, gaffers, and the hundreds of other people who create the movies and TV shows we love. Instead, they can just type a few words into a prompt and get a full video.

Animation of a monster like Pixar movies — Prompt: Animated scene features a close-up of a short fluffy monster kneeling beside a melting red candle. The art style is 3D and realistic, with a focus on lighting and texture. The mood of the painting is one of wonder and curiosity, as the monster gazes at the flame with wide eyes and open mouth. Its pose and expression convey a sense of innocence and playfulness, as if it is exploring the world around it for the first time. The use of warm colors and dramatic lighting further enhances the cozy atmosphere of the image.
OpenAI

Of course, there are very strong arguments to be made that there will always be a need for a human when it comes to creating good art like cinema. But those words might fall on deaf ears when it comes to producers and studios looking to make a movie as cheaply as possible.

And on top of all of this are the perennial questions of how exactly the model was trained—and whose data was used to train it? Sora wasn’t created in a vacuum. The model required a massive corpus of image, video, and text data in order to create. All of that data likely came from artists, writers, and creators that did not give informed consent to have their work be a part of a dataset to train an AI that will likely push them out of jobs.

The videos are impressive, but utterly terrifying when you stop to consider the model’s implications. This is just another example of how AI threatens so many people’s lives and livelihoods—and perhaps most terrifying of all, OpenAI isn’t done yet.

Wooly mammoths running through a frozen tundra — Prompt: Several giant wooly mammoths approach treading through a snowy meadow, their long wooly fur lightly blows in the wind as they walk, snow covered trees and dramatic snow capped mountains in the distance, mid afternoon light with wispy clouds and a sun high in the distance creates a warm glow, the low camera view is stunning capturing the large furry mammal with beautiful photography, depth of field.
OpenAI

OpenAI’s Video Generator Sora Is Stunning and Utterly Terrifying

The text-to-video model is able to create videos up to a minute long.

Tony Ho Tran