Sora: Open AI’s New Video Generation Model

On February 16, 2024, OpenAI once again shocked the world with the next iteration of generative AI: video generation. Called Sora, most likely based on the Japanese word for “sky,” this text-to-video model is capable of generating high-quality videos of up to 60 seconds from simple text prompts.


Sora’s capabilities are not limited to simply text-to-video as users can animate photos into videos from pre-existing images. The jobs of animators and motion graphic artists could be done in an instant with impressive quality. Sora is also capable of video-to-video—able to add extra footage at the end of a video clip, as well as in front. This would be useful in creating infinite loops, as Sora could generate the transition needed to perfectly loop any video. Furthermore, Sora can generate any transition between two videos and make it completely seamless while staying creative. 

Sora’s ability to simulate and interact with a simulated world is remarkable. When trained at scale, Sora displays object permanence, environmental interactions and permanent changes, digital world simulation, and more. Sora can achieve what would have once required 3D models and programmed camerawork by a professional 3D animator with one sentence.

The mechanisms behind Sora are similar to large language models (LLM) like ChatGPT in that they use “visual patches.” They can be compared to tokens used in LLMs, where text, whether that be code, math, or natural language, is converted and unified into tokens for LLMs to understand. Similarly, videos are compressed into patches—small snapshots of both visual frame and time in a video—which allow models to analyze and generate videos effectively. 

It’s worthwhile to note that other AI models such as Runway and Pika have already achieved text-to-video, image-to-video, and video-to-video, but Sora’s level of detail and video length are much more advanced. 

Of course, Sora isn’t perfect yet, and many limitations exist. Simulating the physics of simple interactions such as glass shattering is not realistic, and objects sometimes spontaneously disappear or reappear. These limitations, however, will gradually be overcome with time, and it is likely that AI generated videos will soon be virtually indistinguishable from real videos. 

My first reactions to Sora’s videos went something like: what the… and already?? and how..?? For a couple minutes, I was just in awe of the realism of the videos and tried my best to catch the imperfections in the generations. After the initial awe subsided, a new feeling of dread slowly creeped upon me. The possibilities were truly endless—something like this in the wrong hands could cause a lot of harm. Impersonation of influential figures, fake news, violence, self-harm, pornography, and suicide were some of the worries I had with the potential of this software. I don’t believe this technology should be kept from the public, and I think it will inevitably be released at some point in the future, but I don’t exactly feel comfortable knowing that anyone could create anything with a few words. Of course, OpenAI will take precautions, such as text classifiers that would check for “text input prompts that are in violation of [their] usage policies, like those that request extreme violence, sexual content, hateful imagery, celebrity likeness, or the IP of others,” image classifiers that review every frame before being shown to the user, and also include C2PA metadata in the video to mark whether a video was generated by Sora. 


And still, jailbreaks will happen, and people will abuse them in some way or another. That we cannot control, but we can control how we prepare for the future of media with AI. Ever since ChatGPT was opened to the public in November 2022, I have learned how to distinguish AI text from human writing. Same for AI-generated images—although it has become considerably harder since generative art was first introduced, I’ve gained a sense of intuition of how to tell if an image was generated by AI. At first, it was really hard, but as I was exposed to more AI content, it became easier to make subtle distinctions on tiny details or just have this weird feeling that what I was looking at wasn’t real. I believe that this skill should be taught in schools. Distinguishing media from real and AI-generated will only get harder from now on, so providing children with the necessary tools and providing them guided experience is essential, especially because they are more vulnerable to be negatively influenced by AI content (misinformation, fraud, etc). Teaching students the C2PA standard and how they can recognize it is important as well. With technological advancements exponentially advancing today, it is crucial that we prepare ourselves and our children now for tomorrow.

Phillip Han

ISK TIMES - Journalist

Previous
Previous

Suicidal AGI

Next
Next

Digital vs. Paper Reading: Which One is Best?