4 Best Open Source Text To Video Model (2023)

Discover the Best Open Source Text-to-Video Models! Join us on a creative odyssey as we explore cutting-edge technologies that transform text into captivating videos. From anime-inspired enchantment to professional-quality narratives, these models unlock limitless storytelling possibilities. Embark on this transformative journey and breathe life into your words on the screen. Let your creativity soar with the best open source text-to-video models today!

Text To Video Model

zeroscope_v2 XL

Unlocking the world of high-quality, watermark-free videos, zeroscope_v2 XL is an open source text-to-video model that takes creativity to new heights. With its remarkable ability to generate captivating videos at a resolution of 1024 x 576, this model is a game-changer.

Powered by an extensive training dataset of 9,923 clips and 29,769 tagged frames, zeroscope_v2 XL guarantees exceptional video quality. Its unique feature lies in upscaling content created with zeroscope_v2_576w using the vid2vid technique in the renowned 1111 text2video extension by kabachuha.

Experience faster exploration in 576×320 (or 448×256) before transitioning to a high-resolution masterpiece. Simply replace the necessary files and let the magic unfold. With zeroscope_v2 XL, you’ll have the power to create visually stunning compositions that surpass expectations.

Let your creativity soar with this innovative model. Elevate your videos, embrace higher resolutions, and witness the magic of zeroscope_v2 XL. Unleash the full potential of your text-to-video creations today!


potat1 is a groundbreaking open-source text-to-video model that brings your textual content to life in stunning 1024×576 resolution. Developed as a prototype model, potat1 holds the distinction of being the pioneer in the realm of open-source text-to-video generation.

Powered by the impressive computational capabilities of Lambda Labs’ 1xA100 (40GB) infrastructure, potat1 has been meticulously trained using a dataset comprising 2197 clips and a staggering 68388 tagged frames sourced from salesforce/blip2-opt-6.7b-coco. This extensive training has equipped potat1 with an exceptional understanding of visual context and textual cues.

One of the notable features of potat1 is its integration with modelscope, a cutting-edge framework that enhances model performance and reliability. Additionally, potat1 stands out from its counterparts as it is entirely watermark-free, ensuring a seamless and unobtrusive video generation experience.


The modelscope text-to-video generation diffusion model is a powerful open source tool consisting of three sub-networks: text feature extraction, text feature-to-video latent space diffusion model, and video latent space to video visual space. With approximately 1.7 billion parameters, this model supports English input and serves as the foundation for numerous other open source text-to-video models.

Unleashing the creative potential of deep learning, modelscope employs the Unet3D structure to facilitate video generation. Through an iterative denoising process starting from a pure Gaussian noise video, it brings your textual descriptions to life in captivating visual form. However, it’s important to note that the modelscope model includes a watermark as a distinguishing feature.

With modelscope, you can delve into the world of text-to-video generation, unlock your imagination, and create stunning visual content that truly stands out.


Animov-512x is an impressive open source text-to-video model that brings a touch of anime magic to your creations. This model, fine-tuned using ModelScope, has been specially designed for diffusers, ensuring an anime-style appearance in the resulting videos. With a resolution of 512×512, Animov-512x unleashes its artistic prowess to deliver captivating visuals that truly capture the essence of anime. Prepare to be amazed as your text comes to life with an enchanting and distinctively anime-inspired twist.