hunyuanvideo Abstract
HunyuanVideo is introduced as a novel open-source video foundation model that demonstrates performance in video generation comparable to, or even surpassing, leading closed-source models. To train the HunyuanVideo model, several key technologies are employed, including data curation, image-video joint model training, and an efficient infrastructure designed for large-scale model training and inference. Furthermore, by implementing an effective strategy for scaling both model architecture and dataset, a video generative model with over 13 billion parameters has been successfully developed, making it the largest among all open-source models.
Extensive experiments and a series of targeted designs are conducted to ensure high visual quality, motion diversity, text-video alignment, and generation stability. Based on professional human evaluation results, HunyuanVideo is shown to outperform previous state-of-the-art models, such as Runway Gen-3, Luma 1.6, and three top-performing Chinese video generative models. By releasing the code and weights of the foundation model and its applications, this initiative seeks to bridge the gap between closed-source and open-source video foundation models, empowering the community to experiment with their ideas and fostering a more dynamic and vibrant video generation ecosystem.
HunyuanVideo is trained on a spatial-temporally compressed latent space, which is compressed through Causal 3D VAE. Text prompts are encoded using a large language model, and used as the condition. Gaussian noise and condition are taken as input, our generate model generates an output latent, which is decoded to images or videos through the 3D VAE decoder.
HunyuanVideo Key Features
hunyuanvideo Unified Image and Video Generative Architecture
HunyuanVideo introduces the Transformer design and employs a Full Attention mechanism for unified image and video generation. Specifically, we use a "Dual-stream to Single-stream" hybrid model design for video generation. In the dual-stream phase, video and text tokens are processed independently through multiple Transformer blocks, enabling each modality to learn its own appropriate modulation mechanisms without interference. In the single-stream phase, we concatenate the video and text tokens and feed them into subsequent Transformer blocks for effective multimodal information fusion. This design captures complex interactions between visual and semantic information, enhancing overall model performance.
hunyuanvideo MLLM Text Encoder
Some previous text-to-video models typically use pretrained CLIP and T5-XXL as text encoders where CLIP uses Transformer Encoder and T5 uses a Encoder-Decoder structure. In constrast, we utilize a pretrained Multimodal Large Language Model (MLLM) with a Decoder-Only structure as our text encoder, which has following advantages: (i) Compared with T5, MLLM after visual instruction finetuning has better image-text alignment in the feature space, which alleviates the difficulty of instruction following in diffusion models; (ii) Compared with CLIP, MLLM has been demonstrated superior ability in image detail description and complex reasoning; (iii) MLLM can play as a zero-shot learner by following system instructions prepended to user prompts, helping text features pay more attention to key information. In addition, MLLM is based on causal attention while T5-XXL utilizes bidirectional attention that produces better text guidance for diffusion models. Therefore, we introduce an extra bidirectional token refiner for enhancing text features.
hunyuanvideo 3D VAE
HunyuanVideo trains a 3D VAE with CausalConv3D to compress pixel-space videos and images into a compact latent space. We set the compression ratios of video length, space and channel to 4, 8 and 16 respectively. This can significantly reduce the number of tokens for the subsequent diffusion transformer model, allowing us to train videos at the original resolution and frame rate.
HunyuanVideo in ComfyUI
It is currently supported by nodes developed by kijai Many thanks to kijai for his contribution. https://github.com/kijai/ComfyUI-HunyuanVideoWrapper
how to use hunyuanvideo text to video workflow
-
Preparing Input Input a text. the completion cue description in the hunyuanvideo textencode node.
-
Configuring Parameters Adjust the sampling step according to the quality of the video you want, and change the denoise parameter in the hunyuanvideo sampler node according to the size of the video changes.
-
Generate content Configure the frame rate of the generated video, click Generate, and generate the video through hunyuanvideo.