With recent breakthroughs in generative models across language, video, and audio, users are no longer satisfied with merely hearing a virtual character's voice. There is a growing demand to interact with a visual "persona" capable of timely feedback, natural expression, and fluid interaction. RealVideo emerges in response to this need a real-time streaming video dialogue system powered by autoregressive video generation technology. Within this framework, dialogue transcends the limitations of text and audio; instead, it is synthesized into continuous video streams on the fly, making interactions significantly more immersive and approachable. This blog elaborates on RealVideo's overall architecture, model design, inference acceleration mechanisms, and system-level engineering implementation.

We gratefully acknowledge the insights and design principles provided by Talking Machine [1], from which many of our module designs are adapted.
The system workflow begins with character initialization: the user provides a reference image and a reference voice file for cloning, which the system uses to instantiate the character. The user can also set the system prompt to specify the role the model should play. Subsequently, RealVideo interacts with the user via text input. The user's messages are processed and recorded by the LLM, which then generates responses conditioned on the global context. The generated text is passed to the TTS module to synthesize speech, which in turn becomes input to the autoregressive diffusion model. The model outputs video in blocks (each ~0.5 seconds). The video latents are then stream-decoded by a VAE, and together with the corresponding audio, streamed to the frontend for real-time interaction.
In the current architecture, text input can be easily replaced with speech input through introducing ASR and VAD. Similarly, replacing the LLM with a streaming VLM would enable both video and speech input. For modularity, RealVideo uses text input by default; however, we welcome the community to extend it with additional input modalities.
Several open-source, non-real-time, audio-driven video generation models are already capable of producing high-quality results. After testing, we selected WanS2V [2] as our base model for further autoregressive training. WanS2V provides two generation modes: 5-second speech video generation and 5-second video continuation. Since autoregressive training allows the model to generate arbitrarily long videos without explicit continuation, our training focuses on the first mode.
The training pipeline of RealVideo builds upon the Casuvid [3] and Self-Forcing [4] frameworks, with several improvements. The training process is divided into two stages:
Real-time generation requires producing each frame within 1/FPS seconds, meaning inference latency per frame must be tightly bounded. Therefore, sparse attention is essential to keep context length manageable. A simple and effective approach is sliding-window attention. When the video length exceeds a set threshold, old KV-cache entries are truncated (while retaining the reference image's KV-cache), thereby keeping the attention context length fixed. However, sliding window attention brings two main issues: (i) Long-term memory loss: the model forgets earlier frames and fail to accomplish long-duration actions; and (ii) Action repetition: repeatedly generating short-term actions (especially in T2V and I2V), such as repeatedly waving a hand. This occurs because the model cannot determine whether an action has been completed within the limited context window.
Fortunately, in audio-driven video generation, this limitation is generally acceptable for two reasons: (i) the audio stream strictly constrains per-frame content; and (ii) human dialogues rarely involve long or complex actions. Thus sliding window attention is well-suited for audio-driven video generation, as well as for tasks featuring explicit control streams (such as skeleton or style). While not ideal, we selected this approach for the current version due to the greater training and deployment complexity of alternative attention mechanisms. Improvements are planned for future work.
The concept of Sink Tokens was originally introduced in language models [5], where retaining critical tokens in the KV cache during sliding window attention significantly enhances performance in long-context text generation. In the context of real-time streaming video generation, tokens from the reference image are the ideal candidates for sink tokens, as they guide the model to maintain consistency with the reference image throughout the generation process. However, during prolonged conversational sessions, the relative distance between these sink tokens and the current generation frame grows continuously. This distance eventually exceeds the positional encoding range covered during training, precipitating a significant train-inference mismatch. This issue manifests directly as "identity drift": as generation time extends, the character in the video gradually deviates from the reference, leading to a progressive decline in visual fidelity.
Fortunately, since RoPE is a relative positional encoding, we can ensure strict alignment between inference and training during sliding window attention by simply adjusting the positional indices of the sink tokens. A similar observation was noted in concurrent work [6]. In the WanS2V architecture, the reference image is assigned a temporal RoPE index of 30, while the 5-second denoising video window occupies the range 0–20. This implies that as long as the relative distance between the reference image (sink tokens) and the current denoising frame is maintained within the [10, 30] interval, the configuration remains consistent with the training distribution. Consequently, when the index of the current generation frame exceeds 20, we dynamically update the RoPE index of the reference image to enforce the following relationship:
This position indices relationship is also shown in the following figure. Experimental results demonstrate that this strategy effectively mitigates character drift over extended durations, significantly boosting the stability and visual consistency of the video generation process.

Inspired by DMD2 [7], we introduce adversarial training based on noisy latents in the Self-Forcing training stage of training to enhance visual quality and character consistency. Specifically, we leverage the fake score model's strong feature extraction capabilities on noisy latents to construct the discriminator, as illustrated in the following figure.

First, we sample a timestep corresponding to a low-noise regime (0–200 in our experiment). Noise is sampled based on this timestep and added to either real or generated video latents. These noisy latents, along with the conditioning inputs (reference image, text prompt, audio, and timestep), are fed into the fake score model. We then extract features from the later stages of the fake score model (e.g., Transformer blocks 14, 22, and 30) and pass them into lightweight classification heads. Each head consists of a Cross Attention layer, where a learnable register token serves as the Query and the noisy latent features serve as Key and Value, followed by an MLP. The features output by the three heads are concatenated and projected via a final MLP to yield the classification logit.
We cut gradient flow from the classifier back into the fake model to avoid interfering with DMD loss. Additionally, because overly frequent discriminator updates can overpower the generator, we: (i) reduce discriminator learning rate, and (ii) stop discriminator updates when loss falls below a set threshold.
Adversarial training noticeably improves perceptual quality and reduces color drift in long-term video generation.

RealVideo's architecture consists of three major components:

Handles user interactions: sending user input to the backend and displaying streamed audio/video frames.
The VAE Service orchestrates the following tasks:
The DiT Service hosts a streaming video generation Diffusion Transformer. It receives audio embeddings from the VAE Service and generates corresponding latent video blocks in an autoregressive, streaming fashion, which are then sent back to the VAE Service.
The two primary factors affecting user experience are:
We implemented several optimizations accordingly.
The primary bottleneck for real-time performance is whether the DiT Service can complete the denoising and transmission of the next block within the playback duration of the current block. We implemented several acceleration strategies:
Another way is the quantized model, which is supported in the current repo. We welcome community efforts to help develop the quantized version.
Response speed is defined as the latency between the user's input completion and the start of the video response. This is determined by the latency of upstream models (LLM, TTS) and the video generation system itself.
During interaction, the avatar is often in a silent state (zero audio input). We observed that pure zero inputs caused the avatar to freeze entirely. To address this, we inject random noise into silent audio frames, with variance matching the training audio data's global background noise. This effectively prevents static artifacts and maintains a lifelike presence.

RealVideo is a real-time streaming conversational video system that transforms text interactions into continuous, high-fidelity video responses using autoregressive diffusion. A dual-service architecture (VAE Service + DiT Service), combined with sequence-parallel inference, KV-cache optimization, and pipeline scheduling, enables smooth 0.5-second video-block streaming with end-to-end latency of about two seconds. Together, these modeling and engineering advances make RealVideo one of the first open, practical systems capable of real-time, lifelike, and continually generated conversational video.
RealVideo Team
Ke Ning, Zhuoyi Yang†, Jiayan Teng† , ShiZhan liu, Cheng Wang, Zhiming Zhang, Zhenxing Zhang, Xiaotao Gu, Jie Tang
† Project Leader
[1] Low, C., & Wang, W. (2025). TalkingMachines: Real-Time Audio-Driven FaceTime-Style Video via Autoregressive Diffusion Models. arXiv preprint arXiv:2506.03099.
[2] Gao, X., Hu, L., Hu, S., Huang, M., Ji, C., Meng, D., ... & Zhuo, L. (2025). Wan-s2v: Audio-driven cinematic video generation. arXiv preprint arXiv:2508.18621.
[3] Yin, T., Zhang, Q., Zhang, R., Freeman, W. T., Durand, F., Shechtman, E., & Huang, X. (2025). From slow bidirectional to fast autoregressive video diffusion models. In Proceedings of the Computer Vision and Pattern Recognition Conference (pp. 22963-22974).
[4] Huang, X., Li, Z., He, G., Zhou, M., & Shechtman, E. (2025). Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion. arXiv preprint arXiv:2506.08009.
[5] Xiao, G., Tian, Y., Chen, B., Han, S., & Lewis, M. Efficient Streaming Language Models with Attention Sinks. In The Twelfth International Conference on Learning Representations.
[6] Huang, Y., Guo, H., Wu, F., Zhang, S., Huang, S., Gan, Q., ... & Hoi, S. (2025). Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length. arXiv preprint arXiv:2512.04677.
[7] Yin, T., Gharbi, M., Park, T., Zhang, R., Shechtman, E., Durand, F., & Freeman, B. (2024). Improved distribution matching distillation for fast image synthesis. Advances in neural information processing systems, 37, 47455-47487.