A team of AI researchers at Microsoft Research Asia has developed an AI application capable of transforming still images of individuals and audio tracks into animated videos.
The application, named VASA-1, accurately animates the people in the images to appear as if they are speaking or singing along with the audio, complete with appropriate facial expressions.
VASA-1 is designed to generate lifelike talking faces of virtual characters with visual affective skills (VAS) using just a single static image and an audio clip.
In a paper describing the framework, the researchers stated, “Our flagship model, VASA-1, not only produces lip movements that are finely synchronized with the audio but also captures a broad range of facial nuances and natural head motions, enhancing the perception of authenticity and liveliness.”
How Does Microsoft’s VASA-1 Operate? The core architecture of VASA-1 incorporates comprehensive facial dynamics and a head movement generation model that operates within a face latent space. Additionally, it utilizes the development of an expressive disentangled face latent space based on videos, as explained by the team.
“Our approach not only achieves high-quality video output with realistic facial and head movements but also supports the real-time generation of 512×512 videos at up to 40 FPS with minimal latency. This sets the stage for real-time interactions with lifelike avatars that mimic human conversational behaviors,” the researchers elaborated.