Last Updated on May 27, 2026 by Staff
Artificial intelligence is getting better at making images, videos and sound. Most AI systems still struggle to make audio that matches what we see in videos. A new AI system from researchers at KAIST, POSTECH and Sony AI wants to change that.
The research team has created a technology called PAVAS or Physics-Aware Video-to-Audio Synthesis. PAVAS does not just recognize objects or scenes like video-to-audio AI systems. Instead it looks at information like an object’s weight, speed and movement to make more realistic sound effects.
The study is on the arXiv preprint server. It is a big step toward AI systems that understand physics.
Beyond Visual Recognition
Humans naturally connect movement with sound. For example when we watch a dinosaur walk in a movie like Jurassic Park, we expect loud footsteps and deep rumbling noises.
This is because we understand things like size, mass and speed. Existing AI systems mainly make sound by identifying objects or scenes. They often do not capture how sounds change with motion.
The researchers wanted to solve this problem by building an AI system that understands why sounds happen, not just copies patterns from data.
How PAVAS Works
PAVAS estimates information from video footage. Even though videos do not give measurements for weight or speed the AI looks at motion patterns, environment and object interactions to figure these things out.
The system then uses this information to make matching effects.
For example if two objects collide fast PAVAS makes loud and sharp impact sounds. If they move slowly or appear light the audio is softer and less intense.
Researchers say the AI learns not what is happening but also why a certain sound should happen.
Realistic Audio Results
In tests the AI did a job with scenes involving collisions, impacts and object movement.
The sounds matched real-world physics especially when object mass and velocity mattered. Loudness, vibration and tone changed naturally with the force and speed of interactions.
Compared to systems PAVAS made audio that felt more immersive and believable.
The researchers think this technology is especially useful for filmmaking, gaming and advertising where realistic sound design is key.
Different From AI Models
Companies like Google and ByteDance have developed AI systems that generate video and audio together. Examples include Google’s Veo 3 and ByteDances Seedance 2.0.
However these systems focus on creating multimedia content from prompts.
PAVAS is different. It specializes in improving existing video scenes with accurate sound effects.
This makes it valuable for -production work, where creators often need to add or adjust sounds after filming.
Future of Physical AI
The researchers think PAVAS is a step toward “Physical AI”.This means AI systems that understand physics and real-world cause-and-effect, not just visually convincing results.
In the future technologies like PAVAS could improve augmented reality, robotics simulations, gaming and metaverse experiences.
The team also thinks the system could become part of next-generation multimodal AI that understands text, speech, images, videos and physical interactions.
By teaching AI to understand why sounds happen researchers may be bringing machines closer to understanding the world like humans do.
