"The visual microphone": researchers develop algorithm that recovers audio from video of objects

The algorithm can glean speech from the vibrations of a potato-chip bag filmed through soundproof glass.

Well, this is interesting: researchers at MIT, Microsoft, and Adobe have developed an algorithm that can reconstruct audio signals by analyzing the microscopic vibrations of inanimate objects on video. The researchers recovered audio (including speech) from videos of a potato-chip bag, aluminum foil, the surface of a glass of water, and the leaves of a potted plant.

“When sound hits an object, it causes the object to vibrate,” says Abe Davis, the first author on the new paper. “The motion of this vibration creates a very subtle visual signal that’s usually invisible to the naked eye. People didn’t realize that this information was there.”

The researchers ran visual data through a battery of image filters, deciphering the slight distortions of movement to recreate audio signals. While they used high-speed cameras (with frame rates of 2,000 to 6,000 FPS), they were also able to infer audio information of a lower quality from video recorded at a pretty standard 60 FPS, including the gender of the speaker, the number of speakers and possibly more.

“We’re recovering sounds from objects,” says Davis. “That gives us a lot of information about the sound that’s going on around the object, but it also gives us a lot of information about the object itself, because different objects are going to respond to sound in different ways.”

Watch a video about the research below and find more detail about the paper via MIT.