Picture this: transforming silent images into sound waves! Researchers have cracked the code, extracting audio from soundless visuals. What's even more remarkable? This innovation was inspired by the sci-fi show, Fringe, and brought to life by a brilliant professor. Using cutting-edge algorithms, they're redefining how we perceive audio and visuals, offering a glimpse into a future where sound leaps from static images.
It's Possible To Extract Audio From A Still, Soundless Image
In the realm of television, we've all witnessed those jaw-dropping moments, like when the FBI miraculously extracted sound from a melted glass pane.
Den of Geek didn't hold back, labeling it as "ridiculous pseudo-science," a critique that certainly raised eyebrows.
However, Professor Kevin Fu, a visionary in electrical and computer engineering and computer science at Northeastern University, took it upon himself to turn skepticism into a challenge.
He embarked on a groundbreaking journey to demonstrate that the improbable, extracting audio from seemingly mute images and silent videos, could indeed be made possible.
“Imagine someone is doing a TikTok video and they mute it and dub music,” Fu said in a press release.
“Have you ever been curious about what they’re really saying? Was it ‘Watermelon watermelon’ or ‘Here’s my password’? Was somebody speaking behind them? You can actually pick up what is being spoken off camera.”
More In Science and Technology
Here's how it works: Most phone cameras have a nifty image stabilization feature. They use springs to keep the camera lens steady in a liquid, and an electromagnet helps reduce shaky images.
Now, the interesting part is that this image stabilization system unintentionally picks up audio. When someone or something makes a noise near the camera lens, the springs vibrate a bit and bend the light ever so slightly.
You won't notice it unless you're specifically looking for it, as Professor Fu explains. On its own, this distortion doesn't give you useful audio.
But when combined with another feature in modern phone cameras, it transforms into something worth listening to.
“The way cameras work today to reduce cost basically is they don’t scan all pixels of an image simultaneously – they do it one row at a time,” Fu explained.
“[That happens] hundreds of thousands of times in a single photo. What this basically means is you’re able to amplify by over a thousand times how much frequency information you can get, basically the granularity of the audio.”
With this unintentionally captured data from the way photos are taken, you can actually extract somewhat muted audio from almost any image that has light in it.
The team utilizes a clever machine-learning algorithm called Side Eye, which works its magic to turn this data into useful audio.
“If you want to know if I said yes or no, you can train [Side Eye] on people saying yes and no and then look at the patterns and with high confidence when I get an image later know if someone said yes or no."
In their tests across 10 different smartphones, Professor Fu's team achieved some remarkable results.
They could identify spoken digits with an accuracy of 80.66%, pinpoint the speaker out of 20 choices with an impressive accuracy of 91.28%, and even accurately determine the speaker's gender with a staggering 99.67% accuracy.
While these findings open exciting possibilities, they also raise concerns about potential cybersecurity risks.
If malicious actors can eavesdrop on conversations from still images and videos without intended audio, it could lead to significant privacy and security issues.
To counter this, the team explored potential solutions, such as enhancing spring mechanisms, locking camera lenses, and randomizing how the rolling shutter captures pixels.
However, their primary focus lies in the realm of law and justice. They're keen on exploring how extracted audio could be utilized in legal cases, shedding new light on its potential applications beyond security concerns.
"Maybe there's an alibi and it's being admitted to court and somebody wants to prove somebody was or wasn't there," Fu said. "You might be able to use this technique if you have an authenticated video with a known timestamp to confirm one way or the other. If you hear the person's voice, they're more than likely there."
The study is posted on pre-print server arXiv, and was presented at the 2023 IEEE Symposium on Security and Privacy.