TUM Logo

Transfering Visual Deepfake Detection Methods to Audio Spoof Detection

Transfering Visual Deepfake Detection Methods to Audio Spoof Detection

Supervisor(s): Nicolas Müller
Status: finished
Topic: Others
Author: Roman Umberto Canals
Submission: 2021-10-18
Type of Thesis: Guided Research
Thesis topic in co-operation with the Fraunhofer Institute for Applied and Integrated Security AISEC, Garching


In this guided research it is investigated if visual deepfake detection methodologies can
be transferred and applied to audio deepfake detection. Consecutively, several follow-up
questions are examined.

In recent years, Deepfakes have come to the fore. The term Deepfake refers to seemingly
realistic media content that was generated or modified by the use of artificial intelligence.
On social media, the technology has gone viral in different forms such as face swapping apps,
deep nostalgia, and various fun-websites .

Despite all of its mind blowing possibilities, one has to see the serious destructive potential
of this technology. Our entire society and our legislative system builds on trust in media. We
live in a world where non-repudiation is one of the most highly valued assets. Not only for
public figures, where it is universally explosive, what was said by who to whom at what time,
but also for cookie-cutter citizens. Consider the saying information that appears on the Internet
cannot be deleted. Until now, people were in control of what information about themselves
they present online. With this newly arising technology, this level of control will decrease
As the “quality” of fake media content increases further and further, it is presumable that the day,
when everybody can generate any lifelike content with low effort, which cannot be recognized as fake
by a human, is close. We are facing an unprecedented societal change that cannot be stopped.
Nevertheless, we need to counteract and mitigate the danger of Deepfakes proactively. So,
Deepfake Recognition is a hot topic. In recent years, a lot of research has been done for visual
as well as for audio spoof detection. Mittal et Al. in [1] have used affective cues to strengthen
plain validation of multimedia content. Qi et Al. in [2] developed DeepRhythm, a detection
technique that exploits rhythmical changes of skin colour caused by hearbeats. For a survey,
refer to Tolosona et Al. in [3].
In this research paper, the following key points will be discussed. Is it possible to adapt
methods from the research area of visual deepfake detection to audio spoof detection? Does
every detail of the methodology even matter? And if not, are innovations on data acquisition
and sophisticated preprocessing techniques “more promising”?