Transfering Visual Deepfake Detection Methods to Audio Spoof Detection

Supervisor(s):	Nicolas Müller
Status:	finished
Topic:	Others
Author:	Roman Umberto Canals
Submission:	2021-10-18
Type of Thesis:	Guided Research
Thesis topic in co-operation with the Fraunhofer Institute for Applied and Integrated Security AISEC, Garching
Description In this guided research it is investigated if visual deepfake detection methodologies can be transferred and applied to audio deepfake detection. Consecutively, several follow-up questions are examined. In recent years, Deepfakes have come to the fore. The term Deepfake refers to seemingly realistic media content that was generated or modified by the use of artificial intelligence. On social media, the technology has gone viral in different forms such as face swapping apps, deep nostalgia, and various fun-websites . Despite all of its mind blowing possibilities, one has to see the serious destructive potential of this technology. Our entire society and our legislative system builds on trust in media. We live in a world where non-repudiation is one of the most highly valued assets. Not only for public figures, where it is universally explosive, what was said by who to whom at what time, but also for cookie-cutter citizens. Consider the saying information that appears on the Internet cannot be deleted. Until now, people were in control of what information about themselves they present online. With this newly arising technology, this level of control will decrease continuously. As the “quality” of fake media content increases further and further, it is presumable that the day, when everybody can generate any lifelike content with low effort, which cannot be recognized as fake by a human, is close. We are facing an unprecedented societal change that cannot be stopped. Nevertheless, we need to counteract and mitigate the danger of Deepfakes proactively. So, Deepfake Recognition is a hot topic. In recent years, a lot of research has been done for visual as well as for audio spoof detection. Mittal et Al. in [1] have used affective cues to strengthen plain validation of multimedia content. Qi et Al. in [2] developed DeepRhythm, a detection technique that exploits rhythmical changes of skin colour caused by hearbeats. For a survey, refer to Tolosona et Al. in [3]. In this research paper, the following key points will be discussed. Is it possible to adapt methods from the research area of visual deepfake detection to audio spoof detection? Does every detail of the methodology even matter? And if not, are innovations on data acquisition and sophisticated preprocessing techniques “more promising”?

Transfering Visual Deepfake Detection Methods to Audio Spoof Detection

Description