Spoofed Audio Detection Using a Fusion of Transformer Based Architectures

Bulut M., Tahaoğlu G., Ulutaş G., Üstübioğlu B., Ustubioglu A., Ulutaş M., ...Daha Fazla

2025 18th International Conference on Information Security and Cryptology (ISCTürkiye), Ankara, Türkiye, 22 - 23 Ekim 2025, ss.1-6, (Tam Metin Bildiri)

Yayın Türü: Bildiri / Tam Metin Bildiri
Doi Numarası: 10.1109/isctrkiye68593.2025.11224850
Basıldığı Şehir: Ankara
Basıldığı Ülke: Türkiye
Sayfa Sayıları: ss.1-6
Karadeniz Teknik Üniversitesi Adresli: Evet

Özet

Through the rapid evolution of deepfake audio generation, and more importantly, its quite simplified access through easy-to-use tools, synthetic speech generation and its abuse have become a considerable threat over the years. It becomes clear that the detection of the spoofed audio will be a growing concern in the future. So, building robust and reliable methods to achieve the highest detection rates is needed. To address this issue, we proposed a method to detect the spoofed audio from genuine audio effectively. We utilized the cochleagram images for feature extraction, which is the closest to the human ear's biology, and used ViT and XCiT architectures for classification purposes. At the end, to eliminate deficiencies of one architecture to another, we adapted Late Score Fusing, achieving 6.94 % EER and 0.11 min t-DCF score on the ASVspoof2019 LA benchmark dataset, surpassing the state-of-the-art methods.