2025 18th International Conference on Information Security and Cryptology (ISCTürkiye), Ankara, Türkiye, 22 - 23 Ekim 2025, ss.1-6, (Tam Metin Bildiri)
Through the rapid evolution of deepfake audio generation, and more importantly, its quite simplified access through easy-to-use tools, synthetic speech generation and its abuse have become a considerable threat over the years. It becomes clear that the detection of the spoofed audio will be a growing concern in the future. So, building robust and reliable methods to achieve the highest detection rates is needed. To address this issue, we proposed a method to detect the spoofed audio from genuine audio effectively. We utilized the cochleagram images for feature extraction, which is the closest to the human ear's biology, and used ViT and XCiT architectures for classification purposes. At the end, to eliminate deficiencies of one architecture to another, we adapted Late Score Fusing, achieving 6.94 % EER and 0.11