2025 18th International Conference on Information Security and Cryptology (ISCTürkiye), Ankara, Turkey, 22 - 23 October 2025, pp.1-6, (Full Text)
Through the rapid evolution of deepfake audio generation, and more importantly, its quite simplified access through easy-to-use tools, synthetic speech generation and its abuse have become a considerable threat over the years. It becomes clear that the detection of the spoofed audio will be a growing concern in the future. So, building robust and reliable methods to achieve the highest detection rates is needed. To address this issue, we proposed a method to detect the spoofed audio from genuine audio effectively. We utilized the cochleagram images for feature extraction, which is the closest to the human ear's biology, and used ViT and XCiT architectures for classification purposes. At the end, to eliminate deficiencies of one architecture to another, we adapted Late Score Fusing, achieving 6.94 % EER and 0.11