Confidence scoring for deep learning-predicted antibody–antigen complexes: AntiConf as a precision-driven metric

Ünsal, SERBÜLENT; Holland, Benjamin; Sardag, Inci; Timucin, Emel

doi:10.1093/bib/bbag137

Confidence scoring for deep learning-predicted antibody–antigen complexes: AntiConf as a precision-driven metric

Ünsal S., Holland B., Sardag I., Timucin E.

Briefings in Bioinformatics, cilt.27, sa.2, 2026 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 27 Sayı: 2
Basım Tarihi: 2026
Doi Numarası: 10.1093/bib/bbag137
Dergi Adı: Briefings in Bioinformatics
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, BIOSIS, Library, Information Science & Technology Abstracts (LISTA), MEDLINE, Directory of Open Access Journals
Anahtar Kelimeler: AF3-based implementations, Alphafold2, antibody–antigen complexes, model confidence scores, multimer prediction
Karadeniz Teknik Üniversitesi Adresli: Evet

Özet

Abstract Accurate determination of antibody–antigen (Ab–Ag) complex structures is critical for therapeutic development. While deep learning-based methods, beginning with AlphaFold2 (AF2), have revolutionized multimer predictions, the optimal strategies for Ab–Ag modeling, and the reliability of their confidence scores remain active areas of research. This study evaluates the performance of AF2, Boltz-1, Boltz-1x, Boltz-2, Chai-1, Protenix, Protenix-1, OpenFold3, and ESMFold, on a curated dataset of 200 Ab–Ag complexes. Among the nine methods tested, Protenix-1 emerged as the top performer, with Chai-1 consistently ranking second across multiple success metrics, closely followed by AF2. We observed diverse effects of recycling iterations, with AF2, Chai-1, and Protenix variants benefiting from increased cycles, unlike Boltz variants. We analyzed various model confidence scores, noting high precision from pDockQ2 and high recall from predicted Template-Modeling (pTM) score. By integrating these two scores, we developed antibody confidence (AntiConf), a novel metric that achieves superior performance for all methods in terms of precision and recall. These strengths make AntiConf a valuable post score for both computational predictions and downstream experimental workflows, reflecting its potential to improve Ab–Ag complex predictions by AF2 and AF3 architectures. Altogether, this study addresses current limitations in deep learning-based Ab–Ag complex prediction, showcasing the potential of AntiConf for future assessment studies, and providing a guideline for improving the accuracy of Ab–Ag complex prediction.