Investigating class granularity in historical aerial image segmentation: a comparative analysis of CNN and transformer-based models

DİHKAN, MUSTAFA; Akbulut, Zeynep; Güler, Gülsena; Özdemir, Samed; KARSLI, FEVZİ

doi:10.1080/01431161.2026.2660253

Investigating class granularity in historical aerial image segmentation: a comparative analysis of CNN and transformer-based models

DİHKAN M., Akbulut Z., Güler G., Özdemir S., KARSLI F.

International Journal of Remote Sensing, 2026 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Basım Tarihi: 2026
Doi Numarası: 10.1080/01431161.2026.2660253
Dergi Adı: International Journal of Remote Sensing
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Artic & Antarctic Regions, BIOSIS, Compendex, Environment Index, Geobase, INSPEC, Public Affairs Index
Anahtar Kelimeler: Historical aerial imagery, land cover classification, Vision Transformer
Karadeniz Teknik Üniversitesi Adresli: Evet

Özet

Historical aerial imagery acts as a crucial bridge to the pre-satellite era, becoming increasingly vital for quantifying long-term land cover changes and supporting retrospective environmental studies. This study presents a comprehensive comparison of CNN-based (DeepLabV3+, PSPNet, HRNet, ConvNeXt) and transformer-based (ViT, Swin Transformer, BEiT) architectures for semantic segmentation of historical aerial imagery. In particular, the potential of Transformer-based architectures for semantic segmentation of panchromatic historical aerial imagery in the context of increasing segmentation complexity remains underexplored. We assess how effectively each architecture handles increasing class granularity and captures finer-grained land-cover semantics. Experiments are conducted at three levels of detail: binary, three-class, and five-class segmentation. The dataset comprises panchromatic images from 1960, captured in Turkey. Precision, recall, F1-score, and intersection-over-union (IoU) metrics were employed to evaluate class-wise performance. In the binary ground and non-ground segmentation task, all architectures achieved mean F1-scores of 0.945–0.959 and mean IoU values of 0.905–0.921. In the three-class and five-class tasks, mean F1-scores ranged from 0.861–0.902 and 0.754–0.803, with mean IoU values of 0.777–0.831 and 0.623–0.682, respectively. In the three-class setting, ConvNeXt, Swin Transformer, and BEiT clustered towards the upper end of the mean metrics with only small differences among them. Increasing class granularity from three to five reduced mean performance across architectures, with decreases of up to 0.123 in mean F1-score and up to 0.176 in mean IoU. Across all classification levels, ConvNeXt consistently achieved higher performance, distinguishing itself specifically in the five-class task with the highest F1-scores and IoU values. ConvNeXt outperformed both the CNN-based baselines and Transformer-based models in the challenging building and road categories.