Investigating class granularity in historical aerial image segmentation: a comparative analysis of CNN and transformer-based models


DİHKAN M., Akbulut Z., Güler G., Özdemir S., KARSLI F.

International Journal of Remote Sensing, 2026 (SCI-Expanded, Scopus) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Basım Tarihi: 2026
  • Doi Numarası: 10.1080/01431161.2026.2660253
  • Dergi Adı: International Journal of Remote Sensing
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Artic & Antarctic Regions, BIOSIS, Compendex, Environment Index, Geobase, INSPEC, Public Affairs Index
  • Anahtar Kelimeler: Historical aerial imagery, land cover classification, Vision Transformer
  • Karadeniz Teknik Üniversitesi Adresli: Evet

Özet

Historical aerial imagery acts as a crucial bridge to the pre-satellite era, becoming increasingly vital for quantifying long-term land cover changes and supporting retrospective environmental studies. This study presents a comprehensive comparison of CNN-based (DeepLabV3+, PSPNet, HRNet, ConvNeXt) and transformer-based (ViT, Swin Transformer, BEiT) architectures for semantic segmentation of historical aerial imagery. In particular, the potential of Transformer-based architectures for semantic segmentation of panchromatic historical aerial imagery in the context of increasing segmentation complexity remains underexplored. We assess how effectively each architecture handles increasing class granularity and captures finer-grained land-cover semantics. Experiments are conducted at three levels of detail: binary, three-class, and five-class segmentation. The dataset comprises panchromatic images from 1960, captured in Turkey. Precision, recall, F1-score, and intersection-over-union (IoU) metrics were employed to evaluate class-wise performance. In the binary ground and non-ground segmentation task, all architectures achieved mean F1-scores of 0.945–0.959 and mean IoU values of 0.905–0.921. In the three-class and five-class tasks, mean F1-scores ranged from 0.861–0.902 and 0.754–0.803, with mean IoU values of 0.777–0.831 and 0.623–0.682, respectively. In the three-class setting, ConvNeXt, Swin Transformer, and BEiT clustered towards the upper end of the mean metrics with only small differences among them. Increasing class granularity from three to five reduced mean performance across architectures, with decreases of up to 0.123 in mean F1-score and up to 0.176 in mean IoU. Across all classification levels, ConvNeXt consistently achieved higher performance, distinguishing itself specifically in the five-class task with the highest F1-scores and IoU values. ConvNeXt outperformed both the CNN-based baselines and Transformer-based models in the challenging building and road categories.