2025 EARSeL and DGPF Joint Istanbul Workshop on Topographic Mapping from Space, İstanbul, Türkiye, 29 - 31 Ocak 2025, cilt.48, ss.23-29, (Tam Metin Bildiri)
Visual Foundation Models (VFMs) demonstrate impressive generalization capabilities for image segmentation and classification tasks, leading to their increasing adoption in the remote sensing field. This study investigates the performance of VFMs in zero-shot building segmentation from aerial imagery using two model pipelines: Grounded-SAM and SAM+CLIP. Grounded-SAM integrates the Grounding DINO backbone with a Segment Anything Model (SAM) while SAM+CLIP first employs SAM for generating masks followed by Contrastive Language Image Pretraining (CLIP) for classification. The evaluation, performed on the WHU building dataset using Precision, Recall, F1 score, and intersection over union (IoU) metrics, revealed that Grounded-SAM achieved F1-score of 0.83 and IoU of 0.71. SAM+CLIP achieved F1-score of 0.65 and IoU of 0.49. While Grounded-SAM excelled at accurately delineating partially occluded and irregularly shaped buildings, SAM+CLIP was able to segment larger buildings but struggled with delineating smaller ones. Given the impressive performance of VFMs in zero-shot building segmentation, future efforts aimed at refining these models through fine-tuning or few-shot learning could significantly expand their application in remote sensing.