From Chat to Academia: Calibrating Formality in Low-Resource Languages


Efendioğlu S., Pehlivan H.

IEEE ACCESS, cilt.14, ss.7776-7791, 2026 (SCI-Expanded, Scopus) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 14
  • Basım Tarihi: 2026
  • Doi Numarası: 10.1109/access.2026.3652343
  • Dergi Adı: IEEE ACCESS
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Compendex, INSPEC, Directory of Open Access Journals
  • Sayfa Sayıları: ss.7776-7791
  • Karadeniz Teknik Üniversitesi Adresli: Evet

Özet

How formal should a sentence sound? The answer is rarely limited to formal or informal. In natural communication, formality changes gradually from casual conversation to professional writing and highly academic prose. However, for many low-resource languages, this continuum of graded stylistic shifts remains largely unmodeled. Turkish, despite its rich stylistic variation, still lacks a systematic framework for capturing such gradual shifts. This study introduces LyreSense, a framework designed to represent the full spectrum of formality in Turkish. We integrate human-written texts with annotation assisted by large language models (LLMs) and controlled synthetic text generation to construct a stylistically diverse corpus. Building on this, we propose a style-intensity-calibrated triplet loss that adapts its margin to differences in formality, enabling embeddings to disentangle subtle stylistic variation independently of semantic content. To train efficiently while preserving model capacity, we apply Low-Rank Adaptation (LoRA) during fine-tuning. Experiments across four incremental formality classes (informal, neutral, formal, and highly formal) demonstrate that LyreSense achieves Macro-F1 of 0.69. Misclassifications are concentrated between adjacent categories, reflecting the natural continuity of formality, while extreme classes are consistently distinguished. LyreSense is more than a framework for Turkish: it establishes a scalable, language-agnostic pipeline for style-sensitive NLP in low-resource settings. By moving beyond binary style distinctions, it demonstrates how lightweight, efficient models can provide nuanced, human-like style awareness for both research and practical applications.