Evaluating the performance of generative artificial intelligence models in multidimensional data analysis tasks: a comparative study with large language models

DEMİR, FATİH; YAVUZ, Uğur

doi:10.1007/s10791-026-09905-1

Evaluating the performance of generative artificial intelligence models in multidimensional data analysis tasks: a comparative study with large language models

DEMİR F., YAVUZ U.

DISCOVER COMPUTING, cilt.29, sa.1, 2026 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 29 Sayı: 1
Basım Tarihi: 2026
Doi Numarası: 10.1007/s10791-026-09905-1
Dergi Adı: DISCOVER COMPUTING
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus
Karadeniz Teknik Üniversitesi Adresli: Evet

Özet

This study presents a comparative performance evaluation of state-of-the-art generative artificial intelligence models in the context of data analysis. Eight large language models (Claude, Gemini, ChatGPT, Qwen, Grok, DeepSeek, LLaMA, and Mistral) were tested on 13 distinct analytical tasks derived from the Titanic dataset. Performance was assessed using a multidimensional scoring rubric consisting of five main categories-technical accuracy, analytical depth, machine learning application, presentation and communication, and originality-with a total of 14 sub-criteria. Each model's output was rated on a five-point scale by independent evaluators. Results indicate that Claude and Gemini outperformed others, particularly in tasks requiring reasoning and transparency, while LLaMA and Mistral showed weaknesses in higher-order cognitive tasks. Overall, the findings provide theoretical insight into the cognitive capacities of generative artificial intelligence models in data-driven contexts and offer practical guidance for model selection in applied analytics.