What issues are data scientists talking about? Identification of current data science issues using semantic content analysis of Q&A communities


PeerJ Computer Science, vol.9, 2023 (SCI-Expanded) identifier identifier identifier

  • Publication Type: Article / Article
  • Volume: 9
  • Publication Date: 2023
  • Doi Number: 10.7717/peerj-cs.1361
  • Journal Name: PeerJ Computer Science
  • Journal Indexes: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Compendex, Directory of Open Access Journals
  • Keywords: Data science, Q&A communities, Text mining, Topic modeling
  • Karadeniz Technical University Affiliated: Yes


Background: Because of the growing involvement of communities from various disciplines, data science is constantly evolving and gaining popularity. The growing interest in data science-based services and applications presents numerous challenges for their development. Therefore, data scientists frequently turn to various forums, particularly domain-specific Q&A websites, to solve difficulties. These websites evolve into data science knowledge repositories over time. Analysis of such repositories can provide valuable insights into the applications, topics, trends, and challenges of data science. Methods: In this article, we investigated what data scientists are asking by analyzing all posts to date on DSSE, a data science-focused Q&A website. To discover main topics embedded in data science discussions, we used latent Dirichlet allocation (LDA), a probabilistic approach for topic modeling. Results: As a result of this analysis, 18 main topics were identified that demonstrate the current interests and issues in data science. We then examined the topics’ popularity and difficulty. In addition, we identified the most commonly used tasks, techniques, and tools in data science. As a result, “Model Training”, “Machine Learning”, and “Neural Networks” emerged as the most prominent topics. Also, “Data Manipulation”, “Coding Errors”, and “Tools” were identified as the most viewed (most popular) topics. On the other hand, the most difficult topics were identified as “Time Series”, “Computer Vision”, and “Recommendation Systems”. Our findings have significant implications for many data science stakeholders who are striving to advance data-driven architectures, concepts, tools, and techniques