Improving bag-of-poses with semi-temporal pose descriptors for skeleton-based action recognition

Creative Commons License

Agahian S., Negin F., KÖSE C.

VISUAL COMPUTER, cilt.35, ss.591-607, 2019 (SCI İndekslerine Giren Dergi) identifier identifier

  • Cilt numarası: 35 Konu: 4
  • Basım Tarihi: 2019
  • Doi Numarası: 10.1007/s00371-018-1489-7
  • Sayfa Sayıları: ss.591-607


Over the last few decades, human action recognition has become one of the most challenging tasks in the field of computer vision. Effortless and accurate extraction of 3D skeleton information has been recently achieved by means of economical depth sensors and state-of-the-art deep learning approaches. In this study, we introduce a novel bag-of-poses framework for action recognition using 3D skeleton data. Our assumption is that any action can be represented by a set of predefined spatiotemporal poses. The pose descriptor is composed of three parts. The first part is concatenation of the normalized coordinate of the skeleton joints. The second part is consisted of temporal displacement of the joints constructed with predefined temporal offset, and the third part is temporal displacement with the previous frame in the sequence. In order to generate the key poses, we apply K-means clustering over all the training pose descriptors of the dataset. SVM classifier is trained with the generated key poses to classify an action pose. Accordingly, every action in the dataset is encoded with key pose histograms. ELM classifier is used for action recognition due to its fast, accurate and reliable performance compared to the other classifiers. The proposed framework is validated with five publicly available benchmark 3D action datasets and achieved state-of-the-art results on three of the datasets and competitive results on the other two datasets compared to the other methods.