Abstract:
In recent years, Human Activity Recognition (HAR) has been one of the core research areas
due to its various applications. It has been attracting growing attention in the computer
vision field. In HAR, activities are normally represented using numerous sensor modalities,
like vision, inertial, skeleton, audio, etc. However, there are limitations associated with
these sensors like Local barriers, image barriers, sensor unreliability, and consumer
concerns. Multimodal Human Activity Recognition (MMHAR) solves these problems by
using more than one modality sensor to enable the complementary information of different
domains in the recognition task. Recently, various Deep learning-based approaches have
been proposed for MMHAR and have achieved state-of-the-results. Though great efforts
have been made in this area using various modalities, little attention has been paid to the
analysis of the dominance and relevance of one modality over another. This research work
highlights the importance of multimodal-based sensor fusion using deep neural networks
to achieve HAR and highlights which modality has more importance in recognizing
activities. This research work proposes a novel deep multimodal fusion network based on
two-stream architecture. One model stream uses the Three-dimensional Convolutional
Neural Network (3D-CONV) to handle the depth sensor data. At the same time, the second
stream of the model uses a Two-dimensional Convolutional Neural Network and Long
Short Term Memory (2D-CONVLSTM) for handling the inertial sensor data. Both streams
capture features from data generated through depth and inertial sensors. Decision level
fusion combines the results generated from both streams to get the final prediction. The
proposed model has been evaluated on a publicly available benchmark dataset Berkley
MHAD (Multimodal Human Action Detection) and has produced state-of-the-art accuracy
of 99.73%, outperforming the previous methods. The depth sensor and inertial sensor data
are passed to the proposed model streams separately in a Single modality-based Human
Activity Recognition (SHAR) process. It is observed that the depth camera sensor achieves
a higher accuracy result of 98.89 % than an inertial sensor accuracy score of 89.34%.
x
Hence, it is concluded that the depth camera sensor has more importance in the recognition
task than inertial sensor data