Abstract:
In recent years, multimodal sentiment analysis and emotion recognition has become a hot research topic among researchers. Instead of manually scrutinizing the sentiments and emotions, there is a need for an automatic process that would easily be applied to numerous tasks, such as product analysis, social media monitoring, election prediction, etc. Though great efforts have been devoted to gauging the people’s sentiments and emotions by manipulating the single modality data like text, or images or audio, but less attention is paid to the joint analysis of multimodal data in social media. Different modalities usually contain complementary information that must be combined for better learning of sentiments and emotions.
In this research work, a novel framework: Attention-based Multimodal Sentiment Analysis and Emotion Recognition (AMSAER) using deep neural networks is proposed to exploit intra-modality discriminative features, as well as inter-modality correlation in visual, audio, and text modalities. Three separate unimodal i.e., visual, acoustic and semantic attention models are proposed to effectively learn sentiment and emotion classifiers for these modalities. These models automatically acquire the high-quality discriminative sentiment and emotion-related semantic words, image-regions, and raw audio features. Subsequently, deep hierarchical multimodal based on intermediate fusion is proposed to learn an internal- correlation between these modality features hierarchically. This model first learns the correlation among bimodal features and then among the trimodal features for the joint classification of sentiments and emotions. Finally, all four distinct models are fused into a holistic framework through decision level fusion for multimodal sentiment analysis and emotion recognition. Extensive experiments have been conducted on publically available benchmark IEMOCAP corpus for both sentiment and emotion prediction. The results demonstrated that our proposed architecture outperformed existing techniques with a solid margin in both sentiment and emotion classification, achieving 84% and 92% accuracy, respectively.