Abstract:
The task of developing intelligent machines that can understand visual scenes and then relay back their understanding in natural language has been a challenge for the fields of Computer Vision and Natural Language Processing. Recently, thanks to advancements in computer hardware, data-gathering techniques and most importantly, Artificial Intelligence, quiet a lot has been achieved in this domain in a short span of time. Through this project, we present the fundamentals of two such models that accomplish the task of image captioning using Neural Networks. We elaborate on their respected methodologies, argue on the quality of the captions they produce and suggest improvements for even better-quality captions. We then prove that these models can be used in countless applications with immediate and practical benefits.
Our contribution lies in performing an in-depth analysis of machine generated captions using automatic evaluation metrics and our proposed ‘consensus-based’ Human Evaluation plan, rather than designing and staging complex algorithms for the Image Captioning task. We show that the automatic metrics give better scores when the captions are being compared against more than one reference sentence. Moreover, our study suggests that the evaluation of Image Captioning systems may be fully automated.
We propose strategies for collecting human judgments cheaply and on a large scale, which allows even better analysis of the captions produced by machines. We also suggest improvements in the models discussed and then implement both the models accordingly.