Amalgamation of Video Description and Multiple Object Localization Using Single Deep Learning Model

Submitted by grigby1 on Mon, 11/19/2018 - 12:20pm

Title	Amalgamation of Video Description and Multiple Object Localization Using Single Deep Learning Model
Publication Type	Conference Paper
Year of Publication	2017
Authors	Duggal, Shivam, Manik, Shrey, Ghai, Mohan
Conference Name	Proceedings of the 9th International Conference on Signal Processing Systems
Publisher	ACM
Conference Location	New York, NY, USA
ISBN Number	978-1-4503-5384-7
Keywords	deep video, Inverted Y-Shape Model, Metrics, Microsoft Research Video Description Corpus, pubcrawl, resilience, Resiliency, Scalability, VGG-16, Video Caption Generation, Video Object Detection, YOLO
Abstract	Self-describing the content of a video is an elementary problem in artificial intelligence that joins computer vision and natural language processing. Through this paper, we propose a single system which could carry out video analysis (Object Detection and Captioning) at a reduced time and memory complexity. This single system uses YOLO (You Look Only Once) as its base model. Moreover, to highlight the importance of using transfer learning in development of the proposed system, two more approaches have been discussed. The rest one uses two discrete models, one to extract continuous bag of words from the frames and other to generate captions from those words i.e. Language Model. VGG-16 (Visual Geometry Group) is used as the base image decoder model to compare the two approaches, while LSTM is the base Language Model used. The Dataset used is Microsoft Research Video Description Corpus. The dataset was manually modified to serve the purpose of training the proposed system. Second approach which uses transfer learning proves to be the better approach for development of the proposed system.
URL	https://dl.acm.org/doi/10.1145/3163080.3163108
DOI	10.1145/3163080.3163108
Citation Key	duggal_amalgamation_2017

Groups:

Science of Security VO