Visible to the public Amalgamation of Video Description and Multiple Object Localization Using Single Deep Learning Model

TitleAmalgamation of Video Description and Multiple Object Localization Using Single Deep Learning Model
Publication TypeConference Paper
Year of Publication2017
AuthorsDuggal, Shivam, Manik, Shrey, Ghai, Mohan
Conference NameProceedings of the 9th International Conference on Signal Processing Systems
PublisherACM
Conference LocationNew York, NY, USA
ISBN Number978-1-4503-5384-7
Keywordsdeep video, Inverted Y-Shape Model, Metrics, Microsoft Research Video Description Corpus, pubcrawl, resilience, Resiliency, Scalability, VGG-16, Video Caption Generation, Video Object Detection, YOLO
Abstract

Self-describing the content of a video is an elementary problem in artificial intelligence that joins computer vision and natural language processing. Through this paper, we propose a single system which could carry out video analysis (Object Detection and Captioning) at a reduced time and memory complexity. This single system uses YOLO (You Look Only Once) as its base model. Moreover, to highlight the importance of using transfer learning in development of the proposed system, two more approaches have been discussed. The rest one uses two discrete models, one to extract continuous bag of words from the frames and other to generate captions from those words i.e. Language Model. VGG-16 (Visual Geometry Group) is used as the base image decoder model to compare the two approaches, while LSTM is the base Language Model used. The Dataset used is Microsoft Research Video Description Corpus. The dataset was manually modified to serve the purpose of training the proposed system. Second approach which uses transfer learning proves to be the better approach for development of the proposed system.

URLhttps://dl.acm.org/doi/10.1145/3163080.3163108
DOI10.1145/3163080.3163108
Citation Keyduggal_amalgamation_2017