Amalgamation of Video Description and Multiple Object Localization Using Single Deep Learning Model
Title | Amalgamation of Video Description and Multiple Object Localization Using Single Deep Learning Model |
Publication Type | Conference Paper |
Year of Publication | 2017 |
Authors | Duggal, Shivam, Manik, Shrey, Ghai, Mohan |
Conference Name | Proceedings of the 9th International Conference on Signal Processing Systems |
Publisher | ACM |
Conference Location | New York, NY, USA |
ISBN Number | 978-1-4503-5384-7 |
Keywords | deep video, Inverted Y-Shape Model, Metrics, Microsoft Research Video Description Corpus, pubcrawl, resilience, Resiliency, Scalability, VGG-16, Video Caption Generation, Video Object Detection, YOLO |
Abstract | Self-describing the content of a video is an elementary problem in artificial intelligence that joins computer vision and natural language processing. Through this paper, we propose a single system which could carry out video analysis (Object Detection and Captioning) at a reduced time and memory complexity. This single system uses YOLO (You Look Only Once) as its base model. Moreover, to highlight the importance of using transfer learning in development of the proposed system, two more approaches have been discussed. The rest one uses two discrete models, one to extract continuous bag of words from the frames and other to generate captions from those words i.e. Language Model. VGG-16 (Visual Geometry Group) is used as the base image decoder model to compare the two approaches, while LSTM is the base Language Model used. The Dataset used is Microsoft Research Video Description Corpus. The dataset was manually modified to serve the purpose of training the proposed system. Second approach which uses transfer learning proves to be the better approach for development of the proposed system. |
URL | https://dl.acm.org/doi/10.1145/3163080.3163108 |
DOI | 10.1145/3163080.3163108 |
Citation Key | duggal_amalgamation_2017 |