Visible to the public Biblio

Filters: Keyword is Microsoft Research Video Description Corpus  [Clear All Filters]
2018-11-19
Duggal, Shivam, Manik, Shrey, Ghai, Mohan.  2017.  Amalgamation of Video Description and Multiple Object Localization Using Single Deep Learning Model. Proceedings of the 9th International Conference on Signal Processing Systems. :109–115.

Self-describing the content of a video is an elementary problem in artificial intelligence that joins computer vision and natural language processing. Through this paper, we propose a single system which could carry out video analysis (Object Detection and Captioning) at a reduced time and memory complexity. This single system uses YOLO (You Look Only Once) as its base model. Moreover, to highlight the importance of using transfer learning in development of the proposed system, two more approaches have been discussed. The rest one uses two discrete models, one to extract continuous bag of words from the frames and other to generate captions from those words i.e. Language Model. VGG-16 (Visual Geometry Group) is used as the base image decoder model to compare the two approaches, while LSTM is the base Language Model used. The Dataset used is Microsoft Research Video Description Corpus. The dataset was manually modified to serve the purpose of training the proposed system. Second approach which uses transfer learning proves to be the better approach for development of the proposed system.