Visible to the public Biblio

Filters: Keyword is deep video  [Clear All Filters]
2018-11-19
Yang, Lingxiao, Liu, Risheng, Zhang, David, Zhang, Lei.  2017.  Deep Location-Specific Tracking. Proceedings of the 25th ACM International Conference on Multimedia. :1309–1317.

Convolutional Neural Network (CNN) based methods have shown significant performance gains in the problem of visual tracking in recent years. Due to many uncertain changes of objects online, such as abrupt motion, background clutter and large deformation, the visual tracking is still a challenging task. We propose a novel algorithm, namely Deep Location-Specific Tracking, which decomposes the tracking problem into a localization task and a classification task, and trains an individual network for each task. The localization network exploits the information in the current frame and provides a specific location to improve the probability of successful tracking, while the classification network finds the target among many examples generated around the target location in the previous frame, as well as the one estimated from the localization network in the current frame. CNN based trackers often have massive number of trainable parameters, and are prone to over-fitting to some particular object states, leading to less precision or tracking drift. We address this problem by learning a classification network based on 1 × 1 convolution and global average pooling. Extensive experimental results on popular benchmark datasets show that the proposed tracker achieves competitive results without using additional tracking videos for fine-tuning. The code is available at https://github.com/ZjjConan/DLST

Zhu, Yi, Liu, Sen, Newsam, Shawn.  2017.  Large-Scale Mapping of Human Activity Using Geo-Tagged Videos. Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. :68:1–68:4.

This paper is the first work to perform spatio-temporal mapping of human activity using the visual content of geo-tagged videos. We utilize a recent deep-learning based video analysis framework, termed hidden two-stream networks, to recognize a range of activities in YouTube videos. This framework is efficient and can run in real time or faster which is important for recognizing events as they occur in streaming video or for reducing latency in analyzing already captured video. This is, in turn, important for using video in smart-city applications. We perform a series of experiments to show our approach is able to map activities both spatially and temporally.

Song, Baolin, Jiang, Hao, Zhao, Li, Huang, Chengwei.  2017.  A Bimodal Biometric Verification System Based on Deep Learning. Proceedings of the International Conference on Video and Image Processing. :89–93.

In order to improve the limitation of single-mode biometric identification technology, a bimodal biometric verification system based on deep learning is proposed in this paper. A modified CNN architecture is used to generate better facial feature for bimodal fusion. The obtained facial feature and acoustic feature extracted by the acoustic feature extraction model are fused together to form the fusion feature on feature layer level. The fusion feature obtained by this method are used to train a neural network of identifying the target person who have these corresponding features. Experimental results demonstrate the superiority and high performance of our bimodal biometric in comparison with single-mode biometrics for identity authentication, which are tested on a bimodal database consists of data coherent from TED-LIUM and CASIA-WebFace. Compared with using facial feature or acoustic feature alone, the classification accuracy of fusion feature obtained by our method is increased obviously.

Pang, Yulei, Xue, Xiaozhen, Wang, Huaying.  2017.  Predicting Vulnerable Software Components Through Deep Neural Network. Proceedings of the 2017 International Conference on Deep Learning Technologies. :6–10.

Vulnerabilities need to be detected and removed from software. Although previous studies demonstrated the usefulness of employing prediction techniques in deciding about vulnerabilities of software components, the improvement of effectiveness of these prediction techniques is still a grand challenging research question. This paper employed a technique based on a deep neural network with rectifier linear units trained with stochastic gradient descent method and batch normalization, for predicting vulnerable software components. The features are defined as continuous sequences of tokens in source code files. Besides, a statistical feature selection algorithm is then employed to reduce the feature and search space. We evaluated the proposed technique based on some Java Android applications, and the results demonstrated that the proposed technique could predict vulnerable classes, i.e., software components, with high precision, accuracy and recall.

Yaseen, Muhammad Usman, Anjum, Ashiq, Antonopoulos, Nick.  2017.  Modeling and Analysis of a Deep Learning Pipeline for Cloud Based Video Analytics. Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies. :121–130.

Video analytics systems based on deep learning approaches are becoming the basis of many widespread applications including smart cities to aid people and traffic monitoring. These systems necessitate massive amounts of labeled data and training time to perform fine tuning of hyper-parameters for object classification. We propose a cloud based video analytics system built upon an optimally tuned deep learning model to classify objects from video streams. The tuning of the hyper-parameters including learning rate, momentum, activation function and optimization algorithm is optimized through a mathematical model for efficient analysis of video streams. The system is capable of enhancing its own training data by performing transformations including rotation, flip and skew on the input dataset making it more robust and self-adaptive. The use of in-memory distributed training mechanism rapidly incorporates large number of distinguishing features from the training dataset - enabling the system to perform object classification with least human assistance and external support. The validation of the system is performed by means of an object classification case-study using a dataset of 100GB in size comprising of 88,432 video frames on an 8 node cloud. The extensive experimentation reveals an accuracy and precision of 0.97 and 0.96 respectively after a training of 6.8 hours. The system is scalable, robust to classification errors and can be customized for any real-life situation.

Duggal, Shivam, Manik, Shrey, Ghai, Mohan.  2017.  Amalgamation of Video Description and Multiple Object Localization Using Single Deep Learning Model. Proceedings of the 9th International Conference on Signal Processing Systems. :109–115.

Self-describing the content of a video is an elementary problem in artificial intelligence that joins computer vision and natural language processing. Through this paper, we propose a single system which could carry out video analysis (Object Detection and Captioning) at a reduced time and memory complexity. This single system uses YOLO (You Look Only Once) as its base model. Moreover, to highlight the importance of using transfer learning in development of the proposed system, two more approaches have been discussed. The rest one uses two discrete models, one to extract continuous bag of words from the frames and other to generate captions from those words i.e. Language Model. VGG-16 (Visual Geometry Group) is used as the base image decoder model to compare the two approaches, while LSTM is the base Language Model used. The Dataset used is Microsoft Research Video Description Corpus. The dataset was manually modified to serve the purpose of training the proposed system. Second approach which uses transfer learning proves to be the better approach for development of the proposed system.

Choi, Jun-Ho, Choi, Manri, Choi, Min-Su, Lee, Jong-Seok.  2017.  Impact of Three-Dimensional Video Scalability on Multi-View Activity Recognition Using Deep Learning. Proceedings of the on Thematic Workshops of ACM Multimedia 2017. :135–143.

Human activity recognition is one of the important research topics in computer vision and video understanding. It is often assumed that high quality video sequences are available for recognition. However, relaxing such a requirement and implementing robust recognition using videos having reduced data rates can achieve efficiency in storing and transmitting video data. Three-dimensional video scalability, which refers to the possibility of reducing spatial, temporal, and quality resolutions of videos, is an effective way for flexible representation and management of video data. In this paper, we investigate the impact of the video scalability on multi-view activity recognition. We employ both a spatiotemporal feature extraction-based method and a deep learning-based method using convolutional and recurrent neural networks. The recognition performance of the two methods is examined, along with in-depth analysis regarding how their performance vary with respect to various scalability combinations. In particular, we demonstrate that the deep learning-based method can achieve significantly improved robustness in comparison to the feature-based method. Furthermore, we investigate optimal scalability combinations with respect to bitrate in order to provide useful guidelines for an optimal operation policy in resource-constrained activity recognition systems.

Duta, Ionut C., Ionescu, Bogdan, Aizawa, Kiyoharu, Sebe, Nicu.  2017.  Simple, Efficient and Effective Encodings of Local Deep Features for Video Action Recognition. Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval. :218–225.

For an action recognition system a decisive component is represented by the feature encoding part which builds the final representation that serves as input to a classifier. One of the shortcomings of the existing encoding approaches is the fact that they are built around hand-crafted features and they are not also highly competitive on encoding the current deep features, necessary in many practical scenarios. In this work we propose two solutions specifically designed for encoding local deep features, taking advantage of the nature of deep networks, focusing on capturing the highest feature response of the convolutional maps. The proposed approaches for deep feature encoding provide a solution to encapsulate the features extracted with a convolutional neural network over the entire video. In terms of accuracy our encodings outperform by a large margin the current most widely used and powerful encoding approaches, while being extremely efficient for the computational cost. Evaluated in the context of action recognition tasks, our pipeline obtains state-of-the-art results on three challenging datasets: HMDB51, UCF50 and UCF101.

Yang, Wenhan, Deng, Shihong, Hu, Yueyu, Xing, Junliang, Liu, Jiaying.  2017.  Real-Time Deep Video SpaTial Resolution UpConversion SysTem (STRUCT++ Demo). Proceedings of the 25th ACM International Conference on Multimedia. :1255–1256.

Image and video super-resolution (SR) has been explored for several decades. However, few works are integrated into practical systems for real-time image and video SR. In this work, we present a real-time deep video SpaTial Resolution UpConversion SysTem (STRUCT++). Our demo system achieves real-time performance (50 fps on CPU for CIF sequences and 45 fps on GPU for HDTV videos) and provides several functions: 1) batch processing; 2) full resolution comparison; 3) local region zooming in. These functions are convenient for super-resolution of a batch of videos (at most 10 videos in parallel), comparisons with other approaches and observations of local details of the SR results. The system is built on a Global context aggregation and Local queue jumping Network (GLNet). It has a thinner and deeper network structure to aggregate global context with an additional local queue jumping path to better model local structures of the signal. GLNet achieves state-of-the-art performance for real-time video SR.