Recognition of Visually Perceived Compositional Human Actions by Multiple Spatio-Temporal Scales Recurrent Neural Networks
Title | Recognition of Visually Perceived Compositional Human Actions by Multiple Spatio-Temporal Scales Recurrent Neural Networks |
Publication Type | Journal Article |
Year of Publication | 2018 |
Authors | Lee, Haanvid, Jung, Minju, Tani, Jun |
Journal | IEEE Transactions on Cognitive and Developmental Systems |
Volume | 10 |
Pagination | 1058—1069 |
Date Published | Dec. 2018 |
ISSN | 2379-8939 |
Keywords | action recognition, Biological neural networks, compositionality, conventional convolutional neural network model, convolution, Convolutional codes, convolutional neural network (CNN), critical spatio-temporal information, deep learning model, different compositionality levels, dynamic vision processing, feature extraction, feedforward neural nets, gesture recognition, human action datasets, image motion analysis, image representation, learning (artificial intelligence), MSTRNN model aid, multiple spatio-temporal scales recurrent neural network model, multiple timescale recurrent dynamics, neural activities, pubcrawl, recurrent neural nets, recurrent neural network, Recurrent neural networks, RGB input data, spatio-temporal constraints, spatio-temporal information extraction, Spatiotemporal phenomena, symbol grounding, Training, visualization, visually perceived compositional human action recognition |
Abstract | We investigate a deep learning model for action recognition that simultaneously extracts spatio-temporal information from a raw RGB input data. The proposed multiple spatio-temporal scales recurrent neural network (MSTRNN) model is derived by combining multiple timescale recurrent dynamics with a conventional convolutional neural network model. The architecture of the proposed model imposes both spatial and temporal constraints simultaneously on its neural activities. The constraints vary, with multiple scales in different layers. As suggested by the principle of upward and downward causation, it is assumed that the network can develop a functional hierarchy using its constraints during training. To evaluate and observe the characteristics of the proposed model, we use three human action datasets consisting of different primitive actions and different compositionality levels. The performance capabilities of the MSTRNN model on these datasets are compared with those of other representative deep learning models used in the field. The results show that the MSTRNN outperforms baseline models while using fewer parameters. The characteristics of the proposed model are observed by analyzing its internal representation properties. The analysis clarifies how the spatio-temporal constraints of the MSTRNN model aid in how it extracts critical spatio-temporal information relevant to its given tasks. |
URL | https://ieeexplore.ieee.org/document/8090898 |
DOI | 10.1109/TCDS.2017.2768422 |
Citation Key | lee_recognition_2018 |
- RGB input data
- MSTRNN model aid
- multiple spatio-temporal scales recurrent neural network model
- multiple timescale recurrent dynamics
- neural activities
- pubcrawl
- recurrent neural nets
- recurrent neural network
- Recurrent neural networks
- learning (artificial intelligence)
- spatio-temporal constraints
- spatio-temporal information extraction
- Spatiotemporal phenomena
- symbol grounding
- Training
- visualization
- visually perceived compositional human action recognition
- different compositionality levels
- Biological neural networks
- Compositionality
- conventional convolutional neural network model
- convolution
- Convolutional codes
- convolutional neural network (CNN)
- critical spatio-temporal information
- deep learning model
- action recognition
- dynamic vision processing
- feature extraction
- feedforward neural nets
- gesture recognition
- human action datasets
- image motion analysis
- image representation