An Augmented Treble Stream Deep Neural Network for Video Analysis – As we know, human action recognition has become an important research area in computer vision because of its wide range of applications such as intelligent video surveillance, entertainment and robotics applications. And with the development of about half-century there are two main research lines the first one is handcrafted features or so-called feature engendering and the second one is deep learning method, which is very popular in recent years.
There are many types of human activities and in my research area. We mainly focus on analysis of Individual action and Interactions in real-lift videos. To recognise actions, the traditional handcrafted approaches follow the bottom-up line which includes feature detection and description which is followed by a trainable classifier such as SVM for action classification And, with the deep learning methods, features are extracted by a end-to-end trainable model, as well as a classifier to output actions.
So, in terms of video analysis, there are three significant models such as LRCN, 3D CNN and multiple stream CNNs. Actually, our work is based on the multiple stream CNN models and we developed a so-called treble-stream network. The picture shows our network architecture. We design a treble-stream network for automatic feature extraction and motion representation. This model has three individual neural networks.
It includes two spatial stream networks and a temporal stream. Learning features from videos. And the three networks extract features parallelly and then the last features from the three streams are fused and classified. the two networks are totally the same. It is actually implemented by the CNN-LSTM architecture. The CNN parts take two continues RGB video frames as input for the automatic feature extraction And the output is fed into the LSTM units for sequence learning. And this is the details of temporal stream network.
In practices. Finally, we combine the three feature vectors by a score-based fusion method and then applying fully-connected and softmax layers for action classification. OK, let’s see how we train the network, it actually very important for a deep learning model. we use PyTorch and Python to implement our model And we first train the three networks separately. In this stage, we also use the pre-trained CNN model and transform learning strategy. After that, we treat the three networks as general feature extractors and then train the fully-connected and classifier. OK, let’s see the experiments and results.
First, we show some feature maps in different CNN layers. We select the “TaiChi” action video. For example, we can see the human body easily from the first layer, while the later layers describe more abstract information.
We test the performance of the treble-stream network configurations single CNN-LSTM network has low ability to recognition actions while the treble-stream network achieves a remarkable result. Then, we also compared the treble-stream network with other deep learning methods.
Web enthusiast. Thinker. Evil coffeeaholic. Food specialist. Reader. Twitter fanatic. Music maven. AI and Machine Learning!