Motion Embedded Images: An Approach to Capture Spatial and Temporal Features for Action Recognition

Tri Le, Nham Huynh-Duc, Chung Thai Nguyen, Minh-Triet Tran


The demand for human activity recognition (HAR) from videos has witnessed a significant surge in various real-life applications, including video surveillance, healthcare, elderly care, among others. The explotion of short-form videos on social media platforms has further intensified the interest in this domain. This research endeavors to focus on the problem of HAR in general short videos. In contrast to still images, video clips offer both spatial and temporal information, rendering it challenging to extract complementary information on appearance from still frames and motion between frames. This research makes a two-fold contribution. Firstly, we investigate the use of motion-embedded images in a variant of two-stream Convolutional Neural Network architecture, in which one stream captures motion using combined batches of frames, while another stream employs a normal image classification ConvNet to classify static appearance. Secondly, we create a novel dataset of Southeast Asian Sports short videos that encompasses both videos with and without effects, which is a modern factor that is lacking in all currently available datasets used for benchmarking models. The proposed model is trained and evaluated on two benchmarks: UCF-101 and SEAGS-V1. The results reveal that the proposed model yields competitive performance compared to prior attempts to address the same problem.

Full Text:



Carreira, J. and Zisserman, A. (2017). Quovadis, action recognition? A new model and the kinetics dataset. CoRR, abs/1705.07750.

Feichtenhofer, C. (2020). X3D: expanding architectures for efficient video recognition. CoRR, abs/2004.04730.

Feichtenhofer, C., Fan, H., Malik, J., and He, K. (2018). Slowfast networks for video recognition. CoRR, abs/1812.03982.

Feichtenhofer, C., Pinz, A., and Zisserman, A. (2016). Convolutional two-stream network fusion for video action recognition. CoRR, abs/1604.06573.

Goyal, R., Kahou, S. E., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Fr ̈und, I., Yianilos, P., Mueller-Freitag, M., Hoppe, F., Thurau, C., Bax, I., and Memisevic, R. (2017). The ”something something” video database for learning and evaluating visual common sense. CoRR, abs/1706.04261.

Han, C., Wang, C., Mei, E., Redmon, J., Divvala, S. K., Wu, Z., Wang, X., Jiang, Y.-G., Ye, H., and Xue, X. (2017). Yolo-based adaptive window two-stream convolutional neural network for video classification.

Hara, K., Kataoka, H., and Satoh, Y. (2017). Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? CoRR, abs/1711.09577.

Heilbron, F. C., Escorcia, V., Ghanem, B., and Niebles, J. C. (2015). Activitynet: A large-scale video benchmark for human activity understanding. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 961–970.

Heng Wang, Alexander Kl ̈aser, C. S. L. C.-L. (2011). Action recognition by dense trajectories.

Ji, S., Xu, W., Yang, M., and Yu, K. (2013). 3d convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):221–231.

Kalfaoglu, M. E., Kalkan, S., and Alatan, A. A. (2020). Late temporal modeling in 3d cnn architectures with bert for action recognition.

Karpathy, A., Toderici, G., Shetty, S., Le-ung, T., Sukthankar, R., and Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In 2014 IEEE Conference on Computer vision and Pattern Recognition, pages 1725–1732.

Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F.,Green, T., Back, T., Natsev, P., Suleyman, M.,and Zisserman, A. (2017). The kinetics human action video dataset. CoRR, abs/1705.06950.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Pereira, F.,Burges, C. J. C., Bottou, L., and Weinberger, K. Q., editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc.

Laptev and Lindeberg (2003). Space-time interest points. In Proceedings Ninth IEEE International Conference on Computer Vision pages 432–439 vol.1.

Lin, J., Gan, C., and Han, S. (2018). Temporal shift module for efficient video understanding. CoRR, abs/1811.08383.

Ng, J. Y., Choi, J., Neumann, J., and Davis, L. S. (2016). Actionflownet: Learning motion representation for action recognition. CoRR, abs/1612.03052.

Ng, J. Y., Hausknecht, M. J., Vijayanarasimhan, S., Vinyals, O., Monga, R., and Toderici, G. (2015). Beyond short snippets: Deep networks for video classification. CoRR, abs/1503.08909.

Rodriguez, M. D., Ahmed, J., and Shah, M. (2008). Action mach a spatio-temporal maximum average correlation height filter for action recognition. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pages 18.

Simonyan, K. and Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. CoRR, abs/1406.2199.

Soomro, K., Zamir, A. R., and Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR, abs/1212.0402.

Wang, L., Xiong, Y., Wang, Z., and Qiao, Y. (2015). Towards good practices for very deep two-stream convnets. CoRR, abs/1507.02159.

Zach, C., Pock, T., and Bischof, H. (2007). A duality based approach for realtime tv-l1 optical flow. volume 4713, pages 214–223.

Zhang, B., Wang, L., Wang, Z., Qiao, Y., and Wang, H. (2016). Real-time action recognition with enhanced motion vector cnns. CoRR, abs/1604.07669.

Zhu, Y., Lan, Z., Newsam, S. D., and Hauptmann, A. G. (2017). Hidden two-stream convolutional networks for action recognition. CoRR, abs/1704.00389.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.