Motion Embedded Images: An Approach to Capture Spatial and Temporal Features for Action Recognition
Abstract
The demand for human activity recognition (HAR) from videos has witnessed a significant surge in various real-life applications, including video surveillance, healthcare, elderly care, among others. The explotion of short-form videos on social media platforms has further intensified the interest in this domain. This research endeavors to focus on the problem of HAR in general short videos. In contrast to still images, video clips offer both spatial and temporal information, rendering it challenging to extract complementary information on appearance from still frames and motion between frames. This research makes a two-fold contribution. Firstly, we investigate the use of motion-embedded images in a variant of two-stream Convolutional Neural Network architecture, in which one stream captures motion using combined batches of frames, while another stream employs a normal image classification ConvNet to classify static appearance. Secondly, we create a novel dataset of Southeast Asian Sports short videos that encompasses both videos with and without effects, which is a modern factor that is lacking in all currently available datasets used for benchmarking models. The proposed model is trained and evaluated on two benchmarks: UCF-101 and SEAGS-V1. The results reveal that the proposed model yields competitive performance compared to prior attempts to address the same problem.References
Carreira, J. and Zisserman, A. (2017). Quovadis, action recognition? A new model and the kinetics dataset. CoRR, abs/1705.07750.
Feichtenhofer, C. (2020). X3D: expanding architectures for efficient video recognition. CoRR, abs/2004.04730.
Feichtenhofer, C., Fan, H., Malik, J., and He, K. (2018). Slowfast networks for video recognition. CoRR, abs/1812.03982.
Feichtenhofer, C., Pinz, A., and Zisserman, A. (2016). Convolutional two-stream network fusion for video action recognition. CoRR, abs/1604.06573.
Goyal, R., Kahou, S. E., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Fr ̈und, I., Yianilos, P., Mueller-Freitag, M., Hoppe, F., Thurau, C., Bax, I., and Memisevic, R. (2017). The ”something something” video database for learning and evaluating visual common sense. CoRR, abs/1706.04261.
Han, C., Wang, C., Mei, E., Redmon, J., Divvala, S. K., Wu, Z., Wang, X., Jiang, Y.-G., Ye, H., and Xue, X. (2017). Yolo-based adaptive window two-stream convolutional neural network for video classification.
Hara, K., Kataoka, H., and Satoh, Y. (2017). Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? CoRR, abs/1711.09577.
Heilbron, F. C., Escorcia, V., Ghanem, B., and Niebles, J. C. (2015). Activitynet: A large-scale video benchmark for human activity understanding. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 961–970.
Heng Wang, Alexander Kl ̈aser, C. S. L. C.-L. (2011). Action recognition by dense trajectories.
Ji, S., Xu, W., Yang, M., and Yu, K. (2013). 3d convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):221–231.
Kalfaoglu, M. E., Kalkan, S., and Alatan, A. A. (2020). Late temporal modeling in 3d cnn architectures with bert for action recognition.
Karpathy, A., Toderici, G., Shetty, S., Le-ung, T., Sukthankar, R., and Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In 2014 IEEE Conference on Computer vision and Pattern Recognition, pages 1725–1732.
Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F.,Green, T., Back, T., Natsev, P., Suleyman, M.,and Zisserman, A. (2017). The kinetics human action video dataset. CoRR, abs/1705.06950.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Pereira, F.,Burges, C. J. C., Bottou, L., and Weinberger, K. Q., editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc.
Laptev and Lindeberg (2003). Space-time interest points. In Proceedings Ninth IEEE International Conference on Computer Vision pages 432–439 vol.1.
Lin, J., Gan, C., and Han, S. (2018). Temporal shift module for efficient video understanding. CoRR, abs/1811.08383.
Ng, J. Y., Choi, J., Neumann, J., and Davis, L. S. (2016). Actionflownet: Learning motion representation for action recognition. CoRR, abs/1612.03052.
Ng, J. Y., Hausknecht, M. J., Vijayanarasimhan, S., Vinyals, O., Monga, R., and Toderici, G. (2015). Beyond short snippets: Deep networks for video classification. CoRR, abs/1503.08909.
Rodriguez, M. D., Ahmed, J., and Shah, M. (2008). Action mach a spatio-temporal maximum average correlation height filter for action recognition. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pages 18.
Simonyan, K. and Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. CoRR, abs/1406.2199.
Soomro, K., Zamir, A. R., and Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR, abs/1212.0402.
Wang, L., Xiong, Y., Wang, Z., and Qiao, Y. (2015). Towards good practices for very deep two-stream convnets. CoRR, abs/1507.02159.
Zach, C., Pock, T., and Bischof, H. (2007). A duality based approach for realtime tv-l1 optical flow. volume 4713, pages 214–223.
Zhang, B., Wang, L., Wang, Z., Qiao, Y., and Wang, H. (2016). Real-time action recognition with enhanced motion vector cnns. CoRR, abs/1604.07669.
Zhu, Y., Lan, Z., Newsam, S. D., and Hauptmann, A. G. (2017). Hidden two-stream convolutional networks for action recognition. CoRR, abs/1704.00389.
DOI:
https://doi.org/10.31449/inf.v47i3.4755Downloads
Additional Files
Published
How to Cite
Issue
Section
License
Authors retain copyright in their work. By submitting to and publishing with Informatica, authors grant the publisher (Slovene Society Informatika) the non-exclusive right to publish, reproduce, and distribute the article and to identify itself as the original publisher.
All articles are published under the Creative Commons Attribution license CC BY 3.0. Under this license, others may share and adapt the work for any purpose, provided appropriate credit is given and changes (if any) are indicated.
Authors may deposit and share the submitted version, accepted manuscript, and published version, provided the original publication in Informatica is properly cited.







