[1] YAO B, FEI-FEI L. Grouplet: A structured image representation for recognizing human and object interactions[C]//2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco, CA: IEEE, 2010: 9-16.
[2] KLASER A, MARSZAŁEK M, SCHMID C. A spatio-temporal descriptor based on 3dgradients[C]//Proceedings of the 2008 British Machine Vision Conference. Leeds ,UK, 2008: 275:1-10.
[3] ZHANG Z, HU Y, CHAN S, et al. Motion context: a new representation for human action recognition[C]//Proceedings of the 2008 European Conference on Computer Vision. Marseille, France, 2008: 817-829.
[4] WANG H, KLäSER A, SCHMID C, et al. Action recognition by dense trajectories[C/OL]// CVPR 2011. Colorado Springs, CO, USA, 2011: 3169-3176. DOI: 10.1109/CVPR.2011.5995407.
[5] CHÉRON G, LAPTEV I, SCHMID C. P-cnn: Pose-based cnn features for action recognition [C]//Proceedings of the 2015 IEEE international conference on computer vision. Washington, DC, USA, 2015: 3218-3226.
[6] SIMONYAN K, ZISSERMAN A. Two-stream convolutional networks for action recognition in videos[C]//Proceedings of the 2014 Advances in Neural Information Processing Systems. Cambridge, Montreal, Canada, 2014: 568-576.
[7] WANG L, XIONG Y, WANG Z, et al. Temporal segment networks: Towards good practices for deep action recognition[C]//2016 European conference on computer vision. Amsterdam, The Netherlands: Springer, 2016: 20-36.
[8] DONAHUE J, ANNE HENDRICKS L, GUADARRAMA S, et al. Long-term recurrent convolutional networks for visual recognition and description[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA, 2015: 2625-2634.
[9] HE J Y, WU X, CHENG Z Q, et al. DB-LSTM: Densely-connected Bi-directional LSTM for human action recognition[J]. Neurocomputing, 2021, 444: 319-331.
[10] GE H, YAN Z, YU W, et al. An attention mechanism based convolutional LSTM network for video action recognition[J]. Multimedia Tools and Applications, 2019, 78(14): 20533-20556.
[11] SUDHAKARAN S, ESCALERA S, LANZ O. Lsta: Long short-term attention for egocentric action recognition[C]//2019 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, CA, USA, 2019: 9954-9963.
[12] WU Z, XIONG C, MA C Y, et al. Adaframe: Adaptive frame selection for fast video recognition [C]//2019 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, CA, USA, 2019: 1278-1287.
[13] LIU Z, LI Z, WANG R, et al. Spatiotemporal saliency-based multi-stream networks with attention-aware LSTM for action recognition[J]. Neural Computing and Applications, 2020, 32: 14593-14602.
[14] TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3d convolutional networks[C]//2015 Proceedings of the IEEE international conference on computer vision. Santiago, Chile, 2015: 4489-4497.
[15] HUSSEIN N, GAVVES E, SMEULDERS A W. Timeception for complex action recognition [C]//2019 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, CA, USA, 2019: 254-263.
[16] CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? a new model and the kinetics dataset[C]//2017 proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA, 2017: 6299-6308.
[17] FAYYAZ M, BAHRAMI E, DIBA A, et al. 3D CNNs with adaptive temporal feature resolutions[C]//2021 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, 2021: 4731-4740.
[18] XIE S, SUN C, HUANG J, et al. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification[C]//2018 Proceedings of the European conference on computer vision (ECCV). Munich, Germany, 2018: 305-321.
[19] LI K, LI X, WANG Y, et al. Ct-net: Channel tensorization network for video classification[C]// 2021 International Conference on Learning Representations. Vienna, Austria, 2021.
[20] HAN K, WANG Y, CHEN H, et al. A survey on vision transformer[J]. IEEE transactions on pattern analysis and machine intelligence, 2022, 45(1): 87-110.
[21] KHAN S, NASEER M, HAYAT M, et al. Transformers in vision: A survey[J]. ACM computing surveys (CSUR), 2022, 54(10s): 1-41.
[22] BERTASIUS G, WANG H, TORRESANI L. Is space-time attention all you need for video understanding?[C]//2021 ICML: Vol. 2. Vienna, Austria, 2021: 4.
[23] TRUONG T D, BUI Q H, DUONG C N, et al. Direcformer: A directed attention in transformer approach to robust action recognition[C]//2022 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, LA, USA, 2022: 20030-20040.
[24] RANASINGHE K, NASEER M, KHAN S, et al. Self-supervised video transformer[C]//2022 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, LA, USA, 2022: 2874-2884.
[25] BULAT A, PEREZ RUA J M, SUDHAKARAN S, et al. Space-time mixing attention for video transformer[J]. 2021 Advances in neural information processing systems, 2021, 34: 19594-19607.
[26] CHEN R, PANDA R, FAN Q. RegionViT: Regional-to-Local Attention for Vision Transformers [C]//2022 International Conference on Learning Representations. 2022.
[27] YANG J, DONG X, LIU L, et al. Recurring the transformer for video action recognition[C]//2022 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, LA, USA, 2022: 14063-14073.
[28] DU Y, WANG W, WANG L. Hierarchical recurrent neural network for skeleton based action recognition[C]//2015 Proceedings of the IEEE conference on computer vision and pattern recognition. Boston, MA, USA, 2015: 1110-1118.
[29] LI S, LI W, COOK C, et al. Independently recurrent neural network (indrnn): Building a longer and deeper rnn[C]//Proceedings of the 2018 IEEE conference on computer vision and pattern recognition. Salt Lake City, UT, USA, 2018: 5457-5466.
[30] ZHENG W, LI L, ZHANG Z, et al. Relational network for skeleton-based action recognition [C]//Proceedings of the 2019 IEEE International Conference on Multimedia and Expo. Shanghai, China, 2019: 826-831.
[31] WANG H, WANG L. Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA, 2017: 499-508.
[32] SONG S, LAN C, XING J, et al. An end-to-end spatio-temporal attention model for human action recognition from skeleton data[C]//2017 Proceedings of the AAAI conference on artificial intelligence: Vol. 31. San Francisco, California, 2017.
[33] LI B, DAI Y, CHENG X, et al. Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep CNN[C]//Proceedings of the 2017 IEEE International Conference on Multimedia & Expo Workshops. Hong Kong, China, 2017: 601-604.
[34] LIU M, LIU H, CHEN C. Enhanced skeleton visualization for view invariant human action recognition[J]. Pattern Recognition, 2017, 68: 346-362.
[35] KE Q, AN S, BENNAMOUN M, et al. Skeletonnet: Mining deep part features for 3-d action recognition[J]. IEEE signal processing letters, 2017, 24(6): 731-735.
[36] LI B, DAI Y, CHENG X, et al. Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep CNN[C]//2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). Hong Kong, China: IEEE, 2017: 601-604.
[37] KORBAN M, LI X. Ddgcn: A dynamic directed graph convolutional network for action recognition[C]//ECCV. Glasgow, UK: Springer, 2020: 761-776.
[38] YAN S, XIONG Y, LIN D. Spatial temporal graph convolutional networks for skeleton-based action recognition[C]//Proceedings of the 2018 AAAI conference on artificial intelligence: Vol. 32. Palo Alto, California, 2018: 1801.07455.
[39] PENG W, HONG X, CHEN H, et al. Learning graph convolutional network for skeleton-based human action recognition by neural searching[C]//2020 Proceedings of the AAAI conference on artificial intelligence: Vol. 34. Hilton New York Midtown, New York, USA, 2020: 2669-2676.
[40] ZHANG X, XU C, TAO D. Context aware graph convolution for skeleton-based action recognition[C]//2020 Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Seattle, WA, USA, 2020: 14333-14342.
[41] LI M, CHEN S, CHEN X, et al. Symbiotic graph neural networks for 3d skeleton-based human action recognition and motion prediction[J]. 2021 IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 44(6): 3316-3333.
[42] CHEN Y, ZHANG Z, YUAN C, et al. Channel-wise topology refinement graph convolution for skeleton-based action recognition[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada, 2021: 13359-13368.
[43] WANG S, ZHANG Y, WEI F, et al. Skeleton-based Action Recognition via Temporal-Channel Aggregation[A/OL]. 2022. 2205.15936. https://doi.org/10.48550/arXiv.2205.15936.
[44] HU L, LIU S, FENG W. Spatial Temporal Graph Attention Network for Skeleton-Based Action Recognition[A/OL]. 2022. 2208.08599. https://doi.org/10.48550/arXiv.2208.08599.
[45] ZENG A, SUN X, YANG L, et al. Learning skeletal graph neural networks for hard 3d pose estimation[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops. Montreal, BC, Canada, 2021: 11436-11445.
[46] ZHANG Y, WU B, LI W, et al. STST: Spatial-temporal specialized transformer for skeleton-based action recognition[C]//2021 Proceedings of the 29th ACM International Conference on Multimedia. Virtual Event China, 2021: 3229-3237.
[47] PLIZZARI C, CANNICI M, MATTEUCCI M. Spatial temporal transformer network for skeleton-based action recognition[C]//ICPR. :Virtual Event: Springer, 2021: 694-701.
[48] QIU H, HOU B, REN B, et al. Spatio-temporal tuples transformer for skeleton-based action recognition[A]. 2022.
[49] LIU Y, ZHANG H, XU D, et al. Graph transformer network with temporal kernel attention for skeleton-based action recognition[J]. Knowledge-Based Systems, 2022, 240: 108146.
[50] WANG Q, SHI S, HE J, et al. Iip-transformer: Intra-inter-part transformer for skeleton-based action recognition[C]//2023 IEEE International Conference on Big Data (BigData). Sorrento, Italy: IEEE, 2023: 936-945.
[51] YANG X, ZHANG C, TIAN Y. Recognizing actions using depth motion maps-based histograms of oriented gradients[C]//2012 Proceedings of the 20th ACM international conference on Multimedia. Nara,Japan, 2012: 1057-1060.
[52] WANG P, LI W, GAO Z, et al. Action recognition from depth maps using deep convolutional neural networks[J]. IEEE Transactions on Human-Machine Systems, 2015, 46(4): 498-509.
[53] WANG P, WANG S, GAO Z, et al. Structured images for RGB-D action recognition[C]//2017 Proceedings of the IEEE international conference on computer vision workshops. Venice, Italy, 2017: 1005-1014.
[54] FERNANDO B, GAVVES E, ORAMAS J, et al. Rank pooling for action recognition[J]. IEEE transactions on pattern analysis and machine intelligence, 2016, 39(4): 773-787.
[55] WANG P, LI W, GAO Z, et al. Depth pooling based large-scale 3-d action recognition with convolutional neural networks[J]. IEEE Transactions on Multimedia, 2018, 20(5): 1051-1061.
[56] XIAO Y, CHEN J, WANG Y, et al. Action recognition for depth video using multi-view dynamic images[J]. Information Sciences, 2019, 480: 287-304.
[57] LECUN Y, BOSER B, DENKER J S, et al. Backpropagation applied to handwritten zip code recognition[J]. Neural computation, 1989, 1(4): 541-551.
[58] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6): 84-90.
[59] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: abs/2010.11929[A/OL]. 2020. https://api.semanticscholar.org/CorpusID:225039882.
[60] SHAHROUDY A, LIU J, NG T T, et al. Ntu rgb+ d: A large scale dataset for 3d human activity analysis[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. Las Vegas, NV, USA, 2016: 1010-1019.
[61] LIU J, SHAHROUDY A, PEREZ M, et al. Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding[J]. IEEE transactions on pattern analysis and machine intelligence, 2019, 42(10): 2684-2701.
[62] WANG J, NIE X, XIA Y, et al. Cross-view action modeling, learning and recognition[C]// Proceedings of the IEEE conference on computer vision and pattern recognition. Columbus, OH, USA, 2014: 2649-2656.
[63] LI M, CHEN S, CHEN X, et al. Actional-structural graph convolutional networks for skeleton-based action recognition[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Long Beach, CA, USA, 2019: 3595-3603.
[64] SHI L, ZHANG Y, CHENG J, et al. Two-stream adaptive graph convolutional networks for skeleton-based action recognition[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Long Beach, CA, USA, 2019: 12026-12035.
[65] SHI L, ZHANG Y, CHENG J, et al. Skeleton-based action recognition with directed graph neural networks[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Long Beach, CA, USA, 2019: 7912-7921.
[66] CHENG K, ZHANG Y, HE X, et al. Skeleton-based action recognition with shift graph convolutional network[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Seattle, WA, USA, 2020: 183-192.
[67] LIU Z, ZHANG H, CHEN Z, et al. Disentangling and unifying graph convolutions for skeleton-based action recognition[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Seattle, WA, USA, 2020: 143-152.
[68] CHENG K, ZHANG Y, CAO C, et al. Decoupling gcn with dropgraph module for skeleton-based action recognition[C]//Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIV 16. Glasgow, UK: Springer, 2020: 536-553.
[69] SONG Y F, ZHANG Z, SHAN C, et al. Stronger, faster and more explainable: A graph convolutional baseline for skeleton-based action recognition[C]//proceedings of the 28th ACM international conference on multimedia. New York,United States, 2020: 1625-1633.
[70] YE F, PU S, ZHONG Q, et al. Dynamic gcn: Context-enriched topology learning for skeleton-based action recognition[C]//Proceedings of the 28th ACM international conference on multimedia. New York,United States, 2020: 55-63.
[71] SONG Y F, ZHANG Z, SHAN C, et al. Constructing stronger and faster baselines for skeleton-based action recognition[J]. IEEE transactions on pattern analysis and machine intelligence, 2022, 45(2): 1474-1488.
[72] CHI H G, HA M H, CHI S, et al. Infogcn: Representation learning for human skeleton-basedaction recognition[C]//Proceedings of the IEEE/CVF conference on computer vision and patternrecognition. New Orleans, Louisiana, USA, 2022: 20186-20196.
[73] XIANG W, LI C, ZHOU Y, et al. Generative action description prompts for skeleton-based action recognition[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Paris, France, 2023: 10276-10285.
修改评论