Conventional pose estimation models in video surveillance often rely on graph-based structures. In this paper, we introduce a novel approach that overcomes the limitations of template matching across varying poses to achieve more robust results. Our swimmer pose estimation method leverages deep learning, utilizing RTMPose to extract and fuse visual features for accurate object detection based on human joint key points. With proper training, the proposed model is adaptable to various swimming styles. Unlike approaches that require multiple models and extensive training, our method provides an end-to-end prediction pipeline that is straightforward to implement and deploy. By comparing MMpose and RTMPose, our network demonstrates excellent performance in swimmer pose estimation. Furthermore, we developed a real-time dataset with annotated key points of swimmers, captured from an underwater perspective. This viewpoint offers a more suitable representation of the swimmer’s torso than the side view, making it valuable for a wide range of machine vision applications. |