Keypoint detection is an important task in the field of computer vision, aiming at recognizing keypoints in an image or video that have specific meaning or information. For example, the nose, eyes, or joints on a human face. Among neural networks, Convolutional Neural Networks (CNNs) for keypoint detection have many well-known models. Starting from DeepPose, up to the current state-of-the-art methods, a rich history has been formed.
Below is a brief history of its development and a note. Maybe the information that individuals look for on the Internet is not complete, and their own experience is not rich, so what is recorded is not comprehensive, if there is any error please correct, thank you.
1, The idea of keypoint detection before deep learning
Before the rise of deep learning, keypoint detection mainly relied on hand-designed features and machine learning algorithms such as SIFT, SURF, HOG and other feature descriptors, combined with sliding windows and classifiers such as SVM. These methods are effective under specific conditions, but performance is limited in complex scenes and changing lighting.
angle detection: Early keypoint detection methods focused on the detection of corner points, such as the Harris corner point detector and the Shi-Thmasi corner point detector. These methods rely on the localized gradient information of the image to detect corner points.
SIFT (Scale Invariant Feature Transform): SIFT is a classical feature extraction and matching algorithm that detects keypoints by finding extreme points at different scales and orientations and extracts descriptors for matching.
SURF (Speed-Up Robustness Feature): The SURF algorithm is an accelerated version of SIFT that improves computational efficiency by integrating images and detecting keypoints with a fast Hessian matrix.
FAST (Feature Point Detector): FAST is a fast corner point detection algorithm for real-time applications, but is sensitive to illumination transformations and changes in viewing angle.
HOG (Histogram of Orientation Gradients): The HOG feature descriptor is widely used for tasks such as pedestrian detection and describes features by counting histograms of gradient directions in localized regions of an image.
2, DeepPose (2014): the first to use deep neural networks for human pose estimation
- milestone: DeepPose is the first framework to apply deep neural networks to human pose estimation and keypoint localization, treating pose estimation as a regression problem and outputting keypoint coordinates directly.
- technological breakthrough: A cascading network structure is introduced to improve detection accuracy by gradually refining the location of key points through multiple stages.
DeepPose is a method for keypoint detection using deep learning techniques with particular application to human posture estimation. It is one of the first models to use Convolutional Neural Networks (CNNs) to solve the problem of human pose estimation, proposed by Shih-En Wei et al. in 2016.The main goal of DeepPose is to predict the position of human joint points, which is crucial for applications such as action recognition, human-computer interaction, and virtual reality.
The basic idea of DeepPose is to treat the human pose estimation problem as a regression problem in which the network directly predicts the 2D or 3D coordinates of human joints. Specifically, the DeepPose model receives an input image and outputs a series of coordinates, each of which represents a specific human key point, such as the elbow, knee, wrist, and so on.
To accomplish this, DeepPose uses a multi-layer convolutional neural network (CNN), a network that is capable of extracting complex features from an image.The output of the CNN is fed into a fully connected layer, which finally outputs the coordinates of each key point. During training, the model learns by minimizing the distance between the predicted coordinates and the true coordinates (usually using a mean square error loss).DeepPose's model is as follows, containing seven layers of Alexnet and an additional regression fully-connected layer, and the output is a 2*number of coordinates of the joints, which represents the coordinates in a 2D image. Moreover, the authors address the current problem of fixed feature scale and poor regression performance learned by shallow CNN, save the coarse regression (x, y) coordinates obtained by the network, add a stage to cut a region image centered on (x, y) in the original image, and pass the region image into the CNN network to learn the higher resolution features for higher precision regression of the coordinate values.
That is, the accuracy is improved by cascading: the first stage takes the whole-body image as input and predicts theCoordinates of all key points, the positional accuracy is low.
second levelCentered on each prediction pointCropped localized images are fed into theTier 2 network, returning to that point again to improve accuracy.
DeepPose's point of innovation:
- 1,End-to-end learning: DeepPose can learn end-to-end the mapping from the input image to the coordinates of keypoints without manually designing features.
- 2,regression rather than categorization: Unlike many early pose estimation methods, DeepPose does not use heat maps or categorization of keypoints, but instead directly regresses the coordinates of keypoints, which was a novel strategy at the time.
- 3,dexterity: DeepPose can handle pose estimation in different viewpoints and complex backgrounds, and can be extended to pose estimation in 3D space.
While DeepPose has opened a new chapter in the field of pose estimation, he has some limitations. For example, he may perform poorly under joint overlap or occlusion clearing, as the network struggles to distinguish between occluded keypoints. In addition, since it regresses directly on the coordinates, it may receive the effects of the limited minima problem of coordinate representation. Subsequent researchers have proposed a variety of improvements, including the use of heat maps, multi-task learning, recurrent neural networks (RNNs), and deeper CNN architectures to improve the accuracy and robustness of pose estimation.
In summary, DeepPose is an important milestone in the field of pose estimation that demonstrates the potential of deep learning for solving complex visual tasks.
3, Stacked Hourglass Network (2016): human pose estimation via stacked hourglass networks
- milestone: A stacked Hourglass network structure is proposed, which is a fully convolutional network (FCN) capable of multiple up- and down-sampling and up-sampling to form a multi-scale feature representation with enhanced resolution sensitivity for attitude estimation.
- technological breakthrough: Using heat maps as outputs instead of direct coordinates solves the local minima problem of coordinate regression and improves the robustness of detection.
Stacked Hourglass Network is a deep learning architecture originally proposed by Alexander Newell, Kaiyu Yang, and Jia Deng in their 2016 paper "Stacked Hourglass Networks for Human Pose Estimation". This work focuses on human pose estimation, i.e., detecting and localizing key points of the human body, such as joint positions, in a given image.Stacked Hourglass Networks have become a milestone in the field due to their excellent performance on human pose estimation tasks.
Stacked Hourglass Network also has a disruptive position in the field of 2D human pose recognition, sweeping all major competition datasets as soon as it was removed, and acquiring a lot of attention and subsequent improvements by virtue of its simple and flexible structure. It inherits and amplifies the idea of multi-resolution features proposed by DeepPose. Although the regression of individual joint coordinates relies on the features of a small-size region, such as the hand, leg, and head regions of the image, the complete pose of the whole person also relies on the large-scale global features, and the CNN learns to discriminate the coordinates of the keypoints in the image while learning the spatial position relationship of each keypoint in the whole image. .
The Stacked HourGlass Network is composed as follows:
The core of Stacked Hourglass Network is its unique hourglass structure, which is a Fully Convolutional Network (FCN) capable of performing multiple down-sampling and up-sampling operations on the input image to form a multi-scale feature representation.
- Downsampling: The size of the feature map is reduced layer by layer by convolution and pooling layers, which helps in capturing the global information of the image.
- Upsampling: The size of the feature map is recovered by an inverse convolutional layer, which helps to preserve the local details of the image.
- Stacking: Multiple hourglass structures are stacked together to form a series of repeating modules, each of which outputs a set of heat maps of key points. The network of stacked hourglasses can iteratively refine the locations of keypoints to improve the accuracy of detection.
The workflow is as follows:
- The input image is processed with an initial convolutional layer to generate an initial feature map.
- The feature map enters the first hourglass module and is down-sampled and up-sampled, ultimately outputting a set of heat maps of key points.
- The output heat map and the original feature map are merged through a skip connection (skip connection) and used as input for the next hourglass module, a process that can be repeated several times.
- The output of each hourglass module is used in the final keypoint prediction, not just the output of the last module.
Key Contribution Points for Stacked Hourglass:
- Multi-scale feature learning: Through the process of down-sampling and up-sampling, the network is able to learn multi-scale features of the image, which is important for pose estimation since different parts of the human body may be presented at different scales.
- Iterative forecasting: The stacked hourglass structure allows the network to make multiple iterative corrections to the locations of key points, improving the accuracy of the predictions.
- End-to-end learning: The entire network can be trained end-to-end without any hand-designed features, relying only on automatically learned representations.
At the time of its release, Stacked Hourglass Network achieved the best results at the time on several pose estimation datasets, demonstrating its power in handling complex pose and occlusion situations. In addition, its design ideas have been widely applied to other vision tasks, such as face keypoint detection and hand pose estimation.
4, Cascaded Pyramid Network (CPN): cascaded pyramid network for human posture estimation
-
milestone: CPN demonstrates how to efficiently get the most out ofExtraction and integration of information from feature maps at different scalesand then through theCascade network progressively improves prediction of joint positionThe innovations of CPN areCombines global and local information, first performs a global prediction and then refines it for the local region of each joint, this strategy improves the accuracy and robustness of localization. Finally it introduces themultitaskingThe concept of predicting the existence of joints in addition to the localization of key points for the main task, this joint training approach improves the overall performance of the model.
- technological breakthrough: CPN utilizationpyramid schemeHandling multi-scale features, which is innovative in attitude estimation, captures details of joints of different sizes and improves the generalization ability of the model.CPN'scascade structureAllowing the model to correct the prediction at each stage, this iterative refinement process gradually removes errors from the prediction and ultimately results in a more accurate joint position.CPN'sLocalized guidance mechanismsBy focusing on the area around each joint, the model's understanding of local details is enhanced, a mechanism that is especially important when joints are obscured by clothing or other objects
Cascaded Pyramid Network (CPN) is indeed an important milestone in the field of human pose estimation. Proposed by Zhe Cao, Tianchuang Shen, Wei Wang, and Yichen Wei in 2017, CPN is mainly targeted at single-person pose estimation tasks, but it introduces several innovations that have had a profound impact on subsequent research.
The overall model samples a top-down detection strategy, which first performs human target detection on the input image, obtains the candidate frames and passes them into the CPN network for human keypoint regression. The method contains three methods for keypoint prediction: direct prediction, increasing sensory field prediction, and prediction based on context.
The structure of the network is as follows:
GlobalNet on the left is responsible for the direct prediction of keypoints, targeting the more easily detected parts such as eyes, elbows, etc. Each featuremap output goes through a 1*1 convolutional layer. The RefineNet on the right corrects the prediction results for occlusions, complex backgrounds, and improperly scaled joints that are difficult to handle on the left.
At the time of its release, CPN's performance on multiple pose estimation benchmarks outperformed existing methods at the time, especially in complex pose and occlusion scenarios. It not only demonstrated the power of deep learning in pose estimation, but also inspired subsequent researchers to explore more complex network architectures and training strategies in pose estimation tasks.
The contribution of CPN is that it proposes an effective multi-scale feature fusion and cascade refinement mechanism, which is an innovation in the field of pose estimation, and also provides valuable references for other related tasks (e.g., human body part segmentation, behavior recognition, etc.). With the continuous progress of deep learning technology, the concepts and methods of CPN are still being borrowed and expanded by subsequent studies.
5, AlphaPose (2018): multiple human poses can be detected and estimated simultaneously in complex scenes
- milestone: AlphaPose is an improved version of Mask R-CNN based on Mask R-CNN, which not only detects keypoints, but also provides instance segmentation, helping to solve the problem of keypoint attribution in multi-person pose estimation.
- technological breakthrough: A solution for multi-individual pose estimation is introduced to correlate the key points of different poses by means of advanced post-processing algorithms (e.g., greedy matching algorithms).
- Associative Embedding.: AlphaPose employs an associative embedding technique to address the problem of keypoint attribution in multi-person pose estimation. It uses an additional embedding vector to distinguish the same type of keypoints belonging to different people, even if these keypoints overlap or are close in the image.
- Combining top-down and bottom-up: AlphaPose combines top-down target detection and bottom-up keypoint detection methods. First, it uses a detector to localize body frames in the image; then, within these frames, bottom-up keypoint detection is applied to find the pose of each individual. This hybrid strategy improves detection speed and accuracy.
- multilevel integration: AlphaPose uses a multi-level fusion mechanism to optimize the detection of critical points. It utilizes different levels of feature maps to progressively refine the results of pose estimation from coarse to fine.
- Recurrent Neural Networks (RNN)Usage: In some implementations, AlphaPose also introduces Recurrent Neural Networks (RNN) to model time series data, which helps to track motion and pose changes in video sequences.
- End-to-end training: The entire system can be trained end-to-end, which means that all components (including detection, keypoint estimation, and correlation) can be jointly optimized for the best overall performance.
6, Simple Baselines (2018): proposing a simple baseline network for human pose estimation and tracking
- milestone: It is demonstrated that a simple residual network (ResNet) architecture and a top-down strategy can also achieve advanced keypoint detection performance, simplifying model design.
- technological breakthrough: The importance of data augmentation and training techniques such as multi-scale training, and initializing the network by pre-training the model on ImageNet was emphasized.
Core features of Simple Baselines.
-
ResNet Backbone: Simple Baselines uses ResNet, a deep and powerful convolutional neural network capable of learning complex features from images, as the underlying feature extraction network.
-
Heatmap Regression: Like many other attitude estimation methods, Simple Baselines uses heat map regression as an output representation. Each key point corresponds to a heat map, and the peak position of the heat map indicates the location of the key point.
-
Multi-Scale Training: In order to improve the model's ability to generalize and detect objects at different scales, Simple Baselines uses a multi-scale training strategy during the training process. This means that the model is trained on images with different resolutions to enhance its robustness to scale variations.
-
Heavy Data Augmentation: Simple Baselines employs a number of data enhancement techniques during training, including random cropping, scaling, rotation, and color dithering, to increase the diversity exposure of the model and improve its generalization.
-
Flip Test Time Augmentation: During the testing phase, Simple Baselines also used flip test time enhancement, where predictions were made on both the original and horizontally flipped images, and then averaged the two predictions to further improve accuracy.
-
Post-processing: In addition to the optimization of the network itself, Simple Baselines uses a number of post-processing techniques such as non-maximal suppression (NMS) and anatomical constraint-based tuning to improve the accuracy of keypoint localization.
Simple Baselines greatly simplifies the design and training process of pose estimation models by demonstrating that a relatively simple model architecture coupled with efficient data augmentation and training strategies can achieve or even exceed the performance of complex customized models in the pose estimation task. This approach lowers the entry barrier to attitude estimation and promotes the development of the field, while also emphasizing the importance of data augmentation and training strategies. In addition, the openness of Simple Baselines' code provides the research community with a powerful benchmark that facilitates comparison and improvement in subsequent research.
In these models, it can be seen that how to generate high-resolution feature maps is a key to pose estimation.SimplePose uses Deconv to expand the resolution of feature maps, and upsampling+skip is used in Hourglass, CPN.
7, HRNet (2020): a deep high-resolution model for human pose estimation
HRNet, a human posture estimation model released by CSU and Microsoft Research Asia, set three COCO records and was selected for CVPR 2019.
In human posture tasks, previous methods such as CPN, Hourglass, etc., have reconstructed high-resolution representations recovered from low resolution, generally by recovering high-resolution representations with low resolution in a high-to-low resolution network structure (e.g., VGG, Resnet); it is mentioned in the CPN thatHigher spatial resolution facilitates precise location of feature points, and lower resolution has more semantic information. All these methods are getting new high resolution feature representation in some way, but this author of this paper thinks that this high resolution feature is not strong enough because they are recovered from relatively low resolution representation, this author of this paper wants to solve this problem from other point of view and analyze the nature of the goal, the nature of the goal is trying to get the high resolution representation, then the author of this paper considers designing the network in such a way that it has been maintain the high resolution characterization instead of recovering from the low resolution characterization. The network is designed in such a way that the network is connected in parallel from high to low sub-networks.
HRNet is based on this idea and designed the network structure of high and low multi-resolution networks in parallel to extract features, as shown below:
Key features of HRNet:
-
Multi-resolution parallel processing: HRNet employs a multi-branch parallel network architecture where the branches process features in parallel at different resolutions. Unlike other network architectures that downsample and then upsample to restore resolution, HRNet maintains a stream of high-resolution features throughout the network, thus avoiding loss of information.
-
Cross-resolution fusion: Information is exchanged between different resolution branches in HRNet through cross-resolution connections, where features from the low-resolution branch are upsampled and fused with those from the high-resolution branch, while features from the high-resolution branch are downsampled and passed on to the low-resolution branch, which promotes complementarity between features at different scales.
-
modular design: The architecture of HRNet is modular, which means that different parts (modules) of the network can be flexibly combined to accommodate different tasks and input sizes. This design allows HRNet to show good adaptability and scalability to different visualization tasks.
-
Lightweighting and efficiency: Although HRNet maintains a high-resolution feature representation, it is still able to maintain computational efficiency and avoid excessive computational costs through parallel computing and modular design.
HRNet has achieved remarkable results in pose estimation tasks, especially in human pose estimation, where it achieves industry-leading performance on standard datasets such as COCO. In addition, HRNet also shows excellent performance in tasks such as semantic segmentation, image classification, and target detection, thanks to its preservation of detail information and effective utilization of multi-scale features.It will be followed up with a study specifically for HRNet, which is really so important after all!。
The HRNet model is very different from the previous mainstream methods in terms of ideas. Before HRNet, 2D human posture estimation algorithms used (Hourglass/CPN/Simple Baseline/MSPN, etc.) the idea of downsampling the high-resolution feature map to low-resolution, and then recovering from the low-resolution feature map to high-resolution (single or repeated many times), which realizes the process of multi-scale feature extraction.The main feature of HRNet is that the feature map always maintains high resolution throughout the process, and the low-resolution features and high-resolution features are designed in parallel. The main feature of HRNet is that the feature map always keeps high resolution in the whole process, and the low-resolution features and high-resolution features are designed in parallel. Low-resolution features and high-resolution features are designed in parallel. Low-resolution features and high-resolution features are fused in such a way that they are basically similar or identical at the feature level.
8, YOLO Series - End-to-End Posture Detection Algorithm
The YOLO (You Only Look Once) family of models is a landmark in the field of target detection and has had a profound impact on a wide range of computer vision tasks, including human pose estimation. Although the YOLO family of models initially focused on target detection, i.e., recognizing and localizing objects in an image, their real-time performance and accuracy have led to their widespread use in a wide variety of application scenarios, including human detection and preliminary pose estimation.
Currently the most famous Ultralytics series of yolo and Kuang Shi's YOLOX series can simply realize image classification, target detection, key point detection, target tracking, image segmentation and other computer vision tasks.
A milestone in the YOLO series:
-
Real-time detection: The most important feature of the YOLO family of models is their ability to perform target detection at very high speeds, which is critical in real-time applications.YOLOv1 presents for the first time a single pass through the network for target detection, which greatly improves the speed of detection.
-
End-to-end detection: YOLO integrates target localization and classification into a unified framework, eliminating the need for preprocessing steps such as area proposals, which simplifies the process of target detection and makes it more efficient.
-
Continuous Performance Improvement: With the introduction of YOLOv2, YOLOv3, YOLOv4, YOLOv5, YOLOX, YOLOV8, YOLOV10, etc., the YOLO series of models have improved their detection accuracy while maintaining a high speed, thanks to the optimization of the network architecture, the improvement of data augmentation techniques, and more efficient training strategies.
-
Flexibility and scalability: The architecture of the YOLO model is easily adaptable and extensible to the needs of different scenarios, be it lightweight applications on mobile devices such as cell phones or high-performance inspection tasks on servers.
YOLO and human posture estimation:Although the YOLO family of models is primarily targeted at target detection, their efficiency in human body detection provides a solid foundation for subsequent human pose estimation. Once YOLO detects the human body, subsequent models such as DeepPose, Stacked Hourglass Network, and AlphaPose can focus on the detected region for more detailed pose estimation.
The YOLO series models have not only transformed the field of target detection through their efficient real-time detection capabilities, but have also provided strong support for a wide range of computer vision tasks, including human pose estimation. As the YOLO series continues to evolve, we can expect to see more innovations and advancements in future computer vision applications.
9, mmpose - a comprehensive open source pose estimation toolbox
- milestone: MMPose is a comprehensive open-source pose estimation toolkit that contains a variety of pose estimation models, including 2D and 3D pose estimation, as well as hand and face keypoint detection.
- technological breakthrough: facilitates resource sharing in the research community and accelerates progress in the field of attitude estimation.
MMpose is an open-source, comprehensive and modular human posture estimation toolkit developed by Aliyun and Shanghai Jiao Tong University. It is based on the PyTorch framework and aims to provide a powerful platform for researchers and developers to study and implement algorithms related to pose estimation.MMpose is not only limited to 2D pose estimation, but also covers a wide range of tasks such as 3D pose estimation, hand pose estimation, and facial key-point detection, and provides a rich support of models and datasets.
MMpose features:
-
Extensive model support: MMpose contains a range of pose estimation models including, but not limited to, HRNet, Simple Baselines, Stacked Hourglass, AlphaPose, etc., which makes it easy for users to compare and select the right model for their needs.
-
modular design: MMpose's modular design allows users to easily add, modify, or replace model components such as backbone, neck, and head, which increases the framework's flexibility and scalability.
-
Rich dataset support: MMpose supports multiple pose estimation datasets such as COCO, MPII, AIC, Total Captures, etc., which facilitates model training and evaluation.
-
Comprehensive documentation and tutorials: MMpose provides thorough documentation and tutorials on model configuration, data preparation, and training and inference processes, which reduces the ease of entry for novices and accelerates the research and development process.
-
Communities and ecosystems: MMpose has an active open source community that is regularly updated and maintained with bug fixes, performance optimizations, and new feature additions, which fosters knowledge sharing and collaboration.
-
High performance and reproducibility: MMpose is committed to ensuring that the models and results provided are highly reproducible, while optimizing code efficiency to support large-scale training and deployment.
MMpose, as a comprehensive attitude estimation toolbox, provides powerful support for research and applications in the field of attitude estimation through its modular design, extensive model support, and rich datasets. It not only accelerates the development of attitude estimation techniques, but also promotes knowledge sharing and technological innovation in the field.
When we look at its code repository introduction, we can see that mmpose has integrated all the algorithms described earlier:
Includes datasets for various pose estimations:
This is of course the focus and key to our final study. It will be studied in detail later.
10, Recent Developments - RTMO Algorithm
We can also see from the latest list that RTMO is currently the SOTA algorithm:
Of course we can also find it by going directly to the mmpose code repository, with its latest article:
This one combines the framework of the YOLO series and uses a single-stage approach to achieve a high level of speed and accuracy. We will also study it carefully later.