- 0:00: The video discusses vit post paper which is currently leading in 2D post estimation on the Ms coco data set. - 0:13: Previous attempts to use Transformers for 2D Pro estimation have included transpose and token pose. - 0:26: Transpose uses a CNN backbone to extract local information from the input image and a Transformer encoder to understand the skeleton key points in the image. - 0:58: Token pose uses a similar approach but includes random tokens to represent missing or occluded key points. - 1:33: Another attempt, HR former, combines Transformer blocks and convolutional blocks for down sampling and up sampling. - 2:11: Vit pose simplifies the process by using only Transformers, making it easier to deal with the problem. - 2:21: Vit pose uses an encoder which is a Transformer to create tokens from an input image. - 3:50: Vit pose has two different decoder options - classic decoder and simple decoder. - 6:15: Vit pose allows multi-dataset training, enabling the utilization of different decoders depending on the data set. - 7:03: The video presents different variants of vit pose like base, large, huge, and gigantic, which differ in the number of layers and channel size. - 7:27: The video discusses the simplicity and scalability of vit pose. - 8:33: The video discusses the influence of pre-training data on the performance of vit pose. - 10:11: The video discusses the influence of input resolution on the performance of vit pose. - 11:32: The video discusses the influence of attention type on the performance of vit pose. - 14:55: The video discusses the influence of partially finetuning on the performance of vit pose. - 16:02: The video discusses the influence of multi-dataset training on the performance of vit pose. - 16:21: The video discusses the use of knowledge distillation to improve the generalizability of the model. - 21:12: The video presents the results of vit pose in comparison with different modules for the task of 2D post estimation on Ms Coco dataset. Positive Learnings: - Vit pose simplifies the process of 2D pose estimation by using only Transformers. - The use of an encoder which is a Transformer to create tokens from an input image has proven to be effective. - The use of different variants like base, large, huge, and gigantic can enhance the performance of vit pose. - The use of pre-training data can improve the performance of vit pose. - The use of knowledge distillation can improve the generalizability of the model. Negative Learnings: - Previous attempts to use Transformers for 2D Pro estimation such as transpose and token pose had limitations. - The use of a CNN backbone in transpose limits its effectiveness. - Token pose's use of random tokens to represent missing or occluded key points is not the most efficient approach. - HR former's combination of Transformer blocks and convolutional blocks for down sampling and up sampling makes it complicated. - Partially finetuning can negatively affect the performance of vit pose.
@amirhosseinmohammadi47314 ай бұрын
It was very comprehensive, thanks a lot Soroush
@mjalali3109 Жыл бұрын
Congratulations, a perfect and neat job
@francisferri2732 Жыл бұрын
Thank you for your videos! they are very good to know the state of the art
@soroushmehraban Жыл бұрын
Glad you enjoyed it
@rohollahhosseyni8564 Жыл бұрын
Great job!
@alihadimoghadam8931 Жыл бұрын
nice job
@soroushmehraban Жыл бұрын
Thanks
@mrraptorious80908 ай бұрын
Hey, I am asking myself how to train ViTPose by myself. Did you coincidently trained it by yourself? If so could you share experiences?
@nikhilchhabra Жыл бұрын
Thank you for this Interesting video. Would be interesting to see Bottom up pose estimation using transformers like ED-Pose. VitPose is top down so (a) Inference time increases with number of person. (b) It can not handle overlapping human scenarios.
@soroushmehraban Жыл бұрын
Thanks for the feedback. I didn’t know about the ED-Pose. Surely will read it soon
@Fateme_Pourghasem Жыл бұрын
That was great. Thanks.
@soroushmehraban Жыл бұрын
Thanks for the feedback
@shklbor4 ай бұрын
how do they detect poses from heatmaps for say 'k' people?
@shklbor4 ай бұрын
nevermind it doesn't detect multiple poses
@shrayesraman5192Ай бұрын
Concievably have two stages with a human object detection and then crop for pose estimation