ViTPose: 2D Human Pose Estimation

Рет қаралды 3,894

Күн бұрын

Пікірлер

@wolpumba4099 Жыл бұрын

- 0:00: The video discusses vit post paper which is currently leading in 2D post estimation on the Ms coco data set. - 0:13: Previous attempts to use Transformers for 2D Pro estimation have included transpose and token pose. - 0:26: Transpose uses a CNN backbone to extract local information from the input image and a Transformer encoder to understand the skeleton key points in the image. - 0:58: Token pose uses a similar approach but includes random tokens to represent missing or occluded key points. - 1:33: Another attempt, HR former, combines Transformer blocks and convolutional blocks for down sampling and up sampling. - 2:11: Vit pose simplifies the process by using only Transformers, making it easier to deal with the problem. - 2:21: Vit pose uses an encoder which is a Transformer to create tokens from an input image. - 3:50: Vit pose has two different decoder options - classic decoder and simple decoder. - 6:15: Vit pose allows multi-dataset training, enabling the utilization of different decoders depending on the data set. - 7:03: The video presents different variants of vit pose like base, large, huge, and gigantic, which differ in the number of layers and channel size. - 7:27: The video discusses the simplicity and scalability of vit pose. - 8:33: The video discusses the influence of pre-training data on the performance of vit pose. - 10:11: The video discusses the influence of input resolution on the performance of vit pose. - 11:32: The video discusses the influence of attention type on the performance of vit pose. - 14:55: The video discusses the influence of partially finetuning on the performance of vit pose. - 16:02: The video discusses the influence of multi-dataset training on the performance of vit pose. - 16:21: The video discusses the use of knowledge distillation to improve the generalizability of the model. - 21:12: The video presents the results of vit pose in comparison with different modules for the task of 2D post estimation on Ms Coco dataset. Positive Learnings: - Vit pose simplifies the process of 2D pose estimation by using only Transformers. - The use of an encoder which is a Transformer to create tokens from an input image has proven to be effective. - The use of different variants like base, large, huge, and gigantic can enhance the performance of vit pose. - The use of pre-training data can improve the performance of vit pose. - The use of knowledge distillation can improve the generalizability of the model. Negative Learnings: - Previous attempts to use Transformers for 2D Pro estimation such as transpose and token pose had limitations. - The use of a CNN backbone in transpose limits its effectiveness. - Token pose's use of random tokens to represent missing or occluded key points is not the most efficient approach. - HR former's combination of Transformer blocks and convolutional blocks for down sampling and up sampling makes it complicated. - Partially finetuning can negatively affect the performance of vit pose.

@amirhosseinmohammadi4731 4 ай бұрын

It was very comprehensive, thanks a lot Soroush

@mjalali3109 Жыл бұрын

Congratulations, a perfect and neat job

@francisferri2732 Жыл бұрын

Thank you for your videos! they are very good to know the state of the art

@soroushmehraban Жыл бұрын

Glad you enjoyed it

@rohollahhosseyni8564 Жыл бұрын

Great job!

@alihadimoghadam8931 Жыл бұрын

nice job

@soroushmehraban Жыл бұрын

Thanks

@mrraptorious8090 8 ай бұрын

Hey, I am asking myself how to train ViTPose by myself. Did you coincidently trained it by yourself? If so could you share experiences?

@nikhilchhabra Жыл бұрын

Thank you for this Interesting video. Would be interesting to see Bottom up pose estimation using transformers like ED-Pose. VitPose is top down so (a) Inference time increases with number of person. (b) It can not handle overlapping human scenarios.

@soroushmehraban Жыл бұрын

Thanks for the feedback. I didn’t know about the ED-Pose. Surely will read it soon