Рет қаралды 3,960
YOLO-World - It is a zero shot model which means you can detect objects without training your model on it.
GitHub: github.com/Aar...
For queries: You can comment in comment section or you can email me at aarohisingla1987@gmail.com
The YOLO-World builds the YOLO detector with the frozen CLIP-based text encoder for extracting text embeddings from the input texts, e.g., object categories or noun phrases.
The YOLO-World contains an Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) to facilitate the interaction between multi-scale image features and text embeddings. The RepVL-PAN can re-parameterize the user's offline vocabularies into the model parameters for fast inference and deployment.
The YOLO-World is pre-trained on large-scale region-text datasets with the region-text contrastive loss to learn the region-level alignment between vision and language. For normal image-text datasets, e.g., CC3M, we adopt an automatic labeling approach to generate pseudo region-text pairs.