PaliGemma by Google: Train Model on Custom Detection Dataset

  Рет қаралды 7,731

Roboflow

Roboflow

Күн бұрын

Learn how to fine-tune PaliGemma, Google's open-source Vision-Language Model, for custom object detection tasks. This step-by-step tutorial walks you through modifying Google's notebook to train PaliGemma on your dataset. We'll use the handwritten digits and math operations dataset from RF100, explore the JSONL format, and demonstrate how to deploy your fine-tuned model for real-world inference. Discover the power of PaliGemma for image captioning, VQA, and object detection, and overcome its limitations.
Chapters:
- 00:00 PaliGemma Capabilities
- 02:03 Environment Setup
- 05:25 Dataset Format
- 09:07 Downloading Pre-trained Model
- 11:27 Loading Dataset
- 13:45 Training and Evaluating the Model
- 15:19 Deploying the Model
- 17:37 Important Considerations
- 20:02 Outro
Resources:
- Roboflow: roboflow.com
- 🔴 Community Session June 6th, 2024 at 08:00 AM PST / 11:00 AM EST / 05:00 PM CET: roboflow.stream
- ⭐ Notebooks GitHub: github.com/roboflow/notebooks
- ⭐ Supervision GitHub: github.com/roboflow/supervision
- 📓 PaliGemma notebook: colab.research.google.com/git...
- 🗞 Gemma arXiv paper: arxiv.org/pdf/2403.08295
- 🗞 SigLIP arXiv paper: arxiv.org/pdf/2303.15343
- 🗞 PaliGemma blog post: blog.roboflow.com/how-to-fine...
- 🔗 RF100: www.rf100.org
- 🔗 PaliGemma model card: www.kaggle.com/models/google/...
- 🔗 PaliGemma fine-tuned checkpoints: huggingface.co/collections/go...
- 🔗 PaliGemma HF Space: huggingface.co/spaces/big-vis...
Stay updated with the projects I'm working on at github.com/roboflow and github.com/SkalskiP! ⭐

Пікірлер: 44
@rairorr
@rairorr Ай бұрын
The evaluation technique of visualizing the learning rate is something I’ve never seen before. Beautiful.
@Roboflow
@Roboflow Ай бұрын
I’m super glad you liked it. I’ve invested a bit of time to make it look good.
@abdshomad
@abdshomad 2 ай бұрын
Always the first to teach others all conputer vision tools ... Thank you ... 🎉
@Roboflow
@Roboflow 2 ай бұрын
You are always first to comment! Thank you… 🙏🏻
@RipNonoRasta
@RipNonoRasta 8 күн бұрын
amazing work!
@Roboflow
@Roboflow 7 күн бұрын
Thank you!
@lucasbeyer2985
@lucasbeyer2985 2 ай бұрын
Great dive, I'm happy to see you figured this all out without us documenting it well yet! For the issue about bad performance on card detection, I think most the reasons you listed contribute, except I don't think larger model is needed. 1) It suffers from the same issues as pix2seq (and all max-likelihood detection models) of being "conservative" and needing either lots of tricks/augmentations (see pix2seq) to fix that, or RL-tuning (see "Tuning computer vision models with task rewards" paper). 2) Order should be fine if trained for long enough 3) probably attention-only with SGD (what's done in the colab) fine-tuning is too weak for "harder" tasks like this. (not sure about this one)
@Roboflow
@Roboflow 2 ай бұрын
Thanks a lot! I have almost none intuition with this category of models, so any ideas and suggestions are super helpful. I intend to spend more time battling this issue. Going to start here: - arxiv.org/abs/2302.08242 - arxiv.org/abs/2109.10852
@Roboflow
@Roboflow 2 ай бұрын
Also if you need any help documenting it I’m happy to help ;)
@lucasbeyer2985
@lucasbeyer2985 2 ай бұрын
@@Roboflow yup, those two are good reads to start. Probably pix2seq first, as it introduce the general idea but also lots of hacks to make it work OK, and then RL-tuning second, because it then shows how to skip all of the tricks in a nice way (with RL), but in the writing, assumes you kinda know pix2seq already.
@lovol2
@lovol2 2 ай бұрын
Fantastic. Thank you.
@gokayfem
@gokayfem 2 ай бұрын
excellent tutorial!
@Pingu_astrocat21
@Pingu_astrocat21 2 ай бұрын
Thank you for this :)
@codeandrobotid2466
@codeandrobotid2466 2 ай бұрын
Good job, bro!
@suphotnarapong355
@suphotnarapong355 2 ай бұрын
Thank you
@hegalzhang1457
@hegalzhang1457 4 күн бұрын
Greate work !Do you have example code for finetuning OCR task?
@Roboflow
@Roboflow 4 күн бұрын
Not yet. But I plan to play with OCR and VQA tasks finetuning.
@trezero
@trezero 2 ай бұрын
I'd love to see your take on Phi 3 Vision with Grounding Dino.
@Roboflow
@Roboflow 2 ай бұрын
I have not tried Phi 3 yet
@BingxinYang-j7t
@BingxinYang-j7t Ай бұрын
if I want to fine-tune the PaliGemma model for the segmentation task, what dataset should I prepare?
@Roboflow
@Roboflow Ай бұрын
We don’t have a written tutorial but I made Google Colab some time ago. Would that be enough?
@BingxinYang-j7t
@BingxinYang-j7t Ай бұрын
@@Roboflow sorry, I cannot find the Google Colab about fine-tuning the PaliGemma model for the segmentation task. Would you please give me the web address? Thank you very much.
@marcc0183
@marcc0183 Ай бұрын
thank you for this. One question, I've been researching for a while and I haven't found a way. I want to train a model that has information about my trips, meal plans with friends, etc. so I can ask it. I don't know if it is best to use vector embeddings with images or use vision models to convert the images to text, and this text together with the metadata of the photos (with Google Photos API) create a RAG on the model that has information about this in text. What do you recommend vector embeddings with images or RAG with text? all the best
@MABatin-hd1mt
@MABatin-hd1mt Ай бұрын
As always another great video. However, do you think few shot detection is possible with PaliGemma where I can fine-tune the model with as little as a single image rather than a large dataset like rf100?
@uttamdwivedi7709
@uttamdwivedi7709 Ай бұрын
Great tutorial @roboflow !!! Is it possible to train same model for VQA as well as object detection? Can you provide any example of how the JSONL file should look like in such cases?
@Roboflow
@Roboflow Ай бұрын
We will soon add support for VQA datasets in roboflow. I plan to roll out tutorials covering this topic soon.
@safiraghulam1862
@safiraghulam1862 Ай бұрын
Hi, Can I fine-tune the model on a medical dataset? Currently, the model is not performing well on this data, and the results indicate that it is not suitable for medical data out-of-the-box. If I fine-tune the model on my dataset, which consists of approximately 200 to 300 images, will it work better? Additionally, is it possible to quantize this model to reduce its size from 3B to something smaller without significantly compromising its performance? Thank you.
@Roboflow
@Roboflow Ай бұрын
It is possible to fine-tune on medical images I done that on several use-cases like tumors detection.
@safiraghulam1862
@safiraghulam1862 Ай бұрын
@@Roboflow okay
@AkramKhanyt
@AkramKhanyt 2 ай бұрын
How did you deploy the model and also give code for how to download dependencies supervision of paligemma ........
@Roboflow
@Roboflow 2 ай бұрын
For now I deployed it simply by cloning this HF space: huggingface.co/spaces/big-vision/paligemma But we are wrapping up the work on github.com/roboflow/inference integration.
@IceStormSerenade
@IceStormSerenade 26 күн бұрын
@@Roboflow Hii can you explain how the deployment works please? I haven't used gradio before so I am having difficulty with it...
@thesuriya_3
@thesuriya_3 Ай бұрын
can you do as well as in VQA😊 using pytorch code
@darklord96423
@darklord96423 2 ай бұрын
Good job, bro, but don't u think that not including the answers bout photo, Yolo is better than paligemma. I mean in detection
@Roboflow
@Roboflow 2 ай бұрын
YOLO is definitely a better detection model, so if you don’t care about other capabilities, you’ll probably be better off staying with YOLO.
@toobasheikh106
@toobasheikh106 2 ай бұрын
Hi, I fine tuned the model on a small medical object detection dataset with just 223 training examples and after training for more than 20 epochs it starts giving random predictions (classes that are not there in the dataset and even some spam classes). Could you please highlight what could be the reason?
@Roboflow
@Roboflow 2 ай бұрын
Can you send me the link to your dataset?
@toobasheikh106
@toobasheikh106 2 ай бұрын
@@Roboflow sorry it's a private dataset for organoid cell detection and has 4 classes
@toobasheikh106
@toobasheikh106 2 ай бұрын
It gives me random predictions like this: metaphase; talaga; ExecuteAsync; tooth; JvmSlicer#
@ttkrpink
@ttkrpink 2 ай бұрын
Great tutorial. I followed the tutorial online, everything worked great. When I tried on a local machine, I got the following error on the line: for image, _, caption in make_predictions(validation_data_iterator(), num_examples=4, batch_size=4): ValueError: Sharding GSPMDSharding({devices=[3]
@Roboflow
@Roboflow 2 ай бұрын
Do you have multiple GPUs locally?
@ttkrpink
@ttkrpink 2 ай бұрын
@@Roboflow yes, I do have 3 GPUs
@Roboflow
@Roboflow 2 ай бұрын
​@@ttkrpink please try using `batch_size` value that is devisable by 3. 3, 6, 9, 12, 15 something like this. It looks like it tries to split your data equally across all the GPUs but you set batch_size=4.
@president2
@president2 Ай бұрын
What about liquid AI? I see a video from years ago. Any updates on this process?
Heartwarming Unity at School Event #shorts
00:19
Fabiosa Stories
Рет қаралды 25 МЛН
Nastya and SeanDoesMagic
00:16
Nastya
Рет қаралды 40 МЛН
Они так быстро убрались!
01:00
Аришнев
Рет қаралды 1,9 МЛН
How to Choose the Best Computer Vision Model for Your Project
12:59
YOLO-World: Real-Time, Zero-Shot Object Detection Explained
17:49
Automated Prompt Engineering with DSPy + DSPy Visualization
36:27
Fast Segment Anything (FastSAM) vs SAM | Is it 50x faster?
16:02
How to Auto Label Your Custom Dataset with Roboflow in 2 Minutes
11:22
Has Generative AI Already Peaked? - Computerphile
12:48
Computerphile
Рет қаралды 917 М.
Stanford CS25: V4 I Hyung Won Chung of OpenAI
36:31
Stanford Online
Рет қаралды 174 М.
Florence 2 Fine-Tuning: How to Train a Vision Language Model?
9:33
Mervin Praison
Рет қаралды 6 М.
Лучший браузер!
0:27
Honey Montana
Рет қаралды 1 МЛН
Vision Pro наконец-то доработали! Но не Apple!
0:40
ÉЖИ АКСЁНОВ
Рет қаралды 172 М.