Fine-tune Multi-modal LLaVA Vision and Language Models

Рет қаралды 15,084

Күн бұрын

➡️ ADVANCED Vision Fine-tuning Repo: trelis.com/advanced-vision/
➡️ ADVANCED-inference Repo: trelis.com/enterprise-server-...
➡️ ADVANCED-fine-tuning Repo: trelis.com/advanced-fine-tuni...
➡️ Trelis Function-calling Models and Scripts: trelis.com/function-calling/
➡️ ADVANCED Transcription Repo: trelis.com/advanced-transcrip...
➡️ One-click Fine-tuning & Inference Templates: github.com/TrelisResearch/one...
➡️ Trelis Newsletter: Trelis.Substack.com
➡️ Trelis Resources and Support: Trelis.com/About
Affiliate Links (support the channel):
- Vast AI - cloud.vast.ai/?ref_id=98762
- RunPod - tinyurl.com/4b6ecbbn
*Video Resources*
Slides: docs.google.com/presentation/...
One-click RunPod / VastAI Templates: github.com/TrelisResearch/ins...
IDEFICS: huggingface.co/HuggingFaceM4/...
LLaVA: llava.hliu.cc/
Trelis Newsletter: Trelis.Substack.com
Chapters:
0:00 Fine-tuning Multi-modal Models
0:16 Overview
1:30 LLaVA vs ChatGPT
4:53 Applications
5:37 Multi-modal model architecture
9:05 Vision Encoder architecture
14:00 LLaVA 1.5 architecture
16:30 LLaVA 1.6 architecture
18:30 IDEFICS architecture
22:00 Data creation
24:11 Dataset creation
25:29 Fine-tuning
34:25 Inference and Evaluation
37:34 Data loading
40:00 LoRA setup
42:52 Recap so far
43.25 Evaluation pre-training
44:26 Training
45:40 Evaluation post-training
46:45 Technical clarifications
50:29 Summary

Пікірлер: 82

@TrelisResearch Ай бұрын

UPDATE APRIL 24th 2024 VRAM Requirements have been greatly reduced by adding gradient checkpointing (all below are for 16 bit training): LLaVA 1.5 - liuhaotian/llava-v1.5-7b takes a min of 24 GB to train and will run on a single A6000. - liuhaotian/llava-v1.5-13b REQUIRES VRAM OF

@TemporaryForstudy Ай бұрын

Nice video. Where do you work in Dublin? I am from india and i want to work at your company. I have masters degree in AI. I am currently working in an indian company but they are not providing remote work and the amount they are paying is also low. So please let me know if there is something for me.

@sam_joshua_s 3 ай бұрын

Most underatted youtube channel

@ForTheEraOfLove 3 ай бұрын

Reminds me of the Person of Interest episode called "If-Then-Else" where "The Machine" has to make a choice in nearly infinite possibilities. Great show for those ML enthusiasts.

@user-my1tx4dc2w 3 ай бұрын

Amazing video! Thank you for sharing!❤

@Tsardoz Ай бұрын

Very well explained.

@lourdarunraj9967 2 ай бұрын

Amazing content!!!

@NametVevo 26 күн бұрын

Thank you for your video! I just starting in AI and It help me a lot.

@Cloudvenus666 Ай бұрын

One thing to note, it took 9x A6000s for me, as 7 caused Cuda to run out of memory. Nevertheless, this is the best channel to learn how to fine-tune models and it is worth buying the repos.

@TrelisResearch Ай бұрын

interesting, what model - the 34B. And did you change batch size or context length?

@Cloudvenus666 Ай бұрын

@@TrelisResearch I used the 34B, and didn’t change the configurations. I’m sure that I could have gotten away with 8 GPUs but 7 ran a bit short.

@jacekb4057 26 күн бұрын

Man this helps me a lot. Thanks ❤

@3169aaaa 25 күн бұрын

@jacekb4057 hi did you created a notebook related this?

@3169aaaa 25 күн бұрын

@jacekb4057

@danieldemillard9412 3 ай бұрын

Thanks again for another great video and tutorial. How much effort would it require to swap out your code to work with Mixtral 8x7b? I assume it isn't as trivial as swapping out the model name and fine-tuning. Do you foresee any issues with combining these with Instruct models instead of the base chat models?

@TrelisResearch 3 ай бұрын

Good Q. I don’t think it would take much work, although - Mixtral doesn’t quite fit on a single A100, so training will be slower. Maybe 24 hours on 8 A100s… Btw I’m also just fine tuning so if you wanted to swap in Mixtral, it’s maybe better to use the original code.

@unsaturated8482 3 ай бұрын

very informative

@imranullah3097 3 ай бұрын

❤❤❤❤❤. Kindly also create a video on hifi gan to fine tune model for natural synthesis..

@unclecode Ай бұрын

Worth of your life 51 minutes. Kudos. I learned a lot. Quick question - got any vids or tips on making something like LLava from scratch? Moondream's a good example. I have watched your other vids, but that is more about fine-tuning like this one. I wanna grasp the whole process of merging models, building the adapter, training it, and dropping a new multi-model version of the original language model used. Thx again

@TrelisResearch Ай бұрын

I guess you watched the moon dream video I made right? That's a start. Yeah building from scratch is a bit more involved as you have to make the loading scripts. Again, the moondream model repo is a good place to look and get inspiration. I may get around to building from scratch at some point.

@lalpremi 3 ай бұрын

Thank you for sharing, very interesting. Wow, your trained model summarizing given pictures is very impressive and fast. What type of hardware is behind the scenes handling all your site? have a great day. 🙂

@TrelisResearch 3 ай бұрын

I'm running on A6000s on runpod! See: github.com/TrelisResearch/install-guides/blob/main/llm-notebook-setup.md

@UtoobNam Ай бұрын

Hey! Are you making something similar for the multimodal output llava (interactive)?

@user-im4mt4ce1x 2 ай бұрын

Hi. love the content btw. do you think finetuning phi2 with this approach might be a good idea like what moondream is about. and will this same script work for phi2.

@TrelisResearch 2 ай бұрын

Yes, in principle that would work, although you would need to instantiate the model correctly swapping in phi for mistral/llama

@mirai5749 Ай бұрын

Hello Embeddings are expert resamplers? Just read about Prismer VLM

@luce_yliu7524 Ай бұрын

Great materials! Do you have this repo on your GitHub?

@TrelisResearch Ай бұрын

Yup, this is in the ADVANCED-vision repo for purchase from trelis.com/ADVANCED-vision

@user-gp5wb6cz2v 3 ай бұрын

Great video! I have fine-tuned a llama2 model on a v100 previously but I'm wondering if a model like llava-v1.6-mistral-7b on huggingface would be too large to fit on the 16gb available on the v100? Any suggestions on how to figure out how much vram a model requires? It doesn't seem to be too obvious a lot of the time from the documentation.

@TrelisResearch 3 ай бұрын

Yeah, so Llama 7B has 7B parameters and in 16-bit, that's two bytes per parameter, so you need about 14 GB of VRAM to load the model, plus some headroom for kv cache for context length. For LLaVA you additionally need space for the image model AND you need space for the kv cache for the images. The vision model is quite small - a few hundred GB in size - so that shouldn't add much. I see on the repo that the files are around 16 GB in total for model plus vision. However, the vision model is cast up to 32-bits, so that can also double its size. All in all - in 16-bit - it won't be possible to fit in 16 GB of VRAM unless you do quantization. There's a flag to set that, but it's not stable and I had issues trying it. Basically, the LLaVA 1.6 model is not well supported in HuggingFace, so custom scripts are needed like I showed in the video here. However, you can train llava 1.5 with 4-bit quantization and that should fit in your V100.

@user-gp5wb6cz2v 3 ай бұрын

Thank you for taking the time to reply! I assume you meant a few hundred MB for the vision model? That's interesting on the differences between training 1.5 vs 1.6 currently. Do you think there might be some more out-of-the-box approaches to fine-tuning 1.5 or would it still require more custom scripts like yours?

@TrelisResearch 3 ай бұрын

@@user-gp5wb6cz2v oops, yes, hundreds of MB. Actually I just tested 1.6 again yesterday and I think it should be ok with about 24 GB of VRAM. Regarding more out-of-the-box, I'm a bit puzzled why this hasn't happened, and it's been a month or so now, perhaps we'll just have to look towards the next model.

@nguyenhoangnam 2 ай бұрын

⁠⁠@@TrelisResearchcorrect me if I’m wrong, as from what you stated above, you mean your script can finetune 1.6 on a 24gb 3090?

@TrelisResearch 2 ай бұрын

@@nguyenhoangnam in principle it should be possible but in practise the scripts for 1.6 take quite a bit more. There are some notes on trelis.com/advanced-vision

@divyagarh Ай бұрын

Hi Ronan, once the Model is trained can we ask the model to give a image of a Wooden Rook or a black/white rook? or is this model just classifying if it is a rook or a King piece?

@TrelisResearch Ай бұрын

nice question. The model is just classifying/describing. To go the other direction (generation) you need a diffusion model that basically starts with features and then renders and smooths those out.

@LukeDupin 3 ай бұрын

Awesome

@user-io1jn5ob1p 3 ай бұрын

amazing and very informative. Can you pls also show us how to fine tune LLaVA 1.5 ?

@TrelisResearch 3 ай бұрын

Same approach! I used the same script!

@khalilbezrati8638 Ай бұрын

Thank you for this video. I have a question and I would be happy if you could answer it. do you think that these multimodal AIs like LLAVA cen be fine-tuned for fraud detection in identity documents (passports, ID cards, driver's licenses)?

@TrelisResearch Ай бұрын

Yes, this sounds like a good use case.

@xtu373 2 ай бұрын

on how much examples to fine-tune LLaVA to get better results? 100 examples? what's the minimum number ?

@TrelisResearch 2 ай бұрын

It depends how broad the concepts you are aiming to build into the model. For a very narrow fine-tune, it's possible just 25 images might be enough. You can get a rough sense from the video here and this application. Now, if you additionally wanted to train on other board games, you'd need quite a few more examples.

@AlexBerg1 3 ай бұрын

On a first warch through, my impression is that it looks like fine-tuning LLaVA is a much longer script than fine-tuning Llama.

@TrelisResearch 3 ай бұрын

Yeah, it's much longer because you can't use out of the box trainers with default data preparation (because the preparation of prompts for a model with images and vision is different). Probably out of the box will come, but will take some time.

@sillystuff6247 3 ай бұрын

Is there a way to upload images to a OpenAI model via the API ?

@sherpya 3 ай бұрын

yes you need to use gpt 4 vision model

@TrelisResearch 3 ай бұрын

platform.openai.com/docs/api-reference/chat and click on image input to the right of the screen

@xtu373 2 ай бұрын

Hi! Can I get notebook repo of Fine-tuning Multi-modal LLaVA?

@TrelisResearch 2 ай бұрын

Check out Trelis.com/ADVANCED-vision

@ayushsinghal28 3 ай бұрын

can it work with multiple images in a single prompt??

@TrelisResearch 3 ай бұрын

It can!

@jonatan01i Ай бұрын

Wouldn't it be easier to load in the model anyhow it comes and then looping through all the modules and setting them to bfloat16?

@TrelisResearch Ай бұрын

Yeah now that you say it, I don’t see why not . Sounds better

@TrelisResearch Ай бұрын

UPDATE: Yeah, I had forgotten that the main reason not to do this is that you need more VRAM to first load everything in float32 (or whatever the default is). So you may OOM

@jonatan01i Ай бұрын

@@TrelisResearch oh wow, I haven't thought of that. Feels like a lot of hassle, hats off that you pushed through to make it happen. But upon more thinking: - Can you not change the number of gpus aft... - No-no I do one better either send it in fp16 or if that doesn't work then loop through on the cpu, send to gpu one set of parameters at a time and convert to bfloat16, then go to next and so on

@TheYephers 3 ай бұрын

Will these fine tuning projects run on Colab Pro (A100) as is?

@TrelisResearch 2 ай бұрын

LLaVA 1.5 will, but LLaVA 1.6 won't for now, the memory requirement to fine-tune is 100 GB. It should be a lot lower, but there is an open comment on the github repo around that high memory usage. So you need 2X A100 or 3X A6000

@DeviGoneMad 2 ай бұрын

@@TrelisResearch But we can use 4bit quantization to finetune llava 1.6, that will run on google colab, right?

@TrelisResearch 2 ай бұрын

@@DeviGoneMad in principle yes, but I haven't been able to get quantization working with the 1.6 models (as opposed to 1.5). :(

@semigoso7274 4 күн бұрын

Did you run into this error when checking LoraConfig? ValueError: Target module Sequential( (0): Linear(in_features=1024, out_features=4096, bias=True) (1): GELU(approximate='none') (2): Linear(in_features=4096, out_features=4096, bias=True) ) is not supported. Currently, only the following modules are supported: `torch.nn.Linear`, `torch.nn.Embedding`, `torch.nn.Conv2d`, `transformers.pytorch_utils.Conv1D`. peft - Version: 0.11.2.dev0. Running in NVIDIA A10G. Everythin run correctly until that part. Great video!

@TrelisResearch 4 күн бұрын

seems like you tried to set a certain module as trainable that is not a linear layer. Just look at your lora modules and try commenting them out one by one. Or comment them all out and then include them one by one. use print(model) to see the list of modules.

@semigoso7274 4 күн бұрын

@@TrelisResearch the layer is the mm_projector (the adapter) that it is compose of Sequential( (0): Linear(in_features=1024, out_features=4096, bias=True) (1): GELU(approximate='none') (2): Linear(in_features=4096, out_features=4096, bias=True) ). Did you train the adapter as a whloe without any issues or just the linear part of the adapter?

@TrelisResearch 3 күн бұрын

@@semigoso7274 ah yes, you can't do that because the GeLU isn't a linear layer. You have to target "model.mm_projector.0" and "model.mm_projector.2"

@pacograciadonat8885 Ай бұрын

Hello I'm having problems with the image.py file when I try to use raw image URL what can i do?

@pacograciadonat8885 Ай бұрын

this is the error i have : cannot identify image file

@TrelisResearch Ай бұрын

howdy! if you purchased repo access, it's best to post an issue there. If you're using a URL, then put the relevant portion of the image.py code into chat gpt and ask it to adjust it to allow an image OR a URL to be passed as the input.

@pacograciadonat8885 Ай бұрын

@@TrelisResearch ty really, and one more thing, what is the entire code when fine-tuning dataset

@xiaojinyusaudiobookswebnov4951 3 ай бұрын

Can you show how to fine tune Google's gemma models?

@TrelisResearch 3 ай бұрын

Same approach as in the embeddings vs fine tuning videos. Btw I’m unsure Gemma is that good compared to mistral or openchat

@tami9154 2 ай бұрын

may i do all this on windows?

@TrelisResearch 2 ай бұрын

you can do it on windows if you have a GPU. If you don't have a separate GPU, then you won't have enough RAM.

@fuba44 3 ай бұрын

If you reversed the axis, the queen would be h5, maybe it's not a standard chess board? I'm not a big chess guy.

@TrelisResearch 3 ай бұрын

Yeah it’s possible that’s the mix up

@xtu373 2 ай бұрын

Why did you post this video on KZbin - When you are trying to sell the repo? Pls change the video title.

@TrelisResearch 2 ай бұрын

Howdy! Hopefully you can learn quite a lot without buying the repo. I don't have ads on this channel and those who do buy repos help to support the channel. That's the business model.

@robxmccarthy Ай бұрын

I appreciate everything you share @@TrelisResearch