LLaMA2 for Multilingual Fine Tuning?

  Рет қаралды 16,990

Sam Witteveen

Sam Witteveen

Күн бұрын

Пікірлер: 74
@edouardalbert7788
@edouardalbert7788 Жыл бұрын
That was quite insightful, we don't speak too much about tokenizers and there is definitely room for improvement. Thanks !
@GenAIWithNandakishor
@GenAIWithNandakishor Жыл бұрын
Your explanations are simple but deep. Great !!!
@toddnedd2138
@toddnedd2138 Жыл бұрын
Very interesting and informative. Thank you. Looking forward to your next videos on finetuning.
@MrDeanelwood
@MrDeanelwood Жыл бұрын
You have a great channel Sam. I really like how you're jumping into the topics that most people are ignoring and only opting for the sexy stuff. You're covering important things. Great insights, thank you.
@samwitteveenai
@samwitteveenai Жыл бұрын
Thanks much appreciated. I am trying to stay away from just purely the latest sexy stuff and cover more things in a bit more depth and with code etc.
@FalahgsGate
@FalahgsGate Жыл бұрын
Very interesting comparison of languages. Thank you for the clarification❣👏
@futureautomation9518
@futureautomation9518 Жыл бұрын
Thank you very much for info on the multilingual
@auddy7889
@auddy7889 Жыл бұрын
Thank you so much, this is very useful. I thought I will learn how to use llama 2 to fine tune in Thai, but now I have to reconsider.
@SunnyJocker
@SunnyJocker 6 ай бұрын
Thanks for sharing this video. It’s comprehensive🎉
@samwitteveenai
@samwitteveenai 6 ай бұрын
Glad it was helpful!
@ringpolitiet
@ringpolitiet Жыл бұрын
Very insightful, thanks. Great with some technical deep dives.
@ChatchaiPummala
@ChatchaiPummala Жыл бұрын
I'm glad to know that you can speak Thai. I am your FC from Thailand.
@samwitteveenai
@samwitteveenai Жыл бұрын
ขอบคุณมากครับ 😃
@micbab-vg2mu
@micbab-vg2mu Жыл бұрын
Thank you for the information. I plan to use Llama 2 for simple tasks in English, such as data retrieval, summarization, and chatting based on the provided context. For translations, logic tasks, and coding, I use the GPT-4 API (March version).
@sagartamang0000
@sagartamang0000 5 ай бұрын
Very helpful, thank you so much
@nickki8ara
@nickki8ara Жыл бұрын
Great video Sam!!
@IQmates
@IQmates Жыл бұрын
I wish there were tutorials on how to deploy the downloaded model on Azure. With the commercial license, many companies are considering it but they cannot use HuggingFace due to data security etc.
@samwitteveenai
@samwitteveenai Жыл бұрын
Sorry I don't have much to do with MSFT currently.
@rukaiyahasan2945
@rukaiyahasan2945 Жыл бұрын
I am trying to fine-tune a model which works like ChatGPT for Punjabi language, using the mt5-base, however I am not sure if I should go ahead with it since it does not even generate text and when I try to use it, I just get a response as 0. I have checked the tokenizers, they work fine with Punjabi language, can anyone please tell how may I go on about it? Thanks in advance!
@georgekokkinakis7288
@georgekokkinakis7288 Жыл бұрын
Great review this is what I needed. I want an open sourced LLM in order to build a chatbot for qna retrieval from documents in the Greek language using langchain so this will help me a lot to find a model from hugging face. Thanks again 😊. Looking forward for the fine tuning tutorial
@samwitteveenai
@samwitteveenai Жыл бұрын
Hey George I think you might have been the person who asked about Greek before. Glad to hear this helped.
@georgekokkinakis7288
@georgekokkinakis7288 Жыл бұрын
​@@samwitteveenaiYes, that's me 😅. Your presentations have helped me a lot in my project. It would be great if you could find the time and make a tutorial on how we could use petals with langchain. I am asking this because not every one including me, has access to high ram gpus or pay for high ram time in colab in order to run those big llms like llama e.t.c.
@samwitteveenai
@samwitteveenai Жыл бұрын
yes I have started looking into Petals :D
@УукнеУкн
@УукнеУкн 11 ай бұрын
Your explanations are simple but deep. Today from video I know about tokenizers much more. Great tutorial !!! ps: Can you make more videos about tokenizers and deeper understanding of LLMs
@caiyu538
@caiyu538 Жыл бұрын
Great lectures.
@aurkom
@aurkom Жыл бұрын
Would be nice to see a tutorial on training a tokenizer from scratch
@Chob_PT
@Chob_PT Жыл бұрын
Any resources you'd have on how to actually fine-tune to make the model better at other language? I loved the video but still got confused if we should be looking at increasing the vocab size or actually just feeding a translated dataset in a different language would be enough. Again, Thanks for this
@samwitteveenai
@samwitteveenai Жыл бұрын
So most the rules of fine-tuning apply, one difference is that you will often add more pre-training on the target language before doing instruction fine tuning in that language etc. You can get more data for general language for most languages from datasets like OSCAR and Common Crawl, depending on the language. For a lot of languages people have also translated things like the Alpaca dataset etc
@HazemAzim
@HazemAzim Жыл бұрын
Very Insightful thanks.. Arabic is also a problem with Tokenizers
@samwitteveenai
@samwitteveenai Жыл бұрын
Yes this is one I have looked at recently after the video and it is also challenging.
@henkhbit5748
@henkhbit5748 Жыл бұрын
Thanks for explaining the impact of different tokenizers. I assume each LLM are using its specific tokenizer and you cannot use for example a t5 tokenizer in a llma model?
@samwitteveenai
@samwitteveenai Жыл бұрын
Yes the models have to use the tokenizers they are trained with.
@gunasekhar8440
@gunasekhar8440 11 ай бұрын
Could you help me that how to make a own tokenization model for any indic language?
@kevinbatdorf
@kevinbatdorf Жыл бұрын
What's a good model for English Thai translations? I live in Chiang Mai and would like to build something fun.
@loicbaconnier9150
@loicbaconnier9150 Жыл бұрын
Hi where did you put your notebooks on llama2 please cant find them on github ? Thanks
@beginnerscode5684
@beginnerscode5684 Жыл бұрын
Dear Sam , Thank you for this video. As you showed the LLaMa is trained mainly on English and does support the western European languages. My future goal is to train a LLM for indo aryan script. I have tried alpaca but the results were so much the reason was the same as you mentioned. What would be the step if we want to fine-tune LLama for any other language
@ardasevinc4
@ardasevinc4 Жыл бұрын
You would need to extend the vocabulary of the tokenizer, do multiple stages of pretraining then fine tune. This would require at least 8 A100 GPUs. Check out the chinese llama/alpaca, they did something similar.
@beginnerscode5684
@beginnerscode5684 Жыл бұрын
@@ardasevinc4 yes thank you for replying. I did check that paper recently! But there is an other approch named as okapi by university of Oregon i will first try out that. To do like Chinese llma i really need gpus and unfortunately we don't have.
@ardasevinc4
@ardasevinc4 Жыл бұрын
@@beginnerscode5684 okapi seems interesting. Thanks for mentioning that. It'll still be tough to get llama2 to speak other langs if the base model's training dataset includes very little of it...
@beginnerscode5684
@beginnerscode5684 Жыл бұрын
Yes that is going to be challenge, if your language is not based on latin copra then certainly it is challange @@ardasevinc4
@juda-marto
@juda-marto Жыл бұрын
A very informative video Sam! What tokenization would you recommend to fine tune Llama2 for Indonesian language? In general how to make Llama 2 work with Bahasa
@samwitteveenai
@samwitteveenai Жыл бұрын
unfortunately you can't change the the tokenizer on a model once it has changed. You will have to try with the current one. For Bahasa it won't be as bad as Thai or Greek etc.
@aditiasetiawan563
@aditiasetiawan563 4 ай бұрын
How to train other language? Can you help..
@RobertoAntonioMenjívarHernánde
@RobertoAntonioMenjívarHernánde Жыл бұрын
thanks buddy, i am working in a fine tunning with my own data to Dolly2.0but i hope dont have any problem because that will be on spanish, this is a good point to start thanks! If i am working in a Q&A but i dont have dataset just my database with my own tables what would be your hint? my goal is would be write questions about of my data and have answer like graph or answer like that?
@pranilpatil4109
@pranilpatil4109 3 ай бұрын
Hi, Now that Llama 3.1 is released, can you tell roughly how many new tokens should I create from the same tokenizer for an another language.
@samwitteveenai
@samwitteveenai 3 ай бұрын
yeah you can just load their tokenizer and check it. Llama 3 is certainly better for many languages
@pranilpatil4109
@pranilpatil4109 3 ай бұрын
@@samwitteveenai I am doing that. And adding those tokens. But I am not sure about the minimum tokens I should create per language. I might want to add other languages later.
@BusraSebin
@BusraSebin Жыл бұрын
Hi Sam, thanks for great video! Can we fine tune the Llama2 for translation task from Turkish to German languages? I have done tokenizer test for Turkish but it did not provide a great result while as you know German is okay. That's why I have questioned :)
@samwitteveenai
@samwitteveenai Жыл бұрын
This may work but it is far from ideal for 2 main reasons 1 the tokenizer issues with Turkish (which you have checked) but also 2. LLaMA-2 was not really built for doing translation. For translation you will probably be better to fine tune something like the mT5 or another Seq2Seq model.
@DanielWeikert
@DanielWeikert Жыл бұрын
Can you do a video elaborating on model sizes, loading techniques to reduce gpu memory ,...? br
@loicbaconnier9150
@loicbaconnier9150 Жыл бұрын
Hi Sam, do you know how to use llama2 using an api from HuggingFace TGI ? with langchain. I don’t know how to write prompts.. Thanks
@samwitteveenai
@samwitteveenai Жыл бұрын
So I was going to make a video about exactly this but then they made the library no longer open source, so a bit reluctant now. Might do it at some point.
@loicbaconnier9150
@loicbaconnier9150 Жыл бұрын
@@samwitteveenaiThey only change the license for firm which sell inferences, not for using it in a firm . Isn't it ?
@samwitteveenai
@samwitteveenai Жыл бұрын
As I understood it if I was making a chatbot etc then it would apply in that case. More than that though its how they benefitted from other people contributing to it and then changed it later. Just seems they could have handled it better overall.
@loicbaconnier9150
@loicbaconnier9150 Жыл бұрын
@@samwitteveenai There now a new fork made from apache 2.0 version
@samwitteveenai
@samwitteveenai Жыл бұрын
I saw that one of the main contributors said they would make an open source fork with their startup and also do some things like remove the need for docker etc. I certainly want to support that.
@georgekokkinakis7288
@georgekokkinakis7288 Жыл бұрын
Ι was woundering for the following. As it is mentioned to the video If someone uses a tokenizer which tokenizes each word to character level then this tokenizer probably is not ideal for the language in interest. After watching Sam' excellent turorial I went to the open ai's webpage and used their tokenizer. I've notice that when I give a sentence in Greek then I get chararcter level tokens. Does this mean that when I will send a query to their model then it will tokenize the query to the character level? Because if that's the case then it the expenses will go exponentially up for someone who wants to use chatgpt models for the Greek language. I would appreciate if someone could clarify or disapprove my point.😊
@TaoWang1
@TaoWang1 Жыл бұрын
It depends on which model you're using. Not every models are the same, some better and some worse. If you tested the tokenizer of the model you're using and get character level tokens for the Greek sentence, then yes, the cost of using the model is much higher than English, and it's not only affect the cost, it might also hurt the model understanding and express of the Greek language as well.
@samwitteveenai
@samwitteveenai Жыл бұрын
You are totally right about ChatGPT etc cost much more for languages that aren't a good match for its tokenizer. I retweeted a tweet all about this a few months back I think it was for Turkish. It is often an order of magnitude more expensive. The model can do the character tokens etc as it is son big, but it is much more expensive.
@georgekokkinakis7288
@georgekokkinakis7288 Жыл бұрын
@@samwitteveenai If I hadn't watched your video about tokenizers I wouldn't have notice it. Thanks once more. Now I know that openai will be very expensive for my case. Unfortunately I haven't found yet any open sourced LLM model which is good for RetrievalQA for the Greek language ☹️. I think I will try google translation. The problem is that I have mathematical terms and google translation doesn't deliver what I want. Let me give an example, someone might find it usefull. Two angles are called complementary angles when they sum to 90 degrees whereas when they sum to 180 degrees they are called supplementary angles. In Greek complimentary angles= συμπληρωματικές γωνίες, supplementary angles = παραπληρωματικές γωνίες. Google translate sometimes translates συμπληρωματικές as supplementary and sometimes as complementary. If someone knows a model which works for Greek he would save my day 😅. My task is to do closed domain extructive QA for mathematical definitions and methodologies from texts in greek . Can Bert like models been used with langchain? Sorry for the big post and thank you once more, your presentations are excellent 👍
@devedtara
@devedtara Жыл бұрын
what are the models supporting arabic language ?
@hlumisa.mazomba
@hlumisa.mazomba Жыл бұрын
Thank you so much for this. I had something similar in mind. In my case I wanted to finetune it for IsiXhosa, my home language. Have you had a chance to play around with Facebook's MMS models yet?
@thabolezwemabandla2461
@thabolezwemabandla2461 8 ай бұрын
Hie , I have a similar task. Did you find any breakthrough with your language, IsiXhosa
@lnakarin
@lnakarin Жыл бұрын
ขอบคุณแซม
@samwitteveenai
@samwitteveenai Жыл бұрын
ยินดีมากครับ
@michallecbych7556
@michallecbych7556 Жыл бұрын
Would you show, how to train your custom tokenizer, so we can support new language?
@yuchi65535
@yuchi65535 Жыл бұрын
tokenizer is trained during pre-traning, you need to retrain all the model to custom tonkenizer..
@user-wp8yx
@user-wp8yx Жыл бұрын
I appear to have this very issue. Too bad the solution is to dump llama2.
@samwitteveenai
@samwitteveenai Жыл бұрын
What's the language you are after? there could be a multi lingual LLaMA around the corner.
@user-wp8yx
@user-wp8yx Жыл бұрын
@@samwitteveenai Sanskrit. Bert based models apparently work, but I use oobabooga and can't get them to work with ooba. I had some success with vicuna 1.1, in spite of the tokenizer breaking everything down to one letter. Not so much with vicuna 1.5. No luck with bloom or orca or llama1. Haven't tried llama2 because vicuna outperforms it pretraining for sanskrit. I'm surprised with so many south Asians in computers that more models don't at least speak hindi.
@user-wp8yx
@user-wp8yx 11 ай бұрын
Update on Sanskrit tokens project: I managed to add tokens to mistral7b. And I had to "resize the embeddings" and "the head". Subsequently, the model does inference, but fine tuning causes a cuda error. I now wonder if the embedding are correct or what the issue is?
@jorgeromero4680
@jorgeromero4680 Жыл бұрын
i speek greek
Fine Tuning GPT-3.5-Turbo - Comprehensive Guide with Code Walkthrough
15:57
Building a RCI Chain for Agents with LangChain Expression Language
17:51
Quilt Challenge, No Skills, Just Luck#Funnyfamily #Partygames #Funny
00:32
Family Games Media
Рет қаралды 55 МЛН
Mom Hack for Cooking Solo with a Little One! 🍳👶
00:15
5-Minute Crafts HOUSE
Рет қаралды 23 МЛН
She made herself an ear of corn from his marmalade candies🌽🌽🌽
00:38
Valja & Maxim Family
Рет қаралды 18 МЛН
Fine-tuning LLMs with PEFT and LoRA
15:35
Sam Witteveen
Рет қаралды 136 М.
LLaMA2 Tokenizer and Prompt Tricks
13:42
Sam Witteveen
Рет қаралды 16 М.
Fine-tuning Large Language Models (LLMs) | w/ Example Code
28:18
Shaw Talebi
Рет қаралды 376 М.
The EASIEST way to finetune LLAMA-v2 on local machine!
17:26
Abhishek Thakur
Рет қаралды 177 М.
Understanding 4bit Quantization: QLoRA explained (w/ Colab)
42:06
How to use BGE Embeddings for LangChain and RAG
10:23
Sam Witteveen
Рет қаралды 15 М.
Quilt Challenge, No Skills, Just Luck#Funnyfamily #Partygames #Funny
00:32
Family Games Media
Рет қаралды 55 МЛН