LLaMA2 for Multilingual Fine Tuning?

  Рет қаралды 15,598

Sam Witteveen

Sam Witteveen

Күн бұрын

LLaMA2 for Multilingual Fine Tuning?
Colab: drp.li/Lgqmg
My Links:
Twitter - / sam_witteveen
Linkedin - / samwitteveen
Github:
github.com/samwit/langchain-t... (updated)
github.com/samwit/llm-tutorials
00:00 Intro
00:49 LLaMA 2 Paper
01:33 Code Time
05:05 LLaMA 2
05:29 Bloom
10:20 GLM2-6B
11:32 MT5
13:03 Open Sourced LLaMA Model RedPajama-INCITE 7B Base

Пікірлер: 71
@toddnedd2138
@toddnedd2138 11 ай бұрын
Very interesting and informative. Thank you. Looking forward to your next videos on finetuning.
@GenAIWithNandakishor
@GenAIWithNandakishor 8 ай бұрын
Your explanations are simple but deep. Great !!!
@edouardalbert7788
@edouardalbert7788 11 ай бұрын
That was quite insightful, we don't speak too much about tokenizers and there is definitely room for improvement. Thanks !
@FalahgsGate
@FalahgsGate 11 ай бұрын
Very interesting comparison of languages. Thank you for the clarification❣👏
@ringpolitiet
@ringpolitiet 11 ай бұрын
Very insightful, thanks. Great with some technical deep dives.
@nickki8ara
@nickki8ara 11 ай бұрын
Great video Sam!!
@micbab-vg2mu
@micbab-vg2mu 11 ай бұрын
Thank you for the information. I plan to use Llama 2 for simple tasks in English, such as data retrieval, summarization, and chatting based on the provided context. For translations, logic tasks, and coding, I use the GPT-4 API (March version).
@MrDeanelwood
@MrDeanelwood 10 ай бұрын
You have a great channel Sam. I really like how you're jumping into the topics that most people are ignoring and only opting for the sexy stuff. You're covering important things. Great insights, thank you.
@samwitteveenai
@samwitteveenai 10 ай бұрын
Thanks much appreciated. I am trying to stay away from just purely the latest sexy stuff and cover more things in a bit more depth and with code etc.
@auddy7889
@auddy7889 10 ай бұрын
Thank you so much, this is very useful. I thought I will learn how to use llama 2 to fine tune in Thai, but now I have to reconsider.
@futureautomation9518
@futureautomation9518 11 ай бұрын
Thank you very much for info on the multilingual
@SunnyJocker
@SunnyJocker 28 күн бұрын
Thanks for sharing this video. It’s comprehensive🎉
@samwitteveenai
@samwitteveenai 28 күн бұрын
Glad it was helpful!
@caiyu538
@caiyu538 6 ай бұрын
Great lectures.
@user-kw3sp7lb5c
@user-kw3sp7lb5c 5 ай бұрын
Your explanations are simple but deep. Today from video I know about tokenizers much more. Great tutorial !!! ps: Can you make more videos about tokenizers and deeper understanding of LLMs
@ChatchaiPummala
@ChatchaiPummala 10 ай бұрын
I'm glad to know that you can speak Thai. I am your FC from Thailand.
@samwitteveenai
@samwitteveenai 10 ай бұрын
ขอบคุณมากครับ 😃
@aurkom
@aurkom 10 ай бұрын
Would be nice to see a tutorial on training a tokenizer from scratch
@georgekokkinakis7288
@georgekokkinakis7288 11 ай бұрын
Great review this is what I needed. I want an open sourced LLM in order to build a chatbot for qna retrieval from documents in the Greek language using langchain so this will help me a lot to find a model from hugging face. Thanks again 😊. Looking forward for the fine tuning tutorial
@samwitteveenai
@samwitteveenai 11 ай бұрын
Hey George I think you might have been the person who asked about Greek before. Glad to hear this helped.
@georgekokkinakis7288
@georgekokkinakis7288 11 ай бұрын
​@@samwitteveenaiYes, that's me 😅. Your presentations have helped me a lot in my project. It would be great if you could find the time and make a tutorial on how we could use petals with langchain. I am asking this because not every one including me, has access to high ram gpus or pay for high ram time in colab in order to run those big llms like llama e.t.c.
@samwitteveenai
@samwitteveenai 11 ай бұрын
yes I have started looking into Petals :D
@HazemAzim
@HazemAzim 9 ай бұрын
Very Insightful thanks.. Arabic is also a problem with Tokenizers
@samwitteveenai
@samwitteveenai 9 ай бұрын
Yes this is one I have looked at recently after the video and it is also challenging.
@loicbaconnier9150
@loicbaconnier9150 10 ай бұрын
Hi where did you put your notebooks on llama2 please cant find them on github ? Thanks
@juda-marto
@juda-marto 10 ай бұрын
A very informative video Sam! What tokenization would you recommend to fine tune Llama2 for Indonesian language? In general how to make Llama 2 work with Bahasa
@samwitteveenai
@samwitteveenai 10 ай бұрын
unfortunately you can't change the the tokenizer on a model once it has changed. You will have to try with the current one. For Bahasa it won't be as bad as Thai or Greek etc.
@lnakarin
@lnakarin 11 ай бұрын
ขอบคุณแซม
@samwitteveenai
@samwitteveenai 11 ай бұрын
ยินดีมากครับ
@henkhbit5748
@henkhbit5748 10 ай бұрын
Thanks for explaining the impact of different tokenizers. I assume each LLM are using its specific tokenizer and you cannot use for example a t5 tokenizer in a llma model?
@samwitteveenai
@samwitteveenai 10 ай бұрын
Yes the models have to use the tokenizers they are trained with.
@user-bu9mf8jm8z
@user-bu9mf8jm8z 10 ай бұрын
thanks buddy, i am working in a fine tunning with my own data to Dolly2.0but i hope dont have any problem because that will be on spanish, this is a good point to start thanks! If i am working in a Q&A but i dont have dataset just my database with my own tables what would be your hint? my goal is would be write questions about of my data and have answer like graph or answer like that?
@IQmates
@IQmates 11 ай бұрын
I wish there were tutorials on how to deploy the downloaded model on Azure. With the commercial license, many companies are considering it but they cannot use HuggingFace due to data security etc.
@samwitteveenai
@samwitteveenai 11 ай бұрын
Sorry I don't have much to do with MSFT currently.
@kevinbatdorf
@kevinbatdorf 9 ай бұрын
What's a good model for English Thai translations? I live in Chiang Mai and would like to build something fun.
@hlumisa.mazomba
@hlumisa.mazomba 11 ай бұрын
Thank you so much for this. I had something similar in mind. In my case I wanted to finetune it for IsiXhosa, my home language. Have you had a chance to play around with Facebook's MMS models yet?
@thabolezwemabandla2461
@thabolezwemabandla2461 2 ай бұрын
Hie , I have a similar task. Did you find any breakthrough with your language, IsiXhosa
@Chob_PT
@Chob_PT 11 ай бұрын
Any resources you'd have on how to actually fine-tune to make the model better at other language? I loved the video but still got confused if we should be looking at increasing the vocab size or actually just feeding a translated dataset in a different language would be enough. Again, Thanks for this
@samwitteveenai
@samwitteveenai 11 ай бұрын
So most the rules of fine-tuning apply, one difference is that you will often add more pre-training on the target language before doing instruction fine tuning in that language etc. You can get more data for general language for most languages from datasets like OSCAR and Common Crawl, depending on the language. For a lot of languages people have also translated things like the Alpaca dataset etc
@BusraSebin
@BusraSebin 10 ай бұрын
Hi Sam, thanks for great video! Can we fine tune the Llama2 for translation task from Turkish to German languages? I have done tokenizer test for Turkish but it did not provide a great result while as you know German is okay. That's why I have questioned :)
@samwitteveenai
@samwitteveenai 10 ай бұрын
This may work but it is far from ideal for 2 main reasons 1 the tokenizer issues with Turkish (which you have checked) but also 2. LLaMA-2 was not really built for doing translation. For translation you will probably be better to fine tune something like the mT5 or another Seq2Seq model.
@rukaiyahasan2945
@rukaiyahasan2945 7 ай бұрын
I am trying to fine-tune a model which works like ChatGPT for Punjabi language, using the mt5-base, however I am not sure if I should go ahead with it since it does not even generate text and when I try to use it, I just get a response as 0. I have checked the tokenizers, they work fine with Punjabi language, can anyone please tell how may I go on about it? Thanks in advance!
@beginnerscode5684
@beginnerscode5684 10 ай бұрын
Dear Sam , Thank you for this video. As you showed the LLaMa is trained mainly on English and does support the western European languages. My future goal is to train a LLM for indo aryan script. I have tried alpaca but the results were so much the reason was the same as you mentioned. What would be the step if we want to fine-tune LLama for any other language
@ardasevinc4
@ardasevinc4 10 ай бұрын
You would need to extend the vocabulary of the tokenizer, do multiple stages of pretraining then fine tune. This would require at least 8 A100 GPUs. Check out the chinese llama/alpaca, they did something similar.
@beginnerscode5684
@beginnerscode5684 10 ай бұрын
@@ardasevinc4 yes thank you for replying. I did check that paper recently! But there is an other approch named as okapi by university of Oregon i will first try out that. To do like Chinese llma i really need gpus and unfortunately we don't have.
@ardasevinc4
@ardasevinc4 10 ай бұрын
@@beginnerscode5684 okapi seems interesting. Thanks for mentioning that. It'll still be tough to get llama2 to speak other langs if the base model's training dataset includes very little of it...
@beginnerscode5684
@beginnerscode5684 10 ай бұрын
Yes that is going to be challenge, if your language is not based on latin copra then certainly it is challange @@ardasevinc4
@gunasekhar8440
@gunasekhar8440 5 ай бұрын
Could you help me that how to make a own tokenization model for any indic language?
@DanielWeikert
@DanielWeikert 11 ай бұрын
Can you do a video elaborating on model sizes, loading techniques to reduce gpu memory ,...? br
@user-rh4tt1rw1h
@user-rh4tt1rw1h 10 ай бұрын
what are the models supporting arabic language ?
@georgekokkinakis7288
@georgekokkinakis7288 11 ай бұрын
Ι was woundering for the following. As it is mentioned to the video If someone uses a tokenizer which tokenizes each word to character level then this tokenizer probably is not ideal for the language in interest. After watching Sam' excellent turorial I went to the open ai's webpage and used their tokenizer. I've notice that when I give a sentence in Greek then I get chararcter level tokens. Does this mean that when I will send a query to their model then it will tokenize the query to the character level? Because if that's the case then it the expenses will go exponentially up for someone who wants to use chatgpt models for the Greek language. I would appreciate if someone could clarify or disapprove my point.😊
@TaoWang1
@TaoWang1 11 ай бұрын
It depends on which model you're using. Not every models are the same, some better and some worse. If you tested the tokenizer of the model you're using and get character level tokens for the Greek sentence, then yes, the cost of using the model is much higher than English, and it's not only affect the cost, it might also hurt the model understanding and express of the Greek language as well.
@samwitteveenai
@samwitteveenai 11 ай бұрын
You are totally right about ChatGPT etc cost much more for languages that aren't a good match for its tokenizer. I retweeted a tweet all about this a few months back I think it was for Turkish. It is often an order of magnitude more expensive. The model can do the character tokens etc as it is son big, but it is much more expensive.
@georgekokkinakis7288
@georgekokkinakis7288 11 ай бұрын
@@samwitteveenai If I hadn't watched your video about tokenizers I wouldn't have notice it. Thanks once more. Now I know that openai will be very expensive for my case. Unfortunately I haven't found yet any open sourced LLM model which is good for RetrievalQA for the Greek language ☹️. I think I will try google translation. The problem is that I have mathematical terms and google translation doesn't deliver what I want. Let me give an example, someone might find it usefull. Two angles are called complementary angles when they sum to 90 degrees whereas when they sum to 180 degrees they are called supplementary angles. In Greek complimentary angles= συμπληρωματικές γωνίες, supplementary angles = παραπληρωματικές γωνίες. Google translate sometimes translates συμπληρωματικές as supplementary and sometimes as complementary. If someone knows a model which works for Greek he would save my day 😅. My task is to do closed domain extructive QA for mathematical definitions and methodologies from texts in greek . Can Bert like models been used with langchain? Sorry for the big post and thank you once more, your presentations are excellent 👍
@loicbaconnier9150
@loicbaconnier9150 11 ай бұрын
Hi Sam, do you know how to use llama2 using an api from HuggingFace TGI ? with langchain. I don’t know how to write prompts.. Thanks
@samwitteveenai
@samwitteveenai 10 ай бұрын
So I was going to make a video about exactly this but then they made the library no longer open source, so a bit reluctant now. Might do it at some point.
@loicbaconnier9150
@loicbaconnier9150 10 ай бұрын
@@samwitteveenaiThey only change the license for firm which sell inferences, not for using it in a firm . Isn't it ?
@samwitteveenai
@samwitteveenai 10 ай бұрын
As I understood it if I was making a chatbot etc then it would apply in that case. More than that though its how they benefitted from other people contributing to it and then changed it later. Just seems they could have handled it better overall.
@loicbaconnier9150
@loicbaconnier9150 10 ай бұрын
@@samwitteveenai There now a new fork made from apache 2.0 version
@samwitteveenai
@samwitteveenai 10 ай бұрын
I saw that one of the main contributors said they would make an open source fork with their startup and also do some things like remove the need for docker etc. I certainly want to support that.
@user-sm1re8xm5p
@user-sm1re8xm5p 10 ай бұрын
For the Greek case, I suspect that a lot of greek text on the internet is written in latin symbols, so the tokenizer might do better with ""to ónoma mou eínai Sam" than with the greek alphabet. have to check...
@samwitteveenai
@samwitteveenai 9 ай бұрын
Interesting. Yes it would probably do a lot better with text like that.
@michallecbych7556
@michallecbych7556 11 ай бұрын
Would you show, how to train your custom tokenizer, so we can support new language?
@yuchi65535
@yuchi65535 11 ай бұрын
tokenizer is trained during pre-traning, you need to retrain all the model to custom tonkenizer..
@user-wp8yx
@user-wp8yx 9 ай бұрын
I appear to have this very issue. Too bad the solution is to dump llama2.
@samwitteveenai
@samwitteveenai 8 ай бұрын
What's the language you are after? there could be a multi lingual LLaMA around the corner.
@user-wp8yx
@user-wp8yx 8 ай бұрын
@@samwitteveenai Sanskrit. Bert based models apparently work, but I use oobabooga and can't get them to work with ooba. I had some success with vicuna 1.1, in spite of the tokenizer breaking everything down to one letter. Not so much with vicuna 1.5. No luck with bloom or orca or llama1. Haven't tried llama2 because vicuna outperforms it pretraining for sanskrit. I'm surprised with so many south Asians in computers that more models don't at least speak hindi.
@user-wp8yx
@user-wp8yx 5 ай бұрын
Update on Sanskrit tokens project: I managed to add tokens to mistral7b. And I had to "resize the embeddings" and "the head". Subsequently, the model does inference, but fine tuning causes a cuda error. I now wonder if the embedding are correct or what the issue is?
@jorgeromero4680
@jorgeromero4680 9 ай бұрын
i speek greek
Fine Tuning GPT-3.5-Turbo - Comprehensive Guide with Code Walkthrough
15:57
She ruined my dominos! 😭 Cool train tool helps me #gadget
00:40
Go Gizmo!
Рет қаралды 57 МЛН
WHO DO I LOVE MOST?
00:22
dednahype
Рет қаралды 23 МЛН
Which one is the best? #katebrush #shorts
00:12
Kate Brush
Рет қаралды 27 МЛН
Climbing to 18M Subscribers 🎉
00:32
Matt Larose
Рет қаралды 35 МЛН
Mesop - Google's New UI Maker
14:04
Sam Witteveen
Рет қаралды 48 М.
Fine-tuning LLMs with PEFT and LoRA
15:35
Sam Witteveen
Рет қаралды 114 М.
User Input in Python | Python Programming Tutorial #6
4:45
Danyal Codes
Рет қаралды 11
How to use Custom Prompts for RetrievalQA on LLaMA-2 7B
13:39
Sam Witteveen
Рет қаралды 15 М.
5 Problems Getting LLM Agents into Production
13:12
Sam Witteveen
Рет қаралды 11 М.
RAG from the Ground Up with Python and Ollama
15:32
Decoder
Рет қаралды 24 М.
What is LoRA? Low-Rank Adaptation for finetuning LLMs EXPLAINED
8:22
AI Coffee Break with Letitia
Рет қаралды 35 М.
YOTAPHONE 2 - СПУСТЯ 10 ЛЕТ
15:13
ЗЕ МАККЕРС
Рет қаралды 41 М.
сюрприз
1:00
Capex0
Рет қаралды 1,6 МЛН