How to Create Custom Datasets To Train Llama-2

  Рет қаралды 92,743

Prompt Engineering

Prompt Engineering

Күн бұрын

In this video, I will show you how to create a dataset for fine-tuning Llama-2 using the code interpreter within GPT-4. We will create a dataset for creating a prompt given a concept. We will structure the dataset in proper format to fine tune a Llama-2 7B model using the HuggingFace auto train-advanced package.
Happy learning :)
#llama2 #finetune #llm
▬▬▬▬▬▬▬▬▬▬▬▬▬▬ CONNECT ▬▬▬▬▬▬▬▬▬▬▬
☕ Buy me a Coffee: ko-fi.com/promptengineering
|🔴 Support my work on Patreon: Patreon.com/PromptEngineering
🦾 Discord: / discord
▶️️ Subscribe: www.youtube.com/@engineerprom...
📧 Business Contact: engineerprompt@gmail.com
💼Consulting: calendly.com/engineerprompt/c...
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
LINKS:
One-liner fine-tuning of Llama2: • LLAMA-2 🦙: EASIET WAY ...
ChatGPT as Midjourney Prompt Generator: • ChatGPT & MidJourney: ...
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
Timestamps:
Intro: [00:00]
Testing Vanila Llama2: [01:20]
Description of Dataset: [02:14]
Code Interpreter: [03:24]
Structure of the Dataset: [4:56]
Using Base model: [06:18]
Fine-tuning Llama2: [07:25]
Logging during training: [10:36]
Inference of the fine-tuned model: [12:44]
Output Examples: [14:36]
Things to Consider: [15:40]
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
All Interesting Videos:
Everything LangChain: • LangChain
Everything LLM: • Large Language Models
Everything Midjourney: • MidJourney Tutorials
AI Image Generation: • AI Image Generation Tu...

Пікірлер: 108
@chuckwashington6663
@chuckwashington6663 10 ай бұрын
Thanks, this gives me exactly what I needed to understand how to create a dataset for fine tuning. Most of the other videos skip over the details of the formatting and other parameters that go into creating your own dataset. Thanks again!
@engineerprompt
@engineerprompt 10 ай бұрын
Thank you for your support. I'm glad it was helpful 😊
@pareak
@pareak 5 ай бұрын
Thank you so much! This just gives me a really good basis on how I can start finetuning my own model! Because the model will in the end be as good as the training set.
@SafetyLabsInc_ca
@SafetyLabsInc_ca 8 ай бұрын
Datasets are key for fine tuning. This is a great video!
@engineerprompt
@engineerprompt 8 ай бұрын
Yes! Thank you!
@umeshtiwari9249
@umeshtiwari9249 10 ай бұрын
Thanks. very nice way you explained the concept. it gives boost to the knowledge and to the area where usually people have fear in mind to grasp but the way you explained it, to me it looks very easy. today i got the ability to fine tune the model myself. thanks a lot Sir. looking forward to more advanced topics from you.
@engineerprompt
@engineerprompt 10 ай бұрын
Thanks and welcome!
@kevon217
@kevon217 5 ай бұрын
Thanks for covering this topic!
@engineerprompt
@engineerprompt 5 ай бұрын
My pleasure!
@LeonvanBokhorst
@LeonvanBokhorst 10 ай бұрын
Wow. Thanks again, sir 🙏
@drbinxy9433
@drbinxy9433 8 ай бұрын
You are a legend my man
@engineerprompt
@engineerprompt 8 ай бұрын
🙏🙏🙏
@tarun4705
@tarun4705 10 ай бұрын
Very informative
@derejehinsermu6928
@derejehinsermu6928 10 ай бұрын
Thank you man , that is exactly what i am looking for
@engineerprompt
@engineerprompt 10 ай бұрын
Glad I could help
@TheCloudShepherd
@TheCloudShepherd 7 ай бұрын
Daaaamn bro that's brilliant
@engineerprompt
@engineerprompt 7 ай бұрын
Thank you. More to come on fine-tuning :)
@rahulrajpvr7d
@rahulrajpvr7d 10 ай бұрын
thank you so much brother❤❤
@HarishRaoS
@HarishRaoS 4 ай бұрын
thanks for this video
@abhijitbarman
@abhijitbarman 10 ай бұрын
@Prompt Engineering. Wow, exactly what I was looking for . I have another request, Can you please make a video on Prompt-Tuning/P-Tuning which is also a PEFT technique ?
@vobbilisettyjayadeep4346
@vobbilisettyjayadeep4346 10 ай бұрын
You are a saviour
@engineerprompt
@engineerprompt 10 ай бұрын
Thank you 😊
@oliversilverstein1221
@oliversilverstein1221 10 ай бұрын
FYI you're the man. idk why it was so hard to find a good pipeline to train literally went througfh all the libs and no one mentioned autotrainer advanced lol
@engineerprompt
@engineerprompt 10 ай бұрын
Thank you!
@oxytic
@oxytic 10 ай бұрын
Great bro 👍
@samcavalera9489
@samcavalera9489 10 ай бұрын
You're an AI champion. Thanks for the fine-tuning lectures 🙏🙏🙏
@engineerprompt
@engineerprompt 10 ай бұрын
Thank you for your kind words!
@samcavalera9489
@samcavalera9489 10 ай бұрын
@@engineerprompt Welcome brother!
@DemoGPT
@DemoGPT 10 ай бұрын
Kudos on the excellent video! Your hard work is acknowledged. Could we expect a video about DemoGPT from you?
@haouarino
@haouarino 10 ай бұрын
Thank you very much for the video. In the case of plaintext, how the dataset could be formatted?
@stickmanland
@stickmanland 9 ай бұрын
Thanks for the informative video. I am wondering: Is there a way to do this, but with local LLMs?
@techmontc8360
@techmontc8360 10 ай бұрын
Hey sir, thank you for the great tutorial. I've some question, it seems in this training you didn't define "--model_max_length" parameter. Is there any differences if you define this parameter or not ?
@vbywrde
@vbywrde 5 ай бұрын
Very coherent and well explained. Thank you kindly. I'm curious also if you have any advice about creating a dataset that would allow me to fine tune my model on my database schema? What I'd like to do is run my model locally, and ask it to interact with my database, and have it do so in a smooth and natural manner. I'm curious about how one would structure a database schema as a dataset for fine tuning. Any recommendations or advice would be greatly appreciated. Thanks again! Great videos!
@LeKhang98
@LeKhang98 2 ай бұрын
Thank you very much. Is 300 rows a good number for training? I know it depends on many factors but I don't know how to identify if my dataset is bad or it's just too small.
@am0x01
@am0x01 4 ай бұрын
Thanks for great service to the community. On my experiment, the Config.json is not created, is that normal?
@chiachinghsieh2150
@chiachinghsieh2150 10 ай бұрын
Thanks SO MUCH for sharing this! Really helpful. I also trying to train on my own data on LLAMA 2. But I am facing a problem from deploy the model. I trained the model on AWS Sagenmaker and store the model in an S3 bucket. When I try to deploy the model and feed it with the prompt, I keep getting errors. my input follows the rule like ###Human....###Assistant. But I still have errors. I wonder if I use the wrong tokenizer. But I couldn't use AutoTokenizer.from_pretrained() in Sagemaker. Wonder if you have some advice!!
@prestonmccauley43
@prestonmccauley43 10 ай бұрын
I did something similar working on my test data set to get this a bit more understood. I created a python script to merge all the data sets together. I still seem to be struggling to grasp the core training approach using SFT and what models work with what. Its' like the last puzzle missing
@lrkx_
@lrkx_ 10 ай бұрын
If you don’t mind sharing, what’s the performance of a Mac like when fine tuning? I’m quite keen to see how long it takes to fine tune a 7B vs a 13B parameter model on a consumer machine on a small/medium sized dataset. Thanks for the tutorial, very helpful!
@vedchaudhary1597
@vedchaudhary1597 9 ай бұрын
7B with 4 bit quantization takes about 12.9 GBs of GPU RAM, i dont think mac will be able to run it locally
@valthrudnir
@valthrudnir 10 ай бұрын
Hello, thank you for sharing - is this method applicable to GGML / GPTQ models from say TheBloke's repo for example the 'Firefly Llama2 13B v1.2 - GPTQ' or would the training parameters need to be adjusted?
@engineerprompt
@engineerprompt 10 ай бұрын
I haven't tried this with Quantized models so I am not sure how that will behave. One thing to keep in mind is that you want to use the "base" model not the chat version for best results. Will look at it and see if it can be done.
@MohamedElGhazi-ek6vp
@MohamedElGhazi-ek6vp 10 ай бұрын
it's very helpful thanks, is it the same process to create a data from multiple documents for a Qusetion Answering model ?
@engineerprompt
@engineerprompt 10 ай бұрын
Yes, this will work
@ishaanshettigar1554
@ishaanshettigar1554 9 ай бұрын
How does this differ if I'm looking to fine-tune for Llama2 7b code instruct
@Phoenix-fr9ic
@Phoenix-fr9ic 8 ай бұрын
Can I finetune llama 2 for pdf to question answers generation?
@AGAsnow
@AGAsnow 10 ай бұрын
How could I limit it, for example I train it with several relevant paragraphs about the little prince novel, how do I limit it so that it only answers questions that are in the context of the little prince novel
@xiangyao9192
@xiangyao9192 9 ай бұрын
I have a question. Why don't we use the conversation format given by llama2, which contains , something like that? thanks
@engineerprompt
@engineerprompt 9 ай бұрын
You will need to use that if you are using the instruct/chat version. Since I was fine tuning the base version, you can define your own format. Hope this helps
@user-nj7ry9dl3y
@user-nj7ry9dl3y 10 ай бұрын
For fine-tuning of the large language models (llama-2-13b-chat), what should be the format(.text/.json/.csv) and structure (like should be an excel or docs file or prompt and response or instruction and output) of the training dataset? And also how to prepare or organise the tabular dataset for training purpose?
@mohammedmujtabaahmed490
@mohammedmujtabaahmed490 2 ай бұрын
Hey, did you find the answer for your question? If yes, please tell me what format shouldnthe dataset be for fine tuning please.
@marcoabk
@marcoabk 2 ай бұрын
there is a way to do it with the 13b original from llama2 already in my hard drive?
@vitocorleon6753
@vitocorleon6753 9 ай бұрын
I need help please. I just want to be pointed in the right direction since I'm new to this and since I couldn't really find any proper guide to summarize the steps for what I want to accomplish. I want to integrate a LLama 2 70B chatbot into my website. I have no idea where to start. I looked into setting up the environment on one of my cloud servers(Has to be private). Now I'm looking into training/fine-tuneing the chat model using our data from our DBs(It's not clear for me here but I assume it involves two steps, first I have to have the data in a CSV format since it's easier for me, second I will need to format it in Alpaca or Openassistant formats). After that, the result should be a deployment-ready model ? Just bullet points I'd highly appreciate that.
@LeoNux-um7tg
@LeoNux-um7tg 5 ай бұрын
Can I use my files for data sets? I'm just planning to train a model that can remind me of commands in linux and its options so I don't have to keep reading manuals everytime I use commands that aren't regularly use.
@Shahawir
@Shahawir 10 ай бұрын
I wonder if it is possible train LLAMA, on data where input are numbers and categorical variables(string), of fixed length, to predict a timer series of fixed length, anyone knows if this is possible? And how to fine the model if I have it locally
@user-pl9gm5qm1s
@user-pl9gm5qm1s 8 ай бұрын
How can I build my label with input and output? I found that llama 2 pieced the input and output together, can my label match the input_id?
@nqaiser
@nqaiser 8 ай бұрын
What hardware specifications would be needed to fine tune a 70b model? Once the fine-tuning is complete, can you run the model using oogabooga?
@JJ-yw3ug
@JJ-yw3ug 10 ай бұрын
I would like to ask, is the RTX4090 sufficient to fine-tune the 13B model, or can it only fine-tune the 7B model? Because I've noticed that the 13B model with default settings doesn't pose a problem for the RTX4090 in terms of parameter handling, but I'm uncertain whether a single RTX4090 is enough if data fine-tuning is required
@engineerprompt
@engineerprompt 10 ай бұрын
I don't think you can fine tune 13B with 24GB vRAM. Your best bet will be 7B
@jamesljl
@jamesljl 10 ай бұрын
would u pls give a sample of how the csv file looks like ? thanks a lot !
@engineerprompt
@engineerprompt 10 ай бұрын
Let me see what I can do, the format is shown in the video.
@Phoenix-fr9ic
@Phoenix-fr9ic 8 ай бұрын
Can I use this technique for document based question answers generation dataset?
@topg4439
@topg4439 Ай бұрын
Hey did you found any solution for Q & A model
@MichealAngeloArts
@MichealAngeloArts 10 ай бұрын
Do you need the 3 columns (Concept, Description, text) in train.csv or just 1 column (text) is enough?
@engineerprompt
@engineerprompt 10 ай бұрын
Just one column
@godataprof
@godataprof 10 ай бұрын
Just the last text column
@nutCaseBUTTERFLY
@nutCaseBUTTERFLY 10 ай бұрын
@@engineerprompt So I watch the video 5 times, and it is still not clear what columns go where. You didn't even bother to open the .csv file so that we can see the schema. But you did show us the log file!
@Enju-Aihara
@Enju-Aihara 10 ай бұрын
@@engineerprompt ​ i wanna know too
@filippobistaffa5913
@filippobistaffa5913 9 ай бұрын
You just need a "text" column present in your train.csv file, the other columns will be ignored... if you want you can change which column will be used with --text_column column_name
@fups8222
@fups8222 10 ай бұрын
why can't you fine-tune chat model of llama 2? the text completion of the fine tuned model I'm using is giving terrible results from my exact instructions in my prompt. I am using Puffin13B, but when feeding exact instructions it just cannot do them like I am prompting it to do.
@muhannadobeidat
@muhannadobeidat 2 ай бұрын
Thanks for the video. Two things please: 1. When you use autotrain package, then all details are hidden and one is not able to see what is being done and in what exact steps. I would suggest a video like that please if you have even same example. 2. Secondly, it is not clear to me what is the data vs label being fed into the model training phase, what is the loss function, how it is being calculated, etc...
@engineerprompt
@engineerprompt 2 ай бұрын
I agree with you. Autotrain abstracts alot of details but if you are interested in more detailed setup. I would recommend to look for "fine-tune" videos on my channel. Here is one example: kzbin.info/www/bejne/onS9g6qoh9uljck
@medicationrefill
@medicationrefill 10 ай бұрын
Can I train my own LLM model using data generated by chatpgt, if the model is intended for academic/commercial use?
@engineerprompt
@engineerprompt 10 ай бұрын
Probably you cant use it commercial purposes. But most of the open source models out there (at least the initial versions) were trained on data generated by chatgpt
@brunapupoo4809
@brunapupoo4809 Ай бұрын
when I try to run the command in the terminal it gives error: autotrain [] llm: error: the following arguments are required: --project-name
@Univers314
@Univers314 10 ай бұрын
Can Chatgpt3.5 generate files?
@md.rakibulhaque2262
@md.rakibulhaque2262 6 ай бұрын
Getting this error with AutoModelForCausalLM: from transformers import AutoModelForCausalLM MJ_Prompts does not appear to have a file named config.json. instead i had to import from peft import AutoPeftModelForCausalLM and use the AutoPeftModelForCausalLM to inference from the model. and one more question, did we train an adapter model here? please tell me how can i solve this. I am using free collab.
@gamingisnotacrime6711
@gamingisnotacrime6711 9 ай бұрын
So if we are fine tuning the chat model, can we use same format as above? ; Human...., Assistant.....
@engineerprompt
@engineerprompt 9 ай бұрын
Yes
@susteven4974
@susteven4974 10 ай бұрын
how I can fine tuning llama-2-7b-chat ,can I use your dataset format?
@xiangye524
@xiangye524 7 ай бұрын
Getting the error: ValueError: Batch does not contain any data (`None`). At the end of all iterable data available before expected stop iteration. Does anyone know what is the issue ? running on google collab thanks
@georgekokkinakis7288
@georgekokkinakis7288 10 ай бұрын
I am facing the following problem. The model gets uploaded to the hugging face repo but without a config.json file. Any solutions? Also can the finetuned model run on the free google colab or should we shard it?
@engineerprompt
@engineerprompt 10 ай бұрын
Are you fine-tuning it locally or on Google Colab? I am doing it locally without any issues.
@georgekokkinakis7288
@georgekokkinakis7288 10 ай бұрын
@@engineerprompt I am fine-tuning it on Google-Colab. In a post I made on your other video about fine-tuning Llama2 you mentioned that it seems to be a problem with the free tier of colab. I hope 🙏 you will find a fix , because not everyone owns a gpu.
@dmitrymalyshev3810
@dmitrymalyshev3810 10 ай бұрын
So, if you have a problem, same on my, in google colab this code not will work, because on free google colab this script don't end job and don't create config.json, and you will have aproblems. And i think, that is reason, why this script don't push my model on huggingface hub. But your work is great, so thanks for that.
@georgekokkinakis7288
@georgekokkinakis7288 10 ай бұрын
I am also facing the same problem. Actually in my case the model is uploaded to the huggingface repo but it is missing the config.json file. Any sollutions?
@sauravmukherjeecom
@sauravmukherjeecom 10 ай бұрын
Is it possible to directly finetune gptq models?
@stephenf3838
@stephenf3838 10 ай бұрын
Qlora
@emrahe468
@emrahe468 10 ай бұрын
how or why do we decide on ###Human: ?? i see lots of variations on different videos. some use ->: , others use ###Input: etc. etc.
@engineerprompt
@engineerprompt 10 ай бұрын
It's really upto you how you want to define the format. Some models accepts instructions along with the user input, so really you get to decide based on your application.
@ruizhou1243
@ruizhou1243 4 ай бұрын
there is no code and sinppet for how to it work? I don't know the meaning
@user-ek5mv1fp7y
@user-ek5mv1fp7y 8 ай бұрын
Hola, soy novata en esto, lo que trato de hacer es un chat personalizado con llama2, por ejemplo tengo datos de una empresa x que son como 15 columnas X 300filas. Hasta ahora me responde aunque aún esta con sus respuestas ilógicas , EN FIN quiero saber es que si creo la columan text de {humano, assitant } para cada columna o pregunta posible Y como se prepararía los datos para entrenar el modelo Porfa, alguien me guía en esto?
@manavshah9062
@manavshah9062 10 ай бұрын
Hey, i have fine-tuned the model using my own dataset but when running the bash command somehow the model did not get uploaded to hugginface but after the training completed i zipped the model and downloaded it. Now is there a way by which i can upload this fine-tuned model to hugginface now?
@fernando88to
@fernando88to 10 ай бұрын
How do I use this local template in the localGPT project?
@engineerprompt
@engineerprompt 10 ай бұрын
The localGPT code has support for custom prompt template. You will need to provide your template there.
@user-qb9ku5ye4v
@user-qb9ku5ye4v Ай бұрын
Sir, I don't have ChatGPT Plus. Are there any alternatives?
@engineerprompt
@engineerprompt Ай бұрын
Look into the Groq, its a free API (at the moment).
@user-qb9ku5ye4v
@user-qb9ku5ye4v Ай бұрын
@@engineerprompt Thank you so much sir. But, I decided I'd do it without any LLMs. So, I wrote my own code using python and pandas. If you want, I could share the code to you?
@srikrishnavamsi1470
@srikrishnavamsi1470 7 ай бұрын
How can i contact you sir.
@engineerprompt
@engineerprompt 7 ай бұрын
Check out my email
@milesbarn
@milesbarn 4 ай бұрын
It is not allowed to use GPT-4's output, any of it even if the data fed into it is yours, to train other models than OpenAI's according to their terms.
@phat80
@phat80 Ай бұрын
Who cares? Who will know? 😅
@jojomama3028
@jojomama3028 8 ай бұрын
hmmm. ok. But I have 156Go of pdf about norms and rules of art construction of France. Who to put all of that documentation in the datasets ? I can transfert all that pdf in text files this is not the issue. How to make llama search response in THAT particulare dataset ? I have to do all Q&A of all the document to be able to train the Llama ? What is the point of that ? if I have to do the work before and answer the question, what is the benefit ? It will took me years to write all questions... Perhaps I don't understand the video ? I just want to fill the existing generating model language with specific dataset, not oriantate how he answer me.
@engineerprompt
@engineerprompt 8 ай бұрын
In that case check out something like localgpt or other RAG solutions
@BHAVYASRIPOLISHETTI
@BHAVYASRIPOLISHETTI 9 күн бұрын
your session crashed after using all ram error
@islamicinterestofficial
@islamicinterestofficial 8 ай бұрын
Getting the error: FileNotFoundError: Couldn't find a dataset script at /content/train.csv/train.csv.py or any data file in the same directory. Even though I'm running the autotrain code line in the same directory where my train.csv file present. I'm running on colab btw
@islamicinterestofficial
@islamicinterestofficial 8 ай бұрын
The solution is to just provide the path of folder where the csv file is present. But don't write the csv file name...
LocalGPT API: Build Powerful Doc Chat Apps
14:43
Prompt Engineering
Рет қаралды 30 М.
We Got Expelled From Scholl After This...
00:10
Jojo Sim
Рет қаралды 38 МЛН
World’s Deadliest Obstacle Course!
28:25
MrBeast
Рет қаралды 65 МЛН
Which one of them is cooler?😎 @potapova_blog
00:45
Filaretiki
Рет қаралды 7 МЛН
The EASIEST way to finetune LLAMA-v2 on local machine!
17:26
Abhishek Thakur
Рет қаралды 164 М.
QLoRA-How to Fine-tune an LLM on a Single GPU (w/ Python Code)
36:58
Create your own CUSTOMIZED Llama 3 model using Ollama
12:55
AI DevBytes
Рет қаралды 14 М.
Don’t Build AI Products The Way Everyone Else Is Doing It
12:52
Steve (Builder.io)
Рет қаралды 338 М.
INSANELY FAST Talking AI: Powered by Groq & Deepgram
12:11
Prompt Engineering
Рет қаралды 8 М.
ВСЕ МОИ ТЕЛЕФОНЫ
14:31
DimaViper Live
Рет қаралды 72 М.
wireless switch without wires part 6
0:49
DailyTech
Рет қаралды 4,1 МЛН
Жёсткий тест чехла Spigen Classic C1
0:56
Romancev768
Рет қаралды 720 М.
5 НЕЛЕГАЛЬНЫХ гаджетов, за которые вас посадят
0:59
Кибер Андерсон
Рет қаралды 1,6 МЛН
WWDC 2024 - June 10 | Apple
1:43:37
Apple
Рет қаралды 10 МЛН
WWDC 2024 Recap: Is Apple Intelligence Legit?
18:23
Marques Brownlee
Рет қаралды 5 МЛН