How to Create Synthetic Dataset EASILY? Step by Step Tutorial

Рет қаралды 3,940

Күн бұрын

Пікірлер: 25

@Menasaat 2 ай бұрын

This is awesome. For the next video, I have a suggestion: Suppose I have multiple PDF files containing a lot of information about my organization. How can I use a large language model (LLM) like the one you used above to create a dataset extracted from the knowledge provided in these PDFs?

@ByteBop911 2 ай бұрын

i also generate synthetic datasets... a secret tip for alignment... set the mood and tone as parameters in the prompt as well to generate the questions as response (make the dataset a little more dynamic)

@kolasatheesh1719 2 ай бұрын

hi @ByteBop i would like to create a synthetic dataset for images, do you know how to do it?

@CryptoMaN_Rahul 2 ай бұрын

Hey bro can you tell does it cost anything? Also can you share your codes ?? I'm a newbie want to learn

@maruc14 2 ай бұрын

Really great tutorial. Keep em' coming. First time seeing NIMs demo.

@litttlemooncream5049 2 ай бұрын

first time to know that LLMs can generate datasets! thank a lot

@atultiwari88 2 ай бұрын

Thank you for this awesome tutorial. I request you to kindly make a video on synthetic dataset generation from pdf files. Thank you so much

@swetharavishankar4825 Ай бұрын

This is amazing.! Can you also explain how to create a classification model from the generated dataset ?

@gr8tbigtreehugger 2 ай бұрын

Super cool! Just did something extremely similar but in Google Sheets, so my non-tech peers can help.

@chaithanyavamshi2898 2 ай бұрын

Wow! Great Tutorial Mervin this is what I'm exactly looking for fine tuning. I tested it and it worked perfectly. I have a question, from my understanding from your blog and video, this dataset is suitable for ORPO fine-tuning (AI feedback scores)? Can I still use it for SFT by filtering out the responses (rows) with best scores?

@MeinDeutschkurs 2 ай бұрын

Great insight!

@batigol_9 10 күн бұрын

if I have a specific number of subtopics and I don't want to generate new subtopics how would I chose the number of data sets to generate?

@vitalis 2 ай бұрын

Can we use ask LLMs to outline, create questions, reply, summerise BOOKS and use that to fine tune LLMs?

@kolasatheesh1719 2 ай бұрын

hey buddy , can't we create the synthetic dataset for images? i mean uploading images and getting responses for the questions... how to do it

@swetharavishankar4825 Ай бұрын

Hey I am able to generate only 10 examples in count how do I make sure it generates more than a thousand at least

@bocilmillenium7698 2 ай бұрын

Can it use for indonesian language?

@commoncats5437 2 ай бұрын

Bro create a best model for tamil We don’t have best gpu’s If you do it we can createit for many usecases

@MervinPraison 2 ай бұрын

ollama.com/mervinpraison

@commoncats5437 2 ай бұрын

@@MervinPraison 🥰tnq

@john_blues 2 ай бұрын

@@MervinPraison Pretty cool. Can you point me to how I can do something like this for another language? I am trying to help build one for the Yoruba language.

@fascinatingfactsabout 2 ай бұрын

I'm still fuzzy about when is creating a synthetic data useful in a practical scenario for me, as a single person, not a large company that needs to fine-tune LLM's. Can someone clarify? What's the real world use for this?

@brishtiteveja 2 ай бұрын

LLM can hallucinate for questions you ask. Specially for low resource languages. For my own language Bengali, it hallucinates a lot and gives wrong answer for facts/events. Now, you can use RAG to stop hallucination. But, RAG depends on the size of the context. If I want to build a specialized model which knows and answers facts about Bengali culture and recent events, it’s useful if I can rather fine tune with the facts and recent event dataset, so that it becomes part of my model itself, therefore no hallucination. You can think of RAG as an open book exam, where you can search for answers in the book while taking the exam versus, a fine tuned model is you having the knowledge in your brain.. of course depending on your memory and reasoning ability, you will give accurate vs hallucinated answer. But if it becomes part of your memory accurately and you can retrieve it on demand, you now no longer have to check and search your books every time someone ask you a question. So, I hope that now you understand why it may be useful to finetune. The ultimate goal is “no hallucination” and therefore better accuracy.

@fascinatingfactsabout 2 ай бұрын

@@brishtiteveja Tnx for the response, I really appreciate it. I believe you're talking about real world data, not generated by the AI. My question was focused on the usefulness of synthetic data though or am I interpreting synthetic data the wrong way?

@john_blues 2 ай бұрын

@@brishtiteveja That's a great answer.