This is awesome. For the next video, I have a suggestion: Suppose I have multiple PDF files containing a lot of information about my organization. How can I use a large language model (LLM) like the one you used above to create a dataset extracted from the knowledge provided in these PDFs?
@ByteBop9112 ай бұрын
i also generate synthetic datasets... a secret tip for alignment... set the mood and tone as parameters in the prompt as well to generate the questions as response (make the dataset a little more dynamic)
@kolasatheesh17192 ай бұрын
hi @ByteBop i would like to create a synthetic dataset for images, do you know how to do it?
@CryptoMaN_Rahul2 ай бұрын
Hey bro can you tell does it cost anything? Also can you share your codes ?? I'm a newbie want to learn
@maruc142 ай бұрын
Really great tutorial. Keep em' coming. First time seeing NIMs demo.
@litttlemooncream50492 ай бұрын
first time to know that LLMs can generate datasets! thank a lot
@atultiwari882 ай бұрын
Thank you for this awesome tutorial. I request you to kindly make a video on synthetic dataset generation from pdf files. Thank you so much
@swetharavishankar4825Ай бұрын
This is amazing.! Can you also explain how to create a classification model from the generated dataset ?
@gr8tbigtreehugger2 ай бұрын
Super cool! Just did something extremely similar but in Google Sheets, so my non-tech peers can help.
@chaithanyavamshi28982 ай бұрын
Wow! Great Tutorial Mervin this is what I'm exactly looking for fine tuning. I tested it and it worked perfectly. I have a question, from my understanding from your blog and video, this dataset is suitable for ORPO fine-tuning (AI feedback scores)? Can I still use it for SFT by filtering out the responses (rows) with best scores?
@MeinDeutschkurs2 ай бұрын
Great insight!
@batigol_910 күн бұрын
if I have a specific number of subtopics and I don't want to generate new subtopics how would I chose the number of data sets to generate?
@vitalis2 ай бұрын
Can we use ask LLMs to outline, create questions, reply, summerise BOOKS and use that to fine tune LLMs?
@kolasatheesh17192 ай бұрын
hey buddy , can't we create the synthetic dataset for images? i mean uploading images and getting responses for the questions... how to do it
@swetharavishankar4825Ай бұрын
Hey I am able to generate only 10 examples in count how do I make sure it generates more than a thousand at least
@bocilmillenium76982 ай бұрын
Can it use for indonesian language?
@commoncats54372 ай бұрын
Bro create a best model for tamil We don’t have best gpu’s If you do it we can createit for many usecases
@MervinPraison2 ай бұрын
ollama.com/mervinpraison
@commoncats54372 ай бұрын
@@MervinPraison 🥰tnq
@john_blues2 ай бұрын
@@MervinPraison Pretty cool. Can you point me to how I can do something like this for another language? I am trying to help build one for the Yoruba language.
@fascinatingfactsabout2 ай бұрын
I'm still fuzzy about when is creating a synthetic data useful in a practical scenario for me, as a single person, not a large company that needs to fine-tune LLM's. Can someone clarify? What's the real world use for this?
@brishtiteveja2 ай бұрын
LLM can hallucinate for questions you ask. Specially for low resource languages. For my own language Bengali, it hallucinates a lot and gives wrong answer for facts/events. Now, you can use RAG to stop hallucination. But, RAG depends on the size of the context. If I want to build a specialized model which knows and answers facts about Bengali culture and recent events, it’s useful if I can rather fine tune with the facts and recent event dataset, so that it becomes part of my model itself, therefore no hallucination. You can think of RAG as an open book exam, where you can search for answers in the book while taking the exam versus, a fine tuned model is you having the knowledge in your brain.. of course depending on your memory and reasoning ability, you will give accurate vs hallucinated answer. But if it becomes part of your memory accurately and you can retrieve it on demand, you now no longer have to check and search your books every time someone ask you a question. So, I hope that now you understand why it may be useful to finetune. The ultimate goal is “no hallucination” and therefore better accuracy.
@fascinatingfactsabout2 ай бұрын
@@brishtiteveja Tnx for the response, I really appreciate it. I believe you're talking about real world data, not generated by the AI. My question was focused on the usefulness of synthetic data though or am I interpreting synthetic data the wrong way?
@john_blues2 ай бұрын
@@brishtiteveja That's a great answer.
@TheBestgoku2 ай бұрын
Use claude to do this. No model in this world is even close to claude currently. Dont beleive the benchmarks. The difference is huge