How To Create Datasets for Finetuning From Multiple Sources! Improving Finetunes With Embeddings.

Рет қаралды 34,240

Күн бұрын

Today, we delve into the process of setting up data sets for fine-tuning large language models (LLMs). Starting from the initial considerations needed before dataset construction, we navigate through various pipeline setup questions, such as the need for embeddings. We discuss how to structure raw text data for fine-tuning, exemplified with real coding and medical appeals scenarios.
We also explore how to leverage embeddings to provide additional context to our models, a crucial step in building more general and robust models. The video further explains how to transform books into structured data sets using LLMs, with an example of transforming the book 'Twenty Thousand Leagues Under the Sea' into a question-and-answer format.
In addition, we look at the process of fine-tuning LLMs to write in specific programming languages, showing a practical application with a Cipher query for graph databases. Lastly, we demonstrate how to enhance the performance of a medical application with the use of embedded information utilizing the Superbooga platform.
Whether you're interested in coding, medical applications, book conversion, or simply fine-tuning LLMs in general, this video provides comprehensive insights. Tune in to discover how to augment your models with advanced techniques and tools. Join us on our live stream for a deep dive into how to broaden the context in local models and results from our book training and comedy sets.
0:00 Intro
0:44 Considerations For Finetuning Datasets
2:45 Reviewing Embeddings
5:35 Finetuning With Embeddings
8:31 Creating Datasets From Raw/Books
12:08 Coding Finetuning Example
14:02 Medicare/Medicaid Appeals Example
17:01 Outro
Training datasets: github.com/tomasonjo/blog-dat...
Massive Text Embeddings: huggingface.co/blog/mteb
Github Repo: github.com/Aemon-Algiz/Datese...
#machinelearning #ArtificialIntelligence #LargeLanguageModels #FineTuning #DataPreprocessing #Embeddings

Пікірлер: 109

@cesarsantos854 Жыл бұрын

This content is top notch among ML and AI in KZbin showing us how it really works!

@AemonAlgiz Жыл бұрын

Thank you, I’m glad it’s helpful!

@boogfromopenseason Ай бұрын

I would pay a lot of money for this information, thank you.

@fabsync Ай бұрын

Finally some freaking great tutorial! Practical, straight to the point and it works!!

@RAG3Network Ай бұрын

You’re literally a genius! I appreciate you taking the time to share the knowledge with us! Exactly what I was looking for… how to create a dataset and in such a well put together video. Thank you

@timothymaggenti717 Жыл бұрын

Okay so after a cup of coffee and watching a couple of times, WOW. You helped me so much thank you. This has been driving me nuts and you make it look so easy to fix. I wish I was as smart as you. Thank you again. 🎉

@AemonAlgiz Жыл бұрын

You always ask the best questions, so keep them coming :)

@flowers134 10 ай бұрын

Amazing, Thanks a lot for sharing your reflections on your work and experience ! It is much appreciated ! First time I check something like this quickly browsing and stick without having to review / study and come back later. I am able to get a Birds eye view on the topic and options available for work, and the underlying purpose. 🥇Pure Gold. Definitely Subscribed !

@HistoryIsAbsurd 4 ай бұрын

Dude seriously your content is so clear and easy to follow keep it up!

@pelaus01 Жыл бұрын

Amazing work... this channel is pure gold, the exact amount of concepts, everything is spot on. Nothing beats teaching by experience like you do.

@AemonAlgiz Жыл бұрын

I’m glad it was helpful and thank you for the comment :)!

@leont.17 11 ай бұрын

I very much appreciate that you always have this way of listing the most important bullet points at the beginning

@AemonAlgiz 11 ай бұрын

I’m glad it’s helpful! I figured it would be nice to give a quick overview

@smellslikeupdog80 Жыл бұрын

I knew I subscribed here for good reason. this is consistently extremely high quality information -- not the regurgitated stuff. This is super educational and has immensely improved my understanding. Please keep going bud, this is great.

@AemonAlgiz Жыл бұрын

Thank you! It’s greatly appreciated

@rosenangelow6082 11 ай бұрын

Great explanation with the right level of details and depth. Good stuff. Thanks!

@AemonAlgiz 11 ай бұрын

I’m glad it was helpful!

@kaymcneely7635 Жыл бұрын

Superb presentation. As always. 😊

@redbaron3555 11 ай бұрын

Awesome content!! Thank you very much!!👏🏻👏🏻👍🏻

@babyfox205 3 ай бұрын

great explanations thanks a lot for your efforts making this great content!

@Hypersniper05 Жыл бұрын

Thats awesome! And you can even save the new appeal to create more data !

@AemonAlgiz Жыл бұрын

Indeed! It becomes a very nice self reinforcing model, this is why I really like the fine tuning and embedding approach

@timothymaggenti717 Жыл бұрын

Wow, how do you make everything look easy. Nice thanks. So East coast, man your early bird.

@AemonAlgiz Жыл бұрын

I live in MST, haha. I just wake up very early :)

@AemonAlgiz Жыл бұрын

Comedy dataset update! I have found an approach I think I like for it, though I didn't have time to complete it for this video. So, I will also cover that in today's live stream!

@onurbltc Жыл бұрын

Awesome video!

@AemonAlgiz Жыл бұрын

Thank you!

@jonmichaelgalindo Жыл бұрын

The appeal has been processed by the approval AI... And it passed! The prescription will now be covered. 😊 (Thank you for the video! I think datasets and install dependencies are ML's greatest pain points at the moment.)

@AemonAlgiz Жыл бұрын

Thank you! I’m glad it was helpful :)

@li-pingho1441 11 ай бұрын

thank you soooooo much

@Tranquilized_ Жыл бұрын

You are an Angel. 💜

@AemonAlgiz Жыл бұрын

Thank you! I’m glad it was helpful :) I do like how you left your name that haha

@arinco3817 Жыл бұрын

This video was awesome! I'm finally starting to wrap my head round this stuff. At the same time I'm realising the power that is being unleashed onto the world! BTW did you see this new paper:SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression. Looks like it's right up your alley!

@AemonAlgiz Жыл бұрын

Thank you! I’m glad it’s helpful :D I have not seen this, this is super cool though, thank you for pointing me to it! I would love to see some implementation of pruning in LLM’s. Quantization is incredibly powerful, but we can only compress so much until we hit the limit. With pruning plus weight compression, as could run 30/65B parameter models on a single consumer GPU.

@kenfink9997 Жыл бұрын

How would building a training set on a codebase look? Is there a good example of automating generation of a Q&A training set based on code? How do you chunk it to fit in context window - break it up by functions and classes? Where would extraneous stuff go, like requirements, imports, etc... Thanks for the great content!

@danielmz99 11 ай бұрын

Hey man, thanks for your videos they are instructive. I am new to LLMs and I think there is a significant gap in KZbin content with the new LLMs. I know there are videos on fine tuning GPT3 but I can't find anything like walk through in fine tuning a larger new open source model like Falcon-40b instruct. If there was a playlist going through the process: QA fine tune data definition, synthetic data production, fine tuning and test. I am sure others like myself will be very keen followers

@AemonAlgiz 11 ай бұрын

I’ll make a playlist today!

@amortalbeing 8 ай бұрын

thanks man

@LeonvanBokhorst 10 ай бұрын

🙏 thanks

@MohamedElGhazi-ek6vp 10 ай бұрын

it's so helpful thank you, what if I have a multiple pdf files at the same time and each one of them has his own subject can I do the same for them ?

@user-nj1js7ky8p 11 ай бұрын

Amazing work! I would like to know if it is possible to use langchain to load pdfs to batch generate instruction datasets?

@SamuelJohnKing 9 ай бұрын

I really love the concept, but whatever I have tried I get ERROR: Token indices sequence length is longer than the specified maximum sequence length for this model (194233 > 2048) Could you please update it? it would be of immense value to me :)

@octadion3274 7 ай бұрын

Did u find the solution?

@darklikeashadow6626 3 ай бұрын

Would also love to know :)

@maneeharani8135 27 күн бұрын

How you resolved this problem?

@cmosguy1 9 ай бұрын

Hey @AemonAlgiz - How did you create the instruction set of data for the CYPHER query examples? Did you do that all manually?

@asdasdaa7063 5 ай бұрын

do you have a video on how to prepare a dataset for creative writing?

@d_b_ 11 ай бұрын

Could you clarify the performance of the LLMs where you provide it context, but dont do a fine tune? Was that last oogabooga medial appeal demo with a fine tuned model, or was it just using the additional embedded context?

@user-sl3yn9xv2y 11 ай бұрын

Hi, I have some confusion about your content about leveraging embeddings. My understanding so far is that, embedding approach simply means "few-shot learning". The pipeline is, say, I have a query, I embed the query into a vector and then search for similar vectors which represent relevant examples in a vector db, now I have my initial query + some examples of (query, answer) from the db. Then I somehow cleverly concat my query with the retrieved examples to form a long instruction/prompt, feed it to the llm and just wait for the output. Did I get my understanding right?

@LoneRanger.801 9 ай бұрын

Waiting for new content 😊

@filipbottcher4338 10 ай бұрын

Well done but how do you handle the max model length of tokenizer.encode?

@adriangabriel3219 Жыл бұрын

Hi @AemonAlgiz, great video! I am using a similar approach (I use langchain for the handing over the documents to a LLM) and I have tried a wizardLM model which hasn't performed too great. What strategies (fine-tuning, in-context learning or other models?) would you recommend to improve the performance of answering a question given the retrieved documents? Can you recommend specific models (Flan-T5 or other models?)

@AemonAlgiz Жыл бұрын

Gorilla is specifically tuned for use with langchain, so that may be an interesting model to test with. What kind of data are you want to use? That may influence my answer here

@adriangabriel3219 Жыл бұрын

@@AemonAlgiz haven't heard of Gorilla so thank's for pointing that out! I would like to answer questions given paragraphs of a technical manual

@adriangabriel3219 Жыл бұрын

Hi @@AemonAlgiz I don't quite understand how to use Gorilla with an existing vector database. Could you make a video on that or do you have guidance for that? Am I suppose to use the OpenAI API for that use case?

@mygamecomputer1691 11 ай бұрын

Hi, I was listening to your description of raw text and then how are you converted it. But can you just upload a very short story that has the style you like and take all the defaults of the training tab and use the plain TXT file and make a lora that will be useful in that it will simulate the style I like in model I want to use?

@unshadowlabs Жыл бұрын

When you uploaded the additional data in superbooga, did you have to prep it first as a question and answer format like you did on the fine tuning, or were you able to just upload books, files, etc for that part? Also thanks for doing these vidoes! These are by far the most informative on how this stuff works!

@AemonAlgiz Жыл бұрын

I just naively dumped the entire file, which I wouldn’t do for a more sophisticated application. Though superbooga will just chunk the files for you, so you can just drag and drop massive files.

@unshadowlabs Жыл бұрын

@@AemonAlgiz Thanks, How do you deal with more complex formatted material, such as research papers? Are the parsers good enough to handle them without a lot of data cleaning or prep work on the paper first?

@AemonAlgiz Жыл бұрын

@@unshadowlabs this has been my area of expertise for years! I worked in scientific publishing for over a decade, so what I find is that trying to naively parse them works to some extent, especially with research papers since they tend to be very topically dense. What you may find challenging is keeping all of the context densely packed, so it may be worth trying to split on taxonomic/ontological concepts.

@unshadowlabs Жыл бұрын

@@AemonAlgiz Awesome, thanks for the reply! A suggestion for a video, I would love to see how you deal with different types of content and sources and what type of data processing, wrangling, or cleaning, and what type of tools you recommend given your expertise, background, and experience.

@AemonAlgiz Жыл бұрын

This is a great idea, I have dealt with some nightmarish formats

@wilfredomartel7781 Жыл бұрын

Amazing work! I still trying to understand the embeddings approach.😊

@AemonAlgiz Жыл бұрын

Basically, we would rather teach the model how to use information than try to teach it everything. So, if we can give the model enough examples of what a procedure looks like, it can learn how to better follow it. So, take for example a para-legal or a lawyer. They’re well educated on how to write legal briefs, though they’re not aware of every law to exist. They have learned how to research and leverage information, which is what we’re trying to do with this approach.

@Hypersniper05 Жыл бұрын

The only way you'll understand it is by trying it yourself

@wilfredomartel7781 Жыл бұрын

@@Hypersniper05 you are right.

@wilfredomartel7781 Жыл бұрын

@@AemonAlgiz thanks for the explanation to my doubt. I will try yo reproduce in my colab pro.

@AemonAlgiz Жыл бұрын

Let me know how the experiment goes!

@mohammedanfalvp8691 9 ай бұрын

I am getting an error like this. Token indices sequence length is longer than the specified maximum sequence length for this model (546779 > 2048). Running this sequence through the model will result in indexing errors Max retries exceeded. Skipping this chunk.

@darklikeashadow6626 3 ай бұрын

Same here. Does anyone have an answer?

@darklikeashadow6626 3 ай бұрын

Hi @aemonAlgiz , I am new to Python (and LLMs) and wanted to try creating a dataset from a book as well. However when running the provided code, I got a warning: "Token indices sequence length is longer than the specified maximum sequence length for this model (181602 > 2048). Running this sequence through the model will result in indexing errors Max retries exceeded. Skipping this chunk." (which happened a lot). The new .JSON file was empty. I tried changing the "model_max_length": from 2048 to 200000 in the tokenizer_config from my model, but that only made the warning disappear (but the result was the same). Would love if anyone has a solution to this :)

@bleo4485 Жыл бұрын

Hi Aemon, i am new to local llm api setting up. Could you explain a little on how to get around setting it up? thanks

@AemonAlgiz Жыл бұрын

Hey there! From the OobaBooga web application you can enable extensions, including the api. It will run on port 5000 by default!

@champ8142 6 ай бұрын

Hi Aemon I checked api and public_api on the flags/extensions page, any idea why I can't connect to port 5000?

@CallisterPark Жыл бұрын

Hi @aemonAlgiz - how long did it take to finetune stablelm-base-alpha-7b ? On what hardware?

@AemonAlgiz Жыл бұрын

Howdy! Not very long for this, since it was a fairly small finetune, about an hour. I use an AMD 7950X3D CPU and a RTX 4090

@AadeshKulkarni 10 ай бұрын

Which model did you use on oobabooga ?

@aditiasetiawan563 3 ай бұрын

can you explain code to convert pdf to json.. i dont know how you doing that.. it's great and thats what we need.. thanks before

@bleo4485 Жыл бұрын

Aemon, what time will your live stream be?

@AemonAlgiz Жыл бұрын

6PM MST :D

@protectorate2823 11 ай бұрын

Hey aemon, how can I structure my dataset so it outputs answers in a specific format every time. Is this possible?

@GamingDaveUK Жыл бұрын

so with superbooga you could just drop in the file with the Q&A from the book, add an injection point in your prompt and the LLM has access to the data? That sounds too easy lol So say you want to have oogabooga be a storytelling ai, can you add the injection point in that opening prompt, feed it a Q&A made from stargate scripts and then have it use that data in responses to set tone and characters?

@AemonAlgiz Жыл бұрын

Superbooga makes it pretty easy! They have a drag and drop embedding system and it handles the rest for you. It’s not going to be optimal for all use-cases but it works well in general

@srisai00123 3 ай бұрын

Token indices sequence length is longer than the specified maximum sequence length for this model (249345 > 2048). Running this sequence through the model will result in indexing errors I am facing this issue, please help for resolution.

@NeuralNet_Ninjas 10 ай бұрын

@AemonAlgiz How to enable Superbooga api .?

@tatsamui Жыл бұрын

What difference between this and chat with documents?

@AemonAlgiz Жыл бұрын

That’s a great question! You can encourage the model to “behave” in a particular way. Though of course you’re not really imbuing the model with knowledge you’re causing a preference for tokens that satisfy some requirement. For example, if I had enough samples for a solid fine tune on appeals it would write near human like in the process. So combining the influence on the models behavior with additional context from documents, you get a more modern version of an expert system. This is a technique we have been using in industry to get models to fulfill very specific use-cases.

@Hypersniper05 Жыл бұрын

Think of it as of you were using bing but the search results are very specific. This is good for closed domains and very specific tasks . I use it for work as well in closed domain data

@othmankabbaj9960 7 ай бұрын

When training a dataset, it seems the Q&A is too specific to the book. Wouldn't that make the model too specific to the use case you're training ?

@xspydazx Ай бұрын

hmm... : I would like to be able to : Update the llm , ie by extrracting the documents in a folder , extracting the text and fine tuning it in ? ie : i suppose the best way would be to inject it as a text dump ~ HOW?(Please) ie take the whole text and tne a single epoch only !: As well as saving my chat history as a input/Response dump : single epoch only . Question : each time we fine tune ? it takes the last layer and makes a copy then trains the copy and replaces the last layer ? as the model weights are FROZEN? does this mean that they dont get updated ....? if so then the lora is applied to this last layer esentially replacing the layer ? If we keep replacing the last layer do we essentially wipe over the previous training ?? i have seen that you can target Specific layers ? ... How to determine which layers to target? then create the config to match these layers? Question : How dowe create a strategy for regular tuning without destroying the last training ? should we be Targetting different layers each fine tuning ? Also Why canwe not tune it Live!! ie while we are talking to it ? or discuss with the model and adust the model whilst talking ? is adjusting the weights done by the AUTOGRAD? NN in pytorch with the optimization ? ie adam optimizer ? as with each turn we can produce the loss from the input by supplying the expected outputs to compare with simuarity so if the output is over a specfic threshhold it would finetune acording to the loss (optimize this(once)) ... ie switching between train and evaluation , (freezing a specific percentage of the model )... ? ie essentially woring with a live brain ??? how can we update the llm with conversation , ??? by giving it the function (function calling) to execute a single training optimization based on user feedback ? ie positive and negative votes... and the current response chain ... ie if the rag was used then the content should be tuned in ?? SOrry for the long post but it all connects to the same thingy?

@user-bs5xo4nd1t 6 ай бұрын

iam getting this error"Max retries exceeded. Skipping this chunk."

@JAIRREVOLUTION7 10 ай бұрын

Thanks for your awesome video, if you some day want to work as a mentor for our startup, write me dude.

@user-hf3fu2xt2j Жыл бұрын

I still understood literally nothing. What vector databases have anything to do with embedding vectors in language models? and how they get utilized anyway? This video being like "we mentioned them in adjacent sentences and this shows they can work together".

@AemonAlgiz Жыл бұрын

Howdy! I’m happy to try and explain anything that’s not clear. Where are things not making sense?

@user-hf3fu2xt2j Жыл бұрын

@@AemonAlgiz the whole thing, the entire pipeline, especially for QA purpose. like, if I have a huge document put into a vector database, an embedding for a question about this document can very well be really far away from any relevant vector in the database, thus, making chances of getting relevant vector from the database smaller. if this vector affects further model generation, then we won't get answer on this question. it's also not clear how exactly this vector is getting used within the model anyway. it this concatenation? or used as a bias vector? or is it a soft promt?

@AemonAlgiz Жыл бұрын

@@user-hf3fu2xt2j this is a great question! This is why we have the tags around different portions of the input, mainly to control the documents that are queried for. Since we can wrap the input, we have explicit control over what portion of the input text gets embedded for the query. Does that make more sense? Also, the way we chunk inputs helps to prevent getting portions of the document that aren’t relevant. The way I embedded in this example was naive, though we can use very intricate chunking methodologies to have a higher assurance of topical density.

@user-hf3fu2xt2j Жыл бұрын

@@AemonAlgiz in such case, if we need explicit control over which documents/portions of documents are queried, it looks like queries in question look more like queries to old-fashioned databases and less like questions to a language model, with a lot of manual labour and engineering knowledge required to do make fruitful requests

@caseygoodrich9717 Жыл бұрын

Lipsync issue your audio

@pedro336 10 ай бұрын

did you skip the training process?

@stephenphillips8782 6 ай бұрын

I am going to get fired if you don't come back

@vicentegimeno6806 9 ай бұрын

Hi, I'm new to Python and getting an error related to the token sequence length exceeding the maximum limit of the model, could you please help me to solve the problem? ERROR: Token indices sequence length is longer than the specified maximum sequence length for this model (194233 > 2048). Running this sequence through the model will result in indexing errors 2023-08-24 10:41:54.890169: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

@SamuelJohnKing 9 ай бұрын

would also love a answer to the Token Indicies issue

@fndTenorio 10 ай бұрын

So in the embedding approach the embeddings are just additional information that are injected in the prompt itself? In other words, the fine tuned model knows how to do something, but i can use an extra help (the embedding info) to generate a better prompt? If so we are optimizing the prompt, right? Thanks for the video!