LangChain Data Loaders, Tokenizers, Chunking, and Datasets

LangChain Data Loaders, Tokenizers, Chunking, and Datasets - Data Prep 101

Рет қаралды 54,936

Күн бұрын

Пікірлер: 101

@jamesbriggs Жыл бұрын

LangChain docs have moved so the original wget command in this video will no longer download everything, now you need to use: !wget -r -A.html -P rtdocs python.langchain.com/en/latest/

@mohitagarwal9007 Жыл бұрын

This is not downloading everything as well is there anything else we can use to get the necessary files ?

@jamesbriggs Жыл бұрын

@@mohitagarwal9007 yes I have created a copy of the docs on Hugging Face here huggingface.co/datasets/jamescalam/langchain-docs-23-06-27 You can download by doing a `pip install datasets` followed by: ``` from datasets import load_dataset data = load_dataset('jamescalam/langchain-docs-23-06-27', split='train') ```

@deniskrr 10 ай бұрын

@@mohitagarwal9007 just go to the above link and see where you're getting redirected to now. Then copy the link from the browser to the wget command and it should always work.

@jamesbriggs Жыл бұрын

if the code isn't loading for you from video links, try opening in Colab here: colab.research.google.com/github/pinecone-io/examples/blob/master/generation/langchain/handbook/xx-langchain-chunking.ipynb

@RamasubramaniamM Жыл бұрын

Chunking the most important idea and largely ignored. Thanks James love your technical depth.

@ADHDOCD Жыл бұрын

Great video. Finally somebody that goes in depth into data prep. I've always wondered unecessary (key,value) pairs in json files.

@fgfanta Жыл бұрын

I need to chunk text for retrieval augmentation and did a search on KZbin and found... James Briggs' video. I know I will find in it what I need. Nice!

@dikshyakasaju7541 Жыл бұрын

Thank you for sharing this informative video showcasing Lanchain's powerful text chunking capabilities using the RecursiveCharacterTextSplitter. Previously, I had to write several functions to tokenize and split text while managing context overlap to avoid missing crucial information. However, accomplishing the same task now only requires a few lines of code. Very impressive.

@grandplazaunited Жыл бұрын

Thank you for sharing your knowledge. these are some of the best videos on LangChain.

@MaciekMorz Жыл бұрын

I have seen a lot of materials on how to store embeddings in Pinecone vector db. But I haven't seen any tutorial yet on how to store vectorstores with different embeddings of different users in one index. I.e. how to retrieve embeddings depending on which user they belong to. What would be the best strategy for this, whether through metadata or something else? It would be great to see a tutorial on this especially using langchain although it seems to me that the current wrapper doesn't really allow this. BTW. The whole series with langchain is great!

@harleenmann6280 10 ай бұрын

Great video series. Appreciate you sharing your thought process as we go - this is the part most online content creators of tech mill. They cover the how, and more often than not miss the why. Thanks again. Enjoying all the videos in this playlist

@videowatching9576 Жыл бұрын

Fascinating channel, thanks! Remarkable to learn about LLMs, how to interact with LLMs, what can be built, and what could be possible over time. I look forward to more.

@BrianStDenis-pj1tq 11 ай бұрын

At first, it seemed like you switched from tiktoken len to char len of your chunks, when explaining RecursiveCharacterTextSplitter. That wasn't going to work, so I went back and found that you did show, maybe not so much explain, that the splitter is using the tiktoken len function. Makes sense now, thanks!

@hashiromer7668 Жыл бұрын

Wouldn't chunking lose information about long term dependencies between passages? For example, if a term is defined in the start of document which is used in the last passage, this dependency won't be captured if we chunk documents.

@jamesbriggs Жыл бұрын

yes, this is an issue with it, if you're lucky and using a vector db with returning 5 or so chunks, you might return both chunks and then the LLM sees both, but naturally there's no guarantee of this - I'm not aware of a better approach for tackling this problem with large datasets though

@bobjones7274 Жыл бұрын

@@jamesbriggs Somebody on another video said the following, is it relevant here? "You could aggregate chunk togethers asking the LLM to summarize and group them in "meta chunks", you could repeat the process until all years are contained into a single max limit tokens batch. Then, with the meta Data, you'll be able to perform a much more powerful search over the corpus, providing much more context to your LLM with different level of aggregation."

@rodgerb2645 Жыл бұрын

@@bobjones7274 sounds interesting, do you remember the video? Can you provide the link? Tnx

@astro_roman Жыл бұрын

@@bobjones7274 link, please, I beg you

@JOHNSMITH-ve3rq Жыл бұрын

I’ve seen this in many places but where has it been implemented??

@SuperYoschii Жыл бұрын

Thanks for the content James! I think they changed something when downloading the htmls with wget. When I run the colab, it only downloads a single index.html file

@muhammadhammadkhan1289 Жыл бұрын

You always know what I am looking for thanks for this 🙏

@jamesbriggs Жыл бұрын

glad it helps!

@redfield126 Жыл бұрын

Thank you James for the in depth explanation of data prep. Learning a lot with your videos.

@alvinpinoy Жыл бұрын

Very helpful and very well explained. Thanks for sharing your knowledge about this! LangChain really feels as the missing glue between the open web and all those new AI models popping up.

@jamesbriggs Жыл бұрын

yeah it's really helpful

@SnowyMango Жыл бұрын

This was great! I made a terrible mistake of chunking without considering this simple math and embedded indexed into Pinecone at the larger size. Now I have to go redo all them after realizing at their current sizes it isn’t quite suitable for LangChain retrieval

Жыл бұрын

Great content once again, thanks for sharing. I wish I had this a couple weeks ago :D

@jamesbriggs Жыл бұрын

any time!

@mrchongnoi Жыл бұрын

You talk about adding context. Where can I get information on adding context? Sorry if it is a remedial question.

@siamhasan288 Жыл бұрын

Ayo I was literally looking for how to prepare my data for the past hour . Thank you for making these.

@eRiicBelleT Жыл бұрын

In my case the last two weeks xD

@codecritique 5 ай бұрын

Thanks for the tutorial, really clear explaination !!

@eRiicBelleT Жыл бұрын

Uff the video that I was expecting! Thank youuu!

@mintakan003 Жыл бұрын

I played with this awhile ago in LangChain. My impression is in order to do Q&A on documents, one has to do a sequential scan. Every chunk has to be read in. Wouldn't this be prohibitively expensive for a large document set? I know there are vector databases (indices) which can do a pre-screen based on vector similarity. This would be an improvement. But it still involves a sequential scan, now at the vector level. Are there attempts to address this problem? Perhaps parallelism maybe one part of the solution (?)

@jamesbriggs Жыл бұрын

it isn't a sequential scan with (most, maybe all) vector DBs, they use approximate search, so the answer is approximated and not everything is fully compared - a good vector db will make this approximation very accurate (like 99% accuracy)

@AlexBego Жыл бұрын

James, I should say Thank You a Lot for your interesting and so useful videos!

@jamesbriggs Жыл бұрын

you're welcome, thanks for watching them!

@rodgerb2645 Жыл бұрын

Amazing James, I've learned so much from you!

@jamesbriggs Жыл бұрын

Awesome to hear :)

@matheusrdgsf Жыл бұрын

James you are helping a lot in my activities. Thank you.

@jamesbriggs Жыл бұрын

glad to hear!

@Sergedable Жыл бұрын

nice job, also it would be grateful if you could make a video on how to combine, for example, multiple documents doc1 doc2 doc3..etc and use LangChain and ChatGraph 4 to analyze them.

@henkhbit5748 Жыл бұрын

Thanks James, for sharing this information.👍 I always thought that 4k token limit for chatgpt-turbo was independent for input and output completion and not combined...

@mohammadsunasra Жыл бұрын

So James, what you mean to say is it will first split based on the first character splitter, then compare of the no of tokens > chunk size and if yes, then split again based on the next split until the no of tokens < chunk size right?

@murphp151 Жыл бұрын

these videos are pure class

@temiwale88 Жыл бұрын

I'm @ 12:34 and this is an amazing explanation thus far. Thank you!

@lf6190 Жыл бұрын

Awesome I was just trying to figure out how to do this with the langchain docs so that I can learn it quicker!

@fraternitas5117 Жыл бұрын

James dropping the great content as usual.

@gunderhaven Жыл бұрын

Hi James, thanks for sharing your work. In this video, you briefly mention cleaning up the "messy bits" in the plain text page content and that it is not necessary in your estimation. Could you suggest an approach to clean up those messy bits to some degree? Thanks in advance.

@ayushgautam9462 Жыл бұрын

are you using a jupyter notebook or are you working on the google colab, and how can i run these codes on vs code if possible

@ChronicleContent Жыл бұрын

I am kinda clueless and don't know much about any of this but why are we doing this? Don't you think that in the future chatgpt or other will use live internet and have the information available? And also have bigger limits? I am trying to understand the vision on this. Or is it just for now to be able to "bypass" the limits and use it on updated stuff till they find out a way to have a live trained model? Sorry if it sounds totally clueless.

@calebmoe9077 Жыл бұрын

Thank you James!

@nazimtairov Жыл бұрын

thanks for tutorial, how text after splitting into chunks can be processed further to LLMChain? I'm getting an error from openai api: chain = LLMChain(llm=llm, prompt=chat_prompt, verbose=True) chain_result = chain.run({ 'source_code': python_code, 'target_tech': 'python', 'source_tech': 'Go' }) This model's maximum context length is 8193 tokens. However, your messages resulted in 13448 tokens. Please reduce the length of the messages.

@TomanswerAi Жыл бұрын

Nice one James. Demystified that step for me there 👍 As you say if people get this part wrong everything else will underperform

@jamesbriggs Жыл бұрын

yeah it's super important

@ewanp1396 Жыл бұрын

Great video. What software are you using for the video (as in the notebook with blue background)?

@raypixelz Жыл бұрын

Awesome. thank you!

@krisszostak4849 Жыл бұрын

Hi James, Thanks for your amazing work! I've been playing with this lately and I'm not sure if I understand the connection between the tiktoken_len function and the chunk_size and the length_function args in RecursiveCharacterTextSplitter. So the question is this: In the RecursiveCharacterTextSplitter - if the "length_function=len" (by default), then the "chunk_size" sets the max amount of CHARACTERS in the chunk, but if the "lenght_function=tiktoken_len" (or any other token counter) - then the "chunk_size" sets the max amount of TOKENS? Is that correct? Thanks!

@RedCloudServices Жыл бұрын

James I hope I am asking this question correctly. Would it not be cheaper to fine tune an existing GPT model with your entire custom corpus (i.e. your langchain docs) and then have chatbot using your finished fine tuned LLM published up to openai?

@GrahamAndersonis Жыл бұрын

Is there a best practice for chunking mixed documents that also include tables and images? Are you extracting tables/images (out of the chunk) and into a separate CSV/other file, and then providing some kind of ‘hey llm, the table for this chunk is located in this CSV file’ ? If so, how do you write the syntax for this note (within the chunk) to the LLM? Much appreciation in advance.

@sevilnatas 4 ай бұрын

I am struggling with a chunking scenario that includes PDFs that include a lot of columnar data in tables and the primary questions users will be asking of the PDF data will be contained in the tables. Qu4estions that depend on finding the value in the first column and then retrieving a value on that row found with the original value in a specified column. This means that the chunked data needs to be able to maintain the integrity of the table. Any suggestions?

@absar66 Ай бұрын

any solution? i am struggling with the same..

@sevilnatas Ай бұрын

@@absar66 No silver bullets. I did see a project called Marker, I think, that can take PDFs and convert them to markdown text. If that is effective at translating to markdown, columnar type text will probably be better chunked if it is markdown. Anyway, just a thought I was thinking about trying. If you give it a try, let me know how it goes.

@rishniratnam Жыл бұрын

Nice video James.

@kevon217 Жыл бұрын

any tips for dealing with datasets that have missing values? doesn’t seem like the various transformer encoding classes have defaults for handling entirely empty strings/values. it’ll still spit out a vector which i assume is just padding tokens?

@younginnovatorscenterofint8986 Жыл бұрын

thanks for the content James. I am trying to build document conversational asistant,using langchain and huggingface but I have been getting this error .Token indices sequence length is longer than the specified maximum sequence length for this model (2842 > 512). Running this sequence through the model will result in indexing errors d

@maximchuprynsky7472 Жыл бұрын

Hello. I have a question/problem. I have a rather large prompt and it exceeds the token limit. Is there any possibility to split it as well as the basic information from the pdf file?

@ylazerson Жыл бұрын

great video - super appreciated!

@generichuman_ Жыл бұрын

I'm training a transformer model from scratch just to get a better intuition on how they work. I'm curious if you know the best way to setup the text dataset so that each text chunk is it's own entity and won't bleed over into other chunks. For example if I have a dataset of stories, when one ends and another one begins, I don't want the next story to still have context from the last story. I'm using the hugging face tokenizer to implement BPE. I hope this makes sense and I would greatly appreciate any guidance!

@LucaMainieri68 Жыл бұрын

Thank you for your amazing video and all work you do. I was wondering how to use langchain to perform data analysis on one or more datasets. Let's say I have leads, sell and orders dataset. Can I use langchain to perform some analysis, such as ask which customers placed the last order? How were sales last month? Can you aim me in the right direction? Thanks 🙏

@jimjones26 Жыл бұрын

I have a question. I am working on loading documentation for several different technologies into 1 vector database. I want to use this as a AI development assistant for the tech stack that I use to create web applications. I am assuming the way you are categorizing your chunks would be an appropriate way to have these different 'columns' of data within one vector db?

@jesusperdomo8388 Жыл бұрын

please, is it possible for you to work the code in visual studio code?

@jacobgoldenart Жыл бұрын

Thanks James! About chunking. What about when your documentation has a lot of code example’s interspersed throughout the text. Is the recursive text splitter able to work with say python code where retaining white space is important?

@jamesbriggs Жыл бұрын

it won't distinguish any special difference between normal text and code unfortunately, so it will just split on newlines, whitespace, etc as per usual

@artchess0 Жыл бұрын

Hi James, thank you very much for your videos. I have a question. What if we need to pass context to our LLM model to translate from one language to another. It is better to chunk in smallest sizes or to the limit of tokens for the request to the model? Im thinking of processing the chunks in parallel and then join the results of the translation together. But i dont know what is the best aproach. Thank you in advance

@alivecoding4995 Жыл бұрын

How do you work remotely in VSCode with the notebook on Colab?

@li_tsz_fung Жыл бұрын

Is LLaMA langchain a thing now? It makes sense to me that we should use open source stuff, so that we can run it locally soon.

@jamesbriggs Жыл бұрын

I believe so, but haven't had the chance to check it out yet - for sure, will be focusing more on open source soon

@StephenStrong-x1s Жыл бұрын

James, this video (and all your postings) are excellent! Exactly what a long time developer, looking to expand into AI needs to get started! Do you do any lectures at conferences?

@ketangote Жыл бұрын

Great Video

@dreamphoenix Жыл бұрын

Thank you.

@fraternitas5117 Жыл бұрын

Could you make content about Nvidia's NeMo?

@Sunghoon4life Жыл бұрын

Is the RecursiveCharacterTextSplitter split the text based on token or text? as per dos seems like it's based on charector but in the video u said it's based on token. Could you please confirm?

@jamesbriggs Жыл бұрын

it's splitting on character (the " ", " ", " ", "" characters), but the length function is based on tokens, so it is kind of doing both, meaning it is identifying a satisfactory length based on tokens, but then the split itself is using characters

@rmehdi5871 11 ай бұрын

@@jamesbriggs does this splitting words on any text?? in my data, taken, I think, via xml format, has these tags: , and . Should I split on those rather than with " ", " ", " ", "", or do both, perhaps? What is your recommendation?

@tadavid1999 Жыл бұрын

Could anyone help me? I'm trying to use the !wget -r -A but it is not recognised as a command. I do not understand where I am going wrong as far as I know I have all modules installed as well as correct permissions. I have tried running this in terminal of Microsoft visual code, powershell as admin (with ChatGPT to put it in a diff format) and as a script importing os. Just is not working for me and I am very interested in the practical applications of this. Great video by the way I like how everything is explained step by step!

@jamesbriggs Жыл бұрын

I think it should be recognized as a command, the issue may be that the webpage is outdated, could you try `!wget -r -A.html -P rtdocs python.langchain.com/en/latest/` - also another thought, if you're running in terminal drop the `!`, leaving you with `wget -r -A.html -P rtdocs python.langchain.com/en/latest/`

@tadavid1999 Жыл бұрын

@@jamesbriggs i've just figured this out. It's because im not linux based. This video helped me fix the issue for anyone wanting to follow along: kzbin.info/www/bejne/nXTVd2uQrZZmrck

@tadavid1999 Жыл бұрын

@@jamesbriggs I'm trying to use the wget command to download my own website for context but it keeps downloading the first page only, any tips on how i can get it to go for the rest?

@mohammedsaheer4700 Жыл бұрын

Can we pass more than 10000 tokens into langchain using chunking ?

@jamesbriggs Жыл бұрын

Yes you can pass in as many as you like, billions even

@paenget Жыл бұрын

Amazing❤

@yourmom-in4po Жыл бұрын

For some reason, when i try to download all the HTML files using wget, it only downloads the index.html file, is there any reason for this? i used the provided google collaboratory notebook and nothing :(

@jamesbriggs Жыл бұрын

I don't know why that would happen using the same command, may be a system difference I'm not sure - but maybe you can refer to this: www.linuxjournal.com/content/downloading-entire-web-site-wget and try modifying the command as per the info above?

@jamesbriggs Жыл бұрын

sorry I realize this is because the webpage for the langchain docs moved, it's actually nothing to do with the command, try this: !wget -r -A.html -P rtdocs python.langchain.com/en/latest/

@yourmom-in4po Жыл бұрын

@@jamesbriggs Thank you so much!

@rafaelprudencioleite7291 Жыл бұрын

Thanks so much for the video. when i use !wget -r -A.html -P rtdocs link... It download the index.html page. I tried in the terminal and won't work too. There's a way to handle that?

@Clubcloudcomputing Жыл бұрын

Looks like the website changed, and does a redirect to a different domain. Hence you get only 1 file. Instead, index the domain that it redirects to.