Both LangChain and Llama Index have added Semantic Chunking (level 4) to their libraries LangChain: python.langchain.com/docs/modules/data_connection/document_transformers/semantic-chunker Llama Index: llamahub.ai/l/llama-packs/llama-index-packs-node-parser-semantic-chunking?from=all
@GeorgAubele8 ай бұрын
But the Semantic Chunker in LangChain only goes with the OpenAI Embedder, doesn't it? What I mean: Is there a way to use another embedding mode than openAI embedder?
@DataIndependent8 ай бұрын
@@GeorgAubele No, you can use your own, check out the docs, replace the embeddings engine you use
@stavroskyriakidis48399 ай бұрын
Why did KZbin take so long to recommend me this channel? Incredible work!
@DataIndependent9 ай бұрын
Glad you're here my friend
@AshWickramasinghe11 ай бұрын
First video I came across that actually explain langchain in detail so that a layman can understand how it actually works
@DataIndependent11 ай бұрын
Nice I love that - thank you!
@truthwillout19809 ай бұрын
Thanks for this Greg. I've been looking at agentic chunking for a while and this video really helped me with implementation. Not heard of you before I searched but now subbed. Thanks a lot :)
@DataIndependent9 ай бұрын
Awesome - love it thanks for sharing
@MuhammadFaisal_Iqbal2 ай бұрын
00:01 Splitting large data into smaller chunks improves language model performance 01:52 Exploring the 5 levels of text splitting for retrieval 05:44 Chunking strategy is crucial for data loading and retrieval 07:29 The text splitting method is simple but rigid 11:00 Text splitting with chunk overlap 12:38 Text splitting can be customized using various parameters like overlap, separators, and chunk size. 16:02 Recursive character text splitting infers chunk sizes based on text structure. 17:39 Splitting text into chunks and overcoming splitting issues. 20:51 Level three involves document-specific splitting 22:25 Choosing chunk sizes for text splitting 25:40 Utilize parsers for complex and messy data types. 27:05 Extracting images from PDFs using unstructured data 30:03 Using the GPT for vision to create a human message and retrieve image information 31:39 Group similar items for language model task preparation 34:54 Adding positional reward to hierarchical clustering 36:24 Combining sentences for better comparison 39:30 Exploring the distance between text chunks. 41:06 Chunking up text and not keeping it together in embedding space helps in better understanding. 44:04 Identifying outliers above the breakpoint distance threshold 45:38 Text splitting for retrieval 48:33 Text is split into chunks for retrieval 50:07 Language model understanding of propositions 53:11 Creating a function to extract propositions from text 54:37 Using an agent-like system for chunking paragraphs 57:34 Determining the placement of propositions within chunks using simple prompts 59:12 Creating and adding chunks to the agentic chunker 1:02:27 Text splitting through alternative representations 1:03:56 Using Lang chain expression language for batch processing 1:07:03 Examining text relationships and graph structures 1:08:42 Greg Camrad is exploring the overlap between AI and business.
@shankstuv6 ай бұрын
Love this! I'm working with transcripts where semantically generated chunks can be quite large. These chunks need to be further divided to fit the limits of the embedding model. Given this, isn't semantic chunking unnecessary if we ultimately have to recursively break down the larger chunks into smaller ones?
@kenchang345610 ай бұрын
I thought the explanation and showing your experimentation for semantic splitting was creative. Thank you very much.
@rabinbhandari68002 ай бұрын
38:03 Can we reduce the times of embedding we are doing while making semantic chunks ?
@NadaaTaiyab8 ай бұрын
Wow! I hadn't even thought about Agentic Chunking! I need to try this. I did some extensive experimentation with chunking on a project at work for a clinical knowledge base and I found that chunking strategies can make the difference between an ok retrieval and an awesome retrieval that works across a higher percentage of queries.
@zinebbhr65110 ай бұрын
Could attention be used here instead of the embeddings? Input every 2 sentences with overlap into an encoder. Above a certain threshold of "attention" from one sentence to another, have both in the same chunk
@JunYamog10 ай бұрын
Thanks I was thinking about solving my own Retrieval problem. I already got the small crude proof of concept using just simple chunking, embedding, RAG, etc. Now I need to get bigger user inputs that are in bigger pdf files. I thought using agents for it to get around the context window, you agentic chunker is a good starter and does make intuitive sense. I will try this route.
@BrianRhea10 ай бұрын
Incredible! I love the approach to Semantic Splitting. I'm working on creating AI tools that will analyze customer interviews (i.e. founders or user researchers talking to customers and then using AI for the analysis/synthesis). In those transcripts, there are multiple speakers. I'm incorporating your approach here and trying to find a better way to chunk those transcripts by the topic of conversation. Thanks a ton for sharing your work!
@DataIndependent10 ай бұрын
Awesome, thank you Brian! Love it - I'm doing a ton of work on transcripts as well. This company was just showed to me around user research calls for consultants www.myjunior.ai/
@robxmccarthy10 ай бұрын
Any tips yet based on your findings? I've also been experimenting with semantic chunking of transcripts with somewhat mixed results.
@chakerayachi84688 ай бұрын
you really deserve that like buttons really thanks for this out of the world content
@JoanApita5 ай бұрын
man it took me 3 weeks to find you. thank you please keep on coming.
@alxcnwy3 ай бұрын
Awesome vid - great work, especially on your semantic chunking approach, love the idea!
@alexeponon32505 ай бұрын
Single and multi hop explained concerning the semantic splitting. Nice !!
@pcebro3 ай бұрын
Clear and concise! Your ability to break down complex concepts into easily digestible information is impressive. As a beginner, I found this video incredibly helpful and I'm grateful for sharing your expertise and talent! 🙏
@connor-shorten11 ай бұрын
Amazing!! I am fascinated by how document specific splitting or the bonus level also ties with how we structure our data schema. E.g. extracting metadata like "Introduction" in level 3 or applying a summary to the podcast and indexing that to then link to the raw clip in the bonus level. All amazing, super useful stuff -- I am a bit skeptical on embedding based splitting though, maybe just need to dive in further! Mostly bullish on level 5: agentic splitting with multimodal llms that kind of blend levels 3 and 5
@DataIndependent11 ай бұрын
Awesome Connor I love the comment!
@adityasankhla14337 ай бұрын
With the continuous influx of short form content, props to you for making this so interesting to watch. Didn't even realise it was an hour long. Loved every second of it. Thanks!
@robxmccarthy10 ай бұрын
Love your videos, especially this one. The information density and presentation is off the charts. It is so altruistic of you to put this out there for free. I am especially interested in the semantic chunking. One use case is transcripts which often have distinct conversation blocks or qhestion answer pairs. Since it is important to capture the question and answer for full context, i was wonderinf what methodology might work best. Alternatively, semantically chunking a document vs pre-defined themes - sort of the opposite direction as the agentic chunker. First generate or define the overarching themes or buckets, then assign chunks to them. It seems that there is some real possibility in the semantic chunking methods. 🎉 Looking forward to experimenting more. Thank you again.
@DataIndependent10 ай бұрын
Nice! For that one I actually recommend a slightly different method to explore. No idea if it'll work better for your use case but it might Check out this video where I do topic extraction from podcasts, I bet you could use this method and switch up the prompts a bit to pull out Q&A pairs w/ answers kzbin.info/www/bejne/pnbOqYWHe7N0qZY
@NehaEjaz29Күн бұрын
For semantic chunking, does the chunking process have to follow a sequential order? What if the document isn't well-structured, such as when topic "x" is introduced at the beginning and then revisited at the end? In this case, it seems like two separate chunks would be created for topic "x," as the chunk would split when another topic begins. Do you think it's possible to combine two related pieces of information in a single chunk, even if they aren't in sequential order within the document?
@Arvolve11 ай бұрын
That was great! Semantic and Agential ideas are definitely a way forward. Branching off that, here's a thought: building a meta-transformer that uses a classic-transformer through multi head attention to associate high dim vect between semantic chunks > more efficient parallel processing and capturing more nuanced relations between chunks & macro managing the splitting iteratively GPT formatting: Proposed Meta Transformer Approach: Chunk-Level Semantic Analysis: The meta transformer, as you propose, would operate on semantically split chunks, not just individual tokens. High-Dimensional Semantic Space: Each chunk (sequence of tokens) is mapped onto a high-dimensional semantic space. Iterative Mapping for Optimal Chunking: Through multi-head attention, the model would iteratively determine the best separation points for these chunks.
@DataIndependent11 ай бұрын
That’s a fun idea - I’d love to see a demo or implementation if you share it out
@stonedizzleful6 ай бұрын
This is an insanely detailed from first principles tutorial. Thank you for taking the time to put this together.
@hensonjhensonjesse7 ай бұрын
I also never thought of purpose built chunking or semantic proxies like that. Adding question hypotheticals as the embedding could be extrapolated to other use cases.
@artislove49110 ай бұрын
Hi Greg, many thanks for the work you put into this and to help all of us learn. Great clarity, depth and tempo! 💪
@DataIndependent10 ай бұрын
Awesome thank you! The tempo part is good to hear because you never know
@mattfarmerai3 ай бұрын
This video is incredible! Thank you for sharing this breakdown of RAG chunking.
@rembautimes880819 күн бұрын
Definitely and positively should make the recommendation list on any ML enthusiast. Well explained and I think the proposition retriever is very interesting as well as the graph document. On a side note, there are some practical issues with large documents. For example a 10K document for a large bank could easily be 300 pages and if you’re unlucky they may tack on 400 pages of agreements as exhibits ( which will be a waste of LLM calls 😢). So if a prop retriever is used or a summary retriever you may have a lot of LLM calls to make. 😅
@DataIndependent19 күн бұрын
Love this! Thank you for the comment
@aarshmehtani54686 ай бұрын
While running the code elements=partition_pdf(filename=filename,strategy="hi_res",infer_table_structure=True,model_name="yolox") in my Jupyter notebook, I encountered errors such as TesseractNotFoundError. If anyone has faced this issue or knows how to solve it, please guide me as soon as possible. Great work, sir. I can confidently say that such a combination of content and explanation is unparalleled in the KZbin world. @DataIndependent
@aarshmehtani54686 ай бұрын
Now this problem is solved but new problem has come. Basically this code is not working properly due some versions or subclasses. So please give the alternative method.
@andreyseas11 ай бұрын
Nice vid, Greg! You're on the cutting edge with some of these splitting techniques. Well done. 😎
@DataIndependent11 ай бұрын
Thanks man - they were fun explorations
@bernardo42903 ай бұрын
Could you make a video about comparison performance of different chunking methods?
@Roman-i4k8e6 ай бұрын
Liked this semantic splitting! Cool stuff you´ve done there!! Also agentic chunking. Pretty cool!!!
@nlp_team20242 ай бұрын
Could you share the requirement packages with their versions like requirement.txt. There are lots of dependency issues while running the code
@NuwanChamara-e1e2 ай бұрын
coz this LC version is outdated. now v0.3. this code is older.
@xinxunzeng96396 ай бұрын
Great video! it helped me clarify the past and present of all the chunks. I have a question, in agent chunking, there can be an issue of having too much content on a single topic. In extreme cases, an entire book might be about one topic. How should we further break it down in such situations?
@Jaybearno11 ай бұрын
Hi Greg, thanks for the video. It's awesome to have someone publishing good content who's doing the exact same thing as me. Hope to see more videos on advanced topics like this!
@DataIndependent11 ай бұрын
Awesome thank you Jonathan! What is the domain you're working in?
@drakongames54178 ай бұрын
what the ___. how good can a tutorial be. such a gem of a video. thx for making this. new to ml and found this very helpful
@vigneshpadmanabhan40882 ай бұрын
Fo you think for semantic chunking, it would be good to introduce hierarchical clustering and then do chunking on it? In place of cosine similarity? What’s your thought on using clustering there ?
@paalhoff6310 ай бұрын
Great video, starting out with naive and easy to understand methods of text chunking, ending up with novel ideas that may point to the future
@DataIndependent10 ай бұрын
Awesome - thank you!
@ahmadzaimhilmi8 ай бұрын
That agentic chunking really does sound like an interesting approach . How can we predefine the topics instead of them being automatically generated?
@yashpokar4 ай бұрын
Best explanation on text splitter
@preetisoni9663Ай бұрын
Thanks Greg. You're awesome. Expecting more topics from you.
@DataIndependentАй бұрын
Love it - thank you
@TitanWellnessCenter285211 ай бұрын
Thanks!
@DataIndependent11 ай бұрын
Woah this is cool - I think its my first tip, I appreciate it and I will be enjoying In-n-out animal style fries with it
@TitanWellnessCenter285211 ай бұрын
Your Doing an amazing Job. I have really enjoyed the hard work you have put in. Keep it up. @@DataIndependent
@amrohendawi60076 ай бұрын
This is an amazing professional content! it hits the point directly
@MrSawaiz9 ай бұрын
This video should have a milliion views already. Amazing work
@DataIndependent9 ай бұрын
Thanks again sawaiz - text splitting, not sexy, but it's fun!
@vijaybrock7 ай бұрын
Hi Sir, what is the best chunking method to process the complex pdfs such as 10K reports. 10K reports will have so many TABLES, How to load those tables to vectorDBs?
@salahuddinpalagiri45034 ай бұрын
Hey did you find a way around it? Would love to know your input
@actorjohanmatsfredkarlsson229311 ай бұрын
Great video. Thanks for sharing. The level 5 implementation doesn't rewrite the proposition (e.g. It would still say "He likes walking" not "Greg likes walking"), or am I missing something!? I guess that would be another level of improvement? Any ideas how to implement that rewrite?
@actorjohanmatsfredkarlsson229311 ай бұрын
Ah answer seems to be in the bouns part. Use a graph retriever.
@DataIndependent11 ай бұрын
Hey thanks for the comment. The first step of getting the proposition would remove any of the “he likes doing X” Or maybe I’m not understanding the question correctly
@quofintech920023 күн бұрын
Your videos are awesome! Congrats!!
@DataIndependent23 күн бұрын
Love that - thank you for sharing!
@erdoganyildiz6179 ай бұрын
Great content! I have one quick question though, You have specified that typically you go with chunk sizes around 2000-4000 characters. But isn't it a problem for the embedding stage? I believe 4000 characters roughly corresponds around 600-1000 tokens, popular small-sized sentence transformers (for embedding purposes) typically have context size around 512. What am I missing here? How do you meaningfully embed the long chunks? Any suggestions? Thanks in advance.
@mzafarr2 ай бұрын
AMAZING VIDEO Greg! Thank you. I am not that experienced with RAG, wanted to get advice on a project I am working on, I am building an AI Lawyer sort of chatbot, which chunking, retrieval strategy would be best for this use case? (laws in Greek only for Greece only).
@nfaza807 ай бұрын
Theory & Importance of Text Splitting: Context Limits: Language models have limitations on the amount of data they can process at once. Splitting helps by breaking down large texts into manageable chunks. Signal-to-Noise Ratio: Providing focused information relevant to the task improves the model's accuracy and efficiency. Splitting eliminates unnecessary data, enhancing the signal-to-noise ratio. Retrieval Optimization: Splitting prepares data for effective retrieval, ensuring the model can easily access the necessary information for its task. Five Levels of Text Splitting: Level 1: Character Splitting: Concept: Dividing text based on a fixed number of characters. Pros: Simplicity and ease of implementation. Cons: Rigidity and disregard for text structure. Tools: LangChain's CharacterTextSplitter. Level 2: Recursive Character Text Splitting: Concept: Recursively splitting text using a hierarchy of separators like double new lines, new lines, spaces, and characters. Pros: Leverages text structure (paragraphs) for more meaningful splits. Cons: May still split sentences if chunk size is too small. Tools: LangChain's RecursiveCharacterTextSplitter. Level 3: Document Specific Splitting: Concept: Tailoring splitting strategies to specific document types like markdown, Python code, JavaScript code, and PDFs. Pros: Utilizes document structure (headers, functions, classes) for better grouping of similar information. Cons: Requires specific splitters for different document types. Tools: LangChain's various document-specific splitters, Unstructured library for PDFs and images. Level 4: Semantic Splitting: Concept: Grouping text chunks based on their meaning and context using embedding comparisons. Pros: Creates semantically coherent chunks, overcoming limitations of physical structure-based methods. Cons: Requires more processing power and is computationally expensive. Methods: Hierarchical clustering with positional reward, finding breakpoints between sequential sentences. Level 5: Agentic Chunking: Concept: Employing an agent-like system that iteratively decides whether new information belongs to an existing chunk or should initiate a new one. Pros: Emulates human-like chunking with dynamic decision-making. Cons: Highly experimental, slow, and computationally expensive. Tools: LangChain Hub prompts for proposition extraction, custom agentic chunker script. Bonus Level: Alternative Representations: Concept: Exploring ways to represent text beyond raw form for improved retrieval. Methods: Multi-vector indexing (using summaries or hypothetical questions), parent document retrieval, graph structure extraction. Key Takeaways: The ideal splitting strategy depends on your specific task, data type, and desired outcome. Consider the trade-off between simplicity, accuracy, and computational cost when choosing a splitting method. Experiment with different techniques and evaluate their effectiveness for your application. Be mindful of future advancements in language models and chunking technologies. Further Exploration: Full Stack Retrieval website: Explore tutorials, code examples, and resources for retrieval and chunking techniques. LangChain library: Discover various text splitters, document loaders, and retrieval tools. Unstructured library: Explore options for extracting information from PDFs and images. LlamaIndex library: Investigate alternative chunking and retrieval methods. Research papers and articles on text splitting and retrieval.
@micbab-vg2mu11 ай бұрын
Another great video - thank you:) In my case I need to try Semantic Splitting and Document Specific Splitting.
@DataIndependent11 ай бұрын
Awesome, thanks Micbab
@awakenwithoutcoffee7 ай бұрын
Hi Greg, appreciate the fantastic breakdown of text splitting for LLM. Personally I keep finding LLM's having trouble retrieving whole chapters or pages. I was wondering why we wouldn't split based on page number instead of paragraph, would that slow down the LLM ? I'm temped to train a local LLM on chunking based on user input.
@DataIndependent7 ай бұрын
You could split on page number, no problem! But then if your content spans pages, then you might lose context. You could store the page number and chapters in the metadata, then query and filter for those later on. Training a local LLM would be fun, but also a ton of work
@awakenwithoutcoffee7 ай бұрын
@@DataIndependent Q: What I am wondering is why don't we use an LLM/custom parser to split a text based on a Document's Chapter's (1-3 pages)? Is it because this is difficult to do ? This would increase the chunk size but wouldn't it also solve our contextual issues ? - "You could store the page number and chapters in the metadata, then query and filter for those later on." Q: This sounds like a potential solution but I am honestly a bit overwhelmed with the application of it. Is there a specific tutorial that you can direct me to for study ? Your work is greatly appreciated Greg. I consider myself a student that one day hopes to contribute to the community like yourself, Cheers!
@nihilitymandate60735 ай бұрын
Not a comment on the context, but I think that the style of the thumbnail is very smart. It reminds me of the Wired 5 Levels of Difficulty style. I think if the aesthetic is softer, it can be even more popular.
@DemoP.AUSSEIL-bb1ew11 ай бұрын
Congrats that's ! That's an excellent job ! I hope you will continue your work and more benchmark will come. I am particularly curious if the benefit of semantic & agent chunking are minored or majored when code, html , csv is chunked.
@krishnaprasad58749 ай бұрын
What a great video. It would have taken me forever if if I was to research and learn more about this on my own. What a life safer. Do you have a video or a good resource about optimizing other RAG hyperparams and about reranking of chunks?
@DataIndependent9 ай бұрын
Nope not yet, but there is more at FullStackRetrieval.com on RAG in general
@YanasChanell6 ай бұрын
That was really helpful, thank you for information you're sharing!
@DataIndependent6 ай бұрын
Thanks Yanas!
@zugbob8 ай бұрын
Awesome, I'm trying to do a similar thing with semantic chunking on historic chat messages, but every new message that comes in means you have to re-do the chunking. Can you think of a better way of chunking chat message history.
@DataIndependent8 ай бұрын
Instead of redoing the all the chunks again you could try finding which cluster the embedding is closest with and naively add it to that one?
@zugbob8 ай бұрын
@@DataIndependent Cheers, I did that at first but ended up doing something similar to the percentile method you mentioned. The issue was the overlapping possibly unrelated message threw off the cluster. I get the embedding of each new message and measure the similarity distance between each and then when a new message comes in and if it's > 85 percentile then it splits on that (with a minimum of 4-5 messages in a cluster with overlap).
@nathank51409 ай бұрын
This might just work for my meeting transcripts. Ts similar to something David Shaperio did. Where knowledge bases articles are written. And then reserved and updated during a conversation. I like the idea of using propositions and doing this at the article level.
@maria-wh3km4 ай бұрын
Awesome video, thansk so much, its so much informative and clear to follow. Well done.
@henkhbit57488 ай бұрын
Thanks, Excellent video about chunking strategies👍 Question: Can i store the pulled html table using unstructured in a vector database together with a normal text and asking question (RAG)?.
@sticksen11 ай бұрын
So, as you are using Langchain and Llama-Index - what do you prefer for which task? What are the pros and cons of each? I´ve also used both and have manifested an opinion.
@DataIndependent11 ай бұрын
Nice! They both have pros and cons for different tasks. It's up to the dev w/ what they are most comfortable with
@karthikb.s.k.448611 ай бұрын
Great tutorials . Are there any courses or book written by you. Your explanation is excellent . Thank you. Can you please share the code which was shown in the demo.
@DataIndependent11 ай бұрын
Check out fullstackretrieval.com for the code
@haribattula51875 ай бұрын
I guess semantic search is what already vector databases are supporting.. and i don't find any advantages by doing sentence split, then calculating cosine distance and putting them in same bucket.. am i missing something here?
@Himanshu-gg6vo7 ай бұрын
Hi... Any suggestion like how we can handle large chunks s some of the chunks are having token length greater then 4k !!
@IvanTsvetanov-yq7xu8 ай бұрын
Nice video, next level of chunking! Are you planning to have soon max_chunk_size?
@DataIndependent8 ай бұрын
Hey, nope not at the moment, but it would be cool to add
@selfhosted-lover16 күн бұрын
Thanks for your tutorial. Great for new pie of RAG.
@RushyNova11 ай бұрын
Great content. FYI - Google’s Gemini models are built to be Multi-Modal from the outset, so seems to overcome some of the challenges you mentioned when combing text and images.
@DataIndependent11 ай бұрын
Awesome thanks Rushy - ya, I’m ready for a multi modal embedding model
@saiashwalkaligotla66396 ай бұрын
Hey Greg Thanks for an amazing tutorial, really loved all the strategies. I have couple of doubts. So if we do agentic chunking wont it cost too much and also considering the scale of the data we have and also i believe it would be very costly to update with a new item aswell every time. What about retrival strategies, Also how to tackle retrival strategies in QA chat setup. Thank you once again.
@DataIndependent6 ай бұрын
Agentic chunking would definitely cost a lot, but the assumption is that latency and cost will go down so it will become more approachable. Do you mean how to do retrieval when building a chat bot?
@saiashwalkaligotla66396 ай бұрын
@@DataIndependent yes, so i always had this doubt , during a chat we have the responses of user and assistants. Now say after 5 conversations , now if a user asks a question, Now how do we do retrieval of embeddings during this situation .
@hensonjhensonjesse7 ай бұрын
Imagine langsmith for chunking. Something like the agentic flow you have but all chunk titles, summaries ECT... Are kept and able to be used for future tuning and other iterations.
@datauv-asia5 ай бұрын
Loved your channel, could you do one with LangServe, please, thanks.
@Munk-tt6tz8 ай бұрын
Your channel is a gem, thank you!
@ANDREACARLINI-u4i2 ай бұрын
HI, I would like to know if these methods can be used with database vector, of if it doesnt change anything? because i dont know if its something unrelated or not
@derekcarroll790411 ай бұрын
Buy you a cup of coffee ? How about a Starbucks franchise ! This is some very powerful material.. Looking forward to implementing these into my pipelines ! THANK YOU !
@ccapp338911 ай бұрын
I really enjoyed this thanks. I’ve had good IRL business results with your tiers 2 and 3. I’ve used semantic search quite a bit and my jury is still out on the match score’s reliability to granular levels like decision-making breakpoints. So I would probably find tier 4 still more of an aspirational novelty. I like the concepts of 4 & 5 though on the more distant horizon. As an aside - the term “naive” a lot of folks are using lately in the Langchain llamaindex crowd makes me roll my eyes. It just smells like smug Silicon Valley 20-something (not specifically throwing shade at you I’m seeing it all over the place). If someone is chunking a set of documentation and the content is divided into topics by markdown tags they’d call your “tier 3” implementation naive even if it’s clearly the most practical way to chunk the data and achieve an outcome. I would love to see a term arise to discuss the simple-but-often-practical methods with less negative baggage.
@DataIndependent11 ай бұрын
Nice! Thank you for the solid comment. Totally agree that 4 and 5 are experimental for now. It’s really tough to beat the ROI on recursive character. Definitely open to a new word if one fit better
@ccapp338911 ай бұрын
Perhaps a specific term isn’t even needed? They’re all just different methods that can add value in different scenarios. Some are useful for concept-level education, some are useful for practical implementations today, some are useful for future-state theory crafting 🤷♂️
@ajaykumar-rh2gz3 ай бұрын
This is really amazing video first time I have seen
@oleksandr.brazhii10 ай бұрын
Best chunking video to date.
@datagus9 ай бұрын
Has someone solved this issue when running the function partition_pdf(). I get this error: module 'PIL.Image' has no attribute 'LINEAR'
@DataIndependent9 ай бұрын
I would try upgrading all packages
@MuhammadDanyalKhan9 ай бұрын
@DataIndependent Hi Greg. Which type of splitting would you recommend when working with bank statement, invoices, balance sheets etc.
@tomor388011 ай бұрын
Hi Greg, Nice video! As for 4, did you considered to use fine tuned NLI models? i.e. combine 2 sentences if the model predicts entailment/natural relationship?
@DataIndependent11 ай бұрын
I did - but it seemed like way overkill for the tutorial scope I'd like to explore that another time
@danielvalentine13211 ай бұрын
Fantastic Video. Been thinking about level 5, a brilliant way to approach chunking, and I see other applications. Level 4 is clever. Retrieving in syntheticI believe will be the standard as time moves on.
@DataIndependent11 ай бұрын
Totally agree
@surajthakkar342011 ай бұрын
Hello Greg, Great video! ANy chnace we can get access to the Agentic chunking code?
@DataIndependent11 ай бұрын
It's in this repo! Which you can find at fullstackretrieval.com
@surajthakkar342011 ай бұрын
Thank you for the reply, maybe i'm just dumb but I cannot find link to the repository anywhere after the sign up. I tried to use the search bar as well as CTRL+F. Would be great if you could post it here. @@DataIndependent
@DataIndependent11 ай бұрын
Here ya go github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/5_Levels_Of_Text_Splitting.ipynb @@surajthakkar3420
@surajthakkar342011 ай бұрын
Thank you so much Greg!@@DataIndependent
@AdamTwardoch8 ай бұрын
One useful technique for performing sentence embedding is to apply COREFERENCE RESOLUTION, with a cheap model like Haiku: """Identify pronouns, definite noun phrases, and other referring expressions in the text (for example "it", "he", "she", "they", "this", "that", "those", "their" etc.). Determine the antecedent or the entity to which each referring expression refers. Apply coreference resolution by replacing the referring expressions with their antecedents or a more explicit description of the entity."""
@AdamTwardoch8 ай бұрын
This makes each sentence self-sufficient semantically, and you get huge improvements this way. This is useful for any kind of chunking.
@AdamTwardoch8 ай бұрын
Grammar simplification also may be useful: """Optimize the syntax and grammar. Identify syntactic and grammatical problems such as complex phrases, clauses, and sentence structures, as well as passive voice, embedded clauses, or convoluted sentence constructions. Break these syntactic and grammatical problems down into simpler, more concise sentences, and rephrase them using simpler structures, such as active voice or straightforward subject-verb-object constructions."""
@TalhaJSiam8 ай бұрын
Do you havy any example notebook or code for this ? Or any way I can contact you ?
@jlsachse8 ай бұрын
There is a spacy module for coreference resolution called "neuralcoref" which works without an LLM. I wonder whether neuralcoref + a clustering algorithm, e. g. BERTopic, could replace the use of LLMs and make the process cheaper.
@jessaco.865311 ай бұрын
Another banger hit from Greg! How does he do it. Love this video!
@AmeliaMelia-tj3kc3 ай бұрын
a-true-good-teacher
@shuvobarman929410 ай бұрын
When I try to run the same code for reading tables from pdf and saving image from pdf my kernel shutdown and gives message that it will restart again. How to overcome this? Thanks
@DataIndependent10 ай бұрын
weird - I haven't seen that one before. I would double check that all packages are up to date
@shuvobarman929410 ай бұрын
I have tried the same code with Google Colab, and it's working just fine. The issue was with my anaconda environment as it seems. Thanks a lot for creating such a depth video. Learned a lot.
@olivert.71777 ай бұрын
29:13 Now there is GPT4-O ... So you were right with the prediction 😂
@Akimbofmg9_11 ай бұрын
Is there a way to dynamically change the chunk size ? I have text where I want to split according 4 anchors let’s say. The 4 anchors have x amount of text in between them. So chunk size can say constant, and I’m trying to use regex to split the text.
@DataIndependent11 ай бұрын
Check out level 2 and specify your own splitters and then chunk size
@Akimbofmg9_11 ай бұрын
@@DataIndependent I’m sorry I meant to say chunk size cannot stay constant. This is for api call sequences from windows executables. They have varied names and argument sizes. But they do have module name, api name arguments and return values as constants. But the actual text in each field(args ret value etc) can vary according to the specific api.
@Sylarleft8 ай бұрын
I feel like my mind was blown, brought together then blown again by 'level 4 - semantic search' part of the video
@DataIndependent8 ай бұрын
Love it! Thanks for the comment
@ultracycling_vik11 ай бұрын
Running the code, it always throws an error when using unstructured --> "No module named 'unstructured_inference.inference.elements' " Anyone solved it?
@caiyu53811 ай бұрын
great lectures, great teacher
@DataIndependent11 ай бұрын
Thanks Caiyu!
@srikanthganta762611 ай бұрын
Thanks greg! Love the long form instructional video :D Greatly appreciated
@DataIndependent11 ай бұрын
Awesome! Glad it worked out
@cag68259 ай бұрын
Great video. Some concepts in it overlap with the RAPTOR paper for RAG
@frimis9 ай бұрын
This video is a piece of art ❤
@DataIndependent9 ай бұрын
Thanks Frimis
@furkandemirturk364610 ай бұрын
wow it has been very long time since I made a comment. This content is outstanding! Thank you for creating such a great video.
@DataIndependent10 ай бұрын
heck ya! Thank you! glad to see you back on the comments
@JelckedeBoer11 ай бұрын
Extremely helpful, thanks for the great tutorial!
@DataIndependent11 ай бұрын
nice! thank you
@無產階級4 ай бұрын
where can i directly download this juypyter notebook directly? i already subscribe your personal website email list
@vctorroferz7 ай бұрын
amazing video ! very helpful ! thanks !
@DarrenAllatt7 ай бұрын
Human beings are always continuously learning. LLM’s should have all the abilities that we have.
@GeorgAubele8 ай бұрын
Thank you very much! Great video!
@AR_733311 ай бұрын
Agent chunking is a paradox. We aim to spilt the document into concise units to eliminate the noise so that the LLM can generate better answers. But we are asking the LLM to figure out the concise units by dumping all the propositions.
@DataIndependent11 ай бұрын
Thanks for the comment! I take the other side of the argument where the correct chunks are task dependent, and creating those with character based methods is too crude.
@AR_733311 ай бұрын
@@DataIndependent , I agree that creating chunks with character based method is a naive approach. But my concern is: won't the LLM suffer from the same difficulty to process all the proposition to group together the relevant ones as it did when the entire document (without chunking) is given as context to the LLM.
@joshlopez77279 ай бұрын
Does anyone have an example of agentic chunking (level 5) as javascript?
@DataIndependent9 ай бұрын
I bet you could feed the agentic chunking python code into gemini (or claude 3) and get a pretty good starting point to make it yourself