Semantic Chunking

  Рет қаралды 10,877

SwingingInTheHood

SwingingInTheHood

Күн бұрын

Пікірлер: 20
@naderjanhaoui583
@naderjanhaoui583 10 ай бұрын
If you use an OCR system like the OCR API of Adobe PDF Service, you can easily obtain the semantic schema. Unlike regex, which makes it impossible to detect titles, sections, or other parts of the document, OCR allows you to identify every element in your document, such as tables or lists. This ensures that you have a cleanly parsed document.
@SwingingInTheHood
@SwingingInTheHood 10 ай бұрын
Thanks for the info. As someone who has successfully used OCR and regular expressions for decades, I would hardly say it makes it impossible to detect formatting. Au contraire, that's what it was designed for. However, I find working with PDFs and bookmarks becoming much easier. I would recommend PDF WonderShare Element and Nitro PDF Pro. Both have auto bookmarking features, Nitro's being the best because you can search by font and text.
@mazenlahham8029
@mazenlahham8029 Жыл бұрын
Amazing idea, thanks for sharing ❤
@shikharashish7616
@shikharashish7616 Ай бұрын
I am very new to this and having trouble to understand it but still trying my best. Wanted to ask will it work for documents with complex layout as well? such as PDFs with multi column tables, images or tables that span across multiple pages, tables that have images inside them. Developing a RAG based pdf query system especially for complex PDFs and i am confused what is the best chunking method for my task.
@SwingingInTheHood
@SwingingInTheHood Ай бұрын
The issue in your case is extraction. Semantic chunking is basically organizing the content hierarchy. Typically,, this is text. A good PDF to text extractor is LlamaParse. It does a very good job of maintaining table structures. As for images, now you are talking multi-modal vectorization, which is beyond my pay grade at this moment. It is possible, but you will need to investigate which vectorizors support it, and how the images need to be submitted.
@galdx_
@galdx_ Жыл бұрын
Did some tests here and also noticed a substantial improvement when using the header approach per chunk. I searched for some pdf parsers, but could not find one that recognizes the structure of the document and then parses it. Did you have any luck with it? I believe that this problem might have been solved by someone already.
@SwingingInTheHood
@SwingingInTheHood Жыл бұрын
A pdf export program that could export documents according to their hierarchal organization would be a dream come true. But, alas, I have yet to find one. I did make a request to ABBYY to look into it. What I have ended up doing is writing code that reads the header I created to chunk the document, then re-organizes all the chunks in hierarchal order. Now, I can import these text files as "book" nodes into Drupal, where they create their own natural "table of contents". And, using my SolrAI module, I vectorize these nodes from within Drupal and now have some pretty organized content that always knows where it is in the hierarchy.
@galdx_
@galdx_ Жыл бұрын
@@SwingingInTheHood yes, it solves the issue, but it is not scalable right? maybe there is an opportunity.
@SwingingInTheHood
@SwingingInTheHood Жыл бұрын
@@galdx_ Au contraire, Drupal is the most scalable CMS available today. It is the preferred CMS of enterprise organizations. The reason the updates are queued is so that they can be upserted to the vector store in a more manageable manner. If you have hundreds, even thousands of updates going on hourly, the only difference would be that they would need to be queued and batched instead of the one-per system I have now. If this is what you mean.
@sharannagarajan4089
@sharannagarajan4089 Жыл бұрын
I’m also looking for a solution where PDF hierarchical schema is maintained for chunking
@SwingingInTheHood
@SwingingInTheHood Жыл бұрын
Outside of custom regex code, another method I've found is to use pdf bookmarking. If it's not that large of a document, I simply go through and bookmark the individual sections, then use a pdf splitter tool to split the document by section. The tool I've been using is Sejda.com, but there ae a few of them out there.
@naderjanhaoui583
@naderjanhaoui583 10 ай бұрын
You can use a ocr system contact me if you need help
@SwingingInTheHood
@SwingingInTheHood 7 ай бұрын
If you're up to the coding challenge yourself, in this discussion we have created a roadmap on developing this process yourself: community.openai.com/t/using-gpt-4-api-to-semantically-chunk-documents/715689/
@johnday2631
@johnday2631 Жыл бұрын
link to code repo?
@SwingingInTheHood
@SwingingInTheHood Жыл бұрын
Not yet. But I think I will create a Github repo and post the code I have created for my use. I'll add the link here when it is done. Thanks for the suggestion.
@Victor-ww2hx
@Victor-ww2hx Жыл бұрын
@@SwingingInTheHood still no repo?
@deftcg
@deftcg 8 ай бұрын
@@Victor-ww2hx bump
@SwingingInTheHood
@SwingingInTheHood 7 ай бұрын
Still no repo, primarily because the current code is part of the embedding pipeline in my existing system. Trying to pull it out to make it standalone is just too big a task at the moment. However, I am thinking about making an API available: community.openai.com/t/using-gpt-4-api-to-semantically-chunk-documents/715689/100?u=somebodysysop Or, if you're up to the coding challenge yourself, in this discussion we have created a roadmap on developing this process yourself: community.openai.com/t/using-gpt-4-api-to-semantically-chunk-documents/715689/
@tommaso6187
@tommaso6187 6 ай бұрын
amazing video
AI Agent UX Patterns: Lessons From 1000 Startups - Jonas Braadbaart
20:25
pyGrunn and aiGrunn Conferences
Рет қаралды 5 М.
The evil clown plays a prank on the angel
00:39
超人夫妇
Рет қаралды 53 МЛН
Support each other🤝
00:31
ISSEI / いっせい
Рет қаралды 81 МЛН
The Best Band 😅 #toshleh #viralshort
00:11
Toshleh
Рет қаралды 22 МЛН
Правильный подход к детям
00:18
Beatrise
Рет қаралды 11 МЛН
The 5 Levels Of Text Splitting For Retrieval
1:09:00
Greg Kamradt
Рет қаралды 87 М.
Semantic Chunking for RAG
29:56
James Briggs
Рет қаралды 28 М.
AI Model Context Decoded
8:21
Matt Williams
Рет қаралды 10 М.
LangChain: How to Properly Split your Chunks
10:41
Prompt Engineering
Рет қаралды 31 М.
MASTER Chunking in Just 18 Minutes with These 3 Techniques
18:21
Zahiruddin Tavargere
Рет қаралды 465
Semantic-Text-Splitter - Create meaningful chunks from documents
6:59
Coding Crash Courses
Рет қаралды 12 М.
Learn Machine Learning Like a GENIUS and Not Waste Time
15:03
Infinite Codes
Рет қаралды 351 М.
Semantic Chunking for RAG with #langchain
1:02:52
AI Makerspace
Рет қаралды 7 М.
The moment we stopped understanding AI [AlexNet]
17:38
Welch Labs
Рет қаралды 1,5 МЛН
The evil clown plays a prank on the angel
00:39
超人夫妇
Рет қаралды 53 МЛН