If you use an OCR system like the OCR API of Adobe PDF Service, you can easily obtain the semantic schema. Unlike regex, which makes it impossible to detect titles, sections, or other parts of the document, OCR allows you to identify every element in your document, such as tables or lists. This ensures that you have a cleanly parsed document.
@SwingingInTheHood10 ай бұрын
Thanks for the info. As someone who has successfully used OCR and regular expressions for decades, I would hardly say it makes it impossible to detect formatting. Au contraire, that's what it was designed for. However, I find working with PDFs and bookmarks becoming much easier. I would recommend PDF WonderShare Element and Nitro PDF Pro. Both have auto bookmarking features, Nitro's being the best because you can search by font and text.
@mazenlahham8029 Жыл бұрын
Amazing idea, thanks for sharing ❤
@shikharashish7616Ай бұрын
I am very new to this and having trouble to understand it but still trying my best. Wanted to ask will it work for documents with complex layout as well? such as PDFs with multi column tables, images or tables that span across multiple pages, tables that have images inside them. Developing a RAG based pdf query system especially for complex PDFs and i am confused what is the best chunking method for my task.
@SwingingInTheHoodАй бұрын
The issue in your case is extraction. Semantic chunking is basically organizing the content hierarchy. Typically,, this is text. A good PDF to text extractor is LlamaParse. It does a very good job of maintaining table structures. As for images, now you are talking multi-modal vectorization, which is beyond my pay grade at this moment. It is possible, but you will need to investigate which vectorizors support it, and how the images need to be submitted.
@galdx_ Жыл бұрын
Did some tests here and also noticed a substantial improvement when using the header approach per chunk. I searched for some pdf parsers, but could not find one that recognizes the structure of the document and then parses it. Did you have any luck with it? I believe that this problem might have been solved by someone already.
@SwingingInTheHood Жыл бұрын
A pdf export program that could export documents according to their hierarchal organization would be a dream come true. But, alas, I have yet to find one. I did make a request to ABBYY to look into it. What I have ended up doing is writing code that reads the header I created to chunk the document, then re-organizes all the chunks in hierarchal order. Now, I can import these text files as "book" nodes into Drupal, where they create their own natural "table of contents". And, using my SolrAI module, I vectorize these nodes from within Drupal and now have some pretty organized content that always knows where it is in the hierarchy.
@galdx_ Жыл бұрын
@@SwingingInTheHood yes, it solves the issue, but it is not scalable right? maybe there is an opportunity.
@SwingingInTheHood Жыл бұрын
@@galdx_ Au contraire, Drupal is the most scalable CMS available today. It is the preferred CMS of enterprise organizations. The reason the updates are queued is so that they can be upserted to the vector store in a more manageable manner. If you have hundreds, even thousands of updates going on hourly, the only difference would be that they would need to be queued and batched instead of the one-per system I have now. If this is what you mean.
@sharannagarajan4089 Жыл бұрын
I’m also looking for a solution where PDF hierarchical schema is maintained for chunking
@SwingingInTheHood Жыл бұрын
Outside of custom regex code, another method I've found is to use pdf bookmarking. If it's not that large of a document, I simply go through and bookmark the individual sections, then use a pdf splitter tool to split the document by section. The tool I've been using is Sejda.com, but there ae a few of them out there.
@naderjanhaoui58310 ай бұрын
You can use a ocr system contact me if you need help
@SwingingInTheHood7 ай бұрын
If you're up to the coding challenge yourself, in this discussion we have created a roadmap on developing this process yourself: community.openai.com/t/using-gpt-4-api-to-semantically-chunk-documents/715689/
@johnday2631 Жыл бұрын
link to code repo?
@SwingingInTheHood Жыл бұрын
Not yet. But I think I will create a Github repo and post the code I have created for my use. I'll add the link here when it is done. Thanks for the suggestion.
@Victor-ww2hx Жыл бұрын
@@SwingingInTheHood still no repo?
@deftcg8 ай бұрын
@@Victor-ww2hx bump
@SwingingInTheHood7 ай бұрын
Still no repo, primarily because the current code is part of the embedding pipeline in my existing system. Trying to pull it out to make it standalone is just too big a task at the moment. However, I am thinking about making an API available: community.openai.com/t/using-gpt-4-api-to-semantically-chunk-documents/715689/100?u=somebodysysop Or, if you're up to the coding challenge yourself, in this discussion we have created a roadmap on developing this process yourself: community.openai.com/t/using-gpt-4-api-to-semantically-chunk-documents/715689/