LlamaIndex Webinar: Improving RAG with Advanced Parsing + Metadata Extraction

Рет қаралды 5,198

Күн бұрын

Пікірлер: 16

@SamiSabirIdrissi 6 ай бұрын

Overall i think this is super dope! I can’t wait to try this. The Increase routing accuracy capability is wild. Pulling relevant data with high accuracy is extremely important! 💪⚡️

@MatijaGrcic 6 ай бұрын

This was great, thanks for sharing.

6 ай бұрын

Same for the metadata tags generation: another open AI GPT wrapper doing generation or mapping depending if the tags are suggested or not. As shown, the best result is obtained with the custom metadata. It means that humans are still in need to do the most difficult and time-consuming task, i.e. defining the custom tasks...😢

@awakenwithoutcoffee 6 ай бұрын

nah I can't see this not being automated in the foreseeable future. There are already OCR models with long context memory that are able to create metadata tags. Give it a few months and this metadata "problem" will be solved.

@isle1009 6 ай бұрын

8:16 Does Deasie support languages other than English well, especially Korean?

@SamiSabirIdrissi 6 ай бұрын

Very very interesting, i feel like this is similar to Azure’s document intelligence feature?

@awakenwithoutcoffee 6 ай бұрын

yup, similar to unstructured / Llamaparse + LlamaExtract (new) . There is also new OCR models like ColPali!

@pin65371 6 ай бұрын

It seems to me like this would get much more effective with a graph system? When Jerry was asking about how the data would be retrieved it seemed like graph would work well with this. When you ask a question the LLM first would retrieve relevant parent metadata. From there it can branch out from there. The advantage with that would be that connections that maybe arent so obvious with vector would be very obvious with graph. Also with graph at least you have visibility to be able to manually go in and see what is going on. I liked that last question as well. It seems like maybe it wasnt something they thought about but they might look at how to implement something like that. Tokens are getting so cheap now that it would make sense. Especially if lets say you are using the openai 4o-mini model its 30 cents for a million tokens output. Just getting it to output some extra metadata would essentially be free and would only make the whole system more efficient in the long run.

@awakenwithoutcoffee 6 ай бұрын

I agree but it still too expensive and difficult to fully automate correctly. Let's keep in touch trough to the comments as us engineers are looking to for production ready techniques. My take is that graphRAG is not ready yet but it might be early next year (for enterprise).

@Anselm243 4 ай бұрын

@@awakenwithoutcoffee Could you explain or provide an example of why GraphRAG is not considered ready yet? I have currently built my own knowledge graph from scratch, which is fully customisable and flexible enough to connect any type of entity or relationship types. I have already implemented tags, but this metadata has sparked new ideas. I'm curious as to why you believe GraphRAG may not yet be ready? Any insights would be greatly appreciated!

@awakenwithoutcoffee 4 ай бұрын

@@Anselm243 Hi Anselm , personally I see potential in GraphRAG but only for specific types of data/industries which could benefit from extensive entities. The downsides are with indexing and entity generation, if the pool of data change you would need to re-create the graph. It is also quite hard to get the correct entities extracted for a use-case. More entities != better results. What I found personally more useful is to use the concept of graphRAG (e.g. entity extraction) and apply that on existing RAG systems which are far easier & cheaper to manage. We can store entities as meta-data and use query filtering techniques to capture broader, global context.

@Anselm243 4 ай бұрын

@@awakenwithoutcoffee Thanks for sharing, that's a good point. I'll need to carefully consider the best way to decouple the graph of old data if it becomes too large, to avoid having to recreate it entirely. This is why I find metadata so interesting tagging entities with a 'version' can make it easier to remove outdated nodes and replace them with updated ones. One of the main reasons I prefer graphRag over other rag methods is its ability to retrieve neighboring nodes for additional context, which may not necessarily show up in a traditional similarity-based result.

@awakenwithoutcoffee 4 ай бұрын

@@Anselm243 for sure, keep me updated as im always researching for the latest stack. You don’t need graph databases to utilize entities/meta-data btw, you can store them in the meta-data and filter on it.

6 ай бұрын

The example with PyPDF is not correct as nobody is using PyPDF texts extracted per page. Instead, there is post-processing on the raw text. All these startups founders think that we are dummies and propose in their "products" the recipes that we are all using for months or years without pretending to build a company on the top of them. Same for metadata.... Almost nothing new here. 😢😮

@awakenwithoutcoffee 6 ай бұрын

you bring up an important point: The part about cross-page context confused me since Jerry basically didn't know why this was happening. Have you found additional information or techniques yourself ? I'm looking for production ready techniques for meta-data extraction. One alternative new approach is ColPali.