Open Source Extraction Service

Рет қаралды 18,207

Күн бұрын

Earlier this month we announced our most recent OSS use-case accelerant: a service for extracting structured data from unstructured sources, such as text and PDF documents. Today we are exposing a hosted version of the service with a simple front end.
Key Links:
Hosted Extraction Service: extract.langchain.com/
GitHub Repo: github.com/langchain-ai/langc...
Blog: blog.langchain.dev/open-source-extraction-service/

Пікірлер: 23

@FileBrainPro 3 ай бұрын

Looks very interesting guys. Thanks for sharing!

@erkanmalcok 3 ай бұрын

Great work! Everything looks good except "cauliflower". Please keep them in the book :)

@azerberakanat 3 ай бұрын

@aiexplainai2 3 ай бұрын

this is cool! Just curious how this few shot learning works under the hood - was it pass on the example as part of prompt?

@mattbloss6302 3 ай бұрын

AWESOME TOOL

@kenchang3456 3 ай бұрын

Interesting, I have a use case where I have people ordering mechanical parts and materials and I'd like to test extracting unit of measure (e.g. each, dozen, etc.) and quantity and length e.g. 1+3/4" or 3/8" or 1 3/4 inch (or in) and I wonder what the extraction will do with the 1 3/4 inch example. Will it say you want one 3/4" item or will it return 1+3/4" item. Thanks for doing this.

@deeptrikannad 2 ай бұрын

This is awesome! Are these libraries available in javascript / typescript?

@IamalwaysOK 3 ай бұрын

Thanks for the nice video. Is it possible to create a Knowledge graph from the pdf document using this method? Is there any other better way to create Knowledge Graph? Any guidance on that would be appreciated.

@potatodog7910 3 ай бұрын

Wondering the same

@stevenheymans 3 ай бұрын

Make a pydantic model for Node and one for Relationship, define which relationships you want to capture (eg toc hierarchy, sequence, styling tags, ..), combine both in one graph pydantic model, export that model to json schema response model and prompt the LLM to generate the nodes and relationships. End with a topological sort to get the content back in the right order.

@muhannadobeidat 3 ай бұрын

Nice video but can you clarify that this is based on calling gpt apis and have an api key and enough credits to invoke their services? If so then what’s the point?

@sakibali1265 3 ай бұрын

Hi, Can anyone explain me how this is different with function calling in GPTs. Except the fact of parsing a pdf file.

@fullstackburger 3 ай бұрын

My understanding is the value is that it can efficiently pull data from an unstructured source, as opposed to a structured one, such as a W2. Well-known LLMs (like chat GPT) can do this out of the box to some degree, but there's a lot more that goes into it if you want to do it at scale and accurately.

@waneyvin 3 ай бұрын

great job! can we use this for NER as well?

@LangChain 3 ай бұрын

Thank you! You can extract entities (e.g., people, organizations, or locations) although you will not get exact character spans out. You can instruct the models to output a raw snippet of "evidence" as shown in the recording, though.

@danielneu9136 3 ай бұрын

you can use an extractive method like this but there is a better method for doing NER with GPTs. its described in this paper on arxiv: GPT-NER: Named Entity Recognition via Large Language Models

@bobjones7274 3 ай бұрын

I did not understand what you were doing with the reference to Orange? What was actually achieved in that use case?

@lukeotwell3296 3 ай бұрын

instead of using a pdf he gave text to extract data from. another approach that could have made it more clear is if you just copy and pasted the text of the pdf to get the same output.

@stevenheymans 3 ай бұрын

So it's basically just sending all the content to an LLM with a prompt and a json schema response model. If you're wondering why you're spending so much on tokens, that's why.

@ethan612 3 ай бұрын

can we use this extract html locator information from web page

@bobjones7274 3 ай бұрын

I understand that you might feel that your audience is people who already understand what you are trying to do, and what you are doing. I believe this is a mistake in a publicly released video. That might be fine for videos which you embed inside documentation, where people have already read everything prior to this point. But you are not making it easy for potential new customers to place what you have done in a use case context, nor shown what they can do with the output.

@Licardo7 3 ай бұрын

I think there’s a space for both. I generally hate when there’s a lot of content for basic functions, but any type of advanced concept is just digging through documentation 🫠

@davidjohnson5635 3 ай бұрын

You can always, I don't know, just not watch the video. Not every video recommended to you is actually for you specifically.