Scrape any website with OpenAI Functions & LangChain

Рет қаралды 49,367

Күн бұрын

Пікірлер: 106

@devlearnllm Жыл бұрын

Additional details about scraping: only scrape for tags on some sites (like WSJ or CNN) yields the best results. Others might be different.

@stevenwessel9641 7 ай бұрын

I’m working on a couple ai projects with Malik Yusef, Kanye’s main collaborator and one of Virgil’s first mentors. We should connect, lmk 🙏🏼

@devinschumacher Жыл бұрын

You are now officially a real youtuber.

@georgesanchez8051 Жыл бұрын

Refreshing to not see some bs clickbait video on LLM-uses. Just a clean, focused, and super differentiated walkthrough-video. Subscribed, and looking forward to more!

@devlearnllm Жыл бұрын

Thank you. I'm glad that this approach resonates with people.

@emlincharly 11 ай бұрын

This video feels like a coworker showing me something cool. Really good video man!

@alexanderroodt5052 Жыл бұрын

With AI assistance I can scrape hundreds of thousands of products/services a week and now have the facilities to talk to thousands of people at once. Learnt most of it from youtube from people such as yourself who are grossly underappreciated. Keep up the good work and thanks for sharing!

@devlearnllm Жыл бұрын

Thanks for the appreciation then. We love it.

@diegosandoval7462 Жыл бұрын

🎯 Key Takeaways for quick navigation: 02:05 🌐 You can scrape websites using LangChain, OpenAI Functions, Playwright, and Beautiful Soup. 03:55 🧩 OpenAI Functions simplify web scraping by eliminating the need to manually declare HTML tags. 05:20 🛍️ You can use this approach to scrape e-commerce websites and extract specific information like item titles and prices. 15:41 🤖 LangChain simplifies interactions with OpenAI's GPT models for various applications, including information extraction. 23:32 ⚙️ Consider chunking large HTML content and building a FastAPI server to enhance this web scraping tool's capabilities. Made with HARPA AI

@walkingwchris Жыл бұрын

Well done for explaining the why so clearly . You had me in the first minute

@richmadrid9563 Жыл бұрын

This is exactly what I was looking for. A way to scrape websites like a human being, and done it via scripting. Also, I like how you explain things clearly and how they work. I found this channel by accident, and decided to watch it. The next thing I knew, I'm a new subscriber!

@mr.gk5 4 ай бұрын

Great stuff, keep up on making great AI coding content, you got my sub!

@lukeotwell3296 9 ай бұрын

That's a high quality vid right there.

@meinbherpieg4723 8 ай бұрын

Good video. Thanks for taking the time to explain the nuances in depth. You've got my sub ha

@emmanueladepoju4089 Жыл бұрын

First video and I like this channel already!🙂

@devlearnllm Жыл бұрын

Just a slight correction: openai_api_key is a property in the llm object, in LangChain. It's not a global variable.

@sandratoolan9598 Жыл бұрын

Good luck dude, just keep doing what you doing.

@techgeekguru Жыл бұрын

Very cool stuff! Like the style of narration focusing on conveying information in a straightforward and matter-of-fact manner, without overemphasizing or exaggerating.

@tebblesfun Жыл бұрын

Thank you so much! I didn't know how to implement this and I bumped into your video. Such a saver!

@miltondavilaharjula Жыл бұрын

Great video!! Thank you for sharing. I liked how you simplified the code and explanation. Your project really makes sense as webpages do change their structure and traditional approach may break due to those changes.

@StudioTatsu Жыл бұрын

Note: 'kwargs' usually stand for keyword arguments. normally we call it "keyword args". Nice Vid. :)

@devlearnllm Жыл бұрын

Thank you haha. So obvious in hindsight.

@Wildhoneybush1 10 ай бұрын

Great job, you are a real you-tuber and I can tell that you will become very popular. 😮🎉

@jazzzAiman Жыл бұрын

Yup, straight to the point

@devlearnllm Жыл бұрын

@zakuro8532 9 ай бұрын

You are the King

@tiagoc9754 11 ай бұрын

11:49 is it safe to remove other tags? It's recommended that web pages contain elements such as section, article, main, menu, header, footer, etc. not to mention h(n), label, span, aria. I know many pages out there don't follow the "correct" syntax, but I suppose especially when we talk about huge websites we'll commonly find those patterns. So removing other tags would not affect the result we expect from the AI integration?

@FlutterDev1337 Жыл бұрын

This is awesome content btw!

@ratnpriyarai4793 4 ай бұрын

It was quite useful for me.

@bigbena23 7 ай бұрын

Great video. I guess modifying this to use local LLM should be easy, right?

@koleshjr Жыл бұрын

Amazing Amazing

@SergeyNumerov 4 ай бұрын

I wonder how this would handle dynamic content: as in scraping websites where you have to click stuff to reveal valuable content.

@evansmakuba1631 8 ай бұрын

bro codes in light mode...respects

@HazemAzim Жыл бұрын

nice and simple . Thanks

@chadmichaellawson3985 Жыл бұрын

Fire!!

@MK-jn9uu Жыл бұрын

The beginning started mid sentence. Did I miss where you explained how ai will prevent us from rebuilding the scrape code when the website changes?

@devlearnllm Жыл бұрын

@augmentos Жыл бұрын

Why when I try to use functions under the new assistant GPT builder, does it keep telling me it's invalid JSON, and I can't paste Python or JavaScript in there to be able to scrape the web?

@4ram16 9 ай бұрын

I’m on a quest to use an LLM for web scraping without identifying HTML. You gave a lot of value background information. You referred to “python things” and talked as though you have experience with NodeJS. Why didn’t you use LangchainJS, Puppeteer and Cheerio? How difficult it be to rewrite your repo for NodeJS?

@lamboqin2180 Жыл бұрын

Thank you for your video and resource! I am trying to build an web app to find news articles that have different stand points on the a choosen topic. Would this code be a good solution for me to scrap news, or would this be more suited to something else like scrapping more security tight websites(since it uses chromium)? I see the waiting time is quite long too. What Langchain solution/module would you recommend for my project?

@devlearnllm Жыл бұрын

Hey there, the wait time is mostly on the LLM part and not the scraping part. You can definitely use this to scrape news sites. LangChain has the OpenAI Function extraction chain, which has nice input parser for extracting. All you have to do is defining your schema for scraping, then off you go 🚀

@plashless3406 10 ай бұрын

Feeding all the HTML to the LLM might exhast the context lenght of LLM pretty quick.

@tiagoc9754 11 ай бұрын

As a dev with JS background, how's been your experience with Python? Why you moved to Python instead of using LangchainJS? Comparing LangChainJS vs LangChain Python, do you miss many features from a fw to another? Have you ever faced an issue with JS that you could only solve in with Python?

@Guy-Scott Жыл бұрын

Where are you using the openai function calling functionality? Isn't it so that the openai function calling should call a specific function inside of your program? Or am I missing something?

@abhishekchoudhury Жыл бұрын

hey @llmschool! This is very insightful, and it got me wondering if we can extract security ownership DEF 14A filling. The difficulty is that each filling has a different structure; can the LLM handle that?

@devlearnllm Жыл бұрын

You can try. Let us know how it goes.

@aamdmn2641 Жыл бұрын

Hi, great video! I've implemented a similar approach and I wanted to see yours which has given me new inspiration which I'm very grateful for so thank you! Why did you use Python, based on that you mentioned you are from the Javascript/Typescript world?

@devlearnllm Жыл бұрын

Yup, my background was in JS / React

@sunilbendre123 11 ай бұрын

Thanks for this. Just a quick question. How do i approach this problem if i have like 300 website links to scrape?

@onirdutta666 Жыл бұрын

I am getting an error "TypeError: Parameters to generic types must be types. Got {'properties': {'item_title': {'type': 'string'}, 'item_price': {'type': 'number'}, 'item_extra_info."..can u help..Thanks in advance

@devlearnllm Жыл бұрын

Hey, without looking at your code I'm not sure why that's the case. But I merged my code to LangChain (Python) a couple weeks ago for this usecase and you can follow the guide here: python.langchain.com/docs/use_cases/web_scraping/

@salmankhandu3819 11 ай бұрын

When we get data from site and provide to llm for scraping how can we manage large data because data will to llm in chunks so when there large data some data might be truncated

@rajnishadhikari9280 7 ай бұрын

can you do same using opensource llm like llama 3 ?

@kamalseriki3201 Жыл бұрын

I've been trying to use this a Django web app using celery but I've been getting coroutine errors. I managed to bypass that with async_to_sync function, but now the task keeps executing without giving any results. What can I do?

@BarışAytimur-e8x 11 ай бұрын

why use playwright? can't you use selenium instead?

@AlloMission Жыл бұрын

Thanks

@Guy-Scott Жыл бұрын

I also printed out the content in the extract function which is just plain text. How can openai with just plain text and a schema convert that plain text to a JSON file? I mean, where does it know another news_headline or news_short_summary start?

@devlearnllm Жыл бұрын

The OpenAI Functions call is encapsulated in LangChain's chain.

@thomaslyngesen7221 Жыл бұрын

I find the data returned is not valid, article title does not match their summary for instance. Can you comment a little more on the schemas, like is the naming of items important?

@devlearnllm Жыл бұрын

Sure, which site are you scraping?

@thomaslyngesen7221 Жыл бұрын

@@devlearnllm I get pretty good results with your basic 'news' schema, but nothing with the 'e_commerce' schema, which is also more detailed it seems. Are you mirroring the item names used at the site you want to scrape?

@devlearnllm Жыл бұрын

@@thomaslyngesen7221 For ecommerce sites, it's quite challenging on the scraping side of things to deliver clean data to the LLM to extract. App Sumo is an easy site to scrape, but Amazon or Bestbuy seems more challenging. It'll take some experimentation to get them to work.

@devlearnllm Жыл бұрын

Make sure to pull my latest code, and only scrape for tag. Then the titles should be accurate. Thanks for pointing this out

@funnyperson4016 Жыл бұрын

If I have a list of URLs to scrape and a website behind a login and password with keywords and overall score and other variables I don’t need, will this be able to scrape all keywords from all URLs into a single csv file?

@SilenceOnPS4 Жыл бұрын

Would you know how to scrape PDF documents (download and sort into files) from a website that has a database that is constantly updating? If this is something you can do, I'd love to have a chat and would pay you for your time. I am a beginner in this realm, and would love to figure this out.

@devlearnllm Жыл бұрын

For sure. You can reach out to me on LinkedIn: www.linkedin.com/in/haiphunghiem/ Or chat with me on LangChain Canada's Discord: discord.gg/rtKE2g266C (my username is toasted_shibe)

@evolution3658 6 ай бұрын

What is it for ? For what purpose?

@Flameandfireclan Жыл бұрын

Hello sir, I’m building a commercial software. And I want to ask your permission before I use your code. Would it be okay if I cloned your code and used it as a part of my software? (I am very impressed by what you have built that’s why I’m interested in using it myself)

@devlearnllm Жыл бұрын

For sure. I'm flattered. And thanks for asking as well. Please credit me (my name and this video) if you don't mind.

@Flameandfireclan Жыл бұрын

@@devlearnllm Thanks! I’ll make sure to include your name (author) in the documentation and a link to the video! 🙏

@julianomoraisbarbosa Жыл бұрын

# til

@HiteshGautam-v6y 9 ай бұрын

Can we scrape deep links of website as well. Like scrape about us page of website which was found from home page of website. If you can post it

@matheusduzziribeiro5637 10 ай бұрын

I'm trying to scrape wsj but I got this error: "RuntimeError: no validator found for , see `arbitrary_types_allowed` in Config". Do you know what this could be?

@andrew54292 9 ай бұрын

Did you ever figure that out?

@jsfnnyc Жыл бұрын

Lolz at the neighbor's trash 😄

@devlearnllm Жыл бұрын

The worst.

@hishamazmy8189 8 ай бұрын

amazing

@atrocitus777 Жыл бұрын

is this worth doing for data you want to scrape that's behind captchas?

@devlearnllm Жыл бұрын

I haven't tried that yet, but probably requires some modifications on the Chromium and scraping side (not the extraction side)

@atrocitus777 Жыл бұрын

ok i know there are captcha solution provides like 2captcha but then there are more advanced solutions offered by bright data and scraper api. There is not a lot of video tutorials about those services but i think this could be pretty powerful when integrated with something like those tools@@devlearnllm

@CarlChristiansen-ps5ov 8 ай бұрын

i tried to upload a comment on a problem i run into, but for some reason it doesn't show in the comment? anyone knows why 😅

@viktorvegh7842 Жыл бұрын

Dont you have problems with website security? I tried to scrap some webs and I got IP ban

@devlearnllm Жыл бұрын

Don't go overboard then lol

@HappyDataScience Жыл бұрын

if you don't mind please change the theme

@SurajSingh-y3n3e 6 ай бұрын

bro i watched 4 minutes add before jumping actual video

@devlearnllm 6 ай бұрын

That's crazy. Let me see if I can change that somehow

@Ryan-yj4sd Жыл бұрын

Nice video. This is totally unscalable, expensive and very slow. Websites don’t change much. You’re far better off asking the AI to write a good scraping bot rather than feeding in HTML into the bot. 😊

@devlearnllm Жыл бұрын

For now, everything you said is true (except websites don't change much. Scraping competitor's websites, or listings of JS-heavy websites change all the time). Over time, we'll see LLM calls being cheaper and faster. The act of asking chatGPT to write a scraping bot is, how much different than an LLM call?

@Ryan-yj4sd Жыл бұрын

Feeding in the entire HTML call is slow and inefficient. I do some professional scraping and most of my clients scrapes run for years without almost no maintenance.

@Ryan-yj4sd Жыл бұрын

@@devlearnllmmy suggestion is to use LLM to make the updates to a real scraper on the fly, rather than blindly feeding in 4000 characters of text and asking LLM to extract. LLMs context length is O(n^2) and no cost reduction will solve this issue. So keeping context length as low as possible is always important.

@devlearnllm Жыл бұрын

@@Ryan-yj4sd I don't know what you mean by LLM context length being O n^2, but the output length is what determines the amount of time it takes to generate. Doesn't matter if the prompt is long or short. I do like the idea of updating a scraper on the fly though. It might end up needing as much HTML as possible to generate new code or schema accurately anyways. But you gave me a better idea: what if you still push HTML to LLM once, create a scraper or schema (like you said), and keep using it until the website changes. Here's where one can put in an evaluator of some sort (another small LLM call, perhaps?) to check the work of the scraper. If the work results are poor (you can determine what's good/not good for the LLM evaluator), then we run the first step again. Thoughts?

@Ryan-yj4sd Жыл бұрын

@@devlearnllm the algorithm complexity is O(n^2). In other words, each token sits in a double loop. Of course the input length matters! I double checked as well: For transformer-based models like GPT, the primary computational concern is the self-attention mechanism. The self-attention mechanism's complexity in transformers is primarily influenced by the sequence length. The computational complexity of the self-attention mechanism in a transformer scales as \(O(n^2 \times d)\), where: - \(n\) is the number of tokens in the sequence. - \(d\) is the dimension of the model (i.e., the number of features or hidden units at each layer). The quadratic relationship (\(n^2\)) arises from the pairwise comparisons between tokens when calculating attention scores. For each token, the model computes attention scores with every other token, leading to the quadratic term. Given this, the time taken by the model will be proportionally related to the square of the input length (keeping other factors like model dimension and hardware constant). In other words, if you double the length of the input, you might expect roughly a fourfold increase in the time taken by the self-attention calculations. However, in practice, other factors can influence the total processing time, including hardware efficiency, batch processing, and other parts of the model that don't scale quadratically. Still, the quadratic relationship provides a good rough estimate for the scaling behavior of transformers with respect to sequence length.