Scrape any website with OpenAI Functions & LangChain

  Рет қаралды 46,801

LLMs for Devs

LLMs for Devs

Күн бұрын

Пікірлер: 106
@devlearnllm
@devlearnllm Жыл бұрын
Additional details about scraping: only scrape for tags on some sites (like WSJ or CNN) yields the best results. Others might be different.
@stevenwessel9641
@stevenwessel9641 4 ай бұрын
I’m working on a couple ai projects with Malik Yusef, Kanye’s main collaborator and one of Virgil’s first mentors. We should connect, lmk 🙏🏼
@georgesanchez8051
@georgesanchez8051 Жыл бұрын
Refreshing to not see some bs clickbait video on LLM-uses. Just a clean, focused, and super differentiated walkthrough-video. Subscribed, and looking forward to more!
@devlearnllm
@devlearnllm Жыл бұрын
Thank you. I'm glad that this approach resonates with people.
@alexanderroodt5052
@alexanderroodt5052 Жыл бұрын
With AI assistance I can scrape hundreds of thousands of products/services a week and now have the facilities to talk to thousands of people at once. Learnt most of it from youtube from people such as yourself who are grossly underappreciated. Keep up the good work and thanks for sharing!
@devlearnllm
@devlearnllm Жыл бұрын
Thanks for the appreciation then. We love it.
@emlincharly
@emlincharly 8 ай бұрын
This video feels like a coworker showing me something cool. Really good video man!
@diegosandoval7462
@diegosandoval7462 Жыл бұрын
🎯 Key Takeaways for quick navigation: 02:05 🌐 You can scrape websites using LangChain, OpenAI Functions, Playwright, and Beautiful Soup. 03:55 🧩 OpenAI Functions simplify web scraping by eliminating the need to manually declare HTML tags. 05:20 🛍️ You can use this approach to scrape e-commerce websites and extract specific information like item titles and prices. 15:41 🤖 LangChain simplifies interactions with OpenAI's GPT models for various applications, including information extraction. 23:32 ⚙️ Consider chunking large HTML content and building a FastAPI server to enhance this web scraping tool's capabilities. Made with HARPA AI
@richmadrid9563
@richmadrid9563 Жыл бұрын
This is exactly what I was looking for. A way to scrape websites like a human being, and done it via scripting. Also, I like how you explain things clearly and how they work. I found this channel by accident, and decided to watch it. The next thing I knew, I'm a new subscriber!
@devlearnllm
@devlearnllm Жыл бұрын
Just a slight correction: openai_api_key is a property in the llm object, in LangChain. It's not a global variable.
@devlearnllm
@devlearnllm Жыл бұрын
Sign up for the upcoming AI Agents Master Course: forms.gle/YuMvqfXo6xXUXaR6A
@walkingwchris
@walkingwchris 11 ай бұрын
Well done for explaining the why so clearly . You had me in the first minute
@techgeekguru
@techgeekguru Жыл бұрын
Very cool stuff! Like the style of narration focusing on conveying information in a straightforward and matter-of-fact manner, without overemphasizing or exaggerating.
@devinschumacher
@devinschumacher 11 ай бұрын
You are now officially a real youtuber.
@miltondavilaharjula
@miltondavilaharjula 10 ай бұрын
Great video!! Thank you for sharing. I liked how you simplified the code and explanation. Your project really makes sense as webpages do change their structure and traditional approach may break due to those changes.
@sandratoolan9598
@sandratoolan9598 Жыл бұрын
Good luck dude, just keep doing what you doing.
@tebblesfun
@tebblesfun 11 ай бұрын
Thank you so much! I didn't know how to implement this and I bumped into your video. Such a saver!
@StudioTatsu
@StudioTatsu Жыл бұрын
Note: 'kwargs' usually stand for keyword arguments. normally we call it "keyword args". Nice Vid. :)
@devlearnllm
@devlearnllm Жыл бұрын
Thank you haha. So obvious in hindsight.
@emmanueladepoju4089
@emmanueladepoju4089 9 ай бұрын
First video and I like this channel already!🙂
@meinbherpieg4723
@meinbherpieg4723 5 ай бұрын
Good video. Thanks for taking the time to explain the nuances in depth. You've got my sub ha
@jazzzAiman
@jazzzAiman Жыл бұрын
Yup, straight to the point
@devlearnllm
@devlearnllm Жыл бұрын
TY
@lukeotwell3296
@lukeotwell3296 5 ай бұрын
That's a high quality vid right there.
@tiagoc9754
@tiagoc9754 8 ай бұрын
11:49 is it safe to remove other tags? It's recommended that web pages contain elements such as section, article, main, menu, header, footer, etc. not to mention h(n), label, span, aria. I know many pages out there don't follow the "correct" syntax, but I suppose especially when we talk about huge websites we'll commonly find those patterns. So removing other tags would not affect the result we expect from the AI integration?
@bigbena23
@bigbena23 4 ай бұрын
Great video. I guess modifying this to use local LLM should be easy, right?
@onirdutta666
@onirdutta666 Жыл бұрын
I am getting an error "TypeError: Parameters to generic types must be types. Got {'properties': {'item_title': {'type': 'string'}, 'item_price': {'type': 'number'}, 'item_extra_info."..can u help..Thanks in advance
@devlearnllm
@devlearnllm Жыл бұрын
Hey, without looking at your code I'm not sure why that's the case. But I merged my code to LangChain (Python) a couple weeks ago for this usecase and you can follow the guide here: python.langchain.com/docs/use_cases/web_scraping/
@Wildhoneybush1
@Wildhoneybush1 7 ай бұрын
Great job, you are a real you-tuber and I can tell that you will become very popular. 😮🎉
@MK-jn9uu
@MK-jn9uu Жыл бұрын
The beginning started mid sentence. Did I miss where you explained how ai will prevent us from rebuilding the scrape code when the website changes?
@SergeyNumerov
@SergeyNumerov 24 күн бұрын
I wonder how this would handle dynamic content: as in scraping websites where you have to click stuff to reveal valuable content.
@tiagoc9754
@tiagoc9754 8 ай бұрын
As a dev with JS background, how's been your experience with Python? Why you moved to Python instead of using LangchainJS? Comparing LangChainJS vs LangChain Python, do you miss many features from a fw to another? Have you ever faced an issue with JS that you could only solve in with Python?
@lamboqin2180
@lamboqin2180 Жыл бұрын
Thank you for your video and resource! I am trying to build an web app to find news articles that have different stand points on the a choosen topic. Would this code be a good solution for me to scrap news, or would this be more suited to something else like scrapping more security tight websites(since it uses chromium)? I see the waiting time is quite long too. What Langchain solution/module would you recommend for my project?
@devlearnllm
@devlearnllm Жыл бұрын
Hey there, the wait time is mostly on the LLM part and not the scraping part. You can definitely use this to scrape news sites. LangChain has the OpenAI Function extraction chain, which has nice input parser for extracting. All you have to do is defining your schema for scraping, then off you go 🚀
@4ram16
@4ram16 6 ай бұрын
I’m on a quest to use an LLM for web scraping without identifying HTML. You gave a lot of value background information. You referred to “python things” and talked as though you have experience with NodeJS. Why didn’t you use LangchainJS, Puppeteer and Cheerio? How difficult it be to rewrite your repo for NodeJS?
@Guy-Scott
@Guy-Scott 11 ай бұрын
Where are you using the openai function calling functionality? Isn't it so that the openai function calling should call a specific function inside of your program? Or am I missing something?
@abhishekchoudhury
@abhishekchoudhury Жыл бұрын
hey @llmschool! This is very insightful, and it got me wondering if we can extract security ownership DEF 14A filling. The difficulty is that each filling has a different structure; can the LLM handle that?
@devlearnllm
@devlearnllm Жыл бұрын
You can try. Let us know how it goes.
@augmentos
@augmentos 10 ай бұрын
Why when I try to use functions under the new assistant GPT builder, does it keep telling me it's invalid JSON, and I can't paste Python or JavaScript in there to be able to scrape the web?
@thomaslyngesen7221
@thomaslyngesen7221 Жыл бұрын
I find the data returned is not valid, article title does not match their summary for instance. Can you comment a little more on the schemas, like is the naming of items important?
@devlearnllm
@devlearnllm Жыл бұрын
Sure, which site are you scraping?
@thomaslyngesen7221
@thomaslyngesen7221 Жыл бұрын
@@devlearnllm I get pretty good results with your basic 'news' schema, but nothing with the 'e_commerce' schema, which is also more detailed it seems. Are you mirroring the item names used at the site you want to scrape?
@devlearnllm
@devlearnllm Жыл бұрын
@@thomaslyngesen7221 For ecommerce sites, it's quite challenging on the scraping side of things to deliver clean data to the LLM to extract. App Sumo is an easy site to scrape, but Amazon or Bestbuy seems more challenging. It'll take some experimentation to get them to work.
@devlearnllm
@devlearnllm Жыл бұрын
Make sure to pull my latest code, and only scrape for tag. Then the titles should be accurate. Thanks for pointing this out
@plashless3406
@plashless3406 7 ай бұрын
Feeding all the HTML to the LLM might exhast the context lenght of LLM pretty quick.
@FlutterDev1337
@FlutterDev1337 Жыл бұрын
This is awesome content btw!
@kamalseriki3201
@kamalseriki3201 11 ай бұрын
I've been trying to use this a Django web app using celery but I've been getting coroutine errors. I managed to bypass that with async_to_sync function, but now the task keeps executing without giving any results. What can I do?
@sunilbendre123
@sunilbendre123 8 ай бұрын
Thanks for this. Just a quick question. How do i approach this problem if i have like 300 website links to scrape?
@Flameandfireclan
@Flameandfireclan 11 ай бұрын
Hello sir, I’m building a commercial software. And I want to ask your permission before I use your code. Would it be okay if I cloned your code and used it as a part of my software? (I am very impressed by what you have built that’s why I’m interested in using it myself)
@devlearnllm
@devlearnllm 11 ай бұрын
For sure. I'm flattered. And thanks for asking as well. Please credit me (my name and this video) if you don't mind.
@Flameandfireclan
@Flameandfireclan 11 ай бұрын
@@devlearnllm Thanks! I’ll make sure to include your name (author) in the documentation and a link to the video! 🙏
@SilenceOnPS4
@SilenceOnPS4 Жыл бұрын
Would you know how to scrape PDF documents (download and sort into files) from a website that has a database that is constantly updating? If this is something you can do, I'd love to have a chat and would pay you for your time. I am a beginner in this realm, and would love to figure this out.
@devlearnllm
@devlearnllm Жыл бұрын
For sure. You can reach out to me on LinkedIn: www.linkedin.com/in/haiphunghiem/ Or chat with me on LangChain Canada's Discord: discord.gg/rtKE2g266C (my username is toasted_shibe)
@aamdmn2641
@aamdmn2641 Жыл бұрын
Hi, great video! I've implemented a similar approach and I wanted to see yours which has given me new inspiration which I'm very grateful for so thank you! Why did you use Python, based on that you mentioned you are from the Javascript/Typescript world?
@devlearnllm
@devlearnllm Жыл бұрын
Yup, my background was in JS / React
@HazemAzim
@HazemAzim 11 ай бұрын
nice and simple . Thanks
@funnyperson4016
@funnyperson4016 Жыл бұрын
If I have a list of URLs to scrape and a website behind a login and password with keywords and overall score and other variables I don’t need, will this be able to scrape all keywords from all URLs into a single csv file?
@julianomoraisbarbosa
@julianomoraisbarbosa 11 ай бұрын
# til
@salmankhandu3819
@salmankhandu3819 7 ай бұрын
When we get data from site and provide to llm for scraping how can we manage large data because data will to llm in chunks so when there large data some data might be truncated
@zakuro8532
@zakuro8532 5 ай бұрын
You are the King
@evansmakuba1631
@evansmakuba1631 5 ай бұрын
bro codes in light mode...respects
@BarışAytimur-e8x
@BarışAytimur-e8x 7 ай бұрын
why use playwright? can't you use selenium instead?
@Guy-Scott
@Guy-Scott 11 ай бұрын
I also printed out the content in the extract function which is just plain text. How can openai with just plain text and a schema convert that plain text to a JSON file? I mean, where does it know another news_headline or news_short_summary start?
@devlearnllm
@devlearnllm 11 ай бұрын
The OpenAI Functions call is encapsulated in LangChain's chain.
@priyasharma1290
@priyasharma1290 5 ай бұрын
I have tried cnn same website which is on code but not getting any data to send LLM
@AlloMission
@AlloMission 9 ай бұрын
Thanks
@koleshjr
@koleshjr Жыл бұрын
Amazing Amazing
@ratnpriyarai4793
@ratnpriyarai4793 25 күн бұрын
It was quite useful for me.
@matheusduzziribeiro5637
@matheusduzziribeiro5637 7 ай бұрын
I'm trying to scrape wsj but I got this error: "RuntimeError: no validator found for , see `arbitrary_types_allowed` in Config". Do you know what this could be?
@andrew54292
@andrew54292 6 ай бұрын
Did you ever figure that out?
@HiteshGautam-v6y
@HiteshGautam-v6y 6 ай бұрын
Can we scrape deep links of website as well. Like scrape about us page of website which was found from home page of website. If you can post it
@chadmichaellawson3985
@chadmichaellawson3985 11 ай бұрын
Fire!!
@rajnishadhikari9280
@rajnishadhikari9280 4 ай бұрын
can you do same using opensource llm like llama 3 ?
@evolution3658
@evolution3658 3 ай бұрын
What is it for ? For what purpose?
@CarlChristiansen-ps5ov
@CarlChristiansen-ps5ov 5 ай бұрын
i tried to upload a comment on a problem i run into, but for some reason it doesn't show in the comment? anyone knows why 😅
@jsfnnyc
@jsfnnyc Жыл бұрын
Lolz at the neighbor's trash 😄
@devlearnllm
@devlearnllm Жыл бұрын
The worst.
@atrocitus777
@atrocitus777 Жыл бұрын
is this worth doing for data you want to scrape that's behind captchas?
@devlearnllm
@devlearnllm Жыл бұрын
I haven't tried that yet, but probably requires some modifications on the Chromium and scraping side (not the extraction side)
@atrocitus777
@atrocitus777 Жыл бұрын
ok i know there are captcha solution provides like 2captcha but then there are more advanced solutions offered by bright data and scraper api. There is not a lot of video tutorials about those services but i think this could be pretty powerful when integrated with something like those tools@@devlearnllm
@viktorvegh7842
@viktorvegh7842 Жыл бұрын
Dont you have problems with website security? I tried to scrap some webs and I got IP ban
@devlearnllm
@devlearnllm Жыл бұрын
Don't go overboard then lol
@HappyDataScience
@HappyDataScience Жыл бұрын
if you don't mind please change the theme
@hishamazmy8189
@hishamazmy8189 5 ай бұрын
amazing
@SurajSingh-y3n3e
@SurajSingh-y3n3e 3 ай бұрын
bro i watched 4 minutes add before jumping actual video
@devlearnllm
@devlearnllm 3 ай бұрын
That's crazy. Let me see if I can change that somehow
@Ryan-yj4sd
@Ryan-yj4sd Жыл бұрын
Nice video. This is totally unscalable, expensive and very slow. Websites don’t change much. You’re far better off asking the AI to write a good scraping bot rather than feeding in HTML into the bot. 😊
@devlearnllm
@devlearnllm Жыл бұрын
For now, everything you said is true (except websites don't change much. Scraping competitor's websites, or listings of JS-heavy websites change all the time). Over time, we'll see LLM calls being cheaper and faster. The act of asking chatGPT to write a scraping bot is, how much different than an LLM call?
@Ryan-yj4sd
@Ryan-yj4sd Жыл бұрын
Feeding in the entire HTML call is slow and inefficient. I do some professional scraping and most of my clients scrapes run for years without almost no maintenance.
@Ryan-yj4sd
@Ryan-yj4sd Жыл бұрын
@@devlearnllmmy suggestion is to use LLM to make the updates to a real scraper on the fly, rather than blindly feeding in 4000 characters of text and asking LLM to extract. LLMs context length is O(n^2) and no cost reduction will solve this issue. So keeping context length as low as possible is always important.
@devlearnllm
@devlearnllm Жыл бұрын
@@Ryan-yj4sd I don't know what you mean by LLM context length being O n^2, but the output length is what determines the amount of time it takes to generate. Doesn't matter if the prompt is long or short. I do like the idea of updating a scraper on the fly though. It might end up needing as much HTML as possible to generate new code or schema accurately anyways. But you gave me a better idea: what if you still push HTML to LLM once, create a scraper or schema (like you said), and keep using it until the website changes. Here's where one can put in an evaluator of some sort (another small LLM call, perhaps?) to check the work of the scraper. If the work results are poor (you can determine what's good/not good for the LLM evaluator), then we run the first step again. Thoughts?
@Ryan-yj4sd
@Ryan-yj4sd Жыл бұрын
@@devlearnllm the algorithm complexity is O(n^2). In other words, each token sits in a double loop. Of course the input length matters! I double checked as well: For transformer-based models like GPT, the primary computational concern is the self-attention mechanism. The self-attention mechanism's complexity in transformers is primarily influenced by the sequence length. The computational complexity of the self-attention mechanism in a transformer scales as \(O(n^2 \times d)\), where: - \(n\) is the number of tokens in the sequence. - \(d\) is the dimension of the model (i.e., the number of features or hidden units at each layer). The quadratic relationship (\(n^2\)) arises from the pairwise comparisons between tokens when calculating attention scores. For each token, the model computes attention scores with every other token, leading to the quadratic term. Given this, the time taken by the model will be proportionally related to the square of the input length (keeping other factors like model dimension and hardware constant). In other words, if you double the length of the input, you might expect roughly a fourfold increase in the time taken by the self-attention calculations. However, in practice, other factors can influence the total processing time, including hardware efficiency, batch processing, and other parts of the model that don't scale quadratically. Still, the quadratic relationship provides a good rough estimate for the scaling behavior of transformers with respect to sequence length.
@dxvfdfx
@dxvfdfx Жыл бұрын
How much do you need pay for open function if you called 1000 times?
@devlearnllm
@devlearnllm Жыл бұрын
Call it 1000 times and share it with everyone.
LangChain Crash Course: Build a AutoGPT app in 25 minutes!
27:28
Nicholas Renotte
Рет қаралды 420 М.
OpenAI Embeddings and Vector Databases Crash Course
18:41
Adrian Twarog
Рет қаралды 467 М.
LIFEHACK😳 Rate our backpacks 1-10 😜🔥🎒
00:13
Diana Belitskay
Рет қаралды 3,6 МЛН
Spongebob ate Patrick 😱 #meme #spongebob #gmod
00:15
Mr. LoLo
Рет қаралды 12 МЛН
Шок. Никокадо Авокадо похудел на 110 кг
00:44
Brawl Stars Edit😈📕
00:15
Kan Andrey
Рет қаралды 51 МЛН
The Biggest Issues I've Faced Web Scraping (and how to fix them)
15:03
Scraping House Prices | Upwork Paid Project #1
29:44
CodeMate TV
Рет қаралды 3,3 М.
Python AI Web Scraper Tutorial - Use AI To Scrape ANYTHING
45:36
Tech With Tim
Рет қаралды 85 М.
Industrial-scale Web Scraping with AI & Proxy Networks
6:17
Beyond Fireship
Рет қаралды 747 М.
Run ALL Your AI Locally in Minutes (LLMs, RAG, and more)
20:19
Cole Medin
Рет қаралды 73 М.
Advanced Web Scraping Tutorial! (w/ Python Beautiful Soup Library)
42:43
AI Pioneer Shows The Power of AI AGENTS - "The Future Is Agentic"
23:47
LIFEHACK😳 Rate our backpacks 1-10 😜🔥🎒
00:13
Diana Belitskay
Рет қаралды 3,6 МЛН