Additional details about scraping: only scrape for tags on some sites (like WSJ or CNN) yields the best results. Others might be different.
@stevenwessel96417 ай бұрын
I’m working on a couple ai projects with Malik Yusef, Kanye’s main collaborator and one of Virgil’s first mentors. We should connect, lmk 🙏🏼
@devinschumacher Жыл бұрын
You are now officially a real youtuber.
@georgesanchez8051 Жыл бұрын
Refreshing to not see some bs clickbait video on LLM-uses. Just a clean, focused, and super differentiated walkthrough-video. Subscribed, and looking forward to more!
@devlearnllm Жыл бұрын
Thank you. I'm glad that this approach resonates with people.
@emlincharly11 ай бұрын
This video feels like a coworker showing me something cool. Really good video man!
@alexanderroodt5052 Жыл бұрын
With AI assistance I can scrape hundreds of thousands of products/services a week and now have the facilities to talk to thousands of people at once. Learnt most of it from youtube from people such as yourself who are grossly underappreciated. Keep up the good work and thanks for sharing!
@devlearnllm Жыл бұрын
Thanks for the appreciation then. We love it.
@diegosandoval7462 Жыл бұрын
🎯 Key Takeaways for quick navigation: 02:05 🌐 You can scrape websites using LangChain, OpenAI Functions, Playwright, and Beautiful Soup. 03:55 🧩 OpenAI Functions simplify web scraping by eliminating the need to manually declare HTML tags. 05:20 🛍️ You can use this approach to scrape e-commerce websites and extract specific information like item titles and prices. 15:41 🤖 LangChain simplifies interactions with OpenAI's GPT models for various applications, including information extraction. 23:32 ⚙️ Consider chunking large HTML content and building a FastAPI server to enhance this web scraping tool's capabilities. Made with HARPA AI
@walkingwchris Жыл бұрын
Well done for explaining the why so clearly . You had me in the first minute
@richmadrid9563 Жыл бұрын
This is exactly what I was looking for. A way to scrape websites like a human being, and done it via scripting. Also, I like how you explain things clearly and how they work. I found this channel by accident, and decided to watch it. The next thing I knew, I'm a new subscriber!
@mr.gk54 ай бұрын
Great stuff, keep up on making great AI coding content, you got my sub!
@lukeotwell32969 ай бұрын
That's a high quality vid right there.
@meinbherpieg47238 ай бұрын
Good video. Thanks for taking the time to explain the nuances in depth. You've got my sub ha
@emmanueladepoju4089 Жыл бұрын
First video and I like this channel already!🙂
@devlearnllm Жыл бұрын
Just a slight correction: openai_api_key is a property in the llm object, in LangChain. It's not a global variable.
@sandratoolan9598 Жыл бұрын
Good luck dude, just keep doing what you doing.
@techgeekguru Жыл бұрын
Very cool stuff! Like the style of narration focusing on conveying information in a straightforward and matter-of-fact manner, without overemphasizing or exaggerating.
@tebblesfun Жыл бұрын
Thank you so much! I didn't know how to implement this and I bumped into your video. Such a saver!
@miltondavilaharjula Жыл бұрын
Great video!! Thank you for sharing. I liked how you simplified the code and explanation. Your project really makes sense as webpages do change their structure and traditional approach may break due to those changes.
@StudioTatsu Жыл бұрын
Note: 'kwargs' usually stand for keyword arguments. normally we call it "keyword args". Nice Vid. :)
@devlearnllm Жыл бұрын
Thank you haha. So obvious in hindsight.
@Wildhoneybush110 ай бұрын
Great job, you are a real you-tuber and I can tell that you will become very popular. 😮🎉
@jazzzAiman Жыл бұрын
Yup, straight to the point
@devlearnllm Жыл бұрын
TY
@zakuro85329 ай бұрын
You are the King
@tiagoc975411 ай бұрын
11:49 is it safe to remove other tags? It's recommended that web pages contain elements such as section, article, main, menu, header, footer, etc. not to mention h(n), label, span, aria. I know many pages out there don't follow the "correct" syntax, but I suppose especially when we talk about huge websites we'll commonly find those patterns. So removing other tags would not affect the result we expect from the AI integration?
@FlutterDev1337 Жыл бұрын
This is awesome content btw!
@ratnpriyarai47934 ай бұрын
It was quite useful for me.
@bigbena237 ай бұрын
Great video. I guess modifying this to use local LLM should be easy, right?
@koleshjr Жыл бұрын
Amazing Amazing
@SergeyNumerov4 ай бұрын
I wonder how this would handle dynamic content: as in scraping websites where you have to click stuff to reveal valuable content.
@evansmakuba16318 ай бұрын
bro codes in light mode...respects
@HazemAzim Жыл бұрын
nice and simple . Thanks
@chadmichaellawson3985 Жыл бұрын
Fire!!
@MK-jn9uu Жыл бұрын
The beginning started mid sentence. Did I miss where you explained how ai will prevent us from rebuilding the scrape code when the website changes?
@devlearnllm Жыл бұрын
Sign up for the upcoming AI Agents Master Course: forms.gle/YuMvqfXo6xXUXaR6A
@augmentos Жыл бұрын
Why when I try to use functions under the new assistant GPT builder, does it keep telling me it's invalid JSON, and I can't paste Python or JavaScript in there to be able to scrape the web?
@4ram169 ай бұрын
I’m on a quest to use an LLM for web scraping without identifying HTML. You gave a lot of value background information. You referred to “python things” and talked as though you have experience with NodeJS. Why didn’t you use LangchainJS, Puppeteer and Cheerio? How difficult it be to rewrite your repo for NodeJS?
@lamboqin2180 Жыл бұрын
Thank you for your video and resource! I am trying to build an web app to find news articles that have different stand points on the a choosen topic. Would this code be a good solution for me to scrap news, or would this be more suited to something else like scrapping more security tight websites(since it uses chromium)? I see the waiting time is quite long too. What Langchain solution/module would you recommend for my project?
@devlearnllm Жыл бұрын
Hey there, the wait time is mostly on the LLM part and not the scraping part. You can definitely use this to scrape news sites. LangChain has the OpenAI Function extraction chain, which has nice input parser for extracting. All you have to do is defining your schema for scraping, then off you go 🚀
@plashless340610 ай бұрын
Feeding all the HTML to the LLM might exhast the context lenght of LLM pretty quick.
@tiagoc975411 ай бұрын
As a dev with JS background, how's been your experience with Python? Why you moved to Python instead of using LangchainJS? Comparing LangChainJS vs LangChain Python, do you miss many features from a fw to another? Have you ever faced an issue with JS that you could only solve in with Python?
@Guy-Scott Жыл бұрын
Where are you using the openai function calling functionality? Isn't it so that the openai function calling should call a specific function inside of your program? Or am I missing something?
@abhishekchoudhury Жыл бұрын
hey @llmschool! This is very insightful, and it got me wondering if we can extract security ownership DEF 14A filling. The difficulty is that each filling has a different structure; can the LLM handle that?
@devlearnllm Жыл бұрын
You can try. Let us know how it goes.
@aamdmn2641 Жыл бұрын
Hi, great video! I've implemented a similar approach and I wanted to see yours which has given me new inspiration which I'm very grateful for so thank you! Why did you use Python, based on that you mentioned you are from the Javascript/Typescript world?
@devlearnllm Жыл бұрын
Yup, my background was in JS / React
@sunilbendre12311 ай бұрын
Thanks for this. Just a quick question. How do i approach this problem if i have like 300 website links to scrape?
@onirdutta666 Жыл бұрын
I am getting an error "TypeError: Parameters to generic types must be types. Got {'properties': {'item_title': {'type': 'string'}, 'item_price': {'type': 'number'}, 'item_extra_info."..can u help..Thanks in advance
@devlearnllm Жыл бұрын
Hey, without looking at your code I'm not sure why that's the case. But I merged my code to LangChain (Python) a couple weeks ago for this usecase and you can follow the guide here: python.langchain.com/docs/use_cases/web_scraping/
@salmankhandu381911 ай бұрын
When we get data from site and provide to llm for scraping how can we manage large data because data will to llm in chunks so when there large data some data might be truncated
@rajnishadhikari92807 ай бұрын
can you do same using opensource llm like llama 3 ?
@kamalseriki3201 Жыл бұрын
I've been trying to use this a Django web app using celery but I've been getting coroutine errors. I managed to bypass that with async_to_sync function, but now the task keeps executing without giving any results. What can I do?
@BarışAytimur-e8x11 ай бұрын
why use playwright? can't you use selenium instead?
@AlloMission Жыл бұрын
Thanks
@Guy-Scott Жыл бұрын
I also printed out the content in the extract function which is just plain text. How can openai with just plain text and a schema convert that plain text to a JSON file? I mean, where does it know another news_headline or news_short_summary start?
@devlearnllm Жыл бұрын
The OpenAI Functions call is encapsulated in LangChain's chain.
@thomaslyngesen7221 Жыл бұрын
I find the data returned is not valid, article title does not match their summary for instance. Can you comment a little more on the schemas, like is the naming of items important?
@devlearnllm Жыл бұрын
Sure, which site are you scraping?
@thomaslyngesen7221 Жыл бұрын
@@devlearnllm I get pretty good results with your basic 'news' schema, but nothing with the 'e_commerce' schema, which is also more detailed it seems. Are you mirroring the item names used at the site you want to scrape?
@devlearnllm Жыл бұрын
@@thomaslyngesen7221 For ecommerce sites, it's quite challenging on the scraping side of things to deliver clean data to the LLM to extract. App Sumo is an easy site to scrape, but Amazon or Bestbuy seems more challenging. It'll take some experimentation to get them to work.
@devlearnllm Жыл бұрын
Make sure to pull my latest code, and only scrape for tag. Then the titles should be accurate. Thanks for pointing this out
@funnyperson4016 Жыл бұрын
If I have a list of URLs to scrape and a website behind a login and password with keywords and overall score and other variables I don’t need, will this be able to scrape all keywords from all URLs into a single csv file?
@SilenceOnPS4 Жыл бұрын
Would you know how to scrape PDF documents (download and sort into files) from a website that has a database that is constantly updating? If this is something you can do, I'd love to have a chat and would pay you for your time. I am a beginner in this realm, and would love to figure this out.
@devlearnllm Жыл бұрын
For sure. You can reach out to me on LinkedIn: www.linkedin.com/in/haiphunghiem/ Or chat with me on LangChain Canada's Discord: discord.gg/rtKE2g266C (my username is toasted_shibe)
@evolution36586 ай бұрын
What is it for ? For what purpose?
@Flameandfireclan Жыл бұрын
Hello sir, I’m building a commercial software. And I want to ask your permission before I use your code. Would it be okay if I cloned your code and used it as a part of my software? (I am very impressed by what you have built that’s why I’m interested in using it myself)
@devlearnllm Жыл бұрын
For sure. I'm flattered. And thanks for asking as well. Please credit me (my name and this video) if you don't mind.
@Flameandfireclan Жыл бұрын
@@devlearnllm Thanks! I’ll make sure to include your name (author) in the documentation and a link to the video! 🙏
@julianomoraisbarbosa Жыл бұрын
# til
@HiteshGautam-v6y9 ай бұрын
Can we scrape deep links of website as well. Like scrape about us page of website which was found from home page of website. If you can post it
@matheusduzziribeiro563710 ай бұрын
I'm trying to scrape wsj but I got this error: "RuntimeError: no validator found for , see `arbitrary_types_allowed` in Config". Do you know what this could be?
@andrew542929 ай бұрын
Did you ever figure that out?
@jsfnnyc Жыл бұрын
Lolz at the neighbor's trash 😄
@devlearnllm Жыл бұрын
The worst.
@hishamazmy81898 ай бұрын
amazing
@atrocitus777 Жыл бұрын
is this worth doing for data you want to scrape that's behind captchas?
@devlearnllm Жыл бұрын
I haven't tried that yet, but probably requires some modifications on the Chromium and scraping side (not the extraction side)
@atrocitus777 Жыл бұрын
ok i know there are captcha solution provides like 2captcha but then there are more advanced solutions offered by bright data and scraper api. There is not a lot of video tutorials about those services but i think this could be pretty powerful when integrated with something like those tools@@devlearnllm
@CarlChristiansen-ps5ov8 ай бұрын
i tried to upload a comment on a problem i run into, but for some reason it doesn't show in the comment? anyone knows why 😅
@viktorvegh7842 Жыл бұрын
Dont you have problems with website security? I tried to scrap some webs and I got IP ban
@devlearnllm Жыл бұрын
Don't go overboard then lol
@HappyDataScience Жыл бұрын
if you don't mind please change the theme
@SurajSingh-y3n3e6 ай бұрын
bro i watched 4 minutes add before jumping actual video
@devlearnllm6 ай бұрын
That's crazy. Let me see if I can change that somehow
@Ryan-yj4sd Жыл бұрын
Nice video. This is totally unscalable, expensive and very slow. Websites don’t change much. You’re far better off asking the AI to write a good scraping bot rather than feeding in HTML into the bot. 😊
@devlearnllm Жыл бұрын
For now, everything you said is true (except websites don't change much. Scraping competitor's websites, or listings of JS-heavy websites change all the time). Over time, we'll see LLM calls being cheaper and faster. The act of asking chatGPT to write a scraping bot is, how much different than an LLM call?
@Ryan-yj4sd Жыл бұрын
Feeding in the entire HTML call is slow and inefficient. I do some professional scraping and most of my clients scrapes run for years without almost no maintenance.
@Ryan-yj4sd Жыл бұрын
@@devlearnllmmy suggestion is to use LLM to make the updates to a real scraper on the fly, rather than blindly feeding in 4000 characters of text and asking LLM to extract. LLMs context length is O(n^2) and no cost reduction will solve this issue. So keeping context length as low as possible is always important.
@devlearnllm Жыл бұрын
@@Ryan-yj4sd I don't know what you mean by LLM context length being O n^2, but the output length is what determines the amount of time it takes to generate. Doesn't matter if the prompt is long or short. I do like the idea of updating a scraper on the fly though. It might end up needing as much HTML as possible to generate new code or schema accurately anyways. But you gave me a better idea: what if you still push HTML to LLM once, create a scraper or schema (like you said), and keep using it until the website changes. Here's where one can put in an evaluator of some sort (another small LLM call, perhaps?) to check the work of the scraper. If the work results are poor (you can determine what's good/not good for the LLM evaluator), then we run the first step again. Thoughts?
@Ryan-yj4sd Жыл бұрын
@@devlearnllm the algorithm complexity is O(n^2). In other words, each token sits in a double loop. Of course the input length matters! I double checked as well: For transformer-based models like GPT, the primary computational concern is the self-attention mechanism. The self-attention mechanism's complexity in transformers is primarily influenced by the sequence length. The computational complexity of the self-attention mechanism in a transformer scales as \(O(n^2 \times d)\), where: - \(n\) is the number of tokens in the sequence. - \(d\) is the dimension of the model (i.e., the number of features or hidden units at each layer). The quadratic relationship (\(n^2\)) arises from the pairwise comparisons between tokens when calculating attention scores. For each token, the model computes attention scores with every other token, leading to the quadratic term. Given this, the time taken by the model will be proportionally related to the square of the input length (keeping other factors like model dimension and hardware constant). In other words, if you double the length of the input, you might expect roughly a fourfold increase in the time taken by the self-attention calculations. However, in practice, other factors can influence the total processing time, including hardware efficiency, batch processing, and other parts of the model that don't scale quadratically. Still, the quadratic relationship provides a good rough estimate for the scaling behavior of transformers with respect to sequence length.
@dxvfdfx Жыл бұрын
How much do you need pay for open function if you called 1000 times?