Web Scraping with GPT-4 Vision AI + Puppeteer is Mind-Blowingly EASY!

  Рет қаралды 59,158

ByteGrad

ByteGrad

Күн бұрын

Пікірлер: 53
@zeeeeeman
@zeeeeeman 8 ай бұрын
This is such a timely video - i'm doing something similar to resurrect a website from the wayback machine.
@hxxzxtf
@hxxzxtf 8 ай бұрын
🎯 Key Takeaways for quick navigation: 00:00 *🌐 Web scraping has been revolutionized by AI, particularly with the latest Vision AI model, making data extraction more efficient.* 01:07 *💻 Manually copying HTML and using Chat GPT for extraction is one method, but OpenAI's API offers programmable solutions for scalability.* 02:16 *🔄 Using Puppeteer with Bright Data's scraping browser helps circumvent website restrictions and rate limiting during scraping.* 05:33 *🖥️ Puppeteer allows for easy scraping of HTML content, but there's a need to manage and clean up the extracted data before analysis.* 08:35 *💡 Extracting only necessary data from HTML can optimize costs when using OpenAI's models for analysis.* 12:17 *💰 Text-based scraping methods can be cost-effective, but they require ongoing maintenance due to HTML structure changes.* 14:49 *📸 Utilizing OpenAI's GPT-4 Vision API enables data extraction from screenshots, potentially offering a more robust solution for complex web scraping tasks.* 17:52 *🖼️ Using base64 encoding allows passing images to models, enhancing data processing capabilities.* 18:49 *💸 Consider cost-effectiveness when choosing between complex HTML-based or text-based approaches for web scraping.* 19:58 *🎚️ Adjusting image resolution can significantly decrease token usage in web scraping, but it may increase the likelihood of errors.* 20:53 *🖼️🔄 Balance image resolution and price when utilizing Vision API for web scraping, as higher resolution images incur higher costs.* 21:19 *🧹 Clean up HTML before web scraping to reduce token usage and ensure accuracy in results.* 22:57 *🤖 Explore advanced features of AI tools, such as identifying clickable elements, to enhance web scraping automation.* Made with HARPA AI
@rakysreplays8259
@rakysreplays8259 27 күн бұрын
The best video I've seen about web scraping
@beemerrox
@beemerrox 5 ай бұрын
Wow. this video provides GREAT value. Just in time for what I´m doing now. Thanks mate!
@reidevanson181
@reidevanson181 8 ай бұрын
what an amazing video - like its so niche but so useful
@ByteGrad
@ByteGrad 8 ай бұрын
Glad you liked it
@SupCortez
@SupCortez 5 ай бұрын
Thank you infinitely for sharing this masterclass lesson with the universe for free. Subbed
@Lars16
@Lars16 8 ай бұрын
This is a great video. But the problem with scraping has hardly ever been parsing the HTML or maintaining the parsers. The biggest problem is efficiently accessing websites that actively try to block you by gating their content being a login or captchas. Then comes IP blocking (or worse data obfuscation) if you Scrape their website in a large volume.
@binhtruongdac2861
@binhtruongdac2861 8 ай бұрын
That’s why you need smth like Bright Data, yes, it’s not free unfortunately
@karenapatch1952
@karenapatch1952 6 ай бұрын
Octoparse can deal with this, and it's free. No thanks
@beemerrox
@beemerrox 5 ай бұрын
@@karenapatch1952 Thanks! Didnt know, looks awesome!
@Andrew-qc8jh
@Andrew-qc8jh 4 ай бұрын
yeah this is pretty cool to see but it doesn't look that helpful in comparison to methods using beautifulsoup.
@benhasanaltun
@benhasanaltun Ай бұрын
Thanks for sharing!
@juliushernandez9855
@juliushernandez9855 7 ай бұрын
Can you create a video how to deploy puppeteer and next js to vercel?
@niclas.pandey
@niclas.pandey 8 ай бұрын
thank you a lot ♥
@dmitriydorogonov7918
@dmitriydorogonov7918 6 ай бұрын
Perfect video, thanks
@laughremixsquad
@laughremixsquad Ай бұрын
🫡 those 90,000 tokens. Thanks you for your sacrifice. 😢
@felipeblin8616
@felipeblin8616 6 ай бұрын
Great video. Some question though. What about hallucinating? How can be sure is not doing it?
@justcars2454
@justcars2454 3 ай бұрын
When doing web scraping but at a large scale it will be so much expensive, its better to use chatgpt or a better llm, trough its api, and automatilcy making chatgpt handle the errors untill it find the perfect code, its better if it can try finding hidden api endpoints first then building the script for the website based on that enpdoint .... And all this automatily, you just need to make chatgpt, be able to correct itself, and making scripts by itself and run it on your pc, and handle errors untill getting the exact script that succefully scrape what you want.
@jameskayihura1675
@jameskayihura1675 3 ай бұрын
Let’s say I want to scrape LinkedIn mentions. Basically LinkedI will request authentifications. Can this be applied to my question? Thanks
@imranhrafi
@imranhrafi 8 ай бұрын
It's interesting, but what if I want pagination? I will still need to select next button in old way. Is there any other way of doing the pagination?
@MrVliegendepater
@MrVliegendepater 5 ай бұрын
scrape all url's from all sitemaps and then define how many levels deep you like to go... you will get more info than needed but it will do the job. If you put your html contento to markdown and secondly embed the markdown content into a vector database, you could query anything on the content.
@RobShocks
@RobShocks 7 ай бұрын
Have you thought about or tried using a local model to scrape, it would save all the costs
@Zaddy_Woods
@Zaddy_Woods 4 ай бұрын
Could you explain a little more please?
@dupatrio9305
@dupatrio9305 5 ай бұрын
Where can I learn basic coding from scratch to be able to do that?
@amitjangra6454
@amitjangra6454 7 ай бұрын
I am scrapping (dropping html) with python code with selenium (aprrox 60,000 articles) and later creating vector embeddings for Llama 3 and asking it to write article for me.
@richerite
@richerite 6 ай бұрын
Do you have a GitHub link? What did you mean write article
@5minutes106
@5minutes106 6 ай бұрын
We're you able to scrape 60,000 articles without getting your IP address blocked ? That's impressive if you did
@OnlyUseMeEquip
@OnlyUseMeEquip 5 ай бұрын
@@5minutes106 obviously not, you just rotate proxies
@gregsLyrics
@gregsLyrics 5 ай бұрын
and how do you get to the next page to scrape?
@Garejoor
@Garejoor 8 ай бұрын
can crewAI do this as well?
@hishamazmy8189
@hishamazmy8189 7 ай бұрын
amazing
@LifeTrekchannel
@LifeTrekchannel 7 ай бұрын
How to do this using Braina AI? Braina can run GPT-4 Vision.
@Kamil_Aqil
@Kamil_Aqil 4 ай бұрын
10/10
@hellokevin_133
@hellokevin_133 8 ай бұрын
Hey man, mind if I ask what programming languages you know other than Javascript/TS ?
@水手大力-y8l
@水手大力-y8l 8 ай бұрын
elegant
@amadeuszg1491
@amadeuszg1491 8 ай бұрын
I am interested in creating a price comparison website featuring approximately 10-20 shops, each offering around 10,000 similar products. Unfortunately, these shops do not provide APIs for direct access to their data. What would be the most efficient approach to setting up such a website while keeping maintenance costs reasonable?
@Braincompiler
@Braincompiler 8 ай бұрын
Make it like the other comparison sites and provide an upload for CSV, XML and so on or YOU provide the API for them so their shop systems can push the data ;) Crawling by yourself is the last option and could be made with XPath and stuff.
@amadeuszg1491
@amadeuszg1491 8 ай бұрын
@@Braincompiler Yes, but in this case store needs to send me the csv, xml file with their products. What if they dont?
@Braincompiler
@Braincompiler 8 ай бұрын
@@amadeuszg1491 Yes of course. If your comparison site has a benefit for them be sure they will.
@abhisycvirat
@abhisycvirat 8 ай бұрын
I did this 6 years ago, scraped each website and compared the price using SKU
@subhranshudas8862
@subhranshudas8862 8 ай бұрын
how do you handle paginated data?
@binhtruongdac2861
@binhtruongdac2861 8 ай бұрын
You just need to use the URL with page number in query params then run for loop to request multiple html page
@dmytroocheretianyi7577
@dmytroocheretianyi7577 7 ай бұрын
Perhaps it will be cheaper on Claude.
@laihan4469
@laihan4469 6 ай бұрын
How a full stack dev work with AI?
@ThePriceIsNeverRight
@ThePriceIsNeverRight 3 ай бұрын
This is good but costly to maintain 💸
@UserAliyev
@UserAliyev 8 ай бұрын
First
@semyaza555
@semyaza555 8 ай бұрын
2nd
@antronx7
@antronx7 3 ай бұрын
So is this what modern software engineers do these days? Write scripts to glue paid services together?
@Fatman305
@Fatman305 3 ай бұрын
Yeah. Makes zero sense... Paying for each scraped page is probably one of the worst ways of doing this. I guess it's fine if your total bill is very low, but really, for serious work it would make way more sense to ask the AI how to store these pages locally and analyze that local data...locally...
@ByteGrad
@ByteGrad 6 ай бұрын
Hi, my latest course is out now (Professional React & Next.js): bytegrad.com/courses/professional-react-nextjs -- I'm very proud of this course, my best work! I'm also a brand ambassador for Kinde (paid sponsorship). Check out Kinde for authentication and more bit.ly/3QOe1Bh
The Biggest Issues I've Faced Web Scraping (and how to fix them)
15:03
SIZE DOESN’T MATTER @benjaminjiujitsu
00:46
Natan por Aí
Рет қаралды 6 МЛН
I was just passing by
00:10
Artem Ivashin
Рет қаралды 18 МЛН
All 17 React Best Practices (IMPORTANT!)
1:46:11
ByteGrad
Рет қаралды 201 М.
The ultimate AI SCRAPER is Finally COMPLETE!!
14:54
Reda Marzouk
Рет қаралды 11 М.
Web Scraping + Reverse Engineering APIs
52:33
Syntax
Рет қаралды 7 М.
Learn 80% of Perplexity in under 10 minutes!
9:52
Jeff Su
Рет қаралды 326 М.
host ALL your AI locally
24:20
NetworkChuck
Рет қаралды 1,4 МЛН
Web Developer Roadmap (2025) - Everything is Changing
25:02
ByteGrad
Рет қаралды 412 М.
This is How I Scrape 99% of Sites
18:27
John Watson Rooney
Рет қаралды 188 М.
Industrial-scale Web Scraping with AI & Proxy Networks
6:17
Beyond Fireship
Рет қаралды 772 М.
This is how I scrape 99% websites via LLM
22:44
AI Jason
Рет қаралды 124 М.