Best Web Scraping Combo? Use These In Your Projects

  Рет қаралды 41,858

John Watson Rooney

John Watson Rooney

Күн бұрын

A full Python project using my 2 current favorite tools, HTTP Client HTTPX and HTML Parser Selectolax.
Scraper API www.scrapingbee.com/?fpr=jhnwr
Patreon: / johnwatsonrooney
Donations: www.paypal.com/donate/?hosted...
Proxies: iproyal.club/JWR50
Hosting: Digital Ocean: m.do.co/c/c7c90f161ff6
Gear I use: www.amazon.co.uk/shop/johnwat...
Disclaimer: These are affiliate links and as an Amazon Associate I earn from qualifying purchases

Пікірлер: 125
Жыл бұрын
I'm starting with Python ans web scraping all along and this video is amazing and teach me a lot of basic things! Thank you a lot for such a fantastic video.
@rics6035
@rics6035 Жыл бұрын
John! Thanks so much for your amazing videos, they are super useful and interesting to watch!
@nuritas8424
@nuritas8424 Жыл бұрын
the way you explain is so clean. thanks a lot
@arabymo
@arabymo 2 ай бұрын
Putting your mind and thinking into the code! What a way to explain and learn. Thank you.
@drac.96
@drac.96 Жыл бұрын
Nice to see a new intro and the step by step explanation is really good
@pypypy4228
@pypypy4228 Жыл бұрын
I like this approach. Thank you!
@lucasmoratoaraujo8433
@lucasmoratoaraujo8433 Жыл бұрын
Nice video! Thank you for sharing your knowledge with others!
@hayat_soft_skills
@hayat_soft_skills Жыл бұрын
Love the content & specially how to write code clean and neat. The best channel in my 5 years youtube journey. May Allah give you more power and we are enjoying the best content. thanks!
@y2kdeuce2
@y2kdeuce2 Жыл бұрын
Hey JWR, JRW here. I've been "scraping" for 20 years now. Amazing how the tools have matured. Dead simple these days. That said, this video is a fantastic example of a cherry picked site to demo these tools. Few real world websites are this simple to parse using CSS. Please dedicate some time to digging through more challenging selectors. Thanks in advance - John
@JohnWatsonRooney
@JohnWatsonRooney Жыл бұрын
Hey! Cherry picked example are unfortunately a part of it, people simply won’t watch a long video where I am trying to work stuff out. Also this video was more of a demo of how different tools work, but also end up at the same result. I’m not sure I agree about the parsing css part though, I don’t often find an issue on sites where it is most html and css and minimal js
@contrarypagan
@contrarypagan Жыл бұрын
@@JohnWatsonRooney I think if you did some master classes where you tackled some complex sites and worked through things that you would have a fair few views on those videos. The easy options you use are VERY useful for specific tips but I would love to see you work through some real difficult situations as well. But your content is awesome so believe me I am not complaining! Thank you so much.
@AmodeusR
@AmodeusR Жыл бұрын
@@JohnWatsonRooney I garantee there will be people that will watch a long video to see a professional trying to figure out things. Most true learners are just sick of the magic and smooth programming experience many videos show, when in reality, trying to do it by ourselves we just end up struggling a lot. And that's even unhealthy to those starting in the area, thinking everything is always that simple just for the sheer amount of such cherry picked contents. Just make clear from the beginning that is a "advanced" content, an example of when things are not that simple so people can relate to it and feel compelled to watch it.
@OPPACHblu_channel
@OPPACHblu_channel Жыл бұрын
Thanks u for sharing experience, very interesting and helpful!👍
@GelsYT
@GelsYT Жыл бұрын
OH MY GOODNESS! THANK YOU! THIS IS SO MUCH EASIER TO COLLECT THE DATA AND CONSTRUCT IN A DATA STRUCTURE LIKE DICT THANK YOU!
@JohnWatsonRooney
@JohnWatsonRooney Жыл бұрын
thanks I'm happy to help!
@MrRementer
@MrRementer Жыл бұрын
Yes, happy to see a new Video!
@JohnWatsonRooney
@JohnWatsonRooney Жыл бұрын
thanks, i hope you enjoyed it!
@MrRementer
@MrRementer Жыл бұрын
@@JohnWatsonRooney I did! I've got a longer question regarding my Amazon Scraping Project, which i am currently doing with Selenium. Everything works fine, its just quite slow.. Is it okay to hit you up with a direct message/email?
@mitchconnor8764
@mitchconnor8764 Жыл бұрын
Great video
@azwan1992
@azwan1992 Жыл бұрын
I love you man.
@JKnight
@JKnight Жыл бұрын
That was fantastic. Cory Schafer tier content. Love to see it.
@JohnWatsonRooney
@JohnWatsonRooney Жыл бұрын
Thank you - he’s the best so very happy to be included there!
@RonWaller
@RonWaller 10 ай бұрын
John, Thanks for your tutorials. Enjoying the web scrapping I am planning to dig into this more. Curious, this tool used "css" to get the data. Are there other tools to get "dynamic" data or JS data? Just wondering thanks
@pythonprogrammer2186
@pythonprogrammer2186 Жыл бұрын
Very nice!
@UniquelyCritical
@UniquelyCritical Жыл бұрын
1000th like here at 5:12 AM CST. Thanks!!!
@JohnWatsonRooney
@JohnWatsonRooney Жыл бұрын
Thank you!!
@mirkolantieri
@mirkolantieri Жыл бұрын
Nice video John!
@JohnWatsonRooney
@JohnWatsonRooney Жыл бұрын
thanks for watching!
@thepoorsultan5112
@thepoorsultan5112 Жыл бұрын
Already have been using selectolax and httpx combo
@gabiie9839
@gabiie9839 Жыл бұрын
nice work john, web scraping lord.
@JohnWatsonRooney
@JohnWatsonRooney Жыл бұрын
thanks for watching!
@joaoalmirante4268
@joaoalmirante4268 Жыл бұрын
hey. nice video. But the most problem this days on scraping its the amount of js/non html things that make us a lot of difficulty to get. But overall thanks for sharing
@Septumsempra8818
@Septumsempra8818 Жыл бұрын
My business got funding!!! Thank you Mr Rooney.
@JohnWatsonRooney
@JohnWatsonRooney Жыл бұрын
great! thanks for watching!
@hrvojematosevic8769
@hrvojematosevic8769 Жыл бұрын
Are you hiring? :)))
@Septumsempra8818
@Septumsempra8818 Жыл бұрын
@@hrvojematosevic8769 developers, Yes.
@roshanyadav4459
@roshanyadav4459 Жыл бұрын
Can i join u i worked with scrapy playwright Beautifulsoup selenium I am an intermediate programer
@hrvojematosevic8769
@hrvojematosevic8769 Жыл бұрын
@@Septumsempra8818 it's a broad term T_T
@arsalan0561
@arsalan0561 Жыл бұрын
so I've been learning scrapy basics and following your channel for quite a while. So as per this video this is the latest method to scrape the pages ! what about those ol scrapy start_url and responses to get the whole page and link extractors and follow_url to get to next pages and stuff! i mean do we still need to use them at some point or we could replace them with this method altogether. ? And thanks for the sharing new ways to scrape. cheers
@philtoa334
@philtoa334 Жыл бұрын
Thanks.
@rosaarzabala5189
@rosaarzabala5189 Жыл бұрын
Great combo! Thanks for ur videos 🙌 almost didn't get the doge 👀
@JohnWatsonRooney
@JohnWatsonRooney Жыл бұрын
thanks ;D
@losefaithinhumanity8238
@losefaithinhumanity8238 Жыл бұрын
Hey man I've been watching your content for the past couple of weeks and it's fire. A good content idea would be to create a beginner series where you go through the absolute basics, I'm proposing this because nearly all of the videos on the topic are very outdated. Cheers.
@frynoodles1274
@frynoodles1274 Жыл бұрын
Hi John, I love your videos. What if view-source doesn't return all the HTML on the page that we want? Do we need to use a headless browser and wait for elements to load? Or is there a good requests library we can use instead? Thanks
@JohnWatsonRooney
@JohnWatsonRooney Жыл бұрын
if it doesn't you have a few options, headless browser is one, or seeing if there are AJAX requests you can use too
@karthikshaindia
@karthikshaindia Жыл бұрын
Good one. However, request HTML may replaced and comfortable instead BS4. This one have to be decoded for specifically
@juliohernandez4890
@juliohernandez4890 7 ай бұрын
Thanks for wonderful material. Maybe is me but right now price is not saved. Thanks
@bakasenpaidesu
@bakasenpaidesu Жыл бұрын
Great video... Btw u can use pandas to convert dictionary to CSV.
@JohnWatsonRooney
@JohnWatsonRooney Жыл бұрын
thanks - yes its much easier too, but pandas is a big library to import in
@eternogigante685
@eternogigante685 4 ай бұрын
Does this work on SPAs rendered by frameworks like react and such?
@jasonbischer4568
@jasonbischer4568 6 ай бұрын
Also, a csv file is not being created when I run the script? Any idea?
@michakuczma4076
@michakuczma4076 Жыл бұрын
great video John. Thanks for that. One question come to my mind. Why do you use dataclasses first and then transfer them to dictionaries. Why not to use dictionaries from the beginning? Whats the advantage of dataclassess here besides IDE hints? Don't have much experience with them thats why I'm asking.
@JohnWatsonRooney
@JohnWatsonRooney Жыл бұрын
Thanks. In this case there wasn’t much of a benefit I am just in the habit of using them now. The benefit comes in the validation you can do with them when accepting data in and out of your program
@guillaumebignon6957
@guillaumebignon6957 Жыл бұрын
Using requests and beautifulsoup up to now, it's great to discover competitive alternatives. Would append data in a json file instead of csv also work ?
@JohnWatsonRooney
@JohnWatsonRooney Жыл бұрын
Always good to see other options in case they fit your needs better. Yes you can append to a json file, look at json lines too it might be better for you
@yawarvoice
@yawarvoice Жыл бұрын
Hi, @John I've been following you for a long time and watching all your scraping videos with Python. I have started to create scraper but the website is not allowing me to access as it is considering my script as a bot, though I have changed the user-agent to latest chrome but still, that website is recognizing me as a bot. My question is that which combo I should use for scraping little complex JS/AJAX/bot-aware websites? People say that selenium is good for that purpose, but you say that selenium is not a good option now a days as it is slow, then what do you suggest, which combo should I use, that can fit in many scenarios, if not all. Looking forward! Thanks.
@randomcat4344
@randomcat4344 Жыл бұрын
if its cloud flare detecting you, try cloudscraper
@garymichalske2274
@garymichalske2274 Жыл бұрын
Thanks for the video, John. I was finally able to run my code successfully following the steps in this video. I was following the older videos for selenium and playwright but couldn't get the results you displayed in the video. I think the html code on the websites had changed since you recorded the video. The only issue I ran into for this one is my csv file has a blank row between every exported line. So instead of 300 rows, I have 600. Any idea why?
@JohnWatsonRooney
@JohnWatsonRooney Жыл бұрын
thanks - yes unfortunately that is part of it, websites change so my examples often expire. I try to show the methods as much as I can. As for your CSV, some of your data probably has a newline character at the end, try adding .strip() to each line to see which one it is!
@7Trident3
@7Trident3 Жыл бұрын
No dickin around, meat and potatoes! This should be the gold standard on how to make a programming vid.
@JohnWatsonRooney
@JohnWatsonRooney Жыл бұрын
Thanks I appreciate it!
@dungphung2252
@dungphung2252 Жыл бұрын
Hi,i just want to know if it work on all websites. Tks
@tarik9563
@tarik9563 Жыл бұрын
maybe a stupid question, what about scraping data that is only generated from a request + captcha?
@lucianocarvajal6698
@lucianocarvajal6698 Жыл бұрын
And can httpx to scrap info from dynamic / javascript web pages? Because what i see in the video is that is being used in a normal html website
@JohnWatsonRooney
@JohnWatsonRooney Жыл бұрын
you can, if you can find the backend API, or otherwise you will need to render the page with browser automation like playwright
@pranavanand24
@pranavanand24 Жыл бұрын
Hey John, great video! I am a beginner at webscraping and vscode in general. I saw that your import csv part got added automatically, I think? Can you please tell me how to do that? Is that some extension like Auto Import?
@JohnWatsonRooney
@JohnWatsonRooney Жыл бұрын
i'm gonna be honest with you... i think i typed it in but forgot and edited that part out... sorry
@pranavanand24
@pranavanand24 Жыл бұрын
@@JohnWatsonRooney oh, okay, got it. No problem. I ended up searching Google and came to know about this extension auto import and included it in my vscode, which is indeed able to add those import lines by itself.
@shivambajaj6228
@shivambajaj6228 Жыл бұрын
I usually encounter Error 429 scraping web pages. Is there any way I could bypass that?
@Lahmeinthehouse
@Lahmeinthehouse Жыл бұрын
Nice video! What do you use for screen recording ?
@JohnWatsonRooney
@JohnWatsonRooney Жыл бұрын
OBS! It’s free
@Lahmeinthehouse
@Lahmeinthehouse Жыл бұрын
Great, thanks! Also, do you have LinkedIn ?
@Relaxing_Sounds_Rain
@Relaxing_Sounds_Rain Жыл бұрын
Hi thank you john. This work on windows very well but linux ubuntu do not work. help me please
@rajkumargerard5474
@rajkumargerard5474 Жыл бұрын
Can we run this from Spyder or Jupiter? Also request you to please try and scrapp Tesco link.. I had tried it and it was working fine for sometime but now due to the restrictions my code doesn't work.
@francius3103
@francius3103 Жыл бұрын
It sayts venv/bin/activate doesn't exists; theres only a file falled python and another one called python3 there :(
@gisleberge4363
@gisleberge4363 Жыл бұрын
Why you "left" requests and beautifulsoup? Just curious about their downsides compared to the ones you recommend here.
@JohnWatsonRooney
@JohnWatsonRooney Жыл бұрын
HTTPX works just like requests, but it can be Async when needed. Selectolax is faster and more focused (CSS selectors only) than BS4. I always say use what you prefer I have found after exploring different tools that these 2 work the best for me!
@joshman844
@joshman844 Жыл бұрын
does this work in google colab?
@jasonbischer4568
@jasonbischer4568 6 ай бұрын
Hey John, great video here. I wasn’t getting the price in my results, it was just empty for some reason? I messed around and used the span code as well but it just returned “no text found”. Any ideas? Thanks for everything, your videos are great
@karimbenamar362
@karimbenamar362 10 күн бұрын
same here😢 any idea how to solve the issue ?
@bittumonkb866
@bittumonkb866 Жыл бұрын
What about sites needed login?
@MrPaynealex6
@MrPaynealex6 Жыл бұрын
Any recommendations to avoid rate limiting aside from rotating proxies?
@jhonjuniordelaguilapinedo2746
@jhonjuniordelaguilapinedo2746 Жыл бұрын
I guess keep using requests library, because the get function lets you put the header and the proxy you want.
@yoanbello6891
@yoanbello6891 Жыл бұрын
great video like always, i have this error scraping a site with python playwright … intercepts pointer events retrying click action, attempt #57, Is a heavy javascript site, i am tryingn to click a button. Thanks
@jfk1337
@jfk1337 Жыл бұрын
Why no async?
@CreativeCorners
@CreativeCorners Жыл бұрын
Respected Sir, I need your help to get links or download F B marketplace images using the scraping tool, I did a lot of work but I am confused, even though I got links to all listed items but couldn't get the link to images in the individual list shown in the new tab. Please guide me
@CreativeCorners
@CreativeCorners Жыл бұрын
I'm still waiting for your favorable reply please
@hossamgamal8661
@hossamgamal8661 Жыл бұрын
Thanks for sharing such important information I didn't know that there are modules other than beautifulsoup and requests I have a question Can you make a video on how to use authentication proxy with selenium? I have used options.add_argument('--proxy-server=ip:port') it doesn't work with me It doesn't show the alert box which I should input the username and password
@JohnWatsonRooney
@JohnWatsonRooney Жыл бұрын
I'm going to do some more selenium vids i will try to cover this in those, but i'm not sure exactly why that doesn't work
@hossamgamal8661
@hossamgamal8661 Жыл бұрын
@@JohnWatsonRooney Thanks
@y2kdeuce2
@y2kdeuce2 Жыл бұрын
@@JohnWatsonRooney how about a Firefox AWS lambda function with a rotating proxy? :D
@nztitirangi
@nztitirangi Жыл бұрын
very cool. I kept getting timeouts so did this to solve: client = httpx.Client(timeout=None) resp = client.get(url) return HTMLParser(resp.text)
@thorstenwidmer7275
@thorstenwidmer7275 6 ай бұрын
May I ask which Lenovo this is?
@JohnWatsonRooney
@JohnWatsonRooney 6 ай бұрын
It’s an old x200. I’ve put an SSD and more RAM in it and it’s a workable machine
@itzcallmepro4963
@itzcallmepro4963 Жыл бұрын
I Got Errors and Searched And Found that the site iam trying to scrape uses CloudFlare Protection is there anyway to bypass that ?
@JohnWatsonRooney
@JohnWatsonRooney Жыл бұрын
Try cloudscraper, it can have some good results
@itzcallmepro4963
@itzcallmepro4963 Жыл бұрын
@@JohnWatsonRooney i ALready Searched and Have Used it Thanks Very Much , But i Have Another Problem now , iam Scraping Some Data , One of them is Prices , There SomeTimes are 2 prices , The 2 prices are always in the html but there is sometimes one only that's displayed on page . i Can't Find any Class or anything to difeerentiiate between them to get the element that's appearing on the screen only ,
@artabra1019
@artabra1019 Жыл бұрын
nc asdict method save much time
@digitalbangladesh6977
@digitalbangladesh6977 Жыл бұрын
hello sir..what are downsides of scrapy in respect of this project??
@JohnWatsonRooney
@JohnWatsonRooney Жыл бұрын
None really - I just sometimes feel like it’s overkill for a small project and think of it more for larger scrapers and crawlers
@rovshenhojayev1843
@rovshenhojayev1843 Жыл бұрын
can we add link and image of that product on this lib?
@JohnWatsonRooney
@JohnWatsonRooney Жыл бұрын
Yes it would work the same way
@mectoystv
@mectoystv Жыл бұрын
Hello, in order to use web scraping, you must ask permission from the owner of the content of the web page?
@nztitirangi
@nztitirangi Жыл бұрын
if your scraper behaves similar to a human then its fine. If it totally smashes some poor sods ecommerce page then no. If its a well built webapp then they coing to be throttling you anyway, IMHO
@tanekapace8080
@tanekapace8080 Жыл бұрын
I’m interested in a bot that can fill out online forms at multiple websites. Kindly respond if you could help me
@ramarajesh9554
@ramarajesh9554 Жыл бұрын
First
@disrael2101
@disrael2101 Жыл бұрын
Source code
@scottmiller2591
@scottmiller2591 Жыл бұрын
This was a good example of how to get started, but I still had some questions: - In your opinion, why are httpx and selectolax better than requests and BeautifulSoup? - There are so many places where things can fail - status code =/= 200, website sends you to a "I'm busy" page, etc. - that are missing here. If you are communicating with an unreliable website, this code may fail even with the hobby application, much less something that is scraping professionally. Is there anything in httpx/selectolax that helps with the exception handling compared to requests/BS4?
@JohnWatsonRooney
@JohnWatsonRooney Жыл бұрын
Httpx has async ready for you when you need it and selectolax is a much faster parser than bs4. It still comes down to preference- use what works for you! Yes I’m this video I didn’t flesh it out fully with error handling, and retries and other parts that would make the script more complete for more professional use. I didn’t want to cover too much in one go and also reach as many people as possible
@scottmiller2591
@scottmiller2591 Жыл бұрын
@@JohnWatsonRooney Thanks for your prompt and useful reply!
@nztitirangi
@nztitirangi Жыл бұрын
@@scottmiller2591 client = httpx.Client(timeout=None) resp = client.get(url) return HTMLParser(resp.text)
@nadavnesher8641
@nadavnesher8641 Жыл бұрын
Hi John, Thanks for the awesome video! I really like your clear explanations. I was trying to run your code but on a Google Search page but got into some difficulties. I was hoping you could please tell me what I'm doing wrong. The div class I'm trying to grab: (which represents a Google Search result). But what's returned is an empty list: [] def parse_queries(html): queries = html.css("div.MjjYud") print(queries) I, therefore, cannot grab nested "div", "h3", and "cite" classes which hold the information I require to populate my dataclass attributes (website address, website title, website text). For example: address --> title --> text --> (*) As you suggested, I also looked at the page source and did find this "MjjYud" div class. My code: import httpx from selectolax.parser import HTMLParser from dataclasses import dataclass, asdict import csv @dataclass class Query: website: str title: str information: str def get_html(): url = "www.google.com/search?q=data+science+courses" resp = httpx.get(url) html = HTMLParser(resp.text) return html def parse_queries(html): queries = html.css("div.MjjYud") print(queries) results = [] for item in queries: new_item = Query( website=item.css_first("cite.iUh30 qLRx3b tjvcx").text(), news_title=item.css_first("h3.LC20lb MBeuO DKV0Md").text(), textual_info=item.css_first("div.VwiC3b yXK7lf MUxGbd yDYNvb lyLwlc lEBKkf").text() # the inside ) results.append(asdict(new_item)) print("new_item") return results def to_csv(res): with open("results.csv", "a") as f: writer = csv.DictWriter(f, fieldnames=["website", "news_title", "textual_info"]) writer.writerows(res) def main(): html = get_html() res = parse_queries(html) to_csv(res) main() Thank you very much for taking the time to read my comment 🙏🏼
@nicolasalarcon58
@nicolasalarcon58 Жыл бұрын
Hey thank you so much for your explanation!, What happen when products have this structure? ... ... ... ... ... because I cant get anything from this web I tried everything like html.css(div.Fractal-ProductCard__productcard--container ) or html.css(div.productcard--container) or html.css(div.t:m|n:productcard|v:default) and much more
@zelt7466
@zelt7466 Жыл бұрын
requests_html one love)
@karlblau2
@karlblau2 Жыл бұрын
Great video
Want To Learn Web Scraping? Start HERE
10:54
John Watson Rooney
Рет қаралды 26 М.
The most important Python script I ever wrote
19:58
John Watson Rooney
Рет қаралды 143 М.
Василиса наняла личного массажиста 😂 #shorts
00:22
Денис Кукояка
Рет қаралды 7 МЛН
She ruined my dominos! 😭 Cool train tool helps me #gadget
00:40
Go Gizmo!
Рет қаралды 53 МЛН
IS THIS REAL FOOD OR NOT?🤔 PIKACHU AND SONIC CONFUSE THE CAT! 😺🍫
00:41
Smart Sigma Kid #funny #sigma #comedy
00:19
CRAZY GREAPA
Рет қаралды 13 МЛН
Modern HTML Scraping with Pythons BEST Tools
24:47
John Watson Rooney
Рет қаралды 12 М.
Web Scraping with ChatGPT is mind blowing 🤯
8:03
Code Bear
Рет қаралды 34 М.
If you're web scraping, don't make these mistakes
12:07
The PyCoach
Рет қаралды 1 М.
This AI Agent can Scrape ANY WEBSITE!!!
17:44
Reda Marzouk
Рет қаралды 39 М.
The Biggest Mistake Beginners Make When Web Scraping
10:21
John Watson Rooney
Рет қаралды 102 М.
Requests vs HTTPX vs Aiohttp | Which One to Pick?
15:11
ArjanCodes
Рет қаралды 34 М.
Web Scraping with Python and BeautifulSoup is THIS easy!
15:51
Thomas Janssen | Tom's Tech Academy
Рет қаралды 18 М.
RAG from the Ground Up with Python and Ollama
15:32
Decoder
Рет қаралды 23 М.
Python Tutorial: Web Scraping with BeautifulSoup and Requests
45:48
Corey Schafer
Рет қаралды 1,1 МЛН
Телефон в воде 🤯
0:28
FATA MORGANA
Рет қаралды 774 М.