Modern HTML Scraping with Pythons BEST Tools

Рет қаралды 14,324

Күн бұрын

There's still plenty of modern sites that are HTML and can be scraped using simple methods. In this video I code from scratch a complete web scraping project up to saving the data. I will use dataclasses, handle responses, use urljoin and scrape detail pages and pagination.
Scraper API www.scrapingbe...
Patreon: / johnwatsonrooney
Donations: www.paypal.com...
Hosting: Digital Ocean: m.do.co/c/c7c9...
Gear I use: www.amazon.co....

Пікірлер: 61

@sdriding Жыл бұрын

Don't think I ever did this so it's well over due... You helped me get a job as a software engineer. I used things I learned from your vids to make a project that was instrumental in getting a job offer. Thank you so much, you changed the financial trajectory of my whole family! (for others looking for the same, a major contributor in standing out is having an AWS cert)

@JohnWatsonRooney Жыл бұрын

thank you that's amazing, the reason I do this is to help people and its great to hear! congratulations on your job!

@IwoGda Жыл бұрын

What AWS cert is the best?

@sdriding Жыл бұрын

@@IwoGdaprobably developer associate

@damilolaowolabi9148 5 ай бұрын

what projects did you build?

@sdriding 5 ай бұрын

@@damilolaowolabi9148 if you want to stand out build a project using the tools listed in a job posting you're really interested in. I had some random personal projects but what got their attention was that one of my projects listed many of the tools they were looking for. It was a ridiculously basic project and practically a laughing point during my interviews but it got me in the door

@runnrnr Жыл бұрын

Thank you for your videos! I now link them to people who ask me questions about selectolax. I'm the author of selectolax.

@JohnWatsonRooney Жыл бұрын

Oh cool thank you! Selectolax is great I use it all the time - appreciate your work!

@Алексей-й4з6ш 7 ай бұрын

You should be written better manual, very poor documented

@TheJFMR Жыл бұрын

John It would be nice if you make a video on how to apply unit testing or test Driven Development to a web scraping project 😉 You are a good teacher to teach that

@JohnWatsonRooney Жыл бұрын

Interesting idea, I’ll add it to my list thanks!

@Kicsa Жыл бұрын

I have been enjoying your good videos, thank you for everything. I hope in a couple of weeks, I can start making my own programs.

@adarshjamwal3448 Жыл бұрын

Awesome👍👍 tutorial. I learned a lot of things from your scraping series. Keep going on.

@JohnWatsonRooney Жыл бұрын

Thank you glad I can help

@samoylov1973 Жыл бұрын

Set Comprehension is a nice touch in this video. While watching, thought of converting to set afterwards. But making it in one and easy go, as you did, is better. One wish: when you explain such parts as "When you want to grab all these table information..." (20:19 on timing), please, show at least one piece of it to the end. How to do others, will figure out)

@ManuelGonzales-ni9sh Жыл бұрын

Great tutorial John! Would you please consider doing a full tutorial on your nvim theme & config?

@JohnWatsonRooney Жыл бұрын

Thank! Yes I will do a video on my nvim, I’ve been configuring it a little more recently and will share soon

@JuanPerez-iu9vk 4 ай бұрын

I love your VIM workflow. Could you make a video some day about VIM and your config file and plugins?

@mchahal22 3 ай бұрын

Great Content Thanks!

@valuetraveler2026 Жыл бұрын

Good to see alternatives for parsing (selectolax), Will use rich now from now on. Dont personally like to use dataclass/pydantic for most work as it has hundreds of fields. But this is cleaner code than imperative style down the page

@JohnWatsonRooney Жыл бұрын

I really like selectolax. And fair enough regarding dataclasses - for me at the moment the benefits outweigh the downsides

@anthonymunnelly20 Жыл бұрын

Excellent. Really, really well-done tutorial on a subject that seems straight-forward, but isn't.

@amarAK47khan Жыл бұрын

you are a life saver !

@yacinehechmi6012 Жыл бұрын

Greetings from Tunisia, Thanks John!!, waiting for that nvim video i would really love to know what you configured in nvim for python development.

@mxdigitalmediamarketplace Жыл бұрын

Hello, I a newby at scrapping. When I wrote @Dataclass it did not let me do it, it says it is not an integrer. I using python 3.12, httpx, selectolax and rich. Ase you mentioned in the tutorial

@tm_Panda... Жыл бұрын

Hey, I was wondering why you stopped using Scrapy? Was it too big of a framework for the scraping projects you do? Great video as always!

@JohnWatsonRooney Жыл бұрын

I found that i preferred to write my own solutions from the ground up with what I was trying to do, scrapy is still a great framework though. I have a video on my channel about it if you are interested in more details

@DucNguyen-in1xd Жыл бұрын

can you give example when select by class?

@JohnWatsonRooney Жыл бұрын

Class is separated by a dot “div.class”

@malwaredev33 Жыл бұрын

Excellent video content, all videos are understandable for anyone, can you tell me what font/theme you're using in vs code in this video. Thnaks

@JohnWatsonRooney Жыл бұрын

Thanks! Editor is Neovim and colour scheme is called oxocarbon

@rz84vlog78 Жыл бұрын

The tutorial really helped me. Is it possible to scrape website like college board since the basic authentication of username and password doesn’t seem to work. Would love to at-least get some tips so that I can scrape the bit complex websites.

@JohnWatsonRooney Жыл бұрын

Hey thanks glad it helped. For websites that need a login I generally lean towards browser automation (playwright) simply because it is much quicker and easier to get something working. I’d suggest that if you haven’t looked into it already, a few videos on my channel that could help

@flashwade888 Жыл бұрын

Thank you so much for the detailed tutorial, John! I have a quick question - would it be possible to use dataclasses with Scrapy, please?

@JohnWatsonRooney Жыл бұрын

thanks glad you liked it! yes you can use dataclasses with scrapy since 2.2

@flashwade888 Жыл бұрын

@@JohnWatsonRooney Cheeeeers!! I cannot wait to give it go!

@codetechpro Жыл бұрын

Hey John I was wondering, is it possible to fill up visa card dynamic form with selenium or playwright?

@JohnWatsonRooney Жыл бұрын

I don’t know that on specifically but I’ve filled out loads of forms with playwright and selenium before, if it loads the page fine you’ll have access to the forms to j out data

@DrChrisCopeland Жыл бұрын

I have learned a lot from your videos. Can you do any type of tutorial on report generation for the scrapes. My main use case is once I identify a page that meets my requirements, I generate a PDF (or something) that would show the page as it was. I've had terrible luck with htmltopdf and similar libraries (or point me in the right direction). Thanks for what you do!

@JohnWatsonRooney Жыл бұрын

Are you after just a visual representation of the page? Playwright can do that very easily. Or are you grabbing data and want that in PDF sorry not quite sure what you mean!

@DrChrisCopeland Жыл бұрын

@@JohnWatsonRooney visual representation as far as I can tell (use case is still in the works/fluid). Once an item/listing on the page meets a requirement, save that individual info to a pdf, run some more stuff, then on to the next item/listing. Due to the subject matter, I don't want to put more in the comments, but yeah I'm learning a lot here and it's all going to work on a non-profit I run in the US.

@DrChrisCopeland Жыл бұрын

@@JohnWatsonRooney I will look at playwright as well!

@mxdigitalmediamarketplace Жыл бұрын

Hello, following your tutorial, I am getting a enrror on line 26 resp = client.get(url, headers=headers) Traceback (most recent call last): File "", line 1, in resp = client.get(url, headers=headers) NameError: name 'client' is not defined

@michakuczma4076 Жыл бұрын

Is this M+ 1M font you use in your ide? very nice and readable

@JohnWatsonRooney Жыл бұрын

Yes it is- although I think it’s m+ 2m. It’s great I’ve been using it for a while now

@coyoteden8111 Жыл бұрын

Early morning web scraping lesgo

@charlescharles4279 Жыл бұрын

Awesome tutorial, do you notice any performance drop when using dataclass to save data during web scraping compared to using dicts?

@JohnWatsonRooney Жыл бұрын

Thanks! Generally no, the time lost in scraping is in the network connections so I’ve never worried about it much

@AhmedAl-Yousofi Жыл бұрын

What editor are you using?

@JohnWatsonRooney Жыл бұрын

Neovim

@ZhCrypto Жыл бұрын

U are innocent programmer ❤

@atatekeli9295 Жыл бұрын

Hi John, I tried turning your header code into this for macOS headers = { "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 11_5_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.9999.99 Safari/537.36" } I use Google Chrome for web scraping, use M1 Chip and use macOS Ventura 13.4, how can I make it compatible for my scraping

@JohnWatsonRooney Жыл бұрын

Hi - the user agent header is what we send with the request to the website - it can be anything, you can use the same one I do or any that you can find on google. It doesn’t need to match your system

@atatekeli9295 Жыл бұрын

@@JohnWatsonRooney Would it cause an error if I write the same code that is not configured to my system requirements

@malwaredev33 Жыл бұрын

Hi, Bro how are you.?

@keifer7813 Жыл бұрын

What do you do when the elements you want have dynamically changing classes like class="xJdnxidXjejns xIdhdn39db xzIJhdidmn8"

@JohnWatsonRooney Жыл бұрын

go back up the element tree until you find one that is constant, then reference off of that. I use css selectors so something like "div.constantclass li a" for all the a tags within li tags in divs with class "constantclass"

@ankylosis751 Жыл бұрын

@@JohnWatsonRooney would really love a tutorial on this... and if you made something similar to this dynamic Changing classes can you link me? I'm at my wits end btw superb content manh. its helping me learn python deeply too

@richardboreiko Жыл бұрын

I'm getting an error on page 20 and it's consistent, but the products seem to vary each time the page appears, so they must be getting unordered data from their SQL statement. File "C:\Users icha\AppData\Local\Programs\Python\Python310\lib\ssl.py", line 1132, in read return self._sslobj.read(len) TimeoutError: The read operation timed out It looks like the last line from your code to be executed was this: File "C:\Users icha\PycharmProjects\webScraping\JohnWatsonRooney\ModernScrapingBestTools.py", line 28, in get_page resp = client.get(url, headers=headers) File "C:\Users icha\PycharmProjects\webScraping\venv\lib\site-packages\httpx\_transports\default.py", line 77, in map_httpcore_exceptions raise mapped_exc(message) from exc httpx.ReadTimeout: The read operation timed out It happens consistently on www.rei.com/c/backpacks?page=20 but the number of products printed seems to vary before the error occurs. Do you have any debugging suggestions?