How I Scrape multiple pages on Amazon with Python, Requests & BeautifulSoup

Рет қаралды 94,173

3 жыл бұрын

In this video I will demonstrate one of the ways to deal with the pagination when scraping the amazon website. We check to see if the next button is availabe then collect the url from it, and using our functions, move to scrape the next page. this works well as we can let it run and collect all the pages without having to add a number to the url each time. This method would also work for other websites that have a similar style of pagination
code: github.com/jhnwr/amazon-pagin...
Digital Ocean (Affiliate Link) - m.do.co/c/c7c90f161ff6
-------------------------------------
Disclaimer: These are affiliate links and as an Amazon Associate I earn from qualifying purchases
-------------------------------------
Sound like me:
microphone amzn.to/36TbaAW
mic arm amzn.to/33NJI5v
audio interface amzn.to/2FlnfU0
-------------------------------------
Video like me:
webcam amzn.to/2SJHopS
camera amzn.to/3iVIJol
lights amzn.to/2GN7INg
-------------------------------------
PC Stuff:
case: amzn.to/3dEz6Jw
psu: amzn.to/3kc7SfB
cpu: amzn.to/2ILxGSh
mobo: amzn.to/3lWmxw4
ram: amzn.to/31muxPc
gfx card amzn.to/2SKYraW
27" monitor amzn.to/2GAH4r9
24" monitor (vertical) amzn.to/3jIFamt
dual monitor arm amzn.to/3lyFS6s
mouse amzn.to/2SH1ssK
keyboard amzn.to/2SKrjQA

Пікірлер: 126

@JohnWatsonRooney 3 жыл бұрын

UPDATE: check the repo for a short code tweak - github.com/jhnwr/amazon-pagination def getdata(url): r = s.get(url) r.html.render(sleep=1) soup = BeautifulSoup(r.html.html, 'html.parser') return soup

@KhalilYasser 3 жыл бұрын

Amazing. Thanks a lot for your support.

@axvex595 3 жыл бұрын

i tried this script as well, no luck...

@axvex595 3 жыл бұрын

this is the error I'm getting; "The application has failed to start because its side-by-side configuration is incorrect. Please see the application event log or use the command-line sxstrace.exe tool for more detail" Any ideas!?

@genedummac 3 жыл бұрын

Great tutorial bro! Please tell me what vscode theme you are using in this video, I like it. Thanks

@JohnWatsonRooney 3 жыл бұрын

@@genedummac sure its called One Dark Pro

@proxyscrape Жыл бұрын

Amazing tutorial John! I love how you break down the process of pagination. Keep up the great work :)

@pradeepkumar-qo8lu 3 жыл бұрын

This method is more intuitive that concatenating the page numbers Thanks for the useful content 👍

@vahsek7488 Жыл бұрын

best and the simplest way for scraping amazon products,hats off to you. lots of love from India

@JayBeeDev 3 жыл бұрын

You are a hero John ❤️! Love what you do!

@jonathanfriz4410 3 жыл бұрын

Very nice John, I always learn something new with your videos.

@michaeltillcock3864 Жыл бұрын

Subscribed, really well explained. I would love to see a video showing web scraping of wikipedia tables, with a loop that can input different wikipedia URLs based on URLs stored in an Excel file. I can't find a video on this and I think it would be very popular!

@ammaralzhrani6329 3 жыл бұрын

Please keeps going! Your channel will grow slightly. Promise

@samvid1992 3 жыл бұрын

Thank you very much. This is exactly what I was looking for.

@leleemagnu6831 3 жыл бұрын

Great videos John! I could not be more grateful. Thank you! A suggestion for a more concise ending: while url: data = get_data(url) url = get_next_page(data) print(url) print('That\'s all folks THE END')

@JohnWatsonRooney 3 жыл бұрын

Haha, thank you very much!

@Mangosrllysuck 3 жыл бұрын

Great content! Liked and subbed. Thanks for doing this

@shahalmoveed6191 3 жыл бұрын

Thank you sir for your brilliant explanation 💯

@mikkiverma9545 3 жыл бұрын

Thanks John it really helped .

@irfankalam509 3 жыл бұрын

Very useful one! Keep Going!

@kamaleshpramanik7645 3 жыл бұрын

That is very helpful video. Thank you very much Sir.

@anilfirat7651 3 жыл бұрын

Hi John, another nice content! Thx a lot! Can you do a video for price tracker includes price inside buybox, shipping price, fee etc. based on deliver location? Because these informations may change based on location. Hope you see this comment. Keep rocking mate!

@fabianrestrepo82 3 жыл бұрын

Fantastic!

@raywong9832 Жыл бұрын

HI, thanks for the nice video. I ran into a page that navigates page via dropdown box. I was able to scrap all the options value from the dropdown, but don't have any idea how to navigate to page from the options. Do you have any existing videos or any documentations for me to reference to?

@adnanklc1527 3 жыл бұрын

This is very nice content thanks for that. Can we get a full video that we can pull user comments from a dynamic page (for example from ten pages) and add them to the list?

@BotanicalOdyssey 3 жыл бұрын

These are so great John thank you for posting!

@Gh0stwrter 3 жыл бұрын

Great video dude

@stalluri11 3 жыл бұрын

Hi John, how do we scrape data when url doesn't change for next page?

@brokerkamil5773 11 ай бұрын

Thx John ❤❤

@erkindalkilic 3 жыл бұрын

Thank you very much bro. The information you have given is amazing.

@JohnWatsonRooney 3 жыл бұрын

You are most welcome

@erkindalkilic 2 жыл бұрын

@@JohnWatsonRooney sir how can i reach you? Twitter? any Email? Social Media Platform?

@eddiethinhvuong1607 3 жыл бұрын

Hi John, thanks for the video very intuitive and easy to understand for a newbie. I've been always using Selenium for webscraping and been learning a bunch from your videos :) I've got a question hopefully you could answer: What would you do if the response is a captcha instead of the actual site when sending a request? I have been trying to find away to get through it but found none. Thank you!

@JohnWatsonRooney 3 жыл бұрын

Thanks! Captchas are a bit more of a challenge, have a look at some captcha solving services and see how they work, you will get a better understanding that way!

@Magma-uw7yo 5 ай бұрын

Is it possible to get the content with a loop if the url don't change ? When I click on the button, the content change but not the url

@Dr_Knight 2 жыл бұрын

Thanks for this video! Is it possible to parse data if there is a button which loads more data on the same page using Beautiful soup?

@LiverpoolDon1981 3 жыл бұрын

dude you're awesome 😎

@3wXpertz Жыл бұрын

I have a website where next page doesn't show any link it just show # at the end of URL every page I move the URL doesn't change it is shows just # at the end of every page and number I hover. How to get the page URL for each individual pages?

@wikd13 3 жыл бұрын

Really helpful video.

@JohnWatsonRooney 3 жыл бұрын

Glad you think so!

@yogeshkumarshankariya642 2 жыл бұрын

Hi John, what can I do when the next page URL postfix shows like 'MTYzMjMwMzE3NDAwMHw2MTRhZjg0NjYyMjNlMjIxMThiNzYxODY', instead of number also while scraping from inspect page it shows different URL in python compare to what it have in inspect page.

@rajatchauhan6675 3 жыл бұрын

what method can be used to scrape load more data

@dongmogilles3209 2 жыл бұрын

i wish to scrape over ten pages how can i use the for loop in your code? thanks

@almirf3729 Жыл бұрын

Awesome video, thanks

@JohnWatsonRooney Жыл бұрын

Glad you liked it!

@jayp9148 2 жыл бұрын

hey john i’m getting this error if not page.find(‘span’, {‘class’, ‘s-pagination-item s-pagination-disabled’}): AttributeError: ‘NoneType’ has no attribute ‘find’

@nidhikaushik2861 2 жыл бұрын

Hi John, Thanks for the amazing content I have learnt a lot from your videos...I have question...while printing out the soup it is giving me 503- Service Unavailable Error...How to deal with that?🙄

@harshparikh7898 3 жыл бұрын

thanks a lot!

@TechsAndSpecs 3 жыл бұрын

Thanks for this video. How can we limit the search to only the first 10 pages?

@ismailsufiani2810 Жыл бұрын

appreciated

@justins7796 3 жыл бұрын

A++ videos man, you'll be big in no time :D

@nikhildoye9671 Жыл бұрын

Hi John, On Zomato - Food delivery app, next button gets hidden once we reach the final page. How should I proceed?

@samithagoud189 Жыл бұрын

How to webscrap the job recent job openings posted 24 hours ago?

@muhammedasrar8773 3 жыл бұрын

Hi John, Nice one. Can we able to get EAN/UPC from amazon website

@prateeksharma-ig5qg 2 жыл бұрын

What can we do if the url doesn't consist page number? Please Help..

@danielcanizalez8558 2 жыл бұрын

great tutorial, thanks!!!! i need to process like 100 url and it return a 503 :( any help?

@quangjoseph8287 2 жыл бұрын

Hi bro, I’m stuck in problem with Web Worker APIs when scraping websites. The website is always sending a preflight request before it send the main request. Could you please make a video about it?

@mylordlucifer 2 жыл бұрын

Thanks

@ollie5845 2 жыл бұрын

Does anybody know what theme this is?

@mohammedthanveerulaklam9288 Жыл бұрын

the code runs smoothly on the first iteration but when it moves to second iteration on the loop, it fails and shows the following error: if not page.find('li',{'class':'a-disabled a-last'}): AttributeError: 'NoneType' object has no attribute 'find' I Dont know the solution, please help....☹

@imranullah7355 2 жыл бұрын

Sir I get empty list from soup.find_all("div",class_="some class"), although there are some children of this class What can be the reason?

@pietravalle69 Жыл бұрын

i m new in Python when i get this erroe : ModuleNotFoundError: No module named 'requests_html'

@rahulwadwani9345 Жыл бұрын

sir how do you handle 403 errors can you make a detailed video on it if any can you tag it over here

@ikalangitahaja 2 жыл бұрын

Great, but not working if products list has no pagination, just check if pages exists to solve ity

@lautarob 2 жыл бұрын

I have seen a couple of more videos today. All excellent. Thank you so much. I would like to see a video scraping a site protected by login credentials. Would be that possible?. Other: Would it be possible to scrape the content of a sharepoint site having admin credential to access it ?

@JohnWatsonRooney 2 жыл бұрын

Interesting prospect, not one that I have done really as most things like that have an api that are best used. I know sharepoint does but I understand that in most companies getting access to the api would be quite difficult

@lautarob 2 жыл бұрын

@@JohnWatsonRooney thanks John. The main reason to suggest these topics is precisely because they are complex and your clear and thoughtful explanations would be of great help.

@Adaeze_ifeanyi Жыл бұрын

So i am the only one confused with the web scraping method i have watched like tons of yoyr videos and i am still getting errors i need helppppp i can't seem to load all the pages i have tried all the methods in your videos.

@srivathsgondi191 Жыл бұрын

Hi John, ive tried to scrape amazon website, but their anti bots keep blocking my requests. My status_code is always 503. How do i fix this?

@JohnWatsonRooney Жыл бұрын

Have you tried the correct user agent? Copy some of the headers from when yo load the page up in chrome too that should help!

@srivathsgondi191 Жыл бұрын

@@JohnWatsonRooney i fixed the issue by using proxies to generate requests.

@ansadhedhi2469 2 жыл бұрын

I am unable to import requests_html. what could be the issue?

@JohnWatsonRooney 2 жыл бұрын

make sure you install it with pip first, - pip install requests-html

@technoscopy 2 жыл бұрын

Hello my submit data is in JSF viewstate how to scrape that please help me 😭😭

@renemiche735 3 жыл бұрын

Hi John, What is the difference between requests and requests-html? I have no answer to this question. As an exemple, why using requests html and not requests in this case? (one vs the other) Thanks for your work from france :) (I got an error 503, I go to your new tutorial ).

@JohnWatsonRooney 3 жыл бұрын

Hi Rene. They have similar names however they are 2 separate Python Libraries (made by the same person). requests is for working with http protocols, and thats is, requests-html also has its own HTML parser, and the ability to render a pages javascript, allowing us to scrape more sites

@renemiche735 3 жыл бұрын

@@JohnWatsonRooney merci 🙏

@KhalilYasser 3 жыл бұрын

Thank you very much. I encountered this error ` if not pages.find('li', {'class': 'a-disabled a-last'}): AttributeError: 'NoneType' object has no attribute 'find'`. Can you help me fixing that?

@henrygreen737 3 жыл бұрын

I am getting the same error. My getdata soup value is returning "To discuss automated access to Amazon data please contact api-services-support@amazon.com.". I tried using a user-agent and got the same result. I don't have answer but, this is my problem.

@JohnWatsonRooney 3 жыл бұрын

check the repo i updated it, the getdata() function should look like this: def getdata(url): r = s.get(url) r.html.render(sleep=1) soup = BeautifulSoup(r.html.html, 'html.parser') return soup github.com/jhnwr/amazon-pagination thanks!

@mezianibelkacem650 3 жыл бұрын

yes i got this error too

@cm4u825 3 жыл бұрын

hi dear i am having this error what i do now RuntimeError: Cannot use HTMLSession within an existing event loop. Use AsyncHTMLSession instead.

@JohnWatsonRooney 3 жыл бұрын

Hi! This is because you are using Jupiter notebooks or similar, it causes issues - this code needs to be run as a py file using vs code or another editor

@cm4u825 3 жыл бұрын

@@JohnWatsonRooney thanks for reply. Can this be working fine in spyder or any other . plz Share other editor name

@hogrider423 3 жыл бұрын

i did exactly the same thing but i have error message if not page.find('li', {'class': "a-disabled a-last"}): line 15, in getnextpage if not page.find('li', {'class': "a-disabled a-last"}): AttributeError: 'NoneType' object has no attribute 'find what should i do??

@nathanscott4007 2 жыл бұрын

I have the same issue

@manny7662 3 ай бұрын

Would you recommend web scrapping or using an official API?

@JohnWatsonRooney 3 ай бұрын

Official API if you can. Structured data and no breaking changes without warning (hopefully)

@alessandrowind4544 3 жыл бұрын

Hello i updated the script and i try to solve this but i still get this error Traceback (most recent call last): File "C:\Users\Mark\Desktop\script.py", line 76, in url = getnextpage(data) File "C:\Users\Mark\Desktop\script.py", line 67, in getnextpage if not pages.find('li', {'class': 'a-disabled a-last'}): AttributeError: 'NoneType' object has no attribute 'find'

@alessandrowind4544 3 жыл бұрын

in method getdata(url) ofc i do other operations like printing out some datas

@mattmovesmountains1443 3 жыл бұрын

Of all the scraping tools you use, which would you recommend for building a bot to buy ps5? Seems like a possible viral tutorial right now

@mattmovesmountains1443 3 жыл бұрын

After writing this, I decided to try helium and was able to get a few basic refresh and auto-purchase bots running for the stores that don't use bot detection.

@JohnWatsonRooney 3 жыл бұрын

That would have been my suggestion. Perhaps use basic scraping techniques to scrape multiple stores pages every hour or so too see if it’s in stock. If it is the run helium to add to cart and email you to complete the purchase

@faker_fakerplaymaker3614 2 жыл бұрын

didnt work for me.....the html was different. Its always different from the tutorials so I never know how to access the tags.

@MarcelStrobel 3 жыл бұрын

Hey John, fantastic content as usual! I get the following error - Could you please explain why? page = soup.find('ul', {'class': 'a-pagination'}) TypeError: slice indices must be integers or None or have an __index__ method

@JohnWatsonRooney 3 жыл бұрын

Hi Marcel, it seems that it is not finding that html element on the page, I’d suggest checking that the page was rendered properly, and have a look to see why the pagination list isn’t appearing

@MarcelStrobel 3 жыл бұрын

@@JohnWatsonRooney Hey John, in fact, with html_requests the page wasn't rendered properly. So now I am using splash and the Site is rendered properly. I also checked manually if the specified class is there. Could it be a problem with the Python Version? I am using 3.6.2.

@JohnWatsonRooney 3 жыл бұрын

@@MarcelStrobel did you have the sleep=1 in there? odd, i've not had an issue with that before

@MarcelStrobel 3 жыл бұрын

@@JohnWatsonRooney I´m gonna try it and get back to you. Thank you very much for your help!

@MarcelStrobel 3 жыл бұрын

@@JohnWatsonRooney I put your tweak in and now I am getting a valid response from html_requests. Still getting the same error as above. I send you an invite for the code on git. If you have the time to look at it .. that would be highly appreciateded

@ismaelruizranz7799 3 жыл бұрын

Great video mi friend , ¿Do you know any way to use the requests_html library in the jupyter notebook?

@JohnWatsonRooney 3 жыл бұрын

Thank you. You can’t use it with Jupiter I’m afraid as they both use the same event loop and it clashes

@ismaelruizranz7799 3 жыл бұрын

@@JohnWatsonRooney No problem John , i can still learn your method , if anyone have the same problem quickfix is to create the bot in a .py document in step of a .ipynb the you can execute it with the comand line , in mi case using linux just code in the comand line python3 bot.py

@avinashk8231 2 жыл бұрын

Make video on book scraping price used book price hardcover price paperback price etc.

@thetravellingdream3480 3 жыл бұрын

I am getting this error : cannot import name 'HTMLsession' from 'requests_html'

@JohnWatsonRooney 3 жыл бұрын

I think it’s a capitol S for session

@thetravellingdream3480 3 жыл бұрын

@@JohnWatsonRooney I fixed it by installing requests_html via a pip install. Thanks for the reply anyway great tutorial :)

@viettuan5798 3 жыл бұрын

Helpful. Like and Sub. Can you make one video about how to handle when Amazon detect scrape script as a bot. Thanks

@muhammadjamshed2128 Жыл бұрын

How can I fetch any Amazon product BSR in Google sheet on daily bases. Please make a video on it to track the products BRS, price Reviews numbers daily. Thanks for all tutorial.

@azle7206 3 жыл бұрын

scrape data asin on multipage please sir with script....

@SajjadKhan-cn6mv 2 жыл бұрын

if not pages.find('li', {'class': 'a-disabled a-last'}): AttributeError: 'NoneType' object has no attribute 'find' running exact code....pages is Nonetype

@aslammasood9504 2 жыл бұрын

It''s visibiy is not clear.

@UbaidKhan-cm2gz 3 жыл бұрын

there is no class "a-disabled a-last" . I am scrapping amazon.in.

@goujoe2880 Жыл бұрын

if not pages.find('li', {'class': 'a-disabled a-last'}): ^^^^^^^^^^ AttributeError: 'NoneType' object has no attribute 'find'

@barathsekar8616 2 жыл бұрын

Do anyone here knows how to scrap the view-more button

@cornstarch27 2 жыл бұрын

John- This may help you with your bs4 issues: kzbin.info/www/bejne/bHyWhqOhqbZ7b9k.

@im4485 3 жыл бұрын

Hi John, can you please explain r.html.html? why twice?

@python689 Жыл бұрын

Hello, help me please, how to get the text out "Wilson Tour Premier All Court 4B" soup = BeautifulSoup(html, 'lxml') title = soup.find('h1', class_='product--title') Tennis balls Wilson Tour Premier All Court 4B

@blogsbarrel4734 3 жыл бұрын

if not page.find('li', {'class' : 'a-disabled a-last'}):

@Cubear99 3 жыл бұрын

i took all the steps and it does not show page 2. keep looping page 1 till i brake if i copied your code this what happened 2 # this will return the next page URL 3 pages = soup.find('ul', {'class': 'a-pagination'}) ----> 4 if not pages.find('li', {'class': 'a-disabled a-last'}): 5 url = 'www.amazon.co.uk' + str(pages.find('li', {'class': 'a-last'}).find('a')['href']) 6 return url AttributeError: 'NoneType' object has no attribute 'find'

@Adaeze_ifeanyi Жыл бұрын

def transform(soup): articles = soup.find_all('article', {'itemprop' : 'review'}) for feedback in articles: title = feedback.find('h2').text.replace(' ', '') ratings = float(feedback.find('div', {'itemprop': 'reviewRating'}).text.replace('/10','').strip()) body = feedback.find('div', {'class': 'text_content'}).text.replace('✅', '') date = feedback.find('time').text reviews = { 'title' : title, 'ratings': ratings, 'body': body, 'date' : date } reviewlist.append(reviews) if not feedback.find('li', {'class': 'off'}): url = 'www.airlinequality.com/airline-reviews/british-airways' + str(feedback.find('li')).find('a')['href'] return url else: return