Render Dynamic Pages - Web Scraping Product Links with Python

Рет қаралды 66,609

Күн бұрын

Thanks to Stuart for sending this site in! I enjoyed this scraping challenge.
This video will show a simple method that can help with dynamically loaded content. I use the requestes-html library to render the page in the background quickly and efficiently, and scrape all the product links from the html DIV using the XPATH selector. I loop through each link to get all the product information.
Coming in part 2 - pagination and functions to tidy up the code.
-------------------------------------
twitter / jhnwr
code editor code.visualstu...
WSL2 (linux on windows) docs.microsoft...
-------------------------------------
Disclaimer: These are affiliate links and as an Amazon Associate I earn from qualifying purchases
mouse amzn.to/2SH1ssK
27" monitor amzn.to/2GAH4r9
24" monitor (vertical) amzn.to/3jIFamt
dual monitor arm amzn.to/3lyFS6s
microphone amzn.to/36TbaAW
mic arm amzn.to/33NJI5v
audio interface amzn.to/2FlnfU0
keyboard amzn.to/2SKrjQA
lights amzn.to/2GN7INg
webcam amzn.to/2SJHopS
camera amzn.to/3iVIJol
gfx card amzn.to/2SKYraW
ssd amzn.to/3lAjMAy

Пікірлер: 168

@JohnWatsonRooney 4 жыл бұрын

Keyboard too loud? I've been using my mech kb again.. Is it too distracting?

@11hamma 4 жыл бұрын

i think its fine, at least i didnt get distracted

@11hamma 4 жыл бұрын

@Vishal Gupta that website is using a Javascript to load the content. But first try using the library explained in this video by John. It looks like you can get the work done through it. (i haven't used it myself so cant vouch for it) Anyhow is this library fails, you can definitely use selenium and get your work done. Selenium opens up the page in some of its browser and then load the page there which loads all of the page contents and in fact gives you the option of clicking at a particular web element. A tip: just load the page by selenium library. Then pass source code of that page into the bs4 also know as BeautifulSoup library and scrap the site in normal way from there on. It's essential because selenium's methods for extracting information out of website takes a lot amout of time and bs4 is much faster instead and has better error handling.

@Neil4Speed 4 жыл бұрын

Not at all, makes it feel like you are working away!

@dimaua1830 2 жыл бұрын

I enjoy the sound. It's like in hackers in the movies :)

@kavehyarohi2886 2 жыл бұрын

kind of enjoyed it !

@schlotto Жыл бұрын

THANK YOU for this video and all the others. I am learning web scraping to gather data for my PhD thesis and you have helped me make such great progress in just a few days. :)

@xilllllix 3 ай бұрын

i'm going through ALL of your videos and just finished this one! learning so much it's incredible!

@edcoughlan5742 4 жыл бұрын

I can get data from static websites using scrapy with relative ease, but I always come unstuck when I try the same with dynamic websites; I might give "html_requests' a go instead of my usual scrapy-selenium combo...Thanks for the video! 👊👊👊

@JohnWatsonRooney 4 жыл бұрын

Glad you liked it - give it a go. I believe scrapy-splash is an add on for scrapy that can reload dynamic pages but I’m yet to try it

@ottomanasina1254 4 жыл бұрын

Amazing explanation skills! Everything was clear. One of the greatest video for web scraping so far! Good job, Good luck!!

@mia_bobia_ 2 ай бұрын

this was super useful! I have a project rn that needs to scrape on many pages that need renderer. This looks much more lightweight than what I'm using rn (selenium)

@kewl201 3 жыл бұрын

Man this is some amazing content. So glad i found your channel! Definitely earned a subscribe.

@JohnWatsonRooney 3 жыл бұрын

Thanks!

@agsantiago22 3 жыл бұрын

Lifesaver! Thank you so much! Wish you the best of luck with your channel!

@stuarthoughton3517 4 жыл бұрын

Brilliant, John!!! Makes complete sense now. Thank you! 👏🏻

@JohnWatsonRooney 4 жыл бұрын

Thanks for sharing the site Stuart I enjoyed this one!

@farhadkhan3893 2 жыл бұрын

Awesome!, I was searching for such type of scraping , and I found

@Neil4Speed 4 жыл бұрын

Great video John as always - Thanks!

@JohnWatsonRooney 4 жыл бұрын

Thank you!

@dobcs3236 Жыл бұрын

You are a great and creative person...keep going champ.

@Nope-12485 2 жыл бұрын

Nice video - minus the try/catch with no specific exception. I know this is a tutorial, but that’s a bad habit to share. Regardless, thank you for the content.

@JohnWatsonRooney 2 жыл бұрын

Thanks, and yes you are absolutely right, I don’t do that anymore!

@kavehyarohi2886 2 жыл бұрын

You are a truly life saver. great great video. thanks mate

@mohamadalhamawi6437 2 жыл бұрын

very helpful tutorial , thank you for your efforts

@agsantiago22 3 жыл бұрын

OMG! I would like to hit the "like" button a million times!

@JohnWatsonRooney 3 жыл бұрын

Thank you very much!

@Aaron-qn1gu 3 жыл бұрын

When I use Xpath, in products (on a different site, but same principles) terminal keeps returning 'None', the site is gwt based, would that affect xpath from working?

@paulblart8262 2 жыл бұрын

bravo sir, you gave me my eureka moment 👏

@youcannotsaypopandforgetth7609 3 жыл бұрын

Hey john awesome video (like always). I have a question, in terms of speed would you recommend a splash or request_html?

@JohnWatsonRooney 3 жыл бұрын

I haven’t done any proper speed tests but they do essentially the same thing so I think it would be marginal. Requests-html has the benefit of being a python package so if that works for your needs I’d use that. Splash has the benefits of scripting though- video to come!

@youcannotsaypopandforgetth7609 3 жыл бұрын

@@JohnWatsonRooney Thanks, this helps so much.

@itsmehemant7 Жыл бұрын

oops...You are legend...........I am blind...This is also in docs on top layer 😂(I think I need some sleep)

@royteicher 2 жыл бұрын

Hi John and everyone, I'm having trouble with the html.render() method, I'd appreciate any help. First time the method runs, it downloads chromium. After I ran it, 3 red lines were printed (Downloading Chromium & stuff I can't remember), I felt like it took too long (more than 10 minutes), so I stopped the program. Now when I try to run a the method, the script just get stucked, I mean, it is running, but never continues to the lines after the html.render method. No errors are raising, the script simply never finishes to run. I tried to pip uninstall requests-html and reinstall it but I'm getting the same not indicative result. How can I troubleshoot this problem? I'm excite to work with requests-HTML and letting for of Selenium for standard rendering needs, but I can't. Thanks a lot for anyone who cares enough to give it a try.

@gitgosc7075 2 жыл бұрын

great as always, thanks!

@charisthawhite2793 3 жыл бұрын

Hello John, if i add command r.html.render(sleep=1) the output be "Cannot use HTMLSession within an existing event loop. Use AsyncHTMLSession instead.", i am anything on google, no clue, any idea?

@JohnWatsonRooney 3 жыл бұрын

Hiya! Are you running it in a jupyter notebook or similar? The way they work conflicts with the render function - try running it in vs code or similar and that should work

@charisthawhite2793 3 жыл бұрын

@@JohnWatsonRooney its running on vs code, but i got new error python .\coba.py Traceback (most recent call last): File ".\coba.py", line 19, in print(r.html.xpath("//div[@class='span6']/h1", first=True).text) AttributeError: 'NoneType' object has no attribute 'text', can you tell me where do i go wrong?

@ubaidkhan-rr3ow 3 жыл бұрын

Thank you sir. This make sense to me

@GainzJPN Жыл бұрын

Thanks, again super easy to follow!

@JohnWatsonRooney Жыл бұрын

Thank you very much! Appreciate it.

@bagia1000 3 жыл бұрын

Hi, I tried your code on other website, but when I arrived at print(products) part, it returns 'NoneType' object. The code get no url. What should I do?. I tried to use the user-agent, but also return nothing

@imfinitiamusic.4632 2 жыл бұрын

You are the best, subscribed

@mylordlucifer 3 жыл бұрын

thanks for learning

@nostalgeomusic 3 жыл бұрын

Great video and easy to follow for a noob like me! Appreciate it :D

@JohnWatsonRooney 3 жыл бұрын

:D thank you

@nostalgeomusic 3 жыл бұрын

@@JohnWatsonRooney Do you have any videos focusing on if statements and/or keyword lists such as changing results, for example; Junior = Entry Level Early Professional = Entry Level Graduate = Entry Level etc...

@samibdh 3 жыл бұрын

Thank you man really useful !!

@daddy_eddy 2 жыл бұрын

Super!!! I appreciated!

@justinames5439 2 жыл бұрын

John: when I follow your code, @ "for item in products.absolute_links:, although I specify, e.g. 'div.product-subtext', the iteration only returns the item.text, (the link text of item) and not the sub-text of the item. This is true of price, name, and so-forth. Can you explain this behavior?

@user-et2lr8qf3o 3 жыл бұрын

Great job keep it up keep useful

@fsamobby 3 жыл бұрын

hi, i'm trying to retrieve the data (the list of employers /vacations initiated by jquery code) from the Canadian job bank, i made "get" request but won't be able to get inner response payload data www.jobbank.gc.ca/jobsearch/jobsearch?searchstring=&locationstring=&sort=M, i can see this payload in the firefox developer tool but failed to find the proper python library and methods to get it, is there any way other than selenium to accomplish this task? I am at the very beginning of the path of learning programming and would be grateful for any help or advice on what to read or watch to figure it out. thanks.

@JohnWatsonRooney 3 жыл бұрын

I think you might need to use the same approach as my sports stats video - using postman to replicate the request made by your browser, the copy that over to your python code

@fsamobby 3 жыл бұрын

@@JohnWatsonRooney ok, ill try this out. thanks anyway)

@neginbabaiha9287 8 ай бұрын

Very clearly explained. May I ask if there is a GitHub repo containing the code that you used in the video?

@engineerbaaniya4846 4 жыл бұрын

Amazing sir please keep posting videos like this we will help u to increase subscriber number

@ssh6467 4 жыл бұрын

Thank you♥️♥️ you are BEST💪

@Mr.AIFella 6 ай бұрын

Thank you so much. Your video is going to help me a lot in a project that I'm going to start. One question if you don't mind, when I want to gather text but there is a part of the text is appearing and there is a[ click for more] ~>hyperlink, that prevents the text from being fully copied to the csv file. Do you have a hint or suggestions? I appreciate your help in advance

@mohammadkhosrotabar5658 2 жыл бұрын

after use render I got this error: There is no current event loop in thread 'Thread-5 (process_request_thread)'

@Dome8 8 ай бұрын

You missed an explanation: what circumstances should you use xpath v div.?

@abhilash93v 4 жыл бұрын

Fantastic demonstration.Would love to know how can we use this module to submit forms or logins

@JohnWatsonRooney 4 жыл бұрын

Sure that’s a good idea , I will look into it

@abhilash93v 4 жыл бұрын

@@JohnWatsonRooney Looking forward to it..Easing login efforts in flash enabled sites such as gmail or any.Any references now would be much helpful for me in my project!

@kooy2254 3 жыл бұрын

Hi John, I am one of your fans. I really wonder how did you learn these techniques before? I am currently in a status that don't know how to be a self-taught web scrapper. In other words, I don't know how to learn from a myriad of knowledges on the internet. But fortunately, I found you

@thetransferaccount4586 Жыл бұрын

good explanation

@pinkypromisesx3 3 жыл бұрын

What's the difference between using requestes-html vs. scrapy or selenium?

@z.heisenberg 11 ай бұрын

selenium is a tool used for different purpose its by product is EXCELLENT ease in web scraping..its been 2 yrs though

@vincentamus 3 жыл бұрын

Hey John, great videos. Thank you so much for it! I wanted to ask, how can I scrape multiple categories(Categories like /computers, /headphones, /monitors/, /keyboards/), do you have any video or idea for that? Thanks for your content!

@navindubimsara9157 7 ай бұрын

Hi bro, Did you find any technique to scrape multiple categories? Please let me know.

@user-vw3qz6ii9s 11 ай бұрын

Amazing video. I'm wondering how can we scrape all the pictures for the product if they are rendered dynamically (like in a slideshow)

@papusa9878 3 жыл бұрын

Ohhhh nice I use an API that uses this method

@christenw.1726 Жыл бұрын

Can a modified version of this work on scraping links listed inside a live chat feed?

@JohnWatsonRooney Жыл бұрын

That’s not something I’ve tried but yes I think so

@itstisn 2 жыл бұрын

Hi, thank you so much for your video. I want to ask how to scrape multiple review page in one product? I get confuse

@olafecub 3 жыл бұрын

Genial el video, no conocia opcion, normalmente usaba bs4

@rohangadgil4527 Жыл бұрын

With requests_html, when I print the soup, I am getting the message - you are not authorized... in the page html. I tried loading the page manually , it worked , so my IP isnt blocked. Can anyone help me with this.

@alessioturcoliveri9840 Жыл бұрын

Hi John is it possible to parse the requests-html response with bs4? I've tried passing response.text when making a bs4 Soup but it returns None. Can somebody help me?

@JohnWatsonRooney Жыл бұрын

Hi, yes it is - I’m sure I’ve covered that before. It’s quite a useful method. Try printing the html before making the soup and check is it what you were expecting to see

@leoyuanluo 3 жыл бұрын

you. are. awesome!

@anirbanpatra3017 10 ай бұрын

Can You explain when should we use what?? I generally prefer sticking to selenium for all my needs.

@momq112233 4 жыл бұрын

nice video 👌 and keep going

@benoitdefays578 4 жыл бұрын

Hi, first tanks a lot for your tutorial. I have a question, i generate my csv file , but my separator are ',' how can i change the separator ?

@JohnWatsonRooney 4 жыл бұрын

Sure, after the csv file name, add in sep=“ “ and put in what separator you want to use

@richu-21 3 жыл бұрын

How can we use threading while scraping thousands of website links?

@alaaabdullah2648 Жыл бұрын

I am trying using website the data shown after write in input field , otherwise the html element empty what should I use? I am using pyautogui to fill the field but I don’t know how to read the data

@GabrielMendes-jy4mp 3 жыл бұрын

John, I've done some code web-scraping dynamically like you in this video. But it's taking too much time because for every product it has to open its page. Is it common, is there a faster way for doing this?

@PhilipRhoadesP Жыл бұрын

Nice! - is there a way of doing this for the _currently displayed page_ ? - on a YT video page I want to scrape all the recommended videos and their titles from that page . .

@sandilemfazi8624 2 жыл бұрын

Hey John, very helpful video, but I keep having this one issue when I try to render the url, I get this error message: RuntimeError: Cannot use HTMLSession within an existing event loop. Use AsyncHTMLSession instead.

@alokyathiraj 2 жыл бұрын

Were you able to fix it? I'm having the same problem

@jaredspilky9699 2 жыл бұрын

@@alokyathiraj Im having a similar issue as well

@donnaperyginathome Жыл бұрын

I can't get this to work either. I think maybe the library needs to be updated.

@johnmurray6405 2 жыл бұрын

I've followed you code to the tee. It locks up both at Pycharm and VScode at the render statement (r.html.render(sleep=1)). I literally have to close both programs to get them to run again. Any ideas? Great video though.

@JohnWatsonRooney 2 жыл бұрын

If it’s the first time running the render method it should download headless chrome - I’m guessing it’s getting stuck there. Maybe try removing requests_html and reinstalling it

@simpleffective186 Жыл бұрын

what can i do if the xpath search doesn't find anything?

@by_westy 2 жыл бұрын

i tried that in jupyter and it gave me this error message: **'Cannot use HTMLSession within an existing event loop. Use AsyncHTMLSession instead.'**

@alexdiaz4371 2 жыл бұрын

can't install requests-html, any ideas? I'm using windows and the error jumps with lxml , tried to install lxml and got same error

@abhishekkamuni9971 4 жыл бұрын

Nice video, but according to you which python webscraper takes less resources like memory etc?

@JohnWatsonRooney 4 жыл бұрын

If the website is html (no JavaScript) requests and bs4 will be the lightest in my opinion. The method in this video is slower due to the render process but still good for smaller projects - selenium is the slowest and not really designed for scraping but does work when needed

@tokoindependen7458 3 жыл бұрын

@@JohnWatsonRooney this information absolutely u must explain in single video, from fastest method and the slowest one, thx sir,

@nikhilsaikondapaneni6657 6 ай бұрын

using render for first time i haven't been able to install any thing and its giving me error

@stern7658 4 ай бұрын

dude update this code I try to run the request_html but all it state is that it need chromium to work but the thing is I have chromium on my machine even the binary file too. why when I run it, it attempt to down chromium which I already have and fail to find it. I try this already a few months back now I retuen to the same I even uninstall and install everything but tge same problem.

@janlisowski5396 3 жыл бұрын

hmm the website I am trying to scrape returns status code 429... and I haven't even started scraping. Do you know what could be causing it?

@surendratamang8848 4 жыл бұрын

Sir how will you deal with infinite scrolling if can't find easy

@JohnWatsonRooney 4 жыл бұрын

that's a bit more tricky without browser automation (selenium) we can use "r.html.render(url, sleep=1, scrolldown=x)" - where x is the ammount of times to page down. Not ideal but might work

@abhishekkamuni9971 4 жыл бұрын

Can you login a website using requests-html?

@JohnWatsonRooney 4 жыл бұрын

You can yes, you can post to the server - I have an older video on my channel where I cover the basics of this if you are interested

@AlejandroKarlitos 4 жыл бұрын

Thank Bro

@patweru7471 Жыл бұрын

Good one, Any idea to do the same for laravel baes??

@engineerbaaniya4846 4 жыл бұрын

Awesome

@kerteradih3721 Жыл бұрын

Could you do a solid for me, I’ve suffered trying to scrape this site

@narjesatia 3 жыл бұрын

Hi john , trying to run the code , i got this error with render ; AttributeError: 'Future' object has no attribute 'html' ...any help please , didn't find in google .Thanks ,

@bunyaminsahiner9060 3 жыл бұрын

While trying to get product links on the category page of the site I work for, it also takes an extra 2 links I don't want for each product. How can I remove these links that I don't want or only one word exists in the links I want, how can I get links with only that word?

@itsmehemant7 Жыл бұрын

Hey john, After struggling with stackoverflow I am here finally..."response.html.render(sleep=3)" is giving error in django view (i .e There is no current event loop in thread 'uWSGIWorker1Core8') .....can you help me how to solve this??

@dyegoborges9985 3 жыл бұрын

it doesnt work with aliexpress

@LogansRunnersVideo 3 жыл бұрын

Trying to recreate on a similar e-commerce website and print(products) from 4:57 gives None type. Any suggestions why?

@tokoindependen7458 3 жыл бұрын

Just print html source code, look if u looking out there

@christinahachem6649 3 жыл бұрын

hello i'm having a chromium related error when i want to render an html page can you pls tell me how can i fix it?

@jaydecanon1314 Жыл бұрын

you shouldn't be john rooney, you should be john legend

@user-lj8cq1ki3w Жыл бұрын

can you do scrapping video for tracton gyan website?

@ashishtiwari1912 3 жыл бұрын

Cider | 4.0% | 44 cl Trying this: info=r.html.find('div.Select an element with a CSS Selector:',first=True).text The output shows:AttributeError: 'NoneType' object has no attribute 'text'

@splashoui3760 3 жыл бұрын

probably you choose your class incorrectly, that's why you have no elements in your output. Non type means you have no result( empty array).

@artabra1019 4 жыл бұрын

what is better bs4 or html.xpath ???

@JohnWatsonRooney 4 жыл бұрын

learn to use both but generally if i can i use BS4

@Neil4Speed 4 жыл бұрын

Hope you don't mind me asking but I have been banging my head against this one for a few hours... but I am trying to pick up only a specific url from a container (the container has non product URL's). : " from requests_html import HTMLSession import pandas as pd import time url = 'www.fragrancenet.com/fragrances' s = HTMLSession() r = s.get(url) r.html.render(sleep=1) products = r.html.xpath('//*[@id="resultSet"]', first=True) print(products.absolute_links) " I am only looking for the p-tags under Result set called: Any help would be super appreciated, thanks again John.

@meme_me 2 жыл бұрын

used the same code and It didn't work for me, I changed the website to my desired one and I get a bunch of errors... :(

@linxx1184 3 жыл бұрын

Hi John, I watched this video many times, you're great at explaining. However, I am getting this error "Navigation Timeout Exceeded: 8000 ms exceeded" when r.html.render(sleep=1) I even bumped up the sleep time. Please help.

@pranit449 3 жыл бұрын

Try using timeout=(number you'd like for more than 8s) instead of sleep. worked for me

@linxx1184 3 жыл бұрын

@@pranit449 thanks for the advice, it worked with timeout=30 and also added keep_page=True

@marco-3942 Жыл бұрын

Hi , can you help me ??

@sinamobasheri3632 4 жыл бұрын

🖤👌🏻

@dickyindra4923 2 жыл бұрын

hi sir, can you fix this problem : AttributeError: 'NoneType' object has no attribute 'text' Thanks, btw nice vid

@tsay214 3 жыл бұрын

What does first=True do?

@JohnWatsonRooney 3 жыл бұрын

with requests-html "find" always returns a list, but using first=True forces it to return only a single item, the first element it finds that matches your find criteria

@tsay214 3 жыл бұрын

@@JohnWatsonRooney got it, thanks. On to pt2!

@hammadrafique7313 Жыл бұрын

Great, but i took a lot of time for rendering

@MrGarrincha11 4 жыл бұрын

Hello, can you do scraping on this page : stats.nba.com/teams/transition/ I want to compare playtype team1 percentile on offense (also the frequency) against team2 percentile on defense. can you help me, please?

@JohnWatsonRooney 4 жыл бұрын

Hi! Yes I can scrape that site - I have a video coming this week that scrapes a site simliar that you will be able to apply to this site too. JR

@MrGarrincha11 4 жыл бұрын

@@JohnWatsonRooney Great! Thank you for the really quick answer!

@signin7740 3 жыл бұрын

Beerwulf is not a dynamic site....LOL

@samuelricard3895 3 жыл бұрын

when I am trying to type r.html.render() I get this Unresolved attribute reference 'html' for class 'Response'

@saeeahmed5213 2 жыл бұрын

🥰🥰🥰🥰

@sydpao2224 2 жыл бұрын

The accent, where are you from?

@JohnWatsonRooney 2 жыл бұрын

UK near London

@barguybrady 4 жыл бұрын

So, when copy the Xpath, I get this as a result:

@barguybrady 4 жыл бұрын

/html/body/div[7]/div[4]/section/div[10]/div[3]/div[2]/div[2]/div[1]/ul[2]

@JohnWatsonRooney 4 жыл бұрын

Are you using chrome or Firefox? That looks like the “full xpath” option, as opposed to just the “xpath”. I am planning to do a video on xpaths to clear it up a bit more

@barguybrady 4 жыл бұрын

@@JohnWatsonRooney inspector in Firefox, which leads me to think, then, that there's a difference btw Chrome and Firefox ?

@JohnWatsonRooney 4 жыл бұрын

There shouldn’t be but I have seen different results from both

@CodeGlintHub-kn9fx 2 ай бұрын

Didn't you get any other site instead of bear website? Why are you promoting harmful things?

@msyahdan183 Жыл бұрын

i have a problem with this code produk = r.html.xpath('/html/body/div[4]/div[2]/div[2]/div[2]/div[1]/div/div[2]',first=True)...the result is None or []..how to fix it?

@RedSpark_ Жыл бұрын

I'm having the same problem, did you find a solution? Thanks