The Biggest Mistake Beginners Make When Web Scraping

Рет қаралды 120,140

Күн бұрын

Пікірлер: 142

@JohnWatsonRooney 2 жыл бұрын

The first 1,000 people to use the link or my code johnwatsonrooney will get a 1 month free trial of Skillshare: skl.sh/johnwatsonrooney05221

@josele7666 Жыл бұрын

Would it be possible to scrape any website with playwright with the hidden browser? For example: williamhill or bet365?

@thetransferaccount4586 Жыл бұрын

one of the best tutorial i've seen on web scraping. i wish you could make more of these. thank you

@MrViolentKoala Жыл бұрын

You genius! At first I was skipping over the video like "what is he doing, that's not scraping, thats no help"... but then I actually took a look at the website I wanted to scrape, and there it is! All the info I want, nicely formatted into a json! Copied your code from github after and after a few minutes everything was working perfectly. Thank you a lot! (Also for me it was the same, only needed the cookie for the request to work.)

@Bobbias 2 жыл бұрын

Not every site relies on JS for loading data from the back end. Sometimes your only option is to scrape the front end. Being able to directly query an API is always going to be the better solution, but sometimes you're stuck doing things the hard way.

@JohnWatsonRooney 2 жыл бұрын

Absolutely! If it’s only html, then use that!

@Bobbias 2 жыл бұрын

@@JohnWatsonRooney yeah, I don't generally do web scraping, but I ended up doing some for my most recent project, and unfortunately not only was there no API to pull from, just navigating through the html was a pain due to a lack of Ids/classes on some relatively deeply nested tables. It resulted in some pretty nasty code to work my way to the relevant data :/

@ToughdataTiktok 2 жыл бұрын

I did a similar scraping project on an ancient website with no html id/classes. It was a pain, but regexes did help

@agentnull5242 Жыл бұрын

@@ToughdataTiktok How did they help? I've done this before and had to quit.

@ToughdataTiktok Жыл бұрын

@@agentnull5242 Regex helped to extract data where there were patterns in the HTML

@markblake77889900 2 жыл бұрын

Pretty decent methodology, but a few inaccuracies. CORS does not have an effect on the backend api security it is a client based security model. It instigates a "pre-flight" check response in order to stop untrusted websites from consuming state changing APIs. You're not using a browser so you'll never hit a CORS limitation. To consume the API you just need the application Oauth2 token, or whatever is being used for API Auth. You may also require setting your header "origin", which is just an idiosyncratic server behaviour.

@JohnWatsonRooney 2 жыл бұрын

Great, thank you - I appreciate the corrections

@markblake77889900 2 жыл бұрын

@@kesbetik I ment authentication, but authorisation would still make sense. Was it too ambiguous? Have I confused you? Any authenticated API needs some sort of session information. This is often an API token using the Oauth2 mechanism but absolutely doesn't have to be. Any suitablely random string if good enough, and can be in the post body or the header. There are just security concerns for each, such as leaving tokens in the DOM makes them accessable to JavaScript etc.

@soniablanche5672 Жыл бұрын

Yeah, CORS is to protect legitimate users from attackers rather than protect the server from attackers.

@CodePhiles 2 жыл бұрын

this is brilliant and opens a new door for us, thank you John for your great work

@samuelweber3872 11 күн бұрын

Man this content change my life im webscraping tanks só muth man

@MiguelRodriguez2010 7 ай бұрын

What a great explanation, thanks man!

@dmitriyneledva4693 2 жыл бұрын

your channel and this video are the best things youtube has ever suggested to me ^^

@Julian-sn9vc Жыл бұрын

Great tutorial! I have a question though: What do you have to pay attention to when you log in to the site in order to scrape the data in relation to the cookies, and not be blocked? I'm wondering if there are any specific precautions or best practices we should follow when our requests are connected to an account. Thanks!

@David-mj9st 2 жыл бұрын

I used to copy cookie directly from browser, after this video I shloud review my code and make it a little better.

@zainkhalid3670 8 күн бұрын

Great! But why is it better than just scrapping the front-end? Is it because of structured data?

@fischerdev Жыл бұрын

I already did complex web scrapping, using C#, retrieving more than 1 million records. But I am finding these videos interesting. I would have to learn Phyton. Two questions: a) What is this Inspect Tool? It is called Insomnia? b) How to handle with Google "I am not a robot"? Do you have some video about it?

@hrvojematosevic8769 2 жыл бұрын

Nice videos, keep creating quality content John :) Just a quick question: can you not do the same via requests by using .Session() and .cookies?

@sweatdog 10 ай бұрын

I also wondered why Session() wasn't used to track the cookies for the request. perhaps he was demonstrating that the initial "accept cookies" button needs to be interacted with using Playwright, then you are off to the races.

@tobiewaldeck7105 2 жыл бұрын

Hi, I have 2 questions. 1. How do I know the context.cookies index? 2. What scraping method should I use on a website where the next button doesn't change the page number but only the data dynamically. a Chrome extension or even Uipath can do this for me, but with a site I'm practicing on the json data I'm getting is irrelevant.

@JohnWatsonRooney 2 жыл бұрын

Hey! The headers returned will be a list or dictionary- if you print out the whole thing you can work out how to index or reference it. For the second question you want to try to find the Ajax request that’s being made when the next button is clicked, check out the video on my channel called best scraping method

@tobiewaldeck7105 2 жыл бұрын

@@JohnWatsonRooney Thank you for replying. I was hoping I would not have to go so deep into webscraping with coding because I just learned some vba an python for this purpose, but unfortunately there isn't a one size fits all website and creating my own scrapers just takes less pc resources.

@gugurlqk 2 жыл бұрын

@@JohnWatsonRooney can you explain in more detail the index part?

@iamkian 2 жыл бұрын

Thank you. I'm going to be testing a little bit with this.

@fou2flo 2 ай бұрын

Works great for client based data fetching... for server-side -rendering (wish is common) you will need something else - great video btw

@fransubaru 2 жыл бұрын

Amazing content John! Thanks for sharing!

@JohnWatsonRooney 2 жыл бұрын

Thanks!

@wangdanny178 2 жыл бұрын

another API tutorial with the help of cookies. Thanks John.

@vtrandal Жыл бұрын

Excellent. I think I got the main points. Thank YouI! But ... I do have multiple problems recreating your results. First, the Forbes website implementation seems to have changed, but I found a fetch request by filter larger-than:1M and used it in Insomnia as you did. However, I do not follow how you decide which cookie to use in your code. Your hard coded [3] for "INDEX LIST" baffles me. In fact any nonempty cookie seems to work (perhaps only for a while).

@SunDevilThor 2 жыл бұрын

Pretty cool trick. Thanks for sharing!

@stephenwilson0386 2 жыл бұрын

Really glad I found your channel, I'm hoping to learn enough web scraping to get some extra income on the side (or maybe even full-time). Very off-topic, but what window manager are you using?

@JohnWatsonRooney 2 жыл бұрын

Thanks, its a good skill to learn. It also teaches you about how websites work, APIs and handling data. I use I3WM, this is skinned using regolith, however I don't use that anymore, a much more basic i3 skin

@stewart5136 Жыл бұрын

Thanks John. The site has changed and now it looks like you can just grab '......page-data/index/page-data.json' - but your video really helped to always inspect and see what's happening. Instead of Playwright could we do the same with requests_html render()?

@JohnWatsonRooney Жыл бұрын

Unfortunately that happens! Site updates means you have to stay on top of everything. You probably could use render() but it hasn’t worked well for me recently so I stopped using it

@aliacosta7073 Жыл бұрын

Great tutorial, thaks a lot!!!!

@lulu_dotes 2 жыл бұрын

thank you for very informative and interesting turtorial, may I ask what browser are you using?

@lulu_dotes 2 жыл бұрын

what browser are you using for inspecting?

@JohnWatsonRooney 2 жыл бұрын

Sure, it’s Firefox

@lulu_dotes 2 жыл бұрын

@@JohnWatsonRooney thank you so much, I'm learning so much from your tutorials, I am new to web scraping and im watching your playlist, can you recommend any of your playlist to start in web scraping?

@twiincentral8780 2 жыл бұрын

Thank you for all the great content and specifically this video! Going to try this with Walmart to see if it will work!

@austinrutledge6484 2 жыл бұрын

I'm using Walmart as my web scraping test too lol. They have really good bot detection.

@statsnow3354 2 жыл бұрын

can you make asynchronous scraping with playwright ?

@wangdanny178 2 жыл бұрын

a follow-up question. Is it the same Network layout in Chrome browser?

@JohnWatsonRooney 2 жыл бұрын

Sam functionality yeah just looks a little different

@Kfq1111 2 жыл бұрын

Can you use the same method an access controlled website? There a website that does some questionnaires I am trying to get the weighting of each question.

@agrobots 2 жыл бұрын

Same question here. I’m trying to scrape sites that require PIV card authentication.

@R1rare 2 жыл бұрын

TNice tutorials should be the first video that pops up when you're new to making soft

@TrevCole Жыл бұрын

If the site only displays the json data as POST instead of a GET request, do you have to use front end scraping?

@JohnWatsonRooney Жыл бұрын

You should be able to replicate the pair request in the same way and get the results back

@yakobadamovich1813 2 жыл бұрын

I wish all websites were exposing their APIs like Forbes is doing... Sometimes Selenium is a necessity

@Victor_Marius Жыл бұрын

What if you need to scrape an website that has multiple s embedded into each other? Do you get the contents of the s with the main page?

@Rapid1898 2 жыл бұрын

Hello - where can i find the code from this video? (can´t find any github or something else in the description)

@JohnWatsonRooney 2 жыл бұрын

yes sorry added now to desc!

@Rapid1898 2 жыл бұрын

@@JohnWatsonRooney Thanks a lot!

@justusmzb7441 9 ай бұрын

How do you handle backends that use some weird very dynamic security methods? I think its recaptcha v3 in my case (Javascript call goes through a function the name of which suggests as much). I was desperately trying to crack the search endpoint of the Al-Jazeera search function for a research project... And simply hijacking the cookies still resulted in a 403 even if they were freshly stolen from a selenium session just milliseconds ago...

@humansaremortal3803 2 жыл бұрын

Hello, have you tried scrapping sockets at any time?

@ViktorRzh 2 жыл бұрын

Good job! 👏

@coolcat4805 2 жыл бұрын

Great as always, can you share code as you mentioned?

@wangdanny178 2 жыл бұрын

I do not know what is wrong. But there is no "button.trustarc-agree-btn". This video is 2 days ago. Is it possible they change the web elements already?

@JohnWatsonRooney 2 жыл бұрын

Open the url in incognito mode and it should be there

@san24dec 2 жыл бұрын

Really nice explanation. Will this method also work where we interact with webelements to download file from frontend. Like click a button and then download csv file.

@wangdanny178 2 жыл бұрын

the incognito mode shows the same web elements. And on my side, there is only 'accept' and no 'accept all' for cookies. and i have to scroll down to the bottom to click the 'cookies preference', which is totally different from the vid.

@johnme60 2 жыл бұрын

I have a website , It downloads a JavaScript and calculates a Token . Now this token this is sent to api as Bearer and looks like it only calculates the token while loaded into browser. It's been a week and token is not still expired , but I really want to make it dynamic. should I use browser to grab load and grab the token?

@Kaganerkan 10 ай бұрын

Gonna ask how to submit a form etc like this or is it possible. Im trying to fill a blank textlabel then press submit to send.

@axedexango 2 жыл бұрын

Great videos, thanks for help us be better. what i can do if i cant find the API? i click in all XHR but non of them has the values

@JohnWatsonRooney 2 жыл бұрын

Thanks! Try going to different links and other parts of the site and see if you can find any

@axedexango 2 жыл бұрын

@@JohnWatsonRooney i tried but i cant find any API, they have a websocket i can conect but i also dont get any values

@Christian-mn8dh 2 жыл бұрын

@@axedexango I have the same issue

@drac.96 2 жыл бұрын

Can you please share the code? It's not in the description.

@JohnWatsonRooney 2 жыл бұрын

yes sorry added now to desc!

@drac.96 2 жыл бұрын

@@JohnWatsonRooney Much appreciated!

@KhalilYasser 2 жыл бұрын

Thank you very much for your amazing tutorials. I have used Insomnia to mimic the request as you did, when unchecking cookies and clicking on Send, I still have a response. I tried to delete the cookie at all but the same response I recieved. I have created a new request and changed the settings of the request before doing anything but I got a response. I need when I uncheck the cookie to get blank response as you did in the video.

@JohnWatsonRooney 2 жыл бұрын

hey! are you doing it on the same site? if its a different website some will let you in without the cookie

@KhalilYasser 2 жыл бұрын

@@JohnWatsonRooney Yes, I am trying on the same website you explained in the video

@reverendbluejeans1748 Жыл бұрын

My inspector on Chrome looks completely different. I dont know what to do.

@ShehneelAhmedKhan 10 ай бұрын

Informative!

@kafran Жыл бұрын

Why use Playwright to get the cookie instead of requests' session?

@graczew 2 жыл бұрын

Grate content as always. I'm curious how you store all scraped data? I'm on data analysis path and do some small projects where data is gathered daily. Did you work on something like that? I have sql database to hold all raw data than use pandas to clean it and analyse.

@JohnWatsonRooney 2 жыл бұрын

Thanks! Small projects I use SQLite, anything bigger I use Postgres. I have used mongodb before too which I may do a video on

@jeroenvermunt3372 2 жыл бұрын

@@JohnWatsonRooney Could you do a video (or maybe you already have) on efficiently updating your postgres database. I am currently designing some kind of pipeline myself which stores batches of insert operations and batches of update operations to update in one transaction. I think it would be a useful video for developers trying to implement scrapers in production. Love the content as always!

@renancatan 2 жыл бұрын

Guys what I do is: Use sqlalchemy library to work with sqlserver/postgres/mysql, there are some methods that can be instanciated like append and replace, So I insert the scraped data into replace method for temporary tables or tables that will be droped and recreated automatically every day through this method, and also in the same script bellow I define the append method, every time this script runs, it inserts to sql more lines, there you can also define the table's format like int, varchar, boolean etc

@graczew 2 жыл бұрын

I just start learn sqlalchemy but for small projects it's looks to much. What I like in it is modeling and data validation. There is one thing witch confuse me. Relationships. I wish to see someone who build whole project like scrape, clean and validation plus db design.

@JohnWatsonRooney 2 жыл бұрын

@@graczew Hey sure I get that, I've got some projects lined up that I can tailor more to this side of scrape/validate and load. the best way to understand the basic relationships is to learn to build some basic web apps with something like Flask, it helped me a lot

@gugurlqk 2 жыл бұрын

Hey John, your work is awesome but for some reason I get a JSONDecodeError. I tried executing the insomnia code in a separate .py file and it returned the json data. How ever when I try to execute your code I hit on to this nasty JSONDecodeError. Do you have any advice how to fix it?

@anthonyrodz7189 8 ай бұрын

newbie question, but is Google ending third-party cookies in Chrome going to change this ?

@oyuncubirlii Жыл бұрын

amazing video!!!!

@hamdimohamed8913 2 жыл бұрын

what is the web browser that you are using ?

@JohnWatsonRooney 2 жыл бұрын

I use Firefox mostly but sometimes chrome for demonstrating

@killianrak Жыл бұрын

Hey ! The API that I’m scrapping is giving me the response after some delay , does anybody knows why this delay occurs and how to bypass it ? Thanx

@fhkdhkdyidyhfufufh9011 Жыл бұрын

Are there sites that do not support web scraping?

@defiler90 Жыл бұрын

scraping backend you'll encounter honeypots. If you're scraping sites with strong anti-scraping, going from the front-end is probably the only way. Especially if they're determining whether you're a bot based on behavior.

@sandeshpandit5540 2 жыл бұрын

Please tell me, is web scrapping a good career option? Do companies hire you for web scrapping?

@reverendbluejeans1748 Жыл бұрын

Selenium is used for website testing. You could be an automation aq tester.

@RyuDouro Жыл бұрын

will this work on all browsers?

@kiddo8714 2 жыл бұрын

Want to learn Web scraping. where to begin on your channel ?

@JohnWatsonRooney 2 жыл бұрын

Good question, I must reorganize my playlists! maybe this one kzbin.info/www/bejne/faqlZWaeqsmZh9k

@jithin.johnson 2 жыл бұрын

Good one!

@JohnWatsonRooney 2 жыл бұрын

Thanks!

@vetrivelr5708 2 жыл бұрын

Great info. But i do have quick a question. Like in my company website i was able successfully login via request. But i try to find different pages links. But i don't find any anchor tags or links for any clickable buttons which will lead to different pages. Therefore i used explicit URLs for the corresponding pages. But it always returns only home page html details. I used context manager here. I don't understand why. Any suggestions would be appreciated. Thanks

@JohnWatsonRooney 2 жыл бұрын

If you log in using requests you’ll need to use a session so it saves your logged in status and you can then visit other pages

@vetrivelr5708 2 жыл бұрын

@@JohnWatsonRooney Thank you. Of course i have used session as i have learned from you. Never matter what page URL i use, it simply always return Home page HTML details only. That's what i couldn't figure it out.

@Rapid1898 2 жыл бұрын

Hello - i tried to follow your description but when i do the exactly same thing like you (on windows) and reload the page i get 244 requests. Why do you only have 7 when showing this in the video? I was also not able to find the page-data.json file. What i also saw that you have a "file"-column in your inspect window - but i don´t have a column "file" for selection when clicking right mouse on the column-headers. How can i find this json-file resonse you showed in the video?

@JohnWatsonRooney 2 жыл бұрын

Are you using chrome? I find that the inspect element tool looks different on chrome (this is Firefox) they will be there in chrome too - sometimes you need to click different pages to see them

@Rapid1898 2 жыл бұрын

@@JohnWatsonRooney Yes its chrome - but then i will give it a try with firefox

@Rapid1898 2 жыл бұрын

Tried it now with Firefox and with that it works as i see it in your video. Another question btw - is it somewhere possible to get the code you used in your videoi (can´t find any github-link in the video-description).

@zackplauche Жыл бұрын

Is there a tl;dr of this vid?

@jcvp2493 Ай бұрын

It is a ten minutes video

@kusunagi1 2 жыл бұрын

What are cookies actually? Just request?

@JohnWatsonRooney 2 жыл бұрын

It’s a small bit of data the server puts on the client computer to help identify it

@kamizandesh2750 2 жыл бұрын

thanks for your helpful tutorial. is it possible to explain web scrapping on some websites which has dynamic context like financial yahoo? there are many samples but none of them does not work properly or does not show price online. best regards

@kanikakushwaha2009 2 жыл бұрын

thank you

@return_1101 2 жыл бұрын

What is wrong with Selenium?...

@JohnWatsonRooney 2 жыл бұрын

Nothing, it’s a great tool for testing websites. But I think many people lean on it for scraping data when they don’t need to

@return_1101 2 жыл бұрын

@@JohnWatsonRooney Thank you very much. This video very informative. But I know only Selenium and BS4...(requests).

@ajitha7768 2 жыл бұрын

Sos grande papaaa espero la nueva version

@soniablanche5672 2 жыл бұрын

LMAO 5 MB of JSON. Are they sending the entire database ?

@wangdanny178 2 жыл бұрын

John another question is about insomnia. My requests are always timed out. Do you have any idea?

@Tothefutureand Жыл бұрын

Genius

@LoneWolF45236 8 ай бұрын

didn't know that Robert Downey Jr. aka "iron man" Master the web scraping.😆

@theTrigant 2 жыл бұрын

Why, of all the pages on the web, did you choose a frickin site about the most sinister humans on earth?

@Jjjnmkkmkkk 4 ай бұрын

Can you just make your videos in Java Android studio I'm coding with my app and it only supports Java not ever one is lucky to have a desktop like you please have me out

@oopss794 Жыл бұрын

a headless browser won't be able to get an httpOnly cookie for example

@mehmetkamildemirkent4319 2 жыл бұрын

Been there, done that.

@Data_Creator 2 жыл бұрын

wow...

@coddude3284 Жыл бұрын

allow saving videos, why you are disabling it !

@kikiryki 2 жыл бұрын

Hi John ! Would you please explain in a new video, how to automate a Google search of "list of queries" (Company names as keywords in column A in a XLSX or CSV file), and Save some output of Google Search (top 3 results - URL, title, Adress, business ID), in a new Results CSV or XLSX file? Thank you!

@moomoocow526 2 жыл бұрын

Automating Google searches is almost impossible given their bot detection. Try using an alternative search engine

@kikiryki 2 жыл бұрын

@@moomoocow526 not necesarrily Google, but i want to learn to iterate a list în a searchbox (dropdown list to choose) and save în csv some results.