Scrapy Course - Python Web Scraping for Beginners

Рет қаралды 376,585

Күн бұрын

The Scrapy Beginners Course will teach you everything you need to learn to start scraping websites at scale using Python Scrapy.
The course covers:
- Creating your first Scrapy spider
- Crawling through websites & scraping data from each page
- Cleaning data with Items & Item Pipelines
- Saving data to CSV files, MySQL & Postgres databases
- Using fake user-agents & headers to avoid getting blocked
- Using proxies to scale up your web scraping without getting banned
- Deploying your scraper to the cloud & scheduling it to run periodically
✏️ Course created by Joe Kearney.
⭐️ Resources ⭐️
Course Resources
- Scrapy Docs: docs.scrapy.org/en/latest/
- Course Guide: thepythonscrapyplaybook.com/f...
- Course Github: github.com/orgs/python-scrapy...
- The Python Scrapy Playbook: thepythonscrapyplaybook.com/
Cloud Environments
- Scrapyd: github.com/scrapy/scrapyd
- ScrapydWeb: github.com/my8100/scrapydweb
- ScrapeOps Monitor & Scheduler: scrapeops.io/monitoring-sched...
- Scrapy Cloud: www.zyte.com/scrapy-cloud/
Proxies
- Proxy Plan Comparison Tool: scrapeops.io/proxy-providers/...
- ScrapeOps Proxy Aggregator: scrapeops.io/proxy-api-aggreg...
- Smartproxy: smartproxy.com/deals/proxyser...
⭐️ Contents ⭐️
⌨️ (0:00:00) Part 1 - Scrapy & Course Introduction
⌨️ (0:08:22) Part 2 - Setup Virtual Env & Scrapy
⌨️ (0:16:28) Part 3 - Creating a Scrapy Project
⌨️ (0:28:17) Part 4 - Build your First Scrapy Spider
⌨️ (0:55:09) Part 5 - Build Discovery & Extraction Spider
⌨️ (1:20:11) Part 6 - Cleaning Data with Item Pipelines
⌨️ (1:44:19) Part 7 - Saving Data to Files & Databases
⌨️ (2:04:33) Part 8 - Fake User-Agents & Browser Headers
⌨️ (2:40:12) Part 9 - Rotating Proxies & Proxy APIs
⌨️ (3:18:12) Part 10 - Run Spiders in Cloud with Scrapyd
⌨️ (4:03:46) Part 11 - Run Spiders in Cloud with ScrapeOps
⌨️ (4:20:04) Part 12 - Run Spiders in Cloud with Scrapy Cloud
⌨️ (4:30:36) Part 13 - Conclusion & Next Steps
🎉 Thanks to our Champion and Sponsor supporters:
👾 davthecoder
👾 jedi-or-sith
👾 南宮千影
👾 Agustín Kussrow
👾 Nattira Maneerat
👾 Heather Wcislo
👾 Serhiy Kalinets
👾 Justin Hual
👾 Otis Morgan
--
Learn to code for free and get a developer job: www.freecodecamp.org
Read hundreds of articles on programming: freecodecamp.org/news

Пікірлер: 378

@NiranjanND 4 ай бұрын

14:45 source venv/bin/activate is for the mac if youre on window ".\venv\Scripts\activate" use this in your terminal

@sampoulis Ай бұрын

on windows you just type in the name of the venv file, then \Scripts\activate as long as you are in the project folder. Example: PS D:\Projects\Scrapy> .venv\Scripts\activate

@johnsyborg Ай бұрын

wow you are my hero

@anesanes2863 25 күн бұрын

in case of security issues you might need this too : Set-ExecutionPolicy Unrestricted -Scope Process

@leolion516 11 ай бұрын

Amazing tutorial, I've only gone through half of it, and I can say it's really easy to follow along and it does work ! Thanks a lot !

@TriinTamburiin Жыл бұрын

Note for Windows users: To activate virtual env, type venv\Scripts\activate

@gilangdeatama4436 11 ай бұрын

very useful for windows user :)

@entrprnrtim 8 ай бұрын

Didn't work for me. Can't seem to get it to activate

@jawadlamin4047 7 ай бұрын

@@entrprnrtim in the Terminal switch PowerShell to cmd

@KrishanuDebnath-vv9cs 3 ай бұрын

The actual one is .\virtualenv\Scripts\Activate

@Sasuke-px5km 3 ай бұрын

venv/Scripts/Activate.ps1

@user-tu9ct2mv8t Жыл бұрын

The issue we faced in part 6 was that the values added to the attributes of our `BookItem` instance in the `parse_book_page` method were being passed as `tuples` instead of `strings`. Removing commas at the end of the values should resolve this issue. Once we fix this problem, everything should work perfectly without needing to modify the `process_item` method.

@Empyrean629 5 ай бұрын

Thanks alot.

@jonwinder1861 2 ай бұрын

goat

@terraflops Жыл бұрын

this tutorial really needed the code aspect to help make sense of what is going on and fix errors. thanks

@flanderstruck3751 Жыл бұрын

Thank you for the time you've put into this tutorial. That being said, you should make clear that the setup is different for windows than Mac. No bin folder for example

@lemastertech Жыл бұрын

Thanks for another great video FreeCodeCamp! This is something I've wanted to spend more time on for a long time with python!!

@Autoscraping 3 ай бұрын

A wonderful video that we've used as a reference for our recent additions. Your sharing is highly appreciated!

@Felipe-ib9cx 7 ай бұрын

I'm starting this course now and very excited! Thanks for the effort of teaching it

@shameelabid2107 Жыл бұрын

How did u know I needed this course now. 😍😍😍😍 Btw thanks for this free education.

@omyeole7221 Ай бұрын

This is the first coding course I followed up to an end. Nicely taught. Keep it up.

@riticklath6413 26 күн бұрын

Is it good?

@omyeole7221 25 күн бұрын

@@riticklath6413 ya

@deograsswidambe7803 9 ай бұрын

Amazing tutorial, I've really enjoyed watching and it helped me a lot with my project.

@jackytsui422 7 ай бұрын

I just finished part 7 and want to thanks for the great tutorial!!

@M0hamedElsayed 11 ай бұрын

Thank you very much for this great course. I really learned a lot. ❤❤❤

@DibyanshuPandey-dg5hh Жыл бұрын

Thanks alot Freecodecamp for another amazing tutorial ❤️.

@benjamunji1 4 ай бұрын

For anyone having errors in Part 8 with the fake headers: You need to import this: from scrapy.http import Headers and then in the process_requests function you need to replace this line: request.headers = Headers(random_browser_header)

@pnwokeji7027 3 ай бұрын

Thanks!

@codewithguillaume Жыл бұрын

Thanks for this crazy course !!!

@bratadippalai Жыл бұрын

Exactly what I wanted at this moment, Thank you

@mariusvantubbergh 7 ай бұрын

Thanks for the video. Can Scrapy be used to scrape AJAX responses? Or will Puppeteer / Selenium be more effective?

@rwharrington87 8 ай бұрын

Looking forward to this. A mongodb/pymongo section would be nice for data storage though!

@456user-ql3ot 11 ай бұрын

Hello, thanks for this great introduction to Scrapy However, am wondering in Part 8 for both "User Agents" and "Headers", you set up a boolean var checking middleware_enabled_status 🤔🤔 But, I didn't see it being used anywhere. Personally, I thought it was to be used in the process_request method

@johnnygoffla 6 ай бұрын

Thank you so much for providing this content for free. It's truly incredible that anyone with an internet connection can get free coding education, and its all thanks to people like you!

@mupreport 6 ай бұрын

13:37 creating venv 17:45 create scrapy project 29:31 create spider 33:38 shell

@seaondao 4 ай бұрын

This is so cool! I was able to follow until Part 6 but from Part 7 I couldn't so I will come back in the future after I have basic knowledge of MYSQL and databases. (Note for myself).

@sarfrazjahan8615 10 ай бұрын

Overall good video I learn lot of things but I thinks you should discuss briefly about css and xpath selectors. I am facing problem on it

@MinhLe-ev4wc 11 ай бұрын

How do you guys know I need this for my data analysis project? Fantastic videos, guys! Thank you so much x 5000!

@jean-mariecarrara7226 7 ай бұрын

Very clear explanation. Many thanks

@aladinmovies Ай бұрын

Thanks Joe Kearney! Nice course of course. You are good teacher, love

@tomasdileo2518 8 ай бұрын

Great tutorial however, I am stuck on part 6 as python is not able to recognize the bookscraper folder as a module and therefore wont let me access the class BookItem to use as a pipeline

@DayTrading_SinFILTRO Жыл бұрын

thx! very hard to follow, needed a solid knowledge in python

@milchreis9726 6 ай бұрын

Thank you very much for the good work! Really appreciate the tutorial. I need to point out that MySQL I installed with dmg cannot be used with terminal somehow, so I ended up reinstalled MySQL using terminal.

@ThanhNguyen-rz4tf 10 ай бұрын

This is gold for beginners like me. Tks.

@zee_designs 8 ай бұрын

Great tuturial, Thanks a lot!

@madrasatul-Qamr 9 ай бұрын

Thank you for these amazing videos, I wonder if anyone can help me, whilst on part 5, for some reason when I exit from the scrapy shell i noticed that I was no longer in Virtual Environment (ie the prefix 'env' was no longer present). When I tried to activate it again using env\Scripts\activate (as Im using windows) it kept giving an error. Has anyone else had this problem? If yes, why does it happen and how can I resolve it?

@codetoday1055 7 ай бұрын

I had a question. How to split data by table rows for example Name, Price, Description in scrapy

@BeMyArt Жыл бұрын

Don't know why but Scrapy looks easier to understand for me comparing to BeautifulSoup🤔 Maybe it's just because of teacher or individual mindflow

@Code___Play 2 ай бұрын

Very practical and helpful video with very detailed explanation!

@cn7xp 2 ай бұрын

Finally someone understands what we reaally need and it is published 9 months ago, how did i miss it, i hope this time i will have a Python adventure by not wasting 100's of hours and wasted it for nothing. If it happens again i will curse python and its devolopers. So many changing things, thing can be outdated in seconds and don't know from where to fix it etc. etc. I will hopefully update this adventure.

@Rodrigo.Aragon 4 ай бұрын

Great content! It helped me a lot to understand some concepts better. 💯

@bubblegum8630 4 ай бұрын

CAN SOMEONE HELP ME!!!!!???? At part 3 when you create bookscraper, I don't have bookspider.py created for me. What do i do for it to be generated???? I AM CONFUSED

@felicytatomaszewska2934 21 күн бұрын

I watched it twice and I think it can be shortened quite a lot and better organized.

@ismailgrira7924 Жыл бұрын

just in time, thnx tho I didn't knew what i will do for a project i'm working on till i watched the video life saver

@utsavaggrawal2697 Жыл бұрын

make a course to block the crypto spammers btw thanks for the scrapy course, i was searching for this for a while😃

@flanderstruck3751 Жыл бұрын

Note that I copied the code from the tutorial page for the ScrapeOpsFakeUserAgentMiddleware, and when trying to run it I get the following error: (...) AttributeError: 'dict' object has no attribute 'to_string'. SOLUTION: copy the process_request function exactly as it is in the video, not like in the tutorial page.

@gintautasrakauskas5336 7 ай бұрын

Thank you. Great job.

@itwasntme7481 7 ай бұрын

I have mysql installed and can get the version in my cmd, but when I try in VScode, I just get "mysql : The term 'mysql' is not recognized as the name of a cmdlet, function, script file, or operable program. Check the spelling of the name, or if a path was included, verify that the path is correct and try again. At line:1 char:1"

@user-gb3er2th5f 7 ай бұрын

I definitely recommend it to everyone 👌👌👌

@Ka-kz3he 7 ай бұрын

Part4 54:07 if you're wondering why the result of 'item_scraped_count' still only 40 probably href is already full url so don't duplicate its domain teach yourself to improvise💪

@ChristopherFabianMendozaLopez 6 күн бұрын

This was very helpful, thank you so much for sharing all this knowledge free!

@mikenb3682 10 ай бұрын

Thank you, thank you, and once again, thank you!

@WanKy182 11 ай бұрын

1:24:48 don't forget to remove comas after book_item['url'] = response.url and all others when we add BookItem import. Because i have some values in list instead of string

@minhazulislam683 4 ай бұрын

Please help me, I got 2 errors from this line : from bookscraper.items import BookItem. (errors detected in items and BookItem). Has anyone faced the same issue as me?

@priyanshusamanta858 10 ай бұрын

Thanks for such a wonderful web scraping tutorial. Please make a video tutorial on how to download thousands of pdfs from a website and perform pdf scraping with scrapy. In general, please make a tutorial on pdf scraping as well.

@haleygillenwater8971 8 ай бұрын

You don’t do the pdf scraping with scrapy- it’s designed for scraping pdfs. You can download the pdfs using scrapy (at least I imagine you can), but you have to use a pdf scraper module in order to parse the contents of the pdf

@joem8251 8 ай бұрын

This tutorial is excellent! If you haven't made it already I (and it looks like at least 1 other on this thread) would appreciate a css & xpath tutorial.

@arinzechukwunwuba5813 8 ай бұрын

Good day. Thank you for this but i have tried to connect mysql to pycharm on my windows OS to no avail. Any help will be appreciated.

@xiaolou8423 8 ай бұрын

Hi, I have a question about venv. Do I need to install a new venv for each part or should I use the venv in part2 all the time?

@emilrueh 7 ай бұрын

you should be using the same venv throughout a project as it stores pip installed libraries like scrapy itself

@user-wf1ep2tw9x 7 ай бұрын

oh man , he was just showing me how good his code is !!!!!

@negonifas Жыл бұрын

that's what i need! 👍👍👍

@martingustavoreyes6217 8 ай бұрын

Hi, how can i use crapy with pages that are using Ajax? Thanks

@milckshakebeans8356 9 ай бұрын

When you save to the database in 2:02:00 ; I had the error because the url was a tuple and 'cannot be converted'. If someone has a similar problem he can just index into the url like this: 'str(item["description"][0])' (instead of the code provided which is this: 'str(item["description"]') in the excute function in the process_item function.

@ibranhr 9 ай бұрын

I’m still having the errors bro

@milckshakebeans8356 9 ай бұрын

@@ibranhr I found the error by looking at the what is being processed when the error happened. I saw that it was a tuple and fixed it. Try something similar too if you know the error is with converting values.

@renantrevisan2406 2 ай бұрын

Nice video! Unfurtunelly part 6 has a lot of code without debug, so it's really hard to fix errors. Something is going worng with my code, but i can't identify

@kaanenginsoy562 8 ай бұрын

for windows users: If you get error first type Set-ExecutionPolicy -Scope CurrentUser -ExecutionPolicy Unrestricted -Force and after that type venv\Scripts\activate

@YvonneLoonyBalloony 6 ай бұрын

This worked for me, many thanks.

@2ru2pacFan Жыл бұрын

Hey guys! Do you have any content on using Puppeteer for JS? :) That would be amazing! Thank you so much for doing what you do

@lucasgonzalezsonnenberg3204 Жыл бұрын

Is it good Javascript for web scraping? Thank you, BR

@pkavenger9990 7 ай бұрын

1:34:58 instead of using a lot of if statement use mapping. for example: # saving the rating of the book as integer ratings = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5} rating = adapter.get("rating") if rating: adapter["rating"] = ratings[rating] This is not only faster but it also looks clean.

@aissame112 Жыл бұрын

Great gob thanks !!

@eduardabramovich1216 9 ай бұрын

I learned the basics of Python and now I want to focus on something to get a job, is web scraping a skill that can get you a job on its own?

@yooujiin Жыл бұрын

the course I needed months ago 😭

@_Clipper_ Жыл бұрын

did you try some other course?

@yooujiin Жыл бұрын

@@_Clipper_ bought two Udemy courses. the tutorials on KZbin is limited. so is this one.

@_Clipper_ Жыл бұрын

@@yooujiin are you in data science? I need some recommendations for ML and web scraping. I tried Jose pradila's course and it wasn't very in depth so refunded that. Please recommend only if you are in the same field or have been suggested the same by someone you know in ds/ai/ml.

@yooujiin Жыл бұрын

@@_Clipper_ I'm currently doing my masters in software development. I would love for some recommendations myself. I recommend a the scrapy course by Ahmed Rafik

@xie-yaohuan 10 ай бұрын

Thank you for this excellent tutorial! The scraper I had at the end of part 5 somehow only extracted the first book on each page, resulting in only 50 entries instead of 1000. I compared my code to the tutorial code but can't find what I did wrong. I'm wondering if someone had the same issue as me and managed to solve it. Or otherwise, any hints which part of the code might be the cause of this are also very much appreciated. (EDIT: solved the issue - thanks again for making this great tutorial!)

@sidcritch7724 9 ай бұрын

how did you solve it

@sprinter5901 9 ай бұрын

@@sidcritch7724most probably he used .get() at the end while declaring books variable.

@DonNwN 5 ай бұрын

Write a solution to the problem

@salimtrabelsi2163 5 ай бұрын

@xie-yaohuan I got the same problem but I can't fix can you give us the solution thank you 😀

@joshuabotes2263 3 ай бұрын

@@sprinter5901 Thanks for this comment. Couldn't understand why it didn't work.

@SpiritualItachi 3 ай бұрын

For PART 8 if anybody is having trouble with the new headers not being printed to the terminal, make sure in your settings.py file that you enable the "ScrapeOpsFakeUserAgentMiddleware" in the DOWNLOADER_MIDDLEWARES and not the SPIDER_MIDDLEWARES.

@jonwinder6622 2 ай бұрын

He explained that in the video.

@SpiritualItachi 2 ай бұрын

@@jonwinder6622 Yeah after going through it again I realized I missed that detail..

@jonwinder6622 2 ай бұрын

@@SpiritualItachi I dont blame you, its so easy to look over since he literally goes through so much lol

@lucasgonzalezsonnenberg3204 Жыл бұрын

Amazing video, I learned a lot!!! would you like to do a video how to scrap pages with CAPTCHA? Thank you very mucho for your engagement.

@rahmonpro381 11 ай бұрын

thanks for the tutorial, I have a question, which is the best choice for scraping websites , python or node ?

@erenyeager663 11 ай бұрын

python

@jonwinder1861 2 ай бұрын

python

@rahmonpro381 2 ай бұрын

I am using nodejs it's much faster ^^ @@jonwinder1861

@ByTyoma Жыл бұрын

You created the venv in part2 folder then it appears in part3 folder... why not just create it in freecodecamp folder and use it?

@briyarkhodayar5986 8 ай бұрын

I have a question with part4, in part 4 at first you just scraped one page but later on when we want to have all the next pages and modified it, it still shows me the first page, I'm not sure what is the reason. can you help me with that please? Thank you

@MarwanBahgat 26 күн бұрын

i face the same do you find a solution

@oskarquintanagarcia5420 Жыл бұрын

Great job 🤘

@user-yt2cv9ke1y Жыл бұрын

Are there any custom mature Scrapy Downloaders that facilitate Selenium?

@priyanshusamanta858 11 ай бұрын

Please explain in detail how to get the XPaths above. Do they have to be typed manually, or they can be copied from any right-click option??

@mohitchaniyal6170 10 ай бұрын

you can copy xpaths but creating xpath mannually can give you more flexibility got through the beleow pallist kzbin.info/aero/PLL34mf651faO1vJWlSoYYBJejN9U_rwy-

@abhijeetyou98 7 ай бұрын

Interesting and useful 😎🤔 video

@michaelday6987 11 ай бұрын

In case anyone is having a problem activating venv in windows, use the following command. . venv\scripts\activate

@stefanolanzini4316 Жыл бұрын

Setting everything up using conda instead of venv could cause any problem?

@oanvanbinh2965 5 ай бұрын

Great tutorial!

@yogeshpatil1586 11 ай бұрын

54:36 - 18/05 1:23:16 - 26/05 1:44:19 - 14/06

@sarfrazjahan8615 10 ай бұрын

and also make a video on scrapy playwright for JavaScript based sites Thankyou

@Alien-cr1zb 7 ай бұрын

Is this course enough to do scraping tasks on freelancing websites If it's not could anyone mention what should I do after I finish this

@ezoterikcodex 3 ай бұрын

Hello folks! I'm currently at the part 5 and here is my question. If there is an answer for this in the following parts of this course please let me know. When we export the results into a file, the first book at the file is not "A light in the Attic". Also that is same for me and probably all of us. Is that a problem or what is it? The number of items scraped is 1k, so the scraped item number is matching with the numbers of total books we want to scrape. To be honest I'm wondering how item not coming to output without the order in the site.

@doctordensol 11 ай бұрын

is there a second part to this course about scraping the dynamic websites with scrapy?

@user-kn4ud5mf3o 8 ай бұрын

In his channel

@ndungukaranja913 2 ай бұрын

In part 6, at the start of the process item function, despite having the exact same code as the tutorial my value = adapter.get(field_name) returns the exact value and not a tuple, so it was unnecessary to add the index in the following line, does anyone know why this is happening?

@MiscChecker Жыл бұрын

Excellent video

@albint3r532 13 күн бұрын

"I have a question, does all the change of agents and proxies once we implement them in our code also reflect in the Shell?"

@abdulrasheednomwenighoa.1442 5 ай бұрын

How did you add the full bookscraper in the terminal ?

@bryancapulong147 2 ай бұрын

Can scrapy get data from Cloudflare-protected websites? I just want to extract a list of holidays from our country's government websites to automatically store them in a table, but they don't have an API for it.

@geovannycamargo1282 Жыл бұрын

is there something equivalente in R?

@luisemilioogando Жыл бұрын

Unable to activate virtual env on windows, After beein in env/Scripts what is the command I have tried .\activate, ./activate and activate alone. any of those works.. Finally I changed terminal to CMD on vS code after beein on venv/Scripts you type .\activate

@jasonexis1792 5 ай бұрын

Great job!

@niidaimehokage5731 7 ай бұрын

Can someone answer me why many want to use/learn web scraping that need coding while they're many web scraping tools that doesn't need coding to use it? It's because web scraping tools that doesn't need coding usage are limited? Thank you for anyone that answering my question

@Alex-kz1rw 5 ай бұрын

Hey 👋🏻 when i use the scrapy Shell and use the view(response) command i cannot see all of the Html from the Website. I just can see the "cookies accept" window i can accept this in the browser and after that i have a blanc browser. What can i do to fetch the whole html code?

@efootballpes9736 Жыл бұрын

Hello Sir can you make a react full proof project like a clone or new thing ...I'm trying this but need a course ... please

@flanderstruck3751 3 ай бұрын

Ok, so you schedule your spiders using scrapeops. But how do you consume the product of such scrapping? As far as I know it's just being stored in the virtual server. Can you retrieve these with scrapeops?

@0810honeymoon 7 ай бұрын

Sorry, I just want to know how to let scrapeops knows mysql info like (hostname, port, user, password, database) if I don't want to show it in my code directly. I've saved mysql password in .env file at local, and I don't want to push .env file onto github which brings the result that scrapeops can't connect to mysql when schedule run. I also try to add mysql info as key:value arguements when scheduling, but it's not work. Can someone help me to solve it.... :(

@talaldardgn2550 Жыл бұрын

Great job .. i hope to make a course how to dockrize the scarpy project with postgres DB

@canvietanh7112 10 ай бұрын

Example i want to crawl data from this website but in the response, js wasnot rendered. Can u show the way, plz?

@loveprathshukla5677 Жыл бұрын

what a great video! But I have always face issue in scraping a tailwind css website, can anybody tell me how can we do that ?

@user-kn4ud5mf3o 8 ай бұрын

hmmmm maybe xpath? never heard of this

@lukesanford9026 Жыл бұрын

I've been using a combination of BS4 and Selenium for all my scraping needs, does Scrapy offer something they don't? Should I check this vid out?

@akj3344 Жыл бұрын

I wanted to ask the same question

@CrYpt001 Жыл бұрын

Selenium is way overkill for the job and very slow . The elegant scraping solutions involves you making the requests allowing for speed and scalability . Think about it , you use a whole browser to automate , that browser make the requests anyway . Why use a useless middle man if you can help it ?