Hey guys! If you have any idea's about websites that you would like us to show you how to scrape, please let us know! Oh and what programing language/framework too - we will be branching out into videos for scraping with node.js and other languages too :)
@unspkn2428 Жыл бұрын
Hello, Amazing Video. I am looking to use this to Login. I tried sending form data in start_requests but no luck. Does this support authentication
@darkomacoritto Жыл бұрын
Hello, That would be awesome if you could scrape Facebook Marketplace products (description, price, location, etc.). I have tried but never managed to make it work.
@neekolad5 ай бұрын
hey, everywhere i look i see basic examples with scrapy, just go through pages and scrape product info.. what i would like to see is a more complex example, like submitting multiple post requests or going through categories and scraping product urls, and then later on visiting that urls to scrape product data.. just to get the idea of some common crawler flows and architecture.. or lets say using database to save products, and then do some data manipulation before exporting data to client.. cheers
@michaelmuolokwu50392 жыл бұрын
Awesome video!! At 15:03, how does Playwright know to recursively callback the parse function since it was not added in the next page request
@nervous711 Жыл бұрын
15:08 you don't need to specify callback method? So scrapy automatically call parse function when you goes to next page?
@marceli15887 ай бұрын
By default it goes to parse function, you can specify it for code clarity or if you want to have another name / many parse methods in one spider
@tomz842 жыл бұрын
Great video, thanks for the content. How would you generalise the scroll logic to websites that have hundreds of lazy-loaded pages?
@JcOnAFieldtrip Жыл бұрын
hi, does this method work if the url doesn't change with a new page?
@TNuno2 жыл бұрын
do this work on windows ? I get an error -> 2-11-27 21:57:36 [asyncio] ERROR: Task exception was never retrieved future:
@scrapeops2 жыл бұрын
At the moment running Scrapy Playwright doesn't work on Windows machines. However, you can use it using Windows Subsystem for Linux. github.com/scrapy-plugins/scrapy-playwright/issues/7#issuecomment-817394494
@sebastiangiunta58852 жыл бұрын
How do you handle pagination when you don't have a simple link, rather the href value is something like javascript:document.form.pageNum.value=17;gotoPage( null). So we'd need playwright to click on that button essentially, load the new page, scrape and repeat.
@scrapeops2 жыл бұрын
There are two ways to do this: 1) As you said use Playwright to click on the next page button. 2) Open the network tab in your developer tools and try to find the network request the next page button generates. Oftentimes, when there is a next page button click it will send either a GET or POST request to an internal API endpoint which will load in the new data. If you find a API endpoint like this then you can just send your request directly to that API endpoint. This wouldn't require the use of Playwright.
@sebastiangiunta58852 жыл бұрын
@@scrapeops Thanks - I've managed to get the pagination working with clicking. Issue now is with the memory, the page I am trying to scrape has 900 pages I need to cycle through. I am using a while loop to click through all the pages, and extract the links. Is there a way to open a fresh page after clicking the next page button, and closing the old page?
@scrapeops2 жыл бұрын
@@sebastiangiunta5885 By default when you make a request using: yield scrapy.Request( url="httpbin.org/get", meta={"playwright": True}, ) It makes the request with a new page. However, if you add: "playwright_include_page": True To the request then it will use the same page. Search the documentation github.com/scrapy-plugins/scrapy-playwright for "playwright_include_page" and you will see the explanation. So you are probably already creating a new page. Maybe try using the same page and see if you still have issues. Memory issues are one of the big problems of using headless browsers at scale.
@ZEMMOURI_Med Жыл бұрын
Thanks for the explanation, I ask if you could specify the used version of python, scrapy, scrapy-playwright
@JozsefKun-n5u11 ай бұрын
I just wanted to run scrapy shell on this script, but i always get this error: twisted.internet.error.DNSLookupError: DNS lookup failed: no results for hostname lookup: quotes. what can I do?
@vicky875874 ай бұрын
great video, can you do how we can run this on aws lambda with ECR , scrapy-playwright
@leonardschenk173 Жыл бұрын
Hey, on windows, instead of using WSL, would it be possible to use a Linux based docker image to run scrapy and playwright?
@scrapeops Жыл бұрын
Not sure, I've never tried. But it is worth giving it a go.
@sowjanyakake5720 Жыл бұрын
why the project setup is done in venv
@spotshot70232 жыл бұрын
Hi, Please can you tell us how to setup wsl in windows for playwright to work
@scrapeops2 жыл бұрын
Here is a good tutorial video which should help you get WSL setup: kzbin.info/www/bejne/p4rPmIh6gLB-a6M&ab_channel=NeuralNine
@helix88472 жыл бұрын
How would this work with JSON files? I noticed the website I am scrapping the search results are within a "POST" Json file.
@scrapeops2 жыл бұрын
How do you mean they are within a POST Json file. Do you need to send a POST request to an endpoint to get the data? If it is then you can just send a POST request to that endpoint with a normal HTTP request and not bother with using a headless browser.
@helix88472 жыл бұрын
@@scrapeops Yeah that was the question. Thanks!
@Exramas Жыл бұрын
Hi great content ! I tried to reproduce your tuto and end up with a "ModuleNotFoundError: No module named '_lzma'" and find myself stuck at the first scrapy crawl quotes... Do you happen to know why please ?
@scrapeops Жыл бұрын
Looks like it is an issue with how Python is installed on your machine. Check out this stackoverflow it might help you: stackoverflow.com/questions/59690698/modulenotfounderror-no-module-named-lzma-when-building-python-using-pyenv-on
@cocomango-p2p Жыл бұрын
I got "[asyncio] ERROR: Task exception was never retrieved" while running the code on Windows. Does any one know the solution to this?
@ahmedaboamra56 Жыл бұрын
scrapy-playwright does not work in windows it works in wsl or linux system
@surajbherwani5488 Жыл бұрын
Please scrape discord server
@AmonAsmodeus2 ай бұрын
For some reason when I follow this video, I have no issues, but when I follow the article tutorial i get the errror: scrapy.exceptions.NotSupported: Unsupported URL scheme 'https': No module named 'scrapy_playwright' deactivating the venv and reactivating it does not solve the issue for me.