Scrapy-Playwright: How To Scrape Dynamic JS Websites (2022)

Рет қаралды 21,560

Күн бұрын

Пікірлер: 32

@scrapeops 2 жыл бұрын

Hey guys! If you have any idea's about websites that you would like us to show you how to scrape, please let us know! Oh and what programing language/framework too - we will be branching out into videos for scraping with node.js and other languages too :)

@unspkn2428 Жыл бұрын

Hello, Amazing Video. I am looking to use this to Login. I tried sending form data in start_requests but no luck. Does this support authentication

@darkomacoritto Жыл бұрын

Hello, That would be awesome if you could scrape Facebook Marketplace products (description, price, location, etc.). I have tried but never managed to make it work.

@neekolad 5 ай бұрын

hey, everywhere i look i see basic examples with scrapy, just go through pages and scrape product info.. what i would like to see is a more complex example, like submitting multiple post requests or going through categories and scraping product urls, and then later on visiting that urls to scrape product data.. just to get the idea of some common crawler flows and architecture.. or lets say using database to save products, and then do some data manipulation before exporting data to client.. cheers

@michaelmuolokwu5039 2 жыл бұрын

Awesome video!! At 15:03, how does Playwright know to recursively callback the parse function since it was not added in the next page request

@nervous711 Жыл бұрын

15:08 you don't need to specify callback method? So scrapy automatically call parse function when you goes to next page?

@marceli1588 7 ай бұрын

By default it goes to parse function, you can specify it for code clarity or if you want to have another name / many parse methods in one spider

@tomz84 2 жыл бұрын

Great video, thanks for the content. How would you generalise the scroll logic to websites that have hundreds of lazy-loaded pages?

@JcOnAFieldtrip Жыл бұрын

hi, does this method work if the url doesn't change with a new page?

@TNuno 2 жыл бұрын

do this work on windows ? I get an error -> 2-11-27 21:57:36 [asyncio] ERROR: Task exception was never retrieved future:

@scrapeops 2 жыл бұрын

At the moment running Scrapy Playwright doesn't work on Windows machines. However, you can use it using Windows Subsystem for Linux. github.com/scrapy-plugins/scrapy-playwright/issues/7#issuecomment-817394494

@sebastiangiunta5885 2 жыл бұрын

How do you handle pagination when you don't have a simple link, rather the href value is something like javascript:document.form.pageNum.value=17;gotoPage( null). So we'd need playwright to click on that button essentially, load the new page, scrape and repeat.

@scrapeops 2 жыл бұрын

There are two ways to do this: 1) As you said use Playwright to click on the next page button. 2) Open the network tab in your developer tools and try to find the network request the next page button generates. Oftentimes, when there is a next page button click it will send either a GET or POST request to an internal API endpoint which will load in the new data. If you find a API endpoint like this then you can just send your request directly to that API endpoint. This wouldn't require the use of Playwright.

@sebastiangiunta5885 2 жыл бұрын

@@scrapeops Thanks - I've managed to get the pagination working with clicking. Issue now is with the memory, the page I am trying to scrape has 900 pages I need to cycle through. I am using a while loop to click through all the pages, and extract the links. Is there a way to open a fresh page after clicking the next page button, and closing the old page?

@scrapeops 2 жыл бұрын

@@sebastiangiunta5885 By default when you make a request using: yield scrapy.Request( url="httpbin.org/get", meta={"playwright": True}, ) It makes the request with a new page. However, if you add: "playwright_include_page": True To the request then it will use the same page. Search the documentation github.com/scrapy-plugins/scrapy-playwright for "playwright_include_page" and you will see the explanation. So you are probably already creating a new page. Maybe try using the same page and see if you still have issues. Memory issues are one of the big problems of using headless browsers at scale.

@ZEMMOURI_Med Жыл бұрын

Thanks for the explanation, I ask if you could specify the used version of python, scrapy, scrapy-playwright

@JozsefKun-n5u 11 ай бұрын

I just wanted to run scrapy shell on this script, but i always get this error: twisted.internet.error.DNSLookupError: DNS lookup failed: no results for hostname lookup: quotes. what can I do?

@vicky87587 4 ай бұрын

great video, can you do how we can run this on aws lambda with ECR , scrapy-playwright

@leonardschenk173 Жыл бұрын

Hey, on windows, instead of using WSL, would it be possible to use a Linux based docker image to run scrapy and playwright?

@scrapeops Жыл бұрын

Not sure, I've never tried. But it is worth giving it a go.

@sowjanyakake5720 Жыл бұрын

why the project setup is done in venv

@spotshot7023 2 жыл бұрын

Hi, Please can you tell us how to setup wsl in windows for playwright to work

@scrapeops 2 жыл бұрын

Here is a good tutorial video which should help you get WSL setup: kzbin.info/www/bejne/p4rPmIh6gLB-a6M&ab_channel=NeuralNine

@helix8847 2 жыл бұрын

How would this work with JSON files? I noticed the website I am scrapping the search results are within a "POST" Json file.

@scrapeops 2 жыл бұрын

How do you mean they are within a POST Json file. Do you need to send a POST request to an endpoint to get the data? If it is then you can just send a POST request to that endpoint with a normal HTTP request and not bother with using a headless browser.

@helix8847 2 жыл бұрын

@@scrapeops Yeah that was the question. Thanks!

@Exramas Жыл бұрын

Hi great content ! I tried to reproduce your tuto and end up with a "ModuleNotFoundError: No module named '_lzma'" and find myself stuck at the first scrapy crawl quotes... Do you happen to know why please ?

@scrapeops Жыл бұрын

Looks like it is an issue with how Python is installed on your machine. Check out this stackoverflow it might help you: stackoverflow.com/questions/59690698/modulenotfounderror-no-module-named-lzma-when-building-python-using-pyenv-on

@cocomango-p2p Жыл бұрын

I got "[asyncio] ERROR: Task exception was never retrieved" while running the code on Windows. Does any one know the solution to this?

@ahmedaboamra56 Жыл бұрын

scrapy-playwright does not work in windows it works in wsl or linux system

@surajbherwani5488 Жыл бұрын

Please scrape discord server

@AmonAsmodeus 2 ай бұрын

For some reason when I follow this video, I have no issues, but when I follow the article tutorial i get the errror: scrapy.exceptions.NotSupported: Unsupported URL scheme 'https': No module named 'scrapy_playwright' deactivating the venv and reactivating it does not solve the issue for me.