Scrapy-Playwright: How To Scrape Dynamic JS Websites (2022)

  Рет қаралды 21,560

ScrapeOps

ScrapeOps

Күн бұрын

Пікірлер: 32
@scrapeops
@scrapeops 2 жыл бұрын
Hey guys! If you have any idea's about websites that you would like us to show you how to scrape, please let us know! Oh and what programing language/framework too - we will be branching out into videos for scraping with node.js and other languages too :)
@unspkn2428
@unspkn2428 Жыл бұрын
Hello, Amazing Video. I am looking to use this to Login. I tried sending form data in start_requests but no luck. Does this support authentication
@darkomacoritto
@darkomacoritto Жыл бұрын
Hello, That would be awesome if you could scrape Facebook Marketplace products (description, price, location, etc.). I have tried but never managed to make it work.
@neekolad
@neekolad 5 ай бұрын
hey, everywhere i look i see basic examples with scrapy, just go through pages and scrape product info.. what i would like to see is a more complex example, like submitting multiple post requests or going through categories and scraping product urls, and then later on visiting that urls to scrape product data.. just to get the idea of some common crawler flows and architecture.. or lets say using database to save products, and then do some data manipulation before exporting data to client.. cheers
@michaelmuolokwu5039
@michaelmuolokwu5039 2 жыл бұрын
Awesome video!! At 15:03, how does Playwright know to recursively callback the parse function since it was not added in the next page request
@nervous711
@nervous711 Жыл бұрын
15:08 you don't need to specify callback method? So scrapy automatically call parse function when you goes to next page?
@marceli1588
@marceli1588 7 ай бұрын
By default it goes to parse function, you can specify it for code clarity or if you want to have another name / many parse methods in one spider
@tomz84
@tomz84 2 жыл бұрын
Great video, thanks for the content. How would you generalise the scroll logic to websites that have hundreds of lazy-loaded pages?
@JcOnAFieldtrip
@JcOnAFieldtrip Жыл бұрын
hi, does this method work if the url doesn't change with a new page?
@TNuno
@TNuno 2 жыл бұрын
do this work on windows ? I get an error -> 2-11-27 21:57:36 [asyncio] ERROR: Task exception was never retrieved future:
@scrapeops
@scrapeops 2 жыл бұрын
At the moment running Scrapy Playwright doesn't work on Windows machines. However, you can use it using Windows Subsystem for Linux. github.com/scrapy-plugins/scrapy-playwright/issues/7#issuecomment-817394494
@sebastiangiunta5885
@sebastiangiunta5885 2 жыл бұрын
How do you handle pagination when you don't have a simple link, rather the href value is something like javascript:document.form.pageNum.value=17;gotoPage( null). So we'd need playwright to click on that button essentially, load the new page, scrape and repeat.
@scrapeops
@scrapeops 2 жыл бұрын
There are two ways to do this: 1) As you said use Playwright to click on the next page button. 2) Open the network tab in your developer tools and try to find the network request the next page button generates. Oftentimes, when there is a next page button click it will send either a GET or POST request to an internal API endpoint which will load in the new data. If you find a API endpoint like this then you can just send your request directly to that API endpoint. This wouldn't require the use of Playwright.
@sebastiangiunta5885
@sebastiangiunta5885 2 жыл бұрын
@@scrapeops Thanks - I've managed to get the pagination working with clicking. Issue now is with the memory, the page I am trying to scrape has 900 pages I need to cycle through. I am using a while loop to click through all the pages, and extract the links. Is there a way to open a fresh page after clicking the next page button, and closing the old page?
@scrapeops
@scrapeops 2 жыл бұрын
@@sebastiangiunta5885 By default when you make a request using: yield scrapy.Request( url="httpbin.org/get", meta={"playwright": True}, ) It makes the request with a new page. However, if you add: "playwright_include_page": True To the request then it will use the same page. Search the documentation github.com/scrapy-plugins/scrapy-playwright for "playwright_include_page" and you will see the explanation. So you are probably already creating a new page. Maybe try using the same page and see if you still have issues. Memory issues are one of the big problems of using headless browsers at scale.
@ZEMMOURI_Med
@ZEMMOURI_Med Жыл бұрын
Thanks for the explanation, I ask if you could specify the used version of python, scrapy, scrapy-playwright
@JozsefKun-n5u
@JozsefKun-n5u 11 ай бұрын
I just wanted to run scrapy shell on this script, but i always get this error: twisted.internet.error.DNSLookupError: DNS lookup failed: no results for hostname lookup: quotes. what can I do?
@vicky87587
@vicky87587 4 ай бұрын
great video, can you do how we can run this on aws lambda with ECR , scrapy-playwright
@leonardschenk173
@leonardschenk173 Жыл бұрын
Hey, on windows, instead of using WSL, would it be possible to use a Linux based docker image to run scrapy and playwright?
@scrapeops
@scrapeops Жыл бұрын
Not sure, I've never tried. But it is worth giving it a go.
@sowjanyakake5720
@sowjanyakake5720 Жыл бұрын
why the project setup is done in venv
@spotshot7023
@spotshot7023 2 жыл бұрын
Hi, Please can you tell us how to setup wsl in windows for playwright to work
@scrapeops
@scrapeops 2 жыл бұрын
Here is a good tutorial video which should help you get WSL setup: kzbin.info/www/bejne/p4rPmIh6gLB-a6M&ab_channel=NeuralNine
@helix8847
@helix8847 2 жыл бұрын
How would this work with JSON files? I noticed the website I am scrapping the search results are within a "POST" Json file.
@scrapeops
@scrapeops 2 жыл бұрын
How do you mean they are within a POST Json file. Do you need to send a POST request to an endpoint to get the data? If it is then you can just send a POST request to that endpoint with a normal HTTP request and not bother with using a headless browser.
@helix8847
@helix8847 2 жыл бұрын
@@scrapeops Yeah that was the question. Thanks!
@Exramas
@Exramas Жыл бұрын
Hi great content ! I tried to reproduce your tuto and end up with a "ModuleNotFoundError: No module named '_lzma'" and find myself stuck at the first scrapy crawl quotes... Do you happen to know why please ?
@scrapeops
@scrapeops Жыл бұрын
Looks like it is an issue with how Python is installed on your machine. Check out this stackoverflow it might help you: stackoverflow.com/questions/59690698/modulenotfounderror-no-module-named-lzma-when-building-python-using-pyenv-on
@cocomango-p2p
@cocomango-p2p Жыл бұрын
I got "[asyncio] ERROR: Task exception was never retrieved" while running the code on Windows. Does any one know the solution to this?
@ahmedaboamra56
@ahmedaboamra56 Жыл бұрын
scrapy-playwright does not work in windows it works in wsl or linux system
@surajbherwani5488
@surajbherwani5488 Жыл бұрын
Please scrape discord server
@AmonAsmodeus
@AmonAsmodeus 2 ай бұрын
For some reason when I follow this video, I have no issues, but when I follow the article tutorial i get the errror: scrapy.exceptions.NotSupported: Unsupported URL scheme 'https': No module named 'scrapy_playwright' deactivating the venv and reactivating it does not solve the issue for me.
Scrapy Splash: How to scrape JS rendered websites  (2022)
22:35
How to Scrape JavaScript Websites with Scrapy and Playwright
11:12
John Watson Rooney
Рет қаралды 53 М.
路飞做的坏事被拆穿了 #路飞#海贼王
00:41
路飞与唐舞桐
Рет қаралды 26 МЛН
UFC 310 : Рахмонов VS Мачадо Гэрри
05:00
Setanta Sports UFC
Рет қаралды 1,2 МЛН
Scraping Dynamic JavaScript Websites - Beautiful Soup Python
11:38
Scraping with Playwright 101 - Easy Mode
19:56
John Watson Rooney
Рет қаралды 14 М.
PLAYWRIGHT: пишем парсер OZON
40:29
The ParseHub
Рет қаралды 17 М.
On These Questions, Smarter People Do Worse
14:35
Veritasium
Рет қаралды 4,6 МЛН
Python and Scrapy - Scraping Dynamic Site (Populated with JavaScript)
15:40
Beautifulsoup vs Selenium vs Scrapy - Which Tool for Web Scraping?
6:54
John Watson Rooney
Рет қаралды 78 М.
Scrapy and Selenium - Scraping Dynamic Sites Faster!
9:01
Code [RE] Code
Рет қаралды 20 М.
路飞做的坏事被拆穿了 #路飞#海贼王
00:41
路飞与唐舞桐
Рет қаралды 26 МЛН