Web Scrape Nested Links/Multiple Pages - Web Scraping in R (Part 2)

Рет қаралды 42,047

Күн бұрын

Web Scraping in R: Visiting Nested Links using RVest Part 2 -- After scraping movies from IMDb (in the last video), I also want to extract the main cast members from each movie -- which requires me to go into each movie's link. In this video, I show you how to scrape and navigate into links to potentially scrape even more
Part 3: • Web Scrape Multiple Pa...
Part 4: • Web Scrape Tables - We...
Part 1: • Web Scrape Text from A...
SelectorGadget for Chrome: chrome.google....
Code: github.com/abh...
Music by: Prod. Riddiman

Пікірлер: 82

@Nafke 5 ай бұрын

I love how much knowledge you can pack in such a short vid. I love the editing, the tone, the pace… your tutorials are def among the best on KZbin for any field!

@vedantamohapatra1192 2 жыл бұрын

Hey, great vid. However, after two years the iMDB website has made some changes and when you click the movie link it redirects you to a page which has all details about the movie but you can't the cast by using the selector gadget as the cast are in div elements. We have to click another link on this page "All cast and crew" and then on that page we can select the cast. So how do we it as we cant use the movie_links in sapply function as it returns nothing.

@prasenjeetrathore 3 жыл бұрын

man you should post more, love your videos, very easy to follow tutorials 🧡

@postercam 2 жыл бұрын

You juste helped get my researches to an unexpected level of data-collection. Thanks so much for this, your videos were an incredible support.

@willykitheka7618 3 жыл бұрын

Sorry I didn't realize you created a series so am going through all of them! Once again thanks for sharing...this is super useful!

@grainofsalt2113 3 жыл бұрын

Thanks so much for this. Your videos have been SO helpful for my analytics class....you have no idea. I don't know if you get any financial benefit from these videos at all, but I wanted to express my appreciation

@dataslice 3 жыл бұрын

That's great to hear! I was a teaching assistant for a data science/analytics course last year and it's what inspired to me to make these videos!

@1812CE 4 жыл бұрын

Your videos are really great man, keep going!

@dataslice 4 жыл бұрын

Thank you! Glad you enjoyed it :-)

@adamsaxton6550 2 жыл бұрын

Excellent video. Perfect example from which I could follow and apply to my own project. Great.

@vicentefontecilla2025 3 жыл бұрын

A big thanks for you. ive been looking this on the internet for hours adn you just explained it in 9 min. Thank you so much! Keep doing videos!

@suganyavidyadharan2966 4 жыл бұрын

Great tutorial.... I loved all the four parts... You are awesome... Thank you so much for this... I have been looking videos for web scraping and this is the best.... Thank you once again... God bless you...

@dataslice 4 жыл бұрын

Glad you enjoyed it!

@JapjeetS 2 жыл бұрын

How are you getting the table output in a different tab, and the format looks so clean?

@atilga.n 3 жыл бұрын

thank you thank you thank you! amazing explanations and demonstration, really clear, thanks again!

@dataslice 3 жыл бұрын

Glad it was helpful!

@KianaAshoftehfard 10 ай бұрын

thanks for this tutorials.I tried the codes and the variables remain empty. Could this problem be due to the new site restriction? Because selectorgaget is not working like your video now.

@paulfong3011 3 жыл бұрын

thank you so so much for your great tutorial!! And I have a question: what should I do when the error "Error in data.frame(name, year, rating, sys, cast, stringsAsFactors = FALSE) : arguments imply differing number of rows: 50, 45" appears for that on imdb there're 5 movies without rating?

@kavitakamatdivekar5152 3 жыл бұрын

same with me

@bonumsanguis1051 2 жыл бұрын

I tried doing this in Google news to extract the articles from each of the urls/headlines. It only work for a single sample url and not for all. Please help. 🙏🙏🙏

@retrosak1977 Жыл бұрын

Outstanding video 👏

@vineetkaushik5265 2 жыл бұрын

Hello Dataslice, thank for this awesome tutorial. When I am doing this for other website the href value is not visible. They have written instead of href. is there any way to deal with this issue??

@rekkiesbub423 Жыл бұрын

The same thing happened to me, I got through it because I tried to inspect whether the code is "href" or something else, then tried to copy different type of < > in the script and it finally worked. If you know a better way to fix it please reply

@radhikaiyer8012 4 жыл бұрын

hi.. this is very informative.. unfotunately when i try to do wikipedia scraping i get the following error "Error in open.connection(x, "rb") : HTTP error 404" what am i doing wrong ?

@dataslice 4 жыл бұрын

Is it possible the link was wrong? I haven't run into the error but it looks like there are some potential solutions on Google that might be worth looking at

@mariemmoula2874 3 жыл бұрын

actually I got the same problem and I didn't find a solution on Google Error in open.connection(x, "rb") : Could not resolve host: url

@jesusrafaelyanvalenzuela2953 3 жыл бұрын

Nice Videos! Thanks a lot. The sapply function is relatively slow (1 seg per Link). What is the reason why sapply takes so long? Is there a way to make it faster?

@TheNozimjon 3 жыл бұрын

There is a function called "map" by purr package. It is now common to use that function for iterative tasks

@siddhantsingh7018 4 жыл бұрын

Why did you define (movie_link) in the get_cast function and not (movie_links) (LINE 14)? I used your code and still got object movie_cast not found. Help me out, please!

@dataslice 4 жыл бұрын

Movie_link was just a variable name for the input of the function. For example, I could define: AddTwo = function(movie_link) { return(movie_link+2)} X = 5 AddTwo(X) should give me 7

@dataslice 4 жыл бұрын

Is your data frame at the end (movies) fully populated?

@jannonflores1113 2 жыл бұрын

Thanks for this bro! I was able to scrape propertyfinder because of this. One problem thou, how can we make the program faster? It seems that whenever it scrapes across multiple pages, it always takes time rather than just scrape from outside details.

@venkatsainivarthi6326 3 жыл бұрын

Iam getting error, when I run line 21 as 'HTTP error 400' what does this mean?

@dataslice 3 жыл бұрын

It may be that the URL is invalid. You might want to printing out the URL and seeing if you can go to it in a web browser to make sure it actually exists

@qorazx Жыл бұрын

For me it always seperates with a space when pasting to concatenate the links. If I add seq="" it concatenates with an extra space at the end. If I leave the seq out completely I do not get the extra space at the end of the link. No matter what I do the space between my main url that I paste and the href ( , .,) is not going away.. Any help? Edit: paste0() worked! Without using any seq="" specification. Thanks ChatGPT ;)

@vanidanara 3 жыл бұрын

When you are dragging the line @1.25, the console monitor instantly showing what is inside but my PC didn't show. How to setting up or I don't know just new for learning this. Thank you

@dataslice 3 жыл бұрын

I'm actually hitting Command + Enter (Control + Enter for PC) which is a shortcut to run the highlighted code -- apologies for the confusion!

@ilhanilkeralbulut6620 2 жыл бұрын

Hello, does anyone how an idea how to scrape a webpage's clickable parts. for example, there is a table on the page and a click button for collapse and expand. When I scrape I can only scrape collapsed parts and I need the extended parts. Thanks in advance

@adammeziti9480 2 жыл бұрын

Awesome

@ss_051 10 ай бұрын

Thank you very very very much 👏👏👏👏👏👏👏👏👏👏👏👏👏👏👏

@mkhani023 4 жыл бұрын

Hello, great tutorial. I am using it to web scrape job listings for a project. I continuously keep on getting "character(0)" when I try to get the job description for just one listing. Is there anyway around this? thanks!

@kavitakamatdivekar5152 3 жыл бұрын

I was also getting same for cast for movie, I removed one tag from selector, means selected correct css parameter for html_nodes, it worked... e.g for cast ".primary_photo+td" worked for me, in video it was ".primary_photo +td a" was used

@fezzix8223 2 жыл бұрын

It sucks that imdb updated it's page. I am having issues with the crew section in the data frame. This is because they changed how the elements are for the cast on pages, to get to the cast as displayed in your video (that allows Selector Gadget to work), you would need to click a separate link. I am very inexperienced so connecting that extra click to the format of program isn't something I'm capable of doing.

@howardly7687 2 жыл бұрын

A minor adjustment can be made to account for the IMDB layout change. All you have to do is replace a segment of the film page's url with a segment from the full cast page's url. Doing so will give you the link to that film's full cast, where the CSS selector for the cast names remains the same as in the tutorial. To do this, add a single line of code to the pipeline for the movie_links : movie_links = page %>% html_nodes(".lister-item-header a") %>% html_attr("href") %>% str_replace(pattern = fixed("?ref_=adv_li_tt"), replacement = fixed("fullcredits?ref_=tt_cl_sm")) %>% paste0("www.imdb.com", .) Make sure you import the stringr library or you won't be able to use str_replace().

@wooorrrrdddlol 4 жыл бұрын

Great video! how would you go about getting more than one variable besides cast? I tried to do this and I kept getting an error saying that the read_html argument is missing with no default. Also how would you separate the cast row into separate columns?

@dataslice 4 жыл бұрын

Which variable were you trying to grab besides cast? And for separating the cast, the dplyr ‘separate’ function should allow you to split a column into multiple columns by a character - that might work

@wooorrrrdddlol 4 жыл бұрын

@@dataslice I was trying to get the director for each movie along with the cast variable - are you able to do that?

@zeeshanhamid4413 3 жыл бұрын

Thanks, this really helps in the project I'm attempting. However, I have one issue that I cannot find the solution for. I'm trying to grab information from within a page, but the function to retrive the information doesn't work. Instead an error comes stating " read_html ("link") isn't working as it states the link argument is missing with no default, although it's defined identically to how you defined movie_links. And when I just view the links variable by itself, it looks perfect, the list is how it should be, with address being correct for each example player (scraping a football website) in the list. I think everything else is working as inteneded, just can't finish the dataframe as this can't be grabbed for whatever reason. Any help would be much appreciated. Thanks.

@dataslice 3 жыл бұрын

Are you doing read_html(link) or read_html(“link”)? The first is passing in a variable named link and the second is the string “link”

@maheshgurumoorthi4391 3 жыл бұрын

Followed the same, but getting an error as " Error in open.connection(x, "rb") : HTTP error 400." Think it is because of web security...

@leozborowski 4 жыл бұрын

Absolutely great! Clear and to the point, the best tutorial I have found so far. I have a question: I have tried this method on a different website, but when I collect the data into a dataframe I get the following error: Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 1, 0 This is probably because some pages returned no data. Could you help please? Many thanks for the great videos!

@dataslice 4 жыл бұрын

Were you able to figure it out? It could be because the html_nodes() argument is incorrect / doesn't exist on the page. What does your code look like?

@lingzhao242 4 жыл бұрын

Did you figure it out? I have the same problem on my code as well. But I noticed that it works sometimes but sometimes it doesn't. It's really weird.

@lifestoriesfromearth6271 3 жыл бұрын

@@dataslice How to avoid it? Is it a good way to use try except (put "Null" for empty info)inside the function?

@nancyachiengodhiambo9727 4 жыл бұрын

Nice presentation, waiting for web scrapping of multiple pages

@dataslice 4 жыл бұрын

Here's a link to part 3! kzbin.info/www/bejne/aGnTqnh6i56gg9k

@MMansouri 4 жыл бұрын

Any idea how this would work with continuous/infinite scrolling?

@dataslice 4 жыл бұрын

@@MMansouri I will definitely make a video at some point down the line on how to do this -- but I believe you need to use a combination of rvest and also the R selenium package that allows you to emulate a web browser and make it scroll down

@MMansouri 4 жыл бұрын

dataslice Thanks for the reply! I thought so too. Looking forward to it... Thanks for the great videos.

@timfei7221 3 жыл бұрын

Thank you so much! Explained a complicated question well in a short video. I computer spend morn than 1 minutes to run the cast = sapply(...) and it confused of why it spent so long. Would mind to explain the logic behind of why it takes long time to run this code?

@dataslice 3 жыл бұрын

Good question. sapply() acts as a for loop in this case; it takes in movie_links (which is a vector of movie link URLs) and then iterates through each URL and calls the get_cast function on it (which scrapes the actors on that link), and finally, returns a vector with the all the results from each URL. The reason it takes so long is because it's scraping n number of pages, where n is the number of URLs in movie_links.

@timfei7221 3 жыл бұрын

@@dataslice Thank you! I understand the logic right now. However, some other questions just came out of my mind. Does the internet speed affect the operating speed? Are there any approaches can reduce the time consuming while we are scraping larger number of data (i.e. scraping 10000 movies)? Sorry for asking so much :) My major is statistic and actuarial study so that I learned R for statistical purpose only in the uni, and have ground level knowledge of Data Science.

@dataslice 3 жыл бұрын

@@timfei7221 Yes, internet speed will definitely affect how quickly the code runs since the web requests are being made from your computer. I can't think of a way of reducing the overall time, but it might be better to scrape multiple batches of URLs (e.g. 100 at a time) instead of just one giant list. It wouldn't make it quicker but you could append the results to a data frame and even save it to a .csv so you could at least see the results incrementally.

@timfei7221 3 жыл бұрын

@@dataslice Thank you so much!

@thiagorocha2696 3 жыл бұрын

Hi, awesome video. You made it very clear and simple. Can you make a video demonstrating how to scrap websites that uses Javascript to render content? The approach will be a little different I think. Congrats.

@kavitakamatdivekar5152 3 жыл бұрын

for some elements like rating runtime no. of rows don't match, what can be done if missing values are there in CSS Selectors

@kavitakamatdivekar5152 3 жыл бұрын

and what if some extra values are coming?

@mrmr2973 3 жыл бұрын

great explanation !!

@lingzhao242 4 жыл бұрын

Fantastic! Really really useful By the way, I got error "arguments imply differing number of rows: 24, 23" when scraping my page, can you give any advice on how to fix that?

@dataslice 4 жыл бұрын

I just responded to your other comment!

@jtcr1 3 жыл бұрын

Great tutorial, thanks a lot!

@marcinterlecki3024 2 жыл бұрын

Awesome!

@abdallahel-kafrawy4114 4 жыл бұрын

Great job !!! Thanks for your effort and time

@dataslice 4 жыл бұрын

No problem, thanks for watching!

@thainapinheiro1019 3 жыл бұрын

Incredible! you literally ELI5

@djangoworldwide7925 3 жыл бұрын

Fantastic

@neillubbe79 3 жыл бұрын

You are a God

@yesdcotchin 3 жыл бұрын

What if the html_node has no href or url? I'm following along using a Goodreads list. The list and book urls take the following forms, though this may be irrelevant: Book page: "www.goodreads.com/book/show/..." List page: "www.goodreads.com/list/show/..." From the console: > page %>% + html_nodes(".bookTitle span") {xml_nodeset (100)} [1] Don't Close Your Eyes [2] To Kill a Mockingbird

@alexandervera8482 2 жыл бұрын

love you

@temurgugushvili9368 3 жыл бұрын

Thanks for sharing tutorial, really useful. I tried to use the same logic you showed to build page link "www.hr.gov.ge/JobProvider/UserOrgVaks/Details/62799" html % html_attr("href") %>% paste("www.hr.gov.ge", ., sep="") somehow it does not work, any suggestion on it. once again thank you in advance.