A head to head comparison of the base R and magrittr pipe (CC244)

Рет қаралды 3,118

Riffomonas Project

Күн бұрын

Пікірлер: 42

@sven9r 2 жыл бұрын

I love how interactive this channel is getting!

@Riffomonas 2 жыл бұрын

I love it! There’s at least one more audience suggested video in the queue 🤓

@eric13hill 2 жыл бұрын

I really enjoyed this type of content Pat. Seeing different ways to do the same thing opens my mind to other options.

@Riffomonas 2 жыл бұрын

Thanks! These have been fun to make

@wilsonsouza3582 Жыл бұрын

It's really important this video. Thanks for to do it! But I would like to see the comparison with data.table library too.

@Riffomonas Жыл бұрын

Thanks for the suggestion - I don't usually use data.table, but I'll keep this in mind if I need to use it at any point 🤓

@johneagle4384 2 жыл бұрын

Ah... there is something new to learn. Thank you for showing something I was totally oblivious about

@Riffomonas 2 жыл бұрын

Thanks for watching John 🤓

@bassamsaleh8034 2 жыл бұрын

Amazing video, I actually surprised with the benchmarks. I thought the base pipe is faster than the Magritter but that's not true. (sorry Magritter) so I'll keep using the magritter. I learned R lately, and the reason I loved it because of the tidyverse. if someone forced me to use base R, I'd switch to python immediately hahaha.

@Riffomonas 2 жыл бұрын

Thanks for watching! I think if we did a different set of operations with different data it’s possible to get different results. context matters a lot with benchmarks

@k1llyah 2 жыл бұрын

The benchmarks in the video are a bit of a red herring - for this operation the bottleneck is probably the cor.test function that does way more than what is needed. It would make more sense to optimize that part as pipes are almost never the bottleneck of a program. That said, the base R pipe is not a function, it is resolved by the parser, it simply does less than magrittr, so should always be faster given a fair comparison. You can try for example `f f(y = _)` and `1 %>% f(y = .)`. Note that on the micro and nano time scale it can be surprising that the first expression performs slower simply because it is the first expression in the bench mark, try switching them around a few times. For fair benchmarks, make sure the 1) the operations are minimal, 2) the number of iterations are similar and sufficient, 3) you run them in a fresh R session.

@Riffomonas 2 жыл бұрын

Thanks. The context of the benchmarking was in reference to the pipes. I agree that the pipes aren’t the problem. Even if they’re slower they significantly improve readability

@SammanMahmoud 2 жыл бұрын

Can you please have one video regarding big data and parallel computing ... Thank you your videos Dr. Pat

@Riffomonas 2 жыл бұрын

Thanks for watching! Check out my previous episodes using the furrr package

@julianrozenberg2036 2 жыл бұрын

Hi Pat! Thank you for making you sessions so entertaining and usefull. Well, I was trying to make a dataframe of 66000 rows and 2500 columns from the text files and encounted two problems: I was using base R and memory overflows keep shutting down R session (i have only 6g available) and being relatively slow. Eventially I managed to do this by the tidyverse map function. This is a typical task and it would be great if you will be interested in making a code club session about it. Thank's

@musicspinner 2 жыл бұрын

Hi Julian, I think you could really benefit from using the "arrow" library here. Some videos to introduce you to `library(arrow)`: Danielle Navarro: kzbin.info/www/bejne/hWWVfYijf7-DrpI Neal Richardson: kzbin.info/www/bejne/sH-nXoqgZ72DrMU Helped me manage a similar (> memory data) problem. Good luck. 👌🏽

@Riffomonas 2 жыл бұрын

Hi Julian - thanks for watching! R really struggles with wide data frames. I'd encourage you to check out the fread function from the data.table package. I did a video with it about 200 episodes ago ;) kzbin.info/www/bejne/ZnWlhGyejJ5kiqM

@chenxiao315 2 жыл бұрын

I think the fact dplyr::filter is slower than base::subset in your example is because of the fixed overhead cost. If the dataset is much larger, filter should be faster than subset.

@Riffomonas 2 жыл бұрын

Thanks for watching- I think context is very important and you could get different results with different data or functions. Regardless, for most of us it’s fast no matter what!

@bassamsaleh8034 2 жыл бұрын

I'm about to learn the arrow package, but it scares me a bit. I'm wondering if you use it before. I noticed that many mentioned in the comments.

@Riffomonas 2 жыл бұрын

I haven't done anything with arrow yet, but it looks pretty straightforward for super long data frames. You might also consider checking out the vroom package and fread function from data.table

@s.m.habiburrahaman2443 2 жыл бұрын

stay on tune with what you want to learn, just because it's hard now, doesn't an it's impossible. It's all about ntal mindset and

@rayflyers 2 жыл бұрын

Your most popular episode, you say? I await my royalties check.

@Riffomonas 2 жыл бұрын

Hah! Thanks for always watching 🤓

@xballspitzer3927 2 жыл бұрын

MUCH!!!!!!!!!!!!!!!!!!!!!!!!!!!!

@matthewson8917 20 күн бұрын

It was surprising that base pipe was generally slower than magrittr pipe

@Riffomonas 20 күн бұрын

Thanks for watching! More experimenting with both suggests that it really depends on the context. Any difference is really minimal

@VenSensei 2 жыл бұрын

If you want speed and efficiency, you can use collapse and data.table.

@Riffomonas 2 жыл бұрын

Yep. I’ve used data.table in an earlier episode and I use it a lot with very wide data frames. This episode was really about the pipes. When I want efficiency I use C++ 😂

@VenSensei 2 жыл бұрын

@@Riffomonas It's funny you mention that, because that's how I found your channel; "Writing C++ code in R...".

@haraldurkarlsson1147 2 жыл бұрын

Pat, This is interesting but there is a trade off in memory allocation. Dplyr-magrittr combo seems to use the least in your case. This might become an issue when dealing large dataset -e.g. maps such as rasters. Any thoughts?

@Riffomonas 2 жыл бұрын

Hmmm I forgot to look at the memory performance 😂 I think I would worry about memory only if it was limiting and then I’d try different options. I’m usually more time constrained than memory.

@haraldurkarlsson1147 2 жыл бұрын

Pat, I agree - think this is only an issue with truly large dataset and even with these huge weather datasets one might barely see a difference. I agree that readability of code is much more important than shaving off a few seconds.

@broderickeleazar 2 жыл бұрын

send you the link of it

@musicspinner 2 жыл бұрын

Have you used `library(arrow)` yet? Been working with increasingly large datasets... laptop went on strike. Started using arrow. Laptop started cooperating with me again.

@Riffomonas 2 жыл бұрын

Hey, if it works, that's all that matters, right? :) You might also consider checking out the vroom and data.table packages. The latter is great for working with really wide data frames.

@JOHNSMITH-ve3rq Жыл бұрын

Now do data.table! 😂😂😂

@Riffomonas Жыл бұрын

HAH! We'll see 🤓

@haraldurkarlsson1147 2 жыл бұрын

Hi Pat, Could you do a presentation on the echarts4r package? I have tested it a little bit and it seems like a marriage made in heaven between ggplot2 and plotly. It is the best and easiest to use interactive pacakage that I have encountered in my brief life with R. Another neat new package is Quarto - the successor to RMarkdown. H

@Riffomonas 2 жыл бұрын

Thanks. I’ll likely do something with quatro. Unfortunately most people aren’t very interested in interactive plots

@haraldurkarlsson1147 2 жыл бұрын

Pat, You can count me among those that think that in general dynamic/interactive plots are more fluff than substance. However, the eCharts4r package is quite good at generating time series plots such as those you have generate for your climate change series. H @@Riffomonas P

@danielvaulot5105 2 жыл бұрын

@@Riffomonas Actually, this is not completely true. I think for microbiological data interactive plots where you can zoom in on certain parts of graph for example to get into the detail of a given group or for climatogical data zooming on a period is quite useful. A brief look at eCharts4r is that what it seems to propose.... PS. This pipe comparison is very interesting and I guess could be critical when for example you develop Shiny applications to prevent users to become impatient...