Make Your Pandas Code Lightning Fast

Рет қаралды 186,630

Күн бұрын

Пікірлер: 329

@hasijasanskar 2 жыл бұрын

Whoa.. 3500 times difference. Vectorised is even faster than apply, will give it try next time for sure. Awesome video as always.

@robmulla 2 жыл бұрын

Thanks Sanskar. Yes, using vectorized functions is always much faster. In some cases it's not possible but then there are other ways to speed it up. I might show that in another video if this one is popular.

@amazingdude9042 9 ай бұрын

@@robmulla can you make a video on how to make pandas resample faster ?

@miaandgingerthememebunnyme3397 2 жыл бұрын

That’s my husband! He’s so cool.

@robmulla 2 жыл бұрын

Love you boo. 😘

@FilippoGronchi 2 жыл бұрын

Fully agree!

@sauloncall 2 жыл бұрын

Aww! This is wholesome!

@rahulchoudhary1024 2 жыл бұрын

I've been watching your videos since last one week non stop! And enjoy comments from your SO!!! Lovely!

@mohammedgt8102 Жыл бұрын

He is awesome. Taking time out of his day to share knowledge 👏

@kip1272 Жыл бұрын

also, a way to speed it up is to not use & and | for 'and' and 'or' but just use the words 'and' and 'or'. these words are made for boolean expressions and thus work faster. & and | are bitwise operators and are made for integers. using these will force python to make the booleans an integer and then do the bitwise operation and then cast it back to a boolean. this doesn't take that much time if u do it once but in a test scenario inspired by this video it was roughly 45% slower.

@robmulla Жыл бұрын

Nice tip! I didn’t know that.

@kazmkazm9676 Жыл бұрын

I made the experiment. It is ready to run. What you have suggested is coded below. It is approximately 20 percent faster. import timeit setup = 'import random; random_list = random.sample(range(1,101),100)' # with or first_code = '''\ result_1 = [rand for rand in random_list if (rand >75) or (rand 75) | (rand

@kip1272 Жыл бұрын

@@kazmkazm9676 the difference was even bigger between & and 'and', if i remember corectly.

@A372575 Жыл бұрын

Great, never realized that. Will start using 'and' and 'or' now onwards.

@nathanielbonini8951 Жыл бұрын

This is spot on. I had a filter running that was going to take 2 days to complete on a 12M line CSV file using iteration - clearly not good. Now it takes 6 seconds.

@Zenoandturtle Жыл бұрын

That is unbelievable. Astounding time difference. I was recently watching a presentation on candle stick algorythm, and the presenter used vectorised method and I was confused (I an new to Python), but this video made it all too clear. Fantastic presentation.

@robmulla Жыл бұрын

Glad you found it interesting. Thanks for watching!

@jti107 2 жыл бұрын

I didn’t realize you could write 10k as 10_000. I work with astronomical units so makes variables more readable. Great video!

@robmulla 2 жыл бұрын

Thanks! Yes, they introduced that functionality with underscores in numbers with python 3.6 - it really helps make numbers more readable.

@kailashlate6348 9 ай бұрын

😊😊😊

@kailashlate6348 9 ай бұрын

😊

@deepakramani05 2 жыл бұрын

As I work with Pandas and large datasets, I come across code that use iterrows often. Most developers just don't care about time or come from various programming backgrounds that prohibit them from using efficient methods. I wish more people use vectorization.

@robmulla 2 жыл бұрын

Thanks. That’s exactly why I wanted to make this video. Hopefully people will find it helpful.

@pr0skis 2 жыл бұрын

Some of the biggest bottlenecks are from IO... especially when trying to read then concat multiple large Excel files. Shaving a few seconds in the algos just isnt gonna make much of a difference

@allenklingsporn6993 2 жыл бұрын

@@pr0skis Hard to say that definitively, though, right? You have no idea how anyone is using pandas. If they have slow algos running iteratively, it can very easily become much slower than I/O functions. I've seen some pretty wild pandas use in my business, and a lot of it is really terrible at runtime, especially anything that is wrapped in a GUI (sometimes even with multiprocessing...).

@nitinkumar29 Жыл бұрын

@@pr0skis you can convert excel file to csv and then use csv files because csv files io are faster.

@jaimeduncan6167 Жыл бұрын

It's the same with a relational database, we call them the cursor kids. They loop and loop and loop when they can use a set operation to go hundreds of times faster and often with less code.

@i-Mik Жыл бұрын

Thanks for the great video! I have a project with some calculations. They take some minutes through the loops. I'm going to use vectorized way. So i'll write another comment with comparison later. Some days later... i rewrote a signifacnt part of my code. Made it vectorized, and i got fantastic results. The example: old code - 1m.3s, new code - 6s. One more: old code - 14m.58s, new code - 11s. Awesome!

@robmulla Жыл бұрын

So awesome! It's really satisfying when you are able to improve the speed of code by orders of magnitude.

@nirbhay_raghav Жыл бұрын

My man made a df out of the time diff to plot them!! Really useful video. Will definitely keep this in mind from now.

@robmulla Жыл бұрын

Haha. Thanks Nirbhay!

@balajikrishnamoorthy5464 Жыл бұрын

I am a begineer, admired your sound knowledge in Pandas

@robmulla Жыл бұрын

Thanks for watching. Hope you leaned some helpful stuff.

@LimitlesslyUnlimited 2 жыл бұрын

Haha coincidentally I'd been raving about vectorized to my friends the last few months. It's soo good. The moment I saw your title I figured you're probably talking about vectorize too haha. Awesome video and great content!!

@robmulla 2 жыл бұрын

You called it! Thanks for the positive feedback. Hope to create more videos like it soon.

@robertnolte519 2 жыл бұрын

Same! Still hasn't worked on picking up chicks at the bar, but I'm not giving up.

@colmduffy2272 2 жыл бұрын

There are several videos on pandas vectorization. This is the best.

@robmulla 2 жыл бұрын

I apprecaite you saying that! Thanks for watching.

@OPPACHblu_channel Жыл бұрын

Somehow i have been met vectorize method first at the beginning on my python and pandas journey. Thanks for sharing your experience, lightning fast

@robmulla Жыл бұрын

It’s a great thing to learn early!

@craftydoeseverything9718 11 ай бұрын

Hey, I just thought I'd mention, I really appreciate that you use really huge test datasets, since a lot of the time, test datasets used in tutorials are quite small and don't sure how code will scale. This video does it perfectly, though!

@robertjordan114 2 жыл бұрын

Man where have you been all my Python-Life!?!? Thank you so much for this! Outstanding!!!

@robmulla 2 жыл бұрын

Thanks Robert for watching. Glad you found it helpful!

@robertjordan114 2 жыл бұрын

The problem in dealing with is that I am looping through some poorly designed tables and building a sql statement to be applied and then appending the output to a list. Not sure if a vectorized approach will work since I have that sql call, but the apply might save me from needing to recreate the df prior to appending everytime.

@robmulla 2 жыл бұрын

@@robertjordan114 Interesting. Not sure what your data is like- but it can be better a lot of the times to write a nice SQL statement that puts the data in the correct formatting first. That way you put the processing demands on the SQL server and it can usually optimize really well.

@robertjordan114 2 жыл бұрын

Oh you have no idea, my source table has one column with the name of the column in my lookup table and another with the value that I need to filter on in that lookup table. The loop creates the where clause based on the number of related rows in the initial dataset, and then I'm executing that sql statement the return the values to a python data frame which I then convert to a pandas data frame and append. Like I said, amateur hour! 🤣

@OktatOnline Жыл бұрын

I'm over here as a newbie data scientist, copying the logic step-by-step in order to have good coding habits in the future lmao. Thanks for the video, really valuable!

@robmulla Жыл бұрын

Glad you found it helpful!

@gabriel-mckee Жыл бұрын

Great video! I wish I had known not to loop over my array for my machine learning project... going to go improve my code now!

@robmulla Жыл бұрын

Glad you learned something new!

@FilippoGronchi 2 жыл бұрын

That's another awesome video....extremely useful in the real world work. Thanks again Rob

@robmulla 2 жыл бұрын

Thanks for watching Filippo!

@artemqqq7153 Жыл бұрын

Dude, that row[column] thing was a shock to me, thanks!

@robmulla Жыл бұрын

Glad you learned something!

@blogmaster7920 Жыл бұрын

This can be really helpful, when moving data from one source to another through Internet.

@robmulla Жыл бұрын

Absolutely, compressing can make any data transfer faster.

@hussamcheema 2 жыл бұрын

Wow amazing. Please keep making more videos like this.

@robmulla 2 жыл бұрын

Thanks for the feedback. I’ll try my best.

@alexandremachado1014 2 жыл бұрын

Hey man, nice video! Kudos from reddit!

@robmulla 2 жыл бұрын

Glad you enjoed it. So cool that the reddit community liked this video so much. Hopefully my next one will be as popular.

@thebreath6159 Жыл бұрын

Ok this channel is great for data science, I’ll follow

@robmulla Жыл бұрын

Thanks for subbing!

@anoopbhagat13 2 жыл бұрын

Wow ! That's an excellent way of speed up the code.

@robmulla 2 жыл бұрын

Thanks Anoop. Hope your future pandas code is a bit faster because of this video :D

@LaHoraMaker Жыл бұрын

I loved that you used Madrid Python user group for the pandas logo :)

@robmulla Жыл бұрын

I did?! I didn't even realize. What's the timestamp where I show that logo?

@ajaybalakrishnan5208 Жыл бұрын

Awesome. Thanks Rob for introducing this concept to me.

@robmulla Жыл бұрын

Happy it helped!

@sphericalintegration Жыл бұрын

Thank you for this, Rob. This video made me subscribe because in 10 minutes you solved one of my biggest problems. And your Boo is right - you are pretty cool. Thanks again, sir.

@robmulla Жыл бұрын

That's awesome that I was able to help you out. Check my other videos where I go over similar tips! Glad you agree with my Boo

@prodmanaiml9317 2 жыл бұрын

More video tips for pandas would be excellent!

@robmulla 2 жыл бұрын

Great suggestion. I'll try to keep the pandas videos coming.

@Vonbucko Жыл бұрын

Awesome video man! Appreciate the tips, I'll definitely be subscribing!

@robmulla Жыл бұрын

I appreciate that a ton. Share with a friend too!

@pietraderdetective8953 2 жыл бұрын

I have always been struggling to understand how vectorize work..this video of yours is the one made it crystal clear for me. What a great video! Can you please do more of these efficient pandas videos and use some stock market data? Thanks!

@robmulla 2 жыл бұрын

Thanks for the feedback. I’m so happy you found this useful. I’ll try my best to do a future video related to stock market data.

@DataScienceconMilton 6 ай бұрын

Thank you very much for this video Rob. It is very helpful for beginners like me. Have a great day.

@kingj5983 6 ай бұрын

Wow, awesome video, thanks! Although it takes time to figure out how to turn my limit conditions into logical calculation and return a bool dataframe

@sweealamak628 Жыл бұрын

I'm kicking myself now for not finding your video 10 months ago. I'm near the completion of my code and resorted to a mix of iterating For loops and small scale vectorisation by declaring new columns after applying some logic. I seriously need to adopt your methods and redo my code because mine is just not fast enough!

@robmulla Жыл бұрын

I totally feel you. It took me years before I understood really how important it is to avoid iterating rows was. Once you learn it all your pandas code will be much faster though.

@sweealamak628 Жыл бұрын

@@robmulla I just altered one of my For loops and used your Vectorized approach! Not only is it faster, I did it in just 3 lines of code and the syntax is much easier to read! I feel so embarrased for myself cos it's much more straight forward than I thought! Now the tricky thing is, I work on a time series dataset where I compare previous rows of data to the current row to get the "result". I assume I can use the "shift" method to look back at a previous row of data. If it works, I'm gonna Vectorize everything! THANKS SO MUCH!

@RichieStockholm 2 жыл бұрын

I expect a video about moped gangs in the future, Rob.

@robmulla 2 жыл бұрын

That’s a great idea Richie! I practically majored in moped gangs in college. 😂

@FabioRBelotto Жыл бұрын

Great video. I am working on a Df with millions of rows and pandas apply was struggling. I solved using an vectorized solution as exposed. Much much better. Could you imagine a situation where vectorization would be not possible?

@robmulla Жыл бұрын

Glad this helped! As far as examples where vectorization is not possible: For example, if you need to perform an operation that requires branching, such as selecting different values based on some condition, vectorization may not be possible. In this case, you would need to use a loop or some other non-vectorized approach to perform the operation. Another example where vectorization may not be possible is when working with datasets that have varying lengths or shapes. In this case, it may not be possible to perform operations on the entire dataset using vectorized methods. Hope that helps.

@johnidouglasmarangon Жыл бұрын

Great video Bob, thanks. I curious, which interface for Jupyter Notebook you are using?

@robmulla Жыл бұрын

Glad you liked it. This is jupyterlab with the solarized dark theme. Check out my full video on jupyter where I go into detail about it.

@johnidouglasmarangon Жыл бұрын

@@robmulla Tks Bob ✌️

@robmulla Жыл бұрын

@@johnidouglasmarangon no problem. Jane!

@GregZoppos Жыл бұрын

Wow, thanks! I'm a beginner in data science, this is really interesting to me.

@robmulla Жыл бұрын

Great to hear! Good luck in your data science journey.

@Graham_Wideman Жыл бұрын

1:19 "a random integer between one and 100." I believe that should be from 0 to 99 (ie: inclusive at both ends). In case nobody else mentioned it.

@robmulla Жыл бұрын

Good catch! I think you are the first to point that out.

@alysmtech3683 Жыл бұрын

Jesus, I'm over here blowing up my laptop. Had no idea, thank you!

@robmulla Жыл бұрын

Hah. My name is Rob. But glad you learned something new.

@bgotura Жыл бұрын

I love how that Pandas logo has canibalized the city of Madrid (Spain) logo

@BILALAHMAD-cz9gu Жыл бұрын

this man is amazing but i'm poor with english ...... but i will learn english definetly bcz of this man

@robmulla Жыл бұрын

Thanks. So glad it helped even though it’s not your native tongue!

@moodiiie Жыл бұрын

That’s all I do at work, vectorize is the way to go. I was able to do some complex logic with them.

@robmulla Жыл бұрын

Love it.

@Sinke_100 Жыл бұрын

Cool, for really large dataset and when conditions aren't too complicated that vectorized method is amazing, apply is nice alternative cause you can write function, there should be a module that converts normal functions in this vectorized syntax cause it's quite complicated to write

@robmulla Жыл бұрын

Glad it was helpful! There are some packages that compile functions (called numba/jit) there is also np.vectorize

@Sinke_100 Жыл бұрын

@@robmulla I tryed to played a bit with it, pandas it's similar to numpy and I worked with numpy quite a bit, I tryed to put in a function bool_calculation with 3 distinct dfs for age condition, pct_sleeping and time in bed, finaly return value was final condition, df loc supports putting function directly in it's statement, so I did that finaly I compared created dfs with both methods, and they are same. My suggestion is that you should explain more in depth those complexed stuff.

@incremental_failure Жыл бұрын

Vectorization is the whole point of Pandas. But there are cases where vectorization is impossible and you need to process row-by-row, in that case it's best to switch to numba for a precompiled function.

@djangoworldwide7925 10 ай бұрын

As an R user we use vectorization using mutate without even thinking about the other methods for such task. R is so much more suitable for data science and wrangling

@ledestonilo7274 Жыл бұрын

Interesting. Thank you will try it.

@robmulla Жыл бұрын

Awesome! Let me know how it goes.

@andrewcoyne9768 Жыл бұрын

Love the video, thanks Rob! Is there a vectorized way to create a column that is the sum of several columns? I tried df['total'] = df.iloc[:, 5:13].sum(), which was way faster but returned all NaN values. Any help would be appreciated.

@robmulla Жыл бұрын

So close! I think all you need to do is to change `sum(axis=1)` and it should work!

@andrewcoyne9768 Жыл бұрын

@@robmulla Brilliant! Works perfect now. Thanks for the quick reply

@dreamdeckup 2 жыл бұрын

I had to do the same thing in my first internship lol. The script went from 4 hours to like 10 minutes to run

@robmulla 2 жыл бұрын

Yea, when I learned this it 100% changed the way I write pandas code.

@cbritton27 2 жыл бұрын

I had a similar situation creating a new column based on conditions. My data set has 520,000 records so the apply was very slow. I got good results with using the select function from numpy. I'm curious how that would compare to the vectorization in your case. Edit: in my case, the numpy select is slightly faster than the vectorization.

@robmulla 2 жыл бұрын

Thanks for sharing. It would be cool to see an example code snippet similar to what I used in this video for comparison.

@linkernick5379 Жыл бұрын

Polars lib is quite fast with my 1 million dataset, I recommend to try.

@MrJak3d 2 жыл бұрын

Damn, I knew lvl 2 but lvl 3 was awesome!

@robmulla 2 жыл бұрын

Thanks Jake! Yea, vectorized functions are super fast. If you can't vectorize then there are other ways to make it faster (like chunking and multiprocessing)... I might make a video about that next!

@adamleon8504 Жыл бұрын

in these cases it is easy to vectorize but how can you vectorize when the process or the function that needs the df as input is more complex? For example can you vectorize a procedure that uses specific rows and not one column based on a condition and then use these elements to perform calculations with step and not on the same row for example df.loc[i,"A"] - df.loc[i-1,"B"]?

@justsayin...1158 Жыл бұрын

It's a great tip, but I don't feel like, I understood, what vectorized means, or how I make a function vectorized. Is it just creating the boolean array by applying the conditions to the whole data frame in this way, or are there other ways to vectorize as well?

@mltamarlin 27 күн бұрын

You should plot the results on a log plot, which would give the fold speedup for each.

@mic9657 Жыл бұрын

great tips! and very well presented

@robmulla Жыл бұрын

Glad you like it. Thanks for watching.

@mbcebrix Жыл бұрын

Is vectorization applicable for huge datasets? Like millions of datasets for example.

@robmulla Жыл бұрын

If it can fit in your computer’s memory then yes!

@rockwellshabani5180 2 жыл бұрын

Would vectorization also be faster than an np.where statement with multiple conditions?

@robmulla 2 жыл бұрын

Great question! I think someone tested it out in the reddit thread where I posted it and found maybe a slight speed increase over the vectorized version.

@georgebrandon7696 Жыл бұрын

np.where() is what I use almost exclusively. However, it tends to be a little unreadable if you need to use additional if statements to go from binary (either or) to 3 or more possible values. Of course, one could also nest np.where() statements too. :)

@kanishkpareek6650 Жыл бұрын

your teaching style is awesome. where can i find your videos in a structured manner??

@spicytuna08 2 жыл бұрын

oh my!!! awesome. thanks!!!

@robmulla 2 жыл бұрын

Thanks 🙏

@PeterSeres 2 жыл бұрын

Nice video! Thanks for detailed explanation. My only problem with this is that I often have to apply functions that depend on sequential time data and a loop setup makes the most sense since the next time step depends on the previous time steps. Are there some advanced methods on how to set up more complex vectorized functions that don't fit into a one-liner expression?

@robmulla 2 жыл бұрын

Yes there are! I think I'll probably make a few more videos on the topic considering how interested people seem in this. But I'd suggest if you can do any of your processing that goes across rows in groups - first do a `groupby()` and then you can multiprocess the processing of each group on a different CPU thread. If you have 8 or 16 CPU threads you can speed things up a lot!

@DrewLevitt 2 жыл бұрын

Pandas has a lot of useful time series methods, but without knowing exactly what you're trying to do, it'd be hard to suggest any specific functions. But if you only need to refer to step (n-1) when processing step n, you can use df.shift() to store step n-1 IN the row for step n. Hope this helps!

@danielbrett247 Жыл бұрын

Not everything can be vectorized, commonly when processing time series data. For these, a great library to know about is njit.

@robmulla Жыл бұрын

Agreed. Njit / numba can be great when needing to make sudo compiled python code.

@ersineser7610 2 жыл бұрын

Thank you very much for great video.

@robmulla 2 жыл бұрын

Glad you liked it! Thanks for the feedback.

@FF-ct5dr Жыл бұрын

The Pandas doc literally tells you that iterrows is slow and should be avoided lol. As for vectorization, Pandas uses (slightly tweaked so to hold different types) numpy arrays which are hosted in continuous memory blocks... So ofc vectorization will be faster than apply/map.

@robmulla Жыл бұрын

Yep. This is obvious to a seasoned veteran, but as I mentioned in the video, for many newbies who haven't read the docs and aren't fully aware of the backend, they don't know that iterrows is a bad idea.

@Geza_Molnar_ Жыл бұрын

@@robmulla Maybe, when you have time for that, you could publish a video that describes to newbies what "RTFM" means, and what is the benefit of that. You are popular, a role model for some 🙂 (in this case "M" -> docs)

@Reftsquabble Жыл бұрын

i _am_ excited! show the solution in machine code next pls thx

@robmulla Жыл бұрын

Working on it…

@diegoalmeida2221 Жыл бұрын

Nice video, though in some cases we want to use a specific complex function from a library. The apply method works fine for that case. But is there a way to use it with vectorization?

@robmulla Жыл бұрын

You can try to vectorize using something like numba. But it depends on the complexity of the function.

@Pranavshashi Жыл бұрын

I'm new so this might be a silly question. I am using an API to get additional data for each row in my dataset. Can I use vectorized approach while making API calls as well?

@robmulla Жыл бұрын

That's kind of different. You just want to gather the results from the api as fast as possible. Check out something like async calls to the api. This might help: stackoverflow.com/questions/71232879/how-to-speed-up-async-requests-in-python

@Pranavshashi Жыл бұрын

@@robmulla thanks!

@kathrynpanger2289 2 жыл бұрын

What if I want to apply a more complicated or non-numeric test, like instead of df['pct-sleeping'] > 0.5 I was looking at whether "teeth" was in df['dream-themes'], (a list of the tags concerning the things the sleeper dreamt about e.g. [teeth, whale, dog, slide, school]). Is the only way to do this by with .apply or can this still be vectorized?

@robmulla 2 жыл бұрын

This is a good question. I think it depends on the dtype of the dream-theme columns. Would it only contain a single value or potentially multiple ones? Check the ‘isin’ function in the pandas docs, its a vectorized function of doing this.

@DrewLevitt 2 жыл бұрын

I haven't tested this but try df['dream-themes'].str.contains('teeth'). If df['dream-themes'] is a bunch of comma-delimited strings, you should be good to go (but watch out for partial matches e.g. "teeth whitening" contains "teeth"); not sure whether this will work if df['dream-themes'] contains a bunch of proper lists. Try it and let me know!

@andrew3068 Жыл бұрын

Super awesome video.

@robmulla Жыл бұрын

I appreciate that. Thanks for commenting!

@alanhouston5874 Жыл бұрын

Can you save lists using Parquet Or is it only applicable to dataframes?

@A372575 Жыл бұрын

Thanks, one query in case of vectorize, which one would be faster - np.where or the method you memtioned ?

@scientistgeospatial Жыл бұрын

Thanks Rob

@robmulla Жыл бұрын

Thanks for watching!

@vinitjha_ 11 ай бұрын

which font do you use? That's awesome font and color scheme

@kennethstephani692 2 жыл бұрын

Great video!!!

@robmulla 2 жыл бұрын

Thank you!!

@Hitdouble Жыл бұрын

What theme do you use in your notebook?

@robmulla Жыл бұрын

I have a whole video on my jupyter lab setup. But it’s just the solarized dark theme.

@abdulkadirguven1173 Жыл бұрын

Thank you very much Rob.

@blakeedwards3582 Жыл бұрын

What theme are you using to get your Jupyter Notebook to look like that?

@robmulla Жыл бұрын

Solarized dark theme. I have a whole video about my jupyter setup

@bm647 Жыл бұрын

Great video! very useful

@robmulla Жыл бұрын

Glad you found it useful!

@amitamola2014 Жыл бұрын

So what about the scenario when we want to perform same operation but only in one column? Such as, if pct_sleep

@robmulla Жыл бұрын

You can use something like qcut in this case or a vectorized statement with and or statement.

@Chris_87BC Жыл бұрын

Great video! I am currently looping through a data frame column for each customer and print the data to PDF. Is there a vectorized version that can be much faster?

@dh00mketu Жыл бұрын

Why didn't you remove get rewards function from other run times?

@robmulla Жыл бұрын

Oops. Did I do it incorrectly? Can you share the timestamp?

@chndrl5649 Жыл бұрын

Could also use query instead of loc

@robmulla Жыл бұрын

Not sure that would work for this case because we aren’t straight filtering.

@jrwkc Жыл бұрын

when you vectorize with loc, don't you have to vectorize the right side of the equation too. df['favorite_food'] is not masked. It's the whole array. Right? So you are setting the reward to the first N of df['favorite_food'] where N is the length of the mask.

@robmulla Жыл бұрын

I don't think so because pandas will use the index when populating. But I'm also not 100% sure.

@jrwkc Жыл бұрын

@@robmulla make github repos so we can test! that would be great

@rahulkmail 4 ай бұрын

Thanks for sharing a nice information

@leejunzhao Жыл бұрын

Question: I follow your code, an erroe come out said " 'reward_calc' is not defined ".

@wenbozhao4325 Жыл бұрын

In the video the fn name has no letter c at the end

@robmulla Жыл бұрын

Oh. Good catch. Sorry it was confusing

@beastmaroc7585 Жыл бұрын

thank you so much fir this game changer tips ....

@robmulla Жыл бұрын

Thanks for watching!

@DK-rl1sf 2 жыл бұрын

Is there a way to use np.vectorize() instead of df.loc so things are more tidier?

@robmulla 2 жыл бұрын

That's a great point. I've used np.vectorize before but not too frequently. I agree the current solution isn't very clean to read and could be much tidier.

@DK-rl1sf 2 жыл бұрын

@@robmulla No, this is not so much a suggestion and more a question. I'm new to this and literally don't know. I had previously read about np.vectorize(). I tried doing your vectorize method but using np.vectorize but couldn't figure out the syntax.

@robmulla 2 жыл бұрын

@@DK-rl1sf Yea, there is a lot of overhead when using pandas instead of numpy - but you get the benefit of named columns, easy filtering and sorting. In my experience np.vectorize worked but i was working with just numpy arrays not pandas dataframes.

@elgoogffokcuf Жыл бұрын

What about Numba, if it can bring some more optimization, it will be nice if you make a video for it.

@robmulla Жыл бұрын

Numba/jit is great to speed up more complex operations. I've had limited experience with it, but every case it really sped things up. Doing it as a video is a good idea.

@lucienjaegers2028 Жыл бұрын

Nice trick, but what if you code it completely in C / C++ / Rust? Literature says those are 50 - 80 times faster?

@robmulla Жыл бұрын

I have a whole video on polars, which is written in rust. It’s faster for sure. But keep in mind pandas backend is just C code.

@ivanrubnenkov919 2 ай бұрын

now instead of two actions + .loc in the third example use np.where for oneliner and it will be even faster

@alexisdebrand6209 Жыл бұрын

so usefull thank you !!!!!

@robmulla Жыл бұрын

You're welcome! Thanks for commenting.

@YuanYuan-uk8sz Жыл бұрын

thank you your very extremely perfect video,so so helpful for me,love you so much

@robmulla Жыл бұрын

I'm so glad! Share it with a friend or two who you think might also appreciate it.

@GroWithUmar 2 жыл бұрын

Amazing video

@robmulla 2 жыл бұрын

Thanks for the feedback!

@agoodwin-8127 2 жыл бұрын

I typically use numpy where in this situation (mainly because I like the syntax better!), so I was curious about the speed vs. the level 3 solution. Where ran a little faster (~15-20% for datasets sized 10K - 50M records). # level 3 - vectorized %%timeit df = get_data(10_000) df['reward'] = df['hate_food'] df.loc[((df['pct_sleeping'] > 0.5) & (df['time_in_bed'] > 5)) | (df['age'] > 90), 'reward'] = df['favorite_food'] # 3.74 ms ± 79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # level 4 - where %%timeit df = get_data(10_000) df['reward'] = np.where(((df['pct_sleeping'] > 0.5) & (df['time_in_bed'] > 5)) | (df['age'] > 90), df['favorite_food'], df['hate_food']) # 3.15 ms ± 37 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@robmulla 2 жыл бұрын

I love that you ran that experiment! I'm actually suprised the numpy version isn't even faster. Thanks for sharing.

@georgebrandon7696 Жыл бұрын

@@robmulla I'll throw another wrench into this. df.at[] vs df.loc[]. df.at[] is considerably faster than df.loc[]. But I've never ran conditionals with df.at[]. I'm also an np.where() user. :)

@ErikS- Жыл бұрын

3.5 seconds for a for loop with only 10k rows... Is this done in a Docker container or another VM(-like) environment?

@robmulla Жыл бұрын

Just done locally on my fairly beefy machine.

@saxegothaea_conspicua Жыл бұрын

can you vectorize using query? I suppose you can't

@co.n.g.studios5710 2 жыл бұрын

Nice vid. Wouldn't it even be faster, if using the .values for the columns? is this even applicable in the case presented in the example? Looking forward to your answer, cheers

@robmulla 2 жыл бұрын

Thanks for the comment. Yes using .values could be faster thanks for pointing that out. Not sure about specific part in this video but worth a try.

@bilalbayrakdar7100 Жыл бұрын

so sql like logical filtering is the real deal dude

@robmulla Жыл бұрын

Not sure if I follow. I do love SQL though!

@rickk3658 Жыл бұрын

3500 times faster is all well and good, but I'd like to know your speed up magic at 2:31. You were turning create dataset into a function. As you type the colon, the rest of the code in the cell became properly indented. My version of JupyterLab does not do that. What's the secret?

@robmulla Жыл бұрын

Oh. I’m using the black auto formatter with nb_black. It’s really helpful to keep your code clean in jupyter.

@TeXiCiTy Жыл бұрын

For looping over big datasets I switch to polars when speed becomes an issue.

@robmulla Жыл бұрын

I have an entire video on my channel about polars. It’s great! Check it out.

@krishnapullak Жыл бұрын

Nice tip

@robmulla Жыл бұрын

Thx for watching.

@SillyLittleMe Жыл бұрын

Hey, this is a great video and truly shows the benefit of vectorisation I would like to point out that always remebering the vectorize way of writing is hard. Fortunately, NumPy module does provide a neat method called "vectorize" that vectorizes your non-vectorize function. an example (from the docs): ## this is the function def myfunc(a, b): "Return a-b if a>b, otherwise return a+b" if a > b: return a - b else: return a + b ## vectorising the function and then applying it vfunc = np.vectorize(myfunc) vfunc([1, 2, 3, 4], 2) array([3, 4, 1, 2]) This works on DataFrames as well. Do Note tho that this is not True vectorisation because of that ,in some cases, it performs similarly to functions like "apply". However, for the most part it does a tremendous job and has significantly increased the speed of my functions. The reasons for why it is not "true vectorisation" are mentioned in this thread : stackoverflow.com/questions/52673285/performance-of-pandas-apply-vs-np-vectorize-to-create-new-column-from-existing-c

@robmulla Жыл бұрын

Thanks! I've used that before and it does come in handy. Also using things like jit/numba can compile numpy operations.