Whoa.. 3500 times difference. Vectorised is even faster than apply, will give it try next time for sure. Awesome video as always.
@robmulla2 жыл бұрын
Thanks Sanskar. Yes, using vectorized functions is always much faster. In some cases it's not possible but then there are other ways to speed it up. I might show that in another video if this one is popular.
@amazingdude90429 ай бұрын
@@robmulla can you make a video on how to make pandas resample faster ?
@miaandgingerthememebunnyme33972 жыл бұрын
That’s my husband! He’s so cool.
@robmulla2 жыл бұрын
Love you boo. 😘
@FilippoGronchi2 жыл бұрын
Fully agree!
@sauloncall2 жыл бұрын
Aww! This is wholesome!
@rahulchoudhary10242 жыл бұрын
I've been watching your videos since last one week non stop! And enjoy comments from your SO!!! Lovely!
@mohammedgt8102 Жыл бұрын
He is awesome. Taking time out of his day to share knowledge 👏
@kip1272 Жыл бұрын
also, a way to speed it up is to not use & and | for 'and' and 'or' but just use the words 'and' and 'or'. these words are made for boolean expressions and thus work faster. & and | are bitwise operators and are made for integers. using these will force python to make the booleans an integer and then do the bitwise operation and then cast it back to a boolean. this doesn't take that much time if u do it once but in a test scenario inspired by this video it was roughly 45% slower.
@robmulla Жыл бұрын
Nice tip! I didn’t know that.
@kazmkazm9676 Жыл бұрын
I made the experiment. It is ready to run. What you have suggested is coded below. It is approximately 20 percent faster. import timeit setup = 'import random; random_list = random.sample(range(1,101),100)' # with or first_code = '''\ result_1 = [rand for rand in random_list if (rand >75) or (rand 75) | (rand
@kip1272 Жыл бұрын
@@kazmkazm9676 the difference was even bigger between & and 'and', if i remember corectly.
@A372575 Жыл бұрын
Great, never realized that. Will start using 'and' and 'or' now onwards.
@nathanielbonini8951 Жыл бұрын
This is spot on. I had a filter running that was going to take 2 days to complete on a 12M line CSV file using iteration - clearly not good. Now it takes 6 seconds.
@Zenoandturtle Жыл бұрын
That is unbelievable. Astounding time difference. I was recently watching a presentation on candle stick algorythm, and the presenter used vectorised method and I was confused (I an new to Python), but this video made it all too clear. Fantastic presentation.
@robmulla Жыл бұрын
Glad you found it interesting. Thanks for watching!
@jti1072 жыл бұрын
I didn’t realize you could write 10k as 10_000. I work with astronomical units so makes variables more readable. Great video!
@robmulla2 жыл бұрын
Thanks! Yes, they introduced that functionality with underscores in numbers with python 3.6 - it really helps make numbers more readable.
@kailashlate63489 ай бұрын
😊😊😊
@kailashlate63489 ай бұрын
😊
@deepakramani052 жыл бұрын
As I work with Pandas and large datasets, I come across code that use iterrows often. Most developers just don't care about time or come from various programming backgrounds that prohibit them from using efficient methods. I wish more people use vectorization.
@robmulla2 жыл бұрын
Thanks. That’s exactly why I wanted to make this video. Hopefully people will find it helpful.
@pr0skis2 жыл бұрын
Some of the biggest bottlenecks are from IO... especially when trying to read then concat multiple large Excel files. Shaving a few seconds in the algos just isnt gonna make much of a difference
@allenklingsporn69932 жыл бұрын
@@pr0skis Hard to say that definitively, though, right? You have no idea how anyone is using pandas. If they have slow algos running iteratively, it can very easily become much slower than I/O functions. I've seen some pretty wild pandas use in my business, and a lot of it is really terrible at runtime, especially anything that is wrapped in a GUI (sometimes even with multiprocessing...).
@nitinkumar29 Жыл бұрын
@@pr0skis you can convert excel file to csv and then use csv files because csv files io are faster.
@jaimeduncan6167 Жыл бұрын
It's the same with a relational database, we call them the cursor kids. They loop and loop and loop when they can use a set operation to go hundreds of times faster and often with less code.
@i-Mik Жыл бұрын
Thanks for the great video! I have a project with some calculations. They take some minutes through the loops. I'm going to use vectorized way. So i'll write another comment with comparison later. Some days later... i rewrote a signifacnt part of my code. Made it vectorized, and i got fantastic results. The example: old code - 1m.3s, new code - 6s. One more: old code - 14m.58s, new code - 11s. Awesome!
@robmulla Жыл бұрын
So awesome! It's really satisfying when you are able to improve the speed of code by orders of magnitude.
@nirbhay_raghav Жыл бұрын
My man made a df out of the time diff to plot them!! Really useful video. Will definitely keep this in mind from now.
@robmulla Жыл бұрын
Haha. Thanks Nirbhay!
@balajikrishnamoorthy5464 Жыл бұрын
I am a begineer, admired your sound knowledge in Pandas
@robmulla Жыл бұрын
Thanks for watching. Hope you leaned some helpful stuff.
@LimitlesslyUnlimited2 жыл бұрын
Haha coincidentally I'd been raving about vectorized to my friends the last few months. It's soo good. The moment I saw your title I figured you're probably talking about vectorize too haha. Awesome video and great content!!
@robmulla2 жыл бұрын
You called it! Thanks for the positive feedback. Hope to create more videos like it soon.
@robertnolte5192 жыл бұрын
Same! Still hasn't worked on picking up chicks at the bar, but I'm not giving up.
@colmduffy22722 жыл бұрын
There are several videos on pandas vectorization. This is the best.
@robmulla2 жыл бұрын
I apprecaite you saying that! Thanks for watching.
@OPPACHblu_channel Жыл бұрын
Somehow i have been met vectorize method first at the beginning on my python and pandas journey. Thanks for sharing your experience, lightning fast
@robmulla Жыл бұрын
It’s a great thing to learn early!
@craftydoeseverything971811 ай бұрын
Hey, I just thought I'd mention, I really appreciate that you use really huge test datasets, since a lot of the time, test datasets used in tutorials are quite small and don't sure how code will scale. This video does it perfectly, though!
@robertjordan1142 жыл бұрын
Man where have you been all my Python-Life!?!? Thank you so much for this! Outstanding!!!
@robmulla2 жыл бұрын
Thanks Robert for watching. Glad you found it helpful!
@robertjordan1142 жыл бұрын
The problem in dealing with is that I am looping through some poorly designed tables and building a sql statement to be applied and then appending the output to a list. Not sure if a vectorized approach will work since I have that sql call, but the apply might save me from needing to recreate the df prior to appending everytime.
@robmulla2 жыл бұрын
@@robertjordan114 Interesting. Not sure what your data is like- but it can be better a lot of the times to write a nice SQL statement that puts the data in the correct formatting first. That way you put the processing demands on the SQL server and it can usually optimize really well.
@robertjordan1142 жыл бұрын
Oh you have no idea, my source table has one column with the name of the column in my lookup table and another with the value that I need to filter on in that lookup table. The loop creates the where clause based on the number of related rows in the initial dataset, and then I'm executing that sql statement the return the values to a python data frame which I then convert to a pandas data frame and append. Like I said, amateur hour! 🤣
@OktatOnline Жыл бұрын
I'm over here as a newbie data scientist, copying the logic step-by-step in order to have good coding habits in the future lmao. Thanks for the video, really valuable!
@robmulla Жыл бұрын
Glad you found it helpful!
@gabriel-mckee Жыл бұрын
Great video! I wish I had known not to loop over my array for my machine learning project... going to go improve my code now!
@robmulla Жыл бұрын
Glad you learned something new!
@FilippoGronchi2 жыл бұрын
That's another awesome video....extremely useful in the real world work. Thanks again Rob
@robmulla2 жыл бұрын
Thanks for watching Filippo!
@artemqqq7153 Жыл бұрын
Dude, that row[column] thing was a shock to me, thanks!
@robmulla Жыл бұрын
Glad you learned something!
@blogmaster7920 Жыл бұрын
This can be really helpful, when moving data from one source to another through Internet.
@robmulla Жыл бұрын
Absolutely, compressing can make any data transfer faster.
@hussamcheema2 жыл бұрын
Wow amazing. Please keep making more videos like this.
@robmulla2 жыл бұрын
Thanks for the feedback. I’ll try my best.
@alexandremachado10142 жыл бұрын
Hey man, nice video! Kudos from reddit!
@robmulla2 жыл бұрын
Glad you enjoed it. So cool that the reddit community liked this video so much. Hopefully my next one will be as popular.
@thebreath6159 Жыл бұрын
Ok this channel is great for data science, I’ll follow
@robmulla Жыл бұрын
Thanks for subbing!
@anoopbhagat132 жыл бұрын
Wow ! That's an excellent way of speed up the code.
@robmulla2 жыл бұрын
Thanks Anoop. Hope your future pandas code is a bit faster because of this video :D
@LaHoraMaker Жыл бұрын
I loved that you used Madrid Python user group for the pandas logo :)
@robmulla Жыл бұрын
I did?! I didn't even realize. What's the timestamp where I show that logo?
@ajaybalakrishnan5208 Жыл бұрын
Awesome. Thanks Rob for introducing this concept to me.
@robmulla Жыл бұрын
Happy it helped!
@sphericalintegration Жыл бұрын
Thank you for this, Rob. This video made me subscribe because in 10 minutes you solved one of my biggest problems. And your Boo is right - you are pretty cool. Thanks again, sir.
@robmulla Жыл бұрын
That's awesome that I was able to help you out. Check my other videos where I go over similar tips! Glad you agree with my Boo
@prodmanaiml93172 жыл бұрын
More video tips for pandas would be excellent!
@robmulla2 жыл бұрын
Great suggestion. I'll try to keep the pandas videos coming.
@Vonbucko Жыл бұрын
Awesome video man! Appreciate the tips, I'll definitely be subscribing!
@robmulla Жыл бұрын
I appreciate that a ton. Share with a friend too!
@pietraderdetective89532 жыл бұрын
I have always been struggling to understand how vectorize work..this video of yours is the one made it crystal clear for me. What a great video! Can you please do more of these efficient pandas videos and use some stock market data? Thanks!
@robmulla2 жыл бұрын
Thanks for the feedback. I’m so happy you found this useful. I’ll try my best to do a future video related to stock market data.
@DataScienceconMilton6 ай бұрын
Thank you very much for this video Rob. It is very helpful for beginners like me. Have a great day.
@kingj59836 ай бұрын
Wow, awesome video, thanks! Although it takes time to figure out how to turn my limit conditions into logical calculation and return a bool dataframe
@sweealamak628 Жыл бұрын
I'm kicking myself now for not finding your video 10 months ago. I'm near the completion of my code and resorted to a mix of iterating For loops and small scale vectorisation by declaring new columns after applying some logic. I seriously need to adopt your methods and redo my code because mine is just not fast enough!
@robmulla Жыл бұрын
I totally feel you. It took me years before I understood really how important it is to avoid iterating rows was. Once you learn it all your pandas code will be much faster though.
@sweealamak628 Жыл бұрын
@@robmulla I just altered one of my For loops and used your Vectorized approach! Not only is it faster, I did it in just 3 lines of code and the syntax is much easier to read! I feel so embarrased for myself cos it's much more straight forward than I thought! Now the tricky thing is, I work on a time series dataset where I compare previous rows of data to the current row to get the "result". I assume I can use the "shift" method to look back at a previous row of data. If it works, I'm gonna Vectorize everything! THANKS SO MUCH!
@RichieStockholm2 жыл бұрын
I expect a video about moped gangs in the future, Rob.
@robmulla2 жыл бұрын
That’s a great idea Richie! I practically majored in moped gangs in college. 😂
@FabioRBelotto Жыл бұрын
Great video. I am working on a Df with millions of rows and pandas apply was struggling. I solved using an vectorized solution as exposed. Much much better. Could you imagine a situation where vectorization would be not possible?
@robmulla Жыл бұрын
Glad this helped! As far as examples where vectorization is not possible: For example, if you need to perform an operation that requires branching, such as selecting different values based on some condition, vectorization may not be possible. In this case, you would need to use a loop or some other non-vectorized approach to perform the operation. Another example where vectorization may not be possible is when working with datasets that have varying lengths or shapes. In this case, it may not be possible to perform operations on the entire dataset using vectorized methods. Hope that helps.
@johnidouglasmarangon Жыл бұрын
Great video Bob, thanks. I curious, which interface for Jupyter Notebook you are using?
@robmulla Жыл бұрын
Glad you liked it. This is jupyterlab with the solarized dark theme. Check out my full video on jupyter where I go into detail about it.
@johnidouglasmarangon Жыл бұрын
@@robmulla Tks Bob ✌️
@robmulla Жыл бұрын
@@johnidouglasmarangon no problem. Jane!
@GregZoppos Жыл бұрын
Wow, thanks! I'm a beginner in data science, this is really interesting to me.
@robmulla Жыл бұрын
Great to hear! Good luck in your data science journey.
@Graham_Wideman Жыл бұрын
1:19 "a random integer between one and 100." I believe that should be from 0 to 99 (ie: inclusive at both ends). In case nobody else mentioned it.
@robmulla Жыл бұрын
Good catch! I think you are the first to point that out.
@alysmtech3683 Жыл бұрын
Jesus, I'm over here blowing up my laptop. Had no idea, thank you!
@robmulla Жыл бұрын
Hah. My name is Rob. But glad you learned something new.
@bgotura Жыл бұрын
I love how that Pandas logo has canibalized the city of Madrid (Spain) logo
@BILALAHMAD-cz9gu Жыл бұрын
this man is amazing but i'm poor with english ...... but i will learn english definetly bcz of this man
@robmulla Жыл бұрын
Thanks. So glad it helped even though it’s not your native tongue!
@moodiiie Жыл бұрын
That’s all I do at work, vectorize is the way to go. I was able to do some complex logic with them.
@robmulla Жыл бұрын
Love it.
@Sinke_100 Жыл бұрын
Cool, for really large dataset and when conditions aren't too complicated that vectorized method is amazing, apply is nice alternative cause you can write function, there should be a module that converts normal functions in this vectorized syntax cause it's quite complicated to write
@robmulla Жыл бұрын
Glad it was helpful! There are some packages that compile functions (called numba/jit) there is also np.vectorize
@Sinke_100 Жыл бұрын
@@robmulla I tryed to played a bit with it, pandas it's similar to numpy and I worked with numpy quite a bit, I tryed to put in a function bool_calculation with 3 distinct dfs for age condition, pct_sleeping and time in bed, finaly return value was final condition, df loc supports putting function directly in it's statement, so I did that finaly I compared created dfs with both methods, and they are same. My suggestion is that you should explain more in depth those complexed stuff.
@incremental_failure Жыл бұрын
Vectorization is the whole point of Pandas. But there are cases where vectorization is impossible and you need to process row-by-row, in that case it's best to switch to numba for a precompiled function.
@djangoworldwide792510 ай бұрын
As an R user we use vectorization using mutate without even thinking about the other methods for such task. R is so much more suitable for data science and wrangling
@ledestonilo7274 Жыл бұрын
Interesting. Thank you will try it.
@robmulla Жыл бұрын
Awesome! Let me know how it goes.
@andrewcoyne9768 Жыл бұрын
Love the video, thanks Rob! Is there a vectorized way to create a column that is the sum of several columns? I tried df['total'] = df.iloc[:, 5:13].sum(), which was way faster but returned all NaN values. Any help would be appreciated.
@robmulla Жыл бұрын
So close! I think all you need to do is to change `sum(axis=1)` and it should work!
@andrewcoyne9768 Жыл бұрын
@@robmulla Brilliant! Works perfect now. Thanks for the quick reply
@dreamdeckup2 жыл бұрын
I had to do the same thing in my first internship lol. The script went from 4 hours to like 10 minutes to run
@robmulla2 жыл бұрын
Yea, when I learned this it 100% changed the way I write pandas code.
@cbritton272 жыл бұрын
I had a similar situation creating a new column based on conditions. My data set has 520,000 records so the apply was very slow. I got good results with using the select function from numpy. I'm curious how that would compare to the vectorization in your case. Edit: in my case, the numpy select is slightly faster than the vectorization.
@robmulla2 жыл бұрын
Thanks for sharing. It would be cool to see an example code snippet similar to what I used in this video for comparison.
@linkernick5379 Жыл бұрын
Polars lib is quite fast with my 1 million dataset, I recommend to try.
@MrJak3d2 жыл бұрын
Damn, I knew lvl 2 but lvl 3 was awesome!
@robmulla2 жыл бұрын
Thanks Jake! Yea, vectorized functions are super fast. If you can't vectorize then there are other ways to make it faster (like chunking and multiprocessing)... I might make a video about that next!
@adamleon8504 Жыл бұрын
in these cases it is easy to vectorize but how can you vectorize when the process or the function that needs the df as input is more complex? For example can you vectorize a procedure that uses specific rows and not one column based on a condition and then use these elements to perform calculations with step and not on the same row for example df.loc[i,"A"] - df.loc[i-1,"B"]?
@justsayin...1158 Жыл бұрын
It's a great tip, but I don't feel like, I understood, what vectorized means, or how I make a function vectorized. Is it just creating the boolean array by applying the conditions to the whole data frame in this way, or are there other ways to vectorize as well?
@mltamarlin27 күн бұрын
You should plot the results on a log plot, which would give the fold speedup for each.
@mic9657 Жыл бұрын
great tips! and very well presented
@robmulla Жыл бұрын
Glad you like it. Thanks for watching.
@mbcebrix Жыл бұрын
Is vectorization applicable for huge datasets? Like millions of datasets for example.
@robmulla Жыл бұрын
If it can fit in your computer’s memory then yes!
@rockwellshabani51802 жыл бұрын
Would vectorization also be faster than an np.where statement with multiple conditions?
@robmulla2 жыл бұрын
Great question! I think someone tested it out in the reddit thread where I posted it and found maybe a slight speed increase over the vectorized version.
@georgebrandon7696 Жыл бұрын
np.where() is what I use almost exclusively. However, it tends to be a little unreadable if you need to use additional if statements to go from binary (either or) to 3 or more possible values. Of course, one could also nest np.where() statements too. :)
@kanishkpareek6650 Жыл бұрын
your teaching style is awesome. where can i find your videos in a structured manner??
@spicytuna082 жыл бұрын
oh my!!! awesome. thanks!!!
@robmulla2 жыл бұрын
Thanks 🙏
@PeterSeres2 жыл бұрын
Nice video! Thanks for detailed explanation. My only problem with this is that I often have to apply functions that depend on sequential time data and a loop setup makes the most sense since the next time step depends on the previous time steps. Are there some advanced methods on how to set up more complex vectorized functions that don't fit into a one-liner expression?
@robmulla2 жыл бұрын
Yes there are! I think I'll probably make a few more videos on the topic considering how interested people seem in this. But I'd suggest if you can do any of your processing that goes across rows in groups - first do a `groupby()` and then you can multiprocess the processing of each group on a different CPU thread. If you have 8 or 16 CPU threads you can speed things up a lot!
@DrewLevitt2 жыл бұрын
Pandas has a lot of useful time series methods, but without knowing exactly what you're trying to do, it'd be hard to suggest any specific functions. But if you only need to refer to step (n-1) when processing step n, you can use df.shift() to store step n-1 IN the row for step n. Hope this helps!
@danielbrett247 Жыл бұрын
Not everything can be vectorized, commonly when processing time series data. For these, a great library to know about is njit.
@robmulla Жыл бұрын
Agreed. Njit / numba can be great when needing to make sudo compiled python code.
@ersineser76102 жыл бұрын
Thank you very much for great video.
@robmulla2 жыл бұрын
Glad you liked it! Thanks for the feedback.
@FF-ct5dr Жыл бұрын
The Pandas doc literally tells you that iterrows is slow and should be avoided lol. As for vectorization, Pandas uses (slightly tweaked so to hold different types) numpy arrays which are hosted in continuous memory blocks... So ofc vectorization will be faster than apply/map.
@robmulla Жыл бұрын
Yep. This is obvious to a seasoned veteran, but as I mentioned in the video, for many newbies who haven't read the docs and aren't fully aware of the backend, they don't know that iterrows is a bad idea.
@Geza_Molnar_ Жыл бұрын
@@robmulla Maybe, when you have time for that, you could publish a video that describes to newbies what "RTFM" means, and what is the benefit of that. You are popular, a role model for some 🙂 (in this case "M" -> docs)
@Reftsquabble Жыл бұрын
i _am_ excited! show the solution in machine code next pls thx
@robmulla Жыл бұрын
Working on it…
@diegoalmeida2221 Жыл бұрын
Nice video, though in some cases we want to use a specific complex function from a library. The apply method works fine for that case. But is there a way to use it with vectorization?
@robmulla Жыл бұрын
You can try to vectorize using something like numba. But it depends on the complexity of the function.
@Pranavshashi Жыл бұрын
I'm new so this might be a silly question. I am using an API to get additional data for each row in my dataset. Can I use vectorized approach while making API calls as well?
@robmulla Жыл бұрын
That's kind of different. You just want to gather the results from the api as fast as possible. Check out something like async calls to the api. This might help: stackoverflow.com/questions/71232879/how-to-speed-up-async-requests-in-python
@Pranavshashi Жыл бұрын
@@robmulla thanks!
@kathrynpanger22892 жыл бұрын
What if I want to apply a more complicated or non-numeric test, like instead of df['pct-sleeping'] > 0.5 I was looking at whether "teeth" was in df['dream-themes'], (a list of the tags concerning the things the sleeper dreamt about e.g. [teeth, whale, dog, slide, school]). Is the only way to do this by with .apply or can this still be vectorized?
@robmulla2 жыл бұрын
This is a good question. I think it depends on the dtype of the dream-theme columns. Would it only contain a single value or potentially multiple ones? Check the ‘isin’ function in the pandas docs, its a vectorized function of doing this.
@DrewLevitt2 жыл бұрын
I haven't tested this but try df['dream-themes'].str.contains('teeth'). If df['dream-themes'] is a bunch of comma-delimited strings, you should be good to go (but watch out for partial matches e.g. "teeth whitening" contains "teeth"); not sure whether this will work if df['dream-themes'] contains a bunch of proper lists. Try it and let me know!
@andrew3068 Жыл бұрын
Super awesome video.
@robmulla Жыл бұрын
I appreciate that. Thanks for commenting!
@alanhouston5874 Жыл бұрын
Can you save lists using Parquet Or is it only applicable to dataframes?
@A372575 Жыл бұрын
Thanks, one query in case of vectorize, which one would be faster - np.where or the method you memtioned ?
@scientistgeospatial Жыл бұрын
Thanks Rob
@robmulla Жыл бұрын
Thanks for watching!
@vinitjha_11 ай бұрын
which font do you use? That's awesome font and color scheme
@kennethstephani6922 жыл бұрын
Great video!!!
@robmulla2 жыл бұрын
Thank you!!
@Hitdouble Жыл бұрын
What theme do you use in your notebook?
@robmulla Жыл бұрын
I have a whole video on my jupyter lab setup. But it’s just the solarized dark theme.
@abdulkadirguven1173 Жыл бұрын
Thank you very much Rob.
@blakeedwards3582 Жыл бұрын
What theme are you using to get your Jupyter Notebook to look like that?
@robmulla Жыл бұрын
Solarized dark theme. I have a whole video about my jupyter setup
@bm647 Жыл бұрын
Great video! very useful
@robmulla Жыл бұрын
Glad you found it useful!
@amitamola2014 Жыл бұрын
So what about the scenario when we want to perform same operation but only in one column? Such as, if pct_sleep
@robmulla Жыл бұрын
You can use something like qcut in this case or a vectorized statement with and or statement.
@Chris_87BC Жыл бұрын
Great video! I am currently looping through a data frame column for each customer and print the data to PDF. Is there a vectorized version that can be much faster?
@dh00mketu Жыл бұрын
Why didn't you remove get rewards function from other run times?
@robmulla Жыл бұрын
Oops. Did I do it incorrectly? Can you share the timestamp?
@chndrl5649 Жыл бұрын
Could also use query instead of loc
@robmulla Жыл бұрын
Not sure that would work for this case because we aren’t straight filtering.
@jrwkc Жыл бұрын
when you vectorize with loc, don't you have to vectorize the right side of the equation too. df['favorite_food'] is not masked. It's the whole array. Right? So you are setting the reward to the first N of df['favorite_food'] where N is the length of the mask.
@robmulla Жыл бұрын
I don't think so because pandas will use the index when populating. But I'm also not 100% sure.
@jrwkc Жыл бұрын
@@robmulla make github repos so we can test! that would be great
@rahulkmail4 ай бұрын
Thanks for sharing a nice information
@leejunzhao Жыл бұрын
Question: I follow your code, an erroe come out said " 'reward_calc' is not defined ".
@wenbozhao4325 Жыл бұрын
In the video the fn name has no letter c at the end
@robmulla Жыл бұрын
Oh. Good catch. Sorry it was confusing
@beastmaroc7585 Жыл бұрын
thank you so much fir this game changer tips ....
@robmulla Жыл бұрын
Thanks for watching!
@DK-rl1sf2 жыл бұрын
Is there a way to use np.vectorize() instead of df.loc so things are more tidier?
@robmulla2 жыл бұрын
That's a great point. I've used np.vectorize before but not too frequently. I agree the current solution isn't very clean to read and could be much tidier.
@DK-rl1sf2 жыл бұрын
@@robmulla No, this is not so much a suggestion and more a question. I'm new to this and literally don't know. I had previously read about np.vectorize(). I tried doing your vectorize method but using np.vectorize but couldn't figure out the syntax.
@robmulla2 жыл бұрын
@@DK-rl1sf Yea, there is a lot of overhead when using pandas instead of numpy - but you get the benefit of named columns, easy filtering and sorting. In my experience np.vectorize worked but i was working with just numpy arrays not pandas dataframes.
@elgoogffokcuf Жыл бұрын
What about Numba, if it can bring some more optimization, it will be nice if you make a video for it.
@robmulla Жыл бұрын
Numba/jit is great to speed up more complex operations. I've had limited experience with it, but every case it really sped things up. Doing it as a video is a good idea.
@lucienjaegers2028 Жыл бұрын
Nice trick, but what if you code it completely in C / C++ / Rust? Literature says those are 50 - 80 times faster?
@robmulla Жыл бұрын
I have a whole video on polars, which is written in rust. It’s faster for sure. But keep in mind pandas backend is just C code.
@ivanrubnenkov9192 ай бұрын
now instead of two actions + .loc in the third example use np.where for oneliner and it will be even faster
@alexisdebrand6209 Жыл бұрын
so usefull thank you !!!!!
@robmulla Жыл бұрын
You're welcome! Thanks for commenting.
@YuanYuan-uk8sz Жыл бұрын
thank you your very extremely perfect video,so so helpful for me,love you so much
@robmulla Жыл бұрын
I'm so glad! Share it with a friend or two who you think might also appreciate it.
@GroWithUmar2 жыл бұрын
Amazing video
@robmulla2 жыл бұрын
Thanks for the feedback!
@agoodwin-81272 жыл бұрын
I typically use numpy where in this situation (mainly because I like the syntax better!), so I was curious about the speed vs. the level 3 solution. Where ran a little faster (~15-20% for datasets sized 10K - 50M records). # level 3 - vectorized %%timeit df = get_data(10_000) df['reward'] = df['hate_food'] df.loc[((df['pct_sleeping'] > 0.5) & (df['time_in_bed'] > 5)) | (df['age'] > 90), 'reward'] = df['favorite_food'] # 3.74 ms ± 79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # level 4 - where %%timeit df = get_data(10_000) df['reward'] = np.where(((df['pct_sleeping'] > 0.5) & (df['time_in_bed'] > 5)) | (df['age'] > 90), df['favorite_food'], df['hate_food']) # 3.15 ms ± 37 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
@robmulla2 жыл бұрын
I love that you ran that experiment! I'm actually suprised the numpy version isn't even faster. Thanks for sharing.
@georgebrandon7696 Жыл бұрын
@@robmulla I'll throw another wrench into this. df.at[] vs df.loc[]. df.at[] is considerably faster than df.loc[]. But I've never ran conditionals with df.at[]. I'm also an np.where() user. :)
@ErikS- Жыл бұрын
3.5 seconds for a for loop with only 10k rows... Is this done in a Docker container or another VM(-like) environment?
@robmulla Жыл бұрын
Just done locally on my fairly beefy machine.
@saxegothaea_conspicua Жыл бұрын
can you vectorize using query? I suppose you can't
@co.n.g.studios57102 жыл бұрын
Nice vid. Wouldn't it even be faster, if using the .values for the columns? is this even applicable in the case presented in the example? Looking forward to your answer, cheers
@robmulla2 жыл бұрын
Thanks for the comment. Yes using .values could be faster thanks for pointing that out. Not sure about specific part in this video but worth a try.
@bilalbayrakdar7100 Жыл бұрын
so sql like logical filtering is the real deal dude
@robmulla Жыл бұрын
Not sure if I follow. I do love SQL though!
@rickk3658 Жыл бұрын
3500 times faster is all well and good, but I'd like to know your speed up magic at 2:31. You were turning create dataset into a function. As you type the colon, the rest of the code in the cell became properly indented. My version of JupyterLab does not do that. What's the secret?
@robmulla Жыл бұрын
Oh. I’m using the black auto formatter with nb_black. It’s really helpful to keep your code clean in jupyter.
@TeXiCiTy Жыл бұрын
For looping over big datasets I switch to polars when speed becomes an issue.
@robmulla Жыл бұрын
I have an entire video on my channel about polars. It’s great! Check it out.
@krishnapullak Жыл бұрын
Nice tip
@robmulla Жыл бұрын
Thx for watching.
@SillyLittleMe Жыл бұрын
Hey, this is a great video and truly shows the benefit of vectorisation I would like to point out that always remebering the vectorize way of writing is hard. Fortunately, NumPy module does provide a neat method called "vectorize" that vectorizes your non-vectorize function. an example (from the docs): ## this is the function def myfunc(a, b): "Return a-b if a>b, otherwise return a+b" if a > b: return a - b else: return a + b ## vectorising the function and then applying it vfunc = np.vectorize(myfunc) vfunc([1, 2, 3, 4], 2) array([3, 4, 1, 2]) This works on DataFrames as well. Do Note tho that this is not True vectorisation because of that ,in some cases, it performs similarly to functions like "apply". However, for the most part it does a tremendous job and has significantly increased the speed of my functions. The reasons for why it is not "true vectorisation" are mentioned in this thread : stackoverflow.com/questions/52673285/performance-of-pandas-apply-vs-np-vectorize-to-create-new-column-from-existing-c
@robmulla Жыл бұрын
Thanks! I've used that before and it does come in handy. Also using things like jit/numba can compile numpy operations.