Dask in 15 Minutes | Machine Learning & Data Science Open-source Spotlight #5

Рет қаралды 50,691

Күн бұрын

Пікірлер: 85

@mwd6478 4 жыл бұрын

Really liked how you showed how to pipeline the data to keras! None of the other Dask videos I've seen show those next steps, it's just pandas stuff

@skyleryoung1010 4 жыл бұрын

Great video! Great use of examples and explained in a way that made Dask very approachable. Thanks!

@Aman-Satat 3 күн бұрын

thank you! got what i was looking for

@cristian-bull 3 жыл бұрын

I really appreciate the batch-on-the-fly example with keras.

@apratimdey6118 Жыл бұрын

Hi Dan, this was a great introductory video. I am learning Dask and this was very helpful.

@guideland1 3 жыл бұрын

Really great Dask introduction and the explanation is so easy to understand. That was useful. Thank you!

@alikoko1 2 жыл бұрын

Great video! Thank you man 👍

@sripadvantmuri89 2 жыл бұрын

Great explanations for beginners!! Thanks for this...

@dwivedys 3 жыл бұрын

Excellent!

@summerxia7474 2 жыл бұрын

Very nice video!!!! Thank you so much!!! Pls make more about this hands-on video! You explain them very clear and helpful!!!!

@arashkhosravi1982 3 жыл бұрын

Great explanation, It was not long, it is very interesting .

@vitorruffo9431 3 жыл бұрын

Good work sir, your video has helped me to get started with Dask. Thank you very much.

@terraflops 3 жыл бұрын

very helpful in understanding Dask, think i will start using it on my projects

@ГерманРыков-ъ6в 2 жыл бұрын

amazing. Very interesting theme.

@krocodilnaohote1412 2 жыл бұрын

Man, great video, thank you!

@datadiggers_ru 2 жыл бұрын

Great intro. Thank you

@vireshsaini705 3 жыл бұрын

Really good..explanation of working with dask :)

@parikannappan1580 2 жыл бұрын

4:12 ,visualize() , where can in get documetation? I tried to use for sorted() did not work

@teslaonly2136 3 жыл бұрын

Very nice explanation! Thanks!

@julianurrea 4 жыл бұрын

Good job man, I'm starting with Dask and I am excited about its capabilities!, thanks a lot

@priyabratasinha1478 3 жыл бұрын

Thank you so much man... you saved a project... 🙂🙂❤❤🙏🙏

@bodenseeboys 2 жыл бұрын

Chapeau, well explained and healthy usage of memes!

@sanketwebsites9764 3 жыл бұрын

Awesome Video. Thanks a ton !

@af121x 3 жыл бұрын

Thank you so much. This video is really helpful to me.

@ChaitanyaBelhekar 4 жыл бұрын

Thanks for a great introductory video on Dask!

@the-data-queerie 3 жыл бұрын

Thanks for this super informative video!

@shandi1241 3 жыл бұрын

thanks man you made my morning

@_Apex 3 жыл бұрын

Amazing video!

@magicpotato1707 3 жыл бұрын

Thanks man! very informative

@piotr780 2 жыл бұрын

very good ! thank you ! :)

@Rafaelkenjinagao 4 жыл бұрын

Many thanks Dan. Good job, sir!

@someirrelevantthings6198 3 жыл бұрын

Literally man, you saved me. Thanks a ton

@jameswestwood1268 4 жыл бұрын

Amazing. Please expand and create a section on Dask for Machine Learning.

@danbochman 4 жыл бұрын

Thank you! Dask's machine learning support got even better since I've made this video. The Dask-ML package mimics the scikit-learn API and mode selection, and the new SciKeras package wraps Keras models in the scikit-learn API, so the transition is pretty seamless. Is there anything specific you wish you could have more guidance on?

@NitinPasumarthy 4 жыл бұрын

Very well explained! Thank you.

@danbochman 4 жыл бұрын

Very happy to hear! If there's any topic in a particular which interests you, feel free to suggest!

@NitinPasumarthy 4 жыл бұрын

@@danbochman May be how to build custom UDFs like in Spark?

@danbochman 4 жыл бұрын

@@NitinPasumarthy hmm to be honest I don't know much about this topic. But I just released a video about visualizing big data with Datashader + Bokeh (I remember you've asked for that before): www.linkedin.com/posts/danbochman_datashader-in-15-minutes-machine-learning-activity-6639434661288263680-g62W Hope you'll like it!

@FabioRBelotto 2 жыл бұрын

Thanks man. It's really hard to find some information about dask

@-mwolf Жыл бұрын

awesome intro

@vegaalej 2 жыл бұрын

Great Video an Explanation! Thank you very much for it! IT is really helpful! I tried to run the notebook, and it ran pretty well after some minor updates. I just had problems to run the latest part. "never run out-of-memory while training", seems the generator or steps per epoch part is giving some prblem I cant fogure hout how to solve. Any possible suggestion on how to fix the code? Thanks! InvalidArgumentError: Graph execution error: TypeError: `generator` yielded an element of shape (26119,) where an element of shape (None, None) was expected. Traceback (most recent call last): TypeError: `generator` yielded an element of shape (26119,) where an element of shape (None, None) was expected.

@AlejandroGarcia-ib4kb 2 жыл бұрын

Interesting, I have the same problem in the last part of the notebook. Seems it is related to IDE, it needs and update.

@haydn.murray 4 жыл бұрын

Fantastic video! Keep it up!

@DanielWeikert 3 жыл бұрын

Great. I would also like to see more on DASK and Deep Learning. How exactly would this generator be used in pytorch? Instead of the DataLoader. Thanks for the video(s)

@luisalbertovivas5174 4 жыл бұрын

I liked it Dan. I wan to try how it works with Scikit Learn and all the models.

@danbochman 4 жыл бұрын

For Scikit-learn and other popular models such as XGBoost, it's even easier! Everything is already implemented in the dask-ml package: ml.dask.org/

@jodiesimkoff5778 4 жыл бұрын

Great video, looking forward to exploring more of your content! If you don’t mind, could you share any details about your jupyter notebook setup (extensions, etc)? It looks great.

@danbochman 4 жыл бұрын

Hey Jodie, really glad you liked it! It's actually not a Jupyter Notebook, it's Google Colaboratory, which is essentially very similar, but runs on Google's servers and not your local machine. I highly recommend it if you're doing work that is not hardware/storage demanding.

@felixtechverse 2 жыл бұрын

How to schedule a dask task? for example how to put a script to run every day at 10:00 with dask for example

@FabioRBelotto 2 жыл бұрын

I've added dask delayed to some functions. When I visualize it, there are several parallel works planned, but my cpu does not seems to be affected (its only using a small percentage of it)

@unathimatu 3 жыл бұрын

This is really greAT

2 жыл бұрын

Hi 😀 Dan. Thank you for this video. Do you an example which uses the apply() function? I want to create a new column based on a data transformation. Thank you!

@asd222treed 2 жыл бұрын

Great video! Thank you for sharing.But I think your code would have some incorrect codes in machine learning with dask part. There is no X in the code (model.add(..., input_dim=X.shape[1], ... ) and when I training model.fit_generator, the tensor flow saids model.fit_generator is deprecated.. and finally displayed error - AttributeError: 'tuple' object has no attribute 'rank'

@danbochman 2 жыл бұрын

Hey! Whoops, I must've changed the variable name X to df_train and wasn't consistent in the process, it probably didn't pop a message to me because X was still in my notebook workspace. You can either change df_train to X or change X to df_train X df_train. Just be consistent and it should work!

@DontMansion 3 жыл бұрын

Hello. Can I use Dask for Numpy without using "defs"? Thank you.

@BilalAslamIsAwesome 4 жыл бұрын

Hello! Thank you for your tutorial, i downloaded and ran your notebook, at the keras steps i'm getting an ( 'X' is not defined) error. I cant see where it was created either. Any ideas on how i can fix this to run it?

@danbochman 4 жыл бұрын

Hey Bilal! Whoops, I must've changed the variable name X to df_train and wasn't consistent in the process, it probably didn't pop a message to me because X was still in my notebook workspace. You can either change df_train to X or change X to df_train X df_train. Just be consistent and it should work!

@BilalAslamIsAwesome 4 жыл бұрын

@@danbochman great! Thank you I will try that. Have you been using dask for everything now or do you still use pandas and numpy?

@danbochman 4 жыл бұрын

@@BilalAslamIsAwesome Hope it worked for you! Please let me know if it didn't. Yes, I use Dask for mostly anything. Some exceptions: 1. Pandas Profiling - A great package (I have an old video on it) for EDA which I ALWAYS use when I first approach a new dataset, doesn't play well with Dask. 2. Complex (interactive) Visualizations - My favorite package for this is Plotly, doesn't synergize well with Dask. If plotting all the data points is a must than I will use Dask + Datashader + Bokeh.

@joseleonardosanchezvasquez1514 2 жыл бұрын

Great

@abhisekbus 3 жыл бұрын

Came to handle a 5gig csv, stayed for capybara with friends.

@adityaanvesh 4 жыл бұрын

Great intro of dask for a pandas guy like me...

@indian-inshorts5786 4 жыл бұрын

it's really helpful for me to thank you may I provide the link of your channel on my notebook on Kaggle?

@danbochman 4 жыл бұрын

Very happy to hear that you've found the video helpful! (About link) Sure! The whole purpose of these videos is to share them ;)

@vegaalej 2 жыл бұрын

Many thanks for this excellent video! It is really clear and helpful! I just have one question, I tried to run the notebook, and ran pretty well after some minor updates. Just the last line I was not able to make it run: # never run out-of-memory while training "model.fit_generator(generator=dask_data_generator(df_train), steps_per_epoch=100)" Gives me an error message: InvalidArgumentError: Graph execution error: TypeError: `generator` yielded an element of shape (26119,) where an element of shape (None, None) was expected. Traceback (most recent call last): TypeError: `generator` yielded an element of shape (26119,) where an element of shape (None, None) was expected. [[{{node PyFunc}}]] [[IteratorGetNext]] [Op:__inference_train_function_506] Any recommendation on how I should modify it to make it run? Thanks AG

@FindMultiBagger 4 жыл бұрын

Thanks ! Nice tutorial , Thanks 🙌🏼 Can we use any ml library on it ? Like scikit , pycaret etc

@danbochman 4 жыл бұрын

To the best of my knowledge not directly. You can use any ML/DL framework after you persist your dask array to memory. However, Dask has it's own dask-ml package in which contributors migrate most of the common use cases in scikit-learn and PyCaret.

@FindMultiBagger 4 жыл бұрын

@@danbochman Can you make simple use-case on 🙂, 1) Integrate Dask with pycaret , it will so helpful. Beacuse I have used all approaches like RAY , MODIN , RAPID , PYSPARK. But all have some limitations. If you help to integrate DASK with pycaret it helps lots for open-source community :) Looking forward to hearing from you :) Thanks

@anirbanaws143 4 жыл бұрын

How do I handle if my data is .xlsx?

@danbochman 4 жыл бұрын

To be honest, if you're working with a huge .xlsx file and best performance is necessary, then I would recommend rewriting the .xlsx to .csvs with: from openpyxl import load_workbook wb = load_workbook(filename = "data.xlsx", read_only=True, values_only=True) list_ws = wb.sheetnames nws = len(wb.sheetnames) for i in range(0, nws): ws = wb[list_ws[i]] with open("output/%s.csv" %(list_ws[i].replace(" ","")), "w", newline="") as f: c = csv.writer(f) for r in ws.rows: c.writerow([cell.value for cell in r]) If this is not an option for you, you can call Dask's delayed decorator on the Panda's read_excel function like so: delayed_ddf = dask.delayed(pd.read_excel)( "data.xlsx", sheet_name=0) ddf = dd.from_delayed(delayed_ddf ) But you won't see a huge performance increase. Hope it helped!

@dendi1076 4 жыл бұрын

Hi sir, for Dask to work on my laptop, do I need to have more than 1 core? what if I only have 1 core on my laptop and no other nodes to work with, will Dask still be helpful for reading a csv file that has millions of rows and help me speed up the process

@danbochman 4 жыл бұрын

Hello dendi, If you only have 1 core, you won't be able to speed up performance with parallelization, as you don't have the ability to open different processes/threads. However, you would still be more memory efficient if you have a huge dataset on a limited RAM. Dask can still help you work with data that is bigger or close to your RAM size (e.g. 4GB-8GB for most laptops), by only fetching what is relevant for the current computation in-memory.

@raafhornet 4 жыл бұрын

Would it still work if the fraction of data you're sampling for the model is still larger than your memory?

@danbochman 4 жыл бұрын

Sorry for the late reply, this comment didn't pop up in my notifications... If a fraction of the data is larger than your memory, than no, can't handle data bigger than your in-memory capabilities. It just means that fraction should be fractioned even more :)

@Единыймир.Переводыисубтитры 3 жыл бұрын

Many thanks. Now I understand why the file was not read

@madhu1987ful 3 жыл бұрын

Hey thanks for the awesome video and the explanation. I have a use case. I am trying to build a Deep learning tensorflow model for time series forecasting. For this I need to use multinode cluster for parallelization across all nodes. I have a single function which can take data for any 1 store and predict for that store. Likewise I need to do predictions for 2 lakh outlets. How can I use dask to parallelize this huge task across all nodes of my cluster? Can you please guide me. Thanks in advance.

@danbochman 3 жыл бұрын

Hi Madhu, Sorry, wish I could help, but node cluster parallelization tasks are more dependent on the framework iteself (e.g. Kubernetes), than Dask. You have the dask.distributed module (distributed.dask.org/en/stable/), but handling the multi-worker infrastructure is where the real challenge lies...

@emmaf4320 3 жыл бұрын

Hey Dan, I really like the video! I have a question with regards to the for loop example. Why it only saved half of the time? To my understanding, delayed(inc) took about 1s because of the parallel computing? What else took time to operate?

@danbochman 3 жыл бұрын

Hey Lyann, There’a a limit to how much you can parallelize It depends on how many cores you have and how many threads can be opened within each process, in addition, how much of your computer resources are available (e.g. several chrome tabs are open)

@emmaf4320 3 жыл бұрын

@@danbochman Thank you!

@markp2381 3 жыл бұрын

Great content! One question, isn't it strange to use function in this form: function(do_something)(on_variable) instead of function(do_something(on_variable)) ?

@danbochman 3 жыл бұрын

Hey Mark! Thanks. I understand what you mean, but when a function returns a function this makes sense, as opposed to a function which outputs the input to the next function. delayed(fn) returns a a new function (e.g. "delayed_fn"), and this new function is then called regularly delayed_fn(x). So its delayed(fn)(x). All decorators are functions which return callable functions. In this example they are used quite unnaturally because I wanted to keep both versions of the functions. Hope the explanation helped!

@kalagaarun9638 4 жыл бұрын

Thanks sir... very well exlpained and covered a wide range of topics!! btw what are chunks in dask.array, I find it a bit hard to understand... Can you explain?

@danbochman 4 жыл бұрын

Hey! Sorry for the delay, for some reason KZbin doesn't notify me on comments on my videos... (although set to do so) If your question is still relevant, Chunks are basically, as the name suggests, chunks of your big arrays split into smaller arrays (in Dask -> Numpy ndarrays). There exists a certain sweet spot of computational efficiency (parallelization potential) and memory capability (in-memory computations) for which a chunk size is optimal. Given your amount of CPU cores, available RAM and file size, Dask find the optimal way to split your data to chunks which would be best optimally processed by your machine. You might ask: "Isn't fetching only a certain number of data rows each time is sufficient?" Well not exactly, the bottleneck maybe be at the column dimension (e.g. 10 rows, 2M columns), so row operations should split each row into chunks of columns for more efficient computation. Hope it helped clear some concepts for you!

@kalagaarun9638 4 жыл бұрын

@@danbochman It helped a lot, thanks sir !