Really liked how you showed how to pipeline the data to keras! None of the other Dask videos I've seen show those next steps, it's just pandas stuff
@skyleryoung10104 жыл бұрын
Great video! Great use of examples and explained in a way that made Dask very approachable. Thanks!
@Aman-Satat3 күн бұрын
thank you! got what i was looking for
@cristian-bull3 жыл бұрын
I really appreciate the batch-on-the-fly example with keras.
@apratimdey6118 Жыл бұрын
Hi Dan, this was a great introductory video. I am learning Dask and this was very helpful.
@guideland13 жыл бұрын
Really great Dask introduction and the explanation is so easy to understand. That was useful. Thank you!
@alikoko12 жыл бұрын
Great video! Thank you man 👍
@sripadvantmuri892 жыл бұрын
Great explanations for beginners!! Thanks for this...
@dwivedys3 жыл бұрын
Excellent!
@summerxia74742 жыл бұрын
Very nice video!!!! Thank you so much!!! Pls make more about this hands-on video! You explain them very clear and helpful!!!!
@arashkhosravi19823 жыл бұрын
Great explanation, It was not long, it is very interesting .
@vitorruffo94313 жыл бұрын
Good work sir, your video has helped me to get started with Dask. Thank you very much.
@terraflops3 жыл бұрын
very helpful in understanding Dask, think i will start using it on my projects
@ГерманРыков-ъ6в2 жыл бұрын
amazing. Very interesting theme.
@krocodilnaohote14122 жыл бұрын
Man, great video, thank you!
@datadiggers_ru2 жыл бұрын
Great intro. Thank you
@vireshsaini7053 жыл бұрын
Really good..explanation of working with dask :)
@parikannappan15802 жыл бұрын
4:12 ,visualize() , where can in get documetation? I tried to use for sorted() did not work
@teslaonly21363 жыл бұрын
Very nice explanation! Thanks!
@julianurrea4 жыл бұрын
Good job man, I'm starting with Dask and I am excited about its capabilities!, thanks a lot
@priyabratasinha14783 жыл бұрын
Thank you so much man... you saved a project... 🙂🙂❤❤🙏🙏
@bodenseeboys2 жыл бұрын
Chapeau, well explained and healthy usage of memes!
@sanketwebsites97643 жыл бұрын
Awesome Video. Thanks a ton !
@af121x3 жыл бұрын
Thank you so much. This video is really helpful to me.
@ChaitanyaBelhekar4 жыл бұрын
Thanks for a great introductory video on Dask!
@the-data-queerie3 жыл бұрын
Thanks for this super informative video!
@shandi12413 жыл бұрын
thanks man you made my morning
@_Apex3 жыл бұрын
Amazing video!
@magicpotato17073 жыл бұрын
Thanks man! very informative
@piotr7802 жыл бұрын
very good ! thank you ! :)
@Rafaelkenjinagao4 жыл бұрын
Many thanks Dan. Good job, sir!
@someirrelevantthings61983 жыл бұрын
Literally man, you saved me. Thanks a ton
@jameswestwood12684 жыл бұрын
Amazing. Please expand and create a section on Dask for Machine Learning.
@danbochman4 жыл бұрын
Thank you! Dask's machine learning support got even better since I've made this video. The Dask-ML package mimics the scikit-learn API and mode selection, and the new SciKeras package wraps Keras models in the scikit-learn API, so the transition is pretty seamless. Is there anything specific you wish you could have more guidance on?
@NitinPasumarthy4 жыл бұрын
Very well explained! Thank you.
@danbochman4 жыл бұрын
Very happy to hear! If there's any topic in a particular which interests you, feel free to suggest!
@NitinPasumarthy4 жыл бұрын
@@danbochman May be how to build custom UDFs like in Spark?
@danbochman4 жыл бұрын
@@NitinPasumarthy hmm to be honest I don't know much about this topic. But I just released a video about visualizing big data with Datashader + Bokeh (I remember you've asked for that before): www.linkedin.com/posts/danbochman_datashader-in-15-minutes-machine-learning-activity-6639434661288263680-g62W Hope you'll like it!
@FabioRBelotto2 жыл бұрын
Thanks man. It's really hard to find some information about dask
@-mwolf Жыл бұрын
awesome intro
@vegaalej2 жыл бұрын
Great Video an Explanation! Thank you very much for it! IT is really helpful! I tried to run the notebook, and it ran pretty well after some minor updates. I just had problems to run the latest part. "never run out-of-memory while training", seems the generator or steps per epoch part is giving some prblem I cant fogure hout how to solve. Any possible suggestion on how to fix the code? Thanks! InvalidArgumentError: Graph execution error: TypeError: `generator` yielded an element of shape (26119,) where an element of shape (None, None) was expected. Traceback (most recent call last): TypeError: `generator` yielded an element of shape (26119,) where an element of shape (None, None) was expected.
@AlejandroGarcia-ib4kb2 жыл бұрын
Interesting, I have the same problem in the last part of the notebook. Seems it is related to IDE, it needs and update.
@haydn.murray4 жыл бұрын
Fantastic video! Keep it up!
@DanielWeikert3 жыл бұрын
Great. I would also like to see more on DASK and Deep Learning. How exactly would this generator be used in pytorch? Instead of the DataLoader. Thanks for the video(s)
@luisalbertovivas51744 жыл бұрын
I liked it Dan. I wan to try how it works with Scikit Learn and all the models.
@danbochman4 жыл бұрын
For Scikit-learn and other popular models such as XGBoost, it's even easier! Everything is already implemented in the dask-ml package: ml.dask.org/
@jodiesimkoff57784 жыл бұрын
Great video, looking forward to exploring more of your content! If you don’t mind, could you share any details about your jupyter notebook setup (extensions, etc)? It looks great.
@danbochman4 жыл бұрын
Hey Jodie, really glad you liked it! It's actually not a Jupyter Notebook, it's Google Colaboratory, which is essentially very similar, but runs on Google's servers and not your local machine. I highly recommend it if you're doing work that is not hardware/storage demanding.
@felixtechverse2 жыл бұрын
How to schedule a dask task? for example how to put a script to run every day at 10:00 with dask for example
@FabioRBelotto2 жыл бұрын
I've added dask delayed to some functions. When I visualize it, there are several parallel works planned, but my cpu does not seems to be affected (its only using a small percentage of it)
@unathimatu3 жыл бұрын
This is really greAT
2 жыл бұрын
Hi 😀 Dan. Thank you for this video. Do you an example which uses the apply() function? I want to create a new column based on a data transformation. Thank you!
@asd222treed2 жыл бұрын
Great video! Thank you for sharing.But I think your code would have some incorrect codes in machine learning with dask part. There is no X in the code (model.add(..., input_dim=X.shape[1], ... ) and when I training model.fit_generator, the tensor flow saids model.fit_generator is deprecated.. and finally displayed error - AttributeError: 'tuple' object has no attribute 'rank'
@danbochman2 жыл бұрын
Hey! Whoops, I must've changed the variable name X to df_train and wasn't consistent in the process, it probably didn't pop a message to me because X was still in my notebook workspace. You can either change df_train to X or change X to df_train X df_train. Just be consistent and it should work!
@DontMansion3 жыл бұрын
Hello. Can I use Dask for Numpy without using "defs"? Thank you.
@BilalAslamIsAwesome4 жыл бұрын
Hello! Thank you for your tutorial, i downloaded and ran your notebook, at the keras steps i'm getting an ( 'X' is not defined) error. I cant see where it was created either. Any ideas on how i can fix this to run it?
@danbochman4 жыл бұрын
Hey Bilal! Whoops, I must've changed the variable name X to df_train and wasn't consistent in the process, it probably didn't pop a message to me because X was still in my notebook workspace. You can either change df_train to X or change X to df_train X df_train. Just be consistent and it should work!
@BilalAslamIsAwesome4 жыл бұрын
@@danbochman great! Thank you I will try that. Have you been using dask for everything now or do you still use pandas and numpy?
@danbochman4 жыл бұрын
@@BilalAslamIsAwesome Hope it worked for you! Please let me know if it didn't. Yes, I use Dask for mostly anything. Some exceptions: 1. Pandas Profiling - A great package (I have an old video on it) for EDA which I ALWAYS use when I first approach a new dataset, doesn't play well with Dask. 2. Complex (interactive) Visualizations - My favorite package for this is Plotly, doesn't synergize well with Dask. If plotting all the data points is a must than I will use Dask + Datashader + Bokeh.
@joseleonardosanchezvasquez15142 жыл бұрын
Great
@abhisekbus3 жыл бұрын
Came to handle a 5gig csv, stayed for capybara with friends.
@adityaanvesh4 жыл бұрын
Great intro of dask for a pandas guy like me...
@indian-inshorts57864 жыл бұрын
it's really helpful for me to thank you may I provide the link of your channel on my notebook on Kaggle?
@danbochman4 жыл бұрын
Very happy to hear that you've found the video helpful! (About link) Sure! The whole purpose of these videos is to share them ;)
@vegaalej2 жыл бұрын
Many thanks for this excellent video! It is really clear and helpful! I just have one question, I tried to run the notebook, and ran pretty well after some minor updates. Just the last line I was not able to make it run: # never run out-of-memory while training "model.fit_generator(generator=dask_data_generator(df_train), steps_per_epoch=100)" Gives me an error message: InvalidArgumentError: Graph execution error: TypeError: `generator` yielded an element of shape (26119,) where an element of shape (None, None) was expected. Traceback (most recent call last): TypeError: `generator` yielded an element of shape (26119,) where an element of shape (None, None) was expected. [[{{node PyFunc}}]] [[IteratorGetNext]] [Op:__inference_train_function_506] Any recommendation on how I should modify it to make it run? Thanks AG
@FindMultiBagger4 жыл бұрын
Thanks ! Nice tutorial , Thanks 🙌🏼 Can we use any ml library on it ? Like scikit , pycaret etc
@danbochman4 жыл бұрын
To the best of my knowledge not directly. You can use any ML/DL framework after you persist your dask array to memory. However, Dask has it's own dask-ml package in which contributors migrate most of the common use cases in scikit-learn and PyCaret.
@FindMultiBagger4 жыл бұрын
@@danbochman Can you make simple use-case on 🙂, 1) Integrate Dask with pycaret , it will so helpful. Beacuse I have used all approaches like RAY , MODIN , RAPID , PYSPARK. But all have some limitations. If you help to integrate DASK with pycaret it helps lots for open-source community :) Looking forward to hearing from you :) Thanks
@anirbanaws1434 жыл бұрын
How do I handle if my data is .xlsx?
@danbochman4 жыл бұрын
To be honest, if you're working with a huge .xlsx file and best performance is necessary, then I would recommend rewriting the .xlsx to .csvs with: from openpyxl import load_workbook wb = load_workbook(filename = "data.xlsx", read_only=True, values_only=True) list_ws = wb.sheetnames nws = len(wb.sheetnames) for i in range(0, nws): ws = wb[list_ws[i]] with open("output/%s.csv" %(list_ws[i].replace(" ","")), "w", newline="") as f: c = csv.writer(f) for r in ws.rows: c.writerow([cell.value for cell in r]) If this is not an option for you, you can call Dask's delayed decorator on the Panda's read_excel function like so: delayed_ddf = dask.delayed(pd.read_excel)( "data.xlsx", sheet_name=0) ddf = dd.from_delayed(delayed_ddf ) But you won't see a huge performance increase. Hope it helped!
@dendi10764 жыл бұрын
Hi sir, for Dask to work on my laptop, do I need to have more than 1 core? what if I only have 1 core on my laptop and no other nodes to work with, will Dask still be helpful for reading a csv file that has millions of rows and help me speed up the process
@danbochman4 жыл бұрын
Hello dendi, If you only have 1 core, you won't be able to speed up performance with parallelization, as you don't have the ability to open different processes/threads. However, you would still be more memory efficient if you have a huge dataset on a limited RAM. Dask can still help you work with data that is bigger or close to your RAM size (e.g. 4GB-8GB for most laptops), by only fetching what is relevant for the current computation in-memory.
@raafhornet4 жыл бұрын
Would it still work if the fraction of data you're sampling for the model is still larger than your memory?
@danbochman4 жыл бұрын
Sorry for the late reply, this comment didn't pop up in my notifications... If a fraction of the data is larger than your memory, than no, can't handle data bigger than your in-memory capabilities. It just means that fraction should be fractioned even more :)
@Единыймир.Переводыисубтитры3 жыл бұрын
Many thanks. Now I understand why the file was not read
@madhu1987ful3 жыл бұрын
Hey thanks for the awesome video and the explanation. I have a use case. I am trying to build a Deep learning tensorflow model for time series forecasting. For this I need to use multinode cluster for parallelization across all nodes. I have a single function which can take data for any 1 store and predict for that store. Likewise I need to do predictions for 2 lakh outlets. How can I use dask to parallelize this huge task across all nodes of my cluster? Can you please guide me. Thanks in advance.
@danbochman3 жыл бұрын
Hi Madhu, Sorry, wish I could help, but node cluster parallelization tasks are more dependent on the framework iteself (e.g. Kubernetes), than Dask. You have the dask.distributed module (distributed.dask.org/en/stable/), but handling the multi-worker infrastructure is where the real challenge lies...
@emmaf43203 жыл бұрын
Hey Dan, I really like the video! I have a question with regards to the for loop example. Why it only saved half of the time? To my understanding, delayed(inc) took about 1s because of the parallel computing? What else took time to operate?
@danbochman3 жыл бұрын
Hey Lyann, There’a a limit to how much you can parallelize It depends on how many cores you have and how many threads can be opened within each process, in addition, how much of your computer resources are available (e.g. several chrome tabs are open)
@emmaf43203 жыл бұрын
@@danbochman Thank you!
@markp23813 жыл бұрын
Great content! One question, isn't it strange to use function in this form: function(do_something)(on_variable) instead of function(do_something(on_variable)) ?
@danbochman3 жыл бұрын
Hey Mark! Thanks. I understand what you mean, but when a function returns a function this makes sense, as opposed to a function which outputs the input to the next function. delayed(fn) returns a a new function (e.g. "delayed_fn"), and this new function is then called regularly delayed_fn(x). So its delayed(fn)(x). All decorators are functions which return callable functions. In this example they are used quite unnaturally because I wanted to keep both versions of the functions. Hope the explanation helped!
@kalagaarun96384 жыл бұрын
Thanks sir... very well exlpained and covered a wide range of topics!! btw what are chunks in dask.array, I find it a bit hard to understand... Can you explain?
@danbochman4 жыл бұрын
Hey! Sorry for the delay, for some reason KZbin doesn't notify me on comments on my videos... (although set to do so) If your question is still relevant, Chunks are basically, as the name suggests, chunks of your big arrays split into smaller arrays (in Dask -> Numpy ndarrays). There exists a certain sweet spot of computational efficiency (parallelization potential) and memory capability (in-memory computations) for which a chunk size is optimal. Given your amount of CPU cores, available RAM and file size, Dask find the optimal way to split your data to chunks which would be best optimally processed by your machine. You might ask: "Isn't fetching only a certain number of data rows each time is sufficient?" Well not exactly, the bottleneck maybe be at the column dimension (e.g. 10 rows, 2M columns), so row operations should split each row into chunks of columns for more efficient computation. Hope it helped clear some concepts for you!