Learn to Use a CUDA GPU to Dramatically Speed Up Code In Python

Рет қаралды 86,081

Күн бұрын

Пікірлер: 46

@pkeric2626 3 жыл бұрын

holy shit, i was looking into this to speed up my mandelbrot-zooms and they are what you use as an example! This is a dream come true!

@somefriday7594 4 жыл бұрын

This is amazing! Thank you for taking effort to make it!

@ChristopheKumsta 4 жыл бұрын

Hello, Thank you for this great introduction to numba and more specifically numba+cuda. It is effectively a very easy approach to harness the power of cuda in simple python scripts. there is a mistake in the "cuda" example. You are calling the regular "create_fractal" instead of calling the "mandel_kernel" cuda version. But if you call the cuda version : "mandel_kernel" you also have to precise the size of the grid (be careful x and y are reversed). Therefore the final version for the call of "cuda" version of Mandelbrot is: image = np.zeros((1024,1536), dtype=np.uint8) start = timer() mandel_kernel[1536, 1024](-2.0, 1.0, -1.0, 1.0, image, 20) dt = timer() - start print("Mandelbrot created in %f s" % dt) imshow(image) show()

@pragmaticai 3 жыл бұрын

thanks I will take a look

@codeonion 4 жыл бұрын

Awesome! learning never stops.

@chetana9802 4 жыл бұрын

good stuff on here :) I like how you did the website for documenting the video notes for reference later

@trentonsawyer1650 3 жыл бұрын

you all probably dont give a damn but does any of you know of a tool to get back into an Instagram account? I stupidly lost my login password. I appreciate any help you can give me

@marcokashton8279 3 жыл бұрын

@Trenton Sawyer instablaster ;)

@trentonsawyer1650 3 жыл бұрын

@Marco Kashton thanks for your reply. I got to the site on google and im trying it out atm. I see it takes quite some time so I will reply here later when my account password hopefully is recovered.

@trentonsawyer1650 3 жыл бұрын

@Marco Kashton it worked and I actually got access to my account again. I'm so happy! Thank you so much you saved my account!

@marcokashton8279 3 жыл бұрын

@Trenton Sawyer No problem xD

@ramoni3608 4 жыл бұрын

I tried to follow this on my Windows 10 machine. The function you call as at 7:16 is still create_fractal() and not mandel_kernel() so I don't see why it is faster. When I changed it to mandel_kernel(), it complained that I hat to provide a launch configuration, telling the gpu how many grids and blocks to create. I added it like so (First properly setting a grid and block variable): mandel_kernel[grid, block](-2.0, 1.0, -1.0, 1.0, image, 20). It then worked and really was nearly 100x faster than the jit version.

@Zartymil 4 жыл бұрын

Yes, I think he made a little mistake there. The first call to a jit function generally is really slow (I guess because the compilling is happening) and from then on it's blazing fast.

@vallurirajesh 3 жыл бұрын

This is very helpful. Most people don't realize the overheads and code refactoring necessary to take advantages of the GPUs. I am going to refactor a simple MNIST training propgram I have which currently uses only Numpy. See if I can get meaningful improvements in training time.

@pragmaticai 3 жыл бұрын

A great way to challenge yourself.

@valentinfontanger4962 2 жыл бұрын

Very clear, I loved it !

@pragmaticai 2 жыл бұрын

Glad you liked it!

@bernietgn6406 11 ай бұрын

A great and unique video. Thanks a lot for sharing.

@alexzander__6334 3 жыл бұрын

6:41, except the time when you run the function for the first time, as in the rest, it will be fast.

@gjubbar 4 жыл бұрын

Nice demo - I am getting into CUDA -GPU programming and have a workstation build with a 1950x 16 core CPU and two rtx 2080ti gpus and would like to check this demo on the machine and observe the outcome results without using colab- definitely will check this out today. By the way , with notebook python3 environment , I need to use pip to install numba library as shown or do i have to create a new virtal environemnt? I am curious about that. Thank you

@ShaunPrince Жыл бұрын

Thank-for this. I was able to replicate locally using Jupyter Notebook with Nvidia and WSL2, worked like a charm.

@pragmaticai Жыл бұрын

Excellent!

@harukosaver809 3 жыл бұрын

Awesome video, i will tey this too.

@agnichatian 2 жыл бұрын

Can I use this in an app that has a Kivy GUI ?

@pragmaticai 2 жыл бұрын

As long as an NVidia GPU is present yes.

@xinlnixg4354 4 жыл бұрын

thanks a lot for sharing

@PP-tc1zp 3 жыл бұрын

Hi, Can You show the same problem solution in code with cpu device to compare performance cpu vs gpu?

@QuantumWormhole 3 жыл бұрын

Is the GPU script correct? No to_device and copy_to_host functions to copy the image to and from the GPU. And the script uses the create_fractal function rather than the mandel_kernel.

@yogeshwarshendye4857 3 жыл бұрын

can I use numba for training models in sklearn libraries?

@pragmaticai 3 жыл бұрын

Great question, not sure if anyone is doing this yet.

@saebifar 6 ай бұрын

how can i speed up my machine learning code (sklearn and tensorflow) , its very slow , ahhh😡

@knowledgelover2736 3 жыл бұрын

Can you use this to speed up kmeans? I have 60 million rows to cluster. On 16 cores it is going for hours.

@pragmaticai 3 жыл бұрын

yes, you could do this by hand, which would be a great challenge in distributed computing to code by hand. Another option is to use a framework/platform like AWS Sagemaker to do distributed kmeans. Most organizations will do this.

@ardan7779 4 жыл бұрын

So I must buy 3090 for running .py

@PP-tc1zp 3 жыл бұрын

I don't understand example with numpy array sum, why you do this?. I don't need this. Yo can do just sum two the same tables with numpy: df = df2 + df2. Effect is the same without gpu, imediately So why use gpu for this operation? I don't see any adventage with table example.

@pragmaticai 3 жыл бұрын

This is an academic example that shows the process of copying data to the GPU, doing a vectorized operation, then showing the results. Actually what makes sense on the GPU vs CPU is something I didn't cover, and am hoping other can figure out some cool ideas.

@summercamp5183 3 жыл бұрын

sir I am still having some doubts.. can you please share your contact_num/mail_id? Actually I have downloaded 2 files on github: one is a .cu file and the other is a .sh file. Now the thing is both the files are interconnected, as like the .cu file takes the input from .sh file. I don't know how to run them or how to upload them. I request you to please guide me. I will be highly thankful to you. My project review is there.

@pragmaticai 3 жыл бұрын

KZbin questions are typically the best way to handle an issue or an issue request on Github on a project demo. I will do my best to answer when I have time.

@jakubkahoun8383 2 жыл бұрын

My head gonna explode from all of theese, but I feel if learn this, I will get powerful....still no idea how to make my program run on GPU even when its HIGLY parael stuff...

@pragmaticai 2 жыл бұрын

GPU programming is a definitely a bit esoteric, but a fun skill to have.

@cleisonarmandomanriqueagui7502 3 жыл бұрын

dammit , i didnt know about it 2021

@ajflink 2 жыл бұрын

Is there something other than Cuda that I can use? I don't plan to use any Nvidia GPUs. So, cuda is useless for me. In addition, unless you work in game development or some kind of niche research, work computers will not have an Nvidia-based GPU. I own several computers and none use Nvidia.

@pragmaticai 2 жыл бұрын

Try AMD ROC: numba.pydata.org/numba-doc/latest/roc/index.html

@lawrencepanozzo9492 4 жыл бұрын

So the cuda was 7X faster. Nice laptops have had 6-8 cores for many years now, so this GPU implementation is still no faster than multi-core parallelization. I have yet to see one article or one video where the GPU actually creates a performance improvement over njit, prange, multiprocessing, etc. :(

@ShaunakDe 4 жыл бұрын

I'd say most of these examples are trivial. The one place where the GPGPU compute really helped me in my work was running erosion calculations on a DEM. The individual calculations are quite simple, but they just need to happen with massive parallelization (the scene was about 75k pixels wide and high, and they needed to be run in 5x5 blocks). This was also helped by the fact that the computation could be expressed as a matrix multiplication. Another more direct example would be running deep sample FFT on an audio signal. Trival on the GPU, hard on the CPU even with the excellent FFTW lib. Here is a really nice report on the subject: www.researchgate.net/publication/233865704_Accelerating_Fast_Fourier_Transformation_for_Image_Processing_using_Graphics_Processing_Unit/figures