Arm-chair fractal geometry expert here. Was expecting to be bored, but was actually thrilled to see how your optimizations would perform. However I'm dispointed you didn't go on a 45 minute tangent about the existential implications of the fractal nature of life, the universe, and everything.
@daedreaming62674 жыл бұрын
Well you see... What you forgot to take into account is that the fractal nature of things? Really boils down into one thing. And that thing? Is 42.
@SpaghettiToaster4 жыл бұрын
@@daedreaming6267 What he actually forgot to take into account is that there is no such things as the fractal nature of things, because things are not fractal in nature.
@Capris4EveR4 жыл бұрын
@@SpaghettiToaster in my point of view everything in nature is paralel (mirrored to infinite). We can't find the start since its infinit but we will be able to predict the future and change the lane of our evolution. In my mind a higher being developed us for God knows what purpose and to be paralel with that, someone else created them too, when our turn comes we will create other if we didn't already (animals).
@Kenjuudo3 жыл бұрын
@@emjizone I'm pretty sure that if you would iterate through every single color component of every single pixel in every single frame in every single video for every single video resolution as they are stored on the google servers you'd pretty much find the number 42 in all of them. And for the off-chance a video's video stream doesn't contain that number there are also the audio streams to sifter through... :)
@austinwoodall54232 жыл бұрын
Fractals are so named for their FRACTionAL dimensionality. for instance, the Mandelbrot set is neither 1D nor 2D but 1.72 dimensional. Fractals are characterized by their infinite perimeter but 0 area (or other applicable into)
@dartstone2384 жыл бұрын
The SIMD conversion would be super interesting I thing. I do like the low level part of things.
@Agatio17x4 жыл бұрын
I would love to watch a SIMD video too!
@ElGuAcA114 жыл бұрын
Me too, i'd love to see it!
@byrontheobald66634 жыл бұрын
From my own testing with other problems (N-Body) you don't actually need to do much. If you look here: gcc.gnu.org/projects/tree-ssa/vectorization.html You'll see that GCC will actually automatically compile SIMD instructions if you compile past -O2, just make sure your code is in a vector/array or otherwise contiguous in memory. I'd imagine MSVC will do something very similar.
@mileswilliams5274 жыл бұрын
Yes PLEASE!!! I'm at this point in my coding journey where I need to start getting the most performance I can and have started looking at parallelism so this would be very interesting
@ScottFCote4 жыл бұрын
I also would very much like to see this.
@daklabs4 жыл бұрын
The AVX sounds interesting, I'd like to see more on it!
@gideonmaxmerling2044 жыл бұрын
there is a youtuber named "what's a creel" that has some AVX in asm vids and also some VCL vids which is a vector library for c++
@dragan387654 жыл бұрын
More SIMD/AVX videos? Yes, please. I'd love to learn more about those, I only know the basics sadly. I pray to the compiler gods that they have mercy upon my soul and vectorize it automatically.
@douggale59624 жыл бұрын
In most cases, they can't auto-vectorize. Usually because it can't be certain your pointers aren't aliasing, it thinks your stores may be changing the values of subsequent loads, so it is afraid to use SIMD, it changes the results. You can help a lot with "restrict" pointers (__restrict in g++). It can be absolutely sure there is no aliasing when you say they are restrict. Saying restrict in the signature's pointers means that the function cannot deal with the restrict pointers pointing to the same memory. Applying that restriction sets the compiler free to assume that its stores won't be feeding back into loads, allowing vectorization.
@Bravo-oo9vd4 жыл бұрын
Instead of praying, you can profile and inspect compiler output. It's hard to speed up your code when you don't know where the bottlenecks are and you can help the compiler with autovectorising if you provide it with some hints. Writing code with intrinsics directly is not very maintainable, and better thing to do would be to look into parallel processing libraries that manage this complexity for you. Also by looking at output assembly you're able to confirm that other optimisations are working such as instruction level paralellism
@GregoryTheGr8ster4 жыл бұрын
I hate to have to be the one to bring you the news, but there are no compiler gods. I SWEAR THAT I AM NOT MAKING THIS UP!
@glitched_code4 жыл бұрын
very interesting, i'm rooting for the SIMD conversion video to come out :)
@javidx94 жыл бұрын
thanks and noted!
@geotale954 жыл бұрын
I'm a simple man. I see brute forcing and the Mandelbrot set, and I click on the video.
@kepeb14 жыл бұрын
I'm a simpleton, I see overused formulaic cliche comments, and I join in.
@geotale954 жыл бұрын
@@kepeb1 Yeah, sorry, I didn't see any other comments and wanted to comment quickly, and that's the best I could do :p
@luisgeniole3694 жыл бұрын
I think the Mandelbrot Set and fractals in general are so engaging because they're the closest thing we have of visualizing eternal cosmic inflation, and that pokes at a very powerful memory of our collective consciousness. Also the pretty colors.
@RogerBarraud4 жыл бұрын
One does not simply brute-force into Mandelbrot...
@Omnituens4 жыл бұрын
I'm in for a dedicated video on the SIMD conversion.
@atrumluminarium4 жыл бұрын
Yes please I would be interested in seeing more about the intrinsics and performance advantages. I've been programming for years and had no idea that was even a thing you can do in C/C++
@cocorinow4 жыл бұрын
javid you angel! just as i'm starting my master's thesis having to write a program that will have to crunch through a lot of complex numbers you post this, thanks!
@muckiSG4 жыл бұрын
For everybody interested in the mandelbrot set and not deep in the subject i woukd suggest to search for "numberphile mandelbrot". Really nice videos ... have fun!
@nawdawg43003 жыл бұрын
I watched a numberphile video about the mandelbrot set (and the key fact that any value past a radius of 2 explodes to infinity) and I managed to write the code for a multithreaded implementation (not thread pooled) without any external resources. Seeing the mandelbrot pop up on the screen for the first time was an amazing feeling haha. Anyways now that I've stumbled across your video I can't wait to learn more about and implement the thread pool/vector co-processing (I just got my BS CS degree and am disappointed I didn't get to learn more about this stuff). This is definitely one of the better videos I've come across when it comes to mandelbrot implementation, so thanks for that!
@antipainK4 жыл бұрын
I would love to see how to do conversion to SIMD! :) Btw, my first C++ project was Mandelbrot Set generator and I did most of the mistakes that you pinpointed here. Now I'm gonna rewrite it after final exams. :D
@marcodebruin53703 жыл бұрын
One way to avoid the "waiting for the slowest thread" problem and to keep every thread busy till the end (and thus shorten the total time needed). Have the workers calculate the iterations for 1 pixel, then ask for the next pixel not yet started, then feeding the pixels to those threads until you've run out of pixels to do. All threads will be kept busy until the last few pixels. That should still give a significant improvement (especially on screen-areas that are unbalanced)
@jordanclarke72834 жыл бұрын
I wish to register interest in the SIMD code. Also I wish to acknowledge that the part where you talked about running in release mode versus debug mode was for my benefit after the (stupid) question I asked in your last video regarding the speed of decal rendering. Keep up the good work! Awesome channel! 👍🏻
@rhydermike4 жыл бұрын
Highly enjoyable video there. I first came across the Mandlebot set on my BBC Micro as there was a viewer included on a Disc User Magazine cover disc. It wasn't *quite* as fast as your routine in execution. You've made me want to break off from my regular programming project to render the Mandelbrot set, something I have never done. Just one for the nerdy bucket list. Great stuff, and thanks for making it.
@DiamondShocked4 жыл бұрын
This is exactly the type of computation that GPUs are designed for, and you would likely see a huge performance improvement there. In addition, while this approach generalizes fractal computation somewhat well, there are many techniques to be explored to optimize specifically the rendering of the mandelbrot set. For example, there is no need for the computation to go to the max iteration count for pixels which are clearly in the middle region and will go forever if possible. There are also ways to scale the values of the floating points based on the zoom so that the issue of precision is not a limitation. This would allow you to zoom indefinitely. Next, it does not make sense that 32 threads should have improved performance over 8 threads on your machine, since your 4 core processor can only effectively execute instructions from 8 threads simultaneously (two instruction streams per CPU core). I would also say avoid busy waiting for threads in the thread pool. Simply deschedule these threads if they do not have work to do, and make them runnable when they should run. It is perhaps possible that this is handled somehow by the OS/Compiler (if a loop is seen to jump to the previous instruction and the condition is not met, the processor may be able to dynamically see that no useful progress can be made and the thread will yield to another?), although unlikely. While it is a rule of thumb to avoid thread busy waiting at all costs on a single core machine, there are reasons why busy waiting may be an appropriate solution in a multi-core system: conditions can change while a thread is busy waiting due to the behavior of other threads, and the system call to make the thread runnable can be avoided. So the tradeoffs between these options should be weighed. Also your solution to thread synchronization (atomic counter) can be replaced more elegantly with a barrier or other synchronization primitive. An interesting experiment for the approach you have taken would be to disable the optimization flags of the compiler, so that simple code optimizations such as loop unrolling or strength reduction can actually be measured.
@AusSkiller4 жыл бұрын
With the way he was doing things it absolutely makes sense that 32 threads would improve performance over 8 threads, what is probably tripping you up is that the number of threads isn't the reason it'll be faster (because more threads doesn't really help as you probably correctly concluded), but rather it is faster because he divides the area up based on the number of threads thus reducing the time it takes to process the most computationally expensive section, which with only 32 sections is likely still a major bottleneck. With just 8 threads and 8 sections whenever a thread completes its section then the hardware thread will sit idle until the others are all complete, but with 32 threads and 32 sections whenever a section is completed another software thread can utilise the hardware thread that was freed up, and the maximum time to complete a section is reduced to a quarter of what it would take with 8 sections allowing it to be up to nearly 4 times faster in the worst case scenario for 8 threads. However it is entirely dependent on the complexity of the sections and how well balanced they are, if all 8 sections are equally complex then there will be no performance benefit running 32 threads/sections, and 32 threads/sections may even be marginally slower in that case. Basically the more sections there are the better it can load balance and keep the CPU fully utilised, at least up to the point where the overhead of more threads starts negating the better utilisation.
@luisendymion90804 жыл бұрын
AusSkiller is right. You're speaking of a better but way more complex implementation. Javi's goal is to use a minimalist implementation that still makes a good use (although not optimal) of the cores. For his algorithm makes sense to use more threads because it reduces the overall time to join them.
@DiamondShocked4 жыл бұрын
@@AusSkiller I see your point. The load balancing is the key issue here. For each thread to have somewhat equal work, each pixel should have an estimated cost and the total cost of the pixels for each thread should be similar. At least that's one way you could do it off the top of my head. Also the pixels chosen for each thread should be done so with memory access locality in mind.
@Splatpope4 жыл бұрын
yep, got 60fps all the way to float precision limit on my opengl implementation of the julia set
@Splatpope4 жыл бұрын
can you tell me how to overcome float precision ?
@sanyasanders95803 жыл бұрын
Wow! It's incridible! Thanks for this video
@GregoryTheGr8ster4 жыл бұрын
This is a totally riveting video! Not only do you dive deep into the Mandelbrot fractal, to the point where a mighty 64-bit double is inadequate, but you also dive into unearthing the incredible power of the Intel CPUs. I completely fell in love with SSE and AVX when I finally decided to learn how to make use of them, but I have not yet ventured into threads and multi-cores. I know that I need to. It's on my bucket list. No, actually it is much more imminent than that. I will get into multi-core programming soon. This is a *must*.
@thewhitefalcon85392 ай бұрын
double isn't that mighty in the Mandelbrot set
@antipainK4 жыл бұрын
Could you make a video about dynamically resizing variables (ones, that when they run out of precision - they double their length)? That would of course require defining all calculations and make program slower, but would enable us to zoom "indefinitely".
@thewhitefalcon85392 ай бұрын
Variables can't change length, but you can replace objects with bigger objects. There are arbitrary-precision arithmetic libraries for this purpose
@gilleswalther59644 жыл бұрын
Great performance! Really impressive what you can get out of the CPU.
@DFPercush4 жыл бұрын
+1 just for the profile pic. Very apropo.
@ogr24 жыл бұрын
Awesome video man. Actually I was watching another video about the "Mandelbrot Conjunction" and I found it quite interesting to analyze.
@javidx94 жыл бұрын
Cheers Oscar, they just never get boring - I really find Mandelbulbs quite interesting
@Kenjuudo3 жыл бұрын
Awesome video. It seems I've struck gold by finding your channel. I can already tell I will never regret subscribing to you! Don''t ever change your format! You hear???
@ScramblerUSA4 жыл бұрын
30:02 - should be "bit 255".
@davidbundgaard4 жыл бұрын
Its such a joy to see you program this fractal generator and I would love to see more about avx, gpu calculations on this problem and threads, very nice walkthrough. Keep it up as always.
@Endede4 жыл бұрын
In addition to the great walkthrough of optimizations, and the fantastic explanations of successes and failures of optimizations, I immensely enjoyed your near-giddiness when playing with your Mandelbrot Set Explorer. Clearly, you love coding, but I feel like in this video, it was extra-apparent and frankly infectious.
@Zartymil4 жыл бұрын
Such a fun video! Didn't even feel like it was almost an hour long. Thanks for the content! Subscribed :)
@javidx94 жыл бұрын
Cheers!
@Hacker-at-Large4 жыл бұрын
I’d also love to see this extended into using the GPU.
@ryanlyle92014 жыл бұрын
"quite advance for this channel". My god, he's holding back his full power.
@zikomo89134 жыл бұрын
I just started learning programming and I was dreaming of making a Mandelbrot Fractal plotter and then improving it bit by bit(going lower and lower towards hardware) to optimize it. and now you did a video on it, amazing !
@ghollisjr3 жыл бұрын
I had a lot of fun making a Mandelbrot fractal Web app with WebGL by putting the fractal calculations in a fragment shader. Highly recommend it as a learning project.
@Collinoeight4 жыл бұрын
Definitely interested in seeing that conversion. This is one of the best CS videos i've seen in quite a while.
@rachelmaxwell49364 жыл бұрын
I enjoy and appreciate all of your content. I would really like a video on intrinsics.
@cameraforchristmas4 жыл бұрын
What a terrific find! I'm working on a project based on the Mandelbrot set. I fell in love with it when It was an assignment in college many years ago. What is crazy is that I came across your channel not for the Mandelbrot part but when trying to find a good C++ development environment for the Raspberry Pi. I chose WXWidgets, Code:: Blocks and MinGW. So, I found the channel vid on WxWidhets setup. Now here I've run into you twice on the same project. How cool is that? I've already compiled my Mandelbrot starting point in C#, my favorite environment. My end point is a Raspberry PI and a 7" touchscreen in a permanent case. It should allow the user to zoom in (and out). I want it to behave as a purpose-built, single function appliance. No desktop, no multiple-windows, no choosing the app to run, just plug it in to the wall power. For extra credit, I thought I'd see if I could output the display on the HDMI connector at the same time so someone can enjoy it on a monitor or the living room TV. This would be the Manderbrot-set-top-box. Haha. So, thanks twice for the great content! Dave San Francisco
@ComposingGloves4 жыл бұрын
If you had kept reading the wiki you would be my micro-processor professor who seems to think reading slides counts as teaching. Thanks for teaching! Your channel is amazing!
@morphx6664 жыл бұрын
Absolutely! I'd love to see a video explaining the conversion to SIMD data types.
@mattbland23804 жыл бұрын
Loved this. I remember entering Basic programs to plot the Mandelbrot set back on 8 bit computers and tweaking them myself without understanding them properly and getting wonderfully strange results.
@undersquire4 жыл бұрын
luv ur vids very informational :) i've learned a lot i would love to see a series on possibly making your own basic coding language in C++
@javidx94 жыл бұрын
It's certainly an interesting idea! If I find a need for such a language then it may make a good series
@DestroManiak3 жыл бұрын
this is fascinating, immensely helpful for scientific calculation coders
@christianantfeld38274 жыл бұрын
I'm shocked by how smart you and the other commenters are. Absolutely incredible if anyone else can follow along.
@taraxacum26334 жыл бұрын
maybe real time ray tracing??? 😄 I am trying to do it rn but I'd love your explanation and tricks!
@javidx94 жыл бұрын
Its a popular request for sure!
@jsflood4 жыл бұрын
Amazing as always. This is great! Really interesting. I would love to see a video explaining the vector co-processing (immtrin/SIMD/AVX) further. Thank you. Great work :-)
@cumulonimbuscapillatus93124 жыл бұрын
The accidental video recording lag near the end though lmao. Great video as always!
@javidx94 жыл бұрын
lol, hey its an interesting and very real side effect, so I thought it had value and kept it in :D
@Gordonfreems4 жыл бұрын
The value of this video is incredible, thank you
@SamClayMold4 жыл бұрын
Hey javidx9, love this video and your channel. Commenting here to say I am definitely interested in that video about the 256 bit registers and parallelism
@battosaijenkins9463 жыл бұрын
Amazing. I was using intrinsics to see if I could outperform the quake 3 square root dilemma a while back and brought back wonderful memories. I cant wait to try this mandelbrot concept using OpenGL! Thank you~ 👏👏👏
@portlyoldman4 жыл бұрын
Blimey. Really enjoyed that 😀I’m not even a C++ programmer only comfortable with C# but I learned a lot from this. Also enjoyed the style and presentation. Thanks. Subbed 😁
@javidx94 жыл бұрын
Thank you Jim, much appreciated!
@Cr4zy4pple4 жыл бұрын
Normal people pay top money for this kind of knowledge, not to mention the unfathomable amount of work put in scripting and compiling these vids. It's baffling how clean, simple yet detailed you can craft these tutorials. My IQ went up exponentially since I started watching you. Never stop making these vids, you'll go down in history as a legend. Also, please drop a link where I can donate single amounts for poor wankers like myself.
@chrismingay60054 жыл бұрын
Great video thank you. I really liked the pacing (as always) and while not a c++ programmer in any capacity I learned quite a lot as well.
@stephenbradshaw67274 жыл бұрын
A splendid video! Thank you very much. It might be interesting to insert a generator function to be called in place of the Mandelbrot-specific code, with arguments that will create different kinds of fractals for exploration. I am certainly interested in seeing more specifics w.r.t. implementing the AVX extensions. Amusingly, the Mandelbrot Set is based on iterating quadratic polynomials, similar to those which describe population growth and are being used to gain some insights into our current global predicament.
@oscill8ocelot4 жыл бұрын
Consider this my shout-out for more vector processing instruction videos :3
@maybenexttime374 жыл бұрын
Great video! An improvement would be to divide the screen into many thin rows instead. Rows are better than columns for two reasons: 1- Better cache locality for threads when they're working on a particular job 2- You can increase the number of rows arbitrarily whereas number columns have to be tuned such that you're as close to a multiple of 8 pixels per column as possible (otherwise you give up potenial AVX benefits). This is important because in order to reduce bubbling you want to increase the number of jobs so that the workload is more balanced among threads. Of course to do that you need to decrease the size of each row/column. Maybe even down to one single pixel row per job, because at given width (1280) thread signaling overhead will likely be negligible even then.
@oblivionronin4 жыл бұрын
Awesome long video love it ! Fractal are indeed hypnotizing ! Also Yes i definitly want to see that SIMD optimisation. I was thinking, couple things we coudl do to optimise the display of this woudl be 1. cache the world position, zoom level and itteration count, if thoses havent changed we can just re-render that frame thats already baked in. 2. once thoses are calculated once, you can use that data (time for the thread to finish) to guess where and how you shoudl divide up your thread grid, maybe some area (like that big half screen with very little calculation shoudl be handled by only one thread and that one sliver of fractal shoudl be calculated by the other 31 threads. (IE : you hold the thread time to finish for each of them, and an average of them all. On each frame or itteration if the thread finish time is shorter then the average, it shoudl take some screen space from other threads (verically or horizontal) if its higher it shoud give some up. That way, threads kind of dynamically exchange their screen space depending on their own finish time and the average time it takes. ) Sorry for the long comment, just had to put that out there ! :P Cheers !
@antipainK4 жыл бұрын
Yeah, dividing screen to threads by bitmasks should do the trick. Didn't think of it. Thanks, gonna implement it in my own Mandelbrot generator :)
@blinded65024 жыл бұрын
Just imagine these fractals as continents in a solar-system-sized ocean. Literally breath-taking.
@dioneto68554 жыл бұрын
funny because when you started using thread I thought what could happen if you go to a stressful region of the fractal to the processor while recording and thanks you killed my curiosity
@matthewpalermo49574 жыл бұрын
Loved the demonstration of how good the compiler optimization is! Question: Is there an abs_squared method for std complex values? That could be used instead of your implementation and still avoid the square root.
@johngreen10604 жыл бұрын
That would be a dot product of the complex number against itself or multiplication of the number by a complex conjugate of itself.
@fredg83284 жыл бұрын
Compilers are not always good at optimizing. They are smart to do small optimizations, but generally you can do better with higher level ones. And there are a lot of differences between compilers. Visual Studio compiler is known to be one of the worst at optimizing. But that's the same story as why you should use STL or not. Visual Studio follows the standards more tightly and has to add a lot of tests in the generated code.
@wtk3004 жыл бұрын
std::norm(std::complex)
@Saxie814 жыл бұрын
@32:54 I think it would be neat to look at the conversion process!
@javidx94 жыл бұрын
yeah, me too! noted!
@stickfigure318 ай бұрын
@24:36 I'm sure in most cases using the standard libraries is the preferred way, but tracing this back down to the square root function cutting performance in 1/2 did gave me an appreciation for why Id software made their own faster (but less accurate) square root algorithm for the Quake engine.
@andersonklein35874 жыл бұрын
I absolutely loved this AVX implementation! I might just try to implement that on a game engine I'm building for parallel processing inside thread pool. Thank you so much for your amazing video on optimization! PS: I'm actually ashamed that I hadn't used .wait on my code for the pool. I literally had my threads pooling constantly and wasting circles. lol
@carsont16354 жыл бұрын
Thank you for the excellent video! I have a course next semester specifically regarding parallel processing and this was a great introduction into some of the techniques involved. And I would love to see the video mentioned about how to use a cpu's vector processor!
@BuxleyHall4 жыл бұрын
Thanks for the video! Add my name to those interested in learning more about AVX and SIMD. On a different topic, but still related to the but related to rendering the Mandelbrot set, I’d love to see how to get past the resolution limit of 64-bit doubles so one could explore even deeper into the set. Thanks again!
@anakimluke4 жыл бұрын
THE INTRO SOUND is finally better!! thanks! hahah :)
@ratchet1freak4 жыл бұрын
About the condition variable wait, you should wrap the unique_lock and the wait in a block to avoid holding that lock for too long (doesn't really matter in this case because each worker has its own mutex), and you should guard against spurious wakeups. You could also try other pimitives for launching the threads like a semaphore or a barrier. The barrier is the superior method to do this kind of waiting as it can actually wake up multiple threads at the time (which you can't with condition variables even with broadcast due to thundering herd related issues) and the completion function can both signal completion to the main thread and then wait for the next task. (edit) Also if you divided tasks by row instead of column you might get some gains because of cache prefetching and false sharing avoidance.
@CrystalMusicProductions4 жыл бұрын
Wow this video and in general your channel are a great motivation to keep programming and exploring the field :)
@javidx94 жыл бұрын
Thank you Crystal Music!
@bloodypommelstudios71444 жыл бұрын
Few ideas: Every time you scroll you're recalculating the whole frame. If you saved the iteration data and scrolled 2 rows of pixels you'd only need to actually calculate the extra 2 rows which have become visible. Hell you could precalculate a border after the screenspace pixels have been calculated in anticipation of the user scrolling. Tthere's a few ways certain pixels could be predicted. You get bands of given iteration values and never get lower values inside higher values. Therefor if you hit the highest iteration value you could find the outline of this shape and simply fill it in. You could then work your way out from that shape to the 2nd highest iteration value and repeat the process. This would probably be insanely complicated but in certain circumstances it could be a colossal speed boost. Mirroring. If the screen space passes through zero on the Y axis you could simply mirror the pixels above the line for a nearly 2x speed boost. Scale x2 . If you had a button to double the zoom then you've already calculated 25% of the pixels if it's lined up correctly. The same is true zooming out though it would need calculating slightly differently. Between these I'd imagine it could hit an effortless 4K 60fps.
@kris0304 жыл бұрын
please do more of this, optimization is my jam
@wesleythomas68584 жыл бұрын
Awesome video. Great touching on the different techniques available, never even knew about SIMD! 👍
@serialhexed4 жыл бұрын
YES!!! INTERESTED!!!
@duality4y4 жыл бұрын
thanks for this! it explains a lot of questions i had regarding sse and simd
@sabriath4 жыл бұрын
You can divide the screen into a grid that is 4x the amount of threads that you plan on using in the pool (or more). As each thread is "start"ed, it can read the position of the list based on the mux variable that is decremented at the time of reading it....and when the thread is done, it can get another one and another, etc...only stopping if the mux is below zero and decrement the "thread done" mux. So if a particular set of grids are fast, they can help finish off other parts. For more advanced approach, you can have that same thread alter the grid coordinates at that list position when it is finished for the next iteration before advancing.....the controller code literally only has to reset the grid mux to the number of grids in the list and "thread done" mux to the number of threads, and then print the screen and repeat when the 'thread done' mux hits zero (the test for a mux to be zero is faster than comparing to a non zero value). So in a 1280x760 with 32 worker threads, I would suggest a list of 608 grids of size 40x40 pixels each. This would give an average of each thread to work on 19 grids each round.
@dhiahassen94144 жыл бұрын
Best fractal programming video on youtube
@kyleeames82294 жыл бұрын
I too would be interested in the SIMD conversion video. I'm always fascinated by low level stuff and would happily watch the video.
@patrickc16594 жыл бұрын
Really enjoyed this, thanks. I'm all for hearing about how the AVX-256 works.
@photonymous4 жыл бұрын
One of the first optimizations I like to try is to take advantage of "superscalarness" ie "pipelining". The goal here is to keep your ALU pipelines fed (no bubbles). Achieving it can be as simple as unrolling for loops (processing 4 pixels in one step of the for loop, for example). It can sometimes be shocking how much can be gained from this simple technique. To get the most out of this technique you must be mindful of "dependency chains" which prevent concurrent execution of code. There are cases where it can be faster to do some redundant calculations, if it increases concurrency.
@tomay30004 жыл бұрын
Man, this is awesome, the math performance calculations, I didn't know about that header file before. Yes I do want to see more videos about it and how you did the conversion. TIA.
@rustycherkas82293 жыл бұрын
Processor envy! I explored Mandelbrot on my '286 (without a coprocessor). Filling a single EGA screen, at higher zooms, could take hours! Those were the days... Yours is running plenty fast for us consumers, but, seeking to 'optimise' my slow version, I recall using the log() of the zoom (somehow) to 'dynamically' adjust the iteration threshold... It was all long, long ago... Probably should have been "inverse square" functionality, but log() performed well enough... I'm no mathematician...
@unknownuser80444 жыл бұрын
And now, Implemented on the GPU? :)
@javidx94 жыл бұрын
Well that is indeed the next phase is suppose, though I reckon there is a lot to squeeze out of the CPU yet!
@Gelikafkal4 жыл бұрын
@@javidx9 I think the communication between cpu and gpu over the PCI bus will eat up most of the performance that a GPU implementation would give you as long as you don't have a way to directly give a device pointer to the graphics library/engine.
@mariokart63093 жыл бұрын
@@Gelikafkal GPU compute is very much viable for anything that can take advantage of mass parallelisation, otherwise why would that functionality even exist?
@lucasgasparino61413 жыл бұрын
To anyone interested: the CUDA C book has a bare bones implementation of the Julia/Mandelbrot sets, written in C. It's very intuitive to use, and the difference is GIGANTIC if your GPU is reasonably modern.
@TenshiSeraph73 жыл бұрын
Yup, I reimplemented the Mandelbrot set using recursive CUDA, although it was not a live updating viewer (following a tutorial). Indeed the transfer of data between host and device (GPU) is usually costly, but it should be possible to move fractal data directly to the frame buffer without transfering everything back to the CPU. There is also an optimization for parallel implementations of the Mandelbrot set, which takes advantage of the fact, that all points in the set are connected. You can divide the set into different rectangles which are computed by different threads. First you only compute each pixel on the border of the rectangle. If none of the border pixels contain any part of the fractal, then you can skip the calculation of the pixels inside the rectangle (assuming you are not zoomed out too far).
@Dave_thenerd3 жыл бұрын
19:05 you could use std::fma(z, z, c); and there's a chance it will use the fma instruction which should be faster if available in hardware. Although, the compiler may have already figured this out.
@johndoe19094 жыл бұрын
Back in the days, about 93 I was working with massive parallelisation on multiple cups I remember there where big differences in how you iterated over the axles. This where related to the way the mmu swapped memory in and out (row or column oriented). Those where the days!
@colinbrown79474 жыл бұрын
Very nice video! I would love to see you implement more mandelbrot specific optimization, perhaps even looking into perturbation or series approximation.
@DavidWeiss24 жыл бұрын
Please do more like this. I learned a lot!
@rednike5004 жыл бұрын
It would be nice to see a video on parallelism and avx, but your video are always super interesting. Please continue to do as you like in your videos, you are an amazing programmer
@skilz80984 жыл бұрын
I found this to be an excellent video and very enjoyable. As for the possibility of you making a video strictly based on the intrinsics library that is based on the CPU's extensions, I would truly be interested in it. I've always used Intel CPUs and I'm decently familiar with the x86 architecture, but never got to work with that part of the CPU's instruction set. Something else to consider... One possible optimization could be to calculate each iteration of the Mandelbrot set as a recursive Function Template. The compiler can precalculate the values for you. Then you can cache them repetitively for each loop, each iteration. Another technique would be similar to the concepts of how rendering pipelines work with a back buffer and a swap chain. You would write to a texture image that is the back buffer first, calculate the next iteration of values, and write it to the next texture to be rendered. Then have the swap chain present the image to the screen as a single texture image mapped to a quad that is the size of the screen using basic vertex and fragment(pixel) shaders. After that have a compute shader working in its own thread to calculate the next iteration based on the input of where the zoom took place within the image and the zoom amount and zoom level. This will then create the appropriate texture for the next iteration and feed it into the swap chain's back buffer. This ought to make it possible to have nearly an infinite amount of zooming levels with minimal overhead, providing the precision of a 64-bit unsigned integer to represent which zoom level we are on. This zoom level variable would, of course, end up being incremented or decremented accordingly depending on the direction of the next zoom provided by the user. Another thing to consider is that we do have double precision floating point variables to work with, but they shouldn't be needed. It should be possible to work with basic floating-point types, however, we have to know what level of scale we are at since this algorithm will end up generating a unique set of floating-point values for each iteration and zoom level. There may be a small amount of overhead as we would have to keep an extra floating-point value around to keep the state of the previous zoom level so that if the user does zoom out, we can easily reference the previous state to use that for the next set of calculations. It's almost the same thing you've already done in this video, but it would be adding a few different set of features, template metaprogramming to precalculate values for you and the usage of a graphics pipeline involving not just the basic shaders, the swap chain, and the back buffer, but to utilize compute shaders as well as giving us the ability to perform both pre and post-process calculations of data for the next iteration to be rendered.
@axelBr12 жыл бұрын
Seeing your code optimisation, (at about the 20 min mark) not working was one of the best parts of the video. In the past I would have done the same things that you did, move variable declarations outside the loop (would have probably missed the calculation optimisation though), but doing that troubles me, as it follows bad programming practice; separates point of declaration from use and increases the variable's scope. I've been hoping that the compiler would do lots of magic to allow me to write clear code that is optimised.
@Padeir04 жыл бұрын
that was a very enjoyable video! and i'm really hoping for a follow up video working around the precision issues :D
@postvideo974 жыл бұрын
Very interesting that brute force allows a two/three order of magnitude acceleration. Personal computers have gotten so fast it is incredible. The usual method at accelerating Mandelbrot set rendering was to use a series approximation. You can render a single pixel at full precision (the anchor) and use small residual numbers for all other pixels, thus shaving off many orders of magnitude in time.
@alles_muss_anders_werden3 жыл бұрын
I have done the Fractal calculation on a 64 bit Linux system that has 2 processor cores with 6 calculation cores for each processor. I included the libomp Library and it performed very well.
@homomorphic4 жыл бұрын
Definitely your best video.
@GrayBlood13314 жыл бұрын
I come to this channel whenever I want to feel stupid.
@diveinsidetech4 жыл бұрын
Hi David, the vids are awesome, and I loved them! You are supposed to be earning more subs!
@nonchip4 жыл бұрын
34:00 since "goto is evil" (as in "not as intuitive to see the loop at all times"), couldn't that "label: ....... if(...) goto label;" just be literally swapped out for "do{......}while(...)" to improve readability? or is there something weird about the mm256 stuff preventing that? afaik "do while" is pretty much the same as "label if goto" internally (just without a "{block}")? 46:30 "i just copied it in verbatim" couldn't you just call the one you've already implemented instead?
@oj00244 жыл бұрын
It would have been quite interesting to compare different compiler optimization options like automatic SIMD, fastmath and -O3 or the msvc equivalent.
@javidx94 жыл бұрын
I agree, the source is available, hint hint XD
@michaelmahn43734 жыл бұрын
I was also wondering how automatic SIMD would perform in comparsion. But I think then the program would crash on an old CPU without, whereas it can fall back on a slower method if you use intrinsics afaik.
@obinator90654 жыл бұрын
Michael Mahn Compilers are so advanced that there really is no need to go through that. O3 batches everything full with AVX and SSE.
@stephenkamenar4 жыл бұрын
40:08 i was working on a similar problem and found just throwing 1000 threads at it was actually faster than trying to smartly divide the work between threads (i was using "greenthreads")
@prohacker50864 жыл бұрын
Really? That's interesting!
@bob808084 жыл бұрын
"I assumed I could do a better job than the compiler. I thought I was smart. I now know one doesn't mess with the compiler gods." Hahahaha love it!!
@benrex77754 жыл бұрын
This was really interesting. Thank you for making the video.
@s-sugoi8354 жыл бұрын
Subbed since 10k and your videos still motivates me to not give up on C++ even tho I mainly program nowadays using Python 'cause of university. C++ sure is beautiful 😊.
@lifeisgameplayit4 жыл бұрын
Big thumb here. I find this tube content rather elegant with some funky twist in it. ;}
@RogueShadowTCN4 жыл бұрын
This may be the most C++ video in a while.
@fernandomartins54514 жыл бұрын
Awsome Video! I'd love to see more about SIMD and vectorization.
@Kattemageren4 жыл бұрын
TL;DR Bisqwit also did this, search up "bisqwit parallelism in c++" I know you already covered SIMD I would just like to point out that another one of my favorite youtubers also did a great series on this subject. Search up "bisqwit parallelism in c++". It is a great series that you could probably draw some inspiration from! Great video as always, thanks!
@johngreen10604 жыл бұрын
I would like to know how important aligning to cache lines is in this case. Or how bad it would get if we encountered cache associativity issues. People often talk how important it is but rarely provide numbers.
@javidx94 жыл бұрын
Hi John, usually with vector processors managing cache, and memory alignment are actually quite important - but this algorithm never reads anything from memory - so not applicable.