FASTER Ray Tracing with Multithreading // Ray Tracing series

Рет қаралды 43,346

Күн бұрын

Пікірлер: 146

@TheCherno 2 жыл бұрын

Thank you all for watching! If you want to contribute to the optimization discussion, check out the GitHub issue here ► github.com/TheCherno/RayTracing/issues/6 Also check out Brilliant to learn all the math you need for this series! Get started for free, and hurry-the first 200 people get 20% off an annual premium subscription ► brilliant.org/TheCherno/

@Theawesomeking4444 2 жыл бұрын

Can you please do a Morton Z order in C++ tutorial next? I feel that would be nice to learn considering graphics use it a lot.

@ivanivenskii6942 2 жыл бұрын

Здравствуйте, а вы знаете русский язык?

@ivanivenskii6942 2 жыл бұрын

@@dav1dsm1th героя Слава

@nathans_codes Жыл бұрын

can you take a look at the issues and PR's on the walnut repo? It has some serious problems right now

@blackbriarmead1966 2 жыл бұрын

This video seems made for me. I was on a huge time crunch so I had to implement a ray tracer with reflections, BVH, etc, in about 36 hours total. It took a lot of coffee but I got it done. It's reasonably performant but I rendered a similar scene using cycles in blender and it is simply so much faster. What takes blender seconds takes me minutes, even with mutlithreading, and I don't have "fancy" features such as texture mapping running yet

@blackbriarmead1966 2 жыл бұрын

the way I'm currently doing it is by using a library called CTPL, in which I push all of my future operations. I give each thread an nxn block, just like blender, and as the tasks complete ctpl deals with joining the threads and starting new threads and all of that. I have them all write to the same framebuffer which I display on the screen so you can keep track of the progress of the render

@blackbriarmead1966 2 жыл бұрын

update: minimized size of bounding boxes in BVH by using surface area heuristic, made it 50% faster

@Fragtex_CN 2 жыл бұрын

Hey bro. If it's possible may i have a link to your repository to learn sth from that or2

@Alkanen 2 жыл бұрын

@@blackbriarmead1966 simply picking the two objects that create the bounding box with the smallest surface area to combine? Do you loop through all your objects to find the absolut smallest surface area, or do you do a more stochastic approach by sampling the objects and picking the smallest area from the objects in the sample to speed up BVH creation?

@blackbriarmead1966 2 жыл бұрын

@@Alkanen the way I do it currently is I sort the objects in terms of their centroids along the x, y, or z axis depending on the depth of the bounding box. I create two bounding boxes, one starting at the triangle with the smallest value, and the other starting at the triangle with the largest value, and I add triangles smaller to bigger and bigger to smaller respectively. I store the surface area of all of these potential bounding boxes, and I choose the pair of bounding boxes which minimize the surface area heuristic. The surface area heuristic is the surface area of the bounding box times the number of children it has. The lower this heuristic, the more optimized the BVH. so you would choose the candidates that minimize this, or choose not to split the parent if none of the candidates are better than the parent itself for some reason. I use axis aligned bounding boxes which allow for faster intersection calculations than some other methods

@FabricioSTH 2 жыл бұрын

Maybe a Matt Parker fenomena erupts from the internets and we get a 40,832,277,770% improvement. Or maybe not, cause we are not starting with python.

@matthewparker9276 Жыл бұрын

Probably not that much. It's not like the baseline was 1 month to render a frame.

@bishboria 2 жыл бұрын

In my own version of this, I initially tried grouping a chunk of rows per thread and got good improvements. But then I noticed that certain blocks would take longer to run if there were lots going on in the image, so you'd have 1 thread working alone when all others were finished. I ended up using a threadpool and allocating each thread in the pool to work on one pixel, once that pixel was calculated the thread would go back in the pool and pick up the next pixel to work on. This worked very well and keeps maxing out the cpu until there are fewer pixels left to calculate than than cores available to work on them. I'd love to change the code to work on GPU, and I did try for a while to get Metal to work but just couldn't work it out…

@bunpasi 2 жыл бұрын

Good point. Have you tried interlacing the rows? So if you have 8 hyperthreads, you skip 7 rows. It's probably going to be better divided.

@bishboria 2 жыл бұрын

@@bunpasi I think you’d still have a similar problem as with chunking: one thread will be running the final row when all the others have finished and are now idle. The whole cpu won’t be maxed out.

@bunpasi 2 жыл бұрын

@@bishboria We can simplify the problem by using an image with 3 regions. The 2 upper regions are primarily sky and take 1 ms to process individually, whereas the bottom section has a lot of objects taking 7 ms to process. With 1 thread, the image will take 9 ms. Now we look at 3 threads, so ideally it will take 9/3=3 ms. Scenario 1: We use chunks. Thread 1 and 2 will be done in 1 ms, but thread 3 will take 7ms. In total it will take 7 ms. Scenario 2: We skip rows. All threads will handle a third of each section. 1/3 + 1/3 + 7/3 = 3 ms. And yes, one thread might lag a few rows behind, but if we take a height of 1080px, this will be orders of magnitude less. Even if one thread is a 10 rows behind, this will only add 7 / (1080 / 3) = 0.02 ms

@bishboria 2 жыл бұрын

@@bunpasi yes I understood you originally. If you prefer to do it that way go ahead. For now, while I still need to work out how to convert to gpu based computation, I prefer the thread pool as I want as much of the cpu maxed out as I can for as long as possible.

@bunpasi 2 жыл бұрын

@@bishboria Because in a gaming engine there are a lot more things you might want to do simultaneously, a thread pool (with event queue) might be the best solution indeed. Good luck!

@Theawesomeking4444 Жыл бұрын

4:10 thats actually wrong, gpus dont have thousands of cores, what they have is bigger simd widths usually 64-256, cpus also have simd widths of 8-16 so you can actually turn your cpu into a gpu if you are willing to vectorize or use intrinsics.

@Kazyek 2 жыл бұрын

Isn't `std::execution::par` enforcing sequential execution, which is not required here? I believe simply switching to `std::execution::par_unseq` would be an instant speedup. But ultimately, thread creation have a overhead, and creating exactly as many threads as there is logical core and distributing the work would be faster. But then again, not all threads would have the same amount of work since some pixels take longer than others, so to fully saturate all threads for the whole frame it would be better to use a thread-stealing threadpool. However, maybe exactly N threads (N: amount of logical cores) might still be faster even if not fully well balanced if you have distinct tiles with thread-local data for them for better cache locality...

@srisairayapudi6074 2 жыл бұрын

YO BELATED HAPPY BDAY MAN! Wish i came sooner, would have wished you on the day :( HAVE A GOOD ONE EVERYDAY

@ChrisM541 2 жыл бұрын

Excellent challenge, cheers Cherno! Loving this series. There's a lot of optimisation possible here - 2x faster (around 60ms/16.6fps to 30ms/33.3fps) is some way below what we'd expect from fully independent worker units (check: are they? include worker timer and look for normal/abnormal timing distribution), all this assuming maximum threads isn't set to 2, of course ;) I'd also be checking the thread allocation process (hint: another, more 'direct' way?), and making sure the work is 100% optimally split up, and 100% optimally allocated to the maximum threads returned from hardware_concurrency() (though historically not 100% guaranteed to work (return 0), don't know if it's now fixed...been a while for me).

@zvxcvxcz 2 жыл бұрын

Might be from using so many more threads than there are cores. Probably should really restrict to the number of threads the hardware can actually use and have a proper thread queue.

@jfgh900 2 жыл бұрын

I really appreciate this! I've always wondered how multithreading is implemented but always got stuck in the syntax. Are there any plans on showing how to set up rendering on a graphics card?

@Alkanen 2 жыл бұрын

Messing around trying to optimise the code a bit, I noticed that your implementation of Random::InUnitSphere() is wrong. It's biased towards values in the directions of the corners of the unit box surrounding the sphere (because it draws a sample from the unit box and then normalizes that sample to fit on the surface of a sphere).

@theonetribble5867 2 жыл бұрын

Hey, thanks for the series. The first video kick started my learning process about path tracing. In my opinion the series was a little slow and I was eager to outpace it. So I wrote a Vulkan path tracer in rust and learned most things by doing them. now I'm writing my Bachelors Thesis about differentiable path tracing. Btw. Mitsuba3 is a great tool for learning about path tracing as well, especially if you don't want to deal with C++. Anyways Thanks for the inspiration.

@edu_rinaldi 2 жыл бұрын

Any suggested source for learning Vulkan raytracing extension ? (and maybe also Vulkan in general) Thanks in advance :)

@Pedro-jj7gp 2 жыл бұрын

I'm also interested in hearing about resources to learn Vulkan and path tracing. I might even try and learn Rust while I'm at it! :)

@theonetribble5867 2 жыл бұрын

@@edu_rinaldi Hi, I replied to @Pedro. I hope you get the notification.

@theonetribble5867 2 жыл бұрын

@@Pedro-jj7gp Hi, sorry for taking so long to reply. It seems that KZbin doesn't allow me to paste links but didn't warn me (If you can't find the resources contact me directly if that's possible on YT). There are some resources I used to learn vulkan though I still don't quite understand it (I used screen-13 a Vulkan abstraction layer in rust). First of all there is the vulkan tutorial which helped a lot. I can also recommend the Vulkan lecture series from "Computer Graphics at TU Wien". Specifically for ray tracing there are some blog entries from the Khronos group explaining the high level layout. For more detail there is a tutorial for NVIDIA which uses the KHR extension (Note there are, i think two extension for Vulkan ray tracing KHR and one from NVIDIA the KHR extension also works on AMD GPUs). If you want to learn more about path tracing in general there is also the Rendering Lecture from CG at TU Wien (thats where I learned about path tracing the most). In general If you want to know things about such topics I can recommend to look at lectures from universities many European universities put their lectures online but MIT also has some stuff under "Open Course Ware". I can also highly recommend the paper from Eric Veach if you want to have a more mathematical background but it's a very long paper and I mostly use it for reference.

@edu_rinaldi 2 жыл бұрын

@@theonetribble5867 Thank you so much! ❤️

@marcotroster8247 2 жыл бұрын

Superior performance can also be achieved with techniques other than multi-threading. In fact, threading can actually be slower when the synchronization effort outweighs the performance gains (see Amdahl's law). First, notice that CPUs already have parallelism built into their instruction feed pipeline. Fetching / decoding / executing / writing back results can be performed in parallel for successive instructions if they don't depend on each others results. Rearranging commands in assembly can have crazy gains (but with C/C++ we usually don't dig that deep). Second, there are dedicated SIMD instruction sets on modern CPUs that can perform the same operations for multiple inputs (256 / 512 bit wide registers) at once to increase data throughput (e.g. 8 or 16 float ops at once). Third, avoiding allocation can save lots of compute, too. Preprocessing data only once upfront is very nice. And having smaller stack frames to allocate / destroy is also important. Using some static, rewritable cache memory that's owned by one thread can really help performance such that there's smaller stack frames (at the downside of non-threadsafe code). And last, there are different CPU caching layers which have 1000x faster I/O delays. So fitting all the memory in a faster cache and constantly reusing it will skyrocket the performance. CPUs have great latency once the data is loaded into a register. Small and simple is fast. Maybe this inspires some devs here to write faster programs. Cheers, have fun at optimizing 🤓👨🏻‍💻🏎️

@manuntn08 2 жыл бұрын

Thank you very much for your effort you put in this video. I've learnt a lot from your tips. Could you please make some videos about how to optimize in the case when the computation on one pixel related to pixels around (Example : convolution, Gaussian filtering...) Once again, thank you and have a nice year !

@nathans_codes Жыл бұрын

can you take a look at the issues and PR's on the walnut repo? It has some serious problems right now

@jeofthevirtuoussand 2 жыл бұрын

I am not a programmer nor a developer but I am actually curious. Would it be possible to say to the hardware: " hey can you run raytracing in parallel on 3 cores but only use 60% of the cores and assigne the remaining 40% for ennemis AI calculations "

@peezieforestem5078 2 жыл бұрын

Would you please do more episodes on various methods of multithreading? C++17 exclusive thing is nice, but I'd like to know the broadest applicable method, a method that works for C, the most optimal method, etc.

@Alkanen 2 жыл бұрын

I suspect the most widely supported variant might be using pthreads. It's originally Unix (well, POSIX), but there are Windows compatible implementations available if you google for a couple of minutes, and then you'll have code that works on all POSIX compatible systems, which is pretty nice. And it's in C. Not to bad to work with either if I remember correctly, but it's been a few decades (jesus, I'm getting old) since I wrote my wrapper around it so I might be misremembering :)

@peezieforestem5078 2 жыл бұрын

@@Alkanen Thank you, mate!

@jumponblocker 2 жыл бұрын

I actually had an assignment where we made a raytracer recently. Kind of funny that I also used std::for_each which I had not heard of before. The only difference was that I just looped over 1 vector containing each pixel index rather than an inner and outer loop.

@ezpzgamez Жыл бұрын

I have been following along with this series while writing in Rust over C++ to see how things can compare. Until this series everything on the Rust side has been matching the C++ performance if not a somewhat better. (In comparison to the laptop, my desktop PC with an i9-9900k gets about 15ms where the laptop gets about 60ms for single-threaded). One thing Rust suffers from here is being able to mutate simple structures in an async context. A mutex or rwlock is required to be able to do what is asked of the multithreading unless allocating temporary buffers (one for both the image data and accumulation data). In an unsafe context it would be a lot easier but unfortunately Rust lacks a lot of things for async including some unsafe items. SyncUnsafeCell has yet to be stabilized. So from here on out I guess I'll stick with the single-threaded and see how the performance goes. Would rather do that than clone two large vectors on every iteration. Just my two cents from outside of C++ :)

@1ups_15 7 ай бұрын

hello, thank you for you video, it looks very useful, however I have a problem; I have noticed that my raytracer doesn't gain any performance from applying your changes, it even gets slightly worse, and when I look at my processor usage using htop, only one of my cores is being used. I am using linux and compiling using g++ through cmake, is there some flags I could use to actually make it multithreaded?

@alessandrocaviola1575 2 жыл бұрын

On my raytracer i got almost perfect scaling in performance: 4x the Speed the Moment i multithreaded It on a 4 cores CPU, so there Is definitely room for improvements there

@Theodorlei1 Жыл бұрын

Yeah he got a 2.5x speedup on an 8core machine on a parallel problem - at least 8x should be possible for him

@Iuigi_t Жыл бұрын

Where are the triangles?

@lithium 2 жыл бұрын

std::iota is the "fancy function" you're avoiding to generate sequences, fyi ;)

@ovi1326 2 жыл бұрын

allocating a vector of numbers going from 0 to width and height made me very sad altough I get that this is for the sake of simplicity for anyone interested though, here are some tips a more proper way to go about this would be to either implement a custom range iterator (look up legacy iterator on cppreference) or use std::ranges::iota_view which is roughly equivalent to python's `range()` or rust's `x..y` thingy you can also just avoid using parallel for_each, and instead split work for multiple threads by giving them responsibility over equally divided ranges of scanlines. this is pretty straightforward to implement and should yield good enough perfomance

@zvxcvxcz 2 жыл бұрын

Not just "good enough," but better because there will likely be less cache contention and less thread creation overhead.

@ovi1326 2 жыл бұрын

@@zvxcvxcz I meant that there are better methods than simply splitting work by rows, ie. someone in the comments mentioned using a thread pool to saturate the cpu which sounds kinda cool

@gustavbw Жыл бұрын

Wouldn't allocating the threads on every std::for_each() be highly inefficient compared to pre-allocating the pool when the program starts?

@dmitrysapelnikov Жыл бұрын

In fact the c++ runtime uses an internal thread pool for parallel for_each(). But AFAIK there is no way for the user to explicitly control this pool.

@eduardoassis2826 Жыл бұрын

hey, how you do to draw during explications over your current window? I'm curious for a long time now and can't help to ask :).

@helmuthpetelin4613 2 жыл бұрын

hey do you ve planed to show how to push the raytracing to the gpu?

@HandsomeLukeMan 2 жыл бұрын

Love the red you've done with your syntax highlighting. How did you do this? I don't see an option for keywords like const and for and if in VA settings? Curious what value of red that is.

@davidrobinson8523 2 жыл бұрын

Its from a third party paid extension. Visual Assist. And yes it is so much better than the defaults.

@HandsomeLukeMan 2 жыл бұрын

@@davidrobinson8523 Yeah, I've got VA but curious what he did to modify his theme. I do not know how I would live without VA now that I've used it for so long.

@thebasicmaterialsproject1892 2 жыл бұрын

go on the cherno still killing it

@ivansanz4029 2 жыл бұрын

If instead of having each thread do a row you make them do a column, the performance is even better as the "sky" is very cheap to process and the real complex part (the "ground") is distributed better across threads.

@ZeroUm_ 2 жыл бұрын

It probably won't do much, if 20% of a scene is sky, with 1080 lines you still have 216 lines to go divided by a much smaller number of threads. With 8 threads, that's still 27 passes, enough to saturate them equally.

@ivansanz4029 2 жыл бұрын

@@ZeroUm_ Yeah I was forward-thinking to when he will use the GPU cores :D

@thomasavino3450 2 жыл бұрын

What theme/color scheme are you using? (the default visual assist is not like this)

@Kaldrax 2 жыл бұрын

Interesting, I didn’t know about this one. I attended a lecture called high performance computing last semester in which we did similar things, starting with OpenMPI, then threads and in the end OpenMP. I absolutely cannot recommend OpenMPI since it’s a total nightmare. OpenMP on the other hand would simplify this code. You don’t need the iterators and I believe you can just write #pragma omp parallel for collapse(2) above the nested loops and it will achieve the same performance. 🙂

@unknownunknown6531 2 жыл бұрын

OpenMPI does not address the same problem, it is used to distribute a task on multiple computers (a cluster) rather than only one, hence the additional complexity :). OpenMP is the tool to use in this case indeed !

@psychoinferno4227 2 жыл бұрын

As an exercise, you should run a profiler and understand why it's only 2x faster on an 8 core machine.

@peezieforestem5078 2 жыл бұрын

I did some testing with OpenMP and my code started working slower... not sure why this happens, I made sure to parallelize the independent loops.

@zvxcvxcz 2 жыл бұрын

Yup, iterators are gross, OpenMP is way nicer (suck it C++ committee).

@zvxcvxcz 2 жыл бұрын

@@peezieforestem5078 Slower than what was done in the video or slower than the code was before? You shouldn't really do even what he did in the video. In either case, creating way more threads than you actually have the hardware for can cause a lot of contention and cache misses and actually slow things down sometimes. He has 8x the hardware threads and was only getting like 2x the performance... not exactly ideal. What you should really do is create just 8-16 threads when you have 8 physical cores and have a thread queue so they pick up a new task each time they finish a pixel until there are no pixels left.

@anime_erotika585 Жыл бұрын

7:07 I want multithreading, at my table, until tomorrow!

@CreativeOven 2 жыл бұрын

Dude make us a chapter someday showing you programming in Cpp to get at your level .. ( idea ) , because some of us we are super in stone age in cpp

@sshawarma 2 жыл бұрын

Awesome video as always! Why was the program not running 8x faster? Only thing I can think of is an IO bottleneck.

@psychoinferno4227 2 жыл бұрын

Run a profiler and you'll find a different answer. If you want to spoil the fun see the responses in the Github discussion.

@kelvinpoetra 2 жыл бұрын

hallo cherno, I want to ask how to make graphic software and software such as Microsoft Word. Is the basis for making software all the same stages.

@ChaoticFlounder 2 жыл бұрын

how difficult would it be to implement the RayTracing calculations on the integrated graphics on your cpu?

@zvxcvxcz 2 жыл бұрын

"It depends," is the unfortunate answer there. It depends just what types you're using, what the driver for that GPU exposes and if it supports the necessary extensions, etc... Maybe you can drop it on there with CUDA or OpenCL or maybe you can even wrangle the regular display part of the driver into giving you what you need with OpenGL or DirectX, etc... Often laptop manufacturers have not been great about switching these GPUs (sometimes if you're primarily on the discrete care, the integrated one can be almost totally deactivated, or vice versa). Sometimes that is seen as a plus, since it dealt with battery concerns.

@vasile2321 Жыл бұрын

What RTX do you have on your pc? Thx

@gabrieldesimone4644 2 жыл бұрын

Hey there, I'm not familiar with C# or game making stuff but I was wondering that code is running on CPU cores, how do you make it use GPU cores instead?

@Alkanen 2 жыл бұрын

That's coming in a future episode

@zvxcvxcz 2 жыл бұрын

3 main options 1) wrangle your GPU into doing so by sort of telling it that it is doing normal math for output using OpenGL/DirectX/etc... 2) use OpenCL 3) use CUDA.

@steellung 2 жыл бұрын

Does anyone know which software he uses for drawing on the screen on the fly?

@rastaarmando7058 2 жыл бұрын

It looks very similar to gInk.

@steellung 2 жыл бұрын

@@rastaarmando7058 cool, didn't know this one. Thanks

@erikrl2 2 жыл бұрын

He uses ZoomIt

@mackerel987 2 жыл бұрын

Hey guys. Does anyone get the "no instance of overloaded function:"std::for_each" matches the arguments list " error? Afaik we only need to include the execution header for it to work. Am I missing something?

@simonmaracine4721 2 жыл бұрын

Make sure you compile with C++17 flag or newer, and your compiler supports C++17.

@mackerel987 2 жыл бұрын

@@simonmaracine4721 exactly what was wrong. thank you.

@ricbattaglia6976 Жыл бұрын

Is not faster a gpu render? Thanks

@CreativeOven 2 жыл бұрын

Comment 10 10 out of 10 : D, How is hazel ? I see it is not all about drawing that open GL 3d lines right for those vertices? : P

@MorebitsUK 2 жыл бұрын

Nice!! Always good content Cherno. Any Idea on how to use IntStream in Java to parallelize stuff. FYI I'm using `map`; not `for_each`.

@wuangg 2 жыл бұрын

Use IntStream.parallel() to return a parallel IntStream and after that, use forEach() to perform an action to each element in the stream in parallel, it will use all available processors to do the job. For example: IntStream stream = IntStream.range(1, 10); // create a sequential ordered IntStream from the range of 1 to 10 stream.parallel().forEach(i -> { // do stuff to element 'i' here }); // perform an action to each element in the stream in multi-threaded This is equivalent to C++ std::for_each with parallel execution policy, which is being shown in this video.

@MorebitsUK 2 жыл бұрын

@@wuangg Thanks for the reply, but I'm using Map not For_Each. I just need to return something from the map. String[] results = IntStream.range(0,imageHeight-1).parallel().map(i -> { // y value String row = String.join(System.lineSeparator(), IntStream.range(0, imageWidth).map(j -> { // x value Vec3 pixelColour = new Vec3(0, 0, 0); float u = (i + Utils.randomFloat(0.0f, 1.0f)) / (float) (imageWidth - 1); float v = (j + Utils.randomFloat(0.0f, 1.0f)) / (float) (imageHeight - 1); final Ray rayP = camera.getRay(u, v); pixelColour.addEquals(rayColor(rayP, finalWorld, maxDepth)); String pixel = PPM.vectorToRGB(pixelColour, 1); }));

@andrewporter1868 2 жыл бұрын

Multi-threading is also a mistake. It's a failure to defer parallel computing to the programmer. Instead of providing an asynchronous master-slave universal scheduler system and then on top of that the ability to do cheap software scheduling by providing a simple custom scheduler that can use the exact same code (it's asynchronous, so you just insert the scheduler code at some point in the future on one of your existing execution pathways), we got this pile of garbage that requires us to add all this overhead by synchronizing everything and it's just this massive headache where you can't just write parallel code but you have to think about synchronization too, and if you think too hard, you get a synchronization bug that you spend the afternoon fixing instead of fixing your actual code that's supposed to be part of the design that you're implementing, not a standard library feature that's missing from every language and imposed on us by all major operating systems.

@rckeet 2 жыл бұрын

oh yesssssss!!😎

@JATmatic 2 жыл бұрын

I made it much faster than the MT version here by fixing the wonky Walnut::Random code and removing branches from Renderer::TraceRay() loop. Render runs in about ~11ms on Ryzen 2700 8-core.

@stinkybeam Жыл бұрын

I know nothing of programing and coding, watch this video remind me of high school math class. I think I understand but actually I don't

@CP-sr6ml Жыл бұрын

Don't get me wrong your content is great but... Why are we bothering with multithreading if we could just move to the gpu? I don't undedrstand why you keep building and even optimizing like this on cpu side now. Wont that just make it harder/more work to move to the gpu?

@ng.h9315 2 жыл бұрын

Wonderful courses👌, but please continue the "Create Game engine in cpp" course add 3d game development option build for Android , ios ,,, Please teach us how to create a game engine like unreal Engine 😀. Im waiting for your answer...... Thanks for all of things Cherno ♥️

@larryfulkerson4505 2 ай бұрын

I like to write code by the principle of least astonishment.

@AnalogFoundry 2 жыл бұрын

I wish the team at Striking Distance Studios would take notes and improve ray-tracing performance in their game called The Callisto Protocol. At the moment their CPU utilization with RT is abysmal.

@zvxcvxcz 2 жыл бұрын

Are they not doing their raytracing on the GPU though? Recent GPUs have hardware accelerated raytracing. I'm not at all familiar with the game or what they've done other than that it is supposed to be like a AAA title? I would expect any AAA to be using the GPU features on this (whether or not they should be).

@AnalogFoundry 2 жыл бұрын

@@zvxcvxcz - they are doing RT on the GPU using dedicated RT cores of AMD and NVIDIA, but building BVH and stuff is handled on the CPU. Thus RT can be very taxing even on the CPU. The problem with Callisto Protocol is that it uses very little of the CPU (i.e. not well multithreaded) even with the latest greatest multi-core CPUs which causes huge fps issues.

@hymen0callis Жыл бұрын

Unfortunately, std::for_each() is not very "efficient". Apparently, you got a speedup of only about 2, while I (using the exact same parallelization scheme) got a speedup of 5.5 (I only have 8 logical cores) by using PPL's Concurrency::parallel_for() instead. It's not portable code, but if it is almost 3 times faster, I'll go with Microsoft's PPL. Edit: just watched the next video where you fixed your global RNG. In my code, the RNG was already thread_local, which explains the much higher speedup in my example. So, I guess std::for_each() isn't that slow after all.

@stephenkamenar 2 жыл бұрын

GPUs don't have 8,000 cores. they have very wide instructions. like SIMD but on massive data at the same time. same difference tho

@nenomius1148 2 жыл бұрын

8:50 running around updating two vectors on each window resize is much simpler than that stinky over-engineered std::views::iota from "modern" C++

@ovi1326 2 жыл бұрын

yeah but like think of the cache friendliness of accessing a buffer of memory just to get the next consecutive number

@nenomius1148 2 жыл бұрын

@@ovi1326 Yeah, reading consecutive numbers from memory is much more cache-friendly than generating them on CPU in registers

@TheApsiiik Жыл бұрын

It's been 2 months.. where is next episode!!!1

@closingtheloop2593 2 жыл бұрын

Why arent you doing this in cuda? Or in an opengl fragment shader?

@Jkauppa 2 жыл бұрын

multicore avx-512 on cpu

@Jkauppa 2 жыл бұрын

screen space dynamic baking surface light map caching

@Jkauppa 2 жыл бұрын

update the surface dynamic baked light map only when needed, new or when update is needed, like every 4th frame, at some fps, like 240fps, update light only at 60fps

@Jkauppa 2 жыл бұрын

pseudo-coding is a must, so that you are not tied to a language

@Jkauppa 2 жыл бұрын

focus on programming pipeline or in the pseudo-algorithm methods

@Jkauppa 2 жыл бұрын

language specifics are so 80's :)

@Alkanen 2 жыл бұрын

Wohoo!

@IshanChaudharii 2 жыл бұрын

Oh my goodness finally!!!! ❤️🥲🎉

@Notsorandomnumbers 2 жыл бұрын

anyone know of a channel similar to this but like 1 degree more amateur? I find myself having difficulty keeping up at points

@fanisdeli 2 жыл бұрын

Complete assumption because I'm too lazy to search it: I would think that std::for_each would be smarter than creating a thread for every single item in your iterator. Creating the threads would be much slower than actually running on one thread. My assumption is that it creates a few threads, depending on your hardware, and it reuses them. When one iteration is done, the same thread is used for a future iteration. That would also explain why using nested std::for_each made no difference in performance.

@zvxcvxcz 2 жыл бұрын

You think it's only getting 2x rather than at least ballpark 8x if it is being smart? I think the nested for makes no difference because the single loop is already that bad for resource contention (several thousand threads on 8 hardware cores... ) and thread creation that it doesn't get any worse than that.

@fanisdeli 2 жыл бұрын

@@zvxcvxcz I don't think that it could possibly be creating millions (1920*1080) threads, 60+ times a second Also, in programming there's no such thing as "it can't get worse" lol. If it was a thread per iteration, then without nesting you'd have 1920 threads, with nesting you'd have over 2 million. So, yeah, that would be WAY worse for sure. Like "freeze the entire OS and blue screen" type of stuff

@MrMirbat 2 жыл бұрын

Thanks for sharing knowledge. Can you do tutorial how to make casino games like slot machines - Book of Ra, Texas holdem poker or roulette? Thanks in advance.

@mr.mirror1213 2 жыл бұрын

lesss gooo

@zvxcvxcz 2 жыл бұрын

Iterators are gross... I would rather use OpenMP.

@anlcangulkaya6244 2 жыл бұрын

#pragma omp parallel for

@psychoinferno4227 2 жыл бұрын

The performance was nearly identical to the for_each with a parallel execution policy.

@peezieforestem5078 2 жыл бұрын

Hey, I tried OpenMP once and my code got slower. I'm not sure why, do you have any ideas?

@zvxcvxcz 2 жыл бұрын

@@psychoinferno4227 Yes, but with OpenMP you don't need those silly ranges, that's the advantage there. I would expect the performance to be about the same as to what was done in the video if done the same way like that. Creating thousands of threads on a machine with 8 physical cores is begging for 1) overhead due to thread creation and 2) resource contention as all those threads want to get their task executed, so expect an increase in cache misses. The proper way to do it is to create a properly sized thread pool (somewhere between 8 and 16 most likely if you have 8 hardware cores) and have a task queue where each pixel's processing is a task. Have each thread pick up a new task when finished until there are no tasks left. I would expect something like a 5x-7.8x ish improvement rather than 2x. I might be wrong, but that's my naive expectation without knowing too many details about the raytracing algorithm itself. Offhand I don't think we're being memory bottlenecked in this case in terms of throughput, just perhaps by cache misses as the threads swap.

@zvxcvxcz 2 жыл бұрын

I use a sort of implied threading in Bash too with the same model. Can't have an ancient Bash though because they didn't add the feature to wait for any task to finish until like 4.something. So now you can start 8 commands while hundreds more wait and each time one finishes the next starts, it's pretty sweet. Prior to that bash you could only wait for all tasks to finish or you had to know the exact task you were waiting for (and of course you can't know ahead of time what order they will finish in (in most cases).