Thank you all for watching! If you want to contribute to the optimization discussion, check out the GitHub issue here ► github.com/TheCherno/RayTracing/issues/6 Also check out Brilliant to learn all the math you need for this series! Get started for free, and hurry-the first 200 people get 20% off an annual premium subscription ► brilliant.org/TheCherno/
@Theawesomeking44442 жыл бұрын
Can you please do a Morton Z order in C++ tutorial next? I feel that would be nice to learn considering graphics use it a lot.
@ivanivenskii69422 жыл бұрын
Здравствуйте, а вы знаете русский язык?
@ivanivenskii69422 жыл бұрын
@@dav1dsm1th героя Слава
@nathans_codes Жыл бұрын
can you take a look at the issues and PR's on the walnut repo? It has some serious problems right now
@blackbriarmead19662 жыл бұрын
This video seems made for me. I was on a huge time crunch so I had to implement a ray tracer with reflections, BVH, etc, in about 36 hours total. It took a lot of coffee but I got it done. It's reasonably performant but I rendered a similar scene using cycles in blender and it is simply so much faster. What takes blender seconds takes me minutes, even with mutlithreading, and I don't have "fancy" features such as texture mapping running yet
@blackbriarmead19662 жыл бұрын
the way I'm currently doing it is by using a library called CTPL, in which I push all of my future operations. I give each thread an nxn block, just like blender, and as the tasks complete ctpl deals with joining the threads and starting new threads and all of that. I have them all write to the same framebuffer which I display on the screen so you can keep track of the progress of the render
@blackbriarmead19662 жыл бұрын
update: minimized size of bounding boxes in BVH by using surface area heuristic, made it 50% faster
@Fragtex_CN2 жыл бұрын
Hey bro. If it's possible may i have a link to your repository to learn sth from that or2
@Alkanen2 жыл бұрын
@@blackbriarmead1966 simply picking the two objects that create the bounding box with the smallest surface area to combine? Do you loop through all your objects to find the absolut smallest surface area, or do you do a more stochastic approach by sampling the objects and picking the smallest area from the objects in the sample to speed up BVH creation?
@blackbriarmead19662 жыл бұрын
@@Alkanen the way I do it currently is I sort the objects in terms of their centroids along the x, y, or z axis depending on the depth of the bounding box. I create two bounding boxes, one starting at the triangle with the smallest value, and the other starting at the triangle with the largest value, and I add triangles smaller to bigger and bigger to smaller respectively. I store the surface area of all of these potential bounding boxes, and I choose the pair of bounding boxes which minimize the surface area heuristic. The surface area heuristic is the surface area of the bounding box times the number of children it has. The lower this heuristic, the more optimized the BVH. so you would choose the candidates that minimize this, or choose not to split the parent if none of the candidates are better than the parent itself for some reason. I use axis aligned bounding boxes which allow for faster intersection calculations than some other methods
@FabricioSTH2 жыл бұрын
Maybe a Matt Parker fenomena erupts from the internets and we get a 40,832,277,770% improvement. Or maybe not, cause we are not starting with python.
@matthewparker9276 Жыл бұрын
Probably not that much. It's not like the baseline was 1 month to render a frame.
@bishboria2 жыл бұрын
In my own version of this, I initially tried grouping a chunk of rows per thread and got good improvements. But then I noticed that certain blocks would take longer to run if there were lots going on in the image, so you'd have 1 thread working alone when all others were finished. I ended up using a threadpool and allocating each thread in the pool to work on one pixel, once that pixel was calculated the thread would go back in the pool and pick up the next pixel to work on. This worked very well and keeps maxing out the cpu until there are fewer pixels left to calculate than than cores available to work on them. I'd love to change the code to work on GPU, and I did try for a while to get Metal to work but just couldn't work it out…
@bunpasi2 жыл бұрын
Good point. Have you tried interlacing the rows? So if you have 8 hyperthreads, you skip 7 rows. It's probably going to be better divided.
@bishboria2 жыл бұрын
@@bunpasi I think you’d still have a similar problem as with chunking: one thread will be running the final row when all the others have finished and are now idle. The whole cpu won’t be maxed out.
@bunpasi2 жыл бұрын
@@bishboria We can simplify the problem by using an image with 3 regions. The 2 upper regions are primarily sky and take 1 ms to process individually, whereas the bottom section has a lot of objects taking 7 ms to process. With 1 thread, the image will take 9 ms. Now we look at 3 threads, so ideally it will take 9/3=3 ms. Scenario 1: We use chunks. Thread 1 and 2 will be done in 1 ms, but thread 3 will take 7ms. In total it will take 7 ms. Scenario 2: We skip rows. All threads will handle a third of each section. 1/3 + 1/3 + 7/3 = 3 ms. And yes, one thread might lag a few rows behind, but if we take a height of 1080px, this will be orders of magnitude less. Even if one thread is a 10 rows behind, this will only add 7 / (1080 / 3) = 0.02 ms
@bishboria2 жыл бұрын
@@bunpasi yes I understood you originally. If you prefer to do it that way go ahead. For now, while I still need to work out how to convert to gpu based computation, I prefer the thread pool as I want as much of the cpu maxed out as I can for as long as possible.
@bunpasi2 жыл бұрын
@@bishboria Because in a gaming engine there are a lot more things you might want to do simultaneously, a thread pool (with event queue) might be the best solution indeed. Good luck!
@Theawesomeking4444 Жыл бұрын
4:10 thats actually wrong, gpus dont have thousands of cores, what they have is bigger simd widths usually 64-256, cpus also have simd widths of 8-16 so you can actually turn your cpu into a gpu if you are willing to vectorize or use intrinsics.
@Kazyek2 жыл бұрын
Isn't `std::execution::par` enforcing sequential execution, which is not required here? I believe simply switching to `std::execution::par_unseq` would be an instant speedup. But ultimately, thread creation have a overhead, and creating exactly as many threads as there is logical core and distributing the work would be faster. But then again, not all threads would have the same amount of work since some pixels take longer than others, so to fully saturate all threads for the whole frame it would be better to use a thread-stealing threadpool. However, maybe exactly N threads (N: amount of logical cores) might still be faster even if not fully well balanced if you have distinct tiles with thread-local data for them for better cache locality...
@srisairayapudi60742 жыл бұрын
YO BELATED HAPPY BDAY MAN! Wish i came sooner, would have wished you on the day :( HAVE A GOOD ONE EVERYDAY
@ChrisM5412 жыл бұрын
Excellent challenge, cheers Cherno! Loving this series. There's a lot of optimisation possible here - 2x faster (around 60ms/16.6fps to 30ms/33.3fps) is some way below what we'd expect from fully independent worker units (check: are they? include worker timer and look for normal/abnormal timing distribution), all this assuming maximum threads isn't set to 2, of course ;) I'd also be checking the thread allocation process (hint: another, more 'direct' way?), and making sure the work is 100% optimally split up, and 100% optimally allocated to the maximum threads returned from hardware_concurrency() (though historically not 100% guaranteed to work (return 0), don't know if it's now fixed...been a while for me).
@zvxcvxcz2 жыл бұрын
Might be from using so many more threads than there are cores. Probably should really restrict to the number of threads the hardware can actually use and have a proper thread queue.
@jfgh9002 жыл бұрын
I really appreciate this! I've always wondered how multithreading is implemented but always got stuck in the syntax. Are there any plans on showing how to set up rendering on a graphics card?
@Alkanen2 жыл бұрын
Messing around trying to optimise the code a bit, I noticed that your implementation of Random::InUnitSphere() is wrong. It's biased towards values in the directions of the corners of the unit box surrounding the sphere (because it draws a sample from the unit box and then normalizes that sample to fit on the surface of a sphere).
@theonetribble58672 жыл бұрын
Hey, thanks for the series. The first video kick started my learning process about path tracing. In my opinion the series was a little slow and I was eager to outpace it. So I wrote a Vulkan path tracer in rust and learned most things by doing them. now I'm writing my Bachelors Thesis about differentiable path tracing. Btw. Mitsuba3 is a great tool for learning about path tracing as well, especially if you don't want to deal with C++. Anyways Thanks for the inspiration.
@edu_rinaldi2 жыл бұрын
Any suggested source for learning Vulkan raytracing extension ? (and maybe also Vulkan in general) Thanks in advance :)
@Pedro-jj7gp2 жыл бұрын
I'm also interested in hearing about resources to learn Vulkan and path tracing. I might even try and learn Rust while I'm at it! :)
@theonetribble58672 жыл бұрын
@@edu_rinaldi Hi, I replied to @Pedro. I hope you get the notification.
@theonetribble58672 жыл бұрын
@@Pedro-jj7gp Hi, sorry for taking so long to reply. It seems that KZbin doesn't allow me to paste links but didn't warn me (If you can't find the resources contact me directly if that's possible on YT). There are some resources I used to learn vulkan though I still don't quite understand it (I used screen-13 a Vulkan abstraction layer in rust). First of all there is the vulkan tutorial which helped a lot. I can also recommend the Vulkan lecture series from "Computer Graphics at TU Wien". Specifically for ray tracing there are some blog entries from the Khronos group explaining the high level layout. For more detail there is a tutorial for NVIDIA which uses the KHR extension (Note there are, i think two extension for Vulkan ray tracing KHR and one from NVIDIA the KHR extension also works on AMD GPUs). If you want to learn more about path tracing in general there is also the Rendering Lecture from CG at TU Wien (thats where I learned about path tracing the most). In general If you want to know things about such topics I can recommend to look at lectures from universities many European universities put their lectures online but MIT also has some stuff under "Open Course Ware". I can also highly recommend the paper from Eric Veach if you want to have a more mathematical background but it's a very long paper and I mostly use it for reference.
@edu_rinaldi2 жыл бұрын
@@theonetribble5867 Thank you so much! ❤️
@marcotroster82472 жыл бұрын
Superior performance can also be achieved with techniques other than multi-threading. In fact, threading can actually be slower when the synchronization effort outweighs the performance gains (see Amdahl's law). First, notice that CPUs already have parallelism built into their instruction feed pipeline. Fetching / decoding / executing / writing back results can be performed in parallel for successive instructions if they don't depend on each others results. Rearranging commands in assembly can have crazy gains (but with C/C++ we usually don't dig that deep). Second, there are dedicated SIMD instruction sets on modern CPUs that can perform the same operations for multiple inputs (256 / 512 bit wide registers) at once to increase data throughput (e.g. 8 or 16 float ops at once). Third, avoiding allocation can save lots of compute, too. Preprocessing data only once upfront is very nice. And having smaller stack frames to allocate / destroy is also important. Using some static, rewritable cache memory that's owned by one thread can really help performance such that there's smaller stack frames (at the downside of non-threadsafe code). And last, there are different CPU caching layers which have 1000x faster I/O delays. So fitting all the memory in a faster cache and constantly reusing it will skyrocket the performance. CPUs have great latency once the data is loaded into a register. Small and simple is fast. Maybe this inspires some devs here to write faster programs. Cheers, have fun at optimizing 🤓👨🏻💻🏎️
@manuntn082 жыл бұрын
Thank you very much for your effort you put in this video. I've learnt a lot from your tips. Could you please make some videos about how to optimize in the case when the computation on one pixel related to pixels around (Example : convolution, Gaussian filtering...) Once again, thank you and have a nice year !
@nathans_codes Жыл бұрын
can you take a look at the issues and PR's on the walnut repo? It has some serious problems right now
@jeofthevirtuoussand2 жыл бұрын
I am not a programmer nor a developer but I am actually curious. Would it be possible to say to the hardware: " hey can you run raytracing in parallel on 3 cores but only use 60% of the cores and assigne the remaining 40% for ennemis AI calculations "
@peezieforestem50782 жыл бұрын
Would you please do more episodes on various methods of multithreading? C++17 exclusive thing is nice, but I'd like to know the broadest applicable method, a method that works for C, the most optimal method, etc.
@Alkanen2 жыл бұрын
I suspect the most widely supported variant might be using pthreads. It's originally Unix (well, POSIX), but there are Windows compatible implementations available if you google for a couple of minutes, and then you'll have code that works on all POSIX compatible systems, which is pretty nice. And it's in C. Not to bad to work with either if I remember correctly, but it's been a few decades (jesus, I'm getting old) since I wrote my wrapper around it so I might be misremembering :)
@peezieforestem50782 жыл бұрын
@@Alkanen Thank you, mate!
@jumponblocker2 жыл бұрын
I actually had an assignment where we made a raytracer recently. Kind of funny that I also used std::for_each which I had not heard of before. The only difference was that I just looped over 1 vector containing each pixel index rather than an inner and outer loop.
@ezpzgamez Жыл бұрын
I have been following along with this series while writing in Rust over C++ to see how things can compare. Until this series everything on the Rust side has been matching the C++ performance if not a somewhat better. (In comparison to the laptop, my desktop PC with an i9-9900k gets about 15ms where the laptop gets about 60ms for single-threaded). One thing Rust suffers from here is being able to mutate simple structures in an async context. A mutex or rwlock is required to be able to do what is asked of the multithreading unless allocating temporary buffers (one for both the image data and accumulation data). In an unsafe context it would be a lot easier but unfortunately Rust lacks a lot of things for async including some unsafe items. SyncUnsafeCell has yet to be stabilized. So from here on out I guess I'll stick with the single-threaded and see how the performance goes. Would rather do that than clone two large vectors on every iteration. Just my two cents from outside of C++ :)
@1ups_157 ай бұрын
hello, thank you for you video, it looks very useful, however I have a problem; I have noticed that my raytracer doesn't gain any performance from applying your changes, it even gets slightly worse, and when I look at my processor usage using htop, only one of my cores is being used. I am using linux and compiling using g++ through cmake, is there some flags I could use to actually make it multithreaded?
@alessandrocaviola15752 жыл бұрын
On my raytracer i got almost perfect scaling in performance: 4x the Speed the Moment i multithreaded It on a 4 cores CPU, so there Is definitely room for improvements there
@Theodorlei1 Жыл бұрын
Yeah he got a 2.5x speedup on an 8core machine on a parallel problem - at least 8x should be possible for him
@Iuigi_t Жыл бұрын
Where are the triangles?
@lithium2 жыл бұрын
std::iota is the "fancy function" you're avoiding to generate sequences, fyi ;)
@ovi13262 жыл бұрын
allocating a vector of numbers going from 0 to width and height made me very sad altough I get that this is for the sake of simplicity for anyone interested though, here are some tips a more proper way to go about this would be to either implement a custom range iterator (look up legacy iterator on cppreference) or use std::ranges::iota_view which is roughly equivalent to python's `range()` or rust's `x..y` thingy you can also just avoid using parallel for_each, and instead split work for multiple threads by giving them responsibility over equally divided ranges of scanlines. this is pretty straightforward to implement and should yield good enough perfomance
@zvxcvxcz2 жыл бұрын
Not just "good enough," but better because there will likely be less cache contention and less thread creation overhead.
@ovi13262 жыл бұрын
@@zvxcvxcz I meant that there are better methods than simply splitting work by rows, ie. someone in the comments mentioned using a thread pool to saturate the cpu which sounds kinda cool
@gustavbw Жыл бұрын
Wouldn't allocating the threads on every std::for_each() be highly inefficient compared to pre-allocating the pool when the program starts?
@dmitrysapelnikov Жыл бұрын
In fact the c++ runtime uses an internal thread pool for parallel for_each(). But AFAIK there is no way for the user to explicitly control this pool.
@eduardoassis2826 Жыл бұрын
hey, how you do to draw during explications over your current window? I'm curious for a long time now and can't help to ask :).
@helmuthpetelin46132 жыл бұрын
hey do you ve planed to show how to push the raytracing to the gpu?
@HandsomeLukeMan2 жыл бұрын
Love the red you've done with your syntax highlighting. How did you do this? I don't see an option for keywords like const and for and if in VA settings? Curious what value of red that is.
@davidrobinson85232 жыл бұрын
Its from a third party paid extension. Visual Assist. And yes it is so much better than the defaults.
@HandsomeLukeMan2 жыл бұрын
@@davidrobinson8523 Yeah, I've got VA but curious what he did to modify his theme. I do not know how I would live without VA now that I've used it for so long.
@thebasicmaterialsproject18922 жыл бұрын
go on the cherno still killing it
@ivansanz40292 жыл бұрын
If instead of having each thread do a row you make them do a column, the performance is even better as the "sky" is very cheap to process and the real complex part (the "ground") is distributed better across threads.
@ZeroUm_2 жыл бұрын
It probably won't do much, if 20% of a scene is sky, with 1080 lines you still have 216 lines to go divided by a much smaller number of threads. With 8 threads, that's still 27 passes, enough to saturate them equally.
@ivansanz40292 жыл бұрын
@@ZeroUm_ Yeah I was forward-thinking to when he will use the GPU cores :D
@thomasavino34502 жыл бұрын
What theme/color scheme are you using? (the default visual assist is not like this)
@Kaldrax2 жыл бұрын
Interesting, I didn’t know about this one. I attended a lecture called high performance computing last semester in which we did similar things, starting with OpenMPI, then threads and in the end OpenMP. I absolutely cannot recommend OpenMPI since it’s a total nightmare. OpenMP on the other hand would simplify this code. You don’t need the iterators and I believe you can just write #pragma omp parallel for collapse(2) above the nested loops and it will achieve the same performance. 🙂
@unknownunknown65312 жыл бұрын
OpenMPI does not address the same problem, it is used to distribute a task on multiple computers (a cluster) rather than only one, hence the additional complexity :). OpenMP is the tool to use in this case indeed !
@psychoinferno42272 жыл бұрын
As an exercise, you should run a profiler and understand why it's only 2x faster on an 8 core machine.
@peezieforestem50782 жыл бұрын
I did some testing with OpenMP and my code started working slower... not sure why this happens, I made sure to parallelize the independent loops.
@zvxcvxcz2 жыл бұрын
Yup, iterators are gross, OpenMP is way nicer (suck it C++ committee).
@zvxcvxcz2 жыл бұрын
@@peezieforestem5078 Slower than what was done in the video or slower than the code was before? You shouldn't really do even what he did in the video. In either case, creating way more threads than you actually have the hardware for can cause a lot of contention and cache misses and actually slow things down sometimes. He has 8x the hardware threads and was only getting like 2x the performance... not exactly ideal. What you should really do is create just 8-16 threads when you have 8 physical cores and have a thread queue so they pick up a new task each time they finish a pixel until there are no pixels left.
@anime_erotika585 Жыл бұрын
7:07 I want multithreading, at my table, until tomorrow!
@CreativeOven2 жыл бұрын
Dude make us a chapter someday showing you programming in Cpp to get at your level .. ( idea ) , because some of us we are super in stone age in cpp
@sshawarma2 жыл бұрын
Awesome video as always! Why was the program not running 8x faster? Only thing I can think of is an IO bottleneck.
@psychoinferno42272 жыл бұрын
Run a profiler and you'll find a different answer. If you want to spoil the fun see the responses in the Github discussion.
@kelvinpoetra2 жыл бұрын
hallo cherno, I want to ask how to make graphic software and software such as Microsoft Word. Is the basis for making software all the same stages.
@ChaoticFlounder2 жыл бұрын
how difficult would it be to implement the RayTracing calculations on the integrated graphics on your cpu?
@zvxcvxcz2 жыл бұрын
"It depends," is the unfortunate answer there. It depends just what types you're using, what the driver for that GPU exposes and if it supports the necessary extensions, etc... Maybe you can drop it on there with CUDA or OpenCL or maybe you can even wrangle the regular display part of the driver into giving you what you need with OpenGL or DirectX, etc... Often laptop manufacturers have not been great about switching these GPUs (sometimes if you're primarily on the discrete care, the integrated one can be almost totally deactivated, or vice versa). Sometimes that is seen as a plus, since it dealt with battery concerns.
@vasile2321 Жыл бұрын
What RTX do you have on your pc? Thx
@gabrieldesimone46442 жыл бұрын
Hey there, I'm not familiar with C# or game making stuff but I was wondering that code is running on CPU cores, how do you make it use GPU cores instead?
@Alkanen2 жыл бұрын
That's coming in a future episode
@zvxcvxcz2 жыл бұрын
3 main options 1) wrangle your GPU into doing so by sort of telling it that it is doing normal math for output using OpenGL/DirectX/etc... 2) use OpenCL 3) use CUDA.
@steellung2 жыл бұрын
Does anyone know which software he uses for drawing on the screen on the fly?
@rastaarmando70582 жыл бұрын
It looks very similar to gInk.
@steellung2 жыл бұрын
@@rastaarmando7058 cool, didn't know this one. Thanks
@erikrl22 жыл бұрын
He uses ZoomIt
@mackerel9872 жыл бұрын
Hey guys. Does anyone get the "no instance of overloaded function:"std::for_each" matches the arguments list " error? Afaik we only need to include the execution header for it to work. Am I missing something?
@simonmaracine47212 жыл бұрын
Make sure you compile with C++17 flag or newer, and your compiler supports C++17.
@mackerel9872 жыл бұрын
@@simonmaracine4721 exactly what was wrong. thank you.
@ricbattaglia6976 Жыл бұрын
Is not faster a gpu render? Thanks
@CreativeOven2 жыл бұрын
Comment 10 10 out of 10 : D, How is hazel ? I see it is not all about drawing that open GL 3d lines right for those vertices? : P
@MorebitsUK2 жыл бұрын
Nice!! Always good content Cherno. Any Idea on how to use IntStream in Java to parallelize stuff. FYI I'm using `map`; not `for_each`.
@wuangg2 жыл бұрын
Use IntStream.parallel() to return a parallel IntStream and after that, use forEach() to perform an action to each element in the stream in parallel, it will use all available processors to do the job. For example: IntStream stream = IntStream.range(1, 10); // create a sequential ordered IntStream from the range of 1 to 10 stream.parallel().forEach(i -> { // do stuff to element 'i' here }); // perform an action to each element in the stream in multi-threaded This is equivalent to C++ std::for_each with parallel execution policy, which is being shown in this video.
@MorebitsUK2 жыл бұрын
@@wuangg Thanks for the reply, but I'm using Map not For_Each. I just need to return something from the map. String[] results = IntStream.range(0,imageHeight-1).parallel().map(i -> { // y value String row = String.join(System.lineSeparator(), IntStream.range(0, imageWidth).map(j -> { // x value Vec3 pixelColour = new Vec3(0, 0, 0); float u = (i + Utils.randomFloat(0.0f, 1.0f)) / (float) (imageWidth - 1); float v = (j + Utils.randomFloat(0.0f, 1.0f)) / (float) (imageHeight - 1); final Ray rayP = camera.getRay(u, v); pixelColour.addEquals(rayColor(rayP, finalWorld, maxDepth)); String pixel = PPM.vectorToRGB(pixelColour, 1); }));
@andrewporter18682 жыл бұрын
Multi-threading is also a mistake. It's a failure to defer parallel computing to the programmer. Instead of providing an asynchronous master-slave universal scheduler system and then on top of that the ability to do cheap software scheduling by providing a simple custom scheduler that can use the exact same code (it's asynchronous, so you just insert the scheduler code at some point in the future on one of your existing execution pathways), we got this pile of garbage that requires us to add all this overhead by synchronizing everything and it's just this massive headache where you can't just write parallel code but you have to think about synchronization too, and if you think too hard, you get a synchronization bug that you spend the afternoon fixing instead of fixing your actual code that's supposed to be part of the design that you're implementing, not a standard library feature that's missing from every language and imposed on us by all major operating systems.
@rckeet2 жыл бұрын
oh yesssssss!!😎
@JATmatic2 жыл бұрын
I made it much faster than the MT version here by fixing the wonky Walnut::Random code and removing branches from Renderer::TraceRay() loop. Render runs in about ~11ms on Ryzen 2700 8-core.
@stinkybeam Жыл бұрын
I know nothing of programing and coding, watch this video remind me of high school math class. I think I understand but actually I don't
@CP-sr6ml Жыл бұрын
Don't get me wrong your content is great but... Why are we bothering with multithreading if we could just move to the gpu? I don't undedrstand why you keep building and even optimizing like this on cpu side now. Wont that just make it harder/more work to move to the gpu?
@ng.h93152 жыл бұрын
Wonderful courses👌, but please continue the "Create Game engine in cpp" course add 3d game development option build for Android , ios ,,, Please teach us how to create a game engine like unreal Engine 😀. Im waiting for your answer...... Thanks for all of things Cherno ♥️
@larryfulkerson45052 ай бұрын
I like to write code by the principle of least astonishment.
@AnalogFoundry2 жыл бұрын
I wish the team at Striking Distance Studios would take notes and improve ray-tracing performance in their game called The Callisto Protocol. At the moment their CPU utilization with RT is abysmal.
@zvxcvxcz2 жыл бұрын
Are they not doing their raytracing on the GPU though? Recent GPUs have hardware accelerated raytracing. I'm not at all familiar with the game or what they've done other than that it is supposed to be like a AAA title? I would expect any AAA to be using the GPU features on this (whether or not they should be).
@AnalogFoundry2 жыл бұрын
@@zvxcvxcz - they are doing RT on the GPU using dedicated RT cores of AMD and NVIDIA, but building BVH and stuff is handled on the CPU. Thus RT can be very taxing even on the CPU. The problem with Callisto Protocol is that it uses very little of the CPU (i.e. not well multithreaded) even with the latest greatest multi-core CPUs which causes huge fps issues.
@hymen0callis Жыл бұрын
Unfortunately, std::for_each() is not very "efficient". Apparently, you got a speedup of only about 2, while I (using the exact same parallelization scheme) got a speedup of 5.5 (I only have 8 logical cores) by using PPL's Concurrency::parallel_for() instead. It's not portable code, but if it is almost 3 times faster, I'll go with Microsoft's PPL. Edit: just watched the next video where you fixed your global RNG. In my code, the RNG was already thread_local, which explains the much higher speedup in my example. So, I guess std::for_each() isn't that slow after all.
@stephenkamenar2 жыл бұрын
GPUs don't have 8,000 cores. they have very wide instructions. like SIMD but on massive data at the same time. same difference tho
@nenomius11482 жыл бұрын
8:50 running around updating two vectors on each window resize is much simpler than that stinky over-engineered std::views::iota from "modern" C++
@ovi13262 жыл бұрын
yeah but like think of the cache friendliness of accessing a buffer of memory just to get the next consecutive number
@nenomius11482 жыл бұрын
@@ovi1326 Yeah, reading consecutive numbers from memory is much more cache-friendly than generating them on CPU in registers
@TheApsiiik Жыл бұрын
It's been 2 months.. where is next episode!!!1
@closingtheloop25932 жыл бұрын
Why arent you doing this in cuda? Or in an opengl fragment shader?
@Jkauppa2 жыл бұрын
multicore avx-512 on cpu
@Jkauppa2 жыл бұрын
screen space dynamic baking surface light map caching
@Jkauppa2 жыл бұрын
update the surface dynamic baked light map only when needed, new or when update is needed, like every 4th frame, at some fps, like 240fps, update light only at 60fps
@Jkauppa2 жыл бұрын
pseudo-coding is a must, so that you are not tied to a language
@Jkauppa2 жыл бұрын
focus on programming pipeline or in the pseudo-algorithm methods
@Jkauppa2 жыл бұрын
language specifics are so 80's :)
@Alkanen2 жыл бұрын
Wohoo!
@IshanChaudharii2 жыл бұрын
Oh my goodness finally!!!! ❤️🥲🎉
@Notsorandomnumbers2 жыл бұрын
anyone know of a channel similar to this but like 1 degree more amateur? I find myself having difficulty keeping up at points
@fanisdeli2 жыл бұрын
Complete assumption because I'm too lazy to search it: I would think that std::for_each would be smarter than creating a thread for every single item in your iterator. Creating the threads would be much slower than actually running on one thread. My assumption is that it creates a few threads, depending on your hardware, and it reuses them. When one iteration is done, the same thread is used for a future iteration. That would also explain why using nested std::for_each made no difference in performance.
@zvxcvxcz2 жыл бұрын
You think it's only getting 2x rather than at least ballpark 8x if it is being smart? I think the nested for makes no difference because the single loop is already that bad for resource contention (several thousand threads on 8 hardware cores... ) and thread creation that it doesn't get any worse than that.
@fanisdeli2 жыл бұрын
@@zvxcvxcz I don't think that it could possibly be creating millions (1920*1080) threads, 60+ times a second Also, in programming there's no such thing as "it can't get worse" lol. If it was a thread per iteration, then without nesting you'd have 1920 threads, with nesting you'd have over 2 million. So, yeah, that would be WAY worse for sure. Like "freeze the entire OS and blue screen" type of stuff
@MrMirbat2 жыл бұрын
Thanks for sharing knowledge. Can you do tutorial how to make casino games like slot machines - Book of Ra, Texas holdem poker or roulette? Thanks in advance.
@mr.mirror12132 жыл бұрын
lesss gooo
@zvxcvxcz2 жыл бұрын
Iterators are gross... I would rather use OpenMP.
@anlcangulkaya62442 жыл бұрын
#pragma omp parallel for
@psychoinferno42272 жыл бұрын
The performance was nearly identical to the for_each with a parallel execution policy.
@peezieforestem50782 жыл бұрын
Hey, I tried OpenMP once and my code got slower. I'm not sure why, do you have any ideas?
@zvxcvxcz2 жыл бұрын
@@psychoinferno4227 Yes, but with OpenMP you don't need those silly ranges, that's the advantage there. I would expect the performance to be about the same as to what was done in the video if done the same way like that. Creating thousands of threads on a machine with 8 physical cores is begging for 1) overhead due to thread creation and 2) resource contention as all those threads want to get their task executed, so expect an increase in cache misses. The proper way to do it is to create a properly sized thread pool (somewhere between 8 and 16 most likely if you have 8 hardware cores) and have a task queue where each pixel's processing is a task. Have each thread pick up a new task when finished until there are no tasks left. I would expect something like a 5x-7.8x ish improvement rather than 2x. I might be wrong, but that's my naive expectation without knowing too many details about the raytracing algorithm itself. Offhand I don't think we're being memory bottlenecked in this case in terms of throughput, just perhaps by cache misses as the threads swap.
@zvxcvxcz2 жыл бұрын
I use a sort of implied threading in Bash too with the same model. Can't have an ancient Bash though because they didn't add the feature to wait for any task to finish until like 4.something. So now you can start 8 commands while hundreds more wait and each time one finishes the next starts, it's pretty sweet. Prior to that bash you could only wait for all tasks to finish or you had to know the exact task you were waiting for (and of course you can't know ahead of time what order they will finish in (in most cases).
@irfanjames65512 жыл бұрын
Thanks a lot I was really waiting for the optimisations especially M u l t i - t h r e a d i n g.