Programmable Vertex Pulling // Intermediate OpenGL Series

Рет қаралды 2,910

Күн бұрын

Пікірлер: 31

@OGLDEV 3 ай бұрын

Clone the sources: git clone --recurse-submodules github.com/emeiri/ogldev.git If you want to get the same version that was used in the video you can checkout the tag 'TUT_53_PVP'. Build on Windows: Open the Visual Studio solution: ogldev\Windows\ogldev_vs_2022\ogldev_vs_2022.sln Build the project 'OpenGL\Tutorials\Tutorial53_PVP' * Important note - I forgot to mention that if you use PVP you need to call glBindBufferBase whenever you change the VAO. This info is not registered in the VAO so this has to be done manually. In the checked-in code I fixed it.

@CodeParticles 2 ай бұрын

@OGLDEV, after trying to optimize everything and create a engine myself so I don't have to create everything from scratch each time, I finally managed to compare Direct State Access vs Programmable Vertex Pulling. So far using just a low polygon Icosphere exported from Blender, I found out the average was about 1-2 milliseconds difference between DSA & PVP. DSA was faster but I suppose it's only a low polygon no? I also optimized my object loading parser which drove me nuts for about a week because I had to compare std::vectors (reserve vs no reserve) vs std::arrays vs C-style dynamically allocated arrays, and that was an eye opener to say the least. Anyway, I think I will try loading that dinosaur obj file. I think I sent you that one over a year ago. But I would have to add my assimp version and make sure theres no leaks anywhere lol. Very exhausting 2 weeks of research =)

@OGLDEV 2 ай бұрын

Thanks a lot for the feedback. Is the 1-2 ms diff on the entire frame? Have you decided on your preferred method already?

@CodeParticles 2 ай бұрын

@@OGLDEV Hi Etay, sorry for the late response. I've been stuck on this for almost 3 weeks now and I finally figured out how to do PVP instancing with compute shaders, what a complete headache that was. In this entire time I went from trying to optimize little sections such as obj loading and shader loading to re-writing everything to DSA and comparing it to non-DSA and then having it work with PVP as well. I had a old project where I used a geometry shader to build a cube from a single GL_POINT and use compute shader to handle the physics with over 1 million particles @ 126.3ms. But the memory bandwith exploded to 1.5gigs. I wasn't very precise on this section so as I tried loading rudimentary obj files I wanted to instance a cube obj from blender. I tried non-DSA instancing with compute shader in debug mode then vs release mode, DSA instancing debug vs release, then DSA PVP. This took me the longest as I didn't quite understand how to use two Shader Storage Buffer Objects for two separate structs where the 1st had the normal position, normals, and texcoords, and the 2nd was for instancing and compute shading but I got it to work. if tl;dr sorry: Vsync ON/ Compiled with Visual Studio 22 + 12900k rtx 3090FE // non dsa instanced cube loading time + rendering with computer shader @ 1 million. (debug mode) // 146.81ms @ fps:87.2 ms/frame 11.4 532megs // 134.25ms @ fps:87.3 ms/frame 11.4 528megs // 132.16ms @ fps:87.2 ms/frame 11.5 534megs // 133.94ms @ fps:86.4 ms/frame 11.5 526megs // // " " (release mode) // 80.06ms @ fps:62.9 ms/frame 15.1 525megs // 73.85ms @ fps:86.3 ms/frame 11.5 526megs // 73.66ms @ fps:86.3 ms/frame 11.5 525megs // 81.60ms @ fps:86.3 ms/frame 11.5 526megs // // DSA instanced box rendering with compute shader @ 1 million (debug mode) // 136.26ms @ fps:90.0 ms/frame 11.17 526megs // 137.75ms @ fps:90.0 ms/frame 11.1 527megs // 138.55ms @ fps:90.0 ms/frame 11.2 526megs // 136.27ms @ fps:88.1 ms/frame 11.3 527megs // " " (release mode) // 81.83ms @ fps:89.5 ms/frame 10.9 525megs // 74.84ms @ fps:89.5 ms/frame 11.1 525megs // 76.99ms @ fps:90.0 ms/frame 11.0 525megs // 74.93ms @ fps:90.0 ms/frame 11.1 530megs // DSA instanced PVP with compute shader @ 1 million (debug mode) // 149.28ms @ fps:144 ms/frame 6.9 698megs // 142.10ms @ fps:144 ms/frame 6.9 695megs // 146.81ms @ fps:144 ms/frame 6.9 695megs // 142.91ms @ fps:144 ms/frame 6.9 695megs (at the cost of memory and ms to compile, fps goes to max // // " " (release mode // 82.29ms @ fps: 102 ms/frame 10.1 694megs // 86.89ms @ fps: 144 ms/frame 6.9 693megs // 81.93ms @ fps: 75 ms/frame 11.7 694megs // 89.49ms @ fps: 98 ms/frame 9.9 693megs (unsure why in release mode fps jumps around) What I could do better is possibly send just a single uniform mat4 instead of sending it 3 times to improve times. Or maybe even store the entire matrices into another vertex attribute and lots more ways. I still need to optimize TCS + TES shaders and then try assimp loading boy thats gonna take another several weeks. 😭😭

@MaGetzUb 3 ай бұрын

I came up with similar method quite recently. I basically used it to directly render _triangulated_ .obj files. Funny thing is, that I did use Vertex Attributes as for an element buffer on steroids. The vertex attribute is uvec3 that I use to refer SSBOs with, each element in the uvec3 refers one of the bound SSBOs, x -> vertex position data, y -> texcoord data, z -> normal data. It saves memory and headache. :D Very good video!

@OGLDEV 3 ай бұрын

Nice! Thanks :-)

@cybereality 3 ай бұрын

Thanks for the shout out!!!

@OGLDEV 3 ай бұрын

Thank you for your support!

@perefm Ай бұрын

Thanks for the video, it is great! In fact, I'm considering change my engine from "Fixed" vertex pulling to "PVP", but... don't you think that using all those new functions like "GetPosition", "GetTexCoord", etc... will be at the end a penalty on performance? I can see the benefits of using PVP like having access to the entire buffer in the VS, get rid of VertexBuffers or being able to modify the buffer directly from the Shader (if we are in rw mode), but, apart from this, I think that "maybe" it can have a penalty on performance... what do you think?

@OGLDEV Ай бұрын

I think that the shader compiler should be able to optimize the wrapper functions so that the code will be inlined directly into the main function. It's a bit difficult to prove it because we can't see the assembly of the shaders (at least, I don't know how) but this is a standard compiler optimization so my guess is that they've implemented it.

@CodeParticles 3 ай бұрын

Really cool upload. Now I'm going to have to benchmark DSA vs PVP just for fun. Thanks! =)

@OGLDEV 3 ай бұрын

Great!

@coreC.. 3 ай бұрын

Post your results please CodeParticles. I would like to know. 👍 Big models, small models..

@dmitryrozovik3996 3 ай бұрын

I think the T&L cache (old historical name) will stay for a while. It gives advantage when rendering large grids like terrains. Regular meshes are often layout-optimized to benefit from it. In general this technique is more beneficial for Vulkan and D3D12. Those APIs have to create PSO/pipeline of each shader set for each vertex format. A huge number of PSO is a headache for big projects. (GL probably does the same under the hood automatically) A nice feature is an ability to not only pass some custom sophisticated data, but to change vertex layout dynamically in-flight. Changing format basically forces GPU to stop the last draw call and flush internal caches. Having some 'flags' attribute in the vertex data allows to pass for example static and skinned geommetry in a single draw call (assuming some global matrix palette; maybe not the best example). In this case vertex shader branches will stay coherent. So multiple different geometry data may be fed using the same vertex buffer minimizing state changes. One thing I would recommend is to introduce some common header for code and shaders to create a common point for declaring bindings. This should minimize the risk of binding indices going out of sync, and simplify the search of use points along the code.

@yas1945 3 ай бұрын

I no longer do skinned geo in the vertex shader stage, but instead use a compute shader to write directly into to the VBO, including motion vectors. From the vertex shader's POV, everything is static, and is totally compatible with what you're doing.

@OGLDEV 3 ай бұрын

Thanks for the feedback. I usually use ogldev\Include\ogldev_engine_common.h for the bindings you mentioned, though in this specific video I'm using a new framework that I'm working on so things are still chaotic right now. Would eventually use that as well.

@SillyOrb 3 ай бұрын

Good video. I would have liked some benchmarking, but I understand that that would have been a bit much and isn't the focus of the channel. Another technique that can lift performance on some platforms is to fetch and manually interpolate "static" vertex attributes like UVs and colours in the fragment shader, instead of having them interpolated between vertex and fragment shading. It might have no benefit on platforms that don't provide barycentric coordinates (e.g. via intrinsics), but it might still be worth a try, if there are many attributes and cases, where only a subset of them are used conditionally. One such example is rim lighting, which only affects certain pixels only, yet all vertices need to transfer the required attributes (vertex cache pollution) and all fragments interpolate every attribute by default. There might be other cases where this might make sense. On the other hand, the memory reads can increase especially with higher render resolutions, so there are possible drawbacks that likely depend on the ratio of vertices to fragments shaded per frame. The best way with these things is to implement all variants, then switch them at runtime (so you have the exact same test configuration) and to measure each technique on every supported hardware platform with the expected render resolutions. Cheers.

@OGLDEV 3 ай бұрын

Thanks for the feedback, these are interesting suggestions. Definitely worth exploring when I find the time for it.

@stysner4580 3 ай бұрын

Interesting. I'm using DSA to bind a VBO and EBO to the VAO, resulting in 1 VAO bind operation before rendering, pretty sure that would beat vertex pulling in performance by a longshot, no?

@OGLDEV 3 ай бұрын

Not sure, this needs to be measured. As I said in the video, in most cases the performance depends on the complexity of the fragment shader as well as renderer design issues such as read-after-write dependencies, "render pass" complexity, etc.

@stysner4580 3 ай бұрын

@@OGLDEV Yeah but if the fragment shader is taking the longest it's still a linear increase if the vertex stage takes longer, meaning the fastest way to get the vertex data to the vertex shader should always be the best way from the view of performance. If anything PVP makes me think about possibilities with transform feedback for particle systems than offering performance improvements, especially when running a vertex shader on its own (no fragment shader).

@OGLDEV 3 ай бұрын

The claim by Daniel Rakos is that whether you use the fixed function path or the programmable path you basically go thru the same hardware with the exception of the post transform cache. But this was back then and I don't know the current state of affairs. I saw that PVP is mentioned along side with AZDO which is the current hot topic in terms of OpenGL performance. It all depends on the attention of the GPU vendors to this path and the resources that they may have allocated to improving its performance.

@stysner4580 3 ай бұрын

@@OGLDEV I really doubt vendors still care about OpenGL performance. They need driver implementations because it's still widely used, but there's no way they really care about optimizing OpenGL that much. No-one is benchmarking it. A "hot topic" within OpenGL is only with enthusiasts like ourselves, no-one else really cares. I'm currently writing a game engine and I really don't want to bother with Vulkan but it's really fun to optimize OpenGL for me. Since all the vertex shader fixed-pipeline stuff is (excluding the transform cache optimizations) is just going through two memory indirections to get to the shader attributes, I really doubt there's many low hanging fruit worth it enough for vendors to care about, especially with OpenGL being end-of-life since, what... basically 2012? Pretty sure it's really hard to do a lot better then 1 VAO bind call per drawing operation. I also render pretty much everything in only a couple of "buckets" each having the same shader program and VAO (with bound VBO/optional EBO) using the multidraw functions. Even though there's not really any parallelism for OpenGL (not safe or easy, anyway) I do think my AZDO implementation is more than good enough to come close to an average Vulkan implementation. Though as I said before I do see some potential for PVP for particle systems. Will be fun to implement!

@OGLDEV 3 ай бұрын

I've just sent a couple of questions about that to my friend who works at NVIDIA. Will update you if I hear anything interesting.

@coreC.. 3 ай бұрын

I am curious about the performance of using this technique on modern hardware. You showed some results from a book, but that is from 2012. _I guess the holidays are to blame for the few views/likes, so far._

@OGLDEV 3 ай бұрын

I'm curious too... ;-) Please note that the technique is also used in the '3D graphics rendering cookbook' which is from 2021 so this gives more confidence that it may perform well on modern hardware. The reason for the low view count is that the video is still in the early access period. It will be publicly available later today.

@kheartztv 3 ай бұрын

From my understanding hearing from others, at least on Vulkan this is nearly identical in performance on AMD hardware and with a very slight perf hit on Nvidia GPUs. I haven't personally profiled myself, so take this with a grain of salt.