Рет қаралды 85,554
This was written to demonstrate and test the compute toolchain/implementation I'm currently working on for my master's thesis (github.com/a2f.... With this it is possible to compile the same C++17 code to CUDA/PTX, OpenCL/SPIR/SPIR-V, Metal and Vulkan/SPIR-V, thus running on a multitude of GPUs and CPUs on different platforms. To achieve this, I'm using a modified clang/llvm/libc++ 4.0 toolchain and a layer of host and device side code that makes it possible to address everything the same way. This demo in particular shows the use of local/shared memory buffers, local memory barriers, OpenGL buffer sharing, loop unrolling and that high performance computing is indeed possible with this toolchain.
The N-body simulation is largely based on http.developer.... with some additional optimizations.
More information on N-body simulations: en.wikipedia.o...
Code for this demo: github.com/a2f...
Current performance stats (in benchmark mode):
* P6000: ~8400 gflops (--count 262144 --tile-size 512)
* GP100: ~7600 gflops (--count 262144 --tile-size 512)
* GTX 970: ~2770 gflops (--count 131072 --tile-size 256)
* GTX 780: ~2350 gflops (--count 131072 --tile-size 512)
* GTX 1050 Ti: ~1675 gflops (--count 262144 --tile-size 256)
* R9 285: ~850 gflops (--count 131072 --tile-size 64)
* GTX 750: ~840 gflops (--count 65536 --tile-size 256)
* GT 650M: ~375 gflops (--count 65536 --tile-size 512)
* HD 530: ~242 gflops (--count 65536 --tile-size 128)
* HD 4600: ~235 gflops (--count 65536 --tile-size 80)
* i7-6700: ~195 gflops (--count 32768 --tile-size 1024)
* HD 4000: ~165 gflops (--count 32768 --tile-size 128)
* iPhone A10: ~131 gflops (--count 32768 --tile-size 512)
* i7-5820K: ~105 gflops (--count 32768 --tile-size 8)
* i7-4770: ~80 gflops (--count 32768 --tile-size 8)
* i7-3615QM: ~38 gflops (--count 32768 --tile-size 8)
* i7-950: ~29 gflops (--count 32768 --tile-size 4)
* iPhone A8: ~28 gflops (--count 16384 --tile-size 512)
* iPad A7: ~20 gflops (--count 16384 --tile-size 512)
Stats from this video:
* N = 131072, damping = 0.9983, softening = 0.01
* since this is an O(n^2) algorithm, this results in 131072^2 = 17179869184 body/body interactions per iteration
* the initial body setup is a hollow sphere (or on-sphere), with body velocities set to the center
* with rendering and video capturing, performance is degraded a little and one iteration of this simulation took about 175ms (w/o rendering/capturing it would be ~155ms)
* with N = 65536 this runs in realtime on a GTX 780 (~38ms per iteration with rendering)
* the 1x runtime of this video is slightly above 1 hour, the video is shown in 16x speed-up, with camera rotations at 3x (to not cause that much confusion ;))