Making Golang 13x faster with Assembly code

  Рет қаралды 20,544

Brent Farris

Brent Farris

22 күн бұрын

One of the coolest parts of Go (golang) is that there are many ways to speed up your program. One such way is to take advantage of the ability to create .s and .asm assembly code files that are compiled directly into your program. In this video I go over what I did in my Golang Vulkan game engine to improve the performance of the linear algebra math. Taking advantage of the SIMD (AVX) instructions we can improve some functions by nearly 13x. SIMD is "single instruction multiple data" and is a key component missing from the standard go compiler. We can of course use the built in assembly capabilities of Go to improve performance and access non-accessable cpu instructions for many more things other than vectorization operations, but this is probably the most common operation people would drop into assembly for.
Go assembly file ► github.com/KaijuEngine/kaiju/...
Twitter ► brentfarris.com/twitter
Website ► brentfarris.com
GitHub ► brentfarris.com/github

Пікірлер: 88
@user-tw2kr6hg4r
@user-tw2kr6hg4r 20 күн бұрын
you know its serious computer engineering when the source code is printed on a sheet of paper
@lozyodella4178
@lozyodella4178 18 күн бұрын
😂😂😂
@w花b
@w花b 18 күн бұрын
First one I've seen like that was ben Eater but this one even has colors that's next level
@araz911
@araz911 18 күн бұрын
​@@w花bfor syntax highlight, it's a paper of 2024, ok?...!!
@gregandark8571
@gregandark8571 16 күн бұрын
Always has been.
@AK-vx4dy
@AK-vx4dy 19 күн бұрын
In assembly you have only layer of abstraction... paper 😅
@mrrolandlawrence
@mrrolandlawrence 14 күн бұрын
wow love this. i used to be an ARM programmer many many years ago. back in those days you really had to optimise code for the number of cpu cycles needed. sophie wilson really made ARM instructions a doddle to use.
@tubbystubby
@tubbystubby 12 күн бұрын
I started go half a year ago and have been enjoying it a lot. This was awesome, learned a lot. Thanks for great content. You know you are getting the juiciest stuff if it's on paper.
@user-tw2kr6hg4r
@user-tw2kr6hg4r 20 күн бұрын
matrix multiplication in primary school?
@BrentFarris
@BrentFarris 20 күн бұрын
One year, you're learning to read books without pictures. The next, you're calculating the cross product on a 4 dimensional matrix. Then you go to middle school, learn about girls, forget it all, and have to relearn it in pre-calc.
@Paul-zh2jp
@Paul-zh2jp 19 күн бұрын
this is what i came to comment lol
@w花b
@w花b 18 күн бұрын
​@@BrentFarris Don't have to forget it if no girls approaches you. That's a win in my book.
@MEMUNDOLOL
@MEMUNDOLOL 17 күн бұрын
Link to this video could possibly be the best answer for the question "Should i build my own engine or use a ready one"
@grimquokka9843
@grimquokka9843 16 күн бұрын
this is a good idea you came up and also appreciate using Paper explanation,.Please keep up with these videos sir.
@mr.daniish
@mr.daniish 13 күн бұрын
This is some serious knowledge! More of these please
@MaxPicAxe
@MaxPicAxe 14 күн бұрын
That's nice how, for four floats, the 2 bits for src, 2 bits for dst and 4 bits for bitmask conveniently fit into exactly a byte. The next convenient number of floats with this pattern appears to be a very large number, where convenient means when the amount of space src,dst,bitmask take up in bits is a power of 2.
@BrentFarris
@BrentFarris 13 күн бұрын
You can also operate on doubles, but half as many due to using the same space.
@crowlsyong
@crowlsyong 20 күн бұрын
What a supernatural gift to the world
@lufsss_
@lufsss_ 20 күн бұрын
What a supernatural explanation
@QW3RTYUU
@QW3RTYUU 6 күн бұрын
Ben Eater vibes this gives me. Thanks for the video!
17 күн бұрын
Love that you did it with go. It's just such a clean language.
@BrentFarris
@BrentFarris 17 күн бұрын
I picked up Go after learning that Ken Thompson helped design it. Slices and goroutine/channels are awesome
@shurizzle
@shurizzle 11 күн бұрын
@@BrentFarris Goroutines/channels come from Plan9, as does that ASM syntax. After all, Rob Pike is behind both Golang and Plan9.
@joeybasile1572
@joeybasile1572 13 күн бұрын
Thanks dude. Informative. Good presentation.
@baxiry.
@baxiry. 21 күн бұрын
What a supernatural topic
@sanderbos4243
@sanderbos4243 11 күн бұрын
Extremely good explanation
@treelibrarian7618
@treelibrarian7618 18 күн бұрын
just thought you might be interested that the pack operation with 16 insertps's (16p23,16p5 ops) instead may be done as an in-register matrix transpose using unpckhps/unpcklps (4p23,8p5 ops) in half the time. I'm not familiar with golangs inline asm so I'll use intel asm instead, I'm sure you'll be able to translate: vmovups xmm1, [rbp + start] ; a3a2a1a0 vmovups xmm2, [rbp + start + 16] ; b3b2b1b0 vmovups xmm3, [rbp + start + 32] ; c3c2c1c0 vmovups xmm4, [rbp + start + 48] ; d3d2d1d0 vunpckhps xmm5, xmm2, xmm4 ; d3b3d2b2 vunpcklps xmm4, xmm2, xmm4 ; d1b1d0b0 vunpcklps xmm2, xmm1, xmm3 ; c1a1c0a0 vunpckhps xmm3, xmm1, xmm3 ; c3a3c2a2 vunpcklps xmm1, xmm2, xmm4 ; d0c0b0a0 vunpckhps xmm2, xmm2, xmm4 ; d1c1b1a1 vunpckhps xmm4, xmm3, xmm5 ; d3c3b3a3 vunpcklps xmm3, xmm3, xmm5 ; d2c2b2a2 has the advantage that whether xmm,ymm, or zmm registers it's still 8 unpack ops to do 1,2 or 4 4x4 matrix transposes. This formula uses one extra register and produces the result in the same order in the registers as your insertps-based version. edit: realized a couple of days later that I used the AVX 3-operand versions of the instructions not the SSE1 2-operand versions, so I've added the V's. it's not so pretty if you can't use the, since every output has to be copied first and the operand ordering is inconvenient too, so it doesn't fit in 5 registers any more...
@lozyodella4178
@lozyodella4178 18 күн бұрын
Is this the language of Gods?
@stercorarius
@stercorarius 17 күн бұрын
@@lozyodella4178 nah thats lisp
@treelibrarian7618
@treelibrarian7618 15 күн бұрын
a further thought for you: perhaps the transpose is entirely unneeded anyway: this code here does the 4x4 matrix multiply without it. see the inline comments for functional details. I adapted this from a 16x16 avx512 version where the macro was just 16 fma instructions with inline broadcast loading the elements of A directly. Here using shufps as an SSE broadcast equivalent and the multiplies and adds are separated. ;; 4x4 matrix multiply ;; A is the matrix that is scanned horizontally, ;; B is the matrix to be scanned vertically. ;; output to O %macro domatrixrowSSE 0 shufps xmm0, xmm3, 0 ; broadcast first element of A row 1 mulps xmm0, xmm4 ; multiply whole first row of B shufps xmm1, xmm3, 0x55 ; bcast second element of A row 1 mulps xmm1, xmm5 ; multiply by second row of B addps xmm0, xmm1 ; add to first result shufps xmm1, xmm3, 0xaa ; e3 of A row 1 mulps xmm1, xmm6 ; mult B row 3 addps xmm0, xmm1 ; add shufps xmm1, xmm3, 0xff ; e4 of A row 1 mulps xmm1, xmm7 ; mult B row 4 addps xmm0, xmm1 ; last add %endmacro multiply4x4function: ; this is not complete: replace the tokens of a, b and o ; with whatever you have those pointers in. ; Can be used as a base for larger matrix multiplies ; if you load the prior output content before adding all 4 lines ; and change 16/32/48 to 1/2/3x row length in bytes, ; and a/b/o point to the relevant parts of the input/output matrices. movups xmm4, [b + 0] ; load whole b matrix movups xmm5, [b + 16] movups xmm6, [b + 32] movups xmm7, [b + 48] movups xmm3, [a + 0] ; load first row of A matrix domatrixrowSSE ; the macro multiplies one row of A by 4 columns of B movups [o + 0], xmm0 ; store results to first row of output matrix ; e1r1O = : e2r1O = : e3r1O = : e4r1O = ; e1r1A*e1r1B : e1r1A*e2r1B : e1r1A*e3r1B : e1r1A*e4r1B ; + e2r1A*e1r2B : + e2r1A*e2r2B : + e2r1A*e3r2B : + e2r1A*e4r2B ; + e3r1A*e1r3B : + e3r1A*e2r3B : + e3r1A*e3r3B : + e3r1A*e4r3B ; + e4r1A*e1r4B : + e4r1A*e2r4B : + e3r1A*e3r4B : + e4r1A*e4r4B movups xmm3, [a + 16] ; load second row of A domatrixrowSSE movups [o + 16], xmm0 ; store to second row of O movups xmm3, [a + 32] ; third row of A domatrixrowSSE movups [o + 32], xmm0 ; to third row of O movups xmm3, [a + 48] ; 4th row of A domatrixrowSSE movups [o + 48], xmm0 ; to 4th row of O ;; 28p01, 16p5, 8r4w. 16cycles/matrix on icelake, 28c/matrix on older CPU with only 1 vfp port (eg sandy bridge)
@shappertallw
@shappertallw 3 күн бұрын
@@treelibrarian7618 this is insane i never thought i would see the day where someone cold rolled asm with sse instructions no less in a yt comments section. props
@treelibrarian7618
@treelibrarian7618 3 күн бұрын
@@shappertallw it's a hobby of mine: I've done it before and I'll probably do it again. I think I might have scared one or two youtubers away from posting asm-related videos - which was not my intention. I really should be making video's myself...
@Antonio-yy2ec
@Antonio-yy2ec 17 күн бұрын
Pure gold!!
@sirbumblefuck
@sirbumblefuck 20 күн бұрын
What a supernatural way of explaining
@kira.herself
@kira.herself 20 күн бұрын
What a supernatural video
@timofeysobolev7498
@timofeysobolev7498 14 күн бұрын
Great video!)
@blockshift758
@blockshift758 17 күн бұрын
I always see comments "matrix math on middle/high school?!" On videos like this. And laugh to my self because i remember we did it on elementary(grade 4-6).
@Decastyled
@Decastyled 19 күн бұрын
'Cause you're a supernatural A beating heart of stone You gotta be so cold To make it in this world Yeah, you're a supernatural Living your life cutthroat You gotta be so cold Yeah, you're a supernatural
@Caellyan
@Caellyan 17 күн бұрын
What about using something like volk (vector optimized library of kernels)? Is Go FFI slow?
@BrentFarris
@BrentFarris 17 күн бұрын
You likely can without issues. You may have to benchmark it though because you do have to pay the small cost of swapping stacks. Go's stack is built leaning for goroutines, so it has the swap to a C-compatible stack to call C.
@hyprland
@hyprland 20 күн бұрын
What a super nature
@--bountyhunter--
@--bountyhunter-- 19 күн бұрын
what a natural super
@hulakdar
@hulakdar 19 күн бұрын
is there no way to natively emit vector instructions in go? If that is true, than that is quite unfortunate Isn't it easier to write those functions in C and link with them instead of writing out assembly?
@BrentFarris
@BrentFarris 19 күн бұрын
Not at the moment in Go directly. You have a few options: 1. Write assembly as we did here directly (fastest execution). 2. Write vectorized assembly instructions as their own function (similar to C) and use them at a higher level, but you'll need to take care to follow the calling conventions to not clobbered your asm work. 3. Use the C vectorization library functions and call from C. This will have the tiny overhead of swapping stacks, though.
@gregandark8571
@gregandark8571 18 күн бұрын
@@BrentFarris Go is bullshit language exactly for such technical lacks :(
@maximus1172
@maximus1172 15 күн бұрын
very cool!!, you should also try making the engine in rust
@BrentFarris
@BrentFarris 15 күн бұрын
One day, I may. I enjoy trying out languages, and game frameworks/engines tend to be my testbed. Either as the core engine code or as a scripting language depending on the nature of the language.
@TheCyberBully420
@TheCyberBully420 10 күн бұрын
You made an engine with Vulkan or you made something similar to Vulkan??
@BrentFarris
@BrentFarris 9 күн бұрын
Using Vulkan, I've made engines in C, C++, and Go. It has a pretty nice and straightforward structure once you get a handle of it.
@MrTomyCJ
@MrTomyCJ 18 күн бұрын
There is a flaw in the system: I can deduce from the comments that I should reply something supernatural without having watched the entire video. The next time you'll have to provide a function to determine what to comment instead of a phrase, so that the appropriate comment can't be deduced from the comments. You got me to comment anyway though.
@iant9053
@iant9053 18 күн бұрын
Holy, If you had to learn everything from scratch, in what order would you learn your langs? just starting with C, thx wizard
@BrentFarris
@BrentFarris 17 күн бұрын
I would learn C if I had to go from scratch. It's just high level enough to do huge projects and just low level enough to teach you how computers work internally. I learned C++ as my first language, but I wish it were C.
@cvabds
@cvabds 19 күн бұрын
How much you want to create a game engine for temple OS?
@BrentFarris
@BrentFarris 19 күн бұрын
Haha, I haven't booted up TempleOS yet. It's still on my bucket list. When I do, it might just happen!
@cvabds
@cvabds 19 күн бұрын
@@BrentFarris please don't be restricted to the whole religious thing, use it to the full potential please, 4k high res
@Onyx-it8gk
@Onyx-it8gk 19 күн бұрын
Neat video! If you have this much programming knowledge and skill, I think you'd really appreciate Vale. It's a new language that takes a novel approach to memory management without a GC. It borrows concepts from many languages like Rust, Cyclone, Pony and Forty2.
@BrentFarris
@BrentFarris 19 күн бұрын
Thanks! I'll have to check it out, I have a lot of fun trying out different languages. There have been a lot of languages popping up lately. It's so hard to keep up, haha
@Onyx-it8gk
@Onyx-it8gk 19 күн бұрын
@@BrentFarris I know what you mean! I'm sure someone such as yourself has a very long list of things to check out with not enough time in the day!
@tiskanto
@tiskanto 12 күн бұрын
This is the "Ben Eater" style
@wakanda6357
@wakanda6357 17 күн бұрын
What should one do or learn to understand assembly??
@greenrocket23
@greenrocket23 16 күн бұрын
Well, a pretty good resource for beginners is the MIT OpenCourseWare for the x86_64 architecture
@BrentFarris
@BrentFarris 15 күн бұрын
Program some small things in 6502 assembly. It is an incredibly small assembly language and will teach you 90% of what you need to know. You can then get a book or read online docs for x86/x64 and arm instructions. Check out this 6502 tutorial. It comes with an emulator and is a lot of fun: skilldrick.github.io/easy6502/index.html
@alejandroulisessanchezgame6924
@alejandroulisessanchezgame6924 18 күн бұрын
It is posible to develop 3d games with golang like this, even if its a gc language?
@BrentFarris
@BrentFarris 18 күн бұрын
Yes, you can either write it from scratch like I do for fun (see Kaiju github engine link in description). Or you can load up helper C libraries for OpenGL, SDL, etc; which I've done in the past. You'll find most game engines like Unreal and Unity use an internally built garbage collector, so don't let the GC hold you back from experimenting.
@alejandroulisessanchezgame6924
@alejandroulisessanchezgame6924 18 күн бұрын
Thanks i will try.
@nittani.
@nittani. 18 күн бұрын
What is garbage ​@@BrentFarris
@QW3RTYUU
@QW3RTYUU 5 күн бұрын
@@nittani. something to be collected it seems
@danielsmith5626
@danielsmith5626 15 күн бұрын
ASMR backend is peak
@harold2718
@harold2718 18 күн бұрын
Instead of transposing B and then doing dot-products essentially, you can take a row of B and multiply it by a broadcasted element of A, and then add it into the result. That's more efficient than doing dot-products, HADDPS isn't that efficient (essentially equal to 2 shuffles plus ADDPS). Also even you do want to transpose, you can do it with 8 shuffles instead of 16 INSERTPSes, similar to how the _MM_TRANSPOSE4_PS macro does it (but you have no access to that so you'd implement it manually).
@fqidz
@fqidz 19 күн бұрын
supernatural season 2 ep 2
@domelessanne6357
@domelessanne6357 18 күн бұрын
wow
@trungthanhbp
@trungthanhbp 8 күн бұрын
niec
@rdubb77
@rdubb77 19 күн бұрын
Primary school? Linear algebra is generally a college subject, I didn’t learn matrix multiplication even in high school
@BrentFarris
@BrentFarris 18 күн бұрын
What? Kids nowadays don't do linear algebra after nap time anymore?
@gbucks5117
@gbucks5117 15 күн бұрын
When code in paper , you know the shit is serious
@Jhat
@Jhat 16 күн бұрын
the real question is... WHAT IS THAT PENCIL HOLDER????
@Kyle-do6nj
@Kyle-do6nj 17 күн бұрын
All this to ultimately have a 20% efficiency at candy crush...
@sokiuwu
@sokiuwu 17 күн бұрын
Making assembly 30× faster by writing in binary
@BrentFarris
@BrentFarris 17 күн бұрын
Don't tempt me with a good time
@opkp
@opkp 20 күн бұрын
Neat
@spoonikle
@spoonikle 19 күн бұрын
Who else is naturally this super?
@Miles-co5xm
@Miles-co5xm 17 күн бұрын
Java base classes
@emirsahin4105
@emirsahin4105 9 күн бұрын
manyakadam
@mikejohneviota9293
@mikejohneviota9293 18 күн бұрын
primary school huh for linear math i feel dumb
@BrentFarris
@BrentFarris 18 күн бұрын
Me too, I must have missed the linear algebra class they taught at recess...
@blockshift758
@blockshift758 17 күн бұрын
Bruh is expaining code on paper
@user-lh3xs9km6z
@user-lh3xs9km6z 17 күн бұрын
it's nice results ... without dubt...but at that point of needed optimization going back to c/c++ isn't better?
@BrentFarris
@BrentFarris 17 күн бұрын
Actually, there are some highly optimized Go functions that beat its Go Assembly counterparts. I'll make a video on this next. I always advocate for people to write in C, I'm biased because it's my main language. But, you really do get some amazing benefits in Go that you just don't in C/C++. So it's really up to the taste of the developer. I've written 3D Vulkan game engines in all 3 languages (C, C++, and Go)
C++ Game Programmer Tries ZIG for the first time.
5:28
Low Level Game Dev
Рет қаралды 54 М.
Understanding B-Trees: The Data Structure Behind Modern Databases
12:39
ТАМАЕВ УНИЧТОЖИЛ CLS ВЕНГАЛБИ! Конфликт с Ахмедом?!
25:37
마시멜로우로 체감되는 요즘 물가
00:20
진영민yeongmin
Рет қаралды 31 МЛН
But, what is Virtual Memory?
20:11
Tech With Nikola
Рет қаралды 236 М.
Software engineer interns on their first day be like...
2:21
Frying Pan
Рет қаралды 13 МЛН
Premature Optimization
12:39
CodeAesthetic
Рет қаралды 773 М.
HOW TRANSISTORS RUN CODE?
14:28
Core Dumped
Рет қаралды 320 М.
The moment we stopped understanding AI [AlexNet]
17:38
Welch Labs
Рет қаралды 733 М.
A simple procedural animation technique
8:31
argonaut
Рет қаралды 183 М.
Torvalds Speaks: Impact of Artificial Intelligence on Programming
5:05
Mastery Learning
Рет қаралды 819 М.
All Rust string types explained
22:13
Let's Get Rusty
Рет қаралды 155 М.
Learning Rust! | Writing a 16bit Virtual Machine
1:37:34
Tom Marks Talks Code LIVE
Рет қаралды 13 М.