Very interesting, I imagine it took a lot of head scratching to really understand what was going on with those experiments. Great video
@leddoo2 жыл бұрын
thanks! yeah, took me about a week to figure out what was going on :D initially, i thought it was just about the number of memory accesses, but things were just not adding up. the experiments from the video are the ones that convinced me it was about dependencies. but yeah, there were more :D
@isotoxal2 жыл бұрын
You are now in the list of my favourite youtubers
@leddoo2 жыл бұрын
damn, big words! thanks man 😎
@JoeHinkle11 Жыл бұрын
I haven't seen many channels focus on programming language development like yours. Amazing job! Also, I love your animations. Really makes it clear what's happening when you explain concepts
@P-39_AiracobraАй бұрын
This is one of the most interesting and educational low-level videos I've ever seen, I love it. I never would have thought about these optimizations on my own.
@SaHaRaSquad Жыл бұрын
I once wrote a small stack-based vm but had two bits left over in the encoded instructions, so I decided to just have two independent stacks and the last two bits determined which stack would be read for arguments and which would receive the result. I didn't do a lot with it so I'm not sure about how this affected performance, but I think it could allow quite some optimizations. Also I saw a talk about a very small chip running Forth code, and it used a stack as well as two registers, so a bit of extra space for temporary values really pays off I guess. Oh, and because I saw your lisp-based syntax: It is surprisingly easy to write a simple Lisp-to-Lua transpiler. Lua lets you use most things as expressions and also supports tail-call optimization, so you can basically map Lisp expressions 1:1 to equivalent Lua code.
@OMFGxIamxAxNinja2 жыл бұрын
Wow, this just popped up in my feed, but I really enjoyed this video! You made it really interesting!
@devtadeo2 жыл бұрын
Me wondering why he uses a knife to point around
@pp_up8 ай бұрын
It's because... it's _pointy_
@JoStro_ Жыл бұрын
fascinating, i've never considered this aspect of performace before.
@randomstuffgamess Жыл бұрын
Awesome video! I really like the 'modern cpus go brr (if you let them)' section. And the animations really helped with clarity. How did you make them? And your explanations are really on point.
@leddoo Жыл бұрын
thanks! the stack & cpu were animated using motion canvas (i think i left links to the components in the description, but they're not updated to the latest version of motion canvas)
@shilangyu Жыл бұрын
Subscribed. High quality content and touching exactly the topics I love.
@AK-vx4dy Жыл бұрын
Nicely explained, good work !! Keep explaining :)
@j-r-hill Жыл бұрын
Strongly recommend looking into work done by Anton Ertl if you're interested in stack machine compiler optimizations
@kevinrobben1999 Жыл бұрын
Learned so much from your videos! Thank you, subbed
@thomasmewily4012 Жыл бұрын
This is exactly the channel I needed : cleans, concises and really good comparison and explanation about Stack VM vs Register VM, thank you ! Just one question : When you have a stack VM, it is like having an infinite number of register. But when you are using a register VM, you usually get a limited number of register. What happen if you use more variable than the number of available register, for a register VM ? And how do the VM handle that, or even your computer if you compile it instead of interpreting it ?
@leddoo Жыл бұрын
the register vm in my scripting language is currently limited to 255 registers (one byte per operand). so the compiler would raise an error (currently just panicks, lol). i may add instructions to use more than 255 registers later, not sure. python has a special instructions for when there are more than 255 local variables (docs.python.org/3/library/dis.html#opcode-EXTENDED_ARG ) in general, compilers tend to have all kinds of limits that you never notice during regular use (like max number of nested blocks, etc).
@thomasmewily4012 Жыл бұрын
@@leddoo Yeah I was thinking about tricky/edges cases Thank for the link btw ^^
@oblivion_2852 Жыл бұрын
Funny enough I came across the max number of arguments in a java function call because I was transpiling from a language that had arbitrary length arguments. It was a really dumb concat function that 700 arguments but that day I learnt that Java only supports 254 arguments (since the class itself is always passed as an argument)
@thomasmewily4012 Жыл бұрын
@oblivion8459 Wow, I wonder in which context you manage to have a concat function with more than 700 arguments ? (Maybe the function call was written by another program) Look like you reach some cool hidden technical details too. From memory in C it is at least 127 arguments.
@thomasmewily4012 Жыл бұрын
@oblivion8459 The translated function can also have a single list or somethings iterable that contains the variadic arguments, but I guess it's just better to don't handle the case and see where the weird code is
@Diamonddrake Жыл бұрын
Very well explained!
@evanfreethy2574 Жыл бұрын
This is my favorite type of video 🎉
@JH-pe3ro8 ай бұрын
There's a strong path-dependency to how we've ended up in "registers are faster", I think(and I was aware of that before the video, though it's a great explanation of why); register architectures are intuitive when computers are viewed in terms of arithmetic processing, and they map well to local variables. Therefore we have hardware that engages with that, provides a lot of enumerated registers, and asks for languages to utilize all of them, and that has a consequence on VM design, since the VM's goal is to remap program description to efficient hardware utilization: if it looks stack-y, then the JIT rewrites it to register-y code. On the other hand, the CPU architecture can avoid this and do something like Minimal 64x4, a non-pipelined hobby computer using discrete logic: it achieves very good performance at a relatively low clock rate by reducing the register set and optimizing the instruction set around zero page memory(a construct which exists on 6502 as well, but is used supplemental to registers), thus algorithm descriptions are smaller and the total sequential work is smaller than on other 8-bits, despite having a lower "MIPS" rating. I believe the distinction between VMs would be relatively less important on Min 64x4 because the stack VM's instruction set would also benefit from the ability to code random access efficiently. I've started to think that local variables are worth questioning, though. They serve a certain "black-boxing" purpose of presenting short routines as a matter of loading up the registers in an initial state, doing the computation, and then writing that back using the stack as a protocol between that routine and the rest of the program. But the usual alternative to a stack protocol is a buffer, and buffers have certain benefits in terms of coordination of resource use. If the buffer were implemented like a ring buffer and the computer optimized around that, it presents ideas like the Mill computer's "belt", an interesting concept, albeit one that sadly still hasn't shipped anything.
@luna010 Жыл бұрын
great video. I'd be curious to see how your scripting language compares to VFX Forth (which is the fastest Forth compiler/interpreter afaik).
@dimitar.bogdanov2 жыл бұрын
Really interesting!
@Richard_Knusprig2 жыл бұрын
I really like your editing style and how you ... aaahm ..., anyway.
@leddoo2 жыл бұрын
asufutimaehaehfutbscueme, anyway
@aredrih6723 Жыл бұрын
I'm more familiar with the JVM and in these stack machine, the locals (which include arguments) and the stack are kept separate. The stack starts out empty and you have to load the value from the locals. A bit as if there was a shelf alongside the stack and instructions to add the value of a local and instructions to save the current value to a local. Basically, these stack machine _have_ register, they just have 2 operation allowed on them. I'm not sure how python does it differently or if I'm misinterpreting the visualisation. So does python really load all the arguments at the bottom of the stack?
@leddoo Жыл бұрын
yes, python puts the locals at the bottom of the stack frame and loads them to the top for operations. (i can recommend messing around with python's `dis` module to see how python implements language constructs in bytecode!) from what i've read on wikipedia, the stack in the JVM is more of an abstraction though. the jit turns it into regular native register code. similar to how WASM is stack based, but wasm runtimes convert the code to native register code before execution.
@aredrih6723 Жыл бұрын
@@leddoo Hum, I did a quick test by making an identity function: def ident(x): return x It compiled to "load_fast" and "return_value". I used the command `python3 -m compileall` to get it compiled, removed the "load_fast" from the compiled function (couldn't found how to load arbitrary byte code) and added a nop (byte code 9; for opts and arg) after the return to keep alignment (no clue how the format works, just searched byte code matching the disassembly and replaced it). Then I tried to load the module. It worked. Then I tried to called the function. It's SIGSEGV. I'll need to dig deeper to find out if my "unconventional" edit are the caused or if the function need the "load_fast" to put a value on the stack. Or I might be completely misunderstanding what you mean by "stack frame". Maybe what you call "stack frame" is what java call " slot" and wasm "local".
@leddoo Жыл бұрын
@@aredrih6723 yeah, load_fast loads the value to the top, return_value returns that value. maybe it crashes, cause python does not expect the parameters to be popped? by stack frame i mean the area of the stack, reserved for the function invocation. (en.wikipedia.org/wiki/Call_stack#Structure ) you know, actually i'm not sure whether parameters are on the stack in python (that's what i do in my VM). so maybe it crashes, cause return_value tries to pop from an empty stack?
@aredrih6723 Жыл бұрын
@@leddoo So, I tried digging a bit and dis.code_info seems to confirm the existence of locals (which would be register only supporting writing to and reading from the stack). I also seem python encode the (I assume maximum) size of the stack and numbers of locals. (also, "def f(a, b) : return 0" is marked as having 2 local and a stack of 1) Now, storing the arguments at the bottom of the stack is a possible implementation but you would either need to copy an element from an absolute index or every loading depends on the size of the current size of the stack which would need to consider the size of the stack in both case. Still, it seems to allow for some neatly optimized code so far so I'm curious how the vm will mature.
@leddoo Жыл бұрын
@@aredrih6723 in my vm, the call instruction currently takes a variable number of register operands & copies those into the newly created stack frame (so params are the bottom `n` registers upon entry, but the callee is free to modify them).
@flareflo362 Жыл бұрын
What are the various profiling tools used here? They look really neat!
@leddoo Жыл бұрын
intel's vtune (on windows, i'm using the msvc integration) and apple's instruments (for the macos stuff)
@cemgecgel4284 Жыл бұрын
Great video!
@otesunki Жыл бұрын
this makes me curious is it faster to push a dummy value like 0 then add the top two elements and copy over the dummy value
@leddoo Жыл бұрын
hmm, not sure what you mean 🤔 my takeaway from that video was: if two "instructions" access the same memory location, they can interfere. (a "read after read" is fine, but "read after write" causes delays)
@Israel2205004 ай бұрын
What program do you use to generate the reports on CPU time and microarchitecture pipeline usage?
@nahblue Жыл бұрын
Nice! Does anyone have a link for the pipeline diagram tool?
@leddoo Жыл бұрын
link in the description :P framework is motion canvas. script is not ported to latest version (of mc) however.
@u9vata Жыл бұрын
I understood this before the vid being hiperf coding guy... But I am still not totally sure how good an optimizer can optimize stack based code to avoid these. I used to like Factor for example and although I never used it for hiperf code like I did with C++ they claim to have stack based good optimizer... I mean for example a stack base language optimizer could spot that your load + dup pattern could be exchanged with the two separate load - I see no issue why the optimizer could not have this rule for example and this is just an example. Generally informative video though. I did not know about the tool that shows that "green / red" thing for pipelining. What is the name of it? Ah never mind... its vtune. I prefer perf and such tools generally.
@treelibrarian7618 Жыл бұрын
pretty sure the reason the "smart" version is slower is the rotation - since that would have to read and write every entry in the stack memory to shuffle them down by one. better would probably be: Load { src: a }, Load { src b }, Store { dst: a }, Add,
@hfspace Жыл бұрын
on the other hand these smart instructions take less place in the cpu even if they are dependent, since they are only occupying one "execution" slot. So in principal you have more capacity to execute other completely independent instructions. This seems to be a trade off here or am i missing something?
@hfspace Жыл бұрын
i guess the product of taken execution slots and the time needed to execute is needed to compare different instructions absolutely.
@blinking_dodo2 жыл бұрын
This is interesting...
@OhadLutzky Жыл бұрын
That is super cool, and seems highly relevant to Java as well, which people like to at least pretend can be performant. (JVM is stack-based, Dalvik is register-based)
@leddoo Жыл бұрын
well kinda :D wasm is also stack based, but close to native in perf. the question is whether the *interpreter* is stack based. java is typically jitted, so it runs natively using the physical registers.
@bernatrosello4375 Жыл бұрын
I don't think the "parallel" vs sequential argument is necessarily right. Even if you're running both versions using more/less variables you have register usage risks, which are going to slow yo down significantly on a segmented data path CPU.
@leddoo Жыл бұрын
hmm, which part of the video are you referring to? and which registers - cpu regs or vm regs?
@elarajade78932 жыл бұрын
First ✌
@nexovec Жыл бұрын
I'm pretty sure not padding every stack frame with 200 bytes helped in making your interpreter faster than python 😂
@themcchuck8400 Жыл бұрын
You really need to work on your audio levels. They're all over the place. Keep the microphone a steady distance from your mouth for the entire session, and keep your mouth pointed at the microphone.
@HXMCPP Жыл бұрын
what do you use ? to benchmark it ? 6:44
@leddoo Жыл бұрын
i use vtune on windows (the msvc extension) and instruments for macos.
@HXMCPP Жыл бұрын
does it work with amd ? @@leddoo
@leddoo Жыл бұрын
@@HXMCPP i don't think so. amd has "amd uprof", but i'm not familiar with it.
@kfsone Жыл бұрын
For some reason people look at dis output and always miss the elephant in the room. 2 0 LOAD_FAST 0 (a) 2 LOAD_FAST 1 (b) 4 BINARY_ADD 6 LOAD_FAST 2 (c) 8 BINARY_ADD 10 LOAD_FAST 3 (d) 12 LOAD_FAST 4 (e) 14 BINARY_ADD 16 BINARY_ADD 18 STORE_FAST 5 (r) You're looking a the center column and not the far right. 2 0 LOAD_FAST 0 (this_couldnt_possibly_be_slower) 2 LOAD_FAST 1 (could_it) 4 BINARY_ADD 6 LOAD_FAST 2 (just_changing_the_variable_names) 8 BINARY_ADD 10 LOAD_FAST 3 (couldnt_make_it_slower) 12 LOAD_FAST 4 (right) 14 BINARY_ADD 16 BINARY_ADD 18 STORE_FAST 5 (i_mean_theyre_just_hash_lookups_arent_they) def f(a, b, c, d, e): r = (a + b) + c + (d + e) %timeit f(007, 14, 21, 27, 351) from os import * from sys import * from collections import * from itertools import * from functools import * def well_what_about_this(this_couldnt_possibly_be_slower, could_it, just_changing_the_variable_names, couldnt_make_it_slower, right, i_mean_theyre_just_hash_lookups_arent_they): i_mean_theyre_just_hash_lookups_arent_they = (this_couldnt_possibly_be_slower + could_it) + just_changing_the_variable_names + (couldnt_make_it_slower + right) %timeit well_what_about_this(007, 14, 21, 27, 351) (bonus points if you figure why the imports)
@kfsone Жыл бұрын
In my testing, "42 + 27 + 351" takes 25ns. calling "adder(42, 27, 351)" takes 75ns. changing adder to "adder(first, second, third)" and the body to just "result = (first + second) + third" adds 1.1-1.6ns to each call. Well, 1.2ns on top of 76? No. 50ns of that is the overhead of PyCall, the actual instructions are only 25 + 1.2ns, or those few extra letters of the names accounted for 5% of the cost for that one line of 2 additions. And *that* is just for top-level local parameters.
@kfsone Жыл бұрын
kzbin.info/www/bejne/rKDFeKSPiNt7ebc
@leddoo Жыл бұрын
i actually have no idea what you’re talking about 🤔 the length of the variable names doesn’t matter at runtime. they’re turned into indices. the last column is just for the programmer. you seem to make some other point about call overhead, which is something that can matter, but in this video, the function call overhead was insignificant. and there wasn’t any python in this vid. i’m just very confused.
@0LoneTech Жыл бұрын
Those indices are the second rightmost column (outside the parenthesis) and the reason the accesses are labeled "fast". The imports certainly increase the odds of hash collisions, but it only affects the function lookup. Long names take (usually insignificantly) more time to parse, hash and intern (all during compile time), but once interned can be matched by id. You might want to rerun those timeit tests a handful of times; in my tests the differences are entirely noise and swing either way.