Browser hacking: Making a dumb loop go fast in the JIT!

Рет қаралды 8,904

Күн бұрын

Пікірлер: 64

@awesomekling 11 ай бұрын

Things we need to improve next time: - Increment can simply add to ARG1 instead of masking and re-adding the INT32_TAG (thanks begga9682!) - Handle negative values correctly in LessThan (thanks LKokos!)

@vytah 11 ай бұрын

If you do the first thing, check if incrementing -1 doesn't overflow outside the integer part.

@fredleckie5880 11 ай бұрын

These Yak wallpapers are getting more and more extreme! How about one in a jet 'plane next time?

@paulwratt 11 ай бұрын

Yeap, this definitely qualifies as _code therapy_ - I love that this kind of process requires an understanding of: 1) C++ 2) Javascript 3) JS bytecode engine 4) compiler state machine 5) assembler opcodes SOS Code Commando :)

@jvcrules 11 ай бұрын

FYI There's an indicator in the middle of the status bar that tells if Copilot is thinking or not :) Helpful when it's unclear it's doing anything.

@awesomekling 11 ай бұрын

Oh! Thank you, I'll have to pay attention to that 😅

@nils-kopal 11 ай бұрын

Hey Andreas, For the number test of fastpath of lessthan, you can move the 2nd shift after the first int32 test. If first test is false, there is no need for shifting the 2nd number since you don’t need to test if it is an int32 then 🙂

@DanelonNicolas 11 ай бұрын

it's amazing to see the whole refactoring process.. how the code growth, new functionalities, and the enhancements.. this is the best series so far ❤ Awesome, Kling! Thanks!!

@msclrhd 11 ай бұрын

It woud be useful to have tests for the x86_64 assembly that check for a given instruction whether the correct bytes are emitted. That way, you can avoid future issues with swapped dsr/src bit patterns and other similar codegen bugs.

@awesomekling 11 ай бұрын

Yeah definitely :^)

@begga9682 11 ай бұрын

20:30 Couldn't you just add 1 to ARG1 in the fast path instead of incrementing GPR0 and then OR'ing the tag back into it and mov'ing it

@awesomekling 11 ай бұрын

Yup! We should totally do that :)

@solcloud 11 ай бұрын

Awesome video and yak roller coaster ❤ looking forward for next episode 😍

@amr3162 11 ай бұрын

Thanks for the videos they’re pretty informative and chill!

@awesomekling 11 ай бұрын

Thank you amr3162! I'm glad you enjoy them

@Artentus 11 ай бұрын

38:19 I'm not sure this is actually correct. If the payload int32 is negative, then the sign bit will be located at bit index 31 rather than 63 where the 64 bit compare instruction expects it. So even though the top bits are all the same and the comparison is correct for positive numbers, the comparison will interpret the numbers as unsigned. Since you are targeting x86 you should be able to just use a 32 bit compare instruction instead, which will truncate the contents of the registers if I'm not mistken.

@awesomekling 11 ай бұрын

Indeed, it’s incorrect as mentioned in the pinned comment :) Thank you for keeping an eye out, we’ll fix it next time!

@xbzq 11 ай бұрын

The jumps don't support linking before jumping, such as in a backward jump. Should be easy to implement though. Alternatively, it should error if you link before jumping.

@Dayanto 11 ай бұрын

Backwards jumps don't require linking since the address is already known when emitting the jump instruction. It might feel a bit weird though since the current API semantics tend to treat labels as placeholders with a bogus address until you actually link.

@perotubinger 11 ай бұрын

I really enjoy these videos. The approach to the JIT is surprisingly simple and straightforward. I would have thought you bring out the mighty LLVM hammer, but that wouldn’t fit the Serenity motto of doing everything yourself (which I like). Given that types can but seldomly do change in a javascript program / function, would compiling the whole function based on the types captured by the first bytecode run(s) result in both faster code and easier codegen? Instead of testing for i32 in each compile_-function you have compile_less_than_i32 etc. At the beginning of each function execution you’d have to check whether all types are still the same then and recompile or fall back to the interpreter.

@awesomekling 11 ай бұрын

We'll start with a simple JIT and see how much performance that buys us. Keeping track of which types are being used would indeed enable more sophisticated optimizations, but that's a huge step up in compiler complexity. We'll see, we'll see.. :)

@perotubinger 11 ай бұрын

@@awesomekling I really like that mindset. About keeping track of types: You implicitly do this already in the code generation. But I tend to imagine things way simpler than they actually are. :-) Looking forward to the next episodes.

@StopBuggingMeGoogleIHateYou 11 ай бұрын

FYI, you did the right thing by doing `add 1` instead of `inc`. The latter can cause pipeline stalls because it doesn't set the carry flag, hence the pipeline will stall while merging its prior value for subsequent instructions that use `cf`. `add 1` doesn't have that problem because it sets all of the common status flags.

@ttrss 11 ай бұрын

where would you want to use inc?

@miko007 11 ай бұрын

@@ttrssfor the "increment" bytecode instruction, obviously

@kreuner11 11 ай бұрын

that would be cool if at this point he cared about the speed of the machine code, the other parts of the assembled code are very slow by assembly optimization standards

@ttrss 11 ай бұрын

@@miko007 no, not *obviously*, at least not to me, why would you not also use x86 `add reg, 1` to JIT an increment bytecode instruction? my question is if `inc` has some inefficiencies, then where, if anywhere, would you want to use it?

@Bobbias 11 ай бұрын

@@ttrss While I can't speak for where you might want to use the inc instruction, I'd like to point out that there are many instructions in the x86-64 ISA that are simply outdated and more or less only exist for backwards compatibility.

@misisumegi6138 11 ай бұрын

Hello Andreas, I have an idea that i think would simplify writing this JIT. I'm not sure how it could be implemented, but i feel like writing a some way for the JIT to inline the cxx helper function would be useful . It would significantly reduce the amount of assembly you need to write by hand and would make more complicated fast paths easier to implement.

@StopBuggingMeGoogleIHateYou 11 ай бұрын

... but who is going to write that? By definition, it's the job of a compiler to turn high-level languages into machine code. And the C++ compiler he's using to build the JIT won't do it, because the machine code is being built on the fly and copied into executable memory. So it's Andreas' job to write that. That's what he's doing in this series.

@misisumegi6138 11 ай бұрын

@@StopBuggingMeGoogleIHateYou the idea is to take the already generated code and copy it into the jit so for example this less then optimization could be written in the cxx helper function and then the jit copies that function from the executable into the jit code and replaces the rets with jumps to the end of it the hard part is finding the length of the functions generated by the c++ compiler and im not sure if just replacing ret with a jump is enough

@StopBuggingMeGoogleIHateYou 11 ай бұрын

@@misisumegi6138 here's a good question for you to ponder. The code that the C++ compiler generates is in assembly. The code that the JIT generates is in assembly, although it is much less optimized and overall a lot slower than the compiled C++ code. Why do we get any speedup at all from the JIT? Where is the speedup coming from? When you understand the answers to these questions, you will realize that the thing you are thinking about does not need to be built.

@connorwood95 11 ай бұрын

I may have misunderstood the JIT architecture here, please do correct me if so. AIUI, it outputs one function for every bytecode operation, and those then get chained together at some higher level. Am I right in that being how it works? If so, wouldn't it be more efficient to output a big old string of machine code, one machine code function per JIT function context? That way, each operation will fall straight through to the next, eliminating all the overhead, and instead of a high level loop going "CALL function_for_load_imm; CALL function_for_less_than; CALL function_for_increment;" etc, you'd have "CALL function_top_level", similar to the output a C++ compiler would produce (for example). For longer running things, these overheads will stack up to be a substantial amount of time

@awesomekling 11 ай бұрын

What you are suggesting is actually exactly how it already works! :)

@gianni50725 11 ай бұрын

Yay! more hacking to go to start the day with

@spaculo 11 ай бұрын

Perhaps you could write the JITed instructions to a file instead of STDOUT? It could be toggled with something like LIBJS_JIT_DUMP and write to different files depending on execution (function) name?

@kaos092 11 ай бұрын

What tool is this to get a graph of where every part of the prgram is run?

@Akronymus_ 11 ай бұрын

is shift_right a sign preserving shift?

@mrmaymanalt 11 ай бұрын

Wow! This JIT is getting impressive. Can't wait for the day when it's fully functional on websites! Btw I have a doubt, since I'm not the best at reading C++. Currently, does it run the jitted code only if all the functions were jitted, or can it run successful jit functions along with bytecode functions where jit failed, both together

@StopBuggingMeGoogleIHateYou 11 ай бұрын

It's per-function. Functions for which JIT failed are executed via the bytecode interpreter.

@matthias916 11 ай бұрын

you've probably been asked this before, but, do you plan on, and if so, when will you switch to serenity os and use it to further develop serenity itself?

@mabdinur85 11 ай бұрын

How would an optimization like multiplication or division by powers of 2 look like in the fast path when it applies to specific cases when the multiplier or divisor is not an odd number, and is a power of 2? The optimization is the left shift (multiplication) or right shift (division) assembly code and would probably have a bit more operations for signed numbers. Although I'm not sure if it's much of a speed up because the checking of the odd/ even is fast but seeing if a multiplier or divisor is a power of 2 might be just as compute intensive as the multiplication or division you are seeking to optimize if done the brute force way ... but could be fast if it's a look up table of multipliers or divisors ... faster yet maybe just defined constants in a header file that lists (2^1...2^N); N=32 or 64 depending on int size; so it's at most 64 constants to list out. Lastly the odd multiplier or divisor of 1 probably needs a fast path too but that fast path is probably the bytecode interpreter as the better place to handle that request. LOL I'm not even sure if multiplying or dividing by 1 even needs a fast path, worth benchmarking and seeing how the current setup handles it because doing no operations but returning the multiplicand or dividend in that instance is the fastest so what is happening right now? It also might not be worth optimizing because it slows down every other optimization path that comes after checking whether the multiplier or multiplicand or the divisor is a 1 to place in a fast path so it punishes other optimization paths to reward a bad code choice as it were. This would be insightful in seeing how the JIT would handle multiple optimization strategy branching for a specific operator optimization as in the fast path for the general case, the fast path for the specific case, the fast path that is better done by bytecode interpreter, and etc... depending on the optimization strategy. Thanks for the series Andreas; quite a fun watch to see the birth of a new JIT Compiler to the world.

@StopBuggingMeGoogleIHateYou 11 ай бұрын

I'm not sure that kind of optimization makes sense in a JIT compiler. It makes a lot of sense when you know the numeric value ahead of time, in which case you can just generate a more specialized sequence, so it's free at runtime. But if you have to emit branches that execute at runtime to check those kinds of things, it's quickly going to overwhelm any savings you might get from the actual operation. Even one branch is probably too many; more than one, forget it.

@sp000ph 11 ай бұрын

IIRC currently the assembler emits a lot of redundant movs between GPR0 and the accumulator. Would it be possible to just cache the result in rax when storing to the accumulator and only spill it when necessary or when leaving the function?

@awesomekling 11 ай бұрын

Sure, we could probably do something like that. It would definitely add a bit of complexity to the compiler though, so we’d wanna make sure it’s worth it.

@GAoctavio 11 ай бұрын

Does this optimization not cover i = i + 1, i+=1 ? Or even i+=N (constant) and i+=n (variable)?

@MarekKnapek 11 ай бұрын

Well hello Andreas, I have question about labels in the JIT. The interface is like `auto some_label = m_assembler.make_label();` and then `some_label.link(m_assembler);`. Why does the link member function need an assembler as a parameter? The label could already know the assembler when it was created, because the assembler created it, the assembler could stash a reference to itself into the label at label creation time. Then the usage of the link function would be easier / simpler / less cluttered. Is it really needed? Or is this some philosophical and not technical reason behind it?

@awesomekling 11 ай бұрын

Sure, I guess we could do that :)

@-lolus- 11 ай бұрын

where do you get those wallpapers ? they r gorgeous

@awesomekling 11 ай бұрын

I've been making them with DALL-E Prompt for this one was "Photo of a yak, eyes wide and full of thrill, riding a roller coaster. It's upside down in a loop, seated in a cart with its arms up, showcasing the rush of the ride. The background is filled with streaks of light indicating the yak's rapid movement."

@MrSmokedTurkey 11 ай бұрын

What a beautiful prompt lol

@chenhonzhou 11 ай бұрын

👍🏻good

@relakin 11 ай бұрын

Is Jakt still being developed, or has that been abandoned?

@awesomekling 11 ай бұрын

Last commit was 5 days ago :) github.com/SerenityOS/jakt

@aJanuary 11 ай бұрын

I feel like it would be more ergonomic if branch_if_int32 took a “slow case” lambda, made the labels for you, and dealt with all the jumping. It can pass the slow case label to the first lambda to allow it to jump out to the slow case.

@awesomekling 11 ай бұрын

That could be nice if Int32 is the only thing we want a fast path for. Ultimately, we'll want to do more than one fast path for many of the instructions :)

@kreuner11 11 ай бұрын

I hope these wallpapers are not ai generated

@awesomekling 11 ай бұрын

Of course they are!

@CYXXYC 11 ай бұрын

i think instead of 0x00000000ffffffff you can write 0x0ffffffff

@AntonioNoack 11 ай бұрын

Is jump_if_less_than signed, 32 bit? You thought about it a while, so I'd guess it is a 64-bit comparison If it is 64-bit, comparisons between positive and negative numbers won't work correctly. 0xPrefixBits_ffff_ffff < 0xPrefixBits_0000_0000 (-1 < 0) returns false on signed, 64 bit numbers. (the result should be true)

@awesomekling 11 ай бұрын

Indeed, I made a mistake there! As mentioned in the pinned comment, it’s something we’ll have to fix next time :)