AcheronVM: 16-bit code on the 6502, taken too far

AcheronVM: 16-bit code on the 6502, taken too far | VCFMW 2019

Рет қаралды 16,838

Күн бұрын

Пікірлер: 59

@beopstek 5 жыл бұрын

This is really good, I recognise all the issues you address. It's tempting to use AcheronVM as inspiration for an improved 16-bit virtual CPU in the 8-bit Gigatron TTL computer.

@timsmith2525 5 жыл бұрын

15:41 "I'm going to hobby the crap out of this thing." Love it! Fascinating work. Thanks for sharing.

@arnolddalby5552 5 жыл бұрын

What you are doing is so important to us assembly language programmers because your pushing the code outside of the original box. Your amazing. Thank you for your talk.

@jimreynolds2399 4 жыл бұрын

Every now and again you come across a really interesting video on KZbin. This is one - this guy has really come up with a very clever VM. This guy should be working for RISC-V or some other CPU manufacturer because this is a really unique thing that he has produced. Even though it's for a very old CPU design, he has retro-fitted it to 16-bit operation. Great work.

@JohnDlugosz 5 жыл бұрын

The first thing that comes to mind for me is UCSD p-system. I used it on the Apple II, but it was ported to a large variety of hardware before the rise of microcomputers.

@edminchau811 3 жыл бұрын

I keep coming back to this video. There's a lot of efficient coding ideas in here, several of which are brilliant. I'm using some of these ideas for other code I'm writing.

@PeterCCamilleri 5 жыл бұрын

This talk was inspiring. I have been studying virtual machines for years and this was by far the most advanced. I am involved in the Commander X 16 project which uses the newer W65C02S processor. It adds some new instructions and addressing modes and a whole heap of speed. I can see at least a few place where it would help. Thank You so much for an Excellent Talk!

@sietuuba 5 жыл бұрын

I was totally thinking of this when I caught up with the status of the X16 (just today). So nice that someone in the project is already aware of this then!

@edminchau811 4 жыл бұрын

I was so proud of the 16 bit functions I included in my editor that I just published for the CX16, and then I saw this guy and now, now I'm thinking in a whole new paradigm. Geez, we haven't even scratched the surface of what this processor can do.

@edminchau811 4 жыл бұрын

Now that I've dug into his code a bit, I can see that it will work a whole lot better on the 65c02 than on the 6502. The 65c02 has PHX, PHY, PLX, and PLY commands, which simplifies a lot of what he's trying to do.

@snorman1911 3 жыл бұрын

Wish you guys would go with the 65816 :)

@PeterCCamilleri 3 жыл бұрын

Reading over my old post I must confess that I misspoke. My involvement with the CX16 is limited to being a fan and working on some independent software projects. I hope my error has not caused any confusion.

@jessieoberreuter4755 5 жыл бұрын

Very nice!!! I've been working on a multi-platform VM that includes the C64 and considering many of the same issues. Many thanks for sharing your approach! I loved your comment about having the interrupt routine patch the dispatch :). I have the same issue (not wanting context switches to occur mid virtual instruction) and, at present, backwards jumps and slow operations return to a point (slightly before the dispatch) which checks a zp byte to see if a task switch is pending. Looking at my code, it would be just as easy to patch the 'jmp' with an absolute 'bit' and fall-through to the task switch! *mwa*!

@HummelJaeger 5 жыл бұрын

For the past few years I have been working on a similar 6502 hobby project which uses the same design philosophy and near identical optimising techniques as you've presented here, such as those for the software stack operations and dispatching. One useful consideration that may save even more RAM, is the software stack arranged for inline operations allows a client application to have structs cleanly and expressively defined in the assembler sourcecode. Such a piece of data could have a list of inline operator return addresses and a list of their arguments (or a reference to the argument list, or even a set of argument lists accompanied by an index to whichever is to be used ). The args are pushed to the software stack and the return addresses to the CPU stack (either current position, or, for recurring use, stored in a safe spot with the Stack Pointer altered) . An RTS starts the nested operations.

@3vi1J 5 жыл бұрын

One of the most amazing things about this video, besides being insanely interesting to a small cross-section of us, is that it has zero downvotes after 1.5 months. I follow dozens of geeky channels and had assumed there were 17-22 bots that downvote every tech video immediately until now.

@mmille10 4 жыл бұрын

My hat's off to David Holz. Good job. For a long time, I've been thinking about, "What is architecture?", in terms of thinking about a VM. I've gotten some hints by looking at a description of an old, obscure machine, and this gave me some more ideas. This is a simple machine model, well-designed to serve programmer needs, that seems to do the job.

@ian_b 3 жыл бұрын

Haven't finished watching yet but I have to say rP is a stroke of freakin' genius.

@mmille10 4 жыл бұрын

I noticed the process David used for creating this sounds similar to a technique I've heard about for developing a new programming language (I believe Guy Steele talked about this, but I've also seen it discussed in SICP (Abelson and Sussman)). It's the idea of engaging in "wishful thinking," to work on the high levels of the language, looking at what elements are going to be in it, how they're going to look, and how they're going to interact, and then working on the underlying implementation, from the bottom up, eventually finding an intersection between the two. It was gratifying to see that this was applied to developing this framework.

@hicknopunk 2 жыл бұрын

The 6502 is my favorite CPU ❤❤

@daschewie 5 жыл бұрын

Now we just need a C compiler that can target this VM.

@quincy1048 5 жыл бұрын

I thought sweet16 was good when I saw it versus raw 6502 code...but this is much better...obviously the result of much time and analysis. Now just need compiler writers for 6502 platforms to embrace this. This kind of reminds me of atari's action! compiler where speed was achieved by embracing what the architecture did well and leaving other things to someone else. I wonder if looking into this you looked at the action! compiler at all. Good work was done there, but I bet a rework with this at the back end could do better.

@GerardWassink 5 жыл бұрын

Wow, I say again: WOW!!! Respect dude!

@francisgeorge7639 5 жыл бұрын

My goodness, this takes the Nerdaprise to insanely go where no nerd has gone before. Warp level Nerd, engage!!!!

@jensdroessler3575 5 жыл бұрын

Come on! You can get at least 58% more nerd into that statement!

@Ziplock9000 3 жыл бұрын

No demos of this running the things you designed it to actually power, like editors, GUIs etc??

@PeterCCamilleri 5 жыл бұрын

Amazing video! Amazing design! Any thoughts on the tool set?

@anjinmiura6708 3 жыл бұрын

So the 6502 is a slower, 8 bit RISC processor before there were RISC processors. And at the core of modern processors is RISC processors running microcode to provide the x86/x64 CISC instruction set we see and use. So this guy basically wrote microcode to enable CISC code. I'm only watching this is a casual way and may re-watch it but as this appears to be C64, then I'm hoping it will work with REUs which, in some cases, support a LOT of RAM.

@JanuszKrysztofiak Жыл бұрын

No, 6502 is not RISC or even proto-RISC. RISCs rely on load/store approach and a large register file. 6502 is nothing like this. It has only 3 registers you can use to hold values/do operations on (A, X an T), only A being general purpose, whereas X and Y are mostly for indexing. Moreover, you can do some operations directly on memory (such as increment) and strives for certain orthogonality in addressing modes (CISCy feature). The misconception 6502 is a proto-RISC came from the fact it supports zero page addressing mode, making people think the first page as a sort of extension of registers, but that's wrong. The zero page is no different in what you can do with it than the rest of memory, no extra features, the only thing is the opcode/address combinations are 1 byte shorter, so it takes fewer cycles to fetch/decode them.

@Codeaholic1 5 жыл бұрын

Is it just me or does the "sliding register window" sound just like a call stack?

@tim1724 5 жыл бұрын

It's like having a separate stack for arguments, leaving the actual stack just for return addresses. The SPARC architecture used a very similar sliding register window. For an architecture with a big register file this allows you to avoid pushing/popping a lot of stuff on the stack; just leave it in registers and rename them. The SPARC architecture was a little bit different in that only a subset of the registers were part of the sliding register window.

@mmille10 4 жыл бұрын

I thought the same thing, but it's not quite that. In a call stack, the result would be pushed onto a cell in the stack, where the old call frame used to be. Here, he mutates an input register, in what could be thought of as the stack frame for the calling function. So, he skips the push (pushing the return value), and then pop (get return value) operations that would take place on a call stack.

@FindecanorNotGmail 3 жыл бұрын

The 6502 only really has push/pop and call/return instructions that work on the actual stack but many ALU instructions that read/write to zero-page indexed by X. So, for emulating registers _with_ _the_ _6502_ this is more efficient.

@BruceHoult 4 жыл бұрын

Nice. At the start I was worried you've done exactly the same as me, but it turns out we only share some ideas. I'm not immediately sure which is better! It would need some detailed benchmarks. If your dispatch is half of FORTH then I guess you're around 20 cycles .. but plus 11 for "with". My dispatch is 12 cycles (exactly a JSR/RET pair) and my "with" is 2 cycles. My instructions are all exactly 3 bytes but can be preceded by one or two 2 byte / 2 cycle WITHs. You're using a memory to accumulator architecture while I'm doing effectively 2-address e.g. dst += src where dst and src are arbitrary 16 or 32 bit registers in zero page. I have individual "WITH"s for both src and dst, either or both (or neither) of which can persist from instruction to instruction. Your GROWR is interesting. I use a (modern) standard RISC-style register file in zero page with argument/return registers, temporary registers, and callee-save registers. Leaf functions don't have to do any copying. Functions that call other functions use callee-save registers to hold variables that need to persist over the function call and the function prologue & epilog save and restore those registers to a large stack in high memory -- just as with RISC-V or ARM.

@BruceHoult 4 жыл бұрын

Having now read most of your code I have more comments. The bodies of most of our opcodes are basically identical, with the following exceptions: you typically start with get_ra (4 bytes, 9 cycles) or get_ra_y (7 bytes, 14 cycles) in addition to your basic dispatch time. You end with typically "jmp mainloop1" (3 bytes, 3 cycles). I load X and Y using normal 6502 LDX #, LDY # if they are not already set from previous operations (2 bytes, 2 cycles each), then JSR the opcode (3 bytes, 6 cycles). The opcode gets right to work, and ends with RTS (1 byte 6 cycles). Our code operating on the actual registers is identical, right down to using X for dst and Y for src to minimize the use of absolute,Y addressing. In short, you've optimized a little more for size, and I've optimized a little more for speed. For "add" without a new WITH you have 14 cycles dispatch plus 14 cycles get_ra_y plus 3 cycles jmp mainLoopRestoreY (I believe jmp mainloop1 is a bug) plus 6 cycles restoreY plus 10 cycles getting the next opcode into A (assuming no Y overflow) -- that's a total of 47 cycles overhead on a 22 cycle operation compared to inlining it. In equivalent circumstances (X already loaded appropriately) I have 2 cycles LDY #src, 6 cycles JSR, 6 cycles RTS for a total of 14 cycles overhead. However, I used 5 bytes for the 16 bit operation while you used 2 bytes for it. Also of course I have zero overhead switching between 8 bit 6502 code and 16 bit or 32 bit operations. And your opcode library is around 8 or 10 bytes bigger per opcode (plus 2 bytes in the dispatch table). Further notes: your carry stack is clever. I don't handle carry at all as I assume I'm a target for a language such as C which doesn't use carry. I provide both 16 and 32 bit opcodes directly. I'm worried you don't check for register stack overflow, but it would be easy to modify grow and shrink to detect this and copy (for example) the oldest half of the regstack to somewhere else and copy the new half over the top of the old half. This should happen very infrequently.

@BruceHoult 4 жыл бұрын

A post I made in 2019 explaining my system: www.eevblog.com/forum/microcontrollers/father-of-6502-microcontroller-passes-away/msg2845228/#msg2845228 Once again, yours will definitely result in more compact code.

@edminchau811 3 жыл бұрын

@@BruceHoult I've been basing a virtual machine on Acheron for a different application and will definitely take a look at your system too. I'm using 8 bits to represent a fixed point range from -1 to 1 instead of 16 bit code here, and every speed shortcut I can find helps.

@oisnowy5368 5 жыл бұрын

Comparing apples to apples? I was expecting an apples to commodores comparison. :P

@JesusisJesus 5 жыл бұрын

oiSnowy I was expecting a “Pick 2” joke at “Fast, Small or Powerful “

@xlar54 5 жыл бұрын

Ima "hobby the crap out of this thing" too! Excellent. Suggestion: start a facebook group so we can talk with you about how to use it effectively

@NuntiusLegis 2 жыл бұрын

People speaking in a derogatory way about C64 BASIC usually forget that it's not CBM BASIC 2.0 but 4.0 including the bug fixes, stripped to the command set of 2.0; is the fastest of all CBM-8-bit-BASICs because of being that streamlined; eats less address space because of this compared ro more bloated BASICs; has bit-wise logical operators (other than other BASICs like Apple II ROM BASIC, requiring machine language to manipulate single bits); has one of the best 8-bit-BASIC compilers (BASIC BOSS, quite often speeding it up 100 times or more); uses a splendid screen editor in ROM instead of a line editor as on many 8-bit systems, that can be used for programmed direct mode to write self-modifying code easily, remove large chunks of program lines that would be a chore to do manually, etc. But you don't have commands for silly circles or meager beeps, so I guess it's still "crappy".

@JanuszKrysztofiak Жыл бұрын

Without commands for 'silly circles', you end up peeking and poking like crazy if you go beyond a simple text app.

@NuntiusLegis Жыл бұрын

@@JanuszKrysztofiak No, the C64 has great graphical characters built in, the characters can be modified, and there are hardware sprites - a combination of these features is way more powerful than silly circles.

@JanuszKrysztofiak Жыл бұрын

@@NuntiusLegis I know what it can, I owned it. Not only played games but did some BASIC and assembly programming on it. You don't understand me. By 'silly circles' I meant any support for graphics routines in BASIC. To do circles or hardware sprites, you need to PEEK, POKE and DATA in C64's built-in BASIC. It has ZERO support for it on its own. In the end it is often easier to do that in the assembly (and that will run much faster, obviously).

@NuntiusLegis Жыл бұрын

@@JanuszKrysztofiak You can load in sprite, character, or screen data directly, without poking. There is a plethora of nice tools to make sprites or custom graphics characters. I wrote a little editor in BASIC to create and save screens made of graphic characters, which later can easily be loaded in by other programs.

@circuitsandcigars1278 5 жыл бұрын

Better than a C65

@marcbotnope1728 5 жыл бұрын

I found the NEEEEEEEEEEEEEEEEEEEEEEEEEEERRRD part of KZbin... i feel oddly at home

@JesusisJesus 5 жыл бұрын

Admit it when you lost it after LOAD “PRESENTATION”,8,1 YEAH me....

@CraigOverend 5 жыл бұрын

Source: github.com/AcheronVM/acheronvm

@SyntheToonz 5 жыл бұрын

Apple sweet 16?

@brucemcfarling7810 5 жыл бұрын

Specifically Steve Wozniak's Sweet16, embedded in the ROM of some Apple II computers. cf. www.6502.org/source/interpreters/sweet16.htm

@jensdroessler3575 5 жыл бұрын

Oh god, please get a better sound guy. This is terrible work!

@JimLeonard 5 жыл бұрын

What do you find wrong with the audio, and how would you improve it?

@jensdroessler3575 5 жыл бұрын

@@JimLeonard There is ringing and there are slight feedbacks. Also there are echos created by the sound from the PA picked up again by the microphone. I worked on many events of that or similar type as sound technician. I've seen colleagues loosing their jobs for such performance (well, not on first occurence). It seems they are using a lapel microphone, which is pretty much wrong as soon as the PA is in the same room and not that far away. The speaker should get a handheld microphone (only if they know how to handle it) or a headset microphone with cardioid or hyper-cardioid polar pattern.

@JimLeonard 5 жыл бұрын

@@jensdroessler3575 It is a volunteer-run show that charges no admission fee for the 1200+ people that show up every year. The speakers don't always want to wear head-mounted mics. The talks are held in a room with fixed positions for the talent and the speakers. In future events, I'll suggest running the mics less hot.

@JesusisJesus 5 жыл бұрын

As an audio engineer, I have to say... Set the faders at the top. Turn the gain to where it rings Knock it back a notch Set fader to ZERO and it’ll never ring. That mic is set. Don’t fk with it.

@JimLeonard 5 жыл бұрын

@@JesusisJesus Need clarification. By faders, you mean on the board for the input for the mic? By gain, you mean master output fader on the board? And why would any fader be zero?