Surprised no one has made an UTAUloid of the Commodore 64's speech synthesis
@cazb73Ай бұрын
This demo , if presented in year of the c64 release, would stop progress of home computers by decade. Why 16bits, if the 8bits are capable of this! (Ideally combined with late Tim Folin music)
@janderogee3 ай бұрын
Wow, this is awesome! Amazing work! What also amazes me is that this channel, with so much groundbreaking stuff, has only 550 subscribers. Please keep up the good work, it is very inspiring. Is there a place where more info is found on how you all did this? Can't wait to see this used in games and such, I know there are limitations in how this can be used, but the ability to get such high quality sound at such a low bandwidth using a stock C64 is mind boggling!
@AppliedCryogenics3 ай бұрын
Sounds good! Seems like maybe you did spectrum analysis of the source audio, detecting peaks vs time, then additive synthesis of those peaks using SID voices?
@thealgorithm3 ай бұрын
Not quite the case. Each layer parameter has its own phase and custom waveform and there is analysis of pre-post frequencies with a overall frequency sensitivity curve where the specific layer parameters are chosen. The entire thing is also digitally decoded (no sid oscillators used). However my past "frodigi" series did indeed do it the way you mentioned (and the quality was pretty subpar for many other reasons too - such as no phase data and only a sine waveform)
@acied62003 ай бұрын
Question, in the end it still plays it as a digi over the volume in d018 ?
@thealgorithm3 ай бұрын
yes, all the decoding is done digitally and the output is sent to d418 (custom setup) to allow roughly 6 bits resolution
@acied62003 ай бұрын
I have been experimenting with the other techmique. Playing digis over 1 voice. The quality is way superior to d018 toggling. Its 8 bit out, but since its controlled by the frequency, you can even play up to 16 bit digis. Only drawback is that it takes more cycles per push. Currently i am on a project to use a 2nd c64 stock connected with a parallel user port cable to have 1 c64 calculate continiously a pseudo 3d road. I could imagine something wih this too. Let 1 c64 decode and push its values into dd01 , so the other c64 can be dedicated to only pick up from dd01 and play the digi.
@acied62003 ай бұрын
@@thealgorithm Or sync 2 c64s and have stereo :-)
@acied62003 ай бұрын
@@thealgorithm Had to look it up, the voice playing digi is the sid vicious code as found at codebase.
@thealgorithm3 ай бұрын
@@acied6200 yes, its currently the highest quality method of getting digi output but needs to be cycle exact and uses more cpu time hence why I did not utilise that. Ideally you would need to write triangle or sawtooth, then set testbit, then write the sample value to high byte of freq register and then set silent waveform per sample update so that it will seek to the required value in the next update
@bufferjoetommas3 ай бұрын
somehow i feel "old radio" nostalgia. wasnt unpleasant to hear at all . when compared to the muddines of a bad ripped mp3 back in the days XD
@dinkc644 ай бұрын
Super impressive, wowsers!!!
@AftercastGames4 ай бұрын
Is there any multiplication going on in the C64 decoder, or is it all addition and/or bit shifting? Is there any floating point math going on?
@thealgorithm4 ай бұрын
There is no floating point maths or multiplications in the decoder. The core of the decoder is additive synthesis with each layer being added with different volume levels together and a phase accumulator per layer
@AftercastGames4 ай бұрын
@@thealgorithm How do you do different volume levels on the decoder side without multiplication or bit shifting?
@thealgorithm4 ай бұрын
@@AftercastGames Tables and indexing :-) 1k of tables for volume (32 levels with 32 values for each). The main core of the decoder reads an indexed channel and points to a volume table which adjusts the value and accumulates these for the other layers). Outside of the decoder, the parameter updater self modifies the decoder to point to the new volume tables, waveforms, pitch offsets etc. On a machine with such low cpu power, Tables are the key!
@acied62004 ай бұрын
Very cool !
@antivanti4 ай бұрын
Imagine showing this to someone back in 1982. Tho I suppose even tho the C64 can do the decoding there might not have been a mainframe in the world at the time capable of doing the encoding?
@thealgorithm4 ай бұрын
With phase optimisations and accuracy set to low, I can encode around 1 seconds worth of data in a minute (on a single core of a zen3 5800h) I would guess a supercomputer during that 1982 era (cray 1) would take several months to do that
@tsonez4 ай бұрын
Impressive! Somehow reminds me of the backgroung music while being on hold on the phone.
@julienbraudel71094 ай бұрын
Can't wait to hear 11khz..
@NickFellows4 ай бұрын
Imagining a He-Man game with the actual music.
@glokyful4 ай бұрын
bravo !
@-Error994 ай бұрын
Cool 👍
@pasromano754 ай бұрын
Can you tell me here technical details ?
@thealgorithm4 ай бұрын
In this current implementation (in this video) - since then there have been additions, But will talk about this particular one. The decoder on the C64 has 8 channels with each channel having 11 bit frequency resolution, 7 bit volume resolution and can place any one of the 8 predefined waveforms (sawtooth, pulse, triangle, noise, sine, res1, res2, trisaw) each having its own phase. The encoder does the heavy computational work being able to decompose a chunk (20/40/60/80ms) into multiple layers of waveforms, phase, amplitude and phase. The decoder mixes all these in realtime to recreate the audio on a stock c64 machine
@pasromano754 ай бұрын
Ok, so , in brief, we have are no digi music , but music recreated with the original Sid waveform, it's correct ? Very interesting .... I think never done before ....!
@pasromano754 ай бұрын
The encoder phase Is done on a regular PC ?
@thealgorithm4 ай бұрын
@@pasromano75 The music is recreated by simple waveforms yes and when these are layered together with phase, amplitude and frequency it recreates the source audio. All waveforms are digitally mixed hence it can be used on other hardware too
@thealgorithm4 ай бұрын
@@pasromano75 Yes
@pasromano754 ай бұрын
I some kind of digi music ....?
@tsonez4 ай бұрын
Great update and what a nice improvement! Keep up the good work. 👍
@FunkyM2174 ай бұрын
I know someone who'd be very interested in this one...
@parzivalwolfram70844 ай бұрын
Damn, this is not bad at all. How long does encoding take on the host machine?
@thealgorithm4 ай бұрын
It really depends on the parameters and quality settings. The above example took (on a ryzen 5800h in single thread) around a minute per second of encode. If i want to maximize the quality further, it can take a massive half an hour just to encode 1 second!
@fcycles4 ай бұрын
Very impressive… the high frequency fidelity you achieved and as always I enjoy hearing the compressed result… and surprise to see how brass instruments sound like!
@halabardsznept38974 ай бұрын
❤
@JimLeonard4 ай бұрын
Can we hear a voice-only sample next time?
@thealgorithm4 ай бұрын
There is a voice sample in this video, It contains a voice sample (tatu) unless its blocked in your region and muted. But to reiterate, this is not a speech encoder (other custom speech codecs will do a lot better for speech only). The whole purpose of this codec is to compress music to a "high" quality compared to its bitrate and to play it back on low cpu spec devices. Would be nice to see what other alternatives there are under 400-500 bytes a second playing on 80's retro hardware (without being a garbled mess) but not seen any yet. (exception being the amazing encodec or so but that requires powerful pc hardware and training data) and lots of ram and cpu for decode
@JimLeonard4 ай бұрын
@@thealgorithmBy "voice only" I meant normal speech, not singing, no reverb or echo effects. I know you are optimizing for music only, but hearing a speech example of only talking helps evaluate the codec. Speech-only codecs are not appropriate for music, but music-only codecs are usually also appropriate for speech. (Essentially, I'm trying to evaluate if your codec is fixed wide band, fixed narrow band, or adaptive wide band)
@thealgorithm4 ай бұрын
@@JimLeonard here, this is an old example from a few years back, but you should get the rough example of the type of quality to expect from speech only (at under 300 bytes a sec) It is singing albeit more in a monotonous voice. suffice to say, this will give an example of how it will sound quality wise even if there is just pure monotonous speech www.dropbox.com/scl/fi/qq3lyxmq863csz5s7vfi4/awd3-diner50hz.mp4?rlkey=agipfhfo3f91onzo3hutpxccq&dl=0 Will encode a speech only with no singing example and link it to you soon
@thealgorithm4 ай бұрын
total monotonous speech here (its actually under 160 bytes a second after packing) as there are the gaps between the speech. (but if including the gaps inbetween its roughly at 250 bytes per second www.dropbox.com/scl/fi/d51bor98i6itgeubtcd3c/speechtesta.avi?rlkey=e0b20kppa2zohtaofz2le9htk&dl=0 The toms diner one does sound considerably better though By the way the spectrogram is in realtime so you will have a rough idea of the ranges used (mainly it falls in the mid/low frequencies)
@JimLeonard4 ай бұрын
@@thealgorithm Thanks! It seems to adapt pretty well, all things considering. At 500 B/s it is still competitive with other low-power speech codecs.
@antivanti4 ай бұрын
How's the quality at higher bitrates? Also what's the fastest you can stream data from a disk on an unmodified C64 and drive? And when do you hit diminishing returns? Would be fun to see how far the quality can be pushed if one was to dedicate one entire side of a floppy to one song 😊
@thealgorithm4 ай бұрын
The quality at higher bitrates is considerably higher (high enough to sound just the same nearly as the source pre-encode) however the pack rate in this case is roughly 5:1 (May be more beneficial to use other codecs such as adpcm in this case which would use less cpu time compared to the AWE method of decode. Fastest IRQ loader at the moment is roughly 7.7k a second on the c64 (However this is if you give it all cpu for loading). As the AWE decode uses 90% of cpu time and has non maskable interrupts for sample playback, This would roughly reduce loading speed to perhaps 600 bytes a second (which is just about enough to stream and playback larger segments of audio
@dajan7774 ай бұрын
So you can squeeze whole 2 minutes of song into C64 memory in this quality?
@thealgorithm4 ай бұрын
It would be roughly 1 and a half minutes as there would be roughly 46k or so of spare ram left, but ofcourse can be resequenced for repeating sections and in most cases should be able to fit whole songs into it (There was a human league example and la isla bonita example in earlier tests that were all in a single load)
@LPChip3 ай бұрын
@@thealgorithm What if you add a ram expander, or go the C128 route?
@thealgorithm3 ай бұрын
@@LPChip Then the possibilities are more. For example, I can then write data to the REU at a higher sample rate (and more layers) or/and with more accuracy, then play the decoded sample from REU at these high sample rates without using much cpu time.. For the C128, 2mhz can be activated in the border areas which can speed up decode and allow me to have a higher sample rate/or/and layers. However both these are not stock (I prefer coding on a stock machine - even though i have coded some easyflash cartridge music video demos before)
@nunyabiznas26964 ай бұрын
So many years later, still so sick!
@kanalnamn4 ай бұрын
Cool. :) How long is "long time" to do the encode? Hours per second of content? More? How does it perform on pure speech?
@demoscenes4 ай бұрын
Incredible! This is the future of waveforms! <3
@fcycles4 ай бұрын
well, that is what I was curious for in your last test.. is all these samples come from one single .prg file or it's a montage of samples?
@thealgorithm4 ай бұрын
All from one prg file
@JimLeonard4 ай бұрын
The degradation is too high IMO. This might be better for spoken word content without bits wasted on background music.
@thealgorithm4 ай бұрын
dedicated speech only codecs should perform better than this (This encoder is not specifically aimed at speech). The preencoded source file does not sound that different to the encoded output (and its still packing the data to 12:1) even with the additional degradation of phase accuracy. The freddie mercury part for example. The others suffer more when comparing source pre-encode to decoded output (which gets considerably better with the phase set to full resolution). These examples also were mainly at 40ms (with transient sensitivity set to very low). Bear in mind that the sample rate is pretty low. At some point I will implement the decoder on the Amiga side (This will allow me to have higher sample rates and less restrictions on waveform size) - and still use the same filesize increasing the compression ratio Still a lot of improvements to make and on the pipeline. For example, more updates and less layers (and vq'd volumed codebooks) should improve the overall quality (even though the volume tables will then be non optimal)
@JimLeonard4 ай бұрын
@@thealgorithm I look forward to your Amiga experiments!
@dwsel4 ай бұрын
Sounds great and the compression rate 🤯
@zvonimirstrucelj61904 ай бұрын
Any chance we'll see final release soon?
@dwsel4 ай бұрын
All hail @thealgorithm !!!
@CodexPermutatio5 ай бұрын
Holy shit!
@julienbraudel71095 ай бұрын
Great work.
@MichaelHuth5 ай бұрын
Impressive, will there be a publication describing the details?
@thealgorithm5 ай бұрын
At some point yes, there is quite a lot of things to cover both on the decoder side and the encoding
@youtube-stole-handle-att3nd15 ай бұрын
@@thealgorithmcan't wait to read about it. Your work keeps pushing the boundaries of 8bit home computers, and the c64 specifically, not to mention that thw results are amazing from the algorithmic viewpoint (no pun intended, at least not by me 😊)
@joecincotta58054 ай бұрын
Amazing
@muchkaev5 ай бұрын
does it works with voice too?
@thealgorithm5 ай бұрын
Sure. Look at my earlier examples on my channel which have examples from various pop songs. Though since then i have improved and refined the encoder by quite a bit
@muchkaev5 ай бұрын
@@thealgorithm yep i know and i remeber that you have a problems with publishing commercial songs on KZbin. Maybe next time make something with voice from someone less known than Madonna :-)
@dwsel4 ай бұрын
@@thealgorithm Can we hear something with voice encoded in the new algo?
@thealgorithm4 ай бұрын
@@dwsel here you are (this is using phase quantisation, but expect around 15% better if this option is not used) kzbin.info/www/bejne/gpislJloateHn9k
@julienbraudel71095 ай бұрын
Impressionnant. Does it mean you could also think of a project to play protracker songs for instance, or 8 channels. MOD? Congratulations for your stunning work.
@thealgorithm5 ай бұрын
I have some unreleased code that is able to play back 4 channel protracker songs (but also is able to depack each sample on the fly while mixing too) - This would allow large mods to fit in ram at one time. I have an earlier version of this in one of my single file c64 demos kzbin.info/www/bejne/aWPXaKiZpb-dpbc
@julienbraudel71095 ай бұрын
I definately love your work. No .MOD player on c64? You could do yours. Congratulations again for your work.
@pannonianbrute5 ай бұрын
Jawdropping!
@plasmaastronaut5 ай бұрын
the sound effectively switched off after the start and u have to dl it and ru it thru audacity amplifier x50 to hear it.
@dwsel5 ай бұрын
I thought I understood what's happening around Frodigi times, but at this point I can't keep up behind your genius
@RudeRud76 ай бұрын
👍
@JimLeonard6 ай бұрын
I think this has more artifacts than the previous experiments. More layers and more volume bit depth sounds better in my opinion.
@thealgorithm5 ай бұрын
Added 5 bit resolution now (without increasing precalced wavetable size. Quality is better though still needs some quality improvements. Have some ideas on improving it further at the same size. (Then when happy with it, will post some higher bitrate versions) kzbin.info/www/bejne/qabceGpuf92Xnas
@fcycles6 ай бұрын
Your compression research/projects on C64 is one of, if not the best things I see! The quality of the sounds make it very interesting to hear and try to understand the lost of information decided by the encoder. Hearing this, I try to figure out how our/my brain encode it, how these information could be visualize... In fact the higher the compression can be and there is a point where we should be able to bring a visual display that represents in real time what we hear and that it could make a sense... by analogy a text spoken can be visualize by displaying the text...
@thealgorithm6 ай бұрын
Thanks. Yes, this is the idea for this demo (to have lyrics in time to the singing). Though i am also still considering instead to encode at a higher bitrate and have snippets of songs instead. (By doubling the file size to 400 bytes a second, it nearly sounds as good as the original resampled original) - at least in the same madonna example
@fcycles6 ай бұрын
@@thealgorithm I know this idea is far more complicated, but thinking about how we were doing our digimix back in 1991... our music guy was using the original song track and also an instrumental version only... to create a mix of the song. Maybe this kind of idea can be use here where the song refrain is with lyrics and high quality and other part instrumental but still in digital with higher compression rate?
@thealgorithm6 ай бұрын
@@fcycles in theTCBI50k single filer demo released in December 2021 or so, i used a higher quality background layer and lower quality speech layer. However I had to choose a track specifically that had many repeating background patterns (and the rick astley track was exactly that). It did comprise of a suboptimal quality speech layer though The only issue here is that there is not many tracks that have such repeated sequences of background audio. I did also experiment in separating the bassline, drums etc and reconstructing them like a mod track - which may be somewhat promising and something to look into
@tsonez6 ай бұрын
Incredible! Is this song a hard or easy case for the encoder? Would it work better for other types of songs or genres?
@thealgorithm6 ай бұрын
The original song has background chords, drums, bassline (deep bassline) and speech combined, so it is a hard case for the encoder considering the low bitrate used. For more demanding titles, I can increase the layer amount or/and parameter updates though there is a limit on how many layers i can mix on the C64 decoder end (due to the slow processor)
@tsonez6 ай бұрын
@@thealgorithm I see. So easier case would be something with less layering.
@thealgorithm6 ай бұрын
@@tsonez Less layering is needed for audio that is less complex (e.g speech) - this in turn will use even less space
@AftercastGames6 ай бұрын
Unbelievable. I’d love to see more info about the compression and the SID registers involved, if it’s documented anywhere.
@thealgorithm6 ай бұрын
It uses digital mixing without using the oscillators of the SID. Of-course the thing runs on a stock c64 regardless. Plus point of this is that it can be feasible on other machines too. More information on the technique here kzbin.info/www/bejne/j6PInIeoeLZljNU
@AftercastGames6 ай бұрын
@@thealgorithm Gotcha.. So the only register involved is the volume register then?
@thealgorithm6 ай бұрын
@@AftercastGames correct. its output via one register only. This also means that its not machine specific and can be used on other devices (as it does not rely on sid specifics such as its channels or oscillator)
@SuperWiiBros087 ай бұрын
I thought it was N64 but this is amazing too
@Nbrother1607 Жыл бұрын
music too fast video too faster
@rokker333 Жыл бұрын
Please, can someone explain a bit the background techniques? From my point of view this is really impossible on an original C64 HW.
@Ic1621 Жыл бұрын
Maria isnt show mercy for me! It was so brutal effect! Gz for programmer!