Let’s Create a Speech Synthesizer (C++17) with Finnish Accent!

Рет қаралды 98,705

Күн бұрын

Пікірлер: 297

@Bisqwit 6 жыл бұрын

This is LPC (Linear Predictive Coding): yₙ = eₙ − ∑(ₖ₌₁..ₚ) (bₖ yₙ₋ₖ) where ‣ y[] = output signal, e[] = excitation signal (buzz, also called predictor error signal), b[] = the coefficients for the given frame ‣ p = number of coefficients per frame, k = coefficient index, n = output index Compare with FIR (Finite Impulse Response): yₙ = ∑(ₖ₌₁..ₚ) (bₖ xₙ₋ₖ) where ‣ x[] = input signal The similarities between the two are striking. FIR is used in applications like low-pass filtering, high-pass filtering, band-pass filtering, band-stop filtering, etc. It is an almost magical type of mathematics that is used to generate these filters. For LPC, there are several different algorithms, many of which are implemented in Praat, the software that I used in this video to create my LPC files.

@huyvole9724 6 жыл бұрын

I met that formula when I learn Signal & System module (my school call it Digital Signal Processing)

@BichaelStevens 6 жыл бұрын

16-17 minutes in: Please next time lower the audio or give a warning. The popping killed my hearing

@Rennu_the_linux_guy 6 жыл бұрын

uhhh

@a1k0n 6 жыл бұрын

In fact it's identical to an IIR filter, which has coefficients for both x and y, and your x coefficient is 1 and all your y coefficients are negated.

@RazorM97 6 жыл бұрын

How to stop prison radicalization

@KatzRool 6 жыл бұрын

I was going to make a joke about how you already sound like speech synthesis when speaking English, but your English gets better every single video. Keep it up man!

@framegrace1 6 жыл бұрын

I think the clicks are because the program is cutting/pasting at random waveform values. This produces non-continuous gaps in the waveform that generates those clicks. I think the simple way to solve it, is to just wait until the value of the sample crosses the 0 line to perform the cut of the audio, and wait again a 0 crossing to introduce the next one.

@KuraIthys 6 жыл бұрын

Interesting theory. That actually matches advice mentioned in the SNES manual in relation to audio samples. What it is trying to say exactly is ambiguous, but it warns against discontinuities in the waveform, which would result in clicking sounds. Of course, given the ADPCM coding, discontinuities on block boundaries would easily result if you're not careful. (since the samples within a block are all expanded using the same parameters, but across block boundaries the parameters change.)

@idk-bv3iw 6 жыл бұрын

What about a simple fade-out/fade-in between the samples?

@TheBcoolGuy 6 жыл бұрын

@@idk-bv3iw That's the method used in video editing.

@crimsun7186 6 жыл бұрын

You also have to determine a rithmic pattern dependant on the langauge and overall delivery, as words are not spoken at a constant pace.

@a1k0n 6 жыл бұрын

I don't think that will work, because of all the excitation signal history in the bp[] array. Instantaneously changing the filter coefficients can lead to instability. One thing that might help, or might make it worse (I'm not sure) is to try implementing the transposed version where the bp[] array isn't just past output samples, but partially computed future samples. See the notes here: docs.scipy.org/doc/scipy/reference/generated/scipy.signal.lfilter.html

@x0j 6 жыл бұрын

This doesn't fool me, I know you have a much more advanced synthesizer that you use for your videos. A nice coverup attempt though

@tomh6339 4 жыл бұрын

Dude. I haven't used Praat since University, was hit by waves of nostalgia in the most unexpected place. Your videos are the best, you're quite the renaissance man.

@educate9946 6 жыл бұрын

Now I can have Robot Bisqwit wake me up every morning.

@thefoolishgmodcube2644 6 жыл бұрын

Imaging having “SHALOM! SHALOM!” as a wake-up alarm

@kkeanie 5 жыл бұрын

@David Plays Stuff I really need that. it would stop my depression

@PantsYT 5 жыл бұрын

"Hyvää huomenta"

@sindavmi 4 жыл бұрын

robot bisqwit is a pleonasm

@greasyfingers9250 6 жыл бұрын

"Yes, I use PHP. Because a programming language, that you know is much more efficient than one that you don't know." This is the truest statement I have ever heard.

@greasyfingers9250 6 жыл бұрын

@Michael Smith You can debug it line by line with xdebug, but c# or java are usually better for that kind of work.

@Kitulous 6 жыл бұрын

in order to debug PHP you have to var_dump every single variable because the stack trace in PHP is a real mess.

@HermanWillems 5 жыл бұрын

Short term yes, long term no.

@jlewwis1995 3 жыл бұрын

Finally a video that actually shows how to ACTUALLY MAKE a TTS voice from scratch, almost everything online about "how to make a text to speech synthesizer from scratch" is just "use this function to call the os TTS library lul"

@BichaelStevens 6 жыл бұрын

We have reached peak AI revolution - machines making machines A voice synth making a voice synth

@akj7 6 жыл бұрын

Haha

@huyvole9724 6 жыл бұрын

-6.4°C

@Bisqwit 6 жыл бұрын

Actually the truth was like -22. I just happened to do the recording a month earlier...

@imlxh7126 Жыл бұрын

Uberduck has a neural-network-based simulation of Microsoft Sam. Talk about overengineered lmao

@metadaat5791 6 жыл бұрын

I always liked the implication of GSM using LPC, that technically you're not hearing someone's actual voice, but a reconstruction made of a buzzer with a filter and hisses and pops from filtered noise. So, you're actually listening to a speech synthesizer's reconstruction of the other person's voice! :-) :-)

@chooha 6 жыл бұрын

Hi bisqwit I don't know if you realize this but you are an inspiration for many of the viewers here, like a hero. So could you make a video about how you reached this insane level of skill, what your journey was like, and maybe some tips on how one can be as good as you ? Thanks for all the amazing content ^_^

@MissNorington 6 жыл бұрын

Really outstanding video! Great work Bisqwit!

@magicstix0r 5 жыл бұрын

The input signal can't be a pure sine wave because: 1.) The vocal chords don't emit pure sine waves; they emit something more like a buzz. 2.) A pure sinewave would almost be unaffected by the LPC filters because it's a single frequency. A buzz is extremely rich in harmonics, and the human ear keys off the presence or absence of those harmonics in determining what was said. That's why if you look at voice data in a spectrogram, you tend to see lots of streaks that move together or widen/shrink based on what's being said. In a sort of philosophical explanation, the input signal is "sampling" your LPC filters. A single single sine wave would result in sampling just a single data point. You need a lot of sine waves to get enough of a picture of the LPC filter to see what it looks like, which is what your brain is keying on to make sense of your words. Think of it kind of like an image. The sine waves are the pixels that you're building a picture of the LPC filter with. A single sine wave is like a single pixel; it doesn't tell you much. A buzz is loaded with lots of sine waves, so analogously it's loaded with a lot of pixels, so it can give you a better picture of the LPC filter, and thus a better picture of the formant it represents.

@Bisqwit 5 жыл бұрын

Great explanation! Not an ELI5 though :-) But I would have settled for that.

@magicstix0r 5 жыл бұрын

The constant clicks and pops are due to discontinuities at the frame boundaries. With an algorithm like this, they usually fix it using overlap-add. The gist of OLA is that your frames overlap and are weighted by a windowing function, then you sum them together where they overlap.

@oresteszoupanos 6 жыл бұрын

Joel, regarding your question at 8:05, we cannot use a sine wave because it only has audio energy in 1 frequency, whereas to synthesise human speech, we need energies in "all" frequencies, so we can have base pitches and formants happening at the same time. Buzzers have a better spread of frequencies, compared to the more "pure" sine wave. Hope I made sense ^_^

@Bisqwit 6 жыл бұрын

Good explanation, but not really an ELI5 :-) I understand the situation as indicated elsewhere in the video, but I was having trouble explaining in layman terms without referring to things like frequency spectrum; I wrote that request for the benefit of audience.

@oresteszoupanos 6 жыл бұрын

@@Bisqwit Aha, I'd never heard the term ELI5 (Explain Like I'm 5) before! Here is my second attempt :-) Voice sounds are slightly complicated. Sine wave sounds are simple. Buzzers are super-complicated. We cannot use 1 simple sine wave, filter it, and get a complex voice sound. We have to start with a super-complex buzzer, then filter out some things, to be left with a less-complex voice sound.

@frisosmit8920 6 жыл бұрын

That's actually a very good explainaition. Your first explaination made me understand it. But then again, I'm not 5 years old.

@noneofyourbeeswax3460 5 жыл бұрын

But you could superimpose sine waves to get all the frequencies?

@Bisqwit 5 жыл бұрын

Yes, and in fact all waveforms can be represented as a sum of sinewaves. That is what e.g. the Fourier transform is about, or the discrete cosine transform.

@OverSeasMedia 6 жыл бұрын

bisqwit was the inspiration to write my own tools whenever i need one, Great video.

@wallaguest1 6 жыл бұрын

i cant understand how you have so much knowledge, its crazy

@DudeWatIsThis 4 жыл бұрын

Bisqwit you fucking legend man. This is the way to handle the banter. Throw it straight back at them! Genius stuff. You win again, good sir!

@pixelflow 5 жыл бұрын

Finally! A Bisqwit Vocaloid :3

@shivisuper 6 жыл бұрын

These videos make me respect you even more. You're very knowledgeable!

@prizmarvalschi1319 4 жыл бұрын

This is kinda like how utau users create voicebanks Except we sing in 5 syllable strings for Japanese,sometimes more for others languages. And sometimes recorded in three or more pitches.

@noname-rr7hk 2 жыл бұрын

I was searching for this video for half a year. Thankyou...

@fisu51 5 жыл бұрын

Kyllä

@BeeBaux 5 жыл бұрын

Great! job bro. Thanks for making complex thing easier.

@d3ibit 6 жыл бұрын

Joel, always a pleasure to watch a C++ (related in some way) video. Keep the good work!

@MrGoatflakes 5 жыл бұрын

6:34 if you say this five times into a mirror at night you will summon a Bisqwit :P

@tomaszx7760 4 жыл бұрын

I remember play with " Say " speech synthesizer from Workbench 1.3 OS (at Amiga 500 computer)

@stennisrl 6 жыл бұрын

Wow, what a cool video to wake up to. Excellent work!

@DrSid42 6 жыл бұрын

Just had an idea I will make my own speech synth. I wondered if there is some nice example low-level enough. And guess what. This guys had the same idea just in time to have it done now. Great job !

@miszczklasykuw3025 6 жыл бұрын

music in background adds nice atmosphere to video as always x)

@moth.monster 5 жыл бұрын

Now we need to record the speech synth speaking and use that to make another synth

@kapiltyagi4639 6 жыл бұрын

The solution for the clicking in the sound is to simply fade out some of the frequency from the very end of the sample. Because LPC just converting the audio samples into the simple and low resolution waveform just bunch of float values and a gain.

5 жыл бұрын

Super interesting article. Thanks!

@skilz8098 6 жыл бұрын

Once again; another great video!

@davidcuny7002 4 жыл бұрын

The red lines in Praat indicate formants, not the overtones. The vocal chords produces pulses, which have a fundamental frequency (pitch) as well as overtones (multiples of the pitch). The tongue forms a series of "tubes" in the mouth, which causes the pulses to resonate at frequencies proportional to the length of those various chambers. The resonating frequencies of these "tubes" are formants, and different mouth shapes create different sets of resonating frequencies.

@RamLaska 5 жыл бұрын

I did something like this in the early nineties. I recorded my voice on my Mac SE, and wrote a hypercard stack to play the correct sounds together. It didn't translate English into phonemes, you had to write out your own phonemes, but that wasn't quite so unusual at that time. I also only made one recording per phoneme, because ain't nobody got time to record every possible phoneme pair 😂

@thetastefultoastie6077 6 жыл бұрын

I've never seen `++i %= max` before. That's pretty cool. Edit: it seems this only works in C++ but not in C, Java or Javascript

@Bisqwit 6 жыл бұрын

In C++, operator++() returns a reference to the object being modified. This is not the case in C. This has nothing to do with C++17 or about sequence points. If the expression was `i++ %= max`, it would be a different story. `++i %= max` is completely unambiguous in its meaning. The reason it does not work in C is because `++i` returns a non-lvalue copy of the variable in C, not a reference to it. (C does not have references.)

@thetastefultoastie6077 6 жыл бұрын

@@Bisqwit Thanks for the explanation! I used an online compiler to quickly try all versions of C++ and indeed it worked in all of them.

@Smaxx 6 жыл бұрын

@@shaurz I'd just write a tiny inline function with a speaking name instead. ;) Like `incmod(v, m)`

@DrSid42 6 жыл бұрын

@@shaurz It seems weird to you because of different background. Finish folk did it like this for centuries.

@noneofyourbeeswax3460 5 жыл бұрын

@@DrSid42I don't think computers have been around for centuries

@yukimoe 6 жыл бұрын

So you're basically teaching us how to make Vocaloid-like software? Nice.

@ceablue8037 6 жыл бұрын

@jj zun Yesssssssssssssssssssssssss

@mattg5461 5 жыл бұрын

Brilliant. I find this video a week after handing in my dissertation on vocal synthesis... This would have changed everything

@Bisqwit 5 жыл бұрын

How so?

@mattg5461 5 жыл бұрын

There's just a lot of things you've covered in here that I wasn't able to find much concrete information about - things like accents and dialects especially. Lots of things like that which I knew from common sense but couldn't find actual written documentation to back up.

@gero9307 3 жыл бұрын

I created a voicebank CVVC and VCV type for utau, and while watching this video I experienced deja vu)

@AT-zr9tv 3 жыл бұрын

Your videos are fantastic. This one particularly.

@GibusWearingMann 5 жыл бұрын

I'm starting to become curious how to stop prison radicalization.

@DynamicFortitude 6 жыл бұрын

8:00 Buzzer cannot be pure sine, because then the filtering of the frequencies would make no sense - there would be only one frequency in buzzer to start with. Buzzer needs to have rich frequency spectrum, but at the same time it needs to be harmonic (i.e. all frequencies are natural multiples of some base frequency = there is a defined pitch). You could use any function in form A*sin(x) + B*sin(2x) + C*sin(3x) +..., but of course the easiest way to produce signal like that is to use 1) square wave, 2) sawtooth wave (as you did), 3) triangle wave, 4) exp(sin(x)), etc.

@Bisqwit 6 жыл бұрын

Good explanation, but not an ELI5. I had trouble explaining it in layman terms without invoking mathematics and frequency spectrums... That's why I wrote the annotation.

@DynamicFortitude 6 жыл бұрын

@@Bisqwit Vocal cords are the buzzers. Air go through a buzzer, then through a tube (vocal tract) which amplifies some frequencies (formants), and dampen other. If buzzer sound would be just one sine wave, then the tube just makes it louder or more silent, nothing more. Tube cannot create new frequencies, acts as a filter only. So the aim for the buzzer is to generate many frequencies, so the tube (vocal tract) have something to choose from. White noise (during whispering) have all frequencies - so it is OK. Pitched sound is also OK, since it have many sine waves in it, as long as its base frequency is not too high (easier to understand bass singing than soprano singing!). High pitch have fewer sine waves in formants frequency range (~300-3000Hz). Try changing VoicePitch to ~1046 Hz (soprano's high C), and you won't be able to distinguish vowels o from u from a, or e from i.

@XTpF4vaQEp 5 жыл бұрын

13:15 accidentally used the whisper effect

@farteryhr 5 жыл бұрын

virtual singer Bisqwitoid confirmed (slap have you played with UTAU (singing synthesis software) in which it's very easy to make your own voicebank (and get quality high)? looking forward to that soooo much~ it's just wonderful to find another common interest of you and me.. phonology and speech/singing synthesizing! (but yes to get high quality it needs deeper understanding of singing in timing, rhythm, grammar, and much time to fine-tune pitch, volume, breathiness envelopes for songs)

@JokerCat-x2t 7 ай бұрын

I know this is 5 years old, but it's still cool to listen to.

@adam7868 6 жыл бұрын

I think I remember asking about this at one point, glad to see a video done on it

@edo9k 5 жыл бұрын

I wish I had seen this video when I was researching for the master's degree.

@Bisqwit 5 жыл бұрын

What did you write about?

@pedropereirapt 5 жыл бұрын

So inspiring! Thanks for this video, you got a new sub!

@themcc1879 6 жыл бұрын

Sample voice frame to C code... the Lisp lover in me says you should have used Lisp, code as data and data as code. Either way this was beyond interesting. I like your accent but to be honest everyone who speaks English has an accent. The voice speaking with an accent was diffently something I wasn't expecting this　月曜日。

@codeninja1832 6 жыл бұрын

This is interesting as a programmer, as someone who's trying to learn another language (old english, dead language sure, but fun), and as someone who asked you how to trill about a month ago haha. Still can't trill, but I'm on my way.

@Bisqwit 6 жыл бұрын

Thanks for posting!

@1st_ProCactus 5 жыл бұрын

Awesome !!!

@robertboran6234 6 жыл бұрын

Great Project. Thanks for sharing.

@GabrielCrowe 6 жыл бұрын

Awesome stuff.

@gandolfphoenix1363 5 жыл бұрын

You used the speech synthesizer that you made to give the Tutorial!

@Bisqwit 5 жыл бұрын

Yes, I used it in the first few seconds of this video.

@uxxlabrute 6 жыл бұрын

Earthbound music in the background FeelsgoodMan

@Catbangin 6 жыл бұрын

Cheer bisqwit! Almost near to guitar effects tutorial!

@dgmsstuff 6 жыл бұрын

I'm speechless. No pun intended.

@Thebasicmaker 4 жыл бұрын

I also made a speech syinthethizer using the same procedure but my language was BASIC! And the voice was mine too pronuncing a word and then cutting the part that I needed and the program just had to load the sounds and play it one after the other to speech reading a phrase I give to an input intruction

@firemaniac10010 5 жыл бұрын

I'm guessing the "buzz" can't be a pure sine wave because a pure sine wave has no harmonics; it's a pure tone. In other words, there's nothing to filter out except for one single frequency.

@alexhauptmann298 5 жыл бұрын

ELI5 explanation for why you can't use a sine wave: the human voice is essentially a subtractive synthesizer. Most commercial music synthesizers can do some form of this. It's the same sort of "buzzer in a tube" model, except the tube is generally way simpler (unless you're Plogue, but that's another story). The reason a sine wave can't be used is because subtractive synthesis works by taking away frequencies from a harmonically-rich (i.e. complex waveform) sound. Any given wave can be recreated by an arbitrary number of sine waves, but a sine wave can't be broken down into something simpler. So essentially, a sine wave can't be used because it's not enough data. It mathematically cannot be subtracted from any further. This is...more complex than I was intending but oh well lmao

@Bisqwit 5 жыл бұрын

Good explanation, but definitely not something that works for five-year-olds :)

@alexhauptmann298 5 жыл бұрын

@@Bisqwit Haha, I figured. Is that a QRIO in the thumbnail btw? I wanted one SO BAD as a little kid and was thoroughly impressed with how realistic the synthesized speech sounded. Of course, now I know (from experience, even) that Japanese is a MUCH easier language to synthesize than English. Also while watching your video on Finnish phonetics, I found it interesting how it's sort of similar to Japanese (vowels with singular pronunciation, lengthened vowels and consonants). I wonder if that would make it technically easier to synthesize than English (at least, native-speaker English)...at the very least, it would make the plaintext dictionary rules much easier :P

@Bisqwit 5 жыл бұрын

It’s a Nao, not Qrio. And yes, as a Finnish person who knows the basics of Japanese, I find Japanese much easier and familiar in many aspects compared to English.

@oo8dev 6 жыл бұрын

Amazing!!

@smkyone 6 жыл бұрын

kiitos

@zeppy13131 5 жыл бұрын

I can't speak for anyone else, but I was glad when this was Finnished.

@JoLiKMC 6 жыл бұрын

I, for one, welcome our new, Finnish robot overlords. _Hail Roboisqwit!_ Seriously, though, this is neat-as-hell. It's also kind of… heartbreaking, in a way. I never considered how speech synthesis works, and now that I know? The magic… is gone. :(

@clearz3600 6 жыл бұрын

Interesting as always.

@j5679 Жыл бұрын

Very interesting video. I may have missed it but it seems like you are not incorporating stress accent into your synthesis, right? Algorithmically figuring out where the stress lies may be a bit of a challenge depending on the language (or be downright impossible), but the English Wiktionary actually provides this data and they also offer regular HTML dumps that contain IPA transcriptions. Finnish actually happens to be one of the best covered languages on the English Wiktionary, so if you ever decide to do a v2 of this project, incorporating Wiktionary's IPA data might be an idea. I'm not sure how much you know about phonetics but please be aware that IPA does not fully capture how words are pronounced. Phonemic transcriptions don't capture it by a long shot but even a narrow phonetic transcription can be slightly inaccurate (vowel qualities are a continuum, the different durations are on a continuum etc.). This all is to say that even if you use IPA data, the rest of the pipeline still needs to be tailored to a specific language and can't produce accurate output language-agnostically.

@Bisqwit Жыл бұрын

From Wikipedia: ”Since stress can be realised through a wide range of phonetic properties, such as loudness, vowel length, and pitch (which are also used for other linguistic functions), it is difficult to define stress solely phonetically.” In Finnish language (this synth aims for speaking like Finnish speakears do) emphasis (stress) is always on the first syllable. In my speech synthesizer, it is realized by using slightly higher pitch for stressed phonemes.

@Darksoulmaster 6 жыл бұрын

Wow, i dont know what are you even talking about, but its cool.

@Bisqwit 6 жыл бұрын

Speech synthesis

@krank3869 5 жыл бұрын

I always thought these videos were sped up but then i looked at the clock

@Sturmtreiben 5 жыл бұрын

Which graphics software do you use for creating pictures like the one in 3:00? They somehow look really good.

@Bisqwit 5 жыл бұрын

Thanks. I use LibreOffice Impress. I also do some postprocessing in kdenlive; basically all _animations_ are done in the video editor.

@Sturmtreiben 5 жыл бұрын

Thanks, Joel!

@smallgoodwoodoodaddy 6 жыл бұрын

I always liked your accent. So I liked it 👍 :D

@Eodese Ай бұрын

1:49 this is exactly the process to make a UTAU voicebank

@Bisqwit Ай бұрын

Interesting. Is there a video about that?

@ruadeil_zabelin 6 жыл бұрын

Note that std::wstring_convert is deprecated in C++17, so if you want to be standard conforming, you should replace it with something else.

@Bisqwit 6 жыл бұрын

Noted. I used it for 1) its brevity and 2) because I couldn’t figure out a concise replacement that is not deprecated.

@ruadeil_zabelin 6 жыл бұрын

@@Bisqwit Unfortunately there isn't a standard way anymore. The standards commity has said that they're working on a replacement, but will only readd it if it's fully compliant with the unicode standards (apparently this one didn't work in all cases). The only way seems to be fully implement it yourself (utf8 decoding isn't very hard luckily), or use a library like iconv or libicu.

@jfkd2812 5 жыл бұрын

11:01 Hey, it's imgui! Very nice to use

@Embedonix 6 жыл бұрын

+1 for using 'goto' in your code :)

@Armadurapersonal 6 жыл бұрын

Perfect for spurdo memes

@ddream296 6 жыл бұрын

whoah nice!

@videogamemusicandfunstuff4873 6 жыл бұрын

11:01 This program looks really nice. What GUI library did you use?

@Kellykellamster 6 жыл бұрын

Looks like imgui to me.

@Bisqwit 6 жыл бұрын

Yep, correct. Imgui it is.

@minecrafttheobjectno541 5 жыл бұрын

Did I hear a turret say "Weeee" when he said "thumbs up the video"?

@Bleenderhead 6 жыл бұрын

I want to hear it sing Space Oddity.

@gazehound 6 жыл бұрын

I'm early this time. Awesome video!

@Bisqwit 6 жыл бұрын

Thank you!

@victorprokop2240 4 жыл бұрын

3:16 Mongolian throat singing!!! lmao

@Bisqwit 4 жыл бұрын

I’m not sure if you are mocking, but the principle is actually similar. The purpose is to enunciate different subtones while keeping the primary tone unchanged.

@yohvh 5 жыл бұрын

When you find a problem after you played the audio do you just in real time think of a solution and code it right there at that speed?

@aprilliac 6 жыл бұрын

Rolling index, why didn't I think of that... Thanks for the excellent video. :)

@Bisqwit 6 жыл бұрын

Yeah, a rolling index is a bit neater solution than doing a copy-backwards-by-1 loop after each iteration. On the other hand, the rolling index makes SIMD optimizations impossible, so it’s a tradeoff.

@NonTwinBrothers 3 жыл бұрын

9:30 holy shit that scared me

@ivanbogdasaebersold4690 5 жыл бұрын

This will be my COVAS in Elite Dangerous...

@vegardertilbake1 6 жыл бұрын

Ha! This was so much fun!

@jamescumbria4499 5 жыл бұрын

Are you going to make this speech synthesizer a TTS voice for Windows?

@Bisqwit 5 жыл бұрын

I don’t deal with Windows.

@pencrows 5 жыл бұрын

was all the audio in this video speech synthesized edit:its not

@icu7992 5 жыл бұрын

why you don't use namespace std?

@Bisqwit 5 жыл бұрын

The standard namespace is not a demon to be vanquished with -a magic spell- boilerplate code. It has a purpose.

@alejandroduarte5245 6 жыл бұрын

Great video :)

@arcnorj 6 жыл бұрын

Can you explain just a bit what you did to generate the LPC sample from David Woods? I guess manually editing the pitch curve with Praat?

@Bisqwit 6 жыл бұрын

I dumped the soundtrack of the video into a wav file using MPlayer. Then I opened the soundtrack in Audacity, and cropped it into just those three seconds or so, saved it into a new wav file. (Or maybe I dumped only three seconds from the soundtrack in the first place, using -ss and -endpos options. I don’t remember.) Then I opened the wav file in Praat, and did nothing else but synthesized the LPC from it (Analyze spectrum → To LPC (burg) → Save).