Let’s Create a Speech Synthesizer (C++17) with Finnish Accent!

  Рет қаралды 97,389

Bisqwit

Bisqwit

5 жыл бұрын

In this tool-assisted education video, we create a speech synthesizer in modern C++ - with a Finnish accent. The video deconstructs speech and phonemes and explores the Linear Predictive Coding, LPC. The open source programs Praat and Audacity are featured.
All downloadable materials, including source code: github.com/bisqwit/speech_syn...
Become a member: kzbin.infojoin
My links:
Twitter: / realbisqwit
Liberapay: liberapay.com/Bisqwit
Steady: steadyhq.com/en/bisqwit
Patreon: / bisqwit (Other options at bisqwit.iki.fi/donate.html)
Twitch: / realbisqwit
Homepage: iki.fi/bisqwit/
Credits, in order of appearance:
- Music: original composition :: untitled :: Joel Yliluoma
- Music: Duke Nukem 3D :: Shop n Bag :: Lee Jackson
- Music: Earthbound :: Twoson :: Akio Ōmuri and others (converted into MIDI and played through OPL3 emulation through homebrew software)
- Meme: The Fairly OddParents :: If I Had One :: Butch Hartman
- Music: Chrono Trigger :: Ending :: Yasunori Mitsuda (SPC-OPL3 conversion)
- Presentation: Haifa University :: Linear Predictive Coding :: Nimrod Peleg 2009
- Music: original composition :: Space :: Joel Yliluoma
- Music: Final Fantasy V :: Airship :: Nobuo Uematsu (SPC-OPL3 conversion)
- Music: Tales of Phantasia :: Freeze :: Motoi Sakuraba (SPC-OPL3 conversion)
- Music: Aryol :: Warmup :: Kyohei Sada (SPC-OPL3 conversion)
How to stop prison radicalization: • Video
You can contribute subtitles: kzbin.info_vid... or to any of my videos: kzbin.info_cs_...
#speechsynthesis #bisqwit #programming

Пікірлер: 297
@Bisqwit
@Bisqwit 5 жыл бұрын
This is LPC (Linear Predictive Coding): yₙ = eₙ − ∑(ₖ₌₁..ₚ) (bₖ yₙ₋ₖ) where ‣ y[] = output signal, e[] = excitation signal (buzz, also called predictor error signal), b[] = the coefficients for the given frame ‣ p = number of coefficients per frame, k = coefficient index, n = output index Compare with FIR (Finite Impulse Response): yₙ = ∑(ₖ₌₁..ₚ) (bₖ xₙ₋ₖ) where ‣ x[] = input signal The similarities between the two are striking. FIR is used in applications like low-pass filtering, high-pass filtering, band-pass filtering, band-stop filtering, etc. It is an almost magical type of mathematics that is used to generate these filters. For LPC, there are several different algorithms, many of which are implemented in Praat, the software that I used in this video to create my LPC files.
@huyvole9724
@huyvole9724 5 жыл бұрын
I met that formula when I learn Signal & System module (my school call it Digital Signal Processing)
@BichaelStevens
@BichaelStevens 5 жыл бұрын
16-17 minutes in: Please next time lower the audio or give a warning. The popping killed my hearing
@Rennu_the_linux_guy
@Rennu_the_linux_guy 5 жыл бұрын
uhhh
@a1k0n
@a1k0n 5 жыл бұрын
In fact it's identical to an IIR filter, which has coefficients for both x and y, and your x coefficient is 1 and all your y coefficients are negated.
@RazorM97
@RazorM97 5 жыл бұрын
How to stop prison radicalization
@KatzRool
@KatzRool 5 жыл бұрын
I was going to make a joke about how you already sound like speech synthesis when speaking English, but your English gets better every single video. Keep it up man!
@educate9946
@educate9946 5 жыл бұрын
Now I can have Robot Bisqwit wake me up every morning.
@thefoolishgmodcube2644
@thefoolishgmodcube2644 5 жыл бұрын
Imaging having “SHALOM! SHALOM!” as a wake-up alarm
@kkeanie
@kkeanie 4 жыл бұрын
@David Plays Stuff I really need that. it would stop my depression
@PantsYT
@PantsYT 4 жыл бұрын
"Hyvää huomenta"
@ktaleentkekma5777
@ktaleentkekma5777 3 жыл бұрын
robot bisqwit is a pleonasm
@framegrace1
@framegrace1 5 жыл бұрын
I think the clicks are because the program is cutting/pasting at random waveform values. This produces non-continuous gaps in the waveform that generates those clicks. I think the simple way to solve it, is to just wait until the value of the sample crosses the 0 line to perform the cut of the audio, and wait again a 0 crossing to introduce the next one.
@KuraIthys
@KuraIthys 5 жыл бұрын
Interesting theory. That actually matches advice mentioned in the SNES manual in relation to audio samples. What it is trying to say exactly is ambiguous, but it warns against discontinuities in the waveform, which would result in clicking sounds. Of course, given the ADPCM coding, discontinuities on block boundaries would easily result if you're not careful. (since the samples within a block are all expanded using the same parameters, but across block boundaries the parameters change.)
@idk-bv3iw
@idk-bv3iw 5 жыл бұрын
What about a simple fade-out/fade-in between the samples?
@TheBcoolGuy
@TheBcoolGuy 5 жыл бұрын
@@idk-bv3iw That's the method used in video editing.
@crimsun7186
@crimsun7186 5 жыл бұрын
You also have to determine a rithmic pattern dependant on the langauge and overall delivery, as words are not spoken at a constant pace.
@a1k0n
@a1k0n 5 жыл бұрын
I don't think that will work, because of all the excitation signal history in the bp[] array. Instantaneously changing the filter coefficients can lead to instability. One thing that might help, or might make it worse (I'm not sure) is to try implementing the transposed version where the bp[] array isn't just past output samples, but partially computed future samples. See the notes here: docs.scipy.org/doc/scipy/reference/generated/scipy.signal.lfilter.html
@x0j
@x0j 5 жыл бұрын
This doesn't fool me, I know you have a much more advanced synthesizer that you use for your videos. A nice coverup attempt though
@BichaelStevens
@BichaelStevens 5 жыл бұрын
We have reached peak AI revolution - machines making machines A voice synth making a voice synth
@akj7
@akj7 5 жыл бұрын
Haha
@huyvole9724
@huyvole9724 5 жыл бұрын
-6.4°C
@Bisqwit
@Bisqwit 5 жыл бұрын
Actually the truth was like -22. I just happened to do the recording a month earlier...
@imlxh7126
@imlxh7126 Жыл бұрын
Uberduck has a neural-network-based simulation of Microsoft Sam. Talk about overengineered lmao
@greasyfingers9250
@greasyfingers9250 5 жыл бұрын
"Yes, I use PHP. Because a programming language, that you know is much more efficient than one that you don't know." This is the truest statement I have ever heard.
@greasyfingers9250
@greasyfingers9250 5 жыл бұрын
@Michael Smith You can debug it line by line with xdebug, but c# or java are usually better for that kind of work.
@Kitulous
@Kitulous 5 жыл бұрын
in order to debug PHP you have to var_dump every single variable because the stack trace in PHP is a real mess.
@HermanWillems
@HermanWillems 4 жыл бұрын
Short term yes, long term no.
@magicstix0r
@magicstix0r 5 жыл бұрын
The input signal can't be a pure sine wave because: 1.) The vocal chords don't emit pure sine waves; they emit something more like a buzz. 2.) A pure sinewave would almost be unaffected by the LPC filters because it's a single frequency. A buzz is extremely rich in harmonics, and the human ear keys off the presence or absence of those harmonics in determining what was said. That's why if you look at voice data in a spectrogram, you tend to see lots of streaks that move together or widen/shrink based on what's being said. In a sort of philosophical explanation, the input signal is "sampling" your LPC filters. A single single sine wave would result in sampling just a single data point. You need a lot of sine waves to get enough of a picture of the LPC filter to see what it looks like, which is what your brain is keying on to make sense of your words. Think of it kind of like an image. The sine waves are the pixels that you're building a picture of the LPC filter with. A single sine wave is like a single pixel; it doesn't tell you much. A buzz is loaded with lots of sine waves, so analogously it's loaded with a lot of pixels, so it can give you a better picture of the LPC filter, and thus a better picture of the formant it represents.
@Bisqwit
@Bisqwit 5 жыл бұрын
Great explanation! Not an ELI5 though :-) But I would have settled for that.
@metadaat5791
@metadaat5791 5 жыл бұрын
I always liked the implication of GSM using LPC, that technically you're not hearing someone's actual voice, but a reconstruction made of a buzzer with a filter and hisses and pops from filtered noise. So, you're actually listening to a speech synthesizer's reconstruction of the other person's voice! :-) :-)
@tomh6339
@tomh6339 4 жыл бұрын
Dude. I haven't used Praat since University, was hit by waves of nostalgia in the most unexpected place. Your videos are the best, you're quite the renaissance man.
@chooha
@chooha 5 жыл бұрын
Hi bisqwit I don't know if you realize this but you are an inspiration for many of the viewers here, like a hero. So could you make a video about how you reached this insane level of skill, what your journey was like, and maybe some tips on how one can be as good as you ? Thanks for all the amazing content ^_^
@magicstix0r
@magicstix0r 5 жыл бұрын
The constant clicks and pops are due to discontinuities at the frame boundaries. With an algorithm like this, they usually fix it using overlap-add. The gist of OLA is that your frames overlap and are weighted by a windowing function, then you sum them together where they overlap.
@MissNorington
@MissNorington 5 жыл бұрын
Really outstanding video! Great work Bisqwit!
@shivisuper
@shivisuper 5 жыл бұрын
These videos make me respect you even more. You're very knowledgeable!
@OverSeasMedia
@OverSeasMedia 5 жыл бұрын
bisqwit was the inspiration to write my own tools whenever i need one, Great video.
@jlewwis1995
@jlewwis1995 2 жыл бұрын
Finally a video that actually shows how to ACTUALLY MAKE a TTS voice from scratch, almost everything online about "how to make a text to speech synthesizer from scratch" is just "use this function to call the os TTS library lul"
@d3ibit
@d3ibit 5 жыл бұрын
Joel, always a pleasure to watch a C++ (related in some way) video. Keep the good work!
@pixelflow
@pixelflow 5 жыл бұрын
Finally! A Bisqwit Vocaloid :3
@wallaguest1
@wallaguest1 5 жыл бұрын
i cant understand how you have so much knowledge, its crazy
@adam7868
@adam7868 5 жыл бұрын
I think I remember asking about this at one point, glad to see a video done on it
@stennisrl
@stennisrl 5 жыл бұрын
Wow, what a cool video to wake up to. Excellent work!
@BeeBaux
@BeeBaux 4 жыл бұрын
Great! job bro. Thanks for making complex thing easier.
@prizmarvalschi1319
@prizmarvalschi1319 3 жыл бұрын
This is kinda like how utau users create voicebanks Except we sing in 5 syllable strings for Japanese,sometimes more for others languages. And sometimes recorded in three or more pitches.
@miszczklasykuw3025
@miszczklasykuw3025 5 жыл бұрын
music in background adds nice atmosphere to video as always x)
@DrSid42
@DrSid42 5 жыл бұрын
Just had an idea I will make my own speech synth. I wondered if there is some nice example low-level enough. And guess what. This guys had the same idea just in time to have it done now. Great job !
@noname-rr7hk
@noname-rr7hk 2 жыл бұрын
I was searching for this video for half a year. Thankyou...
5 жыл бұрын
Super interesting article. Thanks!
@DudeWatIsThis
@DudeWatIsThis 3 жыл бұрын
Bisqwit you fucking legend man. This is the way to handle the banter. Throw it straight back at them! Genius stuff. You win again, good sir!
@AT-zr9tv
@AT-zr9tv 3 жыл бұрын
Your videos are fantastic. This one particularly.
@robertboran6234
@robertboran6234 5 жыл бұрын
Great Project. Thanks for sharing.
@esteveslisboeta
@esteveslisboeta 4 жыл бұрын
So inspiring! Thanks for this video, you got a new sub!
@skilz8098
@skilz8098 5 жыл бұрын
Once again; another great video!
@kapiltyagi4639
@kapiltyagi4639 5 жыл бұрын
The solution for the clicking in the sound is to simply fade out some of the frequency from the very end of the sample. Because LPC just converting the audio samples into the simple and low resolution waveform just bunch of float values and a gain.
@oresteszoupanos
@oresteszoupanos 5 жыл бұрын
Joel, regarding your question at 8:05, we cannot use a sine wave because it only has audio energy in 1 frequency, whereas to synthesise human speech, we need energies in "all" frequencies, so we can have base pitches and formants happening at the same time. Buzzers have a better spread of frequencies, compared to the more "pure" sine wave. Hope I made sense ^_^
@Bisqwit
@Bisqwit 5 жыл бұрын
Good explanation, but not really an ELI5 :-) I understand the situation as indicated elsewhere in the video, but I was having trouble explaining in layman terms without referring to things like frequency spectrum; I wrote that request for the benefit of audience.
@oresteszoupanos
@oresteszoupanos 5 жыл бұрын
@@Bisqwit Aha, I'd never heard the term ELI5 (Explain Like I'm 5) before! Here is my second attempt :-) Voice sounds are slightly complicated. Sine wave sounds are simple. Buzzers are super-complicated. We cannot use 1 simple sine wave, filter it, and get a complex voice sound. We have to start with a super-complex buzzer, then filter out some things, to be left with a less-complex voice sound.
@frisosmit8920
@frisosmit8920 5 жыл бұрын
That's actually a very good explainaition. Your first explaination made me understand it. But then again, I'm not 5 years old.
@noneofyourbeeswax3460
@noneofyourbeeswax3460 4 жыл бұрын
But you could superimpose sine waves to get all the frequencies?
@Bisqwit
@Bisqwit 4 жыл бұрын
Yes, and in fact all waveforms can be represented as a sum of sinewaves. That is what e.g. the Fourier transform is about, or the discrete cosine transform.
@RamLaska
@RamLaska 5 жыл бұрын
I did something like this in the early nineties. I recorded my voice on my Mac SE, and wrote a hypercard stack to play the correct sounds together. It didn't translate English into phonemes, you had to write out your own phonemes, but that wasn't quite so unusual at that time. I also only made one recording per phoneme, because ain't nobody got time to record every possible phoneme pair 😂
@tomaszx7760
@tomaszx7760 3 жыл бұрын
I remember play with " Say " speech synthesizer from Workbench 1.3 OS (at Amiga 500 computer)
@GibusWearingMann
@GibusWearingMann 4 жыл бұрын
I'm starting to become curious how to stop prison radicalization.
@Catbangin
@Catbangin 5 жыл бұрын
Cheer bisqwit! Almost near to guitar effects tutorial!
@user-ql1hd2my3y
@user-ql1hd2my3y 4 күн бұрын
I know this is 5 years old, but it's still cool to listen to.
@thetastefultoastie6077
@thetastefultoastie6077 5 жыл бұрын
I've never seen `++i %= max` before. That's pretty cool. Edit: it seems this only works in C++ but not in C, Java or Javascript
@Bisqwit
@Bisqwit 5 жыл бұрын
In C++, operator++() returns a reference to the object being modified. This is not the case in C. This has nothing to do with C++17 or about sequence points. If the expression was `i++ %= max`, it would be a different story. `++i %= max` is completely unambiguous in its meaning. The reason it does not work in C is because `++i` returns a non-lvalue copy of the variable in C, not a reference to it. (C does not have references.)
@thetastefultoastie6077
@thetastefultoastie6077 5 жыл бұрын
@@Bisqwit Thanks for the explanation! I used an online compiler to quickly try all versions of C++ and indeed it worked in all of them.
@Smaxx
@Smaxx 5 жыл бұрын
@@shaurz I'd just write a tiny inline function with a speaking name instead. ;) Like `incmod(v, m)`
@DrSid42
@DrSid42 5 жыл бұрын
@@shaurz It seems weird to you because of different background. Finish folk did it like this for centuries.
@noneofyourbeeswax3460
@noneofyourbeeswax3460 4 жыл бұрын
@@DrSid42I don't think computers have been around for centuries
@mattg5461
@mattg5461 5 жыл бұрын
Brilliant. I find this video a week after handing in my dissertation on vocal synthesis... This would have changed everything
@Bisqwit
@Bisqwit 5 жыл бұрын
How so?
@mattg5461
@mattg5461 5 жыл бұрын
There's just a lot of things you've covered in here that I wasn't able to find much concrete information about - things like accents and dialects especially. Lots of things like that which I knew from common sense but couldn't find actual written documentation to back up.
@GabrielCrowe
@GabrielCrowe 5 жыл бұрын
Awesome stuff.
@yukimoe
@yukimoe 5 жыл бұрын
So you're basically teaching us how to make Vocaloid-like software? Nice.
@ceablue8037
@ceablue8037 5 жыл бұрын
@jj zun Yesssssssssssssssssssssssss
@davidcuny7002
@davidcuny7002 3 жыл бұрын
The red lines in Praat indicate formants, not the overtones. The vocal chords produces pulses, which have a fundamental frequency (pitch) as well as overtones (multiples of the pitch). The tongue forms a series of "tubes" in the mouth, which causes the pulses to resonate at frequencies proportional to the length of those various chambers. The resonating frequencies of these "tubes" are formants, and different mouth shapes create different sets of resonating frequencies.
@moth.monster
@moth.monster 5 жыл бұрын
Now we need to record the speech synth speaking and use that to make another synth
@olekolek1000
@olekolek1000 5 жыл бұрын
Amazing!!
@clearz3600
@clearz3600 5 жыл бұрын
Interesting as always.
@farteryhr
@farteryhr 4 жыл бұрын
virtual singer Bisqwitoid confirmed (slap have you played with UTAU (singing synthesis software) in which it's very easy to make your own voicebank (and get quality high)? looking forward to that soooo much~ it's just wonderful to find another common interest of you and me.. phonology and speech/singing synthesizing! (but yes to get high quality it needs deeper understanding of singing in timing, rhythm, grammar, and much time to fine-tune pitch, volume, breathiness envelopes for songs)
@uxxlabrute
@uxxlabrute 5 жыл бұрын
Earthbound music in the background FeelsgoodMan
@procactus9109
@procactus9109 4 жыл бұрын
Awesome !!!
@Thebasicmaker
@Thebasicmaker 3 жыл бұрын
I also made a speech syinthethizer using the same procedure but my language was BASIC! And the voice was mine too pronuncing a word and then cutting the part that I needed and the program just had to load the sounds and play it one after the other to speech reading a phrase I give to an input intruction
@yohvh
@yohvh 5 жыл бұрын
When you find a problem after you played the audio do you just in real time think of a solution and code it right there at that speed?
@MrGoatflakes
@MrGoatflakes 5 жыл бұрын
6:34 if you say this five times into a mirror at night you will summon a Bisqwit :P
@edo9k
@edo9k 5 жыл бұрын
I wish I had seen this video when I was researching for the master's degree.
@Bisqwit
@Bisqwit 5 жыл бұрын
What did you write about?
@akj7
@akj7 5 жыл бұрын
Question: At 9:17, why do you have: constexpr unsigned maxOrder? What is the purpose of the constexpr here? Won't the compiler evaluate the what maxOrder is without the constexpr? Why haven't you use const?
@Bisqwit
@Bisqwit 5 жыл бұрын
For integers, there is not much difference between const and constexpr. I just like to document the intention. The primary target audience of source code is people, after all. When I write “constexpr”, I mean “this should be a compile-time constant, and something probably depends on the fact”. Here, MaxOrder _needs_ to be a compile-time constant, because it is used as an array dimension. When I write “const”, I mean “this is immutable; it should be only read, not written to”. For example, the constant “rate” is not intended to be changed, but I don’t necessarily need it to be a compile-time constant, even though it happens to be.
@themcc1879
@themcc1879 5 жыл бұрын
Sample voice frame to C code... the Lisp lover in me says you should have used Lisp, code as data and data as code. Either way this was beyond interesting. I like your accent but to be honest everyone who speaks English has an accent. The voice speaking with an accent was diffently something I wasn't expecting this 月曜日。
@arcnorj
@arcnorj 5 жыл бұрын
Can you explain just a bit what you did to generate the LPC sample from David Woods? I guess manually editing the pitch curve with Praat?
@Bisqwit
@Bisqwit 5 жыл бұрын
I dumped the soundtrack of the video into a wav file using MPlayer. Then I opened the soundtrack in Audacity, and cropped it into just those three seconds or so, saved it into a new wav file. (Or maybe I dumped only three seconds from the soundtrack in the first place, using -ss and -endpos options. I don’t remember.) Then I opened the wav file in Praat, and did nothing else but synthesized the LPC from it (Analyze spectrum → To LPC (burg) → Save).
@vegardertilbake1
@vegardertilbake1 5 жыл бұрын
Ha! This was so much fun!
@gandolfphoenix1363
@gandolfphoenix1363 5 жыл бұрын
You used the speech synthesizer that you made to give the Tutorial!
@Bisqwit
@Bisqwit 5 жыл бұрын
Yes, I used it in the first few seconds of this video.
@XTpF4vaQEp
@XTpF4vaQEp 5 жыл бұрын
13:15 accidentally used the whisper effect
@aprilliac
@aprilliac 5 жыл бұрын
Rolling index, why didn't I think of that... Thanks for the excellent video. :)
@Bisqwit
@Bisqwit 5 жыл бұрын
Yeah, a rolling index is a bit neater solution than doing a copy-backwards-by-1 loop after each iteration. On the other hand, the rolling index makes SIMD optimizations impossible, so it’s a tradeoff.
@alejandroduarte5245
@alejandroduarte5245 5 жыл бұрын
Great video :)
@codeninja1832
@codeninja1832 5 жыл бұрын
This is interesting as a programmer, as someone who's trying to learn another language (old english, dead language sure, but fun), and as someone who asked you how to trill about a month ago haha. Still can't trill, but I'm on my way.
@Bisqwit
@Bisqwit 5 жыл бұрын
Thanks for posting!
@ShotgunLlama
@ShotgunLlama 2 жыл бұрын
In this video in Praat, he uses LPC (Burg). Would this work with LPC using covariance or autocorrelation?
@ddream296
@ddream296 5 жыл бұрын
whoah nice!
@smallgoodwoodoodaddy
@smallgoodwoodoodaddy 5 жыл бұрын
I always liked your accent. So I liked it 👍 :D
@counterculturecocks
@counterculturecocks 5 жыл бұрын
You are amazing.
@gazehound
@gazehound 5 жыл бұрын
I'm early this time. Awesome video!
@Bisqwit
@Bisqwit 5 жыл бұрын
Thank you!
@Sturmtreiben
@Sturmtreiben 4 жыл бұрын
Which graphics software do you use for creating pictures like the one in 3:00? They somehow look really good.
@Bisqwit
@Bisqwit 4 жыл бұрын
Thanks. I use LibreOffice Impress. I also do some postprocessing in kdenlive; basically all _animations_ are done in the video editor.
@Sturmtreiben
@Sturmtreiben 4 жыл бұрын
Thanks, Joel!
@fisu51
@fisu51 4 жыл бұрын
Kyllä
@adraxcz
@adraxcz 5 жыл бұрын
Hey Bisqwit! May I ask what editing software do you use ? Thanks!
@Bisqwit
@Bisqwit 5 жыл бұрын
For which type of content?
@skilz8098
@skilz8098 5 жыл бұрын
I'm wondering if the technology that is used to transfer data from Vinyl Record Albums into mp3 files would be of any assistance... Then just filter out the background music until you have pure voice. Then you can have a singing speech synthesizer.
@Bisqwit
@Bisqwit 5 жыл бұрын
As for the first sentence, I fail to see the relevance. As for the second sentence, what kind of solutions do you have for “filtering out background music”? Even on KZbin* it depends on correctly identifying the original recording (with or without lyrics) leaving only the added commentary and sound effects, and even then the resulting audio sounds quite hollow. *) KZbin has a tool that allows video creators remove a song that infringes copyright, when KZbin has first identified the infringement using ContentID. Often it results in simply muting that region of the video, but sometimes it successfully removes the song leaving only commentary.
@Embedonix
@Embedonix 5 жыл бұрын
+1 for using 'goto' in your code :)
@smkyone
@smkyone 5 жыл бұрын
kiitos
@juniorsilvabroadcast
@juniorsilvabroadcast 4 жыл бұрын
Bisqwit can you help me with something? I'm looking for a advanced audio clipper created in VST architecture. Some type of clipper that doesn't let out any type of small peaks on the output. 4x oversampling would help a lot but even that is badly implemented with traditional audio clipper avaible at internet in VST architecture. I have a FM Audio Processor made in VST technology using some VST Plug-Ins avaible on internet. And the big issue is the clipper. It let's out small peaks that makes the processing difficult because i need to implement a ISP Protected limiter on the end. That makes the sound go down when High Frequency material is played.
@Bisqwit
@Bisqwit 4 жыл бұрын
I am not sure what exactly it is you want. It kind of sounds like you want a soft limiter, though. I don’t particularly have experience about VST plugins, aside from trying to install them for use in Audacity, at some point, getting it working, and then at some later point, noticing that the plugins are no longer there and being too indifferent to study further why.
@coolbrotherf127
@coolbrotherf127 5 жыл бұрын
How did you know this stuff before starting the project or learn as you went? I've wanted to start stuff like this but get overwhelmed by all the stuff I have to learn to finish the project.
@Bisqwit
@Bisqwit 5 жыл бұрын
I studied it while making this project. Reading example code, reading articles that describe how LPC works, exploring outputs, trial and error until I got the first LPC-to-WAV converter working. Some basic principles I had already learned years ago don’t-remember-where. And of course, the principles of phoneme-based speech synthesis were already familiar to me since the 1990s when I studied how Dr. Sbaitso works.
@JakubSkowron
@JakubSkowron 5 жыл бұрын
8:00 Buzzer cannot be pure sine, because then the filtering of the frequencies would make no sense - there would be only one frequency in buzzer to start with. Buzzer needs to have rich frequency spectrum, but at the same time it needs to be harmonic (i.e. all frequencies are natural multiples of some base frequency = there is a defined pitch). You could use any function in form A*sin(x) + B*sin(2x) + C*sin(3x) +..., but of course the easiest way to produce signal like that is to use 1) square wave, 2) sawtooth wave (as you did), 3) triangle wave, 4) exp(sin(x)), etc.
@Bisqwit
@Bisqwit 5 жыл бұрын
Good explanation, but not an ELI5. I had trouble explaining it in layman terms without invoking mathematics and frequency spectrums... That's why I wrote the annotation.
@JakubSkowron
@JakubSkowron 5 жыл бұрын
@@Bisqwit Vocal cords are the buzzers. Air go through a buzzer, then through a tube (vocal tract) which amplifies some frequencies (formants), and dampen other. If buzzer sound would be just one sine wave, then the tube just makes it louder or more silent, nothing more. Tube cannot create new frequencies, acts as a filter only. So the aim for the buzzer is to generate many frequencies, so the tube (vocal tract) have something to choose from. White noise (during whispering) have all frequencies - so it is OK. Pitched sound is also OK, since it have many sine waves in it, as long as its base frequency is not too high (easier to understand bass singing than soprano singing!). High pitch have fewer sine waves in formants frequency range (~300-3000Hz). Try changing VoicePitch to ~1046 Hz (soprano's high C), and you won't be able to distinguish vowels o from u from a, or e from i.
@HerrRussoTragik
@HerrRussoTragik 5 жыл бұрын
Ohhh in the past I've made a pseudo "TTS" using the winmm from windows.h and PlaySound function...
@alexhauptmann298
@alexhauptmann298 4 жыл бұрын
ELI5 explanation for why you can't use a sine wave: the human voice is essentially a subtractive synthesizer. Most commercial music synthesizers can do some form of this. It's the same sort of "buzzer in a tube" model, except the tube is generally way simpler (unless you're Plogue, but that's another story). The reason a sine wave can't be used is because subtractive synthesis works by taking away frequencies from a harmonically-rich (i.e. complex waveform) sound. Any given wave can be recreated by an arbitrary number of sine waves, but a sine wave can't be broken down into something simpler. So essentially, a sine wave can't be used because it's not enough data. It mathematically cannot be subtracted from any further. This is...more complex than I was intending but oh well lmao
@Bisqwit
@Bisqwit 4 жыл бұрын
Good explanation, but definitely not something that works for five-year-olds :)
@alexhauptmann298
@alexhauptmann298 4 жыл бұрын
@@Bisqwit Haha, I figured. Is that a QRIO in the thumbnail btw? I wanted one SO BAD as a little kid and was thoroughly impressed with how realistic the synthesized speech sounded. Of course, now I know (from experience, even) that Japanese is a MUCH easier language to synthesize than English. Also while watching your video on Finnish phonetics, I found it interesting how it's sort of similar to Japanese (vowels with singular pronunciation, lengthened vowels and consonants). I wonder if that would make it technically easier to synthesize than English (at least, native-speaker English)...at the very least, it would make the plaintext dictionary rules much easier :P
@Bisqwit
@Bisqwit 4 жыл бұрын
It’s a Nao, not Qrio. And yes, as a Finnish person who knows the basics of Japanese, I find Japanese much easier and familiar in many aspects compared to English.
@gero9307
@gero9307 2 жыл бұрын
I created a voicebank CVVC and VCV type for utau, and while watching this video I experienced deja vu)
@MESYETI
@MESYETI 4 жыл бұрын
Wow! I might try to make one, it seems hard though
@dgmsstuff
@dgmsstuff 5 жыл бұрын
I'm speechless. No pun intended.
@krank3869
@krank3869 4 жыл бұрын
I always thought these videos were sped up but then i looked at the clock
@Darksoulmaster
@Darksoulmaster 5 жыл бұрын
Wow, i dont know what are you even talking about, but its cool.
@Bisqwit
@Bisqwit 5 жыл бұрын
Speech synthesis
@videogamemusicandfunstuff4873
@videogamemusicandfunstuff4873 5 жыл бұрын
11:01 This program looks really nice. What GUI library did you use?
@Kellykellamster
@Kellykellamster 5 жыл бұрын
Looks like imgui to me.
@Bisqwit
@Bisqwit 5 жыл бұрын
Yep, correct. Imgui it is.
@jfkd2812
@jfkd2812 5 жыл бұрын
11:01 Hey, it's imgui! Very nice to use
@zeppy13131
@zeppy13131 5 жыл бұрын
I can't speak for anyone else, but I was glad when this was Finnished.
@jamescumbria4499
@jamescumbria4499 5 жыл бұрын
Are you going to make this speech synthesizer a TTS voice for Windows?
@Bisqwit
@Bisqwit 5 жыл бұрын
I don’t deal with Windows.
@siddharthkalantri5076
@siddharthkalantri5076 5 жыл бұрын
Thank you I use sync dec talk alyes with my screen reader and alyes search for Hindi language . perhaps one day I will get my wish looking at your finish voice sync . as its possible .
@arcnorj
@arcnorj 5 жыл бұрын
This might be a silly question, but is the speech from the beginning generated using the speech synthesizer you coded here? It sounds much better there, so I guess not...
@Bisqwit
@Bisqwit 5 жыл бұрын
It is generated by the same synthesizer indeed, but I edited the pitch curve in Praat in postprocess. Also I think I spelled some words a bit differently in order to get the right voice out. I did the same at 19:20 (except for postprocessing), intentionally misspelling “typical” as “typeecal” because the text-to-phonemes table would have generated /tɪpaɪkəl/ (typie-cal) otherwise.
@arcnorj
@arcnorj 5 жыл бұрын
Wow, it sounds so much better! I guess it's possible to do the same changes in code, although figuring out the pitch curve is probably a problem on its own
@arcnorj
@arcnorj 5 жыл бұрын
About your edit (pronouncing words differently so they sound right) I was reading just yesterday about TTS engines (which is a huge coincidence) and Mycroft's Mimic engine has a page where people can suggest fixes to words doing exactly that ("eye q" for "IQ" and other stuff)
@Smaxx
@Smaxx 5 жыл бұрын
Your failing lowercase conversion for umlauts is a pretty nasty trap I fell into in the past as well. It looks like you're doing everything correct, it should work, yet it somehow doesn't. Unfortunately, it's not as easy as imbuing/passing the correct locale. It might be, but that's not guaranteed. Even when using the UTF-8 locale, you might just walk char by char and for whatever reason ignore UTF-8 sequences… So far for me it always worked when using the wide character version instead (i.e. `wchar_t` over `char` or possibly `uint32_t`), although I've heard even that fails for some. Guess it's not totally unexpected I've heard stuff about dropping the `codecvt` header from the standard… So in your case I'd just to the `std::u32string` conversion first, then unify character casing after that.
@firemaniac10010
@firemaniac10010 5 жыл бұрын
I'm guessing the "buzz" can't be a pure sine wave because a pure sine wave has no harmonics; it's a pure tone. In other words, there's nothing to filter out except for one single frequency.
@generic_programmer
@generic_programmer 5 жыл бұрын
I like this
@ivanbogdasaebersold4690
@ivanbogdasaebersold4690 5 жыл бұрын
This will be my COVAS in Elite Dangerous...
@AlexVasiluta
@AlexVasiluta 5 жыл бұрын
Nice
@Bleenderhead
@Bleenderhead 5 жыл бұрын
I want to hear it sing Space Oddity.
@sebudrsappu6098
@sebudrsappu6098 3 жыл бұрын
Earthbound music in Background:)
@JoLiKMC
@JoLiKMC 5 жыл бұрын
I, for one, welcome our new, Finnish robot overlords. _Hail Roboisqwit!_ Seriously, though, this is neat-as-hell. It's also kind of… heartbreaking, in a way. I never considered how speech synthesis works, and now that I know? The magic… is gone. :(
@ruadeil_zabelin
@ruadeil_zabelin 5 жыл бұрын
Note that std::wstring_convert is deprecated in C++17, so if you want to be standard conforming, you should replace it with something else.
@Bisqwit
@Bisqwit 5 жыл бұрын
Noted. I used it for 1) its brevity and 2) because I couldn’t figure out a concise replacement that is not deprecated.
@ruadeil_zabelin
@ruadeil_zabelin 5 жыл бұрын
@@Bisqwit Unfortunately there isn't a standard way anymore. The standards commity has said that they're working on a replacement, but will only readd it if it's fully compliant with the unicode standards (apparently this one didn't work in all cases). The only way seems to be fully implement it yourself (utf8 decoding isn't very hard luckily), or use a library like iconv or libicu.
@j5679
@j5679 4 ай бұрын
Very interesting video. I may have missed it but it seems like you are not incorporating stress accent into your synthesis, right? Algorithmically figuring out where the stress lies may be a bit of a challenge depending on the language (or be downright impossible), but the English Wiktionary actually provides this data and they also offer regular HTML dumps that contain IPA transcriptions. Finnish actually happens to be one of the best covered languages on the English Wiktionary, so if you ever decide to do a v2 of this project, incorporating Wiktionary's IPA data might be an idea. I'm not sure how much you know about phonetics but please be aware that IPA does not fully capture how words are pronounced. Phonemic transcriptions don't capture it by a long shot but even a narrow phonetic transcription can be slightly inaccurate (vowel qualities are a continuum, the different durations are on a continuum etc.). This all is to say that even if you use IPA data, the rest of the pipeline still needs to be tailored to a specific language and can't produce accurate output language-agnostically.
@Bisqwit
@Bisqwit 4 ай бұрын
From Wikipedia: ”Since stress can be realised through a wide range of phonetic properties, such as loudness, vowel length, and pitch (which are also used for other linguistic functions), it is difficult to define stress solely phonetically.” In Finnish language (this synth aims for speaking like Finnish speakears do) emphasis (stress) is always on the first syllable. In my speech synthesizer, it is realized by using slightly higher pitch for stressed phonemes.
What are EXE files made of?
11:00
Bisqwit
Рет қаралды 144 М.
Why You Should Always Help Others ❤️
00:40
Alan Chikin Chow
Рет қаралды 8 МЛН
Creating a Doom-style 3D engine in C
18:50
Bisqwit
Рет қаралды 946 М.
Introduction to Finnish phonology
9:26
Bisqwit
Рет қаралды 60 М.
How we fit an NES game into 40 Kilobytes
12:04
Morphcat Games
Рет қаралды 3,5 МЛН
How Speech Synthesizers Work
18:15
The 8-Bit Guy
Рет қаралды 2 МЛН
DECtalk DTC01 - 1984 Speech Synthesizer
22:13
Tech Tangents
Рет қаралды 50 М.
Conan O'Brien Crunes "Trouble" at the 2006 Emmys
3:05
KFaddis
Рет қаралды 535 М.
How I got Mario in That Editor - And how Norton Got 🐁
13:30
Monty Python's Argument Sketch performed with two vintage speech synthesizers
3:29