NEW Open Source Model for Emotional Text to Speech

Рет қаралды 27,556

Jarods Journey

Күн бұрын

Пікірлер: 180

@NFawc Ай бұрын

Agreed! This sort of TTS power with an Audiobook maker would be epic!

@genshinfinity Ай бұрын

We can use the angry AI voice to shout to robbers or trespassers. Just need to hook it to a motion sensor / cctv

@Jarods_Journey Ай бұрын

😂 Brilliant idea

@CheckTheWiki Ай бұрын

Or clone your own angry voice and have the motion sensor tell your pets to get off of somewhere they aren't allowed when you aren't home.

@jessequartey Ай бұрын

These are really brilliant.

@maxlikessnacks123 Ай бұрын

Or.... just record a few lines of you saying "stop it" or "i'll call the cops" and play that when the motion sensor gets activated. No ai required.

@jessequartey Ай бұрын

@maxlikessnacks123 what if you want a context aware response.

@RobertJene Ай бұрын

1:45 LOL melina. one time I downloaded all the Melina voice lines to my phone and listened to them + talked like her for an hour while I got into costume and makeup, then did a parody of her for stingers when I live-streamed Elden Ring

@Jarods_Journey Ай бұрын

Method acting at its finest 😂

@bmatt2626 Ай бұрын

Yeah! This sort of expression in AM is the dream. I've got a whole 3d puppetry / rendering / sfx / scoring pipeline with a voice-shaped hole in it. Thanks for digging into all this.

@namuzed Ай бұрын

lol, I've been in the same boat. xtts v2 hasn't been cutting it due to lack of emotion. I tried pairing it with RVC which gave improvement, but I'm not much of a voice actor. This seems like the key.

@SttravagaNZza Ай бұрын

What are you trying to do ? what tools are you using?

@bmatt2626 13 күн бұрын

@@SttravagaNZza Film dialogue eventually. I'm currently using Blender, CC4/iClone, Embergen, Liquigen, Resolve/Fusion. Audio side is Reaper, Divisimate, RapidComposer, Vienna Ensemble, various instrument and SFX libraries.

@JonnyCrackers Ай бұрын

The inflection is still all over the place and it places emphasis on the wrong words. Until they can figure out how to make TTS get that right, it's always going to sound weird.

@Jarods_Journey Ай бұрын

It's getting there alright, progress seems to be steady

@n_n_hapi Ай бұрын

Holy smokes, that Melina was amazing

@notnotandrew 28 күн бұрын

Based on only a 4-second audio clip! I think the word "inclined" was the only word where it slightly broke character.

@OnlyJavascript Ай бұрын

Wow, I just subscribed! I've recently started exploring TTS and how it works, and I finally found a guru!

@Jarods_Journey Ай бұрын

Glad to have you :)

@shawn4990 Ай бұрын

Wow... I had mentioned 'emotion' in an earlier post and bam! I had a feeling these models were in the works.

@robertotomas Ай бұрын

I am also interested in training it with foreign languages. Do you know how much ram it would take? do the samples have to be well labelled (emotionally)?

@jjjbeastfd Ай бұрын

Please expand on this more and make a website too. I'm very interested in this, for sure! Youre the best, Jarod!

@rodrigovieirastudies Ай бұрын

Wow!!! This would be a fabulous addition to the audiobook maker!

@BloodyLiFe255 Ай бұрын

I will use in my work, and make audiobooks out of it most probably, when i make it (will take some time) ill tag you

@rodrigovieirastudies Ай бұрын

@@BloodyLiFe255 , thanks! That's very kind.

@Random_person_07 Ай бұрын

Holy crap the zero shot is so good better than GPT SoVITS and XTTS and it doesnt even take long to infrence this is awesome!

@ponywarrioryt 27 күн бұрын

Wow that's insane how good this is for something you can run locally! Would love a webui of this.

@megamayo2500 Ай бұрын

Finally, a competitor to Bark AI. Bark was the only AI App that could do this back then. From what you describe from its architecture, sounds like this is a self existing build from Voice craft. It is known to be highly efficient in voice cloning direct from audio. The only catch is that a partial amount of the code is Linux based. I look forward to your WebUI contribution, although I'm curious on how these models can be trained. Overall amazing.

@GraveUypo Ай бұрын

being linux based is a plus to me.

@tierdropp7544 Ай бұрын

thats literally based

@metalmassacrefilms Ай бұрын

So it wouldn't work in Windows? As long as I can run in a Ubuntu sort of VM, it's fine for me.

@nodewizard Ай бұрын

Very good quality! Progress being made in TTS.

@joseeduardobolisfortes Ай бұрын

Years ago, I tried to start a project called "Vox Render", to render audio using the same principles used for image rendering: creating the audio from fundamental sinusoids and harmonics formed by the resonant chamber, but I didn't get very far due to hardware limitations at the time and never got around to it. One of the things I intended to do was create a markup language to add accents and emotional tones to the text that was to be rendered; I now realize that this kind of language would be perfect for these text-to-speech tools.

@TheWebgecko Ай бұрын

Looking forward to when it is eventually able to understand tone in relation to sentence structure

@morpheusnotes Ай бұрын

Holy Moly! This is 11labs quality right there.

@4.0.4 Ай бұрын

Nah it's good but not THAT good... yet.🤞🏻

@wickedjutto Ай бұрын

Absolutely wild! Thanks for sharing!

@Rejekts Ай бұрын

I'm adding this to my voice cloning webui for sure! real nice

@hamburger--fries Ай бұрын

This is very easy to add to any web app or mobile app. I just made a simple app to allow a writer to upload a PDF and the app outputs a .wav file.

@SuperUniqueHandle 27 күн бұрын

This seems like a huge step up in open source TTS!

@sinayagubi8805 Ай бұрын

Please make a tutorial how to finetune it on a language it doesn't support maybe on Runpod. that would be amazing since that's where the money is. Edit: Subscribed.

@TheZaky80 13 күн бұрын

We would appreciate your efforts if you make another video for training this model in different languages, this will be very helpful

@metalmassacrefilms Ай бұрын

How many variety of voices? And do we have commercial use to use them? I used to subscribe to Altered, which also had a text to speech with emotion, but they are very expensive and the emotions are not so good as this one. They also treated me bad as a customer. I would be glad to change to another TTS solution. Open source or free, as long as I have commercial use.

@dthSinthoras Ай бұрын

Appreciate you cover the other languages thing! :)

@mumarr4690 Ай бұрын

this is exactly what im looking for, currently making a tools to make translated dubbing from a video. and right now i used microsoft tts which doesn't have "emotion" but fast

@fulldivemedia Ай бұрын

Damn that is awesome, if you can make it work for tourists and rvc together in audio book maker

@nigeldogg Ай бұрын

Future meme: “she said with disgust”

@DanIel-fl1vc Ай бұрын

I use Applio to read text from ebooks. Paste in the parts of the ebook, generate natural sounding speech then convert it using a voice model. The tts is the bottleneck, preventing it from sounding indistinguishable from a real speaker. What's the most natural, convincing tts? Preferably with some UI so it's easy to copy paste an entire book in there and generate audio files that can be voice transferred.

@DataJuggler Ай бұрын

1:21 It sounds good, but it sounds like there is a deliberate pause between words that doesn't sound natural. I think you would have to speed it up some to sound better.

@kthalas 28 күн бұрын

It sounds like Jordan Peterson lol

@ElaraArale Ай бұрын

Man, you rock, thanks for the knowledge, for real.

@romanochelommmiii1526 25 күн бұрын

wow amazing bro please a web ui for the audio books will really be great thanks for your work

@dolboeb-tz4bw Ай бұрын

Can I use a processor for generating or an AMD graphics card? Could you teach me how to train models for other languages?

@vuquangtruong5950 Ай бұрын

Awesome bro. Good tutorial!

@VaibhavShewale Ай бұрын

ooh man, this is so amazing and clean!

@Katsumi_Maki Ай бұрын

The 0 shot performance omg

@635574 Ай бұрын

Can we use this for game audio or what is the license? In interested in makibg a game where one of two MCs is an AI bot.

@TeamDman Ай бұрын

wow that cloning is crazy

@Random_person_07 20 күн бұрын

I think E2 sounds better with emotions and voice F5 might need more training but its still pretty good

@IvarDaigon Ай бұрын

you forgot to mention the best part.. its only 1.4GB unquantized so its ripe for optimization in on-device use cases.

@Jarods_Journey Ай бұрын

Not only that, it does inference on less than 5gb of VRAM :)!

@devon9374 21 күн бұрын

@@Jarods_Journey *Rubs had like Birdman* We need an alternative to ElevenLabs

@jonascale Ай бұрын

dude! how are you getting that to run without using the web interface? i have been trying to figure this out for days since i first seen your video. I would love to integrate this into my own project but i must be missing something. any chance you could do a deeper dive on this? this is the lowest latency that i have come across so far and think it would be perfect to give my own waifu a decent and responsive voice.

@derjungejesusjunge2047 Ай бұрын

Did you try w-okada new TTS?

@flykiller Ай бұрын

Did you also check GPT-SoVITS TTS? I just found this after some new TTS comparison posts appeared on reddit with F5-TTS. It flew under my radar but V2 version looks very good on comparison examples. Couldn't check it out much but it seems that finetuning is very fast and resulting model output is also small like RVC models.

@Jarods_Journey Ай бұрын

I'm in the same boat as you, I probably saw the same comparison. I had tried out this repo when it released but both the demo and results were not impressive. Given that much better demo and display of its capabilities with FT, I'm gonna make my way back around to it

@JieTie 17 күн бұрын

It would be rly cool if the sotfware had feature to change already recorded audio to sound like sample audio :) do you know any ai open source software that could do that? :) I know there is RVC but you have to have a train model first, and that model requries ~15min of audio.

@nekomiruku Ай бұрын

I just want a voice mod that is a bit higher than my actual voice because its hard to keep it higher pitch without my throat getting tired

@ephimp3189 Ай бұрын

why did it mess up the "you" tone in "I don't want to hear you" while it was correctly saying "you" in 2 previous segments?

@Jarods_Journey Ай бұрын

Still not perfect in all cases, we leave that up to probability

@AlgorithmInstituteofBR Ай бұрын

Try Pinoki homie...that command line looks like a nightmare. Salute!

@kait3n10 Ай бұрын

With this tech, can you make Melina sound angry or sad if you just have her normal speech sample?

@Jarods_Journey Ай бұрын

This is something I'm looking into

@4.0.4 Ай бұрын

If making a webui is a lot of work, you can just make a PR for some webui that's already out there

@Jarods_Journey Ай бұрын

I believe fakery has made one, he's pretty active so for inference, I don't have to get anything up and going.

@BloodyLiFe255 Ай бұрын

Amazing man, thank you for sharing

@stevewarby12 25 күн бұрын

Is this to be added to the audiobook as a voice option ??

@hiddendrifts Ай бұрын

imagine the day we get an open source version of openai's advanced voice mode. the age of the ai waifu dawns upon us

@keltyll 29 күн бұрын

Next year I bet, let's be patient

@ayanshproplayer5559 Ай бұрын

Not new but i think a emotional voice that can't create positive and push forward to in bad (Not Sleep) time

@silentswitch1309 26 күн бұрын

Hi I was wondering if you can make a new updated tutorial for the audiobook...there's alot of scattered info from old to new versions on the channel and I've just been surfing your channel from the past 1hr fully confused...I just want to make a customized audiobook😭😭😭

@jeffwads Ай бұрын

By the way, Tortoise TTS always did emotions as well. You just had to put (sad) (happy) in the text to get it. Been around for years now.

@Jarods_Journey Ай бұрын

Only issue is those were suggestions to the model, this is much more controllable. Tortoise you could also train anger tokens and whisper tokens into it by explicit data labeling - the beauty here is you don't need that explicit labeling to control the output

@azaharia10 25 күн бұрын

That’s sound mind blown amazingly goodCan I use the celebrities model like Ariana Grande or Cat Valentine model for F5-TTS

@HadrianAibe 18 күн бұрын

How big can the ref_audio file be 10 seconds? 50 seconds?

@jujjuj7676 Ай бұрын

We need whisper voice training!!! Not sad or angry. Out of normal usage whisper or speaking quietly is used more then both angry and sad..its just more useful. To convey points....

@Jarods_Journey Ай бұрын

I'll have to see what this models whispering capabilities are

@jujjuj7676 Ай бұрын

@@Jarods_Journey good luck, I have tried its almost impossible but maybe you got better ninja skills..👍

@lanaferraii2184 Ай бұрын

I also need a calm relaxing voice, would be highly appreciated if you find one ❤

@Making_Random_Edits 4 күн бұрын

Can this work on text-generation? Like u know the AI chat characters?

@trashboatex 25 күн бұрын

Damn we just need some ethical voice choices you know, like professional grade voices we can clone and use without running into copyright.

@udayrajpatel9048 Ай бұрын

Fantastic tutorial ❤❤❤❤❤❤

@VikashKumar-t4g5v 27 күн бұрын

can you suggest me any cloud gpu provider who can attach our micrphones... please make a video on it. without rtx using realtime voice changer through cloud gpu and less delays.

@gumvue.studio 18 күн бұрын

how use the emotions?

@SavvyStaks Ай бұрын

I can't use it plz give a detail video on it, i am trying to run it in lightning ai code editor but its not working

@Jarods_Journey Ай бұрын

I can probably draft up a quick tutorial on how to use it in the terminal, but no indepth webui or package

@johnovercash1798 Ай бұрын

Can you a video that start from the beginning not from the middle?

@notnotandrew 28 күн бұрын

If only you could give it a natural language instruction with regard to the tone of voice, or if there was some way to annotate subsections of the provided text with particular emotions, tones, pitches, timbres, timings, etc.

@Happynut72 27 күн бұрын

Is there a way to bring down the vram usage. I have 16gb and it runs at 100%.

@EditorLue Ай бұрын

bro make a video on training F5-TTS on custom data and bro can you make video on how to make a unique voice that doesnot exist by blending multiple voice using ai like VITS and F5TTS

@dragon3602010 28 күн бұрын

Is it available in others languages like French?

@Entity303GB 28 күн бұрын

only english and chines :((((

@OrsotarBarr Ай бұрын

First voice sounds like Maria from Silent Hill 2

@hdgdhnxbdx1619 Ай бұрын

Wait was that badger from brraking bad LMFAO

@mtaliamino7169 Ай бұрын

Hey buddy how can I contact you ?

@svenbjorn9700 Ай бұрын

Are there any GUI solutions for audiobook-length TTS? Something accessible to normies?

@Jarods_Journey Ай бұрын

The project I'm working on segments text files into smaller chunks that can be used to create an audiobook. Segmenting is the only way right now because long context audio generation would take too much compute and too much time

@svenbjorn9700 Ай бұрын

@@Jarods_Journey Looks promising, but requires ultra-not-normie stuff like manually installing and configuring an entire separate alpha-stage github project (a >20min technical tutorial video from you and you're already experienced, literally inconceivable for normies), plus you have a note saying progress was paused a year ago :/ I will pay money ($5-$20) for a one-click installer if you ever complete this. I feel like there's a huge fucking market for this, but all the finished products get greedy and do subscription services so those are the only things that exist.

@EDashMan Ай бұрын

What would you say is best text to speech model to try out because traditionally when using ai models I’ve had to use google google text to speech verse and then concert speech to speech in the ai models voice and then it’s not as clear or accurate sounding

@Jarods_Journey Ай бұрын

Mmph, right now, I'd say to try out this model, F5TTS. It's pretty good for starters. Then would be xtts, then styletts2, then tortoise

@EDashMan Ай бұрын

@@Jarods_Journey what model do you think character ai uses? I’d love to try the model, it’s pretty good at instant voice clone with only 10sec of audio

@vickmackey24 Ай бұрын

Inflection points are wrong. Good, but not as good as OpenAI's advanced voice mode.

@bikgrow Ай бұрын

We want WebUi for this model soon!!

@siddarth26 Ай бұрын

Brother we are waiting for your webui

@gaeonx Ай бұрын

Do you know if there's any way that i can use an ai voice model (pth file) to do something like TTS like this? or if it's even possible??

@Jarods_Journey Ай бұрын

Well, this uses its own model architecture which are saved in .pt files. Other models from other architectures are not compatible even if they have the same file type due to fundamental differences in the code

@l4l01234 Ай бұрын

Just take the output audio files from this and plug it into RVC (audio-to-audio) to use any of the existing RVC models

@635574 Ай бұрын

But which part of this does OP want? the short sample to a good voice or the voice itself for other TTS or even realtime voice changer?

@LOC-Ness Ай бұрын

can this do voice to voice though

@me-cm8or Ай бұрын

Damn that’s so good, it only got English language?

@me-cm8or Ай бұрын

Just checked their GitHub sadly it’s only English/ Chinese. To train another language you basically need to have over 10k hours or something 💀

@Jarods_Journey Ай бұрын

Only english and chinese right now

@bossgd100 Ай бұрын

How fast it is ? Can it works in live / streaming mode ?

@Jarods_Journey Ай бұрын

Yes, it's pretty fast, 1 second inference for like 20 seconds of audio on my 4090

@Sitki-w4n 22 күн бұрын

can we have this with webui

@Cloudwalker2k3 Ай бұрын

I may have to ask for a tutorial on install of this.

@steve-g3j6b Ай бұрын

whats is the audiobook maker :D

@Jarods_Journey Ай бұрын

A project I'm working on to make audiobooks lol

@steve-g3j6b Ай бұрын

@@Jarods_Journey 🤤🤤

@analia390 Ай бұрын

you, monster! (Ariel)

@HushHunt Ай бұрын

Can I install in my pc i3 11th gen, no graphics card

@finalblast3825 Ай бұрын

You can use CPU but it will be unbearably slow. I suggest obtaining any nvidia card above 8 GB VRAM for this text to speech to work. I am using a GTX 1080 with 8 gigs and it works great with only up to 20 seconds of waiting time to generate.

@anagnorisis2024 Ай бұрын

does this work with webui like xtts?

@Jarods_Journey Ай бұрын

It would need to be built by someone, but looks like fakery has a gradio app up for the repo

@onlyyoucanstopevil9024 Ай бұрын

AWESOME 😊😊😊

@PatchworxStudios Ай бұрын

Ok everything nice and dandy. But Ai evolving to fast to get a userfriendly version or comunity. I have so much pytons in my vram like i am an reptile zoo. Pls ignore all the Lamas. I am tired.

@GettingMiggyWithIt 29 күн бұрын

holy moley

@pragmata7997 Ай бұрын

amazing

@visual_rev6192 29 күн бұрын

The beginning was pretty bad both the sad and anger i thought was very terrible but hey we are at a good start so it's good in a sense that we never had this stuff before but it's still no where neqr as good as the real stuff but its making good progress can't wait for 2030 or even 2027 to see how much better quality thing's get

@Jarods_Journey 29 күн бұрын

As they say, it's the worse it'll ever be