Agreed! This sort of TTS power with an Audiobook maker would be epic!
@genshinfinityАй бұрын
We can use the angry AI voice to shout to robbers or trespassers. Just need to hook it to a motion sensor / cctv
@Jarods_JourneyАй бұрын
😂 Brilliant idea
@CheckTheWikiАй бұрын
Or clone your own angry voice and have the motion sensor tell your pets to get off of somewhere they aren't allowed when you aren't home.
@jessequarteyАй бұрын
These are really brilliant.
@maxlikessnacks123Ай бұрын
Or.... just record a few lines of you saying "stop it" or "i'll call the cops" and play that when the motion sensor gets activated. No ai required.
@jessequarteyАй бұрын
@maxlikessnacks123 what if you want a context aware response.
@RobertJeneАй бұрын
1:45 LOL melina. one time I downloaded all the Melina voice lines to my phone and listened to them + talked like her for an hour while I got into costume and makeup, then did a parody of her for stingers when I live-streamed Elden Ring
@Jarods_JourneyАй бұрын
Method acting at its finest 😂
@bmatt2626Ай бұрын
Yeah! This sort of expression in AM is the dream. I've got a whole 3d puppetry / rendering / sfx / scoring pipeline with a voice-shaped hole in it. Thanks for digging into all this.
@namuzedАй бұрын
lol, I've been in the same boat. xtts v2 hasn't been cutting it due to lack of emotion. I tried pairing it with RVC which gave improvement, but I'm not much of a voice actor. This seems like the key.
@SttravagaNZzaАй бұрын
What are you trying to do ? what tools are you using?
@bmatt262613 күн бұрын
@@SttravagaNZza Film dialogue eventually. I'm currently using Blender, CC4/iClone, Embergen, Liquigen, Resolve/Fusion. Audio side is Reaper, Divisimate, RapidComposer, Vienna Ensemble, various instrument and SFX libraries.
@JonnyCrackersАй бұрын
The inflection is still all over the place and it places emphasis on the wrong words. Until they can figure out how to make TTS get that right, it's always going to sound weird.
@Jarods_JourneyАй бұрын
It's getting there alright, progress seems to be steady
@n_n_hapiАй бұрын
Holy smokes, that Melina was amazing
@notnotandrew28 күн бұрын
Based on only a 4-second audio clip! I think the word "inclined" was the only word where it slightly broke character.
@OnlyJavascriptАй бұрын
Wow, I just subscribed! I've recently started exploring TTS and how it works, and I finally found a guru!
@Jarods_JourneyАй бұрын
Glad to have you :)
@shawn4990Ай бұрын
Wow... I had mentioned 'emotion' in an earlier post and bam! I had a feeling these models were in the works.
@robertotomasАй бұрын
I am also interested in training it with foreign languages. Do you know how much ram it would take? do the samples have to be well labelled (emotionally)?
@jjjbeastfdАй бұрын
Please expand on this more and make a website too. I'm very interested in this, for sure! Youre the best, Jarod!
@rodrigovieirastudiesАй бұрын
Wow!!! This would be a fabulous addition to the audiobook maker!
@BloodyLiFe255Ай бұрын
I will use in my work, and make audiobooks out of it most probably, when i make it (will take some time) ill tag you
@rodrigovieirastudiesАй бұрын
@@BloodyLiFe255 , thanks! That's very kind.
@Random_person_07Ай бұрын
Holy crap the zero shot is so good better than GPT SoVITS and XTTS and it doesnt even take long to infrence this is awesome!
@ponywarrioryt27 күн бұрын
Wow that's insane how good this is for something you can run locally! Would love a webui of this.
@megamayo2500Ай бұрын
Finally, a competitor to Bark AI. Bark was the only AI App that could do this back then. From what you describe from its architecture, sounds like this is a self existing build from Voice craft. It is known to be highly efficient in voice cloning direct from audio. The only catch is that a partial amount of the code is Linux based. I look forward to your WebUI contribution, although I'm curious on how these models can be trained. Overall amazing.
@GraveUypoАй бұрын
being linux based is a plus to me.
@tierdropp7544Ай бұрын
thats literally based
@metalmassacrefilmsАй бұрын
So it wouldn't work in Windows? As long as I can run in a Ubuntu sort of VM, it's fine for me.
@nodewizardАй бұрын
Very good quality! Progress being made in TTS.
@joseeduardobolisfortesАй бұрын
Years ago, I tried to start a project called "Vox Render", to render audio using the same principles used for image rendering: creating the audio from fundamental sinusoids and harmonics formed by the resonant chamber, but I didn't get very far due to hardware limitations at the time and never got around to it. One of the things I intended to do was create a markup language to add accents and emotional tones to the text that was to be rendered; I now realize that this kind of language would be perfect for these text-to-speech tools.
@TheWebgeckoАй бұрын
Looking forward to when it is eventually able to understand tone in relation to sentence structure
@morpheusnotesАй бұрын
Holy Moly! This is 11labs quality right there.
@4.0.4Ай бұрын
Nah it's good but not THAT good... yet.🤞🏻
@wickedjuttoАй бұрын
Absolutely wild! Thanks for sharing!
@RejektsАй бұрын
I'm adding this to my voice cloning webui for sure! real nice
@hamburger--friesАй бұрын
This is very easy to add to any web app or mobile app. I just made a simple app to allow a writer to upload a PDF and the app outputs a .wav file.
@SuperUniqueHandle27 күн бұрын
This seems like a huge step up in open source TTS!
@sinayagubi8805Ай бұрын
Please make a tutorial how to finetune it on a language it doesn't support maybe on Runpod. that would be amazing since that's where the money is. Edit: Subscribed.
@TheZaky8013 күн бұрын
We would appreciate your efforts if you make another video for training this model in different languages, this will be very helpful
@metalmassacrefilmsАй бұрын
How many variety of voices? And do we have commercial use to use them? I used to subscribe to Altered, which also had a text to speech with emotion, but they are very expensive and the emotions are not so good as this one. They also treated me bad as a customer. I would be glad to change to another TTS solution. Open source or free, as long as I have commercial use.
@dthSinthorasАй бұрын
Appreciate you cover the other languages thing! :)
@mumarr4690Ай бұрын
this is exactly what im looking for, currently making a tools to make translated dubbing from a video. and right now i used microsoft tts which doesn't have "emotion" but fast
@fulldivemediaАй бұрын
Damn that is awesome, if you can make it work for tourists and rvc together in audio book maker
@nigeldoggАй бұрын
Future meme: “she said with disgust”
@DanIel-fl1vcАй бұрын
I use Applio to read text from ebooks. Paste in the parts of the ebook, generate natural sounding speech then convert it using a voice model. The tts is the bottleneck, preventing it from sounding indistinguishable from a real speaker. What's the most natural, convincing tts? Preferably with some UI so it's easy to copy paste an entire book in there and generate audio files that can be voice transferred.
@DataJugglerАй бұрын
1:21 It sounds good, but it sounds like there is a deliberate pause between words that doesn't sound natural. I think you would have to speed it up some to sound better.
@kthalas28 күн бұрын
It sounds like Jordan Peterson lol
@ElaraAraleАй бұрын
Man, you rock, thanks for the knowledge, for real.
@romanochelommmiii152625 күн бұрын
wow amazing bro please a web ui for the audio books will really be great thanks for your work
@dolboeb-tz4bwАй бұрын
Can I use a processor for generating or an AMD graphics card? Could you teach me how to train models for other languages?
@vuquangtruong5950Ай бұрын
Awesome bro. Good tutorial!
@VaibhavShewaleАй бұрын
ooh man, this is so amazing and clean!
@Katsumi_MakiАй бұрын
The 0 shot performance omg
@635574Ай бұрын
Can we use this for game audio or what is the license? In interested in makibg a game where one of two MCs is an AI bot.
@TeamDmanАй бұрын
wow that cloning is crazy
@Random_person_0720 күн бұрын
I think E2 sounds better with emotions and voice F5 might need more training but its still pretty good
@IvarDaigonАй бұрын
you forgot to mention the best part.. its only 1.4GB unquantized so its ripe for optimization in on-device use cases.
@Jarods_JourneyАй бұрын
Not only that, it does inference on less than 5gb of VRAM :)!
@devon937421 күн бұрын
@@Jarods_Journey *Rubs had like Birdman* We need an alternative to ElevenLabs
@jonascaleАй бұрын
dude! how are you getting that to run without using the web interface? i have been trying to figure this out for days since i first seen your video. I would love to integrate this into my own project but i must be missing something. any chance you could do a deeper dive on this? this is the lowest latency that i have come across so far and think it would be perfect to give my own waifu a decent and responsive voice.
@derjungejesusjunge2047Ай бұрын
Did you try w-okada new TTS?
@flykillerАй бұрын
Did you also check GPT-SoVITS TTS? I just found this after some new TTS comparison posts appeared on reddit with F5-TTS. It flew under my radar but V2 version looks very good on comparison examples. Couldn't check it out much but it seems that finetuning is very fast and resulting model output is also small like RVC models.
@Jarods_JourneyАй бұрын
I'm in the same boat as you, I probably saw the same comparison. I had tried out this repo when it released but both the demo and results were not impressive. Given that much better demo and display of its capabilities with FT, I'm gonna make my way back around to it
@JieTie17 күн бұрын
It would be rly cool if the sotfware had feature to change already recorded audio to sound like sample audio :) do you know any ai open source software that could do that? :) I know there is RVC but you have to have a train model first, and that model requries ~15min of audio.
@nekomirukuАй бұрын
I just want a voice mod that is a bit higher than my actual voice because its hard to keep it higher pitch without my throat getting tired
@ephimp3189Ай бұрын
why did it mess up the "you" tone in "I don't want to hear you" while it was correctly saying "you" in 2 previous segments?
@Jarods_JourneyАй бұрын
Still not perfect in all cases, we leave that up to probability
@AlgorithmInstituteofBRАй бұрын
Try Pinoki homie...that command line looks like a nightmare. Salute!
@kait3n10Ай бұрын
With this tech, can you make Melina sound angry or sad if you just have her normal speech sample?
@Jarods_JourneyАй бұрын
This is something I'm looking into
@4.0.4Ай бұрын
If making a webui is a lot of work, you can just make a PR for some webui that's already out there
@Jarods_JourneyАй бұрын
I believe fakery has made one, he's pretty active so for inference, I don't have to get anything up and going.
@BloodyLiFe255Ай бұрын
Amazing man, thank you for sharing
@stevewarby1225 күн бұрын
Is this to be added to the audiobook as a voice option ??
@hiddendriftsАй бұрын
imagine the day we get an open source version of openai's advanced voice mode. the age of the ai waifu dawns upon us
@keltyll29 күн бұрын
Next year I bet, let's be patient
@ayanshproplayer5559Ай бұрын
Not new but i think a emotional voice that can't create positive and push forward to in bad (Not Sleep) time
@silentswitch130926 күн бұрын
Hi I was wondering if you can make a new updated tutorial for the audiobook...there's alot of scattered info from old to new versions on the channel and I've just been surfing your channel from the past 1hr fully confused...I just want to make a customized audiobook😭😭😭
@jeffwadsАй бұрын
By the way, Tortoise TTS always did emotions as well. You just had to put (sad) (happy) in the text to get it. Been around for years now.
@Jarods_JourneyАй бұрын
Only issue is those were suggestions to the model, this is much more controllable. Tortoise you could also train anger tokens and whisper tokens into it by explicit data labeling - the beauty here is you don't need that explicit labeling to control the output
@azaharia1025 күн бұрын
That’s sound mind blown amazingly goodCan I use the celebrities model like Ariana Grande or Cat Valentine model for F5-TTS
@HadrianAibe18 күн бұрын
How big can the ref_audio file be 10 seconds? 50 seconds?
@jujjuj7676Ай бұрын
We need whisper voice training!!! Not sad or angry. Out of normal usage whisper or speaking quietly is used more then both angry and sad..its just more useful. To convey points....
@Jarods_JourneyАй бұрын
I'll have to see what this models whispering capabilities are
@jujjuj7676Ай бұрын
@@Jarods_Journey good luck, I have tried its almost impossible but maybe you got better ninja skills..👍
@lanaferraii2184Ай бұрын
I also need a calm relaxing voice, would be highly appreciated if you find one ❤
@Making_Random_Edits4 күн бұрын
Can this work on text-generation? Like u know the AI chat characters?
@trashboatex25 күн бұрын
Damn we just need some ethical voice choices you know, like professional grade voices we can clone and use without running into copyright.
@udayrajpatel9048Ай бұрын
Fantastic tutorial ❤❤❤❤❤❤
@VikashKumar-t4g5v27 күн бұрын
can you suggest me any cloud gpu provider who can attach our micrphones... please make a video on it. without rtx using realtime voice changer through cloud gpu and less delays.
@gumvue.studio18 күн бұрын
how use the emotions?
@SavvyStaksАй бұрын
I can't use it plz give a detail video on it, i am trying to run it in lightning ai code editor but its not working
@Jarods_JourneyАй бұрын
I can probably draft up a quick tutorial on how to use it in the terminal, but no indepth webui or package
@johnovercash1798Ай бұрын
Can you a video that start from the beginning not from the middle?
@notnotandrew28 күн бұрын
If only you could give it a natural language instruction with regard to the tone of voice, or if there was some way to annotate subsections of the provided text with particular emotions, tones, pitches, timbres, timings, etc.
@Happynut7227 күн бұрын
Is there a way to bring down the vram usage. I have 16gb and it runs at 100%.
@EditorLueАй бұрын
bro make a video on training F5-TTS on custom data and bro can you make video on how to make a unique voice that doesnot exist by blending multiple voice using ai like VITS and F5TTS
@dragon360201028 күн бұрын
Is it available in others languages like French?
@Entity303GB28 күн бұрын
only english and chines :((((
@OrsotarBarrАй бұрын
First voice sounds like Maria from Silent Hill 2
@hdgdhnxbdx1619Ай бұрын
Wait was that badger from brraking bad LMFAO
@mtaliamino7169Ай бұрын
Hey buddy how can I contact you ?
@svenbjorn9700Ай бұрын
Are there any GUI solutions for audiobook-length TTS? Something accessible to normies?
@Jarods_JourneyАй бұрын
The project I'm working on segments text files into smaller chunks that can be used to create an audiobook. Segmenting is the only way right now because long context audio generation would take too much compute and too much time
@svenbjorn9700Ай бұрын
@@Jarods_Journey Looks promising, but requires ultra-not-normie stuff like manually installing and configuring an entire separate alpha-stage github project (a >20min technical tutorial video from you and you're already experienced, literally inconceivable for normies), plus you have a note saying progress was paused a year ago :/ I will pay money ($5-$20) for a one-click installer if you ever complete this. I feel like there's a huge fucking market for this, but all the finished products get greedy and do subscription services so those are the only things that exist.
@EDashManАй бұрын
What would you say is best text to speech model to try out because traditionally when using ai models I’ve had to use google google text to speech verse and then concert speech to speech in the ai models voice and then it’s not as clear or accurate sounding
@Jarods_JourneyАй бұрын
Mmph, right now, I'd say to try out this model, F5TTS. It's pretty good for starters. Then would be xtts, then styletts2, then tortoise
@EDashManАй бұрын
@@Jarods_Journey what model do you think character ai uses? I’d love to try the model, it’s pretty good at instant voice clone with only 10sec of audio
@vickmackey24Ай бұрын
Inflection points are wrong. Good, but not as good as OpenAI's advanced voice mode.
@bikgrowАй бұрын
We want WebUi for this model soon!!
@siddarth26Ай бұрын
Brother we are waiting for your webui
@gaeonxАй бұрын
Do you know if there's any way that i can use an ai voice model (pth file) to do something like TTS like this? or if it's even possible??
@Jarods_JourneyАй бұрын
Well, this uses its own model architecture which are saved in .pt files. Other models from other architectures are not compatible even if they have the same file type due to fundamental differences in the code
@l4l01234Ай бұрын
Just take the output audio files from this and plug it into RVC (audio-to-audio) to use any of the existing RVC models
@635574Ай бұрын
But which part of this does OP want? the short sample to a good voice or the voice itself for other TTS or even realtime voice changer?
@LOC-NessАй бұрын
can this do voice to voice though
@me-cm8orАй бұрын
Damn that’s so good, it only got English language?
@me-cm8orАй бұрын
Just checked their GitHub sadly it’s only English/ Chinese. To train another language you basically need to have over 10k hours or something 💀
@Jarods_JourneyАй бұрын
Only english and chinese right now
@bossgd100Ай бұрын
How fast it is ? Can it works in live / streaming mode ?
@Jarods_JourneyАй бұрын
Yes, it's pretty fast, 1 second inference for like 20 seconds of audio on my 4090
@Sitki-w4n22 күн бұрын
can we have this with webui
@Cloudwalker2k3Ай бұрын
I may have to ask for a tutorial on install of this.
@steve-g3j6bАй бұрын
whats is the audiobook maker :D
@Jarods_JourneyАй бұрын
A project I'm working on to make audiobooks lol
@steve-g3j6bАй бұрын
@@Jarods_Journey 🤤🤤
@analia390Ай бұрын
you, monster! (Ariel)
@HushHuntАй бұрын
Can I install in my pc i3 11th gen, no graphics card
@finalblast3825Ай бұрын
You can use CPU but it will be unbearably slow. I suggest obtaining any nvidia card above 8 GB VRAM for this text to speech to work. I am using a GTX 1080 with 8 gigs and it works great with only up to 20 seconds of waiting time to generate.
@anagnorisis2024Ай бұрын
does this work with webui like xtts?
@Jarods_JourneyАй бұрын
It would need to be built by someone, but looks like fakery has a gradio app up for the repo
@onlyyoucanstopevil9024Ай бұрын
AWESOME 😊😊😊
@PatchworxStudiosАй бұрын
Ok everything nice and dandy. But Ai evolving to fast to get a userfriendly version or comunity. I have so much pytons in my vram like i am an reptile zoo. Pls ignore all the Lamas. I am tired.
@GettingMiggyWithIt29 күн бұрын
holy moley
@pragmata7997Ай бұрын
amazing
@visual_rev619229 күн бұрын
The beginning was pretty bad both the sad and anger i thought was very terrible but hey we are at a good start so it's good in a sense that we never had this stuff before but it's still no where neqr as good as the real stuff but its making good progress can't wait for 2030 or even 2027 to see how much better quality thing's get
@Jarods_Journey29 күн бұрын
As they say, it's the worse it'll ever be
@edderuizАй бұрын
A model for spanish please ?
@omargoodman2999Ай бұрын
Meh, Evil Neuro still sounds more expressive and natural, I think.
@tylerboy19ypАй бұрын
is it finetuneable?
@Jarods_JourneyАй бұрын
I believe the authors said its possible. IDK how to go about that rn though, so that's for a future investigation
@abhinavbisht9851Ай бұрын
How do you install it and use it...?
@Jarods_JourneyАй бұрын
You can follow their github to get the requirements, but I won't be creating a tutorial for it for awhile
@abhinavbisht9851Ай бұрын
@@Jarods_Journey yes I did...
@researchandbuild1751Ай бұрын
That reference audio is already pretty bad lol
@GraveUypoАй бұрын
idk it sounds extremely artificial to me [edit] well the elden ring example sounded really good. i guess you need a good sample to get good results
@Jarods_JourneyАй бұрын
I'd say Melina's voice is probably closer to content in the source training (audiobooks and podcasts) which helps a lot.