I think I figured how to clone (almost) any language in Tortoise TTS

  Рет қаралды 4,297

Jarods Journey

Jarods Journey

Күн бұрын

Пікірлер: 54
@puntogcb
@puntogcb 8 ай бұрын
Aprecio muchísimo todo el trabajo que le pones a esto, y compartirlo para los que quieren aprender sin saber codear, como es mi caso. 🙌 Super agradecido de aprender con lo que hacés. Ese "acento" es puro español de España. Y suena muy bien. Un español te lo puede confirmar mejor. 😉 Yo veo tus videos más de una vez para tratar de entender el proceso hacerlo yo mismo y enseñarle a mi máquina como entrenarla con mi voz de acento argentino"
@sadelcri
@sadelcri 8 ай бұрын
no te creas, no suena 100% castellano, con tantos seseos, parece que tiene un acento latino
@rolgnav
@rolgnav 8 ай бұрын
Can''t wait, I would love to make a model for my language, which hasn't been done yet. Will definitely donate to help the cause once it's ready. Thanks
@bomar920
@bomar920 8 ай бұрын
Thank you 🙏. I can’t wait once you got everything get sorted and becomes easily accessible to us . Your are GOAT 🐐
@Jscott1314
@Jscott1314 8 ай бұрын
amazing job man, your a genius, love finally see my language here.
@WorldYuteChronicles
@WorldYuteChronicles 8 ай бұрын
Amazing! Thx for your work here again. I will probably be using this in future projects, thanks for providing these dataset tools for tortoise on your Github as well
@sataz101
@sataz101 8 ай бұрын
thank you so much Jarod for this invaluable experience and sharing, especially for newbie like me😅❤
@juanpablothummler
@juanpablothummler 7 ай бұрын
Incredible thank you for sharing
@ElmorenohWTF
@ElmorenohWTF 8 ай бұрын
As a native Spanish speaker, this sounds amazing. And if you add RVC to it, it would sound even better. I'm looking forward to being able to train my own model in Spanish with my voice in tortoise-tts. As always, great job.
@felipemartinez4379
@felipemartinez4379 8 ай бұрын
como lo vas a entrenar amigo? voy a hacer lo mismo
@Lv7-L30N
@Lv7-L30N 8 ай бұрын
i speak spanish and mel's voice sounds really good in spanish........ thank you
@alialikhanov8383
@alialikhanov8383 8 ай бұрын
Hi, Jarods. Can you give me advise to create tokenizer for Uzbek language?
@miyrrecs3024
@miyrrecs3024 8 ай бұрын
I really surprised by this update upwork in other languages 👍. I remember when very before I proved tts in (japanese to japanese generation), but i only received glitches
@Dj-vt5gr
@Dj-vt5gr 8 ай бұрын
There is an open source ebook text to speech utility called Balabolka, which uses speech models via the Microsoft SAPI interface. is there a way to call tortoise RVC voice models through the SAPI interface? that would be truly awesome!
@juanpablothummler
@juanpablothummler 7 ай бұрын
Good question!
@TREXYT
@TREXYT 8 ай бұрын
I understand a bit but after downloading audio books, how do you split into tons of audio's ? Do you use a batch file to download tons of audio files ? I think steps are not well explained, if you can help, i would highly appreciate, thanks
@Serve_one_another_with_love
@Serve_one_another_with_love 8 ай бұрын
another great video bro, thank you so much. is just want to know if we can use this in youtube or facebook without getting copyright strike?
@fujihita2500
@fujihita2500 8 ай бұрын
I used the autoaggressive model to generate latent files for a JP voice and it works somewhat when I use against EN text input. The problem is that the samples in JP are missing some of the sounds in EN, the end result is subpar when dealing with certain sounds (such as "bra" being pronounced as either "bara" or "ra"). To cope, I just generate latent off EN samples of a similar speaker, then feeds the output through RVC of the JP voice. So yes, I can confirm that the RVC doesn't care which language you feed into it but the TTS model does.
@Jarods_Journey
@Jarods_Journey 8 ай бұрын
Yeah, rvc is language agnostic, it just converts pitch. Tortoise still infers a little bit of speech prosody from the conditioning latents, which is most likely affecting the output.
@fujihita2500
@fujihita2500 8 ай бұрын
@@Jarods_Journey As long as it's English text input, I'm unsure what's the difference between autoagressive with latents and a finetune model. After RVC, they sound very similar to me. The only thing that matters then is the quality of the RVC model, which is a lot smaller to move around and easier to train.
@kubananimator
@kubananimator 8 ай бұрын
What if the model sometimes distorts hissing sounds? For example, she pronounces "r", "zh", "sh", "sch", "t'" badly
@Jarods_Journey
@Jarods_Journey 8 ай бұрын
I think the root cause would be either having an insufficient tokenizer or you need more data. The more data you can train it on, the better.
@9-volt247
@9-volt247 8 ай бұрын
Hi, Jarods! I'm 9-Volt! Nintendo's biggest fan!
@glowstorm334
@glowstorm334 8 ай бұрын
Hi jarods, I was wondering you could provide the training scripts , you used for preparing the dataset as well the one for BPE tokenizer and is the dataset size is the same as before 800 hours audio . Also is there way for me to contact you as I would like to ask a few questions regarding Tortoise, as I am also an AI engineer who in the same domain. Last but most importantly thank you for creating these videos as no one but you explains TTS system especially Tortoise TTS in such depth.
@Jarods_Journey
@Jarods_Journey 8 ай бұрын
Appreciate it! If you want to get started with looking through my scripts, you can look at them here: github.com/JarodMica/tortoise-dataset-tools Since it's tortoise specific, I'd actually prefer that it be on a discussion that you could open up here: github.com/JarodMica/ai-voice-cloning/discussions
@iweiteh
@iweiteh 7 ай бұрын
Hi Jarod, thanks for this amazing video! I am training a french voice model and came across an interesting observation in the loss function and wonder if you have any insights on it. I have about 40 hours of clean audio samples across 10k files with multiple speakers. Each sample is about 10-15 seconds long. When training the model (using lr of 0.0001, 0.00001, batch_size of 32, 64, 128 all yield similar results), I noticed both the loss functions (i.e. loss_mel, loss_text) descended very rapidly and approached 0 after about 20 epochs. When I generate the voice samples, the results are not very good. I wonder if the model is overfitting on the audio samples. I am very interested in hearing your experience whether this is common and your approach of addressing this issue. Thanks and really appreciate your feedback!
@lio1234234
@lio1234234 8 ай бұрын
This is amazing! Do you think we could use similar techniques to improve how the model abides to the emotive prompting by including it in the transcriptions used in training? You know how you can put in [angry], etc.,? Makes me wonder if we can also train it to deal with things in the text, rather than just at the start, just like 'bark' does but at a high quality, like with [cough] and [laugh]?
@--_Angel_--
@--_Angel_-- 8 ай бұрын
Hello! Can you help me create tokonizer for russian language? I have 1000 hours asudiobooks for it.
@radioketnoi
@radioketnoi 2 ай бұрын
I don't understand where I went wrong. I'm training Vietnamese language. I used about 1 hour of my voice for training, created tokenzier with your python file for Vietnamese language "vi". Then I tested it with a sentence that was already in the audio sample. It produced a sound that was my voice. However, the sound produced was meaningless, not Vietnamese at all. Please tell me where I went wrong??
@Conrox
@Conrox 8 ай бұрын
Is the "whisper.json" being used when training ? Located next to the validation and training text files. The speaker I want to train has an accent and doesn't use proper grammar. Whisper made a lot of small mistakes, which I manually corrected in the train & valid text files but now that I''m done I realized that in the whisper.json the wrong text is there present aswell. I checked a previous trained model folder where I used the orignal DLAS repo. I couldn't find that whisper.json in the training folder there but maybe it's just hidden somewhere else or a temporary file or is it not needed ?
@Jarods_Journey
@Jarods_Journey 8 ай бұрын
Whisper.json is not used in training, its only an intermediary file that is used to store the transcripts from whisper. Once it's done all the transcriptions, it'll create your train.txt from it and this is the file used in the training.
@Conrox
@Conrox 8 ай бұрын
@@Jarods_Journey thank you so much ♥
@adamrastrand9409
@adamrastrand9409 8 ай бұрын
Hi, how much training data do you need to train the tortoise model on let’s say Swedish or any other language or what happens if I upload just maybe 30 minutes or 10 minutes or one hour
@kaya2995
@kaya2995 8 ай бұрын
The quality is dependent on the source material. So feeding it 10 minutes will result in worse quality than 3 hours.
@Jarods_Journey
@Jarods_Journey 8 ай бұрын
For training a new language, you need quite a bit more data than that of finetuning, I'd say at least 10 hours. What seems to happen when you train with too little is that the model never converges well and is more likely to explode; this is when you see mel loss drop to basically 0 on the graph, causing terrible distortion in the output.
@FloridaFamAdventures
@FloridaFamAdventures 8 ай бұрын
Are you able to share your spanish dataset and or your tokenizer for spanish?
@Jarods_Journey
@Jarods_Journey 8 ай бұрын
Won't share the dataset, but you're welcome to curate it. It would be the large playlist of audiobooks on KZbin of around 580 videos. I'll release tokenizers sometime in the future
@shovonjamali7854
@shovonjamali7854 8 ай бұрын
Hi, I'm new to voice cloning. If it's trained with a new language and after that if I want to pass a new user's audio (3/4 clips, 5-10 seconds long) will it be able to clone it? If so, how?
@Jarods_Journey
@Jarods_Journey 8 ай бұрын
Yes, give or take it should. You can always fine tune it on another voice to get it closer sounding, then run it through rvc to get it even closer
@shovonjamali7854
@shovonjamali7854 8 ай бұрын
@@Jarods_Journey Great! Thanks for your valuable comment. 😍
@Random_person_07
@Random_person_07 8 ай бұрын
XTTS VS Tortoise what is better?
@thekinoreview1515
@thekinoreview1515 8 ай бұрын
Jarod, can I ask why you favour Tortoise? I started playing with XTTS v2 yesterday and I am blown away. The training and inference seems far, far faster and I'm getting better quality outputs than I did with Tortoise. Specifically, XTTS for cadence/delivery/accent -> RVC for fidelity/final output. I'm able to train a great XTTS model in only 10 mins with ~15 mins of data, and inference is ~1/3 output duration. I believe XTTS is also multi-lingual.
@juanjesusligero391
@juanjesusligero391 8 ай бұрын
I think it may be because TortoiseTTS is Open Source (Apache license, commercial use allowed), while XTTS v2 is not.
@Jarods_Journey
@Jarods_Journey 8 ай бұрын
4 reasons: 1. Familiarity and learning - I am much more familiar with the structure and hierarchy of the AI Voice Cloning Repository (Tortoise), making modification and browsing much easier for me. Since I want to understand how it works, this is one of my main reasons. 2. XTTS (v2) is Tortoise - The LLM behind it is roughly the same architecture, just a few different parameter values passed. If XTTS was significantly different, I would be putting more work into looking at it. 3. Quality and speed - I get solid outputs from both models so there's not much of a big difference for me. Speed is roughly the same as well if you enable HifiGAN on tortoise. Deepspeed should be enabled on both by default as well. 4. Licensing on Tortoise is more permissive than XTTS - This isn't a big factor for me, but I know it is for others and so since there isn't much of a difference between the two I choose to still work on tortoise. The biggest reason for me to switch to XTTS would be the multi-lingual capabilities, but currently, I'm learning a lot in figuring out how they did that so no reason to switch rn!
@thekinoreview1515
@thekinoreview1515 8 ай бұрын
@@Jarods_Journey Awesome, thanks for the detailed response. I appreciate it. I am hoping to keep drilling down into this stuff until I find an ideal setup.
@oguz2241
@oguz2241 7 ай бұрын
Anyone wanna do Turkish version of TTS? Please reach me out.
@myte1why
@myte1why 8 ай бұрын
wait a minute so basicaly there is unintendent way to train voice in one language and make it speak another one?
@Jarods_Journey
@Jarods_Journey 8 ай бұрын
Not too sure on what you meant here, but the voice samples you use during generation don't actually have to match the language that the autoregressive model is trained on.
@myte1why
@myte1why 8 ай бұрын
@@Jarods_Journey wait did I go hard way to tarian my voice as model instead of using just masking 🧐
@watcher_xD
@watcher_xD 8 ай бұрын
Is it any to without gpu using cpu? 😊
@Jarods_Journey
@Jarods_Journey 8 ай бұрын
Unfortunately not, CPUs aren't optimized for training.
@poly06033
@poly06033 8 ай бұрын
Hindi Language ❤
How to Clone Most Languages Using Tortoise TTS - AI Voice Cloning
29:40
PRANK😂 rate Mark’s kick 1-10 🤕
00:14
Diana Belitskay
Рет қаралды 9 МЛН
Каха и лужа  #непосредственнокаха
00:15
Triple kill😹
00:18
GG Animation
Рет қаралды 18 МЛН
Training Any Language in AI Voice Cloning - Tortoise TTS
20:40
Jarods Journey
Рет қаралды 14 М.
Voice Cloning For Any Language | Fine-Tuning Tortoise-TTS | Part 1
22:53
Why is Gen Z Slowly Giving Up?
14:24
Moon
Рет қаралды 1,2 МЛН
Text to Speech Fine-tuning Tutorial
1:15:44
Trelis Research
Рет қаралды 6 М.
How to Make the PERFECT Dataset for RVC AI Voice Training
18:17
Jarods Journey
Рет қаралды 135 М.
How to Clone Any Voice With AI | Tortoise-TTS Tutorial
8:42
Prompt Engineering
Рет қаралды 128 М.
PRANK😂 rate Mark’s kick 1-10 🤕
00:14
Diana Belitskay
Рет қаралды 9 МЛН