I think I figured how to clone (almost) any language in Tortoise TTS

Рет қаралды 4,434

Күн бұрын

Пікірлер: 54

@WorldYuteChronicles 10 ай бұрын

Amazing! Thx for your work here again. I will probably be using this in future projects, thanks for providing these dataset tools for tortoise on your Github as well

@bomar920 10 ай бұрын

Thank you 🙏. I can’t wait once you got everything get sorted and becomes easily accessible to us . Your are GOAT 🐐

@rolgnav 10 ай бұрын

Can''t wait, I would love to make a model for my language, which hasn't been done yet. Will definitely donate to help the cause once it's ready. Thanks

@puntogcb 9 ай бұрын

Aprecio muchísimo todo el trabajo que le pones a esto, y compartirlo para los que quieren aprender sin saber codear, como es mi caso. 🙌 Super agradecido de aprender con lo que hacés. Ese "acento" es puro español de España. Y suena muy bien. Un español te lo puede confirmar mejor. 😉 Yo veo tus videos más de una vez para tratar de entender el proceso hacerlo yo mismo y enseñarle a mi máquina como entrenarla con mi voz de acento argentino"

@sadelcri 9 ай бұрын

no te creas, no suena 100% castellano, con tantos seseos, parece que tiene un acento latino

@Jscott1314 9 ай бұрын

amazing job man, your a genius, love finally see my language here.

@TREXYT 9 ай бұрын

I understand a bit but after downloading audio books, how do you split into tons of audio's ? Do you use a batch file to download tons of audio files ? I think steps are not well explained, if you can help, i would highly appreciate, thanks

@alialikhanov8383 9 ай бұрын

Hi, Jarods. Can you give me advise to create tokenizer for Uzbek language?

@Dj-vt5gr 9 ай бұрын

There is an open source ebook text to speech utility called Balabolka, which uses speech models via the Microsoft SAPI interface. is there a way to call tortoise RVC voice models through the SAPI interface? that would be truly awesome!

@juanpablothummler 8 ай бұрын

Good question!

@sataz101 10 ай бұрын

thank you so much Jarod for this invaluable experience and sharing, especially for newbie like me😅❤

@juanpablothummler 8 ай бұрын

Incredible thank you for sharing

@glowstorm334 10 ай бұрын

Hi jarods, I was wondering you could provide the training scripts , you used for preparing the dataset as well the one for BPE tokenizer and is the dataset size is the same as before 800 hours audio . Also is there way for me to contact you as I would like to ask a few questions regarding Tortoise, as I am also an AI engineer who in the same domain. Last but most importantly thank you for creating these videos as no one but you explains TTS system especially Tortoise TTS in such depth.

@Jarods_Journey 10 ай бұрын

Appreciate it! If you want to get started with looking through my scripts, you can look at them here: github.com/JarodMica/tortoise-dataset-tools Since it's tortoise specific, I'd actually prefer that it be on a discussion that you could open up here: github.com/JarodMica/ai-voice-cloning/discussions

@lio1234234 9 ай бұрын

This is amazing! Do you think we could use similar techniques to improve how the model abides to the emotive prompting by including it in the transcriptions used in training? You know how you can put in [angry], etc.,? Makes me wonder if we can also train it to deal with things in the text, rather than just at the start, just like 'bark' does but at a high quality, like with [cough] and [laugh]?

@ElmorenohWTF 9 ай бұрын

As a native Spanish speaker, this sounds amazing. And if you add RVC to it, it would sound even better. I'm looking forward to being able to train my own model in Spanish with my voice in tortoise-tts. As always, great job.

@felipemartinez4379 9 ай бұрын

como lo vas a entrenar amigo? voy a hacer lo mismo

@kubananimator 10 ай бұрын

What if the model sometimes distorts hissing sounds? For example, she pronounces "r", "zh", "sh", "sch", "t'" badly

@Jarods_Journey 10 ай бұрын

I think the root cause would be either having an insufficient tokenizer or you need more data. The more data you can train it on, the better.

@iweiteh 9 ай бұрын

Hi Jarod, thanks for this amazing video! I am training a french voice model and came across an interesting observation in the loss function and wonder if you have any insights on it. I have about 40 hours of clean audio samples across 10k files with multiple speakers. Each sample is about 10-15 seconds long. When training the model (using lr of 0.0001, 0.00001, batch_size of 32, 64, 128 all yield similar results), I noticed both the loss functions (i.e. loss_mel, loss_text) descended very rapidly and approached 0 after about 20 epochs. When I generate the voice samples, the results are not very good. I wonder if the model is overfitting on the audio samples. I am very interested in hearing your experience whether this is common and your approach of addressing this issue. Thanks and really appreciate your feedback!

@radioketnoi 4 ай бұрын

I don't understand where I went wrong. I'm training Vietnamese language. I used about 1 hour of my voice for training, created tokenzier with your python file for Vietnamese language "vi". Then I tested it with a sentence that was already in the audio sample. It produced a sound that was my voice. However, the sound produced was meaningless, not Vietnamese at all. Please tell me where I went wrong??

@fujihita2500 10 ай бұрын

I used the autoaggressive model to generate latent files for a JP voice and it works somewhat when I use against EN text input. The problem is that the samples in JP are missing some of the sounds in EN, the end result is subpar when dealing with certain sounds (such as "bra" being pronounced as either "bara" or "ra"). To cope, I just generate latent off EN samples of a similar speaker, then feeds the output through RVC of the JP voice. So yes, I can confirm that the RVC doesn't care which language you feed into it but the TTS model does.

@Jarods_Journey 10 ай бұрын

Yeah, rvc is language agnostic, it just converts pitch. Tortoise still infers a little bit of speech prosody from the conditioning latents, which is most likely affecting the output.

@fujihita2500 10 ай бұрын

@@Jarods_Journey As long as it's English text input, I'm unsure what's the difference between autoagressive with latents and a finetune model. After RVC, they sound very similar to me. The only thing that matters then is the quality of the RVC model, which is a lot smaller to move around and easier to train.

@Serve_one_another_with_love 9 ай бұрын

another great video bro, thank you so much. is just want to know if we can use this in youtube or facebook without getting copyright strike?

@Lv7-L30N 10 ай бұрын

i speak spanish and mel's voice sounds really good in spanish........ thank you

@Conrox 10 ай бұрын

Is the "whisper.json" being used when training ? Located next to the validation and training text files. The speaker I want to train has an accent and doesn't use proper grammar. Whisper made a lot of small mistakes, which I manually corrected in the train & valid text files but now that I''m done I realized that in the whisper.json the wrong text is there present aswell. I checked a previous trained model folder where I used the orignal DLAS repo. I couldn't find that whisper.json in the training folder there but maybe it's just hidden somewhere else or a temporary file or is it not needed ?

@Jarods_Journey 10 ай бұрын

Whisper.json is not used in training, its only an intermediary file that is used to store the transcripts from whisper. Once it's done all the transcriptions, it'll create your train.txt from it and this is the file used in the training.

@Conrox 10 ай бұрын

@@Jarods_Journey thank you so much ♥

@thekinoreview1515 10 ай бұрын

Jarod, can I ask why you favour Tortoise? I started playing with XTTS v2 yesterday and I am blown away. The training and inference seems far, far faster and I'm getting better quality outputs than I did with Tortoise. Specifically, XTTS for cadence/delivery/accent -> RVC for fidelity/final output. I'm able to train a great XTTS model in only 10 mins with ~15 mins of data, and inference is ~1/3 output duration. I believe XTTS is also multi-lingual.

@juanjesusligero391 10 ай бұрын

I think it may be because TortoiseTTS is Open Source (Apache license, commercial use allowed), while XTTS v2 is not.

@Jarods_Journey 10 ай бұрын

4 reasons: 1. Familiarity and learning - I am much more familiar with the structure and hierarchy of the AI Voice Cloning Repository (Tortoise), making modification and browsing much easier for me. Since I want to understand how it works, this is one of my main reasons. 2. XTTS (v2) is Tortoise - The LLM behind it is roughly the same architecture, just a few different parameter values passed. If XTTS was significantly different, I would be putting more work into looking at it. 3. Quality and speed - I get solid outputs from both models so there's not much of a big difference for me. Speed is roughly the same as well if you enable HifiGAN on tortoise. Deepspeed should be enabled on both by default as well. 4. Licensing on Tortoise is more permissive than XTTS - This isn't a big factor for me, but I know it is for others and so since there isn't much of a difference between the two I choose to still work on tortoise. The biggest reason for me to switch to XTTS would be the multi-lingual capabilities, but currently, I'm learning a lot in figuring out how they did that so no reason to switch rn!

@thekinoreview1515 10 ай бұрын

@@Jarods_Journey Awesome, thanks for the detailed response. I appreciate it. I am hoping to keep drilling down into this stuff until I find an ideal setup.

@miyrrecs3024 9 ай бұрын

I really surprised by this update upwork in other languages 👍. I remember when very before I proved tts in (japanese to japanese generation), but i only received glitches

@adamrastrand9409 10 ай бұрын

Hi, how much training data do you need to train the tortoise model on let’s say Swedish or any other language or what happens if I upload just maybe 30 minutes or 10 minutes or one hour

@kaya2995 10 ай бұрын

The quality is dependent on the source material. So feeding it 10 minutes will result in worse quality than 3 hours.

@Jarods_Journey 10 ай бұрын

For training a new language, you need quite a bit more data than that of finetuning, I'd say at least 10 hours. What seems to happen when you train with too little is that the model never converges well and is more likely to explode; this is when you see mel loss drop to basically 0 on the graph, causing terrible distortion in the output.

@Random_person_07 10 ай бұрын

XTTS VS Tortoise what is better?

@shovonjamali7854 10 ай бұрын

Hi, I'm new to voice cloning. If it's trained with a new language and after that if I want to pass a new user's audio (3/4 clips, 5-10 seconds long) will it be able to clone it? If so, how?

@Jarods_Journey 9 ай бұрын

Yes, give or take it should. You can always fine tune it on another voice to get it closer sounding, then run it through rvc to get it even closer

@shovonjamali7854 9 ай бұрын

@@Jarods_Journey Great! Thanks for your valuable comment. 😍

@FloridaFamAdventures 9 ай бұрын

Are you able to share your spanish dataset and or your tokenizer for spanish?

@Jarods_Journey 9 ай бұрын

Won't share the dataset, but you're welcome to curate it. It would be the large playlist of audiobooks on KZbin of around 580 videos. I'll release tokenizers sometime in the future

@--_Angel_-- 9 ай бұрын

Hello! Can you help me create tokonizer for russian language? I have 1000 hours asudiobooks for it.

@9-volt247 10 ай бұрын

Hi, Jarods! I'm 9-Volt! Nintendo's biggest fan!

@myte1why 10 ай бұрын

wait a minute so basicaly there is unintendent way to train voice in one language and make it speak another one?

@Jarods_Journey 10 ай бұрын

Not too sure on what you meant here, but the voice samples you use during generation don't actually have to match the language that the autoregressive model is trained on.

@myte1why 10 ай бұрын

@@Jarods_Journey wait did I go hard way to tarian my voice as model instead of using just masking 🧐