Aprecio muchísimo todo el trabajo que le pones a esto, y compartirlo para los que quieren aprender sin saber codear, como es mi caso. 🙌 Super agradecido de aprender con lo que hacés. Ese "acento" es puro español de España. Y suena muy bien. Un español te lo puede confirmar mejor. 😉 Yo veo tus videos más de una vez para tratar de entender el proceso hacerlo yo mismo y enseñarle a mi máquina como entrenarla con mi voz de acento argentino"
@sadelcri8 ай бұрын
no te creas, no suena 100% castellano, con tantos seseos, parece que tiene un acento latino
@rolgnav8 ай бұрын
Can''t wait, I would love to make a model for my language, which hasn't been done yet. Will definitely donate to help the cause once it's ready. Thanks
@bomar9208 ай бұрын
Thank you 🙏. I can’t wait once you got everything get sorted and becomes easily accessible to us . Your are GOAT 🐐
@Jscott13148 ай бұрын
amazing job man, your a genius, love finally see my language here.
@WorldYuteChronicles8 ай бұрын
Amazing! Thx for your work here again. I will probably be using this in future projects, thanks for providing these dataset tools for tortoise on your Github as well
@sataz1018 ай бұрын
thank you so much Jarod for this invaluable experience and sharing, especially for newbie like me😅❤
@juanpablothummler7 ай бұрын
Incredible thank you for sharing
@ElmorenohWTF8 ай бұрын
As a native Spanish speaker, this sounds amazing. And if you add RVC to it, it would sound even better. I'm looking forward to being able to train my own model in Spanish with my voice in tortoise-tts. As always, great job.
@felipemartinez43798 ай бұрын
como lo vas a entrenar amigo? voy a hacer lo mismo
@Lv7-L30N8 ай бұрын
i speak spanish and mel's voice sounds really good in spanish........ thank you
@alialikhanov83838 ай бұрын
Hi, Jarods. Can you give me advise to create tokenizer for Uzbek language?
@miyrrecs30248 ай бұрын
I really surprised by this update upwork in other languages 👍. I remember when very before I proved tts in (japanese to japanese generation), but i only received glitches
@Dj-vt5gr8 ай бұрын
There is an open source ebook text to speech utility called Balabolka, which uses speech models via the Microsoft SAPI interface. is there a way to call tortoise RVC voice models through the SAPI interface? that would be truly awesome!
@juanpablothummler7 ай бұрын
Good question!
@TREXYT8 ай бұрын
I understand a bit but after downloading audio books, how do you split into tons of audio's ? Do you use a batch file to download tons of audio files ? I think steps are not well explained, if you can help, i would highly appreciate, thanks
@Serve_one_another_with_love8 ай бұрын
another great video bro, thank you so much. is just want to know if we can use this in youtube or facebook without getting copyright strike?
@fujihita25008 ай бұрын
I used the autoaggressive model to generate latent files for a JP voice and it works somewhat when I use against EN text input. The problem is that the samples in JP are missing some of the sounds in EN, the end result is subpar when dealing with certain sounds (such as "bra" being pronounced as either "bara" or "ra"). To cope, I just generate latent off EN samples of a similar speaker, then feeds the output through RVC of the JP voice. So yes, I can confirm that the RVC doesn't care which language you feed into it but the TTS model does.
@Jarods_Journey8 ай бұрын
Yeah, rvc is language agnostic, it just converts pitch. Tortoise still infers a little bit of speech prosody from the conditioning latents, which is most likely affecting the output.
@fujihita25008 ай бұрын
@@Jarods_Journey As long as it's English text input, I'm unsure what's the difference between autoagressive with latents and a finetune model. After RVC, they sound very similar to me. The only thing that matters then is the quality of the RVC model, which is a lot smaller to move around and easier to train.
@kubananimator8 ай бұрын
What if the model sometimes distorts hissing sounds? For example, she pronounces "r", "zh", "sh", "sch", "t'" badly
@Jarods_Journey8 ай бұрын
I think the root cause would be either having an insufficient tokenizer or you need more data. The more data you can train it on, the better.
@9-volt2478 ай бұрын
Hi, Jarods! I'm 9-Volt! Nintendo's biggest fan!
@glowstorm3348 ай бұрын
Hi jarods, I was wondering you could provide the training scripts , you used for preparing the dataset as well the one for BPE tokenizer and is the dataset size is the same as before 800 hours audio . Also is there way for me to contact you as I would like to ask a few questions regarding Tortoise, as I am also an AI engineer who in the same domain. Last but most importantly thank you for creating these videos as no one but you explains TTS system especially Tortoise TTS in such depth.
@Jarods_Journey8 ай бұрын
Appreciate it! If you want to get started with looking through my scripts, you can look at them here: github.com/JarodMica/tortoise-dataset-tools Since it's tortoise specific, I'd actually prefer that it be on a discussion that you could open up here: github.com/JarodMica/ai-voice-cloning/discussions
@iweiteh7 ай бұрын
Hi Jarod, thanks for this amazing video! I am training a french voice model and came across an interesting observation in the loss function and wonder if you have any insights on it. I have about 40 hours of clean audio samples across 10k files with multiple speakers. Each sample is about 10-15 seconds long. When training the model (using lr of 0.0001, 0.00001, batch_size of 32, 64, 128 all yield similar results), I noticed both the loss functions (i.e. loss_mel, loss_text) descended very rapidly and approached 0 after about 20 epochs. When I generate the voice samples, the results are not very good. I wonder if the model is overfitting on the audio samples. I am very interested in hearing your experience whether this is common and your approach of addressing this issue. Thanks and really appreciate your feedback!
@lio12342348 ай бұрын
This is amazing! Do you think we could use similar techniques to improve how the model abides to the emotive prompting by including it in the transcriptions used in training? You know how you can put in [angry], etc.,? Makes me wonder if we can also train it to deal with things in the text, rather than just at the start, just like 'bark' does but at a high quality, like with [cough] and [laugh]?
@--_Angel_--8 ай бұрын
Hello! Can you help me create tokonizer for russian language? I have 1000 hours asudiobooks for it.
@radioketnoi2 ай бұрын
I don't understand where I went wrong. I'm training Vietnamese language. I used about 1 hour of my voice for training, created tokenzier with your python file for Vietnamese language "vi". Then I tested it with a sentence that was already in the audio sample. It produced a sound that was my voice. However, the sound produced was meaningless, not Vietnamese at all. Please tell me where I went wrong??
@Conrox8 ай бұрын
Is the "whisper.json" being used when training ? Located next to the validation and training text files. The speaker I want to train has an accent and doesn't use proper grammar. Whisper made a lot of small mistakes, which I manually corrected in the train & valid text files but now that I''m done I realized that in the whisper.json the wrong text is there present aswell. I checked a previous trained model folder where I used the orignal DLAS repo. I couldn't find that whisper.json in the training folder there but maybe it's just hidden somewhere else or a temporary file or is it not needed ?
@Jarods_Journey8 ай бұрын
Whisper.json is not used in training, its only an intermediary file that is used to store the transcripts from whisper. Once it's done all the transcriptions, it'll create your train.txt from it and this is the file used in the training.
@Conrox8 ай бұрын
@@Jarods_Journey thank you so much ♥
@adamrastrand94098 ай бұрын
Hi, how much training data do you need to train the tortoise model on let’s say Swedish or any other language or what happens if I upload just maybe 30 minutes or 10 minutes or one hour
@kaya29958 ай бұрын
The quality is dependent on the source material. So feeding it 10 minutes will result in worse quality than 3 hours.
@Jarods_Journey8 ай бұрын
For training a new language, you need quite a bit more data than that of finetuning, I'd say at least 10 hours. What seems to happen when you train with too little is that the model never converges well and is more likely to explode; this is when you see mel loss drop to basically 0 on the graph, causing terrible distortion in the output.
@FloridaFamAdventures8 ай бұрын
Are you able to share your spanish dataset and or your tokenizer for spanish?
@Jarods_Journey8 ай бұрын
Won't share the dataset, but you're welcome to curate it. It would be the large playlist of audiobooks on KZbin of around 580 videos. I'll release tokenizers sometime in the future
@shovonjamali78548 ай бұрын
Hi, I'm new to voice cloning. If it's trained with a new language and after that if I want to pass a new user's audio (3/4 clips, 5-10 seconds long) will it be able to clone it? If so, how?
@Jarods_Journey8 ай бұрын
Yes, give or take it should. You can always fine tune it on another voice to get it closer sounding, then run it through rvc to get it even closer
@shovonjamali78548 ай бұрын
@@Jarods_Journey Great! Thanks for your valuable comment. 😍
@Random_person_078 ай бұрын
XTTS VS Tortoise what is better?
@thekinoreview15158 ай бұрын
Jarod, can I ask why you favour Tortoise? I started playing with XTTS v2 yesterday and I am blown away. The training and inference seems far, far faster and I'm getting better quality outputs than I did with Tortoise. Specifically, XTTS for cadence/delivery/accent -> RVC for fidelity/final output. I'm able to train a great XTTS model in only 10 mins with ~15 mins of data, and inference is ~1/3 output duration. I believe XTTS is also multi-lingual.
@juanjesusligero3918 ай бұрын
I think it may be because TortoiseTTS is Open Source (Apache license, commercial use allowed), while XTTS v2 is not.
@Jarods_Journey8 ай бұрын
4 reasons: 1. Familiarity and learning - I am much more familiar with the structure and hierarchy of the AI Voice Cloning Repository (Tortoise), making modification and browsing much easier for me. Since I want to understand how it works, this is one of my main reasons. 2. XTTS (v2) is Tortoise - The LLM behind it is roughly the same architecture, just a few different parameter values passed. If XTTS was significantly different, I would be putting more work into looking at it. 3. Quality and speed - I get solid outputs from both models so there's not much of a big difference for me. Speed is roughly the same as well if you enable HifiGAN on tortoise. Deepspeed should be enabled on both by default as well. 4. Licensing on Tortoise is more permissive than XTTS - This isn't a big factor for me, but I know it is for others and so since there isn't much of a difference between the two I choose to still work on tortoise. The biggest reason for me to switch to XTTS would be the multi-lingual capabilities, but currently, I'm learning a lot in figuring out how they did that so no reason to switch rn!
@thekinoreview15158 ай бұрын
@@Jarods_Journey Awesome, thanks for the detailed response. I appreciate it. I am hoping to keep drilling down into this stuff until I find an ideal setup.
@oguz22417 ай бұрын
Anyone wanna do Turkish version of TTS? Please reach me out.
@myte1why8 ай бұрын
wait a minute so basicaly there is unintendent way to train voice in one language and make it speak another one?
@Jarods_Journey8 ай бұрын
Not too sure on what you meant here, but the voice samples you use during generation don't actually have to match the language that the autoregressive model is trained on.
@myte1why8 ай бұрын
@@Jarods_Journey wait did I go hard way to tarian my voice as model instead of using just masking 🧐
@watcher_xD8 ай бұрын
Is it any to without gpu using cpu? 😊
@Jarods_Journey8 ай бұрын
Unfortunately not, CPUs aren't optimized for training.