Training Any Language in AI Voice Cloning

Training Any Language in AI Voice Cloning - Tortoise TTS

Рет қаралды 15,943

Jarods Journey

Күн бұрын

Пікірлер: 65

@bomar920 10 ай бұрын

Your amazing dude, I been following you almost a year and I have learned a lot from your channel, Keep it up :)

@adamrastrand9409 10 ай бұрын

Oh, so if it’s a Latin alphabet language, for example, Swedish, Spanish or German, could I just use the whisper transcribed Swedish text to train the model or how will that come out?

@Jarods_Journey 10 ай бұрын

I would say yes, as the text cleaners will normalize the accented characters. However, it might be needed for proper accenting and that is where a custom tokenizer might come in handy.

@ElmorenohWTF 10 ай бұрын

I would say you could try it but you would run into some problems, for example, I saw a nanonomad video where he trained a tortoise-tts model in Spanish, and although it worked, the accent was not a correct Spanish accent and the model had difficulties with some words. For example, in the example shown by jarods in this video with Spanish (I am Spanish), I would say that it has not been tokenized correctly, since there are certain letters that should not be separated (although I may be wrong, I do not know exactly how the tokenizer works).

@adamrastrand9409 10 ай бұрын

@@Jarods_Journey what do you mean with accent characters and stuff like that and I also wonder will the model learn to speak the language that I’m training it

@Vantaz 10 ай бұрын

Are you going to share your japanese models at some point? I am working on a script that uses LLMs to generate sentences that I turn into infinite comprehensible input by scraping google images for the words and using ffmpeg to turn the audio and images into a video where for every sentence it displays an image representing the words in that sentence.

@Jarods_Journey 10 ай бұрын

I generally don't share my models so that'll be the same in this case. As for Infinite comprehensible input, that is a good one! I'd love to see a demo of that when you complete it.

@lunch69 10 ай бұрын

will tortose be able to work with cyrilic characters if i make a tokenizer with cyrilic characters?

@SAnsAN091190 10 ай бұрын

Hi Jarod! I am glad to see your new video! Thank you! In fact, the most interesting thing I wanted to know is how you prepared the dataset for training. I asked about this under the previous video 😅. Well, I hope you will tell us about this soon)))

@SAnsAN091190 10 ай бұрын

Regarding those clumps of red and green that you're talking about at 19:36, I've also come across this. This effect appears only when the training is resumed. I noticed that if I saved the results at epoch 10, and training was interrupted at epoch 11, then when resuming training from epoch 10, the points from the previous training up to epoch 11 are saved on the graph, and while my training continues from epoch 10 to epoch 11, the points for each iteration are duplicated by the current and previous training, therefore they appear these are clumps of red and green colors. (I don't think this affects anything other than the visual perception of the graph, which is becoming difficult to read for the current training period)

@Jarods_Journey 10 ай бұрын

Thanks! Might make a followup video on how I prepared my dataset here. I still generally follow the same way as I've discussed in other videos too though so if you've seen those, it's not far off.

@SAnsAN091190 10 ай бұрын

@@Jarods_Journey Yes, I certainly adhere to the approaches that you used earlier, but I'm still not sure about some of the points that I'm doing and how much this may affect the final result

@juanjesusligero391 10 ай бұрын

Oooh, this is great!! :D I want to try training a Spanish language voice! I'll watch this video asap! (I'm working now XD) Thank you very much for sharing it! :D

@schakuun1995 10 ай бұрын

As always great Video Thanks!

@Jarods_Journey 10 ай бұрын

Appreciate it :)!

@mitchelljams 10 ай бұрын

Hey Jarod! Been watching all your videos and I think I might have a unique challenge. I’d like to remove a tremor in someone’s voice. Since it’s possible to voice clone in other languages, this doesn’t seem impossible. I’m wondering how you would approach?

@Oqalualaat 9 ай бұрын

Is that 800h dataset only one speaker? If im gonna collect that much data it would take me 100 years to transcribe it manually lol I have no way to use transcriber to do it automatically....

@charleswang2515 10 ай бұрын

great video!!!👍

@caesq_r 3 ай бұрын

hi, after training my model, I try to load its .pth file onto okada's AI voice changer but it says that the pth file is missing a "config" parameter or something. how do i fix that!?#@?!#@

@DoofusLoopus 10 ай бұрын

Could you help me, please? I installed TTS recently but my interface looks NOTHING like yours. Yours only has 5 tabs while my has 15 and the settings don't look alike at all. There's no Generate, History, Utility, Training. It has Generation (Bark), Bark Voice Clone, MusicGen + AudioGen, RVC Beta Demo, Demucs Demo, Seamless M4Tv2 Demo, Magnet, Vocos, Tortoise TTS, Outputs, Favorites, Collections, Voices. It's not RVC, but it's not simply Voice Generation either. What should I download?

@ItsAJB 10 ай бұрын

Hi, How to fix it us during the training process, cmd always shows me a notification "ai-voice-cloning>pause"?

@radioketnoi 4 ай бұрын

I don't understand where I went wrong. I'm training Vietnamese language. I used about 1 hour of my voice for training, created tokenzier with your python file for Vietnamese language "vi". Then I tested it with a sentence that was already in the audio sample. It produced a sound that was my voice. However, the sound produced was meaningless, not Vietnamese at all. Please tell me where I went wrong??

@TheGamingR Ай бұрын

Did you figure out how to make the tokenizer

@derBenIsPlaying 8 ай бұрын

There is so much documentation missing on tortoise. For example, how to install new models. I have a model that is fully trained in tortoise, but i just cant install it, its a few files, but no documentation lists where you have to put them. They are for french, but it sounds nothing like french when installed, plus the interface doesn't seem to recognize them properly. And why do i have like hundreds of autoregressive models to pick from after training a new model myself, which one is the correct one to use???

@petals-gg7bc 10 ай бұрын

What does this mean in RVC client on collab please ? ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. chex 0.1.85 requires numpy>=1.24.1, but you have numey¥ 1.23.5 which is incompatible. torchdata 0.7.0 requires torch==2.1.0, but you have torch 2.0.1 which is incompatible. torchtext 0.16.0 requires torch= 2.1.0, but you have torch 2.0.1 which is incompatible. torchvision 0.16.0+cu121 requires torch= 2.1.0, but you have torch 2.0.1 which is incompatible

@ChasingStars7111 5 ай бұрын

Yo Jarods thanks for the guide! Could you please make another guide using tokenizer for English voices ?

@EfeSteve-on6gd 8 ай бұрын

Hi Jarod, nice channel you got. Can you train a TTS tokenizer that can sing out lyrics of any song? Have you got a video on that? Cheers

@akemixx._0 10 ай бұрын

Well, I have a doubt, for example, if I trained a model that spoke in English but wrote in Portuguese, would that work? I would love to know this because I wanted to dub games using tts.

@SyntheticVoices 10 ай бұрын

You think this could be better in pronunciation than XTTSv2? Interesting making a German model, I attempted on Tortoise a few months back but it wasn't great. So not sure if there been a big change since.

@Jarods_Journey 10 ай бұрын

The pronunciation isn't horrendous for the Japanese one as far as I can tell, but there are a bunch of quirks with it. Seems to work well finetuning on some voices, but not others so I can't really tell if it's going to be overall better than XTTS. However, I can say that it does sound better so far than the finetune that I did for Japanese of XTTS, but that's just my initial impression

@SyntheticVoices 10 ай бұрын

@@Jarods_Journey cool thanks

@WorldYuteChronicles 10 ай бұрын

Great video! Converting to latin is all you need, really? Even if the language you want to train contains a lot special characters that are part of the International Phonetic Alphabet(like "ɖ, Ƒ, ɣ, ọ, ʋ") and is tonal? Leading to actual voiced labiodental approximant when "ʋ" from the Latin IPA is written?

@Jarods_Journey 10 ай бұрын

I'm starting to find out that it may not be as easy as that, but I have a follow up video on some things I've been finding out

@WorldYuteChronicles 10 ай бұрын

@@Jarods_Journey Alright. Thank you again. Your work is very valuable!

@allan59796 9 ай бұрын

Jarod thank you for the great tutorial! Really appreciate your content is unique ✨️. I've got a question by the way concerning tokenizers: In many Turkic languages including Turkish there are letters such as "s" and "ş" both tokenize the same way into -> "s" and won't it make the model confused? Since those 2 are different letters, are written and spelled differently but tokenized into 1 letter I think there's a chance that the model will misspell them and could be confused because of the tokenizer. What do you think about it?🤔

@creativenets2 10 ай бұрын

Hey Jarod, can i follow your installation on mac? I know it says windows but wondering if that will also work on mac.

@ElmorenohWTF 10 ай бұрын

Amazing video! I have a few questions: How much file size was approximately the 840 hours of audio you used? Do you know where I could find a tortoise-tts model in Spanish to fine-tuning it with the voice I want to train? Or maybe I could train my own model in Spanish and then fine-tuning it but doing it all inside the free version of google colab?

@Jarods_Journey 10 ай бұрын

I used mp3 files at 22050 hz and it came out to around 19gb of data Not too sure, you would have to probably train one up to Finetune ATM and as for Collab, the repository isn't setup for Collab. Some people have gotten it running though but I am not too sure.

@dthSinthoras 10 ай бұрын

Bekommen wir alles für ein deutsches Modell zusammen? Ich kann das Training durchführen, habe aber nicht ganz verstanden wo ich den Tokenizer herbekomme. Und ich bekomme vielleicht 100 Stunden an Trainingsdaten zusammen, aber keine 800+. Wenn also wer zusammenarbeiten möchte...

@lowskillpanda 4 ай бұрын

how can i voice large text?

@daibaogoh5487 10 ай бұрын

So how low spec can you go to use this and rvc(not real time) could you do this on a laptop with a 4060 8GB vram?

@danielkuperstein1835 10 ай бұрын

if i upload an model that is trained in portuguese te output of generation gona be in portuguese? will i be hable to use an RVC audio file in english with text in portuguese?

@3k3k3 10 ай бұрын

Just need that 4070 Super Ti , then i am going in..

@ahmetalpergultekin 5 ай бұрын

I want to hear the voice training of Charlie you have there ahahhaha

@fuzzlehub 10 ай бұрын

The link didnt work in tortoise. Any solution?

@vitmine 10 ай бұрын

Hi, i have one question: If I have the word čau and the tokenizer says ['ca', 'u'], will it work?

@Jarods_Journey 10 ай бұрын

My intuition says yes as the model will learn to associate ča with ca. Now if there are two versions, č and c in your language, a custom tokenizer with both characters in it might be valuable if you find that it's not doing it correctly.

@vitmine 10 ай бұрын

@@Jarods_Journey Thanks for the reply, if I played cau and čau, it sounds the same. Now I'm going to try train again with my own tokenizer.

@rudritarahman9719 5 ай бұрын

Is it possible to train Bengali language with Tortoise TTS?

@siamsurf 7 ай бұрын

13:45 haha, Charlie

@elviskent9104 6 ай бұрын

How long did it take ?

@iweiteh 10 ай бұрын

Hi Jarod, thanks so much for this demo! I learn so much from your videos. Keep up the great work! I followed you tutorial here and managed to train a spanish model using a multi speaker dataset. The training job took about 12 hours to complete successfully. After the training job, I tried generating a voice from the finetuned model. However, due to the volume of my training data, the generation process failed with OOM error. The error indicted that it ran out of memory in the compute_latent process. I have about 25 hours of voice data in my training folder. I wonder if you have any suggestions on how to overcome this issue? I am using an A10 GPU with 24GB VRAM. Thanks in advance!

@Jarods_Journey 10 ай бұрын

I don't show it in this video, it's in several others, but you usually move all your audio data and training files into a backup folder and inference from only 2 samples so you don't get OOM. Another thing you can do is make a new voice folder and infer from there

@iweiteh 10 ай бұрын

Thanks for your guidance! @@Jarods_Journey !

@glowstorm334 10 ай бұрын

Hi , Love the video , This exactly what I was looking for, could you please provide the training scripts for these as I also want to train TTS in my native language.

@Jarods_Journey 10 ай бұрын

The training scripts are inside of the AI Voice Training repo, so make sure you understand that process

@glowstorm334 10 ай бұрын

Thanks for the reply , so all I have to do is prepare the dataset as showing in the train.txt file and training folder and then run generate configuration on to get a valid training dataset.