How to Make the PERFECT Dataset for RVC AI Voice Training

Рет қаралды 134,226

Күн бұрын

Пікірлер: 364

@Jarods_Journey Жыл бұрын

The end of the video got cut off -_-. I only had like 10 seconds left so when I get the chance, I'm just going to link a shorts so that you guys can see the rest of the video lol

@Jarods_Journey Жыл бұрын

Finishing the Data Curation Video...

@rytraccount4553 Жыл бұрын

@@Jarods_Journey Your audiosplitter code exports 44.1khz audio. how do I make it export 48khz? I am losing quality with this code!

@M4rt1nX Жыл бұрын

Thank you Jarod. If people don't want to use GIT they just can download the zip and unpack it at the preferred location. 😉

@Jarods_Journey Жыл бұрын

Solid tip, thanks Luz! Totally skipped my mind.

@IIStaffyII Жыл бұрын

Wow, I am amazed by this channel. A few weeks ago I was searching for Diarization of voices but had no good luck finding a good fit. Not only do you have a very good tutorial, you seem to be knowledgeable and up to date with everything (as up to date as one can when things are moving this quick).

@Jarods_Journey Жыл бұрын

Too many things, too fast. Appreciate it :D, tis is the realm of open source.

@brianlink391 8 ай бұрын

@Jarods_Journey Love you, bro! Thanks a ton. I didn't even know this existed!

@ohheyvoid Жыл бұрын

Just found your channel last night, and your workflows are so clear and to the point. Quickly becoming my go-to for voice2voice workflows. Thank you for your work.

@Jarods_Journey Жыл бұрын

Appreciate it 🙏

@ZitronenChan Жыл бұрын

Your channel and the AI Hub have helped me a lot in getting started. I just trained a model with 2 hours of audio from Faunas last stream in RVCv2 on 1000 epochs and it came out very well

@Jarods_Journey Жыл бұрын

Haha awesome, glad to hear!

@paradym777 Жыл бұрын

Is there a way I can get a copy of it? (>

@VexHood 8 ай бұрын

how much better is that than 300? does that prevent static sounds if you don't use pretrained generators?

@keisaboru1155 Жыл бұрын

how to combine ! - voices to create a total unique one !

@cubicstorm81 3 ай бұрын

For those receiving an error with the "split_audio" script not creating the .srt audio as per the above tutorial, run this in an Anaconda or Python prompt, let it download the required dependencies and it will work as you need. Thank you for a great tutorial!

@ShiinoAndra 8 ай бұрын

Just found your channel, and I want to say i'm too deep into the rabbit hole that I instantly recognize all the voice you use for conversion at the start😂

@OthiOthi Жыл бұрын

Jarod managed to help me figure out a strange problem that I was not able to figure out at all. He's got my sub. Thanking you kindly!

@joshuashepherd7189 Жыл бұрын

OMG Jerod! Your video tutorials are becoming better and better. I love seeing a new release from you! Thanks for all your hard work!

@TLabsLLC-AI-Development 8 ай бұрын

Bro. This channel is amazing. I've been around and you are needed by many. Welcome.

@m0nkeyb0i666 5 ай бұрын

copied from the issues section, worked for me. Running split_audio.py threw this error Exception has occurred: FileNotFoundError [Errno 2] No such file or directory: 'D:\ai\programs\audiosplitter_whisper\data\output\1.srt' File "D:\ai\programs\audiosplitter_whisper\split_audio.py", line 96, in extract_audio_with_srt subs = pysrt.open(srt_file) File "D:\ai\programs\audiosplitter_whisper\split_audio.py", line 150, in process_audio_files extract_audio_with_srt(audio_file_path, srt_file, speaker_segments_dir) File "D:\ai\programs\audiosplitter_whisper\split_audio.py", line 180, in main process_audio_files(input_folder, settings) File "D:\ai\programs\audiosplitter_whisper\split_audio.py", line 183, in main() FileNotFoundError: [Errno 2] No such file or directory: 'D:\ai\programs\audiosplitter_whisper\data\output\1.srt' Additionally, the terminal was saying something about not having or not finding cublas64_12 (I can't remember exactly what it said) The error is thrown because the program can't find the srt file, because it can't make the srt file, and this is caused by a mismatch of CUDA versions. Torch (or something) has CUDA 11, but the script (or whatever) needs CUDA 12. I'm not a programmer, I don't know exactly what is what. All I know is that I fixed it. To fix this, do the following. Download and install CUDA 12 developer.nvidia.com/cuda-12-0-0-download-archive Navigate to "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.0\bin" Copy cublas64_12.dll, cublasLt64_12.dll, cudart64_12.dll Navigate to "...\audiosplitter_whisper\venv\Lib\site-packages\torch\lib" Paste the dlls into this folder Now when you run split_audio.py, it will be able to create the srt file, fixing the issue with not being able to find said file.

@CaptainCarlossi 5 күн бұрын

Hello, me again with two small questions: 1. The file format of choice is of course WAV, but what should it be for the best quality? 44khz or 48khz? Mono or stereo? (My recordings are in mono, but I could duplicate the channels and create a "pseudo-stereo track" if that produces better results.) 2. Your Audiosplitter_Whisper is good for my spoken sound files, but what is the best way to split the sung recordings? I think that because of the continuous singing there is not always a silence every 10 seconds (or less). What could you recommend to me? Or do you know a current, nice HowTo that describes everything in detail for achieving best quality ? (These are really my last questions :) )

@smokey4049 Жыл бұрын

Hey, thanks for your awesome series of tutorials! As someone who is pretty new to this, it really helps out a ton. Would it be possible if you could make a tutorial on how to train a RVC 2 Voices with the dataset I just created? Thanks again and keep up the great work!

@Jarods_Journey Жыл бұрын

Appreciate it! Respective tutorials already exist, so I'd go check those out! kzbin.info/aero/PLknlHTKYxuNshtQQQ0uyfulwfWYRA6TGn

@CaptainCarlossi 5 күн бұрын

Hello. Thanks for your great Videos. One Question: I am from Germany and have WAV files spoken or sung in english AND in german language. For your tool / whisperx I can handle them separately by changing the languange. But my question is about RVC: For training a new model, can I mix those different languages together? I always did that and now I realize, that this maybe wasn't a good idea? Or does that not matter for RVC? Thanks in advance ;)

@Jarods_Journey 5 күн бұрын

This is fine, RVC doesn't look at text to train - it's strictly extracting features from the audio provided. The only thing is it may sound accented, for example, if I train a model on Japanese audio, if I use it to convert English speech, it may not sound 100% English native

@CaptainCarlossi 5 күн бұрын

@@Jarods_Journey Cool. Thank You for your quick reply ;)

@pilpinpin322 Жыл бұрын

Thank you so much ! It's a clear video and we see that you know what you are doing! I have a small question regarding the .wav files of the dataset, is it better to encode them in stereo or in mono? Or does the program make no difference?

@Jarods_Journey Жыл бұрын

I don't think it makes a difference, but I read somewhere that it should be done in stereo. It flattens them I believe though so it doesn't really matter after it's been processed though

@pilpinpin322 Жыл бұрын

@@Jarods_Journey Thank you very much ! One last question : Is it better to segment the sounds into files of 10 seconds each, or to cut in the form of complete sentences (and therefore to have files of very variable duration)? Thx for your work !

@Jarods_Journey Жыл бұрын

@@pilpinpin322 :), complete sentences works best so you don't get weird clippings, but if you run out of VRAM, you'll need to split into smaller segments.

@pilpinpin322 Жыл бұрын

@@Jarods_Journey Thx for the fast reponse, even if there are very small sentence of 1 sec like " Yess i agree ! " ?

@Dante02d12 Жыл бұрын

Hey there! Thank you for all those videos! I hadn't realized UVR5 had advanced options, lol. Hey, I have a question that can look silly but it is serious : is it really required to train for _hundreds_ of epochs? I have had absolutely great results with 50 epochs only. What does more epochs bring exactly? Meanwhile, the issues I have also happen with models trained for hundreds or thousands of epochs, because most of my problems come from the way I clean the audio I want to clone. I also noticed my feminine voices tend to break at growls. Is it required to have growling audio in the database used for training? Or is there a secret sauce to make any voice have growls?

@Jarods_Journey Жыл бұрын

Appreciate it! A finished epoch indicates that the model has seen every sample once. Increasing epochs just repeats this process for X number of epochs. It's all data dependant, as you don't always need more epochs for a good model. As well for growls, just in general, they seem to be harder for the models to infer on and my anecdotal experience seems to be all models kinda struggle with it. I have yet to try training with growls, but I want to try a similar experience with laughing because often times laughing just sound weird 😂

@fuuka69420 Жыл бұрын

Hey another banger video mate! Do you reckon its wise to keep the sound of breaths such as when they inhale or exhale?? or do I need to ONLY need the part where the source voice talks or sing?? let me know your thought and keep up the cool vids!

@Jarods_Journey Жыл бұрын

Whatever is included in the split audio, should be fine. It may cut out some of the breathing perhaps at the end of a sentence or beginning, it everything else in between is fine to keep :)!

@lockdot2 Жыл бұрын

I am still working on it, I have decided to do this on the worst quad core CPU there is, the 1.3 GHz, with no turbo, 4 core, 4 thread AMD Sempron 3850. I spent a bit over a week getting clean audio to save on the Ultimate Vocal Remover. I am using 12 hours of talking.

@Jarods_Journey Жыл бұрын

There is probably a way to do this on collab, but atm, Collab is a hassle I don't wanna have to deal with :(. Good luck on it 🫡

@lockdot2 Жыл бұрын

@@Jarods_Journey Thanks! It's going somewhat smoothly, got 5 errors in the CPU part of Visual Studio Code, but I am just going to pretend they don't exist, and move on with it. Lol.

@whimblaster Жыл бұрын

Do I need to sing in the audio for the dataset or talk is enough (like reading something from the web)? Thx, apart from that great tutorial. ^^

@ControllerCommand Жыл бұрын

your channel is amazing. I was looking for this long time.

@matrixxman187 Жыл бұрын

I have 3 minutes of studio quality lossless vocals I would like to use to train. Is that sufficient? Additionally, there are some interviews on KZbin of the same artist speaking at length but I was concerned whether the lower quality mp3 stuff should be avoided for these purposes. Thanks for your video! Very informative

@Jarods_Journey Жыл бұрын

Muffled audio should be excluded but if the voice sounds good enough you can include it. 3 minutes may be okay, but idk, you just gotta try it out mate 🤟. 10 minutes or more is recommended but you can use less sometimes and it'll be fine.

@bruhby6276 Жыл бұрын

Thx for your content! Why would I use WhisperX tho? Is it just for data management or is it actually helps RVC train?

@Jarods_Journey Жыл бұрын

For curating better data, by using sub timing, there's may be less chances for audio samples being empty noise

@enoticlive9103 Жыл бұрын

Hi! I'm from another country and I don't really understand English, but this topic is very interesting! How can I teach a model to speak my language better?

@alphaxeu Жыл бұрын

Ultimate Vocal Remover is struggling with some track like i hear the instrumental in the back with Kim Vocal 1 is there a model where the vocal are perfect like ?? great vid!

@Jarods_Journey Жыл бұрын

The vocal removers are really good, but they're not 100% unfortunately. That's very hard to achieve and I'm sure there are brilliant minds working towards this eventually. But doesn't exist ATM rn, you may be able to get better results with ensemble mode, but you'll have to research a bit on the best combos: github.com/Anjok07/ultimatevocalremovergui/issues/344

@EclipsixGaming Ай бұрын

i recommend using software like audacity for post processing the audio it help with clearity and if random noice

@dookiepost 5 ай бұрын

If you get an error when running xwhsiper, make sure you have version 12 of NVIDIA CUDA toolkit installed

@PowerRedBullTypology 11 ай бұрын

Jarod, do you know if there is software or websites or whatever that let you make a new voice out of other voices? Like blend them into a new voice? especially RVC type of voices (since I know that best) ..but would be curious otherwise of others too

@RayplayzFN 7 ай бұрын

this is an error i got RuntimeError: Library cublas64_12.dll is not found or cannot be loaded

@MrAcapella 7 ай бұрын

SAME! :(

@Timiny118 6 ай бұрын

i had this same error but ended up having the file from a previous installation of alltalk_tts. I'm sure you could find it elsewhere though. I ended up placing it in "audiosplitter_whisper\venv\Lib\site-packages\torch\lib" and everything worked as it did in his video.

@davidmaldonado9254 Жыл бұрын

Thank you for your amazing videos, it really helps me understand how everything works, just one question, I'm having some problems when running the "split_audio" script, it seems it isn't creating the .srt file of the audio and when it tries to call the file it runs into an error, do you know what it could be?

@Jarods_Journey Жыл бұрын

Whisperx may not be being downloaded correctly. I would try rerunning the setup file again and trying to get this going. One other thing you can do is type and enter whisperx into the console after activating the venv to see if it got installed

@davidmaldonado9254 Жыл бұрын

@@Jarods_Journey Thanks! I'll try uninstalling everything and installing again because now the set-up is showing error when previously it didn't

@nadaup6023 Жыл бұрын

@@davidmaldonado9254Managed to solve? I have the same problem

@Zielloss Жыл бұрын

Run VS code as admin.

@el-bicente Жыл бұрын

I think I had the same problem using the cuda installation. If your debugger tells you that it can't find the .srt file when running split_audio script then check your terminal logs. If you have an error like this: "ValueError: Requested float16 compute type, but the target device or backend do not support efficient float16 computation." Then it means that your GPU does not support FP16 execution. To fix it go line 26 in the split_audio script which must be: return 'cuda', "float16" and replace "float16" by "float32" or "int8".

@jeyraxel 2 ай бұрын

I'm from the future: Don't install Python 3.12, use 3.10.

@Antonioriccioff 27 күн бұрын

True story 🤬

@VegascoinVegas 6 ай бұрын

Exactly what I needed to know

@MFSCraft Жыл бұрын

Is there some kind of vocaloid-like interface so that i have some control on how certain words would sound like? would be cool to have a TTS that could run the trained RVC voices.

@Jarods_Journey Жыл бұрын

ATM, I don't know of any that use RVC voice, though I'm bound to see it happening someday

@JobzenPadayattil Жыл бұрын

Hey Bruh I'm getting some errors while converting trained data to out put,, ffmpeg error + dtype, type error... (Ffmpeg is already installed )

@kaant21 Жыл бұрын

Dont forget to change execution policy to default when you are done with this

@shampun2281 Жыл бұрын

They have been updated and now it is not possible to sort files by speakers. Can you look at the new version and tell me what can be done? Is it possible to use the old version somehow?

@fountainbird Жыл бұрын

Thanks for the vid although I'm confused. I understand the UVR step to isolate vocals. I would generally then use that as the dataset. What is the benefit of the next step of splitting the file up? is that all it does? What else is happening that I don't know about? I've generally just used longer clean audio files for training. Thanks for enlightening me :)

@Jarods_Journey Жыл бұрын

By splitting it, we solve the biggest issue of CUDA out of memory as I don't believe RVC splits larger audio files into more digestible chunks. Splitting it allows us to control this issue, and then additionally, get rid of any silence in the audio samples. Then theres also the fact you can easily remove any bad data from the audio file that you may not want in the training set. If your running it just fine with UVR without the out of memory issue though, you should be good to go there, but splitting it just gives you a bit more freedom with the data.

@MaorStudio 11 ай бұрын

Thank you so much. King!

@denblindedjaligator5300 8 ай бұрын

just have a question. How high is your batch size, when you train? Is it something that if you set it too high, you get an imprecise module? If I have a dataset of one hour, what should my batch size be?

@wugglie Жыл бұрын

for some reason i keep getting an error where it cannot open the vocals.srt file. did i miss a step? there is no vocals.srt file generated in the output folder for audiosplitter.

@battletopia Жыл бұрын

I'm having the same problem. Did you manage to sort this out?

@chaunguyenthanh6664 5 ай бұрын

Hi Jarods, can I use large-v3 model instead of large-v2?

@ShelfxYT Жыл бұрын

Do you have any voice modifications like the ones in the video played in real time? to use the same discord for example voicemod/clownfish ?

@Grom76300 Жыл бұрын

I thought this included both the separation and training, but all those GB of programs are only for isolating voice, daym !

@KrazyGen Жыл бұрын

I'm trying to do my own voice and got some decent results, but it can't handle higher pitches. Should I add more samples with my voice in a higher pitch, or give it more samples with my normal voice and train it for longer? I have it trained using the Harvard Sentences from a previous video and I did 300 epochs.

@Jarods_Journey Жыл бұрын

You can try adding samples of higher pitch, it's mainly going to be good at speaking in the pitch and timbre of the voice you train it with, so if your voice is naturally deeper it's not going to know how to handle that if you try to speak high all of a sudden

@chranman1855 Жыл бұрын

I'm getting FileNotFounderror in Visual Code Studios, where it cannot find srt_file. I followed your tutorial step by step, but I'm sure I did something wrong since I dont get the same results when I run the program. Since I have no python experience, I'm not sure what I did wrong here.

@Jarods_Journey Жыл бұрын

Some people have reported that it'll work if you try running vscode in admin mode

@chranman1855 Жыл бұрын

@@Jarods_Journey Thank you for responding! I will try that.

@colinosoft 9 ай бұрын

Maybe it's too late, but I solved it with "pip install -r requirements-cuda.txt" in my case I have an Nvidia graphics card, if you use cpu then replace it with "requirements-cpu.txt" for some reason there is a missing package that it is not installed when running "setup-cuda.py". Always run the command within the virtual environment created previously with "venv"

@handsomebanana4060 Жыл бұрын

What if my voice doesn't speak any of the default languages? I have found a phoneme-based ASR model that suits me but how do I use it in your code? Anyway, great tutorital!

@Jarods_Journey Жыл бұрын

Ah... I haven't dabbled in that area yet and don't know how it works in other non supported languages. I would test it as a command line script first to see if you can get it working that way. I believe the --align_model argument would need to be used

@gamecreator7214 22 күн бұрын

The whisper no longer has setup cpu and setup cuda. Do I just download later versions or are is there a newer tutorial?

@Metalovania 4 ай бұрын

Hi! I followed your tutorial and managed to set everything up and run the script without getting any errors, but the problem is that I didn't get the expected amount of segments.... I tried the script with three different audios. The first one, of about 4 minutes, got me an output of 35 seconds worth of segments; the second one, also about 4 minutes got an output of 1min 36sec total; and the the third, a bit over 2 minutes, got 55 seconds. Do you know what could be the issue? Also, I tested speaker diarization with another audio but it didn't go very well. It had 4 different speakers, which it separated in only 2 and all 4 speakers where in both folders.

@matthewpaquette Жыл бұрын

Great tutorial!!

@TheChipMcDonald Жыл бұрын

1) what/how/can I change this to have multiple data directories (if I want to tweak/add on a later retry, and as a way of keeping things organized). I presume I can make a subdirectory like the "vocal" ones for each unique dataset? 2) can I bypass the audio split step if I've exported my dataset in

@Jarods_Journey Жыл бұрын

1. Each file you put in the data folder will be exported to its own segmented folder in the output folder. Once finished here, I recommend moving the finished files to some somewhere else on your PC. 2. Yes, no need 3. The exported files (segmented pieces) are coded in by me and organized to export to the folder you chose at the start. Means unlimited freedom if you wanted to modify the code 4. It sorta is a batch process, what additional feature are you looking for? From the question, I'm assuming you just want to choose an input and an output folder right? Since it makes a folder per file name, I can see this being a bit cumbersome to have to manually move them into one directory, but this is for sorting reasons. A 3060 is good as it can utilize CUDA. Imo, 3060 will gives more flexibility due to its 12gb VRAM so this would be the cheaper option to go with compared to like a 3070 or 3060ti

@TheChipMcDonald Жыл бұрын

@@Jarods_Journey 1) ok 2) ok 3) ah; following along without actually doing it makes it easy to discount where you started at, ahrgh, sorry 4) by batch, effectively automating starting Visual Studio, getting to the point where training ui begins... or in essence, an actual app ala UVC that does the environment setup, python behind the scenes. I want to copy my dataset over, then jump to a ui to start training.... and ideally the same ui to manage models, inference. Installing python, visual studio etc. are one time things I don't mind - I'm thankful you've done these tutorials, but the steps, steps, steps, steps, steps just to get to starting training seems automatable? My interest is in music, singing replacement; and what happens by tweaking the dataset, getting to what I hear in my head. Which I want bad enough to jump through hoops (and buy a new pc I previously didn't need, lol) but.... gahhh... it's like being a kid again, configuring AUTOEXEC.BAT and CONFIG.SYS for hours, only to be burned out by the time you get Wolfenstein to run in SVGA with a hand-me-down SoundBlaster 16 card....

@TheChipMcDonald Жыл бұрын

@@Jarods_JourneyThanks

@Jarods_Journey Жыл бұрын

@@TheChipMcDonald Gotcha! The RVC web-UI is actually pretty close, it's literally just missing the data curation side of things as it comes in a downloadable release too. A few more quality of life things later like file browsers instead of paths, etc. and I think we're looking at a very robust and easy to follow workflow. I'll definitely keep the channel updated WHEN someone comes out with something that has all of the puzzle pieces put together. 🙏

@Skurios18 7 ай бұрын

Just a maybe random question I was having issues installnig the audio splitter and I thought it was because I haven't installed cuda toolkit of NVIDIA, so ended up installing it, but it was other thing that was giving me the error so my question is Should I uninstall this cuda toolkit I don't know what it does exactly or it won't harm my configuration or gpu in the future ?

@rytraccount4553 Жыл бұрын

The code does not generate an srt file for me from a single WAV, and I get a filenotfound errror: No such file or directory: 'D:vocalsplittest/data\\output\\song.srt

@rytraccount4553 Жыл бұрын

Apparently this is an issue with whisperx, as somne devices like mine donot support this float type, making this code unuseable :(

@lerian7669 Жыл бұрын

same problem

@smokey4049 Жыл бұрын

Yeah, have the same problem. Hopefully it will be fixed soon

@Jarods_Journey Жыл бұрын

Can you try setting it up with setup-cpu? My laptop has a i7-8650u and works with this setup, this switches it over to int 8 instead of float 16

@colinosoft 9 ай бұрын

@nexgen91 7 ай бұрын

I have audiosplitter_whisper installed and vscode opened, trying to run debugging as per 12:00 in the video and am getting the following error "configuration 'python:file 'is missing in 'launch.json"" any idea what might be going on? BTW: It appears to work if I run "python split_audio.py" in power shell.

@michaelcasado 6 ай бұрын

All points to using WIndows on any of these. Or am I missing something? I am on MacOS and all stuff is .bat and .exe , google collab sandbox running things. Is there no UI to thisdate, that also runs on MacOS? Have I missed to locate it perhaps?

@paulovictor5123 6 ай бұрын

Hey Jarods, much appreciation for your tutorials. I'm facing some issue when running the split_audio.py. I'm using a Spanish database, followed all your steps and changed conf.yaml to language: "es". But, when I run the split_audio.py script, I face this issue: Exception has occurred: FileNotFoundError [Errno 2] No such file or directory: 'D:\\Documentos\\VoiceCloning - AudioSplitter\\audiosplitter_whisper\\data\\Vocals\\output\\100_Salmo 53_(Vocals).srt' File "D:\Documentos\VoiceCloning - AudioSplitter\audiosplitter_whisper\split_audio.py", line 96, in extract_audio_with_srt subs = pysrt.open(srt_file) File "D:\Documentos\VoiceCloning - AudioSplitter\audiosplitter_whisper\split_audio.py", line 150, in process_audio_files extract_audio_with_srt(audio_file_path, srt_file, speaker_segments_dir) File "D:\Documentos\VoiceCloning - AudioSplitter\audiosplitter_whisper\split_audio.py", line 180, in main process_audio_files(input_folder, settings) File "D:\Documentos\VoiceCloning - AudioSplitter\audiosplitter_whisper\split_audio.py", line 183, in main() Can you help me out?

@ann96662 5 ай бұрын

did you fix the issue ??

@sukhpalsukh3511 Жыл бұрын

Great , Thank you for this video,

@xerotivi Жыл бұрын

Just found out your channel and wanted to ask you if you know any ways to follow these steps on Mac. As a student the only computer I have is my MacBook Air M1. I watched your video where you show how to use RVC on Colab and I want to learn how I can create my own dataset and remove vocals from songs.

@Jarods_Journey Жыл бұрын

You can run this on CPU using setup-cpu, though I haven't tried myself since I don't have MAC. You could technically do all this in Collab as well, but you'll have to set that up yourself

@xerotivi Жыл бұрын

@@Jarods_Journey I will spend some time on it and if I found a way, I will post here for others.

@LosantoBeats Жыл бұрын

Can I use talking + singing audio to create my model or should it be split into two separate models. One for singing voice and one for talking voice. I am having trouble finding clean singing audio for my model and considering using talking audio from like interviews etc.

@Jarods_Journey Жыл бұрын

You can use both. As long as it's the same voice, it should be fine

@CalmEchos-r6l 9 сағат бұрын

ERROR: Could not find a version that satisfies the requirement torch==2.0.0+cu118 (from versions: none) ERROR: No matching distribution found for torch==2.0.0+cu118

@edwincloudusa Жыл бұрын

Can you make a video on how to keep the emotions from the original souce voice? I have everything beautifully working for a clean and perfect voice clone but my source audio has some strong emotion acting, anger/fear/happiness etc, that are not represented in the cloned audio. Thanks.

@LosantoBeats Жыл бұрын

Does it matter if my source audio is chopped up? For example incomplete words/sentences etc..

@Random_person_07 Жыл бұрын

Just a question does remove the background voice of another speaker if there is another speaker speaking behind the the target speaker

@Jarods_Journey Жыл бұрын

Unfortunately it does not, overlapping speech and disentanglement is still a research in progress field

@Random_person_07 Жыл бұрын

@@Jarods_Journey One last question what does Speaker diarize do? like cut out each speaker? Nvm you explained it in the video

@moddest7123 Жыл бұрын

Hey Jarod. Slight issue when cloning the audiosplitter_whisper. I don't get the .git file at the top. Just the rest of the files. How do I fix that?

@alexjet5890 2 ай бұрын

How do you use the dataset created following this tutorial with AI voice cloning 3.0? You don't explain how to use them. Can you make a video?

@aboodghanem1679 10 ай бұрын

Hello dear, I would like your help regarding sound reproduction via Google Colab. Is the data uploaded in Wave Mono or Stereo format and is it 16 bit or 24 bit?

@sonofforehead Жыл бұрын

Hey, at 12:22 I get a similar error, but .\venv\Scripts\activate doesn't seem to fix it, are there any other solutions? It's giving me an error saying "FileNotFoundError", highlighting "subs = pysrt.open(srt_file)" Here's most of the error (there's more, just basically the same thing) "Exception has occurred: FileNotFoundError [Errno 2] No such file or directory: 'C:Users/myuser/OneDrive/Desktop/deleteme/audiosplitter_whisper/data\\output\\MyDataSet.srt' File "C:Users\myuser\OneDrive\Desktop\deleteme\audiosplitter_whisper\split_audio.py", line 101, in extract_audio_with_srt subs = pysrt.open(srt_file) ^^^^^^^^^^^^^^^^^^^^^^^^^^^" Also great video so far!

@Jarods_Journey Жыл бұрын

Something happened when trying to make the srt file, make sure that whisperx downloaded and the setup wan without issue. You may also have to run vscode in admin mode

@colinosoft 9 ай бұрын

@gokulkrish3839 Жыл бұрын

Do we need highlevel gpu spec to do above things that you showed in the video

@Jarods_Journey Жыл бұрын

Anything that is a Nvidia 3060 12 GB or above should be fine, even 20 series cards work still too. Anything that is not Nvidia often has issues so I don't recommend those.

@zafkieldarknesAnimation Жыл бұрын

Hello please help me erorr (Requested float16 compute type, but the target device or backend do not support efficient float16 computation.)

@battletopia Жыл бұрын

I am having similar issues, did you ever figure it out?

@Arc-Trinity 10 ай бұрын

ERROR: Error [WinError 2] The system cannot find the file specified while executing command git version ERROR: Cannot find command 'git' - do you have 'git' installed and in your PATH? lol

@olaitanluvsojewale Жыл бұрын

One more question.. for now.. if that’s okay? Say I wanted be excessive to get the cleanest, most accurate, almost perfect result possible on the first train. And I had 1 and 1/2 hours or even 2 hours max audio data. And My PC could probably handle it (For context i a have NVIDIA GeForce rtx 3060 graphics card and 32GB ram) What is the Max amount of epochs do you recommend I could train for?

@Jarods_Journey Жыл бұрын

Dunno, the big answer is "it depends". Just try training for 10 epochs and hear how it sounds. Tain around other epochs and try those as well. You're looking for the lowest epoch #

@olaitanluvsojewale Жыл бұрын

@@Jarods_Journey Oh okay then 🤔 Thank you a lot! I really appreciate you taking the time to answer

@GaypataponALT 10 ай бұрын

I have 954 audio file in my training folder, is it a bit too much for rvc to train?

@MadFakto 3 ай бұрын

Which Video Player do you use?

@nazersonic6938 Жыл бұрын

Thanks for the helpful video, I have a gtx 1660 ti 6gb vram cuda say i am out of memeory is there a low vram option like in stable diffusion or i am stuck with using cpu?

@Jarods_Journey Жыл бұрын

There are some low VRAM options built into whisperx that have to be passed, you would have to modify the script to do that. I'll get around to adding it when I get the chance

@MinerCold-w1s Жыл бұрын

hell @jarod, i got this error whiles its creating output and vocal audio sets CUDA is available. Running on GPU. The torchaudio backend is switched to 'soundfile'. Note that 'sox_io' is not supported on Windows. The torchaudio backend is switched to 'soundfile'. Note that 'sox_io' is not supported on Windows. Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.6. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file C:\Users\kit\.cache\torch\whisperx-vad-segmentation.bin` Model was trained with pyannote.audio 0.0.1, yours is 2.1.1. Bad things might happen unless you revert pyannote.audio to 0.x. Model was trained with torch 1.10.0+cu102, yours is 2.0.0+cu118. Bad things might happen unless you revert torch to 1.x. >>Performing transcription... Traceback (most recent call last): File "C:\Users\kit\Desktop vc\audiosplitter_whisper\venv\Scripts\whisperx-script.py", line 33, in sys.exit(load_entry_point('whisperx==3.1.1', 'console_scripts', 'whisperx')()) File "C:\Users\kit\Desktop vc\audiosplitter_whisper\venv\lib\site-packages\whisperx\transcribe.py", line 159, in cli result = model.transcribe(audio, batch_size=batch_size) File "C:\Users\kit\Desktop vc\audiosplitter_whisper\venv\lib\site-packages\whisperx\asr.py", line 288, in transcribe for idx, out in enumerate(self.__call__(data(audio, vad_segments), batch_size=batch_size, num_workers=num_workers)): File "C:\Users\kit\Desktop vc\audiosplitter_whisper\venv\lib\site-packages\transformers\pipelines\pt_utils.py", line 124, in __next__ item = next(self.iterator) File "C:\Users\kit\Desktop vc\audiosplitter_whisper\venv\lib\site-packages\transformers\pipelines\pt_utils.py", line 125, in __next__ processed = self.infer(item, **self.params) File "C:\Users\kit\Desktop vc\audiosplitter_whisper\venv\lib\site-packages\transformers\pipelines\base.py", line 1028, in forward model_outputs = self._forward(model_inputs, **forward_params) File "C:\Users\kit\Desktop vc\audiosplitter_whisper\venv\lib\site-packages\whisperx\asr.py", line 228, in _forward outputs = self.model.generate_segment_batched(model_inputs['inputs'], self.tokenizer, self.options) File "C:\Users\kit\Desktop vc\audiosplitter_whisper\venv\lib\site-packages\whisperx\asr.py", line 138, in generate_segment_batched result = self.model.generate( RuntimeError: CUDA failed with error out of memory

@olaitanluvsojewale Жыл бұрын

Hello, I got a few questions.. So I have access to 6ch audio with the voice I want to clone, and I'm extracting it all manually using Adobe Audition. 1. Using UVR helps remove any lingering bg noise but sometimes a little noise will remain. It is not that noticeable, so is it okay to have a little noise or will that affect the model? 2. I know to remove long silences, but what about the small gaps between when the character is actually speaking, should I remove that too so it is just a continuous stream of talking with not even 0.5 second breaks? And what about the sounds when a character isn't actually speaking, e.g. growls or hums, or breathy sounds like laughing, that naturally have some silence in there.

@Jarods_Journey Жыл бұрын

My observation is the little bit of noise is ok, it shouldn't be that noticeable. One case though I have of a model is that it does show in the output that I can hear the bg that was not removed. Hard to get it perfect though. 2. The little gaps are fine, as for growls and what not I'd say to cut those out, but I haven't actually tried so I can't say for certain.

@olaitanluvsojewale Жыл бұрын

@@Jarods_Journey Thank You!

@lisaree-tn8dm Жыл бұрын

Hey, is it bad if there are low sounds of people slamming doors or making pop-like noise in the back?(They get loud on purpose everytime I sing) I can't get rid of those as well as plosives from breathing. But you can still here my voice :/

@PeteJohnson1471 Жыл бұрын

make space cakes and give it to them, start recording an hour later. You should be good for a few hours whilst they are all monging on the Sofa ;-) I feel for your situation that people around you can't be reasonable with you for ten or so minutes. Maybe show them some video's of what you are looking to do, and offer to make them a voice , on the proviso that they just shut up or 10 minutes whilst you do your? Good luck

@Overneed-Belkan-Witch Жыл бұрын

Hi Jarods, Im currently on my project of doing Audiobook using cloned voice where I will be the voice How good the training will be If I have an i5 and GTX1060 6gb. Is this enough?

@Jarods_Journey Жыл бұрын

That GPU might be rough... You might wanna train on Google colab. The training quality should be the same, just training time will be different

@Overneed-Belkan-Witch Жыл бұрын

@@Jarods_Journey Thanks for the tips

@youngtrapgod6375 9 ай бұрын

Can this be done for so vits? Becaus RVC loses the human element in my voice when I try making cover songs

@supersonicunitedsupersonic8531 10 ай бұрын

I have source track with background noises and of course I can solve that using UVR5 or another voice isolation VSTs, but there are also segments with much voice reverb and when I decrease that reverb it cuts low-mod frequencies from voice, what i shoul do at such situation? maybe i need to find reference with good eq and try to improve target data using eq match?

@Jarods_Journey 10 ай бұрын

In this case, you're in a tough spot because if you can't clean the data, it may have some murkiness in the final output. As much as you can, you would wanna get you're audio as clean as possible before training.

@Malkovitz_ Жыл бұрын

Thanks for tutorial, could you please explain how to replace the whisper model for the one that was trained on my native language?

@Malkovitz_ Жыл бұрын

BTW, I already found the model, but it's still a mystery on how to use it with your script

@Jarods_Journey Жыл бұрын

Sorry mate, I haven't looked into this area and don't know quite exactly how to do it either. You have to tell whisperx the location of the alignment model your using, but that's as far as I know.

@3ool0ne 2 ай бұрын

hey can you do an update on this video? any new tools and methodologies that replaces what is outlined in this video.

@21f.a.c.e.s 7 ай бұрын

Unfortunately, I don't see any file for Cuda setup file in the cloned directory. Any help?

@martin_taavet Жыл бұрын

hey, will this app work in multiplayer games, or is it client-side only? i tried using it in discord, there seems to be no effect.

@Jarods_Journey Жыл бұрын

It works, you just need to connect via vb audio cable (tut on channel)

@kratoos0.0 11 ай бұрын

when i run script this is my error no module name 'ymal"

@maxikittikat 10 ай бұрын

i had to manually go through the pain of finding out and basically you make sure your not the the virtual environment to make sure type "deactivate" then all you do for everything isn't installed or is saying the module name isn't found find out online the command to install it then add "--use-pep517" after each command so try "pip install PyYAML --use-pep517" for yaml

@basspig Жыл бұрын

How important is it to remove silence between the speaker's words or does it matter at all?

@Jarods_Journey Жыл бұрын

It may help reduce some artifacting, but oftentimes you can leave some silence in there and it'll be fine.

@basspig Жыл бұрын

@@Jarods_Journey is the artifacting sounding like octave shift/cracking/falsetto effects. That's been a problem with some of the voice models I've made, and some that I've downloaded and tried using.

@nobodywakeup Жыл бұрын

dude can this AI voIce use in OBS, or discord? plis tutorial.

@tetragrammaton3 Жыл бұрын

what his other videos, he explains this already

@denblindedjaligator5300 8 ай бұрын

Hi, I have a question about rvc. I am trying to train a module where I have chosen no pitch. it sounds autotuner like how can i fix it` how does learning rate work` what is batch size`

@Jarods_Journey 8 ай бұрын

Not too sure about this unfortunately

@jeremybauchet6845 Жыл бұрын

Hello ! I've followed closely the tutorial three times, but I keep getting that one error at line 101 : "Exception has occurred: File Not Found Error" It seems to be looking for an srt file ? Also the terminal says "Requested float16 compute type, but the target device or backend do not support efficient float16 computation."

@Jarods_Journey Жыл бұрын

That means no srt file was generated by whisperx. Try redownloading with setup-cpu.py as you're gpu probably doesn't support float16. That, or in the code, you can change it to int8 where there is float16. I'll need to work on a fix for this.

@jeremybauchet6845 Жыл бұрын

@@Jarods_Journey Thank you ! I'll try so.

@grasshoffers 8 ай бұрын

I do not think I have cuda...just cpu but got the error ERROR: Could not find a version that satisfies the requirement torch (from versions: none) ERROR: No matching distribution found for torch

@caleb8857 Жыл бұрын

When running it like on 13:00 it says that it `failed to align segment ("!!!!!!!!!!"): no character in this segment found in model disctionary, resorting to original...` multiple times and once it was finished the folder had no segmented audio and was just empty. How do I fix this

@Jarods_Journey Жыл бұрын

I think this is a language issue, if your audio files have multiple languages being used, this causes issue with whisperx, as well as if it's an unsupported language. Further than that, please reference the whisperx GitHub issues page for more details as I'm not sure what else causes this.

@oliviosih347 Жыл бұрын

mine says Failed to create virtual environment. Error: [Errno 13] Permission denied

@itschepi Жыл бұрын

did you solve that?

@kaililkendrick783 8 ай бұрын

Same error for me. Unable to get pass this.

@IPartyWithUrMom 6 ай бұрын

Fixed it. Uninstall python and reinstall no later than 3.9

@Englishnamee Жыл бұрын

Hii! I've been following your videos and it has been tot's awesome, however I find that since my setup is outside my room I often encounter a lot of background noise that I can't really escape from (family, vehicles passing, fan noises etc). I've been looking around the app but I can't seem to find a solution for implementing noise cancellation to my AI voice, Does anyone know any fix that I can do to fix this? (And no, I can't move my setup :c)

@Jarods_Journey Жыл бұрын

That's a tough situation, you'll either need a better mic that is less sensitive or try to find other audio processing software that can be done before feeding into the AI voice

@victoroam 7 ай бұрын

11:01 i don't know why, but keeps getting me the same error (No module named 'pysrt') but 'pysrt' is already installed

@AmanKhan-bj7im Жыл бұрын

Hi if my model is under trained . Do i have to train a new model with more voice sample .. or can i do something with the current one ?

@Jarods_Journey Жыл бұрын

If you've reached a flat spot and your model still sounds bad, you'll need to add more data

@foxey461 Жыл бұрын

🔥🔥

@RoxWinted Жыл бұрын

hello, i'm askin anyone right now because i got a bit lost. i'm trying to make the ai voice not glitch out whenever i'm doing long vowels so it doesn't look for all of them at once making it sound like a mess, and i so far thought you have to train them to sound better, but i think that's not the case. can someone explain what i have to do to achive this?

@junofall Жыл бұрын

How come we have to split the audio data into smaller parts? I just threw a 30 minute audio file at RVC and it handled it no problem.

@Jarods_Journey Жыл бұрын

Splitting the audio into smaller parts does two things: allows you to train on the dataset without getting out of memory issues due to VRAM and cleaning out the silences of an audio file. What type of file did you feed it, was it a wav file?

@fountainbird Жыл бұрын

@@Jarods_Journey This was my question as well. I've just always used full length clean audio files. I usually do a few things in audacity before training. I'll truncate silence, convert to mono, and de-noise if needed. im on a 4070 and haven't run out of memory with hour long wav files. As cool as this setup is, I don't think it does much for me personally aside from the uvr process. Am I missing something? Is it just best practice to split up files for use as datasets? Thanks for everything!

@NorasHobbyverse 11 ай бұрын

thanks for trying but this thing has failed for me multiple times and im tired of trying to troubleshoot this. is it that hard to just make an executable for people to use? i dont know jack shit about code and cant fix it when it doesnt do the same shit your computer does, even when following all the steps.

@denblindedjaligator5300 Жыл бұрын

there is 2 new version of RVC and i can now train on my AMD graphic card with out of GPU!

@Jarods_Journey Жыл бұрын

This is awesome! Saw that they recently added AMD with directml!