The Secrets Behind Voice Cloning & AI Covers

Рет қаралды 74,687

Күн бұрын

Пікірлер: 170

@bycloudAI Жыл бұрын

To plug the sponsor: try everything Brilliant has to offer free for a full 30 days, visit brilliant.org/bycloud . The first 200 of you will get 20% off Brilliant’s annual premium subscription! P.S. Nothing in this video is voiced by a real person. All the voices are fake (except for 12:32 lol) The first 1 min (0:00~0:58) is generated using voice2voice with my real voice as the reference. 0:58~12:47 is generated with the combo which I mentioned in 11:46. From 11:46 till the end is all ElevenLabs Pro Voice Cloning.

@bycloudAI Жыл бұрын

@@thelegendguyofficial dw the music and the content is not HAHAHA and will probably not be anytime soon here's the music yt link kzbin.info/www/bejne/ZoPcmHVjosajqc0 this person makes banger lofi, go support them

@NevelWong Жыл бұрын

@@bycloudAI So.... if it's ai generated, it cannot be copyrighted, right? So if I use this copyright-free voice to train a model of, and I then use that model to narrate my own videos, that would be legal, right? I am equal parts concerned and titillated.

@jamessharpe2630 Жыл бұрын

@@NevelWongvoices in general can't be copyrighted. If it was a slogan(arrangement of sounds) or roar/yell then yeah copyrightable.

@Mark_Rober Жыл бұрын

I was thinking to myself every so often 'his voice sounds a bit fake' but I swear it was just because this video was about cloning AI voices and if you had done anything else, like make a minecraft video for example, I wouldn't even have imagined it being AI.

@Deagan Жыл бұрын

based.

@__aceofspades Жыл бұрын

I didnt realize this was AI narrated until you said it was... I just assumed the scuff in the audio was due to using a worse mic like from a laptop or some screw up when editing, it sounded off but not AI off. As much as I believe AI is the future, we are clearly going to be in for a very very rough ride from here on out. You'll basically only be able to trust that something was real if you saw it in person, no audio, no pictures, and no video will be trustworthy.

@gh0stpyram1d Жыл бұрын

fr i had a whole mental picture of how this admin looked and i realize that was a mental picture of a robot lmaoooo

@asdfssdfghgdfy5940 Жыл бұрын

Nah there are relatively simple ways of digitally signing things to prove you said them or filmed them etc. It will become a problem for the masses for sure especially if people keep believing whatever they see on Facebook. It will be easy enough for the more tech savvy peeps, or people who are required to vet things (e.g. Reporters) to work out if they are real or not. Or at least if they have been signed or not.

@quazar-omega Жыл бұрын

Then the Matrix credits roll in inside your eyes

@Kmrn596 Жыл бұрын

WTF I thought that was your voice. I guess generative AI these days is something else.

@albertsitoe7340 Жыл бұрын

I I struggle to understand how society will even function in the next 50 to 100 years

@David.Alberg Жыл бұрын

@@albertsitoe7340Bro all the experts struggle if the society will function in 3-5 years 😂

@Kynatosh Жыл бұрын

I heard artifacts so I had doubts

@rotors_taker_0h Жыл бұрын

Nice information dump, good job on collecting all this info. To be honest, this tech is good enough that I wouldn't be surprised if any of your previous videos were voiced by AI too. As a random youtube viewer I have no idea if cartoon cloud's voice is a real person or totally generated anyway.

@phizc Жыл бұрын

First time viewer here. When this video showed up in my feed, that click-baity title almost made me skip it, but this is definitely the best video about different options for TTS and voice cloning I've seen yet. Well done. I'll definitely stick around and see what other videos you've made.

@trollenz Жыл бұрын

Everything you always wanted to know about speech synthesis* (*but you've never found). Thanks mate for this masterclass ! ❤

@krystiankrysti1396 Жыл бұрын

"most boring" bit you mention is actually the most useful info in this video, links to websites and what theyre for

@juanjesusligero391 Жыл бұрын

Your videos are the best, seriously! Not only do you keep us in the loop about all the cool AI stuff, but you also manage to make it super entertaining. Big thumbs up, man! :D

@maki_ligon Жыл бұрын

6:47 nope. It was sovits. They used my weeknd model. Sovits is pretty good at raw studio quality vocals assuming the dataset is good. Which my weeknd model isnt it lol

@krishp1104 Жыл бұрын

wtf this is the first time AI actually fooled me

@ojsef39 Жыл бұрын

i was eating while watching and only notices it because of the muffle and the red line im peripheral vision hahaha

@ojsef39 Жыл бұрын

oh damn, i wasn’t at the part where he revealed it yet. im shocked hahah

@handle__ Жыл бұрын

@@ojsef39same. When I first saw the comments when I haven't yet reached that part I thought people meant the red line parts, but then mind blown🤯😮

@wham7125 Жыл бұрын

Definitely not the first time, but you wouldn't know that of course.

@wuy4 Жыл бұрын

That Asmongold cameo lol

@zyxwvutsrqponmlkh Жыл бұрын

You knocked this one out of the park. A+ video.

@netoeli Жыл бұрын

The fact that you have to let us know that was not an actual real discord call with asmongold, as if the intelligence in the choice of words did not give it away already

@Askejm Жыл бұрын

TRUE

@shadowrealms2676 Жыл бұрын

@@Askejm BIG W!

@USBEN. Жыл бұрын

BRUH made whole video with this, EPIC!

@zjihf Жыл бұрын

Thank you I ve been searching for this so long

@icedude_907 Жыл бұрын

Thanks so much for this - this is a great place to start for AI voice generation on local machines. I'm eager to experiment on mine

@zenu903 Жыл бұрын

I was actually fooled too and didn't realize it wasn't his voice until he pointed it out. Any imperfection you hear could be confused with his accent anyway and his monotone voice also helps so it makes it extra hard to spot

@dudedude-su7pt Жыл бұрын

There thousands of channels like this lol. Most people don't know what voice is robotic or real

@gh0stpyram1d Жыл бұрын

Goated Ai channel

@marian3248 Жыл бұрын

I was watching this video at 2x speed and got giga fooled by your ai voice, I really couldn't tell this wasn't you.

@jurandfantom Жыл бұрын

So if I get it right. 1) record voice 2) use whisper to get transcription (+some fixes of text) 3) use text-voice model that is similar to our voice 4) use voice-voice (that model need to be trained on our own) --- -Training of voice happens once. -we are doing all of that to make our dialog more smooth, but we still make voice over to video for correct speed and length of video (not a case when video is created after voice creation).

@FenrirRobu Жыл бұрын

What's up with skipping like a dozen webuis for audio. Not just for this video but many others on the audio AI also just end up showing some barebones default UI and completely miss the projects that are specifically improving the UI and UX.

@herdenq 10 ай бұрын

Can you suggest some? :)

@FenrirRobu 10 ай бұрын

@@herdenq I have forgotten a few but there's bark infinity, audio webui, tts webui, then for music there's also audiocraft-webui, Audiocraft plus. RVC has some specific additional UIs, there's also the tortoise RVC pipeline but I'm not sure if it's an UI. I watched the video again and I will say that it's well researched but it focuses on teaching about the technology, rather than showing the best ways to use it. If you want to hardcore go on tortoise, mrq might still be the best (although I think already during this video mrq was migrated to mrq's audio tools or something), RVC's original UI has the most buttons and unexplained options. I'm glad he didn't mention coqui because, at least 6 months ago it was just a closed source tortoise clone.

@ceticx Жыл бұрын

amazing video, if this isn't a 1/10 confetti video just know it deserves to be

@bycloudAI Жыл бұрын

its a 10/10 bottom feeder lol rip

@pikaa-si9ie Жыл бұрын

@@bycloudAI I'll give you a like to try to push the algorithm 👍😁😁

@sneedtube Жыл бұрын

Wew lad, one of the best vids that I watched in months. God-tier quality!

@4.0.4 Жыл бұрын

Might be good to mention you can run Whisper locally to transcribe audio. The large-v2 model is better than whatever KZbin uses, even if slow.

@Askejm Жыл бұрын

Well its included by default in MRQs tortoise ui and i think RVC uses it too

@ShepoPL Жыл бұрын

At 0:09 I realized that was AI model of your voice. It's hilarious to listen to AI talking about how great voice deepfake is 😂

@Askejm Жыл бұрын

well thats funny because the first minute is his real voice

@ShepoPL Жыл бұрын

@@Askejm You're wrong my guy. Listen carefully when he talks with high pitch and compare it with his other videos where he talks this way. You will hear the slight difference

@Askejm Жыл бұрын

@@ShepoPL no, he did narrate it normally. the artifacts is probably because we added V2V for it to be consistent with the rest of the video. as this was done with RVC v1, it leaded to some artifacting despite a ground truth input

@herdenq 10 ай бұрын

@@AskejmHe mentions at the end that this is AI

@l.halawani 10 ай бұрын

super interesting, as an AI Product Owner i find your videos invaluable to quickly catching up with all tech at once.

@Siacourage 10 ай бұрын

Best video about AI voice cloning I've found so far on the internet. I'm saving it to revisit later when I have more powerful hardware to run the Tortoise and RVC combo. In the meantime I think Eleven Labs will suit my needs. Thanks for all the great info. Subscribed.

@Beyondarmonia Жыл бұрын

That "listening to right now" hit me like a freight train. Came to the comments and happy to see everyone else is having a simmilar reaction.

@JohnDoe-nn5pj Жыл бұрын

the biggest problem with TTS is that you need to make a transcription file for all your audio files. So tacotron needing 1-3 hrs of transcribed audio and that can take a very long time to do. RVC and SVC doesn't need transcripts so it's much easier to make training data.

@Askejm Жыл бұрын

just use whisper

@andreya.l.1270 Жыл бұрын

I missed your videos man, good work, keep it up

@sharptrickster Жыл бұрын

Do we currently have any TTS pipeline with good enough quality for non-english languages?

@Askejm Жыл бұрын

your best bet is probably 11labs multilingual, which still only supports a handful of languages

@gameb30232 Жыл бұрын

this is so cool i wanted to do this for so long! thank you!

@jan-Juta Жыл бұрын

Just waiting for Live V2V to become viable in the open source space. Would be insane for tabletop RPGs and VA for solo projects. Live RVC is kinda working, but not very well.

@4.0.4 Жыл бұрын

VA for solo projects doesn't need to be live, why trade quality for speed in that case?

@Kisai_Yuki Жыл бұрын

It already is. You can use the RVC software to create an ONNX and then take the ONNX to MMVCServerSIO. It will work with very little tweaking. The problem is that RVC is more of an auto-tune. It will not change someone's gender, accent or age. It can only create a voice filter. And what is being passed off as "AI singing cover" is really just laundering someone elses singing through this pitch tuning. So taking one singer and using it to sing a different singer, tuned ON that singer, isn't actually a cover, at least not by what the term "cover" means. But it is useful for creating a character voice. So if one were so inclined, a D&D campaign could be made very interesting by using the RVC to train voices (eg a deeper voice for barbarian troll, and a higher pitch voice for a dwarf or halfling) and the GM could create unique NPC's for characters without having to strain their voice.

@iambinarymind Жыл бұрын

Fantastic overview. Much thanks, bycloud

@sujimatsubackupaccount194 Жыл бұрын

RVC retains to core trained voice meanwhile sounding smooth. The SO-VIST-SVC removes most of the trained voice personallity , makes it more based on the voice in the source audio and make the voice sound flat weirdly enough, Even for talking RVC has the better strengths . Tho it suffers from sharp note transitions like c2 to c5 which can cause issues.

@stephantual Жыл бұрын

Exactly. And don't get me started about accents ;) My 'charming' french accent is the bane of these tools.

@GoharioFTW Жыл бұрын

15:12 Is nobody absolutely terrified of this? We could get to the point that someone could grab a minute of you talking and be able to use it accurately anywhere for anything.

@slime-smp Жыл бұрын

Can you please make a tutorial on how to do this its very confusing

@pradachan 5 ай бұрын

so, was confused a bit -> means do we have to 1st use tortoise tts (with our voice) & then rvc(with our voice) to achieve the same result?

@absence9443 Жыл бұрын

Beautiful video! Really helpful :)

@shApYT Жыл бұрын

Watching at 2x completely smooths out any bumps that rvc has. The cadence sounds off after pointing out that it is AI.

@jurandfantom Жыл бұрын

At last I managed that! Thank You ByCloud !

@BHBalast Жыл бұрын

Lol, on my smartphone i cant even tell a difference between your Real voice and fake ones!

@krishp1104 Жыл бұрын

At the end he says ALL audio in this video is AI generated

@BHBalast Жыл бұрын

@@krishp1104 NOT all, there was a Little fragment. :)

@krishp1104 Жыл бұрын

@@BHBalast no literally all audio in the video is AI generated

@BHBalast Жыл бұрын

@@krishp1104 I Dont get it, in his comment he says one fragment is not.

@flowerpt Жыл бұрын

Wow, that was dense - awesome!

@mastermohit Жыл бұрын

I can't wait for asmin to react to this

@akshatgarg6635 Жыл бұрын

Can you please tell how did you train TorToise TTS in your voice. I saw the repo but it is not mentioned how to fine-tune it on your voice

@nunuarthas8680 Жыл бұрын

we're witnessing bycloud turning himself to an ai then he's gonna upload himself to a cloud and live forever

@_Everything_is_Fine_ Жыл бұрын

are we only limit to voice cloning? any voice generator that generate new voice like changing parameters or combine two voice give one new voice?

@beowulf2772 Жыл бұрын

Hey! your videos are very professional and well edited! You deserve this like and comment.

@Crazybark Жыл бұрын

The tacotron one sounded better than the tortoise one

@DrW1ne Жыл бұрын

12:32 My mind blew up.

@lucas_zampar Жыл бұрын

Great video!

@_Sepherial 11 ай бұрын

How do I use a cloned voice to read aloud a pdf file?

@terraguy996 Жыл бұрын

Is this just me or this vid had a different thumbnail?

@Askejm Жыл бұрын

he switches it a lot after release, as does him and other youtubers often do

@fnytnqsladcgqlefzcqxlzlcgj9220 Жыл бұрын

WOAH i didnt notice it was AI and I work with audio constantly. trippy!

@stephantual 11 ай бұрын

Non ironically still the best primer on the topic - 5 month on! (which is prehistory in AI) - 🤠

@GavrikCat Жыл бұрын

What about BARK? But I guess it's not so good. Also, what option would be the best in terms of inference speed?

@steve_jabz Жыл бұрын

Is RVC still better now that so-vits-svc 5.0 is out?

@deepanyai Жыл бұрын

Love your videos!

@IPutFishInAWashingMachine Жыл бұрын

I had absolutely no idea that your voice was completely ai generated... WHAT?!?!?!

@herdenq 10 ай бұрын

Definitely not. Just that one section :)

@mineralbunny8736 Жыл бұрын

Ah that “crappy” free KZbin course Harvard let us have 😂 I actually took the Java CS50 class there and it was very good… I like that they record them so you can watch later!

@memegazer Жыл бұрын

Some of those songs that sound good have a lot of work put into them as well. A lot of post processing as well with other audio tools

@YoIomaster Жыл бұрын

another great video. keep it up brother! QUESTION: I want to wait until fall because AMD is gona enable shader conversion (basically allowing high end consumer cards to use CUDA coded AI tools) until i buy a new gfx card, I really struggle learnign new things with my 6gb 1660 Super but i aslo don't ant to support Nvidias incredible greed and market anipulation. Would your ecommend me to wait and support AMD or what would be the route you would go? I want to go full Audio synth setup and im already using Stable diffusion 1.5

@naeemulhoque1777 Жыл бұрын

this video is gold

@krasen671 Жыл бұрын

Weird, I've always done Eleven Labs + RVC, not Tortoise

@Askejm Жыл бұрын

well imo 11labs is already good enough quality, its resemblance that it lacks. tortoise solves that, and RVC makes up for the subpar quality

@krasen671 Жыл бұрын

@@Askejm i mean for RVC i just set the index rate up super high and it sounds good enough to be the actual person lol

@Askejm Жыл бұрын

@@krasen671 well one should be a little cautious with just jamming the index rate up. the rvc v2 is a lot more intrusive tho in my experience while also sounding better, but i feel like the resemblance you can get is just lackluster since youre limited to only 1 minute

@VaibhavShewale Жыл бұрын

i need this tts cause i need to make videos that are usually long and i have to keep moving so that means background noise earlier i use to record room and then start recording but it used to take me over 2 weeks just to create a 5 min audio and that is too damn long pperiod. i thing need to do research in all this ool cause i dont have that much of money to invest in any of the company is offering for

@nils900 Жыл бұрын

How well does the TorToiSe + RVC combo work with other languages?

@Toliman. Жыл бұрын

It would be reliant on the RVC training of phoneme and language salience of the native recording. Accents are naturally difficult. Ie accents and pronunciation is usually not neutral, so if you use a TTS to generate the non-english version, RVC will interpolate the accent and pronunciation based on the native accent it was generated with. So, if you generate an Austrian voice first, then pass it to a Japanese RVC, it will struggle to find matching properties. But, if you use a Japanese speaker to create English phonemes, and the RVC has examples of these equivalent phonemes, it will substitute. The effect is weird, which is why accents are difficult to emulate.

@yuyiko Жыл бұрын

great video. really love all of this AI content (keywords for youtube ;P )

@OxidoPEZON Жыл бұрын

Can you have this narrate your weekly AI news videos? I loved that series, and I really would watch them all the same with this voice, I didn't notice until you exposed yourself.

@KASSIND 4 ай бұрын

Can i usefor Indonesian language?

@monstercameron Жыл бұрын

what about BARK?

@madcatlady Жыл бұрын

I have the Bark Webui on my PC and it's a crazy lucky dip what you get, some sing and none sound the same as the previous one

@lll-yq4hu Жыл бұрын

Great vid

@liam10000888 Жыл бұрын

I really like this type of video from you! The ai news was great, but as a layman it was too scattered

@CrashDeluxe Жыл бұрын

I'm stupid, I just heard lots of words jumbled together; RVC, VITS, VCS, JBC, RVC, BC?!?

@Bazilisk_AU Жыл бұрын

OKay... I zoned out playing Genshin with this playing on my second monitor and I hear Asmongold and go "Wait wtf !?" and I went back and rewatched the whole thing for context and HOLY CRAP I DID NOT DOUBT THAT IT WAS YOUR VOICE THE WHOLE TIME ! Man... what a time to be alive. A tad too early to pilot mechs in space... just just in time for AI Waifus and have food delivered to your door while you watch anime, explore the stars with hyper-realistic games and argue with strangers on the other side of the world about made up problems.

@AngryApple Жыл бұрын

Bark is also very interesting

@KW-jj9uy Жыл бұрын

I played with allot of these free tools, and 5he most difficult part (as usual) is installing them, lol

@homeyworkey Жыл бұрын

does the voice-to-voice follow the inflections in the original voice? ie if i a scream, the generated voice would scream too. even if this is a 10/10 video, its still good. ive been wanting to know how to clone the voice of a younger version of me, and now i know exactly what my options are (i tried researching before, to no avail). thank you ! :DD

@Askejm Жыл бұрын

yeah. i made bycloud do the crazy frog with V2V and it worked totally fine. the ding worked but surprisingly all the verbal sound effects were cloned too and it genuinely sounded like him

@homeyworkey Жыл бұрын

@@Askejm oh 100%, this video is extremely convincing. i do have a suspicion though that it is easy as most of his videos his voice is pretty flat (not an insult btw, its calming and i like it) if he had more variance in what voices he made, ie whispering, yelling, singing, speaking fast, speaking slow, even when you speak louder or quieter, its not as a simple as 'lowering the volume', the actual voice changes with it. this would require alot of data and to identify the 'tone' of the original voice recording, so you can interpret it for the fake voice generation. ik its complex but im just wondering where we are at with that sort of stuff.

@Askejm Жыл бұрын

@@homeyworkey well i found it to work pretty well. also, rvc uses a pretrained model

@l.halawani 10 ай бұрын

[solved] this is a channel by fireship, but completely run by ai

@lauraalvarezgonzalez6184 Жыл бұрын

Thanks!!

@thedementiapodcast Жыл бұрын

Bar none the best video on the topic. If your mother's tongue is American english, the FLOSS path is the best (use a cloud GPU for speed). But accents are unique to the person (im native french and my english is hit and miss on certain words, which currently no ai can learn, no matter how much data i give it). Even in the best case scenario, it's far from 'perfect' and the affect is overall very flat, as we can hear in this video. But it will get better over time, i'm sure.

@knoopx Жыл бұрын

us techno bros are not into karaoke xD

@alkeryn1700 Жыл бұрын

is it just me that thought that tocotron sounded a lot better than tortoise ?

@Askejm Жыл бұрын

I think tortoise sounds better but by far the most noticeable thing is how tacotron has very poor resemblance

@kw4093-v3p Жыл бұрын

wtf I was actually fooled. I thought this was your real voice

@fueledbylofi7078 Жыл бұрын

Bycloud will soon be THE AI news source as this stuff gets more complicated and controversial and eventually will be completely self sufficient and ran by its own AI models trained on bycloud AI news videos 😶

@samriddhlakhmani284 5 ай бұрын

Thank god, I skipped sleep, to click on this video. Awesome survey

@mlcat Жыл бұрын

for just tts VITS is one of the best options

@alibahrami6810 Жыл бұрын

What I get from this video is EEC, VTC, CCT, VTC, and HIGHGAN. 😂

@minicup Жыл бұрын

After about 30 seconds I realised it was AI

@Kisai_Yuki Жыл бұрын

The information in the video is pretty good, but when people make the claim "you can't tell this isn't a real person" I call that bluff. The reason some voices are "good enough" is simply because people can't tell a bad voice from a bad recording. GIGO. If you have a high quality input, it will produce a high quality output. When your sources are not high quality (which is why the tr*mp/Biden/Obama video sounds completely unrealistic) it's immediately obvious, but likewise, it will not fool someone who knows that person. The vernacular will be wrong, the use of filler words will be wrong, the lack of spacing in the audio will be wrong. Like the three biggest tells/fails for TTS and VC is: - Output is too clean and consistent (true human voices have variability to their pitch, speed and volume) - The vocabulary is incorrect for the accent or age given - The accent is foreign Like don't get me wrong, I think the advancements in TTS/VC will open doors to allowing people to use their own voice in foreign dubs of material that otherwise would only be available subtitled. However, the writing is on the wall for jack-of-all-trades voice actors, because instead of doing 25% of the voices in a cartoon, the studio will just use "good enough" AI for characters that aren't named.

@Askejm Жыл бұрын

Im the editor. I listen to byclouds voice for hours. I gotta say, this AI is scaringly close. Most of the time you can't hear it, and I forgot many times it was AI while editing. Only some things it pronounces a bit oddly, and the S sounds off. But I would say, if you know this may be AI you can hear it, but if you're just a casual viewer you wouldn't realize.

@Kisai_Yuki Жыл бұрын

@@Askejm I'd say that I could tell was from the 'w' sounds. When I watched a different video I also noticed that bycloud's natural voice has a lower F0. Not every voice is going to be equal. After I watched this video, I went and grabbed the RVC stuff and tried some of my own experiments. As expected, it's extremely GIGO (I've trained TTS voices on VITS before, so I know a lot of it's tells.) RVC is basically autotune. If you know what the input is, you can definitely tell. So RVC will not work to change the gender, accent or age of the input. It will however give you a much more consistent F0. So this is immensely useful for people who want to do character voices with their own voice and don't want to damage their voice doing them.

@Askejm Жыл бұрын

@@Kisai_Yuki I think the real and fake F0 is very close. His recent videos has a higher F0 tbh. Also, is that RVC v2 with pitch guidance disabled?

@midvightmirage Жыл бұрын

the sponsor is not your real voice no way it is

@max_s557 9 ай бұрын

This is the best video ive seen on this topic many thanks brother! I sent you a message on twitter but i couldnt DM because im not verified but i would like you to help me create a pipeline.