MetaVoice 1B - TTS & Voice Cloning

Рет қаралды 46,852

Sam Witteveen

Күн бұрын

Пікірлер: 70

@Randomsitos 8 ай бұрын

I like the results from clonemyvoice as well.

@mrquicky 4 ай бұрын

If you're looking for an online service it seems there are plenty available like clonemyvoice to queue your work on. If you are looking to perform the work locally & are tired of cloning repos which do not execute despite their boasts, this is the one for you. I've got it working well in linux using Cuda with out of the box settings and the README. Thanks @Sam Witteveen

@davidw8668 9 ай бұрын

The last Friedman was spot-on, was just waiting him to say something like: "... existentially speaking, it's in love where we truly find meaning in our universe."

@CodyAvant 9 ай бұрын

“Anthropomorphize”

@davidw8668 9 ай бұрын

@@CodyAvant robots?

@shaunjohann 9 ай бұрын

you and Philip over at AI Explained are my favs, thanks for doing what you do in the AI space. i always look forward to what you both create for us to think on

@bobobo1673 9 ай бұрын

Philips?

@BryanChance 8 ай бұрын

OMG this is amazing. I just had my mind blown by Sora, the video generator from OpenAI. Everything is moving so fast it hurts my brain. :-_)

@shobhitagnihotri416 9 ай бұрын

Sam's knowledge on LLM/AI is on different level .Thanks

@kate-pt2ny 8 ай бұрын

Great tutorial, thank you for sharing, thank you Sam

@Joooooooooooosh 9 ай бұрын

Sounds like we still have very far to go before open source models catch up to commercial services like playht and ElevenLabs.

@alx8439 9 ай бұрын

A lot of TTS models are doing the same drops when you start to play with their parameters- I've tested like half a hundred or them and I have a strong impression they're still quite fragile

@ash3844 4 ай бұрын

After running example usage cell getting this error "' returned non-zero exit status 1." could you pls help?

@samwitteveenai 4 ай бұрын

That video is quite old, my guess is they have updated the code or library

@vrrevolution9183 8 ай бұрын

is there somewhere in the code that allows for more than 11 seconds of audio? i tested a few and it stops at 11

@eaglestudio4284 6 ай бұрын

can be this monetized on YT if i dont pay a plan?

@goyashy 9 ай бұрын

This is a great demo. Seems like they're doing real good work! I tried turtle and a few other open source models (open voice), but none of them reach to the point where they seem to be a competitor of Elevenlabs in the future. This one has great potential!

@albertsitoe7340 8 ай бұрын

Nothing is better than Eleven Labs

@davidencarnacion2555 8 ай бұрын

Eleven labs is killin it.

@mastershake2782 9 ай бұрын

I like OpenVoice more than Bark tbh. Curious to see how MetaVoice stacks up.

@shaileshvekariya9059 8 ай бұрын

Can I run this on multiple GPUs?

@lucamatteobarbieri2493 9 ай бұрын

The original *uckerberg voice is more robotic than any cloning attempt 😂

@VaibhavShewale 8 ай бұрын

sounds nice

@sherpya 9 ай бұрын

looks interesting, does it really requires GPU with >=24GB RAM?

@samwitteveenai 9 ай бұрын

no not that much RAM needed

@P-G-77 9 ай бұрын

Now I try the DEMO... work well...

@ForTheEraOfLove 8 ай бұрын

I can't believe you haven't used that drunk Lex Fridman with your slurring system string to have an interview with you as "Drunk Tech"

@Tarbard 9 ай бұрын

Interesting. I would have like to hear it clone your voice.

@samwitteveenai 9 ай бұрын

lol good point I didn't try that. I might do that if they release the Fine Tuning code.

@Tarbard 9 ай бұрын

@@samwitteveenai The demo link lets you upload a voice clip to clone, is that not enough? I didn't have good results with what I uploaded but if it's because it needs to be fine tuned that would explain why.

@IgnacioMartinez82 9 ай бұрын

Is it possible to use this with home assistant?

@samwitteveenai 9 ай бұрын

yes bt the challenge is getting the TTS closer to real time. I gave another vid coming up that is more inline with a home assistant

@kamathsutra 9 ай бұрын

I almost thought the beginning of the video you were using the model to speak.

@SyedMujtabaHassanRizvi 9 ай бұрын

Make a video on how to train a language model using Direct Preference Optimization

@abdelkaioumbouaicha 9 ай бұрын

📝 Summary of Key Points: MetaVoice, a startup, has released an open-source text-to-speech (TTS) model called MetaVoice 1B, which is a 1.2 billion parameter model trained on 100,000 hours of speech data. The model claims to have zero-shot cloning capabilities for American and British voices with just 30 seconds of reference audio. It uses Transformers and diffusion techniques in its architecture. The speaker demonstrates the model's capabilities, mentioning that it performs well for some voices but not consistently for others. Adjusting the temperature and guidance scale can affect the output to sound more like the desired voice. The model has limitations, such as occasionally dropping words or generating silence. Fine-tuning the model on longer audio samples could be interesting once the code is released. While the open-source nature of the model is promising, it still has a long way to go to match the performance of proprietary models like Google's SoundStorm and OpenAI's voices. 💡 Additional Insights and Observations: 💬 [Quotable Moments]: "The model performs well for some voices, but not consistently for others." 📊 [Data and Statistics]: The model is trained on 100,000 hours of speech data and has 1.2 billion parameters. 🌐 [References and Sources]: The video encourages viewers to try out the provided notebook and experiment with the model themselves. 📣 Concluding Remarks: The MetaVoice 1B model, an open-source text-to-speech model, shows promise with its zero-shot cloning capabilities and use of Transformers and diffusion techniques. However, it still has limitations and has yet to match the performance of proprietary models. The video encourages viewers to try out the model and see its capabilities firsthand. Generated using TalkBud

@codecaine 9 ай бұрын

Nice work

@hqcart1 9 ай бұрын

is the notebook free for a100?

@samwitteveenai 9 ай бұрын

unfortunately it is not free

@joffreylemery6414 9 ай бұрын

Next video on virtual avatars ?

@rubbercable 9 ай бұрын

Is this 'offline' or an 'online service'?

@mastershake2782 9 ай бұрын

Local/Offline

8 ай бұрын

some examples of how it sounds: 5:30 6:21 6:47 7:15 ...

@Darfail Ай бұрын

my guy doing the lord's work god bless

@Macatho 8 ай бұрын

I dunno... Compared to Elevenlabs it's miles away... If you can spot that it's an AI voice within 3 seconds, it's not very good.

@jonathanmckinney5826 9 ай бұрын

Coqui xTTS v2 is still much better and includes cloning and more languages. Bark is not a great comparison as it's one of the worst open source TTS models, they haven't updated the repo in 5 months. It was a great thing when it came out of course.

@samwitteveenai 9 ай бұрын

Interesting I will go back and check the Coqui models out. I was saddened when I heard they were shutting down as they had done a lot of good work in the TTS field

@nyny 9 ай бұрын

unfortunately xTTS isn't open source

@samwitteveenai 9 ай бұрын

So do we know what has happened to that IP now that Coqui has gone under? Who bought it etc?

@nyny 9 ай бұрын

@@samwitteveenai They have only said that it retains the same conditions. They are probably shopping it or their investors want to liquidate.

@samwitteveenai 9 ай бұрын

thanks for the update.

@fontenbleau 9 ай бұрын

i would really like audio restoration tools, but any development in this stalled for 2 years already. Adobe closed access to it's speech restoration model, which was very bad with non english accents (i've tested). The only new thing is Intel added this january it's Ai audio tools to popular Audacity editor, but quality of their models i would call awful, from music remix to noise removal model, it's incredibly unoptimised, on Gpu works longer than on CPU. When we will get a tool to restore vinyl music no one knows, there's no motivation in development to restore old music.

@zyxwvutsrqponmlkh 9 ай бұрын

Make a corpus of degraded vinal copies and pristine copies of the same content. This is the sort of thing I could crank out in like a week if I had the data already.

@fontenbleau 9 ай бұрын

@@zyxwvutsrqponmlkhI don't know why there's such censorship but I can't write anything to you, the words filter is crazy, or channel author added every possible word to filter list. I want answer you, but can't, only this message comes through filters-worse than in communism.

@DocDoc-h4w 9 ай бұрын

Hello Sam, would like to connect with you regarding a collaboration opportunity for a new product that we are building. Let me know the best way to reach you. Thanks.

@samwitteveenai 9 ай бұрын

you can ping me on linkedin

@alanalvarado3862 8 ай бұрын

Doubt it!

@zyxwvutsrqponmlkh 9 ай бұрын

Thanks, I had the misunderstanding that this was facebook. Honestly, results are quite good, best in class but it does fail sometimes producing long breaks or silence without saying all the words. But still it's enough better than anything else I think it's worth it.

@teaman7v 9 ай бұрын

Best in class? This is way behind curve for tts and voice cloning.

@teaman7v 9 ай бұрын

Best in class? This is way behind curve for tts and voice cloning.

@zyxwvutsrqponmlkh 9 ай бұрын

@@teaman7v Bull scat. I've done bark, openvoice, coqui-ai, MockingBird etc. What do you have better that you can run locally? Because I've tried this and I like it.

@Joooooooooooosh 9 ай бұрын

@@zyxwvutsrqponmlkhthey are all pretty terrible tbh compared to commercial models.

@zyxwvutsrqponmlkh 9 ай бұрын

@@Joooooooooooosh Then what fucking good is it? Cant rely on anything commercial, they will change shit on the back end without warning, go out of business, throw up rate limits and costs a fortune. I want to make full on audio books, I'm talking thousands of hours of content. And I want it to be good, like better than the average audio book narrator. There is jack shit out there that is open and can be convincing for more than a couple minutes. This is a good step in the right direction, I can get Wisper to timestamp each syllable and use that data to ensure accuracy and find odd delays and shit to automate a re-generation of that segment. I can't do that with your closed source garbage services, if I cant make it run on a system I control it's not worth considering. And your all "it's not best in class because this other entirely deferent class exists too" 💩

@hqcart1 9 ай бұрын

I just tried out the demo, it really suck, a lot of flickering and voice cuts, you can feel it, sometimes i didn't understand what he was saying. i say this model was released too soon or because it sucked.

@jeffwads 9 ай бұрын

Meh. Tortoise-TTL is at least as good. People yapping about how it doesn't match up to Elevenlabs just aren't following the guide. 3 samples 10 seconds wav at the right bit-rate is key.