Local Low Latency Speech to Speech - Mistral 7B + OpenVoice / Whisper

Local Low Latency Speech to Speech - Mistral 7B + OpenVoice / Whisper | Open Source AI

Рет қаралды 115,001

Күн бұрын

Пікірлер: 184

@Canna_Science_and_Technology 8 ай бұрын

Awesome! Time to replace my slow speech to speech code using openAI. Also, added eleven labs for a bit of a comedic touch. Thanks for putting this together.

@mayushi7792 Ай бұрын

How much did it cost you? For integrating eleven labs?

@NoLimitYou 7 ай бұрын

Too bad you take open source and make it closed.

@mblend27 7 ай бұрын

Explain?

@NoLimitYou 7 ай бұрын

@@mblend27 You take code openly available, and ask people to become a member, to receive the code of what you demo using the open source code. The whole idea of open source is that everyone contributes without putting it behind walls

@Ms.Robot. 7 ай бұрын

You can in several ways.

@NoLimitYou 7 ай бұрын

You take open source and make something with that and put it behind a wall.

@TheGrobe 7 ай бұрын

@@mblend27 You make someone pay to access something on github you comprised of open source components.

@nyny 8 ай бұрын

Thats supah cool, I actually built something almost exactly like this yesterday. I get about the same performance. The hard part is needing to figure out threading/process pools/asyncio. To get that latency down. I used small instead of base. I think I get about the same response or better.

@Baptiste__1Trocquet 8 ай бұрын

Hi ! Very impressive !! Do you have a github to share your code ?

@CognitiveComputations 7 ай бұрын

can we see your code please

@limebulls 7 ай бұрын

Im interested in it as well

@williamjustus2654 8 ай бұрын

Some of the best work and fun that I have seen so far. Can't wait to try on my own. Keep up the great work!!

@LFPGaming 8 ай бұрын

do you know of any offline/local way to do translations? i've been searching but haven't found a way to do local translations of video or audio using LargeLanguageModels

@deltaxcd 7 ай бұрын

there is a program "subtitle edit" which can do that

@MelindaGreen 7 ай бұрын

I'm daunted by the idea of setting up these development systems just to use a model. Any chance people can bundle them into one big executable for Windows and iOS? I sure would love to just load-and-go.

@zyxwvutsrqponmlkh 7 ай бұрын

I have tried open voice and bark, but VITS by far makes the most natural sounding voices.

@ales240 8 ай бұрын

Just subscribed! can't wait to get my hands on it, looks super cool!

@swannschilling474 8 ай бұрын

I am still using Tortoise but Open Voice seems to be promising! 😊 Thanks for this video!! 🎉🎉🎉

@deeplearningdummy 7 ай бұрын

I've been trying to figure out how to do this. Great job. I want to support your work and get this up and running for myself, but is KZbin membership the only option?

@tommoves3385 8 ай бұрын

Hey Kris - that is awesome. I like it very much. Great that you do this open source stuff. Very cool 😎.

@PhillipThomas87 8 ай бұрын

I mean, this is dependent on your hardware... Are the specs anywhere for this "inference server"

@ryanjames3907 8 ай бұрын

very cool, low latency voice, thanks for sharing, i watch all your videos, and i look forward to the next one,

@DihelsonMendonca 2 ай бұрын

That's wonderful. I wish I had the knowledge to implement that on my LLMs in LM Studio.

@denisblack9897 8 ай бұрын

I know about this for more than a year now and it still blows my mind. wtf

@SaveTheHuman5 7 ай бұрын

Hello, please can inform to us what is your cpu, gpu, ram etc?

@JohnSmith762A11B 8 ай бұрын

I wonder if you are (or can, if not) caching the processed .mp3 voice model after the speech engine processes it and turns it into partials. That would cut out a lot of latency if it didn't need to process those 20 seconds of recorded voice audio every time. Right now it's pretty fast but the latency still sounds more like they are using walkie talkies than speaking on a phone.

@levieux1137 8 ай бұрын

it could go way further by using the native libs and dropping all the python-based wrappers that pass data between stages using files and that copy, copy, copy and recopy data all the time. For example llama.cpp is clearly recognizable in the lower layers, all the tunable parameters match it. I don't know for openvoice for example however, but the state the presenter arrived at shows that we're pretty close to reaching a DIY conversational robot, which is pretty cool.

@JohnSmith762A11B 8 ай бұрын

@@levieux1137 By native libs, you mean the system tts speech on say Windows and macOS?

@levieux1137 8 ай бұрын

@@JohnSmith762A11B not necessarily that, but I'm speaking about the underlying components that are used here. In fact if you look, this is essentially python code built as wrapper on top of other parts that already run natively. The llama.cpp server for example is used here apparently. And once wrapped into layers and layers, you see that it becomes heavy to transport contents from one layer to another (particularly when passing via files, but even memcpy is expensive). It might even be possible that some elements are re-loaded from scratch and re-initialized after each sentence. The python script here appears to be mostly a wrapper around all such components,working like a shell script recording input from the microphone to a file then sending it to openvoice, then send that output to a file, then load another component with that file, etc... This is just like a shell script working with files and heavy initialization at every step. Dropping all that layer and directly using the native APIs of the various libs and components would be way more efficient. And it's very possible that past a point the author will discover that Python is not needed at all, which could suddenly offer more possibilities for lighter embedded processing.

@zedboiii 4 ай бұрын

that's some Bethesda level of conversation

@yoagcur 8 ай бұрын

Fascinating. Any chance you could upgrade it so that specific voices could be used and a recording made automatically, Could make for some interesting Biden v Trump debates

@edgarl.mardal8256 2 ай бұрын

Jeg kjøper meg patron medlemskap om du setter opp rasa med denne modellen, ettersom hun mangler IQ og structur vil jeg anbefale rasa og bruke salgs teknikk for å få henne til å høres mer logisk ut. Med det mener jeg spinning.

@irraz1 5 ай бұрын

wow! I would love to have such an assistant to practice languages. The “python hub” code, do you plan to share it at some point?

@josephtilly258 5 ай бұрын

really interesting, lot of it i can't understand because I don't know coding but speech to speech could be a big thing within few years

@TomM-p3o 8 ай бұрын

This is great. But personally I think a speech recognition with push to talk or push to toggle talk is most useful.

@lokiwhacker 7 ай бұрын

Thought this was really cool, love open source. But this really isnt open source if youre hiding it behind a pay wall... smh

@codygaudet8071 6 ай бұрын

Just earned yourself a sub sir!

@MiguelCayazaya 2 ай бұрын

Thanks there are those who go to war and become heroes and those who don't but still write programs

@arkdirfe 7 ай бұрын

Interesting, this is similar to a small project I made for myself. But instead of a chatbot conversation, the whisper output is fed into SAM (yes, the funny robot voice) and sent to an audio output. Basically makes SAM say whatever I say with a slight delay. I'm chopping up the speech into small segments so it can start transcribing while I speak for longer, but that introduces occasional weirdness, but I'm fine with that.

@squiddymute 8 ай бұрын

no api = pure genius

@DoNotTredOnMe 2 ай бұрын

I'd love to see a video of to AI's conversating with one another.

@matthewfuller9760 5 ай бұрын

I think at even 1/3rd the speed with my rtx titan it would run just fine to learn a new language. Waiting 3 seconds is perfectly acceptable as a novice language learner.

@kleber1983 6 ай бұрын

Hi, I´d like to know the computer specs required to run your speech to speech system, I m quite interested but I need to know first I my computer can handle it. thanks.

@OdikisOdikis 7 ай бұрын

the predefined answer timing is what makes it not real conversation. It should spit answer questions at random timings like any human can think of something and only then answer. Randomizing timings would create more realistic conversations

@duffy666 4 ай бұрын

I really like it! It this already on Github for members (could not find it)?

@TheDailyMemesShow 2 ай бұрын

OMG, I just noticed I've watched gazillion videos of yours. Why haven't subscribed, though? I swear I thought I had done it before? Something's not adding up here...

@ProjCRys 8 ай бұрын

Nice! I was about to create something like this for myself but I still couldn't use OpenVoice because I keep failing to run it on my venv instead of conda.

@Zvezdan88 8 ай бұрын

How do you even install OpenVoice?

@normanalc 3 ай бұрын

I'd like to get a copy of the script please, this one is really cool! thanks for sharing this.

@aladinmovies 7 ай бұрын

Good job. Interesting video

@aestendrela 8 ай бұрын

It would be interesting to make a real-time translator. I think it could be very useful. The language barrier would end.

@deltaxcd 7 ай бұрын

meta didi it already they created speech to speech translation model

@fire17102 8 ай бұрын

Would love to see some realtime animations to go with the voice, could be a face, but also can be minimalistic (like the R1 rabbit).

@wurstelei1356 8 ай бұрын

You need a second GPU for this. Lets say you put on Stable Diffusion. Displaying a robot face with emotions would be nice.

@leucome 8 ай бұрын

Try Amica AI . It has VRM 3D/vtuber character and multiple option for the voice and the llm backed.

@fire17102 6 ай бұрын

@@leucomedoes it work locally in real time?

@fire17102 6 ай бұрын

@@wurstelei1356Again, I think a minimalistic animation would also do the trick , or prerendeing the images once, and using them in the appropriate sequence in realtime.

@leucome 6 ай бұрын

@@fire17102 Yes it can work in real-time locally as long as the GPU is fast and has enough vram to run the AI+Voice. It can also connect to online service if required. I uploaded a video where I play Minecraft and talk to the AI at same time with all the component running on a single GPU.

@LadyTink 7 ай бұрын

Kinda feels like something the "rabbit R1" does with the whole fast speech to speech thing

@weisland2807 7 ай бұрын

would be funny if you had this in games - like the people on the streets of gta having convos fueled by somthing like this. maybe it's already happening tho, i'm not in the know. awesomesauce!

@kumar.jayanti9700 3 ай бұрын

Hi Kris, Where is the Github code for this one. I could not locate it in the Member github.

@64jcl 7 ай бұрын

Surely the response time is a function of what rig you are doing this on - an RTX 4080 as you have is no doubt a major contributor here, and I would guess you have a beast of a CPU and high speed memory on a newer motherboard.

@microponics2695 8 ай бұрын

I have the uncensored model the same one and when I ask it to list curse words it says it can't do that. ???

@jungen1093 7 ай бұрын

Lmao that’s annoying

@cmcdonough2 4 ай бұрын

This was great 😃👍

@MegaMijit 7 ай бұрын

this is awesome, but voice could use some fine tuning to sound more realistic

@ArnaudMEURET 7 ай бұрын

Just to paraphrase your models: “Dude ! Are you actually grabbing the gorram scrollbars to scroll down an effing window !? What is this? 1996 ? Ever heard of a mouse wheel? You know it’s even emulated by double drag on track pads, right?” 🤘

@MrScoffins 7 ай бұрын

So if you disconnect your computer from the Internet, will it still work?

@jephbennett 7 ай бұрын

Yes, this code package is not pulling APIs (which is why the latency is low), so it doesn't need internet connection. Downside is, it cannot access info outside of it's core dataset, so no current events or anything like that.

@darik31 4 ай бұрын

Thanks for sharing this mate! I wonder if the code is available somewhere? If so, could you please provide a link? Thanks

@researchforumonline 7 ай бұрын

wow very cool! Thanks

@alexander191297 7 ай бұрын

I swear on my mother’s grave lol… this AI is hilarious! 😂😂😂

@jacoballessio5706 7 ай бұрын

I wonder if you could directly convert embeddings to speech to skip text inference

@JohnGallie 7 ай бұрын

is there anyway that you can give the python 90% of system resources so it would be faster

@Jesulex82 2 ай бұрын

Este es un modelo para descargar y poder hablar con la IA? se puede jugar a ro? habla en español?

@mastershake2782 8 ай бұрын

I am trying to clone a voice from a reference audio file, but despite following the standard process, the output doesn't seem to change according to the reference. When I change the reference audio to a different file, there's no noticeable change in the voice characteristics of the output. The script successfully extracts the tone color embeddings, but the conversion process doesn't seem to reflect these in the final output. I'm using the demo reference audio provided by OpenVoice (male voice), but the output synthesized speech remains in a female voice, typical of the base speaker model. I've double-checked the script, model checkpoints, and audio file paths, but the issue persists. If anyone has encountered a similar problem or has suggestions on what might be going wrong, I would greatly appreciate your insights. Thank you in advance!

@tag_of_frank 7 ай бұрын

Why LM Studio over OogaBooga? What are the pros/cons of them? I have been using Ooga, but wondering why one might switch.

@乾淨核能 2 ай бұрын

what's the GPU requirement to achieve real time response? thank you

@deltaxcd 7 ай бұрын

I think to decrease latency more you need to make it speak before AI finishes its sentence unfortunately there is no obvious way to feed it partial prompt but waiting until it will finish generating reply takes asy too long

@SonGoku-pc7jl 8 ай бұрын

thanks, good project. Whisper can translate my spanish to english to spanish directly with little change in code? and tts i need change something also? thanks!

@suminlee6576 7 ай бұрын

Do you have a video for showing how to do this step by step? I was going to be paid member but I couldn't see how to video in your paid channel?

@Abhi-l6r1k 11 күн бұрын

Where is the code available ?, I want to try it on my local

@aboudezoa 7 ай бұрын

Running on 4080 🤣 makes sense the damn thing is very fast

@sovietlo8136 Ай бұрын

pewdiepie if he started coding... I'm so into this, I'm still learning about this but I want to make my own local AI to be able to manage my business and make everything easier

@kcnb28 Ай бұрын

You ever thought about using a virtual assistant?

@mickelodiansurname9578 8 ай бұрын

can the llm handle being told in a system prompt that it will be taking in the sentences in small chunks? say cut up into 2 second audio chunks per transcript. Can the mistral model do that? Anyway if so you might even be able to get it to 'butt in' to your prompt. now thats low latency!

@deltaxcd 7 ай бұрын

No it cant be told that but it is not necessary. just feed it the chunk and then if user speaks before it managed to reply more restart and feed more

@khajask8113 3 ай бұрын

Hindi and Telugu language supports..?

@mertgundogdu211 5 ай бұрын

How I can try this in my computer?? I couldnt find the talk.py in github code??

@Warz-cx6zk Ай бұрын

It's his own code and you need to become a member and wait for invite to Github community.

@skullseason1 7 ай бұрын

How can i do this with the Apple M1, this is soooo awesome i need to figure it out!

@inLofiLife 8 ай бұрын

looks interesting but where is this community link you mentioned? :)

@witext 6 ай бұрын

I look forward to actual speech to speech LLM, not any speech to text translation layers, pure speech in and speech out, it would be revolutionary imo

@JG27Korny 8 ай бұрын

I run the oobabooga silero plus whisper, but those take forever to make voice from text, especially silero.

@musumo1908 8 ай бұрын

Hey cool…anyway to run this self hosted for an online speech to speech setup? Want to drop this into a chatbot project…what level membership to access the code thanks

@NirmalEleQtra 4 ай бұрын

Where can i find whole GitHub repo ?

@smthngsmthngsmthngdarkside 7 ай бұрын

So where's the source code mate? Or is this just a hook for your newsletter marketing and crap website?

@Skystunt123 4 ай бұрын

Just a hook, the code is not shared.

@BrutalStrike2 8 ай бұрын

Jumanji Alan

@TanvirsTechTalk 3 ай бұрын

How did you actually set it up?

@jeffsmith9384 8 ай бұрын

I would like to see how a chat room full of different models would problem solve... ChatGPT + Claude + * 7B + Grok + Bard... all in a room, trying to decide what you should have for lunch

@mickelodiansurname9578 8 ай бұрын

AI: "We got some rich investors on board dude, and their willing to back us up!" I think this script just announced the games commencing in the 2024 US Election... [not in the US so reaches for popcorn]

@ExploreTogetherYT 7 ай бұрын

how much RAM do you have to run mistral 7b locally? using gpu or cpu?

@TheDailyMemesShow 2 ай бұрын

Would this work on the cloud? If so, how?

@_-JR01 7 ай бұрын

does openvoice perform better than whisper's TTS?

@Nursultan_karazhigit 7 ай бұрын

Thanks . Is whisper api free ?

@m0nxt3r 4 ай бұрын

it's open source

@Ms.Robot. 7 ай бұрын

❤❤❤🎉 nice

@binthem7997 8 ай бұрын

Great tutorial but I wish you could share gists or share your code

@kritikusi-666 8 ай бұрын

the voices are Mehh...cool project tho. You always have some fire content. You could train a LLM just off your content and be set haha.

@jerryqueen6755 6 ай бұрын

How can I install this on my PC? I am a member of the channel

@AllAboutAI 6 ай бұрын

did you get the gh invite?

@jerryqueen6755 6 ай бұрын

@@AllAboutAI yes, thanks

@miaohf 5 ай бұрын

@@AllAboutAI I am a member of the channel too, how to get gh invite?

@Yossisinterests-hq2qq 8 ай бұрын

hi I dont have talk.py, but is there another way of running it im missing?

@Warz-cx6zk Ай бұрын

It's his own code, you need to become a member of the channel through subscription and wait for the invite code to github community.

@TheRottweiler_Gemii 4 ай бұрын

Anybody done with this and have a code or link can share please

@laalbujhakkar 5 ай бұрын

How is a system that goes out to openAI, "local" ????????

@seRko123 4 ай бұрын

Open air whisper is locally

@DihelsonMendonca 2 ай бұрын

Too complex for the average guy. We need a ready LLM with easy voice options on LM Studio.

@ajayjasperj 7 ай бұрын

we can make youtube content with those conversation between bots😂❤

@MetaphoricMinds 7 ай бұрын

What GPU are you running?

@AllAboutAI 7 ай бұрын

4080 RTX!

@MetaphoricMinds 7 ай бұрын

Dude just made a JARVIS embryo.

@JohnGallie 7 ай бұрын

you need to get out more man lol. that was toooo much!

@VitorioMiguel 8 ай бұрын

Try fast-whisper. Open source and faster

@jcolabzzz 8 ай бұрын

Do not make AI lie on your face, man. Thankfully this is local.

@artisalva 7 ай бұрын

haha AI conversations could have their own chanels

@Jesulex82 15 күн бұрын

ESTARIA BIEN QUE LO PUSIERAS PARA PERSONAS COMO MI HERMANO QUE ES CIEGO....Y ASI PUDIERA ESCUCHAR LO QUE LE CONTESTA LA IA... PERO BUENO TRONCO...QUIEN PIENSA EN PERSONAS COMO MI HERMANO VERDAD?....TE RECOMIENDO QUE CIERRES LOS OJOS DURANTE UNA HORA AL DIA... Y ASI QUIZAS TE HAGAS UNA LIGERA IDEA... Y LUEGO PIENSA... SI ME QUEDARA ASI PARA SIEMPRE.