Realtime Local AI Chatbot Demo with GPT-SoVITS and Llama 3

Рет қаралды 5,387

Jarods Journey

Күн бұрын

Пікірлер: 91

@mohammadaliabbas3847 17 күн бұрын

Can you share code

@OnigoroshiZero 16 күн бұрын

This is great! Currently running local models for everything is hard for most because of hardware limitations, and using APIs is not recommended for daily use unless you have a few thousand dollars to spend. But having an AI assistant on your pc will be the best thing along with real time game emulation in 2-3 years from now. Especially with agents that can essentially "live" on the desktop using virtual environments backed by Unreal Engine or similar (I've watched a proof of concept along those lines a few weeks ago, and it was very fun).

@Admlass 17 күн бұрын

Got my own implementation of realtime gpt-sovits with playback too (not the same approach as yours). It takes ~0.3s for 30s of audio on colab's T4 and ~1s to stream to my computer through a reverse proxy, so it can be very efficient once you optimize everything. This video makes me want to test it with the STT + LLM parts. We really live in the future.

@Jarods_Journey 17 күн бұрын

Very interesting! So you made gpt sovits more efficient and improved the base speed of it? Any pointers to look at? There's still a lot of room for improvement I see, since your able to generate more AR tokens faster, that's ideal because of the context dependency of the vits decoding process so more accurate speech might be enabled if this is the case 🤔.

@Admlass 16 күн бұрын

@@Jarods_Journey Yes, I found the bottleneck was the vits model as it doesn't have a batched_decode method so the processing is sequential by default (there was an unsuccesful attempt commented in the TTS.py). I added my own method, there I padded the chunks as we got sequences of different lenghts and removed the padding in post-processing

@Jarods_Journey 16 күн бұрын

Hmm, my observation is a little bit different, the bottleneck of the process is the AR decoder, not vits in my testing so far. I'm wondering what our difference here are. I'll have to do some testing around and seeing if the AR part can be made faster Appreciate the input!

@davidc.2525 16 күн бұрын

wait those numbers are insane, are you willing to share your code? I am looking for something that can do ~0.3s for 3 s not 30s on a Macbook M1, do you think this is possible with your changes?

@davidc.2525 16 күн бұрын

If I understand correctly, your change speeds up the batch speed for longer text but does not affect the base speed for a single text.

@UltraK420 17 күн бұрын

Some french guy was doing this exact thing on twitch over a year ago with gpt-4. The AI personalities were completely uncensored, and very naughty.

@Jarods_Journey 17 күн бұрын

Well, the sky's the limit with uncensored models 😅

@4.0.4 16 күн бұрын

I imagine he got banned? Haha.

@BackTiVi 16 күн бұрын

@@Tenkin42 Aitrepreneur I think

@UltraK420 15 күн бұрын

@4.0.4 Nope.

@MyutiConx 2 күн бұрын

Hmm the name?

@cxzyKitsuki 11 күн бұрын

So cool, I had been looking for self-host ai assistant so long. This is aswesome!

@VirtusRex48 17 күн бұрын

Super cool, can't wait until you share!

@joshfokis 17 күн бұрын

I have made a few projects like this, one I am currently working on still but the GPT-SoVITS sounds promising. I made a discord bot for my DnD group to ask questions to and it will speak back in response with the answer and in character. I really want to give this one a go. Would love to review this once you publish it, if you do. Great work on your projects. It is always nice to see new updates.

@jdcmsigma47 8 күн бұрын

awesome vid man!!

@tr1pod623 18 күн бұрын

It would be amazing if you could turn into a full on project. with maybe some UI of some sort? like imagine you could add in RAG and add in your personal files or create some sort of Long term memory (as RAG) which the LLM saves (with tools) or maybe i just need to find out how to use GPTsovitts with Silly Tavern. GPTsovitts is so awesome, i love the expression, laughs, and it feels super voice actor like.

@lokeshart3340 17 күн бұрын

ITS amazing i supprt u

@4.0.4 16 күн бұрын

I think all of what you're looking for and more exists in SillyTavern. Though, 99%+ of existing character cards on characterhub won't have the necessary live2D/VRM/etc thing for the character. Also voice isn't part of the card spec. (In other words, you'd have to set it up like you want yourself, but it's all there for you to do it)

@lokeshart3340 16 күн бұрын

@4.0.4 is it free?

@KnutNukem 12 күн бұрын

Impressive, well done

@thenextension9160 17 күн бұрын

very cool tech demo. this is the future of content

@gr8tbigtreehugger 17 күн бұрын

Very awesome! On my voice bot, am doing transcription in smaller chunks in parallel, so the final transcription is super short and then send the appended text to the LLM.

@cassusgames 17 күн бұрын

Totally unrelated to the actual AI stuff, but did you choose the name Vivy as in Vivy: Fluorite Eye's Song? I enjoyed that anime quite a bit.

@Jarods_Journey 17 күн бұрын

That is indeed the lore for why I named my original project Vivy and continue to do so :)!

@cassusgames 17 күн бұрын

@@Jarods_Journey That is awesome and very fitting

@ShiroAisan 15 күн бұрын

holy crap she can laugh and not some creepy robotic hahaa

@sinayagubi8805 17 күн бұрын

hey! can you speak some japanese with it? also can we somehow reproduce this? will you publish it on github?

@Jarods_Journey 17 күн бұрын

Japanese - maybe. Publish? Ye, I'll be publishing it sometime in the ensr future

@moresignal 17 күн бұрын

Congratulations. I've been excited to see the results of your efforts and this looks brilliant. Is there any reason a similar approach would not work with F5TTS for low latency response?

@Jarods_Journey 17 күн бұрын

Essentially, F5 doesn't iteratively produce tokens and it does it all in one big chunk. Unless that can be changed, it cant be used to stream audio

@moresignal 17 күн бұрын

Thanks. I was unaware of the difference. I noticed the socket server example breaks the request into 10 second blocks and processes them sequentially.

@Random_person_07 17 күн бұрын

It would be awesome if you could get emotion text data set or a large data set and fine tune the base GPT SoVITS model to do better with emotions

@cefcephatus 14 күн бұрын

I can't wait to see a sentient AI Vtuber from your project. Maybe a girl who could design her own physical body.

@fxy201 17 күн бұрын

That's so cool! Pls make tutorial 🙏

@yereem 14 күн бұрын

Amazing, had one working fine with gpt 3.5 was using vtube studio for the model and all. tried moving to a local llama 3.1 and add some memory and everything started going downhill. Had to take a break because of classes, I will finally be able to get back at it this winter break hopefully I fix that and move it to a Raspberry Pi , finally got my end on one of those new AI hat I am so exited to test it, will put it on some screen with a camera somewhere in my place for no reason.

@yereem 6 күн бұрын

-Update the AI hat is not really useful for llm, the llm can run locally on my pi but it’s kinda slow which makes sense considering the specs of a pi, i am in the process of making it an ai server that i will use with home assistant later.

@kurotesuta 11 күн бұрын

Built mine, but decided to try to make it run on a M2 Mac, the output audio doesnt seem to be completed proccesed by GPT-SoVITS, voice sounds like a demon

@Sajeas 18 күн бұрын

That's cool. Seeing integration with SillyTavern with Mistral large API (it's free) and either GPT-SoVITS or F5 would be nice. I tried to train GPT-SoVITS, but got only .pth file and not ckpt file., so if you'll know or make GPT-SoVITS Trainer would be great.

@lichtundliebe999 17 күн бұрын

When I tested GPT-SoVITS in English, I still heared a Chinese accent. I used the creator's version and tried only with webui. Maybe there is still some fine tuning necessary.

@liberdelta 16 күн бұрын

Do you think whisper is better for speech to text Japanese compared to enterprises solutions like amivoice?

@4.0.4 16 күн бұрын

Whisper large-v3 does handle Japanese somewhat ok, but on a long video it hallucinates a lot. (Things like, "thanks for watching!")

@FrostDagger 12 күн бұрын

long term memory? how much vram? specs?

@Tumhishka 17 күн бұрын

Could you please provide the code? I recently started working on my thesis project (an AI assistant in glasses similar to Xreal (Android)). This code would help me a lot to study STT and to create an animated character.

@zikwin 18 күн бұрын

Wait, how do you make that animated character speak too?

@Jarods_Journey 17 күн бұрын

Vtuber studio with doing digital cables

@Random_person_07 17 күн бұрын

This is really cool hopefully you release a web UI version or something like that

@MyutiConx 2 күн бұрын

Is rtx 3070 8gb possible jarods?

@FromTheWombTotheGrave 17 күн бұрын

Will u drop a Full course on making this?

@jimmyjam77 17 күн бұрын

Wow this is cool!

@Tamrinschannel 17 күн бұрын

wow the respons are quick

@sdaassadd4721 18 күн бұрын

Hello Jarods, i watch some of your vids, and i want to know with one TTs ir better i can easy train on a 12GB VRAM Gpu voices to portuguese? So many options i getting confuse

@Jarods_Journey 17 күн бұрын

All of them can technically do it, so are easier than others. If I were you, I'd start with F5TTS first as it's the simplest and 12 GB can hobble along and train on 1-2 hours of data to get a single speaker in the language

@sdaassadd4721 17 күн бұрын

@@Jarods_Journey any of your videos teach how to train? or have a link i can read? ty for recomendation, i will follow!

@Menober 16 күн бұрын

Bro any chance TTS running on AMD GPU? ;(

@soraygoularssm8669 17 күн бұрын

Please share the code for the realtime GPT-SoVITS

@NickyTuan 15 күн бұрын

How to make voice like that?

@piplupsuper0 17 күн бұрын

Oh wow is there a release for this my man?

@Jarods_Journey 17 күн бұрын

Sometimes in the future I'll be throwing the code up! It's very messy rn

@islam_era2035 10 күн бұрын

tolong berikan cara untuk mengintegrasikan AI ke dalam karakter vtuber

@kritikusi-666 17 күн бұрын

how did you do this Vtuber thingy? Can you make a tutorial?

@Jarods_Journey 17 күн бұрын

I hope to do a breakdown video in the future, but it's a lot of moving parts

@jimmyjam77 17 күн бұрын

How to customize the voice?

@Jarods_Journey 17 күн бұрын

You'd need to train a gpt-sovits model, a whole nother step of the process (or use different reference audio)

@rtyzxc 15 күн бұрын

I feel like you used a bit of Botan. Judging by the laugh.

@TheDailyMemesShow 17 күн бұрын

OMG, she's so cute!

@alpuhagame 16 күн бұрын

So is it real Kizuna AI? xD We're very close. Just need to teach her play video games and comment on it.

@lionight.custom5693 17 күн бұрын

How to make ai vtuber?

@ninbob4633 17 күн бұрын

damn we're gonna get clones of neuro-sama everywhere

@stilly5016 16 күн бұрын

Please add animation like hand movement so make more realistic 😊

@TABandiTA 17 күн бұрын

feels like a budget Neuro-sama, but pretty cool regardless.

@mactheo2574 17 күн бұрын

Neuro can't laugh and pretty monotone. Evil is through an API (the TTS, I know he runs a small LLM locally while Neuro is also local TTS). Lame. Also it's so stupid how so many people look up to Vedal. He literally never teach anyone anything about Neuro while the majority of Neuro relies on OSS...

@Jarods_Journey 17 күн бұрын

I think Vedal has done a fantastic job at building a character and personality, so much so, that the work and success is well deserved IMO. I understand his disposition of not wanting to teach the whole world how he does it because he's got a make a living somehow lol. If I wanted to be an AI vtuber, I'd probably be doing the same thing he's doing tbh

@mactheo2574 17 күн бұрын

@@Jarods_Journey You're very gracious and understanding. I still think Vedal should at least give some spot light to the people who worked hard to implement new tech and optimization, things that directly benefited Neuro. Right now, every Neuro "upgrade" updates has been "I made Neuro smarter", "Neuro now remembers more", "Neuro latency is improved", even the latest "Neuro can now directly play minecraft". All of these I can fairly accurate points to how these "upgrades" are made, such as longer context llama 3.1 and mistral models, flash attention, quantization techniques, KV cache, faster whisper and whisper turbo, the amazing minecraft project that Emergent Garden made public, etc.. Just a simple mention of those amazing work would suffice. It teaches the swarm that Vedal cannot exist in a vacuum and software dev is fundamentally collaborative. I've been a "fan" of Vedal more than 2 years, his behavior continues to be troubling to me.