Human-like ChatGPT Voice is SHOCKINGLY Good & Audiobook Maker Updates

Рет қаралды 8,116

Jarods Journey

Күн бұрын

Пікірлер: 98

@iyxan23 2 ай бұрын

i absolutely did not expect chatgpt able to do an "ara-ara" impression

@Billy4321able 2 ай бұрын

I thought OpenAI might have been exaggerating and cherry-picking the responses but NO this is truly REVOLUTIONARY in the AI voice space. You could literally practice your language skills with something like this for hours it's crazy. Not even just learning a language you could literally learn to voice act in another language with fantasy style voices. This is crazy! I'm afraid people are going to stop talking to each other if this ever becomes cheap enough to use full time. They sound like way more entertaining speaking partners than anyone I know. It's kinda messed up honestly. I struggle to see how people are going to adapt to having easy access to AI friends in the future. It only gets worse once they get robot bodies... This is going to be like real life Chobits.

@Jarods_Journey 2 ай бұрын

It's definitely a game changer, I'm 100% gonna be talking with it more lol. The quality coming out was literally fiction 2 years ago.

@blasandresayalagarcia3472 2 ай бұрын

That's amazing! Im actually working on a textbook to audiobook project too 😂 was looking to selfhost something but didn't find anything so started making it.

@BlastHeart96 2 ай бұрын

Wow AI is terrifying, but amazing at the same time. I wonder if ChatGPT will ever be able to “voice act” audio books by reading ahead and analyzing content clues, tones, setting, context, etc. maybe being generating sounds or bgm that fits the situation.

@GraveUypo 2 ай бұрын

this is 100% able to do that. i just want an open model that does this too.

@seifuishiguro 2 ай бұрын

7:01 Exactly, I really wonder where they got all that voice data and how they went about labelling it. Human speech is complicated with so many different styles of speaking and ways of responding, if only other companies and open source teams could get their hands on such kind of a dataset.

@HogwartsStudy 2 ай бұрын

That's crazy!!! I can't wait for this quality in a local Applio type solution.

@DrewWalton 2 ай бұрын

Minor clarification: Standard voice mode (which has been available for a while) is bog standard speech-to-text, runs the transcribed text prompt against the model, and re-transcribes the text back to speech. Advanced Voice mode is *not* text-to-speech or speech-to-text *at all*. GPT-4o is natively multi-modal, so what you're actually experiencing is something called "voice-to-voice." That's right, it is, in a very real sense, hearing what you say and how you say it. Per my understanding, the only speech-to-text that really happens is for providing the text transcript of your conversations.

@Jarods_Journey 2 ай бұрын

This is correct, I make the correction a little later in the video, but yes, it encodes and decodes audio natively without needing to pipeline other networks in tandem which is impressive - I think we'll start seeing this paradigm more and more

@DrewWalton 2 ай бұрын

@@Jarods_Journey and of course I wrote this comment before getting to that part 😆 But yeah, absolutely. I've been having way too much fun with AV mode. $20 a month for something that's more fun than a barrel o' monkeys and extremely useful to boot? Shut up and take my money.

@leodark_animations2084 2 ай бұрын

@@DrewWalton Regarding how the model works in details you can find the info on OpenAi website or somewhere else? i'm pretty interested in understanding it more

@johansparr2409 2 ай бұрын

Awesome to hear, also love the audiobook update, looking forward to using it!!😊

@Katsumi_Maki 2 ай бұрын

I-It's not like I want to chat with you or anything, baka! 5:07

@Yaksha_Indra 2 ай бұрын

I was cringing like this en.wikipedia.org/wiki/Tetanus#/media/File:Opisthotonus_in_a_patient_suffering_from_tetanus_-_Painting_by_Sir_Charles_Bell_-_1809.jpg Almost broke my back

@saintkamus14 2 ай бұрын

"who do you even label that" been wondering that myself. What I think is going on, is that they have AIs that do the labeling from the raw data (they first become "experts" then they can do the labeling)

@farrael004 2 ай бұрын

They probably didn't have labeled data. It's most likely pretrained on a massive amount of audio from youtube and other recordings from different languages, then fine-tuned using the voices available.

@Jarods_Journey 2 ай бұрын

I'd have to think the fine-tuning portion had some type of labeling since it has to understand what whispering is. Pretraining portion was most definitely unsupervised

@mirek190 2 ай бұрын

@@Jarods_Journey actually no .... Is the same situation with multimodal llms .... you are not even training model with pictures ... you are just adding projector to llm and ..is working just like that... magic

@CosmicTavern 2 ай бұрын

This is INSANE. Need this on local.

@Jarods_Journey 2 ай бұрын

Same 😭

@shreyashmore1824 2 ай бұрын

Yes please Local 🥲

@budbin 2 ай бұрын

2:15 that's crazy

@om4741 2 ай бұрын

Csn you tell me the promt you used in this video?

@nanieve4296 2 ай бұрын

DONT MIND IF I DO WITH ALL OF THOSE VOICE SET!

@KnutNukem 2 ай бұрын

Pls add an label function like [SPEAKER NAME], to automatically mark lines to a voice. Best would be, that all following lines would be added to a speaker after marking it. Could be a toggleable option. Also let multiple lines be marked and their voice set via CTRL + CLICK or SHIFT + CLICK to mark a bunch.

@Jarods_Journey 2 ай бұрын

Ooh, this might make for a good alternative tab to set it up for the speaker formats. I'll think about this one

@darkreader01 2 ай бұрын

It would be crazy if we could generate a audio book or hear a full story book in that chatgpt advanced voice with emotions.

@Jarods_Journey 2 ай бұрын

That day will surely come. The only question is when 😅

@Murderface666 2 ай бұрын

The voice made a error speaking Japanese. "Haisai" is Okinawan for "hello."

@Jarods_Journey 2 ай бұрын

I had never heard haisai until that day lol

@phenix5609 2 ай бұрын

First off Whoaaaa!! The voice gen, this give me really sad feelings we can get this kind of voice in local yet. Second youtube know me too well scary sometimes, i was also thinking about making an audiobook reader, i’m really curious about 2-3 thing tough, if you are willing to share, does your run all locally ? If so what voice or where do you found the voice, you want to, or what do you use to clone them if you clone some voice, what do you think is better as of now for this type of stuff full locally ? Last thing i think i saw a 24 ram, u must run with a 3090 or 4090, how much do you think is needed to run this without too much lag or wait time, can it be done with a 3080 10 go vram, 32 go ram ? Or is it to low ? i hope you could answer me thx.

@TonyMezaXD 2 ай бұрын

The British voice was the only bad impression. It sounded like an American attempting to do a British accent.

@sownheard 2 ай бұрын

Wait isn't the voice American voice pretending to be British

@spiker.c6058 2 ай бұрын

Yeah the English voice are american unless they add a proper UK English language setting with proper UK English voices. But why would they do it just to get another english accent.

@TonyMezaXD 2 ай бұрын

@@sownheard If that’s the case then it’s spot on. I just figured since it could switch to other languages easily it should be able to switch English dialects as well.

@TonyMezaXD 2 ай бұрын

@@spiker.c6058 Oh in that case it was spot on.

@adolphgracius9996 2 ай бұрын

Can it say Yaaamite?

@macmcleod1188 2 ай бұрын

6:00 ... mind blown.

@basspig 2 ай бұрын

Wow the Japanese language chat GPT voice is pretty darn convincing. Although it does sound like a Caucasian person speaking Japanese rather than a Japanese person speaking japanese. I'll be really impressed when it can do the voice of my favorite anime character.

@나익명 2 ай бұрын

Yeah the Australian accent also still sounds like an American person

@radianthole 2 ай бұрын

My man got that Nihongo jozu

@Akurallia 2 ай бұрын

CAAAARA Isso é ABSURDO de incrível! Simplesmente magnífico 🤩

@adamrastrand9409 2 ай бұрын

When will it be available in Sweden in EU would it be like next week October 5 or would it be like in six months or so

@Jarods_Journey 2 ай бұрын

I'm not too certain, you might wanna keep up with openAI to see when they announce it for non US countries

@NFawc 2 ай бұрын

Looks like after running the generation, the voice settings for each line is lost? ie: After you ran the generation the line colours (speakers) were all lost (they all changed to grey)? ps: VERY interested in the audiobook maker, especially with a good TTS generator.

@Jarods_Journey 2 ай бұрын

Just for the generation, to show they were complete. If you load the audiobook again, it'll restore the colors and associated speaker

@NFawc 2 ай бұрын

@@Jarods_Journey Understood. But after a generation, if you then want to regen just a sentence or two again, wouldn't it be good to have colour/settings as before the (full) generation?

@dthSinthoras 2 ай бұрын

What is really missing in Audiobook Maker, that would make it usable, would be the possibility to use other languages.

@matty.j_1997 2 ай бұрын

Great demo! How were you able to screen-record the Advanced Voice?

@Jarods_Journey 2 ай бұрын

Using just my phone's native recorder, then I just synced it up in editting

@matty.j_1997 2 ай бұрын

@@Jarods_Journey Yeah but it sounds like the voice was also recorded „officially“

@Jarods_Journey 2 ай бұрын

I guess Samsung is just that good 😅. Just the Samsung screen recorder with media sound enabled

@Snafuuu 2 ай бұрын

"I can't imagine the data needed for this" I can, it's the whole damn internet 💀

@dadadies 2 ай бұрын

Whats your audio book maker? Is it some sort of audio dialog maker with different characters all in one interface? I wonder if it can be adapted for interactive dialog such as in a game. Especially if someone attaches a AI LLM to it that can generate dialogs based on their interactions. Youd have an even more special system (with other people contributing in those other areas).

@Jarods_Journey 2 ай бұрын

It's a tool that you can load up a text file and use tortoise TTS/styletts to generate audiobooks with. Currently adding features to it like different speakers for sentences, etc!

@Barrel_Of_Lube 2 ай бұрын

it pretty much got all the languages (claimed by an openai dev) thats fking crazy

@tylerboy19yp 2 ай бұрын

hey jarod what is the best current voice cloning fine tune tts model right now i can run locally?

@Jarods_Journey 2 ай бұрын

Gonna be either xtts/tortoise or styletts, and then I'm trying out parlertts, so we'll see how this comes along to see if I can recommend it

@tylerboy19yp 2 ай бұрын

@@Jarods_Journey tried xtts base model i can't seem to get the dependencies to work with the fine tune model, have you heard doppleAI's voice models? they sound really good with only 1-3 minutes of audio

@Mika43344 2 ай бұрын

Did you hear about eleven labs reader?

@iseahosbourne9064 2 ай бұрын

Hey jarod, whats the best ai voice cloning tool as of today? RVC, Xtts, tortoise tts etc?

@Jarods_Journey 2 ай бұрын

Local? Pipe xtts/tortoise into RVC, and it's still very solid. Via corpo? Elevenlabs for sure.

@iseahosbourne9064 2 ай бұрын

@@Jarods_Journey Thanks for the info jarod!

@iseahosbourne9064 Ай бұрын

@@Jarods_JourneyWas just thinking, what is the best tts to train,xtts finetune,tortoise? Im having trouble with xtts generating consistent speech. Not sure if thats because I trained a model with 21min of data doh.

@nomadv7860 2 ай бұрын

Just wanted to point out it’s not text-to-speech, it’s able to hear the actual tone and emotion of your voice

@mal-avcisi9783 2 ай бұрын

This is insane

@bananalord9288 2 ай бұрын

goddammnit I lost it during the japanese demonstration XD

@cookiefrnamikaze1674 2 ай бұрын

man you are such a masterpiece for asking the Ara ara

@spiker.c6058 2 ай бұрын

Ara ... ara .... my God !!!!

@WistrelChianti 2 ай бұрын

What a time to be alive!

@BoomBillion 2 ай бұрын

😂try american tourist voice speaking Spanish.

@나익명 2 ай бұрын

Same with Australian accent haha

@Crazy_Truth 2 ай бұрын

Plus plan 😢

@Jarods_Journey 2 ай бұрын

Unfortunately 😅

@xaiyeon_xiuzhen 2 ай бұрын

OMG idc if im cooked that was awesome :D

@Англичанин_Р 2 ай бұрын

As long as you are not the native-level speaker, the robot will sound to you like a human.😅 How long can you chat to a robot without getting annoyed?

@ariverosmg 2 ай бұрын

You don't need to lable things, it "understands" what you mean, it learned to "reason" slightly, no human needed in that loop then.

@Jarods_Journey 2 ай бұрын

I'm firm to believe a little bit of labeling is still needed to get it to understand all the pre training data, I just think in this case they have trained classifiers that can accurately label specific types of data. No human in the loop here, just classification models and lots of compute 😅

@Airbender131090 2 ай бұрын

This is riduculos😮 imagine this in video game npcs 😮 thats crazy

@justindressler5992 2 ай бұрын

Is the Japanese accurate and understandable i would love to learn Japanese through chat.

@TrC450 2 ай бұрын

Yes, at least what was shown here.

@Jarods_Journey 2 ай бұрын

Understandable, absolutely. It is more than accurate enough I'd say, but I'm not at a level high enough to determine whether or not it was using natural expressions. Though, the anime impersonations were pretty good imo

@baltakatei 2 ай бұрын

We're so cooked.

@saint115io 2 ай бұрын

Wait, are you japanese? 😮😮

@Jarods_Journey 2 ай бұрын

A small bit :)

@dadadies 2 ай бұрын

Her dramatic movie trailer narrator voice skill sucks. It sounds like someone's mom failing to do the task. Maybe if you asked her to make it sound professional i wonder if she would have sounded better.

@Jarods_Journey 2 ай бұрын

It may have been influenced by earlier portions of chat, but I'd agree it wasn't the best as shown here!

@Draggtar 2 ай бұрын

The answer is: GPT chat creators are otakus

@hypersonicmonkeybrains3418 2 ай бұрын

I tried it and the audio quality is atrocious, bitcrushed, compressed, choppy and analog sounding. Not at all impressed, i would never consider paying a subscription for an Ai voice who's main feature is it can put on accents. I then tried Gemini voice and its crystal clear... Inflection PI AI, crystal clear... what gives.