Human-like ChatGPT Voice is SHOCKINGLY Good & Audiobook Maker Updates

Рет қаралды 7,617

Jarods Journey

Күн бұрын

Пікірлер: 97

@iyxan23 Ай бұрын

i absolutely did not expect chatgpt able to do an "ara-ara" impression

@Billy4321able Ай бұрын

I thought OpenAI might have been exaggerating and cherry-picking the responses but NO this is truly REVOLUTIONARY in the AI voice space. You could literally practice your language skills with something like this for hours it's crazy. Not even just learning a language you could literally learn to voice act in another language with fantasy style voices. This is crazy! I'm afraid people are going to stop talking to each other if this ever becomes cheap enough to use full time. They sound like way more entertaining speaking partners than anyone I know. It's kinda messed up honestly. I struggle to see how people are going to adapt to having easy access to AI friends in the future. It only gets worse once they get robot bodies... This is going to be like real life Chobits.

@Jarods_Journey Ай бұрын

It's definitely a game changer, I'm 100% gonna be talking with it more lol. The quality coming out was literally fiction 2 years ago.

@BlastHeart96 Ай бұрын

Wow AI is terrifying, but amazing at the same time. I wonder if ChatGPT will ever be able to “voice act” audio books by reading ahead and analyzing content clues, tones, setting, context, etc. maybe being generating sounds or bgm that fits the situation.

@GraveUypo Ай бұрын

this is 100% able to do that. i just want an open model that does this too.

@Katsumi_Maki Ай бұрын

I-It's not like I want to chat with you or anything, baka! 5:07

@Yaksha_Indra Ай бұрын

I was cringing like this en.wikipedia.org/wiki/Tetanus#/media/File:Opisthotonus_in_a_patient_suffering_from_tetanus_-_Painting_by_Sir_Charles_Bell_-_1809.jpg Almost broke my back

@HogwartsStudy Ай бұрын

That's crazy!!! I can't wait for this quality in a local Applio type solution.

@blasandresayalagarcia3472 Ай бұрын

That's amazing! Im actually working on a textbook to audiobook project too 😂 was looking to selfhost something but didn't find anything so started making it.

@seifuishiguro Ай бұрын

7:01 Exactly, I really wonder where they got all that voice data and how they went about labelling it. Human speech is complicated with so many different styles of speaking and ways of responding, if only other companies and open source teams could get their hands on such kind of a dataset.

@phenix5609 Ай бұрын

First off Whoaaaa!! The voice gen, this give me really sad feelings we can get this kind of voice in local yet. Second youtube know me too well scary sometimes, i was also thinking about making an audiobook reader, i’m really curious about 2-3 thing tough, if you are willing to share, does your run all locally ? If so what voice or where do you found the voice, you want to, or what do you use to clone them if you clone some voice, what do you think is better as of now for this type of stuff full locally ? Last thing i think i saw a 24 ram, u must run with a 3090 or 4090, how much do you think is needed to run this without too much lag or wait time, can it be done with a 3080 10 go vram, 32 go ram ? Or is it to low ? i hope you could answer me thx.

@Murderface666 Ай бұрын

The voice made a error speaking Japanese. "Haisai" is Okinawan for "hello."

@Jarods_Journey Ай бұрын

I had never heard haisai until that day lol

@CosmicTavern Ай бұрын

This is INSANE. Need this on local.

@Jarods_Journey Ай бұрын

Same 😭

@shreyashmore1824 Ай бұрын

Yes please Local 🥲

@DrewWalton Ай бұрын

Minor clarification: Standard voice mode (which has been available for a while) is bog standard speech-to-text, runs the transcribed text prompt against the model, and re-transcribes the text back to speech. Advanced Voice mode is *not* text-to-speech or speech-to-text *at all*. GPT-4o is natively multi-modal, so what you're actually experiencing is something called "voice-to-voice." That's right, it is, in a very real sense, hearing what you say and how you say it. Per my understanding, the only speech-to-text that really happens is for providing the text transcript of your conversations.

@Jarods_Journey Ай бұрын

This is correct, I make the correction a little later in the video, but yes, it encodes and decodes audio natively without needing to pipeline other networks in tandem which is impressive - I think we'll start seeing this paradigm more and more

@DrewWalton Ай бұрын

@@Jarods_Journey and of course I wrote this comment before getting to that part 😆 But yeah, absolutely. I've been having way too much fun with AV mode. $20 a month for something that's more fun than a barrel o' monkeys and extremely useful to boot? Shut up and take my money.

@leodark_animations2084 Ай бұрын

@@DrewWalton Regarding how the model works in details you can find the info on OpenAi website or somewhere else? i'm pretty interested in understanding it more

@budbin Ай бұрын

2:15 that's crazy

@saintkamus14 Ай бұрын

"who do you even label that" been wondering that myself. What I think is going on, is that they have AIs that do the labeling from the raw data (they first become "experts" then they can do the labeling)

@darkreader01 Ай бұрын

It would be crazy if we could generate a audio book or hear a full story book in that chatgpt advanced voice with emotions.

@Jarods_Journey Ай бұрын

That day will surely come. The only question is when 😅

@johansparr2409 Ай бұрын

Awesome to hear, also love the audiobook update, looking forward to using it!!😊

@farrael004 Ай бұрын

They probably didn't have labeled data. It's most likely pretrained on a massive amount of audio from youtube and other recordings from different languages, then fine-tuned using the voices available.

@Jarods_Journey Ай бұрын

I'd have to think the fine-tuning portion had some type of labeling since it has to understand what whispering is. Pretraining portion was most definitely unsupervised

@mirek190 Ай бұрын

@@Jarods_Journey actually no .... Is the same situation with multimodal llms .... you are not even training model with pictures ... you are just adding projector to llm and ..is working just like that... magic

@basspig Ай бұрын

Wow the Japanese language chat GPT voice is pretty darn convincing. Although it does sound like a Caucasian person speaking Japanese rather than a Japanese person speaking japanese. I'll be really impressed when it can do the voice of my favorite anime character.

@나익명 Ай бұрын

Yeah the Australian accent also still sounds like an American person

@KnutNukem Ай бұрын

Pls add an label function like [SPEAKER NAME], to automatically mark lines to a voice. Best would be, that all following lines would be added to a speaker after marking it. Could be a toggleable option. Also let multiple lines be marked and their voice set via CTRL + CLICK or SHIFT + CLICK to mark a bunch.

@Jarods_Journey Ай бұрын

Ooh, this might make for a good alternative tab to set it up for the speaker formats. I'll think about this one

@macmcleod1188 Ай бұрын

6:00 ... mind blown.

@om4741 23 күн бұрын

Csn you tell me the promt you used in this video?

@nanieve4296 Ай бұрын

DONT MIND IF I DO WITH ALL OF THOSE VOICE SET!

@TonyMezaXD Ай бұрын

The British voice was the only bad impression. It sounded like an American attempting to do a British accent.

@sownheard Ай бұрын

Wait isn't the voice American voice pretending to be British

@spiker.c6058 Ай бұрын

Yeah the English voice are american unless they add a proper UK English language setting with proper UK English voices. But why would they do it just to get another english accent.

@TonyMezaXD Ай бұрын

@@sownheard If that’s the case then it’s spot on. I just figured since it could switch to other languages easily it should be able to switch English dialects as well.

@TonyMezaXD Ай бұрын

@@spiker.c6058 Oh in that case it was spot on.

@adolphgracius9996 Ай бұрын

Can it say Yaaamite?

@radianthole Ай бұрын

My man got that Nihongo jozu

@matty.j_1997 Ай бұрын

Great demo! How were you able to screen-record the Advanced Voice?

@Jarods_Journey Ай бұрын

Using just my phone's native recorder, then I just synced it up in editting

@matty.j_1997 Ай бұрын

@@Jarods_Journey Yeah but it sounds like the voice was also recorded „officially“

@Jarods_Journey Ай бұрын

I guess Samsung is just that good 😅. Just the Samsung screen recorder with media sound enabled

@dthSinthoras Ай бұрын

What is really missing in Audiobook Maker, that would make it usable, would be the possibility to use other languages.

@dadadies Ай бұрын

Whats your audio book maker? Is it some sort of audio dialog maker with different characters all in one interface? I wonder if it can be adapted for interactive dialog such as in a game. Especially if someone attaches a AI LLM to it that can generate dialogs based on their interactions. Youd have an even more special system (with other people contributing in those other areas).

@Jarods_Journey Ай бұрын

It's a tool that you can load up a text file and use tortoise TTS/styletts to generate audiobooks with. Currently adding features to it like different speakers for sentences, etc!

@Barrel_Of_Lube Ай бұрын

it pretty much got all the languages (claimed by an openai dev) thats fking crazy

@Akurallia Ай бұрын

CAAAARA Isso é ABSURDO de incrível! Simplesmente magnífico 🤩

@NFawc Ай бұрын

Looks like after running the generation, the voice settings for each line is lost? ie: After you ran the generation the line colours (speakers) were all lost (they all changed to grey)? ps: VERY interested in the audiobook maker, especially with a good TTS generator.

@Jarods_Journey Ай бұрын

Just for the generation, to show they were complete. If you load the audiobook again, it'll restore the colors and associated speaker

@NFawc Ай бұрын

@@Jarods_Journey Understood. But after a generation, if you then want to regen just a sentence or two again, wouldn't it be good to have colour/settings as before the (full) generation?

@adamrastrand9409 Ай бұрын

When will it be available in Sweden in EU would it be like next week October 5 or would it be like in six months or so

@Jarods_Journey Ай бұрын

I'm not too certain, you might wanna keep up with openAI to see when they announce it for non US countries

@Snafuuu Ай бұрын

"I can't imagine the data needed for this" I can, it's the whole damn internet 💀

@nomadv7860 Ай бұрын

Just wanted to point out it’s not text-to-speech, it’s able to hear the actual tone and emotion of your voice

@cookiefrnamikaze1674 Ай бұрын

man you are such a masterpiece for asking the Ara ara

@bananalord9288 Ай бұрын

goddammnit I lost it during the japanese demonstration XD

@Mika43344 Ай бұрын

Did you hear about eleven labs reader?

@iseahosbourne9064 Ай бұрын

Hey jarod, whats the best ai voice cloning tool as of today? RVC, Xtts, tortoise tts etc?

@Jarods_Journey Ай бұрын

Local? Pipe xtts/tortoise into RVC, and it's still very solid. Via corpo? Elevenlabs for sure.

@iseahosbourne9064 Ай бұрын

@@Jarods_Journey Thanks for the info jarod!

@iseahosbourne9064 15 күн бұрын

@@Jarods_JourneyWas just thinking, what is the best tts to train,xtts finetune,tortoise? Im having trouble with xtts generating consistent speech. Not sure if thats because I trained a model with 21min of data doh.

@mal-avcisi9783 Ай бұрын

This is insane

@tylerboy19yp Ай бұрын

hey jarod what is the best current voice cloning fine tune tts model right now i can run locally?

@Jarods_Journey Ай бұрын

Gonna be either xtts/tortoise or styletts, and then I'm trying out parlertts, so we'll see how this comes along to see if I can recommend it

@tylerboy19yp Ай бұрын

@@Jarods_Journey tried xtts base model i can't seem to get the dependencies to work with the fine tune model, have you heard doppleAI's voice models? they sound really good with only 1-3 minutes of audio

@spiker.c6058 Ай бұрын

Ara ... ara .... my God !!!!

@WistrelChianti Ай бұрын

What a time to be alive!

@Англичанин_Р Ай бұрын

As long as you are not the native-level speaker, the robot will sound to you like a human.😅 How long can you chat to a robot without getting annoyed?

@xaiyeon_xiuzhen Ай бұрын

OMG idc if im cooked that was awesome :D

@BoomBillion Ай бұрын

😂try american tourist voice speaking Spanish.

@나익명 Ай бұрын

Same with Australian accent haha

@Airbender131090 Ай бұрын

This is riduculos😮 imagine this in video game npcs 😮 thats crazy

@ariverosmg Ай бұрын

You don't need to lable things, it "understands" what you mean, it learned to "reason" slightly, no human needed in that loop then.

@Jarods_Journey Ай бұрын

I'm firm to believe a little bit of labeling is still needed to get it to understand all the pre training data, I just think in this case they have trained classifiers that can accurately label specific types of data. No human in the loop here, just classification models and lots of compute 😅

@Crazy_Truth Ай бұрын

Plus plan 😢

@Jarods_Journey Ай бұрын

Unfortunately 😅

@justindressler5992 Ай бұрын

Is the Japanese accurate and understandable i would love to learn Japanese through chat.

@TrC450 Ай бұрын

Yes, at least what was shown here.

@Jarods_Journey Ай бұрын

Understandable, absolutely. It is more than accurate enough I'd say, but I'm not at a level high enough to determine whether or not it was using natural expressions. Though, the anime impersonations were pretty good imo

@baltakatei Ай бұрын

We're so cooked.

@saint115io Ай бұрын

Wait, are you japanese? 😮😮

@Jarods_Journey Ай бұрын

A small bit :)

@Draggtar Ай бұрын

The answer is: GPT chat creators are otakus

@hypersonicmonkeybrains3418 Ай бұрын

I tried it and the audio quality is atrocious, bitcrushed, compressed, choppy and analog sounding. Not at all impressed, i would never consider paying a subscription for an Ai voice who's main feature is it can put on accents. I then tried Gemini voice and its crystal clear... Inflection PI AI, crystal clear... what gives.

@dadadies Ай бұрын

Her dramatic movie trailer narrator voice skill sucks. It sounds like someone's mom failing to do the task. Maybe if you asked her to make it sound professional i wonder if she would have sounded better.

@Jarods_Journey Ай бұрын

It may have been influenced by earlier portions of chat, but I'd agree it wasn't the best as shown here!