i absolutely did not expect chatgpt able to do an "ara-ara" impression
@Billy4321able2 ай бұрын
I thought OpenAI might have been exaggerating and cherry-picking the responses but NO this is truly REVOLUTIONARY in the AI voice space. You could literally practice your language skills with something like this for hours it's crazy. Not even just learning a language you could literally learn to voice act in another language with fantasy style voices. This is crazy! I'm afraid people are going to stop talking to each other if this ever becomes cheap enough to use full time. They sound like way more entertaining speaking partners than anyone I know. It's kinda messed up honestly. I struggle to see how people are going to adapt to having easy access to AI friends in the future. It only gets worse once they get robot bodies... This is going to be like real life Chobits.
@Jarods_Journey2 ай бұрын
It's definitely a game changer, I'm 100% gonna be talking with it more lol. The quality coming out was literally fiction 2 years ago.
@blasandresayalagarcia34722 ай бұрын
That's amazing! Im actually working on a textbook to audiobook project too 😂 was looking to selfhost something but didn't find anything so started making it.
@BlastHeart962 ай бұрын
Wow AI is terrifying, but amazing at the same time. I wonder if ChatGPT will ever be able to “voice act” audio books by reading ahead and analyzing content clues, tones, setting, context, etc. maybe being generating sounds or bgm that fits the situation.
@GraveUypo2 ай бұрын
this is 100% able to do that. i just want an open model that does this too.
@seifuishiguro2 ай бұрын
7:01 Exactly, I really wonder where they got all that voice data and how they went about labelling it. Human speech is complicated with so many different styles of speaking and ways of responding, if only other companies and open source teams could get their hands on such kind of a dataset.
@HogwartsStudy2 ай бұрын
That's crazy!!! I can't wait for this quality in a local Applio type solution.
@DrewWalton2 ай бұрын
Minor clarification: Standard voice mode (which has been available for a while) is bog standard speech-to-text, runs the transcribed text prompt against the model, and re-transcribes the text back to speech. Advanced Voice mode is *not* text-to-speech or speech-to-text *at all*. GPT-4o is natively multi-modal, so what you're actually experiencing is something called "voice-to-voice." That's right, it is, in a very real sense, hearing what you say and how you say it. Per my understanding, the only speech-to-text that really happens is for providing the text transcript of your conversations.
@Jarods_Journey2 ай бұрын
This is correct, I make the correction a little later in the video, but yes, it encodes and decodes audio natively without needing to pipeline other networks in tandem which is impressive - I think we'll start seeing this paradigm more and more
@DrewWalton2 ай бұрын
@@Jarods_Journey and of course I wrote this comment before getting to that part 😆 But yeah, absolutely. I've been having way too much fun with AV mode. $20 a month for something that's more fun than a barrel o' monkeys and extremely useful to boot? Shut up and take my money.
@leodark_animations20842 ай бұрын
@@DrewWalton Regarding how the model works in details you can find the info on OpenAi website or somewhere else? i'm pretty interested in understanding it more
@johansparr24092 ай бұрын
Awesome to hear, also love the audiobook update, looking forward to using it!!😊
@Katsumi_Maki2 ай бұрын
I-It's not like I want to chat with you or anything, baka! 5:07
@Yaksha_Indra2 ай бұрын
I was cringing like this en.wikipedia.org/wiki/Tetanus#/media/File:Opisthotonus_in_a_patient_suffering_from_tetanus_-_Painting_by_Sir_Charles_Bell_-_1809.jpg Almost broke my back
@saintkamus142 ай бұрын
"who do you even label that" been wondering that myself. What I think is going on, is that they have AIs that do the labeling from the raw data (they first become "experts" then they can do the labeling)
@farrael0042 ай бұрын
They probably didn't have labeled data. It's most likely pretrained on a massive amount of audio from youtube and other recordings from different languages, then fine-tuned using the voices available.
@Jarods_Journey2 ай бұрын
I'd have to think the fine-tuning portion had some type of labeling since it has to understand what whispering is. Pretraining portion was most definitely unsupervised
@mirek1902 ай бұрын
@@Jarods_Journey actually no .... Is the same situation with multimodal llms .... you are not even training model with pictures ... you are just adding projector to llm and ..is working just like that... magic
@CosmicTavern2 ай бұрын
This is INSANE. Need this on local.
@Jarods_Journey2 ай бұрын
Same 😭
@shreyashmore18242 ай бұрын
Yes please Local 🥲
@budbin2 ай бұрын
2:15 that's crazy
@om47412 ай бұрын
Csn you tell me the promt you used in this video?
@nanieve42962 ай бұрын
DONT MIND IF I DO WITH ALL OF THOSE VOICE SET!
@KnutNukem2 ай бұрын
Pls add an label function like [SPEAKER NAME], to automatically mark lines to a voice. Best would be, that all following lines would be added to a speaker after marking it. Could be a toggleable option. Also let multiple lines be marked and their voice set via CTRL + CLICK or SHIFT + CLICK to mark a bunch.
@Jarods_Journey2 ай бұрын
Ooh, this might make for a good alternative tab to set it up for the speaker formats. I'll think about this one
@darkreader012 ай бұрын
It would be crazy if we could generate a audio book or hear a full story book in that chatgpt advanced voice with emotions.
@Jarods_Journey2 ай бұрын
That day will surely come. The only question is when 😅
@Murderface6662 ай бұрын
The voice made a error speaking Japanese. "Haisai" is Okinawan for "hello."
@Jarods_Journey2 ай бұрын
I had never heard haisai until that day lol
@phenix56092 ай бұрын
First off Whoaaaa!! The voice gen, this give me really sad feelings we can get this kind of voice in local yet. Second youtube know me too well scary sometimes, i was also thinking about making an audiobook reader, i’m really curious about 2-3 thing tough, if you are willing to share, does your run all locally ? If so what voice or where do you found the voice, you want to, or what do you use to clone them if you clone some voice, what do you think is better as of now for this type of stuff full locally ? Last thing i think i saw a 24 ram, u must run with a 3090 or 4090, how much do you think is needed to run this without too much lag or wait time, can it be done with a 3080 10 go vram, 32 go ram ? Or is it to low ? i hope you could answer me thx.
@TonyMezaXD2 ай бұрын
The British voice was the only bad impression. It sounded like an American attempting to do a British accent.
@sownheard2 ай бұрын
Wait isn't the voice American voice pretending to be British
@spiker.c60582 ай бұрын
Yeah the English voice are american unless they add a proper UK English language setting with proper UK English voices. But why would they do it just to get another english accent.
@TonyMezaXD2 ай бұрын
@@sownheard If that’s the case then it’s spot on. I just figured since it could switch to other languages easily it should be able to switch English dialects as well.
@TonyMezaXD2 ай бұрын
@@spiker.c6058 Oh in that case it was spot on.
@adolphgracius99962 ай бұрын
Can it say Yaaamite?
@macmcleod11882 ай бұрын
6:00 ... mind blown.
@basspig2 ай бұрын
Wow the Japanese language chat GPT voice is pretty darn convincing. Although it does sound like a Caucasian person speaking Japanese rather than a Japanese person speaking japanese. I'll be really impressed when it can do the voice of my favorite anime character.
@나익명2 ай бұрын
Yeah the Australian accent also still sounds like an American person
@radianthole2 ай бұрын
My man got that Nihongo jozu
@Akurallia2 ай бұрын
CAAAARA Isso é ABSURDO de incrível! Simplesmente magnífico 🤩
@adamrastrand94092 ай бұрын
When will it be available in Sweden in EU would it be like next week October 5 or would it be like in six months or so
@Jarods_Journey2 ай бұрын
I'm not too certain, you might wanna keep up with openAI to see when they announce it for non US countries
@NFawc2 ай бұрын
Looks like after running the generation, the voice settings for each line is lost? ie: After you ran the generation the line colours (speakers) were all lost (they all changed to grey)? ps: VERY interested in the audiobook maker, especially with a good TTS generator.
@Jarods_Journey2 ай бұрын
Just for the generation, to show they were complete. If you load the audiobook again, it'll restore the colors and associated speaker
@NFawc2 ай бұрын
@@Jarods_Journey Understood. But after a generation, if you then want to regen just a sentence or two again, wouldn't it be good to have colour/settings as before the (full) generation?
@dthSinthoras2 ай бұрын
What is really missing in Audiobook Maker, that would make it usable, would be the possibility to use other languages.
@matty.j_19972 ай бұрын
Great demo! How were you able to screen-record the Advanced Voice?
@Jarods_Journey2 ай бұрын
Using just my phone's native recorder, then I just synced it up in editting
@matty.j_19972 ай бұрын
@@Jarods_Journey Yeah but it sounds like the voice was also recorded „officially“
@Jarods_Journey2 ай бұрын
I guess Samsung is just that good 😅. Just the Samsung screen recorder with media sound enabled
@Snafuuu2 ай бұрын
"I can't imagine the data needed for this" I can, it's the whole damn internet 💀
@dadadies2 ай бұрын
Whats your audio book maker? Is it some sort of audio dialog maker with different characters all in one interface? I wonder if it can be adapted for interactive dialog such as in a game. Especially if someone attaches a AI LLM to it that can generate dialogs based on their interactions. Youd have an even more special system (with other people contributing in those other areas).
@Jarods_Journey2 ай бұрын
It's a tool that you can load up a text file and use tortoise TTS/styletts to generate audiobooks with. Currently adding features to it like different speakers for sentences, etc!
@Barrel_Of_Lube2 ай бұрын
it pretty much got all the languages (claimed by an openai dev) thats fking crazy
@tylerboy19yp2 ай бұрын
hey jarod what is the best current voice cloning fine tune tts model right now i can run locally?
@Jarods_Journey2 ай бұрын
Gonna be either xtts/tortoise or styletts, and then I'm trying out parlertts, so we'll see how this comes along to see if I can recommend it
@tylerboy19yp2 ай бұрын
@@Jarods_Journey tried xtts base model i can't seem to get the dependencies to work with the fine tune model, have you heard doppleAI's voice models? they sound really good with only 1-3 minutes of audio
@Mika433442 ай бұрын
Did you hear about eleven labs reader?
@iseahosbourne90642 ай бұрын
Hey jarod, whats the best ai voice cloning tool as of today? RVC, Xtts, tortoise tts etc?
@Jarods_Journey2 ай бұрын
Local? Pipe xtts/tortoise into RVC, and it's still very solid. Via corpo? Elevenlabs for sure.
@iseahosbourne90642 ай бұрын
@@Jarods_Journey Thanks for the info jarod!
@iseahosbourne9064Ай бұрын
@@Jarods_JourneyWas just thinking, what is the best tts to train,xtts finetune,tortoise? Im having trouble with xtts generating consistent speech. Not sure if thats because I trained a model with 21min of data doh.
@nomadv78602 ай бұрын
Just wanted to point out it’s not text-to-speech, it’s able to hear the actual tone and emotion of your voice
@mal-avcisi97832 ай бұрын
This is insane
@bananalord92882 ай бұрын
goddammnit I lost it during the japanese demonstration XD
@cookiefrnamikaze16742 ай бұрын
man you are such a masterpiece for asking the Ara ara
@spiker.c60582 ай бұрын
Ara ... ara .... my God !!!!
@WistrelChianti2 ай бұрын
What a time to be alive!
@BoomBillion2 ай бұрын
😂try american tourist voice speaking Spanish.
@나익명2 ай бұрын
Same with Australian accent haha
@Crazy_Truth2 ай бұрын
Plus plan 😢
@Jarods_Journey2 ай бұрын
Unfortunately 😅
@xaiyeon_xiuzhen2 ай бұрын
OMG idc if im cooked that was awesome :D
@Англичанин_Р2 ай бұрын
As long as you are not the native-level speaker, the robot will sound to you like a human.😅 How long can you chat to a robot without getting annoyed?
@ariverosmg2 ай бұрын
You don't need to lable things, it "understands" what you mean, it learned to "reason" slightly, no human needed in that loop then.
@Jarods_Journey2 ай бұрын
I'm firm to believe a little bit of labeling is still needed to get it to understand all the pre training data, I just think in this case they have trained classifiers that can accurately label specific types of data. No human in the loop here, just classification models and lots of compute 😅
@Airbender1310902 ай бұрын
This is riduculos😮 imagine this in video game npcs 😮 thats crazy
@justindressler59922 ай бұрын
Is the Japanese accurate and understandable i would love to learn Japanese through chat.
@TrC4502 ай бұрын
Yes, at least what was shown here.
@Jarods_Journey2 ай бұрын
Understandable, absolutely. It is more than accurate enough I'd say, but I'm not at a level high enough to determine whether or not it was using natural expressions. Though, the anime impersonations were pretty good imo
@baltakatei2 ай бұрын
We're so cooked.
@saint115io2 ай бұрын
Wait, are you japanese? 😮😮
@Jarods_Journey2 ай бұрын
A small bit :)
@dadadies2 ай бұрын
Her dramatic movie trailer narrator voice skill sucks. It sounds like someone's mom failing to do the task. Maybe if you asked her to make it sound professional i wonder if she would have sounded better.
@Jarods_Journey2 ай бұрын
It may have been influenced by earlier portions of chat, but I'd agree it wasn't the best as shown here!
@Draggtar2 ай бұрын
The answer is: GPT chat creators are otakus
@hypersonicmonkeybrains34182 ай бұрын
I tried it and the audio quality is atrocious, bitcrushed, compressed, choppy and analog sounding. Not at all impressed, i would never consider paying a subscription for an Ai voice who's main feature is it can put on accents. I then tried Gemini voice and its crystal clear... Inflection PI AI, crystal clear... what gives.
@FrankHouston-v5e2 ай бұрын
This is much better than OpenAI faked voice demo 🧐.