Open AI’s Whisper is Amazing!

Рет қаралды 481,589

Күн бұрын

Пікірлер: 457

@juhotuho10 2 жыл бұрын

0:58 I went to test it out, and I'm amazed that it understood when I mocked the British accent and said: "can i have a bo'e o' wo'er?" Had some trouble with other words like that but still amazing to see it understood most of them

@NikhilS-1 2 жыл бұрын

😂😂boho water

@Str1der1 2 жыл бұрын

@@NikhilS-1 bottle of water I believe

@lunachocken 2 жыл бұрын

@@Str1der1 NEVER. The Rich tea Biscuits and Tea country requires the "boho water". We SHALL NEVER say the t's Mwah' ha ha, *sips tea* MWAHAHAH.

@xxxsaraHelloxxx 2 жыл бұрын

Friend of mine used to call the English (tea bags). AI art looks soulless. Prophet Elon good by world

@christianfoley7441 2 жыл бұрын

Whispers shows a really neat emerging rule of thumb in deep learning: if you want to train a model to do a task, pick a task that is harder than the task you want it to do. Aka, force your model to go beyond what you expect of it. Another great example of this can be seen in state of the art cell segmentation models like CellPose, which tries to predict a depth map of the cells, when the central goal is just finding their boundaries. In a way it forces the model to learn a more abstract, heuristic understanding of that first, easier task, which helps prevent overfit. I like to think it is in the same conceptual vein (although on a far greater scale) as enforcing dropout, where we randomly remove nodes so that models don't learn some convoluted inter-layer correction pattern, but instead a general, more abstract translation mapping.

@cmilkau 2 жыл бұрын

Funny that I'm most impressed by the failure cases. Even when it fails, it fails really well, the "wrong" guesses are still extremely good guesses.

@carecavoador 2 жыл бұрын

The wrong guesses are much better than most of my ESL students would get.

@RavenMobile 2 жыл бұрын

And it put the mistakes in red, the AI knew it was likely getting those parts wrong! I'm very impressed.

@andrearruda9005 Жыл бұрын

@@RavenMobile It probably has some kind of word accuracy rate for every single word and measure it to retrieve the color

@cmilkau 2 жыл бұрын

KZbin can you please implement this :D Live transcription would be SO good, but even just the better results would be such a quality-of-life improvement.

@Rhyzhang 2 жыл бұрын

I think the model is too big for instant live transcriptions however this will be fantastic for captioning

@galbiband 2 жыл бұрын

isn't it already implemented?

@jbexta 2 жыл бұрын

@@galbiband They have generated captions but not on all videos, I think some languages aren't detected and I've found even some English videos don't have it

@cmilkau 2 жыл бұрын

@@Rhyzhang wdym too big?

@meihauf 2 жыл бұрын

@@cmilkau KZbin already takes an insane amount of computing power to store, retrieve, and distribute the insane amounts of videos uploaded to it every minute. Running a transcriber over every video would increase their overhead to unreasonable levels.

@Veptis 2 жыл бұрын

I was at a natural language processing conference a last month. And one of the talks presented a paper that compared most common issues of speech recognition systems between German, English and Dutch. They did notice quite a few trends. The main issue being that the training data for these models is very clear and slowly spoken sentences. Not how spoken language sounds in the real world. (Even KZbin audio is more clear than the real world). Especially with conversations and having small pause fillers from the other parties. Stuff like "hmm, uhm, mmh, ja..." Were missed most often. With different particles between languages. These systems also have to handle errors and corrections, which is rather difficult, especially if you want to have downstream tasks that already start processing before input is finished (for example retico package for python). It would be interested how much better this model does with their metrics. Reference: Evaluation of Automatic Speech Recognition for Conversational Speech in Dutch, English and German: What Goes Missing? (Lopez et al., KONVENS 2022)

@rickevans7941 2 жыл бұрын

TY for SOURCES AT BOTTOM BLESS YOU!!

@Veptis 2 жыл бұрын

@@rickevans7941 been trying to backup my claims recently, hope you found some more reading material

@Gogolian 2 жыл бұрын

We need more data, like, maybe recorded in the background by cell phones of people everyday lives... wait, no...

@00SEVEN28 2 жыл бұрын

Toss in some Cockney, and it’ll get bolloxed.

@benjigeezer 2 жыл бұрын

do you spit or do you swallow ?

@owenspottiswoode5936 2 жыл бұрын

I believe there's quite a lot of literature in the field of linguistics that says that humans who learn multiple languages concurrently/in quick succession perform better in each individually than those who study just one, so it doesn't surprise me that much to find the same holds true for machines. Intuitively, I think it makes sense: rather than pattern matching for the grammar and traits of one particular language, you're abstracting the problem to understand the relationship between the word and the underlying concept (signifier and signified, in the parlance of linguistics).

@zerglin9000 2 жыл бұрын

Even more likely, in the real world all languages are (loosely) connected. In todays world, the prevalence of foreign loan words is near guaranteed, so having some words that 'don't generally fit the language' to start with is almost expected. Plus, if you consider accents, which are essentially residuals from a foreign language, being able to process both languages would help source the origin sound to what the intended sound is.

@galen__ 2 жыл бұрын

Reminds me of how video codecs became better at handling film grain vs noise. I believe at least one can now essentially filter film grain before the main transform, retaining a map of the original grain pattern in the encoded data to be restored when decoded by the player.

@lorenzoiotti 2 жыл бұрын

Being Italian I immediately tested it on a video of a local KZbinr with a strong Sardinian accent, I'm impressed by the accuracy even with accents tending to dialect, gpt4 will be very interesting ;)

@andresshamis4348 2 жыл бұрын

I think Gpt 4 will be indistinguishable from a live human

@qwerasdfhjkio 2 жыл бұрын

As an italian, thats actually impressive, even I can't understand them sometimes 😂

@McDonaldsCalifornia 2 жыл бұрын

Woah i didn't even think about that. All of KZbin potentially as a text data set!

@rielaxault 2 жыл бұрын

@@andresshamis4348 Doubtful.

@MikeClarkeARVR 2 жыл бұрын

How about an Italian speaking Ithkuil? We may all have to learn Ithkuil soon! ;)

@lukask969 2 жыл бұрын

As a Ph.D. student, I already used whisper - one day after the release - for my taken interviews, and I can say it works incredibly well against other cloud speech-to-text algorithms (AWS/GPC/Azure tested). It does not have speaker diarization, but it works unbelievably well in ignoring pauses, uhm, hmm, ahms, and background noises. Each sentence will be clearly recognized, and you can do the speaker separation very well sentence per sentence. My 20 hrs in interviews - which would be 2-3 weeks of transcription work - were done with my GPU in 5 hrs. Another day for annotating the speakers - easy. Thank god that they decided to release this to the public, and you can run it locally totally gdpr compliant! Used this for German - with very little to no flaws found, it is even better for English transcription. + It is so easy to use.

@TheRainHarvester 2 жыл бұрын

Is a gpu necessary? Or can cpu-only be used?

@lukask969 2 жыл бұрын

@@TheRainHarvester as far as I know you can use the device parameter to say use cuda or CPU. But, CPU could be factor 20 slower, even a low end Nvidia graphics card should be faster than any CPU. I do not tried it though idk how well it works. Used the large model for my interviews.

@gatenetsetgo....7012 2 жыл бұрын

Hey bro in which area you doing your phd... Bcoz I thought to do after long gap on my education... But need help bro... As I am having subject in CSE but very less knowledge in that area.... So where to start... And i am not getting any help bcoz all my connection of education were unlinked and want to start over....

@jakcrimson1448 2 жыл бұрын

Yes, omg, thank you i've been looking for explanations on this new model. Great video as always!

@toastrecon 2 жыл бұрын

I have a bunch of MP3s that I pulled from audio cassette recordings of family members who have been gone a long time. I've been meaning to transcribe them, it'd be interesting to see how this worked with those.

@RavenMobile 2 жыл бұрын

Try it out and let us know -- I am also curious.

@chrisoman87 2 жыл бұрын

Even if it gets 95% correct, thats a significant reduction in manual work, just read along and correct a mistake here or there, great idea.

@sieyk 2 жыл бұрын

I just tested this translating some random anime raw and it did a fantastic job, and automatically generated the SRT file with timestamps. Dodgy translations are a thing of the past!

@OfficialSombrero 2 жыл бұрын

very nice sir

@MikeClarkeARVR 2 жыл бұрын

Unless you speak Ithkuil ! ;)

@carlosdesantiago1356 Жыл бұрын

Some notes: [00:00] OpenAI's new Transformer model 'Whisper' is an automatic speech recognition model - Whisper is fully open sourced for inference and can be downloaded and used - The model has varying performance and accuracy based on the size - AI models may overfit to gold standard datasets and underperform in real-world scenarios - Models trained on highly curated datasets may outperform humans on classification tasks, but struggle with out-of-distribution samples - Humans have a more generalized approach to problem-solving, which allows them to perform better in real-world scenarios - Increasing model size may not significantly improve speech recognition performance. - Overfitting may occur with too large of a model. - Data set size is likely to have a greater impact on performance. [03:18] Whisper model performs well on imperfect audio data - Model tested on various audio qualities and sizes - Whisper model trained on weakly supervised data with background noise [06:45] Fine-tuning speech-to-text models can lead to overfitting - Mixing new data with original data can help with overfitting - Similar strategy can be applied to image and audio models [09:54] Using imperfect data can improve model performance - Training models on imperfect data can be followed by fine-tuning on gold standard data - Mixing tasks and data to support them can lead to better performance [13:18] AI-generated content detection is becoming crucial - As AI-generated content becomes more prevalent, models that can detect whether content was generated by AI will become more important. - Mixing different tasks and task tags can add to the robustness and generalization of GPT-style models. [22:21] Models trained on narrow tasks perform better than those trained on multiple tasks and languages, except for joint models that perform both transcriptions and translations. - Small models benefit from narrow tasks and training on English transcriptions only. - Joint models outperform English-only models in larger experiments, indicating a trend towards more powerful and generalized AI.

@MenkoDany 2 жыл бұрын

Finally, I can understand the lyrics of death metal recordings!

@sentdex 2 жыл бұрын

O man these would be good test cases! Haha

@dmarsub 2 жыл бұрын

23:20 wow thats fascinating wouldn't have expected the general model at this stage to be better than the language specific one.

@alexandrepv 2 жыл бұрын

OpenAI just destroyed a lot of AI startups that are specialised in speech transcription. Well done :)

@throwawayidiot6451 2 жыл бұрын

Not just AI startups, a bunch of traditional firms specialized in transcriptions made by humans will also go belly up

@sunrealclothing 2 жыл бұрын

Technology will hopefully create new jobs as well.

@SamGarcia 2 жыл бұрын

not really, you just integrate it to your AI model.

@alexandrepv 2 жыл бұрын

@@sunrealclothing I wasn't being sarcastic, I really want them to go belly up :)

@sunrealclothing 2 жыл бұрын

@@alexandrepv seems AI is going to be advancing forward whether we like it or not. And it seems a bit sketchy and unregulated at the moment.

@FinanceLogic 2 жыл бұрын

Good job being so knowledgeable about the cutting edge. Best, fastest, most precise, best background filtering, actual whisper catching speech to text ever in my testing. It's. amazing. Before anyone says that is obvious, some people have not tried it of course. most people. I'm excited to begin watching this video now

@HenryLoenwind 2 жыл бұрын

At around 7:00, you're mixing up "clean recordings" with "transcribed recordings". The "gold" input data they are talking about isn't that because it has good audio quality, it's "gold" because there's a known good text transcription of it. The unsupervised learning was about feeding the model plenty of audio without transcription, so the model could first learn how human speech sounds without knowing anything about its meaning. Only later the "gold" data was used to teach it how to convert whatever it understood into text. So this model first converts the audio it hears into some kind of "mental picture", and then in a second step converts that "mental picture" into text.

@PoschiUnavailable 2 жыл бұрын

It makes a ton of sense to have a unified language like english that is used for further internal text processing. For example - talk in any language to your AI home assistant, internal the spoken words are detected in the language they were spoken in, then translated in english and the english words are processed to figure out what the user wants the assistant to do. Now hear me out: what if the internal language of that process was not english but we train a model with a lot of data in many languages to figure out the best possible unified language that transports meaning, intention etc in the best way possible. This would relate to many words or phrases in any language being related to relatively few "meanings" of the spoken words if you think of it in a mind-map way.

@solenoidnull9542 2 жыл бұрын

I've always wondered what this optimal 'unified' language would look and sound like, but it certainly exists

@KiraSlith Жыл бұрын

The jump from Base to Small in this test case is also pretty great. Small seems perfectly serviceable for say, watching translated streams and live-translating with minimal errors and time per token.

@acrawford01 2 жыл бұрын

The part about the quality of the models: AlphaZero the chess AI learned by playing billions of games against itself. It wasn’t fed any data of openings or game databases , and it has amazing results. I know that is very different than this case, but I think training with non-perfect data can be beneficial.

@pile_of_kyle 2 жыл бұрын

Isn't that an entirely different type of model, though? AlphaZero used reinforcement learning, whereas this model uses "weakly supervised" learning.

@syrus3k 2 жыл бұрын

It needs to know if it's right in order to be able to train, easy with a game, hard with something without hard rules.

@MAlanThomasII 2 жыл бұрын

AlphaZero and similar play themselves because there are no "perfect" data sets, just some examples of whatever the current human best is. To become more perfect than the best available examples, it needs to do more than just study the best available examples. That is why it uses a completely different form of learning.

@millionare5446 2 жыл бұрын

i think it makes sense for english performance to go up if you train the model on non-english audio. it will be less likely to transcribe a word incorrectly when it knows which words do not exist in the english language. for example, it will not confuse "is" and "es" because it knows the difference between english and spanish

@pile_of_kyle 2 жыл бұрын

I'm confused. If you only trained the model on english audio, how would the model ever know of the existence of "es?" I am still struggling to understand how the generalized model performs better than the English-specialized model, and I think Sentdex is equally perplexed.

@millionare5446 2 жыл бұрын

@@pile_of_kyle im not sure, having imperfect training data and byte pair tokenization could probably make the model learn non-english words even if you tried to only train it on english data. this also makes me think of unsupervised training; having multiple contrasting sets of data will make the model be very good at performing tasks for the target use case

@alexlong9424 2 жыл бұрын

Cool video! I'm still watching but one note: Weakly supervised doesn't mean that the training data quality is bad, it means rather that the labels on the training data aren't necessarily good. It sounds like in this case they labeled their training data (i.e. they transcribed input audio) through less labor-intensive means like maybe having a less effective model produce transcriptions, as opposed to having grad students produce high quality transcriptions.

@arjunharikumar7176 2 жыл бұрын

this happened to me when training yolov5 to identify cracks the model accuracy improved so much once we trained it to identify human faces as well, i think this is due to how my model prolly before adding human faces learned that if the pixels go from white to black instantly its a crack so it passed the validation set as well with stunning accuracy without really understanding what a crack is.

@arigato1901 2 жыл бұрын

Wow! OpenAI is actually delivering something open?! 😳😁

2 жыл бұрын

Having tasks considered as not relevant could indeed help the generalization of the embeddings. People often forget that when we are using a pre-trained embedding space, it corresponds usually to a certain task. Multi-tasks training is a way to generalize the embeddings.

@VaibhavShewale 2 жыл бұрын

i use this model, this is just incredible. i used different types of audio with different noise and it worked everytime!

@y.shrestha6936 2 жыл бұрын

I have seen this effect in image processing networks too. I was training a cervical cancer image classification network when I had the idea to train a UNet segmentation network and add an additional classification head off the encoder. The result was better performance in the classification problem even though I throw away the segmentation head at inference time.

@Piqcked 2 жыл бұрын

Hell yeah ! I'll finally understand Limmy's videos.

@pointerish 2 жыл бұрын

Oh wow, I haven't seen a video of yours in a couple of years. What a big difference in quality. Awesome job!

@benjaminlynch9958 2 жыл бұрын

One nice thing that would be an awesome addition to the model is the ability to output the text with timestamps for closed caption text generation. But overall this is awesome. Can’t wait for the inevitable improvements to personal assistants - Siri, Google, Alexa, etc.

@KoolenDasheppi 2 жыл бұрын

It can already output vtt and srt with timestamps iirc

@pile_of_kyle 2 жыл бұрын

I'm not sure what technology KZbin uses, but that is already possible when uploading a video. There's an option for "auto generate timestamps" where you can paste in raw text with a perfectly timestamped output.

@tobiasjennerjahn8659 2 жыл бұрын

There are sooo many videos on youtube that have proper subtitles (not auto generated ones). These should be super high quality, because whoever uploaded the video has an incentive to make their own subtitles as good as possible. I'm not sure how easy it would be to find and scrape that, but at least for Google this data is readily available. That's hundreds of thousands of hours of properly transcribed audio with varying degrees of audio quality.

@anguswett 2 жыл бұрын

Good idea!

@countofst.germain6417 2 жыл бұрын

One would assume that they probably used it. I can't imagine where else they would be getting 680,000 hours from. Maybe movies I guess but I'm not sure if they would be allowed to use them.

@tobiasjennerjahn8659 2 жыл бұрын

@@countofst.germain6417 Oh, I must've misheard. I thought it was much less. You're probably right then.

@peter9477 2 жыл бұрын

@@countofst.germain6417 Interesting question of law. As an AI is not (yet) a "person", feeding the data into it cannot be considered a "performance", and certainly not a public performance. Would love to see this tested in court some day.

@medhurstt 2 жыл бұрын

I think effective language translation happens when the model recognises the word in and of itself but this is only part of the story. It also recognises it in context of the other words preceding it and importantly in the context of the idea being portrayed. Consequently I think training the model with multiple languages helps strengthen the contextual associations (ie modelled ideas) without overfitting due to limited English words. So in general, I would think that training models with complementary tasks would benefit the model but training with completely different tasks would have much less benefit. For example training the speech to text model with speech to text data plus Egyptian history data might benefit the speech recognition for Egyptian history related transcriptions but for business related transcriptions, not so much. My 2c

@tiredko-hi- 2 жыл бұрын

Currently learning a third language and can confirm that when you're learning more languages, especially when they are very different, if you're translating between them you're forced to get a better understanding of the ideas of the words you have in all off the languages.

@Boringpenguin 2 жыл бұрын

They finally live up to their name🤣 Who could have known

@nilfux 2 жыл бұрын

When it becomes sentient you'll know because it'll HATE being called artificial.

@microcolonel 2 жыл бұрын

Even more than this, the large model is trained to transcribe spoken French, Japanese, and other languages to English text, and it works remarkably well. I spoke some pretty complex japanese sentences and got very good English translations of them out the other end. The large model is *tiny* for its performance, it recognizes a massive vocabulary in Japanese (which I tested the most). One thing that it doesn't do, that would be interesting, is multilingual inference. Currently, if you speak two or more languages in one sample, it will break down.

@stanislasbasket 2 жыл бұрын

Very interesting to try turning on the auto generated subs for this video at 2:50, you can see that Google algo understood half of it.

@stanislasbasket 2 жыл бұрын

At 2:40 actually is the start

@cmeerdo 2 жыл бұрын

I am sitting in a room different from the one you are in now. I am recording the sound of my speaking voice and I am going to play it back into the room again and again until the resonant frequencies of the room reinforce themselves so that any semblance of my speech, with perhaps the exception of rhythm, is destroyed. What you will hear, then, are the natural resonant frequencies of the room articulated by speech. I regard this activity not so much as a demonstration of a physical fact, but more as a way to smooth out any irregularities my speech might have.

@ScottJWaldron 2 жыл бұрын

This looks interesting! I'd like something that separates people in a conversation for the transcript. Haven't looked around to see if there is anything currently available with that type of model. Whatever model TikTok is using seems like the best. KZbin's has gotten better but I tend to correct more when I'm going through auto captions for my videos.

@bernardofrassy 2 жыл бұрын

Hey man, your content is awesome, keep it up!

@sentdex 2 жыл бұрын

Thanks!

@mytechnotalent 2 жыл бұрын

Just incredible how well and simple this model is!

@mikeyjohnson5888 2 жыл бұрын

automatic subtitles for media is something ive wanted for a long time

@cmilkau 2 жыл бұрын

Instead of a "clean audio tag" (which would probably work), another idea is to do style-transfer. Give it a text, a voice recording, and a target language, and make it translate the text into the target language while applying the voice and style of the speaker. This style might also include how much background noises you want (is the speaker in a public room or a quiet studio) or whether you want deliberate artefacts (is the speaker on a megaphone, or an aged analog recording, or other degradations).

@pilcaroo 2 жыл бұрын

Yes. I imagine in the not so far future you could choose what language you want this video to be in, and an algorythm would make the tranlation, keep the voice and intonation of the speaker, and even correct the lips in the the video so they move according to the new audio. And another way to approach getting clean sound from a an AI trained on dirty recordings might be to add another layer on top, trained on cleaning dirty recordings. On the one hand, it sounds like making the process longer and more cumbersome, but on the other hand it means you can use a very large data set of audio, and not only professionaly recorded audio, which might also be narrow and contain only certain accents etc.

@jamesnewton-thomas5902 2 жыл бұрын

The improvement seen in multiple language training for single language transcription may be related to the mechanics of human vocalization, which is common, or even onomatopoeia

@graestarr 2 жыл бұрын

bang

@TheGeneticHouse 2 жыл бұрын

Descript has perfected the art of cloning you so you are now a TTS! Not exactly what the program is intended or marketed for more audio video editing via text editing after the audio or video is made and then you can overdub which is what the voice for you is going to be called when you create one overdub or replace that text in the audio and video but I just use the TTS it's amazing it's a secret though lol

@Qstandsforred 2 жыл бұрын

I think that even for a speech generator you may want to feed it dirty data as well. No reason not to. It will extend its capabilities as well. Seems plausible that it would also enhance the quality of clean outputs, or at the very least it might add more possible variations for clean outputs (such as the type of microphone used).

@Octamed 2 жыл бұрын

Use movies with subtitles as a dataset. Run it through random filters to simulate far/near/obsured/add random background noise etc speakers in real settings.

@CurlyScott89 2 жыл бұрын

English has a ton of words used in it that are based off of different languages so it makes sense that when an AI knows different languages it has a better capability of context clues when transcribing

@jammin023 2 жыл бұрын

Transcribing different languages and translating between them are quite closely-related tasks, so it's easy to understand how a single model might gain from the commonalities between those tasks by learning all of them concurrently (as long as the model size is not too small). It doesn't necessarily follow that there are similar gains to be had from mixing and matching less closely-related tasks. So for example I think it's unlikely there'd be a benefit to mixing image recognition with speech recognition, for the same reason that we use different specialised areas of the brain for those tasks.

@cmilkau 2 жыл бұрын

GPT-3 even generalizes on the tasks. It often can follow natural language instructions (zero-shot task generalization), sometimes with the help of one or few examples in its prompt (few-shot task generalization).

@briceleroy 2 жыл бұрын

I miss the old videos, where you'd actually be hands on putting together AI related project

@Lunsterful 2 жыл бұрын

It's like 3 lines of code

@crackwitz 2 жыл бұрын

This is a news video, not a tutorial video. He gives us varied content. Also: yes, literally a fistful of code to run this thing.

@briceleroy 2 жыл бұрын

@@Lunsterful I'm not sure what you meant

@briceleroy 2 жыл бұрын

@@crackwitz thanks for the mansplaining

@sentdex 2 жыл бұрын

I hear ya. I'm just no longer interested in doing more basics videos. I just keep feeling like "I've already done this" when I try, and at this point I gloss over things I think are obvious, but arent...etc. the videos here have almost always been based on what I'm actually working on/interested in, and more and more that's become impossible to show step by step in the basics way, just due to overall complexity.

@mikeciul8599 2 жыл бұрын

* accidentally creates a new interpretation of Alvin Lucier's "I am sitting in a room" *

@madebyrasa 2 жыл бұрын

I am floored. This model worked so good for me.

@jcjensenllc 2 жыл бұрын

Would be better to start with some background context like what is Whisper. What is it used for? What are you talking about?

@JamJells 2 жыл бұрын

I wanted to make a SRT file for an old movie, but couldn't make out what they were singing in a song in that movie. Can I upload the soundtrack and have whisper translate it?

@Smittel 2 жыл бұрын

if this also returns exact timestamps for when what was said it'll be hella useful and maybe finally something that can replace, say, KZbin absolutely dreadful auto captions (or allow creators to quickly and easily add their own captions)

@MyHowHowHow 2 жыл бұрын

There is a fork of whisper that has an experimental feature to do per word timestamps. It is called WhisperCpp

@tmattoneill 2 жыл бұрын

Woah! It's the Python guy talking about AI Art. How cool is that. Long time, dude!

@outlander234 2 жыл бұрын

This is what I am most excited about in the whole AI field... This will eventually lead to Universal Translator like the one in Start Trek :D

@ericvosselmans5657 2 жыл бұрын

Exactlty! That translator went from pure Scifi, to being just over the horizon, at least for the 100 or so languages these things are trained in.

@outlander234 2 жыл бұрын

@@ericvosselmans5657 Its crazy! I remember thinking when watching Star Trek how t f does it do that... Yet here we are... Basically anything we can do, AI will be and at ridiculusly faster rate.

@ericvosselmans5657 2 жыл бұрын

@@outlander234 I remember having exactly the same thought!

@The-Dom 2 жыл бұрын

Great job of testing and presenting, sir.

@Crayphor 2 жыл бұрын

I wonder, what if we used an tag to use whisper as a discriminator in an audio GAN. That could possibly make a very realistic text to speech model if we also used diffusion for the generator.

@NoobHunter65 2 жыл бұрын

Weakly supervised data can be built first and then more accurate data can be added to it to make it better, it will start off as its own baby and then grow over time based on its environment.

@devreactor 2 жыл бұрын

Amazing accuracy of the model😍

@danielash1704 2 жыл бұрын

489 character with in the program reprogramming it a cursive writing of words to resolution which may fits perfect to realize that the experiencers own closeness to the experiencers

2 жыл бұрын

And what about a multi-modal training where we also incorporate the "old-fashion" audiograms and use these images to feed the transformer in addition to the speech audio ? ;)

@avi7278 2 жыл бұрын

Then the AIs that determine if a piece of content was generated by an AI can theoretically be used to train another AI that improves the original AI so that it can no longer be detected - ad infinitum?

@สุนิษาประเสริฐกุล 2 жыл бұрын

Thank you for explaining this thoroughly!

@gunterstrubinsky9452 2 жыл бұрын

if we order the ebook, will we be able to download the changes/clarifications?

@geifwijfheigvwis 2 жыл бұрын

openAI is unbelievable, really has my admiration and the admiration of a lot of people in the field. we definetly need more companies like this moving the boundries of data science towards a better future. I am definetly mixing this with GPTchat to get something like siri XD just far better

@thesmilegame 2 жыл бұрын

Thank you for the upload

@xuepingsong5329 2 жыл бұрын

We need this for lectures at uni!!!

@khalidzamzamkz 2 жыл бұрын

You said if the extra task makes a difference or not in increasing the performance. My instinct says it definitely does. In this case, the model that translate performs better, as to translate a language, you need a better understanding of the "meaning" of the language. This deeper understanding might have helped the model fill in "predict" words that was not perfectly clear from audio (especially in the non-gold standard data). I think one could test this, by seeing the effect of using only "gold standard" data with both of the models. Lets say that using the multi task model in the paper, with the mixed dataset, resulted in 15% better performance. Yet, when using only the gold standard data, this increase in performance was around 1%, this COULD show that the deeper knowledge was used to predict words instead of actually "hear" them. Although, that would be difficult to test in this case due to the difference in sizes of the dataset.

@BrandonJacobson 2 жыл бұрын

I’ve been trying to implement this to generate captions for my Python KZbin channel and I can’t get the code to work.

@thaddeuspellegrini3883 2 жыл бұрын

Do you think that a large language translation model (text input, text output for example) trained on a sufficiently diverse set of human languages (English -> Mandarin, Hindi -> German, French -> Japanese and every other combination of living languages) would be able to translate a dead language into a living language since its a decent assumption that there are commonalities in base structure between that dead language and the living languages of today? Basically what I am wondering is, if you trained a language model to translate N languages into all of the various permutations of those languages, then input some language N+1 as a test case for which the model had not been trained, would the similarities in the N languages to the N+1 language be enough to translate it as well. For example, translating texts written in Egyptian Hieroglyphs into English even if we had not trained the model on Egyptian Hieroglyphs

@billykotsos4642 Жыл бұрын

This model truly is insane

@mertinan8252 2 жыл бұрын

Weakly supervised does not mean lower quality sound data, it's about whether the data has labels or not. The hard and time consuming step about data collection is not the quality of the sound, it's the labeling by human annotators.

@garrettjones1161 2 жыл бұрын

I’d be interested in having a switch where you can pair this AI with a tool that fixes grammar so it defaults to “sensible” sentences rather than being literal and preserving faults in speech.

@snippletrap 2 жыл бұрын

You can absolutely use the same models for text to speech, just like you can unify text to images and images to text. They all live in the same embedding space.

@danielash1704 2 жыл бұрын

In 2003 more careful with everything looking at the silenced is a trick question about training the brain and memory of a pump that has long short drivers to realize that the experiencers own closeness to the experiences one from multiple ai in a single asking for a variety of answered which ones difference between the upper and lower levels of A.I Learning is better then to useing just one point of same quests

@janiv3987 2 жыл бұрын

Good lighting dude.

@_divya_shakti Жыл бұрын

Do you have any playlist specially dedicated for speech deep learning ??

@doctorai 2 жыл бұрын

how we can do diarization using OPENAI Whisper or any other model? on mono channel

@akauppi2 2 жыл бұрын

Interestingly, 12:10 where you describe how doing multiple tasks may be beneficial. I think that’s what our body (nerveous and muscle system) does.

@N3Cr0Ph0b1A 2 жыл бұрын

Me translating a 9 minute Kurzgesagt video in 30 seconds: "Here in the Kurzgesagt Labs we only work on the most important scientific problems like what if we nuke stuff? Or how about we make this elephant explode? Or who could forget, look at this thing, it's really big. Continuing this proud tradition, let's explore the scientific mystery of what would happen to you if Earth suddenly turned into gold. The Midaspocalypse, based on the ancient tale of King Midas who was cursed so everything he touched turned into gold. Before we can explore this scenario with science, we'll first define the premise. Midas' curse is a very special phenomenon called magic, which allows us to modify physics. So what happens when Midas touches something and it turns to gold? An atom of gold has 79 protons and 118 neutrons in its nucleus......" How the hell did this model come up with "Midaspocalypse"?!?! What a time to be alive!

@nickvanamburg 2 жыл бұрын

my man accidentally recreated "I am sitting in a room"

@TheApoorvagni 2 жыл бұрын

The moment you realize you need to like the video: 6:06 "...with enough time and grad students, this could eventually be hundreds of thousands of hours"

@Ez-se2dl 2 жыл бұрын

"There's a lot of fat around the lines in this important area. So I'm going to have a talk about my hair conditioning." - The Whisper

@RoyAAD 6 ай бұрын

Did you color the text differences manually? Or is there some library that does that?

@africanelectron751 2 жыл бұрын

Finally we are getting to see the old nsa tech

@DanFrederiksen 2 жыл бұрын

Can it operate live? like robot hearing instead of an audio file.

@wk8219 2 жыл бұрын

Great Video. Thanks. P.S. Your lipstick game in on point 👍👍😁😁

@crypto_peng 2 жыл бұрын

can you share how you set your studio? like camera and software you use! Thanks

@joelimmanueloelsner9211 2 жыл бұрын

OpenAI has an open project - what a consept

@simssim262 2 жыл бұрын

From dominating NLP to dominating CV to some extent and now even Audio? Does the transformer architecture have plans to dominate the whole earth?

@jeremyforest8032 2 жыл бұрын

Task tags are common in Continual Learning settings. Interesting that there moving towards that approach !

@TheCopernicus1 2 жыл бұрын

Amazing my favourite STT model by far!

@bhhbcc4573 2 жыл бұрын

The behaviour of the model seems to replicate the observation that multilingual people have an easier time learning new languages and tend to understand their own native tongue better than most.

@chikir9777 2 жыл бұрын

Amazing crack, and the tutorial was perfect. I´m testing it right now

@KennTollens 2 жыл бұрын

This would be great for the police interrogation videos, you can barely hear what they say on most. AI is so cool. I can't wait until it reads a book and automatically makes the movie or series.

@redpython99 2 жыл бұрын

That's sounds kinda dystopian :/

@KennTollens 2 жыл бұрын

@@redpython99 I would love a business to be fully automated with AI and just pay people the money it generates. But people are the problem, if you give them money and all the time in the world, they destroy themselves most of the time. Instead of learning new things, going on vacation, improving their health. Instead, they often turn to drugs.

@violent_bebop9687 2 жыл бұрын

@@KennTollens that's a very cynical take on human nature. And I completely agree..look at all the homeless addicts right now.

@KennTollens 2 жыл бұрын

@@violent_bebop9687 I came to that conclusion after seeing what happens on Indian reservations. They get government money, but so many are drunks and drug addicts.

@mohegyux4072 2 жыл бұрын

I'm not an expert by any means but when I finetune or train a model on new data I always mix it with old data, never read that before, it just feels intuitive so that the model doesn't lose much of its past knowledge, also validation set is mixed, or 2 separate validation sets of old and new data