0:58 I went to test it out, and I'm amazed that it understood when I mocked the British accent and said: "can i have a bo'e o' wo'er?" Had some trouble with other words like that but still amazing to see it understood most of them
@NikhilS-12 жыл бұрын
😂😂boho water
@Str1der12 жыл бұрын
@@NikhilS-1 bottle of water I believe
@lunachocken2 жыл бұрын
@@Str1der1 NEVER. The Rich tea Biscuits and Tea country requires the "boho water". We SHALL NEVER say the t's Mwah' ha ha, *sips tea* MWAHAHAH.
@xxxsaraHelloxxx2 жыл бұрын
Friend of mine used to call the English (tea bags). AI art looks soulless. Prophet Elon good by world
@christianfoley74412 жыл бұрын
Whispers shows a really neat emerging rule of thumb in deep learning: if you want to train a model to do a task, pick a task that is harder than the task you want it to do. Aka, force your model to go beyond what you expect of it. Another great example of this can be seen in state of the art cell segmentation models like CellPose, which tries to predict a depth map of the cells, when the central goal is just finding their boundaries. In a way it forces the model to learn a more abstract, heuristic understanding of that first, easier task, which helps prevent overfit. I like to think it is in the same conceptual vein (although on a far greater scale) as enforcing dropout, where we randomly remove nodes so that models don't learn some convoluted inter-layer correction pattern, but instead a general, more abstract translation mapping.
@cmilkau2 жыл бұрын
Funny that I'm most impressed by the failure cases. Even when it fails, it fails really well, the "wrong" guesses are still extremely good guesses.
@carecavoador2 жыл бұрын
The wrong guesses are much better than most of my ESL students would get.
@RavenMobile2 жыл бұрын
And it put the mistakes in red, the AI knew it was likely getting those parts wrong! I'm very impressed.
@andrearruda9005 Жыл бұрын
@@RavenMobile It probably has some kind of word accuracy rate for every single word and measure it to retrieve the color
@cmilkau2 жыл бұрын
KZbin can you please implement this :D Live transcription would be SO good, but even just the better results would be such a quality-of-life improvement.
@Rhyzhang2 жыл бұрын
I think the model is too big for instant live transcriptions however this will be fantastic for captioning
@galbiband2 жыл бұрын
isn't it already implemented?
@jbexta2 жыл бұрын
@@galbiband They have generated captions but not on all videos, I think some languages aren't detected and I've found even some English videos don't have it
@cmilkau2 жыл бұрын
@@Rhyzhang wdym too big?
@meihauf2 жыл бұрын
@@cmilkau KZbin already takes an insane amount of computing power to store, retrieve, and distribute the insane amounts of videos uploaded to it every minute. Running a transcriber over every video would increase their overhead to unreasonable levels.
@Veptis2 жыл бұрын
I was at a natural language processing conference a last month. And one of the talks presented a paper that compared most common issues of speech recognition systems between German, English and Dutch. They did notice quite a few trends. The main issue being that the training data for these models is very clear and slowly spoken sentences. Not how spoken language sounds in the real world. (Even KZbin audio is more clear than the real world). Especially with conversations and having small pause fillers from the other parties. Stuff like "hmm, uhm, mmh, ja..." Were missed most often. With different particles between languages. These systems also have to handle errors and corrections, which is rather difficult, especially if you want to have downstream tasks that already start processing before input is finished (for example retico package for python). It would be interested how much better this model does with their metrics. Reference: Evaluation of Automatic Speech Recognition for Conversational Speech in Dutch, English and German: What Goes Missing? (Lopez et al., KONVENS 2022)
@rickevans79412 жыл бұрын
TY for SOURCES AT BOTTOM BLESS YOU!!
@Veptis2 жыл бұрын
@@rickevans7941 been trying to backup my claims recently, hope you found some more reading material
@Gogolian2 жыл бұрын
We need more data, like, maybe recorded in the background by cell phones of people everyday lives... wait, no...
@00SEVEN282 жыл бұрын
Toss in some Cockney, and it’ll get bolloxed.
@benjigeezer2 жыл бұрын
do you spit or do you swallow ?
@owenspottiswoode59362 жыл бұрын
I believe there's quite a lot of literature in the field of linguistics that says that humans who learn multiple languages concurrently/in quick succession perform better in each individually than those who study just one, so it doesn't surprise me that much to find the same holds true for machines. Intuitively, I think it makes sense: rather than pattern matching for the grammar and traits of one particular language, you're abstracting the problem to understand the relationship between the word and the underlying concept (signifier and signified, in the parlance of linguistics).
@zerglin90002 жыл бұрын
Even more likely, in the real world all languages are (loosely) connected. In todays world, the prevalence of foreign loan words is near guaranteed, so having some words that 'don't generally fit the language' to start with is almost expected. Plus, if you consider accents, which are essentially residuals from a foreign language, being able to process both languages would help source the origin sound to what the intended sound is.
@galen__2 жыл бұрын
Reminds me of how video codecs became better at handling film grain vs noise. I believe at least one can now essentially filter film grain before the main transform, retaining a map of the original grain pattern in the encoded data to be restored when decoded by the player.
@lorenzoiotti2 жыл бұрын
Being Italian I immediately tested it on a video of a local KZbinr with a strong Sardinian accent, I'm impressed by the accuracy even with accents tending to dialect, gpt4 will be very interesting ;)
@andresshamis43482 жыл бұрын
I think Gpt 4 will be indistinguishable from a live human
@qwerasdfhjkio2 жыл бұрын
As an italian, thats actually impressive, even I can't understand them sometimes 😂
@McDonaldsCalifornia2 жыл бұрын
Woah i didn't even think about that. All of KZbin potentially as a text data set!
@rielaxault2 жыл бұрын
@@andresshamis4348 Doubtful.
@MikeClarkeARVR2 жыл бұрын
How about an Italian speaking Ithkuil? We may all have to learn Ithkuil soon! ;)
@lukask9692 жыл бұрын
As a Ph.D. student, I already used whisper - one day after the release - for my taken interviews, and I can say it works incredibly well against other cloud speech-to-text algorithms (AWS/GPC/Azure tested). It does not have speaker diarization, but it works unbelievably well in ignoring pauses, uhm, hmm, ahms, and background noises. Each sentence will be clearly recognized, and you can do the speaker separation very well sentence per sentence. My 20 hrs in interviews - which would be 2-3 weeks of transcription work - were done with my GPU in 5 hrs. Another day for annotating the speakers - easy. Thank god that they decided to release this to the public, and you can run it locally totally gdpr compliant! Used this for German - with very little to no flaws found, it is even better for English transcription. + It is so easy to use.
@TheRainHarvester2 жыл бұрын
Is a gpu necessary? Or can cpu-only be used?
@lukask9692 жыл бұрын
@@TheRainHarvester as far as I know you can use the device parameter to say use cuda or CPU. But, CPU could be factor 20 slower, even a low end Nvidia graphics card should be faster than any CPU. I do not tried it though idk how well it works. Used the large model for my interviews.
@gatenetsetgo....70122 жыл бұрын
Hey bro in which area you doing your phd... Bcoz I thought to do after long gap on my education... But need help bro... As I am having subject in CSE but very less knowledge in that area.... So where to start... And i am not getting any help bcoz all my connection of education were unlinked and want to start over....
@jakcrimson14482 жыл бұрын
Yes, omg, thank you i've been looking for explanations on this new model. Great video as always!
@toastrecon2 жыл бұрын
I have a bunch of MP3s that I pulled from audio cassette recordings of family members who have been gone a long time. I've been meaning to transcribe them, it'd be interesting to see how this worked with those.
@RavenMobile2 жыл бұрын
Try it out and let us know -- I am also curious.
@chrisoman872 жыл бұрын
Even if it gets 95% correct, thats a significant reduction in manual work, just read along and correct a mistake here or there, great idea.
@sieyk2 жыл бұрын
I just tested this translating some random anime raw and it did a fantastic job, and automatically generated the SRT file with timestamps. Dodgy translations are a thing of the past!
@OfficialSombrero2 жыл бұрын
very nice sir
@MikeClarkeARVR2 жыл бұрын
Unless you speak Ithkuil ! ;)
@carlosdesantiago1356 Жыл бұрын
Some notes: [00:00] OpenAI's new Transformer model 'Whisper' is an automatic speech recognition model - Whisper is fully open sourced for inference and can be downloaded and used - The model has varying performance and accuracy based on the size - AI models may overfit to gold standard datasets and underperform in real-world scenarios - Models trained on highly curated datasets may outperform humans on classification tasks, but struggle with out-of-distribution samples - Humans have a more generalized approach to problem-solving, which allows them to perform better in real-world scenarios - Increasing model size may not significantly improve speech recognition performance. - Overfitting may occur with too large of a model. - Data set size is likely to have a greater impact on performance. [03:18] Whisper model performs well on imperfect audio data - Model tested on various audio qualities and sizes - Whisper model trained on weakly supervised data with background noise [06:45] Fine-tuning speech-to-text models can lead to overfitting - Mixing new data with original data can help with overfitting - Similar strategy can be applied to image and audio models [09:54] Using imperfect data can improve model performance - Training models on imperfect data can be followed by fine-tuning on gold standard data - Mixing tasks and data to support them can lead to better performance [13:18] AI-generated content detection is becoming crucial - As AI-generated content becomes more prevalent, models that can detect whether content was generated by AI will become more important. - Mixing different tasks and task tags can add to the robustness and generalization of GPT-style models. [22:21] Models trained on narrow tasks perform better than those trained on multiple tasks and languages, except for joint models that perform both transcriptions and translations. - Small models benefit from narrow tasks and training on English transcriptions only. - Joint models outperform English-only models in larger experiments, indicating a trend towards more powerful and generalized AI.
@MenkoDany2 жыл бұрын
Finally, I can understand the lyrics of death metal recordings!
@sentdex2 жыл бұрын
O man these would be good test cases! Haha
@dmarsub2 жыл бұрын
23:20 wow thats fascinating wouldn't have expected the general model at this stage to be better than the language specific one.
@alexandrepv2 жыл бұрын
OpenAI just destroyed a lot of AI startups that are specialised in speech transcription. Well done :)
@throwawayidiot64512 жыл бұрын
Not just AI startups, a bunch of traditional firms specialized in transcriptions made by humans will also go belly up
@sunrealclothing2 жыл бұрын
Technology will hopefully create new jobs as well.
@SamGarcia2 жыл бұрын
not really, you just integrate it to your AI model.
@alexandrepv2 жыл бұрын
@@sunrealclothing I wasn't being sarcastic, I really want them to go belly up :)
@sunrealclothing2 жыл бұрын
@@alexandrepv seems AI is going to be advancing forward whether we like it or not. And it seems a bit sketchy and unregulated at the moment.
@FinanceLogic2 жыл бұрын
Good job being so knowledgeable about the cutting edge. Best, fastest, most precise, best background filtering, actual whisper catching speech to text ever in my testing. It's. amazing. Before anyone says that is obvious, some people have not tried it of course. most people. I'm excited to begin watching this video now
@HenryLoenwind2 жыл бұрын
At around 7:00, you're mixing up "clean recordings" with "transcribed recordings". The "gold" input data they are talking about isn't that because it has good audio quality, it's "gold" because there's a known good text transcription of it. The unsupervised learning was about feeding the model plenty of audio without transcription, so the model could first learn how human speech sounds without knowing anything about its meaning. Only later the "gold" data was used to teach it how to convert whatever it understood into text. So this model first converts the audio it hears into some kind of "mental picture", and then in a second step converts that "mental picture" into text.
@PoschiUnavailable2 жыл бұрын
It makes a ton of sense to have a unified language like english that is used for further internal text processing. For example - talk in any language to your AI home assistant, internal the spoken words are detected in the language they were spoken in, then translated in english and the english words are processed to figure out what the user wants the assistant to do. Now hear me out: what if the internal language of that process was not english but we train a model with a lot of data in many languages to figure out the best possible unified language that transports meaning, intention etc in the best way possible. This would relate to many words or phrases in any language being related to relatively few "meanings" of the spoken words if you think of it in a mind-map way.
@solenoidnull95422 жыл бұрын
I've always wondered what this optimal 'unified' language would look and sound like, but it certainly exists
@KiraSlith Жыл бұрын
The jump from Base to Small in this test case is also pretty great. Small seems perfectly serviceable for say, watching translated streams and live-translating with minimal errors and time per token.
@acrawford012 жыл бұрын
The part about the quality of the models: AlphaZero the chess AI learned by playing billions of games against itself. It wasn’t fed any data of openings or game databases , and it has amazing results. I know that is very different than this case, but I think training with non-perfect data can be beneficial.
@pile_of_kyle2 жыл бұрын
Isn't that an entirely different type of model, though? AlphaZero used reinforcement learning, whereas this model uses "weakly supervised" learning.
@syrus3k2 жыл бұрын
It needs to know if it's right in order to be able to train, easy with a game, hard with something without hard rules.
@MAlanThomasII2 жыл бұрын
AlphaZero and similar play themselves because there are no "perfect" data sets, just some examples of whatever the current human best is. To become more perfect than the best available examples, it needs to do more than just study the best available examples. That is why it uses a completely different form of learning.
@millionare54462 жыл бұрын
i think it makes sense for english performance to go up if you train the model on non-english audio. it will be less likely to transcribe a word incorrectly when it knows which words do not exist in the english language. for example, it will not confuse "is" and "es" because it knows the difference between english and spanish
@pile_of_kyle2 жыл бұрын
I'm confused. If you only trained the model on english audio, how would the model ever know of the existence of "es?" I am still struggling to understand how the generalized model performs better than the English-specialized model, and I think Sentdex is equally perplexed.
@millionare54462 жыл бұрын
@@pile_of_kyle im not sure, having imperfect training data and byte pair tokenization could probably make the model learn non-english words even if you tried to only train it on english data. this also makes me think of unsupervised training; having multiple contrasting sets of data will make the model be very good at performing tasks for the target use case
@alexlong94242 жыл бұрын
Cool video! I'm still watching but one note: Weakly supervised doesn't mean that the training data quality is bad, it means rather that the labels on the training data aren't necessarily good. It sounds like in this case they labeled their training data (i.e. they transcribed input audio) through less labor-intensive means like maybe having a less effective model produce transcriptions, as opposed to having grad students produce high quality transcriptions.
@arjunharikumar71762 жыл бұрын
this happened to me when training yolov5 to identify cracks the model accuracy improved so much once we trained it to identify human faces as well, i think this is due to how my model prolly before adding human faces learned that if the pixels go from white to black instantly its a crack so it passed the validation set as well with stunning accuracy without really understanding what a crack is.
@arigato19012 жыл бұрын
Wow! OpenAI is actually delivering something open?! 😳😁
2 жыл бұрын
Having tasks considered as not relevant could indeed help the generalization of the embeddings. People often forget that when we are using a pre-trained embedding space, it corresponds usually to a certain task. Multi-tasks training is a way to generalize the embeddings.
@VaibhavShewale2 жыл бұрын
i use this model, this is just incredible. i used different types of audio with different noise and it worked everytime!
@y.shrestha69362 жыл бұрын
I have seen this effect in image processing networks too. I was training a cervical cancer image classification network when I had the idea to train a UNet segmentation network and add an additional classification head off the encoder. The result was better performance in the classification problem even though I throw away the segmentation head at inference time.
Oh wow, I haven't seen a video of yours in a couple of years. What a big difference in quality. Awesome job!
@benjaminlynch99582 жыл бұрын
One nice thing that would be an awesome addition to the model is the ability to output the text with timestamps for closed caption text generation. But overall this is awesome. Can’t wait for the inevitable improvements to personal assistants - Siri, Google, Alexa, etc.
@KoolenDasheppi2 жыл бұрын
It can already output vtt and srt with timestamps iirc
@pile_of_kyle2 жыл бұрын
I'm not sure what technology KZbin uses, but that is already possible when uploading a video. There's an option for "auto generate timestamps" where you can paste in raw text with a perfectly timestamped output.
@tobiasjennerjahn86592 жыл бұрын
There are sooo many videos on youtube that have proper subtitles (not auto generated ones). These should be super high quality, because whoever uploaded the video has an incentive to make their own subtitles as good as possible. I'm not sure how easy it would be to find and scrape that, but at least for Google this data is readily available. That's hundreds of thousands of hours of properly transcribed audio with varying degrees of audio quality.
@anguswett2 жыл бұрын
Good idea!
@countofst.germain64172 жыл бұрын
One would assume that they probably used it. I can't imagine where else they would be getting 680,000 hours from. Maybe movies I guess but I'm not sure if they would be allowed to use them.
@tobiasjennerjahn86592 жыл бұрын
@@countofst.germain6417 Oh, I must've misheard. I thought it was much less. You're probably right then.
@peter94772 жыл бұрын
@@countofst.germain6417 Interesting question of law. As an AI is not (yet) a "person", feeding the data into it cannot be considered a "performance", and certainly not a public performance. Would love to see this tested in court some day.
@medhurstt2 жыл бұрын
I think effective language translation happens when the model recognises the word in and of itself but this is only part of the story. It also recognises it in context of the other words preceding it and importantly in the context of the idea being portrayed. Consequently I think training the model with multiple languages helps strengthen the contextual associations (ie modelled ideas) without overfitting due to limited English words. So in general, I would think that training models with complementary tasks would benefit the model but training with completely different tasks would have much less benefit. For example training the speech to text model with speech to text data plus Egyptian history data might benefit the speech recognition for Egyptian history related transcriptions but for business related transcriptions, not so much. My 2c
@tiredko-hi-2 жыл бұрын
Currently learning a third language and can confirm that when you're learning more languages, especially when they are very different, if you're translating between them you're forced to get a better understanding of the ideas of the words you have in all off the languages.
@Boringpenguin2 жыл бұрын
They finally live up to their name🤣 Who could have known
@nilfux2 жыл бұрын
When it becomes sentient you'll know because it'll HATE being called artificial.
@microcolonel2 жыл бұрын
Even more than this, the large model is trained to transcribe spoken French, Japanese, and other languages to English text, and it works remarkably well. I spoke some pretty complex japanese sentences and got very good English translations of them out the other end. The large model is *tiny* for its performance, it recognizes a massive vocabulary in Japanese (which I tested the most). One thing that it doesn't do, that would be interesting, is multilingual inference. Currently, if you speak two or more languages in one sample, it will break down.
@stanislasbasket2 жыл бұрын
Very interesting to try turning on the auto generated subs for this video at 2:50, you can see that Google algo understood half of it.
@stanislasbasket2 жыл бұрын
At 2:40 actually is the start
@cmeerdo2 жыл бұрын
I am sitting in a room different from the one you are in now. I am recording the sound of my speaking voice and I am going to play it back into the room again and again until the resonant frequencies of the room reinforce themselves so that any semblance of my speech, with perhaps the exception of rhythm, is destroyed. What you will hear, then, are the natural resonant frequencies of the room articulated by speech. I regard this activity not so much as a demonstration of a physical fact, but more as a way to smooth out any irregularities my speech might have.
@ScottJWaldron2 жыл бұрын
This looks interesting! I'd like something that separates people in a conversation for the transcript. Haven't looked around to see if there is anything currently available with that type of model. Whatever model TikTok is using seems like the best. KZbin's has gotten better but I tend to correct more when I'm going through auto captions for my videos.
@bernardofrassy2 жыл бұрын
Hey man, your content is awesome, keep it up!
@sentdex2 жыл бұрын
Thanks!
@mytechnotalent2 жыл бұрын
Just incredible how well and simple this model is!
@mikeyjohnson58882 жыл бұрын
automatic subtitles for media is something ive wanted for a long time
@cmilkau2 жыл бұрын
Instead of a "clean audio tag" (which would probably work), another idea is to do style-transfer. Give it a text, a voice recording, and a target language, and make it translate the text into the target language while applying the voice and style of the speaker. This style might also include how much background noises you want (is the speaker in a public room or a quiet studio) or whether you want deliberate artefacts (is the speaker on a megaphone, or an aged analog recording, or other degradations).
@pilcaroo2 жыл бұрын
Yes. I imagine in the not so far future you could choose what language you want this video to be in, and an algorythm would make the tranlation, keep the voice and intonation of the speaker, and even correct the lips in the the video so they move according to the new audio. And another way to approach getting clean sound from a an AI trained on dirty recordings might be to add another layer on top, trained on cleaning dirty recordings. On the one hand, it sounds like making the process longer and more cumbersome, but on the other hand it means you can use a very large data set of audio, and not only professionaly recorded audio, which might also be narrow and contain only certain accents etc.
@jamesnewton-thomas59022 жыл бұрын
The improvement seen in multiple language training for single language transcription may be related to the mechanics of human vocalization, which is common, or even onomatopoeia
@graestarr2 жыл бұрын
bang
@TheGeneticHouse2 жыл бұрын
Descript has perfected the art of cloning you so you are now a TTS! Not exactly what the program is intended or marketed for more audio video editing via text editing after the audio or video is made and then you can overdub which is what the voice for you is going to be called when you create one overdub or replace that text in the audio and video but I just use the TTS it's amazing it's a secret though lol
@Qstandsforred2 жыл бұрын
I think that even for a speech generator you may want to feed it dirty data as well. No reason not to. It will extend its capabilities as well. Seems plausible that it would also enhance the quality of clean outputs, or at the very least it might add more possible variations for clean outputs (such as the type of microphone used).
@Octamed2 жыл бұрын
Use movies with subtitles as a dataset. Run it through random filters to simulate far/near/obsured/add random background noise etc speakers in real settings.
@CurlyScott892 жыл бұрын
English has a ton of words used in it that are based off of different languages so it makes sense that when an AI knows different languages it has a better capability of context clues when transcribing
@jammin0232 жыл бұрын
Transcribing different languages and translating between them are quite closely-related tasks, so it's easy to understand how a single model might gain from the commonalities between those tasks by learning all of them concurrently (as long as the model size is not too small). It doesn't necessarily follow that there are similar gains to be had from mixing and matching less closely-related tasks. So for example I think it's unlikely there'd be a benefit to mixing image recognition with speech recognition, for the same reason that we use different specialised areas of the brain for those tasks.
@cmilkau2 жыл бұрын
GPT-3 even generalizes on the tasks. It often can follow natural language instructions (zero-shot task generalization), sometimes with the help of one or few examples in its prompt (few-shot task generalization).
@briceleroy2 жыл бұрын
I miss the old videos, where you'd actually be hands on putting together AI related project
@Lunsterful2 жыл бұрын
It's like 3 lines of code
@crackwitz2 жыл бұрын
This is a news video, not a tutorial video. He gives us varied content. Also: yes, literally a fistful of code to run this thing.
@briceleroy2 жыл бұрын
@@Lunsterful I'm not sure what you meant
@briceleroy2 жыл бұрын
@@crackwitz thanks for the mansplaining
@sentdex2 жыл бұрын
I hear ya. I'm just no longer interested in doing more basics videos. I just keep feeling like "I've already done this" when I try, and at this point I gloss over things I think are obvious, but arent...etc. the videos here have almost always been based on what I'm actually working on/interested in, and more and more that's become impossible to show step by step in the basics way, just due to overall complexity.
@mikeciul85992 жыл бұрын
* accidentally creates a new interpretation of Alvin Lucier's "I am sitting in a room" *
@madebyrasa2 жыл бұрын
I am floored. This model worked so good for me.
@jcjensenllc2 жыл бұрын
Would be better to start with some background context like what is Whisper. What is it used for? What are you talking about?
@JamJells2 жыл бұрын
I wanted to make a SRT file for an old movie, but couldn't make out what they were singing in a song in that movie. Can I upload the soundtrack and have whisper translate it?
@Smittel2 жыл бұрын
if this also returns exact timestamps for when what was said it'll be hella useful and maybe finally something that can replace, say, KZbin absolutely dreadful auto captions (or allow creators to quickly and easily add their own captions)
@MyHowHowHow2 жыл бұрын
There is a fork of whisper that has an experimental feature to do per word timestamps. It is called WhisperCpp
@tmattoneill2 жыл бұрын
Woah! It's the Python guy talking about AI Art. How cool is that. Long time, dude!
@outlander2342 жыл бұрын
This is what I am most excited about in the whole AI field... This will eventually lead to Universal Translator like the one in Start Trek :D
@ericvosselmans56572 жыл бұрын
Exactlty! That translator went from pure Scifi, to being just over the horizon, at least for the 100 or so languages these things are trained in.
@outlander2342 жыл бұрын
@@ericvosselmans5657 Its crazy! I remember thinking when watching Star Trek how t f does it do that... Yet here we are... Basically anything we can do, AI will be and at ridiculusly faster rate.
@ericvosselmans56572 жыл бұрын
@@outlander234 I remember having exactly the same thought!
@The-Dom2 жыл бұрын
Great job of testing and presenting, sir.
@Crayphor2 жыл бұрын
I wonder, what if we used an tag to use whisper as a discriminator in an audio GAN. That could possibly make a very realistic text to speech model if we also used diffusion for the generator.
@NoobHunter652 жыл бұрын
Weakly supervised data can be built first and then more accurate data can be added to it to make it better, it will start off as its own baby and then grow over time based on its environment.
@devreactor2 жыл бұрын
Amazing accuracy of the model😍
@danielash17042 жыл бұрын
489 character with in the program reprogramming it a cursive writing of words to resolution which may fits perfect to realize that the experiencers own closeness to the experiencers
2 жыл бұрын
And what about a multi-modal training where we also incorporate the "old-fashion" audiograms and use these images to feed the transformer in addition to the speech audio ? ;)
@avi72782 жыл бұрын
Then the AIs that determine if a piece of content was generated by an AI can theoretically be used to train another AI that improves the original AI so that it can no longer be detected - ad infinitum?
@สุนิษาประเสริฐกุล2 жыл бұрын
Thank you for explaining this thoroughly!
@gunterstrubinsky94522 жыл бұрын
if we order the ebook, will we be able to download the changes/clarifications?
@geifwijfheigvwis2 жыл бұрын
openAI is unbelievable, really has my admiration and the admiration of a lot of people in the field. we definetly need more companies like this moving the boundries of data science towards a better future. I am definetly mixing this with GPTchat to get something like siri XD just far better
@thesmilegame2 жыл бұрын
Thank you for the upload
@xuepingsong53292 жыл бұрын
We need this for lectures at uni!!!
@khalidzamzamkz2 жыл бұрын
You said if the extra task makes a difference or not in increasing the performance. My instinct says it definitely does. In this case, the model that translate performs better, as to translate a language, you need a better understanding of the "meaning" of the language. This deeper understanding might have helped the model fill in "predict" words that was not perfectly clear from audio (especially in the non-gold standard data). I think one could test this, by seeing the effect of using only "gold standard" data with both of the models. Lets say that using the multi task model in the paper, with the mixed dataset, resulted in 15% better performance. Yet, when using only the gold standard data, this increase in performance was around 1%, this COULD show that the deeper knowledge was used to predict words instead of actually "hear" them. Although, that would be difficult to test in this case due to the difference in sizes of the dataset.
@BrandonJacobson2 жыл бұрын
I’ve been trying to implement this to generate captions for my Python KZbin channel and I can’t get the code to work.
@thaddeuspellegrini38832 жыл бұрын
Do you think that a large language translation model (text input, text output for example) trained on a sufficiently diverse set of human languages (English -> Mandarin, Hindi -> German, French -> Japanese and every other combination of living languages) would be able to translate a dead language into a living language since its a decent assumption that there are commonalities in base structure between that dead language and the living languages of today? Basically what I am wondering is, if you trained a language model to translate N languages into all of the various permutations of those languages, then input some language N+1 as a test case for which the model had not been trained, would the similarities in the N languages to the N+1 language be enough to translate it as well. For example, translating texts written in Egyptian Hieroglyphs into English even if we had not trained the model on Egyptian Hieroglyphs
@billykotsos4642 Жыл бұрын
This model truly is insane
@mertinan82522 жыл бұрын
Weakly supervised does not mean lower quality sound data, it's about whether the data has labels or not. The hard and time consuming step about data collection is not the quality of the sound, it's the labeling by human annotators.
@garrettjones11612 жыл бұрын
I’d be interested in having a switch where you can pair this AI with a tool that fixes grammar so it defaults to “sensible” sentences rather than being literal and preserving faults in speech.
@snippletrap2 жыл бұрын
You can absolutely use the same models for text to speech, just like you can unify text to images and images to text. They all live in the same embedding space.
@danielash17042 жыл бұрын
In 2003 more careful with everything looking at the silenced is a trick question about training the brain and memory of a pump that has long short drivers to realize that the experiencers own closeness to the experiences one from multiple ai in a single asking for a variety of answered which ones difference between the upper and lower levels of A.I Learning is better then to useing just one point of same quests
@janiv39872 жыл бұрын
Good lighting dude.
@_divya_shakti Жыл бұрын
Do you have any playlist specially dedicated for speech deep learning ??
@doctorai2 жыл бұрын
how we can do diarization using OPENAI Whisper or any other model? on mono channel
@akauppi22 жыл бұрын
Interestingly, 12:10 where you describe how doing multiple tasks may be beneficial. I think that’s what our body (nerveous and muscle system) does.
@N3Cr0Ph0b1A2 жыл бұрын
Me translating a 9 minute Kurzgesagt video in 30 seconds: "Here in the Kurzgesagt Labs we only work on the most important scientific problems like what if we nuke stuff? Or how about we make this elephant explode? Or who could forget, look at this thing, it's really big. Continuing this proud tradition, let's explore the scientific mystery of what would happen to you if Earth suddenly turned into gold. The Midaspocalypse, based on the ancient tale of King Midas who was cursed so everything he touched turned into gold. Before we can explore this scenario with science, we'll first define the premise. Midas' curse is a very special phenomenon called magic, which allows us to modify physics. So what happens when Midas touches something and it turns to gold? An atom of gold has 79 protons and 118 neutrons in its nucleus......" How the hell did this model come up with "Midaspocalypse"?!?! What a time to be alive!
@nickvanamburg2 жыл бұрын
my man accidentally recreated "I am sitting in a room"
@TheApoorvagni2 жыл бұрын
The moment you realize you need to like the video: 6:06 "...with enough time and grad students, this could eventually be hundreds of thousands of hours"
@Ez-se2dl2 жыл бұрын
"There's a lot of fat around the lines in this important area. So I'm going to have a talk about my hair conditioning." - The Whisper
@RoyAAD6 ай бұрын
Did you color the text differences manually? Or is there some library that does that?
@africanelectron7512 жыл бұрын
Finally we are getting to see the old nsa tech
@DanFrederiksen2 жыл бұрын
Can it operate live? like robot hearing instead of an audio file.
@wk82192 жыл бұрын
Great Video. Thanks. P.S. Your lipstick game in on point 👍👍😁😁
@crypto_peng2 жыл бұрын
can you share how you set your studio? like camera and software you use! Thanks
@joelimmanueloelsner92112 жыл бұрын
OpenAI has an open project - what a consept
@simssim2622 жыл бұрын
From dominating NLP to dominating CV to some extent and now even Audio? Does the transformer architecture have plans to dominate the whole earth?
@jeremyforest80322 жыл бұрын
Task tags are common in Continual Learning settings. Interesting that there moving towards that approach !
@TheCopernicus12 жыл бұрын
Amazing my favourite STT model by far!
@bhhbcc45732 жыл бұрын
The behaviour of the model seems to replicate the observation that multilingual people have an easier time learning new languages and tend to understand their own native tongue better than most.
@chikir97772 жыл бұрын
Amazing crack, and the tutorial was perfect. I´m testing it right now
@KennTollens2 жыл бұрын
This would be great for the police interrogation videos, you can barely hear what they say on most. AI is so cool. I can't wait until it reads a book and automatically makes the movie or series.
@redpython992 жыл бұрын
That's sounds kinda dystopian :/
@KennTollens2 жыл бұрын
@@redpython99 I would love a business to be fully automated with AI and just pay people the money it generates. But people are the problem, if you give them money and all the time in the world, they destroy themselves most of the time. Instead of learning new things, going on vacation, improving their health. Instead, they often turn to drugs.
@violent_bebop96872 жыл бұрын
@@KennTollens that's a very cynical take on human nature. And I completely agree..look at all the homeless addicts right now.
@KennTollens2 жыл бұрын
@@violent_bebop9687 I came to that conclusion after seeing what happens on Indian reservations. They get government money, but so many are drunks and drug addicts.
@mohegyux40722 жыл бұрын
I'm not an expert by any means but when I finetune or train a model on new data I always mix it with old data, never read that before, it just feels intuitive so that the model doesn't lose much of its past knowledge, also validation set is mixed, or 2 separate validation sets of old and new data
@sotonin2 жыл бұрын
somebody needs to hook this up to language translation and make a service to feed audio clips and see translations in your native language.