OpenAI Whisper: Robust Speech Recognition via Large-Scale Weak Supervision | Paper and Code

  Рет қаралды 34,911

Aleksa Gordić - The AI Epiphany

Aleksa Gordić - The AI Epiphany

Күн бұрын

❤️ Become The AI Epiphany Patreon ❤️
/ theaiepiphany
👨‍👩‍👧‍👦 Join our Discord community 👨‍👩‍👧‍👦
/ discord
In this video I cover Whisper, an ASR system from OpenAI's "Robust Speech Recognition via Large-Scale Weak Supervision" paper.
Trained on a huge multi-lingual, multi-task weakly supervised dataset it achieves a very high effective robustness and accuracy closing the gap with the human baseline using only an off-the-shelf transformer.
I walk you through both the paper as well as the actual code. Let me know whether the code part helped!
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
✅ Paper: cdn.openai.com/papers/whisper...
✅ Code: github.com/openai/whisper
✅ Nice explanation of mel spectrograms: • Mel Spectrograms Expla...
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
⌚️ Timetable:
00:00:00 Intro
00:02:05 Paper overview
00:07:30 Collecting a large scale weakly supervised dataset
00:13:55 Evaluation metric issues (WER)
00:16:05 Effective robustness
00:18:40 Scaling laws in progress
00:26:30 Decoding is hacky
00:28:30 Code walk-through
00:30:25 Model architecture (diagram vs code)
00:33:30 Transcription task
00:34:10 Loading the audio, mel spectrograms
00:37:50 Language detection
00:45:00 Transcription task continued
00:47:35 Suppressing token logits
00:52:00 Voice activity detection
00:53:35 Decoding and heuristics
01:01:56 Outro
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
💰 BECOME A PATREON OF THE AI EPIPHANY ❤️
If these videos, GitHub projects, and blogs help you,
consider helping me out by supporting me on Patreon!
The AI Epiphany - / theaiepiphany
One-time donation - www.paypal.com/paypalme/theai...
Huge thank you to these AI Epiphany patreons:
Eli Mahler
Petar Veličković
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
💼 LinkedIn - / aleksagordic
🐦 Twitter - / gordic_aleksa
👨‍👩‍👧‍👦 Discord - / discord
📺 KZbin - / theaiepiphany
📚 Medium - / gordicaleksa
💻 GitHub - github.com/gordicaleksa
📢 AI Newsletter - aiepiphany.substack.com/
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
#whisper #openai #asr

Пікірлер: 58
@TheAIEpiphany
@TheAIEpiphany Жыл бұрын
Let me know whether the code part helped! :) Is it adding any value for you guys? Or am I just rambling and it's too hard to follow unless you play with the code yourself? Would really appreciate some feedback!
@xl0xl0xl0
@xl0xl0xl0 Жыл бұрын
It definitely did! Is the debugger your first choice when it comes to figuring out how some new codebase works, or did you fire it up for the occasion as a demonstration tool?
@TheAIEpiphany
@TheAIEpiphany Жыл бұрын
@@xl0xl0xl0 thanks!! I have a whole series where I do just that. And as for your question - it depends. If I am playing with something on my own then yes, always! By far the best way to understand every single detail of your code
@Erosis
@Erosis Жыл бұрын
@@TheAIEpiphany I missed that series! I actually struggle with debugging ML code with vscode, so I'll check it out!
@leobeeson1
@leobeeson1 Жыл бұрын
This code walkthrough has made this paper walkthrough one of the best I've seen. Thanks for that, and please keep doing it!
@TheAIEpiphany
@TheAIEpiphany Жыл бұрын
@@leobeeson1 wow nice, thanks for telling me that! If I get more feedback I might keep doing this in every paper walk through!
@devhau5
@devhau5 Жыл бұрын
I just found this channel and I’m SO THANKFUL for a great walkthrough and explanation. It’s super fun. This is gold!!! Thanks Aleksa!
@mariatrofimova5512
@mariatrofimova5512 Жыл бұрын
Thanks for walking through Whisper code together, enjoyed the journey!
@pratikkhedikar6759
@pratikkhedikar6759 Жыл бұрын
mmmaaannnnn !! What a good video. I was like searching for something like this. Where in even a noob like me can understand the entire paper because you took through it step by step! I knew this was going to be a great video when you stopped to explain log-mel spectrum as well! Thanks Aleksa
@Spockleblupit
@Spockleblupit Жыл бұрын
Thanks Aleksa! Really appreciate the effort you put into this videos. Quality content, keep it up.
@huonglarne
@huonglarne Жыл бұрын
Thank you so much for doing these videos. You helped me so so so so much.
@FreeSubtitlesAI
@FreeSubtitlesAI Жыл бұрын
Very informative and authoritative, thank you!
@alexgil55ka
@alexgil55ka Жыл бұрын
This is super cool man! Thanks for diving deep into it
@TheAIEpiphany
@TheAIEpiphany Жыл бұрын
Thanks Alex!
@pocco8388
@pocco8388 Жыл бұрын
Thanks for making this great video!
@CHENXIN-pn7oh
@CHENXIN-pn7oh Жыл бұрын
So well explained! Thx!
@nnpy
@nnpy Жыл бұрын
Great video!!
@DevasheeshMishra
@DevasheeshMishra 8 ай бұрын
Loved it.. need more such videos
@DevasheeshMishra
@DevasheeshMishra 2 ай бұрын
Rewatching the stream
@curryeater259
@curryeater259 Жыл бұрын
You are amazing sir!
@vinayakbaddi29
@vinayakbaddi29 Жыл бұрын
@Aleksa Gordic, Thanks for sharing this valuable information. Apart from AI would look to see how you are using VS code so effectively to move between the code and debug it. Would really appreciate it if you could provide more information on the same on video.
@asceznyk
@asceznyk Жыл бұрын
Hi Aleksa! Great video! I just wanted to know what would the loss function be for the models? Would it be something like cross-entropy? Because the model predicts tokens..
@ChuanChihChou
@ChuanChihChou Жыл бұрын
I wonder if we can use the attention map (of how much each audio token contributes to the prediction of each transcript token) to back out timestamps instead?
@tahercoolguy123
@tahercoolguy123 Жыл бұрын
Hey really nice video. Can we fine tune whisper model for our dataset. If yes can you show us how
@MattUebel
@MattUebel Жыл бұрын
This is great, ty!
@TheAIEpiphany
@TheAIEpiphany Жыл бұрын
🚀
@amilia4174
@amilia4174 Жыл бұрын
I have watched your video and it was great! But I'm not sure whether the translation and transcription tasks share the same decode parameters.
@goryeodynasti3025
@goryeodynasti3025 10 ай бұрын
@TheAIEpiphany , how do you see the effect of "best_of" parameter in the quality of the transcription? Any insight would be helpful. Thanks
@JF-vt4ve
@JF-vt4ve Жыл бұрын
impressive work!
@xXMaDGaMeR
@xXMaDGaMeR Жыл бұрын
can the model be ram locally? how much computing to run this model for inference
@convolutionalnn2582
@convolutionalnn2582 Жыл бұрын
Sir,I have read your roadmap to Reinforcement Learning...I wanna do research in RL...1)Should i still follow your roadmap ? 2) Do i need to know the whole maths derivation behind Supervised Unsupervised and Deep Learning Algorithm 3) How can i start doing research in RL in undergraduate in an non research institute?
@kshitizkhandelwal9348
@kshitizkhandelwal9348 Жыл бұрын
Can someone explain how are embeddings learnt?
@lavkushdas5529
@lavkushdas5529 10 ай бұрын
hey can you provide your source code that u have written in vscode?
@iradaaristova8698
@iradaaristova8698 7 ай бұрын
can you make a video how decoder works?
@sibadattasasmal4866
@sibadattasasmal4866 Жыл бұрын
can you put little emphasis on how the time stamps are generated for transcription
@deadend6399
@deadend6399 Жыл бұрын
can you do a install video on this
@FinnBrownc
@FinnBrownc Жыл бұрын
Would be helpful if you could put these models in history a bit. I’m not as familiar with how things were done in the past vs. today SOTA.
@petercowling6769
@petercowling6769 Жыл бұрын
Welsh an outlier. Never would have guessed. Anyway, gotta go, heading out to Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch this afternoon.
@ludvigjoborn7937
@ludvigjoborn7937 Жыл бұрын
One can find in the paper that this is because a lot of Welsh was misclassified as English during the data collection process. Imagine them finding out.
@phongtranhung5635
@phongtranhung5635 Жыл бұрын
Hi. I've found you channel and the videos are totally all mind blowing. I have a question regarding Whisper. Currently I want to return a list of all transcribed words probability. I think that I have something to do with the def update inside Decoding.py. Can you make some help on how to do it? I would be very appreciated!
@huonglarne
@huonglarne Жыл бұрын
You can modify the update function to return the logprobs of all words. The max of that logprobs is the selected token's probability.
@kerenstarobinski8564
@kerenstarobinski8564 Жыл бұрын
is it possible to find the timestamps of each transcribed word? Great work!
@MyHowHowHow
@MyHowHowHow Жыл бұрын
Not in OpenAI's version but a fork of it has this feature. It is called WhisperCpp
@dimorischinyui1875
@dimorischinyui1875 Жыл бұрын
Hey guys please can anyone help me with this issue. I am trying to run whisper on my machine and I am getting this error in cmd. UserWarning: FP16 is not supported on CPU; using FP32 instead warnings.warn("FP16 is not supported on CPU; using FP32 instead"). I use a windows 10 with gpu RTX2060. Also it seems it runs on my cpu instead of NVIDIA GPU. I created a python virtual environment and pip installed whisper in that virtual environment just for more details.
@MyHowHowHow
@MyHowHowHow Жыл бұрын
Try the parameter --device cuda
@MoiezIzmail
@MoiezIzmail Ай бұрын
it's not chinese, it's korean(script is called han-gul). Thanks for the tutorial!
@HarishPentapalli
@HarishPentapalli Жыл бұрын
Any guesses on the name of company B?
@TheAIEpiphany
@TheAIEpiphany Жыл бұрын
Hah, hard to infer without repeating the research
@marearts.
@marearts. 9 ай бұрын
1:30 yes it's korean
@SuperMan-rw6iz
@SuperMan-rw6iz Жыл бұрын
That's not even Mandarin... it's Korean BTW 😅
@FinnBrownc
@FinnBrownc Жыл бұрын
Would be helpful if you could put these models in history a bit. I’m not as familiar with how things were done in the past vs. today SOTA.
@TheAIEpiphany
@TheAIEpiphany Жыл бұрын
Thanks for the feedback, you mostly care about transformers here. "Attention is all you need" paper
How I'd Learn AI (If I Had to Start Over)
15:04
Thu Vu data analytics
Рет қаралды 751 М.
ПРОВЕРИЛ АРБУЗЫ #shorts
00:34
Паша Осадчий
Рет қаралды 6 МЛН
39kgのガリガリが踊る絵文字ダンス/39kg boney emoji dance#dance #ダンス #にんげんっていいな
00:16
💀Skeleton Ninja🥷【にんげんっていいなチャンネル】
Рет қаралды 8 МЛН
Smart Sigma Kid #funny #sigma #comedy
00:26
CRAZY GREAPA
Рет қаралды 3,5 МЛН
Scientific Concepts You're Taught in School Which are Actually Wrong
14:36
How ChatGPT Works Technically For Beginners
33:11
Kurdiez
Рет қаралды 1 МЛН
"okay, but I want Llama 3 for my specific use case" - Here's how
24:20
Generative AI in a Nutshell - how to survive and thrive in the age of AI
17:57
What are Transformer Models and how do they work?
44:26
Serrano.Academy
Рет қаралды 107 М.
Fine-tuning Whisper to learn my Chinese dialect (Teochew)
28:10
Efficient NLP
Рет қаралды 4,9 М.
Stable Diffusion: High-Resolution Image Synthesis with Latent Diffusion Models | ML Coding Series
1:40:36
2. OpenAI Whisper - Fed Speech Recognition
22:59
Part Time Larry
Рет қаралды 117 М.