*Abstract* This video explores the potential of OpenAI's Whisper model for real-time streaming automatic speech recognition (ASR). While Whisper excels in batch ASR, its ability to handle streaming scenarios with low latency is less obvious. The video introduces the open-source whisper-streaming project, which adapts Whisper for streaming applications by processing consecutive audio buffers of increasing size and confirming output tokens using the LocalAgreement algorithm. The video also discusses the limitations of this approach compared to models specifically designed for streaming ASR. *Summary* *Introduction (**0:00**)* * The video investigates whether OpenAI's Whisper model can be used for real-time streaming ASR. * Whisper is a powerful ASR model trained on a massive multilingual dataset, known for its robustness to noise and accents. *Batch vs Streaming ASR (**0:35**)* * Batch ASR processes entire audio recordings at once, while streaming ASR produces output as the speaker talks, with minimal delay. * Streaming ASR is crucial for applications like live captioning, where real-time transcription is essential. *Why is Streaming Whisper Difficult? (**1:55**)* * Whisper is designed for processing fixed-length audio segments (30 seconds), making it challenging to handle longer recordings in a streaming fashion. * Simply splitting audio into chunks can lead to inaccurate word recognition and high latency. *Whisper-streaming Demo (**2:58**)* * The video showcases the open-source whisper-streaming project, which enables real-time transcription using Whisper. * The demo demonstrates the project's ability to transcribe speech with minimal delay and provide timestamps. *Processing Consecutive Audio Buffers (**3:38**)* * Whisper-streaming feeds increasingly larger audio chunks into Whisper until an end-of-sentence marker is detected. * This ensures that Whisper processes complete sentences, leading to better accuracy. *Confirming Tokens with LocalAgreement (**4:36**)* * The LocalAgreement algorithm confirms output tokens only after they are generated in two consecutive audio buffers. * This helps distinguish between confirmed and unconfirmed transcription results, allowing for real-time feedback with potential corrections. *Prompting Previous Context (**6:05**)* * Whisper-streaming uses the previous sentence as prompt tokens for the model, providing additional context and improving accuracy. *Limitations vs Other Streaming ASR Models (**7:01**)* * Whisper's design isn't optimized for streaming, leading to inefficiencies like repeatedly processing the beginning of long sentences. * Dedicated streaming ASR models utilize architectures that allow for efficient processing of continuous audio streams with fixed context windows. * Adapting Whisper for streaming requires modifying its architecture and retraining, which is currently limited by data accessibility. I used gemini 1.5 pro for the summary Token count 4,084 / 1,048,576
@additivealternativesbc7 күн бұрын
Spectacular
@prajdabre3 ай бұрын
Hey thanks for covering our work. Really neat explanations!
@nmstoker8 ай бұрын
Thank you - yet another beautifully explained topic 🙂
@qwerty_and_azerty8 ай бұрын
What happens if 2 consecutive predictions continue to disagree on a specific word? Do you pick one of the options at random? Or does the sentence starting at that word never become confirmed?
@EfficientNLP8 ай бұрын
Generally, the predictions change up to a certain point, after which they no longer change based on additional inputs, and then they are confirmed. If this never occurs, then I guess it will need to handle this edge case in some way, such as picking randomly, but this should not happen often.
@atomicfang2 ай бұрын
really great explanation...
@sethhammock3602Ай бұрын
awesome explanation
@yasinsharifbeigy72383 ай бұрын
thank you. I really enjoyed your explanation. Is it possible to explain and introduce models, designed for streaming?
@EfficientNLP3 ай бұрын
Not sure if this is what you're asking about, but I have a video about Whisper fine-tuning that explains the architecture of the Whisper model as well!
@pedroprobst52307 ай бұрын
Thank you. Are you using faster-whisper as your backend? I'm trying to achieve something similar but with whisper.cpp.
@EfficientNLP7 ай бұрын
This method should work for any backend, but only faster-whisper is supported in the current implementation of whisper-streaming. Some modification will be required to make it work for whisper.cpp.
@pedroprobst52307 ай бұрын
@@EfficientNLP Interesting; thank you very much.
@Bub_s695 ай бұрын
Did you end up figuring it out with whisper.cpp?
@pinkmatter84883 ай бұрын
I'm trying to see how to take this module and integrate it within a real time audio pipeline just like your project. I'm kind of lost right now and would love to have a bit of feedback on your process.
@EfficientNLP3 ай бұрын
Sure, if you DM me on LinkedIn, I'd be happy to chat about it.
@pinkmatter84883 ай бұрын
@@EfficientNLP Thanks, will do.
@wolpumba40998 ай бұрын
I would like something like your voice writer but instead of outputting text it should output speech. I should remove my grammar mistakes and accent but should copy my intonation. Do you think this is possible at this time? I can't find good text-to-speech or voice cloning models.
@EfficientNLP8 ай бұрын
This sounds quite different from what I'm building with Voice Writer. I've not looked at voice cloning models before, so I'm not sure of their feasibility, but it's a good and potentially useful project idea.
@gpminsuk7 ай бұрын
Thanks for the video!! This is a great technique. I am thinking to use this technique for our application. I have one question. When the words are confirmed, why don't you feed the partial audio (except the confirmed words part) with the confirmed text in the initial prompt? Would that be a lot faster when a sentence is really long? Or faster on smaller chips like SBC?
@EfficientNLP7 ай бұрын
The main issue is Whisper is trained on audio that is at the beginning of a sentence so feeding it audio that starts in the middle of a sentence would be out of distribution. Your suggestion would be more efficient, but may lead to a degradation in transcript quality.
@AmirMahmoudi-je2pu6 ай бұрын
nice video and great voice writer, I have tried implementing it with transformers js package and its whisper model but no luck yet since processing is heavy
@EfficientNLP6 ай бұрын
There are a number of things you can do to speed up the whisper model. Some backends are more optimized depending on your hardware; faster-whisper is a popular one. You can also try smaller models: "base" is a good tradeoff that sacrifices some quality for better performance.
@ouicalsirgiane1610Ай бұрын
Were can I find the dataset
@EfficientNLPАй бұрын
There is no dataset mentioned in this video, but the Whisper paper provides details about how the Whisper model is trained, if that's what you're asking.
@jacemc98524 ай бұрын
What latencies can be expected with Whisper Streaming? I'd like to know what to expect before going down that route?
@EfficientNLP4 ай бұрын
Latency depends on various factors such as your hardware, model size, and options like minimum chunk size; the paper reports latency results between 3-6 seconds depending on configuration.
@jacemc98523 ай бұрын
@@EfficientNLP Thanks for your reply. I require an order of magnitude less for my application which I currently get with Azure. Whisper ASR is not ready for me to jump in quite yet. Cheers
@johnny1tapАй бұрын
I'm curious you said simply breaking the audio to 30 second chunks may break a word up, and that whisper is trained to use 30 second chunks, so what is it doing that's special when you just run it normally on a large batch rather than chunking it yourself to 30sec?
@EfficientNLPАй бұрын
When splitting a longer audio segment into 30-second chunks, it may break up in the middle of a word. Whisper cannot process segments longer than 30 seconds at a time, but the method described in this video uses some special logic to avoid splitting in the middle of a word.
@johnny1tapАй бұрын
@@EfficientNLP Yea i get that. What i'm saying is why does whisper work when you just call it on a large file. What is it's method that avoids the word break problem?
@EfficientNLPАй бұрын
Ah, yes, the problem is similar. The main Whisper model only works on 30 seconds at a time, but many libraries automatically support chunking to process longer files, using various methods to avoid splitting in the middle of words. One difference, though, is that in Whisper streaming, there is an additional requirement of low latency, so the method is not exactly the same.
@johnny1tapАй бұрын
@@EfficientNLP What I'm saying is when you just run whisper on a big file in a completely vanilla way like 'cmd:> whisper myAudio.mp3 --model medium' It somehow does it all correctly, avoiding the word break problem. Right? I'm assuming that and that may be an incorrect assumption. But when you, let's say, chunk it to 30s yourself you will definitely run into the word break problem. Now, we know it works on 30s under the hood right, but if we chunk manually in a simple way, we get word break, but when we run it vanilla as intended, no extra libraries or anything, no problems. Why?
@ouicalsirgiane1610Ай бұрын
I need to train the model with another languages like Arabic and show the details of the result
@EfficientNLPАй бұрын
The Whisper model supports many languages, including Arabic. Additionally, for fine-tuning Whisper, I have a video that is more relevant: kzbin.info/www/bejne/gHnCaGuBorVnkM0
@lydiakrifka-dobes371011 күн бұрын
Audio AI specialist needed. Whisper AI Google Colab specialist needed 22.00-23.00 New York time paid gig I hope I can post this hear. I desperately need help with a task I waited too long to complete. Audio (2 minutes) file in several languages must be segmented into words and phonemes. The languages are endangered. Maybe also other tools can be used, tricks and help appreciated. Maybe you know someone. Reposting for a friend, Maybe you know someone.
@EfficientNLP10 күн бұрын
Interesting problem. You should probably try Whisper to get word segment boundaries if the languages are close to the 100 or so supported by Whisper; otherwise, a different approach is to try forced alignment tools.