0:38 regarding the mention of DistilBERT and how it omitted a chunk due to it, will there potentially be a way where we could input new words for 0-shot finetuning? 2:25 i noticed that the model decided to omit the word "like". this is very good for voice agents, but there are some use cases where it is necessary to include certain misspoken words (needed for an accurate transcript or additional nuance). so i suggest a threshold slider, perhaps? will it ever be able to focus on 1 speaker, if another speaker talks over the 1st one? or can it take account of the 2nd speaker at the same time? or will it only assume 1 speaker at all times?
@Linguflex4 күн бұрын
Good observations. At 0:38 it seems like the sentence got detected too quickly. I needs some deeper analysis for those edge cases to figure out why the algorithm cuts into the sentence. You can fine-tune Whisper to handle new words and then use the fine-tuned models directly with RealtimeSTT: huggingface.co/blog/fine-tune-whisper For 2:25, this looks like a Whisper quirk. Whisper tends to remove filler words like "ah" or repetitions. It might improve if I tweak the initial_prompt behavior to use it only for the real-time model but removing it from the final transcription. Whisper tends to lose some accuracy when prompted, but in this case, it’s necessary for real-time processing. Unfortunately, there’s no simple parameter to make Whisper more accurate regarding filler words. If it’s critical it's needed switch to a different ASR model, there are some that deliver precise transcriptions even on filler words, though that brings its own set of pros and cons. Regarding speaker handling, Whisper doesn’t support speaker diarization natively. It always assumes a single speaker. Real-time speaker diarization is a whole other rabbit hole. Not impossible, but very complex to pull off.
@jayr77412 күн бұрын
Does it supports multiple languages?
@Linguflex2 күн бұрын
Nope, the dataset I used to train the classification model only has English sentences.