Enhanced Speech Endpoint Detection with Fine-Tuned AI

  Рет қаралды 110

Linguflex

Linguflex

Күн бұрын

Пікірлер: 5
@daviruela2055
@daviruela2055 4 күн бұрын
FYI I am a huge fan of your work. Thank you!
@42ndMoose
@42ndMoose 4 күн бұрын
0:38 regarding the mention of DistilBERT and how it omitted a chunk due to it, will there potentially be a way where we could input new words for 0-shot finetuning? 2:25 i noticed that the model decided to omit the word "like". this is very good for voice agents, but there are some use cases where it is necessary to include certain misspoken words (needed for an accurate transcript or additional nuance). so i suggest a threshold slider, perhaps? will it ever be able to focus on 1 speaker, if another speaker talks over the 1st one? or can it take account of the 2nd speaker at the same time? or will it only assume 1 speaker at all times?
@Linguflex
@Linguflex 4 күн бұрын
Good observations. At 0:38 it seems like the sentence got detected too quickly. I needs some deeper analysis for those edge cases to figure out why the algorithm cuts into the sentence. You can fine-tune Whisper to handle new words and then use the fine-tuned models directly with RealtimeSTT: huggingface.co/blog/fine-tune-whisper For 2:25, this looks like a Whisper quirk. Whisper tends to remove filler words like "ah" or repetitions. It might improve if I tweak the initial_prompt behavior to use it only for the real-time model but removing it from the final transcription. Whisper tends to lose some accuracy when prompted, but in this case, it’s necessary for real-time processing. Unfortunately, there’s no simple parameter to make Whisper more accurate regarding filler words. If it’s critical it's needed switch to a different ASR model, there are some that deliver precise transcriptions even on filler words, though that brings its own set of pros and cons. Regarding speaker handling, Whisper doesn’t support speaker diarization natively. It always assumes a single speaker. Real-time speaker diarization is a whole other rabbit hole. Not impossible, but very complex to pull off.
@jayr7741
@jayr7741 2 күн бұрын
Does it supports multiple languages?
@Linguflex
@Linguflex 2 күн бұрын
Nope, the dataset I used to train the classification model only has English sentences.
Use Open WebUI with Your N8N AI Agents - Voice Chat Included!
26:06
Chat with Multiple/Large SQL and Vector Databases using LLM agents (Combine RAG and SQL-Agents)
1:36:39
Quilt Challenge, No Skills, Just Luck#Funnyfamily #Partygames #Funny
00:32
Family Games Media
Рет қаралды 55 МЛН
BAYGUYSTAN | 1 СЕРИЯ | bayGUYS
36:55
bayGUYS
Рет қаралды 1,9 МЛН
Speech endpoint detection algorithm
3:31
Linguflex
Рет қаралды 49
Qwen Just Casually Started the Local AI Revolution
16:05
Cole Medin
Рет қаралды 115 М.
Windsurf vs Cursor: In-Depth AI Code Editor Comparison
18:14
Yifan - Beyond the Hype
Рет қаралды 17 М.
This Llama 3 is powerful and uncensored, let’s run it
14:58
David Ondrej
Рет қаралды 178 М.
Why Agent Frameworks Will Fail (and what to use instead)
19:21
Dave Ebbelaar
Рет қаралды 101 М.
What are AI Agents?
12:29
IBM Technology
Рет қаралды 862 М.
FILETREES ARE BAD BUT OIL NVIM IS GOOD
4:29
TJ DeVries
Рет қаралды 17 М.
Anthropic MCP with Ollama, No Claude? Watch This!
29:55
Chris Hay
Рет қаралды 13 М.
Speech endpoint detection algorithm (new version)
6:09
Linguflex
Рет қаралды 1,2 М.