Very comprehensive overview of speech recognition!
@fbrault77 жыл бұрын
47:58 guy caressing his friend's head
@evil17177 жыл бұрын
lmao
@nutelina7 жыл бұрын
You have not been paying attention to the talk. -points! ;)
@95guignol7 жыл бұрын
Andrew Ng first row
@AnirbanKar42944 жыл бұрын
I was about to ask that in the comments
@susmitislam19103 жыл бұрын
He himself gave one of the previous lectures that day so I would've been surprised if he weren't.
@giannagiavelli50987 жыл бұрын
At Noonean we use holographic recognition which takes just a few minutes to train with a 2 gpu 50 TFLOP noonean cube. and gives us one degree of recognition. We can encode a few hundred features into a single holographic plane (for vision its 4kx3kx200 for speech its a similar number but dimensions are more square). That's still eating up 2.5 billion neurons and for depth in vision we use 2 thats 5 billion of our 8 billion possible neurons. So doing just vision or speech eats up one full cube's processing power today. We use about five machines to have dozens of holographic planes for shapes texture hair furniture animals etc but for speech just 100 teraflops work or 2 machines one for about a dozen trained speech holograms and one for the ontic processor. So we might take the word "Hello" and train it with 1000 different native speakers saying it, positive reinforcement with holographic reinforcement with diminishment. Training takes about five minutes. So then we work on 5000 words for basic english including as part of sentence fragments and we get training in about a day but its actually just a few hours of actual compute time. The thought of having to use 60 gpu clusters to achieve a week training time is just ridiculous and backwards. If we had a Noonean supercube of 64 cubes delivering 3 PFlop, our training time would be milliseconds. This works for visual or speech features. A second optimization with our ontic (concept/language) processor fixes the porkchop/portshop issues. As the ontic relationships are pre-created, getting a score between options is very fast. However, a 1000 watt desktop machine is still far too big and heat generating for embedded android brains so we still struggle knowing the technology to build full cybertrons is at least 8 years off unless the low watt synapse type hardware gets scales better (maybe the new google breakaway team can help us get it!). Holographic recognition and their peculiar reinforcement patterns work especially well for vision problems but also apply to audio if you think of spatial distortion and bi-aural classification of sounds. Our hope is that the 10-15 noonean cubes it would take for vision speech and thought will in 8 years become one large desktop machine and in another few years become a small embeddable machine. Our standard noonean cube which is not for vision is a 2k^3 8 billion neural unit fully interconnected. We use both neural darwinism and dynamic new synapse association creation on proximal excited areas. So it is more cognitive science brain modeling based than machine learning CNN based.
@unabonger7776 жыл бұрын
can it do paragraph breaks?
@taufiquzzamanpeyash60083 жыл бұрын
Decoding Techniques start at 41:34
@deepakbabupr11735 жыл бұрын
A good overview of DL based speech recognition. It's ironical that machine transcription of this video keeps decoding "Baidu" as "I do". Where's the Google's CTC ?
@dianaamiri95204 жыл бұрын
thank you for sharing this video. it helped me to grasp fast the whole idea
@nutelina7 жыл бұрын
Wauw, what a great talk, a little low on the math explanations typical of American Uni's but great overall, well done. Thank you.
@siddharthkotwal88237 жыл бұрын
Andrew NG walked in late at 0:13
@diyuanlu61077 жыл бұрын
in 33 minute, the speaker talked about the sum up all the possible alignments "c" given one input sequence "x" to get the final P(y|x). How to get all the possible "c"? isn't it the case that every time step the softmax layer outputs the probability distribution over all possible characters. and after "t" steps, you get an output matrix "O" with size t *27(26 letters + blank), and you take argmax(O, axis=1), you only get one most probable transcription sequence "c". How can you get all the possible "c"s?
@opinoynated7 жыл бұрын
I think as stated in the paper during training you are able to programmatically generate the possible c's by first figuring out all possible alignments and then generating the possibilities from those alignments (www.cs.toronto.edu/~fritz/absps/RNN13.pdf)
@chackothomas87575 жыл бұрын
Are you not using any windowing (Hamming, Hanning etc) on speech frames to smooth them before calculating the spectrogram?
@srijonishampriti34734 жыл бұрын
What is basic difference of deep speech and deep speech2 on basis of model architecture
@806aman2 жыл бұрын
Hi Lex Can you help me with English to Mandarin dataset?
@deepamohan31577 жыл бұрын
Is the architecture similar for keyword spotting where input is a text query? Does this work for Indian languages?
@megadebtification7 жыл бұрын
ghost in 47:57- 47:58 -3rd row left in front of mic
@fanjerry81007 жыл бұрын
So, is the Closed Caption of this video auto generated or manual generated?
@fedsummer905 жыл бұрын
Thanks!!!
@dragonnaturallyspeakingsup89595 жыл бұрын
nice...
@kenichimori85337 жыл бұрын
Compophorism0
@arunnambiar23154 жыл бұрын
Speak louder next time
@arjunsinghyadav4273 Жыл бұрын
Going back to this problem today yes the LLM solved this me: transcribe this for me please hhhhhhhheeee. lllll. ooooo. iiiii chatGPT: "Hello." me: transcribe this for me please primi miniter nerner modi chatGPT: "Prime Minister Narendra Modi."