Deep Learning for Speech Recognition (Adam Coates, Baidu)

Рет қаралды 71,073

Lex Fridman

Күн бұрын

Пікірлер: 28

@muditjain7667 8 жыл бұрын

Very comprehensive overview of speech recognition!

@fbrault7 7 жыл бұрын

47:58 guy caressing his friend's head

@evil1717 7 жыл бұрын

lmao

@nutelina 7 жыл бұрын

You have not been paying attention to the talk. -points! ;)

@95guignol 7 жыл бұрын

Andrew Ng first row

@AnirbanKar4294 4 жыл бұрын

I was about to ask that in the comments

@susmitislam1910 3 жыл бұрын

He himself gave one of the previous lectures that day so I would've been surprised if he weren't.

@giannagiavelli5098 7 жыл бұрын

At Noonean we use holographic recognition which takes just a few minutes to train with a 2 gpu 50 TFLOP noonean cube. and gives us one degree of recognition. We can encode a few hundred features into a single holographic plane (for vision its 4kx3kx200 for speech its a similar number but dimensions are more square). That's still eating up 2.5 billion neurons and for depth in vision we use 2 thats 5 billion of our 8 billion possible neurons. So doing just vision or speech eats up one full cube's processing power today. We use about five machines to have dozens of holographic planes for shapes texture hair furniture animals etc but for speech just 100 teraflops work or 2 machines one for about a dozen trained speech holograms and one for the ontic processor. So we might take the word "Hello" and train it with 1000 different native speakers saying it, positive reinforcement with holographic reinforcement with diminishment. Training takes about five minutes. So then we work on 5000 words for basic english including as part of sentence fragments and we get training in about a day but its actually just a few hours of actual compute time. The thought of having to use 60 gpu clusters to achieve a week training time is just ridiculous and backwards. If we had a Noonean supercube of 64 cubes delivering 3 PFlop, our training time would be milliseconds. This works for visual or speech features. A second optimization with our ontic (concept/language) processor fixes the porkchop/portshop issues. As the ontic relationships are pre-created, getting a score between options is very fast. However, a 1000 watt desktop machine is still far too big and heat generating for embedded android brains so we still struggle knowing the technology to build full cybertrons is at least 8 years off unless the low watt synapse type hardware gets scales better (maybe the new google breakaway team can help us get it!). Holographic recognition and their peculiar reinforcement patterns work especially well for vision problems but also apply to audio if you think of spatial distortion and bi-aural classification of sounds. Our hope is that the 10-15 noonean cubes it would take for vision speech and thought will in 8 years become one large desktop machine and in another few years become a small embeddable machine. Our standard noonean cube which is not for vision is a 2k^3 8 billion neural unit fully interconnected. We use both neural darwinism and dynamic new synapse association creation on proximal excited areas. So it is more cognitive science brain modeling based than machine learning CNN based.

@unabonger777 6 жыл бұрын

can it do paragraph breaks?

@taufiquzzamanpeyash6008 3 жыл бұрын

Decoding Techniques start at 41:34

@deepakbabupr1173 5 жыл бұрын

A good overview of DL based speech recognition. It's ironical that machine transcription of this video keeps decoding "Baidu" as "I do". Where's the Google's CTC ?

@dianaamiri9520 4 жыл бұрын

thank you for sharing this video. it helped me to grasp fast the whole idea

@nutelina 7 жыл бұрын

Wauw, what a great talk, a little low on the math explanations typical of American Uni's but great overall, well done. Thank you.

@siddharthkotwal8823 7 жыл бұрын

Andrew NG walked in late at 0:13

@diyuanlu6107 7 жыл бұрын

in 33 minute, the speaker talked about the sum up all the possible alignments "c" given one input sequence "x" to get the final P(y|x). How to get all the possible "c"? isn't it the case that every time step the softmax layer outputs the probability distribution over all possible characters. and after "t" steps, you get an output matrix "O" with size t *27(26 letters + blank), and you take argmax(O, axis=1), you only get one most probable transcription sequence "c". How can you get all the possible "c"s?

@opinoynated 7 жыл бұрын

I think as stated in the paper during training you are able to programmatically generate the possible c's by first figuring out all possible alignments and then generating the possibilities from those alignments (www.cs.toronto.edu/~fritz/absps/RNN13.pdf)

@chackothomas8757 5 жыл бұрын

Are you not using any windowing (Hamming, Hanning etc) on speech frames to smooth them before calculating the spectrogram?

@srijonishampriti3473 4 жыл бұрын

What is basic difference of deep speech and deep speech2 on basis of model architecture

@806aman 2 жыл бұрын

Hi Lex Can you help me with English to Mandarin dataset?

@deepamohan3157 7 жыл бұрын

Is the architecture similar for keyword spotting where input is a text query? Does this work for Indian languages?

@megadebtification 7 жыл бұрын

ghost in 47:57- 47:58 -3rd row left in front of mic

@fanjerry8100 7 жыл бұрын

So, is the Closed Caption of this video auto generated or manual generated?

@fedsummer90 5 жыл бұрын

Thanks!!!

@dragonnaturallyspeakingsup8959 5 жыл бұрын

nice...

@kenichimori8533 7 жыл бұрын

Compophorism0

@arunnambiar2315 4 жыл бұрын

Speak louder next time

@arjunsinghyadav4273 Жыл бұрын

Going back to this problem today yes the LLM solved this me: transcribe this for me please hhhhhhhheeee. lllll. ooooo. iiiii chatGPT: "Hello." me: transcribe this for me please primi miniter nerner modi chatGPT: "Prime Minister Narendra Modi."