1:22 text to sequenxe step 1 tokenization: lowercase/stopwords/typo etc step 2 build dictionary : word - word sequences step 3 one hot encoding step 4 3:56 every wording sequence SHOULD have the same length -> cut off the text or using null zero padding 4:54 one-hot encoding problem could be one-hot encoding : v unique words means each one-hot vectors are v-dimensional --> too sparse (RNN parameter dimension is corresponding to input dimension) --> word embedding to map one-hot vctors to low-dimensinal vectors. 5:56 word embedding P is a parameter matrix that learned from traaning data, its dimension is vxd where v is the dimention of original one-hot vector and d is the "embedding_dimension 词向量" decided by user. P should be learned on cross-validation. 7:11 each row is a wording vector that corresponding to a unique word. Plotting the wording vector for each word, for words with similar sentiment, their distribution should be clustered together. 8:23 keras example for word embedding embedding parameter # = vocabulary # x embedding_dimension d 9:37 keras example for classification based on logistic regression flatten - flatten the embedding dimension d to a 1-d array dense - fully connected layer using sigmoid 12:18 how to choose epoches number - training 50 epoches to achive stable validation acc 14:35 review of keras code on parameter settings
老師您好,我有一個小疑問,像3:29講的text to sequence,上面的標題是寫one hot encoding,只是如果是one hot encoding的話,每一個text產出的向量應該要跟vocabulary的數量一樣,而w0只有52維, w5只有90維,但vocabulary的數量至少上千甚至破萬,而且兩個向量的相同位置的components應該要代表著相同的意思才對,像老師上一篇講到國籍這種categorical feature,做完one-hot encoding, 美國=[1,0,0,...0] 中國=[0,1,0,0,0...0]。 比方說(先不管時態): 1. I am a boy 2. the boy play a ball 3.I play with a boy so dictionary should be{I:1, am:2, a:3, boy:4, the:5, play:6, ball:7, with:8} text to sequence I am a boy →[1,1,1,1,0,0,0,0] the boy play a ball →[0,0,1,1,1,1,1,0] I play with a boy →[1,0,1,1,0,1,0,1] 不知道我的理解是否錯誤,請老師指教