Rasa Algorithm Whiteboard - BytePair Embeddings

  Рет қаралды 5,775

Rasa

Rasa

Күн бұрын

BytePair embeddings are a really cool idea. BytePair Embeddings can be seen as a lightweight variant of FastText. They need less memory because they are more selective in what subtokens they remember. This also makes them useful in certain scenarios because they can ignore subwords as well. They're also available in 275 languages!
If you want to see the Rasa NLU examples repo, go here:
github.com/Ras...
If you want to see the Whatlies repo for these embeddings, go here:
rasahq.github....
If you want to see the BPEmb repo, go here:
nlp.h-its.org/...

Пікірлер: 14
@faangsde
@faangsde 4 жыл бұрын
This channel is a gold mine! Thank you very much for sharing your insights!
@87456100
@87456100 4 жыл бұрын
Great video! I think the video description lacks a word in the sentence "They need way memory..."?
@distrologic2925
@distrologic2925 Жыл бұрын
they were masking that word to test your listening comprehension
@distrologic2925
@distrologic2925 Жыл бұрын
How does the algorithm know when to stop merging tokens?
@alanliang9538
@alanliang9538 Жыл бұрын
thanks bro, best explaination i can find.
@wibulord926
@wibulord926 2 жыл бұрын
thanks for usefull tutorial
@piyalikarmakar5979
@piyalikarmakar5979 2 жыл бұрын
Thanks sir.. one query.. what's the difference between byte pair and wordpiece tokenization?
@RasaHQ
@RasaHQ 2 жыл бұрын
(Vincent here) Great question! My impression is that they are very similar in practice but that the way for merging letters is slightly different. I could be wrong but I think workpiece uses a likelihood heuristic while bytepair uses counts.
@shahzadmalik96
@shahzadmalik96 4 жыл бұрын
First comment
@gorgolyt
@gorgolyt 3 жыл бұрын
congratulations, you win nothing.
Rasa Algorithm Whiteboard: Levenshtein Vectors
11:10
Rasa
Рет қаралды 1,3 М.
Do you choose Inside Out 2 or The Amazing World of Gumball? 🤔
00:19
отомстил?
00:56
История одного вокалиста
Рет қаралды 7 МЛН
Крутой фокус + секрет! #shorts
00:10
Роман Magic
Рет қаралды 20 МЛН
The day of the sea 😂 #shorts by Leisi Crazy
00:22
Leisi Crazy
Рет қаралды 1,8 МЛН
Subword Tokenization: Byte Pair Encoding
19:30
Abhishek Thakur
Рет қаралды 18 М.
1 5 Byte Pair Encoding
7:38
From Languages to Information
Рет қаралды 29 М.
SHA: Secure Hashing Algorithm - Computerphile
10:21
Computerphile
Рет қаралды 1,2 МЛН
Dijkstra's Algorithm - Computerphile
10:43
Computerphile
Рет қаралды 1,3 МЛН
Byte Pair Encoding Tokenization
5:23
HuggingFace
Рет қаралды 30 М.
Natural Language Processing: Tokenization (Basic)
20:22
Abhishek Thakur
Рет қаралды 9 М.
Do you choose Inside Out 2 or The Amazing World of Gumball? 🤔
00:19