What makes LLM tokenizers different from each other? GPT4 vs. FlanT5 Vs. Starcoder Vs. BERT and more

Рет қаралды 15,907

Күн бұрын

Пікірлер: 16

@vanerk_ 8 ай бұрын

Mr. Alammar, your post with gpt2 explanation is great, I frequently return to it, because it is very detailed and visual; A lot of time has passed, it would be awesome to see the same post explaining more modern LLMs such as llama 2 (for instance). I wish I could read the explanation of the "new" activations, norms, embeddings used in modern foundation models. Looking forward for such post!

@manuelkarner8746 Жыл бұрын

very nice video thanks, a video on galactica would be aswsome

@bibekupadhayay4593 Жыл бұрын

@Jay, this is super cool, and exactly what I was waiting for. Thank you so much for this video. Please keep up the good work :)

@HeartWatch93 6 ай бұрын

Such a passionating topic, thank you !

@Ali_S245 11 ай бұрын

Amazing video! Thanks Jay

@msfasha 11 ай бұрын

Brilliant, unexpected insights!

@ssshukla26 Жыл бұрын

Great video 😊

@kerryxueify Жыл бұрын

Great video， would be great if can explain how to know the token is name or date of birth and so on

@stephanmarguet 6 ай бұрын

Very nice and helpful. How is ambiguity resolved? How does a tokenizer choose whether (toy example) "t abs" vs "tab s"?

@map-creator Жыл бұрын

Colab link please?

@SatyaRao-fh4ny 9 ай бұрын

I think it is unfortunate that the word 'model' is used so often everywhere that it becomes difficult to understand what it means. e.g is it LLM "tokenizer foo" or LLM "model foo"? Are they the same? is bert-base-cased a "model"(if so, what does it mean?), or a "tokenizer" that has N number of tokens in its dictionary? Another question that is a bit fuzzy is, a "model" that uses a particular tokenizer must "know" what these tokens are, and must have a corresponding embeddings for every one of the tokens supported by the tokenizer it is using. So, speaking of tokenizers in isolation, without the downstream "model"(?) that is tied to this tokenizer is a bit confusing. I am still unclear on the flow of these tokenizer->embeddings->output-vector->some-decoder etc...