how the tokenizer for gpt-4 (tiktoken) works and why it can't reverse strings

  Рет қаралды 2,208

Chris Hay

Chris Hay

Күн бұрын

chris breaks down the chatgpt (gpt-4) tokenizer and shows why large language models such as gpt, llama-2 and mistral struggle to reverse words. chris looks at how words, programming languages, different languages and even how morse code is tokenized, and shows how tokenizers tend to be biased towards english languages and programming languages,

Пікірлер: 6
@ernestuz
@ernestuz 4 ай бұрын
The funny thing is the most complete the vocabulary the less pressure in the upper layers, so it's not only cheaper because of fewer tokens, but in processing, I wonder if somebody has prepared a semi handcrafted tokenizer, where, let's say the first 30K tokens come from a dictionary and the rest is generated.
@chrishayuk
@chrishayuk 3 ай бұрын
exactly. tbh, i wouldn't' be surprised if someone goes that direction
@ilyanemihin6029
@ilyanemihin6029 6 ай бұрын
Thanks, very interesting information
@chrishayuk
@chrishayuk 6 ай бұрын
glad it was useful
@feniyuli
@feniyuli 5 ай бұрын
It is very helpful to understand how the tokenization works. Thanks! Do you think data that we encode using tiktoken will be sent to the AI?
@chrishayuk
@chrishayuk 3 ай бұрын
definitely not, it's all local
Now it’s my turn ! 😂🥹 @danilisboom  #tiktok #elsarca
00:20
Elsa Arca
Рет қаралды 12 МЛН
Apple peeling hack
00:37
_vector_
Рет қаралды 62 МЛН
The CUTEST flower girl on YouTube (2019-2024)
00:10
Hungry FAM
Рет қаралды 41 МЛН
i really want to say goodbye to copilot...
35:21
Chris Hay
Рет қаралды 2,4 М.
Nemotron-4 is BIG in More Ways than One
10:02
AI Master Group
Рет қаралды 772
what happens if you give claude's system prompt to llama3...
19:51
why llama-3-8B is 8 billion parameters instead of 7?
25:40
Chris Hay
Рет қаралды 3,4 М.
The future of AI agents is WebAssembly (get started now)
39:51
Chris Hay
Рет қаралды 1,7 М.
Now it’s my turn ! 😂🥹 @danilisboom  #tiktok #elsarca
00:20
Elsa Arca
Рет қаралды 12 МЛН