Рет қаралды 2,341
chris breaks down the chatgpt (gpt-4) tokenizer and shows why large language models such as gpt, llama-2 and mistral struggle to reverse words. chris looks at how words, programming languages, different languages and even how morse code is tokenized, and shows how tokenizers tend to be biased towards english languages and programming languages,