how the tokenizer for gpt-4 (tiktoken) works and why it can't reverse strings

Inside the LLM: Visualizing the Embeddings Layer of Mistral-7B and Gemma-2B

Fine-Tune Llama3 using Synthetic Data

버블티로 부자 구별하는법4

🕊️Valera🕊️

How I Turned a Lolipop Into A New One 🤯🍭

MY HEIGHT vs MrBEAST CREW 🙈📏

how the tokenizer for gpt-4 (tiktoken) works and why it can't reverse strings

Рет қаралды 2,341

Chris Hay

Күн бұрын

chris breaks down the chatgpt (gpt-4) tokenizer and shows why large language models such as gpt, llama-2 and mistral struggle to reverse words. chris looks at how words, programming languages, different languages and even how morse code is tokenized, and shows how tokenizers tend to be biased towards english languages and programming languages,

Пікірлер: 6

@feniyuli 6 ай бұрын

It is very helpful to understand how the tokenization works. Thanks! Do you think data that we encode using tiktoken will be sent to the AI?

@chrishayuk 5 ай бұрын

definitely not, it's all local

@ilyanemihin6029

@ilyanemihin6029 8 ай бұрын

Thanks, very interesting information

@chrishayuk 7 ай бұрын

glad it was useful

@ernestuz 6 ай бұрын

The funny thing is the most complete the vocabulary the less pressure in the upper layers, so it's not only cheaper because of fewer tokens, but in processing, I wonder if somebody has prepared a semi handcrafted tokenizer, where, let's say the first 30K tokens come from a dictionary and the rest is generated.

@chrishayuk 5 ай бұрын

exactly. tbh, i wouldn't' be surprised if someone goes that direction

Inside the LLM: Visualizing the Embeddings Layer of Mistral-7B and Gemma-2B

26:59

Inside the LLM: Visualizing the Embeddings Layer of Mistral-7B and Gemma-2B

Chris Hay

Рет қаралды 6 М.

Fine-Tune Llama3 using Synthetic Data

37:03

Fine-Tune Llama3 using Synthetic Data

Chris Hay

Рет қаралды 3,2 М.

00:11

버블티로 부자 구별하는법4

진영민yeongmin

Рет қаралды 22 МЛН

00:34

🕊️Valera🕊️

DO$HIK

Рет қаралды 11 МЛН

How I Turned a Lolipop Into A New One 🤯🍭

00:19

How I Turned a Lolipop Into A New One 🤯🍭

Wian

Рет қаралды 11 МЛН

MY HEIGHT vs MrBEAST CREW 🙈📏

00:22

MY HEIGHT vs MrBEAST CREW 🙈📏

Celine Dept

Рет қаралды 77 МЛН

i really want to say goodbye to copilot...

35:21

i really want to say goodbye to copilot...

Chris Hay

Рет қаралды 2,4 М.

why llama-3-8B is 8 billion parameters instead of 7?

25:40

why llama-3-8B is 8 billion parameters instead of 7?

Chris Hay

Рет қаралды 3,5 М.

Now Anyone Can Code: How AI Agents Can Build Your Whole App

37:14

Now Anyone Can Code: How AI Agents Can Build Your Whole App

Y Combinator

Рет қаралды 45 М.

The future of AI agents is WebAssembly (get started now)

39:51

The future of AI agents is WebAssembly (get started now)

Chris Hay

Рет қаралды 1,8 М.

JavaScript Event Loop -- Visualized!

29:43

JavaScript Event Loop -- Visualized!

ColorCode

Рет қаралды 15 М.

Efficient Streaming Language Models with Attention Sinks (Paper Explained)

32:27

Efficient Streaming Language Models with Attention Sinks (Paper Explained)

Yannic Kilcher

Рет қаралды 36 М.

What makes LLM tokenizers different from each other? GPT4 vs. FlanT5 Vs. Starcoder Vs. BERT and more

14:13

What makes LLM tokenizers different from each other? GPT4 vs. FlanT5 Vs. Starcoder Vs. BERT and more

Jay Alammar

Рет қаралды 16 М.

NVIDIA's Nemotron-4's is totally insane for synthetic data generation

28:35

NVIDIA's Nemotron-4's is totally insane for synthetic data generation

Chris Hay

Рет қаралды 2 М.

Getting Started with ReAct AI agents work using langchain

43:33

Getting Started with ReAct AI agents work using langchain

Chris Hay

Рет қаралды 8 М.

How Powerful Is GPT-4 Really

13:01

How Powerful Is GPT-4 Really

Nick White

Рет қаралды 69 М.

00:11

버블티로 부자 구별하는법4

진영민yeongmin

Рет қаралды 22 МЛН