How the Gemma/Gemini Tokenizer Works - Gemma/Gemini vs GPT-4 vs Mistral

  Рет қаралды 1,360

Chris Hay

Chris Hay

3 ай бұрын

in this video, we go under the hood of the gemini and gemma-7b and gemma-2b tokenizer. we look at the large vocabulary and the impact that it has on the size of the model, and how Google has put a focus on people, places, culture, languages and things over efficient vocabulary and frequent sub-words. in this video chris introduced his new tokenizer benchmark test, dataset and tokenizer visualizer tools
github
---------------
github.com/chrishayuk/tokeniz...

Пікірлер: 10
@Aberger789
@Aberger789 2 ай бұрын
Well, it's 2am, and I can't wait to watch your other videos. I am building some RAG implementations with scientific journals from PDF, and feeling like I'm going in circles. Taking a step back and considering the bigger concepts is helping. Great format for learning, I really appreciate your time!
@chrishayuk
@chrishayuk 2 ай бұрын
glad you're enjoying, you might wanna checkout my RAG video, and listen to my stoopid poems
@reza2kn
@reza2kn 3 ай бұрын
This is wonderful! The dataset alone is super useful to have, and the video walk through was really awesome for someone who's just trying to understand what's what here :D Please keep on doing what you're doing! One thing I have been interested in is visualizing the entire vocabulary inside a tokenizer to actually see what's inside, but have it be done in a easy to explore way. tried world clouds and didn't work at all. Do you have any ideas? I'm also super interested in fine-tuning models to teach them another language and using agents, but not to just look at codes for 30 mins. Specific , real-world use-cases with applied examples. I think KZbin is really lacking that at the moment. P.S: Cool glasses :)
@chrishayuk
@chrishayuk 3 ай бұрын
thank you, glad it's useful. you might find my next video on embeddings useful for visualization (no spoilers :). As for fine-tuning. I recently downloaded a lot of english-welsh translations, and was planning to do a video on that. i was going to use llama2-7b as i know it doesn't do welsh. i might do it with Gemma but not sure if does Welsh already. Regardless i'll be doing a language fine tune video soon
@smithnigelw
@smithnigelw 3 ай бұрын
Thanks Chris. Very interesting how they have chosen the vocabulary. For representation of programs in Python, how do they tokenise the white-space? I’m looking forward to the video on embedding.
@chrishayuk
@chrishayuk 3 ай бұрын
it's a similar approach to llama, because not every language seperates using whitespace. i'll maybe cover that in a future video. i will update the programming languages in the dataset, i didn't have time to merge all the other versions back in (where python was covered)
@garyhamilton2104
@garyhamilton2104 3 ай бұрын
Commenting cuz I know Chris will give me a heart :)
@chrishayuk
@chrishayuk 3 ай бұрын
because i love you all
@cybermanaudiobooks3231
@cybermanaudiobooks3231 3 ай бұрын
Great video. Companion piece to Andrej Karpathy's most recent. Very insightful. Thanks!
@chrishayuk
@chrishayuk 3 ай бұрын
Thank you, glad it’s useful. This one was a video I’ve been trying to get right for a while
What is Retrieval Augmented Generation (RAG) and JinaAI?
37:12
Chris Hay
Рет қаралды 2,9 М.
Заметили?
00:11
Double Bubble
Рет қаралды 1,4 МЛН
1🥺🎉 #thankyou
00:29
はじめしゃちょー(hajime)
Рет қаралды 77 МЛН
Super sport🤯
00:15
Lexa_Merin
Рет қаралды 20 МЛН
MOM TURNED THE NOODLES PINK😱
00:31
JULI_PROETO
Рет қаралды 9 МЛН
$8 MilkV Duo: Arduino on one core and Linux on the other
13:49
Andreas Spiess
Рет қаралды 1,3 М.
Can Google's Gemini Advanced Beat GPT-4? Or Is ChatGPT Still King?
13:23
why llama-3-8B is 8 billion parameters instead of 7?
25:40
Chris Hay
Рет қаралды 3 М.
16 yr old leg spinner Eshwin bowling at the Brisbane Cricket Centre #cricket
0:26
What are Transformers (Machine Learning Model)?
5:50
IBM Technology
Рет қаралды 347 М.
Getting Started with ReAct AI agents work using langchain
43:33
Chris Hay
Рет қаралды 2,9 М.
abstract syntax tree's are gonna be IMPORTANT in 2024
20:53
Chris Hay
Рет қаралды 1,6 М.
How to play left arm spinner 🏏 #cricket #drills
0:58
AB sports academy
Рет қаралды 792
Samsung or iPhone
0:19
rishton vines😇
Рет қаралды 8 МЛН
Power up all cell phones.
0:17
JL FUNNY SHORTS
Рет қаралды 49 МЛН
iPhone 15 Pro vs Samsung s24🤣 #shorts
0:10
Tech Tonics
Рет қаралды 9 МЛН