How the Gemma/Gemini Tokenizer Works - Gemma/Gemini vs GPT-4 vs Mistral

Рет қаралды 1,842

Күн бұрын

Пікірлер: 10

@Aberger789 10 ай бұрын

Well, it's 2am, and I can't wait to watch your other videos. I am building some RAG implementations with scientific journals from PDF, and feeling like I'm going in circles. Taking a step back and considering the bigger concepts is helping. Great format for learning, I really appreciate your time!

@chrishayuk 10 ай бұрын

glad you're enjoying, you might wanna checkout my RAG video, and listen to my stoopid poems

@reza2kn 11 ай бұрын

This is wonderful! The dataset alone is super useful to have, and the video walk through was really awesome for someone who's just trying to understand what's what here :D Please keep on doing what you're doing! One thing I have been interested in is visualizing the entire vocabulary inside a tokenizer to actually see what's inside, but have it be done in a easy to explore way. tried world clouds and didn't work at all. Do you have any ideas? I'm also super interested in fine-tuning models to teach them another language and using agents, but not to just look at codes for 30 mins. Specific , real-world use-cases with applied examples. I think KZbin is really lacking that at the moment. P.S: Cool glasses :)

@chrishayuk 11 ай бұрын

thank you, glad it's useful. you might find my next video on embeddings useful for visualization (no spoilers :). As for fine-tuning. I recently downloaded a lot of english-welsh translations, and was planning to do a video on that. i was going to use llama2-7b as i know it doesn't do welsh. i might do it with Gemma but not sure if does Welsh already. Regardless i'll be doing a language fine tune video soon

@cybermanaudiobooks3231 11 ай бұрын

Great video. Companion piece to Andrej Karpathy's most recent. Very insightful. Thanks!

@chrishayuk 11 ай бұрын

Thank you, glad it’s useful. This one was a video I’ve been trying to get right for a while

@smithnigelw 11 ай бұрын

Thanks Chris. Very interesting how they have chosen the vocabulary. For representation of programs in Python, how do they tokenise the white-space? I’m looking forward to the video on embedding.

@chrishayuk 11 ай бұрын

it's a similar approach to llama, because not every language seperates using whitespace. i'll maybe cover that in a future video. i will update the programming languages in the dataset, i didn't have time to merge all the other versions back in (where python was covered)