Build a Custom Transformer Tokenizer - Transformers From Scratch #2

Рет қаралды 14,061

Күн бұрын

How can we build our own custom transformer models?
Maybe we'd like our model to understand a less common language, how many transformer models out there have been trained on Piemontese or the Nahuatl languages?
In that case, we need to do something different. We need to build our own model - from scratch.
In this video, we'll learn how to use HuggingFace's tokenizers library to build our own custom transformer tokenizer.
Part 1: • How-to Use HuggingFace...
---
Part 3: • Building MLM Training ...
Part 4: • Training and Testing a...
🤖 70% Discount on the NLP With Transformers in Python course:
bit.ly/3DFvvY5
📙 Medium article:
towardsdatasci...
📖 If membership is too expensive - here's a free link:
towardsdatasci...
👾 Discord
/ discord
🕹️ Free AI-Powered Code Refactoring with Sourcery:
sourcery.ai/?u...

Пікірлер: 28

@LokeshSharma-me5pg Жыл бұрын

this video is all I need searching for this content for whole day and finally find it internet is such a blessing and facinates me sometimes

@izmirdatascience6605 3 жыл бұрын

Thank you! Well explained!

@ajitkumar15 2 жыл бұрын

Thank you for this video , just one query, does it supports token for Latin related languages or can we do it for any other language or script

@thangphanchau4048 Жыл бұрын

hello but can i use this Tokenizer to train my own XLM-R model :v

@hjpark87 3 жыл бұрын

Thank you for your kind explaining the video. I wonder what kind of program are you using. It seems like you can see several examples of options during typing. Is that jupyter lab? or anything else?

@jamesbriggs 3 жыл бұрын

I'm using VS Code with the Jupyter extension - it's very good I'd recommend it :)

@JonRicketts-m8k Жыл бұрын

I've been able to save the tokenizer locally (merge and vocab files), however when I come to initialise them (tokenizer = RobertaTokenizerFast.from_pretrained(file_path)) I get an OS error even though both vocab and merge files are in the directory. Any ideas why this would happen? Otherwise great set of tutorials :)

@hemanthkumar-tj4hs 2 жыл бұрын

how can get those files from local disk ?

@fernandoandutta 2 жыл бұрын

Hi James. You tutorials a f... amazing. I tested it over a very small vocabulary (1K only). When I checked the merge.txt (see @ 9:55), then I found (ra zione) (la zione) (ca zione) and (ta zione), meaning four items separated, which joined would form (razione), (lazione), (cazione), and (tazione). Then I checked inside vocab.json and found the following (zione), (razione), (lazione), (cazione) and (tazione) with an ID to each one. Does that mean the MERGE.txt indicate pieces of strings that are joined to create another larger string?? Thankssss again. Sincerely, F. Andutta.

@jamesbriggs 2 жыл бұрын

Happy you're enjoying them! Yes that's right, so one example could be if the model came across a misspelling of a word, like 'atenzione' it would most likely split it into something like ['[UNK]', 'zione'] - that way it at least understands *part* of the word rather than needing to replace the whole word with '[UNK]' :)

@fernandoandutta 2 жыл бұрын

@@jamesbriggs do you have a suggestion to compare two documents (A with 100 sentences) and (B with 50 sentences)? Which approach do you believe to be fair to compare these two documents comprising different total number of sentences.

@jamesbriggs 2 жыл бұрын

@@fernandoandutta I'm not sure if this is the best approach, but in the past I iteratively compared all sentences, taking the best score for each and then taking an average of those. I remember experimenting with some logic like 'if 90% of best matches score greater than 0.8 then these are similar' too. You could try something like this, although I can't say I know what best practices are for this

@fernandoandutta 2 жыл бұрын

@@jamesbriggs thanksss heaps.

@etherealshift9786 Жыл бұрын

Hi, James what BPE should I use for english tokenizing? it seems following these tokenizing makes the tokens for Latin. Thank you in advance!

@jamesbriggs Жыл бұрын

hi you should be able to use the same tokenization steps as we did here - what is making you think this tokenizer won't work for English?

@etherealshift9786 Жыл бұрын

Im just lookin at the json file. Turns out it works for english sorry for wastin your time 🤝🫣

@jamesbriggs Жыл бұрын

@@etherealshift9786 haha no worries, glad you figured it out!

@wilfredomartel7781 2 жыл бұрын

👏👏👏

@0MVR_0 3 жыл бұрын

Worth asking if this is truly from scratch with so many imports

@jamesbriggs 3 жыл бұрын

By 'from scratch' I mean training a custom transformer model rather than using a pre_trained model - does 'from scratch' give the wrong impression?

@0MVR_0 3 жыл бұрын

@@jamesbriggs Then just say that.

@chitranshtarsoliya7378 3 жыл бұрын

@@0MVR_0 bro, I don't know about you but in ML 'from scratch' doesn't mean that you are not allowed to use libraries or frameworks

@0MVR_0 3 жыл бұрын

@@chitranshtarsoliya7378 Nobody has suggested such.

@RajdeepBorgohainRajdeep 2 жыл бұрын

@@chitranshtarsoliya7378 Not at all, if such then your research can take decades!

@parmeetsingh4580 3 жыл бұрын

file bertius/config.json not found The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. The tokenizer class you load from this checkpoint is 'BertTokenizer'. The class this function is called from is 'RobertaTokenizer'. Getting the above error while using RobertaTokenizerFast.from_pretrained('bertius')

@oguuzhansahin 3 жыл бұрын

Did you solve the problem? I have same problem.

@esteban.santiago Жыл бұрын

Did you solve it? I have the same issue!