Building MLM Training Input Pipeline - Transformers From Scratch #3

Рет қаралды 7,714

Күн бұрын

The input pipeline of our training process is the more complex part of the entire transformer build. It consists of us taking our raw OSCAR training data, transforming it, and preparing it for Masked-Language Modeling (MLM). Finally, we load our data into a DataLoader ready for training!
Part 1: • How-to Use HuggingFace...
Part 2: • Build a Custom Transfo...
---
Part 4: • Training and Testing a...
📙 Medium article:
towardsdatasci...
📖 Free link:
towardsdatasci...
🤖 70% Discount on the NLP With Transformers in Python course:
bit.ly/3DFvvY5
👾 Discord
/ discord
🕹️ Free AI-Powered Code Refactoring with Sourcery:
sourcery.ai/?u...

Пікірлер: 34

@d3v487 3 жыл бұрын

Hi James. One thing pls don't mind .. though you have made videos about mask language modeling. Please make one video where you fine tune MLM with a simple dataset and use it for sentence completion task (filling multiple mask only , not NLG). It will be very nice for all. Love 💕 from India.

@jamesbriggs 3 жыл бұрын

Hey Janmejay, that's an awesome idea, I'll add it to the list!

@maryjoycanon5387 2 жыл бұрын

Hi, James! Thanks for adding this into your playlists. It's great for beginners in transfer learning. I tried running your scripts here and also the ones in medium. However, I encountered the following errors and I can't really figure out what's wrong. 1. When initiating a dataloader, this error prompts -> 'list' object has no attribute 'shape' 2. On this cell: import torch labels = torch.tensor([x.ids for x in batch]) mask = torch.tensor([x.attention_mask for x in batch]) -> 'str' object has no attribute 'ids' Thanks for your help!

@sarahalqaseemi571 2 жыл бұрын

Same with me

@kriteshrauniyar3395 Жыл бұрын

This issue occurs when we do shuffle=True AttributeError: 'list' object has no attribute 'shape' Any solution for this?? @James

@koushikram4036 Жыл бұрын

any solution for this??

@mwzkhalil 2 жыл бұрын

nice work James I'll try it on Urdu

@andrewspanopoulos1115 2 жыл бұрын

Quick question: In the RoBERTa paper, didn't they encode the input using the `FULL-SENTENCES` method? That is, the input to the BERT model becomes the concatenation of many sentences, of length at most 512. Which makes sense because if every input is only 1 sentence, then a lot of memory and processing power is wasted for padding. Is there a specific reason that you didn't do the same here when creating the inputs?

@LuisGarcia-bx6uf 2 жыл бұрын

Hey! great videos btw! did you use all files splitted from the corpus dataset? I was trying when creating the input_ids, labels and mask ang I got a memory error on file 301. Is there any trick or you just have lots of RAM? thank you again :)

@chitranshtarsoliya7378 3 жыл бұрын

Bro can you implement this with tensorflow..?

@jamesbriggs 3 жыл бұрын

would be interesting, I'll add it to the list - might be sometime until I get around to it though

@christophwindheuser3252 3 жыл бұрын

Hi James, great video series! I think there is an error at 14:33: mlm(sample.input_ids) does not work, as sample.input_ids is a list and not a tensor. I get an error message. I have fixed that with: input_ids.append(mlm(torch.tensor(sample.input_ids)))

@christophwindheuser3252 3 жыл бұрын

Sorry, that was my bet! in the line: sample = tokenizer(lines, max_length=512, padding='max_length', truncation=True, return_tensors='pt') I forgot the parameter return_tensors='pt' and sample contained lists instead of tensors!

@ImranDaybyDay 3 жыл бұрын

this error comes on mlm funtion 'builtin_function_or_method' object has no attribute 'shape', kindly tell me how to resolve

@virtualvoyagers429 2 жыл бұрын

HI @james ... thankyou so much for this video and for sharing your knowledge with us. This is so helpful. I have a question - Since you are encoding and training using a bert model, the max length is 512 but what if I have input sequences greater than that length? I found that there is a model - Longformer for long sequences... can I use the same method you used here but send the max_length as my custom dataset's max length? Hope for your reply :)

@jamesbriggs 2 жыл бұрын

Happy it helps, you can absolutely do that yes!

@sarahalqaseemi571 2 жыл бұрын

Thanks for sharing this great knowledge, I tried to apply all the steps with 12k files of 10k tokens for each and got the error: Cuda out of memory if I use gpu, or the process is crached when I use cpu. Do you have an idea about this error and how to handle this amount of data?

@jamesbriggs 2 жыл бұрын

Yes you can load the data from file, so within the dataloader function, you can add a load from file method (beforehand you will need to save everything to file too), and do something like this stanford.edu/~shervine/blog/pytorch-how-to-generate-data-parallel In this example they do load from already tokenized tensors (the .pt files), you can do this by tokenizing beforehand, or alternatively you can tokenize within the __getitem__ method. Dataloaders are super important and I'm planning on covering them better soon :)

@balamuruganm5668 2 жыл бұрын

hi @james what about upcoming video part 4 which is training part?....

@jamesbriggs 2 жыл бұрын

I did it but forgot to add to description, it’s there now!

@balamuruganm5668 2 жыл бұрын

@@jamesbriggs thanks

@PowerPlaay 2 жыл бұрын

Hello James I am getting the following error when I am using input_ds = torch.cat(input_ds): cat(): argument 'tensors' (position 1) must be tuple of Tensors, not Tensor please help

@etherealshift9786 2 жыл бұрын

have you figured out how to fix this?

@PowerPlaay 2 жыл бұрын

@@etherealshift9786 No, I moved to a different Project, but can you explain it, please.

@drym_bar 3 жыл бұрын

Hi! Sorry for disturbing but I have watched your KZbin video about Roberta model training from scratch and I have one question for you: How to use really large datasets for this task? Because you have used only 10 000 batch (10k lines). I have a dataset with 26Gb of text and don't want to use classical datacollectors from torch.utils and trainer as well. Can you please give an advise?

@jamesbriggs 3 жыл бұрын

hey Rostam, my most recent video on HuggingFace pipelines might help :)

@drym_bar 3 жыл бұрын

@@jamesbriggs Thanks for your reply! Did you mean usage of dataset.map with batched=True and batch_size?

@jamesbriggs 3 жыл бұрын

@@drym_bar yes, load_dataset with 'streaming=True' followed by what you said, will that help?

@mercymoila5498 3 жыл бұрын

when I follow this tutorial,, it says 'RobertaTokenizer' object is not callable, how do I fix this ?

@qasimlau2897 3 жыл бұрын

hi, you can update transformers==3.0.0.

@jaychen1116 3 жыл бұрын

Hi, James: After execute batch = tokenizer.encode_batch(lines), I got the following error: AttributeError: 'RobertaTokenizer' object has no attribute 'encode_batch'. My version of transformers is 4.6.1. Any thoughts? Thanks

@jamesbriggs 3 жыл бұрын

Hi Jay, did I use 'encode_batch' in the video? It should just be batch = tokenizer(lines) - hope that helps :)

@jaychen1116 3 жыл бұрын

Hi, James: After watching the video, I used to read your Medium article. 'encode_batch' is used in the article, but not in the video. Thanks for help.

@jamesbriggs 3 жыл бұрын

Ah I see, sorry my bad there were some code changes during the article write-up but I missed that one, thanks for letting me know!