How to Training Custom Entities into spaCy Models (Named Entity Recognition for DH 09 03)

How to Training Custom Entities into spaCy Models (Named Entity Recognition for DH 09 03) - spaCy 2

Рет қаралды 12,018

Python Tutorials for Digital Humanities

Күн бұрын

Пікірлер: 61

@python-programming 3 жыл бұрын

Check out the Textbook for this series: ner.pythonhumanities.com/intro.html Playlist on NER: kzbin.info/aero/PL2VXyKi-KpYs1bSnT8bfMFyGS-wMcjesM Playlist on spaCy: kzbin.info/aero/PL2VXyKi-KpYvuOdPwXR-FZfmZ0hjoNSUo

@shivadhanush3131 3 жыл бұрын

i can see the patience you have trying to make everyone understand bit by bit. Great.. much appreciate

@python-programming 3 жыл бұрын

Thanks!

@kiranmore29 Жыл бұрын

The way you teach is really good, I have been following the video from beginning , but can you please share data that you used in this video of training (camp_training_data.json) also its not present in the given git repo (Please reply asap)

@wiktoria5162 9 ай бұрын

Hi @kiranmore29, Did you get it?

@jeremyzhang8603 2 жыл бұрын

Sir, this saved me a lot of time! Great videos, thanks!

@python-programming 2 жыл бұрын

Awesome!! So happy it helped!

@kehteraho 3 жыл бұрын

Hi, Thanks a lot for this series, really appreciate the efforts you are making. Could you please share some more insights on to how the annotations have been done and is there any automated way or smarter way for doing the same.

@maliha_abroad 2 жыл бұрын

Do you have a git repository where you share the datasets?

@afroman1611 2 ай бұрын

If the file is not with you then can you please share or tell the data from which you have created the training data?

@mandaravh7541 3 жыл бұрын

Can you please share how to train the model in Spacy version 3? because this code doesn't work with V3

@python-programming 3 жыл бұрын

Absolutely! It is in the works!

@asfandkhan6206 8 ай бұрын

Could you share your json file (camp_training_data.json) with us??

@esooghazy 3 жыл бұрын

Thanks for your great videos! :) I just have a request. Could you please create a video for the changes in custom training, exactly in the training code in v3? I can see that you wrote in the comments it is in the work since 2 month, but it would be highly appreciated if you can just tell us how to fix the error of nlp.update because when I follow the documentation and use Example. bla bla bla it doesn't work. I also watched all the new videos, but I wasn't able to catch the solution in any of them. Thanks again! :)

@python-programming 3 жыл бұрын

Hi! Indeed. I already have a few on this subject, check out the bottom of this playlist: kzbin.info/aero/PL2VXyKi-KpYvuOdPwXR-FZfmZ0hjoNSUo

@StefanoBoyanov 11 ай бұрын

@python-programming Could you share your json file (camp_training_data.json) with us? Because we can not go any further in the video without it, unless you have an already prepared data before hand, which is not my case. Thank you in advance

@python-programming 11 ай бұрын

Oh no! I will try and find that file. I'm not sure why it's not in the repository. Sorry about that.

@hichembouricha6328 3 жыл бұрын

The code doesn't work anymore with create_pipe and add_label.. can show us an alternative please?

@python-programming 3 жыл бұрын

Thanks! I made a new video for spaCy 3, but I forgot to change these titles to spaCy 2.

@hichembouricha6328 3 жыл бұрын

@@python-programming ok thanks a lot 🙏 I will check them later

@parthrangarajan3241 3 жыл бұрын

Is there a way to annotate custom data other than manually performing such a cumbersome task?

@python-programming 3 жыл бұрын

Grest question. In other ML applications, you have a lot of liberty with synthetic data or data augmentation to generate or rapidly expand an existing small dataset. Unfortunately, for text there are a lot of issues with these because it is difficult to produce good synthetic data and data augmentation methods are language and domain specific.

@pablovitale6058 3 жыл бұрын

Hi great serie thank you ! What are the usual good practices for NER with custom entities ? Do I need to create a new NER pipe from scratch and add my custom labels or do I only need to add custom labels on spacy pre-trained model ?

@python-programming 3 жыл бұрын

This is a great question and one I get a lot. I will make a video explaining this in more detail, but the short answer is that a custom model is better than using a pretrained model. This is because if you try and train new labels to an existing model, you will experience catastrophic forgetting where the pretrained model quickly forgets old training.

@michaelmohen1500 8 ай бұрын

Love these videos! Any chance you could send me the camp_training_data.json file? I'd really like to finish this series of videos.

@aniketchatterjee2440 3 жыл бұрын

Hii, I have my trained data in ".spacy" format how do I load it and train it? Thank you

@vinsmokearifka 3 жыл бұрын

Prof, if I try to custom ner model, is dep_ included? I mean, is there no need to custom dep_ also? Thank you

@raymondforce138 Жыл бұрын

Hi, when you train the new NER component, is it initialized based on the transformer or tok2vec? When I try training the ner on its own and append it to the new pipeline, the accuracy is horrible. I see how you do this with internal training, but how do you do with the config system?

@python-programming Жыл бұрын

This is because you need to make sure the vectors of your ner model align withbthe vectors of the rest of the spaCy pipeline. In the config file, you can point the vectors to the same pipeline you are using to ensure they align

@dec13666 3 жыл бұрын

So by doing this, won't Spacy "default" model "forget"the previous entities (i.e.: 'DATE', 'PERSON', etc), and start "labeling everything as my own, customized entity 'CONC_CAMP'"?

@python-programming 3 жыл бұрын

Indeed it will.

@dec13666 3 жыл бұрын

@@python-programming Wow! That was a quick response! Lol. Well now seriously, what I am doing right now, is kinda similar than what you did in this video (in my case, my model tags only "JOB_SKILLS"), and for me, it seemed very straightforward; however so far, it is labeling pretty much every single word as a "JOB_SKILL". I saw at the end of your video that you just "took a magic better model under your sleeve" and boom! It worked... But I'd love to know HOW could we pass from our first model version to that "best" model. I have already increased the size of my dataset, as well as making sure the number of words is balanced (as it seems to also affect the performance of your model), so those points are already checked. I was checking some blogs in the web to see how could I improve my model (i.e.: www.machinelearningplus.com/nlp/training-custom-ner-model-in-spacy/), and what I have learnt is that, apparently SpaCy kinda forgets older entities and thus, MIGHT cause what's going on in my case. In the provided blog, what the author does is "generating some other labeled examples with other entities, and ADD them tou your own, customized dataset", training and testing, and displaying quite decent results. However and sadly, that and other blogs are set for SpaCy V2.0, and I don't know how could I do something similar for SpaCy V3.0. Any video in your NLP series that would be worth considering, or some other material that could be kept in mind too? Thank you very much and keep it up with the innovative material (and the quick responses! ;) ).

@python-programming 3 жыл бұрын

Have you seen my free textbook? It answers all these questions from beginning to end. And most of it is noa spacy 3. Ner.pythonhumanities.com

@dec13666 3 жыл бұрын

@@python-programming Thank you very much Dr. Mattingly, I'll make sure to check it 😀👍.

@python-programming 3 жыл бұрын

No problem! Sorry for my short reply. I am currently cleaning a pool. =)

@FRUXT 2 жыл бұрын

I don't understand what we train here ? We give a list of the camps. Why we need to train ? It just has to spot the name in the text to return the entities...

@python-programming 2 жыл бұрын

A list will not work with varient spellings, typos, poor OCR, etc. An ML model can account for those, especially a BERT model or other models that leverage subword embeddings

@FRUXT 2 жыл бұрын

@@python-programming Ah ok I understand. It will target different typos but will it target other camps that was not in the train that it could target with the context for example ?

@python-programming 2 жыл бұрын

Precisely! This is really useful also because sometimes people have unique names for camps in testimonies that are not in lists.

@FRUXT 2 жыл бұрын

@@python-programming Thanks for your answer ^^ In my school I have a project about this kind of problematic. The thing is I don't understand how you create you training set, it seems a hard task, isn't it ?

@python-programming 2 жыл бұрын

@@FRUXT it can be. A good way to start is to come up with a set or rules to autogenerate a dataset. Use that to train one model. Next load that model into something like Prodigy from the soaCy Team and annotate a gold standard dataset.

@vinsmokearifka 3 жыл бұрын

Prof, how the workflow if using another language, not English?

@python-programming 3 жыл бұрын

Same workflow but of your proper nouns decline, you need to think about that. Check out my video on Classical Latin NER. I go through the whole workflow. It is helpful even if you do not know Latin.

@vinsmokearifka 3 жыл бұрын

@@python-programming thank you Prof

@python-programming 3 жыл бұрын

No problem!

@brucechang5068 Жыл бұрын

Hi, what IDE is this?

@python-programming Жыл бұрын

This was the good ole' days of Atom before Microsoft stopped supporting it. I use VS Code now. It is much better than Atom in many ways, but I still miss that IDE.

@brucechang5068 Жыл бұрын

@@python-programming thank you for the reply. I am working on a project that can be converted into a customized NER task. I have no experience in NLP and the research environment has limited tools to use(luckily we have SpaCy). I'm watching your tutorial these days. They are super helpful. Thank you for working on this!

@python-programming Жыл бұрын

@@brucechang5068 I am so happy to hear that! No problem! Glad you are finding them useful! Good luck on your NLP journey!

@ronchristino12 3 жыл бұрын

Is the code available in a GitHub repo?

@python-programming 3 жыл бұрын

Thanks for reminding me! just added it to the repo for this series: github.com/wjbmattingly/ner_youtube Also, keep a look out because I am currently preparing the Jupyter Notebooks for this series as well. nbviewer.jupyter.org/github/wjbmattingly/holocaust_ner_lessons/tree/main/

@ronchristino12 3 жыл бұрын

Also just one more question. The start and end indexes of the entities in the training data, are they token level or character level?

@python-programming 3 жыл бұрын

@@ronchristino12 Great question! They are the start character and end character of the entity in the string. So, no they do not use the token index of the token in the spaCy doc. I've often wondered why this is the case in spaCy, but I suspect it allows for universal training data structure for all languages (because not all languages tokenize and index the same way). Does that answer your question?

@ronchristino12 3 жыл бұрын

@@python-programming yeah it does. Thank you so much.

@python-programming 3 жыл бұрын

@@ronchristino12 No problem! If you ever have any other questions, feel free to continue leaving comments. It helps me figure out what I need to include/discuss in more detail in my videos.

@ricardocalleja 3 жыл бұрын

everything fine until line 29. I think the problem is in the nlp.update method ValueError: too many values to unpack (expected 2) for text, annotations in TRAIN_DATA: nlp.update([text],