Introduction to Named Entity Recognition (NER for DH 01)

Рет қаралды 34,743

Python Tutorials for Digital Humanities

Күн бұрын

Пікірлер: 25

@python-programming 3 жыл бұрын

Check out the Textbook for this series: ner.pythonhumanities.com/intro.html

@chessketeer 2 жыл бұрын

Thank you, Dr. Mattingly, for sharing you knowledge and for doing this in a so unbelievably learner oriented way.

@python-programming 2 жыл бұрын

You are very welcome! Thanks for the comment!

@amandaahringer7466 2 жыл бұрын

So happy to have found your channel! Very much looking forward to the rest of the series! Thank you for taking the time to create this!!!

@python-programming 2 жыл бұрын

Awesome! So happy to help.

@xirongcui7864 3 жыл бұрын

This kind of video is exactly what I need.I am currently working on building a Knowledge Graph of the thermal power industry.The first problem we face is NER and Relationship Extraction in a closed domain.It helped me a lot.

@python-programming 3 жыл бұрын

Glad it helped!

@nitindabadghav 2 жыл бұрын

This playlist is just fantastic. More power to you !! Thank you very much !!!

@python-programming 2 жыл бұрын

Glad you like it!

@saadatkamaei8829 3 жыл бұрын

Thank you Doctor. This is exactly what I need! 😊

@vincent_hall 7 ай бұрын

Criticism isn't a bad thing, it highlights places for improvement and places where things are already good. Criticism is good.

@agni8840 Жыл бұрын

i love these videos so much, thank you

@python-programming Жыл бұрын

No problem!

@arungade2 3 жыл бұрын

I just started this playlist, but I was wondering if I could learn to extract the entities from song lyrics. in maybe later vids as well.

@python-programming 3 жыл бұрын

Yes, you can. If you follow my steps in this series, you will be able to do that

@nomisjooon9241 4 жыл бұрын

Thank you very much for this video! I'm also using spaCy for Information Extraction and it works quite well. However, documents not only include free text, but also tables. E.g. in your domain number of concentration camps per country. Do you have any suggestions to combine extracted information from free text with cell specific information from tabular data?

@python-programming 4 жыл бұрын

Thanks for the feedback! It is most appreciated. Great question. So, there are few different routes to solve that problem and it largely depends on the state of the tables. Have they been OCR'ed? Tesseract and Tabula are both good candidates, depending on the situation. Is it already in an Excel spreadsheet? If so, then using the module CSV is the way to go or XLRS/XLRD. Let me know the state of the data and I can advise more precisely. Again thanks!

@nomisjooon9241 4 жыл бұрын

@@python-programming Thank you so much for responding that fast! I'm dealing with pdf files that contain test procedures. Per document there a various tests which are specified in different sections. When I started this project I wasn't aware of how many problems with pdfs arise and that most of them still aren't adequately solved. That's why - for now - I've decided to manually extract the file's different sections for the training. So it's up to me in which file format the free text and the tables of each section are stored. Of course it would be nicer to have a fully automated process starting with the actual pdf file but right now I'm ok with only showing the general applicability of NER and IE in the industrial context. So far, I've only found Use Cases within the biomedical and legal AI area. I've already developed own labels (e.g. test_specification, test_value,...) and trained the model with spaCy. It works quite well for the free text but I don't even know where to start with the tabular data. The data stored in tables mostly contains additional information which is needed to perform the IE adequately.

@python-programming 4 жыл бұрын

Not a problem at all! Always happy to help. Okay from the sound of things I would go the Tesseract or Tabula-py route. Those are usually the top contenders for this situation. You may have to adjust save each pdf page as an image, then use OpenCV to adjust the brightness/contrast to get better results. I had to do this in my own research on multiple occasions. Once the image is adjusted, Tesseract and Tabula both perform well. But those are definitely the route to pursue. I've got both slated for video series next year, but maybe I'll move them closer to January.

@nomisjooon9241 4 жыл бұрын

@@python-programming Thank you so much for your suggestions! Let's assume Tesseract and Tabula solve the problem for automatically extract the tables of a document with several sections about concentration camps in 1. Germany, 2. Poland, 3. Czech Republic. And let's assume that each of these sections contain tables with numbers of people in different cities of the corresponding country. My main issue here would be that I don't know how I can use IE in way that the algorithm understands to which section/country this table is referring to and how to get the content of a specific cell (e.g. Berlin XXX people). I can't find any solutions combining NER and IE with tabluar data.

@python-programming 4 жыл бұрын

@@nomisjooon9241 I understand better now. It will likely need to be a custom solution to your problem, either a custom neural net (very simple one should do it) to identify the type of data combined with or in lieu of a rules-based solution. DM me on twitter with some pics of the pdf structure. I am away from my computer until next week, but I will see if I can write some code for you and help you out.

@programmingworld9751 2 жыл бұрын

Thanks for the nice videos. One request please. I can't find the complete play list of all the videos you mentioned above. The playlits for Spacy is huge and I still cant find the 05 of the above series. Please can you point me to the playlist of above series. Thanks

@python-programming 2 жыл бұрын

This has been on my to do list for a month. I will try and do it first thing tomorrow morning.

@programmingworld9751 2 жыл бұрын

@@python-programming Thank You so much. Also please suggest good universities to pursue Phd in Stats in the UK. I really like your way of teaching and your tutorials. Thank you so much. I am presently working on demand planning using Timeseries. Is there any document/tutorial you can suggest. Thanks