Introduction to Named Entity Recognition (NER for DH 01)

  Рет қаралды 33,199

Python Tutorials for Digital Humanities

Python Tutorials for Digital Humanities

Күн бұрын

Пікірлер: 25
@python-programming
@python-programming 3 жыл бұрын
Check out the Textbook for this series: ner.pythonhumanities.com/intro.html
@arungade2
@arungade2 3 жыл бұрын
I just started this playlist, but I was wondering if I could learn to extract the entities from song lyrics. in maybe later vids as well.
@python-programming
@python-programming 3 жыл бұрын
Yes, you can. If you follow my steps in this series, you will be able to do that
@programmingworld9751
@programmingworld9751 2 жыл бұрын
Thanks for the nice videos. One request please. I can't find the complete play list of all the videos you mentioned above. The playlits for Spacy is huge and I still cant find the 05 of the above series. Please can you point me to the playlist of above series. Thanks
@python-programming
@python-programming 2 жыл бұрын
This has been on my to do list for a month. I will try and do it first thing tomorrow morning.
@programmingworld9751
@programmingworld9751 2 жыл бұрын
@@python-programming Thank You so much. Also please suggest good universities to pursue Phd in Stats in the UK. I really like your way of teaching and your tutorials. Thank you so much. I am presently working on demand planning using Timeseries. Is there any document/tutorial you can suggest. Thanks
@chessketeer
@chessketeer Жыл бұрын
Thank you, Dr. Mattingly, for sharing you knowledge and for doing this in a so unbelievably learner oriented way.
@python-programming
@python-programming Жыл бұрын
You are very welcome! Thanks for the comment!
@nomisjooon9241
@nomisjooon9241 3 жыл бұрын
Thank you very much for this video! I'm also using spaCy for Information Extraction and it works quite well. However, documents not only include free text, but also tables. E.g. in your domain number of concentration camps per country. Do you have any suggestions to combine extracted information from free text with cell specific information from tabular data?
@python-programming
@python-programming 3 жыл бұрын
Thanks for the feedback! It is most appreciated. Great question. So, there are few different routes to solve that problem and it largely depends on the state of the tables. Have they been OCR'ed? Tesseract and Tabula are both good candidates, depending on the situation. Is it already in an Excel spreadsheet? If so, then using the module CSV is the way to go or XLRS/XLRD. Let me know the state of the data and I can advise more precisely. Again thanks!
@nomisjooon9241
@nomisjooon9241 3 жыл бұрын
@@python-programming Thank you so much for responding that fast! I'm dealing with pdf files that contain test procedures. Per document there a various tests which are specified in different sections. When I started this project I wasn't aware of how many problems with pdfs arise and that most of them still aren't adequately solved. That's why - for now - I've decided to manually extract the file's different sections for the training. So it's up to me in which file format the free text and the tables of each section are stored. Of course it would be nicer to have a fully automated process starting with the actual pdf file but right now I'm ok with only showing the general applicability of NER and IE in the industrial context. So far, I've only found Use Cases within the biomedical and legal AI area. I've already developed own labels (e.g. test_specification, test_value,...) and trained the model with spaCy. It works quite well for the free text but I don't even know where to start with the tabular data. The data stored in tables mostly contains additional information which is needed to perform the IE adequately.
@python-programming
@python-programming 3 жыл бұрын
Not a problem at all! Always happy to help. Okay from the sound of things I would go the Tesseract or Tabula-py route. Those are usually the top contenders for this situation. You may have to adjust save each pdf page as an image, then use OpenCV to adjust the brightness/contrast to get better results. I had to do this in my own research on multiple occasions. Once the image is adjusted, Tesseract and Tabula both perform well. But those are definitely the route to pursue. I've got both slated for video series next year, but maybe I'll move them closer to January.
@nomisjooon9241
@nomisjooon9241 3 жыл бұрын
@@python-programming Thank you so much for your suggestions! Let's assume Tesseract and Tabula solve the problem for automatically extract the tables of a document with several sections about concentration camps in 1. Germany, 2. Poland, 3. Czech Republic. And let's assume that each of these sections contain tables with numbers of people in different cities of the corresponding country. My main issue here would be that I don't know how I can use IE in way that the algorithm understands to which section/country this table is referring to and how to get the content of a specific cell (e.g. Berlin XXX people). I can't find any solutions combining NER and IE with tabluar data.
@python-programming
@python-programming 3 жыл бұрын
@@nomisjooon9241 I understand better now. It will likely need to be a custom solution to your problem, either a custom neural net (very simple one should do it) to identify the type of data combined with or in lieu of a rules-based solution. DM me on twitter with some pics of the pdf structure. I am away from my computer until next week, but I will see if I can write some code for you and help you out.
@nitindabadghav
@nitindabadghav 2 жыл бұрын
This playlist is just fantastic. More power to you !! Thank you very much !!!
@python-programming
@python-programming 2 жыл бұрын
Glad you like it!
@saadatkamaei8829
@saadatkamaei8829 2 жыл бұрын
Thank you Doctor. This is exactly what I need! 😊
@vincent_hall
@vincent_hall 4 ай бұрын
Criticism isn't a bad thing, it highlights places for improvement and places where things are already good. Criticism is good.
@xirongcui7864
@xirongcui7864 3 жыл бұрын
This kind of video is exactly what I need.I am currently working on building a Knowledge Graph of the thermal power industry.The first problem we face is NER and Relationship Extraction in a closed domain.It helped me a lot.
@python-programming
@python-programming 3 жыл бұрын
Glad it helped!
@amandaahringer7466
@amandaahringer7466 2 жыл бұрын
So happy to have found your channel! Very much looking forward to the rest of the series! Thank you for taking the time to create this!!!
@python-programming
@python-programming 2 жыл бұрын
Awesome! So happy to help.
@agni8840
@agni8840 Жыл бұрын
i love these videos so much, thank you
@python-programming
@python-programming Жыл бұрын
No problem!
@bharavi1
@bharavi1 Жыл бұрын
Thank you very much
Rules Based NER in Python  (Named Entity Recognition for Digital Humanities 02)
20:50
Python Tutorials for Digital Humanities
Рет қаралды 14 М.
Machine Learning NER with Python and spaCy (NER for DH 03 )
13:36
Python Tutorials for Digital Humanities
Рет қаралды 17 М.
小天使和小丑太会演了!#小丑#天使#家庭#搞笑
00:25
家庭搞笑日记
Рет қаралды 24 МЛН
Electric Flying Bird with Hanging Wire Automatic for Ceiling Parrot
00:15
Officer Rabbit is so bad. He made Luffy deaf. #funny #supersiblings #comedy
00:18
Funny superhero siblings
Рет қаралды 12 МЛН
Best way to do Named Entity Recognition in 2024 with GliNER and spaCy - Zero Shot NER
5:01
Python Tutorials for Digital Humanities
Рет қаралды 7 М.
Introduction to Named Entity Tagging
5:07
From Languages to Information
Рет қаралды 10 М.
Custom NER with GPT-3 using Promptify
10:28
1littlecoder
Рет қаралды 12 М.
How to Training Custom Entities into spaCy Models (Named Entity Recognition for DH 09 03) - spaCy 2
15:29
Python Tutorials for Digital Humanities
Рет қаралды 11 М.