Topic Modeling with LeetTopic - Transformer Topic Modeling that Generates a Bokeh App (EASY!)

Рет қаралды 2,592

Python Tutorials for Digital Humanities

Күн бұрын

Пікірлер: 21

@python-programming 2 жыл бұрын

Repo for this video: github.com/wjbmattingly/leettopic-test

@Kylbigel 2 жыл бұрын

Thank you! Took about 30 mins on a 3080Ti and 64GB DDR5 for a 70K+ dataset of job postings. Very interesting results!

@python-programming 2 жыл бұрын

So happy to hear that! We are tweaking some things for performance in 0.0.10, but 30 min for 70k is pretty quick. I have nearly identical specs except a 3080 not a 3080ti

@Kylbigel 2 жыл бұрын

@@python-programming I have a large job posting dataset and can scrape anything on ind33d. Currently annotating the bulletpoints and classifying using a custom model via spaCy. Would love to collaborate if you are interested. For public research and career development/job seekers.

@mrtn5882 2 жыл бұрын

I'm very much feeling this and am really looking forward to playing around with LeetTopic! However, If I were you, I'd make another video that explains in layfolks' terms what they can achieve with LeetTopic -- ie. potential use cases how topic modelling can be used in real-life DH studies, generate additional insight, etc. -- and also link it in the description of the video. It might sound trivial, but in conversations with students, these questions keep popping up.

@python-programming 2 жыл бұрын

Thanks! This is really helpful feedback. I will make that the video for next week

@python-programming 2 жыл бұрын

Since recording this video LeetTopic has gone to 0.0.9. You can now control UMAP and HDBScan parameters as well as pass language models to the entire pipeline. You can check out these new updates on the GitHub repo: github.com/wjbmattingly/LeetTopic

@juancruzbayonas3320 Жыл бұрын

i got a problem that spacy_models isnt recognized, is out of use ? Error: LeetTopic() got an unexpected keyword argument 'spacy_model'

@MrT12359 2 жыл бұрын

I'm excited to give the library a go, it looks a lot more full featured with less coding effort than other approaches I've tried ☺️

@python-programming 2 жыл бұрын

Thanks! That was precisely the idea behind it. Transformer-based topic modeling designed for humanists. Over the coming weeks, there will be many more features as well, including the ability to search within the Bokeh application.

@leandromachado7199 2 жыл бұрын

I tried to reduce the outliers by changing the min_samples parameter in HDBSCAN setting it with the same min_cluster_size values as suggested by Grootendorst. However, in my study the values of outliers remained the same.

@python-programming 2 жыл бұрын

Can you tell me a little bit about your data? How many docs are you using? How many topics are created? What max_distance are you using?

@Htyagi1998 2 жыл бұрын

Can you please explain max distance if possible ?

@python-programming 2 жыл бұрын

Absolutely. The max distance is the furthest distance an outlier can be from the nearest topic center. If an outlier is further than the max distance from any topics, it will remain an outlier; otherwise, it will be assigned to the nearest topic. Handling outliers by assigning them to the nearest topic was originally introduced in Top2Vec, but the max_distance feature gives users a bit more control over how this process works.

@Kylbigel 2 жыл бұрын

@@python-programming so turn it up to, say .75, and you will have more unique topics and lower topic sizes?

@python-programming 2 жыл бұрын

@@Kylbigel Sort of. If you turn it up, you will have more topics in each topic because there will be more outliers being able to be assigned to surrounding topics. It may not necessarily make them more unique, though.

@chrissonntag9 Жыл бұрын

The topics found out are just numbers - any way to relate these numbers to actual topics in english ?

@python-programming Жыл бұрын

Thanks for the comment! Great question. There are a lot of ways to do this. One way is to use a transformer model to generate a topic for you from the top keywords of the topic.

@TeaDrinkingHacker Жыл бұрын

Hi WJB, when running this in Google Colab, I'm getting "AttributeError: 'TfidfVectorizer' object has no attribute 'get_feature_names'. I've uninstalled and updated scikit-learn and restarted runtime, to no avail. Any ideas?

@cristianedelweiss8634 Жыл бұрын

same error here