Repo for this video: github.com/wjbmattingly/leettopic-test
@Kylbigel2 жыл бұрын
Thank you! Took about 30 mins on a 3080Ti and 64GB DDR5 for a 70K+ dataset of job postings. Very interesting results!
@python-programming2 жыл бұрын
So happy to hear that! We are tweaking some things for performance in 0.0.10, but 30 min for 70k is pretty quick. I have nearly identical specs except a 3080 not a 3080ti
@Kylbigel2 жыл бұрын
@@python-programming I have a large job posting dataset and can scrape anything on ind33d. Currently annotating the bulletpoints and classifying using a custom model via spaCy. Would love to collaborate if you are interested. For public research and career development/job seekers.
@mrtn58822 жыл бұрын
I'm very much feeling this and am really looking forward to playing around with LeetTopic! However, If I were you, I'd make another video that explains in layfolks' terms what they can achieve with LeetTopic -- ie. potential use cases how topic modelling can be used in real-life DH studies, generate additional insight, etc. -- and also link it in the description of the video. It might sound trivial, but in conversations with students, these questions keep popping up.
@python-programming2 жыл бұрын
Thanks! This is really helpful feedback. I will make that the video for next week
@python-programming2 жыл бұрын
Since recording this video LeetTopic has gone to 0.0.9. You can now control UMAP and HDBScan parameters as well as pass language models to the entire pipeline. You can check out these new updates on the GitHub repo: github.com/wjbmattingly/LeetTopic
@juancruzbayonas3320 Жыл бұрын
i got a problem that spacy_models isnt recognized, is out of use ? Error: LeetTopic() got an unexpected keyword argument 'spacy_model'
@MrT123592 жыл бұрын
I'm excited to give the library a go, it looks a lot more full featured with less coding effort than other approaches I've tried ☺️
@python-programming2 жыл бұрын
Thanks! That was precisely the idea behind it. Transformer-based topic modeling designed for humanists. Over the coming weeks, there will be many more features as well, including the ability to search within the Bokeh application.
@leandromachado71992 жыл бұрын
I tried to reduce the outliers by changing the min_samples parameter in HDBSCAN setting it with the same min_cluster_size values as suggested by Grootendorst. However, in my study the values of outliers remained the same.
@python-programming2 жыл бұрын
Can you tell me a little bit about your data? How many docs are you using? How many topics are created? What max_distance are you using?
@Htyagi19982 жыл бұрын
Can you please explain max distance if possible ?
@python-programming2 жыл бұрын
Absolutely. The max distance is the furthest distance an outlier can be from the nearest topic center. If an outlier is further than the max distance from any topics, it will remain an outlier; otherwise, it will be assigned to the nearest topic. Handling outliers by assigning them to the nearest topic was originally introduced in Top2Vec, but the max_distance feature gives users a bit more control over how this process works.
@Kylbigel2 жыл бұрын
@@python-programming so turn it up to, say .75, and you will have more unique topics and lower topic sizes?
@python-programming2 жыл бұрын
@@Kylbigel Sort of. If you turn it up, you will have more topics in each topic because there will be more outliers being able to be assigned to surrounding topics. It may not necessarily make them more unique, though.
@chrissonntag9 Жыл бұрын
The topics found out are just numbers - any way to relate these numbers to actual topics in english ?
@python-programming Жыл бұрын
Thanks for the comment! Great question. There are a lot of ways to do this. One way is to use a transformer model to generate a topic for you from the top keywords of the topic.
@TeaDrinkingHacker Жыл бұрын
Hi WJB, when running this in Google Colab, I'm getting "AttributeError: 'TfidfVectorizer' object has no attribute 'get_feature_names'. I've uninstalled and updated scikit-learn and restarted runtime, to no avail. Any ideas?