The Best Way to do Topic Modeling in Python - Top2Vec Introduction and Tutorial

Рет қаралды 30,481

Python Tutorials for Digital Humanities

Күн бұрын

Пікірлер: 98

@justinhuang8034 2 жыл бұрын

Your killing it lately with these videos. Keep up the great work.

@python-programming 2 жыл бұрын

Thanks! That is great to hear. I am trying out a new style this month to see if subscribers like it.

@jesusmtz29 2 жыл бұрын

@@python-programming New subscriber here. Love your style of presentation

@Adrian_Marmy Жыл бұрын

Dude, this video is awesome. Breaking things down seems to be your super power.... 👌

@python-programming Жыл бұрын

Thanks so much! I always wanted a super power. Since this video came out, I think BertTopic is a bit better. It has more features and is a bit more accessible now to beginners too. It also has a thriving community.

@Adrian_Marmy Жыл бұрын

@@python-programming wow, awesome for you to comment this. I will have a look at it :-)

@dankchan420 2 жыл бұрын

I am a new subscriber and this .. was .. simply .. great! I wish there were more Top2Vec videos (ranging from beginner to advanced) . Keep up the excellent work. *hint* *hint* 🙂

@python-programming 2 жыл бұрын

Thanks! Great to hear! I will be making more in the future.

@abasisadegh 11 ай бұрын

Thank you very much for this video man, Is there a way to use pyLDAvis visualizations with top2vec?

@Kylbigel 2 жыл бұрын

Exactly what I needed thank you!

@dankchan420 2 жыл бұрын

can you show how to compare it to lda with topic information gain? or coherence score? something i’m curious to see

@edadila 2 жыл бұрын

I need this too! Thanks for the great video by the way👏

@python-programming 2 жыл бұрын

Great idea for a new video! Thanks!

@TheAbdallahk 2 жыл бұрын

Wow, this is amazing. Thank you so much!

@python-programming 2 жыл бұрын

No problem!! So happy to hear it useful!

@JayShankarpure 2 жыл бұрын

Hi sir I checked out your NER Playlist and had a doubt . How can we calculate accuracy of a ner model ?

@python-programming 2 жыл бұрын

Hi. I am glad you are watching the video. You analyze the Precision, Recall, and F-Score during training, but this only let's you know how the model is performing during the training process. To gather proper metrics, you need to structure a formal test with a heldout set and monitor the results. I have a video on it here: kzbin.info/www/bejne/oWKppaN3edGoqac

@JayShankarpure 2 жыл бұрын

@@python-programming Got it , Thanks sir . Actually i am making a stock research platform called Shodh . Which involves some advanced nlp. Would love to take your guidance on few of topics that i am making . Can we connect anytime soon. Thanks

@python-programming 2 жыл бұрын

Sure! I do consultation, just fill out the form on my website wjbmattingly.com

@SonnyGeorgeVlogs 2 жыл бұрын

Great video. Glad to have stumbled on it.

@python-programming 2 жыл бұрын

Thanks!

@sinabaghaei3504 Жыл бұрын

so do you suggest working with Top2Vec rather than LDA? I mean do you think doing those manual changes in implementing LDA and data preprocessing worth it? or let's stick to Top2Vec. by the way your videos are awesome and I am really interested to go deep into Topic Modeling.

@python-programming Жыл бұрын

For most tasks, it makes more sense to use the newer methods applied in Top2Vec, BERTopic, or LeetTopic than doing traditional LDA Topic Modeling. That said, there are times that LDA may make more sense. It just depends on the problem that you are trying to solve. I have not had to use LDA in a while because the results from transformer-based topic modeling is far superior.

@BispensGipsGebis 2 жыл бұрын

You my Sir are awesome

@python-programming 2 жыл бұрын

Thanks!

@gwaliwamashaka8724 5 ай бұрын

Awesome, thank you very much.

@sjoerdbraaksma9358 2 жыл бұрын

This is such a great find! What I am wondering is: Can you train a BERT sentencetransformer on a large set of documents spanning several projects, then have top2vec use these embeddings to make a topic model for each project (so basically, for each subset of the larger corpus)?

@fetchthebattleaxe 2 жыл бұрын

Great video! Do you know if top2vec has options for when you have a dataset too large to fit into RAM? I have a dataset that is something like 9gb of text that I've been trying to topic model with different methods, so I'd be curious to try this out. But I probably can't just load the whole thing into a list and pass it in

@python-programming 2 жыл бұрын

Thanks! Great question. I have not personally tried it with a dataset that large just yet. What are your computer's specs? Do you have a Cuda-accelerated GPU?

@fetchthebattleaxe 2 жыл бұрын

@@python-programming CPU: AMD Ryzen 7 3700X 8 core 16 gb ram GPU: RTX 2070 super The GPU does have Cuda installed and I've used it for deep learning a bit. But the GPU itself only has 8gb vram and i've run into cuda memory issues before. Though admittedly I have no idea how memory needs are shared between CPU and GPU. Either way, I'll probably try this library on a random slice of the full data to see if it shows promise. Thanks for drawing my attention to it!

@kosemekars 2 жыл бұрын

Great vid, as always. I'm interested in creating my own WordNet dataset, any ideas where I should start?

@python-programming 2 жыл бұрын

You are too kind! Thanks. That is a very interesting question that I have never gotten before. I have never attempted something precisely like that before (so take what I say with a grain of salt), but I have worked with similar problems that were very domain--specific. I used a combination of heuristics and FastText embeddings to generate a sort of weak supervised approach to forming a knowledge tree based on semantic and syntactic meaning. Does this help?

@kosemekars 2 жыл бұрын

@@python-programming Thanks for the illuminating answer. Do you think that a graph-based approach (using something like NetworkX) could be helpful? Basically starting from a lexicon or dictionary and mapping the relations.

@python-programming 2 жыл бұрын

No problem! Indeed I do. That was actually exactly how I graphed them out. Also use word vectors and use the similarity to calculate the weights of the edges in the graph. That may help

@kosemekars 2 жыл бұрын

@@python-programming Very interesting. Thanks.

@python-programming 2 жыл бұрын

No problem!

@lukechen8015 3 ай бұрын

Hi there, how would this work if there's multiple topics tagged to one line? Is it all mutually exclusive?

@RedCloudServices Жыл бұрын

How do you filter stop words and how does this compare to Bartopic

@rush19772112 2 жыл бұрын

Dr Mattingly wish you my best to your channel and CONGRATULATIONS, you 've been GREAT help/assist with your videos in understanding Pandas. Topic modeling is an area of INTEREST to me specially everything related to social sciences especially the LDA. Looking forward seeing your video-tutorial. needless to say how grateful I am to you, cause you HELPED ME to UNDERSTAND by showing step by step Pandas Tutorials. If you could do the same with Latent Dirichlet Allocation algorithm that you would be marvelous. Even though you do have some code already written in a past video tutorial I am still not quite there in how to apply it in a project with texts in third languages than English, such as Greek or Hebrew. Looking forward seeing your video-tutorial kind wishes, Christos Bardas

@python-programming 2 жыл бұрын

Thank you so much for your very kind comment! It means a lot to me. I will see if I can put together a video for topic modeling with non-English texts. Would Latin be alright? I don't have Greek or Hebrew unfortunately.

@rush19772112 2 жыл бұрын

@@python-programming any non english language would be ok. What I can't work out in the video about LDA is how to transform data. For instance how to make a data set of texts (from historical data) in pdf format and tokenize words, make all necessary steps to run the lda algorithm etc but please make everything from scratch as you did in the pandas series. Your videos in pandas series have been proved an inspiration for me, therefore I'm truly grateful Dr Mattingly! I'm looking forward for an LDA one as well! ..Kind regards.. ..christos bardas..

@TC-bv4on 2 жыл бұрын

Working on topic modeling for legal opinions. Have you tried Bert?

@python-programming 2 жыл бұрын

I have. It works very nicely. There is a library that wraps around HuggingFace BERT model. It is called BerTopic, but top2vec does the same actions and a bit more. Just specify the BERT model.

@TC-bv4on 2 жыл бұрын

@@python-programming awesome! Thanks. I know there is a Legal Bert that is pretrained on legal materials so idk if there is a way to specify it. Also hoping to supplement it with a citation network because you really can’t understand an opinion without understanding it’s citations. If you have any ideas I’m all ears! Btw your channel is so needed. Hope it keeps growing while staying helpful and non-youtubey.

@python-programming 2 жыл бұрын

@@TC-bv4on you should be able to point to the legal BERT. Thanks so much for your kind words about my channel! If you want to see some legal content, let me know.

@TC-bv4on 2 жыл бұрын

I personally would but idk I might be the only one. Law is super far behind as far as technology goes

@juanmanuelaguiar3368 2 жыл бұрын

Great video, very clear! Do you know how Top2Vec deals with outliers? there is no 'outlier topic' at the end and all the documents seem to be assigned a topic. (I have BERTopic in mind where there is a -1 topic with the outliers)

@cuneyttyler4922 2 жыл бұрын

Nice video. But when I listed the words for each topic it shows stop words only - isn't it supposed to remove them in preprocessing stage?

@patrykkoakowski4357 Жыл бұрын

How did you force the code to run on CPU?

@sarasharick5209 2 жыл бұрын

I just started my first data science role and there’s a project coming up with a topic modeling aspect to it. Looking forward to this video.

@python-programming 2 жыл бұрын

Awesome! Glad to hear it. I hope it helps out a lot.

@AlexAlexanderIII 2 жыл бұрын

Great video.

@python-programming 2 жыл бұрын

Thanks!!

@tonyberber 2 жыл бұрын

I'm getting this error: from top2vec import Top2Vec ImportError: cannot import name 'Top2Vec' from partially initialized module 'top2vec' (most likely due to a circular import) any ideas are appreciated. I'm using an M1 Mac

@malikrumi1206 2 жыл бұрын

Do you mean that Top2Vec requires *actual sentences*? What about paragraphs? Paragraphs with more than one topic inside them?

@python-programming 2 жыл бұрын

Great question. You can use any length text but if you are using BERT, you want to keep it under 512 tokens. (Double check my number). If your texts have frequently overlapping subjects you can plot the texts and see where that overlap occurs visually and assign labels accordingly. Say topic 3 shares features of topics 1 and 2. It would be plotted theoretically between the teo with pull towards the one it shares the most overlap. But yes, you can use sentences or paragarphs. Either should be fine.

@malikrumi1206 2 жыл бұрын

Great! Thanks.

@python-programming 2 жыл бұрын

No problem!

@j0shm0o1 2 жыл бұрын

Thanks for a great video ! I installed top2vec and tried importing it it. I get following error 'No module named 'wordcloud.query_integral_image'. Any ideas

@python-programming 2 жыл бұрын

Thanks! Interesting question. Did you create a new environment? I am wondering if you have an older version of wordcloud installed in your base?

@j0shm0o1 2 жыл бұрын

This got resolved when I created a new environment

@python-programming 2 жыл бұрын

@@j0shm0o1 excellent!

@AndrewPeverells 2 жыл бұрын

Hello Dr Mattingly, great guide as always! I'm in need of help though. I'll post this here, so maybe other people who have the same issue can solve it, but if you prefer I can send you a pm. I'm trying to feed my model 2 kinds of lists: 1. ["arma", "virumque", "cano", "troiae"...] 2. ["arma virumque cano troiae qui primus ab oris..."] (from what i get in the documentation, the first one should be the way to go, as it processes lists of strings) When trying to build my model, I get these two types of errors; for the first one: "Exception in thread Thread-171: Traceback (most recent call last): File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/usr/lib/python3.8/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/home/apevere/.local/lib/python3.8/site-packages/gensim/models/word2vec.py", line 1163, in _worker_loop tally, raw_tally = self._do_train_job(data_iterable, alpha, thread_private_mem) File "/home/apevere/.local/lib/python3.8/site-packages/gensim/models/doc2vec.py", line 424, in _do_train_job tally += train_document_dbow( File "gensim/models/doc2vec_inner.pyx", line 358, in gensim.models.doc2vec_inner.train_document_dbow TypeError: Cannot convert list to numpy.ndarray" (and it gets stuck loading) For the second one: "hdbscan/_hdbscan_boruvka.pyx in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.__init__() hdbscan/_hdbscan_boruvka.pyx in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm._compute_bounds() sklearn/neighbors/_binary_tree.pxi in sklearn.neighbors._kd_tree.BinaryTree.query() k must be less than or equal to the number of training points" Do you know how to solve it?

@python-programming 2 жыл бұрын

Thanks! And great question. I think we can work on it here, that way others can get the benefit of hearing about the issue and potential solutions. First, unlike other topic modeling approaches, with top2vec, you do not need to tokenize your text, so a list of docs (as strings) is what you want to give the model. I have not tried to give it a list of lists, yet, but from what I can see from your first example, you appear to just be giving the model a list of words. In this scenario, you would typically want to give it a list of lists with each sublist being the tokens (words) from each document. Does that make sense? I suspect this is the origin of the error, but I would need to see your code more to address it properly. If you want, DM me on Twitter with a larger snippet and I will respond here with a better answer.

@AndrewPeverells 2 жыл бұрын

@@python-programming Thank you for your quick answer! Yes, it does make sense. As with the a pretty consistent part of coding-related problems, it's an issue of data types and how to properly handle them. Now though I'm a bit lost. As a test, I'm trying to feed my model this list of strings: " lst = [["arma", "virumque", "cano", "troiae", "qui", "primus", "ab", "oris"], ["nunc", "est", "bibendum", "nunc", "pede", "libero", "pulsare", "tellus"], ["uiuamus", "mea", "lesbia", "atque", "amemus"]] " (yes, I'm working with latin!) The error for model = Top2Vec(lst) now is: "ValueError: Documents need to be a list of strings" Isn't it, like you said, a list of lists, with each sublist being strings (the tokens)? Am I missing something terribly basic, because I'm a complete beginner at coding?

@python-programming 2 жыл бұрын

@@AndrewPeverells no problem! Happy to help. Can you try and give it a list of sentences rather than a list of lists of tokens and see if that helps? Also can you paste your whole code here so I can see how you are loading your data? Also what OS are you using?

@AndrewPeverells 2 жыл бұрын

@@python-programming Now I tried with a simple list of sentences, and it gave me another error. I'll paste the whole code, although it's very short: >> from top2vec import Top2Vec >> lst = ["arma virumque cano troiae qui primus ab oris", "nunc est bibendum, nunc pede libero pulsare tellus", "uiuamus mea lesbia atque amemus"] >> model = Top2Vec(lst) Error: "RuntimeError Traceback (most recent call last) /tmp/ipykernel_1573/2552625371.py in ----> 1 model = Top2Vec(lst) ~/.local/lib/python3.8/site-packages/top2vec/Top2Vec.py in __init__(self, documents, min_count, ngram_vocab, ngram_vocab_args, embedding_model, embedding_model_path, embedding_batch_size, split_documents, document_chunker, chunk_length, max_num_chunks, chunk_overlap_ratio, chunk_len_coverage_ratio, sentencizer, speed, use_corpus_file, document_ids, keep_documents, workers, tokenizer, use_embedding_model_tokenizer, umap_args, hdbscan_args, verbose) 524 logger.info('Creating joint document/word embedding') 525 self.embedding_model = 'doc2vec' --> 526 self.model = Doc2Vec(**doc2vec_args) 527 528 self.word_vectors = self.model.wv.get_normed_vectors() [...] RuntimeError: you must first build vocabulary before training the model" I'm working on jupyter notebook, from an Ubuntu terminal environment for Windows.

@AndrewPeverells 2 жыл бұрын

Update I think I found the issue for this. It's the size of your corpus. If I raise the number of documents (being whole sentences) in my corpus, it stops giving me the error. I went for at least 15 documents. Now it gives me another error though, and I'm quite lost. Code: >> from top2vec import Top2Vec >> lst = ["document1", "document2", "document3", ... "document17"] >> model = Top2Vec(lst) Error: "~/.local/lib/python3.8/site-packages/top2vec/Top2Vec.py in __init__(self, documents, min_count, ngram_vocab, ngram_vocab_args, embedding_model, embedding_model_path, embedding_batch_size, split_documents, document_chunker, chunk_length, max_num_chunks, chunk_overlap_ratio, chunk_len_coverage_ratio, sentencizer, speed, use_corpus_file, document_ids, keep_documents, workers, tokenizer, use_embedding_model_tokenizer, umap_args, hdbscan_args, verbose) 682 683 # create topic vectors --> 684 self._create_topic_vectors(cluster.labels_) 685 686 # deduplicate topics ~/.local/lib/python3.8/site-packages/top2vec/Top2Vec.py in _create_topic_vectors(self, cluster_labels) 857 unique_labels.remove(-1) 858 self.topic_vectors = self._l2_normalize( --> 859 np.vstack([self.document_vectors[np.where(cluster_labels == label)[0]] 860 .mean(axis=0) for label in unique_labels])) 861 in vstack(*args, **kwargs) ~/.local/lib/python3.8/site-packages/numpy/core/shape_base.py in vstack(tup) 280 if not isinstance(arrs, list): 281 arrs = [arrs] --> 282 return _nx.concatenate(arrs, 0) 283 284 in concatenate(*args, **kwargs) ValueError: need at least one array to concatenate" I really don't know what's this all about.

@amrmoursi7303 2 жыл бұрын

Thanks, wish you my best to your channel, and CONGRATULATIONS, How can we evaluate the topic modeling algo like top2vec or BerTopic Thanks in advanced

@prabhacar 2 жыл бұрын

brilliant stuff! thanks! Just a small comment.....i am quite visual and I learn better with pictures...in future if possible please include some visualizations of the topic modeling.

@python-programming 2 жыл бұрын

Great idea! Thanks for the feedback!

@wenqianzhou9174 2 жыл бұрын

how about BerTopic

@python-programming 2 жыл бұрын

I will do a video on that

@khalifakhalifa610 2 жыл бұрын

@@python-programming Can't wait for your BerTopic video. Your style is just amazing, Kudos!!!

@jordoobodi 2 жыл бұрын

4:20 which is it!? "Each word in that document, type, all th the items of that vector, all the documents.."

@speedTurtle Жыл бұрын

Bro is the NLP Gawd.

@python-programming Жыл бұрын

Thanks so much!

@python-programming Жыл бұрын

Thanks so much!

@boubacarbah1455 2 жыл бұрын

Hello , i'm trying to reproduce your exercice. But i got a problem when i tried to import Top2vec " from top2vec import Top2Vec ".I get this error " no module named "llvmlite.binding.dylib". And i could not fix it.So i wonder if you have a solution ?

@bben4507 2 жыл бұрын

similar here, but I got: OSError: Could not load shared object file: libllvmlite.dylib

@thepresistence5935 2 жыл бұрын

Try BERT TOPIC

@saranshtiwari8543 2 ай бұрын

Hey KZbin, Why am I seeing this after 2 years? Recommend videos like these as soon as they get uploaded!

@dynahmhyte 11 ай бұрын

ValueError: Documents need to be a list of strings (I get this when I type model = Top2Vec(docs)

@python-programming 11 ай бұрын

Perhaps a few of your items are NaN values or ints or floats?

@jubinamarie Жыл бұрын

This and your other top2vec videos are awesome! This is exactly what I needed. I have a question for you. I use other tools (e.g., Tableau) and would want to export topic data from Jupyter Notebooks to use elsewhere. I figured out how to export the DF to Excel with a column added for the topic numbers, but can't for the life of me figure out how to get columns with the other information, such as the document scores, maybe the top 10 words for each topic. The inability to move all the data out is holding me back. Hope you can help. Thank you!

@python-programming Жыл бұрын

I ran into these same issues thats why I created LeetTopic with a colleague. It does a lot of the same things as Too2Vec but returns a df with all this data you want.

@moemarocha3893 2 жыл бұрын

Hi! Anyone here having trouble importing Top2Vec due to problems with Numpy version? Just tried most of possible solutions I found on stackoverflow but nothing works...

@radoslavkoynov322 Жыл бұрын

try a clean environment using venv