Your killing it lately with these videos. Keep up the great work.
@python-programming2 жыл бұрын
Thanks! That is great to hear. I am trying out a new style this month to see if subscribers like it.
@jesusmtz292 жыл бұрын
@@python-programming New subscriber here. Love your style of presentation
@Adrian_Marmy Жыл бұрын
Dude, this video is awesome. Breaking things down seems to be your super power.... 👌
@python-programming Жыл бұрын
Thanks so much! I always wanted a super power. Since this video came out, I think BertTopic is a bit better. It has more features and is a bit more accessible now to beginners too. It also has a thriving community.
@Adrian_Marmy Жыл бұрын
@@python-programming wow, awesome for you to comment this. I will have a look at it :-)
@dankchan4202 жыл бұрын
I am a new subscriber and this .. was .. simply .. great! I wish there were more Top2Vec videos (ranging from beginner to advanced) . Keep up the excellent work. *hint* *hint* 🙂
@python-programming2 жыл бұрын
Thanks! Great to hear! I will be making more in the future.
@abasisadegh11 ай бұрын
Thank you very much for this video man, Is there a way to use pyLDAvis visualizations with top2vec?
@Kylbigel2 жыл бұрын
Exactly what I needed thank you!
@dankchan4202 жыл бұрын
can you show how to compare it to lda with topic information gain? or coherence score? something i’m curious to see
@edadila2 жыл бұрын
I need this too! Thanks for the great video by the way👏
@python-programming2 жыл бұрын
Great idea for a new video! Thanks!
@TheAbdallahk2 жыл бұрын
Wow, this is amazing. Thank you so much!
@python-programming2 жыл бұрын
No problem!! So happy to hear it useful!
@JayShankarpure2 жыл бұрын
Hi sir I checked out your NER Playlist and had a doubt . How can we calculate accuracy of a ner model ?
@python-programming2 жыл бұрын
Hi. I am glad you are watching the video. You analyze the Precision, Recall, and F-Score during training, but this only let's you know how the model is performing during the training process. To gather proper metrics, you need to structure a formal test with a heldout set and monitor the results. I have a video on it here: kzbin.info/www/bejne/oWKppaN3edGoqac
@JayShankarpure2 жыл бұрын
@@python-programming Got it , Thanks sir . Actually i am making a stock research platform called Shodh . Which involves some advanced nlp. Would love to take your guidance on few of topics that i am making . Can we connect anytime soon. Thanks
@python-programming2 жыл бұрын
Sure! I do consultation, just fill out the form on my website wjbmattingly.com
@SonnyGeorgeVlogs2 жыл бұрын
Great video. Glad to have stumbled on it.
@python-programming2 жыл бұрын
Thanks!
@sinabaghaei3504 Жыл бұрын
so do you suggest working with Top2Vec rather than LDA? I mean do you think doing those manual changes in implementing LDA and data preprocessing worth it? or let's stick to Top2Vec. by the way your videos are awesome and I am really interested to go deep into Topic Modeling.
@python-programming Жыл бұрын
For most tasks, it makes more sense to use the newer methods applied in Top2Vec, BERTopic, or LeetTopic than doing traditional LDA Topic Modeling. That said, there are times that LDA may make more sense. It just depends on the problem that you are trying to solve. I have not had to use LDA in a while because the results from transformer-based topic modeling is far superior.
@BispensGipsGebis2 жыл бұрын
You my Sir are awesome
@python-programming2 жыл бұрын
Thanks!
@gwaliwamashaka87245 ай бұрын
Awesome, thank you very much.
@sjoerdbraaksma93582 жыл бұрын
This is such a great find! What I am wondering is: Can you train a BERT sentencetransformer on a large set of documents spanning several projects, then have top2vec use these embeddings to make a topic model for each project (so basically, for each subset of the larger corpus)?
@fetchthebattleaxe2 жыл бұрын
Great video! Do you know if top2vec has options for when you have a dataset too large to fit into RAM? I have a dataset that is something like 9gb of text that I've been trying to topic model with different methods, so I'd be curious to try this out. But I probably can't just load the whole thing into a list and pass it in
@python-programming2 жыл бұрын
Thanks! Great question. I have not personally tried it with a dataset that large just yet. What are your computer's specs? Do you have a Cuda-accelerated GPU?
@fetchthebattleaxe2 жыл бұрын
@@python-programming CPU: AMD Ryzen 7 3700X 8 core 16 gb ram GPU: RTX 2070 super The GPU does have Cuda installed and I've used it for deep learning a bit. But the GPU itself only has 8gb vram and i've run into cuda memory issues before. Though admittedly I have no idea how memory needs are shared between CPU and GPU. Either way, I'll probably try this library on a random slice of the full data to see if it shows promise. Thanks for drawing my attention to it!
@kosemekars2 жыл бұрын
Great vid, as always. I'm interested in creating my own WordNet dataset, any ideas where I should start?
@python-programming2 жыл бұрын
You are too kind! Thanks. That is a very interesting question that I have never gotten before. I have never attempted something precisely like that before (so take what I say with a grain of salt), but I have worked with similar problems that were very domain--specific. I used a combination of heuristics and FastText embeddings to generate a sort of weak supervised approach to forming a knowledge tree based on semantic and syntactic meaning. Does this help?
@kosemekars2 жыл бұрын
@@python-programming Thanks for the illuminating answer. Do you think that a graph-based approach (using something like NetworkX) could be helpful? Basically starting from a lexicon or dictionary and mapping the relations.
@python-programming2 жыл бұрын
No problem! Indeed I do. That was actually exactly how I graphed them out. Also use word vectors and use the similarity to calculate the weights of the edges in the graph. That may help
@kosemekars2 жыл бұрын
@@python-programming Very interesting. Thanks.
@python-programming2 жыл бұрын
No problem!
@lukechen80153 ай бұрын
Hi there, how would this work if there's multiple topics tagged to one line? Is it all mutually exclusive?
@RedCloudServices Жыл бұрын
How do you filter stop words and how does this compare to Bartopic
@rush197721122 жыл бұрын
Dr Mattingly wish you my best to your channel and CONGRATULATIONS, you 've been GREAT help/assist with your videos in understanding Pandas. Topic modeling is an area of INTEREST to me specially everything related to social sciences especially the LDA. Looking forward seeing your video-tutorial. needless to say how grateful I am to you, cause you HELPED ME to UNDERSTAND by showing step by step Pandas Tutorials. If you could do the same with Latent Dirichlet Allocation algorithm that you would be marvelous. Even though you do have some code already written in a past video tutorial I am still not quite there in how to apply it in a project with texts in third languages than English, such as Greek or Hebrew. Looking forward seeing your video-tutorial kind wishes, Christos Bardas
@python-programming2 жыл бұрын
Thank you so much for your very kind comment! It means a lot to me. I will see if I can put together a video for topic modeling with non-English texts. Would Latin be alright? I don't have Greek or Hebrew unfortunately.
@rush197721122 жыл бұрын
@@python-programming any non english language would be ok. What I can't work out in the video about LDA is how to transform data. For instance how to make a data set of texts (from historical data) in pdf format and tokenize words, make all necessary steps to run the lda algorithm etc but please make everything from scratch as you did in the pandas series. Your videos in pandas series have been proved an inspiration for me, therefore I'm truly grateful Dr Mattingly! I'm looking forward for an LDA one as well! ..Kind regards.. ..christos bardas..
@TC-bv4on2 жыл бұрын
Working on topic modeling for legal opinions. Have you tried Bert?
@python-programming2 жыл бұрын
I have. It works very nicely. There is a library that wraps around HuggingFace BERT model. It is called BerTopic, but top2vec does the same actions and a bit more. Just specify the BERT model.
@TC-bv4on2 жыл бұрын
@@python-programming awesome! Thanks. I know there is a Legal Bert that is pretrained on legal materials so idk if there is a way to specify it. Also hoping to supplement it with a citation network because you really can’t understand an opinion without understanding it’s citations. If you have any ideas I’m all ears! Btw your channel is so needed. Hope it keeps growing while staying helpful and non-youtubey.
@python-programming2 жыл бұрын
@@TC-bv4on you should be able to point to the legal BERT. Thanks so much for your kind words about my channel! If you want to see some legal content, let me know.
@TC-bv4on2 жыл бұрын
I personally would but idk I might be the only one. Law is super far behind as far as technology goes
@juanmanuelaguiar33682 жыл бұрын
Great video, very clear! Do you know how Top2Vec deals with outliers? there is no 'outlier topic' at the end and all the documents seem to be assigned a topic. (I have BERTopic in mind where there is a -1 topic with the outliers)
@cuneyttyler49222 жыл бұрын
Nice video. But when I listed the words for each topic it shows stop words only - isn't it supposed to remove them in preprocessing stage?
@patrykkoakowski4357 Жыл бұрын
How did you force the code to run on CPU?
@sarasharick52092 жыл бұрын
I just started my first data science role and there’s a project coming up with a topic modeling aspect to it. Looking forward to this video.
@python-programming2 жыл бұрын
Awesome! Glad to hear it. I hope it helps out a lot.
@AlexAlexanderIII2 жыл бұрын
Great video.
@python-programming2 жыл бұрын
Thanks!!
@tonyberber2 жыл бұрын
I'm getting this error: from top2vec import Top2Vec ImportError: cannot import name 'Top2Vec' from partially initialized module 'top2vec' (most likely due to a circular import) any ideas are appreciated. I'm using an M1 Mac
@malikrumi12062 жыл бұрын
Do you mean that Top2Vec requires *actual sentences*? What about paragraphs? Paragraphs with more than one topic inside them?
@python-programming2 жыл бұрын
Great question. You can use any length text but if you are using BERT, you want to keep it under 512 tokens. (Double check my number). If your texts have frequently overlapping subjects you can plot the texts and see where that overlap occurs visually and assign labels accordingly. Say topic 3 shares features of topics 1 and 2. It would be plotted theoretically between the teo with pull towards the one it shares the most overlap. But yes, you can use sentences or paragarphs. Either should be fine.
@malikrumi12062 жыл бұрын
Great! Thanks.
@python-programming2 жыл бұрын
No problem!
@j0shm0o12 жыл бұрын
Thanks for a great video ! I installed top2vec and tried importing it it. I get following error 'No module named 'wordcloud.query_integral_image'. Any ideas
@python-programming2 жыл бұрын
Thanks! Interesting question. Did you create a new environment? I am wondering if you have an older version of wordcloud installed in your base?
@j0shm0o12 жыл бұрын
This got resolved when I created a new environment
@python-programming2 жыл бұрын
@@j0shm0o1 excellent!
@AndrewPeverells2 жыл бұрын
Hello Dr Mattingly, great guide as always! I'm in need of help though. I'll post this here, so maybe other people who have the same issue can solve it, but if you prefer I can send you a pm. I'm trying to feed my model 2 kinds of lists: 1. ["arma", "virumque", "cano", "troiae"...] 2. ["arma virumque cano troiae qui primus ab oris..."] (from what i get in the documentation, the first one should be the way to go, as it processes lists of strings) When trying to build my model, I get these two types of errors; for the first one: "Exception in thread Thread-171: Traceback (most recent call last): File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/usr/lib/python3.8/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/home/apevere/.local/lib/python3.8/site-packages/gensim/models/word2vec.py", line 1163, in _worker_loop tally, raw_tally = self._do_train_job(data_iterable, alpha, thread_private_mem) File "/home/apevere/.local/lib/python3.8/site-packages/gensim/models/doc2vec.py", line 424, in _do_train_job tally += train_document_dbow( File "gensim/models/doc2vec_inner.pyx", line 358, in gensim.models.doc2vec_inner.train_document_dbow TypeError: Cannot convert list to numpy.ndarray" (and it gets stuck loading) For the second one: "hdbscan/_hdbscan_boruvka.pyx in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.__init__() hdbscan/_hdbscan_boruvka.pyx in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm._compute_bounds() sklearn/neighbors/_binary_tree.pxi in sklearn.neighbors._kd_tree.BinaryTree.query() k must be less than or equal to the number of training points" Do you know how to solve it?
@python-programming2 жыл бұрын
Thanks! And great question. I think we can work on it here, that way others can get the benefit of hearing about the issue and potential solutions. First, unlike other topic modeling approaches, with top2vec, you do not need to tokenize your text, so a list of docs (as strings) is what you want to give the model. I have not tried to give it a list of lists, yet, but from what I can see from your first example, you appear to just be giving the model a list of words. In this scenario, you would typically want to give it a list of lists with each sublist being the tokens (words) from each document. Does that make sense? I suspect this is the origin of the error, but I would need to see your code more to address it properly. If you want, DM me on Twitter with a larger snippet and I will respond here with a better answer.
@AndrewPeverells2 жыл бұрын
@@python-programming Thank you for your quick answer! Yes, it does make sense. As with the a pretty consistent part of coding-related problems, it's an issue of data types and how to properly handle them. Now though I'm a bit lost. As a test, I'm trying to feed my model this list of strings: " lst = [["arma", "virumque", "cano", "troiae", "qui", "primus", "ab", "oris"], ["nunc", "est", "bibendum", "nunc", "pede", "libero", "pulsare", "tellus"], ["uiuamus", "mea", "lesbia", "atque", "amemus"]] " (yes, I'm working with latin!) The error for model = Top2Vec(lst) now is: "ValueError: Documents need to be a list of strings" Isn't it, like you said, a list of lists, with each sublist being strings (the tokens)? Am I missing something terribly basic, because I'm a complete beginner at coding?
@python-programming2 жыл бұрын
@@AndrewPeverells no problem! Happy to help. Can you try and give it a list of sentences rather than a list of lists of tokens and see if that helps? Also can you paste your whole code here so I can see how you are loading your data? Also what OS are you using?
@AndrewPeverells2 жыл бұрын
@@python-programming Now I tried with a simple list of sentences, and it gave me another error. I'll paste the whole code, although it's very short: >> from top2vec import Top2Vec >> lst = ["arma virumque cano troiae qui primus ab oris", "nunc est bibendum, nunc pede libero pulsare tellus", "uiuamus mea lesbia atque amemus"] >> model = Top2Vec(lst) Error: "RuntimeError Traceback (most recent call last) /tmp/ipykernel_1573/2552625371.py in ----> 1 model = Top2Vec(lst) ~/.local/lib/python3.8/site-packages/top2vec/Top2Vec.py in __init__(self, documents, min_count, ngram_vocab, ngram_vocab_args, embedding_model, embedding_model_path, embedding_batch_size, split_documents, document_chunker, chunk_length, max_num_chunks, chunk_overlap_ratio, chunk_len_coverage_ratio, sentencizer, speed, use_corpus_file, document_ids, keep_documents, workers, tokenizer, use_embedding_model_tokenizer, umap_args, hdbscan_args, verbose) 524 logger.info('Creating joint document/word embedding') 525 self.embedding_model = 'doc2vec' --> 526 self.model = Doc2Vec(**doc2vec_args) 527 528 self.word_vectors = self.model.wv.get_normed_vectors() [...] RuntimeError: you must first build vocabulary before training the model" I'm working on jupyter notebook, from an Ubuntu terminal environment for Windows.
@AndrewPeverells2 жыл бұрын
Update I think I found the issue for this. It's the size of your corpus. If I raise the number of documents (being whole sentences) in my corpus, it stops giving me the error. I went for at least 15 documents. Now it gives me another error though, and I'm quite lost. Code: >> from top2vec import Top2Vec >> lst = ["document1", "document2", "document3", ... "document17"] >> model = Top2Vec(lst) Error: "~/.local/lib/python3.8/site-packages/top2vec/Top2Vec.py in __init__(self, documents, min_count, ngram_vocab, ngram_vocab_args, embedding_model, embedding_model_path, embedding_batch_size, split_documents, document_chunker, chunk_length, max_num_chunks, chunk_overlap_ratio, chunk_len_coverage_ratio, sentencizer, speed, use_corpus_file, document_ids, keep_documents, workers, tokenizer, use_embedding_model_tokenizer, umap_args, hdbscan_args, verbose) 682 683 # create topic vectors --> 684 self._create_topic_vectors(cluster.labels_) 685 686 # deduplicate topics ~/.local/lib/python3.8/site-packages/top2vec/Top2Vec.py in _create_topic_vectors(self, cluster_labels) 857 unique_labels.remove(-1) 858 self.topic_vectors = self._l2_normalize( --> 859 np.vstack([self.document_vectors[np.where(cluster_labels == label)[0]] 860 .mean(axis=0) for label in unique_labels])) 861 in vstack(*args, **kwargs) ~/.local/lib/python3.8/site-packages/numpy/core/shape_base.py in vstack(tup) 280 if not isinstance(arrs, list): 281 arrs = [arrs] --> 282 return _nx.concatenate(arrs, 0) 283 284 in concatenate(*args, **kwargs) ValueError: need at least one array to concatenate" I really don't know what's this all about.
@amrmoursi73032 жыл бұрын
Thanks, wish you my best to your channel, and CONGRATULATIONS, How can we evaluate the topic modeling algo like top2vec or BerTopic Thanks in advanced
@prabhacar2 жыл бұрын
brilliant stuff! thanks! Just a small comment.....i am quite visual and I learn better with pictures...in future if possible please include some visualizations of the topic modeling.
@python-programming2 жыл бұрын
Great idea! Thanks for the feedback!
@wenqianzhou91742 жыл бұрын
how about BerTopic
@python-programming2 жыл бұрын
I will do a video on that
@khalifakhalifa6102 жыл бұрын
@@python-programming Can't wait for your BerTopic video. Your style is just amazing, Kudos!!!
@jordoobodi2 жыл бұрын
4:20 which is it!? "Each word in that document, type, all th the items of that vector, all the documents.."
@speedTurtle Жыл бұрын
Bro is the NLP Gawd.
@python-programming Жыл бұрын
Thanks so much!
@python-programming Жыл бұрын
Thanks so much!
@boubacarbah14552 жыл бұрын
Hello , i'm trying to reproduce your exercice. But i got a problem when i tried to import Top2vec " from top2vec import Top2Vec ".I get this error " no module named "llvmlite.binding.dylib". And i could not fix it.So i wonder if you have a solution ?
@bben45072 жыл бұрын
similar here, but I got: OSError: Could not load shared object file: libllvmlite.dylib
@thepresistence59352 жыл бұрын
Try BERT TOPIC
@saranshtiwari85432 ай бұрын
Hey KZbin, Why am I seeing this after 2 years? Recommend videos like these as soon as they get uploaded!
@dynahmhyte11 ай бұрын
ValueError: Documents need to be a list of strings (I get this when I type model = Top2Vec(docs)
@python-programming11 ай бұрын
Perhaps a few of your items are NaN values or ints or floats?
@jubinamarie Жыл бұрын
This and your other top2vec videos are awesome! This is exactly what I needed. I have a question for you. I use other tools (e.g., Tableau) and would want to export topic data from Jupyter Notebooks to use elsewhere. I figured out how to export the DF to Excel with a column added for the topic numbers, but can't for the life of me figure out how to get columns with the other information, such as the document scores, maybe the top 10 words for each topic. The inability to move all the data out is holding me back. Hope you can help. Thank you!
@python-programming Жыл бұрын
I ran into these same issues thats why I created LeetTopic with a colleague. It does a lot of the same things as Too2Vec but returns a df with all this data you want.
@moemarocha38932 жыл бұрын
Hi! Anyone here having trouble importing Top2Vec due to problems with Numpy version? Just tried most of possible solutions I found on stackoverflow but nothing works...