BERTopic Explained

Рет қаралды 26,128

Күн бұрын

Пікірлер: 52

@egericke123 11 ай бұрын

I know this video is a year old but on the dimensionality reduction part. You don't reduce the dimensions just because there is too much information and you want to compress it. It's mainly due to the "curse of dimensionality" where increasing the dimensions makes distances between datapoints more and more negligible. So trying to cluster in this high dimensional space will result in arbitrary clusters because the distances are negligible.

@janspoerer 22 күн бұрын

Thank you for the video! The notebook seems to have been removed from the repository. Is it still available?

@tomwalczak4992 2 жыл бұрын

Awesome video James! Great idea using world map to illustrate dim reduction techniques!

@shaheerzaman620 2 жыл бұрын

Amazing. It was lot of fun to go through this!

@jamesbriggs 2 жыл бұрын

Awesome to hear

@junchoi5605 Жыл бұрын

Hi James, When doing Topic Modeling using BerTopic, how do we choose the Umap's n_neighbor and n_components if we do not have already predefined topic labels like the Reddit data's Sub field for the Selfnotes field?

@WouterSuren 2 жыл бұрын

Well prepared demo and well crafted video, thanks mate!

@chineduezeofor2481 Жыл бұрын

Excellent video!

@renatorao9211 2 жыл бұрын

I found you yesterday looking for a NSWG video and now this. Really nice how relevant your videos are

@jamesbriggs 2 жыл бұрын

NSWG? That's awesome though, glad it's helping!

@renatorao9211 2 жыл бұрын

@@jamesbriggs aproximate nearest neighborhood search with navigable small world graphs, but I think your video was about the randomizef hierarchcical approach

@jamesbriggs 2 жыл бұрын

@@renatorao9211 ah yes, I did cover HNSW which is heirarchical-NSW, I did a high level view on NSW too, check out the article if it helps, might be more detail there, will be at pinecone.io/learn/hnsw

@shameekm2146 10 ай бұрын

Hi There, The link to colab notebook seems to show repository not found. Could you please update the link.

@kantafcb1 2 жыл бұрын

Thanks a million James. So clearly explained!

@RajeshGupta-gx3yz 2 жыл бұрын

Hi James, thanks for the excellent set of videos. Do you know of any pretrained SentenceTransformer models that can work on longer documents?

@jamesbriggs 2 жыл бұрын

Unfortunately I’m not aware of any, most of them tend to be capped at around 512 tokens

@alecg223 Жыл бұрын

Try with BERT, the Google NLP model.

@averma12 2 жыл бұрын

Hi James, I had an issue running the umap code. I fixed it by following the steps for me pip uninstall umap pip install umap-learn then the import was import umap.umap_ as umap Can you confirm if umap worked for you. Because it didn't work for me. I had to follow the following steps.

@jamesbriggs 2 жыл бұрын

hey Abhinav, I did: pip/conda install umap-learn and for the import I did: from umap import UMAP Can confirm that the `pip install umap` installs the wrong library, but your import is strange, does `from umap import UMAP` not work for you?

@averma12 2 жыл бұрын

@@jamesbriggs yes it's weird for me. Can you share the versions of the libraries for umap and bertopic.

@averma12 2 жыл бұрын

@@jamesbriggs I did the same . Didn't work for me. I downloaded an older version of bertopic 0.9.4. That installed older versions of umap-learn and hdbscan for me . That seemed to work. Now your code works with the imports how you wrote

@aizasyamimi 2 жыл бұрын

Can bertopic use for small datasets less than 1000 rows, or short sentences per row? will it be reliable for topic modelling, then do unap and clustering?

@jamesbriggs 2 жыл бұрын

ideally more, but yes, it's possible - just bear in mind that more (quality) data will tend to result in better clustering etc

@dr.kingschultz Жыл бұрын

BERTopic not working on windows. Can you please create a video just installing. I know sounds ridiculous, but I tried everything

@ernestosantiesteban6333 10 ай бұрын

Wow, your videos always save my schoolwork, if one day I become a millionaire I will give you part of my company.

@brianferrell9454 2 жыл бұрын

Does BERTopic understand the context words can be used? So in your example, Pytorch can be the most used word in a particular topic, but if the word Pytorch is used differently in different contexts within that topic, does this model pick up on that since it has the transformers?

@jamesbriggs 2 жыл бұрын

Yes thanks to the transformer part, it understands the context - and here we're using sentence transformers too so we're encoding each sentence (or paragraph) as a whole

@seemarani7314 2 жыл бұрын

Thankyou very much for the tutorial sir. How the parameters are tuned in BERTopic? What is advantage of BERTopic over standard topic models such as LDA, NMF?What is the difference in the number of parameters in topic models vs. BERTopic? Please help sir.

@Dan-vo7hl 2 жыл бұрын

I've found one of the most important tuning parameters is number of topics. One of the advantages of BERTopic over some of the other methods is that it comes up with its own number of topics. However, in some cases you will need to set a minimum topic size to reduce the overall number of topics. There is a simple number of topics reduction function in BERTopic, reduce_topics(), which works well if you are ok with a relatively large number of uncategorized documents. The BERTopic website has some detailed tuning info. In terms of other approaches the big deal is that this works on AI model embeddings as opposed to simpler, but less sophisticated counting and groupings of word occurrences. One of the upsides is that this means you don't have to do much (if any) data cleaning when you start, where LDA and NMF rely on the data being thoroughly pre-processed.

@seemarani7314 2 жыл бұрын

@@Dan-vo7hl ok sir.... thanks for the kind reply. Sir, can we know how these parameters are handled in BERTopic that it automatically detects no of topics without specifying it?

@Dan-vo7hl 2 жыл бұрын

@@seemarani7314 BERTopic uses HDBDSCAN to identify clusters and assign documents to them. The HDBSCAN settings MIN_CLUSTER_SIZE and MIN_SAMPLES will control how many are created and are well documented in the HDBSCAN Doco. There is a section on tuning in the BERTopic documentation which covers this as well.

@seemarani7314 2 жыл бұрын

Thanku so much sir 🙏

@wasgeht2409 2 жыл бұрын

Thanks ! I have one question. If I get a cluster 2 with the words: year, month, time and I want manually remove the word time from cluster 2 and put it into another cluster. Is that possible ?

@jamesbriggs 2 жыл бұрын

not directly, you could find all of the items in cluster 2 that contain the word "time" and tag them as belonging to cluster 1 - then feeding that tagged data into a new bertopic process but using the labels available in the hdbscan component

@wasgeht2409 2 жыл бұрын

@@jamesbriggs Thanks... and how could I for example remove the word from the cluster 2 and added into cluster 1.

@waleedkhalid1145 2 жыл бұрын

Thank you so much. You are Amazing in explaining that lecturer. Its very understandable. I will recommend this vedio to all my friends.

@jamesbriggs 2 жыл бұрын

thanks!

@BlackLightning0Games Жыл бұрын

James! Thanks for the tutorial. Very helpful. The only thing I do not know how to do is implement this with my own custom dataset. I have my data split into topics. It looks like this: topic 1 = ["sentence 1", "sentence 2", ...], topic 2 = ["sentence 3", "sentence 4", ...] topic 3 = ... How do I turn it into the correct format like model.fit_transform(data['selftext']) and data['title'][i] like you are using in the pretrained model. If you could explain how to organize my data to make this work or direct me to a resource that would be amazing. Subscribed!

@alecg223 Жыл бұрын

there are 2 options: 1) Do you want to have subtopics of each topic you have? Apply the model to each one separately. 2) Do you want to get automatic topics from all your data? Create a list with all the elements of each of your lists and apply the model. A pretrained model within Bertopic only means that it will help you in the creation of the embed vectors, that step is used by the model before entering the clustering phase.

@sanazulfiqar676 2 жыл бұрын

Nica Explanation. Can BERTopic work well in other languages (Hindi, URdu) for topic modelling???

@jamesbriggs 2 жыл бұрын

it is possible, but would require a model trained for Hindi or Urdu, or a multilingual model that covers those languages

@yucg9684 2 жыл бұрын

thank you for this clear explanation!

@jamesbriggs 2 жыл бұрын

You're welcome!

@ylazerson 2 жыл бұрын

Fantastic lesson; super appreciated!

@dabody123456 2 жыл бұрын

Thanks a lot for sharing

@prabhacar 2 жыл бұрын

great stuff! thanks!

@wolfers5090 2 жыл бұрын

Fluff. Some interesting parts but still not quite sure what BERTopic is. Poorly introduced. In fact very much just a mix of code run throughs. Not coherent and maybe no actual appreciation of what BERT is. Good presentation skills through.

@jamesbriggs 2 жыл бұрын

Thanks for the feedback, it’s a complicated topic, you might find the article more coherent www.pinecone.io/learn/bertopic

@boubacarbah1455 2 жыл бұрын

Hello , i'm trying to exercise on Bert topic . I am really new on it. I was using your code , but when i'm trying to import Bert topic i got this error 'No module named 'llvmlite.binding.dylib'. i didn't figure it out how to fix it. Do you have any idea?

@jamesbriggs 2 жыл бұрын

one possible issue is that you could have the wrong version of umap installed, there are two libraries, one is "umap" (the wrong one) and the other "umap-learn", so you might need to: ``` pip uninstall umap pip install umap-learn ```