I know this video is a year old but on the dimensionality reduction part. You don't reduce the dimensions just because there is too much information and you want to compress it. It's mainly due to the "curse of dimensionality" where increasing the dimensions makes distances between datapoints more and more negligible. So trying to cluster in this high dimensional space will result in arbitrary clusters because the distances are negligible.
@janspoerer22 күн бұрын
Thank you for the video! The notebook seems to have been removed from the repository. Is it still available?
@tomwalczak49922 жыл бұрын
Awesome video James! Great idea using world map to illustrate dim reduction techniques!
@shaheerzaman6202 жыл бұрын
Amazing. It was lot of fun to go through this!
@jamesbriggs2 жыл бұрын
Awesome to hear
@junchoi5605 Жыл бұрын
Hi James, When doing Topic Modeling using BerTopic, how do we choose the Umap's n_neighbor and n_components if we do not have already predefined topic labels like the Reddit data's Sub field for the Selfnotes field?
@WouterSuren2 жыл бұрын
Well prepared demo and well crafted video, thanks mate!
@chineduezeofor2481 Жыл бұрын
Excellent video!
@renatorao92112 жыл бұрын
I found you yesterday looking for a NSWG video and now this. Really nice how relevant your videos are
@jamesbriggs2 жыл бұрын
NSWG? That's awesome though, glad it's helping!
@renatorao92112 жыл бұрын
@@jamesbriggs aproximate nearest neighborhood search with navigable small world graphs, but I think your video was about the randomizef hierarchcical approach
@jamesbriggs2 жыл бұрын
@@renatorao9211 ah yes, I did cover HNSW which is heirarchical-NSW, I did a high level view on NSW too, check out the article if it helps, might be more detail there, will be at pinecone.io/learn/hnsw
@shameekm214610 ай бұрын
Hi There, The link to colab notebook seems to show repository not found. Could you please update the link.
@kantafcb12 жыл бұрын
Thanks a million James. So clearly explained!
@RajeshGupta-gx3yz2 жыл бұрын
Hi James, thanks for the excellent set of videos. Do you know of any pretrained SentenceTransformer models that can work on longer documents?
@jamesbriggs2 жыл бұрын
Unfortunately I’m not aware of any, most of them tend to be capped at around 512 tokens
@alecg223 Жыл бұрын
Try with BERT, the Google NLP model.
@averma122 жыл бұрын
Hi James, I had an issue running the umap code. I fixed it by following the steps for me pip uninstall umap pip install umap-learn then the import was import umap.umap_ as umap Can you confirm if umap worked for you. Because it didn't work for me. I had to follow the following steps.
@jamesbriggs2 жыл бұрын
hey Abhinav, I did: pip/conda install umap-learn and for the import I did: from umap import UMAP Can confirm that the `pip install umap` installs the wrong library, but your import is strange, does `from umap import UMAP` not work for you?
@averma122 жыл бұрын
@@jamesbriggs yes it's weird for me. Can you share the versions of the libraries for umap and bertopic.
@averma122 жыл бұрын
@@jamesbriggs I did the same . Didn't work for me. I downloaded an older version of bertopic 0.9.4. That installed older versions of umap-learn and hdbscan for me . That seemed to work. Now your code works with the imports how you wrote
@aizasyamimi2 жыл бұрын
Can bertopic use for small datasets less than 1000 rows, or short sentences per row? will it be reliable for topic modelling, then do unap and clustering?
@jamesbriggs2 жыл бұрын
ideally more, but yes, it's possible - just bear in mind that more (quality) data will tend to result in better clustering etc
@dr.kingschultz Жыл бұрын
BERTopic not working on windows. Can you please create a video just installing. I know sounds ridiculous, but I tried everything
@ernestosantiesteban633310 ай бұрын
Wow, your videos always save my schoolwork, if one day I become a millionaire I will give you part of my company.
@brianferrell94542 жыл бұрын
Does BERTopic understand the context words can be used? So in your example, Pytorch can be the most used word in a particular topic, but if the word Pytorch is used differently in different contexts within that topic, does this model pick up on that since it has the transformers?
@jamesbriggs2 жыл бұрын
Yes thanks to the transformer part, it understands the context - and here we're using sentence transformers too so we're encoding each sentence (or paragraph) as a whole
@seemarani73142 жыл бұрын
Thankyou very much for the tutorial sir. How the parameters are tuned in BERTopic? What is advantage of BERTopic over standard topic models such as LDA, NMF?What is the difference in the number of parameters in topic models vs. BERTopic? Please help sir.
@Dan-vo7hl2 жыл бұрын
I've found one of the most important tuning parameters is number of topics. One of the advantages of BERTopic over some of the other methods is that it comes up with its own number of topics. However, in some cases you will need to set a minimum topic size to reduce the overall number of topics. There is a simple number of topics reduction function in BERTopic, reduce_topics(), which works well if you are ok with a relatively large number of uncategorized documents. The BERTopic website has some detailed tuning info. In terms of other approaches the big deal is that this works on AI model embeddings as opposed to simpler, but less sophisticated counting and groupings of word occurrences. One of the upsides is that this means you don't have to do much (if any) data cleaning when you start, where LDA and NMF rely on the data being thoroughly pre-processed.
@seemarani73142 жыл бұрын
@@Dan-vo7hl ok sir.... thanks for the kind reply. Sir, can we know how these parameters are handled in BERTopic that it automatically detects no of topics without specifying it?
@Dan-vo7hl2 жыл бұрын
@@seemarani7314 BERTopic uses HDBDSCAN to identify clusters and assign documents to them. The HDBSCAN settings MIN_CLUSTER_SIZE and MIN_SAMPLES will control how many are created and are well documented in the HDBSCAN Doco. There is a section on tuning in the BERTopic documentation which covers this as well.
@seemarani73142 жыл бұрын
Thanku so much sir 🙏
@wasgeht24092 жыл бұрын
Thanks ! I have one question. If I get a cluster 2 with the words: year, month, time and I want manually remove the word time from cluster 2 and put it into another cluster. Is that possible ?
@jamesbriggs2 жыл бұрын
not directly, you could find all of the items in cluster 2 that contain the word "time" and tag them as belonging to cluster 1 - then feeding that tagged data into a new bertopic process but using the labels available in the hdbscan component
@wasgeht24092 жыл бұрын
@@jamesbriggs Thanks... and how could I for example remove the word from the cluster 2 and added into cluster 1.
@waleedkhalid11452 жыл бұрын
Thank you so much. You are Amazing in explaining that lecturer. Its very understandable. I will recommend this vedio to all my friends.
@jamesbriggs2 жыл бұрын
thanks!
@BlackLightning0Games Жыл бұрын
James! Thanks for the tutorial. Very helpful. The only thing I do not know how to do is implement this with my own custom dataset. I have my data split into topics. It looks like this: topic 1 = ["sentence 1", "sentence 2", ...], topic 2 = ["sentence 3", "sentence 4", ...] topic 3 = ... How do I turn it into the correct format like model.fit_transform(data['selftext']) and data['title'][i] like you are using in the pretrained model. If you could explain how to organize my data to make this work or direct me to a resource that would be amazing. Subscribed!
@alecg223 Жыл бұрын
there are 2 options: 1) Do you want to have subtopics of each topic you have? Apply the model to each one separately. 2) Do you want to get automatic topics from all your data? Create a list with all the elements of each of your lists and apply the model. A pretrained model within Bertopic only means that it will help you in the creation of the embed vectors, that step is used by the model before entering the clustering phase.
@sanazulfiqar6762 жыл бұрын
Nica Explanation. Can BERTopic work well in other languages (Hindi, URdu) for topic modelling???
@jamesbriggs2 жыл бұрын
it is possible, but would require a model trained for Hindi or Urdu, or a multilingual model that covers those languages
@yucg96842 жыл бұрын
thank you for this clear explanation!
@jamesbriggs2 жыл бұрын
You're welcome!
@ylazerson2 жыл бұрын
Fantastic lesson; super appreciated!
@dabody1234562 жыл бұрын
Thanks a lot for sharing
@prabhacar2 жыл бұрын
great stuff! thanks!
@wolfers50902 жыл бұрын
Fluff. Some interesting parts but still not quite sure what BERTopic is. Poorly introduced. In fact very much just a mix of code run throughs. Not coherent and maybe no actual appreciation of what BERT is. Good presentation skills through.
@jamesbriggs2 жыл бұрын
Thanks for the feedback, it’s a complicated topic, you might find the article more coherent www.pinecone.io/learn/bertopic
@boubacarbah14552 жыл бұрын
Hello , i'm trying to exercise on Bert topic . I am really new on it. I was using your code , but when i'm trying to import Bert topic i got this error 'No module named 'llvmlite.binding.dylib'. i didn't figure it out how to fix it. Do you have any idea?
@jamesbriggs2 жыл бұрын
one possible issue is that you could have the wrong version of umap installed, there are two libraries, one is "umap" (the wrong one) and the other "umap-learn", so you might need to: ``` pip uninstall umap pip install umap-learn ```