Thanks for sharing! The best tutorials I've watched. No fancy slides, but very very useful code line by line.
@rachely7672 жыл бұрын
This was really awesome! I am going to try implement this to a survey here in New Zealand 🇳🇿
@kasperwelbers2 жыл бұрын
Nice! If you're working with survey data, you might also want to look into structural topic modeling (which has a great R package). That was originally also developed for open-ended survey questions, and allows you to model covariates. doi.org/10.1111/ajps.12103
@slkslk78412 жыл бұрын
@@kasperwelbers I have a feeling that these are outdated and would not as good as the current libraries like spacy. Is that right? How good are these when compared to a phrase extraction using logics/rules on spacy's POS-tagging+Dependency-paser data?
@kasperwelbers2 жыл бұрын
@@slkslk7841 I'd say the goal is quite different. Spacy is a great tool for preprocessing data, which can enhance various types of analysis (including topic modeling). One approach is indeed to use rules on dependency trees, for instance to extract phrases and semantic patterns such as who does what (coincidentally, we developed the rsyntax package for working with dependency trees in R: computationalcommunication.org/ccr/article/view/51/30 ). But if you want to automatically classify documents in latent classes, you'll need some form of dimensionality reduction, and topic modeling is still a great way to do this. That said, it is certainly true that the vanilla LDA approach discussed here is getting older (though as I think I mention in the video, it's still a nice place to start). In terms of more state-of-the-art alternatives, I've seen some topic models that use contextual word embeddings (which you could for instance obtain using spacy).
@slkslk78412 жыл бұрын
@@kasperwelbers Great. Much thanks for the reply Kasper! Do you mind elaborating a bit more on the method mentioned in your last sentence. Thanks again.
@kasperwelbers2 жыл бұрын
@@slkslk7841 I'll try, though it's hard to summarize. Classic LDA uses a document term matrix, which tells us nothing about the semantic similarities between words. With just this data, our model doesn't understand that the column "cat" and "dog" are more similar than "cat" and "airplane". Given enough data, the topic model can learn that "cat" and "dog" are more similar if they often co-occur (perhaps in a "pets" topic). But wouldn't it be nice if we could actually already infuse our model with some general information about these types of semantic relations of words beforehand? Here's where word embeddings come in to play. These are lower dimensional representations of words that are typically learned by training a deep learning model on a huuuuuge number of texts. This is great for all sorts of machine learning tasks, because it means our model knows more than just what it can learn from our training data. Even if "dog" and "cat" never co-occur together, the word embeddings convey that they are more similar than "cat" and "airplane". One final cool thing to mention is that there are also alligned word embeddings across languages. This way our model even 'knows' which words are similar across languages, which is paving the way for multilingual topic models.
@zoowalk08153 жыл бұрын
Many thanks for putting this all together. Very helpful.
@rafaanimir8820 Жыл бұрын
Thank you tons, sir, very helpful video!
@gauravsutar98904 ай бұрын
Hello it was good to learn LDA from this video, but can you arrange any videos for Structural topic modelling full explanation ?
@kasperwelbers3 ай бұрын
Hi @guaravsutar9080, I'm afraid I haven't planned anything of the sort. It's been a while since I used topic modeling (simply because my research took me elsewhere), so I'm not fully up to speed on the current state of the field.
@gauravsutar98903 ай бұрын
@@kasperwelbers oh yes thank you so much Actually I’m going through it but some of the codes I’m not able to interpret in R
@abhijitthakuria1368 Жыл бұрын
Hi kasper, nice explaination on TM, i am not able to figure out how to plot latent topics to visualise the evolution of topics yearwise.
@leo.79013 жыл бұрын
Thanks for such a good content!!!
@MK-fp6tg6 ай бұрын
This is a great tutorial. I have a quick question. Which file type do I have to convert my current data set in an Excel file?
@mehedihasanifti46913 жыл бұрын
Thank you so much, this helps a lot
@shengmingwong69682 жыл бұрын
wow! very helpful!
@SamGlover253 жыл бұрын
Kasper, your videos are so helpful! Do you know of any good videos explaining how to use Wordfish or Latent Semantic Scaling? I'm struggling with those.
@kasperwelbers3 жыл бұрын
Hi Sam! I don't know of any videos, but have you already checked out the Quanteda tutorials on the topic? tutorials.quanteda.io/machine-learning/. Quanteda has great support for word scaling methods, and some of the lead developers contributed greatly to this field. Regarding more background on wordfish and wordscores I recommend looking up some of the work of Ken Benoit. For latent semantic scaling, see Kohei Watanabe's LSX package (github.com/koheiw/LSX ), and his excellent recent paper about this method (linked on the github page).
@mollymurphey45264 ай бұрын
How do I add my own csv file as the corpus?
@nejc83162 жыл бұрын
Hi again, I also have one question: how to add Slovenian stopwords in R? Do you know it maybe? Thank you so much.
@nephastgweiz10223 жыл бұрын
Great video ! Although I found that wearing a wig was a bit over the top
@kasperwelbers3 жыл бұрын
hahahaha, how did I miss this comment!! I'm afraid it's my actual hair though.
@mariagaetanaagnesitois48443 жыл бұрын
Nice! Can you try run the other topic modelling that is BTM?
@kasperwelbers3 жыл бұрын
Are you referring to biterm topic modelling? I haven't yet used it, but I know that Jan Wijffels (who also wrote the udpipe package) wrote a package for it. The documentation on the GitHub page makes it look pretty easy to implement: github.com/bnosac/BTM
@juliantorelli45405 ай бұрын
Kasper, how would this work for a correlation topic model heat map with topic rows/topic columns?
@kasperwelbers4 ай бұрын
If I recall correctly, the correlated topic model mostly differs in that it takes the correlations between topics into account in fitting the model. It probably adds a covariance matrix, but there should still be posterior distributions for document-topic and word-document, and so you should still just be able to visualize the correlations of topics and documents (or topics with topics) in a heatmap. Though depending on what package you use to compute them, extracting the posteriors might work differently.
@juliantorelli45404 ай бұрын
@@kasperwelbers Thank you! I tried this code, it seems to have worked for basic LDA: beta_x % arrange(term) %>% select(-term) %>% rename_all(~paste0("topic", .)) } beta _w
@pomme_paille3 жыл бұрын
Thank you for this great tutorial! How do we order the heat map if we aggregate by date instead of by president? It is not sorted, so visually we cannot retrieve information.
@kasperwelbers3 жыл бұрын
Ah right! So the thing with the heatmap is that by default it creates a dendogram (the tree thing on the top and left) that shows hierarchical clustering. The rows and columns are reordered so similar rows/columns are closer. If you want to use a specific order (like Year), you can order the matrix that you pass to the heatmap function, and disable the clustering feature. If you look at the documentation of the heatmap function (run ?heatmap) you see that the Rowv and Colv arguments control the dendogram and re-ordering. You can turn this of by passing the NA value. heatmap(as.matrix(tpp[-1]), Rowv = NA)
@pomme_paille3 жыл бұрын
@@kasperwelbers just saw the answer. Thank you very much, it helps a lot! 🙏🏿🙏🏿
@sainaveenballa30652 жыл бұрын
For the corpus can I use a large number of tweets. If so, what should be the data type?
@kasperwelbers2 жыл бұрын
Sure, though off course depending on how large is large. The main limitation is your computer's memory, and the more text the longer it will take. If you have lots of data, you might want to limit your vocabulary size by dropping very rare (or very common) words. The data type of the tweets should just be text (character, string). The main topic modeling packages in R (topicmodels, stm) take a quanteda dfm as input, so if you learn how to convert tweets to this dfm you're good to go.
@briantheworld Жыл бұрын
Hello! I have a question.. is there a way to implement LDA in other languages? I'm trying to applied to Italian Reviews from the web
@kasperwelbers Жыл бұрын
Hi Brian! LDA itself does not care about language, because it only looks at word occurrences in documents. Simply put, as long as you can preprocess the text and represent it as a document term matrix, you can apply LDA.
@briantheworld Жыл бұрын
@@kasperwelbers Thanks a lot for your fast reply. And of course thanks for the high quality content videos.
@nejc83162 жыл бұрын
Hi Kasper. I find this video very useful. I have one question. In my research, I am analysing comments from social media. But I organised the data as follows and I would be happy if you could help. In the Excel document, I have the authors of the comment written in one column, and I have the content of the comment written in the other column. So I have approx. 4,000 rows and each row has two columns - one for the author and one for the comment. I had all of these comments “separate” in my document, but I wanted to combine them. I obtained each group of comments from individual web portals (eg. Facebook posts, comments under articles, Reddit debates, ...). And I combined all these documents of comments into two columns. So now all comments are written in one column. That's my corpus (it is binary now - all the comments in one row). Can I use the LDA in the R program on this data set? Or do comment groups need to be separated into individual documents for the LDA method? I hope my question is clear, thank you so much.
@nejc83162 жыл бұрын
When I type the first command, I get this message: Error: corpus_reshape() only works on corpus objects. So I guess I did not prepare the data correctly. How is data prepared correctly for LDA?
@abhijitthakuria1368 Жыл бұрын
@@nejc8316 Maybe install.package(reshape2) will solve the problem
@ronandunne10972 жыл бұрын
Hi Kasper, when I try using quanteda package I am getting an error: Error: package or namespace load failed for ‘quanteda’:
@kasperwelbers2 жыл бұрын
Hi Ronan, that type of error could have many reasons. What often helps is updating R to the latest version (I at least remember that R 4.0.0 had some issues with packages that use Rcpp, which quanteda certainly does).
@ronandunne10972 жыл бұрын
Thanks! Would you recommend the latest R version, 4.0.5?
@kasperwelbers2 жыл бұрын
@@ronandunne1097 In general, just always go for the latest. The issue with Rcpp was solved in 4.0.2 I think.
@lmaok41453 жыл бұрын
how do i connect with you
@kasperwelbers3 жыл бұрын
I'm not very well hidden online, but I tend to prefer via my university email (research.vu.nl/en/persons/kasper-welbers)