This is great. Would you have any performance comparisons between SetFit and deberta (say, v3) on NLI tasks? Also, how many examples are needed to fine tune these models. Thanks
@EkShunya Жыл бұрын
great work
@manabchetia8382 Жыл бұрын
GREAT VIDEO! Where can i try T FEW?
@andrea-mj9ce2 жыл бұрын
Can SetFit be used for topic modeling (find the topics that a text deals with)?
@tweak38712 жыл бұрын
In an unsupervised setting like LDA, LSA, etc. not really, but if you have topic classes that you have already identified and want to classify them then you could do that pretty easily. For example you could identify some 30 examples that are talking about the Pennsylvania senate race, train set fit on that small dataset, then run it on a larger stack of news articles, I would expect that to work reasonably well with some iteration on the dataset.
@Hellas112 жыл бұрын
Hi, may I ask what is the minimum number of texts that should be labelled to perform best, let's say in classifying 1 million short texts? Also, may I ask how this method would compare to current topic algos, and more specifically BERTopic (using "all-MiniLM-L6-v2" model)? Would SetFit perform better than BERTopic? Thanks in advance for your time 🙏
@tweak38712 жыл бұрын
@@Hellas11 Was unfamiliar w/BERTopic before this, just looked it up and it looks simple enough so I think I have an intuition on how well that would work. So comparing SetFit vs. BertTopic, honestly it really depends on your use case. SetFit at it's core is a few shot classification system, whereas BertTopic is an unsupervised methods that clusters on top of factorized document embeddings embedded by BERT. So topic modeling in general the goal is to find "topics" that describe the dataset well, most commonly for either for EDA or to be used as features into another model. Perhaps a more modern use would be document tagging for an app of some kind, but in general I don't see topic modeling done all that much nowadays as there is better methods to do what you want in NLP. w/SetFit, the idea is to already know what it is that you're looking for in a set of texts, build a small dataset (say 16 labels per class) then train a classifier on that dataset. So '"How would they compare" really depends on your use case, but generally speaking I would say a supervised method like SetFit would enable you to have more control to do whatever it is you're trying to do, especially in NLP. That aside, "How much data?" I mean the more the merrier always, what I would do is probably label in increments of 8 or 16 for each class, so if you have 2 classes, label 32, run it, see how it's doing, if not good enough label some more, rinse repeat. In few shot, the quality of your data & labels matters a lot more though, so when evaluating how the model is doing, try to build an intuition on why the model is getting what it's getting wrong, then find data examples for the model to learn whatever it is it's struggling with. Try to give the model "hard" examples, i.e. stuff that the model clearly does not already know, but mix in some easy ones too. Also try to find examples where it's just barely one class or the other, like challenging borderline examples that are still definitely one class or another, this really helps in fewshot. You can also just write your own examples if you're struggling to find good ones. Also remember to always label a dataset for eval as well. Finally, vary model size. Start small if you want, but I'd personally jump into colab w/a standard GPU ( you will get a 16gb v100 or a T4 on a free account) You should be able to train MPnet on any GPU you get on colab, and pretty quickly at that as well.
@mikael_aldo Жыл бұрын
Can this used for a Regression Task? e.g. comparing answers and calculate the score based on its similarity.
@_luca_marinelli Жыл бұрын
setfit is based on categorical labels, but you can just quantize the regression into classes
@rajibahsan62922 жыл бұрын
hi, If i want to train the model with my own dataset how do i prepare the dataset ? I am passing the train and eval data as dictionary but its not able to read the colnames. how do I prepare my own data to train this model ?
@jacehua73342 жыл бұрын
for num examples 640 how is it calculated?
@X1011 Жыл бұрын
pineapple and pizza are too distant in flavor space 😋
@CppExpedition Жыл бұрын
huggings face 'emoticon' is so annoying for the excellent presentation. It adds noise to the communication