Is GPL the Future of Sentence Transformers? | Generative Pseudo-Labeling Deep Dive

Рет қаралды 8,693

Күн бұрын

Пікірлер: 28

@mayank_khurana Жыл бұрын

Hello James. Wonderful explanation. I was wondering, what are the latest techniques that now has come? Or what's after GPL & TSDAE?

@shobhitrajgautam Жыл бұрын

How to use TSDAE + GPL, i understand both approaches but how to combine both?

@e_hossam96 Жыл бұрын

This is really wonderful and great. Thanks for this awesome series 👏👏👏

@LauraneCastiaux 2 жыл бұрын

Thank you very much for this high quality explanation!

@irfanahmad-ti7vl Жыл бұрын

hello james ...its superb,,,,can you please explain that wether this gpl can be used for hyperspectral image classification using transformers .. if you have used could you please share the link...thanks in advance

@arvinflores5316 2 жыл бұрын

if I wanted to do TSDAE and GPL together is there a specific order? Like pretrained using TSDAE first then fine tune with GPL

@jamesbriggs 2 жыл бұрын

yes exactly - pretrain with TSDAE, then fine-tune GPL

@arvinflores5316 2 жыл бұрын

@@jamesbriggs thanks I tried it with TSDAE and GPL! Giving me good results. Is there any newer domain adaption/unsupervised training you've been looking aside from GPL? Curious to know more in the field!

@jamesbriggs 2 жыл бұрын

GPL is the latest that I've worked with, but I've been focusing on other things in the past few months since this so I may have missed something!

@mayank_khurana Жыл бұрын

@@arvinflores5316, will it be possible to share your codebase or an example of how you approach the above method? It would be really helpful for my college project :)

@ylazerson 2 жыл бұрын

fantastic video!

@chenlin7535 Жыл бұрын

Hi James! Wonderful video! May I ask what software you used to record it? The round camera is adorable

@jamesbriggs Жыл бұрын

Thanks, I record the screen with OBS, then I have a separate camera for my face - then I put them together with either Premiere Pro or Davinci Resolve (both can do the same)

@chenlin7535 Жыл бұрын

@@jamesbriggs Thanks! Very helpful!

@tinyentropy 2 жыл бұрын

I don't understand the key element that makes this working. What component actually can distinguish hard from trivial negative passages and why is it able to do so. In other words, what is the difference between the cross-encoder and the bi-encoder in terms of data / text understanding? Only the loss function? Or do they have different access to knowledge?

@jamesbriggs 2 жыл бұрын

Good question! Cross encoders are more powerful models that are able to adapt to new domains much better than bi-encoder models, the only problem is that they're slow when you want to compare many items - which is why we use bi-encoders for that task. So we just use the cross-encoder's better performance/adaptability to create the labels for a dataset that can be used to train the bi-encoder.

@tinyentropy 2 жыл бұрын

@@jamesbriggs Thanks 😊 But a few more follow-up questions: 1.) Is it a form of distillation then? 2.) In your video you said: Most of the negative examples of the covid-19 passages (shown on the example slide) are relatively easy to identify as true or false negatives, whereas a single one of the passages is deemed to be hard. I think I got lost here. I am missing something that explains why this distinction should be easy for any sentence transformer model. To me it sounds like the initial problem we are going to solve in the first place, not an already existing part of the solution. Maybe you explained that elsewhere?

@murphp151 2 жыл бұрын

Do you assume that the sim(q, P+) is going to be more than the sim(q, p-). Is this always the case, if it wasn't would you be training the model to return the wrong value?

@ax5344 2 жыл бұрын

The pretrained model is from msmarco, and it's further trained on query_answer type of data. Does it mean the final model is most suitable for semantic retrieval tasks? If so, is it because of the data (msmarco, query_answer) or the method (GPL)? If I want to train a domain-adapted language model for classification task, will GPL still the best unsupervised method?

@jamesbriggs 2 жыл бұрын

GPL is specific to semantic retrieval, deduplication, etc - I'm not sure if it would be possible to apply this as training for classification tasks, other as the embedding model in a topic modeling pipeline

@ax5344 2 жыл бұрын

@@jamesbriggs Thanks so much for the illustration. That helps clarify a doubt I had: some SBERT models on huggingface say that they are good for semantic search, others say theirs are good for classification. Do you know any pre-train method good for classification? Could you do videos there too?

@MaltheHave 2 жыл бұрын

Hi James! You mentioned a project you did involving the Dhivehi and using TSDAE. Given it's hard to find a doc2query and cross-encoder model in Dhivehi, could you have used GPL as well?

@jamesbriggs 2 жыл бұрын

We would need the two models, the difficult part would be the doc2query model, a cross-encoder might be more doable as they don't tend to require too much data. I'd probably try and find translation pairs for (english, dhivehi) though and see if multilingual knowledge distillation for one or more of these models could be an option When I have time I'd love to do more on the Dhivehi project, it was very interesting and the guy I was working with (Ashraq) is still keen to do more

@MaltheHave 2 жыл бұрын

Awesome. The reason I'm asking is because I'm trying to use GPL on the Danish language (hard to find pre trained models as compared to English). I thought about manually writing queries to passages and training a doc2query algorithm on this, but I wouldn't know how to custom train a doc2query model. Would be very informative to see a series about doing NLP tasks on less spoken languages like Dhievhi.

@bastahous 2 жыл бұрын

Hello James. Thank you for the invaluable documentation you have been providing on all these state-of-art approaches. I have an additional question which maybe I've missed but you don't seem to adress.. How to split the unstructured documents in the first place ? There are some pre-trained sentence parsers out there but when it comes to paragraphs or sets of sentences that make sense together I'm not sure what the best approach is here ? Splitting on doesn't work for me

@jamesbriggs 2 жыл бұрын

Preprocess is often pretty difficult, I've only used rule-based logic for this in the past, eg split on " " until length is 400 < x < 600 characters etc - someone recently mentioned using a sentence tokenizer from NLTK to help create the splits (but I haven't used that)