Hello James. Wonderful explanation. I was wondering, what are the latest techniques that now has come? Or what's after GPL & TSDAE?
@shobhitrajgautam Жыл бұрын
How to use TSDAE + GPL, i understand both approaches but how to combine both?
@e_hossam96 Жыл бұрын
This is really wonderful and great. Thanks for this awesome series 👏👏👏
@LauraneCastiaux2 жыл бұрын
Thank you very much for this high quality explanation!
@irfanahmad-ti7vl Жыл бұрын
hello james ...its superb,,,,can you please explain that wether this gpl can be used for hyperspectral image classification using transformers .. if you have used could you please share the link...thanks in advance
@arvinflores53162 жыл бұрын
if I wanted to do TSDAE and GPL together is there a specific order? Like pretrained using TSDAE first then fine tune with GPL
@jamesbriggs2 жыл бұрын
yes exactly - pretrain with TSDAE, then fine-tune GPL
@arvinflores53162 жыл бұрын
@@jamesbriggs thanks I tried it with TSDAE and GPL! Giving me good results. Is there any newer domain adaption/unsupervised training you've been looking aside from GPL? Curious to know more in the field!
@jamesbriggs2 жыл бұрын
GPL is the latest that I've worked with, but I've been focusing on other things in the past few months since this so I may have missed something!
@mayank_khurana Жыл бұрын
@@arvinflores5316, will it be possible to share your codebase or an example of how you approach the above method? It would be really helpful for my college project :)
@ylazerson2 жыл бұрын
fantastic video!
@chenlin7535 Жыл бұрын
Hi James! Wonderful video! May I ask what software you used to record it? The round camera is adorable
@jamesbriggs Жыл бұрын
Thanks, I record the screen with OBS, then I have a separate camera for my face - then I put them together with either Premiere Pro or Davinci Resolve (both can do the same)
@chenlin7535 Жыл бұрын
@@jamesbriggs Thanks! Very helpful!
@tinyentropy2 жыл бұрын
I don't understand the key element that makes this working. What component actually can distinguish hard from trivial negative passages and why is it able to do so. In other words, what is the difference between the cross-encoder and the bi-encoder in terms of data / text understanding? Only the loss function? Or do they have different access to knowledge?
@jamesbriggs2 жыл бұрын
Good question! Cross encoders are more powerful models that are able to adapt to new domains much better than bi-encoder models, the only problem is that they're slow when you want to compare many items - which is why we use bi-encoders for that task. So we just use the cross-encoder's better performance/adaptability to create the labels for a dataset that can be used to train the bi-encoder.
@tinyentropy2 жыл бұрын
@@jamesbriggs Thanks 😊 But a few more follow-up questions: 1.) Is it a form of distillation then? 2.) In your video you said: Most of the negative examples of the covid-19 passages (shown on the example slide) are relatively easy to identify as true or false negatives, whereas a single one of the passages is deemed to be hard. I think I got lost here. I am missing something that explains why this distinction should be easy for any sentence transformer model. To me it sounds like the initial problem we are going to solve in the first place, not an already existing part of the solution. Maybe you explained that elsewhere?
@murphp1512 жыл бұрын
Do you assume that the sim(q, P+) is going to be more than the sim(q, p-). Is this always the case, if it wasn't would you be training the model to return the wrong value?
@ax53442 жыл бұрын
The pretrained model is from msmarco, and it's further trained on query_answer type of data. Does it mean the final model is most suitable for semantic retrieval tasks? If so, is it because of the data (msmarco, query_answer) or the method (GPL)? If I want to train a domain-adapted language model for classification task, will GPL still the best unsupervised method?
@jamesbriggs2 жыл бұрын
GPL is specific to semantic retrieval, deduplication, etc - I'm not sure if it would be possible to apply this as training for classification tasks, other as the embedding model in a topic modeling pipeline
@ax53442 жыл бұрын
@@jamesbriggs Thanks so much for the illustration. That helps clarify a doubt I had: some SBERT models on huggingface say that they are good for semantic search, others say theirs are good for classification. Do you know any pre-train method good for classification? Could you do videos there too?
@MaltheHave2 жыл бұрын
Hi James! You mentioned a project you did involving the Dhivehi and using TSDAE. Given it's hard to find a doc2query and cross-encoder model in Dhivehi, could you have used GPL as well?
@jamesbriggs2 жыл бұрын
We would need the two models, the difficult part would be the doc2query model, a cross-encoder might be more doable as they don't tend to require too much data. I'd probably try and find translation pairs for (english, dhivehi) though and see if multilingual knowledge distillation for one or more of these models could be an option When I have time I'd love to do more on the Dhivehi project, it was very interesting and the guy I was working with (Ashraq) is still keen to do more
@MaltheHave2 жыл бұрын
Awesome. The reason I'm asking is because I'm trying to use GPL on the Danish language (hard to find pre trained models as compared to English). I thought about manually writing queries to passages and training a doc2query algorithm on this, but I wouldn't know how to custom train a doc2query model. Would be very informative to see a series about doing NLP tasks on less spoken languages like Dhievhi.
@bastahous2 жыл бұрын
Hello James. Thank you for the invaluable documentation you have been providing on all these state-of-art approaches. I have an additional question which maybe I've missed but you don't seem to adress.. How to split the unstructured documents in the first place ? There are some pre-trained sentence parsers out there but when it comes to paragraphs or sets of sentences that make sense together I'm not sure what the best approach is here ? Splitting on doesn't work for me
@jamesbriggs2 жыл бұрын
Preprocess is often pretty difficult, I've only used rule-based logic for this in the past, eg split on " " until length is 400 < x < 600 characters etc - someone recently mentioned using a sentence tokenizer from NLTK to help create the splits (but I haven't used that)
@malikrumi1206 Жыл бұрын
Man, can you *please* sit further back from the camera in your videos? I feel like you are sitting in my lap.
@dandanny1081 Жыл бұрын
it may be a good explanation but I cant open this video at work with your face covering my screen it's too weird !