Text Embeddings, Classification, and Semantic Search (w/ Python Code)

  Рет қаралды 39,864

Shaw Talebi

Shaw Talebi

Күн бұрын

👉 Need help with AI? Reach out: shawhintalebi.com/
In this video, I introduce text embeddings and describe how we can use them for 2 simple yet high-value use cases: text classification and semantic search.
More Resources:
👉 Series Playlist: • Large Language Models ...
🎥 RAG: • How to Improve LLMs wi...
📰 Read more: medium.com/towards-data-scien...
💻 GitHub: github.com/ShawhinT/KZbin-B...
[1] • What Are Word and Sent...
[2] R. Patil, S. Boit, V. Gudivada and J. Nandigam, “A Survey of Text Representation and Embedding Techniques in NLP,” in IEEE Access, vol. 11, pp. 36120-36146, 2023, doi: 10.1109/ACCESS.2023.3266377.
[3] owasp.org/www-project-top-10-...
--
Book a call: calendly.com/shawhintalebi
Socials
/ shawhin
/ shawhintalebi
/ shawhint
/ shawhintalebi
The Data Entrepreneurs
🎥 KZbin: / @thedataentrepreneurs
👉 Discord: / discord
📰 Medium: / the-data
📅 Events: lu.ma/tde
🗞️ Newsletter: the-data-entrepreneurs.ck.pag...
Support ❤️
www.buymeacoffee.com/shawhint
Intro - 0:00
Problem: Text isn't computable - 0:42
Text Embeddings - 1:42
Why should I care? - 3:15
Use Case 1: Text Classification - 5:49
Use Case 2: Semantic Search - 12:40
Free gift for watching: 23:50

Пікірлер: 53
@ShawhinTalebi
@ShawhinTalebi 2 ай бұрын
Check out more videos in this series 👇 👉 Series Playlist: kzbin.info/aero/PLz-ep5RbHosU2hnz5ejezwaYpdMutMVB0 🎥 RAG: kzbin.info/www/bejne/j53daGpvhNVshtk -- 📰 Read more: medium.com/towards-data-science/text-embeddings-classification-and-semantic-search-8291746220be?sk=03e4e68a420373a3525de8721f57c570 💻 GitHub: github.com/ShawhinT/KZbin-Blog/tree/main/LLMs/text-embeddings Resources [1] kzbin.info/www/bejne/d2mrdoN4mbCJg6Msi=PA4kCnfgd3nx24LR [2] R. Patil, S. Boit, V. Gudivada and J. Nandigam, “A Survey of Text Representation and Embedding Techniques in NLP,” in IEEE Access, vol. 11, pp. 36120-36146, 2023, doi: 10.1109/ACCESS.2023.3266377. [3] owasp.org/www-project-top-10-for-large-language-model-applications/
@ccapp3389
@ccapp3389 2 ай бұрын
Love that you’re bringing real knowledge, insights and code here! So many AI KZbinrs are just clickbaiting their way through the hype cycle by reading the same SHOCKING news as everyone else.
@tylerpoore97
@tylerpoore97 Ай бұрын
I mean, the guy clickbaited the thumbnail. Also, this is insanely old news at this point(if considered news at all). Video content was on point, but we shouldn't be promoting clickbait methods.
@ccapp3389
@ccapp3389 Ай бұрын
I clicked this video for technical explanations and code, not news. There are plenty of dudes reading off the same SHOCKING news across AI KZbin. I got exactly what I wanted from this video and feel like the title was clear.
@krishnavamsiyerrapatruni5385
@krishnavamsiyerrapatruni5385 15 күн бұрын
I have learnt so much by watching the entire series. Thank you so much Shaw! I think this is one of the best playlists out there for anyone looking to get into the field of LLMs and GenAI.
@ShawhinTalebi
@ShawhinTalebi 12 күн бұрын
Great to hear! Feel free to share any suggestions for future content :)
@BrandonFoltz
@BrandonFoltz Ай бұрын
Great video. The practical use cases for embeddings themselves are undervalued IMHO and this video is fantastic for showing ways to use embeddings. Even if you use OpenAI embeddings, they are dirt cheap, and can provide fantastic vectors for further analysis, manipulation, and comparison.
@ShawhinTalebi
@ShawhinTalebi Ай бұрын
Thanks Brandon! I completely agree. Agents are great, but they seem to overshadow all the relatively simple text embedding-based applications.
@PRColacino
@PRColacino Ай бұрын
Congrats man! Keep going with more real examples with code sharing
@obaydmir8353
@obaydmir8353 Ай бұрын
Clear and understandable explanation of these concepts. Thanks and really enjoyed!
@aldotanca9430
@aldotanca9430 Ай бұрын
Exceptionally clear as always!
@ShawhinTalebi
@ShawhinTalebi Ай бұрын
Thanks Aldo :)
@ethanlazuk
@ethanlazuk Ай бұрын
SEO here, enjoyed your examples of semantic search and explanation of hybrid search. Great vid and easy to follow. Will explore your channel. Cheers!
@pramodkumarsola
@pramodkumarsola Ай бұрын
You are the real guy to subscribe and learn
@ifycadeau
@ifycadeau 2 ай бұрын
Wow! Thank you for breaking this down, been trying to figure it out!
@ShawhinTalebi
@ShawhinTalebi 2 ай бұрын
Glad to help!
@blackswann9555
@blackswann9555 Ай бұрын
Excellent work sir! ❤
@dr.aravindacvnmamit3770
@dr.aravindacvnmamit3770 2 ай бұрын
Excellent!
@avi7278
@avi7278 Ай бұрын
Great format subd
@AlexandreMarr-uq8pw
@AlexandreMarr-uq8pw 8 сағат бұрын
Can only two kinds of classification be made? If I have lots of types, for example, product classification, can it be applied?
@mr.daniish
@mr.daniish Ай бұрын
Love you shaw!
@ShawhinTalebi
@ShawhinTalebi Ай бұрын
❤️
@cinematicsounds
@cinematicsounds Ай бұрын
Thank you very good information, will try to make a database for audio sound effects using vector databases text to audio
@sherpya
@sherpya 2 ай бұрын
it's possible to extract software names from the query with a text classifier and apply only e. g. apache airflow to kw search? also what db do you suggest? is postgres with vector db good?
@ShawhinTalebi
@ShawhinTalebi 2 ай бұрын
Good question. While I haven't seen a text classifier used for KW search, that could be a clever way to implement it. There are several DBs to choose from these days. I'd say go with what makes sense with the existing data infrastructure. If starting from scratch, Elastic search or Pinecone might be good jumping off points.
@aldotanca9430
@aldotanca9430 Ай бұрын
lanceDB is also quite good.
@tamilinfomite
@tamilinfomite Ай бұрын
Hi Shawhin, Thanks. I ran into a problem. I tried to use Sentence_transformers model by installing it. It always givens an error no file found config_sentence_transformers.json' in the .cache/huggingface/... folder. Your help is appreciated
@ShawhinTalebi
@ShawhinTalebi Ай бұрын
Not sure what the issue could be. Did you install all the requirements on the GitHub? github.com/ShawhinT/KZbin-Blog/tree/main/LLMs/text-embeddings
@pepeballesteros9488
@pepeballesteros9488 Ай бұрын
Many thanks for the video Shaw, great content! One simple question: when using OpenAI's embedding model, each resume is represented by an embedding vector. Is this embedding computed as the average of all word vectors?
@ShawhinTalebi
@ShawhinTalebi Ай бұрын
Great question! Embedding models do not operate on specific words, but rather on the text as a whole. This is valuable because the meaning of specific words is driven by the context it appears in.
@KrisTC
@KrisTC Ай бұрын
I have watched most of the videos in this series and found them really helpful. Something I am looking for that I haven't seen you cover yet. Is some more guidance on preparing data for either RAG or fine tuning. I am sure you have practical tips you can give. I have a large old codebase, we have loads of documentation and tutorials etc, but it is a lot of someone to pickup. This new world of GPTs seams perfect for building an assistant. I will be able to work through it ok, but I suspect there will be a load of learnt best practices or pitfalls to avoid that are a bit more subtle. For example I am looking through our support emails / tickets, lots of them all start with please send logs :) and after a load of back and forth we have info. This is much like a conversation with ChatGPT. For fine tuning is it best to fine tune on a whole thread? Or each chunk of the conversation?
@ShawhinTalebi
@ShawhinTalebi Ай бұрын
Great suggestion! I plan to do a series on data engineering and this would be a great thing to incorporate into it. For you use case, the best choice would depend on what you want the assistant to do. For instance, if you want the assistant to mimic the support rep, then you'd likely want to use each message in the thread with its appropriate context (i.e. preceding messages).
@KrisTC
@KrisTC Ай бұрын
@@ShawhinTalebi thanks for the tip. That’s what I ended up doing it. Not yet tried actually fine tuning yet. Just finished my data prep. Looking forward to you next series 😊
@eliskucevic340
@eliskucevic340 Ай бұрын
Iv been using embeddings for awhile but i find that agents can call specialized tools that can be very useful depending on the applications.
@ShawhinTalebi
@ShawhinTalebi Ай бұрын
Thanks for sharing your insight! Indeed agents and embeddings solve different problems. However, some agent use cases could be reconfigured to be solved with text embeddings + human in the loop.
@alroygama6166
@alroygama6166 9 күн бұрын
Can i use these embeddings with bert based models instead?
@ShawhinTalebi
@ShawhinTalebi 5 күн бұрын
Yes! In fact, sentence transformers has a few bert-based embedding models: sbert.net/docs/pretrained_models.html
@Whysicist
@Whysicist Ай бұрын
LDA - Latent Dirichlet Allocation is kinda trivial these days… Matlab text analytics toolbox works great on pdf’s with bi-grams… a la bag-of-N-Grams. Cool… thanks…
@avi7278
@avi7278 Ай бұрын
finally someone who speaks with their hands more than I do, lol...
@ShawhinTalebi
@ShawhinTalebi Ай бұрын
😂😂.. 👋 👍
@toddai2721
@toddai2721 20 күн бұрын
I call him the hand whisperer.... but really loud.
@chamaljayasinghe4210
@chamaljayasinghe4210 5 күн бұрын
✌✌🧑‍💻🧑‍💻
@skarloti
@skarloti Ай бұрын
This is not always a good solution if we have multilingual text. I see that LLM context 1M token/character They offer other solutions with functions and external API calls.
@ShawhinTalebi
@ShawhinTalebi Ай бұрын
I'm curious about this. I've seen embedding models that can handle multiple languages, so I'd expect them to work pretty well. Can you shed any more light on this?
@tylerpoore97
@tylerpoore97 Ай бұрын
Soo, unlike your thumbnail, this has nothing to do with agents... Why mention them?
@ShawhinTalebi
@ShawhinTalebi Ай бұрын
Thumbnail is "Forget AI agents... use this instead". I explain this a bit @3:15.
@user-gy1pl9ri1k
@user-gy1pl9ri1k Ай бұрын
are you persian
@ShawhinTalebi
@ShawhinTalebi Ай бұрын
Yes :)
@cirtey29
@cirtey29 Ай бұрын
By end of next year all the drawbacks of LLMs will be erased.
@ShawhinTalebi
@ShawhinTalebi 26 күн бұрын
I hope so!
@user-yu2wr5qf7g
@user-yu2wr5qf7g 24 күн бұрын
cool...
@bentobin9606
@bentobin9606 Ай бұрын
is text embedding same as text tokenization done in training ?
@ShawhinTalebi
@ShawhinTalebi Ай бұрын
Good question! These are different things. Tokenization is the process of taking a some text and deriving a vocabulary from which the original text can be generated, where each element in the vocabulary is assigned a unique integer value. Text embeddings on the other hand, take tokens and translate them into meaningful (numerical) representations. I talk a little more about tokenization here: kzbin.info/www/bejne/mavZh5yYd5efiKMsi=FwqmkB9Ltyq45n0w&t=348
How to Improve LLMs with RAG (Overview + Python Code)
21:41
Shaw Talebi
Рет қаралды 18 М.
I Analyzed My Finance With Local LLMs
17:51
Thu Vu data analytics
Рет қаралды 395 М.
SHE WANTED CHIPS, BUT SHE GOT CARROTS 🤣🥕
00:19
OKUNJATA
Рет қаралды 14 МЛН
Bro be careful where you drop the ball  #learnfromkhaby  #comedy
00:19
Khaby. Lame
Рет қаралды 41 МЛН
La final estuvo difícil
00:34
Juan De Dios Pantoja
Рет қаралды 27 МЛН
OpenAI Embeddings and Vector Databases Crash Course
18:41
Adrian Twarog
Рет қаралды 382 М.
The Most Important Algorithm in Machine Learning
40:08
Artem Kirsanov
Рет қаралды 226 М.
Generative AI 101: When to use RAG vs Fine Tuning?
6:08
Leena AI
Рет қаралды 6 М.
What is Semantic Search?
11:53
Cohere
Рет қаралды 22 М.
Joseph Suarez Thesis Defense - Neural MMO
1:00:06
Neural MMO
Рет қаралды 72 М.
AI Pioneer Shows The Power of AI AGENTS - "The Future Is Agentic"
23:47
QLoRA-How to Fine-tune an LLM on a Single GPU (w/ Python Code)
36:58
"okay, but I want Llama 3 for my specific use case" - Here's how
24:20
Python RAG Tutorial (with Local LLMs): AI For Your PDFs
21:33
GraphRAG: LLM-Derived Knowledge Graphs for RAG
15:40
Alex Chao
Рет қаралды 60 М.
SHE WANTED CHIPS, BUT SHE GOT CARROTS 🤣🥕
00:19
OKUNJATA
Рет қаралды 14 МЛН