How to save money with Gemini Context Caching

  Рет қаралды 7,166

Sam Witteveen

Sam Witteveen

Күн бұрын

Пікірлер: 52
@Dillonvu
@Dillonvu 2 ай бұрын
Very excited it's for Flash too! This'll help a lot at work for certain features!
@deeplearning7097
@deeplearning7097 2 ай бұрын
Brilliant Sam, as always, thank you very much.
@SwapperTheFirst
@SwapperTheFirst 2 ай бұрын
I can immediately see how to use it for quite cheap similarity search. Assuming that you have 1M strings to match, you can put all of them into context window and then ask model each time to find a similar string. Though it will be quite long - it will be not expensive with caching (storing tokens in RAM or on SSD). But this is not scalable to 1B strings. RAG approach is also not possible/feasible. Sam, maybe you have some advice on how to solve a similarity search on a scale? On smaller scale you can solve this character-wise, using rapidfuzz or dedupe. But how you can solve it on a scale? This business problem is known as "entity matching" or "fuzzy entity matching". For example, you want to match "Microsoft corp" to "Microsoft corporation" to "MSFT". Also you want to cluster similar strings under the same unique umbrella - "Microsoft corporation" in the example. You can also use regular vector search, but the problem with clustering... How to "shuffle" thru 1B rows to create a reliable index and then make it updateable in real-time.
@samwitteveenai
@samwitteveenai 2 ай бұрын
Very interesting comment. AFAIK the most commonly used (combining accuracy, cost and efficiency) models for entity matching are encoder based Bert/RoBerta style models. LLMs can certainly do it they just end up being slow. It would be interesting to see if you could do it with a long context model and perhaps store the list of entities in the prompt and only have give new ones as the output. The challenge is it would still be way to slow for anything real time. This is an interesting challenge, let me think about it a bit more and look for a dataset to test on.
@JayanaKalansuriya
@JayanaKalansuriya 2 ай бұрын
Hi if possible would love to get some advice from you guys! There’s a requirement where I have a master catalog of products with of over 5000 products with images and product text I want to build a similarity matching solution, in which if i upload an image and if there’s a similar product in the master catalog, I want to find that out Basically we need to do a image similarity mapping and the most similar image that matches with the input image should be shown How can I work on this? Any advices would be much appreciated!
@SwapperTheFirst
@SwapperTheFirst 2 ай бұрын
@@JayanaKalansuriya with the scale of 5K or so you don't need anything complex. Just grab some multi-modal (visual is enough) embedding model and then organize a vector search. It is just a regular RAG app, with some small twist, that you create embedding for images as well and also will serve images in the RAG response. Or use some managed solution from Google Cloud.
@SwapperTheFirst
@SwapperTheFirst 2 ай бұрын
@@samwitteveenai Thanks a lot, Sam. You're right about Bert/Roberta and these are used in spacy "transformer" model for Named entities extraction. And spacy always is using the latest and greatest, unlike NLTK, which is very conservative and has lots of legacy stuff for compatibility.
@eddiehaug
@eddiehaug 2 ай бұрын
​@@JayanaKalansuriya - Not sure what your specific use case is, but you might want to take a look at Google's recommendations AI, and there's a Ranking API as well.
@danangjeffry
@danangjeffry 2 ай бұрын
Very useful and easy to understand. Thank you!
@rluijk
@rluijk 2 ай бұрын
Thanks! Nice explainer. Will integrate this part in my setup!
@GamingClubGermany
@GamingClubGermany Ай бұрын
First of thanks a lot for the video! but why is your voice/the noise so wobbly/? Are you using a TTS model or something like this? update: Okay i dont know what you use but its pretty awesome! do you mind sharing infos on what voice "thing" you use?
@leslysandra
@leslysandra 23 күн бұрын
thank you for sharing :D
@gen_ai_explorer
@gen_ai_explorer 2 ай бұрын
How this will benefit? As we can store the information in vector db and use only the relevant chunks at a time right? how does the google caching will help us?
@matty-oz6yd
@matty-oz6yd 2 ай бұрын
Any idea how this works? I am trying to work out whether to use an index of relevant context or use the context cache feature. It seems like the details are a closely guarded secret which would mean the only way for me to decide between the two would be to test both lots. The use cases seem to be very similar Option 1 - Give google a bunch of context, hope that it's good and then run queries against it Option 2 - index my context and add information as needed using RAG The RAG approach would lead to more tokens being used but at least I know how it works so can set my expectations for how it works. The google approach would be cheaper but I don't know how the context has been processed. I cant intentionally format my data for optimal performance.
@Maisonier
@Maisonier 2 ай бұрын
I'd love to know how to do that with openwebui and a local model in a single GPU. Do we need to use FAISS or what RAG?
@IdPreferNot1
@IdPreferNot1 2 ай бұрын
A great video example of this would be demonstrating processing a repo or API docs to help with programming code where the libraries have significantly changed since cutoff date. I still cant believe that gpt-4o cant get the endpoints and structure right for coding with its own current OAI API when you ask it to build code to work with the gpt.
@KishanLal-s4k
@KishanLal-s4k Ай бұрын
Is there a way to update the content of the cache? I could see it's just limited to TTL but unable to update the actual content of the cache.
@gemini_537
@gemini_537 2 ай бұрын
This is super useful! ❤
@ylazerson
@ylazerson 2 ай бұрын
great video - thanks!
@miriamramstudio3982
@miriamramstudio3982 2 ай бұрын
Very useful. Thanks
@RD-learning-today
@RD-learning-today 2 ай бұрын
how to use it in Vertex-AI ?
@johnrperry5897
@johnrperry5897 Ай бұрын
Wait what is cayche
@SrikanthCSE-mi9jm
@SrikanthCSE-mi9jm 2 ай бұрын
How do i use it with langchain?
@samwitteveenai
@samwitteveenai 2 ай бұрын
I am not sure if they support this yet or not. my guess is the google-langchain package and vertex langchain package will have to give it support.
@darshank8748
@darshank8748 2 ай бұрын
Great video!
@RD-learning-today
@RD-learning-today 2 ай бұрын
can i use it with vertex ai?
@TheRcfrias
@TheRcfrias 2 ай бұрын
I thought this video was about caching client side to avoid passing around huge payloads for function calling and so 😢
@guanjwcn
@guanjwcn 2 ай бұрын
Thank you, Sam!! Does llama3 have this too?
@samwitteveenai
@samwitteveenai 2 ай бұрын
I think if you served it with vLLM you could do it that has prefix caching
@WillJohnston-wg9ew
@WillJohnston-wg9ew 2 ай бұрын
Any thoughts on how this would apply to real time video? I am trying to create something that does real-time video sentiment analysis.
@samwitteveenai
@samwitteveenai 2 ай бұрын
Real time probably wouldn't work just yet. There are some hacks/techniques they have to do it realtime with Project Astra etc, but I am not sure when that will be available to us externally.
@MistikBBQ
@MistikBBQ 2 ай бұрын
Any way of doing this locally with something like ollama? This would actually be amazing to use in some local/edge case
@samwitteveenai
@samwitteveenai 2 ай бұрын
you could do it with a local model via vLLM. AFAIK currently not possible in Ollama, but they certainly could add it
@MistikBBQ
@MistikBBQ 2 ай бұрын
@@samwitteveenai Thanks a lot for the reply!
@mrnakomoto7241
@mrnakomoto7241 2 ай бұрын
aussie accent living in usa whats going on
@samwitteveenai
@samwitteveenai 2 ай бұрын
actually back living in Singapore for the time being 😀
@mrnakomoto7241
@mrnakomoto7241 2 ай бұрын
@@samwitteveenai out of curiosity Do you own a house in every country you go?
@micbab-vg2mu
@micbab-vg2mu 2 ай бұрын
Great:)
@ahmaddajani3639
@ahmaddajani3639 2 ай бұрын
why use context like this instead of using vector store and chunk the content?
@SwapperTheFirst
@SwapperTheFirst 2 ай бұрын
you will not get this with a vector store and RAG. Here you have all content availabe for the model.
@eddiehaug
@eddiehaug 2 ай бұрын
Because depending on the use case, you may wanna use one technique vs the other. Adding all the info as context to an LLM is not the same as using RAG where your results may vary greatly depending on the chunk size, the ranking engine, etc.
@ahmaddajani3639
@ahmaddajani3639 2 ай бұрын
@@eddiehaug Yes correct it depends on the use case, but if you want to save money in case of question answering, RAG is better.
@eddiehaug
@eddiehaug 2 ай бұрын
@@ahmaddajani3639 - yes, agree 👍
@vicovico
@vicovico 2 ай бұрын
What's going on with the pronunciation of "cached"?
@ScottVanKirk
@ScottVanKirk 2 ай бұрын
It is pronounced Kash. The e is vestigial like our appendix😁
@jamiek2039
@jamiek2039 2 ай бұрын
😂
@matthewwalker7063
@matthewwalker7063 2 ай бұрын
Engagement baiting
@samwitteveenai
@samwitteveenai 2 ай бұрын
lol I was waiting for someone to say something 😀
@ariganeri
@ariganeri 2 ай бұрын
It's what Aussies call English.
@otty4000
@otty4000 2 ай бұрын
functionly isnt this quick similar to notebookllm
@samwitteveenai
@samwitteveenai 2 ай бұрын
no this is more than just upload the docs/video etc it is having a lot of the values precomputed in the model
New Summarization via In Context Learning with a New Class of Models
28:12
Will A Guitar Boat Hold My Weight?
00:20
MrBeast
Рет қаралды 170 МЛН
How Strong is Tin Foil? 💪
00:26
Preston
Рет қаралды 55 МЛН
АЗАРТНИК 4 |СЕЗОН 1 Серия
40:47
Inter Production
Рет қаралды 1,3 МЛН
Apple peeling hack @scottsreality
00:37
_vector_
Рет қаралды 125 МЛН
5 Problems Getting LLM Agents into Production
13:12
Sam Witteveen
Рет қаралды 13 М.
Making Long Context LLMs Usable with Context Caching
13:39
Prompt Engineering
Рет қаралды 4,7 М.
Anthropic's Meta Prompt: A Must-try!
12:34
Sam Witteveen
Рет қаралды 92 М.
RAG for long context LLMs
21:08
LangChain
Рет қаралды 37 М.
Introducing Gemini API Code Interpreter (Execution)
18:21
Prompt Engineering
Рет қаралды 8 М.
How to build Multimodal Retrieval-Augmented Generation (RAG) with Gemini
34:22
Google for Developers
Рет қаралды 58 М.
Building long context RAG with RAPTOR from scratch
21:30
LangChain
Рет қаралды 32 М.
Gemini 1.5 Pro With 10,000,000 Tokens Is Absurd
5:05
bycloud
Рет қаралды 95 М.
Will A Guitar Boat Hold My Weight?
00:20
MrBeast
Рет қаралды 170 МЛН