Is RAG Really Dead? Testing Multi Fact Retrieval & Reasoning in GPT4-128k

Рет қаралды 18,130

Күн бұрын

One of the most popular benchmarks for long context LLM retrieval is @GregKamradt's Needle in A Haystack: a fact (needle) is injected into a (haystack) of context (e.g., Paul Graham essays) and the LLM is asked a question related to this fact. But, it only evaluates a single needle.
RAG typically requires retrieval and reasoning over multiple chunks. In order to use long context LLM in place of RAG systems, it is critical to understand how well long context LLMs can retrieve multiple facts and reason over them. We recently updated Greg's repo to work with multi-needles and use LangSmith for evaluation and logging.
Here, we show how how to perform multi-needle evaluations and show our results from testing. We tested GPT-4-128k on retrieval of 1, 3, and 10 needles in a single turn across 1k to 120k context windows. We show that performance degrades as you ask LLMs to retrieve more facts, as the context window increases, for fact placed towards the beginning of the context, and when the LLM has to reason about retrieved facts.
All code is open source:
github.com/gka...
Visualization notebook:
github.com/gka...
All runs can be seen here with public LangSmith traces:
github.com/gka...

Пікірлер: 34

@mr_adisa 6 ай бұрын

Great video. Would be awesome to run the same tests with Claude 3 models.

@codetothemoon 6 ай бұрын

This video is fantastic and likely extremely relevant to a bulk of the work currently being done that leverages language models. Thank you!! 🙏

@ksprashutv 6 ай бұрын

Thank you. This is some great work / analysis that has been put in. I didn't quite really find the answer to the question - "Is RAG dead?". This to me seems more like GPT4 has an artificial context window size. It looks like for increased reliability, we should be using a context window size of about 12K (somewhere between 1K and 24K). Anything over that GPT4 isn't reliable with multi-needles. Is this a limitation of the model? The Gemini 1.5 pro demos seems to show a 100% retrieval of all needles. Maybe if you compare it with other released models, we'll know more if we shouldn't' rely on the context windows. The other question is on the context. How cost efficient is it to pass the large document as tokens for every query vs run the queries off a vector database? That may be a more compelling reason to use RAG rather than a large context window. The test then would be hallucinations to see which of the approaches is more reliable.

@FranAbenza 6 ай бұрын

I was also wondering where is the comparison. RAG vs no RAG pls

@GeertBaeke 6 ай бұрын

The answer to questions in blog posts and videos like "Is RAG dead?" is more than likely always NO! In fact, the answer is pretty clear here... RAG is certainly not dead because retrieval in a shorter context performs better. And that's wat RAG is all about.

@ksprashutv 6 ай бұрын

@@GeertBaeke Yes and no. I don't think the answer is that simple. Retrieval in a shorter context performs well, so RAG is not needed in that case. In a longer context, we are seeing that OpenAI doesn't perform well. It does't meant that other models will also perform similarly. There is also then the cost / accuracy tradeoff. So the question on RAG is not a simple No, it is - It Depends. I'd have liked the video to have laid all those aspects out so that the user could make a decision.

@theindubitable 5 ай бұрын

Has anybody tried Cohere Command R+ yet?

@vap0rtranz 4 ай бұрын

@@ksprashutv I get what you're saying, but as others have commented, a RAG vs LLM comparison test should be done. The title reeks too much of click-bait despite the content being serious. I think we suspect that RAG will perform the same or better than huge LLMs, but it probably depends on RAG tweaks (which embedding model, document pinning / stuffing context, context window size, etc.)

@AlbertoCastilloAroca 6 ай бұрын

🎯 Key Takeaways for quick navigation: 📈 Context lengths for LLMs are increasing, raising questions about the necessity of external retrieval systems like RAG, especially when massive amounts of context can be fed directly into LLMs. 🤔 Greg Kamradt's Needle in A Haystack analysis tested LLMs' ability to retrieve specific facts from varying context lengths and placements within documents, revealing limitations in retrieval, particularly towards the start of longer documents. 🧠 RAG systems aim for multi-fact retrieval, requiring the retrieval of multiple facts from a context. Google's recent 100-needle retrieval demonstrates the need for efficient multi-needle retrieval for comprehensive understanding. 💡 Multi-needle retrieval analysis provides insights into LLMs' performance in retrieving multiple facts, with findings showing a degradation in retrieval performance as the number of needles and context lengths increase. 🔍 Detailed analysis reveals a pattern where needles placed earlier in the context, especially towards the start of documents, have a lower chance of retrieval, indicating potential limitations in accessing information from the initial sections of documents. 🔄 Retrieval sets an upper bound on reasoning ability, with reasoning lagging slightly behind retrieval performance, highlighting the importance of understanding both retrieval and reasoning capabilities in long-context scenarios. 📉 As context lengths increase and reliance on long-context retrieval grows, understanding the limitations of retrieval systems becomes crucial, especially in scenarios where RAG may be replaced or supplemented. 🛑 Retrieval from long contexts doesn't guarantee retrieval of multiple facts, especially with increasing context size and number of needles. 🔄 GPT-4 tends to fail in retrieving needles placed towards the start of longer documents, indicating retrieval patterns influenced by context length. 💬 Improved prompting might be necessary for better retrieval performance, as seen in previous studies on CLIP. 🧠 Retrieval and reasoning are separate tasks, with retrieval potentially constraining reasoning abilities, highlighting the importance of understanding their independence. 💰 Cost of long-context tests can be managed effectively, with careful budgeting enabling meaningful research without significant financial strain. 📊 Tracking costs with tools like Langs Smith can help manage research expenses, making it feasible to conduct multiple tests within reasonable budgets. 🌐 All data and tools used in the study are open-source and available in Greg's repository, facilitating transparency and reproducibility in research efforts. Made with HARPA AI

@jamesgphillips91 6 ай бұрын

Hey man… Rag doesn’t charge me per token, Rag 2024. Great video.

@jeffwang5677 6 ай бұрын

Love the video, very informative. Makes me wonder how RAG would compare because it seems like it'd be affected by the same issue in addition to any losses from knn. Makes me think about RAG as a way to tidy up context. Anyways at least now I can benchmark. ☺

@fire17102 6 ай бұрын

Really awesome video and contribution ❤ You definitely showcased why RAG and smart context management is still important. Though i have a question, As you said yourself 2:14, rag i usually setup to get 3 to 5 chunks. So if you out 10 needles, or unknown potentially large (100 needles), how would the rag get all of them? how would you setup rag in a more dynamic way to actually get all of them? I guess you can add everything above a certain confinence score (upto chosen token limit). Maybe a smarter backing off formula? As i am thinking about it, it makes since to use a cheap model to influence to perhapse alter the default rag params, for example if you tell it get all 100 ingredients (give him the clue) it would demand more chunks from the rag dynamically as opposed to a static rag. But giving this cue to look for 100 needles might also help long context models without rag.. (should be tested) So how does rag retrieve arbitrary number of needles without getting a hint about them? Would be a nice followup to show rag needle benchmarking next to the pure context needle benchmarking you did. Hope you understand my dilema, All the best!

@yuanzhou6643 6 ай бұрын

Curious what the results would be if we do the same task with RAG?

@timothywcrane 6 ай бұрын

I don't see RAG being replaced, people use them, they are called notes, plus when it comes to retrieval (memory) "640K ought to be enough for anybody." RAG may morph, but it wont go away. It is the onboard memory bridge endpoint of the future in some manner.

@fire17102 6 ай бұрын

Since you seem like an expert at rag, lemme please get your take on this. No one seems to mention this and i think it makes the difference between fun-to-do and actual production software... Whats the best way to do rag on data that changes. For example, updating an item price. Do we need to manually manage our chunks? Find, delete, re-embed and re-push to the vector db? Is there a simpler way? I've been recommended to rag only variables names and descriptions, and upon retrieval lookup the values of these variables. Yes if its my db i can pull updated data, but that's not always feasable or applicable. Let's say i have a folder of documents which i want to rag. I chunk and embed them. But these are not archived documents, I work on them iteravely, they change from day to day. I would like to setup a watchdog that if any file in that folder changes, it will auto embed the changes and replace the old chunks (to not pull old data) Seems like langchain would do this no problem. - manage the chunks, and when given a trigger that the file changed, it will take care of updating the changes. What do you think? Is this something that already exists? Thanks a lot and all the best!

@XavierNV 6 ай бұрын

Try Watson discovery, with live data sets makes updates to knowledge bases simpler to manage

@VincentFulco 6 ай бұрын

Really great insights. Thanks!

@insitegd7483 6 ай бұрын

Great job, Only a question, this is a test in ChatGPT, some days ago I watched a new that Gemini pass the test of needles, but I don't know exatcly if Gemini also has problems like ChatGPT. Does it have them?.

@henkhbit5748 6 ай бұрын

All the big Company loves that you are feeding text with much tokens. They makes more money if you are not doing rag.

@googleyoutubechannel8554 4 ай бұрын

So basically RAG continues to not work, and ROPE techniques to 'fake' long context windows also don't work. Makes sense.

@yourmomsboyfriend3337 6 ай бұрын

Definitely not. It’s still required for all open source models. Not a big fan of the clickbait title.

6 ай бұрын

What I ended up taking from the video, is to not feed the best chunks from retrieval at the top of the context window during RAG 😁 Thanks for the insights

@FranAbenza 6 ай бұрын

That is a great conclusion

@MichaelScharf 6 ай бұрын

I wonder if there is a difference if the instructions is at the end instead of the beginning on the context

@HideBuz 6 ай бұрын

Large context models just means they use their own embeddings. It's like serverless just means using another guys's server.

@fire17102 6 ай бұрын

Wait you saying they use rag in the background? Wouldn't that be fake context length. Sure they can do it without saying so, not surprising. I guess my question is since it's all token embeddings manipulation, is the pure context window for a model fundamentally different? At least in terms of actual ability for a model to digest tokens before outputing more. Or maybe I misunderstood, I get it in the high level, but practically how do they achieve it? I honestly don't understand what is the limiting factor in the input window, is it the architecture? Cuz it's not about the model size, right? Can they choose any amount of tokens for input, but also have to make new optimizations, unless it's not reliable which means it's not worth releasing...

@HideBuz 6 ай бұрын

@@fire17102 Yes, context length has to do with how the model is trained to create output that results in well written and structured texts for a certain length. Google and other companies that offer large context windows are very likely "cheating" and just use their own embedding system for their model itself, and delete it after any output.

@zyxwvutsrqponmlkh 6 ай бұрын

That Claude demo made me really think these needle tests may not be useful. If the LLM recognizes some needle as being out of place and focusing more attention on it than it otherwise would have needle benchmarks would train this behaver and reward those who learn to do it better, but actual understanding of all the aspects of the document would not be properly incentivized.

@wilsvenleong96 6 ай бұрын

May I know what do you use to do up your diagrams?

@nogool111 6 ай бұрын

tl;dr: RAG is not dead. And this is very sad

@CodeFun691 6 ай бұрын

I am not even going to bother watching this if it already starts with a click bait title. What the hell is wrong with the technical AI side of things?

@splitpierre 6 ай бұрын

This is what happens with reward functions. 😂 Diversity decreases, then click baits. Who would guess?