Рет қаралды 18,130
One of the most popular benchmarks for long context LLM retrieval is @GregKamradt's Needle in A Haystack: a fact (needle) is injected into a (haystack) of context (e.g., Paul Graham essays) and the LLM is asked a question related to this fact. But, it only evaluates a single needle.
RAG typically requires retrieval and reasoning over multiple chunks. In order to use long context LLM in place of RAG systems, it is critical to understand how well long context LLMs can retrieve multiple facts and reason over them. We recently updated Greg's repo to work with multi-needles and use LangSmith for evaluation and logging.
Here, we show how how to perform multi-needle evaluations and show our results from testing. We tested GPT-4-128k on retrieval of 1, 3, and 10 needles in a single turn across 1k to 120k context windows. We show that performance degrades as you ask LLMs to retrieve more facts, as the context window increases, for fact placed towards the beginning of the context, and when the LLM has to reason about retrieved facts.
All code is open source:
github.com/gka...
Visualization notebook:
github.com/gka...
All runs can be seen here with public LangSmith traces:
github.com/gka...