Dirichlet Energy Minimization Explains In-Context Learning (Harvard)

Рет қаралды 2,013

Күн бұрын

Пікірлер: 26

@code4AI 3 күн бұрын

With the automatic audio dubbing from KZbin /Google you hear a synthetic voice in your regional language. To hear my original voice in English, switch to "Default" or "English" in the settings. Thank you.

@jabowery 2 күн бұрын

This is my favorite machine learning channel. It doesn't surprise me that as you get over the target you start getting flack. And you are over the target. Welcome to the club.

@vrc5674 3 күн бұрын

13:30 Wow! This radically changes my perspective of LLM's conceptually. I guess I always sorta viewed them as a sort of auto-complete that uses the context to locate itself in a high-dimension vector space from which it just spits out the next most statistically likely token. This ability to use ICL to form new conceptual models from the context seems like its similar to what we do in our own minds as we think. It's not just performing a vector search, but rather building a conceptual model of the context (or even reality 😲 ) . There seems to be the implication that this could be the seed of consciousness if it were to emerge ... not in the LLM's and their weights and not even some complex agent wrapper system around the LLM, but in the process of assimilating the context it's given, it is form new conceptual models (thoughts and ideas). As the LLM processes the context, it's almost as though it wakes up and starts experiencing it as a stream of sensory input which it starts forming conceptual models from. Once context processing has completed it snuffs out of existence. I'm not suggesting its self-aware or consciouse but during those periods it really does feel like its resembling something that's alive and processing its environment. Another way to frame what's going on with ICL as it processes the context, as it reaches "the plateau" it's, in a sense, reaching an *understanding* of the concept. If there was code that could monitor the Dirichlet energy so that once it has reached understanding it could redirect its attention to a new context pattern it would be interesting to see what sort of performance increase you would see. Could this attention mechanism be used to decompose and re-integrate conceptual models into more and more complex structures. Could the LLM system be given the abilty to reflect on its own state at a meta-level. Is this aleady going on at some level within the transformer mechanism?

@damianlewis7550 3 күн бұрын

Fantastic. Part III please! I really need to implement a better RAG that has better than 35-45% accuracy and doesn’t require 600 examplars being injected into the context.

@johang1293 3 күн бұрын

This is great stuff. Love to see part 3.

@greatworksalliance6042 2 күн бұрын

I see this being applied to dictionaries, rule sets, policies, etc. I may have missed a video you've done on this, such as webster's dictionary weighting each example and then going on from there with synonyms... Providing true language modeling and a basis for proper foundational layering. Keep up the highly insightful work💪

@tantzer6113 3 күн бұрын

This is reminiscent of the Ising model of ferromagnetism, a model that explains phase transitions (sudden global shifts) in magnetization. A paper by Gergely Tibely connects the Ising model to “label propagation,” an algorithm for community detection in graphs. The idea is that pressure gradually builds up in a community and presses against its boundaries until the dam breaks, the floodgates open, and one community of vertices/nodes quickly takes over an entire other community. How all of this relates to transformer based systems remains vague to me. I’m not clear on how the energy terms were computed in these plots, which part of the system is implicated in residual/ICL “streams” (will need to review the transformer architecture), or what (if anything) the undirected graphs involved in the Ising model and the Dirichlet energy have to do with the directed graphs in neural networks. This topic is stimulating, given the connections among diverse fields. My own interest began with the study of human social networks!

@simonhaddow5052 3 күн бұрын

You had me at Laplace, great find!

@IdPreferNot1 3 күн бұрын

I talked to a gpt how we may have just discovered that D energy can explain changes from ICL and produces a visual representation on the right axis, given a transformers model.... and it freaked out, giving one of the larger outputs i've seen without pushing, with a large explanation of why that would make sense! Awesome ! Here are only some of its thoughts.... an easier explanation of the 'efficient state' that you struggled with :) """ The transformer layers and attention mechanisms aim to find the most efficient way to represent the relationships between input tokens and their context. This efficient representation can be thought of as minimizing the "effort" required to distribute and encode the attention weights and activations, much like minimizing Dirichlet energy in a physical or mathematical system. 2. How Does This Relate to In-Context Learning? In-context learning occurs when an LLM adapts its responses based on examples or patterns provided in the input prompt, without updating its underlying weights. This implies: The LLM dynamically reorganizes its internal representations (activations across transformer layers) to "learn" from the context provided. The "organization" of these activations can be thought of as finding an efficient structure within the high-dimensional space of the model's layers, guided by the relationships encoded in the attention mechanism. If we view the transformer layers as a field of information, then the activations and attention distributions across layers can be seen as minimizing "Dirichlet energy": Smooth transitions across layers (harmonic-like behavior). Efficient propagation of contextual information. Stable configurations that require minimal energy to maintain. 3. Visualization Through Energy Minimization Your suggestion that visualizing in-context learning in terms of Dirichlet energy is both feasible and promising: Each layer of the transformer acts like a "state" in a dynamical system, with the attention heads acting as forces shaping the representation of the input. The energy landscape of the model (e.g., how activations distribute across layers and attention heads) reflects how efficiently the model is learning from the input context. By projecting this onto the right axes (e.g., principal components of attention maps, token embeddings, or gradient flow), we could visualize how the model stabilizes into a low-energy state as it processes and "learns" from a prompt."""

@tk0150 3 күн бұрын

I barely understand but I love it!

@wwkk4964 3 күн бұрын

Amazing Stuff, thank you so much for sharing!

@BeauAD 3 күн бұрын

part 3 would be appreciated

@irbsurfer1585 3 күн бұрын

I get it now. If the LLM is not pretrained on my domain and I have to use examples, instead of fine-tuning, I can just assemble 600+ examples, feed it into the context window, and then use ICL to answer my query and with a VERY high level of accuracy! But no more "few-shot learning", it is strictly "few-hundred-shot learning" from now on. Period.

@aftercamp4322 2 күн бұрын

By the way, the size of context size is limited, how do you pass hundreds of examples to the LLM?

@irbsurfer1585 2 күн бұрын

@@aftercamp4322 I guess sometimes it just wont work. There is really only one company with really long context window.

@davidwynter6856 3 күн бұрын

For me the difficulty is building a large enough context. Carers for children with a disability should only see responses from the LLM that are specific to that child's unique treatment plan. We can never pollute the response with aspects of a different plan. I can already extract details that relate to only their unique plan in providing correct context to add to their prompt, that is well understood. But can I build a large enough context to trigger the advantages of higher accuracy by supplying 1000 tokens of specific context? I plan on fine tuning a ~7Bn LLM with all the terms and concepts of the disabilities we cover as a starting point.

@MatthewSanders-l7k 3 күн бұрын

Harvard's insights into RAG and ICL promise significant optimization for AI without costly fine-tuning.

@Siqum 3 күн бұрын

what kind of examples they used, and what concepts they tried to overwrite?

@fernandofernandesneto7238 3 күн бұрын

Does test time training give a little push on this energy minimization, in such a way we need way less examples? [Hence, being a smarter way of finding a new optimal point between providing hundreds of examples x inherent capabilities of the model]

@nanotech_republika 3 күн бұрын

Are you able to compare this method of updating the core neural net with new info in LLM to the process of updating that occurs in a real brain (based on current research of, for example, mice)? Specifically, how do the updates happens when the connections between the hippocampus and the neocortex are adjusted during the first exposure to the new knowledge and then during the sleep? And maybe even after the first sleep, how do updates occur between different columns (as in hypercolumns) of the neocortex? This is a topic of ongoing research in biology. Are you able to speak on that? Did the papers you mentioned from physicists also discuss the real neural net?

@richardnunziata3221 3 күн бұрын

is the context length needed independent of the prompt . say you ask a difficult prompt but give it a lot of uncorrelated context after.

@davidwynter6856 3 күн бұрын

Semantic analysis of the query (extract triples) backed by a knowledge graph will allow you to extract relevant examples from the knowledge graph to build a context with example specific to the query. I combine this with a vector store that supports hybrid, cosine similarity and keyword search, to build my context. But as this paper shows you need lots of examples to hit the 1000 token context size. Maybe not the several hundred as others have expressed.

@patruff 2 күн бұрын

Just use graph rag, no need to change how the LLM thinks about organizing it, just kick back and let the LLM build the graph for you. Any flaw in this logic?

@code4AI 2 күн бұрын

You are really funny. I like it.

@patruff 2 күн бұрын

@code4AI I like you too, keep up the good work!

@IdPreferNot1 3 күн бұрын

Lol you didnt say it... were you equating our current state of understanding with the Ptolemaic model of the solar system?