Great video, I enjoyed this one. I like to think of human cognitive architecture as a is/if loop. Essentially, it's answering the question - How many times do I repeat an observation before it becomes true? There's a great example in Kahneman's Thinking Fast and Slow of a trainee and veteran firefighter and their differing approach to assessing danger. The veteran (system 1) has a solid "what is" model of reality based on years of experience (observations). When the veteran sees a specific danger they always know how to respond. The trainee (system 2) assesses the situation differently. They move methodically, step by step, through all the possibilities to find the right course of action. They're constantly asking - what if it's this, or that? Now consider what happens to the veteran if they make a mistake? Presumably, they will be sent back to training to refresh the relevant knowledge that caused their mistake. The veteran has become a trainee again. The next time they encounter the same situation they will cautiously ask - what if? The constant back and forth between system 1 and 2 seems to act as a kind of error correction for observations. Any thoughts? Do these ideas fit within your model? Can you recommend any further reading I might find interesting? Thanks ps. If you're interested I have a few thoughts about three distinct processes system 2 uses to interrogate system 1's model of reality. Happy to discuss.
@MikeGashler Жыл бұрын
I definitely agree that human cognition separates system 1 and system 2 thinking. But I'm not yet settled about what is the best way to implment similar functionality in machines. I would probably characterize most LLMs as entirely system 1 thinkers because they use pretrained weights and don't really iterate or do any kind of experimental (what if) thinking. So I suppose the very simple cognitive architecture I describe in the middle of my video (around 13:18) is also an entirely system 1 thinker. After that point, I spent the rest of the video describing how the brain is believed to modularize cognitive abilities, and I proposed that a similar modularization of that LLM-based cognitive architecture might enable it to do real-time learning. But I did not go as far as to suggesting any mechanism for separating system 1 from system 2 learning. I suppose that is probably the next evolutionary step for this architecture. But it is not obvious to me how to do that. So if you have ideas about how that should be implemented, I'd love to try to contemplate how we might integrate them with my architecture, or alternatively how we might modify my architecture to accommodate it.
@htidtricky1295 Жыл бұрын
@@MikeGashler Imagine teaching a hypothetical AGI how to shoot a basketball. You've given it access to wikipedia and the first relevant page it finds is classical mechanics. This concept goes in system 1; these are the known knowns it will use to solve the task. The AGI measures the distance to the hoop, the mass of the ball, and so on, and calculates the force needed to shoot the hoop... It misses the first shot. System 2 needs to correct this error. The first step is explore known unknowns. I like to think of these as rounding errors. System 2 expands a known concept in system 1 and explores it in greater detail. If the AGI calculated the force needed to throw the ball within two decimal places, why not ten, or one hundred? The predictive model failed and needs a finer granular detail. The AGI shoots....misses the second shot. Step two - unknown knowns. These are concepts it's aware of but hasn't recognised as relevant for the task. At the moment System 1 contains classical mechanics but hasn't considered the air resistance. The previous attempts assumed a vacuum. System 2 connects two or more known concepts in a novel manner to solve the task. The third shot....hits the rim. Step three - unknown unknowns. This one is going to be difficult to describe without inventing a new branch of science. These are concepts we know exist but aren't understood. They fill a gap in our knowledge and often have the prefix dark attached to them, e.g. dark-matter, dark-energy, etc. A new concept that fills a hole in our observations. For argument's sake, let's imagine the wikipedia page for dark-matter didn't exist but the extra pull from gravity was responsible for our AGI missing the shot. The AGI writes a new wikipedia page for dark-matter and moves it to system 1. The final shot....score! I hope that kind of made sense. These are all just the broad brush strokes that describe the functions of system 2 - expanding a concept in more detail, connecting two or more concepts in a novel manner, creating a new concept to fit observations. The back and forth between system 1 and 2 is a balance that minimises maximum regret. A final point about system 1, it's not just a fixed "what is" predictive model of known knowns. System 1 uses the same functions as system 2 in reverse. It's trying to simplify the model until it fails. Compressing information and using simpler heuristics to lower the processing burden. Why do I need three concepts if can solve the task with one? The veteran is trying to become a trainee and the trainee is trying to become a veteran. For what it's worth I have no formal background in any of these related fields, I'm just a well read dilettante with a casual interest AI safety but I hope you can find something interesting here and welcome any feedback. You can reason most of this from epistemology and asking - How many times do I repeat an observation before it becomes true? Thanks for reading. tricky
@lemurpotatoes7988 Жыл бұрын
How would you distinguish your proposal from Hinton's capsule networks or China's Wu Dao (Mixture of Experts)?
@MikeGashler Жыл бұрын
Good question. The similarity is that all three of these models are ensembles of simpler components. They all derive their high-level capabilities by delegating to smaller units that specialize in particular areas of computation. The difference is that capsule networks make predictions in a single forward pass, Wu Dao is for predicting sequences, and cognitive architectures are intended to run indefinitely. Essentially, I'm proposing that a cognitive architecture could be as simple as an LLM in a loop. Of the three models, capsule networks are the most general. They could be used for almost any purpose. LLMs (including Wu Dao) are only suitable for sequence prediction. They are usually pretrained on very large corpuses of data. Consequently, they are generally pretrained, and are not very amenable to learning from new experiences. The focus of my model is on making a cognitive architecture that possesses the general intelligence of LLMs, but is also capable of learning from its daily experiences. Another big difference is that capsule networks and Wu Dao are both well-tested. By contrast, my proposed idea is still just an untested idea.
@lemurpotatoes7988 Жыл бұрын
@@MikeGashler If the idea is to get in-context learning talking to a network's weights somehow, you might like reading about Imitation Learning from Language Feedback.
@lemurpotatoes7988 Жыл бұрын
@@MikeGashler I don't understand what induces the base level models to specialize in your setup. Capsule networks and MoEs both have to solve a routing problem. Maybe attention can fill a similar role in a less rigid way? I usually think about the thalamus as performing routing in the MoE sense, but probably that's too modular.
@MikeGashler Жыл бұрын
@@lemurpotatoes7988 My video didn't really address how my proposed model would be trained. That's partly because I'm still thinking about it. But the training details are certainly essential for making it work, so I'll attempt to answer as best I can. To address your question about specialization, let's first consider the simpler case of a regular feed-forward convolutional neural network. If we train such a model by gradient descent, while applying L1 regularization to the weights, then units in hidden layers are known to specialize for recognizing particular features. I believe the reason this occurs is because specialization is the only solution that reduces pressure from both signals (that minimize global error and promote local sparsity). So I imagine the key to promoting specialization in my model will be to create similar conditions. Specifically, I think (1) the error signal applied to train my "cortex" models needs to be proportional to the product of the forward activation pressure coming from my "hippocampus" model, and the error propagated back from my "thalmus" model. And I think (2) there also needs to be some kind of pressure that promotes sparsity. If these conditions are met during training, then I would expect specialization to emerge. I think an important difference with my proposed architecture and a CNN is that all units in the hidden layers of the CNN are activated with each forward pass. By contrast, it seems wrong to me to activate all of the "cortex" models in my architecture for a single iteration. So my current thinking is to activate them probabilistically. Specifically, I'm thinking the "hippocampus" model should output two vectors: (1) some kind of softmax that will direct how I probabilistically choose which "cortex" model to activate, and (2) the positional encoding, or reference frame values that will be fed into that model. I admit I am slightly uncertain about how well backpropagation really works when hidden units are activated probabilistically rather than continuously, but I am optimistic that it may work because I can think of a few models that work somewhat similarly. For example, neural networks seem to be quite robust when they are trained with dropout. Additionally, I suspect that activating only a subset of the cortex models may have the implicit side-effect of promoting sparsity. So it is my current opinion that explicit routing is not really necessary. I think specialization will emerge as long as I create training conditions that are known to promote it. But again, I really should test this before I start making too many explicit claims. So far, this is all just hope and conjecture.
@lemurpotatoes7988 Жыл бұрын
@@MikeGashler I like the idea that sparsification and routing are doing similar things! I haven't heard people attribute layerwise specialization in CNNs to L1 (or L2) regularization before. What you're saying reminds me a little bit of claims made by Information Bottleneck theory. It also reminds me of Neural Darwinism's claim that biological neurons face selection pressures to be useful. The claim I've seen is that CNNs have a "texture bias" and neglect shape information because they don't understand part whole relationships because something about pooling is bad. Other people say that the texture bias is dataset dependent, and I don't know who to believe. It's interesting to think about the granularity or resolution of sparsification schemes. Weight penalization is the most detailed or highest resolution approach. Neuron level dropout would be another layer up. And gating mechanisms, probabilistic or not, would be very high level. Thinking about sparsification resolution is kind of similar to questions about what optimal masking schemes should look like for sequence completion training. For music generation, for example, you can mask one out of every X notes, but you can also mask one measure or line out of every X measures or line. For poem generation, you can mask Y% of words, or you can mask strategic syllables in every line. My personal, completely unsupported opinion is that bad masking is why AI generated music doesn't sound good yet. I think we should be doing some kind of smart, bottom up masking scheme somehow. It reminds me of how naive cross validation goes wrong for spatial statistics and instead coarse, blocked cross validation needs to be used. I think naive masking, naive cross validation, and naive high-resolution sparsification are all making the same mistake in some way that I don't know enough math to describe.