Hella New AI Papers - Aug 9, 2024

Рет қаралды 2,317

Tunadorable

Күн бұрын

Пікірлер: 36

@GNARGNARHEAD 2 ай бұрын

first paper out of the gate sounds like a winner 🤯

@jakeaustria5445 2 ай бұрын

Yo, me got addicted to your channel. I kinda binge-watched your latest vids. I just grab what I can and then guess the concepts of those I do not fully understand.

@kevon217 2 ай бұрын

Damn, just downloaded like half that list. Love the curation you do.

@Tunadorable 2 ай бұрын

hahaha glad to be of service. careful about downloading more than you can read per week

@sniperhawk6969 2 ай бұрын

The Apple intelligence paper isn’t too interesting, but have a look at section 5.1, something about adapting to the task at hand on the fly using LoRA. I don’t know about other literature related to this, but sounds pretty interesting to me

@Jayc5001 2 ай бұрын

Very nice review. That first paper got my attention!

@TendoNin64 2 ай бұрын

really like the first paper shown. Its interesting how introducing self modeling has the consequence of also simplifying the network. I mean, it makes sense that the model would want to be simpler in order to optimally compute itself. I do wonder what effect the self modeling has besides that though: is the primary effect the simplificaton of the network when training? or does the auxillary task of predicting internal states assist with the primary task in a meaningful way? judging from the paper, it seems accuracy in the task actually drops slightly (although MNIST is such a simple classification example that I'm not sure that says anything about performance anyway). Really interested to hear more about this strategy in larger models.

@u2b83 2 ай бұрын

One consequence of self-modeling is loss of high-resolution features (in the generative weather model domain) I found that having two networks: one for self modeling and another for "filling in the details" (with awareness of durable self-modeling features) scored better than either approach individually. Both network and sub-network (a specific range of feature channels) were optimized simultaneously.

@TendoNin64 2 ай бұрын

@@u2b83 thank you for this comment, I felt like the reduction in complexity would result in a less detailed answer by the model but glad someone else has tested it. Benefits of a simpler model architecture are there of course but I wonder if it's possible for self modeling to occur without the reduction in complexity, and what effect self modeling would have on the final answer in that instance. Although, it seems this is basically achieved with two networks instead of one, as you describe.

@andrewsilber 2 ай бұрын

Seems like we could use synthetic data for the blind vision model problem. They could use Unreal or Unity armed with a huge pile of game dev artist created models and shaders to set up millions of permutations complex scenes from different angles and also labels which we could piece together as we’re assembling the scene and train on that. I have to assume Musk & Co are doing that sort of thing for their robot training.

@tensiondriven 2 ай бұрын

Skimming abstracts, I love it! Have some engagement.

@drdca8263 2 ай бұрын

11:10 : oh, cool, this sounds similar to something I was daydreaming about (except I was imagining clusters of a handful of tokens not necessarily matching sentence boundaries, and I was imagining doing this recursively). Like, I imagine this is like: have an auto-encoder that goes from a not-too-long sequence of tokens, to a single higher-level token, and then the decoder part predicts the individual tokens given the previous higher-level tokens, the current higher-level tokens, and the base-level tokens already produced corresponding to the current higher-level token? I suppose their tokens encoding entire sentences can’t be using a fixed discrete set of tokens for the higher level tokens, so, I guess they just have those be continuous? (Aside: hm, if you used a standard decoder-only LLM, but instead of selecting a token with the probabilities it assigns, just took the average of the embedding vectors for each of those tokens, and let that iterate a dozen times, and then switched to picking specific tokens again, I wonder what kind of garbage output that would produce? That thought probably seems pretty unrelated. It came to mind because I was thinking about how, when the “tokens” produced as outputs, are continuous, you don’t get a probability distribution, so the only way to mix between options is to mix the actual options, rather than a probability mix of options.) Another idea I had in relation to this, was that maybe the encoding for a cluster of tokens could have two parts, one which is only used when decoding to try to get the particular tokens back, and one which is used for that but also used when predicting the next higher-level token. The idea being that this might encourage it to separate the parts that matter significantly later in the text, with irrelevant accidents of phrasing. Perhaps somewhat of a semantics vs phrasing distinction… ..but probably not quite, because the phrasing at one part probably helps predict the phrasing at a later point, due to stuff like different writing styles, etc., so probably not a clean split.

@u2b83 2 ай бұрын

I suspect that the lower computational effort of sentence token granularity is why LLMs don't use it and instead use character-level granularity. Wolfram's notion of computational irreducibility applies to LLMs in that you can't take a short cut by skipping network iterations. Example 1: ODE solver running at coarse grain sucks. Example 2: Stable diffusion in fewer steps goes out of distribution and output suffers. I suspect the same thing with LLMs when each character token is an iteration of an ODE-like system - The more finer-grained iterations you do, the better the results, as they converge more precisely. The idea that LLMs operate in a way that's similar to solving an ODE (Ordinary Differential Equation) system-where each token generation step refines the output-is a compelling analogy. The finer the granularity (e.g., character-level vs. sentence-level), the more steps or iterations are needed to converge to a precise result. Computational Irreducibility in LLMs: Granularity and Iteration: By operating at a finer granularity, such as character-level, LLMs may require more iterations to generate coherent text. However, this also allows them to capture subtler nuances and dependencies that might be missed at a coarser level, like sentence-level tokens. This could be why LLMs tend to avoid sentence-level tokens despite the potential computational savings-they need the finer control to ensure the quality and coherence of the output. Irreducibility: Wolfram's notion of computational irreducibility suggests that certain processes can't be shortcut-each step in the computation is necessary to achieve the final result. This aligns with the idea that skipping iterations (or using fewer iterations, like in stable diffusion) can lead to outputs that fall out of distribution, thereby degrading quality. Applications to LLMs: Iterative Refinement: The process of refining the output at each step in an LLM is akin to an iterative solver working towards a solution. More iterations allow the model to more precisely adjust its output, improving coherence, accuracy, and alignment with context. This is similar to how an ODE solver benefits from finer time steps to accurately track the behavior of a system. Balancing Granularity and Efficiency: The challenge for LLMs is balancing the need for fine-grained iteration with computational efficiency. While character-level granularity allows for high precision, it also increases computational cost. Finding the right balance, or developing new techniques that can capture the benefits of fine granularity with fewer iterations, could be key to advancing LLM performance.

@drdca8263 2 ай бұрын

@@u2b83 Well, the paper (I ended up skimming through it a few times) *does* also predict individual tokens. It is just that it uses a smaller network to do so. (Also! My impression, is that they are actually using much of any one of various existing LLM models, and grafting their sentence token stuff onto it, so that they feed their sentence-token vector into part of a network which was only trained to handle vectors for words, and it *works*?! If I’m understanding what they are saying correctly, and if what they are saying is true, that seems pretty wild to me!) Their auto-encoder consists of an encoder transformer (and then adding up each of the vectors produced and applying layer norm), and as the decoder, they use both causal self-attention, and cross attention to the single vector representing the entire sentence. (It also, in the version I saw on ArXiv at least, has a lot of typos in it still.. they repeatedly have “sentience” when they intend “sentence”. I guess that’s it being a pre-print and not the final version.) Anyway, my point is: they *are* predicting individual tokens autoregressively, except that they are predicting them only after predicting a vector which is supposed to encode the entire sentence. Ok, so, why wouldn’t computational irreducibility be a problem for this (after all, they claim that this method gets them *improvements* in perplexity, not just not-much-worse along with faster results.)? They make the argument that the semantic content of the sentence is decided first, and that it is only afterwards encoded into a sentence. If this is true, then they would just be doing something analogous to the process that originally produced the text, and so there’s no “trying to fast-forward” issue. Though, speaking of, I’m personally not sure that the “principle of computational irreducibility” is all that well-defined? Or like, what precisely is it saying that wasn’t well-known? For a given computable sequence, and a given model of computation, there there is some limit on how fast the sequence can be computed, but, depending on things like where one puts what quantifiers, that is either obvious, or is a theorem proved a sorta long-ish time ago (relatively speaking) (and also what one would expect to be true). Some computational tasks, computed one iterative way, can have a lot of what is in-between be skipped. Instead of computing the n-term of the Fibonacci sequence by simply iterating through all the terms before the n-th one, one can do it be repeated squaring. Why does this not go against the principle of computational irreducibility? Well, because the task “isn’t irreducible”, I suppose. So, what, we call a computation “irreducible” if there’s no computation that does the same task significantly faster? Idk.

@u2b83 2 ай бұрын

@@drdca8263 The sentence granularity approach might actually be computationally reducible, as the authors of the paper argue that by predicting the "semantic content" of the sentence first, and then encoding that into individual tokens, they might be following a process that's closer to how the text was originally produced. If that's the case, then this method wouldn't be violating computational irreducibility-it might actually be leveraging a more efficient pathway that's still faithful to the underlying complexity of language generation. The key idea is that for some tasks, there's no shortcut-no way to compute the outcome faster than by going through every necessary step. However, as you pointed out with the Fibonacci sequence, not all tasks are irreducible. Some computations can be optimized, like using matrix exponentiation to compute Fibonacci numbers faster than simple iteration.

@superfliping 2 ай бұрын

Do you ever take this information and rework it into multi dimensionally frameworks when you come across new information,. I watch your videos find the original source a interpret into my system ai frameworks in many different formats and sources of data. Was just wondering if anyone else does that?😊

@Tunadorable 2 ай бұрын

could you rephrase the question? can’t tell if you’re asking if i store this information in a notes app that has some ai integration or in some kind of inception style high dimensional portal

@superfliping 2 ай бұрын

@@Tunadorable ask if you personally try to mix information your read and make new ideas?

@Tunadorable 2 ай бұрын

@@superfliping oooooh yes that's most of my time

@superfliping 2 ай бұрын

@Tunadorable . Do you ever see the formulas you create on your computer show up in online papers you read weeks after you have written them? It happens to me weekly. Some of the context has been modified but there concepts are still mimicking my framework.

@Tunadorable 2 ай бұрын

yes totally. you’ve gotta start coding and writing papers as fast as possible if you haven’t already, it’s about being first to the punch these days. also kinda hard to do when ppl in academia can team up but that’s why i’m working on a codebase that makes it really easy to do language modeling architecture experiments and iterate on them quickly

@NextGenart99 2 ай бұрын

@TDVL 2 ай бұрын

Child-language use: blind children?

@Tunadorable 2 ай бұрын

they still get proprioception, taste, smell, tone in the words they hear, and a continual context/story for the things they hear rather than random fragmented toneless internet documents full of html and other garbage

@TDVL 2 ай бұрын

@@Tunadorable Surely, with text-only input. But also, I believe the biggest difference is that all networks are trained pretty much one shot with a huge amount of data, while living brains typically trained with a fairly limited dataset but iteratively, interactively and recursively, which by itself creates multiple layers of understanding at different levels of detail, they include real-world tests of the knowledge including edge and fail cases and it has an ongoing context of the discovery, testing and rewriting process itself which may well be used as a reference to understand all other brains much better than LLMs ever will.

@Tunadorable 2 ай бұрын

yes fs, the RL component is huge