Пікірлер
@Shooter_Mc_Evan
@Shooter_Mc_Evan Күн бұрын
Impressive. I had to use ChatGPT to dumb this way way down for me.
@wiktorm9858
@wiktorm9858 Күн бұрын
What is this thing? I need a comment.
@josephvanname3377
@josephvanname3377 Күн бұрын
It is probably the Okubo algebra, but I need to study this more.
@JustMeJustADude
@JustMeJustADude 2 күн бұрын
What is an LSRDR?
@josephvanname3377
@josephvanname3377 2 күн бұрын
Suppose that K is the field of real numbers (for simplicity). Let (A_1,...,A_r) be n by n matrices over the field K. Let d be less than . The goal is to find d by d matrices (X_1,...,X_r) that best approximate the matrices A_1,...,A_r. We do this by generalizing the dot product and cosine similarity to matrices. We define the L_2-spectral radius similarity between (A_1,...,A_r) and (X_1,...,X_r) to be rho(kron(A_1,X_1)+...+kron(A_r,X_r))/[(kron(A_1,,A_1)+...+kron(A_r,A_r))^(1/2)(kron(X_1,X_1)+...+kron(X_r,X_r))^(1/2)] where kron denotes the tensor (or Kronecker product), and rho denotes the spectral radius. The motivation for this expression is that rho(kron(A_1,X_1)+...+kron(A_r,X_r)) is a sort of dot product while the entire similarity is a generalization of the cosine similarity. We say that (X_1,...,X_r) is an LSRDR of (A_1,...,A_r) if (X_1,...,X_r) has locally maximal cosine similarity among all tuples of r many d by d matrices. I came up with the notion of an LSRDR to investigate cryptographic functions to make more secure cryptocurrency mining algorithms.
@BadChess56
@BadChess56 8 күн бұрын
Fascinating
@Bencurlis
@Bencurlis 8 күн бұрын
I bet it is the possible to have even higher similarity for the same loss if you replace the weight reset mechanism by a regularization term that forces the weights to stay close to the init weights. Perhaps directly regularizing in term of cosine similarity.
@josephvanname3377
@josephvanname3377 8 күн бұрын
That is a reasonable prediction. I will have to do the experiment to determine whether this is the case especially since such an experiment is really easy.
@Bencurlis
@Bencurlis 13 күн бұрын
Complete forgetting is an interesting new concept, it is fascinating that differently initialized networks would converge in behavior not only in what they remember, but in what they do not. I hear your point about complete forgetting being good to remove pseudorandom information, but I think this is a bad solution to this problem, we should want our models to learn things and not forget them if there is no reason to change the mapping (no distributional shift). Catastrophic forgetting does not appear to caused by a limited capacity in practice. By the way, could it be that pseudorandom information is equivalent or very similar to feature contamination? abs/2406.03345
@josephvanname3377
@josephvanname3377 11 күн бұрын
Catastrophic forgetting is only a problem in some circumstances where one is retraining the machine learning model. If one keeps the training data after training the model or if the training data is otherwise readily available (using generative AI, from the internet, or from direct computation or observations), then it seems like catastrophic forgetting is a non-issue. Sometimes we would like to retrain the network, and sometimes we would like for the network for forget bad behavior or information that the network is not supposed to know. In this case, catastrophic forgetting is a good thing and complete forgetting is even better. Half of my videos have been about my AI algorithms that exhibit complete forgetting, but most of those are about a topological version of complete forgetting while this visualization is about a uniform version of complete forgetting. If we are going to build more networks that do not forget much after retraining, then those networks will likely be much less interpretable than AI algorithms with complete forgetting. Complete forgetting is a highly mathematical property of few AI algorithms, and such mathematical behavior is helpful for AI interpretability and safety. I do not believe that the pseudorandom information is equivalent to feature contamination. I have trained AI models with complete forgetting, but these AI models learn uncorrelated features for a simple reason. I designed the fitness functions for these AI models so that they not learn how the output relates to the input, but they also learn to recognize the input all by itself; I could easily modify these algorithms so that they do not try to figure out any outputs that correspond to the inputs but instead just learn to recognize the inputs. If one looks at the fitness function for some of these AI models, one would see that the AI models are performing tasks other than simply trying to produce the correct output for a particular input, and in order to maximize fitness, these AI models do not optimize their ability to predict the correct outputs. As a result, these AI models learn plenty of uncorrelated features. With that being said, I would like to develop more AI algorithms with complete forgetting that more accurately produce the correct output for a given input (and more importantly are equipped to handle more difficult tasks), so one will have to ask me how this relates to feature contamination when I develop those algorithms, so more research is needed. I edited the description to state that catastrophic forgetting is not caused simply by limited capacity and that catastrophic forgetting occurs even in networks with more than enough capacity to learn.
@Bencurlis
@Bencurlis 11 күн бұрын
@@josephvanname3377 The phenomenon of CF prevents us from training our deep learning models from non-IID data streams, but I believe this is the only possible paradigm in the long term as we already produce way more data than we can store. Yes, we sometimes want the model to unlearn some information it learned previously, but in this case there is the distributional shift I mentioned, but in general we just want the model to learn some mapping, assuming it does not change much over time. If I understand correctly what you said about feature contamination, you said that your models are not generally trained to learn a precise explicit mapping, but the fitness function still implies it will produce some output for a given input, right? I am not sure if it really imply that feature contamination cannot exist in this case. The feature being uncorrelated is precisely one of the properties of feature contamination, the only remaining question is if these pseudo-random features you identified are learned because of the asymmetry caused by non-linear activations while training, or if they pre-exist in the network and simply remain in it while training, as you seem to imply in previous videos. In the former case, both phenomenons should be one and the same, but not in the latter case.
@josephvanname3377
@josephvanname3377 11 күн бұрын
@@Bencurlis In the case where the AI model learned some fitness function that approximates or at least incorporates the explicit mapping, the model will naturally want to learn things about those data regardless of whether they are correlated with the label or are even helpful in determining the label. This means that in my AI models that exhibit complete forgetting, the fitness function naturally gravitates towards feature contamination. But in that case, since the fitness function is just an approximation for the determination of the label, I am unsure if we can still call this 'feature contamination'. To test whether feature contamination is a result of the random/pseudorandom initialization, one may want to perform an experiment in which the network has a non-random initialization. To do this, one can use a network with layers of the form SkipConnection(Dense(mn,mn,activation),+) where the initialization consists of zeros. In this case, the network begins as the identity function, and all forms of feature contamination arise during the training instead of the initialization (even with zero initialization, the network will still learn pseudorandom information though since it is influenced by things such as the learning rate, the order in which the network is fed data, and the amount of momentum, and this pseudorandom information is similar to a random/pseudorandom initialization). It seems like the solution to catastrophic forgetting is not to use fitness/loss functions that are resistant to catastrophic forgetting since there are other techniques to deal with this issue (though some forgetting resistant fitness/loss functions may be useful). In some cases, one can treat catastrophic forgetting as a non-issue; for example with recommender systems, it may be better to learn people's current preferences and to learn current trends than to remember historical preferences and trends. In the case where the history is important, one may want to keep some training examples in long term storage and use other techniques to minimize forgetting besides simply use loss/fitness functions that remember everything.
@JustMeJustADude
@JustMeJustADude 16 күн бұрын
Are there any visual patterns that you yourself find interesting or remarkable?
@josephvanname3377
@josephvanname3377 16 күн бұрын
On the left side, the entries vary independently of each other, so the is nothing unremarkable when we just look at the left matrix. But the right side of the matrix is more remarkable especially when compared to the left side. 1. It is quite obvious that the right matrix evolves at a faster rate than the left matrix. 2. Much (but not all) of the chaos in the right matrix is due to column swaps. I will make another visualization without the row swaps (I need to generalize the singular value decomposition so that the entries on the diagonal entry are unordered to do this) so that we can see what that is like (making the visualization without column swaps will also give people a better sense of the stability of the singular value decomposition). 3. The amount of chaos in each of the columns varies with time. If we take a particular column, then that column is sometimes inactive, but at other times that column is highly active. This indicates that the instability stems from the interactions that singular vectors have with each other. 4. Columns tend to interact with their neighbors. We do not see columns that are far from each other interacting with each other. This means that if vectors have nearby singular values, then those vectors will blend in with each other and mix. These patterns become more apparent as one becomes familiar with the singular value decomposition and related results such as the polar decomposition and spectral theorem for normal or at least Hermitian operators. But the matrices on the left side are random matrices and each frame the matrix on the right side is a random orthogonal matrix, so since we are dealing with random matrices that move in random directions (with some momentum for smoothness), one should not look too deeply for patterns. Added later: It turns out that except for the column swaps, the orthogonal vectors vary smoothly over time. This visualization does not show this phenomenon very well, but I will make an upcoming visualization which clearly shows this phenomenon.
@josephvanname3377
@josephvanname3377 19 күн бұрын
I have not met too many complex matrices whose spectrum looks like Oumuamua especially when they are so far from being Hermitian. The circular law states that such spectra should look like disks.
@isaigm
@isaigm 20 күн бұрын
I found such a subset, but its length is 72, my idea is simple, I start with 1, then I try to insert the next number 2 only if there is no arithmetic progression of length 4 starting from there, there isn't, then I do some brute force for bigger numbers, and so on, but definitely I don't understand how to find this subset of 77 elements with gradient descent, which as far as I know is used to optimise differentiable functions, I need to read more :p
@josephvanname3377
@josephvanname3377 20 күн бұрын
Your greedy algorithm is one simple way of producing a set with no arithmetic progression which cannot be extended to a larger set with no arithmetic progression, and due to the simplicity of your algorithm, your set is probably more interpretable than most locally maximal 4 arithmetic progression free sets. Since you apparently went in order from 1 to 240, it looks like your set will be imbalanced in the sense that it will have mean statistically significantly less than 120.5. I have made other visualizations (with a description of how it works) of an algorithm that solves a discrete optimization problem including the clique problem and generalizations (I made a couple of different algorithms for this) such as the problem of partitioning a graph into cliques. So it is feasible to use gradient descent to solve a discrete optimization problem. One can also use gradient descent to train neural networks to make discrete choices when playing combinatorial games like the heavily publicized AlphaGo.
@isaigm
@isaigm 20 күн бұрын
@@josephvanname3377 thank you for your answer Dr, i'll investigate more about discrete optimization
@ismbks
@ismbks 21 күн бұрын
1 minute ago is crazy
@josephvanname3377
@josephvanname3377 21 күн бұрын
It is not too hard to catch me posting content 1 minute ago because I regularly post content here.
@josephvanname3377
@josephvanname3377 21 күн бұрын
Reversible computation is the future.
@josephvanname3377
@josephvanname3377 21 күн бұрын
Reversible computation is the future.
@josephvanname3377
@josephvanname3377 21 күн бұрын
Reversible computation is the future.
@PS0DEK
@PS0DEK 24 күн бұрын
1:57 Amazing swirl man gj
@josephvanname3377
@josephvanname3377 24 күн бұрын
Thanks. The swirly was part of the reason I did not smoothen the visualization. And then the swirly ate everything.
@Bencurlis
@Bencurlis 28 күн бұрын
I have some trouble understanding the regularization function. It seems a bit more complicated than L2 regularization, what is the rationale behind it?
@josephvanname3377
@josephvanname3377 27 күн бұрын
Let's assume we are working in the field of complex numbers. We have some room to modify the regularization function without compromising the important characteristics of the fitness function. A lot of justification for the choice of the regularization boils down to the performance of the fitness function in experiments, but there is some theory that explains the choice of regularization and helps us predict the behavior of similar forms of regularization. The regularization function g(A)=norm(\sum_{k=1}^r A_k adjoint(A_k))^2 becomes more clear with a coordinate free formulation of the fitness function. To do this, we can reformulate a tuple of matrices (A_1,...,A_r) as a linear operator A=kron(A_1,e_1)+...+kron(A_r,e_r) where e_1,...,e_r is an orthonormal basis for some inner product space W. In this case, \sum_{k=1}^r A_k adjoint(A_k)=A adjoint(A), and \sum_{k=1}^r adjoint(A_k) A_k is the partial trace (where we trace out W) of adjoint(A) A. Taking a Schatten or operator norm of A adjoint(A) for a linear operator A is a natural thing to do. Let ||*||_p denote the Schatten p-norm. Then || A adjoint(A)||_p=||A||_{2p}^2. We get even more motivation for the expression norm(\sum_{k=1}^r A_k adjoint(A_k))^2 from quantum information theory. Define a completely positive superoperator E=\Phi(A_1,...,A_r) by E(X)=\sum_{j=1}^r A_j X adjoint(A_j) (without coordinates, E(X) is just the partial trace of AX adjoint(A)). Then we would ideally like for our 'norm' of (A_1,...,A_r) to be invariant under unitary transformations in the sense that if U is unitary, and B_j=\sum_{k}U(j,k)A_k for all j, then the norm of (B_1,...,B_r) should be the same as the 'norm' of (A_1,...,A_r), but it is a well-known fact from quantum information theory that \Phi(A_1,...,A_r)=\Phi(B_1,...,B_r) if and only if we can find a unitary operator U where B_j=\sum_{k}U(j,k)A_k for all j. This means that in order to find a norm on (A_1,...,A_r), we should find a norm on the completely positive superoperator \Phi(A_1,...,A_r). But \Phi(A_1,...,A_r)(I)=\sum_{j=1}^r A_j adjoint(A_j), and \sum_{j=1}^r adjoint(A_j) A_j=adjoint(\Phi(A_1,...,A_r)), and Tr(\Phi(A_1,...,A_r)(X))=dot(X,\sum_{k=1}^r adjoint(A_k) A_k) where 'dot' refers to the Frobenius inner product. This means that the expressions \sum_{k=1}^r A_k adjoint(A_k) and \sum_{k=1}^r adjoint(A_k) A_k can easily be formulated in terms of the completely positive superoperator \Phi(A_1,...,A_r), so any unitary invariant norm of \sum_{k=1}^r A_k adjoint(A_k) or \sum_{k=1}^r adjoint(A_k)A_k may be used for regularization. Depending on which norm we regularize, the operator \Phi(A_1,...,A_r) will look more or less like a quantum channel or the adjoint of a quantum channel. For example, if we regularize \sum_{k=1}^r adjoint(A_k)A_k using the spectral norm (or a Schatten p-norm for large p), then \Phi(A_1,...,A_r) will be a quantum channel (or almost a quantum channel), and if \Phi(A_1,...,A_r) is always a quantum channel, then the fitness function has a clear interpretation in terms of quantum information/computation.
@aaravyadav9906
@aaravyadav9906 Ай бұрын
what level of math would I need to know in order to understand what this is trying to do I have only taken first year lin alg and multi var calc
@josephvanname3377
@josephvanname3377 Ай бұрын
More math is better here. Here, the fitness function is a linear algebraic function, so one needs to go through a couple of good courses on linear algebra covering topics such as vector spaces, inner product spaces, trace, eigenvalues, Jordan decomposition, the spectral theorem and singular value decomposition, and matrix norms (such as Schatten norms). Functional analysis and Hilbert spaces would also be helpful. But gradient ascent multivariate calculus is good, but instead of dealing with just 3 variables, one needs to deal with many variables and even matrices, so one should want to go over some matrix calculus to see what the gradient actually is. To have a better idea about why the fitness function has only few local maxima (and I do not have a complete understanding of why this is the case) and to get a better understanding of the problem, one should go over some quantum information theory (quantum channels in particular). After all, the tuple of Hermitian matrices (A_1,...,A_r) is best understood as a real completely positive superoperator which is essentially a real quantum channel (and here it does not make much of a difference whether we use the real or complex numbers). Added later: It takes a bit of experience training these fitness functions to determine what conditions are needed for the fitness function to apparently have one or a few local maxima, so for this, one should train many fitness functions (for this experience, it is better to train many small models than a few large ones since it is the number of models that gives the experience rather than the size). My visualizations should give one an indication of some conditions for which fitness functions tend to apparently have just one local optima.
@QW3RTYUU
@QW3RTYUU Ай бұрын
like a little dance
@josephvanname3377
@josephvanname3377 Ай бұрын
The spectrum bounces that way because the Jacobian is nearly positive semidefinite.
@ianweckhorst3200
@ianweckhorst3200 Ай бұрын
I have no idea what is happening but it’s a fun pattern
@josephvanname3377
@josephvanname3377 Ай бұрын
We found the zero of the vector field when the dots overlapped and the fitness went to 1. Some people get some satisfaction from these visualizations simply since they make a visible pattern, but other people read the description to go a little bit deeper into the mathematics.
@randomnessslayer
@randomnessslayer Ай бұрын
What a fascinating character you are, Joseph Van Name, quite the bread-crumb trail
@josephvanname3377
@josephvanname3377 Ай бұрын
???
@PS0DEK
@PS0DEK Ай бұрын
​@@josephvanname3377 If you can upload these should know it already? How interesting! You can weave a tapestry of intrigue with every step and epoch-one can't help but follow the theads and hope to converge in the global minima. Does it make any sense now? We aknowledge your talent. You probably navigate life like a chessboard, Joseph Van Name-each move calculated, each piece strategically placed. Like a riddle that begs to be solved, leaving those around you captivated. Even if we don't comment often, that doesnt mean that we are not there. We still follow the bread-crumb trail that you are leaving behind. What else can I say? Just keep going. Do not stop please. Do not dare to be bothered-so the Totient Function that is sacred... I doubt it'll be useful in the regards of these though since it is not releated to this video in particular... Just take it as you want. I mean, how are we supposed to answer, we don't have any clue how to answer to the "???" it is a rather undefined question! Just... Just take the thumbs up, anyway, and keep moving on! :)
@josephvanname3377
@josephvanname3377 Ай бұрын
@@PS0DEK Thanks. I was unsure of what the commenter meant since the phrase 'bread-crumb trail' is not used very often, and it is often difficult to figure out the tone of what is being said in this format.
@uzairname
@uzairname Ай бұрын
I wonder if the functions look similar to each other outside of the domain of the plot on the right
@josephvanname3377
@josephvanname3377 Ай бұрын
Outside the domain, the functions hardly resemble each other. This means that neural networks cannot be used in the obvious way for analytic continuation.
@albitross1992
@albitross1992 Ай бұрын
In all seriousness this is very cool. And I mean it when I say science is sexy.
@josephvanname3377
@josephvanname3377 Ай бұрын
Then share this video with all your friends.
@albitross1992
@albitross1992 Ай бұрын
I need everyone to understand that this is extremely sexy.
@josephvanname3377
@josephvanname3377 Ай бұрын
Then tell all your friends that this is an instance where gradient ascent apparently produces just one local maximum; in layman's terms, you train the model twice and you get the same thing. I should probably read more about Morse theory to know more about Morse functions with a single local maximum and local minimum since this can tell me something about the (differential) topology of the domain. I know that every manifold has a Morse function with a single local maximum and single local minimum though. I should also try to find other critical points of the fitness function too.
@albitross1992
@albitross1992 Ай бұрын
@@josephvanname3377 oh, GOD 😩
@josephvanname3377
@josephvanname3377 Ай бұрын
@@albitross1992 That means that I will need to skim through more differential topology.
@mmujtabahamid
@mmujtabahamid Ай бұрын
How big are the referenced matrixes?
@josephvanname3377
@josephvanname3377 Ай бұрын
The population consists 100 many 28 by 28 matrices. We can see that this is the case by counting up the number of entries per row.