Wow!! I mean, how do you keep up and sustain this rate of reading and producing these videos man! I had come across your channel just about week and a half ago (while searching on explanations for DETR) and since have had tough time just sifting through your very interesting videos and picked up quite a bit along the way! You are indeed a role model ! :) Thanks a lot for what you are doing! _/\_
@herp_derpingson4 жыл бұрын
12:18 Its interesting how the graphs are sigmoid shaped. I would expect that the graph to start out flat-ish because of redundant connections and then fall linearly. But, it seems to flatten out near the end. Basically, it is retaining accuracy even as most of the parameters get pruned. It would be interesting to see what happens if we start adding more parameters and training in this near critical model again. Would it trace back the same path upward? Or would it do better? Or worse? . 38:38 Intuitively, this is equivalent to asking a node, "How much of the final output do you contribute to?". However, since we are taking absolute values, lets say there is a node which has a weight 1x + 9999 passes an activation of +9999 to the next node which has a weight of 1x - 9999. So, although this loss function would rate both these two nodes highly. In reality, they are always negating each other and never contribute to the final output. Then again, checking interactions between neurons is practically intractable. . I really liked the dataless approach in this paper. I think this would inspire more papers to try similar stuff. Good stuff. . 43:30 IDK. A network initialized with all equal weights to 1 would make this algorithm go crazy with excitement. Lets see if we will see a new family of network initialization policies.
@YannicKilcher4 жыл бұрын
- yea indeed, I think there are dozens of questions like this that could all turn out either way and nobody knows - maybe here one can count on the gaussian initializations to basically never make this happen, because it would be quite the coincidence.d - I guess we're both waiting for the first generation of "adversarial initializations" to screw with the pruning methods :D
@ProfessionalTycoons4 жыл бұрын
Amazing! The secrets slowly revealing itself.
@stephenhough57634 жыл бұрын
Amazing video presenting an amazing paper, afaik currently sparsity doesn't have much benefit ito most real world performance gains (gpus and tpus) but that should start changing soon. 10x more layers with 90% of the weights pruned using Synflow should greatly outperform while having similar (end) param counts.
@raghavjajodia14 жыл бұрын
Wow, Video on this paper is out already? That was quick! Really well explained, keep it up 👍
@visualizedata66914 жыл бұрын
Simply Superb. Great work!
@siyn0074 жыл бұрын
What if you can have a constraint for the pruning to keep at least a percent (say 10%) of the connections in each layer to prevent layer collapse? Edit: here's the answer 33:50
@alabrrmrbmmr4 жыл бұрын
Exceptional work once again!
@mrityunjoypanday2274 жыл бұрын
This can help identify motif for Synthetic Petri Dish
@vijayabhaskar-j4 жыл бұрын
Yannic, you need to tell us how you manage your time and how you manage to put a video daily?, for me reading a paper itself takes 2-3 days to understand it fully, let alone explain it to others.
@rbain164 жыл бұрын
He's been publishing papers in ML since at least 2016, so he's had a bunch of practice. Plus he's lightning Yannic :D
@YannicKilcher4 жыл бұрын
Once you get a bit of practice, it gets easier :)
@herp_derpingson4 жыл бұрын
@@YannicKilcher Building the cache, so to speak :)
@Phobos114 жыл бұрын
Awesome! I was waiting for today’s paper 🤣
@LouisChiaki4 жыл бұрын
Can we just add the new loss function into the original loss function of the model and train the original network with this pruning cost (and the one shot data) included? Like we prune the model while training it.
@YannicKilcher4 жыл бұрын
Yea I guess that's a possibility
@wesleyolis4 жыл бұрын
I fully agree with your statement of having models being inflated, compression will still have its place. It is an unavoidable thought of why not inject additional layers with weights for matrix Q as required, such that you increase the spatial resolution/search space to capture increased detail was required for a more accurate model. The other thing best initialization of the Q matric perturbations weights pattern. I don't believe that random weight is right, my work, but don't think it is the best. one would want an equal distribution of perturbations, give the best possible probability for backpropagation to chance to enhance the weights according to the forward propagation. Clearly, if we go to inflationary models then one also be inflation perturbations more intelligently, not random. The thing for me is that might have an iterative algorithm, more intelligently build the structure by swapping out the section of weights for different weights structures, that would allow different types of mathematical equations to be represented. A^b or like X^2 + Y^2 could be capture here, with linear expansion formula. At on point, I was searching the internet for X^B formulary for computation didn't find it, to look how to constructive set of weights to model that equation in NN. Did discovery it few days back kinda from my understanding in probability book, by the looks of it. Strangely enough if to inflate models, with a different section of weights, that resemble different mathematical relationship equations. Then we have more insight into mathematics going on, as we could hold parallel mathematics constructive for Q matrix(for how long it remain this). The next step I guess is going to be how one improves the abstraction of matrics for hardware computation. Lost of empty spaces and missing weights, ability to restructure Q matrix, without accuracy loss, such its equivalent, merge and splitting layers into multiple layers, such that one has dens matric of weights for hardware computation. With regards to the Q matrix with perturbations, I think for better matching should be a matrix with 2 incremental symmetry around at F0/FN upper and lower bound and F(n/2), so basically values incrementally increase by small values of same amounts, this means won't' have random weight lines, should get better patterns in my mind versus random jumping all around the Q matric of weights.
@YannicKilcher4 жыл бұрын
That all seems predicated on these mathematical functions actually being the best kind of representations for these kinds of problems, which is a very strong hypothesis
@EditorsCanPlay4 жыл бұрын
duude, Do you ever take a break? haha, love it though!
@DeepGamingAI4 жыл бұрын
SNIP SNAP SNIP SNAP SNIP SNAP! Do you know the toll that iterative pruning has on a neural network?
@YannicKilcher4 жыл бұрын
Must be horrible :D
@shubhvachher48334 жыл бұрын
Priceless comment.
@AsmageddonPrince2 жыл бұрын
I wonder if instead of all-ones datapoint you could use a normalized average of all your training datapoints.
@Kerrosene4 жыл бұрын
I was wondering why they use the hadamard product instead of the dR/d(theta) score alone as a metric to evaluate a parameter's effect on the loss? I understand that this new score won't obey the conservation theorem but if the prime issue was to avoid layer collapse, could we just chuck the conservation part out and try this score in a way that prevents layer collapse (like provide a contingency in the algorithm that avoids it maybe using a local masking technique (which is sub par in performance, i know)). Has this been done? any thoughts?
@YannicKilcher4 жыл бұрын
True. I think the parameter magnitude itself might carry some meaningful information. Like, when it's close to 0, a large gradient on it shouldn't "mean" as much. But that's just my intuition
@Tehom14 жыл бұрын
The problem looks so much like a max flow or min flow cost problem.
@MarkusBreitenbach4 жыл бұрын
How can this work for architectures like Resnet, which have bypass connections for layers, without looking at the data? They show results in the paper for resnet, but somehow that doesn't make sense to me. Anybody know what I am missing?
@bluel1ng4 жыл бұрын
You might take a look at their code at github.com/ganguli-lab/Synaptic-Flow . Why do you think that using the training-data would be required for dealing with the shortcut connections?
@YannicKilcher4 жыл бұрын
I think the same analysis still applies, though you're right the interesting part in ResNets is their skip connection, so technically they never have to deal with layer collapse.
@zhaotan61634 жыл бұрын
it works for image with first layer as CNN. how about a MLP with different features as inputs? it will be problematic for first layer, i.e. feature selection , since it never sees the data and have no idea which feature is more important.
@YannicKilcher4 жыл бұрын
I guess we'll have to try. But you'd leave all neurons there, just prune the connections.
@billymonday83882 жыл бұрын
the algorithm essentially improves the gradient of a network to make it train better. It does not solve everything.
@RohitKumarSingh254 жыл бұрын
Yannic thanks for making such videos, it really helps a lot. :D I wanted to know, these pruning techniques are not going to improve FLOPs of my model right, because we are just masking the weights in order to prune right? or is there any other way to reduce FLOPs?
@YannicKilcher4 жыл бұрын
Yes, for now that's the case. But you could apply the same methods to induce block-sparsity, which you could then leverage to get to faster neural networks.
@RohitKumarSingh254 жыл бұрын
@@YannicKilcher I l look into it.
@jonassekamane4 жыл бұрын
So -- if I understood this correctly -- you would in principle be able to 1) take a huge model (which normally requires an entire datacenter to train), 2) prune it down to some reasonable size -- and presumably prune it on relatively small computer, since the method does not use any data in the pruning process, and 3) finally train the smaller pruned model to high accuracy (or SOTA given the network size) -- presumably also on a relative small computer
@jwstolk4 жыл бұрын
I think that would be correct, if training on a CPU. I don't know how current GPU's handle pruned networks or how much it benefits them. GPU's may need some additional hardware features to really benefit from using a pruned network.
@jonassekamane4 жыл бұрын
This method applied in reverse could also be quite interesting, i.e. for model search. Assuming the accuracy of a pruned network is reflective of the accuracy of the full network, then you could use Synflow to train and test various pruned models, before scaling up the best performing model and training that... but yes, new hardware might need to be developed.
@bluel1ng4 жыл бұрын
Nearly identical accuracy with 1% or even 0.1% of the weights at initialization? That is fascinating. A bit mind-bending for me seems the fact that this pruning can be done DATA independent - only by feeding 1s through the network? Crazy - maybe the future poised to be sparse and fully-connected initialization become a thing of the past. ;-) If layer-collapse (aka layer dependent average synaptic saliency score magnitude) is the problem: Why not perform pruning layer-wise in general? How would the base-line methods perform if the pruning-selection would be done for each layer individually instead sorting the scores for the network globally?
@YannicKilcher4 жыл бұрын
in the paper they claim that layer-wise pruning gives much worse results
@bluel1ng4 жыл бұрын
@@YannicKilcher I see, they reference "What is the State of Neural Network Pruning?" arxiv.org/abs/2003.03033 ... maybe layerwiese (or fan-in/fan-out dependent) normalization of the saliency scores might be a thing to compensate the magnitude differences. ;-) btw the "linearization" trick they use for ReLUs (W.abs() and then passing 1-Vector) is nice ... for other activation functions this will probably require a bit more work.
@hungryskelly4 жыл бұрын
Phenomenal. Would you be able to step through the code of one of these papers?
@YannicKilcher4 жыл бұрын
sure, but this one is just like 3 lines, have a look
@hungryskelly4 жыл бұрын
@@YannicKilcher Fair point. Would look forward to that kind of thing on other papers. Thanks for the incredibly insightful content!
@robbiero3684 жыл бұрын
Is it possible to iteratively grow the network rather than pruning it, or does that collapse to be essentially the same thing?
@robbiero3684 жыл бұрын
Oh just heard your similar comments right at the end of the video. Cool.
@blanamaxima4 жыл бұрын
Not sure what this thing learn , the dataset or the architectures...
@kpbrl4 жыл бұрын
Great video once again! Just 1 q, do you have a goal of making at least one video a day? I found this channel while I was searching if anyone had an idea - "reading a paper" - to make a video. Now, I have another idea. Will implement soon and share it here. :)
@YannicKilcher4 жыл бұрын
I try to make one every day, but I'll probably fail at some point
@Zantorc4 жыл бұрын
I wonder what pruning method the human brain uses. At birth, the number of synapses per neuron is 2,500 and grows to 15,000 by about age 2. From then on they get pruned mostly between ages 2-10 but continuing at a slower rate til the late 20s. The adult brain only retains 50% of the synapses it had as a 2 year old.
@NicheAsQuiche4 жыл бұрын
This has been the most interesting part of the lottery ticket thing to me - it's amazing how many parallels there are between biological neurons and artificial ones. I think the lottery ticket hypothesis paper found good performance between 50% and 70% pruning
@bluel1ng4 жыл бұрын
I guess "what fires together wires together" is also a good intuition in the reverse sense for pruning. Like muscles the body will likely also try to optimize the brain based on usage/functional relevance. But there is definitely some stability in the system otherwise we would quickly lose all memories that are not recalled frequently. ;-)
@kDrewAn4 жыл бұрын
Do you have a PayPal? I don't have much but I at least want to buy you a cup of coffee.
@YannicKilcher4 жыл бұрын
thank you very much :) but I'm horribly over-caffinated already :D
@kDrewAn4 жыл бұрын
Nice
@sansdomicileconnu4 жыл бұрын
this is pareto law
@GoriIIaTactics4 жыл бұрын
this sounds like it's trying to solve a minor problem in a really convoluted way
@jerryb27354 жыл бұрын
This paper contains no new or deep idea. They do use data when pruning the network. It is the data on which the network was trained. Moreover, the lottery ticket hypothesis is trivial. Once stated rigorously, it takes less than four lines to prove it.
@YannicKilcher4 жыл бұрын
Enlighten us, please and prove it in four lines :D
@jerryb27354 жыл бұрын
@@YannicKilcher Sure, send me the statement of the hypothesis with the definitions of all technical terms used in it.
@YannicKilcher4 жыл бұрын
@@jerryb2735 no, you define and prove it. You claim to be able to do both, so go ahead