SynFlow: Pruning neural networks without any data by iteratively conserving synaptic flow

Рет қаралды 17,352

Күн бұрын

Пікірлер: 65

@srivatsabhargavajagarlapud2274 4 жыл бұрын

Wow!! I mean, how do you keep up and sustain this rate of reading and producing these videos man! I had come across your channel just about week and a half ago (while searching on explanations for DETR) and since have had tough time just sifting through your very interesting videos and picked up quite a bit along the way! You are indeed a role model ! :) Thanks a lot for what you are doing! _/\_

@herp_derpingson 4 жыл бұрын

12:18 Its interesting how the graphs are sigmoid shaped. I would expect that the graph to start out flat-ish because of redundant connections and then fall linearly. But, it seems to flatten out near the end. Basically, it is retaining accuracy even as most of the parameters get pruned. It would be interesting to see what happens if we start adding more parameters and training in this near critical model again. Would it trace back the same path upward? Or would it do better? Or worse? . 38:38 Intuitively, this is equivalent to asking a node, "How much of the final output do you contribute to?". However, since we are taking absolute values, lets say there is a node which has a weight 1x + 9999 passes an activation of +9999 to the next node which has a weight of 1x - 9999. So, although this loss function would rate both these two nodes highly. In reality, they are always negating each other and never contribute to the final output. Then again, checking interactions between neurons is practically intractable. . I really liked the dataless approach in this paper. I think this would inspire more papers to try similar stuff. Good stuff. . 43:30 IDK. A network initialized with all equal weights to 1 would make this algorithm go crazy with excitement. Lets see if we will see a new family of network initialization policies.

@YannicKilcher 4 жыл бұрын

- yea indeed, I think there are dozens of questions like this that could all turn out either way and nobody knows - maybe here one can count on the gaussian initializations to basically never make this happen, because it would be quite the coincidence.d - I guess we're both waiting for the first generation of "adversarial initializations" to screw with the pruning methods :D

@ProfessionalTycoons 4 жыл бұрын

Amazing! The secrets slowly revealing itself.

@stephenhough5763 4 жыл бұрын

Amazing video presenting an amazing paper, afaik currently sparsity doesn't have much benefit ito most real world performance gains (gpus and tpus) but that should start changing soon. 10x more layers with 90% of the weights pruned using Synflow should greatly outperform while having similar (end) param counts.

@raghavjajodia1 4 жыл бұрын

Wow, Video on this paper is out already? That was quick! Really well explained, keep it up 👍

@visualizedata6691 4 жыл бұрын

Simply Superb. Great work!

@siyn007 4 жыл бұрын

What if you can have a constraint for the pruning to keep at least a percent (say 10%) of the connections in each layer to prevent layer collapse? Edit: here's the answer 33:50

@alabrrmrbmmr 4 жыл бұрын

Exceptional work once again!

@mrityunjoypanday227 4 жыл бұрын

This can help identify motif for Synthetic Petri Dish

@vijayabhaskar-j 4 жыл бұрын

Yannic, you need to tell us how you manage your time and how you manage to put a video daily?, for me reading a paper itself takes 2-3 days to understand it fully, let alone explain it to others.

@rbain16 4 жыл бұрын

He's been publishing papers in ML since at least 2016, so he's had a bunch of practice. Plus he's lightning Yannic :D

@YannicKilcher 4 жыл бұрын

Once you get a bit of practice, it gets easier :)

@herp_derpingson 4 жыл бұрын

@@YannicKilcher Building the cache, so to speak :)

@Phobos11 4 жыл бұрын

Awesome! I was waiting for today’s paper 🤣

@LouisChiaki 4 жыл бұрын

Can we just add the new loss function into the original loss function of the model and train the original network with this pruning cost (and the one shot data) included? Like we prune the model while training it.

@YannicKilcher 4 жыл бұрын

Yea I guess that's a possibility

@wesleyolis 4 жыл бұрын

I fully agree with your statement of having models being inflated, compression will still have its place. It is an unavoidable thought of why not inject additional layers with weights for matrix Q as required, such that you increase the spatial resolution/search space to capture increased detail was required for a more accurate model. The other thing best initialization of the Q matric perturbations weights pattern. I don't believe that random weight is right, my work, but don't think it is the best. one would want an equal distribution of perturbations, give the best possible probability for backpropagation to chance to enhance the weights according to the forward propagation. Clearly, if we go to inflationary models then one also be inflation perturbations more intelligently, not random. The thing for me is that might have an iterative algorithm, more intelligently build the structure by swapping out the section of weights for different weights structures, that would allow different types of mathematical equations to be represented. A^b or like X^2 + Y^2 could be capture here, with linear expansion formula. At on point, I was searching the internet for X^B formulary for computation didn't find it, to look how to constructive set of weights to model that equation in NN. Did discovery it few days back kinda from my understanding in probability book, by the looks of it. Strangely enough if to inflate models, with a different section of weights, that resemble different mathematical relationship equations. Then we have more insight into mathematics going on, as we could hold parallel mathematics constructive for Q matrix(for how long it remain this). The next step I guess is going to be how one improves the abstraction of matrics for hardware computation. Lost of empty spaces and missing weights, ability to restructure Q matrix, without accuracy loss, such its equivalent, merge and splitting layers into multiple layers, such that one has dens matric of weights for hardware computation. With regards to the Q matrix with perturbations, I think for better matching should be a matrix with 2 incremental symmetry around at F0/FN upper and lower bound and F(n/2), so basically values incrementally increase by small values of same amounts, this means won't' have random weight lines, should get better patterns in my mind versus random jumping all around the Q matric of weights.

@YannicKilcher 4 жыл бұрын

That all seems predicated on these mathematical functions actually being the best kind of representations for these kinds of problems, which is a very strong hypothesis

@EditorsCanPlay 4 жыл бұрын

duude, Do you ever take a break? haha, love it though!

@DeepGamingAI 4 жыл бұрын

SNIP SNAP SNIP SNAP SNIP SNAP! Do you know the toll that iterative pruning has on a neural network?

@YannicKilcher 4 жыл бұрын

Must be horrible :D

@shubhvachher4833 4 жыл бұрын

Priceless comment.

@AsmageddonPrince 2 жыл бұрын

I wonder if instead of all-ones datapoint you could use a normalized average of all your training datapoints.

@Kerrosene 4 жыл бұрын

I was wondering why they use the hadamard product instead of the dR/d(theta) score alone as a metric to evaluate a parameter's effect on the loss? I understand that this new score won't obey the conservation theorem but if the prime issue was to avoid layer collapse, could we just chuck the conservation part out and try this score in a way that prevents layer collapse (like provide a contingency in the algorithm that avoids it maybe using a local masking technique (which is sub par in performance, i know)). Has this been done? any thoughts?

@YannicKilcher 4 жыл бұрын

True. I think the parameter magnitude itself might carry some meaningful information. Like, when it's close to 0, a large gradient on it shouldn't "mean" as much. But that's just my intuition

@Tehom1 4 жыл бұрын

The problem looks so much like a max flow or min flow cost problem.

@MarkusBreitenbach 4 жыл бұрын

How can this work for architectures like Resnet, which have bypass connections for layers, without looking at the data? They show results in the paper for resnet, but somehow that doesn't make sense to me. Anybody know what I am missing?

@bluel1ng 4 жыл бұрын

You might take a look at their code at github.com/ganguli-lab/Synaptic-Flow . Why do you think that using the training-data would be required for dealing with the shortcut connections?

@YannicKilcher 4 жыл бұрын

I think the same analysis still applies, though you're right the interesting part in ResNets is their skip connection, so technically they never have to deal with layer collapse.

@zhaotan6163 4 жыл бұрын

it works for image with first layer as CNN. how about a MLP with different features as inputs? it will be problematic for first layer, i.e. feature selection , since it never sees the data and have no idea which feature is more important.

@YannicKilcher 4 жыл бұрын

I guess we'll have to try. But you'd leave all neurons there, just prune the connections.

@billymonday8388 2 жыл бұрын

the algorithm essentially improves the gradient of a network to make it train better. It does not solve everything.

@RohitKumarSingh25 4 жыл бұрын

Yannic thanks for making such videos, it really helps a lot. :D I wanted to know, these pruning techniques are not going to improve FLOPs of my model right, because we are just masking the weights in order to prune right? or is there any other way to reduce FLOPs?

@YannicKilcher 4 жыл бұрын

Yes, for now that's the case. But you could apply the same methods to induce block-sparsity, which you could then leverage to get to faster neural networks.

@RohitKumarSingh25 4 жыл бұрын

@@YannicKilcher I l look into it.

@jonassekamane 4 жыл бұрын

So -- if I understood this correctly -- you would in principle be able to 1) take a huge model (which normally requires an entire datacenter to train), 2) prune it down to some reasonable size -- and presumably prune it on relatively small computer, since the method does not use any data in the pruning process, and 3) finally train the smaller pruned model to high accuracy (or SOTA given the network size) -- presumably also on a relative small computer

@jwstolk 4 жыл бұрын

I think that would be correct, if training on a CPU. I don't know how current GPU's handle pruned networks or how much it benefits them. GPU's may need some additional hardware features to really benefit from using a pruned network.

@jonassekamane 4 жыл бұрын

This method applied in reverse could also be quite interesting, i.e. for model search. Assuming the accuracy of a pruned network is reflective of the accuracy of the full network, then you could use Synflow to train and test various pruned models, before scaling up the best performing model and training that... but yes, new hardware might need to be developed.

@bluel1ng 4 жыл бұрын

Nearly identical accuracy with 1% or even 0.1% of the weights at initialization? That is fascinating. A bit mind-bending for me seems the fact that this pruning can be done DATA independent - only by feeding 1s through the network? Crazy - maybe the future poised to be sparse and fully-connected initialization become a thing of the past. ;-) If layer-collapse (aka layer dependent average synaptic saliency score magnitude) is the problem: Why not perform pruning layer-wise in general? How would the base-line methods perform if the pruning-selection would be done for each layer individually instead sorting the scores for the network globally?

@YannicKilcher 4 жыл бұрын

in the paper they claim that layer-wise pruning gives much worse results

@bluel1ng 4 жыл бұрын

@@YannicKilcher I see, they reference "What is the State of Neural Network Pruning?" arxiv.org/abs/2003.03033 ... maybe layerwiese (or fan-in/fan-out dependent) normalization of the saliency scores might be a thing to compensate the magnitude differences. ;-) btw the "linearization" trick they use for ReLUs (W.abs() and then passing 1-Vector) is nice ... for other activation functions this will probably require a bit more work.

@hungryskelly 4 жыл бұрын

Phenomenal. Would you be able to step through the code of one of these papers?

@YannicKilcher 4 жыл бұрын

sure, but this one is just like 3 lines, have a look

@hungryskelly 4 жыл бұрын

@@YannicKilcher Fair point. Would look forward to that kind of thing on other papers. Thanks for the incredibly insightful content!

@robbiero368 4 жыл бұрын

Is it possible to iteratively grow the network rather than pruning it, or does that collapse to be essentially the same thing?

@robbiero368 4 жыл бұрын

Oh just heard your similar comments right at the end of the video. Cool.

@blanamaxima 4 жыл бұрын

Not sure what this thing learn , the dataset or the architectures...

@kpbrl 4 жыл бұрын

Great video once again! Just 1 q, do you have a goal of making at least one video a day? I found this channel while I was searching if anyone had an idea - "reading a paper" - to make a video. Now, I have another idea. Will implement soon and share it here. :)

@YannicKilcher 4 жыл бұрын

I try to make one every day, but I'll probably fail at some point

@Zantorc 4 жыл бұрын

I wonder what pruning method the human brain uses. At birth, the number of synapses per neuron is 2,500 and grows to 15,000 by about age 2. From then on they get pruned mostly between ages 2-10 but continuing at a slower rate til the late 20s. The adult brain only retains 50% of the synapses it had as a 2 year old.

@NicheAsQuiche 4 жыл бұрын

This has been the most interesting part of the lottery ticket thing to me - it's amazing how many parallels there are between biological neurons and artificial ones. I think the lottery ticket hypothesis paper found good performance between 50% and 70% pruning

@bluel1ng 4 жыл бұрын

I guess "what fires together wires together" is also a good intuition in the reverse sense for pruning. Like muscles the body will likely also try to optimize the brain based on usage/functional relevance. But there is definitely some stability in the system otherwise we would quickly lose all memories that are not recalled frequently. ;-)

@kDrewAn 4 жыл бұрын

Do you have a PayPal? I don't have much but I at least want to buy you a cup of coffee.

@YannicKilcher 4 жыл бұрын

thank you very much :) but I'm horribly over-caffinated already :D

@kDrewAn 4 жыл бұрын

Nice

@sansdomicileconnu 4 жыл бұрын

this is pareto law

@GoriIIaTactics 4 жыл бұрын

this sounds like it's trying to solve a minor problem in a really convoluted way

@jerryb2735 4 жыл бұрын

This paper contains no new or deep idea. They do use data when pruning the network. It is the data on which the network was trained. Moreover, the lottery ticket hypothesis is trivial. Once stated rigorously, it takes less than four lines to prove it.

@YannicKilcher 4 жыл бұрын

Enlighten us, please and prove it in four lines :D

@jerryb2735 4 жыл бұрын

@@YannicKilcher Sure, send me the statement of the hypothesis with the definitions of all technical terms used in it.

@YannicKilcher 4 жыл бұрын

@@jerryb2735 no, you define and prove it. You claim to be able to do both, so go ahead

@jerryb2735 4 жыл бұрын

@@YannicKilcher False, read my claim carefully.