Sparsity in Deep Learning: Pruning + growth for efficient inference and training in neural networks

Рет қаралды 10,008

Scalable Parallel Computing Lab, SPCL @ ETH Zurich

Күн бұрын

Пікірлер: 9

@wolfgangmitterbaur3942 2 жыл бұрын

Good day Mr. Hoefler, a very good and extensive overview of this huge topic. Thanks a lot.

@oscarsandoval9870 2 жыл бұрын

Excellent review of the state of the art, well explained and concise, thank you Torsten!

@maryamsamami6974 7 ай бұрын

Dear Mr. Hoefler! Thanks for offering the useful video. May I ask if you could please share the slide of the video with me?

@ayushchaturvedi5203 3 жыл бұрын

where can i find the slides of this presentation

@hoaxuan7074 3 жыл бұрын

I see Numenta have a sparse neural network that uses top-k magnitude selection after the dot products. High magnitude means small angle between input and weight vector. Which is one of 2 factors for noise sensitivity. The variance equation for linear combinations of random variables applies to dot products. If you want the value 1 out of a dot product and there is equal noise on all inputs you can make one input 1 and one weight 1. That cuts out most of the noise. You can make all the inputs 1 and all the weights 1/d (d=dimensions.) Then the noise is averaged. Averaging is better. In both cases the angle between the input vector and the weight vector is zero. If you increase the angle toward 90 degrees the length of the weight vector must increase to continue getting 1 out of the dot product. The output gets very noisy.

@hoaxuan7074 3 жыл бұрын

You know that ReLU is a switch. f(x)=x is connect, f(x)=0 is disconnect. A light switch in your house is binary on off, yet it connects or disconnects a continuously variable AC voltage signal. The dot product of several (connected) dot products can be simplified back to a single dot product. When all the switch states in a ReLU neural network become known the net actually collapses to a simple matrix. There may be some ways to compress or make sparse neural networks using that insight. Also I like to remind people that the variance equation for linear combinations of random variables applies to dot products (noise sensitivity.)

@hoaxuan7074 3 жыл бұрын

Ankit Patel Breaking Bad on YT. AI462 Neural Networks on Google. I think pruning with retraining can allow the net to explore more than the set of statistical solutions (where each dot product is a statistical summary measure and filter responding to the broad statistical activation patterns in the prior layer.)

@hoaxuan7074 3 жыл бұрын

Fast Transform fixed-filter-bank neural networks based on the fast (Walsh) Hadamard transform would certainly be limited by data movement because they are so fast. However high speed means they are suitable for use with federated evolution algorithms. Eg. Each CPU is given the full neural model and part of the training data. Each CPU is sent the same short sparse list of mutations to make to the model and returns the cost for its part of the training data. The costs are summed and if an improvement an accept message is sent to each CPU else a reject message. Very little data is moving around. If the training set is 1 million images and you have 1000 CPU cores (250 raspberry pi 4s) then each core only has to process 1000 images. The hardware would cost less than 1 or 2 high end GPUs. The Continuous Gray Code Optimization algorithm algorithm is a good choice for the evolution based training.