These long videos are really growing on me. Not just introducing me to papers that I am not familiar with, but also the additional insight of your perspectives. Thank you.
@YeshwanthReddy4 жыл бұрын
At 21:55 I think you should draw horizontal lines (not vertical) to discard architectures since score is on the y-axis. No?
@herp_derpingson4 жыл бұрын
We dont really know what the cutoff score should be. Maybe we can have the top N highest scoring models.
@tonyrobinson13494 жыл бұрын
ah I was about to say the same! It;s confusing when the cause has been plotted on the y axis and the effect on the x axis.
@pladselsker83404 жыл бұрын
Isn't the score measurement on the left evaluated from actual trainning? I think he meant to discard architectures before even trainning them, which I think means that you have to select a vertical threshold on the validation accuracy, like he did
@Deez-Master4 жыл бұрын
Funny tho, Yes you can discard any crap with bad validation accuracy... if only there were some way to predict that without having to train and validate it ... :0
@hocusbogus79304 жыл бұрын
@@pladselsker8340 Nope, 'score' is what is determined by eqn (2) for an untrained network, while validation accuracy is for a trained network. Before training, one could calculate 'score' for each network, and they would look like dots plotted on a vertical line. Then, discard all networks below a certain score -- by drawing a horizontal line that cuts this vertical line -- and only train the networks that lie above it.
@JeroenMW23 жыл бұрын
Thank you so much, you save 10s of thousands of people hours of work. The impact of your work is immense even if you don't get hundreds of thousands of views. Please never stop, you're amazing!
@Notshife4 жыл бұрын
Yet another area I am most interested to hear about. Thank you
@marohs56064 жыл бұрын
Thank you for the efforts.. highly appreciated!!
@Miestro85h4 жыл бұрын
If it's true, then pre filtering (or rejection sampling) based up on these avg score is a cheap speedup tool for any neural architecture search algorithm too.
@herp_derpingson4 жыл бұрын
It is basically an "anti-lottery ticket hypothesis". . 33:00 For the RL based search models, I think we would still need some negative samples. Otherwise the RL model would keep suggesting bad models for the sake of exploration. . Nice paper, easy to implement. Will definitely use this trick.
@CosmiaNebula4 жыл бұрын
my 6-word slogan for this paper: Neural architecture physiognomy! And it works!
@eddtsoi4 жыл бұрын
amazing, always like your inspiring interpretation
@TimeofMinecraftMods4 жыл бұрын
I think that a lot of the lack of performance compared to other techniques can be explained by the way the NAS-Bench-201 benchmark is constructed: We only have 15,625 different architectures: enough for the sample efficient "train until you're done" NAS systems, but searching without training may just need significantly different architectures. This would also explain why the more complex tasks the metric "spreads out": There just isn't enough ways the NAS-201 networks differ to make a meaningful difference that can be observed just by looking at the initial state. Maybe one could combine this approach with something like NEAT to generate a population of architectures and score them pretty much instantly using this. This would allow the system to get away from the "resnet-likes" that make up the NAS-BENCH.
@albertwang59744 жыл бұрын
The role of nonlinear function in a neural network can be treated as the if...else statement in a traditional programming language. the LSTM, GRU, Attention also can be treated as the same way, they provide switch control capability.
@lucha62624 жыл бұрын
This is very exciting and as others mention in the comments a super speedy tool to discard faulty architectures, thanks for the video!
@Mahyaalizadeh199511 ай бұрын
Thank you for sharing this. It's very interesting. I learned a lot.
@lugae461911 ай бұрын
Great video! Thanks for you personal interpretation too---helps think things through. I would argue though that the interpretation of the pytorchcv at (25:40) is wrong (admittedly, I don't know if its your interpretation or the authors since they seemed to have removed this part from their most recent version). But it looks like they're showing that their metric scores methods that we know do well high. That is, architectures that have been found by humans to preform well achieve a high NASWOT score (or whatever they call it).
@dipsyteletubbie8022 жыл бұрын
Thank you for the very clear explanation!
@gauravkoradiya12364 жыл бұрын
Thanks for sharing. Commendable work.
@sweatobertrinderknecht34804 жыл бұрын
your thumbnails are getting better
@Kram10324 жыл бұрын
I wonder if this scoring could be improved simply by exchanging regular correlation with distance correlation, since that will also capture non-linearities. It might make a difference in particular in those networks where currently the score no longer tells you much.
@herp_derpingson4 жыл бұрын
I just read about distance correlation. It makes sense.
@astroganov4 жыл бұрын
You have a misunderstanding on #21:42 about axes. Covariation score is on a vertical axe, and validation accuracy (after training) - is horizontal axe. So, if you wish to use described method to filter "bad" architectures in a fast way, you should cut by Score (draw horizontal line on some threshold level), instead of drawing vertical line at #22:21 That actually means, this method is even more far from being precise by itself...
@franciskusxaveriuserick76083 жыл бұрын
Unrelated dumb comment but that annotated NAS-Bench-201 diagram at 19:52 roughly looks like map of Switzerland. Though yeah I am doing research on network compressions and this is a really cool idea, would like to see more studies between these parameters, scores and inference speed so that we can also optimize NAS to get the "smallest" or whatsoever that results to the best inference speed in embedded systems while still giving reasonably good accuracy value.
@bluel1ng4 жыл бұрын
I somehow like these one-step methods. What I do not directly understand is how this method can predict the generalization capabilities of a network-architecture (e.g. validation set accuracy) from the linear map histogram of one mini-batch.
@tonyrobinson13494 жыл бұрын
The idea is interesting, thanks for making it so accessible. The big question is: Is it useful? I read fig 3 differently to you and so I come to a different conclusion. You think that this method weeds out most/90% of the bad architectures, I think it weeds out very few. If we had an uninformative method then the correlation line would be vertical in fig 3, that is the score is not correlated with accuracy. By eye I integrate vertically to get the distribution of scores that would happen with a useless method. I then do the same again for (say) the top 10% of scores. Scatter plots are terrible at showing density, but it looks to me that all the probability mass is at the top of the plot, so the distributions would be very similar. The authors could have done this basic stats,.
@ameetrahane14454 жыл бұрын
I think the idea has merit, at least intuitively. I'd like your input on why it wouldn't work.
@tonyrobinson13494 жыл бұрын
@@ameetrahane1445 I agree, I do think it's an interesting idea. Please note that my long comment was on fig 3 only, it's easy to have some really bad architectures out of all the 15625 combinations and I believe these are given too much weight by the use of the scatter diagram (which doesn't show density well). Thus I was commenting on specifics, not making a general statement that "it wouldn't work".
@YannicKilcher4 жыл бұрын
True, you have a good point. Maybe it would be worth investigating this quantitatively.
@julespoon28844 жыл бұрын
Am kinda disappointed they did not show the scores for the well performing trained networks after showing initialisation affects the score significantly. If trained networks tend to have a greater correlation between it's score and accuracy perhaps this method can be useful by somehow mitigating the randomness from initialisation. Perhaps training the network a few rounds on random data to maximise the score and use that? A tangent: If random data does not affect the score, and the score is correlated with accuracy, what if a network is trained on random data to maximise the score, would it necessarily increase the accuracy on the actual data? This is a reason I'm skeptical on this method as idt the score is a good indication on the accuracy as it does not seem to account for the training data much.
@annasuperjump4 жыл бұрын
Hi @Tony Robinson, I do not quite understand this sentence in your comment " If we had an uninformative method then the correlation line would be vertical in fig 3, that is the score is not correlated with accuracy.", could you please clarify more? thanks
@dermitdembrot30914 жыл бұрын
Is J of shape NxD or DxN (where D is the dimensionality of x)? The shape of JJ^T would be NxN and DxD respectively in these two cases. Clearly the first makes more sense in context but the J_i,n in the second line below (1) seems to indicate otherwise.
@CosmiaNebula4 жыл бұрын
Alternative 6-word slogan: First, sanity check neural architecture expressivity!
@hannesstark50244 жыл бұрын
Very interesting video once again! I have to say I thought I was going to like the "historical papers" but I have to admit that I found the present word2vec and gan videos boring and did not finish them. Just wanted to leave that feedback.
@drozen2143 жыл бұрын
I don’t know if you’re aware, but the paper seems to have been edited/updated since you made this video with different graphs, showing correlation matrices instead of histograms, and a different equation for computing the score. Is this common for papers to be changed after publishing? Do you know if the new equation is mathematically equivalent and preferred because it’s easier to calculate? Or is it just a different score that measures approximately the same thing?
@bryce-bryce4 жыл бұрын
What if you first train for let's say 5 epochs and then compute the score?
@dermitdembrot30914 жыл бұрын
I think this goes in the direction of something people do, where computing power is saved by only doing as few updates as necessary to see whether the architecture is good/ bad. If it's good after five steps you might decide to continue for another bit since good models are harder to tell apart than bad models. The paper in this video seems to have found a better predictor of performance at convergence than is the score after five steps.
@bryce-bryce4 жыл бұрын
@@dermitdembrot3091 I know. I was wondering if the scores get more accurate / reliable when training the network for a few epochs and then looking at the correlations. Because if I understand it correctly, the network is just initialized and the correlations are based on the random weights. I just find it hard to understand why correlations of random weights are a good indicator of the final prediction. But I did not read the paper, just watched the video, so maybe I did not fully understand the idea.
@dermitdembrot30914 жыл бұрын
Oh sorry I completely misread your comment. Kind of assumed you meant accuracy score. I agree that it's worth investigating whether the "gradient correlation" score improves the evaluation. Quite possibly the authors tried and didn't see an improvement.
@blanamaxima4 жыл бұрын
I guess that if the loss landscape is similar to a random spin glass hamiltonian , as Yann was saying then it makes some sense to have some spreading in the orientation of the linearization... To some extent it is sad that we are basically saying that a convex function has to be discarded from the begining :) I am curious to see some changes in loss function as well.
@arnoldchen11084 жыл бұрын
Could you please point out the reference from Yann's point? Thank you!!
@blanamaxima4 жыл бұрын
@@arnoldchen1108 it is a pretty old paper by current standards arxiv.org/abs/1412.0233 .
@arnoldchen11084 жыл бұрын
@@blanamaxima Definitely an interesting read though. Thanks a lot!
@deepblender4 жыл бұрын
Great video as always! Did you activate ads? I don't mind them at all! I am only asking because you recently mentioned, you didn't plan to enable them.
@YannicKilcher4 жыл бұрын
Yes, I announced that in the latest channel update :)
@siyn0074 жыл бұрын
I wonder if there's a way to use that score as the reward for the RL algorithms instead of the accuracy, I think that will lower down the computation time without necessarily dropping performance so much, but I might horribly wrong haha
@sheggle4 жыл бұрын
Thought so too, should at least be interesting to find out!
@yashrunwal51113 жыл бұрын
@Yannick Kilcher. Great job. Thank you! Can you explain EfficientNet/EfficientDet paper?
@norik16164 жыл бұрын
🤔 Add this before HyperBand in keras-tuner + add Bayesian opt after HyperBand. 1. search without train to get rid of 50-80 % of the really bad architectures 2. HyperBand to quickly abandon poor architectures 3. Bayes opt (with all of the HyperBand runs as input) for the fine-search
@111dimka1114 жыл бұрын
Great paper and great review! I wonder if we can replace the gradient w.r.t. input with gradient w.r.t. weights. The updated score can be related to NTK and to the metric over the function space. Would such change produce a better expressiveness score? Insights anyone?
@YannicKilcher4 жыл бұрын
interesting idea. I know too little about ntk to have an informed opinion :)
@norik16164 жыл бұрын
I think the correlation could be better after training for few batches for the hard tasks (ImageNet) - The lottery ticket also had similar problem with harder tasks and needed a bit of training. Does it make sense?
@YannicKilcher4 жыл бұрын
Yes, totally.
@utku_yucel4 жыл бұрын
Thanks!
@robbiero3684 жыл бұрын
Am I right in thinking that people always initialize networks with random weights (or weights from a previous training) Has anyone done any work looking at what happens if you use some sort of "less" random values as initial ones? Is all randomness created equally, so to speak, and is it important to be completely random in your starting point? What happens if you initialize with a regular pattern, does it fail to train at all?
@YannicKilcher4 жыл бұрын
There is work in this area, but I think without significant improvement over random init.
@YeshwanthReddy4 жыл бұрын
I can't believe they've trained image net 50000 times 😝
@jasdeepsinghgrover24704 жыл бұрын
Have observed this in some places. At least in simple datasets, too much non-linear behaviour also allows over-fitting which might cause unexpected behaviours for high scores.
@ЗакировМарат-в5щ4 жыл бұрын
I thinl that this article implies some linf od contradiction if we look it in context of maniflod mixup, in that article they claimed (if I ma right) that they reduced number of meangfull eigenvalues making manifold itself more linear alike, here I am hearing exactly opposite thing
@rishabhmanishsahlot1294 жыл бұрын
Has anyone used this? Does it actually work? Please Let me know?
@vsiegel4 жыл бұрын
So we use an AI to build another AI... why does that feel so spooky...
@DasGrosseFressen4 жыл бұрын
Why do guys in ML love to rename already established concepts?... Why "linear map of data" instead of simple the first oder Taylor expansion?