Thank you! I'm wondering if this limits the search space to operations that are possible to combine. For example, a global average pooling could not be part of the operations if we are considering convolution that somewhat keep the spatial dimensions. Right? Another worry i have is theoretical. By updating the alphas arent we modying the loss surface of the parameters of our network? Wouldn't it be possible to fail to converge? Hopefully you have the time to answer me. Thank you again.
@arentsteen54526 ай бұрын
Very interesting. Thank you!
@Ssc2969 Жыл бұрын
Thank you so much for this lecture. Just wanted to clarify one question. So if am understanding correctly, Each operation in DARTS has 2 parameters: alpha (operation strength/ operation coefficient) and actual weights (w) of those operations (which are matrices probably for CNN). Initially all these operation weights (w) and alpha are random and fixed at the start of the process. We are NOT manually intervening on them. However the question would be, in contrast to usual CNN architectures, does DARTS assign a new parameter - alpha on top of the actual weights of these operations and are these alphas kept same at the start of the process ? The set of all alphas are vectors I understand (as per the DARTS paper) and each alpha associated to each operations are scalers. nodes are states or latent representations and edges have operation choices. Then, each intermediate nodes (j) receive mixed operations (i.e. summation of {product of [Softmax scores based of each alpha ]multiplied with the feature map obtained by putting that operation on 'x' where 'x' represents feature map from predecessor node (i)}) . The operations we are talking about are the choices that lies in the edge connecting that node (j) to its predecessor (i). Now, DARTS minimizes training loss by updating the weights of the operations. i.e. it tries to optimize the actual weights (matrices) of the operations by gradient-descent. Using those actual weights , it then tries to minimize the validation-loss too by gradient descent and in this case it optimizes each respective alphas of each associated operation choices (ie scalers which represents strength and being used in the Softmax operations .So alpha vector (set of all individual alphas) is are getting changed here, I am hoping ?). The alphas we get at the end of the process are compared and choose the one that has highest value. Thus DARTS jointly optimizes WEIGHTS (by training loss ) and alpha or operation coefficient or operation strengths (by validation loss) . Questions : 0) Is my above understanding correct ? 1) What is the relation between alpha (operation strengths) and its actual weights (weights of those operations, example weight of a conv filter which is a matrix and that is getting optimized during training process) 2) how are alphas initialized at the beginning ? Are these scalers randomly assigned or are all same at the beginning ? We are not manually assigning weights to these operations right ? 3) Also the authors assumes that each cell that 2 input nodes and one output. While output nodes can be one, in my implementation I used more than 2 input nodes or 1 input node even. As per equation (1) of DARTS paper which just sums the operations from each predecessor nodes, I don't think this losses generalizability. Please let me know your thoughts. I would be really very grateful to you if you can kindly clarify this question. Lectures on NAS are very rare so there are little chances to ask questions like this. Thanks a lot for uploading your videos. It helps!
@aixplained4763 Жыл бұрын
Great questions! 0) Yes, very good! 1) The weights - by large - determine the behavior of the neural network. They compute the next hidden state (embedding) based on the received input. Assume we go from the input X to the first hidden layer, and that we have 3 operations to choose from. Each of these 3 operations computes a certain hidden embedding. The alphas are used to mix/weigh these embeddings to compute a single hidden state for the next layer. The alphas are basically the output of the softmax applied to the operation strengths of these 3 operations. In the lecture, the operation strengths are unfortunately called w; but these are not equal to the weights/matrices of the neural network! 2) The alphas are usually initialized so that the operations are mixed uniformly. A default value that is commonly used is 0 for all operation strengths, so that when you apply a softmax over this to compute the alphas, you get a uniform distribution. For example, if we have 4 operations, and initialize each operation strength with 0, the alphas (obtained by doing a softmax over the operation strengths) would all be 0.25. 3) I agree with you. You could use any number of inputs per node and the DARTS algorithm can be perfectly applied. I hope this helps! Best of luck with your application :)
@aixplained4763 Жыл бұрын
Happy that it was helpful! Unfortunately, I do not have LinkedIn, otherwise I would have loved to connect. I can give you my email though, if you want :)
@Ssc2969 Жыл бұрын
Thank you very much Professor. That sounds great! I can drop you an email then. Thanks a lot.@@aixplained4763
@MonkkSoori6 ай бұрын
Does it make sense to speak of using regularization like L1 in the DARTS equations to make sure that some of the connections to some of the considered components of the model architecture (the "architecture coefficients" you mentioned) are reduced to zero to end up with a discrete architecture instead of a soft one?
@aixplained47636 ай бұрын
It is definitely possible to use L1 regularization, although it should be noted that this will not immediately lead to a discrete architecture. So we will still need to retrain the model after having selected the discrete components. We can thus ask ourselves whether we gain anything from the L1 regularization if we select the top-K architectural components and have to perform retraining afterward anyway.