CS885 Lecture 8a: Multi-armed bandits

Рет қаралды 21,971

Pascal Poupart

Күн бұрын

Пікірлер: 17

@rafaelmosca6243 5 жыл бұрын

Excellent Lectures! Everything crystal clear.

@cbandle5050 Жыл бұрын

Suboptimal action != Worst action (unless there are only two actions). This leads to various issues with the proof at 44:20 1) 2nd bullet does not follow from 1st. R(a) >= R(a') for all a' means that action a is actually optimal. However this contradicts bullet 1 which defines action a as suboptimal. 2) Assuming that action a is suboptimal only implies that there merely *exists* an a' so that R(a') > R(a). We cannot deduce universal quantification as it's possible for R(b) < R(a) < R(a') (here action a is suboptimal but not the worst). 3) We also cannot fix this by assuming that we converged to the worst arm a for this proof by contradiction, as the negation of "converging to the best arm" = "not converging or converging to an arm which is not the best".

@user-co6pu8zv3v 2 жыл бұрын

Excellent lecture!

@shivangitomar5557 Жыл бұрын

Amazing lecture! Thank you!!

@Bryan-bh7cy Жыл бұрын

well damn it's really good

@umarkhan-hu7yt Жыл бұрын

please suggest some book to go more in depth of your lecturer. I am following Reinforcement Learning of Richard Sutton.

@ryanjadidi8622 4 жыл бұрын

timestamp 44:20 shouldnt it be "assumption that a' is suboptimal" rather than a?

@andresacostaescobar 5 жыл бұрын

Hi. Where can I find a proof of the statement that P(a_t != a*) ~ epsilon and of the fact that epsilon=1/t leads to a logarithmic regret?

@NilavraPathak 4 жыл бұрын

Andrés Acosta Escobar sum of inverse of t is a log series as it tends to infinity. Easiest explanation is if you integrate 1/t you will get log t.

@sudhas_world 4 жыл бұрын

To prove that P(a_t != a*) ~ epsilon, you just need to understand that epsilon-Greedy algorithm explores with probability epsilon at each iteration. Hence, on an average, we can say that that an arm isn't going to be the optimal arm at any time step is just epsilon.

@Cindy-md1dm 4 жыл бұрын

The purpose of having epsilon is to increase the exploration. If we try to reduce the probability of getting a suboptimal by reducing epsilon, it will go back to greedy strategy. Then, what's the point?

@m-aun 4 жыл бұрын

You explore at the start to get the best action. After that is estimated for you choose the best action(exploit), and not sub-optimal actions. You want your regret to be low.

@jocarrero 4 жыл бұрын

True that as epsilon decreases you become greedy but before reaching that point you have had explored a lot!! So the theoretical result is telling you a.- decrease the rate of exploitation over time. b.- But not too fast!! slower than 1/t. This type of results is typical in the theory of stochastic process. It is related to the way “stochastic processes” are constructed. The classical book by Love, vol 2 has a beautiful chapter using this type of conditions to prove that ‘good process” can be constructed in a Kolmogov space.

@jocarrero 4 жыл бұрын

Just one more comment on the rate at which epsilon is decreasing. Look at Kolmogorov’s 0-1 Law. This law tell you what happens if you keep sampling independently from a distribution.

@Lcipolina 3 жыл бұрын

@@jocarrero which book?

@hsujerry7231 Жыл бұрын

greedy strategy also gets regret = O(n) if reward is bounded