Finding Policies Two - Georgia Tech

Finding Policies Two - Georgia Tech - Machine Learning

Рет қаралды 12,057

Күн бұрын

Watch on Udacity: www.udacity.co...
Check out the full Advanced Operating Systems course for free at: www.udacity.co...
Georgia Tech online Master's program: www.udacity.co...

Пікірлер: 10

@chaitanyasharma6270 2 жыл бұрын

I liked the analogies ypu were using previously, i think of this as telling someone to go higher or lower when you ask them to guess a number, they eventually converge

@bharadwajadi18 7 жыл бұрын

Great video. Thanks for making this video :)

@ZohreYahyaee 9 ай бұрын

I have a question, what is U(s)? in the previous videos it is defined to have U(s0, s1, s2, .....)= sigma of R(si) (with/ without gamma (discount factor)) as R(si) is defined in the model, U(si) is also defined. we don't need to select it arbitrary.

@MinhVu-fo6hd 7 жыл бұрын

I think it does help by the fact that the incorrect initial guess being discounted when added to the true reward. This makes the next guess move toward the correct direction faster, so faster converge to the true Utility.

@oldcowbb 2 жыл бұрын

they mentioned that actually

@bikcrum 2 жыл бұрын

Is solving non-linear equation iteratively somehow related to Newton-Rapson method? (Take initial guess and converge recursively).

@IndrajitRajtilak 6 жыл бұрын

Why isn't U(S) equal to the R(S) + max(U(S')) where max(U(S')) is the max utility of its next neighbour? Each of the immediate neighbours capture their immediate neightbours. Why have a summation there? Why not define it as the Reward in current state + max utility of next state, what am I missing?

@Arik1989 6 жыл бұрын

If still relevant: As I understand it , in the general case the model (T) is stochastic , so when deciding to take action (a), you have to consider that there's are several states (s') that you can end up in. In the grid example, you pick action (a) to go up, you have a chance to go up, left or right, so you have to calculate the EXPECTED VALUE: The probability of reaching each possible state (T(s,a ,s')) - up(0.8), left(0.1), right(0.1) multiplied by the respective utility U(s')

@mcab2222 3 жыл бұрын

Also you miss the decaying factor. Just summing them would be neglecting decaying factor