COMPSCI 188 - 2018-09-18 - Markov Decision Processes (MDPs) Part 1/2

Рет қаралды 48,361

Webcast Departmental

Күн бұрын

Пікірлер: 16

@sanjanachopra2100 4 жыл бұрын

Topic: Markov Decision Process, Value iteration

@Hecklit 6 жыл бұрын

Start 3:33

@AkshayAradhya 5 жыл бұрын

Just awkward silence till then

@youssefabdallah3940 4 жыл бұрын

Does the overheated state at 1:20:00 correspond to the pit for example in the grid world? This is somehow confusing me because in the grid world there was an exit action that we have to take to get the negative reward, but here Prof. Klein mentions that you take the reward when you transition to this state

@nate7368 3 жыл бұрын

Yes, they're both terminal states. There are different ways to construct the MDP. In Grid World it was arbitrarily decided that you have to leave the terminal state to get the +1/-1 reward. They could have decided to instead get the reward for transitioning into the state.

@quiteSimple24 6 жыл бұрын

Question: 1:12:08 Why the optimal policy(arrow) next to the fire pit is north? I think it should be west. If you choose north, there are chances to fall in to the fire pit and no chance to get a diamond, so the sum of discounted rewards is less than zero. If you choose west, it is always zero. Also I wonder why the arrow in the top second state heads east. If it is a matter of tie breaking than why the other states head north.

@shivendraiitkgp 5 жыл бұрын

Did you ever get the answer to these doubts? I have the same doubts. I have one more doubt - Why does the other green state have V_2(s) = 0.72? I thought it should be 0.8*1 + 0.1*0 + 0.1*( 0.8*0 + 0.1*0 + 0.1*-1) = 0.79

@AkshayAradhya 5 жыл бұрын

I had the same question. I think the optimal policy next to the fire pit should be WEST and not NORTH

@akshara08 5 жыл бұрын

I think it should be north because we are calculating rewards for just 2 time steps, the discounted reward for going north is >0, since it also includes the probability of going in the right direction.

@rogertrullo8272 4 жыл бұрын

the optimal policy at that specific time step, suggest that the best value is to go north (the value is 0.72 which is greater than zero). This value already took into account the chances of failing.Same goes to the top second state. The other states are pointing north but they could be pointing anywhere because the value is the same (zero); note however the bottom right state which could be pointing anywhere except north because the value is -1 which is less than zero. This is more clear in next video, the part called policy extraction (extraction action from values)

@LoveIsTheCure001 4 жыл бұрын

@@shivendraiitkgp You forgot to multiply by the discount factor of 0.9, so 0.8*0.9 = 0.72

@jonsnow9246 3 жыл бұрын

Why expectimax doesn't work? 1:07:00

@alexandrebrownAI Жыл бұрын

See 1:06:35 Essentially, because we have a very deep tree when in fact there is only 3 states that are repeated over and over again. Also the tree goes on forever. Using Expectimax would be doing "hard work" instead of "smart work", it is not the appropriate algorithm for such cases. What is mentioned at 1:07:24 is that if you try to use expectimax with some tricks such as caching and limiting the depth then you'd actually end-up close to the Value Iteration algorithm which is the algorithm that is more appropriate for these situations. Therefore expectimax alone is not the best choice and one should consider more appropriate technics like the value iteration algorithm. Hope this helps.