For those unaware: Peter Abbeel is a legend. Consider gratitude. ✌️
@yihongliu7326 Жыл бұрын
Thank you, Pieter. More than 1 year back when I first learned RL I was trying to search for something like this online but couldn't. This is very helpful and amazing :)
@hassaannaeem43742 жыл бұрын
Awesome series. Thanks Pieter
@dermitdembrot30913 жыл бұрын
How widely is this "backprop through Q" used in continuous control? Does e.g. SAC use it?
@sourjyasarkar3 жыл бұрын
In the learning step of Soft-Actor Critic Algorithm, shouldn't it be Q_{phi}(st,at) and pi_{\theta}(at | st) presuming the parameters for the Q value network in \phi and that of the policy network is \theta ?
@dermitdembrot30913 жыл бұрын
Yes it should, to keep the slides consistent, but no big issue. Maybe the papers use greek letters differently.
@bashkirovsergey2 жыл бұрын
Hello! At 3:38 I cannot understand the Q(t) = r(t) + gamma * Q(t+1). Shouldn't it be Q(t) = r(t) + gamma * V(t+1)? I'd appreciate any hint! Thank you!
@ioannis.g.tzolas2 жыл бұрын
My understanding is that the action value is equal to the state value since we are having a deterministic policy.
@rezamaz8975 Жыл бұрын
In general, value function give us how much the state is good and it seems to me it's equal Q function
@ky8920 Жыл бұрын
i think it uses a single sample to estimate weighted sum over q_t+1
@aakashrana3036 Жыл бұрын
I think that the action is taken from the policy, have a look at the ddpg algorithm pseudocode to have more clarity.
@arizmohammadi5354 Жыл бұрын
I appreciate your efforts
@nguyenmanh4664 ай бұрын
10:00 that flexing :"D
@levizhou67262 жыл бұрын
Thank you sir.
@arisioz2 жыл бұрын
In comparison to the other banger lectures this one was very briefly analyzed :(