[Seminar] Imitation learning and GAIL

  Рет қаралды 157

강형엽 IIIXR LAB

강형엽 IIIXR LAB

Күн бұрын

Пікірлер: 2
@IIIXRLab
@IIIXRLab 3 ай бұрын
Read the overall estimation carefully, and respond with how you plan to address each one moving forward. [Overall] Good structure and Flow. Lack of Explanation in Complex Concepts: The explanations of difficult concepts were too superficial, which made it seem like you might not fully understand them. I suggest you revisit these sections and consider whether the audience would truly comprehend your explanations. Too simple Code Section: I expected the code section to cover the actual code and its flow comprehensively. However, you focused only on a narrow aspect, directing us to refer to GitHub for the rest, which was disappointing. Please create a supplementary PowerPoint that explains the code flow in detail. [Question] On slide 9, you referred to the entropy as the "policy's probability distribution," but how does log π(a|s) function as a distribution? Also, remember to italicize terms like action a and state s. On slide 11, you mentioned that GAIL is similar to GAN and included a good illustration for GAN. Wouldn’t it be more effective to create a similar illustration for GAIL to aid understanding? On slide 12, there's a reward function shown at the bottom left, which is then connected with an arrow to the reward function c(s,a). What does this signify? The meaning is unclear. On slide 13, you used the term "similar wrt ψ" and included a diagram with "how?" written next to it. I'm not sure what this diagram represents. Additionally, on slide 12, there was an arrow pointing to the reward function, but it’s gone here. Why? On slide 15, why is the statement "changing the problem to finding the policy through the occupancy measure" significant? On slide 16, why is "convex conjugate" suddenly introduced? What is it, how is it used, and why is it important? This needs a more detailed explanation. On slide 18, why does "cost is far from 0 = low penalty, close to 0 = high penalty" make sense? Where is this method commonly used? On slide 19, despite including the pseudocode and putting effort into it, the explanation is lacking. I don’t understand it clearly. Could you provide a better explanation, particularly for lines 4 and 5 of the pseudocode?
@황주영-j7y
@황주영-j7y 2 ай бұрын
Thank you for your feedback and supplemented the missing explanation. I will include the parts that are difficult to express in the text in ppt. Onslide 9: There are some unclear points in describing entropy, and I will correct them as follows. π(a|s) is the probability distribution in which actiona occurs when in the state and is called policy in reinforcement learning. Entropy is generally a number that represents the degree of uncertainty, and in reinforcement learning, this number is "states" and various actiona are performed. ", In other words, it can be described as exploring the environment. On slide 10: Add with code page. Onslide 12: The line above is the Reinforcement Learning step, and the line below is the Inverse RL step. It means that the reward function estimated from expert's investigation is set to the reward function of reinforcement learning (cost function here) and the step is repeated again. Onslide 13: Existing Inverse RLs had the problem of having to follow all six steps on the page. However, GAIL tried to jump through all of these steps and pull the optional policy out of expert'trajectory. As a way to do this, we propose a method to make the policy of ψ similar to the policy of the traffic that we are trying to find. Here, ψ is a function described on page 18, which the authors of GAIL have found a ψ function that is similar to the formula of GAN. (Because GAN is already a validated method.) Onslide16: The bottom column is the Inverse RL formula that was previously described. It was to express the same form as the expression claimed by GAIL when using convex conjugate. On slide 18: We will receive the rho_phi-rho_phi_E we are trying to optimize in ψ input. The closer the difference between the two is to zero, the lower the penalty means that the cost function is set. Onslide 19:4 is the process of updating InverseRL, and 5 is the process of updating reinforcement learning policy. Here, I added the algorithm to show that two steps are being updated at the same time, unlike before. Form 4 is a function guided by p. 18 and Form 5 is a TRPO. Other algorithms such as PPO can be included in number 5.
[Seminar] Synthesizing Physical Character-Scene Interactions
13:39
강형엽 IIIXR LAB
Рет қаралды 200
風船をキャッチしろ!🎈 Balloon catch Challenges
00:57
はじめしゃちょー(hajime)
Рет қаралды 56 МЛН
How Strong is Tin Foil? 💪
00:25
Brianna
Рет қаралды 68 МЛН
МЕНЯ УКУСИЛ ПАУК #shorts
00:23
Паша Осадчий
Рет қаралды 1,3 МЛН
Happy birthday to you by Secret Vlog
00:12
Secret Vlog
Рет қаралды 6 МЛН
I think you’ll like these puzzles
29:42
3Blue1Brown
Рет қаралды 661 М.
Core Concepts: Imitation Learning
16:00
Sanjiban Choudhury
Рет қаралды 1,9 М.
18. UniXcoder: Unified Cross-Modal Pre-training for Code Representation (ACL 2022)
57:20
Human-AI Collaborative Programming Platform
Рет қаралды 29
MoRN Seminar: Wasiur Khuda Bukhsh (University of Nottingham), November 7, 2024
30:01
Mathematics of Reaction Networks
Рет қаралды 15
PR-130: Generative Adversarial Imitation Learning
31:33
Jinsung Yoon
Рет қаралды 4,9 М.
BX2122/EC5216 Topic 7-1 Using Indicator Variables
44:43
Sizhong Sun
Рет қаралды 22
CS 182: Lecture 14: Part 3: Imitation Learning
8:35
RAIL
Рет қаралды 5 М.
evan reads Generative Adversarial Imitation Learning
15:28
evanthebouncy
Рет қаралды 3,7 М.
MIT Introduction to Deep Learning | 6.S191
1:09:58
Alexander Amini
Рет қаралды 709 М.
2강. Mutual information
22:02
지승현
Рет қаралды 19
風船をキャッチしろ!🎈 Balloon catch Challenges
00:57
はじめしゃちょー(hajime)
Рет қаралды 56 МЛН