Read the overall estimation carefully, and respond with how you plan to address each one moving forward. [Overall] Good structure and Flow. Lack of Explanation in Complex Concepts: The explanations of difficult concepts were too superficial, which made it seem like you might not fully understand them. I suggest you revisit these sections and consider whether the audience would truly comprehend your explanations. Too simple Code Section: I expected the code section to cover the actual code and its flow comprehensively. However, you focused only on a narrow aspect, directing us to refer to GitHub for the rest, which was disappointing. Please create a supplementary PowerPoint that explains the code flow in detail. [Question] On slide 9, you referred to the entropy as the "policy's probability distribution," but how does log π(a|s) function as a distribution? Also, remember to italicize terms like action a and state s. On slide 11, you mentioned that GAIL is similar to GAN and included a good illustration for GAN. Wouldn’t it be more effective to create a similar illustration for GAIL to aid understanding? On slide 12, there's a reward function shown at the bottom left, which is then connected with an arrow to the reward function c(s,a). What does this signify? The meaning is unclear. On slide 13, you used the term "similar wrt ψ" and included a diagram with "how?" written next to it. I'm not sure what this diagram represents. Additionally, on slide 12, there was an arrow pointing to the reward function, but it’s gone here. Why? On slide 15, why is the statement "changing the problem to finding the policy through the occupancy measure" significant? On slide 16, why is "convex conjugate" suddenly introduced? What is it, how is it used, and why is it important? This needs a more detailed explanation. On slide 18, why does "cost is far from 0 = low penalty, close to 0 = high penalty" make sense? Where is this method commonly used? On slide 19, despite including the pseudocode and putting effort into it, the explanation is lacking. I don’t understand it clearly. Could you provide a better explanation, particularly for lines 4 and 5 of the pseudocode?
@황주영-j7y2 ай бұрын
Thank you for your feedback and supplemented the missing explanation. I will include the parts that are difficult to express in the text in ppt. Onslide 9: There are some unclear points in describing entropy, and I will correct them as follows. π(a|s) is the probability distribution in which actiona occurs when in the state and is called policy in reinforcement learning. Entropy is generally a number that represents the degree of uncertainty, and in reinforcement learning, this number is "states" and various actiona are performed. ", In other words, it can be described as exploring the environment. On slide 10: Add with code page. Onslide 12: The line above is the Reinforcement Learning step, and the line below is the Inverse RL step. It means that the reward function estimated from expert's investigation is set to the reward function of reinforcement learning (cost function here) and the step is repeated again. Onslide 13: Existing Inverse RLs had the problem of having to follow all six steps on the page. However, GAIL tried to jump through all of these steps and pull the optional policy out of expert'trajectory. As a way to do this, we propose a method to make the policy of ψ similar to the policy of the traffic that we are trying to find. Here, ψ is a function described on page 18, which the authors of GAIL have found a ψ function that is similar to the formula of GAN. (Because GAN is already a validated method.) Onslide16: The bottom column is the Inverse RL formula that was previously described. It was to express the same form as the expression claimed by GAIL when using convex conjugate. On slide 18: We will receive the rho_phi-rho_phi_E we are trying to optimize in ψ input. The closer the difference between the two is to zero, the lower the penalty means that the cost function is set. Onslide 19:4 is the process of updating InverseRL, and 5 is the process of updating reinforcement learning policy. Here, I added the algorithm to show that two steps are being updated at the same time, unlike before. Form 4 is a function guided by p. 18 and Form 5 is a TRPO. Other algorithms such as PPO can be included in number 5.