Improving Intrinsic Exploration with Language Abstractions (Machine Learning Paper Explained)

Рет қаралды 9,390

Күн бұрын

#reinforcementlearning #ai #explained
Exploration is one of the oldest challenges for Reinforcement Learning algorithms, with no clear solution to date. Especially in environments with sparse rewards, agents face significant challenges in deciding which parts of the environment to explore further. Providing intrinsic motivation in form of a pseudo-reward is sometimes used to overcome this challenge, but often relies on hand-crafted heuristics, and can lead to deceptive dead-ends. This paper proposes to use language descriptions of encountered states as a method of assessing novelty. In two procedurally generated environments, they demonstrate the usefulness of language, which is in itself highly concise and abstractive, which lends itself well for this task.
OUTLINE:
0:00 - Intro
1:10 - Paper Overview: Language for exploration
5:40 - The MiniGrid & MiniHack environments
7:00 - Annotating states with language
9:05 - Baseline algorithm: AMIGo
12:20 - Adding language to AMIGo
22:55 - Baseline algorithm: NovelD and Random Network Distillation
29:45 - Adding language to NovelD
31:50 - Aren't we just using extra data?
34:55 - Investigating the experimental results
40:45 - Final comments
Paper: arxiv.org/abs/2202.08938
Abstract:
Reinforcement learning (RL) agents are particularly hard to train when rewards are sparse. One common solution is to use intrinsic rewards to encourage agents to explore their environment. However, recent intrinsic exploration methods often use state-based novelty measures which reward low-level exploration and may not scale to domains requiring more abstract skills. Instead, we explore natural language as a general medium for highlighting relevant abstractions in an environment. Unlike previous work, we evaluate whether language can improve over existing exploration methods by directly extending (and comparing to) competitive intrinsic exploration baselines: AMIGo (Campero et al., 2021) and NovelD (Zhang et al., 2021). These language-based variants outperform their non-linguistic forms by 45-85% across 13 challenging tasks from the MiniGrid and MiniHack environment suites.
Authors: Jesse Mu, Victor Zhong, Roberta Raileanu, Minqi Jiang, Noah Goodman, Tim Rocktäschel, Edward Grefenstette
Links:
TabNine Code Completion (Referral): bit.ly/tabnine-yannick
KZbin: / yannickilcher
Twitter: / ykilcher
Discord: / discord
BitChute: www.bitchute.com/channel/yann...
LinkedIn: / ykilcher
BiliBili: space.bilibili.com/2017636191
If you want to support me, the best thing to do is to share out the content :)
If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribestar.com/yannick...
Patreon: / yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Пікірлер: 19

@YannicKilcher 2 жыл бұрын

OUTLINE: 0:00 - Intro 1:10 - Paper Overview: Language for exploration 5:40 - The MiniGrid & MiniHack environments 7:00 - Annotating states with language 9:05 - Baseline algorithm: AMIGo 12:20 - Adding language to AMIGo 22:55 - Baseline algorithm: NovelD and Random Network Distillation 29:45 - Adding language to NovelD 31:50 - Aren't we just using extra data? 34:55 - Investigating the experimental results 40:45 - Final comments Paper: arxiv.org/abs/2202.08938

@xDMrGarrison 2 жыл бұрын

I could follow this video despite being a total ML noob. Thanks :) The paper was interesting too.

@alan2here 2 жыл бұрын

The teacher/student part reminds me of adversarial neural networks.

@alan2here 2 жыл бұрын

GPT-3 as epic as it is, does have a tendency to be like "1, 2, 3, 5, 6, 7, 8, 9" "there I counted up to 8". Maybe a language model could be bolstered with a logic (boolean algebra perhaps) solver to help it with non-intuitive stuff. It would seem especially useful here.

@SimonJackson13 2 жыл бұрын

The RND one might perform better if the map state novelty was used to train a post state vector mapping to unify different lingual factorizations of similar world map states.

@oncedidactic 2 жыл бұрын

This is a cool idea!

@alan2here 2 жыл бұрын

BABA IS YOU Please :) lets see it complete the hellishly hard "baba is you", I'd be really interested to see if AI beats humans at it. Would be a very insightful challenge, especially with a language model.

@altus1226 2 жыл бұрын

The title of this video made me think of: a 3d video game with entity AI NN that view the world as a ASCII text game like Zork after it has been categorized as text by some other NN. Is the sense the title gave accurate? I am not sure yet lol

@cennywenner516 2 жыл бұрын

Super exciting approach that I think should be exploited further in RL. Especially when it can get to only rely on pre-trained models of manageable size. Frankly, I also think most of the time it would be fair game to assume a text description of the overall goal of the game. Similar to a paragraph that a human player would get. It would be interesting to know how this would affect convergence when used as an additional input for the student and/or teacher. E.g. Montezuma WP contains "The objective is to score points by gathering jewels and killing enemies along the way.". It is probably obvious but wouldn't this approach be much more powerful if the goal was evaluated as a property of the state rather than as the single generated description of the state? I.e. image+statement -> confidence. Each state may have multiple valid descriptions: Be further up than the starting position, Be at (5,6), Be at the red key, Get an item for getting past the red door, .. By requiring identical descriptions, it seems difficult to get the signal, pre-train across games, or succeed with open-ended generators like CLIP. The teacher could additionally output a threshold confidence and so learn suitable degrees of 'similarity' and go from general to specific statements. I suppose one challenge then is the larger action space of the teacher but wouldn't it do with either hill climbing of gradients, sampling, or complementing with a description generator? It is not clear that a game-specific condition generator is necessary if it is additionally trained to output/accept descriptions which are true for some but not all game states. The grounding and goal-generator anyhow seem like a sample+accept setup, just missing the threshold parameter. E.g. discriminate to "triangle next to a key" rather than "image of an old-school game".

@oncedidactic 2 жыл бұрын

18:10 Doesn't the "first description" keep shifting? Note the part: " t' is the minimum t where L =/= 0. " I think maybe what they're saying is, do the first step up to time t, and then as t increases the "loss floor" increases, invalidating the old goal and making the next goal the new "first description". So it's providing a curriculum to the grounding network this way. This seems to explain why to have the grounding network vs just make that part of the teacher, if I'm understanding this correctly. You get to "curricularize" the goal-validity aspect for free because of the time based loss which keeps ratcheting up. This seems a bit of a hack in the "building in priors" sense- the model takes as given that in these environments, you can only progress by doing a sequence of actions correctly, and it will (obviously) take more and more time to do these actions in sequence. That might seem painfully obvious, but leveraging it so explicitly for training seems brittle. It wasn't the research focus, however, so don't want to be too critical. And the "do more stuff as time goes on" prior pairs nicely with the novelty reward. The time-ratcheted loss A+ / B- combined with the exploration reward is like a mode selector between "do a step" (which may be required to unlock more state space), and "gather more info". Any plans for an RND video? That was new to me, sounds cool and I want to hear all the caveats haha.

@herp_derpingson 2 жыл бұрын

21:00 Where does the teacher get the natural language from? . 21:51 It should be less I think. Otherwise, if a student took infinite steps to reach a goal, then the teacher would still be rewarded. . 28:00 The RND algorithm is awesome. I was looking for a type of neural network which can tell if it has seen an image before. This might be it. . 38:20 If entire sentences become one token, then how is it natural language anymore? . If we can make reinforcement learning agents do stuff by instructing it in natural language, then I think we already have reached SciFi level AI.

@Idiomatick 2 жыл бұрын

21:00 - much of the text you see is part of the game environment (nethack) which was originally a text based adventure game. But human labellers also described scenes. 21:51 - nope. The goal is to stay in the zone of proximal development. So it should be more than t but less than infinite/fail. 38:20 - it isn't. The final product did not tokenize like that. The point of that was an experiment to see how much worse the system performed IF you removed the natural language elements.

@twobob 2 жыл бұрын

In the "NovelD" television example Why wouldn't the agent just learn to spin on the spot though, must be the most effective transition path. Just a thought. real new new novelty is not the same as old new novelty... (I guess the generalisation of the type of novelty might help here)

@twobob 2 жыл бұрын

ah RND, aight