Social Learning
Created: Feb 9, 2021 9:36 AM
Social Learning / Social Cognition
- Learning from other agents
- Not same as multi-agent learning :
- Multi-Agent Learning: Learning collaborating/coordinating with other agents to solve a particular problem
- Relevant concepts in social cognition:
- Theory of Mind : Generally modeled as $<Beliefs,Desires,Intention>$
- Coordination
Learning Social Learning
Kamal Ndousse, Douglas Eck, Sergey Levine, Natasha Jacques
Main Contributions
- Analysis of why social learning doesn't work so well in model-free agent
- Introduction of a novel model-based auxiliary loss which helps agent manifest social learning and learns by observing the actions of the 'expert' agents present in the environment
- A new environment that promotes social learning
- Zero-shot transfer results showing agents that use social learning generalize better to new environments
Precursor
- Imitation Learning (Behavior Cloning, etc.) : Can break if the agent encounters trajectories which were not present in the training data
- Inverse Reinforcement Learning : Reverse engineering reward function by observing actions of other agents
- Other attempts at social learning: In most of the approaches, the learning agent has access to the internal dynamics of the expert agent
NOTE : This work focuses mainly on Multi-Agent POMDP
data:image/s3,"s3://crabby-images/89e79/89e79182b305c27676016d38c13b775161d3d626" alt="Social%20Learning%20c56c5f3585014620a0472c8724570b59/Untitled.png"
- $\mathcal{I}$ - Function that maps environment state to individual state
Why is social learning difficult?
data:image/s3,"s3://crabby-images/f235c/f235c4ad6f6ef4a06c39a76fa1db5bedb1fba582" alt="Social%20Learning%20c56c5f3585014620a0472c8724570b59/Untitled%201.png"
- $\tilde{s}$ is the demonstration state. Difficult to reach through random exploration
- For eg. if the novice agent uses policy gradient based learning
data:image/s3,"s3://crabby-images/1c7b9/1c7b9ac306f0a170cc7cdf100922a035ff45223d" alt="Social%20Learning%20c56c5f3585014620a0472c8724570b59/Untitled%202.png"
- No reward received by the novice agent when $\tilde{s}$ is reached ⇒ $\mathcal{R_t}$ = 0 ⇒ no update
- All Q - values become zero in case of Q learning
data:image/s3,"s3://crabby-images/0cc2e/0cc2ee24947e8812d07b4be6d14244796eb07164" alt="Social%20Learning%20c56c5f3585014620a0472c8724570b59/Untitled%203.png"
- Access to the expert agents policy?
Method
- Introducing a model based reward in addition to the model-free reward
- Motivation: The problem can be solved if the novice agent has access to the expert agent's policy
- The expert agent's policy is a part of the state transition function $\mathcal{T}(s_t, a^N, a^E)$
- Mean absolute Error (MAE)
data:image/s3,"s3://crabby-images/838cc/838cceda54828c9ff28aadef3c6f832999b52f45" alt="Social%20Learning%20c56c5f3585014620a0472c8724570b59/Untitled%204.png"
- Approach is called SociAPL (Auxiliary predictive loss)
- Important advantage over other similar methods: The experts can exist in the environment minding their own business and the novice will still learn
data:image/s3,"s3://crabby-images/4212b/4212b3ddbbdc6de6eb9797c80a8b7e5d6fa41fe8" alt="Social%20Learning%20c56c5f3585014620a0472c8724570b59/Untitled%205.png"
data:image/s3,"s3://crabby-images/99af6/99af635b84e065a96af7c1a57267e166c2f7789e" alt="Social%20Learning%20c56c5f3585014620a0472c8724570b59/Untitled%206.png"
Social Learning Environment
- Individual exploration is harder than social learning
Goal Cycle
data:image/s3,"s3://crabby-images/4876c/4876cf296c4f628f1b5c565de73fbe7fd9c397da" alt="Social%20Learning%20c56c5f3585014620a0472c8724570b59/Untitled%207.png"
- 3 goal tiles where the agent will get a positive reward
- Traverse the three goal tiles in a specific order
- Penalty on deviating from the order
- Color of the agent changes as it collects more rewards: Blue → Expert, Red → Novice. Color resets to red on incurring a penalty
data:image/s3,"s3://crabby-images/74c86/74c86d7e14a068c18b76079ca0bd0d9fd1fa02f6" alt="Social%20Learning%20c56c5f3585014620a0472c8724570b59/Untitled%208.png"
- $c$ - Prestige. This rule helps induce prestige cue into the agent
- Experts are trained using a curriculum since individual exploration is difficult
- In the goal cycle environment
data:image/s3,"s3://crabby-images/57f4e/57f4e42fd32c2e860cc9ea259ca12d58468f31a4" alt="Social%20Learning%20c56c5f3585014620a0472c8724570b59/Untitled%209.png"
- Learning from sub-optimal experts
data:image/s3,"s3://crabby-images/82a18/82a180538c47914e12b5aa8d43b9552e30c79cd8" alt="Social%20Learning%20c56c5f3585014620a0472c8724570b59/Untitled%2010.png"
Transfer Learning on other environments
- 4 - Goal Cycle Environment
data:image/s3,"s3://crabby-images/9218c/9218c137d271070bf68392aa7d82b6390c0637ff" alt="Social%20Learning%20c56c5f3585014620a0472c8724570b59/Untitled%2011.png"
- 4 Rooms
- Reach to the goal in a limited time
data:image/s3,"s3://crabby-images/cd7d3/cd7d35df9407cca3e51192a9b652c7bf8bfc95f9" alt="Social%20Learning%20c56c5f3585014620a0472c8724570b59/Untitled%2012.png"
data:image/s3,"s3://crabby-images/fb9f1/fb9f1391087d558e8f76c261f3bb74b7d2438660" alt="Social%20Learning%20c56c5f3585014620a0472c8724570b59/Untitled%2013.png"
Discussion
- Training in presence of different expert agents following different strategies
- Model Based - Model - Free trade off
Too many cooks: Bayesian inference for coordinating multi-agent collaboration
Rose E. Wang, Sarah A. Wu, James A. Evans, Joshua B. Tenenbaum, David C. Parkes, Max Kleiman-Weiner
Main Contributions
- Introducing a new approach for decentralized multi-agent coordination
- This approach can train agents that can coordinate in three distinct scenarios
- Divide and conquer: agents should work in parallel when sub-tasks can be efficiently carried out individually
- Cooperation: agents should work together on the same sub-task when most efficient or necessary
- Spatio-temporal movement: agents should avoid getting in each other’s way at any time
- The approach allows agents to predict the intentions of other agents
Difference from Previous Works
- Previous works attempting to do similar things require pre-training
- Prior work often limited to action limitation
Task Description
- Environment used : Overcooked
- Multi-agent MDPs :
data:image/s3,"s3://crabby-images/bab78/bab78cb0a52ae9b8757bf7e502187f32f44592a5" alt="Social%20Learning%20c56c5f3585014620a0472c8724570b59/Untitled%2014.png"
- Tasks form and ordered set, (hierarchical structure)
data:image/s3,"s3://crabby-images/ad503/ad5031c7a5746d10ef1354b4ada65ba81e479814" alt="Social%20Learning%20c56c5f3585014620a0472c8724570b59/Untitled%2015.png"
- Each task has a pre-condition that needs to be satisfied for the task to be relevant at a given point in time
- Effect of partial ordering - Multiple possible ways of allocating agents to the subtasks
- Each task can be represented as $Merge(X, Y)$
Environments
- Goal: Cook a recipe in the shortest time possible
- Kitchen Configurations:
data:image/s3,"s3://crabby-images/8d9cf/8d9cff83571c4e1a557ac177f1c14b08c10a30d9" alt="Social%20Learning%20c56c5f3585014620a0472c8724570b59/Untitled%2016.png"
data:image/s3,"s3://crabby-images/9ebdc/9ebdc412e1a01adee9f614420963309f3626fc5f" alt="Social%20Learning%20c56c5f3585014620a0472c8724570b59/Untitled%2017.png"
Method
-
Based on Bayesian Theory of Mind
-
Approach is called Bayesian Delegation
-
Makes probabilistic inferences (beliefs -theory of mind) about the sub-tasks other agents are working on
-
$\textbf{ta}$ - Set of all possible task allocation permutations. For eg. in case of two agents
data:image/s3,"s3://crabby-images/e8527/e85271c480cdf7cdbb2594b23b61888bb6b80616" alt="Social%20Learning%20c56c5f3585014620a0472c8724570b59/Untitled%2018.png"
-
This posterior over the permutations is updated after every time step and then used for further planning.
data:image/s3,"s3://crabby-images/45535/45535d9bca2129fff070a8c6865839d3e86c6984" alt="Social%20Learning%20c56c5f3585014620a0472c8724570b59/Untitled%2019.png"
data:image/s3,"s3://crabby-images/f8452/f845262f8188fb518294e878e812d411a0c12632" alt="Social%20Learning%20c56c5f3585014620a0472c8724570b59/Untitled%2020.png"
data:image/s3,"s3://crabby-images/09e5d/09e5d878e079960a926fffcc3e9b0c35d1a4b11b" alt="Social%20Learning%20c56c5f3585014620a0472c8724570b59/Untitled%2021.png"
- $Q_{\mathcal{T_i}}^* (s, a_i)$ - Expected future reward of a towards completion of task $\mathcal{T_i}$ for agent $i$
- $\beta$ → degree to which the agent believes that others are acting optimally
Deciding the prior $P(ta)$
- 0 for all the $ta$s having sub-tasks without satisfied pre-conditions
- For other $ta$ :
data:image/s3,"s3://crabby-images/8e543/8e5435ac4aa6b49732378afb7031366ce29f0b00" alt="Social%20Learning%20c56c5f3585014620a0472c8724570b59/Untitled%2022.png"
- $V_{\mathcal{T}(s)}$ - estimated value of the current state under sub-task $\mathcal{T}$
Planning