Machine Learning for Retail Price Recommendation with Python

By Kimberly Cook |Email | Jul 24, 2018 | 15315 Views

#4 Research Paper Explained
DeepMind and other universities has published many End to End Reinforcement Learning papers that are used for problems that can be solved by a single agent. End to End RL algorithms learns both feature representation and decision making in the network by taking pixels as the input and the controls as output.

The real world contain problems that needs multiple individuals acting independently but still collaborating together to achieve a single goal. From playing games like football or basketball to landing a rocket on the moon, a team of individuals works together following a strategy to complete the faster, safer by reducing the risk of failure. This paper can be used to solve many real life tasks, so let's breakdown the paper to understand their solution.

DeepMind has build an End-to-End population based RL algorithm that tackles the problem successfully using a two-tier optimisation process & by training individuals, acting and learning independent to each other in a team based 3D multi-agent environment (Capture the Flag) working together strategically to achieve a single goal.

This leads to models that suffers from high complexity of the learning problem that arises from the concurrent adaptation of other learning agents in the environment.

The game Capture the Flag has all the traits of the problem above:

1.A 3D first person view multi-agent game.(also can implemented in robotics due to fpv similarity)

2. Agents unaware of each other decisions playing in same environment as opponent or teammate.

3.Strategy based game for learning higher cognitive skills.

Also, the indoor & outdoor theme maps are randomly generated for every game. Two opposing teams consisting of multiple individual players compete to capture each others flags by strategically navigating, tagging, and evading opponents. The team with the greatest number of flag captures after five minutes wins the game.

Ad-hoc teams
To develop more generalised policies & learning agent capable of acquiring generalised skills, training fixed teams of agents on a fixed map reduces the diversity in the training data - instead the paper devise an algorithm & training procedure that enables agents to acquire policies that are robust to the variability of maps, number of players, and choice of teammates, a paradigm closely related to ad-hoc team play .

The final win/loss is a delayed episodic signal received from the environment, making it difficult to optimise 1000's of actions performed by the agent on the basis of only one binary signal at the end of 5 minute game.

This makes it difficult to recognise the actions that were effective in winning the game from the ones that did not help.

We can solve the problem by increasing the number of rewards in the game. By using more frequent Internal rewards, the rewards can be given on the basis of actions performed by the agent.

The memory and long-term temporal reasoning requirements of high level strategic CTF play is met by introducing an agent architecture that features a multi-timescale representation - reminiscent of what has been observed in primate cerebral cortex and an external working memory module - broadly inspired by human episodic memory.

These 3 innovations are integrated with in a scalable, massively distributed & asynchronous computational framework.

In the game, agent receives raw RBG pixels input Xt from the first person perspective at timestep t, produces control action at and receives game points t to train agents policy.

The goal of the reinforcement learning is to find a policy that maximises the the expected cumulative -discounted reward over a CTF game with T time steps.

Pai is parameterised by a multi-time scale Recurrent Neural Network with external memory.

The agent's architecture model constructs a temporally hierarchical temporal representation space and uses recurrent latent variable for sequential agent to promote the use of memory & temporally coherent action sequences.


Probability of Winning
For Ad-hoc teams, the agents policy should maximise the probability of winning for its teams, and its teammate policies, for a total of N players in the game:

where winning operator > returns 1 if left wins, 0 for losing, and randomly break ties. Also, ├??? representing specific maps of the games.

For The Win teams Now that we are using more frequent Internal Rewards rt, we can operationalise the idea of each agent having a denser reward function by specifying rt =  based on available game point signals t (points are registered for events such as capturing a flag) and allowing the agent to learn the transformation w such that policy optimisation on the internal rewards r t optimises the policy For The Win, giving us the FTW agent.

Traditional methods used for training 1000s of multi-agent environment at such a high scale together is not supported making the methods unstable.

Scalability -Population of total P different agents are trained in parallel with each other by introducing diversity amongst players to stabilise the training(54).

Matchmaking - To improve the skills of the agents, the teammates & opponents are sampled from population P . The agents indexed by ┬╣ for a training game using a stochastic matchmaking scheme mp that biases co-players to be of similar skill to player p, increasing the uncertainty.

Agent Skill Level -Agents skill score are estimated online by calculating Elo Score (15) based on the output of training games.

Meta-optimization - It is a method of using one optimizaiton method to train other optimizers. The paper uses population to meta-optimize the internal rewards & hyperparameters of RL process itself. This can be seen as two-tier optimization RL problem. Inner Optimisation aka J inner: The inner optimisation is soled by RL and it maximises J inner, the agents expected future discounted internal rewards. Outer Optimisation aka J outer: It is solved with Population Based Training (PBT) and it is maximised w.r.t. internal reward schemes wp and hyperparameters p , with the inner
optimisation providing the meta transition dynamics.

PBT is an online evolutionary process which adapts internal rewards and hyperparameters and performs model selection by replacing under-performing agents with mutated versions of better agents.

This joint optimization of agents policies helps in utilising the potential of combining learning and evolution together which results in maximisation of:


Assessment during Training | Tournament
To assess the generalisation performance of agents during training a tournament is conducted on procedurally generated maps with Ad-hoc matches involving three types of agents.

  1. Ablated version of FTW.
  2. Quake 3 Arena scripted bots
  3. Human participants with with first person game experience.

Results
1.FTW clearly exceeded the win-rate of humans with maps which neither agent nor human had seen previously, i.e. zero-shot generalisation, with a team of two humans on average capturing 16 flags per game less than a team of two FTW agents.

2.Human-agent v/s agent-agent -Only as part of a h-a team did we observe a human winning over an a-a team (5% win probability).

3.Pro gamers v/s FTW - Even after twelve hours of practice the human game testers were only able to win 25% of games against the agent team.

4.Tagging accuracy of agents FTW agents were 80% while humans were lagging at only 48% success. Agents won the match even after their tagging accuracy were artificially reduced to humans accuracy.

Superior observations & controls resolution of humans helped them to surpass FTW agents in Successful long range tagging at 17% & agents at 0.5%.

But FTW again surpassed humans in short range tagging reaction time with 258ms and humans at 559ms.

Firing Neurons in the Network!!! | Knowledge Representation
To investigate how network have learned skills with such high -level rich representation.The network was asking past, present or future states of the game. For Example -

Q: Do I have the Flag? (Present)
Q: Did I see my teammate recently? (Past)
Q: Will I be in the opponent├?┬ó??s base soon? (Future)

Similarly, total 200 binary questions was asked based on the features of the games to see the internal representation of the network.

Results
According to the authors, the agent has knowledge of a given feature if logistic regression on the internal state of the agent accurately models the feature. Interestingly, the FTW agent's representation was found to encode features related to the past particularly well: e.g. the FTW agent was able to classify the state both flags are stray (flags dropped not at base) with 91% AUCROC (area under the receiver operating characteristic curve), compared to 70% with the self-play baseline.

I encourage you to look at the paper to see more detailed stats.


Visualisation
There are many more visualisations in the paper for your help. I have selected few that need less explanation:



Conclusion
In this paper, artificial agent using only pixels and game points as input can learn to play highly competitively in a rich multi-agent environment. This was achieved by combining a number of innovations in agent training- population based training of agents, internal reward optimisation, and temporally hierarchical RL -together with scalable computational architectures.

This paper can be used to solve other problems around you that contain memory and temporally extended interfaces difficulties in their solution. So, I encourage you to read the paper to have fun & understand the methods emerged at the edge of our knowledge about Machine Learning andpush the boundaries by implementing the paper, solving real world problems to live on the edge of the human knowledge.

Source: HOB