How to not get robbed – Use Deep Reinforcement Learning

Namaste! I spent quite a lot of time in 2018 teaching and applying Deep Learning. Training and deploying Deep Neural Networks. It is rather difficult for me to count the number of people who went through my trainings and trained one Neural Network after the other. Currently, I am working on a plan for 2018: Teaching and applying Deep Reinforcement Learning!

In the past, I already touched the topic Deep Reinforcement Learning slightly. With my article about Deep Reinforcement Learning and Doom. I am reading a couple of books and learning from two online-courses plus from a lot of research. I strongly believe that Deep Learning will soon be commodity and Deep Reinforcement Learning will be the bomb. Why do I think so? Because I saw it live in action. And it is marvellous to see an Artificial Intelligence learn to solve problems by solving problems! You just lay down a couple of rules and watch!

But let me quickly tell what I am talking about. While Deep Learning is all about Neural Networks with many, many layers, Deep Reinforcement Learning is all about using Deep Neural Networks in a Reinforcement Learning situation. This means that the Neural Network is kinda left alone. The good thing is that it learns desired behaviours through an action-reward system. Basically if the AI does something great it gets a reward. Learning means adapting the policy in order to optimize the reward. And this is basically how biological agents – including you and me – learn things. Isn’t that exiting?

Gambling is a sin. So let’s do it!

Slot-machines are interesting devices. When I was a kid, my grandparents were owners of a local pub. In Germany pubs usually have slot-machines or something similar. Today, I am certain of one thing: The machines‘ goal is to steal your money. That is why slot-machines are often called „one-armed bandits“. The name is close to the truth!

Let us do a tiny little exercise in this tutorial. Let us prevent ourselves from getting robbed. We are about to use multiple slot-machines. This is a kind of Hello, World! for Deep Reinforcement learning. And yes, we are very brave. We face multiple slot-machines simultaneously. This is why we call it Multi-Arm-Bandits!

The scenario is very simple. We assume that we have a fixed number of one-armed-bandits. Multiple opponents. The problem we want to solve is finding a strategy that maximises our overall reward. And we know that under specific circumstances some bandits might yield a higher reward. What the circumstances and which bandits that are, we do not know. But we are going to find out!

The first thing we definitely need is a simulation. Have a look at the multi-arm-bandit:

class MultiArmBandit:
    
    def __init__(self, arms):
        self.arms = arms
        self.bandit_matrix = np.random.rand(self.arms, self.arms)
        self.update_state()
        
    def get_state(self):
        return self.state
    
    def update_state(self):
        self.state = np.random.randint(0, self.arms)
        
    def pull_arm_and_get_reward(self, arm):
        probability = self.bandit_matrix[self.get_state()][arm]
        reward = 0
        for _ in range(self.arms):
            if random.random() < probability:
                reward += 1
    
        self.update_state()
        return reward

This is a tiny little simulation of an arbitrary number of one-armed-bandits. Let us dig deeper.

Now it is time to think about Intelligent Agents.

One interesting thing: We are in the domain of Intelligent-Agents. And a fun-fact: I did my PhD in Multi-Agent Systems. That’s why I am super excited about the task at hand!

In our case, the environment is the multi-arm-bandit itself. It has an internal state which can be perceived (get_state). It has an interface for interacting with it (pull_arm_and_get_reward). And our agent is definitely the AI that we are about to train.

Before training, let us see how the environment works. Especially how it evolves over time. Just write a simple loop. This is how you can instantiate a 10-arm-bandit in code:

arms = 10
environment = MultiArmBandit(arms)

Very straightforward. Letting the environment evolve over time is just calling the action method. This is what our AI is going to do in the near future. Here is a code-snippet that runs the environment over a fixed amount of rounds:

for round_index in range(10):
    print("Round", round_index)
    
    print("  State is", environment.get_state())

    random_arm = np.random.randint(0, arms)
    print("  Pulling arm", random_arm)
    
    reward = environment.pull_arm_and_get_reward(random_arm)
    print("  Reward is", reward)
    
    print("  New state is", environment.get_state())
    print("")

And this is the output:

Round 0
  State is 6
  Pulling arm 4
  Reward is 10
  New state is 3

Round 1
  State is 3
  Pulling arm 4
  Reward is 5
  New state is 2

Round 2
  State is 2
  Pulling arm 8
  Reward is 6
  New state is 6

Round 3
  State is 6
  Pulling arm 1
  Reward is 2
  New state is 7

Round 4
  State is 7
  Pulling arm 1
  Reward is 1
  New state is 6

...

To a certain extent, creating and/or interfacing the environment is the hardest part of Deep Reinforcement Learning. This is not me saying that it is difficult. This is me saying: If the environment is up and running, training the agent can be quite straightforward.

What can be different environments, you may ask? Everything ranging from simple simulations to more complex ones. Including computer games. And at the end of the day even running systems like the cooling in a data-center. And let me tell you one thing: There is a lot of environments readily available!

Now the question is… Keeping in mind the above code… How can an agent learn the best policy in order to maximise the rewards over time?

Let us create an agent and watch it learn.

The core of our solution will be a Deep Neural Network. Its main task will be to perceive the current state of the multi-arm-bandit and predict the most optimal action to perform next. This translates to: Which arm to pull next.

Our Neural Network architecture is almost trivial. We are going to use Keras:

from keras import models, layers, optimizers

# Training parameters.
epochs = 5000
learning_rate = 1e-2

# Network parameters.
input_size = arms
hidden_size = 100
output_size = arms

model = models.Sequential()
model.add(layers.Dense(hidden_size, activation="relu", input_shape=(input_size,)))
model.add(layers.Dense(output_size, activation="relu", input_shape=(input_size,)))

model.compile(
    optimizer=optimizers.Adam(lr=learning_rate),
    loss="mse"
)

You are just staring at a Deep Neural Network with a 10-unit-wide input-layer, a 100-unit-wide hidden-layer, and a 10-unit-wide output-layer. Its sole purpose: Mapping a state to an action that optimises the reward. We are going to train it over 5000 epochs, using the Adam-optimizer and the MSE as loss. This is a very simple and very effective approach!

Almost there… The next thing would be to actually train the Neural Network. Have a look at the code before we examine it:

# Creating history of 50 rewards. Initialize all with 5.
running_mean_update_frequency = 50

reward_history = np.zeros(running_mean_update_frequency)
reward_history[:] = 5

plot_epochs = []
plot_running_means = []

bar = progressbar.ProgressBar(max_value = epochs)
for epoch in range(epochs):
    bar.update(epoch)
    
    # Get the current state and predict on it.
    current_state = environment.get_state()
    current_state = utils.one_hot(arms, current_state)
    y_pred = model.predict(current_state.reshape(1, arms))

    # Perform the action and get the reward.
    av_softmax = utils.softmax(y_pred[0],  tau=2.0)
    av_softmax /= av_softmax.sum()
    choice = np.random.choice(arms, p=av_softmax)
    current_reward = environment.pull_arm_and_get_reward(choice)
    
    # Get the reward and do a training-step.
    one_hot_reward = y_pred.copy()
    one_hot_reward[0, choice] = current_reward
    model.train_on_batch(current_state.reshape(1, arms), one_hot_reward)
    
    # Monitor running mean.
    if epoch % running_mean_update_frequency == 0:
        plot_epochs.append(epoch)
        running_mean = np.average(reward_history)
        plot_running_means.append(running_mean)
        reward_history[:] = 0
    reward_history[epoch % running_mean_update_frequency] = current_reward
    
bar.finish()

The first step is getting the current state of the environment and letting the Network predict the optimal action. At the beginning the Network is not trained it all. So it might just predict a very random next action. This is going to change over time, as the Network is getting better and better.

In the second step, the action is performed. This yields a reward, which is our prime indicator for how good the agent is. Again the whole endeavour is reward-driven. This often translates to a goad-directed approach. This is what differentiates Deep Reinforcement Learning from Deep Learning. Deep Learning is a data-driven approach.

In the third step, the reward is fed back into the Neural Network for training. By performing a single training-step. This is one step of Stochastic Gradient Descent. And this is the point in time where the Neural Network slowly learns which arm to pull under which circumstances. By playing the game!

It is very important to monitor how good the agent is.

Above I did not say anything about the reward-history. Have a second look. Me monitor the strength of our agent my gathering a running mean. We use a sampling-frequency of 50 epochs and calculate the mean-reward over that sequence of time-steps. Plotting the collected means yields some great insights:

plt.xlabel("Plays")
plt.ylabel("Mean reward")
plt.plot(plot_epochs, plot_running_means)
plt.show()
plt.close()

We get this:

What does that mean? Well… Your agent gets better and better over time by playing the multi-arm-bandit game. After training, our Neural Network has an average-reward of around 7.5 points. This is great, isn’t it?

We beat the game. A summary.

Yes, we are done. We beat the game. And we trained an AI. Well, basically the AI trained itself. We just provided an environment and a means to allow the agent to evaluate its policy: A reward. Our agent was a Deep Neural Network. And this is Deep Reinforcement Learning in a nut-shell!

And finally: Here is the GitHub-repository with the code.

Thanks a lot for reading! And if you are interested in learning about and applying Deep Reinforcement Learning, please drop me a line anytime!

Stay in touch.

I hope you liked the article. Why not stay in touch? You will find me at LinkedIn, XING and Facebook. Please add me if you like and feel free to like, comment and share my humble contributions to the world of AI. Thank you!

If you want to become a part of my mission of spreading Artificial Intelligence globally, feel free to become one of my Patrons. Become a Patron!

A quick about me. I am a computer scientist with a love for art, music and yoga. I am a Artificial Intelligence expert with a focus on Deep Learning. As a freelancer I offer training, mentoring and prototyping. If you are interested in working with me, let me know. My email-address is tristan@ai-guru.de - I am looking forward to talking to you!

Postscriptum.

This work is heavily inspired by the upcoming book Deep Reinforcement Learning in Action. Support the author by participating in the early-access-program!