How to Make Data-Driven Decisions with Contextual Bandits: The Case for Bayesian Inference


In 2018, there is a huge amount of pressure in virtually every industry to make data-driven decisions. In the era of powerful machine learning algorithms, this is more than a good idea; this is survival.

But how do we apply “predictive analytics” to make decisions? So what if we can predict customer churn or the likelihood of conversion? Does a good prediction really bring value to your organization?


It’s Time for Data Science to Grow Up

Supervised learning has done wonders, but it’s fundamentally limited. The direct optimization of decision-making — most broadly, the field of reinforcement learning (RL)— is mature enough for the big leagues.

We’ve seen some huge breakthroughs in RL, but a reliable general solution is still lacking.

What is well understood, and effectively solved, is the special case of RL known as the contextual bandits problem.

What is contextual bandits? It’s a lot like RL; you have a set of decisions to make and the task is to learn the best choice to optimize some reward. The only difference is that the contextual bandits formulation has no state.

John Schulman is a leading researcher in the field of deep RL at OpenAI, and his direct advice at this 2017 talk to anybody looking to apply deep RL in a practical setting is to frame it as a contextual bandits problem.

“Often you can just approximate [your problem] as a contextual bandits problem where there is no state. There’s a better theoretical understanding of contextual bandits problems.”- John Schulman, OpenAI


Deep Contextual Bandits is Here — And it Works

Perhaps eclipsed by headlines about deep General RL, recent advances in the study of contextual bandits problems have gone largely unnoticed. There are some amazing results in deep RL and it makes for excellent clickbait, but in the wild, we data scientists have a professional responsibility to pay attention to relevant research.

Particularly important is the 2018 Google Brain paper Deep Bayesian Bandits Showdown. This paper isn’t exactly innovative, but it’s a milestone. The authors take a look at some of the most promising recent solutions and compare them on a variety of different tasks.

The main takeaway? Most of these methods work well across a diverse set of problems.

Explore vs. Exploit


All RL problems face a fundamental paradox: do we try new things to find better solutions, or do we simply repeat the behavior that we know works well? This classic conundrum is known as the explore/exploit trade-off

Most research in contextual bandits focuses on solving this conundrum in a general way. Every problem is different; can we create an algorithm that works well everywhere?

A simple and effective solution is known as the epsilon-greedy policy. Epsilon is a hyperparameter; with epsilon=.01, we exploit 99% of the time, and explore 1% of the time. This is simple and effective. We need only choose an epsilon that works well, and we can find a solution while maximizing performance along the way.

The problem with epsilon-greedy is that it requires tuning. Do we start with a high epsilon and lower it over time? Do we pick an epsilon and leave it?

The real risk with fiddling here is that while we tune, real money is being lost due to sub-optimal decision-making.


Thompson Sampling

The real solution is to find an algorithm that self-regulates the explore-exploit problem. It turns out that a Bayesian formulation gives us a reliable way to do this.

Thompson sampling is an idea that dates back to 1933. It gives us a way to explore intelligently. The main idea with Thompson sampling is:

“When in doubt: Explore!” — Michael Klear, trying to explain Thompson sampling.

The simpler-case of contextual bandits, known as the multi-arm bandit problem, is easily solved using Thompson sampling.

The idea is simple: using our prior understanding of the expected reward distribution behind each action, let’s draw samples and pick the argmax as our selected action. That’s the Thompson sampling policy: both greedy and stochastic.

The result of this policy is profound: when our model is confident about a decision, it will consistently repeat that behavior. When it is less confident, it will explore other options with a higher likelihood.

It works well for multi-arm bandits. The tricky part is to apply it to the contextual case.

Bayesian Regression for Thompson Sampling

When we introduce context to the multi-arm bandit problem, we can no longer simply learn a reward distribution for every action. We need to condition this distribution on a set of features, or a context.

Bayesian linear regression gives us a nifty way to do just that. Rather than simply fitting a linear regression model, Bayesian linear regression allows us to place uncertainty on our predictions. More specifically, it allows us to condition the expected distribution of some outcome (say, our reward) on some set of features. This distribution can be sampled from, opening the door to Thompson sampling in the contextual bandits case.

Bayesian regression reward optimization — Maximum Reward is the known-best policy, and cumulative reward is the trained policy. Bayesian LR + Thompson Sampling implicitly optimizes explore-exploit behavior. (Each decision and timestep is the x-axis.)

Bayesian regression reward optimization — Maximum Reward is the known-best policy, and cumulative reward is the trained policy. Bayesian LR + Thompson Sampling implicitly optimizes explore-exploit behavior. (Each decision and timestep is the x-axis.)

What About Nonlinear Functions?

Linear models are effective but limited to approximately linear signals given independent features. What if we need to model nonlinear functions?

One completely valid approach is to engineer a set of roughly independent features from our raw features. You can do that.

But feature engineering is old-school. In 2018, we have neural networks to do this for us.

Enter Deep Learning


The 2018 Contextual Bandits Showdown paper explores a clever adaptation of the Bayesian linear regression solution. Simply called the Neural Linear algorithm, Google Brain researchers apply a neural network to learn a set of features to feed in to a Bayesian linear regression model. 

The result? A model with empirically superior performance across a variety of tasks. It’s simple, computationally efficient, and the proof is in the pudding.

“…making decisions according to a Bayesian linear regression on the representation provided by the last layer of a deep network offers a robust and easy-to-tune approach.” — Riquelme et al., Google Brain

You Can Apply State-of-the-art Research

This algorithm is generally applicable and simple at heart. The researchers even open-sourced their code for public use.

But it can be intimidating to directly apply bleeding-edge research.

I lead a training session for data scientists at Launchpad.AI, as a part of an international campaign to foster data science talent entering the field.

It’s easy enough to teach about gradient boosting and random forests, because the open-source software is so user-friendly. A budding data scientist need only import some class, and she can begin experimenting immediately.

So when I stand up in front of a group of data-scientists-in-training and teach them about deep bayesian contextual bandits, they’re eager to try it.

The only problem? They can’t just import it from scitkit-learn. The only available option, hand-coding implementations from scratch, isn’t always feasible when you have deadlines to meet.

Space Bandits is Born

I decided to take Google Brain’s open source code and package it up so my trainees could use it.

I’m not ambitious; I took the simplest model (Bayesian Linear) and the best model (Neural Linear) from the Tensorflow open-source code, optimized it for use at-scale, and uploaded it to PyPI.

It just needed a name. I had many ideas, but something about my love for astronomy and the ring to the silly name “Space Bandits” called out to me. Space Bandits was born.

The current version, v0.0.9, is an alpha release, but it works. A toy problem shows the library’s models can learn both online and from historic data.

This is another advantage of Google Brain’s Bayesian approach: direct access to the decision-making policy is not required for learning. This means that records from arbitrary campaigns can be used for optimization.

Space Bandits is able to learn a “soft” decision boundary with Thompson sampling and a Bayesian linear model — only observing rewards given actions


A Note on Reward Design

“I’ve taken to imagining deep RL as a demon that’s deliberately misinterpreting your reward and actively searching for the laziest possible local optima.” — Alex Irpan, Google, Deep Reinforcement Learning Doesn’t Work Yet

Alex perhaps puts it best, but anybody with experience in machine learning should not be surprised by this.

RL algorithms, including contextual bandits, do not readily generalize. They only do so when there is no alternative.

In the supervised setting, this idea manifests itself in overfitting. When simply memorizing the training data is an option, why generalize?

The same problem is not so clear in RL. When your model finds a way to optimize rewards, is it overfitting? This will only be so if your reward function is flawed.

This piece should be easy, so don’t overthink it. In real life, reward is (usually) money. If your model finds a “trick” to optimize reward, as long as that reward is profit, you should be happy.

A Space Bandits Neural Linear model learns nonlinear decision boundaries. Thompson sampling encourages more exploration closer to “true” decision boundaries, and optimal choices in regions with higher certainty.

A Space Bandits Neural Linear model learns nonlinear decision boundaries. Thompson sampling encourages more exploration closer to “true” decision boundaries, and optimal choices in regions with higher certainty.

Deploy a Deep Bayesian Contextual Bandits Model

You’re out of excuses. The algorithm has been shown to work on a broad range of problems in practice. You can impress your colleagues with decision optimization based on bleeding-edge research. Perhaps best of all, you can pip install an open-source python package that does all the hard work for you.

So what are you waiting for?

Michael Klear