Predict Uncertainty with Monte Carlo Simulations

How can you not only predict the future but also predict how certain you are about your estimates? We'll look at a simple way to model uncertainty as it flows through events, decisions and time.

Prerequisites

General understanding of statistics
You should know what concepts like mean, standard deviation, Gaussian distribution mean.
Basic Python Skills
Basic Jupyter Notebook Skills
Try the Jupyter Intro inside the DataBriefing VM.
A Python data environment (Jupyter, numpy, pandas, etc)
You can use the DataBriefing Vagrant VM and our Setup and Quickstart guide.

Motivation & Goals

In this article we’ll look at how we can use the Monte Carlo method to simulate processes that contain a flow of probabilities. If the process is simple we may be able to calculate the probability of some event deterministically. But what if a decision you might take in one month depends on a combination of a few uncertain events. Your decision then in turn triggers uncertain events and so on until you arrive at a point in time where the sequence of events and decisions have led to a profit - or a loss. To calculate the probability of making a profit in this example we need a new set of tools.

In a moment we’ll run two Monte Carlo analyses using NumPy. First to simulate a robot that doesn’t know exactly how far it travels in which direction. And then to predict a company’s cash-flow. Once we’ve finished the simulations we’ll look at ways to interpret and visualize the results.

Robotic Car ebook

Our step-by-step guide to building a self-driving model car that can navigate your home. Learn all the algorithms and build a real robotic car using a Raspberry Pi.

Sign up now to get exclusive early access and a 50% discount.

We don't spam and you can unsubscribe anytime. Learn more about how we handle your data.

What is Monte Carlo?

Processes involving events that occur with certain probabilities can become so complex that they become impractical or even impossible to solve analytically. Monte Carlo methods get rid of this analytical complexity by simulating single experiments with randomness over and over and over again. In the end you get the results from tens or hundreds of thousands of distinct experiments. These concrete results can easily be turned into interpretable statistics.

Monte Carlo methods were thought up in the Manhattan project partly inspired by the new computational powers that were available back then. For the first time in history it was possible to run a huge amount of calculations in a short time. This made Monte Carlo feasible in the first place. Now it’s often used in physics, mathematics and computer science problems - which is another way of saying it’s used everywhere. As David Silver once said in one of his lectures, "I only get this one stream of data that we call life." - A Monte Carlo simulation gets thousands of possible streams of data. Much like in one of the Black Mirror episodes from season 4. (Don’t want to spoiler you - so no more details)

Using NumPy for Monte Carlo

Before diving into real simulations let’s try out ways to speed up our code so our simulations run fast. This will allow you to iterate quickly on your models. In the first code snippet we calculate the mean of 10 million random numbers.

sum_x = 0
for i in range(int(1e7)):
    x = np.random.rand()
    sum_x += x
print(sum_x / 1e7)  # ~0.5

This code uses a loop that runs 10 million times. Inside the loop it generates a random number using NumPy. It then uses Python code to sum them up and to divide the sum by 10 million to get the expected mean of 0.5. Running this code takes about 6 to 7 seconds on my laptop. Given the simplicity of these calculations that’s quite a lot.

Fortunately there is a quicker and more elegant way: Using NumPy arrays.

x = np.random.rand(int(1e7))  # x is an array containing 10 million random numbers
print(np.mean(x))  # ~0.5

We can tell np.random.rand() that instead of generating a single number it should populate a 10-million-element array with random numbers. This NumPy array has a method to calculate the mean. The above code takes about 0.16 seconds and only has 2 lines of code.

NumPy achieves this speedup by using the efficient data structure ndarray which only holds elements of a single type and is written in C. Take a look at the NumPy docs to see all methods of ndarray.

Another important method for vectorized implementations is np.where(). This replaces if-clauses inside a loop.

Turning this running 6 seconds:

sum_x = 0
for i in range(int(1e7)):
    if np.random.rand() > 0.9:
        sum_x += 1
print(sum_x / 1e7)  # ~0.1

Into this running 0.25 seconds:

x = np.random.rand(int(1e7))
condition = x > 0.9  # boolean ndarray containing 10 million entries
results = np.where(condition, 1, 0)
print(np.mean(results))  # ~0.1

With these two code snippets you know the basics of vectorized implementations and are ready for some real simulations.

Two Examples

We’ll look at two examples. One is taken from robotics and is visually quite intuitive: We want to estimate the uncertainty of a robot’s position after it moves across a room.

In the second example we’ll try to predict whether a company will run out of money over the next ten years.

Feel free to follow one or both examples.

Uncertain Locations of a Robot

You can find the complete code of this example at https://github.com/databriefing/article-notebooks/blob/master/monte-carlo-simulations/robot.ipynb

Robot's path going in an S

The premise of this example is that a robot drives around in a room. We can tell it to move a certain distance and to steer in a certain direction. When we tell it to go 10 units the real distance and direction travelled will deviate. So for each simulated step we draw the real distance and real angle from a normal distribution. We use a vectorized implementation:

def go_step(distance_mean, distance_std, angle_mean, angle_std):
    """Calculates deltas in the x and y direction."""
    distances = np.random.normal(distance_mean, distance_std, size=(N_SIMULATIONS))
    angles = np.random.normal(angle_mean, angle_std, size=(N_SIMULATIONS))
    dx = distances * np.cos(np.deg2rad(angles))
    dy = distances * np.sin(np.deg2rad(angles))
    return dx, dy

It’s a good idea to put the real sampling operations in a method that we can call repeatedly. Then we only have to define the starting positions in x and y which itself has a small uncertainty and a matrix to store all positions of all robot simulations of all steps.

x = np.random.normal(60, .3, size=(N_SIMULATIONS))
y = np.random.normal(0, .3, size=(N_SIMULATIONS))
N_STEPS = 19
points = np.zeros((N_STEPS, N_SIMULATIONS, 2))
# Define coordinates at step 0
points[0,:,0] = x
points[0,:,1] = y

Now we run the simulation for 19 steps. The most complicated part of this is that we first want the robot to turn left and then to turn right. Finally we use a helper function to plot all points.

for i in range(1, N_STEPS):
    # Calculate absolute heading of current step
    heading = 20 * i
    # Turn right after step 9
    if i > 9:
        heading = 360-heading
    # Go one step of distance 10 in the caculated direction.
    dx, dy = go_step(10, 0.1, heading, 10)
    x, y = x+dx, y+dy
    points[i,:,0] = x
    points[i,:,1] = y

plot_points(points)

We also pick 10 different simulation runs and plot their complete paths. This is to visualize the different paths a single robot could take. We have really simulated paths on the level of single robots. The longer the robots travel the more their paths diverge from each other which is exactly what we’ve expected.

In the example you can also find code for a more complicated robot model. As long as it goes straight it can measure the distance and angle travelled quite accurately. But in a curve the heading measurement will get more error-prone due to random drift on unknown ground. We can see this nicely in the following plot.

Robot's path going straight and in a curve

Predicting Cash Flow

You can find the complete code of this example at https://github.com/databriefing/article-notebooks/blob/master/monte-carlo-simulations/company_cashflow.ipynb

In this example we want to calculate the risk for a company to run out of cash. First we make a few assumptions to build a simple model of a company.

Revenue growth year on year is 5% with a standard deviation of 5%-points
We have an internal profit margin target of 7%
If we miss this target we reduce the expenses accordingly
- We might not be able to reduce fixed expenses immediately so with 25% probability expenses remain unchanged
If we exceed the targeted profit margin we increase expenses
- Same here. Don’t change with a probability of 25% because we might not be able to e.g. hire more people immediately
Our single biggest customer generates between 15 and 30% of revenue
Yearly chance of losing the biggest customer is 10%

Now let’s put all these assumptions into code. Notice how easy it is to sample huge amounts of data points from these distributions with NumPy.

revenue_growth = np.random.normal(1.05, 0.05, size=(N_YEARS, N_SIMULATIONS))
can_change_expenses = np.random.rand(N_YEARS, N_SIMULATIONS) < 0.75  # boolean
biggest_customer_share = np.random.uniform(0.15, 0.4, size=(N_YEARS, N_SIMULATIONS))
biggest_customer_leaves = np.random.rand(N_YEARS, N_SIMULATIONS) < 0.1  # boolean

np.random.rand(N_YEARS, N_SIMULATIONS) < 0.75 is a quick way of getting a boolean array where 75% of the elements are true.

We initialize container matrices to store all our calculated values.

revenues = np.zeros((N_YEARS, N_SIMULATIONS))
profits = np.zeros((N_YEARS, N_SIMULATIONS))
expenses = np.zeros((N_YEARS, N_SIMULATIONS))
cash = np.zeros((N_YEARS, N_SIMULATIONS))

And finally we loop through the years. This is very fast as we’ve vectorized the calculations across the simulations and only loop through 10 years. The other way round would be much slower.

for i in range(N_YEARS):
    if i == 0:
        # first year
        prev_revenue = REVENUE_START * np.ones(N_SIMULATIONS)
        prev_expenses = EXPENSES_START * np.ones(N_SIMULATIONS)
        prev_cash = CASH_START * np.ones(N_SIMULATIONS)
    else:
        prev_revenue = revenues[i-1,:]
        prev_expenses = expenses[i-1,:]
        prev_cash = cash[i-1,:]
    prev_expenses_target = prev_revenue * (1 - PROFITMARGIN_TARGET)
    revenues[i,:] = prev_revenue * revenue_growth[i,:]
    expenses[i,:] = np.where(can_change_expenses[i,:], prev_expenses_target, prev_expenses)
    revenue_from_biggest_customer = biggest_customer_share[i,:] * revenues[i,:]
    revenues[i,:] = np.where(
        biggest_customer_leaves[i,:],
        revenues[i,:] - revenue_from_biggest_customer,
        revenues[i,:]
    )
    profits[i,:] = revenues[i,:] - expenses[i,:]
    cash[i,:] = prev_cash + profits[i,:]

One way to visualize the results is through a plot of each year’s confidence intervals. The function np.percentile comes in quite handy for this.

plt.figure(figsize=(10, 7))
plt.fill_between(np.arange(N_YEARS), *np.percentile(cash, [2.5, 97.5], axis=1), facecolor='b', alpha=0.1, label='95%');
plt.fill_between(np.arange(N_YEARS), *np.percentile(cash, [15, 85], axis=1), facecolor='r', alpha=0.1, label='70%');
plt.plot(np.percentile(cash, 50, axis=1), c='k');
plt.legend()
plt.title('Confidence Intervals of cash amount')
plt.xlabel('Year')
plt.ylabel('Cash');

Confidence intervals of future cash predictions across ten years

As you can already see with 70% probability we barely make it and with 95% probability we may have negative cash for a long time. Let’s get some numbers for this. I like to think about three probabilities: 51, 95 and 99%. The question we’re asking is, “What amount of money will we have at least with a probability of X%?”

for i in [51, 95, 99]:
    print('With {}% probability cash is always more than: {: 3.2f}'.format(
            i, np.percentile(cash, 100-i, axis=1).min()))

Giving us:

With 51% probability cash is always more than: 12.15
With 95% probability cash is always more than: -11.00
With 99% probability cash is always more than: -28.04

The beauty of this approach is it doesn’t tell you what to do. I like to expect the 51% or 60% value, to prepare for the 95% value and to at least think about the 99% probable outcome. Other people may be more risk-averse or risk-affine and might look for other probabilities.

In this case and for me it would mean to talk to several banks about an overdraft facility early on.

Where to go from here

Optimize input values

What if you have constraints? For example if your cash can never ever fall below a certain amount you might have to tweak input values.

Sensitivity Analysis

The other question is which of your parameters change the end result most directly. You can conduct an automatic sensitivity analysis or systematically try out different input parameters. In the cash-flow example this seems to be the size of the biggest customer and the likelihood of them leaving. So these two aspects might be effective areas to focus resources on.

In the robot example the question would be whether you should focus on making the angle-measurements more accurate by adding more sensors and how this would affect the certainty of the robot’s position after driving around for a while.

Monte Carlo analysis is only ever as good as the models it’s based on but it can help you make uncertainty quantifiable.

What's next?

You could try a Monte Carlo analysis on one of your own problems. How would you build optimization or sensitivity analysis?

Don't forget to sign up for the newsletter to get an email when we release a new tutorial and rate this article below.

Email me the next article!

Be the first to get an email when we publish another high-quality article.