Random prediction

fastai

Published

September 23, 2020

Random prediction

When we try to build a prediction algorithm, it is a useful practice to first try out a baseline algorithm and see how they perform. Here we will describe a case of random prediction.

(Inspiration and examples are from Jason Brownlee’s Machine Learning Algorithms from Scratch book.)

import pandas as pd
import altair as alt
import numpy as np

from vega_datasets import data

np.random.seed(42)

For the examples we will use the vega ‘volcano’ dataset. The width and height values are constant, so we work only with the values column.

volcano = data('volcano')
volcano.head()

	width	height	values
0	87	61	103
1	87	61	104
2	87	61	104
3	87	61	105
4	87	61	105

The random prediction algorithm

takes a training and a test set
generates the prediction by selecting random elements from the training set

We split the dataset to training and test sets, by taking the first 2/3 and last 1/3 of the data respectively.

train, test = volcano.iloc[: volcano.shape[0] * 2 // 3, :], volcano.iloc[volcano.shape[0] * 2 // 3 :, :]

A small check that the split did not leave out an entry or did not result in an overlap.

assert train.index[-1] + 1 == test.index[0]

We plot the training and the test sets, respectively.

alt.Chart(train.reset_index()).mark_line().encode(alt.Y('values'), alt.X('index:Q'), alt.Tooltip('values')).interactive().properties(title='Train data')

alt.Chart(test.reset_index()).mark_line().encode(alt.Y('values'), alt.X('index:Q'), alt.Tooltip('values')).interactive().properties(title='Test data')

First, we generate the random predictions with replacement. That is, the same values can occur more than once.

predictions = np.random.choice(train['values'], size=test.shape[0], replace=True)
predictions

array([166, 179,  96, ..., 105, 157,  95])

We calculate the error terms

errors = test['values'] - predictions
errors

3538   -16
3539   -29
3540    54
3541    54
3542    56
        ..
5302    -8
5303     2
5304    -7
5305   -60
5306     2
Name: values, Length: 1769, dtype: int64

We calculate the root mean squared error of the predictions.

def calculate_rmse(observed, predicted):
    return np.sqrt(sum((observed - predicted) ** 2)/len(observed))

The RMSE is around 34.59 which stands for about 1.34 STD.

rmse = calculate_rmse(test['values'], predictions)
rmse

34.93380641982711

For reference, if we would simply use the simple mean, we would get an RMSE of 18.42,

calculate_rmse(test['values'], test['values'].mean())

18.420942507684327

Let’s plot the results

to_compare = pd.concat(
    [
        test['values'].rename('observed').reset_index(drop=True),
        pd.Series(predictions).rename('predictions').reset_index(drop=True)
    ], axis=1
).stack().reset_index().rename(columns={'level_1': 'status', 'level_0': 'index', 0: 'values'})

to_compare

	index	status	values
0	0	observed	150
1	0	predictions	166
2	1	observed	150
3	1	predictions	179
4	2	observed	150
...	...	...	...
3533	1766	predictions	105
3534	1767	observed	97
3535	1767	predictions	157
3536	1768	observed	97
3537	1768	predictions	95

3538 rows × 3 columns

alt.Chart(to_compare).mark_line().encode(
    alt.X('index:Q'), alt.Y('values'), alt.Color('status'), alt.OpacityValue(0.7)
).properties(width=1200, title='Predictions with replacement')

We put the steps into a function and rerun the prediction without replacement.

def predict_randomly(train, test, replace=True):
    predictions = np.random.choice(train, size=test.shape[0], replace=True)
    errors = test - predictions
    rmse = calculate_rmse(test, predictions)

    return predictions, rmse

predictions, rmse = predict_randomly(train['values'], test['values'], replace=False)
rmse

35.90392934559341

Finally, we also put the plotting steps into a function

def plot_predictions(observed, predicted):

    to_compare = pd.concat(
        [observed.rename('observed').reset_index(drop=True), pd.Series(predicted).rename('predictions').reset_index(drop=True)], axis=1
    ).stack().reset_index().rename(columns={'level_1': 'status', 'level_0': 'index', 0: 'values'})

    chart = alt.Chart(to_compare).mark_line().encode(
        alt.X('index:Q'), alt.Y('values'), alt.Color('status'), alt.OpacityValue(0.7)
    ).properties(width=1200)

    chart.display()

The predictions without replacement

plot_predictions(test['values'], predictions)