import pandas as pd
import altair as alt
import numpy as np
from vega_datasets import dataRandom prediction
Random prediction
When we try to build a prediction algorithm, it is a useful practice to first try out a baseline algorithm and see how they perform. Here we will describe a case of random prediction.
(Inspiration and examples are from Jason Brownlee’s Machine Learning Algorithms from Scratch book.)
np.random.seed(42)For the examples we will use the vega ‘volcano’ dataset. The width and height values are constant, so we work only with the values column.
volcano = data('volcano')
volcano.head()| width | height | values | |
|---|---|---|---|
| 0 | 87 | 61 | 103 |
| 1 | 87 | 61 | 104 |
| 2 | 87 | 61 | 104 |
| 3 | 87 | 61 | 105 |
| 4 | 87 | 61 | 105 |
The random prediction algorithm
- takes a training and a test set
- generates the prediction by selecting random elements from the training set
We split the dataset to training and test sets, by taking the first 2/3 and last 1/3 of the data respectively.
train, test = volcano.iloc[: volcano.shape[0] * 2 // 3, :], volcano.iloc[volcano.shape[0] * 2 // 3 :, :]A small check that the split did not leave out an entry or did not result in an overlap.
assert train.index[-1] + 1 == test.index[0]We plot the training and the test sets, respectively.
alt.Chart(train.reset_index()).mark_line().encode(alt.Y('values'), alt.X('index:Q'), alt.Tooltip('values')).interactive().properties(title='Train data')alt.Chart(test.reset_index()).mark_line().encode(alt.Y('values'), alt.X('index:Q'), alt.Tooltip('values')).interactive().properties(title='Test data')First, we generate the random predictions with replacement. That is, the same values can occur more than once.
predictions = np.random.choice(train['values'], size=test.shape[0], replace=True)
predictionsarray([166, 179, 96, ..., 105, 157, 95])
We calculate the error terms
errors = test['values'] - predictions
errors3538 -16
3539 -29
3540 54
3541 54
3542 56
..
5302 -8
5303 2
5304 -7
5305 -60
5306 2
Name: values, Length: 1769, dtype: int64
We calculate the root mean squared error of the predictions.
def calculate_rmse(observed, predicted):
return np.sqrt(sum((observed - predicted) ** 2)/len(observed))The RMSE is around 34.59 which stands for about 1.34 STD.
rmse = calculate_rmse(test['values'], predictions)
rmse34.93380641982711
For reference, if we would simply use the simple mean, we would get an RMSE of 18.42,
calculate_rmse(test['values'], test['values'].mean())18.420942507684327
Let’s plot the results
to_compare = pd.concat(
[
test['values'].rename('observed').reset_index(drop=True),
pd.Series(predictions).rename('predictions').reset_index(drop=True)
], axis=1
).stack().reset_index().rename(columns={'level_1': 'status', 'level_0': 'index', 0: 'values'})
to_compare| index | status | values | |
|---|---|---|---|
| 0 | 0 | observed | 150 |
| 1 | 0 | predictions | 166 |
| 2 | 1 | observed | 150 |
| 3 | 1 | predictions | 179 |
| 4 | 2 | observed | 150 |
| ... | ... | ... | ... |
| 3533 | 1766 | predictions | 105 |
| 3534 | 1767 | observed | 97 |
| 3535 | 1767 | predictions | 157 |
| 3536 | 1768 | observed | 97 |
| 3537 | 1768 | predictions | 95 |
3538 rows × 3 columns
alt.Chart(to_compare).mark_line().encode(
alt.X('index:Q'), alt.Y('values'), alt.Color('status'), alt.OpacityValue(0.7)
).properties(width=1200, title='Predictions with replacement')We put the steps into a function and rerun the prediction without replacement.
def predict_randomly(train, test, replace=True):
predictions = np.random.choice(train, size=test.shape[0], replace=True)
errors = test - predictions
rmse = calculate_rmse(test, predictions)
return predictions, rmsepredictions, rmse = predict_randomly(train['values'], test['values'], replace=False)
rmse35.90392934559341
Finally, we also put the plotting steps into a function
def plot_predictions(observed, predicted):
to_compare = pd.concat(
[observed.rename('observed').reset_index(drop=True), pd.Series(predicted).rename('predictions').reset_index(drop=True)], axis=1
).stack().reset_index().rename(columns={'level_1': 'status', 'level_0': 'index', 0: 'values'})
chart = alt.Chart(to_compare).mark_line().encode(
alt.X('index:Q'), alt.Y('values'), alt.Color('status'), alt.OpacityValue(0.7)
).properties(width=1200)
chart.display()The predictions without replacement
plot_predictions(test['values'], predictions)