import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]=os.path.expanduser("~/.credentials/Notebook bigquery-c422e406404b.json")Linear regression ‘from scratch’ to predict Ethereum gas prices from transaction values
Linear regression ‘from scratch’ to predict Ethereum gas prices from transaction values
In this post I go through the main steps of how to calculate a simple univariate linear regression model.
For the example I try to predict Ethereum daily average gas prices from daily average transaction values. I pull the data from the public Google data base with BigQuery.
Code and inspiration are based on Jason Brownlee’s “Machine Learning Algorithms from Scratch” book
Libraries and data load
from google.cloud import bigquery
client = bigquery.Client()import altair as alt
alt.data_transformers.disable_max_rows()DataTransformerRegistry.enable('default')
query ="""
SELECT
EXTRACT(DATE FROM block_timestamp) AS date,
AVG(value) AS value,
AVG(gas_price) AS gas_price,
FROM `bigquery-public-data.ethereum_blockchain.transactions`
WHERE
EXTRACT(YEAR FROM block_timestamp) = 2019
GROUP BY date
ORDER BY date
"""values = client.query(query).to_dataframe(dtypes={'value': float, 'gas_price': float}, date_as_object=False)
values.head()| date | value | gas_price | |
|---|---|---|---|
| 0 | 2019-01-01 | 3.719103e+18 | 1.431514e+10 |
| 1 | 2019-01-02 | 4.649915e+18 | 1.349952e+10 |
| 2 | 2019-01-03 | 4.188781e+18 | 1.269504e+10 |
| 3 | 2019-01-04 | 6.958368e+18 | 1.418197e+10 |
| 4 | 2019-01-05 | 8.167590e+18 | 2.410475e+10 |
Calculating linear regression
Linear regression makes predictions with the help of linear coefficient. For an univariate case, this is expressed as:
$ = b_0 + b_1 x $
Coefficients
We can estimate the \(b_0\) and \(b_1\) coeffients in the following ways:
$ b_1 = { _{i=1}^{n} (x_i - {x})^2 } $
$ b_0 = {y} - b_1 {x} $
Where
- \(x\): the predictor variable
- \(y\): the variable to predict
- $ {x} $ and \(\bar{y}\) are their respective means
Covariance and variance
We can also can get the \(b_1\) coefficient from the variance and covariance:
- Covariance: $ (x, y) = { n } $
- Variance: $ ^2 = { n } $
From these, we can get
$ b_1 = { ^2 } $
Calculation steps
Accordingly, we need to calculate the following metrics:
- Variable means for both \(x\) and \(y\)
- Their deviations from the mean
- Covariance of \(x\) and \(y\)
- Variance of \(x\)
- The \(b_1\) coeffcient
- The \(b_0\) coefficient
- \(\hat{y}\), that is, the predictions
Means
values.mean()value 3.173648e+18
gas_price 1.616874e+10
dtype: float64
Deviations from the mean
def deviation(array):
return array - array.mean()deviation(values['value'])0 5.454550e+17
1 1.476267e+18
2 1.015132e+18
3 3.784720e+18
4 4.993942e+18
...
360 -1.004019e+18
361 -1.259703e+18
362 -1.050454e+18
363 -6.660476e+17
364 5.699485e+17
Name: value, Length: 365, dtype: float64
Covariance
def covariance(arr1, arr2):
return (deviation(arr1) * deviation(arr2)).sum() / len(arr1)covariance(values['value'], values['gas_price'])2.8148882775011114e+27
Variance
def variance(arr):
return (deviation(arr) ** 2).sum() / len(arr)variance(values['value'])1.3829873312251886e+36
The \(b_1\) coefficient
def b1(arr1, arr2):
return covariance(arr1, arr2) / variance(arr1)b1(values['value'], values['gas_price'])2.035368086132359e-09
The \(b_0\) coefficient
def b0(arr1, arr2):
return arr2.mean() - b1(arr1, arr2) * arr1.mean()b_0 = b0(values['value'], values['gas_price'])
b_09709198058.151459
Predictions
values['predictions'] = b_0 + values['value'] * b_1values['predictions']0 1.727894e+10
1 1.917349e+10
2 1.823491e+10
3 2.387204e+10
4 2.633325e+10
...
360 1.412519e+10
361 1.360478e+10
362 1.403068e+10
363 1.481309e+10
364 1.732880e+10
Name: predictions, Length: 365, dtype: float64
values.head()| date | value | gas_price | predictions | |
|---|---|---|---|---|
| 0 | 2019-01-01 | 3.719103e+18 | 1.431514e+10 | 1.727894e+10 |
| 1 | 2019-01-02 | 4.649915e+18 | 1.349952e+10 | 1.917349e+10 |
| 2 | 2019-01-03 | 4.188781e+18 | 1.269504e+10 | 1.823491e+10 |
| 3 | 2019-01-04 | 6.958368e+18 | 1.418197e+10 | 2.387204e+10 |
| 4 | 2019-01-05 | 8.167590e+18 | 2.410475e+10 | 2.633325e+10 |
Plotting the results
First we transform the data into long format
to_plot = values.melt(id_vars=['date'], value_vars=['gas_price', 'predictions'], var_name='status')
to_plot.head()| date | status | value | |
|---|---|---|---|
| 0 | 2019-01-01 | gas_price | 1.431514e+10 |
| 1 | 2019-01-02 | gas_price | 1.349952e+10 |
| 2 | 2019-01-03 | gas_price | 1.269504e+10 |
| 3 | 2019-01-04 | gas_price | 1.418197e+10 |
| 4 | 2019-01-05 | gas_price | 2.410475e+10 |
Then, we plot the results.
chart = alt.Chart().mark_line().encode(
alt.X('date:T', axis=alt.Axis(format=("%x"), labelAngle=270)),
alt.Y('value:Q', axis=alt.Axis(format=",.2e"), scale=alt.Scale(type='log')), color=alt.Color('status:N')
).properties(width=600)
label = alt.selection_single(
encodings=['x'],
on='mouseover',
nearest=True,
empty='none'
)
alt.layer(
chart,
chart.mark_circle().encode(
opacity=alt.condition(label, alt.value(1), alt.value(0))
).add_selection(label),
alt.Chart().mark_rule(color='darkgray').encode(
x='date:T'
).transform_filter(label),
chart.mark_text(align='left', dx=5, dy=-5, stroke='black', strokeWidth=0.5).encode(
text=alt.Text('value:Q', format=',.2e')
).transform_filter(label),
chart.mark_text(align='left', dx=5, dy=-5).encode(
text=alt.Text('value:Q', format=',.2e')
).transform_filter(label),
data=to_plot
).properties(title='Gas prices and their predictions (log scale)', width=600, height=400)It is a bit hard to assess the performance of the results just by looking at them, but at least it seems to generate values within the same ballpark. We could generate metrics as RMSE for reference.
Obvious improvements - remove the outliers - include past values