Linear regression ‘from scratch’ to predict Ethereum gas prices from transaction values

fastai

Published

September 27, 2020

Linear regression ‘from scratch’ to predict Ethereum gas prices from transaction values

In this post I go through the main steps of how to calculate a simple univariate linear regression model.

For the example I try to predict Ethereum daily average gas prices from daily average transaction values. I pull the data from the public Google data base with BigQuery.

Code and inspiration are based on Jason Brownlee’s “Machine Learning Algorithms from Scratch” book

Libraries and data load

import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]=os.path.expanduser("~/.credentials/Notebook bigquery-c422e406404b.json")

from google.cloud import bigquery
client = bigquery.Client()

import altair as alt
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

query ="""
SELECT
    EXTRACT(DATE FROM block_timestamp) AS date,
    AVG(value) AS value,
    AVG(gas_price) AS gas_price,    
FROM `bigquery-public-data.ethereum_blockchain.transactions`
WHERE
    EXTRACT(YEAR FROM block_timestamp) = 2019
GROUP BY date
ORDER BY date
"""

values = client.query(query).to_dataframe(dtypes={'value': float, 'gas_price': float}, date_as_object=False)
values.head()

	date	value	gas_price
0	2019-01-01	3.719103e+18	1.431514e+10
1	2019-01-02	4.649915e+18	1.349952e+10
2	2019-01-03	4.188781e+18	1.269504e+10
3	2019-01-04	6.958368e+18	1.418197e+10
4	2019-01-05	8.167590e+18	2.410475e+10

Calculating linear regression

Linear regression makes predictions with the help of linear coefficient. For an univariate case, this is expressed as:

$ = b_0 + b_1 x $

Coefficients

We can estimate the $b_0$ and $b_1$ coeffients in the following ways:

$ b_1 = { _{i=1}^{n} (x_i - {x})^2 } $

$ b_0 = {y} - b_1 {x} $

Where

$x$: the predictor variable
$y$: the variable to predict
$ {x} $ and $\bar{y}$ are their respective means

Covariance and variance

We can also can get the $b_1$ coefficient from the variance and covariance:

Covariance: $ (x, y) = { n } $
Variance: $ ^2 = { n } $

From these, we can get

$ b_1 = { ^2 } $

Calculation steps

Accordingly, we need to calculate the following metrics:

Variable means for both $x$ and $y$
Their deviations from the mean
Covariance of $x$ and $y$
Variance of $x$
The $b_1$ coeffcient
The $b_0$ coefficient
$\hat{y}$, that is, the predictions

Means

values.mean()

value        3.173648e+18
gas_price    1.616874e+10
dtype: float64

Deviations from the mean

def deviation(array):
    return array - array.mean()

deviation(values['value'])

0      5.454550e+17
1      1.476267e+18
2      1.015132e+18
3      3.784720e+18
4      4.993942e+18
           ...     
360   -1.004019e+18
361   -1.259703e+18
362   -1.050454e+18
363   -6.660476e+17
364    5.699485e+17
Name: value, Length: 365, dtype: float64

Covariance

def covariance(arr1, arr2):
    return (deviation(arr1) * deviation(arr2)).sum() / len(arr1)

covariance(values['value'], values['gas_price'])

2.8148882775011114e+27

Variance

def variance(arr):
    return (deviation(arr) ** 2).sum() / len(arr)

variance(values['value'])

1.3829873312251886e+36

The $b_1$ coefficient

def b1(arr1, arr2):
    return covariance(arr1, arr2) / variance(arr1)

b1(values['value'], values['gas_price'])

2.035368086132359e-09

The $b_0$ coefficient

def b0(arr1, arr2):
    return arr2.mean() - b1(arr1, arr2) * arr1.mean()

b_0 = b0(values['value'], values['gas_price'])
b_0

9709198058.151459

Predictions

values['predictions'] = b_0 + values['value'] * b_1

values['predictions']

0      1.727894e+10
1      1.917349e+10
2      1.823491e+10
3      2.387204e+10
4      2.633325e+10
           ...     
360    1.412519e+10
361    1.360478e+10
362    1.403068e+10
363    1.481309e+10
364    1.732880e+10
Name: predictions, Length: 365, dtype: float64

values.head()

	date	value	gas_price	predictions
0	2019-01-01	3.719103e+18	1.431514e+10	1.727894e+10
1	2019-01-02	4.649915e+18	1.349952e+10	1.917349e+10
2	2019-01-03	4.188781e+18	1.269504e+10	1.823491e+10
3	2019-01-04	6.958368e+18	1.418197e+10	2.387204e+10
4	2019-01-05	8.167590e+18	2.410475e+10	2.633325e+10

Plotting the results

First we transform the data into long format

to_plot = values.melt(id_vars=['date'], value_vars=['gas_price', 'predictions'], var_name='status')
to_plot.head()

	date	status	value
0	2019-01-01	gas_price	1.431514e+10
1	2019-01-02	gas_price	1.349952e+10
2	2019-01-03	gas_price	1.269504e+10
3	2019-01-04	gas_price	1.418197e+10
4	2019-01-05	gas_price	2.410475e+10

Then, we plot the results.

chart = alt.Chart().mark_line().encode(
    alt.X('date:T', axis=alt.Axis(format=("%x"), labelAngle=270)),
    alt.Y('value:Q', axis=alt.Axis(format=",.2e"), scale=alt.Scale(type='log')), color=alt.Color('status:N')
).properties(width=600)

label = alt.selection_single(
    encodings=['x'],
    on='mouseover',
    nearest=True,
    empty='none'
)


alt.layer(
    chart,


    chart.mark_circle().encode(
        opacity=alt.condition(label, alt.value(1), alt.value(0))
    ).add_selection(label),
    
    alt.Chart().mark_rule(color='darkgray').encode(
        x='date:T'
    ).transform_filter(label),
    
    chart.mark_text(align='left', dx=5, dy=-5, stroke='black', strokeWidth=0.5).encode(
        text=alt.Text('value:Q', format=',.2e')
    ).transform_filter(label),

    chart.mark_text(align='left', dx=5, dy=-5).encode(
        text=alt.Text('value:Q', format=',.2e')
    ).transform_filter(label),
    data=to_plot

).properties(title='Gas prices and their predictions (log scale)', width=600, height=400)

It is a bit hard to assess the performance of the results just by looking at them, but at least it seems to generate values within the same ballpark. We could generate metrics as RMSE for reference.

Obvious improvements - remove the outliers - include past values

Linear regression ‘from scratch’ to predict Ethereum gas prices from transaction values

Linear regression ‘from scratch’ to predict Ethereum gas prices from transaction values

Libraries and data load

Calculating linear regression

Coefficients

Covariance and variance

Calculation steps

Means

Deviations from the mean

Covariance

Variance

The \(b_1\) coefficient

The \(b_0\) coefficient

Predictions

Plotting the results