import altair as alt
from vega_datasets import dataHow does data standardization work?
fastai
How does data standardization work?
With standardization, we transform the data so its distribution’s center will become 0 and its standard deviation will become 1.
For this demo, we use the sp500 price index.
sp500 = data('sp500')
sp500.head()| date | price | |
|---|---|---|
| 0 | 2000-01-01 | 1394.46 |
| 1 | 2000-02-01 | 1366.42 |
| 2 | 2000-03-01 | 1498.58 |
| 3 | 2000-04-01 | 1452.43 |
| 4 | 2000-05-01 | 1420.60 |
alt.Chart(sp500).mark_line(point=True, strokeWidth=1).encode(
alt.X('date:T'), alt.Y('price'), alt.Tooltip('price')
).configure_point(size=10)The original distribution of prices.
alt.Chart(sp500).mark_bar().encode(
x=alt.X('price:Q', bin=alt.BinParams(maxbins=20)),
y=alt.Y('count()'), tooltip=alt.Tooltip(['count()'])
)We get the standardized value by dividing the values’ difference from the mean by the standard deviation.
$ = {} $
The mean and standard deviation of the prices
print(f"""
mean: {sp500['price'].mean():.2f}
std: {sp500['price'].std():.2f}
""")
mean: 1184.43
std: 195.41
We standardize the data set.
standardized_prices = sp500.copy()
standardized_prices['price'] = (sp500['price'] - sp500['price'].mean()) / sp500['price'].std()
standardized_prices| date | price | |
|---|---|---|
| 0 | 2000-01-01 | 1.074812 |
| 1 | 2000-02-01 | 0.931317 |
| 2 | 2000-03-01 | 1.607646 |
| 3 | 2000-04-01 | 1.371473 |
| 4 | 2000-05-01 | 1.208583 |
| ... | ... | ... |
| 118 | 2009-11-01 | -0.454452 |
| 119 | 2009-12-01 | -0.354814 |
| 120 | 2010-01-01 | -0.565809 |
| 121 | 2010-02-01 | -0.409111 |
| 122 | 2010-03-01 | -0.225086 |
123 rows × 2 columns
alt.Chart(standardized_prices).mark_bar().encode(
x=alt.X('price:Q', bin=alt.BinParams(maxbins=20)),
y=alt.Y('count()'), tooltip=alt.Tooltip(['count()'])
)The new mean and standard deviation
print(f"""
mean: {standardized_prices['price'].mean():.2f}
std: {standardized_prices['price'].std():.2f}
""")
mean: -0.00
std: 1.00
When plotted on the dates, it provides a similar graph as before, but now with a value domain around 0.
alt.Chart(standardized_prices).mark_line(point=True, strokeWidth=1).encode(
alt.X('date:T'), alt.Y('price'), alt.Tooltip('price')
).configure_point(size=10)