import altair as alt
# alt.renderers.enable('default')
alt.renderersRendererRegistry(active='default', registered=['colab', 'default', 'html', 'json', 'jupyterlab', 'kaggle', 'mimetype', 'notebook', 'nteract', 'png', 'svg', 'zeppelin'])
September 21, 2020
Normalization is a common technique used in machine learning to render the scales of different magnitudes to a common range between 0 and 1.
Here we demonstrate how this is done with pandas and altair.
Original inspiration: (Jason Brownlee: Machine Learning Algorithms from Scratch)[https://machinelearningmastery.com/machine-learning-algorithms-from-scratch/]
RendererRegistry(active='default', registered=['colab', 'default', 'html', 'json', 'jupyterlab', 'kaggle', 'mimetype', 'notebook', 'nteract', 'png', 'svg', 'zeppelin'])
We use the Gapminder health and income dataset
| country | income | health | population | |
|---|---|---|---|---|
| 0 | Afghanistan | 1925 | 57.63 | 32526562 |
| 1 | Albania | 10620 | 76.00 | 2896679 |
| 2 | Algeria | 13434 | 76.50 | 39666519 |
| 3 | Andorra | 46577 | 84.10 | 70473 |
| 4 | Angola | 7615 | 61.00 | 25021974 |
income_domain = [health_income['income'].min(), health_income['income'].max()]
health_domain = [health_income['health'].min(), health_income['health'].max()]
alt.Chart(health_income).mark_point().encode(
alt.X('income:Q', scale=alt.Scale(domain=income_domain)),
alt.Y('health:Q', scale=alt.Scale(domain=health_domain)),
alt.Size('population:Q'),
alt.Tooltip('country:N')
).properties(height=600, width=800)The process:
$ = {max - min} $
The first step ensures that the smallest value will become 0. Dividing the reduced values by the range ‘compresses’ the values so the new maximum becomes 1.
The original minimum and maximum values
| country | income | health | population | |
|---|---|---|---|---|
| 32 | Central African Republic | 599 | 53.8 | 4900274 |
| 93 | Lesotho | 2598 | 48.5 | 2135022 |
| 105 | Marshall Islands | 3661 | 65.1 | 52993 |
income 599.0
health 48.5
population 52993.0
dtype: float64
| country | income | health | population | |
|---|---|---|---|---|
| 134 | Qatar | 132877 | 82.0 | 2235355 |
| 3 | Andorra | 46577 | 84.1 | 70473 |
| 35 | China | 13334 | 76.9 | 1376048943 |
income 1.328770e+05
health 8.410000e+01
population 1.376049e+09
dtype: float64
Difference of values from the column minimum
| income | health | population | |
|---|---|---|---|
| 0 | 1326.0 | 9.13 | 32473569.0 |
| 1 | 10021.0 | 27.50 | 2843686.0 |
| 2 | 12835.0 | 28.00 | 39613526.0 |
| 3 | 45978.0 | 35.60 | 17480.0 |
| 4 | 7016.0 | 12.50 | 24968981.0 |
| ... | ... | ... | ... |
| 182 | 5024.0 | 28.00 | 93394608.0 |
| 183 | 3720.0 | 26.70 | 4615473.0 |
| 184 | 3288.0 | 19.10 | 26779222.0 |
| 185 | 3435.0 | 10.46 | 16158774.0 |
| 186 | 1202.0 | 11.51 | 15549758.0 |
187 rows × 3 columns
Value ranges: the difference between the maximum and the minimum
Let’s normalize the dataset
normalized_health_income = normalize_dataset(health_income, quantitative_columns)
normalized_health_income| country | income | health | population | |
|---|---|---|---|---|
| 0 | Afghanistan | 0.010024 | 0.256461 | 0.023600 |
| 1 | Albania | 0.075757 | 0.772472 | 0.002067 |
| 2 | Algeria | 0.097030 | 0.786517 | 0.028789 |
| 3 | Andorra | 0.347586 | 1.000000 | 0.000013 |
| 4 | Angola | 0.053040 | 0.351124 | 0.018146 |
| ... | ... | ... | ... | ... |
| 182 | Vietnam | 0.037981 | 0.786517 | 0.067874 |
| 183 | West Bank and Gaza | 0.028123 | 0.750000 | 0.003354 |
| 184 | Yemen | 0.024857 | 0.536517 | 0.019462 |
| 185 | Zambia | 0.025968 | 0.293820 | 0.011743 |
| 186 | Zimbabwe | 0.009087 | 0.323315 | 0.011301 |
187 rows × 4 columns
The new minimum and maximum values
| country | income | health | population | |
|---|---|---|---|---|
| 32 | Central African Republic | 0.000000 | 0.148876 | 0.003523 |
| 93 | Lesotho | 0.015112 | 0.000000 | 0.001513 |
| 105 | Marshall Islands | 0.023148 | 0.466292 | 0.000000 |
| country | income | health | population | |
|---|---|---|---|---|
| 134 | Qatar | 1.000000 | 0.941011 | 0.001586 |
| 3 | Andorra | 0.347586 | 1.000000 | 0.000013 |
| 35 | China | 0.096275 | 0.797753 | 1.000000 |
Plotting the normalized data, we got the same results, but with the income, health, and population scales all normalized to the [0, 1] range.