# Standardizing Data

This chapter is all about standardizing data. Often a model will make some assumptions about the distribution or scale of your features. Standardization is a way to make your data fit these assumptions and improve the algorithm's performance. This is the Summary of lecture "Preprocessing for Machine Learning in Python", via datacamp.

- Standardizing Data
- Log normalization
- Scaling data for feature comparison
- Standardized data and modeling

```
import pandas as pd
import numpy as np
```

## Standardizing Data

- Standardization
- Preprocessing method used to transform continuous data to make it look normally distributed
- Scikit-learn models assume normally distributed data
- Log normalization
- feature Scaling

- When to standardize: models
- Model in linear space
- Dataset features have high variance
- Dataset features are continuous and on different scales
- Linearity assumptions

### Modeling without normalizing

Let's take a look at what might happen to your model's accuracy if you try to model data without doing some sort of standardization first. Here we have a subset of the wine dataset. One of the columns, `Proline`

, has an extremely high variance compared to the other columns. This is an example of where a technique like log normalization would come in handy, which you'll learn about in the next section.

The scikit-learn model training process should be familiar to you at this point, so we won't go too in-depth with it. You already have a k-nearest neighbors model available (`knn`

) as well as the `X`

and `y`

sets you need to fit and score on.

```
wine = pd.read_csv('./dataset/wine_types.csv')
wine.head()
```

```
X = wine[['Proline', 'Total phenols', 'Hue', 'Nonflavanoid phenols']]
y = wine['Type']
```

```
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
# Split the dataset and labels into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y)
# Fit the k-nearest neighbors model to the training data
knn.fit(X_train, y_train)
# SCore the model on the test data
print(knn.score(X_test, y_test))
```

```
wine.describe()
```

The `Proline`

column has an extremely high variance.

```
print(wine['Proline'].var())
# Apply the log normalization function to the Proline column
wine['Proline_log'] = np.log(wine['Proline'])
# Check the variance of the normalized Proline column
print(wine['Proline_log'].var())
```

### Scaling data - investigating columns

We want to use the `Ash`

, `Alcalinity of ash`

, and `Magnesium`

columns in the `wine`

dataset to train a linear model, but it's possible that these columns are all measured in different ways, which would bias a linear model. Using `describe()`

to return descriptive statistics about this dataset, which of the following statements are true about the scale of data in these columns?

```
wine[['Ash', 'Alcalinity of ash', 'Magnesium']].describe()
```

```
from sklearn.preprocessing import StandardScaler
# Create the scaler
ss = StandardScaler()
# Take a subset of the DataFrame you want to scale
wine_subset = wine[['Ash', 'Alcalinity of ash', 'Magnesium']]
print(wine_subset.iloc[:3])
# Apply the scaler to the DataFrame subset
wine_subset_scaled = ss.fit_transform(wine_subset)
print(wine_subset_scaled[:3])
```

### KNN on non-scaled data

Let's first take a look at the accuracy of a K-nearest neighbors model on the `wine`

dataset without standardizing the data. The `knn`

model as well as the `X`

and `y`

data and labels sets have been created already. Most of this process of creating models in scikit-learn should look familiar to you.

```
wine = pd.read_csv('./dataset/wine_types.csv')
X = wine.drop('Type', axis=1)
y = wine['Type']
knn = KNeighborsClassifier()
```

```
X_train, X_test, y_train, y_test = train_test_split(X, y)
# Fit the k-nearest neighbors model to the training data
knn.fit(X_train, y_train)
# Score the model on the test data
print(knn.score(X_test, y_test))
```

```
knn = KNeighborsClassifier()
# Create the scaling method
ss = StandardScaler()
# Apply the scaling method to the dataset used for modeling
X_scaled = ss.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)
# Fit the k-nearest neighbors model to the training data.
knn.fit(X_train, y_train)
# Score the model on the test data
print(knn.score(X_test, y_test))
```