How to use Python to normalize categorical data
To normalize categorical data in Python, you can use one of the following techniques: label encoding, one-hot encoding, or dummy encoding. Each technique has its use cases and advantages. In this answer, I will provide examples of how to use each technique with the help of the popular Python libraries: pandas and scikit-learn.
1. Label Encoding
Label encoding is the process of converting each category to an integer value. It is useful when the categorical data has some order or hierarchy.
Example:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
# Sample dataset
data = {'Category': ['Low', 'Medium', 'High', 'Medium', 'Low']}
df = pd.DataFrame(data)
# Apply Label Encoding
label_encoder = LabelEncoder()
df['Encoded_Category'] = label_encoder.fit_transform(df['Category'])
print(df)
2. One-Hot Encoding
One-hot encoding is the process of converting each category into a binary vector, where the length of the vector is equal to the number of unique categories. It is useful when the categorical data has no order or hierarchy.
Example:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
# Sample dataset
data = {'Category': ['A', 'B', 'C', 'A', 'B']}
df = pd.DataFrame(data)
# Apply One-Hot Encoding
one_hot_encoder = OneHotEncoder(sparse=False)
encoded_data = one_hot_encoder.fit_transform(df[['Category']])
encoded_df = pd.DataFrame(encoded_data, columns=one_hot_encoder.get_feature_names_out(['Category']))
print(encoded_df)
3. Dummy Encoding
Dummy encoding is similar to one-hot encoding but creates one fewer column for each categorical variable to avoid multicollinearity. It is helpful for linear regression and other algorithms sensitive to multicollinearity.
Example:
import pandas as pd
# Sample dataset
data = {'Category': ['A', 'B', 'C', 'A', 'B']}
df = pd.DataFrame(data)
# Apply Dummy Encoding
dummy_df = pd.get_dummies(df, columns=['Category'], drop_first=True)
print(dummy_df)
In Summary
- 💻 To normalize categorical data in Python, three techniques can be used: label encoding, one-hot encoding, and dummy encoding.
- 🔢 Label encoding is valid when the categorical data has some order or hierarchy, and it converts each category to an integer value.
- 📊 One-hot encoding is valid when the categorical data has no order or hierarchy, and it converts each category into a binary vector.
- 📉 Dummy encoding is similar to one-hot encoding but creates one fewer column for each categorical variable to avoid multicollinearity.
- 🧪 The appropriate encoding technique should be chosen based on the dataset and the machine learning model planned to be used.