Kaggle Winning Solutions   |   How to win a kaggle solution

Feature engineering » label encoding

label-encoding

When you are dealing with categorical variables that have Object data type we need to convert them into numerical variable(s). There are multiple ways for it. One of them is Label Encoding which is assigning a number to each category and map it.

Below is an example where x2 is animal name, a categorical feature.

In [14]:
import numpy as np
import pandas as pd

# create a synthetic X and y where x1 is numerical feature and x2 is a categorical feature
X = pd.DataFrame({'x1': np.random.random(5), 'x2': ['cat', 'cat', 'dog', 'cat', 'dog']})
y = pd.Series(np.random.randint(2, size=5)) # binary target variable y
print(X.head(), '\n')
print(y.head(), '\n')
         x1   x2
0  0.972804  cat
1  0.781573  cat
2  0.469742  dog
3  0.183671  cat
4  0.861439  dog 

0    1
1    0
2    0
3    0
4    1
dtype: int32 

before we feed X into a model we need to convert x2 to numerical values

In [15]:
# unique values in feature x2 are "cat" and "dog"
categories = list(X['x2'].unique())
print(categories)

# we want to create a dictionary to map cat to 0 and dog to 1
map_dict = {'cat': 0, 'dog': 1}

# then we will use dictionary map_dict to map Object data type to int8
X['x2'] = X['x2'].map(map_dict)

# now x2 has numerical values and is ready to be fed into a model
print(X.head())
['cat', 'dog']
         x1  x2
0  0.972804   0
1  0.781573   0
2  0.469742   1
3  0.183671   0
4  0.861439   1

if you have many categories in your features and cannot create a map_dict manually. There are libraries to do it for you e.g. LabelEncoder module of sklearn

In [18]:
from sklearn.preprocessing import LabelEncoder

# create a synthetic X and y where x1 is numerical feature and x2 is a categorical feature
X = pd.DataFrame({'x1': np.random.random(5), 'x2': ['cat', 'cat', 'dog', 'cat', 'dog']})
y = pd.Series(np.random.randint(2, size=5)) # binary target variable y
print(X.head(), '\n')

# make an encoder object
encoder = LabelEncoder()

# fit and transform feature x2
X['x2'] = encoder.fit_transform(X['x2'])

# the output is similar to what we did manually before
print(X.head())
         x1   x2
0  0.694026  cat
1  0.789591  cat
2  0.740815  dog
3  0.017707  cat
4  0.500869  dog 

         x1  x2
0  0.694026   0
1  0.789591   0
2  0.740815   1
3  0.017707   0
4  0.500869   1

Note that in case you have two different data set (X_train and X_test) you need to fit it on all of your data otherwise there might be some categories in the test set X_test that were not in the train set X_train and you will get errors. See the following example

In [19]:
X_train = pd.DataFrame({'x1': np.random.random(5), 'x2': ['cat', 'cat', 'dog', 'cat', 'dog']})
X_test = pd.DataFrame({'x1': np.random.random(5), 'x2': ['cat', 'cat', 'dog', 'rat', 'dog']})


# make an encoder object
encoder = LabelEncoder()

# fit and transform feature x2
X_train['x2'] = encoder.fit_transform(X_train['x2'])
X_test['x2'] = encoder.transform(X_test['x2'])
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
c:\users\appdata\local\programs\python\python37\lib\site-packages\sklearn\preprocessing\label.py in _encode_python(values, uniques, encode)
     63         try:
---> 64             encoded = np.array([table[v] for v in values])
     65         except KeyError as e:

c:\users\appdata\local\programs\python\python37\lib\site-packages\sklearn\preprocessing\label.py in <listcomp>(.0)
     63         try:
---> 64             encoded = np.array([table[v] for v in values])
     65         except KeyError as e:

KeyError: 'rat'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-19-553068c4514a> in <module>
      8 # fit and transform feature x2
      9 X_train['x2'] = encoder.fit_transform(X_train['x2'])
---> 10 X_test['x2'] = encoder.transform(X_test['x2'])

c:\users\appdata\local\programs\python\python37\lib\site-packages\sklearn\preprocessing\label.py in transform(self, y)
    255             return np.array([])
    256 
--> 257         _, y = _encode(y, uniques=self.classes_, encode=True)
    258         return y
    259 

c:\users\appdata\local\programs\python\python37\lib\site-packages\sklearn\preprocessing\label.py in _encode(values, uniques, encode)
    103     if values.dtype == object:
    104         try:
--> 105             res = _encode_python(values, uniques, encode)
    106         except TypeError:
    107             raise TypeError("argument must be a string or number")

c:\users\appdata\local\programs\python\python37\lib\site-packages\sklearn\preprocessing\label.py in _encode_python(values, uniques, encode)
     65         except KeyError as e:
     66             raise ValueError("y contains previously unseen labels: %s"
---> 67                              % str(e))
     68         return uniques, encoded
     69     else:

ValueError: y contains previously unseen labels: 'rat'

To resolve this issue we will first concatenate X_train and X_test together and then perform label encoding. You can have everything in a loop for all of your categorical features

In [22]:
X_train = pd.DataFrame({'x1': np.random.random(5), 'x2': ['cat', 'cat', 'dog', 'cat', 'dog']})
X_test = pd.DataFrame({'x1': np.random.random(5), 'x2': ['cat', 'cat', 'dog', 'rat', 'dog']})

categorical_features = ['x2']

# make an encoder object
encoder = LabelEncoder()

# fit and transform feature x2
for col in categorical_features:
    encoder.fit(pd.concat([X_train[col], X_test[col]], axis=0, sort=False))
    X_train[col] = encoder.transform(X_train[col])
    X_test[col] = encoder.transform(X_test[col])
    
print(X_train.head(), '\n')
print(X_test.head(), '\n')
         x1  x2
0  0.625876   0
1  0.490142   0
2  0.027797   1
3  0.241695   0
4  0.550751   1 

         x1  x2
0  0.054033   0
1  0.687631   0
2  0.388670   1
3  0.495681   2
4  0.836783   1 

Now we have successfully maped "cat" to 0, "dog" to 1, and "rat" to 2.