Kaggle Winning Solutions   |   How to win a kaggle solution

Model selection » target variable

target variable

The type of target variable is a key factor on how you formulate a machine learning problem. Target variable can take one of the following forms:

  • Binary
  • Multiclass (categorical)
  • Multiclass (ordinal)
  • Continuous

Binary and continuous target variables can be solved by binary classifiers and regression techniques but it becomes trickers when it comes to Multiclass target variables. Let's go through them one by one.

Multiclass (categorical)

Consider a categorical target variable where target variable could be one of the ['cat', 'dog', 'rat']. There are two ways you can formulate this.

  1. you can define this problem as three separate binary classification problems e.g. one model to predict whether or not target is a cat; another model to predict whether or not the target is a dog; and a third model to predict whether or not target is a rat. Finally, classify the instance based on the highest probability among our three models.
  2. As opposed to the first approach you can train a single model and use a multiclass aproach from the beginning by using a softmax (softargmax) function. In this approach the sum of probabilities for different classes is one.

Example

In [66]:
# make a synthetic dataset
from sklearn.datasets import make_classification

X, y = make_classification(n_informative=5, n_classes=3)

# we have 100 instances and three labels
print(X.shape, y.shape)
print(y[:15])
(100, 20) (100,)
[0 2 2 1 0 1 2 0 1 2 0 2 2 2 1]

Approach 1

In [67]:
# first we need to one-hot-encode our target variable
y_0 = (y==0).astype(int) # this new target variable is whether or not target is equal to 0
print(y_0[:15]) # compare the result with the original "y" printed earlier

y_1 = (y==1).astype(int) # this new target variable is whether or not target is equal to 1
print(y_1[:15]) # compare the result with the original "y" printed earlier

y_2 = (y==2).astype(int) # this new target variable is whether or not target is equal to 2
print(y_2[:15]) # compare the result with the original "y" printed earlier
[1 0 0 0 1 0 0 1 0 0 1 0 0 0 0]
[0 0 0 1 0 1 0 0 1 0 0 0 0 0 1]
[0 1 1 0 0 0 1 0 0 1 0 1 1 1 0]
In [ ]:
from catboost import CatBoostClassifier
import numpy as np
from sklearn.metrics import accuracy_score

probabilities = np.zeros((20, 3)) # initialize a probability array with shape 20 (lenght of validation set) and 3 (number of classes)
for target_var in [0, 1, 2]:
    new_y = (y==target_var).astype(int)
    clf = CatBoostClassifier(iterations=500, verbose=False)
    clf.fit(X[:80], new_y[:80])
    preds = clf.predict(X[80:])
    print(f'accuracy for target variable {target_var} is ', accuracy_score(new_y[80:], preds))
    probabilities[:, target_var] = clf.predict_proba(X[80:])[:, 1]
In [ ]:
# now compare the true labels
print('true labels')
print(y[80:][:5])

# versus the probabilities
print('\nprobabilities from different classifiers')
print(probabilities[:5].T)

# versus predictions based on probabilities
print('\nfinal result')
print(list(np.argmax(probabilities[:5], axis=1)))

# we can use numpy argmax() to predict the labels
print('\nThe multiclass accuracy score is ', accuracy_score(y[80:], np.argmax(probabilities, axis=1)))

Approach 2

In [ ]:
clf = CatBoostClassifier(loss_function='MultiClass', iterations=500, verbose=False)
clf.fit(X[:80], y[:80])
preds = clf.predict(X[80:])
print('The multiclass accuracy score is', accuracy_score(y[80:], preds))

In this example, it turned out the second approach results in higher accuracy but this is not always the case and not a general conclusion. The final accuracy depends on a lot of other factors. This example was intended to illustrate how to implement two approaches.

Multiclass (ordinal)

Consider a multiclass target variable where target variable is the ratings of a website and could take values [1, 2, 3, 4, 5]. There are two ways you can formulate this:

  1. Formulate it similar to approach 2 of the categorical multiclass problem described earlier
  2. Formulate it as a regression problem and then discretize the response

Example

let's work on the same dataset from our previous example

Approach 1

In [ ]:
clf = CatBoostClassifier(loss_function='MultiClass', iterations=500, verbose=False)
clf.fit(X[:80], y[:80])
preds = clf.predict(X[80:])
print('The multiclass accuracy score is', accuracy_score(y[80:], preds))

Approach 2

This time we will train a regressor as opposed to a multiclass classifier

In [ ]:
from catboost import CatBoostRegressor
clf = CatBoostRegressor(loss_function='RMSE', iterations=500, verbose=False)
clf.fit(X[:80], y[:80])
preds = clf.predict(X[80:])
In [ ]:
# let's take a look at the predictions
import pandas as pd
import matplotlib.pyplot as plt
pd.Series(preds).plot(kind='hist')

The predictions are continuous but we need three classes so we need to discretize the outputs. For this purpose we need to define thresholds and selection of thresholds will affect the outcome.

In [ ]:
def discretize_output(preds, thresholds):
    new_preds = np.zeros(len(y[80:]))
    for i, pred in enumerate(preds):
        if pred < thresholds[0]:
            new_preds[i] = 0
        elif pred < thresholds[1]:
            new_preds[i] = 1
        else:
            new_preds[i] = 2 
    return new_preds

new_preds = discretize_output(preds, [0.75, 1.25])       
print(f'The accuracy score using thresholds [0.75, 1.25] is', accuracy_score(y[80:], new_preds))

new_preds = discretize_output(preds, [0.1, 1])       
print(f'The accuracy score using thresholds [0.1, 1.0] is', accuracy_score(y[80:], new_preds))

OptimizedDiscretizer

Since the output is dependent on the thresholds we need to optimize thresholds. For that purpose we will use the following class:

In [ ]:
import scipy as sp

class OptimizedDiscretizer:
    def __init__(self, eval_metric, initial_coefs,discretizer_function, maximize=True):
        self.eval_metric = eval_metric
        self.coefs = initial_coefs
        self.maximize = maximize
        self.predict = discretizer_function

        
    def _eval_metric(self, coef):
        X_p = self.predict(self.X, coef)
        metric_value = self.eval_metric(self.y, X_p)
        if not self.maximize:
            metric_value *= -1 # because we use scipy minimizer if the goal is to maximize we need to minimize the negative of it
            
        return metric_value

    
    def fit(self, X, y):
        self.X = X
        self.y = y
        self.coefs = sp.optimize.minimize(self._eval_metric, self.coefs, options={'disp': True})['x']
        

        

discretizer = OptimizedDiscretizer(eval_metric=accuracy_score,
                                   initial_coefs=[0.5, 1],
                                   maximize=True,
                                   discretizer_function=discretize_output)

discretizer.fit(preds, y[80:])

new_coefs = discretizer.coefs
new_preds = discretizer.predict(preds, new_coefs)

print('The accuracy based on optimized thresholds is', accuracy_score(y[80:], new_preds))
In [ ]:
# now we have discretized outputs
pd.Series(new_preds).plot(kind='hist')

Conclusion

The target variable to some extend determines how we formulate our problem. The choice is more clear for binary classification and regression but in case of multiclass problems it will be a design choice. It depends on the application but usually for categorical multiclass problems we use multiclass (approach 2) and for ordinal multiclass problems we define it as a regression (approach 2) problem.