Kaggle Winning Solutions   |   How to win a kaggle solution

Model selection » catboost tips


Catboost is one of the best gradient boosting decision tree algorithms developed by Yandex researchers and engineers that works very well on tabular data even with default parameters.

Categorical features

CatBoost supports numerical and categorical features. Categorical features are used to build new numeric features based on categorical features and their combinations. By default, CatBoost uses one-hot encoding for categorical features with a small amount of different values in most modes. All you need to do is to label-encode your categorical columns and then make a list of them. Finally, feed that list as an argument cat_features into the fit() method of your catboost model. Below is an example

In [14]:
import pandas as pd, numpy as np
from catboost import CatBoostClassifier

estimator = CatBoostClassifier(iterations=1000)

# create a synthetic X and y where x1 is numerical feature and x2 is a categorical feature
X = pd.DataFrame({'x1': np.random.random(9), 'x2': np.random.randint(2, size=9)})
y = pd.Series(np.random.randint(2, size=9)) # binary target variable y
print(X.head(), '\n')
print(y.head(), '\n')

# list all categorical features
categorical_features = ['x2']

# feed it into fit() method of CatboostClassifier or CatboostRegressor
estimator.fit(X, y, cat_features=categorical_features, verbose=500)
         x1  x2
0  0.055489   1
1  0.185918   1
2  0.445831   1
3  0.226276   0
4  0.196047   1 

0    1
1    1
2    0
3    1
4    1
dtype: int32 

Learning rate set to 0.004417
0:	learn: 0.6885883	total: 12.7ms	remaining: 12.7s
500:	learn: 0.1360443	total: 5.04s	remaining: 5.02s
999:	learn: 0.0635484	total: 9.96s	remaining: 0us
<catboost.core.CatBoostClassifier at 0x204e5a1e248>

Missing values

Catboost can handle missing values automatically. The missing values processing mode depends on the feature type and the selected package. CatBoost does not process categorical features in any specific way. However, for the numerical features, CatBoost by default processes missing values as the minimum value (less than all other values) for the feature. User can set the processing mode to "Forbidden" and "Max" respectively for 1) Missing values are not supported, their presence is interpreted as an error, and 2) Missing values are processed as the maximum value (greater than all other values) for the feature.