Kaggle Winning Solutions   |   How to win a kaggle solution

Feature selection » permutation importance

permutation_importance

Permutation importance is one of the best methods for feature selection. There are available libraries such as Eli5 that could be used for permutation importance. Nonetheless, it is a very straight-forward procedure and can be done in a few simple steps manually:

1- Have a train and test set for which you have the labels.

2- Train a model on the train set and evaluate its performance on the test set as a baseline.

3- Loop over your features and everytime shuffle the test samples in that one feature only and keep the rest the same. In each loop:

3-1 After feature of interest was shuffled, make predictions on the new test set and evaluate your models performance based on the same metric

3-2 Compare the metric with your baseline obtained in step 2

3-3 If the score stayed constant or increased it means that, that particular feature adds no information to your model. In other words, it is noise because after you shuffled it randomly the outcome remained constant or got even better.

4- Keep track of the difference between baseline and new evaluation metric at each step

5- Sort them and drop a portion of interest to you; maybe all of them? 20% of the worst ones? top 100? It's your decision.

Below is a sample code:

In [ ]:
# permutation importance
# In this example the model used is a LightGBM model, the evaluation metrics is F1 score
from tqdm import tqdm_notebook as tqdm
from sklearn.model_selection import train_test_split

# X is your instance space
# y is your label space
print('spliting...')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2019)

tr_data = lgb.Dataset(X_train, label=y_train)
vl_data = lgb.Dataset(X_test, label=y_test)

print('training model on the entire dataset')
estimator = lgb.train(
    params,
    tr_data,
    valid_sets = [tr_data, vl_data],
    verbose_eval = 1000,
)   

y_preds = np.round(estimator.predict(X_test))
baseline= f1_score(y_test, y_preds)
print('baseline f1_score:', baseline)


perm = {}
for _, cols in enumerate(X.columns):
    a = X_test[cols].reset_index(drop=True).copy()
    X_test[cols] = np.random.permutation(a.values)

    try:
        y_preds = np.round(estimator.predict(X_test))
        perm[cols] = f1_score(y_test, y_preds) - baseline
        print(_, ' ', cols, '\t', perm[cols])
        
    except ValueError:
        print(_, ' ', cols, '\t', 'failed to evaluate...')
        
    X_test[cols] = a.values

    
# write the output in a Pandas DataFrame and save it as a .csv file
pd.DataFrame((zip(perm.keys(), perm.values()))).to_csv('permutation_importance_of_variables.csv', index=None)

If you are interested in using available libraries refer to the documents listed below: