Kaggle Winning Solutions   |   How to win a kaggle solution

The entire pipeline

  1. Buy time on utility functions. Despite initially it seems trivial, it is important to save time on the most-repetitive parts of your code. If not done properly it takes only a bit longer but every single time you commit your script; by the end of competition it will buy you a lot of accumulated time and you will be able to conduct more experiments just by loading data properly. Pandas load every column that has nan values as float64; most of the times we don't need float64. Try to pickle or save as numpy with proper data types.
  2. Exploratory Data Analysis (EDA)
  3. Model selection and baseline model Select a model based on the nature of the problem; Deep Neural Networks usually perform better on unstructered data such image and text while Gradient Boosting Decition Tree algorithms such as LightGBM, XGBoost and Catboost perform better on tabular data. There have been exceptions to this rule too. Start with a simple baseline model; you will need it to evaluate your new features; make sure it has enough capacity to learn new features; hyper parameter optimization is not required at this stage; overfitting doesn't matter at this stage.
  4. Validation Strategy before you have a good validation strategy, FE doesn't make any sense; You need to have a reliable validation strategy so that you can trust validation score and evaluate your progress based on that. If you are dealing with big data (compared to your memory and compute power), you don't need to have multiple folds cross validation strategy as it is so time-consuming and might prevent you from exploring enough hypotheses;
    • Train test split
    • TimeSeries split
    • GroupedKFold cross validation
    • KFold cross validation
    • Evaluation metrics:
  5. Problem-specific items
    • Tabular data
      • Feature Engineering
        • Categorical Features
        • Aggregations / Group Statistics
        • Frequency Encoding
        • NAN Processing
        • Combining / Transforming / Interaction
        • Outlier Removal / Relax / Smooth
        • Normalize / Standardize
      • Feature selection
    • Image data
    • Text data
    • TimeSeries data
  6. Some other topics
    • Data augmentation
    • Dealing with imbalanced datasets
    • Pseudo labeling
  7. Blending Dont blend until last days of the competition;