How to (almost) win Kaggle competitions

Yanir Seroussi

yanirseroussi.com | @yanirseroussi | linkedin.com/in/yanirseroussi

Note: This talk is also available as a blog post.

Bio

Software engineering/computer science background

  • BSc CompSci Technion
  • Intel, Qualcomm, Google

Conversion to data science

  • PhD Monash: text mining and user modelling
  • Giveable: data scientist, recommender systems, and a bunch of other things
  • Next Commerce: head of data science, recommender systems, even more other things

Recently joined the big scary world as an independent consultant/entrepreneur: leads and invitations to connect are welcome

Overview

  • Preliminaries: on Kaggle, data science, and my experience
  • Ten tips
  • General advice and ramblings
  • Question time (but feel free to interrupt anytime)

What's Kaggle?

What's a data scientist?

What's a data scientist?

What's a data scientist?

Someone who sits in the middle of this continuum

Source: Ofer Mendelevitch, How to build a Hadoop data science team

Note: can replace applied scientist with data analyst, and research scientist with statistician

Where do you sit?

Why should data scientists kaggle?

Isn't it just free work?

Great training lab

No cheating: can't pick a friendly baseline (unlike academia)

No maintenance: write throwaway code (unlike industry)

Reputation building

Why should data scientists kaggle?

Nerdy fun!

My Kaggle experience

Would do more, but it's addictive and hard to timebox

Learned a few things in the process...

Tip 1: RTFM

Tip 1: RTFM

  • Understand the competition timeline
  • Tick required boxes, even inexistent ones
  • Submit using the correct format, reproduce benchmarks
  • Know the measure and data

Tip 2: Know your measure

Tip 2: Know your measure

  • Understand how the measure works
  • Use a suitable optimisation approach
    • Often easy to achieve
    • Can make a huge difference

Example: Hackathon MAE versus MSE

Tip 3: Know your data

Tip 3: Know your data

Overspecialisation is a good thing

Examples:

  • Hackathon: how was the data obtained?
  • Multi-label Greek: connected components
  • Arabic writers: histograms

Beyond Kaggle:

Custom solutions win, the world needs data scientists!*

* Until we are replaced by robots

Tip 4: What before how

Tip 4: What before how

Know what you want to model before figuring out how to model it

Example: John's Yandex visualisations

Generally applicable for people coming from either side of the data science continuum

Tip 4: What before how

Become one with the data

Tip 5: Do local validation

Tip 5: Do local validation

Faster and more reliable than relying on the leaderboard

Recommendations:

  • Mimic the competition setup
  • Prefer single split to cross validation:
    • Faster
    • Cross validation may be unsuitable (e.g., time series)
    • Public leaderboard is extra validation
  • Make exceptions for small data or when there's no time

Analogy for software engineers:

  • Development: local validation
  • Staging: public leaderboard
  • Production: private leaderboard

Tip 6: Make fewer submissions

Tip 6: Make fewer submissions

(But not too few)

  • Look better
  • Avoid overfitting the leaderboard
  • Don't join bidding wars and give away your competitive advantage
  • Use local validation to reduce the need for many submissions

Tip 7: Do your research

Tip 7: Do your research

  • For any given problem, it's likely there are people dedicating their lives to its solution
  • Deeper knowledge and understanding is a sure reward

Worked well for me:

  • Arabic writers: histogram kernels
  • Multi-label Greek: ECC/PCC
  • Bulldozers: stochastic GBM sklearn bug
  • Yandex: LambdaMART

Tip 8: Apply the basics rigorously

Tip 8: Apply the basics rigorously

  • Obscure methods are awesome, but often the basics will get you very far
  • Common algorithms have good implementations
  • Running a method without minimal tuning is worse than not running it at all

Example: In defense of one-vs-all classification

Tip 9: The forum is your friend

Tip 9: The forum is your friend

  • Subscribe to receive important notifications
  • Understand shared code, but don't rely on it
  • Try to figure out what your competitors are doing
  • Learn from post-competition summaries

Tip 10: Ensemble all the things

Tip 10: Ensemble all the things

  • Not to be confused with ensemble methods
  • Almost no competition is won by a single model
  • Works well with independent models – merge teams

Basic algorithm:

  1. Try many things
  2. Ensemble the things that work well
  3. Repeat 1 & 2 until you run out of time
  4. Almost win

Other resources

How to get started?

Tips are useless if not applied

Software engineers: learn predictive modelling

Analysts: learn how to program Python

Data scientists: you have no excuse


Go forth and Kaggle

Questions?