Demystifying Data: An Introduction to Data Science

Yanir Seroussi

yanirseroussi.com | @yanirseroussi | linkedin.com/in/yanirseroussi

Note: Check out First Steps in Data Science for advice on how to get started in the field

Bio

Software engineering/computer science background

  • BSc CompSci Technion
  • Intel, Qualcomm, Google

Conversion to data science

  • PhD Monash: text mining and user modelling
  • Giveable: data scientist, recommender systems, and a bunch of other things
  • Next Commerce: head of data science, recommender systems, even more other things

Recently joined the big scary world as an independent consultant/entrepreneur

Overview

  • Defining data science
  • General approach to problems
  • Key terms and tools
  • Case study: Bandcamp recommender system
  • Case study: Surviving the Titanic

What's a data scientist?

What's a data scientist?

"Data Scientist: The sexiest job of the 21st century"

- Harvard Business Review

"I keep saying the sexy job in the next ten years will be statisticians. People think I'm joking, but who would've guessed that computer engineers would've been the sexy job of the 1990s?
The ability to take data – to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it – that’s going to be a hugely important skill in the next decades."

- Hal Varian, Chief Economist at Google, The McKinsey Quarterly

True, but sexy may not be the right word...

What's a data scientist?

What's a data scientist?

Someone who sits in the middle of this continuum

Where do you sit?

What can data science do for you?

Data science problems

"We have all this data, what can we do with it?"

"I want my data thing to be better"

"Here's money and data, please generate more money and data"

Somewhat better-defined data science problems

  • Build a model to predict sales of a marketing campaign
  • Create a system that runs campaigns that automatically adapt to customer feedback
  • Improve user satisfaction with search engine results
  • Predict market response to large trades
  • Detect whale calls from underwater recordings to prevent collisions

Steps to solving data science problems

Steps to solving data science problems

#1: Figure out what the problem is

Steps to solving data science problems

#2: Find out how the solution will be measured

Steps to solving data science problems

#3: "Solve" problem

Solve problem?

Solve problem?

Exploring the sklearn map

Exploring the sklearn map

  • Base assumption: you have data and a rough idea what you want to find
  • All methods require features as input
  • Four main areas:
    • Supervised classification
    • Supervised regression
    • Unsupervised clustering
    • Dimensionality reduction

Supervised classification

Definition

Supervised: we have labelled data to train on

Classification: need to predict/infer the class/category of each instance

Supervised classification

Methods and examples

Classifying iris species using k-nearest-neighbours

Supervised classification

Methods and examples

Classifying email as spam/non-spam using support vector machines

Source: Wikipedia

Supervised regression

Definition

Supervised: we have labelled data to train on

Regression: need to predict/infer a numeric quantity for each instance

Supervised regression

Methods and examples

Linear regression to predict diabetes progression

Supervised regression

Methods and examples

Predicting rent with decision trees

Source: BigML

Unsupervised clustering

Definition

Unsupervised: no labelled training data

Clustering: group together similar instances, can be soft or hard

Unsupervised clustering

Methods and examples

Discovering topics in texts with latent Dirichlet allocation

Unsupervised clustering

Methods and examples

Segmenting images with spectral clustering

Dimensionality reduction

Definition

Dimensionality: number of different feature types

Reduction: decreasing the feature number by selection or transformation

Dimensionality reduction

Methods and examples

Decomposing faces with principal component analysis

Dimensionality reduction

Methods and examples

Discovering movie themes with matrix factorisation

Source: IEEE Spectrum

Beyond the sklearn map...

  • Preprocessing
  • Visualisation
  • Language processing: parse trees, part-of-speech tagging
  • Generation: written language, speech, movement
  • Active learning
  • And much more...

Beyond the sklearn map...

Thinking and putting everything together

In practice

Why are data scientists needed?

  • Reduce and rephrase problems
  • Choose the right tools
  • Combine several tools
  • Handle data drift
  • Understand business needs
  • Make it work in production

Case studies

Case study #1: BCRecommender

BCRecommender

  • BC = Bandcamp – an awesome music publishing platform
  • My pet project
  • Addressing Bandcamp discovery problems
  • Some classic recommender system issues

Recommender systems

Some charactersitics

  • The key word is system
  • Classic data science problem
  • Key revenue generator for companies such as Amazon, Netflix, Spotify
  • End-user exposure means UI/UX is important
  • Many conflicting requirements
  • Measurement is hard

BCRecommender requirements

  1. Help me find music I like
  2. See 1

BCRecommender requirements

As usual, the client is useless at generating actionable requirements, but some things are clear

  • UI/UX is critical: easy to play music, intuitive navigation, mobile-friendly, etc.
  • Personalisation is important
  • Context and mood should be taken into account
  • Oh, and it should be cheap

BCRecommender architecture

More details on my blog

BCRecommender algorithms

AKA the fun part

  • Classic collaborative filtering (only ratings): poor results
  • Clustering based on ratings & tags: much better
  • Personalisation is not enough:
    • Similar music discovery – we can't know your mood
    • Cluster-based discovery – surprisingly-useful content generation game
  • Can do better with more data, users and time

BCRecommender demo

Case study #2: Surviving the Titanic

Surviving the Titanic

  • Goal: "predict" who would survive the Titanic
  • Good toy problem for beginners
  • Hosted on Kaggle

What is Kaggle?

Source: Jessi Reel

Data science with a simple spreadsheet

Where to from here?

  • Complement your skills:
    • Software engineers: learn predictive modelling
    • Analysts: learn how to program Python
  • Read/watch free online resources
  • Get your hands dirty on Kaggle
  • Come to the data science meetup
  • Do the full-length course

Questions?