Overview

Jake Ryland Williams
Assistant Professor
Department of Information Science
College of Computing and Informatics
Drexel University
Introduction to data science

Some themes

  • Data science is mixture of disciplines, centered on data.
  • Not all data scientists have the same skills,
  • and data science folks perform a variety of tasks,
  • so many opinions exist as to what data science is.
  • Teams are important.
  • Data science has been around longer than the name.
  • What is data science?

  • Data science takes on a variety of shapes and forms.
  • This requires a hybrid skill set.
  • Practitioners have varying levels of skills in different areas...
  • ...but generally some broad knowledge of the field.
  • Let's look at a few common positions in the industry.
  • A look at some roles (scroll down)


    A look at some roles


    A look at some roles


    A look at some roles


    A look at some roles


    A look at some roles


    A look at some roles


    A look at some roles


    A look at their pay scales

    So, what is data science?

    • It's kind of hard to find concise and general responses, so I'll try:
      • Computationally-enabled, domain-dependent science centered around data management, analysis, and products.

  • What separates this from definitions of related disciplines, like computer and information science, mathematics, statistics, business analytics, or software engineering?
  • Required reading: What is data science?

    So, what is data science?

    • Here's Wikipedia's (as of 10/21/2016):
      • Data science is an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured, which is a continuation of some of the data analysis fields such as statistics, machine learning, data mining, and predictive analytics, similar to Knowledge Discovery in Databases (KDD).

    So, what is data science?

    • Jonathan Sedar (a DS consultant) de-emphasizes the field's independence:
      • The term 'data science' is a useful shortcut to describe the recent confluence and evolution of several previously distinct disciplines, made possible by an increasing availability of data and sophistication of high quality open source software, decreasing costs of hardware and data processing, intense academic research and massive commercial and industrial interests.

    Required reading: What is data science?

    So, what is data science?

    • Josh Wills, Senior Director of Data Science at Cloudera:
      • Someone better at statistics than any software engineer, and someone better at software engineering than any statistician.

    Required reading: Josh Wills section of the Data Analytics Handbook, Part I (pg. 8)

    So, what is data scientist?

    • O'Neil and Schutt's academic practitioner:
      • ...an academic data scientist is a scientist, trained in anything from social science to biology, who works with large amounts of data, and must grapple with computational problems posed by the structure, size, messiness, and the complexity and nature of the data, while simultaneously solving a real-world problem.

    So, what is a data scientist?

    • O'Neil and Schutt's industry practitioner:
      • A chief data scientist should be setting the data strategy of the company, which involves a variety of things: setting everything up from the engineering and infrastructure for collecting data and logging, to privacy concerns, to deciding what data will be user-facing, how data is going to be used to make decisions, and how it’s going to be built back into the product. She should manage a team of engineers, scientists, and analysts and should communicate with leadership across the company, including the CEO, CTO, and product leadership. She’ll also be concerned with patenting innovative solutions and setting research goals.

    Required reading: Pages 1–16 of O'Neil and Schutt's book.

    The link points to a sampler of the book that covers pages 1–16. The book, whole, is not a required text, but is a good resource for those of you who want to go on in the field.

    So, what is a data scientist?

  • The important take-away is probably 'a mixture of skills.'
  • In a way, data scientists 'glue' disciplines together.
  • This is a good way to think about DS programming.
  • While new algorithms are always needed for DS advancement
    (and data some scientists are heavy on algorithms development),
  • DS programmers usually 'glue' things together.
  • This is a central concept for the terminology 'pipeline.'
  • Let's look at some representations of the DS skill mixture.
  • Required reading: Conway's Venn diagram.

    Required reading: A fourth bubble?

    On "the data science spectrum"

  • A good way to characterize the whole field is as a spectrum.
  • There are a number of specalties that are "on the spectrum,"
  • and depending on individuals' paricular DS "profiles,"
  • responsibilities and pay grades will vary.
  • The best approach as a student, is to identify your interests,
  • and to understand where they will lead you in the job market.
  • Required reading: Job titles and pay

    Moving over to some history

    • Many core concepts are attributable to J.W. Tukey,
      in The Future of Data Analysis (read the introduction):
      • "For a long time I thought I was a statistician, interested in inferences from the particular to the general. But as I have watched mathematical statistics evolve, I have had cause to wonder and doubt... I have come to feel that my central interest is in data analysis... Data analysis, and the parts of statistics which adhere to it, must... take on the characteristics of science rather than those of mathematics... data analysis is intrinsically an empirical science... How vital and how important... is the rise of the stored-program electronic computer? In many instances the answer may surprise many by being 'important but not vital,' although in others there is no doubt but what the computer has been 'vital.'"

    Milestones in formality

    Data scientists love (using) infographics

    More information on the history of DS

    So, what is data science good for?

    • Examples of data science products:
      • effective web search
      • recommender systems
      • elections prediction
      • detecting disease outbreaks
      • targeted electronic advertisement
      • image recognition
      • fraud and risk detections
      • price comparison tools
      • weather forecasting
      • dynamic pricing

    And where does data come from?

  • Let's group data sources according to active agency.
  • Note: when we think about examples, we'll see some overap.
    • A few clear sources are:
      • human: solicited, behavioral, social, and physical
      • machine: survaylence, sensory, communicative
      • nature: behavioral, social, and physical

    Human-generated data

  • Some human-generated data is traditional.
  • Solicited (survey) data is particular to humans is tried and true.
    • There's also the physical data, e.g.,
      • heartrate, body chemistry, genome,
      • height, weight, eye color, etc.
  • And now there are electronic ways for recording most of these.
  • Human-generated data

  • What's really new are the social and behavioral sources.
    • Some online behavioral sources include:
      • every click, visit, form filled,
      • basket loaded, transaction processed
      • search, filter, ad impression,
      • fast-forward, pause, and rewind.


    • Some online social sources include:
      • Facebook, Twitter, Instagram, LinkedIn, and YouTube, etc.,
      • where text, image, video, and audio data are produced.
      • they all allow for directed conversations—networks!

    Machine-generated data

    • From one perspective, machines produce nearly all modern data:
      • Sensors are ever becoming ubiquitous.
      • They measure meteorlogical and seismic information,
      • and monitor video and sound with cameras and microphones.
      • On people, there are watches, phones, and even pacemakers.
      • In homes, there are fire and entry alarms.
      • Lights, thermostats, washers, and dryers control themselves,
      • meaning these (and self driving cars) can be hacked!

    Required reading:
    The internet of things (IoT)

    The internet of things

    • Let's list some core concepts to the IoT:
      • machine to machine (M2M) interaction,
      • nature or human-independent activity,
      • local sensing, and
      • remote communication.

    What's human, natural, or machine?

  • What part of heart-rates and seismic events are machine?
  • Just because my smart watch is in the IoT
  • doesn't mean my heart rate is machine-generated data, right??
    • It's the independent and coordinated machine actions, e.g.,
      • the closure of an unattended parking garage at capacity,
      • an automatic defibrillation from a sensed cardiac episode,
      • the automatic pre-heating of your over to 375 degrees
      • when it realizes the microwave was set to defrost a turkey,
      • the dynamic pricing of Uber cabs in a surge, or
      • the automatic shut-down of a power plant in an earthquake.

    Recap

  • Data science is mixture of disciplines, centered on data.
  • Not all data scientists have the same skills,
  • and data science folks perform a variety of tasks,
  • so many opinions exist as to what data science is.
  • Teams are important.
  • Data science has been around longer than the name.