Description

Jake Ryland Williams
Assistant Professor
Department of Information Science
College of Computing and Informatics
Drexel University
Introduction to data science

Some themes

  • EDA should always come first in a data science project,
  • allowing first for assessment of data integrity,
  • but also for modeling and hypothesis generation.
  • Data should be characterized from every angle.
  • EDA on its own does not constitute a complete study.
  • EDA helps to set the direction of a project.
  • Exploratory data analysis

  • Data description falls under the umbrella of EDA,
  • which after pre-processing, is a preliminary analysis
  • that is an opportunity to check and summarize data.
  • Attention is given to checking the integrity of data
  • and making modeling decisions for the next stage of a project.
  • If data are in any way unfit for further analysis
  • EDA provides the direction to go back and continue munging
  • and if interesting patterns are found through summarization,
  • they can direct hypotheses and modeling decisions.
  • Getting the big picture

  • An import EDA outcome is overview understanding,
  • which can be accomplished through a number of mechanisms.
  • If data are human readable (text), or structured (tables),
  • viewing a sample of record can indicate sparsity or domain,
  • but when data is Big this is not always effective,
  • which is where summarization and visualization come in.
  • Summarization often covers counting, centrality, and variation,
  • and is closely tied to data visualization
  • which is focused on in Ch. 10.
  • Performing EDA

  • We've read a little about the already from John Tukey.
  • Tukey also wrote a whole book: Exploratory Data Analysis,
  • inspiring the creation of statistical programming languages.
  • Without these, EDA in data science would not be possible
  • with so much data and so many ways for representation.
  • Examples include R, Python, Matlab, and Mathematica,
  • and have large communities of scientific developers
  • who make reusable tools for data processing and visualization.
  • Some languages are open source, some require licenses,
  • and all have their strengths and weaknesses.
  • Required reading: EDA

    Commonality and centrality

  • A a good first step is to see if one value represents many,
  • and ask "which values appear most frequently?"
  • A most common element in some data is called a "mode."
  • When data is numeric, centrality, or the "middle," relevant.
  • The middle is not always well-defined,
  • and it can be measured through a variety of mechanisms.
  • One way of measuring the middle of data is with the median,
  • which is the value that has 50% of the data smaller,
  • and is just one of many "percentiles,"
  • e.g., a 35th percentile has 35% of the data smaller, etcetera.
  • Means

    • Means (averages) also measure centrality:
      • The arithmetic mean (expected value) is most common:

      • the geometric mean is for products:

      • and the harmonic mean is for rates:

    Required reading: finding the middle

    Spread

  • It's also important to describe how numeric data are dispersed,
  • i.e., are data bunched up or spread apart?
  • Measures of spread describe how varied data are.
  • Observing the different percentiles can indicate spread,
  • which is one way box plots (visualization) accomplish this.
  • Spread is also commonly measured by variance,
  • and it's square root, the standard deviation,
  • which is often used for "whiskers" of box plots.
  • Variance

  • Variance is computed as the arithmetic average of

  • squared-differences from the arithmetic mean:

  • Squaring keeps things positive, but expresses the wrong units.
  • To undo squaring, we can take the square root
  • which is called the standard deviation (SD).
  • Box and whiskers

  • Box and whiskers plot are a simple data visualization
  • that describe both centrality and spread for numeric data.
  • The "box" shows 25th and 75th percentiles (bottom and top),
  • in addition to the 50th percentile (median) through the middle.
  • "Whiskers" extend from the box to extreme points,
  • often out to the farthest points within 1.5 SDs.
  • Points beyond whiskers are considered extreme, or, "outliers."
  • Extreme variation

  • Human heights, weights, etc., are "normal" (bell-shaped).
  • Normal variation is not very "extreme."
  • When data is normal, means and median are very similar.
  • The median and harmonic mean are robust to extremeness.

    • Some example sources of extreme data:
      • social networks
      • city sizes
      • stock prices
      • seismic events
      • weather events
      • celestial objects

    Co-variation

  • Quantiles and deviations express 1-dimensional spread.
  • It can important to describe how data vary together,
  • e.g., does height correspond to weight?
  • Scatter plots only exhibit variation in 2 or 3 dimensions,
  • so pair-wise comparisons are a common visual approach.
  • This is where methods for "correlation" come into play,
  • which quantify the variation one variable explains in another.
  • Common methods include Pearson's and Spearman's,
  • which hinge on different assumptions,
  • though higher values generally mean "more related."
  • Required reading: Correlation, causation?

    Hypotheses generation

  • Where should a project go after EDA?
  • A central goal of EDA is data characterization.
  • A good outcome of EDA could be testable questions,
  • but a necessary outcome might be removal of some data
  • Or having to go back to a pre-processing stage.

    • Some testable questions might be:
      • Do one or more variables predict others?
      • Does some behavior generate outliers?
      • Do sales of an item trend with some particular marketing?
      • Can a future participation be predicted?
      • Is growth exponential or unsustainable? Will it crash?

    Recap

  • EDA should always come first in a data science project,
  • allowing first for assessment of data integrity,
  • but also for modeling and hypothesis generation.
  • Data should be characterized from every angle.
  • EDA on its own does not constitute a complete study.
  • EDA helps to set the direction of a project.

    • Next time: Depiction
      • What are some common visualization types?
      • How can visualizations be interpreted?
      • What are best practices with visualization?