Chapter 9

Description

Jake Ryland Williams
Assistant Professor
Department of Information Science
College of Computing and Informatics
Drexel University

Introduction to data science

Some themes

EDA should always come first in a data science project,

allowing first for assessment of data integrity,

but also for modeling and hypothesis generation.

Data should be characterized from every angle.

EDA on its own does not constitute a complete study.

EDA helps to set the direction of a project.

Exploratory data analysis

Data description falls under the umbrella of EDA,

which after pre-processing, is a preliminary analysis

that is an opportunity to check and summarize data.

Attention is given to checking the integrity of data

and making modeling decisions for the next stage of a project.

If data are in any way unfit for further analysis

EDA provides the direction to go back and continue munging

and if interesting patterns are found through summarization,

they can direct hypotheses and modeling decisions.

Getting the big picture

An import EDA outcome is overview understanding,

which can be accomplished through a number of mechanisms.

If data are human readable (text), or structured (tables),

viewing a sample of record can indicate sparsity or domain,

but when data is Big this is not always effective,

which is where summarization and visualization come in.

Summarization often covers counting, centrality, and variation,

and is closely tied to data visualization

which is focused on in Ch. 10.

Performing EDA

We've read a little about the already from John Tukey.

Tukey also wrote a whole book: Exploratory Data Analysis,

inspiring the creation of statistical programming languages.

Without these, EDA in data science would not be possible

with so much data and so many ways for representation.

Examples include R, Python, Matlab, and Mathematica,

and have large communities of scientific developers

who make reusable tools for data processing and visualization.

Some languages are open source, some require licenses,

and all have their strengths and weaknesses.

Required reading: EDA

Commonality and centrality

A a good first step is to see if one value represents many,

and ask "which values appear most frequently?"

A most common element in some data is called a "mode."

When data is numeric, centrality, or the "middle," relevant.

The middle is not always well-defined,

and it can be measured through a variety of mechanisms.

One way of measuring the middle of data is with the median,

which is the value that has 50% of the data smaller,

and is just one of many "percentiles,"

e.g., a 35th percentile has 35% of the data smaller, etcetera.

Means

Means (averages) also measure centrality:
- The arithmetic mean (expected value) is most common: $\overline{x}_\text{a} = \frac{1}{n}(x_1+\cdots+x_n)$
- the geometric mean is for products: $\overline{x}_\text{g} = \sqrt[n]{x_1\times\cdots\times x_n}$
- and the harmonic mean is for rates: $\overline{x}_\text{h} = \frac{n}{\frac{1}{x_1}+\cdots+ \frac{1}{x_n}}$

Required reading: finding the middle

Spread

It's also important to describe how numeric data are dispersed,

i.e., are data bunched up or spread apart?

Measures of spread describe how varied data are.

Observing the different percentiles can indicate spread,

which is one way box plots (visualization) accomplish this.

Spread is also commonly measured by variance,

and it's square root, the standard deviation,

which is often used for "whiskers" of box plots.

Variance

Variance is computed as the arithmetic average of

squared-differences from the arithmetic mean: $\sigma^2 = \frac{1}{n}[(x_1-\overline{x}_a)^2+\cdots+(x_n-\overline{x}_a)^2]$

Squaring keeps things positive, but expresses the wrong units.

To undo squaring, we can take the square root

which is called the standard deviation (SD).

Box and whiskers

Box and whiskers plot are a simple data visualization

that describe both centrality and spread for numeric data.

The "box" shows 25th and 75th percentiles (bottom and top),

in addition to the 50th percentile (median) through the middle.

"Whiskers" extend from the box to extreme points,

often out to the farthest points within 1.5 SDs.

Points beyond whiskers are considered extreme, or, "outliers."

Extreme variation

Human heights, weights, etc., are "normal" (bell-shaped).

Normal variation is not very "extreme."

When data is normal, means and median are very similar.

The median and harmonic mean are robust to extremeness.

Some example sources of extreme data:
- social networks
- city sizes
- stock prices
- seismic events
- weather events
- celestial objects

Co-variation

Quantiles and deviations express 1-dimensional spread.

It can important to describe how data vary together,

e.g., does height correspond to weight?

Scatter plots only exhibit variation in 2 or 3 dimensions,

so pair-wise comparisons are a common visual approach.

This is where methods for "correlation" come into play,

which quantify the variation one variable explains in another.

Common methods include Pearson's and Spearman's,

which hinge on different assumptions,

though higher values generally mean "more related."

Required reading: Correlation, causation?

Hypotheses generation

Where should a project go after EDA?

A central goal of EDA is data characterization.

A good outcome of EDA could be testable questions,

but a necessary outcome might be removal of some data

Or having to go back to a pre-processing stage.

Some testable questions might be:
- Do one or more variables predict others?
- Does some behavior generate outliers?
- Do sales of an item trend with some particular marketing?
- Can a future participation be predicted?
- Is growth exponential or unsustainable? Will it crash?