Chapter 10

Depiction

Jake Ryland Williams
Assistant Professor
Department of Information Science
College of Computing and Informatics
Drexel University

Introduction to data science

Some themes

Depicting data visually fulfills a variety of purposes,

from data exploration and intuitions development,

to descriptive storytelling through infographics,

to expressions of analyses and interpretations of results,

and even web-based, interactive products development.

Regardless of purpose, there are key qualities for success,

such as clarity, simplicity, and truth of representation.

Exploration and visualization

While much of EDA is accomplished through summarization,

like through objective measures of centrality and spread,

it is also helpful to depict data visually.

We've introduced box plots, which show these summaries,

but visualization in EDA let's one see all data at once

and form intuitive assessments about important patterns.

Here, we'll go explore some common visualization types,

and walk through best practices and interpretations.

Histograms

A histogram exhibits the spread of a single data dimension.

There are discrete bins into which the data points fall,

which are tallied and shown as bars,

with heights as counts (frequencies) or probabilities.

Here, bin width/spacing is the variable parameter;

e.g., there can be equal-width and -probability bins.

Important: probabilistic areas, not heights sum to 1!

Kernal density estimates (KDEs) smooth histograms

and algorithmically estimate underlying distributions.

Histograms

Scatter plots

Scatter plots exhibit covariation between data simply,

by placing separate data dimensions on separate axes.

Points are commonly tweaked for shape and color

as a means for expressing multiple scatters together.

Point size can express a third, "intensity" dimension.

While depth and perspective can be used for true 3-d scatters,

anything more is beyond physical perception.

So for more comparisons, it's common make more plots,

e.g., with an array of pairwise-comparison scatters,

but mind axis ranges, inflating/deflating relationships.

Scatter plots

Density

An important scatter plot observable is density,

which is straightforward when there are relatively few points,

but plotting many same-color points obscures density.

Once again, color comes in handy here,

where, e.g., color gradients indicate more/less dense regions,

however, this makes density another dimension of data,

which must be determined algorithmically,

Common tools are once again binning and kernel smoothing,

but beware, these rely on the same assumptions as histograms.

Scatter plots

Line plots

When one of 2 dimensions are ordered, line plots can be useful.

E.g., stock prices, daily temperatures, traffic density, etc.

Don't forget, all these do is connect the dots,

so points in between shouldn't generally be assumed,

though in exploration, these help highlight trends.

If a line plot jaggedly distracts from a shape,

a variety of smoothing techniques can come in handy,

e.g, moving averages are quite straightforward,

but the different methods all have their assumptions,

so smoothing should be taken with a grain of salt.

Line plots

A line chart in yellow with a 30-day moving average in black.

Maps

With geospatial data, maps make for exciting visualizations.

Points can be scattered atop political, natural etc. features.

"Choropleths" use shade polygons according to intensities,

and so are just polygon-binned scatter plots exhibiting density.

Don't forget: maps rely on projections—the earth is round!

Some projections keep polygon area true to geography,

while others focus on simplicity, making all bodies visible.

"Cartograms" are maps that distort areas intentionally,

and representing data intensities through polygon size.

Maps

Required readings: Map projections

> Some common map projections explained
> What your favorite projection says about you
> An interactive graphic comparing projections

Infographics

Occasionally, EDA itself can result in a product.

Infographics present summary information quickly and clearly,

often cartoonifying visual information for easy consumption,

and combining different visualization types into one.

We've seen a number of these already!

The goal: bingeably wrap descriptive observations together.

A diverse few examples:

One of the earliest infographics

How many data dimensions are depicted here?

Interactivity

EDA in large or complex data sets can be challenging.

It's great to be able to represent lots of data,

but can one digest 1,000 scatter plots?

This is where sharable interactivity comes into play,

made possible recently with advances in technology.

Web-based apps allow viewers to traverse many plots,

but to truly "explore" some data, movement is necessary.

This can be as simple as a video from successive plots,

and as complex as the panning and zooming in Google maps.

Paramount to interactivity is intuitive navigation.

HTML and interactivity

HTML's malleability makes it an ideal host for interactivity.

With much web-based programming performed in javascript,

an extensive library for interactivity called d3.js was developed.

Other languages have similar functionality,

e.g., python has Bokeh, and R has Shiny

but these are less developed and/or not free to host!

However, d3.js requires html, css, and javascript skills,

so while the learning curve can be somewhat steeper,

learning it leads to proficiency in web products development,

which is an extremely marketable skill in data science.

Required derping with d3.js

Some examples on the d3.js galley (explore this):
A changing Voronoi tesselation (midpoint boundaries).
A click-to-zoom map.
An interactive map depicting US airline flights.
NLF prediction (2015) from Nate Silver's FiverThirtyEight blog.

Some external examples applying d3.js:
koalastothemax is reminiscent of mitosis!
Facebook's initial public offering, via the New York Times.
An interactive exploration of the MBTA (Boston's subway)
Vax! a game about epidemic prevention

What's the point?

With so many visualizations easily accessible,

it's important to maintain focus on goals for depiction.

Consistent values include clarity, trueness, and simplicity,

but often there is a message from analysis to display,

so visualization can be essential to convey interpretation,

as is the case with presentation in storytelling.

So if an outcome is a business decision,

significant emphasis should be placed on interpretation,

which must be carried forward to recipients.

Required readings: Storytelling

> The essential data science skill everyone needs
> How to tell a story with data
> When beautiful metrics can't beat words
> Why data storytelling is so important—and why we're so bad at it

Product development

A common goal in data science is product development,

and customers often experience data visually.

So, instead of communicating hypotheses and interpretations

visualization can bear roles like navigation and uptake.

This again emphasizes simplicity and clarity,

but likewise a need for natural interaction,

i.e., users uptake may improve if instructions are unnecessary.

We'll touch on these topics more in chapter 13,

when we consider generally the role of design in data science.

Recap

Depicting data visually fulfills a variety of purposes,

from data exploration and intuitions development,

to descriptive storytelling through infographics,

to expressions of analyses and interpretations of results,

and even web-based, interactive products development.

Regardless of purpose, there are key qualities for success,

such as clarity, simplicity, and truth of representation.

Next time: Modeling
- Does data follow a pattern or emerge from a process?
- What is hypothesis testing?
- How do math, stats, algorithms, and machine learning all fit?