Chapter 11

Modeling

Jake Ryland Williams
Assistant Professor
Department of Information Science
College of Computing and Informatics
Drexel University

Introduction to data science

Some themes

Modeling is heavy on mathematics and statistics.

Model selection is guided by many factors,

and is preceded by exploratory data analysis.

Theoretical relevance can make a model highly valuable,

but sometimes it's all about performance product quality,

balancing bias and variance as best possible.

What is modeling?

The real world is a messy place,

but this is where data comes from.

While data might be generated by a "true" mechanism,

it may not be known, and exist with interference and noise.

Some modeling seeks to simulate real world phenomena,

but for us will exist in the context of explaining data,

and "fitting it" to identify and predict real-world processes.

The role of mathematics

Mathematics is integral to modeling,

as it provides tools and frameworks for abstraction.

While models might have physical or social foundations,

these must be expressed mathematically to be applied.

A mathematical model might use simple, high school algebra,

or any part of calculus or matrix algebra as a framework.

Calculus

Calculus is all about rates of change and accumulation.

This becomes very important when approaching optimization,

e.g., when error is minimized for a parameter's estimation.

However, calculus is a continuous (smooth) science

and measurements and computations are always discrete,

so the day-to-day value for DS is largely with intuition

and model development or analysis, instead of application.

Differential calculus is about rates (slopes)

The derivative f′(x) of a curve at a point is the slope of the line tangent to that curve at that point. This slope is determined by considering the limiting value of the slopes of secant lines.

Integral calculus is about accumulation

Integration can be thought of as measuring the area under a curve, defined by f(x), between two points (here a and b).

Required readings

If you have never seen calculus, please read sections 2.2 and 2.4 from Wikipedia's article on the topic, but do not worry if you don't understand the algebraic details, i.e., just focus on gaining an intuition for rates and accumulation.

Linear algebra

Linear algebra is all about equations and solutions,

with restrictions on the types of operations considered,

i.e., only multiplication and addition (making things linear),

and extreme generality for many dimensions.

For data representation this is an important framework,

e.g., raster images are matrices of color intensities.

In algorithms, linear algebra plays a central role, too,

e.g., Google's search algorithm is a matrix multiplication!

PageRank is matrix multiplication

Required videos

If you've never had linear algebra, please watch the first four videos in this series: the essence of linear algebra, and once again, don't worry so much about the details, but focus on gaining an intuition for the nature of the subject.

Calculus vs. linear algebra

Calculus usually gets extreme emphasis in math education.

However, there is contention over which is more important.

This is true, with perhaps more emphasis in data science,

where many attest its usefulness over calculus.

Is this true, or simply backlash against historical emphasis?

Whether either is intuitively grounding or explicitly useful,

neither branch of math should really be left out entirely,

i.e., a data scientist is best off understanding both.

Required reading: Calculus is so last century.

Probability and statistics

Probability and statistics are focused on quantifying chance.

Probabilistic models can be theorized to understand data,

and statistics are computed from data to hypothesize models.

Probability and statistics are largely inseparable,

and they use both calculus and linear algebra for modeling.

Historically, a lot of statistics were developed as normal,

i.e., for physical, height-weight, bell-shaped distributions,

but in fact many real-world distributions are much wilder,

like market movements, community sizes, and disaster scales.

It's important have a broad view of statistical concepts.

Required reading

If you've never had any probability and/or statistics, please read this post.

Data science vs. statistics

Historically, statistics is thought of as the science of data.

Does that mean data science is overriding statistics?

No! Once again, data science is all about a collision of skills.

Statisticians can be data scientists (and vice-versa),

and data science can't exist without statistics, but

this is not true for CS, or management and database systems.

Contentions around disciplinary encroachment are unfortunate,

and can be met with reverence and respect,

which starts with understanding what others do.

Required reading: Statistics vs. data science

Regression

Regression is all about fitting shapes to data.

e.g., fitting a line to a cloud of points,

or a distributional model to empirical results.

Two main components to running a regression are:

Parameterizations, and objective functions.

Parameterizations describe model specifics, e.g., slope,

and objective functions describe how well a model fits.

Objective functions

An objective function is a quantity that modeling optimizes,

e.g., sum of squares error, precision, or recall.

Objective functions can either be maximized or minimized,

and sometimes there can be more than one.

Convex optimization is a special case for objective functions

where there is a guaranteed "best case."

Unfortunately, not all algorithms result in convex optimization,

whose results must be accepted with lots of tuning
(and a grain of salt).

Hypothesis testing

Hypothesis testing is described well by the name.

There are two outcomes—null and alternative hypotheses.

A null hypothesis is the assumed default outcome,

e.g., drug X had no effect on the patient's condition,

and the alternative is the opposite,

e.g., drug X had an effect on the patient's condition.

For acceptance/rejection, a test statistic is measured,

in the context of an assumed probability distribution.

With probability known, one check a "significance level,"

which if/not crossed, results in rejection/acceptance.

p-values and hacking

p-values describe how likely more-extreme test statistics are.

They are interpreted as measures of "statistical significance."

Generally, p-values above 0.05 or 0.1 are unacceptable,

and scientific standards hold the p-value up very high.

This means some researchers will ignore low p-values research,

or use computational ease to slice data for significance.

Meaning results go unreported or tweaked for significance.

These practices are called data dredging, or p-hacking.

Required reading: p-hacking

Machine learning

Machine learning is a branch of artificial intelligence.

Focused on data analysis, it is similar to statistics,

but is rule-based programming that learns from data.

Generally, there are supervised and unsupervised algorithms.

Supervised algorithms learn from ground-truth, labeled data,

while the un-supervised are just processes that follow patterns.

Machine learning may, or may not use statistics,

and is most focused on algorithm outcome quality,

as opposed to theoretical soundness and variable relationships.

Required readings

1) A visual introduction to machine learning
2) Statistics vs. machine learning

Model selection

There are a lot of different kinds of models out there!

Choosing the right model starts with knowing what exists.

Additionally, exploratory data analysis should be a guide.

So, selection can occur early on after exploration,

or, it can be a part of a reported analysis,

i.e., from 7 models choose the best-performing for a product.

However, performance should not be the only consideration.

Some other selection factors:
- What model is a good theoretical match for the data?
- How efficiently or quickly does a model run?
- How domain-portable is a model?
- How difficult is a model to implement?
- Will the model scale across multiple machines?
- How transparent are a model's inner workings?

Required reading: DS model selection

The bias-variance tradeoff

Modeling error can be broken down into three parts:
- bias error, due to the assumptions made in a model;
- variance error, due to sensitivity of a model on training;
- and irreducible error, due to factors unknown, often external.

Irreducible error can result from data collection, formation, etc.,

but bias and variance are somewhat controllable.

These two types of error are linked and must be balanced;

when bias error is up, variance error is often down (vice-versa).

True bias and variance decompositions of error are hard to find,

since they would require knowing a "true" target function,

but cross-validated training hints at variance error for a model,

leaving bias and irreducible errors mixed, but approachable.

Required reading: Bias-variance tradeoff

Are models necessary?

A common aphorism in statistics is "all models are wrong."

With data, this is kinda tautological, as noise is ever present.

But some have posited that Big Data makes models obsolete!

I.e., with enough data, models are not necessary for prediction,

e.g., Google doesn't need to know about culture or conventions

in order to be able to serve advertisements—just data!

Yeah, but there are different kinds of models,

e.g., the Internet as a click-through network is an applied model,

and this is what makes Google's PageRank the best at search!

So, with a grain of salt: Big Data might obviate some models,

but more likely, open up opportunities for new alternatives.

Required reading: The end of theory

Recap