Overview
Jake Ryland Williams
Assistant Professor
Department of Information Science
College of Computing and Informatics
Drexel University
Some themes
Data science is mixture of disciplines, centered on data.
Not all data scientists have the same skills,
and data science folks perform a variety of tasks,
so many opinions exist as to what data science is.
Teams are important.
Data science has been around longer than the name.
What is data science?
Data science takes on a variety of shapes and forms.
This requires a hybrid skill set.
Practitioners have varying levels of skills in different areas...
...but generally some broad knowledge of the field.
Let's look at a few common positions in the industry.
A look at some roles (scroll down)
A look at their pay scales
So, what is data science?
- It's kind of hard to find concise and general responses, so I'll try:
-
Computationally-enabled, domain-dependent science centered around data management, analysis, and products.
What separates this from definitions of related disciplines,
like computer and information science, mathematics, statistics, business analytics, or software engineering?
So, what is data science?
- Here's Wikipedia's (as of 10/21/2016):
-
Data science is an interdisciplinary field about
processes and systems to extract knowledge or insights from data in various forms,
either structured or unstructured,
which is a continuation of some of the data analysis fields
such as statistics, machine learning, data mining, and predictive
analytics, similar to Knowledge Discovery in Databases (KDD).
So, what is data science?
- Jonathan Sedar (a DS consultant) de-emphasizes the field's independence:
-
The term 'data science' is a useful shortcut to
describe the recent confluence and evolution of several previously distinct disciplines,
made possible by an increasing availability of data and sophistication of high quality open source software,
decreasing costs of hardware and data processing,
intense academic research and massive commercial and industrial interests.
So, what is data science?
- Josh Wills, Senior Director of Data Science at Cloudera:
-
Someone better at statistics than any software engineer, and someone better at software engineering than any statistician.
So, what is data scientist?
- O'Neil and Schutt's academic practitioner:
-
...an academic data scientist is a scientist,
trained in anything from social science to biology,
who works with large amounts of data,
and must grapple with computational problems posed by the
structure, size, messiness, and the complexity and nature of the data,
while simultaneously solving a real-world problem.
So, what is a data scientist?
- O'Neil and Schutt's industry practitioner:
-
A chief data scientist should be setting the data strategy of the company,
which involves a variety of things:
setting everything up from the engineering and infrastructure
for collecting data and logging, to privacy concerns,
to deciding what data will be user-facing,
how data is going to be used to make decisions,
and how it’s going to be built back into the product.
She should manage a team of engineers, scientists, and analysts and
should communicate with leadership across the company, including the CEO, CTO, and product leadership.
She’ll also be concerned with patenting innovative solutions and setting research goals.
The link points to a sampler of the book that covers pages 1–16.
The book, whole, is not a required text, but is a good resource
for those of you who want to go on in the field.
So, what is a data scientist?
The important take-away is probably 'a mixture of skills.'
In a way, data scientists 'glue' disciplines together.
This is a good way to think about DS programming.
While new algorithms are always needed for DS advancement
(and data some scientists are heavy on algorithms development),
DS programmers usually 'glue' things together.
This is a central concept for the terminology 'pipeline.'
Let's look at some representations of the DS skill mixture.
On "the data science spectrum"
A good way to characterize the whole field is as a spectrum.
There are a number of specalties that are "on the spectrum,"
and depending on individuals' paricular DS "profiles,"
responsibilities and pay grades will vary.
The best approach as a student, is to identify your interests,
and to understand where they will lead you in the job market.
Moving over to some history
- Many core concepts are attributable to J.W. Tukey,
in The Future of Data Analysis (read the introduction):
-
"For a long time I thought I was a statistician,
interested in inferences from the particular to the general.
But as I have watched mathematical statistics evolve,
I have had cause to wonder and doubt...
I have come to feel that my central interest is in data analysis...
Data analysis, and the parts of statistics which adhere to it, must...
take on the characteristics of science rather than those of mathematics...
data analysis is intrinsically an empirical science...
How vital and how important...
is the rise of the stored-program electronic computer?
In many instances the answer may surprise many by being 'important but not vital,'
although in others there is no doubt but what the computer has been 'vital.'"
Milestones in formality
- A time line of some highlights:
Data scientists love (using) infographics
More information on the history of DS
So, what is data science good for?
- Examples of data science products:
- effective web search
- recommender systems
- elections prediction
- detecting disease outbreaks
- targeted electronic advertisement
- image recognition
- fraud and risk detections
- price comparison tools
- weather forecasting
- dynamic pricing
And where does data come from?
Let's group data sources according to active agency.
Note: when we think about examples, we'll see some overap.
- A few clear sources are:
- human: solicited, behavioral, social, and physical
- machine: survaylence, sensory, communicative
- nature: behavioral, social, and physical
Human-generated data
Some human-generated data is traditional.
Solicited (survey) data is particular to humans is tried and true.
- There's also the physical data, e.g.,
- heartrate, body chemistry, genome,
- height, weight, eye color, etc.
And now there are electronic ways for recording most of these.
Human-generated data
What's really new are the social and behavioral sources.
- Some online behavioral sources include:
- every click, visit, form filled,
- basket loaded, transaction processed
- search, filter, ad impression,
- fast-forward, pause, and rewind.
- Some online social sources include:
- Facebook, Twitter, Instagram, LinkedIn, and YouTube, etc.,
- where text, image, video, and audio data are produced.
- they all allow for directed conversations—networks!
Machine-generated data
- From one perspective, machines produce nearly all modern data:
- Sensors are ever becoming ubiquitous.
- They measure meteorlogical and seismic information,
- and monitor video and sound with cameras and microphones.
- On people, there are watches, phones, and even pacemakers.
- In homes, there are fire and entry alarms.
- Lights, thermostats, washers, and dryers control themselves,
- meaning these (and self driving cars) can be hacked!
The internet of things
- Let's list some core concepts to the IoT:
- machine to machine (M2M) interaction,
- nature or human-independent activity,
- local sensing, and
- remote communication.
What's human, natural, or machine?
What part of heart-rates and seismic events are machine?
Just because my smart watch is in the IoT
doesn't mean my heart rate is machine-generated data, right??
- It's the independent and coordinated machine actions, e.g.,
- the closure of an unattended parking garage at capacity,
- an automatic defibrillation from a sensed cardiac episode,
- the automatic pre-heating of your over to 375 degrees
- when it realizes the microwave was set to defrost a turkey,
- the dynamic pricing of Uber cabs in a surge, or
- the automatic shut-down of a power plant in an earthquake.
Recap
Data science is mixture of disciplines, centered on data.
Not all data scientists have the same skills,
and data science folks perform a variety of tasks,
so many opinions exist as to what data science is.
Teams are important.
Data science has been around longer than the name.