Chapter 6

Acquisition

Jake Ryland Williams
Assistant Professor
Department of Information Science
College of Computing and Informatics
Drexel University

Introduction to data science

Some themes

Data comes from closed and open sources,

but if it is on the internet you can probably get it.

Data can usually be acquired with creativity,

and is produced with often unforseen ethical implications.

Take care with data licenses and read the fine print.

Where to find data?

Let's consider how data might be collected:
- surveying with direct questions,
- sensing phenomena as they occur, and
- sampling controlled materials.

Surveys

When surveying, researchers can ask specific questions.

Respondent recorded surveys can be written or electronic,

and interviewed surveys can be face to face or over the phone.

There's determining the right number of respondents,

and if they accurately represent the target population.

In addition, question formation requires extreme care.

For example, questions might be
- objective or subjective,
- closed-ended or open-ended,
- respondent leading, or
- loaded with implicit assumptions.

Interviewer-led surveys

On phone or in person—interviews and consultations

While these surveys offer lots of interviewer control,

they are hindered by cost, scalability and interviewer ability.

Some benefits:
- can capture emotions, behaviors, and physical cues
- offer accurate screening
- interviewer has control and can maintain focus

Some drawbacks:
- expensive to run
- slow to establish
- difficult to scale
- subject to interviewer biases

Distributed surveys

Whether distributed as a hard copy or an electronic form,

these types often have a more rigid structure, like

the census, product ratings, Mechanical Turk, or SurveyMonkey.

Some benefits:
- scalable (especially electronic)
- fast completion (especially electronic)
- anonymity for respondents
- affordable to operate

Some drawbacks:
- lack of control (especially electronic)
- format restrictions
- loss of data quality
- respondents may push through for profit

Required reading: Super Turkers

Sampling

Let's think of sampling as a punctuated collection process,

where an object is removed from its source for measurement.

Some examples:
- bone for radio-carbon dating
- rock for mass-spectrometry compositions
- blood for typing and chemical analysis
- hair follicles for gene-sequencing
- ice-cores for climate records

Sensors

Let's think of sensors as passive data collectors

that capture data in live, operational systems.

Here, the great callenge is setting up a sensor system,

and often, the value of data collected is found secondarily.

I.e., a system must exist before its data can be accessed.

Let's list a few more sources:
- thermometers, barometers, and altimeters on balloons
- microphones and cameras in buildings
- voltmeters in grids
- call log recording of cell phones
- blog posting in social media
- heart rate monitors on watches
- conversations with "Hello Barbie"

Required reading: Hello Barbie

Data access, privacy, and sharing

Depending on the source, access to data is highly variable.

If you collect the data, is it yours to keep and share?

What about if you produce the data?

While academic human subjects research is more controlled,

through Institutional Review Boards (IRBs) certifications,

data generated through commercial is more variant,

with consent and rights waived through terms of service.

Required reading: Privacy and information

Controlled data and APIs

So much electronic data is commercially collected or generated.

If you're the company, then great! You have access.

If you're interested, but not the company, there are hoops.

In fact, companies often release small amounts of data,

oftentimes through an application programming interface (API).

What's an API (thanks, Wikipedia)?
- In computer programming, an application programming interface (API) is a set of subroutine definitions, protocols, and tools for building software and applications.

APIs

That was a technical definition.

APIs provide data access for programming development.

Companies may want you to use their data
(if only in a limited way).

E.g., Twitter data applications engage their community.

For development, their API provides 1% public access.

How do you get access to more data? Money!

Well... Twitter also has data grants, but

to access samples, one writes code to make "API requests."

APIs also aid programming, like openCL for GPU processing.

Scraping electronic data

This probably falls into our data-sampling category.

Lots of data is just sitting, artifactually, on the Internet,

e.g., any website, hosting photos, videos, or text.

There, photos and videos are files in a nearby directory,

and scaping is the act of downloading the html and files.

This is all your browser does when you visit a site,

but beware of policies and terms of service.

Some data scraping utilities/methods:
- Python: Beautiful Soup (module)
- R: rvest (module)
- Command line: "wget" (linux command)
- GUI: point and click (human operation)

Required readings

I Don't Need No Stinking API
Web scraping: legal or illegal?

Libraries

Librarys are not just about books.

Collections include data sets, and affiliates gain access.

This also includes institutional access to private websites.

For example, Drexel has access to the Associated Press images.

Learn how to use the library website.

If something is not there, but could be, then ask a librarian!

Open data access

Not all data is restricted.

Publicly-accessible data is called "open data."

E.g., Wikipedia, Project Gutenberg, or the Internet Archive.

This concept is closely related to open-access software,

which has led to large scientific and commercial growth

through the impact of programming languages and modules.

What data should be open and available?

While PM voting records should probably be open,

personal data, like medical records, should probably be closed.

Required reading: Arguments for and against open data

Data processing challenges

Open data has big potential for advancing problems forward.

Data is often opened to cultivate research around a problem.

Workshops sometimes hold "shared tasks" and release data.

Every year, Yelp! has a open Dataset Challenge.

Kaggle is a website devoted to data challenges.
- It is a great place to get into data science,
- where companies release data, and post a reward.
- Sometimes the reward is big money,
- and other times it is an awesome job!

Recap

Data comes from closed and open sources,

but if it is on the internet you can probably get it.

Data can usually be acquired with creativity,

and is produced with often unforseen ethical implications.

Take care with data licenses and read the fine print.

Next time: Pre-processing
- What is data munging?
- What makes data high quality?
- How can data be cleaned?