"Big Data"
Jake Ryland Williams
Assistant Professor
Department of Information Science
College of Computing and Informatics
Drexel University
Some common themes
"Big" data is not just about data being large.
"Big" data is a blessing and a curse.
It's best to work with data that matches needs!
What is "Big data?"
"Big data" is hyped-up term.
- Really, what's going on is:
-
potential insights are a big deal, but
-
effective utilization poses a big problem.
Really, the question is what makes data "big?"
The three Vs of Big data
Data can be "big" for a number reasons.
- In 2001, Doug Laney succinctly described big data by 3 Vs:
-
Volume: The overall size of data
-
Velocity: The rate at which new data emerges
-
Variety: The differences in forms of data
Each V presents different promises and challenges.
The 3+ Vs of big data
Each of the three Vs describes something intrinsic about data.
Over time, we discovered other words that start with V.
- Some other Vs:
-
Veracity: The uncertainties of data constitution
-
Value: The usefulness of data
-
Validity: The quality or trueness of data
-
Variability: The changing nature of data
-
Visualization: The visually-descriptive power of data
-
Vagueness: Confusion over the meaning of big data
-
Vocabulary: Structure metadata that provide context
Are any of the Vs redundant?
Are all of these intrinsic to data?
Maybe 4 Vs of big data?
- Some thoughts:
-
The Vs are only a cute mnemonic for description.
-
Any "keepers" should be distinct
-
and should be data-intrinsic descriptions.
-
This is probably why Veracity has had some staying power.
We'll stick to 4 Vs (including Veracity), and go into some detail.
Volume
Probably the most obvious type of bigness,
Volume refers to the measurable size of data.
E.g., The Library of Congress has all of the public tweets,
which can be used for whole-population research,
but they struggle to even store them all.
- At what size might data be big?
-
One computer/drive/connection can't process/store/send it all?
-
Coverage approaches the whole population?
Velocity
Not speed, velocity, i.e., send and receive.
This has to do with rates.
E.g., Twitter produces ~500,000,000 Tweets/day,
which offers current insight into world events,
but keeping up with the feed requires massive infrastructure.
- At what rate might data be big?
-
Equipment can only process 1/10 records in real time.
-
An analysis informs of something before the news.
All of the Tweets for US
In 2010, the US Library of Congress (LoC) signed with Twitter.
The LoC would house and make available all of the tweets.
While this is hard, there are companies that already do this.
E.g., Gnip, is the official delivery agent for the LoC.
What has happened since then?
Variety
We've already seen there are many data types.
Each type of data has its own processing and storage needs.
E.g., memos, x-rays, and notes. in Elect. med. records (EMRs).
make complete profiles though integration of multiple formats,
but require the integration of specialized processing techniques
- At what heterogeneity might data be big?
-
An analysis combines a different model for 10 types of data.
-
A comprehensive analysis identifies a combined effect.
Veracity
Veracity refers to data integrity, or consistency.
E.g., 50 different reference words for a condition in an EMR.
Here, flexibility can ensure data is always entered, but
might leave a condition's documentation unprocessed.
- At what state of disorder might data be big?
-
Data are completely unstructured.
-
Data captures insight from all levels of participation.
Recap
"Big" data is not just about data being large.
"Big" data is a blessing and a curse.
It's best to work with data that matches needs!