"Big Data"

Jake Ryland Williams
Assistant Professor
Department of Information Science
College of Computing and Informatics
Drexel University
Introduction to data science

Some common themes

  • "Big" data is not just about data being large.
  • "Big" data is a blessing and a curse.
  • It's best to work with data that matches needs!
  • What is "Big data?"

  • "Big data" is hyped-up term.

    • Really, what's going on is:
      • potential insights are a big deal, but
      • effective utilization poses a big problem.


  • Really, the question is what makes data "big?"
  • Required reading

    > Who coined the term "Big data?"

    The three Vs of Big data

  • Data can be "big" for a number reasons.

    • In 2001, Doug Laney succinctly described big data by 3 Vs:
      • Volume: The overall size of data
      • Velocity: The rate at which new data emerges
      • Variety: The differences in forms of data


  • Each V presents different promises and challenges.
  • Required reading:
    The original 3 Vs of big data

    The 3+ Vs of big data

  • Each of the three Vs describes something intrinsic about data.
  • Over time, we discovered other words that start with V.

    • Some other Vs:
      • Veracity: The uncertainties of data constitution
      • Value: The usefulness of data
      • Validity: The quality or trueness of data
      • Variability: The changing nature of data
      • Visualization: The visually-descriptive power of data
      • Vagueness: Confusion over the meaning of big data
      • Vocabulary: Structure metadata that provide context


  • Are any of the Vs redundant?
  • Are all of these intrinsic to data?
  • Required reading:
    Avoiding that "wanna-V" confusion

    Maybe 4 Vs of big data?

    • Some thoughts:
      • The Vs are only a cute mnemonic for description.
      • Any "keepers" should be distinct
      • and should be data-intrinsic descriptions.
      • This is probably why Veracity has had some staying power.


  • We'll stick to 4 Vs (including Veracity), and go into some detail.
  • Volume

  • Probably the most obvious type of bigness,
  • Volume refers to the measurable size of data.
  • E.g., The Library of Congress has all of the public tweets,
  • which can be used for whole-population research,
  • but they struggle to even store them all.

    • At what size might data be big?
      • One computer/drive/connection can't process/store/send it all?
      • Coverage approaches the whole population?

    Velocity

  • Not speed, velocity, i.e., send and receive.
  • This has to do with rates.
  • E.g., Twitter produces ~500,000,000 Tweets/day,
  • which offers current insight into world events,
  • but keeping up with the feed requires massive infrastructure.

    • At what rate might data be big?
      • Equipment can only process 1/10 records in real time.
      • An analysis informs of something before the news.

    All of the Tweets for US

  • In 2010, the US Library of Congress (LoC) signed with Twitter.
  • The LoC would house and make available all of the tweets.
  • While this is hard, there are companies that already do this.
  • E.g., Gnip, is the official delivery agent for the LoC.
  • What has happened since then?

  • Variety

  • We've already seen there are many data types.
  • Each type of data has its own processing and storage needs.
  • E.g., memos, x-rays, and notes. in Elect. med. records (EMRs).
  • make complete profiles though integration of multiple formats,
  • but require the integration of specialized processing techniques

    • At what heterogeneity might data be big?
      • An analysis combines a different model for 10 types of data.
      • A comprehensive analysis identifies a combined effect.

    Veracity

  • Veracity refers to data integrity, or consistency.
  • E.g., 50 different reference words for a condition in an EMR.
  • Here, flexibility can ensure data is always entered, but
  • might leave a condition's documentation unprocessed.

    • At what state of disorder might data be big?
      • Data are completely unstructured.
      • Data captures insight from all levels of participation.

    Recap

  • "Big" data is not just about data being large.
  • "Big" data is a blessing and a curse.
  • It's best to work with data that matches needs!