Pre-processing

Jake Ryland Williams
Assistant Professor
Department of Information Science
College of Computing and Informatics
Drexel University
Introduction to data science

Some themes

  • Acquired data varies greatly in quality.
  • Cleaning/pre-processing data take a lot of time.
  • pre-processing tasks exist throughout pipelines.
  • Alaways save random number seeds for reproducibility!
  • Curation, in general

  • Curation is a broad and important data science,
  • and generally refers to data management.
  • Curation tasks appear throughout the data science lifecycle.

    • A few examples:
      • At collection where/how to store data?
      • Before analyzing, is data clean or complete?
      • When exploring, should data be re-organized?
      • While analyzing, should some data be witheld?
      • Upon deployment, store user interaction records.
      • How should results be stored for visualization?

    Required reading: The hard part

    Data intake

  • It's good to have an idea of size before downloading.
  • Don't neglect considering Big Data constraints (the Vs).
  • If data is streaming, benchmark and plan ahead.
  • In addition, vet an early sample
  • and make plans to clean data on intake.
  • It can be considerably easier to "munge" front end.
  • Required reading: What is data munging/wrangling?
  • Required reading: Being a "data janitor"

    Understanding what you've got

  • As we saw, data acquisition is its own art and process.
  • Once you have data, it's important to look closely at it.
  • Depending on the source, data can be clean or messy.
  • E.g., a csv of numeric values might be "clean",
  • while inconsistent, scraped html might be "dirty".
  • Anything in between can happen, e.g., NAs in a csv,
  • or a formatting error throws off the number of columns.
  • Really, though, beauty lies in the eye of the beholder...
  • ...and clean vs. messy can have more to do with expectations.
  • Handling these considerations is often called pre-processing.
  • Quality

  • On the other hand, data quality can be very objective.
  • Hand-coded data is prone to human error.
  • E.g., a one digit error in a product number at register.
  • Humans can also algorithmically induce errors via scripting,
  • e.g., every returned value incorrect with an index is off by 1.
  • Sometimes an algorithm returns terrible collection results,
  • like can happen in optical character recognition (text scanning).
  • Avoiding/rectifying issues like these is paramount with curation.
  • Preparation

  • Pre-processing isn't just about correction or cleaning data.
  • Oftentimes, data needs to be modified from a raw form.
  • This is often with text-based machine learning,
  • where a transcript might need to be organized as discourse.
  • Beyond this, many models use the "bag-of-words" framework,
  • which represents texts by counts of words.
  • Text preparation goes hand in hand with regular expressions,
  • which establish word-defining patterns.
  • and are central to document "cleaning" tasks,
  • like removing headers and markup from html.
  • Reduction

  • This relates to some aspects of data cleansing
  • insofar as unnecessary data, like markup in text, is removed.
  • However, logistics might make collecting all data impossible,
  • as is the case with the Kepler satelite photography,
  • but an analysis may also not benefit from including all data,
  • and reduction might include smoothing or down-sampling.
  • Sampling and randomness

  • So, reduction might include decreasing resolution,
  • or feature selection, e.g., choosing content words for analysis,
  • but if there is just plain more data than is necessary,
  • or if data must be split for training and validation,
  • random sampling becomes an important pre-processing task.
  • Science should always be reproducible, randomness included,
  • which is where pseudorandom numbers come into play.
  • So, when sampling data, always save random number "seeds,"
  • which identify a particular randomization for reproducibility.
  • Required reading: Random numbers

    Feature selection

  • In machine learning, "features" are predictive variables.
  • Selection can be a choice of columns from a table,
  • or be considerably more complex with unstructured data,
  • and involve extraction of domain-relevant terms from texts,
  • or with images, determining what features classify cats vs. dogs.
  • Even if "tail shape" is identified as a salient feature,
  • it becomes necessary to extract sections that represent tails.
  • This, and feature selection in general, is not an easy task!
  • Some research programs are devoted to feature selection,
  • and attention should be given when building models.
  • Required readings

    An introduction to feature selection
    Camouflage from face detection

    Integration

  • Sometimes multiple data sources are necessary for an analysis.
  • Integration can be straightforward with structured data,
  • where record values are joined according to keys,
  • but oftentimes this may not be the case.
  • E.g., how can one link images by location names, like Paris
  • to geospatial shapefiles for an analysis?
  • Python has an API that determines lat/lons (geocoder),
  • so after characterizing an integration need,
  • it's always best to see if someone else has solved the task!
  • Discretization

  • This is about grouping data meaningfully,
  • like into the bins of a histogram,
  • but grouping most often assumes some choice.
  • So, care should be take for how grouping is executed,
  • whether by tuning the bin sizes/numbers for a histogram
  • or carefully choosing algorithms for network clustering.
  • Commonly, clustering algorithms depend on initial states,
  • whereupon randomization once again plays a role with seeds
  • and "bootstrapping" multiple runs for regularity.
  • Recap

  • Acquired data varies greatly in quality.
  • Cleaning/pre-processing data take a lot of time.
  • pre-processing tasks exist throughout pipelines.
  • Alaways save random number seeds for reproducibility!

    • Next time: Storage
      • structured data types
      • determining how to store data
      • data retrieval systems