Life cycle

Jake Ryland Williams
Assistant Professor
Department of Information Science
College of Computing and Informatics
Drexel University
Introduction to data science

Some themes

  • There is no set methodology for a data science success,
  • but there are still best practices that can be followed.
  • Regardless of results, double-check work,
  • be certain of anything published, and who it impacts.
  • The data science life cycle

  • Now that we have some grounding in data science.
  • We'll focus a bit on how it is practiced.
  • Like other disciplines, there is a process,
  • but depending on the domain, we'll see it's variable.
  • Practicing data science

    • Some rules of thumb to overcome common challenges:
      • Don't just assume, let the data speak.
      • Avoid ad-hoc explanations of data patterns.
      • Focus on communication for broad audiences.
      • Plan for the unexpected amidst noisy input data.
      • Beware the transition from prototype to product.
      • Take time to understand statistical procedures.

    Facebook is going to die?

    • One example fail:
      • Princeton researchers discovered Google trends.
      • They saw patterns for the terms: "MySpace" and "Facebook."
      • They applied an infection model to search-term data.
      • Conclusion: MySpace failed, and both fit the model,
      • so Princeton predicted Facebook would collapse by 2018.

    No, Princeton will dissappear!

  • There are many more profitability indicators than search terms.
  • The Princeton research was spurious and splashy.
    • Facebook eventually struck back!
      • Applying the great theory of correlation => causation,
      • Facebook concluded Princeton would disappear!

    There should still be a scientific method

    Naturally, this is close to CRISP-DM

    Differing scientific methodologies

  • Data science projects have differing needs,
  • and these are, of course, only guides.
  • Ultimately goals, customers, and data specifics set directions,
  • and the context of the work shapes time spent in differnt areas,
  • e.g., a commercial data product vs. an academic development.
  • Similar tasks in a "work flow"

    Required readings

    > Differences for Building data products
    > CRISP-DM
    > The Data science project lifecycle
    > A data science work flow

    Next: we'll go into some lifecycle specifics

    Acquisition

  • Traditionally questions come first with no data to start,
  • but data often comes first, and scientists are forced to ask:

    "What questions can be approached with the data on hand?"

  • Regardless of the context, acquiring data is necessary.
  • If a project is company led, data may be internal
    (making acquisition easy).
  • Even if internal data is present, external data can be necessary.
  • It's always important to get imaginative about what data can be,
  • especially when data is unstructured!
  • Preparation

  • A quote from a co-coiner of the term "data scientist:"

    The hardest part of data science is getting good, clean data. Cleaning data is often 80% of the work.

    • Frequently, data does not come in a convenient form:
      • it can be structured or unstructured,
      • may be factored with dependent records,
      • may be in the wrong units, rife with NAs,
      • or even just spread across multiple sources.
  • Structuring, consolidation, removing NAs, and joining rows
  • are all part of the data preparation process.
  • But it's not just about cleaning data,
  • e.g., this includes building a network model from tweets.
  • Modeling and hypothesis

  • These are very traditional steps.
  • However, for data science there is an important caveat.
  • Data often comes first, so it paramount to "explore."
  • This is called "exploratory data analysis" (EDA).

    • Some rules of thumb:
      • explore first with descriptive analyses and figures
      • characterize data from every angle
      • decide what questions might be answereable
      • review models that can confirm/deny hypotheses
      • choose models are practical to implement

    Evaluation and interpretaion

  • In this step a model is applied or an analysis is run.
  • This might take several attempts before completion.
  • Interpretation should never be taken lightly.

    • Some good practices:
      • test initially on a little data you know inside and out,
      • and make lots of sanity checks—is it working?
      • always save output and visualize separately
      • focus closely on model tuning and interpretation
      • does the output appear as expected?
      • if yes, then double-check that all code is correct
      • if no, then still double-check for broken code
      • but really focus hard on interpretation,
      • because unexpected results can be the most impactful!

    Deployment

  • Deployment is especially relevant in product development.
  • This stage let's an intended audience explore a product.
  • A quote from the same co-coiner:

    Unfortunately, the best way to test data products is in production.

    • Some good practices:
      • start with a pilot program
      • set low expectations for users
      • start with limited functionality
      • clean and simple pays off at the start
      • documentation can save a lot of headaches

    Operations

  • Operations are an essential foil to deployment.
  • Here, a product is managed through regular use.

    • Some good practices:
      • make sure users can report problems
      • make sure someone is regularly checking reports
      • focus on minimizing reported product friction
      • if a product breaks, go back and adjust the pipeline
      • update documentation reflect product changes

    Optimization

  • Optimization is likely more sporadic (once again for products).
  • It may occur as a result of user-base growth,
  • or perhaps a change in technology, knowledge, or competition.

    • Some good practices:
      • keep track of the market and competition
      • stay current on methods and technology
      • monitor growing user bases and their impact on the system
      • stay current with design
      • always look for efficiencies
      • don't leave around unused features

    Recap

  • There is no set methodology for a data science success,
  • but there are still best practices that can be followed.
  • Regardless of results, double-check work,
  • be certain of anything published, and who it impacts.

    • Up next, all about data:
      • Overview of types of data.
      • What makes different types different?
      • How organization defines data.