Life cycle
Jake Ryland Williams
Assistant Professor
Department of Information Science
College of Computing and Informatics
Drexel University
Some themes
There is no set methodology for a data science success,
but there are still best practices that can be followed.
Regardless of results, double-check work,
be certain of anything published, and who it impacts.
The data science life cycle
Now that we have some grounding in data science.
We'll focus a bit on how it is practiced.
Like other disciplines, there is a process,
but depending on the domain, we'll see it's variable.
Practicing data science
- Some rules of thumb to overcome common challenges:
- Don't just assume, let the data speak.
- Avoid ad-hoc explanations of data patterns.
- Focus on communication for broad audiences.
- Plan for the unexpected amidst noisy input data.
- Beware the transition from prototype to product.
- Take time to understand statistical procedures.
Facebook is going to die?
- One example fail:
- Princeton researchers discovered Google trends.
- They saw patterns for the terms: "MySpace" and "Facebook."
- They applied an infection model to search-term data.
- Conclusion: MySpace failed, and both fit the model,
- so Princeton predicted Facebook would collapse by 2018.
No, Princeton will dissappear!
There are many more profitability indicators than search terms.
The Princeton research was spurious and splashy.
- Facebook eventually struck back!
- Applying the great theory of correlation => causation,
- Facebook concluded Princeton would disappear!
There should still be a scientific method
Naturally, this is close to CRISP-DM
Differing scientific methodologies
Data science projects have differing needs,
and these are, of course, only guides.
Ultimately goals, customers, and data specifics set directions,
and the context of the work shapes time spent in differnt areas,
e.g., a commercial data product vs. an academic development.
Similar tasks in a "work flow"
Acquisition
Traditionally questions come first with no data to start,
but data often comes first, and scientists are forced to ask:
"What questions can be approached with the data on hand?"
Regardless of the context, acquiring data is necessary.
If a project is company led, data may be internal
(making acquisition easy).
Even if internal data is present, external data can be necessary.
It's always important to get imaginative about what data can be,
especially when data is unstructured!
Preparation
A quote from a co-coiner of the term "data scientist:"
The hardest part of data science is getting good, clean data. Cleaning data is often 80% of the work.
- Frequently, data does not come in a convenient form:
- it can be structured or unstructured,
- may be factored with dependent records,
- may be in the wrong units, rife with NAs,
- or even just spread across multiple sources.
Structuring, consolidation, removing NAs, and joining rows
are all part of the data preparation process.
But it's not just about cleaning data,
e.g., this includes building a network model from tweets.
Modeling and hypothesis
These are very traditional steps.
However, for data science there is an important caveat.
Data often comes first, so it paramount to "explore."
This is called "exploratory data analysis" (EDA).
- Some rules of thumb:
- explore first with descriptive analyses and figures
- characterize data from every angle
- decide what questions might be answereable
- review models that can confirm/deny hypotheses
- choose models are practical to implement
Evaluation and interpretaion
In this step a model is applied or an analysis is run.
This might take several attempts before completion.
Interpretation should never be taken lightly.
- Some good practices:
- test initially on a little data you know inside and out,
- and make lots of sanity checks—is it working?
- always save output and visualize separately
- focus closely on model tuning and interpretation
- does the output appear as expected?
- if yes, then double-check that all code is correct
- if no, then still double-check for broken code
- but really focus hard on interpretation,
- because unexpected results can be the most impactful!
Deployment
Deployment is especially relevant in product development.
This stage let's an intended audience explore a product.
A quote from the same co-coiner:
Unfortunately, the best way to test data products is in production.
- Some good practices:
- start with a pilot program
- set low expectations for users
- start with limited functionality
- clean and simple pays off at the start
- documentation can save a lot of headaches
Operations
Operations are an essential foil to deployment.
Here, a product is managed through regular use.
- Some good practices:
- make sure users can report problems
- make sure someone is regularly checking reports
- focus on minimizing reported product friction
- if a product breaks, go back and adjust the pipeline
- update documentation reflect product changes
Optimization
Optimization is likely more sporadic (once again for products).
It may occur as a result of user-base growth,
or perhaps a change in technology, knowledge, or competition.
- Some good practices:
- keep track of the market and competition
- stay current on methods and technology
- monitor growing user bases and their impact on the system
- stay current with design
- always look for efficiencies
- don't leave around unused features
Recap
There is no set methodology for a data science success,
but there are still best practices that can be followed.
Regardless of results, double-check work,
be certain of anything published, and who it impacts.
- Up next, all about data:
-
Overview of types of data.
-
What makes different types different?
-
How organization defines data.