Most articles about how to "complete" a data science task usually discuss how to write an algorithm to solve a problem. For example how to classify a text document or forecast financial data. Learning how to do these things can be vital knowledge for a data scientist if it falls within their remit. However, the task is just a small part of the process of completing a data science project in the real world. Even if you can code up the perfect solution to a multi-class text classification problem, is it actually valuable to the business? What's the current solution to the problem and what benchmark do you have to surpass so that the user can trust it's output? When the algorithm is up and running in production are you getting feedback to understand whether the output is continually producing usable results?
In this post, I want to set out some guidelines on developing and carrying out effective and sustainable data science projects. This guide is a list that I came up with after making several mistakes in my own data science projects as well as seeing others making their own. Some of this won't apply to all data scientists because not all of this will fall into their remit. However, I've been in a team where we didn't have the luxury of a dedicated business analyst, product manager or even a data science manager. It meant that I had to take on some of the responsibilities of these roles myself and often, not do a great job of it. But it was a valuable learning experience and here are some of the things that I've learned.
Special mention: Regardless of whether you agree with my waffling or not you should check out the video in this article on how to do stakeholder driven data science by Max Sharon, the head of data science at Warby Parker. It's amazing and sets a great bar for doing good data science projects in the real world.
What questions should be addressed for a project to be considered successful?
When coming up with a solution to a problem I find it useful to picture what success looks like (here we're assuming that we already know what the problem is, but don't underestimate how hard it can be to identify a problem that your team is currently setup up to be able to solve). This helps me to develop strategies get to the end goal. In this case, I want to write down a set of questions that I can answer immediately if the project is successful. Here they are:
Why are you doing the project? I.e. what value does the project bring and how does it contribute to the wider data science team goals?
Who are the main stakeholders of the project?
What is the current solution to the problem?
Is there a simple and effective solution to the problem that can be performed quickly?
Have you made an effort to involve the right people with enough notice and information?
Have you sense-checked your solution with someone else?
Have you made an effort to ensure that the code is robust?
Have you made an effort to make sure that the project can be easily understood and handed over to someone else?
How are you validating your model in production?
How are you gathering feedback from the solution?
In my experience, if these questions can be answered adequately then the project is likely to be successful. This may not always be the case and this list may be far from exhaustive depending on the project, but it's at least a good starting point.
Kate Strachnyi has a set of 20 Questions to Ask Prior to Starting Data Analysis in a much-shorted article if you decide you would rather not make your way through my mammoth brain fart (I wouldn't blame you).
The steps below help in addressing each of these questions.
5 step guideline for data science projects
Step 1: Get an initial evaluation of the potential value of the project
Why does it? It helps you prioritize projects. You should be able to adequately explain why one project should be completed before another. It also allows us to understand how the project aligns with the goals of the team and the company. In addition, this will also provide some guidance on what metric we should optimize for the model.
What does this involve? Rough quantification of benefits e.g. money saved, revenue increase, reduce time spent on manual labor. The argument against this is that it's hard to do and not always quantifiable. My response is: if you or your stakeholders can't figure out the value of the project then why are you allowing yourself or your stakeholders to waste your time? The value doesn't have to be perfect, just a ballpark estimate. This step also involves determining who the main stakeholders are?
What happens if it's not done? We may spend ages doing a project that no one benefits from. One example of a project that I saw could've done with better scoping was where the data science team was tasked with identifying a list of people who were most likely to benefit from being contacted by our marketing team. A model was built but we decided to spend a couple months improving it. Despite the new model giving better results, the team had to adjust their threshold because the business didn't care about the scores just were generated for each customer, instead, they wanted to make sure they were contacting a fixed number of people. So it's arguable that the time spent improving the model was pointless and we would've known this if we scoped the project better with the stakeholders. (An argument could be made that the fixed number of people contacted are actually a better segment due to a better model but this wasn't measured so we don't know whether this is the case).
What is the outcome of this step? A rough quantitative estimate of the value of the project accompanied with a brief paragraph giving more context (an executive summary of the project). It's important to note that depending on the company and the project, just the perceived value gain of having a data model is good enough for the business to deem the project a success. In which case a quantitative estimate isn't necessary. But that's more about company politics and can only happen for so long before you need a hard number to show your team's worth.
Useful resources: This article titled "A Simple way to Model ROI of any new Feature" helps a lot. Give a simple formula: Expected ROI = (reach of users * new or incremental usage * value to business) - development cost. Other useful reads are "Prioritizing data science work" and "Product and Prioritisation"
Step 2: Determine current approach/Create baseline model
Why does it? The current approach gives us a benchmark to target. All useful models should beat the current approach if there is one. If there is no current solution to the problem then you should develop a baseline model. The baseline model is essentially the solution to the problem without machine learning. It's likely that a complex solution may only provide incremental value so you'll need to evaluate if it's actually worth building anything more complex.
What does this involve? Speak to stakeholders to determine what they currently do and what success they have. It's likely they don't measure their success rate so it's something that you'll have to estimate/calculate. Building a baseline model should not involve any complex/involved methods. It should be fairly quick and rudimentary. Probably using counting methods.
What is the outcome of this phase? A baseline evaluation number of the performance required to be successful/useful for stakeholders. An assessment of whether a complex model is worth building.
What happens if not done: You could waste time building a complex model that, at best, probably wasn't worth the time spent getting the additional accuracy, or at worst, doesn't even best the current approach. This was something that was missed when we built our recommendation engine. We didn't check that the algorithm was better than a sensible baseline (recommending the most popular content). It could've been that the recommendation algorithm didn't provide enough value to warrant doing it when we did.
Resources to help: The articles titled "Create a Common-Sense Baseline First" and "Always start with a stupid model, no exceptions." are good reads that emphasize this point.
3. Have a "Team" discussion
Why does it? At this point you've come to the conclusion that this project is worth doing (step 1) and success is feasible (step 2) so it's time to speak to the people involved in making the project successful e.g. engineers and/or other data scientists are obvious candidates. You should be clearer about what code you should write, what data you need, what you should test, what performance measure to use, what model approaches you should try. It's easy to think you know what you need on your own but having discussions with others can often help highlight things you've missed or things that could be improved. Don't underestimate the importance of having people with different viewpoints contribute to the discussion.
What does it involve? Speak to at least one other data scientist and show them the results you've obtained so far. Perhaps they have ideas about how you can improve on your idea. It's vital you do this before you start on the model because you'll be less likely to change your model once it's written. Also the data scientist you speak to might be the one doing your code review and so it'll help them with context. Speak to the engineer that will be involved in productionising your work. They'll likely need to know what to expect and may have suggestions that will make productionising code much easier.
What is the outcome of this phase? Nothing concrete! Just something to ensure quality is as good as it can be the first time around. Ensure that the relevant people are aware and on board with the project.
What happens if not done: Best case: you've managed to think about and avoid all pitfalls on your own. However, it's more likely that you've haven't thought about everything and there'll be important things that you'll have missed. Typical cases of the problems here include unmanageable transfer and handling of storage files when the model is moved into production. Model output misses the mark and isn't in the most useful form. This was the case with one of the models I produced. I wrote the code that made multiple API calls, many of which were unnecessary. It was fine on a small dataset which I ran locally, but the servers struggled with the load in production. It wasn't fixed until I spoke to an engineer that helped me diagnose the problem.
4. Model Development
Why does it? This is the model that we use to ultimately solve the problem.
What does it involve? The difference lies in what's involved. It's not only about creating a model. There are countless articles about how to write machine learning algorithms to solve specific problems so I won't explain this here. Instead, it's important to emphasize some steps that should be carried out to produce a high-quality production code. In the development process, you should be doing regular code reviews. Remember that you are likely not to be the only one that sees the code and, you are not the only one invested in the successful project so there should be good code documentation. This is vital for the longevity of the project. There will almost certainly be bugs and unexpected inputs in production so you can mitigate these issues by performing code testing to improve the robustness of the code. This includes unit testing, integration testing, system testing and user acceptance testing (UAT). The specifics of how to make your code production is able may vary from one team to the next but other things that will help are: working in an isolated environment (virtual environments or Docker containers), using logging to write log files, using configuration files so the configuration is separate to the main code.
What is the outcome of this phase? A shared (Github) repository with the required files and a working model that solves the problem defined in the project.
What happens if not done: The model has to be completed otherwise the problem will not have been solved. If your code isn't tested there'll be mistakes with the logic that may not be noticed until production. If the code isn't reviewed by someone else or it's not documented, it'll be difficult for other people to take over when you inevitably leave the company or are on annual leave. Some of these issues would crop up consistently on projects that I'd done previously that weren't robust. I was still fixing bugs on a project that I was involved with 9 months after it was "completed" because the code was not robust. This eats into time that you could be using to do other valuable things and it causes lots of frustration for everyone involved. Make sure you spend the extra time required to make the code robust during the development period because it will save you time in the long run.
Resources There are loads but these are some of the ones that I've read that I really like:
At the very top of the list here is an article called "How to write a production-level code in Data Science?" It covers pretty much everything that I can think of. If you're a data scientist building production-level code then you should read it.
Code reviews: Code reviewing data science work and An article about how to 'do code reviewing by yourself' i.e. writing in such a way that it serves as being reviewed (this is not a substitute for doing actual code reviews, it's only here to help you think about how to write good code)
Code documentation: A good guide on how to document code properly. I also really like numpy style docstrings
Code testing: A guide on how to write unit tests for machine learning code. This is a good guide for writing unit tests using Python's Pytest library. I used this to help me write my first set of tests for one of my projects. The same company also have an article on mocking data for tests. And here's another guide on Pytest and mocking
5. Model monitoring and feedback
Why does it? This is to ensure that our product is working as intended in production. The outputs of the solution should be stable and reliable. We should be the first to know if something is wrong. Is model performance lower than expected? Are the data formatted differently to the training data? Are the data incorrect? This saves us a lot of time manually checking outputs and going through the code to ensure things are working as expected. This is especially the case when the stakeholders begin questioning our data. We're in the business of providing value to the company so we should also be measuring the impact that our solutions have. Is it working? Does it need tweaking? How much money are we generating? How frequently is the solution being used? These are the numbers that we can report to the executives to show the value that data science is contributing to the business.
What does it involve? This involves a period of time (perhaps a couple weeks) after the model has been put into production to manually and proactively check that everything is working. This also involves automating the monitoring process. perhaps creating a dashboard and automated email alerts and/or an anomaly detection system. Perhaps the stakeholders need to do monitoring as well so the monitoring solution may need to be tweaked to be user-friendly for non-technical colleagues. For feedback purposes, this can involve discussing with the stakeholder how you'll receive feedback. Will it be qualitative or quantitative? You could write something into the product that logs usage so that you don't have to explicitly ask the stakeholder about their usage habits.
What is the outcome of this phase? Methods and products to ensure that the model is working properly and providing qualitative or quantitative feedback on the usage and value of the solution. It may also include some form of notification/alert if something goes wrong.
What happens if not done: If our products aren't monitored properly we potentially risk business stakeholders losing trust in the things we produce when they break. It could also cost the business money and our team a lot of time trying to fix. One example of this that I've faced is when the data in the analytics tool we built started giving wildly incorrect figures. It turns out that the provider of the raw data had messed up. But importantly, it was our stakeholders that picked up on the problem before our data team did. Robust testing (as described in the model development step above) and automated alerting should've picked this up but we didn't have that in place. When the data finally came back, the stakeholders thought that the data were still incorrect. Our data team spent 2 weeks checking the data only to conclude that there was nothing wrong with it! That's 2 weeks that we lost providing value to the business. Automated monitoring could've reduced 2 weeks to 2 minutes! Additionally, if we're not getting feedback then we have no idea how useful our projects are or whether they're still being used. One example was in a meeting with the marketing team and we had to ask the question "are you using the model?". We should never have to ask that question because 1) we should have scoped the project properly during steps 1 and 2 so we know the value and we're certain that our model improves upon the current solution/baseline model 2) We should've done the monitoring as part of the project, not in an unrelated meeting. This shows that we were not measuring the impact our models are having.