Data Science on the job is different from the things you see about it in tutorials. I wish someone would’ve told me about these when I was an aspiring data scientist. But you know, there are things that you have to experience in the front line of data science by working on real projects.
Regardless, this will be a useful read for aspiring and junior data scientists. My aim with this article is to help you to see a more comprehensive picture of what data science is — beyond learning Python, SQL, statistics and a few fancy analytics and machine learning methods.
So here it is:
Data science best practices, common mistakes and basic mindset questions — that I’ve collected in the last 9 years!
Note: this article is also available in video format, here:
(Scroll for the second episode of the video!)
#1 The ultimate goal of a data science project is the impact it creates
It doesn’t matter which machine learning model you use. And it doesn’t matter how efficient your code is — or how sophisticated your mathematical formula becomes:
Your data project will fail if it doesn’t create impact.
We can talk about business impact, social impact, even cultural impact… but in one way or another, your data science project has to foster change if you want to make it useful.
So before starting any of your data projects, ask these:
- What will be different when this project will be done?
- What impact will it create?
If the answer is “nothing” or “I don’t know” then probably it’s a data project no one needs.
At the end of the day your data science project is worth as much as the impact it creates.
#2 The output of a data science project is usually a better decision
Impact is important but it’s usually a long-term effect of your project.
The short term effect is the better decision it enhances.
And that’s why, in data science, setting a hypothesis is so important.
If someone ask you to create an analysis around a specific question like:
“What’s the new vs. returning user % in my app?”
Always ask them to come up with a proper hypothesis first!
Ask back something like this:
- “What will you do if it turns out it’s below 20%?“
- “And what will you do if it’s above 20%?“
If they have answers, great!
If they don’t, point out that your analysis will make sense only if they know what actions they’ll take once they get their answers.
This was just the simplest example, of course…
#3 Coding is a tool
To build a house, you have to know how to use a hammer. And to do data science, you have to know how to code.
But when your house is built, no one will care about the hammer.
They’ll care about the house.
Similarly in data science, your code doesn’t have to be a work of art. What you build with it should be.
This is probably the most important misconception that I see from aspiring and junior data scientists… They never focus on the results, they focus only on the process. But in real life projects both are important, so try to find the perfect balance and remember: coding is just a tool.
#4 Skills are more important than “hard knowledge”
When it comes to coding, for instance: It’s more important to understand how to “talk to a computer” than knowing every tiny detail of Python’s syntax. Using data languages like Python (or SQL or R) is an essential part of the day to day job — but they can (and will) change over time.
So, for instance, to become the best Python-programmer in the long-term, you rather have to understand the logic of Python than memorizing all the tiny syntax nuances.
That’s why I say that in data science skills are more important than “hard knowledge.“
Python was just an example, of course. This principle applies to the statistics, business and all other aspects of data projects, as well.
#5 The simpler the better
Junior and aspiring data scientists tend to overcomplicate things by using fancy statistical models and algorithms.
But in most projects, simpler models outperform complex ones.
I don’t know why that is — but after doing data science for years, it just feels more natural to go with simpler solutions.
Don’t believe it? Well, I didn’t believe my mentor, either.
I guess it takes time to understand it — but: in data science simpler is usually better.
#6 Every data science project should be started with data discovery
Data is a jungle. It’s wild and raw — and you go there to find hidden treasures.
And it’s always easier if you create a map first. Data discovery helps do that.
First, simply browse your raw data — scroll through rows and columns in your data tables. Try to understand the data structure — what’s where and why.
- Do basic segmentations!
- Create distribution charts!
- Plot the general trends that your data shows!
This data discovery part can take hours or days and it will seem pointless, first.
But it’s an investment that pays off later in the project.
The more deeply you understand your dataset, the more clearly you’ll see the potential hidden relationships in it.
#7 Choosing the right project is more important than choosing the right model
In poker, they say that if you want to win, choosing the right table to sit at may be more important than perfectly playing your hand.
In some aspects, data science is similar.
Most data science projects fail way before writing the first line of the code. And this goes back to my very first point: impact.
You should always work on the data project that has the greatest potential impact.
Note the word potential.
Of course, you won’t always know which project will be the best to start with — but you can almost always tell the difference between potentially good or bad data science project ideas.
Just say no to the bad data science project!
Which leads us to the next point:
#8 Saying NO is key
As I said:
Say no to bad data project ideas!
In my experience, the bad data science project ideas usually don’t come from the data scientist themselves.
It’s quite often an external source…
- Like your (non data-scientist) colleague is curious about something…
- Or you’ve seen a great (“inspiring”) case study at a conference…
- Or you want to try out that new machine learning model you’ve just learned about in your latest online course…
Now, whatever external influence tries to push you to do data science projects that are probably useless in the long-term, remember that you don’t do projects to satisfy your professional curiosity. (That’s what your data science hobby projects are for!)
So say no to potentially bad ideas!
Also remember that most non-data-scientist colleagues (even your boss) don’t know which data science projects are important and which aren’t. They will ask what they are personally interested in — but if you know that answering that question is a waste of your time, it’s your responsibility to turn them down.
I know… Saying no, especially to your manager, seems hard. Well, believe me, with a little bit of practicing you’ll master this skill in no time.
Point is: dare to say no and don’t waste your time and energy by working on bad data science projects!
#9 You’ll need a single source of truth
If you use more than one tool for data collection and/or analytics, you’ll see discrepancy between what they show. That’s only natural. During the long years of my data career I haven’t seen two data tools showing the exact same results ever. That can come from multiple reasons:
- how exactly the tools collect the data points
- how they define different metrics
- how they attribute different conversions (e.g. when something leads back to multiple sources)
Seeing different numbers in different tools for the very same metrics is very disturbing for a data scientist. (Well, it’s disturbing for everyone!)
And that might paralyze you.
So my recommendation:
Pick ONE data source you TRUST and go with that.
This problem becomes much smaller when you (or your company) build your own custom data collection solution, your own custom analytics dashboards and your own custom data science projects from scratch. Using third party tools and templates always make things a little bit black-boxish… You want to avoid that.
But whatever you choose: choose one data source you trust and keep it as your single source of truth!
Note: oh and don’t forget to maintain that tool, so it won’t show broken data!
#10 “All models are wrong, but some are useful” (G. Box)
That’s an often cited and very true aphorism… especially in data science! (Too bad no one knows for sure who said it first.)
But it’s a good reminder that as a data scientist you’ll never create a 100% accurate prediction or machine learning model.
So don’t expect that… In fact, expect the opposite!
In data science, there’s always a chance that you and your numbers will be wrong.
That’s part of the game.
Even if you create a 99% accurate model or your A/B test reaches a 99% confidence level — which we tend to interpret as “bulletproof” — there’s a 1% chance that you are wrong, right?
Did you ever consider that if you settle with a 95% confidence level, that actually means that 1 out of 20 times you’ll be wrong while thinking you are right?
Be aware of false positives and false negatives and keep in mind that data science is not the magic pill — it’s just another great tool that helps you get closer to the truth.
#11 Data can be wrong.
So yes, the hard truth is: data can be (and will be) wrong sometimes.
In my career I’ve seen multiple reasons. Here are a few common ones:
- Human errors: by far the most common issue when working with data. I’ll just mention one common human error: biases. If you’ve heard about bias types like selection bias, survivorship bias or cause-effect bias… these all apply to data science, as well. (And if you haven’t, here’s my tutorial article about statistical bias types.)
- Statistical errors. Statistics is a game of probabilities… And yeah, sometimes your data project ends up on the wrong side of statistics. Again, if you are a very thorough data scientist, you can minimize false positives and false negatives but you can’t completely eliminate them.
- Technical errors. Yes, automations, data collection and data processing scripts fail from time to time. If you are lucky enough, you’ll catch the bugs in time. If you don’t, well, maybe your data science project will be based on incomplete data… which is very dangerous.
Keep in mind that data can be wrong! Most parts of the errors can be avoided by being cautious, careful and smart.
The rest should be taken into account as a calculated risk of working with data.
#12 Stupid people make stupid decisions.
Despite all your efforts (education, workshops, 1-on-1s, etc.) some people just won’t understand your data projects. That’s fine as long as they don’t want to work with your data.
The problem starts when non-data-scientist-people don’t understand data, but still want to use it.
Oh boy, I won’t be too specific on this one because I don’t want to hurt anyone’s feelings on this one.
So let me put it this way:
It’s just happened too often to me that someone looked at my charts and they tried to read into them whatever they wanted. That leads back to the previous point with human errors — more specifically to confirmation bias. (Confirmation bias is when someone favors or interprets information in a way that it supports their previous beliefs.) That’s super dangerous.
Long story short: Make sure actual decisions won’t be made by these people! And quite frankly, try to avoid these people (in the workplace I mean) as they’ll drag down your motivation about data science.
But okay, enough from negativity and go to the next point!
#13 The Pareto principle applies for data science, too.
If you haven’t heard of the Pareto principle (aka. the 80/20 principle), it’s a rule of thumb that states that 80% of your results are coming from only 20% of your efforts.
In data science, the Pareto principle occurs quite often in an even more extreme way:
~95% of the things you’ll find in your analyses will be useless or evident…
But that tiny remaining ~5% (or less) might be real game changers.
So don’t worry if your very first data project won’t change the world immediately.
Keep digging and you’ll find gold sooner or later!
#14 Never get conclusions from one analysis!
Statistics is about finding a delicate balance between too much and too little information.
The human brain can’t process millions and billions of data points, so a data scientist’s job is to compress those into a few numbers and/or charts. Thus statistics is an abstraction of reality. Because of that it will never show you the full picture.
But the more analyses you create on a given subject, the more aspects you’ll see and the closer you’ll get to the truth.
So always run multiple analyses and research things from multiple angles before you settle on a conclusion.
#15 Don’t zoom in too much.
One of the most common symptoms of someone new to being data-driven — especially for business background professionals — is zooming in too much on smaller time frames.
“Oh, conversion rates went down yesterday.”
“Wow, our production rate is up 50% in the last hour.”
Observations like these are exciting — but useless.
Worse than useless: distracting.
You won’t see real patterns on smaller scales (a few hours, one day, few days etc). You’ll only see them on a bigger one (weeks, months, years, etc.)
Note: of course, the scale depends on the project. If you are a data scientist in a quantum physics laboratory, maybe your normal scale is nano-seconds, so you’ll see real patterns on the second and minute scale already.
#16 Data science is a long-term game.
You can’t change the world in one day.
When working in data science, real impact will be seen only after months or even years. Along the way, there’ll be a lot of small and big wins but also failures and dead-ends. What matters is the cumulative result of these.
So never forget: data science is a long-term game.
#17 A data scientist is a pioneer.
As I said before:
Data is a jungle. It’s wild and raw — and you go there to find hidden treasures.
And the data scientist is the pioneer who leads the expedition.
And as a pioneer, you’ll have to question the status quo and you’ll have to move people out of their comfort zones.
Sometimes, you’ll also have to move out of your comfort zone.
Look, not everyone in an organization likes changes. Not everyone will like your data projects, either…
Yes, data science can inspire groundbreaking ideas. But for some people new things are scary.
But for data scientists: if it’s scary, it’s good!
#18 Never stop learning new things.
There’s no such a thing as a 100% data-informed business. There are always new things to discover, understand and improve.
And it’s true on a personal level, too, you can always learn new things: new languages, algorithms, tools, approaches, methods, etc.
So either we are talking about a data team at a company, or about you as an individual data professional:
You should never stop learning new things!
You should never stop running new analyses!
You should never stop improving your skills, yourself… and of course your data scripts, either. 😉
Conclusion – Data Science on the job…
I hope that these 18 points helped you to see a more comprehensive picture of data science — more specifically: what does data science look like on the job.
Of course these are just the best practices, common mistakes and basic mindset questions — based on what I’ve experienced on the job… So if you are a practicing data scientist, feel free to share your point of view in the comments section! And if you are an aspiring data scientist, don’t hesitate to add your comments or questions, either!
- If you want to learn more about how to become a data scientist, take my 50-minute video course: How to Become a Data Scientist. (It’s free!)
- Also check out my 6-week online course: The Junior Data Scientist’s First Month video course.