In this article I’d like to answer a very simple question — that I get a lot. It’s this:
What is data science?
The question is simple, sure… But the answer is rather complex.
There are no generally accepted definitions to answer what data science is. So, in this article, I’ll show you many aspects and layers… So read it — and by the end, you’ll have a much, much clearer mental picture of this field! Promise!
What is data science? An everyday example.
Let’s start with a broad definition.
If I had to say extremely simply what data science is, I’d say this:
You have a large amount of data and you’re trying to extract something smart and useful from it.
That’s abstract, I know… Maybe a bit of an oversimplification, too.
So here’s an everyday example to help you understand it.
Note: It’ll be an everyday example intentionally, but read it carefully and you’ll see the business parallels, too!
Okay, let’s see it!
I’m sure you have seen smart watches — or maybe you use one, too. These smart gadgets can measure your sleep quality, how much you walk, your heart rate, etc.
Let’s take sleep quality, for instance!
If you check every single day, how did you sleep the night before, that’s 1 data point for every day. Let’s say that you enjoyed excellent sleep last night: you slept 8 hours, you didn’t move too much, you didn’t have short awakenings, etc. That’s a data point. The day after, you slept slightly worse: only 7 hours. That’s another data point.
By collecting these data points for a whole month, you can start to draw trends from them. Maybe, on the weekends, you sleep better and longer. Maybe if you go to bed earlier, your sleep quality is better. Or you recognize that you have short awakenings around 2 am every night…
By collecting the data for a year, you can create more complex analyses. You can learn what’s the best time for you to go to bed and wake up. You can identify the more stressful parts of the year (when you worked too much and slept too little). Even more, you might be able to predict these stressful parts of the year and you can prepare yourself!
We are getting closer and closer to data science… Let’s go even deeper!
Not just trends — correlations, too!
If you have enough data, you can discover not only trends but correlations, too!
You can check out, for instance, how your sleep quality is affected by how much exercise you got in the given week. (Walking, running, biking, swimming, etc. These can also be measured by smart watches.) A simple correlation would be to see this: on the days you took more than 5,000 steps, your sleep quality was excellent. This is more than an analysis… This can be the basis of an action plan: let’s walk at least 5,000 steps every day!
Note: Although, I have to mention that in real life, a data scientist does much more research to get to a conclusion like this one.
And there are even more levels.
Just imagine the data that the producer of this smart watch can collect. In theory (let’s not consider legal aspects for now) they could see all the data of all their customers. And with that, they can produce analyses of their data that you – a single customer – could never even imagine.
Can the symptoms of depression be mitigated by walking 3,000 steps a day for most people? Are people really healthier in some countries than in others? Can weather conditions strongly impact certain social movements? And there many, many more interesting questions. And these companies might already have enough data points to research these.
Note 1: Most of them are probably doing research around these questions already.
Note 2: Let’s not talk about the legal and ethical aspects of these things here. While these are incredibly interesting and important questions, they’re a whole article by themselves.
So as you see: the more data (and the more detailed data) you have in a data science project, the more complex, exciting and useful analyses and predictions you can create.
In essence, this is what data science is about.
Except that all these can’t only be done with smart watches and by individuals.
But with many other tools that produce and collect data — in many other fields of life…
Data science is growing rapidly in these fields…
Data science, of course, conquered the world of online businesses first.
Why online businesses? Because that’s the #1 place where you can collect data about every single movement of a user. (Some companies, of course, abused this opportunity. But again: we won’t dive into the legal and ethical aspects in this article.)
Also, parts of data science have been present in different social sciences for decades!
And in the last few years, it’s started to gain a foothold in fields like:
- and more…
Okay, so far I’ve written about how data science can be useful.
Let’s talk about what skills and tools you need to do data science!
What is data science? From the aspect of the skills you need to do it…
If you have ever read my blog, I’m sure you’ve seen this Venn diagram:
I show it quite often — and it’s really important.
It says that if you want to be a data scientist, you have to be good at three things:
Why are they so important?
Coding is inevitable, because that’s the tool you need to work with your data. It’s like the piano for the pianist, the brush for the painter, or the pen for the poet. If you want to make your ideas come true, you have to know and use your tools as a professional. (The most popular data science languages are: SQL, Python and bash. I write about all of them on my blog. You can also get access to free cheat sheets and video courses by joining the Data36 Inner Circle.)
Statistics is the actual science of your data science projects. After all, data is about numbers. And when you work with numbers, you should be confident with mathematical and statistical concepts, right?
I know that many people are afraid of (or even more: they hate) statistics. But statistics is not boring nor extremely difficult. It’s only that it has bad marketing. 🙂 To become a data scientist, you have to be familiar with statistical concepts like: statistical averages, statistical biases, correlation analysis, probability theory, functions — machine learning algorithms, of course — and so on…
3. Business knowledge
The third topic is business knowledge. This is a soft factor. For example, let’s say that you are working for a bank as a Data Analyst. You can be the best coder and the best statistician, but if you don’t understand the business concept behind interest rates or how mortgages work, you will never be able to deliver a meaningful data analysis. I wrote more about the business aspect of data science in this article: Data Science for Business.
So data science is an intersection of three things: statistics, coding and business.
What about all the buzzwords? Machine Learning, Artificial Intelligence, Deep Learning, Predictive Analytics…
I wish I had a dollar for every time mainstream media (e.g. news portals, magazines, even conferences) misinterpret the different data-science-related terms.
- What is machine learning?
- What is artificial intelligence?
- What is deep learning?
- What is predictive analytics?
- What is data analysis?
Well, in everyday use they are buzzwords. 🙂
But they have real meanings — and a certain place within the field of data science, too. So it’s time to clarify what means what.
What is data analysis?
Usually, you will use your data for 3 major things in your data science projects:
- Data analysis (e.g. reporting, optimization, etc.)
- Predictive analytics (predicting the future)
- To build a data-based product (eg. a self-teaching chatbot, a recommendation system, etc.)
The word data analysis refers to the most conventional way of using your data. You run analyses to understand what happened in the past and where are you now. Let’s say you have this chart outlining the first 16 months of your product sales:
What is predictive analytics?
Now, predictive analytics refers to projects where you use the same historical data that you see above… but this time you try to predict the future. So you’ll answer the “what will happen” question. Let’s use the same dataset (blue line) — to estimate how your product sales will do through the 20th month (red line):
That’s a prediction.
However, it’s not really accurate, is it?
Is this model better:
Or is there an even better one? Any of these maybe?
When you ask the “what is data science” question, most data scientists would say that – at least – this is where the science part of it starts.
But this is really just the tip of the iceberg…
What is machine learning?
When a computer fits the lines on the above examples, it tries to find a mathematical formula (red line) that describes well enough the relationship between the real-life data points (blue line) — that have a natural variance anyway.
Now you might ask: how the heck can computer find that mathematical formula?
By using Machine Learning.
Machine Learning is the general name for all the methods by which your computer fine-tunes a statistical model and finds the best fit for your dataset. And the blue-line-red-line example is only one of many. There are tons of machine learning methods for all the different typical data science problems. This “model fitting” machine learning method is called regression – or more precisely: linear and polynomial regression. But there are classification problems (popular machine learning algorithms: decision tree, random forest, logistic regression, etc.), clustering tasks (popular machine learning algorithms: K-Means Clustering, DBSCAN, etc.) and many more.
I won’t go into detail here — but I will write more about these on Data36. So stay tuned!
What is Deep Learning?
Actually, I’d like to talk about one particular machine learning method.
It’s called deep learning and it’s gotten very popular in the last few years… but many still don’t know what it is and what it is good for.
Deep learning is nothing but one specific machine learning method. As I mentioned, there are a lot of machine learning models and all of them are good for solving different data science problems. Deep Learning is only one of them — that’s recently been widely used for image and voice recognition projects. The way it works is quite interesting, by the way. It gets input values and it turns them into output values after filtering through many layers by creating automatic correlations. It works very similarly to how the human brain works. (More about deep learning in another article.)
Note: The best explanation of deep learning that I’ve heard so far was by Andrej Karpathy, Director of AI at Tesla. In his presentation, he introduced how Tesla cars learn to drive. He also explained the general concept of deep learning — and he showed how they are using it. It’s part of a bigger presentation, and you can find the full video here — Andrej’s talk starts at 1:52:05 and ends at 2:24:55.
What is Artificial Intelligence?
Well, I wrote a long paragraph about incompetent wannabe data professionals, clickbait journalists and ignorant managers (who read articles from those clickbait journalists)… and of course, companies who try to market their simple data-based products with the “AI” tag (that recently sells everything)… But I just deleted it because I don’t want to offend anyone. 🙂
It’s enough if you know that AI doesn’t exist — yet. And if humanity does ever create one, it won’t happen in the next few years. Right now, there is no computer that would be capable even of imitating creativity, intuitions, ambitions, inspiration or anything else that makes us human.
Sure, there are very advanced bots – like the one that Google presented in mid-2018. (Check out the video here.) But if you think about it – it’s nothing but a combination of an advanced chatbot, an advanced voice recognition software (like the one that you have in your smartphone) and an advanced speaking engine.
Note: Plus, you have to know that most of these bots work only in very narrow situations. As soon as they fall off their script they are useless. Also, show me a bot that has its own ambition to learn Chinese or Spanish because it feels that it will be important for its career… 😉 Right? Today’s “AI” is not even close to real human intelligence.
So what is data science?
I hope that the above helped you to clarify what data science is.
As I said, there are no generally accepted definitions, but I hope that:
- the everyday example (with the smart watch)
- showing the major components of data science (coding, statistics and business)
- and explaining the main concepts (data analysis, predictive analytics, machine learning, artificial intelligence, deep learning)
…will help you to see data science in context.
There are many layers of it and I tried to show you as much as I can in this article.
Before wrapping this up, let me answer two frequently asked questions that I get from aspiring data scientists on a daily basis.
Frequently asked data science question #1:
Data Analyst or Data Scientist?
First of all – you have to know that many people misuse the word “data science.” Especially some HR people who don’t exactly know what data science is but have to create job descriptions for it. (Sorry guys, that’s the truth. By the way, if you’re reading this, you’re not the ones I’m talking about… ;-)) For this reason, you can find a wide variety of job descriptions (and online articles, too) under the umbrella of “data science” and “data scientist”.
Anyway, again, there is no clear definition of how data science and data analytics are different. But in real life projects, you might see these patterns:
- When companies are hiring a Data Analyst, they are usually looking for a person who will be working on research projects, on optimization and on reporting. This person will help the company to understand their customer base and flag possible issues and future opportunities. She will work more on descriptive analytics projects (so answering the “what happened in the past? where are we now?” questions) — and less on predictive analytics and machine learning projects.
- When a company is looking for a Data Scientist, it usually wants someone on board who’s good at predictive analytics and who has experience with machine learning and similar advanced methodologies. This knowledge can be useful for managing risk, for building recommendation systems, for optimising resources, for face recognition, for building chatbots, smart automations and many, many more things – depending on the profile of the given company.
As I see it, Data Analytics is usually mentioned as the “conservative” part of the data projects and it has a big effect on the business side – while Data Science is more progressive and can even have an effect on the product itself…
Note that – in my opinion at least – both of these roles are equally important and valuable.
And, by the way, many professionals have started to call both positions data scientists, which I can relate to, as well.
The point is: the line is blurry, nothing is set in stone — so don’t worry about it too much. Regardless of whether you apply for a Data Analyst or Data Scientist position, make sure you read the job description and if it’s unclear, simply ask the employer what kind of projects you would work on… Easy as that.
Frequently asked data science question #2:
Do you need a university degree or any kind of certification to become a data scientist?
I have already covered this topic in my How to become a data scientist free online course. This is a frequent question, so I also published the video about it to Youtube.
Here it is… but if you can’t watch it right now, just skip it and read the quick summary below.
Comparing university education vs. self learning:
- University takes a lot of time and it’s expensive, too. In exchange, you’ll have a great network (students and professors). And by the end of your studies, you will have an in-depth understanding of data science. Which is great because your fundamentals will be rock-solid – but these will also contain a lot of theory that you will never actually use.
- The self-taught path is faster (~6-12 months) and not so expensive. You’ll have to learn many things by yourself, of course. You will have very practical knowledge — and you will have hands-on experience with real-life problems. But on the other hand, you might have a less solid background because you missed a lot of the theory.
The bottom line is:
Today, fewer and fewer companies actually care about your university (or college) degree. They need people who can get the job done. If you can gather that knowledge by self-teaching and you can prove that during the hiring process (and of course, you can actually do the job during your probation period), nobody will ask for a degree. Of course, you might want to have some certifications on your CV (maybe a few online courses you took) and hands-on experience, too (e.g. hobby projects) — that validate your knowledge.
If this sounds too good to be true, here’s the thing: I don’t have a college degree, either, but I’ve still been hired for many data science positions and projects.
So what is data science?
The answer has many layers — and in this article I showed you the most important ones.
I hope you enjoyed it — and if you want to learn more, join the Data36 Inner Circle!
- If you want to learn more about how to become a data scientist, take my 50-minute video course: How to Become a Data Scientist. (It’s free!)
- Also check out my 6-week online course: The Junior Data Scientist’s First Month video course.