A question that I get often is this:

“Should an aspiring data scientist learn more about big data (e.g. Apache Hadoop or Spark) to get a job?”

I’ll keep this short and sweet. The answer is: no, generally speaking, learning about big data is not needed to get a junior data scientist position.

And I have three simple reasons for that.

Note: this article is available in video format here and in podcast format here!

#1: Big data is the field of data engineers

Remember this chart that I’ve shown about different data science positions?

learn big data data science roles
blue bubbles: a data project step-by-step — green rectangles: the different positions associated with these steps

As you can see, the different steps of a data science project require different levels of programming and engineering skills.

If we follow the generally used naming conventions, a data scientist (especially a junior data scientist) will mostly work on the tasks from the right side of this chart. That’s data cleaning, analytics, visualization and helping with decision making.

And the tasks on the left side are more the field of data engineers. These tasks are: setting up data collection, figuring out the best way to store the data… and generally speaking, building the data infrastructure. And the problems that big data solutions like Apache Hadoop or Spark answer are part of the data infrastructure related problems.

It means that for you as an aspiring data scientist, big data won’t be interesting for now… except, if you, in fact, would rather become a data engineer. (Which is a pretty cool position, too — but it’s a more technical job.)

Note: just a quick comment here. You should know that in the data field the different naming of different positions is pretty… messy.  (Thanks, dear creative HR specialists!) There are data engineers, big data specialists, machine learning ninjas, AI gurus… and so it’s pretty hard to say that there’s a strict rule about who we can call a data scientist and who we can’t. But I see a pattern in the people who have asked the original question — and so if you follow the most common naming conventions for data positions and you ask whether you should learn about big data to get a junior data scientist position, the answer will be no.

#2 You can learn big data on the job

Get back to the original question:

“Should I learn about big data to get a junior data scientist position?” 

Asking this implies that you already have some basic knowledge on the topic. And that might be more than enough to get your first position.

Because data science is an ever-changing field and nobody expects that you know everything in it. If you have the solid foundations of Python, SQL, statistics and business thinking, you can learn the rest later, on the job.

I mean, it’s totally possible that you’ll start to work for a company where after all, you’ll have to start to use Apache Spark, for instance. But if so, don’t worry, you can learn that on the job — and the data engineers at the company will help you to get up to speed. And I can also guarantee, if you were able to learn coding in Python and SQL, you’ll be able to learn a big data language, too. 

#3 You can take advantage of your Python, SQL and pandas knowledge in Apache Spark

Even if you have to work with a big data language like Apache Spark on the job, it’s good to know that it has a Python API, a module called Spark SQL and a solution to use pandas dataframes, and many other useful layers…

data science big data spark sql
Spark SQL!

This means that, sure, to be able to use – for instance – Apache Spark, you have to learn a bit more. But you will be able to take advantage of your preexisting Python, SQL and pandas knowledge in big data systems, too.


So should you learn more about big data for a junior data scientist role?

Well, in an ideal world, of course, you’d learn everything before you apply to your first job. But in the real world that would take long, long years. You don’t have that much time. So it’s best if you focus on the fundamental skills for now: mastering SQL, Python, bash, the basics of statistics and business thinking.

And if you have more time, you can go into the different specifications like big data… But if we are being realistic here, that’ll most likely happen on the job. And that’s just fine.

Tomi Mester