“Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it…” – Dan Ariely

I’m sure you’ve heard the above saying before. It’s half-funny, but half-true as well. Nowadays, everyone is talking about Big Data, and the reason behind this is not due to the spread of technology, but that journalists and marketing people realized they can sell almost anything using these two words. On the other hand, the real data professionals of the field rarely write easy to follow articles on the topic, as they are more interested in technology.

The aim of this article is to fill this gap. What is Big Data? Comprehensibly, without the marketing b*llsh*t.

Small Data

Let’s start with: what is small data? The very basics of data analysis stretch back to census, where thousands of years ago commissioners asked all kind of questions from people. The next evolutionary step was questionnaires, which to this day is a frequently used method in market research. What’s the issue with surveys? Partly the sampling: Even the most precise, clear-cut sampling can be wrong. How could the response of 2.000 people represent the thoughts of one million people? Of course there are correct statistical methods, but the chance of error is still there. The other issue is the quality of the responses. People lie, and often they don’t even know that they are lying. If I asked you what your favorite color is, today you might say red… then a week later you will realize all your t-shirts are yellow and you then get uncertain. But by then, you already gave your answer, and the big businessmen have already made decisions based on it.

These are the typical small data issues.

Almost Big Data

It’s these kinds of problems the Big Data thinking gives answers to. If we think by Big Data, we don’t ask people, we just observe their behavior. This way they can’t lie to us (or themselves). Besides that, we are not just observing 2000 people, but all of them.

Obviously the easiest way to carry this out is through IT and related fields, where each click and movement of the mouse gives rise to a new series of data.

It’s through this logic that well-known and used projects like Google Analytics, CrazyEgg and Mixpanel, etc. were born.

Although all user/visitor behavior was stored in these projects, it is still not considered Big Data, as technically we are still only talking about a small data quantity within a given limit, with a not too flexibly manageable set of data (e.g. you can only create predetermined reports, you can’t combine two)… But then what’s considered Big?

Big Data

One of the most important trends of the past years (decades) was the constant and significant decrease in price of data storage. We have gotten to the stage where it is so cheap to store information, we save everything and delete nothing. And this is the key to Big Data! We store everything we can without deleting anything, for many years back. Usually we don’t store these in GoogleAnalytics-type programs, but into our own data tables (e.g. SQL) or into logs (e.g. csv, txt).

Sooner or later we will get to the stage where we create such huge databases that it will be challenging for one computer to store. We obviously won’t even try to open a one terabyte set of data in Excel or SPSS. But even a normal SQL query can take up to many hours or even days to run. Simply put, whatever we try (R, Python, etc), we realize that it has reached its maximum computing capacity and can’t process the data in a reasonable time.

That’s when the Big Data technologies come into play – the main concept of which is that it won’t just be one: but dozens or even hundreds of computers that work with our data. Often these clusters scale easily and almost endlessly: the more data we have, the more resources we can involve in the processing. This way we can analyze our data in a reasonable time again. But interconnecting many computers and making them work at the same time on one script: necessitates new infrastructure and new technology. This is how the wide scale of Big Data technologies were born, bringing forth new concepts like Hadoop, YARN, Spark, Pig and numerous cool Big Data technologies.

The Big Data evolution within the company

Let’s see how the Big Data evolution works in case of an online startup:

1. In the beginning, the company doesn’t have a data analyst, but they don’t want to fly blind. So they set up Google Analytics, Mixpanel and CrazyEgg, and observe their data.

2. They have their first 10.000 users. Management realizes that Mixpanel and CrazyEgg are starting to become expensive, and they don’t even show detailed enough reports anyway. So they start to build their own SQL tables and create text or csv logs. A data-guy analyses this by the variations of SQL, Python or R scripts. 

3. The number of users keeps growing and the analytics team starts to complain that the analyzing scripts don’t run even after 10-20 minutes. Then, when they reach a loading time of numerous hours, they realize they need Big Data technology and begin to google whatever Hadoop is…:-)

I hope this short summary helps clear things up a bit about the Big Data myth. I will be continuing this topic with a new article soon.

And if you want to be notified first about new content on data36 blog (like articles, videos, handbooks, etc.), sign up for the Newsletter!

Tomi Mester