How data collection works

The first step for every data project is data collection, aka getting the actual raw data.

There are two ways to do this:

A) You can pick one or more “smart tools” to use. These services will collect the data for you automatically. You only need to copy-paste a code-snippet into your website and you are ready to go. (E.g. Google Analytics, Hotjar, Google Optimize, CrazyEgg, Mixpanel, etc.)

B) You can collect the data for yourself. (E.g. via a javascript code snippet that sends data to a .csv plain text file on your server.) It’s a bit more difficult to implement since it requires some coding skills. But in the long term this solution is much profitable than version A. Why? For several reasons that I’ve already written about in this article. But I’ll summarize them here:

  • You have your own data, instead of depending on 3rd party services.
  • You have one unified data warehouse. No need for integrations, API hacks, and so on.
  • You can trust your data 100%.
  • You spend a significantly lower monthly fee on data server costs than on 3rd party tools.
  • There are no limitations on how to use your data or how to connect different data points. (E.g. you can’t use your data in Google Analytics to set up prediction models, but you can do it if you have your own database.)

Either way you choose, it’s worth understanding how data collection works in general.

Do it for yourself or using a 3rd party tool… very similar things are happening in the background!

How does data collection work?

Let’s go with the simplest example!
You are running a business data science project: you have a website and you would like to measure every click on this site (to create a website heatmap, a click map or anything).

First, implement an invisible “tracking script” on every clickable element on your site! As a consequence of this: from now on when a website visitor of yours clicks to a specific element (let’s say to a link for another subpage), the click makes 2 things happen:

  1. The visitor goes to the page she clicked (obviously)
  2. The tracking script sends a small data package to your data warehouse.
data collection - tracking scripts send usage data to your data logs

tracking scripts send usage data to your data logs

As simple as that.

You could track every action (let’s call it “event” ) on your website (or in your mobile app): page views, feature usage, etc… You can track and collect data even about mouse movements, if it’s needed. (Though usually it isn’t).

A more general illustration to help you imagine what’s happening here:

data collection - tracking scripts send data from front-end to production and data servers

tracking scripts send data from the front-end to production and data servers

How to store the collected data

When the data package hits your data warehouse, it can be stored in different formats.

For startups the best format is the plain text format as it is very flexible. You can imagine this as a simple txt, csv or tsv file with text in it. Many companies follow this model.

But it’s also worth to mention that many bigger – and older -companies (e.g. multinational companies) like to collect their data to SQL databases (or to other structured formats). There are several other ways, though.

Let’s look at the simplest and most common solution: plain text format.

Remember that each event (e.g. click on your website) indicates one line of data by your previously implemented tracking script. This line goes into a file on your data server. We call this file a “log.” You can have more than one log, but almost all of them will have the same format, which looks like this:

data collection - sample log

sample log (email addresses removed)

Look messy?

Maybe at first, but go through that column by column! (This is a .csv file, so the field-separator is the semicolon.)

  1. the date and the time: when the event happened
  2. the event itself (in this case: “click”)
  3. the specifics of the event, eg. what exact button has been clicked

These are the very basic data, that every data log should contain.

But there are many more possible dimensions to add. Just a few examples:

  1. visitor’s unique ID
  2. visitor’s email address
  3. visitor segment (if logged in, and if you have pre-defined segments)
  4. visitor’s operation system
  5. last payment
  6. visitor’s device
  7. acquisition channel (source, medium, etc.)
  8. previous site
  9. etc…

What kind of data should you collect?

As you can see, you can collect and store an infinite amount of data. Infinite vertically (the number of different events you can log) as well as horizontally (the number of dimensions you can collect about one event in one line).

This raises the obvious question: what you should collect and what you shouldn’t.

The principle here is very simple: collect everything you can. Every click, every pageview, every feature usage, everything.

It’s interesting to note that most startups who follow this collect-everything-principle are actually using less than 10% of their data. 90% is not even touched by analysts! Then you would ask: so why do they collect everything?

And the answer is: because you can never know when you might need that data in the future. Let’s say you want to change a 3-year-old feature of your product, and you don’t want to mess up anything. Before the change, you will spend some time to understand the exact role of that 3-year-old feature. And for that you will need to analyze your data retrospectively. But you can do that only if you started to collect the data 3 years ago.

That’s the ultimate reason for collecting all possible data.

What kind of data should you not collect?

There are very obvious limitations, of course. The price of storing data is not one of those. Storing data (in the cloud at least) is very cheap today.

The real limitations are:

  • engineering time: The web developers need to spend time to implement the tracking scripts. And if you have a really complex data warehouse, then you will need a full-time person to build and maintain the data infrastructure. So if your developers spend more time on collecting data than on actual production, then maybe you are collecting too much data.
  • common sense: yes, you can overload your database — if you log every mouse movement of every user every millisecond. You should not do that.
  • forgot-to-think-about-it: in most cases, the main reason why people don’t collect particular data-points is simple. They forget that it should be collected. It happens, don’t worry. If you want to avoid it, I suggest setting up a workshop in which you sit together and talk through why, how and what data to collect. I write about these kind of workshops in a “story” article.
  • legal questions: It differs country to country, so I recommend consulting with a legal professional in your country. (Update in 2018: mind GDPR if you have EU users.)
  • And one more comment here. Some countries have strict legal restrictions about data collection, some others don’t. But regardless of the laws: always consider ethics. Never collect data from your users that you wouldn’t want collected about you.

Conclusion

This is how data collection works. Google Analytics, Mixpanel, Crazyegg or your DIY data warehouses are all based on these principles. Of course there are small differences, which I’ll describe later, but for now you can be sure that you understand what happens in the background and you can be more confident acting on your data!

Cheers,
Tomi Mester

← Previous post

Next post →

9 Comments

  1. Hey, I totally agree what you describe in the article. But logging data is just one part of an analytics driven company. Building reports, graphs, etc. is not easy and takes time to implement. I’m working on a startup called OpenInbound. Our primary goal is to simplify contact based logging and help people with tools to really work with data (e.g. Marketing Automation, CRM)

    One more: Always use multiple tracking tools. Google Analytics is a must. We have good experience with Hotjar and OpenInbound.

    • hey Lukas,

      thanks for the addition.
      Yes, absolutely agree – multiple tracking is important, also Google Analytics is a must.

      Haven’t heard about OpenInbound, but will take a look on that!

      Cheers,
      Tomi

  2. Hi, Tomi!
    Interesting post (as are most of your posts!). It would be interesting to hear a bit about the role of cookies in making this data more useful as well as a brief introduction into how to implement them.
    Thanks!
    Ben

  3. I’m learning really a lot from your Posts.
    Easy to read and good examples!

  4. Hey Tomi,

    Thanks for some useful insights here!

    I want to know more about the ‘Tracking Scripts’, how to code them & how to sort data at the backend.

    Since I’m not a tech guy, it would be really helpful for me.

    Thanks!

    • hey Raman,

      thanks for the comment!
      I’ll put that article idea on my list –» and sooner or later it will come to the blog! : )

      Tomi

  5. Great article!

    I found it very helpful for writing an essay about data collection for class.
    Thanks!

Leave a Reply