The first step for every data project is data collection, aka.: getting the actual raw data. There are two ways to do this:
A) You can pick one or more tools from the “smart tools”. These softwares/services will collect the data for you automatically. You only need to copy-paste a code-snippet into your website and you are ready to go. (Eg. Google Analytics, Hotjar, CrazyEgg, Mixpanel, etc.)
- You have your own data, you are not depending on 3rd party solutions.
- You have one unified data warehouse. No need for integrations, API hacks, and so on.
- You can trust your data 100%.
- You spend a significantly lower monthly fee for your data server costs, than smart tools’ services.
- There are no limitations, how to use your data or how to connect different data points. (Eg. you can’t use your data in Google Analytics to set up prediction models, but you can do it if you have an own database.)
Either way you choose, it’s worth to understand, how data collection works in general.
Do it for yourself or using a smart tool… very similar things are happening in the background of both!
How data collection works?
Let’s go with the simplest example:
You have a website and you would like to measure every click on this site (to create a website heatmap, a click map or anything).
First, implement an invisible “tracking script” to every clickable element on your site! As a consequence of this: from now on when a website visitor of yours clicks to a specific element (let’s say to a link for another subpage), the click indicates 2 things:
- The visitor goes to the page, she/he clicked (obviously)
- The tracking script sends a small data package into your data warehouse.
As simple as that.
You could do the same tracking-thing with every action (let’s call it “event” ) on your website (or in your mobile app): pageviews, feature usage, etc… You can track and collect data even about mouse movements, if it’s needed. (Though usually it isn’t).
A more general illustration to help you imagine what’s happening here:
How to store the collected data?
When the data package hits your data warehouse, it could be stored in different formats. For startups the best format is the plain text format as it is very flexible. You can imagine this as a simple txt, csv or tsv file with text in it.
For bigger and more structured companies SQL or any other structured format could work. There are several other ways though to store data.
Let’s go here with the simplest and most common solution: plain text format.
Remember that each event (eg. click on your website) indicates one line of data by your previously implemented tracking script. This line goes into a file on your data server. We call this file a “log”. You can have more than one log, however almost all of them will have the same format. This:
Looks messy? Maybe at first, but go through that column by column! (This is a .csv file, so the field-separator is the semicolon.)
- the date and the time: when the event happened
- the event itself (in this case: “click”)
- the specifics of the event, eg. what exact button has been clicked
These are the very basic data, that every data log should contain.
But there are much more possible dimensions to add. Just a few examples:
- visitor’s unique ID
- visitor’s email address
- visitor segment (if logged in, and if you have pre-defined segments)
- visitor’s operation system
- last payment
- visitor’s device
- acquisition channel (source, medium, etc.)
- previous site
What kind of data can/should you collect?
As you see, you can collect and store infinite amount data. Infinite vertically (the number of different events you can log) as well as horizontally (the number of dimensions you can collect to one event in one line). This raises the obvious question: what should you collect and what shouldn’t.
The principle here is very simple: collect everything you can. Every click, every pageview, every feature usage, everything.
It’s interesting enough that most of the startups, who follow this collect-everything-principle are actually using less than 10% of their data. 90% is not even touched by analysts! Then you would ask: so why do they collect everything then?
And the answer is: because you can never know, when you would need that data in the future. Let’s say, you want to change a 3-years-old feature on your product, so you don’t want to mess up anything. Before the change, you will spend some time to understand the exact role of that 3-years-old feature. And for that you will need to analyze your data retrospectively. But you can do that only, if you started to collect the data 3 years ago. That’s the ultimate reason of collecting every possible data.
What kind of data should you not collect?
There are very obvious limitations of course. The price of storing data is not one of those. Storing data (in the cloud at least) is very cheap today.
The real limitations are:
- developer time: The web developers need to spend time to implement the tracking scripts. And if you have a really complex data warehouse, then you will need someone in full-time, who builds and maintains the data infrastructure. So if your developers spend more time collecting data, then developing the actual production stuff, then maybe you are collecting too much data.
- common sense: yes, you can overload your database, if you log every mouse movement of every users in every millisecond. You should not do that.
- forgot-to-think-about-it: in most cases, the main reason, why people don’t collect particular data-points is simple. They forget about that it should be collected. It happens, don’t worry. If you want to avoid it, I suggest to set up a workshop in the company, where you sit together and talk through why, how and what data to collect. I write about these kind of workshops in a “story” article.
- legal questions: It differs country to country, so I recommend to consult with a legal professional in your country.
And one more comment here. Some countries have strict legal restrictions about data collection, some others don’t. But regardless of the laws: always consider the ethical norms. Never collect data from your users, that you wouldn’t be happy if it were collected about you. 🙂
This is how data collection works. Google Analytics, Mixpanel, Crazyegg or your DIY data warehouses are all based on these principles. Of course there are small differences, that I’ll describe in specific chapters, but for now you can be sure, that you understand, what happens in the background and you can be more confident to act on your data!
And if you want to be notified first about new content on Data36 (like articles, videos, handbooks, etc.), sign up for my Newsletter!