The first step for every data project is data collection, aka getting the actual raw data.
There are two ways to do this:
A) You can pick one or more “smart tools” to use. These services will collect the data for you automatically. You only need to copy-paste a code-snippet into your website and you are ready to go. (E.g. Google Analytics, Hotjar, Google Optimize, CrazyEgg, Mixpanel, etc.)
- You have your own data, instead of depending on 3rd party services.
- You have one unified data warehouse. No need for integrations, API hacks, and so on.
- You can trust your data 100%.
- You spend a significantly lower monthly fee on data server costs than on 3rd party tools.
- There are no limitations on how to use your data or how to connect different data points. (E.g. you can’t use your data in Google Analytics to set up prediction models, but you can do it if you have your own database.)
Either way you choose, it’s worth understanding how data collection works in general.
Do it for yourself or using a 3rd party tool… very similar things are happening in the background!
How does data collection work?
Let’s go with the simplest example!
You are running a business data science project: you have a website and you would like to measure every click on this site (to create a website heatmap, a click map or anything).
First, implement an invisible “tracking script” on every clickable element on your site! As a consequence of this: from now on when a website visitor of yours clicks to a specific element (let’s say to a link for another subpage), the click makes 2 things happen:
- The visitor goes to the page she clicked (obviously)
- The tracking script sends a small data package to your data warehouse.
As simple as that.
You could track every action (let’s call it “event” ) on your website (or in your mobile app): page views, feature usage, etc… You can track and collect data even about mouse movements, if it’s needed. (Though usually it isn’t).
A more general illustration to help you imagine what’s happening here:
How to store the collected data
When the data package hits your data warehouse, it can be stored in different formats.
For startups the best format is the plain text format as it is very flexible. You can imagine this as a simple txt, csv or tsv file with text in it. Many companies follow this model.
But it’s also worth to mention that many bigger – and older -companies (e.g. multinational companies) like to collect their data to SQL databases (or to other structured formats). There are several other ways, though.
Let’s look at the simplest and most common solution: plain text format.
Remember that each event (e.g. click on your website) indicates one line of data by your previously implemented tracking script. This line goes into a file on your data server. We call this file a “log.” You can have more than one log, but almost all of them will have the same format, which looks like this:
Maybe at first, but go through that column by column! (This is a .csv file, so the field-separator is the semicolon.)
- the date and the time: when the event happened
- the event itself (in this case: “click”)
- the specifics of the event, eg. what exact button has been clicked
These are the very basic data, that every data log should contain.
But there are many more possible dimensions to add. Just a few examples:
- visitor’s unique ID
- visitor’s email address
- visitor segment (if logged in, and if you have pre-defined segments)
- visitor’s operation system
- last payment
- visitor’s device
- acquisition channel (source, medium, etc.)
- previous site
What kind of data should you collect?
As you can see, you can collect and store an infinite amount of data. Infinite vertically (the number of different events you can log) as well as horizontally (the number of dimensions you can collect about one event in one line).
This raises the obvious question: what you should collect and what you shouldn’t.
The principle here is very simple: collect everything you can. Every click, every pageview, every feature usage, everything.
It’s interesting to note that most startups who follow this collect-everything-principle are actually using less than 10% of their data. 90% is not even touched by analysts! Then you would ask: so why do they collect everything?
And the answer is: because you can never know when you might need that data in the future. Let’s say you want to change a 3-year-old feature of your product, and you don’t want to mess up anything. Before the change, you will spend some time to understand the exact role of that 3-year-old feature. And for that you will need to analyze your data retrospectively. But you can do that only if you started to collect the data 3 years ago.
That’s the ultimate reason for collecting all possible data.
What kind of data should you not collect?
There are very obvious limitations, of course. The price of storing data is not one of those. Storing data (in the cloud at least) is very cheap today.
The real limitations are:
- engineering time: The web developers need to spend time to implement the tracking scripts. And if you have a really complex data warehouse, then you will need a full-time person to build and maintain the data infrastructure. So if your developers spend more time on collecting data than on actual production, then maybe you are collecting too much data.
- common sense: yes, you can overload your database — if you log every mouse movement of every user every millisecond. You should not do that.
- forgot-to-think-about-it: in most cases, the main reason why people don’t collect particular data-points is simple. They forget that it should be collected. It happens, don’t worry. If you want to avoid it, I suggest setting up a workshop in which you sit together and talk through why, how and what data to collect. I write about these kind of workshops in a “story” article.
- legal questions: It differs country to country, so I recommend consulting with a legal professional in your country. (Update in 2018: mind GDPR if you have EU users.)
- And one more comment here. Some countries have strict legal restrictions about data collection, some others don’t. But regardless of the laws: always consider ethics. Never collect data from your users that you wouldn’t want collected about you.
This is how data collection works. Google Analytics, Mixpanel, Crazyegg or your DIY data warehouses are all based on these principles. Of course there are small differences, which I’ll describe later, but for now you can be sure that you understand what happens in the background and you can be more confident acting on your data!
- Check out my new 6-week video course:
Data Strategy: Boost Your Online Business with Data
It’s an end-to-end practical business data science course that will help you build your own data strategy from scratch.
- If you want to learn more about how to become a data scientist, take my 50-minute video course: How to Become a Data Scientist. (It’s free!)
- Or simply subscribe to my Newsletter.