When and why to build your own data tools?

For the record: I’m a big supporter of the third-party data services (eg. Google Analytics, Hotjar, Crazyegg, Optimizely, Mixpanel, etc.). I like them, because they are easy to use and easy to set up.

But there are cases, when startups, e-commerce companies (and other online businesses) are reaching a size, where they are growing out of these services – and other advanced data tools (eg. SQL, Python, R, bash, etc…) will be needed!

By reading this article you will understand, for which kind of companies, why and when is it essential to build own data tools!

When third-party data tools are good enough

When you’re kicking off your business, you don’t have the time to create proper data analyses – neither the money to hire a data analyst. To be fair – in these first few months most probably you won’t be in need of it either.

However if you are smart and careful enough, you’ll think about at least collecting the data for further researches. And for that, setting up Google Analytics, Hotjar and the other “point-and-click” services seems to be just the perfect solution. As these tracking tools require only the implementation of a small code snippet into your website’s header, you don’t need to spend too much time or developer resources on them. Copy-paste the tracking code, finalize some settings (eg. setting up goals in Google Analytics, start polls in Hotjar, launch the heatmaps in Crazyegg, etc…) and done. Doable in 2-3 hours tops.

hotjar tracking code

Data36.com’s Hotjar code. I’ve copy-pasted this and set up the whole tracking in 5 mins.

Then when you start to grow, you can start analyzing this data. I won’t go into details, why it is needed, if you are reading this blog, I’m pretty sure, you know the importance of that:
No matter how fast you are – without data, the faster you grow, the higher the chance that you will hit the wall.

After a while you’ll be opening your smart data tools on a daily basis, you’ll upgrade them into Pro versions, etc. But sooner or later (usually after 2-3 years) you will realize that these services are not scaling with your company anymore. You’ll have 3 major problems:

  1. You can’t connect all the dots.
  2. You can’t do predictions
  3. You can’t fully trust your data

And these are the 3 exact problems, that you can fix, if you build your own data infrastructure.

But what “own data infrastructure” means?

This can be splitted into 2 parts:

  1. Collecting the Data
  2. Using your Data (for KPIs, analyses or predictions)

First you need to implement your own tracking scripts. These won’t collect your data to Hotjar for instance, but into your own data warehouse, usually in SQL-tables or plain-text files (.csv, .tsv, etc.) – or both. (Read more here: Data collection.) There are many more technical solutions, but to keep it simple for now, I won’t list the rest.

Then you can analyze your data with SQL, Python, R or bash scripts instead of the graphical interface of Google Analytics or others. If you want to try these data coding languages and learn them, I wrote an article about that as well: Data Coding 101 – How to install Python, SQL, R and Bash?

“But wait! that’s sounds too difficult and tech-heavy! Why would I do that?” – you could ask. So let me answer it and let’s get back to the main 3 reasons of building and using your own data tools!

Reason #1: Having your own data. Connecting the dots.

The first big problem with third-party tools like Google Analytics, that they are working as a black-box. This means that you don’t own your data and you can’t use it for everything you want. This is not an issue, as long as you want to check simple reports, like how many people scrolled down to the bottom of your landing page, or how many sessions you had from google/organic in the last month.

But if you want to combine these metrics, things could become tricky. Eg.:
“What was the bounce rate and the time spent on page for each of my A/B test buckets?”

Of course, you can solve the smaller problems by using integrations, APIs or some hacks. (Note: Although let me tell you from my own experience, this could be a real pain in the neck, if you start to integrate more than 2 tools together.) Eg. for this specific question above, you can connect GA to Optimizely.

But doing more and more advanced analyses, you will reach the point, where you understand:

Every third-party service is created to measure a specific part of your product. That’s their power and that’s their limitation at the same time. Even if you manage to connect them, you will never be able to see the full picture. They don’t enable you to connect the dots!

own data dummy sql tables

A simple SQL star schema, where you can connect all the dots based on the user_id

And eventually this will lead to more and more poorly-answered – or even unanswered – questions. In a competitive sector, that online business is, this can be fatal.

Reason #2: Predictions

A part of the “not-having-your-own-data” issue is, that you can’t use your data for predictions either. Predictive Analytics is an iterative process, where you need to have clear and transparent data tables with a big number of variables. To create a meaningful prediction, you need to be flexible with your data. And third-party tools are not flexible at all. I guess, this is the main reason, why I’ve never seen an analyst who created useful predictions from Mixpanel, Kissmetrics or similars.

Reason #3: Trust your data

“Why Mixpanel doesn’t show the same numbers as Optimizely?”
“Why Adwords conversions are different from GA conversions?”
“How come that Crazyegg shows 30% bounce rate and Mixpanel shows 50%?”

comparison

3 different the tools show 3 different results – for the same metrics. by PappG.

In the last few years I’ve been working and consulting with quite a few startups and ecommerce businesses. These questions above tend to come up from time to time. The answers differ by the specific problems. Some examples:

“This tool defines that metric differently, than the other one.”
“This tool has changed its conversion tracking recently.”
“This tool uses sampling.”
“The tool has been set up improperly.”

Either way the ultimate answer is:

The single source of truth will be always in your own data tools. With your own definitions, with your own tracking – and without sampling. (Note that this also means that if you have your own data tools, it’s easier to debug the third-parties, when it’s needed.)

Reason #4: Much more details

Building your own solution will give you the ability to log everything. Every click, every page view, every parameter. If you are using Google Analytics, you are compromising on not having email addresses connected to activity-data. If you are using Mixpanel, you are compromising on which exact data-points to collect (if you collect everything you will reach their quota-limit.)

If you build your own data service, you don’t have to compromise at all. You can have as detailed data as you want. And you can use and analyze it anytime.

Con #1: Simplicity

However building your own data infrastructure is not black or white decision. There is one big counter argument against it: simplicity. Using third-parties like Google Analytics is incredibly easy – the implementation part and the analysis part as well.
My rule here is: simple tools for simple questions – advanced tools for complex questions.

own_data_tools_vs_thirdparty_data_tools

data analysis in Google Analytics vs. Bash

Going with the above mentioned examples: as long as your data analysis ends at checking the number of sessions per acquisition channels, you won’t need to spend time or money to set up your own data tools. There are businesses (eg. sole trader e-commerce businesses), where Google Analytics will cover the data needs forever! And that is cool!

But once you are out of the “simple questions” phase, make sure, you start to build advanced tools.

Hiring questions

Different tools need different skill set.
For using Google Analytics, Hotjar, Optimizely you have to hire a digital marketer or a digital analyst. (Or you can do it by yourself, if you feel like and if you have time for that.)

For building data collection scripts, SQL-tables, python scripts and the rest – you need to hire a data analyst with engineering skills or a data scientist.

It’s hard to tell the exact numbers, but if you look around on webpages like indeed or glassdoor, you will see that a data analyst/scientist salaries are ~20% higher, than digital marketer or digital analyst salaries. Obviously this can differ by country, by market, by company, by the exact role, etc…

Anyways, hopefully regardless if it’s a digital analyst or a data scientist you hire, he/she will create much more value for your company, than he/she costs.

Is the price of the data tools a question?

For sure. But you’d be surprised!
Small calculation for an SaaS startup: let’s say, you have 5.000 users, 500 daily active users and 1000 daily new visitors. In this case you will pay:

Optimizely: ~400$/month (link: https://www.optimizely.com/plans/)
Mixpanel: ~150$/month (link: https://mixpanel.com/pricing/)
Crazyegg: ~50$/month (link: https://www.crazyegg.com/pricing/)
Hotjar: ~30$/month (link: https://www.hotjar.com/pricing)
Google Analytics: free

Altogether: $630/month.

Just to compare:
For the same amount of users and visitors you can collect, store and process all your data on a data server for ~$100/month. On the top of that Python, R, SQL, bash and most of the related things are free.

It means that even at that size, your own tools – besides that you can use them to create better analyses – will be cheaper. And in the long term: the more you scale the bigger this difference will be.
Note that this win will most probably be “paid back” on salaries (see above) – that’s the only reason, why I’m not counting pricing as Reason #5.

When to build your own data tools?

I guess you got the point now! You have to take your first steps into the advanced data tools (SQL, Python, R, bash, etc.) direction, when you have grown out of the basic third-party tools.

But when is that exactly?
In my experience the best possible moment to hire a data scientist/analyst to start to build up your data infrastructure is when your company is between 15 and 30 employees. Obviously, this is based on the great average – but usually this is the time:

  • when you’ve filled in the must-have roles (engineers, designers, marketers) and when you can start to be smarter and smarter, and optimize your online business (with data people)
  • when you’ve reached a reasonable size of audience (users or/and visitors)
  • when the data resistance is still not too big at your company

However if you have some engineering resources, then I recommend to log interaction-data at least in plain-text format from day zero. Believe me, 3 years later you will be very thankful to yourself, not letting this information to be lost today. Also I suggest to create daily copies of your transaction/production data somewhere for the same reasons.

CONCLUSION

Hope this article gave you a good overview about when and why to build your own data infrastructure. Use third-party data services, but don’t get stuck with them. Once it’s needed, don’t be afraid and start to build your own data tools – and create a better, more detailed, more flexible and more analyzable data service for yourself, than any third-party!

If you want to be notified first about new content on Data36 (like articles, videos, handbooks, etc.), sign up for my Newsletter!

Cheers,
Tomi Mester

← Previous post

Next post →

4 Comments

  1. Jesper Petersson

    Great article!

  2. Great article!

    You mentioned star schema which is a great data modelling technique. Also, you mentioned R, SQL, Python, and bash which are great data manipulation and analysis tools.
    Could you please let us know what techniques can be used to actually capture the data? For example, do we need to create REST API for this task or something else?

    • hey Kunal,

      thanks and to answer the question:
      1.) To be honest, I’m not the best person to answer – I usually do data collection in a strong collaboration with (website, app, etc…) developers. They are the people, who are actually implementing the tracking scripts. And from the data side my responsibility is, that to make my data server able to pick up the data, that the developers sending there. I wrote a high-level article about that: https://data36.com/data-collection/
      2.) About the exact techniques though the best I can say, that you should use a native solution. Eg. if you have a Java based web-application, most probably your tracking code should be in Java. But I wrote for instance a chatbot not so long ago in Python+Flask – in this case my tracking script was in Python. If you do your website with Django+Python, than you can find some native solution for that as well…
      When it comes to a simple HTML, I’m not sure, what’s the fanciest solution today, but I’m almost sure, that on the front-end you should use some JavaScript solution, that communicates with your data server… There’s one old solution called AJAX, back in the days, I’ve used that a few times (a PHP script picked up the data on the data server side).

      Hope this helped! 😉

      Cheers,
      Tomi

Leave a Reply