Predictive analytics is not a new or very complicated piece of science. As I mentioned before (Reporting, Optimizing, Predicting – 3 things that you can do with your data), it’s easy for anyone to understand at least the essence of it.

Imagine that you are in the grocery shop. You are done and ready to pay. But which line you choose? Your brain starts to run a built-in “predictive algorithm” with these parameters:

  1. Expected output: the fastest line, you need to pick
  2. Target variable: time spent in the line (the less the better) – fastest way to leave the shop and go home
  3. Historical data: your past experience from previous shopping sessions
  4. Predictors (or “features”): length of line, number of items in the baskets, average age of the line, etc…
  5. The predictive model: the one, that your smart brain picks

Basically computers are doing the exact same thing, when they do predictive analytics (or even machine learning). They copy how our brain works. Obviously computers are more pragmatic. They use well-defined mathematical and statistical methods and much more data. (Sometimes even big data. The real big data. Not the one, that media folks use all the time to click their articles. ;-)) And eventually they can give back more accurate results. At the end of these two articles (Predictive Analytics 101 Part 1 & Part 2) you will learn, how predictive analytics works, what are the methods you can use and how the computers can be so accurate. Don’t worry, this is a 101 article, you will understand it without PhD in mathematics too.

Note: and if you are looking for a more general data science introduction, check the data analytics basics first!

Why every business needs Predictive Analytics? A possible answer is: Customer Lifetime Value.

Though it’s not very difficult to understand, predictive analytics is certainly not the first step, that you do, when you set up the data driven stuff around your startup or e-commerce business. You start with KPIs and data researches. In a little while you will reach a point, where you need to understand an important metrics related to your online business. And this is the Customer Lifetime Value. At Practical Data Dictionary I’ve already introduced a very simple way to calculate CLTV. That was:

CLTV = ARPU * (1 + (RP%) + (RP%)² + (RP%)³ + (RP%)^4 …)

(ARPU: Average Revenue Per User
RP%: Repeat Purchase % or Recurring Payment %)

Customer LifeTime Value Calculation

Customer LifeTime Value Calculation

Download the full 54 pages of Practical Data Dictionary PDF for free.

It’s a good start, but I’d raise an argument with Past Me. I wrote:
“In this formula, we are underestimating the CLTV. When calculating the CLTV, I would advise underestimating it – if we are thinking in terms of money, it’s better to be pleasantly surprised rather than disappointed!”

That’s not quite true, past Tomi. For instance if you underestimate the Customer Lifetime Value, you will underestimate your projected marketing/sales budget as well. And with that the CPC/PPC limits and the overall Customer-Acquisition-Costs. You will spend less. Means you will grow slower. Means you’ll lose potential users. And if you are surrounded with competitors, this could easily cost your business.

Of course, this is too dramatic. 95% of the cases you can use the Practical Data Dictionary formula very well and you will be a very happy business owner with a nice profit at the end of the year.
But you would be even happier if your business would grow faster, right? To reach that you can’t underestimate, neither overestimate your CLTV. You need to know it exactly. This is one important point where predictive analytics can come into play in your online business.

Note: There are many other ways to use predictions at startups/e-commerce businesses. You can predict and prevent churn, you can predict the workload of your support organization, you can predict the traffic on your servers, etc…

How predictive analytics works in real life?

Step 1 – Select the target variable!

In my open grocery-example the metric we wanted to predict was the spent time by lines. If a computer would have done this prediction, we would have gotten back an exact time-value for each line. In this case the question was “how much (time)” and the answer was a numeric value (the fancy word for that: continuous target variable).

There are other cases, where the question is not “how much”, but “which one”. Eg. you are going to the shop and you are able to choose between buying black, white, red, etc… T-shirts. The computer will try to predict which one you will choose, maybe recommend you something. It does this based on your historical decisions. In this case the predicted value is not a number, but a name of a group/category (“black T-shirt”). This is a so called “categorical target variable” resulted by a “discrete choice”.

So if you predict something it’s usually:

A) a numeric value (aka. continuous target variable), that answers the question “how much” or
B) a categorical value (aka. categorical target variable or discrete choice), that answers the question “which one”

These will become important, when you are choosing the prediction model.
Anyhow: at this point your focus is on selecting your target variable. But it needs as much business as statistical considerations.

Note: actually there are more possible types of target variables, but as this is a 101 article, let’s go with these most commonly used two.

Step 2 – Get your historical data set!

If you did the data collection right from the very beginning of your business, then this should not be an issue. Remember the “collect-everything-you-can” principle. When it comes to predictions, it’s extremely handy, that you logged everything and now you can try and use lot of predictors/features in your analysis. Most of them won’t play a significant role in your model. But some of them will – and you won’t know which one, until you don’t test it out.

It’s also worth to mention, that 99.9% of the cases your data won’t be in the right format. So at this step you also need to spend time with cleaning it and formatting your data. But this part is very case-specific, so I leave this task for you.

The concept of overfitting

The idea behind predictive analytics is to “train” your model on historical data and apply this model on future data.

As Istvan Nagy-Racz, Co-Founder of, Radoop and DMLab (all 3 are successful companies working on Big Data, Predictive Analytics and Machine Learning stuff) said:

“Predictive Analytics is nothing else, but assuming that the same thing will happen in the future, that happened in the past.”

Let’s take an example. You have dots on your screen, blues and reds. The screen has been generated by a ruleset, that you don’t know, but you are trying to find it out. You see some kind of correlation between their position on the screen and in their color.

overfitting predictive analytics 1

Overfitting example (source: Wikipedia with modification)

A new dot shows up on the screen. You don’t know the color, only the position of it. Try to guess the color!

overfitting predictive analytics 2

Overfitting example (source: Wikipedia with modification)

Of course if the dot is in the upper right corner, you will say it’s most probably blue. (dot B)
And if it’s the left bottom corner, you will say it’s most probably red. (dot A)

That’s what a computer would say, but it works with a mathematical model, not with gut feelings. The computer try to come up with a curve, that splits the screen. One side is blue, the other side is red. But how the exact curve looks like?

overfitting predictive analytics 3

Overfitting example (source: Wikipedia)

There are several solutions. On the above example, the black and green curves are two of those. Which model is the most accurate? You would say the green one, right? It has 0% error and 100% accuracy. Unfortunately there is a high chance, that you are wrong.

The green-line prediction model includes the noise as well and the accuracy is 100% in this case. However if you regenerate the whole screen, it’s very likely, that you will have a similar screen, but with different random errors. You will see that, the green line model’s accuracy will be much worse in this new case (let’s say 70%).

The black line model has only 90% accuracy, but it doesn’t take into consideration the noise. It’s more general, so it’s accuracy will be 90% again, if you regenerate the screen with different random errors.

That’s what we are calling in predictive analytics the overfitting issue.
Here’s another great example to overfitting:


Overfitting example (source: Wikipedia)

Right? The black-line looks like a better model for nice predictions in the future – the blue looks overfitting.

Both cases showed that the model is the more general the better. But that’s the theory. In real life you can never know. That’s why you need as a next step…

Step 3 – Split your data!

We usually split our historical data into 2 sets:

  1. Training set: the dataset, that we use to teach our model.
  2. Test set: the dataset, that we use to validate our model before using it on real life future data.

The split has to be done with random selection, so the sets will be homogeneous. It’s obvious, but worth to mention, that the bigger the historical data set is the better the randomization and the prediction will be.

But what’s the right split? 50%-50%? 80%-20%? 20%-80%? 70%-30%?
Well, that could be another whole blog article. There are so many methods and opinions.

Most of the people – who I know at least – are focusing more on the training part, so they assign the 70% of the data to the training set and 30% to the test set. This is called to the holdout method.

The Holdout Method

the holdout method 70%-30%

Some others make 3 sets: training, fine-tuning and test sets. So they train the model with the training set, they fine-tune it with the fine-tuning set and eventually validate it with the test set.

What I like the most is a method called Monte Carlo cross-validation – and I don’t like this one only because of the name. In this process you basically repeatedly select 20%s (or any X%s) of your data. You select 20%, use it for any of the training/validation/testing methods, then drop it. Then select an other random 20%. The selections are independent from each other in every round. This means you can use different data-points several times. The advantage of it, that you can run these rounds infinite times, so you can boost your accuracy round by round.

Monte Carlo cross-validation

Monte Carlo cross-validation (20%)

So all in all:
1. Train the model! (And I’ll dig into the details in Part 2 of Predictive Analytics 101.)
2. Validate it on the test set.
And if the training set and test set give back the same error % and the accuracy is high enough, you have every reason to be happy.

To be continued…

UPDATE! Here’s Part 2: LINK!
I will continue from here next week. The next steps will be:
Step 4 – Pick the right prediction model and the right features!
Step 5 – How to validate your model?
Step 6 – Implement!
+1 – when predictive analytics fails…

Plus I’ll add some personal thoughts about the relation between big data, predictive analytics and machine learning too.
If you don’t want to miss it, subscribe to my Newsletter! It’s free and I use it only to send notifications, when I release a new content (article, video or e-book)!

Continue with Part 2: here.

Tomi Mester