At Part 1 I introduced the main concept of Predictive Analytics and also wrote about how predictions are useful for all online businesses. We went through on the different target variables, on the overfitting issue and on the question of data splitting (training and test set).
At the end of this article (combined with the previous one) you will have a great overview about how Predictive Analytics work in real life! And don’t worry, this is still a 101 article, you will understand it without PhD in mathematics too.
Let’s continue with:
Step 4 – Pick the right prediction model and the right input values!
This is the heart of Predictive Analytics. Creating the right model with the right predictors will take most of your time and energy. It needs as much experience as creativity. And there is never one exact or best solution. It’s an iterative stuff and you need to optimize your prediction model over and over.
There are many-many methods. These are differing mostly in the math behind them, so I’m going to highlight here only two of those to explain how the prediction itself works. If you want to go deeper, you should invest a little bit more time in that by reading some books about it, taking an online course or sneak into a Machine Learning university class.
Anyway as you’ve read in Part 1 there are two common cases in Predictive Analytics.
- Answering the question “how much” with a continuous target variable (number)
- Answering the question “which one” (aka. discrete choice) with a categorical target variable
The answer for the first question can be given by “regression” and for the second one by “classification“.
(A small reminder: we are calling predictors the variables we are using as an input for our model. They are also known as features or input variables.)
If you’ve ever used the Trend line function in Excel for a Scatter Plot then congrats! You have already applied Simple Linear Regression! In this model there is only one predictor and only one target variable (that’s why it’s called “simple”). The trend line tries to describe the relationship between the two.
Let’s say we are running an online blogging service and we want to understand the relation between the bloggers’ time spent in the text-editor and the later-coming facebook shares on the article. (Because we suspect, that the more effort the writer put in the article, the better the reaction of the audience will be.) First we get the historical data (step 2) and then we split it into training and test sets (step 3). We continue with the training set. Print it on a scatter plot:
Then fit a trend line on it.
On the pictures above we see a strong correlation between the two variables! (But remember – correlation doesn’t necessarily imply causation!) We can even go further! If someone creates a new article, we can predict the expected number of Facebook shares on that article. Of course, there is a chance that we are wrong – though we can calculate the reliability of the model too and I’ll get back to that soon.
This method is called Simple Linear Regression because this is the most simple way to use regression. There are two ways to complicate this model:
A) Using not just a simple straight line as the fitting curve.
B) Using more than one input variable to predict the target variable.
Both cases we expect a more accurate prediction.
Imagine a situation where your scatter plot looks like this.
You can try to fit a straight line on it, but a curve (in this case an exponential curve) will describe the relationship between the predictor and the predicted value much better.
And in real life in most cases you won’t use only one input value. Eg. if you want to be very precise on the predicted number of Facebook shares in the above example, it’s not enough to look only on the text-editor time. You should consider bringing a few more parameters into the formula like:
- Number of the blogger’s followers
- The quality of the content (eg. number of filler words)
- The success of previous articles of the blogger
Let’s pick only two for now: time spent in editor and number of followers. In this case your scatter plot won’t be a 2D graph anymore, but a 3D one. And you don’t need to assign a Y value for only one X value, but for two.
time spent in editor
number of followers
number of Facebook shares
Based on your historical data you have to predict a Y value for every possible combination of X1 and X2. Comparing to the 2D one, it’s a little bit more difficult to imagine, but you can still do it with basically giving an extra dimension to everything. You scatter plot will be in a 3D space and your curve turns into a surface. Something like this. (But hopefully with more data points)
Note: when you have multiple input values, there is a high chance, that different predictors will have different weights in the formula. It’s also a big and difficult part of the job. But as this is a 101 article, let’s not go into those details.
Every time you add a new feature to the formula, you should think about that as an extra dimension. Don’t worry, you don’t need to imagine the 4D, 5D, 6D spaces… computers meant to imagine those instead of you.
What you need to take-away from here is, that when it comes to regression, you should try to fit some mathematical model to your data and use that model for your predictions.
Note: There are some painful tasks you can’t avoid during a predictive analytics project. One of these is the “feature reduction”, where you remove the highly correlated or the redundant variables. Another one is “feature standardization”, where you try to re-scale the different variables to make them more handy.
Linear Regression gives the answer for “how many”, classification gives the answer for “which one”. There are many classification models – an easy-to-understand one is the Decision Tree. I’m going to drive you through on the main concept of this method!
Let’s say, you are an SaaS company and you have 1.000.000 active users. You want to predict for each of them whether they will use your product tomorrow or not.
- Your target value is a categorical value. In this example a binary value (yes or no).
- You have input values. Some of them are categorical values (eg. paying user or not), some of them are numerical values (eg. days since registration).
- Based on your historical data, you can give every categorical feature a probability value. (eg. If the user is paying user, then 75% is the chance that he will use your product tomorrow and 25%, that not.)
- If you have a numerical predictor (eg. days since registration), you are trying to turn it into a categorical variable – and then give it a probability value. (eg. if days since registration is lower than 100 days, then the chance of return is 60%, if it’s higher than 100 days, then 40%.)
- The tricky part is, that if you combine these predictors with each other, the associated probability values will change. That’s why you need the tree format, where you build a system purely from if-then statements. It will look something like this:
- Usually every additional level of the tree improves accuracy. The goal is to build up the tree, that predicts the most accurate result in the most easiest way.
- If you are done and you want to apply your model predicting if a user comes back to you tomorrow, you simply “put” the user into this tree and send it true on the if-then statements. The result is a “yes” or “no” with X% (preferably with a high) probability.
Note: maybe my description is oversimplified, but here’s another one! The best interpretation of Decision Tree, what I’ve seen so far: the r2d3.us project. If you have extra 10 minutes, scroll through their beautiful infographic-like learning material!
Step 5 – How to validate your prediction model?
On step 4 we worked with the training set to train our model! Now it’s time to use the other slice of our data to validate our model. Let’s go with the test set.
The exact process differs by prediction methods, but the point is always the same. Measure the accuracy on the test set and compare it to the accuracy of the training set. (Remember the overfitting issue!) The closer these two are to each other and the lower the error % is, the better we are.
Sticking with the two predictive analytics methods I described:
- For linear regression the most well-known way to measure accuracy is the R squared value. It basically tells you how well your curve fits on your data and it’s calculated based on the distance between the dots and the curve. However many professionals claim that R squared value is not enough in itself. If you want to learn more about their arguments, google some of these: “analysis of residuals”, “confidence intervals”, “Akaike information criterion”, etc. (Comments are highly welcomed on this topic!)
- For decision trees the case is even simpler. What you need to do is check the reality against your prediction. If you predicted yes and it’s a yes in real life, then you are good. And if you predicted yes and it’s a no, you are not good. Obviously, you won’t have 100% accuracy here, either. But that’s okay. To get this more visual, we usually put these informations into a confusion matrix.
When the model predicts correctly a yes or a no, we call these values to true positives and true negatives. And when the model predicts wrong, we call them the false positive and the false negatives.
Accuracy = (True positive + True negative) / (True Positive + True Negative + False Positive + False Negative)
Note: If you want to learn more about how to measure the accuracy of a decision tree more precize, google “precision and recall” or “F1 score”.
Step 6 – Implement
Yay! You are done! It’s time to start using your model to predict the future with your real life data!
A side-note here: even if your prediction model brings pretty good results during the validation process, you should always follow-up and check back on that with the present and future data. There is a chance that you missed something during the setup, that hits back, when you implement your model for real.
It’s also worth to mention that even the best prediction models start to do more and more mistakes with time. Simply because life produces changes that you can never be prepared for.
All in all
Summarize the 2 articles, a predictive analytics project looks like this:
- Select the target variable!
- Get the historical data!
- Split the data into training and test sets!
- Experiment with prediction models and predictors! Pick the most accurate model!
- Prepare the data!
- Train the model on the training set!
- Validate it on the test set!
- Implement it! And don’t forget to follow-up on it from time to time!
When Predictive Analytics fails
Yes, Predictive Analytics can and will fail. There are two main concerns, you should think about.
The first one is the issue that Oedipus had as well. No, not the ‘marrying his own mother’, but the other one: self-fulfilling prophecy. I’m going to use the online blog engine example from above. Let’s say that you clearly see from your predictions that your article will be shared more on Facebook, if you spend more time in the editor. Knowing this, you can get some false conclusions. Eg. you will type one sentence only, then you leave the editor open on your computer to have a higher time-spent-in-editor value. You click publish and you get surprised, when you realize that nobody shared it, even if you “spent” 6 hours in the editor…
We all see, where the issue is. High time-spent-in-editor value is not the reason, but the symptom of a high-quality article. The real reason is the care of the author, that ends up in high time-spent-in-editor value as well as in a high number of FB shares. If you think this is a trivial issue, I can tell you, people have almost died by similar mistakes.
The other classic problem is obsolescence. Even the best models fail after a few years. There are things you simply just can’t predict. Like how would you have predicted in 2006, that in 2016 the 50% of your web-traffic will be Mobile traffic. Impossible, right? Predictive Models will fail by definition after a while, because you can’t prepare them to unexpected things. But that’s a known limitation. Predictive models are like cars. You should sometimes maintain or replace them.
Big Data vs. Predictive Analytics vs. Machine Learning
Just another note. I recently realized, that “Big Data”, “Predictive Analytics” and “Machine Learning” are used as synonyms of each other, though they are not. Big Data is a technology to process big amount of data (I detailed the big data topic it here: The Great Big Data Misunderstanding). It’s not necessarily used for Predictive Analytics nor for Machine Learning. It can be used for simple data analyses or text processing too.
I guess, if you read this far, you know, what Predictive Analytics is. So the only question remaining, what Machine Learning is. In our context you can think about Machine Learning as an “improved Predictive Analytics”, where the computer automatically fine-tunes the prediction formula by learning from the difference between the predicted values and the reality.
According to Istvan Nagy, you can also think about Predictive Analytics as a business goal. In this context Machine Learning is the approach to reach that goal. And the different models are the different tools.
Of course it’s much more than that, but then again, this is a 101 article and I don’t want to open Pandora’s box here. 🙂
Is Predictive Analytics easy? To understand: yes. To apply it in real life: well, I agree, it can be difficult.
My aim with these 2 articles (Part 1 here, if you didn’t read) was however to highlight, that it’s not rocket science. Everybody can understand the concept of it. And if you do so, it’s only the question of your motivation (and time, of course) when you will learn the specifics and apply them in real life! Good luck with it and thank you for reading!
And if you want to be notified first about new content on Data36 (like articles, videos, handbooks, etc.), sign up for my Newsletter!
Inspirations and sources: