Prediction
In this tutorial we will learn how to use the PREDICT
function to predict the quality of red wine using Infer's special flavour of SQL, SQL-inf! 🍷
What is Prediction?
Let's learn by example! Pretend you own a vineyard, and have done so for many years. As a diligent winemaker, you take a number of scientific measurements of the wine you produce:
- the alcohol % 🍸
- the pH level 🧫
- the amount of sulphates 🌋
- and many other things...
You also taste the wine, and give it a quality score - a score from 1 to 10.
You suspect that some of these measurements
(also referred to as features
in the machine-learning world, or independent variables
in statistics)
influence the quality of your wine (also known as your dependent variable
, outcome
, or target
).
You want to use all this data you've painstakingly measured and turn it into the best actionable insight of all: making good wine better. 😋 You might even have some intuition that some of these features are more important than others, but you don't know for certain. So how can we get to these precious insights? Prediction!
Machine learning techniques can automatically discover the complex relationships in your data, doing the hard work, so you don't have to. Learning these relationships allows you to take your measurements and predict the quality of the wine without even tasting it. Woah, the power of prediction! 🪄
Using SQL-inf for Predictions
SQL-inf takes these complicated machine-learning techniques and simplifies them into one line of SQL. Let's see it in action.
First we upload our dataset to the Infer platform, which we've called vino_veritas
. The CSV file can be found here.
We can preview the dataset using Infer's dataset viewer, as shown in the video below.
Next, we run our one-liner:
SELECT * FROM vino_veritas PREDICT(quality, model='reg')
and we get a result - a table and a graph! Cool. It's that easy to do make your first machine-learning-based predictive model!
Breaking Down the Query 🪓
Ok, a lot was just packed into that one-liner. Let's break down the query we just ran into steps:
SELECT * FROM vino_veritas
- This part of the command selects all of the columns in the table
vino_veritas
. If we ran this command by itself, we would just return the input dataset.
PREDICT(quality, model='reg')
- At the end of the query, we have
PREDICT(quality)
. A few things are happening here:- We are deciding which kind of SQL-inf function to use, i.e.
PREDICT
. This means we would like to build a machine-learning model for prediction. - Our target is the
quality
of the wine, so we put that in ourPREDICT
function as the thing we'd like to predict. - We are using a
regression
(reg
for short) model. Read the section onClassification and Regression
to understand this a bit better! - We have selected all columns (using the SQL wildcard
*
) as inputs into the machine-learning model. This is often a good starting point.
- We are deciding which kind of SQL-inf function to use, i.e.
Sometimes not all columns are relevant, so we can be more selective with our inputs to the model:
SELECT pH, alcohol, sulphates, inf.* FROM vino_veritas PREDICT(quality, model='reg')
In this example, we use only pH
, alcohol
, and sulphates
as our inputs to the model.
We also use a special SQL-inf wildcard, inf.*
.
This keyword will return all special columns that are computed with SQL-inf functions.
Under-the-hood, Infer writes any newly generated information, like model outputs and metrics, into a table called inf
.
By doing this, we allow the user to easily access as much or as little new information as they'd like.
Adding Finesse with Optional Arguments
SQL-inf functions accept optional arguments which change how the function operates. Let's take a look at another example.
SELECT pH, alcohol, sulphates, inf.prediction FROM vino_veritas PREDICT(quality, model='clf')
Here we have added an optional argument model
and set it to clf
, short for classification
, which will change the type of model we use, and hence the kinds of outputs we generate.
Specifying the model
isn't always necessary. By default, PREDICT
will attempt to find the correct model type depending on the selected column.
So, when we write PREDICT(quality)
, this is equivalent to writing PREDICT(quality, model='clf')
for categorical data
or PREDICT(quality, model='reg')
for continuous data.
Classification and Regression
What does 'clf'
and 'reg'
refer to?
Classification ('clf'
) is prediction of a category or class.
Examples could include gender, categories of wine quality (good quality, bad quality), animal species, eye color, yes/no, etc.
The output of classification models is a probability for each class, e.g. probability_Cat
, probability_Dog
. The class probabilities always sum to 1, i.e. probability_Dog + probability_Cat = 1.0
.
For common binary problems (0/1, True/False, Yes/No), only probability
is returned, representing the probability of the positive case (1, True, Yes). For multiple classes, probability
represents the probability of the predicted class.
Regression ('reg'
) is prediction of a continuous quantity. Examples could include height,
the value of a stock one month from now, or a quality score out of 10. The output of regression models is the same as the quantity you are trying to estimate, i.e. if you were predicting height (cm), the output would be the predicted height (cm).
By choosing model='reg'
for predicting the quality of the wine, we treat the quality score as a continuous quantity. This means the output will not be an integer number even if it is labelled that way (5, 6, 7...). Instead, predictions will be values like 5.6, 7.2, etc.
Want to know more about the predictive model?
The XGB algorithm is the default algorithm used in PREDICT
, as it is a notoriously strong baseline model.
In the future, we plan to expand to other machine-learning algorithms.
If you'd like to learn more about the XGBoost algorithm, we highly recommend checking out our How Infer Works article:
https://docs.getinfer.io/docs/reference/how%20Infer%20works/predict_explain
Getting Insights
We have a predictive model, and predictions from that model. How can we get insights using these new tools?
Auto-visualisation with Infer
The Infer platform will automatically begin to visualise the most important relationships in your machine-learning model. The visualisation shown will depend on the specific command you run, your input data, and your predictive model. Basically, we do the hard work of figuring out the best way to represent your data and key insights, so you don't have to.
For PREDICT
, we return scatter plots for continuous quantities and box plots for categorical quantities.
The first plot we show is always the most important input to the model on the x-axis, and the predictions themselves on the y-axis.
In our wine example, alcohol % is the most important feature to predicting the quality of the wine, so we show that first:
With these kinds of visualisations, it's possible to understand better the relationship between the most important features (e.g. alcohol) and your target prediction (e.g. quality of wine). It's quite clear from above that the machine learned that higher alcohol percentage is associated with higher scores. Maybe if we want better quality wine, we could just need more alcohol! 🍷
You can check out how other variables affect the prediction of wine quality by clicking on the column names. Have a play around and see what patterns you can see!
Explainability
Our auto-visualisation tool is powered by something called "Explainable AI" (XAI, for short). It's possible to access our Explainable AI algorithms directly to find even more insights about your data.
Please continue to the next chapter to learn all about explainability within the Infer platform.