# Prediction

In this tutorial we will learn how to use the `PREDICT`

function to predict the quality of red wine using Infer's special flavour of SQL, SQL-inf! 🍷

# What is Prediction?

Let's learn by example! Pretend you own a vineyard, and have done so for many years. As a diligent winemaker, you take a number of scientific measurements of the wine you produce:

- the alcohol % 🍸
- the pH level 🧫
- the amount of sulphates 🌋
- and many other things...

You also taste the wine, and give it a *quality score* - a score from 1 to 10.
You suspect that some of these measurements
(also referred to as `features`

in the machine-learning world, or `independent variables`

in statistics)
influence the quality of your wine (also known as your `dependent variable`

, `outcome`

, or `target`

).

You want to use all this data you've painstakingly measured and turn it into the best actionable insight of all: making good wine better. 😋 You might even have some intuition that some of these features are more important than others, but you don't know for certain. So how can we get to these precious insights? Prediction!

Machine learning techniques can automatically discover the complex relationships in your data, doing the hard work, so you don't have to.
Learning these relationships allows you to take your measurements and **predict** the quality of the wine *without even tasting it*. Woah, the power of prediction! 🪄

# Using SQL-inf for Predictions

SQL-inf takes these complicated machine-learning techniques and simplifies them into one line of SQL. Let's see it in action.

First we upload our dataset to the Infer platform, which we've called `vino_veritas`

. The CSV file can be found here.
We can preview the dataset using Infer's dataset viewer, as shown in the video below.

Next, we run our one-liner:

`SELECT * FROM vino_veritas PREDICT(quality, model='reg')`

and we get a result - a table and a graph! Cool. It's that easy to do make your first machine-learning-based predictive model!

# Breaking Down the Query 🪓

Ok, a lot was just packed into that one-liner. Let's break down the query we just ran into steps:

`SELECT * FROM vino_veritas`

- This part of the command selects all of the columns in the table
`vino_veritas`

. If we ran this command by itself, we would just return the input dataset.

`PREDICT(quality, model='reg')`

- At the end of the query, we have
`PREDICT(quality)`

. A few things are happening here:- We are deciding which kind of SQL-inf function to use, i.e.
`PREDICT`

. This means we would like to build a machine-learning model for prediction. - Our target is the
`quality`

of the wine, so we put that in our`PREDICT`

function as the thing we'd like to predict. - We are using a
`regression`

(`reg`

for short) model. Read the section on`Classification and Regression`

to understand this a bit better! - We have selected all columns (using the SQL wildcard
`*`

) as inputs into the machine-learning model. This is often a good starting point.

- We are deciding which kind of SQL-inf function to use, i.e.

Sometimes not all columns are relevant, so we can be more selective with our inputs to the model:

`SELECT pH, alcohol, sulphates, inf.* FROM vino_veritas PREDICT(quality, model='reg')`

In this example, we use only `pH`

, `alcohol`

, and `sulphates`

as our inputs to the model.
We also use a special SQL-inf wildcard, `inf.*`

.
This keyword will return all special columns that are computed with SQL-inf functions.

Under-the-hood, Infer writes any newly generated information, like model outputs and metrics, into a table called `inf`

.
By doing this, we allow the user to easily access as much or as little new information as they'd like.

# Adding Finesse with Optional Arguments

SQL-inf functions accept optional arguments which change how the function operates. Let's take a look at another example.

`SELECT pH, alcohol, sulphates, inf.prediction FROM vino_veritas PREDICT(quality, model='clf')`

Here we have added an optional argument `model`

and set it to `clf`

, short for `classification`

, which will change the type of model we use, and hence the kinds of outputs we generate.

Specifying the `model`

isn't always necessary. By default, `PREDICT`

will attempt to find the correct model type depending on the selected column.
So, when we write `PREDICT(quality)`

, this is equivalent to writing `PREDICT(quality, model='clf')`

for categorical data
or `PREDICT(quality, model='reg')`

for continuous data.

# Classification and Regression

What does `'clf'`

and `'reg'`

refer to?

**Classification** (`'clf'`

) is prediction of a *category* or *class*.
Examples could include gender, categories of wine quality (good quality, bad quality), animal species, eye color, yes/no, etc.
The output of classification models is a probability for each class, e.g. `probability_Cat`

, `probability_Dog`

. The class probabilities always sum to 1, i.e. `probability_Dog + probability_Cat = 1.0`

.
For common binary problems (0/1, True/False, Yes/No), only `probability`

is returned, representing the probability of the positive case (1, True, Yes). For multiple classes, `probability`

represents the probability of the predicted class.

**Regression** (`'reg'`

) is prediction of a *continuous quantity*. Examples could include height,
the value of a stock one month from now, or a quality score out of 10. The output of regression models is the same as the quantity you are trying to estimate, i.e. if you were predicting height (cm), the output would be the predicted height (cm).

By choosing `model='reg'`

for predicting the quality of the wine, we treat the quality score as a continuous quantity. This means the output will not be an integer number even if it is labelled that way (5, 6, 7...). Instead, predictions will be values like 5.6, 7.2, etc.

## Want to know more about the predictive model?

The XGB algorithm is the default algorithm used in `PREDICT`

, as it is a notoriously strong baseline model.
In the future, we plan to expand to other machine-learning algorithms.

If you'd like to learn more about the XGBoost algorithm, we highly recommend checking out our How Infer Works article:

https://docs.getinfer.io/docs/reference/how%20Infer%20works/predict_explain

# Getting Insights

We have a predictive model, and predictions from that model. How can we get insights using these new tools?

# Auto-visualisation with Infer

The Infer platform will automatically begin to visualise the most important relationships in your machine-learning model. The visualisation shown will depend on the specific command you run, your input data, and your predictive model. Basically, we do the hard work of figuring out the best way to represent your data and key insights, so you don't have to.

For `PREDICT`

, we return scatter plots for continuous quantities and box plots for categorical quantities.
The first plot we show is always the most important input to the model on the x-axis, and the predictions themselves on the y-axis.

In our wine example, alcohol % is the most important feature to predicting the quality of the wine, so we show that first:

With these kinds of visualisations, it's possible to understand better the relationship between the most important features (e.g. alcohol) and your target prediction (e.g. quality of wine). It's quite clear from above that the machine learned that higher alcohol percentage is associated with higher scores. Maybe if we want better quality wine, we could just need more alcohol! 🍷

You can check out how other variables affect the prediction of wine quality by clicking on the column names. Have a play around and see what patterns you can see!

# Explainability

Our auto-visualisation tool is powered by something called "Explainable AI" (XAI, for short). It's possible to access our Explainable AI algorithms directly to find even more insights about your data.

Please continue to the next chapter to learn all about explainability within the Infer platform.