CLUSTER
The CLUSTER
command clusters the rows in the input data table into groups of rows that are similar.
The CLUSTER
command takes no inputs, since it uses all input data.
This can be adjusted by using the ignore
option to ignore certain columns when clustering.
The output is a new field cluster_id
, which outputs the label for the cluster of each row in the data.
We also return the columns for the probabilities (probabilities_
) of each data point belonging to a specific cluster.
You can read more about how CLUSTER
works and how to get the best out of it in the tutorial
on Clusterings.
Syntax
CLUSTER([, min_cluster_size=<min_cluster_size>, ignore=<column_names>])
Options
ignore
can be used to specify columns (as a comma separated list) returned by theSELECT
statement but which you want theCLUSTER
to ignore.min_cluster_size
can be used to specify columns the minimum size of a cluster.
Returns
Appends a new column to the input dataset named cluster_id
which has an integer value and describes for each row
what cluster, or grouping, that row belongs to.
Some points maybe considered outside a grouping, sometimes called noise. These are given the cluster_id
-1.
A column is appended to the input dataset with a column for each class prefixed
with probability_
.
Examples
Clusters all the rows in the customer
where churn
is TRUE
.
SELECT * FROM customer WHERE churn=TRUE CLUSTER()