Skip to main content

CLUSTER

The CLUSTER command clusters the rows in the input data table into groups of rows that are similar.

The CLUSTER command takes no inputs, since it uses all input data. This can be adjusted by using the ignore option to ignore certain columns when clustering.

The output is a new field cluster_id, which outputs the label for the cluster of each row in the data. We also return the columns for the probabilities (probabilities_) of each data point belonging to a specific cluster.

You can read more about how CLUSTER works and how to get the best out of it in the tutorial on Clusterings.

Syntax

CLUSTER([, min_cluster_size=<min_cluster_size>, ignore=<column_names>])

Options

  • ignore can be used to specify columns (as a comma separated list) returned by the SELECT statement but which you want the CLUSTER to ignore.
  • min_cluster_size can be used to specify columns the minimum size of a cluster.

Returns

Appends a new column to the input dataset named cluster_id which has an integer value and describes for each row what cluster, or grouping, that row belongs to.

Some points maybe considered outside a grouping, sometimes called noise. These are given the cluster_id -1.

A column is appended to the input dataset with a column for each class prefixed with probability_.

Examples

Clusters all the rows in the customer where churn is TRUE.

SELECT * FROM customer WHERE churn=TRUE CLUSTER()