Skip to main content


The CLUSTER command clusters the rows in the input data table into groups of rows that are similar.

The CLUSTER command takes no inputs, since it uses all input data. This can be adjusted by using the ignore option to ignore certain columns when clustering.

The output is a new field cluster_id, which outputs the label for the cluster of each row in the data. We also return the columns for the probabilities (probabilities_) of each data point belonging to a specific cluster.

You can read more about how CLUSTER works and how to get the best out of it in the tutorial on Clusterings.


CLUSTER([, min_cluster_size=<min_cluster_size>, ignore=<column_names>])


  • ignore can be used to specify columns (as a comma separated list) returned by the SELECT statement but which you want the CLUSTER to ignore.
  • min_cluster_size can be used to specify columns the minimum size of a cluster.


Appends a new column to the input dataset named cluster_id which has an integer value and describes for each row what cluster, or grouping, that row belongs to.

Some points maybe considered outside a grouping, sometimes called noise. These are given the cluster_id -1.

A column is appended to the input dataset with a column for each class prefixed with probability_.


Clusters all the rows in the customer where churn is TRUE.