CLUSTER

The CLUSTER command clusters the rows in the input data table into groups of rows that are similar.

The CLUSTER command has no required inputs, since it can use all input data. This can be adjusted by using the ignore option to ignore certain columns when clustering, or target to favour clusters to form around a certain behaviour.

The output is a new field cluster_id, which outputs the label for the cluster of each row in the data.

Syntax

CLUSTER([, min_cluster_size=<min_cluster_size>, ignore=<column_names>, target=<column_name>])

Options

ignore can be used to specify columns (as a comma separated list) returned by the SELECT statement but which you want the CLUSTER to ignore.
min_cluster_size can be used to specify the minimum size of a cluster.
target can be used to specify a column name around which clusters will be encouraged to form. This can be useful for scenarios where a specific attribute is of greater importance and should serve as a reference point for clustering.

Returns

Appends a new column to the input dataset named cluster_id which has an integer value and describes for each row what cluster, or grouping, that row belongs to.

Some points may be considered outside a grouping, sometimes called noise. These are given the cluster_id -1.

Examples

Basic Example

SELECT * FROM customer CLUSTER()

In this basic example, all columns in the customer table will be used for clustering without any specific conditions.

Using the `target` Option

SELECT * FROM customer CLUSTER(target='age')

In this example, the target='age' option encourages the clusters to form around the age column, making it the central feature for similarity assessment.

Using the `ignore` Option

SELECT * FROM customer CLUSTER(ignore='age,gender')

Here, the ignore='age,gender' option means that the age and gender columns will be ignored during the clustering process, even though they are part of the original SELECT * FROM customer statement.

Using the `min_cluster_size` Option

SELECT * FROM customer CLUSTER(min_cluster_size=5)

In this example, the min_cluster_size=5 option ensures that any cluster formed must have at least 5 data points. Clusters with fewer than 5 data points would be considered as noise and their cluster_id will be set to -1.

Combining Multiple Options

SELECT * FROM customer CLUSTER(target='income', ignore='gender', min_cluster_size=10)

Here, the target='income', ignore='gender', and min_cluster_size=10 options are combined to form clusters around the income column, ignore the gender column, and ensure that each cluster has at least 10 data points.