CLUSTER
The CLUSTER
command clusters the rows in the input data table into groups of rows that are similar.
The CLUSTER
command has no required inputs, since it can use all input data.
This can be adjusted by using the ignore
option to ignore certain columns when clustering, or target
to favour clusters to form around a certain behaviour.
The output is a new field cluster_id
, which outputs the label for the cluster of each row in the data.
Syntax
CLUSTER([, min_cluster_size=<min_cluster_size>, ignore=<column_names>, target=<column_name>])
Options
ignore
can be used to specify columns (as a comma separated list) returned by theSELECT
statement but which you want theCLUSTER
to ignore.min_cluster_size
can be used to specify the minimum size of a cluster.target
can be used to specify a column name around which clusters will be encouraged to form. This can be useful for scenarios where a specific attribute is of greater importance and should serve as a reference point for clustering.
Returns
Appends a new column to the input dataset named cluster_id
which has an integer value and describes for each row
what cluster, or grouping, that row belongs to.
Some points may be considered outside a grouping, sometimes called noise. These are given the cluster_id
-1.
Examples
Basic Example
SELECT * FROM customer CLUSTER()
In this basic example, all columns in the customer
table will be used for clustering without any specific conditions.
Using the target
Option
SELECT * FROM customer CLUSTER(target='age')
In this example, the target='age'
option encourages the clusters to form around the age
column, making it the central feature for similarity assessment.
Using the ignore
Option
SELECT * FROM customer CLUSTER(ignore='age,gender')
Here, the ignore='age,gender'
option means that the age
and gender
columns will be ignored during the clustering process, even though they are part of the original SELECT * FROM customer
statement.
Using the min_cluster_size
Option
SELECT * FROM customer CLUSTER(min_cluster_size=5)
In this example, the min_cluster_size=5
option ensures that any cluster formed must have at least 5 data points. Clusters with fewer than 5 data points would be considered as noise and their cluster_id
will be set to -1.
Combining Multiple Options
SELECT * FROM customer CLUSTER(target='income', ignore='gender', min_cluster_size=10)
Here, the target='income'
, ignore='gender'
, and min_cluster_size=10
options are combined to form clusters around the income
column, ignore the gender
column, and ensure that each cluster has at least 10 data points.