CLUSTER command clusters the rows in the input data table into groups of rows that are similar.
CLUSTER command has no required inputs, since it can use all input data.
This can be adjusted by using the
ignore option to ignore certain columns when clustering, or
target to favour clusters to form around a certain behaviour.
The output is a new field
cluster_id, which outputs the label for the cluster of each row in the data.
You can read more about how
CLUSTER works and how to get the best out of it in the tutorial
CLUSTER([, min_cluster_size=<min_cluster_size>, ignore=<column_names>, target=<column_name>])
ignorecan be used to specify columns (as a comma separated list) returned by the
SELECTstatement but which you want the
min_cluster_sizecan be used to specify the minimum size of a cluster.
targetcan be used to specify a column name around which clusters will be encouraged to form. This can be useful for scenarios where a specific attribute is of greater importance and should serve as a reference point for clustering.
Appends a new column to the input dataset named
cluster_id which has an integer value and describes for each row
what cluster, or grouping, that row belongs to.
Some points may be considered outside a grouping, sometimes called noise. These are given the
SELECT * FROM customer CLUSTER()
In this basic example, all columns in the
customer table will be used for clustering without any specific conditions.
SELECT * FROM customer CLUSTER(target='age')
In this example, the
target='age' option encourages the clusters to form around the
age column, making it the central feature for similarity assessment.
SELECT * FROM customer CLUSTER(ignore='age,gender')
ignore='age,gender' option means that the
gender columns will be ignored during the clustering process, even though they are part of the original
SELECT * FROM customer statement.
SELECT * FROM customer CLUSTER(min_cluster_size=5)
In this example, the
min_cluster_size=5 option ensures that any cluster formed must have at least 5 data points. Clusters with fewer than 5 data points would be considered as noise and their
cluster_id will be set to -1.
Combining Multiple Options
SELECT * FROM customer CLUSTER(target='income', ignore='gender', min_cluster_size=10)
min_cluster_size=10 options are combined to form clusters around the
income column, ignore the
gender column, and ensure that each cluster has at least 10 data points.