Clustering
Hello again! This is the second part of a tutorial on analysing usage behaviour of credit card holders. 💳 If you haven't completed the first part yet, we highly recommend you do that first. If you have, great! Let's continue.
In this tutorial we will learn how to use the CLUSTER
function to create customer segmentation - groups of users with similar behaviours.
What is Clustering?​
Clustering is a type of analysis that allows us to place similar data objects (users, companies, etc) into groups or clusters. This might be similar behaviours (e.g. power users vs casual users), properties (e.g. demographics), or any other kind of data (e.g. reviews of products, preferred content).
As we discussed in the similarity tutorial, the idea of similarity is fuzzy. We won't cover this discussion again, so please feel free to re-read before moving on if you need a refresher!
After we have defined what 'similar' means (which features we care about, e.g. behaviours, properties, or a mixture), we can begin to form our clusters.
There are many algorithms to perform clustering analysis. At Infer, we use a two-step process combining a dimensionality reduction technique (UMAP) and a 'hierarchical' clustering method (HDBSCAN). This method provides increasingly more detailed sets of clusters by splitting bigger clusters into smaller ones. This allows us to tune how detailed we want the clustering to be... more on this later!
A nice real-life example of hierarchical clustering can be found in evolutionary biology. All animals share common ancestors, some closer in time to others. Orangutans and gorillas were one and the same ~15-20 million years ago (the hominids), before diverging into different species. If you keep moving up the chain of evolutionary biology of gorillas, you'll find common ancestors with monkeys, lemurs, squirrels, rats, rabbits, and eventually dinosaurs! 🦖
In much the same way, we can define user behaviour in the same way - maybe we have a 'Daily Active User' cluster. This cluster can then split into 2 more clusters 'Night Time Users' and 'Day Time Users', and so on.
Using SQL-inf for Clustering​
Note that as we are currently in a beta launch, our Infer commands have been limited to using 10,000 randomly sampled rows instead of the entire dataset. This limitation will be lifted post-beta. This limitation does not apply to the DBT integration.
SQL-inf takes these state-of-the-art clustering techniques and turns them into a simple one-liner.
We use the same dataset as in the Similarity tutorial ("Credit Card Dataset for Clustering"), with table name credit_card
.
Next, we run our one-liner. Note that the CLUSTER
function does not require any input params, as it has no 'target', unlike in PREDICT
.
SELECT * FROM credit_card CLUSTER
and we get a clustering result!
We now see a new column, cluster_id
.
cluster_id
is just an identifier for the topic, with no special meaning to the order.
A cluster ID of -1
means that data point belongs to no clusters.
Auto-visualisation with Infer​
The Infer platform will automatically begin to visualise the Clustering Analysis for you.
This is currently a bit limited for beta and we're working on better ways to interact with clusters.
Deeper Analysis​
Filtering​
We can find deeper insights by probing the data via SQL-inf directly.
For example, if we want to look at all data points in a specific cluster we can append a
WHERE
statement to filter by e.g. cluster_id=2
.
SELECT * FROM credit_card CLUSTER where cluster_id=2
It's a little difficult to understand what's going on from just looking at the data. Let's see if we can do better...
Combining with EXPLAIN
​
We can explain all of our clusters by using our EXPLAIN
modifier command EXPLAIN(PREDICT(cluster_id))
:
SELECT * FROM credit_card EXPLAIN(CLUSTER)
From this we can see that our 5 clusters are explained to different degrees with different features. This is good! We want our clusters to have different behaviours.
We can also see that all of the clusters are predicted with high accuracy, so we can feel comfortable using the insights derived by these clusters.
From a glance, it seems the clusters are mostly defined by:
- Cluster 0: Balance, Balance frequency
- Cluster 1: Cash advances
- Cluster 2: One-off purchases, one-off purchase frequencies
- Cluster 3: Even mixture of many variables
- Cluster 4: Tenure
Let's zoom in on what's going on using PREDICT
:
SELECT * FROM (SELECT credit_card.*, inf.cluster_id FROM credit_card CLUSTER) PREDICT(cluster_id)
and we can see how one-off purchases are related to cluster 2, by creating a plot of probability_2
vs one_off_purchases
:
It seems like the more one-off purchases a user has, the more likely they are to belong to this cluster! We can probably dub this the 'one-off customer' cluster or segment. Neat!
Feel free to explore the clustering to see what is explaining the other segments!
Clustering Level​
If you find you are getting too many clusters, or not enough, you can change the minimum acceptable size of a cluster
with min_cluster_size
. By default, this is set to be 2% of the size of your dataset.
SELECT * FROM credit_card CLUSTER(min_cluster_size=5)
Wrapping Up Clustering​
By using our Clustering Analysis method we found that:
- There are 5 distinct high level clusters or segments that describe different behaviours in credit card usage.
- These 5 clusters could be well explained by the data.
- We explored cluster 2, which we found relates to users that do a lot of one-off purchasing.
We could use these insights to help build out a customer segmentation program for marketing, or for generally discovering usage patterns we didn't know about!
That's it for our clustering analysis tutorial. Now, go out and analyse your own data using these new tools and skills!