Skip to main content

Similarities

In this tutorial we will learn how to use the SIMILAR_TO function to find customers that are the most similar to each other by analysing usage behaviour of credit card holders. 💳

Similarity

What is Similarity?

This is a surprisingly difficult question!

Broadly, two things are similar when they share the same features, characteristics, or behaviours, and are dissimilar when they do not.

Similarity is a spectrum. All green apples are similar to each other, but may differ in size, shade of green, sweetness, etc. 🍏 Red apples may be a different colour, but share much of the same properties as green apples. 🍎 What about blue oranges, how similar are they to green apples? Comparing apples to oranges is... difficult. 🤔

To answer this question, we need to define a quantitative view of similarity: let's say a score of 1.0 means two objects are identical, and 0.0 means they are complete opposites. Green and red apples get a high similarity, say 0.9. Green apples and blue oranges get a score of 0.3.

Similarity_function

A similarity function (sim) takes in a pair of objects and determines how similar they are. Similar objects get a high score (green apples vs red apples) while dissimilar objects get a low score (green apples vs blue oranges).

This is a good example for some initial intuition. But there are some more things to think about.

If you want to understand similarity on a deeper level, see the next section. If not you can skip it. Infer will do the difficult maths so you don't have to.

[Optional] Understanding Similarity

There are two elements to calculating the similarity score:

  1. The objects you are comparing, and their corresponding features (colour, size, sweetness)
  2. The similarity function (sim) itself.

Input Features

When we are comparing objects, we have to ask - what do we really care about comparing? Let's return to comparing our red apples, green apples, and blue oranges.

Let's say we only really care about finding fruits that are the same colour. In this case, green apples, red apples, and blue oranges are all different! We might expect our similarity score to be 0.0. This is quite different to our previous suggestion that green apples and red apples are similar!

Which characteristics you choose can therefore vastly change your similarity scores.

Suppose we think that our colours as categorically different to each other. We can represent them mathematically using a technique known as 'one-hot encoding'. One-hot encoding would work something like this:

  • We have 3 colours, so let's make a three-dimensional vector (aka a list of numbers).
  • We say our first column represents green, second is red, and third is blue.
  • In each column, we say that a 1 indicates that colour is present and 0 if it is not present.
  • We can only put a 1 in one of these columns, because it isn't possible to be both red and green.

Our one-hot-encoded colours are shown below.

one_hot

Similarity Function

Now that we have represented our colours mathematically, we can apply a similarity function to them and determine how similar they are!

A commonly used similarity function is cosine similarity or cosine distance. If you can remember your trigonometry (I won't be mad if you don't!), the cosine of 90 degrees is equal to 0.0, and a cosine of 0 degrees is equal to 1.0.

So, if two colours (e.g. comparing red to green) are completely dissimilar (similarity score = 0.0), that's kind of like saying that the colours are at 90 degrees to each other using cosine similarity (cos(90) = 0.0). In machine-learning, we say these vectors are orthogonal (at right angles to each other).

Likewise, the same colours (e.g. comparing red to red) should have a similarity of 1.0. This is the same as saying that the colours are at 0 degrees relative to each other (cos(0) = 1.0)

In fact, if we apply the cosine similarity function to our three one-hot-encoded vectors representing colours, these are exactly the results we get:

  • red-red, green-green, blue-blue -> similarity score of 1.0.
  • red-green, red-blue, and green-blue -> similarity score of 0.0.

Encodings

In the real world, we have shades of red, green, and blue. We even have multiple ways of encoding these colours: RGB, HSL, HSV, and many many more.

Instead of one-hot-encoding our colours, where each colour is a different category, we can look at the colours as existing on a spectrum like RGB:

spectrum

Now, when we apply our cosine similarity function to these vectors, what will happen?

  • red-red, green-green, blue-blue -> similarity score of 1.0.
  • red-green, red-blue, and green-blue -> similarity score of 0.16, 0.13, and 0.04 respectively.

Our similarity measure has become a bit more nuanced, with values in between 0 and 1. We can see now that our shade of red is closest to green, and most dissimilar to blue.

We can go one step further with encodings, and learn them directly using machine-learning techniques. In particular, Deep Learning allows us to learn very accurate encodings of many kinds of data, including images, text, and even video. This allows us to apply our similarity functions to find similar images or sentences with high accuracy.

This illustration demonstrates how different kinds of encodings can impact your similarity measure... we said it was complicated!

Thankfully at Infer, we take care to perform the best encodings for you automatically, so you can get on with finding business insights, not hardcore math.

Using SQL-inf for Similarity

SQL-inf takes these complicated similarity functions and simplifies them into one line of SQL. Let's see it in action.

First we upload our dataset ("Credit Card Dataset for Clustering") to the Infer platform, which we've called credit_card. The CSV file can be found here. We can preview the dataset using Infer's dataset viewer, as shown in the video below.


Next, we run our one-liner:

SELECT * FROM credit_card SIMILAR_TO(CUST_ID=C10001) ORDER BY similarity DESC

and we get a result! The SIMILAR_TO function takes in a column name and a value. It checks to make sure that column and value is a unique identifier then, bam! Similarity score.

sim_score

We now see there is one new column, similarity, that we have ordered to show the most similar rows to the selected ID. Of course, the top result is the input ID because it is identical to itself. Then the second result is the most similar that is not identical. By eye, we can see that these two rows are indeed quite similar! Success.

You can now use SIMILAR_TO to find the most similar products, customers, users... anything at all!

Clustering

Similarity is great when you want to find the most or least similar products/customers/users. But similarity is also the basis for finding patterns in user behaviours or customer segmentations, which can be super useful for understanding and summarising your market audience.

In order to group similar users / customers / products together, we need to use Clustering algorithms. What are those? Continue to the next chapter to learn all about clustering within the Infer platform!