Skip to main content

Churn Analysis

The purpose of churn analysis is to analyse and measure the rate at which customers are leaving your product. Depending on your product, the behavior that you want to analyse can take many shapes.

For example churn can be defined as:

  • a customer terminates their contract with you
  • a user deletes their account with you
  • a user uninstall your app from their devices
  • a customer does not return to your website or location

To analyse churn you first need to define and capture the behavior that feels most appropriate for your product. You then need to decide on the features that you feel would be most appropriate to analyse to understand the churn.

Once you have collected the data there are a few ways that you can go about analysing it to better understand churn.

  • Predicting the probability of churn for customers that are still active
  • Understanding what drives churn by explaining the prediction
  • Predicting potential losses from churn in the short term or by future periods
  • Understanding what drives churn by different data cuts, like country or demographics
  • Predicting potential losses by different data cuts, like country or demographics

In the below example we will use a demo data set to explain and describe these methods in more detail.

Example Data

In the following examples we will use data from the Kaggle Dataset Bank Customer Churn Dataset , which describes the customers of a bank, where some have churned and others are still active users.

For each user we have data on demographics and some high level bank details.

You can download the raw dataset that fits with below queries directly here.

Predicting churn

In our first examples we will predict the probability of churn for all points in the data set. We will then sort the output by the likelihood of churn.

The query predicts the probability of churn by learning from already churned customers; if customers haven’t churned yet, we can predict how likely they are to churn.

We input all columns from the data set Bank_Customer_Churn_Prediction but filter out the customer_id column from the prediction, using the ignore flag, since we do not believe it will contribute to the prediction but only add noise.

SELECT Bank_Customer_Churn_Prediction.*, prediction 
FROM Bank_Customer_Churn_Prediction
PREDICT(churn, ignore=customer_id) ORDER BY prediction DESC

The output column prediction is an output of the PREDICT method, the value being the predicted outcome; 1 if churn is predicted and 0 otherwise. The column probability is an output of the PREDICT method, giving the probability of the predicted outcome being positive (i.e. churn = 1).

Understanding churn

The next step of the analysis is to understand what drives the prediction to better understand what can be done to prevent churn or what signs to look out for.

To do this we extend the previous analysis by wrapping the prediction with an EXPLAIN. For each input column this will output a value between 0 and 1.0 indicating how influential that column is to the prediction of churn.

It uses all columns from the input dataset *, except 'customer_id', to learn about 'churn' and returns how important each column is to the prediction of churn.

SELECT * FROM Bank_Customer_Churn_Prediction 
EXPLAIN(PREDICT(churn, ignore=customer_id))

In this example we find that age is the main driver of churn.

Most likely high value customers to churn

Now that we understand how to predict churn and what drives it, let's look at who is most likely to churn that would significantly impact our business.

To do this we filter our prediction, firstly by who has more than 85% likely to churn and secondly by who has more than 100k in the account and hasn't churned yet.

Note that the inner SELECT statement uses all the data points, churned or not, to learn what and predict the probability of churn. We then in the outer statement filter out the accounts that are yet to churn and has a high balance.

In total, the query applies 3 filters to the previous "Predict churn" query:

  • Filter 1 - High probability of churn (>75%);
  • Filter 2 - High value customer with 100k+ in their account;
  • Filter 3 - Customers that haven’t yet churned
SELECT customer_id, balance FROM 
(SELECT * FROM Bank_Customer_Churn_Prediction PREDICT(churn, ignore=customer_id)
WHERE probability > 0.85 and prediction=1)
WHERE balance > 100000 and churn=0

Expected loss of balance per customer

Now that we understand how to predict churn we can define the notion of "expected loss" for a customer as the probability of them churning times their balance.

This gives an indication of the balance at risk for all accounts, not just high value or high churn probability ones.

We compute this by an estimate of the likely loss of balance of customers who haven't yet churned but are predicted to churn. We use balance times probability to calculate expected loss, rename the output to match and order by size of the expected loss.

SELECT customer_id, balance*probability AS expected_loss FROM 
(SELECT * FROM Bank_Customer_Churn_Prediction PREDICT(churn, ignore=customer_id))
WHERE churn=0 ORDER BY expected_loss DESC

Expected average loss of balance per country

Once we have defined the notion of "expected loss" on an account basis, we can aggregate these across accounts and get different views on the total or average expected loss by subsets of the data. This gives us an insight into which countries are most exposed.

In this case we compute the average expected loss per account by country. By averaging the expected loss we can compare across countries more easily. We could also have summed and gotten a view of the total expected loss for each country.

SELECT country, AVG(balance*probability) AS expected_loss FROM 
(SELECT * FROM Bank_Customer_Churn_Prediction PREDICT(churn, ignore=customer_id))
WHERE churn=0 GROUP BY country