Skip to main content

Examples of using SQL-inf with DBT

There are many ways in which you can use SQL-inf within your dbt pipelines to power ML analytics in your data stack. To illustrate some of the use cases we have build a demo dbt project showcasing a few of the core SQL-inf commands in a complex, real-life setting, namely an e-commerce marketplace.

To learn about potential use cases have a look at our Use Cases page.

For more details about SQL-inf itself, checkout out the SQL-inf Syntax reference.

All of our examples are contained in a single project, which is available in our dbt-infer-examples repo on Github.

Structuring your SQL-inf models

We use the convention that if a model uses an SQL-inf command we prefix the model name with the command it uses. For example, we have a model named predict_customer_ltv, which uses the PREDICT SQL-inf comamnd to predict the customer ltv.

As usual when developing with DBT, you should only build models for the things you want to persist. When defining intermediary datasets it is often better to use CTEs that aren't materialised in the data layer. Often when building models using SQL-inf you find yourself defining datasets that are only used as inputs to SQL-inf commands and nowhere else. In these cases we define the inputs as a CTE define within the model and we like to use the convention that if the model using an SQL-inf command is called my_model we call the CTE for the input data my_model_input.

To illustrate what that might look like, here is a simple example of a predictive model, which we call predict_customer_churn.

models/predict_customer_churn.sql

WITH predict_customer_churn_input AS (
SELECT (CASE WHEN end_date is NULL THEN True ELSE False THEN) as churn,
age,
gender,
location,
balance
FROM customers
)

SELECT * FROM predict_customer_churn_input PREDICT(churn)

Here we engineer a feature called churn, based on whether an end_date has been set for a given customer, which we then use as the target to predict the probability of customer churn.

Example Data

Throughout these examples we use the same data set.

It is based on two Kaggle datasets of customer and seller data from Olist, a brazilian marketplace integrator - meaning they help smaller sellers connect and sell their products on larger marketplaces. The combined dataset contains 11 tables containing seller data, customer and purchase data as well as reviews.

You can find the datasets here:

Example SQL-inf Models

Predictions and Explainability

Analysing Seller Conversion: predict_seller_conversion.sql | explain_seller_conversion.sql

Example illustrates how to define a model for predicting a binary event, like conversion. For each seller a probability of churn is predicted and returned.

The explain model then calculates and saves the drives of the conversion predict. These values can be used to monitor the dynamics of conversion.

The same pattern can be used for predicting any other binary event - like churn, feature usage etc

Analysing Seller LTV: predict_seller_ltv.sql | explain_seller_ltv.sql

Similar to the previous example, this example illustrates how to define a model for predicting a continuous variable, like LTV. For each seller a predicted LTV is returned. In this example, we NULL the LTV value of any sellers that signed up in the last 180 days, since we do not believe their LTV should impact our prediction - given how early they are in their customer journeys.

For the inputs to the prediction we focus on "demographic" data, like location, business type, acquisition channel etc

Like before we pair the prediction with an explain model for understanding what drives the LTV prediction.

The same pattern can be used for predicting any other continuous variable - like sales, spend etc

Analysing Customer LTV: predict_customer_ltv.sql | explain_customer_ltv.sql

Another example illustrating how to define a model for predicting and explaining a continuous variable, like LTV.

In this case we focus on product behaviors, instead of demographic features, to predict the LTV - like number of items bought, number of reviews posted, average number of payment installments etc

Similarity and Segmentations

Finding Sellers most similar to highest LTV: find_most_similar_sellers.sql

Finds the most similar sellers to the seller with the highest LTV. This can easily be adapted to finding customers most similar to a particularly valuable or interesting customers. Similarly, with finding most similar leads, users, products and so on.

Segmenting Products and Customers: segment_customers_sports_leisure.sql | segment_products_sports_leisure.sql

Two examples of how to use clustering, CLUSTER, to cluster and segment customers and products by how similar they are. In these examples we focus on the "sport and leisure" category to analyse the different segments of products and customers within this category.

The same technique can be used to analyse segments of sellers, users, leads and so on.

Text Analysis

Analysing Reviews: translate_reviews_sports_leisure.sql | sentiment_reviews_sports_leisure.sql | topics_reviews_sports_leisure.sql

Three examples of text analysis using SQL-inf. In all three examples we analyse reviews of "sports and leisure" products.

The first example shows how to translate text. In this case we translate portuguese reviews into english to then enable the next two models to be applied.

The second example analyses the sentiment of the translated text. This can be used to understand the sentiment of reviews of particular products or to group products by the sentiment of their reviews.

Finally, the third example analyses the topics of the translated text. Each review is then assigned a topic based on the analysis.