dbt

Generate API key

The first thing to do is to sign in to Infer in order to generate your API key, needed to setup the dbt-infer adapter.

Go to the "API & Integrations" section of the "Settings" page Account
In the "API token" generate the token and copy it to the clipboard.

Install the dbt-infer adapter

The dbt-infer adapter is maintained via PyPi and installed with pip.

To install the latest dbt-infer package simply run the following within the same shell as you run dbt.

pip install dbt-infer

Versioning of dbt-infer follows the standard dbt versioning scheme - meaning if you are using dbt 1.2 the corresponding dbt-infer will be named 1.2.x where is the latest minor version number.

If you haven't already, you need to have installed the adapter package for your underlying data warehouse. For example, if your data warehouse is BigQuery then you need to also have installed the appropriate dbt-bigquery package, e.g.

pip install dbt-bigquery

Configure dbt-infer

In order to set up dbt-infer, we need to add a target to your profiles.yml file. A dbt target contains information about your connection to the warehouse, see here for more info.

The dbt-infer adapter requires five fields to be filled out in a new target in your profiles.yml file:

url: the URL of Infer API
username: your Infer API username
apikey your Infer API key
data_config: the config for your underlying data warehouse, e.g. BigQuery, Snowflake
type: name of the adapter. Must be infer.

The schema to define a target using dbt-infer looks like this:

<profile-name>:
  target: <target-name>
  outputs:
    <target_name>:
      url: <infer-api-endpoint>
      username: <infer-api-username>
      apikey: <infer-apikey>
      type: infer
      data_config:
        <here goes your normal data warehouse config>

For example, if you are using BigQuery as your underlying data warehouse the target might look like this:

my-bigquery-profile:
  target: infer-bigquery
  infer-bigquery:
    apikey: 1234567890abcdef
    username: [email protected]
    url: https://app.getinfer.io
    type: infer
    data_config:
      dataset: my_dataset
      job_execution_timeout_seconds: 300
      job_retries: 1
      keyfile: bq-user-creds.json
      location: EU
      method: service-account
      priority: interactive
      project: my-big-query-project
      threads: 1
      type: bigquery

All Done! Time for ML Analytics!

That's it! You can now use SQL-inf in your dbt models.

Using SQL-inf

You do not need to change anything in your existing DBT models when switching to use SQL-inf – they will all work the same as before – but you now have the ability to use SQL-inf commands as native SQL functions.

To get you started we will give a brief example of what such a model might look like. You can find other more other examples on our GitHub repo with all of the examples in one place. Feel free to clone this repo to get started from scratch! Otherwise follow our examples below.

In our simple example, we will show how to use a previous user_features model to predict churn by predicting the column has_churned in our new model predict_user_churn.

predict_user_churn.sql
{{
  config(
    materialized = "table"
  )
}}

with predict_user_churn_input as (
    select * from {{ ref('user_features') }}
)

SELECT * FROM predict_user_churn_input PREDICT(has_churned, ignore=user_id)

As usual, the result of this is written into a new table, called predict_user_churn.

Not that we ignore user_id from the prediction. This is because we think that the user_id might, and should, not influence our prediction of churn, so we remove it. We also use the convention of pulling together the inputs for our prediction in a CTE, named predict_user_churn_input. This is not needed, you could do this within the final SELECT-statement, but using CTEs is good practice as it makes the model easier to understand and edit.

How it works

The dbt-infer adapter splits the execution of your models into two steps:

Any models that do not use SQL-inf commands will be executed as normal within your data warehouse.
Models that take advantage of SQL-inf will be executed in 5 steps:
- The model query will be parsed by Infer. This will tell dbt-infer what data to fetch from your data warehouse.
- The necessary data will be fetched from your data warehouse.
- The SQL-inf commands will be executed within Infer.
- The results will be saved to a temporary table in your data warehouse.
- The remaining part of the dbt model will be executed as usual within your data warehouse.

The 5 steps of the SQL-inf execution can be visualised in the following way:

Security

The execution model of SQL-inf means that:

Infer does not have access to your data warehouse, since it uses dbt and your existing data warehouse connection for that
Infer does not store any of your data beyond the lifetime of an SQL-inf transactions

Furthermore, all data handled by Infer is encrypted and secured to the highest standards.

You can read more about the security of Infer here.

Limitations

dbt-infer only supports SQL-inf commands with models that are table materialized, ie you cannot use SQL-inf commands in a model that is materialized as a view.
dbt-infer currently does not work with DBT Cloud.
SQL-inf cannot be used within CTEs - instead materialise them into tables as the output of a model and then use that in downstream models

Examples of using SQL-inf with DBT

There are many ways in which you can use SQL-inf within your dbt pipelines to power ML analytics in your data stack. To illustrate some of the use cases we have build a demo dbt project showcasing a few of the core SQL-inf commands in a complex, real-life setting, namely an e-commerce marketplace.

To learn about potential use cases have a look at our Use Cases page.

For more details about SQL-inf itself, checkout out the SQL-inf Syntax reference.

All of our examples are contained in a single project, which is available in our dbt-infer-examples repo on Github.

Structuring your SQL-inf models

We use the convention that if a model uses an SQL-inf command we prefix the model name with the command it uses. For example, we have a model named predict_customer_ltv, which uses the PREDICT SQL-inf comamnd to predict the customer ltv.

As usual when developing with DBT, you should only build models for the things you want to persist. When defining intermediary datasets it is often better to use CTEs that aren't materialised in the data layer. Often when building models using SQL-inf you find yourself defining datasets that are only used as inputs to SQL-inf commands and nowhere else. In these cases we define the inputs as a CTE define within the model and we like to use the convention that if the model using an SQL-inf command is called my_model we call the CTE for the input data my_model_input.

To illustrate what that might look like, here is a simple example of a predictive model, which we call predict_customer_churn.

models/predict_customer_churn.sql

WITH predict_customer_churn_input AS (
    SELECT (CASE WHEN end_date is NULL THEN True ELSE False THEN) as churn, 
           age, 
           gender, 
           location, 
           balance 
    FROM customers
)

SELECT * FROM predict_customer_churn_input PREDICT(churn)

Here we engineer a feature called churn, based on whether an end_date has been set for a given customer, which we then use as the target to predict the probability of customer churn.

Example Data

Throughout these examples we use the same data set.

It is based on two Kaggle datasets of customer and seller data from Olist, a brazilian marketplace integrator - meaning they help smaller sellers connect and sell their products on larger marketplaces. The combined dataset contains 11 tables containing seller data, customer and purchase data as well as reviews.

You can find the datasets here:

Example SQL-inf Models

Predictions and Explainability

Analysing Seller Conversion: predict_seller_conversion.sql | explain_seller_conversion.sql

Example illustrates how to define a model for predicting a binary event, like conversion. For each seller a probability of churn is predicted and returned.

The explain model then calculates and saves the drives of the conversion predict. These values can be used to monitor the dynamics of conversion.

The same pattern can be used for predicting any other binary event - like churn, feature usage etc

Analysing Seller LTV: predict_seller_ltv.sql | explain_seller_ltv.sql

Similar to the previous example, this example illustrates how to define a model for predicting a continuous variable, like LTV. For each seller a predicted LTV is returned. In this example, we NULL the LTV value of any sellers that signed up in the last 180 days, since we do not believe their LTV should impact our prediction - given how early they are in their customer journeys.

For the inputs to the prediction we focus on "demographic" data, like location, business type, acquisition channel etc

Like before we pair the prediction with an explain model for understanding what drives the LTV prediction.

The same pattern can be used for predicting any other continuous variable - like sales, spend etc

Analysing Customer LTV: predict_customer_ltv.sql | explain_customer_ltv.sql

Another example illustrating how to define a model for predicting and explaining a continuous variable, like LTV.

In this case we focus on product behaviors, instead of demographic features, to predict the LTV - like number of items bought, number of reviews posted, average number of payment installments etc

Similarity and Segmentations

Finding Sellers most similar to highest LTV: find_most_similar_sellers.sql

Finds the most similar sellers to the seller with the highest LTV. This can easily be adapted to finding customers most similar to a particularly valuable or interesting customers. Similarly, with finding most similar leads, users, products and so on.

Segmenting Products and Customers: segment_customers_sports_leisure.sql | segment_products_sports_leisure.sql

Two examples of how to use clustering, CLUSTER, to cluster and segment customers and products by how similar they are. In these examples we focus on the "sport and leisure" category to analyse the different segments of products and customers within this category.

The same technique can be used to analyse segments of sellers, users, leads and so on.

Text Analysis

Analysing Reviews: translate_reviews_sports_leisure.sql | sentiment_reviews_sports_leisure.sql | topics_reviews_sports_leisure.sql

Three examples of text analysis using SQL-inf. In all three examples we analyse reviews of "sports and leisure" products.

The first example shows how to translate text. In this case we translate portuguese reviews into english to then enable the next two models to be applied.

The second example analyses the sentiment of the translated text. This can be used to understand the sentiment of reviews of particular products or to group products by the sentiment of their reviews.

Finally, the third example analyses the topics of the translated text. Each review is then assigned a topic based on the analysis.