dbt
Generate API key
The first thing to do is to sign in to Infer in order to generate your API key, needed to setup the dbt-infer
adapter.
- Go to the "API & Integrations" section of the "Settings" page Account
- In the "API token" generate the token and copy it to the clipboard.
Install the dbt-infer adapter
The dbt-infer
adapter is maintained via PyPi and installed with pip
.
To install the latest dbt-infer
package simply run the following within the same shell as you run dbt.
pip install dbt-infer
Versioning of dbt-infer
follows the standard dbt versioning scheme - meaning if you are using dbt 1.2 the
corresponding dbt-infer
will be named 1.2.x where is the latest minor version number.
If you haven't already, you need to have installed the adapter package for your underlying data warehouse.
For example, if your data warehouse is BigQuery then you need to also have installed the appropriate dbt-bigquery
package, e.g.
pip install dbt-bigquery
Configure dbt-infer
In order to set up dbt-infer
, we need to add a target to your profiles.yml
file. A dbt target contains information about your connection to the warehouse, see here for more info.
The dbt-infer
adapter requires five fields to be filled out in a new target in your profiles.yml
file:
url
: the URL of Infer APIusername
: your Infer API usernameapikey
your Infer API keydata_config
: the config for your underlying data warehouse, e.g. BigQuery, Snowflaketype
: name of the adapter. Must beinfer
.
The schema to define a target using dbt-infer
looks like this:
<profile-name>:
target: <target-name>
outputs:
<target_name>:
url: <infer-api-endpoint>
username: <infer-api-username>
apikey: <infer-apikey>
type: infer
data_config:
<here goes your normal data warehouse config>
For example, if you are using BigQuery as your underlying data warehouse the target might look like this:
my-bigquery-profile:
target: infer-bigquery
infer-bigquery:
apikey: 1234567890abcdef
username: [email protected]
url: https://app.getinfer.io
type: infer
data_config:
dataset: my_dataset
job_execution_timeout_seconds: 300
job_retries: 1
keyfile: bq-user-creds.json
location: EU
method: service-account
priority: interactive
project: my-big-query-project
threads: 1
type: bigquery
All Done! Time for ML Analytics!
That's it! You can now use SQL-inf in your dbt models.
Using SQL-inf
You do not need to change anything in your existing DBT models when switching to use SQL-inf – they will all work the same as before – but you now have the ability to use SQL-inf commands as native SQL functions.
To get you started we will give a brief example of what such a model might look like. You can find other more other examples on our GitHub repo with all of the examples in one place. Feel free to clone this repo to get started from scratch! Otherwise follow our examples below.
In our simple example, we will show how to use a previous user_features
model to predict churn
by predicting the column has_churned
in our new model predict_user_churn
.
{{
config(
materialized = "table"
)
}}
with predict_user_churn_input as (
select * from {{ ref('user_features') }}
)
SELECT * FROM predict_user_churn_input PREDICT(has_churned, ignore=user_id)
As usual, the result of this is written into a new table, called predict_user_churn
.
Not that we ignore user_id
from the prediction.
This is because we think that the user_id
might, and should, not influence our prediction of churn, so we remove it.
We also use the convention of pulling together the inputs for our prediction in a CTE, named predict_user_churn_input
.
This is not needed, you could do this within the final SELECT
-statement, but using CTEs is good practice
as it makes the model easier to understand and edit.
How it works
The dbt-infer
adapter splits the execution of your models into two steps:
Any models that do not use SQL-inf commands will be executed as normal within your data warehouse.
Models that take advantage of SQL-inf will be executed in 5 steps:
- The model query will be parsed by Infer. This will tell
dbt-infer
what data to fetch from your data warehouse. - The necessary data will be fetched from your data warehouse.
- The SQL-inf commands will be executed within Infer.
- The results will be saved to a temporary table in your data warehouse.
- The remaining part of the dbt model will be executed as usual within your data warehouse.
- The model query will be parsed by Infer. This will tell
The 5 steps of the SQL-inf execution can be visualised in the following way:
Security
The execution model of SQL-inf means that:
- Infer does not have access to your data warehouse, since it uses dbt and your existing data warehouse connection for that
- Infer does not store any of your data beyond the lifetime of an SQL-inf transactions
Furthermore, all data handled by Infer is encrypted and secured to the highest standards.
You can read more about the security of Infer here.
Limitations
dbt-infer
only supports SQL-inf commands with models that are table materialized, ie you cannot use SQL-inf commands in a model that is materialized as a view.dbt-infer
currently does not work with DBT Cloud.- SQL-inf cannot be used within CTEs - instead materialise them into tables as the output of a model and then use that in downstream models
Examples of using SQL-inf with DBT
There are many ways in which you can use SQL-inf within your dbt pipelines to power ML analytics in your data stack. To illustrate some of the use cases we have build a demo dbt project showcasing a few of the core SQL-inf commands in a complex, real-life setting, namely an e-commerce marketplace.
To learn about potential use cases have a look at our Use Cases page.
For more details about SQL-inf itself, checkout out the SQL-inf Syntax reference.
All of our examples are contained in a single project, which is available in our dbt-infer-examples repo on Github.
Structuring your SQL-inf models
We use the convention that if a model uses an SQL-inf command we prefix the model name with the command it uses.
For example, we have a model named predict_customer_ltv
, which uses the PREDICT
SQL-inf comamnd to predict the customer ltv.
As usual when developing with DBT, you should only build models for the things you want to persist.
When defining intermediary datasets it is often better to use CTEs that aren't materialised in the data layer.
Often when building models using SQL-inf you find yourself defining datasets that are only used as inputs to SQL-inf commands and nowhere else.
In these cases we define the inputs as a CTE define within the model and we like to use the convention that if the
model using an SQL-inf command is called my_model
we call the CTE for the input data my_model_input
.
To illustrate what that might look like, here is a simple example of a predictive model,
which we call predict_customer_churn
.
WITH predict_customer_churn_input AS (
SELECT (CASE WHEN end_date is NULL THEN True ELSE False THEN) as churn,
age,
gender,
location,
balance
FROM customers
)
SELECT * FROM predict_customer_churn_input PREDICT(churn)
Here we engineer a feature called churn
, based on whether an end_date
has been set for a given customer,
which we then use as the target to predict the probability of customer churn.
Example Data
Throughout these examples we use the same data set.
It is based on two Kaggle datasets of customer and seller data from Olist, a brazilian marketplace integrator - meaning they help smaller sellers connect and sell their products on larger marketplaces. The combined dataset contains 11 tables containing seller data, customer and purchase data as well as reviews.
You can find the datasets here:
Example SQL-inf Models
Predictions and Explainability
Analysing Seller Conversion: predict_seller_conversion.sql | explain_seller_conversion.sql
Example illustrates how to define a model for predicting a binary event, like conversion. For each seller a probability of churn is predicted and returned.
The explain
model then calculates and saves the drives of the conversion predict.
These values can be used to monitor the dynamics of conversion.
The same pattern can be used for predicting any other binary event - like churn, feature usage etc
Analysing Seller LTV: predict_seller_ltv.sql | explain_seller_ltv.sql
Similar to the previous example, this example illustrates how to define a model for predicting a continuous variable, like LTV.
For each seller a predicted LTV is returned.
In this example, we NULL
the LTV value of any sellers that signed up in the last 180 days, since we do not believe
their LTV should impact our prediction - given how early they are in their customer journeys.
For the inputs to the prediction we focus on "demographic" data, like location, business type, acquisition channel etc
Like before we pair the prediction with an explain
model for understanding what drives the LTV prediction.
The same pattern can be used for predicting any other continuous variable - like sales, spend etc
Analysing Customer LTV: predict_customer_ltv.sql | explain_customer_ltv.sql
Another example illustrating how to define a model for predicting and explaining a continuous variable, like LTV.
In this case we focus on product behaviors, instead of demographic features, to predict the LTV - like number of items bought, number of reviews posted, average number of payment installments etc
Similarity and Segmentations
Finding Sellers most similar to highest LTV: find_most_similar_sellers.sql
Finds the most similar sellers to the seller with the highest LTV. This can easily be adapted to finding customers most similar to a particularly valuable or interesting customers. Similarly, with finding most similar leads, users, products and so on.
Segmenting Products and Customers: segment_customers_sports_leisure.sql | segment_products_sports_leisure.sql
Two examples of how to use clustering, CLUSTER
, to cluster and segment customers and products by how similar they are.
In these examples we focus on the "sport and leisure" category to analyse the different segments of products and customers within this category.
The same technique can be used to analyse segments of sellers, users, leads and so on.
Text Analysis
Analysing Reviews: translate_reviews_sports_leisure.sql | sentiment_reviews_sports_leisure.sql | topics_reviews_sports_leisure.sql
Three examples of text analysis using SQL-inf. In all three examples we analyse reviews of "sports and leisure" products.
The first example shows how to translate text. In this case we translate portuguese reviews into english to then enable the next two models to be applied.
The second example analyses the sentiment of the translated text. This can be used to understand the sentiment of reviews of particular products or to group products by the sentiment of their reviews.
Finally, the third example analyses the topics of the translated text. Each review is then assigned a topic based on the analysis.