Understanding dbt fundamentals
I finally have a reason to learn dbt properly. Here is my summary of the dbt Fundamentals course.
In this post, I summarise what I've learned from the dbt Fundamentals course created and shared for free by dbt Labs.
But first, perhaps some context is in order.
I am seeking a move upwards in my career. I tell this to my manager, who is incredibly supportive (thank you Natan!). I didn't know it then, but this was perfect timing, because a role was opening up for a new engineering manager.
But, you know, startups. One moment they need a new engineering manager, the next moment they need someone to lead a special project for ensuring we get our act together with our data warehouse so we can understand our business well.
So I found myself in that position: here's a big project, go figure it out, and you'll probably get a move upwards in your career.
This big project was all about data pipelines and SQL and dashboards, but I'd never led a data-related project of any scale before...
I knew, though, that I had a few colleagues – software engineers, not data engineers, surprisingly – who have a lot of experience with data. I could always ask them for help.
Except... that's not my style. I think it's irresponsible to be a dilettante. You have to first learn the fundamentals and try and piece things together for yourself before you ask fundamental questions. Even Albert Einstein knew nothing about physics at one point.
And so, with a big project to deliver, I asked myself: what do I not understand? Here's the list:
- Financial domain concepts
- Risk domain concepts
- Operational workflow
- Our data pipeline, and where best to implement data transformations for the whole business – sales, marketing, finance, ops, risk, tech – to use
That last point about the data pipeline led me to dbt, which stands for data build tool.
I knew by this point in my career that dbt has become the de facto standard in the tech industry for codifying and building data models that are exposed to the rest of the business. I just didn't really know how to use it, other than from intuiting using my software engineering knowledge. I could only rely on that for so long until I realised I was asking the same old questions again and again (a reliable signal for a learning intervention).
So, here are my main learnings from completing the dbt Fundamentals course. I combine them with existing knowledge of data pipelines that I've gained elsewhere to make this post a richer reference.
Types of tables
I thought there were a lot of types of tables, but the course helped me realise that there are only two types that matter:
- source tables
- tables exposed in the data mart
Source tables contain raw data. You refer to them with the source
macro, like so:
SELECT *
FROM {{ source('app_db', 'orders') }};
Source tables are almost always normalised. That means data isn't repeated, and associations between tables are made using IDs. This also means that these tables are usually quite thin (not many columns) and long (loads of entries).
Tables exposed in the data mart contain transformed data, modelling business concepts like customers, orders, and so on. These are usually de-normalised, because we need them to have a lot of additional joined data for business users and data analysts to make use of immediately. This means that these tables are usually fat (loads of columns).
Typically, mart tables are prefixed with dim_
or fct_
or agg_
.
- dim = dimension, referring to users, organisations, shopping carts, and generally any entity that is usually described as "who" or "what"
- fct = fact, referring to events, actions, or results from a process, like payments, emails sent, or user accounts created
- agg = ... frankly, unsure, as this wasn't covered. But I've seen this before in production at a few companies. Likely some aggregation on top of
dim_
orfct_
would be my guess
That's it. That's the whole thing with dimensional modelling.
The other kinds of prefixes that I'd come across are for modularisation and composability. Tables with prefixes like stg_
and int_
are just conventions used by the dbt community for incrementally grouping, filtering, and transforming source tables into exposed mart tables.
What's the data mart? Where does it live?
Speaking of mart... what is it?
It's a conceptual thing. A data mart is what dbt folks use to refer to the place you go to to grab ready-to-use data about the business (my definition). It's any number of dim_
and fct_
and potentially agg_
tables, ready to use.
The folks at dbt give a great analogy with flour (source) and cupcakes (mart):
Just as eating raw flour isn’t that appetizing, neither is deriving insights from raw data since it rarely has a nice structure that makes it poised for analytics. There’s some considerable work that’s needed to organize data and make it usable for business users.
This is where dimensional modeling comes into play; it’s a method that can help data folks create meaningful entities (cupcakes and cookies) to live inside their data mart (your glass display) and eventually use for business intelligence purposes (eating said cookies).
And where does the data mart live? Likely as part of your data warehouse.
So the data warehouse has a mix of source tables and mart tables. Why? Because dbt operates on the ELT paradigm, not ETL.
ELT? Not ETL?
This is a helpful distinction to know, so let's talk about it.
The traditional paradigm for processing data until it becomes fit for consumption by business users and data analysts is ETL:
- Extract = take data from sources (e.g. your application database tables, Salesforce, etc.)
- Transform = changing the data format, cleaning it up, ignoring, etc.
- Load = putting transformed data into the warehouse for querying
There is one big problem with this paradigm that became more obvious when cloud warehouses with flexibly scalable computational power and cheap storage came into existence: the process results in rigidity.
With ETL, to have something exposed, you're going to need to get a data engineer in a meeting, explain your requirements, and they go make changes in code for the Transform step. Cumbersome, slow, painful, costly.
But since products like Google's BigQuery came about, which is effectively data warehouse software as a service, a new paradigm became possible. That's the ELT paradigm, built on cheap storage and pay-as-you-use and flexibly scalable computing power.
In ELT, the Transform step is the last. E and L are done first, which is effectively saying "we take stuff from 10 different sources and load them into the warehouse, raw." Products like Fivetran automate this E and L step, freeing up time from data engineers (maybe you don't even need to hire them anymore... many startups don't).
What's left is the T, and dbt provides the toolset for making transformations pleasant and maintainable.
So, dbt – the "data build tool" – effectively covers the whole T step in ELT paradigm.
(See THIS useful video of Fivetran founder explaining ETL vs ELT. I went back a few times to watch this across a few years to really understand it.)
Most useful commands
There are two:
dbt build
dbt run
+dbt test
First one runs both dbt run
and dbt test
(i.e. option 2) but with one crucial difference: it runs it in DAG order.
DAG = directed acyclic graph. It's a technically-accurate but convoluted way of referring to the flow of data from source to exposed tables.
dbt run
will generate the models and fill them in with data from your sources.
dbt test
will run all tests against the generated models.
If you run dbt run
first, then run dbt test
, what's effectively happening is this, in order:
- Generate models and fill in data
- Test against generated models and data
But because models are built upon previous tables in a lineage graph, this can be problematic. If a previous table generated had some issues, then downstream tables would, of course, have issues too.
What we usually want is to be build confidence in our upstream tables and move to downstream tables. That's why dbt build
is better. Instead of running in this order:
- generate 1st model
- generate 2nd model that depends on 1st model
- test 1st model
- test 2nd model
It runs in this order:
- generate 1st model
- test 1st model
- generate 2nd model that depends on 1st model
- test 2nd model
The benefit to us as developers is that we know where exactly the problem lies, as upstream as possible. It's the difference between an obscure error message and a clear one with a well defined stacktrace.
Tests: Generic vs Singular
Generic tests are super simple to add against models in dbt. You just specify it in the model's YML file in the tests
property:
version: 2
models:
- name: stg_sheets_gmv_forecast__goals
columns:
- name: sales_channel
description: Sales channel for the forecast
tests:
- not_null
- accepted_values:
values:
- offline
- ecommerce
With that, whenever you run dbt test
, dbt will run a generic query against the table to ensure that the test assertion passes. It's applying software engineering to data.
Using the snippet above as an example, we're asserting that the sales_channel
column in the stg_sheets_gmv_forecast__goals
table should always be "not_null".
What dbt will do under the hood is to write a simple – what some may call "generic" – SQL select statement to check for entries where the sales_channel
value is null.
If any row returns, the test fails. If no row is returned, the test passes. I love how simple this is.
There are 4 generic tests provided by dbt:
- not_null = should never be null
- unique = should be unique across whole table
- accepted_values = should only have value in predetermined list
- relationships = should have corresponding associated entry in another table
That's it with generic tests. Powerful and simple. Beautiful too, because they live in the model's YML, describing its schema exactly where you'd want to read about it.
Then there's the second type of test – "singular" tests.
From what I understand, the main difference between singular and generic tests in dbt is that you write your own SQL instead of let dbt write it for you under the hood.
Say you want to ensure that all orders have positive or 0 total payment amount on the stg_payments
table. You write this SQL as a test:
SELECT
SUM(amount) AS total_amount
FROM {{ ref('stg_payments') }}
WHERE total_amount < 0;
Then dbt will execute this as part of dbt test
and check that it doesn't return any rows. (There may be a one-liner missing to ensure dbt picks it up as a test, but I don't remember it.)
Documentating data with dbt
One of the biggest problems with data in any organisation is the lack of shared meaning in column names and cell values.
What does "return_pending" mean in an dim_orders table?
What does the column customer_id
mean in the same table? Why is there another similarly named column end_customer_id
in the same table?
To solve part of this problem, you can codify the meaning behind each column and its possible values with a description
property in the model's YML file.
Here is an example:
version: 2
models:
- name: dim_customers
description: One record per customer
columns:
- name: customer_id
description: Primary key
You can do this for every single column you have in every single table that dbt will generate.
The cherry on top is that dbt has a built-in feature for generating a relatively non-technical-user-friendly documentation site. In that site, users of your data mart can see the description of each column clearly.
For situations where you want to, say, explain what each status
value means, or where you need to be verbose, "doc blocks" allows you to describe a field with a markdown file. The markdown file looks like this:
And to refer to it, you use curly braces and the doc(...)
macro in the model's YML file:
version: 2
models:
- name: fct_orders
description: This table has basic information about orders, as well as some derived facts based on payments
columns:
- name: status
description: '{{ doc("orders_status") }}'
Hopefully you get the point that dbt is very useful and that their own dbt Fundamentals course is well worth your time.
I feel much more prepared now to deliver the big data project!