When is a left join not a left join?
Introduction
One of the most common issues I see, even amongst experienced analysts and engineers, is the misuse of left joins. This post deals with the proper usage of join conditions vs. where constraints in left joins. I’ve seen numerous cases where getting this wrong has cost companies thousands of dollars.
Creating sample data
First, let’s create some sample data. We will create a fact table called f_transactions and a dimension table called d_customer. If you aren’t familiar with fact vs. dimension tables, I recommend reading more about it here.
The d_customer table uses a SCDT2 (slowly changing dimension type 2) which just means that the fields can change over time. When this happens, new records are added with the new values. The effective date structure indicates which version of each record is active at any given time. Read more about slowly changing dimensions here.
Now that we have our table structure created, we can go ahead and add some sample data
Querying the data
Great! We have our table structure and our data. Let’s start some analysis!
Let’s start by querying the transaction and customer data to see what they look like.
ID | PRODUCT_ID | CUSTOMER_ID | TRANSACTION_AMOUNT | TRANSACTION_TIMESTAMP |
---|---|---|---|---|
101 | 1 | 201 | 20.34 | 2024-04-09 01:38:51 |
102 | 2 | 201 | 12.34 | 2024-04-07 01:38:51 |
103 | 3 | 202 | 34.97 | 2024-04-09 01:28:51 |
104 | 6 | 203 | 99.65 | 2024-04-09 01:28:51 |
105 | 2 | 204 | 25.36 | 2024-04-09 00:46:51 |
ID | CUSTOMER_NAME | STATE_OF_RESIDENCE | ZIPCODE | EFFECTIVE_START_DATE | EFFECTIVE_END_DATE | CURR_IN |
---|---|---|---|---|---|---|
201 | Jacob Ferraiolo | TX | 75324 | 2019-01-15 | 2022-01-01 | 0 |
201 | Jacob Ferraiolo | VA | 23112 | 2022-01-02 | 9999-01-01 | 1 |
202 | Luke McDonald | CA | 85532 | 2020-01-01 | 2023-10-31 | 0 |
202 | Lucas McDonald | CA | 32654 | 2023-11-01 | 9999-01-01 | 1 |
203 | Melody Cherry | CA | 25483 | 2016-01-01 | 9999-01-01 | 1 |
204 | Richard Folkes | CA | 32654 | 2018-01-01 | 2020-01-01 | 0 |
We have 5 transactions for a total of $192.66. Let’s try and get aggregate transaction amount by state of residence of the customer. If we look at d_customer, we can see that there is no currently active customer_id = 204. This could be a data quality issue or some legitimate business reason. We don’t want to entirely exclude this analysis for our report, so let’s bucket it under ‘missing’ for now.
Results:
State_of_Residence | Total_Transaction_Amount |
---|---|
CA | 134.62 |
VA | 32.68 |
The California and Virginia numbers look correct, but where did the transaction with missing customer details go?
If we revisit our code, we can see that we are filtering on curr_in. So, it will only return join results where the curr_in field is true – this does not include results where curr_in is null. So, we have essentially converted our LEFT JOIN into an inner join excluding all transactions that don’t have a matching active customer record.
So, how can we include the transaction with missing customer details?
If we move the CURR_IN filter to the LEFT JOIN condition instead of the WHERE constraint, we will only be joining on d_customer records where curr_in is true and keeping all transactions.
State_of_Residence | Total_Transaction_Amount |
---|---|
CA | 134.62 |
Missing | 25.36 |
VA | 32.68 |
Conclusion
In this post we have demonstrated a common issue when writing LEFT JOIN queries. We have shown that, when you have a LEFT JOIN with a WHERE constraint on a right-hand table, you are effectively writing an INNER JOIN.