When is a left join not a left join?
Introduction
One of the most common issues I see, even amongst experienced analysts and engineers, is the misuse of left joins. This post deals with the proper usage of join conditions vs. where constraints in left joins. I’ve seen numerous cases where getting this wrong has cost companies thousands of dollars.
Creating sample data
First, let’s create some sample data. We will create a fact table called f_transactions and a dimension table called d_customer. If you aren’t familiar with fact vs. dimension tables, I recommend reading more about it here.
The d_customer table uses a SCDT2 (slowly changing dimension type 2) which just means that the fields can change over time. When this happens, new records are added with the new values. The effective date structure indicates which version of each record is active at any given time. Read more about slowly changing dimensions here.
-- Table to store transaction data
CREATE TABLE F_TRANSACTION (
ID INTEGER, -- Unique identifier for the transaction
PRODUCT_ID INTEGER, -- Identifier for the product involved in the transaction
CUSTOMER_ID INTEGER, -- Identifier for the customer involved in the transaction
TRANSACTION_AMOUNT NUMERIC, -- The amount of the transaction
TRANSACTION_TIMESTAMP DATETIME -- The timestamp of the transaction
);
-- Table to store customer details
CREATE TABLE D_CUSTOMER (
ID INTEGER, -- Unique identifier for the customer
CUSTOMER_NAME VARCHAR(80), -- Name of the customer
STATE_OF_RESIDENCE CHAR(2), -- State code where the customer resides
ZIPCODE INTEGER, -- ZIP code of the customer's residence
EFFECTIVE_START_DATE DATE, -- Start date for the customer record's validity
EFFECTIVE_END_DATE DATE, -- End date for the customer record's validity
CURR_IN BOOLEAN -- Flag indicating if the customer is currently active
);
Now that we have our table structure created, we can go ahead and add some sample data
INSERT INTO F_TRANSACTION (ID, PRODUCT_ID, CUSTOMER_ID, TRANSACTION_AMOUNT, TRANSACTION_TIMESTAMP ) VALUES
(101, 1, 201, 20.34, CURRENT_TIMESTAMP),
(102, 2, 201, 12.34, datetime(CURRENT_TIMESTAMP, '-2 Day')),
(103, 3, 202, 34.97, datetime(CURRENT_TIMESTAMP, '-10 minutes')),
(104, 6, 203, 99.65, datetime(CURRENT_TIMESTAMP, '-10 minutes')),
(105, 2, 204, 25.36, datetime(CURRENT_TIMESTAMP, '-52 minutes')
);
INSERT INTO D_CUSTOMER (ID, CUSTOMER_NAME, STATE_OF_RESIDENCE, ZIPCODE, EFFECTIVE_START_DATE, EFFECTIVE_END_DATE, CURR_IN) VALUES
(201, 'Jacob Ferraiolo', 'TX', 75324, '2019-01-15', '2022-01-01', FALSE),
(201, 'Jacob Ferraiolo', 'VA', 23112, '2022-01-02', '9999-01-01', TRUE),
(202, 'Luke McDonald', 'CA', 85532, '2020-01-01', '2023-10-31', FALSE),
(202, 'Lucas McDonald', 'CA', 32654, '2023-11-01', '9999-01-01', TRUE),
(203, 'Melody Cherry', 'CA', 25483, '2016-01-01', '9999-01-01', TRUE),
(204, 'Richard Folkes', 'CA', 32654, '2018-01-01', '2020-01-01', FALSE)
;
Querying the data
Great! We have our table structure and our data. Let’s start some analysis!
Let’s start by querying the transaction and customer data to see what they look like.
ID | PRODUCT_ID | CUSTOMER_ID | TRANSACTION_AMOUNT | TRANSACTION_TIMESTAMP |
---|---|---|---|---|
101 | 1 | 201 | 20.34 | 2024-04-09 01:38:51 |
102 | 2 | 201 | 12.34 | 2024-04-07 01:38:51 |
103 | 3 | 202 | 34.97 | 2024-04-09 01:28:51 |
104 | 6 | 203 | 99.65 | 2024-04-09 01:28:51 |
105 | 2 | 204 | 25.36 | 2024-04-09 00:46:51 |
ID | CUSTOMER_NAME | STATE_OF_RESIDENCE | ZIPCODE | EFFECTIVE_START_DATE | EFFECTIVE_END_DATE | CURR_IN |
---|---|---|---|---|---|---|
201 | Jacob Ferraiolo | TX | 75324 | 2019-01-15 | 2022-01-01 | 0 |
201 | Jacob Ferraiolo | VA | 23112 | 2022-01-02 | 9999-01-01 | 1 |
202 | Luke McDonald | CA | 85532 | 2020-01-01 | 2023-10-31 | 0 |
202 | Lucas McDonald | CA | 32654 | 2023-11-01 | 9999-01-01 | 1 |
203 | Melody Cherry | CA | 25483 | 2016-01-01 | 9999-01-01 | 1 |
204 | Richard Folkes | CA | 32654 | 2018-01-01 | 2020-01-01 | 0 |
We have 5 transactions for a total of $192.66. Let’s try and get aggregate transaction amount by state of residence of the customer. If we look at d_customer, we can see that there is no currently active customer_id = 204. This could be a data quality issue or some legitimate business reason. We don’t want to entirely exclude this analysis for our report, so let’s bucket it under ‘missing’ for now.
SELECT COALESCE(C.STATE_OF_RESIDENCE, 'Missing'), SUM(TRANSACTION_AMOUNT)
FROM F_TRANSACTION T
LEFT JOIN D_CUSTOMER C
ON T.CUSTOMER_ID = C.ID
WHERE CURR_IN
GROUP BY COALESCE(C.STATE_OF_RESIDENCE, 'Missing');
Results:
State_of_Residence | Total_Transaction_Amount |
---|---|
CA | 134.62 |
VA | 32.68 |
The California and Virginia numbers look correct, but where did the transaction with missing customer details go?
If we revisit our code, we can see that we are filtering on curr_in. So, it will only return join results where the curr_in field is true – this does not include results where curr_in is null. So, we have essentially converted our LEFT JOIN into an inner join excluding all transactions that don’t have a matching active customer record.
So, how can we include the transaction with missing customer details?
SELECT COALESCE(C.STATE_OF_RESIDENCE, 'Missing') state_of_residence, SUM(TRANSACTION_AMOUNT) total_transaction_amount
FROM F_TRANSACTION T
LEFT JOIN D_CUSTOMER C
ON T.CUSTOMER_ID = C.ID
AND CURR_IN
GROUP BY COALESCE(C.STATE_OF_RESIDENCE, 'Missing');
If we move the CURR_IN filter to the LEFT JOIN condition instead of the WHERE constraint, we will only be joining on d_customer records where curr_in is true and keeping all transactions.
State_of_Residence | Total_Transaction_Amount |
---|---|
CA | 134.62 |
Missing | 25.36 |
VA | 32.68 |
Conclusion
In this post we have demonstrated a common issue when writing LEFT JOIN queries. We have shown that, when you have a LEFT JOIN with a WHERE constraint on a right-hand table, you are effectively writing an INNER JOIN.