When is a left join not a left join? - Jacob Ferraiolo's Blog

Introduction

One of the most common issues I see, even amongst experienced analysts and engineers, is the misuse of left joins. This post deals with the proper usage of join conditions vs. where constraints in left joins. I’ve seen numerous cases where getting this wrong has cost companies thousands of dollars.

Creating sample data

First, let’s create some sample data. We will create a fact table called f_transactions and a dimension table called d_customer. If you aren’t familiar with fact vs. dimension tables, I recommend reading more about it here.

The d_customer table uses a SCDT2 (slowly changing dimension type 2) which just means that the fields can change over time. When this happens, new records are added with the new values. The effective date structure indicates which version of each record is active at any given time. Read more about slowly changing dimensions here.

-- Table to store transaction data
CREATE TABLE F_TRANSACTION (
  ID INTEGER,  -- Unique identifier for the transaction
  PRODUCT_ID INTEGER,  -- Identifier for the product involved in the transaction
  CUSTOMER_ID INTEGER,  -- Identifier for the customer involved in the transaction
  TRANSACTION_AMOUNT NUMERIC,  -- The amount of the transaction
  TRANSACTION_TIMESTAMP DATETIME  -- The timestamp of the transaction
);

-- Table to store customer details
CREATE TABLE D_CUSTOMER (
  ID INTEGER,  -- Unique identifier for the customer
  CUSTOMER_NAME VARCHAR(80),  -- Name of the customer
  STATE_OF_RESIDENCE CHAR(2),  -- State code where the customer resides
  ZIPCODE INTEGER,  -- ZIP code of the customer's residence
  EFFECTIVE_START_DATE DATE,  -- Start date for the customer record's validity
  EFFECTIVE_END_DATE DATE,  -- End date for the customer record's validity
  CURR_IN BOOLEAN  -- Flag indicating if the customer is currently active
);

Now that we have our table structure created, we can go ahead and add some sample data

  INSERT INTO F_TRANSACTION (ID, PRODUCT_ID, CUSTOMER_ID, TRANSACTION_AMOUNT, TRANSACTION_TIMESTAMP ) VALUES 
(101, 1, 201, 20.34, CURRENT_TIMESTAMP), 
(102, 2, 201, 12.34, datetime(CURRENT_TIMESTAMP, '-2 Day')),
(103, 3, 202, 34.97, datetime(CURRENT_TIMESTAMP, '-10 minutes')),
(104, 6, 203, 99.65, datetime(CURRENT_TIMESTAMP, '-10 minutes')),
(105, 2, 204, 25.36, datetime(CURRENT_TIMESTAMP, '-52 minutes')
);

INSERT INTO D_CUSTOMER (ID, CUSTOMER_NAME, STATE_OF_RESIDENCE, ZIPCODE, EFFECTIVE_START_DATE, EFFECTIVE_END_DATE, CURR_IN) VALUES 
(201, 'Jacob Ferraiolo', 'TX', 75324, '2019-01-15', '2022-01-01', FALSE), 
(201, 'Jacob Ferraiolo', 'VA', 23112, '2022-01-02', '9999-01-01', TRUE), 
(202, 'Luke McDonald', 'CA', 85532, '2020-01-01', '2023-10-31', FALSE),
(202, 'Lucas McDonald', 'CA', 32654, '2023-11-01', '9999-01-01', TRUE),
(203, 'Melody Cherry', 'CA', 25483, '2016-01-01', '9999-01-01', TRUE),
(204, 'Richard Folkes', 'CA', 32654, '2018-01-01', '2020-01-01', FALSE)
;

Querying the data

Great! We have our table structure and our data. Let’s start some analysis!

Let’s start by querying the transaction and customer data to see what they look like.

ID	PRODUCT_ID	CUSTOMER_ID	TRANSACTION_AMOUNT	TRANSACTION_TIMESTAMP
101	1	201	20.34	2024-04-09 01:38:51
102	2	201	12.34	2024-04-07 01:38:51
103	3	202	34.97	2024-04-09 01:28:51
104	6	203	99.65	2024-04-09 01:28:51
105	2	204	25.36	2024-04-09 00:46:51

ID	CUSTOMER_NAME	STATE_OF_RESIDENCE	ZIPCODE	EFFECTIVE_START_DATE	EFFECTIVE_END_DATE	CURR_IN
201	Jacob Ferraiolo	TX	75324	2019-01-15	2022-01-01	0
201	Jacob Ferraiolo	VA	23112	2022-01-02	9999-01-01	1
202	Luke McDonald	CA	85532	2020-01-01	2023-10-31	0
202	Lucas McDonald	CA	32654	2023-11-01	9999-01-01	1
203	Melody Cherry	CA	25483	2016-01-01	9999-01-01	1
204	Richard Folkes	CA	32654	2018-01-01	2020-01-01	0

We have 5 transactions for a total of $192.66. Let’s try and get aggregate transaction amount by state of residence of the customer. If we look at d_customer, we can see that there is no currently active customer_id = 204. This could be a data quality issue or some legitimate business reason. We don’t want to entirely exclude this analysis for our report, so let’s bucket it under ‘missing’ for now.

SELECT COALESCE(C.STATE_OF_RESIDENCE, 'Missing'), SUM(TRANSACTION_AMOUNT)
FROM F_TRANSACTION T
LEFT JOIN D_CUSTOMER C
  ON T.CUSTOMER_ID = C.ID
 WHERE CURR_IN
 GROUP BY COALESCE(C.STATE_OF_RESIDENCE, 'Missing');

Results:

State_of_Residence	Total_Transaction_Amount
CA	134.62
VA	32.68

The California and Virginia numbers look correct, but where did the transaction with missing customer details go?

If we revisit our code, we can see that we are filtering on curr_in. So, it will only return join results where the curr_in field is true – this does not include results where curr_in is null. So, we have essentially converted our LEFT JOIN into an inner join excluding all transactions that don’t have a matching active customer record.

So, how can we include the transaction with missing customer details?

SELECT COALESCE(C.STATE_OF_RESIDENCE, 'Missing') state_of_residence, SUM(TRANSACTION_AMOUNT) total_transaction_amount
FROM F_TRANSACTION T
LEFT JOIN D_CUSTOMER C
  ON T.CUSTOMER_ID = C.ID
  AND  CURR_IN
GROUP BY COALESCE(C.STATE_OF_RESIDENCE, 'Missing');

If we move the CURR_IN filter to the LEFT JOIN condition instead of the WHERE constraint, we will only be joining on d_customer records where curr_in is true and keeping all transactions.

State_of_Residence	Total_Transaction_Amount
CA	134.62
Missing	25.36
VA	32.68

Conclusion

In this post we have demonstrated a common issue when writing LEFT JOIN queries. We have shown that, when you have a LEFT JOIN with a WHERE constraint on a right-hand table, you are effectively writing an INNER JOIN.