Not in vs. Outer join Performance
I was running an SQL query today and it was sooooo slow. So slow, in fact, that it never returned. I asked the DBA, Reed, who built the table what might be up, and he informed me that it was not indexed. And proceeded to show me some cool stuff I could do to actually get my query to return. In the end, it was a comparison between the “not in” operator and a “left join”.
My original query was thus, names changed to protect the innocent:
select count(*) from temp_legacy_attachments i where i.person_id not in ( select m.legacy_person_id from new_attachment a , person p where a.person_id = p.id);
I was trying to query the temp_legacy_attachments to get all rows that didn’t have a record in the new_attachments table. It never returned, and so Reed told me to give this one a try:
select count(*) from temp_legacy_attachments i left join ( select m.legacy_person_id from new_attachment a , person p where a.person_id = p.id) ea on ea.legacy_person_id = i.person_id where ea.legacy_person_id is null ;
So, instead of using “not in” a set, I select all the legacy rows, then outer join to the new_attachment rows and filter where a column on the new attachment set is null (it’s the smaller/less-available set).
I thought it was pretty sweet. No magic bullet, though, as Reed tells me that there is a fair amount of debate over the performance difference between the two methods. You just have to try it and find out. For me, in this case, the outer join was more awesome.
Another savvy DBA, Bill, has graced us with another method yet:
select sum(cnt) from ( select p.legacy_person_id , count(*) cnt from person p join temp_legacy_attachments i on i.person_id = p.legacy_person_id where not exists (select null from new_attachments a where a.person_id = p.id and a.created_by = 'LEGACY_MIGRATION') group by p.legacy_person_id)