Why RIGHT SEMI JOIN Can Be Slower Than LEFT SEMI JOIN in Velox
· 5 min read
TL;DR
A query rewrite from A LEFT SEMI JOIN B to B RIGHT SEMI JOIN A can be much
slower in Velox, even though the two forms are semantically equivalent.
At first glance, this rewrite looks attractive when A is much smaller than
B: building on A should reduce build-side work and often helps regular hash
joins. In one real query, we made exactly this change expecting a speedup, but
instead observed roughly a 10x regression.
The root cause is execution asymmetry:
- RIGHT SEMI needs to mark build-side rows as matched.
- That marking (
setProbedFlag) is random-memory-write heavy. - With duplicate or skewed keys, redundant marking work grows quickly.
A targeted optimization for RIGHT SEMI FILTER without extra filter moves marking into probe-time hit traversal and adds early stop logic for duplicate chains, significantly reducing redundant work.
