Skip to main content

Why NULLIF Must Be a Special Form

· 4 min read
Masha Basmanova
Software Engineer @ Meta
Peter Enescu
Software Engineer @ Meta
Bikram Singh
Software Engineer @ Meta

What is NULLIF?

NULLIF(a, b) is a SQL standard function that returns NULL if a equals b, otherwise returns a. It's commonly used to avoid division by zero (x / NULLIF(y, 0)) or to convert sentinel values to NULLs.

The Naive Implementation

The textbook rewrite is simple:

NULLIF(a, b) → IF(a = b, NULL, a)

This is exactly what Presto's SQL-to-plan translator does when running on Velox (Prestissimo). It works correctly for deterministic expressions. But there's a subtle, critical bug.

The Bug

Consider this query:

SELECT nullif(rand() < 0.5, true), count(*)
FROM unnest(sequence(1, 10))
GROUP BY 1;

nullif(x, true) should return NULL when x is true and FALSE when x is false. TRUE should never appear in the output.

But with the IF rewrite:

IF(rand() < 0.5 = true, NULL, rand() < 0.5)

The expression rand() < 0.5 appears twice in the expression tree. Velox evaluates each occurrence independently. When rand() returns different values for the two evaluations, you get impossible results. For example: the first rand() returns 0.7, so the condition 0.7 < 0.5 = true is FALSE — we take the else branch. But the second rand() returns 0.3, so the else value 0.3 < 0.5 is TRUE. The result is TRUE, which nullif(x, true) should never produce.

Actual results from Prestissimo:

 _col0 | _col1
-------+-------
NULL | 4
true | 4 ← should never happen
false | 2

Why This Happens in Prestissimo

Velox has a common subexpression elimination (CSE) mechanism that can detect when the same expression appears multiple times and evaluate it only once. However, CSE correctly skips non-deterministic expressions — evaluating rand() once and reusing the result would change the semantics of queries that intentionally use multiple independent random values. So when Presto's planner duplicates rand() < 0.5 into both the condition and the else branch of IF, Velox evaluates each occurrence independently, producing different random values.

The root cause is that Velox doesn't have a native NULLIF implementation, forcing Prestissimo to use the IF rewrite. Adding NULLIF to Velox would eliminate the need for the rewrite entirely.

Why Not a Simple Rewrite or a Regular Function?

The IF rewrite is broken for non-deterministic expressions, as shown above. What about making NULLIF a regular function instead? A regular function receives pre-evaluated arguments, so there's no duplication problem.

But NULLIF has a type complication: the types of a and b may differ. Per the SQL spec, the comparison uses the common supertype (equal(cast(a as common_type), cast(b as common_type))), but NULLIF must return a's original value in its original type, not a cast-back version. For example, NULLIF(tinyint_col, bigint_col) casts both to BIGINT for comparison, but returns the original TINYINT value — not the BIGINT value cast back to TINYINT. A regular function would receive arguments already cast to a common type, losing a's original value.

One possible workaround: rewrite NULLIF(a, b) as cast(nullif_function(cast(a as common_type), cast(b as common_type)) as typeof(a)), where nullif_function is a regular same-typed function. The cast back to typeof(a) is lossless for integer, decimal, and varchar coercions since the value originated from the narrower type. However, when the common type is real or double (e.g., NULLIF(bigint_col, 1.5)), the round-trip can lose precision. In practice, comparing floating point values for equality is unreliable anyway, so rejecting NULLIF with a float/double common type at planning time may be acceptable.

If exact value preservation is required in all cases, a special form is needed — a built-in expression type (like IF, COALESCE, TRY) that controls its own evaluation and type handling. Velox already has several special forms for similar reasons, where a construct's semantics cannot be expressed through simple function calls or rewrites.

See #17110 for more details and discussion.

The General Lesson

Any time a SQL construct needs to use the same expression value in multiple places, it cannot be safely rewritten into a tree that duplicates that expression. This applies whenever the expression is non-deterministic (rand(), uuid()) or has side effects.

NULLIF is not the only construct with this problem. The same class of bug affects CASE expressions with non-deterministic operands (see #17115).

The SQL standard defines NULLIF, COALESCE, and CASE with single-evaluation semantics. When implementing a SQL engine, these semantics must be preserved through special forms that control evaluation order and multiplicity.

From flaky Axiom CI to a Velox bug fix: a cross-repo debugging story

· 9 min read
Masha Basmanova
Software Engineer @ Meta

TL;DR

When adding macOS CI to Axiom, set operation tests kept failing intermittently — but only in macOS debug CI. Linux CI (debug and release) passed consistently. Local runs always passed. The root cause turned out to be a bug in Velox — a dependency managed as a Git submodule. This post describes the process of debugging a CI-only failure when the bug lives in a different repository.

Adaptive Per-Function CPU Time Tracking

· 6 min read
Rajeev Singh
Software Engineer @ Meta
Pedro Pedreira
Software Engineer @ Meta

Context

Velox evaluates SQL expressions as trees of functions. A query like if(array_gte(a, b), multiply(x, y), 0) compiles into a tree where each node processes an entire vector of rows at a time. When a query runs slowly, the first question usually is: which function is consuming the most CPU? Is it the expensive array comparison, or the cheap arithmetic called millions of times? This problem is even more prominent in use cases like training data loading, when very long and deeply nested expression trees are common, and jobs may run for many hours, or days; in such cases, the CPU usage of even seemingly short-lived functions may add up to substantial overhead. Without a detailed per-function CPU usage breakdown, you may be left guessing — or worse, optimizing the wrong thing.

Accelerating Unicode string processing with SIMD in Velox

· 8 min read
Ping Liu
Software Engineer
Yuhta
Software Engineer @ Meta
Masha Basmanova
Software Engineer @ Meta

TL;DR

We optimized two Unicode string helpers — cappedLengthUnicode and cappedByteLengthUnicode — by replacing byte-by-byte utf8proc_char_length calls with a SIMD-based scanning loop. The new implementation processes register-width blocks at a time: pure-ASCII blocks skip in one step, while mixed blocks use bitmask arithmetic to count character starts. Both helpers now share a single parameterized template, eliminating code duplication.

On a comprehensive benchmark matrix covering string lengths from 4 to 1024 bytes and ASCII ratios from 0% to 100%, we measured 2–15× speedups across most configurations, with no regressions on Unicode-heavy inputs. The optimization benefits all callers of these helpers, including the Iceberg truncate transform and various string functions.

The hidden traps of regex in LIKE and split

· 6 min read
Masha Basmanova
Software Engineer @ Meta

SQL functions sometimes use regular expressions under the hood in ways that surprise users. Two common examples are the LIKE operator and Spark's split function.

In Presto, split takes a literal string delimiter and regexp_split is a separate function for regex-based splitting. Spark's split, however, always treats the delimiter as a regular expression.

Both LIKE and Spark's split can silently produce wrong results and waste CPU when used with column values instead of constants. Understanding why this happens helps write faster, more correct queries — and helps engine developers make better design choices.

velox::StringView API Changes and Best Practices

· 5 min read
Pedro Pedreira
Software Engineer @ Meta

Context

Strings are ubiquitously used in large-scale analytic query processing. From storing identifiers, names, labels, or structured data (like json/xml), to simply descriptive text, like a product description or the contents of this very blog post, there is hardly a SQL query that does not require the manipulation of string data.

This post describes in more detail how Velox handles columns of strings, the low-level C++ APIs involved and some recent changes made to them, and presents best practices for string usage throughout Velox's codebase.

Task Barrier: Efficient Task Reuse and Streaming Checkpoints in Velox

· 4 min read
Xiaoxuan Meng
Software Engineer @ Meta
Yuhta
Software Engineer @ Meta
Masha Basmanova
Software Engineer @ Meta
Pedro Pedreira
Software Engineer @ Meta

TL;DR

Velox Task Barriers provide a synchronization mechanism that not only enables efficient task reuse, important for workloads such as AI training data loading, but also delivers the strict sequencing and checkpointing semantics required for streaming workloads.

By injecting a barrier split, users guarantee that no subsequent data is processed until the entire DAG is flushed and the synchronization signal is unblocked. This capability serves two critical patterns:

  1. Task Reuse: Eliminates the overhead of repeated task initialization and teardown by safely reconfiguring warm tasks for new queries. This is a recurring pattern in AI training data loading workloads.

  2. Streaming Processing: Enables continuous data handling with consistent checkpoints, allowing stateful operators to maintain context across batches without service interruption.

See the Task Barrier Developer Guide for implementation details.

Why Sort is row-based in Velox — A Quantitative Assessment

· 8 min read
Meng Duan (macduan)
Software Engineer @ ByteDance
Xiaoxuan Meng
Software Engineer @ Meta

TL;DR

Velox is a fully vectorized execution engine[1]. Its internal columnar memory layout enhances cache locality, exposes more inter-instruction parallelism to CPUs, and enables the use of SIMD instructions, significantly accelerating large-scale query processing.

However, some operators in Velox utilize a hybrid layout, where datasets can be temporarily converted to a row-oriented format. The OrderBy operator is one example, where our implementation first materializes the input vectors into rows, containing both sort keys and payload columns, sorts them, and converts the rows back to vectors.

In this article, we explain the rationale behind this design decision and provide experimental evidence for its implementation. We show a prototype of a hybrid sorting strategy that materializes only the sort-key columns, reducing the overhead of materializing payload columns. Contrary to expectations, the end-to-end performance did not improve—in fact, it was even up to slower. We present the two variants and discuss why one is counter-intuitively faster than the other.

Multi-Round Lazy Start Merge

· 6 min read
Meng Duan (macduan)
Software Engineer @ ByteDance
Xiaoxuan Meng
Software Engineer @ Meta
Pedro Pedreira
Software Engineer @ Meta

Background

Efficiently merging sorted data partitions at scale is crucial for a variety of training data preparation workloads, especially for Generative Recommenders (GRs) a new paradigm introduced in the paper Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations. A key requirement is to merge training data across partitions—for example, merging hourly partitions into daily ones—while ensuring that all rows sharing the same primary key are stored consecutively. Training data is typically partitioned and bucketed by primary key, with rows sharing the same key stored consecutively, so merging across partitions essentially becomes a multi-way merge problem.

Normally, Apache Spark can be used for this sort-merge requirement — for example, via CLUSTER BY. However, training datasets for a single job can often reach the PB scale, which in turn generates shuffle data at PB scale. Although we typically apply bucketing and ordering by key when preparing training data in production, Spark can eliminate the shuffle when merging training data from multiple hourly partitions. However, each Spark task can only read the files planned from various partitions within a split sequentially, placing them into the sorter and spilling as needed. Only after all files have been read does Spark perform a sort-merge of the spilled files. This process produces a large number of small spill files, which further degrades efficiency.