Skip to main content

29 posts tagged with "tech-blog"

View All Tags

From copyBits to SIMD: Accelerating Parquet DELTA Decoding in Velox

· 7 min read
Shaojie Li
Software Engineer
Masha Basmanova
Software Engineer @ Meta
Ping Liu
Software Engineer

TL;DR

Velox's Parquet DELTA_BINARY_PACKED decoders exhibit approximately 8x higher CPU cost relative to the PLAIN and dictionary-encoded paths. On a TPC-H Q12 scan over a DELTA-encoded lineitem table, decoding accounts for approximately 5.0 s of CPU time, compared with 0.6 s for an equivalent dataset stored under PLAIN or dictionary encoding.

To address this performance disparity, two optimizations are contributed: first, replacing a general-purpose bit-copy routine with a single unaligned memory load, eliminating per-value function-call overhead; second, introducing a batched decode path that leverages SIMD bit-unpack kernels to amortize instruction costs across multiple values.

Together, these optimizations reduce Q12 scan CPU time from 5.0 s to 1.1 s (a 4.5x improvement) and end-to-end wall-clock time from 1.71 s to 0.59 s (a 2.9x speedup).

Background: DELTA encoding

DELTA_BINARY_PACKED is Parquet's delta encoding for integers. Rather than storing values, it stores the differences between consecutive values, then bit-packs them. Because real integer columns are often sorted or slowly varying (IDs, timestamps, dates, dictionary keys), the deltas are small and pack into very few bits, so DELTA files are compact. The encoding is also reused internally: DELTA_LENGTH_BYTE_ARRAY decodes string lengths through one delta decoder, and DELTA_BYTE_ARRAY decodes prefix and suffix lengths through two, so the integer decoder sits on the string read path too.

Problem

Velox's Parquet writer can opt into DELTA (WriterOptions::encoding = arrow::Encoding::kDeltaBinaryPacked, off by default). Re-encoding TPC-H lineitem as DELTA and running Q12 at SF10, the lineitem scan that costs ~0.6 s of CPU on PLAIN/dictionary data cost ~5.0 s on DELTA. Profiling pinned ~87% of the scan CPU on DELTA decoding itself, making it the clear optimization target.

Following the profile: a general-purpose bit-copy

The profile pointed straight at DeltaBpDecoder::readLong, and specifically at the call it used to extract each bit-packed delta:

bits::copyBits(
reinterpret_cast<const uint64_t*>(bufferStart_),
consumedBits, // source bit offset
reinterpret_cast<uint64_t*>(&value),
0, // destination bit offset, always 0
deltaBitWidth_); // at most 64

bits::copyBits is a fully general bit-range memmove: arbitrary source and destination offsets, arbitrary length, looping byte by byte. The destination offset is always 0 and the width is at most 64, so the call reduces to a single unaligned 64-bit load and shift (bits::detail::loadBits), plus a low-bits mask.

readLong runs once per value, so collapsing a byte-loop to one load yields a significant improvement: on its own it took Q12 scan CPU from 5.0 s to 2.8 s and wall time 1.71 s to 1.02 s.

Still 4.5× off: the unpack was still scalar

The single-load fix improved throughput, but DELTA scan CPU was still ~4.5× the PLAIN/dictionary baseline. Two costs remained: the bit-unpack still handled one value at a time, and the read path called the column visitor's process() once per row, paying per-row dispatch even when no filter was active. Both required separate solutions.

Scalar per-value decoding vs batched SIMD decoding of a miniblock

Vectorizing the unpack

Every value in a miniblock shares the same bit width, which makes the unpack a tight, predictable loop, a natural SIMD target. The solution is decodeMiniBlockSimd, a kernel templated on a compile-time bitWidth (shifts and masks become constants) and dispatched at runtime by an index-sequence fold. The strategy depends on the width:

bit widthstrategy
0constant-delta: prefix-sum of min_delta only, no loads
1–164 values per iteration from one unaligned 64-bit load (4 × bw ≤ 64)
17–322 values per iteration via a __uint128_t window (lowers to an SHRD funnel shift on x86_64)
33–64scalar fallback (a single value can straddle > 64 bits after byte-misalignment; rare in practice)

The prefix sum is fused into the unpack: each kernel keeps a running cumulative and writes finished values straight out, so there's no second pass to turn deltas into values. Per the Parquet spec the arithmetic is unsigned mod 2⁶⁴, which makes overflow well-defined and lets the compiler vectorize freely.

Bit-unpack strategies: 4-value 64-bit load for widths 1-16, 128-bit funnel shift for widths 17-32

Once per chunk invocation

For the dispatch cost, ColumnVisitor::processRun(), a bulk variant of per-row process(), hands a decoded chunk to the visitor at once instead of one row at a time. For unfiltered reads the decode writes straight into the output with no per-row work; for filtered reads a deterministic comparison runs in SIMD batches.

Sparse (filtered) reads decode a batch of 1024 values into a stack buffer and index into it by row. DELTA's prefix-sum chain forces every physical value to be decoded anyway (random-access to value N requires computing 0…N), so one sequential, cache-friendly pass followed by array indexing outperforms per-row seeking.

The string decoder

Profiling the DELTA_BYTE_ARRAY path surfaced three steady-state costs in DeltaByteArrayDecoder, all removed:

  • Devirtualized readString() by removing the virtual base class.
  • Replaced make_unique child decoders (allocated per page) with std::optional + reset(), eliminating per-page-boundary allocations.
  • Replaced the std::string "last value" with std::vector<char> + length, dropping the std::string::_M_replace overhead (~6.5% of the string path) from the prefix-share copy.

These improvements were measured on a DELTA_BYTE_ARRAY-encoded column and are not reflected in the Q12 integer results below.

Where it landed

TPC-H Q12, SF10, lineitem re-encoded as DELTA_BINARY_PACKED, num_drivers=4 --num_repeats=5 --warmup_after_clear=true, 5-run median:

ConfigurationWalllineitem scan CPU
PLAIN / dictionary (no DELTA)0.479 s0.614 s
DELTA, original (copyBits)1.71 s5.04 s
DELTA, after the single-load fix1.02 s2.76 s
DELTA, after SIMD batched decode0.594 s1.11 s

The first fix is a 1.7× wall improvement; the batched SIMD path reduces wall time by another 42% and scan CPU by 60% on top. The DELTA-vs-PLAIN scan-CPU gap shrank from ~8× to ~1.8×. The PLAIN/dictionary path was re-measured throughout and stayed within noise; this work is confined to the DELTA decoders.

Takeaways

  • Let the profile pick the target. Comparing DELTA against the PLAIN/dictionary scan on the same data made both the size of the gap and how much was decoding unambiguous.
  • General-purpose primitives carry a tax. bits::copyBits is correct for any offset and length; where the offset is 0 and the width ≤ 64 it collapses to one load. Specializing the hot call site was the biggest win per line changed.
  • Attack each layer of overhead separately. Batching the visitor call removed per-row dispatch; SIMD kernels removed per-value unpack cost; fusing the prefix sum into the unpack removed a whole pass.

Acknowledgements

Thanks to Masha Basmanova and Ping Liu for the detailed reviews that shaped the safety invariants and pushed for comprehensive bit-width test coverage. In particular, Masha's early performance question on #17633 directly motivated the SIMD batched decode work in #17728 that brought the gap down to ~1.8x.

FlatMapVector Adoption for Scaling High-Performance AI/ML Data Pre-Processing

· 7 min read
Peter Enescu
Software Engineer @ Meta
Pedro Pedreira
Software Engineer @ Meta
Kk Pulla
Software Engineer @ Meta
Yuhta
Software Engineer @ Meta
Kevin Wilfong
Software Engineer @ Meta

Context

At Meta, features used for AI use cases are largely combined and stored within warehouse tables as map columns because frequent access to and manipulation of these features can scale poorly if modelled as top-level columns, which would result in extremely wide tables and frequent schema changes. Thus, to provide maximum flexibility, features are modeled as maps.

In a traditional columnar layout, map columns are typically represented in-memory by a few data streams. The diagram below illustrates an example dataset. Two main buffers or streams are allocated for map keys and values. Additional buffers are used for null flags and map offsets or lengths (note that map keys are non-nullable):

Nimble Cluster Index: Efficient Indexed Lookups on Columnar Data

· 9 min read
Xiaoxuan Meng
Software Engineer @ Meta
Jialiang Tan
Software Engineer @ Meta
Zac Wen
Software Engineer @ Meta
Zhenyuan Zhao
Software Engineer @ Meta
Pedro Pedreira
Software Engineer @ Meta
Masha Basmanova
Software Engineer @ Meta

Introduction

Analytical data lakes excel at full-table scans but struggle with point lookups. Key-value stores handle point lookups efficiently but cannot serve analytical queries. What if a single file format could serve both workloads?

Nimble's Cluster Index bridges this gap. It is a lightweight, hierarchical index structure embedded directly inside Nimble columnar files. It enables O(log n) point lookups and range scans on sorted data — without a separate index file, without an external service, and without sacrificing Nimble's columnar scan performance.

We have integrated the cluster index with Presto for analytical index joins and are actively integrating with ZippyDB for prefix key scans — both powered by the same underlying index structure, served through Velox.

Axiom: Composable Query Engines Built on Velox

· 8 min read
Masha Basmanova
Software Engineer @ Meta
Amit Dutta
Software Engineer @ Meta

Introduction

Axiom is a C++ library for building fully composable, high-performance query engines, built on top of Velox. Think of it as Lego for query processing — the pieces are compatible, but don't restrict how you put them together. Configure each layer, swap in your own components.

Today, users face significant friction moving between interactive queries, batch processing, streaming, and AI training data preparation — different engines, different semantics, different quirks. With Axiom, we can build all of these on a shared foundation, delivering consistent semantics across deployment modes.

From flaky Axiom CI to a Velox bug fix: a cross-repo debugging story

· 9 min read
Masha Basmanova
Software Engineer @ Meta

TL;DR

When adding macOS CI to Axiom, set operation tests kept failing intermittently — but only in macOS debug CI. Linux CI (debug and release) passed consistently. Local runs always passed. The root cause turned out to be a bug in Velox — a dependency managed as a Git submodule. This post describes the process of debugging a CI-only failure when the bug lives in a different repository.

Adaptive Per-Function CPU Time Tracking

· 6 min read
Rajeev Singh
Software Engineer @ Meta
Pedro Pedreira
Software Engineer @ Meta

Context

Velox evaluates SQL expressions as trees of functions. A query like if(array_gte(a, b), multiply(x, y), 0) compiles into a tree where each node processes an entire vector of rows at a time. When a query runs slowly, the first question usually is: which function is consuming the most CPU? Is it the expensive array comparison, or the cheap arithmetic called millions of times? This problem is even more prominent in use cases like training data loading, when very long and deeply nested expression trees are common, and jobs may run for many hours, or days; in such cases, the CPU usage of even seemingly short-lived functions may add up to substantial overhead. Without a detailed per-function CPU usage breakdown, you may be left guessing — or worse, optimizing the wrong thing.

Accelerating Unicode string processing with SIMD in Velox

· 8 min read
Ping Liu
Software Engineer
Yuhta
Software Engineer @ Meta
Masha Basmanova
Software Engineer @ Meta

TL;DR

We optimized two Unicode string helpers — cappedLengthUnicode and cappedByteLengthUnicode — by replacing byte-by-byte utf8proc_char_length calls with a SIMD-based scanning loop. The new implementation processes register-width blocks at a time: pure-ASCII blocks skip in one step, while mixed blocks use bitmask arithmetic to count character starts. Both helpers now share a single parameterized template, eliminating code duplication.

On a comprehensive benchmark matrix covering string lengths from 4 to 1024 bytes and ASCII ratios from 0% to 100%, we measured 2–15× speedups across most configurations, with no regressions on Unicode-heavy inputs. The optimization benefits all callers of these helpers, including the Iceberg truncate transform and various string functions.

The hidden traps of regex in LIKE and split

· 7 min read
Masha Basmanova
Software Engineer @ Meta

SQL functions sometimes use regular expressions under the hood in ways that surprise users. Two common examples are the LIKE operator and Spark's split function.

In Presto, split takes a literal string delimiter and regexp_split is a separate function for regex-based splitting. Spark's split, however, always treats the delimiter as a regular expression.

Both LIKE and Spark's split can silently produce wrong results and waste CPU when used with column values instead of constants. Understanding why this happens helps write faster, more correct queries — and helps engine developers make better design choices.

velox::StringView API Changes and Best Practices

· 5 min read
Pedro Pedreira
Software Engineer @ Meta

Context

Strings are ubiquitously used in large-scale analytic query processing. From storing identifiers, names, labels, or structured data (like json/xml), to simply descriptive text, like a product description or the contents of this very blog post, there is hardly a SQL query that does not require the manipulation of string data.

This post describes in more detail how Velox handles columns of strings, the low-level C++ APIs involved and some recent changes made to them, and presents best practices for string usage throughout Velox's codebase.