Skip to main content

27 posts tagged with "tech-blog"

View All Tags

Nimble Cluster Index: Efficient Indexed Lookups on Columnar Data

· 9 min read
Xiaoxuan Meng
Software Engineer @ Meta
Jialiang Tan
Software Engineer @ Meta
Zac Wen
Software Engineer @ Meta
Zhenyuan Zhao
Software Engineer @ Meta
Pedro Pedreira
Software Engineer @ Meta
Masha Basmanova
Software Engineer @ Meta

Introduction

Analytical data lakes excel at full-table scans but struggle with point lookups. Key-value stores handle point lookups efficiently but cannot serve analytical queries. What if a single file format could serve both workloads?

Nimble's Cluster Index bridges this gap. It is a lightweight, hierarchical index structure embedded directly inside Nimble columnar files. It enables O(log n) point lookups and range scans on sorted data — without a separate index file, without an external service, and without sacrificing Nimble's columnar scan performance.

We have integrated the cluster index with Presto for analytical index joins and are actively integrating with ZippyDB for prefix key scans — both powered by the same underlying index structure, served through Velox.

Motivation

Consider a common pattern in data warehousing: a large fact table needs to be joined with a dimension table on a key column. Traditional hash joins materialize both sides into memory. For selective lookups — where the probe side touches only a fraction of the build side — this is wasteful.

Index joins solve this: for each probe key, look up only the matching rows in the build table, skipping everything else. But this requires an efficient index on the build-side file.

Similarly, in ML training data serving, workloads need prefix key scans with column projection — e.g., fetching a user's event history while reading only the features a model needs. Without an index, this requires scanning entire stripes — unacceptable at serving latency.

The cluster index addresses both use cases with a single, unified design.

File Layout

The cluster index is stored as a FlatBuffer-serialized optional section in the Nimble file footer. No separate index file is needed — the index lives alongside the data it references.

Nimble file layout showing the cluster index as an optional section in the footer, with its internal hierarchical structure of partitions and key stream chunks.

Hierarchical Structure

The index is organized as a three-level hierarchy designed to minimize I/O:

Level 1: Root ClusterIndex. The root is a FlatBuffer table embedded in the file footer's optional sections. It contains the index column names, sort orders, and an array of partition_keys — the boundary keys for each index group. Since this metadata is loaded when the file is opened, the first level of binary search requires zero additional I/O.

Level 2: Index Groups. Each index group corresponds 1:1 to a stripe group and contains metadata for the chunks within it: chunk_keys (boundary keys per chunk), chunk_offsets, and chunk_rows (accumulated row counts). Index groups are written inline after each stripe group and loaded on demand — only the group(s) that could contain the lookup key are read from disk.

Level 3: Key Stream Chunks. The actual encoded keys live in the key stream, divided into fixed-size chunks. Each chunk stores keys using prefix encoding (with a configurable restart interval) for compact representation. Only the specific chunk(s) identified by the group-level binary search are read and decoded.

Key Encoding

Multi-column index keys are serialized into a single byte-comparable binary string using Velox's KeyEncoder. The sort order (ascending or descending) for each column is baked into the encoding, so a simple byte comparison on the encoded key correctly handles multi-column composite keys with mixed sort orders.

Lookup Flow

A lookup proceeds through the hierarchy:

  1. Binary search on partition_keys — identifies which index group(s) could contain the target key. This is pure in-memory computation — no I/O.
  2. Load index group metadata — reads the matching group's FlatBuffer from disk (cached for repeated lookups).
  3. Binary search on chunk_keys — narrows down to specific chunk(s) within the group. Again, in-memory only.
  4. Seek within chunk — reads and decodes the key stream chunk, then binary searches the materialized keys to find the exact row range.

The result is a set of file-level row ranges — contiguous for both point lookups and range scans since the data is sorted by the index key.

The key design insight is that the common case — eliminating entire index groups — requires no disk I/O at all. Only when a group is a candidate does the system incur reads, and even then, only the specific chunk within that group is loaded.

Beyond ClusterIndex: Hash and Sorted Indices

While the cluster index is optimized for sorted data, Nimble also supports two additional index types for unsorted data:

HashIndex — O(1) amortized point lookups using a hash table with optional Bloom filter. The Bloom filter follows the Parquet split-block design (256-bit blocks, 8 hash probes per block, xxHash64) for cache-friendly probabilistic filtering before accessing the hash buckets.

SortedIndex — A secondary index that maintains a separately sorted copy of the keys alongside their original row IDs. Each entry in the sorted key stream is a composite encoded_key || fixed_width_row_id, enabling binary search while carrying the row ID inline. This supports both point lookups and range scans on unsorted data.

All three index types implement a unified IndexLookup interface with PointLookup and RangeScan modes, and are discovered through the DenseIndexRegistry which transparently selects the best available index for a given set of columns.

Integration with Presto

We integrated the Nimble cluster index with Presto's native execution engine (Prestissimo) through Velox's IndexLookupJoin operator.

Integration architecture showing the Presto path (left) and ZippyDB path (right), both sharing the same ClusterIndex core.

The Execution Path

The Presto integration flows through several layers:

Presto Coordinator (Java) — The query optimizer recognizes index join opportunities and generates an IndexJoinNode in the query plan, with planner rules for lookup variable extraction, non-equal join condition support, and index join session properties.

Velox IndexLookupJoin — The execution operator receives probe-side input batches and performs index lookups against the cluster indexed table. It projects lookup columns from each probe batch, detects and excludes null keys, and calls into the connector's IndexSource for the actual lookup. For left joins, unmatched probe rows are emitted with null values for the lookup columns. The operator pipelines multiple probe batches to overlap lookup I/O with processing.

HiveIndexSource — Implements Velox's connector::IndexSource interface for Hive tables. The coordinator retrieves index metadata from the Hive metastore (declared via the indexed_by DDL table property) and passes it to the worker through the table handle.

SelectiveNimbleIndexReader — The Nimble-specific implementation of Velox's IndexReader interface. It converts index bounds into encoded key bounds, calls ClusterIndex::lookup() to resolve file-level row ranges, then maps those ranges to stripes and iterates through them using the selective column reader with precise seekTo() operations.

A notable design detail in the reader is how it handles overlapping row ranges from multiple lookup requests. When no additional filter is present, it merges overlapping ranges for efficient bulk reads. When a filter is present, it splits ranges into non-overlapping segments so that filtered output can be correctly attributed back to the originating request.

Usage

With the integration in place, users can create Nimble tables with a cluster index and query them with index joins:

-- Create a table with cluster index on the key column
CREATE TABLE dimension_table (
key_col BIGINT,
value_col VARCHAR,
...
) WITH (
format = 'NIMBLE',
indexed_by = '["key_col"]'
);

-- The optimizer automatically chooses index join when beneficial
SELECT f.*, d.value_col
FROM fact_table f
JOIN dimension_table d ON f.key_col = d.key_col
WHERE f.date = '2026-04-27';

Integration with ZippyDB

We are actively integrating the cluster index with ZippyDB to serve point lookup workloads. The integration uses a fundamentally different strategy than the Presto path, optimized for serving latency rather than analytical throughput.

Server-Side Zero-Decode Projection

The key innovation in the ZippyDB path is the NimbleIndexProjector, which implements a zero-decode passthrough strategy. Instead of decoding Nimble-encoded data into Velox vectors on the server (as the Presto path does), it extracts the raw encoded stream bytes and sends them directly to clients via the Nimble serializer — without decoding or re-encoding. The client deserializes and decodes locally. This design shifts the CPU cost of decoding from the centralized lookup service to the distributed clients — critical for a serving system where the server is the bottleneck.

The projector uses the same ClusterIndex::lookup() for key-to-row-range resolution, but then takes a completely different serialization path. It builds per-stripe results by copying raw stream bytes directly from the tablet's encoded streams, with no intermediate decode/re-encode step.

Design Tradeoffs

AspectPresto PathZippyDB Path
Decode locationServer (Velox vectors)Client (raw bytes)
Output formatRowVectorNimble serializer
Filter supportFull pushdownKey-only
Optimized forThroughputLatency
CPU bottleneckDistributed workersCentralized server

Performance Characteristics

The hierarchical design provides several performance properties:

Minimal I/O amplification. A point lookup on a file with millions of rows typically requires 2 disk reads: one for the index group metadata and one for the target key stream chunk. The root-level group search is free — it's already in memory from the file open.

Bounded memory overhead. Only the root FlatBuffer is kept in memory permanently. Index group metadata and key stream chunks are loaded on demand and can be cached or evicted independently.

Write-time cost is low. The index is built incrementally during writes — keys are encoded as a byte stream and chunked as rows are flushed. The root-level partition metadata is assembled at file close. There is no post-write index building step.

Composability with columnar scans. The cluster index produces row ranges, not decoded values. These row ranges feed directly into the selective column reader's seekTo() mechanism, which uses the separate ChunkIndex (a position-to-offset mapping per stream) to skip efficiently within stripes. The index lookup and the columnar read are cleanly separated — the index narrows the search space, and the columnar reader handles the actual data extraction.

What's Next

We are continuing to expand the cluster index ecosystem:

  • ZippyDB production integration — completing the end-to-end integration with ZippyDB's serving layer.
  • Adaptive index selection — automatically choosing between cluster, hash, and sorted indices based on query patterns and data characteristics.
  • Vector search index — exploring embedding FAISS-based approximate nearest neighbor indices inside Nimble files, co-locating vector search with columnar data so one indexed file serves analytical queries, training pipelines, and retrieval workloads.

Acknowledgements

Thanks to the Velox, Nimble, and Presto teams at Meta for the foundational infrastructure and integration work that makes this possible.

Axiom: Composable Query Engines Built on Velox

· 8 min read
Masha Basmanova
Software Engineer @ Meta
Amit Dutta
Software Engineer @ Meta

Introduction

Axiom is a C++ library for building fully composable, high-performance query engines, built on top of Velox. Think of it as Lego for query processing — the pieces are compatible, but don't restrict how you put them together. Configure each layer, swap in your own components.

Today, users face significant friction moving between interactive queries, batch processing, streaming, and AI training data preparation — different engines, different semantics, different quirks. With Axiom, we can build all of these on a shared foundation, delivering consistent semantics across deployment modes.

From flaky Axiom CI to a Velox bug fix: a cross-repo debugging story

· 9 min read
Masha Basmanova
Software Engineer @ Meta

TL;DR

When adding macOS CI to Axiom, set operation tests kept failing intermittently — but only in macOS debug CI. Linux CI (debug and release) passed consistently. Local runs always passed. The root cause turned out to be a bug in Velox — a dependency managed as a Git submodule. This post describes the process of debugging a CI-only failure when the bug lives in a different repository.

Adaptive Per-Function CPU Time Tracking

· 6 min read
Rajeev Singh
Software Engineer @ Meta
Pedro Pedreira
Software Engineer @ Meta

Context

Velox evaluates SQL expressions as trees of functions. A query like if(array_gte(a, b), multiply(x, y), 0) compiles into a tree where each node processes an entire vector of rows at a time. When a query runs slowly, the first question usually is: which function is consuming the most CPU? Is it the expensive array comparison, or the cheap arithmetic called millions of times? This problem is even more prominent in use cases like training data loading, when very long and deeply nested expression trees are common, and jobs may run for many hours, or days; in such cases, the CPU usage of even seemingly short-lived functions may add up to substantial overhead. Without a detailed per-function CPU usage breakdown, you may be left guessing — or worse, optimizing the wrong thing.

Accelerating Unicode string processing with SIMD in Velox

· 8 min read
Ping Liu
Software Engineer
Yuhta
Software Engineer @ Meta
Masha Basmanova
Software Engineer @ Meta

TL;DR

We optimized two Unicode string helpers — cappedLengthUnicode and cappedByteLengthUnicode — by replacing byte-by-byte utf8proc_char_length calls with a SIMD-based scanning loop. The new implementation processes register-width blocks at a time: pure-ASCII blocks skip in one step, while mixed blocks use bitmask arithmetic to count character starts. Both helpers now share a single parameterized template, eliminating code duplication.

On a comprehensive benchmark matrix covering string lengths from 4 to 1024 bytes and ASCII ratios from 0% to 100%, we measured 2–15× speedups across most configurations, with no regressions on Unicode-heavy inputs. The optimization benefits all callers of these helpers, including the Iceberg truncate transform and various string functions.

The hidden traps of regex in LIKE and split

· 7 min read
Masha Basmanova
Software Engineer @ Meta

SQL functions sometimes use regular expressions under the hood in ways that surprise users. Two common examples are the LIKE operator and Spark's split function.

In Presto, split takes a literal string delimiter and regexp_split is a separate function for regex-based splitting. Spark's split, however, always treats the delimiter as a regular expression.

Both LIKE and Spark's split can silently produce wrong results and waste CPU when used with column values instead of constants. Understanding why this happens helps write faster, more correct queries — and helps engine developers make better design choices.

velox::StringView API Changes and Best Practices

· 5 min read
Pedro Pedreira
Software Engineer @ Meta

Context

Strings are ubiquitously used in large-scale analytic query processing. From storing identifiers, names, labels, or structured data (like json/xml), to simply descriptive text, like a product description or the contents of this very blog post, there is hardly a SQL query that does not require the manipulation of string data.

This post describes in more detail how Velox handles columns of strings, the low-level C++ APIs involved and some recent changes made to them, and presents best practices for string usage throughout Velox's codebase.

Task Barrier: Efficient Task Reuse and Streaming Checkpoints in Velox

· 4 min read
Xiaoxuan Meng
Software Engineer @ Meta
Yuhta
Software Engineer @ Meta
Masha Basmanova
Software Engineer @ Meta
Pedro Pedreira
Software Engineer @ Meta

TL;DR

Velox Task Barriers provide a synchronization mechanism that not only enables efficient task reuse, important for workloads such as AI training data loading, but also delivers the strict sequencing and checkpointing semantics required for streaming workloads.

By injecting a barrier split, users guarantee that no subsequent data is processed until the entire DAG is flushed and the synchronization signal is unblocked. This capability serves two critical patterns:

  1. Task Reuse: Eliminates the overhead of repeated task initialization and teardown by safely reconfiguring warm tasks for new queries. This is a recurring pattern in AI training data loading workloads.

  2. Streaming Processing: Enables continuous data handling with consistent checkpoints, allowing stateful operators to maintain context across batches without service interruption.

See the Task Barrier Developer Guide for implementation details.

Why Sort is row-based in Velox — A Quantitative Assessment

· 8 min read
Meng Duan (macduan)
Software Engineer @ ByteDance
Xiaoxuan Meng
Software Engineer @ Meta

TL;DR

Velox is a fully vectorized execution engine[1]. Its internal columnar memory layout enhances cache locality, exposes more inter-instruction parallelism to CPUs, and enables the use of SIMD instructions, significantly accelerating large-scale query processing.

However, some operators in Velox utilize a hybrid layout, where datasets can be temporarily converted to a row-oriented format. The OrderBy operator is one example, where our implementation first materializes the input vectors into rows, containing both sort keys and payload columns, sorts them, and converts the rows back to vectors.

In this article, we explain the rationale behind this design decision and provide experimental evidence for its implementation. We show a prototype of a hybrid sorting strategy that materializes only the sort-key columns, reducing the overhead of materializing payload columns. Contrary to expectations, the end-to-end performance did not improve—in fact, it was even up to slower. We present the two variants and discuss why one is counter-intuitively faster than the other.