32 posts tagged with "tech-blog"

View All Tags

Accelerating Unicode string processing with SIMD in Velox

March 13, 2026 · 8 min read

Ping Liu

Software Engineer

Yuhta

Software Engineer @ Meta

Masha Basmanova

Software Engineer @ Meta

TL;DR

We optimized two Unicode string helpers — cappedLengthUnicode and cappedByteLengthUnicode — by replacing byte-by-byte utf8proc_char_length calls with a SIMD-based scanning loop. The new implementation processes register-width blocks at a time: pure-ASCII blocks skip in one step, while mixed blocks use bitmask arithmetic to count character starts. Both helpers now share a single parameterized template, eliminating code duplication.

On a comprehensive benchmark matrix covering string lengths from 4 to 1024 bytes and ASCII ratios from 0% to 100%, we measured 2–15× speedups across most configurations, with no regressions on Unicode-heavy inputs. The optimization benefits all callers of these helpers, including the Iceberg truncate transform and various string functions.

The hidden traps of regex in LIKE and split

March 7, 2026 · 7 min read

Masha Basmanova

Software Engineer @ Meta

SQL functions sometimes use regular expressions under the hood in ways that surprise users. Two common examples are the LIKE operator and Spark's split function.

In Presto, split takes a literal string delimiter and regexp_split is a separate function for regex-based splitting. Spark's split, however, always treats the delimiter as a regular expression.

Both LIKE and Spark's split can silently produce wrong results and waste CPU when used with column values instead of constants. Understanding why this happens helps write faster, more correct queries — and helps engine developers make better design choices.

velox::StringView API Changes and Best Practices

February 10, 2026 · 5 min read

Pedro Pedreira

Software Engineer @ Meta

Context

Strings are ubiquitously used in large-scale analytic query processing. From storing identifiers, names, labels, or structured data (like json/xml), to simply descriptive text, like a product description or the contents of this very blog post, there is hardly a SQL query that does not require the manipulation of string data.

This post describes in more detail how Velox handles columns of strings, the low-level C++ APIs involved and some recent changes made to them, and presents best practices for string usage throughout Velox's codebase.

Task Barrier: Efficient Task Reuse and Streaming Checkpoints in Velox

January 6, 2026 · 4 min read

Xiaoxuan Meng

Software Engineer @ Meta

Yuhta

Software Engineer @ Meta

Masha Basmanova

Software Engineer @ Meta

Pedro Pedreira

Software Engineer @ Meta

TL;DR

Velox Task Barriers provide a synchronization mechanism that not only enables efficient task reuse, important for workloads such as AI training data loading, but also delivers the strict sequencing and checkpointing semantics required for streaming workloads.

By injecting a barrier split, users guarantee that no subsequent data is processed until the entire DAG is flushed and the synchronization signal is unblocked. This capability serves two critical patterns:

Task Reuse: Eliminates the overhead of repeated task initialization and teardown by safely reconfiguring warm tasks for new queries. This is a recurring pattern in AI training data loading workloads.
Streaming Processing: Enables continuous data handling with consistent checkpoints, allowing stateful operators to maintain context across batches without service interruption.

See the Task Barrier Developer Guide for implementation details.

Why Sort is row-based in Velox — A Quantitative Assessment

December 24, 2025 · 8 min read

Meng Duan (macduan)

Software Engineer @ ByteDance

Xiaoxuan Meng

Software Engineer @ Meta

TL;DR

Velox is a fully vectorized execution engine[1]. Its internal columnar memory layout enhances cache locality, exposes more inter-instruction parallelism to CPUs, and enables the use of SIMD instructions, significantly accelerating large-scale query processing.

However, some operators in Velox utilize a hybrid layout, where datasets can be temporarily converted to a row-oriented format. The OrderBy operator is one example, where our implementation first materializes the input vectors into rows, containing both sort keys and payload columns, sorts them, and converts the rows back to vectors.

In this article, we explain the rationale behind this design decision and provide experimental evidence for its implementation. We show a prototype of a hybrid sorting strategy that materializes only the sort-key columns, reducing the overhead of materializing payload columns. Contrary to expectations, the end-to-end performance did not improve—in fact, it was even up to 3× slower. We present the two variants and discuss why one is counter-intuitively faster than the other.

Multi-Round Lazy Start Merge

November 9, 2025 · 6 min read

Meng Duan (macduan)

Software Engineer @ ByteDance

Xiaoxuan Meng

Software Engineer @ Meta

Pedro Pedreira

Software Engineer @ Meta

Background

Efficiently merging sorted data partitions at scale is crucial for a variety of training data preparation workloads, especially for Generative Recommenders (GRs) a new paradigm introduced in the paper Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations. A key requirement is to merge training data across partitions—for example, merging hourly partitions into daily ones—while ensuring that all rows sharing the same primary key are stored consecutively. Training data is typically partitioned and bucketed by primary key, with rows sharing the same key stored consecutively, so merging across partitions essentially becomes a multi-way merge problem.

Normally, Apache Spark can be used for this sort-merge requirement — for example, via CLUSTER BY. However, training datasets for a single job can often reach the PB scale, which in turn generates shuffle data at PB scale. Although we typically apply bucketing and ordering by key when preparing training data in production, Spark can eliminate the shuffle when merging training data from multiple hourly partitions. However, each Spark task can only read the files planned from various partitions within a split sequentially, placing them into the sorter and spilling as needed. Only after all files have been read does Spark perform a sort-merge of the spilled files. This process produces a large number of small spill files, which further degrades efficiency.

Enabling Shared Library Builds in Velox

November 1, 2025 · 8 min read

Jacob Wujciak-Jens

Software Engineer @ Voltron Data

In this post, I’ll share how we unblocked shared library builds in Velox, the challenges we encountered with our large CMake build system, and the creative solution that let us move forward without disrupting contributors or downstream users.

Extending Velox - GPU Acceleration with cuDF

July 11, 2025 · 4 min read

Gregory Kimball

Software Engineer @ NVIDIA

TL;DR

This post describes the design principles and software components for extending Velox with hardware acceleration libraries like NVIDIA's cuDF. Velox provides a flexible execution model for hardware accelerators, and cuDF's data structures and algorithms align well with core components in Velox.

SEGFAULT due to Dependency Update

July 7, 2025 · 4 min read

Deepak Majeti

Software Engineer @ IBM

Christian Zentgraf

Software Engineer @ IBM

Background

Velox depends on several libraries. Some of these dependencies include open-source libraries from Meta, including Folly and Facebook Thrift. These libraries are in active development and also depend on each other, so they all have to be updated to the same version at the same time.

Updating these dependencies typically involves modifying the Velox code to align with any public API or semantic changes in these dependencies. However, a recent upgrade of Folly and Facebook Thrift to version v2025.04.28.00 caused a SEGFAULT only in one unit test in Velox named velox_functions_remote_client_test.

A Velox Primer, Part 3

May 12, 2025 · 10 min read

Orri Erling

Software Engineer @ Meta

Pedro Pedreira

Software Engineer @ Meta

At the end of the previous article, we were halfway through running our first distributed query:

SELECT l_partkey, count(*) FROM lineitem GROUP BY l_partkey;

We discussed how a query starts, how tasks are set up, and the interactions between plans, operators, and drivers. We have also presented how the first stage of the query is executed, from table scan to partitioned output - or the producer side of the shuffle.

In this article, we will discuss the second query stage, or the consumer side of the shuffle.

TL;DR​

Context​

TL;DR​

TL;DR​

Background​

TL;DR​

Background​

TL;DR

Context

TL;DR

TL;DR

Background

TL;DR

Background