Skip to main content

Adaptive Per-Function CPU Time Tracking

· 6 min read
Rajeev Singh
Software Engineer @ Meta
Pedro Pedreira
Software Engineer @ Meta

Context

Velox evaluates SQL expressions as trees of functions. A query like if(array_gte(a, b), multiply(x, y), 0) compiles into a tree where each node processes an entire vector of rows at a time. When a query runs slowly, the first question usually is: which function is consuming the most CPU? Is it the expensive array comparison, or the cheap arithmetic called millions of times? This problem is even more prominent in use cases like training data loading, when very long and deeply nested expression trees are common, and jobs may run for many hours, or days; in such cases, the CPU usage of even seemingly short-lived functions may add up to substantial overhead. Without a detailed per-function CPU usage breakdown, you may be left guessing — or worse, optimizing the wrong thing.

Velox supports CPU usage tracking through the expression.track_cpu_usage query config. When enabled, every function invocation is wrapped in a CpuWallTimer — an RAII guard that makes the clock_gettime() system call at entry and exit, recording the CPU and wall time spent in each function. This gives precise, per-function cost attribution across the entire expression tree.

The Overhead Problem

That precision comes at a price. Each CpuWallTimer invocation performs a heap allocation, two clock_gettime() syscalls, and a heap deallocation — roughly ~5μs of fixed overhead, depending on the platform. For a function like array_gte(), which spends milliseconds per vector iterating over array elements, 5μs is a negligible noise. But for a function like multiply that incurs a single instruction, the timer takes longer than the work itself:

Vector SizeNo TrackingFull TrackingOverhead
100 rows2.13ms5.89ms2.8x
1,000 rows300μs668μs2.2x
10,000 rows193μs233μs1.2x

Benchmark for multiply(a, b) on double columns. Times are per 10K iterations.

This overhead forced production systems to disable per-function CPU tracking entirely, losing visibility into more granular CPU consumption. Selective per-function tracking (expression.track_cpu_usage_for_functions) was added to Velox in the past to let operators opt in specific functions, but even for tracked functions the overhead remained — you just chose which functions to pay the cost for.

Why Uniform Sampling Falls Short

The obvious next step is to sample: measure every Nth vector instead of every one, and extrapolate. A uniform rate applied to all functions would be simple — one counter, one config key. But a single rate cannot serve both cheap and expensive functions.

Consider array_gte, where the timer overhead is less than 0.1% of execution time. Sampling in this case throws away measurement precision for no benefit — the function can afford full tracking. Now consider multiply on small vectors, where the timer is 2500% of the function cost. Even a 1/1000 rate may not be aggressive enough. A rate that works for one function is either inaccurate or insufficient for another.

Adaptive Per-Function Sampling

When a query starts running, adaptive sampling does not yet know which functions are cheap and which are expensive. So it watches.

The first vector through each function is a warmup — caches and branch predictors settle, and the system measures the cost of the CpuWallTimer itself by creating and destroying a dummy one. Over the next 5 vectors, each function runs with a lightweight stopwatch instead, quietly accumulating how long it actually takes without any timer overhead.

By the sixth vector, the system has what it needs: the ratio of timer overhead to function cost. If the ratio is within a configurable threshold (default 1%), the function switches to full tracking: every future vector gets a CpuWallTimer, with zero information loss. If the timer dominates, the function switches to sampling at a rate that bounds overhead to the threshold:

FunctionPer-Vector CostTimer OverheadOverhead RatioDecision
multiply(a,b) 100 rows~0.2μs~5μs2500%Sample 1/2500
multiply(a,b) 10K rows~20μs~5μs25%Sample 1/25
array_gte(a,b) 100 rows~3ms~5μs0.17%Always track
array_gte(a,b) 10K rows~12ms~5μs0.04%Always track

Expensive functions get exact tracking. Cheap functions sample aggressively. No per-function configuration needed — just a single "max acceptable overhead" knob.

When stats are read, sampled functions have their CPU and wall time scaled up by the ratio of total vectors to timed vectors. To handle short queries where the sampling counter might never fire, the first post-calibration vector is always timed, guaranteeing at least one measurement to extrapolate from.

Results

So does it work? We benchmarked adaptive sampling against both no tracking and full tracking, across cheap and expensive functions, at different vector sizes, and with different adaptive rates (1% and 0.5%). Results are presented below:

BenchmarkNo TrackingFull TrackingAdaptive 1%Adaptive 0.5%
multiply 100 rows2.13ms5.86ms (36%)2.19ms (97%)2.16ms (98%)
multiply 1K rows299μs671μs (44%)304μs (98%)302μs (99%)
multiply 10K rows193μs224μs (86%)165μs (117%)193μs (100%)
array_gte 100 rows188ms193ms (98%)189ms (100%)187ms (100%)
array_gte 1K rows32ms32ms (100%)32ms (100%)32ms (100%)
array_gte 10K rows11ms11ms (100%)11ms (100%)11ms (100%)

(Percentages are relative to "No Tracking" — higher is better)

For multiply on 100-row vectors, overhead drops from 64% to under 3%. For array_gte, overhead was already negligible and remains so — the function is fully tracked with near zero sampling.

The natural concern with sampling is accuracy. We ran a correctness benchmark comparing the extrapolated CPU time from adaptive sampling against the true value from full tracking over 1,000 vectors:

Function100 rows1,000 rows10,000 rows
multiply108.8%102.4%103.7%
array_gte99.8%100.3%99.6%

Estimates land within 1–9% of the true value. For expensive functions that end up in always track mode, accuracy is exact — every vector is timed.

Configuration

Two config properties control adaptive sampling:

ConfigTypeDefaultDescription
expression.adaptive_cpu_samplingboolfalseEnable adaptive per-function sampling
expression.adaptive_cpu_sampling_max_overhead_pctdouble1.0Target max overhead percentage

Adaptive sampling interacts cleanly with existing tracking modes. expression.track_cpu_usage (global tracking for all functions) takes highest priority and always wins. expression.track_cpu_usage_for_functions (selective per-function tracking) overrides adaptive for named functions. Adaptive sampling applies to everything else.

Takeaway

Adaptive per-function CPU time tracking eliminates the tradeoff between observability and performance. Every function gets the best tracking mode for its cost profile — cheap functions run at near-native speed with minimal overhead, expensive functions get exact tracking at zero additional cost. The system is self-tuning and requires no per-function configuration.