The hidden traps of regex in LIKE and split
SQL functions sometimes use regular expressions under the hood in ways that surprise users. Two common examples are the LIKE operator and Spark's split function.
In Presto,
split
takes a literal string delimiter and
regexp_split
is a separate function for regex-based splitting. Spark's split, however,
always treats the delimiter as a regular expression.
Both LIKE and Spark's split can silently produce wrong results and waste CPU when used with column values instead of constants. Understanding why this happens helps write faster, more correct queries — and helps engine developers make better design choices.
LIKE is not contains
A very common query pattern is to check whether one string contains another:
SELECT * FROM t WHERE name LIKE '%' || search_term || '%'
This looks intuitive: wrap search_term in % wildcards and you get a
"contains" check. But LIKE is not the same as substring matching.
LIKE treats _ as a single-character wildcard and % as a multi-character
wildcard. If search_term comes from a column and contains these characters,
the results are silently wrong:
SELECT url,
url LIKE '%' || search_term || '%' AS like_result,
strpos(url, search_term) > 0 AS contains_result
FROM (VALUES
('https://site.com/home'),
('https://site.com/user_profile'),
('https://site.com/username')
) AS t(url)
CROSS JOIN (VALUES ('user_')) AS s(search_term);
url | like_result | contains_result
-------------------------------+-------------+----------------
https://site.com/home | false | false
https://site.com/user_profile | true | true
https://site.com/username | true | false
LIKE '%user_%' matches 'https://site.com/username' because _ is a
wildcard that matches any single character — in this case, n. But
strpos(url, 'user_') > 0 treats _ as a literal underscore and correctly
reports that 'https://site.com/username' does not contain the substring
'user_'.
When the pattern is a constant, this distinction is visible and intentional.
But when users write x LIKE '%' || y || '%' where y is a column, the
values of y may contain _ or % characters — and they will be silently
interpreted as wildcards, producing wrong results.
Spark's split treats delimiters as regex
In Presto, the split function takes a literal string delimiter, while regexp_split is a separate function for regex-based splitting. This distinction makes the intent clear.
Spark's split function, however, always treats the delimiter as a regular expression. Users rarely realize this, and a common pattern is to split a string using a value from another column:
select split(dir_path, location_path)[1] as partition_name from t
Here, a table stores Hive partition metadata: dir_path is the full partition
path (e.g., /data/warehouse/db.name/table/ds=2024-01-01) and location_path
is the table path (e.g., /data/warehouse/db.name/table). The user wants to
strip the table path prefix to get the partition name.
This works for simple paths. But location_path is interpreted as a regular
expression, not a literal string. If it contains . — as in db.name — the
. matches any character, not a literal dot. Characters like (, ),
[, +, *, ?, and $ would also cause wrong results or errors.
A correct alternative that also executes faster uses simple string operations:
IF(starts_with(dir_path, location_path),
substr(dir_path, length(location_path) + 2)) as partition_name
This is a bit more verbose than split(dir_path, location_path)[1], but it is
correct for all inputs and avoids regex compilation entirely.
Performance trap
Beyond correctness, there is a performance problem. Both LIKE and Spark's split use RE2 as the regex engine. RE2 is fast and safe, but compiling a regular expression can take up to 200x more CPU time than evaluating it.
When the pattern or delimiter is a constant, the regex is compiled once and
reused for every row. The cost is negligible. But when the pattern comes from
a column, a new regex may need to be compiled for every distinct value. A table
with thousands of distinct location_path values means thousands of regex
compilations — each one expensive and none of them necessary.
Velox limits the number of compiled regular expressions per function instance per thread of execution via the expression.max_compiled_regexes configuration property (default: 100). When this limit is reached, the query fails with an error.
Tempting but wrong fix
When users hit this limit, the natural reaction is to ask the engine developers to raise or eliminate the cap. A recent pull request proposed replacing the fixed-size cache with an evicting cache: when the limit is reached, the oldest compiled regex is evicted to make room for the new one.
This sounds reasonable, and the motivation is understandable — users migrating from Spark don't want to rewrite working queries. But it makes things worse:
- It hides the correctness bug. The query no longer fails, so users never discover that their LIKE pattern or split delimiter is being interpreted as a regex and producing wrong results for inputs with special characters.
- It makes the performance problem worse. With thousands of distinct patterns, the cache churns constantly — evicting one compiled regex only to compile another. The query runs, but dramatically slower than necessary, and the user has no indication why. In shared multi-tenant clusters, a single slow query like this can consume excessive CPU and affect other users' workloads.
The error is a feature, not a bug. It is an early warning that catches misuse before it leads to silently wrong results in production and prevents a single query from wasting shared cluster resources.
Right fix
For users: replace LIKE with literal string operations when checking for
substrings. Use strpos(x, y) > 0 or contains(x, y) instead of
x LIKE '%' || y || '%'. For Spark's split with literal delimiters, use
substr or other string functions that don't involve regex.
For engine developers: optimize the functions to avoid regex when it isn't
needed. Velox's LIKE implementation already does this. As described in
James Xu's
earlier blog post, the engine analyzes each pattern
and uses fast paths — prefix match, suffix match, substring search — whenever
the pattern contains only regular characters and _ wildcards. For simple patterns, this gives up to 750x speedup over regex.
Regex is compiled only for patterns that truly require it, and these optimized
patterns are not counted toward the compiled regex limit.
The same approach should be applied to Spark's split function. The engine can
check whether the delimiter contains any regex metacharacters. If it doesn't,
a simple string search can be used instead of compiling a regex. This would
make queries like split(dir_path, location_path) both fast and correct —
without users needing to change anything and without removing the safety net
for cases that genuinely require regex.
Takeaways
LIKEis notcontains. The_and%wildcards can silently corrupt results when the pattern comes from a column.- Spark's
splittreats delimiters as regex. Characters like.in column values are interpreted as regex metacharacters, not literal characters. Presto avoids this by separatingsplit(literal) andregexp_split(regex). - When a query hits the compiled regex limit, the right response is to fix the query, not to raise the limit.
- Engine developers should optimize functions to avoid regex when the input is a plain string, rather than making it easier to misuse regex at scale.
