Skip to main content

REGEXP

Description

~ Performs a regular expression match on the string str, returning true if the match succeeds, otherwise false. pattern is the regular expression pattern. It should be noted that when handling character set matching, Utf-8 standard character classes should be used. This ensures that functions can correctly identify and process various characters from different languages.

If the pattern is not allowed regexp regular,throw error;

Default supported character match classes : https://github.com/google/re2/wiki/Syntax

Doris supports enabling more advanced regular expression features, such as look-around zero-width assertions, through the session variable enable_extended_regex (default is false).

Supported character matching types when the session variable enable_extended_regex is set to true: https://www.boost.org/doc/libs/latest/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html

Note: After enabling this variable, performance will only be affected when the regular expression contains advanced syntax (such as look-around). Therefore, for better performance, it is recommended to optimize your regular expressions as much as possible and avoid using such zero-width assertions.

Syntax

REGEXP(<str>, <pattern>)

Parameters

ParameterDescription
<str>String type. Represents the string to be matched against the regular expression, which can be a column in a table or a literal string.
<pattern>String type. The regular expression pattern used to match against the string . Regular expressions provide a powerful way to define complex search patterns, including character classes, quantifiers, and anchors.

Return Value

The REGEXP function returns a BOOLEAN value. If the string matches the regular expression pattern , the function returns true (represented as 1 in SQL); if not, it returns false (represented as 0 in SQL).

Default Behavior:

Default SettingBehavior
. matches newline. can match \n (newline) by default.
Case-sensitiveMatching is case-sensitive.
^/$ match full string boundaries^ matches only the start of the string, $ matches only the end, not line starts/ends.
Greedy quantifiers*, +, etc. match as much as possible by default.
UTF-8Strings are processed as UTF-8.

Pattern Modifiers:

You can override the default behavior by prefixing the pattern with (?flags). Multiple modifiers can be combined, e.g., (?im); a - prefix disables the corresponding option, e.g., (?-s).

Pattern modifiers only take effect when using the default regex engine. If enable_extended_regex=true is enabled while using zero-width assertions (e.g., (?<=...), (?=...)), the query will be handled by the Boost.Regex engine, and modifier behavior may not work as expected. It is recommended not to mix them.

FlagMeaning
(?i)Case-insensitive matching
(?-i)Case-sensitive (default)
(?s). matches newline (enabled by default)
(?-s). does not match newline
(?m)Multiline mode: ^ matches start of each line, $ matches end of each line
(?-m)Single-line mode: ^/$ match full string boundaries (default)
(?U)Non-greedy quantifiers: *, +, etc. match as little as possible
(?-U)Greedy quantifiers (default): *, +, etc. match as much as possible

Examples

CREATE TABLE test ( k1 VARCHAR(255) ) properties("replication_num"="1")

INSERT INTO test (k1) VALUES ('billie eillish'), ('It\'s ok'), ('billie jean'), ('hello world');
--- Find all data in the k1 column starting with 'billie'
SELECT k1 FROM test WHERE k1 REGEXP '^billie'
--------------

+----------------+
| k1 |
+----------------+
| billie eillish |
| billie jean |
+----------------+
2 rows in set (0.02 sec)

--- Find data in the k1 column ending with 'ok':
SELECT k1 FROM test WHERE k1 REGEXP 'ok$'
--------------

+---------+
| k1 |
+---------+
| It's ok |
+---------+
1 row in set (0.03 sec)

Chinese Character Example

mysql> select regexp('这是一段中文 This is a passage in English 1234567', '\\p{Han}');
+-----------------------------------------------------------------------------+
| ('这是一段中文 This is a passage in English 1234567' regexp '\p{Han}') |
+-----------------------------------------------------------------------------+
| 1 |
+-----------------------------------------------------------------------------+

Insertion and Testing for Simple String Matching

CREATE TABLE test_regexp (
id INT,
name VARCHAR(255)
) PROPERTIES("replication_num"="1");

INSERT INTO test_regexp (id, name) VALUES
(1, 'Alice'),
(2, 'Bob'),
(3, 'Charlie'),
(4, 'David');

-- Find all names starting with 'A'
SELECT id, name FROM test_regexp WHERE name REGEXP '^A';
+------+-------+
| id | name |
+------+-------+
| 1 | Alice |
+------+-------+

Special Character Matching Test

-- Insert names with special characters
INSERT INTO test_regexp (id, name) VALUES
(5, 'Anna-Maria'),
(6, 'John_Doe');

-- Find names containing the '-' character
SELECT id, name FROM test_regexp WHERE name REGEXP '-';
+------+------------+
| id | name |
+------+------------+
| 5 | Anna-Maria |
+------+------------+

Test for Ending String Matching

-- Find names ending with 'e'
SELECT id, name FROM test_regexp WHERE name REGEXP 'e$';
+------+---------+
| id | name |
+------+---------+
| 1 | Alice |
| 3 | Charlie |
+------+---------+

Emoji test

SELECT 'Hello' REGEXP '😀'; 
+-----------------------+
| 'Hello' REGEXP '😀' |
+-----------------------+
| 0 |
+-----------------------+

'str' is NULL,return NULL

mysql> SELECT REGEXP(NULL, '^billie');
+-------------------------+
| REGEXP(NULL, '^billie') |
+-------------------------+
| NULL |
+-------------------------+

'pattern' is NULL, return NULL

mysql> SELECT REGEXP('billie eillish', NULL);
+--------------------------------+
| REGEXP('billie eillish', NULL) |
+--------------------------------+
| NULL |
+--------------------------------+

All parameters are NULL,return NULL

mysql> SELECT REGEXP(NULL, NULL);
+--------------------+
| REGEXP(NULL, NULL) |
+--------------------+
| NULL |
+--------------------+

If the pattern is not allowed regexp regular,throw error;

SELECT REGEXP('Hello, World!', '([a-z');
ERROR 1105 (HY000): errCode = 2, detailMessage = (10.16.10.2)[INTERNAL_ERROR]Invalid regex expression: ([a-z

Advanced regexp

SELECT REGEXP('Apache/Doris', '([a-zA-Z_+-]+(?:\/[a-zA-Z_0-9+-]+)*)(?=s|$)');
-- ERROR 1105 (HY000): errCode = 2, detailMessage = (127.0.0.1)[INTERNAL_ERROR]Invalid regex expression: ([a-zA-Z_+-]+(?:/[a-zA-Z_0-9+-]+)*)(?=s|$). Error: invalid perl operator: (?=

SET enable_extended_regex = true;
SELECT REGEXP('Apache/Doris', '([a-zA-Z_+-]+(?:\/[a-zA-Z_0-9+-]+)*)(?=s|$)');
+-----------------------------------------------------------------------+
| REGEXP('Apache/Doris', '([a-zA-Z_+-]+(?:\/[a-zA-Z_0-9+-]+)*)(?=s|$)') |
+-----------------------------------------------------------------------+
| 1 |
+-----------------------------------------------------------------------+

Pattern Modifiers

Case-insensitive matching: (?i) makes the match ignore case

SELECT REGEXP('Hello World', 'hello') AS case_sensitive, REGEXP('Hello World', '(?i)hello') AS case_insensitive;
+----------------+------------------+
| case_sensitive | case_insensitive |
+----------------+------------------+
| 0 | 1 |
+----------------+------------------+

. matches newline by default; use (?-s) to prevent . from matching newline

SELECT REGEXP('foo\nbar', '^.+$') AS dot_match_nl, REGEXP('foo\nbar', '(?-s)^.+$') AS dot_not_match_nl;
+--------------+------------------+
| dot_match_nl | dot_not_match_nl |
+--------------+------------------+
| 1 | 0 |
+--------------+------------------+

Multiline mode: (?m) makes ^ and $ match start/end of each line

SELECT REGEXP('foo\nbar', '^bar') AS single_line, REGEXP('foo\nbar', '(?m)^bar') AS multi_line;
+-------------+------------+
| single_line | multi_line |
+-------------+------------+
| 0 | 1 |
+-------------+------------+

Greedy vs non-greedy: (?U) makes quantifiers match as little as possible

SELECT REGEXP_EXTRACT('aXbXc', '(a.*X)', 1) AS greedy, REGEXP_EXTRACT('aXbXc', '(?U)(a.*X)', 1) AS non_greedy;
+--------+------------+
| greedy | non_greedy |
+--------+------------+
| aXbX | aX |
+--------+------------+