Skip to main content

UNICODE_NORMALIZE

Description

Performs Unicode Normalization on the input string.

Unicode normalization is the process of converting equivalent Unicode character sequences into a unified form. For example, the character "é" can be represented by a single code point (U+00E9) or by "e" + a combining acute accent (U+0065 + U+0301). Normalization ensures that these equivalent representations are handled uniformly.

Syntax

UNICODE_NORMALIZE(<str>, <mode>)

Parameters

ParameterDescription
<str>The input string to be normalized. Type: VARCHAR
<mode>The normalization mode, must be a constant string (case-insensitive). Supported modes:
- NFC: Canonical Decomposition, followed by Canonical Composition
- NFD: Canonical Decomposition
- NFKC: Compatibility Decomposition, followed by Canonical Composition
- NFKD: Compatibility Decomposition
- NFKC_CF: NFKC followed by Case Folding

Return Value

Returns VARCHAR type, representing the normalized result of the input string.

Examples

  1. Difference between NFC and NFD (composed vs decomposed characters)
-- 'Café' where é may be in composed form, NFD will decompose it into e + combining accent
SELECT length(unicode_normalize('Café', 'NFC')) AS nfc_len, length(unicode_normalize('Café', 'NFD')) AS nfd_len;
+---------+---------+
| nfc_len | nfd_len |
+---------+---------+
| 4 | 5 |
+---------+---------+
  1. NFKC_CF for case folding
SELECT unicode_normalize('ABC 123', 'nfkc_cf') AS result;
+---------+
| result |
+---------+
| abc 123 |
+---------+
  1. NFKC handling fullwidth characters (compatibility decomposition)
-- Fullwidth digits '123' will be converted to halfwidth '123'
SELECT unicode_normalize('123ABC', 'NFKC') AS result;
+--------+
| result |
+--------+
| 123ABC |
+--------+
  1. NFKD handling special symbols (compatibility decomposition)
-- ℃ (degree Celsius symbol) will be decomposed to °C
SELECT unicode_normalize('25℃', 'NFKD') AS result;
+--------+
| result |
+--------+
| 25°C |
+--------+
  1. Handling circled numbers
-- ① ② ③ circled numbers will be converted to regular digits
SELECT unicode_normalize('①②③', 'NFKC') AS result;
+--------+
| result |
+--------+
| 123 |
+--------+
  1. Comparing different modes on the same string
SELECT 
unicode_normalize('fi', 'NFC') AS nfc_result,
unicode_normalize('fi', 'NFKC') AS nfkc_result;
+------------+-------------+
| nfc_result | nfkc_result |
+------------+-------------+
| fi | fi |
+------------+-------------+
  1. String equality comparison scenario
-- Use normalization to compare visually identical but differently encoded strings
SELECT unicode_normalize('café', 'NFC') = unicode_normalize('café', 'NFC') AS is_equal;
+----------+
| is_equal |
+----------+
| 1 |
+----------+