normalize_string

normalize_string(s, normalform::Symbol)

Normalize the string s according to one of the four "normal forms" of the Unicode standard: normalform can be :NFC, :NFD, :NFKC, or :NFKD. Normal forms C (canonical composition) and D (canonical decomposition) convert different visually identical representations of the same abstract string into a single canonical form, with form C being more compact. Normal forms KC and KD additionally canonicalize "compatibility equivalents": they convert characters that are abstractly similar but visually distinct into a single canonical choice (e.g. they expand ligatures into the individual characters), with form KC being more compact.

Alternatively, finer control and additional transformations may be be obtained by calling normalize_string(s; keywords...), where any number of the following boolean keywords options (which all default to false except for compose) are specified:

compose=false: do not perform canonical composition
decompose=true: do canonical decomposition instead of canonical composition (compose=true is ignored if present)
compat=true: compatibility equivalents are canonicalized
casefold=true: perform Unicode case folding, e.g. for case-insensitive string comparison
newline2lf=true, newline2ls=true, or newline2ps=true: convert various newline sequences (LF, CRLF, CR, NEL) into a linefeed (LF), line-separation (LS), or paragraph-separation (PS) character, respectively
stripmark=true: strip diacritical marks (e.g. accents)
stripignore=true: strip Unicode's "default ignorable" characters (e.g. the soft hyphen or the left-to-right marker)
stripcc=true: strip control characters; horizontal tabs and form feeds are converted to spaces; newlines are also converted to spaces unless a newline-conversion flag was specified
rejectna=true: throw an error if unassigned code points are found
stable=true: enforce Unicode Versioning Stability

For example, NFKC corresponds to the options compose=true, compat=true, stable=true.

Examples

User Contributed Notes

Add a Note

The format of note supported is markdown, use triple backtick to start and end a code block.

normalize_string

Examples

See Also

User Contributed Notes

Add a Note