Unicode

Functions for Unicode strings.

List of functions

Unicode::IsUtf(String) -> Bool

Checks whether a string is a valid UTF-8 sequence. For example, the string "\xF0" isn't a valid UTF-8 sequence, but the string "\xF0\x9F\x90\xB1" correctly describes a UTF-8 cat emoji.

Unicode::GetLength(Utf8{Flags:AutoMap}) -> Uint64

Returns the length of a utf-8 string in unicode code points. Surrogate pairs are counted as one character.

  SELECT Unicode::GetLength("жніўня"); -- 6

Unicode::Find(string:Utf8{Flags:AutoMap}, subString:Utf8, [pos:Uint64?]) -> Uint64?
Unicode::RFind(string:Utf8{Flags:AutoMap}, subString:Utf8, [pos:Uint64?]) -> Uint64?

Finding the first (RFind - the last) occurrence of a substring in a string starting from the pos position. Returns the position of the first character from the found substring. In case of failure, returns Null.

  SELECT Unicode::Find("aaa", "bb"); -- Null

Unicode::Substring(string:Utf8{Flags:AutoMap}, from:Uint64?, len:Uint64?) -> Utf8

Returns a string substring starting with from that is len characters long. If the len argument is omitted, the substring is taken to the end of the source string.

If from exceeds the length of the original string, an empty string "" is returned.

  SELECT Unicode::Substring("0123456789abcdefghij", 10); -- "abcdefghij"

The Unicode::Normalize... functions convert the passed UTF-8 string to a normalization form:
- Unicode::Normalize(Utf8{Flags:AutoMap}) -> Utf8 -- NFC
- Unicode::NormalizeNFD(Utf8{Flags:AutoMap}) -> Utf8
- Unicode::NormalizeNFC(Utf8{Flags:AutoMap}) -> Utf8
- Unicode::NormalizeNFKD(Utf8{Flags:AutoMap}) -> Utf8
- Unicode::NormalizeNFKC(Utf8{Flags:AutoMap}) -> Utf8
Unicode::Translit(string:Utf8{Flags:AutoMap}, [lang:String?]) -> Utf8

Transliterates with Latin letters the words from the passed string, consisting entirely of characters of the alphabet of the language passed by the second argument. If no language is specified, the words are transliterated from Russian. Available languages: "kaz", "rus", "tur", and "ukr".

  SELECT Unicode::Translit("Тот уголок земли, где я провел"); -- "Tot ugolok zemli, gde ya provel"

Unicode::LevensteinDistance(stringA:Utf8{Flags:AutoMap}, stringB:Utf8{Flags:AutoMap}) -> Uint64

Calculates the Levenshtein distance for the passed strings.

Unicode::Fold(Utf8{Flags:AutoMap}, [ Language:String?, DoLowerCase:Bool?, DoRenyxa:Bool?, DoSimpleCyr:Bool?, FillOffset:Bool? ]) -> Utf8

Performs case folding on the passed string.

Parameters:

Language is set according to the same rules as in Unicode::Translit().
DoLowerCase converts a string to lowercase letters, defaults to true.
DoRenyxa converts diacritical characters to similar Latin characters, defaults to true.
DoSimpleCyr converts diacritical Cyrillic characters to similar Latin characters, defaults to true.

FillOffset parameter is not used.

SELECT Unicode::Fold("Kongreßstraße", false AS DoSimpleCyr, false AS DoRenyxa); -- "kongressstrasse"
SELECT Unicode::Fold("ҫурт"); -- "сурт"
SELECT Unicode::Fold("Eylül", "Turkish" AS Language); -- "eylul"

Unicode::ReplaceAll(input:Utf8{Flags:AutoMap}, find:Utf8, replacement:Utf8) -> Utf8
Unicode::ReplaceFirst(input:Utf8{Flags:AutoMap}, find:Utf8, replacement:Utf8) -> Utf8
Unicode::ReplaceLast(input:Utf8{Flags:AutoMap}, find:Utf8, replacement:Utf8) -> Utf8

Replaces all/first/last occurrences of the find string in the input with replacement.

Unicode::RemoveAll(input:Utf8{Flags:AutoMap}, symbols:Utf8) -> Utf8
Unicode::RemoveFirst(input:Utf8{Flags:AutoMap}, symbols:Utf8) -> Utf8
Unicode::RemoveLast(input:Utf8{Flags:AutoMap}, symbols:Utf8) -> Utf8

Deletes all/first/last occurrences of characters in the symbols set from the input. The second argument is interpreted as an unordered set of characters to be removed.

  SELECT Unicode::ReplaceLast("absence", "enc", ""); -- "abse"
  SELECT Unicode::RemoveAll("abandon", "an"); -- "bdo"

Unicode::ToCodePointList(Utf8{Flags:AutoMap}) -> List<Uint32>

Splits a string into a Unicode sequence of codepoints.

Unicode::FromCodePointList(List<Uint32>{Flags:AutoMap}) -> Utf8

Generates a Unicode string from codepoints.

  SELECT Unicode::ToCodePointList("Щавель"); -- [1065, 1072, 1074, 1077, 1083, 1100]
  SELECT Unicode::FromCodePointList(AsList(99,111,100,101,32,112,111,105,110,116,115,32,99,111,110,118,101,114,116,101,114)); -- "code points converter"

Unicode::Reverse(Utf8{Flags:AutoMap}) -> Utf8

Reverses a string.

Unicode::ToLower(Utf8{Flags:AutoMap}) -> Utf8
Unicode::ToUpper(Utf8{Flags:AutoMap}) -> Utf8
Unicode::ToTitle(Utf8{Flags:AutoMap}) -> Utf8

Converts a string to UPPER, lower, or Title case.

Unicode::SplitToList( string:Utf8?, separator:Utf8, [ DelimeterString:Bool?, SkipEmpty:Bool?, Limit:Uint64? ]) -> List<Utf8>

Splits a string into substrings by separator. string -- Source string. separator -- Separator. Parameters:

DelimeterString:Bool? — treating a delimiter as a string (true, by default) or a set of characters "any of" (false)
SkipEmpty:Bool? - whether to skip empty strings in the result, is false by default
Limit:Uint64? - Limits the number of fetched components (unlimited by default); if the limit is exceeded, the raw suffix of the source string is returned in the last item
Unicode::JoinFromList(List<Utf8>{Flags:AutoMap}, separator:Utf8) -> Utf8

Concatenates a list of strings via a separator into a single string.

  SELECT Unicode::SplitToList("One, two, three, four, five", ", ", 2 AS Limit); -- ["One", "two", "three, four, five"]
  SELECT Unicode::JoinFromList(["One", "two", "three", "four", "five"], ";"); -- "One;two;three;four;five"

Unicode::ToUint64(string:Utf8{Flags:AutoMap}, [prefix:Uint16?]) -> Uint64

Converts a string to a number.

The second optional argument sets the number system. By default, 0 (automatic detection by prefix). Supported prefixes: 0x(0X) - base-16, 0 - base-8. Defaults to base-10. The - sign before a number is interpreted as in C unsigned arithmetic. For example, -0x1 -> UI64_MAX. If there are incorrect characters in a string or a number goes beyond ui64, the function terminates with an error.

Unicode::TryToUint64(string:Utf8{Flags:AutoMap}, [prefix:Uint16?]) -> Uint64?

Similar to the Unicode::ToUint64() function, except that it returns NULL instead of an error.

  SELECT Unicode::ToUint64("77741"); -- 77741
  SELECT Unicode::ToUint64("-77741"); -- 18446744073709473875
  SELECT Unicode::TryToUint64("asdh831"); -- Null

unicode.md 6.8 KB History Raw

Unicode

List of functions

unicode.md 6.8 KB

History Raw