# Unicode Functions for Unicode strings. ## List of functions * `Unicode::IsUtf(String) -> Bool` Checks whether a string is a valid UTF-8 sequence. For example, the string `"\xF0"` isn't a valid UTF-8 sequence, but the string `"\xF0\x9F\x90\xB1"` correctly describes a UTF-8 cat emoji. * `Unicode::GetLength(Utf8{Flags:AutoMap}) -> Uint64` Returns the length of a utf-8 string in unicode code points. Surrogate pairs are counted as one character. ```yql SELECT Unicode::GetLength("жніўня"); -- 6 ``` * `Unicode::Find(string:Utf8{Flags:AutoMap}, subString:Utf8, [pos:Uint64?]) -> Uint64?` * `Unicode::RFind(string:Utf8{Flags:AutoMap}, subString:Utf8, [pos:Uint64?]) -> Uint64?` Finding the first (`RFind` - the last) occurrence of a substring in a string starting from the `pos` position. Returns the position of the first character from the found substring. In case of failure, returns Null. ```yql SELECT Unicode::Find("aaa", "bb"); -- Null ``` * `Unicode::Substring(string:Utf8{Flags:AutoMap}, from:Uint64?, len:Uint64?) -> Utf8` Returns a `string` substring starting with `from` that is `len` characters long. If the `len` argument is omitted, the substring is taken to the end of the source string. If `from` exceeds the length of the original string, an empty string `""` is returned. ```yql SELECT Unicode::Substring("0123456789abcdefghij", 10); -- "abcdefghij" ``` * The `Unicode::Normalize...` functions convert the passed UTF-8 string to a [normalization form](https://unicode.org/reports/tr15/#Norm_Forms): * `Unicode::Normalize(Utf8{Flags:AutoMap}) -> Utf8` -- NFC * `Unicode::NormalizeNFD(Utf8{Flags:AutoMap}) -> Utf8` * `Unicode::NormalizeNFC(Utf8{Flags:AutoMap}) -> Utf8` * `Unicode::NormalizeNFKD(Utf8{Flags:AutoMap}) -> Utf8` * `Unicode::NormalizeNFKC(Utf8{Flags:AutoMap}) -> Utf8` * `Unicode::Translit(string:Utf8{Flags:AutoMap}, [lang:String?]) -> Utf8` Transliterates with Latin letters the words from the passed string, consisting entirely of characters of the alphabet of the language passed by the second argument. If no language is specified, the words are transliterated from Russian. Available languages: "kaz", "rus", "tur", and "ukr". ```yql SELECT Unicode::Translit("Тот уголок земли, где я провел"); -- "Tot ugolok zemli, gde ya provel" ``` * `Unicode::LevensteinDistance(stringA:Utf8{Flags:AutoMap}, stringB:Utf8{Flags:AutoMap}) -> Uint64` Calculates the Levenshtein distance for the passed strings. * `Unicode::Fold(Utf8{Flags:AutoMap}, [ Language:String?, DoLowerCase:Bool?, DoRenyxa:Bool?, DoSimpleCyr:Bool?, FillOffset:Bool? ]) -> Utf8` Performs [case folding](https://www.w3.org/TR/charmod-norm/#definitionCaseFolding) on the passed string. Parameters: - `Language` is set according to the same rules as in `Unicode::Translit()`. - `DoLowerCase` converts a string to lowercase letters, defaults to `true`. - `DoRenyxa` converts diacritical characters to similar Latin characters, defaults to `true`. - `DoSimpleCyr` converts diacritical Cyrillic characters to similar Latin characters, defaults to `true`. - `FillOffset` parameter is not used. ```yql SELECT Unicode::Fold("Kongreßstraße", false AS DoSimpleCyr, false AS DoRenyxa); -- "kongressstrasse" SELECT Unicode::Fold("ҫурт"); -- "сурт" SELECT Unicode::Fold("Eylül", "Turkish" AS Language); -- "eylul" ``` * `Unicode::ReplaceAll(input:Utf8{Flags:AutoMap}, find:Utf8, replacement:Utf8) -> Utf8` * `Unicode::ReplaceFirst(input:Utf8{Flags:AutoMap}, find:Utf8, replacement:Utf8) -> Utf8` * `Unicode::ReplaceLast(input:Utf8{Flags:AutoMap}, find:Utf8, replacement:Utf8) -> Utf8` Replaces all/first/last occurrences of the `find` string in the `input` with `replacement`. * `Unicode::RemoveAll(input:Utf8{Flags:AutoMap}, symbols:Utf8) -> Utf8` * `Unicode::RemoveFirst(input:Utf8{Flags:AutoMap}, symbols:Utf8) -> Utf8` * `Unicode::RemoveLast(input:Utf8{Flags:AutoMap}, symbols:Utf8) -> Utf8` Deletes all/first/last occurrences of characters in the `symbols` set from the `input`. The second argument is interpreted as an unordered set of characters to be removed. ```yql SELECT Unicode::ReplaceLast("absence", "enc", ""); -- "abse" SELECT Unicode::RemoveAll("abandon", "an"); -- "bdo" ``` * `Unicode::ToCodePointList(Utf8{Flags:AutoMap}) -> List` Splits a string into a Unicode sequence of codepoints. * `Unicode::FromCodePointList(List{Flags:AutoMap}) -> Utf8` Generates a Unicode string from codepoints. ```yql SELECT Unicode::ToCodePointList("Щавель"); -- [1065, 1072, 1074, 1077, 1083, 1100] SELECT Unicode::FromCodePointList(AsList(99,111,100,101,32,112,111,105,110,116,115,32,99,111,110,118,101,114,116,101,114)); -- "code points converter" ``` * `Unicode::Reverse(Utf8{Flags:AutoMap}) -> Utf8` Reverses a string. * `Unicode::ToLower(Utf8{Flags:AutoMap}) -> Utf8` * `Unicode::ToUpper(Utf8{Flags:AutoMap}) -> Utf8` * `Unicode::ToTitle(Utf8{Flags:AutoMap}) -> Utf8` Converts a string to UPPER, lower, or Title case. * `Unicode::SplitToList( string:Utf8?, separator:Utf8, [ DelimeterString:Bool?, SkipEmpty:Bool?, Limit:Uint64? ]) -> List` Splits a string into substrings by separator. `string` -- Source string. `separator` -- Separator. Parameters: - DelimeterString:Bool? — treating a delimiter as a string (true, by default) or a set of characters "any of" (false) - SkipEmpty:Bool? - whether to skip empty strings in the result, is false by default - Limit:Uint64? - Limits the number of fetched components (unlimited by default); if the limit is exceeded, the raw suffix of the source string is returned in the last item * `Unicode::JoinFromList(List{Flags:AutoMap}, separator:Utf8) -> Utf8` Concatenates a list of strings via a `separator` into a single string. ```yql SELECT Unicode::SplitToList("One, two, three, four, five", ", ", 2 AS Limit); -- ["One", "two", "three, four, five"] SELECT Unicode::JoinFromList(["One", "two", "three", "four", "five"], ";"); -- "One;two;three;four;five" ``` * `Unicode::ToUint64(string:Utf8{Flags:AutoMap}, [prefix:Uint16?]) -> Uint64` Converts a string to a number. The second optional argument sets the number system. By default, 0 (automatic detection by prefix). Supported prefixes: `0x(0X)` - base-16, `0` - base-8. Defaults to base-10. The `-` sign before a number is interpreted as in C unsigned arithmetic. For example, `-0x1` -> UI64_MAX. If there are incorrect characters in a string or a number goes beyond ui64, the function terminates with an error. * `Unicode::TryToUint64(string:Utf8{Flags:AutoMap}, [prefix:Uint16?]) -> Uint64?` Similar to the `Unicode::ToUint64()` function, except that it returns `NULL` instead of an error. ```yql SELECT Unicode::ToUint64("77741"); -- 77741 SELECT Unicode::ToUint64("-77741"); -- 18446744073709473875 SELECT Unicode::TryToUint64("asdh831"); -- Null ```