diff options
Diffstat (limited to 'docs/handbook/tomlplusplus/unicode-handling.md')
| -rw-r--r-- | docs/handbook/tomlplusplus/unicode-handling.md | 335 |
1 files changed, 335 insertions, 0 deletions
diff --git a/docs/handbook/tomlplusplus/unicode-handling.md b/docs/handbook/tomlplusplus/unicode-handling.md new file mode 100644 index 0000000000..6cafb3deff --- /dev/null +++ b/docs/handbook/tomlplusplus/unicode-handling.md @@ -0,0 +1,335 @@ +# toml++ β Unicode Handling + +## Overview + +toml++ fully handles UTF-8 encoded input and output as required by the TOML specification. All TOML documents must be valid UTF-8, and the library validates, decodes, and encodes Unicode throughout parsing and formatting. + +Core Unicode utilities are in `include/toml++/impl/unicode.hpp` with auto-generated lookup tables in `unicode_autogenerated.hpp`. + +--- + +## UTF-8 Input Requirements + +The parser expects all input to be valid UTF-8: + +- **BOM handling**: A leading UTF-8 BOM (`0xEF 0xBB 0xBF`) is silently stripped before parsing begins +- **Validation**: Invalid byte sequences (overlong encodings, surrogate code points, truncated sequences) produce parse errors +- **Multi-byte characters**: Fully supported in string values, comments, and bare keys (where permitted by TOML) + +```cpp +// UTF-8 content works naturally +auto tbl = toml::parse(R"( + greeting = "Hello, δΈη!" + emoji = "π" + name = "ΓoΓ±o" +)"); +``` + +--- + +## Character Classification + +The library classifies Unicode code points for parsing with functions in `unicode.hpp`: + +### `is_string_delimiter()` + +Identifies characters that can start/end strings: `"` (U+0022) and `'` (U+0027). + +### `is_ascii_letter()` + +`[A-Za-z]` β used in bare key validation and other ASCII-specific checks. + +### `is_ascii_whitespace()` + +Space (U+0020) and tab (U+0009). + +### `is_ascii_line_break()` + +LF (U+000A) and CR (U+000D). + +### `is_bare_key_character()` + +Characters permitted in TOML bare keys: `[A-Za-z0-9_-]` plus Unicode letters/digits when `TOML_LANG_UNRELEASED_FEATURES` is enabled. + +### `is_control_character()` + +Control characters (U+0000βU+001F, U+007F) excluding tab. These are forbidden in basic strings and must be escaped. + +### `is_non_ascii_letter()` + +Unicode letter code points outside ASCII β from auto-generated tables in `unicode_autogenerated.hpp`. Used for extended bare key support in unreleased TOML features. + +### `is_non_ascii_number()` + +Unicode digit code points outside ASCII (e.g., Arabic-Indic digits). + +### `is_non_ascii_whitespace()` + +Unicode whitespace beyond ASCII space/tab. + +--- + +## Escape Sequences in Strings + +TOML basic strings (`"..."` and `"""..."""`) support escape sequences. The parser decodes these into their UTF-8 representations: + +| Escape | Meaning | Code Point | +|--------|---------|------------| +| `\b` | Backspace | U+0008 | +| `\t` | Tab | U+0009 | +| `\n` | Line Feed | U+000A | +| `\f` | Form Feed | U+000C | +| `\r` | Carriage Return | U+000D | +| `\"` | Quote | U+0022 | +| `\\` | Backslash | U+005C | +| `\uXXXX` | Unicode (4 hex digits) | U+0000βU+FFFF | +| `\UXXXXXXXX` | Unicode (8 hex digits) | U+00000000βU+0010FFFF | + +### `control_char_escapes` Table + +The formatter uses a lookup table for serializing control characters back to escape sequences: + +```cpp +// In impl namespace: +inline constexpr const char* control_char_escapes[] = { + "\\u0000", "\\u0001", "\\u0002", "\\u0003", + "\\u0004", "\\u0005", "\\u0006", "\\u0007", + "\\b", "\\t", "\\n", "\\u000B", + "\\f", "\\r", "\\u000E", "\\u000F", + "\\u0010", "\\u0011", "\\u0012", "\\u0013", + "\\u0014", "\\u0015", "\\u0016", "\\u0017", + "\\u0018", "\\u0019", "\\u001A", "\\u001B", + "\\u001C", "\\u001D", "\\u001E", "\\u001F", +}; +``` + +--- + +## Unicode Escape Decoding + +The parser processes `\uXXXX` and `\UXXXXXXXX` escapes: + +1. Reads 4 or 8 hexadecimal digits +2. Validates the code point: + - Must not be a surrogate (U+D800βU+DFFF) + - Must not exceed U+10FFFF + - Must not be a non-character (U+FDD0βU+FDEF, U+xFFFEβU+xFFFF) +3. Encodes to UTF-8 bytes (1β4 bytes depending on code point range) + +```toml +# Valid Unicode escapes +escape_a = "\u0041" # "A" +escape_heart = "\u2764" # "β€" +escape_emoji = "\U0001F600" # "π" +``` + +```cpp +auto tbl = toml::parse(R"( + a = "\u0041" + heart = "\u2764" +)"); + +std::cout << tbl["a"].value_or(""sv) << "\n"; // A +std::cout << tbl["heart"].value_or(""sv) << "\n"; // β€ +``` + +--- + +## UTF-8 Encoding in Output + +When formatting, the behavior depends on `format_flags::allow_unicode_strings`: + +### With `allow_unicode_strings` (default for TOML and YAML formatters) + +Non-ASCII characters pass through unescaped: + +```cpp +auto tbl = toml::table{ { "name", "ζ₯ζ¬θͺ" } }; +std::cout << tbl << "\n"; +// name = "ζ₯ζ¬θͺ" +``` + +### Without `allow_unicode_strings` + +Non-ASCII characters are escaped to `\uXXXX` / `\UXXXXXXXX`: + +```cpp +auto tbl = toml::table{ { "name", "ζ₯ζ¬θͺ" } }; +auto fmt = toml::toml_formatter{ + tbl, + toml::format_flags::indentation // no allow_unicode_strings +}; +std::cout << fmt << "\n"; +// name = "\u65E5\u672C\u8A9E" +``` + +--- + +## char8_t Support (C++20) + +When compiling with C++20, `char8_t` overloads are available for parsing: + +```cpp +auto tbl = toml::parse(u8R"( + greeting = "Hello, δΈη" +)"sv); +``` + +The `char8_t` strings are internally treated as UTF-8 byte sequences. `std::u8string_view` is accepted by `parse()`. + +### `source_path` as u8string + +```cpp +auto tbl = toml::parse(doc, u8"config.toml"sv); +``` + +--- + +## Windows Compatibility (`TOML_ENABLE_WINDOWS_COMPAT`) + +When enabled (default on Windows), additional conversion overloads exist: + +- `parse_file(std::wstring_view)` β converts wide file path to UTF-8 +- `value<std::wstring>()` β converts stored UTF-8 string to wide string +- String comparison with `wchar_t*` / `std::wstring_view` + +The conversions use Windows API (`MultiByteToWideChar` / `WideCharToMultiByte`) internally. + +--- + +## Bare Key Unicode Rules + +Per TOML v1.0.0, bare keys are limited to ASCII letters, digits, hyphen, and underscore: + +```toml +valid-key = "value" +valid_key_2 = "value" +# ζ₯ζ¬θͺ = "value" # NOT valid as bare key in TOML v1.0 +``` + +Non-ASCII keys must be quoted: + +```toml +"ζ₯ζ¬θͺ" = "value" # valid as quoted key +``` + +### Unreleased Features + +With `TOML_UNRELEASED_FEATURES=1`, the parser accepts Unicode letters and digits in bare keys as proposed for future TOML versions: + +```toml +# Only with TOML_UNRELEASED_FEATURES=1: +ζ₯ζ¬θͺ = "value" # bare key with Unicode letters +``` + +The `is_non_ascii_letter()` and `is_non_ascii_number()` functions from `unicode_autogenerated.hpp` provide the code point tables for this classification. + +--- + +## Auto-Generated Unicode Tables + +`include/toml++/impl/unicode_autogenerated.hpp` contains machine-generated lookup tables derived from the Unicode Character Database. These tables classify code points by category: + +- **Letter** categories: Lu, Ll, Lt, Lm, Lo +- **Number** categories: Nd, Nl +- **Combining marks**: Mn, Mc +- **Connector punctuation**: Pc + +The tables use range-based compression for efficiency: + +```cpp +// Simplified representation: +struct code_point_range +{ + char32_t first; + char32_t last; +}; + +// Function uses binary search over sorted ranges +bool is_non_ascii_letter(char32_t cp) noexcept; +``` + +--- + +## String Handling in Formatters + +Each formatter handles strings slightly differently: + +### TOML Formatter + +- Defaults to basic strings with escaping: `"hello\nworld"` +- Uses literal strings when `allow_literal_strings` is set and string has no single quotes: `'no escapes needed'` +- Uses multi-line strings when `allow_multi_line_strings` is set and string contains newlines +- Preserves Unicode with `allow_unicode_strings` (default on) + +### JSON Formatter + +- Always uses double-quoted strings +- Escapes all required JSON characters +- Does not use literal or multi-line strings +- Unicode behavior follows `allow_unicode_strings` flag + +### YAML Formatter + +- Uses double-quoted strings +- `allow_unicode_strings` is on by default +- Escapes control characters + +--- + +## Complete Example + +```cpp +#include <toml++/toml.hpp> +#include <iostream> + +int main() +{ + // Parse document with Unicode content + auto config = toml::parse(R"( + title = "ζ₯ζ¬θͺγγΉγ" + greeting = "Hello, δΈη! π" + escaped = "\u0048\u0065\u006C\u006C\u006F" + path = "C:\\Users\\εε\\config" + + [metadata] + "quoted.key" = "value" + author = "JosΓ© GarcΓa" + )"); + + // Read values β Unicode is preserved + auto title = config["title"].value_or(""sv); + std::cout << "Title: " << title << "\n"; + // Title: ζ₯ζ¬θͺγγΉγ + + auto greeting = config["greeting"].value_or(""sv); + std::cout << "Greeting: " << greeting << "\n"; + // Greeting: Hello, δΈη! π + + // Escaped values are decoded + auto escaped = config["escaped"].value_or(""sv); + std::cout << "Escaped: " << escaped << "\n"; + // Escaped: Hello + + // Serialize back β Unicode preserved by default + std::cout << "\n=== TOML (Unicode) ===\n"; + std::cout << config << "\n"; + + // Serialize with Unicode escaping + std::cout << "\n=== TOML (Escaped) ===\n"; + std::cout << toml::toml_formatter{ + config, + toml::format_flags::indentation // no allow_unicode_strings + } << "\n"; + + return 0; +} +``` + +--- + +## Related Documentation + +- [parsing.md](parsing.md) β Parser UTF-8 input handling +- [formatting.md](formatting.md) β Unicode output control via format_flags +- [values.md](values.md) β String value type |
