# toml++ β€” Unicode Handling ## Overview toml++ fully handles UTF-8 encoded input and output as required by the TOML specification. All TOML documents must be valid UTF-8, and the library validates, decodes, and encodes Unicode throughout parsing and formatting. Core Unicode utilities are in `include/toml++/impl/unicode.hpp` with auto-generated lookup tables in `unicode_autogenerated.hpp`. --- ## UTF-8 Input Requirements The parser expects all input to be valid UTF-8: - **BOM handling**: A leading UTF-8 BOM (`0xEF 0xBB 0xBF`) is silently stripped before parsing begins - **Validation**: Invalid byte sequences (overlong encodings, surrogate code points, truncated sequences) produce parse errors - **Multi-byte characters**: Fully supported in string values, comments, and bare keys (where permitted by TOML) ```cpp // UTF-8 content works naturally auto tbl = toml::parse(R"( greeting = "Hello, δΈ–η•Œ!" emoji = "πŸŽ‰" name = "Γ‘oΓ±o" )"); ``` --- ## Character Classification The library classifies Unicode code points for parsing with functions in `unicode.hpp`: ### `is_string_delimiter()` Identifies characters that can start/end strings: `"` (U+0022) and `'` (U+0027). ### `is_ascii_letter()` `[A-Za-z]` β€” used in bare key validation and other ASCII-specific checks. ### `is_ascii_whitespace()` Space (U+0020) and tab (U+0009). ### `is_ascii_line_break()` LF (U+000A) and CR (U+000D). ### `is_bare_key_character()` Characters permitted in TOML bare keys: `[A-Za-z0-9_-]` plus Unicode letters/digits when `TOML_LANG_UNRELEASED_FEATURES` is enabled. ### `is_control_character()` Control characters (U+0000–U+001F, U+007F) excluding tab. These are forbidden in basic strings and must be escaped. ### `is_non_ascii_letter()` Unicode letter code points outside ASCII β€” from auto-generated tables in `unicode_autogenerated.hpp`. Used for extended bare key support in unreleased TOML features. ### `is_non_ascii_number()` Unicode digit code points outside ASCII (e.g., Arabic-Indic digits). ### `is_non_ascii_whitespace()` Unicode whitespace beyond ASCII space/tab. --- ## Escape Sequences in Strings TOML basic strings (`"..."` and `"""..."""`) support escape sequences. The parser decodes these into their UTF-8 representations: | Escape | Meaning | Code Point | |--------|---------|------------| | `\b` | Backspace | U+0008 | | `\t` | Tab | U+0009 | | `\n` | Line Feed | U+000A | | `\f` | Form Feed | U+000C | | `\r` | Carriage Return | U+000D | | `\"` | Quote | U+0022 | | `\\` | Backslash | U+005C | | `\uXXXX` | Unicode (4 hex digits) | U+0000–U+FFFF | | `\UXXXXXXXX` | Unicode (8 hex digits) | U+00000000–U+0010FFFF | ### `control_char_escapes` Table The formatter uses a lookup table for serializing control characters back to escape sequences: ```cpp // In impl namespace: inline constexpr const char* control_char_escapes[] = { "\\u0000", "\\u0001", "\\u0002", "\\u0003", "\\u0004", "\\u0005", "\\u0006", "\\u0007", "\\b", "\\t", "\\n", "\\u000B", "\\f", "\\r", "\\u000E", "\\u000F", "\\u0010", "\\u0011", "\\u0012", "\\u0013", "\\u0014", "\\u0015", "\\u0016", "\\u0017", "\\u0018", "\\u0019", "\\u001A", "\\u001B", "\\u001C", "\\u001D", "\\u001E", "\\u001F", }; ``` --- ## Unicode Escape Decoding The parser processes `\uXXXX` and `\UXXXXXXXX` escapes: 1. Reads 4 or 8 hexadecimal digits 2. Validates the code point: - Must not be a surrogate (U+D800–U+DFFF) - Must not exceed U+10FFFF - Must not be a non-character (U+FDD0–U+FDEF, U+xFFFE–U+xFFFF) 3. Encodes to UTF-8 bytes (1–4 bytes depending on code point range) ```toml # Valid Unicode escapes escape_a = "\u0041" # "A" escape_heart = "\u2764" # "❀" escape_emoji = "\U0001F600" # "πŸ˜€" ``` ```cpp auto tbl = toml::parse(R"( a = "\u0041" heart = "\u2764" )"); std::cout << tbl["a"].value_or(""sv) << "\n"; // A std::cout << tbl["heart"].value_or(""sv) << "\n"; // ❀ ``` --- ## UTF-8 Encoding in Output When formatting, the behavior depends on `format_flags::allow_unicode_strings`: ### With `allow_unicode_strings` (default for TOML and YAML formatters) Non-ASCII characters pass through unescaped: ```cpp auto tbl = toml::table{ { "name", "ζ—₯本θͺž" } }; std::cout << tbl << "\n"; // name = "ζ—₯本θͺž" ``` ### Without `allow_unicode_strings` Non-ASCII characters are escaped to `\uXXXX` / `\UXXXXXXXX`: ```cpp auto tbl = toml::table{ { "name", "ζ—₯本θͺž" } }; auto fmt = toml::toml_formatter{ tbl, toml::format_flags::indentation // no allow_unicode_strings }; std::cout << fmt << "\n"; // name = "\u65E5\u672C\u8A9E" ``` --- ## char8_t Support (C++20) When compiling with C++20, `char8_t` overloads are available for parsing: ```cpp auto tbl = toml::parse(u8R"( greeting = "Hello, δΈ–η•Œ" )"sv); ``` The `char8_t` strings are internally treated as UTF-8 byte sequences. `std::u8string_view` is accepted by `parse()`. ### `source_path` as u8string ```cpp auto tbl = toml::parse(doc, u8"config.toml"sv); ``` --- ## Windows Compatibility (`TOML_ENABLE_WINDOWS_COMPAT`) When enabled (default on Windows), additional conversion overloads exist: - `parse_file(std::wstring_view)` β€” converts wide file path to UTF-8 - `value()` β€” converts stored UTF-8 string to wide string - String comparison with `wchar_t*` / `std::wstring_view` The conversions use Windows API (`MultiByteToWideChar` / `WideCharToMultiByte`) internally. --- ## Bare Key Unicode Rules Per TOML v1.0.0, bare keys are limited to ASCII letters, digits, hyphen, and underscore: ```toml valid-key = "value" valid_key_2 = "value" # ζ—₯本θͺž = "value" # NOT valid as bare key in TOML v1.0 ``` Non-ASCII keys must be quoted: ```toml "ζ—₯本θͺž" = "value" # valid as quoted key ``` ### Unreleased Features With `TOML_UNRELEASED_FEATURES=1`, the parser accepts Unicode letters and digits in bare keys as proposed for future TOML versions: ```toml # Only with TOML_UNRELEASED_FEATURES=1: ζ—₯本θͺž = "value" # bare key with Unicode letters ``` The `is_non_ascii_letter()` and `is_non_ascii_number()` functions from `unicode_autogenerated.hpp` provide the code point tables for this classification. --- ## Auto-Generated Unicode Tables `include/toml++/impl/unicode_autogenerated.hpp` contains machine-generated lookup tables derived from the Unicode Character Database. These tables classify code points by category: - **Letter** categories: Lu, Ll, Lt, Lm, Lo - **Number** categories: Nd, Nl - **Combining marks**: Mn, Mc - **Connector punctuation**: Pc The tables use range-based compression for efficiency: ```cpp // Simplified representation: struct code_point_range { char32_t first; char32_t last; }; // Function uses binary search over sorted ranges bool is_non_ascii_letter(char32_t cp) noexcept; ``` --- ## String Handling in Formatters Each formatter handles strings slightly differently: ### TOML Formatter - Defaults to basic strings with escaping: `"hello\nworld"` - Uses literal strings when `allow_literal_strings` is set and string has no single quotes: `'no escapes needed'` - Uses multi-line strings when `allow_multi_line_strings` is set and string contains newlines - Preserves Unicode with `allow_unicode_strings` (default on) ### JSON Formatter - Always uses double-quoted strings - Escapes all required JSON characters - Does not use literal or multi-line strings - Unicode behavior follows `allow_unicode_strings` flag ### YAML Formatter - Uses double-quoted strings - `allow_unicode_strings` is on by default - Escapes control characters --- ## Complete Example ```cpp #include #include int main() { // Parse document with Unicode content auto config = toml::parse(R"( title = "ζ—₯本θͺžγƒ†γ‚Ήγƒˆ" greeting = "Hello, δΈ–η•Œ! 🌍" escaped = "\u0048\u0065\u006C\u006C\u006F" path = "C:\\Users\\名前\\config" [metadata] "quoted.key" = "value" author = "JosΓ© GarcΓ­a" )"); // Read values β€” Unicode is preserved auto title = config["title"].value_or(""sv); std::cout << "Title: " << title << "\n"; // Title: ζ—₯本θͺžγƒ†γ‚Ήγƒˆ auto greeting = config["greeting"].value_or(""sv); std::cout << "Greeting: " << greeting << "\n"; // Greeting: Hello, δΈ–η•Œ! 🌍 // Escaped values are decoded auto escaped = config["escaped"].value_or(""sv); std::cout << "Escaped: " << escaped << "\n"; // Escaped: Hello // Serialize back β€” Unicode preserved by default std::cout << "\n=== TOML (Unicode) ===\n"; std::cout << config << "\n"; // Serialize with Unicode escaping std::cout << "\n=== TOML (Escaped) ===\n"; std::cout << toml::toml_formatter{ config, toml::format_flags::indentation // no allow_unicode_strings } << "\n"; return 0; } ``` --- ## Related Documentation - [parsing.md](parsing.md) β€” Parser UTF-8 input handling - [formatting.md](formatting.md) β€” Unicode output control via format_flags - [values.md](values.md) β€” String value type