summaryrefslogtreecommitdiff
path: root/docs/handbook/tomlplusplus/unicode-handling.md
diff options
context:
space:
mode:
Diffstat (limited to 'docs/handbook/tomlplusplus/unicode-handling.md')
-rw-r--r--docs/handbook/tomlplusplus/unicode-handling.md335
1 files changed, 335 insertions, 0 deletions
diff --git a/docs/handbook/tomlplusplus/unicode-handling.md b/docs/handbook/tomlplusplus/unicode-handling.md
new file mode 100644
index 0000000000..6cafb3deff
--- /dev/null
+++ b/docs/handbook/tomlplusplus/unicode-handling.md
@@ -0,0 +1,335 @@
+# toml++ β€” Unicode Handling
+
+## Overview
+
+toml++ fully handles UTF-8 encoded input and output as required by the TOML specification. All TOML documents must be valid UTF-8, and the library validates, decodes, and encodes Unicode throughout parsing and formatting.
+
+Core Unicode utilities are in `include/toml++/impl/unicode.hpp` with auto-generated lookup tables in `unicode_autogenerated.hpp`.
+
+---
+
+## UTF-8 Input Requirements
+
+The parser expects all input to be valid UTF-8:
+
+- **BOM handling**: A leading UTF-8 BOM (`0xEF 0xBB 0xBF`) is silently stripped before parsing begins
+- **Validation**: Invalid byte sequences (overlong encodings, surrogate code points, truncated sequences) produce parse errors
+- **Multi-byte characters**: Fully supported in string values, comments, and bare keys (where permitted by TOML)
+
+```cpp
+// UTF-8 content works naturally
+auto tbl = toml::parse(R"(
+ greeting = "Hello, δΈ–η•Œ!"
+ emoji = "πŸŽ‰"
+ name = "Γ‘oΓ±o"
+)");
+```
+
+---
+
+## Character Classification
+
+The library classifies Unicode code points for parsing with functions in `unicode.hpp`:
+
+### `is_string_delimiter()`
+
+Identifies characters that can start/end strings: `"` (U+0022) and `'` (U+0027).
+
+### `is_ascii_letter()`
+
+`[A-Za-z]` β€” used in bare key validation and other ASCII-specific checks.
+
+### `is_ascii_whitespace()`
+
+Space (U+0020) and tab (U+0009).
+
+### `is_ascii_line_break()`
+
+LF (U+000A) and CR (U+000D).
+
+### `is_bare_key_character()`
+
+Characters permitted in TOML bare keys: `[A-Za-z0-9_-]` plus Unicode letters/digits when `TOML_LANG_UNRELEASED_FEATURES` is enabled.
+
+### `is_control_character()`
+
+Control characters (U+0000–U+001F, U+007F) excluding tab. These are forbidden in basic strings and must be escaped.
+
+### `is_non_ascii_letter()`
+
+Unicode letter code points outside ASCII β€” from auto-generated tables in `unicode_autogenerated.hpp`. Used for extended bare key support in unreleased TOML features.
+
+### `is_non_ascii_number()`
+
+Unicode digit code points outside ASCII (e.g., Arabic-Indic digits).
+
+### `is_non_ascii_whitespace()`
+
+Unicode whitespace beyond ASCII space/tab.
+
+---
+
+## Escape Sequences in Strings
+
+TOML basic strings (`"..."` and `"""..."""`) support escape sequences. The parser decodes these into their UTF-8 representations:
+
+| Escape | Meaning | Code Point |
+|--------|---------|------------|
+| `\b` | Backspace | U+0008 |
+| `\t` | Tab | U+0009 |
+| `\n` | Line Feed | U+000A |
+| `\f` | Form Feed | U+000C |
+| `\r` | Carriage Return | U+000D |
+| `\"` | Quote | U+0022 |
+| `\\` | Backslash | U+005C |
+| `\uXXXX` | Unicode (4 hex digits) | U+0000–U+FFFF |
+| `\UXXXXXXXX` | Unicode (8 hex digits) | U+00000000–U+0010FFFF |
+
+### `control_char_escapes` Table
+
+The formatter uses a lookup table for serializing control characters back to escape sequences:
+
+```cpp
+// In impl namespace:
+inline constexpr const char* control_char_escapes[] = {
+ "\\u0000", "\\u0001", "\\u0002", "\\u0003",
+ "\\u0004", "\\u0005", "\\u0006", "\\u0007",
+ "\\b", "\\t", "\\n", "\\u000B",
+ "\\f", "\\r", "\\u000E", "\\u000F",
+ "\\u0010", "\\u0011", "\\u0012", "\\u0013",
+ "\\u0014", "\\u0015", "\\u0016", "\\u0017",
+ "\\u0018", "\\u0019", "\\u001A", "\\u001B",
+ "\\u001C", "\\u001D", "\\u001E", "\\u001F",
+};
+```
+
+---
+
+## Unicode Escape Decoding
+
+The parser processes `\uXXXX` and `\UXXXXXXXX` escapes:
+
+1. Reads 4 or 8 hexadecimal digits
+2. Validates the code point:
+ - Must not be a surrogate (U+D800–U+DFFF)
+ - Must not exceed U+10FFFF
+ - Must not be a non-character (U+FDD0–U+FDEF, U+xFFFE–U+xFFFF)
+3. Encodes to UTF-8 bytes (1–4 bytes depending on code point range)
+
+```toml
+# Valid Unicode escapes
+escape_a = "\u0041" # "A"
+escape_heart = "\u2764" # "❀"
+escape_emoji = "\U0001F600" # "πŸ˜€"
+```
+
+```cpp
+auto tbl = toml::parse(R"(
+ a = "\u0041"
+ heart = "\u2764"
+)");
+
+std::cout << tbl["a"].value_or(""sv) << "\n"; // A
+std::cout << tbl["heart"].value_or(""sv) << "\n"; // ❀
+```
+
+---
+
+## UTF-8 Encoding in Output
+
+When formatting, the behavior depends on `format_flags::allow_unicode_strings`:
+
+### With `allow_unicode_strings` (default for TOML and YAML formatters)
+
+Non-ASCII characters pass through unescaped:
+
+```cpp
+auto tbl = toml::table{ { "name", "ζ—₯本θͺž" } };
+std::cout << tbl << "\n";
+// name = "ζ—₯本θͺž"
+```
+
+### Without `allow_unicode_strings`
+
+Non-ASCII characters are escaped to `\uXXXX` / `\UXXXXXXXX`:
+
+```cpp
+auto tbl = toml::table{ { "name", "ζ—₯本θͺž" } };
+auto fmt = toml::toml_formatter{
+ tbl,
+ toml::format_flags::indentation // no allow_unicode_strings
+};
+std::cout << fmt << "\n";
+// name = "\u65E5\u672C\u8A9E"
+```
+
+---
+
+## char8_t Support (C++20)
+
+When compiling with C++20, `char8_t` overloads are available for parsing:
+
+```cpp
+auto tbl = toml::parse(u8R"(
+ greeting = "Hello, δΈ–η•Œ"
+)"sv);
+```
+
+The `char8_t` strings are internally treated as UTF-8 byte sequences. `std::u8string_view` is accepted by `parse()`.
+
+### `source_path` as u8string
+
+```cpp
+auto tbl = toml::parse(doc, u8"config.toml"sv);
+```
+
+---
+
+## Windows Compatibility (`TOML_ENABLE_WINDOWS_COMPAT`)
+
+When enabled (default on Windows), additional conversion overloads exist:
+
+- `parse_file(std::wstring_view)` β€” converts wide file path to UTF-8
+- `value<std::wstring>()` β€” converts stored UTF-8 string to wide string
+- String comparison with `wchar_t*` / `std::wstring_view`
+
+The conversions use Windows API (`MultiByteToWideChar` / `WideCharToMultiByte`) internally.
+
+---
+
+## Bare Key Unicode Rules
+
+Per TOML v1.0.0, bare keys are limited to ASCII letters, digits, hyphen, and underscore:
+
+```toml
+valid-key = "value"
+valid_key_2 = "value"
+# ζ—₯本θͺž = "value" # NOT valid as bare key in TOML v1.0
+```
+
+Non-ASCII keys must be quoted:
+
+```toml
+"ζ—₯本θͺž" = "value" # valid as quoted key
+```
+
+### Unreleased Features
+
+With `TOML_UNRELEASED_FEATURES=1`, the parser accepts Unicode letters and digits in bare keys as proposed for future TOML versions:
+
+```toml
+# Only with TOML_UNRELEASED_FEATURES=1:
+ζ—₯本θͺž = "value" # bare key with Unicode letters
+```
+
+The `is_non_ascii_letter()` and `is_non_ascii_number()` functions from `unicode_autogenerated.hpp` provide the code point tables for this classification.
+
+---
+
+## Auto-Generated Unicode Tables
+
+`include/toml++/impl/unicode_autogenerated.hpp` contains machine-generated lookup tables derived from the Unicode Character Database. These tables classify code points by category:
+
+- **Letter** categories: Lu, Ll, Lt, Lm, Lo
+- **Number** categories: Nd, Nl
+- **Combining marks**: Mn, Mc
+- **Connector punctuation**: Pc
+
+The tables use range-based compression for efficiency:
+
+```cpp
+// Simplified representation:
+struct code_point_range
+{
+ char32_t first;
+ char32_t last;
+};
+
+// Function uses binary search over sorted ranges
+bool is_non_ascii_letter(char32_t cp) noexcept;
+```
+
+---
+
+## String Handling in Formatters
+
+Each formatter handles strings slightly differently:
+
+### TOML Formatter
+
+- Defaults to basic strings with escaping: `"hello\nworld"`
+- Uses literal strings when `allow_literal_strings` is set and string has no single quotes: `'no escapes needed'`
+- Uses multi-line strings when `allow_multi_line_strings` is set and string contains newlines
+- Preserves Unicode with `allow_unicode_strings` (default on)
+
+### JSON Formatter
+
+- Always uses double-quoted strings
+- Escapes all required JSON characters
+- Does not use literal or multi-line strings
+- Unicode behavior follows `allow_unicode_strings` flag
+
+### YAML Formatter
+
+- Uses double-quoted strings
+- `allow_unicode_strings` is on by default
+- Escapes control characters
+
+---
+
+## Complete Example
+
+```cpp
+#include <toml++/toml.hpp>
+#include <iostream>
+
+int main()
+{
+ // Parse document with Unicode content
+ auto config = toml::parse(R"(
+ title = "ζ—₯本θͺžγƒ†γ‚Ήγƒˆ"
+ greeting = "Hello, δΈ–η•Œ! 🌍"
+ escaped = "\u0048\u0065\u006C\u006C\u006F"
+ path = "C:\\Users\\名前\\config"
+
+ [metadata]
+ "quoted.key" = "value"
+ author = "JosΓ© GarcΓ­a"
+ )");
+
+ // Read values β€” Unicode is preserved
+ auto title = config["title"].value_or(""sv);
+ std::cout << "Title: " << title << "\n";
+ // Title: ζ—₯本θͺžγƒ†γ‚Ήγƒˆ
+
+ auto greeting = config["greeting"].value_or(""sv);
+ std::cout << "Greeting: " << greeting << "\n";
+ // Greeting: Hello, δΈ–η•Œ! 🌍
+
+ // Escaped values are decoded
+ auto escaped = config["escaped"].value_or(""sv);
+ std::cout << "Escaped: " << escaped << "\n";
+ // Escaped: Hello
+
+ // Serialize back β€” Unicode preserved by default
+ std::cout << "\n=== TOML (Unicode) ===\n";
+ std::cout << config << "\n";
+
+ // Serialize with Unicode escaping
+ std::cout << "\n=== TOML (Escaped) ===\n";
+ std::cout << toml::toml_formatter{
+ config,
+ toml::format_flags::indentation // no allow_unicode_strings
+ } << "\n";
+
+ return 0;
+}
+```
+
+---
+
+## Related Documentation
+
+- [parsing.md](parsing.md) β€” Parser UTF-8 input handling
+- [formatting.md](formatting.md) β€” Unicode output control via format_flags
+- [values.md](values.md) β€” String value type