# toml++ — Unicode Handling

## Overview

toml++ fully handles UTF-8 encoded input and output as required by the TOML specification. All TOML documents must be valid UTF-8, and the library validates, decodes, and encodes Unicode throughout parsing and formatting.

Core Unicode utilities are in `include/toml++/impl/unicode.hpp` with auto-generated lookup tables in `unicode_autogenerated.hpp`.

---

## UTF-8 Input Requirements

The parser expects all input to be valid UTF-8:

- **BOM handling**: A leading UTF-8 BOM (`0xEF 0xBB 0xBF`) is silently stripped before parsing begins
- **Validation**: Invalid byte sequences (overlong encodings, surrogate code points, truncated sequences) produce parse errors
- **Multi-byte characters**: Fully supported in string values, comments, and bare keys (where permitted by TOML)

```cpp
// UTF-8 content works naturally
auto tbl = toml::parse(R"(
    greeting = "Hello, 世界!"
    emoji = "🎉"
    name = "Ñoño"
)");
```

---

## Character Classification

The library classifies Unicode code points for parsing with functions in `unicode.hpp`:

### `is_string_delimiter()`

Identifies characters that can start/end strings: `"` (U+0022) and `'` (U+0027).

### `is_ascii_letter()`

`[A-Za-z]` — used in bare key validation and other ASCII-specific checks.

### `is_ascii_whitespace()`

Space (U+0020) and tab (U+0009).

### `is_ascii_line_break()`

LF (U+000A) and CR (U+000D).

### `is_bare_key_character()`

Characters permitted in TOML bare keys: `[A-Za-z0-9_-]` plus Unicode letters/digits when `TOML_LANG_UNRELEASED_FEATURES` is enabled.

### `is_control_character()`

Control characters (U+0000–U+001F, U+007F) excluding tab. These are forbidden in basic strings and must be escaped.

### `is_non_ascii_letter()`

Unicode letter code points outside ASCII — from auto-generated tables in `unicode_autogenerated.hpp`. Used for extended bare key support in unreleased TOML features.

### `is_non_ascii_number()`

Unicode digit code points outside ASCII (e.g., Arabic-Indic digits).

### `is_non_ascii_whitespace()`

Unicode whitespace beyond ASCII space/tab.

---

## Escape Sequences in Strings

TOML basic strings (`"..."` and `"""..."""`) support escape sequences. The parser decodes these into their UTF-8 representations:

| Escape | Meaning | Code Point |
|--------|---------|------------|
| `\b` | Backspace | U+0008 |
| `\t` | Tab | U+0009 |
| `\n` | Line Feed | U+000A |
| `\f` | Form Feed | U+000C |
| `\r` | Carriage Return | U+000D |
| `\"` | Quote | U+0022 |
| `\\` | Backslash | U+005C |
| `\uXXXX` | Unicode (4 hex digits) | U+0000–U+FFFF |
| `\UXXXXXXXX` | Unicode (8 hex digits) | U+00000000–U+0010FFFF |

### `control_char_escapes` Table

The formatter uses a lookup table for serializing control characters back to escape sequences:

```cpp
// In impl namespace:
inline constexpr const char* control_char_escapes[] = {
    "\\u0000", "\\u0001", "\\u0002", "\\u0003",
    "\\u0004", "\\u0005", "\\u0006", "\\u0007",
    "\\b",     "\\t",     "\\n",     "\\u000B",
    "\\f",     "\\r",     "\\u000E", "\\u000F",
    "\\u0010", "\\u0011", "\\u0012", "\\u0013",
    "\\u0014", "\\u0015", "\\u0016", "\\u0017",
    "\\u0018", "\\u0019", "\\u001A", "\\u001B",
    "\\u001C", "\\u001D", "\\u001E", "\\u001F",
};
```

---

## Unicode Escape Decoding

The parser processes `\uXXXX` and `\UXXXXXXXX` escapes:

1. Reads 4 or 8 hexadecimal digits
2. Validates the code point:
   - Must not be a surrogate (U+D800–U+DFFF)
   - Must not exceed U+10FFFF
   - Must not be a non-character (U+FDD0–U+FDEF, U+xFFFE–U+xFFFF)
3. Encodes to UTF-8 bytes (1–4 bytes depending on code point range)

```toml
# Valid Unicode escapes
escape_a = "\u0041"          # "A"
escape_heart = "\u2764"      # "❤"
escape_emoji = "\U0001F600"  # "😀"
```

```cpp
auto tbl = toml::parse(R"(
    a = "\u0041"
    heart = "\u2764"
)");

std::cout << tbl["a"].value_or(""sv) << "\n";     // A
std::cout << tbl["heart"].value_or(""sv) << "\n";  // ❤
```

---

## UTF-8 Encoding in Output

When formatting, the behavior depends on `format_flags::allow_unicode_strings`:

### With `allow_unicode_strings` (default for TOML and YAML formatters)

Non-ASCII characters pass through unescaped:

```cpp
auto tbl = toml::table{ { "name", "日本語" } };
std::cout << tbl << "\n";
// name = "日本語"
```

### Without `allow_unicode_strings`

Non-ASCII characters are escaped to `\uXXXX` / `\UXXXXXXXX`:

```cpp
auto tbl = toml::table{ { "name", "日本語" } };
auto fmt = toml::toml_formatter{
    tbl,
    toml::format_flags::indentation  // no allow_unicode_strings
};
std::cout << fmt << "\n";
// name = "\u65E5\u672C\u8A9E"
```

---

## char8_t Support (C++20)

When compiling with C++20, `char8_t` overloads are available for parsing:

```cpp
auto tbl = toml::parse(u8R"(
    greeting = "Hello, 世界"
)"sv);
```

The `char8_t` strings are internally treated as UTF-8 byte sequences. `std::u8string_view` is accepted by `parse()`.

### `source_path` as u8string

```cpp
auto tbl = toml::parse(doc, u8"config.toml"sv);
```

---

## Windows Compatibility (`TOML_ENABLE_WINDOWS_COMPAT`)

When enabled (default on Windows), additional conversion overloads exist:

- `parse_file(std::wstring_view)` — converts wide file path to UTF-8
- `value<std::wstring>()` — converts stored UTF-8 string to wide string
- String comparison with `wchar_t*` / `std::wstring_view`

The conversions use Windows API (`MultiByteToWideChar` / `WideCharToMultiByte`) internally.

---

## Bare Key Unicode Rules

Per TOML v1.0.0, bare keys are limited to ASCII letters, digits, hyphen, and underscore:

```toml
valid-key = "value"
valid_key_2 = "value"
# 日本語 = "value"   # NOT valid as bare key in TOML v1.0
```

Non-ASCII keys must be quoted:

```toml
"日本語" = "value"   # valid as quoted key
```

### Unreleased Features

With `TOML_UNRELEASED_FEATURES=1`, the parser accepts Unicode letters and digits in bare keys as proposed for future TOML versions:

```toml
# Only with TOML_UNRELEASED_FEATURES=1:
日本語 = "value"     # bare key with Unicode letters
```

The `is_non_ascii_letter()` and `is_non_ascii_number()` functions from `unicode_autogenerated.hpp` provide the code point tables for this classification.

---

## Auto-Generated Unicode Tables

`include/toml++/impl/unicode_autogenerated.hpp` contains machine-generated lookup tables derived from the Unicode Character Database. These tables classify code points by category:

- **Letter** categories: Lu, Ll, Lt, Lm, Lo
- **Number** categories: Nd, Nl
- **Combining marks**: Mn, Mc
- **Connector punctuation**: Pc

The tables use range-based compression for efficiency:

```cpp
// Simplified representation:
struct code_point_range
{
    char32_t first;
    char32_t last;
};

// Function uses binary search over sorted ranges
bool is_non_ascii_letter(char32_t cp) noexcept;
```

---

## String Handling in Formatters

Each formatter handles strings slightly differently:

### TOML Formatter

- Defaults to basic strings with escaping: `"hello\nworld"`
- Uses literal strings when `allow_literal_strings` is set and string has no single quotes: `'no escapes needed'`
- Uses multi-line strings when `allow_multi_line_strings` is set and string contains newlines
- Preserves Unicode with `allow_unicode_strings` (default on)

### JSON Formatter

- Always uses double-quoted strings
- Escapes all required JSON characters
- Does not use literal or multi-line strings
- Unicode behavior follows `allow_unicode_strings` flag

### YAML Formatter

- Uses double-quoted strings
- `allow_unicode_strings` is on by default
- Escapes control characters

---

## Complete Example

```cpp
#include <toml++/toml.hpp>
#include <iostream>

int main()
{
    // Parse document with Unicode content
    auto config = toml::parse(R"(
        title = "日本語テスト"
        greeting = "Hello, 世界! 🌍"
        escaped = "\u0048\u0065\u006C\u006C\u006F"
        path = "C:\\Users\\名前\\config"

        [metadata]
        "quoted.key" = "value"
        author = "José García"
    )");

    // Read values — Unicode is preserved
    auto title = config["title"].value_or(""sv);
    std::cout << "Title: " << title << "\n";
    // Title: 日本語テスト

    auto greeting = config["greeting"].value_or(""sv);
    std::cout << "Greeting: " << greeting << "\n";
    // Greeting: Hello, 世界! 🌍

    // Escaped values are decoded
    auto escaped = config["escaped"].value_or(""sv);
    std::cout << "Escaped: " << escaped << "\n";
    // Escaped: Hello

    // Serialize back — Unicode preserved by default
    std::cout << "\n=== TOML (Unicode) ===\n";
    std::cout << config << "\n";

    // Serialize with Unicode escaping
    std::cout << "\n=== TOML (Escaped) ===\n";
    std::cout << toml::toml_formatter{
        config,
        toml::format_flags::indentation  // no allow_unicode_strings
    } << "\n";

    return 0;
}
```

---

## Related Documentation

- [parsing.md](parsing.md) — Parser UTF-8 input handling
- [formatting.md](formatting.md) — Unicode output control via format_flags
- [values.md](values.md) — String value type