1 files changed, 340 insertions, 0 deletions
diff --git a/docs/handbook/cmark/utf8-handling.md b/docs/handbook/cmark/utf8-handling.md
new file mode 100644
index 0000000000..c5bde6a320
--- /dev/null
+++ b/docs/handbook/cmark/utf8-handling.md
@@ -0,0 +1,340 @@
+# cmark — UTF-8 Handling
+
+## Overview
+
+The UTF-8 module (`utf8.c`, `utf8.h`) provides Unicode support for cmark: encoding, decoding, validation, iteration, case folding, and character classification. It incorporates data from `utf8proc` for case folding and character properties.
+
+## UTF-8 Encoding Fundamentals
+
+The module handles all four UTF-8 byte patterns:
+
+| Codepoint Range | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
+|----------------|--------|--------|--------|--------|
+| U+0000–U+007F | 0xxxxxxx | | | |
+| U+0080–U+07FF | 110xxxxx | 10xxxxxx | | |
+| U+0800–U+FFFF | 1110xxxx | 10xxxxxx | 10xxxxxx | |
+| U+10000–U+10FFFF | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
+
+## Byte Classification Table
+
+```c
+static const uint8_t utf8proc_utf8class[256] = {
+  1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  // 0x00-0x0F
+  1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  // 0x10-0x1F
+  1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  // 0x20-0x2F
+  1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  // 0x30-0x3F
+  // ...
+  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  // 0x80-0x8F (continuation)
+  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  // 0x90-0x9F
+  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  // 0xA0-0xAF
+  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  // 0xB0-0xBF
+  0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,  // 0xC0-0xCF
+  2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,  // 0xD0-0xDF
+  3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,  // 0xE0-0xEF
+  4, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  // 0xF0-0xFF
+};
+```
+
+Lookup table that maps each byte to its UTF-8 sequence length:
+- `1` → ASCII single-byte character
+- `2` → Two-byte sequence lead byte (0xC2-0xDF)
+- `3` → Three-byte sequence lead byte (0xE0-0xEF)
+- `4` → Four-byte sequence lead byte (0xF0-0xF4)
+- `0` → Continuation byte (0x80-0xBF) or invalid lead byte (0xC0-0xC1, 0xF5-0xFF)
+
+Note: 0xC0 and 0xC1 are marked as `0` (invalid) because they would encode codepoints < 0x80, which is an overlong encoding.
+
+## UTF-8 Encoding
+
+```c
+void cmark_utf8proc_encode_char(int32_t uc, cmark_strbuf *buf) {
+  uint8_t dst[4];
+  bufsize_t len = 0;
+
+  assert(uc >= 0);
+
+  if (uc < 0x80) {
+    dst[0] = (uint8_t)(uc);
+    len = 1;
+  } else if (uc < 0x800) {
+    dst[0] = (uint8_t)(0xC0 + (uc >> 6));
+    dst[1] = 0x80 + (uc & 0x3F);
+    len = 2;
+  } else if (uc == 0xFFFF) {
+    // Invalid codepoint — encode replacement char
+    dst[0] = 0xEF; dst[1] = 0xBF; dst[2] = 0xBD;
+    len = 3;
+  } else if (uc == 0xFFFE) {
+    // Invalid codepoint — encode replacement char
+    dst[0] = 0xEF; dst[1] = 0xBF; dst[2] = 0xBD;
+    len = 3;
+  } else if (uc < 0x10000) {
+    dst[0] = (uint8_t)(0xE0 + (uc >> 12));
+    dst[1] = 0x80 + ((uc >> 6) & 0x3F);
+    dst[2] = 0x80 + (uc & 0x3F);
+    len = 3;
+  } else if (uc < 0x110000) {
+    dst[0] = (uint8_t)(0xF0 + (uc >> 18));
+    dst[1] = 0x80 + ((uc >> 12) & 0x3F);
+    dst[2] = 0x80 + ((uc >> 6) & 0x3F);
+    dst[3] = 0x80 + (uc & 0x3F);
+    len = 4;
+  } else {
+    // Out of range — encode replacement char U+FFFD
+    dst[0] = 0xEF; dst[1] = 0xBF; dst[2] = 0xBD;
+    len = 3;
+  }
+
+  cmark_strbuf_put(buf, dst, len);
+}
+```
+
+Encodes a single Unicode codepoint as UTF-8 into a `cmark_strbuf`. Invalid codepoints (U+FFFE, U+FFFF, > U+10FFFF) are replaced with U+FFFD (replacement character).
+
+## UTF-8 Validation and Iteration
+
+```c
+void cmark_utf8proc_check(cmark_strbuf *dest, const uint8_t *line,
+                          bufsize_t size) {
+  bufsize_t i = 0;
+
+  while (i < size) {
+    bufsize_t byte_length = utf8proc_utf8class[line[i]];
+    int32_t codepoint = -1;
+
+    if (byte_length == 0) {
+      // Invalid lead byte — replace
+      cmark_utf8proc_encode_char(0xFFFD, dest);
+      i++;
+      continue;
+    }
+
+    // Check we have enough bytes
+    if (i + byte_length > size) {
+      // Truncated sequence — replace
+      cmark_utf8proc_encode_char(0xFFFD, dest);
+      i++;
+      continue;
+    }
+
+    // Decode and validate
+    switch (byte_length) {
+    case 1:
+      codepoint = line[i];
+      break;
+    case 2:
+      // Validate continuation byte
+      if ((line[i+1] & 0xC0) != 0x80) { /* invalid */ }
+      codepoint = ((line[i] & 0x1F) << 6) | (line[i+1] & 0x3F);
+      break;
+    case 3:
+      // Validate continuation bytes + overlong + surrogates
+      codepoint = ((line[i] & 0x0F) << 12) |
+                  ((line[i+1] & 0x3F) << 6) |
+                  (line[i+2] & 0x3F);
+      // Reject surrogates (U+D800-U+DFFF) and overlongs
+      break;
+    case 4:
+      // Validate continuation bytes + overlongs + max codepoint
+      codepoint = ((line[i] & 0x07) << 18) |
+                  ((line[i+1] & 0x3F) << 12) |
+                  ((line[i+2] & 0x3F) << 6) |
+                  (line[i+3] & 0x3F);
+      break;
+    }
+
+    if (codepoint < 0) {
+      cmark_utf8proc_encode_char(0xFFFD, dest);
+      i++;
+    } else {
+      cmark_utf8proc_encode_char(codepoint, dest);
+      i += byte_length;
+    }
+  }
+}
+```
+
+This function validates UTF-8 and replaces invalid sequences with U+FFFD. It enforces:
+- No invalid lead bytes
+- No truncated sequences
+- No invalid continuation bytes
+- No overlong encodings
+- No surrogate codepoints (U+D800-U+DFFF)
+
+### Validation Rules (RFC 3629)
+
+For 3-byte sequences:
+```c
+// Reject overlongs: first byte 0xE0 requires second byte >= 0xA0
+if (line[i] == 0xE0 && line[i+1] < 0xA0) { /* overlong */ }
+// Reject surrogates: first byte 0xED requires second byte < 0xA0
+if (line[i] == 0xED && line[i+1] >= 0xA0) { /* surrogate */ }
+```
+
+For 4-byte sequences:
+```c
+// Reject overlongs: first byte 0xF0 requires second byte >= 0x90
+if (line[i] == 0xF0 && line[i+1] < 0x90) { /* overlong */ }
+// Reject codepoints > U+10FFFF: first byte 0xF4 requires second byte < 0x90
+if (line[i] == 0xF4 && line[i+1] >= 0x90) { /* out of range */ }
+```
+
+## UTF-8 Iterator
+
+```c
+void cmark_utf8proc_iterate(const uint8_t *str, bufsize_t str_len,
+                            int32_t *dst) {
+  *dst = -1;
+  if (str_len <= 0) return;
+
+  uint8_t length = utf8proc_utf8class[str[0]];
+  if (!length) return;
+  if (str_len >= length) {
+    switch (length) {
+    case 1:
+      *dst = str[0];
+      break;
+    case 2:
+      *dst = ((int32_t)(str[0] & 0x1F) << 6) | (str[1] & 0x3F);
+      break;
+    case 3:
+      *dst = ((int32_t)(str[0] & 0x0F) << 12) |
+             ((int32_t)(str[1] & 0x3F) << 6) |
+             (str[2] & 0x3F);
+      // Reject surrogates:
+      if (*dst >= 0xD800 && *dst < 0xE000) *dst = -1;
+      break;
+    case 4:
+      *dst = ((int32_t)(str[0] & 0x07) << 18) |
+             ((int32_t)(str[1] & 0x3F) << 12) |
+             ((int32_t)(str[2] & 0x3F) << 6) |
+             (str[3] & 0x3F);
+      if (*dst > 0x10FFFF) *dst = -1;
+      break;
+    }
+  }
+}
+```
+
+Decodes a single UTF-8 codepoint from a byte string. Sets `*dst` to -1 on error.
+
+## Case Folding
+
+```c
+void cmark_utf8proc_case_fold(cmark_strbuf *dest, const uint8_t *str,
+                              bufsize_t len) {
+  int32_t c;
+  bufsize_t i = 0;
+
+  while (i < len) {
+    bufsize_t char_len = cmark_utf8proc_charlen(str + i, len - i);
+    if (char_len < 0) {
+      cmark_utf8proc_encode_char(0xFFFD, dest);
+      i += 1;
+      continue;
+    }
+    cmark_utf8proc_iterate(str + i, char_len, &c);
+    if (c >= 0) {
+      // Look up case fold mapping
+      const int32_t *fold = cmark_utf8proc_case_fold_info(c);
+      if (fold) {
+        // Some characters fold to multiple codepoints
+        while (*fold >= 0) {
+          cmark_utf8proc_encode_char(*fold, dest);
+          fold++;
+        }
+      } else {
+        cmark_utf8proc_encode_char(c, dest);
+      }
+    }
+    i += char_len;
+  }
+}
+```
+
+Performs Unicode case folding (not lowercasing — case folding is more aggressive and designed for case-insensitive comparison). Used for normalizing link reference labels.
+
+### Case Fold Lookup
+
+```c
+static const int32_t *cmark_utf8proc_case_fold_info(int32_t c);
+```
+
+Uses a sorted table `cf_table` and binary search to find case fold mappings. Each entry maps a codepoint to one or more replacement codepoints (some characters fold to multiple characters, e.g., `ß` → `ss`).
+
+The table uses sentinel value `-1` to terminate multi-codepoint sequences.
+
+## Character Classification
+
+### `cmark_utf8proc_is_space()`
+
+```c
+int cmark_utf8proc_is_space(int32_t c) {
+  // ASCII spaces
+  if (c < 0x80) {
+    return (c == 9 || c == 10 || c == 12 || c == 13 || c == 32);
+  }
+  // Unicode Zs category
+  return (c == 0xa0 || c == 0x1680 ||
+          (c >= 0x2000 && c <= 0x200a) ||
+          c == 0x202f || c == 0x205f || c == 0x3000);
+}
+```
+
+Matches ASCII whitespace (HT, LF, FF, CR, SP) and Unicode Zs (space separator) characters including:
+- U+00A0 (NBSP)
+- U+1680 (Ogham space)
+- U+2000-U+200A (various typographic spaces)
+- U+202F (narrow NBSP)
+- U+205F (medium mathematical space)
+- U+3000 (ideographic space)
+
+### `cmark_utf8proc_is_punctuation()`
+
+```c
+int cmark_utf8proc_is_punctuation(int32_t c) {
+  // ASCII punctuation ranges
+  if (c < 128) {
+    return (c >= 33 && c <= 47) ||
+           (c >= 58 && c <= 64) ||
+           (c >= 91 && c <= 96) ||
+           (c >= 123 && c <= 126);
+  }
+  // Unicode Pc, Pd, Pe, Pf, Pi, Po, Ps categories
+  // Uses a table-driven approach for Unicode punctuation
+}
+```
+
+Returns true for ASCII punctuation (`!`, `"`, `#`, `$`, `%`, `&`, `'`, `(`, `)`, `*`, `+`, `,`, `-`, `.`, `/`, `:`, `;`, `<`, `=`, `>`, `?`, `@`, `[`, `\\`, `]`, `^`, `_`, `` ` ``, `{`, `|`, `}`, `~`) and Unicode punctuation (categories Pc through Ps).
+
+These classification functions are critical for inline parsing, specifically for delimiter run classification — determining whether a `*` or `_` run is left-flanking or right-flanking depends on whether adjacent characters are spaces or punctuation.
+
+## Helper Functions
+
+### `cmark_utf8proc_charlen()`
+
+```c
+static CMARK_INLINE bufsize_t cmark_utf8proc_charlen(const uint8_t *str,
+                                                      bufsize_t str_len) {
+  bufsize_t length = utf8proc_utf8class[str[0]];
+  if (!length || str_len < length) return -length;
+  return length;
+}
+```
+
+Returns the byte length of the UTF-8 character at the given position. Returns negative on error (invalid byte or truncated).
+
+## Usage in cmark
+
+1. **Input validation**: `cmark_utf8proc_check()` is called on input to replace invalid UTF-8 with U+FFFD
+2. **Reference normalization**: `cmark_utf8proc_case_fold()` is used by `normalize_reference()` in `references.c` for case-insensitive reference label matching
+3. **Delimiter classification**: `cmark_utf8proc_is_space()` and `cmark_utf8proc_is_punctuation()` are used in `inlines.c` for the left-flanking/right-flanking delimiter run rules
+4. **Entity decoding**: `cmark_utf8proc_encode_char()` is used when decoding HTML entities and numeric character references to produce their UTF-8 representation
+5. **Renderer output**: `cmark_render_code_point()` in `render.c` calls `cmark_utf8proc_encode_char()` for multi-byte character output
+
+## Cross-References
+
+- [utf8.c](../../cmark/src/utf8.c) — Implementation
+- [utf8.h](../../cmark/src/utf8.h) — Public interface
+- [inline-parsing.md](inline-parsing.md) — Uses character classification for delimiter rules
+- [reference-system.md](reference-system.md) — Uses case folding for label normalization