summaryrefslogtreecommitdiff
path: root/docs/handbook/cmark
diff options
context:
space:
mode:
authorMehmet Samet Duman <yongdohyun@projecttick.org>2026-04-05 17:37:54 +0300
committerMehmet Samet Duman <yongdohyun@projecttick.org>2026-04-05 17:37:54 +0300
commit32f5f761bc8e960293b4f4feaf973dd0da26d0f8 (patch)
tree8d0436fdd093d5255c3b75e45f9741882b22e2e4 /docs/handbook/cmark
parent64f4ddfa97c19f371fe1847b20bd26803f0a25d5 (diff)
downloadProject-Tick-32f5f761bc8e960293b4f4feaf973dd0da26d0f8.tar.gz
Project-Tick-32f5f761bc8e960293b4f4feaf973dd0da26d0f8.zip
NOISSUE Project Tick Handbook is Released!
Assisted-by: Claude:Opus-4.6-High Signed-off-by: Mehmet Samet Duman <yongdohyun@projecttick.org>
Diffstat (limited to 'docs/handbook/cmark')
-rw-r--r--docs/handbook/cmark/architecture.md283
-rw-r--r--docs/handbook/cmark/ast-node-system.md383
-rw-r--r--docs/handbook/cmark/block-parsing.md310
-rw-r--r--docs/handbook/cmark/building.md268
-rw-r--r--docs/handbook/cmark/cli-usage.md249
-rw-r--r--docs/handbook/cmark/code-style.md293
-rw-r--r--docs/handbook/cmark/commonmark-renderer.md344
-rw-r--r--docs/handbook/cmark/html-renderer.md258
-rw-r--r--docs/handbook/cmark/inline-parsing.md317
-rw-r--r--docs/handbook/cmark/iterator-system.md267
-rw-r--r--docs/handbook/cmark/latex-renderer.md320
-rw-r--r--docs/handbook/cmark/man-renderer.md272
-rw-r--r--docs/handbook/cmark/memory-management.md351
-rw-r--r--docs/handbook/cmark/overview.md256
-rw-r--r--docs/handbook/cmark/public-api.md637
-rw-r--r--docs/handbook/cmark/reference-system.md307
-rw-r--r--docs/handbook/cmark/render-framework.md294
-rw-r--r--docs/handbook/cmark/scanner-system.md223
-rw-r--r--docs/handbook/cmark/testing.md281
-rw-r--r--docs/handbook/cmark/utf8-handling.md340
-rw-r--r--docs/handbook/cmark/xml-renderer.md291
21 files changed, 6544 insertions, 0 deletions
diff --git a/docs/handbook/cmark/architecture.md b/docs/handbook/cmark/architecture.md
new file mode 100644
index 0000000000..e35bd2e578
--- /dev/null
+++ b/docs/handbook/cmark/architecture.md
@@ -0,0 +1,283 @@
+# cmark — Architecture
+
+## High-Level Design
+
+cmark implements a two-phase parsing pipeline that converts CommonMark Markdown into an Abstract Syntax Tree (AST), which can then be rendered into multiple output formats. The design separates concerns cleanly: block-level structure is identified first, then inline content is parsed within the appropriate blocks.
+
+```
+Input Text (UTF-8)
+ │
+ ▼
+┌──────────────────┐
+│ S_parser_feed │ Split input into lines (blocks.c)
+│ │ Handle UTF-8 BOM, CR/LF normalization
+└────────┬───────────┘
+ │
+ ▼
+┌──────────────────┐
+│ S_process_line │ Line-by-line block structure analysis (blocks.c)
+│ │ Open/close containers, detect leaf blocks
+└────────┬───────────┘
+ │
+ ▼
+┌──────────────────┐
+│ finalize_document│ Close all open blocks (blocks.c)
+│ │ Resolve reference link definitions
+└────────┬───────────┘
+ │
+ ▼
+┌──────────────────┐
+│ process_inlines │ Parse inline content in paragraphs/headings (blocks.c → inlines.c)
+│ │ Delimiter stack algorithm for emphasis
+│ │ Bracket stack for links/images
+└────────┬───────────┘
+ │
+ ▼
+┌──────────────────┐
+│ AST (cmark_node tree) │
+└────────┬───────────┘
+ │
+ ▼
+┌──────────────────┐
+│ Renderer │ Iterator-driven traversal
+│ (html/xml/ │ Enter/Exit events per node
+│ latex/man/cm) │
+└──────────────────┘
+ │
+ ▼
+ Output String
+```
+
+## Module Dependency Graph
+
+The internal header dependencies reveal the layered architecture:
+
+```
+cmark.h (public API — types, enums, function declarations)
+ ├── cmark_export.h (generated — DLL export macros)
+ └── cmark_version.h (generated — version constants)
+
+node.h (internal — struct cmark_node)
+ ├── cmark.h
+ └── buffer.h
+
+parser.h (internal — struct cmark_parser)
+ ├── references.h
+ ├── node.h
+ └── buffer.h
+
+iterator.h (internal — struct cmark_iter)
+ └── cmark.h
+
+render.h (internal — struct cmark_renderer)
+ └── buffer.h
+
+buffer.h (internal — cmark_strbuf)
+ └── cmark.h
+
+chunk.h (internal — cmark_chunk)
+ ├── cmark.h
+ ├── buffer.h
+ └── cmark_ctype.h
+
+references.h (internal — cmark_reference_map)
+ └── chunk.h
+
+inlines.h (internal — inline parsing API)
+ ├── chunk.h
+ └── references.h
+
+scanners.h (internal — scanner function declarations)
+ ├── cmark.h
+ └── chunk.h
+
+houdini.h (internal — HTML/URL escaping)
+ └── buffer.h
+
+cmark_ctype.h (internal — locale-independent char classification)
+ (no cmark dependencies)
+
+utf8.h (internal — UTF-8 processing)
+ └── buffer.h
+```
+
+## Phase 1: Block Structure (blocks.c)
+
+Block parsing operates on a state machine maintained in the `cmark_parser` struct (defined in `parser.h`):
+
+```c
+struct cmark_parser {
+ struct cmark_mem *mem; // Memory allocator
+ struct cmark_reference_map *refmap; // Link reference definitions
+ struct cmark_node *root; // Document root node
+ struct cmark_node *current; // Deepest open block
+ int line_number; // Current line being processed
+ bufsize_t offset; // Byte position in current line
+ bufsize_t column; // Virtual column (tabs expanded)
+ bufsize_t first_nonspace; // Position of first non-whitespace
+ bufsize_t first_nonspace_column; // Column of first non-whitespace
+ bufsize_t thematic_break_kill_pos; // Optimization for thematic break scanning
+ int indent; // Indentation level (first_nonspace_column - column)
+ bool blank; // Whether current line is blank
+ bool partially_consumed_tab; // Tab only partially used for indentation
+ cmark_strbuf curline; // Current line being processed
+ bufsize_t last_line_length; // Length of previous line (for end_column)
+ cmark_strbuf linebuf; // Buffer for accumulating partial lines across feeds
+ cmark_strbuf content; // Accumulated content for the current open block
+ int options; // Option flags
+ bool last_buffer_ended_with_cr; // For CR/LF handling across buffer boundaries
+ unsigned int total_size; // Total bytes fed (for reference expansion limiting)
+};
+```
+
+### Line Processing Flow
+
+For each line, `S_process_line()` does the following:
+
+1. **Increment line number**, store current line in `parser->curline`.
+2. **Check open blocks** (`check_open_blocks()`): Walk through the tree from root to the deepest open node. For each open container node, try to match the expected line prefix:
+ - Block quote: expect `>` (optionally preceded by up to 3 spaces)
+ - List item: expect indentation matching `marker_offset + padding`
+ - Code block (fenced): check for closing fence or skip fence offset spaces
+ - Code block (indented): expect 4+ spaces of indentation
+ - HTML block: check type-specific continuation rules
+3. **Try new container starts**: If not all open blocks matched, check if the current line starts a new container (block quote, list item).
+4. **Try new leaf blocks**: If the line doesn't continue an existing block or start a new container, check for:
+ - ATX heading (lines starting with 1-6 `#` characters)
+ - Setext heading (underlines of `=` or `-` following a paragraph)
+ - Thematic break (3+ `*`, `-`, or `_` on a line by themselves)
+ - Fenced code block (3+ backticks or tildes)
+ - HTML block (7 different start patterns)
+ - Indented code block (4+ spaces of indentation)
+5. **Add line content**: For blocks that accept lines (paragraph, heading, code block), append the line content to `parser->content`.
+6. **Handle lazy continuation**: Paragraphs support lazy continuation where a non-blank line can continue a paragraph even without matching container prefixes.
+
+### Finalization
+
+When a block is closed (either explicitly or because a new block replaces it), `finalize()` is called:
+
+- **Paragraphs**: Reference link definitions at the start are extracted and stored in `parser->refmap`. If only references remain, the paragraph node is deleted.
+- **Code blocks (fenced)**: The first line becomes the info string; remaining content becomes the code body.
+- **Code blocks (indented)**: Trailing blank lines are removed.
+- **Lists**: Tight/loose status is determined by checking for blank lines between items and their children.
+
+## Phase 2: Inline Parsing (inlines.c)
+
+After all block structure is finalized, `process_inlines()` walks the AST with an iterator and calls `cmark_parse_inlines()` for every node whose type `contains_inlines()` — specifically, `CMARK_NODE_PARAGRAPH` and `CMARK_NODE_HEADING`.
+
+The inline parser uses a `subject` struct that tracks:
+
+```c
+typedef struct {
+ cmark_mem *mem;
+ cmark_chunk input; // The text to parse
+ unsigned flags; // Skip flags for HTML constructs
+ int line; // Source line number
+ bufsize_t pos; // Current position in input
+ int block_offset; // Column offset of containing block
+ int column_offset; // Adjustment for multi-line inlines
+ cmark_reference_map *refmap; // Reference definitions
+ delimiter *last_delim; // Top of delimiter stack
+ bracket *last_bracket; // Top of bracket stack
+ bufsize_t backticks[MAXBACKTICKS + 1]; // Cache of backtick positions
+ bool scanned_for_backticks; // Whether full backtick scan done
+ bool no_link_openers; // Optimization flag
+} subject;
+```
+
+### Delimiter Stack Algorithm
+
+Emphasis (`*`, `_`) and smart quotes (`'`, `"`) use a delimiter stack. When a run of delimiter characters is found:
+
+1. `scan_delims()` determines whether the run can open and/or close emphasis, based on Unicode-aware flanking rules (checking whether surrounding characters are spaces or punctuation using `cmark_utf8proc_is_space()` and `cmark_utf8proc_is_punctuation_or_symbol()`).
+2. The delimiter is pushed onto the stack as a `delimiter` struct.
+3. When a closing delimiter is found, the stack is scanned backwards for a matching opener, and `S_insert_emph()` creates `CMARK_NODE_EMPH` or `CMARK_NODE_STRONG` nodes.
+
+### Bracket Stack Algorithm
+
+Links and images use a separate bracket stack:
+
+1. `[` pushes a bracket entry; `![` pushes one marked as `image = true`.
+2. When `]` is encountered, the bracket stack is searched for a matching opener.
+3. If found, the parser looks for `(url "title")` or `[ref]` after the `]`.
+4. For reference-style links, `cmark_reference_lookup()` is called against the parser's `refmap`.
+
+## Phase 3: AST Rendering
+
+All renderers traverse the AST using the iterator system. There are two rendering architectures:
+
+### Direct Renderers (no framework)
+- **HTML** (`html.c`): Uses `cmark_strbuf` directly. The `S_render_node()` function handles enter/exit events in a large switch statement. HTML escaping is done via `houdini_escape_html()`.
+- **XML** (`xml.c`): Similar direct approach with XML-specific escaping and indentation tracking.
+
+### Framework Renderers (via render.c)
+- **LaTeX** (`latex.c`), **man** (`man.c`), **CommonMark** (`commonmark.c`): These use the `cmark_render()` generic framework, which provides:
+ - Line wrapping at a configurable width
+ - Prefix management for indented output (block quotes, list items)
+ - Breakpoint tracking for intelligent line breaking
+ - Escape dispatch via function pointers (`outc`)
+
+The framework signature:
+
+```c
+char *cmark_render(cmark_node *root, int options, int width,
+ void (*outc)(cmark_renderer *, cmark_escaping, int32_t, unsigned char),
+ int (*render_node)(cmark_renderer *, cmark_node *,
+ cmark_event_type, int));
+```
+
+Each format-specific renderer supplies its own `outc` (character-level escaping) and `render_node` (node-level output) callback functions.
+
+## Key Design Decisions
+
+### Owning vs. Non-Owning Strings
+
+cmark uses two string types:
+
+- **`cmark_strbuf`** (buffer.h): Owning, growable byte buffer. Used for accumulating output and parser state. Memory is managed via the `cmark_mem` allocator.
+- **`cmark_chunk`** (chunk.h): Non-owning slice (pointer + length). Used for referencing substrings of the input during parsing without copying.
+
+### Node Memory Layout
+
+Every `cmark_node` uses a discriminated union (`node->as`) to store type-specific data without separate allocations:
+
+```c
+union {
+ cmark_list list; // list marker, start, tight, delimiter
+ cmark_code code; // info string, fence char/length/offset
+ cmark_heading heading; // level, setext flag, internal_offset
+ cmark_link link; // url, title
+ cmark_custom custom; // on_enter, on_exit
+ int html_block_type; // HTML block type (1-7)
+} as;
+```
+
+### Open Block Tracking
+
+During block parsing, open blocks are tracked via the `CMARK_NODE__OPEN` flag in `node->flags`. The parser maintains a `current` pointer to the deepest open block. When new blocks are created, they're added as children of the appropriate open container. When blocks are finalized (closed), control returns to the parent.
+
+### Reference Expansion Limiting
+
+To prevent superlinear growth from adversarial reference definitions, `parser->total_size` tracks total bytes fed. After finalization, `parser->refmap->max_ref_size` is set to `MAX(total_size, 100000)`, and each reference lookup deducts the reference's size from the available budget.
+
+## Error Handling
+
+cmark follows a defensive programming model:
+- NULL checks on all public API entry points (return 0 or NULL for invalid arguments)
+- `assert()` for internal invariants (only active in debug builds with `-DCMARK_DEBUG_NODES`)
+- Abort-on-allocation-failure in the default memory allocator
+- No exceptions (pure C99)
+- Invalid UTF-8 sequences are replaced with U+FFFD (when `CMARK_OPT_VALIDATE_UTF8` is set)
+
+## Thread Safety
+
+cmark is **not** thread-safe for concurrent access to the same parser or node tree. However, separate parser instances and separate node trees can be used in parallel from different threads, as there is no global mutable state (the `DEFAULT_MEM_ALLOCATOR` is read-only after initialization).
+
+## Cross-References
+
+- [block-parsing.md](block-parsing.md) — Detailed block-level parsing logic
+- [inline-parsing.md](inline-parsing.md) — Delimiter and bracket stack algorithms
+- [ast-node-system.md](ast-node-system.md) — Node struct internals
+- [render-framework.md](render-framework.md) — Generic render engine
+- [memory-management.md](memory-management.md) — Allocator and buffer details
+- [iterator-system.md](iterator-system.md) — AST traversal mechanics
diff --git a/docs/handbook/cmark/ast-node-system.md b/docs/handbook/cmark/ast-node-system.md
new file mode 100644
index 0000000000..3d25415eda
--- /dev/null
+++ b/docs/handbook/cmark/ast-node-system.md
@@ -0,0 +1,383 @@
+# cmark — AST Node System
+
+## Overview
+
+The AST (Abstract Syntax Tree) node system is defined across `node.h` (internal struct definitions) and `node.c` (node creation, destruction, accessor functions, and tree manipulation). Every element in a parsed CommonMark document is represented as a `cmark_node`. Nodes form a tree via parent/child/sibling pointers, with type-specific data stored in a discriminated union.
+
+## The `cmark_node` Struct
+
+Defined in `node.h`, this is the central data structure of the entire library:
+
+```c
+struct cmark_node {
+ cmark_mem *mem; // Memory allocator used for this node
+
+ struct cmark_node *next; // Next sibling
+ struct cmark_node *prev; // Previous sibling
+ struct cmark_node *parent; // Parent node
+ struct cmark_node *first_child; // First child
+ struct cmark_node *last_child; // Last child
+
+ void *user_data; // Arbitrary user-attached data
+
+ unsigned char *data; // String content (for text, code, HTML)
+ bufsize_t len; // Length of data
+
+ int start_line; // Source position: starting line (1-based)
+ int start_column; // Source position: starting column (1-based)
+ int end_line; // Source position: ending line
+ int end_column; // Source position: ending column
+ uint16_t type; // Node type (cmark_node_type enum value)
+ uint16_t flags; // Internal flags (open, last-line-blank, etc.)
+
+ union {
+ cmark_list list; // List-specific data
+ cmark_code code; // Code block-specific data
+ cmark_heading heading; // Heading-specific data
+ cmark_link link; // Link/image-specific data
+ cmark_custom custom; // Custom block/inline data
+ int html_block_type; // HTML block type (1-7)
+ } as;
+};
+```
+
+The union `as` means each node only occupies memory for one type-specific payload, keeping the struct compact. The largest union member determines the node's size.
+
+## Type-Specific Structs
+
+### `cmark_list` — List Properties
+
+```c
+typedef struct {
+ int marker_offset; // Indentation of list marker from left margin
+ int padding; // Total indentation (marker + content offset)
+ int start; // Starting number for ordered lists (0 for bullet)
+ unsigned char list_type; // CMARK_BULLET_LIST or CMARK_ORDERED_LIST
+ unsigned char delimiter; // CMARK_PERIOD_DELIM, CMARK_PAREN_DELIM, or CMARK_NO_DELIM
+ unsigned char bullet_char;// '*', '-', or '+' for bullet lists
+ bool tight; // Whether the list is tight (no blank lines between items)
+} cmark_list;
+```
+
+`marker_offset` and `padding` are used during block parsing to track indentation levels for list continuation. The `tight` flag is determined during block finalization by checking whether blank lines appear between list items or their children.
+
+### `cmark_code` — Code Block Properties
+
+```c
+typedef struct {
+ unsigned char *info; // Info string (language hint, e.g., "python")
+ uint8_t fence_length; // Length of opening fence (3+ backticks or tildes)
+ uint8_t fence_offset; // Indentation of fence from left margin
+ unsigned char fence_char; // '`' or '~'
+ int8_t fenced; // Whether this is a fenced code block (vs. indented)
+} cmark_code;
+```
+
+For indented code blocks, `fenced` is 0, and `info`, `fence_length`, `fence_char`, and `fence_offset` are unused. For fenced code blocks, `info` is extracted from the first line of the opening fence and stored as a separately allocated string.
+
+### `cmark_heading` — Heading Properties
+
+```c
+typedef struct {
+ int internal_offset; // Internal offset within the heading content
+ int8_t level; // Heading level (1-6)
+ bool setext; // Whether this is a setext-style heading (underlined)
+} cmark_heading;
+```
+
+ATX headings (`# Heading`) have `setext = false`. Setext headings (underlined with `=` or `-`) have `setext = true`. The `level` field is shared and defaults to 1 when a heading node is created.
+
+### `cmark_link` — Link and Image Properties
+
+```c
+typedef struct {
+ unsigned char *url; // Destination URL (separately allocated)
+ unsigned char *title; // Optional title text (separately allocated)
+} cmark_link;
+```
+
+Both `url` and `title` are separately allocated strings that must be freed when the node is destroyed. This struct is used for both `CMARK_NODE_LINK` and `CMARK_NODE_IMAGE`.
+
+### `cmark_custom` — Custom Block/Inline Properties
+
+```c
+typedef struct {
+ unsigned char *on_enter; // Literal text rendered when entering the node
+ unsigned char *on_exit; // Literal text rendered when leaving the node
+} cmark_custom;
+```
+
+Custom nodes allow embedding arbitrary content in the AST for extensions. Both strings are separately allocated.
+
+## Internal Flags
+
+The `flags` field uses bit flags defined in the `cmark_node__internal_flags` enum:
+
+```c
+enum cmark_node__internal_flags {
+ CMARK_NODE__OPEN = (1 << 0), // Block is still open (accepting content)
+ CMARK_NODE__LAST_LINE_BLANK = (1 << 1), // Last line of this block was blank
+ CMARK_NODE__LAST_LINE_CHECKED = (1 << 2), // blank-line status has been computed
+ CMARK_NODE__LIST_LAST_LINE_BLANK = (1 << 3), // (unused/reserved)
+};
+```
+
+- **`CMARK_NODE__OPEN`**: Set when a block is created during parsing. Cleared by `finalize()` when the block is closed. The parser's `current` pointer always points to a node with this flag set.
+- **`CMARK_NODE__LAST_LINE_BLANK`**: Set/cleared by `S_set_last_line_blank()` in `blocks.c` to track whether the most recent line added to this block was blank. Used for determining list tightness.
+- **`CMARK_NODE__LAST_LINE_CHECKED`**: Prevents redundant traversal when checking `S_ends_with_blank_line()`, which recursively descends into list items.
+
+## Node Creation
+
+### `cmark_node_new_with_mem()`
+
+The primary creation function (in `node.c`):
+
+```c
+cmark_node *cmark_node_new_with_mem(cmark_node_type type, cmark_mem *mem) {
+ cmark_node *node = (cmark_node *)mem->calloc(1, sizeof(*node));
+ node->mem = mem;
+ node->type = (uint16_t)type;
+
+ switch (node->type) {
+ case CMARK_NODE_HEADING:
+ node->as.heading.level = 1;
+ break;
+ case CMARK_NODE_LIST: {
+ cmark_list *list = &node->as.list;
+ list->list_type = CMARK_BULLET_LIST;
+ list->start = 0;
+ list->tight = false;
+ break;
+ }
+ default:
+ break;
+ }
+
+ return node;
+}
+```
+
+The `calloc()` zeroes all fields, so pointers start as NULL and numeric fields as 0. Only heading and list nodes need explicit default initialization.
+
+### `make_block()` — Parser-Internal Creation
+
+During block parsing, `make_block()` in `blocks.c` creates nodes with source position and the `CMARK_NODE__OPEN` flag:
+
+```c
+static cmark_node *make_block(cmark_mem *mem, cmark_node_type tag,
+ int start_line, int start_column) {
+ cmark_node *e;
+ e = (cmark_node *)mem->calloc(1, sizeof(*e));
+ e->mem = mem;
+ e->type = (uint16_t)tag;
+ e->flags = CMARK_NODE__OPEN;
+ e->start_line = start_line;
+ e->start_column = start_column;
+ e->end_line = start_line;
+ return e;
+}
+```
+
+### Inline Node Creation
+
+The inline parser in `inlines.c` uses two factory functions:
+
+```c
+// Create an inline with string content (text, code, HTML)
+static inline cmark_node *make_literal(subject *subj, cmark_node_type t,
+ int start_column, int end_column) {
+ cmark_node *e = (cmark_node *)subj->mem->calloc(1, sizeof(*e));
+ e->mem = subj->mem;
+ e->type = (uint16_t)t;
+ e->start_line = e->end_line = subj->line;
+ e->start_column = start_column + 1 + subj->column_offset + subj->block_offset;
+ e->end_column = end_column + 1 + subj->column_offset + subj->block_offset;
+ return e;
+}
+
+// Create an inline with no value (emphasis, strong, etc.)
+static inline cmark_node *make_simple(cmark_mem *mem, cmark_node_type t) {
+ cmark_node *e = (cmark_node *)mem->calloc(1, sizeof(*e));
+ e->mem = mem;
+ e->type = t;
+ return e;
+}
+```
+
+## Node Destruction
+
+### `S_free_nodes()` — Iterative Subtree Freeing
+
+The `S_free_nodes()` function in `node.c` avoids recursion by splicing children into a flat linked list:
+
+```c
+static void S_free_nodes(cmark_node *e) {
+ cmark_mem *mem = e->mem;
+ cmark_node *next;
+ while (e != NULL) {
+ switch (e->type) {
+ case CMARK_NODE_CODE_BLOCK:
+ mem->free(e->data);
+ mem->free(e->as.code.info);
+ break;
+ case CMARK_NODE_TEXT:
+ case CMARK_NODE_HTML_INLINE:
+ case CMARK_NODE_CODE:
+ case CMARK_NODE_HTML_BLOCK:
+ mem->free(e->data);
+ break;
+ case CMARK_NODE_LINK:
+ case CMARK_NODE_IMAGE:
+ mem->free(e->as.link.url);
+ mem->free(e->as.link.title);
+ break;
+ case CMARK_NODE_CUSTOM_BLOCK:
+ case CMARK_NODE_CUSTOM_INLINE:
+ mem->free(e->as.custom.on_enter);
+ mem->free(e->as.custom.on_exit);
+ break;
+ default:
+ break;
+ }
+ if (e->last_child) {
+ // Splice children into list for flat iteration
+ e->last_child->next = e->next;
+ e->next = e->first_child;
+ }
+ next = e->next;
+ mem->free(e);
+ e = next;
+ }
+}
+```
+
+This splicing technique converts the tree into a flat list, allowing O(n) iterative freeing without a recursion stack. For each node with children, the children are prepended to the remaining list by connecting `last_child->next` to `e->next` and `e->next` to `first_child`.
+
+## Containership Rules
+
+The `S_can_contain()` function in `node.c` enforces which node types can contain which children:
+
+```c
+static bool S_can_contain(cmark_node *node, cmark_node *child) {
+ // Ancestor loop detection
+ if (child->first_child != NULL) {
+ cmark_node *cur = node->parent;
+ while (cur != NULL) {
+ if (cur == child) return false;
+ cur = cur->parent;
+ }
+ }
+
+ // Documents cannot be children
+ if (child->type == CMARK_NODE_DOCUMENT) return false;
+
+ switch (node->type) {
+ case CMARK_NODE_DOCUMENT:
+ case CMARK_NODE_BLOCK_QUOTE:
+ case CMARK_NODE_ITEM:
+ return cmark_node_is_block(child) && child->type != CMARK_NODE_ITEM;
+
+ case CMARK_NODE_LIST:
+ return child->type == CMARK_NODE_ITEM;
+
+ case CMARK_NODE_CUSTOM_BLOCK:
+ return true; // Custom blocks can contain anything
+
+ case CMARK_NODE_PARAGRAPH:
+ case CMARK_NODE_HEADING:
+ case CMARK_NODE_EMPH:
+ case CMARK_NODE_STRONG:
+ case CMARK_NODE_LINK:
+ case CMARK_NODE_IMAGE:
+ case CMARK_NODE_CUSTOM_INLINE:
+ return cmark_node_is_inline(child);
+
+ default:
+ break;
+ }
+ return false;
+}
+```
+
+Key rules:
+- **Document, block quote, list item**: Can contain any block except items
+- **List**: Can only contain items
+- **Custom block**: Can contain anything (no restrictions)
+- **Paragraph, heading, emphasis, strong, link, image, custom inline**: Can only contain inline nodes
+- **Leaf blocks** (thematic break, code block, HTML block): Cannot contain anything
+
+## Tree Manipulation
+
+### Unlinking
+
+The internal `S_node_unlink()` function detaches a node from its parent and siblings:
+
+```c
+static void S_node_unlink(cmark_node *node) {
+ if (node->prev) {
+ node->prev->next = node->next;
+ }
+ if (node->next) {
+ node->next->prev = node->prev;
+ }
+ // Update parent's first_child / last_child pointers
+ if (node->parent) {
+ if (node->parent->first_child == node)
+ node->parent->first_child = node->next;
+ if (node->parent->last_child == node)
+ node->parent->last_child = node->prev;
+ }
+ node->next = NULL;
+ node->prev = NULL;
+ node->parent = NULL;
+}
+```
+
+### String Setting Helper
+
+The `cmark_set_cstr()` function manages string assignment with proper memory handling:
+
+```c
+static bufsize_t cmark_set_cstr(cmark_mem *mem, unsigned char **dst,
+ const char *src) {
+ unsigned char *old = *dst;
+ bufsize_t len;
+ if (src && src[0]) {
+ len = (bufsize_t)strlen(src);
+ *dst = (unsigned char *)mem->realloc(NULL, len + 1);
+ memcpy(*dst, src, len + 1);
+ } else {
+ len = 0;
+ *dst = NULL;
+ }
+ if (old) {
+ mem->free(old);
+ }
+ return len;
+}
+```
+
+This function allocates a new copy of the source string, assigns it, then frees the old value — ensuring no memory leaks even when overwriting existing data.
+
+## Node Data Storage Pattern
+
+Nodes store their text content in two ways depending on type:
+
+1. **Direct storage** (`data` + `len`): Used by `CMARK_NODE_TEXT`, `CMARK_NODE_CODE`, `CMARK_NODE_CODE_BLOCK`, `CMARK_NODE_HTML_BLOCK`, and `CMARK_NODE_HTML_INLINE`. The `data` field points to a separately allocated buffer containing the text content.
+
+2. **Union storage** (`as.*`): Used by lists, code blocks (for the info string), headings, links/images, and custom nodes. These store structured data rather than raw text.
+
+3. **Hybrid**: `CMARK_NODE_CODE_BLOCK` uses both — `data` for the code content and `as.code.info` for the info string.
+
+## The `cmark_node_check()` Function
+
+For debug builds, `cmark_node_check()` validates the structural integrity of the tree. It checks that parent/child/sibling pointers are consistent and that the tree forms a valid structure. It returns the number of errors found and prints details to the provided `FILE*`.
+
+## Cross-References
+
+- [node.h](../../../cmark/src/node.h) — Struct definitions
+- [node.c](../../../cmark/src/node.c) — Implementation
+- [iterator-system.md](iterator-system.md) — How nodes are traversed
+- [block-parsing.md](block-parsing.md) — How block nodes are created during parsing
+- [inline-parsing.md](inline-parsing.md) — How inline nodes are created
+- [memory-management.md](memory-management.md) — Allocator integration
diff --git a/docs/handbook/cmark/block-parsing.md b/docs/handbook/cmark/block-parsing.md
new file mode 100644
index 0000000000..2c9efecd50
--- /dev/null
+++ b/docs/handbook/cmark/block-parsing.md
@@ -0,0 +1,310 @@
+# cmark — Block Parsing
+
+## Overview
+
+Block parsing is Phase 1 of cmark's two-phase parsing pipeline. Implemented in `blocks.c`, it processes the input line-by-line, identifying block-level document structure: paragraphs, headings, code blocks, block quotes, lists, thematic breaks, and HTML blocks. The result is a tree of `cmark_node` block nodes with accumulated text content. Inline parsing occurs in Phase 2.
+
+The algorithm follows the CommonMark specification's description at `http://spec.commonmark.org/0.24/#phase-1-block-structure`.
+
+## Key Constants
+
+```c
+#define CODE_INDENT 4 // Spaces required for indented code block
+#define TAB_STOP 4 // Tab stop width for column calculation
+```
+
+## Parser State
+
+The parser state is maintained in the `cmark_parser` struct (from `parser.h`). During line processing, these fields track the current position:
+
+- `offset` — byte position in the current line
+- `column` — virtual column number (tabs expanded to `TAB_STOP` boundaries)
+- `first_nonspace` — byte position of first non-whitespace character
+- `first_nonspace_column` — column of first non-whitespace character
+- `indent` — the difference `first_nonspace_column - column`, representing effective indentation
+- `blank` — whether the line is blank (only whitespace before line end)
+- `partially_consumed_tab` — set when a tab is only partially used for indentation
+
+## Input Feeding: `S_parser_feed()`
+
+The entry point for input is `S_parser_feed()`, which splits raw input into lines:
+
+```c
+static void S_parser_feed(cmark_parser *parser, const unsigned char *buffer,
+ size_t len, bool eof);
+```
+
+### Line Splitting Logic
+
+The function scans for line-ending characters (`\n`, `\r`) and processes complete lines via `S_process_line()`. Partial lines are accumulated in `parser->linebuf`.
+
+Key handling:
+1. **UTF-8 BOM**: Skipped if found at the start of the first line (3-byte sequence `0xEF 0xBB 0xBF`).
+2. **CR/LF across buffer boundaries**: If the previous buffer ended with `\r` and the next starts with `\n`, the `\n` is skipped.
+3. **NULL bytes**: Replaced with the UTF-8 replacement character (U+FFFD, `0xEF 0xBF 0xBD`).
+4. **Total size tracking**: `parser->total_size` accumulates bytes fed, capped at `UINT_MAX`, used later for reference expansion limiting.
+
+### Line Termination
+
+Each line is terminated at `\n`, `\r`, or `\r\n`. The line content passed to `S_process_line()` does NOT include the line-ending characters themselves.
+
+## Line Processing: `S_process_line()`
+
+The main per-line processing function. For each line, it:
+
+1. Stores the line in `parser->curline`
+2. Creates a `cmark_chunk` wrapper for the line data
+3. Increments `parser->line_number`
+4. Calls `check_open_blocks()` to match existing containers
+5. Attempts to open new containers and leaf blocks
+6. Adds line content to the appropriate block
+
+### Step 1: Check Open Blocks
+
+```c
+static cmark_node *check_open_blocks(cmark_parser *parser, cmark_chunk *input,
+ bool *all_matched);
+```
+
+Starting from the document root, this walks through the tree of open blocks (following `last_child` pointers). For each open container, it tries to match the expected line prefix.
+
+The matching rules for each container type:
+
+#### Block Quote
+```c
+static bool parse_block_quote_prefix(cmark_parser *parser, cmark_chunk *input);
+```
+Expects `>` preceded by up to 3 spaces of indentation. After matching the `>`, optionally consumes one space or tab after it.
+
+#### List Item
+```c
+static bool parse_node_item_prefix(cmark_parser *parser, cmark_chunk *input,
+ cmark_node *container);
+```
+Expects indentation of at least `marker_offset + padding` characters. If the line is blank and the item has at least one child, the item continues (lazy continuation).
+
+#### Fenced Code Block
+```c
+static bool parse_code_block_prefix(cmark_parser *parser, cmark_chunk *input,
+ cmark_node *container, bool *should_continue);
+```
+For fenced code blocks: checks if the line is a closing fence (same fence char, length >= opening fence length, preceded by up to 3 spaces). If it is, the block is finalized. Otherwise, skips up to `fence_offset` spaces and continues.
+
+For indented code blocks: requires 4+ spaces of indentation, or a blank line.
+
+#### HTML Block
+```c
+static bool parse_html_block_prefix(cmark_parser *parser, cmark_node *container);
+```
+HTML block types 1-5 accept blank lines (continue until end condition is met). Types 6-7 do NOT accept blank lines.
+
+### Step 2: New Container Starts
+
+If not all open blocks were matched (`!all_matched`), the parser checks if the unmatched portion of the line starts a new container:
+
+- **Block quote**: Line starts with `>` (preceded by up to 3 spaces)
+- **List item**: Line starts with a list marker (bullet character or ordered number + delimiter)
+
+### Step 3: New Leaf Blocks
+
+The parser checks for new leaf block starts using scanner functions:
+
+- **ATX heading**: `scan_atx_heading_start()` — lines starting with 1-6 `#` characters
+- **Fenced code block**: `scan_open_code_fence()` — 3+ backticks or tildes
+- **HTML block**: `scan_html_block_start()` and `scan_html_block_start_7()` — 7 different HTML start patterns
+- **Setext heading**: `scan_setext_heading_line()` — underlines of `=` or `-` (only when following a paragraph)
+- **Thematic break**: `S_scan_thematic_break()` — 3+ `*`, `-`, or `_` characters
+
+### Step 4: Content Accumulation
+
+For blocks that accept lines (`accepts_lines()` returns true for paragraphs, headings, and code blocks), the line content is appended to `parser->content` via `add_line()`:
+
+```c
+static void add_line(cmark_chunk *ch, cmark_parser *parser) {
+ int chars_to_tab;
+ int i;
+ if (parser->partially_consumed_tab) {
+ parser->offset += 1; // skip over tab
+ chars_to_tab = TAB_STOP - (parser->column % TAB_STOP);
+ for (i = 0; i < chars_to_tab; i++) {
+ cmark_strbuf_putc(&parser->content, ' ');
+ }
+ }
+ cmark_strbuf_put(&parser->content, ch->data + parser->offset,
+ ch->len - parser->offset);
+}
+```
+
+When a tab is only partially consumed (e.g., the tab represents 4 columns but only 1 was needed for indentation), the remaining columns are emitted as spaces.
+
+## Adding Child Blocks
+
+```c
+static cmark_node *add_child(cmark_parser *parser, cmark_node *parent,
+ cmark_node_type block_type, int start_column);
+```
+
+When a new block is detected, `add_child()` creates it:
+
+1. If the parent can't contain the new block type (checked via `can_contain()`), the parent is finalized and the function moves up the tree until it finds a suitable ancestor.
+2. A new node is created with `make_block()` (which sets `CMARK_NODE__OPEN`).
+3. The node is linked as the last child of the parent.
+
+### Container Acceptance Rules
+
+```c
+static inline bool can_contain(cmark_node_type parent_type,
+ cmark_node_type child_type) {
+ return (parent_type == CMARK_NODE_DOCUMENT ||
+ parent_type == CMARK_NODE_BLOCK_QUOTE ||
+ parent_type == CMARK_NODE_ITEM ||
+ (parent_type == CMARK_NODE_LIST && child_type == CMARK_NODE_ITEM));
+}
+```
+
+Only documents, block quotes, list items, and lists (for items only) can contain other blocks.
+
+## List Item Parsing
+
+```c
+static bufsize_t parse_list_marker(cmark_mem *mem, cmark_chunk *input,
+ bufsize_t pos, bool interrupts_paragraph,
+ cmark_list **dataptr);
+```
+
+This function detects list markers:
+
+**Bullet markers**: `*`, `-`, or `+` followed by whitespace.
+
+**Ordered markers**: Up to 9 digits followed by `.` or `)` and whitespace. The 9-digit limit prevents integer overflow (max value ~999,999,999 fits in a 32-bit int).
+
+**Paragraph interruption rules**: When `interrupts_paragraph` is true (the marker would interrupt a preceding paragraph):
+- Bullet markers require non-blank content after them
+- Ordered markers must start at 1
+
+### List Matching
+
+```c
+static int lists_match(cmark_list *list_data, cmark_list *item_data) {
+ return (list_data->list_type == item_data->list_type &&
+ list_data->delimiter == item_data->delimiter &&
+ list_data->bullet_char == item_data->bullet_char);
+}
+```
+
+Two list items belong to the same list only if they share the same list type, delimiter style, and bullet character. This means `- item` and `* item` create separate lists.
+
+## Offset Advancement
+
+```c
+static void S_advance_offset(cmark_parser *parser, cmark_chunk *input,
+ bufsize_t count, bool columns);
+```
+
+This function advances `parser->offset` and `parser->column`. The `columns` parameter determines whether `count` measures bytes or virtual columns. Tab expansion is handled here:
+- When counting columns and a tab appears, `chars_to_tab = TAB_STOP - (column % TAB_STOP)` determines how many columns the tab represents
+- If only part of the tab is consumed (advancing fewer columns than the tab provides), `parser->partially_consumed_tab` is set
+
+## Finding First Non-Space
+
+```c
+static void S_find_first_nonspace(cmark_parser *parser, cmark_chunk *input);
+```
+
+Scans from `parser->offset` forward, setting:
+- `parser->first_nonspace` — byte position
+- `parser->first_nonspace_column` — column of first non-whitespace
+- `parser->indent` — `first_nonspace_column - column`
+- `parser->blank` — whether the line is blank
+
+This function is idempotent — it won't re-scan if `first_nonspace > offset`.
+
+## Thematic Break Detection
+
+```c
+static int S_scan_thematic_break(cmark_parser *parser, cmark_chunk *input,
+ bufsize_t offset);
+```
+
+Checks for 3 or more `*`, `_`, or `-` characters (optionally separated by spaces/tabs) on a line by themselves. Uses `parser->thematic_break_kill_pos` as an optimization to avoid re-scanning positions that already failed.
+
+## ATX Heading Trailing Hash Removal
+
+```c
+static void chop_trailing_hashtags(cmark_chunk *ch);
+```
+
+After an ATX heading line is identified, trailing `#` characters are removed from the content if they're preceded by a space. This implements the CommonMark rule that `## Heading ##` renders as "Heading" without trailing `#` marks.
+
+## Block Finalization
+
+```c
+static cmark_node *finalize(cmark_parser *parser, cmark_node *b);
+```
+
+When a block is closed (no longer accepting content), `finalize()` processes its accumulated content:
+
+### Paragraph Finalization
+Reference link definitions at the start are extracted:
+```c
+static bool resolve_reference_link_definitions(cmark_parser *parser);
+```
+This repeatedly calls `cmark_parse_reference_inline()` from `inlines.c` to parse reference definitions like `[label]: url "title"`. If the paragraph becomes empty after extracting all references, the paragraph node is deleted.
+
+### Code Block Finalization
+- **Fenced**: The first line becomes the info string (after HTML unescaping and trimming). Remaining content becomes the code body.
+- **Indented**: Trailing blank lines are removed, and a final newline is appended.
+
+### Heading and HTML Block Finalization
+Content is simply detached from the parser's content buffer and stored in `data`.
+
+### List Finalization
+Determines tight/loose status by checking:
+1. Non-final, non-empty list items ending with a blank line → loose
+2. Children of list items that end with blank lines (checked recursively via `S_ends_with_blank_line()`) → loose
+3. Otherwise → tight
+
+## Document Finalization
+
+```c
+static cmark_node *finalize_document(cmark_parser *parser);
+```
+
+Called by `cmark_parser_finish()`:
+
+1. All open blocks are finalized by walking from `parser->current` up to `parser->root`.
+2. The root document is finalized.
+3. Reference expansion limit is set: `refmap->max_ref_size = MAX(parser->total_size, 100000)`.
+4. `process_inlines()` is called, which uses an iterator to find all nodes that contain inlines (paragraphs and headings) and calls `cmark_parse_inlines()` on each.
+5. After inline parsing, the content buffer of each processed node is freed.
+
+## Inline Content Detection
+
+```c
+static inline bool contains_inlines(cmark_node_type block_type) {
+ return (block_type == CMARK_NODE_PARAGRAPH ||
+ block_type == CMARK_NODE_HEADING);
+}
+```
+
+Only paragraphs and headings have their string content parsed for inline elements. Code blocks, HTML blocks, and other leaf nodes preserve their content as-is.
+
+## Lazy Continuation Lines
+
+The CommonMark spec defines "lazy continuation lines" — lines that continue a paragraph without matching all container prefixes. For example:
+
+```markdown
+> This is a block quote
+with a lazy continuation line
+```
+
+The second line doesn't start with `>` but still belongs to the paragraph inside the block quote. The parser handles this by checking whether the line could be added to an existing open paragraph rather than closing and starting a new one.
+
+## Cross-References
+
+- [parser.h](../../../cmark/src/parser.h) — Parser struct definition
+- [blocks.c](../../../cmark/src/blocks.c) — Full implementation
+- [inline-parsing.md](inline-parsing.md) — Phase 2 parsing
+- [scanner-system.md](scanner-system.md) — Scanner functions used for block detection
+- [reference-system.md](reference-system.md) — How reference definitions are extracted
+- [ast-node-system.md](ast-node-system.md) — Node creation and tree structure
diff --git a/docs/handbook/cmark/building.md b/docs/handbook/cmark/building.md
new file mode 100644
index 0000000000..56272af2be
--- /dev/null
+++ b/docs/handbook/cmark/building.md
@@ -0,0 +1,268 @@
+# cmark — Building
+
+## Build System Overview
+
+cmark uses CMake (minimum version 3.14) as its build system. The top-level `CMakeLists.txt` defines the project as C/CXX with version 0.31.2. It configures C99 standard without extensions, sets up export header generation, CTest integration, and subdirectory targets for the library, CLI tool, tests, man pages, and fuzz harness.
+
+## Prerequisites
+
+- A C99-compliant compiler (GCC, Clang, MSVC)
+- CMake 3.14 or later
+- POSIX environment (for man page generation; skipped on Windows)
+- Optional: re2c (only needed if modifying `scanners.re`)
+- Optional: Python 3 (for running spec tests)
+
+## Basic Build Steps
+
+```bash
+# Out-of-source build (required — in-source builds are explicitly blocked)
+mkdir build && cd build
+cmake ..
+make
+```
+
+The CMakeLists.txt enforces out-of-source builds with:
+
+```cmake
+if("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_BINARY_DIR}")
+ message(FATAL_ERROR "Do not build in-source.\nPlease remove CMakeCache.txt and the CMakeFiles/ directory.\nThen: mkdir build ; cd build ; cmake .. ; make")
+endif()
+```
+
+## CMake Configuration Options
+
+### Library Type
+
+```cmake
+option(BUILD_SHARED_LIBS "Build the CMark library as shared" OFF)
+```
+
+By default, cmark builds as a **static library**. Set `-DBUILD_SHARED_LIBS=ON` for a shared library. When building as static, the compile definition `CMARK_STATIC_DEFINE` is automatically set.
+
+**Legacy options** (deprecated but still functional for backwards compatibility):
+- `CMARK_SHARED` — replaced by `BUILD_SHARED_LIBS`
+- `CMARK_STATIC` — replaced by `BUILD_SHARED_LIBS` (inverted logic)
+
+Both emit `AUTHOR_WARNING` messages advising migration to the standard CMake variable.
+
+### Fuzzing Support
+
+```cmake
+option(CMARK_LIB_FUZZER "Build libFuzzer fuzzing harness" OFF)
+```
+
+When enabled, targets matching `fuzz` get `-fsanitize=fuzzer`, while all other targets get `-fsanitize=fuzzer-no-link`.
+
+### Build Types
+
+The project supports these build types via `CMAKE_BUILD_TYPE`:
+
+| Type | Description |
+|------|-------------|
+| `Release` | Default. Optimized build |
+| `Debug` | Adds `-DCMARK_DEBUG_NODES` for node integrity checking via `assert()` |
+| `Profile` | Adds `-pg` for profiling with gprof |
+| `Asan` | Address sanitizer (loads `FindAsan` module) |
+| `Ubsan` | Adds `-fsanitize=undefined` for undefined behavior sanitizer |
+
+Debug builds automatically add node structure checking:
+
+```cmake
+add_compile_options($<$<CONFIG:Debug>:-DCMARK_DEBUG_NODES>)
+```
+
+## Compiler Flags
+
+The `cmark_add_compile_options()` function applies compiler warnings per-target (not globally), so cmark can be used as a subdirectory in projects with other languages:
+
+**GCC/Clang:**
+```
+-Wall -Wextra -pedantic -Wstrict-prototypes (C only)
+```
+
+**MSVC:**
+```
+-D_CRT_SECURE_NO_WARNINGS
+```
+
+Visibility is set globally to hidden, with explicit export via the generated `cmark_export.h`:
+
+```cmake
+set(CMAKE_C_VISIBILITY_PRESET hidden)
+set(CMAKE_VISIBILITY_INLINES_HIDDEN 1)
+```
+
+## Library Target: `cmark`
+
+Defined in `src/CMakeLists.txt`, the `cmark` library target includes these source files:
+
+```cmake
+add_library(cmark
+ blocks.c buffer.c cmark.c cmark_ctype.c
+ commonmark.c houdini_href_e.c houdini_html_e.c houdini_html_u.c
+ html.c inlines.c iterator.c latex.c
+ man.c node.c references.c render.c
+ scanners.c scanners.re utf8.c xml.c)
+```
+
+Target properties:
+```cmake
+set_target_properties(cmark PROPERTIES
+ OUTPUT_NAME "cmark"
+ PDB_NAME libcmark # Avoid PDB name clash with executable
+ POSITION_INDEPENDENT_CODE YES
+ SOVERSION ${PROJECT_VERSION} # Includes minor + patch in soname
+ VERSION ${PROJECT_VERSION})
+```
+
+The library exposes headers via its interface include directories:
+```cmake
+target_include_directories(cmark INTERFACE
+ $<INSTALL_INTERFACE:include>
+ $<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}>
+ $<BUILD_INTERFACE:${CMAKE_CURRENT_BINARY_DIR}>)
+```
+
+The export header is generated automatically:
+```cmake
+generate_export_header(cmark BASE_NAME ${PROJECT_NAME})
+```
+
+This produces `cmark_export.h` containing `CMARK_EXPORT` macros that resolve to `__declspec(dllexport/dllimport)` on Windows or `__attribute__((visibility("default")))` on Unix.
+
+## Executable Target: `cmark_exe`
+
+```cmake
+add_executable(cmark_exe main.c)
+set_target_properties(cmark_exe PROPERTIES
+ OUTPUT_NAME "cmark"
+ INSTALL_RPATH "${Base_rpath}")
+target_link_libraries(cmark_exe PRIVATE cmark)
+```
+
+The executable has the same output name as the library (`cmark`), but the PDB names differ to avoid conflicts on Windows.
+
+## Generated Files
+
+Two files are generated at configure time:
+
+### `cmark_version.h`
+
+Generated from `cmark_version.h.in`:
+```cmake
+configure_file(cmark_version.h.in ${CMAKE_CURRENT_BINARY_DIR}/cmark_version.h)
+```
+
+Contains `CMARK_VERSION` (integer) and `CMARK_VERSION_STRING` (string) macros.
+
+### `libcmark.pc`
+
+Generated from `libcmark.pc.in` for pkg-config integration:
+```cmake
+configure_file(libcmark.pc.in ${CMAKE_CURRENT_BINARY_DIR}/libcmark.pc @ONLY)
+```
+
+## Test Infrastructure
+
+Tests are enabled via CMake's standard `BUILD_TESTING` option (defaults to ON):
+
+```cmake
+if(BUILD_TESTING)
+ add_subdirectory(api_test)
+ add_subdirectory(test testdir)
+endif()
+```
+
+### API Tests (`api_test/`)
+
+C-level API tests that exercise the public API functions directly — node creation, manipulation, parsing, rendering.
+
+### Spec Tests (`test/`)
+
+CommonMark specification conformance tests. These parse expected input/output pairs from the CommonMark spec and verify cmark produces the correct output.
+
+## RPATH Configuration
+
+For shared library builds, the install RPATH is set to the library directory:
+
+```cmake
+if(BUILD_SHARED_LIBS)
+ set(p "${CMAKE_INSTALL_FULL_LIBDIR}")
+ list(FIND CMAKE_PLATFORM_IMPLICIT_LINK_DIRECTORIES "${p}" i)
+ if("${i}" STREQUAL "-1")
+ set(Base_rpath "${p}")
+ endif()
+endif()
+set(CMAKE_INSTALL_RPATH_USE_LINK_PATH TRUE)
+```
+
+This ensures the executable can find the shared library at runtime without requiring `LD_LIBRARY_PATH`.
+
+## Man Page Generation
+
+Man pages are built on non-Windows platforms:
+
+```cmake
+if(NOT CMAKE_SYSTEM_NAME STREQUAL Windows)
+ add_subdirectory(man)
+endif()
+```
+
+## Building for Fuzzing
+
+To build the libFuzzer harness:
+
+```bash
+mkdir build-fuzz && cd build-fuzz
+cmake -DCMARK_LIB_FUZZER=ON -DCMAKE_C_COMPILER=clang ..
+make
+```
+
+The fuzz targets are in the `fuzz/` subdirectory.
+
+## Platform-Specific Notes
+
+### OpenBSD
+
+The CLI tool uses `pledge(2)` on OpenBSD 6.0+ for sandboxing:
+```c
+#if defined(__OpenBSD__)
+# include <sys/param.h>
+# if OpenBSD >= 201605
+# define USE_PLEDGE
+# include <unistd.h>
+# endif
+#endif
+```
+
+The pledge sequence is:
+1. Before parsing: `pledge("stdio rpath", NULL)` — allows reading files
+2. After parsing, before rendering: `pledge("stdio", NULL)` — drops file read access
+
+### Windows
+
+On Windows (non-Cygwin), binary mode is set for stdin/stdout to prevent CR/LF translation:
+```c
+#if defined(_WIN32) && !defined(__CYGWIN__)
+ _setmode(_fileno(stdin), _O_BINARY);
+ _setmode(_fileno(stdout), _O_BINARY);
+#endif
+```
+
+## Scanner Regeneration
+
+The `scanners.c` file is generated from `scanners.re` using re2c. To regenerate:
+
+```bash
+re2c --case-insensitive -b -i --no-generation-date -8 \
+ -o scanners.c scanners.re
+```
+
+The generated file is checked into the repository, so re2c is not required for normal builds.
+
+## Cross-References
+
+- [cli-usage.md](cli-usage.md) — Command-line tool details and options
+- [testing.md](testing.md) — Test framework details
+- [code-style.md](code-style.md) — Coding conventions
+- [scanner-system.md](scanner-system.md) — Scanner generation details
diff --git a/docs/handbook/cmark/cli-usage.md b/docs/handbook/cmark/cli-usage.md
new file mode 100644
index 0000000000..d77c3b8fa9
--- /dev/null
+++ b/docs/handbook/cmark/cli-usage.md
@@ -0,0 +1,249 @@
+# cmark — CLI Usage
+
+## Overview
+
+The `cmark` command-line tool (`main.c`) reads CommonMark input from files or stdin and renders it to one of five output formats. It serves as both a reference implementation and a practical conversion tool.
+
+## Entry Point
+
+```c
+int main(int argc, char *argv[]);
+```
+
+## Output Formats
+
+```c
+typedef enum {
+ FORMAT_NONE,
+ FORMAT_HTML,
+ FORMAT_XML,
+ FORMAT_MAN,
+ FORMAT_COMMONMARK,
+ FORMAT_LATEX,
+} writer_format;
+```
+
+Default: `FORMAT_HTML`.
+
+## Command-Line Options
+
+| Option | Long Form | Description |
+|--------|-----------|-------------|
+| `-t FORMAT` | `--to FORMAT` | Output format: `html`, `xml`, `man`, `commonmark`, `latex` |
+| | `--width N` | Wrapping width (0 = no wrapping; default 0). Only affects `commonmark`, `man`, `latex` |
+| | `--sourcepos` | Include source position information |
+| | `--hardbreaks` | Render soft breaks as hard breaks |
+| | `--nobreaks` | Render soft breaks as spaces |
+| | `--unsafe` | Allow raw HTML and dangerous URLs |
+| | `--smart` | Enable smart punctuation (curly quotes, em/en dashes, ellipses) |
+| | `--validate-utf8` | Validate and clean UTF-8 input |
+| `-h` | `--help` | Print usage information |
+| | `--version` | Print version string |
+
+## Option Parsing
+
+```c
+for (i = 1; i < argc; i++) {
+ if (strcmp(argv[i], "--version") == 0) {
+ printf("cmark %s", cmark_version_string());
+ printf(" - CommonMark converter\n(C) 2014-2016 John MacFarlane\n");
+ exit(0);
+ } else if (strcmp(argv[i], "--sourcepos") == 0) {
+ options |= CMARK_OPT_SOURCEPOS;
+ } else if (strcmp(argv[i], "--hardbreaks") == 0) {
+ options |= CMARK_OPT_HARDBREAKS;
+ } else if (strcmp(argv[i], "--nobreaks") == 0) {
+ options |= CMARK_OPT_NOBREAKS;
+ } else if (strcmp(argv[i], "--smart") == 0) {
+ options |= CMARK_OPT_SMART;
+ } else if (strcmp(argv[i], "--unsafe") == 0) {
+ options |= CMARK_OPT_UNSAFE;
+ } else if (strcmp(argv[i], "--validate-utf8") == 0) {
+ options |= CMARK_OPT_VALIDATE_UTF8;
+ } else if ((strcmp(argv[i], "--to") == 0 || strcmp(argv[i], "-t") == 0) &&
+ i + 1 < argc) {
+ i++;
+ if (strcmp(argv[i], "man") == 0) writer = FORMAT_MAN;
+ else if (strcmp(argv[i], "html") == 0) writer = FORMAT_HTML;
+ else if (strcmp(argv[i], "xml") == 0) writer = FORMAT_XML;
+ else if (strcmp(argv[i], "commonmark") == 0) writer = FORMAT_COMMONMARK;
+ else if (strcmp(argv[i], "latex") == 0) writer = FORMAT_LATEX;
+ else {
+ fprintf(stderr, "Unknown format %s\n", argv[i]);
+ exit(1);
+ }
+ } else if (strcmp(argv[i], "--width") == 0 && i + 1 < argc) {
+ i++;
+ width = atoi(argv[i]);
+ } else if (strcmp(argv[i], "-h") == 0 || strcmp(argv[i], "--help") == 0) {
+ print_usage();
+ exit(0);
+ } else if (*argv[i] == '-') {
+ print_usage();
+ exit(1);
+ } else {
+ // Treat as filename
+ files[numfps++] = i;
+ }
+}
+```
+
+## Input Handling
+
+### File Input
+
+```c
+for (i = 0; i < numfps; i++) {
+ fp = fopen(argv[files[i]], "rb");
+ if (fp == NULL) {
+ fprintf(stderr, "Error opening file %s: %s\n", argv[files[i]], strerror(errno));
+ exit(1);
+ }
+ // Read in chunks and feed to parser
+ while ((bytes = fread(buffer, 1, sizeof(buffer), fp)) > 0) {
+ cmark_parser_feed(parser, buffer, bytes);
+ if (bytes < sizeof(buffer)) break;
+ }
+ fclose(fp);
+}
+```
+
+Files are opened in binary mode (`"rb"`) and read in chunks of `BUFFER_SIZE` (4096 bytes). Each chunk is fed to the streaming parser via `cmark_parser_feed()`.
+
+### Stdin Input
+
+```c
+if (numfps == 0) {
+ // Read from stdin
+ while ((bytes = fread(buffer, 1, sizeof(buffer), stdin)) > 0) {
+ cmark_parser_feed(parser, buffer, bytes);
+ if (bytes < sizeof(buffer)) break;
+ }
+}
+```
+
+When no files are specified, input is read from stdin.
+
+### Windows Binary Mode
+
+```c
+#if defined(_WIN32) && !defined(__CYGWIN__)
+_setmode(_fileno(stdin), _O_BINARY);
+_setmode(_fileno(stdout), _O_BINARY);
+#endif
+```
+
+On Windows, stdin and stdout are set to binary mode to prevent CR/LF translation.
+
+## Rendering
+
+```c
+document = cmark_parser_finish(parser);
+cmark_parser_free(parser);
+
+// Render based on format
+result = print_document(document, writer, width, options);
+```
+
+### `print_document()`
+
+```c
+static void print_document(cmark_node *document, writer_format writer,
+ int width, int options) {
+ char *result;
+ switch (writer) {
+ case FORMAT_HTML:
+ result = cmark_render_html(document, options);
+ break;
+ case FORMAT_XML:
+ result = cmark_render_xml(document, options);
+ break;
+ case FORMAT_MAN:
+ result = cmark_render_man(document, options, width);
+ break;
+ case FORMAT_COMMONMARK:
+ result = cmark_render_commonmark(document, options, width);
+ break;
+ case FORMAT_LATEX:
+ result = cmark_render_latex(document, options, width);
+ break;
+ default:
+ fprintf(stderr, "Unknown format %d\n", writer);
+ exit(1);
+ }
+ printf("%s", result);
+ document->mem->free(result);
+}
+```
+
+The rendered result is written to stdout and then freed.
+
+### Cleanup
+
+```c
+cmark_node_free(document);
+```
+
+The AST is freed after rendering.
+
+## OpenBSD Security
+
+```c
+#ifdef __OpenBSD__
+ if (pledge("stdio rpath", NULL) != 0) {
+ perror("pledge");
+ return 1;
+ }
+#endif
+```
+
+On OpenBSD, the program restricts itself to `stdio` and `rpath` (read-only file access) via `pledge()`. This prevents the cmark binary from performing any operations beyond reading files and writing to stdout/stderr.
+
+## Usage Examples
+
+```bash
+# Convert Markdown to HTML
+cmark input.md
+
+# Convert with smart punctuation
+cmark --smart input.md
+
+# Convert to man page with 72-column wrapping
+cmark -t man --width 72 input.md
+
+# Convert to LaTeX
+cmark -t latex input.md
+
+# Round-trip through CommonMark
+cmark -t commonmark input.md
+
+# Include source positions in output
+cmark --sourcepos input.md
+
+# Allow raw HTML passthrough
+cmark --unsafe input.md
+
+# Read from stdin
+echo "# Hello" | cmark
+
+# Validate UTF-8 input
+cmark --validate-utf8 input.md
+
+# Print version
+cmark --version
+```
+
+## Exit Codes
+
+- `0` — Success
+- `1` — Error (unknown option, file open failure, unknown format)
+
+## Cross-References
+
+- [main.c](../../cmark/src/main.c) — Full implementation
+- [public-api.md](public-api.md) — The C API functions called by main
+- [html-renderer.md](html-renderer.md) — `cmark_render_html()`
+- [xml-renderer.md](xml-renderer.md) — `cmark_render_xml()`
+- [latex-renderer.md](latex-renderer.md) — `cmark_render_latex()`
+- [man-renderer.md](man-renderer.md) — `cmark_render_man()`
+- [commonmark-renderer.md](commonmark-renderer.md) — `cmark_render_commonmark()`
diff --git a/docs/handbook/cmark/code-style.md b/docs/handbook/cmark/code-style.md
new file mode 100644
index 0000000000..0ac2af2def
--- /dev/null
+++ b/docs/handbook/cmark/code-style.md
@@ -0,0 +1,293 @@
+# cmark — Code Style and Conventions
+
+## Overview
+
+This document describes the coding conventions and patterns used throughout the cmark codebase. Understanding these conventions makes the source code easier to navigate.
+
+## Naming Conventions
+
+### Public API Functions
+
+All public functions use the `cmark_` prefix:
+```c
+cmark_node *cmark_node_new(cmark_node_type type);
+cmark_parser *cmark_parser_new(int options);
+char *cmark_render_html(cmark_node *root, int options);
+```
+
+### Internal (Static) Functions
+
+File-local static functions use the `S_` prefix:
+```c
+static void S_render_node(cmark_node *node, cmark_event_type ev_type,
+ struct render_state *state, int options);
+static cmark_node *S_node_new(cmark_node_type type, cmark_mem *mem);
+static void S_free_nodes(cmark_node *e);
+static bool S_is_leaf(cmark_node *node);
+static int S_get_enumlevel(cmark_node *node);
+```
+
+This convention makes it immediately clear whether a function has file-local scope.
+
+### Internal (Non-Static) Functions
+
+Functions that are internal to the library but shared across translation units use:
+- `cmark_` prefix (same as public) — declared in private headers (e.g., `parser.h`, `node.h`)
+- No `S_` prefix
+
+Examples:
+```c
+// In node.h (private header):
+void cmark_node_set_type(cmark_node *node, cmark_node_type type);
+cmark_node *make_block(cmark_mem *mem, cmark_node_type type,
+ int start_line, int start_column);
+```
+
+### Struct Members
+
+No prefix convention — struct members use plain names:
+```c
+struct cmark_node {
+ cmark_mem *mem;
+ cmark_node *next;
+ cmark_node *prev;
+ cmark_node *parent;
+ cmark_node *first_child;
+ cmark_node *last_child;
+ // ...
+};
+```
+
+### Type Names
+
+Typedefs use the `cmark_` prefix:
+```c
+typedef struct cmark_node cmark_node;
+typedef struct cmark_parser cmark_parser;
+typedef struct cmark_iter cmark_iter;
+typedef int32_t bufsize_t; // Exception: no cmark_ prefix
+```
+
+### Enum Values
+
+Enum constants use the `CMARK_` prefix with UPPER_CASE:
+```c
+typedef enum {
+ CMARK_NODE_NONE,
+ CMARK_NODE_DOCUMENT,
+ CMARK_NODE_BLOCK_QUOTE,
+ // ...
+} cmark_node_type;
+```
+
+### Preprocessor Macros
+
+Macros use UPPER_CASE, sometimes with `CMARK_` prefix:
+```c
+#define CMARK_OPT_SOURCEPOS (1 << 1)
+#define CMARK_BUF_INIT(mem) { mem, cmark_strbuf__initbuf, 0, 0 }
+#define MAX_LINK_LABEL_LENGTH 999
+#define CODE_INDENT 4
+```
+
+## Error Handling Patterns
+
+### Allocation Failure
+
+The default allocator (`xcalloc`, `xrealloc`) aborts on failure:
+```c
+static void *xcalloc(size_t nmemb, size_t size) {
+ void *ptr = calloc(nmemb, size);
+ if (!ptr) abort();
+ return ptr;
+}
+```
+
+Functions that allocate never return NULL — they either succeed or terminate. This eliminates NULL-check boilerplate throughout the codebase.
+
+### Invalid Input
+
+Functions that receive invalid arguments typically:
+1. Return 0/false/NULL for queries
+2. Do nothing for mutations
+3. Never crash
+
+Example from `node.c`:
+```c
+int cmark_node_set_heading_level(cmark_node *node, int level) {
+ if (node == NULL || node->type != CMARK_NODE_HEADING) return 0;
+ if (level < 1 || level > 6) return 0;
+ node->as.heading.level = level;
+ return 1;
+}
+```
+
+### Return Conventions
+
+- **0/1 for success/failure**: Setter functions return 1 on success, 0 on failure
+- **NULL for not found**: Lookup functions return NULL when the item doesn't exist
+- **Assertion for invariants**: Internal invariants use `assert()`:
+ ```c
+ assert(googled_node->type == CMARK_NODE_DOCUMENT);
+ ```
+
+## Header Guard Style
+
+```c
+#ifndef CMARK_NODE_H
+#define CMARK_NODE_H
+// ...
+#endif
+```
+
+Guards use `CMARK_` prefix + uppercase filename + `_H`.
+
+## Include Patterns
+
+### Public Headers
+```c
+#include "cmark.h" // Always first — provides all public types
+```
+
+### Private Headers
+```c
+#include "node.h" // Internal node definitions
+#include "parser.h" // Parser internals
+#include "buffer.h" // cmark_strbuf
+#include "chunk.h" // cmark_chunk
+#include "references.h" // Reference map
+#include "utf8.h" // UTF-8 utilities
+#include "scanners.h" // re2c-generated scanners
+```
+
+### System Headers
+```c
+#include <stdlib.h>
+#include <string.h>
+#include <assert.h>
+#include <stdio.h>
+```
+
+## Inline Functions
+
+The `CMARK_INLINE` macro abstracts compiler-specific inline syntax:
+```c
+#ifdef _MSC_VER
+#define CMARK_INLINE __forceinline
+#else
+#define CMARK_INLINE __inline__
+#endif
+```
+
+Used for small, hot-path functions in headers:
+```c
+static CMARK_INLINE void cmark_chunk_free(cmark_mem *mem, cmark_chunk *c) { ... }
+static CMARK_INLINE cmark_chunk cmark_chunk_dup(...) { ... }
+```
+
+## Memory Ownership Patterns
+
+### Owning vs Non-Owning
+
+The `cmark_chunk` type makes ownership explicit:
+- `alloc > 0` → the chunk owns the memory and must free it
+- `alloc == 0` → the chunk borrows memory from elsewhere
+
+### Transfer of Ownership
+
+`cmark_strbuf_detach()` transfers ownership from a strbuf to the caller:
+```c
+unsigned char *data = cmark_strbuf_detach(&buf);
+// Caller now owns 'data' and must free it
+```
+
+### Consistent Cleanup
+
+Free functions null out pointers after freeing:
+```c
+static CMARK_INLINE void cmark_chunk_free(cmark_mem *mem, cmark_chunk *c) {
+ if (c->alloc)
+ mem->free((void *)c->data);
+ c->data = NULL; // NULL after free
+ c->alloc = 0;
+ c->len = 0;
+}
+```
+
+## Iterative vs Recursive Patterns
+
+The codebase avoids recursion for tree operations to prevent stack overflow on deeply nested input:
+
+### Iterative Tree Destruction
+`S_free_nodes()` uses sibling-list splicing instead of recursion:
+```c
+// Splice children into sibling chain
+if (e->first_child) {
+ cmark_node *last = e->last_child;
+ last->next = e->next;
+ e->next = e->first_child;
+}
+```
+
+### Iterator-Based Traversal
+All rendering uses `cmark_iter` instead of recursive `render_children()`:
+```c
+while ((ev_type = cmark_iter_next(iter)) != CMARK_EVENT_DONE) {
+ cur = cmark_iter_get_node(iter);
+ S_render_node(cur, ev_type, &state, options);
+}
+```
+
+## Type Size Definitions
+
+```c
+typedef int32_t bufsize_t;
+```
+
+Buffer sizes use `int32_t` (not `size_t`) to:
+1. Allow negative values for error signaling
+2. Keep node structs compact (32-bit vs 64-bit on LP64)
+3. Limit maximum allocation to 2GB (adequate for text processing)
+
+## Bitmask Patterns
+
+Option flags use single-bit constants:
+```c
+#define CMARK_OPT_SOURCEPOS (1 << 1)
+#define CMARK_OPT_HARDBREAKS (1 << 2)
+#define CMARK_OPT_UNSAFE (1 << 17)
+#define CMARK_OPT_NOBREAKS (1 << 4)
+#define CMARK_OPT_VALIDATE_UTF8 (1 << 9)
+#define CMARK_OPT_SMART (1 << 10)
+```
+
+Tested with bitwise AND:
+```c
+if (options & CMARK_OPT_SOURCEPOS) { ... }
+```
+
+Combined with bitwise OR:
+```c
+int options = CMARK_OPT_SOURCEPOS | CMARK_OPT_SMART;
+```
+
+## Leaf Mask Pattern
+
+`S_is_leaf()` in `iterator.c` uses a bitmask for O(1) node-type classification:
+```c
+static const int S_leaf_mask =
+ (1 << CMARK_NODE_HTML_BLOCK) | (1 << CMARK_NODE_THEMATIC_BREAK) |
+ (1 << CMARK_NODE_CODE_BLOCK) | (1 << CMARK_NODE_TEXT) | ...;
+
+static bool S_is_leaf(cmark_node *node) {
+ return ((1 << node->type) & S_leaf_mask) != 0;
+}
+```
+
+This is more efficient than a switch statement for a simple boolean classification.
+
+## Cross-References
+
+- [architecture.md](architecture.md) — Design decisions
+- [memory-management.md](memory-management.md) — Allocator patterns
+- [public-api.md](public-api.md) — Public API naming
diff --git a/docs/handbook/cmark/commonmark-renderer.md b/docs/handbook/cmark/commonmark-renderer.md
new file mode 100644
index 0000000000..01ffb3a987
--- /dev/null
+++ b/docs/handbook/cmark/commonmark-renderer.md
@@ -0,0 +1,344 @@
+# cmark — CommonMark Renderer
+
+## Overview
+
+The CommonMark renderer (`commonmark.c`) converts a `cmark_node` AST back into CommonMark-formatted Markdown text. This is significantly more complex than the other renderers because it must reproduce syntactically valid Markdown that, when re-parsed, produces an equivalent AST. It uses the generic render framework from `render.c`.
+
+## Entry Point
+
+```c
+char *cmark_render_commonmark(cmark_node *root, int options, int width);
+```
+
+- `root` — AST root node
+- `options` — Option flags
+- `width` — Target line width for wrapping; 0 disables wrapping
+
+## Character Escaping (`outc`)
+
+The CommonMark escaping is the most complex of all renderers. Three escaping modes exist:
+
+### NORMAL Mode
+
+Characters that could be interpreted as Markdown syntax must be backslash-escaped. Characters that trigger escaping:
+
+```c
+case '*':
+case '#':
+case '(':
+case ')':
+case '[':
+case ']':
+case '<':
+case '>':
+case '!':
+case '\\':
+ // Backslash-escaped: \*, \#, \(, etc.
+```
+
+Additionally:
+- `.` and `)` — only escaped at line start (after a digit), to prevent triggering ordered list syntax
+- `-`, `+`, `=`, `_` — only escaped at line start, to prevent thematic breaks, bullet lists, or setext headings
+- `~` — only escaped at line start
+- `&` — escaped to prevent entity references
+- `'`, `"` — escaped for smart punctuation
+
+For whitespace handling:
+- NBSP (`\xA0`) → `\xa0` (the literal non-breaking space character)
+- Tab → space (tabs cannot be reliably round-tripped)
+
+### URL Mode
+
+Only `(`, `)`, and whitespace `\x20` are escaped with backslashes. URLs in parenthesized `()` format need minimal escaping.
+
+### TITLE Mode
+
+For link titles, only the title delimiter character is escaped. The renderer currently always uses `"` as the title delimiter, so `"` is backslash-escaped within titles.
+
+## Backtick Sequence Analysis
+
+Two helper functions determine how to format inline code spans:
+
+### `longest_backtick_sequence()`
+
+```c
+static int longest_backtick_sequence(const char *code) {
+ int longest = 0;
+ int current = 0;
+ size_t i = 0;
+ size_t code_len = strlen(code);
+ while (i <= code_len) {
+ if (code[i] == '`') {
+ current++;
+ } else {
+ if (current > longest)
+ longest = current;
+ current = 0;
+ }
+ i++;
+ }
+ return longest;
+}
+```
+
+Finds the maximum run of consecutive backticks within a code string.
+
+### `shortest_unused_backtick_sequence()`
+
+```c
+static int shortest_unused_backtick_sequence(const char *code) {
+ int32_t used = 1; // Bitmask for sequences of length 1-31
+ int current = 0;
+ // ... scan for runs, set bits in 'used'
+ int i = 0;
+ while (used & 1) {
+ used >>= 1;
+ i++;
+ }
+ return i + 1;
+}
+```
+
+Determines the shortest backtick sequence (1-32) that does NOT appear in the code content. This ensures the code delimiter won't conflict with backticks inside the code.
+
+Uses a clever bit-manipulation approach: a 32-bit integer `used` tracks which backtick sequence lengths appear. After scanning, the position of the first unset bit gives the shortest unused length.
+
+## Autolink Detection
+
+```c
+static bool is_autolink(cmark_node *node) {
+ const char *title;
+ const char *url;
+ // ...
+ if (node->first_child->type != CMARK_NODE_TEXT) return false;
+ url = (char *)node->as.link.url;
+ title = (char *)node->as.link.title;
+ if (title && title[0]) return false; // Autolinks have no title
+ if (url &&
+ (strncmp(url, "http://", 7) == 0 || strncmp(url, "https://", 8) == 0 ||
+ strncmp(url, "mailto:", 7) == 0) &&
+ strcmp(url, (char *)node->first_child->data) == 0)
+ return true;
+ return false;
+}
+```
+
+A link is an autolink if:
+1. It has exactly one child, a text node
+2. No title
+3. URL starts with `http://`, `https://`, or `mailto:`
+4. The text exactly matches the URL
+
+## Node Rendering (`S_render_node`)
+
+### Block Nodes
+
+#### Document
+No output.
+
+#### Block Quote
+```
+ENTER: Sets prefix to "> " for first line and "> " for continuations
+EXIT: Restores prefix, adds blank line
+```
+
+The prefix mechanism is central to CommonMark rendering. When entering a block quote:
+```c
+cmark_strbuf_puts(renderer->prefix, "> ");
+```
+
+All content within the block quote is prefixed with `"> "` on each line.
+
+#### List
+```
+ENTER: Records tight/loose status, records bullet character
+EXIT: Restores prefix, adds blank line
+```
+
+The renderer stores whether the list is tight to control inter-item blank lines.
+
+#### Item
+```
+ENTER: Computes marker and indentation prefix
+EXIT: Restores prefix
+```
+
+**Bullet items:** Use `-`, `*`, or `+` (from `cmark_node_get_list_delim`). The prefix is set to appropriate indentation:
+
+```c
+// For a bullet item:
+"- " on the first line
+" " on continuation lines (indentation matches marker width)
+```
+
+**Ordered items:** Number is computed by counting previous siblings:
+```c
+list_number = cmark_node_get_list_start(node->parent);
+tmp = node;
+while (tmp->prev) {
+ tmp = tmp->prev;
+ list_number++;
+}
+```
+
+Format: `"N. "` or `"N) "` depending on delimiter type. Continuation indent matches the marker width.
+
+For tight lists, items don't emit blank lines between them.
+
+#### Heading
+**ATX headings** (levels 1-6):
+```
+### Content\n
+```
+
+The number of `#` characters matches the heading level. A newline follows the heading content.
+
+**Setext headings** (levels 1-2 when `width > 0`):
+Not used — the renderer always uses ATX headings.
+
+#### Code Block
+The renderer determines whether to use fenced or indented code:
+
+**Fenced code blocks:**
+```
+```[info]
+content
+```
+```
+
+The fence character is `` ` ``. The fence length is max(3, longest_backtick_in_content + 1).
+
+If the code has an info string, fenced blocks are always used (indented blocks cannot carry info strings).
+
+**Indented code blocks:**
+If there's no info string and `width == 0`, the renderer uses 4-space indentation by setting the prefix to `" "`.
+
+#### HTML Block
+Content is output LITERALLY (no escaping):
+```c
+cmark_render_ascii(renderer, (char *)node->data);
+```
+
+This preserves raw HTML exactly.
+
+#### Thematic Break
+```
+---\n
+```
+
+Uses `---` (three hyphens).
+
+#### Paragraph
+```
+ENTER: (nothing for tight, blank line for normal)
+EXIT: \n (newline after content)
+```
+
+In tight lists, paragraphs don't add blank lines before/after.
+
+### Inline Nodes
+
+#### Text
+Output with NORMAL escaping (all Markdown-significant characters escaped).
+
+#### Soft Break
+Depends on options:
+- `CMARK_OPT_HARDBREAKS`: `\\\n` (backslash line break)
+- `CMARK_OPT_NOBREAKS`: space
+- Default: newline
+
+#### Line Break
+```
+\\\n
+```
+
+Backslash followed by newline.
+
+#### Code (inline)
+The renderer selects delimiters using `shortest_unused_backtick_sequence()`:
+
+```c
+int numticks = shortest_unused_backtick_sequence(code);
+// output numticks backticks
+// if code starts or ends with backtick, add space padding
+// output literal code
+// output numticks backticks
+```
+
+If the code content starts or ends with a backtick, spaces are added inside the delimiters to prevent ambiguity:
+```
+`` `code` ``
+```
+
+#### Emphasis
+```
+ENTER: * or _ (delimiter character)
+EXIT: * or _ (matching delimiter)
+```
+
+The delimiter selection depends on what characters appear in the content. If the content contains `*`, `_` is preferred (and vice versa). The `emph_delim` variable tracks the chosen delimiter.
+
+#### Strong
+```
+ENTER: ** or __
+EXIT: ** or __
+```
+
+Same delimiter selection logic as emphasis.
+
+#### Link
+**Autolinks:**
+```
+<URL>
+```
+
+**Normal links:**
+```
+ENTER: [
+EXIT: ](URL "TITLE") or ](URL) if no title
+```
+
+The URL is output with URL escaping, the title with TITLE escaping.
+
+#### Image
+```
+ENTER: ![
+EXIT: ](URL "TITLE") or ](URL) if no title
+```
+
+Same as links but with `!` prefix.
+
+#### HTML Inline
+Output literally (no escaping).
+
+## Prefix Management
+
+The CommonMark renderer makes extensive use of the prefix system from `render.c`. Each line of output is prefixed with accumulated prefix strings from container nodes. For example, a list item inside a block quote:
+
+```
+> - Item text
+> continuation
+```
+
+The prefix stack would be:
+1. `"> "` from the block quote
+2. `" "` (continuation indent) from the list item
+
+The `cmark_renderer` struct maintains `prefix` and `begin_content` fields to handle this.
+
+## Round-Trip Fidelity
+
+The CommonMark renderer aims for round-trip fidelity: parsing the output should produce an AST equivalent to the input. This is not always perfectly achievable:
+
+1. **Whitespace normalization**: Some whitespace differences (e.g., number of blank lines) are lost.
+2. **Reference links**: Inline link syntax is always used; reference-style links are not preserved.
+3. **ATX vs setext**: Always uses ATX headings.
+4. **Indented vs fenced**: Logic selects one based on info string presence and width setting.
+5. **Emphasis delimiter**: May differ from the original (`*` vs `_`).
+
+## Cross-References
+
+- [commonmark.c](../../cmark/src/commonmark.c) — Full implementation
+- [render-framework.md](render-framework.md) — Generic render framework
+- [public-api.md](public-api.md) — `cmark_render_commonmark()` API docs
+- [scanner-system.md](scanner-system.md) — Scanners used for autolink detection
diff --git a/docs/handbook/cmark/html-renderer.md b/docs/handbook/cmark/html-renderer.md
new file mode 100644
index 0000000000..98406c300c
--- /dev/null
+++ b/docs/handbook/cmark/html-renderer.md
@@ -0,0 +1,258 @@
+# cmark — HTML Renderer
+
+## Overview
+
+The HTML renderer (`html.c`) converts a `cmark_node` AST into an HTML string. Unlike the LaTeX, man, and CommonMark renderers, it does NOT use the generic render framework from `render.c`. Instead, it writes directly to a `cmark_strbuf` buffer, giving it full control over output formatting.
+
+## Entry Point
+
+```c
+char *cmark_render_html(cmark_node *root, int options);
+```
+
+Creates an iterator over the AST, processes each node via `S_render_node()`, and returns the resulting HTML string. The caller is responsible for freeing the returned buffer.
+
+### Implementation
+
+```c
+char *cmark_render_html(cmark_node *root, int options) {
+ char *result;
+ cmark_strbuf html = CMARK_BUF_INIT(root->mem);
+ cmark_event_type ev_type;
+ cmark_node *cur;
+ struct render_state state = {&html, NULL};
+ cmark_iter *iter = cmark_iter_new(root);
+
+ while ((ev_type = cmark_iter_next(iter)) != CMARK_EVENT_DONE) {
+ cur = cmark_iter_get_node(iter);
+ S_render_node(cur, ev_type, &state, options);
+ }
+ result = (char *)cmark_strbuf_detach(&html);
+ cmark_iter_free(iter);
+ return result;
+}
+```
+
+## Render State
+
+```c
+struct render_state {
+ cmark_strbuf *html; // Output buffer
+ cmark_node *plain; // Non-NULL when rendering image alt text (plain text mode)
+};
+```
+
+The `plain` field is used for image alt text rendering. When entering an image node, `state->plain` is set to the image node. While `plain` is non-NULL, only text content is emitted (HTML tags are suppressed) — this ensures the `alt` attribute contains only plain text, not nested HTML. When the iterator exits the image node (`state->plain == node`), plain mode is cleared.
+
+## HTML Escaping
+
+```c
+static void escape_html(cmark_strbuf *dest, const unsigned char *source,
+ bufsize_t length) {
+ houdini_escape_html(dest, source, length, 0);
+}
+```
+
+Characters `<`, `>`, `&`, `"` are converted to their HTML entity equivalents. The `0` argument means "not secure mode" (no additional escaping).
+
+## Source Position Attributes
+
+```c
+static void S_render_sourcepos(cmark_node *node, cmark_strbuf *html, int options) {
+ char buffer[BUFFER_SIZE];
+ if (CMARK_OPT_SOURCEPOS & options) {
+ snprintf(buffer, BUFFER_SIZE, " data-sourcepos=\"%d:%d-%d:%d\"",
+ cmark_node_get_start_line(node), cmark_node_get_start_column(node),
+ cmark_node_get_end_line(node), cmark_node_get_end_column(node));
+ cmark_strbuf_puts(html, buffer);
+ }
+}
+```
+
+When `CMARK_OPT_SOURCEPOS` is set, all block-level elements receive a `data-sourcepos` attribute with format `"startline:startcol-endline:endcol"`.
+
+## Newline Helper
+
+```c
+static inline void cr(cmark_strbuf *html) {
+ if (html->size && html->ptr[html->size - 1] != '\n')
+ cmark_strbuf_putc(html, '\n');
+}
+```
+
+Ensures the output ends with a newline without adding redundant ones.
+
+## Node Rendering Logic
+
+The `S_render_node()` function handles each node type in a large switch statement. The `entering` boolean indicates whether this is an `CMARK_EVENT_ENTER` or `CMARK_EVENT_EXIT` event.
+
+### Block Nodes
+
+#### Document
+No output — the document node is purely structural.
+
+#### Block Quote
+```
+ENTER: \n<blockquote[sourcepos]>\n
+EXIT: \n</blockquote>\n
+```
+
+#### List
+```
+ENTER (bullet): \n<ul[sourcepos]>\n
+ENTER (ordered): \n<ol[sourcepos]>\n (or <ol start="N"> if start > 1)
+EXIT: </ul>\n or </ol>\n
+```
+
+#### Item
+```
+ENTER: \n<li[sourcepos]>
+EXIT: </li>\n
+```
+
+#### Heading
+```
+ENTER: \n<hN[sourcepos]> (where N = heading level)
+EXIT: </hN>\n
+```
+
+The heading level is injected into character arrays:
+```c
+char start_heading[] = "<h0";
+start_heading[2] = (char)('0' + node->as.heading.level);
+```
+
+#### Code Block
+Always a leaf node (single event). Output:
+```html
+<pre[sourcepos]><code>ESCAPED CONTENT</code></pre>\n
+```
+
+If the code block has an info string, a `class` attribute is added:
+```html
+<pre[sourcepos]><code class="language-INFO">ESCAPED CONTENT</code></pre>\n
+```
+
+The `"language-"` prefix is only added if the info string doesn't already start with `"language-"`.
+
+#### HTML Block
+When `CMARK_OPT_UNSAFE` is set, raw HTML is output verbatim. Otherwise, it's replaced with:
+```html
+<!-- raw HTML omitted -->
+```
+
+#### Thematic Break
+```html
+<hr[sourcepos] />\n
+```
+
+#### Paragraph
+The paragraph respects tight list context. The renderer checks if the paragraph's grandparent is a list with `tight = true`:
+
+```c
+parent = cmark_node_parent(node);
+grandparent = cmark_node_parent(parent);
+if (grandparent != NULL && grandparent->type == CMARK_NODE_LIST) {
+ tight = grandparent->as.list.tight;
+} else {
+ tight = false;
+}
+```
+
+In tight lists, the `<p>` tags are suppressed — content flows directly without wrapping.
+
+#### Custom Block
+On enter, outputs the `on_enter` text literally. On exit, outputs `on_exit`.
+
+### Inline Nodes
+
+#### Text
+```c
+escape_html(html, node->data, node->len);
+```
+
+All text content is HTML-escaped.
+
+#### Line Break
+```html
+<br />\n
+```
+
+#### Soft Break
+Behavior depends on options:
+- `CMARK_OPT_HARDBREAKS`: `<br />\n`
+- `CMARK_OPT_NOBREAKS`: single space
+- Default: `\n`
+
+#### Code (inline)
+```html
+<code>ESCAPED CONTENT</code>
+```
+
+#### HTML Inline
+Same as HTML block: verbatim with `CMARK_OPT_UNSAFE`, otherwise `<!-- raw HTML omitted -->`.
+
+#### Emphasis
+```
+ENTER: <em>
+EXIT: </em>
+```
+
+#### Strong
+```
+ENTER: <strong>
+EXIT: </strong>
+```
+
+#### Link
+```
+ENTER: <a href="ESCAPED_URL"[ title="ESCAPED_TITLE"]>
+EXIT: </a>
+```
+
+URL safety: If `CMARK_OPT_UNSAFE` is NOT set, the URL is checked against `_scan_dangerous_url()`. Dangerous URLs (`javascript:`, `vbscript:`, `file:`, certain `data:` schemes) produce an empty `href`.
+
+URL escaping uses `houdini_escape_href()` which percent-encodes special characters. Title escaping uses `escape_html()`.
+
+#### Image
+```
+ENTER: <img src="ESCAPED_URL" alt="
+ (enters plain text mode — state->plain = node)
+EXIT: "[ title="ESCAPED_TITLE"] />
+```
+
+During plain text mode (between enter and exit), only text content, code content, and HTML inline content are output (HTML-escaped), and breaks are rendered as spaces.
+
+#### Custom Inline
+On enter, outputs `on_enter` literally. On exit, outputs `on_exit`.
+
+## URL Safety
+
+Links and images check URL safety unless `CMARK_OPT_UNSAFE` is set:
+
+```c
+if (node->as.link.url && ((options & CMARK_OPT_UNSAFE) ||
+ !(_scan_dangerous_url(node->as.link.url)))) {
+ houdini_escape_href(html, node->as.link.url,
+ (bufsize_t)strlen((char *)node->as.link.url));
+}
+```
+
+The `_scan_dangerous_url()` scanner (from `scanners.c`) matches schemes: `javascript:`, `vbscript:`, `file:`, and `data:` (except for safe image MIME types: `image/png`, `image/gif`, `image/jpeg`, `image/webp`).
+
+## Differences from Framework Renderers
+
+The HTML renderer differs from the render-framework-based renderers in several ways:
+
+1. **No line wrapping**: HTML output has no configurable width or word-wrap logic.
+2. **No prefix management**: Block quotes and lists don't use prefix strings for indentation — they use HTML tags.
+3. **Direct buffer writes**: All output goes directly to a `cmark_strbuf`, with no escaping dispatch function.
+4. **No `width` parameter**: `cmark_render_html()` takes only `root` and `options`.
+
+## Cross-References
+
+- [html.c](../../cmark/src/html.c) — Full implementation
+- [render-framework.md](render-framework.md) — The alternative render architecture used by other renderers
+- [iterator-system.md](iterator-system.md) — How the AST is traversed
+- [scanner-system.md](scanner-system.md) — `_scan_dangerous_url()` for URL safety
+- [public-api.md](public-api.md) — `cmark_render_html()` API documentation
diff --git a/docs/handbook/cmark/inline-parsing.md b/docs/handbook/cmark/inline-parsing.md
new file mode 100644
index 0000000000..4485017305
--- /dev/null
+++ b/docs/handbook/cmark/inline-parsing.md
@@ -0,0 +1,317 @@
+# cmark — Inline Parsing
+
+## Overview
+
+Inline parsing is Phase 2 of cmark's pipeline. Implemented in `inlines.c`, it processes the text content of paragraph and heading nodes, recognizing emphasis (`*`, `_`), code spans (`` ` ``), links (`[text](url)`), images (`![alt](url)`), autolinks (`<url>`), raw HTML inline, hard line breaks, soft line breaks, and smart punctuation.
+
+The entry point is `cmark_parse_inlines()`, called from `process_inlines()` in `blocks.c` after all block structure has been finalized.
+
+## The `subject` Struct
+
+All inline parsing state is tracked in the `subject` struct:
+
+```c
+typedef struct {
+ cmark_mem *mem; // Memory allocator
+ cmark_chunk input; // The text being parsed
+ unsigned flags; // Skip flags for HTML constructs
+ int line; // Source line number
+ bufsize_t pos; // Current byte position in input
+ int block_offset; // Column offset of the containing block
+ int column_offset; // Adjustment for multi-line source position tracking
+ cmark_reference_map *refmap; // Link reference definitions
+ delimiter *last_delim; // Top of delimiter stack (linked list, newest first)
+ bracket *last_bracket; // Top of bracket stack (linked list, newest first)
+ bufsize_t backticks[MAXBACKTICKS + 1]; // Cached positions of backtick sequences
+ bool scanned_for_backticks; // Whether the full input has been scanned for backticks
+ bool no_link_openers; // Optimization: set when no link openers remain
+} subject;
+```
+
+`MAXBACKTICKS` is defined as 1000. The `backticks` array caches the positions of backtick sequences of each length, enabling O(1) lookup once the input has been fully scanned.
+
+### Skip Flags
+
+The `flags` field uses bit flags to track which HTML constructs have been confirmed absent:
+
+```c
+#define FLAG_SKIP_HTML_CDATA (1u << 0)
+#define FLAG_SKIP_HTML_DECLARATION (1u << 1)
+#define FLAG_SKIP_HTML_PI (1u << 2)
+#define FLAG_SKIP_HTML_COMMENT (1u << 3)
+```
+
+Once a scan for a particular HTML construct fails, the flag is set to avoid rescanning.
+
+## The Delimiter Stack
+
+Emphasis and smart punctuation use a delimiter stack. Each entry is:
+
+```c
+typedef struct delimiter {
+ struct delimiter *previous; // Link to older delimiter
+ struct delimiter *next; // Link to newer delimiter (towards top)
+ cmark_node *inl_text; // The text node created for this delimiter run
+ bufsize_t position; // Position in the input
+ bufsize_t length; // Number of delimiter characters remaining
+ unsigned char delim_char; // '*', '_', '\'', or '"'
+ bool can_open; // Whether this run can open emphasis
+ bool can_close; // Whether this run can close emphasis
+} delimiter;
+```
+
+The stack is a doubly-linked list with `last_delim` pointing to the newest entry.
+
+## The Bracket Stack
+
+Links and images use a separate bracket stack:
+
+```c
+typedef struct bracket {
+ struct bracket *previous; // Link to older bracket
+ cmark_node *inl_text; // The text node for '[' or '!['
+ bufsize_t position; // Position in the input
+ bool image; // Whether this is an image opener '!['
+ bool active; // Can still match (set to false when deactivated)
+ bool bracket_after; // Whether a '[' appeared after this bracket
+} bracket;
+```
+
+Brackets are deactivated (set `active = false`) when:
+- A matching `]` fails to produce a valid link (the opener is deactivated to prevent infinite loops)
+- An inner link is formed (outer brackets are deactivated per spec)
+
+## Emphasis Flanking Rules: `scan_delims()`
+
+```c
+static int scan_delims(subject *subj, unsigned char c, bool *can_open,
+ bool *can_close);
+```
+
+This function determines whether a run of `*`, `_`, `'`, or `"` characters can open and/or close emphasis, following the CommonMark spec's Unicode-aware flanking rules:
+
+1. The function looks at the character **before** the run and the character **after** the run.
+2. It uses `cmark_utf8proc_iterate()` to decode the surrounding characters as full Unicode code points.
+3. It classifies them using `cmark_utf8proc_is_space()` and `cmark_utf8proc_is_punctuation_or_symbol()`.
+
+The flanking rules:
+- **Left-flanking**: numdelims > 0, character after is not a space, AND (character after is not punctuation OR character before is a space or punctuation)
+- **Right-flanking**: numdelims > 0, character before is not a space, AND (character before is not punctuation OR character after is a space or punctuation)
+
+For `*`: `can_open = left_flanking`, `can_close = right_flanking`
+
+For `_`:
+```c
+*can_open = left_flanking &&
+ (!right_flanking || cmark_utf8proc_is_punctuation_or_symbol(before_char));
+*can_close = right_flanking &&
+ (!left_flanking || cmark_utf8proc_is_punctuation_or_symbol(after_char));
+```
+
+For `'` and `"` (smart punctuation):
+```c
+*can_open = left_flanking &&
+ (!right_flanking || before_char == '(' || before_char == '[') &&
+ before_char != ']' && before_char != ')';
+*can_close = right_flanking;
+```
+
+The function advances `subj->pos` past the delimiter run and returns the number of delimiter characters consumed. For quotes, only 1 delimiter is consumed regardless of how many appear.
+
+## Emphasis Resolution: `S_insert_emph()`
+
+```c
+static delimiter *S_insert_emph(subject *subj, delimiter *opener,
+ delimiter *closer);
+```
+
+When a closing delimiter is found that matches an opener on the stack, this function creates emphasis nodes:
+
+1. If the opener and closer have combined length >= 2 AND both have individual length >= 2, create a `CMARK_NODE_STRONG` node (consuming 2 characters from each).
+2. Otherwise, create a `CMARK_NODE_EMPH` node (consuming 1 character from each).
+3. All inline nodes between the opener and closer are moved to become children of the new emphasis node.
+4. Any delimiters between the opener and closer are removed from the stack.
+5. If the opener is exhausted (`length == 0`), it's removed from the stack.
+6. If the closer is exhausted, it's removed too; otherwise, processing continues.
+
+## Code Span Parsing: `handle_backticks()`
+
+```c
+static cmark_node *handle_backticks(subject *subj, int options);
+```
+
+When a backtick is encountered:
+
+1. `take_while(subj, isbacktick)` consumes the opening backtick run and records its length.
+2. `scan_to_closing_backticks()` searches forward for a matching backtick run of the same length.
+
+The scanning function uses the `subj->backticks[]` array to cache positions of backtick sequences. If `subj->scanned_for_backticks` is true and the cached position for the needed length is behind the current position, it immediately returns 0 (no match).
+
+If no closing backticks are found, the opening run is emitted as literal text. If found, the content between is extracted, normalized via `S_normalize_code()`:
+
+```c
+static void S_normalize_code(cmark_strbuf *s) {
+ // 1. Convert \r\n and \r to spaces
+ // 2. Convert \n to spaces
+ // 3. If content begins and ends with a space and contains non-space chars,
+ // strip one leading and one trailing space
+}
+```
+
+## Link Parsing
+
+When `]` is encountered after an opener on the bracket stack:
+
+### Inline Links: `[text](url "title")`
+
+The parser looks for `(` immediately after `]`, then:
+1. Skips optional whitespace
+2. Tries to parse a link destination (URL)
+3. Skips optional whitespace
+4. Optionally parses a link title (in single quotes, double quotes, or parentheses)
+5. Expects `)`
+
+### Reference Links: `[text][ref]` or `[text][]` or `[text]`
+
+If the inline link syntax doesn't match, the parser tries:
+1. `[text][ref]` — explicit reference
+2. `[text][]` — collapsed reference (label = text)
+3. `[text]` — shortcut reference (label = text)
+
+Reference lookup uses `cmark_reference_lookup()` against the parser's `refmap`.
+
+### URL Cleaning
+
+```c
+unsigned char *cmark_clean_url(cmark_mem *mem, cmark_chunk *url);
+```
+
+Trims the URL, unescapes HTML entities, and handles angle-bracket-delimited URLs.
+
+### Autolinks
+
+```c
+static inline cmark_node *make_autolink(subject *subj, int start_column,
+ int end_column, cmark_chunk url,
+ int is_email);
+```
+
+Autolinks (`<http://example.com>` or `<user@example.com>`) are detected via the `scan_autolink_uri()` and `scan_autolink_email()` scanner functions. Email autolinks have `mailto:` prepended to the URL automatically:
+
+```c
+static unsigned char *cmark_clean_autolink(cmark_mem *mem, cmark_chunk *url,
+ int is_email) {
+ cmark_strbuf buf = CMARK_BUF_INIT(mem);
+ cmark_chunk_trim(url);
+ if (is_email)
+ cmark_strbuf_puts(&buf, "mailto:");
+ houdini_unescape_html_f(&buf, url->data, url->len);
+ return cmark_strbuf_detach(&buf);
+}
+```
+
+## Smart Punctuation
+
+When `CMARK_OPT_SMART` is enabled, the inline parser transforms:
+
+```c
+static const char *EMDASH = "\xE2\x80\x94"; // —
+static const char *ENDASH = "\xE2\x80\x93"; // –
+static const char *ELLIPSES = "\xE2\x80\xA6"; // …
+static const char *LEFTDOUBLEQUOTE = "\xE2\x80\x9C"; // "
+static const char *RIGHTDOUBLEQUOTE = "\xE2\x80\x9D"; // "
+static const char *LEFTSINGLEQUOTE = "\xE2\x80\x98"; // '
+static const char *RIGHTSINGLEQUOTE = "\xE2\x80\x99"; // '
+```
+
+- `---` becomes em dash (—)
+- `--` becomes en dash (–)
+- `...` becomes ellipsis (…)
+- `'` and `"` are converted to curly quotes using the delimiter stack (open/close logic)
+
+## Hard and Soft Line Breaks
+
+- **Hard line break**: Two or more spaces before a line ending, or a backslash before a line ending. Creates a `CMARK_NODE_LINEBREAK` node.
+- **Soft line break**: A line ending not preceded by spaces or backslash. Creates a `CMARK_NODE_SOFTBREAK` node.
+
+## Special Character Dispatch
+
+```c
+static bufsize_t subject_find_special_char(subject *subj, int options);
+```
+
+This function scans forward from `subj->pos` looking for the next special character that needs inline processing. Special characters include:
+- Line endings (`\r`, `\n`)
+- Backtick (`` ` ``)
+- Backslash (`\`)
+- Ampersand (`&`)
+- Less-than (`<`)
+- Open bracket (`[`)
+- Close bracket (`]`)
+- Exclamation mark (`!`)
+- Emphasis characters (`*`, `_`)
+
+Any text between special characters is collected as a `CMARK_NODE_TEXT` node.
+
+## Source Position Tracking
+
+```c
+static void adjust_subj_node_newlines(subject *subj, cmark_node *node,
+ int matchlen, int extra, int options);
+```
+
+When `CMARK_OPT_SOURCEPOS` is enabled, this function adjusts source positions for multi-line inline constructs. It counts newlines in the just-matched span and updates:
+- `subj->line` — incremented by the number of newlines
+- `node->end_line` — adjusted for multi-line spans
+- `node->end_column` — set to characters after the last newline
+- `subj->column_offset` — adjusted for correct subsequent position calculations
+
+## Inline Node Factory Functions
+
+The inline parser uses efficient factory functions:
+
+```c
+// Macros for simple nodes
+#define make_linebreak(mem) make_simple(mem, CMARK_NODE_LINEBREAK)
+#define make_softbreak(mem) make_simple(mem, CMARK_NODE_SOFTBREAK)
+#define make_emph(mem) make_simple(mem, CMARK_NODE_EMPH)
+#define make_strong(mem) make_simple(mem, CMARK_NODE_STRONG)
+```
+
+```c
+// Fast child appending (bypasses S_can_contain validation)
+static void append_child(cmark_node *node, cmark_node *child) {
+ cmark_node *old_last_child = node->last_child;
+ child->next = NULL;
+ child->prev = old_last_child;
+ child->parent = node;
+ node->last_child = child;
+ if (old_last_child) {
+ old_last_child->next = child;
+ } else {
+ node->first_child = child;
+ }
+}
+```
+
+This `append_child()` is a simplified version of the public `cmark_node_append_child()`, skipping containership validation since the inline parser always produces valid structures.
+
+## The Main Parse Loop
+
+```c
+void cmark_parse_inlines(cmark_mem *mem, cmark_node *parent,
+ cmark_reference_map *refmap, int options);
+```
+
+This function initializes a `subject` from the parent node's `data` field, then repeatedly calls `parse_inline()` until the input is exhausted. Each call to `parse_inline()` finds the next special character, emits any preceding text as a `CMARK_NODE_TEXT`, and dispatches to the appropriate handler.
+
+After all characters are processed, the delimiter stack is processed to resolve any remaining emphasis, and then cleaned up.
+
+## Cross-References
+
+- [inlines.c](../../cmark/src/inlines.c) — Full implementation
+- [inlines.h](../../cmark/src/inlines.h) — Internal API declarations
+- [block-parsing.md](block-parsing.md) — Phase 1 that produces the input for inline parsing
+- [reference-system.md](reference-system.md) — How link references are stored and looked up
+- [scanner-system.md](scanner-system.md) — Scanner functions for HTML tags, autolinks, etc.
+- [utf8-handling.md](utf8-handling.md) — Unicode character classification for flanking rules
diff --git a/docs/handbook/cmark/iterator-system.md b/docs/handbook/cmark/iterator-system.md
new file mode 100644
index 0000000000..3cdcfda66e
--- /dev/null
+++ b/docs/handbook/cmark/iterator-system.md
@@ -0,0 +1,267 @@
+# cmark — Iterator System
+
+## Overview
+
+The iterator system (`iterator.c`, `iterator.h`) provides depth-first traversal of the AST using an event-based model. Each node is visited twice: once on `CMARK_EVENT_ENTER` (before children) and once on `CMARK_EVENT_EXIT` (after children). Leaf nodes receive both events in immediate succession.
+
+All renderers (HTML, XML, LaTeX, man, CommonMark) use the iterator as their traversal mechanism.
+
+## Public API
+
+```c
+cmark_iter *cmark_iter_new(cmark_node *root);
+void cmark_iter_free(cmark_iter *iter);
+cmark_event_type cmark_iter_next(cmark_iter *iter);
+cmark_node *cmark_iter_get_node(cmark_iter *iter);
+cmark_event_type cmark_iter_get_event_type(cmark_iter *iter);
+cmark_node *cmark_iter_get_root(cmark_iter *iter);
+void cmark_iter_reset(cmark_iter *iter, cmark_node *current, cmark_event_type event_type);
+```
+
+## Iterator State
+
+```c
+struct cmark_iter {
+ cmark_mem *mem;
+ cmark_node *root;
+ cmark_node *cur;
+ cmark_event_type ev_type;
+};
+```
+
+The iterator stores:
+- `root` — The subtree root (traversal boundary)
+- `cur` — Current node
+- `ev_type` — Current event (`CMARK_EVENT_ENTER`, `CMARK_EVENT_EXIT`, `CMARK_EVENT_DONE`, or `CMARK_EVENT_NONE`)
+
+## Event Types
+
+```c
+typedef enum {
+ CMARK_EVENT_NONE, // Initial state
+ CMARK_EVENT_DONE, // Traversal complete (exited root)
+ CMARK_EVENT_ENTER, // Entering a node (pre-children)
+ CMARK_EVENT_EXIT, // Exiting a node (post-children)
+} cmark_event_type;
+```
+
+## Leaf Node Detection
+
+```c
+static const int S_leaf_mask =
+ (1 << CMARK_NODE_HTML_BLOCK) | (1 << CMARK_NODE_THEMATIC_BREAK) |
+ (1 << CMARK_NODE_CODE_BLOCK) | (1 << CMARK_NODE_TEXT) |
+ (1 << CMARK_NODE_SOFTBREAK) | (1 << CMARK_NODE_LINEBREAK) |
+ (1 << CMARK_NODE_CODE) | (1 << CMARK_NODE_HTML_INLINE);
+
+static bool S_is_leaf(cmark_node *node) {
+ return ((1 << node->type) & S_leaf_mask) != 0;
+}
+```
+
+Leaf nodes are determined by a bitmask — not by checking whether `first_child` is NULL. This means an emphasis node with no children is still treated as a container (it receives separate enter and exit events).
+
+The leaf node types are:
+- **Block leaves**: `HTML_BLOCK`, `THEMATIC_BREAK`, `CODE_BLOCK`
+- **Inline leaves**: `TEXT`, `SOFTBREAK`, `LINEBREAK`, `CODE`, `HTML_INLINE`
+
+## Traversal Algorithm
+
+`cmark_iter_next()` implements the state machine:
+
+```c
+cmark_event_type cmark_iter_next(cmark_iter *iter) {
+ cmark_event_type ev_type = iter->ev_type;
+ cmark_node *cur = iter->cur;
+
+ if (ev_type == CMARK_EVENT_DONE) {
+ return CMARK_EVENT_DONE;
+ }
+
+ // For initial state, start with ENTER on root
+ if (ev_type == CMARK_EVENT_NONE) {
+ iter->ev_type = CMARK_EVENT_ENTER;
+ return iter->ev_type;
+ }
+
+ if (ev_type == CMARK_EVENT_ENTER && !S_is_leaf(cur)) {
+ // Container node being entered — descend to first child if it exists
+ if (cur->first_child) {
+ iter->ev_type = CMARK_EVENT_ENTER;
+ iter->cur = cur->first_child;
+ } else {
+ // Empty container — immediately exit
+ iter->ev_type = CMARK_EVENT_EXIT;
+ }
+ } else if (cur == iter->root) {
+ // Exiting root (or leaf at root) — done
+ iter->ev_type = CMARK_EVENT_DONE;
+ iter->cur = NULL;
+ } else if (cur->next) {
+ // Move to next sibling
+ iter->ev_type = CMARK_EVENT_ENTER;
+ iter->cur = cur->next;
+ } else if (cur->parent) {
+ // No more siblings — exit parent
+ iter->ev_type = CMARK_EVENT_EXIT;
+ iter->cur = cur->parent;
+ } else {
+ // Orphan node — done
+ assert(false);
+ iter->ev_type = CMARK_EVENT_DONE;
+ iter->cur = NULL;
+ }
+
+ return iter->ev_type;
+}
+```
+
+### State Transition Summary
+
+| Current State | Condition | Next State |
+|--------------|-----------|------------|
+| `NONE` | (initial) | `ENTER(root)` |
+| `ENTER(container)` | has children | `ENTER(first_child)` |
+| `ENTER(container)` | no children | `EXIT(container)` |
+| `ENTER(leaf)` or `EXIT(node)` | node == root | `DONE` |
+| `ENTER(leaf)` or `EXIT(node)` | has next sibling | `ENTER(next)` |
+| `ENTER(leaf)` or `EXIT(node)` | has parent | `EXIT(parent)` |
+| `DONE` | (terminal) | `DONE` |
+
+### Traversal Order Example
+
+For a document with a paragraph containing "Hello *world*":
+
+```
+Document
+└── Paragraph
+ ├── Text("Hello ")
+ ├── Emph
+ │ └── Text("world")
+ └── (end)
+```
+
+Event sequence:
+1. `ENTER(Document)`
+2. `ENTER(Paragraph)`
+3. `ENTER(Text "Hello ")` — leaf, immediate transition
+4. `ENTER(Emph)`
+5. `ENTER(Text "world")` — leaf, immediate transition
+6. `EXIT(Emph)`
+7. `EXIT(Paragraph)`
+8. `EXIT(Document)`
+9. `DONE`
+
+## Iterator Reset
+
+```c
+void cmark_iter_reset(cmark_iter *iter, cmark_node *current,
+ cmark_event_type event_type) {
+ iter->cur = current;
+ iter->ev_type = event_type;
+}
+```
+
+Allows repositioning the iterator to any node and event type. This is used by renderers to skip subtrees — e.g., when the HTML renderer processes an image node, it may skip children after extracting alt text.
+
+## Text Node Consolidation
+
+```c
+void cmark_consolidate_text_nodes(cmark_node *root) {
+ if (root == NULL) return;
+ cmark_iter *iter = cmark_iter_new(root);
+ cmark_strbuf buf = CMARK_BUF_INIT(iter->mem);
+ cmark_event_type ev_type;
+ cmark_node *cur, *tmp, *next;
+
+ while ((ev_type = cmark_iter_next(iter)) != CMARK_EVENT_DONE) {
+ cur = cmark_iter_get_node(iter);
+ if (ev_type == CMARK_EVENT_ENTER && cur->type == CMARK_NODE_TEXT &&
+ cur->next && cur->next->type == CMARK_NODE_TEXT) {
+ // Merge consecutive text nodes
+ cmark_strbuf_clear(&buf);
+ cmark_strbuf_put(&buf, cur->data, cur->len);
+ tmp = cur->next;
+ while (tmp && tmp->type == CMARK_NODE_TEXT) {
+ cmark_iter_reset(iter, tmp, CMARK_EVENT_ENTER);
+ cmark_strbuf_put(&buf, tmp->data, tmp->len);
+ cur->end_column = tmp->end_column;
+ next = tmp->next;
+ cmark_node_free(tmp);
+ tmp = next;
+ }
+ // Replace cur's data with merged content
+ cmark_chunk_free(iter->mem, &cur->as.literal);
+ cmark_strbuf_trim(&buf);
+ // ... set cur->data and cur->len
+ }
+ }
+ cmark_strbuf_free(&buf);
+ cmark_iter_free(iter);
+}
+```
+
+This function merges adjacent text nodes into a single text node. Adjacent text nodes can arise from inline parsing (e.g., when backslash escapes split text). The function:
+
+1. Finds consecutive text node runs
+2. Concatenates their content into a buffer
+3. Updates the first node's content and end position
+4. Frees the subsequent nodes
+5. Uses `cmark_iter_reset()` to skip freed nodes
+
+## Usage Patterns
+
+### Standard Rendering Loop
+
+```c
+cmark_iter *iter = cmark_iter_new(root);
+while ((ev_type = cmark_iter_next(iter)) != CMARK_EVENT_DONE) {
+ cur = cmark_iter_get_node(iter);
+ S_render_node(cur, ev_type, &state, options);
+}
+cmark_iter_free(iter);
+```
+
+### Skipping Children
+
+To skip rendering of a node's children (e.g., for image alt text in HTML):
+```c
+if (ev_type == CMARK_EVENT_ENTER) {
+ cmark_iter_reset(iter, node, CMARK_EVENT_EXIT);
+}
+```
+
+This jumps directly to the exit event, bypassing all children.
+
+### Safe Node Removal
+
+The iterator handles node removal between calls. Since `cmark_iter_next()` always follows `next` and `parent` pointers from the current position, removing the current node is safe as long as:
+1. The node's `next` and `parent` pointers remain valid
+2. The iterator is reset to skip the removed node's children
+
+## Thread Safety
+
+Iterators are NOT thread-safe. A single AST must not be iterated concurrently without external synchronization. However, since iterators only read the AST (never modify it), multiple read-only iterators on the same AST are safe if no modifications occur.
+
+## Memory
+
+The iterator allocates a `cmark_iter` struct using the root node's memory allocator:
+```c
+cmark_iter *cmark_iter_new(cmark_node *root) {
+ cmark_mem *mem = root->mem;
+ cmark_iter *iter = (cmark_iter *)mem->calloc(1, sizeof(cmark_iter));
+ iter->mem = mem;
+ iter->root = root;
+ iter->cur = root;
+ iter->ev_type = CMARK_EVENT_NONE;
+ return iter;
+}
+```
+
+## Cross-References
+
+- [iterator.c](../../cmark/src/iterator.c) — Iterator implementation
+- [iterator.h](../../cmark/src/iterator.h) — Iterator struct definition
+- [ast-node-system.md](ast-node-system.md) — The nodes being traversed
+- [html-renderer.md](html-renderer.md) — Example of iterator-driven rendering
+- [render-framework.md](render-framework.md) — Framework that wraps iterator use
diff --git a/docs/handbook/cmark/latex-renderer.md b/docs/handbook/cmark/latex-renderer.md
new file mode 100644
index 0000000000..d7a492d580
--- /dev/null
+++ b/docs/handbook/cmark/latex-renderer.md
@@ -0,0 +1,320 @@
+# cmark — LaTeX Renderer
+
+## Overview
+
+The LaTeX renderer (`latex.c`) converts a `cmark_node` AST into LaTeX source, suitable for compilation with `pdflatex`, `xelatex`, or `lualatex`. It uses the generic render framework from `render.c`, operating through a per-character output callback (`outc`) and a per-node render callback (`S_render_node`).
+
+## Entry Point
+
+```c
+char *cmark_render_latex(cmark_node *root, int options, int width);
+```
+
+- `root` — AST root node
+- `options` — Option flags (`CMARK_OPT_SOURCEPOS`, `CMARK_OPT_HARDBREAKS`, `CMARK_OPT_NOBREAKS`, `CMARK_OPT_UNSAFE`)
+- `width` — Target line width for hard-wrapping; 0 disables wrapping
+
+## Character Escaping (`outc`)
+
+The `outc` function handles per-character output decisions. It is the most complex part of the LaTeX renderer, with different behavior for three escaping modes:
+
+```c
+static void outc(cmark_renderer *renderer, cmark_escaping escape,
+ int32_t c, unsigned char nextc);
+```
+
+### LITERAL Mode
+Pass-through: all characters are output unchanged.
+
+### NORMAL Mode
+Extensive special-character handling:
+
+| Character | LaTeX Output | Purpose |
+|-----------|-------------|---------|
+| `$` | `\$` | Math mode delimiter |
+| `%` | `\%` | Comment character |
+| `&` | `\&` | Table column separator |
+| `_` | `\_` | Subscript operator |
+| `#` | `\#` | Parameter reference |
+| `^` | `\^{}` | Superscript operator |
+| `{` | `\{` | Group open |
+| `}` | `\}` | Group close |
+| `~` | `\textasciitilde{}` | Non-breaking space |
+| `[` | `{[}` | Optional argument bracket |
+| `]` | `{]}` | Optional argument bracket |
+| `\` | `\textbackslash{}` | Escape character |
+| `|` | `\textbar{}` | Pipe |
+| `'` | `\textquotesingle{}` | Straight single quote |
+| `"` | `\textquotedbl{}` | Straight double quote |
+| `` ` `` | `\textasciigrave{}` | Backtick |
+| `\xA0` (NBSP) | `~` | LaTeX non-breaking space |
+| `\x2014` (—) | `---` | Em dash |
+| `\x2013` (–) | `--` | En dash |
+| `\x2018` (') | `` ` `` | Left single quote |
+| `\x2019` (') | `'` | Right single quote |
+| `\x201C` (") | ` `` ` | Left double quote |
+| `\x201D` (") | `''` | Right double quote |
+
+### URL Mode
+Only these characters are escaped:
+- `$` → `\$`
+- `%` → `\%`
+- `&` → `\&`
+- `_` → `\_`
+- `#` → `\#`
+- `{` → `\{`
+- `}` → `\}`
+
+All other characters pass through unchanged.
+
+## Link Type Classification
+
+The renderer classifies links into five categories:
+
+```c
+typedef enum {
+ NO_LINK,
+ URL_AUTOLINK,
+ EMAIL_AUTOLINK,
+ NORMAL_LINK,
+ INTERNAL_LINK,
+} link_type;
+```
+
+### `get_link_type()`
+
+```c
+static link_type get_link_type(cmark_node *node) {
+ // 1. "mailto:" links where text matches url
+ // 2. "http[s]:" links where text matches url (with or without protocol)
+ // 3. Links starting with '#' → INTERNAL_LINK
+ // 4. Everything else → NORMAL_LINK
+}
+```
+
+Detection logic:
+1. **URL_AUTOLINK**: The `url` starts with `http://` or `https://`, the link has exactly one text child, and that child's content matches the URL (or matches the URL minus the protocol prefix).
+2. **EMAIL_AUTOLINK**: The `url` starts with `mailto:`, the link has exactly one text child, and that child's content matches the URL after `mailto:`.
+3. **INTERNAL_LINK**: The `url` starts with `#`.
+4. **NORMAL_LINK**: Everything else.
+
+## Enumeration Level
+
+For nested ordered lists, the renderer selects the appropriate LaTeX counter style:
+
+```c
+static int S_get_enumlevel(cmark_node *node) {
+ int enumlevel = 0;
+ cmark_node *tmp = node;
+ while (tmp) {
+ if (tmp->type == CMARK_NODE_LIST &&
+ cmark_node_get_list_type(tmp) == CMARK_ORDERED_LIST) {
+ enumlevel++;
+ }
+ tmp = tmp->parent;
+ }
+ return enumlevel;
+}
+```
+
+This walks up the tree, counting ordered list ancestors. LaTeX ordered lists cycle through: `enumi` (arabic), `enumii` (alpha), `enumiii` (roman), `enumiv` (Alpha).
+
+## Node Rendering (`S_render_node`)
+
+### Block Nodes
+
+#### Document
+No output.
+
+#### Block Quote
+```
+ENTER: \begin{quote}\n
+EXIT: \end{quote}\n
+```
+
+#### List
+```
+ENTER (bullet): \begin{itemize}\n
+ENTER (ordered): \begin{enumerate}\n
+ \def\labelenumI{COUNTER}\n (if start != 1)
+ \setcounter{enumI}{START-1}\n
+EXIT: \end{itemize}\n or \end{enumerate}\n
+```
+
+The counter is formatted based on enumeration level:
+- Level 1: `\arabic{enumi}.`
+- Level 2: `\alph{enumii}.` (surrounded by `(`)
+- Level 3: `\roman{enumiii}.`
+- Level 4: `\Alph{enumiv}.`
+
+Period delimiters use `.`, parenthesis delimiters use `)`.
+
+#### Item
+```
+ENTER: \item{} (empty braces prevent ligatures with following content)
+EXIT: \n
+```
+
+#### Heading
+```
+ENTER: \section{ or \subsection{ or \subsubsection{ or \paragraph{ or \subparagraph{
+EXIT: }\n
+```
+
+Mapping: level 1 → `\section`, level 2 → `\subsection`, level 3 → `\subsubsection`, level 4 → `\paragraph`, level 5 → `\subparagraph`.
+
+#### Code Block
+```latex
+\begin{verbatim}
+LITERAL CONTENT
+\end{verbatim}
+```
+
+The content is output in `LITERAL` escape mode (no character escaping). Info strings are ignored.
+
+#### HTML Block
+```
+ENTER: % raw HTML omitted\n (as a LaTeX comment)
+```
+
+Raw HTML is always omitted in LaTeX output, regardless of `CMARK_OPT_UNSAFE`.
+
+#### Thematic Break
+```
+\begin{center}\rule{0.5\linewidth}{\linethickness}\end{center}\n
+```
+
+#### Paragraph
+Same tight-list check as the HTML renderer:
+```c
+parent = cmark_node_parent(node);
+grandparent = cmark_node_parent(parent);
+tight = (grandparent && grandparent->type == CMARK_NODE_LIST) ?
+ grandparent->as.list.tight : false;
+```
+- Normal: newline before and after
+- Tight: no leading/trailing blank lines
+
+### Inline Nodes
+
+#### Text
+Output with NORMAL escaping.
+
+#### Soft Break
+Depends on options:
+- `CMARK_OPT_HARDBREAKS`: `\\\\\n`
+- `CMARK_OPT_NOBREAKS`: space
+- Default: newline
+
+#### Line Break
+```
+\\\\\n
+```
+
+#### Code (inline)
+```
+\texttt{ESCAPED CONTENT}
+```
+
+Special handling: Code content is output character-by-character with inline-code escaping. Special characters (`\`, `{`, `}`, `$`, `%`, `&`, `_`, `#`, `^`, `~`) are escaped.
+
+#### Emphasis
+```
+ENTER: \emph{
+EXIT: }
+```
+
+#### Strong
+```
+ENTER: \textbf{
+EXIT: }
+```
+
+#### Link
+Rendering depends on link type:
+
+**NORMAL_LINK:**
+```
+ENTER: \href{URL}{
+EXIT: }
+```
+
+**URL_AUTOLINK:**
+```
+ENTER: \url{URL}
+(children are skipped — no EXIT rendering needed)
+```
+
+**EMAIL_AUTOLINK:**
+```
+ENTER: \href{URL}{\nolinkurl{
+EXIT: }}
+```
+
+**INTERNAL_LINK:**
+```
+ENTER: (nothing — rendered as plain text)
+EXIT: (~\ref{LABEL})
+```
+
+Where `LABEL` is the URL with the leading `#` stripped.
+
+**NO_LINK:**
+No output.
+
+#### Image
+```
+ENTER: \protect\includegraphics{URL}
+```
+
+Image children (alt text) are skipped. If `CMARK_OPT_UNSAFE` is not set and the URL matches `_scan_dangerous_url()`, the URL is omitted.
+
+#### HTML Inline
+```
+% raw HTML omitted
+```
+
+Always omitted, regardless of `CMARK_OPT_UNSAFE`.
+
+## Source Position Comments
+
+When `CMARK_OPT_SOURCEPOS` is set, the renderer adds LaTeX comments before block elements:
+
+```c
+snprintf(buffer, BUFFER_SIZE, "%% %d:%d-%d:%d\n",
+ cmark_node_get_start_line(node), cmark_node_get_start_column(node),
+ cmark_node_get_end_line(node), cmark_node_get_end_column(node));
+```
+
+## Example Output
+
+Markdown input:
+```markdown
+# Hello World
+
+A paragraph with *emphasis* and **bold**.
+
+- Item 1
+- Item 2
+```
+
+LaTeX output:
+```latex
+\section{Hello World}
+
+A paragraph with \emph{emphasis} and \textbf{bold}.
+
+\begin{itemize}
+\item{}Item 1
+
+\item{}Item 2
+
+\end{itemize}
+```
+
+## Cross-References
+
+- [latex.c](../../cmark/src/latex.c) — Full implementation
+- [render-framework.md](render-framework.md) — Generic render framework (`cmark_render()`, `cmark_renderer`)
+- [public-api.md](public-api.md) — `cmark_render_latex()` API docs
+- [html-renderer.md](html-renderer.md) — Contrast with direct buffer renderer
diff --git a/docs/handbook/cmark/man-renderer.md b/docs/handbook/cmark/man-renderer.md
new file mode 100644
index 0000000000..cae1c6dbf3
--- /dev/null
+++ b/docs/handbook/cmark/man-renderer.md
@@ -0,0 +1,272 @@
+# cmark — Man Page Renderer
+
+## Overview
+
+The man page renderer (`man.c`) converts a `cmark_node` AST into roff/troff format suitable for the Unix `man` page system. It uses the generic render framework from `render.c`.
+
+## Entry Point
+
+```c
+char *cmark_render_man(cmark_node *root, int options, int width);
+```
+
+- `root` — AST root node
+- `options` — Option flags (`CMARK_OPT_HARDBREAKS`, `CMARK_OPT_NOBREAKS`, `CMARK_OPT_SOURCEPOS`, `CMARK_OPT_UNSAFE`)
+- `width` — Target line width for wrapping; 0 disables wrapping
+
+## Character Escaping (`S_outc`)
+
+The man page escaping is simpler than LaTeX. The `S_outc` function handles:
+
+```c
+static void S_outc(cmark_renderer *renderer, cmark_escaping escape,
+ int32_t c, unsigned char nextc) {
+ if (escape == LITERAL) {
+ cmark_render_code_point(renderer, c);
+ return;
+ }
+ switch (c) {
+ case 46: // '.' — if at line start
+ cmark_render_ascii(renderer, "\\&.");
+ break;
+ case 39: // '\'' — if at line start
+ cmark_render_ascii(renderer, "\\&'");
+ break;
+ case 45: // '-'
+ cmark_render_ascii(renderer, "\\-");
+ break;
+ case 92: // '\\'
+ cmark_render_ascii(renderer, "\\e");
+ break;
+ case 8216: // ' (left single quote)
+ cmark_render_ascii(renderer, "\\[oq]");
+ break;
+ case 8217: // ' (right single quote)
+ cmark_render_ascii(renderer, "\\[cq]");
+ break;
+ case 8220: // " (left double quote)
+ cmark_render_ascii(renderer, "\\[lq]");
+ break;
+ case 8221: // " (right double quote)
+ cmark_render_ascii(renderer, "\\[rq]");
+ break;
+ case 8212: // — (em dash)
+ cmark_render_ascii(renderer, "\\[em]");
+ break;
+ case 8211: // – (en dash)
+ cmark_render_ascii(renderer, "\\[en]");
+ break;
+ default:
+ cmark_render_code_point(renderer, c);
+ }
+}
+```
+
+### Line-Start Protection
+
+The `.` and `'` characters are only escaped when they appear at the beginning of a line, since roff interprets them as macro/command prefixes. The check:
+
+```c
+case 46:
+case 39:
+ if (renderer->begin_line) {
+ cmark_render_ascii(renderer, "\\&."); // or "\\&'"
+ }
+```
+
+The `\\&` prefix is a zero-width space that prevents roff from treating the character as a command prefix.
+
+## Block Number Tracking
+
+The renderer tracks nesting with a `block_number` variable for generating matching `.RS`/`.RE` (indent start/end) pairs:
+
+This variable is incremented when entering list items and block quotes, and decremented on exit. It controls the indentation level of nested content.
+
+## Node Rendering (`S_render_node`)
+
+### Block Nodes
+
+#### Document
+No output on enter or exit.
+
+#### Block Quote
+```
+ENTER: .RS\n
+EXIT: .RE\n
+```
+
+`.RS` pushes relative indentation, `.RE` pops it.
+
+#### List
+On exit, adds a blank output line (`cr()`) to separate from following content.
+
+#### Item
+```
+ENTER (bullet): .IP \(bu 2\n
+ENTER (ordered): .IP "N." 4\n (where N = list start + sibling count)
+EXIT: (cr if not last item)
+```
+
+The ordered item number is calculated by counting previous siblings:
+```c
+int list_number = cmark_node_get_list_start(node->parent);
+tmp = node;
+while (tmp->prev) {
+ tmp = tmp->prev;
+ list_number++;
+}
+```
+
+`.IP` sets an indented paragraph with a tag (bullet or number) and indentation width.
+
+#### Heading
+```
+ENTER (level 1): .SH\n (section heading)
+ENTER (level 2): .SS\n (subsection heading)
+ENTER (other): .PP\n\fB (paragraph, start bold)
+EXIT (other): \fR\n (end bold)
+```
+
+Level 1 and 2 headings use dedicated roff macros. Level 3+ are rendered as bold paragraphs.
+
+#### Code Block
+```
+.IP\n.nf\n\\f[C]\n
+LITERAL CONTENT
+\\f[]\n.fi\n
+```
+
+- `.nf` — no-fill (preformatted)
+- `\\f[C]` — switch to constant-width font
+- `\\f[]` — restore previous font
+- `.fi` — return to fill mode
+
+#### HTML Block
+```
+(nothing)
+```
+Raw HTML blocks are silently omitted in man output.
+
+#### Thematic Break
+There is no native roff thematic break. The renderer outputs nothing for this node type.
+
+#### Paragraph
+Same tight-list check as other renderers:
+```c
+tight = (grandparent && grandparent->type == CMARK_NODE_LIST) ?
+ grandparent->as.list.tight : false;
+```
+- Normal: `.PP\n` before content
+- Tight: no `.PP` prefix
+
+### Inline Nodes
+
+#### Text
+Output with NORMAL escaping.
+
+#### Soft Break
+Depends on options:
+- `CMARK_OPT_HARDBREAKS`: `.PD 0\n.P\n.PD\n`
+- `CMARK_OPT_NOBREAKS`: space
+- Default: newline
+
+The hardbreak sequence `.PD 0\n.P\n.PD\n` is a man page idiom that:
+1. Sets paragraph distance to 0 (`.PD 0`)
+2. Starts a new paragraph (`.P`)
+3. Restores default paragraph distance (`.PD`)
+
+#### Line Break
+Same as hardbreak:
+```
+.PD 0\n.P\n.PD\n
+```
+
+#### Code (inline)
+```
+\f[C]ESCAPED CONTENT\f[]
+```
+
+Font switch to `C` (constant-width), then restore.
+
+#### Emphasis
+```
+ENTER: \f[I] (italic font)
+EXIT: \f[] (restore font)
+```
+
+#### Strong
+```
+ENTER: \f[B] (bold font)
+EXIT: \f[] (restore font)
+```
+
+#### Link
+Links render their text content normally. On exit:
+```
+(ESCAPED_URL)
+```
+
+If the link URL is the same as the text content (autolink), the URL suffix is suppressed.
+
+#### Image
+```
+ENTER: [IMAGE:
+EXIT: ]
+```
+
+Images have no roff equivalent, so they're rendered as bracketed alt text.
+
+#### HTML Inline
+Silently omitted.
+
+## Source Position
+
+When `CMARK_OPT_SOURCEPOS` is set, man output includes roff comments:
+```
+.\" sourcepos: LINE:COL-LINE:COL
+```
+
+(The `.\"` prefix is the roff comment syntax.)
+
+## Example Output
+
+Markdown input:
+```markdown
+# My Tool
+
+A description with *emphasis*.
+
+## Options
+
+- `--flag` — Does something
+- `--other` — Does another thing
+```
+
+Man output:
+```roff
+.SH
+My Tool
+.PP
+A description with \f[I]emphasis\f[].
+.SS
+Options
+.IP \(bu 2
+\f[C]\-\-flag\f[] \[em] Does something
+.IP \(bu 2
+\f[C]\-\-other\f[] \[em] Does another thing
+```
+
+## Limitations
+
+1. **No heading levels > 2**: Levels 3+ are rendered as bold paragraphs, losing semantic heading structure.
+2. **No images**: Only alt text is shown in brackets.
+3. **No raw HTML**: Silently dropped.
+4. **No thematic breaks**: No visual separator is output.
+5. **No tables**: Not part of core CommonMark, but if extensions add them, the man renderer has no support.
+
+## Cross-References
+
+- [man.c](../../cmark/src/man.c) — Full implementation
+- [render-framework.md](render-framework.md) — Generic render framework
+- [public-api.md](public-api.md) — `cmark_render_man()` API docs
+- [latex-renderer.md](latex-renderer.md) — Another framework-based renderer
diff --git a/docs/handbook/cmark/memory-management.md b/docs/handbook/cmark/memory-management.md
new file mode 100644
index 0000000000..dbc0046cb9
--- /dev/null
+++ b/docs/handbook/cmark/memory-management.md
@@ -0,0 +1,351 @@
+# cmark — Memory Management
+
+## Overview
+
+cmark's memory management is built around three concepts:
+1. **Pluggable allocator** (`cmark_mem`) — a function-pointer table for calloc/realloc/free
+2. **Owning buffer** (`cmark_strbuf`) — a growable byte buffer that owns its memory
+3. **Non-owning slice** (`cmark_chunk`) — a view into either a `cmark_strbuf` or external memory
+
+## Pluggable Allocator
+
+### `cmark_mem` Structure
+
+```c
+typedef struct cmark_mem {
+ void *(*calloc)(size_t, size_t);
+ void *(*realloc)(void *, size_t);
+ void (*free)(void *);
+} cmark_mem;
+```
+
+All allocation throughout cmark respects this interface. Every node, buffer, parser, and iterator receives a `cmark_mem *` and uses it for all allocations.
+
+### Default Allocator
+
+```c
+static void *xcalloc(size_t nmemb, size_t size) {
+ void *ptr = calloc(nmemb, size);
+ if (!ptr) abort();
+ return ptr;
+}
+
+static void *xrealloc(void *ptr, size_t size) {
+ void *new_ptr = realloc(ptr, size);
+ if (!new_ptr) abort();
+ return new_ptr;
+}
+
+cmark_mem DEFAULT_MEM_ALLOCATOR = {xcalloc, xrealloc, free};
+```
+
+The default allocator wraps standard `calloc`/`realloc`/`free`, adding `abort()` on allocation failure. This means cmark never returns NULL from allocations — it terminates on out-of-memory.
+
+### Getting the Default Allocator
+
+```c
+cmark_mem *cmark_get_default_mem_allocator(void) {
+ return &DEFAULT_MEM_ALLOCATOR;
+}
+```
+
+### Custom Allocator Usage
+
+Users can provide custom allocators (arena allocators, debug allocators, etc.) via:
+
+```c
+cmark_parser *cmark_parser_new_with_mem(int options, cmark_mem *mem);
+cmark_node *cmark_node_new_with_mem(cmark_node_type type, cmark_mem *mem);
+```
+
+The allocator propagates: nodes created by the parser inherit the parser's allocator. Iterators use the root node's allocator.
+
+## Growable Buffer (`cmark_strbuf`)
+
+### Structure
+
+```c
+struct cmark_strbuf {
+ cmark_mem *mem;
+ unsigned char *ptr;
+ bufsize_t asize; // allocated size
+ bufsize_t size; // used size (excluding NUL terminator)
+};
+```
+
+### Initialization
+
+```c
+#define CMARK_BUF_INIT(mem) { mem, cmark_strbuf__initbuf, 0, 0 }
+```
+
+`cmark_strbuf__initbuf` is a static empty buffer that avoids allocating for empty strings:
+```c
+unsigned char cmark_strbuf__initbuf[1] = {0};
+```
+
+This means: uninitialized/empty buffers point to a shared static empty string rather than NULL. This eliminates NULL checks throughout the code.
+
+### Growth Strategy
+
+```c
+void cmark_strbuf_grow(cmark_strbuf *buf, bufsize_t target_size) {
+ // Minimum allocation of 8 bytes
+ bufsize_t new_size = 8;
+ // Double until >= target (or use 2x current if growing existing)
+ if (buf->asize) {
+ new_size = buf->asize;
+ }
+ while (new_size < target_size) {
+ new_size *= 2;
+ }
+ // Allocate
+ if (buf->ptr == cmark_strbuf__initbuf) {
+ buf->ptr = (unsigned char *)buf->mem->calloc(new_size, 1);
+ } else {
+ buf->ptr = (unsigned char *)buf->mem->realloc(buf->ptr, new_size);
+ }
+ buf->asize = new_size;
+}
+```
+
+The growth strategy doubles the capacity each time, ensuring amortized O(1) appends. Minimum capacity is 8 bytes.
+
+When the buffer transitions from the shared static init to a real allocation, `calloc` is used (zero-initialized). Subsequent growths use `realloc`.
+
+### Key Operations
+
+```c
+// Appending
+void cmark_strbuf_put(cmark_strbuf *buf, const unsigned char *data, bufsize_t len);
+void cmark_strbuf_puts(cmark_strbuf *buf, const char *string);
+void cmark_strbuf_putc(cmark_strbuf *buf, int c);
+
+// Printf-style
+void cmark_strbuf_printf(cmark_strbuf *buf, const char *fmt, ...);
+void cmark_strbuf_vprintf(cmark_strbuf *buf, const char *fmt, va_list ap);
+
+// Manipulation
+void cmark_strbuf_clear(cmark_strbuf *buf); // Reset size to 0, keep allocation
+void cmark_strbuf_set(cmark_strbuf *buf, const unsigned char *data, bufsize_t len);
+void cmark_strbuf_sets(cmark_strbuf *buf, const char *string);
+void cmark_strbuf_copy_cstr(char *data, bufsize_t datasize, const cmark_strbuf *buf);
+void cmark_strbuf_swap(cmark_strbuf *a, cmark_strbuf *b);
+
+// Whitespace
+void cmark_strbuf_trim(cmark_strbuf *buf); // Trim leading and trailing whitespace
+void cmark_strbuf_normalize_whitespace(cmark_strbuf *buf); // Collapse runs to single space
+void cmark_strbuf_unescape(cmark_strbuf *buf); // Process backslash escapes
+
+// Lifecycle
+unsigned char *cmark_strbuf_detach(cmark_strbuf *buf); // Return ptr, reset buf to init
+void cmark_strbuf_free(cmark_strbuf *buf); // Free memory, reset to init
+```
+
+### `cmark_strbuf_detach()`
+
+```c
+unsigned char *cmark_strbuf_detach(cmark_strbuf *buf) {
+ unsigned char *data = buf->ptr;
+ if (buf->asize == 0) {
+ // Never allocated — return a new empty string
+ data = (unsigned char *)buf->mem->calloc(1, 1);
+ }
+ // Reset buffer to initial state
+ buf->ptr = cmark_strbuf__initbuf;
+ buf->asize = 0;
+ buf->size = 0;
+ return data;
+}
+```
+
+Transfers ownership of the buffer's memory to the caller. The buffer is reset to the empty init state. The caller must `free()` the returned pointer.
+
+### Whitespace Normalization
+
+```c
+void cmark_strbuf_normalize_whitespace(cmark_strbuf *s) {
+ bool last_char_was_space = false;
+ bufsize_t r, w;
+ for (r = 0, w = 0; r < s->size; r++) {
+ if (cmark_isspace(s->ptr[r])) {
+ if (!last_char_was_space) {
+ s->ptr[w++] = ' ';
+ last_char_was_space = true;
+ }
+ } else {
+ s->ptr[w++] = s->ptr[r];
+ last_char_was_space = false;
+ }
+ }
+ cmark_strbuf_truncate(s, w);
+}
+```
+
+Collapses consecutive whitespace into a single space. Uses an in-place read/write cursor technique.
+
+### Backslash Unescape
+
+```c
+void cmark_strbuf_unescape(cmark_strbuf *buf) {
+ bufsize_t r, w;
+ for (r = 0, w = 0; r < buf->size; r++) {
+ if (buf->ptr[r] == '\\' && cmark_ispunct(buf->ptr[r + 1]))
+ r++;
+ buf->ptr[w++] = buf->ptr[r];
+ }
+ cmark_strbuf_truncate(buf, w);
+}
+```
+
+Removes backslash escapes before punctuation characters, in-place.
+
+## Non-Owning Slice (`cmark_chunk`)
+
+### Structure
+
+```c
+typedef struct {
+ const unsigned char *data;
+ bufsize_t len;
+ bufsize_t alloc; // 0 if non-owning, > 0 if owning
+} cmark_chunk;
+```
+
+A `cmark_chunk` is either:
+- **Non-owning** (`alloc == 0`): Points into someone else's memory (e.g., the parser's input buffer)
+- **Owning** (`alloc > 0`): Owns its `data` pointer and must free it
+
+### Key Operations
+
+```c
+// Create a non-owning reference
+static CMARK_INLINE cmark_chunk cmark_chunk_buf_detach(cmark_strbuf *buf);
+static CMARK_INLINE cmark_chunk cmark_chunk_literal(const char *data);
+static CMARK_INLINE cmark_chunk cmark_chunk_dup(const cmark_chunk *ch, bufsize_t pos, bufsize_t len);
+
+// Free (only if owning)
+static CMARK_INLINE void cmark_chunk_free(cmark_mem *mem, cmark_chunk *c) {
+ if (c->alloc)
+ mem->free((void *)c->data);
+ c->data = NULL;
+ c->alloc = 0;
+ c->len = 0;
+}
+```
+
+### Ownership Transfer
+
+`cmark_chunk_buf_detach()` takes ownership of a `cmark_strbuf`'s memory:
+
+```c
+static CMARK_INLINE cmark_chunk cmark_chunk_buf_detach(cmark_strbuf *buf) {
+ cmark_chunk c;
+ c.len = buf->size;
+ c.data = cmark_strbuf_detach(buf);
+ c.alloc = 1; // Now owns the data
+ return c;
+}
+```
+
+### Non-Owning References
+
+`cmark_chunk_dup()` creates a non-owning view into existing memory:
+
+```c
+static CMARK_INLINE cmark_chunk cmark_chunk_dup(const cmark_chunk *ch,
+ bufsize_t pos, bufsize_t len) {
+ cmark_chunk c = {ch->data + pos, len, 0}; // alloc = 0: non-owning
+ return c;
+}
+```
+
+This is used extensively during parsing to avoid copying strings. For example, text node content during inline parsing initially points into the parser's line buffer. Only when the node outlives the parse does the data need to be copied.
+
+## Node Memory Management
+
+### Node Allocation
+
+```c
+static cmark_node *S_node_new(cmark_node_type type, cmark_mem *mem) {
+ cmark_node *node = (cmark_node *)mem->calloc(1, sizeof(*node));
+ cmark_strbuf_init(mem, &node->content, 0);
+ node->type = (uint16_t)type;
+ node->mem = mem;
+ return node;
+}
+```
+
+Nodes are zero-initialized via `calloc`. The `mem` pointer is stored on the node for later freeing.
+
+### Node Deallocation
+
+```c
+static void S_free_nodes(cmark_node *e) {
+ cmark_node *next;
+ while (e != NULL) {
+ // Free type-specific data
+ switch (e->type) {
+ case CMARK_NODE_CODE_BLOCK:
+ cmark_chunk_free(e->mem, &e->as.code.info);
+ cmark_chunk_free(e->mem, &e->as.literal);
+ break;
+ case CMARK_NODE_LINK:
+ case CMARK_NODE_IMAGE:
+ e->mem->free(e->as.link.url);
+ e->mem->free(e->as.link.title);
+ break;
+ // ... other types
+ }
+ // Splice children into the free list
+ if (e->first_child) {
+ cmark_node *last = e->last_child;
+ last->next = e->next;
+ e->next = e->first_child;
+ }
+ // Advance and free
+ next = e->next;
+ e->mem->free(e);
+ e = next;
+ }
+}
+```
+
+This is an iterative (non-recursive) destructor that avoids stack overflow on deeply nested ASTs. The key technique is **sibling-list splicing**: children are inserted into the sibling chain before the current position, converting tree traversal into linear list traversal.
+
+### What Gets Freed Per Node Type
+
+| Node Type | Freed Data |
+|-----------|-----------|
+| `CODE_BLOCK` | `as.code.info` chunk, `as.literal` chunk |
+| `TEXT`, `HTML_BLOCK`, `HTML_INLINE`, `CODE` | `as.literal` chunk |
+| `LINK`, `IMAGE` | `as.link.url`, `as.link.title` |
+| `CUSTOM_BLOCK`, `CUSTOM_INLINE` | `as.custom.on_enter`, `as.custom.on_exit` |
+| `HEADING` | `as.heading.setext_content` (if chunk) |
+| All nodes | `content` strbuf |
+
+## Parser Memory
+
+The parser allocates:
+- A `cmark_parser` struct
+- A `cmark_strbuf` for the current line (`linebuf`)
+- A `cmark_strbuf` for collected content (`content`)
+- A `cmark_reference_map` for link references
+- Individual `cmark_node` objects for the AST
+
+When `cmark_parser_free()` is called, only the parser's own resources are freed — the AST is NOT freed (the user owns it). To free the AST, call `cmark_node_free()` on the root.
+
+## Memory Safety Patterns
+
+1. **No NULL returns**: The default allocator aborts on failure. User allocators should do the same or handle errors externally.
+2. **Init buffers**: `cmark_strbuf__initbuf` prevents NULL pointer dereferences on empty buffers.
+3. **Owning vs non-owning**: The `cmark_chunk.alloc` field prevents double-frees and ensures non-owning references are not freed.
+4. **Iterative destruction**: `S_free_nodes()` avoids stack overflow on deep trees.
+
+## Cross-References
+
+- [buffer.c](../../cmark/src/buffer.c), [buffer.h](../../cmark/src/buffer.h) — `cmark_strbuf` implementation
+- [chunk.h](../../cmark/src/chunk.h) — `cmark_chunk` definition
+- [cmark.c](../../cmark/src/cmark.c) — Default allocator, `cmark_get_default_mem_allocator()`
+- [node.c](../../cmark/src/node.c) — Node allocation and deallocation
+- [ast-node-system.md](ast-node-system.md) — Node structure and lifecycle
diff --git a/docs/handbook/cmark/overview.md b/docs/handbook/cmark/overview.md
new file mode 100644
index 0000000000..4fc95bdad7
--- /dev/null
+++ b/docs/handbook/cmark/overview.md
@@ -0,0 +1,256 @@
+# cmark — Overview
+
+## What Is cmark?
+
+cmark is a C library and command-line tool for parsing and rendering CommonMark (standardized Markdown). Written in C99, it implements a two-phase parsing architecture — block structure recognition followed by inline content parsing — producing an Abstract Syntax Tree (AST) that can be traversed, manipulated, and rendered into multiple output formats.
+
+**Language:** C (C99)
+**Build System:** CMake (minimum version 3.14)
+**Project Version:** 0.31.2
+**License:** BSD-2-Clause
+**Authors:** John MacFarlane, Vicent Marti, Kārlis Gaņģis, Nick Wellnhofer
+
+## Core Architecture Summary
+
+cmark's processing pipeline follows this sequence:
+
+1. **Input** — UTF-8 text is fed to the parser, either all at once or incrementally via a streaming API.
+2. **Block Parsing** (`blocks.c`) — The input is scanned line-by-line to identify block-level structures (paragraphs, headings, code blocks, lists, block quotes, thematic breaks, HTML blocks).
+3. **Inline Parsing** (`inlines.c`) — Within paragraph and heading blocks, inline elements are parsed (emphasis, links, images, code spans, HTML inline, line breaks).
+4. **AST Construction** — A tree of `cmark_node` structures is built, with each node representing a document element.
+5. **Rendering** — The AST is traversed using an iterator and rendered to one of five output formats: HTML, XML, LaTeX, man (groff), or CommonMark.
+
+## Source File Map
+
+The `cmark/src/` directory contains the following source files, organized by responsibility:
+
+### Public API
+| File | Purpose |
+|------|---------|
+| `cmark.h` | Public API header — all exported types, enums, and function declarations |
+| `cmark.c` | Core glue — `cmark_markdown_to_html()`, default memory allocator, version info |
+| `main.c` | CLI entry point — argument parsing, file I/O, format dispatch |
+
+### AST Node System
+| File | Purpose |
+|------|---------|
+| `node.h` | Internal node struct definition, type-specific unions (`cmark_list`, `cmark_code`, `cmark_heading`, `cmark_link`, `cmark_custom`), internal flags |
+| `node.c` | Node creation/destruction, accessor functions, tree manipulation (insert, append, unlink, replace) |
+
+### Parsing
+| File | Purpose |
+|------|---------|
+| `parser.h` | Internal `cmark_parser` struct definition (parser state: line number, offset, column, indent, reference map) |
+| `blocks.c` | Block-level parsing — line-by-line analysis, open/close block logic, list item detection, finalization |
+| `inlines.c` | Inline-level parsing — emphasis/strong via delimiter stack, backtick code spans, links/images via bracket stack, autolinks, HTML inline |
+| `inlines.h` | Internal API: `cmark_parse_inlines()`, `cmark_parse_reference_inline()`, `cmark_clean_url()`, `cmark_clean_title()` |
+
+### Traversal
+| File | Purpose |
+|------|---------|
+| `iterator.h` | Internal `cmark_iter` struct with `cmark_iter_state` (current + next event/node pairs) |
+| `iterator.c` | Iterator implementation — `cmark_iter_new()`, `cmark_iter_next()`, `cmark_iter_reset()`, `cmark_consolidate_text_nodes()` |
+
+### Renderers
+| File | Purpose |
+|------|---------|
+| `render.h` | `cmark_renderer` struct, `cmark_escaping` enum (`LITERAL`, `NORMAL`, `TITLE`, `URL`) |
+| `render.c` | Generic render framework — line wrapping, prefix management, `cmark_render()` dispatch loop |
+| `html.c` | HTML renderer — `cmark_render_html()`, direct strbuf-based output, no render framework |
+| `xml.c` | XML renderer — `cmark_render_xml()`, direct strbuf-based output with CommonMark DTD |
+| `latex.c` | LaTeX renderer — `cmark_render_latex()`, uses render framework |
+| `man.c` | groff man renderer — `cmark_render_man()`, uses render framework |
+| `commonmark.c` | CommonMark renderer — `cmark_render_commonmark()`, uses render framework |
+
+### Text Processing and Utilities
+| File | Purpose |
+|------|---------|
+| `buffer.h` / `buffer.c` | `cmark_strbuf` — growable byte buffer with amortized O(1) append |
+| `chunk.h` | `cmark_chunk` — lightweight non-owning string slice (pointer + length) |
+| `utf8.h` / `utf8.c` | UTF-8 iteration, validation, encoding, case folding, Unicode property queries |
+| `references.h` / `references.c` | Link reference definition storage and lookup (sorted array with binary search) |
+| `scanners.h` / `scanners.c` | re2c-generated scanner functions for recognizing Markdown syntax patterns |
+| `scanners.re` | re2c source for scanner generation |
+| `cmark_ctype.h` / `cmark_ctype.c` | Locale-independent `cmark_isspace()`, `cmark_ispunct()`, `cmark_isdigit()`, `cmark_isalpha()` |
+| `houdini.h` | HTML/URL escaping and unescaping API |
+| `houdini_html_e.c` | HTML entity escaping |
+| `houdini_html_u.c` | HTML entity unescaping |
+| `houdini_href_e.c` | URL/href percent-encoding |
+| `entities.inc` | HTML entity name-to-codepoint lookup table |
+| `case_fold.inc` | Unicode case folding table for reference normalization |
+
+## The Simple Interface
+
+The simplest way to use cmark is a single function call defined in `cmark.c`:
+
+```c
+char *cmark_markdown_to_html(const char *text, size_t len, int options);
+```
+
+Internally, this calls `cmark_parse_document()` to build the AST, then `cmark_render_html()` to produce the output, and finally frees the document node. The caller is responsible for freeing the returned string.
+
+The implementation in `cmark.c`:
+
+```c
+char *cmark_markdown_to_html(const char *text, size_t len, int options) {
+ cmark_node *doc;
+ char *result;
+
+ doc = cmark_parse_document(text, len, options);
+ result = cmark_render_html(doc, options);
+ cmark_node_free(doc);
+
+ return result;
+}
+```
+
+## The Streaming Interface
+
+For large documents or streaming input, cmark provides an incremental parsing API:
+
+```c
+cmark_parser *parser = cmark_parser_new(CMARK_OPT_DEFAULT);
+
+// Feed chunks of data as they arrive
+while ((bytes = fread(buffer, 1, sizeof(buffer), fp)) > 0) {
+ cmark_parser_feed(parser, buffer, bytes);
+}
+
+// Finalize and get the AST
+cmark_node *document = cmark_parser_finish(parser);
+cmark_parser_free(parser);
+
+// Render to any format
+char *html = cmark_render_html(document, CMARK_OPT_DEFAULT);
+char *xml = cmark_render_xml(document, CMARK_OPT_DEFAULT);
+char *man = cmark_render_man(document, CMARK_OPT_DEFAULT, 72);
+char *tex = cmark_render_latex(document, CMARK_OPT_DEFAULT, 80);
+char *cm = cmark_render_commonmark(document, CMARK_OPT_DEFAULT, 0);
+
+// Cleanup
+cmark_node_free(document);
+```
+
+The parser accumulates input in an internal line buffer (`parser->linebuf`) and processes complete lines as they become available. The `S_parser_feed()` function in `blocks.c` scans for line-ending characters (`\n`, `\r`) and dispatches each complete line to `S_process_line()`.
+
+## Node Type Taxonomy
+
+cmark defines 21 node types in the `cmark_node_type` enum:
+
+### Block Nodes (container and leaf)
+| Enum Value | Type String | Container? | Accepts Lines? | Contains Inlines? |
+|-----------|-------------|------------|---------------|-------------------|
+| `CMARK_NODE_DOCUMENT` | "document" | Yes | No | No |
+| `CMARK_NODE_BLOCK_QUOTE` | "block_quote" | Yes | No | No |
+| `CMARK_NODE_LIST` | "list" | Yes (items only) | No | No |
+| `CMARK_NODE_ITEM` | "item" | Yes | No | No |
+| `CMARK_NODE_CODE_BLOCK` | "code_block" | No (leaf) | Yes | No |
+| `CMARK_NODE_HTML_BLOCK` | "html_block" | No (leaf) | No | No |
+| `CMARK_NODE_CUSTOM_BLOCK` | "custom_block" | Yes | No | No |
+| `CMARK_NODE_PARAGRAPH` | "paragraph" | No | Yes | Yes |
+| `CMARK_NODE_HEADING` | "heading" | No | Yes | Yes |
+| `CMARK_NODE_THEMATIC_BREAK` | "thematic_break" | No (leaf) | No | No |
+
+### Inline Nodes
+| Enum Value | Type String | Leaf? |
+|-----------|-------------|-------|
+| `CMARK_NODE_TEXT` | "text" | Yes |
+| `CMARK_NODE_SOFTBREAK` | "softbreak" | Yes |
+| `CMARK_NODE_LINEBREAK` | "linebreak" | Yes |
+| `CMARK_NODE_CODE` | "code" | Yes |
+| `CMARK_NODE_HTML_INLINE` | "html_inline" | Yes |
+| `CMARK_NODE_CUSTOM_INLINE` | "custom_inline" | No |
+| `CMARK_NODE_EMPH` | "emph" | No |
+| `CMARK_NODE_STRONG` | "strong" | No |
+| `CMARK_NODE_LINK` | "link" | No |
+| `CMARK_NODE_IMAGE` | "image" | No |
+
+Range sentinels are also defined for classification:
+- `CMARK_NODE_FIRST_BLOCK = CMARK_NODE_DOCUMENT`
+- `CMARK_NODE_LAST_BLOCK = CMARK_NODE_THEMATIC_BREAK`
+- `CMARK_NODE_FIRST_INLINE = CMARK_NODE_TEXT`
+- `CMARK_NODE_LAST_INLINE = CMARK_NODE_IMAGE`
+
+## Option Flags
+
+Options are passed as a bitmask integer to parsing and rendering functions:
+
+```c
+#define CMARK_OPT_DEFAULT 0
+#define CMARK_OPT_SOURCEPOS (1 << 1) // Add data-sourcepos attributes
+#define CMARK_OPT_HARDBREAKS (1 << 2) // Render softbreaks as hard breaks
+#define CMARK_OPT_SAFE (1 << 3) // Legacy (now default behavior)
+#define CMARK_OPT_NOBREAKS (1 << 4) // Render softbreaks as spaces
+#define CMARK_OPT_NORMALIZE (1 << 8) // Legacy (no effect)
+#define CMARK_OPT_VALIDATE_UTF8 (1 << 9) // Validate UTF-8 input
+#define CMARK_OPT_SMART (1 << 10) // Smart quotes and dashes
+#define CMARK_OPT_UNSAFE (1 << 17) // Allow raw HTML and dangerous URLs
+```
+
+## Memory Management Model
+
+cmark uses a pluggable memory allocator defined by the `cmark_mem` struct:
+
+```c
+typedef struct cmark_mem {
+ void *(*calloc)(size_t, size_t);
+ void *(*realloc)(void *, size_t);
+ void (*free)(void *);
+} cmark_mem;
+```
+
+The default allocator in `cmark.c` wraps standard `calloc`/`realloc`/`free` with abort-on-NULL safety checks (`xcalloc`, `xrealloc`). Every node stores a pointer to the allocator it was created with (`node->mem`), ensuring consistent allocation/deallocation throughout the tree.
+
+## Version Information
+
+Runtime version queries:
+
+```c
+int cmark_version(void); // Returns CMARK_VERSION as integer (0xMMmmpp)
+const char *cmark_version_string(void); // Returns CMARK_VERSION_STRING
+```
+
+The version is encoded as a 24-bit integer where bits 16–23 are major, 8–15 are minor, and 0–7 are patch. For example, `0x001F02` represents version 0.31.2.
+
+## Backwards Compatibility Aliases
+
+For code written against older cmark API versions, these aliases are provided:
+
+```c
+#define CMARK_NODE_HEADER CMARK_NODE_HEADING
+#define CMARK_NODE_HRULE CMARK_NODE_THEMATIC_BREAK
+#define CMARK_NODE_HTML CMARK_NODE_HTML_BLOCK
+#define CMARK_NODE_INLINE_HTML CMARK_NODE_HTML_INLINE
+```
+
+Short-name aliases (without the `CMARK_` prefix) are also available unless `CMARK_NO_SHORT_NAMES` is defined:
+
+```c
+#define NODE_DOCUMENT CMARK_NODE_DOCUMENT
+#define NODE_PARAGRAPH CMARK_NODE_PARAGRAPH
+#define BULLET_LIST CMARK_BULLET_LIST
+// ... and many more
+```
+
+## Cross-References
+
+- [architecture.md](architecture.md) — Detailed two-phase parsing pipeline, module dependency graph
+- [public-api.md](public-api.md) — Complete public API reference with all function signatures
+- [ast-node-system.md](ast-node-system.md) — Internal `cmark_node` struct, type-specific unions, tree operations
+- [block-parsing.md](block-parsing.md) — `blocks.c` line-by-line analysis, open block tracking, finalization
+- [inline-parsing.md](inline-parsing.md) — `inlines.c` delimiter algorithm, bracket stack, backtick scanning
+- [iterator-system.md](iterator-system.md) — AST traversal with enter/exit events
+- [html-renderer.md](html-renderer.md) — HTML output with escaping and source position
+- [xml-renderer.md](xml-renderer.md) — XML output with CommonMark DTD
+- [latex-renderer.md](latex-renderer.md) — LaTeX output via render framework
+- [man-renderer.md](man-renderer.md) — groff man page output
+- [commonmark-renderer.md](commonmark-renderer.md) — Round-trip CommonMark output
+- [render-framework.md](render-framework.md) — Shared `cmark_render()` engine for text-based renderers
+- [memory-management.md](memory-management.md) — Allocator model, buffer growth, node freeing
+- [utf8-handling.md](utf8-handling.md) — UTF-8 validation, iteration, case folding
+- [reference-system.md](reference-system.md) — Link reference definitions storage and resolution
+- [scanner-system.md](scanner-system.md) — re2c-generated pattern matching
+- [building.md](building.md) — CMake build configuration and options
+- [cli-usage.md](cli-usage.md) — Command-line tool usage
+- [testing.md](testing.md) — Test infrastructure (spec tests, API tests, fuzzing)
+- [code-style.md](code-style.md) — Coding conventions and naming patterns
diff --git a/docs/handbook/cmark/public-api.md b/docs/handbook/cmark/public-api.md
new file mode 100644
index 0000000000..7168282e23
--- /dev/null
+++ b/docs/handbook/cmark/public-api.md
@@ -0,0 +1,637 @@
+# cmark — Public API Reference
+
+## Header: `cmark.h`
+
+All public API functions, types, and constants are declared in `cmark.h`. Functions marked with `CMARK_EXPORT` are exported from the shared library. The header is usable from C++ via `extern "C"` guards.
+
+---
+
+## Type Definitions
+
+### Node Types
+
+```c
+typedef enum {
+ /* Error status */
+ CMARK_NODE_NONE,
+
+ /* Block nodes */
+ CMARK_NODE_DOCUMENT,
+ CMARK_NODE_BLOCK_QUOTE,
+ CMARK_NODE_LIST,
+ CMARK_NODE_ITEM,
+ CMARK_NODE_CODE_BLOCK,
+ CMARK_NODE_HTML_BLOCK,
+ CMARK_NODE_CUSTOM_BLOCK,
+ CMARK_NODE_PARAGRAPH,
+ CMARK_NODE_HEADING,
+ CMARK_NODE_THEMATIC_BREAK,
+
+ /* Range sentinels */
+ CMARK_NODE_FIRST_BLOCK = CMARK_NODE_DOCUMENT,
+ CMARK_NODE_LAST_BLOCK = CMARK_NODE_THEMATIC_BREAK,
+
+ /* Inline nodes */
+ CMARK_NODE_TEXT,
+ CMARK_NODE_SOFTBREAK,
+ CMARK_NODE_LINEBREAK,
+ CMARK_NODE_CODE,
+ CMARK_NODE_HTML_INLINE,
+ CMARK_NODE_CUSTOM_INLINE,
+ CMARK_NODE_EMPH,
+ CMARK_NODE_STRONG,
+ CMARK_NODE_LINK,
+ CMARK_NODE_IMAGE,
+
+ CMARK_NODE_FIRST_INLINE = CMARK_NODE_TEXT,
+ CMARK_NODE_LAST_INLINE = CMARK_NODE_IMAGE
+} cmark_node_type;
+```
+
+### List Types
+
+```c
+typedef enum {
+ CMARK_NO_LIST,
+ CMARK_BULLET_LIST,
+ CMARK_ORDERED_LIST
+} cmark_list_type;
+```
+
+### Delimiter Types
+
+```c
+typedef enum {
+ CMARK_NO_DELIM,
+ CMARK_PERIOD_DELIM,
+ CMARK_PAREN_DELIM
+} cmark_delim_type;
+```
+
+### Event Types (for iterator)
+
+```c
+typedef enum {
+ CMARK_EVENT_NONE,
+ CMARK_EVENT_DONE,
+ CMARK_EVENT_ENTER,
+ CMARK_EVENT_EXIT
+} cmark_event_type;
+```
+
+### Opaque Types
+
+```c
+typedef struct cmark_node cmark_node;
+typedef struct cmark_parser cmark_parser;
+typedef struct cmark_iter cmark_iter;
+```
+
+### Memory Allocator
+
+```c
+typedef struct cmark_mem {
+ void *(*calloc)(size_t, size_t);
+ void *(*realloc)(void *, size_t);
+ void (*free)(void *);
+} cmark_mem;
+```
+
+---
+
+## Simple Interface
+
+### `cmark_markdown_to_html`
+
+```c
+CMARK_EXPORT
+char *cmark_markdown_to_html(const char *text, size_t len, int options);
+```
+
+Converts CommonMark text to HTML in a single call. The input `text` must be UTF-8 encoded. The returned string is null-terminated and allocated via the default allocator; the caller must free it with `free()`.
+
+**Implementation** (in `cmark.c`): Calls `cmark_parse_document()`, then `cmark_render_html()`, then `cmark_node_free()`.
+
+---
+
+## Node Classification
+
+### `cmark_node_is_block`
+
+```c
+CMARK_EXPORT bool cmark_node_is_block(cmark_node *node);
+```
+
+Returns `true` if `node->type` is between `CMARK_NODE_FIRST_BLOCK` and `CMARK_NODE_LAST_BLOCK` inclusive. Returns `false` for NULL.
+
+### `cmark_node_is_inline`
+
+```c
+CMARK_EXPORT bool cmark_node_is_inline(cmark_node *node);
+```
+
+Returns `true` if `node->type` is between `CMARK_NODE_FIRST_INLINE` and `CMARK_NODE_LAST_INLINE` inclusive. Returns `false` for NULL.
+
+### `cmark_node_is_leaf`
+
+```c
+CMARK_EXPORT bool cmark_node_is_leaf(cmark_node *node);
+```
+
+Returns `true` for node types that cannot have children:
+- `CMARK_NODE_THEMATIC_BREAK`
+- `CMARK_NODE_CODE_BLOCK`
+- `CMARK_NODE_TEXT`
+- `CMARK_NODE_SOFTBREAK`
+- `CMARK_NODE_LINEBREAK`
+- `CMARK_NODE_CODE`
+- `CMARK_NODE_HTML_INLINE`
+
+Note: `CMARK_NODE_HTML_BLOCK` is **not** classified as a leaf by `cmark_node_is_leaf()`, though the iterator treats it as one (see `S_leaf_mask` in `iterator.c`).
+
+---
+
+## Node Creation and Destruction
+
+### `cmark_node_new`
+
+```c
+CMARK_EXPORT cmark_node *cmark_node_new(cmark_node_type type);
+```
+
+Creates a new node of the given type using the default memory allocator. For `CMARK_NODE_HEADING`, the level defaults to 1. For `CMARK_NODE_LIST`, the list type defaults to `CMARK_BULLET_LIST` with `start = 0` and `tight = false`.
+
+### `cmark_node_new_with_mem`
+
+```c
+CMARK_EXPORT cmark_node *cmark_node_new_with_mem(cmark_node_type type, cmark_mem *mem);
+```
+
+Same as `cmark_node_new` but uses the specified memory allocator. All nodes in a single tree must use the same allocator.
+
+### `cmark_node_free`
+
+```c
+CMARK_EXPORT void cmark_node_free(cmark_node *node);
+```
+
+Frees the node and all its descendants. The node is first unlinked from its siblings/parent. The internal `S_free_nodes()` function iterates the subtree (splicing children into a flat list for iterative freeing) and releases type-specific memory:
+- `CMARK_NODE_CODE_BLOCK`: frees `data` and `as.code.info`
+- `CMARK_NODE_TEXT`, `CMARK_NODE_HTML_INLINE`, `CMARK_NODE_CODE`, `CMARK_NODE_HTML_BLOCK`: frees `data`
+- `CMARK_NODE_LINK`, `CMARK_NODE_IMAGE`: frees `as.link.url` and `as.link.title`
+- `CMARK_NODE_CUSTOM_BLOCK`, `CMARK_NODE_CUSTOM_INLINE`: frees `as.custom.on_enter` and `as.custom.on_exit`
+
+---
+
+## Tree Traversal
+
+### `cmark_node_next`
+
+```c
+CMARK_EXPORT cmark_node *cmark_node_next(cmark_node *node);
+```
+
+Returns the next sibling, or NULL.
+
+### `cmark_node_previous`
+
+```c
+CMARK_EXPORT cmark_node *cmark_node_previous(cmark_node *node);
+```
+
+Returns the previous sibling, or NULL.
+
+### `cmark_node_parent`
+
+```c
+CMARK_EXPORT cmark_node *cmark_node_parent(cmark_node *node);
+```
+
+Returns the parent node, or NULL.
+
+### `cmark_node_first_child`
+
+```c
+CMARK_EXPORT cmark_node *cmark_node_first_child(cmark_node *node);
+```
+
+Returns the first child, or NULL.
+
+### `cmark_node_last_child`
+
+```c
+CMARK_EXPORT cmark_node *cmark_node_last_child(cmark_node *node);
+```
+
+Returns the last child, or NULL.
+
+---
+
+## Iterator API
+
+### `cmark_iter_new`
+
+```c
+CMARK_EXPORT cmark_iter *cmark_iter_new(cmark_node *root);
+```
+
+Creates a new iterator starting at `root`. Returns NULL if `root` is NULL. The iterator begins in a pre-first state (`CMARK_EVENT_NONE`); the first call to `cmark_iter_next()` returns `CMARK_EVENT_ENTER` for the root.
+
+### `cmark_iter_free`
+
+```c
+CMARK_EXPORT void cmark_iter_free(cmark_iter *iter);
+```
+
+Frees the iterator. Does not free any nodes.
+
+### `cmark_iter_next`
+
+```c
+CMARK_EXPORT cmark_event_type cmark_iter_next(cmark_iter *iter);
+```
+
+Advances to the next node and returns the event type:
+- `CMARK_EVENT_ENTER` — entering a node (for non-leaf nodes, children follow)
+- `CMARK_EVENT_EXIT` — leaving a node (all children have been visited)
+- `CMARK_EVENT_DONE` — iteration complete (returned to root)
+
+Leaf nodes only generate `ENTER` events, never `EXIT`.
+
+### `cmark_iter_get_node`
+
+```c
+CMARK_EXPORT cmark_node *cmark_iter_get_node(cmark_iter *iter);
+```
+
+Returns the current node.
+
+### `cmark_iter_get_event_type`
+
+```c
+CMARK_EXPORT cmark_event_type cmark_iter_get_event_type(cmark_iter *iter);
+```
+
+Returns the current event type.
+
+### `cmark_iter_get_root`
+
+```c
+CMARK_EXPORT cmark_node *cmark_iter_get_root(cmark_iter *iter);
+```
+
+Returns the root node of the iteration.
+
+### `cmark_iter_reset`
+
+```c
+CMARK_EXPORT void cmark_iter_reset(cmark_iter *iter, cmark_node *current,
+ cmark_event_type event_type);
+```
+
+Resets the iterator position. The node must be a descendant of the root (or the root itself).
+
+---
+
+## Node Accessors
+
+### User Data
+
+```c
+CMARK_EXPORT void *cmark_node_get_user_data(cmark_node *node);
+CMARK_EXPORT int cmark_node_set_user_data(cmark_node *node, void *user_data);
+```
+
+Get/set arbitrary user data pointer. Returns 0 on failure, 1 on success. cmark does not manage the lifecycle of user data.
+
+### Type Information
+
+```c
+CMARK_EXPORT cmark_node_type cmark_node_get_type(cmark_node *node);
+CMARK_EXPORT const char *cmark_node_get_type_string(cmark_node *node);
+```
+
+`cmark_node_get_type_string()` returns strings like `"document"`, `"paragraph"`, `"heading"`, `"text"`, `"emph"`, `"strong"`, `"link"`, `"image"`, etc. Returns `"<unknown>"` for unrecognized types.
+
+### String Content
+
+```c
+CMARK_EXPORT const char *cmark_node_get_literal(cmark_node *node);
+CMARK_EXPORT int cmark_node_set_literal(cmark_node *node, const char *content);
+```
+
+Works for `CMARK_NODE_HTML_BLOCK`, `CMARK_NODE_TEXT`, `CMARK_NODE_HTML_INLINE`, `CMARK_NODE_CODE`, and `CMARK_NODE_CODE_BLOCK`. Returns NULL / 0 for other types.
+
+### Heading Level
+
+```c
+CMARK_EXPORT int cmark_node_get_heading_level(cmark_node *node);
+CMARK_EXPORT int cmark_node_set_heading_level(cmark_node *node, int level);
+```
+
+Only works for `CMARK_NODE_HEADING`. Level must be 1–6. Returns 0 on error.
+
+### List Properties
+
+```c
+CMARK_EXPORT cmark_list_type cmark_node_get_list_type(cmark_node *node);
+CMARK_EXPORT int cmark_node_set_list_type(cmark_node *node, cmark_list_type type);
+CMARK_EXPORT cmark_delim_type cmark_node_get_list_delim(cmark_node *node);
+CMARK_EXPORT int cmark_node_set_list_delim(cmark_node *node, cmark_delim_type delim);
+CMARK_EXPORT int cmark_node_get_list_start(cmark_node *node);
+CMARK_EXPORT int cmark_node_set_list_start(cmark_node *node, int start);
+CMARK_EXPORT int cmark_node_get_list_tight(cmark_node *node);
+CMARK_EXPORT int cmark_node_set_list_tight(cmark_node *node, int tight);
+```
+
+All list accessors only work for `CMARK_NODE_LIST`. `set_list_start` rejects negative values. `set_list_tight` interprets `tight == 1` as true.
+
+### Code Block Info
+
+```c
+CMARK_EXPORT const char *cmark_node_get_fence_info(cmark_node *node);
+CMARK_EXPORT int cmark_node_set_fence_info(cmark_node *node, const char *info);
+```
+
+The info string from a fenced code block (e.g., `"python"` from ` ```python `). Only works for `CMARK_NODE_CODE_BLOCK`.
+
+### Link/Image Properties
+
+```c
+CMARK_EXPORT const char *cmark_node_get_url(cmark_node *node);
+CMARK_EXPORT int cmark_node_set_url(cmark_node *node, const char *url);
+CMARK_EXPORT const char *cmark_node_get_title(cmark_node *node);
+CMARK_EXPORT int cmark_node_set_title(cmark_node *node, const char *title);
+```
+
+Only work for `CMARK_NODE_LINK` and `CMARK_NODE_IMAGE`. Return NULL / 0 for other types.
+
+### Custom Block/Inline
+
+```c
+CMARK_EXPORT const char *cmark_node_get_on_enter(cmark_node *node);
+CMARK_EXPORT int cmark_node_set_on_enter(cmark_node *node, const char *on_enter);
+CMARK_EXPORT const char *cmark_node_get_on_exit(cmark_node *node);
+CMARK_EXPORT int cmark_node_set_on_exit(cmark_node *node, const char *on_exit);
+```
+
+Only work for `CMARK_NODE_CUSTOM_BLOCK` and `CMARK_NODE_CUSTOM_INLINE`.
+
+### Source Position
+
+```c
+CMARK_EXPORT int cmark_node_get_start_line(cmark_node *node);
+CMARK_EXPORT int cmark_node_get_start_column(cmark_node *node);
+CMARK_EXPORT int cmark_node_get_end_line(cmark_node *node);
+CMARK_EXPORT int cmark_node_get_end_column(cmark_node *node);
+```
+
+Line and column numbers are 1-based. These are populated during parsing if `CMARK_OPT_SOURCEPOS` is set.
+
+---
+
+## Tree Manipulation
+
+### `cmark_node_unlink`
+
+```c
+CMARK_EXPORT void cmark_node_unlink(cmark_node *node);
+```
+
+Removes `node` from the tree (detaching from parent and siblings) without freeing its memory.
+
+### `cmark_node_insert_before`
+
+```c
+CMARK_EXPORT int cmark_node_insert_before(cmark_node *node, cmark_node *sibling);
+```
+
+Inserts `sibling` before `node`. Validates that the parent can contain the sibling (via `S_can_contain()`). Returns 1 on success, 0 on failure.
+
+### `cmark_node_insert_after`
+
+```c
+CMARK_EXPORT int cmark_node_insert_after(cmark_node *node, cmark_node *sibling);
+```
+
+Inserts `sibling` after `node`. Returns 1 on success, 0 on failure.
+
+### `cmark_node_replace`
+
+```c
+CMARK_EXPORT int cmark_node_replace(cmark_node *oldnode, cmark_node *newnode);
+```
+
+Replaces `oldnode` with `newnode` in the tree. The old node is unlinked but not freed.
+
+### `cmark_node_prepend_child`
+
+```c
+CMARK_EXPORT int cmark_node_prepend_child(cmark_node *node, cmark_node *child);
+```
+
+Adds `child` as the first child of `node`. Validates containership.
+
+### `cmark_node_append_child`
+
+```c
+CMARK_EXPORT int cmark_node_append_child(cmark_node *node, cmark_node *child);
+```
+
+Adds `child` as the last child of `node`. Validates containership.
+
+### `cmark_consolidate_text_nodes`
+
+```c
+CMARK_EXPORT void cmark_consolidate_text_nodes(cmark_node *root);
+```
+
+Merges adjacent `CMARK_NODE_TEXT` children into single text nodes throughout the subtree. Uses an iterator to find consecutive text nodes and concatenates their data via `cmark_strbuf`.
+
+---
+
+## Parsing Functions
+
+### `cmark_parser_new`
+
+```c
+CMARK_EXPORT cmark_parser *cmark_parser_new(int options);
+```
+
+Creates a parser with the default memory allocator and a new document root.
+
+### `cmark_parser_new_with_mem`
+
+```c
+CMARK_EXPORT cmark_parser *cmark_parser_new_with_mem(int options, cmark_mem *mem);
+```
+
+Creates a parser with the specified allocator.
+
+### `cmark_parser_new_with_mem_into_root`
+
+```c
+CMARK_EXPORT cmark_parser *cmark_parser_new_with_mem_into_root(
+ int options, cmark_mem *mem, cmark_node *root);
+```
+
+Creates a parser that appends parsed content to an existing root node. Useful for assembling a single document from multiple parsed fragments.
+
+### `cmark_parser_free`
+
+```c
+CMARK_EXPORT void cmark_parser_free(cmark_parser *parser);
+```
+
+Frees the parser and its internal buffers. Does NOT free the parsed document tree.
+
+### `cmark_parser_feed`
+
+```c
+CMARK_EXPORT void cmark_parser_feed(cmark_parser *parser, const char *buffer, size_t len);
+```
+
+Feeds a chunk of input data to the parser. Can be called multiple times for streaming input.
+
+### `cmark_parser_finish`
+
+```c
+CMARK_EXPORT cmark_node *cmark_parser_finish(cmark_parser *parser);
+```
+
+Finalizes parsing and returns the document root. Must be called after all input has been fed. Triggers `finalize_document()` which closes all open blocks and runs inline parsing.
+
+### `cmark_parse_document`
+
+```c
+CMARK_EXPORT cmark_node *cmark_parse_document(const char *buffer, size_t len, int options);
+```
+
+Convenience function equivalent to: create parser → feed entire buffer → finish → free parser. Returns the document root.
+
+### `cmark_parse_file`
+
+```c
+CMARK_EXPORT cmark_node *cmark_parse_file(FILE *f, int options);
+```
+
+Reads from a `FILE*` in 4096-byte chunks and parses incrementally.
+
+---
+
+## Rendering Functions
+
+### `cmark_render_html`
+
+```c
+CMARK_EXPORT char *cmark_render_html(cmark_node *root, int options);
+```
+
+Renders to HTML. Caller must free returned string.
+
+### `cmark_render_xml`
+
+```c
+CMARK_EXPORT char *cmark_render_xml(cmark_node *root, int options);
+```
+
+Renders to XML with CommonMark DTD. Includes `<?xml version="1.0" encoding="UTF-8"?>` header.
+
+### `cmark_render_man`
+
+```c
+CMARK_EXPORT char *cmark_render_man(cmark_node *root, int options, int width);
+```
+
+Renders to groff man page format. `width` controls line wrapping (0 = no wrap).
+
+### `cmark_render_commonmark`
+
+```c
+CMARK_EXPORT char *cmark_render_commonmark(cmark_node *root, int options, int width);
+```
+
+Renders back to CommonMark format. `width` controls line wrapping.
+
+### `cmark_render_latex`
+
+```c
+CMARK_EXPORT char *cmark_render_latex(cmark_node *root, int options, int width);
+```
+
+Renders to LaTeX. `width` controls line wrapping.
+
+---
+
+## Option Constants
+
+### Rendering Options
+
+```c
+#define CMARK_OPT_DEFAULT 0 // No special options
+#define CMARK_OPT_SOURCEPOS (1 << 1) // data-sourcepos attributes (HTML), sourcepos attributes (XML)
+#define CMARK_OPT_HARDBREAKS (1 << 2) // Render softbreaks as <br /> or \\
+#define CMARK_OPT_SAFE (1 << 3) // Legacy — safe mode is now default
+#define CMARK_OPT_UNSAFE (1 << 17) // Render raw HTML and dangerous URLs
+#define CMARK_OPT_NOBREAKS (1 << 4) // Render softbreaks as spaces
+```
+
+### Parsing Options
+
+```c
+#define CMARK_OPT_NORMALIZE (1 << 8) // Legacy — no effect
+#define CMARK_OPT_VALIDATE_UTF8 (1 << 9) // Replace invalid UTF-8 with U+FFFD
+#define CMARK_OPT_SMART (1 << 10) // Smart quotes and dashes
+```
+
+---
+
+## Memory Allocator
+
+### `cmark_get_default_mem_allocator`
+
+```c
+CMARK_EXPORT cmark_mem *cmark_get_default_mem_allocator(void);
+```
+
+Returns a pointer to the default allocator (`DEFAULT_MEM_ALLOCATOR` in `cmark.c`) which wraps `calloc`, `realloc`, and `free` with abort-on-failure guards.
+
+---
+
+## Version API
+
+### `cmark_version`
+
+```c
+CMARK_EXPORT int cmark_version(void);
+```
+
+Returns the version as a packed integer: `(major << 16) | (minor << 8) | patch`.
+
+### `cmark_version_string`
+
+```c
+CMARK_EXPORT const char *cmark_version_string(void);
+```
+
+Returns the version as a human-readable string (e.g., `"0.31.2"`).
+
+---
+
+## Node Integrity Checking
+
+```c
+CMARK_EXPORT int cmark_node_check(cmark_node *node, FILE *out);
+```
+
+Validates the structural integrity of the node tree, printing errors to `out`. Returns the number of errors found. Available in all builds but primarily useful in debug builds.
+
+---
+
+## Cross-References
+
+- [ast-node-system.md](ast-node-system.md) — Internal struct definitions behind these opaque types
+- [iterator-system.md](iterator-system.md) — Detailed iterator mechanics
+- [memory-management.md](memory-management.md) — Allocator details and buffer management
+- [block-parsing.md](block-parsing.md) — How `cmark_parser_feed` and `cmark_parser_finish` work internally
+- [html-renderer.md](html-renderer.md) — How `cmark_render_html` generates output
diff --git a/docs/handbook/cmark/reference-system.md b/docs/handbook/cmark/reference-system.md
new file mode 100644
index 0000000000..0e63b5c796
--- /dev/null
+++ b/docs/handbook/cmark/reference-system.md
@@ -0,0 +1,307 @@
+# cmark — Reference System
+
+## Overview
+
+The reference system (`references.c`, `references.h`) manages link reference definitions — the `[label]: URL "title"` constructs in CommonMark. During block parsing, reference definitions are extracted and stored. During inline parsing, reference links (`[text][label]` and `[text]`) look up these stored definitions.
+
+## Data Structures
+
+### Reference Entry
+
+```c
+typedef struct cmark_reference {
+ struct cmark_reference *next; // Unused — leftover from old linked-list design
+ unsigned char *url;
+ unsigned char *title;
+ unsigned char *label;
+ unsigned int age; // Insertion order (for stable sorting)
+ unsigned int size; // Length of the label string
+} cmark_reference;
+```
+
+Each reference stores:
+- `label` — The normalized reference label (case-folded, whitespace-collapsed)
+- `url` — The destination URL
+- `title` — Optional title string (may be NULL)
+- `age` — Monotonically increasing counter for insertion order
+- `size` — Byte length of the label
+
+### Reference Map
+
+```c
+struct cmark_reference_map {
+ cmark_mem *mem;
+ cmark_reference **refs; // Sorted array of reference pointers
+ unsigned int size; // Number of entries
+ unsigned int ref_size; // Cumulative size of all labels + URLs + titles
+ unsigned int max_ref_size; // Maximum allowed ref_size (anti-DoS limit)
+ cmark_reference *last; // Most recently added reference
+ unsigned int asize; // Allocated capacity of refs array
+};
+```
+
+The map uses a **sorted array with binary search** for lookup, not a hash table. This gives O(log n) lookup and O(n) insertion with shifting.
+
+### Anti-DoS Limiting
+
+The `ref_size` and `max_ref_size` fields prevent pathological inputs from causing excessive memory usage:
+
+```c
+unsigned int max_ref_size; // Set to 100 * input length at parser init
+unsigned int ref_size; // Sum of all label + url + title lengths
+```
+
+When `ref_size` exceeds `max_ref_size`, new reference additions are silently rejected. This prevents quadratic memory blowup from inputs with many reference definitions.
+
+## Label Normalization
+
+```c
+static unsigned char *normalize_reference(cmark_mem *mem,
+ cmark_chunk *ref) {
+ cmark_strbuf normalized = CMARK_BUF_INIT(mem);
+
+ if (ref == NULL) return NULL;
+
+ if (ref->len == 0) return NULL;
+
+ cmark_utf8proc_case_fold(&normalized, ref->data, ref->len);
+ cmark_strbuf_trim(&normalized);
+ cmark_strbuf_normalize_whitespace(&normalized);
+
+ return cmark_strbuf_detach(&normalized);
+}
+```
+
+The normalization process (per CommonMark spec):
+1. **Case fold** — Uses Unicode case folding (not simple lowercasing), via `cmark_utf8proc_case_fold()`
+2. **Trim** — Remove leading and trailing whitespace
+3. **Collapse whitespace** — Replace runs of whitespace with a single space
+
+This means `[Foo Bar]`, `[FOO BAR]`, and `[foo bar]` all normalize to the same label.
+
+## Reference Creation
+
+```c
+static void cmark_reference_create(cmark_reference_map *map,
+ cmark_chunk *label,
+ cmark_chunk *url,
+ cmark_chunk *title) {
+ cmark_reference *ref;
+ unsigned char *reflabel = normalize_reference(map->mem, label);
+
+ if (reflabel == NULL) return;
+
+ // Anti-DoS: check cumulative size limit
+ if (map->ref_size > map->max_ref_size) {
+ map->mem->free(reflabel);
+ return;
+ }
+
+ ref = (cmark_reference *)map->mem->calloc(1, sizeof(*ref));
+ ref->label = reflabel;
+ ref->url = cmark_clean_url(map->mem, url);
+ ref->title = cmark_clean_title(map->mem, title);
+ ref->age = map->size;
+ ref->size = (unsigned int)strlen((char *)reflabel);
+
+ // Track cumulative size
+ map->ref_size += ref->size;
+ if (ref->url) map->ref_size += (unsigned int)strlen((char *)ref->url);
+ if (ref->title) map->ref_size += (unsigned int)strlen((char *)ref->title);
+
+ // Add to array
+ if (map->size >= map->asize) {
+ // Grow array (double capacity)
+ map->asize = map->asize ? 2 * map->asize : 8;
+ map->refs = (cmark_reference **)map->mem->realloc(
+ map->refs, map->asize * sizeof(cmark_reference *));
+ }
+ map->refs[map->size] = ref;
+ map->size++;
+ map->last = ref;
+}
+```
+
+References are appended to the array in insertion order. The array is NOT kept sorted during insertion — it's sorted once at lookup time (lazily).
+
+## Reference Lookup
+
+```c
+cmark_reference *cmark_reference_lookup(cmark_reference_map *map,
+ cmark_chunk *label) {
+ if (label->len < 1 || label->len > MAX_LINK_LABEL_LENGTH) return NULL;
+ if (map == NULL || map->size == 0) return NULL;
+
+ unsigned char *norm = normalize_reference(map->mem, label);
+ if (norm == NULL) return NULL;
+
+ // Sort on first lookup
+ if (!map->sorted) {
+ qsort(map->refs, map->size, sizeof(cmark_reference *), refcmp);
+ // Remove duplicates (keep first occurrence)
+ // ...
+ map->sorted = true;
+ }
+
+ // Binary search
+ cmark_reference **found = (cmark_reference **)bsearch(
+ &norm, map->refs, map->size, sizeof(cmark_reference *), refcmp);
+
+ map->mem->free(norm);
+ return found ? *found : NULL;
+}
+```
+
+### Lazy Sorting
+
+The reference map is NOT sorted during insertion. On the first call to `cmark_reference_lookup()`, the array is sorted using `qsort()` with a comparison function:
+
+```c
+static int refcmp(const void *a, const void *b) {
+ const cmark_reference *refa = *(const cmark_reference **)a;
+ const cmark_reference *refb = *(const cmark_reference **)b;
+ int cmp = strcmp((char *)refa->label, (char *)refb->label);
+ if (cmp != 0) return cmp;
+ // Tie-break by age (earlier wins)
+ if (refa->age < refb->age) return -1;
+ if (refa->age > refb->age) return 1;
+ return 0;
+}
+```
+
+When labels collide (same normalized label), the first definition wins (lowest `age`).
+
+After sorting, duplicates are removed — entries with the same label as the preceding entry are freed:
+```c
+unsigned int write = 0;
+for (unsigned int read = 0; read < map->size; read++) {
+ if (write > 0 &&
+ strcmp((char *)map->refs[write-1]->label,
+ (char *)map->refs[read]->label) == 0) {
+ // Duplicate — free it
+ cmark_reference_free(map->mem, map->refs[read]);
+ } else {
+ map->refs[write++] = map->refs[read];
+ }
+}
+map->size = write;
+```
+
+### Binary Search
+
+After sorting and deduplication, lookups use standard `bsearch()`, giving O(log n) lookup time.
+
+## URL and Title Cleaning
+
+When creating references, URLs and titles are cleaned:
+
+### `cmark_clean_url()`
+```c
+unsigned char *cmark_clean_url(cmark_mem *mem, cmark_chunk *url);
+```
+- Removes surrounding `<` and `>` if present (angle-bracket URLs)
+- Unescapes backslash escapes
+- Decodes entity references
+- Percent-encodes non-URL-safe characters via `houdini_escape_href()`
+
+### `cmark_clean_title()`
+```c
+unsigned char *cmark_clean_title(cmark_mem *mem, cmark_chunk *title);
+```
+- Strips the first and last character (the delimiter: `"`, `'`, or `(`)
+- Unescapes backslash escapes
+- Decodes entity references
+
+## Integration with Parser
+
+### Extraction during Block Parsing
+
+Reference definitions are extracted when paragraphs are finalized:
+
+```c
+// In blocks.c, during paragraph finalization:
+while (cmark_parse_reference_inline(parser->mem, &node_content,
+ parser->refmap)) {
+ // Keep parsing references from the start of the paragraph
+}
+```
+
+### `cmark_parse_reference_inline()`
+
+```c
+int cmark_parse_reference_inline(cmark_mem *mem, cmark_strbuf *input,
+ cmark_reference_map *refmap) {
+ // Parse: [label]: destination "title"
+ // Returns 1 if a reference was found and consumed, 0 otherwise
+ subject subj;
+ // ... initialize subject on the input buffer
+ // Parse label
+ cmark_chunk lab = cmark_chunk_literal("");
+ cmark_chunk url = cmark_chunk_literal("");
+ cmark_chunk title = cmark_chunk_literal("");
+
+ if (!link_label(&subj, &lab) || lab.len == 0) return 0;
+ if (peek_char(&subj) != ':') return 0;
+ advance(&subj);
+ spnl(&subj); // skip spaces and up to one newline
+ if (!manual_scan_link_url(&subj, &url)) return 0;
+ // ... parse optional title
+ // ... validate: rest of line must be blank
+ cmark_reference_create(refmap, &lab, &url, &title);
+ // Remove consumed bytes from input
+ return 1;
+}
+```
+
+The parser repeatedly calls this function on paragraph content. Each successful parse removes the reference definition from the paragraph. If the entire paragraph consists of reference definitions, the paragraph node is removed from the AST.
+
+### Lookup during Inline Parsing
+
+In `inlines.c`, when a potential reference link is found:
+
+```c
+cmark_reference *ref = cmark_reference_lookup(subj->refmap, &raw_label);
+if (ref) {
+ // Create link node with ref->url and ref->title
+}
+```
+
+## Label Length Limit
+
+```c
+#define MAX_LINK_LABEL_LENGTH 999
+```
+
+Reference labels longer than 999 characters are rejected, per the CommonMark spec.
+
+## Map Lifecycle
+
+```c
+cmark_reference_map *cmark_reference_map_new(cmark_mem *mem);
+void cmark_reference_map_free(cmark_reference_map *map);
+```
+
+The map is created during parser initialization and freed when the parser is freed. The AST's reference links have already been resolved and store their own copies of URL and title — the reference map is not needed after parsing.
+
+### Cleanup
+
+```c
+void cmark_reference_map_free(cmark_reference_map *map) {
+ if (map == NULL) return;
+ for (unsigned int i = 0; i < map->size; i++) {
+ cmark_reference_free(map->mem, map->refs[i]);
+ }
+ map->mem->free(map->refs);
+ map->mem->free(map);
+}
+```
+
+Each reference and its strings (label, url, title) are freed, then the array and map struct are freed.
+
+## Cross-References
+
+- [references.c](../../cmark/src/references.c) — Implementation
+- [references.h](../../cmark/src/references.h) — Data structures
+- [block-parsing.md](block-parsing.md) — Reference extraction during paragraph finalization
+- [inline-parsing.md](inline-parsing.md) — Reference lookup during link resolution
+- [utf8-handling.md](utf8-handling.md) — Case folding used in label normalization
diff --git a/docs/handbook/cmark/render-framework.md b/docs/handbook/cmark/render-framework.md
new file mode 100644
index 0000000000..065b9c878f
--- /dev/null
+++ b/docs/handbook/cmark/render-framework.md
@@ -0,0 +1,294 @@
+# cmark — Render Framework
+
+## Overview
+
+The render framework (`render.c`, `render.h`) provides a generic rendering infrastructure used by three of the five renderers: LaTeX, man, and CommonMark. It handles line wrapping, prefix management, and character-level output dispatch. The HTML and XML renderers bypass this framework and write directly to buffers.
+
+## The `cmark_renderer` Structure
+
+```c
+struct cmark_renderer {
+ cmark_mem *mem;
+ cmark_strbuf *buffer; // Output buffer
+ cmark_strbuf *prefix; // Current line prefix (e.g., "> " for blockquotes)
+ int column; // Current column position (for wrapping)
+ int width; // Target width (0 = no wrapping)
+ int need_cr; // Pending newlines count
+ bufsize_t last_breakable; // Position of last breakable point in buffer
+ bool begin_line; // True if at the start of a line
+ bool begin_content; // True if no content has been output on current line (after prefix)
+ bool no_linebreaks; // Suppress newlines (for rendering within attributes)
+ bool in_tight_list_item; // Currently inside a tight list item
+ void (*outc)(cmark_renderer *, cmark_escaping, int32_t, unsigned char);
+ // Per-character output callback
+ int32_t (*render_node)(cmark_renderer *, cmark_node *, cmark_event_type, int);
+ // Per-node render callback
+};
+```
+
+### Key Fields
+
+- **`column`** — Tracks horizontal position for word-wrap decisions.
+- **`width`** — If > 0, enables automatic line wrapping at word boundaries.
+- **`prefix`** — Accumulated prefix string. For nested block quotes and list items, prefixes stack (e.g., `"> - "` for a list item inside a block quote).
+- **`last_breakable`** — Buffer position of the last whitespace where a line break could be inserted. Used for retroactive line wrapping.
+- **`begin_line`** — True immediately after a newline. Used by renderers to decide whether to escape line-start characters.
+- **`begin_content`** — True until the first non-prefix content on a line. Distinguished from `begin_line` because the prefix itself isn't "content".
+- **`no_linebreaks`** — When true, newlines are converted to spaces. Used when rendering content inside constructs that can't contain literal newlines.
+
+## Entry Point
+
+```c
+char *cmark_render(cmark_mem *mem, cmark_node *root, int options, int width,
+ void (*outc)(cmark_renderer *, cmark_escaping, int32_t, unsigned char),
+ int32_t (*render_node)(cmark_renderer *, cmark_node *,
+ cmark_event_type, int)) {
+ cmark_renderer renderer = {
+ mem,
+ &buf, // buffer
+ &pref, // prefix
+ 0, // column
+ width, // width
+ 0, // need_cr
+ 0, // last_breakable
+ true, // begin_line
+ true, // begin_content
+ false, // no_linebreaks
+ false, // in_tight_list_item
+ outc, // outc
+ render_node // render_node
+ };
+ // ... iterate AST, call render_node for each event
+ return (char *)cmark_strbuf_detach(&buf);
+}
+```
+
+The framework creates a `cmark_renderer`, iterates over the AST using `cmark_iter`, and calls the provided `render_node` function for each event. The `outc` callback handles per-character output with escaping decisions.
+
+## Escaping Modes
+
+```c
+typedef enum {
+ LITERAL, // No escaping — output characters as-is
+ NORMAL, // Full escaping for prose text
+ TITLE, // Escaping for link titles
+ URL, // Escaping for URLs
+} cmark_escaping;
+```
+
+Each renderer's `outc` function switches on this enum to determine how to handle special characters.
+
+## Output Functions
+
+### `cmark_render_code_point()`
+
+```c
+void cmark_render_code_point(cmark_renderer *renderer, int32_t c) {
+ cmark_utf8proc_encode_char(c, renderer->buffer);
+ renderer->column += 1;
+}
+```
+
+Low-level: encodes a single Unicode codepoint as UTF-8 into the buffer and advances the column counter.
+
+### `cmark_render_ascii()`
+
+```c
+void cmark_render_ascii(cmark_renderer *renderer, const char *s) {
+ int len = (int)strlen(s);
+ cmark_strbuf_puts(renderer->buffer, s);
+ renderer->column += len;
+}
+```
+
+Outputs an ASCII string and advances the column counter. Used for fixed escape sequences like `\&`, `\textbf{`, etc.
+
+### `S_out()` — Main Output Dispatcher
+
+```c
+static CMARK_INLINE void S_out(cmark_renderer *renderer, const char *source,
+ bool wrap, cmark_escaping escape) {
+ int length = (int)strlen(source);
+ unsigned char nextc;
+ int32_t c;
+ int i = 0;
+ int len;
+ cmark_chunk remainder = cmark_chunk_literal("");
+ int k = renderer->buffer->size - 1;
+
+ wrap = wrap && !renderer->no_linebreaks;
+
+ if (renderer->need_cr) {
+ // Output pending newlines
+ while (renderer->need_cr > 0) {
+ S_cr(renderer);
+ renderer->need_cr--;
+ }
+ }
+
+ while (i < length) {
+ if (renderer->begin_line) {
+ // Output prefix at start of each line
+ cmark_strbuf_puts(renderer->buffer, (char *)renderer->prefix->ptr);
+ renderer->column = renderer->prefix->size;
+ renderer->begin_line = false;
+ renderer->begin_content = true;
+ }
+
+ len = cmark_utf8proc_charlen((uint8_t *)source + i, length - i);
+ if (len == -1) { // Invalid UTF-8
+ // ... handle error
+ }
+
+ cmark_utf8proc_iterate((uint8_t *)source + i, len, &c);
+
+ if (c == 10) {
+ // Newline
+ cmark_strbuf_putc(renderer->buffer, '\n');
+ renderer->column = 0;
+ renderer->begin_line = true;
+ renderer->begin_content = true;
+ renderer->last_breakable = 0;
+ } else if (wrap) {
+ if (c == 32 && renderer->column > renderer->width / 2) {
+ // Space past half-width — mark as potential break point
+ renderer->last_breakable = renderer->buffer->size;
+ cmark_render_code_point(renderer, c);
+ } else if (renderer->column > renderer->width &&
+ renderer->last_breakable > 0) {
+ // Past target width with a break point — retroactively break
+ // Replace the space at last_breakable with newline + prefix
+ // ...
+ } else {
+ renderer->outc(renderer, escape, c, nextc);
+ }
+ } else {
+ renderer->outc(renderer, escape, c, nextc);
+ }
+
+ if (c != 10) {
+ renderer->begin_content = false;
+ }
+ i += len;
+ }
+}
+```
+
+This is the core output function. It:
+1. Handles deferred newlines (`need_cr`)
+2. Outputs line prefixes at the start of each line
+3. Tracks column position
+4. Implements word wrapping via retroactive line breaks
+5. Delegates character-level escaping to `renderer->outc()`
+
+### Line Wrapping Algorithm
+
+The wrapping algorithm uses a **retroactive break** strategy:
+
+1. As text flows through `S_out()`, spaces past the half-width mark are recorded as potential break points (`last_breakable`).
+2. When the column exceeds `width`, the buffer is split at `last_breakable`:
+ - Everything after the break point is saved in `remainder`
+ - A newline and the current prefix are inserted at the break point
+ - The remainder is reappended
+
+This avoids forward-looking: the renderer doesn't need to know the length of upcoming content to decide where to break.
+
+```c
+// Retroactive line break:
+remainder = cmark_chunk_dup(&renderer->buffer->..., last_breakable, ...);
+cmark_strbuf_truncate(renderer->buffer, last_breakable);
+cmark_strbuf_putc(renderer->buffer, '\n');
+cmark_strbuf_puts(renderer->buffer, (char *)renderer->prefix->ptr);
+cmark_strbuf_put(renderer->buffer, remainder.data, remainder.len);
+renderer->column = renderer->prefix->size + cmark_chunk_len(&remainder);
+renderer->last_breakable = 0;
+renderer->begin_line = false;
+renderer->begin_content = false;
+```
+
+## Convenience Functions
+
+### `CR()`
+
+```c
+#define CR() renderer->need_cr = 1
+```
+
+Requests a newline before the next content output. Multiple `CR()` calls don't stack — only one newline is inserted.
+
+### `BLANKLINE()`
+
+```c
+#define BLANKLINE() renderer->need_cr = 2
+```
+
+Requests a blank line (two newlines) before the next content output.
+
+### `OUT()`
+
+```c
+#define OUT(s, wrap, escaping) (S_out(renderer, s, wrap, escaping))
+```
+
+### `LIT()`
+
+```c
+#define LIT(s) (S_out(renderer, s, false, LITERAL))
+```
+
+Output literal text (no escaping, no wrapping).
+
+### `NOBREAKS()`
+
+```c
+#define NOBREAKS(s) \
+ do { renderer->no_linebreaks = true; OUT(s, false, NORMAL); renderer->no_linebreaks = false; } while(0)
+```
+
+Output text with normal escaping but with newlines suppressed (converted to spaces).
+
+## Prefix Management
+
+Prefixes are used for block-level indentation. The renderer maintains a `cmark_strbuf` prefix that is output at the start of each line.
+
+### Usage Pattern
+
+```c
+// In commonmark.c, entering a block quote:
+cmark_strbuf_puts(renderer->prefix, "> ");
+// ... render children ...
+// On exit:
+cmark_strbuf_truncate(renderer->prefix, original_prefix_len);
+```
+
+Renderers save the prefix length before modifying it and restore it on exit. This creates a stack-like behavior for nested containers.
+
+## Framework vs Direct Rendering
+
+| Feature | Framework (render.c) | Direct (html.c, xml.c) |
+|---------|---------------------|----------------------|
+| Line wrapping | Yes (`width` parameter) | No |
+| Prefix management | Yes (automatic) | No (uses HTML tags) |
+| Per-char escaping | Via `outc` callback | Via `escape_html()` helper |
+| Column tracking | Yes | No |
+| Break points | Retroactive insertion | N/A |
+| `cmark_escaping` enum | Yes | No |
+
+## Which Renderers Use the Framework
+
+| Renderer | Uses Framework | Why/Why Not |
+|----------|---------------|-------------|
+| LaTeX (`latex.c`) | Yes | Needs wrapping for structured text |
+| man (`man.c`) | Yes | Needs wrapping for terminal display |
+| CommonMark (`commonmark.c`) | Yes | Needs wrapping and prefix management |
+| HTML (`html.c`) | No | HTML handles layout via browser |
+| XML (`xml.c`) | No | XML output is structural, not visual |
+
+## Cross-References
+
+- [render.c](../../cmark/src/render.c) — Framework implementation
+- [render.h](../../cmark/src/render.h) — `cmark_renderer` struct and `cmark_escaping` enum
+- [latex-renderer.md](latex-renderer.md) — LaTeX `outc` and `S_render_node`
+- [man-renderer.md](man-renderer.md) — Man `S_outc` and `S_render_node`
+- [commonmark-renderer.md](commonmark-renderer.md) — CommonMark `outc` and `S_render_node`
+- [html-renderer.md](html-renderer.md) — Direct renderer (no framework)
diff --git a/docs/handbook/cmark/scanner-system.md b/docs/handbook/cmark/scanner-system.md
new file mode 100644
index 0000000000..79adf03798
--- /dev/null
+++ b/docs/handbook/cmark/scanner-system.md
@@ -0,0 +1,223 @@
+# cmark — Scanner System
+
+## Overview
+
+The scanner system (`scanners.h`, `scanners.re`, `scanners.c`) provides fast pattern-matching functions used throughout cmark's block and inline parsers. The scanners are generated from re2c specifications and compiled into optimized C switch-statement automata. They perform context-free matching only (no backtracking, no captures beyond match length).
+
+## Architecture
+
+### Source Files
+
+- `scanners.re` — re2c source with pattern specifications
+- `scanners.c` — Generated C code (committed to the repository, regenerated manually)
+- `scanners.h` — Public declarations (macro wrappers and function prototypes)
+
+### Generation
+
+Scanners are regenerated from re2c source via:
+```bash
+re2c --case-insensitive -b -i --no-generation-date --8bit -o scanners.c scanners.re
+```
+
+Flags:
+- `--case-insensitive` — Case-insensitive matching
+- `-b` — Use bit vectors for character classes
+- `-i` — Use `if` statements instead of `switch`
+- `--no-generation-date` — Reproducible output
+- `--8bit` — 8-bit character width
+
+The generated code consists of state machines implemented as nested `switch`/`if` blocks with direct character comparisons. There are no regular expression structs, no DFA tables — the patterns are compiled directly into C control flow.
+
+## Scanner Interface
+
+### The `_scan_at` Wrapper
+
+```c
+#define _scan_at(scanner, s, p) scanner(s->input.data, s->input.len, p)
+```
+
+All scanner functions share the signature:
+```c
+bufsize_t scan_PATTERN(const unsigned char *s, bufsize_t len, bufsize_t offset);
+```
+
+Parameters:
+- `s` — Input byte string
+- `len` — Total length of `s`
+- `offset` — Starting position within `s`
+
+Return value:
+- Length of the match (in bytes) if successful
+- `0` if no match at the given position
+
+### Common Pattern
+
+```c
+// In blocks.c:
+matched = _scan_at(&scan_thematic_break, &input, first_nonspace);
+
+// In inlines.c:
+matched = _scan_at(&scan_autolink_uri, subj, subj->pos);
+```
+
+## Scanner Functions
+
+### Block Structure Scanners
+
+| Scanner | Purpose | Used In |
+|---------|---------|---------|
+| `scan_thematic_break` | Matches `***`, `---`, `___` (with optional spaces) | `blocks.c` |
+| `scan_atx_heading_start` | Matches `#{1,6}` followed by space or EOL | `blocks.c` |
+| `scan_setext_heading_line` | Matches `=+` or `-+` at line start | `blocks.c` |
+| `scan_open_code_fence` | Matches `` ``` `` or `~~~` (3+ fence chars) | `blocks.c` |
+| `scan_close_code_fence` | Matches closing fence (≥ opening length) | `blocks.c` |
+| `scan_html_block_start` | Matches HTML block type 1-5 openers | `blocks.c` |
+| `scan_html_block_start_7` | Matches HTML block type 6-7 openers | `blocks.c` |
+| `scan_html_block_end_1` | Matches `</script>`, `</pre>`, `</style>` | `blocks.c` |
+| `scan_html_block_end_2` | Matches `-->` | `blocks.c` |
+| `scan_html_block_end_3` | Matches `?>` | `blocks.c` |
+| `scan_html_block_end_4` | Matches `>` | `blocks.c` |
+| `scan_html_block_end_5` | Matches `]]>` | `blocks.c` |
+| `scan_link_title` | Matches `"..."`, `'...'`, or `(...)` titles | `inlines.c` |
+
+### Inline Scanners
+
+| Scanner | Purpose | Used In |
+|---------|---------|---------|
+| `scan_autolink_uri` | Matches URI autolinks `<scheme:path>` | `inlines.c` |
+| `scan_autolink_email` | Matches email autolinks `<user@host>` | `inlines.c` |
+| `scan_html_tag` | Matches inline HTML tags (open, close, comment, PI, CDATA, declaration) | `inlines.c` |
+| `scan_entity` | Matches HTML entities (`&amp;`, `&#123;`, `&#x1F;`) | `inlines.c` |
+| `scan_dangerous_url` | Matches `javascript:`, `vbscript:`, `file:`, `data:` URLs | `html.c` |
+| `scan_spacechars` | Matches runs of spaces and tabs | `inlines.c` |
+
+### Link/Reference Scanners
+
+| Scanner | Purpose | Used In |
+|---------|---------|---------|
+| `scan_link_url` | Matches link destinations (parenthesized or bare) | `inlines.c` |
+| `scan_link_title` | Matches quoted link titles | `inlines.c` |
+
+## Scanner Patterns (from `scanners.re`)
+
+### Thematic Break
+```
+thematic_break = (('*' [ \t]*){3,} | ('-' [ \t]*){3,} | ('_' [ \t]*){3,}) [ \t]* [\n]
+```
+Three or more `*`, `-`, or `_` characters, optionally separated by spaces/tabs.
+
+### ATX Heading
+```
+atx_heading_start = '#{1,6}' ([ \t]+ | [\n])
+```
+1-6 `#` characters followed by space/tab or newline.
+
+### Code Fence
+```
+open_code_fence = '`{3,}' [^`\n]* [\n] | '~{3,}' [^\n]* [\n]
+```
+Three or more backticks (not followed by backtick in info string) or three or more tildes.
+
+### HTML Block Start (Types 1-7)
+
+The CommonMark spec defines 7 types of HTML blocks, each matched by different scanners:
+
+1. `<script>`, `<pre>`, `<style>` (case-insensitive)
+2. `<!--`
+3. `<?`
+4. `<!` followed by uppercase letter (declaration)
+5. `<![CDATA[`
+6. HTML tags from a specific set (e.g., `<div>`, `<table>`, `<h1>`, etc.)
+7. Complete open/close tags (not `<script>`, `<pre>`, `<style>`)
+
+### Autolink URI
+```
+autolink_uri = '<' scheme ':' [^\x00-\x20<>]* '>'
+scheme = [A-Za-z][A-Za-z0-9+.\-]{1,31}
+```
+
+### Autolink Email
+```
+autolink_email = '<' [A-Za-z0-9.!#$%&'*+/=?^_`{|}~-]+ '@'
+ [A-Za-z0-9]([A-Za-z0-9-]{0,61}[A-Za-z0-9])?
+ ('.' [A-Za-z0-9]([A-Za-z0-9-]{0,61}[A-Za-z0-9])?)* '>'
+```
+
+### HTML Entity
+```
+entity = '&' ('#' ('x'|'X') [0-9a-fA-F]{1,6} | '#' [0-9]{1,7} | [A-Za-z][A-Za-z0-9]{1,31}) ';'
+```
+
+### Dangerous URL
+```
+dangerous_url = ('javascript' | 'vbscript' | 'file' | 'data'
+ (not followed by image MIME types)) ':'
+```
+
+Data URLs are allowed if followed by `image/png`, `image/gif`, `image/jpeg`, or `image/webp`.
+
+### HTML Tag
+```
+html_tag = open_tag | close_tag | html_comment | processing_instruction | declaration | cdata
+open_tag = '<' tag_name attribute* '/' ? '>'
+close_tag = '</' tag_name [ \t]* '>'
+html_comment = '<!--' ...
+processing_instruction = '<?' ...
+declaration = '<!' [A-Z]+ ...
+cdata = '<![CDATA[' ...
+```
+
+## Generated Code Structure
+
+The generated `scanners.c` contains functions like:
+
+```c
+bufsize_t _scan_thematic_break(const unsigned char *p, bufsize_t len,
+ bufsize_t offset) {
+ const unsigned char *marker = NULL;
+ const unsigned char *start = p + offset;
+ // ... re2c-generated state machine
+ // Returns (bufsize_t)(p - start) on match, 0 on failure
+}
+```
+
+Each function is a self-contained state machine that:
+1. Starts at `p + offset`
+2. Walks forward byte-by-byte through the pattern
+3. Returns the match length or 0
+
+The generated code is typically hundreds of lines per scanner function, with deeply nested `if`/`switch` chains for the character transitions.
+
+## Performance Characteristics
+
+- **O(n)** in the length of the match — each scanner reads input exactly once
+- **No backtracking** — re2c generates DFA-based scanners
+- **No allocation** — scanners work on existing buffers, no heap allocation
+- **Branch prediction friendly** — the common case (no match) typically hits the first branch
+
+## Usage Example
+
+A typical block-parsing sequence using scanners:
+
+```c
+// Check if line starts a thematic break
+if (!indented &&
+ (input.data[first_nonspace] == '*' ||
+ input.data[first_nonspace] == '-' ||
+ input.data[first_nonspace] == '_')) {
+ matched = _scan_at(&scan_thematic_break, &input, first_nonspace);
+ if (matched) {
+ // Create thematic break node
+ }
+}
+```
+
+The manual character check before calling the scanner is an optimization — it avoids the function call overhead when the first character can't possibly start the pattern.
+
+## Cross-References
+
+- [scanners.h](../../cmark/src/scanners.h) — Scanner declarations and `_scan_at` macro
+- [scanners.re](../../cmark/src/scanners.re) — re2c source (if available)
+- [block-parsing.md](block-parsing.md) — Block-level scanner usage
+- [inline-parsing.md](inline-parsing.md) — Inline scanner usage
+- [html-renderer.md](html-renderer.md) — `scan_dangerous_url()` for URL safety
diff --git a/docs/handbook/cmark/testing.md b/docs/handbook/cmark/testing.md
new file mode 100644
index 0000000000..8797629e48
--- /dev/null
+++ b/docs/handbook/cmark/testing.md
@@ -0,0 +1,281 @@
+# cmark — Testing
+
+## Overview
+
+cmark has a multi-layered testing infrastructure: C API unit tests, spec conformance tests, pathological input tests, fuzz testing, and memory sanitizers. The build system integrates all of these through CMake and CTest.
+
+## Test Infrastructure (CMakeLists.txt)
+
+### API Tests
+
+```cmake
+add_executable(api_test api_test/main.c)
+target_link_libraries(api_test libcmark_static)
+add_test(NAME api_test COMMAND api_test)
+```
+
+The API test executable links against the static library and tests the public C API directly.
+
+### Spec Tests
+
+```cmake
+add_test(NAME spec_test
+ COMMAND ${PYTHON_EXECUTABLE} test/spec_tests.py
+ --spec test/spec.txt
+ --program $<TARGET_FILE:cmark>)
+```
+
+Spec tests run the `cmark` binary against the CommonMark specification. The Python script `spec_tests.py` parses the spec file, extracts input/output examples, runs `cmark` on each input, and compares the output.
+
+### Pathological Tests
+
+```cmake
+add_test(NAME pathological_test
+ COMMAND ${PYTHON_EXECUTABLE} test/pathological_tests.py
+ --program $<TARGET_FILE:cmark>)
+```
+
+These tests verify that cmark handles pathological inputs (deeply nested structures, long runs of special characters) without excessive time or memory usage.
+
+### Smart Punctuation Tests
+
+```cmake
+add_test(NAME smart_punct_test
+ COMMAND ${PYTHON_EXECUTABLE} test/spec_tests.py
+ --spec test/smart_punct.txt
+ --program $<TARGET_FILE:cmark>
+ --extensions "")
+```
+
+Tests for the `CMARK_OPT_SMART` option (curly quotes, em/en dashes, ellipses).
+
+### Roundtrip Tests
+
+```cmake
+add_test(NAME roundtrip_test
+ COMMAND ${PYTHON_EXECUTABLE} test/roundtrip_tests.py
+ --spec test/spec.txt
+ --program $<TARGET_FILE:cmark>)
+```
+
+Roundtrip tests verify that `cmark -t commonmark | cmark -t html` produces the same HTML as direct `cmark -t html`.
+
+### Entity Tests
+
+```cmake
+add_test(NAME entity_test
+ COMMAND ${PYTHON_EXECUTABLE} test/spec_tests.py
+ --spec test/entity.txt
+ --program $<TARGET_FILE:cmark>)
+```
+
+Tests HTML entity handling.
+
+### Regression Tests
+
+```cmake
+add_test(NAME regression_test
+ COMMAND ${PYTHON_EXECUTABLE} test/spec_tests.py
+ --spec test/regression.txt
+ --program $<TARGET_FILE:cmark>)
+```
+
+Regression tests cover previously discovered bugs.
+
+## API Test (`api_test/main.c`)
+
+The API test file is a single C source file with test functions covering every public API function. Test patterns used:
+
+### Test Macros
+
+```c
+#define OK(test, msg) \
+ if (test) { passes++; } \
+ else { failures++; fprintf(stderr, "FAIL: %s\n %s\n", __func__, msg); }
+
+#define INT_EQ(actual, expected, msg) \
+ if ((actual) == (expected)) { passes++; } \
+ else { failures++; fprintf(stderr, "FAIL: %s\n Expected %d got %d: %s\n", \
+ __func__, expected, actual, msg); }
+
+#define STR_EQ(actual, expected, msg) \
+ if (strcmp(actual, expected) == 0) { passes++; } \
+ else { failures++; fprintf(stderr, "FAIL: %s\n Expected \"%s\" got \"%s\": %s\n", \
+ __func__, expected, actual, msg); }
+```
+
+### Test Categories
+
+1. **Version tests**: Verify `cmark_version()` and `cmark_version_string()` return correct values
+2. **Constructor tests**: `cmark_node_new()` for each node type
+3. **Accessor tests**: Get/set for heading level, list type, list tight, content, etc.
+4. **Tree manipulation tests**: `cmark_node_append_child()`, `cmark_node_insert_before()`, etc.
+5. **Parser tests**: `cmark_parse_document()`, streaming `cmark_parser_feed()` + `cmark_parser_finish()`
+6. **Renderer tests**: Verify HTML, XML, man, LaTeX, CommonMark output for known inputs
+7. **Iterator tests**: `cmark_iter_new()`, traversal order, `cmark_iter_reset()`
+8. **Memory tests**: Custom allocator, `cmark_node_free()`, no leaks
+
+### Example Test Function
+
+```c
+static void test_md_to_html(const char *markdown, const char *expected_html,
+ const char *msg) {
+ char *html = cmark_markdown_to_html(markdown, strlen(markdown),
+ CMARK_OPT_DEFAULT);
+ STR_EQ(html, expected_html, msg);
+ free(html);
+}
+```
+
+## Spec Test Format
+
+The spec file (`test/spec.txt`) uses a specific format:
+
+```
+```````````````````````````````` example
+Markdown input here
+.
+<p>Expected HTML output here</p>
+````````````````````````````````
+```
+
+Each example is delimited by `example` markers. The `.` on a line by itself separates input from expected output.
+
+The Python test runner (`test/spec_tests.py`):
+1. Parses the spec file to extract examples
+2. For each example, runs the `cmark` binary with the input
+3. Compares the actual output with the expected output
+4. Reports pass/fail for each example
+
+## Pathological Input Tests
+
+The pathological test file (`test/pathological_tests.py`) generates adversarial inputs designed to trigger worst-case behavior:
+
+- Deeply nested block quotes (`> > > > > ...`)
+- Deeply nested lists
+- Long runs of backticks
+- Many consecutive closing brackets `]]]]]...]`
+- Long emphasis delimiter runs `***...***`
+- Repeated link definitions
+
+Each test verifies that cmark completes within a reasonable time bound (not quadratic or exponential).
+
+## Fuzzing
+
+### LibFuzzer
+
+```cmake
+if(CMARK_LIB_FUZZER)
+ add_executable(cmark_fuzz fuzz/cmark_fuzz.c)
+ target_link_libraries(cmark_fuzz libcmark_static)
+ target_compile_options(cmark_fuzz PRIVATE -fsanitize=fuzzer)
+ target_link_options(cmark_fuzz PRIVATE -fsanitize=fuzzer)
+endif()
+```
+
+The fuzzer entry point (`fuzz/cmark_fuzz.c`) implements:
+```c
+int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size) {
+ // Parse data as CommonMark
+ // Render to all formats
+ // Free everything
+ // Return 0
+}
+```
+
+This subjects all parsers and renderers to random input.
+
+### Building with Fuzzing
+
+```bash
+cmake -DCMARK_LIB_FUZZER=ON \
+ -DCMAKE_C_COMPILER=clang \
+ -DCMAKE_C_FLAGS="-fsanitize=fuzzer-no-link,address" \
+ ..
+make
+./cmark_fuzz corpus/
+```
+
+## Sanitizer Builds
+
+### Address Sanitizer (ASan)
+
+```bash
+cmake -DCMAKE_BUILD_TYPE=Asan ..
+```
+
+Sets flags: `-fsanitize=address -fno-omit-frame-pointer`
+
+Detects:
+- Buffer overflows (stack, heap, global)
+- Use-after-free
+- Double-free
+- Memory leaks (LSAN)
+
+### Undefined Behavior Sanitizer (UBSan)
+
+```bash
+cmake -DCMAKE_BUILD_TYPE=Ubsan ..
+```
+
+Sets flags: `-fsanitize=undefined`
+
+Detects:
+- Signed integer overflow
+- Null pointer dereference
+- Misaligned access
+- Invalid shift
+- Out-of-bounds array access
+
+## Running Tests
+
+### Full Test Suite
+
+```bash
+mkdir build && cd build
+cmake ..
+make
+ctest
+```
+
+### Verbose Output
+
+```bash
+ctest --verbose
+```
+
+### Single Test
+
+```bash
+ctest -R api_test
+ctest -R spec_test
+```
+
+### With ASan
+
+```bash
+mkdir build-asan && cd build-asan
+cmake -DCMAKE_BUILD_TYPE=Asan ..
+make
+ctest
+```
+
+## Test Data Files
+
+| File | Purpose |
+|------|---------|
+| `test/spec.txt` | CommonMark specification with examples |
+| `test/smart_punct.txt` | Smart punctuation examples |
+| `test/entity.txt` | HTML entity test cases |
+| `test/regression.txt` | Regression test cases |
+| `test/spec_tests.py` | Spec test runner script |
+| `test/pathological_tests.py` | Pathological input tests |
+| `test/roundtrip_tests.py` | CommonMark roundtrip tests |
+| `api_test/main.c` | C API unit tests |
+| `fuzz/cmark_fuzz.c` | LibFuzzer entry point |
+
+## Cross-References
+
+- [building.md](building.md) — Build configurations including test builds
+- [public-api.md](public-api.md) — API functions tested by `api_test`
+- [cli-usage.md](cli-usage.md) — The `cmark` binary tested by spec tests
diff --git a/docs/handbook/cmark/utf8-handling.md b/docs/handbook/cmark/utf8-handling.md
new file mode 100644
index 0000000000..c5bde6a320
--- /dev/null
+++ b/docs/handbook/cmark/utf8-handling.md
@@ -0,0 +1,340 @@
+# cmark — UTF-8 Handling
+
+## Overview
+
+The UTF-8 module (`utf8.c`, `utf8.h`) provides Unicode support for cmark: encoding, decoding, validation, iteration, case folding, and character classification. It incorporates data from `utf8proc` for case folding and character properties.
+
+## UTF-8 Encoding Fundamentals
+
+The module handles all four UTF-8 byte patterns:
+
+| Codepoint Range | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
+|----------------|--------|--------|--------|--------|
+| U+0000–U+007F | 0xxxxxxx | | | |
+| U+0080–U+07FF | 110xxxxx | 10xxxxxx | | |
+| U+0800–U+FFFF | 1110xxxx | 10xxxxxx | 10xxxxxx | |
+| U+10000–U+10FFFF | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
+
+## Byte Classification Table
+
+```c
+static const uint8_t utf8proc_utf8class[256] = {
+ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x00-0x0F
+ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x10-0x1F
+ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x20-0x2F
+ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x30-0x3F
+ // ...
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0x80-0x8F (continuation)
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0x90-0x9F
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0xA0-0xAF
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0xB0-0xBF
+ 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, // 0xC0-0xCF
+ 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, // 0xD0-0xDF
+ 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, // 0xE0-0xEF
+ 4, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0xF0-0xFF
+};
+```
+
+Lookup table that maps each byte to its UTF-8 sequence length:
+- `1` → ASCII single-byte character
+- `2` → Two-byte sequence lead byte (0xC2-0xDF)
+- `3` → Three-byte sequence lead byte (0xE0-0xEF)
+- `4` → Four-byte sequence lead byte (0xF0-0xF4)
+- `0` → Continuation byte (0x80-0xBF) or invalid lead byte (0xC0-0xC1, 0xF5-0xFF)
+
+Note: 0xC0 and 0xC1 are marked as `0` (invalid) because they would encode codepoints < 0x80, which is an overlong encoding.
+
+## UTF-8 Encoding
+
+```c
+void cmark_utf8proc_encode_char(int32_t uc, cmark_strbuf *buf) {
+ uint8_t dst[4];
+ bufsize_t len = 0;
+
+ assert(uc >= 0);
+
+ if (uc < 0x80) {
+ dst[0] = (uint8_t)(uc);
+ len = 1;
+ } else if (uc < 0x800) {
+ dst[0] = (uint8_t)(0xC0 + (uc >> 6));
+ dst[1] = 0x80 + (uc & 0x3F);
+ len = 2;
+ } else if (uc == 0xFFFF) {
+ // Invalid codepoint — encode replacement char
+ dst[0] = 0xEF; dst[1] = 0xBF; dst[2] = 0xBD;
+ len = 3;
+ } else if (uc == 0xFFFE) {
+ // Invalid codepoint — encode replacement char
+ dst[0] = 0xEF; dst[1] = 0xBF; dst[2] = 0xBD;
+ len = 3;
+ } else if (uc < 0x10000) {
+ dst[0] = (uint8_t)(0xE0 + (uc >> 12));
+ dst[1] = 0x80 + ((uc >> 6) & 0x3F);
+ dst[2] = 0x80 + (uc & 0x3F);
+ len = 3;
+ } else if (uc < 0x110000) {
+ dst[0] = (uint8_t)(0xF0 + (uc >> 18));
+ dst[1] = 0x80 + ((uc >> 12) & 0x3F);
+ dst[2] = 0x80 + ((uc >> 6) & 0x3F);
+ dst[3] = 0x80 + (uc & 0x3F);
+ len = 4;
+ } else {
+ // Out of range — encode replacement char U+FFFD
+ dst[0] = 0xEF; dst[1] = 0xBF; dst[2] = 0xBD;
+ len = 3;
+ }
+
+ cmark_strbuf_put(buf, dst, len);
+}
+```
+
+Encodes a single Unicode codepoint as UTF-8 into a `cmark_strbuf`. Invalid codepoints (U+FFFE, U+FFFF, > U+10FFFF) are replaced with U+FFFD (replacement character).
+
+## UTF-8 Validation and Iteration
+
+```c
+void cmark_utf8proc_check(cmark_strbuf *dest, const uint8_t *line,
+ bufsize_t size) {
+ bufsize_t i = 0;
+
+ while (i < size) {
+ bufsize_t byte_length = utf8proc_utf8class[line[i]];
+ int32_t codepoint = -1;
+
+ if (byte_length == 0) {
+ // Invalid lead byte — replace
+ cmark_utf8proc_encode_char(0xFFFD, dest);
+ i++;
+ continue;
+ }
+
+ // Check we have enough bytes
+ if (i + byte_length > size) {
+ // Truncated sequence — replace
+ cmark_utf8proc_encode_char(0xFFFD, dest);
+ i++;
+ continue;
+ }
+
+ // Decode and validate
+ switch (byte_length) {
+ case 1:
+ codepoint = line[i];
+ break;
+ case 2:
+ // Validate continuation byte
+ if ((line[i+1] & 0xC0) != 0x80) { /* invalid */ }
+ codepoint = ((line[i] & 0x1F) << 6) | (line[i+1] & 0x3F);
+ break;
+ case 3:
+ // Validate continuation bytes + overlong + surrogates
+ codepoint = ((line[i] & 0x0F) << 12) |
+ ((line[i+1] & 0x3F) << 6) |
+ (line[i+2] & 0x3F);
+ // Reject surrogates (U+D800-U+DFFF) and overlongs
+ break;
+ case 4:
+ // Validate continuation bytes + overlongs + max codepoint
+ codepoint = ((line[i] & 0x07) << 18) |
+ ((line[i+1] & 0x3F) << 12) |
+ ((line[i+2] & 0x3F) << 6) |
+ (line[i+3] & 0x3F);
+ break;
+ }
+
+ if (codepoint < 0) {
+ cmark_utf8proc_encode_char(0xFFFD, dest);
+ i++;
+ } else {
+ cmark_utf8proc_encode_char(codepoint, dest);
+ i += byte_length;
+ }
+ }
+}
+```
+
+This function validates UTF-8 and replaces invalid sequences with U+FFFD. It enforces:
+- No invalid lead bytes
+- No truncated sequences
+- No invalid continuation bytes
+- No overlong encodings
+- No surrogate codepoints (U+D800-U+DFFF)
+
+### Validation Rules (RFC 3629)
+
+For 3-byte sequences:
+```c
+// Reject overlongs: first byte 0xE0 requires second byte >= 0xA0
+if (line[i] == 0xE0 && line[i+1] < 0xA0) { /* overlong */ }
+// Reject surrogates: first byte 0xED requires second byte < 0xA0
+if (line[i] == 0xED && line[i+1] >= 0xA0) { /* surrogate */ }
+```
+
+For 4-byte sequences:
+```c
+// Reject overlongs: first byte 0xF0 requires second byte >= 0x90
+if (line[i] == 0xF0 && line[i+1] < 0x90) { /* overlong */ }
+// Reject codepoints > U+10FFFF: first byte 0xF4 requires second byte < 0x90
+if (line[i] == 0xF4 && line[i+1] >= 0x90) { /* out of range */ }
+```
+
+## UTF-8 Iterator
+
+```c
+void cmark_utf8proc_iterate(const uint8_t *str, bufsize_t str_len,
+ int32_t *dst) {
+ *dst = -1;
+ if (str_len <= 0) return;
+
+ uint8_t length = utf8proc_utf8class[str[0]];
+ if (!length) return;
+ if (str_len >= length) {
+ switch (length) {
+ case 1:
+ *dst = str[0];
+ break;
+ case 2:
+ *dst = ((int32_t)(str[0] & 0x1F) << 6) | (str[1] & 0x3F);
+ break;
+ case 3:
+ *dst = ((int32_t)(str[0] & 0x0F) << 12) |
+ ((int32_t)(str[1] & 0x3F) << 6) |
+ (str[2] & 0x3F);
+ // Reject surrogates:
+ if (*dst >= 0xD800 && *dst < 0xE000) *dst = -1;
+ break;
+ case 4:
+ *dst = ((int32_t)(str[0] & 0x07) << 18) |
+ ((int32_t)(str[1] & 0x3F) << 12) |
+ ((int32_t)(str[2] & 0x3F) << 6) |
+ (str[3] & 0x3F);
+ if (*dst > 0x10FFFF) *dst = -1;
+ break;
+ }
+ }
+}
+```
+
+Decodes a single UTF-8 codepoint from a byte string. Sets `*dst` to -1 on error.
+
+## Case Folding
+
+```c
+void cmark_utf8proc_case_fold(cmark_strbuf *dest, const uint8_t *str,
+ bufsize_t len) {
+ int32_t c;
+ bufsize_t i = 0;
+
+ while (i < len) {
+ bufsize_t char_len = cmark_utf8proc_charlen(str + i, len - i);
+ if (char_len < 0) {
+ cmark_utf8proc_encode_char(0xFFFD, dest);
+ i += 1;
+ continue;
+ }
+ cmark_utf8proc_iterate(str + i, char_len, &c);
+ if (c >= 0) {
+ // Look up case fold mapping
+ const int32_t *fold = cmark_utf8proc_case_fold_info(c);
+ if (fold) {
+ // Some characters fold to multiple codepoints
+ while (*fold >= 0) {
+ cmark_utf8proc_encode_char(*fold, dest);
+ fold++;
+ }
+ } else {
+ cmark_utf8proc_encode_char(c, dest);
+ }
+ }
+ i += char_len;
+ }
+}
+```
+
+Performs Unicode case folding (not lowercasing — case folding is more aggressive and designed for case-insensitive comparison). Used for normalizing link reference labels.
+
+### Case Fold Lookup
+
+```c
+static const int32_t *cmark_utf8proc_case_fold_info(int32_t c);
+```
+
+Uses a sorted table `cf_table` and binary search to find case fold mappings. Each entry maps a codepoint to one or more replacement codepoints (some characters fold to multiple characters, e.g., `ß` → `ss`).
+
+The table uses sentinel value `-1` to terminate multi-codepoint sequences.
+
+## Character Classification
+
+### `cmark_utf8proc_is_space()`
+
+```c
+int cmark_utf8proc_is_space(int32_t c) {
+ // ASCII spaces
+ if (c < 0x80) {
+ return (c == 9 || c == 10 || c == 12 || c == 13 || c == 32);
+ }
+ // Unicode Zs category
+ return (c == 0xa0 || c == 0x1680 ||
+ (c >= 0x2000 && c <= 0x200a) ||
+ c == 0x202f || c == 0x205f || c == 0x3000);
+}
+```
+
+Matches ASCII whitespace (HT, LF, FF, CR, SP) and Unicode Zs (space separator) characters including:
+- U+00A0 (NBSP)
+- U+1680 (Ogham space)
+- U+2000-U+200A (various typographic spaces)
+- U+202F (narrow NBSP)
+- U+205F (medium mathematical space)
+- U+3000 (ideographic space)
+
+### `cmark_utf8proc_is_punctuation()`
+
+```c
+int cmark_utf8proc_is_punctuation(int32_t c) {
+ // ASCII punctuation ranges
+ if (c < 128) {
+ return (c >= 33 && c <= 47) ||
+ (c >= 58 && c <= 64) ||
+ (c >= 91 && c <= 96) ||
+ (c >= 123 && c <= 126);
+ }
+ // Unicode Pc, Pd, Pe, Pf, Pi, Po, Ps categories
+ // Uses a table-driven approach for Unicode punctuation
+}
+```
+
+Returns true for ASCII punctuation (`!`, `"`, `#`, `$`, `%`, `&`, `'`, `(`, `)`, `*`, `+`, `,`, `-`, `.`, `/`, `:`, `;`, `<`, `=`, `>`, `?`, `@`, `[`, `\\`, `]`, `^`, `_`, `` ` ``, `{`, `|`, `}`, `~`) and Unicode punctuation (categories Pc through Ps).
+
+These classification functions are critical for inline parsing, specifically for delimiter run classification — determining whether a `*` or `_` run is left-flanking or right-flanking depends on whether adjacent characters are spaces or punctuation.
+
+## Helper Functions
+
+### `cmark_utf8proc_charlen()`
+
+```c
+static CMARK_INLINE bufsize_t cmark_utf8proc_charlen(const uint8_t *str,
+ bufsize_t str_len) {
+ bufsize_t length = utf8proc_utf8class[str[0]];
+ if (!length || str_len < length) return -length;
+ return length;
+}
+```
+
+Returns the byte length of the UTF-8 character at the given position. Returns negative on error (invalid byte or truncated).
+
+## Usage in cmark
+
+1. **Input validation**: `cmark_utf8proc_check()` is called on input to replace invalid UTF-8 with U+FFFD
+2. **Reference normalization**: `cmark_utf8proc_case_fold()` is used by `normalize_reference()` in `references.c` for case-insensitive reference label matching
+3. **Delimiter classification**: `cmark_utf8proc_is_space()` and `cmark_utf8proc_is_punctuation()` are used in `inlines.c` for the left-flanking/right-flanking delimiter run rules
+4. **Entity decoding**: `cmark_utf8proc_encode_char()` is used when decoding HTML entities and numeric character references to produce their UTF-8 representation
+5. **Renderer output**: `cmark_render_code_point()` in `render.c` calls `cmark_utf8proc_encode_char()` for multi-byte character output
+
+## Cross-References
+
+- [utf8.c](../../cmark/src/utf8.c) — Implementation
+- [utf8.h](../../cmark/src/utf8.h) — Public interface
+- [inline-parsing.md](inline-parsing.md) — Uses character classification for delimiter rules
+- [reference-system.md](reference-system.md) — Uses case folding for label normalization
diff --git a/docs/handbook/cmark/xml-renderer.md b/docs/handbook/cmark/xml-renderer.md
new file mode 100644
index 0000000000..83218c7ef2
--- /dev/null
+++ b/docs/handbook/cmark/xml-renderer.md
@@ -0,0 +1,291 @@
+# cmark — XML Renderer
+
+## Overview
+
+The XML renderer (`xml.c`) produces an XML representation of the AST. Like the HTML renderer, it writes directly to a `cmark_strbuf` buffer rather than using the generic render framework. The output conforms to the CommonMark DTD.
+
+## Entry Point
+
+```c
+char *cmark_render_xml(cmark_node *root, int options);
+```
+
+Returns a complete XML document string. The caller must free the result.
+
+### Implementation
+
+```c
+char *cmark_render_xml(cmark_node *root, int options) {
+ char *result;
+ cmark_strbuf xml = CMARK_BUF_INIT(root->mem);
+ cmark_event_type ev_type;
+ cmark_node *cur;
+ struct render_state state = {&xml, 0};
+ cmark_iter *iter = cmark_iter_new(root);
+
+ cmark_strbuf_puts(&xml,
+ "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n"
+ "<!DOCTYPE document SYSTEM \"CommonMark.dtd\">\n");
+
+ // optionally: <?xml-model href="CommonMark.rnc" ...?>
+ while ((ev_type = cmark_iter_next(iter)) != CMARK_EVENT_DONE) {
+ cur = cmark_iter_get_node(iter);
+ S_render_node(cur, ev_type, &state, options);
+ }
+ result = (char *)cmark_strbuf_detach(&xml);
+ cmark_iter_free(iter);
+ return result;
+}
+```
+
+## Render State
+
+```c
+struct render_state {
+ cmark_strbuf *xml; // Output buffer
+ int indent; // Current indentation level (number of spaces)
+};
+```
+
+The `indent` state tracks nesting depth, incremented by 2 for each container node entered.
+
+## XML Escaping
+
+```c
+static CMARK_INLINE void escape_xml(cmark_strbuf *dest, const unsigned char *source,
+ bufsize_t length) {
+ houdini_escape_html0(dest, source, length, 0);
+}
+```
+
+Escapes `<`, `>`, `&`, and `"` to their XML entity equivalents.
+
+## Indentation
+
+```c
+static void indent(struct render_state *state) {
+ int i;
+ for (i = 0; i < state->indent; i++) {
+ cmark_strbuf_putc(state->xml, ' ');
+ }
+}
+```
+
+Each level of nesting adds 2 spaces of indentation.
+
+## Source Position Attributes
+
+```c
+static void S_render_sourcepos(cmark_node *node, cmark_strbuf *xml, int options) {
+ char buffer[BUFFER_SIZE];
+ if (CMARK_OPT_SOURCEPOS & options) {
+ snprintf(buffer, BUFFER_SIZE, " sourcepos=\"%d:%d-%d:%d\"",
+ cmark_node_get_start_line(node), cmark_node_get_start_column(node),
+ cmark_node_get_end_line(node), cmark_node_get_end_column(node));
+ cmark_strbuf_puts(xml, buffer);
+ }
+}
+```
+
+When `CMARK_OPT_SOURCEPOS` is active, XML elements receive `sourcepos="line:col-line:col"` attributes.
+
+## Node Type Name Table
+
+```c
+static const char *S_type_string(cmark_node *node) {
+ if (node->extension && node->extension->xml_tag_name_func) {
+ return node->extension->xml_tag_name_func(node->extension, node);
+ }
+ switch (node->type) {
+ case CMARK_NODE_DOCUMENT: return "document";
+ case CMARK_NODE_BLOCK_QUOTE: return "block_quote";
+ case CMARK_NODE_LIST: return "list";
+ case CMARK_NODE_ITEM: return "item";
+ case CMARK_NODE_CODE_BLOCK: return "code_block";
+ case CMARK_NODE_HTML_BLOCK: return "html_block";
+ case CMARK_NODE_CUSTOM_BLOCK: return "custom_block";
+ case CMARK_NODE_PARAGRAPH: return "paragraph";
+ case CMARK_NODE_HEADING: return "heading";
+ case CMARK_NODE_THEMATIC_BREAK: return "thematic_break";
+ case CMARK_NODE_TEXT: return "text";
+ case CMARK_NODE_SOFTBREAK: return "softbreak";
+ case CMARK_NODE_LINEBREAK: return "linebreak";
+ case CMARK_NODE_CODE: return "code";
+ case CMARK_NODE_HTML_INLINE: return "html_inline";
+ case CMARK_NODE_CUSTOM_INLINE: return "custom_inline";
+ case CMARK_NODE_EMPH: return "emph";
+ case CMARK_NODE_STRONG: return "strong";
+ case CMARK_NODE_LINK: return "link";
+ case CMARK_NODE_IMAGE: return "image";
+ case CMARK_NODE_NONE: return "NONE";
+ }
+ return "<unknown>";
+}
+```
+
+Each node type has a fixed XML tag name. Extensions can override this via `xml_tag_name_func`.
+
+## Node Rendering Logic
+
+### Leaf Nodes vs Container Nodes
+
+The XML renderer distinguishes between leaf (literal) nodes and container nodes:
+
+**Leaf nodes** (single event — `CMARK_EVENT_ENTER` only):
+- `CMARK_NODE_CODE_BLOCK`, `CMARK_NODE_HTML_BLOCK`, `CMARK_NODE_THEMATIC_BREAK`
+- `CMARK_NODE_TEXT`, `CMARK_NODE_SOFTBREAK`, `CMARK_NODE_LINEBREAK`
+- `CMARK_NODE_CODE`, `CMARK_NODE_HTML_INLINE`
+
+**Container nodes** (paired enter/exit events):
+- `CMARK_NODE_DOCUMENT`, `CMARK_NODE_BLOCK_QUOTE`, `CMARK_NODE_LIST`, `CMARK_NODE_ITEM`
+- `CMARK_NODE_PARAGRAPH`, `CMARK_NODE_HEADING`
+- `CMARK_NODE_EMPH`, `CMARK_NODE_STRONG`, `CMARK_NODE_LINK`, `CMARK_NODE_IMAGE`
+- `CMARK_NODE_CUSTOM_BLOCK`, `CMARK_NODE_CUSTOM_INLINE`
+
+### Leaf Node Rendering
+
+Literal nodes that contain text are rendered as:
+```xml
+ <tag_name>ESCAPED TEXT</tag_name>
+```
+
+For example, a text node with content "Hello & goodbye" becomes:
+```xml
+ <text>Hello &amp; goodbye</text>
+```
+
+Nodes without text content (thematic_break, softbreak, linebreak) are rendered as self-closing:
+```xml
+ <thematic_break />
+```
+
+### Container Node Rendering (Enter)
+
+On enter, the renderer outputs:
+```xml
+ <tag_name[sourcepos][ type-specific attributes]>
+```
+
+And increments the indent level by 2.
+
+#### Type-Specific Attributes on Enter
+
+**List attributes:**
+```c
+cmark_strbuf_printf(xml, " type=\"%s\" tight=\"%s\"",
+ cmark_node_get_list_type(node) == CMARK_BULLET_LIST
+ ? "bullet" : "ordered",
+ cmark_node_get_list_tight(node) ? "true" : "false");
+// For ordered lists only:
+int start = cmark_node_get_list_start(node);
+if (start != 1) {
+ snprintf(buffer, BUFFER_SIZE, " start=\"%d\"", start);
+}
+cmark_strbuf_printf(xml, " delimiter=\"%s\"",
+ cmark_node_get_list_delim(node) == CMARK_PAREN_DELIM
+ ? "paren" : "period");
+```
+
+**Heading attributes:**
+```c
+snprintf(buffer, BUFFER_SIZE, " level=\"%d\"", node->as.heading.level);
+```
+
+**Code block attributes:**
+```c
+if (node->as.code.info) {
+ cmark_strbuf_puts(xml, " info=\"");
+ escape_xml(xml, node->as.code.info, (bufsize_t)strlen((char *)node->as.code.info));
+ cmark_strbuf_putc(xml, '"');
+}
+```
+
+**Link/Image attributes:**
+```c
+cmark_strbuf_puts(xml, " destination=\"");
+escape_xml(xml, node->as.link.url, (bufsize_t)strlen((char *)node->as.link.url));
+cmark_strbuf_putc(xml, '"');
+cmark_strbuf_puts(xml, " title=\"");
+escape_xml(xml, node->as.link.title, (bufsize_t)strlen((char *)node->as.link.title));
+cmark_strbuf_putc(xml, '"');
+```
+
+**Custom block/inline attributes:**
+```c
+cmark_strbuf_puts(xml, " on_enter=\"");
+escape_xml(xml, node->as.custom.on_enter, ...);
+cmark_strbuf_puts(xml, "\" on_exit=\"");
+escape_xml(xml, node->as.custom.on_exit, ...);
+```
+
+### Container Node Rendering (Exit)
+
+On exit, the indent level is decremented by 2, and the closing tag is output:
+```xml
+ </tag_name>
+```
+
+### Extension Support
+
+Extensions can add additional XML attributes via:
+```c
+if (node->extension && node->extension->xml_attr_func) {
+ node->extension->xml_attr_func(node->extension, node, xml);
+}
+```
+
+## Example Output
+
+Given this Markdown:
+
+```markdown
+# Hello
+
+A paragraph with *emphasis* and a [link](http://example.com "title").
+```
+
+The XML output (with `CMARK_OPT_SOURCEPOS`):
+
+```xml
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE document SYSTEM "CommonMark.dtd">
+<document sourcepos="1:1-3:65" xmlns="http://commonmark.org/xml/1.0">
+ <heading sourcepos="1:1-1:7" level="1">
+ <text>Hello</text>
+ </heading>
+ <paragraph sourcepos="3:1-3:65">
+ <text>A paragraph with </text>
+ <emph>
+ <text>emphasis</text>
+ </emph>
+ <text> and a </text>
+ <link destination="http://example.com" title="title">
+ <text>link</text>
+ </link>
+ <text>.</text>
+ </paragraph>
+</document>
+```
+
+## CommonMark DTD
+
+The output references `CommonMark.dtd`, the DTD that defines:
+- Document element as the root
+- All CommonMark block and inline element types
+- Required attributes for lists, headings, links, images, and code blocks
+- Entity definitions for the markup model
+
+## Differences from HTML Renderer
+
+1. **Full AST preservation**: XML represents the complete AST structure, including node types that HTML merges or loses (e.g., softbreak, custom blocks/inlines).
+2. **Indentation tracking**: XML output is pretty-printed with nesting-based indentation.
+3. **No tight list logic**: The `tight` attribute is stored as metadata, but does not affect paragraph rendering — paragraphs always appear as `<paragraph>` elements.
+4. **No URL safety**: URLs are output as-is (escaped for XML), no `_scan_dangerous_url()` check.
+5. **No plain text mode**: Image children are rendered structurally, not flattened to alt text.
+
+## Cross-References
+
+- [xml.c](../../cmark/src/xml.c) — Full implementation
+- [html-renderer.md](html-renderer.md) — HTML renderer comparison
+- [iterator-system.md](iterator-system.md) — Traversal mechanism used
+- [public-api.md](public-api.md) — `cmark_render_xml()` API docs