diff options
Diffstat (limited to 'docs/handbook/cmark/ast-node-system.md')
| -rw-r--r-- | docs/handbook/cmark/ast-node-system.md | 383 |
1 files changed, 383 insertions, 0 deletions
diff --git a/docs/handbook/cmark/ast-node-system.md b/docs/handbook/cmark/ast-node-system.md new file mode 100644 index 0000000000..3d25415eda --- /dev/null +++ b/docs/handbook/cmark/ast-node-system.md @@ -0,0 +1,383 @@ +# cmark — AST Node System + +## Overview + +The AST (Abstract Syntax Tree) node system is defined across `node.h` (internal struct definitions) and `node.c` (node creation, destruction, accessor functions, and tree manipulation). Every element in a parsed CommonMark document is represented as a `cmark_node`. Nodes form a tree via parent/child/sibling pointers, with type-specific data stored in a discriminated union. + +## The `cmark_node` Struct + +Defined in `node.h`, this is the central data structure of the entire library: + +```c +struct cmark_node { + cmark_mem *mem; // Memory allocator used for this node + + struct cmark_node *next; // Next sibling + struct cmark_node *prev; // Previous sibling + struct cmark_node *parent; // Parent node + struct cmark_node *first_child; // First child + struct cmark_node *last_child; // Last child + + void *user_data; // Arbitrary user-attached data + + unsigned char *data; // String content (for text, code, HTML) + bufsize_t len; // Length of data + + int start_line; // Source position: starting line (1-based) + int start_column; // Source position: starting column (1-based) + int end_line; // Source position: ending line + int end_column; // Source position: ending column + uint16_t type; // Node type (cmark_node_type enum value) + uint16_t flags; // Internal flags (open, last-line-blank, etc.) + + union { + cmark_list list; // List-specific data + cmark_code code; // Code block-specific data + cmark_heading heading; // Heading-specific data + cmark_link link; // Link/image-specific data + cmark_custom custom; // Custom block/inline data + int html_block_type; // HTML block type (1-7) + } as; +}; +``` + +The union `as` means each node only occupies memory for one type-specific payload, keeping the struct compact. The largest union member determines the node's size. + +## Type-Specific Structs + +### `cmark_list` — List Properties + +```c +typedef struct { + int marker_offset; // Indentation of list marker from left margin + int padding; // Total indentation (marker + content offset) + int start; // Starting number for ordered lists (0 for bullet) + unsigned char list_type; // CMARK_BULLET_LIST or CMARK_ORDERED_LIST + unsigned char delimiter; // CMARK_PERIOD_DELIM, CMARK_PAREN_DELIM, or CMARK_NO_DELIM + unsigned char bullet_char;// '*', '-', or '+' for bullet lists + bool tight; // Whether the list is tight (no blank lines between items) +} cmark_list; +``` + +`marker_offset` and `padding` are used during block parsing to track indentation levels for list continuation. The `tight` flag is determined during block finalization by checking whether blank lines appear between list items or their children. + +### `cmark_code` — Code Block Properties + +```c +typedef struct { + unsigned char *info; // Info string (language hint, e.g., "python") + uint8_t fence_length; // Length of opening fence (3+ backticks or tildes) + uint8_t fence_offset; // Indentation of fence from left margin + unsigned char fence_char; // '`' or '~' + int8_t fenced; // Whether this is a fenced code block (vs. indented) +} cmark_code; +``` + +For indented code blocks, `fenced` is 0, and `info`, `fence_length`, `fence_char`, and `fence_offset` are unused. For fenced code blocks, `info` is extracted from the first line of the opening fence and stored as a separately allocated string. + +### `cmark_heading` — Heading Properties + +```c +typedef struct { + int internal_offset; // Internal offset within the heading content + int8_t level; // Heading level (1-6) + bool setext; // Whether this is a setext-style heading (underlined) +} cmark_heading; +``` + +ATX headings (`# Heading`) have `setext = false`. Setext headings (underlined with `=` or `-`) have `setext = true`. The `level` field is shared and defaults to 1 when a heading node is created. + +### `cmark_link` — Link and Image Properties + +```c +typedef struct { + unsigned char *url; // Destination URL (separately allocated) + unsigned char *title; // Optional title text (separately allocated) +} cmark_link; +``` + +Both `url` and `title` are separately allocated strings that must be freed when the node is destroyed. This struct is used for both `CMARK_NODE_LINK` and `CMARK_NODE_IMAGE`. + +### `cmark_custom` — Custom Block/Inline Properties + +```c +typedef struct { + unsigned char *on_enter; // Literal text rendered when entering the node + unsigned char *on_exit; // Literal text rendered when leaving the node +} cmark_custom; +``` + +Custom nodes allow embedding arbitrary content in the AST for extensions. Both strings are separately allocated. + +## Internal Flags + +The `flags` field uses bit flags defined in the `cmark_node__internal_flags` enum: + +```c +enum cmark_node__internal_flags { + CMARK_NODE__OPEN = (1 << 0), // Block is still open (accepting content) + CMARK_NODE__LAST_LINE_BLANK = (1 << 1), // Last line of this block was blank + CMARK_NODE__LAST_LINE_CHECKED = (1 << 2), // blank-line status has been computed + CMARK_NODE__LIST_LAST_LINE_BLANK = (1 << 3), // (unused/reserved) +}; +``` + +- **`CMARK_NODE__OPEN`**: Set when a block is created during parsing. Cleared by `finalize()` when the block is closed. The parser's `current` pointer always points to a node with this flag set. +- **`CMARK_NODE__LAST_LINE_BLANK`**: Set/cleared by `S_set_last_line_blank()` in `blocks.c` to track whether the most recent line added to this block was blank. Used for determining list tightness. +- **`CMARK_NODE__LAST_LINE_CHECKED`**: Prevents redundant traversal when checking `S_ends_with_blank_line()`, which recursively descends into list items. + +## Node Creation + +### `cmark_node_new_with_mem()` + +The primary creation function (in `node.c`): + +```c +cmark_node *cmark_node_new_with_mem(cmark_node_type type, cmark_mem *mem) { + cmark_node *node = (cmark_node *)mem->calloc(1, sizeof(*node)); + node->mem = mem; + node->type = (uint16_t)type; + + switch (node->type) { + case CMARK_NODE_HEADING: + node->as.heading.level = 1; + break; + case CMARK_NODE_LIST: { + cmark_list *list = &node->as.list; + list->list_type = CMARK_BULLET_LIST; + list->start = 0; + list->tight = false; + break; + } + default: + break; + } + + return node; +} +``` + +The `calloc()` zeroes all fields, so pointers start as NULL and numeric fields as 0. Only heading and list nodes need explicit default initialization. + +### `make_block()` — Parser-Internal Creation + +During block parsing, `make_block()` in `blocks.c` creates nodes with source position and the `CMARK_NODE__OPEN` flag: + +```c +static cmark_node *make_block(cmark_mem *mem, cmark_node_type tag, + int start_line, int start_column) { + cmark_node *e; + e = (cmark_node *)mem->calloc(1, sizeof(*e)); + e->mem = mem; + e->type = (uint16_t)tag; + e->flags = CMARK_NODE__OPEN; + e->start_line = start_line; + e->start_column = start_column; + e->end_line = start_line; + return e; +} +``` + +### Inline Node Creation + +The inline parser in `inlines.c` uses two factory functions: + +```c +// Create an inline with string content (text, code, HTML) +static inline cmark_node *make_literal(subject *subj, cmark_node_type t, + int start_column, int end_column) { + cmark_node *e = (cmark_node *)subj->mem->calloc(1, sizeof(*e)); + e->mem = subj->mem; + e->type = (uint16_t)t; + e->start_line = e->end_line = subj->line; + e->start_column = start_column + 1 + subj->column_offset + subj->block_offset; + e->end_column = end_column + 1 + subj->column_offset + subj->block_offset; + return e; +} + +// Create an inline with no value (emphasis, strong, etc.) +static inline cmark_node *make_simple(cmark_mem *mem, cmark_node_type t) { + cmark_node *e = (cmark_node *)mem->calloc(1, sizeof(*e)); + e->mem = mem; + e->type = t; + return e; +} +``` + +## Node Destruction + +### `S_free_nodes()` — Iterative Subtree Freeing + +The `S_free_nodes()` function in `node.c` avoids recursion by splicing children into a flat linked list: + +```c +static void S_free_nodes(cmark_node *e) { + cmark_mem *mem = e->mem; + cmark_node *next; + while (e != NULL) { + switch (e->type) { + case CMARK_NODE_CODE_BLOCK: + mem->free(e->data); + mem->free(e->as.code.info); + break; + case CMARK_NODE_TEXT: + case CMARK_NODE_HTML_INLINE: + case CMARK_NODE_CODE: + case CMARK_NODE_HTML_BLOCK: + mem->free(e->data); + break; + case CMARK_NODE_LINK: + case CMARK_NODE_IMAGE: + mem->free(e->as.link.url); + mem->free(e->as.link.title); + break; + case CMARK_NODE_CUSTOM_BLOCK: + case CMARK_NODE_CUSTOM_INLINE: + mem->free(e->as.custom.on_enter); + mem->free(e->as.custom.on_exit); + break; + default: + break; + } + if (e->last_child) { + // Splice children into list for flat iteration + e->last_child->next = e->next; + e->next = e->first_child; + } + next = e->next; + mem->free(e); + e = next; + } +} +``` + +This splicing technique converts the tree into a flat list, allowing O(n) iterative freeing without a recursion stack. For each node with children, the children are prepended to the remaining list by connecting `last_child->next` to `e->next` and `e->next` to `first_child`. + +## Containership Rules + +The `S_can_contain()` function in `node.c` enforces which node types can contain which children: + +```c +static bool S_can_contain(cmark_node *node, cmark_node *child) { + // Ancestor loop detection + if (child->first_child != NULL) { + cmark_node *cur = node->parent; + while (cur != NULL) { + if (cur == child) return false; + cur = cur->parent; + } + } + + // Documents cannot be children + if (child->type == CMARK_NODE_DOCUMENT) return false; + + switch (node->type) { + case CMARK_NODE_DOCUMENT: + case CMARK_NODE_BLOCK_QUOTE: + case CMARK_NODE_ITEM: + return cmark_node_is_block(child) && child->type != CMARK_NODE_ITEM; + + case CMARK_NODE_LIST: + return child->type == CMARK_NODE_ITEM; + + case CMARK_NODE_CUSTOM_BLOCK: + return true; // Custom blocks can contain anything + + case CMARK_NODE_PARAGRAPH: + case CMARK_NODE_HEADING: + case CMARK_NODE_EMPH: + case CMARK_NODE_STRONG: + case CMARK_NODE_LINK: + case CMARK_NODE_IMAGE: + case CMARK_NODE_CUSTOM_INLINE: + return cmark_node_is_inline(child); + + default: + break; + } + return false; +} +``` + +Key rules: +- **Document, block quote, list item**: Can contain any block except items +- **List**: Can only contain items +- **Custom block**: Can contain anything (no restrictions) +- **Paragraph, heading, emphasis, strong, link, image, custom inline**: Can only contain inline nodes +- **Leaf blocks** (thematic break, code block, HTML block): Cannot contain anything + +## Tree Manipulation + +### Unlinking + +The internal `S_node_unlink()` function detaches a node from its parent and siblings: + +```c +static void S_node_unlink(cmark_node *node) { + if (node->prev) { + node->prev->next = node->next; + } + if (node->next) { + node->next->prev = node->prev; + } + // Update parent's first_child / last_child pointers + if (node->parent) { + if (node->parent->first_child == node) + node->parent->first_child = node->next; + if (node->parent->last_child == node) + node->parent->last_child = node->prev; + } + node->next = NULL; + node->prev = NULL; + node->parent = NULL; +} +``` + +### String Setting Helper + +The `cmark_set_cstr()` function manages string assignment with proper memory handling: + +```c +static bufsize_t cmark_set_cstr(cmark_mem *mem, unsigned char **dst, + const char *src) { + unsigned char *old = *dst; + bufsize_t len; + if (src && src[0]) { + len = (bufsize_t)strlen(src); + *dst = (unsigned char *)mem->realloc(NULL, len + 1); + memcpy(*dst, src, len + 1); + } else { + len = 0; + *dst = NULL; + } + if (old) { + mem->free(old); + } + return len; +} +``` + +This function allocates a new copy of the source string, assigns it, then frees the old value — ensuring no memory leaks even when overwriting existing data. + +## Node Data Storage Pattern + +Nodes store their text content in two ways depending on type: + +1. **Direct storage** (`data` + `len`): Used by `CMARK_NODE_TEXT`, `CMARK_NODE_CODE`, `CMARK_NODE_CODE_BLOCK`, `CMARK_NODE_HTML_BLOCK`, and `CMARK_NODE_HTML_INLINE`. The `data` field points to a separately allocated buffer containing the text content. + +2. **Union storage** (`as.*`): Used by lists, code blocks (for the info string), headings, links/images, and custom nodes. These store structured data rather than raw text. + +3. **Hybrid**: `CMARK_NODE_CODE_BLOCK` uses both — `data` for the code content and `as.code.info` for the info string. + +## The `cmark_node_check()` Function + +For debug builds, `cmark_node_check()` validates the structural integrity of the tree. It checks that parent/child/sibling pointers are consistent and that the tree forms a valid structure. It returns the number of errors found and prints details to the provided `FILE*`. + +## Cross-References + +- [node.h](../../../cmark/src/node.h) — Struct definitions +- [node.c](../../../cmark/src/node.c) — Implementation +- [iterator-system.md](iterator-system.md) — How nodes are traversed +- [block-parsing.md](block-parsing.md) — How block nodes are created during parsing +- [inline-parsing.md](inline-parsing.md) — How inline nodes are created +- [memory-management.md](memory-management.md) — Allocator integration |
