summaryrefslogtreecommitdiff
path: root/docs/handbook/cmark/ast-node-system.md
diff options
context:
space:
mode:
Diffstat (limited to 'docs/handbook/cmark/ast-node-system.md')
-rw-r--r--docs/handbook/cmark/ast-node-system.md383
1 files changed, 383 insertions, 0 deletions
diff --git a/docs/handbook/cmark/ast-node-system.md b/docs/handbook/cmark/ast-node-system.md
new file mode 100644
index 0000000000..3d25415eda
--- /dev/null
+++ b/docs/handbook/cmark/ast-node-system.md
@@ -0,0 +1,383 @@
+# cmark — AST Node System
+
+## Overview
+
+The AST (Abstract Syntax Tree) node system is defined across `node.h` (internal struct definitions) and `node.c` (node creation, destruction, accessor functions, and tree manipulation). Every element in a parsed CommonMark document is represented as a `cmark_node`. Nodes form a tree via parent/child/sibling pointers, with type-specific data stored in a discriminated union.
+
+## The `cmark_node` Struct
+
+Defined in `node.h`, this is the central data structure of the entire library:
+
+```c
+struct cmark_node {
+ cmark_mem *mem; // Memory allocator used for this node
+
+ struct cmark_node *next; // Next sibling
+ struct cmark_node *prev; // Previous sibling
+ struct cmark_node *parent; // Parent node
+ struct cmark_node *first_child; // First child
+ struct cmark_node *last_child; // Last child
+
+ void *user_data; // Arbitrary user-attached data
+
+ unsigned char *data; // String content (for text, code, HTML)
+ bufsize_t len; // Length of data
+
+ int start_line; // Source position: starting line (1-based)
+ int start_column; // Source position: starting column (1-based)
+ int end_line; // Source position: ending line
+ int end_column; // Source position: ending column
+ uint16_t type; // Node type (cmark_node_type enum value)
+ uint16_t flags; // Internal flags (open, last-line-blank, etc.)
+
+ union {
+ cmark_list list; // List-specific data
+ cmark_code code; // Code block-specific data
+ cmark_heading heading; // Heading-specific data
+ cmark_link link; // Link/image-specific data
+ cmark_custom custom; // Custom block/inline data
+ int html_block_type; // HTML block type (1-7)
+ } as;
+};
+```
+
+The union `as` means each node only occupies memory for one type-specific payload, keeping the struct compact. The largest union member determines the node's size.
+
+## Type-Specific Structs
+
+### `cmark_list` — List Properties
+
+```c
+typedef struct {
+ int marker_offset; // Indentation of list marker from left margin
+ int padding; // Total indentation (marker + content offset)
+ int start; // Starting number for ordered lists (0 for bullet)
+ unsigned char list_type; // CMARK_BULLET_LIST or CMARK_ORDERED_LIST
+ unsigned char delimiter; // CMARK_PERIOD_DELIM, CMARK_PAREN_DELIM, or CMARK_NO_DELIM
+ unsigned char bullet_char;// '*', '-', or '+' for bullet lists
+ bool tight; // Whether the list is tight (no blank lines between items)
+} cmark_list;
+```
+
+`marker_offset` and `padding` are used during block parsing to track indentation levels for list continuation. The `tight` flag is determined during block finalization by checking whether blank lines appear between list items or their children.
+
+### `cmark_code` — Code Block Properties
+
+```c
+typedef struct {
+ unsigned char *info; // Info string (language hint, e.g., "python")
+ uint8_t fence_length; // Length of opening fence (3+ backticks or tildes)
+ uint8_t fence_offset; // Indentation of fence from left margin
+ unsigned char fence_char; // '`' or '~'
+ int8_t fenced; // Whether this is a fenced code block (vs. indented)
+} cmark_code;
+```
+
+For indented code blocks, `fenced` is 0, and `info`, `fence_length`, `fence_char`, and `fence_offset` are unused. For fenced code blocks, `info` is extracted from the first line of the opening fence and stored as a separately allocated string.
+
+### `cmark_heading` — Heading Properties
+
+```c
+typedef struct {
+ int internal_offset; // Internal offset within the heading content
+ int8_t level; // Heading level (1-6)
+ bool setext; // Whether this is a setext-style heading (underlined)
+} cmark_heading;
+```
+
+ATX headings (`# Heading`) have `setext = false`. Setext headings (underlined with `=` or `-`) have `setext = true`. The `level` field is shared and defaults to 1 when a heading node is created.
+
+### `cmark_link` — Link and Image Properties
+
+```c
+typedef struct {
+ unsigned char *url; // Destination URL (separately allocated)
+ unsigned char *title; // Optional title text (separately allocated)
+} cmark_link;
+```
+
+Both `url` and `title` are separately allocated strings that must be freed when the node is destroyed. This struct is used for both `CMARK_NODE_LINK` and `CMARK_NODE_IMAGE`.
+
+### `cmark_custom` — Custom Block/Inline Properties
+
+```c
+typedef struct {
+ unsigned char *on_enter; // Literal text rendered when entering the node
+ unsigned char *on_exit; // Literal text rendered when leaving the node
+} cmark_custom;
+```
+
+Custom nodes allow embedding arbitrary content in the AST for extensions. Both strings are separately allocated.
+
+## Internal Flags
+
+The `flags` field uses bit flags defined in the `cmark_node__internal_flags` enum:
+
+```c
+enum cmark_node__internal_flags {
+ CMARK_NODE__OPEN = (1 << 0), // Block is still open (accepting content)
+ CMARK_NODE__LAST_LINE_BLANK = (1 << 1), // Last line of this block was blank
+ CMARK_NODE__LAST_LINE_CHECKED = (1 << 2), // blank-line status has been computed
+ CMARK_NODE__LIST_LAST_LINE_BLANK = (1 << 3), // (unused/reserved)
+};
+```
+
+- **`CMARK_NODE__OPEN`**: Set when a block is created during parsing. Cleared by `finalize()` when the block is closed. The parser's `current` pointer always points to a node with this flag set.
+- **`CMARK_NODE__LAST_LINE_BLANK`**: Set/cleared by `S_set_last_line_blank()` in `blocks.c` to track whether the most recent line added to this block was blank. Used for determining list tightness.
+- **`CMARK_NODE__LAST_LINE_CHECKED`**: Prevents redundant traversal when checking `S_ends_with_blank_line()`, which recursively descends into list items.
+
+## Node Creation
+
+### `cmark_node_new_with_mem()`
+
+The primary creation function (in `node.c`):
+
+```c
+cmark_node *cmark_node_new_with_mem(cmark_node_type type, cmark_mem *mem) {
+ cmark_node *node = (cmark_node *)mem->calloc(1, sizeof(*node));
+ node->mem = mem;
+ node->type = (uint16_t)type;
+
+ switch (node->type) {
+ case CMARK_NODE_HEADING:
+ node->as.heading.level = 1;
+ break;
+ case CMARK_NODE_LIST: {
+ cmark_list *list = &node->as.list;
+ list->list_type = CMARK_BULLET_LIST;
+ list->start = 0;
+ list->tight = false;
+ break;
+ }
+ default:
+ break;
+ }
+
+ return node;
+}
+```
+
+The `calloc()` zeroes all fields, so pointers start as NULL and numeric fields as 0. Only heading and list nodes need explicit default initialization.
+
+### `make_block()` — Parser-Internal Creation
+
+During block parsing, `make_block()` in `blocks.c` creates nodes with source position and the `CMARK_NODE__OPEN` flag:
+
+```c
+static cmark_node *make_block(cmark_mem *mem, cmark_node_type tag,
+ int start_line, int start_column) {
+ cmark_node *e;
+ e = (cmark_node *)mem->calloc(1, sizeof(*e));
+ e->mem = mem;
+ e->type = (uint16_t)tag;
+ e->flags = CMARK_NODE__OPEN;
+ e->start_line = start_line;
+ e->start_column = start_column;
+ e->end_line = start_line;
+ return e;
+}
+```
+
+### Inline Node Creation
+
+The inline parser in `inlines.c` uses two factory functions:
+
+```c
+// Create an inline with string content (text, code, HTML)
+static inline cmark_node *make_literal(subject *subj, cmark_node_type t,
+ int start_column, int end_column) {
+ cmark_node *e = (cmark_node *)subj->mem->calloc(1, sizeof(*e));
+ e->mem = subj->mem;
+ e->type = (uint16_t)t;
+ e->start_line = e->end_line = subj->line;
+ e->start_column = start_column + 1 + subj->column_offset + subj->block_offset;
+ e->end_column = end_column + 1 + subj->column_offset + subj->block_offset;
+ return e;
+}
+
+// Create an inline with no value (emphasis, strong, etc.)
+static inline cmark_node *make_simple(cmark_mem *mem, cmark_node_type t) {
+ cmark_node *e = (cmark_node *)mem->calloc(1, sizeof(*e));
+ e->mem = mem;
+ e->type = t;
+ return e;
+}
+```
+
+## Node Destruction
+
+### `S_free_nodes()` — Iterative Subtree Freeing
+
+The `S_free_nodes()` function in `node.c` avoids recursion by splicing children into a flat linked list:
+
+```c
+static void S_free_nodes(cmark_node *e) {
+ cmark_mem *mem = e->mem;
+ cmark_node *next;
+ while (e != NULL) {
+ switch (e->type) {
+ case CMARK_NODE_CODE_BLOCK:
+ mem->free(e->data);
+ mem->free(e->as.code.info);
+ break;
+ case CMARK_NODE_TEXT:
+ case CMARK_NODE_HTML_INLINE:
+ case CMARK_NODE_CODE:
+ case CMARK_NODE_HTML_BLOCK:
+ mem->free(e->data);
+ break;
+ case CMARK_NODE_LINK:
+ case CMARK_NODE_IMAGE:
+ mem->free(e->as.link.url);
+ mem->free(e->as.link.title);
+ break;
+ case CMARK_NODE_CUSTOM_BLOCK:
+ case CMARK_NODE_CUSTOM_INLINE:
+ mem->free(e->as.custom.on_enter);
+ mem->free(e->as.custom.on_exit);
+ break;
+ default:
+ break;
+ }
+ if (e->last_child) {
+ // Splice children into list for flat iteration
+ e->last_child->next = e->next;
+ e->next = e->first_child;
+ }
+ next = e->next;
+ mem->free(e);
+ e = next;
+ }
+}
+```
+
+This splicing technique converts the tree into a flat list, allowing O(n) iterative freeing without a recursion stack. For each node with children, the children are prepended to the remaining list by connecting `last_child->next` to `e->next` and `e->next` to `first_child`.
+
+## Containership Rules
+
+The `S_can_contain()` function in `node.c` enforces which node types can contain which children:
+
+```c
+static bool S_can_contain(cmark_node *node, cmark_node *child) {
+ // Ancestor loop detection
+ if (child->first_child != NULL) {
+ cmark_node *cur = node->parent;
+ while (cur != NULL) {
+ if (cur == child) return false;
+ cur = cur->parent;
+ }
+ }
+
+ // Documents cannot be children
+ if (child->type == CMARK_NODE_DOCUMENT) return false;
+
+ switch (node->type) {
+ case CMARK_NODE_DOCUMENT:
+ case CMARK_NODE_BLOCK_QUOTE:
+ case CMARK_NODE_ITEM:
+ return cmark_node_is_block(child) && child->type != CMARK_NODE_ITEM;
+
+ case CMARK_NODE_LIST:
+ return child->type == CMARK_NODE_ITEM;
+
+ case CMARK_NODE_CUSTOM_BLOCK:
+ return true; // Custom blocks can contain anything
+
+ case CMARK_NODE_PARAGRAPH:
+ case CMARK_NODE_HEADING:
+ case CMARK_NODE_EMPH:
+ case CMARK_NODE_STRONG:
+ case CMARK_NODE_LINK:
+ case CMARK_NODE_IMAGE:
+ case CMARK_NODE_CUSTOM_INLINE:
+ return cmark_node_is_inline(child);
+
+ default:
+ break;
+ }
+ return false;
+}
+```
+
+Key rules:
+- **Document, block quote, list item**: Can contain any block except items
+- **List**: Can only contain items
+- **Custom block**: Can contain anything (no restrictions)
+- **Paragraph, heading, emphasis, strong, link, image, custom inline**: Can only contain inline nodes
+- **Leaf blocks** (thematic break, code block, HTML block): Cannot contain anything
+
+## Tree Manipulation
+
+### Unlinking
+
+The internal `S_node_unlink()` function detaches a node from its parent and siblings:
+
+```c
+static void S_node_unlink(cmark_node *node) {
+ if (node->prev) {
+ node->prev->next = node->next;
+ }
+ if (node->next) {
+ node->next->prev = node->prev;
+ }
+ // Update parent's first_child / last_child pointers
+ if (node->parent) {
+ if (node->parent->first_child == node)
+ node->parent->first_child = node->next;
+ if (node->parent->last_child == node)
+ node->parent->last_child = node->prev;
+ }
+ node->next = NULL;
+ node->prev = NULL;
+ node->parent = NULL;
+}
+```
+
+### String Setting Helper
+
+The `cmark_set_cstr()` function manages string assignment with proper memory handling:
+
+```c
+static bufsize_t cmark_set_cstr(cmark_mem *mem, unsigned char **dst,
+ const char *src) {
+ unsigned char *old = *dst;
+ bufsize_t len;
+ if (src && src[0]) {
+ len = (bufsize_t)strlen(src);
+ *dst = (unsigned char *)mem->realloc(NULL, len + 1);
+ memcpy(*dst, src, len + 1);
+ } else {
+ len = 0;
+ *dst = NULL;
+ }
+ if (old) {
+ mem->free(old);
+ }
+ return len;
+}
+```
+
+This function allocates a new copy of the source string, assigns it, then frees the old value — ensuring no memory leaks even when overwriting existing data.
+
+## Node Data Storage Pattern
+
+Nodes store their text content in two ways depending on type:
+
+1. **Direct storage** (`data` + `len`): Used by `CMARK_NODE_TEXT`, `CMARK_NODE_CODE`, `CMARK_NODE_CODE_BLOCK`, `CMARK_NODE_HTML_BLOCK`, and `CMARK_NODE_HTML_INLINE`. The `data` field points to a separately allocated buffer containing the text content.
+
+2. **Union storage** (`as.*`): Used by lists, code blocks (for the info string), headings, links/images, and custom nodes. These store structured data rather than raw text.
+
+3. **Hybrid**: `CMARK_NODE_CODE_BLOCK` uses both — `data` for the code content and `as.code.info` for the info string.
+
+## The `cmark_node_check()` Function
+
+For debug builds, `cmark_node_check()` validates the structural integrity of the tree. It checks that parent/child/sibling pointers are consistent and that the tree forms a valid structure. It returns the number of errors found and prints details to the provided `FILE*`.
+
+## Cross-References
+
+- [node.h](../../../cmark/src/node.h) — Struct definitions
+- [node.c](../../../cmark/src/node.c) — Implementation
+- [iterator-system.md](iterator-system.md) — How nodes are traversed
+- [block-parsing.md](block-parsing.md) — How block nodes are created during parsing
+- [inline-parsing.md](inline-parsing.md) — How inline nodes are created
+- [memory-management.md](memory-management.md) — Allocator integration