docs/handbook/cmark/architecture.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283

# cmark — Architecture

## High-Level Design

cmark implements a two-phase parsing pipeline that converts CommonMark Markdown into an Abstract Syntax Tree (AST), which can then be rendered into multiple output formats. The design separates concerns cleanly: block-level structure is identified first, then inline content is parsed within the appropriate blocks.

```
Input Text (UTF-8)
    │
    ▼
┌──────────────────┐
│   S_parser_feed   │  Split input into lines (blocks.c)
│                    │  Handle UTF-8 BOM, CR/LF normalization
└────────┬───────────┘
         │
         ▼
┌──────────────────┐
│  S_process_line   │  Line-by-line block structure analysis (blocks.c)
│                    │  Open/close containers, detect leaf blocks
└────────┬───────────┘
         │
         ▼
┌──────────────────┐
│  finalize_document│  Close all open blocks (blocks.c)
│                    │  Resolve reference link definitions
└────────┬───────────┘
         │
         ▼
┌──────────────────┐
│  process_inlines  │  Parse inline content in paragraphs/headings (blocks.c → inlines.c)
│                    │  Delimiter stack algorithm for emphasis
│                    │  Bracket stack for links/images
└────────┬───────────┘
         │
         ▼
┌──────────────────┐
│    AST (cmark_node tree)    │
└────────┬───────────┘
         │
         ▼
┌──────────────────┐
│    Renderer       │  Iterator-driven traversal
│   (html/xml/      │  Enter/Exit events per node
│    latex/man/cm)   │
└──────────────────┘
         │
         ▼
    Output String
```

## Module Dependency Graph

The internal header dependencies reveal the layered architecture:

```
cmark.h (public API — types, enums, function declarations)
  ├── cmark_export.h (generated — DLL export macros)
  └── cmark_version.h (generated — version constants)

node.h (internal — struct cmark_node)
  ├── cmark.h
  └── buffer.h

parser.h (internal — struct cmark_parser)
  ├── references.h
  ├── node.h
  └── buffer.h

iterator.h (internal — struct cmark_iter)
  └── cmark.h

render.h (internal — struct cmark_renderer)
  └── buffer.h

buffer.h (internal — cmark_strbuf)
  └── cmark.h

chunk.h (internal — cmark_chunk)
  ├── cmark.h
  ├── buffer.h
  └── cmark_ctype.h

references.h (internal — cmark_reference_map)
  └── chunk.h

inlines.h (internal — inline parsing API)
  ├── chunk.h
  └── references.h

scanners.h (internal — scanner function declarations)
  ├── cmark.h
  └── chunk.h

houdini.h (internal — HTML/URL escaping)
  └── buffer.h

cmark_ctype.h (internal — locale-independent char classification)
  (no cmark dependencies)

utf8.h (internal — UTF-8 processing)
  └── buffer.h
```

## Phase 1: Block Structure (blocks.c)

Block parsing operates on a state machine maintained in the `cmark_parser` struct (defined in `parser.h`):

```c
struct cmark_parser {
  struct cmark_mem *mem;                // Memory allocator
  struct cmark_reference_map *refmap;   // Link reference definitions
  struct cmark_node *root;              // Document root node
  struct cmark_node *current;           // Deepest open block
  int line_number;                      // Current line being processed
  bufsize_t offset;                     // Byte position in current line
  bufsize_t column;                     // Virtual column (tabs expanded)
  bufsize_t first_nonspace;             // Position of first non-whitespace
  bufsize_t first_nonspace_column;      // Column of first non-whitespace
  bufsize_t thematic_break_kill_pos;    // Optimization for thematic break scanning
  int indent;                           // Indentation level (first_nonspace_column - column)
  bool blank;                           // Whether current line is blank
  bool partially_consumed_tab;          // Tab only partially used for indentation
  cmark_strbuf curline;                 // Current line being processed
  bufsize_t last_line_length;           // Length of previous line (for end_column)
  cmark_strbuf linebuf;                 // Buffer for accumulating partial lines across feeds
  cmark_strbuf content;                 // Accumulated content for the current open block
  int options;                          // Option flags
  bool last_buffer_ended_with_cr;       // For CR/LF handling across buffer boundaries
  unsigned int total_size;              // Total bytes fed (for reference expansion limiting)
};
```

### Line Processing Flow

For each line, `S_process_line()` does the following:

1. **Increment line number**, store current line in `parser->curline`.
2. **Check open blocks** (`check_open_blocks()`): Walk through the tree from root to the deepest open node. For each open container node, try to match the expected line prefix:
   - Block quote: expect `>` (optionally preceded by up to 3 spaces)
   - List item: expect indentation matching `marker_offset + padding`
   - Code block (fenced): check for closing fence or skip fence offset spaces
   - Code block (indented): expect 4+ spaces of indentation
   - HTML block: check type-specific continuation rules
3. **Try new container starts**: If not all open blocks matched, check if the current line starts a new container (block quote, list item).
4. **Try new leaf blocks**: If the line doesn't continue an existing block or start a new container, check for:
   - ATX heading (lines starting with 1-6 `#` characters)
   - Setext heading (underlines of `=` or `-` following a paragraph)
   - Thematic break (3+ `*`, `-`, or `_` on a line by themselves)
   - Fenced code block (3+ backticks or tildes)
   - HTML block (7 different start patterns)
   - Indented code block (4+ spaces of indentation)
5. **Add line content**: For blocks that accept lines (paragraph, heading, code block), append the line content to `parser->content`.
6. **Handle lazy continuation**: Paragraphs support lazy continuation where a non-blank line can continue a paragraph even without matching container prefixes.

### Finalization

When a block is closed (either explicitly or because a new block replaces it), `finalize()` is called:

- **Paragraphs**: Reference link definitions at the start are extracted and stored in `parser->refmap`. If only references remain, the paragraph node is deleted.
- **Code blocks (fenced)**: The first line becomes the info string; remaining content becomes the code body.
- **Code blocks (indented)**: Trailing blank lines are removed.
- **Lists**: Tight/loose status is determined by checking for blank lines between items and their children.

## Phase 2: Inline Parsing (inlines.c)

After all block structure is finalized, `process_inlines()` walks the AST with an iterator and calls `cmark_parse_inlines()` for every node whose type `contains_inlines()` — specifically, `CMARK_NODE_PARAGRAPH` and `CMARK_NODE_HEADING`.

The inline parser uses a `subject` struct that tracks:

```c
typedef struct {
  cmark_mem *mem;
  cmark_chunk input;              // The text to parse
  unsigned flags;                 // Skip flags for HTML constructs
  int line;                       // Source line number
  bufsize_t pos;                  // Current position in input
  int block_offset;               // Column offset of containing block
  int column_offset;              // Adjustment for multi-line inlines
  cmark_reference_map *refmap;    // Reference definitions
  delimiter *last_delim;          // Top of delimiter stack
  bracket *last_bracket;          // Top of bracket stack
  bufsize_t backticks[MAXBACKTICKS + 1];  // Cache of backtick positions
  bool scanned_for_backticks;     // Whether full backtick scan done
  bool no_link_openers;           // Optimization flag
} subject;
```

### Delimiter Stack Algorithm

Emphasis (`*`, `_`) and smart quotes (`'`, `"`) use a delimiter stack. When a run of delimiter characters is found:

1. `scan_delims()` determines whether the run can open and/or close emphasis, based on Unicode-aware flanking rules (checking whether surrounding characters are spaces or punctuation using `cmark_utf8proc_is_space()` and `cmark_utf8proc_is_punctuation_or_symbol()`).
2. The delimiter is pushed onto the stack as a `delimiter` struct.
3. When a closing delimiter is found, the stack is scanned backwards for a matching opener, and `S_insert_emph()` creates `CMARK_NODE_EMPH` or `CMARK_NODE_STRONG` nodes.

### Bracket Stack Algorithm

Links and images use a separate bracket stack:

1. `[` pushes a bracket entry; `![` pushes one marked as `image = true`.
2. When `]` is encountered, the bracket stack is searched for a matching opener.
3. If found, the parser looks for `(url "title")` or `[ref]` after the `]`.
4. For reference-style links, `cmark_reference_lookup()` is called against the parser's `refmap`.

## Phase 3: AST Rendering

All renderers traverse the AST using the iterator system. There are two rendering architectures:

### Direct Renderers (no framework)
- **HTML** (`html.c`): Uses `cmark_strbuf` directly. The `S_render_node()` function handles enter/exit events in a large switch statement. HTML escaping is done via `houdini_escape_html()`.
- **XML** (`xml.c`): Similar direct approach with XML-specific escaping and indentation tracking.

### Framework Renderers (via render.c)
- **LaTeX** (`latex.c`), **man** (`man.c`), **CommonMark** (`commonmark.c`): These use the `cmark_render()` generic framework, which provides:
  - Line wrapping at a configurable width
  - Prefix management for indented output (block quotes, list items)
  - Breakpoint tracking for intelligent line breaking
  - Escape dispatch via function pointers (`outc`)

The framework signature:

```c
char *cmark_render(cmark_node *root, int options, int width,
                   void (*outc)(cmark_renderer *, cmark_escaping, int32_t, unsigned char),
                   int (*render_node)(cmark_renderer *, cmark_node *,
                                      cmark_event_type, int));
```

Each format-specific renderer supplies its own `outc` (character-level escaping) and `render_node` (node-level output) callback functions.

## Key Design Decisions

### Owning vs. Non-Owning Strings

cmark uses two string types:

- **`cmark_strbuf`** (buffer.h): Owning, growable byte buffer. Used for accumulating output and parser state. Memory is managed via the `cmark_mem` allocator.
- **`cmark_chunk`** (chunk.h): Non-owning slice (pointer + length). Used for referencing substrings of the input during parsing without copying.

### Node Memory Layout

Every `cmark_node` uses a discriminated union (`node->as`) to store type-specific data without separate allocations:

```c
union {
  cmark_list list;       // list marker, start, tight, delimiter
  cmark_code code;       // info string, fence char/length/offset
  cmark_heading heading; // level, setext flag, internal_offset
  cmark_link link;       // url, title
  cmark_custom custom;   // on_enter, on_exit
  int html_block_type;   // HTML block type (1-7)
} as;
```

### Open Block Tracking

During block parsing, open blocks are tracked via the `CMARK_NODE__OPEN` flag in `node->flags`. The parser maintains a `current` pointer to the deepest open block. When new blocks are created, they're added as children of the appropriate open container. When blocks are finalized (closed), control returns to the parent.

### Reference Expansion Limiting

To prevent superlinear growth from adversarial reference definitions, `parser->total_size` tracks total bytes fed. After finalization, `parser->refmap->max_ref_size` is set to `MAX(total_size, 100000)`, and each reference lookup deducts the reference's size from the available budget.

## Error Handling

cmark follows a defensive programming model:
- NULL checks on all public API entry points (return 0 or NULL for invalid arguments)
- `assert()` for internal invariants (only active in debug builds with `-DCMARK_DEBUG_NODES`)
- Abort-on-allocation-failure in the default memory allocator
- No exceptions (pure C99)
- Invalid UTF-8 sequences are replaced with U+FFFD (when `CMARK_OPT_VALIDATE_UTF8` is set)

## Thread Safety

cmark is **not** thread-safe for concurrent access to the same parser or node tree. However, separate parser instances and separate node trees can be used in parallel from different threads, as there is no global mutable state (the `DEFAULT_MEM_ALLOCATOR` is read-only after initialization).

## Cross-References

- [block-parsing.md](block-parsing.md) — Detailed block-level parsing logic
- [inline-parsing.md](inline-parsing.md) — Delimiter and bracket stack algorithms
- [ast-node-system.md](ast-node-system.md) — Node struct internals
- [render-framework.md](render-framework.md) — Generic render engine
- [memory-management.md](memory-management.md) — Allocator and buffer details
- [iterator-system.md](iterator-system.md) — AST traversal mechanics