1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
|
# cmark — Overview
## What Is cmark?
cmark is a C library and command-line tool for parsing and rendering CommonMark (standardized Markdown). Written in C99, it implements a two-phase parsing architecture — block structure recognition followed by inline content parsing — producing an Abstract Syntax Tree (AST) that can be traversed, manipulated, and rendered into multiple output formats.
**Language:** C (C99)
**Build System:** CMake (minimum version 3.14)
**Project Version:** 0.31.2
**License:** BSD-2-Clause
**Authors:** John MacFarlane, Vicent Marti, Kārlis Gaņģis, Nick Wellnhofer
## Core Architecture Summary
cmark's processing pipeline follows this sequence:
1. **Input** — UTF-8 text is fed to the parser, either all at once or incrementally via a streaming API.
2. **Block Parsing** (`blocks.c`) — The input is scanned line-by-line to identify block-level structures (paragraphs, headings, code blocks, lists, block quotes, thematic breaks, HTML blocks).
3. **Inline Parsing** (`inlines.c`) — Within paragraph and heading blocks, inline elements are parsed (emphasis, links, images, code spans, HTML inline, line breaks).
4. **AST Construction** — A tree of `cmark_node` structures is built, with each node representing a document element.
5. **Rendering** — The AST is traversed using an iterator and rendered to one of five output formats: HTML, XML, LaTeX, man (groff), or CommonMark.
## Source File Map
The `cmark/src/` directory contains the following source files, organized by responsibility:
### Public API
| File | Purpose |
|------|---------|
| `cmark.h` | Public API header — all exported types, enums, and function declarations |
| `cmark.c` | Core glue — `cmark_markdown_to_html()`, default memory allocator, version info |
| `main.c` | CLI entry point — argument parsing, file I/O, format dispatch |
### AST Node System
| File | Purpose |
|------|---------|
| `node.h` | Internal node struct definition, type-specific unions (`cmark_list`, `cmark_code`, `cmark_heading`, `cmark_link`, `cmark_custom`), internal flags |
| `node.c` | Node creation/destruction, accessor functions, tree manipulation (insert, append, unlink, replace) |
### Parsing
| File | Purpose |
|------|---------|
| `parser.h` | Internal `cmark_parser` struct definition (parser state: line number, offset, column, indent, reference map) |
| `blocks.c` | Block-level parsing — line-by-line analysis, open/close block logic, list item detection, finalization |
| `inlines.c` | Inline-level parsing — emphasis/strong via delimiter stack, backtick code spans, links/images via bracket stack, autolinks, HTML inline |
| `inlines.h` | Internal API: `cmark_parse_inlines()`, `cmark_parse_reference_inline()`, `cmark_clean_url()`, `cmark_clean_title()` |
### Traversal
| File | Purpose |
|------|---------|
| `iterator.h` | Internal `cmark_iter` struct with `cmark_iter_state` (current + next event/node pairs) |
| `iterator.c` | Iterator implementation — `cmark_iter_new()`, `cmark_iter_next()`, `cmark_iter_reset()`, `cmark_consolidate_text_nodes()` |
### Renderers
| File | Purpose |
|------|---------|
| `render.h` | `cmark_renderer` struct, `cmark_escaping` enum (`LITERAL`, `NORMAL`, `TITLE`, `URL`) |
| `render.c` | Generic render framework — line wrapping, prefix management, `cmark_render()` dispatch loop |
| `html.c` | HTML renderer — `cmark_render_html()`, direct strbuf-based output, no render framework |
| `xml.c` | XML renderer — `cmark_render_xml()`, direct strbuf-based output with CommonMark DTD |
| `latex.c` | LaTeX renderer — `cmark_render_latex()`, uses render framework |
| `man.c` | groff man renderer — `cmark_render_man()`, uses render framework |
| `commonmark.c` | CommonMark renderer — `cmark_render_commonmark()`, uses render framework |
### Text Processing and Utilities
| File | Purpose |
|------|---------|
| `buffer.h` / `buffer.c` | `cmark_strbuf` — growable byte buffer with amortized O(1) append |
| `chunk.h` | `cmark_chunk` — lightweight non-owning string slice (pointer + length) |
| `utf8.h` / `utf8.c` | UTF-8 iteration, validation, encoding, case folding, Unicode property queries |
| `references.h` / `references.c` | Link reference definition storage and lookup (sorted array with binary search) |
| `scanners.h` / `scanners.c` | re2c-generated scanner functions for recognizing Markdown syntax patterns |
| `scanners.re` | re2c source for scanner generation |
| `cmark_ctype.h` / `cmark_ctype.c` | Locale-independent `cmark_isspace()`, `cmark_ispunct()`, `cmark_isdigit()`, `cmark_isalpha()` |
| `houdini.h` | HTML/URL escaping and unescaping API |
| `houdini_html_e.c` | HTML entity escaping |
| `houdini_html_u.c` | HTML entity unescaping |
| `houdini_href_e.c` | URL/href percent-encoding |
| `entities.inc` | HTML entity name-to-codepoint lookup table |
| `case_fold.inc` | Unicode case folding table for reference normalization |
## The Simple Interface
The simplest way to use cmark is a single function call defined in `cmark.c`:
```c
char *cmark_markdown_to_html(const char *text, size_t len, int options);
```
Internally, this calls `cmark_parse_document()` to build the AST, then `cmark_render_html()` to produce the output, and finally frees the document node. The caller is responsible for freeing the returned string.
The implementation in `cmark.c`:
```c
char *cmark_markdown_to_html(const char *text, size_t len, int options) {
cmark_node *doc;
char *result;
doc = cmark_parse_document(text, len, options);
result = cmark_render_html(doc, options);
cmark_node_free(doc);
return result;
}
```
## The Streaming Interface
For large documents or streaming input, cmark provides an incremental parsing API:
```c
cmark_parser *parser = cmark_parser_new(CMARK_OPT_DEFAULT);
// Feed chunks of data as they arrive
while ((bytes = fread(buffer, 1, sizeof(buffer), fp)) > 0) {
cmark_parser_feed(parser, buffer, bytes);
}
// Finalize and get the AST
cmark_node *document = cmark_parser_finish(parser);
cmark_parser_free(parser);
// Render to any format
char *html = cmark_render_html(document, CMARK_OPT_DEFAULT);
char *xml = cmark_render_xml(document, CMARK_OPT_DEFAULT);
char *man = cmark_render_man(document, CMARK_OPT_DEFAULT, 72);
char *tex = cmark_render_latex(document, CMARK_OPT_DEFAULT, 80);
char *cm = cmark_render_commonmark(document, CMARK_OPT_DEFAULT, 0);
// Cleanup
cmark_node_free(document);
```
The parser accumulates input in an internal line buffer (`parser->linebuf`) and processes complete lines as they become available. The `S_parser_feed()` function in `blocks.c` scans for line-ending characters (`\n`, `\r`) and dispatches each complete line to `S_process_line()`.
## Node Type Taxonomy
cmark defines 21 node types in the `cmark_node_type` enum:
### Block Nodes (container and leaf)
| Enum Value | Type String | Container? | Accepts Lines? | Contains Inlines? |
|-----------|-------------|------------|---------------|-------------------|
| `CMARK_NODE_DOCUMENT` | "document" | Yes | No | No |
| `CMARK_NODE_BLOCK_QUOTE` | "block_quote" | Yes | No | No |
| `CMARK_NODE_LIST` | "list" | Yes (items only) | No | No |
| `CMARK_NODE_ITEM` | "item" | Yes | No | No |
| `CMARK_NODE_CODE_BLOCK` | "code_block" | No (leaf) | Yes | No |
| `CMARK_NODE_HTML_BLOCK` | "html_block" | No (leaf) | No | No |
| `CMARK_NODE_CUSTOM_BLOCK` | "custom_block" | Yes | No | No |
| `CMARK_NODE_PARAGRAPH` | "paragraph" | No | Yes | Yes |
| `CMARK_NODE_HEADING` | "heading" | No | Yes | Yes |
| `CMARK_NODE_THEMATIC_BREAK` | "thematic_break" | No (leaf) | No | No |
### Inline Nodes
| Enum Value | Type String | Leaf? |
|-----------|-------------|-------|
| `CMARK_NODE_TEXT` | "text" | Yes |
| `CMARK_NODE_SOFTBREAK` | "softbreak" | Yes |
| `CMARK_NODE_LINEBREAK` | "linebreak" | Yes |
| `CMARK_NODE_CODE` | "code" | Yes |
| `CMARK_NODE_HTML_INLINE` | "html_inline" | Yes |
| `CMARK_NODE_CUSTOM_INLINE` | "custom_inline" | No |
| `CMARK_NODE_EMPH` | "emph" | No |
| `CMARK_NODE_STRONG` | "strong" | No |
| `CMARK_NODE_LINK` | "link" | No |
| `CMARK_NODE_IMAGE` | "image" | No |
Range sentinels are also defined for classification:
- `CMARK_NODE_FIRST_BLOCK = CMARK_NODE_DOCUMENT`
- `CMARK_NODE_LAST_BLOCK = CMARK_NODE_THEMATIC_BREAK`
- `CMARK_NODE_FIRST_INLINE = CMARK_NODE_TEXT`
- `CMARK_NODE_LAST_INLINE = CMARK_NODE_IMAGE`
## Option Flags
Options are passed as a bitmask integer to parsing and rendering functions:
```c
#define CMARK_OPT_DEFAULT 0
#define CMARK_OPT_SOURCEPOS (1 << 1) // Add data-sourcepos attributes
#define CMARK_OPT_HARDBREAKS (1 << 2) // Render softbreaks as hard breaks
#define CMARK_OPT_SAFE (1 << 3) // Legacy (now default behavior)
#define CMARK_OPT_NOBREAKS (1 << 4) // Render softbreaks as spaces
#define CMARK_OPT_NORMALIZE (1 << 8) // Legacy (no effect)
#define CMARK_OPT_VALIDATE_UTF8 (1 << 9) // Validate UTF-8 input
#define CMARK_OPT_SMART (1 << 10) // Smart quotes and dashes
#define CMARK_OPT_UNSAFE (1 << 17) // Allow raw HTML and dangerous URLs
```
## Memory Management Model
cmark uses a pluggable memory allocator defined by the `cmark_mem` struct:
```c
typedef struct cmark_mem {
void *(*calloc)(size_t, size_t);
void *(*realloc)(void *, size_t);
void (*free)(void *);
} cmark_mem;
```
The default allocator in `cmark.c` wraps standard `calloc`/`realloc`/`free` with abort-on-NULL safety checks (`xcalloc`, `xrealloc`). Every node stores a pointer to the allocator it was created with (`node->mem`), ensuring consistent allocation/deallocation throughout the tree.
## Version Information
Runtime version queries:
```c
int cmark_version(void); // Returns CMARK_VERSION as integer (0xMMmmpp)
const char *cmark_version_string(void); // Returns CMARK_VERSION_STRING
```
The version is encoded as a 24-bit integer where bits 16–23 are major, 8–15 are minor, and 0–7 are patch. For example, `0x001F02` represents version 0.31.2.
## Backwards Compatibility Aliases
For code written against older cmark API versions, these aliases are provided:
```c
#define CMARK_NODE_HEADER CMARK_NODE_HEADING
#define CMARK_NODE_HRULE CMARK_NODE_THEMATIC_BREAK
#define CMARK_NODE_HTML CMARK_NODE_HTML_BLOCK
#define CMARK_NODE_INLINE_HTML CMARK_NODE_HTML_INLINE
```
Short-name aliases (without the `CMARK_` prefix) are also available unless `CMARK_NO_SHORT_NAMES` is defined:
```c
#define NODE_DOCUMENT CMARK_NODE_DOCUMENT
#define NODE_PARAGRAPH CMARK_NODE_PARAGRAPH
#define BULLET_LIST CMARK_BULLET_LIST
// ... and many more
```
## Cross-References
- [architecture.md](architecture.md) — Detailed two-phase parsing pipeline, module dependency graph
- [public-api.md](public-api.md) — Complete public API reference with all function signatures
- [ast-node-system.md](ast-node-system.md) — Internal `cmark_node` struct, type-specific unions, tree operations
- [block-parsing.md](block-parsing.md) — `blocks.c` line-by-line analysis, open block tracking, finalization
- [inline-parsing.md](inline-parsing.md) — `inlines.c` delimiter algorithm, bracket stack, backtick scanning
- [iterator-system.md](iterator-system.md) — AST traversal with enter/exit events
- [html-renderer.md](html-renderer.md) — HTML output with escaping and source position
- [xml-renderer.md](xml-renderer.md) — XML output with CommonMark DTD
- [latex-renderer.md](latex-renderer.md) — LaTeX output via render framework
- [man-renderer.md](man-renderer.md) — groff man page output
- [commonmark-renderer.md](commonmark-renderer.md) — Round-trip CommonMark output
- [render-framework.md](render-framework.md) — Shared `cmark_render()` engine for text-based renderers
- [memory-management.md](memory-management.md) — Allocator model, buffer growth, node freeing
- [utf8-handling.md](utf8-handling.md) — UTF-8 validation, iteration, case folding
- [reference-system.md](reference-system.md) — Link reference definitions storage and resolution
- [scanner-system.md](scanner-system.md) — re2c-generated pattern matching
- [building.md](building.md) — CMake build configuration and options
- [cli-usage.md](cli-usage.md) — Command-line tool usage
- [testing.md](testing.md) — Test infrastructure (spec tests, API tests, fuzzing)
- [code-style.md](code-style.md) — Coding conventions and naming patterns
|