docs/handbook/cmark/inline-parsing.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317

# cmark — Inline Parsing

## Overview

Inline parsing is Phase 2 of cmark's pipeline. Implemented in `inlines.c`, it processes the text content of paragraph and heading nodes, recognizing emphasis (`*`, `_`), code spans (`` ` ``), links (`[text](url)`), images (`![alt](url)`), autolinks (`<url>`), raw HTML inline, hard line breaks, soft line breaks, and smart punctuation.

The entry point is `cmark_parse_inlines()`, called from `process_inlines()` in `blocks.c` after all block structure has been finalized.

## The `subject` Struct

All inline parsing state is tracked in the `subject` struct:

```c
typedef struct {
  cmark_mem *mem;                          // Memory allocator
  cmark_chunk input;                       // The text being parsed
  unsigned flags;                          // Skip flags for HTML constructs
  int line;                                // Source line number
  bufsize_t pos;                           // Current byte position in input
  int block_offset;                        // Column offset of the containing block
  int column_offset;                       // Adjustment for multi-line source position tracking
  cmark_reference_map *refmap;             // Link reference definitions
  delimiter *last_delim;                   // Top of delimiter stack (linked list, newest first)
  bracket *last_bracket;                   // Top of bracket stack (linked list, newest first)
  bufsize_t backticks[MAXBACKTICKS + 1];   // Cached positions of backtick sequences
  bool scanned_for_backticks;              // Whether the full input has been scanned for backticks
  bool no_link_openers;                    // Optimization: set when no link openers remain
} subject;
```

`MAXBACKTICKS` is defined as 1000. The `backticks` array caches the positions of backtick sequences of each length, enabling O(1) lookup once the input has been fully scanned.

### Skip Flags

The `flags` field uses bit flags to track which HTML constructs have been confirmed absent:

```c
#define FLAG_SKIP_HTML_CDATA        (1u << 0)
#define FLAG_SKIP_HTML_DECLARATION  (1u << 1)
#define FLAG_SKIP_HTML_PI           (1u << 2)
#define FLAG_SKIP_HTML_COMMENT      (1u << 3)
```

Once a scan for a particular HTML construct fails, the flag is set to avoid rescanning.

## The Delimiter Stack

Emphasis and smart punctuation use a delimiter stack. Each entry is:

```c
typedef struct delimiter {
  struct delimiter *previous;    // Link to older delimiter
  struct delimiter *next;        // Link to newer delimiter (towards top)
  cmark_node *inl_text;          // The text node created for this delimiter run
  bufsize_t position;            // Position in the input
  bufsize_t length;              // Number of delimiter characters remaining
  unsigned char delim_char;      // '*', '_', '\'', or '"'
  bool can_open;                 // Whether this run can open emphasis
  bool can_close;                // Whether this run can close emphasis
} delimiter;
```

The stack is a doubly-linked list with `last_delim` pointing to the newest entry.

## The Bracket Stack

Links and images use a separate bracket stack:

```c
typedef struct bracket {
  struct bracket *previous;      // Link to older bracket
  cmark_node *inl_text;          // The text node for '[' or '!['
  bufsize_t position;            // Position in the input
  bool image;                    // Whether this is an image opener '!['
  bool active;                   // Can still match (set to false when deactivated)
  bool bracket_after;            // Whether a '[' appeared after this bracket
} bracket;
```

Brackets are deactivated (set `active = false`) when:
- A matching `]` fails to produce a valid link (the opener is deactivated to prevent infinite loops)
- An inner link is formed (outer brackets are deactivated per spec)

## Emphasis Flanking Rules: `scan_delims()`

```c
static int scan_delims(subject *subj, unsigned char c, bool *can_open,
                       bool *can_close);
```

This function determines whether a run of `*`, `_`, `'`, or `"` characters can open and/or close emphasis, following the CommonMark spec's Unicode-aware flanking rules:

1. The function looks at the character **before** the run and the character **after** the run.
2. It uses `cmark_utf8proc_iterate()` to decode the surrounding characters as full Unicode code points.
3. It classifies them using `cmark_utf8proc_is_space()` and `cmark_utf8proc_is_punctuation_or_symbol()`.

The flanking rules:
- **Left-flanking**: numdelims > 0, character after is not a space, AND (character after is not punctuation OR character before is a space or punctuation)
- **Right-flanking**: numdelims > 0, character before is not a space, AND (character before is not punctuation OR character after is a space or punctuation)

For `*`: `can_open = left_flanking`, `can_close = right_flanking`

For `_`:
```c
*can_open = left_flanking &&
            (!right_flanking || cmark_utf8proc_is_punctuation_or_symbol(before_char));
*can_close = right_flanking &&
             (!left_flanking || cmark_utf8proc_is_punctuation_or_symbol(after_char));
```

For `'` and `"` (smart punctuation):
```c
*can_open = left_flanking &&
     (!right_flanking || before_char == '(' || before_char == '[') &&
     before_char != ']' && before_char != ')';
*can_close = right_flanking;
```

The function advances `subj->pos` past the delimiter run and returns the number of delimiter characters consumed. For quotes, only 1 delimiter is consumed regardless of how many appear.

## Emphasis Resolution: `S_insert_emph()`

```c
static delimiter *S_insert_emph(subject *subj, delimiter *opener,
                                delimiter *closer);
```

When a closing delimiter is found that matches an opener on the stack, this function creates emphasis nodes:

1. If the opener and closer have combined length >= 2 AND both have individual length >= 2, create a `CMARK_NODE_STRONG` node (consuming 2 characters from each).
2. Otherwise, create a `CMARK_NODE_EMPH` node (consuming 1 character from each).
3. All inline nodes between the opener and closer are moved to become children of the new emphasis node.
4. Any delimiters between the opener and closer are removed from the stack.
5. If the opener is exhausted (`length == 0`), it's removed from the stack.
6. If the closer is exhausted, it's removed too; otherwise, processing continues.

## Code Span Parsing: `handle_backticks()`

```c
static cmark_node *handle_backticks(subject *subj, int options);
```

When a backtick is encountered:

1. `take_while(subj, isbacktick)` consumes the opening backtick run and records its length.
2. `scan_to_closing_backticks()` searches forward for a matching backtick run of the same length.

The scanning function uses the `subj->backticks[]` array to cache positions of backtick sequences. If `subj->scanned_for_backticks` is true and the cached position for the needed length is behind the current position, it immediately returns 0 (no match).

If no closing backticks are found, the opening run is emitted as literal text. If found, the content between is extracted, normalized via `S_normalize_code()`:

```c
static void S_normalize_code(cmark_strbuf *s) {
  // 1. Convert \r\n and \r to spaces
  // 2. Convert \n to spaces
  // 3. If content begins and ends with a space and contains non-space chars,
  //    strip one leading and one trailing space
}
```

## Link Parsing

When `]` is encountered after an opener on the bracket stack:

### Inline Links: `[text](url "title")`

The parser looks for `(` immediately after `]`, then:
1. Skips optional whitespace
2. Tries to parse a link destination (URL)
3. Skips optional whitespace
4. Optionally parses a link title (in single quotes, double quotes, or parentheses)
5. Expects `)`

### Reference Links: `[text][ref]` or `[text][]` or `[text]`

If the inline link syntax doesn't match, the parser tries:
1. `[text][ref]` — explicit reference
2. `[text][]` — collapsed reference (label = text)
3. `[text]` — shortcut reference (label = text)

Reference lookup uses `cmark_reference_lookup()` against the parser's `refmap`.

### URL Cleaning

```c
unsigned char *cmark_clean_url(cmark_mem *mem, cmark_chunk *url);
```

Trims the URL, unescapes HTML entities, and handles angle-bracket-delimited URLs.

### Autolinks

```c
static inline cmark_node *make_autolink(subject *subj, int start_column,
                                        int end_column, cmark_chunk url,
                                        int is_email);
```

Autolinks (`<http://example.com>` or `<user@example.com>`) are detected via the `scan_autolink_uri()` and `scan_autolink_email()` scanner functions. Email autolinks have `mailto:` prepended to the URL automatically:

```c
static unsigned char *cmark_clean_autolink(cmark_mem *mem, cmark_chunk *url,
                                           int is_email) {
  cmark_strbuf buf = CMARK_BUF_INIT(mem);
  cmark_chunk_trim(url);
  if (is_email)
    cmark_strbuf_puts(&buf, "mailto:");
  houdini_unescape_html_f(&buf, url->data, url->len);
  return cmark_strbuf_detach(&buf);
}
```

## Smart Punctuation

When `CMARK_OPT_SMART` is enabled, the inline parser transforms:

```c
static const char *EMDASH = "\xE2\x80\x94";           // —
static const char *ENDASH = "\xE2\x80\x93";           // –
static const char *ELLIPSES = "\xE2\x80\xA6";         // …
static const char *LEFTDOUBLEQUOTE = "\xE2\x80\x9C";  // "
static const char *RIGHTDOUBLEQUOTE = "\xE2\x80\x9D"; // "
static const char *LEFTSINGLEQUOTE = "\xE2\x80\x98";  // '
static const char *RIGHTSINGLEQUOTE = "\xE2\x80\x99";  // '
```

- `---` becomes em dash (—)
- `--` becomes en dash (–)
- `...` becomes ellipsis (…)
- `'` and `"` are converted to curly quotes using the delimiter stack (open/close logic)

## Hard and Soft Line Breaks

- **Hard line break**: Two or more spaces before a line ending, or a backslash before a line ending. Creates a `CMARK_NODE_LINEBREAK` node.
- **Soft line break**: A line ending not preceded by spaces or backslash. Creates a `CMARK_NODE_SOFTBREAK` node.

## Special Character Dispatch

```c
static bufsize_t subject_find_special_char(subject *subj, int options);
```

This function scans forward from `subj->pos` looking for the next special character that needs inline processing. Special characters include:
- Line endings (`\r`, `\n`)
- Backtick (`` ` ``)
- Backslash (`\`)
- Ampersand (`&`)
- Less-than (`<`)
- Open bracket (`[`)
- Close bracket (`]`)
- Exclamation mark (`!`)
- Emphasis characters (`*`, `_`)

Any text between special characters is collected as a `CMARK_NODE_TEXT` node.

## Source Position Tracking

```c
static void adjust_subj_node_newlines(subject *subj, cmark_node *node,
                                     int matchlen, int extra, int options);
```

When `CMARK_OPT_SOURCEPOS` is enabled, this function adjusts source positions for multi-line inline constructs. It counts newlines in the just-matched span and updates:
- `subj->line` — incremented by the number of newlines
- `node->end_line` — adjusted for multi-line spans
- `node->end_column` — set to characters after the last newline
- `subj->column_offset` — adjusted for correct subsequent position calculations

## Inline Node Factory Functions

The inline parser uses efficient factory functions:

```c
// Macros for simple nodes
#define make_linebreak(mem) make_simple(mem, CMARK_NODE_LINEBREAK)
#define make_softbreak(mem) make_simple(mem, CMARK_NODE_SOFTBREAK)
#define make_emph(mem) make_simple(mem, CMARK_NODE_EMPH)
#define make_strong(mem) make_simple(mem, CMARK_NODE_STRONG)
```

```c
// Fast child appending (bypasses S_can_contain validation)
static void append_child(cmark_node *node, cmark_node *child) {
  cmark_node *old_last_child = node->last_child;
  child->next = NULL;
  child->prev = old_last_child;
  child->parent = node;
  node->last_child = child;
  if (old_last_child) {
    old_last_child->next = child;
  } else {
    node->first_child = child;
  }
}
```

This `append_child()` is a simplified version of the public `cmark_node_append_child()`, skipping containership validation since the inline parser always produces valid structures.

## The Main Parse Loop

```c
void cmark_parse_inlines(cmark_mem *mem, cmark_node *parent,
                         cmark_reference_map *refmap, int options);
```

This function initializes a `subject` from the parent node's `data` field, then repeatedly calls `parse_inline()` until the input is exhausted. Each call to `parse_inline()` finds the next special character, emits any preceding text as a `CMARK_NODE_TEXT`, and dispatches to the appropriate handler.

After all characters are processed, the delimiter stack is processed to resolve any remaining emphasis, and then cleaned up.

## Cross-References

- [inlines.c](../../cmark/src/inlines.c) — Full implementation
- [inlines.h](../../cmark/src/inlines.h) — Internal API declarations
- [block-parsing.md](block-parsing.md) — Phase 1 that produces the input for inline parsing
- [reference-system.md](reference-system.md) — How link references are stored and looked up
- [scanner-system.md](scanner-system.md) — Scanner functions for HTML tags, autolinks, etc.
- [utf8-handling.md](utf8-handling.md) — Unicode character classification for flanking rules