summaryrefslogtreecommitdiff
path: root/docs/handbook/cmark/inline-parsing.md
diff options
context:
space:
mode:
Diffstat (limited to 'docs/handbook/cmark/inline-parsing.md')
-rw-r--r--docs/handbook/cmark/inline-parsing.md317
1 files changed, 317 insertions, 0 deletions
diff --git a/docs/handbook/cmark/inline-parsing.md b/docs/handbook/cmark/inline-parsing.md
new file mode 100644
index 0000000000..4485017305
--- /dev/null
+++ b/docs/handbook/cmark/inline-parsing.md
@@ -0,0 +1,317 @@
+# cmark — Inline Parsing
+
+## Overview
+
+Inline parsing is Phase 2 of cmark's pipeline. Implemented in `inlines.c`, it processes the text content of paragraph and heading nodes, recognizing emphasis (`*`, `_`), code spans (`` ` ``), links (`[text](url)`), images (`![alt](url)`), autolinks (`<url>`), raw HTML inline, hard line breaks, soft line breaks, and smart punctuation.
+
+The entry point is `cmark_parse_inlines()`, called from `process_inlines()` in `blocks.c` after all block structure has been finalized.
+
+## The `subject` Struct
+
+All inline parsing state is tracked in the `subject` struct:
+
+```c
+typedef struct {
+ cmark_mem *mem; // Memory allocator
+ cmark_chunk input; // The text being parsed
+ unsigned flags; // Skip flags for HTML constructs
+ int line; // Source line number
+ bufsize_t pos; // Current byte position in input
+ int block_offset; // Column offset of the containing block
+ int column_offset; // Adjustment for multi-line source position tracking
+ cmark_reference_map *refmap; // Link reference definitions
+ delimiter *last_delim; // Top of delimiter stack (linked list, newest first)
+ bracket *last_bracket; // Top of bracket stack (linked list, newest first)
+ bufsize_t backticks[MAXBACKTICKS + 1]; // Cached positions of backtick sequences
+ bool scanned_for_backticks; // Whether the full input has been scanned for backticks
+ bool no_link_openers; // Optimization: set when no link openers remain
+} subject;
+```
+
+`MAXBACKTICKS` is defined as 1000. The `backticks` array caches the positions of backtick sequences of each length, enabling O(1) lookup once the input has been fully scanned.
+
+### Skip Flags
+
+The `flags` field uses bit flags to track which HTML constructs have been confirmed absent:
+
+```c
+#define FLAG_SKIP_HTML_CDATA (1u << 0)
+#define FLAG_SKIP_HTML_DECLARATION (1u << 1)
+#define FLAG_SKIP_HTML_PI (1u << 2)
+#define FLAG_SKIP_HTML_COMMENT (1u << 3)
+```
+
+Once a scan for a particular HTML construct fails, the flag is set to avoid rescanning.
+
+## The Delimiter Stack
+
+Emphasis and smart punctuation use a delimiter stack. Each entry is:
+
+```c
+typedef struct delimiter {
+ struct delimiter *previous; // Link to older delimiter
+ struct delimiter *next; // Link to newer delimiter (towards top)
+ cmark_node *inl_text; // The text node created for this delimiter run
+ bufsize_t position; // Position in the input
+ bufsize_t length; // Number of delimiter characters remaining
+ unsigned char delim_char; // '*', '_', '\'', or '"'
+ bool can_open; // Whether this run can open emphasis
+ bool can_close; // Whether this run can close emphasis
+} delimiter;
+```
+
+The stack is a doubly-linked list with `last_delim` pointing to the newest entry.
+
+## The Bracket Stack
+
+Links and images use a separate bracket stack:
+
+```c
+typedef struct bracket {
+ struct bracket *previous; // Link to older bracket
+ cmark_node *inl_text; // The text node for '[' or '!['
+ bufsize_t position; // Position in the input
+ bool image; // Whether this is an image opener '!['
+ bool active; // Can still match (set to false when deactivated)
+ bool bracket_after; // Whether a '[' appeared after this bracket
+} bracket;
+```
+
+Brackets are deactivated (set `active = false`) when:
+- A matching `]` fails to produce a valid link (the opener is deactivated to prevent infinite loops)
+- An inner link is formed (outer brackets are deactivated per spec)
+
+## Emphasis Flanking Rules: `scan_delims()`
+
+```c
+static int scan_delims(subject *subj, unsigned char c, bool *can_open,
+ bool *can_close);
+```
+
+This function determines whether a run of `*`, `_`, `'`, or `"` characters can open and/or close emphasis, following the CommonMark spec's Unicode-aware flanking rules:
+
+1. The function looks at the character **before** the run and the character **after** the run.
+2. It uses `cmark_utf8proc_iterate()` to decode the surrounding characters as full Unicode code points.
+3. It classifies them using `cmark_utf8proc_is_space()` and `cmark_utf8proc_is_punctuation_or_symbol()`.
+
+The flanking rules:
+- **Left-flanking**: numdelims > 0, character after is not a space, AND (character after is not punctuation OR character before is a space or punctuation)
+- **Right-flanking**: numdelims > 0, character before is not a space, AND (character before is not punctuation OR character after is a space or punctuation)
+
+For `*`: `can_open = left_flanking`, `can_close = right_flanking`
+
+For `_`:
+```c
+*can_open = left_flanking &&
+ (!right_flanking || cmark_utf8proc_is_punctuation_or_symbol(before_char));
+*can_close = right_flanking &&
+ (!left_flanking || cmark_utf8proc_is_punctuation_or_symbol(after_char));
+```
+
+For `'` and `"` (smart punctuation):
+```c
+*can_open = left_flanking &&
+ (!right_flanking || before_char == '(' || before_char == '[') &&
+ before_char != ']' && before_char != ')';
+*can_close = right_flanking;
+```
+
+The function advances `subj->pos` past the delimiter run and returns the number of delimiter characters consumed. For quotes, only 1 delimiter is consumed regardless of how many appear.
+
+## Emphasis Resolution: `S_insert_emph()`
+
+```c
+static delimiter *S_insert_emph(subject *subj, delimiter *opener,
+ delimiter *closer);
+```
+
+When a closing delimiter is found that matches an opener on the stack, this function creates emphasis nodes:
+
+1. If the opener and closer have combined length >= 2 AND both have individual length >= 2, create a `CMARK_NODE_STRONG` node (consuming 2 characters from each).
+2. Otherwise, create a `CMARK_NODE_EMPH` node (consuming 1 character from each).
+3. All inline nodes between the opener and closer are moved to become children of the new emphasis node.
+4. Any delimiters between the opener and closer are removed from the stack.
+5. If the opener is exhausted (`length == 0`), it's removed from the stack.
+6. If the closer is exhausted, it's removed too; otherwise, processing continues.
+
+## Code Span Parsing: `handle_backticks()`
+
+```c
+static cmark_node *handle_backticks(subject *subj, int options);
+```
+
+When a backtick is encountered:
+
+1. `take_while(subj, isbacktick)` consumes the opening backtick run and records its length.
+2. `scan_to_closing_backticks()` searches forward for a matching backtick run of the same length.
+
+The scanning function uses the `subj->backticks[]` array to cache positions of backtick sequences. If `subj->scanned_for_backticks` is true and the cached position for the needed length is behind the current position, it immediately returns 0 (no match).
+
+If no closing backticks are found, the opening run is emitted as literal text. If found, the content between is extracted, normalized via `S_normalize_code()`:
+
+```c
+static void S_normalize_code(cmark_strbuf *s) {
+ // 1. Convert \r\n and \r to spaces
+ // 2. Convert \n to spaces
+ // 3. If content begins and ends with a space and contains non-space chars,
+ // strip one leading and one trailing space
+}
+```
+
+## Link Parsing
+
+When `]` is encountered after an opener on the bracket stack:
+
+### Inline Links: `[text](url "title")`
+
+The parser looks for `(` immediately after `]`, then:
+1. Skips optional whitespace
+2. Tries to parse a link destination (URL)
+3. Skips optional whitespace
+4. Optionally parses a link title (in single quotes, double quotes, or parentheses)
+5. Expects `)`
+
+### Reference Links: `[text][ref]` or `[text][]` or `[text]`
+
+If the inline link syntax doesn't match, the parser tries:
+1. `[text][ref]` — explicit reference
+2. `[text][]` — collapsed reference (label = text)
+3. `[text]` — shortcut reference (label = text)
+
+Reference lookup uses `cmark_reference_lookup()` against the parser's `refmap`.
+
+### URL Cleaning
+
+```c
+unsigned char *cmark_clean_url(cmark_mem *mem, cmark_chunk *url);
+```
+
+Trims the URL, unescapes HTML entities, and handles angle-bracket-delimited URLs.
+
+### Autolinks
+
+```c
+static inline cmark_node *make_autolink(subject *subj, int start_column,
+ int end_column, cmark_chunk url,
+ int is_email);
+```
+
+Autolinks (`<http://example.com>` or `<user@example.com>`) are detected via the `scan_autolink_uri()` and `scan_autolink_email()` scanner functions. Email autolinks have `mailto:` prepended to the URL automatically:
+
+```c
+static unsigned char *cmark_clean_autolink(cmark_mem *mem, cmark_chunk *url,
+ int is_email) {
+ cmark_strbuf buf = CMARK_BUF_INIT(mem);
+ cmark_chunk_trim(url);
+ if (is_email)
+ cmark_strbuf_puts(&buf, "mailto:");
+ houdini_unescape_html_f(&buf, url->data, url->len);
+ return cmark_strbuf_detach(&buf);
+}
+```
+
+## Smart Punctuation
+
+When `CMARK_OPT_SMART` is enabled, the inline parser transforms:
+
+```c
+static const char *EMDASH = "\xE2\x80\x94"; // —
+static const char *ENDASH = "\xE2\x80\x93"; // –
+static const char *ELLIPSES = "\xE2\x80\xA6"; // …
+static const char *LEFTDOUBLEQUOTE = "\xE2\x80\x9C"; // "
+static const char *RIGHTDOUBLEQUOTE = "\xE2\x80\x9D"; // "
+static const char *LEFTSINGLEQUOTE = "\xE2\x80\x98"; // '
+static const char *RIGHTSINGLEQUOTE = "\xE2\x80\x99"; // '
+```
+
+- `---` becomes em dash (—)
+- `--` becomes en dash (–)
+- `...` becomes ellipsis (…)
+- `'` and `"` are converted to curly quotes using the delimiter stack (open/close logic)
+
+## Hard and Soft Line Breaks
+
+- **Hard line break**: Two or more spaces before a line ending, or a backslash before a line ending. Creates a `CMARK_NODE_LINEBREAK` node.
+- **Soft line break**: A line ending not preceded by spaces or backslash. Creates a `CMARK_NODE_SOFTBREAK` node.
+
+## Special Character Dispatch
+
+```c
+static bufsize_t subject_find_special_char(subject *subj, int options);
+```
+
+This function scans forward from `subj->pos` looking for the next special character that needs inline processing. Special characters include:
+- Line endings (`\r`, `\n`)
+- Backtick (`` ` ``)
+- Backslash (`\`)
+- Ampersand (`&`)
+- Less-than (`<`)
+- Open bracket (`[`)
+- Close bracket (`]`)
+- Exclamation mark (`!`)
+- Emphasis characters (`*`, `_`)
+
+Any text between special characters is collected as a `CMARK_NODE_TEXT` node.
+
+## Source Position Tracking
+
+```c
+static void adjust_subj_node_newlines(subject *subj, cmark_node *node,
+ int matchlen, int extra, int options);
+```
+
+When `CMARK_OPT_SOURCEPOS` is enabled, this function adjusts source positions for multi-line inline constructs. It counts newlines in the just-matched span and updates:
+- `subj->line` — incremented by the number of newlines
+- `node->end_line` — adjusted for multi-line spans
+- `node->end_column` — set to characters after the last newline
+- `subj->column_offset` — adjusted for correct subsequent position calculations
+
+## Inline Node Factory Functions
+
+The inline parser uses efficient factory functions:
+
+```c
+// Macros for simple nodes
+#define make_linebreak(mem) make_simple(mem, CMARK_NODE_LINEBREAK)
+#define make_softbreak(mem) make_simple(mem, CMARK_NODE_SOFTBREAK)
+#define make_emph(mem) make_simple(mem, CMARK_NODE_EMPH)
+#define make_strong(mem) make_simple(mem, CMARK_NODE_STRONG)
+```
+
+```c
+// Fast child appending (bypasses S_can_contain validation)
+static void append_child(cmark_node *node, cmark_node *child) {
+ cmark_node *old_last_child = node->last_child;
+ child->next = NULL;
+ child->prev = old_last_child;
+ child->parent = node;
+ node->last_child = child;
+ if (old_last_child) {
+ old_last_child->next = child;
+ } else {
+ node->first_child = child;
+ }
+}
+```
+
+This `append_child()` is a simplified version of the public `cmark_node_append_child()`, skipping containership validation since the inline parser always produces valid structures.
+
+## The Main Parse Loop
+
+```c
+void cmark_parse_inlines(cmark_mem *mem, cmark_node *parent,
+ cmark_reference_map *refmap, int options);
+```
+
+This function initializes a `subject` from the parent node's `data` field, then repeatedly calls `parse_inline()` until the input is exhausted. Each call to `parse_inline()` finds the next special character, emits any preceding text as a `CMARK_NODE_TEXT`, and dispatches to the appropriate handler.
+
+After all characters are processed, the delimiter stack is processed to resolve any remaining emphasis, and then cleaned up.
+
+## Cross-References
+
+- [inlines.c](../../cmark/src/inlines.c) — Full implementation
+- [inlines.h](../../cmark/src/inlines.h) — Internal API declarations
+- [block-parsing.md](block-parsing.md) — Phase 1 that produces the input for inline parsing
+- [reference-system.md](reference-system.md) — How link references are stored and looked up
+- [scanner-system.md](scanner-system.md) — Scanner functions for HTML tags, autolinks, etc.
+- [utf8-handling.md](utf8-handling.md) — Unicode character classification for flanking rules