# cmark — Scanner System ## Overview The scanner system (`scanners.h`, `scanners.re`, `scanners.c`) provides fast pattern-matching functions used throughout cmark's block and inline parsers. The scanners are generated from re2c specifications and compiled into optimized C switch-statement automata. They perform context-free matching only (no backtracking, no captures beyond match length). ## Architecture ### Source Files - `scanners.re` — re2c source with pattern specifications - `scanners.c` — Generated C code (committed to the repository, regenerated manually) - `scanners.h` — Public declarations (macro wrappers and function prototypes) ### Generation Scanners are regenerated from re2c source via: ```bash re2c --case-insensitive -b -i --no-generation-date --8bit -o scanners.c scanners.re ``` Flags: - `--case-insensitive` — Case-insensitive matching - `-b` — Use bit vectors for character classes - `-i` — Use `if` statements instead of `switch` - `--no-generation-date` — Reproducible output - `--8bit` — 8-bit character width The generated code consists of state machines implemented as nested `switch`/`if` blocks with direct character comparisons. There are no regular expression structs, no DFA tables — the patterns are compiled directly into C control flow. ## Scanner Interface ### The `_scan_at` Wrapper ```c #define _scan_at(scanner, s, p) scanner(s->input.data, s->input.len, p) ``` All scanner functions share the signature: ```c bufsize_t scan_PATTERN(const unsigned char *s, bufsize_t len, bufsize_t offset); ``` Parameters: - `s` — Input byte string - `len` — Total length of `s` - `offset` — Starting position within `s` Return value: - Length of the match (in bytes) if successful - `0` if no match at the given position ### Common Pattern ```c // In blocks.c: matched = _scan_at(&scan_thematic_break, &input, first_nonspace); // In inlines.c: matched = _scan_at(&scan_autolink_uri, subj, subj->pos); ``` ## Scanner Functions ### Block Structure Scanners | Scanner | Purpose | Used In | |---------|---------|---------| | `scan_thematic_break` | Matches `***`, `---`, `___` (with optional spaces) | `blocks.c` | | `scan_atx_heading_start` | Matches `#{1,6}` followed by space or EOL | `blocks.c` | | `scan_setext_heading_line` | Matches `=+` or `-+` at line start | `blocks.c` | | `scan_open_code_fence` | Matches `` ``` `` or `~~~` (3+ fence chars) | `blocks.c` | | `scan_close_code_fence` | Matches closing fence (≥ opening length) | `blocks.c` | | `scan_html_block_start` | Matches HTML block type 1-5 openers | `blocks.c` | | `scan_html_block_start_7` | Matches HTML block type 6-7 openers | `blocks.c` | | `scan_html_block_end_1` | Matches ``, ``, `` | `blocks.c` | | `scan_html_block_end_2` | Matches `-->` | `blocks.c` | | `scan_html_block_end_3` | Matches `?>` | `blocks.c` | | `scan_html_block_end_4` | Matches `>` | `blocks.c` | | `scan_html_block_end_5` | Matches `]]>` | `blocks.c` | | `scan_link_title` | Matches `"..."`, `'...'`, or `(...)` titles | `inlines.c` | ### Inline Scanners | Scanner | Purpose | Used In | |---------|---------|---------| | `scan_autolink_uri` | Matches URI autolinks `` | `inlines.c` | | `scan_autolink_email` | Matches email autolinks `` | `inlines.c` | | `scan_html_tag` | Matches inline HTML tags (open, close, comment, PI, CDATA, declaration) | `inlines.c` | | `scan_entity` | Matches HTML entities (`&`, `{`, ``) | `inlines.c` | | `scan_dangerous_url` | Matches `javascript:`, `vbscript:`, `file:`, `data:` URLs | `html.c` | | `scan_spacechars` | Matches runs of spaces and tabs | `inlines.c` | ### Link/Reference Scanners | Scanner | Purpose | Used In | |---------|---------|---------| | `scan_link_url` | Matches link destinations (parenthesized or bare) | `inlines.c` | | `scan_link_title` | Matches quoted link titles | `inlines.c` | ## Scanner Patterns (from `scanners.re`) ### Thematic Break ``` thematic_break = (('*' [ \t]*){3,} | ('-' [ \t]*){3,} | ('_' [ \t]*){3,}) [ \t]* [\n] ``` Three or more `*`, `-`, or `_` characters, optionally separated by spaces/tabs. ### ATX Heading ``` atx_heading_start = '#{1,6}' ([ \t]+ | [\n]) ``` 1-6 `#` characters followed by space/tab or newline. ### Code Fence ``` open_code_fence = '`{3,}' [^`\n]* [\n] | '~{3,}' [^\n]* [\n] ``` Three or more backticks (not followed by backtick in info string) or three or more tildes. ### HTML Block Start (Types 1-7) The CommonMark spec defines 7 types of HTML blocks, each matched by different scanners: 1. `