summaryrefslogtreecommitdiff
path: root/docs/handbook/cmark/overview.md
diff options
context:
space:
mode:
Diffstat (limited to 'docs/handbook/cmark/overview.md')
-rw-r--r--docs/handbook/cmark/overview.md256
1 files changed, 256 insertions, 0 deletions
diff --git a/docs/handbook/cmark/overview.md b/docs/handbook/cmark/overview.md
new file mode 100644
index 0000000000..4fc95bdad7
--- /dev/null
+++ b/docs/handbook/cmark/overview.md
@@ -0,0 +1,256 @@
+# cmark — Overview
+
+## What Is cmark?
+
+cmark is a C library and command-line tool for parsing and rendering CommonMark (standardized Markdown). Written in C99, it implements a two-phase parsing architecture — block structure recognition followed by inline content parsing — producing an Abstract Syntax Tree (AST) that can be traversed, manipulated, and rendered into multiple output formats.
+
+**Language:** C (C99)
+**Build System:** CMake (minimum version 3.14)
+**Project Version:** 0.31.2
+**License:** BSD-2-Clause
+**Authors:** John MacFarlane, Vicent Marti, Kārlis Gaņģis, Nick Wellnhofer
+
+## Core Architecture Summary
+
+cmark's processing pipeline follows this sequence:
+
+1. **Input** — UTF-8 text is fed to the parser, either all at once or incrementally via a streaming API.
+2. **Block Parsing** (`blocks.c`) — The input is scanned line-by-line to identify block-level structures (paragraphs, headings, code blocks, lists, block quotes, thematic breaks, HTML blocks).
+3. **Inline Parsing** (`inlines.c`) — Within paragraph and heading blocks, inline elements are parsed (emphasis, links, images, code spans, HTML inline, line breaks).
+4. **AST Construction** — A tree of `cmark_node` structures is built, with each node representing a document element.
+5. **Rendering** — The AST is traversed using an iterator and rendered to one of five output formats: HTML, XML, LaTeX, man (groff), or CommonMark.
+
+## Source File Map
+
+The `cmark/src/` directory contains the following source files, organized by responsibility:
+
+### Public API
+| File | Purpose |
+|------|---------|
+| `cmark.h` | Public API header — all exported types, enums, and function declarations |
+| `cmark.c` | Core glue — `cmark_markdown_to_html()`, default memory allocator, version info |
+| `main.c` | CLI entry point — argument parsing, file I/O, format dispatch |
+
+### AST Node System
+| File | Purpose |
+|------|---------|
+| `node.h` | Internal node struct definition, type-specific unions (`cmark_list`, `cmark_code`, `cmark_heading`, `cmark_link`, `cmark_custom`), internal flags |
+| `node.c` | Node creation/destruction, accessor functions, tree manipulation (insert, append, unlink, replace) |
+
+### Parsing
+| File | Purpose |
+|------|---------|
+| `parser.h` | Internal `cmark_parser` struct definition (parser state: line number, offset, column, indent, reference map) |
+| `blocks.c` | Block-level parsing — line-by-line analysis, open/close block logic, list item detection, finalization |
+| `inlines.c` | Inline-level parsing — emphasis/strong via delimiter stack, backtick code spans, links/images via bracket stack, autolinks, HTML inline |
+| `inlines.h` | Internal API: `cmark_parse_inlines()`, `cmark_parse_reference_inline()`, `cmark_clean_url()`, `cmark_clean_title()` |
+
+### Traversal
+| File | Purpose |
+|------|---------|
+| `iterator.h` | Internal `cmark_iter` struct with `cmark_iter_state` (current + next event/node pairs) |
+| `iterator.c` | Iterator implementation — `cmark_iter_new()`, `cmark_iter_next()`, `cmark_iter_reset()`, `cmark_consolidate_text_nodes()` |
+
+### Renderers
+| File | Purpose |
+|------|---------|
+| `render.h` | `cmark_renderer` struct, `cmark_escaping` enum (`LITERAL`, `NORMAL`, `TITLE`, `URL`) |
+| `render.c` | Generic render framework — line wrapping, prefix management, `cmark_render()` dispatch loop |
+| `html.c` | HTML renderer — `cmark_render_html()`, direct strbuf-based output, no render framework |
+| `xml.c` | XML renderer — `cmark_render_xml()`, direct strbuf-based output with CommonMark DTD |
+| `latex.c` | LaTeX renderer — `cmark_render_latex()`, uses render framework |
+| `man.c` | groff man renderer — `cmark_render_man()`, uses render framework |
+| `commonmark.c` | CommonMark renderer — `cmark_render_commonmark()`, uses render framework |
+
+### Text Processing and Utilities
+| File | Purpose |
+|------|---------|
+| `buffer.h` / `buffer.c` | `cmark_strbuf` — growable byte buffer with amortized O(1) append |
+| `chunk.h` | `cmark_chunk` — lightweight non-owning string slice (pointer + length) |
+| `utf8.h` / `utf8.c` | UTF-8 iteration, validation, encoding, case folding, Unicode property queries |
+| `references.h` / `references.c` | Link reference definition storage and lookup (sorted array with binary search) |
+| `scanners.h` / `scanners.c` | re2c-generated scanner functions for recognizing Markdown syntax patterns |
+| `scanners.re` | re2c source for scanner generation |
+| `cmark_ctype.h` / `cmark_ctype.c` | Locale-independent `cmark_isspace()`, `cmark_ispunct()`, `cmark_isdigit()`, `cmark_isalpha()` |
+| `houdini.h` | HTML/URL escaping and unescaping API |
+| `houdini_html_e.c` | HTML entity escaping |
+| `houdini_html_u.c` | HTML entity unescaping |
+| `houdini_href_e.c` | URL/href percent-encoding |
+| `entities.inc` | HTML entity name-to-codepoint lookup table |
+| `case_fold.inc` | Unicode case folding table for reference normalization |
+
+## The Simple Interface
+
+The simplest way to use cmark is a single function call defined in `cmark.c`:
+
+```c
+char *cmark_markdown_to_html(const char *text, size_t len, int options);
+```
+
+Internally, this calls `cmark_parse_document()` to build the AST, then `cmark_render_html()` to produce the output, and finally frees the document node. The caller is responsible for freeing the returned string.
+
+The implementation in `cmark.c`:
+
+```c
+char *cmark_markdown_to_html(const char *text, size_t len, int options) {
+ cmark_node *doc;
+ char *result;
+
+ doc = cmark_parse_document(text, len, options);
+ result = cmark_render_html(doc, options);
+ cmark_node_free(doc);
+
+ return result;
+}
+```
+
+## The Streaming Interface
+
+For large documents or streaming input, cmark provides an incremental parsing API:
+
+```c
+cmark_parser *parser = cmark_parser_new(CMARK_OPT_DEFAULT);
+
+// Feed chunks of data as they arrive
+while ((bytes = fread(buffer, 1, sizeof(buffer), fp)) > 0) {
+ cmark_parser_feed(parser, buffer, bytes);
+}
+
+// Finalize and get the AST
+cmark_node *document = cmark_parser_finish(parser);
+cmark_parser_free(parser);
+
+// Render to any format
+char *html = cmark_render_html(document, CMARK_OPT_DEFAULT);
+char *xml = cmark_render_xml(document, CMARK_OPT_DEFAULT);
+char *man = cmark_render_man(document, CMARK_OPT_DEFAULT, 72);
+char *tex = cmark_render_latex(document, CMARK_OPT_DEFAULT, 80);
+char *cm = cmark_render_commonmark(document, CMARK_OPT_DEFAULT, 0);
+
+// Cleanup
+cmark_node_free(document);
+```
+
+The parser accumulates input in an internal line buffer (`parser->linebuf`) and processes complete lines as they become available. The `S_parser_feed()` function in `blocks.c` scans for line-ending characters (`\n`, `\r`) and dispatches each complete line to `S_process_line()`.
+
+## Node Type Taxonomy
+
+cmark defines 21 node types in the `cmark_node_type` enum:
+
+### Block Nodes (container and leaf)
+| Enum Value | Type String | Container? | Accepts Lines? | Contains Inlines? |
+|-----------|-------------|------------|---------------|-------------------|
+| `CMARK_NODE_DOCUMENT` | "document" | Yes | No | No |
+| `CMARK_NODE_BLOCK_QUOTE` | "block_quote" | Yes | No | No |
+| `CMARK_NODE_LIST` | "list" | Yes (items only) | No | No |
+| `CMARK_NODE_ITEM` | "item" | Yes | No | No |
+| `CMARK_NODE_CODE_BLOCK` | "code_block" | No (leaf) | Yes | No |
+| `CMARK_NODE_HTML_BLOCK` | "html_block" | No (leaf) | No | No |
+| `CMARK_NODE_CUSTOM_BLOCK` | "custom_block" | Yes | No | No |
+| `CMARK_NODE_PARAGRAPH` | "paragraph" | No | Yes | Yes |
+| `CMARK_NODE_HEADING` | "heading" | No | Yes | Yes |
+| `CMARK_NODE_THEMATIC_BREAK` | "thematic_break" | No (leaf) | No | No |
+
+### Inline Nodes
+| Enum Value | Type String | Leaf? |
+|-----------|-------------|-------|
+| `CMARK_NODE_TEXT` | "text" | Yes |
+| `CMARK_NODE_SOFTBREAK` | "softbreak" | Yes |
+| `CMARK_NODE_LINEBREAK` | "linebreak" | Yes |
+| `CMARK_NODE_CODE` | "code" | Yes |
+| `CMARK_NODE_HTML_INLINE` | "html_inline" | Yes |
+| `CMARK_NODE_CUSTOM_INLINE` | "custom_inline" | No |
+| `CMARK_NODE_EMPH` | "emph" | No |
+| `CMARK_NODE_STRONG` | "strong" | No |
+| `CMARK_NODE_LINK` | "link" | No |
+| `CMARK_NODE_IMAGE` | "image" | No |
+
+Range sentinels are also defined for classification:
+- `CMARK_NODE_FIRST_BLOCK = CMARK_NODE_DOCUMENT`
+- `CMARK_NODE_LAST_BLOCK = CMARK_NODE_THEMATIC_BREAK`
+- `CMARK_NODE_FIRST_INLINE = CMARK_NODE_TEXT`
+- `CMARK_NODE_LAST_INLINE = CMARK_NODE_IMAGE`
+
+## Option Flags
+
+Options are passed as a bitmask integer to parsing and rendering functions:
+
+```c
+#define CMARK_OPT_DEFAULT 0
+#define CMARK_OPT_SOURCEPOS (1 << 1) // Add data-sourcepos attributes
+#define CMARK_OPT_HARDBREAKS (1 << 2) // Render softbreaks as hard breaks
+#define CMARK_OPT_SAFE (1 << 3) // Legacy (now default behavior)
+#define CMARK_OPT_NOBREAKS (1 << 4) // Render softbreaks as spaces
+#define CMARK_OPT_NORMALIZE (1 << 8) // Legacy (no effect)
+#define CMARK_OPT_VALIDATE_UTF8 (1 << 9) // Validate UTF-8 input
+#define CMARK_OPT_SMART (1 << 10) // Smart quotes and dashes
+#define CMARK_OPT_UNSAFE (1 << 17) // Allow raw HTML and dangerous URLs
+```
+
+## Memory Management Model
+
+cmark uses a pluggable memory allocator defined by the `cmark_mem` struct:
+
+```c
+typedef struct cmark_mem {
+ void *(*calloc)(size_t, size_t);
+ void *(*realloc)(void *, size_t);
+ void (*free)(void *);
+} cmark_mem;
+```
+
+The default allocator in `cmark.c` wraps standard `calloc`/`realloc`/`free` with abort-on-NULL safety checks (`xcalloc`, `xrealloc`). Every node stores a pointer to the allocator it was created with (`node->mem`), ensuring consistent allocation/deallocation throughout the tree.
+
+## Version Information
+
+Runtime version queries:
+
+```c
+int cmark_version(void); // Returns CMARK_VERSION as integer (0xMMmmpp)
+const char *cmark_version_string(void); // Returns CMARK_VERSION_STRING
+```
+
+The version is encoded as a 24-bit integer where bits 16–23 are major, 8–15 are minor, and 0–7 are patch. For example, `0x001F02` represents version 0.31.2.
+
+## Backwards Compatibility Aliases
+
+For code written against older cmark API versions, these aliases are provided:
+
+```c
+#define CMARK_NODE_HEADER CMARK_NODE_HEADING
+#define CMARK_NODE_HRULE CMARK_NODE_THEMATIC_BREAK
+#define CMARK_NODE_HTML CMARK_NODE_HTML_BLOCK
+#define CMARK_NODE_INLINE_HTML CMARK_NODE_HTML_INLINE
+```
+
+Short-name aliases (without the `CMARK_` prefix) are also available unless `CMARK_NO_SHORT_NAMES` is defined:
+
+```c
+#define NODE_DOCUMENT CMARK_NODE_DOCUMENT
+#define NODE_PARAGRAPH CMARK_NODE_PARAGRAPH
+#define BULLET_LIST CMARK_BULLET_LIST
+// ... and many more
+```
+
+## Cross-References
+
+- [architecture.md](architecture.md) — Detailed two-phase parsing pipeline, module dependency graph
+- [public-api.md](public-api.md) — Complete public API reference with all function signatures
+- [ast-node-system.md](ast-node-system.md) — Internal `cmark_node` struct, type-specific unions, tree operations
+- [block-parsing.md](block-parsing.md) — `blocks.c` line-by-line analysis, open block tracking, finalization
+- [inline-parsing.md](inline-parsing.md) — `inlines.c` delimiter algorithm, bracket stack, backtick scanning
+- [iterator-system.md](iterator-system.md) — AST traversal with enter/exit events
+- [html-renderer.md](html-renderer.md) — HTML output with escaping and source position
+- [xml-renderer.md](xml-renderer.md) — XML output with CommonMark DTD
+- [latex-renderer.md](latex-renderer.md) — LaTeX output via render framework
+- [man-renderer.md](man-renderer.md) — groff man page output
+- [commonmark-renderer.md](commonmark-renderer.md) — Round-trip CommonMark output
+- [render-framework.md](render-framework.md) — Shared `cmark_render()` engine for text-based renderers
+- [memory-management.md](memory-management.md) — Allocator model, buffer growth, node freeing
+- [utf8-handling.md](utf8-handling.md) — UTF-8 validation, iteration, case folding
+- [reference-system.md](reference-system.md) — Link reference definitions storage and resolution
+- [scanner-system.md](scanner-system.md) — re2c-generated pattern matching
+- [building.md](building.md) — CMake build configuration and options
+- [cli-usage.md](cli-usage.md) — Command-line tool usage
+- [testing.md](testing.md) — Test infrastructure (spec tests, API tests, fuzzing)
+- [code-style.md](code-style.md) — Coding conventions and naming patterns