summaryrefslogtreecommitdiff
path: root/docs/handbook/cmark/xml-renderer.md
blob: 83218c7ef261b0542d627ba226b20f3cdb3a306d (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
# cmark — XML Renderer

## Overview

The XML renderer (`xml.c`) produces an XML representation of the AST. Like the HTML renderer, it writes directly to a `cmark_strbuf` buffer rather than using the generic render framework. The output conforms to the CommonMark DTD.

## Entry Point

```c
char *cmark_render_xml(cmark_node *root, int options);
```

Returns a complete XML document string. The caller must free the result.

### Implementation

```c
char *cmark_render_xml(cmark_node *root, int options) {
  char *result;
  cmark_strbuf xml = CMARK_BUF_INIT(root->mem);
  cmark_event_type ev_type;
  cmark_node *cur;
  struct render_state state = {&xml, 0};
  cmark_iter *iter = cmark_iter_new(root);

  cmark_strbuf_puts(&xml,
                     "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n"
                     "<!DOCTYPE document SYSTEM \"CommonMark.dtd\">\n");

  // optionally: <?xml-model href="CommonMark.rnc" ...?>
  while ((ev_type = cmark_iter_next(iter)) != CMARK_EVENT_DONE) {
    cur = cmark_iter_get_node(iter);
    S_render_node(cur, ev_type, &state, options);
  }
  result = (char *)cmark_strbuf_detach(&xml);
  cmark_iter_free(iter);
  return result;
}
```

## Render State

```c
struct render_state {
  cmark_strbuf *xml;    // Output buffer
  int indent;           // Current indentation level (number of spaces)
};
```

The `indent` state tracks nesting depth, incremented by 2 for each container node entered.

## XML Escaping

```c
static CMARK_INLINE void escape_xml(cmark_strbuf *dest, const unsigned char *source,
                                    bufsize_t length) {
  houdini_escape_html0(dest, source, length, 0);
}
```

Escapes `<`, `>`, `&`, and `"` to their XML entity equivalents.

## Indentation

```c
static void indent(struct render_state *state) {
  int i;
  for (i = 0; i < state->indent; i++) {
    cmark_strbuf_putc(state->xml, ' ');
  }
}
```

Each level of nesting adds 2 spaces of indentation.

## Source Position Attributes

```c
static void S_render_sourcepos(cmark_node *node, cmark_strbuf *xml, int options) {
  char buffer[BUFFER_SIZE];
  if (CMARK_OPT_SOURCEPOS & options) {
    snprintf(buffer, BUFFER_SIZE, " sourcepos=\"%d:%d-%d:%d\"",
             cmark_node_get_start_line(node), cmark_node_get_start_column(node),
             cmark_node_get_end_line(node), cmark_node_get_end_column(node));
    cmark_strbuf_puts(xml, buffer);
  }
}
```

When `CMARK_OPT_SOURCEPOS` is active, XML elements receive `sourcepos="line:col-line:col"` attributes.

## Node Type Name Table

```c
static const char *S_type_string(cmark_node *node) {
  if (node->extension && node->extension->xml_tag_name_func) {
    return node->extension->xml_tag_name_func(node->extension, node);
  }
  switch (node->type) {
  case CMARK_NODE_DOCUMENT:       return "document";
  case CMARK_NODE_BLOCK_QUOTE:    return "block_quote";
  case CMARK_NODE_LIST:           return "list";
  case CMARK_NODE_ITEM:           return "item";
  case CMARK_NODE_CODE_BLOCK:     return "code_block";
  case CMARK_NODE_HTML_BLOCK:     return "html_block";
  case CMARK_NODE_CUSTOM_BLOCK:   return "custom_block";
  case CMARK_NODE_PARAGRAPH:      return "paragraph";
  case CMARK_NODE_HEADING:        return "heading";
  case CMARK_NODE_THEMATIC_BREAK: return "thematic_break";
  case CMARK_NODE_TEXT:           return "text";
  case CMARK_NODE_SOFTBREAK:     return "softbreak";
  case CMARK_NODE_LINEBREAK:     return "linebreak";
  case CMARK_NODE_CODE:          return "code";
  case CMARK_NODE_HTML_INLINE:   return "html_inline";
  case CMARK_NODE_CUSTOM_INLINE: return "custom_inline";
  case CMARK_NODE_EMPH:          return "emph";
  case CMARK_NODE_STRONG:        return "strong";
  case CMARK_NODE_LINK:          return "link";
  case CMARK_NODE_IMAGE:         return "image";
  case CMARK_NODE_NONE:          return "NONE";
  }
  return "<unknown>";
}
```

Each node type has a fixed XML tag name. Extensions can override this via `xml_tag_name_func`.

## Node Rendering Logic

### Leaf Nodes vs Container Nodes

The XML renderer distinguishes between leaf (literal) nodes and container nodes:

**Leaf nodes** (single event — `CMARK_EVENT_ENTER` only):
- `CMARK_NODE_CODE_BLOCK`, `CMARK_NODE_HTML_BLOCK`, `CMARK_NODE_THEMATIC_BREAK`
- `CMARK_NODE_TEXT`, `CMARK_NODE_SOFTBREAK`, `CMARK_NODE_LINEBREAK`
- `CMARK_NODE_CODE`, `CMARK_NODE_HTML_INLINE`

**Container nodes** (paired enter/exit events):
- `CMARK_NODE_DOCUMENT`, `CMARK_NODE_BLOCK_QUOTE`, `CMARK_NODE_LIST`, `CMARK_NODE_ITEM`
- `CMARK_NODE_PARAGRAPH`, `CMARK_NODE_HEADING`
- `CMARK_NODE_EMPH`, `CMARK_NODE_STRONG`, `CMARK_NODE_LINK`, `CMARK_NODE_IMAGE`
- `CMARK_NODE_CUSTOM_BLOCK`, `CMARK_NODE_CUSTOM_INLINE`

### Leaf Node Rendering

Literal nodes that contain text are rendered as:
```xml
  <tag_name>ESCAPED TEXT</tag_name>
```

For example, a text node with content "Hello & goodbye" becomes:
```xml
  <text>Hello &amp; goodbye</text>
```

Nodes without text content (thematic_break, softbreak, linebreak) are rendered as self-closing:
```xml
  <thematic_break />
```

### Container Node Rendering (Enter)

On enter, the renderer outputs:
```xml
  <tag_name[sourcepos][ type-specific attributes]>
```

And increments the indent level by 2.

#### Type-Specific Attributes on Enter

**List attributes:**
```c
cmark_strbuf_printf(xml, " type=\"%s\" tight=\"%s\"",
                    cmark_node_get_list_type(node) == CMARK_BULLET_LIST
                        ? "bullet" : "ordered",
                    cmark_node_get_list_tight(node) ? "true" : "false");
// For ordered lists only:
int start = cmark_node_get_list_start(node);
if (start != 1) {
  snprintf(buffer, BUFFER_SIZE, " start=\"%d\"", start);
}
cmark_strbuf_printf(xml, " delimiter=\"%s\"",
                    cmark_node_get_list_delim(node) == CMARK_PAREN_DELIM
                        ? "paren" : "period");
```

**Heading attributes:**
```c
snprintf(buffer, BUFFER_SIZE, " level=\"%d\"", node->as.heading.level);
```

**Code block attributes:**
```c
if (node->as.code.info) {
  cmark_strbuf_puts(xml, " info=\"");
  escape_xml(xml, node->as.code.info, (bufsize_t)strlen((char *)node->as.code.info));
  cmark_strbuf_putc(xml, '"');
}
```

**Link/Image attributes:**
```c
cmark_strbuf_puts(xml, " destination=\"");
escape_xml(xml, node->as.link.url, (bufsize_t)strlen((char *)node->as.link.url));
cmark_strbuf_putc(xml, '"');
cmark_strbuf_puts(xml, " title=\"");
escape_xml(xml, node->as.link.title, (bufsize_t)strlen((char *)node->as.link.title));
cmark_strbuf_putc(xml, '"');
```

**Custom block/inline attributes:**
```c
cmark_strbuf_puts(xml, " on_enter=\"");
escape_xml(xml, node->as.custom.on_enter, ...);
cmark_strbuf_puts(xml, "\" on_exit=\"");
escape_xml(xml, node->as.custom.on_exit, ...);
```

### Container Node Rendering (Exit)

On exit, the indent level is decremented by 2, and the closing tag is output:
```xml
  </tag_name>
```

### Extension Support

Extensions can add additional XML attributes via:
```c
if (node->extension && node->extension->xml_attr_func) {
  node->extension->xml_attr_func(node->extension, node, xml);
}
```

## Example Output

Given this Markdown:

```markdown
# Hello

A paragraph with *emphasis* and a [link](http://example.com "title").
```

The XML output (with `CMARK_OPT_SOURCEPOS`):

```xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE document SYSTEM "CommonMark.dtd">
<document sourcepos="1:1-3:65" xmlns="http://commonmark.org/xml/1.0">
  <heading sourcepos="1:1-1:7" level="1">
    <text>Hello</text>
  </heading>
  <paragraph sourcepos="3:1-3:65">
    <text>A paragraph with </text>
    <emph>
      <text>emphasis</text>
    </emph>
    <text> and a </text>
    <link destination="http://example.com" title="title">
      <text>link</text>
    </link>
    <text>.</text>
  </paragraph>
</document>
```

## CommonMark DTD

The output references `CommonMark.dtd`, the DTD that defines:
- Document element as the root
- All CommonMark block and inline element types
- Required attributes for lists, headings, links, images, and code blocks
- Entity definitions for the markup model

## Differences from HTML Renderer

1. **Full AST preservation**: XML represents the complete AST structure, including node types that HTML merges or loses (e.g., softbreak, custom blocks/inlines).
2. **Indentation tracking**: XML output is pretty-printed with nesting-based indentation.
3. **No tight list logic**: The `tight` attribute is stored as metadata, but does not affect paragraph rendering — paragraphs always appear as `<paragraph>` elements.
4. **No URL safety**: URLs are output as-is (escaped for XML), no `_scan_dangerous_url()` check.
5. **No plain text mode**: Image children are rendered structurally, not flattened to alt text.

## Cross-References

- [xml.c](../../cmark/src/xml.c) — Full implementation
- [html-renderer.md](html-renderer.md) — HTML renderer comparison
- [iterator-system.md](iterator-system.md) — Traversal mechanism used
- [public-api.md](public-api.md) — `cmark_render_xml()` API docs