summaryrefslogtreecommitdiff
path: root/docs/handbook/cmark/iterator-system.md
blob: 3cdcfda66eba712e3bfe86b8d2b9395f8135320a (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
# cmark — Iterator System

## Overview

The iterator system (`iterator.c`, `iterator.h`) provides depth-first traversal of the AST using an event-based model. Each node is visited twice: once on `CMARK_EVENT_ENTER` (before children) and once on `CMARK_EVENT_EXIT` (after children). Leaf nodes receive both events in immediate succession.

All renderers (HTML, XML, LaTeX, man, CommonMark) use the iterator as their traversal mechanism.

## Public API

```c
cmark_iter *cmark_iter_new(cmark_node *root);
void cmark_iter_free(cmark_iter *iter);
cmark_event_type cmark_iter_next(cmark_iter *iter);
cmark_node *cmark_iter_get_node(cmark_iter *iter);
cmark_event_type cmark_iter_get_event_type(cmark_iter *iter);
cmark_node *cmark_iter_get_root(cmark_iter *iter);
void cmark_iter_reset(cmark_iter *iter, cmark_node *current, cmark_event_type event_type);
```

## Iterator State

```c
struct cmark_iter {
  cmark_mem *mem;
  cmark_node *root;
  cmark_node *cur;
  cmark_event_type ev_type;
};
```

The iterator stores:
- `root` — The subtree root (traversal boundary)
- `cur` — Current node
- `ev_type` — Current event (`CMARK_EVENT_ENTER`, `CMARK_EVENT_EXIT`, `CMARK_EVENT_DONE`, or `CMARK_EVENT_NONE`)

## Event Types

```c
typedef enum {
  CMARK_EVENT_NONE,     // Initial state
  CMARK_EVENT_DONE,     // Traversal complete (exited root)
  CMARK_EVENT_ENTER,    // Entering a node (pre-children)
  CMARK_EVENT_EXIT,     // Exiting a node (post-children)
} cmark_event_type;
```

## Leaf Node Detection

```c
static const int S_leaf_mask =
    (1 << CMARK_NODE_HTML_BLOCK) | (1 << CMARK_NODE_THEMATIC_BREAK) |
    (1 << CMARK_NODE_CODE_BLOCK) | (1 << CMARK_NODE_TEXT) |
    (1 << CMARK_NODE_SOFTBREAK)  | (1 << CMARK_NODE_LINEBREAK) |
    (1 << CMARK_NODE_CODE)       | (1 << CMARK_NODE_HTML_INLINE);

static bool S_is_leaf(cmark_node *node) {
  return ((1 << node->type) & S_leaf_mask) != 0;
}
```

Leaf nodes are determined by a bitmask — not by checking whether `first_child` is NULL. This means an emphasis node with no children is still treated as a container (it receives separate enter and exit events).

The leaf node types are:
- **Block leaves**: `HTML_BLOCK`, `THEMATIC_BREAK`, `CODE_BLOCK`
- **Inline leaves**: `TEXT`, `SOFTBREAK`, `LINEBREAK`, `CODE`, `HTML_INLINE`

## Traversal Algorithm

`cmark_iter_next()` implements the state machine:

```c
cmark_event_type cmark_iter_next(cmark_iter *iter) {
  cmark_event_type ev_type = iter->ev_type;
  cmark_node *cur = iter->cur;

  if (ev_type == CMARK_EVENT_DONE) {
    return CMARK_EVENT_DONE;
  }

  // For initial state, start with ENTER on root
  if (ev_type == CMARK_EVENT_NONE) {
    iter->ev_type = CMARK_EVENT_ENTER;
    return iter->ev_type;
  }

  if (ev_type == CMARK_EVENT_ENTER && !S_is_leaf(cur)) {
    // Container node being entered — descend to first child if it exists
    if (cur->first_child) {
      iter->ev_type = CMARK_EVENT_ENTER;
      iter->cur = cur->first_child;
    } else {
      // Empty container — immediately exit
      iter->ev_type = CMARK_EVENT_EXIT;
    }
  } else if (cur == iter->root) {
    // Exiting root (or leaf at root) — done
    iter->ev_type = CMARK_EVENT_DONE;
    iter->cur = NULL;
  } else if (cur->next) {
    // Move to next sibling
    iter->ev_type = CMARK_EVENT_ENTER;
    iter->cur = cur->next;
  } else if (cur->parent) {
    // No more siblings — exit parent
    iter->ev_type = CMARK_EVENT_EXIT;
    iter->cur = cur->parent;
  } else {
    // Orphan node — done
    assert(false);
    iter->ev_type = CMARK_EVENT_DONE;
    iter->cur = NULL;
  }

  return iter->ev_type;
}
```

### State Transition Summary

| Current State | Condition | Next State |
|--------------|-----------|------------|
| `NONE` | (initial) | `ENTER(root)` |
| `ENTER(container)` | has children | `ENTER(first_child)` |
| `ENTER(container)` | no children | `EXIT(container)` |
| `ENTER(leaf)` or `EXIT(node)` | node == root | `DONE` |
| `ENTER(leaf)` or `EXIT(node)` | has next sibling | `ENTER(next)` |
| `ENTER(leaf)` or `EXIT(node)` | has parent | `EXIT(parent)` |
| `DONE` | (terminal) | `DONE` |

### Traversal Order Example

For a document with a paragraph containing "Hello *world*":

```
Document
└── Paragraph
    ├── Text("Hello ")
    ├── Emph
    │   └── Text("world")
    └── (end)
```

Event sequence:
1. `ENTER(Document)`
2. `ENTER(Paragraph)`
3. `ENTER(Text "Hello ")` — leaf, immediate transition
4. `ENTER(Emph)`
5. `ENTER(Text "world")` — leaf, immediate transition
6. `EXIT(Emph)`
7. `EXIT(Paragraph)`
8. `EXIT(Document)`
9. `DONE`

## Iterator Reset

```c
void cmark_iter_reset(cmark_iter *iter, cmark_node *current,
                      cmark_event_type event_type) {
  iter->cur = current;
  iter->ev_type = event_type;
}
```

Allows repositioning the iterator to any node and event type. This is used by renderers to skip subtrees — e.g., when the HTML renderer processes an image node, it may skip children after extracting alt text.

## Text Node Consolidation

```c
void cmark_consolidate_text_nodes(cmark_node *root) {
  if (root == NULL) return;
  cmark_iter *iter = cmark_iter_new(root);
  cmark_strbuf buf = CMARK_BUF_INIT(iter->mem);
  cmark_event_type ev_type;
  cmark_node *cur, *tmp, *next;

  while ((ev_type = cmark_iter_next(iter)) != CMARK_EVENT_DONE) {
    cur = cmark_iter_get_node(iter);
    if (ev_type == CMARK_EVENT_ENTER && cur->type == CMARK_NODE_TEXT &&
        cur->next && cur->next->type == CMARK_NODE_TEXT) {
      // Merge consecutive text nodes
      cmark_strbuf_clear(&buf);
      cmark_strbuf_put(&buf, cur->data, cur->len);
      tmp = cur->next;
      while (tmp && tmp->type == CMARK_NODE_TEXT) {
        cmark_iter_reset(iter, tmp, CMARK_EVENT_ENTER);
        cmark_strbuf_put(&buf, tmp->data, tmp->len);
        cur->end_column = tmp->end_column;
        next = tmp->next;
        cmark_node_free(tmp);
        tmp = next;
      }
      // Replace cur's data with merged content
      cmark_chunk_free(iter->mem, &cur->as.literal);
      cmark_strbuf_trim(&buf);
      // ... set cur->data and cur->len
    }
  }
  cmark_strbuf_free(&buf);
  cmark_iter_free(iter);
}
```

This function merges adjacent text nodes into a single text node. Adjacent text nodes can arise from inline parsing (e.g., when backslash escapes split text). The function:

1. Finds consecutive text node runs
2. Concatenates their content into a buffer
3. Updates the first node's content and end position
4. Frees the subsequent nodes
5. Uses `cmark_iter_reset()` to skip freed nodes

## Usage Patterns

### Standard Rendering Loop

```c
cmark_iter *iter = cmark_iter_new(root);
while ((ev_type = cmark_iter_next(iter)) != CMARK_EVENT_DONE) {
  cur = cmark_iter_get_node(iter);
  S_render_node(cur, ev_type, &state, options);
}
cmark_iter_free(iter);
```

### Skipping Children

To skip rendering of a node's children (e.g., for image alt text in HTML):
```c
if (ev_type == CMARK_EVENT_ENTER) {
  cmark_iter_reset(iter, node, CMARK_EVENT_EXIT);
}
```

This jumps directly to the exit event, bypassing all children.

### Safe Node Removal

The iterator handles node removal between calls. Since `cmark_iter_next()` always follows `next` and `parent` pointers from the current position, removing the current node is safe as long as:
1. The node's `next` and `parent` pointers remain valid
2. The iterator is reset to skip the removed node's children

## Thread Safety

Iterators are NOT thread-safe. A single AST must not be iterated concurrently without external synchronization. However, since iterators only read the AST (never modify it), multiple read-only iterators on the same AST are safe if no modifications occur.

## Memory

The iterator allocates a `cmark_iter` struct using the root node's memory allocator:
```c
cmark_iter *cmark_iter_new(cmark_node *root) {
  cmark_mem *mem = root->mem;
  cmark_iter *iter = (cmark_iter *)mem->calloc(1, sizeof(cmark_iter));
  iter->mem = mem;
  iter->root = root;
  iter->cur = root;
  iter->ev_type = CMARK_EVENT_NONE;
  return iter;
}
```

## Cross-References

- [iterator.c](../../cmark/src/iterator.c) — Iterator implementation
- [iterator.h](../../cmark/src/iterator.h) — Iterator struct definition
- [ast-node-system.md](ast-node-system.md) — The nodes being traversed
- [html-renderer.md](html-renderer.md) — Example of iterator-driven rendering
- [render-framework.md](render-framework.md) — Framework that wraps iterator use