summaryrefslogtreecommitdiff
path: root/docs/handbook/cmark/latex-renderer.md
blob: d7a492d58070dd19826041f457da47247fff37ed (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
# cmark — LaTeX Renderer

## Overview

The LaTeX renderer (`latex.c`) converts a `cmark_node` AST into LaTeX source, suitable for compilation with `pdflatex`, `xelatex`, or `lualatex`. It uses the generic render framework from `render.c`, operating through a per-character output callback (`outc`) and a per-node render callback (`S_render_node`).

## Entry Point

```c
char *cmark_render_latex(cmark_node *root, int options, int width);
```

- `root` — AST root node
- `options` — Option flags (`CMARK_OPT_SOURCEPOS`, `CMARK_OPT_HARDBREAKS`, `CMARK_OPT_NOBREAKS`, `CMARK_OPT_UNSAFE`)
- `width` — Target line width for hard-wrapping; 0 disables wrapping

## Character Escaping (`outc`)

The `outc` function handles per-character output decisions. It is the most complex part of the LaTeX renderer, with different behavior for three escaping modes:

```c
static void outc(cmark_renderer *renderer, cmark_escaping escape,
                 int32_t c, unsigned char nextc);
```

### LITERAL Mode
Pass-through: all characters are output unchanged.

### NORMAL Mode
Extensive special-character handling:

| Character | LaTeX Output | Purpose |
|-----------|-------------|---------|
| `$` | `\$` | Math mode delimiter |
| `%` | `\%` | Comment character |
| `&` | `\&` | Table column separator |
| `_` | `\_` | Subscript operator |
| `#` | `\#` | Parameter reference |
| `^` | `\^{}` | Superscript operator |
| `{` | `\{` | Group open |
| `}` | `\}` | Group close |
| `~` | `\textasciitilde{}` | Non-breaking space |
| `[` | `{[}` | Optional argument bracket |
| `]` | `{]}` | Optional argument bracket |
| `\` | `\textbackslash{}` | Escape character |
| `|` | `\textbar{}` | Pipe |
| `'` | `\textquotesingle{}` | Straight single quote |
| `"` | `\textquotedbl{}` | Straight double quote |
| `` ` `` | `\textasciigrave{}` | Backtick |
| `\xA0` (NBSP) | `~` | LaTeX non-breaking space |
| `\x2014` (—) | `---` | Em dash |
| `\x2013` (–) | `--` | En dash |
| `\x2018` (') | `` ` `` | Left single quote |
| `\x2019` (') | `'` | Right single quote |
| `\x201C` (") | ` `` ` | Left double quote |
| `\x201D` (") | `''` | Right double quote |

### URL Mode
Only these characters are escaped:
- `$` → `\$`
- `%` → `\%`
- `&` → `\&`
- `_` → `\_`
- `#` → `\#`
- `{` → `\{`
- `}` → `\}`

All other characters pass through unchanged.

## Link Type Classification

The renderer classifies links into five categories:

```c
typedef enum {
  NO_LINK,
  URL_AUTOLINK,
  EMAIL_AUTOLINK,
  NORMAL_LINK,
  INTERNAL_LINK,
} link_type;
```

### `get_link_type()`

```c
static link_type get_link_type(cmark_node *node) {
  // 1. "mailto:" links where text matches url
  // 2. "http[s]:" links where text matches url (with or without protocol)
  // 3. Links starting with '#' → INTERNAL_LINK
  // 4. Everything else → NORMAL_LINK
}
```

Detection logic:
1. **URL_AUTOLINK**: The `url` starts with `http://` or `https://`, the link has exactly one text child, and that child's content matches the URL (or matches the URL minus the protocol prefix).
2. **EMAIL_AUTOLINK**: The `url` starts with `mailto:`, the link has exactly one text child, and that child's content matches the URL after `mailto:`.
3. **INTERNAL_LINK**: The `url` starts with `#`.
4. **NORMAL_LINK**: Everything else.

## Enumeration Level

For nested ordered lists, the renderer selects the appropriate LaTeX counter style:

```c
static int S_get_enumlevel(cmark_node *node) {
  int enumlevel = 0;
  cmark_node *tmp = node;
  while (tmp) {
    if (tmp->type == CMARK_NODE_LIST &&
        cmark_node_get_list_type(tmp) == CMARK_ORDERED_LIST) {
      enumlevel++;
    }
    tmp = tmp->parent;
  }
  return enumlevel;
}
```

This walks up the tree, counting ordered list ancestors. LaTeX ordered lists cycle through: `enumi` (arabic), `enumii` (alpha), `enumiii` (roman), `enumiv` (Alpha).

## Node Rendering (`S_render_node`)

### Block Nodes

#### Document
No output.

#### Block Quote
```
ENTER: \begin{quote}\n
EXIT:  \end{quote}\n
```

#### List
```
ENTER (bullet):  \begin{itemize}\n
ENTER (ordered): \begin{enumerate}\n
                 \def\labelenumI{COUNTER}\n  (if start != 1)
                 \setcounter{enumI}{START-1}\n
EXIT:            \end{itemize}\n or \end{enumerate}\n
```

The counter is formatted based on enumeration level:
- Level 1: `\arabic{enumi}.`
- Level 2: `\alph{enumii}.` (surrounded by `(`)
- Level 3: `\roman{enumiii}.`
- Level 4: `\Alph{enumiv}.`

Period delimiters use `.`, parenthesis delimiters use `)`.

#### Item
```
ENTER: \item{}   (empty braces prevent ligatures with following content)
EXIT:  \n
```

#### Heading
```
ENTER: \section{  or  \subsection{  or  \subsubsection{  or  \paragraph{  or  \subparagraph{
EXIT:  }\n
```

Mapping: level 1 → `\section`, level 2 → `\subsection`, level 3 → `\subsubsection`, level 4 → `\paragraph`, level 5 → `\subparagraph`.

#### Code Block
```latex
\begin{verbatim}
LITERAL CONTENT
\end{verbatim}
```

The content is output in `LITERAL` escape mode (no character escaping). Info strings are ignored.

#### HTML Block
```
ENTER: % raw HTML omitted\n  (as a LaTeX comment)
```

Raw HTML is always omitted in LaTeX output, regardless of `CMARK_OPT_UNSAFE`.

#### Thematic Break
```
\begin{center}\rule{0.5\linewidth}{\linethickness}\end{center}\n
```

#### Paragraph
Same tight-list check as the HTML renderer:
```c
parent = cmark_node_parent(node);
grandparent = cmark_node_parent(parent);
tight = (grandparent && grandparent->type == CMARK_NODE_LIST) ?
        grandparent->as.list.tight : false;
```
- Normal: newline before and after
- Tight: no leading/trailing blank lines

### Inline Nodes

#### Text
Output with NORMAL escaping.

#### Soft Break
Depends on options:
- `CMARK_OPT_HARDBREAKS`: `\\\\\n`
- `CMARK_OPT_NOBREAKS`: space
- Default: newline

#### Line Break
```
\\\\\n
```

#### Code (inline)
```
\texttt{ESCAPED CONTENT}
```

Special handling: Code content is output character-by-character with inline-code escaping. Special characters (`\`, `{`, `}`, `$`, `%`, `&`, `_`, `#`, `^`, `~`) are escaped.

#### Emphasis
```
ENTER: \emph{
EXIT:  }
```

#### Strong
```
ENTER: \textbf{
EXIT:  }
```

#### Link
Rendering depends on link type:

**NORMAL_LINK:**
```
ENTER: \href{URL}{
EXIT:  }
```

**URL_AUTOLINK:**
```
ENTER: \url{URL}
(children are skipped — no EXIT rendering needed)
```

**EMAIL_AUTOLINK:**
```
ENTER: \href{URL}{\nolinkurl{
EXIT:  }}
```

**INTERNAL_LINK:**
```
ENTER: (nothing — rendered as plain text)
EXIT:  (~\ref{LABEL})
```

Where `LABEL` is the URL with the leading `#` stripped.

**NO_LINK:**
No output.

#### Image
```
ENTER: \protect\includegraphics{URL}
```

Image children (alt text) are skipped. If `CMARK_OPT_UNSAFE` is not set and the URL matches `_scan_dangerous_url()`, the URL is omitted.

#### HTML Inline
```
% raw HTML omitted
```

Always omitted, regardless of `CMARK_OPT_UNSAFE`.

## Source Position Comments

When `CMARK_OPT_SOURCEPOS` is set, the renderer adds LaTeX comments before block elements:

```c
snprintf(buffer, BUFFER_SIZE, "%% %d:%d-%d:%d\n",
         cmark_node_get_start_line(node), cmark_node_get_start_column(node),
         cmark_node_get_end_line(node), cmark_node_get_end_column(node));
```

## Example Output

Markdown input:
```markdown
# Hello World

A paragraph with *emphasis* and **bold**.

- Item 1
- Item 2
```

LaTeX output:
```latex
\section{Hello World}

A paragraph with \emph{emphasis} and \textbf{bold}.

\begin{itemize}
\item{}Item 1

\item{}Item 2

\end{itemize}
```

## Cross-References

- [latex.c](../../cmark/src/latex.c) — Full implementation
- [render-framework.md](render-framework.md) — Generic render framework (`cmark_render()`, `cmark_renderer`)
- [public-api.md](public-api.md) — `cmark_render_latex()` API docs
- [html-renderer.md](html-renderer.md) — Contrast with direct buffer renderer