summaryrefslogtreecommitdiff
path: root/docs/handbook/json4cpp/parsing-internals.md
blob: ecbc946dee55081a4c35e7c52fa73ba616693241 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
# json4cpp — Parsing Internals

## Parser Architecture

The parsing pipeline consists of three stages:

```
Input → InputAdapter → Lexer → Parser → JSON value
                                  ↓
                           SAX Handler
```

1. **Input adapters** normalize various input sources into a uniform byte stream
2. **Lexer** tokenizes the byte stream into JSON tokens
3. **Parser** implements a recursive descent parser driven by SAX events

## Input Adapters

Defined in `include/nlohmann/detail/input/input_adapters.hpp`.

### Adapter Hierarchy

```cpp
// File input
class file_input_adapter {
    std::FILE* m_file;
    std::char_traits<char>::int_type get_character();
};

// Stream input
class input_stream_adapter {
    std::istream* is;
    std::streambuf* sb;
    std::char_traits<char>::int_type get_character();
};

// Iterator-based input
template<typename IteratorType>
class iterator_input_adapter {
    IteratorType current;
    IteratorType end;
    std::char_traits<char>::int_type get_character();
};
```

All adapters expose a `get_character()` method that returns the next byte
or `std::char_traits<char>::eof()` at end of input.

### `input_adapter()` Factory

The free function `input_adapter()` selects the appropriate adapter:

```cpp
// From string/string_view
auto adapter = input_adapter(std::string("{}"));

// From iterators
auto adapter = input_adapter(vec.begin(), vec.end());

// From stream
auto adapter = input_adapter(std::cin);
```

### Span Input Adapter

For contiguous memory (C++17):

```cpp
template<typename CharT>
class contiguous_bytes_input_adapter {
    const CharT* current;
    const CharT* end;
};
```

This is the fastest adapter since it reads directly from memory without
virtual dispatch.

## Lexer

Defined in `include/nlohmann/detail/input/lexer.hpp`. The lexer
(scanner/tokenizer) converts a byte stream into a sequence of tokens.

### Token Types

```cpp
enum class token_type
{
    uninitialized,       ///< indicating the scanner is uninitialized
    literal_true,        ///< the 'true' literal
    literal_false,       ///< the 'false' literal
    literal_null,        ///< the 'null' literal
    value_string,        ///< a string (includes the quotes)
    value_unsigned,      ///< an unsigned integer
    value_integer,       ///< a signed integer
    value_float,         ///< a floating-point number
    begin_array,         ///< the character '['
    begin_object,        ///< the character '{'
    end_array,           ///< the character ']'
    end_object,          ///< the character '}'
    name_separator,      ///< the character ':'
    value_separator,     ///< the character ','
    parse_error,         ///< indicating a parse error
    end_of_input         ///< indicating the end of the input buffer
};
```

### Lexer Class

```cpp
template<typename BasicJsonType, typename InputAdapterType>
class lexer : public lexer_base<BasicJsonType>
{
public:
    using number_integer_t  = typename BasicJsonType::number_integer_t;
    using number_unsigned_t = typename BasicJsonType::number_unsigned_t;
    using number_float_t    = typename BasicJsonType::number_float_t;
    using string_t          = typename BasicJsonType::string_t;

    // Main scanning entry point
    token_type scan();

    // Access scanned values
    constexpr number_integer_t  get_number_integer()  const noexcept;
    constexpr number_unsigned_t get_number_unsigned() const noexcept;
    constexpr number_float_t    get_number_float()    const noexcept;
    string_t& get_string();

    // Error information
    constexpr position_t get_position() const noexcept;
    std::string get_token_string() const;
    const std::string& get_error_message() const noexcept;

private:
    InputAdapterType ia;           // input source
    char_int_type current;         // current character
    bool next_unget = false;       // lookahead flag
    position_t position {};        // line/column tracking
    std::vector<char_type> token_string {};  // raw token for error messages
    string_t token_buffer {};      // decoded string value
    // Number storage (only one is valid at a time)
    number_integer_t  value_integer  = 0;
    number_unsigned_t value_unsigned = 0;
    number_float_t    value_float    = 0;
};
```

### Position Tracking

```cpp
struct position_t
{
    std::size_t chars_read_total = 0;       // total characters read
    std::size_t chars_read_current_line = 0; // characters on current line
    std::size_t lines_read = 0;              // lines read (newline count)
};
```

### String Scanning

The `scan_string()` method handles:
- Regular characters
- Escape sequences: `\"`, `\\`, `\/`, `\b`, `\f`, `\n`, `\r`, `\t`
- Unicode escapes: `\uXXXX` (including surrogate pairs for `\uD800`–`\uDBFF` + `\uDC00`–`\uDFFF`)
- UTF-8 validation using a state machine

### Number Scanning

The `scan_number()` method determines the number type:

1. Parse sign (optional `-`)
2. Parse integer part
3. If `.` follows → parse fractional part → `value_float`
4. If `e`/`E` follows → parse exponent → `value_float`
5. Otherwise, try to fit into `number_integer_t` or `number_unsigned_t`

The method first accumulates the raw characters, then converts:
- Integers: `std::strtoull` / `std::strtoll`
- Floats: `std::strtod`

### Comment Scanning

When `ignore_comments` is enabled:

```cpp
bool scan_comment() {
    // After seeing '/', check next char:
    // '/' → scan to end of line (C++ comment)
    // '*' → scan to '*/' (C comment)
}
```

## Parser

Defined in `include/nlohmann/detail/input/parser.hpp`. Implements a
**recursive descent** parser that generates SAX events.

### Parser Class

```cpp
template<typename BasicJsonType, typename InputAdapterType>
class parser
{
public:
    using number_integer_t  = typename BasicJsonType::number_integer_t;
    using number_unsigned_t = typename BasicJsonType::number_unsigned_t;
    using number_float_t    = typename BasicJsonType::number_float_t;
    using string_t          = typename BasicJsonType::string_t;
    using lexer_t           = lexer<BasicJsonType, InputAdapterType>;

    parser(InputAdapterType&& adapter,
           const parser_callback_t<BasicJsonType> cb = nullptr,
           const bool allow_exceptions_ = true,
           const bool skip_comments = false,
           const bool ignore_trailing_commas_ = false);

    void parse(const bool strict, BasicJsonType& result);
    bool accept(const bool strict = true);

    template<typename SAX>
    bool sax_parse(SAX* sax, const bool strict = true);

private:
    template<typename SAX>
    bool sax_parse_internal(SAX* sax);

    lexer_t m_lexer;
    token_type last_token = token_type::uninitialized;
    bool allow_exceptions;
    bool ignore_trailing_commas;
};
```

### Recursive Descent Grammar

The parser implements the JSON grammar:

```
json      → value
value     → object | array | string | number | "true" | "false" | "null"
object    → '{' (pair (',' pair)* ','?)? '}'
pair      → string ':' value
array     → '[' (value (',' value)* ','?)? ']'
```

The trailing comma handling is optional (controlled by
`ignore_trailing_commas`).

### SAX-Driven Parsing

The parser calls SAX handler methods as it encounters JSON structure:

```cpp
template<typename SAX>
bool sax_parse_internal(SAX* sax)
{
    switch (last_token) {
        case token_type::begin_object:
            // 1. sax->start_object(...)
            // 2. For each key-value:
            //    a. sax->key(string)
            //    b. recurse into sax_parse_internal for value
            // 3. sax->end_object()
            break;
        case token_type::begin_array:
            // 1. sax->start_array(...)
            // 2. For each element: recurse into sax_parse_internal
            // 3. sax->end_array()
            break;
        case token_type::value_string:
            return sax->string(m_lexer.get_string());
        case token_type::value_unsigned:
            return sax->number_unsigned(m_lexer.get_number_unsigned());
        case token_type::value_integer:
            return sax->number_integer(m_lexer.get_number_integer());
        case token_type::value_float:
            // Check for NaN: store as null
            return sax->number_float(m_lexer.get_number_float(), ...);
        case token_type::literal_true:
            return sax->boolean(true);
        case token_type::literal_false:
            return sax->boolean(false);
        case token_type::literal_null:
            return sax->null();
        default:
            return sax->parse_error(...);
    }
}
```

### DOM Construction

Two SAX handlers build the DOM tree:

#### `json_sax_dom_parser`

Standard DOM builder. Each SAX event creates or appends to the JSON tree:

```cpp
template<typename BasicJsonType>
class json_sax_dom_parser
{
    BasicJsonType& root;
    std::vector<BasicJsonType*> ref_stack;  // stack of parent nodes
    BasicJsonType* object_element = nullptr;
    bool errored = false;
    bool allow_exceptions;

    bool null();
    bool boolean(bool val);
    bool number_integer(number_integer_t val);
    bool number_unsigned(number_unsigned_t val);
    bool number_float(number_float_t val, const string_t& s);
    bool string(string_t& val);
    bool binary(binary_t& val);
    bool start_object(std::size_t elements);
    bool end_object();
    bool start_array(std::size_t elements);
    bool end_array();
    bool key(string_t& val);
    bool parse_error(std::size_t position, const std::string& last_token,
                     const detail::exception& ex);
};
```

The `ref_stack` tracks the current nesting path. On `start_object()` /
`start_array()`, a new container is pushed. On `end_object()` /
`end_array()`, the stack is popped.

#### `json_sax_dom_callback_parser`

Extends the DOM builder with callback support. When the callback returns
`false`, the value is discarded:

```cpp
template<typename BasicJsonType>
class json_sax_dom_callback_parser
{
    BasicJsonType& root;
    std::vector<BasicJsonType*> ref_stack;
    std::vector<bool> keep_stack;   // tracks which values to keep
    std::vector<bool> key_keep_stack;
    BasicJsonType* object_element = nullptr;
    BasicJsonType discarded = BasicJsonType::value_t::discarded;
    parser_callback_t<BasicJsonType> callback;
    bool errored = false;
    bool allow_exceptions;
};
```

## `accept()` Method

The `accept()` method checks validity without building a DOM:

```cpp
bool accept(const bool strict = true);
```

Internally it uses `json_sax_acceptor` — a SAX handler where all methods
return `true` (accepting everything) and `parse_error()` returns `false`:

```cpp
template<typename BasicJsonType>
struct json_sax_acceptor
{
    bool null() { return true; }
    bool boolean(bool) { return true; }
    bool number_integer(number_integer_t) { return true; }
    // ... all return true ...
    bool parse_error(...) { return false; }
};
```

## `sax_parse()` — Static SAX Entry Point

```cpp
template<typename InputType, typename SAX>
static bool sax_parse(InputType&& i, SAX* sax,
                      input_format_t format = input_format_t::json,
                      const bool strict = true,
                      const bool ignore_comments = false,
                      const bool ignore_trailing_commas = false);
```

The `input_format_t` enum selects the parser:

```cpp
enum class input_format_t {
    json,
    cbor,
    msgpack,
    ubjson,
    bson,
    bjdata
};
```

For `json`, the text parser is used. For binary formats, the
`binary_reader` is used (which also generates SAX events).

## Error Reporting

### Parse Error Format

```
[json.exception.parse_error.101] parse error at line 3, column 5:
syntax error while parsing object key - unexpected end of input;
expected string literal
```

The error message includes:
- Exception ID (e.g., 101)
- Position (line and column, or byte offset)
- Description of what was expected vs. what was found
- The last token read (for context)

### Error IDs

| ID | Condition |
|---|---|
| 101 | Unexpected token |
| 102 | `\u` escape with invalid hex digits |
| 103 | Invalid UTF-8 surrogate pair |
| 104 | JSON Patch: invalid patch document |
| 105 | JSON Patch: missing required field |
| 106 | Invalid number format |
| 107 | Invalid JSON Pointer syntax |
| 108 | Invalid Unicode code point |
| 109 | Invalid UTF-8 byte sequence |
| 110 | Unrecognized CBOR/MessagePack/UBJSON/BSON marker |
| 112 | Parse error in BSON |
| 113 | Parse error in UBJSON |
| 114 | Parse error in BJData |
| 115 | Parse error due to incomplete binary data |

### Diagnostic Positions

When `JSON_DIAGNOSTIC_POSITIONS` is enabled at compile time, the library
tracks byte positions for each value. Error messages then include
`start_position` and `end_position` for the offending value:

```cpp
#define JSON_DIAGNOSTICS 1
#define JSON_DIAGNOSTIC_POSITIONS 1
```

## Parser Callback Events

The parser callback receives events defined by `parse_event_t`:

```cpp
enum class parse_event_t : std::uint8_t
{
    object_start,   // '{' read
    object_end,     // '}' read
    array_start,    // '[' read
    array_end,      // ']' read
    key,            // object key read
    value           // value read
};
```

Callback invocation points in the parser:
1. `object_start` — after `{` is consumed, before any key
2. `key` — after a key string is consumed, `parsed` = the key string
3. `value` — after any value is fully parsed, `parsed` = the value
4. `object_end` — after `}` is consumed, `parsed` = the complete object
5. `array_start` — after `[` is consumed, before any element
6. `array_end` — after `]` is consumed, `parsed` = the complete array

### Callback Return Value

- `true` → keep the value
- `false` → discard (replace with `discarded`)

For container events (`object_start`, `array_start`), returning `false`
skips the **entire** container and all its contents.

## Performance Characteristics

| Stage | Complexity | Dominant Cost |
|---|---|---|
| Input adapter | O(n) | Single pass over input |
| Lexer | O(n) | Character-by-character scan, string copy |
| Parser | O(n) | Recursive descent, SAX event dispatch |
| DOM construction | O(n) | Memory allocation for containers |

The overall parsing complexity is O(n) in the input size. Memory usage is
proportional to the nesting depth (parser stack) plus the size of the
resulting DOM (heap allocations for strings, arrays, objects).

For large inputs where the full DOM is not needed, using the SAX interface
directly avoids DOM construction overhead entirely.