# json4cpp — Parsing Internals

## Parser Architecture

The parsing pipeline consists of three stages:

```
Input → InputAdapter → Lexer → Parser → JSON value
                                  ↓
                           SAX Handler
```

1. **Input adapters** normalize various input sources into a uniform byte stream
2. **Lexer** tokenizes the byte stream into JSON tokens
3. **Parser** implements a recursive descent parser driven by SAX events

## Input Adapters

Defined in `include/nlohmann/detail/input/input_adapters.hpp`.

### Adapter Hierarchy

```cpp
// File input
class file_input_adapter {
    std::FILE* m_file;
    std::char_traits<char>::int_type get_character();
};

// Stream input
class input_stream_adapter {
    std::istream* is;
    std::streambuf* sb;
    std::char_traits<char>::int_type get_character();
};

// Iterator-based input
template<typename IteratorType>
class iterator_input_adapter {
    IteratorType current;
    IteratorType end;
    std::char_traits<char>::int_type get_character();
};
```

All adapters expose a `get_character()` method that returns the next byte
or `std::char_traits<char>::eof()` at end of input.

### `input_adapter()` Factory

The free function `input_adapter()` selects the appropriate adapter:

```cpp
// From string/string_view
auto adapter = input_adapter(std::string("{}"));

// From iterators
auto adapter = input_adapter(vec.begin(), vec.end());

// From stream
auto adapter = input_adapter(std::cin);
```

### Span Input Adapter

For contiguous memory (C++17):

```cpp
template<typename CharT>
class contiguous_bytes_input_adapter {
    const CharT* current;
    const CharT* end;
};
```

This is the fastest adapter since it reads directly from memory without
virtual dispatch.

## Lexer

Defined in `include/nlohmann/detail/input/lexer.hpp`. The lexer
(scanner/tokenizer) converts a byte stream into a sequence of tokens.

### Token Types

```cpp
enum class token_type
{
    uninitialized,       ///< indicating the scanner is uninitialized
    literal_true,        ///< the 'true' literal
    literal_false,       ///< the 'false' literal
    literal_null,        ///< the 'null' literal
    value_string,        ///< a string (includes the quotes)
    value_unsigned,      ///< an unsigned integer
    value_integer,       ///< a signed integer
    value_float,         ///< a floating-point number
    begin_array,         ///< the character '['
    begin_object,        ///< the character '{'
    end_array,           ///< the character ']'
    end_object,          ///< the character '}'
    name_separator,      ///< the character ':'
    value_separator,     ///< the character ','
    parse_error,         ///< indicating a parse error
    end_of_input         ///< indicating the end of the input buffer
};
```

### Lexer Class

```cpp
template<typename BasicJsonType, typename InputAdapterType>
class lexer : public lexer_base<BasicJsonType>
{
public:
    using number_integer_t  = typename BasicJsonType::number_integer_t;
    using number_unsigned_t = typename BasicJsonType::number_unsigned_t;
    using number_float_t    = typename BasicJsonType::number_float_t;
    using string_t          = typename BasicJsonType::string_t;

    // Main scanning entry point
    token_type scan();

    // Access scanned values
    constexpr number_integer_t  get_number_integer()  const noexcept;
    constexpr number_unsigned_t get_number_unsigned() const noexcept;
    constexpr number_float_t    get_number_float()    const noexcept;
    string_t& get_string();

    // Error information
    constexpr position_t get_position() const noexcept;
    std::string get_token_string() const;
    const std::string& get_error_message() const noexcept;

private:
    InputAdapterType ia;           // input source
    char_int_type current;         // current character
    bool next_unget = false;       // lookahead flag
    position_t position {};        // line/column tracking
    std::vector<char_type> token_string {};  // raw token for error messages
    string_t token_buffer {};      // decoded string value
    // Number storage (only one is valid at a time)
    number_integer_t  value_integer  = 0;
    number_unsigned_t value_unsigned = 0;
    number_float_t    value_float    = 0;
};
```

### Position Tracking

```cpp
struct position_t
{
    std::size_t chars_read_total = 0;       // total characters read
    std::size_t chars_read_current_line = 0; // characters on current line
    std::size_t lines_read = 0;              // lines read (newline count)
};
```

### String Scanning

The `scan_string()` method handles:
- Regular characters
- Escape sequences: `\"`, `\\`, `\/`, `\b`, `\f`, `\n`, `\r`, `\t`
- Unicode escapes: `\uXXXX` (including surrogate pairs for `\uD800`–`\uDBFF` + `\uDC00`–`\uDFFF`)
- UTF-8 validation using a state machine

### Number Scanning

The `scan_number()` method determines the number type:

1. Parse sign (optional `-`)
2. Parse integer part
3. If `.` follows → parse fractional part → `value_float`
4. If `e`/`E` follows → parse exponent → `value_float`
5. Otherwise, try to fit into `number_integer_t` or `number_unsigned_t`

The method first accumulates the raw characters, then converts:
- Integers: `std::strtoull` / `std::strtoll`
- Floats: `std::strtod`

### Comment Scanning

When `ignore_comments` is enabled:

```cpp
bool scan_comment() {
    // After seeing '/', check next char:
    // '/' → scan to end of line (C++ comment)
    // '*' → scan to '*/' (C comment)
}
```

## Parser

Defined in `include/nlohmann/detail/input/parser.hpp`. Implements a
**recursive descent** parser that generates SAX events.

### Parser Class

```cpp
template<typename BasicJsonType, typename InputAdapterType>
class parser
{
public:
    using number_integer_t  = typename BasicJsonType::number_integer_t;
    using number_unsigned_t = typename BasicJsonType::number_unsigned_t;
    using number_float_t    = typename BasicJsonType::number_float_t;
    using string_t          = typename BasicJsonType::string_t;
    using lexer_t           = lexer<BasicJsonType, InputAdapterType>;

    parser(InputAdapterType&& adapter,
           const parser_callback_t<BasicJsonType> cb = nullptr,
           const bool allow_exceptions_ = true,
           const bool skip_comments = false,
           const bool ignore_trailing_commas_ = false);

    void parse(const bool strict, BasicJsonType& result);
    bool accept(const bool strict = true);

    template<typename SAX>
    bool sax_parse(SAX* sax, const bool strict = true);

private:
    template<typename SAX>
    bool sax_parse_internal(SAX* sax);

    lexer_t m_lexer;
    token_type last_token = token_type::uninitialized;
    bool allow_exceptions;
    bool ignore_trailing_commas;
};
```

### Recursive Descent Grammar

The parser implements the JSON grammar:

```
json      → value
value     → object | array | string | number | "true" | "false" | "null"
object    → '{' (pair (',' pair)* ','?)? '}'
pair      → string ':' value
array     → '[' (value (',' value)* ','?)? ']'
```

The trailing comma handling is optional (controlled by
`ignore_trailing_commas`).

### SAX-Driven Parsing

The parser calls SAX handler methods as it encounters JSON structure:

```cpp
template<typename SAX>
bool sax_parse_internal(SAX* sax)
{
    switch (last_token) {
        case token_type::begin_object:
            // 1. sax->start_object(...)
            // 2. For each key-value:
            //    a. sax->key(string)
            //    b. recurse into sax_parse_internal for value
            // 3. sax->end_object()
            break;
        case token_type::begin_array:
            // 1. sax->start_array(...)
            // 2. For each element: recurse into sax_parse_internal
            // 3. sax->end_array()
            break;
        case token_type::value_string:
            return sax->string(m_lexer.get_string());
        case token_type::value_unsigned:
            return sax->number_unsigned(m_lexer.get_number_unsigned());
        case token_type::value_integer:
            return sax->number_integer(m_lexer.get_number_integer());
        case token_type::value_float:
            // Check for NaN: store as null
            return sax->number_float(m_lexer.get_number_float(), ...);
        case token_type::literal_true:
            return sax->boolean(true);
        case token_type::literal_false:
            return sax->boolean(false);
        case token_type::literal_null:
            return sax->null();
        default:
            return sax->parse_error(...);
    }
}
```

### DOM Construction

Two SAX handlers build the DOM tree:

#### `json_sax_dom_parser`

Standard DOM builder. Each SAX event creates or appends to the JSON tree:

```cpp
template<typename BasicJsonType>
class json_sax_dom_parser
{
    BasicJsonType& root;
    std::vector<BasicJsonType*> ref_stack;  // stack of parent nodes
    BasicJsonType* object_element = nullptr;
    bool errored = false;
    bool allow_exceptions;

    bool null();
    bool boolean(bool val);
    bool number_integer(number_integer_t val);
    bool number_unsigned(number_unsigned_t val);
    bool number_float(number_float_t val, const string_t& s);
    bool string(string_t& val);
    bool binary(binary_t& val);
    bool start_object(std::size_t elements);
    bool end_object();
    bool start_array(std::size_t elements);
    bool end_array();
    bool key(string_t& val);
    bool parse_error(std::size_t position, const std::string& last_token,
                     const detail::exception& ex);
};
```

The `ref_stack` tracks the current nesting path. On `start_object()` /
`start_array()`, a new container is pushed. On `end_object()` /
`end_array()`, the stack is popped.

#### `json_sax_dom_callback_parser`

Extends the DOM builder with callback support. When the callback returns
`false`, the value is discarded:

```cpp
template<typename BasicJsonType>
class json_sax_dom_callback_parser
{
    BasicJsonType& root;
    std::vector<BasicJsonType*> ref_stack;
    std::vector<bool> keep_stack;   // tracks which values to keep
    std::vector<bool> key_keep_stack;
    BasicJsonType* object_element = nullptr;
    BasicJsonType discarded = BasicJsonType::value_t::discarded;
    parser_callback_t<BasicJsonType> callback;
    bool errored = false;
    bool allow_exceptions;
};
```

## `accept()` Method

The `accept()` method checks validity without building a DOM:

```cpp
bool accept(const bool strict = true);
```

Internally it uses `json_sax_acceptor` — a SAX handler where all methods
return `true` (accepting everything) and `parse_error()` returns `false`:

```cpp
template<typename BasicJsonType>
struct json_sax_acceptor
{
    bool null() { return true; }
    bool boolean(bool) { return true; }
    bool number_integer(number_integer_t) { return true; }
    // ... all return true ...
    bool parse_error(...) { return false; }
};
```

## `sax_parse()` — Static SAX Entry Point

```cpp
template<typename InputType, typename SAX>
static bool sax_parse(InputType&& i, SAX* sax,
                      input_format_t format = input_format_t::json,
                      const bool strict = true,
                      const bool ignore_comments = false,
                      const bool ignore_trailing_commas = false);
```

The `input_format_t` enum selects the parser:

```cpp
enum class input_format_t {
    json,
    cbor,
    msgpack,
    ubjson,
    bson,
    bjdata
};
```

For `json`, the text parser is used. For binary formats, the
`binary_reader` is used (which also generates SAX events).

## Error Reporting

### Parse Error Format

```
[json.exception.parse_error.101] parse error at line 3, column 5:
syntax error while parsing object key - unexpected end of input;
expected string literal
```

The error message includes:
- Exception ID (e.g., 101)
- Position (line and column, or byte offset)
- Description of what was expected vs. what was found
- The last token read (for context)

### Error IDs

| ID | Condition |
|---|---|
| 101 | Unexpected token |
| 102 | `\u` escape with invalid hex digits |
| 103 | Invalid UTF-8 surrogate pair |
| 104 | JSON Patch: invalid patch document |
| 105 | JSON Patch: missing required field |
| 106 | Invalid number format |
| 107 | Invalid JSON Pointer syntax |
| 108 | Invalid Unicode code point |
| 109 | Invalid UTF-8 byte sequence |
| 110 | Unrecognized CBOR/MessagePack/UBJSON/BSON marker |
| 112 | Parse error in BSON |
| 113 | Parse error in UBJSON |
| 114 | Parse error in BJData |
| 115 | Parse error due to incomplete binary data |

### Diagnostic Positions

When `JSON_DIAGNOSTIC_POSITIONS` is enabled at compile time, the library
tracks byte positions for each value. Error messages then include
`start_position` and `end_position` for the offending value:

```cpp
#define JSON_DIAGNOSTICS 1
#define JSON_DIAGNOSTIC_POSITIONS 1
```

## Parser Callback Events

The parser callback receives events defined by `parse_event_t`:

```cpp
enum class parse_event_t : std::uint8_t
{
    object_start,   // '{' read
    object_end,     // '}' read
    array_start,    // '[' read
    array_end,      // ']' read
    key,            // object key read
    value           // value read
};
```

Callback invocation points in the parser:
1. `object_start` — after `{` is consumed, before any key
2. `key` — after a key string is consumed, `parsed` = the key string
3. `value` — after any value is fully parsed, `parsed` = the value
4. `object_end` — after `}` is consumed, `parsed` = the complete object
5. `array_start` — after `[` is consumed, before any element
6. `array_end` — after `]` is consumed, `parsed` = the complete array

### Callback Return Value

- `true` → keep the value
- `false` → discard (replace with `discarded`)

For container events (`object_start`, `array_start`), returning `false`
skips the **entire** container and all its contents.

## Performance Characteristics

| Stage | Complexity | Dominant Cost |
|---|---|---|
| Input adapter | O(n) | Single pass over input |
| Lexer | O(n) | Character-by-character scan, string copy |
| Parser | O(n) | Recursive descent, SAX event dispatch |
| DOM construction | O(n) | Memory allocation for containers |

The overall parsing complexity is O(n) in the input size. Memory usage is
proportional to the nesting depth (parser stack) plus the size of the
resulting DOM (heap allocations for strings, arrays, objects).

For large inputs where the full DOM is not needed, using the SAX interface
directly avoids DOM construction overhead entirely.