1 files changed, 493 insertions, 0 deletions
diff --git a/docs/handbook/json4cpp/parsing-internals.md b/docs/handbook/json4cpp/parsing-internals.md
new file mode 100644
index 0000000000..ecbc946dee
--- /dev/null
+++ b/docs/handbook/json4cpp/parsing-internals.md
@@ -0,0 +1,493 @@
+# json4cpp — Parsing Internals
+
+## Parser Architecture
+
+The parsing pipeline consists of three stages:
+
+```
+Input → InputAdapter → Lexer → Parser → JSON value
+                                  ↓
+                           SAX Handler
+```
+
+1. **Input adapters** normalize various input sources into a uniform byte stream
+2. **Lexer** tokenizes the byte stream into JSON tokens
+3. **Parser** implements a recursive descent parser driven by SAX events
+
+## Input Adapters
+
+Defined in `include/nlohmann/detail/input/input_adapters.hpp`.
+
+### Adapter Hierarchy
+
+```cpp
+// File input
+class file_input_adapter {
+    std::FILE* m_file;
+    std::char_traits<char>::int_type get_character();
+};
+
+// Stream input
+class input_stream_adapter {
+    std::istream* is;
+    std::streambuf* sb;
+    std::char_traits<char>::int_type get_character();
+};
+
+// Iterator-based input
+template<typename IteratorType>
+class iterator_input_adapter {
+    IteratorType current;
+    IteratorType end;
+    std::char_traits<char>::int_type get_character();
+};
+```
+
+All adapters expose a `get_character()` method that returns the next byte
+or `std::char_traits<char>::eof()` at end of input.
+
+### `input_adapter()` Factory
+
+The free function `input_adapter()` selects the appropriate adapter:
+
+```cpp
+// From string/string_view
+auto adapter = input_adapter(std::string("{}"));
+
+// From iterators
+auto adapter = input_adapter(vec.begin(), vec.end());
+
+// From stream
+auto adapter = input_adapter(std::cin);
+```
+
+### Span Input Adapter
+
+For contiguous memory (C++17):
+
+```cpp
+template<typename CharT>
+class contiguous_bytes_input_adapter {
+    const CharT* current;
+    const CharT* end;
+};
+```
+
+This is the fastest adapter since it reads directly from memory without
+virtual dispatch.
+
+## Lexer
+
+Defined in `include/nlohmann/detail/input/lexer.hpp`. The lexer
+(scanner/tokenizer) converts a byte stream into a sequence of tokens.
+
+### Token Types
+
+```cpp
+enum class token_type
+{
+    uninitialized,       ///< indicating the scanner is uninitialized
+    literal_true,        ///< the 'true' literal
+    literal_false,       ///< the 'false' literal
+    literal_null,        ///< the 'null' literal
+    value_string,        ///< a string (includes the quotes)
+    value_unsigned,      ///< an unsigned integer
+    value_integer,       ///< a signed integer
+    value_float,         ///< a floating-point number
+    begin_array,         ///< the character '['
+    begin_object,        ///< the character '{'
+    end_array,           ///< the character ']'
+    end_object,          ///< the character '}'
+    name_separator,      ///< the character ':'
+    value_separator,     ///< the character ','
+    parse_error,         ///< indicating a parse error
+    end_of_input         ///< indicating the end of the input buffer
+};
+```
+
+### Lexer Class
+
+```cpp
+template<typename BasicJsonType, typename InputAdapterType>
+class lexer : public lexer_base<BasicJsonType>
+{
+public:
+    using number_integer_t  = typename BasicJsonType::number_integer_t;
+    using number_unsigned_t = typename BasicJsonType::number_unsigned_t;
+    using number_float_t    = typename BasicJsonType::number_float_t;
+    using string_t          = typename BasicJsonType::string_t;
+
+    // Main scanning entry point
+    token_type scan();
+
+    // Access scanned values
+    constexpr number_integer_t  get_number_integer()  const noexcept;
+    constexpr number_unsigned_t get_number_unsigned() const noexcept;
+    constexpr number_float_t    get_number_float()    const noexcept;
+    string_t& get_string();
+
+    // Error information
+    constexpr position_t get_position() const noexcept;
+    std::string get_token_string() const;
+    const std::string& get_error_message() const noexcept;
+
+private:
+    InputAdapterType ia;           // input source
+    char_int_type current;         // current character
+    bool next_unget = false;       // lookahead flag
+    position_t position {};        // line/column tracking
+    std::vector<char_type> token_string {};  // raw token for error messages
+    string_t token_buffer {};      // decoded string value
+    // Number storage (only one is valid at a time)
+    number_integer_t  value_integer  = 0;
+    number_unsigned_t value_unsigned = 0;
+    number_float_t    value_float    = 0;
+};
+```
+
+### Position Tracking
+
+```cpp
+struct position_t
+{
+    std::size_t chars_read_total = 0;       // total characters read
+    std::size_t chars_read_current_line = 0; // characters on current line
+    std::size_t lines_read = 0;              // lines read (newline count)
+};
+```
+
+### String Scanning
+
+The `scan_string()` method handles:
+- Regular characters
+- Escape sequences: `\"`, `\\`, `\/`, `\b`, `\f`, `\n`, `\r`, `\t`
+- Unicode escapes: `\uXXXX` (including surrogate pairs for `\uD800`–`\uDBFF` + `\uDC00`–`\uDFFF`)
+- UTF-8 validation using a state machine
+
+### Number Scanning
+
+The `scan_number()` method determines the number type:
+
+1. Parse sign (optional `-`)
+2. Parse integer part
+3. If `.` follows → parse fractional part → `value_float`
+4. If `e`/`E` follows → parse exponent → `value_float`
+5. Otherwise, try to fit into `number_integer_t` or `number_unsigned_t`
+
+The method first accumulates the raw characters, then converts:
+- Integers: `std::strtoull` / `std::strtoll`
+- Floats: `std::strtod`
+
+### Comment Scanning
+
+When `ignore_comments` is enabled:
+
+```cpp
+bool scan_comment() {
+    // After seeing '/', check next char:
+    // '/' → scan to end of line (C++ comment)
+    // '*' → scan to '*/' (C comment)
+}
+```
+
+## Parser
+
+Defined in `include/nlohmann/detail/input/parser.hpp`. Implements a
+**recursive descent** parser that generates SAX events.
+
+### Parser Class
+
+```cpp
+template<typename BasicJsonType, typename InputAdapterType>
+class parser
+{
+public:
+    using number_integer_t  = typename BasicJsonType::number_integer_t;
+    using number_unsigned_t = typename BasicJsonType::number_unsigned_t;
+    using number_float_t    = typename BasicJsonType::number_float_t;
+    using string_t          = typename BasicJsonType::string_t;
+    using lexer_t           = lexer<BasicJsonType, InputAdapterType>;
+
+    parser(InputAdapterType&& adapter,
+           const parser_callback_t<BasicJsonType> cb = nullptr,
+           const bool allow_exceptions_ = true,
+           const bool skip_comments = false,
+           const bool ignore_trailing_commas_ = false);
+
+    void parse(const bool strict, BasicJsonType& result);
+    bool accept(const bool strict = true);
+
+    template<typename SAX>
+    bool sax_parse(SAX* sax, const bool strict = true);
+
+private:
+    template<typename SAX>
+    bool sax_parse_internal(SAX* sax);
+
+    lexer_t m_lexer;
+    token_type last_token = token_type::uninitialized;
+    bool allow_exceptions;
+    bool ignore_trailing_commas;
+};
+```
+
+### Recursive Descent Grammar
+
+The parser implements the JSON grammar:
+
+```
+json      → value
+value     → object | array | string | number | "true" | "false" | "null"
+object    → '{' (pair (',' pair)* ','?)? '}'
+pair      → string ':' value
+array     → '[' (value (',' value)* ','?)? ']'
+```
+
+The trailing comma handling is optional (controlled by
+`ignore_trailing_commas`).
+
+### SAX-Driven Parsing
+
+The parser calls SAX handler methods as it encounters JSON structure:
+
+```cpp
+template<typename SAX>
+bool sax_parse_internal(SAX* sax)
+{
+    switch (last_token) {
+        case token_type::begin_object:
+            // 1. sax->start_object(...)
+            // 2. For each key-value:
+            //    a. sax->key(string)
+            //    b. recurse into sax_parse_internal for value
+            // 3. sax->end_object()
+            break;
+        case token_type::begin_array:
+            // 1. sax->start_array(...)
+            // 2. For each element: recurse into sax_parse_internal
+            // 3. sax->end_array()
+            break;
+        case token_type::value_string:
+            return sax->string(m_lexer.get_string());
+        case token_type::value_unsigned:
+            return sax->number_unsigned(m_lexer.get_number_unsigned());
+        case token_type::value_integer:
+            return sax->number_integer(m_lexer.get_number_integer());
+        case token_type::value_float:
+            // Check for NaN: store as null
+            return sax->number_float(m_lexer.get_number_float(), ...);
+        case token_type::literal_true:
+            return sax->boolean(true);
+        case token_type::literal_false:
+            return sax->boolean(false);
+        case token_type::literal_null:
+            return sax->null();
+        default:
+            return sax->parse_error(...);
+    }
+}
+```
+
+### DOM Construction
+
+Two SAX handlers build the DOM tree:
+
+#### `json_sax_dom_parser`
+
+Standard DOM builder. Each SAX event creates or appends to the JSON tree:
+
+```cpp
+template<typename BasicJsonType>
+class json_sax_dom_parser
+{
+    BasicJsonType& root;
+    std::vector<BasicJsonType*> ref_stack;  // stack of parent nodes
+    BasicJsonType* object_element = nullptr;
+    bool errored = false;
+    bool allow_exceptions;
+
+    bool null();
+    bool boolean(bool val);
+    bool number_integer(number_integer_t val);
+    bool number_unsigned(number_unsigned_t val);
+    bool number_float(number_float_t val, const string_t& s);
+    bool string(string_t& val);
+    bool binary(binary_t& val);
+    bool start_object(std::size_t elements);
+    bool end_object();
+    bool start_array(std::size_t elements);
+    bool end_array();
+    bool key(string_t& val);
+    bool parse_error(std::size_t position, const std::string& last_token,
+                     const detail::exception& ex);
+};
+```
+
+The `ref_stack` tracks the current nesting path. On `start_object()` /
+`start_array()`, a new container is pushed. On `end_object()` /
+`end_array()`, the stack is popped.
+
+#### `json_sax_dom_callback_parser`
+
+Extends the DOM builder with callback support. When the callback returns
+`false`, the value is discarded:
+
+```cpp
+template<typename BasicJsonType>
+class json_sax_dom_callback_parser
+{
+    BasicJsonType& root;
+    std::vector<BasicJsonType*> ref_stack;
+    std::vector<bool> keep_stack;   // tracks which values to keep
+    std::vector<bool> key_keep_stack;
+    BasicJsonType* object_element = nullptr;
+    BasicJsonType discarded = BasicJsonType::value_t::discarded;
+    parser_callback_t<BasicJsonType> callback;
+    bool errored = false;
+    bool allow_exceptions;
+};
+```
+
+## `accept()` Method
+
+The `accept()` method checks validity without building a DOM:
+
+```cpp
+bool accept(const bool strict = true);
+```
+
+Internally it uses `json_sax_acceptor` — a SAX handler where all methods
+return `true` (accepting everything) and `parse_error()` returns `false`:
+
+```cpp
+template<typename BasicJsonType>
+struct json_sax_acceptor
+{
+    bool null() { return true; }
+    bool boolean(bool) { return true; }
+    bool number_integer(number_integer_t) { return true; }
+    // ... all return true ...
+    bool parse_error(...) { return false; }
+};
+```
+
+## `sax_parse()` — Static SAX Entry Point
+
+```cpp
+template<typename InputType, typename SAX>
+static bool sax_parse(InputType&& i, SAX* sax,
+                      input_format_t format = input_format_t::json,
+                      const bool strict = true,
+                      const bool ignore_comments = false,
+                      const bool ignore_trailing_commas = false);
+```
+
+The `input_format_t` enum selects the parser:
+
+```cpp
+enum class input_format_t {
+    json,
+    cbor,
+    msgpack,
+    ubjson,
+    bson,
+    bjdata
+};
+```
+
+For `json`, the text parser is used. For binary formats, the
+`binary_reader` is used (which also generates SAX events).
+
+## Error Reporting
+
+### Parse Error Format
+
+```
+[json.exception.parse_error.101] parse error at line 3, column 5:
+syntax error while parsing object key - unexpected end of input;
+expected string literal
+```
+
+The error message includes:
+- Exception ID (e.g., 101)
+- Position (line and column, or byte offset)
+- Description of what was expected vs. what was found
+- The last token read (for context)
+
+### Error IDs
+
+| ID | Condition |
+|---|---|
+| 101 | Unexpected token |
+| 102 | `\u` escape with invalid hex digits |
+| 103 | Invalid UTF-8 surrogate pair |
+| 104 | JSON Patch: invalid patch document |
+| 105 | JSON Patch: missing required field |
+| 106 | Invalid number format |
+| 107 | Invalid JSON Pointer syntax |
+| 108 | Invalid Unicode code point |
+| 109 | Invalid UTF-8 byte sequence |
+| 110 | Unrecognized CBOR/MessagePack/UBJSON/BSON marker |
+| 112 | Parse error in BSON |
+| 113 | Parse error in UBJSON |
+| 114 | Parse error in BJData |
+| 115 | Parse error due to incomplete binary data |
+
+### Diagnostic Positions
+
+When `JSON_DIAGNOSTIC_POSITIONS` is enabled at compile time, the library
+tracks byte positions for each value. Error messages then include
+`start_position` and `end_position` for the offending value:
+
+```cpp
+#define JSON_DIAGNOSTICS 1
+#define JSON_DIAGNOSTIC_POSITIONS 1
+```
+
+## Parser Callback Events
+
+The parser callback receives events defined by `parse_event_t`:
+
+```cpp
+enum class parse_event_t : std::uint8_t
+{
+    object_start,   // '{' read
+    object_end,     // '}' read
+    array_start,    // '[' read
+    array_end,      // ']' read
+    key,            // object key read
+    value           // value read
+};
+```
+
+Callback invocation points in the parser:
+1. `object_start` — after `{` is consumed, before any key
+2. `key` — after a key string is consumed, `parsed` = the key string
+3. `value` — after any value is fully parsed, `parsed` = the value
+4. `object_end` — after `}` is consumed, `parsed` = the complete object
+5. `array_start` — after `[` is consumed, before any element
+6. `array_end` — after `]` is consumed, `parsed` = the complete array
+
+### Callback Return Value
+
+- `true` → keep the value
+- `false` → discard (replace with `discarded`)
+
+For container events (`object_start`, `array_start`), returning `false`
+skips the **entire** container and all its contents.
+
+## Performance Characteristics
+
+| Stage | Complexity | Dominant Cost |
+|---|---|---|
+| Input adapter | O(n) | Single pass over input |
+| Lexer | O(n) | Character-by-character scan, string copy |
+| Parser | O(n) | Recursive descent, SAX event dispatch |
+| DOM construction | O(n) | Memory allocation for containers |
+
+The overall parsing complexity is O(n) in the input size. Memory usage is
+proportional to the nesting depth (parser stack) plus the size of the
+resulting DOM (heap allocations for strings, arrays, objects).
+
+For large inputs where the full DOM is not needed, using the SAX interface
+directly avoids DOM construction overhead entirely.