Why round-trip the result through json.dumps and json.loads?

xmltodict returns OrderedDict instances and can surface other non-primitive types that some brokers and serializers cannot handle. The json round-trip flattens everything to plain dict, list, and str primitives so the payload survives Kafka, RabbitMQ, or a Pydantic model without serializer surprises.

XML to JSON Conversion with xmltodict Permalink to this section

↑ Part of Ingestion & Parsing Workflows for Supply Chain Data.

Supply chain ingestion still runs on XML. Advanced Shipping Notices (ASN), purchase-order acknowledgments, carrier status events, and multi-tier inventory manifests arrive as deeply nested XML long after the rest of the procurement stack has standardized on JSON. The engineering challenge is not “parse the file” — Python ships an XML parser — it is producing a deterministic, JSON-shaped dictionary that survives schema drift, single-element collections, namespace reuse, and 500 MB payloads without crashing the run or silently dropping line items. Get the conversion wrong and the corruption is invisible: a sibling node overwritten during dict construction, a one-item shipment that parses as a dict instead of a list, an OrderedDict that a downstream broker cannot serialize.

This page covers when to reach for xmltodict over a full DOM or event parser, how to wire it into a production reconciliation pipeline as the decode stage of Ingestion & Parsing Workflows for Supply Chain Data, and how to recover the feeds that fail. xmltodict is a thin, deterministic bridge over Python’s Expat C extension: it turns hierarchical XML into plain dictionaries with no memory overhead of a retained document tree, which is exactly the contract the downstream Schema Validation Using Pydantic layer expects.

Core Concept & Decision Criteria Permalink to this section

xmltodict.parse() walks the document once and builds a dictionary whose keys are tag names, whose attributes become prefixed keys, and whose text content collapses into a #text entry when an element has both attributes and a body. Because it leans on Expat — a non-validating, streaming SAX-style parser — it never materializes a navigable DOM, so peak memory tracks the size of the output dict, not the document plus a parse tree. That makes it the right default for converting well-formed supplier feeds into JSON. It is the wrong tool when you need XPath queries, XSD validation, or surgical in-place edits; reach for lxml there.

The decision signal that governs which parser you use is what you need to do with the tree after parsing, combined with payload size. If you need the whole document as JSON and it fits in memory, parse it whole. If you only need repeated leaf records (every LineItem, every ShipmentDetail) and the document is large, use the streaming callback mode covered below — it never holds more than one record at a time. If you need to validate against a schema or run XPath, the conversion library is the wrong layer entirely.

Dimension	`xmltodict.parse` (whole)	`xmltodict` streaming (`item_callback`)	`lxml.etree.iterparse`	DOM (`minidom` / `ElementTree`)
Output shape	JSON-ready `dict`	One `dict` per record	`Element` objects	`Element` tree
Peak memory	~output dict size	Constant (one record)	Constant + manual `clear()`	Full tree, highest
Best for	Feeds that fit in RAM	Multi-GB manifests	XPath on huge files	Small docs needing edits
XPath / XSD	No	No	Yes	Limited
Namespace control	`process_namespaces`	`process_namespaces`	Full QName control	Full
Right tool when	JSON is the target	JSON target, huge feed	Query/validate at scale	Mutate a small tree

For the bulk of procurement ingestion — turning a supplier’s XML into a typed Python object — the first two columns are what you want. This hierarchical normalization is the opposite of the columnar path in Parsing CSV and Excel Feeds with Pandas, where schema alignment is a flat dtype cast rather than a tree walk; XML forces you to decide, per element, whether a node is a scalar, an object, or a collection.

Implementation Permalink to this section

The converter below is the decode stage: it reads bytes, parses with the three production flags that matter (process_namespaces, force_list, and a normalized attribute/CDATA shape), and round-trips through json so every OrderedDict becomes a plain dict before it leaves the function. It logs structured failures instead of raising, so one malformed ASN quarantines itself rather than aborting the batch.

PYTHON

import json
import logging
import xml.parsers.expat
from pathlib import Path
from typing import Any, Dict, List, Optional

import xmltodict

logger = logging.getLogger("supply_chain.ingest.xml")

# Elements that are single in some payloads and repeated in others. Forcing them
# to lists guarantees a stable iterable shape so downstream joins never hit a
# TypeError on a one-item shipment.
DEFAULT_FORCE_LIST: tuple[str, ...] = ("LineItem", "ShipmentDetail", "PackReference")


def safe_xml_to_dict(
    xml_content: str,
    *,
    force_list: Optional[List[str]] = None,
    namespaces: Optional[Dict[str, str]] = None,
) -> Optional[Dict[str, Any]]:
    """Convert one supplier XML payload to a JSON-ready dict, or None on failure.

    Returning None (and logging a structured reason) lets the caller route the
    payload to a dead-letter queue without crashing the surrounding batch.
    """
    try:
        parsed = xmltodict.parse(
            xml_content,
            process_namespaces=True,
            namespaces=namespaces or {},   # collapse known URIs to short prefixes
            attr_prefix="",                # attributes become plain keys
            cdata_key="#text",             # mixed-content body lands here
            force_list=force_list or list(DEFAULT_FORCE_LIST),
        )
    except xml.parsers.expat.ExpatError as exc:
        logger.error(
            "Malformed XML at line %d col %d: %s",
            exc.lineno, exc.offset, exc.code, extra={"error_type": "ExpatError"},
        )
        return None
    except UnicodeDecodeError as exc:
        logger.error(
            "Encoding mismatch in supplier payload: %s",
            exc.reason, extra={"error_type": "UnicodeDecodeError"},
        )
        return None

    # Round-trip strips OrderedDict and any custom types so the result is safe to
    # hand to Kafka/RabbitMQ or to a Pydantic model without serializer surprises.
    canonical: Dict[str, Any] = json.loads(json.dumps(parsed))
    logger.info("Parsed XML payload: %d top-level keys", len(canonical))
    return canonical


def load_and_convert(xml_path: Path, **kwargs: Any) -> Optional[Dict[str, Any]]:
    """Read a file as UTF-8 and convert it, isolating I/O errors from parse errors."""
    try:
        content = xml_path.read_text(encoding="utf-8")
    except (OSError, UnicodeDecodeError) as exc:
        logger.error("Cannot read %s: %s", xml_path, exc)
        return None
    return safe_xml_to_dict(content, **kwargs)

Three flags carry the weight. process_namespaces=True is non-negotiable when ingesting EDI-adjacent XML where suppliers reuse generic tags like <Item> at different hierarchy levels under different namespace URIs — without it, sibling nodes silently overwrite each other during dict construction and you lose line items with no error. force_list pins single-element collections to arrays so a one-line ASN and a fifty-line ASN have the same shape, which is what lets the validation layer iterate unconditionally. attr_prefix="" plus cdata_key="#text" produce the flattest, most JSON-idiomatic output so the model layer is not littered with @-prefixed keys.

Configuration & Threshold Calibration Permalink to this section

The defaults are tuned for hand-written, small XML; production supplier feeds need explicit calibration per trading-partner tier. The streaming variant below is the large-feed path: item_depth tells Expat how deep to descend before invoking your callback, and item_callback fires once per record so peak memory never exceeds a single line item regardless of total file size.

PYTHON

from typing import Any, Dict

import xmltodict

logger = logging.getLogger("supply_chain.ingest.xml.stream")


def route_record_to_staging(record: Dict[str, Any]) -> None:
    """Push one parsed line item to a staging table / queue. Replace with real I/O."""
    ...


def on_line_item(_path: list[tuple], record: Dict[str, Any]) -> bool:
    # Runs per record without buffering the document. Return False to abort.
    try:
        route_record_to_staging(record)
    except Exception:  # never let one bad record kill the stream
        logger.exception("Failed to route record at %s", _path)
        return False
    return True


def stream_large_manifest(path: str, item_depth: int = 2) -> None:
    """Constant-memory parse of a multi-GB manifest via Expat callbacks."""
    with open(path, "rb") as fh:        # binary mode: let Expat sniff the encoding
        xmltodict.parse(
            fh,
            item_depth=item_depth,
            item_callback=on_line_item,
            process_namespaces=True,
        )

The streaming path holds memory roughly constant: peak usage approximates the largest single record rather than the whole file, so for a manifest of $n$ records the in-flight footprint is

M_{\text{peak}} \approx \max_{i \le n} \lvert r_i \rvert \ll \sum_{i=1}^{n} \lvert r_i \rvert

That inequality is the entire reason to stream: the right-hand side is the whole-document cost that OOMs the worker.

Parameter	Recommended value	Rationale
`process_namespaces`	`True` (always)	Prevents sibling-node overwrite when suppliers reuse generic tags across namespaces.
`namespaces`	`{uri: short_prefix}` map	Collapse verbose URIs to stable short keys so downstream models are not URI-coupled.
`force_list`	Tuple of every repeatable record tag	Guarantees a one-item shipment and a many-item shipment share a shape.
`attr_prefix`	`""`	Flattens attributes into idiomatic JSON keys; drop the `@` noise.
`cdata_key`	`"#text"`	Predictable home for mixed-content text bodies.
`item_depth`	Nesting level of the record element	Set to where `LineItem`/`ShipmentDetail` sits; depth 0 means whole-document.
Whole-vs-stream cutover	~200–500 MB per file	Above this, switch to `item_callback`; below it, parse whole for simplicity.

Maintain force_list and the namespaces map as per-vendor config, never as module constants — tier-1 suppliers and small portals namespace and repeat tags differently, and a hard-coded list silently mis-shapes a feed you onboarded last week. The same per-vendor philosophy governs tolerance and matching downstream, mirroring how Setting Quantity and Price Tolerance Windows keeps thresholds in configuration rather than code.

Orchestration & Integration Permalink to this section

In the wider pipeline this converter is the decode stage, sitting between raw acquisition and validation. Upstream, bytes arrive from an SFTP drop or an API poll and land unmodified in an immutable landing zone keyed by content hash. The decode stage reads those bytes, produces a JSON-ready dict, and hands it to the contract layer — passing the output through Schema Validation Using Pydantic catches type drift, missing mandatory fields like PurchaseOrderNumber or CarrierSCAC, and bad date formats before they reach the inventory ledger. Validated records then flow to canonical staging, where the matching engine in Matching & Reconciliation Algorithms consumes them.

Idempotency is enforced at the staging boundary, not here: derive a deterministic key from supplier_id + document type + a content hash of the source bytes, and upsert on it, so a redelivered ASN collapses onto the same row instead of double-counting inventory. Because the decode stage is pure (bytes in, dict out, no side effects), it is freely replayable from the landing zone — a reprocess of yesterday’s feeds is a no-op against already-staged keys. When acquisition itself is the bottleneck, the high-throughput fetch layer in Async Batch Processing for High-Volume Feeds feeds this decode stage in bounded concurrent batches.

Two integration concerns are XML-specific. First, timestamps: ASN and delivery-window elements arrive in supplier-local time with inconsistent offsets, so the decode output must be normalized before comparison — applying Timezone Normalization for Global Supply Chains prevents off-by-one-day discrepancies in receipt matching. Second, legacy EDI: suppliers frequently embed proprietary X12/EDIFACT segments inside XML wrappers, producing deeply nested hybrids that break naive XPath and tag assumptions; the dedicated namespace-stripping and segment-flattening techniques for those payloads live in Converting Legacy EDI XML to Structured JSON.

Debugging & Pipeline Recovery Permalink to this section

When a payload fails to convert, safe_xml_to_dict returns None and emits a structured log line — the run continues, and the failed payload is routed to a dead-letter queue (DLQ) for triage and replay. A minimal audit record per failure is {ingest_id, supplier_id, document_type, error_type, line, offset, source_hash, ts_utc}. The error_type field is what lets you triage at a glance:

ExpatError with a repeated line/offset points at a structurally broken document — an unclosed tag, an illegal character, or a truncated transfer. Inspect the bytes at that offset; a truncation usually means the upstream SFTP/HTTP transfer failed, so the fix is re-acquisition, not re-parsing.
UnicodeDecodeError means the declared encoding lies — a supplier stamped encoding="utf-8" on a Latin-1 or UTF-16 file. Decode in binary mode and let Expat sniff the BOM, or pin the real encoding per vendor.
A record present in the XML but missing from the dict is the namespace-collision signature: sibling nodes overwrote each other because process_namespaces was off or the namespaces map was incomplete. Diff the source tag count against the parsed list length.
A TypeError downstream during iteration is the force_list gap: a tag was single in this payload and got parsed as a dict. Add it to the vendor’s force_list.

Monitor three signals in production: DLQ depth per supplier (a spike isolates a broken trading partner), parse latency per MB (a climb flags a feed drifting toward the streaming cutover), and a parsed-record-count-versus-source-tag-count ratio (a divergence catches silent namespace overwrites before they reach the ledger). Keep the original bytes in the landing zone so every DLQ entry is replayable byte-for-byte once the per-vendor config is corrected.

FAQ Permalink to this section

Why is a `LineItem` sometimes a dict and sometimes a list? Permalink to this section

xmltodict has no schema, so it infers shape from the instance: one <LineItem> becomes a dict, two or more become a list. That instance-dependent shape is the single most common source of TypeError in XML pipelines. Pin every repeatable element with force_list=["LineItem", ...] so a one-line and a fifty-line ASN always present the same iterable, and your downstream code never branches on type.

My XML has elements, but they vanish from the output dict. What happened? Permalink to this section

That is a namespace collision. When suppliers reuse a generic tag like <Item> at different levels under different namespace URIs and you parse without process_namespaces=True, the later sibling overwrites the earlier one during dict construction — with no error raised. Turn on process_namespaces, supply a namespaces map to collapse the URIs to readable prefixes, and diff source tag counts against parsed list lengths to confirm nothing dropped.

Should I use `xmltodict` or `lxml` for supplier feeds? Permalink to this section

Use xmltodict when the goal is JSON — turning a well-formed feed into a dict for validation and matching — because it is deterministic, dependency-light, and memory-cheap. Switch to lxml when you need XPath queries, XSD validation, or in-place tree edits, none of which a conversion library provides. The two coexist: many pipelines validate structure with lxml/XSD at the edge, then convert with xmltodict for the JSON-native stages.

How do I convert a 2 GB manifest without exhausting memory? Permalink to this section

Do not call parse() on the whole file. Use the streaming API: open the file in binary mode and pass item_depth set to the nesting level of the record element plus an item_callback, which fires once per record so peak memory stays near a single line item instead of the whole document. Route each record to staging inside the callback and return True to continue or False to abort.

Why round-trip the result through `json.dumps`/`json.loads`? Permalink to this section

xmltodict returns OrderedDict instances (and can surface other non-primitive types). Some message brokers, caches, and serializers choke on those or order keys unpredictably. The json.loads(json.dumps(parsed)) round-trip flattens everything to plain dict, list, and str primitives, guaranteeing the payload survives a hop through Kafka, RabbitMQ, or a Pydantic model without serializer surprises.

XML to JSON Conversion with xmltodict Permalink to this section#

Core Concept & Decision Criteria Permalink to this section#

Implementation Permalink to this section#

Configuration & Threshold Calibration Permalink to this section#

Orchestration & Integration Permalink to this section#

Debugging & Pipeline Recovery Permalink to this section#

FAQ Permalink to this section#

Why is a LineItem sometimes a dict and sometimes a list? Permalink to this section#

My XML has elements, but they vanish from the output dict. What happened? Permalink to this section#

Should I use xmltodict or lxml for supplier feeds? Permalink to this section#

How do I convert a 2 GB manifest without exhausting memory? Permalink to this section#

Why round-trip the result through json.dumps/json.loads? Permalink to this section#

Related Permalink to this section#