XML to JSON Conversion with xmltodict
Supply chain data ingestion relies heavily on XML for Advanced Shipping Notices (ASN), purchase order acknowledgments, and multi-tier inventory manifests. While modern procurement APIs and reconciliation engines favor JSON, legacy supplier integrations, WMS exports, and ERP middleware still default to XML. Converting these feeds reliably is a foundational step in Ingestion & Parsing Workflows for Supply Chain Data. The xmltodict library provides a lightweight, deterministic bridge between hierarchical XML and JSON-compatible Python dictionaries, avoiding the memory overhead of full DOM parsers while preserving structural fidelity required for inventory matching.
Core Conversion Mechanics
Unlike row-oriented formats that map directly to tabular structures, XML requires explicit tree traversal. When parsing supplier manifests, xmltodict automatically coerces XML attributes and nested elements into Python dictionaries. However, production reconciliation pipelines must address three common failure modes: namespace collisions, single-element list coercion, and malformed encoding.
import xmltodict
import json
from pathlib import Path
from typing import Dict, Optional, List
def parse_xml_to_dict(xml_path: Path, force_list: Optional[List[str]] = None) -> Dict:
with open(xml_path, "r", encoding="utf-8") as f:
return xmltodict.parse(
f.read(),
process_namespaces=True,
force_list=force_list or ["LineItem", "ShipmentDetail"]
)
The process_namespaces=True flag is critical when ingesting EDI-adjacent XML where suppliers frequently reuse generic tags like <Item> across different hierarchical levels. Without it, downstream reconciliation logic will silently overwrite sibling nodes during dictionary construction. For feeds where single-item collections must be normalized to arrays, the force_list parameter ensures consistent iterable structures before serialization, preventing TypeError exceptions during join operations. This hierarchical normalization contrasts sharply with columnar ingestion strategies, such as those detailed in Parsing CSV and Excel Feeds with Pandas, where schema alignment typically relies on explicit dtype casting rather than tree traversal.
Production Exception Handling & Memory Management
Supply chain feeds are notoriously inconsistent. A single malformed ASN can halt an entire reconciliation batch. Wrapping xmltodict in defensive parsing routines prevents pipeline crashes while preserving audit trails.
import logging
import xml.parsers.expat
from typing import Optional
logger = logging.getLogger(__name__)
def safe_xml_to_json(xml_content: str, force_list: Optional[List[str]] = None) -> Optional[Dict]:
try:
parsed = xmltodict.parse(
xml_content,
process_namespaces=True,
attr_prefix="",
cdata_key="#text",
force_list=force_list or ["LineItem", "ShipmentDetail"]
)
return json.loads(json.dumps(parsed))
except xml.parsers.expat.ExpatError as e:
logger.error("Malformed XML structure at offset %d: %s", e.lineno, e.msg)
return None
except UnicodeDecodeError as e:
logger.error("Encoding mismatch in supplier payload: %s", e.reason)
return None
The json.loads(json.dumps(parsed)) round-trip ensures all OrderedDict instances and custom types are converted to standard Python primitives, guaranteeing compatibility with downstream message brokers like Kafka or RabbitMQ. Under the hood, xmltodict delegates parsing to Python’s Expat C extension, which operates as a non-validating, event-driven parser. For memory-constrained environments, this approach avoids loading the entire document tree into RAM, though payloads exceeding 500MB still require chunked ingestion strategies. Refer to the official Python xml.parsers.expat documentation for low-level buffer tuning and character encoding specifications.
Post-Conversion Validation & EDI Integration
Raw dictionary output requires strict typing before entering downstream procurement systems. Passing the parsed output through Schema Validation Using Pydantic catches type drift, missing mandatory fields like PurchaseOrderNumber or CarrierSCAC, and invalid date formats before they propagate to inventory ledgers.
For organizations migrating from legacy X12/EDIFACT standards, handling mixed EDI/XML payloads requires specialized namespace stripping and segment mapping. Suppliers often embed proprietary EDI segments within XML wrappers, creating deeply nested structures that break standard XPath queries. Advanced namespace resolution and tag flattening techniques are required to normalize these hybrid payloads, as outlined in Converting Legacy EDI XML to Structured JSON.
Streaming Large Manifests
When processing multi-gigabyte inventory manifests, loading the entire document into memory remains a bottleneck. xmltodict supports streaming via the item_depth and item_callback parameters, which trigger a callback function each time the parser reaches a specified nesting depth.
import xmltodict
from typing import Dict
def validate_and_route_to_erp(item: Dict) -> None:
# Replace with concrete reconciliation logic — push to a queue, write to a
# staging table, or hand off to a downstream pipeline.
...
def process_line_item(_path, item: Dict) -> bool:
# Execute per-line reconciliation logic without buffering the full document.
validate_and_route_to_erp(item)
return True # Return False to abort streaming.
with open("manifest.xml", "rb") as f:
xmltodict.parse(
f,
item_depth=2,
item_callback=process_line_item,
process_namespaces=True,
)
This pattern enables constant-memory processing of arbitrarily large feeds, aligning with high-throughput ETL architectures. For comprehensive configuration options and callback return semantics, consult the xmltodict official repository documentation.
Deterministic XML-to-JSON conversion eliminates structural ambiguity in procurement data pipelines. By enforcing namespace isolation, standardizing list coercion, and implementing streaming fallbacks, engineering teams can reliably ingest legacy supplier feeds into modern JSON-native reconciliation engines.