Converting Legacy EDI XML to Structured JSON Permalink to this section

↑ Part of XML to JSON Conversion with xmltodict.

When a trading partner still transmits EDI as XML — X12 segments wrapped in vendor tags, bloated namespace declarations, attributes carrying the keys you actually need, and order-line hierarchies nested four levels deep — a naive xmltodict.parse() either crashes the run or, worse, silently produces a dict whose shape changes from file to file. This page is the concrete decode-and-validate procedure that sits beneath the strategy in XML to JSON Conversion with xmltodict: profile the document, pin the parser flags that make output deterministic, coerce types with a typed contract, stream the multi-gigabyte files, and quarantine the malformed ASNs without aborting the batch.

Operational Trigger Signals Permalink to this section

Reach for this dedicated EDI-XML decode path — rather than a one-line xmltodict.parse() or a generic DOM walk — only when the feed actually shows these measurable signals across consecutive runs:

Collection arity flips between files. A tag such as LineItem or ShipmentDetail arrives as a single child in some payloads and a repeated list in others, so downstream code intermittently raises TypeError: string indices must be integers when it iterates what it expected to be a list.
Attribute-borne keys. The values you must join on (ItemID, UOM, ShipTo, segment qualifiers) live in XML attributes, not element bodies, so a default parse buries them under @-prefixed keys the matching engine never looks at.
Namespace reuse and prefix drift. The same logical element appears as {http://vendorA/schema}LineItem in one feed and an unprefixed LineItem in another, fracturing the dict keyspace.
Encoding that is not UTF-8. The declared encoding is ISO-8859-1 or Windows-1252, or there is no declaration at all, and parsing raises UnicodeDecodeError on the first accented supplier name.
Payloads above ~200 MB. A multi-year ASN or inventory-snapshot manifest will not fit in RAM as a single output dict, and an in-memory parse OOM-kills the container before it reaches validation.
No schema guarantees from the partner. Quantities arrive as "1,250", currencies as "$3.40", dates in three formats, and mandatory fields go missing without warning — so the decoded dict must pass through type coercion before anything trusts it.

If none of these hold — a small, well-formed, UTF-8, stable-arity document — the whole-document converter on the parent XML to JSON Conversion with xmltodict page is sufficient and this procedure is overkill.

Step-by-Step Implementation Permalink to this section

Build the converter in five ordered, independently testable stages: profile, parse, validate, stream, recover. Each stage hands a cleaner contract to the next. The profiler output (force_list plus schema) configures the parse and validate stages; payload size routes a document down either the whole-document or the streaming branch; and validation splits every record between the typed-JSON sink and the dead-letter queue.

Step 1 — Profile the raw payload before writing conversion logic. Legacy EDI XML mixes X12 segments with proprietary tags and leans on attributes for keys. Run a lightweight profiler against a representative sample to map repeating nodes, attribute keys, and namespace prefixes — the output dictates your force_list set and your schema requirements.

PYTHON

import logging
import xml.etree.ElementTree as ET
from collections import Counter
from typing import Set

logging.basicConfig(level=logging.INFO, format="%(asctime)s | %(levelname)s | %(message)s")
logger = logging.getLogger("supply_chain.ingest.edi_xml")


def profile_xml_structure(xml_path: str) -> dict[str, object]:
    """Map tag frequency and attribute keys so force_list and schema can be derived."""
    tree = ET.parse(xml_path)
    root = tree.getroot()
    tag_counts: Counter[str] = Counter(elem.tag for elem in root.iter())
    attr_keys: Set[str] = set()
    for elem in root.iter():
        attr_keys.update(elem.attrib.keys())

    repeating = sorted(tag for tag, n in tag_counts.items() if n > 10)
    logger.info("profile root=%s unique_tags=%d repeating=%s", root.tag, len(tag_counts), repeating)
    logger.info("profile attribute_keys=%s", sorted(attr_keys))
    return {"root": root.tag, "repeating": repeating, "attribute_keys": sorted(attr_keys)}

Pay close attention to tags that appear exactly once versus those that repeat across order lines: misclassifying a singleton as a collection (or the reverse) is the single most common cause of downstream array-iteration failures.

Step 2 — Pin the deterministic xmltodict flags. The default behavior introduces structural ambiguity. The three flags that matter for supply chain feeds are force_list (stable arity), an explicit attribute/CDATA shape (keys reachable), and process_namespaces (one canonical keyspace). Feed the force_list tuple from the profiler output in Step 1.

PYTHON

import xmltodict


def parse_edi_xml(xml_bytes: bytes, force_list: tuple[str, ...]) -> dict:
    """Decode EDI XML into a deterministic, JSON-shaped dict."""
    parsed = xmltodict.parse(
        xml_bytes,
        attr_prefix="",            # Lift attribute keys (ItemID, UOM) to the element level
        cdata_key="#text",         # Preserve mixed-content text nodes under a stable key
        force_list=force_list,     # Guarantee list arity for repeated records
        process_namespaces=False,  # Strip {http://vendor/schema} prefixes to one keyspace
        dict_constructor=dict,     # Plain dict -> deterministic key ordering, JSON-serializable
    )
    logger.info("parsed root_keys=%s", list(parsed.keys()))
    return parsed

force_list is load-bearing: without it a one-item shipment parses as a dict and a two-item shipment as a list, and any for item in payload["LineItem"]: toggles between iterating characters and iterating records. attr_prefix="" keeps attribute-borne keys addressable instead of hiding them behind @.

Step 3 — Coerce types and validate against a typed contract. A raw dict has no type safety; quantities are strings, currencies carry symbols, dates vary. Enforce a Schema Validation Using Pydantic model so malformed currency strings, bad dates, and missing mandatory fields fail loudly here rather than corrupting the warehouse silently. See Validating Supplier Data Payloads with Pydantic Models for the full contract pattern.

PYTHON

from datetime import datetime
from decimal import Decimal
from typing import List, Optional

from pydantic import BaseModel, field_validator


class LineItem(BaseModel):
    sku: str
    quantity: int
    unit_cost: Decimal
    uom: str
    ship_date: Optional[datetime] = None

    @field_validator("unit_cost", mode="before")
    @classmethod
    def coerce_cost(cls, v: object) -> Decimal:
        # Partners send "$3.40" / "1,250.00"; strip symbols before Decimal parses.
        if isinstance(v, str):
            return Decimal(v.replace("$", "").replace(",", "").strip())
        return Decimal(str(v))


class ASNPayload(BaseModel):
    header_id: str
    supplier_code: str
    line_items: List[LineItem]


def validate_payload(parsed: dict) -> ASNPayload:
    """Flatten the decoded tree onto the typed contract and validate."""
    header = parsed.get("ASNHeader", {})
    clean = {
        "header_id": header.get("ID"),
        "supplier_code": header.get("SupplierCode"),
        "line_items": parsed.get("LineItem", []),
    }
    validated = ASNPayload.model_validate(clean)
    logger.info("validated header=%s lines=%d", validated.header_id, len(validated.line_items))
    return validated

Step 4 — Stream the multi-gigabyte feeds. Parsing a multi-GB ASN whole will OOM-kill the worker. Use event-driven iterparse() and drop each LineItem subtree at its closing tag. The standard library exposes no parent links, so keep a reference to the root and remove each processed child to release the references iterparse would otherwise accumulate. Peak memory stays bounded near

M_{\text{peak}} \approx M_{\text{root}} + b \cdot \bar{s}_{\text{item}}

where $b$ is batch_size and $\bar{s}_{\text{item}}$ is the average decoded size of one record — independent of total file size.

PYTHON

import json
import xml.etree.ElementTree as ET


def stream_edi_xml_to_json(xml_path: str, output_path: str, batch_size: int = 5000) -> None:
    """Stream-convert a large EDI manifest to a JSON array, holding one batch at a time."""
    context = ET.iterparse(xml_path, events=("start", "end"))
    _, root = next(context)  # consume the root's start event so we can clear it later
    batch: list[dict] = []
    first = True

    with open(output_path, "w", encoding="utf-8") as f:
        f.write("[")
        for event, elem in context:
            if event != "end" or elem.tag != "LineItem":
                continue
            batch.append({child.tag: child.text for child in elem})
            elem.clear()
            # Drop the processed child off the root so accumulated references are freed.
            if elem in list(root):
                root.remove(elem)
            if len(batch) >= batch_size:
                for item in batch:
                    f.write("" if first else ",")
                    f.write(json.dumps(item))
                    first = False
                logger.info("flushed batch size=%d", len(batch))
                batch.clear()
        for item in batch:  # final partial batch
            f.write("" if first else ",")
            f.write(json.dumps(item))
            first = False
        f.write("]")

This caps RAM at roughly a fixed ceiling regardless of input size, making it viable for containerized ETL runners with strict resource quotas. When the same documents arrive in bulk, pull and decode them through the concurrency model in Async Batch Processing for High-Volume Feeds, offloading the blocking parse to an executor so it never starves the event loop.

Step 5 — Quarantine failures, never halt the batch. Wrap each document so one corrupt ASN routes itself to a dead-letter queue while the rest of the batch completes (detailed in Debugging & Recovery below).

The EDI document type changes which header and line tags you map, so derive the force_list set and the Pydantic model per transaction set rather than hard-coding one:

EDI transaction set	XML purpose	Key repeating node	Critical attribute keys
856 (ASN)	Advance ship notice	`LineItem`, `ShipmentDetail`	`ItemID`, `UOM`, `ShipTo`, `Qty`
850 (PO)	Purchase order	`PO1Loop`, `LineItem`	`BuyerPartNumber`, `OrderQty`
855 (PO ack)	PO acknowledgment	`AckLine`, `ScheduleLine`	`AckStatus`, `PromiseDate`
810 (Invoice)	Invoice	`InvoiceLine`, `ChargeDetail`	`LineAmount`, `TaxRate`

For the cross-walk from these segments onto your internal tables, follow How to Map EDI 810 Invoices to Internal PO Schemas.

Configuration Reference Permalink to this section

These parameters drive the converter above. Tier the force_list set and the source encoding per trading partner from a config table rather than hard-coding constants.

Parameter	Accepted values	Default	Notes
`force_list`	tuple of tag names	`()`	Must include every repeated record tag from the Step 1 profile; missing one breaks list iteration
`attr_prefix`	string	`"@"`	Set to `""` so attribute-borne keys (`ItemID`, `UOM`) land at the element level
`cdata_key`	string	`"#text"`	Key under which mixed-content text collapses; keep stable for downstream access
`process_namespaces`	`True` / `False`	`False`	`False` strips prefixes to one keyspace; set `True` only if two namespaces reuse a tag name
`encoding`	`utf-8` / `latin-1` / detected	`utf-8`	Fall back to `latin-1` or `chardet` on `UnicodeDecodeError` from legacy feeds
`batch_size`	1000–10000	5000	Records buffered before flush in streaming mode; bounds peak memory $b \cdot \bar{s}_{\text{item}}$
`stream_threshold`	bytes	200 MB	Switch from whole-document parse to `iterparse` above this size
`dict_constructor`	`dict` / `OrderedDict`	`dict`	Plain `dict` is JSON-serializable and deterministically ordered on modern Python

Debugging & Recovery Permalink to this section

Production feeds break daily. Triage by the failure signal rather than re-running blindly, and route every document that cannot be decoded or validated to a dead-letter queue keyed by batch_id so the run stays auditable and replayable. A sufficient audit record per failed document is {batch_id, supplier_id, source_uri, transaction_set, error_type, element_path, attempt, ts_utc}. The error_type field drives triage at a glance:

UnicodeDecodeError (encoding mismatch). Symptom: parse fails on the first accented byte. Cause: legacy ISO-8859-1 / Windows-1252 payload decoded as UTF-8. Fix: retry with encoding="latin-1" or detect with chardet before handing bytes to xmltodict; a wave of these from one partner means their export setting changed.
ExpatError (malformed markup). Symptom: xml.parsers.expat.ExpatError: not well-formed mid-document. Cause: unescaped < or & inside a text node. Fix: pre-process with re.sub(r"&(?!\w+;)", "&", raw_xml) before parsing; if it recurs on the same element_path, push a sanitizer rule for that vendor.
TypeError on iteration (arity flip). Symptom: string indices must be integers downstream. Cause: a repeated tag was missing from force_list, so a one-item collection parsed as a dict. Fix: add the tag to force_list, or guard reads with a defensive coercion helper:

PYTHON

def ensure_list(data: dict, key: str) -> list:
    """Return a stable list for a key whose arity is not guaranteed by force_list."""
    val = data.get(key)
    if val is None:
        return []
    return val if isinstance(val, list) else [val]

pydantic.ValidationError (contract breach). Symptom: validation rejects a record. Cause: malformed currency, unparseable date, or a missing mandatory field. Fix: do not patch in place — serialize the failed payload with its traceback to the DLQ and replay after the partner is notified. A spike concentrated on one supplier_id signals an upstream export change, not random corruption.
Monitoring to confirm the fix. Track decode_success_rate per partner, dlq_depth by error_type, and peak_rss on streaming runs. A healthy converter holds success rate above 99%, keeps peak_rss flat across file sizes, and never lets a single corrupt ASN halt the batch. Because failed documents are keyed by batch_id, a failed run replays from its last checkpoint without re-pulling already-staged records.

FAQ Permalink to this section

Why does a one-item shipment crash my downstream loop when ten-item shipments work? Permalink to this section

Without force_list, xmltodict represents a single repeated element as a dict and multiple as a list, so the arity changes per file. Your for item in payload["LineItem"]: then iterates dict keys (strings) on the one-item case and raises TypeError the moment it indexes them. Add every repeated record tag — LineItem, ShipmentDetail, PackReference — to the force_list tuple, and guard any unprofiled tag with an ensure_list helper so arity is always stable.

Should I strip namespaces or preserve them? Permalink to this section

For the common case where each logical element maps to one tag, set process_namespaces=False to collapse everything to a single canonical keyspace — it removes {http://vendor/schema} noise and shrinks the payload. Preserve namespaces only when two different namespaces genuinely reuse the same local tag name with different meaning; then keep process_namespaces=True and map the expanded QNames explicitly so the two are not silently merged.

How do I convert a multi-gigabyte ASN without running out of memory? Permalink to this section

Do not parse it whole. Use xml.etree.ElementTree.iterparse() with ("start", "end") events, capture the root element from the first start event, and after appending each LineItem to your batch call elem.clear() and remove it from the root so the references iterparse accumulates are released. Flush the batch to a JSON array in fixed-size chunks. Peak memory then tracks batch_size × average_record_size, not the file size, which keeps the converter inside a containerized runner’s quota regardless of how large the manifest grows.

Converting Legacy EDI XML to Structured JSON Permalink to this section#

Operational Trigger Signals Permalink to this section#

Step-by-Step Implementation Permalink to this section#

Configuration Reference Permalink to this section#

Debugging & Recovery Permalink to this section#

FAQ Permalink to this section#

Why does a one-item shipment crash my downstream loop when ten-item shipments work? Permalink to this section#

Should I strip namespaces or preserve them? Permalink to this section#

How do I convert a multi-gigabyte ASN without running out of memory? Permalink to this section#

Related Permalink to this section#

Converting Legacy EDI XML to Structured JSON Permalink to this section

Operational Trigger Signals Permalink to this section

Step-by-Step Implementation Permalink to this section

Configuration Reference Permalink to this section

Debugging & Recovery Permalink to this section

FAQ Permalink to this section

Why does a one-item shipment crash my downstream loop when ten-item shipments work? Permalink to this section

Should I strip namespaces or preserve them? Permalink to this section

How do I convert a multi-gigabyte ASN without running out of memory? Permalink to this section

Related Permalink to this section