Schema Validation Using Pydantic Permalink to this section

↑ Part of Ingestion & Parsing Workflows for Supply Chain Data.

Reliable supply chain reconciliation depends on deterministic data contracts. Raw ingestion moves bytes into memory, but a row can parse cleanly while carrying a negative lead time, a malformed vendor code, or a quantity that violates a receiving tolerance. Schema validation is the boundary that turns a structurally-parsed dictionary into a record the matching and inventory-balancing stages are allowed to trust. The engineering decision this page addresses is narrow and consequential: where in the pipeline do you enforce the contract, how strict do you make coercion, and how do you fail a record without stalling the batch around it?

Within the broader Ingestion & Parsing Workflows for Supply Chain Data, Pydantic operates as the explicit contract layer that sits between raw feed consumption and the reconciliation engine. Unlike ad-hoc type checks, isinstance guards, or dictionary lookups scattered through business logic, Pydantic V2 gives you declarative model definitions, runtime coercion with predictable rules, and a structured error surface that scales across high-volume procurement, logistics, and warehouse management feeds. The patterns below are implementation-ready: domain models that mirror operational entities, a strict-mode policy you can defend, field-level error extraction wired to a dead-letter queue, and the calibration knobs that keep false-positive rejections out of your exception backlog.

Core Concept & Decision Criteria Permalink to this section

A validation contract answers three questions for every incoming record: is the shape correct (all mandatory fields present, no unexpected extras), is the type correct (and may it be coerced), and is the record semantically valid against business rules (over-receipt tolerance, state transitions, referential sanity). Parsing libraries deliberately stop at shape. They will hand you a dictionary with a string where an integer belongs and a date that is three days in the future, and consider their job done. Pydantic is where the remaining two questions get answered.

The first decision signal is strict versus lax coercion. Legacy EDI translators and supplier portals overwhelmingly emit numerics as strings — "1200.00" for a quantity, "00045" for a line sequence. Strict mode rejects these outright; lax mode (strict=False) coerces them into native types before the constraint checks run. For ingestion you almost always want lax coercion at the field boundary paired with strict semantic validators, so trivial formatting differences pass while genuine rule violations fail loudly. The second signal is where validation runs: validate as early as the data is structurally complete — the moment a Parsing CSV and Excel Feeds with Pandas row becomes a dict, or a nested XML to JSON Conversion with xmltodict payload is flattened — so a bad record is quarantined before it can pollute downstream joins.

The table below contrasts the validation styles you will choose between per feed and per field. Treat the “When to use” column as the policy your model configuration encodes.

Validation style	Mechanism	Behaviour on `"1200.00"` → `float`	When to use
Strict typing	`model_config = {"strict": True}`	Rejected — type mismatch	Internal services emitting already-typed JSON
Lax coercion	`model_config = {"strict": False}`	Coerced to `1200.0`	Legacy EDI / portal feeds with string numerics
Field constraint	`Field(ge=0, pattern=...)`	Coerced, then range/pattern checked	Bounded quantities, formatted identifiers
Pre-normalization	`Annotated[T, BeforeValidator(fn)]`	Cleaned, then validated	SKU/vendor canonicalization before checks
Cross-field rule	`@model_validator(mode="after")`	Whole-record business rule	Over-receipt tolerance, state-to-quantity logic

Two distinctions deserve emphasis. A BeforeValidator runs before type coercion and field constraints, which is exactly where input normalization belongs — strip control characters, canonicalize a SKU, fold a vendor alias — so the constraint checks see a clean value. A model_validator(mode="after") runs once the whole record is typed and is the only correct place for rules that span fields, because it can see qty_ordered and qty_fulfilled together. Putting a cross-field rule in a single-field validator is the most common source of order-dependent bugs.

Implementation Permalink to this section

Pydantic V2 models must mirror the operational entities your reconciliation pipeline consumes. For procurement and logistics that means explicit typing for purchase order identifiers, constrained numeric tolerances, timezone-aware delivery windows, and enumerated fulfillment states. The model below is the contract a single procurement line item must satisfy before it is eligible for three-way matching. Structured logging at the boundary gives the recovery section its audit fields.

PYTHON

import logging
import re
from datetime import datetime
from enum import Enum
from typing import Annotated, Optional

from pydantic import BaseModel, BeforeValidator, Field, model_validator

logger = logging.getLogger("validation.procurement")


class UnitOfMeasure(str, Enum):
    EACH = "EA"
    CASE = "CS"
    PALLET = "PLT"
    KILOGRAM = "KG"


class FulfillmentState(str, Enum):
    PENDING = "PENDING"
    PARTIAL_SHIP = "PARTIAL"
    COMPLETE = "COMPLETE"
    CANCELLED = "CANCELLED"


def canonicalize_sku(raw: str) -> str:
    """Normalize a SKU before length/pattern checks run."""
    return re.sub(r"[\s\-_\.]+", "", raw.strip().upper())


# In Pydantic v2, BeforeValidator is attached via typing.Annotated, not as a
# field kwarg. It runs BEFORE coercion and Field constraints, so the value is
# already canonical when the length/pattern checks evaluate it.
NormalizedSKU = Annotated[str, BeforeValidator(canonicalize_sku)]


class ProcurementLineItem(BaseModel):
    # Lax coercion at the field boundary (string EDI numerics are accepted),
    # paired with strict semantic validators below.
    model_config = {"strict": False, "validate_default": True, "extra": "forbid"}

    po_id: str = Field(min_length=4, max_length=18, pattern=r"^[A-Z0-9\-]+$")
    line_sequence: int = Field(gt=0, le=999)
    sku: NormalizedSKU = Field(min_length=5, max_length=32)
    qty_ordered: float = Field(ge=0.0)
    qty_fulfilled: float = Field(ge=0.0)
    uom: UnitOfMeasure
    state: FulfillmentState = FulfillmentState.PENDING
    promised_date: datetime
    vendor_code: str = Field(pattern=r"^VND-\d{5}$")
    dc_location: Optional[str] = None

    @model_validator(mode="after")
    def enforce_fulfillment_logic(self) -> "ProcurementLineItem":
        # Cross-field rules belong here: the whole record is typed and visible.
        if self.qty_fulfilled > self.qty_ordered * 1.05:
            raise ValueError("Over-fulfillment exceeds 5% tolerance threshold")
        if self.state is FulfillmentState.COMPLETE and self.qty_fulfilled < self.qty_ordered:
            raise ValueError("Complete state requires full quantity match")
        return self

The strict: False configuration permits safe type coercion, converting string-encoded numerals from legacy EDI or supplier portals into native Python types without raising premature exceptions, while extra: "forbid" makes an unexpected column a hard error rather than a silent drop — critical when a supplier appends an undocumented field that should trigger an onboarding review. The @model_validator hook intercepts domain-specific violations before they contaminate reconciliation logic, so business rules like over-receipt tolerances and state transitions are enforced at instantiation. Configuration details for strict mode and validator execution order are documented in the official Pydantic V2 documentation.

The model is only half the contract; the other half is how you apply it. Never call Model(**row) bare in a loop — a single raised ValidationError aborts the batch. Wrap each record so failures become data, not control flow:

PYTHON

from typing import Any, Dict, List, Tuple

from pydantic import ValidationError


def validate_batch(
    rows: List[Dict[str, Any]], feed_name: str
) -> Tuple[List[ProcurementLineItem], List[Dict[str, Any]]]:
    """Validate a batch, returning (accepted_records, dlq_entries).

    Each DLQ entry carries the structured error list so triage never needs
    to re-run the parse to locate the break.
    """
    accepted: List[ProcurementLineItem] = []
    quarantined: List[Dict[str, Any]] = []

    for idx, row in enumerate(rows):
        try:
            accepted.append(ProcurementLineItem(**row))
        except ValidationError as exc:
            # exc.errors() -> list of {loc, type, msg, input}
            quarantined.append(
                {
                    "feed": feed_name,
                    "row_index": idx,
                    "po_id": row.get("po_id"),
                    "errors": exc.errors(include_url=False),
                    "raw": row,
                }
            )
            logger.warning(
                "validation_failed feed=%s row=%d po=%s errors=%d",
                feed_name, idx, row.get("po_id"), len(exc.errors()),
            )

    logger.info(
        "batch_validated feed=%s accepted=%d quarantined=%d",
        feed_name, len(accepted), len(quarantined),
    )
    return accepted, quarantined

When a record violates the schema, Pydantic raises a ValidationError whose errors() method returns a list of dictionaries, each with loc (the field path as a tuple), type (the failure category, e.g. greater_than_equal), msg (a human-readable description), and input (the offending value). That structure is what makes automated quarantine routing possible instead of silent corruption — the loc and type are enough to tag the failure, alert the right team, and build a discrepancy report without a human re-reading the file.

Configuration & Threshold Calibration Permalink to this section

The validation contract is a configuration surface, and it should be vendor-tier specific rather than global. A strategic supplier with a hand-maintained ERP export needs a generous tolerance and lax coercion; a high-volume commodity partner emitting clean, schema-versioned JSON should run closer to strict mode so drift surfaces immediately. Pin these per trading partner in a feed registry rather than hard-coding one policy for every source.

Parameter	Recommended default	Tier override range	Rationale
`strict`	`False`	`True` for typed internal feeds	Coerce string EDI numerics; tighten only where the source is already typed
`extra`	`forbid`	`ignore` for chatty legacy feeds	Surface undocumented columns as onboarding signals, not silent drops
Over-receipt tolerance	`1.05` (5%)	`1.00`–`1.20`	Bulk commodities flex up; serialized/high-value parts stay tight
`promised_date` skew	reject if `> now + 365d`	30d–730d	Catches century-typo and epoch-zero dates from bad exporters
`validate_default`	`True`	`False` for trusted defaults	Ensures enum/default values are themselves contract-valid
SKU length band	`5`–`32`	per-catalog map	Prevents truncated or padded identifiers from joining wrong

The over-receipt tolerance is the parameter most worth calibrating deliberately. Expressed as a multiplier, the rule the validator enforces is that the received quantity must not exceed the ordered quantity scaled by the tolerance factor:

q_{fulfilled} \le q_{ordered} \times (1 + \tau)

where $\tau$ is the allowed over-receipt fraction (0.05 for the 5% default). Set $\tau$ from the receiving agreement, not from a round number: weigh-scale commodities legitimately drift a few percent per pallet, while serialized electronics should reject anything over the ordered count. When the quantity and price bands the validator enforces must agree with the match engine, align them with Setting Quantity and Price Tolerance Windows so a record that passes validation is not re-flagged as a tolerance break during reconciliation. Never widen a tolerance to clear a backlog — that converts a data-quality alert into silent acceptance of bad receipts.

Pre-validation data hygiene further reduces exception rates. Applying deterministic cleaning routines — stripping control characters, normalizing whitespace, folding vendor aliases, standardizing currency tokens — inside BeforeValidator functions before instantiation minimizes false positives, so a ValidationError reflects a genuine business-rule violation rather than a trivial formatting artifact. Timezone-bearing fields such as promised_date should be normalized to a single canonical zone in step with Timezone Normalization for Global Supply Chains, which prevents off-by-one-day discrepancies when a delivery window is compared across regions.

Orchestration & Integration Permalink to this section

The contract layer sits immediately after parsing and immediately before the reconciliation engine, and it must behave predictably under retry. Upstream, the parser hands it a list of structurally-clean dictionaries; downstream, the engine receives only typed, contract-bound records while every rejection lands in a tagged exception table. Validation is a pure function of its input — the same row always yields the same accept/reject decision — which is the precondition for idempotent replay. Derive the idempotency key for a record from its content (feed name plus PO id plus line sequence plus content hash) and persist it before the upsert, so a redelivered feed re-validates to the same outcome rather than double-writing into reconciliation.

Validation receives input from two parsing paths and feeds one consumer. Tabular feeds arrive as DataFrame rows from Parsing CSV and Excel Feeds with Pandas; convert each row with df.to_dict("records") and validate per record so one bad row is isolated rather than failing the frame. Hierarchical ASN and despatch-advice documents arrive as nested dictionaries from XML to JSON Conversion with xmltodict; Pydantic’s nested-model support validates the parent-and-lines structure in one pass, catching missing mandatory segments and malformed enumerations before they reach the engine. The canonical field names the model expects should match the mapping documented in EDI 810 vs 850 Schema Mapping so an invoice and its originating order validate against compatible contracts. Where feed volume rather than record complexity is the constraint, the per-record validation call becomes the unit of work fanned out under the concurrency model in Async Batch Processing for High-Volume Feeds.

For payloads that carry procurement-sensitive fields, the contract layer is also a natural enforcement point for field-level access rules described in Data Security Boundaries for Procurement Systems — validate and redact in the same pass so restricted attributes never reach a downstream store that should not hold them.

Debugging & Pipeline Recovery Permalink to this section

When a record fails validation the goal is a self-clearing exception queue, not a manual scavenger hunt. The structured errors() list is what makes that possible — route every failure to a dead-letter queue (DLQ) carrying the full context, then tag it so root-cause analytics can spot systemic supplier drift before it snowballs.

DLQ payload contract. Each entry stores the feed name, the row index or document id, the business key (po_id/line_sequence), the full errors() list (loc, type, msg, input), and the raw record. Without the loc pointer an analyst has to re-run the validation by hand to find the broken field.
Failure-reason taxonomy. Map Pydantic error type codes onto a small operational taxonomy: MISSING_FIELD (missing), TYPE_INVALID (int_parsing, datetime_parsing), OUT_OF_RANGE (greater_than_equal, less_than), PATTERN_INVALID (string_pattern_mismatch), ENUM_INVALID (enum), EXTRA_FIELD (extra_forbidden), and BUSINESS_RULE (raised by model_validator). This single tag turns a flat queue into a triage dashboard and tells you whether the fix belongs to onboarding, the supplier, or the schema.
Audit log fields. Emit feed_name, content_hash, record_count, accepted_count, quarantined_count, model_version, and validated_at for every batch — write them to append-only storage so SOX and internal audit reviews can replay any acceptance decision.
Monitoring signals & alert thresholds. Track the failure-reason distribution per feed. A climbing EXTRA_FIELD or MISSING_FIELD rate almost always means a supplier changed an export template; a spike in ENUM_INVALID points at a new status code your model has not yet learned; a rising BUSINESS_RULE rate is a genuine operational problem (systematic over-receipts), not a parsing bug. Alert the feed-onboarding team rather than loosening a constraint to mask the symptom.

Retry only transient causes. A ValidationError is deterministic — re-running it produces the identical failure, so retrying a malformed record just burns the queue. Reserve backoff retries for the I/O around validation (a registry lookup, a downstream write) and fail fast on contract violations, quarantining the record and emitting the telemetry above so the triage dashboard drives the fix. Versioning the model (model_version in the audit row) lets you correlate a spike in rejections with a contract change you shipped, which is the fastest way to distinguish “the supplier broke” from “we tightened the schema.”

FAQ Permalink to this section

When should I use strict mode instead of lax coercion? Permalink to this section

Use lax coercion (strict=False) at the field boundary for any feed that originates from EDI translators or supplier portals, because those sources emit numerics and booleans as strings and strict mode would reject every record. Reserve strict=True for internal services that already emit correctly-typed JSON, where a string where an integer belongs is a real upstream bug you want surfaced immediately. The robust default for ingestion is lax field coercion paired with strict semantic validators, so formatting differences pass while genuine rule violations fail.

Why does my cross-field rule fire before all the fields are set? Permalink to this section

Because it is almost certainly written as a single-field validator, which runs during field assignment before sibling fields exist. Any rule that compares two or more fields — like over-receipt tolerance comparing qty_fulfilled against qty_ordered — must live in a @model_validator(mode="after"), which runs once after the entire record is typed and populated. That is the only place where every field is guaranteed visible.

How do I stop one bad record from killing the whole batch? Permalink to this section

Never instantiate models bare inside a loop. Wrap each Model(**row) call in a try/except ValidationError, append successes to an accepted list and failures (with exc.errors()) to a quarantine list, and return both. A single raised ValidationError only aborts the batch when it propagates; catching it per record turns a fatal exception into a routable DLQ entry and keeps the valid records flowing.

My SKUs fail the length check even though they look correct. What is wrong? Permalink to this section

The constraint is evaluating the raw value before normalization. Attach a BeforeValidator via Annotated that canonicalizes the SKU — strip whitespace, hyphens, and case — so the length and pattern checks see the cleaned form. BeforeValidator runs ahead of coercion and field constraints, which is exactly why input normalization belongs there rather than in an after-validator.

How do I validate nested ASN or despatch-advice payloads from XML? Permalink to this section

Model the hierarchy as nested Pydantic models — a header model with a typed list[LineItem] field — and validate the whole structure in one call after flattening the document with xmltodict. Pydantic recurses into the nested models, so a missing mandatory segment or a malformed line enumeration produces a precise loc path like ("lines", 3, "uom"), telling you exactly which line of which document failed.

Schema Validation Using Pydantic Permalink to this section#

Core Concept & Decision Criteria Permalink to this section#

Implementation Permalink to this section#

Configuration & Threshold Calibration Permalink to this section#

Orchestration & Integration Permalink to this section#

Debugging & Pipeline Recovery Permalink to this section#

FAQ Permalink to this section#

When should I use strict mode instead of lax coercion? Permalink to this section#

Why does my cross-field rule fire before all the fields are set? Permalink to this section#

How do I stop one bad record from killing the whole batch? Permalink to this section#

My SKUs fail the length check even though they look correct. What is wrong? Permalink to this section#

How do I validate nested ASN or despatch-advice payloads from XML? Permalink to this section#

Related Permalink to this section#