Schema Validation Using Pydantic
Reliable supply chain reconciliation depends on deterministic data contracts. Raw ingestion moves bytes into memory, but schema validation enforces the structural and semantic guarantees required for downstream matching, exception routing, and inventory balancing. Within broader Ingestion & Parsing Workflows for Supply Chain Data, Pydantic operates as the explicit contract layer that sits between raw feed consumption and reconciliation engines. Unlike ad-hoc type checks or dictionary lookups, Pydantic provides compile-time model definitions, runtime coercion, and structured error extraction that scale across high-volume procurement, logistics, and warehouse management feeds.
Defining Domain-Specific Models
Pydantic V2 models must mirror the operational entities your reconciliation pipeline consumes. For procurement and logistics, this means explicit typing for purchase order identifiers, constrained numeric tolerances, timezone-aware delivery windows, and enumerated fulfillment states.
from datetime import datetime
from enum import Enum
from typing import Annotated, Optional
import re
from pydantic import BaseModel, Field, BeforeValidator, model_validator
class UnitOfMeasure(str, Enum):
EACH = "EA"
CASE = "CS"
PALLET = "PLT"
KILOGRAM = "KG"
class FulfillmentState(str, Enum):
PENDING = "PENDING"
PARTIAL_SHIP = "PARTIAL"
COMPLETE = "COMPLETE"
CANCELLED = "CANCELLED"
def canonicalize_sku(raw: str) -> str:
return re.sub(r"[\s\-_\.]+", "", raw.strip().upper())
# In Pydantic v2, BeforeValidator is attached via typing.Annotated, not as a
# field kwarg. Combining it with Field constraints lets us normalize the value
# before the length/pattern checks run.
NormalizedSKU = Annotated[str, BeforeValidator(canonicalize_sku)]
class ProcurementLineItem(BaseModel):
model_config = {"strict": False, "validate_default": True}
po_id: str = Field(min_length=4, max_length=18, pattern=r"^[A-Z0-9\-]+$")
line_sequence: int = Field(gt=0, le=999)
sku: NormalizedSKU = Field(min_length=5, max_length=32)
qty_ordered: float = Field(ge=0.0)
qty_fulfilled: float = Field(ge=0.0)
uom: UnitOfMeasure
state: FulfillmentState = FulfillmentState.PENDING
promised_date: datetime
vendor_code: str = Field(pattern=r"^VND-\d{5}$")
dc_location: Optional[str] = None
@model_validator(mode="after")
def enforce_fulfillment_logic(self) -> "ProcurementLineItem":
if self.qty_fulfilled > self.qty_ordered * 1.05:
raise ValueError("Over-fulfillment exceeds 5% tolerance threshold")
if self.state == FulfillmentState.COMPLETE and self.qty_fulfilled < self.qty_ordered:
raise ValueError("Complete state requires full quantity match")
return self
The strict: False configuration permits safe type coercion, converting string-encoded numerals from legacy EDI or supplier portals into native Python types without raising premature exceptions. The @model_validator hook intercepts domain-specific violations before they contaminate reconciliation logic, ensuring that business rules like over-receipt tolerances and state transitions are enforced at instantiation. Configuration details for strict mode and validator execution order are documented in the official Pydantic V2 documentation.
Positioning Validation Against Raw Parsing
Parsing libraries excel at structural extraction but intentionally defer semantic validation. When extracting tabular data from spreadsheets or comma-delimited exports, tools like pandas efficiently handle delimiter detection, header alignment, and chunked I/O. However, structural extraction does not guarantee data integrity. A CSV row may parse successfully while containing negative lead times, malformed vendor codes, or out-of-range quantities. This is where Parsing CSV and Excel Feeds with Pandas transitions into schema validation. Once rows are materialized as dictionaries, they must be passed through Pydantic models to enforce type boundaries and business constraints.
Similarly, hierarchical formats like ASN (Advanced Shipping Notice) files often arrive as XML. Converting these structures to JSON via XML to JSON Conversion with xmltodict flattens nested elements into serializable payloads, but the resulting dictionaries still lack type safety. Pydantic bridges this gap by applying strict schema contracts to the parsed output, catching missing mandatory fields, invalid enumerations, and malformed timestamps before they reach the reconciliation engine. Timezone handling should align with ISO 8601 standards to prevent cross-region reconciliation drift.
Structured Error Handling & Pipeline Integration
Production pipelines require deterministic failure modes. When a record violates the schema, Pydantic raises a ValidationError containing a structured errors() list. Each error dictionary includes loc (field path), type (validation failure category), and msg (human-readable description). This structure enables automated quarantine routing rather than silent data corruption.
For high-throughput ingestion, wrap model instantiation in a try/except block that serializes validation failures into a dead-letter queue or exception table. This approach is critical when Validating Supplier Data Payloads with Pydantic Models, as supplier feeds frequently drift from published specifications. By capturing the exact field and violation type, engineering teams can generate automated discrepancy reports and trigger corrective workflows without halting the entire batch.
Pre-validation data hygiene further reduces exception rates. Applying deterministic cleaning routines—such as stripping control characters, normalizing whitespace, and standardizing currency formats—before Pydantic instantiation minimizes false positives. This practice aligns with Sanitizing Supplier Master Data Before Ingestion, ensuring that validation failures reflect genuine business rule violations rather than trivial formatting artifacts.
Operationalizing the Contract Layer
Schema validation using Pydantic transforms raw, unstructured supply chain feeds into deterministic, contract-bound records. By combining V2’s performance optimizations with explicit business rule enforcement, data engineering teams can isolate structural drift, route exceptions accurately, and maintain reconciliation integrity across high-volume procurement and logistics pipelines.