Step-by-Step Guide to Parsing Large CSV Feeds in Pandas Permalink to this section

↑ Part of Parsing CSV and Excel Feeds with Pandas.

When a supplier portal, warehouse management system, or 3PL partner pushes a multi-gigabyte CSV export, a single pd.read_csv() call routinely exhausts worker memory, trips a scheduler timeout, or silently coerces the SKU and purchase-order columns that every downstream join depends on. This page addresses one precise scenario: the inventory or PO feed is too large to load whole, and you need a deterministic, resumable procedure that parses it inside a bounded memory window while preserving an audit trail. It is the deep-dive companion to the general parser-selection contract in the parent Parsing CSV and Excel Feeds with Pandas reference, which sits inside the broader Ingestion & Parsing Workflows for Supply Chain Data architecture.

Operational Trigger Signals Permalink to this section

Reach for the chunked, checkpointed procedure below — rather than a plain whole-file read — when your ingestion logs show any of these measurable conditions across consecutive runs:

File size exceeds ~25% of the worker memory budget. A 2 GB CSV materializes to 6–10 GB as a DataFrame once pandas widens narrow integers to int64 and strings to object, so anything past a quarter of an 8 GB pod’s RSS is an OOM risk.
Peak RSS climbs above 70% of the allocated container limit during a single read, visible as MemoryError tracebacks or kernel OOM-killer entries in the pod event log.
The read blocks the orchestrator past its task timeout — a synchronous whole-file parse stalls the event loop and trips the scheduler’s per-task SLA before the file finishes.
dtype inference drift appears between runs: the same feed parses qty_ordered as Int64 one day and float64 the next because a later chunk introduced a blank, breaking arithmetic and joins.
Row volume passes ~1–2 million lines, where the inference cost of scanning the whole file for column types becomes a measurable share of total parse time.

When file volume (thousands of small exports per hour) is the constraint rather than single-file size, the fan-out concurrency model in Async Batch Processing for High-Volume Feeds is the better tool; the procedure here is for one oversized file at a time.

Step-by-Step Implementation Permalink to this section

Work through the steps in order. Each one removes a specific failure mode, and together they turn an unbounded read into a resumable, memory-stable pipeline.

Step 1 — Pin an explicit dtype map before ingestion Permalink to this section

Pandas defaults to object for mixed-type columns and infers numeric types dynamically, which both inflates memory 3–5x and causes silent coercion (1000 becoming 1000.0, leading zeros stripped from SKUs). Declare a strict schema dictionary and pass it straight to the parser so type resolution is fixed before the first byte is scanned.

PYTHON

import logging
from typing import Final

import pandas as pd

logging.basicConfig(level=logging.INFO, format="%(asctime)s | %(levelname)s | %(message)s")
logger = logging.getLogger("ingestion.large_csv")

# Nullable dtypes (Int64, string) keep blanks as <NA> instead of forcing float coercion.
SUPPLY_CHAIN_SCHEMA: Final[dict[str, str]] = {
    "po_number": "string",
    "sku_id": "string",
    "warehouse_code": "category",   # low-cardinality -> category saves memory
    "qty_ordered": "Int64",         # nullable int prevents float coercion on blanks
    "unit_cost_usd": "float32",     # float32 halves memory vs the float64 default
    "status": "category",
}

Do not put timestamp columns in the dtype map — pandas cannot resolve "datetime64[ns]" as a dtype string before it has seen the data, and it raises TypeError. Load date columns as strings (or omit them) and convert after the read, which also isolates unparseable rows as NaT instead of aborting the chunk. The same explicit-coercion rule is documented for the general case in the parent Parsing CSV and Excel Feeds with Pandas guide, including legacy ERP exports that embed currency symbols or locale-specific decimal separators.

Step 2 — Stream the file through a bounded-memory chunk iterator Permalink to this section

Pass chunksize so pandas yields fixed-row DataFrames instead of one monolith. Peak DataFrame memory is roughly rows × columns × average-cell-bytes, so size each chunk to keep that product well under the pod’s working-set budget.

PYTHON

CHUNK_SIZE: Final[int] = 150_000

def iter_feed_chunks(path: str) -> "pd.io.parsers.TextFileReader":
    """Yield memory-bounded chunks of a large supply chain CSV."""
    return pd.read_csv(
        path,
        dtype=SUPPLY_CHAIN_SCHEMA,
        chunksize=CHUNK_SIZE,
        low_memory=False,        # suppress per-chunk dtype-inference warnings
        on_bad_lines="warn",     # skip ragged rows instead of aborting the batch
        encoding="utf-8-sig",    # strip the Windows/ERP byte-order mark
    )

for chunk_idx, chunk in enumerate(iter_feed_chunks("supplier_inventory_feed.csv")):
    used_mb = chunk.memory_usage(deep=True).sum() / 1_000_000
    logger.info("chunk=%d rows=%d mem_mb=%.1f", chunk_idx, len(chunk), used_mb)

Monitor chunk.memory_usage(deep=True).sum() per iteration and lower CHUNK_SIZE until peak RSS stabilizes below 70% of allocated worker memory. Consult the official pandas.read_csv documentation for delimiter-sniffing and quoting behavior.

Step 3 — Coerce timestamps and apply vectorized business rules Permalink to this section

Inside the loop, convert deferred date columns and apply row-level logic with vectorized NumPy-backed operations. Never call iterrows() or row-wise apply() here — vectorized expressions hold throughput above 500k rows/sec on commodity hardware.

PYTHON

def transform_chunk(chunk: pd.DataFrame) -> pd.DataFrame:
    """Coerce types and apply vectorized supply chain rules to one chunk."""
    chunk["ship_date"] = pd.to_datetime(
        chunk["ship_date"], errors="coerce", format="ISO8601"
    )  # unparseable dates become NaT, not a raised exception
    unparsed = int(chunk["ship_date"].isna().sum())
    if unparsed:
        logger.warning("unparsed_ship_dates count=%d", unparsed)

    chunk["net_value"] = chunk["qty_ordered"] * chunk["unit_cost_usd"]
    chunk = chunk[chunk["qty_ordered"] > 0]              # vectorized filter
    chunk = chunk.dropna(subset=["po_number", "sku_id"])  # keys must be present
    return chunk

Step 4 — Carry cross-chunk state explicitly Permalink to this section

Chunking bounds memory but fractures global state. Cumulative PO totals, running balances, and cross-chunk deduplication must live in containers declared outside the loop, because each chunk only sees its own rows.

PYTHON

seen_skus: set[str] = set()
running_po_totals: dict[str, float] = {}

def accumulate_state(chunk: pd.DataFrame) -> pd.DataFrame:
    """Deduplicate SKUs and accumulate PO totals across chunk boundaries."""
    chunk = chunk[~chunk["sku_id"].isin(seen_skus)]
    seen_skus.update(chunk["sku_id"].unique())

    po_agg = chunk.groupby("po_number")["net_value"].sum()
    for po, val in po_agg.items():
        running_po_totals[po] = running_po_totals.get(po, 0.0) + float(val)
    return chunk

Step 5 — Serialize to Parquet and checkpoint for resume-from-failure Permalink to this section

Hold nothing in RAM longer than one chunk: write each validated chunk to a columnar file and persist a checkpoint so a mid-file crash resumes from the last committed batch instead of re-reading gigabytes. Parquet preserves the typed schema, cuts the storage footprint 60–80%, and enables predicate pushdown downstream.

PYTHON

import json
from pathlib import Path

OUTPUT_DIR = Path("/data/processed_inventory")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
CHECKPOINT_FILE = OUTPUT_DIR / "pipeline_state.json"

def resume_index() -> int:
    """Return the next chunk index to process, honoring any prior checkpoint."""
    if CHECKPOINT_FILE.exists():
        return json.loads(CHECKPOINT_FILE.read_text())["last_chunk_index"] + 1
    return 0

def run_pipeline(path: str) -> None:
    start = resume_index()
    for chunk_idx, chunk in enumerate(iter_feed_chunks(path)):
        if chunk_idx < start:
            continue  # already committed on a previous run
        chunk = accumulate_state(transform_chunk(chunk))

        chunk.to_parquet(
            OUTPUT_DIR / f"chunk_{chunk_idx:05d}.parquet",
            engine="pyarrow",
            compression="snappy",
            index=False,
        )
        CHECKPOINT_FILE.write_text(json.dumps({"last_chunk_index": chunk_idx}))
        logger.info("committed_chunk=%d rows=%d", chunk_idx, len(chunk))

See the official pandas Parquet I/O guide for partitioning and schema-evolution handling. Once chunks are typed and persisted, they still lack semantic guarantees — a row can parse cleanly yet carry a negative lead time — so the next stage is the contract enforcement described in Schema Validation Using Pydantic. Parsed quantity and price precision must survive into the match engine, so align the numeric dtypes above with the bands in Setting Quantity and Price Tolerance Windows.

Configuration Reference Permalink to this section

These are the parameters that decide whether the procedure stays inside its memory budget. Pin them per feed rather than globally; a wide freight manifest needs a smaller chunksize than a narrow status feed.

Parameter	Recommended default	Accepted range	Rationale
`chunksize`	`150_000`	`10_000`–`500_000`	Bounds peak RSS; lower it for wide rows or small pods
`dtype` (ids)	`string`	`string` / `category`	Preserves leading zeros on SKU and PO keys
`dtype` (counts)	`Int64`	`Int64` / `Int32`	Nullable int keeps blanks as `<NA>`, not `float`
`dtype` (money)	`float32`	`float32` / `float64`	`float32` halves memory; use `float64` if cents must be exact
`low_memory`	`False`	`True` / `False`	`False` stops repeated per-chunk inference warnings
`on_bad_lines`	`warn`	`warn` / `error` / `skip`	Skip-and-log vs hard-fail on ragged rows
`encoding`	`utf-8-sig`	`utf-8-sig` / `latin-1`	Strips BOM; `latin-1` for legacy EU exports
`compression` (Parquet)	`snappy`	`snappy` / `zstd`	`snappy` is fast; `zstd` compresses tighter for cold storage

Never widen on_bad_lines to skip or relax the dtype map just to clear a backlog — that converts a data-quality alert into silent data loss.

Debugging & Recovery Permalink to this section

When a large-feed parse stalls or produces silent row loss, isolate the failure vector deterministically and route the offending payload to a dead-letter queue (DLQ) rather than re-running the whole file by hand.

Identify malformed row boundaries. If on_bad_lines="warn" floods the log, extract the offending line range with a byte-offset slice and validate it against the RFC 4180 CSV spec. Unescaped quotes inside free-text notes or address columns are the usual culprit.
Verify dtype coercion drift. Run chunk.dtypes right after ingestion and diff against SUPPLY_CHAIN_SCHEMA. If object appears where Int64 is expected, inspect for hidden whitespace with chunk[col].str.contains(r"\s", regex=True).any(), strip, and recast before any arithmetic.
Isolate memory leaks. If RSS climbs monotonically across chunks, you are accumulating references — flush each chunk to Parquet inside the loop instead of appending to an in-RAM list, and snapshot allocations at chunk boundaries with tracemalloc.
Handle BOM and encoding shifts. Vendor exports mix UTF-8 BOMs with Windows-1252. Force encoding="utf-8-sig"; if UnicodeDecodeError persists, fall back to encoding_errors="replace" and log the byte offset for manual vendor correction.
Catch chunk-boundary splits. A multi-line quoted field straddling a chunk edge can corrupt parsing; if low_memory=False does not resolve quoting errors, normalize line endings upstream or switch to a streaming parser that handles fragmented records natively.

Tag every quarantined file with one of OOM_ABORT, ENCODING_ERROR, DTYPE_DRIFT, RAGGED_ROW, or UNPARSED_DATES, and emit audit fields — file_name, content_hash, chunk_index, rows_committed, rows_skipped, peak_rss_mb, and ingested_at — to append-only storage so a SOX or internal audit review can replay any ingestion decision. The checkpoint in Step 5 is what makes recovery cheap: re-running the pipeline skips committed chunks and reprocesses only the failed tail.

FAQ Permalink to this section

How do I choose the right chunksize? Permalink to this section

Start from row width, not habit. Estimate peak chunk memory as rows × columns × average-cell-bytes and target a value that keeps a single chunk well under your container’s working-set limit — 150_000 is a safe default for a typical 20–30 column inventory feed. Watch chunk.memory_usage(deep=True).sum() for the first few chunks and halve chunksize if peak RSS approaches 70% of the pod limit.

Why are my SKU and PO numbers losing leading zeros across chunks? Permalink to this section

Because pandas infers those columns as integers when no explicit dtype is given. Force them to string (or category) in SUPPLY_CHAIN_SCHEMA and pass the map to every read_csv call. Any column that participates in a join or match key must never rely on inference.

My pipeline crashed halfway through a 2 GB file. Do I have to start over? Permalink to this section

No — that is exactly what the checkpoint prevents. The Step 5 procedure writes last_chunk_index after each committed Parquet file, so re-running calls resume_index(), skips every already-committed chunk, and resumes at the first unprocessed one. Keep the cross-chunk state (seen_skus, running_po_totals) derivable from committed output if you need it to survive a restart, or persist it alongside the checkpoint.

Step-by-Step Guide to Parsing Large CSV Feeds in Pandas Permalink to this section#

Operational Trigger Signals Permalink to this section#

Step-by-Step Implementation Permalink to this section#

Step 1 — Pin an explicit dtype map before ingestion Permalink to this section#

Step 2 — Stream the file through a bounded-memory chunk iterator Permalink to this section#

Step 3 — Coerce timestamps and apply vectorized business rules Permalink to this section#

Step 4 — Carry cross-chunk state explicitly Permalink to this section#

Step 5 — Serialize to Parquet and checkpoint for resume-from-failure Permalink to this section#

Configuration Reference Permalink to this section#

Debugging & Recovery Permalink to this section#

FAQ Permalink to this section#

How do I choose the right chunksize? Permalink to this section#

Why are my SKU and PO numbers losing leading zeros across chunks? Permalink to this section#

My pipeline crashed halfway through a 2 GB file. Do I have to start over? Permalink to this section#

Related Permalink to this section#

Step-by-Step Guide to Parsing Large CSV Feeds in Pandas Permalink to this section

Operational Trigger Signals Permalink to this section

Step-by-Step Implementation Permalink to this section

Step 1 — Pin an explicit dtype map before ingestion Permalink to this section

Step 2 — Stream the file through a bounded-memory chunk iterator Permalink to this section

Step 3 — Coerce timestamps and apply vectorized business rules Permalink to this section

Step 4 — Carry cross-chunk state explicitly Permalink to this section

Step 5 — Serialize to Parquet and checkpoint for resume-from-failure Permalink to this section

Configuration Reference Permalink to this section

Debugging & Recovery Permalink to this section

FAQ Permalink to this section

How do I choose the right chunksize? Permalink to this section

Why are my SKU and PO numbers losing leading zeros across chunks? Permalink to this section

My pipeline crashed halfway through a 2 GB file. Do I have to start over? Permalink to this section

Related Permalink to this section