Skip to content

Performance

pyreps is designed for high performance and low memory consumption.

Streaming Pipeline

The entire pipeline is lazy — data flows record by record without accumulating in memory:

graph LR
    A["Adapter<br/><i>yield record</i>"] --> B["Mapping<br/><i>yield row</i>"]
    B --> C["Renderer<br/><i>write row</i>"]

    style A fill:#e3f2fd
    style B fill:#fff3e0
    style C fill:#e8f5e9

Each component is a Python generator. Data enters, is processed, and leaves — with no intermediate lists.

Benchmarks

Results with 6 columns, declarative types enabled:

Format Records Time Peak RAM File rows/s
CSV 10K 0.05s 51.11 MB 0.63 MB 194K
CSV 100K 0.50s 51.11 MB 6.67 MB 201K
CSV 500K 2.39s 51.11 MB 34.9 MB 209K
XLSX 10K 0.13s 51.11 MB 0.34 MB 76K
XLSX 100K 0.90s 51.11 MB 3.25 MB 111K
XLSX 500K 4.37s 51.11 MB 16.0 MB 114K
PDF 10K 1.74s 51.11 MB 1.01 MB 5K

Stable Memory (CSV/XLSX)

CSV and XLSX maintain stable memory usage (~51MB process baseline) regardless of the data volume.

PDF: Memory O(chunk_size)

The PDF uses streaming by 200-row chunks (configurable). Peak RAM is proportional to chunk_size × n_columns. See Formats → PDF for details.

Performance Stack

Component Library Language Why
JSON parsing orjson Rust ~6x faster than json stdlib
XLSX writing rustpy-xlsxwriter Rust Native writing, accepts generators
XLSX widths ZIP streaming Python Patching in 64KB chunks, no DOM
CSV csv stdlib C Native module, as fast as possible
PDF reportlab Python + C C core, industry standard

Optimization Tips

Use generators as data source

# ❌ Materializes everything before starting
data = [row for row in fetch_all_rows()]
generate_report(data_source=data, ...)

# ✅ Streaming — constant memory
def stream():
    for page in paginate():
        yield from page
generate_report(data_source=stream(), ...)

Prefer CSV/XLSX for large volumes

PDF processes data in 200-row chunks (configurable via metadata["pdf"]["chunk_size"]), keeping memory proportional to the chunk size — not to the total records. Even so, the speed (~165 rows/s) is much lower than CSV/XLSX. For datasets above 50K rows, prefer CSV or XLSX.

XLSX — manual mode for maximum speed

The auto/mixed mode calculates widths during streaming (minimal overhead). If you don't need automatic width:

metadata={"xlsx": {"width_mode": "manual", "default_width": 15.0}}

Reproducing Benchmarks

uv run python benchmarks/bench_performance.py