Skip to content

Output formats

What lands on disk, in which format, and how to load it. Plus the manifest sidecar and the optional companion files.


CSV (default)

from plotsim import create_from_yaml, generate_tables, write_tables

cfg = create_from_yaml("my_config.yaml")
tables = generate_tables(cfg)
write_tables(tables, cfg)

Default output in ./output/:

output/
├── dim_date.csv
├── dim_customer.csv
├── fct_engagement.csv
├── fct_mrr.csv
├── evt_login.csv
├── config.yaml
├── validation_report.txt
└── manifest.json

File format conventions:

  • UTF-8 encoding
  • Float format %.6g (6 significant digits, mixed scientific / fixed)
  • pd.NA and NaN written as empty strings
  • Strings quoted only when they contain commas, newlines, or quotes (CSV QUOTE_NONNUMERIC style — numeric cells are unquoted)

To load with pandas:

import pandas as pd
df = pd.read_csv("output/fct_engagement.csv")

Parquet

Set the output format on the config:

output_format: parquet

Then run as normal — every table writes as .parquet instead of .csv.

Parquet writes require pyarrow:

pip install plotsim[parquet]
# or
pip install pyarrow

When to use Parquet:

  • Files are 5–10× smaller than CSV on the bundled templates (typed columns + Snappy compression)
  • Column dtypes round-trip exactly — no string-vs-int ambiguity
  • Faster to load into DuckDB / pandas / polars at scale
  • Streaming write per archetype group keeps memory bounded for very large fact tables

When CSV is fine:

  • Smaller datasets (< 1M rows) where the file-size delta doesn't matter
  • Tooling that doesn't speak Parquet (some shell pipelines, some legacy loaders)
  • Eyeballing data in a text editor

To load Parquet with pandas / polars:

import pandas as pd
df = pd.read_parquet("output/fct_engagement.parquet")

# or
import polars as pl
df = pl.read_parquet("output/fct_engagement.parquet")

What write_tables produces

File Always written? Description
<table>.csv / .parquet yes One file per generated table
config.yaml yes Round-trippable copy of the config used for generation
validation_report.txt yes Human-readable list of FK / PK / spine / null-policy issues
manifest.json conditional Ground-truth signal layer (see below)
<fact>_train.<ext> / <fact>_holdout.<ext> conditional Train/holdout split when holdout is configured
_entity_features.<ext> conditional Flat per-entity feature table when entity_features is enabled

config.yaml

A complete, round-trippable copy of the config used for this run. Pass it to create_from_yaml(...) and you regenerate the same dataset under the same plotsim version.

The copy includes engine-derived defaults the original input may have omitted — useful when you want to see exactly what plotsim filled in for you.

validation_report.txt

Human-readable validation summary. Header carries error / warning counts and overall VALID / INVALID status. Body lists each issue with check name, table, message, and detail block.

Plotsim Validation Report
==========================
Generated: deterministic (config-sha256[:16]=a1b2c3d4...)
Errors: 0 | Warnings: 1 | Total: 1
Status: VALID

[WARN ] empty_event_tables (evt_churn) — 0 rows generated; threshold may be too aggressive
        threshold: above 0.95

Status: VALID requires zero errors. Warnings don't block — they inform.


The manifest

manifest.json is the ground-truth sidecar. It captures the signal layer — the inputs an ML pipeline would predict against, rather than re-derive from noisy fact-table cells.

import json
from pathlib import Path

manifest = json.loads(Path("output/manifest.json").read_text())

# Entity → archetype label
labels = {a["entity"]: a["archetype"] for a in manifest["archetype_assignments"]}

# Trajectory position at every period for sampled entities
positions = manifest["trajectory_samples"]

The manifest is byte-deterministic — same (config, seed) produces the same JSON. Full field reference in manifest-reference.md.

To opt out of manifest emission, set manifest: { include: false } in the config. The file is then never written.


Holdout split (optional)

When you declare a holdout block, plotsim writes two extra files for every per-entity-per-period fact table:

holdout:
  target: mrr
  periods: 3
  min_training_periods: 6
output/
├── fct_engagement.csv
├── fct_engagement_train.csv      # periods [0, n - 3)
├── fct_engagement_holdout.csv    # periods [n - 3, n)
├── fct_mrr.csv
├── fct_mrr_train.csv
└── fct_mrr_holdout.csv

The unsplit fact table is still written. Dim, bridge, and event tables are not split — they're not period-indexed in a way that slices cleanly.

The manifest's holdout block records target_metric, holdout_periods, and the resolved cutoff_period_index so a downstream consumer can re-derive the split without re-reading the config.


Per-entity features (optional)

When entity_features: true, plotsim writes one extra file: _entity_features.csv (or .parquet).

entity_features: true

One row per entity. For every numeric metric the engine landed in a fact table, six aggregate columns are added per entity:

customer_id,
engagement_mean, engagement_std, engagement_slope,
engagement_first, engagement_last, engagement_peak_period,
mrr_mean, mrr_std, mrr_slope, mrr_first, mrr_last, mrr_peak_period,
archetype, final_trajectory_position

The archetype and final_trajectory_position columns are ground-truth labels pulled from the manifest. They give a downstream classifier the answer key to learn against.

When holdout is also enabled, aggregation is restricted to the training window and the target metric's six aggregate columns are dropped to prevent label leakage.


Output directory and overrides

Default location is ./output/ (relative to the working directory). Override via:

write_tables(tables, cfg, output_dir="path/to/somewhere")

If the directory doesn't exist, plotsim creates it. Existing files at the same paths are overwritten — there's no append, no timestamped subdirectories. Run twice and the second run replaces the first.

For hosted deployments where you want to constrain output to a sandbox root:

write_tables(tables, cfg, output_dir="user_request_dir", base_dir="/sandbox")

Absolute-path overrides and .. traversal are rejected when base_dir is set.


Putting it together

from plotsim import (
    create_from_yaml,
    generate_tables_with_state,
    build_manifest,
    write_tables,
)

cfg = create_from_yaml("my_config.yaml")

# Generate tables and the trajectory state alongside
tables, state = generate_tables_with_state(cfg)

# Build the manifest from the state
manifest = build_manifest(
    cfg, state.trajectories, tables,
    scd_state=state.scd, bridge_state=state.bridges,
)

# Write everything
out_path = write_tables(tables, cfg, manifest=manifest)
print(f"Wrote to {out_path}")

Or the one-liner version (no manifest):

from plotsim import create_from_yaml, generate_tables, write_tables

cfg = create_from_yaml("my_config.yaml")
write_tables(generate_tables(cfg), cfg)