Skip to content

Config Reference

Every input field accepted by create() / create_from_yaml(). Source of truth is the code; this page is the field map.

For column types see column-types.md.


Builder input shape

about: <one-line description>
unit: <singular noun>
window: { start, end, every }
metrics: [ ... ]
segments: [ ... ]
connections: [ ... ]
lifecycle: { track, stages, enforce_order, downgrade_delay }
dimensions: [ ... ]
facts: [ ... ]
events: [ ... ]
seasonality: [ ... ]
bridges: [ ... ]
quality: [ ... ]
holdout: { target, periods, min_training_periods }
entity_features: true | false | { metrics, include_labels }
noise: <preset_name> | { gaussian_sigma, outlier_rate, mcar_rate }
output: csv | parquet | { format, directory }
locale: <faker locale or list of locales>
seed: <int>

Required keys: about, unit, window, metrics (at least one), segments (at least one). Everything else is optional. The same shape is accepted from both create(**kwargs) and create_from_yaml(path).


Top-level fields

about

Type str
Required yes
Constraints non-empty

One-line description of what the dataset represents. Surfaces in config.domain.description.

about: "Subscription customers churning across 24 months"

unit

Type str
Required yes
Constraints non-empty; lowercase singular noun recommended

The thing each entity represents. Used to name the auto-generated entity dim table (dim_<unit>) and to label the entity FK column.

unit: customer    # auto-generates dim_customer, customer_id
unit: employee    # auto-generates dim_employee, employee_id

window

Type object or ⅔-tuple
Required yes

Time span and granularity. Three accepted shapes:

# Object form
window:
  start: 2024-01
  end: 2024-12
  every: monthly

# Two-tuple (default granularity = monthly)
window: ["2024-01", "2024-12"]

# Three-tuple
window: ["2024-01", "2024-12", "monthly"]
Field Type Default Constraints
start str required YYYY-MM or YYYY-MM-DD
end str required same format as start
every "daily" / "weekly" / "monthly" "monthly"

YAML's relaxed scalar parser turns 2024-01 into a date object; the builder coerces it back to a string before validation, so both quoted and unquoted forms work.

seed

Type int
Required no
Default drawn from secrets.randbelow(2**32)
Constraints 0 ≤ seed ≤ 2**32 - 1

Pin this for reproducible output. Same (config, seed) always produces byte-identical files. When omitted, the builder draws a fresh seed from the system CSPRNG.


metrics

Array of metric declarations. At least one required, max 50.

metrics:
  - name: engagement
    type: score
    polarity: positive

  - name: mrr
    type: amount
    polarity: positive
    range: [10, 5000]

  - name: churn_risk
    type: score
    polarity: negative
    follows: engagement
    delay: 2
    seasonal_sensitivity: 0.5

Metric fields

Field Type Required Default Notes
name str yes Alphanumeric / underscore only
type enum yes score, amount, count, index
polarity enum yes positive (high position → high value) or negative (high position → low value)
label str no None Display label; defaults to name
range [float, float] conditional None Required for amount and index; forbidden for count
follows str no None Name of another metric this one lags behind. Must pair with delay
delay int no None Lag in periods. Must be ≥ 1 and pair with follows
seasonal_sensitivity float no 1.0 Per-metric multiplier on global seasonality. 0.0 immune; -0.5 halves and inverts

Metric types

Type Distribution Range Use for
score beta(2, 5) implicit [0, 1] Health scores, engagement indices, satisfaction
count poisson(λ=5) non-negative integers Logins, transactions, ticket counts
amount lognorm or beta (auto-picked) required Money, weights, durations
index normal required Bounded indicators where mean matters

For amount, the builder picks lognorm when min == 0 or max / min ≥ 10, else beta. The index distribution is centered on the range midpoint with sigma chosen to keep ~99.7% of draws inside the range.

Causal lag (follows / delay)

follows: <other_metric> and delay: <int> declare that this metric trails the named driver by delay periods. The two must appear together or not at all. A metric cannot follow itself, and the chain must be acyclic — both are checked at construction time.


segments

Array of cohort declarations — each segment is a count of entities all sharing one archetype. At least one required.

segments:
  - name: growers
    count: 30
    archetype: growth

  - name: decliners
    count: 20
    archetype: decline
    baseline:
      mrr: high
      engagement: mid
    attributes:
      industry: ["tech", "retail"]

  - name: hybrids
    count: 25
    archetype: "growth > decline @ 0.6"
    seasonal_sensitivity: 0.0

Segment fields

Field Type Required Default Notes
name str yes Alphanumeric / underscore
count int yes 3 ≤ count ≤ 5000 per segment
archetype str yes Shape word or composition DSL — see below
label str no None Display label
attributes dict[str, str \| list[str]] no {} Per-segment static attributes; doubles as the source for pool.{attr} columns
baseline dict[str, str] no {} Per-metric value-range narrowing — high / mid / low
seasonal_sensitivity float no 1.0 Per-segment multiplier on global seasonality

Archetype DSL

Six base shapes, listed in BASELINE_RECIPES / SHAPE_RECIPES:

Shape Behavior
growth Sigmoid rise from low to high
decline Exponential decay from high to low
seasonal Oscillating around 0.5
flat Constant around 0.15
spike_then_crash Sigmoid rise → step drop → low plateau
accelerating Compounding growth with acceleration

Shapes compose with two operators:

  • Sequence > — chain shapes in order. Default split is even. growth > decline is half growth, half decline.
  • Anchor @ — explicit transition period. growth > decline @ 8 spends periods 0–7 on growth then transitions to decline at period 8. With N shapes, supply N-1 @ clauses (one between every pair).

Examples: growth, growth > decline, flat > growth > seasonal @ 4 @ 12, growth > spike_then_crash @ 6. See user-guide/archetypes.md for the full DSL.

Baseline vocabulary

Three words that narrow the metric's value range to a third of its declared band:

Word Range fraction
high upper third — (2/3, 1) of [min, max]
mid middle third — (1/3, 2/3)
low lower third — (0, 1/3)

Useful for "this segment runs hot" / "this segment runs cold" without authoring a full archetype variant.


connections

Array of correlation pairs. Optional. Each entry has three slots — left metric, relationship-or-coefficient, right metric — and accepts six shorthand forms:

connections:
  # Word form
  - "mrr driven_by engagement"                        # 3-token string
  - ["churn_risk", "inverts", "mrr"]                  # tuple
  - {metric_a: "support_tickets", relationship: "related", metric_b: "churn_risk"}

  # Numeric form (any coefficient in [-1.0, 1.0])
  - "engagement 0.42 retention"                       # numeric middle token
  - ["mrr", -0.31, "support_tickets"]                  # numeric in tuple
  - {metric_a: "nps", coefficient: 0.18, metric_b: "feature_adoption"}

The middle slot is parsed as a number when it tokenizes to a float; otherwise it's looked up against the relationship vocabulary. Each canonical entry sets exactly one of relationship / coefficient — passing both raises at construction time, since the word already implies a coefficient.

Relationship vocabulary

Nine words spanning -0.75 to +0.75:

Word Coefficient
mirrors +0.75
driven_by +0.55
related +0.40
hints_at +0.20
independent 0.00
hints_against -0.20
resists -0.40
opposes -0.55
inverts -0.75

The numeric form accepts any value in [-1.0, 1.0] — useful when you've calibrated the coefficient from a real dataset and the nine-word vocabulary doesn't land on the right magnitude. Coefficients of exactly 0.0 are dropped (treated as "independent") with a warning, matching the engine's redundant-pair contract.

Both endpoints must reference declared metrics. Self-pairs and connections on metrics named in lifecycle.track are rejected at construction time.

If your declared correlation matrix is not positive semi-definite, the engine projects it to the nearest valid matrix using Higham's algorithm and records the adjustment in the manifest under correlation_adjustments.


lifecycle

Optional ladder of named thresholds against a chosen metric. When set, the engine emits a stage column on the relevant fact table.

lifecycle:
  track: engagement
  stages:
    - { onboarding: 0.0 }
    - { active: 0.3 }
    - { at_risk: 0.6 }
    - { churned: 0.9 }
  enforce_order: false        # default — stateless free-mode
  downgrade_delay: null       # ignored when enforce_order is false

Stage entries accept four shapes: {onboarding: 0.0}, (onboarding, 0.0), {name: onboarding, threshold: 0.0}, or canonical form. Each thresh must be in [0, 1]; thresholds must be strictly ascending; stage names must be unique.

Field Type Required Default Notes
track str yes Must be a declared metric
stages array yes At least 2 entries
enforce_order bool no false When false, every period independently picks the highest threshold the realised value satisfies — stateless free-mode. When true, the cursor advances only and an entity can't jump back on a transient dip — a monotonic stage walk
downgrade_delay int or null no null Hysteresis under enforce_order: true. The cursor steps back once the entity has sat below the demote threshold for downgrade_delay consecutive periods. null keeps strict monotonicity. Range 1120

The keyword lifecycle is canonical; stages is also accepted as the outer block name as an alias (a back-compat path for the early-spec keyword — both forms parse identically).


seasonality

Optional global seasonal effects, each spanning a set of calendar months.

seasonality:
  - { months: [11, 12], strength: 0.30 }   # +30% in Nov-Dec
  - { months: [6, 7, 8], strength: -0.10 } # -10% in summer
Field Type Required Notes
months tuple of int yes Values in 1..12, unique within one effect, max 12 entries
strength float yes Multiplier added to 1.0 at each named month

Multiple effects may overlap — strengths sum at each period. The summed effect is then multiplied by per-metric seasonal_sensitivity and per-segment seasonal_sensitivity before being applied to the metric's distribution center.

The empty default [] produces output byte-identical to runs without a seasonality block.


Schema overrides — dimensions, facts, events

When you want named columns that aren't auto-generated. Each entry uses the same shape:

dimensions:
  - name: dim_customer
    columns:
      - { name: customer_id, type: id }
      - { name: signup_date, type: date }
      - { name: industry,    type: pool.industry }

facts:
  - name: fct_engagement
    metrics: [engagement, mrr]
    columns:
      - { name: customer_id, type: ref.dim_customer }
      - { name: date_key,    type: ref.dim_date }

events:
  - name: evt_login
    trigger: proportional
    driver: engagement
    scale: 5
    columns:
      - { name: customer_id, type: ref.dim_customer }
      - { name: timestamp,   type: timestamp }

See column-types.md for every supported type.

Dim fields

Field Type Required Default Notes
name str yes Table name; conventionally dim_<thing>
columns array yes At least one column
per "period" / "unit" no None Cardinality hint — one row per period or per unit
reference bool no false Pure lookup table (no per-entity / per-period rows)
count int no 1 Sub-entity multiplier (e.g. dim_user with count=3 produces 3 users per customer)

reference: true and per are mutually exclusive.

Fact fields

Field Type Required Default Notes
name str yes Conventionally fct_<thing>
columns array yes At least one column
metrics array of str no [] Metric names whose metric.{name} columns are added automatically

Event fields

Field Type Required Default Notes
name str yes Conventionally evt_<thing>
columns array yes At least one column
trigger "proportional" / "threshold" yes How row count is determined
driver str proportional only Metric whose value drives row count
scale float proportional only ≥ 0. Rows per entity per period = metric_value × scale
metric str threshold only Metric to watch
above float threshold only Fire when value crosses above this
below float threshold only Fire when value crosses below this
for_periods (alias for) int no None Hold the threshold for N periods before firing

above and below are mutually exclusive on a single event.


bridges

Many-to-many associations between two dimension tables.

How to enable

Append to any config that has (or auto-generates) at least two dim tables. Replace dim_a / dim_b with the names of two distinct dims already in your config — dim_date and dim_{unit} are auto-generated and always valid targets.

bridges:
  - name: a_b
    left: dim_a
    right: dim_b
    cardinality: [1, 3]

Detailed example

bridges:
  - name: customer_subscription
    left: dim_customer
    right: dim_subscription
    cardinality: [1, 3]
    driver: mrr
    columns:
      - { name: weight, type: metric.mrr }
Field Type Required Default Notes
name str yes Alphanumeric / underscore
left str yes Dim table name; auto-dims (dim_date, dim_<unit>) are valid
right str yes Same; must differ from left
cardinality [int, int] yes Inclusive [min, max] second-dim entries per left entity
driver str no None Optional metric — non-null biases sampling toward trajectory position
columns array no [] Up to 20 bridge-row columns (metric.{name}, static.{value}, faker.{kind} only)

Limit: 20 bridges per config.


quality

Post-generation data corruption — null injection, duplicates, type mismatches, late arrivals, schema drift.

How to enable

Append to any config. Replace <fact> with one of your fact-table names and <metric_col> with a column on that fact. Mutually exclusive with holdout and entity_features — pick one corruption strategy per config.

quality:
  - { table: <fact>, issue: null_injection, rate: 0.02, column: <metric_col> }
  - { table: <fact>, issue: duplicate_rows, rate: 0.01 }

Detailed example

quality:
  - { table: fct_engagement, issue: null_injection,  rate: 0.02, column: engagement }
  - { table: fct_engagement, issue: duplicate_rows, rate: 0.01 }
Field Type Required Default Notes
table str yes Target table
issue enum yes null_injection, duplicate_rows, type_mismatch, late_arrival, schema_drift
rate float yes 0.0 ≤ rate ≤ 1.0
column str conditional None Required for null_injection, type_mismatch, schema_drift; optional otherwise
seed_offset int no 0 Sub-seed offset to vary which rows are corrupted under the same config seed

Limit: 50 quality issues per config. The clean copy of the data is preserved in memory; the manifest's quality_injections list records exactly which rows / columns / clean values were corrupted so a downstream consumer can recover ground truth.

quality is mutually exclusive with entity_features and with holdout — both rules raise at config load.


holdout

Temporal train/holdout split for ML target workflows.

How to enable

Append to any config. Replace <metric> with any numeric metric emitted on a per-entity-per-period fact table. The minimum is two lines (target + periods); min_training_periods defaults to 3. Requires quality: [] — the splits work on the clean tables.

holdout:
  target: <metric>
  periods: 3

Detailed example

holdout:
  target: mrr
  periods: 3
  min_training_periods: 6
Field Type Required Default Notes
target str yes Metric you intend to predict
periods int yes Trailing periods reserved for evaluation. 1 ≤ periods ≤ 10000
min_training_periods int no 3 Floor on n_periods - periods; rejected at load if violated

When set, every per-entity-per-period fact table writes two extra files: <fact>_train.<ext> ([0, n - periods)) and <fact>_holdout.<ext> ([n - periods, n)). The unsplit fact is also written. Dim, bridge, and event tables are not split.

When entity features are also enabled, aggregation restricts to the training window and the target metric's six aggregate columns are dropped to prevent label leakage.


entity_features

Per-entity flat feature table emission.

How to enable

Append one line to any config. Defaults to "every numeric metric emitted on a fact table, with archetype + final-trajectory labels." Requires quality: [] and manifest.include: true (the default).

entity_features: true

Detailed example

# Narrow the metric set or strip labels for unsupervised pipelines
entity_features:
  metrics: [engagement, mrr]
  include_labels: true
Field Type Required Default Notes
metrics array of str no [] (every numeric metric on a fact table) Each name must reference a numeric fact metric. Max 50
include_labels bool no true Emits archetype and final_trajectory_position columns

For every selected metric, six aggregate columns are added per entity: <m>_mean, <m>_std, <m>_slope, <m>_first, <m>_last, <m>_peak_period. See build_entity_features.

Pre-conditions enforced at load: manifest.include must be True; quality.quality_issues must be empty.


noise

Distributional noise applied on top of the trajectory-driven distribution center. Three independent dials, all defaulting to zero (no noise — the default produces clean output identical to pre-noise baselines).

# Preset shorthand
noise: realistic

# Detailed
noise:
  gaussian_sigma: 0.05
  outlier_rate: 0.02
  mcar_rate: 0.01
Field Type Default Range Effect
gaussian_sigma float 0.0 0.05.0 Multiplicative log-normal jitter on each draw — value *= exp(N(0, σ²)). Bigger σ = wider spread
outlier_rate float 0.0 0.01.0 Probability per cell of replacing the value with a 3-σ tail draw
mcar_rate float 0.0 0.01.0 Probability per cell of dropping the value to NaN (missing-completely-at-random)

Four named presets accept the lower-case canonical name OR a friendly alias — pick whichever reads naturally:

Preset gaussian_sigma outlier_rate mcar_rate Aliases
perfectly_clean (default — same as omitting noise) 0.00 0.00 0.000 clean
slightly_messy 0.03 0.01 0.005
realistic 0.05 0.02 0.010 messy
dirty 0.10 0.05 0.030 very_messy

The same constants are exported from plotsim for engine-direct mutation: PERFECTLY_CLEAN, SLIGHTLY_MESSY, REALISTIC, DIRTY.

noise is independent of the quality block — noise perturbs metric values during generation (correlations and trajectory still hold); quality corrupts the output table after generation.


output

Output-format selector and target directory.

# Word shorthand (uses default directory ./output)
output: parquet

# Detailed
output:
  format: parquet
  directory: ./fixtures
Field Type Default Notes
format "csv" / "parquet" "csv" CSV is the default; parquet requires pip install plotsim[parquet] (pyarrow). Same engine path, ~5–10× smaller on disk
directory str "output" Where write_tables writes. Override at call time with write_tables(..., output_dir=...)

When format: parquet and pyarrow is missing, write_tables raises ImportError naming the install command — fail-fast at the write call rather than mid-iteration. See Output formats for the full pickup of column dtypes, dim_date typing, and downstream loader notes.


locale

Faker locale (or list of locales) threaded to every faker.* column.

locale: en_GB                   # single locale
locale: [en_US, ja_JP, de_DE]   # multi-locale mix
Type Default Notes
str or list[str] "en_US" Any locale supported by your installed faker package. Lists round-robin across providers — useful when seeded fixtures should look multinational

Locale only affects faker.* columns; static.*, metric.*, and pool.* columns are unaffected.


Engine-direct fields

A handful of PlotsimConfig fields are not surfaced in the builder YAML above. They live on the engine config — set them with load_config()/dump_config() round-trips, or by passing them to a hand-authored engine-direct YAML.

compensate_correlations

Type bool
Default (engine-direct) False
Default (builder) True

When True, the engine pre-compensates the trajectory-driven mean shift so the realized Pearson correlations land closer to the declared connections coefficients on configs with strong archetype mixes. Records each adjustment in manifest.correlation_compensations. The builder layer sets True explicitly because connections is a table-wide intent contract; engine-direct configs default to False to preserve byte-identical output for pre-M120 YAML on disk.

generation_mode

Type "serial" / "vectorized" / "auto"
Default (engine-direct) "serial"
Default (builder) "auto"

"vectorized" batches all entities in an archetype group through one copula draw — large speedups on configs above ~5,000 entities, identical results modulo the deliberate copula bypass-fallback contract. "auto" picks per archetype group by entity count; create() / create_from_yaml() set "auto" explicitly. Manifest records the mode and any bypass-fallback counts under bypass_fallback_counts.

Per-entity overrides — cross_dim_fks and inflection_month

Both fields live on individual Entity objects (the resolved counterpart to a builder segment). They steer per-entity behavior that doesn't belong at the segment level:

from plotsim.config import EntityOverrides
cfg.entities[0].cross_dim_fks = {"plan_id": "plan_enterprise"}
cfg.entities[0].overrides = EntityOverrides(inflection_month=4)
Field Type Default Purpose
cross_dim_fks dict[str, str] {} Pin specific FK column values to specific PKs in another dim — e.g. bind expansion-champion accounts to a specific plan row. Bypasses the column's distribution for that entity
overrides.inflection_month int or None None Shift the archetype's curve segments so its canonical inflection lands on this period index. Per-entity narrative timing (e.g. "this account turned around in March")

manifest

The manifest emission config. Defaults to include: true, trajectory_sample_rate: 1.0 — every run lands a manifest.json next to the table files. Set manifest: {include: false} for microbenchmarks or sandboxed CI runs that don't need the ground-truth payload. See Manifest reference.

Per-archetype overrides — curve_segments and metric_overrides

Two mechanisms let an archetype diverge from the global metric defaults.

Archetype.curve_segments — per-archetype list of CurveSegment entries defining the full [0.0, 1.0] trajectory shape. Segments must cover the range without gaps or overlaps (validated at config load). Every metric reads its position from this curve; there is no per-metric curve override.

Archetype.metric_overridesdict[str, MetricOverride] keyed by metric name. Each entry can override distribution, params, or value_range for that metric only when sampled for entities of this archetype. polarity and causal_lag are never overridable.

value_range overrides must be a subset of the global range — overrides narrow, never expand. Subset enforcement runs at config load.

Resolution order: for each (entity, metric) draw, the engine looks up archetype.metric_overrides[metric.name]. If present, listed fields replace the global Metric fields; unset fields fall through to the global metric. Partial overrides compose cleanly via model_copy(update=…).

The builder API surfaces metric_overrides.value_range only; distribution and params overrides require an engine-direct config.


Limits and performance gates

Every config is checked against per-field caps and a global cell-count budget at load time. The bounds are intentionally conservative — well above any realistic dashboard dataset, well below the point where a single laptop run becomes painful.

Limit Cap Behavior on breach
metrics count 50 Pydantic rejects at load
Per-segment count 5000 Pydantic rejects at load
Total entities (Σ segments.count) 100,000 Custom validator rejects at load
quality issues 50 Pydantic rejects at load
bridges count 20 Pydantic rejects at load
Per-bridge columns 20 Pydantic rejects at load
seasonality effects 12 Pydantic rejects at load
Causal lag delay 110000 periods Pydantic rejects at load

Cell-count budget

The cell count (Σ segments.count × n_periods) drives a tiered budget. The thresholds protect against runaway configs while keeping big datasets a real feature for users who deliberately want them.

Cell count Behavior
≤ 500,000 Silent (just the always-printed summary line)
> 500,000 Stderr advisory recommending output.format: parquet and generation_mode: auto
> soft budget (default 2,000,000) ValueError at load with instructions to opt in
> soft budget, opt-in given Stderr large-dataset notice, generation proceeds
> 50,000,000 Hard ceiling — ValueError regardless of opt-in

Two ways to opt into above-soft-budget runs:

  1. CLI flag--allow-large-dataset on plotsim run, plotsim validate, or plotsim info.
  2. Environment variablePLOTSIM_ALLOW_LARGE_DATASET=1 for library callers and CI scripts.

Two ways to change the soft-budget threshold itself:

  • PLOTSIM_CELL_BUDGET=N — set the soft cap to N cells.
  • PLOTSIM_CELL_BUDGET=0 — disable the soft cap entirely (only the 50,000,000-cell hard ceiling still applies).

The hard ceiling is non-configurable. Configs above 50,000,000 cells should be split or chunked rather than coerced through a single run.

A summary line is always printed to stderr at load time so the projected cell count and peak memory estimate are visible even on runs well below the threshold:

Config summary: 80 entities × 24 periods = 1,920 cells, 4 metrics, 6 tables. Estimated peak memory: ~100 MB.