ML training data & evaluation¶
plotsim for ML training data, feature-engineering practice, and evaluation under known-truth conditions. Trajectory positions and archetype labels are exposed as ground truth via the manifest.
Why plotsim for ML work¶
Most synthetic data tools generate columns that don't tell a story — revenue, engagement, and churn move independently, and any model trained on them learns noise. Real data has coherence: the same underlying behavioral state shapes every metric on the row.
plotsim builds that coherence in. Every metric value at every
(entity, period) cell is derived from the same trajectory position.
The trajectory itself is exposed as ground truth — so you can train a
classifier on the noisy fact tables and score it against the answer
key.
The companion notebooks are ds_use_cases.ipynb (end-to-end ML workflows) and ml_readiness.ipynb (features, holdout, sklearn).
Quick start — two paths to the same dataset¶
The bundled saas template is the fastest way to get a coherent,
multi-metric dataset with archetype ground truth.
The saas_template.py
companion shows the same template authored as a create(**kwargs)
call — every YAML field maps 1-1 to a Python keyword.
Training-ready feature table¶
Set entity_features: true and a flat one-row-per-entity DataFrame
is written alongside the fact tables. Six aggregate columns per
metric, plus archetype + final-trajectory ground truth.
import pandas as pd
features = pd.read_csv("data/_entity_features.csv")
# Columns: <metric>_mean, <metric>_std, <metric>_slope,
# <metric>_first, <metric>_last, <metric>_peak_period,
# archetype, final_trajectory_position
X = features.drop(columns=["archetype", "final_trajectory_position", "company_id"])
y = features["archetype"]
The detailed shape (metrics: filter, include_labels: false for
unsupervised pipelines) is documented in
config-reference.md §entity_features.
Temporal train/holdout split¶
Reserve trailing periods for evaluation. Every per-entity-per-period
fact table writes <fact>_train.<ext> and <fact>_holdout.<ext>
companions. Aggregations on entity_features restrict to the
training window automatically — no leakage.
The target metric's six aggregate columns are dropped from
_entity_features.csv to prevent label leakage. Cutoff period
index lands on manifest.holdout.cutoff_period_index for
downstream code that needs the boundary.
Lagged features via follows / delay¶
Cross-metric causal lag is part of the metric declaration — no
separate post-processing step. support_tickets trails
engagement by 2 periods:
A follows graph cannot loop back on itself — the engine validates
the chain at config load. See
Causal lag.
Custom correlation magnitudes¶
Calibrate a coefficient to whatever your domain measurement suggests rather than picking from the nine-word vocabulary. Three shorthand forms:
Useful when a real dataset gave you a measured correlation (r=0.42)
that doesn't match any of mirrors/driven_by/etc. Coefficients are
clamped to [-1, 1]; the engine projects non-PSD matrices to the
nearest valid one and records the adjustment in the manifest.
Time-to-event metrics — Weibull¶
For survival-style metrics (session duration, days-to-renewal,
contract length), the engine accepts the Weibull distribution
directly via load_config(). The shape parameter controls the
tail; trajectory still scales the realised value, so a growth
archetype with weibull produces lengthening sessions over time,
not just larger ones.
# engine-direct YAML (load with plotsim.load_config — not the builder)
metrics:
- name: session_duration_days
distribution: weibull
params: { shape: 1.5 } # >1 → lengthening tail; <1 → aging out
polarity: positive
value_range: { min: 1.0, max: 365.0 }
The five builder metric types (score, amount, count, index)
cover most cases; weibull is for the time-to-event shape
specifically and uses the engine-direct path. See
metrics-and-connections.md §weibull.
SCD Type 2 dimensions for time-aware joins¶
Slowly-changing dimensions are first-class — declare a scd
column on a per-entity dim and the engine emits one row per band
crossing. Each fact row joins to the active version via an
auto-appended dim_row_id column.
manifest.scd_events records every transition with old/new label,
old/new dim_row_id, trigger period, and trigger position —
perfect ground truth for tenure-aware feature engineering. See
Schema guide §SCD2.
Bridge tables for relationship features¶
Many-to-many associations between two dimensions land as a bridge
table. manifest.bridge_associations records every
(left_entity, right_entities) tuple — the answer key for
relationship-graph features.
driver: mrr biases sampling toward the entity's trajectory
position — high-MRR companies hold more subscriptions. See
bridges_and_advanced.ipynb.
Seasonality + per-metric sensitivity¶
Calendar-month effects layer on top of the trajectory. Each metric
declares its own seasonal_sensitivity — 0.0 immune, 1.0
default, -0.5 halve and invert.
seasonality:
- { months: [11, 12], strength: 0.30 } # Q4 lift
- { months: [6, 7, 8], strength: -0.10 } # summer dip
metrics:
- { name: mrr, type: amount, polarity: positive,
range: [10, 5000], seasonal_sensitivity: 1.0 }
- { name: support_tickets, type: count, polarity: negative,
seasonal_sensitivity: 0.0 } # immune
cfg = create(
# ...
seasonality=[
{"months": [11, 12], "strength": 0.30},
{"months": [6, 7, 8], "strength": -0.10},
],
metrics=[
{"name": "mrr", "type": "amount", "polarity": "positive",
"range": [10, 5000], "seasonal_sensitivity": 1.0},
{"name": "support_tickets", "type": "count", "polarity": "negative",
"seasonal_sensitivity": 0.0},
],
)
Per-segment sensitivity is also available — seasonal_sensitivity:
1.5 on a B2C cohort, 0.0 on a B2B cohort. See
Seasonality for the full
multiplication formula.
Reading ground truth from the manifest¶
The manifest is the ML answer key — every entity's archetype, every event firing, every band transition, every adjustment the engine made (Higham PSD projection, correlation compensation, copula bypass fallback).
import json
from pathlib import Path
mf = json.loads(Path("./data/manifest.json").read_text(encoding="utf-8"))
# Entity → archetype label
labels = {a["entity"]: a["archetype"] for a in mf["archetype_assignments"]}
# Trajectory position at every period for sampled entities
positions = {
(s["entity"], s["period_index"]): s["position"]
for s in mf["trajectory_samples"]
}
Train a classifier on the fact-table aggregates, predict on the
holdout entities, compare against labels for accuracy / F1 /
confusion matrix. plotsim is the only synthetic data tool that
gives you that answer key for free.
See Manifest reference for every section.
Trace one weird cell — trace_metric_cell¶
When a particular (entity, period, metric) value looks
surprising, trace_metric_cell reconstructs every step of the
pipeline that produced it: trajectory position → distribution
center → correlated draw → noise → clamp.
from plotsim.inspect import trace_metric_cell
result = trace_metric_cell(
cfg, seed=cfg.seed,
entity_name="promising_client_0001",
period_index=8,
metric_name="mrr",
)
print(result.trajectory_position, result.distribution_center,
result.correlated_draw, result.noised_value, result.realized_cell)
Useful for debugging unexpected outliers or validating that a
custom archetype curve actually produced the shape you intended.
See trace_metric_cell.
A complete ML-ready config¶
The bundled saas template — extended with entity_features,
holdout, and quality: [] (entity_features and holdout both
require quality empty) — is ML-ready out of the box:
plotsim template saas -o saas_ml.yaml
# Edit saas_ml.yaml: append entity_features: true and a holdout block
plotsim run saas_ml.yaml -o ./data --validate
Then load _entity_features.csv for tabular ML, the per-fact
_train / _holdout files for time-series ML, and
manifest.json for ground truth. ml_readiness.ipynb walks this
recipe end-to-end against a sklearn classifier and a per-account
linear-regression forecast.
See also¶
- ml_readiness.ipynb — end-to-end ML pipeline with plotsim fixtures
- designing_archetypes.ipynb — the archetype DSL in depth
- seasonality_and_correlations.ipynb — full seasonality + connection-grammar walk-through
- Manifest reference — every ground-truth field
- Config field reference §holdout / §entity_features — the ML-aware config blocks
- Engine-direct fields —
weibull distribution,
cross_dim_fks,inflection_month - API reference §build_entity_features / §trace_metric_cell — programmatic access to features and per-cell pipeline traces