ML readiness — entity features and holdout splits¶
A working ML loop needs three things:
- An entity-feature matrix — one row per entity, with aggregated metric columns and the segment label.
- A train/holdout split — last N periods reserved as a hold-out, rest used for fitting.
- Known ground truth — the manifest names every entity's archetype, so cluster recovery and label-prediction accuracy are checkable end-to-end.
This notebook generates a dataset that exercises all three, then trains a sklearn classifier on the result.
Generate with holdout and entity_features enabled¶
import json
import numpy as np
import pandas as pd
from pathlib import Path
from plotsim import (
create, generate_tables_with_state, build_manifest, write_tables,
)
cfg = create(
about="ML-ready fixture",
unit="account",
window=("2023-01", "2024-12", "monthly"),
metrics=[
{"name": "engagement", "type": "score", "polarity": "positive"},
{"name": "spend", "type": "amount", "polarity": "positive",
"range": [10, 500]},
{"name": "feature_use", "type": "score", "polarity": "positive"},
{"name": "support_load", "type": "score", "polarity": "negative"},
],
connections=[
"engagement mirrors feature_use",
"engagement opposes support_load",
],
segments=[
{"name": "stars", "count": 40, "archetype": "growth",
"attributes": {"label": "stars"}},
{"name": "fading", "count": 40, "archetype": "decline",
"attributes": {"label": "fading"}},
{"name": "steady", "count": 40, "archetype": "flat",
"attributes": {"label": "steady"}},
],
holdout={"target": "engagement", "periods": 4},
entity_features=True,
)
tables, state = generate_tables_with_state(cfg, np.random.default_rng(cfg.seed))
manifest = build_manifest(
cfg, state.trajectories, tables,
scd_state=state.scd, bridge_state=state.bridges,
)
out = Path("./out_ml")
write_tables(tables, cfg, output_dir=out, manifest=manifest)
sorted(p.name for p in out.glob("*"))
Entity-features file — one row per account¶
_entity_features.csv has summary columns per metric (mean, std, last) plus the segment label. With holdout enabled, aggregations are computed on training periods only and the target metric's aggregates are dropped to prevent leakage.
features = pd.read_csv(out / "_entity_features.csv")
print(f"Shape: {features.shape}")
print(f"\nColumns:\n {list(features.columns)}")
features.head()
Holdout splits — temporal cutoff¶
Each fact table is split into <name>_train.csv and <name>_holdout.csv. The cutoff is recorded on the manifest under holdout.cutoff_period_index.
train = pd.read_csv(out / "fct_account_train.csv")
holdout = pd.read_csv(out / "fct_account_holdout.csv")
print(f"Train rows: {len(train):>5} date_keys {train['date_key'].min()}..{train['date_key'].max()}")
print(f"Holdout rows: {len(holdout):>5} date_keys {holdout['date_key'].min()}..{holdout['date_key'].max()}")
manifest_disk = json.loads((out / "manifest.json").read_text(encoding="utf-8"))
print(f"\nmanifest.holdout: {manifest_disk['holdout']}")
Recovering the configured archetype with sklearn¶
The features are intentionally easy to separate — three archetypes with different shapes — so a simple linear classifier should hit high accuracy. This is the "did the engine actually produce what I asked for?" test.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
label_col = "archetype" if "archetype" in features.columns else "label"
X = features.drop(columns=["account_id", label_col]).select_dtypes(include="number")
y = features[label_col]
X_tr, X_te, y_tr, y_te = train_test_split(
X, y, test_size=0.25, random_state=42, stratify=y,
)
clf = LogisticRegression(max_iter=2000)
clf.fit(X_tr, y_tr)
print(classification_report(y_te, clf.predict(X_te), zero_division=0))
Predicting the held-out periods from the trained periods¶
For a regression-style check: fit a per-account linear model on the training-period engagement, predict the holdout periods, and compare to the recorded values.
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
date_index = (
tables["dim_date"][["date_key", "period_index"]]
.set_index("date_key")["period_index"]
)
train_p = train.assign(period_index=train["date_key"].map(date_index))
holdout_p = holdout.assign(period_index=holdout["date_key"].map(date_index))
picks = train_p["account_id"].drop_duplicates().head(3)
fig, ax = plt.subplots(figsize=(10, 4))
for aid in picks:
a_tr = train_p[train_p["account_id"] == aid].sort_values("period_index")
a_ho = holdout_p[holdout_p["account_id"] == aid].sort_values("period_index")
model = LinearRegression().fit(a_tr[["period_index"]], a_tr["engagement"])
pred = model.predict(a_ho[["period_index"]])
ax.plot(a_tr["period_index"], a_tr["engagement"], "o-", label=f"{aid} train")
ax.plot(a_ho["period_index"], a_ho["engagement"], "o", label=f"{aid} actual")
ax.plot(a_ho["period_index"], pred, "x--", label=f"{aid} predicted")
ax.axvline(train_p["period_index"].max() + 0.5, ls=":", color="grey",
label="holdout cutoff")
ax.set_title("Holdout-period prediction — linear fit on train, eval on holdout")
ax.set_xlabel("Period index"); ax.set_ylabel("engagement")
ax.legend(fontsize=7, ncol=3)
plt.tight_layout(); plt.show()
Where to next¶
- Bridges and advanced —
bridges_and_advanced.ipynbcovers richer feature surfaces (bridge cardinality, sub-entity rollups). - DS use cases —
ds_use_cases.ipynbextends this notebook into controlled experiments and feature-engineering validation. - Config fields —
docs/site/config-reference.md§holdout and §entity_features document every field.