ML readiness — entity features and holdout splits¶

A working ML loop needs three things:

An entity-feature matrix — one row per entity, with aggregated metric columns and the segment label.
A train/holdout split — last N periods reserved as a hold-out, rest used for fitting.
Known ground truth — the manifest names every entity's archetype, so cluster recovery and label-prediction accuracy are checkable end-to-end.

This notebook generates a dataset that exercises all three, then trains a sklearn classifier on the result.

Generate with `holdout` and `entity_features` enabled¶

In [ ]:

Copied!





import json
import numpy as np
import pandas as pd
from pathlib import Path
from plotsim import (
    create, generate_tables_with_state, build_manifest, write_tables,
)

cfg = create(
    about="ML-ready fixture",
    unit="account",
    window=("2023-01", "2024-12", "monthly"),
    metrics=[
        {"name": "engagement",   "type": "score",  "polarity": "positive"},
        {"name": "spend",        "type": "amount", "polarity": "positive",
         "range": [10, 500]},
        {"name": "feature_use",  "type": "score",  "polarity": "positive"},
        {"name": "support_load", "type": "score",  "polarity": "negative"},
    ],
    connections=[
        "engagement mirrors feature_use",
        "engagement opposes support_load",
    ],
    segments=[
        {"name": "stars",      "count": 40, "archetype": "growth",
         "attributes": {"label": "stars"}},
        {"name": "fading",     "count": 40, "archetype": "decline",
         "attributes": {"label": "fading"}},
        {"name": "steady",     "count": 40, "archetype": "flat",
         "attributes": {"label": "steady"}},
    ],
    holdout={"target": "engagement", "periods": 4},
    entity_features=True,
)
tables, state = generate_tables_with_state(cfg, np.random.default_rng(cfg.seed))
manifest = build_manifest(
    cfg, state.trajectories, tables,
    scd_state=state.scd, bridge_state=state.bridges,
)
out = Path("./out_ml")
write_tables(tables, cfg, output_dir=out, manifest=manifest)
sorted(p.name for p in out.glob("*"))
import json
import numpy as np
import pandas as pd
from pathlib import Path
from plotsim import (
    create, generate_tables_with_state, build_manifest, write_tables,
)

cfg = create(
    about="ML-ready fixture",
    unit="account",
    window=("2023-01", "2024-12", "monthly"),
    metrics=[
        {"name": "engagement",   "type": "score",  "polarity": "positive"},
        {"name": "spend",        "type": "amount", "polarity": "positive",
         "range": [10, 500]},
        {"name": "feature_use",  "type": "score",  "polarity": "positive"},
        {"name": "support_load", "type": "score",  "polarity": "negative"},
    ],
    connections=[
        "engagement mirrors feature_use",
        "engagement opposes support_load",
    ],
    segments=[
        {"name": "stars",      "count": 40, "archetype": "growth",
         "attributes": {"label": "stars"}},
        {"name": "fading",     "count": 40, "archetype": "decline",
         "attributes": {"label": "fading"}},
        {"name": "steady",     "count": 40, "archetype": "flat",
         "attributes": {"label": "steady"}},
    ],
    holdout={"target": "engagement", "periods": 4},
    entity_features=True,
)
tables, state = generate_tables_with_state(cfg, np.random.default_rng(cfg.seed))
manifest = build_manifest(
    cfg, state.trajectories, tables,
    scd_state=state.scd, bridge_state=state.bridges,
)
out = Path("./out_ml")
write_tables(tables, cfg, output_dir=out, manifest=manifest)
sorted(p.name for p in out.glob("*"))

Entity-features file — one row per account¶

_entity_features.csv has summary columns per metric (mean, std, last) plus the segment label. With holdout enabled, aggregations are computed on training periods only and the target metric's aggregates are dropped to prevent leakage.

In [ ]:

Copied!





features = pd.read_csv(out / "_entity_features.csv")
print(f"Shape: {features.shape}")
print(f"\nColumns:\n  {list(features.columns)}")
features.head()
features = pd.read_csv(out / "_entity_features.csv")
print(f"Shape: {features.shape}")
print(f"\nColumns:\n  {list(features.columns)}")
features.head()

Holdout splits — temporal cutoff¶

Each fact table is split into <name>_train.csv and <name>_holdout.csv. The cutoff is recorded on the manifest under holdout.cutoff_period_index.

In [ ]:

Copied!





train  = pd.read_csv(out / "fct_account_train.csv")
holdout = pd.read_csv(out / "fct_account_holdout.csv")

print(f"Train rows:    {len(train):>5}  date_keys {train['date_key'].min()}..{train['date_key'].max()}")
print(f"Holdout rows:  {len(holdout):>5}  date_keys {holdout['date_key'].min()}..{holdout['date_key'].max()}")

manifest_disk = json.loads((out / "manifest.json").read_text(encoding="utf-8"))
print(f"\nmanifest.holdout: {manifest_disk['holdout']}")
train  = pd.read_csv(out / "fct_account_train.csv")
holdout = pd.read_csv(out / "fct_account_holdout.csv")

print(f"Train rows:    {len(train):>5}  date_keys {train['date_key'].min()}..{train['date_key'].max()}")
print(f"Holdout rows:  {len(holdout):>5}  date_keys {holdout['date_key'].min()}..{holdout['date_key'].max()}")

manifest_disk = json.loads((out / "manifest.json").read_text(encoding="utf-8"))
print(f"\nmanifest.holdout: {manifest_disk['holdout']}")

Recovering the configured archetype with sklearn¶

The features are intentionally easy to separate — three archetypes with different shapes — so a simple linear classifier should hit high accuracy. This is the "did the engine actually produce what I asked for?" test.

In [ ]:

Copied!





from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

label_col = "archetype" if "archetype" in features.columns else "label"
X = features.drop(columns=["account_id", label_col]).select_dtypes(include="number")
y = features[label_col]

X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y,
)
clf = LogisticRegression(max_iter=2000)
clf.fit(X_tr, y_tr)
print(classification_report(y_te, clf.predict(X_te), zero_division=0))
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

label_col = "archetype" if "archetype" in features.columns else "label"
X = features.drop(columns=["account_id", label_col]).select_dtypes(include="number")
y = features[label_col]

X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y,
)
clf = LogisticRegression(max_iter=2000)
clf.fit(X_tr, y_tr)
print(classification_report(y_te, clf.predict(X_te), zero_division=0))

Predicting the held-out periods from the trained periods¶

For a regression-style check: fit a per-account linear model on the training-period engagement, predict the holdout periods, and compare to the recorded values.

In [ ]:

Copied!





from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

date_index = (
    tables["dim_date"][["date_key", "period_index"]]
    .set_index("date_key")["period_index"]
)
train_p = train.assign(period_index=train["date_key"].map(date_index))
holdout_p = holdout.assign(period_index=holdout["date_key"].map(date_index))

picks = train_p["account_id"].drop_duplicates().head(3)
fig, ax = plt.subplots(figsize=(10, 4))
for aid in picks:
    a_tr = train_p[train_p["account_id"] == aid].sort_values("period_index")
    a_ho = holdout_p[holdout_p["account_id"] == aid].sort_values("period_index")
    model = LinearRegression().fit(a_tr[["period_index"]], a_tr["engagement"])
    pred = model.predict(a_ho[["period_index"]])
    ax.plot(a_tr["period_index"], a_tr["engagement"], "o-", label=f"{aid} train")
    ax.plot(a_ho["period_index"], a_ho["engagement"], "o", label=f"{aid} actual")
    ax.plot(a_ho["period_index"], pred, "x--", label=f"{aid} predicted")
ax.axvline(train_p["period_index"].max() + 0.5, ls=":", color="grey",
           label="holdout cutoff")
ax.set_title("Holdout-period prediction — linear fit on train, eval on holdout")
ax.set_xlabel("Period index"); ax.set_ylabel("engagement")
ax.legend(fontsize=7, ncol=3)
plt.tight_layout(); plt.show()
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

date_index = (
    tables["dim_date"][["date_key", "period_index"]]
    .set_index("date_key")["period_index"]
)
train_p = train.assign(period_index=train["date_key"].map(date_index))
holdout_p = holdout.assign(period_index=holdout["date_key"].map(date_index))

picks = train_p["account_id"].drop_duplicates().head(3)
fig, ax = plt.subplots(figsize=(10, 4))
for aid in picks:
    a_tr = train_p[train_p["account_id"] == aid].sort_values("period_index")
    a_ho = holdout_p[holdout_p["account_id"] == aid].sort_values("period_index")
    model = LinearRegression().fit(a_tr[["period_index"]], a_tr["engagement"])
    pred = model.predict(a_ho[["period_index"]])
    ax.plot(a_tr["period_index"], a_tr["engagement"], "o-", label=f"{aid} train")
    ax.plot(a_ho["period_index"], a_ho["engagement"], "o", label=f"{aid} actual")
    ax.plot(a_ho["period_index"], pred, "x--", label=f"{aid} predicted")
ax.axvline(train_p["period_index"].max() + 0.5, ls=":", color="grey",
           label="holdout cutoff")
ax.set_title("Holdout-period prediction — linear fit on train, eval on holdout")
ax.set_xlabel("Period index"); ax.set_ylabel("engagement")
ax.legend(fontsize=7, ncol=3)
plt.tight_layout(); plt.show()

Where to next¶

Bridges and advanced — bridges_and_advanced.ipynb covers richer feature surfaces (bridge cardinality, sub-entity rollups).
DS use cases — ds_use_cases.ipynb extends this notebook into controlled experiments and feature-engineering validation.
Config fields — docs/site/config-reference.md §holdout and §entity_features document every field.

ML readiness — entity features and holdout splits¶

Generate with holdout and entity_features enabled¶

Entity-features file — one row per account¶

Holdout splits — temporal cutoff¶

Recovering the configured archetype with sklearn¶

Predicting the held-out periods from the trained periods¶

Where to next¶

Generate with `holdout` and `entity_features` enabled¶