Schema and dimensions¶
By default plotsim auto-generates a minimal star schema: a date dimension, a unit dimension, and one fact table per group of metrics. That's enough for most cases. When you need control over column names, references between tables, slowly-changing-dimension history, or events, you declare an explicit schema.
This notebook shows both, side by side, then walks through every column type the builder accepts.
Auto-schema¶
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from plotsim import create, generate_tables
auto = create(
about="Auto-schema demo",
unit="user",
window=("2024-01", "2024-12", "monthly"),
metrics=[
{"name": "logins", "type": "count", "polarity": "positive"},
{"name": "spend", "type": "amount", "polarity": "positive", "range": [0, 200]},
],
segments=[
{"name": "active", "count": 10, "archetype": "growth",
"attributes": {"tier": "active"}},
{"name": "lapsed", "count": 10, "archetype": "decline",
"attributes": {"tier": "lapsed"}},
],
)
tables = generate_tables(auto, np.random.default_rng(auto.seed))
print("Tables generated:")
for name, df in tables.items():
print(f" {name:<14} {len(df):>4} rows × {df.shape[1]} cols — {list(df.columns)}")
The auto-schema produced three tables: a per-period date dim, a per-entity user dim (with the segment-attribute tier baked in), and a single fact table holding both metrics. That's it.
Explicit schema — column type vocabulary¶
Inside dimensions / facts / events, every column has a type. The vocabulary is plain language:
| Type | Where it pulls from |
|---|---|
id |
auto-generated primary key |
ref.<table> |
foreign key to a dim's primary key |
metric.<name> |
a declared metric's value at this row's period |
faker.<which> |
text from faker (e.g. faker.company, faker.name) |
static.<value> |
the same literal in every row |
segment.count |
the entity's segment size |
pool.<attr> |
the entity's segment-attribute value (M122) |
timestamp |
a datetime within the row's period |
flag |
a boolean derived from an event trigger |
date, int, string, float |
dim_date conventional dtypes |
bucket |
categorical label binned from trajectory position |
scd |
slowly-changing dimension — needs tracks / tiers / at |
explicit = create(
about="Explicit schema demo",
unit="user",
window=("2024-01", "2024-12", "monthly"),
metrics=[
{"name": "engagement", "type": "score", "polarity": "positive"},
{"name": "spend", "type": "amount", "polarity": "positive", "range": [0, 200]},
],
segments=[
{"name": "active", "count": 10, "archetype": "growth",
"attributes": {"tier": "active"}},
{"name": "lapsed", "count": 10, "archetype": "decline",
"attributes": {"tier": "lapsed"}},
],
dimensions=[
{"name": "dim_date", "per": "period", "columns": [
{"name": "date_key", "type": "id"},
{"name": "date", "type": "date"},
{"name": "year", "type": "int"},
{"name": "month", "type": "int"},
]},
{"name": "dim_user", "per": "unit", "columns": [
{"name": "user_id", "type": "id"},
{"name": "user_name", "type": "faker.name"},
{"name": "tier", "type": "pool.tier"},
{"name": "cohort_size", "type": "segment.count"},
{"name": "plan_tier", "type": "scd",
"tracks": "engagement", "tiers": ["free", "pro", "enterprise"],
"at": [0.35, 0.7]},
]},
{"name": "dim_plan", "reference": True, "columns": [
{"name": "plan_id", "type": "id"},
{"name": "plan_name", "type": "static.standard"},
]},
],
facts=[
{"name": "fct_activity", "metrics": ["engagement", "spend"], "columns": [
{"name": "date_key", "type": "ref.dim_date"},
{"name": "user_id", "type": "ref.dim_user"},
{"name": "plan_id", "type": "ref.dim_plan"},
{"name": "engagement", "type": "metric.engagement"},
{"name": "spend", "type": "metric.spend"},
{"name": "engagement_band", "type": "bucket",
"labels": ["cold", "warm", "hot"]},
]},
],
events=[
{"name": "evt_session", "trigger": "proportional",
"driver": "engagement", "scale": 6.0,
"columns": [
{"name": "event_id", "type": "id"},
{"name": "date_key", "type": "ref.dim_date"},
{"name": "user_id", "type": "ref.dim_user"},
{"name": "event_ts", "type": "timestamp"},
]},
{"name": "evt_upgrade", "trigger": "threshold",
"metric": "engagement", "above": 0.6, "for_periods": 2,
"columns": [
{"name": "event_id", "type": "id"},
{"name": "date_key", "type": "ref.dim_date"},
{"name": "user_id", "type": "ref.dim_user"},
{"name": "upgrade_flag", "type": "flag"},
]},
],
)
tables = generate_tables(explicit, np.random.default_rng(explicit.seed))
for name, df in tables.items():
print(f" {name:<16} {len(df):>5} rows × {df.shape[1]} cols")
Inspecting each piece of the schema¶
A few peeks: the SCD dim, the bucket column, the reference dim, and the two event tables.
# 1. SCD dim — multiple rows per user as the tracked metric crosses thresholds.
scd = tables["dim_user"]
print(f"dim_user has {len(scd)} rows for {scd['user_id'].nunique()} users — "
f"the extra rows are SCD history.")
scd[scd["user_id"] == scd["user_id"].iloc[0]][
["user_id", "tier", "plan_tier", "valid_from", "valid_to"]
]
# 2. Bucket column — categorical bins of trajectory position.
band_counts = tables["fct_activity"]["engagement_band"].value_counts()
band_counts.plot(kind="bar", figsize=(6, 3),
title="engagement_band distribution across all rows")
plt.tight_layout(); plt.show()
# 3. Reference dim — static, no entity / no period axis.
print("dim_plan:")
print(tables["dim_plan"])
# 4. Event tables — the two triggers produce different shapes.
print(f"\nevt_session: {len(tables['evt_session'])} proportional rows "
f"(driver=engagement × scale 6.0)")
print(tables["evt_session"].head(3))
print(f"\nevt_upgrade: {len(tables['evt_upgrade'])} threshold-fired rows "
f"(engagement > 0.6 for 2 periods)")
print(tables["evt_upgrade"].head(3))
Where to next¶
- Bridges and advanced —
bridges_and_advanced.ipynbcovers M:N bridge tables, sub-entity dims (dim_user.count), andpool.{attr}in depth. - Quality —
data_quality.ipynbshows how to corrupt the explicit schema you just built (NULLs, duplicates, type mismatches) for pipeline-robustness testing. - Column types —
docs/site/column-types.mdcatalogs every column type and its required sub-fields.