Schema and dimensions¶

By default plotsim auto-generates a minimal star schema: a date dimension, a unit dimension, and one fact table per group of metrics. That's enough for most cases. When you need control over column names, references between tables, slowly-changing-dimension history, or events, you declare an explicit schema.

This notebook shows both, side by side, then walks through every column type the builder accepts.

Auto-schema¶

In [ ]:

Copied!





import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from plotsim import create, generate_tables

auto = create(
    about="Auto-schema demo",
    unit="user",
    window=("2024-01", "2024-12", "monthly"),
    metrics=[
        {"name": "logins",   "type": "count",  "polarity": "positive"},
        {"name": "spend",    "type": "amount", "polarity": "positive", "range": [0, 200]},
    ],
    segments=[
        {"name": "active",   "count": 10, "archetype": "growth",
         "attributes": {"tier": "active"}},
        {"name": "lapsed",   "count": 10, "archetype": "decline",
         "attributes": {"tier": "lapsed"}},
    ],
)
tables = generate_tables(auto, np.random.default_rng(auto.seed))
print("Tables generated:")
for name, df in tables.items():
    print(f"  {name:<14} {len(df):>4} rows × {df.shape[1]} cols — {list(df.columns)}")
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from plotsim import create, generate_tables

auto = create(
    about="Auto-schema demo",
    unit="user",
    window=("2024-01", "2024-12", "monthly"),
    metrics=[
        {"name": "logins",   "type": "count",  "polarity": "positive"},
        {"name": "spend",    "type": "amount", "polarity": "positive", "range": [0, 200]},
    ],
    segments=[
        {"name": "active",   "count": 10, "archetype": "growth",
         "attributes": {"tier": "active"}},
        {"name": "lapsed",   "count": 10, "archetype": "decline",
         "attributes": {"tier": "lapsed"}},
    ],
)
tables = generate_tables(auto, np.random.default_rng(auto.seed))
print("Tables generated:")
for name, df in tables.items():
    print(f"  {name:<14} {len(df):>4} rows × {df.shape[1]} cols — {list(df.columns)}")

The auto-schema produced three tables: a per-period date dim, a per-entity user dim (with the segment-attribute tier baked in), and a single fact table holding both metrics. That's it.

Explicit schema — column type vocabulary¶

Inside dimensions / facts / events, every column has a type. The vocabulary is plain language:

Type	Where it pulls from
`id`	auto-generated primary key
`ref.<table>`	foreign key to a dim's primary key
`metric.<name>`	a declared metric's value at this row's period
`faker.<which>`	text from `faker` (e.g. `faker.company`, `faker.name`)
`static.<value>`	the same literal in every row
`segment.count`	the entity's segment size
`pool.<attr>`	the entity's segment-attribute value (M122)
`timestamp`	a datetime within the row's period
`flag`	a boolean derived from an event trigger
`date`, `int`, `string`, `float`	dim_date conventional dtypes
`bucket`	categorical label binned from trajectory position
`scd`	slowly-changing dimension — needs `tracks` / `tiers` / `at`

In [ ]:

Copied!





explicit = create(
    about="Explicit schema demo",
    unit="user",
    window=("2024-01", "2024-12", "monthly"),
    metrics=[
        {"name": "engagement", "type": "score",  "polarity": "positive"},
        {"name": "spend",      "type": "amount", "polarity": "positive", "range": [0, 200]},
    ],
    segments=[
        {"name": "active", "count": 10, "archetype": "growth",
         "attributes": {"tier": "active"}},
        {"name": "lapsed", "count": 10, "archetype": "decline",
         "attributes": {"tier": "lapsed"}},
    ],
    dimensions=[
        {"name": "dim_date", "per": "period", "columns": [
            {"name": "date_key", "type": "id"},
            {"name": "date",     "type": "date"},
            {"name": "year",     "type": "int"},
            {"name": "month",    "type": "int"},
        ]},
        {"name": "dim_user", "per": "unit", "columns": [
            {"name": "user_id",     "type": "id"},
            {"name": "user_name",   "type": "faker.name"},
            {"name": "tier",        "type": "pool.tier"},
            {"name": "cohort_size", "type": "segment.count"},
            {"name": "plan_tier",   "type": "scd",
             "tracks": "engagement", "tiers": ["free", "pro", "enterprise"],
             "at": [0.35, 0.7]},
        ]},
        {"name": "dim_plan", "reference": True, "columns": [
            {"name": "plan_id",   "type": "id"},
            {"name": "plan_name", "type": "static.standard"},
        ]},
    ],
    facts=[
        {"name": "fct_activity", "metrics": ["engagement", "spend"], "columns": [
            {"name": "date_key",   "type": "ref.dim_date"},
            {"name": "user_id",    "type": "ref.dim_user"},
            {"name": "plan_id",    "type": "ref.dim_plan"},
            {"name": "engagement", "type": "metric.engagement"},
            {"name": "spend",      "type": "metric.spend"},
            {"name": "engagement_band", "type": "bucket",
             "labels": ["cold", "warm", "hot"]},
        ]},
    ],
    events=[
        {"name": "evt_session", "trigger": "proportional",
         "driver": "engagement", "scale": 6.0,
         "columns": [
             {"name": "event_id", "type": "id"},
             {"name": "date_key", "type": "ref.dim_date"},
             {"name": "user_id",  "type": "ref.dim_user"},
             {"name": "event_ts", "type": "timestamp"},
         ]},
        {"name": "evt_upgrade", "trigger": "threshold",
         "metric": "engagement", "above": 0.6, "for_periods": 2,
         "columns": [
             {"name": "event_id",   "type": "id"},
             {"name": "date_key",   "type": "ref.dim_date"},
             {"name": "user_id",    "type": "ref.dim_user"},
             {"name": "upgrade_flag", "type": "flag"},
         ]},
    ],
)
tables = generate_tables(explicit, np.random.default_rng(explicit.seed))
for name, df in tables.items():
    print(f"  {name:<16} {len(df):>5} rows × {df.shape[1]} cols")
explicit = create(
    about="Explicit schema demo",
    unit="user",
    window=("2024-01", "2024-12", "monthly"),
    metrics=[
        {"name": "engagement", "type": "score",  "polarity": "positive"},
        {"name": "spend",      "type": "amount", "polarity": "positive", "range": [0, 200]},
    ],
    segments=[
        {"name": "active", "count": 10, "archetype": "growth",
         "attributes": {"tier": "active"}},
        {"name": "lapsed", "count": 10, "archetype": "decline",
         "attributes": {"tier": "lapsed"}},
    ],
    dimensions=[
        {"name": "dim_date", "per": "period", "columns": [
            {"name": "date_key", "type": "id"},
            {"name": "date",     "type": "date"},
            {"name": "year",     "type": "int"},
            {"name": "month",    "type": "int"},
        ]},
        {"name": "dim_user", "per": "unit", "columns": [
            {"name": "user_id",     "type": "id"},
            {"name": "user_name",   "type": "faker.name"},
            {"name": "tier",        "type": "pool.tier"},
            {"name": "cohort_size", "type": "segment.count"},
            {"name": "plan_tier",   "type": "scd",
             "tracks": "engagement", "tiers": ["free", "pro", "enterprise"],
             "at": [0.35, 0.7]},
        ]},
        {"name": "dim_plan", "reference": True, "columns": [
            {"name": "plan_id",   "type": "id"},
            {"name": "plan_name", "type": "static.standard"},
        ]},
    ],
    facts=[
        {"name": "fct_activity", "metrics": ["engagement", "spend"], "columns": [
            {"name": "date_key",   "type": "ref.dim_date"},
            {"name": "user_id",    "type": "ref.dim_user"},
            {"name": "plan_id",    "type": "ref.dim_plan"},
            {"name": "engagement", "type": "metric.engagement"},
            {"name": "spend",      "type": "metric.spend"},
            {"name": "engagement_band", "type": "bucket",
             "labels": ["cold", "warm", "hot"]},
        ]},
    ],
    events=[
        {"name": "evt_session", "trigger": "proportional",
         "driver": "engagement", "scale": 6.0,
         "columns": [
             {"name": "event_id", "type": "id"},
             {"name": "date_key", "type": "ref.dim_date"},
             {"name": "user_id",  "type": "ref.dim_user"},
             {"name": "event_ts", "type": "timestamp"},
         ]},
        {"name": "evt_upgrade", "trigger": "threshold",
         "metric": "engagement", "above": 0.6, "for_periods": 2,
         "columns": [
             {"name": "event_id",   "type": "id"},
             {"name": "date_key",   "type": "ref.dim_date"},
             {"name": "user_id",    "type": "ref.dim_user"},
             {"name": "upgrade_flag", "type": "flag"},
         ]},
    ],
)
tables = generate_tables(explicit, np.random.default_rng(explicit.seed))
for name, df in tables.items():
    print(f"  {name:<16} {len(df):>5} rows × {df.shape[1]} cols")

Inspecting each piece of the schema¶

A few peeks: the SCD dim, the bucket column, the reference dim, and the two event tables.

In [ ]:

Copied!





# 1. SCD dim — multiple rows per user as the tracked metric crosses thresholds.
scd = tables["dim_user"]
print(f"dim_user has {len(scd)} rows for {scd['user_id'].nunique()} users — "
      f"the extra rows are SCD history.")
scd[scd["user_id"] == scd["user_id"].iloc[0]][
    ["user_id", "tier", "plan_tier", "valid_from", "valid_to"]
]
# 1. SCD dim — multiple rows per user as the tracked metric crosses thresholds.
scd = tables["dim_user"]
print(f"dim_user has {len(scd)} rows for {scd['user_id'].nunique()} users — "
      f"the extra rows are SCD history.")
scd[scd["user_id"] == scd["user_id"].iloc[0]][
    ["user_id", "tier", "plan_tier", "valid_from", "valid_to"]
]

In [ ]:

Copied!





# 2. Bucket column — categorical bins of trajectory position.
band_counts = tables["fct_activity"]["engagement_band"].value_counts()
band_counts.plot(kind="bar", figsize=(6, 3),
                 title="engagement_band distribution across all rows")
plt.tight_layout(); plt.show()
# 2. Bucket column — categorical bins of trajectory position.
band_counts = tables["fct_activity"]["engagement_band"].value_counts()
band_counts.plot(kind="bar", figsize=(6, 3),
                 title="engagement_band distribution across all rows")
plt.tight_layout(); plt.show()

In [ ]:

Copied!





# 3. Reference dim — static, no entity / no period axis.
print("dim_plan:")
print(tables["dim_plan"])

# 4. Event tables — the two triggers produce different shapes.
print(f"\nevt_session: {len(tables['evt_session'])} proportional rows "
      f"(driver=engagement × scale 6.0)")
print(tables["evt_session"].head(3))

print(f"\nevt_upgrade: {len(tables['evt_upgrade'])} threshold-fired rows "
      f"(engagement > 0.6 for 2 periods)")
print(tables["evt_upgrade"].head(3))
# 3. Reference dim — static, no entity / no period axis.
print("dim_plan:")
print(tables["dim_plan"])

# 4. Event tables — the two triggers produce different shapes.
print(f"\nevt_session: {len(tables['evt_session'])} proportional rows "
      f"(driver=engagement × scale 6.0)")
print(tables["evt_session"].head(3))

print(f"\nevt_upgrade: {len(tables['evt_upgrade'])} threshold-fired rows "
      f"(engagement > 0.6 for 2 periods)")
print(tables["evt_upgrade"].head(3))

Where to next¶

Bridges and advanced — bridges_and_advanced.ipynb covers M:N bridge tables, sub-entity dims (dim_user.count), and pool.{attr} in depth.
Quality — data_quality.ipynb shows how to corrupt the explicit schema you just built (NULLs, duplicates, type mismatches) for pipeline-robustness testing.
Column types — docs/site/column-types.md catalogs every column type and its required sub-fields.