Skip to content

Schema guide

When to let plotsim auto-build your tables and when to take over. How to design dimensions, facts, events, bridges, and SCD columns.


Auto-schema vs explicit schema

Two ways to declare a config:

Auto-schema — the default

If you skip the dimensions / facts / events blocks entirely, plotsim emits a sensible default star schema:

  • dim_date — date spine for the configured window
  • dim_<unit> — one row per entity
  • fct_<metric> — one fact table per metric, with FKs to dim_<unit> and dim_date
about: Subscription customers
unit: customer
window: ["2024-01", "2024-12", "monthly"]

metrics:
  - { name: engagement, type: score,  polarity: positive }
  - { name: mrr,        type: amount, polarity: positive, range: [10, 5000] }

segments:
  - { name: growers,   count: 30, archetype: growth }
  - { name: decliners, count: 20, archetype: decline }

Run that and you get:

dim_date.csv
dim_customer.csv
fct_engagement.csv
fct_mrr.csv

No schema declaration needed. This is the fastest path from prompt to data.

Explicit schema

When you want named columns, multiple metrics in one fact table, custom event tables, or bridge tables, declare the schema yourself:

dimensions:
  - name: dim_customer
    columns:
      - { name: customer_id,   type: id }
      - { name: company_name,  type: faker.company }
      - { name: industry,      type: pool.industry }

facts:
  - name: fct_engagement
    metrics: [engagement, mrr]
    columns:
      - { name: customer_id, type: ref.dim_customer }
      - { name: date_key,    type: ref.dim_date }

events:
  - name: evt_login
    trigger: proportional
    driver: engagement
    scale: 5
    columns:
      - { name: customer_id, type: ref.dim_customer }
      - { name: timestamp,   type: timestamp }

dim_date is auto-generated even when you supply explicit dims; you don't have to declare it (though you can to add custom columns).

When you declare any schema block, plotsim uses only what you declared — there's no merging of auto-defaults into your declarations.


Designing dimensions

Dim tables hold static (or slowly-changing) attributes. There are three kinds:

Kind When Cardinality
Per-entity dim One row per entity count(entities)
Reference dim Lookup table (industries, plans, departments) Whatever you put in it
Sub-entity dim Multiple rows per entity (e.g. users per company) count(entities) × count

Per-entity dim

The default. One row per entity, keyed off the entity's PK.

dimensions:
  - name: dim_customer
    columns:
      - { name: customer_id,   type: id }
      - { name: company_name,  type: faker.company }
      - { name: signup_date,   type: faker.date_between }
      - { name: cohort_size,   type: segment.count }
      - { name: industry,      type: pool.industry }

pool.{attr} columns pull from per-segment attributes — see Column types.

Reference dim

Set reference: true for static lookup tables.

dimensions:
  - name: dim_plan
    reference: true
    columns:
      - { name: plan_id,    type: id }
      - { name: plan_name,  type: static.starter }
      - { name: monthly_fee, type: static.49 }

Reference dims aren't tied to entities or periods. They're typically small (3–10 rows) and exist to enable FK joins from facts.

Sub-entity dim

Set count: <int> for sub-entity expansion. Each entity produces count rows.

dimensions:
  - name: dim_user
    count: 5     # 5 users per customer
    columns:
      - { name: user_id,     type: id }
      - { name: customer_id, type: ref.dim_customer }
      - { name: full_name,   type: faker.name }

Useful for one-to-many entity hierarchies — companies with multiple users, departments with multiple employees, schools with multiple students.


Designing facts

Facts are the temporal heart of the dataset. One row per entity per period (or whatever grain you configure).

facts:
  - name: fct_engagement
    metrics: [engagement, mrr, support_tickets]
    columns:
      - { name: customer_id, type: ref.dim_customer }
      - { name: date_key,    type: ref.dim_date }

Two ways to put metrics on a fact table:

The metrics: array is the simplest path. Each name auto-adds a column with type: metric.<name> and the right dtype. The example above produces five columns: customer_id, date_key, engagement, mrr, support_tickets.

Explicit metric.X columns

When you want to control the column name independently of the metric name (or skip the metric in the schema even though it's declared), use metric.{name} in the columns array directly:

facts:
  - name: fct_revenue
    columns:
      - { name: customer_id,   type: ref.dim_customer }
      - { name: date_key,      type: ref.dim_date }
      - { name: monthly_revenue, type: metric.mrr }

Mix and match — you can use both shorthand (metrics: [engagement]) and explicit columns ({ name: foo, type: metric.mrr }) on the same fact table.


Designing events

Event tables have variable row counts per (entity, period). Two trigger types:

Proportional events

Number of rows per entity per period scales with a metric value: rows = metric_value × scale.

events:
  - name: evt_login
    trigger: proportional
    driver: engagement
    scale: 5     # peak ~5 logins per period at engagement = 1.0
    columns:
      - { name: customer_id, type: ref.dim_customer }
      - { name: timestamp,   type: timestamp }

Use for activity that recurs whenever the entity is "active" — logins, transactions, page views.

Threshold events

A row fires when a metric crosses a threshold and stays past it for N consecutive periods.

events:
  - name: evt_churn
    trigger: threshold
    metric: churn_risk
    above: 0.8
    for_periods: 2
    columns:
      - { name: customer_id, type: ref.dim_customer }
      - { name: date_key,    type: ref.dim_date }
      - { name: churn_flag,  type: flag }

Use for state-change moments — churn, conversion, escalation, qualification. The flag column reads True when the threshold fires.

above and below are mutually exclusive; pick one.


Designing bridges

Bridge tables capture many-to-many relationships between two dim tables.

bridges:
  - name: customer_subscription
    left: dim_customer
    right: dim_subscription
    cardinality: [1, 3]
    driver: mrr
    columns:
      - { name: weight, type: metric.mrr }
Field Notes
left / right Both must be dim tables (auto or declared); must differ
cardinality [min, max] second-dim entries per left entity
driver Optional metric — biases sampling toward trajectory position
columns Up to 20 bridge-row columns (metric.{name}, static.{value}, faker.{kind} only)

Bridges produce one row per (left_entity, right_entity) association. With cardinality: [1, 3] and 100 left entities, expect roughly 200 bridge rows.

When driver is set, entities with high trajectory position get more associations than entities with low position. When omitted, sampling is uniform.


SCD Type 2 columns

For dim columns whose value changes as an entity's behavior crosses configured thresholds, use the scd column type:

dimensions:
  - name: dim_customer
    columns:
      - { name: customer_id, type: id }
      - name: plan_tier
        type: scd
        tracks: mrr
        tiers: [free, starter, pro, enterprise]
        at: [0.0, 0.25, 0.50, 0.75]

The dim is expanded into multiple rows per entity — one for each band crossing — with valid_from / valid_to semantics. Fact rows automatically reference the active version through a dim_row_id column the engine appends.

Use SCD when:

  • The attribute genuinely changes over time (plan tier, role, status)
  • Downstream queries need point-in-time joins
  • You want history, not just the current state

For static attributes that don't change (industry, signup date), use a plain column. SCD adds row count and join complexity — only opt in when you need the history.


What plotsim won't let you do

A handful of structural rules the builder enforces at construction:

  • Self-bridgesleft == right is rejected. To model intra-dim relationships (manager / report, parent / child), use a sub-entity dim instead.
  • pool.{attr} on facts/events/reference dims — pool columns are only valid on per-entity dim columns. Per-segment attributes are entity-level facts, not row-level facts.
  • Bare dtype words on non-dim_date tablestype: int is only valid on dim_date. Other tables need a source-bearing type (metric.X, faker.X, static.X, ref.X, etc.).
  • flag on proportional events — flags are only valid on threshold events; proportional events don't have a threshold to flag.