Manifest Reference¶

The manifest is a JSON sidecar (manifest.json) written next to the generated tables. It captures the signal layer — archetype labels, trajectory positions, event firings, SCD transitions, bridge associations, and reproducibility metadata — that a downstream ML pipeline can use as ground truth without re-deriving it from noisy cell values.

See build_manifest for the programmatic builder. The companion docs are config-reference.md (the manifest config block) and api-reference.md.

When the manifest is written¶

write_tables writes manifest.json when both:

config.manifest.include is True (default), and
A manifest argument was passed to write_tables.

The function-level CLI does this automatically. Programmatic callers must build the manifest first via build_manifest(...) and pass it through:

from plotsim import generate_tables_with_state, build_manifest, write_tables

tables, state = generate_tables_with_state(cfg)
manifest = build_manifest(
    cfg, state.trajectories, tables,
    scd_state=state.scd, bridge_state=state.bridges,
)
write_tables(tables, cfg, manifest=manifest)

The JSON serialization is byte-deterministic — same (config, seed) produces a byte-identical manifest.json. Encoding: UTF-8, sort_keys=True, indent=2, trailing newline.

Top-level fields¶

{
  "schema_version": "1.0",
  "seed": 42,
  "config_sha256": "<64-char hex>",
  "archetype_assignments": [...],
  "trajectory_samples": [...],
  "event_firings": [...],
  "scd_events": [...],
  "bridge_associations": [...],
  "quality_injections": [...],
  "holdout": {...} | null,
  "correlation_adjustments": [...] | null,
  "correlation_compensations": [...] | null,
  "bypass_fallback_counts": {...} | null,
  "vectorized_threshold_used": 50 | null
}

Field	Type	Description
`schema_version`	`str`	Wire-shape version. Currently `"1.0"`. Bumped when the manifest gains a non-additive change
`seed`	`int`	The seed used for generation — `config.seed`
`config_sha256`	`str`	Full SHA-256 hex of the JSON-serialized config. Detects config drift between generation and consumption
`archetype_assignments`	array	One entry per entity; see below
`trajectory_samples`	array	Per-period position cells for a sampled subset of entities
`event_firings`	array	Which periods each entity fired in for each event table
`scd_events`	array	SCD Type 2 band crossings (empty when no SCD columns are configured)
`bridge_associations`	array	Per-bridge M:N association ground truth (empty when no bridges are configured)
`quality_injections`	array	Per-issue ground truth — corrupted rows and clean values (empty when `quality.quality_issues` is empty)
`holdout`	object or `null`	Train/holdout split metadata. `null` when `holdout.enabled` is False
`correlation_adjustments`	array or `null`	Higham nearest-PD projections. `null` when the user matrix was already PD
`correlation_compensations`	array or `null`	Trajectory-aware compensation records. `null` when `compensate_correlations` is False or the metric cap was exceeded
`bypass_fallback_counts`	object or `null`	Per-archetype count of cells that fell back to the scalar copula path. `null` in serial mode
`vectorized_threshold_used`	`int` or `null`	The auto-mode entity-count threshold at generation time. `null` for pre-M121b manifests on disk

`archetype_assignments`¶

One ground-truth label per entity — the archetype the engine drove their trajectory from.

{
  "archetype_assignments": [
    { "entity": "growers_001",   "archetype": "growth" },
    { "entity": "decliners_002", "archetype": "decline" }
  ]
}

Field	Type	Description
`entity`	`str`	Entity name (matches `config.entities[i].name`)
`archetype`	`str`	Archetype name (matches `config.archetypes[i].name`)

Sorted by entity for stable diff under the same config.

Use case — train a classifier on the fact-table aggregates and score it against this column. The archetype is the latent class label your model is trying to recover; this list is the answer key.

`trajectory_samples`¶

Per-period trajectory positions for a sampled subset of entities.

{
  "trajectory_samples": [
    { "entity": "growers_001", "period_index": 0,  "position": 0.05 },
    { "entity": "growers_001", "period_index": 1,  "position": 0.08 },
    { "entity": "growers_001", "period_index": 2,  "position": 0.13 }
  ]
}

Field	Type	Description
`entity`	`str`	Entity name
`period_index`	`int`	Zero-based period index. `0` is the first period of `time_window`
`position`	`float`	Position in `[0, 1]`

Position is the noise-free, distribution-free behavioral state the engine derived every metric from. It's not present in the fact table — the fact table holds realized values shaped by polarity, distribution, correlation, and noise.

Sampled subset — controlled by config.manifest.trajectory_sample_rate (default 1.0, meaning every entity). The selection is the first ceil(n_entities × sample_rate) entities under sorted-name order, so it stays stable regardless of seed. Set this below 1.0 for very large configs where the per-period tape would dominate manifest size.

Use case — verify the trajectory-first invariant from the manifest: combine with trace_metric_cell to confirm position → realized cell for any entity in the sample.

`event_firings`¶

For each event table, which periods each entity fired in.

{
  "event_firings": [
    {
      "entity": "growers_001",
      "table": "evt_login",
      "period_indices": [0, 1, 2, 3, 5, 7]
    }
  ]
}

Field	Type	Description
`entity`	`str`	Entity name
`table`	`str`	Event-table name
`period_indices`	array of `int`	Sorted ascending; the periods this entity contributed at least one row in

Empty period_indices are kept rather than omitted, so a downstream consumer can iterate the full entity × event-table matrix without fallback logic.

Both threshold and proportional events surface here. The manifest records observed firings, not the configured triggers.

`scd_events`¶

SCD Type 2 band crossings — emitted only for transitions, not the initial band.

{
  "scd_events": [
    {
      "dim_table": "dim_customer",
      "entity": "growers_001",
      "period_index": 5,
      "old_label": "starter",
      "new_label": "pro",
      "old_dim_row_id": 12,
      "new_dim_row_id": 13,
      "trigger_metric": "fct_engagement.mrr",
      "trigger_position": 0.52
    }
  ]
}

Field	Type	Description
`dim_table`	`str`	The dim table the SCD column lives on
`entity`	`str`	Entity name
`period_index`	`int`	Period the crossing happened
`old_label`	`str`	Band the entity was in before
`new_label`	`str`	Band the entity advanced to
`old_dim_row_id`	`int`	Surrogate row ID of the closing version
`new_dim_row_id`	`int`	Surrogate row ID of the opening version
`trigger_metric`	`str`	The metric whose threshold was crossed (`<fact_table>.<metric>`)
`trigger_position`	`float`	Trajectory position at the crossing period

Empty when no scd columns are configured. Sorted by dim table, then entity, then period for stable ordering.

Use case — join against trajectory_samples to recover the exact position that triggered each band change.

`bridge_associations`¶

Many-to-many associations recorded as ground truth.

{
  "bridge_associations": [
    {
      "bridge": "customer_subscription",
      "entity": "growers_001",
      "targets": ["sub_007", "sub_023", "sub_041"],
      "cardinality": 3
    }
  ]
}

Field	Type	Description
`bridge`	`str`	Bridge-table name
`entity`	`str`	First-dim entity name
`targets`	array	Second-dim FK values (PKs for non-SCD dims; `dim_row_id` for SCD dims)
`cardinality`	`int`	`len(targets)`. Surfaced separately so consumers can aggregate without iterating each tuple

Empty when no bridges are configured. Sorted by bridge name, then entity name.

`quality_injections`¶

Ground truth for post-generation data corruption.

{
  "quality_injections": [
    {
      "issue_index": 0,
      "issue_type": "null_injection",
      "table": "fct_engagement",
      "column": "engagement",
      "row_indices": [3, 17, 42],
      "clean_values": [0.42, 0.71, 0.18]
    },
    {
      "issue_index": 1,
      "issue_type": "duplicate_rows",
      "table": "fct_engagement",
      "column": "_rows",
      "row_indices": [8, 19],
      "clean_values": []
    }
  ]
}

Field	Type	Description
`issue_index`	`int`	Position in `config.quality.quality_issues` — distinguishes multiple issues
`issue_type`	`str`	`null_injection`, `duplicate_rows`, `type_mismatch`, `late_arrival`, or `schema_drift`
`table`	`str`	Target table
`column`	`str`	Target column. For row-level issues this is a sentinel — `_rows` for duplicates, `_arrival_period` for late arrivals
`row_indices`	array of `int`	Row positions in the corrupted DataFrame — the rows that were affected
`clean_values`	array	Original values at those rows. Empty for `duplicate_rows` and `late_arrival` (the corruption is row-level, not per-cell)

Empty when config.quality.quality_issues is empty.

Use case — recover the clean dataset from the corrupted output without re-running generation, or train a model that explicitly handles the corruption pattern.

`holdout`¶

Train/holdout split metadata. Present only when config.holdout.enabled is True; null otherwise.

{
  "holdout": {
    "target_metric": "mrr",
    "holdout_periods": 3,
    "cutoff_period_index": 21
  }
}

Field	Type	Description
`target_metric`	`str`	Mirror of `config.holdout.target`
`holdout_periods`	`int`	Mirror of `config.holdout.periods`
`cutoff_period_index`	`int`	The resolved boundary — `n_periods - holdout_periods`. Periods `[0, cutoff)` are training; `[cutoff, n_periods)` are holdout

Use case — slice an unsplit fact table or its derivative on the same axis without recomputing period_count from time_window.

`correlation_adjustments`¶

Pairs whose configured correlation was projected to a nearby PD value because the user-declared matrix wasn't positive semi-definite.

{
  "correlation_adjustments": [
    {
      "metric_a": "engagement",
      "metric_b": "support_tickets",
      "requested": -0.75,
      "achieved":  -0.68,
      "adjustment": 0.07
    }
  ]
}

Field	Type	Description
`metric_a` / `metric_b`	`str`	The pair
`requested`	`float`	Coefficient declared in the config
`achieved`	`float`	Value at the same `(i, j)` cell after Higham projection
`adjustment`	`float`	`abs(requested - achieved)`

null when the user-declared matrix was already PD (the common case) or when no correlations were configured. Pairs whose adjustment falls below the numerical noise floor (~1e-12) are dropped, so an empty array distinguishes "all pairs were tolerance-clean" from null ("no projection needed").

Use case — flag configs whose declared correlations couldn't be delivered exactly, and decide whether to relax the matrix or accept the projected value.

`correlation_compensations`¶

Pairs the engine pre-compensated for trajectory-induced covariance — recorded only when compensate_correlations is True.

{
  "correlation_compensations": [
    {
      "metric_a": "engagement",
      "metric_b": "mrr",
      "user_target": 0.55,
      "trajectory_contribution": 0.32,
      "compensated_target": 0.23,
      "achievable": 0.23,
      "infeasible": false,
      "adjustment": 0.32
    }
  ]
}

Field	Type	Description
`metric_a` / `metric_b`	`str`	The pair
`user_target`	`float`	Coefficient declared in the config's `connections` block
`trajectory_contribution`	`float`	Within-archetype-weighted Pearson the trajectory's centers induce, in `[-1, 1]`
`compensated_target`	`float`	Pre-clamp `user_target - trajectory_contribution`. May fall outside `[-1, 1]`
`achievable`	`float`	`compensated_target` clamped to `[-1, 1]`. The value the copula actually targets
`infeasible`	`bool`	True when `compensated_target` fell outside `[-1, 1]`. The realized table-wide Pearson lands at `user_target ± something < \|user_target\|` for these
`adjustment`	`float`	`abs(user_target - achievable)`

null when:

compensate_correlations is False, or
the config has no correlations / connections block, or
the metric count exceeded the cap (20) and the engine fell back to the direct-copula path.

Distinct from correlation_adjustments: that records "your matrix wasn't PD, we projected"; this records "your target was compensated for the trajectory's structural contribution before reaching the copula." Both can populate on a single run.

Use case — sort by adjustment to find pairs whose realized correlation drifts most from the configured target. Pairs flagged infeasible: true can never reach the user target on the current config — relax the trajectory mix or lower the magnitude.

`bypass_fallback_counts`¶

Per-archetype count of cells that triggered the per-row scalar fallback in vectorized generation mode.

{
  "bypass_fallback_counts": {
    "growth": 0,
    "decline": 12,
    "spike_then_crash": 47
  }
}

Form	Meaning
`null`	Serial mode — bypass was never measured
`{}`	Vectorized ran with zero bypass cells (the production-shape case)
`{name: count, ...}`	Vectorized hit the scalar fallback for `count` cells under archetype `name`

A non-zero count means vectorized mode wasn't fully effective for that archetype on this config. Surfaces "vectorized isn't faster than serial here" investigations directly.

`vectorized_threshold_used`¶

The value of the auto-mode entity-count threshold at generation time.

Form	Meaning
`int`	Recorded threshold (currently `50`)
`null`	Pre-M121b manifest on disk

Recorded so old manifests stay reproducible if the constant changes in a later release — comparing this against the current threshold lets a consumer detect that a re-run would land in a different generation_mode.

Reading the manifest in Python¶

import json
from pathlib import Path

manifest = json.loads(Path("output/manifest.json").read_text())

# Build the entity → archetype lookup
labels = {a["entity"]: a["archetype"] for a in manifest["archetype_assignments"]}

# Reconstruct an entity's trajectory tape
positions = sorted(
    (s["period_index"], s["position"])
    for s in manifest["trajectory_samples"]
    if s["entity"] == "growers_001"
)

# Detect quality corruption on a column
nullified_rows = [
    inj["row_indices"]
    for inj in manifest["quality_injections"]
    if inj["issue_type"] == "null_injection"
    and inj["table"] == "fct_engagement"
    and inj["column"] == "engagement"
]

pydantic users can validate the on-disk JSON against the typed manifest model directly:

from plotsim import ManifestSchema

manifest = ManifestSchema.model_validate_json(Path("output/manifest.json").read_text())

The model has extra="forbid", so a malformed or out-of-version manifest fails loudly during validation rather than silently dropping unknown fields.

Manifest Reference¶

When the manifest is written¶

Top-level fields¶

archetype_assignments¶

trajectory_samples¶

event_firings¶

scd_events¶

bridge_associations¶

quality_injections¶

holdout¶

correlation_adjustments¶

correlation_compensations¶

bypass_fallback_counts¶

vectorized_threshold_used¶