Skip to content

Manifest Reference

The manifest is a JSON sidecar (manifest.json) written next to the generated tables. It captures the signal layer — archetype labels, trajectory positions, event firings, SCD transitions, bridge associations, and reproducibility metadata — that a downstream ML pipeline can use as ground truth without re-deriving it from noisy cell values.

See build_manifest for the programmatic builder. The companion docs are config-reference.md (the manifest config block) and api-reference.md.


When the manifest is written

write_tables writes manifest.json when both:

  1. config.manifest.include is True (default), and
  2. A manifest argument was passed to write_tables.

The function-level CLI does this automatically. Programmatic callers must build the manifest first via build_manifest(...) and pass it through:

from plotsim import generate_tables_with_state, build_manifest, write_tables

tables, state = generate_tables_with_state(cfg)
manifest = build_manifest(
    cfg, state.trajectories, tables,
    scd_state=state.scd, bridge_state=state.bridges,
)
write_tables(tables, cfg, manifest=manifest)

The JSON serialization is byte-deterministic — same (config, seed) produces a byte-identical manifest.json. Encoding: UTF-8, sort_keys=True, indent=2, trailing newline.


Top-level fields

{
  "schema_version": "1.0",
  "seed": 42,
  "config_sha256": "<64-char hex>",
  "archetype_assignments": [...],
  "trajectory_samples": [...],
  "event_firings": [...],
  "scd_events": [...],
  "bridge_associations": [...],
  "quality_injections": [...],
  "holdout": {...} | null,
  "correlation_adjustments": [...] | null,
  "correlation_compensations": [...] | null,
  "bypass_fallback_counts": {...} | null,
  "vectorized_threshold_used": 50 | null
}
Field Type Description
schema_version str Wire-shape version. Currently "1.0". Bumped when the manifest gains a non-additive change
seed int The seed used for generation — config.seed
config_sha256 str Full SHA-256 hex of the JSON-serialized config. Detects config drift between generation and consumption
archetype_assignments array One entry per entity; see below
trajectory_samples array Per-period position cells for a sampled subset of entities
event_firings array Which periods each entity fired in for each event table
scd_events array SCD Type 2 band crossings (empty when no SCD columns are configured)
bridge_associations array Per-bridge M:N association ground truth (empty when no bridges are configured)
quality_injections array Per-issue ground truth — corrupted rows and clean values (empty when quality.quality_issues is empty)
holdout object or null Train/holdout split metadata. null when holdout.enabled is False
correlation_adjustments array or null Higham nearest-PD projections. null when the user matrix was already PD
correlation_compensations array or null Trajectory-aware compensation records. null when compensate_correlations is False or the metric cap was exceeded
bypass_fallback_counts object or null Per-archetype count of cells that fell back to the scalar copula path. null in serial mode
vectorized_threshold_used int or null The auto-mode entity-count threshold at generation time. null for pre-M121b manifests on disk

archetype_assignments

One ground-truth label per entity — the archetype the engine drove their trajectory from.

{
  "archetype_assignments": [
    { "entity": "growers_001",   "archetype": "growth" },
    { "entity": "decliners_002", "archetype": "decline" }
  ]
}
Field Type Description
entity str Entity name (matches config.entities[i].name)
archetype str Archetype name (matches config.archetypes[i].name)

Sorted by entity for stable diff under the same config.

Use case — train a classifier on the fact-table aggregates and score it against this column. The archetype is the latent class label your model is trying to recover; this list is the answer key.


trajectory_samples

Per-period trajectory positions for a sampled subset of entities.

{
  "trajectory_samples": [
    { "entity": "growers_001", "period_index": 0,  "position": 0.05 },
    { "entity": "growers_001", "period_index": 1,  "position": 0.08 },
    { "entity": "growers_001", "period_index": 2,  "position": 0.13 }
  ]
}
Field Type Description
entity str Entity name
period_index int Zero-based period index. 0 is the first period of time_window
position float Position in [0, 1]

Position is the noise-free, distribution-free behavioral state the engine derived every metric from. It's not present in the fact table — the fact table holds realized values shaped by polarity, distribution, correlation, and noise.

Sampled subset — controlled by config.manifest.trajectory_sample_rate (default 1.0, meaning every entity). The selection is the first ceil(n_entities × sample_rate) entities under sorted-name order, so it stays stable regardless of seed. Set this below 1.0 for very large configs where the per-period tape would dominate manifest size.

Use case — verify the trajectory-first invariant from the manifest: combine with trace_metric_cell to confirm position → realized cell for any entity in the sample.


event_firings

For each event table, which periods each entity fired in.

{
  "event_firings": [
    {
      "entity": "growers_001",
      "table": "evt_login",
      "period_indices": [0, 1, 2, 3, 5, 7]
    }
  ]
}
Field Type Description
entity str Entity name
table str Event-table name
period_indices array of int Sorted ascending; the periods this entity contributed at least one row in

Empty period_indices are kept rather than omitted, so a downstream consumer can iterate the full entity × event-table matrix without fallback logic.

Both threshold and proportional events surface here. The manifest records observed firings, not the configured triggers.


scd_events

SCD Type 2 band crossings — emitted only for transitions, not the initial band.

{
  "scd_events": [
    {
      "dim_table": "dim_customer",
      "entity": "growers_001",
      "period_index": 5,
      "old_label": "starter",
      "new_label": "pro",
      "old_dim_row_id": 12,
      "new_dim_row_id": 13,
      "trigger_metric": "fct_engagement.mrr",
      "trigger_position": 0.52
    }
  ]
}
Field Type Description
dim_table str The dim table the SCD column lives on
entity str Entity name
period_index int Period the crossing happened
old_label str Band the entity was in before
new_label str Band the entity advanced to
old_dim_row_id int Surrogate row ID of the closing version
new_dim_row_id int Surrogate row ID of the opening version
trigger_metric str The metric whose threshold was crossed (<fact_table>.<metric>)
trigger_position float Trajectory position at the crossing period

Empty when no scd columns are configured. Sorted by dim table, then entity, then period for stable ordering.

Use case — join against trajectory_samples to recover the exact position that triggered each band change.


bridge_associations

Many-to-many associations recorded as ground truth.

{
  "bridge_associations": [
    {
      "bridge": "customer_subscription",
      "entity": "growers_001",
      "targets": ["sub_007", "sub_023", "sub_041"],
      "cardinality": 3
    }
  ]
}
Field Type Description
bridge str Bridge-table name
entity str First-dim entity name
targets array Second-dim FK values (PKs for non-SCD dims; dim_row_id for SCD dims)
cardinality int len(targets). Surfaced separately so consumers can aggregate without iterating each tuple

Empty when no bridges are configured. Sorted by bridge name, then entity name.


quality_injections

Ground truth for post-generation data corruption.

{
  "quality_injections": [
    {
      "issue_index": 0,
      "issue_type": "null_injection",
      "table": "fct_engagement",
      "column": "engagement",
      "row_indices": [3, 17, 42],
      "clean_values": [0.42, 0.71, 0.18]
    },
    {
      "issue_index": 1,
      "issue_type": "duplicate_rows",
      "table": "fct_engagement",
      "column": "_rows",
      "row_indices": [8, 19],
      "clean_values": []
    }
  ]
}
Field Type Description
issue_index int Position in config.quality.quality_issues — distinguishes multiple issues
issue_type str null_injection, duplicate_rows, type_mismatch, late_arrival, or schema_drift
table str Target table
column str Target column. For row-level issues this is a sentinel — _rows for duplicates, _arrival_period for late arrivals
row_indices array of int Row positions in the corrupted DataFrame — the rows that were affected
clean_values array Original values at those rows. Empty for duplicate_rows and late_arrival (the corruption is row-level, not per-cell)

Empty when config.quality.quality_issues is empty.

Use case — recover the clean dataset from the corrupted output without re-running generation, or train a model that explicitly handles the corruption pattern.


holdout

Train/holdout split metadata. Present only when config.holdout.enabled is True; null otherwise.

{
  "holdout": {
    "target_metric": "mrr",
    "holdout_periods": 3,
    "cutoff_period_index": 21
  }
}
Field Type Description
target_metric str Mirror of config.holdout.target
holdout_periods int Mirror of config.holdout.periods
cutoff_period_index int The resolved boundary — n_periods - holdout_periods. Periods [0, cutoff) are training; [cutoff, n_periods) are holdout

Use case — slice an unsplit fact table or its derivative on the same axis without recomputing period_count from time_window.


correlation_adjustments

Pairs whose configured correlation was projected to a nearby PD value because the user-declared matrix wasn't positive semi-definite.

{
  "correlation_adjustments": [
    {
      "metric_a": "engagement",
      "metric_b": "support_tickets",
      "requested": -0.75,
      "achieved":  -0.68,
      "adjustment": 0.07
    }
  ]
}
Field Type Description
metric_a / metric_b str The pair
requested float Coefficient declared in the config
achieved float Value at the same (i, j) cell after Higham projection
adjustment float abs(requested - achieved)

null when the user-declared matrix was already PD (the common case) or when no correlations were configured. Pairs whose adjustment falls below the numerical noise floor (~1e-12) are dropped, so an empty array distinguishes "all pairs were tolerance-clean" from null ("no projection needed").

Use case — flag configs whose declared correlations couldn't be delivered exactly, and decide whether to relax the matrix or accept the projected value.


correlation_compensations

Pairs the engine pre-compensated for trajectory-induced covariance — recorded only when compensate_correlations is True.

{
  "correlation_compensations": [
    {
      "metric_a": "engagement",
      "metric_b": "mrr",
      "user_target": 0.55,
      "trajectory_contribution": 0.32,
      "compensated_target": 0.23,
      "achievable": 0.23,
      "infeasible": false,
      "adjustment": 0.32
    }
  ]
}
Field Type Description
metric_a / metric_b str The pair
user_target float Coefficient declared in the config's connections block
trajectory_contribution float Within-archetype-weighted Pearson the trajectory's centers induce, in [-1, 1]
compensated_target float Pre-clamp user_target - trajectory_contribution. May fall outside [-1, 1]
achievable float compensated_target clamped to [-1, 1]. The value the copula actually targets
infeasible bool True when compensated_target fell outside [-1, 1]. The realized table-wide Pearson lands at user_target ± something < |user_target| for these
adjustment float abs(user_target - achievable)

null when:

  • compensate_correlations is False, or
  • the config has no correlations / connections block, or
  • the metric count exceeded the cap (20) and the engine fell back to the direct-copula path.

Distinct from correlation_adjustments: that records "your matrix wasn't PD, we projected"; this records "your target was compensated for the trajectory's structural contribution before reaching the copula." Both can populate on a single run.

Use case — sort by adjustment to find pairs whose realized correlation drifts most from the configured target. Pairs flagged infeasible: true can never reach the user target on the current config — relax the trajectory mix or lower the magnitude.


bypass_fallback_counts

Per-archetype count of cells that triggered the per-row scalar fallback in vectorized generation mode.

{
  "bypass_fallback_counts": {
    "growth": 0,
    "decline": 12,
    "spike_then_crash": 47
  }
}
Form Meaning
null Serial mode — bypass was never measured
{} Vectorized ran with zero bypass cells (the production-shape case)
{name: count, ...} Vectorized hit the scalar fallback for count cells under archetype name

A non-zero count means vectorized mode wasn't fully effective for that archetype on this config. Surfaces "vectorized isn't faster than serial here" investigations directly.


vectorized_threshold_used

The value of the auto-mode entity-count threshold at generation time.

Form Meaning
int Recorded threshold (currently 50)
null Pre-M121b manifest on disk

Recorded so old manifests stay reproducible if the constant changes in a later release — comparing this against the current threshold lets a consumer detect that a re-run would land in a different generation_mode.


Reading the manifest in Python

import json
from pathlib import Path

manifest = json.loads(Path("output/manifest.json").read_text())

# Build the entity → archetype lookup
labels = {a["entity"]: a["archetype"] for a in manifest["archetype_assignments"]}

# Reconstruct an entity's trajectory tape
positions = sorted(
    (s["period_index"], s["position"])
    for s in manifest["trajectory_samples"]
    if s["entity"] == "growers_001"
)

# Detect quality corruption on a column
nullified_rows = [
    inj["row_indices"]
    for inj in manifest["quality_injections"]
    if inj["issue_type"] == "null_injection"
    and inj["table"] == "fct_engagement"
    and inj["column"] == "engagement"
]

pydantic users can validate the on-disk JSON against the typed manifest model directly:

from plotsim import ManifestSchema

manifest = ManifestSchema.model_validate_json(Path("output/manifest.json").read_text())

The model has extra="forbid", so a malformed or out-of-version manifest fails loudly during validation rather than silently dropping unknown fields.