Manifest Reference¶
The manifest is a JSON sidecar (
manifest.json) written next to the generated tables. It captures the signal layer — archetype labels, trajectory positions, event firings, SCD transitions, bridge associations, and reproducibility metadata — that a downstream ML pipeline can use as ground truth without re-deriving it from noisy cell values.See
build_manifestfor the programmatic builder. The companion docs areconfig-reference.md(themanifestconfig block) andapi-reference.md.
When the manifest is written¶
write_tables writes manifest.json when both:
config.manifest.includeis True (default), and- A
manifestargument was passed towrite_tables.
The function-level CLI does this automatically. Programmatic callers
must build the manifest first via build_manifest(...) and pass it
through:
from plotsim import generate_tables_with_state, build_manifest, write_tables
tables, state = generate_tables_with_state(cfg)
manifest = build_manifest(
cfg, state.trajectories, tables,
scd_state=state.scd, bridge_state=state.bridges,
)
write_tables(tables, cfg, manifest=manifest)
The JSON serialization is byte-deterministic — same (config, seed)
produces a byte-identical manifest.json. Encoding: UTF-8,
sort_keys=True, indent=2, trailing newline.
Top-level fields¶
{
"schema_version": "1.0",
"seed": 42,
"config_sha256": "<64-char hex>",
"archetype_assignments": [...],
"trajectory_samples": [...],
"event_firings": [...],
"scd_events": [...],
"bridge_associations": [...],
"quality_injections": [...],
"holdout": {...} | null,
"correlation_adjustments": [...] | null,
"correlation_compensations": [...] | null,
"bypass_fallback_counts": {...} | null,
"vectorized_threshold_used": 50 | null
}
| Field | Type | Description |
|---|---|---|
schema_version |
str |
Wire-shape version. Currently "1.0". Bumped when the manifest gains a non-additive change |
seed |
int |
The seed used for generation — config.seed |
config_sha256 |
str |
Full SHA-256 hex of the JSON-serialized config. Detects config drift between generation and consumption |
archetype_assignments |
array | One entry per entity; see below |
trajectory_samples |
array | Per-period position cells for a sampled subset of entities |
event_firings |
array | Which periods each entity fired in for each event table |
scd_events |
array | SCD Type 2 band crossings (empty when no SCD columns are configured) |
bridge_associations |
array | Per-bridge M:N association ground truth (empty when no bridges are configured) |
quality_injections |
array | Per-issue ground truth — corrupted rows and clean values (empty when quality.quality_issues is empty) |
holdout |
object or null |
Train/holdout split metadata. null when holdout.enabled is False |
correlation_adjustments |
array or null |
Higham nearest-PD projections. null when the user matrix was already PD |
correlation_compensations |
array or null |
Trajectory-aware compensation records. null when compensate_correlations is False or the metric cap was exceeded |
bypass_fallback_counts |
object or null |
Per-archetype count of cells that fell back to the scalar copula path. null in serial mode |
vectorized_threshold_used |
int or null |
The auto-mode entity-count threshold at generation time. null for pre-M121b manifests on disk |
archetype_assignments¶
One ground-truth label per entity — the archetype the engine drove their trajectory from.
{
"archetype_assignments": [
{ "entity": "growers_001", "archetype": "growth" },
{ "entity": "decliners_002", "archetype": "decline" }
]
}
| Field | Type | Description |
|---|---|---|
entity |
str |
Entity name (matches config.entities[i].name) |
archetype |
str |
Archetype name (matches config.archetypes[i].name) |
Sorted by entity for stable diff under the same config.
Use case — train a classifier on the fact-table aggregates and score it against this column. The archetype is the latent class label your model is trying to recover; this list is the answer key.
trajectory_samples¶
Per-period trajectory positions for a sampled subset of entities.
{
"trajectory_samples": [
{ "entity": "growers_001", "period_index": 0, "position": 0.05 },
{ "entity": "growers_001", "period_index": 1, "position": 0.08 },
{ "entity": "growers_001", "period_index": 2, "position": 0.13 }
]
}
| Field | Type | Description |
|---|---|---|
entity |
str |
Entity name |
period_index |
int |
Zero-based period index. 0 is the first period of time_window |
position |
float |
Position in [0, 1] |
Position is the noise-free, distribution-free behavioral state the engine derived every metric from. It's not present in the fact table — the fact table holds realized values shaped by polarity, distribution, correlation, and noise.
Sampled subset — controlled by config.manifest.trajectory_sample_rate
(default 1.0, meaning every entity). The selection is the first
ceil(n_entities × sample_rate) entities under sorted-name order, so it
stays stable regardless of seed. Set this below 1.0 for very large
configs where the per-period tape would dominate manifest size.
Use case — verify the trajectory-first invariant from the manifest:
combine with trace_metric_cell
to confirm position → realized cell for any entity in the sample.
event_firings¶
For each event table, which periods each entity fired in.
{
"event_firings": [
{
"entity": "growers_001",
"table": "evt_login",
"period_indices": [0, 1, 2, 3, 5, 7]
}
]
}
| Field | Type | Description |
|---|---|---|
entity |
str |
Entity name |
table |
str |
Event-table name |
period_indices |
array of int |
Sorted ascending; the periods this entity contributed at least one row in |
Empty period_indices are kept rather than omitted, so a downstream
consumer can iterate the full entity × event-table matrix without
fallback logic.
Both threshold and proportional events surface here. The manifest records observed firings, not the configured triggers.
scd_events¶
SCD Type 2 band crossings — emitted only for transitions, not the initial band.
{
"scd_events": [
{
"dim_table": "dim_customer",
"entity": "growers_001",
"period_index": 5,
"old_label": "starter",
"new_label": "pro",
"old_dim_row_id": 12,
"new_dim_row_id": 13,
"trigger_metric": "fct_engagement.mrr",
"trigger_position": 0.52
}
]
}
| Field | Type | Description |
|---|---|---|
dim_table |
str |
The dim table the SCD column lives on |
entity |
str |
Entity name |
period_index |
int |
Period the crossing happened |
old_label |
str |
Band the entity was in before |
new_label |
str |
Band the entity advanced to |
old_dim_row_id |
int |
Surrogate row ID of the closing version |
new_dim_row_id |
int |
Surrogate row ID of the opening version |
trigger_metric |
str |
The metric whose threshold was crossed (<fact_table>.<metric>) |
trigger_position |
float |
Trajectory position at the crossing period |
Empty when no scd columns are configured. Sorted by dim table, then
entity, then period for stable ordering.
Use case — join against trajectory_samples to recover the exact
position that triggered each band change.
bridge_associations¶
Many-to-many associations recorded as ground truth.
{
"bridge_associations": [
{
"bridge": "customer_subscription",
"entity": "growers_001",
"targets": ["sub_007", "sub_023", "sub_041"],
"cardinality": 3
}
]
}
| Field | Type | Description |
|---|---|---|
bridge |
str |
Bridge-table name |
entity |
str |
First-dim entity name |
targets |
array | Second-dim FK values (PKs for non-SCD dims; dim_row_id for SCD dims) |
cardinality |
int |
len(targets). Surfaced separately so consumers can aggregate without iterating each tuple |
Empty when no bridges are configured. Sorted by bridge name, then
entity name.
quality_injections¶
Ground truth for post-generation data corruption.
{
"quality_injections": [
{
"issue_index": 0,
"issue_type": "null_injection",
"table": "fct_engagement",
"column": "engagement",
"row_indices": [3, 17, 42],
"clean_values": [0.42, 0.71, 0.18]
},
{
"issue_index": 1,
"issue_type": "duplicate_rows",
"table": "fct_engagement",
"column": "_rows",
"row_indices": [8, 19],
"clean_values": []
}
]
}
| Field | Type | Description |
|---|---|---|
issue_index |
int |
Position in config.quality.quality_issues — distinguishes multiple issues |
issue_type |
str |
null_injection, duplicate_rows, type_mismatch, late_arrival, or schema_drift |
table |
str |
Target table |
column |
str |
Target column. For row-level issues this is a sentinel — _rows for duplicates, _arrival_period for late arrivals |
row_indices |
array of int |
Row positions in the corrupted DataFrame — the rows that were affected |
clean_values |
array | Original values at those rows. Empty for duplicate_rows and late_arrival (the corruption is row-level, not per-cell) |
Empty when config.quality.quality_issues is empty.
Use case — recover the clean dataset from the corrupted output without re-running generation, or train a model that explicitly handles the corruption pattern.
holdout¶
Train/holdout split metadata. Present only when config.holdout.enabled
is True; null otherwise.
| Field | Type | Description |
|---|---|---|
target_metric |
str |
Mirror of config.holdout.target |
holdout_periods |
int |
Mirror of config.holdout.periods |
cutoff_period_index |
int |
The resolved boundary — n_periods - holdout_periods. Periods [0, cutoff) are training; [cutoff, n_periods) are holdout |
Use case — slice an unsplit fact table or its derivative on the
same axis without recomputing period_count from time_window.
correlation_adjustments¶
Pairs whose configured correlation was projected to a nearby PD value because the user-declared matrix wasn't positive semi-definite.
{
"correlation_adjustments": [
{
"metric_a": "engagement",
"metric_b": "support_tickets",
"requested": -0.75,
"achieved": -0.68,
"adjustment": 0.07
}
]
}
| Field | Type | Description |
|---|---|---|
metric_a / metric_b |
str |
The pair |
requested |
float |
Coefficient declared in the config |
achieved |
float |
Value at the same (i, j) cell after Higham projection |
adjustment |
float |
abs(requested - achieved) |
null when the user-declared matrix was already PD (the common case)
or when no correlations were configured. Pairs whose adjustment falls
below the numerical noise floor (~1e-12) are dropped, so an empty array
distinguishes "all pairs were tolerance-clean" from null ("no
projection needed").
Use case — flag configs whose declared correlations couldn't be delivered exactly, and decide whether to relax the matrix or accept the projected value.
correlation_compensations¶
Pairs the engine pre-compensated for trajectory-induced covariance —
recorded only when compensate_correlations is True.
{
"correlation_compensations": [
{
"metric_a": "engagement",
"metric_b": "mrr",
"user_target": 0.55,
"trajectory_contribution": 0.32,
"compensated_target": 0.23,
"achievable": 0.23,
"infeasible": false,
"adjustment": 0.32
}
]
}
| Field | Type | Description |
|---|---|---|
metric_a / metric_b |
str |
The pair |
user_target |
float |
Coefficient declared in the config's connections block |
trajectory_contribution |
float |
Within-archetype-weighted Pearson the trajectory's centers induce, in [-1, 1] |
compensated_target |
float |
Pre-clamp user_target - trajectory_contribution. May fall outside [-1, 1] |
achievable |
float |
compensated_target clamped to [-1, 1]. The value the copula actually targets |
infeasible |
bool |
True when compensated_target fell outside [-1, 1]. The realized table-wide Pearson lands at user_target ± something < |user_target| for these |
adjustment |
float |
abs(user_target - achievable) |
null when:
compensate_correlationsis False, or- the config has no
correlations/connectionsblock, or - the metric count exceeded the cap (20) and the engine fell back to the direct-copula path.
Distinct from correlation_adjustments: that records "your matrix
wasn't PD, we projected"; this records "your target was compensated for
the trajectory's structural contribution before reaching the copula."
Both can populate on a single run.
Use case — sort by adjustment to find pairs whose realized
correlation drifts most from the configured target. Pairs flagged
infeasible: true can never reach the user target on the current
config — relax the trajectory mix or lower the magnitude.
bypass_fallback_counts¶
Per-archetype count of cells that triggered the per-row scalar fallback in vectorized generation mode.
| Form | Meaning |
|---|---|
null |
Serial mode — bypass was never measured |
{} |
Vectorized ran with zero bypass cells (the production-shape case) |
{name: count, ...} |
Vectorized hit the scalar fallback for count cells under archetype name |
A non-zero count means vectorized mode wasn't fully effective for that archetype on this config. Surfaces "vectorized isn't faster than serial here" investigations directly.
vectorized_threshold_used¶
The value of the auto-mode entity-count threshold at generation time.
| Form | Meaning |
|---|---|
int |
Recorded threshold (currently 50) |
null |
Pre-M121b manifest on disk |
Recorded so old manifests stay reproducible if the constant changes in
a later release — comparing this against the current threshold lets a
consumer detect that a re-run would land in a different
generation_mode.
Reading the manifest in Python¶
import json
from pathlib import Path
manifest = json.loads(Path("output/manifest.json").read_text())
# Build the entity → archetype lookup
labels = {a["entity"]: a["archetype"] for a in manifest["archetype_assignments"]}
# Reconstruct an entity's trajectory tape
positions = sorted(
(s["period_index"], s["position"])
for s in manifest["trajectory_samples"]
if s["entity"] == "growers_001"
)
# Detect quality corruption on a column
nullified_rows = [
inj["row_indices"]
for inj in manifest["quality_injections"]
if inj["issue_type"] == "null_injection"
and inj["table"] == "fct_engagement"
and inj["column"] == "engagement"
]
pydantic users can validate the on-disk JSON against the typed
manifest model directly:
from plotsim import ManifestSchema
manifest = ManifestSchema.model_validate_json(Path("output/manifest.json").read_text())
The model has extra="forbid", so a malformed or out-of-version manifest
fails loudly during validation rather than silently dropping unknown
fields.