Skip to content

plotsim

Generate multi-table synthetic datasets where the metrics tell a story — not random noise.

Most synthetic data tools generate columns independently. Revenue is random. Engagement is random. Churn is random. The numbers fill a schema, but they don't behave like real data — because in real data, these things move together.

plotsim generates relational test data with shape: every entity follows a behavioral trajectory, and every metric — across every table, every foreign key, every time period — is derived from the same trajectory position. When engagement rises, revenue follows. When it declines, churn fires.


See it

The same SaaS schema, generated two ways. One company, twelve months.

Every column is independent. The numbers don't agree with each other.

month engagement mrr tickets churn_risk
2024-01 0.842 $483 7 0.611
2024-02 0.117 $4,201 0 0.043
2024-03 0.674 $1,089 11 0.892
2024-04 0.298 $112 2 0.355
2024-05 0.951 $7,733 4 0.018
2024-06 0.024 $964 9 0.477
2024-07 0.560 $2,154 1 0.802
2024-08 0.405 $328 6 0.220
2024-09 0.789 $617 0 0.998
2024-10 0.131 $5,440 8 0.156
2024-11 0.847 $192 3 0.501
2024-12 0.334 $3,876 12 0.063

Engagement at 0.95 with churn risk at 0.018, then 0.79 at the highest churn risk in the table. There is no story here — only fields filled.

Same schema, generated by plotsim run saas. One real dim_company row, 12 of its 24 monthly rows from fct_engagement, fct_revenue, fct_support_tickets.

month engagement mrr tickets churn_risk
2024-01 0.587 $1,191 0 0.261
2024-02 0.807 $1,265 1 0.189
2024-03 1.000 $3,532 2 0.129
2024-04 0.593 $818 0 0.171
2024-05 0.904 $3,567 2 0.237
2024-06 0.956 $4,264 1 0.257
2024-07 1.000 $302 2 0.000
2024-08 0.917 $1,507 0 0.000
2024-09 1.000 $890 1 0.000
2024-10 0.783 $512 1 0.264
2024-11 0.956 $837 0 0.000
2024-12 0.827 $351 1 0.248

Engagement is climbing toward its plateau. MRR moves with it. Support tickets stay low. Churn risk stays near zero. All four columns read from the same underlying trajectory position — not from four independent random generators.

The contrast is the entire product.


Install

pip install plotsim

Requires Python 3.10+. Zero network calls at generation time. All bundled templates work offline.


Run a template

plotsim template saas -o config.yaml
plotsim run config.yaml -o ./output

Or from Python — the builder API is the front door:

from plotsim import create_from_yaml, generate_tables, write_tables

cfg = create_from_yaml("config.yaml")
tables = generate_tables(cfg)
write_tables(tables, cfg)

Either flow produces a complete star schema:

output/
├── dim_date.csv                # complete date spine
├── dim_company.csv             # entity attributes (with SCD2 plan_tier)
├── dim_user.csv                # sub-entity attributes
├── dim_plan.csv                # reference lookup
├── fct_engagement.csv          # entity × period metrics
├── fct_revenue.csv             # entity × period metrics
├── fct_support_tickets.csv     # entity × period metrics
├── evt_login.csv               # proportional events
├── evt_churn.csv               # threshold-triggered events
├── config.yaml                 # frozen copy of the input config
└── validation_report.txt       # FK + PK + spine integrity checks

If a company's engagement trajectory declines, its login events decrease in evt_login.csv and churn events appear in evt_churn.csv — because both tables read from the same underlying trajectory.


Where to go next


What plotsim is not

  • Not an ML model trained on real data. plotsim takes a YAML spec; it does not learn from samples.
  • Not an LLM-driven generator. The engine is deterministic. Same config + same seed = byte-identical output.
  • Not a Faker replacement. Faker fills cells; plotsim composes coherent multi-table datasets where the cells agree.
  • Not a privacy tool. Faker output looks realistic but is not anonymized. Treat string columns (names, companies) as synthetic, not safe-to-publish.