TabDM

TabDM is a small Python library for generating synthetic mixed-type tabular data with diffusion in a transformed feature space.

It is designed for tabular datasets with numeric, categorical, boolean, ordinal, count, and positive continuous columns. The public API focuses on two workflows:

fit a model and generate synthetic rows in one call
fit once, then generate repeatedly with optional target or subgroup controls

TabDM also includes evaluation helpers for schema compatibility, distribution fidelity, downstream utility, validity checks, and privacy-screening metrics.

For exact function signatures, parameter semantics, and report shapes, see docs/API_REFERENCE.md.

Install

From PyPI:

pip install tabdm

For evaluation helpers:

pip install "tabdm[eval]"

For local development:

pip install -e ".[dev]"

Quick Start

import pandas as pd

from tabdm import generate_synthetic_data

real = pd.DataFrame(
    {
        "age": [21.0, 35.0, 44.0, 28.0, 31.0, 39.0],
        "job": ["admin", "tech", "admin", "services", "tech", "admin"],
        "owns_house": [True, False, True, False, True, False],
        "balance": [1000.0, 250.0, 1900.0, 750.0, 1200.0, 400.0],
    }
)

synthetic = generate_synthetic_data(
    real,
    num_rows=100,
    discrete_columns=["job", "owns_house"],
    epochs=50,
    timesteps=64,
    sample_steps=16,
    random_state=42,
)

generate_synthetic_data returns a pandas.DataFrame with the same column order as the training dataframe. Numeric outputs are clipped to the training range, count columns are rounded, and discrete columns are decoded to values observed during fitting.

Fit Once, Generate Many Times

Use fit_tabdm when you want to reuse a fitted model.

from tabdm import fit_tabdm

model = fit_tabdm(
    real,
    discrete_columns=["job", "owns_house"],
    epochs=50,
    timesteps=64,
    sample_steps=16,
    random_state=42,
)

synthetic_a = model.generate(100, random_state=1)
synthetic_b = model.generate(100, random_state=2)

Passing the same random_state to generate produces the same sampled rows for the same fitted model.

Conditional Generation

TabDM can treat target, sensitive, or explicitly named columns as conditioning columns. Condition columns are not generated by the diffusion model. They are provided by the caller or sampled from the training rows, then recombined with generated feature columns.

real = pd.DataFrame(
    {
        "age": [21, 35, 44, 28, 31, 39],
        "job": ["admin", "tech", "admin", "services", "tech", "admin"],
        "sex": ["f", "m", "f", "m", "f", "m"],
        "default": ["yes", "no", "yes", "no", "yes", "no"],
    }
)

synthetic = generate_synthetic_data(
    real,
    num_rows=200,
    discrete_columns=["job", "sex", "default"],
    target_column="default",
    sensitive_columns=["sex"],
    conditions={"default": "yes"},
    condition_strategy="prior",
    epochs=50,
    random_state=42,
)

Conditioning controls:

Argument	Meaning
`target_column`	Downstream label to hold fixed or sample separately.
`sensitive_columns`	Subgroup columns to preserve or control.
`condition_on`	Additional columns to use as generation conditions.
`conditions`	Fixed values or row-wise values to use at generation time.
`condition_strategy`	How unspecified condition columns are sampled: `prior` or `balanced`.

conditions can be:

None: sample all condition columns from the training condition rows
a mapping of scalar values: fix those columns and sample the remaining condition columns from matching training rows
a mapping of sequences: provide row-wise condition values for all condition columns
a one-row dataframe: repeat the row for every generated sample
a num_rows-row dataframe: use row-wise condition values directly

Column Metadata

TabDM can use metadata for columns whose dtype alone is not enough.

metadata = {
    "grade_band": {"type": "ordinal", "order": ["low", "mid", "high"]},
    "incidents": {"type": "count"},
    "tuition": {"type": "positive_continuous"},
}

synthetic = generate_synthetic_data(
    real,
    discrete_columns=["district"],
    column_metadata=metadata,
    random_state=42,
)

Supported metadata types:

Type	Transform behavior	Inverse behavior
`ordinal`	Encoded as an ordered scalar using the supplied or inferred order.	Rounded to the nearest ordinal level.
`count`	Encoded with `log1p` after clipping to non-negative values.	Decoded with `expm1`, clipped to the training range, and rounded.
`positive_continuous`	Encoded with `log1p` after clipping to non-negative values.	Decoded with `expm1` and clipped to the training range.

Object, string, categorical, and boolean columns are inferred as discrete when discrete_columns is not supplied. Numeric columns are continuous unless listed in metadata.

Generation Parameters

generate_synthetic_data exposes the fitting and sampling controls directly.

Parameter	Default	Description
`dataframe`	required	Training dataframe. Must contain at least one row.
`num_rows`	`len(dataframe)`	Number of synthetic rows. Must be positive when provided.
`discrete_columns`	inferred	Categorical columns. Accepts column names.
`column_metadata`	`None`	Metadata for ordinal, count, or positive continuous columns.
`target_column`	`None`	Column to condition on rather than generate.
`sensitive_columns`	`None`	Additional condition columns, usually subgroup attributes.
`condition_on`	`None`	Other condition columns.
`conditions`	`None`	Fixed or row-wise generation conditions.
`condition_strategy`	`"prior"`	`prior` samples training condition rows; `balanced` samples unique condition rows uniformly.
`hidden_dims`	`(256, 256)`	MLP denoiser hidden layer sizes.
`time_embedding_dim`	`64`	Sinusoidal timestep embedding size.
`timesteps`	`96`	Number of training diffusion timesteps.
`sample_steps`	`24`	Number of deterministic reverse steps used during sampling.
`epochs`	`120`	Training epochs.
`batch_size`	`512`	Training and sampling batch size.
`learning_rate`	`1e-3`	AdamW learning rate.
`weight_decay`	`1e-6`	AdamW weight decay.
`beta_start`	`1e-4`	First value in the linear noise schedule.
`beta_end`	`0.02`	Last value in the linear noise schedule.
`dropout`	`0.0`	Dropout inside the denoiser MLP.
`discrete_loss_weight`	`2.0`	Multiplier for one-hot categorical spans in the training loss.
`prediction_clip`	`1.5`	Clamp applied to predicted transformed features.
`grad_clip_norm`	`1.0`	Gradient norm clipping threshold. Use `0` to disable.
`device`	`"cpu"`	`"cpu"` or a CUDA device string. Falls back to CPU if CUDA is unavailable.
`random_state`	`None`	Seeds Python, NumPy, and Torch during fitting, and seeds sampling noise when generating.
`verbose`	`False`	Print training loss periodically.
`return_model`	`False`	Return `SyntheticDataResult` with the fitted model and metadata.

Lower-Level Model API

from tabdm import TabDM, TabDMConfig

model = TabDM(
    TabDMConfig(
        hidden_dims=(256, 256),
        time_embedding_dim=64,
        timesteps=64,
        sample_steps=16,
        epochs=50,
        batch_size=512,
        random_state=42,
    )
)

model.fit(real, discrete_columns=["job", "owns_house"])
synthetic = model.sample(100, random_state=42)

Use TabDM directly if you want to hold a model object, inspect fit_history_, or call sample repeatedly.

Evaluation

Install the optional evaluation dependencies first:

pip install "tabdm[eval]"

Then call evaluate_synthetic.

from tabdm import evaluate_synthetic

report = evaluate_synthetic(
    real=real,
    synthetic=synthetic,
    target_column="default",
    random_state=42,
)

print(report["schema"])
print(report["distribution"])
print(report["validity"])
print(report["utility"])
print(report["trust"])

evaluate_synthetic can compute:

Group	Included by default	Description
`schema`	yes	Column presence, column order compatibility, and dtype mismatches.
`distribution`	yes	Categorical total variation distance, numeric KS distance, and numeric correlation delta.
`validity`	yes	Numeric bound violations and unseen categorical values.
`utility`	yes when `target_column` is provided	Train-on-synthetic, test-on-real downstream utility.
`trust`	yes	Exact row matches and nearest-neighbor privacy-screening metrics.

You can select metric groups explicitly:

report = evaluate_synthetic(
    real=real,
    synthetic=synthetic,
    target_column="default",
    metrics=("schema", "distribution", "validity"),
    include_trust=False,
)

Task type is inferred as classification for object, string, categorical, boolean, and low-cardinality integer targets. Floating numeric targets are treated as regression. Override with task_type="classification" or task_type="regression" when needed.

Evaluation Helpers

The public evaluation helpers are:

evaluate_synthetic
evaluate_utility
schema_report
distribution_report
validity_report
trust_report
exact_match_rate
nearest_neighbor_privacy
categorical_tvd
numeric_ks
numeric_correlation_delta
infer_task_type

Privacy-screening metrics are diagnostics only. They do not prove anonymization, differential privacy, or legal compliance.

Public API

tabdm.TabDM
tabdm.TabDMConfig
tabdm.DataTransformer
tabdm.SyntheticDataResult
tabdm.fit_tabdm
tabdm.generate_synthetic_data
tabdm.infer_discrete_columns
tabdm.evaluate_synthetic
tabdm.evaluate_utility
tabdm.trust_report

Testing

PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 pytest -q

PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 avoids unrelated third-party pytest plugin startup issues in environments with many globally installed plugins.

License

TabDM is distributed under the Apache License 2.0.

Contact

For project questions, contact:

Sam Urmian: sam.urmian@uib.no
Mohammad Khalil: mohammad.khalil@uib.no

Scope

This package ships the core generation and evaluation APIs.

TabDM is an alpha research/development package. Always evaluate generated data for the intended dataset, task, and privacy posture before use.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
docs		docs
src/tabdm		src/tabdm
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TabDM

Install

Quick Start

Fit Once, Generate Many Times

Conditional Generation

Column Metadata

Generation Parameters

Lower-Level Model API

Evaluation

Evaluation Helpers

Public API

Testing

License

Contact

Scope

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TabDM

Install

Quick Start

Fit Once, Generate Many Times

Conditional Generation

Column Metadata

Generation Parameters

Lower-Level Model API

Evaluation

Evaluation Helpers

Public API

Testing

License

Contact

Scope

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages