Skip to content

mlgorithm/tabdm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TabDM

TabDM is a small Python library for generating synthetic mixed-type tabular data with diffusion in a transformed feature space.

It is designed for tabular datasets with numeric, categorical, boolean, ordinal, count, and positive continuous columns. The public API focuses on two workflows:

  • fit a model and generate synthetic rows in one call
  • fit once, then generate repeatedly with optional target or subgroup controls

TabDM also includes evaluation helpers for schema compatibility, distribution fidelity, downstream utility, validity checks, and privacy-screening metrics.

For exact function signatures, parameter semantics, and report shapes, see docs/API_REFERENCE.md.

Install

From PyPI:

pip install tabdm

For evaluation helpers:

pip install "tabdm[eval]"

For local development:

pip install -e ".[dev]"

Quick Start

import pandas as pd

from tabdm import generate_synthetic_data

real = pd.DataFrame(
    {
        "age": [21.0, 35.0, 44.0, 28.0, 31.0, 39.0],
        "job": ["admin", "tech", "admin", "services", "tech", "admin"],
        "owns_house": [True, False, True, False, True, False],
        "balance": [1000.0, 250.0, 1900.0, 750.0, 1200.0, 400.0],
    }
)

synthetic = generate_synthetic_data(
    real,
    num_rows=100,
    discrete_columns=["job", "owns_house"],
    epochs=50,
    timesteps=64,
    sample_steps=16,
    random_state=42,
)

generate_synthetic_data returns a pandas.DataFrame with the same column order as the training dataframe. Numeric outputs are clipped to the training range, count columns are rounded, and discrete columns are decoded to values observed during fitting.

Fit Once, Generate Many Times

Use fit_tabdm when you want to reuse a fitted model.

from tabdm import fit_tabdm

model = fit_tabdm(
    real,
    discrete_columns=["job", "owns_house"],
    epochs=50,
    timesteps=64,
    sample_steps=16,
    random_state=42,
)

synthetic_a = model.generate(100, random_state=1)
synthetic_b = model.generate(100, random_state=2)

Passing the same random_state to generate produces the same sampled rows for the same fitted model.

Conditional Generation

TabDM can treat target, sensitive, or explicitly named columns as conditioning columns. Condition columns are not generated by the diffusion model. They are provided by the caller or sampled from the training rows, then recombined with generated feature columns.

real = pd.DataFrame(
    {
        "age": [21, 35, 44, 28, 31, 39],
        "job": ["admin", "tech", "admin", "services", "tech", "admin"],
        "sex": ["f", "m", "f", "m", "f", "m"],
        "default": ["yes", "no", "yes", "no", "yes", "no"],
    }
)

synthetic = generate_synthetic_data(
    real,
    num_rows=200,
    discrete_columns=["job", "sex", "default"],
    target_column="default",
    sensitive_columns=["sex"],
    conditions={"default": "yes"},
    condition_strategy="prior",
    epochs=50,
    random_state=42,
)

Conditioning controls:

Argument Meaning
target_column Downstream label to hold fixed or sample separately.
sensitive_columns Subgroup columns to preserve or control.
condition_on Additional columns to use as generation conditions.
conditions Fixed values or row-wise values to use at generation time.
condition_strategy How unspecified condition columns are sampled: prior or balanced.

conditions can be:

  • None: sample all condition columns from the training condition rows
  • a mapping of scalar values: fix those columns and sample the remaining condition columns from matching training rows
  • a mapping of sequences: provide row-wise condition values for all condition columns
  • a one-row dataframe: repeat the row for every generated sample
  • a num_rows-row dataframe: use row-wise condition values directly

Column Metadata

TabDM can use metadata for columns whose dtype alone is not enough.

metadata = {
    "grade_band": {"type": "ordinal", "order": ["low", "mid", "high"]},
    "incidents": {"type": "count"},
    "tuition": {"type": "positive_continuous"},
}

synthetic = generate_synthetic_data(
    real,
    discrete_columns=["district"],
    column_metadata=metadata,
    random_state=42,
)

Supported metadata types:

Type Transform behavior Inverse behavior
ordinal Encoded as an ordered scalar using the supplied or inferred order. Rounded to the nearest ordinal level.
count Encoded with log1p after clipping to non-negative values. Decoded with expm1, clipped to the training range, and rounded.
positive_continuous Encoded with log1p after clipping to non-negative values. Decoded with expm1 and clipped to the training range.

Object, string, categorical, and boolean columns are inferred as discrete when discrete_columns is not supplied. Numeric columns are continuous unless listed in metadata.

Generation Parameters

generate_synthetic_data exposes the fitting and sampling controls directly.

Parameter Default Description
dataframe required Training dataframe. Must contain at least one row.
num_rows len(dataframe) Number of synthetic rows. Must be positive when provided.
discrete_columns inferred Categorical columns. Accepts column names.
column_metadata None Metadata for ordinal, count, or positive continuous columns.
target_column None Column to condition on rather than generate.
sensitive_columns None Additional condition columns, usually subgroup attributes.
condition_on None Other condition columns.
conditions None Fixed or row-wise generation conditions.
condition_strategy "prior" prior samples training condition rows; balanced samples unique condition rows uniformly.
hidden_dims (256, 256) MLP denoiser hidden layer sizes.
time_embedding_dim 64 Sinusoidal timestep embedding size.
timesteps 96 Number of training diffusion timesteps.
sample_steps 24 Number of deterministic reverse steps used during sampling.
epochs 120 Training epochs.
batch_size 512 Training and sampling batch size.
learning_rate 1e-3 AdamW learning rate.
weight_decay 1e-6 AdamW weight decay.
beta_start 1e-4 First value in the linear noise schedule.
beta_end 0.02 Last value in the linear noise schedule.
dropout 0.0 Dropout inside the denoiser MLP.
discrete_loss_weight 2.0 Multiplier for one-hot categorical spans in the training loss.
prediction_clip 1.5 Clamp applied to predicted transformed features.
grad_clip_norm 1.0 Gradient norm clipping threshold. Use 0 to disable.
device "cpu" "cpu" or a CUDA device string. Falls back to CPU if CUDA is unavailable.
random_state None Seeds Python, NumPy, and Torch during fitting, and seeds sampling noise when generating.
verbose False Print training loss periodically.
return_model False Return SyntheticDataResult with the fitted model and metadata.

Lower-Level Model API

from tabdm import TabDM, TabDMConfig

model = TabDM(
    TabDMConfig(
        hidden_dims=(256, 256),
        time_embedding_dim=64,
        timesteps=64,
        sample_steps=16,
        epochs=50,
        batch_size=512,
        random_state=42,
    )
)

model.fit(real, discrete_columns=["job", "owns_house"])
synthetic = model.sample(100, random_state=42)

Use TabDM directly if you want to hold a model object, inspect fit_history_, or call sample repeatedly.

Evaluation

Install the optional evaluation dependencies first:

pip install "tabdm[eval]"

Then call evaluate_synthetic.

from tabdm import evaluate_synthetic

report = evaluate_synthetic(
    real=real,
    synthetic=synthetic,
    target_column="default",
    random_state=42,
)

print(report["schema"])
print(report["distribution"])
print(report["validity"])
print(report["utility"])
print(report["trust"])

evaluate_synthetic can compute:

Group Included by default Description
schema yes Column presence, column order compatibility, and dtype mismatches.
distribution yes Categorical total variation distance, numeric KS distance, and numeric correlation delta.
validity yes Numeric bound violations and unseen categorical values.
utility yes when target_column is provided Train-on-synthetic, test-on-real downstream utility.
trust yes Exact row matches and nearest-neighbor privacy-screening metrics.

You can select metric groups explicitly:

report = evaluate_synthetic(
    real=real,
    synthetic=synthetic,
    target_column="default",
    metrics=("schema", "distribution", "validity"),
    include_trust=False,
)

Task type is inferred as classification for object, string, categorical, boolean, and low-cardinality integer targets. Floating numeric targets are treated as regression. Override with task_type="classification" or task_type="regression" when needed.

Evaluation Helpers

The public evaluation helpers are:

  • evaluate_synthetic
  • evaluate_utility
  • schema_report
  • distribution_report
  • validity_report
  • trust_report
  • exact_match_rate
  • nearest_neighbor_privacy
  • categorical_tvd
  • numeric_ks
  • numeric_correlation_delta
  • infer_task_type

Privacy-screening metrics are diagnostics only. They do not prove anonymization, differential privacy, or legal compliance.

Public API

  • tabdm.TabDM
  • tabdm.TabDMConfig
  • tabdm.DataTransformer
  • tabdm.SyntheticDataResult
  • tabdm.fit_tabdm
  • tabdm.generate_synthetic_data
  • tabdm.infer_discrete_columns
  • tabdm.evaluate_synthetic
  • tabdm.evaluate_utility
  • tabdm.trust_report

Testing

PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 pytest -q

PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 avoids unrelated third-party pytest plugin startup issues in environments with many globally installed plugins.

License

TabDM is distributed under the Apache License 2.0.

Contact

For project questions, contact:

Scope

This package ships the core generation and evaluation APIs.

TabDM is an alpha research/development package. Always evaluate generated data for the intended dataset, task, and privacy posture before use.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages