.. _generators:
Generators
==========
`pysdg` supports the following generators with their source links:
"replica/seq"
*************
**Name:** Sequential Decision Trees.
**Official Website:** `Aetion `_ (previously `Replica Analytics `_)
**Reference Publication:** :cite:p:`ElEmam2020`
**Licensing:** Proprietary. A license shall be obtained from `Aetion `_ (formerly known as Replica).
**Overview:** Similar to using a chaining method for multi-label classification problems, sequential decision trees (SEQ) generate synthetic data using conditional trees in a sequential fashion :cite:p:`Hothorn2006`, :cite:p:`Read2009`. It has been commonly employed in the healthcare and social science domains for data synthesis :cite:`Quintana2020`.
"synthcity/bayesian_network"
****************************
**Name:** Bayesian Network.
**Official Website:** `Bayesian Network `_
**Reference Publication:** :cite:p:`Ankan2015`
**Licensing:** `Apache License 2.0 (BN) `_
**Overview:** Bayesian Networks (BN) are models based on Directed Acyclic Graphs that consist of nodes representing the random variables and arcs representing the dependencies among these variables. To construct the BN model, the first step is to find the optimal network topology, and then to estimate the optimal parameters [14]. Starting with a random initial network structure, the Hill Climb heuristic search is used to find the optimal structure. Then, the conditional probability distributions are estimated using the maximum a posteriori estimator [15]. Once the network structure and the parameters are estimated, we can initialize the nodes with no incoming arcs by sampling from their marginal distributions and predict the rest of the connected variables using the estimated parameters. [16]
"synthcity/ctgan"
*****************
**Name:** Conditional Tabular Generative Adversarial Network.
**Official Website:** `CTGAN `_
**Reference Publication:** :cite:p:`xu2019modelingtabulardatausing`
**Licensing:** `Apache License 2.0 (CTGAN) `_
**Overview:** A basic generative adversarial network (GAN) consists of two artificial neural networks (ANNs), a generator and a discriminator [17]. The generator and the discriminator play a min-max game. The input to the generator is noise, while its output is synthetic data. The discriminator has two inputs: the real training data and the synthetic data generated by the generator. The output of the discriminator indicates whether its input is real or synthetic. The generator is trained to *trick* the discriminator by generating samples that look real. On the other hand, the discriminator is trained to maximize its discriminatory capability.
Among all the variations of GAN architectures, the conditional tabular GAN (CTGAN) is often used in tabular data synthesis [18]. CTGAN builds on conditional GANs by addressing the multimodal distributions of continuous variables and the highly imbalanced categorical variables [16]. CTGAN solves the first problem by proposing a per-mode normalization technique. For the second problem, each category of a categorical variable serves as the condition passed to the GAN.
"synthcity/tvae"
****************
**Name:** Tabular Variational Autoencoder.
**Official Website:** `TVAE `_
**Reference Publication:** :cite:p:`xu2019modelingtabulardatausing`
**Licensing:** `Apache License 2.0 (TVAE) `_
**Overview:** Variational autoencoders (VAE) use ANNs and involve two steps (encoding and decoding) to generate new samples [19]. First, an encoder is generated to compress input data into a lower-dimensional latent space, in which the data points are represented by distributions. The second step is a decoding process, in which new data samples are reconstructed as output from the latent space. The neural network is optimized by minimizing the reconstruction loss between the output and the input. In TVAE, the generator directly models the distribution of mixed-type tabular data, which includes both continuous and discrete variables. The model learns to represent continuous variables using Gaussian distributions, while categorical variables are handled using softmax outputs. This allows TVAE to generate realistic synthetic tabular data, preserving the statistical properties of the original dataset.
"synthcity/rtvae"
*****************
**Name:** Robust Tabular Variational Autoencoder.
**Official Website:** `RTVAE `_
**Reference Publication:** :cite:p:`akrami2020robustvariationalautoencodertabular`
**Licensing:** `Apache License 2.0 (RTVAE) `_
**Overview:** is an extension of the standard TVAE designed to handle outliers in tabular data by incorporating β-divergence into the VAE framework. Unlike TVAE, which relies on traditional KL-divergence for reconstruction loss, RTVAE replaces this with β-divergence, making the model more resilient to anomalies and contaminated datasets. This modification reduces the sensitivity of the reconstruction loss to extreme values, which can otherwise disproportionately affect training. RTVAE outperforming standard TVAE when the data contains outliers.
"synthcity/arf"
***************
**Name:** Adversarial Random Forests.
**Official Website:** `ARF `_
**Reference Publication:** :cite:p:`watson2023adversarialrandomforestsdensity`
**Licensing:** `Apache License 2.0 (ARF) `_
**Overview:** Inspired by Generative Adversarial Networks (GANs), Adversarial Random Forests (ARFs) employ a recursive process where trees iteratively learn the structural properties of data by alternating between rounds of data generation and discrimination. This allows the model to gradually refine its understanding of the data distribution. Unlike classic tree-based models, ARFs provide smooth density estimations and can generate fully synthetic data, making them highly effective for tasks such as data augmentation and imputation.
"synthcity/nflow"
*****************
**Name:** Neural Spline Flows.
**Official Website:** `NFlow `_
**Reference Publication:** :cite:p:`durkan2019neuralsplineflows`
**Licensing:** `Apache License 2.0 (NFLOW) `_
**Overview:** Neural Spline Flows (NFLOW) are a type of normalizing flow model designed to enhance the flexibility of transformations used in generative models and density estimation. NFLOW utilizes monotonic rational-quadratic splines to implement invertible transformations, offering a significant improvement over traditional affine or additive transformations typically used in flow-based models.
The key advantage of NFLOW lies in its ability to model complex, multi-modal distributions by allowing smooth, non-linear deformations of data while maintaining analytic invertibility. This is achieved by defining transformations using a series of spline segments, ensuring that the inverse and Jacobian determinant calculations remain efficient and exact. By incorporating spline-based transformations, NFLOW bridges the performance gap between autoregressive flows and coupling-based flows, resulting in improved density estimation and generative modeling performance for high-dimensional data, such as images and tabular datasets.
"yandex/tabddpm"
****************
**Name:** Tabular Denoising Diffusion Probabilistic Model.
**Official Website:** `TabDDPM `_
**Reference Publication:** :cite:p:`kotelnikov_tabddpm_2023`
**Licensing:** `MIT License (TABDDPM) `_
**Overview:** Tabular Denoising Diffusion Probabilistic Model (TabDDPM) is a generative model designed to produce high-quality synthetic tabular data by leveraging diffusion models, which iteratively corrupt and denoise data to approximate complex distributions through a Markov chain process. The forward Markov process gradually adds noise to the data, transforming it into a Gaussian or categorical noise distribution, while the reverse Markov process, learned by a neural network, progressively denoises the data to reconstruct the original distribution. Unlike traditional GANs or VAEs, TabDDPM handles heterogeneous tabular datasets with mixed numerical and categorical features by applying Gaussian diffusion to continuous variables and multinomial diffusion to categorical ones. This flexibility allows TabDDPM to outperform other generative models in capturing feature correlations and preserving data privacy. Its ability to generate realistic synthetic data makes it useful for data augmentation, imbalanced datasets, and privacy-preserving applications.
.. note::
The `yandex/tabddpm` model requires a GPU to perform properly. Ensure that your environment has access to a compatible GPU to achieve optimal performance.
"amazon/tabsyn"
****************
**Name:** Mixed-Type Tabular Data Model (comprises transformer-based VAE and diffusion models) .
**Official Website:** `TabSyn `_
**Reference Publication:** :cite:p:`zhang2023mixed`
**Licensing:** `Apache License 2.0 (TABSYN) `_
**Overview:** TabSyn is a generative model for synthesizing tabular data by combining a Variational Autoencoder (VAE) with score-based diffusion models in the latent space. Unlike TabDDPM, which applies diffusion directly to raw tabular data, TabSyn first encodes the data into a continuous latent space using a transformer-based VAE, enabling it to handle mixed numerical and categorical features more effectively. This approach simplifies the diffusion process, reduces the need for separate handling of data types, and improves generation quality. TabSyn also employs a linear noise schedule, allowing it to generate high-fidelity data with fewer reverse steps (under 20), making it faster and more efficient than TabDDPM, which typically requires more steps for similar performance. By operating in the latent space, TabSyn outperforms TabDDPM in capturing complex column dependencies and producing more accurate synthetic data.
.. note::
The `amazon/tabsyn` model requires a GPU to perform properly. Ensure that your environment has access to a compatible GPU to achieve optimal performance.