Changelog
This file documents the notable changes to this project. Only selected versions are released.
[2.7.2] - 2025-06-12
Fixed Automatic deletion of user’s work directory if the model is not saved.
Fixed Incorrect location of temporary folder for
yandex/tabddpm
when generating from a saved model.
[2.7.1] - 2025-05-21
Fixed Error thrown if
quasi_idxs
key is missing from the inputjson
info file.Fixed Incorrect number of rows generated in
yandex/tabddpm
(#71).Changed Eliminated dependency on
tomli_i
Fixed
pysdg_vault_path
does not work ingen.gen()
foryandex/tabddpm
(#70)
[2.7.0] - 2025-05-12
Warning: This release is NOT compatible with previous versions due to module restructuring and function renaming.
Changed Removed all metrics.
Changed Restructured package.
[2.6.0b0] - 2025-05-06
Added: Amazon Tabsyn transformer/diffusion-based generator
amazon/tabsyn
.Corrected: Checking by
bool
inGenerator.gen
method is replaced by checking by column name if_missing
is included.
[2.5.0rc1] - 2025-04-02
Fixed: Hamming distance between NaNs in
calc_membership_vuln
.Added: A boolean argument
na_is_match
incalc_membership_vuln
to treat compared NAs as match to mismatch when calculating the Hamming distance.
[2.5.0rc0] - 2025-04-02
Warning: This release is NOT compatible with previous versions due to module restructuring and function renaming.
Changed: Module names have been updated to align with the new naming conventions.
Deprecated:
calc_membership_risk
. Usecalc_membership_vuln
.Deprecated:
calc_attribution_risk
. Usecalc_attribute_vuln_replica
.Deprecated:
calc_inference_risk
. Usecalc_attribution_vuln
.Fixed: Bug in copula metadata generation.
Changed: Improved documentation and added references.
[2.4.4b4] - 2025-03-27 (RELEASED - BETA)
Fixed: Error thrown by
calc_univar_hellinger_distance
when encoded columns of real and synth data do not match.Fixed: Median calculation in
calc_univar_hellinger_distance
when NA is encountered.Fixed: Multivariate Hellinger distance datatime handling.
Added: Inference risk privacy metric.
Added: Vulnerability-utility metric.
[2.4.4b3] - 2025-03-06
Added: Default option to infer datatypes in addition to explicitly passing the JSON file for the input dataset in the
Generator
class.Added: Function to calculate multivariate Hellinger distance
calc_multivar_hellinger_distance
.Deprecated: Renamed the function
calc_mmbrshp_risk
tocalc_membership_risk
.Deprecated: Renamed the function
calc_univariate_hlngr_distance
tocalc_univar_hellinger_distance
.Fixed: Issue with “Failed to remove ‘None’” in the
unload
method.Added: Support for specifying generator names in the format
source/generator
(e.g.,synthcity/ctgan
). The old format remains functional.Added: Introduced the
gen_params
attribute inGenerator
to allow users to define generator hyperparameters in the format used by the generator’s source, replacing thesynthcity_params
attribute, which will eventually be deprecated.Changed: Redefined estimate agreement to indicate whether the synthetic data estimate falls within the confidence interval (CI) of the real data.
[2.4.4b2] - 2025-02-07 (RELEASED - BETA)
Fixed: The calculation of directional decision agreement.
Deprecated: The
compare_estimates
function no longer returns an array. It now returns a dictionary instead. The array return is deprecated.
[2.4.3b0] - 2025-02-07 (RELEASED - BETA)
Fixed: Handling incorrect
quasi_vars
in the membership disclosure function.
[2.4.2b] - 2025-02-05
Added: Option to pass data info as
json/dict
to membership disclosure function.Fixed: Ensured SDV library is installed as a dependency for the diffusion model.
[2.4.0b] - 2025-01-30
Fixed: Corrected membership disclosure algorithm. Note: The previous implementation resulted in inflated membership disclosure numbers, which means that if the disclosure value was low then the true value was definitely low.
[2.4.0a] - 2025-01-28
Added:
Yandex diffusion model
yandex_tabddpm
.Possibility to save selected models, pickle artifacts, and retrieve them.
Python and R tutorials for calculating membership disclosure.
Dummy generator.
Capability to log to a file for testing.
Tutorial for Median Hellinger distance.
Changed: Changed workspace naming by including the Process ID.
Fixed: Cleared log handlers before starting pysdg.
[2.3.0] - 2024-12-02 (RELEASED - STABLE)
Added:
restore_col_names
method inGenerator
class to retrieve the encoded dataframe with original column names.
[2.3.0rc0] - 2024-11-23
Added:
Bayesian optimization feature for CTGAN, along with a tutorial.
Option to log output to a file.
Option to specify the maximum number of cores.
Detection of duplicate index entries in the json input file for the Generator.
Fixed: Improved performance for handling erratic values in datasets.
Changed: Replaced
no_obsvs
withnum_rows
andno_synths
withnum_synths
inGenerator.gen
method.
[2.2.0b] - 2024-09-26
Added:
Standalone function to compute membership disclosure risk.
Function to remove unnecessary global variables.
Deprecated: The class
MmbrshpRsk
may be deprecated in upcoming release.Changed: Improved code readability.
[2.1.6rc0] - 2024-09-06
Added: Verified availability of the Replica library.
[2.1.6b] - 2024-08-30
Added:
inspect_data
function to assist json file creator in locating discrepancies.Fixed: Misinterpretation of high cardinality categorical variables in Replica.
[2.1.5b] - 2024-08-20
Fixed:
Mismatch in shape error in synthcity_ctgan for unbalanced categorical variables.
Removed ‘synthcity_goggle’ generator.
Changed: Replaced “soul” by “real” and “ghost” by “synth” in all Generator attributes.
[2.1.4b] - 2024-08-20
Fixed: Error loading the .env file for Replica generator.
[2.1.3b] - 2024-08-16
Fixed:
Error that forced all logical columns to equate to True.
Incorrect identification of missing values as erratic.
[2.1.2b] - 2024-08-16
Fixed:
Datetime type discrepancies in soul and ghost.
Occasional invalid output for categorical variables in Replica.
Logical error in previous erratic release.
Changed: Dropped json_type option in Generator.load to enforce identical data types.
[2.1.1b] - 2024-08-09
Added: Option to delete Replica working folder by setting
sweep_replica_jobs
to True inget_replica_risk
.
[2.1.0b] - 2024-08-08
Added:
Four more Synthcity generators: “synthcity_nflow”, “synthcity_rtvae”, “synthcity_gogle”, “synthcity_arf”.
do_sweep_replica
function to delete Replica working folders.Generator.sweep_replica
attribute set to True by default.Generator.replica_ids
attribute to retrieve generated replica jobs.
[2.0.8b] - 2024-08-06
Fixed: Fixed data type discrepancies between soul and ghosts for Replica.
Changed: Set the default for enforcing json types in Generator.load method to True.
[2.0.7b] - 2024-08-02
Fixed:
Fixed a discrepancy between soul and ghost if a special missing value is defined in the json file and Replica generator is used.
Fixed datetime processing.
Fixed NaT incompatibility with R.
Added: Added suppress_errors attribute to the Generator class to deal with erratic entries in numeric variables as missing values.
Changed: Suppressed warnings.
[2.0.6b] - 2024-07-31
Fixed: Fixed error when passing a combination of dataframe and json path to Generator.load method.
[2.0.5b] - 2024-07-30
Fixed:
Fixed Replica data type error in encoded ghosts.
Updated R installation documentation.
Added: Added R usage documentation.
[2.0.4b] - 2024-07-29
Fixed: Fixed Replica data type discrepancy between soul and ghosts.
[2.0.3b] - 2024-07-29
Fixed: Fixed Replica encoding issue.
[2.0.2b] - 2024-07-26
Added:
Allowed user to input either a path to the csv file or a Pandas dataframe in Generator.load method.
Allowed user to input either a path to the json file or a dictionary in Generator.load method.
[2.0.1b] - 2024-07-24
Added: In Generator.load method, added the option of enforcing user-defined datatype as given in the json to both the soul and its ghosts. The default is to use pandas reading defaults.
[2.0.0b] - 2024-07-24
Added: Initial release of pysdg v2 forked from v1.1.14b and based on python 3.10.
Changed:
Eliminated the naive_read function and incorporated it into Generator.load method.
Unify_na in v2 takes place during loading and not unloading.
Generator.train method takes a training subset as an option.