Changelog

This file documents the notable changes to this project. Only selected versions are released.

[2.7.2] - 2025-06-12

  • Fixed Automatic deletion of user’s work directory if the model is not saved.

  • Fixed Incorrect location of temporary folder for yandex/tabddpm when generating from a saved model.

[2.7.1] - 2025-05-21

  • Fixed Error thrown if quasi_idxs key is missing from the input json info file.

  • Fixed Incorrect number of rows generated in yandex/tabddpm (#71).

  • Changed Eliminated dependency on tomli_i

  • Fixed pysdg_vault_path does not work in gen.gen() for yandex/tabddpm (#70)

[2.7.0] - 2025-05-12

  • Warning: This release is NOT compatible with previous versions due to module restructuring and function renaming.

  • Changed Removed all metrics.

  • Changed Restructured package.

[2.6.0b0] - 2025-05-06

  • Added: Amazon Tabsyn transformer/diffusion-based generator amazon/tabsyn.

  • Corrected: Checking by bool in Generator.gen method is replaced by checking by column name if _missing is included.

[2.5.0rc1] - 2025-04-02

  • Fixed: Hamming distance between NaNs in calc_membership_vuln.

  • Added: A boolean argument na_is_match in calc_membership_vuln to treat compared NAs as match to mismatch when calculating the Hamming distance.

[2.5.0rc0] - 2025-04-02

  • Warning: This release is NOT compatible with previous versions due to module restructuring and function renaming.

  • Changed: Module names have been updated to align with the new naming conventions.

  • Deprecated: calc_membership_risk. Use calc_membership_vuln.

  • Deprecated: calc_attribution_risk. Use calc_attribute_vuln_replica.

  • Deprecated: calc_inference_risk. Use calc_attribution_vuln.

  • Fixed: Bug in copula metadata generation.

  • Changed: Improved documentation and added references.

[2.4.4b4] - 2025-03-27 (RELEASED - BETA)

  • Fixed: Error thrown by calc_univar_hellinger_distance when encoded columns of real and synth data do not match.

  • Fixed: Median calculation in calc_univar_hellinger_distance when NA is encountered.

  • Fixed: Multivariate Hellinger distance datatime handling.

  • Added: Inference risk privacy metric.

  • Added: Vulnerability-utility metric.

[2.4.4b3] - 2025-03-06

  • Added: Default option to infer datatypes in addition to explicitly passing the JSON file for the input dataset in the Generator class.

  • Added: Function to calculate multivariate Hellinger distance calc_multivar_hellinger_distance.

  • Deprecated: Renamed the function calc_mmbrshp_risk to calc_membership_risk.

  • Deprecated: Renamed the function calc_univariate_hlngr_distance to calc_univar_hellinger_distance.

  • Fixed: Issue with “Failed to remove ‘None’” in the unload method.

  • Added: Support for specifying generator names in the format source/generator (e.g., synthcity/ctgan). The old format remains functional.

  • Added: Introduced the gen_params attribute in Generator to allow users to define generator hyperparameters in the format used by the generator’s source, replacing the synthcity_params attribute, which will eventually be deprecated.

  • Changed: Redefined estimate agreement to indicate whether the synthetic data estimate falls within the confidence interval (CI) of the real data.

[2.4.4b2] - 2025-02-07 (RELEASED - BETA)

  • Fixed: The calculation of directional decision agreement.

  • Deprecated: The compare_estimates function no longer returns an array. It now returns a dictionary instead. The array return is deprecated.

[2.4.3b0] - 2025-02-07 (RELEASED - BETA)

  • Fixed: Handling incorrect quasi_vars in the membership disclosure function.

[2.4.2b] - 2025-02-05

  • Added: Option to pass data info as json/dict to membership disclosure function.

  • Fixed: Ensured SDV library is installed as a dependency for the diffusion model.

[2.4.0b] - 2025-01-30

  • Fixed: Corrected membership disclosure algorithm. Note: The previous implementation resulted in inflated membership disclosure numbers, which means that if the disclosure value was low then the true value was definitely low.

[2.4.0a] - 2025-01-28

  • Added:

    • Yandex diffusion model yandex_tabddpm.

    • Possibility to save selected models, pickle artifacts, and retrieve them.

    • Python and R tutorials for calculating membership disclosure.

    • Dummy generator.

    • Capability to log to a file for testing.

    • Tutorial for Median Hellinger distance.

  • Changed: Changed workspace naming by including the Process ID.

  • Fixed: Cleared log handlers before starting pysdg.

[2.3.0] - 2024-12-02 (RELEASED - STABLE)

  • Added: restore_col_names method in Generator class to retrieve the encoded dataframe with original column names.

[2.3.0rc0] - 2024-11-23

  • Added:

    • Bayesian optimization feature for CTGAN, along with a tutorial.

    • Option to log output to a file.

    • Option to specify the maximum number of cores.

    • Detection of duplicate index entries in the json input file for the Generator.

  • Fixed: Improved performance for handling erratic values in datasets.

  • Changed: Replaced no_obsvs with num_rows and no_synths with num_synths in Generator.gen method.

[2.2.0b] - 2024-09-26

  • Added:

    • Standalone function to compute membership disclosure risk.

    • Function to remove unnecessary global variables.

  • Deprecated: The class MmbrshpRsk may be deprecated in upcoming release.

  • Changed: Improved code readability.

[2.1.6rc0] - 2024-09-06

  • Added: Verified availability of the Replica library.

[2.1.6b] - 2024-08-30

  • Added: inspect_data function to assist json file creator in locating discrepancies.

  • Fixed: Misinterpretation of high cardinality categorical variables in Replica.

[2.1.5b] - 2024-08-20

  • Fixed:

    • Mismatch in shape error in synthcity_ctgan for unbalanced categorical variables.

    • Removed ‘synthcity_goggle’ generator.

  • Changed: Replaced “soul” by “real” and “ghost” by “synth” in all Generator attributes.

[2.1.4b] - 2024-08-20

  • Fixed: Error loading the .env file for Replica generator.

[2.1.3b] - 2024-08-16

  • Fixed:

    • Error that forced all logical columns to equate to True.

    • Incorrect identification of missing values as erratic.

[2.1.2b] - 2024-08-16

  • Fixed:

    • Datetime type discrepancies in soul and ghost.

    • Occasional invalid output for categorical variables in Replica.

    • Logical error in previous erratic release.

  • Changed: Dropped json_type option in Generator.load to enforce identical data types.

[2.1.1b] - 2024-08-09

  • Added: Option to delete Replica working folder by setting sweep_replica_jobs to True in get_replica_risk.

[2.1.0b] - 2024-08-08

  • Added:

    • Four more Synthcity generators: “synthcity_nflow”, “synthcity_rtvae”, “synthcity_gogle”, “synthcity_arf”.

    • do_sweep_replica function to delete Replica working folders.

    • Generator.sweep_replica attribute set to True by default.

    • Generator.replica_ids attribute to retrieve generated replica jobs.

[2.0.8b] - 2024-08-06

  • Fixed: Fixed data type discrepancies between soul and ghosts for Replica.

  • Changed: Set the default for enforcing json types in Generator.load method to True.

[2.0.7b] - 2024-08-02

  • Fixed:

    • Fixed a discrepancy between soul and ghost if a special missing value is defined in the json file and Replica generator is used.

    • Fixed datetime processing.

    • Fixed NaT incompatibility with R.

  • Added: Added suppress_errors attribute to the Generator class to deal with erratic entries in numeric variables as missing values.

  • Changed: Suppressed warnings.

[2.0.6b] - 2024-07-31

  • Fixed: Fixed error when passing a combination of dataframe and json path to Generator.load method.

[2.0.5b] - 2024-07-30

  • Fixed:

    • Fixed Replica data type error in encoded ghosts.

    • Updated R installation documentation.

  • Added: Added R usage documentation.

[2.0.4b] - 2024-07-29

  • Fixed: Fixed Replica data type discrepancy between soul and ghosts.

[2.0.3b] - 2024-07-29

  • Fixed: Fixed Replica encoding issue.

[2.0.2b] - 2024-07-26

  • Added:

    • Allowed user to input either a path to the csv file or a Pandas dataframe in Generator.load method.

    • Allowed user to input either a path to the json file or a dictionary in Generator.load method.

[2.0.1b] - 2024-07-24

  • Added: In Generator.load method, added the option of enforcing user-defined datatype as given in the json to both the soul and its ghosts. The default is to use pandas reading defaults.

[2.0.0b] - 2024-07-24

  • Added: Initial release of pysdg v2 forked from v1.1.14b and based on python 3.10.

  • Changed:

    • Eliminated the naive_read function and incorporated it into Generator.load method.

    • Unify_na in v2 takes place during loading and not unloading.

    • Generator.train method takes a training subset as an option.