# Changelog This file documents the notable changes to this project. Only selected versions are released. ## [2.7.2] - 2025-06-12 - **Fixed** Automatic deletion of user's work directory if the model is not saved. - **Fixed** Incorrect location of temporary folder for `yandex/tabddpm` when generating from a saved model. ## [2.7.1] - 2025-05-21 - **Fixed** Error thrown if `quasi_idxs` key is missing from the input `json` info file. - **Fixed** Incorrect number of rows generated in `yandex/tabddpm` (#71). - **Changed** Eliminated dependency on `tomli_i` - **Fixed** `pysdg_vault_path` does not work in `gen.gen()` for `yandex/tabddpm` (#70) ## [2.7.0] - 2025-05-12 - **Warning:** This release is **NOT** compatible with previous versions due to module restructuring and function renaming. - **Changed** Removed all metrics. - **Changed** Restructured package. ## [2.6.0b0] - 2025-05-06 - **Added:** Amazon Tabsyn transformer/diffusion-based generator `amazon/tabsyn`. - **Corrected:** Checking by `bool` in `Generator.gen` method is replaced by checking by column name if `_missing` is included. ## [2.5.0rc1] - 2025-04-02 - **Fixed:** Hamming distance between NaNs in `calc_membership_vuln`. - **Added:** A boolean argument `na_is_match` in `calc_membership_vuln` to treat compared NAs as match to mismatch when calculating the Hamming distance. ## [2.5.0rc0] - 2025-04-02 - **Warning:** This release is **NOT** compatible with previous versions due to module restructuring and function renaming. - **Changed:** Module names have been updated to align with the new naming conventions. - **Deprecated:** `calc_membership_risk`. Use `calc_membership_vuln`. - **Deprecated:** `calc_attribution_risk`. Use `calc_attribute_vuln_replica`. - **Deprecated:** `calc_inference_risk`. Use `calc_attribution_vuln`. - **Fixed:** Bug in copula metadata generation. - **Changed:** Improved documentation and added references. ## [2.4.4b4] - 2025-03-27 (**RELEASED - BETA**) - **Fixed:** Error thrown by `calc_univar_hellinger_distance` when encoded columns of real and synth data do not match. - **Fixed:** Median calculation in `calc_univar_hellinger_distance` when NA is encountered. - **Fixed:** Multivariate Hellinger distance datatime handling. - **Added:** Inference risk privacy metric. - **Added:** Vulnerability-utility metric. ## [2.4.4b3] - 2025-03-06 - **Added:** Default option to infer datatypes in addition to explicitly passing the JSON file for the input dataset in the `Generator` class. - **Added:** Function to calculate multivariate Hellinger distance `calc_multivar_hellinger_distance`. - **Deprecated:** Renamed the function `calc_mmbrshp_risk` to `calc_membership_risk`. - **Deprecated:** Renamed the function `calc_univariate_hlngr_distance` to `calc_univar_hellinger_distance`. - **Fixed:** Issue with "Failed to remove 'None'" in the `unload` method. - **Added:** Support for specifying generator names in the format `source/generator` (e.g., `synthcity/ctgan`). The old format remains functional. - **Added:** Introduced the `gen_params` attribute in `Generator` to allow users to define generator hyperparameters in the format used by the generator's source, replacing the `synthcity_params` attribute, which will eventually be deprecated. - **Changed:** Redefined estimate agreement to indicate whether the synthetic data estimate falls within the confidence interval (CI) of the real data. ## [2.4.4b2] - 2025-02-07 (**RELEASED - BETA**) - **Fixed:** The calculation of directional decision agreement. - **Deprecated:** The `compare_estimates` function no longer returns an array. It now returns a dictionary instead. The array return is deprecated. ## [2.4.3b0] - 2025-02-07 (**RELEASED - BETA**) - **Fixed:** Handling incorrect `quasi_vars` in the membership disclosure function. ## [2.4.2b] - 2025-02-05 - **Added:** Option to pass data info as `json/dict` to membership disclosure function. - **Fixed:** Ensured SDV library is installed as a dependency for the diffusion model. ## [2.4.0b] - 2025-01-30 - **Fixed:** Corrected membership disclosure algorithm. **Note: The previous implementation resulted in inflated membership disclosure numbers, which means that if the disclosure value was low then the true value was definitely low.** ## [2.4.0a] - 2025-01-28 - **Added:** - Yandex diffusion model `yandex_tabddpm`. - Possibility to save selected models, pickle artifacts, and retrieve them. - Python and R tutorials for calculating membership disclosure. - Dummy generator. - Capability to log to a file for testing. - Tutorial for Median Hellinger distance. - **Changed:** Changed workspace naming by including the Process ID. - **Fixed:** Cleared log handlers before starting pysdg. ## [2.3.0] - 2024-12-02 (**RELEASED - STABLE**) - **Added:** `restore_col_names` method in `Generator` class to retrieve the encoded dataframe with original column names. ## [2.3.0rc0] - 2024-11-23 - **Added:** - Bayesian optimization feature for CTGAN, along with a tutorial. - Option to log output to a file. - Option to specify the maximum number of cores. - Detection of duplicate index entries in the json input file for the Generator. - **Fixed:** Improved performance for handling erratic values in datasets. - **Changed:** Replaced `no_obsvs` with `num_rows` and `no_synths` with `num_synths` in `Generator.gen` method. ## [2.2.0b] - 2024-09-26 - **Added:** - Standalone function to compute membership disclosure risk. - Function to remove unnecessary global variables. - **Deprecated:** The class `MmbrshpRsk` may be deprecated in upcoming release. - **Changed:** Improved code readability. ## [2.1.6rc0] - 2024-09-06 - **Added:** Verified availability of the Replica library. ## [2.1.6b] - 2024-08-30 - **Added:** `inspect_data` function to assist json file creator in locating discrepancies. - **Fixed:** Misinterpretation of high cardinality categorical variables in Replica. ## [2.1.5b] - 2024-08-20 - **Fixed:** - Mismatch in shape error in synthcity_ctgan for unbalanced categorical variables. - Removed 'synthcity_goggle' generator. - **Changed:** Replaced "soul" by "real" and "ghost" by "synth" in all Generator attributes. ## [2.1.4b] - 2024-08-20 - **Fixed:** Error loading the .env file for Replica generator. ## [2.1.3b] - 2024-08-16 - **Fixed:** - Error that forced all logical columns to equate to True. - Incorrect identification of missing values as erratic. ## [2.1.2b] - 2024-08-16 - **Fixed:** - Datetime type discrepancies in soul and ghost. - Occasional invalid output for categorical variables in Replica. - Logical error in previous erratic release. - **Changed:** Dropped json_type option in Generator.load to enforce identical data types. ## [2.1.1b] - 2024-08-09 - **Added:** Option to delete Replica working folder by setting `sweep_replica_jobs` to True in `get_replica_risk`. ## [2.1.0b] - 2024-08-08 - **Added:** - Four more Synthcity generators: "synthcity_nflow", "synthcity_rtvae", "synthcity_gogle", "synthcity_arf". - `do_sweep_replica` function to delete Replica working folders. - `Generator.sweep_replica` attribute set to True by default. - `Generator.replica_ids` attribute to retrieve generated replica jobs. ## [2.0.8b] - 2024-08-06 - **Fixed:** Fixed data type discrepancies between soul and ghosts for Replica. - **Changed:** Set the default for enforcing json types in Generator.load method to True. ## [2.0.7b] - 2024-08-02 - **Fixed:** - Fixed a discrepancy between soul and ghost if a special missing value is defined in the json file and Replica generator is used. - Fixed datetime processing. - Fixed NaT incompatibility with R. - **Added:** Added suppress_errors attribute to the Generator class to deal with erratic entries in numeric variables as missing values. - **Changed:** Suppressed warnings. ## [2.0.6b] - 2024-07-31 - **Fixed:** Fixed error when passing a combination of dataframe and json path to Generator.load method. ## [2.0.5b] - 2024-07-30 - **Fixed:** - Fixed Replica data type error in encoded ghosts. - Updated R installation documentation. - **Added:** Added R usage documentation. ## [2.0.4b] - 2024-07-29 - **Fixed:** Fixed Replica data type discrepancy between soul and ghosts. ## [2.0.3b] - 2024-07-29 - **Fixed:** Fixed Replica encoding issue. ## [2.0.2b] - 2024-07-26 - **Added:** - Allowed user to input either a path to the csv file or a Pandas dataframe in Generator.load method. - Allowed user to input either a path to the json file or a dictionary in Generator.load method. ## [2.0.1b] - 2024-07-24 - **Added:** In Generator.load method, added the option of enforcing user-defined datatype as given in the json to both the soul and its ghosts. The default is to use pandas reading defaults. ## [2.0.0b] - 2024-07-24 - **Added:** Initial release of pysdg v2 forked from v1.1.14b and based on python 3.10. - **Changed:** - Eliminated the naive_read function and incorporated it into Generator.load method. - Unify_na in v2 takes place during loading and not unloading. - Generator.train method takes a training subset as an option.