6. API reference

class pysdg.gen.Generator(gen_name: str = 'dummy', num_cores: int | None = None, work_dir: str | None = None, save_model=False, log_to_file: bool = False, credentials: Mapping[str, Any] | None = None)

A class for synthetic data generation.

Parameters:

gen_name – The name of the generator. The default is ‘dummy’ which is used either to get the encoded version of the loaded dataframe using the ‘load’ method or to use the a pretrained model from pysdg vault zip file using the ‘gen’ method. If the generator is ‘dummy’ the rest of the arguments are ignored. The other available generators are:
'replica/seq' (-) – Replica sequential generator (You need license to use this generator).
'synthcity/bayesian_network' (-) – SynthCity Bayesian network generator.
'synthcity/ctgan' (-) – SynthCity CTGAN generator.
'synthcity/nflow' (-) – SynthCity Normalizing Flow generator.
'synthcity/rtvae' (-) – SynthCity RTVAE generator.
'synthcity/tvae' (-) – SynthCity TVAE generator.
'synthcity/arf' (-) – SynthCity ARF generator.
'yandex/tabddpm' (-) – Yandex TabDDPM generator.
'amazon/tabsyn' (-) – Amazon TabSyn generator.
num_cores – The maximum number of cores. If None, all available cores will be used. You may need to set that to a specific number for some generators such as the Bayesian network generator in SynthCity.where exploiting all available cores may cause memory issues.
work_dir – User-defined work directory to which some processes may save temporary files and directories. Also, if ‘save_model’ is set to True, the model artifact will be saved to this directory. If not provided, by default, this directory is created in the current working directory and it starts with ‘pysdg’.
save_model – A boolean whether or not to save the model artifact. If True, the model artifact will be saved to work_dir. Default is False.
log_to_file – A boolean whether or not to save the log file. If True, the pysdg.log file will be saved to work_dir. Default is False.

load(raw_data: str | DataFrame, raw_info: str | dict | None = None) → DataFrame

Safely loads the input dataframe and prepares it for training the generative model.

Parameters:

raw_data (str or pd.DataFrame) –
- If a string, it should the full path to the input real dataset csv file including extension.
- If a pandas DataFrame, it should be the Pandas dataframe of the real data.
raw_info (str or dict) –
- If a string, it should be the full path to the json file describing the real data including its extension.
- If a dictionary, it should be the dictionary describing the real data.

Returns:

The loaded real data with all missing values properly processed.

restore_col_names(enc_df: DataFrame) → DataFrame

Curates the column names of the encoded dataframe.

Parameters:: enc_df – The encoded dataframe.
Returns:: The encoded dataframe with curated column names.

unload() → list

Safely unloads the generated encoded synthetic dataset versions (synths).

Returns:: List of ‘synths’. All ‘synths’ have the same number of records and matching variable types od ‘real’.

train(input_data: DataFrame | None = None) → None

Trains the generator using the encoded real or the input dataset. To avoid training errors, make sure to use the load method first.

Parameters:: input_data – A pandas dataframe that is used to train the model. If passed, it should be a subset of encoded real dataset.

gen(num_rows: int | None = None, num_synths: int | None = None, pysdg_vault_path: str | None = None) → list

Generates multiple synthetic datasets (synths) from the trained generative model.

Parameters:

num_rows – The target number of required records (observations) in the output synthetic data.
num_synths – The target number of required synthetic versions (synths) where each ‘synth’ has the same number of the target num_obsv.
pysdg_vault_path – The path to the pysdg vault zip file. If provided, it will load the model from this path instead of using the trained model in memory.

class pysdg.optimize.BayesianOptimizationRoutine(gen: Generator, eval_function: Callable, holdout_df: DataFrame | None = None, objective: str = 'maximize', n_trials: int = 10, study_name: str = 'my_study', dump_sqlite: bool = False, dump_csv: bool = False)

A class to perform Bayesian optimization for various synthetic data generators

PARAMETER_RANGES

A dictionary containing parameter ranges for different generators.

Type:: dict

gen

The generator object to be optimized.

Type:: Generator

eval_function

The evaluation function to assess generator performance.

Type:: Callable

holdout_df

A DataFrame for holdout validation, if any.

Type:: pd.DataFrame | None

objective

The optimization objective, either “maximize” or “minimize”.

Type:: str

n_trials

The number of optimization trials to run.

Type:: int

study_name

The name of the Optuna study.

Type:: str

dump_sqlite

Whether to dump the study results to an SQLite database.

Type:: bool

dump_csv

Whether to dump the study results to a CSV file.

Type:: bool

best_gen

The best generator found during optimization.

Type:: Generator | None

gen_name

The name of the generator being optimized.

Type:: str

study

The Optuna study object.

Type:: optuna.study.Study

black_box_function

The black-box function for optimization.

Type:: Callable

Intialize BayesianOptimizationRoutine.

Parameters:

gen (Generator) – The generator object to be optimized.
eval_function (Callable) – The evaluation function to assess generator performance.
holdout_df (pd.DataFrame | None, optional) – A DataFrame for holdout validation, if any. Defaults to None.
objective (str, optional) – The optimization objective, either “maximize” or “minimize”. Defaults to “maximize”.
n_trials (int, optional) – The number of optimization trials to run. Defaults to 10.
study_name (str, optional) – The name of the Optuna study. Defaults to “my_study”.
dump_sqlite (bool, optional) – Whether to dump the study results to an SQLite database. Defaults to False.
dump_csv (bool, optional) – Whether to dump the study results to a CSV file. Defaults to False.

generate_params(trial: Trial, generator_name: str): Generate parameters for the generator using the trial object.

generic_black_box_function(trial: Trial, gen: Generator): Run one training iteration of the generator and evaluate the performance.

get_optimization_results(): Returns the optimization results as a DataFrame.

retrain_generator(params): Retrain the best generator using the best parameters.