6. API reference
- class pysdg.gen.Generator(gen_name: str = 'dummy', num_cores: int | None = None, work_dir: str | None = None, save_model=False, log_to_file: bool = False, credentials: Mapping[str, Any] | None = None)
A class for synthetic data generation.
- Parameters:
gen_name – The name of the generator. The default is ‘dummy’ which is used either to get the encoded version of the loaded dataframe using the ‘load’ method or to use the a pretrained model from pysdg vault zip file using the ‘gen’ method. If the generator is ‘dummy’ the rest of the arguments are ignored. The other available generators are:
'replica/seq' (-) – Replica sequential generator (You need license to use this generator).
'synthcity/bayesian_network' (-) – SynthCity Bayesian network generator.
'synthcity/ctgan' (-) – SynthCity CTGAN generator.
'synthcity/nflow' (-) – SynthCity Normalizing Flow generator.
'synthcity/rtvae' (-) – SynthCity RTVAE generator.
'synthcity/tvae' (-) – SynthCity TVAE generator.
'synthcity/arf' (-) – SynthCity ARF generator.
'yandex/tabddpm' (-) – Yandex TabDDPM generator.
'amazon/tabsyn' (-) – Amazon TabSyn generator.
num_cores – The maximum number of cores. If None, all available cores will be used. You may need to set that to a specific number for some generators such as the Bayesian network generator in SynthCity.where exploiting all available cores may cause memory issues.
work_dir – User-defined work directory to which some processes may save temporary files and directories. Also, if ‘save_model’ is set to True, the model artifact will be saved to this directory. If not provided, by default, this directory is created in the current working directory and it starts with ‘pysdg’.
save_model – A boolean whether or not to save the model artifact. If True, the model artifact will be saved to work_dir. Default is False.
log_to_file – A boolean whether or not to save the log file. If True, the pysdg.log file will be saved to work_dir. Default is False.
- load(raw_data: str | DataFrame, raw_info: str | dict | None = None) DataFrame
Safely loads the input dataframe and prepares it for training the generative model.
- Parameters:
raw_data (str or pd.DataFrame) –
If a string, it should the full path to the input real dataset csv file including extension.
If a pandas DataFrame, it should be the Pandas dataframe of the real data.
raw_info (str or dict) –
If a string, it should be the full path to the json file describing the real data including its extension.
If a dictionary, it should be the dictionary describing the real data.
- Returns:
The loaded real data with all missing values properly processed.
- restore_col_names(enc_df: DataFrame) DataFrame
Curates the column names of the encoded dataframe.
- Parameters:
enc_df – The encoded dataframe.
- Returns:
The encoded dataframe with curated column names.
- unload() list
Safely unloads the generated encoded synthetic dataset versions (synths).
- Returns:
List of ‘synths’. All ‘synths’ have the same number of records and matching variable types od ‘real’.
- train(input_data: DataFrame | None = None) None
Trains the generator using the encoded real or the input dataset. To avoid training errors, make sure to use the load method first.
- Parameters:
input_data – A pandas dataframe that is used to train the model. If passed, it should be a subset of encoded real dataset.
- gen(num_rows: int | None = None, num_synths: int | None = None, pysdg_vault_path: str | None = None) list
Generates multiple synthetic datasets (synths) from the trained generative model.
- Parameters:
num_rows – The target number of required records (observations) in the output synthetic data.
num_synths – The target number of required synthetic versions (synths) where each ‘synth’ has the same number of the target num_obsv.
pysdg_vault_path – The path to the pysdg vault zip file. If provided, it will load the model from this path instead of using the trained model in memory.
- class pysdg.optimize.BayesianOptimizationRoutine(gen: Generator, eval_function: Callable, holdout_df: DataFrame | None = None, objective: str = 'maximize', n_trials: int = 10, study_name: str = 'my_study', dump_sqlite: bool = False, dump_csv: bool = False)
A class to perform Bayesian optimization for various synthetic data generators
- PARAMETER_RANGES
A dictionary containing parameter ranges for different generators.
- Type:
dict
- eval_function
The evaluation function to assess generator performance.
- Type:
Callable
- holdout_df
A DataFrame for holdout validation, if any.
- Type:
pd.DataFrame | None
- objective
The optimization objective, either “maximize” or “minimize”.
- Type:
str
- n_trials
The number of optimization trials to run.
- Type:
int
- study_name
The name of the Optuna study.
- Type:
str
- dump_sqlite
Whether to dump the study results to an SQLite database.
- Type:
bool
- dump_csv
Whether to dump the study results to a CSV file.
- Type:
bool
- gen_name
The name of the generator being optimized.
- Type:
str
- study
The Optuna study object.
- Type:
optuna.study.Study
- black_box_function
The black-box function for optimization.
- Type:
Callable
Intialize BayesianOptimizationRoutine.
- Parameters:
gen (Generator) – The generator object to be optimized.
eval_function (Callable) – The evaluation function to assess generator performance.
holdout_df (pd.DataFrame | None, optional) – A DataFrame for holdout validation, if any. Defaults to None.
objective (str, optional) – The optimization objective, either “maximize” or “minimize”. Defaults to “maximize”.
n_trials (int, optional) – The number of optimization trials to run. Defaults to 10.
study_name (str, optional) – The name of the Optuna study. Defaults to “my_study”.
dump_sqlite (bool, optional) – Whether to dump the study results to an SQLite database. Defaults to False.
dump_csv (bool, optional) – Whether to dump the study results to a CSV file. Defaults to False.
- generate_params(trial: Trial, generator_name: str)
Generate parameters for the generator using the trial object.
- generic_black_box_function(trial: Trial, gen: Generator)
Run one training iteration of the generator and evaluate the performance.
- get_optimization_results()
Returns the optimization results as a DataFrame.
- retrain_generator(params)
Retrain the best generator using the best parameters.