3. Usage

This guide outlines how to use pysdg. The first section details the setup of the json file required for generating synthetic data. The second section covers the supported generative models. Subsequent sections provide examples demonstrating how to use pysdg in both Python and R. For the list of supported synthetic data generators, refer to Generators. For a full list of functions and classes, refer to the API reference. If you are ready to get started, and after installing pysdg as described in Installation, you can explore the tutorial files available in the package’s GitHub repository under the tutorials folder.

3.1. General setup

  1. Your tabular real dataset need to be provided as a .csv file. You are encouraged to provide a json file describing the dataset variables, although it is optional. Below is an example of typical json file:

    {
        "ds_name": ["your_data_name"],
        "nct_nos":[],
        "id_idx":[],
        "cat_idxs":[0,1,2,3,4,5,6,7,8,10,12,13],
        "cnt_idxs":[9,14],
        "dscrt_idxs":[11,15],
        "datetime_idxs":[],
        "miss_vals":["nan", "<NA>","NaT","NaN","NA",""],
        "h0_value":[0],
        "quasi_idxs":[6,7],
    }
    

Note

Ensure that the json file accurately represents the variables in your dataset. All values in the json file should be provided as lists, even if they contain only one element. The example indexes shown should be replaced with the appropriate indexes for your dataset. Column indices shall start with 0, so the first column (variable) in the dataset has index 0, the second column has index 1, and so forth. Check API reference for more details on the json file.

The full description of the keys in the above json file is given below:

  • ds_name: The name of the input real dataset that is typically representing a clinical trial.

  • nct_nos: The NCT numbers of the clinical trial (optional).

  • id_idx: The index of ID column (i.e. variable) in the dataset, if any.

  • cat_idxs: Indices of the categorical variables. These include both numeric and alphanumeric variables. Binary variables shall be listed here as well.

  • cnt_idxs: Indices of the continuous variables e.g. float numbers.

  • dscrt_idxs: Indices of the discrete variables e.g. integer numbers.

  • datetime_idxs: Indices of date or date/time variables e.g. The value 13-Jan-23 is an example of a date/time entry.

  • miss_vals: Values used for missing data.

  • h0_value: Keep always 0.

  • quasi_idxs: Indices for quasi identifiers that are used in the calculations of attribution risk disclosure, if needed. The risk calculation requires at least two variables to be listed.

Note

Please make sure to include “nan”, “NA” and “” in the list of missing values.. You can add other simple keys:value pairs to the json file without causing any error in pysdg package.

  1. In order to use replica/seq generator, the user needs to provide its access credentials in an .env file. The file should be saved in the root of users’s project. This is necessary for the project to properly call the replica/seq generator. The following variables shall be set in the .env file:

    REPLICA_ID=<your_replica_user_id>
    REPLICA_PWD=<your_replica_user_password>
    REPLICA_HOST=<replica_host_ip_address>
    REPLICA_PORT=<replica_hots_port_number>
    

3.2. Generating synthetic datasets in Python

The example below illustrates the basic functionality of pysdg in Python. For advanced usage, please refer to the tutorial files.

First activate your environment using:

conda activate <your_virtual_environment_name>

Here’s an example demonstrating how to use pysdg in Python:

from pysdg.synth.generate import Generator # Import pysdg generate module

if __name__ == "__main__":

    raw_data_path='tutorials/raw_data.csv' # Replace the tutorial dataset path with your own.
    raw_info_path='tutorials/raw_info.json' # Replace the tutorial dataset info path with your own.
    gen_name="synthcity/bayesian_network" # Choose any of the supported generators.

    ##### Generate Synthetic Data
    gen=Generator(gen_name) # Construct the generator object
    real=gen.load(raw_data_path,raw_info_path) # Load raw data and its info.
    gen.train() # Train the selected generative model.
    gen.gen(num_rows=len(real), num_synths=2) # Generate a list of multiple synthetic datasets.
    synths=gen.unload() # Unload the generated synthetic datasets.
    synth[0].to_csv('synthetic_data.csv', index=False) # If needed, save the first synthetic dataset. Replace teh file name with your own.

Note

The above code snippets are illustrative and may require modifications to work with your own data.

3.3. Generating synthetic datasets in R

The example below illustrates the basic functionality of pysdg in R. For advanced usage, please refer to the tutorial files.

First activate your environment in R markdown notebook (Rmd) using:

reticulate::use_condaenv(<your_virtual_environment_name>, required = TRUE)

Here’s an example demonstrating how to use pysdg in R:

synth.generate <- reticulate::import("pysdg.synth.generate") # Import pysdg generate module

raw_data_path <- "tutorials/raw_data.csv" # Replace the tutorial dataset path with your own.
raw_info_path <- "tutorials/raw_info.json" # Replace the tutorial dataset info path with your own.
gen_name <- "synthcity/bayesian_network"  # Choose any of the supported generators.

##### Generate Synthetic Data
gen <- synth.generate$Generator(gen_name) # Construct the generator object
real <- gen$load( data_path, info_path) # Load raw data and its info.
gen$train() # Train the selected generative model.
gen$gen(num_rows=nrow(soul), num_synths=2) # Generate a list of multiple synthetic datasets.
synths <- gen$unload() # Unload the generated synthetic datasets.
print(head(synths[[1]])) # If needed, save the first synthetic dataset. Replace the file name with your own.

Note

The above code snippets are illustrative and may require modifications to work with your own data.