.. _usage: Usage ====== This guide outlines how to use `pysdg`. The first section details the setup of the `json` file required for generating synthetic data. The second section covers the supported generative models. Subsequent sections provide examples demonstrating how to use `pysdg` in both `Python` and `R`. For the list of supported synthetic data generators, refer to :ref:`generators`. For a full list of functions and classes, refer to the :ref:`api_ref`. If you are ready to get started, and after installing `pysdg` as described in :ref:`installation`, you can explore the tutorial files available in the package's GitHub repository under the `tutorials` folder. General setup ************* 1. Your tabular real dataset need to be provided as a `.csv` file. You are encouraged to provide a `json` file describing the dataset variables, although it is optional. Below is an example of typical `json` file:: { "ds_name": ["your_data_name"], "nct_nos":[], "id_idx":[], "cat_idxs":[0,1,2,3,4,5,6,7,8,10,12,13], "cnt_idxs":[9,14], "dscrt_idxs":[11,15], "datetime_idxs":[], "miss_vals":["nan", "","NaT","NaN","NA",""], "h0_value":[0], "quasi_idxs":[6,7], } .. note:: Ensure that the `json` file accurately represents the variables in your dataset. All values in the `json` file should be provided as *lists*, even if they contain only one element. *The example indexes shown should be replaced with the appropriate indexes for your dataset*. Column indices shall start with 0, so the first column (variable) in the dataset has index `0`, the second column has index `1`, and so forth. Check :ref:`api_ref` for more details on the `json` file. The full description of the keys in the above `json` file is given below: * ds_name: The name of the input real dataset that is typically representing a clinical trial. * nct_nos: The NCT numbers of the clinical trial (optional). * id_idx: The index of ID column (i.e. variable) in the dataset, if any. * cat_idxs: Indices of the `categorical` variables. **These include both numeric and alphanumeric variables**. Binary variables shall be listed here as well. * cnt_idxs: Indices of the `continuous` variables e.g. float numbers. * dscrt_idxs: Indices of the `discrete` variables e.g. integer numbers. * datetime_idxs: Indices of `date` or `date/time` variables e.g. The value 13-Jan-23 is an example of a date/time entry. * miss_vals: Values used for missing data. * h0_value: Keep always 0. * quasi_idxs: Indices for `quasi identifiers` that are used in the calculations of attribution risk disclosure, if needed. The risk calculation requires at least two variables to be listed. .. note:: Please make sure to include "nan", "NA" and "" in the list of missing values.. You can add other simple keys:value pairs to the json file without causing any error in `pysdg` package. 1. In order to use `replica/seq` generator, the user needs to provide its access credentials in an `.env` file. The file should be saved in the root of users's project. This is necessary for the project to properly call the `replica/seq` generator. The following variables shall be set in the `.env` file:: REPLICA_ID= REPLICA_PWD= REPLICA_HOST= REPLICA_PORT= Generating synthetic datasets in Python *************************************** The example below illustrates the basic functionality of `pysdg` in Python. For advanced usage, please refer to the tutorial files. First activate your environment using:: conda activate Here's an example demonstrating how to use `pysdg` in `Python`: .. code-block:: python from pysdg.synth.generate import Generator # Import pysdg generate module if __name__ == "__main__": raw_data_path='tutorials/raw_data.csv' # Replace the tutorial dataset path with your own. raw_info_path='tutorials/raw_info.json' # Replace the tutorial dataset info path with your own. gen_name="synthcity/bayesian_network" # Choose any of the supported generators. ##### Generate Synthetic Data gen=Generator(gen_name) # Construct the generator object real=gen.load(raw_data_path,raw_info_path) # Load raw data and its info. gen.train() # Train the selected generative model. gen.gen(num_rows=len(real), num_synths=2) # Generate a list of multiple synthetic datasets. synths=gen.unload() # Unload the generated synthetic datasets. synth[0].to_csv('synthetic_data.csv', index=False) # If needed, save the first synthetic dataset. Replace teh file name with your own. .. note:: The above code snippets are illustrative and may require modifications to work with your own data. Generating synthetic datasets in R ********************************** The example below illustrates the basic functionality of `pysdg` in R. For advanced usage, please refer to the tutorial files. First activate your environment in R markdown notebook (Rmd) using:: reticulate::use_condaenv(, required = TRUE) Here's an example demonstrating how to use `pysdg` in `R`: .. code-block:: R synth.generate <- reticulate::import("pysdg.synth.generate") # Import pysdg generate module raw_data_path <- "tutorials/raw_data.csv" # Replace the tutorial dataset path with your own. raw_info_path <- "tutorials/raw_info.json" # Replace the tutorial dataset info path with your own. gen_name <- "synthcity/bayesian_network" # Choose any of the supported generators. ##### Generate Synthetic Data gen <- synth.generate$Generator(gen_name) # Construct the generator object real <- gen$load( data_path, info_path) # Load raw data and its info. gen$train() # Train the selected generative model. gen$gen(num_rows=nrow(soul), num_synths=2) # Generate a list of multiple synthetic datasets. synths <- gen$unload() # Unload the generated synthetic datasets. print(head(synths[[1]])) # If needed, save the first synthetic dataset. Replace the file name with your own. .. note:: The above code snippets are illustrative and may require modifications to work with your own data. .. .. bibliography:: references.bib .. :style: plain