Model specifications¶
The directory src.model_specs contains JSON files with model specifications. They are used across different parts of the model to specify the simulations/calculations or make the plotting uniform across different modules. I decided to split those specification in a lot of different files to make it easier to change only certain parts of the specifications without having to rerun the whole code in waf.
Overview for JSON files¶
All JSON files are used to define a dictionary in python. Below I will give a short descriptions to all JSON files (also referred to as dictionary) and its keys. The default values are all inline with the descriptions in the final term paper and hence omitted here. There is no JSON, which describes the Data Generating Processes, as those are fixed anyways in the DataSimulation class. Also, the order of them is fixed due to the structure of the paper.
boston_settings.json¶
The dictionary defines the simulation set-up that is specific to the boston simulation.
Keys¶
- ratio_test: float
- Ratio for the test sample
- ratio_train: float
- Counterpart to ratio_test
- random_seed_split: int
- Defines the RandomState for the test_train_split
- random_seed_fit: int
- Random seed for the fitting procedure
convergence_settings.json¶
The dictionary defines the simulation set-up that is specific to the convergence of the Bagging Algorithm.
Keys¶
- max_bootstrap: int
- Maximum number of bootstraps in the range to be considered
- min_bootstrap: int
- Minimum number of bootstraps in the range to be considered
- steps_bootstrap: int
- Steps in the range between min_bootstrap and max_bootstrap
- converged_bootstrap: int
- A large value of bootstrap iterations to visualize the convergence
finite_sample_settings.json¶
The dictionary that defines the simulation set-up for the finite sample case.
Keys¶
- n_repeat: int
- Number of Monte Carlo repetitions
- n_list: list
- List with the sample sizes to be considered
- mu: int, float
- True mean of the population
- sigma: int, float
- Standard deviation
- b_iterations: int
- Number of bootstrap iterations
- x_gridpoints: int
- Number of gridpoints
- x_min: int
- Minimum gridpoint
- x_max: int
- Maximal gridpoint
- random_seed: int
- Random seed for the simulation
general_settings.json¶
The dictionary is shared across various simulations and defines the overall simulation set-up.
Keys¶
- n_repeat: int
- Number of Monte Carlo repetitions
- n_test_train: list
- List with the test and train size
- noise: int, float
- Standard deviation of the error term for the data generating process
- b_iterations: int
- Number of bootstrap iterations
- min_split_tree: int
- Governs the tree depth. Lower values imply more complex Regression Trees
- random_seeds: list
- List of random seeds used. Note: I don’t reseed but define different RandomState instances with those.
- bagging_ratio: constant at 1
- Subsampling ratio for bagging. Do not change!
normal_splits_settings.json¶
The dictionary defines the calculation set-up that is specific to the stump predictor simulation.
Keys¶
- c_gridpoints: int
- Number of gridpoints for c
- c_min: int
- Minimum gridpoint
- c_max: int
- Maximal gridpoint
- a_array: dictionary
- Consists of keys that define the subsampling ratios I want to consider. The value of the first key has to be equal to 1. The other key values are defined as lists, where list[0] = numerators and list[1] = denominator of the subsampling fraction.
- gamma: float
- Rate of convergence of the estimator
settings_plotting.json¶
The dictionary contains all plotting specifications that are shared across various modules.
Keys¶
- style: string
- Matplotlib stlye that is used for all plots
- figsize: list
- List that defines the figure sizes
- figsize_theory: list
- List that defines the figure sizes in the theory part
- colors: dictionary
- Dictionary for uniform colors across figures
- ls: dictionary
- Dictionary for uniform line style across figures
subagging_settings.json¶
The dictionary defines the simulation set-up that is specific to the subagging simulation.
Keys¶
- n_ratios: int
- Number of subsampling ratios to be considered
- max_ratio: int, float
- Maximal subsampling ratio
- min_ratio: int, float
- Minimal subsampling ratio