Main Calculations and Simulations¶
Documentation of the code in src.analysis. This is the core of the project and contains all fundamental calculations and simulations of the final paper.
Theoretical Calculations¶
Documentation of the code in src.analysis.theory_simulation. All modules listed below are used for the theoretical simulations and calculations related to bagging and subagging.
Simulating the convergence for the finite sample case¶
The finite sample sample simulation can be found under src.analysis.theory_simulation.calc_finite_sample.
A module to calculate the results for the introductory example in subsection 3.2 of the paper without the dynamic environment of x.
Without choosing a dynamic environment for x, the estimator developed by [3] and illustrated in our paper stabilizes by the (weak) Law of Large Numbers. We simulate this here for a range of sample sizes for a given mean and variance, assuming that Y follows a Gaussian distribution.
-
bagged_indicator
(x_value, sample, b_iterations=50)[source]¶ The bagged indicator function as described in subsection 3.2.
- x_value: int, float
- The value of x to be considered.
- sample: numpy array of shape = [sample_size]
- The sample on which we bootstrap the mean.
- b_iterations: int, optional (Default=50)
- The number of bootstrap iterations to construct the predictor.
Returns the value of the bagged predictor.
-
indicator
(x_value, y_bar)[source]¶ A indicator function that returns 1 if the threshold y_bar is smaller or equal the x value x_value.
- x_value: int, float
- The value of x to be considered.
- y_bar: int, float
- The value of y_bar to be considered, i.e. the threshold.
-
simulate_finite_sample
(settings)[source]¶ Performs the simulation of the MSE for the bagged and unbagged predictor for a range of sample sizes, which are specified by the settings dictionary. The procedure is described in greater detail in the Appendix Part B.2 of the paper.
- settings: Dictionary as described in Model specifications
- The dictionary that defines the simulation set-up for the finite sample case.
Calculations for the introductory example¶
The calculations for the toy example can be found under src.analysis.theory_simulation.calc_toy_example.
A module to calculate the results for the introductory example in Subsection 3.2 of the paper for the dynamic environment of x.
Given the choice of the appropriate environment of x, the estimator does not stabilizes even asymptotically and we can illustrate the effects of bagging on it.
-
calculate_bias_bagged
(c_value)[source]¶ Calculate the squared bias for the bagged predictor given the grid point c_value.
- c_value: float, int
- The gridpoint to be considered.
-
calculate_toy_example
(settings)[source]¶ Calculate the Bias and the Variance for the case of bagged and unbagged predictor based on the calulation settings defined in settings.
- settings: Dictionary as described in Model specifications
- The dictionary defines the calculation set-up that is specific to the introductory simulation.
Returns the calculated values as a dictionary.
-
calculate_var_bagged
(c_value)[source]¶ Calculate the variance for the bagged predictor given the grid point c_value.
- c_value: float, int
- The gridpoint to be considered.
-
calculate_var_unbagged
(c_value)[source]¶ Calculate the variance for the bagged predictor given the grid point c_value.
- c_value: float, int
- The gridpoint to be considered.
-
convolution_cdf_df
(c_value)[source]¶ Calculate the convolution as defined by [3] and as used in the introductory example of our paper for the the c.d.f of the standard normal distribution and the standard normal density for the gridpoint c_value for the real number line.
- c_value: float, int
- The gridpoint to be considered.
-
convolution_cdf_squared_df
(c_value)[source]¶ Calculate the convolution as defined by [3] and as used in the introductory example of our paper for the the squared c.d.f of the standard normal distribution and the standard normal density for the gridpoint c_value for the real number line.
- c_value: float, int
- The gridpoint to be considered.
Calculations for stump predictors using subagging¶
The calculations for the subagging of stump predictors can be found under src.analysis.theory_simulation.calc_normal_splits.
A module to calculate the results for the general stump predictor in Subsection 4.2 (Theorem 4.1) of the paper for the dynamic environment of x.
Replacing the bootstrap procedure by a subsampling scheme, we can here calculate upper bounds for the Variance and the Bias of stump predictors seen in subsection 4.2 and following the framework developed by [3].
-
bias_normal_splits
(c_value, a_value, gamma)[source]¶ Calculates the squared bias for stump predictors as defined in the paper in Theorem 4.1.
- c_value: int, float
- The gridpoint to be considered.
- a_value: float
- The subsampling fraction.
- gamma: float
- The rate of convergence of the estimator.
Returns the squared bias.
-
calculate_normal_splits
(settings)[source]¶ Calculate the Bias and the Variance for the case of subagging based on the calculation settings defined in settings.
- settings: Dictionary as described in Model specifications
- The dictionary defines the calculation set-up that is specific to the stump predictor simulation.
Returns the calculated values as a dictionary.
-
variance_normal_splits
(c_value, a_value, gamma)[source]¶ Calculates the variance for stump predictors as defined in the paper in Theorem 4.1.
- c_value: int, float
- The gridpoint to be considered.
- a_value: float
- The subsampling fraction.
- gamma: float
- The rate of convergence of the estimator.
Returns the variance.
Main Simulations¶
Documentation of the code in src.analysis.main_simulation.
All modules listed below use the MonteCarloSimulation
Class in src.analysis.montecarlosimulation.
I define the simulation setup and the data generating process as a class instance.
Using the functions of the class, I then analysis changes in the bagging parameters for an else constant
simulation set up.
For more details regarding the general simulation set-up see Main Algorithms and Model code.
The Case of Subagging¶
The module with the simulations for subagging can be found under src.analysis.main_simulation.calc_simulation_subagging.
This module simulates the dependence of the subagging results on the subsampling faction and sets it in relation to bagging.
For this we use the MonteCarloSimulation
Class described in Main Algorithms and Model code
in the simulate_bagging_subagging() function and return the results as a
dictionary.
Also the intuition of the simulation setup from Main Algorithms and Model code and
design_choice carries over to this module.
-
simulate_bagging_subagging
(general_settings, subagging_settings, model)[source]¶ A function that simulates the subsampling ratio dependency of the Subagging Algorithm.
- general_settings: Dictionary as described in Model specifications
- The dictionary is shared across various simulations and defines the overall simulation set-up.
- subagging_settings: Dictionary as described in Model specifications
- The dictionary defines the simulation set-up that is specific to the subagging simulation.
- model: String that defines the data generating process to be considered.
- The option are ‘friedman’, ‘linear’ and ‘indicator’ which is usually passed as the first system argument.
- Returns a tuple of the simulation results:
- tuple[0]: numpy array of shape = 4
- The array consists of the MSPE decompositions for the Bagging Algorithm.
- tuple[1]: numpy array of shape = [n_ratios, 4], where n_ratios is
- the number of subsampling ratios to be considered. This is defined by keys in subagging_settings. The array consists of the MSPE decompositions for each of those subsampling fraction.
Varying the Number of Bootstrap Iterations¶
The module regarding the simulations of the convergence of bagging towards a stable value can be found under src.analysis.main_simulation.calc_simulation_convergence.
This module simulates the convergence of bagging towards a stable value as seen in Subsection 5.4 of the final paper.
For this we use the MonteCarloSimulation
class described in Main Algorithms and Model code
in the simulate_convergence() function and return the results as a dictionary.
Also the intuition of the simulation setup from Main Algorithms and Model code and
design_choice carries over to this module.
-
simulate_convergence
(general_settings, convergence_settings, model)[source]¶ A function that simulates the convergence of the Bagging Algorithm.
- general_settings: Dictionary as described in Model specifications
- The dictionary is shared across various simulations and defines the overall simulation set-up.
- convergence_settings: Dictionary as described in Model specifications
- The dictionary defines the simulation set-up that is specific to the convergence of the Bagging Algorithm.
- model: String that defines the data generating process to be considered.
- The option are ‘friedman’, ‘linear’ and ‘indicator’ which is usually passed as the first system argument.
- Returns a tuple of the simulation results:
- tuple[0]: Numpy array of shape = [len(n_bootstraps_array), 4], where
- n_bootstraps_array is the array of Bootstrap iterations to be considered. This is defined by keys in convergence_settings. The array consists of the MSPE decompositions for each of those bootstrap iterations.
- tuple[1]: Numpy array of shape = 4 with the MSPE decomposition for a
- larger bootstrap iterations.
Varying the Complexity of the Regression Trees¶
The module regarding the simulations of the model complexity can be found under src.analysis.main_simulation.calc_simulation_tree_depth.
This module simulates the variations in the model complexity governed by the Tree depth for the Bagging Algorithm.
For this we use the MonteCarloSimulation
Class described in Main Algorithms and Model code
in the simulate_tree_depth() function and return the results as a dictionary.
Also the intuition of the simulation setup from Main Algorithms and Model code and
design_choice carries over to this module.
-
simulate_tree_depth
(general_settings, tree_depth_settings, model)[source]¶ A function that simulates the variations in tree depth an its effect on the MSPE decomposition for the Bagging Algorithm.
- general_settings: Dictionary as described in Model specifications
- The dictionary is shared across various simulations and defines the overall simulation set-up.
- tree_depth_settings: Dictionary as described in Model specifications
- The dictionary defines the simulation set-up that is specific to the tree depth simulation.
- model: String that defines the data generating process to be considered.
- The option are ‘friedman’, ‘linear’ and ‘indicator’ which is usually passed as the first system argument.
- Returns a tuple of the simulation results:
- tuple[0]: numpy array of shape = [min_split_array.size, 4], where
- min_split_array is the array of minimal split values we want to consider. This is defined by keys in tree_depth_settings. The array consists of the MSPE decompositions for each of those minimal split values for the Bagging Algorithm.
- tuple[0]: numpy array of shape = [min_split_array.size, 4], where
- min_split_array is the array of minimal split values we want to consider. This is defined by keys in tree_depth_settings. The array consists of the MSPE decompositions for each of those minimal split values for the unbagged Tree.
Real Data Simulations using the Boston Housing Data¶
Documentation of the code in src.analysis.real_data_simulation.
Following the simulation set-up by [1], we show that the method also works, when applied to
real data.
As Bagging applied to Regression Trees is mostly used for prediction purposes, we pick a
classical prediction problem data set, namely the Boston Housing Data Set.
It was obtain from the scikit-learn library sklearn.datasets
.
The module with the real data simulations can be found under src.analysis.real_data_simulation.calc_boston.
This module simulates the MSPE for bagging and subagging for the Boston Housing data set. The simulation set-up is the following:
- For each simulation iteration follow this procedure
- Randomly divide the data set into a training and a test set
- Fit the predictor (Tree, Bagging, Subagging) to the training set
- Using this new predictor make a prediction into the current test set and save the predicted values
- Compute the average prediction error of the current test set and save the value
- Compute the MSPE as the mean of average prediction errors of each iteration
For this we use the BaggingTree
Class described in Main Algorithms and Model code in the
simulate_bagging() and simulate_subagging() functions and write the
results as a dictionary.
-
simulate_bagging
(x_matrix, y_vector, general_settings, boston_settings)[source]¶ A function that simulates the MSPE for the Bagging Algorithm using the Boston Housing data.
- x_matrix: numpy-array with shape = [n_size, n_features] (Default: None)
- The covariant matrix x_matrix with the sample size n_size and n_features of covariants.
- y_vector: numpy-array with shape = [n_size] (Default: None)
- The vector of the dependent variable y_vector with the sample size n_size
- general_settings: Dictionary as described in Model specifications
- The dictionary is shared across various simulations and defines the overall simulation set-up.
- boston_settings: Dictionary as described in Model specifications
- The dictionary defines the simulation set-up that is specific to the boston simulation.
Returns the simulated MSPE.
-
simulate_subagging
(x_matrix, y_vector, general_settings, subagging_settings, boston_settings)[source]¶ A function that simulates the MSPE for the Subagging Algorithm over a range of subsampling fractions.
- x_matrix: numpy-array with shape = [n_size, n_features] (Default: None)
- The covariant matrix x_matrix with the sample size n_size and n_features of covariants.
- y_vector: numpy-array with shape = [n_size] (Default: None)
- The vector of the dependent variable y_vector with the sample size n_size
- general_settings: Dictionary as described in Model specifications
- The dictionary is shared across various simulations and defines the overall simulation set-up.
- subagging_settings: Dictionary as described in Model specifications
- The dictionary defines the simulation set-up that is specific to subagging simulations.
- boston_settings: Dictionary as described in Model specifications
- The dictionary defines the simulation set-up that is specific to the boston simulation.
Returns a numpy array with the simulated MSPE for each subsampling fraction.
-
split_fit_predict_bagging
(x_matrix, y_vector, ratio_test, random_state, bagging_object)[source]¶ A function that splits the data consisting of x_matrix and y_vector into a new test and training sample, fits the Bagging Algorithm on the training sample and makes a prediction on the test sample. Eventually the MSPE is computed.
- x_matrix: numpy-array with shape = [n_size, n_features] (Default: None)
- The covariant matrix x_matrix with the sample size n_size and n_features of covariants.
- y_vector: numpy-array with shape = [n_size] (Default: None)
- The vector of the dependent variable y_vector with the sample size n_size.
- ratio_test: float
- The ratio of the data used for the test sample.
- random_state: Numpy RandomState container
- The RandomState instance to perform the train/test split.
- bagging_object: Instance of the class BaggingTrees
- The bagging instance used to fit the algorithm to the newly splitted data.
Returns the MSPE for one iteration.