Main Algorithms and Model code¶
The directory src.model_code contains source files for the Bagging Algorithm, the Data Generating Process and a module that is used to perform the Monte Carlo Simulations.
The BaggingTree
class¶
The BaggingTree
class is the implementation of the Bagging
Algorithm applied to Regression Trees.
For Regression Trees the implementation of sklearn
is used and was not implemented
within this algorithm.
Note that this Implementation of the Bagging Algorithm does not run in parallel even though it can be considered as embarrassingly parallel, as also noted by [1] in his pioneer paper.
A runtime analysis using the Python package cProfile showed that if I parallel the algorithm with the
Python package joblib
, the run time is higher then in this current version.
One reason for this is that I restrict the number of bootstrap iterations to 50 and the sample size to 500 in the majority of the paper.
The overhead created by launching and managing multiple threads is higher than the actual
runtime gain processing the bagging operation separately, when there are only comparabily few bagging iterations.
Only when I increase the number of bootstrap iterations considerably (e.g. 500 iterations) a parallel execution becomes
profitable in terms of runtime.
Hence, for the parameter choices I consider in this paper, a parallel execution turned out not to be profitable.
Note however that given a different parameter set (more bootstrap iterations/larger sample size), paralleling the bagging
algorithm is liked to be desired.
The BaggingTree
class can be found under src.model_code.baggingtree.
This module implements the Bagging Algorithm used for the main simulations of this paper. To use it, you first define a class instance that specifies the parameters for the algorithm. Then use the fit() function to fit the algorithm to a training sample. Predictions on a new sample can be made using the predict() function.
-
class
BaggingTree
(random_seed=None, ratio=1.0, bootstrap=True, b_iterations=50, min_split_tree=2)[source]¶ A class that implements the Bagging Algorithm applied to Regression Trees. For the Regression Trees we use the implementation of scikit-learn.
- random_seed: int or None, optional (Default: None)
random_seed is used to specify the RandomState for numpy.random. It is shared across all functions of the class.
Needs to be specified as a usual random seed as it is deployed to numpy.random.
IMPORTANT: This random seed is fixed for a specific instance, as it specifies a new RandomState for all numpy functions used in this class. As a result this random_seed is not overwritten by numpy random seeds that are defined outside of specific class instance. The reason for this is that it makes reproducibility easier across different simulations and modules. Note however that the downside is, that we have to specify for each class (each instance) a different random seed and it is not possible to specify one random seed at the beginning of the whole simulation, as this will define the RandomState within each class.
For further information on this see in Overview and explanations for different design choices.
- ratio: float, optional (Default=1.0)
The sample size for the subsampling procedure. Each sample we draw for the algorithm will be of size math.ceil(n_observations * self.ratio).
Needs to be greater than 0 and smaller than 1.
In accordance with the theoretical treatment in the paper, one would want to choose ratio<1.0 for bootstrap=False (Subagging) and ratio=1.0 for bootstrap=True (Bagging).
- min_split_tree: int, optional (Default=2)
The minimal number of observations that can be within a terminal node of the Regression Trees to be considered for a split. Use this to control for the complexity of the Regression Tree.
Needs to be greater than 1.
- b_iterations: int, optional (Default=50)
The number of bootstrap iterations used to construct the bagging/subagging predictor.
Needs to be greater than 0.
- bootstrap: bool, optional(Default=True)
Specify if the you use the standard bootstrap (Bagging) or m out of n bootstrap (Subagging).
Default=True implies that we use Bagging.
-
fit
(x_matrix, y_vector)[source]¶ Fit the Bagging Algorithm newly to a sample (usually training sample) that consists of the covariant matrix x_matrix and the vector the dependent variable y_vector.
- x_matrix: numpy-array with shape = [n_size, n_features] (Default: None)
- The covariant matrix x_matrix with the sample size n_size and n_features of covariants.
- y_vector: numpy-array with shape = [n_size,] (Default: None)
- The vector of the dependent variable y_vector with the sample size n_size
-
predict
(x_matrix)[source]¶ Make a new prediction for a trained class instance (using the fit() function first) on a new covariant matrix x_matrix (test sample).
- x_matrix: numpy-array with shape = [n_size, n_features] (Default: None)
- The covariant matrix x_matrix of the new test sample with size n_size and n_features covariants.
The DataSimulation
class¶
The DataSimulation
can be found under src.model_code.datasimulation.
This module implements different data generating processes within the DataSimulation class. In order to make the results for different functional forms of f(x_matrix) comparable, we define the attributes of the data simulation within a class. All function to which we apply the Bagging Algorithm then have the same noise, size and random_seed. This is helpful as I want to compare the effectiveness of the Bagging Algorithm among different functional forms while keeping attributes like the sample size or the noise constant across different regression functions
-
class
DataSimulation
(random_seed=None, n_size=500, noise=1.0, without_error=False)[source]¶ A class that collects different data generating processes.
- random_seed: int or None, optional (Default: None)
random_seed is used to specify the RandomState for numpy.random. It is shared across all functions of the class.
Needs to be specified as a usual random seed as it is deployed to numpy.random.
IMPORTANT: This random seed is fixed for a specific instance, as it specifies a new RandomState for all numpy functions used in this class. As a result this random_seed is not overwritten by numpy random seeds that are defined outside of specific class instance. The reason for this is that it makes reproducibility easier across different simulations and modules. Note however that the downside is, that we have to specify for each class (each instance) a different random seed and it is not possible to specify one random seed at the beginning of the whole simulation, as this will define the RandomState within each class.
For further information on this see in Overview and explanations for different design choices.
- n_size: int, optional (Default=500)
The sample size, when calling one of the data generating functions.
Needs to be greater than 0.
- noise: int, float, optional (Default=1.0)
- The variance of the error term that is used for the data generating processes. The default of noise = 1.0 indicates that we draw an error term that is standard normally distributed.
- without_error: bool, optional(Default=False)
Specify if the data should be generated with an error term already added.
Default=False implies that it is created with an error term. Change this option to True if you want to create a test sample for which you draw an error term for each simulation iteration.
-
friedman_1_model
()[source]¶ Returns the Friedman #1 Model by [5] covariant matrix x_matrix (shape = [n_size, 10]) and the target variable y_vector (shape = [n_size])as a numpy arrays for the values specified in the class instance. Note that x6 to x10 do not contribute to y_vector and can be considered as ‘noise’ variables.
For the full functional form is given in the paper and [5].
-
indicator_model
()[source]¶ Returns the covariant matrix x_matrix (shape = [n_size, 5]) and the target variable y_vector (shape = [n_size]) of the M3 Model from [4] as a numpy array for the values specified in the class instance.
Note that this data generating process was not used in the final paper, but offers an interesting comparison for the reader and was thus added later to the appendix.
See [4] for the exact functional form.
-
linear_model
()[source]¶ Returns the linear model from [6] covariant matrix x_matrix (shape = [n_size, 10]) and the target variable y_vector (shape = [n_size]) as numpy arrays for the values specified in the class instance. Note that x6 to x10 do not contribute to y_vector and can be considered as ‘noise’ variables.
For the full functional form is given in the paper and [6].
The MonteCarloSimulation
class¶
The MonteCarloSimulation
class implements the Monte Carlo simulation for a given set of parameters as it was
used in the Simulation part of the paper.
We picked this simulation procedure as we wanted to emphasis the decomposition of the
mean squared prediction error at a new input point into Bias and the Variance
but also the irreducible Noise term.
It is used in the calculation modules in src.analysis.main_simulation, where I consider different parameter variations for the
Bagging Algorithm to observe changes in the MSPE, Bias and the Variance.
The parameters that are specific to the data generating process are defined in the class instance.
Parameters for the Bagging Algorithm are defined in the functions.
The MonteCarloSimulation
can be found under src.model_code.montecarlosimulation.
This module performs the simulations of the MSPE and its decomposition into squared-bias, variance and noise, for the Bagging Algorithm as described in the paper:
In all simulations we use the following procedure:
- Generate a test sample, without error term, according to the data generating processes of interest. This will be constant for the whole simulation study. All predictions will be made on this sample.
- For each simulation iteration we follow this procedure
- Draw new error terms for the test sample.
- Draw a new training sample with regressor and error terms.
- Fit the predictor (Tree, Bagging, Subagging) to the generated training data.
- Using this new predictor make a prediction into the fixed test sample and save the predicted values.
- We compute the MSPE, squared bias and variance for the given predictor at the input point x_matrix = x0 with x0 being the test sample generated in (i).
Within one class instance we define all parameters that are relevant to the data generating process (DGP) and the simulation set-up. Parameters that are specific to the Bagging Algorithm are defined in the functions. The idea is to define one class instance and then loop over different bagging parameters for the same instance, keeping the DGP and the simulation set-up constant. The function calc_mse() computes the MSPE and its decomposition for one specification and the calc_mse_all_ratios() for a range of subsampling ratios.
-
class
MonteCarloSimulation
(n_repeat=100, noise=1.0, data_process='friedman', n_test_train=(500, 500), random_seeds=(None, None, None, None))[source]¶ A class that implements a Monte Carlo Simulation for the Bagging Algorithm.
- random_seeds: tuple or list of size 4 consisting of int or None, optional (Default: (None, None, None, None))
Specify the random seeds that will be used for the simulation study. We have to use different random seeds, as we define different RandomState instances for each part of the simulation.
random_seeds[0]: Defines the RandomState for the noise term draw random_seeds[1]: Defines the RandomState for the BaggingTree class random_seeds[2]: Defines the RandomState for the training sample draws random_seeds[3]: Defines the RandomState for the test sample draw
All random seeds need to be different from one another and specified as a usual random seed as it is deployed to numpy.random.
One random_seed is used to specify the RandomState for numpy.random. It is shared across all functions of the class.
IMPORTANT: This random seed is fixed for a specific instance, as it specifies a new RandomState for all numpy functions used in this class. As a result this random_seed is not overwritten by numpy random seeds that are defined outside of specific class instance. The reason for this is that it makes reproducibility easier across different simulations and modules. Note however that the downside is, that we have to specify for each class (each instance) a different random seed and it is not possible to specify one random seed at the beginning of the whole simulation, as this will define the RandomState within each class.
For further information on this see in Overview and explanations for different design choices.
- noise: int, float, optional (Default=1.0)
The standard deviation of the error term that is used for the data generating processes. The default of noise = 1.0 indicates that we draw an error term that is standard normally distributed.
Needs to be greater than zero.
Note that we cannot draw noise=0 as this would not be inline with the simulation setup that we have chosen.
- n_test_train: tuple or list of size 2 with int, optional(Default= (500, 500))
Specify the sample size of the test sample and the training samples.
n_test_train[0]: Defines the size for the test sample n_test_train[1]: Defines the size for the training samples
Both need to be greater than zero.
- data_process: string, optional (Default=’friedman’)
- Defines which data generating process we use. The options are ‘friedman’, ‘linear’ and ‘indicator’.
- n_repeat: int, optional (Default=100)
The number of Monte Carlo repetitions to use for the simulation.
Needs to be greater than zero.
-
calc_mse
(ratio=1.0, bootstrap=True, min_split_tree=2, b_iterations=50)[source]¶ A function to simulate he MSPE decomposition for one specific specification of the Bagging Algorithm applied to Regression Trees. The simulation set up and the data generating process is given by the respective class instance. We want to compare the output of this function with respect to variations in the Bagging parameters.
- Returns a numpy array of size 4 with the MSPE decomposition:
- array[0]: Simulated MSPE
- array[1]: Simulated squared bias
- array[2]: Simulated variance
- array[3]: Simulated noise
- ratio: float, optional (Default=1.0)
- The sample size used for the simulation procedure. Each sample we draw for the Bagging Algorithm will be of size math.ceil(n_observations * self.ratio).
- min_split_tree: int, optional (Default=2)
The minimal number of observations within a terminal node of the Regression Trees to be considered for a split that are used in the simulation. Use this to control for the complexity of the Regression Tree.
Must be greater than 2.
- b_iterations: int, optional (Default=50)
- The number of bootstrap iterations used to construct the bagging/subagging predictor in the simulation.
- bootstrap: bool, optional(Default=True)
Specify if the you use the standard bootstrap (Bagging) or m out of n bootstrap (Subagging).
Default=True implies that we use Bagging.
-
calc_mse_all_ratios
(n_ratios=10, min_ratio=0.1, max_ratio=1.0, min_split_tree=2, b_iterations=50)[source]¶ A function to simulate he MSPE decomposition for a range of subsampling fractions for one specification of the Subagging Algorithm applied to Regression Trees. The simulation set up and the data generating process is given by the respective class instance. We want to compare the output of this function with respect to variations in the Bagging parameters and the variation between the subsampling fractions. Note that by default we use subsampling instead of the standard bootstrap.
The range of subsampling ratios is created by np.linspace(min_ratio, max_ratio, n_ratios).
Returns a numpy array of shape = [n_ratios, 4] with the MSPE decomposition for the n_ratios different subsampling
- Returns a numpy array with:
- array[:,0]: Simulated MSPE for all subsampling ratios
- array[:,1]: Simulated squared bias for all subsampling ratios
- array[:,2]: Simulated variance for all subsampling ratios
- array[:,3]: Simulated noise for all subsampling ratios
- n_ratios: int, optional (Default=10)
The number of subsampling fractions we want to consider for the simulation.
Needs to be greater than 1.
- min_ratio: float, optional (Default=0.1)
The minimal subsampling fraction to be considered.
Needs to be between zero and one and smaller than max_ratio.
- max_ratio: float, optional (Default=1.0)
The maximal subsampling fraction to be considered.
Needs to be between zero and one and larger than min_ratio.
- min_split_tree: int, optional (Default=2)
The minimal number of observations within a terminal node of the Regression Trees to be considered for a split that are used in the simulation. Use this to control for the complexity of the Regression Tree.
Must be greater than 2.
- b_iterations: int, optional (Default=50)
The number of bootstrap iterations used to construct the subagging predictor in the simulation.
Must be greater than zero.