CADStat: Statistical Tools for Causal Analysis
Classification and regression trees
Recursive partitioning is a method used to analyze the relationship between a response and a collection of explanatory variables. In recursive partitioning algorithms, the data are split into groups such that cases within groups are most similar and the differences between groups are greatest. The splits are often called nodes and a collections of splits is referred to as a tree.
The main advantage of using classification trees is the lack of corresponding assumptions about the underlying relationship between the response/dependent variable and the explanatory/independent variable. Additionally, there are no assumptions about the underlying distribution of the data or the residuals that result from the model fit. The focus of most regression tree modeling is prediction. If the response/dependent variable is continuous, then it is considered a regression problem. Alternatively, if the response/dependent variable is categorical, then it is considered a classification problem.
Select Analysis Tools -> Tree Regression from the menus. A dialog box will open. Select the data set of interest from the pull-down menu, or browse for a tab-delimited text file. The Data Subsetting tab can be used to select a subset of the data file by choosing a variable from the pull down menu and then selecting the levels of that variable to include. You can hold down the <CTRL> key to add several different levels.
Select a Dependent variable (the variable you wish to predict).
Select all Independent variables you wish to include in the model (the variables used to predict the dependent variable). You may hold down the <CTRL> key to several variables. Note: the dependent variable is in the list of possible independent variables, but it should not be selected as an independent variable, or a degenerate regression will be produced.
You can set further options for controlling the tree regression by setting the three analysis options. They are explained briefly in the example below.
The output is a graph containing the tree structure and its estimated regression coefficients.
First launch the tree regression tool by selecting Analysis Tools -> Tree Regression.
Once this option is selected, a new window will open with the available options for the tree regression analysis.
For this example, select mergedData as your active dataset (see help page on Loading and merging data for information on loading CADStat example data). Then, select average stream temperature (temp.avg) as the response/dependent variable. Hold down the <CTRL> key and select latitude, log catchment area, and elevation (lat, area, elev.ut) as explanatory variables. The Min Split variable specifies how many cases need to be in each split. As this number decreases, splits are more likely to be made. The Min Bucket variable specifies the minimum number of observations that must exist in any terminal node. The CP variable is a complexity parameter. As the value of this parameter is increased, the effective difference between groups that is needed for a split to occur increases as well. In an ANOVA context, the CP variable corresponds to the increase in the R2 needed to perform a split. With large dataset, increasing the CP variable decreases the computing time needed. In this example, it is left at the default 0.01.
The text output sent to the CADStat console window gives the details of the model fit:
Call:
rpart(formula = as.formula(formula), data = my.data, minsplit = minsplit,
minbucket = minbucket, cp = cp)
n=125 (123 observations deleted due to missingness)
CP nsplit rel error xerror xstd
1 0.35284747 0 1.0000000 1.0117763 0.11690690
2 0.06556603 1 0.6471525 0.7750921 0.09581347
3 0.05290123 3 0.5160205 0.8017522 0.10312354
4 0.02119298 4 0.4631192 0.7935807 0.10263968
5 0.01976403 5 0.4419263 0.8081870 0.10968343
6 0.01109040 6 0.4221622 0.8148225 0.11010401
7 0.01000000 7 0.4110718 0.8056429 0.10706531
A set of models is presented, each with an increasing number of splits to the data. For each of the models, the relative error is equal to (1 – explained variability), which is analogous to 1 – R2 for an ANOVA model. Here, the results for three models are shown, corresponding to 0, 1, and 2 splits. The relative errors are presented in a way that scales the error for the first node to approximately 1. All subsequent errors in this column are presented relative to 1.0. Hence models can be compared relatively. The xerror represents a cross-validation error and xstd represents the corresponding standard error.
The process by which the tree is estimated is detailed in the rest of the output.
Node number 1: 125 observations, complexity param=0.3528475
mean=14.5336, MSE=10.21594
left son=2 (52 obs) right son=3 (73 obs)
Primary splits:
elev.ut < 1589.999 to the right, improve=0.35284750, (0 missing)
area < 3.313698 to the left, improve=0.17794890, (0 missing)
lat < 45.297 to the right, improve=0.03301643, (0 missing)
Surrogate splits:
area < 3.218943 to the left, agree=0.624, adj=0.096, (0 split)
lat < 43.86934 to the left, agree=0.608, adj=0.058, (0 split)
For the first split, three primary splits were considered (elev.ut, area, and lat). Splitting by elev.ut improved the model the most, so it was selected. Each subsequent node is handled in a similar fashion (results not shown).
At the end of the model output is a summary of the final tree...
1) root 125 1276.99300 14.53360
2) elev.ut>=1589.999 52 324.19350 12.28407
4) elev.ut>=2330.5 27 143.52950 11.18730
8) lat>=44.59658 7 11.08354 9.55306 *
9) lat< 44.59658 20 107.20750 11.75929 *
5) elev.ut< 2330.5 25 113.10950 13.46857
10) lat>=43.57856 18 64.06762 12.99921 *
11) lat< 43.57856 7 34.87951 14.67551 *
3) elev.ut< 1589.999 73 502.21550 16.13601
6) lat>=45.29272 18 102.41810 14.27064 *
7) lat< 45.29272 55 316.66640 16.74649
14) area< 2.94866 19 81.30672 15.04211 *
15) area>=2.94866 36 151.03590 17.64603
30) elev.ut>=808.5004 17 58.33334 16.72940 *
31) elev.ut< 808.5004 19 65.63930 18.46617 *
...but these results are best viewed graphically, as shown in the plot window: