Lasso Regression in R Called by F#

A lasso regression analysis was conducted to identify a subset of variables from a pool of 6 quantitative predictor variables that best predicted a quantitative response variable measuring the number of people employed. Quantitative predictor variables include Gross National Product (GNP), GNP implicit price deflator (1954=100), number of unemployed, number of people in the armed forces, ‘noninstitutionalized’ population ≥ 14 years of age, and the year (time).

Because of the small size of the data set (N=16), data were not split into training and test sets.

Of the 6 predictor variables, only 2 were retained in the selected model. During the estimation process, year and GNP were most strongly associated with number of people employed. The final model accounted for 97.4% of the variance in the response variable.

Figure 1. Change in the coefficients at each step

Lasso Coefficients

Source code in F#:

#load "packages/FsLab/FsLab.fsx"

open RDotNet
open RProvider
open RProvider.lars
open RProvider.datasets
open RProvider.graphics
open Deedle

let longley : Frame = R.longley.GetValue()
let longY = longley?Employed
let longX = R.as_matrix(longley.Columns.[ ["GNP.deflator"; "GNP"; "Unemployed"; "Armed.Forces"; "Population"; "Year"] ])
let fit = R.lars(x=longX, y=longY)
R.summary fit
R.plot fit

Which created the following output:

Call:
lars::lars(x = fsr_9628_42, y = fsr_9628_43)
R-squared: 0.995 
Sequence of LASSO moves:
     GNP Unemployed Armed.Forces Year GNP Population GNP.deflator GNP GNP.deflator GNP.deflator
Var    2          3            4    6  -2          5            1   2           -1            1
Step   1          2            3    4   5          6            7   8            9           10

LARS/LASSO
Call: lars::lars(x = fsr_9628_42, y = fsr_9628_43)
   Df     Rss        Cp
0   1 185.009 1976.7120
1   2   6.642   59.4712
2   3   3.883   31.7832
3   4   3.468   29.3165
4   5   1.563   10.8183
5   4   1.339    6.4068
6   5   1.024    5.0186
7   6   0.998    6.7388
8   7   0.907    7.7615
9   6   0.847    5.1128
10  7   0.836    7.0000

Random Forest in R Called by F#

Random forest analysis was performed to evaluate the importance of a series of explanatory variables in predicting a binary, categorical response variable. The following explanatory variables were included as possible contributors to a random forest evaluating after surgery deformation (kyphosis) (my response variable), age in months (Age), number of vertebrae involved (Number), and the highest vertebrae operated on (Start).

The accuracy of the random forest was 80%, with the subsequent growing of multiple trees rather than a single tree, adding little to the overall accuracy of the model, and suggesting that interpretation of a single decision tree may be appropriate.

Source code in F#:

#I "../packages/RProvider.1.1.15"
#load @"..\packages\RProvider.1.1.15\RProvider.fsx"

open RDotNet
open RProvider
open RProvider.rpart
open RProvider.randomForest

let fit = namedParams [ "x", box kyphoX; "y", box kyphoY ] |> R.randomForest
R.print fit

Which created the following output:

               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 1

        OOB estimate of  error rate: 19.75%
Confusion matrix:
        absent present class.error
absent      60       4   0.0625000
present     12       5   0.7058824

Decision Tree in R Called by F#

Figure 1. Classification tree (N=81) to predict a type of deformation (kyphosis) after surgery (target variable)

Classification Tree Kyphosis

Decision tree analysis was performed to test nonlinear relationships among a series of explanatory variables and a binary, categorical response variable. All possible separations (categorical) or cut points (quantitative) are tested. For the present analyses, the entropy “goodness of split” criterion was used to grow the tree and a cost complexity algorithm was used for pruning the full tree into a final subtree.

The following explanatory variables were included as possible contributors to a classification tree model evaluating after surgery deformation (kyphosis) (my response variable), age in months (Age), number of vertebrae involved (Number), and the highest vertebrae operated on (Start).

The Start variable was the first variable to separate the sample into two subgroups. Patients with a Start value less than 8.5 were more likely to have received kyphosis compared to patients not meeting this cutoff (57.9% vs. 9.7%).

Of the patients with Start value less than 8.5, a further subdivision was made with the Start variable. Patients with a Start value greater than or equal to 14.5 were less likely to have experienced kyphosis. Patients with a Start value less than 14.5 were more likely to have experienced kyphosis. The total model classified 84% of the sample correctly, 88% of kyphosis-affected (sensitivity) and 83% of non-affected (specificity).

Source code in F#:

#I "../packages/RProvider.1.1.15"
#load @"..\packages\RProvider.1.1.15\RProvider.fsx"

open RDotNet
open RProvider
open RProvider.graphics
open RProvider.rpart

let kypho = R.kyphosis.AsDataFrame()
let fit = R.rpart(formula="Kyphosis ~ Age + Number + Start", method="class", data=kypho)
namedParams [ "x", box fit; "uniform", box true; "main", box "Classification Tree for Kyphosis" ] |> R.plot
namedParams [ "x", box fit; "use.n", box true; "all", box true; "cex", box 0.8 ] |> R.text