regtools

Norm Matloff

regtools

Novel tools tools for linear, nonlinear and nonparametric regression.

These tools are associated with my book, From Linear Models to Machine Learning: Statistical Regression and Classification, N. Matloff, CRC, 2017 (recipient of the Technometrics Eric Ziegel Award for Best Book Reviewed in 2017).

The tools are useful in general, independently of the book.

NOTE: See also our qeML package – Quick and Easy Machine Learning.

FEATURES:

EXAMPLE: PARAMETRIC MODEL FIT ASSESSMENT

The fit assessment techniques in regtools gauge the fit of parametric models by comparing to nonparametric ones. Since the latter are free of model bias, they are very useful in assessing the parametric models.

Let’s take a look at the included dataset prgeng, some Census data for California engineers and programmers in the year 2000. The response variable in this example is wage income, and the predictors are age, gender, number of weeks worked, and dummy variables for MS and PhD degrees. You can read the details of the data by typing

> ?prgeng

One of the package’s graphical functions for model fit assessment plots the parametric (e.g. lm()) values against the nonparametric fit via k-NN. Let’s try this on the Census data.

The package includes three versions of the dataset: The original; a version with categorical variables in dummy form; and a version with categorical variables in R factor form. Since the k-NN routines require dummies, we’ll use that first version, peDumms.

We need to generate the parametric and nonparametric fits, then call parvsnonparplot():

data(peDumms)
pe1 <- peDumms[c('age','educ.14','educ.16','sex.1','wageinc','wkswrkd')]
lmout <- lm(wageinc ~ .,data=pe1)
xd <- preprocessx(pe1[,-5],10)  # prep for k-NN, k <= 10
knnout <- knnest(pe1$wageinc,xd,10)
parvsnonparplot(lmout,knnout)

result

We see above how the k-NN code is used. We first call preprocessx() to determine the nearest neighbors of each data point. Here k is 10, so we can later compute various k-NN fits for k anywhere from 1 to 10. The actual fit is done by knnest(). Then parvsnonparplot() plots the linear model fit against the nonparametric one.. Again, since the latter is model-free, it serves as a good assessment of the fit of the linear model.

There is quite a bit suggested in this picture:

Let’s now check the classical assumption of homoscedasticity, meaning that the conditional variance of Y given X is constant. The function nonparvarplot() plots the estimated conditional variance against the estimated conditional mean, both computed nonparametrically:

result

Though we ran the plot thinking of the homoscedasticity assumption, this is much more remarkable, confirming that there are interesting subpopulations within this data. These may correspond to different occupations, something to be investigated.

The package includes various other graphical diagnostic functions.

By the way, violation of the homoscedasticity assumption won’t invalidate the estimates in our linear model. They still will be statistically consistent. But the standard errors we compute, and thus the statistical inference we perform, will be affected. This is correctible using the Eicker-White procedure, which for linear models is available in the car and sandwich packagers. Our package here also extends this to nonlinear parametric models, in our function nlshc() (the validity of this extension is shown in the book).

EXAMPLE; OVA VS. AVA IN MULTICLASS PROBLEMS

A very popular prediction method in 2-class problems is to use logistic (logit) regression. In analyzing click-through patterns of Web users, for instance, we have 2 classes, Click and Nonclick. We might fit a logistic model for Click, given user Web history, demographics and so on. Note that logit actually models probabilities, e.g. the probability of Click given the predictor variables.

But the situation is much less simple in multiclass settings. Suppose our application is recognition of hand-written digits (a famous machine learning example). The predictor variables are pixel patterns in images. There are two schools of thought on this:

Many in the machine learning literature recommend AVA over OVA, on the grounds that there might be linear separability (in the statistical sense) in pairs but not otherwise. My book counters by noting that such a situation could be remedied under OVA by adding quadratic terms to the logit models.

At any rate, the regtools package gives you a choice, OVA or AVA, for both parametric and nonparametric methods. For example, avalogtrn() and avalogpred() do training and prediction operations for logit with AVA.

Let’s look at an example, again using the Census data from above. We’ll predict occupation from age, sex, education (MS, PhD, other) wage income and weeks worked.

data(peFactors) 
pef <- peFactors 
pef1 <- pef[,c('age','educ','sex','wageinc','wkswrkd','occ')] 
# "Y" must be in last column, class ID 0,1,2,...; convert from factor
pef1$occ <- as.numeric(pef1$occ) 
pef1$occ <- pef1$occ - 1
pef2 <- pef1 
# create the education, gender dummy varibles
pef2$ms <- as.integer(pef2$educ == 14) 
pef2$phd <- as.integer(pef2$educ == 16) 
pef2$educ <- NULL 
pef2$sex <- as.integer(pef2$sex == 1) 
pef2 <- pef2[,c(1,2,3,4,6,7,5)] 
ovaout <- ovalogtrn(6,pef2) 
# estimated coefficients, one set ofr each of the 6 classes
ovaout  
# prints
                        0             1             2
(Intercept) -9.411834e-01 -6.381329e-01 -2.579483e-01
xage         9.090437e-03 -3.302790e-03 -2.205695e-02
xsex        -5.187912e-01 -1.122531e-02 -9.802006e-03
xwageinc    -6.741141e-06 -4.609168e-06  5.132813e-06
xwkswrkd     5.058947e-03 -2.247113e-03  2.623924e-04
xms         -5.201286e-01 -4.272846e-01  5.280520e-01
xphd        -3.302821e-01 -8.035287e-01  3.531951e-01
                        3             4             5
(Intercept) -3.370758e+00 -3.322356e+00 -4.456788e+00
xage        -2.193359e-03 -1.206640e-02  3.323948e-02
xsex        -7.856923e-01  5.173516e-01  1.175657e+00
xwageinc    -4.076872e-06  2.033175e-06  1.831774e-06
xwkswrkd     1.311084e-02  5.517912e-04  2.794453e-03
xms         -1.797544e-01  9.947253e-02  2.705293e-01
xphd        -3.883463e-01  4.967115e-01  4.633907e-01
# predict the occupation of a woman, age 35, no MS/PhD, inc 60000, 52
# weeks worked
ovalogpred(ovaout,matrix(c(35,0,60000,52,0,0),nrow=1))
# outputs class 2, Census occupation code 102
[1] 2

With the optional argument probs=TRUE, the call to ovalogpred() will also return the conditional probabilities of the classes, given the predictor values, in the R attribute ‘probs’.

Here is the AVA version:

avaout <- avalogtrn(6,pef2) 
avaout
# prints
                      1,2           1,3           1,4           1,5
(Intercept) -1.914000e-01 -4.457460e-01  2.086223e+00  2.182711e+00
xijage       8.551176e-03  2.199740e-02  1.017490e-02  1.772913e-02
xijsex      -3.643608e-01 -3.758687e-01  3.804932e-01 -8.982992e-01
xijwageinc  -1.207755e-06 -9.679473e-06 -6.967489e-07 -4.273828e-06
xijwkswrkd   4.517229e-03  4.395890e-03 -9.535784e-03 -1.543710e-03
xijms       -9.460392e-02 -7.509925e-01 -2.702961e-01 -5.466462e-01
xijphd       3.983077e-01 -5.389224e-01  7.503942e-02 -7.424787e-01
                      1,6           2,3           2,4           2,5
(Intercept)  3.115845e+00 -2.834012e-01  2.276943e+00  2.280739e+00
xijage      -2.139193e-02  1.466992e-02  1.950032e-03  1.084527e-02
xijsex      -1.458056e+00  3.720012e-03  7.569766e-01 -5.130827e-01
xijwageinc  -5.424842e-06 -9.709168e-06 -1.838009e-07 -4.908563e-06
xijwkswrkd  -2.526987e-03  9.884673e-04 -1.382032e-02 -3.290367e-03
xijms       -6.399600e-01 -6.710261e-01 -1.448368e-01 -4.818512e-01
xijphd      -6.404008e-01 -9.576587e-01 -2.988396e-01 -1.174245e+00
                      2,6           3,4           3,5           3,6
(Intercept)  3.172786e+00  2.619465e+00  2.516647e+00  3.486811e+00
xijage      -2.908482e-02 -1.312368e-02 -3.051624e-03 -4.236516e-02
xijsex      -1.052226e+00  7.455830e-01 -5.051875e-01 -1.010688e+00
xijwageinc  -5.336828e-06  1.157401e-05  1.131685e-06  1.329288e-06
xijwkswrkd  -3.792371e-03 -1.804920e-02  5.606399e-04 -3.217069e-03
xijms       -5.987265e-01  4.873494e-01  2.227347e-01  5.247488e-02
xijphd      -1.140915e+00  6.522510e-01 -2.470988e-01 -1.971213e-01
                      4,5           4,6           5,6
(Intercept) -9.998252e-02  6.822355e-01  9.537969e-01
xijage       1.055143e-02 -2.273444e-02 -3.906653e-02
xijsex      -1.248663e+00 -1.702186e+00 -4.195561e-01
xijwageinc  -4.986472e-06 -7.237963e-06  6.807733e-07
xijwkswrkd   1.070949e-02  8.097722e-03 -5.808361e-03
xijms       -1.911361e-01 -3.957808e-01 -1.919405e-01
xijphd      -8.398231e-01 -8.940497e-01 -2.745368e-02
# predict the occupation of a woman, age 35, no MS/PhD, inc 60000, 52
# weeks worked
avalogpred(6,ovaout,matrix(c(35,0,60000,52,0,0),nrow=1))
# outputs class 2, Census occupation code 102

EXAMPLE: ADJUSTMENT OF CLASS PROBABILITIES IN CLASSIFICATION PROBLEMS

The LetterRecognition dataset in the mlbench package lists various geometric measurements of capital English letters, thus another image recognition problem. One problem is that the frequencies of the letters in the dataset are not similar to those in actual English texts. The correct frequencies are given in the ltrfreqs dataset included here in the regtools package.

In order to adjust the analysis accordingly, the ovalogtrn() function has an optional truepriors argument. For the letters example, we could set this argument to ltrfreqs. (The term priors here does refer to a subjective Bayesian analysis. It is merely a standard term for the class probabilities.)

MULTICLASS CLASSIFICATION WITH k-NN

In addition to use in linear regression graphical diagnostics, k-NN can be very effective as a nonparametric regression/machine learning tool. I would recommend it in cases in which the number of predictors is moderate and there are nonmonotonic relations. (See also our polyreg package.) Let’s continue the above example on predicting occupation, using k-NN.

The three components of k-NN analysis in regtools are:

  1. preprocessx(): This finds the sets of nearest neighbors in the training set, for all values of k up to a user-specified maximum. This facilitates the user’s trying various values of k.

  2. knnest(): This fits the regression model.

  3. knnpred(): This does prediction on the user’s desired set of points of new cases.

Since k-NN involves finding distances between points, our data must be numeric, not factors. This means that in pef2, we’ll need to replace the occ column by a matrix of dummy variables. Utilities in the regtools package make this convenient:

occDumms <- factorToDummies(as.factor(pef2$occ),'occ',omitLast=FALSE)
pef3 <- cbind(pef2[,-7],occDumms)

Note that in cases in which “Y” is multivariate, knnest() requires it in multivariate form. Here “Y” is 6-variate, so we’ve set the last 6 columns of pef3 to the corresponding dummies.

Many popular regression packages, e.g. lars for the LASSO, require data in numeric form, so the regtools’ conversion utilities are quite handy.

Now fit the regression model:

kout <- knnest(pef3[, -(1:6)],xd,10) 

One of the components of kout is the matrix of fitted values:

> head(kout$regest) 
     occ.0 occ.1 occ.2 occ.3 occ.4 occ.5
[1,]   0.2   0.4   0.2     0   0.0   0.2
[2,]   0.2   0.5   0.2     0   0.0   0.1
[3,]   0.5   0.1   0.3     0   0.1   0.0
[4,]   0.3   0.4   0.1     0   0.0   0.2
[5,]   1.0   0.0   0.0     0   0.0   0.0
[6,]   0.2   0.4   0.2     0   0.0   0.2

So for example the conditional probability of Occupation 4 for the third observation is 0.1.

Now let’s do the same prediction as above:

> predict(kout,matrix(c(35,0,60000,52,0,0),nrow=1),TRUE)
occ.0 occ.1 occ.2 occ.3 occ.4 occ.5 
  0.1   0.4   0.5   0.0   0.0   0.0 

These are conditional probabilities. The most likely one is Occupation 2.

The TRUE argument was to specify that we need to scale the new cases in the same way the original data were scaled.

By default, our k-NN routines find the mean Y in the neighborhood. Another option is to do local linear smoothing. Among other things, this may remedy aliasing at the edges of the data. This should be done with a value of k much larger than the number of predictor variables.

EXAMPLE: RECTANGULARIZATION OF TIME SERIES

This allows use of ordinary tools like lm() for prediction in time series data. Since the goal here is prediction rather than inference, an informal model can be quite effective, as well as convenient.

The basic idea is that x[i] is predicted by x[i-lg], x[i-lg+1], x[i-lg+2], i… x[i-1], where lg is the lag.

xy <- TStoX(Nile,5)
head(xy)
#      [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 1120 1160  963 1210 1160 1160
# [2,] 1160  963 1210 1160 1160  813
# [3,]  963 1210 1160 1160  813 1230
# [4,] 1210 1160 1160  813 1230 1370
# [5,] 1160 1160  813 1230 1370 1140
# [6,] 1160  813 1230 1370 1140  995
head(Nile,36)
#  [1] 1120 1160  963 1210 1160 1160  813 1230 1370 1140  995  935 1110  994 1020
# [16]  960 1180  799  958 1140 1100 1210 1150 1250 1260 1220 1030 1100  774  840
# [31]  874  694  940  833  701  916

Try lm():

lmout <- lm(xy[,6] ~ xy[,1:5])
lmout
...
Coefficients:
Coefficients:
(Intercept)   xy[, 1:5]1   xy[, 1:5]2   xy[, 1:5]3   xy[, 1:5]4   xy[, 1:5]5  
  307.84354      0.08833     -0.02009      0.08385      0.13171      0.37160  

Predict the 101st observation:

cfs <- coef(lmout)
cfs %*% c(1,Nile[96:100])
#          [,1]
# [1,] 784.4925
  翻译: