Multiple Linear Regression Analysis

THE REGRESSION MODEL

     In a regression problem the researcher postulates a certain  relation-
ship  between a random variable y (the realizations of which are subject to
some form of disturbance) on  the  one  side  and  a  number  of  variables
x1,...,xp  (which  are  without or at least almost without disturbances) on
the other side.  This relationship is expressed by a mathematical  formula,
which is called the (linear) regression model, for instance:

                  y = a0 + a1 * x1 + ... + ap * xp + e                  (1)

in which a0,...,ap represent unknown regression  coefficients  (parameters)
which  are to be estimated and e represents the disturbance.  If a constant
term is present in the model formula (in (1) the a0), the model is said  to
be  an  'intercept model',  if  no  constant  term is present, the model is
called a 'no-intercept model'.
The variables x1,...,xp and the  variable  y  can  also  represent  (other)
transformed  variables.  The researcher might have reasons to believe (from
background information concerning the experiment) that transformations  are
necessary, for instance:
1) to obtain normally distributed disturbances,
2) to obtain a greater homogeneity of the variances of the disturbances,
3) to linearize non-linear regression models (if possible).
The transformed regression model can be written as:

     G(y) = a0 + a1 * F1(x1,...,xm) + ... + ap * Fp(x1,...,xm) + e      (2)

in which G, F1,...,Fp represent the transformations,
            a0,...,ap represent the parameters to be estimated,
                    y represents the dependent variable,
            x1,...,xm represent the independent variables,
                    e represents the disturbance.

     The choice of a transformation by means of 'trial and error' is rather
time  consuming and costly.  The importance of the location parameter makes
for the difficulty.  It is not unusual that Log (x) yields no  improvement,
but  that  Log (c+x)  gives  better  results  for a particular choice of c.
Because this holds for almost any transformation  of  some  importance,  we
must  actually  solve  in  each case a nonlinear adjustment problem.  Often
though, a simple form of the transformation is suggested by the  researcher
who is better acquainted with the peculiarities of the experiment.


LEAST SQUARES

     Regression analysis consists in fact of the adjustment of a hyperplane
of the required dimension to the data.  The fitting is done with the method
of least  squares,  which  means  that  the  sum  of  the  squares  of  the
differences  between the observed values for y and the estimated values for
the expectation of y, are minimized.  This sum of squares  is  also  called
the residual sum of squares.
In matrix notation the regression model can be written as:

                              Y = Xa + e                                (3)

in which Y is a (n*1) random vector of observations,
         X is a (n*p) matrix of known (fixed) values,
         a is a (p*1) vector of (unknown) parameters,
     and e is a (n*1) random vector of disturbances.

It is supposed that E(e) = 0 and var(e) = Isigma^2, in which I is the  unit
matrix, thus:

                               E(Y) = Xa                                (4)

     The sum of squares of the differences between the observed values of Y
and the estimated values for the expectation of Y thus equals:

                 (Y-Xa)'(Y-Xa) = Y'Y - 2a'X'Y + a'X'Xa                  (5)

(for a'X'Y is a scalar and therefore equal to Y'Xa).

     Choosing as least squares estimator b that value of a which  minimizes
(5),  involves  differentiating  with  respect  to  the  elements  of a and
equating the result to zero:

                   -2X'Y + 2X'Xb = 0,  thus:  X'Y = X'Xb                (6)

This system is called the normal equations.  If the rank of X equals p, X'X
is nonsingular and the inverse of X'X exists.  In that case the solution of
the normal equations can be written as:

                             b = inv(X'X)X'Y                            (7)

Observe that p <= n must hold, in order that the rank of X can be p at all.
Therefore  at  least  as  many  observations  must  be  made,  as there are
parameters in the model.  Also observe that E(b) = inv(X'X)X'E(Y) = a, thus
b is an unbiased estimator of a.

The least squares estimator has the following properties:
1. It is an estimator which minimizes the sum  of  squares  of  deviations,
   irrespective  of  any  distribution properties of the disturbances.  The
   assumption that the disturbances are normally distributed is, of course,
   necessary  for  tests  which depend on this assumption, such as t- or F-
   tests,  or  for  obtaining  confidence  intervals  based  on  t-  or  F-
   distributions.
2. According to the Gauss-Markov theorem, the elements of  b  are  unbiased
   estimators,  which  have minimum variance (of any linear function of the
   Y's which provides  unbiased  estimators),  again  irrespective  of  the
   distribution properties of the disturbances.
3. If the disturbances are mutually independent  and  normally  distributed
   (with  E(e) = 0  and  var(e) = Isigma^2),  then  b  is  also the maximum
   likelihood estimator.

The variance-covariance matrix of b is:

                         var(b) = inv(X'X)sigma^2                       (8)

The variances  are  the  diagonal  and  the  covariances  the  off-diagonal
elements.

An unbiased estimator for sigma^2 is given by:

                       s^2 = (Y'Y - b'X'Y) / (n-p)                      (9)

The square root of this estimator is frequently called 'standard  error  of
estimate'.   In  the  printed  output  of  the program it is indicated more
properly as 'standard deviation of the error term'.

     Let vij be the element in the i-th row and j-th  column  of  inv(X'X),
then  sdi = s * Sqrt(vii) estimates the standard deviation of bi, and cij =
vij / Sqrt(vii * vjj) gives the correlation coefficient between bi  and  bj
for i = 1,...,p and j = 1,...,p.  Thus:

                           vii = (sdi / s)^2                           (10)
and
          vij = cij * Sqrt(vii * vjj) = cij * (sdi * sdj) / s          (11)


     A frequently used statistical measure for evaluating regression models
is the multiple correlation coefficient R which is defined in the intercept
model as the square root of the proportion of the corrected  total  sum  of
squares accounted for by the model.  If the correction for means is denoted
by nu^2, with u = Sum(i,1,n,yi)/n, then R can be defined as:

       R^2 = (b'X'Y-nu^2)/(Y'Y-nu^2) = 1 - (Y'Y-b'X'Y)/(Y'Y-nu^2)      (12)

However, we must divide Y'Y-b'X'Y by n-p, not by n, to obtain  an  unbiased
estimator  of  sigma^2, moreover it is customary to divide Y'Y-nu^2 by n-1,
not by n.  If we adopt both modifications we obtain the  adjusted  multiple
correlation coefficient, which can thus be defined as:

           adj(R)^2 = 1 - (n-1)/(n-p) * (Y'Y-b'X'Y)/(Y'Y-nu^2)         (13)

     In the no-intercept model the correction for means is ignored,  giving
as definition of R^2:  b'X'Y/Y'Y = 1 - (Y'Y-b'X'Y)/Y'Y,  while the adj(R)^2
is defined correspondingly as:  1 - n/(n-p) * (Y'Y-b'X'Y)/Y'Y.  R^2  itself
is often called the 'proportion of variation explained'.


WEIGHTED LEAST SQUARES

     It sometimes happens that some of the observations for  the  dependent
variable  are  'less  reliable'  than  others.  This usually means that the
variances of the observations are not all equal;  in other words the matrix
V = var(e)  is  not  of  the  form  Isigma^2,  but is diagonal with unequal
diagonal elements.  The basic idea to solve this problem is, to transform Y
to  other  variables,  which do appear to satisfy the usual tentative model
assumptions,  and  then  apply  the  usual  (unweighted)  analysis  to  the
variables  so obtained.  The estimates can then be re-expressed in terms of
the original variables Y.

     Let the original regression model be:  Y = Xa + e, with  E(e) = 0  and
var(e) = Vsigma^2,  with V diagonal with unequal diagonal elements, and let
P = inv(V).  Premultiplying the original regression model with  Q = Sqrt(P)
gives as transformed regression model:

                             QY = QXa + Qe                             (14)

with E(Qe) = 0 and var(Qe) = Isigma^2.  The normal equations then become:

                           (QX)'QY = (QX)'QXa                          (15)

giving as solution if the indicated inverse matrix exists:

                b = inv((QX)'QX)(QX)'QY = inv(X'PX)X'PY                (16)

with variance-covariance matrix:

                       var(b) = inv(X'PX)sigma^2                       (17)

     In practical situations it  is  often  difficult  to  obtain  specific
information  on  the  form  of V at first.  For this reason it is sometimes
necessary to make the (known to be erroneous)  assumption  V = I  and  then
attempt  to  discover  something  about  the  form  of  V  by examining the
residuals from the regression analysis.


RESIDUAL ANALYSIS

     The vector of residuals D is defined as  the  difference  between  the
vector  of  observations  Y  and the vector of fitted values Z, obtained by
using the regression equation  Z = Xb.  So D = Y - Z  or  di = yi - zi  for
i = 1,...,n.   If  the model is correct, the residual mean square MSE = s^2
estimates sigma^2, and the estimated standard deviation of the fitted value
zi at xi = (xi1,...,xip)' is:

                     sd(zi) = s * Sqrt(xi'inv(X'X)xi)                  (18)

which can be used to construct a confidence interval for the expected value
of  yi: E(yi) at xi = (xi1,...,xip)', or to construct a prediction interval
for the mean of h new observations at this point.  In the  first  case  the
confidence interval is:

           zi +- t(n-p-1,1-alpha/2) * s * Sqrt(xi'inv(X'X)xi)          (19)

and in the second case the prediction interval is:

        zi +- t(n-p-1,1-alpha/2) * s * Sqrt(1/h + xi'inv(X'X)xi)       (20)

     Researchers often divide the residuals  di  by  s,  resulting  in  the
standardized residuals, which can be examined to see if they make it appear
that the assumption ei/sigma ~ N(0,1) is violated.  It  might  be  expected
that roughly 95% of the di/s were between the limits (-2,2).

     However, the variances  of  the  residuals  are  not  constant  but  a
function of the X matrix (see (18)), which suggests as standardization:

                   ti = di / s / Sqrt(1 - xi'inv(X'X)xi)               (21)

giving the studentized residual.  The maximum studentized residual  can  be
used  in  a  test for detecting outliers, as follows:  let t^2 = max(ti^2),
then  min(1, n * (1-Fisher(1, n-p-1, t^2*(n-p-1)/(n-p-t^2)))) is an  'upper
bound  for  the  right tail probability of the largest absolute studentized
residual'.