Multiple Linear Regression Analysis THE REGRESSION MODEL In a regression problem the researcher postulates a certain relation- ship between a random variable y (the realizations of which are subject to some form of disturbance) on the one side and a number of variables x1,...,xp (which are without or at least almost without disturbances) on the other side. This relationship is expressed by a mathematical formula, which is called the (linear) regression model, for instance: y = a0 + a1 * x1 + ... + ap * xp + e (1) in which a0,...,ap represent unknown regression coefficients (parameters) which are to be estimated and e represents the disturbance. If a constant term is present in the model formula (in (1) the a0), the model is said to be an 'intercept model', if no constant term is present, the model is called a 'no-intercept model'. The variables x1,...,xp and the variable y can also represent (other) transformed variables. The researcher might have reasons to believe (from background information concerning the experiment) that transformations are necessary, for instance: 1) to obtain normally distributed disturbances, 2) to obtain a greater homogeneity of the variances of the disturbances, 3) to linearize non-linear regression models (if possible). The transformed regression model can be written as: G(y) = a0 + a1 * F1(x1,...,xm) + ... + ap * Fp(x1,...,xm) + e (2) in which G, F1,...,Fp represent the transformations, a0,...,ap represent the parameters to be estimated, y represents the dependent variable, x1,...,xm represent the independent variables, e represents the disturbance. The choice of a transformation by means of 'trial and error' is rather time consuming and costly. The importance of the location parameter makes for the difficulty. It is not unusual that Log (x) yields no improvement, but that Log (c+x) gives better results for a particular choice of c. Because this holds for almost any transformation of some importance, we must actually solve in each case a nonlinear adjustment problem. Often though, a simple form of the transformation is suggested by the researcher who is better acquainted with the peculiarities of the experiment. LEAST SQUARES Regression analysis consists in fact of the adjustment of a hyperplane of the required dimension to the data. The fitting is done with the method of least squares, which means that the sum of the squares of the differences between the observed values for y and the estimated values for the expectation of y, are minimized. This sum of squares is also called the residual sum of squares. In matrix notation the regression model can be written as: Y = Xa + e (3) in which Y is a (n*1) random vector of observations, X is a (n*p) matrix of known (fixed) values, a is a (p*1) vector of (unknown) parameters, and e is a (n*1) random vector of disturbances. It is supposed that E(e) = 0 and var(e) = Isigma^2, in which I is the unit matrix, thus: E(Y) = Xa (4) The sum of squares of the differences between the observed values of Y and the estimated values for the expectation of Y thus equals: (Y-Xa)'(Y-Xa) = Y'Y - 2a'X'Y + a'X'Xa (5) (for a'X'Y is a scalar and therefore equal to Y'Xa). Choosing as least squares estimator b that value of a which minimizes (5), involves differentiating with respect to the elements of a and equating the result to zero: -2X'Y + 2X'Xb = 0, thus: X'Y = X'Xb (6) This system is called the normal equations. If the rank of X equals p, X'X is nonsingular and the inverse of X'X exists. In that case the solution of the normal equations can be written as: b = inv(X'X)X'Y (7) Observe that p <= n must hold, in order that the rank of X can be p at all. Therefore at least as many observations must be made, as there are parameters in the model. Also observe that E(b) = inv(X'X)X'E(Y) = a, thus b is an unbiased estimator of a. The least squares estimator has the following properties: 1. It is an estimator which minimizes the sum of squares of deviations, irrespective of any distribution properties of the disturbances. The assumption that the disturbances are normally distributed is, of course, necessary for tests which depend on this assumption, such as t- or F- tests, or for obtaining confidence intervals based on t- or F- distributions. 2. According to the Gauss-Markov theorem, the elements of b are unbiased estimators, which have minimum variance (of any linear function of the Y's which provides unbiased estimators), again irrespective of the distribution properties of the disturbances. 3. If the disturbances are mutually independent and normally distributed (with E(e) = 0 and var(e) = Isigma^2), then b is also the maximum likelihood estimator. The variance-covariance matrix of b is: var(b) = inv(X'X)sigma^2 (8) The variances are the diagonal and the covariances the off-diagonal elements. An unbiased estimator for sigma^2 is given by: s^2 = (Y'Y - b'X'Y) / (n-p) (9) The square root of this estimator is frequently called 'standard error of estimate'. In the printed output of the program it is indicated more properly as 'standard deviation of the error term'. Let vij be the element in the i-th row and j-th column of inv(X'X), then sdi = s * Sqrt(vii) estimates the standard deviation of bi, and cij = vij / Sqrt(vii * vjj) gives the correlation coefficient between bi and bj for i = 1,...,p and j = 1,...,p. Thus: vii = (sdi / s)^2 (10) and vij = cij * Sqrt(vii * vjj) = cij * (sdi * sdj) / s (11) A frequently used statistical measure for evaluating regression models is the multiple correlation coefficient R which is defined in the intercept model as the square root of the proportion of the corrected total sum of squares accounted for by the model. If the correction for means is denoted by nu^2, with u = Sum(i,1,n,yi)/n, then R can be defined as: R^2 = (b'X'Y-nu^2)/(Y'Y-nu^2) = 1 - (Y'Y-b'X'Y)/(Y'Y-nu^2) (12) However, we must divide Y'Y-b'X'Y by n-p, not by n, to obtain an unbiased estimator of sigma^2, moreover it is customary to divide Y'Y-nu^2 by n-1, not by n. If we adopt both modifications we obtain the adjusted multiple correlation coefficient, which can thus be defined as: adj(R)^2 = 1 - (n-1)/(n-p) * (Y'Y-b'X'Y)/(Y'Y-nu^2) (13) In the no-intercept model the correction for means is ignored, giving as definition of R^2: b'X'Y/Y'Y = 1 - (Y'Y-b'X'Y)/Y'Y, while the adj(R)^2 is defined correspondingly as: 1 - n/(n-p) * (Y'Y-b'X'Y)/Y'Y. R^2 itself is often called the 'proportion of variation explained'. WEIGHTED LEAST SQUARES It sometimes happens that some of the observations for the dependent variable are 'less reliable' than others. This usually means that the variances of the observations are not all equal; in other words the matrix V = var(e) is not of the form Isigma^2, but is diagonal with unequal diagonal elements. The basic idea to solve this problem is, to transform Y to other variables, which do appear to satisfy the usual tentative model assumptions, and then apply the usual (unweighted) analysis to the variables so obtained. The estimates can then be re-expressed in terms of the original variables Y. Let the original regression model be: Y = Xa + e, with E(e) = 0 and var(e) = Vsigma^2, with V diagonal with unequal diagonal elements, and let P = inv(V). Premultiplying the original regression model with Q = Sqrt(P) gives as transformed regression model: QY = QXa + Qe (14) with E(Qe) = 0 and var(Qe) = Isigma^2. The normal equations then become: (QX)'QY = (QX)'QXa (15) giving as solution if the indicated inverse matrix exists: b = inv((QX)'QX)(QX)'QY = inv(X'PX)X'PY (16) with variance-covariance matrix: var(b) = inv(X'PX)sigma^2 (17) In practical situations it is often difficult to obtain specific information on the form of V at first. For this reason it is sometimes necessary to make the (known to be erroneous) assumption V = I and then attempt to discover something about the form of V by examining the residuals from the regression analysis. RESIDUAL ANALYSIS The vector of residuals D is defined as the difference between the vector of observations Y and the vector of fitted values Z, obtained by using the regression equation Z = Xb. So D = Y - Z or di = yi - zi for i = 1,...,n. If the model is correct, the residual mean square MSE = s^2 estimates sigma^2, and the estimated standard deviation of the fitted value zi at xi = (xi1,...,xip)' is: sd(zi) = s * Sqrt(xi'inv(X'X)xi) (18) which can be used to construct a confidence interval for the expected value of yi: E(yi) at xi = (xi1,...,xip)', or to construct a prediction interval for the mean of h new observations at this point. In the first case the confidence interval is: zi +- t(n-p-1,1-alpha/2) * s * Sqrt(xi'inv(X'X)xi) (19) and in the second case the prediction interval is: zi +- t(n-p-1,1-alpha/2) * s * Sqrt(1/h + xi'inv(X'X)xi) (20) Researchers often divide the residuals di by s, resulting in the standardized residuals, which can be examined to see if they make it appear that the assumption ei/sigma ~ N(0,1) is violated. It might be expected that roughly 95% of the di/s were between the limits (-2,2). However, the variances of the residuals are not constant but a function of the X matrix (see (18)), which suggests as standardization: ti = di / s / Sqrt(1 - xi'inv(X'X)xi) (21) giving the studentized residual. The maximum studentized residual can be used in a test for detecting outliers, as follows: let t^2 = max(ti^2), then min(1, n * (1-Fisher(1, n-p-1, t^2*(n-p-1)/(n-p-t^2)))) is an 'upper bound for the right tail probability of the largest absolute studentized residual'.