Econometrics Dr. Robert Jantzen

A Brief Introduction to Multiple Regression


I. Purpose
IV.  Goodness of Fit VII. Standardized Coefficients
II.  True and Estimated Models
V.  F Test on the Overall Regression VIII. "Special Variables"
III. Coefficient Interpretation
VI.  T Test on a Single Coefficient IX.  Other Concerns


          This guide provides a brief description of the basic multiple regression model.  The model described below includes only two explanatory variables, but the discussion is readily generalizable to models that contain more explainers.


 
I.  Purpose

       Multiple regression analysis can be used to analyze the behavior of a particular dependent variable of interest.  Given a list of possible explanatory factors, regression analysis can estimate the independent effect each explanatory variable has on the dependent variable of interest.  Multiple regression can also be used to predict how large/small the dependent variable will be, given differing values for the explanatory variables.

Back to top



 
II.  The True and Estimated Models

      The "true" relationship between a particular dependent variable (Y) and two explanatory variables (X2 and X3) that exists in the population can be described by the following multiple regression model equation:
 
 

      where Yi represents the values of the the dependent variable for each case in the population being studied, and X1i and X2i represent the individual values of the two explanatory variables for each case in the population. The bs represent the regression coefficients which quantify the effects each explainer has on the dependent variable in the population being studied.   Lastly, the random "error" term (ei) allows for differing values for the dependent variable, even if the explanatory variables have the same values.

        Ordinarily data on the dependent variable and the explainers are unavailable for everyone in the population, hence the population regression coefficients are unknown.   However, given a representative sample, multiple regression analysis can estimate the following model:
 

      where X1i and X2i represent the values of the two explanatory variables for every case in the sample and the bs represent the estimated values of how large the true population regression coefficients are.  Given specific values for the bs and X1i and X2i, a "predicted" value for the dependent variable
) can be calculated for each observation in the sample.   Prediction errors for each observation in the sample can also be calculated by subtracting the predicted value for the dependent variable from the actual value ().

Back to top



 
  III.  Coefficient Interpretation

      In general, each estimated coefficient (except the constant bo) gauges how many units the dependent variable will change if the corresponding explanatory variable changes by one unit.  Hence, it is important to bear in mind what units the dependent and explanatory variables are measured in.  For example, consider the case where the first explanatory variable (X1) is measured in millions of dollars while the second (X2) is measured in percentage points, and the dependent variable (Y) is measured in metric tons.  The coefficient on X1 will show how many metric tons the dependent variable will change if X1 changes by one million dollars (= 1 unit of X1).  The coefficient on X2, in contrast, will show how many metric tons Y will change if X2 changes by one percentage point ( = 1 unit of X2). 

        The estimated constant term ( bo) shows what the dependent variable will average if all of the explanatory variables have zero values.  Since in most circumstances it's unreasonable to expect that all of the explainers will have zero values, estimates of the constant's size are ordinarily of little import.

Back to top



 
IV.  Goodness of Fit

        The degree to which the explanatory variables explain the behavior of the dependent variable can be assessed with two measures, namely the R squared (R2) and the % standard error of the regression (%SER).

        The R squared (R2) measures the proportion of the variation in the dependent variable that is explained by the behavior of the explanatory variables, and will range between 0 and 1.  An R2 value of .8 indicates that 80% of the variation in the dependent variable is explained by variation in the explanatory variables (and 20% is unaccounted for).  It may be noted that an allied measure, namely the adjusted R2, is the preferred measure of a model's explanatory power if the number of explanatory variables is large and the sample size is small.   Both R2 and adjusted R2 values are reported by regression programs.

         The % standard error of the regression (%SER) provides an alternative measure of goodness of fit.   The %SER can be calculated by dividing the standard error of the regression (also called the standard deviation of the error terms) by the mean of the dependent variable.  The %SER measures how large the standard deviation (SD) of the prediction errors () is relative to the mean ().    For example, a %SER value of .1 indicates that the SD of the prediction errors is .1 or 10% of the mean value for the dependent variable.  Because the errors are assumed to be normally distributed, a %SER of .1 indicates that 68% of the errors are £ 10% of the size of the mean of the dependent variable (and 95% of the errors are £ 20% of the size of the mean).   %SERs of £ .1 generally indicate that the model has superior predictive ability. 

Back to top



 
V.  F Test on the Overall Regression

        The F test on the overall regression assesses whether all of the regression coefficients (except the constant) in the "true" model describing the underlying population are equal to zero.  Like other statistical inference tests, this F test involves four steps, namely stating the hypotheses, calculating a sample F statistic, finding a critical F value, and then comparing the sample F to the critical F.  The null and alternative hypotheses in the two explanatory variable regression model above are:

        Hob1 = b2 = 0
        Ha:  Ho is False.

        The sample F statistic has k-1, n-k degrees of freedom where k is the number of estimated coefficients and n is the sample size.  The sample F is equal to:
 

         The critical F statistic can be found in a table of F values for a particular significance level, with k being the degrees of freedom (dfs) in the numerator and n-k the dfs in the denominator.  The significance level refers to the probability of rejecting the null hypothesis when the null is actually true.

         The decision on whether to accept or reject the null hypothesis is similar to other tests of statistical inference, namely: 
          (a)  If the sample F is  ³  the critical F, reject the Ho (indicating that the explanatory variables' regression coefficients in the underlying population are not all zeros) 
          (b)  If the sample F is £ the critical F, we can't reject the Ho (indicating that there is no evidence that the explanatory variables' regression coefficients in the underlying population are not all zeros). 

Back to top



 
VI.  T Test on a Single Coefficient

       The t test on a single regression coefficient assesses whether a population regression coefficient is =, ¹, £ or ³ a particular number.  T tests are conducted for each estimated regression coefficient and typically use a reference value of zero when the researcher does not have a prior expectation about what value the population coefficient should be.  Like other statistical inference tests, a four step process is used to conduct a t test, including stating the hypotheses, calculating a sample t value, finding a critical t value and then comparing the sample t to the critical t.

           To test hypotheses about X1's regression coefficient (b1), the following pairs of null and alternative hypotheses could be assessed with the t test:

         Ho: b1 = # (usually 0)          Ho: b1£ # (usually 0)         Ho: b1³ # (usually 0
         Ha: b1 ¹ #                           Ha: b1 > #                        Ha: b1 < # 

         The sample t statistic has n-k (sample size minus # estimated coefficients) degrees of freedom and can be calculated as:
 

       where b is the estimated coefficient, bnull is the value contained in the null hypothesis and Sb is the standard error of the estimated coefficient.   By default, when regression programs report the estimated coefficients, they also provide  sample t values that assume that the null hypothesis values are zeros thereby making calculation of the sample t values usually unnecessary. 

          The critical t statistic can be found in a table of t values for a particular significance level, with n-k degrees of freedom.  If the alternative hypothesis is two sided, then two-tailed t values should be utilized; if its one sided, one-tailed t values should be utilized.  The significance level refers to the probability of rejecting the null hypothesis when the null is actually true.

          The decision on whether to accept or reject the null hypothesis is similar to other t tests of statistical inference, namely: 
          (a)  If the |sample t| is  ³  the critical t, reject the Ho. 
          (b)  If the |sample t| is £ the critical t, we can't reject the Ho (indicating that there is no evidence to the contrary refuting the null hypothesis).  Remember that for one-tailed tests, an additional requirement for rejecting the null hypothesis is that the estimated sample coefficient agrees with the value specified in the alternative hypothesis.

Back to top



 
VII.  Standardized Coefficients

        Because the magnitude of the estimated coefficients reflects the magnitudes of the explanatory variables, we cannot determine which explanatory variables have the greatest influence on the dependent variable simply by examining the size of the estimated coefficients.  To assess which explainers have the greatest influence on the dependent variable, standardized coefficients (b*) must be calculated.  Standardized coefficients show how many standard deviations the dependent variable will change if the explanatory variable changes by one standard deviation.  Larger standardized coefficients indicate more influence, smaller ones less. 

           Standardized coefficients (bi*) for each explanatory variable can be calculated as follows:

          where bi is the estimated regression coefficient, Sxi is the standard deviation of the explanatory variable, and Sy is the standard deviation of the dependent variable.

Back to top



 
VIII.  "Special" Variables

        This section describes a short list of "special" variables that are frequently included in regression models.  They include dummy variables, quadratic and reciprocal variables, and logged variables. 

        Dummy Explanatory Variables

        Frequently analysts are interested in identifying what influence a categorical variable might have on the behavior of a dependent variable.  Some examples of categorical variables that are likely to influence a variety of dependent variables include gender (male/female), supply source (company A vs. company B), type of training (CAI vs. seminar), etc.  One way of incorporating categorical variables into a regression model is to convert each categorical variable into a (1,0) dummy variable.  For example, to estimate gender effects, a dummy variable called FEMALE could be created, where every case corresponding to a women would be coded as a 1 and every man would be coded as a zero.  The FEMALE dummy variable could then be included as an explanatory variable in a regression. 

        Because dummy variables are essentially on/off switches indicating whether a particular case has the categorical trait, the estimated regression coefficients on the dummy variable measure how many units the dependent variable will change if the dummy variable takes the value of 1.  Hence the estimated coefficient on the FEMALE dummy variable discussed above would show how much the dependent variable would change if a particular case was a woman (vs. a man).  T tests can also be used to assess whether dummy variable coefficients in the underlying population take on particular values.

        Quadratic and Reciprocal Variables

        The multiple regression model assumes that the effect of changes in an explanatory variable on the dependent variable remains constant, i.e., if the estimated coefficient on X1 is 5 the model posits that every time X1 increases by 1 unit, Y will increase by 5 units.  However, oftentimes a nonlinear  relationship should be expected between a dependent variable and an explainer.  One way to estimate inherently nonlinear relationships using multiple regression is to include quadratic (squared) terms of the explanatory variables in the regression.   Inclusion of quadratic terms allows the behavior of the dependent variable to be modeled as a quadratic function of the explainer, rather than as a simple linear function.

        Another way to estimate nonlinear relationships between a dependent variable and an explainer, is to include the reciprocal of the explainer in the regression (rather than the actual values). 

        Logged Variables

        Regressions can also estimate models that include the logged values of variables (if all data values are > 0).  Logged values are typically included to model nonlinear relationships.

        If only the dependent variable is logged (i.e., the log of Y is used as the dependent variable in an equation including nonlogged X values), then the estimated coefficients of each explainer measures the proportionate change in Y if the explainer changes by 1 unit.  Hence, if X1's coefficient is .1 in a regression with the log of Y, it indicates that every one unit change in X1 will generate a + .1 proportionate unit change in Y (which is a 10% increase).  Note that the relationship between Y and X1 is inherently nonlinear because as X1 increases incrementally, the same 10% increase in Y translates into larger and larger absolute changes in Y.  Logging the dependent variable is common practice in estimating growth models.

         If only the explanatory variable is logged (i.e., the log of X1 is used to explain the behavior of Y), then the estimated coefficients on the logged explainers measure the change in Y if the explainer changes by 1 proportionate unit (which is +100%).  Hence, if the coefficient on the log (X1) variable is 25, it indicates that every time X1 doubles, Y will increase by 25.   This relationship is also inherently nonlinear, because the effect of X1 on Y diminishes as X1 increases. 

         If both the dependent variable and the explanatory variables are logged, then the estimated coefficient of each explainer is equal to the percentage change in Y divided by the percentage change in the explainer. That is, if the coefficient on logX1 in a double logged model is .1, it indicates that the percentage change in Y will always be 1/10th of the percentage change in X1.  Hence a 10% change in X1 will generate a 1% change in Y.  Coefficients generated from double-logged models are often termed elasticities. 

Back to top



 
IX.  Other Concerns

         The default "formulas" used by statistics programs to estimate regression statistics are termed the "ordinary least squares (OLS)" formulas, or just classical regression.  In order for OLS formulas to generate appropriate estimates, the data and model being analyzed must conform to a specified set of assumptions.  If the assumptions are "violated" then the results generated might be unreliable.  This section describes two common problems that frequently arise, their methods of detection and their cures. A more extensive description of OLS regression issues can be found by clicking here

            A.  Error terms with unequal variances ( = heteroskedasticity)

          If the error terms are distributed with constant variance, they are said to be homoskedastic. Frequently, however, in cross-sectional studies (studies that compare individuals, firms, etc. at a single point in time) the size of the error terms will be influenced by the size of the explanatory variables.  If the error terms vary systematically in size, they are said to be heteroskedastic.  The use of classical (OLS) regression with heteroskedastic errors, while yielding unbiased estimates of the regression coefficients, will generate biased estimates of the standard errors of the estimated coefficients, making t-tests on the coefficients unreliable.
     A cursory examination of plots of the estimated error terms against the regression explainers can suggest whether the errors have constant variance (i.e., are homoskedastic).  A formal test for homoskedasticity can be conducted using the Breusch-Pagan (B-P) statistic, which is distributed as chi-squared (with n-k-1 degrees of freedom).  If the sample's B-P statistic is >= the critical chi square value, the null hypothesis that the errors are homoskedastic should be rejected. 
       Most basic statistics programs (like Excel/PHStat/SPSS) do not estimate the B-P statistic.The EALimdep program, however, calculates the sample B-P statistic and also generates corrected coefficients standard errors for reliable t-tests (click here for a link to the EALimdep guide).

   B.  Error terms that are not independent ( = serial correlation)

       Classical (OLS) regression assumes that the size of each error term is not influenced by the size of other error terms (the error terms are independent of each other).  This assumption is very likely to be violated in time-series studies (studies that examine the behavior of a dependent variable across time).   In time-series studies, each period's error term is likely to be correlated to the previous period's error, a process termed serial correlation (or autocorrelation) of the error terms.  Serial (auto) correlation arises because if the regression model starts to overpredict values for the dependent variable, it is likely to do so for several time periods in a row. Similarly, underpredictions are also likely to occur for contiguous time periods.
        If the error terms are serial correlated, the use of classical (OLS) regression is unwarranted.  First, the coefficient estimators are not efficient, yielding sample coefficients that will vary excessively from the true parameters.  Second, the standard errors of the estimated coefficients will be biased downward (too small), generating excessively large sample t values (leading to unwarranted rejections of coefficient null hypotheses).  Third, the standard error of the regression will also be biased downward, overstating the predictive power of the regression.
        A cursory examination of a plot of the error terms over time can indicate whether the errors follow patterns over time.  The Durbin-Watson (D-W) test formally tests whether the null hypothesis that the error terms are not serially correlated should be rejected.  The D-W test involves comparing the sample's D-W statistic, generated usually by the ordinary least squares regression, to a table containing two critical values.  If the sample D-W statistic is less than the lower Dl critical value, the null hypothesis is rejected indicating that the error terms are serially correlated.  If the sample D-W statistic is greater than the upper Du critical value, the null hypothesis is accepted, indicating there is insufficient evidence to conclude that the errors are correlated.  If Dl <= sample D-W <= Du, the test is inconclusive.
       Most basic statistics programs (like Excel/PHStat/SPSS) will estimate the D-W statistic, but will not estimate "corrected" results if the error terms are serially correlated.  The EALimdep program, however, generates the appropriate results in the presence of serial correlation (click here for a link to the EALimdep guide).

Back to top