| Econometrics | Dr. Robert Jantzen |
| III. Coefficient
Interpretation
In general, each estimated coefficient (except the constant bo) gauges how many units the dependent variable will change if the corresponding explanatory variable changes by one unit. Hence, it is important to bear in mind what units the dependent and explanatory variables are measured in. For example, consider the case where the first explanatory variable (X1) is measured in millions of dollars while the second (X2) is measured in percentage points, and the dependent variable (Y) is measured in metric tons. The coefficient on X1 will show how many metric tons the dependent variable will change if X1 changes by one million dollars (= 1 unit of X1). The coefficient on X2, in contrast, will show how many metric tons Y will change if X2 changes by one percentage point ( = 1 unit of X2). The estimated constant term ( bo) shows what the dependent variable will average if all of the explanatory variables have zero values. Since in most circumstances it's unreasonable to expect that all of the explainers will have zero values, estimates of the constant's size are ordinarily of little import. |
| IV.
Goodness of Fit
The degree to which the explanatory variables explain the behavior of the dependent variable can be assessed with two measures, namely the R squared (R2) and the % standard error of the regression (%SER). The R squared (R2) measures the proportion of the variation in the dependent variable that is explained by the behavior of the explanatory variables, and will range between 0 and 1. An R2 value of .8 indicates that 80% of the variation in the dependent variable is explained by variation in the explanatory variables (and 20% is unaccounted for). It may be noted that an allied measure, namely the adjusted R2, is the preferred measure of a model's explanatory power if the number of explanatory variables is large and the sample size is small. Both R2 and adjusted R2 values are reported by regression programs.
The % standard error of the regression (%SER) provides an alternative measure
of goodness of fit. The %SER can be calculated by dividing
the standard error of the regression (also called the standard deviation
of the error terms) by the mean of the dependent variable. The %SER
measures how large the standard deviation (SD) of the prediction errors
( |
| V.
F Test on the Overall Regression
The F test on the overall regression assesses whether all of the regression coefficients (except the constant) in the "true" model describing the underlying population are equal to zero. Like other statistical inference tests, this F test involves four steps, namely stating the hypotheses, calculating a sample F statistic, finding a critical F value, and then comparing the sample F to the critical F. The null and alternative hypotheses in the two explanatory variable regression model above are:
Ho: b1
= b2
= 0
The sample F statistic has k-1, n-k degrees of freedom where k is the number
of estimated coefficients and n is the sample size. The sample F
is equal to:
The critical F statistic can be found in a table of F values for a particular significance level, with k being the degrees of freedom (dfs) in the numerator and n-k the dfs in the denominator. The significance level refers to the probability of rejecting the null hypothesis when the null is actually true.
The decision on whether to accept or reject the null hypothesis is similar
to other tests of statistical inference, namely:
|
| VI.
T Test on a Single Coefficient
The t test on a single regression coefficient assesses whether a population regression coefficient is =, ¹, £ or ³ a particular number. T tests are conducted for each estimated regression coefficient and typically use a reference value of zero when the researcher does not have a prior expectation about what value the population coefficient should be. Like other statistical inference tests, a four step process is used to conduct a t test, including stating the hypotheses, calculating a sample t value, finding a critical t value and then comparing the sample t to the critical t. To test hypotheses about X1's regression coefficient (b1), the following pairs of null and alternative hypotheses could be assessed with the t test: Ho:
b1
= # (usually 0) Ho:
b1£
# (usually 0) Ho:
b1³
#
(usually 0
The sample t statistic has n-k (sample size minus # estimated coefficients)
degrees of freedom and can be calculated as:
where b is the estimated coefficient, bnull is the value contained in the null hypothesis and Sb is the standard error of the estimated coefficient. By default, when regression programs report the estimated coefficients, they also provide sample t values that assume that the null hypothesis values are zeros thereby making calculation of the sample t values usually unnecessary. The critical t statistic can be found in a table of t values for a particular significance level, with n-k degrees of freedom. If the alternative hypothesis is two sided, then two-tailed t values should be utilized; if its one sided, one-tailed t values should be utilized. The significance level refers to the probability of rejecting the null hypothesis when the null is actually true.
The decision on whether to accept or reject the null hypothesis is similar
to other t tests of statistical inference, namely:
|
| VII.
Standardized Coefficients
Because the magnitude of the estimated coefficients reflects the magnitudes of the explanatory variables, we cannot determine which explanatory variables have the greatest influence on the dependent variable simply by examining the size of the estimated coefficients. To assess which explainers have the greatest influence on the dependent variable, standardized coefficients (b*) must be calculated. Standardized coefficients show how many standard deviations the dependent variable will change if the explanatory variable changes by one standard deviation. Larger standardized coefficients indicate more influence, smaller ones less. Standardized coefficients (bi*) for each explanatory variable can be calculated as follows: |
| VIII. "Special" Variables
This section describes a short list of "special" variables that are frequently included in regression models. They include dummy variables, quadratic and reciprocal variables, and logged variables. Dummy Explanatory Variables Frequently analysts are interested in identifying what influence a categorical variable might have on the behavior of a dependent variable. Some examples of categorical variables that are likely to influence a variety of dependent variables include gender (male/female), supply source (company A vs. company B), type of training (CAI vs. seminar), etc. One way of incorporating categorical variables into a regression model is to convert each categorical variable into a (1,0) dummy variable. For example, to estimate gender effects, a dummy variable called FEMALE could be created, where every case corresponding to a women would be coded as a 1 and every man would be coded as a zero. The FEMALE dummy variable could then be included as an explanatory variable in a regression. Because dummy variables are essentially on/off switches indicating whether a particular case has the categorical trait, the estimated regression coefficients on the dummy variable measure how many units the dependent variable will change if the dummy variable takes the value of 1. Hence the estimated coefficient on the FEMALE dummy variable discussed above would show how much the dependent variable would change if a particular case was a woman (vs. a man). T tests can also be used to assess whether dummy variable coefficients in the underlying population take on particular values. Quadratic and Reciprocal Variables The multiple regression model assumes that the effect of changes in an explanatory variable on the dependent variable remains constant, i.e., if the estimated coefficient on X1 is 5 the model posits that every time X1 increases by 1 unit, Y will increase by 5 units. However, oftentimes a nonlinear relationship should be expected between a dependent variable and an explainer. One way to estimate inherently nonlinear relationships using multiple regression is to include quadratic (squared) terms of the explanatory variables in the regression. Inclusion of quadratic terms allows the behavior of the dependent variable to be modeled as a quadratic function of the explainer, rather than as a simple linear function. Another way to estimate nonlinear relationships between a dependent variable and an explainer, is to include the reciprocal of the explainer in the regression (rather than the actual values). Logged Variables Regressions can also estimate models that include the logged values of variables (if all data values are > 0). Logged values are typically included to model nonlinear relationships. If only the dependent variable is logged (i.e., the log of Y is used as the dependent variable in an equation including nonlogged X values), then the estimated coefficients of each explainer measures the proportionate change in Y if the explainer changes by 1 unit. Hence, if X1's coefficient is .1 in a regression with the log of Y, it indicates that every one unit change in X1 will generate a + .1 proportionate unit change in Y (which is a 10% increase). Note that the relationship between Y and X1 is inherently nonlinear because as X1 increases incrementally, the same 10% increase in Y translates into larger and larger absolute changes in Y. Logging the dependent variable is common practice in estimating growth models. If only the explanatory variable is logged (i.e., the log of X1 is used to explain the behavior of Y), then the estimated coefficients on the logged explainers measure the change in Y if the explainer changes by 1 proportionate unit (which is +100%). Hence, if the coefficient on the log (X1) variable is 25, it indicates that every time X1 doubles, Y will increase by 25. This relationship is also inherently nonlinear, because the effect of X1 on Y diminishes as X1 increases. If both the dependent variable and the explanatory variables are logged, then the estimated coefficient of each explainer is equal to the percentage change in Y divided by the percentage change in the explainer. That is, if the coefficient on logX1 in a double logged model is .1, it indicates that the percentage change in Y will always be 1/10th of the percentage change in X1. Hence a 10% change in X1 will generate a 1% change in Y. Coefficients generated from double-logged models are often termed elasticities. |
| IX.
Other Concerns
The default "formulas" used by statistics programs to estimate regression statistics are termed the "ordinary least squares (OLS)" formulas, or just classical regression. In order for OLS formulas to generate appropriate estimates, the data and model being analyzed must conform to a specified set of assumptions. If the assumptions are "violated" then the results generated might be unreliable. This section describes two common problems that frequently arise, their methods of detection and their cures. A more extensive description of OLS regression issues can be found by clicking here. A. Error terms with unequal variances ( = heteroskedasticity)
If the error terms are distributed with constant variance, they are said
to be homoskedastic. Frequently, however, in cross-sectional
studies (studies that compare individuals, firms, etc. at a single
point in time) the size of the error terms will be influenced by the size
of the explanatory variables. If the error terms vary systematically
in size, they are said to be heteroskedastic. The use of classical
(OLS) regression with heteroskedastic errors, while yielding unbiased
estimates of the regression coefficients, will generate biased estimates
of the standard errors of the estimated coefficients, making t-tests
on the coefficients unreliable.
B. Error terms that are not independent ( = serial correlation)
Classical (OLS) regression assumes that the size of each error term is
not influenced by the size of other error terms (the error terms are independent
of each other). This assumption is very likely to be violated in
time-series
studies (studies that examine the behavior of a dependent variable
across time). In time-series studies, each period's
error term is likely to be correlated to the previous period's error, a
process termed serial correlation (or autocorrelation) of
the error terms. Serial (auto) correlation arises because
if the regression model starts to overpredict values for the dependent
variable, it is likely to do so for several time periods in a row. Similarly,
underpredictions are also likely to occur for contiguous time periods.
|