Definition, Causes, Consequences and Solutions for Autocorrelation, Heteroscedasticity, Multicollinearity and Specification error.
Definition,
Causes, Consequences and Solutions for Autocorrelation, Heteroscedasticity,
Multicollinearity and Specification error.
Autocorrelation
Autocorrelation refers to the degree of correlation between the
values of the same variables across different observations in the data. The concept of autocorrelation is most often discussed in the
context of time series data in
which observations occur at different points in time (e.g., air temperature
measured on different days of the month). For example, one might expect the air
temperature on the 1st day of the month to be more similar to the temperature
on the 2nd day compared to the 31st day. If the temperature values that
occurred closer together in time are, in fact, more similar than the
temperature values that occurred farther apart in time, the data would be auto
correlated.
However,
autocorrelation can also occur in cross-sectional data when the observations
are related in some other way. In a survey, for instance, one might
expect people from nearby geographic locations to provide more similar answers
to each other than people who are more geographically distant. Similarly,
students from the same class might perform more similarly to each other than
students from different classes. Thus, autocorrelation can occur if
observations are dependent in aspects other than time. Autocorrelation
can cause problems in conventional analyses (such as ordinary least squares
regression) that assume independence of observations.
In
a regression analysis, autocorrelation of the regression residuals can also
occur if the model is incorrectly specified. For example, if you are
attempting to model a simple linear relationship but the observed relationship
is non-linear (i.e., it follows a curved or U-shaped function), then the
residuals will be auto correlated.
However,
Autocorrelation has its causes.
One
of the causes of autocorrelation is omitted variables from the model. When an
important independent variable is omitted from a model, its effect on the
dependent variable becomes part of the error term. Hence, if the omitted
variable has a positive or negative correlation with the dependent variable, it
is likely to cause error terms that are positively or negative correlated.
There are number of tests for specification error in detecting the errors of
omitted variables from a regression analysis, one rarely knows the best test to
use. This research uses bootstrapping experiment and some properties which
estimators should possess if they are to be accepted as good and satisfactory
estimates of the population parameters, the models investigated in the
bootstrapping experiment consist of two autocorrelation models with
autocorrelation level ρ = 0.5 and 0.9. A bootstrap simulation approach was used
to generate data for each of the models at different sample sizes (n) 20, 30,
50, and 80 respectively each with 100 replications(r) for instant. For the
models considered, the experiment reveals that the estimated β’s were seriously
affected by autocorrelation which may be due to omitted variables as the
autocorrelation level varies in the different models (i.e. it produces a bias and
inefficient estimator).
The
second cause of Autocorrelation is misspecification. The misspecification of
the form of relationship can also introduce autocorrelation in the data. It is assumed that the form of
relationship between study and explanatory variables is linear. The difference
between the observed and true values of variable is called measurement error or
errors in variable. On the other hand, in economic applications, the
presence of autocorrelation in the residuals is often the result of a misspecification
of the underlying model. One important form of misspecification occurs when the
true model is dynamic and the investigator wrongly assumes that it is static.
The estimated residuals from a regression equation are then likely to show some
degree of autocorrelation. In such cases, the Durbin Watson statistic gives
ample warning that something is wrong. But, although the signal is right, the
reason for it might not have anything to do with autocorrelation. It might
simply mean that some dynamic effect has been neglected.
Third cause is systemic error in measurement. As stated by Ku (1969), “systematic error is a fixed deviation that is inherent in each and every measurement.” Hence, the measurements can be corrected for the systematic error if the magnitude and direction of the systematic error are known. Complex devices make it difficult to predict their accuracy. Leaks, and variation of temperature and pressure also influence the accuracy. Large volume flow-type tests suffer from sudden variation of species composition due to their larger residence times. All of these are sources of errors. The mechanical design parameters and dimensions of experimental systems immensely effect the accuracy of measurements. Careful analysis of their design and innovative improvements to increase their accuracy are vital for measurement with better accuracy.
Third cause is systemic error in measurement. As stated by Ku (1969), “systematic error is a fixed deviation that is inherent in each and every measurement.” Hence, the measurements can be corrected for the systematic error if the magnitude and direction of the systematic error are known. Complex devices make it difficult to predict their accuracy. Leaks, and variation of temperature and pressure also influence the accuracy. Large volume flow-type tests suffer from sudden variation of species composition due to their larger residence times. All of these are sources of errors. The mechanical design parameters and dimensions of experimental systems immensely effect the accuracy of measurements. Careful analysis of their design and innovative improvements to increase their accuracy are vital for measurement with better accuracy.
However, Systematic error can also be caused by an
imperfection in the equipment being used or from mistakes the individual makes
while taking the measurement. A balance incorrectly calibrated would result in
a systematic error. Consistently reading the buret wrong would result in a
systematic error.
Consequences of Autocorrelation
Presence of autocorrelation in the
errors has several effects on the ordinary least-squares regression procedures.
These are summarized as follows:
·
Ordinary
least-squares regression coefficients are still unbiased.
·
OLS
regression coefficients are no longer efficient i.e. they are no longer minimum
variance estimates. We say that these estimates are inefficient.
·
The
residual mean square MSres
may seriously underestimate
δ2.
Consequently, the standard errors of the regression coefficients may be too
small. Thus, confidence intervals are shorter than they really should be, and
tests of hypothesis on individual regression coefficients may indicate that one
or more regression contribute significantly to the model when they really do
not. Generally, underestimating δ2 gives
the researcher a false impression
of accuracy.
·
The
confidence intervals and tests of hypothesis based on the t and F distributions
are no longer appropriate.
Remedies of Autocorrelation.
When
temporal autocorrelation is determined to be present in the dataset, then one
of the first remedial measures should be to investigate the omission of one or
more of the key explanatory variables, especially variables that are related to
time. If such a variable does not aid in reducing or eliminating temporal
autocorrelation of the error terms, then a differencing procedure should be
applied to all temporal independent variables in the dataset to convert them into
their differences values, and rerun the regression model by deleting the
intercept from the model. If this remedy does not help in eliminating temporal
autocorrelation, then certain transformations on all variables can be performed
for the AR term. These transformations aim at performing repeated iterative
steps to minimize the squared sum of errors in the regression model. Examples
of such transformations are: Cochrane Orcutt procedure; and Hildreth Lu
procedure. More advanced methods can also be used for big datasets such as:
Fourier series analysis; and the spectral analysis
Heteroscedasticity (also spelled
heteroskedasticity) refers to the circumstance in which the variability of a
variable is unequal across the range of values of a second variable that
predicts it. In statistics,
heteroskedasticity (or heteroscedasticity) happens when the standard errors of a variable, monitored over a specific amount
of time, are non-constant. With heteroskedasticity, the tell-tale sign upon
visual inspection of the residual errors is that they will tend to fan out over
time.
Causes of Heteroskedasticity:
Heteroscedasticity is
likely to be a problem when the values of the variables in the regression
equation vary substantially in different observations. If the true relationship
is given by y = α + βx + u, it may well be the case that the variations in the
omitted variables and the measurement errors that are jointly responsible for
the disturbance term will be relatively small when y and x are small and large
when they are large, economic variables tending to move in size together.
1.
Here are some of the causes. Following the error
learning models, as people learn their error of behaviors becomes smaller over
time. In this case δ2 is expected to decrease. For example
the number of typing errors made in a given time period on a test to the hours
put in typing practice.
2.
As income grows, people have more discretionary income and
hence δ2 is likely to increase
with income.
3.
As data collecting techniques improve, δ2 is likely to decrease.
4.
Heteroscedasticity can also arise as a result of the presence
of outliers. The inclusion or exclusion of such observations,
especially when the sample size is small, can substantially alter the results
of regression analysis.
5.
Heteroscedasticity arises from violating the assumption of CLRM
(classical linear regression model), that the regression model is not correctly
specified.
6.
Skewness in the distribution of one or more repressors included in
the model is another source of heteroscedasticity.
7.
Incorrect data transformation, incorrect functional form (linear
or log-linear model) is also the source of heteroscedasticity.
Consequences of Heteroscedasticity.
The OLS estimators and regression predictions based on them
remains unbiased and consistent. The OLS estimators are no longer the BLUE
(Best Linear Unbiased Estimators) because they are no longer efficient, so the
regression predictions will be inefficient too. Because of the inconsistency of
the covariance matrix of the estimated regression coefficients, the tests of
hypotheses, (t-test, F-test) are no longer valid.
Remedies of
Heteroscedaticity. We have seen, heteroscedasticity does not destroy the unbiasness
& consistency properties of OLS estimators but they are no longer
efficient, not even asymptotically (i.e large sample size). This lack of
efficiency make the usual hypothesis testing procedure of dubious value.
Therefore remedial measure may be called for. There are to approaches to
remedial when δ2 is known & when δ2 is not known. When δ2 is known if δ2 is known, the most straightforward method
of correcting heteroscedasticity is by means of weighted least square, for the
estimators thus obtained are BLUE. The procedure of transforming the original
variance in such a way that the transformed variables state the assumption of
the classical model & then applying OLS to them is known as then method of
generalized least square.
Multicollinearity. In statistics, multicollinearity (also collinearity) is a phenomenon in which
one predictor variable in
a multiple regression model
can be linearly predicted from the others with a substantial degree of
accuracy. In this situation the coefficient
estimates of the multiple regression may change
erratically in response to small changes in the model or the data.
Multicollinearity does not reduce the predictive power or reliability of the
model as a whole, at least within the sample data set; it only affects calculations
regarding individual
predictors. That is, a multivariate regression model
with collinear predictors can indicate how well the entire bundle of predictors
predicts the outcome variable,
but it may not give valid results about any individual predictor, or about
which predictors are redundant with respect to others.
However, multicollinearity is the following causes: Causes
for multicollinearity can also include:
·
Insufficient
data. In some cases, collecting more
data can resolve the issue.
·
Dummy
variables may be incorrectly used. For
example, the researcher may fail to exclude one category, or add a dummy
variable for every category (e.g. spring, summer, autumn, winter).
·
Including a variable in the regression that is
actually a combination of two other variables. For example,
including “total investment income” when total investment income = income from
stocks and bonds + income from savings interest.
·
Including
two identical (or almost identical) variables.
For example, weight in pounds and weight in kilos, or investment income and
savings/bond income.
Consequences of multicollinearity.
One consequence of a high degree of multicollinearity is
that, even if the matrix XT is
invertible, a computer algorithm may be unsuccessful in obtaining an
approximate inverse, and if it does obtain one it may be numerically
inaccurate. But even in the presence of an accurate XT X
matrix, the following consequences arise. In the presence of multicollinearity,
the estimate of one variable's impact on the dependent variable while controlling for the others tends to be less
precise than if predictors were uncorrelated with one another. The usual
interpretation of a regression coefficient is that it provides an estimate of
the effect of a one unit change in an independent variable, Y, holding the other variables constant. If X1X1 is highly correlated with another independent
variable, , in the given
data set, then we have a set of observations for which X2X1 and X2 have a particular linear stochastic
relationship.
We don't have a set of observations for which all
changes in X1 are independent of changes in X2, so we have an imprecise estimate of
the effect of independent changes in X1.
In some sense, the collinear variables contain the same
information about the dependent variable. If nominally "different"
measures actually quantify the same phenomenon then they are redundant.
Alternatively, if the variables are accorded different names and perhaps employ
different numeric measurement scales but are highly correlated with each other,
then they suffer from redundancy.
One of the features of
multicollinearity is that the standard errors of the affected coefficients tend
to be large. In that case, the test of the hypothesis that the coefficient is
equal to zero may lead to a failure to reject a false null hypothesis of no
effect of the explanatory, a type II error.
Another issue with multicollinearity
is that small changes to the input data can lead to large changes in the model,
even resulting in changes of sign of parameter estimates.[6]
A principal danger of such data
redundancy is that of overfitting in regression analysis models. The best
regression models are those in which the predictor variables each correlate
highly with the dependent (outcome) variable but correlate at most only
minimally with each other. Such a model is often called "low noise"
and will be statistically robust (that is, it will predict reliably across
numerous samples of variable sets drawn from the same statistical population).
So long as the underlying
specification is correct, multicollinearity does not actually bias results; it
just produces large standard errors in the related
independent variables. More importantly, the usual use of regression is to take
coefficients from the model and then apply them to other data. Since
multicollinearity causes imprecise estimates of coefficient values, the
resulting out-of-sample predictions will also be imprecise. And if the pattern
of multicollinearity in the new data differs from that in the data that was fitted,
such extrapolation may introduce large errors in the predictions.
Remedies of Multicollinearity.
1. Make sure you have not fallen into
the dummy variable trap; including a
dummy variable for every category (e.g., summer, autumn, winter, and spring)
and including a constant term in the regression together guarantee perfect
multicollinearity.
2. Try seeing what happens if you use
independent subsets of your data for estimation and apply those estimates to
the whole data set. Theoretically you should obtain somewhat higher variance
from the smaller datasets used for estimation, but the expectation of the
coefficient values should be the same. Naturally, the observed coefficient
values will vary, but look at how much they vary.
3. Leave the model as is, despite
multicollinearity. The presence of multicollinearity doesn't affect the
efficiency of extrapolating the fitted model to new data provided that the predictor
variables follow the same pattern of multicollinearity in the new data as in
the data on which the regression model is based.
4. Drop one of the variables. An
explanatory variable may be dropped to produce a model with significant
coefficients. However, you lose information (because you've dropped a
variable). Omission of a relevant variable results in biased coefficient
estimates for the remaining explanatory variables that are correlated with the
dropped variable.
5. Obtain more data, if possible. This
is the preferred solution. More data can produce more precise parameter
estimates (with lower standard errors), as seen from the formula in variance inflation factor for the
variance of the estimate of a regression coefficient in terms of the sample
size and the degree of multicollinearity.
Specification errors.
In
the context of a statistical model, specification
error means that at least one of the key features or
assumptions of the model is incorrect. In consequence, estimation of the model
may yield results that are incorrect or misleading. Specification error can
occur with any sort of statistical model, although some models and estimation methods
are much less affected by it than others. Estimation methods that are
unaffected by certain types of specification error are often said to be robust. For example, the sample median is a much
more robust measure of central tendency than the sample mean because it is unaffected by the presence
of extreme observations in the sample.
Causes
of specification errors:
Specification error occurs when the
functional form or the choice of independent variables poorly
represent relevant aspects of the true data-generating process. In
particular, bias (the expected
value of the difference of an estimated parameter and
the true underlying value) occurs if an independent variable is correlated with
the errors inherent in the underlying process. There are several different
possible causes of specification error; some are listed below.
·
An
inappropriate functional form could be employed.
·
A
variable omitted from the model may have a relationship with both the dependent variable and one or more of the
independent variables (causing omitted-variable bias).
·
An
irrelevant variable may be included in the model (although this does not create
bias, it involves overfitting and so can lead to poor predictive
performance).
·
The
dependent variable may be part of a system of simultaneous equations (giving
simultaneity bias).
Additionally, measurement errors may affect the
independent variables: while this is not a specification error, it can create
statistical bias.
Note that all models will have some
specification error. Indeed, in statistics there is a common aphorism that
"all models are wrong". In the words
of Burnham & Anderson, "Modeling is an art as well as a science
and is directed toward finding a good approximating model ... as the basis for
statistical inference"
Consequences of specification error.
1. If
the omitted variables are correlated with the variables included in the model,
the coefficients of the estimated model are biased.
This
bias does not disappear as the sample size gets larger (i.e., the estimated
coefficients of the misspecified model are also inconsistent).
2. Even
if the incorrectly excluded variables are not correlated with the variables
included in the model, the intercept of the estimated model is biased.
3. The
disturbance variance is incorrectly estimated.
4. The
variances of the estimated coefficients of the misspecified model are biased.
5. In
consequence, the usual confidence intervals and hypothesis-testing procedures
become suspect, leading to misleading conclusions about the statistical
significance of the estimated parameters.
6.
Furthermore, forecasts based on the incorrect model and the forecast confidence
intervals based on it will be unreliable.
REFERENCES
·
Alghalith M. (2018): The perfect regression and
causality test: A solution to regression problems. Biometrical Letters 55:
45–48.
·
Asteriou D., Hall, S.G. (2011): Applied
Econometrics. Palgrave MacMillan, London.
·
Golden R., Henley S., White H., Kashner T.
(2016): Generalized information matrix tests for detecting model
misspecification. Econometrics 4: 1–24.
·
Granger C. W. J. (1969): Investigating causal
relations by econometric models and cross-spectral methods. Econometrica 37:
424–438.
·
Kleinberg S. (2012): Causality, Probability, and
Time. Cambridge University Press, Cambridge.
·
Maddala G.S., Lahiri K. (2009): Diagnostic
Checking, Model Selection, and Specification
·
Testing. Introduction to Econometrics. Wiley,
Chichester.
·
Phillips P.C.B. (1986): Understanding spurious
regressions in econometrics. Journal of Econometrics 33: 311–340.
·
Shiffrin R.M. (2016): Drawing causal inference
from big data. Proceedings of the National Academy of Sciences of the United
States of America 113: 7308– 7309.
·
Varian H.R. (2016): Causal inference in
economics and marketing. Proceedings of the National Academy of Sciences of the
United States of America 113: 7–7315.
·
Wooldridge J. (2013): Introductory Econometrics:
A Modern Approach. South- Western Cengage Learning, Mason.
Comments
Post a Comment