Definition, Causes, Consequences and Solutions for Autocorrelation, Heteroscedasticity, Multicollinearity and Specification error.







Definition, Causes, Consequences and Solutions for Autocorrelation, Heteroscedasticity, Multicollinearity and Specification error.
Autocorrelation
Autocorrelation refers to the degree of correlation between the values of the same variables across different observations in the data.  The concept of autocorrelation is most often discussed in the context of time series data in which observations occur at different points in time (e.g., air temperature measured on different days of the month). For example, one might expect the air temperature on the 1st day of the month to be more similar to the temperature on the 2nd day compared to the 31st day.  If the temperature values that occurred closer together in time are, in fact, more similar than the temperature values that occurred farther apart in time, the data would be auto correlated.
However, autocorrelation can also occur in cross-sectional data when the observations are related in some other way.  In a survey, for instance, one might expect people from nearby geographic locations to provide more similar answers to each other than people who are more geographically distant.  Similarly, students from the same class might perform more similarly to each other than students from different classes.  Thus, autocorrelation can occur if observations are dependent in aspects other than time.  Autocorrelation can cause problems in conventional analyses (such as ordinary least squares regression) that assume independence of observations.
In a regression analysis, autocorrelation of the regression residuals can also occur if the model is incorrectly specified.  For example, if you are attempting to model a simple linear relationship but the observed relationship is non-linear (i.e., it follows a curved or U-shaped function), then the residuals will be auto correlated.
However, Autocorrelation has its causes.
One of the causes of autocorrelation is omitted variables from the model. When an important independent variable is omitted from a model, its effect on the dependent variable becomes part of the error term. Hence, if the omitted variable has a positive or negative correlation with the dependent variable, it is likely to cause error terms that are positively or negative correlated. There are number of tests for specification error in detecting the errors of omitted variables from a regression analysis, one rarely knows the best test to use. This research uses bootstrapping experiment and some properties which estimators should possess if they are to be accepted as good and satisfactory estimates of the population parameters, the models investigated in the bootstrapping experiment consist of two autocorrelation models with autocorrelation level ρ = 0.5 and 0.9. A bootstrap simulation approach was used to generate data for each of the models at different sample sizes (n) 20, 30, 50, and 80 respectively each with 100 replications(r) for instant. For the models considered, the experiment reveals that the estimated β’s were seriously affected by autocorrelation which may be due to omitted variables as the autocorrelation level varies in the different models (i.e. it produces a bias and inefficient estimator).
The second cause of Autocorrelation is misspecification. The misspecification of the form of relationship can also introduce autocorrelation in the data. It is assumed that the form of relationship between study and explanatory variables is linear. The difference between the observed and true values of variable is called measurement error or errors in variable. On the other hand, in economic applications, the presence of autocorrelation in the residuals is often the result of a misspecification of the underlying model. One important form of misspecification occurs when the true model is dynamic and the investigator wrongly assumes that it is static. The estimated residuals from a regression equation are then likely to show some degree of autocorrelation. In such cases, the Durbin Watson statistic gives ample warning that something is wrong. But, although the signal is right, the reason for it might not have anything to do with autocorrelation. It might simply mean that some dynamic effect has been neglected.

Third cause is systemic error in measurement. As stated by Ku (1969), “systematic error is a fixed deviation that is inherent in each and every measurement.” Hence, the measurements can be corrected for the systematic error if the magnitude and direction of the systematic error are known. Complex devices make it difficult to predict their accuracy. Leaks, and variation of temperature and pressure also influence the accuracy. Large volume flow-type tests suffer from sudden variation of species composition due to their larger residence times. All of these are sources of errors. The mechanical design parameters and dimensions of experimental systems immensely effect the accuracy of measurements. Careful analysis of their design and innovative improvements to increase their accuracy are vital for measurement with better accuracy.
However, Systematic error can also be caused by an imperfection in the equipment being used or from mistakes the individual makes while taking the measurement. A balance incorrectly calibrated would result in a systematic error. Consistently reading the buret wrong would result in a systematic error.

Consequences of Autocorrelation
Presence of autocorrelation in the errors has several effects on the ordinary least-squares regression procedures. These are summarized as follows:
·         Ordinary least-squares regression coefficients are still unbiased.
·         OLS regression coefficients are no longer efficient i.e. they are no longer minimum variance estimates. We say that these estimates are inefficient.
·         The residual mean square MSres may seriously underestimate δ2. Consequently, the standard errors of the regression coefficients may be too small. Thus, confidence intervals are shorter than they really should be, and tests of hypothesis on individual regression coefficients may indicate that one or more regression contribute significantly to the model when they really do not. Generally, underestimating δ2 gives the researcher a false impression of accuracy.
·         The confidence intervals and tests of hypothesis based on the t and F distributions are no longer appropriate.
Remedies of Autocorrelation.
When temporal autocorrelation is determined to be present in the dataset, then one of the first remedial measures should be to investigate the omission of one or more of the key explanatory variables, especially variables that are related to time. If such a variable does not aid in reducing or eliminating temporal autocorrelation of the error terms, then a differencing procedure should be applied to all temporal independent variables in the dataset to convert them into their differences values, and rerun the regression model by deleting the intercept from the model. If this remedy does not help in eliminating temporal autocorrelation, then certain transformations on all variables can be performed for the AR term. These transformations aim at performing repeated iterative steps to minimize the squared sum of errors in the regression model. Examples of such transformations are: Cochrane Orcutt procedure; and Hildreth Lu procedure. More advanced methods can also be used for big datasets such as: Fourier series analysis; and the spectral analysis
Heteroscedasticity (also spelled heteroskedasticity) refers to the circumstance in which the variability of a variable is unequal across the range of values of a second variable that predicts it. In statistics, heteroskedasticity (or heteroscedasticity) happens when the standard errors of a variable, monitored over a specific amount of time, are non-constant. With heteroskedasticity, the tell-tale sign upon visual inspection of the residual errors is that they will tend to fan out over time.

Causes of Heteroskedasticity:

Heteroscedasticity is likely to be a problem when the values of the variables in the regression equation vary substantially in different observations. If the true relationship is given by y = α + βx + u, it may well be the case that the variations in the omitted variables and the measurement errors that are jointly responsible for the disturbance term will be relatively small when y and x are small and large when they are large, economic variables tending to move in size together.
1.      Here are some of the causes. Following the error learning models, as people learn their error of behaviors becomes smaller over time. In this case δ2 is expected to decrease. For example the number of typing errors made in a given time period on a test to the hours put in typing practice.
2.      As income grows, people have more discretionary income and hence δ2 is likely to increase with income.
3.      As data collecting techniques improve, δ2 is likely to decrease.
4.      Heteroscedasticity can also arise as a result of the presence of outliers. The inclusion or exclusion of such observations, especially when the sample size is small, can substantially alter the results of regression analysis.
5.      Heteroscedasticity arises from violating the assumption of CLRM (classical linear regression model), that the regression model is not correctly specified.
6.      Skewness in the distribution of one or more repressors included in the model is another source of heteroscedasticity.
7.      Incorrect data transformation, incorrect functional form (linear or log-linear model) is also the source of heteroscedasticity.

Consequences of Heteroscedasticity.

The OLS estimators and regression predictions based on them remains unbiased and consistent. The OLS estimators are no longer the BLUE (Best Linear Unbiased Estimators) because they are no longer efficient, so the regression predictions will be inefficient too. Because of the inconsistency of the covariance matrix of the estimated regression coefficients, the tests of hypotheses, (t-test, F-test) are no longer valid.

Remedies of Heteroscedaticity. We have seen, heteroscedasticity does not destroy the unbiasness & consistency properties of OLS estimators but they are no longer efficient, not even asymptotically (i.e large sample size). This lack of efficiency make the usual hypothesis testing procedure of dubious value. Therefore remedial measure may be called for. There are to approaches to remedial when δ2 is known & when δ2   is not known. When δ2 is known if δ2 is known, the most straightforward method of correcting heteroscedasticity is by means of weighted least square, for the estimators thus obtained are BLUE. The procedure of transforming the original variance in such a way that the transformed variables state the assumption of the classical model & then applying OLS to them is known as then method of generalized least square.
Multicollinearity. In statisticsmulticollinearity (also collinearity) is a phenomenon in which one predictor variable in a multiple regression model can be linearly predicted from the others with a substantial degree of accuracy. In this situation the coefficient estimates of the multiple regression may change erratically in response to small changes in the model or the data. Multicollinearity does not reduce the predictive power or reliability of the model as a whole, at least within the sample data set; it only affects calculations regarding individual predictors. That is, a multivariate regression model with collinear predictors can indicate how well the entire bundle of predictors predicts the outcome variable, but it may not give valid results about any individual predictor, or about which predictors are redundant with respect to others.
However, multicollinearity is the following causes: Causes for multicollinearity can also include:
·         Insufficient data. In some cases, collecting more data can resolve the issue.
·         Dummy variables may be incorrectly used. For example, the researcher may fail to exclude one category, or add a dummy variable for every category (e.g. spring, summer, autumn, winter).
·         Including a variable in the regression that is actually a combination of two other variables. For example, including “total investment income” when total investment income = income from stocks and bonds + income from savings interest.
·         Including two identical (or almost identical) variables. For example, weight in pounds and weight in kilos, or investment income and savings/bond income.


Consequences of multicollinearity.

One consequence of a high degree of multicollinearity is that, even if the matrix X{\displaystyle X^{\top }X} is invertible, a computer algorithm may be unsuccessful in obtaining an approximate inverse, and if it does obtain one it may be numerically inaccurate. But even in the presence of an accurate XT    {\displaystyle X^{\top }X}X matrix, the following consequences arise. In the presence of multicollinearity, the estimate of one variable's impact on the dependent variable {\displaystyle Y}Y while controlling for the others tends to be less precise than if predictors were uncorrelated with one another. The usual interpretation of a regression coefficient is that it provides an estimate of the effect of a one unit change in an independent variable, {\displaystyle X_{1}}X1, holding the other variables constant. If X1 {\displaystyle X_{1}} is highly correlated with another independent variable, {\displaystyle X_{2}}X2, in the given data set, then we have a set of observations for which X1{\displaystyle X_{1}}  and X2{\displaystyle X_{2}} have a particular linear stochastic relationship.
We don't have a set of observations for which all changes in X1 {\displaystyle X_{1}} are independent of changes in X2{\displaystyle X_{2}}, so we have an imprecise estimate of the effect of independent changes in X1. In some sense, the collinear variables contain the same information about the dependent variable. If nominally "different" measures actually quantify the same phenomenon then they are redundant. Alternatively, if the variables are accorded different names and perhaps employ different numeric measurement scales but are highly correlated with each other, then they suffer from redundancy.
One of the features of multicollinearity is that the standard errors of the affected coefficients tend to be large. In that case, the test of the hypothesis that the coefficient is equal to zero may lead to a failure to reject a false null hypothesis of no effect of the explanatory, a type II error.
Another issue with multicollinearity is that small changes to the input data can lead to large changes in the model, even resulting in changes of sign of parameter estimates.[6]
A principal danger of such data redundancy is that of overfitting in regression analysis models. The best regression models are those in which the predictor variables each correlate highly with the dependent (outcome) variable but correlate at most only minimally with each other. Such a model is often called "low noise" and will be statistically robust (that is, it will predict reliably across numerous samples of variable sets drawn from the same statistical population).
So long as the underlying specification is correct, multicollinearity does not actually bias results; it just produces large standard errors in the related independent variables. More importantly, the usual use of regression is to take coefficients from the model and then apply them to other data. Since multicollinearity causes imprecise estimates of coefficient values, the resulting out-of-sample predictions will also be imprecise. And if the pattern of multicollinearity in the new data differs from that in the data that was fitted, such extrapolation may introduce large errors in the predictions.
Remedies of Multicollinearity.
1.     Make sure you have not fallen into the dummy variable trap; including a dummy variable for every category (e.g., summer, autumn, winter, and spring) and including a constant term in the regression together guarantee perfect multicollinearity.
2.     Try seeing what happens if you use independent subsets of your data for estimation and apply those estimates to the whole data set. Theoretically you should obtain somewhat higher variance from the smaller datasets used for estimation, but the expectation of the coefficient values should be the same. Naturally, the observed coefficient values will vary, but look at how much they vary.
3.     Leave the model as is, despite multicollinearity. The presence of multicollinearity doesn't affect the efficiency of extrapolating the fitted model to new data provided that the predictor variables follow the same pattern of multicollinearity in the new data as in the data on which the regression model is based.
4.     Drop one of the variables. An explanatory variable may be dropped to produce a model with significant coefficients. However, you lose information (because you've dropped a variable). Omission of a relevant variable results in biased coefficient estimates for the remaining explanatory variables that are correlated with the dropped variable.
5.     Obtain more data, if possible. This is the preferred solution. More data can produce more precise parameter estimates (with lower standard errors), as seen from the formula in variance inflation factor for the variance of the estimate of a regression coefficient in terms of the sample size and the degree of multicollinearity.
Specification errors.
In the context of a statistical model, specification error means that at least one of the key features or assumptions of the model is incorrect. In consequence, estimation of the model may yield results that are incorrect or misleading. Specification error can occur with any sort of statistical model, although some models and estimation methods are much less affected by it than others. Estimation methods that are unaffected by certain types of specification error are often said to be robust. For example, the sample median is a much more robust measure of central tendency than the sample mean because it is unaffected by the presence of extreme observations in the sample.
Causes of specification errors:
Specification error occurs when the functional form or the choice of independent variables poorly represent relevant aspects of the true data-generating process. In particular, bias (the expected value of the difference of an estimated parameter and the true underlying value) occurs if an independent variable is correlated with the errors inherent in the underlying process. There are several different possible causes of specification error; some are listed below.
·         An inappropriate functional form could be employed.
·         A variable omitted from the model may have a relationship with both the dependent variable and one or more of the independent variables (causing omitted-variable bias).
·         An irrelevant variable may be included in the model (although this does not create bias, it involves overfitting and so can lead to poor predictive performance).
·         The dependent variable may be part of a system of simultaneous equations (giving simultaneity bias).
Additionally, measurement errors may affect the independent variables: while this is not a specification error, it can create statistical bias.
Note that all models will have some specification error. Indeed, in statistics there is a common aphorism that "all models are wrong". In the words of Burnham & Anderson, "Modeling is an art as well as a science and is directed toward finding a good approximating model ... as the basis for statistical inference"
Consequences of specification error.
1. If the omitted variables are correlated with the variables included in the model, the coefficients of the estimated model are biased.
This bias does not disappear as the sample size gets larger (i.e., the estimated coefficients of the misspecified model are also inconsistent).
2. Even if the incorrectly excluded variables are not correlated with the variables included in the model, the intercept of the estimated model is biased.
3. The disturbance variance is incorrectly estimated.
4. The variances of the estimated coefficients of the misspecified model are biased.
5. In consequence, the usual confidence intervals and hypothesis-testing procedures become suspect, leading to misleading conclusions about the statistical significance of the estimated parameters.
6. Furthermore, forecasts based on the incorrect model and the forecast confidence intervals based on it will be unreliable.



















REFERENCES

·         Alghalith M. (2018): The perfect regression and causality test: A solution to regression problems. Biometrical Letters 55: 45–48.
·         Asteriou D., Hall, S.G. (2011): Applied Econometrics. Palgrave MacMillan, London.
·         Golden R., Henley S., White H., Kashner T. (2016): Generalized information matrix tests for detecting model misspecification. Econometrics 4: 1–24.
·         Granger C. W. J. (1969): Investigating causal relations by econometric models and cross-spectral methods. Econometrica 37: 424–438.
·         Kleinberg S. (2012): Causality, Probability, and Time. Cambridge University Press, Cambridge.
·         Maddala G.S., Lahiri K. (2009): Diagnostic Checking, Model Selection, and Specification
·         Testing. Introduction to Econometrics. Wiley, Chichester.
·         Phillips P.C.B. (1986): Understanding spurious regressions in econometrics. Journal of Econometrics 33: 311–340.
·         Shiffrin R.M. (2016): Drawing causal inference from big data. Proceedings of the National Academy of Sciences of the United States of America 113: 7308– 7309.
·         Varian H.R. (2016): Causal inference in economics and marketing. Proceedings of the National Academy of Sciences of the United States of America 113: 7–7315.
·         Wooldridge J. (2013): Introductory Econometrics: A Modern Approach. South- Western Cengage Learning, Mason.




Comments

Popular posts from this blog

Truth about Dinka/Jieng/Monyjang tribe of South-Sudan

HISTORICAL EVIDENCE PROVES “JESUS CHRIST” NEVER EXISTED AND WAS CREATED BY CONSTANTINE (ROMANS)

WHY ARE DINKA/JIENG PEOPLE TALL?