The Heteroskedasticity Tests Implementation for Linear Regression Model Using MATLAB

The article discusses the problem of heteroskedasticity, which can arise in the process of calculating econometric models of large dimension and ways to overcome it. Heteroskedasticity distorts the value of the true standard deviation of the prediction errors. This can be accompanied by both an increase and a decrease in the confidence interval. We gave the principles of implementing the most common tests that are used to detect heteroskedasticity in constructing linear regression models, and compared their sensitivity. One of the achievements of this paper is that real empirical data are used to test for heteroskedasticity. The aim of the article is to propose a MATLAB implementation of many tests used for checking the heteroskedasticity in multifactor regression models. To this purpose we modified few open algorithms of the implementation of known tests on heteroskedasticity. Experimental studies for validation the proposed programs were carried out for various linear regression models. The models used for comparison are models of the Department of Higher Mathematics and Mathematical Methods in Economy of Simon Kuznets Kharkiv National University of Economics and econometric models which were published recently by leading journals.


Introduction
In econometrics, a linear regression model is often used to describe different processes and phenomena. Using matrix notation, the linear model regression can be given as: where Y and  are 1  n matrices, X is ) 1 ( +  m n , and B is ; n is the number of measurements (sample size); m is the number of independent variables in the regression model. An error term is introduced in a regression model because the model does not fully represent the actual relationship between the variables of the model. As a result of this incomplete relationship, there are differences between the observed responses (values of the variable being predicted) in the given dataset and those predicted by a linear function of a set of explanatory variables. The error term is the amount at which the equation may differ from measurements. In other words that is the 'white noise'.
As a rule, the building a linear regression model is done by the method of ordinary least squares (OLS). This method for estimating the unknown parameters is based on the minimization of the sum of the squares of the model errors. The estimators of model parameters determined by OLS are known as best linear unbiased estimators (BLUE). The variances of the model parameters are determined by: The OLS application requires the realization of a number of conditions [1 -3]. Only if these conditions are met, the estimates calculated by such a model will be unbiased, efficient and well-off. These conditions are formulated in the form of the Gauss -Markov theorem.
According to this theorem there are four principal assumptions which admit the using of linear regression models for research and prediction. One of them is the homoskedasticity (constant variance) of the errors in relation to any independent variable.
Homoskedasticity makes the assumption that the errors have a constant variance: const = ) var( and independent of causal variables: The error  is a random variable distributed according to the normal law: where the mathematical expectation of the error term is zero and the variance is constant. Failure to comply with this requirement leads to bias in the estimates obtained using such a regression model. Thus [4] indicate that estimation uncertainty may increase dramatically in the presence of conditional heteroskedasticity.
The requirement of homoskedasticity also exists in the construction of the econometric model using the maximum likelihood method [5 -7].
When the scatter of the errors is different, varying depending on the value of one or more of the independent variables, the error terms are heteroskedastic. Namely the distribution law of errors remains normal with a mathematical expectation equal to zero, but the errors of the model are a function of the values of the independent variables: ~)) ( is a function that describes the change in the variance of errors as a function of the values of the independent variables.
Heteroskedasticity makes difficult to gauge the true standard deviation of the forecast errors. The OLS estimates are no longer BLUE. Thus, if the variance of the errors is increasing over time, confidence intervals for out-of-sample predictions will tend to be unrealistically narrow. In particular, heteroscedasticity does not allow us to use equation 3 for the computation of j b S , since it assumes a uniform dispersion of the errors. Under heteroskedasticity, the sample variance of OLS estimator is where Ω is the covariance matrix, the elements of which are defined as the variance of the model parameters. Under homoskedasticity,_Ω= I. Equation 4 is correct if there is no autocorrelation.
For these reasons, all the conclusions obtained on the basis of the corresponding − t statistics and − F statistics, as well as interval estimates, will be unreliable.
A unified approach to the estimation of heteroscedasticity is lacking. To solve this problem, a large number of different tests and criteria have been developed: the Spearman rank correlation test, the Park test, the Glaser test, the Goldfeld -Quandt test, the Breusch -Pagan test, the Leven's test, the White test, and so on.
The application of all the above tests is very difficult for the so-called 'manual' account, and for a large set of initial data it is completely impossible.
When using economic data researcher can face two main problems. Firstly, all the listed software are quite expensive and price of the product may be an insurmountable barrier for the young researcher. Secondly, company-developer never provides the source code, considering that this is not necessary for an ordinary user. Therefore, we can not modify the built-in algorithms to detect and eliminate heterosquadity.
Another drawback of the above program products is the outdated conceptual approaches to econometric methods, which are constantly being improved.
For example, the program products SPP and MICROSTAT calculate the coefficient of multiple correlations as the square root of the coefficient of determination. STATGRAPHICS calculates it as the square root of the adjusted coefficient of determination [13]. While in theory the coefficients of multiple correlations is estimated using elements of the correlation matrix [2].
Another important aspect that should be taken into account is the existence of different algorithms to identify heteroskedasticity and the specific problem of division by zero [14].
Ideal option would be to create your own software product that would take into account the research tasks.
However, to write such a program, the economist should be an expert in algorithmic programming. But this happens rarely.
In this article, we carry out a comparative analysis of the tests most often used to detect heteroskedasticity [1, 2, 14] and give their source code. The use of program code allows you to modify the program in accordance with the objectives of the study.

Analysis of literary data and the formulation of the problem
Before starting the construction of the regression model, it is necessary to verify whether the conditions of the Gauss-Markov theorem are fulfilled.
The Heteroskedasticity Tests Implementation for Linear...
Informatica 42 (2018) 545-553 547 One of the main methods of preliminary research on heteroskedasticity is a visual analysis of the graph of residues. On these graphs, the scattering of points can vary depending on the value of the independent variables [14,15].
To estimate heteroskedasticity, are used such quantitative tests [15 -17] as the White test, the Goldfeld -Quandt test, the Breusch -Pagan test, the Park test, the Glazer test and also the Spearman test. Unlike other tests the Spearman rank correlation test is a nonparametric statistical test for the heteroskedasticity of random errors in the econometric model. The test algorithm can be studied in detail in [18,19]. However, it is still not implemented in software products which are used to build multiple models [20 -27].
In this paper we examined the software packages most commonly used in economic activity, which contain tests for heteroskedasticity [15,28]. Indeed, these software products do not contain the Spearman rank correlation test.
The most widely used for evaluating heteroscedasticity is the Park test [20,21]. However, the Park test contains the assumption that the change in the remnants of the model is described by a functional dependence of a certain type. It was noted in [24,25] that this can lead to unreasonable conclusions. Therefore, the authors propose to consider the Park test together with other tests.
The software implementation of the Park test for multiple models also does not exist [28]. As far as we know software implementation of the Park test for multifactor models also does not exist.
Another test that the authors of the article implemented in the MATLAB environment is the Goldfeld -Quandt test. This test to check for heteroskedasticity of random errors is used when there is reason to believe that the standard deviation of errors is proportional to some variable.
The test statistics has a Fisher distribution [18,27]. The Goldfeld -Quandt test can also be used if there is an assumption of intergroup heteroskedasticity, when the variance of errors takes, for example, only two possible values. In this case, for the application of the test, there is a need for its software implementation, since applied commercial software has not taken this possibility into account [25,28].
In scientific articles on for the problem of detecting heteroskedasticity, the Breusch -Pagan test is often considered [10,29]. We also carried out research this problem. But it oversteps this article.
Analysis of literature sources shows that all tests of heteroskedasticity detection are difficult for 'manual' application and require the development of special software. In turn, the software of econometric research does not contain built-in functions for heteroskedasticity testing with open source code.
That is way the authors of this article attempted to implement the above tests for heteroskedasticity in the construction of multifactor econometric models in the MATLAB software environment.
It should be noted that MATLAB does not contain ready-made software implementation to verify compliance of homoskedasticity. We chose it as a programming environment. For this purpose, other programming environments can also be used, for example, a the free software environment R.
The authors have chosen MATLAB by the following reasons. First, MATLAB is used as a high-level programming language for writing scripts (Spearman.m, Parks.m and Gold_Quan.m). Secondly, MATLAB includes built-in functions for constructing regression models (Econometric toolbox), which gave the authors relief from the duty of programming the standard functions of regression analysis. Thirdly, the authors worked with data structures based on matrices.

Aims and objectives of the study
The purpose of the article is to present functions to check for heteroskedasticity in multifactor regression models. The implementation is made in MATLAB.
To achieve this purpose, it is necessary to solve a number of problems. Namely: • writing the program code in the MATLAB programming environment; • planning and execution of computer calculations; • completion of programs; • analysis and interpretation of results; • comparison with the results of calculations using software products of leading companies.

Spearman's rank correlation test for multiple regression models
The use of the Spearman's test assumes that the variance of model errors will increase (or decrease) with increasing values of the independent variable. This means that the absolute values of errors i will correlate with each other.
To check whether heteroskedasticity is statistically significant the Spearman's test provides for the following stages: 1) Estimation of the parameters of the econometric model by the OLS: where i ŷ is the predicted response in accordance with the model when the independent variables are ) ... ; ; ( ; 2) Calculate model errors as the difference between the empirical and the ratchet value of the dependent variable: where i y is the value of the dependent variable in the i th experiment; 3 where i d is the difference between the two ranking; 5) The significance of  x r is tested by using − t statistic: is found. Then the calculated value is compared with the critical one.
If the t-statistic value is greater than the critical value, we must say that heteroscedasticity is statistically significant. Here  is the significance level which is chosen to test the null hypothesis: 0 =   x . In the opposite case, the null hypothesis is non-contradictory.

Park's test for multiple regression models
R. Park proposed a test to check for heteroskedasticity, which is based on some formal dependencies. Namely, it assumes that the heteroskedasticity may be proportional to some power of an independent variable j x in the multiple models.
Since the variance of errors x of the explanatory variable j x , and for its description Park proposed the this dependence: After computing its logarithms, we obtain the following relation: 3) Building the regression model: where 2 ln   = . For the case of multiple regressions, this dependence is constructed for each explanatory variable; 4) Verification of statistical significance of the coefficient  on the basis of − t statistics:

Results of numerical experiments
The problem of detecting heteroskedasticity in various multifactor econometric models was considered.
For carrying out numerical simulation experiments we used both the models of the Department of Higher Mathematics, Economic and Mathematical Methods of KhNEU [30 -33], and econometric models which were published recently by leading journals [34 -36].
To check for heteroscedasticity, we used real data. This is one of the advantages of this paper. However, it is possible to use the data obtained with the Monte Carlo simulation [6, 7, 37 -39].
Numerical experiments were performed on the configuration AMD Athlon 64 3200+1.5Gb Ram, graphic accelerator -Nvidia GeForce GTX 560 2Gb with using technology NVIDIA CUDA 4.2.
Let's look at a concrete example of what happens to an eccentric model, if you do not take into account heteroskedasticity.
It should be emphasized that the presence or absence of heteroskedasticity in the initial data is determined automatically by using the check box.
For this we used the code: Thus, the above procedure allows eliminating heteroskedasticity. In this case, the resulting models will be able to adequately reflect the reality. Table 1 shows the results of numerical experiments on testing of programs which are presented in this article on various multifactor models.
As can be seen from Table 1, software products developed by us using MATLAB can be proposed both for constructing multifactor econometric models, and for investigating the latter for the presence of heteroskedasticity.
In doing so, we used new numerical algorithms, developed on the basis of well-known tests of heteroskedasticity detection.
Open source code allows the researcher to use this software to solve their own problems. Residuals plot X 1 X 2 X 3 X 4 X 5

Conclusion and future work
The article examined one of the key problems of regression analysis, which consists in verifying the fulfillment of the requirement of homoskedasticity of the remainders of the model. To this end we used various statistic tests. Analysis of literature sources and our own studies confirm the complexity of using all existing tests for detecting heteroskedasticity in the 'manual account' mode. Therefore, we gave our own implementation in MATLAB for tests used for detecting heteroskedasticity.
This problem was successfully solved, as shown results of numerical experiments which are presented in the article. We represent all software products we have created with open source code, which enables each researcher to customize the program to their problems.
In conclusion, we want to note that the work presented in this article is an on going work having the final purpose to create a complete and effective software for detecting heteroskedasticity in regression models.
Another further development consists in developing a complete econometric toolbox in MATLAB.