34 Box-Jenkins Methodology
Dr. Harmanpreet Singh Kapoor
Module 40: Box-Jenkins Methodology
- Learning Objectives
- Introduction.
- Box-Jenkins Method
- Box-Jenkins Step-By-Step Approach
- Forecasting
- Short Review Through Flowchart
- Summary
- Suggested Readings
- Learning Objectives
Objective of this module is to define a process for fitting and forecasting a time series model with stationary and non-stationary univariate time series. This can be done by using well known and widely used process or method known as box-jenkins methodology in time series analysis theory. This methodology will be defined in this module with an example.
- Introduction
A mathematical model is designed to forecast and analyze time series data with a standard statistical model like Autoregressive Moving average (ARMA) and Autoregressive Integrated Moving Average (ARIMA). This mathematical model was designed in 1970 by two developer George Box and Gwailym Jenkin who proposed in their book, “Time Series analysis: Forecasting and Control”, to identify, estimate, forecast and diagnose a standard statistical model for a specific and particular time series data. The mathematical model is defined in the text books as “Box-Jenkins methodology” because it is developed by two mathematician Box and Jenkins. Box-Jenkins method is a widely used process to find out the best model for a time series data.In this module, we will define the Box-Jenkins method and tips for using it for a time series data.
- Box-Jenkins Method
Box-Jenkins method is based on the assumption that the time series process can be approximated by ARMA stochastic time series model if the given time series is stationary. Also time series can be approximated by ARIMA stochastic time series model if the given time series is non-stationary. Box-Jenkins method is a three stage method aimed at selecting a parsimonious ARMA and ARIMA model for the purpose of estimating and forecasting a univariate time series. Box-Jenkins method referred as stochastic and iterative process to select a standard statistical model.
The three stages are:
- Identification
- Parameter Estimation
- Diagnostic checking
If the identified model passes the diagnostic tests, then model is ready to go for forecasting. If it does not, the diagnostic tests indicate model ought to be modified, and a new cycle of identification, and diagnostics is performed. It will continue till identified model pass diagnostic test. It is a iterative process. Let’s discuss these steps of Box-Jenkins methodology.
3.1) Identification
Identification step is used to check if the given time data is either stationary or non-stationary time series and to find order of parameter for autoregressive and moving average term.According to this, identification are classified in two classes:
- To check whether time series is either stationary or non-stationary
If time series is stationary, next step is to identify the parameter of the standard statistical ARMA model. If time series is non-stationary, convert it into stationary series or remove the non-stationarity factors. There are many techniques available in the literature to remove non-stationarity factor. Some of the techniques were discussed in the module “Introduction to Non-Stationary Time Series Model”. Let us discussed it again for better understanding.
If particular time series data shows deterministic trend, it can be removed by least square trend method.
If particular time series shows seasonal variation in the data, it can be removed by method of seasonal difference.
If particular time series shows periodic variation in the data, it can be removed by the method of moving average.
One of the most used method to transform non-stationary to stationary is differencing or unit root test. In unit root test, one takes the difference of the process until difference process becomes stationary.
- Identify the parameter of the Standard Stochastic time series model (ARMA/ ARIMA)
Once stationarity has been addressed, the next step is to identify the order of parameter of Autoregressive (AR) and Moving Average (MA) terms. The order of the parameter can be identify by correleogram of Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) values. Mostly time series data is non-stationary in real life data. So, ARIMA is mostly used to fit time series model.
An ARIMA (p, d, q) model is completely identified by the choice of non-negative integer value for the parameters p, d and q.
Parameter value p can be found by ACF and PACF. PACF values cut off to 0 after some spike of correlogram, ACF decrease exponentially towards 0, this cut off lag value is the order of autoregressive term i.e. p.
Parameter value q can be found by autocorrelation (ACF) and partial autocorrelation function (PACF). ACF values cut off to 0 after some spike of correlogram, PACF decrease exponentially towards 0, this cut off lag value is the order of moving average term i.e. q.
Parameter d is the number of times differencing the time series. he following principles can be used to choose the appropriate value of d. A time series can be modeled by stationary ARMA model if the ACF decays rapidly to 0 as lag increases.
Sample variance can be used to identify d value. Minimum sample variance will address the value d after differencing.
Example 1
Let us consider an example of monthly RPI time series. ACF decayed slowly exponentially to 0 and there sample variance are
???(??) = 0.698; ???(∇??) = 0.597; ???(∇2??) = 0.706
One can see that sample variance after first difference is minimum. So d would be 1 for this time series.
3.2) Parameter estimation
After addressing the value of p and q, next step is to find estimate parameter. There are several method available in the literature to estimate parameter, some of them are: method of moments, maximum likelihood method and least square method. Least square method is widely used method to estimate parameter because it is a simple and easily solved method with easy mathematical calculation apart from biasedness.
where are normally distributed error terms.
Note: Estimate of parameters using least square method and maximum likelihood method are equal when are normally distributed.
It is easy to estimate parameter by hand by using least square method for AR(1) and MA(1). When number of parameters are more in numbers like 20 or 30. It is very difficult to estimate parameter for AR (20) or MA (30) by hand. Similarly, it is also difficult to estimate parameters for ARIMA model. So, there is a need to use statistical software to estimate parameter easily or accurately for long run. There are many statistical software easily available for solving time series problem. Some of them are Eviews, SAS, Python, R, SPSS etc.
An example will be solved using R that will help you to understand the topic in a clear manner. It will be helpful for you to understand estimation procedure using computer programming in practical scenario because real life data is available for long time span.
3.3) Diagnostic checking
After identification and estimation of the parameters, next step is to perform diagnostic checking. Diagnostic testing is used to inspect the parsimony ARMA model to the underlying time series process. If the parsimony ARMA (p, q) model is a good approximation to the underlying time series process, the residuals process will form a good approximation to a white noise process. If the residuals ̂ will not form a good approximation to a white noise process, the process will repeat again. This process will repeat until residual process will be a good approximation of the underlying time series. There are several tests in the literature to diagnose it, some of them most useful test are:
- Inspection of the graph of ̂
To check if residual process is a good approximation to a white noise process, plot the graph { ̂} against t, if this plot should not show any pattern. If the visual inspection of the graph of the residuals shows any pattern, this graph highlight a poor fitting model. This pattern effect is due to average level of the residuals or magnitude of the fluctuations about 0, this should be taken to mean. So that this ARMA model is inadequate and there is need to repeat the whole procedure again.
- Inspection of the sample ACF of { ̂ }
One is very much familiar with them of term ACF and PACF. There are many ACF and PACF values corresponding to underlying time series process. If these ACF or PACF values of the sequence of residuals are too many outside the range ± √2 , one can conclude that fitted model does not have enough parameter. So, respective time series model is not appropriate and repeat the whole process to get appropriate one.
- Box-Jenkins Step-By-Step Approach
The Box-Jenkins approach involves the following steps:
- Calculate the ACF and PACF of the data, check whether the series is stationary or not. If the series are stationary go to step 3, if not go to step 2.
- Take the logarithm and the first difference of the raw data and calculate the ACF and PACF for the first logarithmic differenced series.
- Examine the graphs of the ACF and PACF and determine which models would be good starting points.
- Estimate parameters of the models.
For each of these estimated models:
(a) Check to see if the parameter of the lag is significant.
If not, then you probably have too many parameters, and should decrease the order of p and/or q.
- b) Check the ACF and PACF of the errors. If the model has at least enough parameters, then all error ACF’s and PACF’s will be significant.
- (c) Check the AIC and BIC together with the adj- 2 of the estimated models to detect which model is the parsimonious one.
- If changes in the original model are needed, go to step 4.
- Forecasting
Once time series process passed through the Diagnostic checking, it is ready to go to next step that is to forecast future value based on the estimated time series model like ARMA and ARIMA. Mostly estimated model will be ARMA model by Box-Jenkins method. Using Box-Jenkins approach, forecasting is relatively straightforward
Example 4:
Consider an example of tractor sales shared by Power Horse’s MIS team, to forecast tractor sales for the next 3 year through time series ARIMA models.
Solution:
Here, we are using R programming to forecast tractor sales for the next 3 year.
These are the steps to forecast tractor sales:
Step 1: Identification
In identification, first check whether the process is either stationary or non-stationary, one can plot the data as time series by using this code in R.
data<-read.csv (“[location of data]“)
data<-ts(data[,2],start = c(2003,1),frequency = 12)
plot(data, xlab=”Years”, ylab = “Tractor Sales”)
From the Figure 2, one can see that data show upward trend for tractor sales i.e. time series data of tractor sale is non-stationary. Now the next step is to transform the time series data into stationary on average by taking first order differencing. If after first order differencing the transformed series is stationary by testing through unit root test or by plotting it. This code will help you to plot differenced tractor sales data.
plot(diff(data),ylab=”Differenced Tractor Sales”)
Differenced data to make data stationary on mean (remove trend)
From Figure 6, one can see that PACF value is 0 at the starting, so p =0; and ACF value decay to zero after one lag so q= 1, and d=1. The method of determining the parameters is already discussed in the module “Introduction to Non-Stationary Time Series”. In that module, we discussed that while determining the parameter for AR model, we have to see the values of PACF and for the MA model we have to see the ACF value. As ARMA or ARIMA are combination of AR and MA model. So while determining the order for ARMA or ARIMA both ACF and PACF should be considered together. Using the AIC (Akaike Information Criteria) and BIC (Bayesian Information Criteria), one can check about the order of ARIMA model as discussed in the module “Introduction to Non-Stationary Time Series”. Now we have to find the estimate of the parameters using these value of p, q and d that will lead to the best fit of the ARIMA model on the data.
Step 2: Estimates the parameters
After addressing p, d, q value, next step is to estimate the parameter of the model.
For that, code is:
require(forecast)
ARIMAfit<-auto.arima(log10(data),approximation=FALSE,trace=FALSE)
Summary(ARIMAfit)
The values of AIC (Akaike Information Criteria) and BIC (Bayesian Information Criteria) can also be used to determine the values of the parameter for our best fit model. ARIMA model is fitted through programming in R and displayed in the Table 1.
Here, we are discussing only best parsimony time series model i.e. ARIMA (0, 1, 1) and their estimates of coefficient and AIC and BIC values. One can see that there are several model for a time series data and one have to select the best model, in this situation goodness of fit is not good criteria to select the best model among others. In this situation, to select best model AIC and BIC supposed to be the best criteria. AIC value shows how much the loss of information occurred while fitting the model on the data and BIC value shows model performance. For the best model, both values AIC and BIC should be minimum.
The output with forecast values of tractor sales is in blue. Also, the range of expected error (i.e. 2 times standard deviation) is displayed with orange lines on either side of predicted blue line.
- Summary
The major assumption is that the underlining patterns in the time series will continue to stay the same as predicted in the model. A short term forecasting model, say a couple of business quarters or a year, is usually a good idea to forecast with reasonable accuracy. A long term model like the one above needs to be evaluated on a regular interval of time (say 6 months). The idea is to incorporate the new information available with the passage of time in the model. Box-Jenkins methodology is a good criterion to fit best model to any underlying time series and forecasting future values using standard statistical model like ARMA and ARIMA. One can apply available R code to other time series data. One can easily install and use R as R is an open source software. One easily download the set up for R from search engine or using the link given below:
https://cran.r-project.org/bin/windows/base/
This module will help you in applying the Box-Jenkins on the real life data. One can understand the importance of time series model, how to find appropriate model for the data that is used for future forecasting.
- Suggested Readings:
- Agung, I. G. N., Time Series Data Analysis Using Eviews, John Wiley & Sons, Asia, 2009.
- Box G. E. P., G. M. Jenkins, G. C. Reinsal, Time Series Analysis Forecasting & Control, 3 Edition, Prentice-Hall International, UK, 1994.
- James, H., Time Series Analysis, Princeton University Press, 1994.
- L ̈tkepohl, H, and M. Kr ̈tzig, Applied Time Series Econometrics, Cambridge University Press, UK, 2004.
- Tsay, R. S., Time Series and Forecasting: Brief History and Future Research, Journal of the American Statistical Association, Vol.95, pp. 638-643, 2000.