1

Prof Rahul Bhattacharya

1 A genesis of Nonparametric Inference

Although statistical methods are used from the ancient times, the discipline of nonparamet-ric statistics is developed only in the earlier decades of the nineteenth century. Savage(1953) considered the year 1936 as the inception of the subject of nonparametric statistics for the publication of the work on rank correlation by Hotelling and Pabst (1936). However, in a review work, Sche e (1943), indicated the presence of the sign test in Fishers seminal book length treatment Statistical Methods for Research Workers in 1925. Further works, bear-ing the development of nonparametric statistics, include contributions by Friedman (1937), Kendall (1938), and Smirnov (1939). But the most signi cant contribution in this context is due to Wilcoxon (1945), who developed the rank based methods for comparing unknown distributions. It was perhaps the earliest attempt to derive a statistical procedure, parallel to the two sample t test without any distribution assumption. This particular work played the key role in accelerating the development of rank-based statistical procedures in the 1950s and 1960s. In further works Pitman (1948), Hodges and Lehmann (1956), and Cherno and Savage (1958) investigated the e ciency aspects of rank based tests and ended with the promising outcome relative to parametric competitors. These works popularized the use of nonparametric statistical procedures among research workers and attracted practitioners of real elds.

For a brief account of the developments up to the recent times in the eld of nonpara-metric statistics, we refer the interested reader to a special issue of the journal Statistical Science which gives exposure of a wide variety of topics. These include articles on com-paring variances and other dispersion measures by Boos and Brownie(2004), density esti-mation by Sheather(2004), quantile-quantile (QQ) plots by Marden(2004), spatial statistics by Chang(2004), reliability methods by Hollander and Pena(2004) and permutation tests by Ernst(2004), among others.

2 Parametric Procedures-The traditional practice with limitations

Most of the traditional statistical tools are based on the parametric assumption: the data at hand can be thought of as generated by some well-known distribution like normal, ex-ponential or Poisson. The parameter(s) of the distributions are assumed unknown and in a parametric inference problem we try to put some idea about the unknown parameter in the form of estimation or testing or con dence interval estimation. However, most of the times the normal distribution is used as the underlying population. The assumption of normality is often justi ed by the Central Limit Theorem, which ensures closeness to normality of certain statistics for large enough sample sizes. Other distributions are also important in di erent elds of applications. For example lifetime of physical systems are often characterized by exponential, Weibull or Gamma distributions. In such cases the interest lies in knowing the expected failure times and the mentioned distributions are used to characterize the lifetime distribution. The lifetime distribution is also of interest in the eld of certain medical trials, where the goal is to give an idea about the length of the life after certain treatments. Ex-ponential, lognormal or Weibull are often used to model the underlying lifetime distribution so as to enable appropriate inferential procedures. Again, in the analysis of economic data, Pareto or lognormal are the appropriate to capture the features of the income distribution.

However, in practice, most of the experiments are of complex nature, a ected by a number of factors and hence the generated data might not be identi ed by a well-known distribu-tions. Although for large data sets, normality assumption can be made for analysis but often the amount of data is small and hence requires assumption of an appropriate distribution for analysis. In fact, there are no exploratory methods to nd the appropriate underlying distribution or to di erentiate among the available choices. Therefore, a traditional statis-tician has to assume a distribution for the data which captures only certain features of the data to nd without caring for the quality of the inference.

3 What is Nonparametric statistics today ?

3.1 Nonparametric, the phrase

The speci c word, Nonparametric has its root in the works of Jacob Wolfowitz (1942) , where he said We shall refer to this situation where a distribution is completely determined by the knowledge of its nite parameter set as the parametric case and denote the opposite case, where the functional forms of the distributions are unknown as the non-parametric case.

Therefore, parametric statistics is based on known distributions with unknown parameters and nonparametric statistics was de ned in the opposite way. Randles, Hettmansperger and Casella (2004) echoed the same notion in the statement Nonparametric statistics can and should be broadly de ned to include all methodology that does not use a model based on a single parametric family.

The limitations of traditional methods facilitate the development of further statistical tech-niques, which can be applied regardless of the true distribution of the data. These techniques are the well known nonparametric and distribution-free methods.

3.2 Nonparametric & Distribution free procedures

Although the terms nonparametric and distribution-free are invariably used, but these actu-ally indicate statistical procedures, which are used in the absence of any assumption of the underlying distribution. In the words of Bradley(1968).

The terms nonparametric and distribution-free are not synonymous… Popu-lar usage, however, has equated the terms … Roughly speaking, a nonparametric test is one which makes no hypothesis about the value of a parameter in a sta-tistical density function, whereas a distribution-free test is one which makes no assumptions about the precise form of the sampled population.

We start with the de nition of these two important concepts and identify the di erences subsequently.

Distribution Free statistic: Let Xi; i = 1; 2; ; ; ; n be n variables with the joint distribu-tion F , where F 2 F, a class of joint distributions. Then a statistic T = T (X1; ::; Xn) is said to be distribution free over F, if the distribution of T remains the same for every possible F 2 F.

For example, if F is the joint distribution of n iid normal variables with unknown mean

However, in the above example, the distribution free statistic is actually ancillary in parametric terminology, that is, they are independent of the indexing parameter of the joint distribution. But they are not nonparametric statistics, since their distributions vary for di erent joint distributions. For instance, the distribution of Z above is no longer 2 when the joint distribution is di erent from normal.

Nonparametric distribution Free statistic: A statistic T = T (X1; ::; Xn) is nonpara-metric distribution free over F, if the distribution of T does not depend on any F 2 F.

For example, if F is the joint distribution of n iid continuous random variables having symmetry at the origin, then the statistic Z = Pni=1 I(Xi > 0), has a binomial distribution with parameters n and success probability F = PF (X1 > 0). Now due to symmetry at the origin, F = 12 and hence the distribution of Z remains the same (i.e. Binomial(n; 12 )) whatever the distribution F be. Thus Z is nonparametric distribution free.

However, for the sake of simplicity, we use the term nonparametric to refer statistical procedures based on a nonparametric distribution free statistic.

4 Advantages, Disadvantages & Recommendation

The main problem with parametric inference is that if the assumed distribution is not correct, then all the e orts might went into vein and the best (i.e. e cient) can become the “worst”. Therefore, nonparametric procedures enjoy the following advantages

1. Nonparametric procedures are based on fewer assumptions about the underlying pop-ulations of the data. Therefore, these procedures can accommodate analysis of non normal data.

2. These techniques are easy to understand and are often easier to apply. Thus unlike parametric procedures, the calculations with the complex distributions can be avoided.

3. Although nonparametric procedures discard a portion of information about the data, but the loss in e ciency as compared to the competitors based on the normal parents is only nominal. However, such procedures tend to be more e cient when the underlying population deviates from normality.

4. Nonparametric methods are robust in the sense that it tolerates departures from as-sumptions. That is, if the distributional assumption is perfect, one can adopt the usual parametric procedures. But, if the assumption of distribution is not perfect, nonparametric methods are the best alternative.But nonparametric methods are not always the most desirable procedures as.

1. Most of the nonparametric methods use only ranks or signs of the observations dis-carding the further features of the data. This makes such procedures less e cient.

2. Nonparametric methods are usually not as e cient as their parametric counterparts when the assumptions are met.

Although nonparametric methods require fewer assumptions but they are used rarely in practice because of the use of limited information contained in the data. Consequently, non-parametric methods are primarily recommended in situations when assumptions are grossly violated(e.g. when severe skewness in the data is present).

5 Software based learning

With the advent of high-speed computing facility and softwares like R, nonparametric tech-niques grow in a fast pace. We emphasise on using R for learning purpose. Actually R is an open source statistical software and users can obtain the software free of charge through the Comprehensive R Archive Network (CRAN). Most importantly, if the required statis-tical methodology is not available within R, one can also write codes for his/her purpose. Consequently, for data based problems, we not only give the nal result but also provide the R commands or codes together with the output.

6 An exploratory example of real eld

Consider the following problem:

Suppose the average weekly sales of a new cell phone are collected for 12 consecutive weeks from a particular shop are as below

52 39 49 33 58 61 44 221 201 133 289 211

The average weekly sales was 125 units for the last year. Is the evidence su cient to conclude that sales this year exceeds last years sales?

The usual statistical procedure test in this context is a single sample t test. Speci cally, if denotes the true weekly sales, then we are interested in testing H0 : = 125 against H1 : > 125. We run the test in R.

data=c(52,39,49, 33, 58, 61, 44, 221, 133, 289, 211)

t.test(data, mu=125, alternative=”greater”)

One Sample t-test data:data t = -0.6138, df = 10, p-value = 0.7235 alternative hypothesis: true mean is greater than 125

Thus we get the p value as .7235 and the value of the t statistic as -.6138 with 10 degrees of freedom. The tabulated value of a t statistic with 10 degrees of freedom is 1.81 at 5% level of signi cance. Thus we accept the null hypothesis at 5% signi cance level. Therefore, the evidence is not enough to reject the null hypothesis.

Consequently, the evidence is not su cient to conclude that there is an increase in the true mean weekly sales.But these results are not nal because we have not yet checked the assumptions required to perform a t test. Speci cally, results from t-tests are valid provided

1. Observations are drawn from a normal parent population, or

2. The sample size is su ciently large (say, at least 30).

In our example, we have only 11 observations and hence t test is valid if we can show that the underlying distribution is normal.

The simplest descriptive method to check the shape of the data is to plot the histogram. If the resulting histogram is symmetric and bell shaped, the underlying distribution is taken as normal. For the above data we plot the histogram below:

Clearly, the histogram of sales data is far from being symmetric. This indicates the non normal nature of the underlying distribution of the sales data.

However, for con rmation, we further use a normal Q-Q plot. Normal Q-Q plot is a simple and e cient exploratory technique, where the quantiles of the observed standardised data set are plotted against the corresponding quantiles of a standard normal distribution. If the underlying distribution is standard normal, the result should be a plot of points over a straight line. The normal Q{Q plot for the stanardised version of the sales data is given below.

We and from the plot that the observed quantiles are too far from those of a standard normal distribution and hence the assumption of normality is not reasonable. Thus t test is not trustworthy for the sales data set.

This particular example exhibits the limitations of a parametric procedure in small samples, especially, when the underlying distributional assumption is not satis ed. The above data does not satisfy the assumptions required to perform a t test. As a result, here application of t test is forceful and hence the results from such a test is inconclusive even with a high p value.

Thus there are situations in which parametric assumptions are violated and alternative(i.e. nonparametric) methods of analysis becomes the most appropriate and meaningful.

you can view video on Introducing Nonparametric Inference