12 Two Sample Procedures : Mann Whitney Test I

Mr Taranga Mukherjee

1 Motivation

We start with an example with data from two independent populations. Consider the life-time(in 1000 hours)of bulbs corresponding to two di erent brands(say, Old and New)

Then it is natural to know which brand is performing better. The usual practice is to use Two sample t test using normality assumption. But we need normality for such a test and hence we perform some exploratory data analysis. We provide normal Q-Q plot and boxplot for sets of data.

The QQ plot for Old brand data shows non normality whereas that for the New brand data reveals that the underlying distribution might be normal. Thus t test is not appro-priate for this data. Again the box plot shows signi cant di erence between the locations.

Since, deciding an appropriate distribution is subjective, applying a parametric test is not reasonable and hence, we need alternative procedures to judge the hypothesis.

2 The hypothesis

Suppose Xi; i = 1; 2; ::; n and Yj; j = 1; 2; ::; m are independent observations from distribu-tions F and G respectively. Then the null hypothesis can be expressed as

H0 : F (x) = G(x) for all x :

The most general two-sided alternative is Ha : F (x) 6= G(x) for some x and the one-sided alternatives are Ha : F (x) G(x) for all x with strict inequality for some x or Ha : F (x) G(x) for all x with strict inequality for some x. Note that F (x) G(x) (or F (x) G(x)) for all x with strict inequality for some x implies Y is either stochastically larger or smaller than X.

2.1 How to set the alternative?

Consider the data example. Then the objective is to know which brand is giving the more lifetime. That is for which brand the lifetime is expected to be higher. Suppose X(Y ) is the lifetime variable for old(new) brand bulbs. Then our interest is to know whether new st brand bulbs are better , that is, Y > X. Thus for the given data the appropriate alternative should be Ha : F (x) G(x) for all x with strict inequality for some x. Depending on the need of the situation, the other alternatives are set.

2.2 Hypotheses under the location model

Suppose the alternative of interest is simply a di erence in location(e.g. di erence of average lifetime), then we assume G(x) = F (x )8x. That is the two populations di er only in location. Then ,F (x) = G(x)8 x , = 0 and F (x) G(x)8 x , 0: Thus the testing problem reduces to H0 : = 0 against all alternatives. Note that under a location model 0 ) the second population is shifted to the right(or left) of the rst population: Now for a clear view of the location model, assume that dFdx(x) = f(x) exists for all x. Then the nature of f(x ) for di erent , that is, under di erent hypotheses can be graphically traced as below.

3 Mann-Whitney Statistic & properties

Suppose the X observations and Y observations are mixed together and ordered according to their magnitudes. Mann-Whitney statistic is based on the position of Y observations

Hence U becomes a distribution free statistic and therefore, tests based on U are exactly nonparametric.

4 U as a measure of degree of separation

First of all, we need to know the range of U. Consider two extreme cases, namely, all the X’s are larger than Y’s and all the Y’s are larger than X’s. In the former situation U is minimum ,i.e. 0 and in the latter case it is maximum , i.e. mn. For other con gurations, 0 < U < mn and hence we get 0 U mn. U actually measures the degree of separation of the populations. Naturally U takes the extreme values if the two populations are completely separated. With the well mixed observations, U takes the intermediate values. The following gures will be helpful to understand this. From Figure 4, it is easy to observe that for completely separated observations U is either highest or lowest and U takes the intermediate values depending on how di erent observations from the two populations are mixed.

4.1 Null Distribution of U

you can view video on Two Sample Procedures : Mann Whitney Test I