5 Normal distribution, characteristics and its uses in Geographical studies Aslam Mahmood
Prof. Aslam Mahmood
(1) E-Text
Frequency Distribution Curves
One of the very first step after collecting the data from the field is to tabulate it through a frequency distribution table. A frequency distribution table converts the data into a systematic and manageable form by arranging it in an order and removing the minor differences while retaining the major differences. Patterns every frequency distribution table show the underlying features of the distribution of these values, these features otherwise are not so clearly visible. The pattern of the frequency distribution tables can be further highlighted by depicting these tables through a histogram. When the middle points of upper side of these histograms are joined we get a frequency polygons and a smooth form of these polygons will give us a curve known as “frequency distribution curve” as shown below in Figure 1.
Thus a large body of data can be represented by a compact frequency curve. The shape of a frequency curve explains the process which has given rise to a particular type of curve. There are few common types of curve such as symmetrical curve, asymmetrical curve, positively skewed (elongated tail on the right) and negatively skewed (elongated on the left), J shaped and U shaped etc. There are different processes which will be attributed behind each one of these shapes. For example due to inequality in the distribution of income and size of the land holdings etc will generate a positively skewed curve having very small frequencies of higher values of income or land etc. similarly a reverse J shaped curve will be showing extreme inequality in the distribution. It will be asymptotic to both X- axis and Y- axis, showing extreme poverty by showing income around zero (Y-axis) to a very large population and infinitely higher income(along X-axis) to very small population. Some time we may have U-shaped curve also showing the effect of a force which operate in the beginning but weakens gradually up to certain point after which it starts operating again and increases with time. The effect of age on mortality and average cost of production of a unit in a factory etc are the example of such processes.
Figure 1: A Frequency Distribution Curve
Normal Distribution Curve
Of all types of curves we have in social and in natural sciences a particular bell shaped symmetrical curve known as “Normal Curve” has very important role to play in statistical analysis. In geography also normal distribution has been used in a number of ways. A normal distribution curve is similar to the Binomial distribution with some modifications.
If our assumptions related to Binomial Distribution are modified as below:
Variable ( X ) is a continuous variable. Neither P nor Q is very small.
n is sufficiently large.
The Binomial Distribution converges to Normal Distribution Function given by:
Normal distribution as name also suggests is normal in the sense that it consists of values which are neither very large nor very small, there is a balance combination of the two. The shape of a normal curve is given below in Figure 2. In a normal distribution larger number of values are found to cluster around a central value. This is the property of almost all the variables generated by a normal process which revolves around middle values and neither give preference to any value nor discriminate against any value. Rainfall, agricultural productivity, family size, size of rural settlements, consumption of food per head, age of individuals, height of students in a class, marks obtained by students and average life of an electric bulb etc. are the examples where a frequency distribution curve will be either normal or near normal.
Area under normal curve
Probability calculations for a normal distribution with different means and standard deviations are possible for different values of the variable X. For any set of values of a variable X following normal distribution we find the values of its mean and standard deviation, substitute these values in the equation given above. After substituting these values, the value of the probability of any value of X can be found by evaluating the value of the equation given above for different values of X. The process is further simplified by converting the variable into a standardized form as:
After transforming the values of X into Z the equation for probability distribution will become as : − 2
P ( x ) = √2 .Note:
Half of the values of Z will be negative (less than mean) and half will be positive (above mean).
Z is a continuous variable.
Range of Z is -ꝏto + ꝏ ( minus infinity to plus infinity )
The probability curve of the normal distribution will be symmetrical at Z = 0.
Sum of the probabilities of Z values from -ꝏ to + ꝏ will be given by the area of the normal probability curve which is = 1.00.
Probability of valueZ between “a to b” will be the difference the area under the normal curve up to the value of Z = “b” and the area up to Z = “a”
Sum of the probability from Z =0 to + ꝏ will be the half i.e. 0.5000. Half of the image of the curve will be the mirror image of the half.
If we know the probability of the value of variable Z between two class limits of a class, by multiplying the total number observations we can find the number of observations falling in that class.
Probabilities of the standard normal variate Z for all possible values up to two decimal places have been worked out and are tabulated as is given below. In view of the above properties of a normal curve half of the values are sufficient to be tabulated instead of full. By taking the modulus value of Z we can find the area under normal curve belonging to either side of the curve.
Example
Annual rainfall of an area is found to be normally distributed with mean= 40 cm and standard deviation =35 cms.
Find out the probability of annual rainfall between 20 – 30 cm
Here for rainfall = 20 cm will givethe value of Z = |(20 – 40)/35 | = 0.5714 and
For rainfall = 30 cm will give the value of Z = |(30 -40) / 35| = 0.2857
We can find out the area under normal curve from Z = 0 to Z = 0.5714 as shown below:
The first column will give the first decimal place. For second decimal place we will have to look the head of the table. The intersection of 0.5 at the row and 0.07 at the column will give the value of the area from Z = 0 to Z = 0.57 which is = 0.2167. We want the value for Z =0.5714 i.e. for addition z = 0.0014. The additional value can be obtained through interpolation. In the following manner:
Value of area for Z = 0.57 is 0.2167
Value of area for Z = 0.58 is 0.2190
For a difference of Z = 0.01 the difference in area is 0.0023
So for the difference in Z = 0.0014 the difference in area will be = (0.0023/.01)x0.0014 =0.000322
Thus the value of the area under the normal curve for rainfall from mean to Z = 0.5714 = 0.2167 + 0.000322 = 0.2170
Similarly we can find out the area under the normal curve for the value of rainfall from 0.0 to 30.
For rainfall = 30 cm the Z value is |(30 – 40 ) / 35| = 0.2857
Area under the normal curve from 0.0 to 0.2857 = 0.1125 .
Thus the probability of the rainfall between (20 – 30 ) cm will be the difference between the area of the normal curve for the two limits, which is (0.2170 – 0.1125 ) = 0.1054
Properties of Normal Distribution
As shown above the “Normal Distribution Curve” is symmetric with the following important properties:
1. Normal distribution function gives a bell shaped symmetrical curve around mean and its mean , median and mode coincide.
2. It is a unimodal distribution.
3. From lowest value to a value of (Mean –Standard Deviation) the curvature of the normal curve is concave i.e it has increasing rate of change of frequencies. At this point it changes its curvature and becomes convex i.e. it start having decreasing rate of change of frequencies. Such a point is known as point of inflexion.
4. Another point of inflexion will be at mean itself where instead of having increase it shows a decrease in the frequencies.
5. Next point of inflexion will occur at a value of ( Mean + Standard Deviation). Here again the curvature of the curve will change from concave to convex and will show increasing decline in the frequencies.
6. In the range of Mean – S.D. to Mean + S.D. 68.23 % observations will fall and,
7. In the range of Mean – 2 Standard Deviation (M – 2SD) to Mean + 2Standard Deviation ( M + 2SD) 95.46 % observations will fall and finally
8. In the range of Mean – 3Standard Deviation (M – 3SD) to Mean + 3Standard Deviation (M + 3SD) 99.73 % observations will fall.
9. Theoretically the range of values of a normal distribution is from minus infinity to plus infinity.
Every symmetric curve is not necessarily normal, only those symmetric curves which follow the above properties will be normal. A normal curve, however, is necessarily symmetric as shown below.
Example
Using the properties of point of inflexion often data is classified into four categories for making a choropleth map. To explain it further let us take a hypothetical example of 104 agricultural plots of equal size of which production of a crop in Kg( 000) per hectare was found as given below.
If we convert the data into histogram we get the following shape of the Histogram given in Figure 3.
Figure 3: Frequency distribution of agricultural production 000Kg/Ht
The curve though is not exactly normal can be approximately taken as normal. The mean value and the standard deviation of the agricultural productivity was found to be 52.47 000kg/ht and 14.88 00kg/Ht respectively. Using the above mentioned criterion given in point 1 of dividing the data into few categories, we can divide the data into four meaningful categories for making a choropleth map:
Category | Values | Number of farms
|
% Farms |
Below Mean – S.D. | below 37.59 | 16 | 15 |
Below Mean – S.D. to Mean | 52.47 to 67.35 | 38 | 37 |
Mean to Mean + D.D. | 37.59 to 52.47 | 30 | 29 |
Above Mean + S.D | Above 67.35 | 20 | 19 |
Total | 104 |
These can be qualified as: (a) Below first point of inflexion, (b) from first point of inflexion to below mean, (c) from mean to below second point of inflexion and (d) above second point of inflexion.
We also note that percentage distribution of frequencies of farms in different categories is found to be very close to normal distribution. The percentage of values between ( M-SD ) to ( M + SD ) is found to be 66.0 % as against the stipulated value of 68.23 %. The difference is not found to be significant.
Test of Normalcy of a Distribution
Many of the techniques of analysis are based on the assumption the data to be analysed is normally distributed. Researcher hardly take pain to verify this assumption before applying these techniques assume it to be normally distributed. However a careful researcher should verify it before making use of ant technique for which normalcy is an essential assumption. Normalcy of any data can be tested with the help of two main techniques.
Graphically.
There are normal probability papers on which X- axis is arithmetic and y axis gives the cumulative probabilities of a normal distribution. If a given distribution is converted into its cumulative frequency distribution ( either more than type of less than type) and plotted on the normal probability paper, the cumulative frequency curve of the data following the normal distribution will give a straight line or will follow the pattern of a straight line.
Statistically
Another way of testing the normalcy is through the test of goodness. Goodness of fit is not specific to normal distribution only. It can be applied to any distribution. Basic approach of the test of the “Goodness of Fit “ to generate frequencies for any distribution according to its theory and compare these estimated values with the actual frequencies to verify the theory.The given frequencies of a distribution are compared with the frequencies generated under the assumption of normalcy. For making the comparison use the “Chai Squared “ test
Standardized Normal Variable
One of the serious problem in geographical research arises when we deal with multivariate data consisting several variables. While carrying out a multivariate analysis we have to combine several variables in different ways. Such an exercise poses two serious problems which cannot be neglected. The first and foremost problem is of dimensions of each variable. In a set of several variables if one of them is in Kgs., one in K.ms. third is in Rs. Etc., they cannot be added, subtracted or subjected to many other mathematical operations. Another problem in this regard is the scale of their operations. Some variables vary in a narrow range while some others may not show very high variations. For example if one variable is income in Rupees varying in thousands and another variable is longevity of life which will very only in few years and in many other cases the variables cannot be treated at par.
Uses of Normal Distribution in Geography
Theory of statistics heavily depends on the normal distribution. Most of the test of significance like; “t – test “, “F- test” and “Z – test” etc. are based on the assumption that the variable is normally distributed. Apart from the general importance of normal distribution in the theory of statistics, there are some particular usages in geography also.For example normal distribution is found to be very helpful in cartography. While preparing choropleth maps of a normally distributed variable geographer often classify the data with the help of the above mentioned three categories and map it. Another place where normal distribution has contributed to geographical research is its use in preparing a composite index of development with the help of few indicators of development. In such cases the original values of different variables are first converted into Z-scores (which are in proportions and have no unit of measurement) and then the scores can be added with appropriate weights to give a composite index.
Standardized Normal Distribution
A normal variable with a given standard deviation and a mean can be converted into a standardized variable Z, with mean as zero and standard deviation of unity as:
In this transformation from each value of a given variable its mean is subtracted and difference is divided by its standard deviation.
We can convert any number of variables into standardized variables. The values of Z are known as standardized score. Whatever may be the values of the original variables, their standardized scores will all have zero mean and unit standard deviation.
Example
Three variables X1, X2and X3 for which the values are given below can be converted into standardized Z scores as shown below
After computing mean and Standard Deviations of each variable we have to subtract the mean value from individual values of each variable. Some of these differences will be negative and others will be positive. We have to divide each of these differences with the respective Standard Deviations to get the standard scores, as given below:
We note that the standardized variables Z1 , Z2 and Z3 have zero mean and standard deviation very close to one ( we are not getting D.D. = 1 due to rounding error, we could have got S.D. = 1 have we rounded after four or five decimal places.). We also note that spread of the Z- score is not very different from each other unlike X- values.
you can view video on Normal distribution, characteristics and its uses in Geographical studies Aslam Mahmood |
References
- Saroj K. Pal (1998). Statistics for Geoscientists: Techniques and Applications. Concept Publishing Company, New Delhi.pp126-45.
- Aslam Mahmood (1993). Statistical Methods in Geographical Research, Rajesh Publications, New Delhi