3 Data Management: Tabulation and Frequency Curves
Dr. Sushil Dalal
(1) E-Contents
In any exercise on data collection and analysis after converting the socially meaningful concept into numerical forms of “Data” and collecting information about them either through Census enumeration or through sample surveys, the next step is to put the mass of the collected data into a systematic and manageable form. We can’t refer the raw data in any text or report as it will be too lengthyand found in a random order. Therefore, the data set is needed to be transformed into a systematic and manageable form.
Tabulation of the raw data into a concise form is, therefore, the an important step in most of the statistical analyses. Tabulation serves the dual purpose of putting the data in a systematic as well as in a manageable form. It, not only puts the data in a concise form but, also arranges the data into a systematic form.
Tabulation
After the data is arranged in a frequency table it will become much easier to handle it for further statistical analysis and it ca also be easily referred to in anywhere in the text. The raw data for this purpose can be transformed into grouped as well as ungrouped “Frequency Distribution Tables”.
Frequency Distribution Table
When we collect the data from the field it is found not in any order. We can’t refer it anywhere in thetext as it does not carry any meaning in that form. Through tabulation we can make it appear more meaningful and also handy to manage. The raw data is converted into small groups and number of observations falling in each group are recorded. Observations falling in a group are considered as similar.By classifying the data into groups in tabulation we remove the minor differences in the data and retain the major differences.In a frequency distribution table we have two columns. First column gives the range of the group known as class and the second gives the frequency of each class i.e. number of observations falling in each class.
There are two types of frequency distributions:
(a) Ungrouped and
(b) Grouped.
Ungrouped Frequency DistributionTable
In an ungrouped frequency distribution the classes consist of the fixed number and is used for the data which is discontinuous by nature and can’t occur in fractions; like size of the family, number of schools, number of floods in a year to a river etc. The range of the discontinuous data, generally, is not very large. An ungrouped frequency distribution table may look like the one given below:
Size of family (X) | Number of families(f) |
1 | 2 |
2 | 14 |
3 | 22 |
4 | 24 |
5 | 18 |
6 | 14 |
7 | 6 |
Total | 100 |
Grouped Frequency Distribution Table
Most of the time we have to handle the data which is continuous by nature, like: rainfall, agricultural production, income etc. Such data occurs in frictions also. The range of the continuous data is also large. In such cases, instead of the fixed number of the variable the classes are formed into some ranges, known as classes and the number of observations, known as frequency, falling in each class is tabulated. A hypothetical frequency distribution table of the grouped data of the daily rainfall of 90 days of a season of an area may look like the one given below:
Distribution of Daily Rainfall of 90 days of an area( In mm)
Daily rainfall (in mm) | Number of days (f) |
20-30 | 5 |
30 – 40 | 6 |
40-50 | 11 |
50-60 | 18 |
60-70 | 19 |
70 – 80 | 15 |
80 – 90 | 13 |
90 -100 | 2 |
100-110 | 1 |
Total | 90 |
In the above frequency table, the values of the variable are tabulated for smaller group of the values of the variables which are known as classes. Every class has two values known as class limits: Lower class limit as well as Upper class limit. The difference between the upper limit and the lower limit of any class is known as class interval. In the present case the first classhas the lower limit as 20.0mm and the upper limit as 30.0 mm and the class interval of 10.0 mm.. In the second class the lower limit is 30.0 mm and the upper limit is 40.0 mm, and so on. All the class intervals of the above frequency distribution table are equal. We notice that upper limit of every class become the lower limit of the next class. So, it should not be counted at two places. The convention is that any value less than the upper limit should be included in the class itself. However, the values equal to the upper limit a class should go to the next class where it is the lower limit. So in every class the lower limit is included in the class but not the upper limit.
In a grouped frequency distribution table number of classes and the class intervals are very important and are related to each other. If our class intervals are large, the number of classes will be less. On the contrary if the class intervals are small, number of classes will increase.
A good frequency distribution table maintains balance between the two. Very large number of classes will lose the advantage of summarising the data. A very small number of classes like; 2 , 3 or 4 will result in significant loss of information.
There are suggestions regarding the number of classes, one such suggestion traditionally referred in the books is that the number of classes of a frequency distribution table, k , should be determined by the formula:
K = 1+ 1.33 Log N which is hardly in practice.
Even when it is found to have class interval not in rounded form, the class intervals of multiple of five or ten are preferred due to practical reasons.
Unequal Class interval
The difference between upper limit and the lower limit of a class is known as the class interval which may be equal or may not be equal for all the classes. Class intervals are commonly of equalsize. In some cases, however, the equal class intervals are not required also. For example, the tabulation of urban settlements whose size in India varies from below 5000 population to 12442000 (highest population of Mumbai 2011) population, uses unequal class intervals due to the range of variations in data. For a range of 12437000, if we use equal class intervals of 5000 each we require 12437000/5000= 2488 (after rounding) classes. This as cumbersome as the data itself, no simplification in handling and interpretation.On the other hand if we take 10 classes of class interval of 10,00,000.0 population, we heavily loose the details as the very first class from below 5000 to 10,00,000.0 (below million cities) will have 7882 towns out of total 7935 towns in India in 2011 (99.3 %). This is as bad as having no information.
In such cases where the range of data is too large, for example population of towns, income of individuals in a society, land holdings among farmers etc. we are forced to go for unequal class intervals in such a manner that class intervals are smaller to begin with the smaller values and become larger and larger as we procced to the higher values. Indicating that smaller differences can’t be ignored at lower end but same differences are not equally important as we move to higher values where only higher differences matter. Thus Census of India classifies the towns in the form of unequal class intervals as given below:
Size class distribution of towns India 2011
Size class of | Population Class interval | Number of towns |
towns | (2011) | |
Class VI town | Below 5000 | 499 |
Class V towns | 5000 – 10,000 | 2188 |
Class IV towns | 10,000 – 20,000 | 2238 |
Class III towns | 20,000 – 50,000 | 1912 |
Class II towns | 50,000 – 100,000 | 600 |
Class I towns | 100,000 and above | 496 |
Total | 7933 |
Either for equal or for unequal class interval, the choice of the class intervals is crucial. For equal class intervals one has to decide about the number of class intervals only. Range of data divided by number of classes will determine the class intervals. Often, the researchers marginally alter it also to suit their convenience. For example if the class interval as per calculations are found to be 19.73 one can change it to 20.0 for the ease of computations and interpretations. There are no hard and fast rules regarding number of classes. The guiding principle is that they should not be too many or too less. Commonly their number lies between 9,10 to 12, 15.
For unequal class intervals, number of classes are generally less as each class represents a category of the data and there should not be larger number of categories to avoid confusion. For example, in the case of census classification of towns of India, class intervals correspond to well recognizedsix classes of towns. What is more important in such cases is the understanding of the researcher toconvert the data into meaningful categories.
Example
Following example shows the process of the conversion of a small set of raw data into a frequency distribution table and its conversion into a “Histogram”. It shows a hypothetical set of data of the production of Wheat in 100 plot of equal size of one hectare each in an area which is given in the table below.
Production of wheat in quintals (00 Kg) per plot of one hectare
20.3 | 20.2 | 19.8 | 20.1 | 21.0 | 20.9 | 20.2 | 19.9 | 19.6 | 19.2 |
20.3 | 21.1 | 19.7 | 19.1 | 18.3 | 18.1 | 17.9 | 20.7 | 20.0 | 19.4 |
18.3 | 18.0 | 17.0 | 17.2 | 22.3 | 20.7 | 21.3 | 18.9 | 19.7 | 21.0 |
21.1 | 19.8 | 18.5 | 18.2 | 22.1 | 21.1 | 18.1 | 19.3 | 19.9 | 19.7 |
18.8 | 18.9 | 16.9 | 20.1 | 20.3 | 18.1 | 17.6 | 19.4 | 20.3 | 21.1 |
20.2 | 22.1 | 18.7 | 19.5 | 20.1 | 23.0 | 22.9 | 22.8 | 22.8 | 22.5 |
20.9 | 20.4 | 20.1 | 20.6 | 20.9 | 18.0 | 20.3 | 18.1 | 19.7 | 18.2 |
18.3 | 17.1 | 20.2 | 23.0 | 20.1 | 18.9 | 18.3 | 21.2 | 17.3 | 17.6 |
19.3 | 19.0 | 21.3 | 22.1 | 19.9 | 18.8 | 21.1 | 23.1 | 23.6 | 23.1 |
20.1 | 19.8 | 19.7 | 18.3 | 17.1 | 18.3 | 19.0 | 20.1 | 20.1 | 18.9 |
As the range of data is quite low, the maximum value is 23 and the minimum is 16.9. The range is 23.0 – 16.9 = 6.1. If we choose 10 classes every class would have an interval of 0.61 (0) kg. per hectare. ) 0.61 does not seem a conveniently understood figure compare to 1 hectare which is also close to it. Secondly, 10 classes appear to be quite large as the number of plots are only 100.Thus a class interval of 1 (00) kg is considered to be quite easily under stood and will give eight classes, which may be alright for the purpose of making a histogram.
Starting with the lower class limit of 16.0 in which the minimum value of 16.9 will lie we form the classes as given following frequency table given below:
Frequency Table
Production of Wheat in (00)Kg in 100 plots of Size one Hectare
Histogram
Distribution of Equal Class Intervals
A frequency distribution table arranges the data into some ordered form which helps us in understanding the distributional properties of the data in a much better way than the raw data. For example , after transferring the data into a frequency distribution form, we can easily see as to how many observations are found in the middle of the values and how many on the either side of it. We can also see the inequalities in the distribution and other important socially important characteristics of the data. These characteristics become more visible if we plot the distribution of the data on a “Histogram”.
A histogram is a collection of a set of rectangles with bases equal to the class interval of each classof the corresponding frequency distribution and the height of the rectangle will be equal to the corresponding frequencies of each class.
Taking the wheat production data of 100 plots of size one hectare each as given in above table we prepared the ‘Histogram’ as shown in the figure given below. The first rectangle has a base equal to 16.0 -17.0 , second rectangle has the base equal to 17.0 -18.0 and so on until the last rectangle whose base is equal to the class interval of the last classof 23.0 – 24.0. The height of the first rectangle is equal to the frequency of the first class i.e. 1, the height of the second rectangle is equal to the frequency of the second class which is 8 and so on until the last class with height equal to 5.
A histogram can also be converted into a “Frequency Polygon” by joining the middle points of the upper sides of each bar. To show the pattern of change as a gradual process the polygon is converted into a smooth curve also, which is known as“Frequency Distribution Curve” or only frequency curve. Such a frequency curve for the data on production ofwheat is also shown below along-with the histogram.
Distribution of Un-equal Class Intervals
In the above histogram the height of the rectangles of a histogram are in proportion to the frequency of each class as the class intervals of each class is equal. The Thus the area of each rectangle will also be in proportion to the number of observations (frequencies) under each rectangle. Frequency density of each class in equal class interval distribution need not to be divided by the class intervals since all of them are equal. However, if the class intervals are not equal, we have to take the height of each rectangle equal to the frequency density of each class. Frequency density of a class is obtained by dividing the frequency by the class interval.
Example
Consider the income distribution of 400 persons of a locality. Since there are large variations in their income , the distribution is given in an unequal intervals.
Income (Rs.) | Number of persons |
0-500 | 200 |
500-1000 | 50 |
1000-2000 | 40 |
2000-5000 | 60 |
5000-10000 | 50 |
Total | 400 |
A histogram without frequency density will give a distorted image. Thus, before making a histogram we have to find out the frequency density for each class as shown below.
Income (Rs.) | Number of persons(Frequency) | Class interval | Unit of Class interval | Frequency Density |
(1) | (2) | (3) | (4) | (5) |
0-500 | 200 | 500 | 1 | 200 |
500-1000 | 50 | 500 | 1 | 50 |
1000-2000 | 40 | 1000 | 2 | 20 |
2000-5000 | 60 | 3000 | 6 | 10 |
5000-10000 | 50 | 5000 | 10 | 5 |
Total | 400 |
Now we can prepare a histogram considering the first class as 0-500 with a frequency = 200. As it is the lowest class of interval 500, its frequencies are not divided. Class interval of Rs. 500 is taken as standard unit. All other classes are converted into the units of the standard uit. The second class also has an interval of Rs. 500, so its equivalence is one only. Third class is interval is Rs. 1000, which is twice as large as the standard class. Fourth class interval is Rs. 3000 , which is six time as high as the standard class and the last class has a class interval of 5000 which is 10 times as high as the standard class. Column no. 4 of the above table gives the class interval of each class in the units of the first class interval. Column no. 5 of the table gives the frequency density of each class per class intervals of the standard class interval of Rs. 500.
Now the histogram will correspond to the fistr class of 0-500 with 200 frequencies. The second class will correspond to 500-1000. Third class wil correspond to two classes 1000-1500 and 1500-2000 with each having the frequency of 20. Fourth class will correspond to six classes of 2000-2500, 2500-3000, 3000-3500,3500-4000,4000-4500,and 4500-5000 each with a frequency of 10. Lastly the last clas 5000-10000 will correspond to 10 classes of interval 500 starting from 5000-5500 and ending with 9500-10000. Each of these 10 classes are with frequency 5.
A histogram of the above distribution of unequal class intervals will be as given below.
Frequency curves play important role in statistical analysis. It helps us in understanding the process through which it is generated.
A usual process in which neither very high nor very low values are preferred will generate a symmetrical curve. Like average annual rainfall of an area over a period of time, height of children in given age group and agricultural productivity of plots in any adjoining area etc. A symmetrical curve is such that if it is folded from the middle, one half of it will overlap the other half.
On the contrary due to certain natural or social factors the values in some distribution are not found symmetrical and we will get a curve which is “Asymmetric” or “Skewed”.The values show inequalities in its distribution either on the higher side or on the lower side. Distribution of agricultural land holdings, income distribution and district wise proportion of urban population etc. will show the curves elongated to the right hand side and are known as “positively skewed”. Proportion of rural population to total population in different districts will give a curve elongated to the left hand side and are known as “negatively skewed”.
Death rates by age in a population will give a “U- shaped curve”, as mortality will be higher in the beginning and at the end and will be lowest in the middle ages. Shapes of Symmetric and skewed curves are also given below:
Comparison of Frequency Distributions
Any research enquiry begins with observations of real world situation around us and comparing it under different geographical situations. After we collect the data about the real world and summarise it with the help of frequency tabulation, different types of graphs provide us only a preliminary understanding about its comparative position under different geographical conditions as they are not very accurate. For an accurate and meaningful comparison we need some numerical measures of the distribution. There are several such meaningful measures of any distribution known as ‘Descriptive Statistics”. Some of the commonly used such measures are as given below:
- Measures of Central Tendency
- Measures of Dispersion and
- Measures of Skewness.
- First two measures i.e. measures of central tendency and measures of dispersions are very important parameters of any distribution as they are used extensively in the theory of sampling, inferences and in many other places also. Measures of skewness, however, are relatively less frequently used.
you can view video on Data Management: Tabulation and Frequency Curves |
References
- Pal Saroj K. (1998) Concept Publishing Company, New Delhi.
- Aslam Mahmood (1998) Rajesh Publications, New Delhi.