11 Data Processing and Frequency Distribution
Prof Surendra Singh
1.0 Introduction:
Quantitative information is main component of scientific enquiry. After collecting the data of a geographical attribute (called variable) either from the field work as primary source of information or secondary official sources, it has to be processed and arranged orderly in accordance with the purpose laid down for the inferences of facts for scientific explanation. Collection of data and its reliability is one of the dimensions of discussion which is not discussed here. But after colleting the data, there are three following given processes of data operation to make it more organized for further statistical operations.
1.1 Editing:
After collecting data, the examination of collected raw data especially from field survey is to be edited to remove errors and to correct the data for further use. Field surveys are conducted through developing schedules/questionnaires. Some times, field editing is also done when a team of collectors of data visits to the field and observes the data errors which deviate from the eye-seen realities of an object for which quantitative data are collected. Editors may correct many entries of data in the field itself after observing the reality or determining error after getting proper answer from respondents in the field. Second step of editing ofthe collected data is known as ‘Central Editing’ when data are collected through schedule from the field and returned to concerned office for final processing. Editors convert data keeping in mind various aspects of data arrangement such as data entry in the wrong place in the table and unitization of data (whole data of a variable to be converted into the same unit). For example, income per household is given month wise in the printed schedules, but some of them provide information at different scale of time asyearly income. This is the task of Central editing cell in the concerned office to convert income data in the same unit of time, either monthly or annual income of a household by converting it linearly or curve-linearly as per the requirement for fact – finding reports.
1.2 Coding:
It is necessary for efficient analysis and correct findings of observations. The transfer of raw material of filled schedule into the numeric form that makes limited categories of variables/ characteristics of given data for further computation and data processing, is an essential part of the process of data – arrangement.
Many research scholars used earlier the ‘hand coding’ method to keep information in proper arrangement. Coding sheets were used to complete the data arrangement due to availability of modern tools for coding and computing data. The coding process is now-a-days is more efficient and correct to find out results, when IBM computation process started in 1960s, the punching cards were used to arrange data sheet. With advancement of new Software’s like Excel sheet, the coding and processing of large chunk of data become easily and fast.
1.3 Classification:
Most research studies are now-a-days dependent on large volume of data, which are available from various advance sources like remotely sensed data of spatial attributes, data generated from Topo-sheets, field inquiries and so on. Such large volume of data is classified based on homogeneous groups of variables or common characteristics of geographical factors/attributes. It is called the statistics of attributes. On the other hand, since geographical analyses are more concerned with the areas/places, the homogenous classes of data may also be made based on homogeneity of places/areas to synthesize it. Classification of data either based on characteristic (variable) features or based on place (observation) classification has its own importance in inferring facts. Classification is utilized as tool in geographical studies in a variety of ways.
(i) In regional planning and development, the concept of ‘compage’ (togetherness of characteristics) and fusion (interaction of different spaces) are largely dependent on the process of regionalization. It is dealt with the dimensions of geographic data matrixas described in another module entitled Geography and Quantitative Techniques. So, classification is a source of regionalization. Areal homogeneity within and heterogeneity between the areas is a fundamental principle of regionalization which is operative only through the concept of classification.
(ii) Of course, there are two types of statistical procedures to handle a set of data: the ungrouped data set operated with simple statistical techniques (used for small size of sample set as N≤30 where N is number of observations/items in a data set) and the grouped data (called frequency distribution) used for larger size of data set as N >30). The problems of frequency distribution are always kept in mind when classes of the magnitudes of variable are chosen and class – limits are fixed.
(iii) Grouping the larger volume of a set of data is related to frequency distribution. Grouped data tables are called frequency tables. It is an essential processing of data for statistical operations as well as for mapping the geographical attributes. Number of observations, i, falls under a given class is known to be the frequency of that particular class. Selection of class-interval and number of classes usually involve three problems: (a) how many classes should be for a set of data, (b) how to choose class limits, and (c) how to determine the frequency of each class. Such problems of frequency distribution and its application in geographical researches are discussed here in the following parts in detail with giving examples from geographical studies.
Although there is no specific answer to the above cited questions, there are some suggestions given below. Some times, we have more number of classes to make frequency distribution table. However, five to fifteen classes are reasonable good to understand the variability of distribution. In general, the following formula provides rough guidelines for deciding the class number as well as class-intervals or preparing any frequency table.
N=2k ……..… (1)
or
k = 1.0 + 3.3 log10 N ……….… (2)
where k = number of classes to be used in a frequency table, N = total number of observations and log10 = logarithm to the base 10 of number of observations (Pal 1998). In addition, width of class called class- interval is determined by dividing the range of data series (maximum – minimum value of data set, R) by number of class, k, written as
i = R /(10.+ 3.3 log10 N), ……………… (3)
where i is size of class-interval. We indicated earlier that it gave rough estimation of calculation of number of classes. But it is purely objective to make classification because the nature of distribution of data set may be different and skewed. Some time, we have to avoid general principal of equi–interval classes due to large variation in given set of data. It may clearly be understood by giving the example of daily rainfall precipitated in Cherrapunji (most humid place on the earth).
2.0 Example:
Daily rainfall data of 211 days from 20 March to 16 October 2004 (rainy season in Cherrapunji area) in which 41 days were counted as zero- rain days. Remaining 170 days had daily rainfall more than 1.0
- A total amount of rainfall of 14,719.3 mm was precipitated in 170 days with a large variation ranging from 1.0 mm to 793.2 mm (Table-1).
Table-1: Collected daily Rainfall Data of 211 days (20 March to 18 November 2004) at Cherrapunji
As per the application of given formula (equations 2 and 3), a total number of classes were calculated 8 and size of class- interval was approximately 100.00 mm because
k=1+3.3 log 170 = 8.36 (number of classes)
and
i = 792.2/8.36 = 94.76 (approximately 100 mm)
Following calculated number of classes and their sizes, a frequency table is prepared with the help of filtering the given data of rainfall through the use of Excel Sheet Microsoft. A discrete frequency distribution is given below (Table-2, Fig.-1). If this data of continuous series of small scale interval of frequency distribution is shown by smooth curve, it is called ‘frequency curve’. On the basis of the curve shape, the frequency curves are divided into six different forms:
(i) Symmetric (normal curve)
(ii) Positively skewed (concentration of frequencies towards lower magnitude of classes)
(iii) Negatively skewed (concentration of frequencies toward higher magnitude classes)
(iv) J-shape distribution (as frequencies are concentrated in the higher value classes)
(v) Reverse J-shape distribution (as frequencies are concentrated in the lowest value class as shown below in the frequency distribution of Cherrapunji rainfall), and
(vi) U-shaped distribution ( if is inversely to the normal curve)
Measures of this distribution would be given separately in another module. However, an interpretation of rainfall table is given in the following paragraphs.
Table-2: – FrequencyDistribution of Daily Rainfall DataFollowing the Equi-interval Classes (Total 170 days)
N.B.: Zero- rain days are excluded in this distribution. This table compiled from the data given in Table-1.
Distribution is discrete because each class of it does not have common value of class limits.
Fig.- 1: Frequency Curve of the daily Rainfall at Charrapunji
From the frequency table of rainfall, it is obvious that most of frequencies (133 out of a total 170) fall within the category of 1-100 mm.It is reverse J-Shaped distribution. Of course, a distribution having almost equal frequencies in its each class is said to be the ideal rather than the groups classified by following the given equation -3. Some times, the formula of identifying class numbers for frequency distribution does not give proper result. It may be due a large range of magnitude of given data.
In such a case either the equi-interval of the classes is to relax or data series is to be transformed to logarithmic system to examine the real nature of distribution.For example, the same data set is transformed into log10 system and then classified into classes as per the convenience of data variation. In the present case an equal class interval of 0.40 of the log.of rainfall. It gives better results as frequency distribution follows its normal tendency (Table-3, Fig.- 2).
Table -3: Transformed Frequency Distribution
log value of Rainfall mm | Frequency (No of days) |
0.00- 0.40 | 16 |
0.40- 0.80 | 10 |
0.80- 1.20 | 25 |
1.20- 1.60 | 47 |
1.60- 2.00 | 32 |
2.00- 2.40 | 23 |
2.40- 2.80 | 15 |
2.80- 3.20 | 2 |
Fig.-2: Graphical Representation of Transformed Frequency distribution (Daily log rainfall in Cherrapunji)
3.0 Cumulative Frequency Distribution:
Frequency of each class in the distribution is often facilitated to understand their cumulative effects as and when it requires(mathematically called the integration of frequencies to obtain combine effect of an event or variable’s trend). Infact, cumulative frequency curve (called ogive) is S-shaped and always has increasing trend if class – frequencies are cumulated at a– less – than basis. Inversely, this curve is reverse S shape if more than rule of frequency accumulation is followed (Table-4, Fig.-3)
Table -4: Cumulative Frequency Distribution
Rainfall Class | %(less than) | % (more than) |
0-100 | 0.000 | 100.000 |
101-200 | 76.471 | 23.529 |
201-300 | 86.471 | 13.529 |
301-400 | 91.765 | 8.235 |
401-500 | 95.882 | 4.118 |
501-600 | 98.235 | 1.765 |
601-700 | 98.824 | 1.176 |
701-800 | 99.412 | 0.588 |
Fig.-3: Ogive of Daily Rainfall at Cherrapunjii
4.0 Characteristics of Frequencies Distribution:
A few following points are important to note when one must use frequency curve.
(i) Area of the constituent rectangle of a histogram is proportional to the number of frequency falling in the same rectangle (bar).
(ii) If a perpendicular is drawn from any point of the frequency curve to X-axis, the area between X-axis and frequency curve becomes proportional to the number of observations. The curve of a distribution is called ‘density function’.
(ii) Relative frequency shows concentration of frequencies in the distribution. It is calculated by diving frequency of each class from its total. It can easily be converted in to probability distribution (Table-5).
Table-5: Relative Frequency and Probability in Different Rainfall Classes at Cherrapunji
Sl no | Rainfall Class (mm) | Frequency (no of days) | Relative
Frequency (%) |
Probability |
1 | 0-100 | 130 | 76.47 | .765 |
2 | 101-200 | 17 | 10.00 | .100 |
3 | 201-300 | 9 | 5.29 | .053 |
4 | 301-400 | 7 | 4.12 | .041 |
5 | 401-500 | 4 | 2.35 | .023 |
6 | 501-600 | 1 | 0.59 | .006 |
7 | 601-700 | 1 | 0.59 | .006 |
8 | 701-800 | 1 | 0.59 | .006 |
Total | 170 | 100.00 | 1.000 |
(iii) Cumulative frequency curve is useful for determining the quick graphical procedure for finding the median and any percentile measures including the quartiles. However, more than and less than curves interact at median of variable data. The point of less than cumulative curve which intersects the 50% frequency of the distribution at Y- axis shows approximately the median value of variable magnitude on X-axis (Fig.-4).
Fig-4: Finding Median Value through Frequency Curve
5.0 Application of Frequency Distribution in Geosciences
In geomorphologic studies, showing area- elevation and area- slope curves of relief of watershed are the best examples of using cumulative frequency curves. Cumulative percentage of area between zones of different heights are shown for small three experimental watersheds (namely Um-u-lah, Paham Syiem and Umpher) located in different topographic conditions of Meghalaya plateau (Fig.-4 and 6). Watersheds are selected because they refer to different topo – sequences and are natural units in which natural resources like soil, water and forests may be studies with reference to topo sequences in the Meghalaya plateau. Of course, it is good model to show the reality of resource distribution, while hypsometric curve shows variations in topographic features of watershed. Distributional features and hypsometric curves are exemplified below.
5.1 Example:
(a) Show distribution of annual runoff depth in the Meghalaya plateau for identifying the areas of high runoff availability for its utilization.
Frequency distribution of annual runoff depth of 235 watersheds of Meghalaya plateau which is calculated by classifying total areal units into eight classes having class interval of 1000 mm of runoff depth as per following the above given rule of finding number of classes. It depicts a clear picture of areal distribution that the areas of southern slopes of Meghalaya plateau contain more runoff in the watersheds which may be utilized for hydro electric generation. The areas adjacent to Bangladesh boarder and interior Cherrapunji yield extremely high runoff (more than 7000 mm annually) especially during the rainy season (Fig.-5).
Abbreviations: 1=International Boundary, 2=State, 3=River Catchment, 4=Sub-Catchments, 5=Micro-Catchments, 6=Water Divide between Northern and Southern Slope, 7=Streams, 8=Rivers
Source: Hydrologic Observations collected and compiled under Project sponsored by DST, New Delhi.
Fig-5: Distributional Pattern of Annual Runoff Depth
(b)Relief Conditions of Sample watersheds selected in Meghalaya Plateau and Interpretation of
Hypsometric Curves:
(i) Due to relative height of 130 m in a 1.70 km long Um-u-lah watershed (Cherrapunji), the average slope of about 8.0 percent has been noticed from the mouth of the watershed at 1,310 m to the upper reaches at 1,440 m a.s.l. Hill tops are generally flat and dissected by seasonal rivulets and depressions. Two streams are perennial and others are seasonal passing through these depressions during heavy rainfall monsoon season. Area-elevation curve of Um-u-lah shows proportionate area under each category of relative height of the watershed.
(ii) Paham Syiem watershed (Nongpoh) of about 3.6 km long (straight line section) has a relative height of 277 m varying from 533 m to 830 m a.s.l. at its upper flat reaches in eastern part of the watershed. Being located in inter-piedmont valley, the surface conditions of this watershed is relatively smooth.
(iii) Relative height of the Umpher watershed (Byrnihat) is recorded 360 m varying from 80 m to 440 m a.s.l. from the mouth to the upper reaches of south-western parts of the watershed. Kiling stream is the main tributary of Umpher watershed that provides regular water supply and influence the pattern of discharge rate. Undulating topography is main feature of watershed. Area-elevation curve shows that about 15.0 percent area of watershed lies at higher elevations.
(c) Interpretation of Area-Elevation Curves:
From the hypsometric curves (Fig.-6), there is an indication that the areas of Central Meghalaya plateau display a matured plateau relief. For example, the fractional areas under each relative height are almost equal in Um-u-lah watershed which is located on flat lands of Meghalaya plateau, while Umpher watershed of foot-hills has less percentage of area under the categories of higher elevations (Fig-6A).
Abbreviation: 1= Um-u-lah , 2= Paham Syiem and 3=Umpher Watersheds
Fig.-5: (A) Hypsometric Curves and (B) Area- Slope Curves for the Experimental Watersheds
(d) Area-Slope Curves: Another Example of Cumulative Frequency Curve:
Slope is an important land surface feature. Its relationship with area in the watershed is significant to study the slope feature features. It is shown through frequency distribution categorizing slopes by preparing slope maps (Fig.-7).
More than three-fourth area of Um-u-lah watershed accounts for slopes of moderate categories below 35%, while a few areas in Paham Syiem (Nongpoh) and Umpher (Byrnihat) are calculated under steep slopes of more than 100.0 percent (i.e., equal to 45o) (Fig.-6B). Flat-hill tops with steep valley slopes are topographic features of these watersheds. It is indicative of higher dissection. There is a longitudinal flat land of rice cultivation along with the main stream on the mouth of Paham Syiem watershed, while steep slopes are found either near the mouth or near the confluence areas of streams in these watersheds (Fig.-7).
Abbreviations: Figures in parentheses are in Percent; 1= Flat (below -2%), 2= Very gentle (2- 4), 3= Gentle (4-10), 4=
Moderate (10-20), 5= Moderately Steep (20-35), 6= Steep (35-60), 7= Very steep (60-100), 8=Most steep (100-175),
Extremely steep (above 175%)
Fig.–6: Slope Variations in the Experimental Watershed-A Case of Frequency Distribution showing areas of different slope classes
6.0 Summary:
Processing of collected data and its arrangement in table form are fundamental steps for further application of statistical techniques to infer the results from data.Editing, coding, classification of data and frequency distribution provide accurate and correct procedure for data arrangement. Graphical depiction of frequency distribution becomes important tool for interpretation of geographical distribution. Hypsometric curve (showing area-elevation relationship), slope depiction(calculation of frequency of different slope categories) and mapping geographical attributes are implicitly concerned with the frequency distribution.
-00===00-
you can view video on Data Processing and Frequency Distribution |
Reference
- Pal S.K. (1998): Statistics for Geoscientists – Techniques and Applications, Concept publishing co. New Delhi.
- Kothari C.R. (2002): Research Methodology -Methods and Techniques, (II Edn.) WishvaPrakashan, New Delhi.
- Aslam Mahmood (1977): Statistical Methods in Geographical studies, Rajesh Publication, New Delhi.
- Alvi, Z (1995): Statistical Geography – Methods and Applications, Rawat Publications, New Delhi.
- WWWabs.gov.au/3121120nsf/home/statistical+language+-+frequency++distribution.
- WWW. en.wikipedia.org./wiki/frequency – distribution.