8 Transformation of Data for Geographical Analysis
Prof Bimal Kar
Introduction
Increasing magnitude and dimensions of data availability due to development of information technology in the modern times, there are many issues pertaining to arriving at a level of scientific explanations of hidden geographical facts. Availability of a large amount of multi dimensional data of real world situation, sometimes, creates confusion and incorrect information to draw the facts from collected data. Without proper knowledge of data handling for inferences of facts, the selection of attributes and their uses may lead to a wrong direction in the geographical analysis. Sometimes, uses of selected attributes/variables become wrong if we do not have knowledge of their proper forms for more effective analysis. So the transformation of these variables makes them usable for finding the appropriate results in understanding their characteristic features in geographic frame. In such steps of procedural analysis of geographic phenomena, transformation of data set is an important step to develop understanding of the quantitative procedure of data transformation and its uses. When information or data about various phenomena, geographical or otherwise, is converted into different forms, it is called data transformation. It is generally done to enhance the information content, and also to make it possible for meaningful representation and analysis. Data transformation is also sometimes necessary to make the data of various attributes of different units dimensionless for the purpose of their comparison and aggregation.
Scales of Measurement
Information including dimension of any phenomenon, can be expressed through four different levels or scales of measurement. In increasing order of measurement, in respect of information content, they include Nominal, Ordinal, Interval and Ratio. Nominal, which is the initial form of measurement without any numerical value, gives information about various phenomena in terms of name, category, class, etc, viz. soil types, vegetation types, rock categories, settlement types, land use categories, etc. Of course, quantitative statements about information are quite significant in scientific analysis. But such statements are informative only, and they do not have the logical clues to reach to the reality. Say for example, there are two statements given below; one is qualitatively informative and another is numerically data dependent:
(a) ‘Cherrapunjee area has extremely high annual rainfall.’
(b) Cherrapunjee area receives 11,000 mm of average annual rainfall.’
What do you understand and perceive about the magnitude of annual rainfall precipitated at Cherrapunjee? In these two statements, first is purely based on perception without accurate information about the amount of rainfall. If the same statement is stated in Rajasthan, a person would perceive meaning of extremely high rainfall to be 2000-3000 mm, because extremely high rainfall in Rajasthan does not go beyond 1500 mm annually. But the perception of the same statement is different in Charrapunjee. On the other hand, second statement provides accurate picture of annual rainfall due to its numerical form. Quantitative data used, therefore, for scientific explanation provide the correct and accurate information for understanding the facts. Hence, nominal dimension of data or information (i.e. qualitative) is used for elementary search to make background material for fact finding.
When the information about any phenomenon or attribute is expressed in some quantitative form in respect of relative importance or magnitude, rank or quantitative order, it is called ordinal scale of measurement. Although it is considered superior to nominal scale of measurement, it cannot give absolute difference of value between the ranks or orders. No doubt, ordinal measurement is better than nominal. However, it does not distinguish the accuracy of facts in a given data series, because it counts term-interval in a series same as rank/order, and does not have value interval same in the distribution. For example, median (measure of central tendency) is calculated by using rank/order (descending/ascending) of the values of observation of a series. It is due to ordered arrangement of data.
Interval scale of measurement of phenomena is purely quantitative. It considers the individual value of each observation in a series. Exact interval between values in a series is the main consideration in this measurement. Hence, when the information about a phenomenon is given in terms of some absolute value or magnitude, it is called interval scale of measurement.
Ratio scale is more comparative and provides good ground for geographical analysis. The quantitative information about some phenomena which can be meaningfully converted into ratio is called ratio scale of measurement. The basic difference between interval and ratio scales of measurements is that the interval scale does not have absolute zero and ratio scale has absolute zero. Hence, a good example of interval scale of data is the temperature measured in 0F or 0C without absolute zero. This type of data cannot be converted into meaningful ratio data, because 00C cannot be considered no temperature, or 200C temperature cannot be said to be two times hotter than 100C temperature. But, distance, weight, area, production, etc are examples of ratio scale of measurement, because all these have an absolute zero and they all can be converted into meaningful ratio values.
Data Transformation Techniques
The data transformation basically involves conversion from nominal to ordinal, interval and ratio scales of measurement, and conversion of numerical information given in interval and ratio scales of measurement into various other forms of interval and ratio data for meaningful representation and analysis.
Data transformation from nominal scale to ordinal and interval/ratio scales
Let us first see how the information or data in nominal scale of measurement, whether point, line or polygon data, can be converted into ordinal, interval and ratio scales of measurement. For this purpose, settlements in respect of village and town are considered as nominal scale of measurement. When among the settlements in an area, the towns are ordered or ranked from top to bottom with respect to their population size, area or health facility, etc.It is called conversion from nominal to ordinal scale of measurement. If the categorization of these ranked towns is done with respect to absolute population size or area, it is called transformation from ordinal to interval scale of measurement. But, when the absolute data (i.e., population size) about the categorized towns can be meaningfully converted into ratio in terms of how many times the first order town is larger than the second order one, the second order town than the third order, and so on. It is called conversion from interval into ratio scale of measurement.
In order to illustrate the above, let us take an example relating to data about settlements as given below (Table-1):
Table-1: Capital Towns of the States of North East India and their Population, 2011
Categorization of settlements based on functional character mainly into village and town is called nominal scale of measurement. Accordingly, grouping of all settlements of North East India into villages and towns constitute nominal data. Similarly, a list of capital towns of North East India, viz. Itanagar, Dispur (Guwahati), Imphal, Shillong, Aizawl, Kohima and Agartala is also nominal data. When the population size of these capital-towns in relative term (rank) is given, it is called ordinal data. In fact, it is obvious from the second column of Table-1 that Guwahati is 1st rank city and Agartala is 2nd rank city in the North East, but we do not know the exact population of these towns. If population size of these towns is added in the data series and used for population analysis, it is called interval scale. Data used in proportionate form is called ratio scale. It links the proportionate strength of the attribute’s values.
Present discussion gives a concept of transformation of data and its correct form to be used in fact finding. However, there are many ways of data- transformation which can be used as per the requirement so as to make explanation of facts more accurate and sound. However, we would describe three most familiar and useful methods of data- transformation.
(a) Non- linear to Liner Form of data Series :
In fact, a time series data of an attribute may have more fluctuation than its smooth change over time. It means there may be extremes in values of observation in a distribution. There may be many causes behind it. In order to find the reasons of fluctuation and fast change in a data series, the nature of its change may be described suitably by using linear transformation. In the case of trend of population growth of towns for a longer period of time, it varies from a few thousands in the case of small towns to several millions in the case of large cities in a country or a large state. In order to show such variation of population growth and its trends, if it is not linear, it can be transformed into linear scale through some appropriate mathematical conversion technique..
Example-1: To show the trend f population change in Assam during the period of 20th century:
In order to illustrate the above example, a set of population data of Assam for the period 1901-2011 is considered (Table -2).
Table -2: Trend of Population in Assam, 1901-2011
Census Year | Population (in millions) | Transformation of population data in logarithmic form (log Population) |
1901 | 3.289 | 6.52 |
1911 | 3.848 | 6.58 |
1921 | 4.638 | 6.67 |
1931 | 5.560 | 6.74 |
1941 | 6.694 | 6.83 |
1951 | 8.028 | 6.90 |
1961 | 10.839 | 7.03 |
1971 | 14.625 | 7.16 |
1981 | 18.041 | 7.26 |
1991 | 22.414 | 7.35 |
2001 | 26.655 | 7.43 |
2011 | 31.205 | 7.49 |
As the population of Assam experienced rapid increase from as low as 3.29 million in 1901 to 31.21 million in 2011, it witnesses a non-linear/exponential trend. It exhibits two linear trends, first one during 1901-1951 and second one during 1951-2011. In fact, the growth of population has been more rapid during the later period, i.e. 1951-2011. This is visible in the line graph showing trend of population growth based on actual data (Fig.-1). However, through the logarithmic conversion of the population data, the trend line is transformed into linear one (Fig.-2).
Since time is scaled as unitary and population as variable, the linear form of increase in actual population size is
P1= P0 + r.P0.t = P0 (1 + r).t, …………..(1)
where, P1 denotes population of the current year, P0 is population of the base year, r is rate of population increase and t shows time.
In fact, this form does not fit in the general trend of population increase. A semi log transformation is used to make it linear. The form of the transformation in this case becomes somewhat more appropriate with the following equation
P1= P0 (1 + r) t . ……..…(2)
Its logarithmic form can be expressed as follows
Log P1= log P0 + t log(1 + r). …………(3)
In this form log P1 is the function of t. Note that t follows simple scale and population in log transformed form (Fig.-2).
Fig-1: Population Increase in Assam (1901-2011)
Fig. -2: Linear Trend of log transformed population Statistics(time series data at X-axis are in simple scale and y-axis shoes log population)
TRANSFORMATION FROM SKEWED TO NORMAL DISTRIBUTION
Distribution of many attributes like land holding size, income, town population, population density distribution, etc presents a skewed distribution, and sometimes it becomes difficult to represent such data with extreme values in histogram and/or frequency curve. In view of this, there is the need of data transformation for normalizing skewed data distribution. For more clarity in the concept, two examples- one from distribution of land holding size and another from distribution of population density of the districts of Assam have been considered.
Example-I:
To show the Frequency distribution of land holding size in Assam and its normalized distribution using power function.
For this exercise, landholding size data were collected from Agriculture Census for the state of Assam. As per standard classification of land holding size, the totalnumber of land holders has been classified into ten categories as given in Table -3.
Table 3: Distribution of Land in Assam, 2010-11
Here, a data set pertaining to distribution of land holding size in Assam (Table -3) is considered to study the nature of distribution of land. The representation of actual data of land as expected presents a highly uneven distribution with concentration of frequency towards very small sizes (less than 2.0 ha). A few land holders have large size of land holdings. Such concavity in frequency distribution shows unevenness and skewness in it (Fig.-3). If this distribution is transformed into a half value of power function ( i.e., square root power) of land holding sizes, the frequency curve becomes more flat with less degree of skewness (Fig.-4).
Fig. 3: Distribution of Land Holding size in Assam 2010-2011 (based on data collected from Office)
Fig. -4: Distribution of Land Holding size in Assam 2010-2011(based on Square Root Transformed data)
Example-II:
Now, let us see how to group a set of individual data of district level population density (persons/km2) of Assam (as per 2011 Census data) in the form of a frequency distribution and represent the same in the form of a histogram.
Population density (persons/sq. km) for 27 districts of Assam (2011):
280, 1171, 553, 632, 618, 711, 365, 457, 213, 347, 393, 431, 383, 302, 93, 44, 459, 673, 497,
425, 244, 436, 2010, 763, 475, 491 and 497.
The frequency distribution of number of districts under each of uniform population density range of 100 beginning from 0-100 to 2,000-2,100 and the resultant histogram are as follows (Table -4 and Fig. -5).
Table 4: Distribution of Population Densityamong the Districts of Assam, 2011
Fig. -5: Population density Pattern in Assam, 2011
It is observed that a large number of classes (12 classes) do not have any observation (no frequency) and accordingly, the histogram appears to be almost discrete and less representative. In view of this problem, logarithmic conversion of individual data of population density are done and grouping to make a frequency distribution of population density is made by taking ideal class interval of 0.2 (Table -5). This transformation has made the frequency distribution (with only one zero frequency class) of population density more meaningful and better representative (Fig.- 6).
Table 5: Distribution of Population Density (Log converted) Among the Districts of Assam, 2011
Fig.- 6: Log transformed Population Density in Assam 2011
(c) Transformation for Creation of Dimensionless (Free Scale) Data
Transformation of data pertaining to different attributes having varying units of measurement (e.g., literacy rate in percentage, road density in km/km2, life expectancy in year, income in money value, so on) can be done by applying different statistical techniques. Among different techniques, Z-transformation and simple transformation index are quite useful. These two simple techniques not only make the data dimensionless but also give a measure of the relative magnitude of every value in comparison to the mean value that is representative of central tendency. This transformation finally helps in data aggregation for computation of composite score. The following are examples to understand procedure and utility of these techniques.
Example-I:
Z-Transformation to Make Dimensionless Data Series: It is to note that in traditional mapping techniques, superimposition of maps of different attributes are made to understand their composite picture. Statistically, the attributes are converted in to standardized scores, that is called free-of scale, because the mean (central value) and SD (dispersion) of distribution become zero and unity respectively after Z- transformation and pattern of distribution remains unchanged because-
Z=(X-X̅)/SD …………….( 4)
For making data dimensionless, a data set of literacy rate, hospital beds and road density for six districts of Assam for the year 2011 are taken (Table- 6).
Table -6: Socio-Economic data about few attributes for Selected Districts of Assam, 2011
By calculating the Z-value for each attribute and district the original values become dimensionless (Table-7). These Z-values are also indicative of the deviation of the attribute value about the mean (in both positive and negative terms) in respect of number of times of standard deviation value. The Z-values can be 0(zero) when both the observed value of an attribute and its mean value are same, it can be negative when attribute value is smaller than mean value, and it can positive when attribute value is larger than the mean value. Hence, aggregation of these Z-values is possible and it gives relative cumulative magnitude of all the attributes.
Table 7: Z-Transformed Values of each observation of of the above data of Table 6
As mean and SD of transformed series are zero and unity respectively for each distribution, now the standard scores of attributes are comparable and their magnitudes are aggregated to show composite index of distribution. The Table-7 containing district wise composite index of socio economic level shows that the literacy rate and bed-density in hospitals have the highest scores in Kamrup Metro district, but it is less developed in its socio economic level.
Example-II:
Ratio- Transformation Method:
it considers only ‘mean- unity’ base of transformation of the values of observation in data series. The value of individual observation divided by the mean ofseries is the procedure through which ratio of individual values are calculated. For instance, the same data of socio economic attributes given in Table-6 are transformed to get dimensionless ratio transformed vales Emerging pattern of these transformed attributes are different from Z-transformed scores. Kokarajhar district stands first in hospital facilities and Nogaon for road density. It is due to different basis of transformation. Z- transformation is Mean and SD free scaling while ratio- transformation is only mean unity base to make data dimensionless.
Transformation of each individual value can also be done by dividing the same by the mean value (Table 8). It makes the data unitless and this value is also indicative of the extent of the attribute value (small or large) in comparison to the corresponding mean value. Such value can be 1 in the case of equal observed value of an attribute withits mean, less than 1 in the case of smaller attribute value in comparison to the mean value, and more than 1 in the case of higher attribute value in comparison to the mean value. The aggregation of such values also finally gives relative cumulative magnitude of all the attributes.
Table 8: Data Transformation through Ratio- TransformationQuotient (SIij=Xij/Xi) of the above data of Table- 6
Note: To get Composite Index for each district all the columns are to be added
Summary:
It is thus clear from the above discussion that the data about various attributes or phenomena can be found in different forms with different levels of measurement. Such data of different forms, viz. like non-linear, exponential, skewed, etc can be converted into linear, normal, etc forms according to our need and convenience through statistical techniques like logarithmic, square root and square conversions. Besides, for aggregation of data about various attributes with different units of measurement, data conversion can be done through Z-transformation and simple transformation quotient.
you can view video on Transformation of Data for Geographical Analysis |
References
- Davis, P. (1974): Data Description and Presentation, Science in Geography 3, Oxford University Press.
- FitzGerald, Brian P. (1974): Developments in Geographical Method, Science in Geography 1, Oxford University Press.
- Gregory, S. (1963): Statistical Methods and the Geographer, Longman.
- Hammond, R. and McCullagh, P. (1974): Quantitative Techniques in Geography, Clarendon Press.
- Johnston, R.J. (1978): Multivariate Statistical Analysis in Geography, Longman.
- McCullagh, P. (1974): Data Use and Interpretation, Science in Geography 4, Oxford University Press.
- Smith, David M. (1977): Patterns in Human Geography, Penguin.
- Yeates, M.H. (1968): An Introduction to Quantitative Analysis in Economic Geography, McGraw-Hill.