21 Mapping the Principal Components (aggregated factors)

Prof. Aslam Mahmood

epgp books

   

 

 

 

 

(1)  E-Text

 

In an exercise of a principal component analysis basic objective is to summarize a large set of data through smaller number of dimensions in such a way that the larger proportion of reality portrayed by all the variable is retained by a smaller number of principal components. This exercise helps the researcher in handling a large body of data and in measuring a phenomenon for which no direct information is available. Scores of the principal components can be mapped also to portray that phenomenon.

 

The exercise starts with a large body of data relevant to the desired phenomenon collected by researcher directly or indirectly through some secondary or primary source. After analysing the data through its descriptive statistics like mean and standard deviations for further mathematical treatment the data is first converted into a set of Z-scores of n variables. Z- scores are free from units of measurements as they are only ratios. Z-scores are also standardized scores as they all have zero means and unit standard deviations. After converting the data into their Z-scores, the scores are converted into weighted composite scores of different principal components. These weights are obtained from the Eigen Vectors of different Eigen values of the correlation matrix of all the variables. The length of these Eigen vectors is found to be normalized to inverse of their Eigen values. For first principal component the weights used are the elements of the Eigen vector corresponding to the highest Eigen value. For the second principal component the weights are the elements of the Eigen vector corresponding to the second highest Eigen value and so on.

 

After z-scores an nxn correlation matrix is worked out giving inter-correlations between all the pairs of variables. From this correlation matrix its Eigen values are worked out with their corresponding Eigen vectors. And only those are retained whose Eigen value is greater than

  1. Eigen values are then arranged in descending order. The length of these Eigen Vectors are then easily normalized to inverse of their Eigen values and the elements of these values are used as weight to prepare weighted combination of Z-scores.

 

Example:

 

To explain the above mentioned steps, let us take an example of summarizing a set of data of five variables of development of 10 districts an area as given below with the help of principal component analysis:

  1. Percentage of workers in secondary sector(X1)
  2. Rate of population growth(X2)
  3. Literacy(X3)
  4. Population Density(X4)
  5. Percentage Urban(X5)
S.N. X1 X2 X3 X4 X5
1 12 9 55 90 6.5
2 15 11 67 70 7
3 16 10 70 80 10
4 22 18 50 110 14.5
5 21 19 55 105 13
6 17 12 45 95 11
7 21 15 50 100 14.5
8 31 40 70 110 17
9 32 43 75 120 18
10 18 11 70 90 12

 

The above set of data can be subjected to a “Principal Component Analysis” using the SPSS. The package has several options for the output, all of them may not be required for the first understanding of the method. Most essential of them could be the descriptive statistics, Correlation matrix, its Eigen values and associated Eigen Vectors, component coefficient matrix , component score coefficient matrix and factor scores. All these outputs are explained in detail and are given below.

 

First output will give the descriptive statistics in form of variables, their mean values and Standard Deviations with number of observations. As given below.

 

Descriptive Statistics

Mean Std. Deviation Analysis N
VAR00001 20.5000 6.55320 10
VAR00002 18.8000 12.43472 10
VAR00003 60.7000 10.77085 10
VAR00004 97.0000 15.12907 10
VAR00005 12.3500 3.85177 10

       Second output relates to the correlation matrix between all the pairs of variables. It will be a 5×5 symmetric matrix with unity in the diagonal and off diagonal elements will give correlation coefficients between different variables. The output of correlation matrix is  given below:

 

Correlation Matrix

VAR00001 VAR00002 VAR00003 VAR00004 VAR00005
Correlation VAR00001 1.000 .964 .380 .824 .941
VAR00002 .964 1.000 .457 .773 .829
VAR00003 .380 .457 1.000 -.061 .188
VAR00004 .824 .773 -.061 1.000 .878
VAR00005 .941 .829 .188 .878 1.000

 

 

The next step in this exercise of “Principle Component Analysis” is to find out Eigen values. The package computes the Eigen values and put them in descending order as given here; 3.697, 1.092, 0.127, 0.083 and 0.001 . Note that the sum of all these Eigen values is 5.000 ( To verify the property that the sum of the Eigen values of a matrix is equal to the sum of their diagonal values which is 5 in the present case).

 

Also to note is that these values are given in descending order and first Eigen value which is also the variance explained by first “Principal Component” comprises (3.697/5)*100 = 73.936 % of the total variance. The second Eigen value is 1.092 and comprises (1.092/5)*100 =21.83 % of the total variance. Cumulative effect of first and second “Principal Component” is to explain 95.767 % of the total variance in the variables, which is quite high and may be considered sufficient to summarize the five original variables. Further components do not add much to it and may be considered as redundant.

 

The prgramme has a default command to discard all the Eigen Values < 1. Due this command in the next table only two Eigen values >1 are given along with their proportional strength. Researchers are given the option to change the default value as per their choice.

 

Total Variance Explained

Extraction Sums of Squared Loadings
Component Total % of Variance Cumulative %
1 3.697 73.936 73.936
2 1.092 21.832 95.767

 

       Extraction Method: Principal Component Analysis.

 

Once we have extracted two Eigen values, programme also gives the Eigen vectors whose length has been normalized to its Eigen value ( The sum of the squared values of its all the elements will be equal to 3.697 and 1.092 respectively). The values under each column are also known as Factor Loadings and are coefficients of correlations of each variable with first and second principal components. All these values are given below in Component Matrix.

 

 

Component Matrix(a)

Component
1 2
VAR00001 .991 .051
VAR00002 .957 .156
VAR00003 .348 .929
VAR00004 .879 -.418
VAR00005 .950 -.165

     Extraction Method: Principal Component Analysis.

 

a  2 components extracted

 

The component matrix given above gives the coefficient of correlation of all the five variables with first and second principal components. These values are also known as “Factor Loadings” in the literature on Principal Component and help in making a sense out of these linear combinations of the constituent variables.

 

The first principal component is found to have a higher positive correlation with percentage of workers in secondary sector (X1), rate of population growth (X2), population density (X3) and % of urban population (X5). It has shown lower positive correlation with literacy (X3). This pattern of relationship can be named as “ Industrial Development”, as it shows crowding, lower literacy, migration and expansion of industrial activities but not much of literacy.

 

Second principal component shows higher positive relationship with literacy (X3) and negative relationship with population density (X4) and urbanization (X5). The pattern of this relationship appears to be just opposite to crowding and industrialization and going with literacy. Such a pattern of relationship may be called as “Social Development”.

    Like Z scores our Principal components scores are also normalized to unit standard deviations, this is possible when the length of the Eigen vector giving the weights is normalized to inverse of the corresponding Eigen value λ. The next output of the exercise will be component score coefficient matrix giving the values of the elements of the above Eigen vectors after normalizing its length to 1/λ as given in the following table.

 

Component Score Coefficient Matrix

Component
1 2
VAR00001 .268 .047
VAR00002 .259 .143
VAR00003 .094 .851
VAR00004 .238 -.383
VAR00005 .257 -.151

 

Extraction Method: Principal Component Analysis.  Component Scores.

 

One can verify that sum of the squared values of the first column is o.271 which is inverse of the first Eigen value of 3.697. Similarly, the sum of the values of the second column is 0.916 which is also the inverse of the second highest Eigen value of 1.092.

 

After getting the weights for making scores of two principal components we are ready to work out component scores.

 

First we convert the given data of five variables into their “Z” scores as mentioned above.

 

The “Z” scores of the variables given in our exercise are given below:

 

Z1 Z2 Z3 Z4 Z5
-1.29708 -0.78812 -0.52921 -0.46269 -1.51878
-0.83928 -0.62728 0.584914 -1.78464 -1.38897
-0.68669 -0.7077 0.863444 -1.12366 -0.61011
0.228896 -0.06434 -0.99342 0.859273 0.558185
0.076299 0.016084 -0.52921 0.528783 0.168754
-0.53409 -0.54686 -1.45764 -0.1322 -0.35049
0.076299 -0.3056 -0.99342 0.198294 0.558185
1.602271 1.704904 0.863444 0.859273 1.207237
1.754868 1.946164 1.327661 1.520252 1.466858
-0.38149 -0.62728 0.863444 -0.46269 -0.09087

 

These Z values can be converted into Principal Component scores as their linear combination using the weights given in the component score matrix above. For first area, for example, the score of the first principal component will be:

 

(-1.29708)x0.268 + (-0.78812)x0.259 + (-0.52921)x0.094 + (-0.46269)x0.234 + (- 1.51878)x0.257 = -1.1018.

 

Using same weights for the Z values for second area we can work out the scores of first principal component for the second area. Likewise we can work out the component scores for all the 10 areas/observations which are given below.

 

Component scores for the second principal component can also be worked out using the weights given in the second column of the component score matrix. For example for the first area the score will be:

 

(-1.29708)x0.047 + (-0.78812)x0.143 + (-0.52921)x0.851 + (-0.46269)x(-0.383) + (-1.51878)x(-0.151) =-0.21679

 

 

Scores of first and second principal components

S.N. Scores of First P.C. Scores of second P.C.
1 -1.1018 -0.21679
2 -1.11353 1.261802
3 -0.70993 1.123697
4 0.298926 -1.2569
5 0.143889 -0.67226
6 -0.54343 -1.23957
7 0.038415 -1.04548
8 1.466621 0.541621
9 1.837543 0.685898
10 -0.31672 0.817971

 

      These factor scores of first principal component and the second principal component give the relative position of each area in terms of created dimensions of “industrial Development” and “Social Development”. It is interesting to note that the position of areas cannot be similar on both the dimensions. For example area no. 2 is low in first principal component (-1.11353) but its position on second principal component is quite high. Similarly the areas no. 8 and 9 are quite high in first principal component but low in second principal component.

 

Mapping of Principal Component

 

To show spatial variations in the principal component scores, we can map these scores also. Before mapping these scores one should note that these scores also have zero means and unit standard deviations like Z-scores.

 

Half of the scores, therefore, are positive and rest are negative. Easiest way, therefore, is to divide the scores into two sets of positive values and negative values and may be called as high and low.

 

These can be further divided into two equal parts by their median values. Thus we can have four categories which may be called: very low, low, high and very high.

 

We can also use any alternative method of classifying the scores into categories and make maps like choropleth map or any other type.

 

you can view video on Mapping the Principal Components (aggregated factors)

 

References

a) Aslam Mahmood (1998) Statistical Methods in Geographical Studies, Rajesh Publications New Delhi.

b) Hadley G. (1962) . Linear Algebra, Addison-Wesley Publishing Co. Inc.

c) Cooley W. W. And Lohnes P. R.(1971): Multivariate Data Analysis, John Wiley & Sons, Inc.