21 Mapping the Principal Components (aggregated factors)

Prof. Aslam Mahmood

(1) E-Text

In an exercise of a principal component analysis basic objective is to summarize a large set of data through smaller number of dimensions in such a way that the larger proportion of reality portrayed by all the variable is retained by a smaller number of principal components. This exercise helps the researcher in handling a large body of data and in measuring a phenomenon for which no direct information is available. Scores of the principal components can be mapped also to portray that phenomenon.

The exercise starts with a large body of data relevant to the desired phenomenon collected by researcher directly or indirectly through some secondary or primary source. After analysing the data through its descriptive statistics like mean and standard deviations for further mathematical treatment the data is first converted into a set of Z-scores of n variables. Z- scores are free from units of measurements as they are only ratios. Z-scores are also standardized scores as they all have zero means and unit standard deviations. After converting the data into their Z-scores, the scores are converted into weighted composite scores of different principal components. These weights are obtained from the Eigen Vectors of different Eigen values of the correlation matrix of all the variables. The length of these Eigen vectors is found to be normalized to inverse of their Eigen values. For first principal component the weights used are the elements of the Eigen vector corresponding to the highest Eigen value. For the second principal component the weights are the elements of the Eigen vector corresponding to the second highest Eigen value and so on.

After z-scores an nxn correlation matrix is worked out giving inter-correlations between all the pairs of variables. From this correlation matrix its Eigen values are worked out with their corresponding Eigen vectors. And only those are retained whose Eigen value is greater than

Eigen values are then arranged in descending order. The length of these Eigen Vectors are then easily normalized to inverse of their Eigen values and the elements of these values are used as weight to prepare weighted combination of Z-scores.

Example:

To explain the above mentioned steps, let us take an example of summarizing a set of data of five variables of development of 10 districts an area as given below with the help of principal component analysis:

Percentage of workers in secondary sector(X1)
Rate of population growth(X2)
Literacy(X3)
Population Density(X4)
Percentage Urban(X5)

S.N.	X1	X2	X3	X4	X5
1	12	9	55	90	6.5
2	15	11	67	70	7
3	16	10	70	80	10
4	22	18	50	110	14.5
5	21	19	55	105	13
6	17	12	45	95	11
7	21	15	50	100	14.5
8	31	40	70	110	17
9	32	43	75	120	18
10	18	11	70	90	12

The above set of data can be subjected to a “Principal Component Analysis” using the SPSS. The package has several options for the output, all of them may not be required for the first understanding of the method. Most essential of them could be the descriptive statistics, Correlation matrix, its Eigen values and associated Eigen Vectors, component coefficient matrix , component score coefficient matrix and factor scores. All these outputs are explained in detail and are given below.

First output will give the descriptive statistics in form of variables, their mean values and Standard Deviations with number of observations. As given below.

Descriptive Statistics

	Mean	Std. Deviation	Analysis N
VAR00001	20.5000	6.55320	10
VAR00002	18.8000	12.43472	10
VAR00003	60.7000	10.77085	10
VAR00004	97.0000	15.12907	10
VAR00005	12.3500	3.85177	10

Second output relates to the correlation matrix between all the pairs of variables. It will be a 5×5 symmetric matrix with unity in the diagonal and off diagonal elements will give correlation coefficients between different variables. The output of correlation matrix is given below:

Correlation Matrix

		VAR00001	VAR00002	VAR00003	VAR00004	VAR00005
Correlation	VAR00001	1.000	.964	.380	.824	.941
	VAR00002	.964	1.000	.457	.773	.829
	VAR00003	.380	.457	1.000	-.061	.188
	VAR00004	.824	.773	-.061	1.000	.878
	VAR00005	.941	.829	.188	.878	1.000

The next step in this exercise of “Principle Component Analysis” is to find out Eigen values. The package computes the Eigen values and put them in descending order as given here; 3.697, 1.092, 0.127, 0.083 and 0.001 . Note that the sum of all these Eigen values is 5.000 ( To verify the property that the sum of the Eigen values of a matrix is equal to the sum of their diagonal values which is 5 in the present case).

Also to note is that these values are given in descending order and first Eigen value which is also the variance explained by first “Principal Component” comprises (3.697/5)*100 = 73.936 % of the total variance. The second Eigen value is 1.092 and comprises (1.092/5)*100 =21.83 % of the total variance. Cumulative effect of first and second “Principal Component” is to explain 95.767 % of the total variance in the variables, which is quite high and may be considered sufficient to summarize the five original variables. Further components do not add much to it and may be considered as redundant.

The prgramme has a default command to discard all the Eigen Values < 1. Due this command in the next table only two Eigen values >1 are given along with their proportional strength. Researchers are given the option to change the default value as per their choice.

Total Variance Explained

	Extraction Sums of Squared Loadings
Component	Total	% of Variance	Cumulative %
1	3.697	73.936	73.936
2	1.092	21.832	95.767

Extraction Method: Principal Component Analysis.

Once we have extracted two Eigen values, programme also gives the Eigen vectors whose length has been normalized to its Eigen value ( The sum of the squared values of its all the elements will be equal to 3.697 and 1.092 respectively). The values under each column are also known as Factor Loadings and are coefficients of correlations of each variable with first and second principal components. All these values are given below in Component Matrix.

Component Matrix(a)

	Component
	1	2
VAR00001	.991	.051
VAR00002	.957	.156
VAR00003	.348	.929
VAR00004	.879	-.418
VAR00005	.950	-.165

Extraction Method: Principal Component Analysis.

a 2 components extracted

The component matrix given above gives the coefficient of correlation of all the five variables with first and second principal components. These values are also known as “Factor Loadings” in the literature on Principal Component and help in making a sense out of these linear combinations of the constituent variables.

The first principal component is found to have a higher positive correlation with percentage of workers in secondary sector (X1), rate of population growth (X2), population density (X3) and % of urban population (X5). It has shown lower positive correlation with literacy (X3). This pattern of relationship can be named as “ Industrial Development”, as it shows crowding, lower literacy, migration and expansion of industrial activities but not much of literacy.

Second principal component shows higher positive relationship with literacy (X3) and negative relationship with population density (X4) and urbanization (X5). The pattern of this relationship appears to be just opposite to crowding and industrialization and going with literacy. Such a pattern of relationship may be called as “Social Development”.

Like Z scores our Principal components scores are also normalized to unit standard deviations, this is possible when the length of the Eigen vector giving the weights is normalized to inverse of the corresponding Eigen value λ. The next output of the exercise will be component score coefficient matrix giving the values of the elements of the above Eigen vectors after normalizing its length to 1/λ as given in the following table.

Component Score Coefficient Matrix

	Component
	1	2
VAR00001	.268	.047
VAR00002	.259	.143
VAR00003	.094	.851
VAR00004	.238	-.383
VAR00005	.257	-.151

Extraction Method: Principal Component Analysis. Component Scores.

One can verify that sum of the squared values of the first column is o.271 which is inverse of the first Eigen value of 3.697. Similarly, the sum of the values of the second column is 0.916 which is also the inverse of the second highest Eigen value of 1.092.

After getting the weights for making scores of two principal components we are ready to work out component scores.

First we convert the given data of five variables into their “Z” scores as mentioned above.

The “Z” scores of the variables given in our exercise are given below:

Z1	Z2	Z3	Z4	Z5
-1.29708	-0.78812	-0.52921	-0.46269	-1.51878
-0.83928	-0.62728	0.584914	-1.78464	-1.38897
-0.68669	-0.7077	0.863444	-1.12366	-0.61011
0.228896	-0.06434	-0.99342	0.859273	0.558185
0.076299	0.016084	-0.52921	0.528783	0.168754
-0.53409	-0.54686	-1.45764	-0.1322	-0.35049
0.076299	-0.3056	-0.99342	0.198294	0.558185
1.602271	1.704904	0.863444	0.859273	1.207237
1.754868	1.946164	1.327661	1.520252	1.466858
-0.38149	-0.62728	0.863444	-0.46269	-0.09087

These Z values can be converted into Principal Component scores as their linear combination using the weights given in the component score matrix above. For first area, for example, the score of the first principal component will be:

(-1.29708)x0.268 + (-0.78812)x0.259 + (-0.52921)x0.094 + (-0.46269)x0.234 + (- 1.51878)x0.257 = -1.1018.

Using same weights for the Z values for second area we can work out the scores of first principal component for the second area. Likewise we can work out the component scores for all the 10 areas/observations which are given below.

Component scores for the second principal component can also be worked out using the weights given in the second column of the component score matrix. For example for the first area the score will be:

(-1.29708)x0.047 + (-0.78812)x0.143 + (-0.52921)x0.851 + (-0.46269)x(-0.383) + (-1.51878)x(-0.151) =-0.21679

Scores of first and second principal components

S.N.	Scores of First P.C.	Scores of second P.C.
1	-1.1018	-0.21679
2	-1.11353	1.261802
3	-0.70993	1.123697
4	0.298926	-1.2569
5	0.143889	-0.67226
6	-0.54343	-1.23957
7	0.038415	-1.04548
8	1.466621	0.541621
9	1.837543	0.685898
10	-0.31672	0.817971

These factor scores of first principal component and the second principal component give the relative position of each area in terms of created dimensions of “industrial Development” and “Social Development”. It is interesting to note that the position of areas cannot be similar on both the dimensions. For example area no. 2 is low in first principal component (-1.11353) but its position on second principal component is quite high. Similarly the areas no. 8 and 9 are quite high in first principal component but low in second principal component.

Mapping of Principal Component

To show spatial variations in the principal component scores, we can map these scores also. Before mapping these scores one should note that these scores also have zero means and unit standard deviations like Z-scores.

Half of the scores, therefore, are positive and rest are negative. Easiest way, therefore, is to divide the scores into two sets of positive values and negative values and may be called as high and low.

These can be further divided into two equal parts by their median values. Thus we can have four categories which may be called: very low, low, high and very high.

We can also use any alternative method of classifying the scores into categories and make maps like choropleth map or any other type.

you can view video on Mapping the Principal Components (aggregated factors)

References

a) Aslam Mahmood (1998) Statistical Methods in Geographical Studies, Rajesh Publications New Delhi.

b) Hadley G. (1962) . Linear Algebra, Addison-Wesley Publishing Co. Inc.

c) Cooley W. W. And Lohnes P. R.(1971): Multivariate Data Analysis, John Wiley & Sons, Inc.