20 Principal Component Analysis and Eigen value weights for aggregation
Prof. Aslam Mahmood
(1) E-Text
Eigen Value and Eigen Vector of a symmetric matrix
Eigen values of a symmetric matrix and its associated Eigen vectors play important role in statistical methods specially in “Principal Component Analysis”. A brief introduction, therefore, is given below before introduction to “Principal Component Analysis”.
Definition
For a symmetric matrix A if there exist a vector X such that AX = λX , the vector X is known as Eigen vector of A and λ, the scalar, is known as Eigen value of A. Note that the relationship given above shows that when an Eigen vector of a symmetric matrix is post multiplied to it the result is the vector itself multiplied by a constant say λ ( also known as a Scalar). Also note that not every vector will follow this property, only some special vectors will follow this relationship.
Example
where λ2 = 0.4 is another Eigen value
Apart from two vectors given above there will be no other vector satisfying the condition AX = λX. (However, If we multiply an Eigen vector by any constant the Eigen vector will remain the same Eigen vector with a different length.)
Length of an Eigen vector is the positive square root of the sum of the squared values of all its elements. In the first case the length of the Eigen vector is = √1 1 + 1 1 = √2 . If each and every element of a vector is divided by its length the length of the vector will also be divided by it and the changed length of the vector will be one. We can normalize the length of a vector by multiplying each and every element of the vector of a unit length by any desired length. When we manipulate the length of a vector, relative positions of the elements do not change i.e. effectively there is no change in the vector.
Important Properties
- Number of Eigen vectors and Eigen values are equal to the number of rows or columns of the symmetric matrix. In the above case there are two Eigen vectors and two Eigen values of the given 2×2 symmetric matrix.
- Sum of the values of the Eigen values of a symmetric matrix is equal to the sum of the diagonal elements of the symmetric matrix (also known as the trace of the matrix.)
- Eigen vectors of a symmetric matrix are independent to each other (also known as orthogonal). This property means that if X1 and X2 are two Eigen vectors X1’ X2 = 0
- Product of all the Eigen values of a symmetric matrix is equal to the determinant of the matrix.
- Eigen values of a correlation matrix (which is symmetric and is known as semi-positive definite matrix) are never negative.
Computation of Eigen Values and Vectors
Computation of Eigen values and vectors of an nxn symmetric matrix require solution of n simultaneous equations in n variables. Solution of such equations beyond 2×2 matrix become difficult and complexity increases with the increase in the number of variables. In such cases we require computer assistance for working out the Eigen values and Eigen vectors of such matrices. In almost all the practical problems where the numbers of variables are generally large we cannot apply principle component analysis without accessing to the electronic computers. There are computer programmes available for computing the Eigen values and Eigen Vectors of different matrices. However, as for “principle Component Analysis” we use them along with many other calculations. Different computer packages are also available giving all such computations together and are preferred. Packages like “Statistical Package for Social Sciences” (SPSS), STATA etc. give menu driven computer programme for “Principal Component Analysis” for this purpose.
Principal Component Analysis
Quite often we come across a problem of measuring a characteristic for which no direct measurement is possible, for example; development, agricultural productivity and quality of life etc. We know that development is a concept which does not relates to a single variable. Development of an area may consists of per capita income, better health conditions of people, good education of the people and better housing facilities etc. Similarly, agricultural productivity means productivity of not only one crop, but productivity of several crops being sown in the area. Quality of life is also not measured only through one variable but through many variables like, education, income, housing, health and many other variables. We need to collect information on several variables related to a given phenomenon and combine them into one index to get a composite picture of the phenomenon like development etc.
On the contrary, sometimes diverse information related to a phenomenon like, socioeconomic development, regional characteristics of an area and different traits of human personality etc. are already available in form of several variables but the body of information available is so large that we find difficult to handle it. In such cases also we need to summarize these variables with the help of few combinations.
In both the cases the problem is to convert a larger body of data into a smaller number of meaningful combinations of the constituent variables.
Construction of composite Index
After collecting the data on large number of variables related to a phenomenon, construction of a composite index from these variables is the next step and involves two main problems as given below:
1. Problem of scale of measurement of the constituent variables.
2. Problem of weights to be given to constituent variables while combining them.
Problem of scale of measurement
If the constituent variables are measured in different units, they cannot be combined into one as it will not carry any meaning. For example if we have agricultural production in tonnes as one variable and road length in Kilo meters as second variable, their addition will not carry any meaning. Similarly addition of some more variables in Rupees, litres and life years etc. will also be meaningless and their combination in given form is not possible. The problem can be ,however, can be handled by converting the constituent variables into some form which is free from the unit of measurement. These forms may be measurement in some relative forms like; percentages, ratios, ranks and “Z” scores etc. Different methods of construction of composite indices like “Human Development Index (HDI), “Cost of Living Index” etc. have been converting the raw data into some relative forms for this purpose only. Method of “Principal Component Analysis” uses technique of “Z” score to remove the unit of measurement.
Z- scores
In a multivariate set of data different variables are not only measured in different units but also show different scales of variations. Some show small variations whereas others show large variations within themselves. The “Z” scores are relative values with zero means and unit standard deviations. Values of each variable can be converted into Z-scores by subtracting their mean values from each value of the variable and dividing each difference by their standard deviation. A transformation of this kind will convert given values of a variable into “Z” scores which will have zero means and unit standard deviations, hence do not show exceptionally higher variations. As Z-scores are ratios they are free from any unit of measurement also.
Problem of weights
After we convert the given set of multivariate data into their Z-scores, we can construct a composite index by combining them. However, before making a combination we have also to decide about their weights. We may give them different weights including equal weights. We can also give some subjective weights on the basis of our experience and giving full justification for it. Alternatively we can use “Principal Component Analysis”.
Principal Component Analysis
Principal component analysis is based on a method in which weights of different variables are generated from the inter-correlations among the given variables. It converts a set of Z- scores into a weighted index Y known as “Principal Component” as Y = w1Z1 + w2 Z2 + w3 Z3 + …….. + wnZn in such a manner that Y has largest possible variance to account for as much variability in the data as is possible.
Main problem of a principal analysis is to find out such weights. The mathematical solution of this problem suggests these weights as the n elements of the Eigen Vector of the correlation matrix of all the given variables, w (= w1 , w2, w3……….. wn) when its length is normalized to inverse of its Eigen value. As there are exactly n such Eigen vectors (equal to number of variables), there will be n such “Principal Components”. It is also shown that the original variance of all the variables which is, n, associated with total n number of standardized variables (as each variable has unit variance which will add to n) is preserved exactly in the total variance of all the principal components. Further the variance of each “Principal Component” will be equal to the Eigen value of the corresponding Eigen vector used for generating it. We therefore arrange n Eigen values of the correlation matrix λ1, λ2,λ3 …… λn in descending order,and choose corresponding Eigen vectors for working out the respective “Principlal Component”. If we use the first Eigen value i.e. the highest Eigen value and the elements of the Eigen vector associated with it as the weights to prepare principal component, the component will be known as first principal component. Similarly, if we use the next highest value of the Eigen value and elements of the corresponding Eigen vector as weight for working out principal component, it will be known as second principal component. Likewise we can get third , fourth, fifth and nth principal components. The variance of first principal component will be λ1(which is maximum), the variance of second principal component will be λ2( next maximum), the variance of third principal component will beλ3(third maximum). It will keep on declining and the variance of last principal component will be λn(the least). The proportional contribution of first principal component in explaining the total variance of the data will be λ1/n . The contribution of the second principal component in explaining the total variation will be λ2/n. Similarly the contribution of third principal component in explaining the total variance will be λ3/3 …. and so on.Total of the proportional variance of all the n principal components will be sum of all these proportional values which will be equal to 1 or 100 per cent if given in percentages.
Identification of Dimensions : Interpretation of Principal Components
When a large body of data is converted into some principal components, the logical question a researcher faces is to select number of the principal components and to interpret each one of them. As has been discussed in the beginning it has to measure some phenomenon for which no direct data is available, the composite index reflects that particular phenomenon.
Some time when most of the variables relate to one similar phenomenon say agricultural development, first principal component alone can explain a large proportion of the variations in the data and may be sufficient to replace all the variables. In such a case the first principal component may explain something around 60 – 80 % of the total variance of all the variables and the researcher may decide to stick to only one principle component i.e. first principal component. The composite index reflected by first principle component can be named also easily on the basis of the constituent variables which are similar in nature.
However, in a multi dimensional situation constituent variables may relate to more than one dimensions like: Agriculture, Industry, urbanization and social development etc. In such a multidimensional situation “Principal Component Analysis” is likely to generate more than one principal components to cover each dimension by one “Principal Component” (with some over lapping also).
In such a case first we have to decide as to how many principal components have to be retained out of all n such components. We will observe that as we move the contribution of different “principle Components” keep on declining, indicating their increasing redundancy. We generally stop extracting Principal Components after their Eigen values become less than any specified value ( may be < 1). We can feel contended when a larger proportion of variations in data (around 65% or more ) are summarized in this manner by only few Principal Components ( say 3 to 5 or close to it ). Cumulative proportion of variance explained by chosen number of principal components can be worked out by taking the sum of the Eigen values of the corresponding Eigen vectors as percentage to the total of n.
In cases of more than one principal components, nature of each “Principal Component” can be identified with the help of the weights given to a principal component.
It has been found that if the length of the Eigen vector used for a principal component is normalized to its corresponding Eigen Value, the values of its elements reflect the coefficients of correlation of the constituent variables with the “Principal Component”. These values are very helpful in identifying the principal component and are known as Factor Loadings. These values are given in the computer output as Component Matrix to facilitate the interpretation of different “Principal Components”.
References
a) Aslam Mahmood (1998) Statistical Methods in Geographical Studies, Rajesh Publications New Delhi.
b) Hadley G. (1962) . Linear Algebra, Addison-Wesley Publishing Co. Inc.
c) Cooley W. W. And Lohnes P. R.(1971): Multivariate Data Analysis, John Wiley & Sons, Inc.