19 Principal Component Analysis using R

Sumitra Purkayastha

1 Introduction

A principal component analysis is concerned with explaining the variance-covariance structure of a set of variables through a fewer linear combinations of these vari-ables.

Its objectives are:

Data reduction
Interpretation

2 Quick Review

A 2-Dimensional Example

First we are going to consider a two dimensional example and thereby look at the geometric interpretation of Principal Component Analysis

We get some data in R arbitrarily like say

x <- c(2.5,0.5,2.2,1.9,3.1,2.3,2,1,1.5,1.1)

y <- c(2.4,0.7,2.9,2.2,3.0,2.7,1.6,1.1,1.6,0.9)

Let us see how the plot of the data looks like

We use the R command plot() to plot the data

plot(x,y,xlim=c(-1,4),ylim=c(-1,4))

abline(h=0,v=0,lty=3)

For PCA to work properly, we must subtract the mean for each dimension This produces a data set whose mean is zero

We subtract the mean of x values from the components of x and also the same mean centering is done for the y values

x1 <- x – mean(x)

y1 <- y – mean(y)

We plot this mean centered data

plot(x1,y1,main=”Plot of the centered data”)

abline(h=0,v=0,lty=3)

Before proceeding further we rst take a look at what is meant by an orthogonal transformation

What is an Orthogonal Transformation?

Consider a linear transformation from the variables (x,y)!(u,v) where,

What effect does an orthogonal transformation have on the axes??

Consider the mean centered data as obtained previously. We consider an or-thogonal transformation of the data and we see how the plot of the new data looks like

The new data when plotted looks like the above with the red and green coloured axes as the reference axes. That is the (u,v) axes are shown as the red and green coloured dotted lines respectively.

Also in the same plot we have the original set of axes, as denoted by the black dotted lines, i.e the (x,y) axes

Note that the linear transformation of the data has basically led to a rotation of the axes. And here the orthogonal transformation, apart from bringing about a rotation of the axes have also maintained the orthogonality of the axes

If we further rotate the axes we would have the following figure

Back to PCA

We nd the eigen values and the corresponding normalized eigen vectors of the var-cov matrix of the mean centered data (var-cov matrix of mean centered data is the same as the correlation matrix of the original data)

m <- as.matrix(cbind(x1,y1))

cov.m <- cov(m)

cov.eig <- eigen(cov.m)

m.values <- cov.eig$values #We get the eigen values

#Cumulative percentage of variability explained by the principal components

(cumsum(m.values)/sum(m.values))*100

[1] 96.31813 100.00000

Thus we see that 96% of total variability is explained by the rst principal component

We plot the mean centered data along with the principal components

Note that the eigen vectors give the direction of the principal components

So here we plot the direction of principal components as lines through the origin and the slope of the line is proportional to the slope of the eigen vectors

plot(x1,y1)

abline(h=0,v=0,lty=3)

abline(a=0,b=(cov.eig$vectors[1,1]/cov.eig$vectors[2,1]), col=”red”)

abline(a=0,b=(cov.eig$vectors[1,2]/cov.eig$vectors[2,2]), col=”green”)

The eigen vector corresponding to the largest eigen value is obtained as

p1 <- cov.eig$vectors[,1] # Eigen vector corr to largest eigen value p1

[1] 0.6778734 0.7351787

The First Principal Component is given by

Y1==0.6778734*x + 0.7351787*y

We can optionally recover the original data back, by 100 % if we choose all components, or an approximation otherwise

Let us consider that we have the data on only the First Principal Component To get back to the original dataset, we actually invert the orthogonal transfor-mation. The R code is as follows

Y1 <- as.numeric(p1 %*% t(m)) # new dataset for feature vector 1

original.dataset1 <- t(p1 %*% t(Y1))

original.dataset1[,1] <- original.dataset1[,1] + mean(x) # re-add means

original.dataset1[,2] <- original.dataset1[,2] + mean(y)

original.dataset1

Let us plot the data obtained

plot(original.dataset1[,1],original.dataset1[,2],xlim=c(-1,4),ylim=c(-1,4),

type=”p”,main=”Plot of Data”,xlab=”X”,ylab=”Y”)

abline(h=0,v=0,lty=3)

Notice that in the approximation the variation over the 2nd eigenvector is gone as expected (since it was previously erased)

Example where PCA is worthwhile

Consider the data on the times taken by some athletes to complete di erent category races like 100m race, 200m race etc

Notice that the time in the rst 3 categories of races ( rst three columns ) are given in seconds whereas the other times are given in minutes

There are several functions from di erent packages for performing PCA :

prcomp() and princomp() from the built-in R stats package
PCA() from FactoMineR package
dudi.pca() from ade4 package

princomp() uses the spectral decomposition approach

The functionsprcomp()andPCA() [FactoMineR] use the singular value decom-position (SVD)

According to R help, SVD has slightly better numerical accuracy. Therefore, prcomp() is the preferred function

Let us read the data into R and we use the R function prcomp()

runner.pca=prcomp(data,scale=TRUE)

names(runner.pca)

[1] “sdev” “rotation” “center” “scale” “x”

> runner.pca$sdev

[1] 2.5733531 0.9368128 0.3991505 0.3522065 0.2826310 0.2607013 0.2154519

[8] 0.1503333

The argument scale=TRUE speci es that the observations are rst standardized and then PCA is performed. This standardization is required since the variables are in di erent units.

names(runner.pca)

[1] “sdev” “rotation” “center” “scale” “x”

runner.pca$sdev

[1] 2.5733531 0.9368128 0.3991505 0.3522065 0.2826310 0.2607013 0.2154519

[8] 0.1503333

When we extract the output sdev from the R object runner.pca we get the square roots of the eigen values of the correlation matrix corresponding to the data

runner.pca$rotation

The columns given by extracting the output rotation from the R object run-ner.pca are the eigen vectors corresponding to the eigen values

So to obtain the eigen values we square the ‘sdev‘

We also obtain the percentage of variability explained by the di erent principal components and the cumulative percentage of variability explained from the formula only.

# Eigenvalues eig <- (runner.pca$sdev)^2

# Variances in percentage variance <- eig*100/sum(eig)

# Cumulative variances cumvar <- cumsum(variance)

runner.eig<- data.frame(eig = eig, variance = variance, cumvariance = cumvar) runner.eig

eig	variance	cumvariance
6.62214613	82.7768266	82.77683
0.87761829	10.9702287	93.74706
0.15932114	1.9915143	95.73857
0.12404939	1.5506173	97.28919
0.07988027	0.9985034	98.28769
0.06796515	0.8495644	99.13725
0.04641953	0.5802441	99.71750
0.02260010	0.2825012	100.00000

Scree Plot Using Base Graphics

To draw the Scree plot we can use the base graphics functions like barplot(), lines() etc.

The R code is as follows

barplot(runner.eig[, 2], names.arg=1:nrow(runner.eig), main = “Variances”,

xlab = “Principal Components”,

ylab = “Percentage of variances”,

col =”steelblue”)

# Add connected line segments to the plot

lines(x = 1:nrow(runner.eig), runner.eig[, 2], type=”b”, pch=19, col = “red”)

Note that:

The 2 principal components retain 94% of original data variability
Instead of 8 original variables we may work with 2 principal components

We can achieve signi cant Reduction in Dimension by PCA

There are several inbuilt R functions under di erent libraries which aid in PCA. The package factoextra is used for the visualization of theprincipal component analysis results.

factoextra can be installed and loaded as follow :

install.packages(“devtools“)

devtools::install_github(“kassambara/factoextra”)

# load

library(“factoextra”)

Scree plot using Factoextra:

The R code is fviz screeplot()

fviz_screeplot(runner.pca, ncp=10)

Example where PCA is not worthwhile

Note that:

The first 5 principal components together explain 93% of the total variability

So we do not achieve any significant reduction in dimension by PCA for this problem. This may be attributed to the fact that the variables may have very low correlations amongst themselves.

Scree Plot

par(mar=c(5.1,4.85,4.1,2.1))

plot(1:6,eigen.values, type=”o”, main=”Scree Plot”, xlab=”Index”, ylab=expression(lambda[j]), cex.lab=1.5, col=”dark red”,lwd=2)

Even from the Scree plot we see that the rst bend in the curve occurs for index 2, however at such a point the jump is not signi cantly close to 0. This bend and signi cant jump can be seen almost at index 5

SUMMARY

In R we can perform Principal Component Analysis from the rst princi-ples or using several inbuilt R functions under di erent libraries
In R we can perform Principal Component Analysis from the rst princi-ples or using several inbuilt R functions under di erent libraries
In R we can perform Principal Component Analysis from the rst princi-ples or using several inbuilt R functions under di erent libraries

you can view video on Principal Component Analysis using R

References

R.A.Johnson & D.W. Wichern, Applied Multivariate Statistical Analysis, Pearson
T.W. Anderson, An Introduction to Multivariate Analysis, John Wiley
G.A.F. Seber, Multivariate Observations, John Wiley
N.C. Giri, Multivariate Statistical Inference, Academic Press