19 Principal Component Analysis using R

Sumitra Purkayastha

epgp books

 

 

 

1  Introduction

 

A principal component analysis is concerned with explaining the variance-covariance structure of a set of variables through a fewer linear combinations of these vari-ables.

 

Its objectives are:

  • Data reduction
  • Interpretation

2 Quick Review

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A 2-Dimensional Example

 

First we are going to consider a two dimensional example and thereby look at the geometric interpretation of Principal Component Analysis

 

We get some data in R arbitrarily like say

 

x <- c(2.5,0.5,2.2,1.9,3.1,2.3,2,1,1.5,1.1)

y <- c(2.4,0.7,2.9,2.2,3.0,2.7,1.6,1.1,1.6,0.9)

 

Let us see how the plot of the data looks like

We use the R command plot() to plot the data

 

plot(x,y,xlim=c(-1,4),ylim=c(-1,4))

abline(h=0,v=0,lty=3)

 

For PCA to work properly, we must subtract the mean for each dimension This produces a data set whose mean is zero

 

We subtract the mean of x values from the components of x and also the same mean centering is done for the y values

 

x1 <- x – mean(x)

y1 <- y – mean(y)

 

We plot this mean centered data

 

plot(x1,y1,main=”Plot of the centered data”)

abline(h=0,v=0,lty=3)

Before proceeding further we rst take a look at what is meant by an orthogonal transformation

 

What is an Orthogonal Transformation?

 

Consider a linear transformation from the variables (x,y)!(u,v) where,

 

 

 

 

 

 

 

What effect does an orthogonal transformation have on the axes??

 

Consider the mean centered data as obtained previously. We consider an or-thogonal transformation of the data and we see how the plot of the new data looks like

 

The new data when plotted looks like the above with the red and green coloured axes as the reference axes. That is the (u,v) axes are shown as the red and green coloured dotted lines respectively.

 

Also in the same plot we have the original set of axes, as denoted by the black dotted lines, i.e the (x,y) axes

 

Note that the linear transformation of the data has basically led to a rotation of the axes. And here the orthogonal transformation, apart from bringing about a rotation of the axes have also maintained the orthogonality of the axes

 

If we further rotate the axes we would have the following  figure

Back to PCA

 

We nd the eigen values and the corresponding normalized eigen vectors of the var-cov matrix of the mean centered data (var-cov matrix of mean centered data is the same as the correlation matrix of the original data)

 

m <- as.matrix(cbind(x1,y1))

cov.m <- cov(m)

cov.eig <- eigen(cov.m)

 

m.values <- cov.eig$values        #We get the eigen values

 

#Cumulative percentage of variability explained by the principal components 

 

(cumsum(m.values)/sum(m.values))*100

 

[1]       96.31813 100.00000

 

Thus we see that 96% of total variability is explained by the rst principal component

 

We plot the mean centered data along with the principal components

 

Note that the eigen vectors give the direction of the principal components

 

So here we plot the direction of principal components as lines through the origin and the slope of the line is proportional to the slope of the eigen vectors

 

plot(x1,y1)

abline(h=0,v=0,lty=3)

abline(a=0,b=(cov.eig$vectors[1,1]/cov.eig$vectors[2,1]), col=”red”)

abline(a=0,b=(cov.eig$vectors[1,2]/cov.eig$vectors[2,2]), col=”green”)

 

The eigen vector corresponding to the largest eigen value is obtained as

 

p1 <- cov.eig$vectors[,1]          # Eigen vector corr to largest eigen value p1

 

[1] 0.6778734 0.7351787

 

The First Principal Component is given by

 

Y1==0.6778734*x + 0.7351787*y

 

We can optionally recover the original data back, by 100 % if we choose all components, or an approximation otherwise

 

Let us consider that we have the data on only the First Principal Component To get back to the original dataset, we actually invert the orthogonal transfor-mation. The R code is as follows

 

Y1 <- as.numeric(p1 %*% t(m)) # new dataset for feature vector 1

original.dataset1 <- t(p1 %*% t(Y1))

original.dataset1[,1] <- original.dataset1[,1] + mean(x) # re-add means

original.dataset1[,2] <- original.dataset1[,2] + mean(y)

 

original.dataset1

 

 

 

 

 

 

 

 

 

Let us plot the data obtained

 

plot(original.dataset1[,1],original.dataset1[,2],xlim=c(-1,4),ylim=c(-1,4),

type=”p”,main=”Plot of Data”,xlab=”X”,ylab=”Y”)

abline(h=0,v=0,lty=3)

 

Notice that in the approximation the variation over the 2nd eigenvector is gone as expected (since it was previously erased)

 

Example where PCA is worthwhile

 

Consider the data on the times taken by some athletes to complete di erent category races like 100m race, 200m race etc

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Notice that the time in the rst 3 categories of races ( rst three columns ) are given in seconds whereas the other times are given in minutes

 

There are several functions from di erent packages for performing PCA :

  • prcomp() and princomp() from the built-in R stats package
  • PCA() from FactoMineR package
  • dudi.pca() from ade4 package

    princomp() uses the spectral decomposition approach

The functionsprcomp()andPCA() [FactoMineR] use the singular value decom-position (SVD)

 

According to R help, SVD has slightly better numerical accuracy. Therefore, prcomp() is the preferred function

 

Let us read the data into R and we use the R function prcomp()

 

runner.pca=prcomp(data,scale=TRUE)

names(runner.pca)

[1] “sdev”               “rotation” “center”             “scale”            “x”

 

> runner.pca$sdev

 

[1]     2.5733531 0.9368128 0.3991505 0.3522065 0.2826310 0.2607013 0.2154519

[8]     0.1503333

 

The argument scale=TRUE speci es that the observations are rst standardized and then PCA is performed. This standardization is required since the variables are in di erent units.

 

names(runner.pca)

[1] “sdev”               “rotation” “center”             “scale”            “x”

 

runner.pca$sdev

 

[1]     2.5733531 0.9368128 0.3991505 0.3522065 0.2826310 0.2607013 0.2154519

[8]     0.1503333

 

When we extract the output sdev from the R object runner.pca we get the square roots of the eigen values of the correlation matrix corresponding to the data

 

runner.pca$rotation

 

The columns given by extracting the output rotation from the R object run-ner.pca are the eigen vectors corresponding to the eigen values

 

So to obtain the eigen values we square the ‘sdev

 

We also obtain the percentage of variability explained by the di erent principal components and the cumulative percentage of variability explained from the formula only.

 

# Eigenvalues                                     eig <- (runner.pca$sdev)^2

# Variances in percentage               variance <- eig*100/sum(eig)

# Cumulative variances                    cumvar <- cumsum(variance)

 

runner.eig<- data.frame(eig = eig, variance = variance, cumvariance = cumvar) runner.eig

eig variance cumvariance
6.62214613 82.7768266 82.77683
0.87761829 10.9702287 93.74706
0.15932114 1.9915143 95.73857
0.12404939 1.5506173 97.28919
0.07988027 0.9985034 98.28769
0.06796515 0.8495644 99.13725
0.04641953 0.5802441 99.71750
0.02260010 0.2825012 100.00000
    Scree Plot Using Base Graphics
    To draw the Scree plot we can use the base graphics functions like barplot(), lines() etc.
    The R code is as follows
    barplot(runner.eig[, 2], names.arg=1:nrow(runner.eig), main = “Variances”,
    xlab = “Principal Components”,
    ylab = “Percentage of variances”,
    col =”steelblue”)
    # Add connected line segments to the plot
    lines(x = 1:nrow(runner.eig), runner.eig[, 2], type=”b”, pch=19, col = “red”)

    Note that:

  • The 2 principal components retain 94% of original data variability
  • Instead of 8 original variables we may work with 2 principal components

We can achieve signi cant Reduction in Dimension by PCA

 

There are several inbuilt R functions under di erent libraries which aid in PCA. The package factoextra is used for the visualization of theprincipal component analysis results.

 

factoextra can be installed and loaded as follow :

 

install.packages(“devtools“)

devtools::install_github(“kassambara/factoextra”)

#   load

library(“factoextra”)

 

Scree plot using Factoextra:

 

The R code is fviz screeplot()

 

fviz_screeplot(runner.pca, ncp=10)

 

Example where PCA is not worthwhile

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Note that:

 

The  first 5 principal components together explain 93% of the total variability

 

So we do not achieve any significant reduction in dimension by PCA for this problem. This may be attributed to the fact that the variables may have very low correlations amongst themselves.

 

Scree Plot

 

par(mar=c(5.1,4.85,4.1,2.1))

 

plot(1:6,eigen.values, type=”o”, main=”Scree Plot”, xlab=”Index”, ylab=expression(lambda[j]), cex.lab=1.5, col=”dark red”,lwd=2)

Even from the Scree plot we see that the rst bend in the curve occurs for index 2, however at such a point the jump is not signi cantly close to 0. This bend and signi cant jump can be seen almost at index 5

 

SUMMARY

  • In R we can perform Principal Component Analysis from the rst princi-ples or using several inbuilt R functions under di erent libraries
  • In R we can perform Principal Component Analysis from the rst princi-ples or using several inbuilt R functions under di erent libraries
  • In R we can perform Principal Component Analysis from the rst princi-ples or using several inbuilt R functions under di erent libraries

 

you can view video on Principal Component Analysis using R

 

References

  • R.A.Johnson & D.W. Wichern, Applied Multivariate Statistical Analysis, Pearson
  • T.W. Anderson, An Introduction to Multivariate Analysis, John Wiley
  • G.A.F. Seber, Multivariate Observations, John Wiley
  • N.C. Giri, Multivariate Statistical Inference, Academic Press