19 Principal Component Analysis using R
Sumitra Purkayastha
1 Introduction
A principal component analysis is concerned with explaining the variance-covariance structure of a set of variables through a fewer linear combinations of these vari-ables.
Its objectives are:
- Data reduction
- Interpretation
2 Quick Review

A 2-Dimensional Example
First we are going to consider a two dimensional example and thereby look at the geometric interpretation of Principal Component Analysis
We get some data in R arbitrarily like say
x <- c(2.5,0.5,2.2,1.9,3.1,2.3,2,1,1.5,1.1)
y <- c(2.4,0.7,2.9,2.2,3.0,2.7,1.6,1.1,1.6,0.9)
Let us see how the plot of the data looks like
We use the R command plot() to plot the data
plot(x,y,xlim=c(-1,4),ylim=c(-1,4))
abline(h=0,v=0,lty=3)

For PCA to work properly, we must subtract the mean for each dimension This produces a data set whose mean is zero
We subtract the mean of x values from the components of x and also the same mean centering is done for the y values
x1 <- x – mean(x)
y1 <- y – mean(y)
We plot this mean centered data
plot(x1,y1,main=”Plot of the centered data”)
abline(h=0,v=0,lty=3)

Before proceeding further we rst take a look at what is meant by an orthogonal transformation
What is an Orthogonal Transformation?
Consider a linear transformation from the variables (x,y)!(u,v) where,

What effect does an orthogonal transformation have on the axes??
Consider the mean centered data as obtained previously. We consider an or-thogonal transformation of the data and we see how the plot of the new data looks like

The new data when plotted looks like the above with the red and green coloured axes as the reference axes. That is the (u,v) axes are shown as the red and green coloured dotted lines respectively.
Also in the same plot we have the original set of axes, as denoted by the black dotted lines, i.e the (x,y) axes
Note that the linear transformation of the data has basically led to a rotation of the axes. And here the orthogonal transformation, apart from bringing about a rotation of the axes have also maintained the orthogonality of the axes
If we further rotate the axes we would have the following figure

Back to PCA
We nd the eigen values and the corresponding normalized eigen vectors of the var-cov matrix of the mean centered data (var-cov matrix of mean centered data is the same as the correlation matrix of the original data)
m <- as.matrix(cbind(x1,y1))
cov.m <- cov(m)
cov.eig <- eigen(cov.m)
m.values <- cov.eig$values #We get the eigen values
#Cumulative percentage of variability explained by the principal components
(cumsum(m.values)/sum(m.values))*100
[1] 96.31813 100.00000
Thus we see that 96% of total variability is explained by the rst principal component
We plot the mean centered data along with the principal components
Note that the eigen vectors give the direction of the principal components
So here we plot the direction of principal components as lines through the origin and the slope of the line is proportional to the slope of the eigen vectors
plot(x1,y1)
abline(h=0,v=0,lty=3)
abline(a=0,b=(cov.eig$vectors[1,1]/cov.eig$vectors[2,1]), col=”red”)
abline(a=0,b=(cov.eig$vectors[1,2]/cov.eig$vectors[2,2]), col=”green”)

The eigen vector corresponding to the largest eigen value is obtained as
p1 <- cov.eig$vectors[,1] # Eigen vector corr to largest eigen value p1
[1] 0.6778734 0.7351787
The First Principal Component is given by
Y1==0.6778734*x + 0.7351787*y
We can optionally recover the original data back, by 100 % if we choose all components, or an approximation otherwise
Let us consider that we have the data on only the First Principal Component To get back to the original dataset, we actually invert the orthogonal transfor-mation. The R code is as follows
Y1 <- as.numeric(p1 %*% t(m)) # new dataset for feature vector 1
original.dataset1 <- t(p1 %*% t(Y1))
original.dataset1[,1] <- original.dataset1[,1] + mean(x) # re-add means
original.dataset1[,2] <- original.dataset1[,2] + mean(y)
original.dataset1

Let us plot the data obtained
plot(original.dataset1[,1],original.dataset1[,2],xlim=c(-1,4),ylim=c(-1,4),
type=”p”,main=”Plot of Data”,xlab=”X”,ylab=”Y”)
abline(h=0,v=0,lty=3)

Notice that in the approximation the variation over the 2nd eigenvector is gone as expected (since it was previously erased)
Example where PCA is worthwhile
Consider the data on the times taken by some athletes to complete di erent category races like 100m race, 200m race etc

Notice that the time in the rst 3 categories of races ( rst three columns ) are given in seconds whereas the other times are given in minutes
There are several functions from di erent packages for performing PCA :
- prcomp() and princomp() from the built-in R stats package
- PCA() from FactoMineR package
- dudi.pca() from ade4 package
princomp() uses the spectral decomposition approach
The functionsprcomp()andPCA() [FactoMineR] use the singular value decom-position (SVD)
According to R help, SVD has slightly better numerical accuracy. Therefore, prcomp() is the preferred function
Let us read the data into R and we use the R function prcomp()
runner.pca=prcomp(data,scale=TRUE)
names(runner.pca)
[1] “sdev” “rotation” “center” “scale” “x”
> runner.pca$sdev
[1] 2.5733531 0.9368128 0.3991505 0.3522065 0.2826310 0.2607013 0.2154519
[8] 0.1503333
The argument scale=TRUE speci es that the observations are rst standardized and then PCA is performed. This standardization is required since the variables are in di erent units.
names(runner.pca)
[1] “sdev” “rotation” “center” “scale” “x”
runner.pca$sdev
[1] 2.5733531 0.9368128 0.3991505 0.3522065 0.2826310 0.2607013 0.2154519
[8] 0.1503333
When we extract the output sdev from the R object runner.pca we get the square roots of the eigen values of the correlation matrix corresponding to the data
runner.pca$rotation
The columns given by extracting the output rotation from the R object run-ner.pca are the eigen vectors corresponding to the eigen values
So to obtain the eigen values we square the ‘sdev‘
We also obtain the percentage of variability explained by the di erent principal components and the cumulative percentage of variability explained from the formula only.
# Eigenvalues eig <- (runner.pca$sdev)^2
# Variances in percentage variance <- eig*100/sum(eig)
# Cumulative variances cumvar <- cumsum(variance)
runner.eig<- data.frame(eig = eig, variance = variance, cumvariance = cumvar) runner.eig
eig | variance | cumvariance |
6.62214613 | 82.7768266 | 82.77683 |
0.87761829 | 10.9702287 | 93.74706 |
0.15932114 | 1.9915143 | 95.73857 |
0.12404939 | 1.5506173 | 97.28919 |
0.07988027 | 0.9985034 | 98.28769 |
0.06796515 | 0.8495644 | 99.13725 |
0.04641953 | 0.5802441 | 99.71750 |
0.02260010 | 0.2825012 | 100.00000 |

Note that:
- The 2 principal components retain 94% of original data variability
- Instead of 8 original variables we may work with 2 principal components
We can achieve signi cant Reduction in Dimension by PCA
There are several inbuilt R functions under di erent libraries which aid in PCA. The package factoextra is used for the visualization of theprincipal component analysis results.
factoextra can be installed and loaded as follow :
install.packages(“devtools“)
devtools::install_github(“kassambara/factoextra”)
# load
library(“factoextra”)
Scree plot using Factoextra:
The R code is fviz screeplot()
fviz_screeplot(runner.pca, ncp=10)

Example where PCA is not worthwhile

Note that:
The first 5 principal components together explain 93% of the total variability
So we do not achieve any significant reduction in dimension by PCA for this problem. This may be attributed to the fact that the variables may have very low correlations amongst themselves.
Scree Plot
par(mar=c(5.1,4.85,4.1,2.1))
plot(1:6,eigen.values, type=”o”, main=”Scree Plot”, xlab=”Index”, ylab=expression(lambda[j]), cex.lab=1.5, col=”dark red”,lwd=2)

Even from the Scree plot we see that the rst bend in the curve occurs for index 2, however at such a point the jump is not signi cantly close to 0. This bend and signi cant jump can be seen almost at index 5
SUMMARY
- In R we can perform Principal Component Analysis from the rst princi-ples or using several inbuilt R functions under di erent libraries
- In R we can perform Principal Component Analysis from the rst princi-ples or using several inbuilt R functions under di erent libraries
- In R we can perform Principal Component Analysis from the rst princi-ples or using several inbuilt R functions under di erent libraries
you can view video on Principal Component Analysis using R | ![]() |
References
- R.A.Johnson & D.W. Wichern, Applied Multivariate Statistical Analysis, Pearson
- T.W. Anderson, An Introduction to Multivariate Analysis, John Wiley
- G.A.F. Seber, Multivariate Observations, John Wiley
- N.C. Giri, Multivariate Statistical Inference, Academic Press