Visualising and exploring multivariate datasets using singular value decomposition and self organising maps

July 17th, 2007

Hola from Madrid, I’ve come here for a data analysis summer school. Last week, there was an interesting class on dimensionality reduction, and since multivariate datasets are prevalent in this -omic era, I thought to post a discussion of what I learnt. The aim of this example is illustrate one technique for visualising multivariate data, singular value decomposition, and a second technique for exploring it, self organising maps.

The Dataset

I’ll be doing this example in R, instructions for getting and installing can be found here. I’m using the example crab data which is in the MASS package. This data contains five measured morphological characteristics for both sexes of two species, orange and blue. The morphological categories are as follows

  • FL frontal lobe size (mm)
  • RW rear width (mm)
  • CL carapace length (mm)
  • CW carapace width (mm)
  • BD body depth (mm)

We can treat these five variables as our observations, and in practice these could be any characteristic, for example each could be an expression level in a microarray experiment. The aim of the analysis to identify the underlying characteristics of the data, which in this case could be crab species and gender. Here’s how to load the crab data


# Access the MASS library, which contains the crab data
library(MASS)


# Load the crab data
data(crabs)


# Show the first few entries of the data
head(crabs)

Dimensionality reduction and visualisation

The first thing I want to do is visualise the data to get a feel for it’s shape. Because of the restrictions of time and space, it’s difficult to conceptualise information using more than three dimensions. It would be possible to create a plot on three axis, then use point colour and size for the last two variables, however this would be complex and add little meaning.

Instead I’m going to use dimensionality reduction techniques to effectively “simplify” the data, then plot it. The most widely used technique, and one that you have likely already heard of, is principle components analysis or PCA.

PCA is difficult to describe, but in essence it will create a new variable that describes a large proportion of the variation in the data, this is then the first component. This component is factored out, and a second component calculated that explains the remaining variation. This is repeated, creating new components until all the variation has been explained. When I use the phrase “explaining variation”, imagine that some variable in the data set are correlated, for example shell length and width; a change in one usually shows a corresponding change in other. When you factor this out, to create a component, you have explained x amount of variation.

PCA calculation is based in linear algebra, if you’re interested there are plenty of sites that explain the process but I’m not going into detail. Instead I’ll use the available R function, and plot the first two components.


# Library for producing plots
library(lattice)


# Simple method call for PCA on the variables of the crab data
prin.comp <- (princomp(crabs[,4:8]))


# Create a simple plot of the first two components
library(lattice)
pca.plot <- xyplot(prin.comp$scores[,2] ~ prin.comp$scores[,1])
pca.plot$xlab <- “First Component”
pca.plot$ylab <- “Second Component”
pca_plot

Principle components plot

This demonstrates how the five dimensional data is now projected into two. This is a perfectly acceptable method for dimensionality reduction. However for the rest of this post, I prefer to use singular value decomposition (SVD) as opposed to PCA. PCA and SVD are closely related, but I have a better understanding of SVD, and therefore will be explain how to use it in more detail. Here’s the code to perform SVD on my data.


# First normalise the data by row
# This isn't necessary for SVD,
# but it is for the self organising maps we'll be using later.
crabs.x <- as.matrix(t(scale(t(crabs[,4:8]))))


# Run the SVD with a similar method call to that of PCA
crabs.svd <- svd(crabs.x)

After performing dimensionality reduction using SVD, PCA or any related technique, the first thing you might be interested in is how much of the variation is explained by each component. In the R svd object, the singular values are scores in the d matrix, which is stored as an array.


# Calculate the percentage of variation explained by each of the components
# Each value divided by the total
var.explained <- crabs.svd$d/sum(crabs.svd$d)
var.explained.plot <- barchart(crabs.svd$d/sum(crabs.svd$d) ~ c(1:length(crabs.svd$d)),
horizontal=F,xlab="Component",ylab="Percentage of variation explained")
var.explained.plot

Barchart showing the variation explained by each of the singular values

It’s easy to see from this plot that the first singular value explains a huge amount of the variation, about 95%. The other components quickly drop off, the second value explaining about 2%. Similarly to the PCA, we want to project the five dimensional crab data into two dimensions, for this we’ll use the first and second components since they explain most of the variation. As an aside, since the first value is so large it would be reasonable to visualise the data in just one dimension, for example using a histogram.

In the svd object, the singular value scores for each crab are in the u matrix, so that the first two dimensions can be projected using the first two columns of this matrix.


plot(crabs.svd$u[,1],crabs.svd$u[,2])

However, I would like to demonstrate some interesting properties of the svd matrices. The svd matrix v contains the singular value scores for each variable; the first column of this matrix shows how much each morphological characteristic contributes to the first singular value. The matrix can therefore be used to project the original data into k number of required dimensions. So for example multiplying the original crab.x data by the first column of the v matrix will result in the first column of the u matrix: the first dimensional projection of the data. Since we’re interested in k=2 dimensions we subset the v matrix for the first two columns then multiply the crab.x data by this.


# Trim to first two components
transform <- crabs.svd$v[,1:2]


# Project the crab data onto two dimensions using the first transform matrix
# The symbol %*% denotes matrix multiplications,
# as opposed to the R default of array multiplication
crabs.projection <- crabs.x %*% transform
# You can check the resulting values against the first
# two columns of the u matrix to see that they are the same

You might ask, whats the point of performing this procedure when the projection scores are already available? Well imagine you have a new set of data, for example five new crabs not in the original set, if you wished to see how these new crabs relate to the original ones, rather than rerunning the SVD, you could project the new data onto the old data using the transform matrix. Another example, imagine you have a library of articles, and word frequencies in each. Using SVD and then a transform matrix, you can query new articles against your existing data set to see which is the most related. This is very similar to latent semantic indexing, an important method in text mining.

So, I plot the first two projections


# Plot the projection
projection.plot <- xyplot(crabs.projection[,2] ~ crabs.projection[,1],
xlab=”First singular value”,ylab=”Second singular value”)
projection.plot

Singular value decomposition projection of the data

The same as the PCA plot, I now have the data in two dimensions, but remember even though they are plotted on similar scales the x axis explains 95% of the variation in the data, compared with the y axis which only explains about 2%. From the appearance of this plot we can see a continuous arc shape with no easily identifiable clusters. If you saw a set of points identifiably split away from the rest, you could say this was a distinct set of crabs and would look in more detail at their characteristics. Instead, here there are no visible clusters so we use another multivariate method to explore the data - self organising maps.

Self organising maps

A self organising map is a rectangular or a hexagonal grid which, through a number of iterations, shapes itself to best fit the landscape of a dataset. Each node in this grid is associated with a number of points. It’s easier to see how Sims work rather than explain them, so here’s the code to produce a self organising map.


# Dimensions of the SIM grid, the number of nodes in each side
Sim.x = 6
som.y = 6
som.shape = "rect"


# Number of iterations to perform
# More iterations produces a better fit
iterations <- 1e6


# Call the self organising map method
# This may take up to 10 minutes depending on your processor
# On my 2GHz Mac, about 2 minutes
crabs.som <- som(crabs.x,som.x,som.y,rlen=iterations,topol=som.shape)

The figure below illustrates how the som fits the data better with increasing numbers of iterations.

Illustration of the self organising map fitting the crab data

You’ll notice that the SOM plot is in two dimensions, the same as the SVD projection of the data. I ran the SOM on the five dimensional crab data, not it’s projection, therefore the SOM fits itself to a five dimensional landscape, however using the transform matrix I made earlier I can again project each node’s five dimensional position into that of the two dimensional plot. Another reason why using the transform matrix can be useful.

Here’s some R code, that you can use to perform a similar operation.


# The codes object contains the positions of the SOM nodes, in terms of the original five dimensional data
# Using the transform matrix we can project the position of these nodes from five dimension into two.
som.projection <- crabs.som$code %*% transform


# A custom panel to for projecting the SOM onto the data
# Plotting the SOM node points is easy, but joining them correctly with lines requires a little more R/Lattice trickery
som.panel <- function(x,y,subscripts,...){


panel.xyplot(x,y,...)


# Plot the SOM points
panel.points(som.projection[,1],som.projection[,2],
col=”red”)


# Plot the x axis lines
for(i in 1:som.x){
panel.lines(som.projection[(i*som.x-(som.x-1)):(i*som.x),],
col=”red”)
}


# Plot the y axis lines
for(i in 1:som.y){
panel.lines(som.projection[som.y*(1:som.y) - (i-1),],col=”red”)
}


}
som.plot <- xyplot(crabs.projection[,2] ~ crabs.projection[,1],panel=som.panel)
som.plot$xlab <- “First component”
som.plot$ylab <- “Second component”
som.plot

Since the SOM has now shaped to fit the data, we can now effectively look at the data through the “eyes” of the self organising map. The SOM is a 6×6 grid, each node in the grid has a number of data points associated with it. By plotting the density of the data in the SOM, we can begin to see the “landscape”.


# The code.sum object lists the x and y coordinates of each node, and the number of data points associated with it.
# We can create a heatmap showing the density of data points across the SOM
# nobs = number of observations
# I added 1 to each axis, so the range is 1-6 rather than 0-5
data.landscape <- levelplot(nobs ~ (x + 1) * (y + 1) ,crabs.som$code.sum)
data.landscape$xlab <- "SOM x axis"
data.landscape$ylab <- "SOM y axis"
data.landscape

Visualising the density of observations in the self organising map

From this plot we can see that there two clusters of high density, in the lower left and in the upper right. We can look at the morphology of the crabs in the these clusters with the following code.


# Select the original data using the x y coordinates from the som
cluster.1 <- crabs.som$visual$x == 0 & crabs.som$visual$y == 0
cluster.2 <- crabs.som$visual$x == 5 & crabs.som$visual$y == 5


# Cross reference this with the original crab data, and create a data frame
clusters <- rbind(cbind(crabs.x[cluster.1,],1),
cbind(crabs.x[cluster.2,],2))
names(clusters) <- c(names(crabs[,4:8]),”cluster”)


# Create parallel plot, illustrating the morphology of each cluster.
cluster.plot <- parallel( ~ clusters[,1:5] | as.factor(clusters[,6]),varnames=names(crabs[,4:8]))
cluster.plot

Two clusters in the crab data

You can see the data in the first cluster has the opposite morphology to that of the crabs in the second cluster. Furthermore looking at the species and sex of these crabs, we can see that cluster one contains male blue crabs and the second cluster contains orange female crabs.


crabs[cluster.1,]
crabs[cluster.2,]

Each of these clusters lies at the extreme ends of the arc we saw in the original SVD plot, so we can hypothesise, based on the shape of the plot, that crab morphology shows a continuous distribution between orange males and blue females. Given the shape of the arc we might expect that at the apex on the left hand side, we get the opposite: orange males and blue females. We can test this simply.


# Select the crabs at the extreme left of the first projection
extreme.left <- crabs.projection[,1] < -1.9995
crabs[extreme.left,]

We can see, with the exception of one, that this is indeed the case.

Summary

If you’re still reading this - thanks. What I intended to be a short demonstration has come in at over two thousand words. I hope it has been a useful introduction to how SVD can be used to simplify and visualise data, and with a combination of self organising maps and a little intuition you can begin to get an understanding of your multivariate data.

10 responses

  1. Animesh comments:

    Great post, thanks :)
    When I was reading about PCA and SVD, I also got confused. My Prof. jokingly said “Well PCA is what physicist call SVD”!
    Essentially when PCA is calculated using the covariance matrix, it is directly proportional to SVD. There is a nice tutorial at http://arxiv.org/ftp/physics/papers/0208/0208101.pdf [PDF] which demonstrates this difference.
    Problem I face using such dimensionality reduction techniques:
    1. The components can not have a biological meaning attached to it, for e.g. if we use this for analyzing the Gene expression data we move from Gene space to these component space and the whole meaning attached to gene expression is lost in a way
    2. Once you pick up the top few vectors which capture say about > 80% of variance, there is no way to go back and retrieve the original data. So gene informations is lost and thus not a good way for feature selection.
    Regarding SOFM, unless we use a good feature selector initially [pick good discriminating genes], the feature space is so huge for gene expression that SOFM literally cries.
    I am not an expert in this field, but this is what I have felt after reading and using the material available on this. Would love to read more in case you have some pointers which address above problems.

  2. Mike comments:

    Hi Animesh,
    I am in similar situation, multivariate analysis techniques are an important method, but the mechanics are difficult for me to understand since I have a biological rather than mathematical background. I´ll answer both you points with the limited knowledge I have though.

    That principle components have no biological value, I would say yes and no. Without any interpretation, they are indeed meaningless abstractions that produce no additional knowledge about your data. However, as I have tried to show in the above example, with a little more intuition you can begin to interpret the what the components mean. In the crab data, a combination of the first and second components can be used to determine which species the crab is. But as you say, this is an arbitrary mathematical factorisation of the five morphological characteristics, which on it´s own is useless.

    There is a technique call negative matrix factorisation, where the produced components have a more relevent meaning to the original data. This technique isn´t available for R, but is for Matlab, so if you have access it might be worth a try.

    Secondly, I don´t how to do this with PCA, but it is possible to return to the original data from the SVD factored data. If the first two components contain 80% of the data, then I believe you can select the first two columns of the u matrix and the first two rows of the transposed v matrix, matrix multiply these together and I believe you will get your original data back containing 80% of the original variation. I think this could be a useful method for filtering noise from data.

    As for SOMs, again my knowledge is limited, but I would suggest increasing the size of the grid. It might be worth reading this paper which is from the developers of the SOM package, where they applied it to gene expression data.

  3. Animesh Sharma comments:

    Had no idea about “negative matrix factorisation”. I have been looking for such a thing me try that and see how it goes :)
    That Tamayo et al. paper is classic one, I could not understand anything when I first read, now slowly it is making sense to me. They have used a non variant criterion (with their own definition) for thresholding. Further the expression thresholding does bring the dimension to 3 digits. I feel everyone should read (even those who are averse to gene expression analysis papers), good stuff for sure, thanks.

  4. Bioinformatics Zen » Deriving biological meaning from principle components analysis pings back:

    [...] made an excellent point on my previous post - principle components analysis is great and all that, but the results have [...]

  5. Glenn Hammonds comments:

    I hope you don’t mind a bug request …

    > crabs.x <- as.matrix(normalize(crabs[,4:8]))
    Error in as.matrix(normalize(crabs[, 4:8])) :
    could not find function “normalize”

    any hints? Are you assuming a standard package I don’t have?

  6. Mike comments:

    Hi Glen,

    Thanks for finding this bug. This script worked when I wrote it a few months back. I can’t find the normalize function now, which is really weird. I was in Spain when I wrote this, but doesn’t really explain much.

    You can get a similar row-normalise functionality by transposing the data, normalising the data by columns then transposing back. This is quite awkward but is the first solution that came to mind.

    crabs.x <- as.matrix(t(scale(t(crabs[,4:8]))))

  7. Hanif Khalak comments:

    Very nice post and blog in general - thanks for the tutorial. I’m interested in using matrix factorization as well, to integrate “multi-omics” datasets into a correlation model. I found a nice MATLAB example here: http://www.stanford.edu/~boyd/cvx/examples/html/nonneg_matrix_fact.html

    As for R code for NMF, I haven’t found anything either, but did see this exchange on R-help:

    http://www.mail-archive.com/r-help@stat.math.ethz.ch/msg88593.html

    BTW, the original “normalize” line should work fine if you prepend the line “library(som)”.

  8. chris comments:

    Great round up Mike - very useful!

    re: transformation. As a side-note, if multiplying subsets of U and V, you should also multiply by the equivalent elements of D (the eigenvalues).

  9. Singular Value Decomposition « Mstobe’s Weblog pings back:

    [...] 16, 2008 at 11:15 pm (Uncategorized) Today I popped into a quite interesting discussion from Mike Barton about Singular Value Decomposition in biology, which is what I’m working on [...]

  10. paul comments:

    Hi, I’m a math guy getting into bioinformatics. While the biology perplexes me, the math isn’t so bad. Anyway, fyi, principal components and svd are basically the same thing. The main difference between the two methods is mean centering. PCA does this whereas SVD does not. Also, usually the output of pca programs is the new coordinates for the data whereas for svd the output is 3 matrices that can be used to get the new coordinates. Every PCA program really does SVD with just some extra steps.

Leave a comment