Bioinformatics Zen | How to draw simple graphs in R

How to draw simple graphs in R

2007.May.04

Graphs and statistics, you can't really get away from it. Even if you try, like a warm seafood sandwich, it'll come up later. So here I made up some example data to produce a short tutorial on how to represent common types of information in R.

R and example data First of all you'll need R to follow the examples, install instructions are here. Second you'll need the example data either in zip or tar.gz format.

Categorical data Categorical data is usually something with a name, and no numerical value. Round, Square, Triangular are categories. Small, Medium, and Large are ordered categories. The example categorical data file is categorical.csv.

To load this into R: data <- read.csv(file="path/to/file/categorical.R") This stores the information in an object called data. The data is 20 measurements for three bioinformaticians based on their field of study (categories). Each measurement is how many cups of tea that bioinformatician drank in one day. We can look at the first few lines of the data using the head command head(data) Systems.biology Functional.genomics Non.coding.RNA 1 2 2 5 2 2 2 5 3 2 2 6 4 2 3 4 5 2 4 7 6 2 3 5 The first column is the row number, followed by three columns, one for each bioinformatician and their cups of tea that day.

So how to visualise categorical data in R? Well first the data needs to in a different format. Currently the data has one column for each set of measurements - all the systems biologists are in the first column, and so forth. This is called "wide", because if I had collected data for 100 bioinformaticians, there would be 100 columns. You can see how this quickly gets difficult to manage.

A better format is "long". Which instead has a column for each type of variable. We have two variables, the first is the area of bioinformatics, the second is the number of cups of tea.

So how how can we can get the data from wide to long? Fortunately, there is an R package for this called, reshape. Unfortunately, reshape doesn't come as standard, with R. You'll need to follow the instructions for installing additional packages.

Once you've installed reshape, the library is loaded by typing. library(reshape) Then we use reshape as follows data <- melt(data,measure.var=c("Systems.biology","Functional.genomics","Non.coding.RNA")) The three variables are identified by the measure.var=c(... argument. Looking at the data, we can now see it's in the "long" format. head(data) variable value 1 Systems.biology 2 2 Systems.biology 2 3 Systems.biology 2 4 Systems.biology 2 5 Systems.biology 2 6 Systems.biology 2 The first column is the type of bioinformatics, the second the cups of tea. You can see the advantage of this, if I wanted to add another area of research, such as phylogenetics, I would add an extra row rather than an extra column. Now, imagine if we were using the database, this would be the difference between updating the schema and or inserting a new record.

So next I want to plot this data to see which researchers are drinking the most tea. I'm using another R package for this, called lattice. Luckily, lattice comes as standard with R so can be loaded with no extra installation. library(lattice) For visualising categorical data, a good plot is a box and whisker plot. The lattice command for this is bwplot. Here's the code to produce the plot. plot <- bwplot(value ~ variable, data = data) plot$ylab <- "Cups of Tea per day" print(plot) The formula value ~ variable tells R that the value (cups of tea) is a result of the variable (area of bioinformatics studied). The next bit data=data says that the data for this plot is stored in the R object named data. We put the result of this command into an R object called plot. On the next line plot$ylab is the label for the y-axis and I'm setting it to cups of tea per day. Finally print(plot) prints out my plot object to the graphics device.

From this it's pretty obvious what the trends are. Systems biologists drink exactly 2 cups of tea every data, functional genomists range 1-5, whilst a non-coding RNA-ist can drink up to 8 cups of tea per day. Crazy!

If you wanted to model this categorical data, to see if area of study significantly affects how many cups of tea were drunk, anova would be a good choice.

Continuous data

Continuous data is numerical. The example data is in continuous.csv. The file is read in using the same method as before. data <- read.csv(file="/Users/mike/Desktop/plots/continuous.csv") head(data) distance productivity 1 41.30473 34.69197 2 36.34507 35.53591 3 39.88439 36.17467 4 26.05197 37.84792 5 39.92318 34.59474 6 55.23140 34.89835 The data measure shows a bioinformatician's distance from the tea making area and their weekly productivity. Since I'm comparing one value with another, the data is already in the correct format with two columns.

The best way to compare two continuous variables is with a simple xyplot, with one on each axis. I create the plot in a similar way as before, by modelling productivity as a response of "distance to tea making area". This time I'm using the xyplot function. Not forgetting to label each axis. plot <- xyplot(productivity ~ distance, data=data) plot$xlab <- "Distance to tea making area (feet)" plot$ylab <- "Weekly productivity (hours)" print(plot) Continuous data

It looks likes productivity declines the further away from the tea making area. I knew drinking cups of tea made we work better! I want to confirm this, and one way to do this would be to add a line to plot. To do this with lattice I have to change the way the "panel" is created.

The panel is the area inside the axis where the points are drawn. At the moment the panel is created using panel.xyplot, which is the default for xyplot. Here's how to create a custom panel. custom_panel_lm <- function(x,y,...){

panel.xyplot(x,y,...) panel.lmline(x,y)

}

plot$panel <- custom_panel_lm print(plot)

The first command function(x,y,...) defines a method that expects two parameters x and y, our continuous variables, plus some other arguments (...) that I'm not worried about. The function defines two method calls, the first draws the plot using the default panel.xyplot function. The second then draws a linear trend line over the data using the xy parameters. Setting a custom panel is done similarly to the way the axis text was defined. plot$panel <- custom_panel_lm print(plot)

Which produces the plot I was after, the line indicates the trend between distance and productivity. However it's worth introducing a note of caution about trend lines. If you follow the line far enough, at some point the number of hours will become negative, something not possible. What we might expect is that hours decreases with distance, but never crosses the x-axis. One way to look at trends without making assumptions is to use the loess rather than lmline function. custom_panel_loess <- function(x,y,...){

panel.xyplot(x,y,...) panel.loess(x,y)

}

plot$panel <- custom_panel_loess print(plot)

The line is more "wobbly" but it gives us an unbiased view of the trend without coercing to a prior assumption, for example, a straight line. If I wanted to create a statistical model of this data I would use some sort of regression.

Factored categorical data

We have an idea of the trends in how many cups of tea researchers drink. But what if I wanted to see how this changed across seasons. How many cups of tea get drunk in winter as opposed to summer. This data is in categorical_categorical.csv. data <- read.csv(file="/Users/mike/Desktop/plots/categorical_categorical.csv") head(data) SB FG ncRNA SB.1 FG.1 ncRNA.1 1 2 3 5 0 1 6 2 2 3 5 0 0 8 3 2 3 4 0 3 6 4 2 2 6 0 3 6 5 2 4 5 0 4 7 6 2 3 6 0 3 7 Again this data is in wide format, The first three columns is the winter data, the second three columns are the summer data. The data is categorical (area of research), organised, by a second category (season). So first of all I need to shape this into long format . This time I'll need three columns because I have three variables, cups of tea, area of bioinformatics, and season. winter <- data[,1:3] summer <- data[,4:6]

winter <- melt(winter,measure.var=c("SB","FG","ncRNA")) summer <- melt(summer,measure.var=c("SB.1","FG.1","ncRNA.1"))

In the first two lines I'm splitting the data into two R objects one for winter, and one for summer. This bit [,1:3] selects columns 1 to 3. In the next two lines I'm melting the data in the same was as before. If we look at the two data sets, they're both now in long format. head(summer) variable value 1 SB.1 0 2 SB.1 0 3 SB.1 0 4 SB.1 0 5 SB.1 0 6 SB.1 0

head(winter) variable value 1 SB 2 2 SB 2 3 SB 2 4 SB 2 5 SB 2 6 SB 2

However I need to add the extra seasonal variable as a new column for each. summer <- cbind(summer,season="summer") winter <- cbind(winter,season="winter") The command cbind means bind a new column, I'm calling this column season, and it only contains one value, either summer or winter, which will be repeated for each row. head(summer) variable value season 1 SB.1 0 summer 2 SB.1 0 summer 3 SB.1 0 summer 4 SB.1 0 summer 5 SB.1 0 summer 6 SB.1 0 summer You can see that this extra column has now been added. Finally we need to combine the two data set together. data <- rbind(winter,summer) The command rbind means bind new rows. I'm using this to join the summer data to the bottom of the winter data. I'm almost ready to plot the data. However there's one more thing, the summer data has .1 attached to the end of each bioinformatics entries. We can see this by looking at the levels of the variable column. levels(data$variable) [1] "SB" "FG" "ncRNA" "SB.1" "FG.1" "ncRNA.1" This is corrected by replacing the names of the last three variables with that of the first three. levels(data$variable)[4:6] <- levels(data$variable)[1:3] levels(data$variable) [1] "SB" "FG" "ncRNA" As you can see the names are now consistent. And the only thing that remains is to plot the data. plot <- bwplot(value ~ variable | season, data=data) plot$ylab <- "Cups of Tea per day" print(plot) Categorical variable organised by second category The plot has been split for the two different seasons. This factoring was produced by using the | season part of the plot formula. The vertical bar means "given". So the layout of the plot is cups of tea per day as a result of area of bioinformatics given time of year. If I wanted create a statistical model for this data, I'd probably think about a factorial anova.

Factored continuous data

My final example is continuous data factored by a categorical variable. This is stored in the file called continuous_categorical.csv. data <- read.csv("/path/to/file/continuous_categorical.csv") head(data) water water.prod tea tea.prod hipflask hf.prod 1 0.4172023 33.99677 0.7731859 31.71567 1.1661047 24.90616 2 0.9628104 40.04022 0.5545423 29.11507 0.3830907 31.74964 3 1.3025050 40.78758 0.9033747 28.75205 1.0460207 29.64453 4 1.3352481 40.19776 1.2515750 28.68057 1.3909051 20.29845 5 0.8253107 38.38265 0.9861313 30.44975 0.2510354 31.51177 6 1.5136886 40.61841 0.9458123 29.50657 1.1346057 25.31571 In this data there are three different types of drinks, water, tea, and the contents of a hipflask. The amount consumed is measured, with the corresponding effect on productivity. So here we have two continuous variables, amount drunk and productivity, and one categorical, the type of liquid. The first thing I need to do is convert the data into long format to reflect this. water <- cbind(data[,1:2],drink="water") tea <- cbind(data[,3:4],drink="tea") hipflask <- cbind(data[,5:6],drink="hipflask") Here I've split the the data set into three smaller ones, and added an extra column to reflect the type of liquid. head(water) water water.prod drink 1 0.4172023 33.99677 water 2 0.9628104 40.04022 water 3 1.3025050 40.78758 water 4 1.3352481 40.19776 water 5 0.8253107 38.38265 water 6 1.5136886 40.61841 water > head(tea) tea tea.prod drink 1 0.7731859 31.71567 tea 2 0.5545423 29.11507 tea 3 0.9033747 28.75205 tea 4 1.2515750 28.68057 tea 5 0.9861313 30.44975 tea 6 0.9458123 29.50657 tea However looking at the individual data sets we can see that the variable names do not match. For example there's one column named water productivity and another named tea productivity. Actually they are the same variable, productivity. So therefore I have to rename the first two columns to be consistent. col.names <- c("volume","productivity") names(water)[1:2] <- col.names names(tea)[1:2] <- col.names names(hipflask)[1:2] <- col.names Now I can join this this three datasets back together again. Row bind would have complained if I hadn't they didn't all have the same column names. data <- rbind(water,tea,hipflask) head(data) volume productivity drink 1 0.4172023 33.99677 water 2 0.9628104 40.04022 water 3 1.3025050 40.78758 water 4 1.3352481 40.19776 water 5 0.8253107 38.38265 water 6 1.5136886 40.61841 water All that remains is to plot the data using the given notation for the categorical variable. I've also added a loess function for the using a custom panel. Both of these should be familiar from above. plot <- xyplot(productivity ~ volume | drink, data=data) plot$xlab <- "Average volume ingested (litres / day)" plot$ylab <- "Weekly productivity (hours)" custom_panel <- function(x,y,...){

panel.xyplot(x,y,...) panel.loess(x,y,col="red",lty=2)

}

plot$panel <- custom_panel print(plot)

Continuous variable organised by category

If I wanted to model this data I would probably start with an ANCOVA. However it might be a bit trickier than that as the contents of the hipflask seem to have a very non-linear effect on productivity.