# Numerical Ecology with R (Tutorial)

In this exercise we will look at a data matrix of 16S rRNA counts in the 74 samples demultiplexed and processed with LotuS2. This dataset is the microbiota composition of 74 mice from 5 different mice strains. The aim of the original research was to define the effect that the mouse genome has on the microbiota and what the effect of living in the same cage would be. However, we found much stronger trends in the data, and these we will look at in this exercise. The 454 data was already compiled into a matrix with genus abundance per sample in a previous step. This matrix is called a feature abundance matrix, or abundance matrix for short. We will do an ecology-oriented analysis of the data, in later steps also taking metadata (experimental, environmental or clinical data that was collected for each sample, independent of the DNA) into account. The aim of this tutorial is to get an idea of the very basic steps of ecological data analysis using the programming language R. If LotuS was not run, the genus abundance table can be downloaded from here.
##### Requirements:
Finished LotuS2 run that can be downloaded here. These files are required for the following R tutorial.

### File preparation

Copy the LotuS output folder to a directory on your local computer, if not done so. Set the higherLvl folder with the provided files as your working directory in R using “setwd”. This way required files can be easily loaded. To find out how to use this command, you can type ?setwd() to open up the help. If there are other R-commands that you want to know more about, you can open the R-help for that command by entering in the R-prompt “?command”. This will be very useful when working with R, make sure to use this a lot as you can only learn more.
setwd(“dir_to_data”)

Load the provided data into the matrices M (Genus.txt, actual genus abundance data), using the read.delim command.
As R reads the matrix as an object of class data.frame, we convert M from a data.frame to a matrix [M=as.matrix(M)]. This is important for some of the following calculations, where we need a “matrix” class object.
```M = read.delim(file="genus.txt",row.names=1)
M = as.matrix(M)```
The matrix you loaded represents the number of 16S sequences assignable to each genus, which we could find in the samples. Also note that not all genera are real genera, but partly assigned unknown sequences. With these groups we do not know if this is a single genus or in fact several genera or in extreme cases even several classes, that just all fall under the same phylum tag. What are the advantages and disadvantages of keeping such undefined groups in the data?
Use the function “edit(M)” to have a different view of the abundance matrix.

### Matrix feature distributions

Let’s look at some basic features of the abundance matrix. The “summary(M)” command is a good start, but also look at total row and column counts (colSums, rowSums command). To see how the genera are distributed within each sample, we will plot a sample-wise density plot.We will be using a combination of the “density”, “lines” and “lapply” functions, to draw the densities of values found in each sample, let’s start with looking at the density of the first sample. In R you can access specific columns by writing the matrix coordinates in square brackets. For example M[1,] shows the first row of a matrix, M[,7] shows the 7th column etc. :
densityOfSample1 = density(M[,1])
Look at the object densityOfSample1 by simply entering the object name into the command prompt. Next try to visualize it with "plot(densityOfSample1)". In this plot you see that most genera are at 0 abundance, some genera have an abundance below 10 and some rare genera actually occur with a higher frequency, one genus even having ~1100 16S reads assigned to it. Which genus is this?
Alternatively you can also use the function hist, to plot a histogram of the abundances:
hist(M[,1], nclass = 50)
We can use the “apply” command, to apply the density to command to every column of M, which will return a list of density objects, which we store in object S_densities. The second argument to the apply function is the margin and set to 2, which tells the apply function that we want to work on columns (Margin = 2) and not on rows (Margin = 1).
S_densities = apply(M,2,density)
plot(1,1,type="n",ylim=c(0,3),xlim=c(0,5000))
this will open a new plotting window, already set to the range of x and y coordinates (xlim, ylim) we will need in this example. In this case we just want to plot a blank space, this is done with the type="n" argument. Try to replace the argument by type="p", to actually see that one point! S_densities is a list, so we use lapply (list apply), in combination with the lines function, that is:
lapply(S_densities,lines)
What you should see now in the plot window is the density distribution of all samples. The lines function is adding new lines, while a plot function makes a completely new plot. Try to replace the "lines" with "plot" to see this (it’s very fast, so keep a close eye on your plot). How are these lines already telling us something about the differences between the communities of each sample?

### Data normalization I

Maybe you noticed that the colSums command showed that the totals are not equal. In this state the data is actually not comparable among each other. One way to “correct” the data for this shortcoming is to normalize the matrix. In this step we will normalize the abundance matrix into variable M1:
M1 = sweep(M,2,colSums(M),"/")
The sweep command is extremely useful, as it will apply a simple arithmetic operation (like divide in this case) in a matrix column or row-wise with a vector of your choice. So it is very similar to apply, but takes more basic mathematical operations. In this case we will divide each column by the sum of the column, this is called normalization. Now we will compare these matrices using the barplot function. For this we need to open another graphical window, using the X11 function:
```barplot(M)
X11()
barplot(M1)```
What do you notice about the sample composition? What does the graph mean? Discuss where you would want to normalize the data (and where not). Close all open plots. Now replot the sample-wise density plot (as you did in step 3), but start the plot with these adapted x and y ranges. Additionally we will this time put a label on the x- and y-axis:
plot(1,1,type="n",ylim=c(0,80),xlim=c(0,1),xlab="relative genus abundance", ylab="Frequency of genera")
You will notice that the graph looks different from you previous plot. What changed due to the normalization? Are the samples more similar to each other using M or M1? If you spot a difference in species abundance between two samples using matrix M, is this difference real, does it have scientific value?

### Diversity

For the next step the R-library vegan is required. It is a set of functions specifically designed for ecological data analysis. First we need to install this external package using the command:
install.packages(“vegan”)
This might take a minute or two, use the time to explore the vegan webpage. Load vegan
library(vegan)
Let’s try to put the differences we observed in sample density into numbers. To do this ecologists rely on the concept of diversity. Diversity describes the evenness of species distributions as well as the richness of species that are observed in a given ecological system. We will first calculate the Shannon diversity, one commonly used method to calculate diversity:
div = apply(M1,2,diversity,index="shannon")
Now we can see in action what these indices are actually doing for us; Plot the density of the sample with the lowest and highest diversity in red and blue on your previous density plot of M1, this you do by first finding out which diversity indexes are the maximum and minimum values using the which.max and which.min functions. Don’t forget to have the last density plot still open (or replot it from step 4 on M1):
```which.min(div) #should be bl43
which.max(div) #should be bl48
You can readjust the window by changing the ylim and xlim attribute in the plot function, if necessary (e.g. try to rerun using ylim=c(0,180) ). Try to explain why the colored samples have the highest and lowest diversity. What does this tell about an ecosystem (remember that these are genus abundances).

### Data normalization II

A different way to normalize the data is to sample exactly equal amounts of 16S rDNA for each sample in this experiment. Of course in practice this is impossible to do, but we can simulate this, by randomly selecting a subset of 16S rDNA. This is called rarefaction. Rarefy your original abundance matrix (M) into M2, using 2000 reads per sample:
```if (!require("rtk")) install.packages("rtk");
library(rtk)
rarefied <- rtk((M), depth = 2000, ReturnMatrix = 1)
M2 <- rarefied\$raremat[]
colSums(M2)```
Use colSums(M2) to check if the rarefaction worked. The main use of rarefaction is in calculating diversity and richness correctly, for this we will look in the following at observed richness. The concept of observed richness within a sample is pretty simple (but useful): richness describes the number of different species that occur at least once in a sample. This can be calculated in two steps:
OnceOrMoreOftenPresent = M1>0
The sum of each column in this matrix will tell us how many species were detected in total within the respective sample:
rich1 = apply(OnceOrMoreOftenPresent,2,sum)
Compare the richness values in “rich1” to the richness obtained on the rarefied matrix M2, calculated with a shortened command:
rich2 = apply(M2>0,2,sum)
Compare rich1 and rich2 in a matrix value by value. This can be easily done using the “cbind” command to bind two vectors column-wise together, so we get a matrix with 2 columns. What does the second part of the command do? What happens if you change that to order(rich2)?
cbind(rich1,rich2)[order(rich1),]
Discuss which richness values have the highest value to the researcher and why the order is very different between these two richness estimates. Is one way clearly wrong? Why did we choose 2000 as cutoff for the sequences per sample? What is the maximum value we could choose?

### Collectors curve

To create a collectors curve of the dataset is as easy as calling:
collectors.curve(M1, times = 100, bin = 1, ylab = "Diversity", xlab = "Samples", col = 3, col2 = 1)
This creates a colectors curve, sampeling a hundred times without bining samples and coloring the border black (1) while keeping the background green (3). Changing the object from M1 to the rarefied dataset M2 allows the visualisation of the rarefaction result: Collector's curve as plotted with rtk. Note that several groups are also possible. Note that you can modify these plots, to also include several groups that are compared separately, or even richness accumulated within a group, but iteratively across groups so that the final richness of all samples is still gained. Try
?collectors.curve
to get the help for collectors.curve and play with the arguments that can be given to the plot.

### Plotting richness and diversity

To visualize richness and diversity of the samples is as easy as calling plot(rarefied) on the dataset rarefied by RTK. Without grouping the samples into clusters, this makes very little sense. Grouping the samples could be done using the column names like this:
```groups <- sapply((colnames(M2)), substr,start = 1, stop = 2)
plot(rarefied, groups = groups, div = 'richness')```
Comparison of three rarefactions at one specific depth Which shows equal diversity across the groups. To plot a rarefaction curve is equally easy, but requires fresh rarefaction to several depths first:
```rarefiedCurve <- rtk((M), depth = seq(10,2000, length.out = 50), ReturnMatrix = 1)
groups <- sapply((colnames(M2)), substr,start = 1, stop = 2)
plot(rarefiedCurve, groups = groups, div = 'richness', raw = T, pch = c(1,5,16), xlab = 'Depth', ylab = 'Richness', fit = F, lty = 0)```
Rarefaction curve comparison of three random groups.

### Clustering / Ordination

First samples are clustered to see underlying data structures. For this tutorial we will choose a hierarchical clustering, based on a bray-curtis distance between samples:
```BCD = vegdist(t(M1), dist="bray")
cluster = hclust(BCD)
plot(cluster)```
How many groups do you think might be in this clustering?
To visualize the samples and their relatedness to each other in a two-dimensional space, we can use an ordination to visualize the data in a low dimensional space. The dimensionality of the original matrix (73 genera=73 dimensions) is reduced to two dimensions. If you know what a PCA (Principal component analysis) is, this step will use a conceptually similar, but methodologically quite different technique to perform an ordination of the data, NMDS (non-metric multidimensional scaling).
Start by calculating a 2-dimensional NMDS of the data using M1. We will use the Bray-Curtis distance in this example.
nmds = metaMDS(t(M1),distance = "bray") #actual NMDS command; matrix needs to be transformed to conform with vegan’s standards
Take a look at the “nmds” object and explore some of its features (e.g. type “str(nmds)” to see what variables are stored within the NMDS object). Try to find out what the “stress” of your ordination is. What does stress stand for (tip: go to the R help on metaMDS)? Next we can visualize the NMDS, similar to what you get out of PCA’s, displaying samples only:
plot(nmds,display ="sites")
The important difference of NMDS compared to PCA is, that NMDS works with any kind of distance metric, while PCA can only use Euclidean distances between samples. A second important feature of NMDS is, that this method it finds non-parametric, monotonic relationships between objects; in short: it doesn’t assume a specific data distribution. Why might these two features be important for ecologists?
You might have noticed that you see two clusters, similar to the hierarchical clustering of the data. We can get for each sample the identity within the two clusters using the “cutree” commands, specifying k=2 (2 clusters). This can be plotted into the NMDS with the following command:
memb = cutree(cluster, k = 2)
ordispider(nmds,memb) Congratulations, you have just visualized the mouse enterotypes. Next we are going to look closer at these. If you want to know the exact methods to detect enterotypes in your data visit the embl on enterotyping.

### Univariate Statistics

In the last tutorial part, we will test for all the genera in the matrix whether they show significant differences between two clusters (mouse enterotypes). The scientific question we are posing here is: what are the significant differences in the gut microbiota of between enterotypes? We will use a non-parametric test (kruskal-wallis) to do the tests, as ecological data is in most cases not normally distributed. This test is very similar to the student t-test, and the interpretation works the same way:
Kt = kruskal.test(M1[6,],memb)
Look at the object typing Kt. This will show you a human readable summary of the test and significance level. You can access elements of a list (Kt is a list in this case) using the “\$” operator. Try to extract the p-value from the Kt object.
Once you know how, we can start to calculate the significance for every genus in the M1 matrix. These p-values we will store in a newly created vector “pvals”. Let’s add the first 2 p-values to the vector:
```pvals = c()
pvals = kruskal.test(M1[1,], memb)\$p.value
pvals = kruskal.test(M1[2,], memb)\$p.value```
... Since doing this 73 times takes a long time, we will be using a for-loop to “loop” over the matrix and do this for us. We could as well use the apply function, but the syntax would get a little more complicated, since we are only interested in a subpart of the result, the \$p.value part. Use the for loop like this:
```for (i in 1:dim(M1)){
pvals[i] = kruskal.test(M1[i,], memb)\$p.value
}```
If some of the p-values are NA's, this could mean that there are not enough comparable values for this genera. As an additional help, you can add the name of the taxa to the pvals vector using the names command (that will name a vector):
names(pvals) = dimnames(M1)[]
Which taxa are significantly different? In this case we will use the normalized M1 matrix, can you explain why we do not use the M or M2 matrix? Would either be wrong to use? In total we were testing in 73 genera, if their p-value was below a threshold of 0.05. What is the chance of observing data with a p-value >0.05 by random chance? How many genera do you expect to be below this threshold by random chance? To avoid statistical errors of this kind, we will use a Benjamini-Hochberg multiple testing correction: