When analyzing data, we often want to look at partial statistics, such as the maximum value of a variable per species or the average value per celltype.
One way to do this is to select the data we want, and create a new temporary array, using the subset()
function. For instance, if we just want to look at the samples of a specific celltype "typeA":
subset(metadata, celltype == "typeA")
## genotype celltype replicate
## sample1 Wt typeA 1
## sample2 Wt typeA 2
## sample3 Wt typeA 3
## sample4 KO typeA 1
## sample5 KO typeA 2
## sample6 KO typeA 3
We can also subset using other logical operators in R. For example suppose we wanted to subset to keep only the WT samples from the typeA cells.
subset(metadata, celltype == "typeA" & genotype == "Wt")
## genotype celltype replicate
## sample1 Wt typeA 1
## sample2 Wt typeA 2
## sample3 Wt typeA 3
Alternatively, we could try looking at only the first two replicates of each sample set. Here, we can use the less than operator since replicate is currently a numeric vector. Adding in the argument select allows us to specify which columns to keep. Which columns are left?
subset(metadata, replicate < 3, select = c('genotype', 'celltype'))
Let's bring in our data matrix. The experimental design file is associated with RNA-Seq data. From the experiment we have three different data files that have been generated which can be found in the /data
directory. We will be using the RPKM data for now. Load in this file similar to how we did for our metadata file.
rpkm_data <- read.csv("data/counts.rpkm")
Take a look at the first few lines of the data matrix to see what's in there.
head(rpkm_data)
It looks as if the sample names (header) are similar to the row names of our metadata file, but it's hard to tell since they are not in the same order. We can do a quick check of the dimensions and see if the numbers match up.
dim(rpkm_data)
## [1] 38828 12
Do we have data for every sample that we have metadata? To do that we can use the %in%
operator. Although lacking in documentation this operator is well used and quite useful. It will take a vector as input to the left and will evaluate each element to see if there is a match in the vector that follows on the right of the operator. For each element of vector on the left it will return a logical value. Take a look at the example below:
A <- c(1,3,5,7,9,11) # odd numbers
B <- c(2,4,6,8,10,12) # even numbers
# test to see if any of A are in B
A %in% B
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
Since vector A is odd numbers and vector B is even numbers, there is no overlap and so for each element we get a FALSE
. Let's change a couple of numbers inside vector B to match vector A.
B <- c(2,4,6,8,1,5) # add some odd numbers in
# test to see if any of A are in B
A %in% B
## [1] TRUE FALSE TRUE FALSE FALSE FALSE
The logical vector returned will tell us which elements are mismatching and we can count how many are TRUE
to get an overlap. The two vectors do not have to be the same size. In this example the vectors are small and so it's easy to count; but when we work with large datasets this is not practical. The all
function comes in handy here. Given a logical vector, it will tell you whether are all values returned are TRUE
. If there is atleast one FALSE
value, the all
function will return a FALSE
and you know that all of A are not contained in B.
all(A %in% B)
Suppose we had two vectors that had the same values but just not in the same order. We could also use all
to test for that. Rather than using the %in%
operator we would use ==
and compare each element to the same position in the other vector. Unlike the %in%
operator, for this to work you must have two vectors that are of equal length.
A <- c(1,2,3,4,5)
B <- c(5,4,3,2,1) # same numbers but backwards
# test to see if any of A are in B
A %in% B
# test to see if any of A is equal to B
A == B
# use all to check if they are a perfect match
all(A == B)
Apply the all
function to check whether all of the row names of our metadata are contained in the header of our data file. Also check to see that they are in the same order.
The column names in our data matrix are all contained in the row names of our metadata, but they appear to be out of order. We can use the match()
function to create a new data matrix such that samples are in the same order. The function match()
takes at least 2 arguments: the values to be matched (in our case the row names metadata
), and the table that contains the values to be matched against (in our case the column names of data
. The function returns the position of the matches in the table, and this can be used to data columns. For values that don't match you can specify what values you would have it assigned using nomatch
argument (by default this is NA).
# Retrieve a vector of indices where the column names are reordered to match the row names
m <- match(row.names(metadata), colnames(rpkm_data))
# Create a re-ordered data matrix based on the matching indices
data_ordered <- rpkm_data[,m]
# Check and see what happened
head(data_ordered)
# Verify that the two are identical in order
all(row.names(metadata) == colnames(data_ordered))
data.frames
Sometimes, you may have to add a new column to your dataset that represents a new variable. You can add columns to a data.frame
using the function cbind()
(column bind). Beware, the additional column must have the same number of elements as there are rows in the data.frame
.
The data matrix we are working with has Ensembl IDs as the identifier, however it would also be nice to know the official gene symbols assocated with those IDs. The gene annotation was only provided for the counts file (data/annotated_combined.counts
) but not the RPKM data. We need to take the gene symbols from the counts file and add it as a column in our data matrix.
counts <- read.csv('data/annotated_combined.counts')
First you will use what you have learned previously to read in the file and check to make see if:
Now, we should have two data files of the same dimensions. The column of interest from our counts file (gene symbols) can be cut and paste onto our data file using cbind
(column__bind__) function to add columns. The number of rows needs to be identical between the two objects.
Suppose we wanted to remove the column, just like you can select columns by their positions in the data.frame
or by their names you can also remove. The only difference is that we would use a minus sign to indicate the column number we wish to remove annot_data[,-1]
or specify the column by name annot_data[,-genesymbol]
. Note that the column names should not be quoted
Similar to cbind()
for columns, there is a function rbind()
(row__bind__) that puts together two data.frame
. With rbind()
the number of columns and their names must be identical between the two objects.
Use the code provided below to generate a dataframe df1
of coloured blocks. Given the dataframe, try the following tasks:
20, 20, 30
?df1 <- data.frame(color=c("red", "green", "blue"), length=c(15,20,26), width=c(20,25,35))
Although we have the annotated RPKM data matrix stored in memory, it would be quite useful for future analyses. We can export it to file using the write.csv
command, including arguments for the filename (be sure to add on the correct path so it gets exported to your data
directory). There are a numberof different options for data output (and input) and we will cover that in more depth in the last section of the R session.
write.csv(annotated_rpkm, file="data/annotated_rpkm", quote=F)