Re-organizing data

Learning Objectives

  • Subsetting data and reordering data
  • Matching elements of dataframes
  • Replacing values in row/vectors
  • Add/remove columns and rows
  • Write to file

Subsetting data

When analyzing data, we often want to look at partial statistics, such as the maximum value of a variable per species or the average value per celltype.

One way to do this is to select the data we want, and create a new temporary array, using the subset() function. For instance, if we just want to look at the samples of a specific celltype "typeA":

subset(metadata, celltype == "typeA")
##         genotype celltype replicate
## sample1       Wt    typeA         1
## sample2       Wt    typeA         2
## sample3       Wt    typeA         3
## sample4       KO    typeA         1
## sample5       KO    typeA         2
## sample6       KO    typeA         3

We can also subset using other logical operators in R. For example suppose we wanted to subset to keep only the WT samples from the typeA cells.

subset(metadata, celltype == "typeA" & genotype == "Wt")
##         genotype celltype replicate
## sample1       Wt    typeA         1
## sample2       Wt    typeA         2
## sample3       Wt    typeA         3

Alternatively, we could try looking at only the first two replicates of each sample set. Here, we can use the less than operator since replicate is currently a numeric vector. Adding in the argument select allows us to specify which columns to keep. Which columns are left?

subset(metadata, replicate < 3, select = c('genotype', 'celltype'))

Matching data

Let's bring in our data matrix. The experimental design file is associated with RNA-Seq data. From the experiment we have three different data files that have been generated which can be found in the /data directory. We will be using the RPKM data for now. Load in this file similar to how we did for our metadata file.

rpkm_data <- read.csv("data/counts.rpkm")

Take a look at the first few lines of the data matrix to see what's in there.

head(rpkm_data)

It looks as if the sample names (header) are similar to the row names of our metadata file, but it's hard to tell since they are not in the same order. We can do a quick check of the dimensions and see if the numbers match up.

dim(rpkm_data)
## [1] 38828    12

Do we have data for every sample that we have metadata? To do that we can use the %in% operator. Although lacking in documentation this operator is well used and quite useful. It will take a vector as input to the left and will evaluate each element to see if there is a match in the vector that follows on the right of the operator. For each element of vector on the left it will return a logical value. Take a look at the example below:

A <- c(1,3,5,7,9,11)   # odd numbers
B <- c(2,4,6,8,10,12)  # even numbers

# test to see if any of A are in B
A %in% B
## [1] FALSE FALSE FALSE FALSE FALSE FALSE

Since vector A is odd numbers and vector B is even numbers, there is no overlap and so for each element we get a FALSE. Let's change a couple of numbers inside vector B to match vector A.

B <- c(2,4,6,8,1,5)  # add some odd numbers in 

# test to see if any of A are in B
A %in% B
## [1]  TRUE FALSE  TRUE FALSE FALSE FALSE

The logical vector returned will tell us which elements are mismatching and we can count how many are TRUE to get an overlap. The two vectors do not have to be the same size. In this example the vectors are small and so it's easy to count; but when we work with large datasets this is not practical. The all function comes in handy here. Given a logical vector, it will tell you whether are all values returned are TRUE. If there is atleast one FALSE value, the all function will return a FALSE and you know that all of A are not contained in B.

all(A %in% B)

Suppose we had two vectors that had the same values but just not in the same order. We could also use all to test for that. Rather than using the %in% operator we would use == and compare each element to the same position in the other vector. Unlike the %in% operator, for this to work you must have two vectors that are of equal length.

A <- c(1,2,3,4,5)
B <- c(5,4,3,2,1)  # same numbers but backwards 

# test to see if any of A are in B
A %in% B

# test to see if any of A is equal to B
A == B

# use all to check if they are a perfect match
all(A == B)

Challenge

Apply the all function to check whether all of the row names of our metadata are contained in the header of our data file. Also check to see that they are in the same order.

Subset data using matching

The column names in our data matrix are all contained in the row names of our metadata, but they appear to be out of order. We can use the match() function to create a new data matrix such that samples are in the same order. The function match() takes at least 2 arguments: the values to be matched (in our case the row names metadata), and the table that contains the values to be matched against (in our case the column names of data. The function returns the position of the matches in the table, and this can be used to data columns. For values that don't match you can specify what values you would have it assigned using nomatch argument (by default this is NA).

# Retrieve a vector of indices where the column names are reordered to match the row names 
m <- match(row.names(metadata), colnames(rpkm_data))

# Create a re-ordered data matrix based on the matching indices
data_ordered  <- rpkm_data[,m]
# Check and see what happened
head(data_ordered)

# Verify that the two are identical in order
all(row.names(metadata) == colnames(data_ordered))

Modifying data.frames

Sometimes, you may have to add a new column to your dataset that represents a new variable. You can add columns to a data.frame using the function cbind() (column bind). Beware, the additional column must have the same number of elements as there are rows in the data.frame.

The data matrix we are working with has Ensembl IDs as the identifier, however it would also be nice to know the official gene symbols assocated with those IDs. The gene annotation was only provided for the counts file (data/annotated_combined.counts) but not the RPKM data. We need to take the gene symbols from the counts file and add it as a column in our data matrix.

counts <- read.csv('data/annotated_combined.counts')

Challenge

First you will use what you have learned previously to read in the file and check to make see if:

  1. all of the Ensemble IDs in the count matrix are contained in the RPKM matrix
  2. the order of Ensemble Ids is the same in both data frames
  3. subset such that you have two data files that have matching row names

Now, we should have two data files of the same dimensions. The column of interest from our counts file (gene symbols) can be cut and paste onto our data file using cbind (column__bind__) function to add columns. The number of rows needs to be identical between the two objects.

Suppose we wanted to remove the column, just like you can select columns by their positions in the data.frame or by their names you can also remove. The only difference is that we would use a minus sign to indicate the column number we wish to remove annot_data[,-1] or specify the column by name annot_data[,-genesymbol]. Note that the column names should not be quoted

Similar to cbind() for columns, there is a function rbind() (row__bind__) that puts together two data.frame. With rbind() the number of columns and their names must be identical between the two objects.

Challenge

Use the code provided below to generate a dataframe df1 of coloured blocks. Given the dataframe, try the following tasks:

  1. The length and width of the blocks are provided, can you add a column to include the following height values: 20, 20, 30?
  2. We have information for an additional block that is yellow, except we don't have height for information (height =20; width= 15). Can you still add the yellow block in as an additional row?
df1 <- data.frame(color=c("red", "green", "blue"), length=c(15,20,26), width=c(20,25,35))

Writing data to file

Although we have the annotated RPKM data matrix stored in memory, it would be quite useful for future analyses. We can export it to file using the write.csv command, including arguments for the filename (be sure to add on the correct path so it gets exported to your data directory). There are a numberof different options for data output (and input) and we will cover that in more depth in the last section of the R session.

write.csv(annotated_rpkm, file="data/annotated_rpkm", quote=F)