- ggplot2 revisited
- dplyr
- The pipe %>%
- tidyr
- An RNA-Seq example
- Appendix: Tidy linear modelling
This lesson covers packages primarily by Hadley Wickham for tidying data and then working with it in tidy form, collectively known as the “tidyverse”.
The packages we are using in this lesson are all from CRAN, so we can install them with install.packages
. Don’t run this if you are using our biotraining server, the packages are already installed!
# install.packages(c(# "tidyverse",# "viridis",# "broom"# ))
library(tidyverse) # Load all "tidyverse" libraries.# OR# library(readr) # Read tabular data.# library(tidyr) # Data frame tidying functions.# library(dplyr) # General data frame manipulation.# library(ggplot2) # Flexible plotting.library(viridis) # Viridis color scale.
These packages usually have useful documentation in the form of “vignettes”. These are readable on the CRAN website, or within R:
vignette()vignette(package="dplyr")vignette("dplyr", package="dplyr")
Let’s continue our examination of the FastQC output. If you’re starting fresh for this lesson, you can load the necessary data frame with:
bigtab <- read_csv("r-more-files/fastqc.csv")
## Parsed with column specification:## cols(## grade = col_character(),## test = col_character(),## file = col_character()## )
We saw ggplot2 in the introductory R day. Recall that we could assign columns of a data frame to aesthetics–x and y position, color, etc–and then add “geom”s to draw the data.
With ggplot2
we can easily view the whole data set.
ggplot(bigtab, aes(x=file,y=test,color=grade)) + geom_point()
With categorical data on the x and y axes, a better geom to use is geom_tile
.
ggplot(bigtab, aes(x=file,y=test,fill=grade)) + geom_tile()
Publication quality images
ggplot2
offers a very wide variety of ways to adjust a plot. For categorical aesthetics, usually the first step is ensuring the relevant column is a factor with a meaningful level order.
y_order <- sort(unique(bigtab$test), decreasing=T) # y axis plots from bottom to top, so reversebigtab$test <- factor(bigtab$test, levels=y_order)x_order <- unique(bigtab$file)bigtab$file <- factor(bigtab$file, levels=x_order)# Only necessary if not continuing from previous lesson on programming!color_order <- c("FAIL", "WARN", "PASS")bigtab$grade <- factor(bigtab$grade, levels=color_order)myplot <- ggplot(bigtab, aes(x=file, y=test, fill=grade)) + geom_tile(color="black", size=0.5) + # Black border on tiles scale_fill_manual( # Colors, as color hex codes values=c("#ee0000","#ffee00","#00aa00")) + labs(x="", y="", fill="") + # Remove axis labels coord_fixed() + # Square tiles theme_minimal() + # Minimal theme, no grey background theme(panel.grid=element_blank(), # No underlying grid lines axis.text.x=element_text( # Vertical text on x axis angle=90,vjust=0.5,hjust=0)) myplot
Plots can be saved with ggsave
. Width and height are given in inches, and an image resolution in Dots Per Inch should also be given. The width and height will determine the relative size of text and graphics, so increasing the resolution is best done by adjusting the DPI. Compare the files produced by:
ggsave("plot1.png", myplot, width=5, height=5, dpi=600)ggsave("plot2.png", myplot, width=10, height=10, dpi=300)
It may be necessary to edit a plot further in a program such as Inkscape or Adobe Illustrator. To allow this, the image needs to be saved in a “vector” format such as SVG, EPS, or PDF. In this case, the DPI argument isn’t needed.
dplyr
gives us a collection of convenient “verbs” for manipulating data frames. The name is a contraction of “Data frame apPLYeR”, as some of these verbs have a similar flavour to apply
/lapply
/tapply
/etc.
Each verb takes a data frame as input and returns a modified version of it. The idea is that complex operations can be performed by stringing together a series of simpler operations in a pipeline.
# input +--------+ +--------+ +--------+ result# data %>% | verb | %>% | verb | %>% | verb | -> data# frame +--------+ +--------+ +--------+ frame
tibbles
bigtab
## # A tibble: 72 x 3## grade test file ## <fct> <fct> <fct> ## 1 PASS Basic Statistics Day0.fastq## 2 PASS Per base sequence quality Day0.fastq## 3 PASS Per tile sequence quality Day0.fastq## 4 PASS Per sequence quality scores Day0.fastq## 5 FAIL Per base sequence content Day0.fastq## 6 WARN Per sequence GC content Day0.fastq## 7 PASS Per base N content Day0.fastq## 8 PASS Sequence Length Distribution Day0.fastq## 9 PASS Sequence Duplication Levels Day0.fastq## 10 WARN Overrepresented sequences Day0.fastq## # ... with 62 more rows
As we used readr
to read the data, bigtab
is a “tibble”, which is the Tidyverse’s improved data frame. The dplyr
verbs also all return their results as tibbles. You can also create tibbles explicitly with the tibble
function (or data_frame
). One convenient feature is that it only shows a few rows of the data frame when you print it. If you do want to see the whole table, you can use as.data.frame
, or use the View
function to view it in a tab.
as.data.frame(bigtab)View(bigtab)
The n
and width
arguments to print
can also be used to print more rows or columns respectively.
print(bigtab, n=100, width=1000)
filter
Say we want to know all the tests that failed. filter
provides a convenient way to get these.
filter(bigtab, grade == "FAIL")
## # A tibble: 8 x 3## grade test file ## <fct> <fct> <fct> ## 1 FAIL Per base sequence content Day0.fastq ## 2 FAIL Per base sequence content Day4.fastq ## 3 FAIL Per base sequence content Day7.fastq ## 4 FAIL Per sequence GC content Day7.fastq ## 5 FAIL Per base sequence content Day10.fastq## 6 FAIL Per base sequence content Day15.fastq## 7 FAIL Per base sequence content Day20.fastq## 8 FAIL Per sequence GC content Day20.fastq
Something is magic here: we do not have grade
in our environment, only within the data frame. Arguments to filter
can use any column of the data frame as a variable. dplyr
uses this idea heavily, arguments to dplyr
verbs often behave in a special way.
arrange
Rather than filtering, we might instead want to sort the data so the most important rows are at the top. arrange
sorts a data frame by one or more columns.
arrange(bigtab, grade)
## # A tibble: 72 x 3## grade test file ## <fct> <fct> <fct> ## 1 FAIL Per base sequence content Day0.fastq ## 2 FAIL Per base sequence content Day4.fastq ## 3 FAIL Per base sequence content Day7.fastq ## 4 FAIL Per sequence GC content Day7.fastq ## 5 FAIL Per base sequence content Day10.fastq## 6 FAIL Per base sequence content Day15.fastq## 7 FAIL Per base sequence content Day20.fastq## 8 FAIL Per sequence GC content Day20.fastq## 9 WARN Per sequence GC content Day0.fastq ## 10 WARN Overrepresented sequences Day0.fastq ## # ... with 62 more rows
# desc( ) can be used to reverse the sort orderarrange(bigtab, desc(grade))
## # A tibble: 72 x 3## grade test file ## <fct> <fct> <fct> ## 1 PASS Basic Statistics Day0.fastq## 2 PASS Per base sequence quality Day0.fastq## 3 PASS Per tile sequence quality Day0.fastq## 4 PASS Per sequence quality scores Day0.fastq## 5 PASS Per base N content Day0.fastq## 6 PASS Sequence Length Distribution Day0.fastq## 7 PASS Sequence Duplication Levels Day0.fastq## 8 PASS Adapter Content Day0.fastq## 9 PASS Kmer Content Day0.fastq## 10 PASS Basic Statistics Day4.fastq## # ... with 62 more rows
select
dplyr’s select
function is for subsetting, renaming, and reordering columns. Old columns can be referred to by name or by number.
select(bigtab, test,grade)
## # A tibble: 72 x 2## test grade## <fct> <fct>## 1 Basic Statistics PASS ## 2 Per base sequence quality PASS ## 3 Per tile sequence quality PASS ## 4 Per sequence quality scores PASS ## 5 Per base sequence content FAIL ## 6 Per sequence GC content WARN ## 7 Per base N content PASS ## 8 Sequence Length Distribution PASS ## 9 Sequence Duplication Levels PASS ## 10 Overrepresented sequences WARN ## # ... with 62 more rows
select(bigtab, 2,1)
## # A tibble: 72 x 2## test grade## <fct> <fct>## 1 Basic Statistics PASS ## 2 Per base sequence quality PASS ## 3 Per tile sequence quality PASS ## 4 Per sequence quality scores PASS ## 5 Per base sequence content FAIL ## 6 Per sequence GC content WARN ## 7 Per base N content PASS ## 8 Sequence Length Distribution PASS ## 9 Sequence Duplication Levels PASS ## 10 Overrepresented sequences WARN ## # ... with 62 more rows
select(bigtab, foo=file, bar=test, baz=grade)
## # A tibble: 72 x 3## foo bar baz ## <fct> <fct> <fct>## 1 Day0.fastq Basic Statistics PASS ## 2 Day0.fastq Per base sequence quality PASS ## 3 Day0.fastq Per tile sequence quality PASS ## 4 Day0.fastq Per sequence quality scores PASS ## 5 Day0.fastq Per base sequence content FAIL ## 6 Day0.fastq Per sequence GC content WARN ## 7 Day0.fastq Per base N content PASS ## 8 Day0.fastq Sequence Length Distribution PASS ## 9 Day0.fastq Sequence Duplication Levels PASS ## 10 Day0.fastq Overrepresented sequences WARN ## # ... with 62 more rows
You can also remove specific columns like this:
select(bigtab, -file)
## # A tibble: 72 x 2## grade test ## <fct> <fct> ## 1 PASS Basic Statistics ## 2 PASS Per base sequence quality ## 3 PASS Per tile sequence quality ## 4 PASS Per sequence quality scores ## 5 FAIL Per base sequence content ## 6 WARN Per sequence GC content ## 7 PASS Per base N content ## 8 PASS Sequence Length Distribution## 9 PASS Sequence Duplication Levels ## 10 WARN Overrepresented sequences ## # ... with 62 more rows
The magic here is that column variables refer to their column number. select
subsets columns rather like subsetting vectors with [ ]
. So -file
becomes -1
, and negative numbers remove items. We can also refer to ranges of columns, such as grade:file
, which would become 1:3
, which is c(1,2,3)
.
Joins
Say we want to convert PASS/WARN/FAIL into a numeric score, so we can produce some numerical summaries. The scoring scheme will be:
fwp <- c("FAIL","WARN","PASS")scoring <- tibble(grade=factor(fwp,levels=fwp), score=c(0,0.5,1))# Or: # scoring <- data.frame(grade=factor(fwp,levels=fwp), score=c(0,0.5,1))scoring
## # A tibble: 3 x 2## grade score## <fct> <dbl>## 1 FAIL 0 ## 2 WARN 0.5## 3 PASS 1
In the introduction to R day, we saw the merge
function. dplyr has its own version of this in the form of several functions: left_join
, right_join
, inner_join
, full_join
, anti_join
.
The difference between these functions is what happens when there is a row in one data frame without a corresponding row in the other data frame. inner_join
discards such rows. full_join
always keeps them, filling in missing data with NA. left_join
always keeps rows from the first data frame. right_join
always keeps rows from the second data frame. anti_join
is a bit different, it gives you rows from the first data frame that aren’t in the second data frame.
Often we use a join to augment a data frame with some extra information. left_join
is a good default choice for this as it will never delete rows from the data frame that is being augmented.
scoretab <- left_join(bigtab, scoring, by="grade")scoretab
## # A tibble: 72 x 4## grade test file score## <fct> <fct> <fct> <dbl>## 1 PASS Basic Statistics Day0.fastq 1 ## 2 PASS Per base sequence quality Day0.fastq 1 ## 3 PASS Per tile sequence quality Day0.fastq 1 ## 4 PASS Per sequence quality scores Day0.fastq 1 ## 5 FAIL Per base sequence content Day0.fastq 0 ## 6 WARN Per sequence GC content Day0.fastq 0.5## 7 PASS Per base N content Day0.fastq 1 ## 8 PASS Sequence Length Distribution Day0.fastq 1 ## 9 PASS Sequence Duplication Levels Day0.fastq 1 ## 10 WARN Overrepresented sequences Day0.fastq 0.5## # ... with 62 more rows
“grade” is here acting as the key, allowing corresponding rows in the data frames to be joined together.
One important thing that all the join functions do: if multiple rows have the same key in either data frame, all ways of combining the two sets of rows will be included in the result. So, here, rows from the scoring data frame have been copied many times in the output.
Joins are the key tool for integrating different types of data, based on some common key such as gene id.
See also: match
and merge
in base R.
Challenge
Using the newly created score
column:
Filter
scoretab
to get only “WARN” or “FAIL” grades.Sort the result so that “FAIL”s are at the top.
mutate
mutate
lets us add or overwrite columns by computing a new value for them.
mutate(scoretab, doublescore = score*2)
## # A tibble: 72 x 5## grade test file score doublescore## <fct> <fct> <fct> <dbl> <dbl>## 1 PASS Basic Statistics Day0.fastq 1 2## 2 PASS Per base sequence quality Day0.fastq 1 2## 3 PASS Per tile sequence quality Day0.fastq 1 2## 4 PASS Per sequence quality scores Day0.fastq 1 2## 5 FAIL Per base sequence content Day0.fastq 0 0## 6 WARN Per sequence GC content Day0.fastq 0.5 1## 7 PASS Per base N content Day0.fastq 1 2## 8 PASS Sequence Length Distribution Day0.fastq 1 2## 9 PASS Sequence Duplication Levels Day0.fastq 1 2## 10 WARN Overrepresented sequences Day0.fastq 0.5 1## # ... with 62 more rows
Even though this is called “mutate”, it is not literally modifying the input. Rather it is producing a copy of the data frame that has the modifiction.
This is equivalent to:
scoretab2 <- scoretabscoretab2$doublescore <- scoretab2$score * 2
summarize
summarize
lets us compute summaries of data.
summarize(scoretab, total=sum(score))
## # A tibble: 1 x 1## total## <dbl>## 1 60
We really want to summarize the data grouped by file, so we can see if there are any particularly bad files. This is achieved using the group_by
“adjective”. group_by
gives the data frame a grouping using one or more columns, which modifies the subsequent call to summarize
.
group_by(scoretab, file)
## # A tibble: 72 x 4## # Groups: file [6]## grade test file score## <fct> <fct> <fct> <dbl>## 1 PASS Basic Statistics Day0.fastq 1 ## 2 PASS Per base sequence quality Day0.fastq 1 ## 3 PASS Per tile sequence quality Day0.fastq 1 ## 4 PASS Per sequence quality scores Day0.fastq 1 ## 5 FAIL Per base sequence content Day0.fastq 0 ## 6 WARN Per sequence GC content Day0.fastq 0.5## 7 PASS Per base N content Day0.fastq 1 ## 8 PASS Sequence Length Distribution Day0.fastq 1 ## 9 PASS Sequence Duplication Levels Day0.fastq 1 ## 10 WARN Overrepresented sequences Day0.fastq 0.5## # ... with 62 more rows
summarize(group_by(scoretab, file), average_score=mean(score))
## # A tibble: 6 x 2## file average_score## <fct> <dbl>## 1 Day0.fastq 0.833## 2 Day4.fastq 0.875## 3 Day7.fastq 0.833## 4 Day10.fastq 0.833## 5 Day15.fastq 0.833## 6 Day20.fastq 0.792
The special function n()
can be used within summarize
to get the number of rows. (This also works in mutate
, but is most useful in summarize
.)
summarize(group_by(scoretab, grade), count=n())
## # A tibble: 3 x 2## grade count## <fct> <int>## 1 FAIL 8## 2 WARN 8## 3 PASS 56
So summarize
lets us do the things we do with base R using table
and tapply
, but staying tidy by using data frames.
Tip: group_by
also affects other verbs, such as mutate. This is more advanced than we will be going into today. Grouping can be removed with ungroup
.
See also functions for common uses of summarize: count
, distinct
.
We often want to string together a series of dplyr
functions. This is achieved using dplyr
’s pipe operator, %>%
. This takes the value on the left, and passes it as the first argument to the function call on the right. So the previous summarization could be written:
scoretab %>% group_by(grade) %>% summarize(count=n())
%>%
isn’t limited to dplyr
functions. It’s an alternative way of writing any R code.
rep(paste("hello", "world"), 5)
## [1] "hello world" "hello world" "hello world" "hello world" "hello world"
"hello" %>% paste("world") %>% rep(5)
## [1] "hello world" "hello world" "hello world" "hello world" "hello world"
The idea of adding geoms in ggplot2
is rather like the dplyr
pipe. ggplot2
predates dplyr
, and Hadley Wickham has had a progression of ideas. It will probably be possible to use %>%
instead of +
in some successor to ggplot2
in the not too distant future.
tidyr
is the Tidyverse package for getting data frames to tidy. Recall that in a tidy data frame:
- each row is a unit of observation
- each column is a single piece of information
I’m not convinced tidyr
is a great package, but the ideas behind it are very important. dplyr
uses magic arguments to produces very beautiful R code, but in tidyr
it gets a bit confusing. Anyway, let’s work through a toy example:
untidy <- read_csv( "country, male-young, male-old, female-young, female-old Australia, 1, 2, 3, 4 New Zealand, 5, 6, 7, 8")untidy
## # A tibble: 2 x 5## country `male-young` `male-old` `female-young` `female-old`## <chr> <int> <int> <int> <int>## 1 Australia 1 2 3 4## 2 New Zealand 5 6 7 8
The first problem is that rows are not distinct units of observation, there are actually four observations per row. This is fixed using gather
.
gathered <- gather(untidy, key=group, value=cases, -country)gathered
## # A tibble: 8 x 3## country group cases## <chr> <chr> <int>## 1 Australia male-young 1## 2 New Zealand male-young 5## 3 Australia male-old 2## 4 New Zealand male-old 6## 5 Australia female-young 3## 6 New Zealand female-young 7## 7 Australia female-old 4## 8 New Zealand female-old 8
We give a data frame, a name for the key and value columns to be created. Remaining arguments behave like select
, here we are saying that the country column shouldn’t be gathered.
In the introductory R day we saw melt
from the reshape2
package, which did a similar thing to matrices.
spread
is the opposite operation, spreading a column into many columns, which generally makes data less tidy but is sometimes necessary. A common application is producing a scatter plot, where the x and y axes need to be two different columns even though they measure the same type of thing. Data may also be easier to look at in a table in spread form.
spread(bigtab, key=file, value=grade)
## # A tibble: 12 x 7## test Day0.fastq Day4.fastq Day7.fastq Day10.fastq Day15.fastq## <fct> <fct> <fct> <fct> <fct> <fct> ## 1 Sequ… PASS PASS PASS PASS PASS ## 2 Sequ… PASS PASS PASS PASS PASS ## 3 Per … PASS PASS PASS PASS PASS ## 4 Per … PASS PASS PASS PASS PASS ## 5 Per … WARN WARN FAIL WARN WARN ## 6 Per … PASS PASS PASS PASS PASS ## 7 Per … FAIL FAIL FAIL FAIL FAIL ## 8 Per … PASS PASS PASS PASS PASS ## 9 Over… WARN PASS PASS WARN WARN ## 10 Kmer… PASS PASS PASS PASS PASS ## 11 Basi… PASS PASS PASS PASS PASS ## 12 Adap… PASS PASS PASS PASS PASS ## # ... with 1 more variable: Day20.fastq <fct>
In our toy example, we have a further problem that the “group” column contains two pieces of data. This can be fixed with separate
. By default separate
splits on any non-alphanumeric characters, but different separator characters can be specified.
separate(gathered, col=group, into=c("gender","age"))
## # A tibble: 8 x 4## country gender age cases## <chr> <chr> <chr> <int>## 1 Australia male young 1## 2 New Zealand male young 5## 3 Australia male old 2## 4 New Zealand male old 6## 5 Australia female young 3## 6 New Zealand female young 7## 7 Australia female old 4## 8 New Zealand female old 8
All of this would typically be written as a single pipline:
tidied <- untidy %>% gather(key=group, value=cases, -country) %>% separate(group, into=c("gender","age"))
Advanced: tidyr
has a number of other useful data frame manipulations. For completeness, we mention nest
. This creates a list column in a tibble, the list column containing one tibble per row. Yo, Hadley put tibbles in your tibble so you can dplyr while you dplyr. The inverse operation is unnest
. unnest
is rather like the bind_rows
function we saw earlier. nest
and unnest
go well with creating or modifying list columns in a mutate
with lapply
(or the purrr
package).
# Advancednested <- nest(gathered, -country)nestednested$dataunnest(nested)
Challenge
You receive data on a set of points. The points are in two dimensions (dim
), and each point has x and y coordinates. Unfortunately it looks like this:
df <- read_csv( "dim, A_1, A_2, B_1, B_2, B_3, B_4, B_5 x, 2, 4, 1, 2, 3, 4, 5 y, 4, 4, 2, 1, 1, 1, 2")
Tidy the data by gathering all of the columns except
dim
. What what does each row now represent?We want to plot the points as a scatter-plot, using either
plot
orggplot
. Spread the gathered data so that this is possible. Now what do the rows represent?What other tidying operation could be applied to this data?
We will now put this all together by looking at a small-scale gene expression data set. To simplify somewhat: The data was generated from a multiplex PCR targetting 44 genes. The PCR product is quantified by counting reads from high throughput sequencing. The experiment is of yeast released synchronously into cell cycling, so that we can see which genes are active at different points in the cycle. Several strains of yeast have been used, a wildtype and several gene knock-downs. Each has nine time points, covering two cell cycles (two hours).
This is much smaller than a typical RNA-Seq dataset. We are only looking at 44 genes, rather than all of the thousands of genes in the yeast genome.
Recall the three activities involved in data analysis. This dataset requires all of them.
Tidying
Hint: You can select from the top of a pipeline to partway through and press Ctrl-Enter or Command-Enter to see the value part-way through the pipeline.
# Use readr's read_csv function to load the filecounts_untidy <- read_csv("r-more-files/read-counts.csv")
## Parsed with column specification:## cols(## .default = col_double(),## Feature = col_character()## )
## See spec(...) for full column specifications.
counts <- counts_untidy %>% gather(key=sample, value=count, -Feature, factor_key=TRUE) %>% separate(sample, sep=":", into=c("strain","time"), convert=TRUE, remove=FALSE) %>% mutate( strain = factor(strain, levels=c("WT","SET1","RRP6","SET1-RRP6")) ) %>% filter(time >= 2) %>% select(sample, strain, time, gene=Feature, count)summary(counts)
## sample strain time gene ## WT:2 : 45 WT :405 Min. : 2 Length:1620 ## WT:3 : 45 SET1 :405 1st Qu.: 4 Class :character ## WT:4 : 45 RRP6 :405 Median : 6 Mode :character ## WT:5 : 45 SET1-RRP6:405 Mean : 6 ## WT:6 : 45 3rd Qu.: 8 ## WT:7 : 45 Max. :10 ## (Other):1350 ## count ## Min. : 0 ## 1st Qu.: 172 ## Median : 666 ## Mean : 3612 ## 3rd Qu.: 2548 ## Max. :143592 ##
Time point 1 was omitted because it isn’t actually part of the time series, it’s from cells in an unsynchronized state. If we wanted to be even tidier, the remaining time points could be recoded with the actual time within the experiment – they are at 15 minute intervals.
Transformation and normalization
ggplot(counts, aes(x=sample, y=count)) + geom_boxplot() + coord_flip()
We immediately see from a Tukey box plot that the data is far from normal – there are outliers many box-widths to the right of the box. Perhaps a log transformation is in order.
ggplot(counts, aes(x=sample, y=log2(count))) + geom_boxplot() + coord_flip()
## Warning: Removed 14 rows containing non-finite values (stat_boxplot).
ggplot(counts, aes(x=log2(count), group=sample)) + geom_line(stat="density")
## Warning: Removed 14 rows containing non-finite values (stat_density).