--- title: "Introduction to cuperdec" output: rmarkdown::html_vignette author: "James A. Fellows Yates" date: "`r Sys.Date()`" vignette: > %\VignetteIndexEntry{Introduction to cuperdec} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Introduction This vignette gives an example on how to use the `cuperdec` R package. The aim of this package is to assist in the generation of what I have termed 'cumulative percent decay curves`. These visualisations were originally developed to assist in the identification of the abundance and prevalence of microbial taxa in ancient microbiome samples, with samples with high abundance and prevalence of the expected original microbiome, when comparing to modern samples from the same microbiome source. However, this concept could potentially be applied in other contexts. The curves consist of an 'abundance rank' (x-axis), and a percentage of taxa from the target (i.e.original microbiome) source (y-axis). The curve is generated by asking at each rank, what percentage of the taxa up until this rank have been from the target source. This vignette will show examples later on. The package also applies a few simple filtering functions, which helps to automate the detection of samples that are similar to modern comparative samples, versus those that do not have the same signal. This can then be used to inform downstream analyses in terms of discarding potentially unreliable samples. # Setup You will require the library [`dplyr`](https://dplyr.tidyverse.org/) for this tutorial! Please install this if you have not already. ```{r message = FALSE} library("dplyr") ``` # Input ## Required The generation of these curves require at a minimum: * A TSV formatted 'OTU Table' with the following characteristics: * Column names: e.g. Taxon, Sample_A, Sample_B, Sample_C * First Column: a list of taxa * Cells: some numeric value representing number of observations of a taxon in a sample | Taxon | Sample_A | Sample_B | Sample_C | |:-------:|---------:|---------:|---------:| | Taxon_1 | 0 | 102 | 237 | | Taxon_2 | 10023 | 7483 | 298 | | Taxon_3 | 6 | 28 | 3409 | | ... | ... | ... | ... | * A TSV formatted isolation source table with two columns: * A list of taxa that matches the names in your OTU table (e.g. _Tannerella forsythia_, _Helicobacter pylori_, _Mycobacterium kansasii_) * The corresponding isolation sources of each taxon, i.e. where the particular reference sequence of the taxon was extracted (e.g. oral cavity, gut, soil) | Taxon | Sample_A | |:-------:|:--------:| | Taxon_1 | oral | | Taxon_2 | gut | | Taxon_3 | unknown | | ... | ... | ## Optional Additionally, you can supply: * A TSV formatted metadata table, which contains: * A column with sample names (matching OTU table) * A grouping column (i.e. allows grouping of samples to be displayed together) | Samples | Other | Group | Other | |:---------:|:--------:|:--------:|---------:| | Sample_A | False | Group_A | 23,000 | | Sample_B | True | Group_A | 4,000 | | Sample_C | True | Group_B | 15,000 | | ... | ... | ... | ... | In this example, we will use a dataset from Fellows Yates _et al._ 202X. This study looked at microbiome DNA isolated from dental calculus of ancient skeletons. We wanted to try to identify samples which appeared to have still displayed evidence of the remains original microbes that makes up a tooth biofilm. This allowed us to discard any samples that were so degraded no signature of the original microbiome remained, and that could negatively influence downstream analyses. This assessment can be done by by comparing the curves of the ancient samples with modern samples of known isolation sources. # Data Loading First we will load the `cuperdec` package, and helper libraries to help make reading R code easier in this vignette. ```{r setup} library(cuperdec) library(magrittr) ## for pipes! library(dplyr) # for mutate()! ``` Next we can load the example dataset that comes with the package with the helper functions that load and format the tables that can be used by downstream `cuperdec` functions. The example data is stored within the package, so we first need find the locations of these in your system. > :warning: You would not normally have to run this step for your own analysis! ```{r} data(cuperdec_taxatable_ex) data(cuperdec_database_ex) data(cuperdec_metadata_ex) ``` Now we can use the loading functions to load the tables for you. You can either supply your table as a path to a TSV on your computer, or alternatively a data frame/tibble that you've already loaded into R. The function `load_taxa_table()` will then convert a OTU table into long format and rename columns to names that downstream functions will use. ```{r} taxatable <- load_taxa_table(cuperdec_taxatable_ex) %>% print() ``` Alternatively you can provide a path directly to a TSV table (not run here), rather than an already loaded object. ```{r, eval = F} taxatable <- load_taxa_table("///.tsv") ``` Next we need to use`load_database()` to load the file which describes the original isolation source of each OTU. Again, this function can take either a path to a file on your system, or a data frame already loaded into R. This function takes two parameters, in addition to the the isolation source table itself, and you also need to supply the name of the target isolation source you are looking for within each sample. In this case, we want to identify the all the taxa in our samples that have been isolated in from the oral cavity - indicated in the database file's second column as 'oral'. The `load_database()` function will then indicate for every Taxon that is derived from the target source as 'True' and everything that isn't as 'False'. ```{r} database <- load_database(cuperdec_database_ex, target = "oral") %>% print() ``` Finally, in the example study, we wanted to compare the curves of different groups of our samples, with each different known-isolation comparative sample. Therefore, we will load an optional file carrying various metadata about each sample For this we use `cuperdec_metadata_ex()`, supplying the metadata TSV, the name of the column with sample names, and the name of the column with the groups. You can have other columns in the metadata file, but only the two indicated will be taken downstream. ```{r} metadata_map <- load_map(cuperdec_metadata_ex, sample_col = "#SampleID", source_col = "Env") %>% print() ``` # Calculating a simple curve Now, with all our data loaded we can calculate the cumulative percent curves per sample. The way this works is to order (per sample) each taxon by it's abundance. Then, calculate the fraction of taxa at each rank, that have come from your 'target' source ( including all ranks up until the current one). A schematic can be seen on Figure R8 in [GitHub repository](https://github.com/jfy133/Hominid_Calculus_Microbiome_Evolution/tree/master/#r81-cumulative-percent-decay-plots) of Fellows Yates _et al._ (202X). To do these of the taxa table loaded above, we use the `calculate_curve()` function. ```{r} curves <- calculate_curve(taxatable, database = database) %>% print() ``` Here you can see a Rank column, which represents the abundance rank of each taxon (from smallest to largest rank representing most to least abundant), and a fraction of all taxa up to that rank that are derived from your target isolation source. `cuperdec` also offers a helper function for fast plotting. ```{r} plot_cuperdec(curves) ``` > Note that these functions just run basic `ggplot()` functions with a couple of preferred theming based on the preferences of the package's author. You can adapt the functions for your own function by running the internal `plot_simple` function without brackets to see the ggplot2 function, and copying the ggplot2 function from there. Due to the large number of samples in the example data, we cannot see much. To improve this, we can use a second plotting function to separate each group of samples into their own window. ```{r fig.width=7, fig.height=5} plot_cuperdec(curves, metadata_map) ``` This is much easier to visualise, and we can start to see some differences. Note that if you wanted to re-order the order panels, you can simply convert the 'Sample_Source' column of the `metadata_map` to factors, with your preferred ordering. For example: ```{r fig.width=7, fig.height=5} ## Set ordering of groups group_order <- c("subPlaque", "supPlaque", "urbanGut", "ruralGut", "skin", "sediment", "EnvironmentalControl", "ExtractionControl", "LibraryControl", "Howler_Monkey", "Gorilla_1", "Gorilla_2", "Gorilla_3", "Chimp_1", "Chimp_2", "Chimp_3", "Chimp_4", "Neanderthal", "PreagriculturalHuman_1", "PreagriculturalHuman_2", "PreantibioticHuman_1", "PreantibioticHuman_2", "ModernDayHuman_1", "ModernDayHuman_2" ) metadata_map <- metadata_map %>% dplyr::mutate(Sample_Source = factor(Sample_Source, levels = group_order)) # Re-plot plot_cuperdec(curves, metadata = metadata_map) ``` Now we can see that the calculus samples (Howler monkey onwards), look much more like the Plaque samples, compared to e.g. sediment and skin. # Preservation filtering ## Simple These visualisation of these difference curves already start help us to distinguish well-preserved samples from those that maybe have few oral taxa left remaining. However, we can also try and use the observations from the plot above to set up cut-off filters that allows us to make consistent 'keep'/'discard' decisions. For example, other than a single extraction control, all non-plaque comparative samples do not exceed a minimum 'Percentage Target Source' threshold of ~50%. Using observations like this `cuperdec` offers a range of different functions that perform various cut-off assessments between samples. For example, we can use `simple_filter()` to indicate any sample that exceeds a minimum percent source threshold of 50% at any point along the abundance rank. ```{r} filter_result <- simple_filter(curves, percent_threshold = 50) %>% print() ``` We can then supply this information as an additional parameter for plotting. ```{r fig.width=7, fig.height=5} plot_cuperdec(curves, metadata_map, filter_result) ``` As a demonstration, the defined cut-off means that any curve that enters the yellow box at any abundance rank (i.e. above the minimum 50% threshold at any point across the x-axis) is considered as being `Passed`. ```{r echo=FALSE, fig.width=7, fig.height=5} plot_cuperdec(curves, metadata_map, filter_result) + ggplot2::geom_hline(yintercept = 50, color = "grey") + ggplot2::geom_rect(xmin = 0, xmax = max(curves$Rank), ymin = 50, ymax = 100, alpha = 0, colour = "#D39200" ) ``` ## Hard Burnin The list of samples generated by `simple_filter()` above could now be used to discard those samples that did not pass the cut off from downstream analysis. However, the one ExtractionControl sample that exceeds 50% for only a couple of ranks, actually looks like for the most part that it doesn't look like it has much 'oral' taxa at all. It may just be spurious hits, or an over-weighting due to the small denominator of the abundance rank in the early ranks. Therefore, we can additionally apply a simple 'burn in', after which we start considering whether the curve exceeds the minimum percent threshold. In other words, only consider whether a given curve goes above 50% after X amount of ranks (given here as 1.0% of the ranks), for each curve. Alternatively phrase, ignore the first 10% ranks of each sample before checking if the curve goes above 50% of the target source. To do this on the full dataset ```{r fig.width=7, fig.height=5} burnin_result <- hard_burnin_filter(curves, percent_threshold = 50, rank_burnin = 0.1) plot_cuperdec(curves, metadata_map, burnin_result) ``` This is better, but now we have a Neanderthal and a few pre-agricultural humans that clearly exceed 50% for quite a few ranks, so while preservation might still be lower than others, they may still have _sufficient_ preservation that we still want to keep these samples for downstream. We can also supply an extra parameter to `plot_cuperdec` Lets look a bit more closely at the curves to by restricting the X-axis Ranks to 250 ranks. ```{r fig.width=7, fig.height=5} plot_cuperdec(curves, metadata_map, burnin_result, restrict_x = 250) ``` We can again graphically demonstrate the burn-in thresholds of a _single sample_ for this filter, following the yellow-box example in the previous section, see the following image. ```{r echo=FALSE, fig.width=7, fig.height=5} sub_curves <- curves %>% filter(Sample %in% c("ABM008.A0101", "FUM001.A0101")) sub_curve_burnin <- sub_curves %>% group_by(Sample) %>% summarise(max = max(Rank)) %>% mutate(perc = round(max * 0.1)) %>% select(Sample, perc) %>% tibble::deframe(.) plot_cuperdec(sub_curves, burnin_result = burnin_result, restrict_x = 250) + ggplot2::geom_hline(yintercept = 50, color = "grey") + ggplot2::geom_rect( xmin = sub_curve_burnin["ABM008.A0101"], xmax = 250, ymin = 50, ymax = 100, alpha = 0, colour = "#D39200" ) + ggplot2::geom_rect( xmin = sub_curve_burnin["FUM001.A0101"], xmax = 250, ymin = 50, ymax = 100, alpha = 0, colour = "#FF61C3" ) ``` > Note this image is purely for demonstrative purposes, and is not from a > `cuperdec` function. You can see the boxes specifying where the curves of each sample must enter has now shifted away from 0 along the x-axis. One sample's curve enters its box (yellow, i.e. after the 10% burning threshold of`r sub_curve_burnin["ABM008.A0101"]` the abundance rank) and therefore is considered passing the minimum percent target source threshold. In contrast, the 'failed' sample has a curve falls just outside of its box (pink), has a 10% burn-in threshold of rank `r sub_curve_burnin["FUM001.A0101"]`. The different boxes emphasise that the burn-in is calculated on a _per-sample_ basis, and that there is not a single abundance rank cut-off for all samples. However, when looking back at the whole dataset definitely see that there are some Neanderthals that are excluded with a hard burn-in of 10% despite having otherwise a curve pretty similar to other Neanderthals. This could be because they have unusually long tails of very low abundant taxa, meaning that the 'peak' exceeding the minimum threshold of 50% occurs within the first 10% of ranks, however this is still the region of the abundance rank that would give the strongest oral taxon signal in very old samples. ## Adaptive Burnin Instead, we can try and customise the burn-in setting on a per-sample basis, based on the amount of 'fluctuation' that occurs early in the curves (based on the small fraction denominators). This is performed here by only starting to considering whether a curve exceeds minimum percentage target source threshold (50%), from the point at which the difference of the percentage target source of one rank to the next, does not exceed the standard-deviation of all differences. ```{r fig.width=7, fig.height=5} burnin_result <- adaptive_burnin_filter(curves, percent_threshold = 50) plot_cuperdec(curves, metadata_map, burnin_result, restrict_x = 250) ``` As with the hard burn-in filter, each curve (i.e., sample) gets a different abundance rank, after which exceeding of the minimum percent target source is considered. Once we are happy that the curves appear to be distinguishing between well-preserved samples and those with a low 'endogenous' microbial content, we can use the list of samples that did not our thresholds to remove from downstream analysis. ```{r} discard_list <- burnin_result %>% filter(!Passed) discard_list ``` ## Custom filters You can of course write your own functions with your own filtering techniques, the only requirement is that you produce a two column output: * Sample: the names of each sample represented in the curves file. * Passed: a logical (TRUE/FALSE) column indicating whether a sample passed the filter or not