--- title: "Performance" output: rmarkdown::html_vignette bibliography: references.bib csl: evolution.csl vignette: > %\VignetteIndexEntry{Performance} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.align = "center", fig.height = 5, fig.width = 7 ) ``` ```{r setup} library(DAISIEprep) ``` ### Performance analysis of DAISIEprep::extract_island_species() In this article we test and analyse the performance, in terms of time consumption/complexity of the main function in the `DAISIEprep` R package: `extract_island_species()`. This function takes in a phylogenetic tree and species endemicity data in the form of a `phylo4d` object (an S4 class from the R package `phylobase`). This is not a thorough examination of all possible uses of the `extract_island_species()` function, but rather gives an indication of the usability of the function for data sets of different sizes, as well as detecting any features of the data that may slow the process of extracting and formatting the island community data. All of the work for the performance analysis is carried out by the `benchmark()` function from the `DAISIEprep` package. But before calling this function we explain how the benchmarking is set up. The first argument of `benchmark()` is `phylod`. If this is `NULL` then the function will simulate the phylogenies and the endemicity data given the: `tree_size_range`, `num_points`, `prob_on_island`, and `prob_endemic` arguments. These specify the lower and upper range of the tree size to simulate (the sequence breaks can be in linear or log space depending on the argument `log_scale`), the number of breaks in the sequence between lower and upper tree sizes, the probability that a species will be on the island, and if a species is on the island whether it is endemic (1 - `prob_endemic` is the probability a species on the island is non-endemic). If a `phylo4d` object is supplied to the `phylod` argument in `benchmark()` then this tree will be used to perform the benchmarking. In the case of simulating data the parameter space of the performance analysis is then the combination of these variables (using the `expand.grid()` function). The phylogeny is simulated to a fixed size using the `rphylo()` function from the `ape` package. As we are stochastically simulating the endemicity statuses of the species in the tree it may be that there is not an outgroup species not on the island in order to correctly extract the colonisation times from the stem age of the species or clade on the island. Therefore, we add an outgroup which we ensure is not present on the island. When we want to test the performance of the `extract_island_species()` function using the `asr` algorithm we need to first run an ancestral state reconstruction. This can easily be achieved if the `phylod` data with the phylogeny and endemicity statuses at each tip of the tree are ready (see tutorial vignette for more information on the `asr` algorithm). To quantify the time elapsed while the function runs there are several methods in both base R and in various R packages (e.g. `microbenchmark` or `rbenchmark`). However here we use `system.time()` from base R (see https://radfordneal.wordpress.com/2014/02/02/inaccurate-results-from-microbenchmark/) for reasoning). Side note: by default `DAISIEprep::benchmark()` conditions on each simulated data set having a non-empty island, and thus the function is not tested for the trivial case that no species need to be extracted. Due to the computational time not being deterministic and to ensure the results are not spurious we replicate timing three times and the mean "real time elapsed" is calculated, as well as replicating the simulation (given by argument `replicates`) to account for stochasticity in the simulation of the data. The range of tree sizes we use encompasses those of common empirical phylogenies (10 tips to 10,000 tips). We then generate a random sample of tip states for each species, with the possible states being: endemic to the island, non-endemic and not present on the island. We tested a low probability of each species in the tree being on the island ($P(0.2)$) and a high probability of species being on the island ($P(0.5)$). For each of these scenarios we tested a low ($P_E(0.2)$, $P_{NE}(0.8)$) and high probability ($P_E(0.8)$, $P_{NE}(0.2)$) of island species being endemic to test whether this affects performance. For each scenario we ran ten replicates. For isolated islands where cladogenesis is high, island species will likely not be spread across uniformly across the phylogeny as we assumed in our simulations, but instead will be clustered. To check for scenarios where the number of species per colonisation events is high (i.e. radiations on the island) we used the mammals of Madagascar data set. For this we used the [@upham_inferring_2019] complete and DNA-only phylogenies, and the island endemicity data of Madagascar [@michielsen_macroevolutionary_2023]. The results presented in this vignette are not computed each time the vignette is rendered due to the large computation time required. Instead, the analyses are run on a cluster computer and saved in the package. The analysis script run to produce the results can be found in the [DAISIEprepExtra](https://github.com/joshwlambert/DAISIEprepExtra) package [here](https://github.com/joshwlambert/DAISIEprepExtra/tree/main/inst/scripts). The performance analysis uses the `benchmark()` function in the `DAISIEprep` package. The output produces results for the DNA-only phylogeny and the complete phylogeny. The raw data of parameter estimates for the different parameter settings is tidied into a tibble containing the data we need for both the DNA and complete phylogeny. ```{r} # Internal function - Not to be called for regular 'DAISIEprep' operation, # but merely to load in data performance_data <- DAISIEprep:::read_performance() ``` We can plot the time consumption of each simulated and empirical (DNA-only and complete phylogeny) data set and group the results by the probability of species being on the island (`prob_on_island`). Now the results can be plotted with `plot_performance` implemented in `DAISIEprep`. This function follows the tidyverse convention of giving variable names as variables (as opposed to strings) and uses tidy evaluation to group by the variable given, either `prob_on_island` or `prob_endemic`. First for the `min` algorithm: ```{r plot-min-prob-on-island} plot_performance( performance_data = performance_data$performance_data_min, group_by = prob_on_island ) ``` And secondly for the `asr` algorithm: ```{r plot-asr-prob-on-island} plot_performance( performance_data = performance_data$performance_data_asr, group_by = prob_on_island ) ``` Alternatively, the data can be grouped by the probability that species on the island are endemic (`prob_endemic`) for the `min` (first) and `asr` (second) algorithms: ```{r plot-min-prob-endemic} plot_performance( performance_data = performance_data$performance_data_min, group_by = prob_endemic ) ``` ```{r plot-asr-prob-endemic} plot_performance( performance_data = performance_data$performance_data_min, group_by = prob_endemic ) ``` We find that even for large trees (10,000 tips) the scale of extraction less than seconds, whereas running the DAISIE inference model is on the scale of minutes to days, and thus pre-processing does not present a bottleneck to the pipeline. The empirical trees we ran (the Madagascar mammal example) were quicker to process than trees simulated with uniformly random island presence (figure 1). This suggests that when the ratio of species to colonisation events on the island is higher it should be faster to extract. However, even for large trees with many colonisations extraction should pose a computational problem. The speed of extraction facilitates extraction of island data across many phylogenies, for example extracting island community data across the posterior distribution of inferred phylogenetic trees to account for uncertainty in branching times and tree topology. Given the approximately linear relationship between the size of the phylogeny and time required for extraction in the above log-log plots we fit a power law to the mean run time (mean across the replicates). ```{r} grouped_performance_data <- dplyr::group_by( performance_data$performance_data_min, tree_size, "prob_on_island" ) mean_performance_data <- dplyr::summarise( grouped_performance_data, mean = mean(median_time), .groups = "drop" ) fit_min <- lm(log(mean_performance_data$mean) ~ log(mean_performance_data$tree_size)) fit_min$coefficients grouped_performance_data <- dplyr::group_by( performance_data$performance_data_asr, tree_size, "prob_on_island" ) mean_performance_data <- dplyr::summarise( grouped_performance_data, mean = mean(median_time), .groups = "drop" ) fit_asr <- lm(log(mean_performance_data$mean) ~ log(mean_performance_data$tree_size)) fit_asr$coefficients ``` The growth in time for both the `min` and `asr` algorithms follows a power function ($y=ax^k$) with exponents $k =$ `r fit_min$coefficients[2]` and $k =$ `r fit_asr$coefficients[2]`, respectively, when fitted on mean run time against tree size. Therefore the time complexity scales relatively well with tree size, and thus unless needing to be applied to extremely large phylogenies (>10,000 tips) the `DAISIEprep` extraction functionality should be applicable. ## References