The ipyrad.analysis module: RAxML
As part of the ipyrad.analysis
toolkit we’ve created convenience functions for
easily running common RAxML
commands, a maximum likelihood inference of phylogenetic trees. This can be
useful when you want to run all of your analyses in a clean stream-lined way in
a jupyter notebook to create a completely reproducible study.
Install software
There are many ways to install RAxML, the simplest of which is to use conda. This will install several RAxML binaries into your conda path. Switch back to the terminal (or open a new one) and run this command:
$ conda install raxml -c bioconda
RAxML Phylogenetic Inference
Create a new notebook inside your /home/jovyan/ipyrad-workshop/
directory
called anolis-raxml.ipynb
by choosing New->Python 3, in the
upper right hand corner. The rest of the materials in this part of the
workshop assume you are running all code in cells of a jupyter notebook inside
your binder instance.
Create a RAxML Class object
First, copy and paste the usual imports into a notebook cell and run it:
import ipyrad.analysis as ipa ## ipyrad analysis toolkit
import toyplot ## plotting library
import toytree ## tree plotting
We will use the anolis data again, just because simulated data is so boring.
In a new cell (+) you may fetch the anolis.phy
file from the RADCamp site with
wget
:
## Use wget to fetch the vcf from the RADCamp website
!wget https://radcamp.github.io/NYC2020/Prates_et_al_2016_example_data/anolis.phy
Now create a RAxML object. The only required argument to initialize the object is a phylip formatted sequence file. In this example we provide a name and working directory as well:
rax = ipa.raxml(
data="./anolis.phy",
name="anolis-tree",
workdir="./anolis-raxml",
);
Additional options
RAxML has a ton of parameters for modifying how it behaves, and we will only
explore just a fraction of these. For more info on RAxML parameters, look
here. You can
also specify many of these parameters by setting values in the params dictionary
of your RAxML object. In the following cell we modify the number of
bootstrapping runs on distinct starting trees (params.N
), the number of
threads to use (params.T
), and the outgroup samples (params.o
).
## Number of runs
rax.params.N = 10
## Number of threads
rax.params.T = 2
## Set the outgroup. Because we don't have an outgroup for Anolis we use None.
rax.params.o = None
## Alternatively, if we had an outgroup we could specify this with sample names
## Here we could specify the Northern samples as the outgroup, this is just for illustration
## rax.params.o = ['punc_ICST764', 'punc_MUFAL9635']
Print the command string
It is good practice to always print the command string so that you know exactly what was called for your analysis and it is documented.
print(rax.command)
/srv/conda/envs/notebook/bin/raxmlHPC-PTHREADS-AVX2 -f a -T 2 -m GTRGAMMA -n anolis-tree -w /home/jovyan/newdocs/API-analysis/anolis-raxml -s /home/jovyan/newdocs/API-analysis/anolis.phy -p 54321 -N 10 -x 12345
Explanation of RAxML arguments:
- -f: selects the algorithm RAxML will execute. In this case,
-f a
will perform a rapid bootstrap analysis and search for best-scoring ML tree in a single run. - -T: specifies the number of threads you want to run. This should reflect the number of cores available in your machine, only 2 in our case.
- -m: selects the model of nucleotide substitution. In this case, GTRGAMMA is the GTR (generalised time-reversible) + GAMMA model of rate heterogeneity. A complex, but standard model used in RAxML.
- -N: specifies the number of alternative runs on distinct starting parsimony trees (bootstrapping).
- -x: random seed number for the analysis.
- -p: random seed number for the parsimony inference of starting trees.
- -n: specifies the root name for the output files.
- -w: specifies the name of the output directory where RAxML will write the output files. Note that you need to specify the full path.
- -s: specifies the name of the input alignment file in PHYLIP format.
Run the job
This will start the job running. The subsampled dataset we are using should run very quickly (~1-2 minutes).
rax.run(force=True)
job aligntest finished successfully
Note: We are running only 10 bootstraps, which takes very little time. In fact, when running a real analysis, we should run at least 500 or 1000 bootstraps. For real large datasets, running an alignment of the entire loci can be very time consumming. Because of that, you can explore RAxML using only SNPs in a PHYLIP format (e.g. anolis.snps.phy) and excluding the invariant sites. Using only variable sites should reduce considerably the running time. However, branch lenghts can be biased when using only variable sites, especially with high levels of missing data. See Leaché et al 2015 for methods correcting for aquisition bias in RAxML when using SNPs only.
Access results
One of the reasons it is so convenient to run your RAxML jobs this way is that the results files are easily accessible from your RAxML objects.
rax.trees
bestTree ~/ipyrad-workshop/anolis-raxml/RAxML_bestTree.anolis-tree
bipartitions ~/ipyrad-workshop/anolis-raxml/RAxML_bipartitions.anolis-tree
bipartitionsBranchLabels ~/ipyrad-workshop/anolis-raxml/RAxML_bipartitionsBranchLabels.anolis-tree
bootstrap ~/ipyrad-workshop/anolis-raxml/RAxML_bootstrap.anolis-tree
info ~/ipyrad-workshop/anolis-raxml/RAxML_info.anolis-tree * bestTree - Exactly what it says, the single best ML tree. * bipartions - The ML tree with bootstrap support on nodes. * bipartionsBranchLabels - The ML tree with support values on branches rather than nodes. * bootstrap - All bootstraped trees. * info - RAxML command line parameters and run info
Plot the results
Here we use toytree to plot the bootstrap results.
tre = toytree.tree(rax.trees.bipartitions)
tre.draw(
height=300,
width=800,
node_labels=tre.get_node_values("support"),
node_sizes=15
)
Note: Toytree is a simple yet flexible and powerful tree drawing program, which we will only briefly introduce. Extensive docs and a tutorial are available on the toytree documentation site.
Rooting the tree
In the above figure the two Northern samples are nested deep within the Southern
clade, but this tree is unrooted. Lets say we want to root the tree on the
Northern samples and replot. This is accomplished by adding the root
parameter
to the tree.draw()
function and specifying the samples to root the tree to:
tre = toytree.tree(rax.trees.bipartitions)
tre = tre.root(["punc_ICST764", "punc_MUFAL9635"])
tre.draw(
width=600,
node_labels=tre.get_node_values("support"),
node_sizes=15
)
Note: The
root()
function accepts a list of samples, so if you have multiple samples from the root taxon, you can include them like this:tre.root(["punc_ICST764", "punc_MUFAL9635", "punc_MTR05978"])
Experimenting with the simulated data
Tree rooting can also be accomplished with the wildcard
parameter of the
tree.root()
function. This is somewhat more straightforward to demonstrate
with the simulated data, so we can create a new raxml
object with the
simulated phylip file, rerun the RAxML tree inference, and then do some plotting:
rax = ipa.raxml(
data="ipyrad-workshop/rad_outfiles/rad.phy",
name="aligntest",
workdir="./analysis-raxml",
);
rax.params.N = 10
rax.params.T = 2
rax.params.o = None
rax.run(force=True)
Here the wildcard="3"
argument specifies to root the tree using all the
samples that include “3” in their names.
tre = toytree.tree(rax.trees.bipartitions)
tre = tre.root(wildcard="3")
tre.draw(
width=600,
node_labels=tre.get_node_values("support"),
node_sizes=15
);
Further exploration
We provide a more thorough exploration of the ipyrad.analysis.raxml
module in
a notebook on the ipyrad github site,
including more details about how to take full advantage of running parallel
RAxML processes on a cluster.