Workshop

Last update: May 27, 2023, Contributors: Minh Bui

IQ-TREE 2 Tutorial (Workshop on Molecular Evolution, Woods Hole 2023)

Table of Contents generated with DocToc

In the virtual machine established by the organisers you can run IQ-TREE version 2.2.2.6 from the command line:

iqtree2

which should display something like this to the screen:

IQ-TREE multicore version 2.2.2.6 COVID-edition for Linux 64-bit built May 27 2023
Developed by Bui Quang Minh, James Barbetti, Nguyen Lam Tung,
Olga Chernomor, Heiko Schmidt, Dominik Schrempf, Michael Woodhams, Ly Trong Nhan.
...

If you instead want to run it on your computer, download version 2.2.2.6 and install the binary for your platform. For the next steps, the folder containing your iqtree2 executable should be added to your PATH enviroment variable so that IQ-TREE can be invoked by simply entering iqtree2 at the command-line. Alternatively, you can also copy iqtree2 binary into your system search. Note that this does not apply if you are using the virtual machines.

1) Input data

We will use a Turtle data set to demonstrate the use of IQ-TREE throughout this workshop tutorial. We try to resolve a once hotly debated phylogenetic position of Turtles, relative to Crocodiles and Birds. There are three possible relationships between them and we want to know which one is the true one:

Three possible trees of Turtles, Crocodiles and Birds

(Picture courtesy of Jeremy Brown)

If you are logged into the virtual machine, you can copy the data from moledata/iqtreelab/ folder:

cd
cp -r moledata/iqtreelab .
cd iqtreelab

This folder contains two input files (which can also be downloaded from the following link):

  • turtle.fa: The DNA alignment (in FASTA format), which is a subset of the original Turtle data set used to assess the phylogenetic position of Turtle relative to Crocodile and Bird (Chiari et al., 2012).
  • turtle.nex: The partition file (in NEXUS format) defining 29 genes, which are a subset of the published 248 genes (Chiari et al., 2012).

QUESTIONS:

  • View the alignment in Jalview or your favourite alignment viewer.

  • Can you identify the gene boundary from the viewer? Does it roughly match the partition file?

  • Is there missing data? Do you think if missing data can be problematic?

2) Inferring the first phylogeny

You can now start to reconstruct a maximum-likelihood (ML) tree for the Turtle data set (assuming that you are in the same folder where the alignment is stored).

What is the command line to run iqtree2 that takes the alignment file turtle.fa as input, performs 1000 ultrafast bootstrap replicates, and automatically determines the best number of cores to use (-T AUTO option)?

Once the run is done, IQ-TREE will write several output files including:

  • turtle.fa.iqtree: the main report file that is self-readable. You should look at this file to see the computational results. It also contains a textual representation of the final tree.
  • turtle.fa.treefile: the ML tree in NEWICK format, which can be visualized in FigTree or any other tree viewer program.
  • turtle.fa.log: log file of the entire run (also printed on the screen).
  • turtle.fa.ckp.gz: checkpoint file used to resume an interrupted analysis.
  • And a few other files.

QUESTIONS:

  • Look at the report file turtle.fa.iqtree.

  • What is the best-fit model name? What do you know about this model? (see substitution models available in IQ-TREE)

  • What are the AIC/AICc/BIC scores of this model and tree?

  • Look at the tree in turtle.fa.iqtree or visualise the tree turtle.fa.treefile in a tree viewer software like FigTree. What relationship among three trees does this tree support?

  • What is the ultrafast bootstrap support (%) for the relevant clade?

  • Does this tree agree with the published tree (Chiari et al., 2012)?

3) Applying partition model

We now perform a partition model analysis (Chernomor et al., 2016), where one allows each partition to have its own model.

What is the command line to run iqtree2 that takes turtle.fa as input alignment, turtle.nex as input partition file, performs 1000 ultrafast bootstrap replicates, and automatically determines the best number of cores?

QUESTIONS:

  • Look at the report file turtle.nex.iqtree. What are the AIC/AICc/BIC scores of partition model? Is it better than the previous model?

  • Look at the tree in turtle.nex.iqtree or visualize turtle.nex.treefile in FigTree. What relationship among three trees does this tree support?

  • What is the ultrafast bootstrap support (%) for the relevant clade?

  • Does this tree agree with the published tree (Chiari et al., 2012)?

4) Choosing the best partitioning scheme

We now perform the PartitionFinder algorithm (Lanfear et al., 2012) that tries to merge partitions to reduce the potential over-parameterization.

What is the command line to run iqtree2 that takes turtle.fa as input alignment, turtle.nex as input partition file, performs 1000 ultrafast bootstrap replicates, merges the partitions with relaxed clustering algorithm, and automatically determines the best number of cores?

  • Please use --prefix turtle.merge to set the prefix for all output files as turtle.merge.*. This is to avoid overwriting outputs from the previous analysis.

QUESTIONS:

  • Look at the report file turtle.merge.iqtree. How many partitions do we have now?

  • Look at the AIC/AICc/BIC scores. Compared with two previous models, is this model better or worse?

  • Look at the tree in turtle.merge.iqtree or visualize turtle.merge.treefile in FigTree. What relationship among three trees does this tree support?

  • What is the ultrafast bootstrap support (%) for the relevant clade?

  • Does this tree agree with the published tree (Chiari et al., 2012)?

5) Tree topology tests

We now want to know whether the trees inferred for the Turtle data set have significantly different log-likelihoods or not. This can be conducted with the SH test (Shimodaira and Hasegawa, 1999), or expected likelihood weights (Strimmer and Rambaut, 2002).

First, concatenate the trees constructed by single and partition models into one file:

For Linux/MacOS:

cat turtle.fa.treefile turtle.nex.treefile >turtle.trees

For Windows:

type turtle.fa.treefile turtle.nex.treefile >turtle.trees

Now you can pass this file into IQ-TREE via -z option.

What is the command line to run iqtree2 that takes turtle.fa as input alignment, turtle.merge.best_scheme.nex as input partition file, turtle.trees as input trees file, performs topology tests with 10,000 replicates, performs the approximately unbiased (AU) test, and no tree search to save time?

  • Please use --prefix turtle.test to set the prefix for all output files as turtle.test.*.

QUESTIONS:

  • Look at the USER TREES section in the report file turtle.test.iqtree. Which tree has worse log-likelihood?

  • Can you reject this tree according to the Shimodaira Hasegawa test, assuming a p-value cutoff of 0.05?

  • Can you reject this tree according to the Approximately Unbiased test, assuming a p-value cutoff of 0.05?

HINTS:

  • The KH, SH and AU tests return p-values, thus a tree is rejected if its p-value < 0.05 (marked with a - sign).
  • bp-RELL and c-ELW return posterior weights which are not p-value. The weights sum up to 1 across the trees tested.

6) Tree mixture model

Another way to analyze different topologies is to use the mixture across sites and trees (MAST) model. MAST relaxes the assumption of a single bifurcating tree on the data. MAST assumes that there is a collection of trees, where each site of the alignment can have a certain probability of having evolved under each of the trees. Each tree has its own topology and branch lengths, and optionally different substitution rates, different nucleotide/amino acid frequencies, and even different rate heterogeneities across sites. The MAST model will estimate all these parameters, and additionally a weight for each tree, roughly representing the proportion of sites evolving under that tree. Unlike partition models, tree mixture model does not need to a partition file, and thus is actually simpler to run.

Your task is now to apply the MAST model to the Turtle data. To use this model, you will need to use the option -m to specify the model, and adding “+T” to the model name. For example, you can use -m GTR+T, but this model is a bit too simple. The better way is to look again the best model found in step 2, and add “+T” to that model name.

What is the command line to run iqtree2 that takes turtle.fa as input alignment, turtle.trees as input trees file, applies the MAST model combined with the best model found in step 2?

  • Please use --prefix turtle.mix to set the prefix for all output files as turtle.mix.*.

QUESTIONS:

  • Look at turle.mix.iqtree for the line printing the tree weights. Which tree has a higher weight?
  • Is it the tree having higher likelihood found in step 5?

7) Identifying most influential genes

Now we want to investigate the cause for such topological difference between trees inferred by single and partition model. One way is to identify genes contributing most phylogenetic signal towards one tree but not the other.

How can one do this? We can look at the gene-wise log-likelihood (logL) differences between the two given trees T1 and T2. Those genes having the largest logL(T1)-logL(T2) will be in favor of T1. Whereas genes showing the largest logL(T2)-logL(T1) are favoring T2.

To compute gen-wise log-likelihoods for the two trees, you can use the -wpl option (for writing partition log-likelihoods):

iqtree2 -s turtle.fa -p turtle.nex.best_scheme.nex -z turtle.trees -n 0 -wpl --prefix turtle.wpl

will write a file turtle.wpl.partlh, that contains log-likelihoods for all partitions in the original partition file. We use -p turtle.nex.best_scheme.nex here (instead of -p turtle.nex) to avoid doing model selection again.

Import turtle.wpl.partlh into MS Excel, Libre Office Calc, or any other spreadsheet software. You will need to tell the software to treat spaces as delimiters, so that the values are imported into different columns for easy processing (e.g., doing log-likelikehood subtraction as pointed out above).

QUESTIONS:

  • Compute the gene-wise log-likelihood differences between two trees.

  • What is the name of the gene showing the largest log-likelihood difference between two trees?

  • What is the name of the gene showing the second largest log-likelihood difference between two trees?

  • Were these two genes identified in (Brown and Thomson, 2016)?

  • Briefly describe what is the problem of these two genes?

8) Removing influential genes

We now try to construct a tree without these “influential” genes. To do so, copy the partition file turtle.nex to a new file and remove the lines defining the charset of these genes, and then repeat the IQ-TREE run with a parititon model (see section 4). You will need to figure out a command line to run IQ-TREE yourself here.

QUESTIONS:

  • Document which command line did you use to run IQ-TREE?

  • What tree topology do you get now?

  • What is the ultrafast bootstrap support (%) for the relevant clade?

  • Does this tree agree with the published tree (Chiari et al., 2012)?

9) Concordance factors

This task is optional

So far we have assumed that gene trees and species tree are equal. However, it is well known that gene trees might be discordant. Therefore, we now want to quantify the agreement between gene trees and species tree in a so-called concordance factor (Minh et al., 2020).

You first need to compute the gene trees, one for each partition separately:

iqtree2 -s turtle.fa -S turtle.nex --prefix turtle.loci -T 2

Options explained:

  • -S turtle.nex to tell IQ-TREE to infer separate trees for every partition in turtle.nex. All output files are similar to a partition analysis, except that the tree turtle.loci.treefile now contains a set of gene trees.

Definitions:

  • Gene concordance factor (gCF) is the percentage of decisive gene trees concordant with a particular branch of the species tree (0% <= gCF(b) <= 100%). gCF=0% means that branch b does not occur in any gene trees, whereas gCF=100% means that branch b occurs in every gene tree.

  • Site concordance factor (sCF) is the percentage of decisive (parsimony informative) alignment sites supporting a particular branch of the species tree (~33% <= sCF(b) <= 100%). sCF<33% means that another discordant branch b’ is more supported, whereas sCF=100% means that branch b is supported by all sites.

  • CAUTION when gCF ~ 0% or sCF < 33%, even if boostrap supports are ~100%!

  • GREAT when gCF and sCF > 50% (i.e., branch is supported by a majority of genes and sites).

You can now compute gCF and sCF for the tree inferred under the partition model:

iqtree2 -t turtle.nex.treefile --gcf turtle.loci.treefile -s turtle.fa --scf 100

Options explained:

  • -t turtle.nex.treefile to specify a species tree. We use tree under the partitioned model here, but you can of course use the other tree.
  • --gcf turtle.loci.treefile to specify a gene-trees file.
  • --scf 100 to draw 100 random quartets when computing sCF.

Once finished this run will write several files:

  • turtle.nex.treefile.cf.tree: tree file where branches are annotated with bootstrap/gCF/sCF values.
  • turtle.nex.treefile.cf.stat: a table file with various statistics for every branch of the tree.

Similarly, you can compute gCF and sCF for the tree under unpartitioned model:

iqtree2 -t turtle.fa.treefile --gcf turtle.loci.treefile -s turtle.fa --scf 100

QUESTIONS:

  • Visualise turtle.nex.treefile.cf.tree.nex in FigTree.

  • Explore gene concordance factor (gCF), gene discordance factors (gDF1, gDF2, gDFP), site concordance factor (sCF) and site discordance factors (sDF1, sDF2).

  • How do gCF and sCF values look compared with bootstrap supports?

  • Visualise turtle.fa.treefile.cf.tree. How do these values look like now on the contradicting branch?

FINAL QUESTIONS:

  • Given all analyses you have done in this tutorial, which relationship between Turtle, Crocodile and Bird is true in your opinion?