Tutorial

Last update: Oct 16, 2021, Contributors: Cuong Dang

Estimating amino acid substitution models

Amino acid substitution models are a key component in phylogenetic analyses of protein sequences. All amino acid models available to date are time-reversible, an assumption designed for computational convenience but not for biological reality. Another significant downside to time-reversible models is that they do not allow inference of rooted trees without outgroups. In our recent papers (Dang et al., 2021, Minh et al., 2021), we introduced QMaker and nQMaker, both using maximum likelihood approach that respectively allow the estimation of time reversible and time non-reversible amino acid substitution models from a single concatenated alignment or from a set of protein sequence alignments. Both reversible models estimated with QMaker and non-reversible models estimated with nQMaker fit better to empirical alignments than existing models, across a wide range of datasets. Moreover, the non-reversible models recovered correctly the commonly known root placements for the plant and bird trees with high statistical support without the need to use an outgroup. Six new reversible models (Q.pfam, Q.plant, Q.mammal, Q.bird, Q.insect, Q.yeast) and six new non-reversible models (NQ.pfam, NQ.plant, NQ.mammal, NQ.bird, NQ.insect, NQ.yeast) are implemented in the latest version of IQ-TREE. This tutorial provides an easy-to-use approach for users to estimate non-reversible models and rooted phylogenies from their own protein datasets. Please make sure that you use the lateset version of IQ-TREE for full features and cite our manuscripts:

Bui Quang Minh, Cuong Cao Dang, Le Sy Vinh, and Robert Lanfear (2021), QMaker: Fast and Accurate Method to Estimate Empirical Models of Protein Evolution. https://doi.org/10.1093/sysbio/syab010

Cuong Cao Dang, Bui Quang Minh, Hanon McShea, Joanna Masel, Jennifer Eleanor James, Le Sy Vinh, and Robert Lanfear (2021), nQMaker: estimating time non-reversible amino acid substitution models. https://doi.org/10.1101/2021.10.18.464754

Estimating a model from a single concatenated alignment

Estimating a reversible model

We first demonstrate the estimation of a reversible model for a clade-specific dataset. Please download and extract the sample training data. This sample data was exctracted from plant data (Ran et al., 2018). There are two files in the downloaded folder:

  • alignment.nex contains the alignment in NEXUS format.
  • train.nex contains the training partitions in NEXUS format.

The estimation (training) then can be accomplished with three commands in IQ-TREE. The first command is:

# step 1: infer an single edge-linked tree with reversible models as initial models
iqtree2 --seed 1 -T AUTO -s alignment.nex -p train.nex -m MFP -mset LG,WAG,JTT -cmax 4 --prefix train_plant
  • -seed 1 sets the random seed.
  • -T AUTO sets the number of computing threads to automatically detection, -T 10 will let IQ-TREE utilize up to 10 CPU threads.
  • -s alignment.nex specifies the NEXUS file containing the concatenated alignment.
  • -p train.nex specifies the NEXUS file containing the training loci, -p option will estimate an edge-linked partition model with a single tree topology shared across all loci. This -p option is typically used with concatenation tree estimation that assumes a single species tree but rescale the branch lengths of the locus trees. It was shown to perform best among other partition models (Duchêne et al., 2019). Note: If you don’t have the partition file train.nex, simply ignore the -p option.
  • -mset LG,WAG,JTT defines the initial candidate matrices to reduce computational burden.
  • -cmax 4 restricts up to four categories for the rate heterogeneity across sites.
  • --prefix train_plant or -pre train_plant sets the name of the output files.

This will run ModelFinder to find the best model for each loci. The best models and the best trees will be saved to train_plant.best_model.nex and to train_plant.treefile, respectively. These two files will be used as the input for the second command that estimate a join reversible matrix:

# step 2: estimate a join non-reversible matrix across all loci
iqtree2 -seed 1 -T AUTO -s alignment.nex -p train_plant.best_model.nex -te train_plant.treefile --init-model LG --model-joint GTR20+FO --prefix plant_GTR20_FO - `--init-model LG` option specifies the initial matrix - `-p train_plant.best_model.nex` and `-te train_plant.treefile` options specify the best models and trees found in step 1.

The resulting matrix (Q.plant) which is included in the file plant_GTR20_FO.iqtree can be obtained with the command:

# step 3: extract the resulting reversible matrix
grep -A 21 "can be used as input for IQ-TREE" plant_GTR20_FO.iqtree | tail -n20 > Q.plant

We can open Q.plant in a text viewer as below, or use with IQ-TREE (e.g. -m Q.plant option). The Q.plant is in PAML format, it contains the lower diagonal part of the AA exchange rates matrix and 20 AA frequencies.

0.120955
0.048462 0.458565
0.436023 0.037551 4.799928
0.470505 1.707355 0.338537 0.000100
0.261114 3.736017 0.605443 0.147890 0.286279
0.837596 0.016734 0.388965 5.815020 0.000100 4.138646
1.939708 1.199364 0.911021 1.413132 0.681747 0.419571 0.886564
0.232927 3.431828 4.195096 1.867861 1.850593 4.826298 0.328092 0.170808
0.118245 0.101767 0.246939 0.041486 0.186430 0.018556 0.000100 0.000100 0.015294
0.137216 0.190052 0.000100 0.000100 0.362263 0.642152 0.028449 0.025460 0.554159 2.930110
0.135127 5.915805 3.499350 0.090416 0.000100 2.539190 1.229355 0.196033 0.301659 0.149798 0.013760
0.359701 0.983896 0.049506 0.000100 0.380313 0.236000 0.389444 0.000100 0.031711 5.188215 4.848803 0.701453
0.203942 0.034729 0.000100 0.000100 1.671509 0.000100 0.000100 0.048554 0.091040 1.015749 3.012478 0.000154 0.873360
1.317722 0.441669 0.000100 0.128632 0.000100 1.357928 0.123957 0.048531 0.824914 0.000100 0.502828 0.122503 0.000100 0.000100
3.425467 1.067776 4.114253 0.435618 2.788606 0.280845 0.104001 1.549784 0.361033 0.041339 0.384561 0.290811 0.188779 0.615963 2.474679
4.260751 0.715157 2.261495 0.178679 0.000100 0.277338 0.227034 0.186275 0.000100 1.909182 0.041874 0.966718 3.385793 0.072074 0.806596 4.938908
0.245284 0.698218 0.000100 0.000100 1.273897 0.161369 0.000100 0.405661 0.536027 0.012605 0.896675 0.000100 0.341461 0.993273 0.031731 0.131122 0.000100
0.012568 0.009986 0.621948 0.200102 2.122422 0.031013 0.010158 0.000100 5.748250 0.031443 0.000100 0.000100 0.156847 7.335297 0.173676 0.241389 0.000100 1.228204
2.630085 0.043992 0.071593 0.042649 0.326353 0.036449 0.277692 0.162678 0.140434 9.733184 1.730443 0.035985 1.921040 0.678498 0.149578 0.089526 0.989870 0.000100 0.234116

0.076028 0.051084 0.039819 0.051496 0.012072 0.038467 0.064409 0.052997 0.017632 0.064148 0.102707 0.073142 0.019406 0.048651 0.034180 0.077354 0.046007 0.011774 0.033201 0.085428

The amino-acid order in this file is:

 A   R   N   D   C   Q   E   G   H   I   L   K   M   F   P   S   T   W   Y   V
Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val

Estimating a non-reversible model

To estimate a non-reversible model, we use --model-joint NONREV+FO option instead of --model-joint GTR20+FO (in step 2). The three commands now are:

# step 1: infer an single edge-linked tree with reversible models as initial models
iqtree2 --seed 1 -T AUTO -s alignment.nex -p train.nex -m MFP -mset LG,WAG,JTT -cmax 4 --prefix train_plant

# step 2: estimate a join non-reversible matrix across all loci
iqtree2 -seed 1 -T AUTO -s alignment.nex -p train_plant.best_model.nex -te train_plant.treefile --model-joint NONREV+FO --prefix plant_NONREV_FO

# step 3: extract the resulting non-reversible matrix
grep -A 22 "can be used as input for IQ-TREE" plant_NONREV_FO.iqtree | tail -n21 > NQ.plant

The resulting NQ.plant matrix may now look like:

-1.061624 0.007785 0.002655 0.022039 0.006589 0.014742 0.062839 0.090131 0.001704 0.002323 0.003087 0.001751 0.012195 0.012809 0.041760 0.320267 0.210023 0.003323 0.000857 0.244743
0.008289 -1.080907 0.020365 0.000987 0.026624 0.177551 0.001802 0.086672 0.075705 0.009088 0.021643 0.491866 0.024868 0.000653 0.006541 0.072228 0.039720 0.013123 0.000986 0.002196
0.004032 0.019786 -1.313687 0.267488 0.005393 0.008927 0.032858 0.067708 0.062829 0.023892 0.000094 0.252395 0.000094 0.000094 0.000094 0.438096 0.096599 0.000094 0.027919 0.005295
0.041666 0.003896 0.227284 -0.890796 0.000094 0.000094 0.400738 0.099378 0.050443 0.002720 0.000094 0.007025 0.000094 0.000094 0.000094 0.027181 0.013001 0.000094 0.009413 0.007395
0.051903 0.094902 0.010786 0.000094 -0.893500 0.026469 0.000094 0.037475 0.106883 0.010559 0.070577 0.000094 0.000094 0.102084 0.000094 0.236907 0.000094 0.012537 0.099632 0.032222
0.013065 0.194182 0.039046 0.017941 0.000305 -1.075698 0.292962 0.027675 0.095402 0.001329 0.081175 0.209181 0.004369 0.000094 0.062599 0.012804 0.017520 0.000094 0.000094 0.005861
0.064413 0.000094 0.012826 0.355911 0.000094 0.181070 -0.822679 0.053530 0.003000 0.000094 0.002512 0.097327 0.005370 0.000094 0.005713 0.002772 0.012020 0.000094 0.000094 0.025650
0.187614 0.053999 0.033263 0.066321 0.009740 0.014380 0.058840 -0.621387 0.003268 0.000094 0.005951 0.015482 0.000094 0.000544 0.000094 0.145278 0.009973 0.000144 0.000094 0.016213
0.033332 0.235841 0.278083 0.091410 0.023465 0.257662 0.049843 0.011802 -1.428426 0.004130 0.053383 0.021902 0.000094 0.007900 0.054172 0.039777 0.000094 0.000094 0.250585 0.014858
0.023950 0.001632 0.004395 0.002772 0.003342 0.000717 0.000094 0.000837 0.000094 -1.493125 0.272227 0.010394 0.124246 0.059446 0.000094 0.003597 0.087965 0.000297 0.003057 0.893969
0.026527 0.011287 0.000094 0.000094 0.002177 0.021633 0.001975 0.000094 0.012033 0.226203 -0.745614 0.000094 0.089496 0.149580 0.006638 0.026123 0.003215 0.013687 0.002033 0.152631
0.017415 0.341123 0.171696 0.006194 0.000094 0.105871 0.089788 0.009665 0.007083 0.009163 0.002472 -0.829488 0.013473 0.000094 0.003097 0.013503 0.038477 0.000094 0.000094 0.000094
0.000094 0.043168 0.007303 0.000094 0.011450 0.011888 0.047858 0.000094 0.004725 0.350677 0.567158 0.061171 -1.532577 0.026166 0.000094 0.027269 0.182166 0.012220 0.014137 0.164845
0.007119 0.004300 0.000094 0.000094 0.023494 0.000094 0.002681 0.008815 0.000094 0.054752 0.348000 0.001664 0.026142 -0.836377 0.000094 0.051269 0.011247 0.006273 0.217177 0.072974
0.115066 0.035090 0.000094 0.011471 0.000094 0.046128 0.005991 0.007020 0.013780 0.000094 0.071513 0.009870 0.000094 0.000094 -0.550036 0.167147 0.048381 0.001156 0.005462 0.011491
0.238188 0.080851 0.127209 0.032161 0.040145 0.018437 0.014191 0.071781 0.006455 0.002646 0.050029 0.035037 0.001082 0.032565 0.125243 -1.151350 0.243892 0.000094 0.012692 0.018650
0.361018 0.036699 0.129824 0.003747 0.000094 0.010391 0.015700 0.012775 0.000630 0.135648 0.002005 0.101363 0.080632 0.000094 0.013000 0.417972 -1.388325 0.000094 0.000094 0.066546
0.014117 0.017973 0.000094 0.000094 0.025311 0.008866 0.000094 0.034917 0.015069 0.000094 0.068188 0.000094 0.000094 0.037263 0.000094 0.016805 0.000094 -0.251399 0.012044 0.000094
0.000094 0.000094 0.010861 0.001684 0.031664 0.000094 0.001058 0.000094 0.132692 0.000094 0.000094 0.000094 0.000094 0.405258 0.006474 0.003214 0.000094 0.007815 -0.601659 0.000094
0.203068 0.005150 0.004281 0.000094 0.003602 0.000454 0.018986 0.008201 0.003052 0.638676 0.198251 0.009524 0.040070 0.033033 0.008064 0.000094 0.062714 0.000094 0.007267 -1.244676

0.076646 0.049413 0.038372 0.049451 0.010780 0.037824 0.063761 0.052468 0.015186 0.065770 0.104298 0.072672 0.019435 0.049968 0.035179 0.078294 0.045874 0.013490 0.033377 0.087742

Note: To assess the statistical support of the root position with bootstraping (-B 1000 option), users can use this tutorial.

Estimating a model from a folder of alignments

Estimating a reversible model

We will now estimate a reversible model from a folder of alignments. Please first download the file plant_10alignments.zip. There is a sub-folder named train_plant in the downloaded folder. We use -S option instead of -s and -p options to allow each alignment having a separate tree. This -S option is typically used with a folder of alignments. The three commands are:

# step 1: infer a  separate tree for each alignment with reversible models as initial models
iqtree2 -seed 1 -T AUTO -S train_plant -mset LG,WAG,JTT -cmax 4 -pre train_plant

# step 2: estimate a join reversible matrix across all alignments
iqtree2 -seed 1 -T AUTO -S train_plant.best_model.nex -te train_plant.treefile --model-joint GTR20+FO --init-model LG -pre train_plant.GTR20

# step 3: extract the resulting reversible matrix
grep -A 21 "can be used as input for IQ-TREE" train_plant.GTR20.iqtree | tail -n20 > Q.plant

Estimating a non-reversible model

Same as estimating a non-reversible model from a single concatenated alignment, we change --model-joint GTR20+FO option (in step 2) to --model-joint NONREV+FO.

# step 1: infer a  separate tree for each alignment with reversible models as initial models
iqtree2 -seed 1 -T AUTO -S train_plant -mset LG,WAG,JTT -cmax 4 -pre train_plant

# step 2: estimate a join non-reversible matrix across all alignments
iqtree2 -seed 1 -T AUTO -S train_plant.best_model.nex -te train_plant.treefile --model-joint NONREV+FO --init-model LG -pre train_plant.NONREV

# step 3: extract the resulting non-reversible matrix
grep -A 22 "can be used as input for IQ-TREE" train_plant.NONREV.iqtree | tail -n21 > NQ.plant