Impact of Missing Data on Phylogenies Inferred from Empirical Phylogenomic Data Sets

[en] Progress in sequencing technology allows researchers to assemble ever-larger supermatrices for phylogenomic inference. However, current phylogenomic studies often rest on patchy data sets, with some having 80% missing (or ambiguous) data or more. Though early simulations had suggested that missing data per se do not harm phylogenetic inference when using sufficiently large data sets, Lemmon et al. (Lemmon AR, Brown JM, Stanger-Hall K, Lemmon EM. 2009. The effect of ambiguous data on phylogenetic estimates obtained by maximum likelihood and Bayesian inference. Syst Biol. 58:130-145.) have recently cast doubt on this consensus in a study based on the introduction of parsimony-uninformative incomplete characters. In this work, we empirically reassess the issue of missing data in phylogenomics while exploring possible interactions with the model of sequence evolution. First, we note that parsimony-uninformative incomplete characters are actually informative in a probabilistic framework. A reanalysis of Lemmon's data set with this in mind gives a very different interpretation of their results and shows that some of their conclusions may be unfounded. Second, we investigate the effect of the progressive introduction of missing data in a complete supermatrix (126 genes × 39 species) capable of resolving animal relationships. These analyses demonstrate that missing data perturb phylogenetic inference slightly beyond the expected decrease in resolving power. In particular, they exacerbate systematic errors by reducing the number of species effectively available for the detection of multiple substitutions. Consequently, large sparse supermatrices are more sensitive to phylogenetic artifacts than smaller but less incomplete data sets, which argue for experimental designs aimed at collecting a modest number (∼50) of highly covered genes. Our results further confirm that including incomplete yet short-branch taxa (i.e., slowly evolving species or close outgroups) can help to eschew artifacts, as predicted by simulations. Finally, it appears that selecting an adequate model of sequence evolution (e.g., the site-heterogeneous CAT model instead of the site-homogeneous WAG model) is more beneficial to phylogenetic accuracy than reducing the level of missing data.

Disciplines :

Genetics & genetic processes
Zoology
Biochemistry, biophysics & molecular biology

Author, co-author :

Roure, Béatrice; Université de Montréal - UdeM > Département de Biochimie

Baurain, Denis ; Université de Liège - ULiège > Département de productions animales > GIGA-R : Génomique animale

Philippe, Hervé; Université de Montréal - UdeM > Département de Biochimie

Language :

English

Title :

Impact of Missing Data on Phylogenies Inferred from Empirical Phylogenomic Data Sets

Publication date :

January 2013

Journal title :

Molecular Biology and Evolution

ISSN :

0737-4038

eISSN :

1537-1719

Publisher :

Oxford University Press, New York, United States - New York

Volume :

Issue :

Pages :

197-214

Peer reviewed :

Peer Reviewed verified by ORBi

Available on ORBi :

since 14 November 2012

Statistics

Number of views

106 (6 by ULiège)

Number of downloads

1 (0 by ULiège)

More statistics

Scopus citations^®

246

Scopus citations^®
without self-citations

242

OpenCitations

245

Bibliography

Bapteste E, Brinkmann H, Lee JA, et al. (11 co-authors). 2002. The analysis of 100 genes supports the grouping of three highly divergent amoebae. Dictyostelium, Entamoeba, and Mastigamoeba. Proc Natl Acad Sci USA. 99:1414-1419.
Barley AJ, Spinks PQ, Thomson RC, Shaffer HB. 2010. Fourteen nuclear genes provide phylogenetic resolution for difficult nodes in the turtle tree of life. Mol Phylogenet Evol. 55:1189-1194.
Bininda-Emonds OR, Gittleman JL, Steel MA. 2002. The (super)tree of live: procedures, problems, and prospects. Annu Rev Ecol Evol Syst. 33:265-289.
Bourlat SJ, Juliusdottir T, Lowe CJ, et al. (14 co-authors). 2006. Deuterostome phylogeny reveals monophyletic chordates and the new phylum Xenoturbellida. Nature 444:85-88.
Castresana J. 2000. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis.Mol Biol Evol. 17:540-552.
Criscuolo A, Berry V, Douzery EJ, Gascuel O. 2006. SDM: a fast distance-based approach for (super) tree building in phylogenomics. Syst Biol. 55:740-755.
Criscuolo A, Gascuel O. 2008. Fast NJ-like algorithms to deal with incomplete distance matrices. BMC Bioinformatics. 9:166.
Delsuc F, Brinkmann H, Chourrout D, Philippe H. 2006. Tunicates and not cephalochordates are the closest living relatives of vertebrates. Nature 439:965-968.
Delsuc F, Brinkmann H, Philippe H. 2005. Phylogenomics and the reconstruction of the tree of life. Nat Rev Genet. 6:361-375.
Driskell AC, Ane C, Burleigh JG, McMahon MM, O'Meara B C, Sanderson MJ. 2004. Prospects for building the tree of life from large sequence databases. Science 306:1172-1174.
Dunn CW, Hejnol A, Matus DQ, et al. (18 co-authors). 2008. Broad phylogenomic sampling improves resolution of the animal tree of life. Nature 452:745-749.
Evans NM, Holder MT, Barbeitos MS, Okamura B, Cartwright P. 2010. The phylogenetic position of Myxozoa: exploring conflicting signals in phylogenomic and ribosomal data sets. Mol Biol Evol. 27: 2733-2746.
Felsenstein J. 1978. Cases in which parsimony or compatibility methods will be positively misleading. Syst Zool. 27:401-410.
Felsenstein J. 2001. PHYLIP (Phylogeny Inference Package): distributed by the author. Seattle (WA): Department of Genetics. University of Washington. http://evolution.gs.washington.edu/phylip.html.
Gauthier JA. 1986. Saurischian monophyly and the origin of birds. In: Padian K, editor. The origin of birds and the evolution of flight. Memoirs of the California Academy of Sciences, No. 8. San Francisco: California Academy of Sciences. p. 1-55.
Halanych KM. 2004. The new view of animal phylogeny. Annu Rev Ecol Evol Syst 35:229-256.
Hejnol A, Obst M, Stamatakis A, et al. (17 co-authors). 2009. Assessing the root of bilaterian animals with scalable phylogenomic methods. Proc Biol Sci. 276:4261-4270.
Hendy MD, Penny D. 1989. A framework for the quantitative study of evolutionary trees. Syst Zool. 38:297-309.
Huelsenbeck JP. 1991. When are fossils better than extant taxa in phylogenetic analysis? Syst Zool. 40:458-469.
Jeffroy O, Brinkmann H, Delsuc F, Philippe H. 2006. Phylogenomics: the beginning of incongruence? Trends Genet. 22:225-231.
Kupczok A, Schmidt HA, von Haeseler A. 2010. Accuracy of phylogeny reconstruction methods combining overlapping gene data sets. Algorithms Mol Biol. 5:37.
Lartillot N, Brinkmann H, Philippe H. 2007. Suppression of long-branch attraction artefacts in the animal phylogeny using a siteheterogeneous model. BMC Evol Biol. 7(1 Suppl), S4.
Lartillot N, Lepage T, Blanquart S. 2009. PhyloBayes 3: a Bayesian software package for phylogenetic reconstruction and molecular dating. Bioinformatics 25:2286-2288.
Lartillot N, Philippe H. 2004. A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. Mol Biol Evol. 21:1095-1109.
Lartillot N, Philippe H. 2008. Improvement of molecular phylogenetic inference and the phylogeny of Bilateria. Philos Trans R Soc Lond B Biol Sci. 363:1463-1472.
Laurin-Lemay S, Brinkmann H, Philippe H. 2012. Origin of land plants revisited in the light of sequence contamination and missing data. Curr Biol. 22:R593-R594.
Lemmon AR, Brown JM, Stanger-Hall K, Lemmon EM. 2009. The effect of ambiguous data on phylogenetic estimates obtained by maximum likelihood and Bayesian inference. Syst Biol. 58: 130-145.
Lemmon AR, Emme SA, Lemmon EM. 2012. Anchored hybrid enrichment for massively high-throughput phylogenomics. Syst Biol. 61: 727-744.
Madsen O, Scally M, Douady CJ, Kao DJ, DeBry RW, Adkins R, Amrine HM, Stanhope MJ, de Jong WW, SpringerMS. 2001. Parallel adaptive radiations in two major clades of placental mammals. Nature 409: 610-614.
Novacek MJ. 1992. Fossils, topologies, missing data, and the higher level phylogeny of eutherian mammals. Syst Biol. 41:58-73.
Pagel M, Meade A. 2004. A phylogenetic mixture model for detecting pattern-heterogeneity in gene sequence or character-state data. Syst Biol. 53:571-581.
Parkinson CL, Adams KL, Palmer JD. 1999. Multigene analyses identify the three earliest lineages of extant flowering plants. Curr Biol. 9: 1485-1488.
Philippe H. 1993. MUST, a computer package of management utilities for sequences and trees. Nucleic Acids Res. 21:5264-5272.
Philippe H, Brinkmann H, Copley RR, Moroz LL, Nakano H, Poustka AJ, Wallberg A, Peterson KJ, Telford MJ. 2011. Acoelomorph flatworms are deuterostomes related to Xenoturbella. Nature 470: 255-258.
Philippe H, Brinkmann H, Lavrov DV, Littlewood DT, Manuel M, Worheide G, Baurain D. 2011. Resolving difficult phylogenetic questions: why more sequences are not enough. PLoS Biol. 9: e1000602.
Philippe H, Brinkmann H, Martinez P, Riutort M, Baguna J. 2007. Acoel flatworms are not platyhelminthes: evidence from phylogenomics. PLoS One. 2:e717.
Philippe H, Delsuc F, Brinkmann H, Lartillot N. 2005. Phylogenomics. Annu Rev Ecol Evol Syst. 36:541-562.
Philippe H, Derelle R, Lopez P, et al. (20 co-authors). 2009. Phylogenomics revives traditional views on deep animal relationships. Curr Biol. 19:706-712.
Philippe H, Lartillot N, Brinkmann H. 2005. Multigene analyses of bilaterian animals corroborate the monophyly of Ecdysozoa, Lophotrochozoa, and Protostomia. Mol Biol Evol. 22:1246-1253.
Philippe H, Snell EA, Bapteste E, Lopez P, Holland PW, Casane D. 2004. Phylogenomics of eukaryotes: impact of missing data on large alignments. Mol Biol Evol. 21:1740-1752.
Philippe H, Telford MJ. 2006. Large-scale sequencing and the new animal phylogeny. Trends Ecol Evol. 21:614-620.
Phillips MJ, Delsuc F, Penny D. 2004. Genome-scale phylogeny and the detection of systematic biases. Mol Biol Evol. 21:1455-1458.
Pick KS, Philippe H, Schreiber F, et al. (11 co-authors). 2010. Improved phylogenomic taxon sampling noticeably affects nonbilaterian relationships. Mol Biol Evol. 27:1983-1987.
Regier JC, Shultz JW, Zwick A, Hussey A, Ball B, Wetzer R, Martin JW, Cunningham CW. 2010. Arthropod relationships revealed by phylogenomic analysis of nuclear protein-coding sequences. Nature 463: 1079-1083.
Robinson DR, Foulds LR. 1981. Comparison of phylogenetic trees. Math Biosci. 53:131-147.
Rodrigue N, Philippe H, Lartillot N. 2010. Mutation-selection models of coding sequence evolution with site-heterogeneous amino acid fitness profiles. Proc Natl Acad Sci USA. 107:4629-4634.
Ronquist F, Huelsenbeck JP. 2003. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19: 1572-1574.
Rota-Stabelli O, Campbell L, Brinkmann H, Edgecombe GD, Longhorn SJ, Peterson KJ, Pisani D, Philippe H, Telford MJ. 2011. A congruent solution to arthropod phylogeny: phylogenomics, microRNAs and morphology support monophyletic Mandibulata. Proc Biol Sci. 278: 298-306.
Roure B, Philippe H. 2011. Site-specific time heterogeneity of the substitution process and its impact on phylogenetic inference. BMC Evol Biol. 11:17.
Roure B, Rodriguez-Ezpeleta N, Philippe H. 2007. SCaFoS: a tool for Selection, Concatenation and Fusion of Sequences for phylogenomics. BMC Evol Biol. 7(1 Suppl), S2.
Rubin BE, Ree RH, Moreau CS. 2012. Inferring phylogenies from RAD sequence data. PLoS One. 7:e33394.
Sanderson MJ, McMahon MM, Steel M. 2011. Terraces in phylogenetic tree space. Science 333:448-450.
Sanderson MJ, Purvis A, Henze C. 1998. Phylogenetic supertrees: assembling the trees of live. Trends Ecol Evol. 13:105-109.
Schierwater B, Eitel M, Jakob W, Osigus HJ, Hadrys H, Dellaporta SL, Kolokotronis SO, Desalle R. 2009. Concatenated analysis sheds light on early metazoan evolution and fuels a modern "urmetazoon" hypothesis. PLoS Biol. 7:e20.
Simon S, Strauss S, von Haeseler A, Hadrys H. 2009. A phylogenomic approach to resolve the basal pterygote divergence.Mol Biol Evol. 26: 2719-2730.
Soltis DE, Albert VA, Savolainen V, et al. (11 co-authors). 2004. Genome-scale data, angiosperm relationships, and "ending incongruence": a cautionary tale in phylogenetics. Trends Plant Sci. 9: 477-483.
Soria-Carrasco V, Talavera G, Igea J, Castresana J. 2007. The K tree score: quantification of differences in the relative branch length and topology of phylogenetic trees. Bioinformatics 23:2954-2956.
Sperling EA, Peterson KJ, Pisani D. 2009. Phylogenetic-signal dissection of nuclear housekeeping genes supports the paraphyly of sponges and the monophyly of Eumetazoa. Mol Biol Evol. 26:2261-2274.
Stamatakis A. 2006. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22:2688-2690.
Stefanovic S, Rice DW, Palmer JD. 2004. Long branch attraction, taxon sampling, and the earliest angiosperms: Amborella or monocots? BMC Evol Biol. 4:35.
Swofford DL. 2000. PAUP*: phylogenetic analysis using parsimony and other methods. Sinauer: Sunderland (MA).
Telford MJ, Copley RR. 2011. Improving animal phylogenies with genomic data. Trends Genet. 27:186-195.
Thompson JD, Higgins DG, Gibson TJ. 1994. CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:4673-4680.
Vos RA, Caravas J, Hartmann K, Jensen MA, Miller C. 2011. BIO::Phylo-phyloinformatic analysis using perl. BMC Bioinformatics. 12:63.
Whelan S, Goldman N. 2001. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol Biol Evol. 18:691-699.
Wiens JJ. 1998. Does adding characters with missing data increase or decrease phylogenetic accuracy? Syst Biol. 47:625-640.
Wiens JJ. 2003. Missing data, incomplete taxa, and phylogenetic accuracy. Syst Biol. 52:528-538.
Wiens JJ. 2005. Can incomplete taxa rescue phylogenetic analyses from long-branch attraction? Syst Biol. 54:731-742.
Wiens JJ. 2006. Missing data and the design of phylogenetic analyses. J Biomed Inform. 39:34-42.
Wiens JJ, Moen DS. 2008. Missing data and the accuracy of Bayesian phylogenetics. J Syst Evol. 46:307-314.
Wiens JJ, Morrill MC. 2011. Missing data in phylogenetic analysis: reconciling results from simulations and empirical data. Syst Biol. 60: 719-731.
Wiens JJ, Tiu J. 2012. Highly incomplete taxa can rescue phylogenetic analyses from the negative impacts of limited taxon sampling. PLoS One. 7:e42925.
Wilkinson M. 1995. Coping with missing entries in phylogenetic inference using parsimony. Syst Biol. 44:501-514.
Yang Z. 1996. Maximum-likelihood models for combined analyses of multiple sequence data. J Mol Evol. 42:587-596.
Zwickl DJ, Hillis DM. 2002. Increased taxon sampling greatly reduces phylogenetic error. Syst Biol. 51:588-598.