Article (Scientific journals)
Impact of Missing Data on Phylogenies Inferred from Empirical Phylogenomic Data Sets
Roure, Béatrice; Baurain, Denis; Philippe, Hervé
2013In Molecular Biology and Evolution, 30 (1), p. 197-214
Peer Reviewed verified by ORBi
 

Files


Full Text
Roure_et_al_2013_MBE_postprint_editor.pdf
Publisher postprint (2.24 MB)
Request a copy
Annexes
Roure_et_al_2013_MBE_suppl_data.pdf
Publisher postprint (1.4 MB)
Supplementary Data
Request a copy

All documents in ORBi are protected by a user license.

Send to



Details



Abstract :
[en] Progress in sequencing technology allows researchers to assemble ever-larger supermatrices for phylogenomic inference. However, current phylogenomic studies often rest on patchy data sets, with some having 80% missing (or ambiguous) data or more. Though early simulations had suggested that missing data per se do not harm phylogenetic inference when using sufficiently large data sets, Lemmon et al. (Lemmon AR, Brown JM, Stanger-Hall K, Lemmon EM. 2009. The effect of ambiguous data on phylogenetic estimates obtained by maximum likelihood and Bayesian inference. Syst Biol. 58:130-145.) have recently cast doubt on this consensus in a study based on the introduction of parsimony-uninformative incomplete characters. In this work, we empirically reassess the issue of missing data in phylogenomics while exploring possible interactions with the model of sequence evolution. First, we note that parsimony-uninformative incomplete characters are actually informative in a probabilistic framework. A reanalysis of Lemmon's data set with this in mind gives a very different interpretation of their results and shows that some of their conclusions may be unfounded. Second, we investigate the effect of the progressive introduction of missing data in a complete supermatrix (126 genes × 39 species) capable of resolving animal relationships. These analyses demonstrate that missing data perturb phylogenetic inference slightly beyond the expected decrease in resolving power. In particular, they exacerbate systematic errors by reducing the number of species effectively available for the detection of multiple substitutions. Consequently, large sparse supermatrices are more sensitive to phylogenetic artifacts than smaller but less incomplete data sets, which argue for experimental designs aimed at collecting a modest number (∼50) of highly covered genes. Our results further confirm that including incomplete yet short-branch taxa (i.e., slowly evolving species or close outgroups) can help to eschew artifacts, as predicted by simulations. Finally, it appears that selecting an adequate model of sequence evolution (e.g., the site-heterogeneous CAT model instead of the site-homogeneous WAG model) is more beneficial to phylogenetic accuracy than reducing the level of missing data.
Disciplines :
Genetics & genetic processes
Zoology
Biochemistry, biophysics & molecular biology
Author, co-author :
Roure, Béatrice;  Université de Montréal - UdeM > Département de Biochimie
Baurain, Denis  ;  Université de Liège - ULiège > Département de productions animales > GIGA-R : Génomique animale
Philippe, Hervé;  Université de Montréal - UdeM > Département de Biochimie
Language :
English
Title :
Impact of Missing Data on Phylogenies Inferred from Empirical Phylogenomic Data Sets
Publication date :
January 2013
Journal title :
Molecular Biology and Evolution
ISSN :
0737-4038
eISSN :
1537-1719
Publisher :
Oxford University Press, New York, United States - New York
Volume :
30
Issue :
1
Pages :
197-214
Peer reviewed :
Peer Reviewed verified by ORBi
Available on ORBi :
since 14 November 2012

Statistics


Number of views
106 (6 by ULiège)
Number of downloads
1 (0 by ULiège)

Scopus citations®
 
246
Scopus citations®
without self-citations
242
OpenCitations
 
245

Bibliography


Similar publications



Contact ORBi