Humans; Haplotypes; Cluster Analysis; Polymorphism, Single Nucleotide; Computational Biology; 1000 Genome project; fine-level population structure; iterative clustering; SNP-based population clustering; Clusterings; Genome projects; Level populations; Level structure; Population structures; Medicine (all)
Abstract :
[en] SNP-based information is used in several existing clustering methods to detect shared genetic ancestry or to identify population substructure. Here, we present a methodology, called IPCAPS for unsupervised population analysis using iterative pruning. Our method, which can capture fine-level structure in populations, supports ordinal data, and thus can readily be applied to SNP data. Although haplotypes may be more informative than SNPs, especially in fine-level substructure detection contexts, the haplotype inference process often remains too computationally intensive. In this work, we investigate the scale of the structure we can detect in populations without knowledge about haplotypes; our simulated data do not assume the availability of haplotype information while comparing our method to existing tools for detecting fine-level population substructures. We demonstrate experimentally that IPCAPS can achieve high accuracy and can outperform existing tools in several simulated scenarios. The fine-level structure detected by IPCAPS on an application to the 1000 Genomes Project data underlines its subject heterogeneity.
Disciplines :
Life sciences: Multidisciplinary, general & others
Author, co-author :
Chaichoompu, Kridsadakorn ; Université de Liège - ULiège > Département d'électricité, électronique et informatique (Institut Montefiore) > Bioinformatique
Wilantho, Alisa; National Biobank Of Thailand, 111 Thailand Science Park, Pathum Thani, Thailand
Wangkumhang, Pongsakorn; National Biobank Of Thailand, 111 Thailand Science Park, Pathum Thani, Thailand
Tongsima, Sissades; National Biobank Of Thailand, 111 Thailand Science Park, Pathum Thani, Thailand
Cavadas, Bruno; Instituto De Investigação E Inovação Em Saúde, Universidade Do Porto, Porto, Portugal ; Instituto De Patologia E Imunologia Molecular Da Universidade Do Porto, Porto, Portugal
Pereira, Luísa; Instituto De Investigação E Inovação Em Saúde, Universidade Do Porto, Porto, Portugal ; Instituto De Patologia E Imunologia Molecular Da Universidade Do Porto, Porto, Portugal
Van Steen, Kristel ; Université de Liège - ULiège > Département d'électricité, électronique et informatique (Institut Montefiore) > Bioinformatique
Language :
English
Title :
Fine-scale subpopulation detection via an SNP-based unsupervised method: A case study on the 1000 Genomes Project resources.
Publication date :
2023
Journal title :
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
ISSN :
1793-5091
Publisher :
World Scientific, United States
Volume :
28
Issue :
2023
Pages :
245 - 256
Peer reviewed :
Peer Reviewed verified by ORBi
Funding text :
1 Supported by Fonds de la Recherche Scientifique (FNRS PDR T.0180.13) 2 Supported by National Science and Technology Development Agency research grant (P18507439) 3 Supported by European Regional Development Fund (COMPETE 2020 and Portugal 2020), and by Portuguese funds through Fundação para a Ciência e a Tecnologia (POCI-01-0145-FEDER-007274) 4 Corresponding authorSupported by Fonds de la Recherche Scientifique (FNRS PDR T.0180.13) Supported by National Science and Technology Development Agency research grant (P18507439) Supported by European Regional Development Fund (COMPETE 2020 and Portugal 2020), and by Portuguese funds through Fundação para a Ciência e a Tecnologia (POCI-01-0145-FEDER-007274).
C. Medina-Gomez et al., Eur. J. Epidemiol. 30, 317-330 (2015).
C. Finan et al., Sci. Transl. Med. 9, eaag1166 (2017).
A. Intarapanich et al., BMC Bioinformatics. 10, 382 (2009).
T. Limpiti et al., BMC Bioinformatics. 12, 255 (2011).
M. Bouaziz, C. Paccard, M. Guedj, C. Ambroise, PLOS ONE. 7, e45685 (2012).
R. Tibshirani, G. Walther, T. Hastie, J. R. Stat. Soc. Ser. B Stat. Methodol. 63, 411-423 (2001).
T. Limpiti, C. Amornbunchornvej, A. Intarapanich, A. Assawamakin, S. Tongsima, IEEE/ACM Trans. Comput. Biol. Bioinform. 11, 903-914 (2014).
K. Chaichoompu et al., Source Code Biol. Med. 14 (2019), doi:10.1186/s13029-019-0072-6.
R Core Team, R: A Language and Environment for Statistical Computing (2020), (available at https://www.R-project.org).
S. C. Heath et al., Eur. J. Hum. Genet. EJHG. 16, 1413-1429 (2008).
D. Clayton, snpStats: SnpMatrix and XSnpMatrix classes and methods (2015).
Y. Qiu, J. Mei, authors of the A. library S. file A. for details, rARPACK: Solvers for Large Scale Eigenvalue and SVD Problems (2016; https://CRAN.R-project.org/package=rARPACK).
K. Chaichoompu et al., KRIS: Keen and Reliable Interface Subroutines for Bioinformatic Analysis (2021), (available at https://CRAN.R-project.org/package=KRIS).
R. Lebret et al., J. Stat. Softw. 67 (2015), doi:10.18637/jss.v067.i06.
G. Schwarz, Ann. Stat. 6, 461-464 (1978).
K. Chaichoompu, F. Abegaz, K. V. Steen, FILEST: Fine-Level Structure Simulator (2018), (available at https://CRAN.R-project.org/package=FILEST).
C. Fraley, A. E. Raftery, T. B. Murphy, L. Scrucca, mclust Version 4 for R: Normal Mixture Modeling for Model-Based Clustering, Classification, and Density Estimation (Department of Statistics, University of Washington, 2012).
D. H. Alexander, J. Novembre, K. Lange, Genome Res. 19, 1655-1664 (2009).
A. Auton et al., Nature. 526, 68-74 (2015).
S. Purcell, C. Chang, PLINK 1.9. BGI Cogn. Genomics, (available at www.cog-genomics.org/plink2).
P. Changmai et al., PLOS Genet. 18, e1010036 (2022).
K. Chaichoompu et al., Hum. Genet. (2019), doi:10.1007/s00439-019-02069-7.
P. Wangkumhang, M. Greenfield, G. Hellenthal, Genome Res., in press, doi:10.1101/gr.275994.121.