Poster (Scientific congresses and symposiums)
Iterative pruning method of unsupervised clustering for categorical data
Chaichoompu, Kridsadakorn; Tongsima, Sissades; Shaw, Philip James et al.
2016the 13th International Congress of Human Genetics (ICHG 2016)
 

Files


Full Text
poster_ichg_23032016.pdf
Author preprint (5.6 MB)
Request a copy

All documents in ORBi are protected by a user license.

Send to



Details



Abstract :
[en] Single Nucleotide Polymorphisms (SNPs) are commonly used to identify population structures. Iterative pruning Principal Component Analysis (ipPCA) utilizes SNP profiles to assign individuals to subpopulations without making assumptions about ancestry. The strategy can be extrapolated to patient samples to identify molecular classes of patients. It is challenging to investigate the utility of substructure detection using profiles based on pre-defined genomic regions-of-interest rather than profiles based on SNPs. Using principles outlined in Fouladi, 2015, we can construct gene-based categorical variables representing different summary gene profiles in a region. These gene-based new constructs no longer have an equal number of unordered category levels. Here, we present C-PCA, an extension of ipPCA to target perform iterative pruning for categorical variables using optimal scaling. It allows performing non-linear principal component analyses to handle possibly non-linearly related variables with different measurement levels. To show the power of C-PCA compared to ipPCA, we simulated 500 individuals and assigned them to two populations of equal size. We considered genetic population distances using Fixation Index from 0.001 to 0.006. For each dataset, we simulated 10,000 independent random SNPs for 100 replicates using the Balding–Nichols model. These were used numerically in ipPCA and as categorical in C-PCA analysis. In conclusion, like ipPCA, we expect C-PCA to perform well in the presence of fine substructures. This paves the way to apply C-PCA to DNA-seq data and input categorical variable derived from genomic regions-of-interest to which common and rare variants are mapped. We foresee additional advantages of C-PCA in this context since region-based categorical variables are likely to be non-linearly associated at the background of underlying gene-gene interaction networks. C-PCA is implemented in R.
Disciplines :
Life sciences: Multidisciplinary, general & others
Author, co-author :
Chaichoompu, Kridsadakorn ;  Université de Liège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Bioinformatique
Tongsima, Sissades;  National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand > Genome Technology Research Unit > Biostatistics and Bioinformatics Laboratory
Shaw, Philip James;  National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand > Medical Molecular Biology Research Unit > Protein-Ligand Engineering and Molecular Biology Laboratory
Sakuntabhai, Anavaj;  Institut Pasteur, France > Functional Genetics of Infectious Diseases Unit
Van Steen, Kristel  ;  Université de Liège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Bioinformatique
Language :
English
Title :
Iterative pruning method of unsupervised clustering for categorical data
Publication date :
03 April 2016
Event name :
the 13th International Congress of Human Genetics (ICHG 2016)
Event place :
Kyoto, Japan
Event date :
3-7 April 2016
Audience :
International
Name of the research project :
Foresting in Integromics Inference
Funders :
F.R.S.-FNRS - Fonds de la Recherche Scientifique [BE]
Available on ORBi :
since 20 July 2016

Statistics


Number of views
88 (5 by ULiège)
Number of downloads
0 (0 by ULiège)

Bibliography


Similar publications



Contact ORBi