References of "Becker, Julien"
     in
Bookmark and Share    
Full Text
See detailProtein Structural Annotation: Multi-Task Learning and Feature Selection
Becker, Julien ULg

Doctoral thesis (2014)

Experimentally determining the three-dimensional structure of a protein is a slow and expensive process. Nowadays, supervised machine learning techniques are widely used to predict protein structures, and ... [more ▼]

Experimentally determining the three-dimensional structure of a protein is a slow and expensive process. Nowadays, supervised machine learning techniques are widely used to predict protein structures, and in particular to predict surrogate annotations, which are much less complex than 3D structures. This dissertation presents, on the one hand, methodological contributions for learning multiple tasks simultaneously and for selecting relevant feature representations, and on the other hand, biological contributions issued from the application of these techniques on several protein annotation problems. Our first methodological contribution introduces a multi-task formulation for learning various protein structural annotation tasks. Unlike the traditional methods proposed in the bioinformatics literature, which mostly treated these tasks independently, our framework exploits the natural idea that multiple related prediction tasks should be designed simultaneously. Our empirical experiments on a set of five sequence labeling tasks clearly highlight the benefit of our multi-task approach against single-task approaches in terms of correctly predicted labels. Our second methodological contribution focuses on the best way to identify a minimal subset of feature functions, {\em i.e.}, functions that encode properties of complex objects, such as sequences or graphs, into appropriate forms (typically, vectors of features) for learning algorithms. Our empirical experiments on disulfide connectivity pattern prediction and disordered regions prediction show that using carefully selected feature functions combined with ensembles of extremely randomized trees lead to very accurate models. Our biological contributions are mainly issued from the results obtained by the application of our feature function selection algorithm on the problems of predicting disulfide connectivity patterns and of predicting disordered regions. In both cases, our approach identified a relevant representation of the data that should play a role in the prediction of disulfide bonds (respectively, disordered regions) and, consequently, in protein structure-function relationships. For example, the major biological contribution made by our method is the discovery of a novel feature function, which has - to our best knowledge - never been highlighted in the context of predicting disordered regions. These representations were carefully assessed against several baselines such as the 10th Critical Assessment of Techniques for Protein Structure Prediction (CASP) competition. [less ▲]

Detailed reference viewed: 100 (6 ULg)
Full Text
Peer Reviewed
See detailOn the Encoding of Proteins for Disordered Regions Prediction
Becker, Julien ULg; Maes, Francis; Wehenkel, Louis ULg

in PLoS ONE (2013)

Disordered regions, i.e., regions of proteins that do not adopt a stable three-dimensional structure, have been shown to play various and critical roles in many biological processes. Predicting and ... [more ▼]

Disordered regions, i.e., regions of proteins that do not adopt a stable three-dimensional structure, have been shown to play various and critical roles in many biological processes. Predicting and understanding their formation is therefore a key sub-problem of protein structure and function inference. A wide range of machine learning approaches have been developed to automatically predict disordered regions of proteins. One key factor of the success of these methods is the way in which protein information is encoded into features. Recently, we have proposed a systematic methodology to study the relevance of various feature encodings in the context of disulfide connectivity pattern prediction. In the present paper, we adapt this methodology to the problem of predicting disordered regions and assess it on proteins from the 10th CASP competition, as well as on a very large subset of proteins extracted from PDB. Our results, obtained with ensembles of extremely randomized trees, highlight a novel feature function encoding the proximity of residues according to their accessibility to the solvent, which is playing the second most important role in the prediction of disordered regions, just after evolutionary information. Furthermore, even though our approach treats each residue independently, our results are very competitive in terms of accuracy with respect to the state-of-the-art. A web-application is available at http://m24.giga.ulg.ac.be:81/x3Disorder. [less ▲]

Detailed reference viewed: 13 (4 ULg)
Full Text
Peer Reviewed
See detailOn the Relevance of Sophisticated Structural Annotations for Disulfide Connectivity Pattern Prediction
Becker, Julien ULg; Maes, Francis; Wehenkel, Louis ULg

in PLoS ONE (2013), 8(2), 56621

Disulfide bridges strongly constrain the native structure of many proteins and predicting their formation is therefore a key sub-problem of protein structure and function inference. Most recently proposed ... [more ▼]

Disulfide bridges strongly constrain the native structure of many proteins and predicting their formation is therefore a key sub-problem of protein structure and function inference. Most recently proposed approaches for this prediction problem adopt the following pipeline: first they enrich the primary sequence with structural annotations, second they apply a binary classifier to each candidate pair of cysteines to predict disulfide bonding probabilities and finally, they use a maximum weight graph matching algorithm to derive the predicted disulfide connectivity pattern of a protein. In this paper, we adopt this three step pipeline and propose an extensive study of the relevance of various structural annotations and feature encodings. In particular, we consider five kinds of structural annotations, among which three are novel in the context of disulfide bridge prediction. So as to be usable by machine learning algorithms, these annotations must be encoded into features. For this purpose, we propose four different feature encodings based on local windows and on different kinds of histograms. The combination of structural annotations with these possible encodings leads to a large number of possible feature functions. In order to identify a minimal subset of relevant feature functions among those, we propose an efficient and interpretable feature function selection scheme, designed so as to avoid any form of overfitting. We apply this scheme on top of three supervised learning algorithms: k-nearest neighbors, support vector machines and extremely randomized trees. Our results indicate that the use of only the PSSM (position-specific scoring matrix) together with the CSP (cysteine separation profile) are sufficient to construct a high performance disulfide pattern predictor and that extremely randomized trees reach a disulfide pattern prediction accuracy of on the benchmark dataset SPX+, which corresponds to +3.2% improvement over the state of the art. A web-application is available at http://m24.giga.ulg.ac.be:81/x3CysBridge​s. [less ▲]

Detailed reference viewed: 43 (15 ULg)
Full Text
Peer Reviewed
See detailPrédiction structurée multitâche itérative de propriétés structurelles de protéines
Maes, Francis ULg; Becker, Julien ULg; Wehenkel, Louis ULg

in 7e Plateforme AFIA: Association Française pour l'Intelligence Artificielle (2011)

Le développement d'outils informatiques pour prédire de l'information structurelle de protéines à partir de la séquence en acides aminés constitue un des défis majeurs de la bioinformatique. Les problèmes ... [more ▼]

Le développement d'outils informatiques pour prédire de l'information structurelle de protéines à partir de la séquence en acides aminés constitue un des défis majeurs de la bioinformatique. Les problèmes tels que la prédiction de la structure secondaire, de l'accessibilité au solvant, ou encore la prédiction des régions désordonnées, peuvent être exprimés comme des problèmes de prédiction avec sorties structurées et sont traditionnellement résolus individuellement par des méthodes d'apprentissage automatique existantes. Etant donné que ces problèmes sont fortement liés les uns aux autres, nous proposons de les traiter ensemble par une approche d'apprentissage multitâche. A cette fin, nous introduisons un nouveau cadre d'apprentissage générique pour la prédiction structurée multitâche. Nous appliquons cette stratégie pour résoudre un ensemble de cinq tâches de prédiction de propriétés structurelles des protéines. Nos résultats expérimentaux sur deux jeux de données montrent que la stratégie proposée est significativement meilleure que les approches traitant individuellement les tâches. [less ▲]

Detailed reference viewed: 17 (2 ULg)
Full Text
Peer Reviewed
See detailIterative multi-task sequence labeling for predicting structural properties of proteins
Maes, Francis ULg; Becker, Julien ULg; Wehenkel, Louis ULg

in ESANN 2011 (2011)

Developing computational tools for predicting protein structural information given their amino acid sequence is of primary importance in protein science. Problems, such as the prediction of secondary ... [more ▼]

Developing computational tools for predicting protein structural information given their amino acid sequence is of primary importance in protein science. Problems, such as the prediction of secondary structures, of solvent accessibility, or of disordered regions, can be expressed as sequence labeling problems and could be solved independently by existing machine learning based sequence labeling approaches. But, since these problems are closely related, we propose to rather approach them jointly in a multi-task approach. To this end, we introduce a new generic framework for iterative multi-task sequence labeling. We apply this - conceptually simple but quite effective - strategy to jointly solve a set of five protein annotation tasks. Our empirical results with two protein datasets show that the proposed strategy significantly outperforms the single-task approaches. [less ▲]

Detailed reference viewed: 40 (2 ULg)