headerimage
Home > Lab > Ke li msc

Inference of CMGC Kinase Interaction Network Topology

Title Inference of CMGC Kinase Interaction Network Topology
Student Ke Li
Type MSc
Completion Date 2013-06-11
Abstract

Protein complexes are the basic functional modules carrying out a variety of fundamental cellular functions. Identification of protein complexes is of central importance in current biological research for interpreting the information encoded in genomes and understanding many fundamental cellular processes. The CMGC kinase group, consisting of 9 subfamilies, has been found to play critical roles in cell signaling, cell cycle regulation, metabolic and slicing control, etc. In CMGC kinase group, some subfamilies such as MAPKs and CDKs are among the most highly studied protein groups whereas other subfamilies such as HIPKs and RCKs are only poorly understood. The first global proteomic analysis on complexes of the human CMGC kinase group accomplished by Matthias Gstaiger provides valuable information for many poorly studied CMGC kinases, including the 652 high-confidence kinase-protein interactions identified from AP-MS experiments with the help of computational tools. As a result of the limitation of AP-MS experiments and current computational methods, these identified 652 interactions may not necessarily be actual physical interactions, and there is no clear way to identify different complexes formed by the same kinase only based on the identified 652 interactions. These issues cannot be resolved only using AP-MS experimental data. Therefore, this project was launched to approach these difficulties.

In order to identify physical interactions and possible protein complexes formed by CMGC kinases, we combined information from the PrePPI database which is mainly a structurebased protein interaction database and then applied machine learning techniques on the integrated protein interaction data. Machine learning, a booming interdisciplinary field from computer science and statistics with extensive applications in scientific research and engineering, focuses on making predictions for new observations based on known properties (also known as features, attributes) learned from the training data. After the generation of a list of features from various bioinformatic sources for all proteins present in the identified 652 interactions, a feature selection procedure was carried out to determine the features to be included in model training step. After that, three different classes of machine learning models were trained and tested individually and finally compared to select a final model for prediction. Three classes of machine learning models applied here are logistic regression, Random forests (RF) and Support Vector Machine (SVM) models. Support Vector Machine (SVM) with RBF (Radial basis function) kernel was finally chosen and applied to predict protein-protein interactions. After the modeling and prediction step, an overall interaction network was constructed from 652 highconfidence kinase-protein interactions identified by Matthias Gstaiger, PrePPI interactions and our predicted interactions. In this integrated global network, first a manual study was performed and several interesting results were found. Because of the large scale of this network which makes further manual detection impossible, a graph density based protein complex detection algorithm was proposed and applied to the network. It was found that the decomposition of a complicated network centered on one kinase into several dense subnetworks is possible and the scale and interaction intensity of the subnetworks are majorly dependent on the choice of graph density and degree thresholds.

In conclusion, we utilized a variety of bioinformatic resources and machine learning techniques to overcome the limitation of AP-MS experimental data and endeavored to infer the network topology of CMGC kinases. The results suggest combining AP-MS experimental data with protein interaction data from other bioinformatic resources and using machine learning methods for prediction can help to construct a refined and more realistic protein interaction network topology. The complex detection algorithm designed in this project can be applied as a computational assistance for large-scale complicated networks.