Mesh PPI classifier
About

The worldwide surge of multiresistant microbial strains has propelled the search for alternative treatment options. A key aspect to this task is to understanding the mechanisms by which specific pathogens colonize, survive and replicate within the host, which can be achieved through the study of protein-protein interactions. Despite the advances of laboratorial techniques, protein sequence-based computational models allow the screening of protein interactions between entire proteomes in a fast and inexpensive manner. These models are specially valuable due to the recent advances in sequencing metagenomic organisms, where only the protein sequence is available.
Here, we present an improved supervised machine learning model for the prediction of protein interactions based on the protein structure. We propose the usage of the discrete cosine transform as an efficient methodology of representing protein sequences and use categories extracted from physicochemical properties of amino acids.
For the classification task we use a mesh of hyper-specialised classifiers dedicated to the most relevant pairs of Gene Ontology molecular function annotations.
Based on an exhaustive evaluation that includes datasets with different configurations, cross-validation and out-of-sampling validation, the obtained results outscore the state-of-the-art for sequence-based methods. For the final mesh model using SVM with RBF, a consistent average AUC of 0.84 was attained.

Available data

-datasets: datasets used for testing

-d1: 6702 protein interactions dataset (50% negative, 50% positive)
-dataset1_RNA.fasta: mRNA sequences for the proteins of the dataset
-negatome_3351.txt- negative interactions (examples of proteins that do not interact)
-positive_3351.txt- positive interactions

-d2: 10000 interactions dataset(50% negative, 50% positive)
-d2.fasta: mRNA sequences for the proteins of the dataset
-negatome_verified_random.txt – negative interactions (examples of proteins that do not interact)
-shuffled-yeast-positive-10k.txt – positive interactions

-d3: 20000 interactions dataset (50% negative, 50% positive)
-20k_negative_random:
-protein_sequences.fasta – negative interactions (examples of proteins that do not interact)
-shuffled-yeast-positive-10k.txt – positive interactions
-yeast_ppi_orig.txt – original protein-protein interactions from yeast

-dct_d2_rbf: script used for studying d2 with dct rbf method
-dct_random_d3: script used for DCT method with dataset 3
-dct_rbf_parameters: script used for studying rbf parameters for dct
-dct_rbf_parameters: script used for studying rbf execution time

-original_code: original script
-guo_d3: script used for studying guo with dataset 3

-shen_d2: script used for studying shen with dataset 2
-shen_d3: script used for studying shen with dataset 3

-shen time: script used for studying shen execution time
-guo time: script used for studying guo execution time

Download