Structuring the protein sequence space
Many biological and evolutionary questions are related to the structure of the protein sequence space and its partitioning into substructures represented by the all-against-all similarity relations. We are developing database resources providing pre-calculated sequence similarities and features for all known proteins as well as new methods to derive biologically and evolutionarily relevant clusters.
Pairwise similarity comparison of every protein against the set of all known proteins is an indispensable step in any annotation process. Many biological and evolutionary questions are related to the structure of the sequence space and its partitioning into substructures represented by the all-against-all similarity relations. However, individual searches for homologs do not allow structuring the sequence universe. SIMAP, the Similarity Matrix of Proteins, currently contains a matrix of all-against-all comparisons of more than 30 million proteins from >10.000 organisms of which the genomes have been completely sequenced. They have been generated by exhaustive sequence similarity searches using the Smith-Waterman algorithm. SIMAP data is used by other important databases and resources in bioinformatics, such as eggNOG and STRING.
Currently we are developing novel clustering methods that rely on SIMAP similarities, SIMAP features, taxonomic and genomic information in order to derive a new generation of biologically relevant clusters of proteins that reflect the evolutionary relationships within the protein sequence space.