SIMAP - The similarity matrix of proteins

Overview

The similarity matrix of proteins is a database of protein sequences, their all-against-all sequence similarities and functional annotations. The database is currently re-implemented, based on a different algorithm for sequence similarity calculation. SIMAP 1 refers to the traditional database, run from 2004 until 2014. SIMAP 2 is the new SIMAP database which has started in 2015.

SIMAP 1

The previous release of SIMAP was based on the FASTA algorithm for sequence similarity calculation. Sequence similarities were calculated in the public resource computing project BOINCSIMAP. Thanks to the thousands of volunteers, up to 30 TeraFLOPs were available for database maintenance. We thank all users of the BOINCSIMAP project for their tremendous and long-lasting support. Without this project it would have been impossible to keep SIMAP up-to-date!

The last major update of SIMAP 1 was in August 2014, when the recent versions of the Uniprot Knowledgebase and NCBI RefSeq were imported into SIMAP. The sequence similarities of the about 45 million non-redundant sequences in SIMAP 1 are still available via the SIMAP webserver:

http://liferay.csb.univie.ac.at/portal/web/simap

However, no further updates of SIMAP 1 are scheduled. SIMAP 1 will end its life cycle by the end of 2015 and will then be fully replaced by SIMAP 2.

SIMAP 2

The new SIMAP database will better facilitate the main use case of SIMAP 1, which was comparative genomics. SIMAP 2 is therefore limited to proteins from completely sequences genomes. Multiple levels of redudancy reduction will ensure the scalability of the SIMAP 2 approach also in the future, when millions of complete genomes are expected in the public databases.

Sequence similarities are calculated based on the Smith-Waterman algorithm, combined with composition-based score adjustment such as in BLAST (see details on the algorithm and hardware accelleration in the PDF file attached). Until March 2015, similarities of the proteins from the current eggNOG and STRING releases will be re-calculated in SIMAP 2. Thereafter, SIMAP 2 will be continuously extended to cover the proteins all completely sequenced genomes.

So far there is no public access to SIMAP 2 yet available. We are happy to share all SIMAP 2 data, please contact us if you are interested.