CLuster ANalysis of Sequences
Tancred Frickey and Andrei Lupas
Max Planck Institut fuer Entwicklungsbiologie
Spemannstr. 35; 72076 Tuebingen, Germany
Frickey T., Lupas A.N. (2004) CLANS: a Java application for visualizing protein families based on pairwise similarity. Bioinformatics 20:3702-3704
Frequently, homology is used as a reason to transfer knowledge about function or structure from known to unknown proteins. Although phylogenies are the method of choice when attempting to determine homology, the most frequently used marker is pairwise sequence similarity. Similarity search programs, such as BLAST or PSI-BLAST (Altschul et al. 1997), can efficiently work with enormous data sets while phylogenetic inference and the prerequisite sequence alignments rapidly reach a point where they become unsuitable due to prohibitive calculation costs and loss of resolution. On the other hand, pairwise similarity searches are plagued by false positive matches and problems arising from amino acid composition bias causing, in many cases, the best BLAST hits not to be the closest sequence relatives (Koski & Golding 2001).
Aiming for the best of both worlds, we have implemented a version of the Fruchterman-Reingold graph-layout algorithm (1991). The program takes unaligned fasta format sequences a input, performs all-against-all BLAST searches and displays the pairwise similarities in either 2D or 3D graphs. Contrary to phylogenetic inference methods this approach uses unaligned sequences and works better the more sequences are provided as an increase in number of pairwise similarities better averages out the chance hits that plague standard BLAST comparisons.
Although the application is meant for use with protein sequences, any kind of pairwise similarity data can be displayed (see below: HHSearch results & BLOSUM62).
Running the program:
To run CLANS you need to have Java 1.4 or better installed (java can be downloaded HERE). For full functionality you will also need the NCBI BLAST,PSI-BLAST and formatdb executables (NCBI). For command line parameters and basic help please refer to the README file.
Graphs calculated using FASTA format sequences as input:
Graph layout for 5101 sequences identified as putative AAA-ATPases by PSI-BLAST and Hidden Markov Model (HMM) searches. The set consists to the largest part of AAA+-ATPases (a superfamily of AAA-ATPases). ABC-transporters, a known outgroup to AAA+-ATPases, can be found as a separate cluster (bottom-left). Blast hits are displayed using a color gradient from red (good) to pale blue (less good). Edges with P-values worse than 10-10 are not shown.
OUTER MEMBRANE PROTEINS
Graph layout 14092 sequences that were the result of recursive PSI-BLAST searches using outer membrane proteins as seeds. Many large clusters are visible as well as many low confidence hits (dots with no connections). Analysis in progress.
Graphs calculated from precomputed similarities:
Graph layout of amino acid similarities according to the BLOSUM62 substitution matrix (only edges with positive values are shown).
Standard input is a file of fasta format sequences:
<pre> sequences=3 #number of sequences
Or, as an alternative, a file containing precomputed similarities can be used as input.
Altschul S.F., Madden T.L., Schaffer A.A., Zhang J., Miller W., Lipman D.J., (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-3444
Enright A.J., Ouzounis C.A., (2001) BioLayout - an automatic graph layout algorithm for similarity visualization. Bioinformatics 17:853-854
Fruchterman T.M., Reingold E.M., (1991) Force directed placement, Softw. -Pract. Exp. 21:1129-1164
Koski L.B., Golding G.B., (2001) The closest BLAST hit Is Often Not the Nearest Neighbor, J. Mol. Evol. 52:540-542