CLANS

CLuster ANalysis of Sequences

Tancred Frickey and Andrei Lupas
Max Planck Institut fuer Entwicklungsbiologie
Spemannstr. 35; 72076 Tuebingen, Germany

Frickey T., Lupas A.N. (2004) CLANS: a Java application for visualizing protein families based on pairwise similarity. Bioinformatics 20:3702-3704

Download

Abstract

Frequently, homology is used as a reason to transfer knowledge about function or structure from known to unknown proteins. Although phylogenies are the method of choice when attempting to determine homology, the most frequently used marker is pairwise sequence similarity. Similarity search programs, such as BLAST or PSI-BLAST (Altschul et al. 1997), can efficiently work with enormous data sets while phylogenetic inference and the prerequisite sequence alignments rapidly reach a point where they become unsuitable due to prohibitive calculation costs and loss of resolution. On the other hand, pairwise similarity searches are plagued by false positive matches and problems arising from amino acid composition bias causing, in many cases, the best BLAST hits not to be the closest sequence relatives (Koski & Golding 2001).

Aiming for the best of both worlds, we have implemented a version of the Fruchterman-Reingold graph-layout algorithm (1991). The program takes unaligned fasta format sequences a input, performs all-against-all BLAST searches and displays the pairwise similarities in either 2D or 3D graphs. Contrary to phylogenetic inference methods this approach uses unaligned sequences and works better the more sequences are provided as an increase in number of pairwise similarities better averages out the chance hits that plague standard BLAST comparisons.

Although the application is meant for use with protein sequences, any kind of pairwise similarity data can be displayed (see below: HHSearch results & BLOSUM62).

Running the program:
To run CLANS you need to have Java 1.4 or better installed (java can be downloaded HERE). For full functionality you will also need the NCBI BLAST,PSI-BLAST and formatdb executables (NCBI). For command line parameters and basic help please refer to the README file.

Examples

Graphs calculated using FASTA format sequences as input:

AAA+ ATPASES

Graph layout for 5101 sequences identified as putative AAA-ATPases by PSI-BLAST and Hidden Markov Model (HMM) searches. The set consists to the largest part of AAA⁺-ATPases (a superfamily of AAA-ATPases). ABC-transporters, a known outgroup to AAA⁺-ATPases, can be found as a separate cluster (bottom-left). Blast hits are displayed using a color gradient from red (good) to pale blue (less good). Edges with P-values worse than 10^-10 are not shown.

OUTER MEMBRANE PROTEINS

Graph layout 14092 sequences that were the result of recursive PSI-BLAST searches using outer membrane proteins as seeds. Many large clusters are visible as well as many low confidence hits (dots with no connections). Analysis in progress.

OTHER DATA

Graphs calculated from precomputed similarities:
Graph layout of amino acid similarities according to the BLOSUM62 substitution matrix (only edges with positive values are shown).

Files

Standard Input:
Standard input is a file of fasta format sequences:

<pre> sequences=3 #number of sequences
<param>#optional
parameters used for graph layout
</param>
<rotmtx>#optional
current rotation of the graph
</rotmtx>
<seqs>
>sequence name1
sequence1
>sequence name2
sequence2
>sequence name3
sequence3
</seqs>
<pos>#X,Y and Z coordinates of vertices
0 -2.359319 -3.0919282 1.6470909
1 2.5697038 -3.258371 -1.4772072
2 -2.1152546 -3.3181837 -1.3644718
</pos>
<hsp>#P-values for edges between vertices (smaller value=better)(missing values=P-value of 1)
0 0:0
0 1:1e-3
0 2:1e-60
1 1:0
1 2:1e-10
2 0:1e-57
2 1:1e-8
2 2:0
</hsp></pre>

Or, as an alternative, a file containing precomputed similarities can be used as input.
Format:

<pre>sequences=3 #number_of_sequences
<seqs>
>vertex_name_1
>vertex_name_2
>vertex_name_3
</seqs>
<mtx> #positive and negative values possible (neg. values = additional repulsive interaction)
0 0.2 -0.4
0.1 0 0.1
-0.1 0 0
</mtx></pre

References

Altschul S.F., Madden T.L., Schaffer A.A., Zhang J., Miller W., Lipman D.J., (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-3444

Enright A.J., Ouzounis C.A., (2001) BioLayout - an automatic graph layout algorithm for similarity visualization. Bioinformatics 17:853-854

Fruchterman T.M., Reingold E.M., (1991) Force directed placement, Softw. -Pract. Exp. 21:1129-1164

Koski L.B., Golding G.B., (2001) The closest BLAST hit Is Often Not the Nearest Neighbor, J. Mol. Evol. 52:540-542