Introduction
CE is a method for calculating pairwise structure alignments. CE aligns two polypeptide chains using characteristics of their local geometry as defined by vectors between C alpha positions. Matches are termed aligned fragment pairs (AFPs). Heuristics are used in defining a set of optimal paths joining AFPs with gaps as needed. The path with the best RMSD is subject to dynamic programming to achieve an optimal alignment. For specific families of proteins additional characteristics are used to weight the alignment. Complete details are described in the paper (PDF format). Databases of alignments for all polypeptide chains and a representative set of proteins is available and kept current with the PDB
Individual topics are arranged alphabetically
Single character used to identify a polypeptide chain. Can be left blank if the chain is the first polypeptide chain in the protein.
Enter a valid email address e.g. bourne@sdsc.edu. Running CE may take several hours depending on the number of other jobs awaiting processing. You will be notified by email when your job completes. That email will include a URL where you may point your Web browser to preview results. If you do not specify an email address the URL that is provided when you submit the job must be revisited periodically to check for results.
Filter results based on the number of gaps - that is the number of positions that have no recognized partner in the other protein being aligned surrounded by aligned positions. The default is no limit in the number of gaps although it is possible to limit the number of gaps to 10%, 20% and 30% of those residues that actually align.
A filter used to limit the number of structure matches reported based on the length of the two polypeptide chains being compared. Th elength is defined as the number of amino acids in the polypeptide. The default is to compare polypeptide chains of any two lengths. Options permit filtering to chains that differ by less than 10%, 20% and 30%, respectively. NOTE polypeptide chains of less than 30 residues in length are NOT included in CE.
Macromolecular structure database maintained by the Research Collaboratory for Structural Bioinformatics (RCSB).
Polypeptide chains are specified in the form: PPPP:C or PPPP where PPPP
is the 4-character PDB assigned identifier and C is the chain identifier as found
in the PDB file for the desired chain. If the chain identifier is not provided,
alignment is performed using the first polypeptide chain found in the protein.
An underscore (_) is used when presenting results to indicate that the PDB file contains a
chain with blank chain identifier. It is not needed as input. User input files are
assigned the PDB identifier USR1 [and USR2].
Specific protein families have been aligned using enhancements to CE that take into account multiple sequence alignments, secondary structure, and property profiles at the final dynamic programming step. These are pairwise alignments. Contact Phil Bourne if you have an interest in alignments of protein families not currently available.
Representatives are structures used to represent others based on anticipated homology as determined from the following set of criteria:
There is nothing significant about the choice of representative e.g., historically it was the first protein solved. Rather it is the first protein processed from the complete PDB.
Filter results based on the RMSD between C alpha atoms over the length of the alignment. The default is less than 5 Angstroms. Dont be fooled. Members of the same protein family which obviously have the same fold can differ by up to 4 Angstroms or more in RMSD. Useful results may be in the twighlight zone of greater than 4 Angstroms.
A filter used to limit structures aligned to only those that have high sequence identity. The default is no limit, but limiting the results to those structures that show more than 30%, 50%, 70%, and 90% sequence identity is possible.
Allows to limit number of reported structure neighbors to 100 or 500 either most similar or dissimilar ones from those selected according to other selection criteria specified.
Allows to select for further alignment either none or all of reported structure neighbors.
The Select Similarity Level option can be used to specify the desired level of structural similarity. If a close match only is desired select High; if a less rigorus match is required select Low. The default is Medium. This value corresponds to the heuristic parameter D1 (in eqs 10 and 11 of the CE paper). Higher values should allow you to detect longer alignments with higher RMSD, but this is true only in some approximate way, since optimization step and significance evaluation may alter the final result.
Defines the parameter on which to sort the output. The default is Z-score. That is, structure homologs to an input polypeptide chain are sorted such that the most statistically significant match is given first.
Additional dynamic programming optimization is applied on the matrix built as follows:
mat(i,j) = dstr(i,j) + dseq(i,j)
| dstr(i,j) = { | 4 - dcalpha(i,j), | if 4 - dcalpha(i,j) > -2, |
| -2, | otherwise |
| dseq(i,j) = { | 2, | if aa1(i) = aa2(j) |
| -0.01, | otherwise |
mat(i,j) - matrix element for position i in protein 1 and position j in protein 2.
dcalpha(i,j) - distance between calpha atoms at position i in protein 1 and j in protein 2 in optimal superposition.
aa1(k), aa2(k) - amino acid code at position k in proteins 1 and 2, respectively.
Specifies the name of a file on the local computer to be uploaded and used as a probe for finding structure neighbors. The file must be in PDB format, minimally with PDB HEADER, and ATOM records. SEQRES record is desirable if chain consists of several fragments.
Measure of the statistical significance of the result relative to an alignment of
random structures. Typically proteins with a similar fold will have a Z-score of 3.5 or
better. The Z-score can be used to filter less significant results or alternatively look
for weak similarities.