The statistics of local sequence comparison

Bit scores

P-values

Database searches

The statistics of gapped alignments

Edge effects

The choice of substitution scores

The PAM and BLOSUM amino acid substitution matrices

DNA substitution matrices

Gap scores

Low complexity sequence regions

References

### Introduction

To assess whether a given alignment constitutes evidence for homology, ithelps to know how strong an alignment can be expected from chance alone.In this context, "chance" can mean the comparison of (i) real but non-homologous sequences; (ii) real sequences that are shuffled to preservecompositional properties [1-3]; or (iii) sequences that are generatedrandomly based upon a DNA or protein sequence model. Analytic statisticalresults invariably use the last of these definitions of chance, whileempirical results based on simulation and curve-fitting may use any ofthe definitions.### The statistics of global sequence comparison

Unfortunately, under even the simplest random models and scoring systems,very little is known about the random distribution of optimal globalalignment scores [4]. Monte Carlo experiments can provide roughdistributional results for some specific scoring systems and sequencecompositions [5], but these can not be generalized easily. Therefore,one of the few methods available for assessing the statistical significanceof a particular global alignment is to generate many random sequencepairs of the appropriate length and composition, and calculate theoptimal alignment score for each [1,3]. While it is then possible toexpress the score of interest in terms of standard deviations from themean, it is a mistake to assume that the relevant distribution is normaland convert this*Z*-value into a

*P*-value; the tail behavior of globalalignment scores is unknown. The most one can say reliably is that if100 random alignments have score inferior to the alignment of interest,the

*P*-value in question is likely less than 0.01. One further pitfallto avoid is exaggerating the significance of a result found among multipletests. When many alignments have been generated, e.g. in a databasesearch, the significance of the best must be discounted accordingly.An alignment with

*P*-value 0.0001 in the context of a single trial maybe assigned a

*P*-value of only 0.1 if it was selected as the best among1000 independent trials.

### The statistics of local sequence comparison

Fortunately statistics for the scores of local alignments, unlike those ofglobal alignments, are well understood. This is particularly true for localalignments lacking gaps, which we will consider first. Such alignments wereprecisely those sought by the original BLAST database search programs [6].A local alignment without gaps consists simply of a pair of equal lengthsegments, one from each of the two sequences being compared. A modificationof the Smith-Waterman [7] or Sellers [8] algorithms will find all segmentpairs whose scores can not be improved by extension or trimming. These arecalled high-scoring segment pairs or HSPs.

To analyze how high a score is likely to arise by chance, a model of randomsequences is needed. For proteins, the simplest model chooses the amino acidresidues in a sequence independently, with specific background probabilitiesfor the various residues. Additionally, the expected score for aligning arandom pair of amino acid is required to be negative. Were this not the case,long alignments would tend to have high score independently of whether thesegments aligned were related, and the statistical theory would break down.

Just as the sum of a large number of independent identically distributed(i.i.d) random variables tends to a normal distribution, the maximumof a large number of i.i.d. random variables tends to an extreme valuedistribution [9]. (We will elide the many technical points requiredto make this statement rigorous.) In studying optimal local sequencealignments, we are essentially dealing with the latter case [10,11].In the limit of sufficiently large sequence lengths

*m*and

*n*, thestatistics of HSP scores are characterized by two parameters,

*K*and

*lambda*. Most simply, the expected number of HSPs with score at least

*S*is given by the formula

We call this the *E*-value for the score *S*.

This formula makes eminently intuitive sense. Doubling the length ofeither sequence should double the number of HSPs attaining a given score.Also, for an HSP to attain the score *2x* it must attain the score *x* twicein a row, so one expects *E* to decrease exponentially with score. Theparameters *K* and *lambda* can be thought of simply as natural scales forthe search space size and the scoring system respectively.

### Bit scores

Raw scores have little meaning without detailed knowledge of the scoringsystem used, or more simply its statistical parameters*K*and

*lambda*.Unless the scoring system is understood, citing a raw score alone is like citing a distance without specifyingfeet, meters, or light years.By normalizing a raw score using the formula

one attains a "bit score" *S'*, which has a standard set of units. The *E*-valuecorresponding to a given bit score is simply

Bit scores subsume the statistical essence of the scoring system employed,so that to calculate significance one needs to know in addition only thesize of the search space.

### P-values

The number of random HSPs with score >=*S*is described by a Poissondistribution [10,11]. This means that the probability of finding exactly

*a*HSPs with score >=

*S*is given by

where *E* is the *E*-value of *S* given by equation (1) above. Specifically thechance of finding zero HSPs with score >=*S* is e^{-E}, so the probabilityof finding at least one such HSP is

This is the *P*-value associated with the score *S*. For example, if one expectsto find three HSPs with score >= *S*, the probability of finding at least oneis 0.95. The BLAST programs report *E*-value rather than *P*-values because itis easier to understand the difference between, for example, *E*-value of 5and 10 than *P*-values of 0.993 and 0.99995. However, when *E* < 0.01, *P*-valuesand *E*-value are nearly identical.

### Database searches

The*E*-value of equation (1) applies to the comparison of two proteins oflengths

*m*and

*n*. How does one assess the significance of an alignment thatarises from the comparison of a protein of length

*m*to a database containingmany different proteins, of varying lengths? One view is that all proteinsin the database are

*a priori*equally likely to be related to the query.This implies that a low

*E*-value for an alignment involving a short databasesequence should carry the same weight as a low

*E*-value for an alignmentinvolving a long database sequence. To calculate a "database search"

*E*-value,one simply multiplies the pairwise-comparison

*E*-value by the number ofsequences in the database. Recent versions of the FASTA protein comparisonprograms [12] take this approach [13].

An alternative view is that a query is

*a priori*more likely to be related toa long than to a short sequence, because long sequences are often composed ofmultiple distinct domains. If we assume the

*a priori*chance of relatedness isproportional to sequence length, then the pairwise

*E*-value involving a databasesequence of length

*n*should be multiplied by

*N/n*, where

*N*is the total lengthof the database in residues. Examining equation (1), this can be accomplishedsimply by treating the database as a single long sequence of length

*N*. TheBLAST programs [6,14,15] take this approach to calculating database

*E*-value.Notice that for DNA sequence comparisons, the length of database records islargely arbitrary, and therefore this is the only really tenable method forestimating statistical significance.

### The statistics of gapped alignments

The statistics developed above have a solid theoretical foundation onlyfor local alignments that are not permitted to have gaps. However, manycomputational experiments [14-21] and some analytic results [22] stronglysuggest that the same theory applies as well to gapped alignments. Forungapped alignments, the statistical parameters can be calculated, usinganalytic formulas, from the substitution scores and the background residue frequencies of the sequences being compared. For gapped alignments,these parameters must be estimated from a large-scale comparison of"random" sequences.Some database search programs, such as FASTA [12] or various implementationof the Smith-Waterman algorithm [7], produce optimal local alignment scoresfor the comparison of the query sequence to every sequence in the database.Most of these scores involve unrelated sequences, and therefore can be usedto estimate

*lambda*and

*K*[17,21]. This approach avoids the artificiality ofa random sequence model by employing real sequences, with their attendantinternal structure and correlations, but it must face the problem of excludingfrom the estimation scores from pairs of related sequences. The BLAST programsachieve much of their speed by avoiding the calculation of optimal alignmentscores for all but a handful of unrelated sequences. The must therefore relyupon a pre-estimation of the parameters

*lambda*and

*K*, for a selected set ofsubstitution matrices and gap costs. This estimation could be done using realsequences, but has instead relied upon a random sequence model [14], whichappears to yield fairly accurate results [21].

### Edge effects

The statistics described above tend to be somewhat conservative for shortsequences. The theory supporting these statistics is an asymptotic one,which assumes an optimal local alignment can begin with any aligned pairof residues. However, a high-scoring alignment must have some length,and therefore can not begin near to the end of either of two sequencesbeing compared. This "edge effect" may be corrected for by calculatingan "effective length" for sequences [14]; the BLAST programs implementsuch a correction. For sequences longer than about 200 residues the edgeeffect correction is usually negligible.### The choice of substitution scores

The results a local alignment program produces depend strongly upon thescores it uses. No single scoring scheme is best for all purposes, andan understanding of the basic theory of local alignment scores can improvethe sensitivity of one's sequence analyses. As before, the theory is fullydeveloped only for scores used to find ungapped local alignments, so westart with that case.A large number of different amino acid substitution scores, based upon avariety of rationales, have been described [23-36]. However the scores ofany substitution matrix with negative expected score can be written uniquelyin the form

where the *q _{ij}*, called target frequencies, are positive numbers that sumto 1, the

*p*are background frequencies for the various residues, and

_{i}*lambda*is a positive constant [10,31]. The

*lambda*here is identical to the

*lambda*of equation (1).

Multiplying all the scores in a substitution matrix by a positive constantdoes not change their essence: an alignment that was optimal using theoriginal scores remains optimal. Such multiplication alters the parameter

*lambda*but not the target frequencies

*q*. Thus, up to a constantscaling factor, every substitution matrix is uniquely determined by itstarget frequencies. These frequencies have a special significance [10,31]:

_{ij}A given class of alignments is best distinguished from chance by the substitution matrix whose target frequencies characterize the class.

To elaborate, one may characterize a set of alignments representing homologousprotein regions by the frequency with which each possible pair of residues isaligned. If valine in the first sequence and leucine in the second appear in1% of all alignment positions, the target frequency for (valine, leucine) is0.01. The most direct way to construct appropriate substitution matrices forlocal sequence comparison is to estimate target and background frequencies,and calculate the corresponding log-odds scores of formula (6). Thesefrequencies in general can not be derived from first principles, and theirestimation requires empirical input.

### The PAM and BLOSUM amino acid substitution matrices

While all substitution matrices are implicitly of log-odds form, the firstexplicit construction using formula (6) was by Dayhoff and coworkers [24,25]. From a study of observed residue replacements in closely related proteins,they constructed the PAM (for "point accepted mutation") model of molecularevolution. One "PAM" corresponds to an average change in 1% of all aminoacid positions. After 100 PAMs of evolution, not every residue will havechanged: some will have mutated several times, perhaps returning to theiroriginal state, and others not at all. Thus it is possible to recognize ashomologous proteins separated by much more than 100 PAMs. Note that thereis no general correspondence between PAM distance and evolutionary time, asdifferent protein families evolve at different rates.Using the PAM model, the target frequencies and the corresponding substitutionmatrix may be calculated for any given evolutionary distance. When twosequences are compared, it is not generally known a priori what evolutionarydistance will best characterize any similarity they may share. Closelyrelated sequences, however, are relatively easy to find even will non-optimalmatrices, so the tendency has been to use matrices tailored for fairly distantsimilarities. For many years, the most widely used matrix was PAM-250,because it was the only one originally published by Dayhoff.

Dayhoff's formalism for calculating target frequencies has been criticized[27], and there have been several efforts to update her numbers using thevast quantities of derived protein sequence data generated since her work[33,35]. These newer PAM matrices do not differ greatly from the originalones [37].

An alternative approach to estimating target frequencies, and the correspondinglog-odds matrices, has been advanced by Henikoff & Henikoff [34]. They examinemultiple alignments of distantly related protein regions directly, rather thanextrapolate from closely related sequences. An advantage of this approach isthat it cleaves closer to observation; a disadvantage is that it yields noevolutionary model. A number of tests [13,37] suggest that the "BLOSUM"matrices produced by this method generally are superior to the PAM matricesfor detecting biological relationships.

### DNA substitution matrices

While we have discussed substitution matrices only in the context of protein sequence comparison, all the main issues carry over to DNA sequence comparison.One warning is that when the sequences of interest code for protein, it is almost always better to compare the protein translations than to compare the DNA sequences directly.The reason is that after only a small amount of evolutionary change, the DNA sequences, when compared using simple nucleotide substitution scores, contain lessinformation with which to deduce homology than do the encoded protein sequences[32].Sometimes, however, one may wish to compare non-coding DNA sequences, at which point the same log-odds approach as before applies.An evolutionary model in which all nucleotides are equally common and all substitution mutations are equally likely yields different scores only for matches and mismatches [32].A more complex model, in which transitions are more likely than transversions, yields different "mismatch" scores for transitions and transversions [32].The best scores to use will depend upon whether one is seeking relatively diverged or closely related sequences [32].

### Gap scores

Our theoretical development concerning the optimality of matrices constructedusing equation (6) unfortunately is invalid as soon as gaps and associated gapscores are introduced, and no more general theory is available to take itsplace. However, if the gap scores employed are sufficiently large, one canexpect that the optimal substitution scores for a given application will notchange substantially. In practice, the same substitution scores have beenapplied fruitfully to local alignments both with and without gaps. Appropriategap scores have been selected over the years by trial and error [13], and mostalignment programs will have a default set of gap scores to go with a defaultset of substitution scores. If the user wishes to employ a different set ofsubstitution scores, there is no guarantee that the same gap scores will remainappropriate. No clear theoretical guidance can be given, but "affine gapscores" [38-41], with a large penalty for opening a gap and a much smallerone for extending it, have generally proved among the most effective.### Low complexity sequence regions

There is one frequent case where the random models and therefore the statisticsdiscussed here break down. As many as one fourth of all residues in proteinsequences occur within regions with highly biased amino acid composition.Alignments of two regions with similarly biased composition may achieve veryhigh scores that owe virtually nothing to residue order but are due insteadto segment composition. Alignments of such "low complexity" regions havelittle meaning in any case: since these regions most likely arise by geneslippage, the one-to-one residue correspondence imposed by alignment isnot valid. While it is worth noting that two proteins contain similar lowcomplexity regions, they are best excluded when constructing alignments[42-44]. The BLAST programs employ the SEG algorithm [43] to filter lowcomplexity regions from proteins before executing a database search.### References

[1] Fitch, W.M. (1983) "Random sequences." J. Mol. Biol. 163:171-176. (PubMed)[2] Lipman, D.J., Wilbur, W.J., Smith T.F. & Waterman, M.S. (1984) "On the statistical significance of nucleic acid similarities." Nucl. Acids Res. 12:215-226. (PubMed)

[3] Altschul, S.F. & Erickson, B.W. (1985) "Significance of nucleotide sequence alignments: a method for random sequence permutation that preserves dinucleotide and codon usage." Mol. Biol. Evol. 2:526-538. (PubMed)

[4] Deken, J. (1983) "Probabilistic behavior of longest-common-subsequence length." In "Time Warps, String Edits and Macromolecules: The Theory and Practice of Sequence Comparison." D. Sankoff & J.B. Kruskal (eds.), pp. 55-91, Addison-Wesley, Reading, MA.

[5] Reich, J.G., Drabsch, H. & Daumler, A. (1984) "On the statistical assessment of similarities in DNA sequences." Nucl. Acids Res. 12:5529-5543. (PubMed)

[6] Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) "Basic local alignment search tool." J. Mol. Biol. 215:403-410. (PubMed)

[7] Smith, T.F. & Waterman, M.S. (1981) "Identification of common molecular subsequences." J. Mol. Biol. 147:195-197. (PubMed)

[8] Sellers, P.H. (1984) "Pattern recognition in genetic sequences by mismatch density." Bull. Math. Biol. 46:501-514.

[9] Gumbel, E. J. (1958) "Statistics of extremes." Columbia University Press, New York, NY.

[10] Karlin, S. & Altschul, S.F. (1990) "Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes." Proc. Natl. Acad. Sci. USA 87:2264-2268.(PubMed)

[11] Dembo, A., Karlin, S. & Zeitouni, O. (1994) "Limit distribution of maximal non-aligned two-sequence segmental score." Ann. Prob. 22:2022-2039.

[12] Pearson, W.R. & Lipman, D.J. (1988) Improved tools for biological sequence comparison." Proc. Natl. Acad. Sci. USA 85:2444-2448. (PubMed)

[13] Pearson, W.R. (1995) "Comparison of methods for searching protein sequence databases." Prot. Sci. 4:1145-1160. (PubMed)

[14] Altschul, S.F. & Gish, W. (1996) "Local alignment statistics." Meth. Enzymol. 266:460-480. (PubMed)

[15] Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997) "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucleic Acids Res. 25:3389-3402. (PubMed)

[16] Smith, T.F., Waterman, M.S. & Burks, C. (1985) "The statistical distribution of nucleic acid similarities." Nucleic Acids Res. 13:645-656. (PubMed)

[17] Collins, J.F., Coulson, A.F.W. & Lyall, A. (1988) "The significance of protein sequence similarities." Comput. Appl. Biosci. 4:67-71. (PubMed)

[18] Mott, R. (1992) "Maximum-likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores." Bull. Math. Biol. 54:59-75.

[19] Waterman, M.S. & Vingron, M. (1994) "Rapid and accurate estimates of statistical significance for sequence database searches." Proc. Natl. Acad. Sci. USA 91:4625-4628. (PubMed)

[20] Waterman, M.S. & Vingron, M. (1994) "Sequence comparison significance and Poisson approximation." Stat. Sci. 9:367-381.

[21] Pearson, W.R. (1998) "Empirical statistical estimates for sequence similarity searches." J. Mol. Biol. 276:71-84. (PubMed)

[22] Arratia, R. & Waterman, M.S. (1994) "A phase transition for the score in matching random sequences allowing deletions." Ann. Appl. Prob. 4:200-225.

[23] McLachlan, A.D. (1971) "Tests for comparing related amino-acid sequences. Cytochrome c and cytochrome c-551." J. Mol. Biol. 61:409-424. (PubMed)

[24] Dayhoff, M.O., Schwartz, R.M. & Orcutt, B.C. (1978) "A model of evolutionary change in proteins." In "Atlas of Protein Sequence and Structure," Vol. 5, Suppl. 3 (ed. M.O. Dayhoff), pp. 345-352. Natl. Biomed. Res. Found., Washington, DC.

[25] Schwartz, R.M. & Dayhoff, M.O. (1978) "Matrices for detecting distant relationships." In "Atlas of Protein Sequence and Structure," Vol. 5, Suppl. 3 (ed. M.O. Dayhoff), p. 353-358. Natl. Biomed. Res. Found., Washington, DC.

[26] Feng, D.F., Johnson, M.S. & Doolittle, R.F. (1984) "Aligning amino acid sequences: comparison of commonly used methods." J. Mol. Evol. 21:112-125. (PubMed)

[27] Wilbur, W.J. (1985) "On the PAM matrix model of protein evolution." Mol. Biol. Evol. 2:434-447. (PubMed)

[28] Taylor, W.R. (1986) "The classification of amino acid conservation." J. Theor. Biol. 119:205-218. (PubMed)

[29] Rao, J.K.M. (1987) "New scoring matrix for amino acid residue exchanges based on residue characteristic physical parameters." Int. J. Peptide Protein Res. 29:276-281.

[30] Risler, J.L., Delorme, M.O., Delacroix, H. & Henaut, A. (1988) "Amino acid substitutions in structurally related proteins. A pattern recognition approach. Determination of a new and efficient scoring matrix." J. Mol. Biol. 204:1019-1029. (PubMed)

[31] Altschul, S.F. (1991) "Amino acid substitution matrices from an information theoretic perspective." J. Mol. Biol. 219:555-565. (PubMed)

[32] States, D.J., Gish, W. & Altschul, S.F. (1991) "Improved sensitivity of nucleic acid database searches using application-specific scoring matrices." Methods 3:66-70.

[33] Gonnet, G.H., Cohen, M.A. & Benner, S.A. (1992) "Exhaustive matching of the entire protein sequence database." Science 256:1443-1445. (PubMed)

[34] Henikoff, S. & Henikoff, J.G. (1992) "Amino acid substitution matrices from protein blocks." Proc. Natl. Acad. Sci. USA 89:10915-10919. (PubMed)

[35] Jones, D.T., Taylor, W.R. & Thornton, J.M. (1992) "The rapid generation of mutation data matrices from protein sequences." Comput. Appl. Biosci. 8:275-282. (PubMed)

[36] Overington, J., Donnelly, D., Johnson M.S., Sali, A. & Blundell, T.L. (1992) "Environment-specific amino acid substitution tables: Tertiary templates and prediction of protein folds." Prot. Sci. 1:216-226. (PubMed)

[37] Henikoff, S. & Henikoff, J.G. (1993) "Performance evaluation of amino acid substitution matrices." Proteins 17:49-61. (PubMed)

[38] Gotoh, O. (1982) "An improved algorithm for matching biological sequences." J. Mol. Biol. 162:705-708. (PubMed)

[39] Fitch, W.M. & Smith, T.F. (1983) "Optimal sequence alignments." Proc. Natl. Acad. Sci. USA 80:1382-1386.

[40] Altschul, S.F. & Erickson, B.W. (1986) "Optimal sequence alignment using affine gap costs." Bull. Math. Biol. 48:603-616. (PubMed)

[41] Myers, E.W. & Miller, W. (1988) "Optimal alignments in linear space." Comput. Appl. Biosci. 4:11-17. (PubMed)

[42] Claverie, J.-M. & States, D.J. (1993) "Information enhancement methods for large-scale sequence-analysis." Comput. Chem. 17:191-201.

[43] Wootton, J.C. & Federhen, S. (1993) "Statistics of local complexity in amino acid sequences and sequence databases." Comput. Chem. 17:149-163.

[44] Altschul, S.F., Boguski, M.S., Gish, W. & Wootton, J.C. (1994) "Issues in searching molecular sequence databases." Nature Genet. 6:119-129. (PubMed)