


The toolkit
Live version: 1.0.2
Tools:
-
SFS & Tajima's D
-
Shannon's H
-
Aln to Matrix
-
HGSV to FASTA
-
Wave
-
Pair Frequencies
-
RNA2Dmatch
-
Kurtosis coef.
-
Hydrophilicity plot (Kyte-Doolittle Analysis)
-
FASTA to SEQRES
SFS & Tajima's D
Tajima’s D is a statistical test that compares the number of segregating sites to the average number of nucleotide differences [1].
Tajima’s D = (π - θ_W) / sqrt(V_D) where: π is the average number of pairwise differences (estimator of θ based on π) θ_W is Watterson's estimator of θ VD is the variance of (π - θ_W)
The code calculates these components as follows:
a) π (theta_pi): π = Σ_i 2i(n-i)SFS[i-1] / (n(n-1))
b) θ_W (theta_w): θ_W = S / a1 where S is the number of segregating sites, and a1 = Σ_i^(n-1) 1/i
c) Variance V_D: VD = e_1S + e_2S(S-1) where e_1 and e_2 are complex functions of n derived in Tajima's original paper [1]. The implementation uses JavaScript to perform these calculations in the browser.
Shannon's H
This tool computes Shannon's entropy, providing a measure of genetic diversity within a sequence [2].
For a discrete random variable X with possible values {x₁, ..., xₙ} and probability mass function P(X), the Shannon entropy H(X) is defined as:
H(X) = -∑ P(xᵢ) * log₂(P(xᵢ))
where the sum is over all possible values of X.
AlntoMatrix
It essentially transforms a vertically aligned set of sequences into a horizontally aligned set, while preserving the alignment relationships.
Pair frequencies
The "wave function" in this context is not a quantum mechanical wave function, but rather a plot of entropy values: WaveFunction = {(i, H(i)) | i = 1 to L} where L is the length of the sequences.
Shannon entropy, calculated for each position i: H(i) = -∑[f(a,i) * log2(f(a,i))] where H(i) is the entropy at position i, and the sum is taken over all characters a present at that position [1].
Kurtosis coefficient
The Kurtosis coefficient, often referred to simply as kurtosis, is a statistical measure used to describe the distribution of data points in a dataset, particularly in terms of the "tailedness" or extremity of deviations from the mean. It provides insight into the shape of the probability distribution of a real-valued random variable [3].
Excess kurtosis is the measure of kurtosis minus 3. This adjustment is made because the kurtosis of the normal distribution is 3. A positive excess kurtosis indicates a leptokurtic distribution, while a negative excess kurtosis indicates a platykurtic distribution.
SEQUENCE FORMATS
HGVS (Human Genome Variation Society) Nomenclature
The HGVS nomenclature is a standardized system for describing variations in DNA, RNA, and protein sequences. It provides a consistent way to report genetic mutations.
DNA Variations: Described using a 'g.' prefix (genomic), 'c.' prefix (coding DNA), or 'm.' prefix (mitochondrial DNA). Example: c.76A>T indicates an adenine (A) to thymine (T) change at position 76 in the coding sequence.
Protein Variations: Described using a 'p.' prefix. Example: p.Gly38Ser indicates a glycine (Gly) to serine (Ser) change at position 38 in the protein.
FASTA
FASTA is a text-based format for representing nucleotide or peptide sequences. Each sequence in a FASTA file begins with a single-line description (preceded by a '>'), followed by lines of sequence data.
SEQRES
SEQRES records are used in the Protein Data Bank (PDB) file format to list the primary structure of a protein or nucleic acid sequence as it appears in the corresponding structure file.
-
Tajima, F. (1989). Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123:585-595.
-
Shannon, C.E. (1948) A Mathematical Theory of Communication. Bell System Technical Journal, 27, 379-423. http://dx.doi.org/10.1002/j.1538-7305.1948.tb01338.x
-
Pearson, Karl (1905), "Das Fehlergesetz und seine Verallgemeinerungen durch Fechner und Pearson. A Rejoinder" [The Error Law and its Generalizations by Fechner and Pearson. A Rejoinder], Biometrika, 4 (1–2): 169–212