top of page
mavi çiçekler
White Structure
Art Exhibit

The toolkit

Live version: 1.0.2

Tools:

  • SFS & Tajima's D

  • Shannon's H

  • Aln to Matrix

  • HGSV to FASTA

  • Wave 

  • Pair Frequencies

  • RNA2Dmatch​

  • Kurtosis coef.

  • Hydrophilicity plot (Kyte-Doolittle Analysis)

  • FASTA to SEQRES

SFS & Tajima's D

​​

Tajima’s D is a statistical test that compares the number of segregating sites to the average number of nucleotide differences [1]. 

Tajima’s D = (π - θ_W) / sqrt(V_D) where: π is the average number of pairwise differences (estimator of θ based on π) θ_W is Watterson's estimator of θ VD is the variance of (π - θ_W)

The code calculates these components as follows:

a) π (theta_pi): π = Σ_i 2i(n-i)SFS[i-1] / (n(n-1))

b) θ_W (theta_w): θ_W = S / a1 where S is the number of segregating sites, and a1 = Σ_i^(n-1) 1/i

c) Variance V_D: VD = e_1S + e_2S(S-1) where e_1 and e_2 are complex functions of n derived in Tajima's original paper [1]. The implementation uses JavaScript to perform these calculations in the browser.

Shannon's H

This tool computes Shannon's entropy, providing a measure of genetic diversity within a sequence [2].

 

For a discrete random variable X with possible values {x₁, ..., xₙ} and probability mass function P(X), the Shannon entropy H(X) is defined as:

H(X) = -∑ P(xᵢ) * log₂(P(xᵢ))

where the sum is over all possible values of X. 

 

  

AlntoMatrix

It essentially transforms a vertically aligned set of sequences into a horizontally aligned set, while preserving the alignment relationships. 

Pair frequencies

The "wave function" in this context is not a quantum mechanical wave function, but rather a plot of entropy values: WaveFunction = {(i, H(i)) | i = 1 to L} where L is the length of the sequences.

Shannon entropy, calculated for each position i: H(i) = -∑[f(a,i) * log2(f(a,i))] where H(i) is the entropy at position i, and the sum is taken over all characters a present at that position [1].

Kurtosis coefficient

The Kurtosis coefficient, often referred to simply as kurtosis, is a statistical measure used to describe the distribution of data points in a dataset, particularly in terms of the "tailedness" or extremity of deviations from the mean. It provides insight into the shape of the probability distribution of a real-valued random variable [3].

Excess kurtosis is the measure of kurtosis minus 3. This adjustment is made because the kurtosis of the normal distribution is 3. A positive excess kurtosis indicates a leptokurtic distribution, while a negative excess kurtosis indicates a platykurtic distribution.

SEQUENCE FORMATS

HGVS (Human Genome Variation Society) Nomenclature
The HGVS nomenclature is a standardized system for describing variations in DNA, RNA, and protein sequences. It provides a consistent way to report genetic mutations.

DNA Variations: Described using a 'g.' prefix (genomic), 'c.' prefix (coding DNA), or 'm.' prefix (mitochondrial DNA). Example: c.76A>T indicates an adenine (A) to thymine (T) change at position 76 in the coding sequence.
Protein Variations: Described using a 'p.' prefix. Example: p.Gly38Ser indicates a glycine (Gly) to serine (Ser) change at position 38 in the protein.


FASTA
FASTA is a text-based format for representing nucleotide or peptide sequences. Each sequence in a FASTA file begins with a single-line description (preceded by a '>'), followed by lines of sequence data.

SEQRES
SEQRES records are used in the Protein Data Bank (PDB) file format to list the primary structure of a protein or nucleic acid sequence as it appears in the corresponding structure file.

  1. Tajima, F. (1989). Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123:585-595.

  2. Shannon, C.E. (1948) A Mathematical Theory of Communication. Bell System Technical Journal, 27, 379-423. http://dx.doi.org/10.1002/j.1538-7305.1948.tb01338.x

  3. Pearson, Karl (1905), "Das Fehlergesetz und seine Verallgemeinerungen durch Fechner und Pearson. A Rejoinder" [The Error Law and its Generalizations by Fechner and Pearson. A Rejoinder], Biometrika, 4 (1–2): 169–212

All rights reserved @Alper Karagöl, @Taner Karagöl

Confused? Go to my twin's website >> www.tanerkaragol.com

The views and opinions expressed on this website are solely my own and do not reflect the views, policies, or positions of any institution, organization, or entity with which I am or have been affiliated.

Nothing on this website should be construed as professional, legal, or official advice. 

bottom of page