Derivations

Some Derivations and Criticisms

Simpson Index
Rarefaction
Entropy
Evenness
Chao 1
Are there Species?

Simpson Index

$\sum_i{p_i^2}$ with $p_i$ = $n_i/\sum_i{n_i}$ and $n_i$ is the abundance of species $i$

derivation: the probability to choose a specific species twice is $p_i*p_i$ and thus the probability that any two chosen specimens belong both to the same species $i$ is the sum over all species $i$ .

criticism: none, it is easy to understand and, as a true probability, bounded between 0 and 1. Of course we shouldn't interprete the "choosing" of a specimen as catching one (this has definitely a different probability), but rather we'll have to say that the sample is representative for the population, what is still questionable but difficult to avoid.

Rarefaction

E(S) = \sum_i({1-\begin{pmatrix} N- N_i \\ n\end{pmatrix} / \begin{pmatrix} N \\ n \end{pmatrix}})

where

N_i

is the number of specimens for species

i

in a population,

N

is the total number of specimens and

n

the size of the sample (in specimens)

N

N

derivation: the expectation value for the number of species to choose from an abundance vector is the sum over the probabilities of all species i in the population, which in turn is one minus the probability to miss the species i. The probability to miss a species i is equal to the number of possible combinations in the sample without that species

\begin{pmatrix} N - N_i \\ n \end{pmatrix}

divided by the number of possible combinations with that species

\begin{pmatrix} N \\ n \end{pmatrix}

.
criticism: we have to assume both, that the abundance distribution of our measured sample is representative for the distribution in the population and that choosing a specimen from the population is a Laplace experiment (all elemenary probabilities are equal). Both assumptions don't hold, and so applications of a proper rarefaction require that we know how close we are to saturation in advance, and thus are circular. What is especially dangerous, is that every rarefaction curve shows some sort of convergence, like in this study. And ceterum censeo, I don't see how Monte Carlo methods like Jackknifing make sense, if we have an analytical formula.

Entropy

$entropy = -1*\sum_i{p_i*\ln\left({p_i}\right)}$ , again with $p_i$ = $n_i/\sum_i{n_i}$ and $n_i$ is the abundance of species i

derivation: the basic experiment of thought in classical thermodynamics starts with a set of distinguishable species kept separately in different volumes, which are allowed to mix adiabatically (no heat is exchanged) by opening some valve. As additionally the total volume is constant, the difference between internal energy of final and initial state is zero for the whole system:

0 = dU = TdS - pdV

, with

U, T, S, p, and V

being internal energy, temperature, entropy, pressure, and volume in turn. Using the ideal gas law

PV = nRT

, with n and R beeing number of moles and the ideal gas constant this means

dS = p/TdV = nR/VdV = nRd\ln\left({V}\right)

, and thus the difference in entropy for species

i

\Delta(S_i) = n_i*R*\ln\left({V/V_i}\right) = - n_i*R*\ln\left({V_i/V}\right) = -n_i*R*\ln\left({p_i}\right)

or as entropy change per mole

\Delta \hat S_i = -p_i*R*\ln\left({p_i}\right)

. To get the total entropy of mixing we simply sum up over all species:

\Delta \hat S = -R*\sum_i p_i*\ln\left({p_i}\right)

, which differs from the more metaphorical use above only by the ideal gas constant

R

which is set to 1.

criticism: the derivation above should make clear that the entropy of an ideal mixture has nothing to do with the actual process of mixing: we get the same result, if we just expand the single (ideal gas) species separately into volumes

V = \sum_i V_i

. What is actually causing the calculated increase of entropy is simply the increase of phase space available to every species. Whith interacting species, like polar molecules, this formula doesn't make much sense and if we deal with a chemical reaction, the entropy of an ideal mixture has usually a negligible contribution. And flies do react with each other... (The same arguments hold if we choose Boltzmann's microstates for a derivation, or if we rename the term entropy to the thereby defined "information content".)

Evenness

entropy / max(entropy)

= (

(\sum_i{p_i*\ln\left({p_i}\right)}) / \ln\left({p}\right)

where

p_i

is again the probability to choose species

i

in the population and

p

is the probability to choose species

i

if all species where equally abundant (the uniform distribution)

derivation: to find

\max(entropy)

we can't simply set the first derivative to zero, but we have to take into account as a variational constraint that the

p_i

must sum up to 1. Thus we build the function

F = S + (λ -1)\sum_ip_i

with the Lagrangian multiplier

(λ - 1)

and set its total derivative to zero:

0 = dF = \sum_i (\frac{\partial F}{\partial p_i})_{j!=i}dp_i

. Now we can vary all

p_i

independently of each other and thus set all coefficients (partial derivatives) separately to zero:

0 = \frac{\partial (p_i * \ln\left({p_i}\right) + (λ-1)p_i)}{\partial p_i} = \ln\left({p_i}\right) + p_i*(\frac {\partial \ln\left({p_i}\right)}{\partial p_i}) +(λ-1)

. Using

{\partial \ln\left({p_i}\right)} = \frac {1} {p_i} {\partial p_i}

we get

p_i = \exp\left({-λ}\right)

, which already shows that in the state of maximum entropy all

p_i

must be the same size

, say p. More precisely

\sum_i p_i = 1 = n*\exp\left({-λ}\right)

and thus

λ = \ln\left({n}\right)

and finally

p_i = p = 1/n

, where

n

is the number of species (not specimens). The entropy for the uniform distribution is then:

max(entropy) = - \sum_{i=1} ^{i=n}p_i*\ln\left({p_i}\right) = n*p*\ln\left({p}\right) = \ln\left({p}\right)

.

criticism: the slightly "unbiological" derivation above shouldn't conceal the important point that evenness is just the entropy of an ideal mixture, but with an upper bound of 1, thus making it into something very close to a probability. Before delving into all these technical details we should strongly consider to stick to the Simpson index which carries almost the same information and is so much easier to understand.

Chao1

S_{expected} = S_{obs} + f_1^2/(2*f_2)

where

f_1

is the number of singletons in a sample (the number of species caught only once) and

f_2

the number of doubletons

derivation: Chao, A. (1984) Non-parametric estimation of the number of classes in a population. Scandinavian Journal of Statistics, 11, 265–270.

criticism: not yet

Are there Species?

In the 70ies and 80ies of the last century we had a broad discussion (mostly in the mass media) about the biological species concept, which left as it's only trace today that this topic is now "mega-out". But there has been no solution and I don't think it's a good idea to ignore this point. To me it sometimes appears as if taxonomy and the species concept is like a black hole buried deep inside of biology, which - once it becomes clear that there is no species concept - could lead to the collapse of many fields of modern biology.

In the most simple (i.e. mathematical) terminology a species is an equivalence class, which is created by an equivalence relation R. Such a relation needs just 3 properties: aRa is true (reflexivity), aRb entails bRa (symmetry) and aRb together with bRc entails aRc (transitivity). (A simple example for a transitive relation is "smaller than" or "<": a < b and b < c entails a < c.) These three properties are sufficient to ensure that any set on which an equivalence relation is defined, can be partitioned into disjoint subsets (or classes): in our case the species.

The problem in taxonomy is: we don't have such an equivalence relation. What we use instead is the similiarity relation - and this is not transitive.

As an example let's talk about the most famous species definition: 2 specimens from a population belong to the same species if they can produce fertile progeny (- we skip the sex discussion). Let's imagine the population is somehow ordered (due to it's gene-pool or whatsoever) in a plane, with the more similar specimens situated in the middle and the less similar ones farther outside on the margins. Of course the specimens from the middle will all belong to the same species. Furthermore let's assume that specimens from the left margin belong to the same species than the middle ones, as well as the specimens from the right margin. But this does not entail, that the specimens from left and right margin are still similar enough to produce fertile offspring.

More precisely this means that if specimen a belongs to the same species as H (the holotype) and H belongs to the same species as b, this does NOT entail that a and b belong to the same species!

And we are not talking about exceptions to a rule (the typical excuse of biologists), we talk about a concept which is self contradictory - until we don't find a proper equivalence relation.