Internucleotide correlations and nucleotide periodicity in Drosophila mtDNA : New evidence for panselective evolution

Analysis for the homogeneity of the distribution of the second base of dinucleotides in relation to the first, whose bases are separated by 0, 1, 2,... 21 nucleotide sites, was performed with the VIH-1 genome (cDNA), the Drosophila mtDNA, the Drosophila Torso gene and the human β-globin gene. These four DNA segments showed highly significant heterogeneities of base distributions that cannot be accounted for by neutral or nearly neutral evolution or by the “neighbor influence” of nucleotides on mutation rates. High correlations are found in the bases of dinucleotides separated by 0, 1 and more number of sites. A periodicity of three consecutive significance values (measured by the χ9) was found only in Drosophila mtDNA. This periodicity may be due to an unknown structure or organization of mtDNA. This non-random distribution of the two bases of dinucleotides widespread throughout these DNA segments is rather compatible with panselective evolution and generalized internucleotide co-adaptation. Key terms: nucleotide heterogeneity, DNA periodicity, dinucleotide analysis, neutral evolution, panselective evolution. Corresponding author: Carlos Y Valenzuela, Programa de Genética Humana. Instituto de Ciencias Biomédicas (ICBM), Facultad de Medicina, Universidad de Chile, Independencia 1027, Casilla 70061, Independencia, CHILE, FAX (56-2) 7373158; Phone (56-2) 9786302, E. Mail < cvalenzu@med.uchile.cl > Received: August 22, 2009. In revised form: August 3, 2010. Accepted: November 3, 2010. INTRODUCTION Most, if not all studies on molecular evolution have been made assuming that evolution is related, directly or indirectly, to protein synthesis and to the genetic code (Nei, 2005, Nei et al., 2010). The evolutionary mechanisms that gave rise to the present genetic code among several other, the selection of four bases, the length and shape of chromosomes, the secondary and tertiary organization of nucleic acids, non-protein-coding DNA (i .e. most of eukaryote DNA), base isochores and signatures and several other pre-transcriptional or non-transcriptional traits are seldom or never studied (Valenzuela, 2007, 2009; Valenzuela et al., 2010). In a previous article the nonrandom association of a dinucleotide base pair, where the first base was separated by 0, 1, 2, and 3 nucleotide sites from the second base, showed an almost pan-selective evolution in the HIV-1 complete genome and in the corresponding segment of the env GP120 gene of its envelope (Valenzuela, 2009). Our conclusion was that selection made HIV-1 genomes, whose nucleotide bases were correlated with those of the neighborhood, along with the whole genome, have a higher probability to remain in the population; this internucleotide non-random distribution indicated a internucleotide co-adapted organization of the HIV-1 genome. This could occur because of the need of the virus’ RNA to be folded and put into the envelope. Thus, the tertiary RNA or DNA structure by itself may be regarded as having been submitted to strong select ion pressure, beside select ion due to transcriptional or protein functions. We also found that in 103 HIV-1 strains, chosen from very separate regions in 35 countries from 5 continents, fixation at every site of the GP120 env gene, distributed as if each site had its own pattern of mutation. This independent evidence also affirms the condition of pan-selective evolution (Valenzuela et al., 2010). The present study analyses the complete HIV-1 genome, the complete Drosophila mtDNA genome and two eukaryote nuclear genes in search of internucleotide correlations, along with whole genomes or DNA segments. RATIONALE, DATA AND METHODS


INTRODUCTION
Most, if not all studies on molecular evolution have been made assuming that evolution is related, directly or indirectly, to protein synthesis and to the genetic code (Nei, 2005, Nei et al., 2010).The evolutionary mechanisms that gave rise to the present genetic code among several other, the selection of four bases, the length and shape of chromosomes, the secondary and tertiary organization of nucleic acids, non-protein-coding DNA (i.e.most of eukaryote DNA), base isochores and signatures and several other pre-transcriptional or non-transcriptional traits are seldom or never studied (Valenzuela, 2007(Valenzuela, , 2009;;Valenzuela et al., 2010).In a previous article the nonrandom association of a dinucleotide base pair, where the first base was separated by 0, 1, 2, and 3 nucleotide sites from the second base, showed an almost pan-selective evolution in the HIV-1 complete genome and in the corresponding segment of the env GP120 gene of its envelope (Valenzuela, 2009).Our conclusion was that selection made HIV-1 genomes, whose nucleotide bases were correlated with those of the neighborhood, along with the whole genome, have a higher probability to remain in the population; this internucleotide non-random distribution indicated a internucleotide co-adapted organization of the HIV-1 genome.This could occur because of the need of the virus' RNA to be folded and put into the envelope.Thus, the tertiary RNA or DNA structure by itself may be regarded as having been submitted to strong selection pressure, beside selection due to transcriptional or protein functions.We also found that in 103 HIV-1 strains, chosen from very separate regions in 35 countries from 5 continents, fixation at every site of the GP120 env gene, distributed as if each site had its own pattern of mutation.This independent evidence also affirms the condition of pan-selective evolution (Valenzuela et al., 2010).The present study analyses the complete HIV-1 genome, the complete Drosophila mtDNA genome and two eukaryote nuclear genes in search of internucleotide correlations, along with whole genomes or DNA segments.

RATIONALE Base frequency expectancies under evolutionary models
An allele in a gene locus or a base in a nucleotide site does exist in frequency p = 1.0 (fixation), 1.0>p>0.0(polymorphism) or 0.0 (elimination or loss); these are the exhaustive classes of the state of alleles or bases in a locus or site, respectively, in a population.Our work deals with bases in sites, thus we shall refer to bases Adenine (A), Thymine (T), Guanine (G) and Citosine (C) in a nucleotide site.More precisely, we study the genetic equilibrium of bases in nucleotide sites (Valenzuela et al., 2010).The present theories of evolution, namely the Synthetic Theory of Evolution (STE), the Neutral Theory of Evolution (NTE) and the Nearly-Neutral Theory of Evolution (NNTE) agree in that the genetic variation in evolution emerges by mutation and that selection or drift (random fluctuations of base frequencies) are the main factors (less important factors are migrations, assortative mating in sexual species) that lead mutations to substitution, fixation, loss or to remain polymorphic.However, the different theories disagree in the importance of these factors.The STE proposes that fixations, losses or polymorphism are mostly due to selection (adaptive processes) and drift rarely contributes to evolution, except in extinction processes; drift is not a directional evolutionary force, its expected contribution (by the nature of random processes) to evolutionary processes is zero, on average.The NTE proposes that most of evolution has occurred at random (by drift) and selection rarely contributes to evolution through purifying selection (lethal or semi-lethal base conditions); positive selection is infrequent.The NNTE proposed the same as the NTE, but adds some selection processes with selection coefficients similar in value to mutation rates; the inclusion by the NNTE of a selection coefficient similar to 1/N (N is the population size) lead to inconsistencies when populations change their N, making evolutionary processes fluctuate between the STE and NNTE models (Kimura, 1991a(Kimura, , 1991b(Kimura, , 1993;;Ohta, 1992;Ohta and Gillespie, 1996;Nei, 2005;Leigh, 2007;Valenzuela, 2007Valenzuela, , 2009Valenzuela, , 2010aValenzuela, , 2010b;;Nei et al., 2010, Valenzuela et al., 2010).The value of rare or small contributions to specify the quantitative participation of selection or drift in evolution has never been given so as to test them with data.NNTE models have included all the values of selection coefficients (even positive one) in variable proportion, interacting with drift at various levels, so that they cannot be differentiated from STE models (Ohta and Gillespie, 1996;Ohta, 2002;Valenzuela, 2010a).Fortunately, these theoretical models lead to sharp expectancies in relation to base frequencies in a site and can be tested with data (Valenzuela, 2009;Valenzuela et al., 2010).

NTE expectancies for equilibrium frequencies
The neutral (random, NTE) expectancy of equilibrium frequencies for the four bases in a site, due to equal random forward and backward mutation rates, is (by definition of neutrality and randomness) 1 / 4 A, 1 / 4 T, 1 / 4 G, 1 / 4 C (Li 1997, Nei, 1987, Valenzuela et al., 2010).This expectancy changes according to the actual mutation rates given by the matrix of mutation from one base to the others (Nei, 1987, Valenzuela et al., 2010), but the expected equality of frequencies between A and T and between G and C due to base complementariness is a strong constraint (Sueoka, 1995;Valenzuela, 1997).Thus, the expected neutral situation for a locus or a site is a polymorphism of alleles or bases, respectively, and not fixation or loss, regardless of the population size.With recurrent forward and backward mutation, fixation is impossible (Wright, 1931;Feller, 1951;Jacquard, 1970;Valenzuela and Santos, 1996;Valenzuela, 1997Valenzuela, , 2000Valenzuela, , 2002Valenzuela, , 2007Valenzuela, , 2009;;Valenzuela et al., 2010).The fact that gene or base fixation is impossible has not been accepted because there is a regrettable confusion between replacement or substitution and fixation (Valenzuela, 2000(Valenzuela, , 2002(Valenzuela, , 2007(Valenzuela, , 2009;;Valenzuela et al., 2010).This confusion originated since the founding articles of NTE (Kimura, 1957(Kimura, , 1962(Kimura, , 1968(Kimura, , 1993;;King and Jukes, 1969).Substitution is the arrival of an allele or base at frequency 1.0 (a turnover process); fixation is the permanence of an allele or base at frequency 1.0 (a fixated state); they are antithetical physical processes.Both are dimensionally different: substitutions (sub) occur per site (s) or locus, per generation (g) or year (sub/s/g); fixations (fix) occur per site, but not per generation (fix/s, it may be expressed per set of data that are under analysis), because the number of generations during which a base remains fixated is very different from taxon to taxon (or biotic group) (Valenzuela, 2000(Valenzuela, , 2002(Valenzuela, , 2007(Valenzuela, , 2009;;Valenzuela et al., 2010).All the articles on phylogenies use fixations (taxonomic characters), but they are presented as substitutions or replacements; this error is widespread.The error is also included in models, such as that of Nei et al. (2010), when defining the neutral theory following the studies of Fisher and Wright, state that "the probability of fixation (u) of a new mutant allele (A 2 ) in the population is …", however, this is not the probability of fixation, it is the probability for a new mutant allele to reach the frequency 1.0 (substitution).Once an allele or base has reached the frequency 1.0, the probability to remain at this frequency (fixation) is 0, because in the next generation its frequency should be (1-m), where m is the mutation rate (Wright, 1931, Feller, 1951;Valenzuela and Santos, 1996;Valenzuela, 2000Valenzuela, , 2002Valenzuela, , 2007Valenzuela, , 2009)).In the n th generation, the probability that this allele or base or their copies remain at frequency 1.0 is (1-m) n ; as 1>(1-m)>0, this probability tends to 0 as n increases (Valenzuela andSantos, 1996, Valenzuela, 2000).At the time of Wright and Fisher, the distinction between fixation and substitution was not made, however it is clear from the Wright (1931) article that fixation (a permanent state as we understand it at present) is impossible "If mutation is occurring, however low the rate, the decline in heterozygosis, following isolation of a relatively small group from a large population, cannot go on indefinitely.There will come a time when the chance elimination of genes will be exactly balanced by new genes arising by mutation".This agrees with the expected polymorphism of neutral frequency of bases in a site.To demonstrate conclusively that fixation is impossible independently of population size, a population of one bacterium was studied (Valenzuela, 2000); in this population (where drift is maximal) only monomorphism is possible at any site, but at in any generation a mutation can substitute the monomorphic base at this site, making fixation impossible.Fixation is factually impossible as is demonstrated in our daily life; we age, become ill, die by mutation; our genome is unstable during our life (Valenzuela, 2007(Valenzuela, , 2009)).

NNTE expectancies for equilibrium frequencies
The expectancy of the equilibrium base frequencies in a site is in the NNTE, assuming a positive selection coefficient for a base, say A, equal to the forward mutation rate to T, G, and C, is near 0.43A, 0.19T, 0.19G, 0.19C [this is in haploid organisms; the demonstration is beyond the scope of this article, see Wright (1931)].That is, the expectancy for a site according to NNTE is also a polymorphism of the four bases and not fixation or loss, regardless of the population size.

STE expectancies for equilibrium frequencies
The STE expectancy is also a polymorphism that depends on the mutation rate and the selection coefficient (sc).For a highly negative selected base in a site its expected equilibrium frequency is near m/s (Li, 1976).For example (working with a haploid organism), if m = 10 -8 and sc = -0.1 the expected frequency of this base is 10 -7 ; for a highly positive selected base (s = +0.1)its frequency is 1-m/s (complement of m/s), in the example it is near 0.9999999 (Wright, 1931).These highly positive selected bases appear fixated (some of them are used as taxonomic characters in phylogenies), but they are not fixated, they remain in an equilibrium frequency near 1, but mutations do occur at these sites inexorably.STE expected equilibrium frequencies may be any that sc and m stochastically determine.

Non-overlapping conditions of evolutionary models
Moreover, all these equilibrium frequencies (for the three models) are stable and resilient; they do not overlap nor are there overlapping situations among these models (except factual sample variations).If random fluctuations change frequencies from the equilibrium, the evolutionary factors tend to re-establish them immediately (equilibriums are resilient).The equilibrium frequencies, for the three models, are determined by the mutation rate and selection coefficient independently of N and drift.Neutral (random) fluctuations may move (transitorily) these equilibrium frequencies up or down, but their evolutionary contribution in thousands or millions of generations is finally zero on average.

Expectancies for nucleotide correlations
Now we provide a rational for the correlation among sites.As mutations occur independently (not necessarily at random) at each site, the correlation of bases among sites, in long DNA segments, is expected to be zero.Moreover, mutations occur without evolutionary "purpose", that is the causal mechanisms of mutation are different from the processes of selection and drift (they are independent).As mutations (in a wide sense including chromosome mutation, duplication, etc.) are accepted for all the evolutionary theories as the base for evolution, the fundamental expectation for internucleotide correlations, due to mutation, is zero.The NTE and the NNTE propose that, besides mutation, evolution (fixation, loss or polymorphism) in a nucleotide site is mostly determined by random genetic drift and selection or adaptive processes have limited importance.The expectancy for internucleotide correlation due to mutation is zero, now if we add random processes to this expectation, it continues to be zero; moreover, if the expectancy of correlations due to mutations was not exactly zero because of some small correlations among sites due to the hypothetical neighbor influence of a base on neighboring mutation rates, drift contributes to blur these small correlations and bring them nearer zero.Gatlin (1976) found by studying longitudinal nucleotide or amino acid series that there was a high correlation among nucleotides in genomes.This was assumed as a demonstration of non-neutrality of evolution.Neutralists counter-argued: this was not a refutation of neutralism because a base could influence mutation rates of the neighbor sites (Jukes, 1976;Kimura and Ohta, 1977).But neutralists have never demonstrated this neighborhood influence nor shown that Gatlin's order of nucleotides was produced by a factual neighbor influence.This debate ended without solution, and the battle was assumed to be won by neutralists.Moreover, even accepting the neighbor influence, the expected frequencies of bases at sites are homogeneously distributed among sites and continue to be mostly an independent polymorphism for each site.Historically, and since the expected situation for a site is a polymorphism of the 4 bases, the expected influence on neighbor sites is this historical influence of the 4 bases, that is, as an average, equal for all the sites, regardless of the actual base in each site.The neighbor influence proposed by neutralists is only partially possible for small neighborhoods (2 or 3 sites up and downstream) if, and only if, non-selective (neutral or random) fixation for millions of generations can occur.Since fixation is impossible, finding a high correlation between both bases of a dinucleotide or pair of bases separated by more than 2 sites is a strong refutation of neutralism, near-neutralism and the neighbor influence hypothesis.This is equivalent to searching for co-adaptive processes among alleles of two or more loci.Finding nonrandom associations among nucleotides separated by K sites demonstrates co-nucleotide-adaptation.There are 4 bases for the first and second position in a dinucleotide, thus 16 dinucleotides are possible.If NTE or NNTE are true, the expected frequency of these 16 dinucleotides is obtained by the product of the total observed frequencies of the 4 bases in the analyzed DNA segment.If, as for example, the frequency of G is 0.2, the expected frequency of G-G pairs, whose bases are separated by 100 (or K) sites, is 0.04 and we found 0.12.This indicates that G-G pairs have been positively selected (and adapted to life conditions) and other pairs were negatively selected and did not remain during evolution in this DNA segment.Thus, these G-G pairs whose bases are separated by 100 sites are co-adapted.Mutation and drift alone cannot construct a meaningful (adapted) DNA segment (by the nature of both processes); only selection by environmental requirements can give the necessary prevalence of correlations among nucleotides sites, randomly produced, to obtain and maintain better adapted (to these environments) biased (statistically) sequences.The present study is aimed to study these biased sequences (association of bases of a dinucleotide).This is not a quantitative problem that can be solved by adjusting N, coefficients, types of mutations, historical processes, and so on; this is a qualitative (factual and conceptual) problem that needs to be rigorously handed.Of course, we should be cautious with statistical significances and work with a small type I error (alpha = 0.01 or less).

METHODS
The analysis of dinucleotides (longitudinal pairs of mononucleotides) whose two bases were separated by 0 (consecutive bases), 1, 2, 3 … 21 nucleotide sites were performed.The first nucleotide of the pair was chosen consecutively from the first to the S-K th (S = total number of nucleotides of the genome or DNA segment; K = number of nucleotide sites that separates the first and the second nucleotide of the pair) nucleotide of the genome or DNA segment; the second nucleotide is chosen 0, 1, 2 … K nucleotide sites downstream from the first one.Two nucleotides, with 4 alternatives each, yield 16 dinucleotides whose frequency expectancies are obtained by the product of the frequencies of both mononucleotides calculated by the marginal totals of the 4x4 matrix for the possible bases of the first and second nucleotide.In this analysis, the homogeneity or heterogeneity of the distribution of the second nucleotide base according to the first was tested by a χ 2 9 test for homogeneity distribution with 9 degrees of freedom (df, 4 rows minus 1, times 4 columns minus 1), noted as subscript.The critical values of the χ 2 9 for significance levels of probabilities 0.05, 0.025, 0.01, 0.005, 0.001 are 16.9 (17), 19.0 (19), 21.7 (22), 23.6 (24) and 27.9 (28), respectively.For very large χ 2 9 values, a conservative z test approximation was used, taking 9 (df) as the mean and 18 (2xdf) as the variance of the normal distribution.More specific details of the method are given in a previous article (Valenzuela, 2009) and results are presented.significant separations at 0, 1, 5, 11, 14 and 20 nucleotide sites were so high (except at 14) as to render the whole analysis highly significant.In Human βHb there are 14 significant (most of them highly significant) and 8 nonsignificant values.Table III showed a remarkable structure of significances in Drosophila mtDNA, which is not present in the other DNA segments.From the separation of 2 sites, until 21, mtDNA showed a highly significant periodicity (several hundred) followed by two less significant ones (with the exception of separation 3, less than one hundred and close to 60).This periodicity was found between 300 and 309 separations and between 600 and 609 separations (Valenzuela, 2010a).In a further analysis, it was found between 1000 and 1009 and between 2002 and 2011 separations (Valenzuela, 2010c).Systematic significant values of the χ 2 9 were found for the HIV-1 until 320 K, for the Torso gene until 25 K and for the βHb gene until 70K.

DISCUSSION
These correlations between mononucleotides of a dinucleotide separated by 0, 1, 2 … K nucleotide sites make neutral and nearly neutral evolution untenable, as well as the hypothesis that the nucleotide neighborhood systematically changes the mutation rate (see Rationale and Valenzuela, 1997, 2000, 2007, 2009;Valenzuela et al., 2010).As stated in the Rationale, these correlations are non-random (biased) frequencies of dinucleotides; thus they imply that there are excesses of dinucleotides (over the random expectation) that were positively selected and maintained by positive selection, and deficiencies of dinucleotides whose frequencies are less than randomly expected that were negatively selected and are maintained in lower frequencies by negative selection.Even in the case of Torso, the least significant DNA segment, any site correlates significantly with several sites that are separated until 25 sites upstream in the genome.These correlations are probably due to pretranscriptional or non-transcriptional processes.The lower significance found in Torso and human βHb genes, may be due to their small number of sites; however, these segments have sufficiently significant results to consider them equally significant qualitatively, and the βHb gene shows a higher significance than Torso, in spite of it is half its length.The periodicity in the level of values of three consecutive χ 2 9 tests found only in Drosophila mtDNA was unexpected and deserves a more detailed and comparative analysis.It shows that there are other levels of organization of nucleic acids, at least, as far as mtDNA is concerned.Our analyses scanned

Table II
Expected and observed dinucleotides frequencies whose bases are separated by 0, 1, 2 and 3 sites.I.
these DNA segments completely, so those non-random associations are widespread throughout the four segments; this indicates that evolution is panselective, not neutral or nearly neutral.Any base is selected in the whole context of the genome of the individual to which it belongs.Thus, there is always a co-selective process that implies a co-adapted condition.DNA sequences (with their internucleotide correlations) that do not fit the environmental requirements are negatively selected and do not remain in the population.
Note: These ideas were presented at the Annual Meeting of the Chilean Society of Evolution and the Chilean Society of Genetics, in Concepción, Chile, October 21-23 2009.
Drosophila torso and human βHb genes

TABLE III χ
2 9 for homogeneity found in dinucleotides separated from 0 to 21 nucleotide sites in HIV-1, Drosophila mtDNA, Drosophila Torso gene and human βHb gene