Doppel delay of 54 months
Origin of doppel chimeric transcripts
First human doppel allele found: M174T
Doppelcross: cis-trans genetic issues
Alignments: getting them straight
Doppel disulfide in 3D (and related issues)
Glycosylation sites in bird prion and mammalian doppel
Candidate exon 2 switch alleles
Intergenic exon 1 is within a Mer115 insertion element
142,634 genes on human chromosome?
US, UK oppose gene withholding
1 Oct 99 webmaster opinionThe webmaster has learned that the DNA sequence of mouse doppel gene was finished by April of 1995 and that the doppel gene was recognized almost immediately. That news was kept secret from hundreds of affected CJD researchers until print publication 54 months later in the 1 Oct 99 J Mol Biol. And the meagre characterization there implies still other long-completed papers and sequences will trickle down slowly in the years ahead, in direct violation of federal government policy, scientific norms, and biomedical ethics.
During the delay, over two thousand Americans died from CJD because no treatment existed. Families simply do not want long and unncessary delays in therapy development. During that 54 months, Medline shows 2,230 other research papers on TSE appeared -- almost all will have to be revisited at great cost because the "new" development puts their current interpretation at risk.
For example, sporadic CJD is defined by default as a normal prion protein sequence and lack of other risk factors. But a normal prion coding sequence says nothing about a normal doppel prion sequence or normal regulatory regions (eg, the exon-skipping switch at prion exon 2). But sporadic CJD researchers did not know they needed to look at these other sequences.
A full year after the secret discovery of the doppel gene, the British disclosed that mad cow disease had passed the species barrier to humans: the first 10 cases of nvCJD were announced on 26 Mar 96. British researchers determined all the nvCJD victims had normal prion protein (including methionine at position 129 in both chromosomes). Scientists all over the world were baffled by the victim set -- one had been a vegetarian for a decade, others had below-average exposure to infectious agent. But how normal was their doppel gene or exon 2? But the doppel gene news blackout continued -- no one could study nvCJD properly until the secrecy surrounding the doppel gene was lifted.
A Nobel Prize, ironically in medicine, was awarded to one of the authors of the doppel paper in November 1997, some 30 months into the cover-up. Experimental support for protein-only theory was in fact a tremendous advance in the face of decades of no real progress with alternative theories. Imagine the mess today in addressing the nvCJD epidemic if viral theories still held sway. (Note: the focus on infectivity in both protein-only and viral theories is itself bizarre in that CJD was clearly identified as an ordinary inherited autosomal dominant human disease gene in two German papers published in 1930. The prion gene could have been mapped and the amyloid sequenced with technology of the late 1950's. But it was not -- cannabalism was more titillating.)
Today the question is, 'yes, protein-only, but which proteins?' The prion and doppel genes were once identical and are still joined at the hip today. The discovery of doppel may be just as important as the discovery of the prion gene -- but its discovery seriously muddied the waters. Could a prize have been awarded before this was straightened out or was keeping the awards committee in the dark the ultimate reason for the inordinate delay in disclosure ?
Excuses (author correspondence):
-- "Senior researchers did not believe initially that doppel was a working gene homologous to prion."
-- "A delay was needed to give us a leg up on competitiors."
-- "The paper was too long for rapid publication in PNAS [2 authors are members] and was delayed by manuscript rejection by Cell; sequence release would jeopardize publication even after acceptance."
An open reading frame, coding for a clearly homologous protein, with a clean ATG start and TAA stop, in tandem duplication position, conserved in 3 species of mammals, and supported by mRNA EST sequences [as of 07 Feb 97] is clearly a functional gene. The J Mol Bio paper is too long -- a separate companion paper addressing ataxia in mouse strains should have been broken out. Cell is an exercise in vanity, not communication. The international norms for human genome sequence data release is same-day posting to Internet databases. Many prominent journals encourage prior draft document and sequence distribution. Indeed, Genome Research -- where these same authors previously published the first paper in this series -- offers 'ahead of print' publication.
Because the authors and their proxies sit on all prion grant committees and peer-review all prion research manuscripts submitted to journals, ample warning would be given of a hypothetical competitor's plans. Knowing this, few labs would consider challenging four well-funded labs already months ahead. A 54 month delay is not playing by the rules that govern biomedical research -- imagine the reaction in AIDS research to a 54 month delay in announcing a relevent new gene.
These researchers need to get their priorites straight. Millions of people worldwide are exposed to mad cow disease. Public money is the source of all NIH grants -- the public wants a diagnosis and cure, not to pay for leg-up's on competitors. It is time for the tail to stop wagging the dog -- the NIH needs to crack the whip over self-absorbed researchers oblivious to human suffering in CJD. No timely disclosure, no grant renewal -- it's that simple.
In the end, they were hoist by their own petard. The webmaster discovered the doppel gene independently in August 1999 using Human Genome Project data and published it first (1, 2, 3) in a rapid disclosure format (preprint server) personally approved by the director of the National Institutes of Health. That server, as prominently profiled in Science magazine and Chemical & Engineering News, adheres to the open lab notebook, daily disclosure standards of the Human Genome Project and the US Freedom of Information Act.
Four labs, 21 authors, 54 months -- did they even get it right? Well, the experimental work is very thorough with numerous cross-checks on conclusions. On glycosylation sites and GPI anchors, the bioinformatics homework was done well. Nonetheless, the webmaster still had to fix numerous problems with the paper: the naturality of chimeric mRNA transcripts, the first human doppel allele, egregious misalignments with prion protein, missing human doppel sequences, candidate exon2 and intergenic exon switching alleles, and prion nmr structure exploitation for the disulfide, glycan, and other homologies.
Thu, 30 Sep 1999 webmaster researchPrion mRNA seems far more abundant than doppel-containing mRNA, judging by the 174:11 = 16:1 ratio that each is represented in the 1.5 million human mRNA EST database (respectively 133:7 = 19:1 in the 690,000 mRNA mouse EST set). It is difficult to anticipate how productive each species of mRNA in the prion-doppel region will be in making protein.
There may exist conditions (or cell types) inducing production of much more doppel -- its own promoter has an apparent TATA positive regulatory sequence. [The "knockout" left doppel under control of the constitutive prion promoter with no alternative splicing possible so more doppel mRNA got made; in other words, it was not responsive upregulation to cell conditions. However, genuine exon-skipping to doppel was confirmed in both wild-type [inbred] rats and mice.
The JMB paper properly notes the similarity in length of the short untranslated 5' leader to the ORF in both prion and doppel genes. This would fit conservation of this exon's splice site subsequent to the tandem duplication event. Relating doppel exon 1ab to bovine prion exon 1ab makes no sense given 5 other species including closely related sheep lack this feature and the poor conservation of exon 1 generally.
Prion exon 2 is baffling in its sequence conservation, which is much better than the seemingly more mission-critical exon 1 transcription start. The conservation extends to species such as hamster (once thought to lack it) and human (no use seen yet). This conservation might be explained by the need to maintain exon skipping over to doppel, which begins at the splice donor at the end of exon 2. Since the function of doppel and prion proteins is unknown, it is not possible to anticipate conditions which might regulate exon 2 use.
The webmaster has proposed an elementary explanation for how exon skipping to doppel arose historically. If the gene duplication event did not involve prion promoter but only subsequent sequence, the switch would automatically have been built in at the get-go: the new distal gene had a fully competitive splice acceptor (indeed, initially an identical one initially). RNA polymerase II need only read through the prion polyA site (which it does at many genes) to give the 5' leader splice site a fighting chance to be spliced to prion exon 2.
The model (three variants, the first ruled out below) below accounts for the observed mRNAs. Because intron 2 is so long, if the tandem duplication occurred at a random site, model C is rather favored: the intron length ratios in human are 2623:9986 or 1 to 3.8. (Polarity variants, with the tandem duplication in the 3' direction, were considered earlier.)
Note there is no sign of a promoter upstream of intergene exon 1, suggesting the prion promoter was not part of the original duplication. With little but splice domains to constrain them, nucleotide comparisons between intergenic exons and prion upstream exons are largely futile (until sufficient species are sequenced to allow ancestral exon reconstruction), even with a new alignment tool written specifically to analyze such regions.
Initially, the switch merely kept both copies of the gene functional; the distal copy otherwise would have had no promoter of its own (and no time to evolve one before accumulating coding mutations and becoming a pseudogene). Later, as protein sequences and functions diverged, the switch was maintained and exploited to a changing purpose. Doppel gradually acquired a promoter of its own. Intergenic exons 1 and 2 either arose later or are fossil prion exons from the duplication [see graphic]. Because doppel is fixing mutations faster and is so little transcribed today, the prion protein may be making it gradually redundant.
The duplication apparently took place prior to bird-mammal divergence at 310 million years. At that time, none of the currently recognizeable retrotransposons were present in the intergenic regions. These regions were therefore much shorter than today (53% can now be identified as insertional elements); longer intergenic regions may have put doppel's splice acceptor at a growing disadvantage. The frequency of tandem duplications with promoter exclusion is subject to interesting effect related to growth in genome size: as the promoter and open reading frame get ever greater separations, the size distribution of tandem repeat relative to this growing separation improve the odds that the promoter will not be part of a CDS-containing tandem duplication.
Comparative studies on large sequence regions of human and various species (see Hardison, R. et al, Long human-mouse sequence alignments reveal novel regulatory elements: A reason to sequence the mouse genome. Genome Res. 7: 959-966 1997. ). Most protein sequences are easily alignable back to fish and higher vertebrates divergence (450 Myr), and at that distance splicing patterns are largely conserved.
In the prion literature, every observation is called "unprecedented" or "paradigm-breaking". However, tandem duplications are not at all rare -- Pauling and Zuckerkandl proposed them 40 years ago as a major mechanism for generating protein diversity as hemoglobin primary structures emerged. Tandem duplication is also a common disease mechanism, eg Charcot-Marie-Tooth 1A. [See the review in Genome Research, Pathological consequences of sequence duplications in the human genome, 8:1007-1021, October 1998. Note that the sequences are so diverged in the prion-doppel pair that recombination pathologies are unlikely at this point in time.]
Provided the promoter is not part of the tandem duplication, the model above shows that "chimeric" mRNA is a very natural, indeed an inevitable outcome of the duplication event. Its retention over evolutionary time suggests some benefit derived from this arrangement as the protein sequences diverged.
The question arises why a bizillion other examples of "chimeric transcripts" are not at Medline -- a single cite is given in the JMB paper and not to a homologous tandem pair. Answer: examples exist in abundance, it is just hard to search for them right now.The mechanism will likely be found in many other tandem gene pairs, depending on the utility of functional divergence and the opportunity afforded for regulation.
The problem is that structures of genes are just now becoming encoded at GenBank. Adjacent genes (which are only now being sequenced) almost always have separate, unlinked GenBank entries; alternative splicing is seldom given. Graphics of the exon/intron structure are nowhere collected. There is no sorting procedure to pull out all entries that are both homologous and adjacent. [The new Popseq feature at Genbank will carry superfamilies.] Any mRNA data available for a tandem pairs is buried in full texts of individual articles.
So while all of this is computationally feasible and desirable, it is second generational bioinformatics. But are we are reduced to going through stacks of journals to find examples? The opsin genes (figure 7) of color vision on chr X, hox genes, and globins are tandem homologous duplications about which a great deal is known about transcription; each gene family has an appalingly complicated specialtist literature.
Human chromosome 22 is now completely sequenced, but how far along is the annotation of its genes? It is computationally quite feasible to march along the chromosome aligning each protein with its neighbor to find tandem paralogues with chimeric transcripts: simply feed the 35 million basepairs through GenScanW, using Blastp on consecutive ORFs (allowing for false positives), and searching in those above a cutoff for EST chimeric mRNAs. Many other chromosomes have extensive finished sequences as well. More is annotated in non-mammalian species such as yeast, nematode, and fruit fly.
In the end, the prion-doppel situation will turn out to be quite common. It requires only that the tandem duplication region not include the promoter, that there be leading exons as splice donors, that pol II be leaky at polyA terminatio,n and that the built-in initial alternative splicing (that allows the second copy to use the same promoter) be retained over evolutionary time, at least in some lineages.
More recent tandem duplications are thus more likely to still exhibit a shared promoter. Rodent chimeric transcript may not be found in all mammalian lineages. Indeed, human exon 2 seems to have an uncompetitive splice acceptor even within the prion gene, so any prion CDS exon-skipping to doppel would proceed from exon 1 (unless unknown regulatory situations come into play). Rodents may be a poor model system for TSEs if their doppel gene is regulated differently.
Genome Res 1998 Oct;8(10):1007-21 Mazzarella R, Schlessinger DAs large-scale sequencing accumulates momentum, an increasing number of instances are being revealed in which genes or other relatively rare sequences are duplicated, either in tandem or at nearby locations. Such duplications are a source of considerable polymorphism in populations, and also increase the evolutionary possibilities for the coregulation of juxtaposed sequences. As a further consequence, they promote inversions and deletions that are responsible for significant inherited pathology. Here we review known examples of genomic duplications present on the human X chromosome and autosomes.
Genome Research Vol. 9, Issue 9, 803-814, September 1999 Eyal Seroussi,...and Jan P. DumanskiAnalysis of 600 kb of sequence encompassing the beta-prime adaptin (BAM22) gene on human chromosome 22 revealed intrachromosomal duplications within 22q12-13 resulting in three active RFPL genes, two RFPL pseudogenes, and two pseudogenes of BAM22. The cDNA sequence comparison of RFPL1, RFPL2, and RFPL3 showed 95%-96% identity between the genes, which were most similar to the Ret Finger Protein gene from human chromosome 6.
The sense RFPL transcripts encode proteins with the tripartite structure. Each of these domains are thought to mediate protein-protein interactions by promoting homo- or heterodimerization. We identified 6-kb and 1.2-kb noncoding antisense mRNAs of RFPL1S and RFPL3S antisense genes. The RFPL1S and RFPL3S genes cover substantial portions of their sense counterparts, which suggests that the function of RFPL1S and RFPL3S is a post-transcriptional regulation of the sense RFPL genes.
We illustrate the role of intrachromosomal duplications in the generation of RFPL genes, which were created by a series of duplications and share an ancestor with the RING-B30 domain containing genes from the major histocompatibility complex region on human chromosome 6.
Am J Med Genet 1999 Aug 6;85(4):403-8 Voullaire L, Saffery R, Davies J, Earle E, Kalitsis P, Slater H, Irvine DV, Choo KHNormal human centromeres contain large tandem arrays of alpha-satellite DNA of varying composition and complexity. However, a new class of mitotically stable marker chromosomes which contain neocentromeres formed from genomic regions previously devoid of centromere activity was described recently. These neocentromeres are fully functional yet lack the repeat sequences traditionally associated with normal centromere function. We report here a supernumerary marker chromosome derived from the short arm of chromosome 20 in a patient with manifestations of dup(20p) syndrome.
Detailed cytogenetic, FISH, and polymorphic microsatellite analyses indicate the de novo formation of the marker chromosome during meiosis or early postzygotically, involving an initial chromosome breakage at 20p11.2, followed by an inverted duplication of the distal 20p segment due to rejoining of sister chromatids and the activation of a neocentromere within 20p12 [the location of the prion-doppel genes -- webmaster]. This inv dup(20p) marker chromosome lacks detectable centromeric alpha-satellite and pericentric satellite III sequences, or centromere protein CENP-B. Functional activity of the neocentromere is evidenced by its association with 5 different, functionally critical centromere proteins: CENP-A, CENP-C, CENP-E, CENP-F, and INCENP. Formation of a neocentromere on human chromosome 20 has not been reported previously and in this context represents a new mechanism for the origin of dup(20p) syndrome.
Tue, 28 Sep 1999 webmaster researchThe J Mol Bio 1 Oct 99 doppel paper is an excellent analysis in many ways. It was electronically released by the journal Friday, 24 Sept 99. However, sequences needed to evaluate conclusion drawn in the paper have still not been released. These sequences include human and rat doppel genes, and mouse 1.7 and 2.7 kb doppel cDNAs 9 and 12, accession numbers AF165165 and AF165166 [still held back on 10 Oct 99]. The main long incubation mouse doppel sequence in chr 2, U29187, was posted to GenBank on 25 June 1999 (conspicuously omitting the annotation for doppel).
Other researchers must type in sequences for human and rat doppel proteins (but not genes) from figure 5; they are given below. Critical rat intergenic exons are not provided. The webmaster does not consider this paper published until all sequences discussed in it are freely available on Internet databases as required by journal policy and prevailing scientific norms. [Rat and human doppel sequences are still missing on 6 Oct 99.]
By comparing the human doppel protein sequence in figure 5 with the one determined independently by at the Sanger Centre chromosome 20 team and still others in the EST dbase [tiled AA234322-AL042906], the webmaster found that an apparently common polymorphism exists in the doppel protein. (This was first noted on 3 Sep 99 based on error-prone EST data.) The human sequence in the paper is identical to the EST mRNA in having a methionine at position 174; the Sanger sequence has threonine. Thus the allele should be called M174T; it arises from a simple transition in the DNA (ATG to ACG). There is otherwise perfect agreement over the 513 nucleotides, ie, no silent polymorphisms of ESTs vs. Sanger (no JMB human nucleotide has been released.)
The genetic code gives weak support to methionine as both the wildtype allele for humans and ancestral: mouse is AAT isoleucine, a 2 bp change from ACG threonine but only a single bp change from ATG methionine, ie, it is more parsimonious to precede by single base changes beginning from the methionine.
The nominal ratio of alleles is 2:1 based on this meagre data set. However, if only 3 human doppel sequences are determined, one does not expect them to differ unless an allele is fairly common. (Recall however the first human prion sequence had a repeat deletion that turned out only to occur with a 1-2% frequency.) Clearly widespread screening of doppel genes in disease and controls is needed to establish this and other variants.
The allele is tentative because the Sanger sequence is unfinished (though 3.2x redundant in coverage so unlikely to be in error), the human protein from this paper has no underlying published data (though previous IY Lee sequences have been first-rate), and the single relevant EST sequence has upstream regions of documentable error. There is no guidance as to wildtype or ancestral value from mouse (position 177) or rat doppel proteins: both are isoleucine at this location).
The allele is not a structural counterpart to the well-known M129V allele of human prion because it occurs in the hydrophobic tail of doppel protein (which is either transmembrane or cleaved off during GPI attachment) whereas M129V occurs non-homologously in a beta strand.
There is no way to predict whether this tentative allele has any significance either to prion or doppel disease susceptibility. It is tempting to say that because it occurs within a peptide segment cleaved off mature protein (or buried as a transmembrane domain) that its significance is minimal. However, a similarly located mutation in human prion protein, M232R, can be causative for CJD, whereas nearby E219K is an apparently neutral or beneficial polymorphism. One cannot be sure of the efficiency with which glycosylphosphotidlyinositol is attached; the cell can cleave these off to release the protein to the inter-cellular mileau.
Reference Sequences >human doppel protein, T174 MRKHLSWWWLATVCMLLFSHLSAVQTRGIKHRIKWNRKALPSTAQITEAQVAENRPGAFI KQGRKLDIDFGAEGNRYYEANYWQFPDGIHYNGCSEANVTKEAFVTGCINATQAANQGEF QKPDNKLHQQVLWRLVQELCSLKHCEFWLERGAGLRVTMHQPVLLCLLALIWLtVK >human doppel protein, M174 MRKHLSWWWLATVCMLLFSHLSAVQTRGIKHRIKWNRKALPSTAQITEAQVAENRPGAFI KQGRKLDIDFGAEGNRYYEANYWQFPDGIHYNGCSEANVTKEAFVTGCINATQAANQGEF QKPDNKLHQQVLWRLVQELCSLKHCEFWLERGAGLRVTMHQPVLLCLLALIWLmVK >mouse doppel protein MKNRLGTWWVAILCMLLASHLSTVKARGIKHRFKWNRKVLPSSGGQITEARVAENRPGAFI KQGRKLDIDFGAEGNRYYAANYWQFPDGIYYEGCSEANVTKEMLVTSCVNATQAANQAEF SREKQDSKLHQRVLWRLIKEICSAKHCDFWLERGAALRVAVDQPAMVCLLGFVWFIVK >rat doppel protein MKNRLGTWglAILClLLASHLSTVKARGIKHRFKWNRKVLPSSGQITEAqVAENRPGAFI KQGRKLDIDFGAEGNkYYAANYWQFPDGIYYEGCSEANVTKEvLVTrCVNATQAANQAEF SREKQDSKLHQRVLWRLIKEICStKHCDFWLERGAALRiTVDQqAMVCLLGFIWFIVK >human prion protein MANLGCWMLVLFVATWSDLGLCKKRPKPGGWNTGGSRYPGQGSP GGNRYPPQGGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQGGGTHSQWNKPSKPKTNMK HMAGAAAAGAVVGGLGGYMLGSAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDE YSNQNNFVHDCVNITIKQHTVTTTTKGENFTETDVKMMERVVEQMCITQYERESQAYY QRGSSMVLFSSPPVILLISFLIFLIVG >rat prion protein MANLGYWLLALFVTTCTDVGLCKKRPKPGGWNTGGSRYPGQGSP GGNRYPPQSGGTWGQPHGGGWGQPHGGGWGQPHGGGWGQPHGGGWSQGGGTHNQWNKP SKPKTNLKHVAGAGAVVGGLGGYMLGSAMSRPMLHFGNDWEDRYYRENMYRYPNQ VYYRPVDQYSNQNNFVHDCVNITIKQHTVTTTTKGENFTETDVKMMERVVEQMCVTQY QKESQAYYDGRRSSAVLFSSPPVILLISFLIFLIVG
mouse prion protein, long incubation MANLGYWLLALFVTMWTDVGLCKKRPKPGGWNTGGSRYPGQGSP GGNRYPPQGGTWGQPHGGGWGQPHGGSWGQPHGGSWGQPHGGGWGQGGGTHNQWNKPS KPKTNFKHVAGAAAAGAVVGGLGGYMLGSAMSRPMIHFGNDWEDRYYRENMYRYPNQV YYRPVDQYSNQNNFVHDCVNITIKQHTVVTTTKGENFTETDVKMMERVVEQMCVTQYQ KESQAYYDGRRSSSTVLFSSPPVILLISFLIFLIVG >mouse prion protein, short incubation MANLGYWLLALFVTMWTDVGLCKKRPKPGGWNTGGSRYPGQGSP GGNRYPPQGGTWGQPHGGGWGQPHGGSWGQPHGGSWGQPHGGGWGQGGGTHNQWNKPS KPKTNLKHVAGAAAAGAVVGGLGGYMLGSAMSRPMIHFGNDWEDRYYRENMYRYPNQV YYRPVDQYSNQNNFVHDCVNITIKQHTVTTTTKGENFTETDVKMMERVVEQMCVTQYQ KESQAYYDGRRSSSTVLFSSPPVILLISFLIFLIVG >mouse atgaagaaccggctgggtacatggtgggtggccatcctctgcatgctgcttgccagccac M K N R L G T W W V A I L C M L L A S H ctctccacggtcaaggcaaggggcataaagcacaggttcaagtggaaccggaaggtcctg L S T V K A R G I K H R F K W N R K V L cccagcagcggcggccagatcaccgaagctcgggtagctgagaaccgcccaggagccttc P S S G G Q I T E A R V A E N R P G A F atcaagcaaggccggaagctggacatcgactttggagcagagggcaacaggtactacgcg I K Q G R K L D I D F G A E G N R Y Y A gctaactattggcagttccctgatgggatctactacgaaggctgctctgaagccaacgtg A N Y W Q F P D G I Y Y E G C S E A N V accaaggagatgctggtgaccagctgcgtcaacgccacccaggcggccaaccaggctgag T K E M L V T S C V N A T Q A A N Q A E ttctcccgggagaagcaggatagcaagctccaccagcgagtcctgtggcggctgatcaaa F S R E K Q D S K L H Q R V L W R L I K gagatctgctccgccaagcactgcgatttctggctggaaaggggagctgcgcttcgggtc E I C S A K H C D F W L E R G A A L R V gccgtggaccaaccggcgatggtctgcctgctgggtttcgtttggttcattgtgaagtaa A V D Q P A M V C L L G F V W F I V K - >human 174T atgaggaagcacctgagctggtggtggctggccactgtctgcatgctgctcttcagccac M R K H L S W W W L A T V C M L L F S H ctctctgcggtccagacgaggggcatcaagcacagaatcaagtggaaccggaaggccctg L S A V Q T R G I K H R I K W N R K A L cccagcactgcccagatcactgaggcccaggtggctgagaaccgcccgggagccttcatc P S T A Q I T E A Q V A E N R P G A F I aagcaaggccgcaagctcgacattgacttcggagccgagggcaacaggtactacgaggcc K Q G R K L D I D F G A E G N R Y Y E A aactactggcagttccccgatggcatccactacaacggctgctctgaggctaatgtgacc N Y W Q F P D G I H Y N G C S E A N V T aaggaggcatttgtcaccggctgcatcaatgccacccaggcggcgaaccagggggagttc K E A F V T G C I N A T Q A A N Q G E F cagaagccagacaacaagctccaccagcaggtgctctggcggctggtccaggagctctgc Q K P D N K L H Q Q V L W R L V Q E L C tccctcaagcattgcgagttttggttggagaggggcgcaggacttcgggtcaccatgcac S L K H C E F W L E R G A G L R V T M H cagccagtgctcctctgccttctggctttgatctggctcacggtgaaataa Q P V L L C L L A L I W L T V K - >human doppel from AA234322-AL042906 tiled ESTs M174 tggtggtggctggccactgtctgcatgctgctcttcagccacctc W W W L A T V C M L L F S H L tctgcggtccagacgaggggcatcaagcacagaatcaagtggaaccggaaggccctgccc S A V Q T R G I K H R I K W N R K A L P agcactgcccagatcactgaggcccaggtggctgagaaccgcccgggagccttcatcaag S T A Q I T E A Q V A E N R P G A F I K caaggccgcaagctcgacattgacttcggagccgagggcaacaggtactacgaggccaac Q G R K L D I D F G A E G N R Y Y E A N tactggcagttccccgatggcatccactacaacggctgctctgaggctaatgtgaccaag Y W Q F P D G I H Y N G C S E A N V T K gaggcatttgtcaccggctgcatcaatgccacccaggcggcgaaccagggggagttccag E A F V T G C I N A T Q A A N Q G E F Q aagccagacaacaagctccaccagcaggtgctctggcggctggtccaggagctctgctcc K P D N K L H Q Q V L W R L V Q E L C S ctcaagcattgcgagttttggttggagaggggcgcaggacttcgggtcaccatgcaccag L K H C E F W L E R G A G L R V T M H Q ccagtgctcctctgccttctggctttgatctggctcatggtgaaataa P V L L C L L A L I W L M V K -
28 Sept 99 webmaster opinionDoppelcross might be taken to mean shafting the very public whose money is the source of all NIH grants and from whom the victim class is drawn. However the term doppelcross is better reserved for special cis-trans genetic issues that arise in partial tandem gene duplication with promoter retention, eg, the prion-doppel region of human chromosome 20.
Picture the situation where there are allelic variants of both genes in the same individual distributed in various ways across the maternal and paternal chromosome copies. These alleles might involve coding regions of the genes, eg codon 129 met/val of prion protei,n or codon 174 thr/met of doppe,l or variations in the crucial switch region of prion exon 2, or the respective splice acceptors, and so on.
It is not enough to specify which alleles are present; the allocation of alleles to each chromosomal copy must be described (ie, the haplotype). Exon-skipping (upregulation of doppel) takes place at the level of haplotype, ie, on the same strand (cis). That is, pre mRNA is internally spliced. There is no trans-splicing of a pre mRNA from one copy of chr 20 with a pre mRNA transcript from the second chromosome.
As a specific example, consider the 3 alleles above (as capital or lower case) using E for exon 2, M for codon 129, and T for codon 174. in the heterozygous case, they could be distributed across the two chromosome copies in eight ways (vertical pairs):
EMT EMt EmT Emt eMT eMt emT emt paternal copy of chr 20 emt emT eMt eMT Emt EmT EMt EMT maternal copy of chr 20Familial CJD involves a fourth locus, say K for E200K, so 16 possibilities:
EMKT EmKt EmKt EmKt emKt emKt emKt emKt EMKT Emkt Emkt Emkt emkt emkt emkt emkt paternal copy of chr 20 emkt eMkT eMkT eMkT EMkT EMkT EMkT EMkT emkt eMKT eMKT eMKT EMKT EMKT EMKT EMKT maternal copy of chr 20If T174M turns out to be a somewhat common doppel polymorphism and doppel disease alleles emerge, 32 possibilities exist. In homozygous cases, complexity drop outs; familial CJD is almost always heterogeneous. It is not clear at this time that familial doppel disease exists and if so whether it is autosomal dominant or recessive. Prion mutations are autosomal dominant because amyloid represents toxic gain-of-function; doppel protein is unlikely to form amyloid because of poor homology to the 106-126 region.
It is already a very poor idea to pool patients with the "same" set of alleles distributed differently even in familial CJD. For example, in P102L M129V hetrozygotes, it may matter very much whether the mutation 102L is on the same strand as the polymorphism 129V because that determines what primary protein sequences are available for amyloid formation.
In prion-doppel disease, there is a second reason for determining haplotypes: the inherent cis nature of tandem transcription and exon-skipping. An exon 2 allele that depletes exon-skipping to doppel may boost prion protein levels, so it matters whether the prion gene of that particular strand carries a amyloidogenic allele.
A third issue concerns recombination. From an April 1999 paper in Am J Hum Gen on E200K, it emerged that recombination of even fairly distant markers is quite uncommon in this region of human chromosome 20. Collinge's group has also reported that all known kindreds of A117V also carry a closely linked -21 allele upstream of the start codon, not supporting recombination between these tightly linked alleles.
The question becomes, is it the A117V that causes CJD, or simple overproduction of protein from that strand due to enhancement of the splice site at the end of intron 2? This splice acceptor must compete with doppel exons for the exon 2 or exon 1 splice donor (human unlike mouse, make no known use of exon 2). Suppose amyloid from A117V patients mainly had valine at position 117: was that because 117V had propensity to form amyloid or because more was made due to the -21 mutation? Are symptoms due to protease K resistant protein or from altered levels of doppel from the -21 splice effect? These are difficult issues to resolve experimentally.
This linkage disequilbrium arising from lack of recombination raises similar questions throughout TSE research In VRQ sheep, are we looking at susceptibility to scrapie from the prion gene or from some tightly linked doppel allele or a splice junction variant on the haploid? Here in vitro conversion at least of purified protein has correlated with inferred genetic susceptibility attributable to VRQ. A high priority for research would be to determine the sequence of the doppel gene and its transcripts in cattle, sheep, and mule deer.
30 Sep 99 webmaster research. last update: 11 Oct 99The doppel and prion genes are a classic case of paralogues resulting from tandem gene duplication. It is important to extract as much information as possible from available sequences because the doppel protein may provide clues to normal prion function, modulate TSEs, and cause other diseases on its own.
The JMB paper and a recent Tubingen meeting talk make a start on this. Figure 5 of the paper aligns 3 doppel proteins relative to chicken, marsupial, and consensus eutherian mammal prions; various percent identities are calculated. Figure 6 proposes a most plausible second disulfide bridging 95-148 and a second glycosylation site at position 99 of mouse doppel.
However, the alignment has 13 distinct kinds of error. This is not appropriate in a paper with 21 authors and 54 months of development:
Signal peptides and GPI (or transmembrane) segments are subject to pronounced convergent evolution, ie , the same enzymes in the endoplasmic reticulum have to recognize signal and GPI sites from thousands of unrelated proteins. Amino acid composition and length is restricted in required hydrophobic and hydrophilic subdomains causing further statistical artefacts. Beyond this, signal and GPI regions have no known structural packing constraints.
Mammalian prions are themselves poorly conserved within the signal region and the orthologous alignment to bird prion signal is already somewhat dubious (only 2 of 9 avian sequences cover this region). Within doppel, signal andGPI regions are evolving dramatically faster than the mature protein or nucleotide. Since the doppel-prion divergence is quite ancient, what is learned from aligning these regions of doppel to prions is merely that they conform to general requirements of signal and GPI domains. (Similarly no alignment credit accrues to methionines at initiating position.) The effect of parallel evolution under common constraints is inflated homology values.
The doppel signal differs strikingly from the GPI region in its ratio of fixed non-synonyomous to synonomous nucleotide change (18:3 versus 11:10 ) despite similar opportunities, suggesting conservative amino acid changes within the signal region are not experiencing as much selective pressure. At the time of tandem duplication, prion and doppel signal domains were identical. Both have fixed a great many changes since divergence of rodents and primates. Alignments are still favorable today, not because of sequence conservation, but rather because both began trapped within the same 'potential well' of signal peptide constraints.
The signal and GPI domains, which really should include a few residues across the cleavage sites, occupy some 53 residues, nearly a third of the protein. These two domains do not have determinable 3D structures essential to reliable alignment of distant sequences; however the domains are important to the overall domain line-up and confidence in the paralogous relation. JMB author PM Harrison did an exceptional analysis of the GPI anchor to set it at position 155 GAA/G (which however is probably in homologous position to that of prion with respect to helical wheel 3). Harrison's study of known mammalian GPI sites remains unfortunately unpublished and unimplemented as a web tool.
The alignment section in figure 5b covering the region between the pre-repeat region and the first beta strand is ridiculous. Doppel lacks anything resembling the repeat region, hinge, and amyloidogenic domains; it is far more plausible to posit a single longish deletion event rather than 5-6 custom deletions that cherry-pick scattered residue agreements at the price of placing 5 charged or polar residues within a purely hydrophobic extremely conserved domain. ["But our alignment must be right -- we used a popular computer program and computers don't make mistakes!"]
It is wrong to align prion repeat regions even within mammals (figure 5b) because the consitituent repeats have very different histories due to an ongoing dynamic between slippage deletions and insertions; these regions are only functionaly homologous (eg, histidines bind copper) and have never been sequentially homologous. The repeat regions emanated at different times from slightly divergent upstream repeat generators giving rise hexapeptide, octapeptide, and nonapeptide repeat lineages. This phenomenon has been observed hundreds of times in other proteins and nucleic acids with short tandem repeats. Since doppel lacks a repeat region, the effect here is to inflate genetic distance.
The tertiary fold of a protein is by far its most conserved aspect over long time periods. Since nmr structures are available for the globular domain of prion protein, any alignment based on simple linear features should have been subordinated to 3D requirements. The doppel protein has few regions of strong identity with prion proteins, but these are precisely at the crucial interior packing region formed by the 'underpass' domain and disulfide. These residues anchor the alignment in space. Internal hydrogen bonding pairs of prion protein cannot be substituted with inappropriate partners, further constraining 3D structure and the implied alignment.
Secondary structure is also generally better conserved than primary sequence. Since prion protein has 3 alpha helical domains and these have regular periodicity, a linear alignment should display a similar helical wheel in terms of surface and interior residues. The beta sheet in prion protein is very short, with only two 3 residue strands accepted by structure editors at PDB. It is problematic whether such a weak feature can be reliably modelled in doppel especially since the adjacent upstream domain is deleted. If one exists, it would also be anti-parallel.
Chicken was not a good choice to represent the 7 avian genera with known prion sequence because of inbreeding in domesticated animals introduces various allele artefacts (as in long incubation lab mouse). It is replaced below by a reconstructed ancestral avian sequence to suppress idiosyncratic change in individual lineages.
A simple consensus(majority rule) sequence from 40 species is unsuitable as a representative of eutherian mammals when early 100 species are known; the first 40 were over-weighted with closely related old world primates. Even had a topology inferred from prion gene sequences been used, that would be less appropriate than one based on the totality of sequence and fossil data. Clamping to a known topology allows an even more reliable ancestral eutherian mammal sequence to be used in comparisons. Marsupial is good to include; however, the single available sequence makes it a dangerous long branch, ie, it must be deweighted in alignments. (Sequencing should always be done in 3's.)
For doppel protein, only 3 sequences are known, with topology: (human, (mouse, rat)). If rat and mouse differ at a residue and human agrees with either, then that value of the residue rules. Rat is surprisingly effective in outgroup arbitration: it allows 10 corrections in the ancestral doppel gene reconstruction (not attempted in the paper). This will be important in constructing a better fugu/zewbrafish/fly/nematode probe which is needed because the prion/doppel superfamily is still an orphan.
The ancestral doppel sequence is shown below. Reference sequences are broken at the signal, mature, and GPI domains. Lower case shows residual uncertainty, human is used as default as rodents evolve more rapidly. The second sequences uses rodent in ambiguous sites randomly in 1:2 proportion to make a better all-around probe. No homologues currently exist in pufferfish, zebrafish, drosphila, or yeast.
>ancestral eutherian doppel (ambiguous = human) MrkhLswWWLAtvCMLLfSHLSaVqt RGIKHRiKWNRKaLPStaQITEAQVAENRPGAFIKQGRKLDIDFGAEGNRYYeANYWQFPDGIhYnGCSEANVTKEafVTgCiNATQAANQgEFSReKpDnKLHQqVLWRLvqElCSlKHCeFWLERG AgLRVTmhQPvllCLLalIWliVK >ancestral eutherian doppel (ambiguous = human:rodent 2:1 random sites) MrnhLgwWWLAtvCMLLfSHLSaVqt RGIKHRfKWNRKaLPSsg-QITEAQVAENRPGAFIKQGRKLDIDFGAEGNRYYeANYWQFPDGIyYeGCSEANVTKEafVTgCiNATQAANQaEFSReKqDnKLHQqVLWRLvqElCSlKHCeFWLERG AgLRVTmhQPallCLLafIWliVK
One of the reasons for reconstructing ancestral doppel protein is to measure convergence. Comparing ancestral prion and ancestral doppel reconstructions measures how much better the homology was 100 million years ago -- the rate of convergence. The rates of evolution of doppel and prion may be coupled and quite variable (marsupial and bird doppel sequences are needed, in 3's).
Be this as it may, the date of the tandem duplication can still be estimated from the apparent rate of convergence. If the homology between prion and doppel was 40% at 100 mya and 20% today, then the tandem duplication dates roughly to 400 mya (ignoring saturation effects). If the homology has only converged to 25%, the date implied is absurdly old, establishing rate acceleration (or the first derivative of the molecular clock function which, like any function, is constant to zeroth order).
The goal here really is a tandem duplication date relative to the two tetraploidization events in the vertebrate lineage (whole genome duplications). If the prion-doppel creation event came first, second, or third, there would now be 8, 4, or 2 of these genes in the haploid genome (though some may have been lost). And these could all be so diverged by now that they would not be picked up by hybridization to either prion or doppel probe.
Instead, they would be suspected in partial synteny. But the official keepers of human/mouse synteny at Jackson Harbor wrote the webmaster last week saying that they had given up long ago in tracking these. (Indeed, in yeast, which also had a genome doubling long ago, the partial syntenies are all worked out but are very complex.)
In other words, doppel trouble could be just the start. Doubles of the double may emerge soon from the human genome sequencing project -- they are best recognize by tBlastn with ancestral prion or doppel as probe (none are known yet on the dozen unfinished chromosomes at the Sanger Centre).
Another use of ancestral doppel sequence is to improve reconstruction of the common ancestor of prion and doppel. This could resemble doppel more than prion. Ideally, doppel sequences from bird will become available, putting doppel reconstruction on a par with that of prion. Together these reconstructions point to a Blast probe that might find ancestral proteins farther back, in fugu, zebrafish, nematode, and fruit fly. A probe is given by:
>ancestral eutherian doppel by domain (alternate residues in brackets)
[Technical note: At doppel residues that can't be reconstructed unambiguously from only 3 species, a dynamically samplable probabality distribution of sequences is easily made. There are 38 positions left of the 179 amino acids at which rat equals mouse not equals human. An ancestral doppel sequence is made by choosing rodent or human at each of the 38 positions according to the random number generator in a spreadsheet inversely weighted by rate of rodent and primate lineages). Distances (and variances) are measured by averaging over a few thousand of these ancestral doppel sequences; of course, the other 141 residues do not need this treatment. Prion protein can treated the same way, though far fewer positions are affected because so many species have been sequenced and marsupial and bird ancestor are available as outgroups. Distances between prion and doppel are taken by sampling the distance between the two distributions. It all takes longer to explain than to do.]
Doppel and prion are similar enough in 3D structure to be capable of interacting as a heterodimer with a pseudo twofold axis, provided prion protein itself can dimerize (not proven) on docking surfaces also available in doppel. A whiff of this is implied by the biased paralogue drift described earlier. (Mammal doppel is closer to mammal prion than to bird prion despite that the prion-doppel tandem duplication event preceding avian/mammal divergence.)
Co-evolution of doppel and prion would differ from situations such as virus/cell surface receptor in that the latter is do-or-die about getting into the cell, whereas prion and doppel are only 2 of an estimated 142,634 human genes. A paralogue heterodimer situation is a little specialized because the coevolution is cooperative, not antagonistic, and retricted to the docking surfaces of a pseudo twofold symmetry, ie corresponding homologous residues would change. In a hypothetic prion-doppel physical interaction, nearly-neutral Kimura drift may dominate. So many genes, so many things going on, how is there to be selection on a small effect?
Yet the relative rates of fixation of mutations is roughly 2.4 times higher in doppel than prion. This is an average over the last 100 million years (rodent-primate divergence); the rate could not really be constant at this factor since mammal-bird divergence without the doppel becoming totally unrecognizable by now. So doppel change has apparently speeded up in eutherian mammals.
We know that doppel and prion were identical long ago (but when?). Now they are severely diverged, with many bacterial-mammal homologues far stronger. What happened? Comparing avian and mammalian prion hexa- and octapeptide repeats, we see that domain arose from internal tandem duplication (replication slippage) from the post-signal generator beginning around 310 million years. [The repeat generator may have 'implemented' at different times in different lineages; marsupial nonapeptide suggests mammals were late in acquiring their repeat.] Doppel got left behind, not acquiring the new copper-related function nor the severe constraint on the double palindrome domain.
Like the stay-at-home husband who enables a brilliant new career for his wife, doppel stuck to traditional mundane tasks while prion experimented with a new lifestyle. But they are still a couple eons later.
A quick check can be made for physical coupling of doppel and prion within lineages. That is, we have rat, mouse, and human sequences for both prion and doppel. If all possible pairwise hetero alignment scores are computed, is rat doppel closest to rat prion etc? That would support some common denominator that both are interacting with, within the particular species.
However the % identity/similarity, matrix scores relative to (human, rat, mouse long, mouse short) provide no support for this notion. Whatever correlation might exist may be overridden by rodents evolving faster than primates. The correlation measure might be improved by reference to the 3D structure where perhaps the RMS errors of threaded sequences could be compared.
human doppel: (25/49 , 23/46, 24/45, 25/46) rat doppel: (20/54, 17/50, 18/52, 18/52) mouse doppel: (22/54, 19/50, 20/52, 20/52)
29 Sept 99 webmaster researchIt would come as a great surprise if doppel protein did not contain two disulfides.
The first candidate pair of cysteines is homologous to one well-documented in prion protein. The few regions of strong primary sequence agreement between doppel and prion include the neighborhoods of the two prion cysteines, which are deep interior residues conserved in all species including birds. Even if the pair initially mispairs during in vivo synthesis, disulfide isomerase quickly allows this lowest energy state to prevail.
After allowing for cleavage or transmembrane burial of remaining cysteines from signal and GPI domains, two further flanking cysteines are found in all 3 species. The endoplasmic reticulum and outer cell surface are oxidizing environments for cysteines. If two cysteines go unpaired, the protein may be subject to a chain of inter-molecular cross-links causing aggregation, metal mercaptan formation, or oxidation to cysteic acid. There seems to be no support for reduced cysteines in comparable environments in other proteins.
Direct support comes from visualizing the flanking cysteines in three dimensional space. This can be done from the prion nmr structure because the tertiary fold is still preserved at far lower sequence identity percentages than observed here. Although the cysteines are far apart in terms of the linear peptide sequence (95 and 148 in mouse doppel), they are facing each other in pre helix 2 and distal helix 3 of the protein interior with acceptable atomic separation distances.
A proximity effect and interior position makes dimer inter-molecular disulfides less favored (unless docking surfaces and conformational change overwhelm this effect). Of the two possible dimers, namely 95-95/148-148 and 95-148/148-95, the latter twofold axis type is more common in proteins. If such a dimer existed in doppel, it would point to homologous docking surfaces in prion protein even though the covalent stabilization would be lacking.
Disulfides are chiral centers in proteins with angles of rotation largely restricted to +90 or -90 degrees -- this geometry may also be retained. In prion protein, the Cbeta-S-S-Cbeta disulfide is trans at about 106 degrees. The S-S bond is 2.04 angstroms in length, the Cbeta-Cbeta distance is 4.61 angstroms, the Calpha-Calpha 6.76 angstroms. This is geometry that anchors 3D models of doppel at the homologous site. At the second site, the S-S bond must be of the same length; the chirality and other distances are not determined (although constrained).
Within SwissPdbView, candidate homologous amino acids may be 'mutated' within prion protein to the cysteines of doppel, to see if they are within range of forming a disulfide (allowing for reasonable conformational change). The distal candidate is nearby and constrained on a helical wheel as to whether the side chain points towards or away the loop containing the proximal candidate. The distal candidate in prion protein has 4-7 intervening residues depending on gapping (a full helical turn may be lost in doppel; such units respect other wheel orientations).
Residues +5 and +6 (relative to second prion cysteine) can be seen from the figure below to be on the wrong side of the helical wheel. The loop region is easier to move (through a torsional angle change -- the Cbeta distance is 6.2 angstroms or 1.6 angstroms too long) than helix 3 because its position is not as constrained by other affiliations. This makes positions +8 more favorable, with the new disulfide (to position 166 in mouse) stablizing the very turning point of the loop. That suggests linear alignment in the region (figure 5 of the paper) needs adjustment.
If prion protein had this second disulfide pair, the effect would be to stabilize the loop between the second beta strand and the second alpha helix, which appear in the original Nature nmr ensemble as a fattened tube of relative positional uncertainty. This second disulfide in doppel protein is very helpful in anchoring coordinates of adjacent amino acids in modelling and also in validating alignment register of doppel and prion.
However, prion protein does not have this second pair nor any intermediate towards it in any species (though a residue such as serine is physically similar. This raises the question of which state is ancestral (present at the tandem duplication), and how and why the second disulfide was lost or gained. The extra structural rigidity possibly correlates with the respective normal functional roles of these two proteins. Mammalian and bird prion are wildly diverged in the loop region today.
Distal homology candidate (first cysteine and glutamate-aromatic are an alignment anchors) CIQQY-REYRL.. ancestral bird prion CITQYQaEYEA.. marsupial prion (deweighted for long branch singlets) CITQYQKEYEA.. ancestral eutherian mammalian prion CSlKHC-EFWL-- ancestral doppel protein
The proposal here needs confirmation with consideration of global modelling energetics. The JMB paper draws numerous conclusions from an invisible model whose coordinates are not available to other researchers. The charge surface of doppel is said to be positive on the non-carbohydrate side like prion protein but more neutral on the negative side.
Helical boundaries for doppel are given as 76-86, 105-125, and 131-154 in figure 6; in the past, the authors have not been able to accurately predict secondary structure in prion protein as shown in nmr. The eutherian sequence oddly gives a 'B' in two places in the first helix that clearly should be 'D" (glutamate). It is not made clear whether these are secondary structure predictions using the 3- sequence linear alignment or seen after energy minimization of a full-blown 3D model of doppel.
A prior helix in prion, 106-126, is now said to be transmembrane non-helix though experimental evidence for this so far is transitorily in the endoplasmic reticulum during synthesis; the nmr shows instead a horseshoe. The conserved region is incorrectly given in several places in the paper: the full 21-residue invariant region of all vertebrate prions is a much longer triple tandem palindrome: KTNKHM/F AGAAAAGA VV GGLGG. The authors at one place correctly note that doppel has no statistically significant region of alignment yet in the Pileup figure 6 exhibit and credit such a alignment towards percent identity. Hand-gapping must also have done in figure 5 to squeeze out more points -- it is better to cluster small close indels rather than introduce multiple events and cherry-pick bogus residue agreement: eg, the GRKL region of doppel.
No beta-sheet is shown in either prion or doppel in figure 6. This is a conserved feature of vertebrate prions having 127-130 and 161-164 anti-parallel strands that straddle the first helix. Being so short, the sheet is difficult to reliably predict. The first strand has been proposed numerous times as the internal seed for conformational change to cross-beta structure of amyloid; an animation of this was posted by this site years ago. The nominal location of this sheet in doppel is FIKQ-IYYE according to the figure 5 alignment. In prion, these residues are YMLG-VYYR in mammal and YAMG-VYYR in birds; a case could thus be made for sequence conservation in the second strand (there are no gap issues and good upstream anchors). The putative first strand shows 2 conservative and 2 non-conservative changes.
Doppel is missing the invariant, hyper-variant transitional, and repeat domains though it has some suggestive compositional similarities in the post-signal repeat leader sequence. Thus the small beta sheet is probably present in doppel and it then transitions into some 30 residues of unknown and perhaps indeterminate structure. That region, as noted years ago on this web site, surely spun off the repeat region through various replication slippages in different lineages during the era of bird/mammal divergence. In this regard, doppel resembles the ancestral protein more than prion.
The most perplexing issue concerns the long invariant region. A region of 21 amino acids that remained fixed in many lineages for 310 million years did not arise overnight. Because of its bland repetitive composition, it is difficult to search for earlier homologues with Blast -- many unrelated proteins have runs of alanine and glycine. The constituent amino acids are not those found in enzyme active sites nor as metal ligands (though some possibilities exist at the region's N-terminus).
This domain was either an important feature of the doppel-prion ancestral protein or came in via an unknown mechanism of domain shuffling from an unknown protein about the time of repeat generation; it did not arise in situ by point mutation subsequent to the tandem duplication event. Its presence or absence is correlated with the repeat region, though no functional or structural association is known. It has somewhat of a structural coupling in nmr to the main region but seems to represent an independent domain.
The majority of mammalian genes have internal introns, often many of them. There is no sign of these in either prion or doppel, therefore no opportunities for alternative splicing of the invariant domain nor for recombining it in through retrotranspositional matching.
In summary, the invariant domain (and its associated function) existed long before the tandem duplication event and most likely, it was found at its current site in the common ancestral protein. Pseudogenes may be numerous in the overall human chromosome to resolve this situation.
In one scenario, the ancestral protein was an ineffecient copper transporter. The repeat region creation vastly enhanced this role. Tandem duplication occurred while "fine tuning" of the repeat was evolving. It became excessive to have two efficient copies of a copper detoxifying gene; doppel took on more of a prion protein modulatory role. A deletion of the repeat region happened to extend to the invariant region, which was in any event redundant to the modulatory role and its continued existence in prion protein sufficed for cellular needs.. (It is no harder to delete a long contiguous region than a short one.)
Since there are 3 events to order in time (tandem duplication, repeat creation, invariant region gain/loss, 6 such scenarios exist. While no doubt a persuasive case could be made for each, the question is, what actually happened. Experimentally, these alternatives are easily resolved by sequencing the syntenic region in amphibian or fish. This was already a pressing need even before the discovery of doppel because of the light it would shed on normal function. However, research has stalled out at the level of apparent detection of invariant region in salmon by antibody 3F4
Hairpin C is also worth checking for in doppel; more sequences would be helpful. This region extends on both sides of the start of the repeat region. Other old issues that need to be revisited are the long open reading frame in the anti-sense strand of prion protein and anti-sense transcripts said still to be present in prion knockouts (which may represent a pseudogene inserted in complementary UTR of an unrelated gene).
4 Oct 99 webmaster researchIt is easily seen that that doppel proteins from all 3 species have two consensus glycosylation sites, NVT on the loop at position 99 and NAT at 111 that are identical in the 3 species at hand. Recall mammals have two sites. The first of these corresponds to the second site in doppel. Birds have 3 glycosylation consensus sites. Two of these correspond to mammal, the third lies in between (in the loop region between helix 2 and 3) and has no counterpart in mammalian prion or doppel.
It will emerge shortly that in the 3D structures of bird prion and mammalian doppel that the glycans at the new sites are situated on the same side of the molecular fold as in mammal, despite the disparities in siting. The role remains unknown; the one site conserved in all lineages is the best candidate for most fundamental; numerous familial CJD mutations center around it (D178N, V180I, T183A). However, the second human glycosylation site also has nearby mutations (F198S, E200K, D202N)
The JMB paper observes that "a minority of Asn-glycosylated proteins ( 23%) have one or more unglycosylated consensus sites, and only 10 % of consensus sites in Asn-glycosylated proteins are unglycosylated (von Heijne, 1992), indicating that the position 99 site is also likely to be glycosylated." Experimental. evidence for Dpl glycosylation is given in figure 7(c)-- PNGase deglycosylation causes faster migration without clarifying whether both sites are glycosylated. Thus strain type issues arise for doppel even if it is not infectious.
It is not clear whether sites are glycosylated during maturation in the same order as they appear in the sequence, or whether some sites (because of accessibility or more favorable consensus) are preferentially glycosylated regardless of their place in the queue. It would not be surprising to find varying glycoform ratios in doppel that varied by tissue and cell type with no particular correlation with those of prion protein.
Indeed, the specific carbohydrates attached to doppel need not be the same as the tetra-antennary sialylated attachments to prion protein. The degree of biosynthetic completion of individual carbohydrate chains may also differ from prion protein. However, the least surprising scenario for doppel would be for its second site to be the primary glycosylation site and for both its sites to have the same substituent glycan as prion protein.
This second site is clearly an ancient glycosylation, whereas the novel bird loop site has probably been restricted to that lineage. It will take prion and doppel sequences from fish and earlier divergences to sort out the history of gain or loss of the other sites, as contemporary pseudogenes are unsuitable for such residue-specific questions.
28 Sept 99 webmaster researchThe JMB paper experimentally confirmed exon-skipping in mRNA from the prion-doppel diad. Contrary to the paper, this is expected in any tandem duplication not extending to the promoter. Otherwise, the gene in second position becomes a pseudogene or protein fusion. In mouse, that exon skipping extended from the end of untranslated exon 2 (splice donor) to either intergenic exons or exon 3 of doppel gene.
Let's do a quick in silico screen for alleles of human and mouse prion exons 1 and 2. This amounts to aligning all available genomic and EST sequences to see if they are completely identical. Note both types of sequences can and do contain errors. Even a variant that crops up repeatedly might be a repeatable error arising from some consistent structural impediment to accurate sequencing, rather than representing a new allele.
Even bona fide alleles will would not necessarily influence exon-skipping. Along these lines, the JMB paper notes a polymorphism in the AGA trinucleotide immediately after the doppeltermination codon (AAA in C57 and 129Sv/J mice). Additional differences have been noted in long and short incubation period mice.
Exon 2 of human U29185 can be used as probe; the Blastn server counts mismatches. However, exon 2 is well known to be cryptic (spliced out) in humans and it is somewhat upstream for an EST; these would make chimeric transcripts uncommon. In fact, no additional exon 2 human sequences found in the main non-redundant GenBank nor in the dbEST collection. The Sanger Centre exon 1 and 2 sequences are independently determined and 100% identical to U29185.
Since the conventional splice junction in humans is between prion exon 1 and prion exon 3, it is also worth checking dbEST for mRNAs that skip from prion exon 1 to doppel exons. A single exon 1 EST is found, AL119735, a 406 bp mRNA determined 27 Sep 1999 by the German Genome Project. It is a perfect match to human exon 1 for its first 51 bases, then continues to prion exon 3. A second mRNA of 366 with the same date, AL119841 barely reaches into exon 1. [A 24 bp gap arises inside exon 3 as these are both 5 repeat alleles whereas U29185 has 4 repeats.]
Thus there is no evidence for exon-skipping in humans at this time though the number of relevent sequences is very small. Some lineages of mammals may have lost prion promoter skipping to doppel; it may be just a feature in rodents.
>human exon 1 134 bp from U29185 1 ccgcccgcga gcgccgccgc ttcccttccc cgccccgcgt ccctccccct cggccccgcg 61 cgtcgcctgt cctccgagcc agtcgctgac agccgcggcg ccgcgagctt ctcctctcct 121 cacgaccgag gcag >human exon 2 99 bp from U29185 1 gactcctgaa tatttttcaa aactgaacaa tttcagccat gtctgagctt tccgtcttcc 61 tggaggcaca aatctagttt agctgaacca caacagatt >human exon 3 1 agcagtcatt atggcgaacc ttggctgctg gatgctggtt ctctttgtgg ccacatggag 61 tgacctgggc ctctgcaaga agcgcccgaa gcctggagga tggaacactg ggggcagccg ....Maybe we can get some better 'action' over at mouse:
Mouse exon 1 has 11 dbEST matches. Minor inconsistencies are seen in some entries; however no exon-skipping. For exon 2 there are 4 genomic and 18 EST sequences, most full length and none exon-skipping. The genomic sequences and the first 14 ESTs are perfect agreement with the probe; then come 4 sequences [below] that possibly represent mouse prion exon 2 alleles.
Aligning the 4 ESTs shows only 1 site where a given variation is seen more than once, a G to A transition in AV079732 and AV120178, from the same mouse strain, C57BL/6. These sequences should agree elsewhere, being from the same strain, but they do not. The likeliest interpretation then is sequencing error in the ESTs. However, the exercise here is worth repeating periodically as more sequence comes in to the databases.
Exon 2 sequences from all available species were aligned here last year to establish their degree of sequence conservation (which far exceeds exon 1).
exon 1 mouse U29187 = U29186 1 gtcggatcag cagaccgatt ctgggcgctg cgtcgcatcg gtggcag exon 2 mouse U29187 = U29186 1 gactcctgag tatatttcag aactgaacca tttcaaccga gctgaagcat tctgccttcc 61 tagtggtacc agtccaattt aggagagcca agcagact >exon 3 mouse 1 atcagtcatc atggcgaacc ttggctactg gctgctggcc ctctttgtga ctatgtggac 61 tgatgtcggc ctctgcaaaa agcggccaaa gcctggaggg tggaacaccg gtggaagccg...The 4 mouse EST sequences representing prion exon 2 variants:
exon 2 1 gactcctgagtatatttcagaactgaaccatttcaaccgagctgaagcattc-tgccttc 59 AU051624 32 .........................g..........................-....... 90 mouse brain AV079732 210 ......a..............................a.....a.....-....... 155 mouse stomach AV120178 245 .........t.........a....................a...........-....... 187 mouse 10-day embryo AA667591 18 ...............a....... 40 mouse myotubes exon 2 60 ctagtggtaccagtccaatttaggagagccaagcagact 98 AU051624 91 ....................................... 129 AV079732 154 ...........................a....... 120 AV120178 186 ...a.....................a.t....... 152 AA667591 41 ....................................... 79Since exon 2 is known from other species such hamster, rat, cow, and sheep, these could also be checked. However, the EST databases are much smaller for these species.
Another approach to exon-skipping is to work backwards from the intergenic exons of mouse. Being closer to the end of the gene, they are more likely to be represented as an EST. These might have been found earlier in the course of looking at ESTs that covered or extended the doppel coding sequences but were not.
Now only long incubation mouse intergenic regions are disclosed in the JMB paper but it is quickly seen that short incubation period mouse also contains intergenic exon 1 as an unannotated feature at positions 34845-35007 of the 38,418 bp in this sequence (not enough for intergenic exon 2). Oddly, it differs ar 4 locations from intergenic exon 1 from long incubation mouse. Rat intergenic exon 1 is said in the paper to be quite diverged from mouse. There is no detectable homology between the upstream prion exons and the intergenic exons, nor do mouse exons find homology to any stretch of human or sheep intergenic region, nor do intergenic stretches conserved between human and mouse match mouse intergenic exons [but see below].
To find intergenic exons in the human sequence from the Sanger Center, older retrotransposons flanking mouse intergenic exons can be matched up.
>intergenic exon 1 163 bp positions 25842-26004 of U29187, no ESTs as of 29 Sept 99 gtaccaagg atgccggaaatttctgccca aagaccaggc ctcttccgcc tcttatctgt ctgctttgtc ctggatggac ttcacttcgt gaagatttga ctctgtgtcc tacagatagc caaagtttgg ctgtgaggga caaagagact cagagaaagc ttag repeat_region complement(24414..24565) MER5A [matches human Mer5A at 30816-31433, sheep U67922 at 30369-30546] intergenic exon 1 repeat_region complement(27124..27269) B1_MM >intergenic exon 1 of short incubation mouse U29186 gtaccaaggatgcaggaaattcctgtccaaagaccaggcctcttccgcctcttatctgtc tgctttgtcctggatggacttcacctcgtgaagatttgactctgtgtcctacagatagcc aaagtttggctgtgagggacaaagggactcagagaaagcttag >intergenic exon 2 84 bp positions 29589-29672 of U29187, no ESTs as of 29 Sept 99 atcaagcgaag gcttttctgg aggtcgagtt ctggatcatg atggagtggaggtcgcttcg agtggaggtc ttcgcgcacc gg repeat_region 28530..28729- MLT1B intergenic exon 2 repeat_region 30254..30345 L1ME3 Alignment of exon 1 from long and short incubation mouse: 96% identity long: 1 gtaccaaggatgccggaaatttctgcccaaagaccaggcctcttccgcctcttatctgtc 60 ||||||||||||| ||||||| ||| |||||||||||||||||||||||||||||||||| short: 34845 gtaccaaggatgcaggaaattcctgtccaaagaccaggcctcttccgcctcttatctgtc 34904 long: 61 tgctttgtcctggatggacttcacttcgtgaagatttgactctgtgtcctacagatagcc 120 |||||||||||||||||||||||| ||||||||||||||||||||||||||||||||||| short: 34905 tgctttgtcctggatggacttcacctcgtgaagatttgactctgtgtcctacagatagcc 34964 long: 121 aaagtttggctgtgagggacaaagagactcagagaaagcttag 163 |||||||||||||||||||||||| |||||||||||||||||| short: 34965 aaagtttggctgtgagggacaaagggactcagagaaagcttag 35007
webmaster research 6 Oct 99It turns out that the first of the two intergenic exons between the prion and doppel genes is actually completely contained within in a Mer1 type of DNA element insertion, specifically a Mer115 on the coding strand. This is an older element found in both mouse and human prion-doppel intergenic regions. That doesn't make it any less valid as an intergenic exon but rules out an origin of this exon by a tandem duplication including prion exon 1 (model A in the figure above). This element may have contained a splice acceptor and donor as part of its own original exonic structure or acquired them through accrued point mutations.
The reported sequence of mouse interexon 1 is shown below embedded within the full Mer115 sequence. Long and short incubation period mice agree at 284/292 (97%) positions over the element. RepeatMasker also finds Mer115 in the human prion intergenic region in a homologous position (between a MER5A and MLT2CB also found in mouse. MER stands for MEdium Reiteration frequency repeat - over 500 copies are present in the human genome.
Mer115 was first recognized as a repeat in March 1999. The situation is a little confused because the two insertion element tools, Censor Server and RepeatMasker, give slightly discordant results. RepeatMasker finds the element in all 3 sequences; Censor finds the element only in humans. This repeat is not annotated at any of the prion entries at GenBank (even though long incubation mouse was updated yet again on 5 Oct 99).
The intergenic exon is not recognized by GenScanW in mouse. GenscanW does report back two integenic coding exons for human but neither matches. Long mouse has a reading frame without a stop codon as does short incubation mouse. Aligning the MER115 from mouse to human is difficult (without reference to the special database of known MER115 sequences). This would fit the observation in the JMB paper that rat intergenic exons are already quite diverged from mouse.
Indeed, human integenic exon 1 may no longer be functional because it is too diverged from a canonical form of MER115. There are no dbEST matches to it. Only because of its fortuitious existence in a repeat could the homologous human position be located at all.
Human intergenic exon 2 is also problematic. In mouse, this exon does not occur in a known insertion element but rather after MLT2CB and MLT1B sequences that can be anchored to human (because of length, position, and orientation) but before a L1ME3 also with counterparts in human sequence by the same criteria. As the intergenic exon in mouse begins 860 bp after the end of the MLT1B and 603 bp before the L1ME3, the location of a putative human integenic exon is constrained to this 1600 bp. Mouse intergenic exon 2 extends for 84 bp positions 29589-29672. The MLT1B is at 28530-28729 and the L1ME3 extends from 30275-30573. [GenBank is slightly misannotated as a 30345-30573 L1ME3 and L1ME3A pair].
Short incubation mouse sequence, U29186, ends with MLT2CB and MLT1B, positions 37591-37790 in a sequence of 38418 bp, ie 628 bp into the region containing intergenic exon 2, as revised 8 Sept1999, stopping at position 29357 relative to long incubation mouse. Intergenic exon 2 begins at 29589, leaving the revised sequence 204 bp short, half a single sequenator run. The nature of the revision is not given by GenBank -- possibly a few hundred bases were withdrawn to conceal intergenic exon 2 from other researchers. Could a sequence of 38,000 base pairs would stop so close to an important feature by coincidence?
The main ideas in analyzing retrotransposons is to integrate Censor and RepeatMasker (each has strengths) and analyze each sequence both as rodent and primate (cross-species nomenclature issues can be clarified). RepeatMasker does better at resolving boundaries in slow mode. The relevent sequences are U29187 for mouse and Sanger Centre contig for human. The flanking repeats are sufficiently ancient to be in the rodent and primate common ancestor. Neither tool finds intervening repeats; there is no support for mouse intergenic exon 2 being associated with a repeat sequence in mouse.
While the MLT2CB, MLT1B, and L1ME3A can be located in an appropriate doppel-prion intergenic region of human chr 20, human has 8 additional intervening transposons in comparison to mouse. The MLT1B end and the L1ME3 start are separated in human by 6613 bp whereas in mouse this separation is only 1612 bp. This seems to makes it unfeasible to locate a counterpart to intergenic exon 2 by this method. (The intervening repeats could be masked and residual sequences aligned but change may be too rapidly fixed.)
However, the first intervening sequence, a primate-specifc AluJb, begins 1201 bp downstream, leaving room for a human intergenic exon 2 (which began 860 bp and ended 924 bp after the end of the upstream MLT1B marker). Other insertions follow in quick succession.
Aligning mouse integenic exon 2 with the MLT1B-AluJb interval fails to show any significant match under permissive alignment conditions. No deEST of Blastn matches are found. Similarly, no match is found to the 4233 bp of masked MLT1B - L1ME3A using either 2Blastn or the better-suited DBA. Splice junction tools find too many false positives.
Thus establshing the existence of a human intergenic exon 2 requires experimental study of in vivo transcripts. Release of the rat sequence may clarify the origin of this exon in mouse.
human sequence flanking mouse interexon 2 2028 17.2 4.5 1.9 hum 3 379 (7324) C MLT1B LTR/MaLR (3) 387 1 1749 16.4 1.3 0.7 hum 1580 1883 (5820) C AluJb SINE/Alu (6) 306 1 1773 20.6 0.6 0.6 hum 2197 2555 (5148) + L1MB6 LINE/L1 5798 6156 (19) 2124 11.6 0.3 0.0 hum 2587 2897 (4806) C AluSx SINE/Alu (0) 312 1 3080 7.5 0.8 0.0 hum 3548 3947 (3756) C L1PA8 LINE/L1 (2) 6161 5759 2239 9.7 0.0 0.0 hum 5033 5332 (2371) + AluSx SINE/Alu 1 300 (12) 696 20.8 12.5 1.7 hum 5353 5592 (2111) C Charlie4a DNA/MER1_type (37) 471 206 516 20.9 2.7 0.0 hum 5593 5702 (2001) C MER81 DNA (0) 114 2 347 13.7 0.0 0.0 hum 5705 5755 (1948) C MER81 DNA (63) 51 1 972 20.2 0.9 3.7 hum 5757 5974 (1729) C Charlie4a DNA/MER1_type (296) 212 1 890 27.0 4.3 0.7 hum 6169 6468 (1235) + L1ME LINE/L1 5509 5819 (345) 1457 20.5 0.7 0.3 hum 6485 6777 (926) + AluJo SINE/Alu 1 294 (18) 436 21.3 2.8 1.9 hum 6780 6887 (816) C L1PA14 LINE/L1 (209) 5940 5832 290 28.4 2.1 0.0 hum 6992 7086 (617) + L1ME3A LINE/L1 6062 6158 (5)
Genome Research Vol. 9, Issue 9, 815-824, September 1999 Niclas Jareborg, Ewan Birney, and Richard Durbin; The Sanger CentreA data set of 77 genomic mouse/human gene pairs has been compiled from the EMBL nucleotide database, and their corresponding features determined. This set was used to analyze the degree of conservation of noncoding sequences between mouse and human. A new alignment algorithm was developed to cope with the fact that large parts of noncoding sequences are not alignable in a meaningful way because of genetic drift. This new algorithm, DNA Block Aligner (DBA), finds colinear-conserved blocks that are flanked by nonconserved sequences of varying lengths. [This anchor method works well on prion intron 1, even when Blast is failing to return signficant matches. -- webmaster]
The noncoding regions of the data set were aligned with DBA. The proportion of the noncoding regions covered by blocks >60% identical was 36% for upstream regions, 50% for 5' UTRs, 23% for introns, and 56% for 3' UTRs. These blocks of high identity were more or less evenly distributed across the length of the features, except for upstream regions in which the first 100 bp upstream of the transcription start site was covered in up to 70% of the gene pairs. This data set complements earlier sets on the basis of cDNA sequences and will be useful for further comparative studies.
Nature 401, 311 (1999 23 September 1999 David DicksonThis new estimate of the number of human genes is quite relevent to the quest for structurally or functionally related neighbors of the prion and ghost prion genes on chromosome 20. These estimates have been going up and down for many years, just like the Hubble constant in astronomy. Incyte has a lot of sequencing data and the estimate could be a good one. They have a promising set of links to public EST data.
After spending 3 weeks on 150,000 bp of human chromosome 20, the webmaster is still of the opinion that there are only 2 active genes and 2 pseudogenes in this stretch. Assuming 3 billion base pairs overall, that works out to a meagre 20,000 genes whereas these folks want 7.1 active genes in the 150 kbp. The parasitism of chromosome 20 is phenomenal, up to 70% retrotransposons and defunct viruses in some stretches.
The prion gene itself has a dramatic CpG island, as graphed by Inyoul Lee et al. in the Dec 98 Genomics paper. The two pseudogenes are apparently processed mRNA, so even the recent RPSx4 event would not carry its island with it (if it had one to begin with). A CpG island is sometimes defined as a DNA stretch at least 200 bp long with a GC content exceeding 50% and an observed-to-expected ratio of CpG dinucleotides greater than 0.6 (Gardiner-Garden and Frommer 1987.)
Taking a hasty look at the human prion doppel gene: there are 289 CpG sites in the 20311 bp between the end of the prion gene mRNA and the TATA box of the prion doppel gene. A further 21 are found in the 2032 bp of intron 1. The GC contents are 47.1% and 49.2% respectively. Roughly 1269 and 127 CpG sites would be expected by chance (1 in 16 ), respectively, so overall these regions are severely depleted as expected for mutational hotspots after a great length of time.
However, just upstream of the prnd promoter, there are 30 CpG in 812 vs 51 expected, which is a substantial enhancement (2.7x) over the average depletion level of 0.22 but still significantly depleted from the statistical expection; no case can be made for hypomethylation. So there is somewhat of a CpG island preceding the prnd gene as well. However this needs to be explored with web tools that evaluate statistical significance relative to other CpG island, eg, G. Micklem and R. Durbin, unpubl. Of course, doppel does have a CpG island -- the one that it borrows from the prion gene in chimeric transcripts.
Now 47% of human genes do not have a CpG island but the significance of this absence isn't fully known. What does it mean for a gene not have a CpG island, in terms of transcriptional initiation or regulation? Is there a dichotomy of genes or a continuous spectrum of intermediate strength islands? CpG islands are associated with the 5' ends of all housekeeping genes and many tissue-specific genes and with 3' ends of some tissue-specific genes. The 5' CpG islands extended through 5'-flanking DNA, exons and introns, whereas most of the 3' CpG islands appeared to be associated with exons.
The sequence actually covers only the lower arm, the so-called q region, of the chromosome. It's roughly 32 megabases long and contains almost all of the chromosome's genes. The upper arm, called the p region, was ignored because it doesn't seem to code for proteins. The consortium also skipped the telomere--the tail end of the arm--and most of the centromere--the "waist" of the chromosome that separates the two arms. These two regions contain few genes and are very difficult to sequence.
The last bits of sequence were the most difficult. "We decided we were almost finished [last spring]," says Shimizu, "but then it took 6 months to actually finish." For reasons that are not completely understood, the bacterial clones that researchers depend on to produce the DNA needed for actual sequencing don't retain certain human sequences. This led to an exhaustive and frustrating search through clone libraries in hope of finding a clone that would cover a particular gap. The group succeeded in filling some of the gaps, but nine small gaps remain that Dunham says "seem to be unclonable."
Comment (webmaster): SwissProtein just began carrying indexes of proteins found on various chromsomes: 12, 14, 15, and 16, and 20. These are helpful in that the cross-links will be good. The webmaster requested that Swissprot carry chromosomal adjacency links and regional Blastp matches so that tandem duplications and inversions are more readily seen. Pseudogenes would also be important for protein evolution in many instances.
Nature 401, 311 (1999 23 September 1999 David DicksonOne of the leading private-sector participants in US genome sequencing efforts says he has firm evidence that there may be more than 140,000 genes in the human genome -- a significant increase over conventional estimates closer to 100,000.
Randall Scott, president and chief scientific officer of the biotechnology company Incyte Pharmaceuticals, suggested this new figure during a presentation on Monday (20 September) to the annual sequencing conference organized in Miami, Florida, by the Institute for Genomic Research (TIGR).
His announcement coincides with the news that the chief scientific advisers to the US and British governments, Neal Lane and Robert May respectively, have been discussing a joint declaration underlining the commitment of their governments to public access to raw sequence data.
Scott's new estimate of the number of human genes will come as little surprise to most researchers. Already one result of sequencing work on other organisms, for example the fruitfly Drosophila, has been to increase the estimate of the number of genes these organisms contain, usually by a factor of about 20 per cent.
Nevertheless, the higher number will be of considerable interest to geneticists, particularly as it follows other suggestions that the total number of nucleotide bases in the genome is considerably more than the figure of three billion usually quoted in debates on sequencing projects.
Scott's new estimate of the total number of human genes is based on an analysis of the prevalence in genes of CpG islands -- short stretches of DNA that can be methylated and as such provide a mechanism for controlling gene expression (full details supposedly available at InCite ).
Incyte has already produced a large bank of sequence data that is made available for searching by other companies and research institutions on a contract basis. Incyte researchers have sequenced just under half of the CpG islands and compared them to the finished sequences of genes now available.
"This has allowed us to estimate that, overall, 53 per cent of genes have CpG islands associated with them," says Scott. A further estimation that there are just over 75,000 CpG islands in total in the genome has led Scott and his team to predict a total of 142,634 genes.
"Previous recent estimates appear to have been substantially lower than this because they have overestimated the frequency of CpG islands in genes," he says. "One of the implications is that the genome is even more complex than we originally thought."
Incyte's calculations are likely to increase interest in the question of patenting of sequencing data, which produced headlines in Britain this week when The Guardian revealed the discussions between Washington and London on a possible joint declaration on access to sequence data. The report was later confirmed by the Department of Trade and Industry.
Officials on both sides of the Atlantic were quick to point out, however, that such a declaration would be aimed primarily at ensuring rapid access to raw sequence data, rather than preventing the patenting of genetic data as such (including the right to patent a specific gene when its sequence data is linked to a specified application).
The move has been welcomed by Britain's Wellcome Trust, which is sponsoring one third of the total human genome sequencing effort. The trust has been insistent -- together with its US partners -- that one condition of its support is that all such data is made publicly available within 24 hours.
John Sulston, director of the Sanger Centre near Cambridge, which is jointly funded by Wellcome and the Medical Research Council, says: "This data must be shared and controlled by all, and I strongly endorse the idea of giving the human genome an 'international ownership' flavour in this way".
Guardian (London) Monday September 20, 1999 David Hencke, Rob Evans and Tim RadfordTony Blair and Bill Clinton are negotiating an Anglo-American agreement to protect the 100,000 genes that control the human body and provide the catalysts for medical advance.
The extraordinary deal - initiated by Mr Blair - aims to prevent entrepreneurs profiting from gene patents and to ensure that the benefits of research are freely available worldwide to combat or even eliminate diseases. The two leaders decided to act after an acceleration in the pace of discovery of the make-up of the human body. In 1997, 8,000 genes had been mapped; by 2003, the body's entire 100,000 genes will have been mapped.
The deal aims to ensure that world's largest medical charity, the British-owned Wellcome Trust, and the US government owned National Institute of Health, publicise genes within 24 hours of their discovery - so that the benefits accrue entirely to the public. Research bodies, universities or laboratories, would be obliged to waive their rights to patent their work in the public interest.
To get the deal Mr Blair, through his science envoy, Lord Sainsbury, pressed the US government to scrap an agreement with an American entrepreneur scientist, Craig Venter, who set up his own company, Celera in Maryland, to patent as many human genes as possible.
As revealed in the Guardian last year, Dr Venter believed that he had developed a method to map the whole gene make-up before the international venture could do so - thus enabling him to patent the information. To protect his investment he tried to get a deal with the US department of energy, which the Wellcome Trust warned would inhibit development of drugs since companies would have to buy a licence to use a Venter gene.
Documents released to the Guardian under the US freedom of information law show there have been discussions between Lord Sainsbury and Neal Lane, Bill Clinton's science and technology adviser, to turn what is known as the Bermuda accord - an informal agreement to release all research on human genes without claiming patents - into a full inter-governmental agreement. The two talked in Kyoto in Japan and Williamsburg in the US during the Carnegie group summits of G8 science ministers.
One e-mail by Mr Lane to a colleague in Washington last December says: "Tony Blair might approach the potus [Bill Clinton] about having a written agreement on cooperation re the human genome project. Lord Sainsbury is handling this matter for the PM. Harold Varmus [director of the US national institute of health] feels an agreement is not really needed but has no objection to having one if it is felt to be important."
Another e-mail discloses talks this year with Ari Patrinos, head of the human genome project at the US department of energy, on how to draw up the Anglo-American agreement.
The e-mail discloses that, before the talks, the department of energy withdrew its agreement with Celera and put up proposals to incorporate the company in a joint US-UK agreement. Officials are worried it may not agree, but the e-mail ends: "Bottom line is that, although [the energy department] did have an earlier agreement with Celera, they have since withdrawn it and are working with [the US health institute] and the Wellcome Trust as a group on any future industry agreements."
The department of trade and industry said yesterday: "The US has proposed an inter-government agreement on the human genome project. We are currently negotiating." The Wellcome Trust said it was keen on a deal that would develop the Bermuda accord.