Find Entrez Seq : Find STS etc. : NCBI Blast : Blast Finished : Sanger Blast : Blast 2 Seqs : Blast yours
RepeatMask : GenScan : Translate : Swiss Tools : Tool Launch : Seq. Utilities : Align 1, 2 : GeneBander : Tutorial
Medline : OMIM : Mouse : Zfish : Fly : Worm : PNAS : Science : Nature : Genome Res : Mol Bio Evol : More Links

Human genome: advanced annotation tutorial

Introduction to genome annotation
30 kb example: a dynein pseudogene
Characterizing features
Pseudogene annotation: the issues
Assembling unfinished contigs

How GeneBander tracks are made
How they did it in the old days

Status and limitations of current tools
Glossary, notation, and acronyms
Published examples of good annotations
Searching for disease genes: Hallervorden-Spatz
Dynein pseudogene consisting solely of 3' UTR
KIAA1628: annotation of unfinished contigs
Tandem duplicate with chimeric intergenic exons
Dating multiply nested retrotransposons
Rps4X: inter-chromosomal superfamily
Syntenic annotation across 5 species
Older pseudogenes: psIPP
Comprehensive single gene annotation
L7a: processed pseudogene with retained intron
GenMap00 integrated map
More Examples (soon)
A tough little gene
Microsatellites in proteins 1, 2, 3
Predicting CpG mutations1, 2, 3
Annotating 3'UTR across species
Remote orthoologues by synteny
Human disease genes: mapped to protein
Human disease genes: partly mapped
The human genome browser at UCSC
Curated single gene mutation database
NCBI/GenBank nightmare

Introduction to genome annotation

Last updated 10 Aug 00 webmaster; unrestricted educational and non-commercial use; please cite this link.
This tutorial follows an experienced annotator through real-life annotation of uncharacterized consequtive segments of the human genome -- look elsewhere for a dessicated classroom drill. It is not intended as an introduction to molecular biology or basic Blast searching. This site features field-tested, experimentally-integrated annotation methods, not pie-in-the-sky annotation tools that seldom deliver.

In fact, the tutorial is designed so visitors can annotate their own piece of a genome as they go along, that is, get some real work done not just do an exercise. A blank worksheet is provided for this purpose. Your sequence may present issues not adequately addressed in the current set of tutorial examples -- submit a web link to your completed worksheet if it develops into an instructive example (but don't ask the webmaster to help annotate your sequence).

There are no real paradigms or standards for annotation -- each person does it differently. It is very easy to miss or misinterpret genomic features. GenBank entries themselves are annotated very unevenly, depending on the knowledge and interest level of the sequencing lab (and no one is allowed to fix a bad annotation!). GenBank is not curated: entries are only provide suggestions for genomic features such as promoters, alternative splicing of mRNAs, retrotransposons, pseudogenes, tandem duplications, synteny, and homology.

As time goes by, most annotations need updating to reflect the new information gained through rapid accumulation of genomic sequences. This will continue long after the sequencing phase of the human genome project is completed, due especially to sequencing of longer ESTs and the mouse genome. Additionally, there is tremendous synergy from improvements in distantly related annotations. But since no one has time to revisit old annotations, how does this happen?

The fact is, trustworthy computer re-annotation remains far off on the horizon, so realistically we seek second generational tools that take the sting out of hand annotation and allow rapid expert curation, as well as easy updating. The tutorial uses a graphical annotation web assistant called GeneBander with exactly these design features.

Let's get started. The four main ideas in successful genome annotation are (1) have someone else's computer do the heavy lifting, (2) assemble complex information in a simple graphic, (3) work at different scales (first coarse, then fine), (4) be done by lunch:

1. Do not download reams of links, sequences, alignments, and software: instead re-visit the site and re-generate matches when needed. Do not run Blast, RepeatMasker, or GenScan locally -- these sites work just fine over the internet even from a dial-up connection. Do not write computer programs -- to get a new feature, encourage an existing web tool provider to add features; in the meantime, cut and paste. Let tool providers update or maintain software. Run Blast 100 times in a day if needed (throwing out almost all of it) -- it makes no difference to the NCBI computer whether it is idle or busy or how many times it has filled the same request.

2. The main idea of graphical annotation is assembling information from a large number of sequence tests as bands stacked in register. Four standard bands come from GC isochores, CpG islands, RepeatMasker transposons, and GenScan predictions. Then come the 3 basic Blast tests : Blastn(nrn), Blastx(nrp), and Blastn(est). [Notation: the target database for a Blast tool at NCBI is given in parentheses: nrn, non-redunant nucleotides; nrp, non-redundant proteins; est, expressed sequence tags or mRNAs in dbEST.]

The best-kept secret in annotation is the Blast graphic. Screen shots of the first few inches of the Blast graphic can be rapidly curated and stacked in register with other interpretative bands. Forget entry descriptions and alignments -- they are easily regenerated and just clutter initially (set alignments to 0 or 10 in Blast formatting).

3. There can be too much detail at first: a 30kbp contig is a good size for the first pass. This allows significant matches to show up in the Blast graphic while doing a bite of a large contig. Don't chase down doubtful homology matches -- if it is black in the Blast graphic, forget it. Avoid the twilight zone of weak matches -- they rarely work out.

On the first pass, only identify candidate features. Later, confirm and characterize these candidates in order of declining quality of support: working known genes, unknown but well-supported genes, possible genes, pseudogenes, ... Annotate at various scalse: micro, local, intra-chromosomal, inter-chromosomal, and inter-species. Thus, if the contig contains an ORF with EST support, are there internal stop codons, tandem duplications, paralogues on other chromosomes, and conserved matches in other species?

4. Never wait on a web window, keep a dozen jobs running in parallel. Send off 20-30 Blast requests at a time; track the queuing numbers in a small spreadsheets: the first request will be ready by the time the last query is in. Work directly within an html editor, save the developing annotation to your internet site as you go. That way, when you're done for the day, you are done with the job.

Identifying candidate features: example 1

Contig:dJ1068H6. Unfinished, Sanger Centre. 30,000 base pairs at 3' end
Setting:human chromosome 20p12 downstream of prion-doppel genes.  
187,029 bp: length of full contig
106,461 bp to ATG start of prion
 25,331 bp to ATG start of doppel
 38,620 bp to start of feature analyzed below
 16,617 bp to end of contig
... do neighboring genes shed light on function of prion gene? 
... how did this stretch of chromosome evolve?
... is there a syntenic counterpart on another chromosome?
Bottom line: 1 pure 3'UTR dynein pseudogene, 2 doubtful features.

Annotation graphic:

Let's get started on the annotation by inspecting the baseline graphic. The GC content is unremarkable, averaging about 43% -- track 1 represents GC content as a grayscale centered about the mean; higher GC content is darker. Note a possible CpG island (track 2ab); some 60% of the genes on completed human chromsome 22 had them. Because of complementarity, neither GC nor CpG depends on which strand is analyzed. In this example, the contig is taken as the plus strand because the earlier prion and doppel genes are in this orientation.

RepeatMasker (black bars in track 3) found a large number of repeats. Retrotransposons are interesting in their own right but are not pursued further in this example except to the extent they are part of gene features. To avoid later artefacts (eg, active LINE 1 elements represented in dbEST), it is very important to run RepeatMasker in its most sensitive mode ('slow'). Masked output from RepeatMasker should always be used in Blast searches (leave masking N's and X's in place to preserve graphical alignments).

Usually there a few items too new for Repeatmasker that have to be added to the mask after Blast searching. Look at the Blast graphic for long columns of very short matches and open a few of the alignment entries at Entrez to see it is a variant retrotransposon or simple sequence that has not yet made it into RepBase (the database underlying RepeatMasker). Such repeat elements might be called undocumented aliens. In this example, there were none.

Unmasked (raw) sequence should be used for GenScan which is designed to find genes in genomic sequence. Genscan is the best ab initio gene prediction program but still reports many false positives and false negatives. In the tutorial graphic (track 4), only exon predictions are investigated (polyA and promoter site predictions are very weak). Do not run unmasked sequence -- it seems like it should help to remove repeats but it does not. GenScan is very fast and accepts long sequences so do not tile or use short contigs. The GenScan graphic is clunky at this time.

GenScan often juxaposes exons from unrelated genes within a single predicted protein so each contiguous peptide needs to be checked separately. Do not attempt to confirm GenScan predictions until more of the annotation steps have been completed. Note however that a protein is a more powerful query (because of third codon synonomy) than a nucleotide: tBlastn(nrn) is allowed at NCBI but not tBlastx(nrn).

As the number of ESTs has gone into the millions and approaches saturation, Blastn (est) has become a far more better method for finding gene features. Unlike gene predictions, EST matches are experimentally grounded. The first two things to note on Blast matches is (1) whether the match is near-perfect (eg, 99%) or simply a good and (2) whether a few or a great many similar matches are found. To eliminate repeats, the best hit should be immediately checked at RepeatMasker; a few matches can be opened as well to see if their annotations mention transposons.

dbEST as of 11 Dec 99:  3,292,593 entries
human        1,642,565
mouse          860,918
rat            128,950
nematode       101,232
fruit fly       86,121
zebrafish       46,428
Blast searches can be repeated with various phylogenetic restrictions. It is not possible to request Boolean retrictions on taxonomy (eg, mammals BUTNOT human) nor request that Blast output be clustered by phylogeny. The color of the Blast bar tracks the quality of the match so it is retained in the annotation graphic. Phylogenic matching strongly affirms matches, especially open reading frames.

For experimental reasons, EST data is strongly biased towards the 3' end of genes; some authentic 3' UTR may include retrotransposons. Watch for gapped EST matches indicating a spliced-out intron; use the mouse-over feature to watch the accession number. Some pseudogenes are also transcribed. Some spliced mRNA inserts to become an intron-purged functional gene. Many proteins have internal duplications; tandem genomic duplication is also common, often with one copy becoming a pseudogene .Adjacent features may be clustered into gene features by ESTs and Blastx single-match gaps across introns and polyA signals.

By looking at at vertical columns of the annotation graphic, we now have a fair idea of where the annotation-worthy features are located, which ones code for known and unknown proteins, and how large and conserved the respective protein families are. Note the graphical process is scalable without limit: new types of data and new feature-finder output tools can be added without any disruption.

Features need to be explored in order of declining interest. Weak features can be very elusive -- no matter how much work is done, the issue remains in doubt. This tutorial does not use 'putative gene' or 'conceptual translation' or 'inferred CDS' -- it shall be understood that an annotation simply marshalls the available evidence the best it can and that further experimental data (or longer ESTs) are always informative.

Feature 1: As seen from the tutorial graphic (tracks 5-7), Blastn(est) has found a solid multiple mRNA match (87%, 90 matches) near position 13,500 with solid genomic support by Blastn(nrn) though not Blastp(nrp). Furthermore, both types of Blast matches were consistently annotated with 'dynein light chain 1 cytoplasmic, neuronal nitric oxide synthase inhibitor PIN, ie, there is a significant resemblance to a gene with known protein function. Nitric oxide is a neurotransmitter; adjacent prion and doppel genes are at synapses. This calls for intensive high resolution annotation of this feature.

Since dyneins (a family of small proteins of 89 amino acids) are contained in the non-redundant protein database, tBlastn(nrp) might well have matched it. However, no match was found (track 6). Similarly, neither GenScan (track 4) nor Grail (data not shown) supported a gene here. Neither of the other EST matches (more about them later) in track 6 extends the dynein EST match.

What is going on? Investigating a representative EST (the 568 bp human EST AI741345, track 9), it emerges that only 227 bp of its length match in the contig (positions 321-382 ... 449-563 to 13396-13457..13660-13776 of the contig). Furthermore, that match is interupted by a complete FLAM_A type SINE found only in the genomic contig from 13468-13670. [This accounts for the broken match in the tutorial graphic, tracks 5+7.)

To avoid major confusion, the EST is reverse-complemented to positive orientation relative to the contig. It is often impossible to determine unambiguously for a GenBank EST entry what orientation the sequence really had relative to authentic mRNA (minus). However this entry had a leading polyT, implying a trailing polyA in reverse complementation, or initial complementarity to an mRNA as befits reverse transcription. But recall that some genes have functional anti-sense transcripts.

It quickly emerges from Blastx(AI741345, nrp) that positions 1-261 correspond to a full length match of the protein coding sequence of nuberous dyneins. Note this fixes once and for all the correct orientation of the feature to the positive strand. This also explains why Blastx(contig, nrp) failed: the feature represents pure 3' UTR of a dynein gene or pseudogene. It stops 61 nucleotides short of reaching coding sequence.

However neither lack of coding sequence nor the presence of an internal SINE proves pseudogene. Perhaps the more of the dynein gene lies past intervening intronic sequence outside the contig boundaries. Perhaps a transcript of this 3'UTR functions to regulate translation or stability of mRNA from a coding dynein gene. While no EST match is near-perfect (parental), nonetheless agreement is high: This suggests either a very recent insertion (seemingly followed by an even more recent Flam_A event on the minus strand) or conservation of sequence through functional selection or maintainence by gene conversion. The Flam_A is complete. Further sequences from other primates are unavailable.

The run of genomic poly A (here 13 bp or 20 in 27: AAAAAAAAAAAAAGAGCCAGAAGAAAA) at the end of 3'UTR supports retrotransposition of processed mRNA: poly A is generally added during mRNA maturation and not genomic. Often, direct flanking repeats, arising from staggered nicks, accompany the insertion event. Here the waters are muddied at the 5' end because of back-to-back AluJb and (AAAT)10. (Repeatmasker, after removing the full length Flam_A, cannot find an older repeat here.)

In any event, the 3' flanking sequence has no upstream repeat. Conventional retrotransposon events, such as L1 LINE and Alu establishments, sometimes are associated with piggybacked pseudogene mRNA insertion. A nearly complete AluJb (16 bp missing 3') on the positive strand ending in AAAAAGGAGAAA immediately precedes the (AAAT)10 upstream of the dynein UTR, suggesting an independent event. Possibly the Alu was inserted earlier and the mRNA insertion used the same break points.

It is instructive to look at the exon/intron structure of previously annotated dynein genes. Intron positions are generally stable in evolution.Rabbit AF020710 is shown (track 10). It is typical of genomic dyneins in that it lacks the Flam_A insertion. It is a good homology match otherwise to some portions of 3' UTR of the human chromosome 20 contig. While it matches the human EST (track 9), the latter must arise from a gene witth a more distal poly A site and possibly a distince 3' UTR intron. A rabbit dynein pseudogene AF005066 (possibly annotated in Gene 214, 67-75 1998) has an identical region of homology to human chr 20.

A human dynein on chromosome 14, unfinished AL121769, gives a human genomic partial match to the chr 20 feature. However, other dyneins are found on chromosomes 1p35 (AL031737), 7q 31, 8, 12 (x2), 14, 17, and 22q12.

There is an excellent full length match (possible parental gene source) to the EST on chromosome 12 unfinished sequence 75N14.02591, assuming that the quality of the EST read deteriorates towards the coding end. The match is non-contiguous, with a gap of 1518 bp suggesting an intron spliced out in the EST near position 126. This is the same structure observed in the rabbit gene, where an intron of 1538 bp separated coding exons 1 (length 132) and 2.

The sequence of the chromosome 12 gene thus affords a 3' genomic extension of the EST (which necessarily stopped at the poly A site) but 3 kb of this failed on Blastn(2) to match any 3' sequence in the chr 20 contig. Likewise the (repeat-less) intron had no match to genomic chr 20 contig. The 5' extension also had no matches. The chr 12 gene lacks the AluJb, (AAAT)10, and Fla_A. This latter SINE is found 107 times in the human EST database (ie, transcribed) and 1059 times among the non-redundant nucleotides.

Thus there is no support for a genomic origin of the chr 20 feature; note, however, a processed mRNA insertion might later be subject to a chromosomal rearrangement. The chr 12 dynein, unlike that of the EST, has no stop codon: MCDRKAVIKNADMSEEMQQDSVECATQALEKYNIEKDIAAHIKKEE FDKKYNPTWHCIVGRNFGSYVTHETKHFIYFYLGQVAILLFKSG. Using it as query on tBlastn on all of sequence chr 20 gave no matches; other dyneins varied from 65-86% identity.

Thus, at this time, the best parental candidate for the many dynein ESTs matching the chr 20 contig is this gene; the best explanation of the chr 20 feature is partial retrotransposition of a processed mRNA from this gene during primate evolution. There is no evidence that the chr 20 feature itself is ever expressed in any tissue. Its only unusual feature is the 262 bp stopping short by 61 bp of upstream coding sequence. Thus this feature is best viewed for now as a non-transcribed, pure 3' UTR. processed pseudogene at position 13515-13776 of this 30 kb contig.

The only puzzling aspect is why mouse, rabbit, and human and pseudogene all conserve distal region. not really near the well preserved poly A signal above the poly A. Gene conversion could be maintaining this stretch of sequence.

Millions of base pairs of new sequence have come in to the databases in the few days taken in writing up this tutorial. Wouldn't it be nice if this button actually launched an update of the GeneBander graphic:

Characterizing features

last update 26 Dec 99 webmaster
Clustering and classification of features: The first step is prioritizing annotation effort. Features need to be explored in order of declining interest. Protein features are columns of the GeneBander graphic with good protein support [some mix of GenScan, tBlastx(est), gapped EST matches, or Blastx(nrn) matches] and long solid matches to non-hypothetical annotated proteins. Unmatched protein features have protein support but lack a match to an annotated protein.

Weak features lack protein support but can have numerous good EST matches. This easily arises in a gene with a long 3' UTR (which are unlikely to match a nrn nucleotide; some 3' UTR even have long (but out of phase) open reading frames. These features may be attached to protein features many tens of kilobases upstream. Sometimes ESTs tile upstream until a coding region is reached. If the ESTs even tile past a non-coding intron, that warrants interpretation as a gene.

If not, little further can be done to further annotate weak features, other than to collect tissue expression data and look for polyA and splice signals that cluster them with adjacent elements. Gapped ESTs and Blastx matches may consolidate neighboring but non-contiguous features into adjacent exons.

Determining the best available parent: For a protein feature, the next priority is to identify the best currently available parental candidate. For a matched protein feature, this is simply be the best blastx match to the contig or the best Blastp match to a GenScan prediction. The best parent is used to determine the boundaries [see below] of the feature, which results in a need for re-determination of the best parent etc; this is an iterative process (probe reversal) and one that greatly benefits from periodic checking of newly posted sequence.

If there is a reasonable homologue to the feature protein in tandem or near-tandem position, this is very likely the true parental gene [see below] even if it is not the best parent. The true parent of a pseudogene may not yet have been sequenced (indeed it may have been lost from the genome) or the apparent parent may be a distant paralogue. The best available parent may even be in mouse or another species. Nonetheless, it is useful in characterizing the feature.

Determining feature boundaries: We now can repeat many of the Blast searches using the best available parent proteins. tBlastn against the original contig is often eye-opening: the boundaries of the contig's protein can be much wider than GenScan or Blastx above predicted because of frameshifts, stop codons, small indels, and lost exons. This usually means pseudogene; detailed annotation of pseudogenes is important.

Extended boundaries of the feature should also be explored by Blast2n using best available parent DNA or mRNA. This depends on the quality of annotation of the parent. If the parent gene is intronless, it is difficult to establish whether the feature has been processed (ie, is retrotransposed mature mRNA).

Apparent EST matches to the coding region might not have eminated from the feature (even if it is functional: mRNA could be rare or tissue-restricted) but rather from mRNA to a parent gene. ESTs by their nature can have a significant error rate; a better percent identity to the best available parent than to the feature is instructive. Error noise in ESTs can also be reduced using 'pairwise with identities' in Blast formatting. Sometimes ESTs will tile in an informative manner; UniGene links are not yet available from the EST entry or accession number. One must be wary of recent gene conversion of a pseudogene by the parent because this results in a closer match to mRNA.

Determining feature boundaries often improves the best available protein sequence for the feature. It is convenient to translate the relevant DNA into all frames at SwissProt's translator and assemble a full-length protein by frame-jumping. This can also be done by assembling tBlastn output fragments of the parent against the contig. (Blast returns are annoyingly scrambled: within a matching entry, fragmentary matches are in quality-order.rather than in query-sequential order.)

Gene or Pseudogene?: It is far easier to prove something is a pseudogene than to prove it is a functional gene. Individual genes are often greatly outnumbered by their pseudogenes; the ratio for cytochrome c is 1:15. A very recent pseudogene may be hard to distinguish from a working gene; cases are known of long ORFs with ATG start and a single stop codon, frequent transcription, yet defective translation initiation.

Pseudogenes of moderate age are usually easily recognizable. They have accumulated many defects across the coding sequence but not so many that homology to the parent is blurred. A single internal stop codon is not completely persuasive as there are cases of mRNA editing.

Older pseudogenes may have drifted off to the point of marginal recognizability. Determining a good parental sequence and event boundaries become problematical. Nonetheless, ancient pseudogenes can still establish whether a given protein domain was present at the time of formation -- these are still identifiable long after point mutations have largely obliterated alignments. (Absence of a domain might also mean retro-insertion of an alternatively spliced mRNA.)

The two basic mechanisms of pseudogene origin are tandem duplication followed by loss of function and insertion of reversed transcribed parental gene (usually processed into mature mRNA). In the case of adjacency of the pseudogene, the first mechanism is strongly favored (insertions have no preference for the parental gene location). However, tandem paralogues can also be translocated to other chromosomes; a regional internal duplication could also have separated the duplicates by a considerable distance.

Sorting out large protein families: It is eye-opening to pick 10 proteins at random and tBlastn(nrn+htgs+hgp) each on finished > and unfinished htgs human genome project sequences. It is not at all uncommon to find 5-6 good matches to a given protein even though the human genome is only a quarter completed. These are distributed over several to many chromosomes and represent diveraged paralogues as well as pseudogenes. By investigating homologue locations of consecutive features from the contig under investigation, regions of synteny can sometimes be found for the whole segment. This can clarify parental gene relationships.

Pseudogene annotation: the issues

26 Dec 99 webmaster
Pseudogenes are diverse, easy to miss, and complex to annotate. They fall into two rough classes: those that arose from genomic duplication and those that arose from retrotranscription of mRNA. The first class is important for understanding chromosomal evolution; the second for a snapshot in time of a gene's evolution. Pseudogenes are part of the extended superfamily of a gene.

Severe problems at GenBank make it difficult to get an overall picture of pseudogenes, despite the immense amount of data. Historically, pseudogenes were often dismissed as insignificant irritating relics tangential to the real research goals of the investigator. Annotation of pseudogenes at GenBank is erratic, seldom informative, and generally just a byproduct of the annotation of a nearby functional gene. Only meagre descriptive options are allowed by Sequin to begin with. There is no mechanism permitted by GenBank by which a missed pseudogene can be added to an annotation by third parties. Every entry has to reannotated from scratch by each person who looks it.

At GenBank Entrez, 623 entries are found using 'pseudogene[Keyword] AND homo sapiens[Organism]' whereas the number of genes and pseudogenes are approximately equal in carefully annotated stretches of the human genome. Of these, 33 also satisfy 'processed[Keyword]'. Only 2 coding regions of pseudogene proteins are annotated; one is annotated 'expression, lack of start codon; frameshift mutation' while the other offers no details. 'Coding' pseudogenes are best read by frame-jumping the 3-frame translation or frame assembly from Blastx(nrp) of the parent gene using GeneBander.

Using 'pseudogene[All Fields] AND homo sapiens[Organism]' give 2634 entries; adding processed cuts this to 235, substuting transcribed gives 22, and requiring both processed and transcribed gives 5 entries. 'Transcribed pseudogene' is not an allowed keyword; using 'transcribed[AllFields] AND pseudogene[Keyword] AND homo sapiens[Organism]' gives 11 sequences (that merely mention transcribed in the journal title.)

Thus the great majority of pseudogenes at GenBank are simply mentioned in the journal title but not found in the annotation proper. In other words, the whole human genome will have to be reannotated from scratch using higher annotation standards for pseudogenes.

Pseudogenes are rarely described adequately in Medline abstracts. 'Pseudogene[MeSH Terms] AND human[All Fields]' gives 976 entries (1559 for pseudogene not as MeSH term); few entries are linked to full text. It would cost tens of thousands of dollars to acquire a full set of reprints of the human pseudogene scientific literature and take years to summarize it in electronic form.

Given this mess, it is probably easier to wait until the human genome is sequenced and then tBlastn(nrn) each known functional protein against the whole genome to pull out its superfamily, among which would be its coding pseudogenes. Querying Blastn(nrn) and Blastn(est) with known genes and known genomic mRNAs would find non-coding and truncated pseudogenes, while resolving retroinsertional pseudogenes from genomic duplicational ones.

Human chromosome 22 contains 545 genes and 134 pseudogenes in 33.4 megabases, which extrapolates to 15,000 pseudogenes across the genome. However, it is likely that a great many pseudogenes were missed; search methods were not described.

Pseudogene checklist: Here is a list of issues to be considered in annotating pseudogenes:

coding disruptions. In-frame stop codons are an elementary property that distinguishes a pseudogene from a functioning gene. In rare instances, the preceding protein fragment is still functional; bad codons can also be bypassed or edited at the level of mRNA. Small indels, frameshifts, and internal retrotransposons are also indicative of pseudogenes. . More subtly, pseudogenes have an enhanced level of non-synonomous, non-conservative coding mutations than a functional gene. A new pseudogene may lack signficant coding disruptions; conversely, a gene with many coding changes can still be functional (best supported by phylogenetic conservation of the open reading frame.

direct flanking repeats and poly A tail If the pseudogene arose from reverse-transcribed mRNA, the nicking and insertion mechanism leaves diagnostic flanking repeats of perhaps a dozen base pairs, the downstream repeat may follow a poly A tail. Over time, the initial identity of the direct repeats as well as the polyA will deteriorate to unrecognizability.

introns. Splicing purges introns from a mature mRNA, so if the pseudogene has them, it probably arose from genomic duplication (local, regional, chromosomal, or tetraploidzation), though unspliced mRNA could conceivably be an alternative origin. However, genomic duplications usually include considerable flanking material as well.

no introns. If the pseudogene lacks introns, then mRNA retrotransposition is favored. However, if the parent gene had no introns to begin with, a genomic origin is still viable. Many genes on the X chromosome have given rise to intron-purged compensatory functional paralogues elsewhere via mRNA retransposition (complete with flanking direct repeats and 3' poly A). Once thought to lack promoters (located upstream of mRNA transcription starts), retrotranspositions can use secondary promoters or be expressed in permissive tissues (testes). In these cases, the annotator sees a parental gene with introns and a (functional!) daughter gene without them. Of course, parent and daughter genes could also give rise to pseudogenes.

transcribed. A pseudogene may fall under the control of another promoter if the insertion falls (non-disruptively) within an internal intron or exonic UTR. A significant percentage of pseudogenes continue to be transcribed even as their protein-coding capability deteriorates (lost translational start or truncated or quickly degraded protein). These pseudogenes would continue to be represented in dbEST. Chromosome 22 had two cases of functional genes on the minus strand of large introns of other genes. . ESTs need to be examined closely (allowing for their inherent error rate): matches may be to a parental gene, a close homologue, or to the pseudogene itself.

anti-sense transcribed. Regulation of genes can occur uncommonly through anti-sense transcripts. An apparent pseudogene might instead be providing such an anti-sense transcript modulo accumulated mismatches. Very serious curation errors were made in compiling dbEST; it is very difficult to reliably determine which strand sense actually occurred in the starting tissue. This makes it impossible to search directly for anti-sense transcripts.

truncated or fragmented. A pseudogene may be terminally truncated at either end or be fragmented (missing internal exons). These effects arise in retrotranscriptional pseudogenes from alternative splicing or from incomplete insertion usually affecting the 5' end, or from an alternative poly A site. For genomic duplication pseudogenes, missing termini may not have been included in the original event (5' and 3' equally likely). For either kind of pseudogene, a subsequent deletion (perhaps millions of years later) may have caused the truncation; only for transcriptional pseudogenes could deletion boundaries plausibly be at intron-exon boundaries of the parent gene.

non-coding. Many mRNAs have long 3'UTRs. Since the retrotranspositional process starts at the 3' poly A tail but does not necessarily include a full-length mRNA, there is a built-in bias towards pseudogenes consisting of distal regions of the parental gene. The dynein pseudogene annotated above had no overlap with protein-coding exons of the parental gene. 3' UTR pseudogenes are rarely reported and never annotated in GenBank. With 3% of the genome functional and some 45% high copy repeats, a considerable portion of the genome might have originated as 3' UTR pseudogenes (many now unrecognizable).

gene conversion. A case is known of a pseudogene unexpressed for much of its evolutionary history that was later repaired and expressed, apparently after gene conversion by the parent gene. Pseudogenes may also be created by gene conversion events.

unequal recombination. A tandem or cis duplication predisposes a chromosome to misaligned recombination that can result in three copies on one daughter strand and one on the other. One copy may already have become a pseudogene. Considerable time is needed to accrue enough mutational change to preclude recombination or gene conversion. Orphons is the name given to solitary members of large copy clusters that are less effected by recombination.

parental gene. The parental gene giving rise to the pseudogene is seldom known for certain. For brevity, the best currently available parental gene candidate is called the parental gene. The best candidate for a parental gene is an adjacent homologous gene in the case of a genomic duplication pseudogene; otherwise the best Blast match (subject to updating as genome sequencing progresses). In some cases, the original gene has been lost (eg, a better Blast match is in another species). A duplicated gene may be the one retaining function as the original gene becomes the pseudogene, or both may become pseudogenes. A pseudogene itself could give rise to further pseudogenes through duplication or retrotranscription of its flawed mRNA and thus be the primary parent.

dating pseudogenes. Assuming the parental gene has been reliably identified, the degree of divergence is roughly proportional to the time elapsed. The rate of mutational fixation is commonly taken as that of synonomous codon evolution in coding genes, say 3-4 point mutations fixed per 100 base pairs per 10 million years. However, the synonomous codons are in fact under selectional constraint as inferred from two tiers of strong codon bias even in third position, so this under-estimates pseudogene times scales. CpG islands in new pseudogenes can experience rapid deterioriation as newly available mutational hotspots give more chances at neutral drift fixation. Note that a pseudogene could experience selection not on its coding abilities but rather on its hybridization potential, if it still served in anti-sense regulation.

The date estimated is thus not that of the establishing event nor the date of loss of function because of gene conversion and recombination. Many pseudogenes have a checkered mixed-rate history. Older relics may exhibit mutational saturation (multiple mutations at a given site). Nearby intronic and UTR exonic base rates may help estimate overall local mutational fixation rates. Comparative phylogeny is the best dating technique: examining other species with known divergence dates for presence/absence of the pseudogene in syntenic position (example: present great apes, absent old world monkeys); it is not plausible that the same pseudogene would insert at the same site in different lineages.

Cheap tricks for assemblying unfinished contigs

last updated: 18 Dec 99 webmaster
The goal: unfinished contigs can be ordered by various techniques
Why wait around for the Human Genome Project to finish? It is quite feasible to assemble contigs without overlapping tiling sequences. This can be a great boon in extending the genomic neighborhood of a gene of interest. Lets assume we have a 140,000 bp contig in 20 unordered pieces. Here are some tricks that allow these to be ordered

Retrotransposon extensions: RepeatMasker output gives some very important details in the column at the far right, namely how much of the repeat is missing. There are various reasons for this but for the first and last repeats, a continuation occurs on another contig fragment. Since half the human genome is repeats, three-quarters of the time a contig fragment begins and/or ends within a retrotransposon.

Example: contig #13 ends with a plus strand LINE 1 missing its last 2344 bp. Contig #9 ends with a minus strand LINE 1 element beginning with the last 2344 bp of such an element. Ergo, the order of assembled contigs is 13-9rc (reverse-complemented). A real-life example received strong additional support from the fact that the contig break took place within a nested contig.

Spanning mRNAs: A typical human gene has numerous small exons separated by largish introns. Mature mRNA represents the genomic sequence with introns spliced out, meaning that longer Blast(est) matches to genomic sequence are non-contiguous (shown in the Blast graphic as hatched).

In the case of contig fragments, the EST match on one contig may very well continue onto another. This is seen by comparing sequence match positions in Blastn(htgs) with contig fragments in the GenBank entry (fragments are separated by runs of 800 N's). In the event that a known protein is associated with the EST matches, the full protein can be used to assemble the coding exon-containing contigs. In the example, an EST match was found on one small contig that lead to a very long mRNA at GenBank had been sequenced starting from a set of capped mRNAs. This allowed the entire gene structure to be quickly assembled from 21 contig fragments.

Syntenic extension: In this example, genomic regions for mouse, human, and sheep had been determined for a syntenic region about the prion gene. However, the ends were different even though the 3 sequences were about the same length (due to deletions). The mouse sequence, being longer than the homologous stretch in humans, was then used to locate the unfinished human contig that extended the human sequence. In effect, the mouse sequence served as overlap, tiling two human sequences.

Iterated extensions: After exploiting the above methods to the maximum extent possible, the set of contig fragments may be ordered or only partially ordered. It doesn't matter. Concatenate the residual repeatmasked contigs that still have prospects for being terminal (ie, don't include contigs internal to the extension) and look for matches in both Blastn (htgs) and Blastn(rnr). With luck, overlap will be found with finished or unfinished sequence and the process of extension starts anew. If not, check back in a couple of weeks: lots of new sequence is coming in.

With enough extensions, the original gene cluster eventually incorporates enough satellite or other markers to determine its orientation with respect to the centromere-telomere axis of that chromosome.

How Tracks are made

Tracks are made in GeneBander and Photoshop
The goal: high information density in the annotation graphic without clutter
Numbers refer to band numbers on the tutorial example graphic above
1A. GC content: The sequence is treated as a binary sequence (1's for G and C, 0's for A and T) binned from 5' to 3' according to the scale. The scale used here was 40 bp = 1 pixel. Therefore, 40 consecutive binary numbers are averaged, representing the % GC for that bin. This is converted to a 8-bit grayscale pixel between 0 and 255 of vertical height 12 pixels. Example: if the GC content is 40%, the grayscale value is (0.40)x255 = 102. For a contig of 30,000 bp, the end product is a 'film strip' 750 pixels wide showing the GC content in 40 bp bins. The higher the GC content, the darker the band.

1B. GC content blur: It is easier to see the 'big picture' of GC isochores (regional biases in GC content) if the bin window results above are subjected to convolution smoothing. To make the 'GC blur' track, each bin is replaced by a weighted local average (Gaussian of radius 5 or 40x5 = 200 bp at the scale here) using the gaussian blur tool in Photoshop.

2A. CpG islands: This is done the same way except that the binary number representing the sequence has 11's wherever C occurs next to G and 0's elsewhere. These are fairly uncommon so for more dramatic visualization, the negative of the 'film strip' is used plus linear contrast rescaling to taste ('level' tool in Photoshop). About 60% of housekeeping and tissue-specific genes have upstream CpG islands.

2B. CpG island blur: Convolution smoothing is applied with a gaussian of suitable radius. The parameters can be chosen to taste because the goal is only to present the best CpG islands relative to the given contig most effectively to the eye. For consistency across annotation of hundreds of contigs, parameters should be calibrated to values that correctly display known CpG islands. By including an internal grayscale ramp band ('gradient' tool) in all Photoshop manipulations, precise quantitative levels can always be recovered later ('histogram tool').

3. RepeatMasker: In the simple version, the X's in the masked sequence output are replaced with 1's, the remaining sequence by 0's. This gives a grayscale filmstrip upon compression to the chosen scale. This can be converted to black and white using the 'threshhold' tool of Photoshop but is better left for accuracy as is, perhaps with some contrast enhancement. The fact is, when 1 pixel must represent 40 nucleotides, boundaries of retrotransposons cannot be represented sharply.

More advanced RepeatMasker tracks exploit the full details of the output table from RepeatMasker. The class of repeat can be represented by color (eg, blue for Sine, red for Mer) and the orientation (strand) by a split track. The sequence is simply masked sequentially with each Repeatmasker feature and the resulting filmstrips overlaid. Work is in progress at the RepeatMasker web site to provide advanced GeneBander output directly.

4. GenScan: The exon predictions of GenScan are used to mask the input sequence separately by strand. The track is made as described above.

5-7. Blast tracks: The NCBI Blast server includes an output graphic. By capturing this to Photoshop, hand-curating significant matches (use the 'magic wand' tool with shift key to move matches up to the scale bar), and rescaling to the annotation graphic (effects menu, scale), the output from various Blast searches is quickly converted to compatible tracks. Blast colors should be retained -- they represent quality of hits. Choosing which features to capture is a job that only a human editor can do wisely. It is a good idea to label key matches with their accession number using the text tool.

8-10. Alignments: GeneBander can effectively show mismatches in sequence alignments. Here alignment outputs from NCBI Blast are stripped of sequence details, leaving mismatches as vertical bars. These are overlaid on the color blocks representing the features (in Photoshop, use wand to select white, select similar, select inverse, and paste). In protein sequences of pseudogenes, frameshifts and premature stop codons can be shown in bright colors. Capturing alignments at the time of Blast searches results in very high information density in the graphic

8-10. Feature tracks: Here it was possible to expand the scale so that 1 bp is represented by 1 pixel. (For the whole 30 kb contig, this would give a 35-foot wide graphic.). A table of start and stop positions for each type of feature and its color code are masked in. Features of little interest such as long repeat insertions are internally truncated. The text tool in Photoshop is used to overlay feature labels from the table.

It is also useful to display protein secondary structure as colored bands for alpha helix and beta sheet.

Technical trick: graphical representation of sequence transposons, GpC islands, and exons

How GeneBander was emulated in the old days...

The bands can easily be made with a text editor and Photoshop without any internet tool.

For example, to convert RepeatMasker output into a graphic that masks a blast graphic, search and replace the masked sequence with '[ ' for transposons and '.' for residual sequence letters.

It turns out that these characters are exactly the same width -- 1 pixel -- in 4 point Symbol font even though Symbol is not generally a proportional font. When rendered as text from Photoshop, '[' collapses to a vertical line and '.' to a blank of width 1 pixel and height 4 pixels, giving desired pattern of solid vertical line and blank in place of sequence letters. This graphic can then be rescaled to the desired width, eg, the 467 pixel width of the NCBI blast graphic and placed over or under it. Photoshop 3.0 is capped at 32,000 nucleotides maximum.

1  Replace repeatmasker N's with '[' and remaining sequence with '.' in Word. 
2a. Paste into photoshop b/w graphic 4 pixels high and width the length of the sequence. 
The paste is tricky: to left-register the end correctly, enlarge several times to the scale of the text tool cursor and click it on the extreme left.
2b.Rescale to desired scale using 'image' tool, eg to 750 pixels wide by 15 pixels high.
Rescaling is tricky: uncheck the box that holds height and width proportional.
3. Copy web Blast graphic directly into Photoshop and tweak it there as needed.
Photoshop will make a correctly sized graphic automatically, depending on the clipboard.
4. Combine Blast RepeatMasker graphics, add text as needed.
6. Voila.
The same procedure works very well in displaying CpG islands at the beginning of genes. It would appear that an individual feature (CG) is inherently too small to show after rescaling, say 20x, 10,000 bp down to 500 pixels width. However, what results from Photoshop bicubic interpolation is a grayscale value from 0-255 depending on local clustering (density) of CpG dinucleotides. This grayscale image can be contrast-stretched or colored as a blackbody spectrum, etc. to emphasize features to the eye.

Naturally the CpG island should be displayed relative to the location of upstream exons. The structure of a whole gene is easily represented by replacing exon nucleotides with '['. (In Word, strip out all spaces, numbers, and carriage returns from the sequence. Then use the GenBank annotation to pull individual exons. Search and replace these with '['.) It is also easy to separately replace coding exons to distinguish them from 5' and 3' UTR.

Status of tools

28 Dec 99 webmaster
Many tools sites have responds helpfully to requests for better features
... allows file upload, text paste, but will not look up an accession number.
... contact
... graphic is mediocre
... output needs fixing to show exon breaks in predicted protein sequences.
... output of masked DNA sequences for predicted genes would be a helpful option.
... perl script output denies access to html, causes Netscape to crash.
... frame layout is poor, abbreviations hard to read.
... phase tracking is good.
... should do better job of masking repeats to lower false positive rates.

... annoying requirement for fasta header can stop job without informing user.
... queuing system is unpredictable.
... 'slow' mode often fails to generate output except by email.
... output and FAQs are outstanding.
... needs output colored by repeat type
... output graphic showing strand and repeat type would greatly help.
... masking of sequence should be done by repeat type, eg Line = 1, Sine =2, Merv = 3

NCBI Blast 
... delays have gotten intolerable, queuing times often unreliable.
... repeat masking option compares poorly to RepeatMasker.
... blast graphic excellent; 
... GenBank annotations unreadable, poorly conceived, hard to search
... GenBank annotation graphic poorly conceived

Netscape browser
... Latest versions have fixed text boxes allowing pastes of huge genomic sequences.

... text box in version 3 and earlier limited to 32,000 total characters.
... width of empty document limited to 30,000 pixel width.

Glossary, notation, and acronyms

last updated: 29 Dec 99 webmaster (under development)
Send email if something wasn't clear in the tutorial
 Types and uses of NCBI Blast tools 
Blastn(nrn): non-redundant nucleotide database used by NCBI Blast server
Blastp(nrp): protein database targeted
Blastn(est): expressed sequence tags, ie mRNAs
Blastn(htgs): high throughput genomic sequence, ie unfinished contig nucleotides
Blastn(sts): sequence tag sites (used to see if contig has any nucleotide markers used in making map)
Blastp(pdb): database of proteins with known 3D structures (used to predict 3D structure of new proteins)
Blastn(epd): collection of eucaryotic promoters (to identify similar promoters in other genes)

Published examples of good annotations

last updated: 29 Dec 99 webmaster (under development)
Send email of recommended articles
This section will briefly describe instructive examples of annotations and give medline links to the abstract (which has links to full text when available. But few people have access to full text online of every possible journal, so it may be better to simply browse print journals that often carry articles on genomic annotation.

Annotation Tutorial . . GeneBander . . RepeatMasker . . GenScan
NCBI Blast . . Blast Human . . Blast 2 . . Sanger Blast . . Align 1, 2
BCM Tool Launcher . . Medline . . Entrez . . Translate . . Swiss Tools . . More Links