Genome Annotation Worksheet
Find Entrez Seq : Find STS etc. : NCBI Blast : Blast Finished : Sanger Blast : Blast 2 Seqs : Blast yours
RepeatMask : GenScan : Translate : Swiss Tools : Tool Launch : Seq. Utilities : Align 1, 2 : GeneBander : Tutorial
Medline : OMIM : Mouse : Zfish : Fly : Worm : PNAS : Science : Nature : Genome Res : Mol Bio Evol : More Links

Steps in the annotation
The annotation graphic
Pseudogene feature checklist


last updated:
sequence being annotated: 
reason and context: 
bottom line summary: 
The idea of the worksheet is to build the annotation directly within an html document and have it immediately online. By viewing source, the html for this page can be captured, along with the blank worksheet image. Then it is just a matter of filling in a few details.

Steps in the annotation

last updated: 
Queue numbers at NCBI Blast server:

The annotation graphic

last updated: 

Pseudogene annotation: the issues

last updated:
Pseudogene checklist: Here is a list of issues to be considered in annotating pseudogenes:

coding disruptions. In-frame stop codons are an elementary property that distinguishes a pseudogene from a functioning gene. In rare instances, the preceding protein fragment is still functional; bad codons can also be bypassed or edited at the level of mRNA. Small indels, frameshifts, and internal retrotransposons are also indicative of pseudogenes. . More subtly, pseudogenes have an enhanced level of non-synonomous, non-conservative coding mutations than a functional gene. A new pseudogene may lack signficant coding disruptions; conversely, a gene with many coding changes can still be functional (best supported by phylogenetic conservation of the open reading frame.

direct flanking repeats and poly A tail If the pseudogene arose from reverse-transcribed mRNA, the nicking and insertion mechanism leaves diagnostic flanking repeats of perhaps a dozen base pairs, the downstream repeat may follow a poly A tail. Over time, the initial identity of the direct repeats as well as the polyA will deteriorate to unrecognizability.

introns. Splicing purges introns from a mature mRNA, so if the pseudogene has them, it probably arose from genomic duplication (local, regional, chromosomal, or tetraploidzation), though unspliced mRNA could conceivably be an alternative origin. However, genomic duplications usually include considerable flanking material as well.

no introns. If the pseudogene lacks introns, then mRNA retrotransposition is favored. However, if the parent gene had no introns to begin with, a genomic origin is still viable. Many genes on the X chromosome have given rise to intron-purged compensatory functional paralogues elsewhere via mRNA retransposition (complete with flanking direct repeats and 3' poly A). Once thought to lack promoters (located upstream of mRNA transcription starts), retrotranspositions can use secondary promoters or be expressed in permissive tissues (testes). In these cases, the annotator sees a parental gene with introns and a (functional!) daughter gene without them. Of course, parent and daughter genes could also give rise to pseudogenes.

transcribed. A pseudogene may fall under the control of another promoter if the insertion falls (non-disruptively) within an internal intron or exonic UTR. A significant percentage of pseudogenes continue to be transcribed even as their protein-coding capability deteriorates (lost translational start or truncated or quickly degraded protein). These pseudogenes would continue to be represented in dbEST. Chromosome 22 had two cases of functional genes on the minus strand of large introns of other genes. . ESTs need to be examined closely (allowing for their inherent error rate): matches may be to a parental gene, a close homologue, or to the pseudogene itself.

anti-sense transcribed. Regulation of genes can occur uncommonly through anti-sense transcripts. An apparent pseudogene might instead be providing such an anti-sense transcript modulo accumulated mismatches. Very serious curation errors were made in compiling dbEST; it is very difficult to reliably determine which strand sense actually occurred in the starting tissue. This makes it impossible to search directly for anti-sense transcripts.

truncated or fragmented. A pseudogene may be terminally truncated at either end or be fragmented (missing internal exons). These effects arise in retrotranscriptional pseudogenes from alternative splicing or from incomplete insertion usually affecting the 5' end, or from an alternative poly A site. For genomic duplication pseudogenes, missing termini may not have been included in the original event (5' and 3' equally likely). For either kind of pseudogene, a subsequent deletion (perhaps millions of years later) may have caused the truncation; only for transcriptional pseudogenes could deletion boundaries plausibly be at intron-exon boundaries of the parent gene.

non-coding. Many mRNAs have long 3'UTRs. Since the retrotranspositional process starts at the 3' poly A tail but does not necessarily include a full-length mRNA, there is a built-in bias towards pseudogenes consisting of distal regions of the parental gene. The dynein pseudogene annotated above had no overlap with protein-coding exons of the parental gene. 3' UTR pseudogenes are rarely reported and never annotated in GenBank. With 3% of the genome functional and some 45% high copy repeats, a considerable portion of the genome might have originated as 3' UTR pseudogenes (many now unrecognizable).

gene conversion. A case is known of a pseudogene unexpressed for much of its evolutionary history that was later repaired and expressed, apparently after gene conversion by the parent gene. Pseudogenes may also be created by gene conversion events.

unequal recombination. A tandem or cis duplication predisposes a chromosome to misaligned recombination that can result in three copies on one daughter strand and one on the other. One copy may already have become a pseudogene. Considerable time is needed to accrue enough mutational change to preclude recombination or gene conversion. Orphons is the name given to solitary members of large copy clusters that are less effected by recombination.

parental gene. The parental gene giving rise to the pseudogene is seldom known for certain. For brevity, the best currently available parental gene candidate is called the parental gene. The best candidate for a parental gene is an adjacent homologous gene in the case of a genomic duplication pseudogene; otherwise the best Blast match (subject to updating as genome sequencing progresses). In some cases, the original gene has been lost (eg, a better Blast match is in another species). A duplicated gene may be the one retaining function as the original gene becomes the pseudogene, or both may become pseudogenes. A pseudogene itself could give rise to further pseudogenes through duplication or retrotranscription of its flawed mRNA and thus be the primary parent.

dating pseudogenes. Assuming the parental gene has been reliably identified, the degree of divergence is roughly proportional to the time elapsed. The rate of mutational fixation is commonly taken as that of synonomous codon evolution in coding genes, say 3-4 point mutations fixed per 100 base pairs per 10 million years. However, the synonomous codons are in fact under selectional constraint as inferred from two tiers of strong codon bias even in third position, so this under-estimates pseudogene times scales. CpG islands in new pseudogenes can experience rapid deterioriation as newly available mutational hotspots give more chances at neutral drift fixation. Note that a pseudogene could experience selection not on its coding abilities but rather on its hybridization potential, if it still served in anti-sense regulation.

The date estimated is thus not that of the establishing event nor the date of loss of function because of gene conversion and recombination. Many pseudogenes have a checkered mixed-rate history. Older relics may exhibit mutational saturation (multiple mutations at a given site). Nearby intronic and UTR exonic base rates may help estimate overall local mutational fixation rates. Comparative phylogeny is the best dating technique: examining other species with known divergence dates for presence/absence of the pseudogene in syntenic position (example: present great apes, absent old world monkeys); it is not plausible that the same pseudogene would insert at the same site in different lineages.



Annotation Tutorial . . GeneBander . . RepeatMasker . . GenScan
NCBI Blast . . Blast Human . . Blast 2 . . Sanger Blast . . Align 1, 2
BCM Tool Launcher . . Medline . . Entrez . . Translate . . Swiss Tools . . More Links