Sulfatase Gene Family Annotation
Genomes ... Bioinformatic Tools .. Sulfatase Home .. Sulfatase Ref Seqs .. Sulfatase Modifcation

Annotation of the sulfatase gene family: introduction
Properties of the 17 human sulfatases
New human sulfatase ARSG on X chromosome
Correlating structure and sequence
Classifying sulfatases: phylogenetic tree for 83 sulfatases
Co-regulation of sulfatases and sulfotransferases?
Location and structure of STS, ARSD, ARSE, ARSF insertion
Are disulfides and glycosylation sites conserved within subfamilies?
Mouse and human long sulfatases: insertion and synteny
Exon structure of human sulfatases
Miscellaneous topics (under development)
130 sulfatase reference sequences (off-page) 34 + 21 sulfatase modifying sequences (off-page)

Annotation of the sulfatase gene family

Last updated 22 June 03
The sulfatases are a conserved gene family having 17 functional representatives in the human genome assembly of April 2003. This enzyme lineage can reliably be traced back to early prokaryotes due to a diagnostic modified cysteine (to formylglycine), catalytic metal binding sites, and a conserved fold (which identifies the sulfatase gene family as a branch of the alkaline phosphatase gene family.

The earliest sulfatases studied were conventional lysosomal catabolic enzymes responsible for breakdown of sulfated metabolites such as dermatin sulfate, yet today half the members seem involved in tissue remodeling or implementation of developmental regulation, in some cases balancing synthesis by sulfotransferases. Although some sulfatases have known in vivo substrates (inferred from metabolite accumulation in rare human diseases), there is no clue in others though sulfate hydrolysis is surely retained in all.

For example, the steroid sulfatase STS belongs to a relatively recent X chromosomal tandem duplication multiplet where the order of evolutionary events, from genome position and sequence conservation, was clearly STS, [ARSF, (ARSD, ARSE)], yet the substrates of the later 3 enzymes remains obscure, despite a well-studied bone and joint disease associated with ARSE. The STS family also includes pseudogenes on non-recombining chr Y; it does not include the chr X q arm sulfatase IDS.

Sulfatase nomenclature is highly unsatisfactory, with numerous synonyms in use and some members named for uninformative artificial substrates (arylsulfatases). Since mammals appear to have a full complement of sulfatases, and humans are the best studied, this site uses the official gene nomenclature at HUGO whenever possible, using the same name for gene and encoded protein, even across species when orthology is clear. A table of synonyms would be useful however. There are currently 6 new unnamed human sulfatases. Two of these could be put in the ARSB series but following the guidelines, this would require renaming ARSB as ARSB1.

Even a gene family constrained to slow divergence becomes difficult to align after long enough time passes. Although subfamilies present few alignment issues, numerous small insertions and deletions (indels) make alignment problematic across subfamilies: the position of indels is not stable to choice of gapping parameters in software such as Blast and ClustalW. Most indels occur within the 20-odd loops linking beta strands and alpha helices and have little affect on the overall fold.

While the 3 known crystallographic structures are quite helpful in remote alignment, residues critical to the active site -- and reliably conserved across subfamilies -- occur only in the catalytic core domain. Mysteriously, other residues having no clear structural or functional role can be even more strongly conserved, but reliable alignment anchors outside the core remains rare. This region determines oligomerization: the ancestral pattern is arguably dimer as ARSA contacts are conserved in alkaline phosphatase; even octameric ARSA is a tetramer of dimers. (However the Ps. aeruginosa sulfatase and ARSB are apparently monomeric.)

Disulfides show surprisingly poor conservation across subfamily lineages. Potential glycosylation sites (NxTx, NxSx, x not proline) are better but still unsatisfactory, conserved within specific subfamilies and sometimes across them. Some 287 sites are observed in 50 eukaryotic sulfatases (average 5.7, range 2-14). The new quail sulfatase and its allies are heavily glucosylated; a 31 Aug 01 Science paper by Dhoot et al. showed the quail protein is exposed on the cell surface. For sulfatases such as ARSA and ARSB, experimental work has confirmed that most potential glycosylation sites are in fact modified in vivo. Like the indel signature, conserved glycosylation sites may have some role as recognition anchors in alignment and subfamily diagnostics.

Past the catalytic core, secondary structure features of the 4-stranded beta sheet and terminal helix is likely conserved even though sequence similarity itself has severely diminished. The best alignment strategy in the post-core region is secondary structure prediction within each subfamily (significantly aided by multiple alignment input), augmented by anchor residues, reliable Blast patches, and exon boundary correspondences.

The central structural issue in sulfatases -- and the key motivation to align remote members -- concerns "long" sulfatases. The newly discovered human sulfatases KIAA1077 and KIAA1247 (and their counterparts in mouse and quail) average 830 amino acids, some 320 residues more than their closest relative among conventional sulfatases, the 550 residue GNS subfamily. (A related sulfatase in Drosophila is even longer at 1114 aa.) Note that 320 extra residues is longer than the average stand-alone enzyme in E.coli.

The key questions are, what function does this extra 320 residues serve (Wnt signalling and embryo patterning?), how is it inserted into the existing structural scaffold (between the 4-stranded beta sheet and terminal helix?), and where did it come from (intronic extension or mobile domain?). To study this region by bioinformatics or experiment, it first must be isolated by alignment. Then its homology relationships to non-sulfatases (if any, fusion or operon-like clues to function), structural properties, and relationship to exon structure can be determined (6 extra exons occur relative to 14-exon GNS). The 3 known structures unfortunately do not involve sulfatases closely related to GNS. Indeed, as a Blast query, the isolated post-catalytic region of GNS sulfatases fails to recognize any other class of sulfatase.

Now the STS subfamily also acquired an extensive extra domain, inserted earlier as a loop within the catalytic core. The most peculiar composition of those 45 residues -- an unbroken run of hydrophobic and apolar residues -- is consistent with experimental data placing it as a luminal side membrane attachment site determinant (Stein et al, 1989). Below the insertion is shown to have occurred within the extruded early 2-strand beta sheet just after core beta strand B6 (ARSA or ARSB numbering).

The second central biosynthetic issue in sulfatases is origin of the modified cysteine. Bizarrely, no enzyme or cofactor involved in the oxidative modification is yet known despite sulfatases in the complete E.coli genome. Complementation assays and operon associations so far have not definitively fingered any bacterial gene; lab strains may be inadvertently defective in non-essential genes. The key enzyme apparently recognizes a linear stretch about the altered cysteine concomittant with translation. It is likely highly conserved from bacteria to human and, when partly disabled by mutation, causal for the rare multiple sulfatase disorder MSD in humans. If could be mapped to position or a homologue found in about any species, the issue could be quickly resolved by genomics.

Using the 500-odd bacterial genomes nearing completion and the very restricted distribution of sulfatases within prokaryotes, it should be possible using the NCBI COG orthologous cluster resource to identify candidate genes because few genes will have both unassigned function and strictly fit the observed phylogenetic distribution. It is essential here to have a very reliable Blast probe and complete genomes to show that a given species (eg, yeast) completely lacks sulfatases and one supposes, the modifying system unique to this system.

A candidate Fe-S oxidoreductase found by this method was further compatible with E.coli operon associations but no convincing counterpart could be found in eukaryotes. There may be multiple copies in animals because of the need to recognize diverged substrates; these could be regulatory choke points for specific sulfatase subfamilies. MSD families have been too rare to map but a strong candidate gene could readily be confirmed by screening.

Properties of the 17 human sulfatases

gene_hsa strand chrom_coords span aa exons glyco S-S
SulfY - chr4:115181037-115257516 76,480 573 2 4+2 5+0
ARSB - chr5:78111979-78316827 204,849 534 10 3+2 3+2
SulfX + chr5:94916739-94964983 48,245 526 8 4+2 2+0
SulfZ - chr5:149656973-149662129 5,157 569 2 2+2 5+0
Sulf1 + chr8:70638765-70713649 74,885 872 19 7+3 1+18
GNS - chr12:63396791-63439323 42,533 553 14 10+4 2+10
GALNS - chr16:87408351-87450786 42,436 523 14 1+1 2+5
SGSH - chr17:75798849-75808707 9,859 503 8 4+1 2+0
KIAA1001 + chr17:63815230-63928196 112,967 526 11 3+1 5+2
SULF2 - chr20:45721549-45819514 97,966 871 20 7+3 1+18
ARSA - chr22:49353720-49356345 2,626 508 8 3+0 5+6
ARSD - chrX:2818676-2837169 18,494 594 10 3+0 4+8
ARSE - chrX:2846237-2869836 23,600 590 10 3+0 4+8
ARSG + chrX:2901783-2944784 43,002 699 11 3+0 4+8
ARSF + chrX:2983426-3023955 40,530 591 10 3+0 4+8
STS + chrX:7030971-7128035 97,065 584 10 2+1 4+8
IDS - chrX:148270039-148292425 22,387 551 9 4+2 3+0

ave: 56,652 598 10 . .

The table above summarizes basic genomic and proteomic properties of the 17 known human sulfatases. Note 3 of these have only been described in machine-annotated GenBank entries and 3 others are altogether new. To view the genomic context of the coding part of each gene within the assembled human genome, simply paste the genome location column entry into the UCSC August 2001 genome browser.

Note 3 of thle 4 NDST sulfotransferases are immediately adjacent to sulfatases perhaps catalyzing the opposing reaction, though there is no evidence for coordinated regulation vs coincidenc epropagated by block duplication. Quite a few important small molecules appear to be down-regulated by sulfotransferase sulfation, the opposite reaction from sulfate removal by sulfatases. These enzymes do not contain a formylglycine; there are at least 31 of them in the human genome. The human genome contains enough sequencing gaps to potentially encode 1-2 additional sulfatase genes even in the May 2004 assembly (which has 424 gaps covering 228,799,690 bp).

31 Sulfotransferases in the human genome
HS2ST1 chr1:87092382-87287685 NM_012262 heparan sulfate 2-O-sulfotransferase 1
CHST10 chr2:100466839-100492609 NM_004854 HNK-1 sulfotransferase
SULT1C1 chr2:108363612-108384888 NM_001056 sulfotransferase family cytosolic 1C
HS6ST1 chr2:128741261-128792482 NM_004807 heparan sulfate 6-O-sulfotransferase
GAL3ST2 chr2:242436229-242470393 NM_022134 galactose-3-O-sulfotransferase 2
CHST13 chr3:127725873-127744831 NM_152889 carbohydrate chondroitin 4 sulfotransferase
NDST4 chr4:116106534-116392636 NM_022569 N-deacetylase/N-sulfotransferase heparan
NDST3 chr4:119313102-119534619 NM_004784 N-deacetylase/N-sulfotransferase heparan
SULT1B1 chr4:70773445-70807190 NM_014465 sulfotransferase family cytosolic 1B
NDST1 chr5:149880622-149917966 NM_001543 N-deacetylase/N-sulfotransferase heparan
UST chr6:149110156-149439818 NM_005715 uronyl-2-sulfotransferase
CHST12 chr7:2216463-2247455 NM_018641 carbohydrate chondroitin 4 sulfotransferase
TPST1 chr7:65114463-65269579 NM_003596 tyrosylprotein sulfotransferase 1
GAL3ST4 chr7:99401518-99410867 NM_024637 galactose-3-O-sulfotransferase 4
CHST3 chr10:73394125-73443318 NM_004273 carbohydrate chondroitin 6 sulfotransferase 3
NDST2 chr10:75231674-75241348 NM_003635 N-deacetylase/N-sulfotransferase heparan
GAL3ST3 chr11:65566028-65573227 NM_033036 galactose-3-O-sulfotransferase 3
CHST11 chr12:103353244-103654350 NM_018413 carbohydrate chondroitin 4 sulfotransferase
HS6ST3 chr13:95541093-96289812 NM_153456 heparan sulfate 6-O-sulfotransferase 3
D4ST1 chr15:38550504-38552645 NM_130468 dermatan 4 sulfotransferase 1
SULT1A2 chr16:28510766-28515892 NM_001054 sulfotransferase family cytosolic 1A
SULT1A1 chr16:28524418-28528150 NM_177530 sulfotransferase family cytosolic 1A
SULT1A3 chr16:29376468-29383784 NM_003166 sulfotransferase family cytosolic 1A
CHST9 chr18:22749594-23019177 NM_031422 GalNAc-4-sulfotransferase 2
SULT2A1 chr19:53065681-53081405 NM_003167 sulfotransferase family cytosolic 2A
SULT2B1 chr19:53747240-53794495 NM_177973 sulfotransferase family cytosolic 2B 1
TPST2 chr22:25246292-25310623 NM_003595 tyrosylprotein sulfotransferase 2
GAL3ST1 chr22:29275177-29285430 NM_004861 galactose-3-O-sulfotransferase 1
SULT4A1 chr22:42545289-42583257 NM_014351 sulfotransferase family 4A 1 a
HS6ST2 chrX:131485578-131816891 NM_147174 heparan sulfate 6-O-sulfotransferase 2

The glycosylation and disulfide columns break potential occurences of these elements into catalytic and post-catalytic domains; sporadic occurences (not conserved within other mammals) are omitted. Fragmentary sequences from ESTs are helpful in this determination because of the additional species they bring in. These structural elements have been surprisingly fluid and are barely conserved outside of narrow sub-families. Direct experimental support for their existence is meagre outside of ARSA and ARSB. However, by structure threading of the conserved catalytic domain fold, it is possible to determine whether given cysteine pairs are in physical proximity and whether glycosylation sites are exposed on the surface. Post-catalytic elements are more problematic for those sulfatases not readily alignable to the three determined structures.

Sulfatases fall into two length categories: standard and long. The length column gives the approximate number of amino acids in the mature protein. Note that 550 amino acids is excessive for a simply hydroytic reaction -- the average enzyme in E.coli is 300 amino acids. Indeed, all the key residues (active site and metal binding) fall within the amino terminal 60%; alternately spliced sulfatase genes such as IDS and ARSD encode proteins more or less truncated to the core catalytic domain. None of the xray structures really exhibit a classical substrate pocket beyond that for the sulfate moiety; the carboxy terminal beta sheet and terminal helix are not positioned to contribute substrate specificity. The remainder of the protein has been shown, however, to provide the surface for homo-oligomer formation.

Note the bizarre range in gene sizes, from a tiny 2630 bp in ARSA to 367080 bp in ARSB; similarly the number of exons ranges from 2 to 14 in short sulfatases and to 20 in long. The four genes in the STS cluster have similar exons structures as relatively recent events but otherwise conservation of exon boundaries is less than might be expected even in subfamilies. Of course, even a small sulfatase subfamily can span an immense time scale

Cellular location has been determined reliably for perhaps half the sulfatases -- the column below is mainly taken from the SwissProt and OMIM entries (which are curated digests of published experiments). The cell surface location of KIAA1077_hsa is taken from recent work on the quail orthologue and is likely applicable to the closely related KIAA1247_hsa. Both of these proteins bristle with potential disulfides and N-glycosylation sites compatible with an exposed exterior position. Despite the frequent association with membrane compartments, sulfatases do not contain packets of helical transmembrane domains; however the X chromosomal STS group does contain a substantial hydrophobic insert in a catalytic domain loop. All mammalian sulfatases contain a targetting leader peptide not part of the mature protein; however these are not satisfactorily interpretable at this time with tools such as ProSort.

Genomically, sulfatases are quite dispersed, occuring on 10 human chromosomes. All sulfatases arose from duplications of a common ancestral enzyme (ultimately an alkaline phosphatase or dual purpose). The X chromosome patch likely arose from local tandem duplications with order of the 3 events likely STS, [(ARSD, ARSE), ARSF]. While these events pre-dated the divergence with rodents, it remains unclear how many of these genes were retained in the latter lineage beyond STS. The grouping (ARSA, KIAA1001), GALNS is related to this family but represents events too remote to illuminate with genomics. IDS is also on chrX but like SulfX and SGHS diverged so much earlier that its affinities are to bacterial sulfatases.

A parsimonious scenario for the ARSB, (SulfY, SulfZ) group is tandem duplication on chr5 yielding ARSB and SulfZ, followed by much more recent translocations of the SulfZ region to chr4 (SulfY) and then to chr10 (inferred SulfP in sequence gap). These translocations also resulted in amplification of the adjacent NDST sulfotransferase family (which later had a tandem duplication resulting in NDST3 and NDST4 on chr4.

A fairly recent vertebrate translocation of a small block of genes duplicated an ancestral long sulfatase to KIAA1077 and KIAA1247 on chr8 and chr20. A much earlier duplication of GNS, now on chr12, possibly a translocation too washed out to be detectable today, created the ancestral gene to the long sulfatases, acquired the 330 extra residues prior to the subsequent translocative duplication. GNS, (KIAA1077, KIAA1247) clearly describes the tree topology of thw two events.

Thus, neglecting pseudogenes and events not retained in evolution, 16 gene duplications were necessary to create the sulfatases seen today in humans starting with an ancestral alkaline phosphatase. Other genes, like hemoglobins, had translocations of tandem duplications yielding more genes with fewer events. The overall order of genes is given tentatively by ClustalW alignment of catalytic cores as {[SulfX, (SGHS, IDS)], {{[ARSB, (SulfY, SulfZ)]*, {{STS, [(ARSD, ARSE), ARSF]*}, [(ARSA, KIAA1001), GALNS]}}, [GNS, (KIAA1077, KIAA1247)]*}}}.

Specific diseases have been associated with 8 of the genes, mostly classical lysosomal catabolic disorders. Quail and sea urchin data however suggest developmental regulatory problems, possibly lethal in utero, could arise from other sulfatase genes. Now that the apparent full complement of human sulfatases has been found and localized beyond cytoband to flankilng genes and even to actual positional coordinates, it should be possible to rapidly screen candidate diseases for mutations in sulfatases unassigned to diseases. Indeed, OMIM may already contain reference to disorders for which the standard lysosomal genes were excluded and possibly roughly positionally mapped to another chromosome.

New human sulfatase ARSG at Xp22.33

Last updated 9 June 02
As sequencing of the human genome nears completion, a new sulfatase (called ArsG here) has become detectable on the X chromosome with the April 2002 assembly. It is part of a 205 Kbp inverted doublet of tandem pairs (ARSD, ARSE, ARSG, ARSF) strand oriented - - + + relative to the p arm telomere. The new sulfatase has the 10 exons found in this subgroup and matching intron location and phases, conserved active site helix CTPSRAAFLTG and metal sites, and ESTs BM069810 and BM069570 from human Islet establish its transcriptional activity. The amino terminus, as seen elsewhere in this family, is somewhat uncertain, unalignable, and may contain a novel upstream coding exon predicted in GenomeScan entry, XM_066808. A 393 bp 3' UTR region with canonical AATAA polyA site can also be recovered from these ESTs.

The 4 sulfatases align among themselves with nearly equal 63% amino acid identity and match most closely steroid sulfatase (STS, 52% identity) among other human sulfatases. STS is also on the X chromosome p arm but approximately 15.5 Mbp downstream at Xp22.13 separated by many intervening unrelated genes. The cluster is next most closely related to GALNS at approximately 40% identity. A even more distantly related 6th sulfatase on this chromosome, IDS, is also found on Xq28.

Draft genomes for mouse and rat do not yet allow reliable comparisons of the cluster region, but apparently different expansion gave rise to the ARSDEGF cluster in the human lineage (or contraction occurred in rodent lineages). Rat core STS (NM_012661) bears only 64% identity to its best match among human sulfatases, the apparently orthologous human STS gene (mouse to human is 61%). The match of rat to mouse is also low at 76%. Possibly multiple copies in human allowed rapid divergence as functions specialized.

The situation is complicated by rapid divergence of pseudoautosomal regions (2.6 Mbp in humans), escape from X inactivation, anomalous recombination, and the notion that most of human Xp represents a relic of an autosomal region added to both X and Y at about 120 MYr. One evolutionary scenario envisions an ancestral STS gene duplication, separation of location by inversion, tandem duplication, followed shortly by a second round of inverted tandem duplication creating the final cluster of 5 sulfatases. Current flanking genes for STS provide no help.

The nearly finished human Y chromosome sheds some light on the X chromosome sulfatase cluster but raises new questions. Two homologous regions are found by Blat alignment on chrY:13683780-13809882 (best related to ARSF, ARSE, and ARSD) and chrY:16913857-16913934 (best related to STS). ARSDp and ARSEp on chrY were recognized in the mid-90's as truncated pseudogenes. Their GenBank entries NG_000881 and NG_000880 are garbled. There is no counterpart to the IDS gene on chrY though chrX contains a tandem pseudogene.

The ARSFp pseudogene on chrY has not been previously described. It is a 93% match at the DNA level, but only exons 2,3, and 10 are represented. These are insufficient to code for a functional sulfatase and additionally contain stop codons, making ARSFp an internally truncated pseudogene. ARSEp contains exons 8,9, and 10; ARSDp is represented on chrY by exons 2,3,4 and 7,8,9, and 10. All in all, it is not so surprising to see ARSGp missing.

On the April 2002 assembly,13 sulfatase exon homologs can be seen in this chrY pseudogene region in Blat matches using all known human sulfatase protein as probe. DNA probes clearly show 3 portions of ARSF at 83% identity on the - strand at chrY:13683780-13689494, preceding ARSEp (chrY:13771299-13778985, 93%) and ARSDp (chrY:13793713-13809882, 88%) on the + strand, indicating overall reversal or more likely misassembly). ARSG is not detectable by Blat with DNA probes on chrY nor by tBlastn of the appropriate intervening 81 Kbp of DNA, chrY:13689494-13771299, of which all but 21 Kbp is repeatmasked (75%). As ARSG is not a newly evolved feature on chrX, most likely it has been lost from chrY or obliterated by retrotranspon insertion. The intergenic distance between ARSE and ARSF is 108k, of which 42 Kbp is not repeatmasked -- ARSG itself spans 43 Kbp of which 22 Kbp are repeats.

Properties of the ARSG protein are those expected for its sulfatase subclass. The metal binding sites, by homology, are DD and xx; the active site has the usual residues; and the insertional hydrophobic residues are present from positions 555-666. Note that endoplasmic reticulum human STS enzyme has been crystallized and its structure is expected shortly. This will enable accurate threading of ARSDEGF, including the novel region.

However the in vivo substrate of ARSG will remain unclear as only artificial substrates are known for ARSD, ARSE, ARSF. iodothyronine sulfates

Correlating structure and sequence

The graphics below illustrate an early stage of transferring secondary structures and key active site residues determined by crystallography to a full alignment of all sulfatase families. It is also quite instructive to color the two crystallographic structures in conjunction with a multi-sequence alignment using the Combosa tool.

For ARSA_hs the PDB code is 1AUK for wild type, 1E2S for substituted active site C69S, and 1E33 for the mutation P426L. For ARSB_hs, 1FSU is available. A KIAA1001-related dimeric sulfatase from Pseudomonas aeruginosa (PDB 1HDH, 1 of 3 in that organism) has just been released, establishing that the associated cation is calcium<. Alkaline phosphatase, which exhibits a near-identical fold (beta strands 8 and 9 are swapped), has 449 aa and PDB accession code 1ALK. Free full text is available for the ARSB structure determination paper and others (1, 2, 3)

Classifying sulfatases: phylogenetic tree for 83 sulfatases

The tree below is based on a ClustalW alignment (default parameters) of 83 sulfatases from human to bacteria. Only the catalytic domains were aligned (no amino terminal signal regions, no secondary oligomeric domains, no extra regions from long sulfatases. The sequences are available from the reference sequence page in fasta format; these have more informative fasta headers. Here only genus and species are indicated in a 3 letter code, eg _hsa for homo sapiens. Note that the tree largely reproduces conventional wisdom, for example known GNS sulfatases from mammal are grouped together.

The most interesting aspect of the tree proposes deeply branching positions for the IDS, SGSH, and newly discovered SulfX families. These have closer affinities to bacterial sulfatases than to any of the other mammalian families. The other two new human sulfatases, SulfY and SulfZ, group with ARSB as expected from their Blast scores and unique predicted terminal disulfide knot. The paired long sulfatases in mammals and birds evidently represents a gene doubling even that took place after the lineage split off from drosophila and nematode. As expected, GNS sulfatases appear as the nearest relative among the short sulfatases.

It does not work well to use full length sequence alignments (in say, Blast or ClustalW) except between very closely related sequences since choices of gap parameters are ad hoc and resulting alignments and statistics computed from them meaningless. However the catalytic domains are of comparable length and align adequately for purposes of classification. Shorter probes, beginning 9 residues before the early DD metal coordinating site and continuing through the conserved TG 4 residues after the R at the catalytic site, also work well. However, bacterial sulfatases are very diverged and often give a barely significant percent identity.

It is also feasible to reliably classify sulfatases from other species relative to human even though these may only be available as partial sequences (from tBlastn recovery of Ests) or from partially finished genomes (eg, mouse) via diagnostic regions within better-conserved parts of the protein, for example about the active site where all sulfatases are alignable without gaps. Many indels amount to altering loop lengths that end up no longer truly homologous in any detailed structural sense.

As a practical matter, the query sequence is aligned, best via Blastp, against a small database of reference sequences, the best match providing the working classification. Typically, all top entries will be clustered within a single class, eg ARSB-type sulfatases, with an abrupt cutoff in Blast expectation value in succeeding secondary matches.

To charactrize a new sulfatase, translate to protein and provide a fasta header line as necessary. Filtering at the UW Blast site should be set to off. The target database is best taken as the full set of known sulfatase catalytic domains. The post-catalytic domains can provide a higher discriminatory resolution within a predetermined narrow subfamily but so far this has not prove useful, as it simply follows the conventional taxonomic ordering of species.

As examples, quail sulfatase Qsulf1, chicken EST, and Drosophila Sulf1 sulfatase are reported by the classifier to be orthologous to mammalian KIAA1077/KIAA1247 sulfatases. The indel pattern of quail is strongly diagnostic of the KIAA1077 class as would be a conventional NCBI best Blast match.

quail         drosophila       chicken

KIAA1077 1114 2.3e-117    KIAA1077 847 4.6e-89    KIAA1247 502 1.7e-52 
KIAA1247 1036 4.3e-109    KIAA1247 832 1.8e-87    KIAA1077 449 6.8e-47 
GNS  563 5.7e-59    GNS  506 6.2e-53    GNS  269 8.1e-28 
ARSA  207 3.0e-21    ARSF  210 1.1e-22    ARSF  124 4.7e-12
ARSF  171 2.0e-17    ARSA  200 1.7e-20    STS  117 2.9e-11
>quail adjusted probe
>drosph adjusted probe
 >chicken BI393009 

ClustalW alignment of mammalian sulfatases: first 110 residues of the catalytic domain

Co-regulation of sulfatases and sulfotransferases?

The newly discovered sulfatases in the ARSB family, SulfY on chr4 and SulfZ on chr5, are about 70% identical as proteins. As both have introns, this duplication likely arose as a translocation rather than retroposition. The high percent identity suggests any neighboring genes in the translocation block will still be detectable by protein Blat and still be adjacent, provided this relation hasn't been dissipated by inversions or loss of function. Note that STS, ARSD, ARSE, and ARSF duplicated by a different in situ tandem mechanism, where GNS, KIAA1077, and KIAA1247 are further examples of translocative blocks.

Neighboring genes of the block can no longer be recognized about ARSB or GNS. Now SulfY and SulfZ have an unusual gene structure for a sulfatase consisting of a single intron at the same position (the conserved TG), whereas ARSB has a more traditional 9 exons in non-corresponding positions. It seems possible then that the initial duplication of ARSB was as a processed retropositioned mRNA that later acquired an intron but prior to translocational duplications.

Indeed, a translocative block involving 4 genes with order Ank2, Camk2, Sulf, and NDST is quickly seen. Interestingly, the NDST genes adjacent to the sulfatases are known sulfotransferases involved in heparan biosynthesis (N-deacetylase/N-sulfotransferases acting on N-acetylglucosamines 1, 2, 3,4) and have about the same 65-70% protein sequence identity.

It is thus possible that SulfY and SulfZ sulfatases are co-regulated with their cognate sulfotransferases to together regulate (on or off with the sulfate) the first step of heparan synthesis in tissue specific fashion. This additionally suggests a substrate for these ARSB paralogs. Note however that the ARSB enzyme acts on N-acetylgalactosamine-4-sulfate; also the sulfatase/sulfotransferase genes are not assembled head to head with a common divergent promoter. Thus the adjacency could be coincidental, the impression of association reinforced by the translocation blocks.

The third newly discovered sulfatase, SulfX, definitely appears to share a common divergent promoter with another gene, KIAA0372 (LocusLink: 9652) on 5q15. Here, in a gapless region of finished genome, only 144 bp separate the 5' UTR of the two genes, which are both supported by numerous mRNAs. It is possible an upstream exon exists so that one gene sits within an intron of another, yet gene sizes of 83kbp and 48kbp resp. mitigate against that. Properties of the the gene (multiple copies of the tetratricopeptide or TPR domain) are available but shed no light on function or relatedness to sulfatases. There are weak blast hits to O-linked GlcNAc transferases and a strong match to the CG8777 gene product in drosphila.

Examining 33 mapped sulfotransferases, none was associated with a sulfatase other than the NDST group, so this is not a general phenomenon. The sulfatase SGSH is unmapped other than to chr 17q25 but it could not be adjacent to HS3ST3A1 and HS3ST3B1 on 17p11.2. The sulfotransferases studied were CHST1, CHST2, CHST3, CHST4, CHST5, CHST6, CHST7, CHST8, HNK1ST, HS2ST1, HS3ST1, HS3ST2, HS3ST3A1, HS3ST3A2, HS3ST3B1, HS3ST3B2, HS3ST4, HS3ST5, HS6ST, NDST1, NDST2, NDST3, NDST4, SULT1A1, SULT1A2, SULT1A3, SULT1C1, SULT1C2, SULT2A1, SULT2B1, SULT4A1, TPST1, and TPST2.

Oddly however, the gene causing mucopolysaccharidosis type I (Hurler), alpha-L-iduronidase (IDUA) at 4p16.3 contains within an intron the gene for a sulfate transporter, SLC26A1, at chr4:817863-870341 in opposite orientation. Mucopolysaccharidosis type II is caused by iduronate-2-sulfatase (IDS) deficiency which maps to chr Xq28. A kindred is known in which both IDUA and IDS are simultaneously affected (Am J Hum Genet 1996 Jan;58:75-85), apparently by conventional CDS mutations. SLC26A2, a sulfate anion transporter, maps to 5q33.1 as does SulfZ but at a location 500 kbp away, chr5:164569261-164574033.

On the chromosome 5 block, the TCOF1 gene for Treacher Collins syndrome (OMIM 154500) intervenes between NDST1 and SulfZ_hsa. It is somewhat reminiscent of a sulfatase disorder, affecting craniofacial development, conductive hearing loss and palate, with antimongoloid slant of the eyes, coloboma of the lid, micrognathia, microtia and other deformity of the ears, hypoplastic zygomatic arches, and macrostomia as features. There is no apparent paralog family in the chr4 or chr10 counterparts; it is rapidly evolving in mammals. According to the August 01 human genome assembly, TCOF1 is divergently transcribed from a 2.5 kbp (or less, depending on 5'UTR determination) promoter region shared with NDST1; it has no sequence homology to sulfatases or NDST1.

A third related block on chr10 has strong Ank2 and NDST paralogs but no sulfatase or Camk2 at the UCSC genome assembly; however, these may lie in two largish gaps in the critical region. (SulfZ is a small gene; its CDS spans 5161 bp.) A paper in the July 01 issue of Am J Hum Gen has this region completely misassembled, using an obsolete human genome assembly, but did report a nearby Camk2 gene on chr10. Indeed, this is support by Blast at NCBI using the chr5 Camk2A query: NT_024037 (completed 16-Oct-2001) contains a 73% identity match called Camk2G. This maps to a slightly out of position region in a confused portion of the August 01 genome assembly, chr10:75521627-75532674. The gene order appears to be ANK3 .. CAMK2G.. [SulfP?] .. NDST2 .. KIAA0913 .. SEC24C ... KIAA0187.

Thus a 17th human sulfatase gene with 2 exons, SulfP, can be predicted by virtual syntenic gap bridging; at this time there is no support at NCBI human or mouse finished, htgs, or trace database but inversions are unlikely to rearrange internal elements. What can be said about the sequence of this putative new sulfatase? Going by the average percent divergences in the adjacent genes present in all three blocks, it will be much closer to the chr5 sulfatase sequence than to one on chr 4, say 80% identical, but more likely to have ancestral node values at more rapidly changing amino acid positions than idiosyncratic changes seen in SulfZ, unless it has become a pseudogene.

The orthologous murine NDSTs are known and SulfY_mmu and SulfZ_mmu orthologs are recoverable from ESTs and htgs. This means if an imperfect SulfZ match in ESTs or htgs shows up on an unmapped datum such as mus or human EST or htgs with the right homological percentages and tree topology, it can actually be mapped with some confidence to the putative chr10 gap gene since sulfatase orthologs run in the mid-90's between human and mouse. So EST mapping may occasionally be possible by virtual synteny in conjunction with ClustalW. No such candidate EST is available; SulfZ itself has a single EST at GenBank, fetal lung AA358883, which in conjunction with conserved protein features in human and mouse, means it is probably not a pseudogene, only rarely transcribed in the tissues studied to date. SulfP may be similar.

Location and structure of STS, ARSD, ARSE, ARSF insertion

The STS family of sulfatases has long been known to carry a block of extra residues within its catalytic domain, presumed to represent a loop insertion that serves to attach the protein to the luminal membrane wall. Its evolutionary origin is unknown: the bland hydrophobic composition is uninformative at Blast searches, especially given several hundred million years of mutational processes. Individual residues are poorly conserved even within the STS group even between mouse and rat, though the insert character is consistent.

The insertion is difficult to localize precisely. However, as the above ClustalW alignment of 110 residues shows, the STS family is of standard sequence through the metal coordinating residues K and H in the conserved feature GKWHL; indeed from careful Blast comparision to not too distant sulfatases, another 28 residues, just past the conserved beta strand B5 (position 164 in full length human STS). Alignability does not pick up again until 79 residues later at the start of beta strand 8.

This would still permit the standard 10 strand beta sheet in the STS family. Note that the "missing" strands 6-7 are not part of the protein core but instead are a protrusion. Thus it appears quite possible that the corresponding region in the STS group lacks beta structure in this region (unsuitable for new membrane attachment role?) and amounts to a mega-loop replacement of the region following strand 5 and preceding strand 8 in conventional sulfatases. This illustrates a fallacious assumption in proteomics: here is a second domain in sultatases whose structure cannot be inferred despite 3 known structures. Secondary structure prediction (using Predator) of 6 family members does however yield a fairly consistent picture of 2-3 transmembrane helical domains unlike any other sulfatase class (using TMHMM):


outside  1 160
TMhelix 161 183
inside  184 189
TMhelix 190 212
outside 213 558

The insert does not correspond to a complete exon. Instead it is part of a large 141aa exon with the same boundaries in the 4 human sulfatase genes, with 32aa preceding the insert and 30aa following. This exon has no counterpart in its boundaries, even approximately, in any other group of sulfatases; that is, no simple fusion or insertion scheme can bring exon boundaries into consistency. Thus its origin cannot be resolved or even reliably dated (without additional sequences -- only an additional fragment is available from zebra fish). It may simply have resulted from pass-over of a splice donor that resulted in a longer exon that happened to have an initial hydrophobic character.

A new internal conserved pair of cysteines occurs at the extreme flanks of the insert. It is not known whether these form a disulfide, though this could provide a modicum of structural anchoring. Otherwise the region is extremely poorly conserved and has been evolving very rapidly compared to other domains and seemingly randomly (within the context of membrane attachment retention) over at least the last 100 million years. It cannot be modelled due to the lack of relevent xray structures and influences of the membrane milieu.

Using a Kyte-Doolittle hydrophobicity plot, the insert is clearly visible as a broad patch of predominant hydrophobic character. The insert occurs at positions 140-218 relative to the mature catalytic core (the modified cysteine is at position 50 in this numbering system):

Are disulfides and glycosylation sites conserved?

Best viewed by collecting sequences and coloring cysteines in a word processor by search and replace
The actual disulfide linking pattern is known from the 3 xray structures (the Pseudomonas protein has none) and therefore can be inferred to a certain extent in others: a conserved cysteine pair, adjacent in the folded threaded structure, is likely to form a disulfide.. Cysteines not even conserved across mammals are unlikely to be in disulfides; these are called sporadic cysteines below.

ARSA_hsa has a terminal disulfide knot linking 6 cysteines, a region at 87-103 aa past the catalytic cysteine with two nested pairs in beta strands 6-7, and a cross-domain pair 231-351 linking two loops. These are conserved -- and presumably also disulfides -- in the other 3 mammalian species known. Human has a further cysteine 31 aa prior to the catalytic site with no counterpart in other species. A single conserved loop cysteine 225 aa past the catalytic site has no available intra-molecular binding partner.

These disulfide pairs are conserved to some extent in KIAA1001 proteins, the nearest neighbors to ARSA. However, the first quartet has more intervening residues, as does the long range pair, and the final knot has only two disulfide pairs. In the more weakly related GALNS sulfatases, nothing corresponds to the first quartet, the long-range disulfide is supported by alignment, and the terminal knot again could be two pairs. Since 2 in 5 residues are conserved overall between GALNS and ARSA, two cysteines would conserved by chance 4 times in 25 (16%).

The conclusion for the extended ARSA family, wrongly described in the literature, is that disulfide pairs have considerable conservation, though some pairs appear more fundamental than others. The knot and long-range pair appear to be the oldest conserved elements.

ARSB_hsa has 4 disulfide pairs that have conserved counterparts in rat and cat ARSB proteins (the latter each have a sporadic cysteine). The first, a long-range pair 26-436 is not an option in the nearest neighboring proteins, SulfX and SulfY. The 30-64 pair is conserved in position and length; the 90-101 pair is also present though slightly shorter at 90-96; and the 320-362 pair is missing altogether. None of these are conserved in the 4 drosophila ARSB-type proteins; these have their own conserved cluster of 4 cysteines extending from 93-119 of the post-catalytic core that are likely paired as disulfides. There is no explanation here for why C521Y causes severe disease (Am J Hum Genet 1994 Mar;54(3):454-63).

The conclusion here is that disulfides are important to ARSB structure but have evolved independently in different lineages. In humans, two of the four disulfides are deeply conserved.

The STS group has a possible counterpart to the two disulfide terminal knot ; 5 conserved cysteines in the catalytic core including 1 preceding the active site and 4 conserved cysteines in the post-core. Some of these may be in disulfides, the organization could be inferred from threading as the cysteines need to be adjacent in the folded protein

The GNS family of sulfatases contains 10 conserved cysteines in addition to the one in the active site. The two early cysteines, 27-49, do not have a counterpart outside this family. The remaining 8 are conserved in the post-catalytic core. The long sulfatases to which the short GNS is best related, have only 1 of the early cysteines. There are 18 cysteines in each long sulfatases post-catalytic region, of which the last 6 correspond to those of GNS. Some 25 residues of GNS at the beginning of the post-catalytic region cannot be located in the KIAA series.

The inserted portion of long sulfatases consists of 7 complete exons, chr8:80324928-80349742 on the August 01 human genome assembly. These are exons 9-15 of the whole CDS. This suggests, since it is not that ancient a feature, that it arrived with introns from another gene. However, no related region can be located by Blast at this time, even though the inserted region is extremely well conserved. There is a 5% chance that something homologous exists in unsequenced human genome; otherwise, the counterpart may have been lost, or perhaps it transferred in its entirety so no outside homolog exists.

(to be continued)

Phosphorylation of arylsulphatase A occurs through multiple interactions with the UDP-N-acetylglucosamine-1-phosphotransferase proximal and distal to its retrieval site by the KDEL receptor.
Biochem J 1999 Jun 15;340 ( Pt 3):729-36 PMID: 10359658
Dittmer F, von Figura K.

Phosphorylation of oligosaccharides of the lysosomal enzyme arylsulphatase A (ASA), which accumulate in the secretions of cells that mis-sort most of the newly synthesized lysosomal enzymes due to a deficiency of mannose 6-phosphate receptors, was found to be site specific. ASA residing within the secretory route of these cells contains about one third of the incorporated [2-3H]mannose in phosphorylated oligosaccharides. Oligosaccharides carrying two phosphate groups are almost 2-fold less frequent than those with one phosphate group and only a few of the phosphate groups are uncovered. Addition of a KDEL (Lys-Asp-Glu-Leu) retention signal prolongs the residence time of ASA within the secretory route 6-fold, but does not result in more efficient phosphorylation. In contrast, more than 90% of the [2-3H]mannose incorporated into secreted ASA (with or without a KDEL retention signal) is present in phosphorylated oligosaccharides. Those with two phosphate groups are almost twice as frequent as those with one phosphate group and most of the phosphate groups are uncovered. Thus, ASA receives N-acetylglucosamine 1-phosphate groups in a sequential manner at two or more sites located within the secretory route proximal and distal to the site where ASA is retrieved by the KDEL receptor, i.e. proximal to the trans-Golgi. At each of these site,s up to two N-acetylglucosamine 1-phosphate groups can be added to a single oligosaccharide. Of several drugs known to inhibit transit of ASA through the secretory route only the ionophore monensin had a major inhibitory effect on phosphorylation, uncovering and sialylation.

The insertion in long sulfatases

The insertion boundaries were determined using Blast relative to the collection of GNS genes, in conjunction with the exonic structures of these genes in finished human genomic sequence. KIAA1077_hsa turned out to have its extra residues cleanly in 6 exons, 9-15. Inserts in the others were recovered by alignment. This domain has an extensive coiled coil, a striking charged domain, and 12 conserved cysteines (in addition to 6 terminal cysteines in common with GNS). However, these extra residues of mammalian long sulfatases fail to exhibit homology with anything non-sulfatase using careful Blast search techniques and so offer no clue as to their origin or presumed extra function.

However, using the insert as a probe for tBlastn of dbEST, a well-conserved long sulfatase fragment from Xenopus turns up, which by good fortune can be extended to include a full length catalytic domain (35aa gap). Note that all 6 cysteines in the insert region are precisely conserved in Xenopus; this provides support for disulfides. A similar fragment assembly from cattle ESTs conserves 16 cysteines. The chicken EST BI393009 is of KIAA1247 type strongly suggesting, given quail is KIAA1077 type, that birds have both forms of long sulfatases. The conservation of these proteins is extraordinary. Rough alignment of long sulfatases, anchored on 12 conserved cysteines and showing nominal coiled-coil domains, with lysine and arginine replaced by *:

gag77 *FL***EEAN*NTQQSNQLP*YE*V*ELCQQA*YQTACEQPGQ*WQCTEDASG*L*IH*C*VSSDILAI***A*.....SIHS*GYSG*D*DCNCGDTDF*NS*TQ**SQ*QFL*NPSAQ*Y*P*FVHT*QT*SLSVEFEGEIYDINLEEEELQVL*T*SIT**HNAEND**AEETDGAPGDTMVADGTDAIGQPssv*vth*Cfilpndti*Ce*elyqsa*aw*dh*ayid*eiealqd*i*nl*ev*ghl****pdeCdCt*qsyyn*e*gv*tqe*i*shlhpf*eaaqevds*lqlf*en*****e**g***q**gdeCslpglTCFTHDNNHWQTAPFW
ele.. *MP*L**I*D*YI*Q***FN*EN*LS*EC****WQ*DCVH.GQLW*CYYTVED*W*IY*C*DNW.........................SDQCSC****EISNYDDDDID.............................................................................................................EFLTYAD*ENFSEGHEWYQGEFEDSGEVGEELDGH*S**GILS*CSCS*NVSHPI*LL..........................EQ*MS**HYL*Y***PQNGSL*P*DCSLPQMNCFTHTASHW*TPPLW
cys ..............................1........2........3............4................................5.6...............................................................................................................7.........8..............................................9.10.............................................................11.....12.............
>KIAA1077_xla_frag 471aa Xenopus insert BG408276 BG360235 BG814048 83% to quail KIAA1077; 35aa missing filled with quail (final caps show 97aa of insert match)

>KIAA1247_bta_frag 422aa Bos taurus BF230382 AV617208 AV600225 98% to KIAA1247_hsa (last caps show insert match)

>KIAA1077_hsa_insert 346aa frame 2 frame 2 exons 9-15 chr8:80324928-80349742

>KIAA1077_mmu_insert 344aa mouse 88%

>KIAA1077_cco_insert 338aa quail 77%

>KIAA1247_hsa_insert 324aa human 46%

>KIAA1247_mmu_insert 329aa mouse 45%

>KIAA47/77_cin insert 332aa tunicate unique Ciona intestinalis long sulfatase
Synteny of mouse and human sulfatase genes:

A small translocation involving human chromosomes 8q21 and 20q13 resulted in a retained gene duplication that known today as the KIAA1077 and KIAA1247 sulfatases (after Kaluza Institute cDNAs). This event preceded the divergence of human and mouse; evidence for this is given in raw form below. However, the event giving rise to these long sulfatases, by parsimony considerations, must have preceded their doubling: a much earlier gene duplication, involving the GNS now on 12q14, gave rise initially to a KIAA sulfatase that then acquired extra coding properties (unlike the member that is today GNS). Indeed, since Drosophila also has a related long sulfatase, this event occured early in animal evolution. Beyond this, affinities of the GNS-KIAA 1077-KIAA 1247 family remain obscure.

Mouse chr 1 genes and their human counterparts on chr 8q21 and chr 20q13:

OPRK1  chr8:56794983-56816757 NM_000912 opioid receptor, kappa 1
Sox17  chr8:58028772-58030137 NM_022454 FLJ22252
RP1  chr8:58112555-58127323 NM_006269 retinitis pigmentosa RP1 protein
MYBL1  chr8:70234049-70234271
KIAA1077 chr8:73675649-73704612
EYA1  chr8:75452356-75587230 NM_000503 eyes absent Drosophila homolog 1
MSC  chr8:75959614-75962428 NM_005098 musculin 8q21 11
TERF1  chr8:77147357-77165680 NM_003218 telomeric repe binding

OPRL1  chr20:64627479-64628368 opiate receptor-like 1 NM_000913
SOX18  chr20:64581727-64582163 NM_018419 SRY sex determining region Y-box 18
RP1  chr20 counterpart missing, oddly no counterpart anywhere, large protein
MYBL2  chr20:44004996-44047675 NM_002466 20q12
KIAA1247 chr20:47988960-48033834
EYA2  chr20:47321193-47519325 NM_005244
MSC  chr20 counterpart missing
TERF2  chr20 counterpart missing chr16:80806221-80836549 NM_005652

Mouse chr2 genes and their human counterparts on chr 20q:

RPN2  chr20:37420266-37620266 no chr 8
PLCG1  chr20:41468766-41506160 no chr 8
RBL1  chr20:37329033-37426950 no chr 8
TOP1  chr20:41360013-41455684 topoisomerase; chr 8 147627504-147763382 not applicable
MYBL2* chr20:44004996-44047675 NM_00246620q12 
PLTP  chr20:46229943-46243330 no chr 8
PTPRT  chr20:42403965-43521108 no chr8
ADA  chr20:44950714-44982894 NM_000022 adenosine deaminase no chr8
SDC4  chr20:45656478-45679601 NM_002999 syndecan 4
SEMG1  chr20:45538239-45540959 NM_003007 semenogelin no chr8
TCF4  chr20:522959-528967  NM_004609 TCF15
EYA2*  chr20:47321193-47519325 NM_005244
SDCBP  chr8:61928362-61962228 NM_005625 20 +1231038 1232169 not applicable

Exon structure of human sulfatases

The phylogram below shows coding exon structures for 17 human sulfatases according to the April 2002 human genome assembly. The sequences are aligned on the active site rather than by N-termini (which are ragged); the scale is 1 pixel to 1 amino acid. Overall, exon boundaries are poorly conserved except in subfamilies. The same can largely be said for putative disulfides and glycosylation sites. SulfY and SulfZ, with only one intron, may have arisen as processed pseudogenes that retained activity and later acquired an intron prior to their subsequent translocational duplication.

To see the tree at better scale, paste the .PHB file into TreeView. The tree was derived from ClustalW alignment of catalytic domains. Note that many of the gene duplications occured very early in the history of the family and that branch lengths (roughly rates of evolution are similar, with the exception of anomalously high rates of change in the X chromsome cluster (ARSG subtree).


KIAA1247 has 20 coding exons in all species where this is determinable. Exon 19 is missing in all known KIAA1077 (floor plate) sulfatases, including human, mouse, rat, and quail. This is not alternative splicing because tBlastn at e=100 can find no sign of a skipped exon in intervening finished DNA. The gene duplication leading to long sulfatases is quite ancient since 3 distinct long sulfatases are observed in fugu.

The approximate exon boundaries for some human sulfatases are given below. Phase is indicated by a lower case letter attached to the exon having 2/3 codon letters. By default, the GT-AG rule (resp. GC-AG) was assumed in determining splice junctions. The quality of data is uneven, depending on the status of the human genome project at the site involved and will improve in coming assemblies. Mouse exon structure can also be determined in some instances from unfinished sequence.

One fact to emerge from comparing exons boundaries carefully is that several of the genes have a phase 2 exon ending at the mysteriously conserved TG following the catalytic site (for example, SulfY and SulfZ). Could this explain the conservation of the amino acid feature: its nucleotides enhance and define a good splice donor? Other sulfatases however do not have exon breaks here, eg GNS though ARSA and ARSB have a boundary nearby. Bacterial sulfatases also have this conserved TG (sometimes SG) but have no introns. Being at the opposite end of the catalytic alpha helix leaves these residues distant from the active site; alanine mutagenesis could not provide a structural explanation. Folding intermediates have also been postulated. Using the pattern [ST].[RK] for protein kinase C phosphorylation site gives TGR and TGK as candidates.

>SulfX_hs 8 exon structure 15 aa signal 55.6 %: extracellular

>SulfY_hsa 574aa 2 exons chr4:125570802-125648441 size 77640 - ::123::45

>SulfZ_hsa 569aa NT_006951 ARSB type another chr 5 gene 2 exon, 4 glcyo QLLTGR end of exon1::123::45

>ARSB from NT_027010

>ARSA cds 8 exons

>GNS_hs 552 aa 14 cds exons Glu6S

>GNS_hs on August 01 assembly relative to KIAA group

>KIAA1247_hs 20 cds exons

>KIAA1077_hs 20 exons

>STS_hs Xp22.31 10 exons

>STS Y pseudogene
2 RTPNIDWLASEGVKLTQHLAASPLCTPSRAAFMTGR*PV.................................33

>ARSD_hs 9 cds exons 
0 MRSAARRGRAAPAA ... orphaned
8 VIDGHSLVPLLQGAEARSAHEFLFHYCGQHLHAARWHQKDS [[old break 8-9 fused, new break to 9 9-10 fused

>AC084294 Mus musculus htgs genomic clone RP23-169K20 14 unordered pieces Length = 199377 chr ?

>ARSE_hs 9 cds exons
0 MLHLHHSC ... orphaned

>ARSF_hs 9 exons

>KIAA1001_hs cds

>IDS_hs 9 exons

>Sulf2_C.elegans U43375 14 exons complement

>AE003522 CG7408 gp Drosophila melanogaster "486 aa" 585 (33%) ARSB false gene start, 18aa short of site, 99 aa missing, aligns to ARSB starting with LLLLAPP 29 to 1; align ends at HNEWTWW...554

3 ESTs support a normal gene here: SD22483 678 bp SD23718 RE13542 670 bp
cgcacttcgagtccggcggatcagagattagaattcggggaacgattaactcgatcgcgaataccagtgactgaaattggcatgcacgaatctagctgataagtccattgttatttggattccttttatttgatttcgcattataccgctaatatatccggaacgatctgaagagctcatccatcattgacagagtgttacccgctgacttgttgtcgcccatttgctgtcccacatcatcctcactccattcacggcg LVVAHLLSHIILTPFTAMSTHLDKFSSATSLLTGFVLCIALSNGIVATSDKPNIIIIMADDLGFDDVSFRGSNNFLTPNIDALAYSGVILNNLYVAPMCTPSRAALLTGKYPINTGMQHYVIVNDQPWGLPLNETTMAEIFRENGYRTSLLGKWHL

drosophila sulfatase cluster:
CG5584 75A2  "N-acetylgalactosamine-4-sulfatase" 996 aa
CG7402 75A6--8 "N-acetylgalactosamine-4-sulfatase" 579 aa
CG7408 75A4--7 "N-acetylgalactosamine-4-sulfatase" 486 aa 5 coding exons:

Relationship of the drosophila proteins, relative to CG7408 
AE003522 CG5584 gp Drosophila melanogaster 996 aa (41%) ... 1301 1.6e-137 1
AE003522) CG7402 gp Drosophila melanogaster 579 aa (43%) ... 1277 5.7e-135 1
AE003821 CG8646 gp Drosophila melanogaster 542 aa (51%) ... 676 3.5e-112 2
AAF55607 CG14291 gp 524 aa Drosophila melanogaster (53%)... 190 5.2e-17 2
AE003712 Sulf1 Drosophila melanogaster 1114 aa (59%) KIA... 187 1.5e-14 2
AE003478 CG12014 gp Drosophila melanogaster 512 aa (46%)... 131 1.2e-09 2

Under development

Coming soon to the sulfatase home page:
... comparative exons boundaries of the 16 human sulfatases relative to key residues
... visit OMIM to see if the 6 new sulfatases correspond to positionally mapped diseases
... visit PubMed to collect sulfated metabolites for candidate substrates, eg dopamine sulfate
... finish the synonym glossary and regularize the tentative names for unpublished sulfatases
... finish checking into alternative splicing in human sulfatases and resultant proteins
... exclude missing sequences by converging alkaline phosphatase to sulfatase probe
... thread 3D structures to make plausible disulfide proximity assignments
... map the known substrates onto phylogenetic tree to suggest substrates for duplications
... collect and map the known mutations onto the structures
... does the metal ion change as the coordination residues change, are latter cross-correlated?
... overview of sulfotransferases because of off-setting regulation
... check if GNS/KIAA duplication is still recognizable in genomes
... identification of MSD modification gene
... provide more details on 3 new human sulfatases
... add leader peptides to extended fasta format; identify cation ligands throughout
... add counter, collect sulfatase emails, notify OMIM, fix best links
Human genome location of sulfotransferases (August 01 assembly):
CHST1chr11:49003217-4901965911p11.2NM_003654chondroitin 6/keratan
CHST2chr3:163373104-1633755963q23NM_004267chondroitin 6/keratan
CHST3chr10:78488846-7853376510q22.1NM_004273chondroitin 6/keratan
CHST4chr16:85850815-8586317116q22.2NM_005769N-acetylglucosamine 6-O sulfotransferase
CHST5chr16:90730589-9074740616q23.1NM_012126N-acetylglucosamine 6-O sulfotransferase
CHST6chr16:90680422-9069735216q23.1NM_021615N-acetylglucosamine 6-O sulfotransferase
CHST7chrX:46968426-46992908Xp11.23NM_019886N-acetylglucosamine 6-O sulfotransferase
HNK1STchr2:104031497-1041226222q11.2NM_004854HNK-1 sulfotransferase
HS2ST1chr1:101010965-1010480051p22.3NM_012262heparan sulfate 2-O-sulfotransferase 1
HS3ST1chr4:12991117-129924134p15.33NM_005114heparan sulfate D-glucosaminyl
HS3ST2chr16:27575495-2767699216p12.2NM_006043heparan sulfate D-glucosaminyl
HS3ST3A1chr17:14846992-1495323017p11.2NM_006042heparan sulfate D-glucosaminyl
HS3ST3B1chr17:15652492-1569747817p11.2NM_006041heparan sulfate D-glucosaminyl
HS6STchr2:132836865-1328880852q21.1NM_004807heparan sulfate 6-O-sulfotransferase
NDST1chr5:165245132-1652908275q33.1NM_001543N-deacetylase/N-sulfotransferase (heparan
NDST2chr10:66113014-6611951510q21.2NM_003635N-deacetylase/N-sulfotransferase (heparan
NDST3chr4:129869870-1304133644q26NM_003635N-deacetylase/N-sulfotransferase (heparan
NDST4chr4:126364694-1266344264q26NM_022569N-deacetylase/N-sulfotransferase (heparan
SULT1A1chr16:34232477-3423620916p11.2NM_001055sulfotransferase family, cytosolic, 1A,
SULT1A2chr16:34302736-3430786216p11.2NM_001054sulfotransferase family, cytosolic, 1A,
SULT1A3chr16:84175628-8418551616q22.1NM_003166sulfotransferase family, cytosolic, 1A,
SULT1C1chr2:112366547-1123874602q12.3NM_001056sulfotransferase family, cytosolic, 1A,
SULT1C2chr2:112455590-1124656282q12.3NM_006588sulfotransferase family, cytosolic, 1A,
SULT2A1chr19:58784636-5879958719q13.32NM_003167sulfotransferase family, cytosolic, 1A,
SULT2B1chr19:59910354-5993424019q13.32NM_004605sulfotransferase family, cytosolic, 1A,
SULT4A1chr22:40838242-4087504022q13.31NM_014351sulfotransferase family, cytosolic, 1A,
TPST1chr7:68803924-689615717q11.21NM_003596tyrosylprotein sulfotransferase 1
TPST2chr22:23617817-2368215222q12.1NM_003595tyrosylprotein sulfotransferase 2

Human Genome Project ... Best Links ... Sulfatase Home ... RefSeqs

Counter Stats