Annotation of the sulfatase gene family: introduction
Properties of the 17 human sulfatases
New human sulfatase ARSG on X chromosome
Correlating structure and sequence
Classifying sulfatases: phylogenetic tree for 83 sulfatases
Co-regulation of sulfatases and sulfotransferases?
Location and structure of STS, ARSD, ARSE, ARSF insertion
Are disulfides and glycosylation sites conserved within subfamilies?
Mouse and human long sulfatases: insertion and synteny
Exon structure of human sulfatases
Miscellaneous topics (under development)
130 sulfatase reference sequences (off-page)
34 + 21 sulfatase modifying sequences (off-page)
Last updated 22 June 03The sulfatases are a conserved gene family having 17 functional representatives in the human genome assembly of April 2003. This enzyme lineage can reliably be traced back to early prokaryotes due to a diagnostic modified cysteine (to formylglycine), catalytic metal binding sites, and a conserved fold (which identifies the sulfatase gene family as a branch of the alkaline phosphatase gene family.
The earliest sulfatases studied were conventional lysosomal catabolic enzymes responsible for breakdown of sulfated metabolites such as dermatin sulfate, yet today half the members seem involved in tissue remodeling or implementation of developmental regulation, in some cases balancing synthesis by sulfotransferases. Although some sulfatases have known in vivo substrates (inferred from metabolite accumulation in rare human diseases), there is no clue in others though sulfate hydrolysis is surely retained in all.
For example, the steroid sulfatase STS belongs to a relatively recent X chromosomal tandem duplication multiplet where the order of evolutionary events, from genome position and sequence conservation, was clearly STS, [ARSF, (ARSD, ARSE)], yet the substrates of the later 3 enzymes remains obscure, despite a well-studied bone and joint disease associated with ARSE. The STS family also includes pseudogenes on non-recombining chr Y; it does not include the chr X q arm sulfatase IDS.
Sulfatase nomenclature is highly unsatisfactory, with numerous synonyms in use and some members named for uninformative artificial substrates (arylsulfatases). Since mammals appear to have a full complement of sulfatases, and humans are the best studied, this site uses the official gene nomenclature at HUGO whenever possible, using the same name for gene and encoded protein, even across species when orthology is clear. A table of synonyms would be useful however. There are currently 6 new unnamed human sulfatases. Two of these could be put in the ARSB series but following the guidelines, this would require renaming ARSB as ARSB1.
Even a gene family constrained to slow divergence becomes difficult to align after long enough time passes. Although subfamilies present few alignment issues, numerous small insertions and deletions (indels) make alignment problematic across subfamilies: the position of indels is not stable to choice of gapping parameters in software such as Blast and ClustalW. Most indels occur within the 20-odd loops linking beta strands and alpha helices and have little affect on the overall fold.
While the 3 known crystallographic structures are quite helpful in remote alignment, residues critical to the active site -- and reliably conserved across subfamilies -- occur only in the catalytic core domain. Mysteriously, other residues having no clear structural or functional role can be even more strongly conserved, but reliable alignment anchors outside the core remains rare. This region determines oligomerization: the ancestral pattern is arguably dimer as ARSA contacts are conserved in alkaline phosphatase; even octameric ARSA is a tetramer of dimers. (However the Ps. aeruginosa sulfatase and ARSB are apparently monomeric.)
Disulfides show surprisingly poor conservation across subfamily lineages. Potential glycosylation sites (NxTx, NxSx, x not proline) are better but still unsatisfactory, conserved within specific subfamilies and sometimes across them. Some 287 sites are observed in 50 eukaryotic sulfatases (average 5.7, range 2-14). The new quail sulfatase and its allies are heavily glucosylated; a 31 Aug 01 Science paper by Dhoot et al. showed the quail protein is exposed on the cell surface. For sulfatases such as ARSA and ARSB, experimental work has confirmed that most potential glycosylation sites are in fact modified in vivo. Like the indel signature, conserved glycosylation sites may have some role as recognition anchors in alignment and subfamily diagnostics.
Past the catalytic core, secondary structure features of the 4-stranded beta sheet and terminal helix is likely conserved even though sequence similarity itself has severely diminished. The best alignment strategy in the post-core region is secondary structure prediction within each subfamily (significantly aided by multiple alignment input), augmented by anchor residues, reliable Blast patches, and exon boundary correspondences.
The central structural issue in sulfatases -- and the key motivation to align remote members -- concerns "long" sulfatases. The newly discovered human sulfatases KIAA1077 and KIAA1247 (and their counterparts in mouse and quail) average 830 amino acids, some 320 residues more than their closest relative among conventional sulfatases, the 550 residue GNS subfamily. (A related sulfatase in Drosophila is even longer at 1114 aa.) Note that 320 extra residues is longer than the average stand-alone enzyme in E.coli.
The key questions are, what function does this extra 320 residues serve (Wnt signalling and embryo patterning?), how is it inserted into the existing structural scaffold (between the 4-stranded beta sheet and terminal helix?), and where did it come from (intronic extension or mobile domain?). To study this region by bioinformatics or experiment, it first must be isolated by alignment. Then its homology relationships to non-sulfatases (if any, fusion or operon-like clues to function), structural properties, and relationship to exon structure can be determined (6 extra exons occur relative to 14-exon GNS). The 3 known structures unfortunately do not involve sulfatases closely related to GNS. Indeed, as a Blast query, the isolated post-catalytic region of GNS sulfatases fails to recognize any other class of sulfatase.
Now the STS subfamily also acquired an extensive extra domain, inserted earlier as a loop within the catalytic core. The most peculiar composition of those 45 residues -- an unbroken run of hydrophobic and apolar residues -- is consistent with experimental data placing it as a luminal side membrane attachment site determinant (Stein et al, 1989). Below the insertion is shown to have occurred within the extruded early 2-strand beta sheet just after core beta strand B6 (ARSA or ARSB numbering).
The second central biosynthetic issue in sulfatases is origin of the modified cysteine. Bizarrely, no enzyme or cofactor involved in the oxidative modification is yet known despite sulfatases in the complete E.coli genome. Complementation assays and operon associations so far have not definitively fingered any bacterial gene; lab strains may be inadvertently defective in non-essential genes. The key enzyme apparently recognizes a linear stretch about the altered cysteine concomittant with translation. It is likely highly conserved from bacteria to human and, when partly disabled by mutation, causal for the rare multiple sulfatase disorder MSD in humans. If could be mapped to position or a homologue found in about any species, the issue could be quickly resolved by genomics.
Using the 500-odd bacterial genomes nearing completion and the very restricted distribution of sulfatases within prokaryotes, it should be possible using the NCBI COG orthologous cluster resource to identify candidate genes because few genes will have both unassigned function and strictly fit the observed phylogenetic distribution. It is essential here to have a very reliable Blast probe and complete genomes to show that a given species (eg, yeast) completely lacks sulfatases and one supposes, the modifying system unique to this system.
A candidate Fe-S oxidoreductase found by this method was further compatible with E.coli operon associations but no convincing counterpart could be found in eukaryotes. There may be multiple copies in animals because of the need to recognize diverged substrates; these could be regulatory choke points for specific sulfatase subfamilies. MSD families have been too rare to map but a strong candidate gene could readily be confirmed by screening.
gene_hsa | strand | chrom_coords | span | aa | exons | glyco | S-S |
SulfY | - | chr4:115181037-115257516 | 76,480 | 573 | 2 | 4+2 | 5+0 |
ARSB | - | chr5:78111979-78316827 | 204,849 | 534 | 10 | 3+2 | 3+2 |
SulfX | + | chr5:94916739-94964983 | 48,245 | 526 | 8 | 4+2 | 2+0 |
SulfZ | - | chr5:149656973-149662129 | 5,157 | 569 | 2 | 2+2 | 5+0 |
Sulf1 | + | chr8:70638765-70713649 | 74,885 | 872 | 19 | 7+3 | 1+18 |
GNS | - | chr12:63396791-63439323 | 42,533 | 553 | 14 | 10+4 | 2+10 |
GALNS | - | chr16:87408351-87450786 | 42,436 | 523 | 14 | 1+1 | 2+5 |
SGSH | - | chr17:75798849-75808707 | 9,859 | 503 | 8 | 4+1 | 2+0 |
KIAA1001 | + | chr17:63815230-63928196 | 112,967 | 526 | 11 | 3+1 | 5+2 |
SULF2 | - | chr20:45721549-45819514 | 97,966 | 871 | 20 | 7+3 | 1+18 |
ARSA | - | chr22:49353720-49356345 | 2,626 | 508 | 8 | 3+0 | 5+6 |
ARSD | - | chrX:2818676-2837169 | 18,494 | 594 | 10 | 3+0 | 4+8 |
ARSE | - | chrX:2846237-2869836 | 23,600 | 590 | 10 | 3+0 | 4+8 |
ARSG | + | chrX:2901783-2944784 | 43,002 | 699 | 11 | 3+0 | 4+8 |
ARSF | + | chrX:2983426-3023955 | 40,530 | 591 | 10 | 3+0 | 4+8 |
STS | + | chrX:7030971-7128035 | 97,065 | 584 | 10 | 2+1 | 4+8 |
IDS | - | chrX:148270039-148292425 | 22,387 | 551 | 9 | 4+2 | 3+0 |
17 | ave: | 56,652 | 598 | 10 | . | . |
The table above summarizes basic genomic and proteomic properties of the 17 known human sulfatases. Note 3 of these have only been described in machine-annotated GenBank entries and 3 others are altogether new. To view the genomic context of the coding part of each gene within the assembled human genome, simply paste the genome location column entry into the UCSC August 2001 genome browser.
Note 3 of thle 4 NDST sulfotransferases are immediately adjacent to sulfatases perhaps catalyzing the opposing reaction, though there is no evidence for coordinated regulation vs coincidenc epropagated by block duplication. Quite a few important small molecules appear to be down-regulated by sulfotransferase sulfation, the opposite reaction from sulfate removal by sulfatases. These enzymes do not contain a formylglycine; there are at least 31 of them in the human genome. The human genome contains enough sequencing gaps to potentially encode 1-2 additional sulfatase genes even in the May 2004 assembly (which has 424 gaps covering 228,799,690 bp).
31 Sulfotransferases in the human genome
HS2ST1 | chr1:87092382-87287685 | NM_012262 | heparan sulfate 2-O-sulfotransferase 1 |
CHST10 | chr2:100466839-100492609 | NM_004854 | HNK-1 sulfotransferase |
SULT1C1 | chr2:108363612-108384888 | NM_001056 | sulfotransferase family cytosolic 1C |
HS6ST1 | chr2:128741261-128792482 | NM_004807 | heparan sulfate 6-O-sulfotransferase |
GAL3ST2 | chr2:242436229-242470393 | NM_022134 | galactose-3-O-sulfotransferase 2 |
CHST13 | chr3:127725873-127744831 | NM_152889 | carbohydrate chondroitin 4 sulfotransferase |
NDST4 | chr4:116106534-116392636 | NM_022569 | N-deacetylase/N-sulfotransferase heparan |
NDST3 | chr4:119313102-119534619 | NM_004784 | N-deacetylase/N-sulfotransferase heparan |
SULT1B1 | chr4:70773445-70807190 | NM_014465 | sulfotransferase family cytosolic 1B |
NDST1 | chr5:149880622-149917966 | NM_001543 | N-deacetylase/N-sulfotransferase heparan |
UST | chr6:149110156-149439818 | NM_005715 | uronyl-2-sulfotransferase |
CHST12 | chr7:2216463-2247455 | NM_018641 | carbohydrate chondroitin 4 sulfotransferase |
TPST1 | chr7:65114463-65269579 | NM_003596 | tyrosylprotein sulfotransferase 1 |
GAL3ST4 | chr7:99401518-99410867 | NM_024637 | galactose-3-O-sulfotransferase 4 |
CHST3 | chr10:73394125-73443318 | NM_004273 | carbohydrate chondroitin 6 sulfotransferase 3 |
NDST2 | chr10:75231674-75241348 | NM_003635 | N-deacetylase/N-sulfotransferase heparan |
GAL3ST3 | chr11:65566028-65573227 | NM_033036 | galactose-3-O-sulfotransferase 3 |
CHST11 | chr12:103353244-103654350 | NM_018413 | carbohydrate chondroitin 4 sulfotransferase |
HS6ST3 | chr13:95541093-96289812 | NM_153456 | heparan sulfate 6-O-sulfotransferase 3 |
D4ST1 | chr15:38550504-38552645 | NM_130468 | dermatan 4 sulfotransferase 1 |
SULT1A2 | chr16:28510766-28515892 | NM_001054 | sulfotransferase family cytosolic 1A |
SULT1A1 | chr16:28524418-28528150 | NM_177530 | sulfotransferase family cytosolic 1A |
SULT1A3 | chr16:29376468-29383784 | NM_003166 | sulfotransferase family cytosolic 1A |
CHST9 | chr18:22749594-23019177 | NM_031422 | GalNAc-4-sulfotransferase 2 |
SULT2A1 | chr19:53065681-53081405 | NM_003167 | sulfotransferase family cytosolic 2A |
SULT2B1 | chr19:53747240-53794495 | NM_177973 | sulfotransferase family cytosolic 2B 1 |
TPST2 | chr22:25246292-25310623 | NM_003595 | tyrosylprotein sulfotransferase 2 |
GAL3ST1 | chr22:29275177-29285430 | NM_004861 | galactose-3-O-sulfotransferase 1 |
SULT4A1 | chr22:42545289-42583257 | NM_014351 | sulfotransferase family 4A 1 a |
HS6ST2 | chrX:131485578-131816891 | NM_147174 | heparan sulfate 6-O-sulfotransferase 2 |
The glycosylation and disulfide columns break potential occurences of these elements into catalytic and post-catalytic domains; sporadic occurences (not conserved within other mammals) are omitted. Fragmentary sequences from ESTs are helpful in this determination because of the additional species they bring in. These structural elements have been surprisingly fluid and are barely conserved outside of narrow sub-families. Direct experimental support for their existence is meagre outside of ARSA and ARSB. However, by structure threading of the conserved catalytic domain fold, it is possible to determine whether given cysteine pairs are in physical proximity and whether glycosylation sites are exposed on the surface. Post-catalytic elements are more problematic for those sulfatases not readily alignable to the three determined structures.
Sulfatases fall into two length categories: standard and long. The length column gives the approximate number of amino acids in the mature protein. Note that 550 amino acids is excessive for a simply hydroytic reaction -- the average enzyme in E.coli is 300 amino acids. Indeed, all the key residues (active site and metal binding) fall within the amino terminal 60%; alternately spliced sulfatase genes such as IDS and ARSD encode proteins more or less truncated to the core catalytic domain. None of the xray structures really exhibit a classical substrate pocket beyond that for the sulfate moiety; the carboxy terminal beta sheet and terminal helix are not positioned to contribute substrate specificity. The remainder of the protein has been shown, however, to provide the surface for homo-oligomer formation.
Note the bizarre range in gene sizes, from a tiny 2630 bp in ARSA to 367080 bp in ARSB; similarly the number of exons ranges from 2 to 14 in short sulfatases and to 20 in long. The four genes in the STS cluster have similar exons structures as relatively recent events but otherwise conservation of exon boundaries is less than might be expected even in subfamilies. Of course, even a small sulfatase subfamily can span an immense time scale
Cellular location has been determined reliably for perhaps half the sulfatases -- the column below is mainly taken from the SwissProt and OMIM entries (which are curated digests of published experiments). The cell surface location of KIAA1077_hsa is taken from recent work on the quail orthologue and is likely applicable to the closely related KIAA1247_hsa. Both of these proteins bristle with potential disulfides and N-glycosylation sites compatible with an exposed exterior position. Despite the frequent association with membrane compartments, sulfatases do not contain packets of helical transmembrane domains; however the X chromosomal STS group does contain a substantial hydrophobic insert in a catalytic domain loop. All mammalian sulfatases contain a targetting leader peptide not part of the mature protein; however these are not satisfactorily interpretable at this time with tools such as ProSort.
Genomically, sulfatases are quite dispersed, occuring on 10 human chromosomes. All sulfatases arose from duplications of a common ancestral enzyme (ultimately an alkaline phosphatase or dual purpose). The X chromosome patch likely arose from local tandem duplications with order of the 3 events likely STS, [(ARSD, ARSE), ARSF]. While these events pre-dated the divergence with rodents, it remains unclear how many of these genes were retained in the latter lineage beyond STS. The grouping (ARSA, KIAA1001), GALNS is related to this family but represents events too remote to illuminate with genomics. IDS is also on chrX but like SulfX and SGHS diverged so much earlier that its affinities are to bacterial sulfatases.
A parsimonious scenario for the ARSB, (SulfY, SulfZ) group is tandem duplication on chr5 yielding ARSB and SulfZ, followed by much more recent translocations of the SulfZ region to chr4 (SulfY) and then to chr10 (inferred SulfP in sequence gap). These translocations also resulted in amplification of the adjacent NDST sulfotransferase family (which later had a tandem duplication resulting in NDST3 and NDST4 on chr4.
A fairly recent vertebrate translocation of a small block of genes duplicated an ancestral long sulfatase to KIAA1077 and KIAA1247 on chr8 and chr20. A much earlier duplication of GNS, now on chr12, possibly a translocation too washed out to be detectable today, created the ancestral gene to the long sulfatases, acquired the 330 extra residues prior to the subsequent translocative duplication. GNS, (KIAA1077, KIAA1247) clearly describes the tree topology of thw two events.
Thus, neglecting pseudogenes and events not retained in evolution, 16 gene duplications were necessary to create the sulfatases seen today in humans starting with an ancestral alkaline phosphatase. Other genes, like hemoglobins, had translocations of tandem duplications yielding more genes with fewer events. The overall order of genes is given tentatively by ClustalW alignment of catalytic cores as {[SulfX, (SGHS, IDS)], {{[ARSB, (SulfY, SulfZ)]*, {{STS, [(ARSD, ARSE), ARSF]*}, [(ARSA, KIAA1001), GALNS]}}, [GNS, (KIAA1077, KIAA1247)]*}}}.
Specific diseases have been associated with 8 of the genes, mostly classical lysosomal catabolic disorders. Quail and sea urchin data however suggest developmental regulatory problems, possibly lethal in utero, could arise from other sulfatase genes. Now that the apparent full complement of human sulfatases has been found and localized beyond cytoband to flankilng genes and even to actual positional coordinates, it should be possible to rapidly screen candidate diseases for mutations in sulfatases unassigned to diseases. Indeed, OMIM may already contain reference to disorders for which the standard lysosomal genes were excluded and possibly roughly positionally mapped to another chromosome.
Last updated 9 June 02As sequencing of the human genome nears completion, a new sulfatase (called ArsG here) has become detectable on the X chromosome with the April 2002 assembly. It is part of a 205 Kbp inverted doublet of tandem pairs (ARSD, ARSE, ARSG, ARSF) strand oriented - - + + relative to the p arm telomere. The new sulfatase has the 10 exons found in this subgroup and matching intron location and phases, conserved active site helix CTPSRAAFLTG and metal sites, and ESTs BM069810 and BM069570 from human Islet establish its transcriptional activity. The amino terminus, as seen elsewhere in this family, is somewhat uncertain, unalignable, and may contain a novel upstream coding exon predicted in GenomeScan entry, XM_066808. A 393 bp 3' UTR region with canonical AATAA polyA site can also be recovered from these ESTs.
The 4 sulfatases align among themselves with nearly equal 63% amino acid identity and match most closely steroid sulfatase (STS, 52% identity) among other human sulfatases. STS is also on the X chromosome p arm but approximately 15.5 Mbp downstream at Xp22.13 separated by many intervening unrelated genes. The cluster is next most closely related to GALNS at approximately 40% identity. A even more distantly related 6th sulfatase on this chromosome, IDS, is also found on Xq28.
Draft genomes for mouse and rat do not yet allow reliable comparisons of the cluster region, but apparently different expansion gave rise to the ARSDEGF cluster in the human lineage (or contraction occurred in rodent lineages). Rat core STS (NM_012661) bears only 64% identity to its best match among human sulfatases, the apparently orthologous human STS gene (mouse to human is 61%). The match of rat to mouse is also low at 76%. Possibly multiple copies in human allowed rapid divergence as functions specialized.
The situation is complicated by rapid divergence of pseudoautosomal regions (2.6 Mbp in humans), escape from X inactivation, anomalous recombination, and the notion that most of human Xp represents a relic of an autosomal region added to both X and Y at about 120 MYr. One evolutionary scenario envisions an ancestral STS gene duplication, separation of location by inversion, tandem duplication, followed shortly by a second round of inverted tandem duplication creating the final cluster of 5 sulfatases. Current flanking genes for STS provide no help.
The nearly finished human Y chromosome sheds some light on the X chromosome sulfatase cluster but raises new questions. Two homologous regions are found by Blat alignment on chrY:13683780-13809882 (best related to ARSF, ARSE, and ARSD) and chrY:16913857-16913934 (best related to STS). ARSDp and ARSEp on chrY were recognized in the mid-90's as truncated pseudogenes. Their GenBank entries NG_000881 and NG_000880 are garbled. There is no counterpart to the IDS gene on chrY though chrX contains a tandem pseudogene.
The ARSFp pseudogene on chrY has not been previously described. It is a 93% match at the DNA level, but only exons 2,3, and 10 are represented. These are insufficient to code for a functional sulfatase and additionally contain stop codons, making ARSFp an internally truncated pseudogene. ARSEp contains exons 8,9, and 10; ARSDp is represented on chrY by exons 2,3,4 and 7,8,9, and 10. All in all, it is not so surprising to see ARSGp missing.
On the April 2002 assembly,13 sulfatase exon homologs can be seen in this chrY pseudogene region in Blat matches using all known human sulfatase protein as probe. DNA probes clearly show 3 portions of ARSF at 83% identity on the - strand at chrY:13683780-13689494, preceding ARSEp (chrY:13771299-13778985, 93%) and ARSDp (chrY:13793713-13809882, 88%) on the + strand, indicating overall reversal or more likely misassembly). ARSG is not detectable by Blat with DNA probes on chrY nor by tBlastn of the appropriate intervening 81 Kbp of DNA, chrY:13689494-13771299, of which all but 21 Kbp is repeatmasked (75%). As ARSG is not a newly evolved feature on chrX, most likely it has been lost from chrY or obliterated by retrotranspon insertion. The intergenic distance between ARSE and ARSF is 108k, of which 42 Kbp is not repeatmasked -- ARSG itself spans 43 Kbp of which 22 Kbp are repeats.
Properties of the ARSG protein are those expected for its sulfatase subclass. The metal binding sites, by homology, are DD and xx; the active site has the usual residues; and the insertional hydrophobic residues are present from positions 555-666. Note that endoplasmic reticulum human STS enzyme has been crystallized and its structure is expected shortly. This will enable accurate threading of ARSDEGF, including the novel region.
However the in vivo substrate of ARSG will remain unclear as only artificial substrates are known for ARSD, ARSE, ARSF. iodothyronine sulfates
For ARSA_hs the PDB code is 1AUK for wild type, 1E2S for substituted active site C69S, and 1E33 for the mutation P426L. For ARSB_hs, 1FSU is available. A KIAA1001-related dimeric sulfatase from Pseudomonas aeruginosa (PDB 1HDH, 1 of 3 in that organism) has just been released, establishing that the associated cation is calcium<. Alkaline phosphatase, which exhibits a near-identical fold (beta strands 8 and 9 are swapped), has 449 aa and PDB accession code 1ALK. Free full text is available for the ARSB structure determination paper and others (1, 2, 3)
![]() | ![]() |
The most interesting aspect of the tree proposes deeply branching positions for the IDS, SGSH, and newly discovered SulfX families. These have closer affinities to bacterial sulfatases than to any of the other mammalian families. The other two new human sulfatases, SulfY and SulfZ, group with ARSB as expected from their Blast scores and unique predicted terminal disulfide knot. The paired long sulfatases in mammals and birds evidently represents a gene doubling even that took place after the lineage split off from drosophila and nematode. As expected, GNS sulfatases appear as the nearest relative among the short sulfatases.
It does not work well to use full length sequence alignments (in say, Blast or ClustalW) except between very closely related sequences since choices of gap parameters are ad hoc and resulting alignments and statistics computed from them meaningless. However the catalytic domains are of comparable length and align adequately for purposes of classification. Shorter probes, beginning 9 residues before the early DD metal coordinating site and continuing through the conserved TG 4 residues after the R at the catalytic site, also work well. However, bacterial sulfatases are very diverged and often give a barely significant percent identity.
It is also feasible to reliably classify sulfatases from other species relative to human even though these may only be available as partial sequences (from tBlastn recovery of Ests) or from partially finished genomes (eg, mouse) via diagnostic regions within better-conserved parts of the protein, for example about the active site where all sulfatases are alignable without gaps. Many indels amount to altering loop lengths that end up no longer truly homologous in any detailed structural sense.
As a practical matter, the query sequence is aligned, best via Blastp, against a small database of reference sequences, the best match providing the working classification. Typically, all top entries will be clustered within a single class, eg ARSB-type sulfatases, with an abrupt cutoff in Blast expectation value in succeeding secondary matches.
To charactrize a new sulfatase, translate to protein and provide a fasta header line as necessary. Filtering at the UW Blast site should be set to off. The target database is best taken as the full set of known sulfatase catalytic domains. The post-catalytic domains can provide a higher discriminatory resolution within a predetermined narrow subfamily but so far this has not prove useful, as it simply follows the conventional taxonomic ordering of species.
As examples, quail sulfatase Qsulf1, chicken EST, and Drosophila Sulf1 sulfatase are reported by the classifier to be orthologous to mammalian KIAA1077/KIAA1247 sulfatases. The indel pattern of quail is strongly diagnostic of the KIAA1077 class as would be a conventional NCBI best Blast match.
quail drosophila chicken KIAA1077 1114 2.3e-117 KIAA1077 847 4.6e-89 KIAA1247 502 1.7e-52 KIAA1247 1036 4.3e-109 KIAA1247 832 1.8e-87 KIAA1077 449 6.8e-47 GNS 563 5.7e-59 GNS 506 6.2e-53 GNS 269 8.1e-28 ARSA 207 3.0e-21 ARSF 210 1.1e-22 ARSF 124 4.7e-12 ARSF 171 2.0e-17 ARSA 200 1.7e-20 STS 117 2.9e-11 ... >quail adjusted probe PNIILVLTDDQDVELGSLQVMNKTRRIMENGGASFINAFVTTPMCCPSRSSMLTGKYVHNHNIYTNNENCSSPSWQATHEPRTFAVYLNNTGYRTAFFGKYLNEYNGSYIPPGWREWVG...DDSMERLYQMLAEMGELENTYIIYTADHGYHIGQFGLVKGKSMPYDFDIRVPFFIRGPSVEPGSVVPQIVLNIDLAPTILDIAGLDTPPDMDGKSVLKLL >drosph adjusted probe PNIILILTDDQDVELGSLNFMPRTLRLLRDGGAEFRHAYTTTPMCCPARSSLLTGMYVHNHMVFTNNDNCSSPQWQATHETRSYATYLSNAGYRTGYFGKYLNKYNGSYIPPGWREWGGDVAVERVYNELKELGELDNTYIVYTSDHGYHLGQFGLIKGKSFPFEFDVRVPFLIRGPGIQASKVVNEIVLNVDLAPTFLDMGGVPTPQHMDGRSILPLL >chicken BI393009 DDSMEMIYNTLVETGELDNTYIIYTADHGYHIGQFGLVKGKSMPYEFDIRVPFYVRGPNVEAGSLNPHIVLNIDLAPTILDIAGLDIPSDMDGKSILKLL
Neighboring genes of the block can no longer be recognized about ARSB or GNS. Now SulfY and SulfZ have an unusual gene structure for a sulfatase consisting of a single intron at the same position (the conserved TG), whereas ARSB has a more traditional 9 exons in non-corresponding positions. It seems possible then that the initial duplication of ARSB was as a processed retropositioned mRNA that later acquired an intron but prior to translocational duplications.
Indeed, a translocative block involving 4 genes with order Ank2, Camk2, Sulf, and NDST is quickly seen. Interestingly, the NDST genes adjacent to the sulfatases are known sulfotransferases involved in heparan biosynthesis (N-deacetylase/N-sulfotransferases acting on N-acetylglucosamines 1, 2, 3,4) and have about the same 65-70% protein sequence identity.
It is thus possible that SulfY and SulfZ sulfatases are co-regulated with their cognate sulfotransferases to together regulate (on or off with the sulfate) the first step of heparan synthesis in tissue specific fashion. This additionally suggests a substrate for these ARSB paralogs. Note however that the ARSB enzyme acts on N-acetylgalactosamine-4-sulfate; also the sulfatase/sulfotransferase genes are not assembled head to head with a common divergent promoter. Thus the adjacency could be coincidental, the impression of association reinforced by the translocation blocks.
The third newly discovered sulfatase, SulfX, definitely appears to share a common divergent promoter with another gene, KIAA0372 (LocusLink: 9652) on 5q15. Here, in a gapless region of finished genome, only 144 bp separate the 5' UTR of the two genes, which are both supported by numerous mRNAs. It is possible an upstream exon exists so that one gene sits within an intron of another, yet gene sizes of 83kbp and 48kbp resp. mitigate against that. Properties of the the gene (multiple copies of the tetratricopeptide or TPR domain) are available but shed no light on function or relatedness to sulfatases. There are weak blast hits to O-linked GlcNAc transferases and a strong match to the CG8777 gene product in drosphila.
Examining 33 mapped sulfotransferases, none was associated with a sulfatase other than the NDST group, so this is not a general phenomenon. The sulfatase SGSH is unmapped other than to chr 17q25 but it could not be adjacent to HS3ST3A1 and HS3ST3B1 on 17p11.2. The sulfotransferases studied were CHST1, CHST2, CHST3, CHST4, CHST5, CHST6, CHST7, CHST8, HNK1ST, HS2ST1, HS3ST1, HS3ST2, HS3ST3A1, HS3ST3A2, HS3ST3B1, HS3ST3B2, HS3ST4, HS3ST5, HS6ST, NDST1, NDST2, NDST3, NDST4, SULT1A1, SULT1A2, SULT1A3, SULT1C1, SULT1C2, SULT2A1, SULT2B1, SULT4A1, TPST1, and TPST2.
Oddly however, the gene causing mucopolysaccharidosis type I (Hurler), alpha-L-iduronidase (IDUA) at 4p16.3 contains within an intron the gene for a sulfate transporter, SLC26A1, at chr4:817863-870341 in opposite orientation. Mucopolysaccharidosis type II is caused by iduronate-2-sulfatase (IDS) deficiency which maps to chr Xq28. A kindred is known in which both IDUA and IDS are simultaneously affected (Am J Hum Genet 1996 Jan;58:75-85), apparently by conventional CDS mutations. SLC26A2, a sulfate anion transporter, maps to 5q33.1 as does SulfZ but at a location 500 kbp away, chr5:164569261-164574033.
On the chromosome 5 block, the TCOF1 gene for Treacher Collins syndrome (OMIM 154500) intervenes between NDST1 and SulfZ_hsa. It is somewhat reminiscent of a sulfatase disorder, affecting craniofacial development, conductive hearing loss and palate, with antimongoloid slant of the eyes, coloboma of the lid, micrognathia, microtia and other deformity of the ears, hypoplastic zygomatic arches, and macrostomia as features. There is no apparent paralog family in the chr4 or chr10 counterparts; it is rapidly evolving in mammals. According to the August 01 human genome assembly, TCOF1 is divergently transcribed from a 2.5 kbp (or less, depending on 5'UTR determination) promoter region shared with NDST1; it has no sequence homology to sulfatases or NDST1.
A third related block on chr10 has strong Ank2 and NDST paralogs but no sulfatase or Camk2 at the UCSC genome assembly; however, these may lie in two largish gaps in the critical region. (SulfZ is a small gene; its CDS spans 5161 bp.) A paper in the July 01 issue of Am J Hum Gen has this region completely misassembled, using an obsolete human genome assembly, but did report a nearby Camk2 gene on chr10. Indeed, this is support by Blast at NCBI using the chr5 Camk2A query: NT_024037 (completed 16-Oct-2001) contains a 73% identity match called Camk2G. This maps to a slightly out of position region in a confused portion of the August 01 genome assembly, chr10:75521627-75532674. The gene order appears to be ANK3 .. CAMK2G.. [SulfP?] .. NDST2 .. KIAA0913 .. SEC24C ... KIAA0187.
Thus a 17th human sulfatase gene with 2 exons, SulfP, can be predicted by virtual syntenic gap bridging; at this time there is no support at NCBI human or mouse finished, htgs, or trace database but inversions are unlikely to rearrange internal elements. What can be said about the sequence of this putative new sulfatase? Going by the average percent divergences in the adjacent genes present in all three blocks, it will be much closer to the chr5 sulfatase sequence than to one on chr 4, say 80% identical, but more likely to have ancestral node values at more rapidly changing amino acid positions than idiosyncratic changes seen in SulfZ, unless it has become a pseudogene.
The orthologous murine NDSTs are known and SulfY_mmu and SulfZ_mmu orthologs are recoverable from ESTs and htgs. This means if an imperfect SulfZ match in ESTs or htgs shows up on an unmapped datum such as mus or human EST or htgs with the right homological percentages and tree topology, it can actually be mapped with some confidence to the putative chr10 gap gene since sulfatase orthologs run in the mid-90's between human and mouse. So EST mapping may occasionally be possible by virtual synteny in conjunction with ClustalW. No such candidate EST is available; SulfZ itself has a single EST at GenBank, fetal lung AA358883, which in conjunction with conserved protein features in human and mouse, means it is probably not a pseudogene, only rarely transcribed in the tissues studied to date. SulfP may be similar.
The insertion is difficult to localize precisely. However, as the above ClustalW alignment of 110 residues shows, the STS family is of standard sequence through the metal coordinating residues K and H in the conserved feature GKWHL; indeed from careful Blast comparision to not too distant sulfatases, another 28 residues, just past the conserved beta strand B5 (position 164 in full length human STS). Alignability does not pick up again until 79 residues later at the start of beta strand 8.
This would still permit the standard 10 strand beta sheet in the STS family. Note that the "missing" strands 6-7 are not part of the protein core but instead are a protrusion. Thus it appears quite possible that the corresponding region in the STS group lacks beta structure in this region (unsuitable for new membrane attachment role?) and amounts to a mega-loop replacement of the region following strand 5 and preceding strand 8 in conventional sulfatases. This illustrates a fallacious assumption in proteomics: here is a second domain in sultatases whose structure cannot be inferred despite 3 known structures. Secondary structure prediction (using Predator) of 6 family members does however yield a fairly consistent picture of 2-3 transmembrane helical domains unlike any other sulfatase class (using TMHMM):
ARSD_hsa TLTNDCDPGRPPEVDAALRAQLWGYTQFLALGILTLAAGQTCGFFSVSARAVTGMAGVGCLFFISWYSSFGFVRRWNCI ____________HHHHHHHHHHH_HHHHHHHHHHHHHHHHH________HHHHHHHHHHHH__________________ ARSE_hsa SLMGDCARWELSEKRVNLEQKLNFLFQVLALVALTLVAGKLTHLIPVSWMPVIWSALSAVLLLASSYFVGALIVHADCF ____________HHHHHHHHHHHHHHHHHHHHHHHHHHHHH________HHHHHHHHHHHHHHHHHHH___EEE_____ ARSF_hsa TLVDSCWPDPSRNTELAFESQLWLCVQLVAIAILTLTFGKLSGWVSVPWLLIFSMILFIFLLGYAWFSSHTSPLYWDCL _____________HHHHHHHHHHHHHHHHHHHHHHHHHHH________HHHHHHHHHHHHHHHHEEE____________ STS_hsa TNLRDCKPGEGSVFTTGFKRLVFLPLQIVGVTLLTLAALNCLGLLHVPLGVFFSLLFLAALILTLFLGFLHYFRPLNCF ____________EEE___EEEEE___HHHHHHHHHHHHHHHHH______HHHHHHHHHHHHHHHHHHHHH_________ STS_mmu TNLRDCRPGAGTVFGPALRVFAAGPLAALGASLAAMAAARWAGLARVPGWALAGTAAAMLAVGGPRSASCLGFRPANCF __________________EEEEE___HHHHHHHHHHHHHHHHH______HHHHHHHHHHHHH_________________ STS_rra TNLRDCKPGGGTVFGSAQQVFVVLPMNILGAVLLAMALARWAGLARPPGWVFGVTVAAMAAVGGAYVAFLYHFRPANCF _________________EEEEEE___HHHHHHHHHHHHHHHHHH_________HHHHHHHHH___HHHHHHH_______ STS_hsa: outside 1 160 TMhelix 161 183 inside 184 189 TMhelix 190 212 outside 213 558
The insert does not correspond to a complete exon. Instead it is part of a large 141aa exon with the same boundaries in the 4 human sulfatase genes, with 32aa preceding the insert and 30aa following. This exon has no counterpart in its boundaries, even approximately, in any other group of sulfatases; that is, no simple fusion or insertion scheme can bring exon boundaries into consistency. Thus its origin cannot be resolved or even reliably dated (without additional sequences -- only an additional fragment is available from zebra fish). It may simply have resulted from pass-over of a splice donor that resulted in a longer exon that happened to have an initial hydrophobic character.
A new internal conserved pair of cysteines occurs at the extreme flanks of the insert. It is not known whether these form a disulfide, though this could provide a modicum of structural anchoring. Otherwise the region is extremely poorly conserved and has been evolving very rapidly compared to other domains and seemingly randomly (within the context of membrane attachment retention) over at least the last 100 million years. It cannot be modelled due to the lack of relevent xray structures and influences of the membrane milieu.
Using a Kyte-Doolittle hydrophobicity plot, the insert is clearly visible as a broad patch of predominant hydrophobic character. The insert occurs at positions 140-218 relative to the mature catalytic core (the modified cysteine is at position 50 in this numbering system):
Best viewed by collecting sequences and coloring cysteines in a word processor by search and replaceThe actual disulfide linking pattern is known from the 3 xray structures (the Pseudomonas protein has none) and therefore can be inferred to a certain extent in others: a conserved cysteine pair, adjacent in the folded threaded structure, is likely to form a disulfide.. Cysteines not even conserved across mammals are unlikely to be in disulfides; these are called sporadic cysteines below.
ARSA_hsa has a terminal disulfide knot linking 6 cysteines, a region at 87-103 aa past the catalytic cysteine with two nested pairs in beta strands 6-7, and a cross-domain pair 231-351 linking two loops. These are conserved -- and presumably also disulfides -- in the other 3 mammalian species known. Human has a further cysteine 31 aa prior to the catalytic site with no counterpart in other species. A single conserved loop cysteine 225 aa past the catalytic site has no available intra-molecular binding partner.
These disulfide pairs are conserved to some extent in KIAA1001 proteins, the nearest neighbors to ARSA. However, the first quartet has more intervening residues, as does the long range pair, and the final knot has only two disulfide pairs. In the more weakly related GALNS sulfatases, nothing corresponds to the first quartet, the long-range disulfide is supported by alignment, and the terminal knot again could be two pairs. Since 2 in 5 residues are conserved overall between GALNS and ARSA, two cysteines would conserved by chance 4 times in 25 (16%).
The conclusion for the extended ARSA family, wrongly described in the literature, is that disulfide pairs have considerable conservation, though some pairs appear more fundamental than others. The knot and long-range pair appear to be the oldest conserved elements.
ARSB_hsa has 4 disulfide pairs that have conserved counterparts in rat and cat ARSB proteins (the latter each have a sporadic cysteine). The first, a long-range pair 26-436 is not an option in the nearest neighboring proteins, SulfX and SulfY. The 30-64 pair is conserved in position and length; the 90-101 pair is also present though slightly shorter at 90-96; and the 320-362 pair is missing altogether. None of these are conserved in the 4 drosophila ARSB-type proteins; these have their own conserved cluster of 4 cysteines extending from 93-119 of the post-catalytic core that are likely paired as disulfides. There is no explanation here for why C521Y causes severe disease (Am J Hum Genet 1994 Mar;54(3):454-63).
The conclusion here is that disulfides are important to ARSB structure but have evolved independently in different lineages. In humans, two of the four disulfides are deeply conserved.
The STS group has a possible counterpart to the two disulfide terminal knot ; 5 conserved cysteines in the catalytic core including 1 preceding the active site and 4 conserved cysteines in the post-core. Some of these may be in disulfides, the organization could be inferred from threading as the cysteines need to be adjacent in the folded protein
The GNS family of sulfatases contains 10 conserved cysteines in addition to the one in the active site. The two early cysteines, 27-49, do not have a counterpart outside this family. The remaining 8 are conserved in the post-catalytic core. The long sulfatases to which the short GNS is best related, have only 1 of the early cysteines. There are 18 cysteines in each long sulfatases post-catalytic region, of which the last 6 correspond to those of GNS. Some 25 residues of GNS at the beginning of the post-catalytic region cannot be located in the KIAA series.
The inserted portion of long sulfatases consists of 7 complete exons, chr8:80324928-80349742 on the August 01 human genome assembly. These are exons 9-15 of the whole CDS. This suggests, since it is not that ancient a feature, that it arrived with introns from another gene. However, no related region can be located by Blast at this time, even though the inserted region is extremely well conserved. There is a 5% chance that something homologous exists in unsequenced human genome; otherwise, the counterpart may have been lost, or perhaps it transferred in its entirety so no outside homolog exists.
(to be continued)
Phosphorylation of arylsulphatase A occurs through multiple interactions with the UDP-N-acetylglucosamine-1-phosphotransferase proximal and distal to its retrieval site by the KDEL receptor. Biochem J 1999 Jun 15;340 ( Pt 3):729-36 PMID: 10359658 Dittmer F, von Figura K. Phosphorylation of oligosaccharides of the lysosomal enzyme arylsulphatase A (ASA), which accumulate in the secretions of cells that mis-sort most of the newly synthesized lysosomal enzymes due to a deficiency of mannose 6-phosphate receptors, was found to be site specific. ASA residing within the secretory route of these cells contains about one third of the incorporated [2-3H]mannose in phosphorylated oligosaccharides. Oligosaccharides carrying two phosphate groups are almost 2-fold less frequent than those with one phosphate group and only a few of the phosphate groups are uncovered. Addition of a KDEL (Lys-Asp-Glu-Leu) retention signal prolongs the residence time of ASA within the secretory route 6-fold, but does not result in more efficient phosphorylation. In contrast, more than 90% of the [2-3H]mannose incorporated into secreted ASA (with or without a KDEL retention signal) is present in phosphorylated oligosaccharides. Those with two phosphate groups are almost twice as frequent as those with one phosphate group and most of the phosphate groups are uncovered. Thus, ASA receives N-acetylglucosamine 1-phosphate groups in a sequential manner at two or more sites located within the secretory route proximal and distal to the site where ASA is retrieved by the KDEL receptor, i.e. proximal to the trans-Golgi. At each of these site,s up to two N-acetylglucosamine 1-phosphate groups can be added to a single oligosaccharide. Of several drugs known to inhibit transit of ASA through the secretory route only the ionophore monensin had a major inhibitory effect on phosphorylation, uncovering and sialylation.
However, using the insert as a probe for tBlastn of dbEST, a well-conserved long sulfatase fragment from Xenopus turns up, which by good fortune can be extended to include a full length catalytic domain (35aa gap). Note that all 6 cysteines in the insert region are precisely conserved in Xenopus; this provides support for disulfides. A similar fragment assembly from cattle ESTs conserves 16 cysteines. The chicken EST BI393009 is of KIAA1247 type strongly suggesting, given quail is KIAA1077 type, that birds have both forms of long sulfatases. The conservation of these proteins is extraordinary. Rough alignment of long sulfatases, anchored on 12 conserved cysteines and showing nominal coiled-coil domains, with lysine and arginine replaced by *:
hsa77 *FL***EESS*NIQQSNHLP*YE*V*ELCQQA*YQTACEQPGQ*WQCIEDTSG*L*IH*C*GPSDLLTV*QST*NLYA*GF.....HD*D*ECSC*ESGY*AS*SQ**SQ*QFL*NQGTP*Y*P*FVHT*QT*SLSVEFEGEIYDINLEEEEELQVLQP*NIA**HDEGH*GP*DLQASSGGN*G*MLADSSNAVGPPTTV*VTH*CFILPNDSIHCE*ELYQSA*AW*DH*AYID*EIEALQD*I*NL*EV*GHL****PEECSCS*QSYYN*E*GV**QE*L*SHLHPF*EAA.QEVDS*LQLF*ENN*****E**E***Q**GEECSLPGLTCFTHDNNHWQTAPFW mus77 *FL***EESG*NIQQSNHLP*YE*V*ELCQQA*YQTACEQPGQNWQCIEDTSG*L*IH*C*GPSDLLTV*QNA*.LYS*GL.....HD*D*ECHC*DSGY*SS*SQ**NQ*QFL*N*GTP*Y*P*FVHT*QT*SLSVEFEGEIYDINLEEEE.LQVLPP*SIA**HDEGHQGFIGHQAAAGDI*NEMLADSNNAVGLPATV*VTH*CFILPNDTIHCE*ELYQSA*AW*DH*AYID*EIEVLQD*I*NL*EV*GHL****PEECGCGDQSYYN*E*GV**QE*L*SHLHPF*EAAAQEVDS*LQLF*EH.*****E**E***Q**GEECSLPGLTCFTHDNNHWQTAPFW gag77 *FL***EEAN*NTQQSNQLP*YE*V*ELCQQA*YQTACEQPGQ*WQCTEDASG*L*IH*C*VSSDILAI***A*.....SIHS*GYSG*D*DCNCGDTDF*NS*TQ**SQ*QFL*NPSAQ*Y*P*FVHT*QT*SLSVEFEGEIYDINLEEEELQVL*T*SIT**HNAEND**AEETDGAPGDTMVADGTDAIGQPssv*vth*Cfilpndti*Ce*elyqsa*aw*dh*ayid*eiealqd*i*nl*ev*ghl****pdeCdCt*qsyyn*e*gv*tqe*i*shlhpf*eaaqevds*lqlf*en*****e**g***q**gdeCslpglTCFTHDNNHWQTAPFW cot77 *FL***EEAN*NTQQSNQLP*YE*V*ELCQQA*YQTACEQPGQ*WQCTEDASG*L*IH*C*VSSDILAI***T*.....SIHS*GYSG*D*DCNCGDTDF*NS*TQ**NQ*QFL*AQ...*Y*P*FVHT*QT*SLSVEFEGEIYDINLEEEE.LQVL*T*SIT**HN..AEND**AEETDGAPGDTMVADGTDVIGQPSSV*VTH*CFILPNDTI*CE*ELYQSA*AW*DH*AYID*EIEALQD*I*NL*EV*GHL****PDECDCT*QSYYN*E*GV*TQE*I*SHLHPF*EAA.QEVDS*LQLF*EN.*****E**G***Q**GDECSLPGLTCFTHDNNHWQTAPFW xen77 *FL***EEPS*STPQSNHLP*YE*V*ELCQQA*YQTACEQPGQ*WQCIENMFG*L*IH*C*GSSDTISL***T.....*SINS*GYGS*H*ECVCGEADY*SS*SQ***Q*LFM*TPGV**FNP*FVHT*HT*SLSVEFEGEIYDINLEDEEDHQASQl*slt**hydneedeedDDED*DMEDYSGTGGLTNDLIVPSSI*VTH*CFILANDTVQCDMDLY*SLQAW*DH*VHIDHEIETLQS*I*NL*EV*GHL****PDECDCS*PGFY**E*GV*VQD*L*GHMHPF*EGV.QEVDS*FQIF*EN.*****E**E***Q**GEECSLPGLTCFTHD*NHWQTAPFW fug77 *ML***DDSA.STQHTNSLP*Y**V*ETCQQAEFQTPCEQPGQ*WHCVEEITG*W*IQ*C*GSP*ESS***V*SL*P*.....SGYDNGE*GCDCGEAAF*PS*VE**SH*QLSSGQ*Y*P*FVHT*PT*SLSVEFEGQIYDIDLQADDQSGI***AIS**HHNAED......PEYDLGSDDGSEEMLGDDTNAVGYPNSL*VTH*CFILMNDSV*CE*EIYQSS*AW*DH*SYVDQEIETLQD*I*NL*EV*GHL**T*PEECDCDG*SYYT*GD*N*AE*T*N**DQLHPF*ETAQEADG*AQLYNEI*****E**E***Q**GDDCSLPGLTCFTHNNDHWQTAPFW hsa47 *LLH**DND*VDAQEENFLP*YQ*V*DLCQ*AEYQTACEQLGQ*WQCVEDATG*L*LH*C*GPM*LGGS*ALSNLVP......*YYGQGSEACTCDSGDY*LSLAG****LF***Y*ASYV*S*SI*SVAIEVDG*VYHVGL...........GDAAQP*NLT**HWPGAPEDQ....DD*DGGDFSGTGGLPDYSAANPI*VTH*CYILENDTVQCDLDLY*SLQAW*DH*LHIDHEIETLQN*I*NL*EV*GHL****PEECDCH*ISYH..TQH*G*L*H*GSSLHPF**G.LQE*D.*VWLL*EQ.*****L**LL**LQNNDTCSMPGLTCFTHDNQHWQTAPFW mus47 *LLH**EGD*VNAQEENFLP*YQ*V*DLCQ*AEYQTACEQLGQ*WQCVEDASGTL*LH*C*GPM*F..S*ALSNLVP......*YDGQSSEACSCDS.DY*LGLAG***.LF***Y*TSYA*N*SI*SVAIEVDGEIYHVGL...........DTVPQP*NLS*PHWPGAPEDQ....DD*DGGSFSGTGGLPDYSAPNPI*VTH*CYILENDTVQCDLDLY*SLQAW*DH*LHIDHEIETLQN*I*NL*EV*GHL****PEECDCH*ISYH..SQH*G*L*H*GSSLHPF**G.LQE*D.*VWLL*EQ.*****L**LL**LQNNDTCSMPGLTCFTHDNHHWQTAPLW gag47 *LLH**ENE*VDAQEENFLP*YQ*V*DLCQ*AEYQTACEQLGQ*WQCVEDPSG*LTLH*C*GMVNLAGNS*GT......SNLLPYYN*NSEDCNCEENEY*LSHTG****LFS***Y*PSYA*N*ST*SVSVELNGAVFNL.........GLEDGYQPVLP*NIT**H*MQ*AVL*EEED*DMAEYSGTGGIAEYAAPNLI*VTH*CYILENDTVQCDTDLY*SLQAW*DH*LHIDHEIETLQN*I*NL*EV*GHL****PEECDCN*ISYHS*..H*S*L*H*GSNLHPF...**ALQE*D*LWLL*EQ*****L**LL**I*NNDTCSMPGLTCFTHDNQHWQTAPLW fug47 *PLH**ADG*EVSQEENFLP*YQ*V*DLCQ*AEYQTSCQQPGQ*WQCVEDSTG*L*LY*C*GMAGLYAP*MQALMA*GASQPSAAAADSSDSCNCGNWGL**TAVL*******WA*SVSFELGGDLYAVDLEEGY*PLSSSN...........SSWA*F*NDEDNDEFSGMGLT............A*PTNNN*LTPPAAL*VTY*CSILMNDTV*CDGGLY*SLQAW*DH*LHIEHEIETLQT*I*NL*EV*GHL**V*PEECQCDPP*YH..TQH*G*L*H*GSSLHPF.....VPS*Q*TQWLQ*EQ*****L**FL**LQNNDTCSMPGLTCFTHDNHHWQTAPFW tun.. IE*G*IPLN*VHLA*PVLPS*QE*IEEEC*SPIY*YPCD.PGQEWTCVLEDGES.*I**C................**F*VP*T*SQ*****CTCHG***TTVSHDCM**LNQI**EY**FATTNE***FI**ASAL**SNQWHSFLSGDEGGLGSS*YSDESFSSLGGSNGLF*T***SPNNEVVSLTNN*TVMFN*LNMAISPQCVFMFPEY*I*CN*GAS*TGA**W****$*L*L*LL*IASACVTN*ASPQ*I**SPQC*C*TSSG*AP*P*IEP*VQQ$*IIAMEI*QAYN**LQEF*E**QEVNFH*I***I*QS*PS*G*CNSNGMSCF*LDAN*WETPPLW ele.. *MP*L**I*D*YI*Q***FN*EN*LS*EC****WQ*DCVH.GQLW*CYYTVED*W*IY*C*DNW.........................SDQCSC****EISNYDDDDID.............................................................................................................EFLTYAD*ENFSEGHEWYQGEFEDSGEVGEELDGH*S**GILS*CSCS*NVSHPI*LL..........................EQ*MS**HYL*Y***PQNGSL*P*DCSLPQMNCFTHTASHW*TPPLW dme.. EELDQEFQQNNDLPLAPYIT*MM*LNSECSDPALL*NCL.PGQ*W*CV.NEEG*W**H*C*FH...........LQLEHQLAAMP**QYQ*NCACFTPDGVVYT*I*APSAGLH*VN**THNGPG***N**EVFHTELPDEMEELLDLHQVV............................................DQLVDHTH*SCFVDATTA*VNCSNVIYDDE*TW*TS*TQID..$YGSASAFDSLEQTQSH*FTP*AECYCEPDVGE....................EHADS*EMA*EA***L*E*Q***E***I**A*LE*ECLSE*MNCFSHDNQHW*TAPLW coil.............................................................................................................................................................................................................................................$$$$$$$$$$$$$$$$$$$$$$$$$.............................................$$$$$$$$$$$$$$$$$$$$$........................ cys ..............................1........2........3............4................................5.6...............................................................................................................7.........8..............................................9.10.............................................................11.....12.............
>KIAA1077_xla_frag 471aa Xenopus insert BG408276 BG360235 BG814048 83% to quail KIAA1077; 35aa missing filled with quail (final caps show 97aa of insert match) RPNIILVLTDDQDVELGSMQVMNKTRRIMEQGGTHFINAFVTTPMCCPSRSSILTGKYVHNHNTYTNNENCSSPSWQAQHETHTFFVYLNNTGYRTAFFGKYLNEYNGTYVPPGWrewvglvknsrfynytisrngnkekhgfdyakdyfLTDLITNDSISFFRMSKKIYPHRPVLMVLSHAAPHGPEDSAPQYSQMFQNASQHITPSYNYAPNPDKHWIMRYTGPMKPIHMEFTNMLQRRRLQTLMSVDNSMEMIYNMLVETGELENTYVIYTADHGYHIGQFGLVKGKSMPYEFDIRVPFYIRGPNVEAGSLNPHIVLNIDLAPTILDIAGLDTPPDMDGKSVLTLLDIERPGyrlrtnkknkiwrdsifvergKFLRKKEEPSKSTPQSNHLPKYERVKELCQQARYQTACEQPGQKWQCIENMFGKLRIHKCKGSSDTISLKKRTRSINSKGYGSKHkECVCGEADYK >KIAA1247_bta_frag 422aa Bos taurus BF230382 AV617208 AV600225 98% to KIAA1247_hsa (last caps show insert match) LTDLITNDSVSFFRASKKMYPHRPVLMVLSHAAPHGPED rvwrdsflvergKLLHKRDSDKVDAQEENFLPKYQRVKDLCQRAEYQTACEQLGQKWQCVEDASGKLKLHKCKGPVRPGGRALSNLVPKYYVQGSEGCICDSGDGQLTLAGRRKKLFKKKYKASYARNRSIRSVAVRGHHVSLDD HEIETLQNKIKNLREVRGHLKKKRPEECDCHKISYHAQHKGRLKHKGSSLHPFRKGLQEKDKVWLLREQKRKKKLRKLLKRLQNNDTCSMPGLTCFTHDNQHWQTAPLWTLGPFCACTSANNNTYWCMRTINETHNFLFCEFATGFLEYFDLNTDPYQLMNAVNTLDRDVLNQLHVQLMELRSCKGYKQCNPRTRNMDLGLKDGGSYEQYRQFQRRKWPEXKRPSSKSLGQLWEGWEG >KIAA1077_hsa_insert 346aa frame 2 frame 2 exons 9-15 chr8:80324928-80349742 KFLRKKEESSKNIQQSNHLPKYERVKELCQQARYQTACEQPGQKWQCIEDTSGKLRIHKCKGPSDLLTVRQSTRNLYARGFHDKDKECSCRESGYRASRSQRKSQRQFLRNQGTPKYKPRFVHTRQTRSLSVEFEGEIYDINLEEEEELQVLQPRNIAKRHDEGHKGPRDLQASSGGNRGRMLADSSNAVGPPTTVRVTHKCFILPNDSIHCERELYQSARAWKDHKAYIDKEIEALQDK >KIAA1077_mmu_insert 344aa mouse 88% KFLRKKEESGKNIQQSNHLPKYERVKELCQQARYQTACEQPGQNWQCIEDTSGKLRIHKCKGPSDLLTVRQNARXLYSRGLHDKDKECHCRDSGYRSSRSQRKNQRQFLRNKGTPKYKPRFVHTRQTRSLSVEFEGEIYDINLEEEELQVLPPRSIAKRHDEGHQGFIGHQAAAGDIRNEMLADSNNAVGLPATVRVTHKCFILPNDTIHCERELYQSARAWKDHKAYIDKEIEVLQDKIKNLREVRGHLKKRKPEECGCGDQSYYNKEKGVKRQEKLKSHLHPFKEAAAQEVDSKLQLFKEHRRRKKERKEKKRQRKGEECSLPGLTCFTHDNNHWQTAPFWN >KIAA1077_cco_insert 338aa quail 77% KFLRKKEEANKNTQQSNQLPKYERVKELCQQARYQTACEQPGQKWQCTEDASGKLRIHKCKVSSDILAIRKRTRSIHSRGYSGKDKDCNCGDTDFRNSRTQRKNQRQFLRAQKYKPRFVHTRQTRSLSVEFEGEIYDINLEEEELQVLKTRSITKRHNAENDKKAEETDGAPGDTMVADGTDVIGQPSSVRVTHKCFILPNDTIRCERELYQSARAWKDHKAYIDKEIEALQDKIKNLREVRGHLKRRKPDECDCTKQSYYNKEKGVKTQEKIKSHLHPFKEAAQEVDSKLQLFKENRRRKKERKGKKRQKKGDECSLPGLTCFTHDNNHWQTAPFWN >KIAA1247_hsa_insert 324aa human 46% KLLHKRDNDKVDAQEENFLPKYQRVKDLCQRAEYQTACEQLGQKWQCVEDATGKLKLHKCKGPMRLGGSRALSNLVPKYYGQGSEACTCDSGDYKLSLAGRRKKLFKKKYKASYVRSRSIRSVAIEVDGRVYHVGLGDAAQPRNLTKRHWPGAPEDQDDKDGGDFSGTGGLPDYSAANPIKVTHRCYILENDTVQCDLDLYKSLQAWKDHKLHIDHEIETLQNKIKNLREVRGHLKKKRPEECDCHKISYHTQHKGRLKHRGSSLHPFRKGLQEKDKVWLLREQKRKKKLRKLLKRLQNNDTCSMPGLTCFTHDNQHWQTAPFW >KIAA1247_mmu_insert 329aa mouse 45% KLLHKREGDKVNAQEENFLPKYQRVKDLCQRAEYQTACEQLGQKWQCVEDASGTLKLHKCKGPMRFGGGGGSRALSNLVPKYDGQSSEACSCDSGGGGDYKLGLAGRRKLFKKKYKTSYARNRSIRSVAIEVDGEIYHVGLDTVPQPRNLSKPHWPGAPEDQDDKDGGSFSGTGGLPDYSAPNPIKVTHRCYILENDTVQCDLDLYKSLQAWKDHKLHIDHEIETLQNKIKNLREVRGHLKKKRPEECDCHKISYHSQHKGRLKHKGSSLHPFRKGLQEKDKVWLLREQKRKKKLRKLLKRLQNNDTCSMPGLTCFTHDNHHWQTAPLW >KIAA47/77_cin insert 332aa tunicate unique Ciona intestinalis long sulfatase LPSKQERIEEECRSPIYKYPCDPGQEWTCVLEDGESKIRKCKRFKVPRTRSQKRRRKCTCHGKKKTTVSHDCMRRLNQIKREYKRFATTNERRRFIRKASALRRSNQWHSFLSGDEGGLGSSRYSDESFSSLGGSNGLFRTKRRSPNNEVVSLTNNRTVMFNKLNMAISPQCVFMFPEYKIKCNKGASRTGARRWKKKRLEQLNRHIVATRLRLKLLRIASACVTNKASPQRIRRSPQCKCKTSSGRAPKPRIEPRVQQQNQSVNPMKIIAMEIRQAYNKRLQEFRERRQEVNFHRIKKKIKQSRPSRGRCNSNGMSCFKLDANRWETPPLWSynteny of mouse and human sulfatase genes:
A small translocation involving human chromosomes 8q21 and 20q13 resulted in a retained gene duplication that known today as the KIAA1077 and KIAA1247 sulfatases (after Kaluza Institute cDNAs). This event preceded the divergence of human and mouse; evidence for this is given in raw form below. However, the event giving rise to these long sulfatases, by parsimony considerations, must have preceded their doubling: a much earlier gene duplication, involving the GNS now on 12q14, gave rise initially to a KIAA sulfatase that then acquired extra coding properties (unlike the member that is today GNS). Indeed, since Drosophila also has a related long sulfatase, this event occured early in animal evolution. Beyond this, affinities of the GNS-KIAA 1077-KIAA 1247 family remain obscure.
Mouse chr 1 genes and their human counterparts on chr 8q21 and chr 20q13: OPRK1 chr8:56794983-56816757 NM_000912 opioid receptor, kappa 1 Sox17 chr8:58028772-58030137 NM_022454 FLJ22252 RP1 chr8:58112555-58127323 NM_006269 retinitis pigmentosa RP1 protein MYBL1 chr8:70234049-70234271 KIAA1077 chr8:73675649-73704612 EYA1 chr8:75452356-75587230 NM_000503 eyes absent Drosophila homolog 1 MSC chr8:75959614-75962428 NM_005098 musculin 8q21 11 TERF1 chr8:77147357-77165680 NM_003218 telomeric repe binding OPRL1 chr20:64627479-64628368 opiate receptor-like 1 NM_000913 SOX18 chr20:64581727-64582163 NM_018419 SRY sex determining region Y-box 18 RP1 chr20 counterpart missing, oddly no counterpart anywhere, large protein MYBL2 chr20:44004996-44047675 NM_002466 20q12 KIAA1247 chr20:47988960-48033834 EYA2 chr20:47321193-47519325 NM_005244 MSC chr20 counterpart missing TERF2 chr20 counterpart missing chr16:80806221-80836549 NM_005652 Mouse chr2 genes and their human counterparts on chr 20q: RPN2 chr20:37420266-37620266 no chr 8 PLCG1 chr20:41468766-41506160 no chr 8 RBL1 chr20:37329033-37426950 no chr 8 TOP1 chr20:41360013-41455684 topoisomerase; chr 8 147627504-147763382 not applicable MYBL2* chr20:44004996-44047675 NM_00246620q12 PLTP chr20:46229943-46243330 no chr 8 PTPRT chr20:42403965-43521108 no chr8 ADA chr20:44950714-44982894 NM_000022 adenosine deaminase no chr8 SDC4 chr20:45656478-45679601 NM_002999 syndecan 4 SEMG1 chr20:45538239-45540959 NM_003007 semenogelin no chr8 TCF4 chr20:522959-528967 NM_004609 TCF15 EYA2* chr20:47321193-47519325 NM_005244 SDCBP chr8:61928362-61962228 NM_005625 20 +1231038 1232169 not applicable
To see the tree at better scale, paste the .PHB file into TreeView. The tree was derived from ClustalW alignment of catalytic domains. Note that many of the gene duplications occured very early in the history of the family and that branch lengths (roughly rates of evolution are similar, with the exception of anomalously high rates of change in the X chromsome cluster (ARSG subtree).
(((GNS:0.41864,(K1247:0.07563,K1077:0.07563):0.34607):0.0407,(SulfX:0.46305,((K1001:0.42549,(ARSA:0.39383,GALNS:0.39328):0.01429):0.00861,(ARSG:0.38955,(STS:0.259,(ARSF:0.19885,(ARSD:0.18763,ARSE:0.18492):0.01123):0.06383):0.12725):0.04255):0.02547):0.00547):0.00185,(IDS:0.45052,(ARSB:0.41513,SGSH:0.4224):0.03968):0.00714,(SulfY:0.46719,SulfZ:0.45998):0.02122);
KIAA1247 has 20 coding exons in all species where this is determinable. Exon 19 is missing in all known KIAA1077 (floor plate) sulfatases, including human, mouse, rat, and quail. This is not alternative splicing because tBlastn at e=100 can find no sign of a skipped exon in intervening finished DNA. The gene duplication leading to long sulfatases is quite ancient since 3 distinct long sulfatases are observed in fugu.
The approximate exon boundaries for some human sulfatases are given below. Phase is indicated by a lower case letter attached to the exon having 2/3 codon letters. By default, the GT-AG rule (resp. GC-AG) was assumed in determining splice junctions. The quality of data is uneven, depending on the status of the human genome project at the site involved and will improve in coming assemblies. Mouse exon structure can also be determined in some instances from unfinished sequence.
One fact to emerge from comparing exons boundaries carefully is that several of the genes have a phase 2 exon ending at the mysteriously conserved TG following the catalytic site (for example, SulfY and SulfZ). Could this explain the conservation of the amino acid feature: its nucleotides enhance and define a good splice donor? Other sulfatases however do not have exon breaks here, eg GNS though ARSA and ARSB have a boundary nearby. Bacterial sulfatases also have this conserved TG (sometimes SG) but have no introns. Being at the opposite end of the catalytic alpha helix leaves these residues distant from the active site; alanine mutagenesis could not provide a structural explanation. Folding intermediates have also been postulated. Using the pattern [ST].[RK] for protein kinase C phosphorylation site gives TGR and TGK as candidates.
>SulfX_hs 8 exon structure 15 aa signal 55.6 %: extracellular MLLLWVSVVAALALAVLAPGAGEQRRRAAKAP DGRLTFHPGSQVVKLPFINFMKTRGTSFLNAYTNSPICCPSRA AMWSGLFTHLTESWNNFKGLDPNYTTWMDVMERHGYRTQKFGKLDYTSGHHSI SNRVEAWTRDVAFLLRQEGRPMVNLIRNRTKVRVMERDWQNTDKAVNWLRKEAINYTEPFVIYLGLNLPHPYPSPSSGENFGSSTFHTSLYWLEK VSHDAIKIPKWSPLSEMHPVDYYSSYTKNCTGRFTKKEIKNIRAFYYAMCAETDAML GEIILALHQLDLLQKTIVIYSSDHGELAMEHRQFYKMSMYEASAHVPLLMMGPGIKAGLQVSNVVSLVDIYPTML DIAGIPLPQNLSGYSLLPLSSETFKNEHKVKNLHPPWILSEFHGCNVNASTYMLRTNHWKYIAYSDGASILPQLF DLSSDPDELTNVAVKFPEITYSLDQKLHSIINYPKVSASVHQYNKEQFIKWKQSIGQNYSNVIANLRWHQDWQKEPRKYENAIDQWLKTHMNPRAV >SulfY_hsa 574aa 2 exons chr4:125570802-125648441 size 77640 - ::123::45 MGALAGFWILCLLTYGYLSWGQALEEEEEGALLAQAGEKLEPSTTSTSQPHLIFILADDQGFRDVGYHGSEIKTPTLDKLAAEGVKLENYYVQPICTPSRSQFITG RYQIHTGLQHSIIRPTQPNCLPLDNATLPQKLKEVGYSTHMVGKWHLGFYRKECMPTRRGFDTFFGSLLGSGDYYTHYKCDSPGMCGYDLYENDNAAWDYDNGIYSTQMYTQRVQQILASHNPTKPIFLYIAYQAVHSPLQAPGRYFEHYRSIININRRRYAAMLSCLDEAINNVTLALKTYGFYNNSIIIYSSDNGGQPTAGGSNWPLRGSKGTYWEGGIRAVGFVHSPLLKNKGTVCKELVHITDWYPTLISLAEGQIDEDIQLDGYDIWETISEGLRSPRVDILHNIDPIYTKAKNGSWAAGYGIWNTAIQSAIRVQHWKLLTGNPGYSDWVPPQSFSNLGPNRWHNERITLSTGKSVWLFNITADPYERVDLSNRYPGIVKKLLRRLSQFNKTAVPVVRYPPKDPRSNPRLNGGVWGPWYKEETKKKKPSKNQAEKKQKKSKKKKKKQQKAVSGSTCHSGVTCG >SulfZ_hsa 569aa NT_006951 ARSB type another chr 5 gene 2 exon, 4 glcyo QLLTGR end of exon1::123::45 MHTLTGFSLVSLLSFGYLSWDWAKPSFVADGPGEAGEQPSAAPPQPPHIIFILTDDQGYHDVGYHGSDIETPTLDRLAAKGVKLENYYIQPICTPSRSQLLTG RYQIHTGLQHSIIRPQQPNCLPLDQVTLPQKLQEAGYSTHMVGKWHLGFYRKECLPTRRGFDTFLGSLTGNVDYYTYDNCDGPGVCGFDLHEGENVAWGLSGQYSTMLYAQRASHILASHSPQRPLFLYVAFQAVHTPLQSPREYLYRYRTMGNVARRKYAAMVTCMDEAVRNITWALKRYGFYNNSVIIFSSDNGGQTFSGGSNWPLRGRKGTYWEGGVRGLGFVHSPLLKRKQRTSRALMHITDWYPTLVGLAGGTTSAADGLDGYDVWPAISEGRASPRTEILHNIDPLYNHAQHGSLEGGFGIWNTAVQAAIRVGEWKLLTGDPGYGDWIPPQTLATFPGSWWNLERMASVRQAVWLFNISADPYEREDLAGQRPDVVRTLLARLAEYNRTAIPVRYPAENPRAHPDFNGGAWGPWASDEEEEEEEGRARSFSRGRRKKKCKICKLRSFFRKLNTRLMSQRI >ARSB from NT_027010 PPHLVFLLADDLGWNDVGFHGSRIRTPHLDALAAGGVLLDNYYTQPLCTPSRSQLLTGRYQ IRTGLQHQIIWPCQPSCVPLDEKLLPQLLKEAGYTTHMVGKWHLGMYRKECLPTRRGFDTYFG YLLGSEDYYSHERCTLIDALNVTRCALDFRDGEEVATGYKNMYSTNIFTKRAIALITNHPPEK PLFLYLALQSVHEPLQVPEEYLKPYDFIQDKNRHHYAGMVSLMDEAVGNVTAALKSSGLWNNTVFIFST DNGGQTLAGGNNWPLRGRKWSLWEGGVRGVGFVASPLLKQKGVKNRELIHISDWLPTLVKLARGHTNGTKPLDGFDVWKTI SEGSPSPRIELLHNIDPNFVDSSPC PRNSMAPAKDDSSLPEYSAFNTSVHAAIRHGN WKLLTGYP GCGYWFPPPSQYNVSEIPSSDPPTKTLWLFDIDRDPEERHDLSREYPHIVTKLLSRLQFYHKHSVPVYFPAQDPRCDPKATGVWGPWM >ARSA cds 8 exons MGAPRSLLLALAAGLAVARPPNIVLIFADDLGYGDLGCYGHPSSTTPNLDQLAAGGLRFTDFYVPVSLCTPS AALLTGRLPVRMGMYPGVLVPSSRGGLPLEEVTVAEVLAARGYLTGMAGKWHLGVGPEGAFLPPHQGFHRFLGIPYSHDQ GPCQNLTCFPPATPCDGGCDQGLVPIPLLANLSVEAQPPWLPGLEARYMAFAHDLMADAQRQDRPFFLYYASH HTHYPQFSGQSFAERSGRGPFGDSLMELDAAVGTLMTAIGDLGLLEETLVIFTADNg PETMRMSRGGCSGLLRCGKGTTYEGGVREPALAFWPGHIAP PGVTHELASSLDLLPTLAALAGAPLPNVTLDGFDLSPLLLGTGK SPRQSLFFYPSYPDEVRGVFAVRTGKYKAHFFTQ gSAHSDTTADPACHASSSLTAHEPPLLYDLSKDPGENYNLLGGVAGATPEVLQALKQLQLLKAQLDAAVTFGPSQVARGEDPALQICCHPGCTPRPACCHCPDPHA >GNS_hs 552 aa 14 cds exons Glu6S MRLLPLAPGRLRRGSPRHLPSCSPALLLLVLGGCLGVFGVAAGTRRPNVVLLLTDDQDEVLGGM TPLKKTKALIGEMGMTFSSA YVPSALCCPSRASILTGKYPHNHHVVNNTLEGNCSSKSWQKIQEPNTFPAILRSMCGYQTFFAGKYLNE YGAPDAGGLEHVPLGWSYWYAL EKNSKYYNYTLSINGKARKHGENYSVDYLTDVL ANVSLDFLDYKSNFEPFFMMIATPAPHSPWTAAPQYQKAFQNVFAPRNKNFNIHGT NKHWLIRQAKTPMTNSSIQFLDNAFRK RWQTLLSVDDLVEKLVKRLEFTGELNNTYIFYTSDNGYHT GQFSLPIDKRQLYEFDIKVPLLVRGPGIKPNQTSK MLVANIDLGPTILDIAGYDLNKTQMDGMSLLPIL RGASNLTWRSDVLVEYQGEGRNVTDPTCPSLSPGVS QCFPDCVCEDAYNNTYACVRTMSALWNLQYCEFDDQE VFVEVYNLTADPDQITNIAKTIDPELLGKMNYRLMMLQSCSGPTCRTPGVFDPG YRFDPRLMFSNRGSVRTRRFSKHLL >GNS_hs on August 01 assembly relative to KIAA group MLVANIDLGPTILDIAGYDLNKTQMDGMSLLPIL end of exon 10 RGASNLTWRSDVLVEYQGEGRNVTDPTCPSLSPGVS exon 11 QCFPDCVCEDAYNNTYACVRTMSALWNLQYCEFDDQE exon 12 VFVEVYNLTADPDQITNIAKTIDPELLGKMNYRLMMLQSCSGPTCRTPGVFDP exon 13 YRFDPRLMFSNRGSVRTRRFSKHLL- exon 14 >KIAA1247_hs 20 cds exons MGPPSLVLCLLSATVFSLLGGSSAFLSHHRLKGRFQRDRRNIRPNIILVLTDDQDVEL GSMQVMNKTRRIMEQGGTHFINAFVTTPMCCPSRSSILTGKYVHNHNTYTNNENCSSPSWQAQHESRTFAVYLNSTGYRT AFFGKYLNEYNGSYVPPGWKEWVGLLKNSRFYNYTLCRNGVKEKHGSDYSK DYLTDLITNDSVSFFRTSKKMYPHRPVLMVISHAAPHGPEDSAPQYSRLFPNASQH TPSYNYAPNPDKHWIMRYTGPMKPIHMEFTNMLQRKRLQTLMSVDDSMET IYNMLVETGELDNTYIVYTADHGYHIGQFGLVKGKSMPYEFDIRVPFYVRGPNVEAGCL NPHIVLNIDLAPTILDIAGLDIPADMDGKSILKLLDTERPVN RFHLKKKMRVWRDSFLVERG KLLHKRDNDKVDAQEENFLPKYQRVKDLCQRAEYQTACEQLG QKWQCVEDATGKLKLHKCKGPMRLGGSRALSNLVPKYYGQGSEACTCDSGDYKLSLAGRRKKLFKK YKASYVRSRSIRSVAIEVDGRVYHVGLGDAAQPRNLTKRHWPGAPEDQDDKDGGDFSGTGGLPDYSAANPIKVTH RCYILENDTVQCDLDLYKSLQAWKDHKLHIDHE IETLQNKIKNLREVRGHLKKKRPEECDCHKI SYHTQHKGRLKHRGSSLHPF RKGLQEKDKVWLLREQKRKKKLRKLLKRLQNNDTCSMPGLTCFTHDNQHWQTAPFWT GPFCACTSANNNTYWCMRTINETHNFLFCEFATGFLEYFDLNTDPY QLMNAVNTLDRDVLNQLHVQLMELRSCKGYKQCNPRTRNMDLG LKDGGSYEQY RQFQRRKWPEMKRPSSKSL GQLWEGWEG >KIAA1077_hs 20 exons MKYSCCALVLAVLGTELLGSLCSTVRSPRFRGRIQQERKNIRPNIILVPTDDQDVEL GSLQVMNKTRKIMEHGGATFINAFVTTPMCCPSRSSMLTGKYVHNHNVYTNNENCSSPSWQAMHEPRTFAVYLNNTGYRT AFFGKYLNEYNGSYIPPGWREWLGLIKNSRFYNYTVCRNGIKEKHGFDYAK DYFTDLITNESINYFKMSKRMYPHRPVMMVISHAAPHGPEDSAPQFSKLYPNASQH TPSYNYAPNMDKHWIMQYTGPMLPIHMEFTNILQRKRLQTLMSVDDSVER LYNMLVETGELENTYIIYTADHGYHIGQFGLVKGKSMPYDFDIRVPFFIRGPSVEPGS VPQIVLNIDLAPTILDIAGLDTPPDVDGKSVLKLLDPEKPGN RFRTNKKAKIWRDTFLVERG KFLRKKEESSKNIQQSNHLPKYERVKELCQQARYQTACEQPGQ KWQCIEDTSGKLRIHKCKGPSDLLTVRQSTRNLYARGFHDKDKECSCRESGYRASRSQRKSQRQFLRNQGTP YKPRFVHTRQTRSLSVEFEGEIYDINLEEEEELQVLQPRNIAKRHDEGHKGPRDLQASSGGNRGRMLADSSNAVGPPTTVRVTH CFILPNDSIHCERELYQSARAWKDHKAYIDKE IEALQDKIKNLREVRGHLKRRKPEECSCSKQS YYNKEKGVKKQEKLKSHLHPF EAAQEVDSKLQLFKENNRRRKKERKEKRRQRKGEECSLPGLTCFTHDNNHWQTAPFWN GSFCACTSSNNNTYWCLRTVNETHNFLFCEFATGFLEYFDMNTDPYQ LTNTVHTVERGILNQLHVQLMELRSCQGYKQCNPRPKNLDV NKDGGSYDLH GQLWDGWEG >STS_hs Xp22.31 10 exons 1 RKMKIPFLLLFFLWEAESHAASRPNIILVMADDLGIGDPGCYGNKTI 2 rTPNIDRLASGGVKLTQHLAASPLCTPSRAAFMTGRYPVRS..................................34 3 GMASWSRTGVFLFTASSGGLPTDEITFAKLLKDQGYSTALI 4 gKWHLGMSCHSKTDFCHHPLHHGFNYFYGISLTNLRDCKPGEGSVFTTGFKRLVFLPLQIVGVTLLTLAALNCLGLLHVP LGVFFSLLFLAALILTLFLGFLHYFRPLNCFMMRNYEIIQQPMSYDNLTQRLTVEAAQFIQ 5 rNTETPFLLVLSYLHVHTALFSSKDFAGKSQHGVYGDAVEEMDWSV 6 GQILNLLDELRLANDTLIYFTSDQGAHVEEVSSKGEIHGGSNGIYK 7 gGKANNWEGGIRVPGILRWPRVIQAGQKIDEPTSNMDIFPTVAKLAGAPLPEDR 8 IIDGRDLMPLLEGKSQRSDHEFLFHYCNA 9 YLNAVRWHPQNS..TSIWKAFFFTPNFNPV 10 GSNGCFATHVCFCFGSYVTHHDPPLLFDISKDPRERNPLTPASEPRFYEILKVMQEAADRHTQTLPEVPDQFSWNNFLWKPWLQLCCPSTGLSCQCDREKQDKRLSR >STS Y pseudogene 1 RKKKIPFHLLFFP*EAESHAASRPNIILVMVDDLGIGDPGCYGNRTL 2 RTPNIDWLASEGVKLTQHLAASPLCTPSRAAFMTGR*PV.................................33 3 GMASWSCTGGFLLTASSGGLPTNEIT 4 gKWHLGMSCHSKTDFCHHPLHHSFDYFYGMTSLTHLRDSKAREGSVFTMGFKRPVFIPLQIIRVALLTLTALNCLGLLHL PLATFFGLLFLAALILTLFLGFLHYFRPLNCFMMRNYEIIQQPMSYDNLTQRLTVEAAQFIQ 5 RNTETPFLLVLSYLHVHMALFSSKDFAGKSKHGVCGDAVEEMDCSVGMSL 6 GQILNLLDELRLANDTLIYFTSDQGAHVEEVSSKGEIHGGSNGIYK 7 gGKANNWEGGIRVPGILRWPRVIQAGQKIDEPTSNMDIFPTVAKLAGAPLPED 8 RIIDGHDLMPLLEGKYQYSDHEFLFRY*NAHLNA 9 YLNAVRWHPQNSTSIWKAFFFTPNFNPV 10 GSNGCFATHVCFCFGSYVTHHDPPLLFDISKDPRERNPLTPASEPRFYEILKVMQEAADRHTQTLPEVPDQFSWNNFLWKPWLQLCCPSTGLSCQCDREKQDKRLSR >ARSD_hs 9 cds exons 0 MRSAARRGRAAPAA ... orphaned 1 RDSLPVLLFLCLLLKTCEPKTANAFKPNILLIMADDLGTGDLGCYGNNTl 2 RTPNIDQLAEEGVRLTQHLAAAPLCTPSRAAFLTGRHSFRS 83112 ... 81938 3 gMDASNGYRALQWNAGSGGLPENETTFARILQQHGYATGLI 4 gKWHQGVNCASRGDHCHHPLNHGFDYFYGMPFTLTNDCDPGRPPEVDAALRAQLWGYTQFLA LGILTLAAGQTCGFFSVSARAVTGMAGVGCLFFISWYSSFGFVRRWNCILMRNHDVTEQPMVLEKTASLMLKEAVSYIE 5 rHKHGPFLLFLSLLHVHIPLVTTSAFLGKSQHGLYGDNVEEMDWLI 6 gKVLNAIEDNGLKNSTFTYFTSDHGGHLEARDGHSQLGGWNGIYK 7 gGKGMGGWEGGIRVPGIFHWPGVLPAGRVIGEPTSLMDVFPTVVQLVGGEVPQDr 8 VIDGHSLVPLLQGAEARSAHEFLFHYCGQHLHAARWHQKDS [[old break 8-9 fused, new break to 9 9-10 fused 9 GSVWKVHYTTPQFHPEGAGACYGRGVCPCSGEGVTHHRPPLLFDLSRDPSEARPLTPDSEPLYHAVIARVGAA VSEHRQTLSPVPQQFSMSNILWKPWLQPCCGHFPFCSCHEDGDGTP >AC084294 Mus musculus htgs genomic clone RP23-169K20 14 unordered pieces Length = 199377 chr ? 1 MHCPGLACCTILLVLGLCGAHSRNVLLIV 2 ADDGGFESGVYNNTAIATPHLDALSRHSLIFRNAFTSVSSCSPSRASLLTGLPq 3 HQNGMYGLHQDVHHFNSFDKVQSLPLLLNQAGVRTg 4 IIGKKHVGPETVYPFDFAFTEENSSVMQVGRNITRIKQLVQKFLQTQDDr 5 PFFLYVAFHDPHRCGHSQPQYGTFCEKFGNGESGMGYIPDWTPQIYDPQDVMv 6 PYFVPDTPAARADLAAQYTTIGRMDQg 7 VGLVLQELRGAGVLNDTLIIFTSDNGIPFPSGRTNLYWPGTAEPLLVSSPEHPQRWGQVSDAYVSLL 8 DLTPTILDWFSIPYPSYAIFGSKTIQLTGRSLLPALEAEPLWATVFSSQSHHEVTMSYPMRSVYHQNFRLI HNLSFKMPFPIDQDFYVSPTFQDLLNRTTTGRPTGWYKDLHRYYYRERWELYDISRDPRETRNLAADPDLAQVLEMLKAQLVKWQWETHDPWVCAPDGVLEEKLTPQCRPLHNEL >ARSE_hs 9 cds exons 0 MLHLHHSC ... orphaned 1 LCFRSWLPAMLAVLLSLAPSASSDISASRPNILLLMADDLGIGDIGCYGNNTM 2 RTPNIDRLAEDGVKLTQHISAASLCTPSRAAFLTGRYPVRS 3 GMVSSIGYRVLQWTGASGGLPTNETTFAKILKEKGYATGLI 4 GKWHLGLNCESASDHCHHPLHHGFEHFYGMPFSLMGDCARWELSEKRVNLEQKLNFLFQVLALVALTLVAGKLTHLIPVSWMPVIWSALSAVLLLASSYFVGALIV HADCFLMRNHTITEQPMCFQRTTPLILQEVASFLK 5 RNKHGPFLLFVSFLHVHIPLITMENFLGKSLHGLYGDNVEEMDWMV 6 GRILDTLDVEGLSNSTLIYFTSDHGGSLENQLGNTQYGGWNGIYK 7 GGKGMGGWEGGIRVPGIFRWPGVLPAGRVIGEPTSLMDVFPTVVRLAGGEVPQD 8 RVIDGQDLLPLLLGTAQHSDHEFLMHYCERFLHAARWHQRD 9 RGTMWKVHFVTPVFQPEGAGACYGRKVCPCFGEKVVHHDPPLLFDLSRDPSETHILTPASEPVFYQVMERVQQAVWEHQRTLSPVPLQLDRLGNIWRPWLQPCCGPFPLC WCLREDDPQ >ARSF_hs 9 exons 0 MRP 1 RRPLVFMSLVCALLNTWPGHTGCMTTRPNIVLIMVDDLGIGDLGCYGNDTM 2 RTPHIDRLAREGVRLTQHISAASLCSPSRSAFLTGRYPIRS 3 GMVSSGNRRVIQNLAVPAGLPLNETTLAALLKKQGYSTGLI 4 GKWHQGLNCDSRSDQCHHPYNYGFDYYYGMPFTLVDSCWPDPSRNTELAFESQLWLCVQLVAIAILTLTFGKLSGWVSVPWLLIFSMILFIFLLGY AWFSSHTSPLYWDCLLMRGHEITEQPMKAERAGSIMVKEAISFLE 5 RHSKETFLLFFSFLHVHTPLPTTDDFTGTSKHGLYGDNVEEMDSMV 6 GKILDAIDDFGLRNNTLVYFTSDHGGHLEARRGHAQLGGWNGIYK 7 GGKGMGGWEGGIRVPGIVRWPGKVPAGRLIKEPTSLMDILPTVASVSGGSLPQD 8 RVIDGRDLMPLLQGNVRHSEHEFLFHYCGSYLHAVRWIPKDDS 9 GSVWKAHYVTPVFQPPASGGCYVTSLCRCFGEQVTYHNPPLLFDLSRDPSESTPLTPATEPLYDFVIKKVANALKEHQETIVPVTYQLSELNQGRTWLKPCCGVFPFCLCD KEEEVSQPRGPNEKR >KIAA1001_hs cds RFVDFHAAASTCSPSRASLLTGRLGLRNGVTRNFAVTSVGGLPLNETTLAEVLQQAGYVTGIIg KWHLGHHGSYHPNFRg FDYYFGIPYSHDMGCTDTPGYNHPPCPACPQGDGPSr NLQRDCYTDVALPLYENLNIVEQPVNLSSLAQKYAEKATQFIQRA STSGRPFLLYVALAHMHVPLPVTQLPAAPRGRSLYGAGLWEMDSLVGQIKDKVDHTVKENTFLWFTg DNGPWAQKCELAGSVGPFTGFWQTRQg GSPAKQTTWEGGHRVPALAYWPGRVPVNVTSTALL SVLDIFPTVVALAQASLPQGRRFDGVDVSEVLFGRSQPGHRv LFHPNSGAAGEFGALQTVRLERYKAFYITg GARACDGSTGPELQHKFPLIFNLEDDTAEAVPLERGGAEYQAVLPEVRKVLADVLQDIANDNISSADYTQDPSVTPCCNPYQIACRCQAA >IDS_hs 9 exons 1 MPPPRTGRGLLWLGLVLSSVCVALGSETQANSTT 2 DALNVLLIIVDDLRPSLGCYGDKLVRSPNIDQLASHSLLFQNAFA 3 QQAVCAPSRVSFLTGRRPDTTRLYDFNSYWRVHAGNFSTIPQYFKENGYVTMSVGKVFHP 4 GISSNHTDDSPYSWSFPPYHPSSEKYENT 5 KTCRGPDGELHANLLCPVDVLDVPEGTLPDKQSTEQAIQLLEKMKTSASPFFLAVGYHKPHIPFRY 6 PKEFQKLYPLENITLAPDPEVPDGLPPVAYNPWMDIRQREDVQALNISVPYGPIPVDF 7 QRKIRQSYFASVSYLDTQVGRLLSALDDLQLANSTIIAFTSDH 8 GWALGEHGEWAKYSNFDVATHVPLIFYVPGRTASLPEAGEKLFPYLDPFDSASQLMEP 9 GRQSMDLVELVSLFPTLAGLAGLQVPPRCPVPSFHVELCREGKNLLKHFRFRDLEEDPYLPGNPRELIAYSQYPRPSDIPQWNSDKPSL KDIKIMGYSIRTIDYRYTVWVGFNPDEFLANFSDIHAGELYFVDSDPLQDHNMYNDSQGGDLFQLLMP >Sulf2_C.elegans U43375 14 exons complement 1068..1259 1392..1500 1548..1685 1759..1815 1867..2156 2543..2669 2715..2894 2946..3121 3190..3340 3396..3565 3757..3967 4026..4139 4186..4302 4356..4453 >AE003522 CG7408 gp Drosophila melanogaster "486 aa" 585 (33%) ARSB false gene start, 18aa short of site, 99 aa missing, aligns to ARSB starting with LLLLAPP 29 to 1; align ends at HNEWTWW...554 MSTHLDKFSSATSLLTGFVLCIALSNGIVATSDKPNIIIIMADDLGFDDVSFRGSNNFLTPNIDALAYSGVILNNLYVAPMCTPSRAALLTGKYPINTGMQHYVIVNDQPWGLPLNETTMAEIFRENGYRTSLLGKWHLGLSQRNFTPTERGFDRHLGYLGAYVDYYTQSYEQQNKGYNGHDFRDSLKSTHDHVGHYVTDLLTDAAVKEIEDHGSKNSSQPLFLLLNHLAPHAANDDDPMQAPAEEVSRFEYISNKTHRYYAAMVSRLDKSVGSVIDALARQEMLQNSIILFLSDNGGPTQGQHSTTASNYPLRGQKNSPWEGALRSSAAIWSTEFERLGSVWKQQIYIGDLLPTLAAAAGISPDPALHLDGLNLWSALKYGYESVEREIVHVIDEDVAEPHLSYTRGKWKVISGTTNQGLYDGWLGHRETSEVDPRAVEYEELVRNTSVWLQLQQVSFGERNISELRDQSRIECPDPATGVKPCLPLEGPCLFDIEADPCERSNLYAEYQNSTIFLDLWSRIQQFAKQAHPPNNKPGDPNCDPRFYHNEWTWWqdekasssgtigvmkvftllvvfiftsscin 3 ESTs support a normal gene here: SD22483 678 bp SD23718 RE13542 670 bp cgcacttcgagtccggcggatcagagattagaattcggggaacgattaactcgatcgcgaataccagtgactgaaattggcatgcacgaatctagctgataagtccattgttatttggattccttttatttgatttcgcattataccgctaatatatccggaacgatctgaagagctcatccatcattgacagagtgttacccgctgacttgttgtcgcccatttgctgtcccacatcatcctcactccattcacggcg LVVAHLLSHIILTPFTAMSTHLDKFSSATSLLTGFVLCIALSNGIVATSDKPNIIIIMADDLGFDDVSFRGSNNFLTPNIDALAYSGVILNNLYVAPMCTPSRAALLTGKYPINTGMQHYVIVNDQPWGLPLNETTMAEIFRENGYRTSLLGKWHL drosophila sulfatase cluster: CG5584 75A2 "N-acetylgalactosamine-4-sulfatase" 996 aa CG7402 75A6--8 "N-acetylgalactosamine-4-sulfatase" 579 aa CG7408 75A4--7 "N-acetylgalactosamine-4-sulfatase" 486 aa 5 coding exons: exon1: 201141 MSTHLDKFSSATSLLTGFVLCIALSNGIVATSDKPN 201034 exon2: 200951 GFDDVSFRGSNNFLTPNIDALAYSGVILNNLYVAPMCTPSRAALLTGKYPINTG 200790 exon3: 200720 MQHYVIVNDQPWGLPLNETTMAEIFRENGY... DYYTQSYEQQ 200496 exon4: 200256 NKGYNGHDFRDSLKS........QGQHSTTASNYPLRG 199834 exon5: 199503 QKNSPWEGALRSSAAIWS...KVFTLLVVFIFTSSCIN 198694 Relationship of the drosophila proteins, relative to CG7408 AE003522 CG5584 gp Drosophila melanogaster 996 aa (41%) ... 1301 1.6e-137 1 AE003522) CG7402 gp Drosophila melanogaster 579 aa (43%) ... 1277 5.7e-135 1 AE003821 CG8646 gp Drosophila melanogaster 542 aa (51%) ... 676 3.5e-112 2 AAF55607 CG14291 gp 524 aa Drosophila melanogaster (53%)... 190 5.2e-17 2 AE003712 Sulf1 Drosophila melanogaster 1114 aa (59%) KIA... 187 1.5e-14 2 AE003478 CG12014 gp Drosophila melanogaster 512 aa (46%)... 131 1.2e-09 2
... comparative exons boundaries of the 16 human sulfatases relative to key residues ... visit OMIM to see if the 6 new sulfatases correspond to positionally mapped diseases ... visit PubMed to collect sulfated metabolites for candidate substrates, eg dopamine sulfate ... finish the synonym glossary and regularize the tentative names for unpublished sulfatases ... finish checking into alternative splicing in human sulfatases and resultant proteins ... exclude missing sequences by converging alkaline phosphatase to sulfatase probe ... thread 3D structures to make plausible disulfide proximity assignments ... map the known substrates onto phylogenetic tree to suggest substrates for duplications ... collect and map the known mutations onto the structures ... does the metal ion change as the coordination residues change, are latter cross-correlated? ... overview of sulfotransferases because of off-setting regulation ... check if GNS/KIAA duplication is still recognizable in genomes ... identification of MSD modification gene ... provide more details on 3 new human sulfatases ... add leader peptides to extended fasta format; identify cation ligands throughout ... add counter, collect sulfatase emails, notify OMIM, fix best linksHuman genome location of sulfotransferases (August 01 assembly):
CHST1 | chr11:49003217-49019659 | 11 | p11.2 | NM_003654 | chondroitin 6/keratan |
CHST2 | chr3:163373104-163375596 | 3 | q23 | NM_004267 | chondroitin 6/keratan |
CHST3 | chr10:78488846-78533765 | 10 | q22.1 | NM_004273 | chondroitin 6/keratan |
CHST4 | chr16:85850815-85863171 | 16 | q22.2 | NM_005769 | N-acetylglucosamine 6-O sulfotransferase |
CHST5 | chr16:90730589-90747406 | 16 | q23.1 | NM_012126 | N-acetylglucosamine 6-O sulfotransferase |
CHST6 | chr16:90680422-90697352 | 16 | q23.1 | NM_021615 | N-acetylglucosamine 6-O sulfotransferase |
CHST7 | chrX:46968426-46992908 | X | p11.23 | NM_019886 | N-acetylglucosamine 6-O sulfotransferase |
CHST8 | chr19:41880733-41969172 | 19 | q13.11 | NM_022467 | N-acetylgalactosamine-4-O-sulfotransferase |
HNK1ST | chr2:104031497-104122622 | 2 | q11.2 | NM_004854 | HNK-1 sulfotransferase |
HS2ST1 | chr1:101010965-101048005 | 1 | p22.3 | NM_012262 | heparan sulfate 2-O-sulfotransferase 1 |
HS3ST1 | chr4:12991117-12992413 | 4 | p15.33 | NM_005114 | heparan sulfate D-glucosaminyl |
HS3ST2 | chr16:27575495-27676992 | 16 | p12.2 | NM_006043 | heparan sulfate D-glucosaminyl |
HS3ST3A1 | chr17:14846992-14953230 | 17 | p11.2 | NM_006042 | heparan sulfate D-glucosaminyl |
HS3ST3A2 | - | - | - | - | - |
HS3ST3B1 | chr17:15652492-15697478 | 17 | p11.2 | NM_006041 | heparan sulfate D-glucosaminyl |
HS3ST3B2 | - | - | - | - | - |
HS3ST4 | - | 16 | - | NT_010604 | - |
HS3ST5 | - | 16 | p13.3 | - | - |
HS6ST | chr2:132836865-132888085 | 2 | q21.1 | NM_004807 | heparan sulfate 6-O-sulfotransferase |
NDST1 | chr5:165245132-165290827 | 5 | q33.1 | NM_001543 | N-deacetylase/N-sulfotransferase (heparan |
NDST2 | chr10:66113014-66119515 | 10 | q21.2 | NM_003635 | N-deacetylase/N-sulfotransferase (heparan |
NDST3 | chr4:129869870-130413364 | 4 | q26 | NM_003635 | N-deacetylase/N-sulfotransferase (heparan |
NDST4 | chr4:126364694-126634426 | 4 | q26 | NM_022569 | N-deacetylase/N-sulfotransferase (heparan |
SULT1A1 | chr16:34232477-34236209 | 16 | p11.2 | NM_001055 | sulfotransferase family, cytosolic, 1A, |
SULT1A2 | chr16:34302736-34307862 | 16 | p11.2 | NM_001054 | sulfotransferase family, cytosolic, 1A, |
SULT1A3 | chr16:84175628-84185516 | 16 | q22.1 | NM_003166 | sulfotransferase family, cytosolic, 1A, |
SULT1C1 | chr2:112366547-112387460 | 2 | q12.3 | NM_001056 | sulfotransferase family, cytosolic, 1A, |
SULT1C2 | chr2:112455590-112465628 | 2 | q12.3 | NM_006588 | sulfotransferase family, cytosolic, 1A, |
SULT2A1 | chr19:58784636-58799587 | 19 | q13.32 | NM_003167 | sulfotransferase family, cytosolic, 1A, |
SULT2B1 | chr19:59910354-59934240 | 19 | q13.32 | NM_004605 | sulfotransferase family, cytosolic, 1A, |
SULT4A1 | chr22:40838242-40875040 | 22 | q13.31 | NM_014351 | sulfotransferase family, cytosolic, 1A, |
TPST1 | chr7:68803924-68961571 | 7 | q11.21 | NM_003596 | tyrosylprotein sulfotransferase 1 |
TPST2 | chr22:23617817-23682152 | 22 | q12.1 | NM_003595 | tyrosylprotein sulfotransferase 2 |
|