[Air-L] Information wants to be ASCII or Unicode? Tibetan-written information cannot be ASCII anyway.

Joseph Reagle reagle at mit.edu
Thu Jul 16 12:03:39 PDT 2009


On Thursday 16 July 2009, Mike Stanger wrote:
> My comment wasn't intended to imply that it was confusing to implement  
> your own software in Unicode, but that Unicode is still an encoding,  
> and you have to deal with it, its assumptions, and the assumptions of  

Just as a bit of evidence of how difficult it can be to grok character issues: Unicode is not "an encoding" itself, but a repertoire of characters, their names, and (abstract) code points (i.e., UCS), plus a set of encodings (i.e., UTF-8, UTF-16), extra properties, and algorithms. And I'm sure a Unicode geek could pick some wholes in what I've said!

Unfortunately, I've spend lots of time wrapping my head around these issues in XML and Python. The character repertoire is still growing, imagine what that can mean for digital signature on XML documents. Or, how easy it is to trick people to think they are going to a URL they know when you can pull off character hijinks with IRIs. Dealing with byte-order-marks (BOMS), transcoding, character decomposition, etc. *are* confusing to implement. So it's not just a matter of lazy westerners. I personally look forward to the day when I can use Python 3.* as it is only now that we are finally moving into a Unicode world.

(Something I was just dealing with today, not even bibtex8 can deal with all of Unicode for those of us who use LaTeX.)



More information about the Air-L mailing list