[Air-L] Information wants to be ASCII or Unicode? Tibetan-written information cannot be ASCII anyway.

Thu Jul 16 10:23:00 PDT 2009

On Jul 15, 2009, at 5:04 PM, Han-Teng Liao (OII) wrote:

> I agree that "Information wants to be digital", and that is why we  
> should start a honest conversations among programmers, IT support,  
> academics and policy makers.
> I disagree that the notion that the technical support of Unicode  
> source is confusing for programmers.  Please refer to the following  
> blog post:
>
> The Absolute Minimum Every Software Developer Absolutely, Positively  
> Must Know About Unicode and Character Sets (No Excuses!)  by Joel  
> Spolsky
> http://www.joelonsoftware.com/articles/Unicode.html

Yes, I've read that many years ago when we started having a need to  
address codepages/Unicode/etc.  :-) but I think that the article is  
primarily saying that the assumption/ignorance of a programmer that a  
given text string may have meaning in ASCII is problematic and that  
the use of Unicode is a better general solution for a number of  
reasons. The use of Unicode believing that it solves the  
interoperability issues and/or is a communication about the intent of  
the programmer is much the same sin, in my view.

My comment wasn't intended to imply that it was confusing to implement  
your own software in Unicode, but that Unicode is still an encoding,  
and you have to deal with it, its assumptions, and the assumptions of  
the programmers and the assumptions of the same set of actors of any  
software or system with which you might interact; if one is  
intentionally trying to be interoperable it's a similar set of  
concerns as if you were if dealing with traditional codepages, one  
just uses different approaches to determine the ability to display the  
content accurately by the time the content makes it's way towards the  
ultimate recipient of the information. However, just using unicode  
isn't going to resolve all of the interoperability issues (eg. reading  
direction, and other unique features of the written form of a  
particular language, etc.).

Ultimately though, what data storage in Unicode does provide almost  
automatically is the preservation of the appropriate data (unless it  
gets transformed of course), and its use could potentially signal the  
intent by the author to enable the coexistence of mixed language  
content as a politically friendly gesture.  I would agree that  
character encodings could potentially send a signal about the intent  
to be good internet citizens, or that the intentional use of something  
other than unicode could be seen as a statement of political position  
(eg. mainland China's use of jianti character sets in a particular  
code page vs. a codepage that supported fanti).

However, I think often programmer intent is lost in the end-product.   
It would be encouraging to see a movement where programmers stated  
that their active decision to use Unicode is a deliberate recognition  
of the multitude of languages as a 'politically friendly' gesture.

I also assume that there are many coders who are using unicode, but  
doing so less than deliberately, perhaps even as a side-effect of the  
development environment that they use (eg. Java's native character/ 
string support), mirroring the use of ASCII in earlier environments.  
These applications may well support Unicode at the character level,  
but because the programmer's use of Unicode is a sort of side-effect,  
the end product may not actually interoperate with other languages  
properly or completely.

So while I agree that the use of Unicode is a step forward in  
interoperability, I'd argue that the work to be done is not so much  
about the use of Unicode, but the 'publicly' stated intent to be  
interoperable. Unicode may be one tool that can assist in that goal if  
used properly, but the use of Unicode alone says little about intent.

Mike

> We can debate about the technical implementation on and on (but I  
> hope the above link has settled the technical debate).  However, it  
> would be better to ask first whether we need, for example, Korean,  
> Japanese and Chinese to *coexist* on the same page, or  
> alternatively, Jewish and Arabic to *coexist* on the same page.  If  
> the social and communicative need across languages is among our  
> priority to support a better Internet environment, then the answer  
> is obvious.  Again, the reason why Unicode is supported and  
> maintained by industry and experts points to the fact that Google,  
> youtube, facebook and other websites supports Unicode probably for a  
> simple reason: they want to reach other local markets.
> My short-cut and simplified understanding of the whole software  
> industry movement in i18n (internationalization) and l10n  
> (localization) is as follows.  The industry (and along with open  
> source community who actually excels in i18n and l10n) has proposed  
> Unicode by first imagining there is limitless space (codepoints) for  
> alphabets/scripts/strokes/characters to be assigned.  And then the  
> industry can compete to implement them and satisfy any potential  
> markets.
> So I am of the opinion that Unicode is actually market-friendly and  
> potentially programmer-friendly.  It takes more effort to make it  
> politically-friendly rather than merely politically-correct.  I hope  
> my starting point is not about multicultural or multilingual  
> correctness, but about an open nature of Internet....
>
> All languages want to be digital.  We have enough space for them.
>