[Air-L] Information wants to be ASCII or Unicode? Tibetan-written information cannot be ASCII anyway.

Han-Teng Liao (OII) han-teng.liao at oii.ox.ac.uk
Thu Jul 16 22:02:57 PDT 2009


Running the risk of taking your comments out of the context, I have 
listed the following responses.
Mike Stanger wrote:
> ......The use of Unicode believing that it solves the interoperability 
> issues and/or is a communication about the intent of the programmer is 
> much the same sin, in my view.
Not sure about "not using" Unicode can solve the interoperability 
issues.  If the use of Unicode is one of the more attractive solutions 
that can deliver some interoperability solutions (as Google, Wikipedia, 
Youtube, etc. try to do, then I do not know whether the two belief is 
"much the same sin". 

> ...... However, just using unicode isn't going to resolve all of the 
> interoperability issues (eg. reading direction, and other unique 
> features of the written form of a particular language, etc.). 
Agree, using Unicode by itself cannot save the world. Still, do you mind 
showing me not using Unicode or other alternatives would solve the 
issues better?  If such solution or vision does exist, why Google, 
Wikipedia, Microsoft, Linux, Mac, etc., adopts the Unicode?  I am not 
citing these examples to refute your argument.  I am genuinely intrigued 
to find out why they come to certain solution but not others (including 
maintaining the status quo by not deploying Unicode to some extent). 

> Ultimately though, what data storage in Unicode does provide almost 
> automatically is the preservation of the appropriate data (unless it 
> gets transformed of course), and its use /could potentially/ signal 
> the intent by the author to enable the coexistence of mixed language 
> content as a politically friendly gesture. 
>  I would agree that character encodings could potentially send a 
> signal about the /intent/ to be good internet citizens, or that the 
> /intentional/ use of something other than unicode could be seen as a 
> statement of political position (eg. mainland China's use of jianti 
> character sets in a particular code page vs. a codepage that supported 
> fanti). 
Agree, good will matters.  Still, efforts to deliver that good will 
matter as well.  I will exhibit some evidence in another email that 
inside Perl (the programming language that supports MediaWiki which 
makes Wikipedia possible) and the logo of Wikipedia and Chinese 
Wikipedia, most of the efforts are requested and done by those who need 
Unicode support.  Then it is not only a picture of good will but some 
kind of push and pull.

> However, I think often programmer intent is lost in the end-product. 
>  It would be encouraging to see a movement where programmers stated 
> that their /active decision/ to use Unicode is a deliberate 
> recognition of the multitude of languages as a 'politically friendly' 
> gesture.
Politically friendly or politically correct could be a bit patronizing.  
I will argue that Wikipedia benefits more from other language versions 
(ranking higher in search results, better webometric position, etc.). 

>
> I also assume that there are many coders who are using unicode, but 
> doing so less than deliberately, perhaps even as a side-effect of the 
> development environment that they use (eg. Java's native 
> character/string support), /mirroring the use of ASCII in earlier 
> environments/. These applications may well support Unicode at the 
> character level, but because the programmer's use of Unicode is a sort 
> of side-effect, the end product may not actually interoperate with 
> other languages properly or completely.
> So while I agree that the use of Unicode is a step forward in 
> interoperability, I'd argue that the work to be done is not so much 
> about the use of Unicode, but the '/publicly' stated intent to be 
> interoperable./ Unicode may be one tool that can assist in that goal 
> if used properly, but the use of Unicode alone says little about intent.
I slightly disagree on the meaning of interoperability.  If 
interoperability means a certain linguistic space can still use a 
non-Unicode standard, then it may create a linguistic hierarchy.  For 
example, Chinese can use GB2312 through out in their user-generated 
websites, and then Tibetans and traditional Chinese characters cannot 
have a voice.  Again imagine Youtube cannot automatically take the 
content contributed by Arabic or Persian users, but only some kind of 
"interfaces" to promise the interoperability.  To me it is not about a 
full support of Unicode at this moment, but it is the awareness that the 
fact that Unicode is arguably the most open linguistic infrastructure 
receives little attention.

Then the sharp question will be, can Beijing, Washington, London, Tokyo 
deliver their government services and communicative spaces by sticking 
to their linguistic ghetto without using Unicode or other open 
linguistic architecture? 

-- 
Han-Teng Liao
PhD Candidate
Oxford Internet Institute
http://www.oii.ox.ac.uk/people/students.cfm?id=123




More information about the Air-L mailing list