[Air-L] Information wants to be ASCII or Unicode? Tibetan-written information cannot be ASCII anyway.

Han-Teng Liao (OII) han-teng.liao at oii.ox.ac.uk
Thu Jul 16 22:24:44 PDT 2009

Using Wikipedia as a case to further the discussion

(1) The history of Wikipedia logo: From English only to International 
identity .....and some mistakes along the way...

(2) Unsung hero (in my personal view, open to debate) Autrijus Tang's 
effort in Perl Internationalization
Tang is a Taiwanese hacker.

(3) Unicode's support in Wikipedia
I have problem to locate the version control file to see when Unicode 
began to be supported and fully supported. 
http://meta.wikimedia.org/wiki/Wikipedia_timeline   (not mentioning 
Unicode here)
However, according to the entry of "Chinese Wikipedia" in English 
Wikipedia, we have the following paragraphs:


The Chinese Wikipedia was established along with 12 other Wikipedias in 
May 2001. At the beginning, however, the Chinese Wikipedia did not 
support Chinese characters 
<http://en.wikipedia.org/wiki/Chinese_character>, and had no 
encyclopedic content.
It was in October 2002 that the first Chinese-language page was written, 
the Main Page <http://zh.wikipedia.org/wiki/>. The first registered user 
of the Chinese Wikipedia was Mountain. A software update 
<http://en.wikipedia.org/wiki/Software_update> on October 27 
<http://en.wikipedia.org/wiki/October_27>, 2002 
<http://en.wikipedia.org/wiki/2002> allowed Chinese language input. .....
In order to accommodate the orthographic differences between simplified 
Chinese <http://en.wikipedia.org/wiki/Simplified_Chinese> and 
traditional Chinese <http://en.wikipedia.org/wiki/Traditional_Chinese> 
(or Orthodox Chinese), from 2002 to 2003, Chinese Wikipedia community 
gradually decided to combine the two originally separate versions of 
Chinese Wikipedia. The first running automatic conversion between the 
two orthographic representation starts from December 23, 2004, with 
MediaWiki 1.4 release. The needs from Hong Kong and Singapore were taken 
into accounts in MediaWiki 1.4.2 release, which made conversion table 
for zh-sg default to zh-cn, and zh-hk default to zh-tw.^[2] 


Overall, from the above evidence, it could be argued that Wikipedia's 
internationalization is a clear effort to adopt the Unicode standards by 
mostly the Unicode-needed crowd.  It is worth pointing out that around 
2001 and 2002, the major operating systems such as Microsoft and Mac 
that most normal PC users used at that time seem to be not Unicode 
available yet, which makes such development in Wikipedia more interesting.

Again, coming back to the original question.  Why Wikipedia wants to be 
Unicode?  or....Why not Wikipedia choose other solutions to deliver 

Han-Teng Liao
PhD Candidate
Oxford Internet Institute

Han-Teng Liao (OII) wrote:
> Running the risk of taking your comments out of the context, I have 
> listed the following responses.
> Mike Stanger wrote:
>> ......The use of Unicode believing that it solves the 
>> interoperability issues and/or is a communication about the intent of 
>> the programmer is much the same sin, in my view.
> Not sure about "not using" Unicode can solve the interoperability 
> issues.  If the use of Unicode is one of the more attractive solutions 
> that can deliver some interoperability solutions (as Google, 
> Wikipedia, Youtube, etc. try to do, then I do not know whether the two 
> belief is "much the same sin".
>> ...... However, just using unicode isn't going to resolve all of the 
>> interoperability issues (eg. reading direction, and other unique 
>> features of the written form of a particular language, etc.). 
> Agree, using Unicode by itself cannot save the world. Still, do you 
> mind showing me not using Unicode or other alternatives would solve 
> the issues better?  If such solution or vision does exist, why Google, 
> Wikipedia, Microsoft, Linux, Mac, etc., adopts the Unicode?  I am not 
> citing these examples to refute your argument.  I am genuinely 
> intrigued to find out why they come to certain solution but not others 
> (including maintaining the status quo by not deploying Unicode to some 
> extent).
>> Ultimately though, what data storage in Unicode does provide almost 
>> automatically is the preservation of the appropriate data (unless it 
>> gets transformed of course), and its use /could potentially/ signal 
>> the intent by the author to enable the coexistence of mixed language 
>> content as a politically friendly gesture.  I would agree that 
>> character encodings could potentially send a signal about the 
>> /intent/ to be good internet citizens, or that the /intentional/ use 
>> of something other than unicode could be seen as a statement of 
>> political position (eg. mainland China's use of jianti character sets 
>> in a particular code page vs. a codepage that supported fanti). 
> Agree, good will matters.  Still, efforts to deliver that good will 
> matter as well.  I will exhibit some evidence in another email that 
> inside Perl (the programming language that supports MediaWiki which 
> makes Wikipedia possible) and the logo of Wikipedia and Chinese 
> Wikipedia, most of the efforts are requested and done by those who 
> need Unicode support.  Then it is not only a picture of good will but 
> some kind of push and pull.
>> However, I think often programmer intent is lost in the end-product. 
>>  It would be encouraging to see a movement where programmers stated 
>> that their /active decision/ to use Unicode is a deliberate 
>> recognition of the multitude of languages as a 'politically friendly' 
>> gesture.
> Politically friendly or politically correct could be a bit 
> patronizing.  I will argue that Wikipedia benefits more from other 
> language versions (ranking higher in search results, better webometric 
> position, etc.).
>> I also assume that there are many coders who are using unicode, but 
>> doing so less than deliberately, perhaps even as a side-effect of the 
>> development environment that they use (eg. Java's native 
>> character/string support), /mirroring the use of ASCII in earlier 
>> environments/. These applications may well support Unicode at the 
>> character level, but because the programmer's use of Unicode is a 
>> sort of side-effect, the end product may not actually interoperate 
>> with other languages properly or completely.
>> So while I agree that the use of Unicode is a step forward in 
>> interoperability, I'd argue that the work to be done is not so much 
>> about the use of Unicode, but the '/publicly' stated intent to be 
>> interoperable./ Unicode may be one tool that can assist in that goal 
>> if used properly, but the use of Unicode alone says little about intent.
> I slightly disagree on the meaning of interoperability.  If 
> interoperability means a certain linguistic space can still use a 
> non-Unicode standard, then it may create a linguistic hierarchy.  For 
> example, Chinese can use GB2312 through out in their user-generated 
> websites, and then Tibetans and traditional Chinese characters cannot 
> have a voice.  Again imagine Youtube cannot automatically take the 
> content contributed by Arabic or Persian users, but only some kind of 
> "interfaces" to promise the interoperability.  To me it is not about a 
> full support of Unicode at this moment, but it is the awareness that 
> the fact that Unicode is arguably the most open linguistic 
> infrastructure receives little attention.
> Then the sharp question will be, can Beijing, Washington, London, 
> Tokyo deliver their government services and communicative spaces by 
> sticking to their linguistic ghetto without using Unicode or other 
> open linguistic architecture?

More information about the Air-L mailing list