[Air-L] Information wants to be ASCII or Unicode? Tibetan-written information cannot be ASCII anyway.

Fri Jul 17 11:46:06 PDT 2009

I'll combine a response to multiple messages in one, hopefully I don't  
break the context:

> First of all, I have to reframe the question in different way. Is  
> the problem of ASCII or the problem of Unicode we are talking  
> about?  On the one extreme we can argue that would it be nice that  
> every domain names, hyperlinks and URL should stay in English  
> alphabets (which enters the ICANN multilingual issue which I aim to  
> avoid in this discussion), on the other extreme we can argue that  
> there would be no problems if everyone is using Unicode now (which  
> implies a coercive force to impose that without the usual technology  
> diffusion).
[snip]
> Again, I am not arguing that the transition from non-Unicode to  
> Unicode is easy and could be done overnight, and hence I have no  
> intention to imply that it is all programmers' unwillingness and  
> laziness to finish the mundane jobs.   It is the opposite.  If we  
> lay out why, how much and how Wikipedia, youtube, Google and etc.  
> invest in Unicode deployment (exploiting the open nature of  
> Internet), we can better understand the richer dimensions of techno- 
> linguistic polices.  It is not my intention to play blame game (the  
> west versus east or the programmers versus users).   It is the  
> opposite.  Why Baidu supports simplified Chinese versions of  
> services, excluding Tibetans, Hong Kongese and even Taiwanese whom  
> Beijing try to represent while Google and Youtube do much better  
> jobs in creating a space where East Asians can fight with each other  
> on the same page.  I hope this case shows my intention to make this  
> an interesting research issue for mutli-discinplinary research than  
> blaming any particular groups of people.
> I hope we are debating on "Information wants to be ASCII or Unicode"  
> versus "Information wants to be digital", not "Information moving  
> from ASCII to Unicode is difficult".  Then the issue would be  
> clearer.  Who decides what digital standards should be selected and  
> deployed.  What is the negotiation process.  And why?  Operating  
> systems, global websites, regional websites, e-government services,  
> citation databases etc are all the domains we should ask.

I think with this reframing of your question I understand the issue  
you pose better: I was addressing commentary that I often hear in  
other contexts where Unicode is proposed as a 'solution' to multi- 
language representation in applications/sites/documentation at a  
trivial level. ie: if we use Unicode, we can support any character,  
ergo we can support any language, but that's obviously incorrect as  
you mention above.

The case of Baidu would be a very interesting one to see what  
pressures may be at play given their inception largely as a media  
search site, and later has, apparently official 'licensing' from  
Beijing itself in order to add functionality. The effect of that  
interaction on the decision of the company (assuming an active  
process) to support only jianti characters would be interesting to  
follow.

> I cannot speak for all those open source contributors out there.  I  
> did not even try to find the regional and linguistic demographics of  
> open source community.  Though I am a big fan of "good will" in  
> Reagle's thesis, I cannot overlook the potentials of competitions  
> and creative conflicts among all branches of open source projects.
> Then the question would be, who should make this efforts?  I will  
> argue that the weight is overwhelmingly weighted on people who has  
> to use Unicode.  In practice, it easily becomes a favor to be asked  
> from those who need Unicode, and extra work to be done by the IT  
> support.  Then Unicode the solution becomes a problem.  I am not  
> saying there is no problem in Unicode implementation.  The reason  
> why I raise the problem here in the AOIR mailing list, not in the  
> Unicode mailing list is not to reaffirm the perception that adoption  
> of Unicode could be difficult, but rather raise the relevant  
> research issues around it.

There are a number of interesting aspects to follow

i) the resources required to support the use of Unicode with the  
intent to provide, say, the ability for a site to be read in both  
jiantizi and fantizi (at least for one scope, given the example of  
Baidu)

ii) the negotiation of the process of support within the Open Source  
community - as you say, is the weight of the responsibility on the  
people who need the support?

iii)  the reasons that an institutional entity (say a business or  
university) might choose to expend the resources to provide Unicode as  
a piece of the base infrastructure. (eg. market share, goodwill,  
officially stated requirements)

The variant I would expect (for what that is worth) is the most  
complex would be:

iv) the reasons that an entity chooses to use an infrastructure that  
excludes the ability to support, say, jianti and fanti ... eg. Chinese  
university websites such as Jilin Daxue (just using that as an example  
because I was there for a couple of semesters in the early 90s,  
they're not an exception, just an example at the top of my mind) --  
The school has students from Taiwan, Japan, Russia and other places  
around the world, including those who are only experienced with  
fantizi, but the pages are encoded as GB2312 . Is that because they  
feel their target market may be better supported with GB2312 (eg. some  
having computers with older versions of operating systems that will  
support GB2312 but not UTF-8, but systems that support UTF-8 will also  
support GB2312, ergo they're just addressing the lowest common  
denominator of their market)? Is there an official edict that  
universities should only use the national standard character set,  
regardless of who they might target (which would seem to be counter- 
productive from a marketing stand-point)? Or was there no active  
decision at all: was the website created with existing tools and  
support people who haven't considered the implications and haven't  
made an active choice?

Our own University's website is almost entirely in English even though  
our country is official French/English bi-lingual. Supporting both  
French and English is a simple problem, but what are the reasons that  
French is not supported (being close to the problem I'd suspect that  
resources and target market are the primary reasons, as well as the  
lack of a central web content management system).  We have a  
connection also with Zhejiang University (a joint degree program)  
which is seen as a key connection to the internationalization of the  
university: the page that describes this has one line of Chinese text  
in jiantizi ( http://www.cs.sfu.ca/undergrad/prospective/ddp/ ) but  
none in fanti, which in the history of Vancouver and environs has a  
much greater pool of readers given that many locals of Chinese  
ethnicity were schooled in Hong Kong, or other fanti using countries,  
and most of the Chinese schools here have taught with fantizi as  
well.  As a result, our local media (newspapers, television, etc.) in  
Chinese are all in fantizi.

> Not sure about "not using" Unicode can solve the interoperability  
> issues.  If the use of Unicode is one of the more attractive  
> solutions that can deliver some interoperability solutions (as  
> Google, Wikipedia, Youtube, etc. try to do, then I do not know  
> whether the two belief is "much the same sin".
> [snip]
> Agree, using Unicode by itself cannot save the world. Still, do you  
> mind showing me not using Unicode or other alternatives would solve  
> the issues better?  If such solution or vision does exist, why  
> Google, Wikipedia, Microsoft, Linux, Mac, etc., adopts the Unicode?   
> I am not citing these examples to refute your argument.  I am  
> genuinely intrigued to find out why they come to certain solution  
> but not others (including maintaining the status quo by not  
> deploying Unicode to some extent).

the "much the same sin" remark was in reference to using Unicode  
without providing additional layers that support true  
internationalization.  Again, referring to the naïve approach that  
some take that using Unicode is sufficient to represent information,  
where the error is made in not understanding that Unicode is only a  
part of a set of tools that supports internationalization and/or  
localization. I suppose that I'm reading more into the term "solution"  
and "vision" than you intend.

> I slightly disagree on the meaning of interoperability.  If  
> interoperability means a certain linguistic space can still use a  
> non-Unicode standard, then it may create a linguistic hierarchy.   
> For example, Chinese can use GB2312 through out in their user- 
> generated websites, and then Tibetans and traditional Chinese  
> characters cannot have a voice.  Again imagine Youtube cannot  
> automatically take the content contributed by Arabic or Persian  
> users, but only some kind of "interfaces" to promise the  
> interoperability.  To me it is not about a full support of Unicode  
> at this moment, but it is the awareness that the fact that Unicode  
> is arguably the most open linguistic infrastructure receives little  
> attention.
>
> Then the sharp question will be, can Beijing, Washington, London,  
> Tokyo deliver their government services and communicative spaces by  
> sticking to their linguistic ghetto without using Unicode or other  
> open linguistic architecture?

> Agree, good will matters.  Still, efforts to deliver that good will  
> matter as well.  I will exhibit some evidence in another email that  
> inside Perl (the programming language that supports MediaWiki which  
> makes Wikipedia possible) and the logo of Wikipedia and Chinese  
> Wikipedia, most of the efforts are requested and done by those who  
> need Unicode support.  Then it is not only a picture of good will  
> but some kind of push and pull.

MediaWiki/Wikipedia are written in PHP, not Perl (unless historical  
versions used Perl? If so, I was previously unaware of that - I've  
only worked with MediaWiki in PHP).

The push and pull is an interesting aspect, and in the case of  
MediaWiki / Wikipedia, it's a good example of the needs of the  
community being somewhat supported by those who need the  
functionality.  Another example, though is a slightly different  
variant: Facebook: a commercial entity whose localization efforts seem  
to be community based (eg. translations are done largely by volunteers  
on a request by Facebook for participants, presumably as a result of  
requests from users -- I've not quite figured out who was the group of  
users who supported the English (Pirate) translation :-)  )....

> Overall, from the above evidence, it could be argued that  
> Wikipedia's internationalization is a clear effort to adopt the  
> Unicode standards by mostly the Unicode-needed crowd.  It is worth  
> pointing out that around 2001 and 2002, the major operating systems  
> such as Microsoft and Mac that most normal PC users used at that  
> time seem to be not Unicode available yet, which makes such  
> development in Wikipedia more interesting.

> Again, coming back to the original question.  Why Wikipedia wants to  
> be Unicode?  or....Why not Wikipedia choose other solutions to  
> deliver interoperability?

Thinking about MediaWiki and PHP: Looking at the PHP history page ( http://ca3.php.net/manual/en/history.php.php 
  ) and other places, I cannot determine when proper  
internationalization support was achieved, but do notice that a true  
Unicode module is still in the internal development phase ( http://www.php.net/manual/en/intro.unicode.php 
  ) .. but I have to wonder how the Wikipedia site would have  
developed 'internationally' had the development environment been  
different.  In the programming language Java, the default character  
encoding has always been Unicode (allow me to use the term  
inaccurately for convenience) as far as I am aware.  But given that it  
was intended initially as a language to support set-top appliances  
that would likely be sold internationally, was that simply a  
'corporate decision?'   How might MediaWiki developed if written in  
Java initially?

And to follow on to this point in another message:

On Jul 17, 2009, at 7:39:30 AM PDT (CA), Andrew Russell wrote:
> On Jul 17, 2009, at 8:50 AM, Joseph Reagle wrote:
>
>> On Thursday 16 July 2009, Han-Teng Liao (OII) wrote:
>>> ask.  We (internet researchers) need empirical research to see why  
>>> and
>>> how the Unicode support is implemented in various projects.
>>
>> I did not appreciate this point, and it is an interesting one. I  
>> haven't followed the literature that takes on standardization as a  
>> business or social science concern and so don't know if people have  
>> focused on Unicode at all. (I'm thinking of continuations of  
>> Cargill's "Open Systems Standardization" and Agre's course  
>> "Institutional Aspects of Computing" from the 90s.)
>
> It is an interesting point, and I am more comfortable with the  
> issues being framed this way - that is, looking at what people do  
> rather than what information "wants".

True, though I think there is an interesting path that could be taken  
in the sense of what information 'wants' by making a small indirect  
reference to W.J.T. Mitchell's work (What do Pictures Want): seeing  
'want' as both meaning "to lack" (as in being denied a means to  
participate in a particular forum such as Baidu's sites) and how  
information/language has no power as an agent alone without another  
agent to receive and process it. One could make an argument that the  
development of Unicode itself is the expression of the desire of  
information to have power and meaning across boundaries.  Following  
that line of thought, the question could be asked: Given that  
information has no power without the ability to be communicated, what  
does an entity gain or lose by adopting a standard such as Unicode,  
(eg. the control of messages, the acquisition of markets, the benefit  
of intercultural communication for its own sake, etc.) and how does  
that affect power relations (etc.)

Mike