[Air-L] Information wants to be ASCII or Unicode? Tibetan-written information cannot be ASCII anyway.

han-teng.liao@oii han-teng.liao at oii.ox.ac.uk
Sat Oct 31 04:32:46 PDT 2009


Thank you Mike Stanger for a detailed and sincere reply.  I am happy to
correct some of my mistakes and insists on the issue of Unicode adoption 
(after the day when the ICANN has announced that non-Latin
domain names has been approved, which will possibly complicate the
conversation in the near future).


Before we go into the nitty-gritty of the discussion, allow me to 
reframe the issue in terms of "default culture" and "redundancy".  We
should be able to agree that in terms of language, because of the
historical development, Latin-based English is the default culture of
the World Wide Web, which means English typing is available for almost
everyone.  Unicode aims to provide a utopia for everyone to display and
input where all languages can be digitally processed.  The reality now
is somewhere in between.  Extra effort seems to be necessary for
specific language support.  Unicode is merely an architecture for a
utopia to be implemented.  Unicode-ready cannot solve the language
capacity problem immediately.  On the other hand, without Unicode
architecture, there is little choice for languages to co-exist without
prior agreement made in the Unicode.


Once we have a clear distinction between architecture / infrastructure
(really don't have a nice metaphor for this) and actual implementation /
support, then we can realize that Unicode is a architecture "solution"
for multilingual support where "actual implementation" is pending.
Now, if a small business owner in North America want to stay in ASCII or
Latin-only environment, that is his or her own choice to make,
especially if the market for non-Latin support is low for him or her.
However, if big players like universities, governments, etc., tell me
that they cannot support multi-lingual capacity yet for various reasons
(under-investment, lack of expertise, lack of demand, etc.), I would
suggest this: (1) Try everything you can to be Unicode-ready, which is
not that difficult these days if we are not asking for Unicode-complete,
or Unicode-to-the-core.   (2) Leave room for future implementation.    I
support this suggestion with two arguments: (1) the extra cost from
Latin-only architecture to Unicode-ready architecture is increasingly
reduced to the extra *redundancy* of storage space.  (2) these extra
*redundant* storage space is nothing compared to the demand of
multimedia materials.


Metaphorically, big players should build the houses right in the first
beginning.  It is another issue if the new rooms in the houses
(redundant room for other languages) are still empty and cannot
practically accommodate other languages.  Some people in the future or
users out there can help and work on that.  At least the space is
available there.  And I will later argue it is not really a redundant
but a necessary gesture for an environment which values openness.


Therefore, I am aware and do agree that full support to display and
input every languages on every single personal computers are not
necessary.  Still I am of the strong opinion that any service websites
made by big players should start using Unicode architecture, as most of
them already has done.


The Jilin Daxue example you mentioned is a perfect example to illustrate
this.  Chinese universities and governments do have a choice of
Unicode-compatible national standard since a few years ago called GB
18030 (http://en.wikipedia.org/wiki/GB18030).   It is claimed to solve
the issues of simplified/orthodox Chinese characters (or jianti and
fanti) and even Mongolian scripts
<http://en.wikipedia.org/wiki/Mongolian_script> and Tibetan scripts
<http://en.wikipedia.org/wiki/Tibetan_script>.   In addition, since 2006
Beijing has mandated that every software sold in China has to support
this standard.  It is then very precarious that many websites and
webpages in China are still GB2312 only (and thus simplified Chinese
characters only) when the software they use should be, as mandated by
the authorities in Beijing, GB 18030 ready and thus Unicode-ready.


In GB2312 only websites, traditional Chinese characters, Mongolian
scripts, Tibetan scripts are denied of "existence", except for Latin
alphabets.  So in the eyes of westerners, it may be okay to stay in a
Latin-only environment where other languages being denied of "existence"
may not be such a big issue.  What about this?


"Name Not on Our List? Change It, China Says"
http://www.nytimes.com/2009/04/21/world/asia/21china.html    (To geeky
audience, the Chinese character mentioned in the new york times is
supported by both GB 18030 and Unicode, so it maybe cause the character
is traditional /orthodox one..... )


Therefore, could we agree on the point that the full Unicode support may
depend on the demand and resource of a certain IT project, but there is
no need to stick to Latin-only architecture or text when the extra
redundant storage space is only a low price to pay for future extension,
good gesture, and a statement to be language-neutral?


I have just checked the SFU's website.  I am pleasantly surprised they
are already encoded in Unicode.  I do not care if they only have English
and Canadian content, which only reflects the cultural and political
context the university is situated.  However, the fact that they are
using Unicode as architecture for web content proves my point.  It means
that if any members of University want to contribute or mix the
languages of their choice, they are not automatically denied because the
fundamental web content architecture does not support these languages.


What would you explain why most websites of Chinese governments still
sticks to GB 2312 when they literally mandate softwares have to support
GB 18030 which is more inclusive?  The SFU case in Canada is a nice
contrast where they adopt Unicode for the website anyway even when the
official languages of Canada can be easily supported by Latin-only
encoding.


Going back to my initial question:  "Information wants to be ASCII or
Unicode?"   I hope I have made the point that Information should be
Unicode so as to avoid the situation where some languages are denied of
digital existence fundamentally in script or character encoding.    I am
aware that on personal computers an universal support for all languages
(including typing and displaying) is up to individual choices.  However,
I have to insist that information, online or offline, should be
Unicode-ready.


It is one thing to sponsor everyone to an open party.  It is another
thing to invite everyone in an open party.  I am insisting on the
latter.  Unicode is an open invitation.


Following Dr. Andrew L. Russell's suggestion of re-framing the issue, it
would be like this:  Information wants to be Unicode because people are
nice enough to invite every languages into the digital worlds. (Maybe it
is not the case for some state players ......language politics)


In response to the issue of Project Gutenberg, I respect people's choice
between plain texts or html formats.  However, it should not conflate
with the choice between Unicode and ASCII.  One can perfectly have a
Unicode plain text file.

-----
*Correction:*

Indeed, as Dr. Mike Stanger has rightly corrected, "MediaWiki/Wikipedia
are written in PHP, not Perl".  It is my own mistake and bad memory. Orz
  ><

PS.  It is interesting to point out, as part of a bigger endeavor to
trace how Unicode support has been made possible by what kind of open
source community members, that probably the Unicode guy inside PHP
community is Andrei Zmievski.  His Russian heritage and the blog entry
"My name is not really Andrei" may be of interest.

http://zmievski.org/2006/07/my-name-is-not-really-andrei


Mike Stanger wrote:
> I'll combine a response to multiple messages in one, hopefully I don't 
> break the context:
>
> [snip]
>
> I think with this reframing of your question I understand the issue 
> you pose better: I was addressing commentary that I often hear in 
> other contexts where Unicode is proposed as a 'solution' to 
> multi-language representation in applications/sites/documentation at a 
> trivial level. ie: if we use Unicode, we can support any character, 
> ergo we can support any language, but that's obviously incorrect as 
> you mention above.
>
> The case of Baidu would be a very interesting one to see what 
> pressures may be at play given their inception largely as a media 
> search site, and later has, apparently official 'licensing' from 
> Beijing itself in order to add functionality. The effect of that 
> interaction on the decision of the company (assuming an active 
> process) to support only jianti characters would be interesting to 
> follow.
>
> [snip]
>
> There are a number of interesting aspects to follow
>
> i) the resources required to support the use of Unicode with the 
> intent to provide, say, the ability for a site to be read in both 
> jiantizi and fantizi (at least for one scope, given the example of Baidu)
>
> ii) the negotiation of the process of support within the Open Source 
> community - as you say, is the weight of the responsibility on the 
> people who need the support?
>
> iii)  the reasons that an institutional entity (say a business or 
> university) might choose to expend the resources to provide Unicode as 
> a piece of the base infrastructure. (eg. market share, goodwill, 
> officially stated requirements)
>
> The variant I would expect (for what that is worth) is the most 
> complex would be:
>
> iv) the reasons that an entity chooses to use an infrastructure that 
> excludes the ability to support, say, jianti and fanti ... eg. Chinese 
> university websites such as Jilin Daxue (just using that as an example 
> because I was there for a couple of semesters in the early 90s, 
> they're not an exception, just an example at the top of my mind) -- 
> The school has students from Taiwan, Japan, Russia and other places 
> around the world, including those who are only experienced with 
> fantizi, but the pages are encoded as GB2312 . Is that because they 
> feel their target market may be better supported with GB2312 (eg. some 
> having computers with older versions of operating systems that will 
> support GB2312 but not UTF-8, but systems that support UTF-8 will also 
> support GB2312, ergo they're just addressing the lowest common 
> denominator of their market)? Is there an official edict that 
> universities should only use the national standard character set, 
> regardless of who they might target (which would seem to be 
> counter-productive from a marketing stand-point)? Or was there no 
> active decision at all: was the website created with existing tools 
> and support people who haven't considered the implications and haven't 
> made an active choice?
>
> Our own University's website is almost entirely in English even though 
> our country is official French/English bi-lingual. Supporting both 
> French and English is a simple problem, but what are the reasons that 
> French is not supported (being close to the problem I'd suspect that 
> resources and target market are the primary reasons, as well as the 
> lack of a central web content management system).  We have a 
> connection also with Zhejiang University (a joint degree program) 
> which is seen as a key connection to the internationalization of the 
> university: the page that describes this has one line of Chinese text 
> in jiantizi ( http://www.cs.sfu.ca/undergrad/prospective/ddp/ ) but 
> none in fanti, which in the history of Vancouver and environs has a 
> much greater pool of readers given that many locals of Chinese 
> ethnicity were schooled in Hong Kong, or other fanti using countries, 
> and most of the Chinese schools here have taught with fantizi as 
> well.  As a result, our local media (newspapers, television, etc.) in 
> Chinese are all in fantizi.
>
> [snip]
>
> the "much the same sin" remark was in reference to using Unicode 
> without providing additional layers that support true 
> internationalization.  Again, referring to the naïve approach that 
> some take that using Unicode is sufficient to represent information, 
> where the error is made in not understanding that Unicode is only a 
> part of a set of tools that supports internationalization and/or 
> localization. I suppose that I'm reading more into the term "solution" 
> and "vision" than you intend.
>
> [snip]
>
> MediaWiki/Wikipedia are written in PHP, not Perl (unless historical 
> versions used Perl? If so, I was previously unaware of that - I've 
> only worked with MediaWiki in PHP).
>
> The push and pull is an interesting aspect, and in the case of 
> MediaWiki / Wikipedia, it's a good example of the needs of the 
> community being somewhat supported by those who need the 
> functionality.  Another example, though is a slightly different 
> variant: Facebook: a commercial entity whose localization efforts seem 
> to be community based (eg. translations are done largely by volunteers 
> on a request by Facebook for participants, presumably as a result of 
> requests from users -- I've not quite figured out who was the group of 
> users who supported the English (Pirate) translation :-)  )....
>
> [snip]
>
> Thinking about MediaWiki and PHP: Looking at the PHP history page ( 
> http://ca3.php.net/manual/en/history.php.php ) and other places, I 
> cannot determine when proper internationalization support was 
> achieved, but do notice that a true Unicode module is still in the 
> internal development phase ( 
> http://www.php.net/manual/en/intro.unicode.php ) .. but I have to 
> wonder how the Wikipedia site would have developed 'internationally' 
> had the development environment been different.  In the programming 
> language Java, the default character encoding has always been Unicode 
> (allow me to use the term inaccurately for convenience) as far as I am 
> aware.  But given that it was intended initially as a language to 
> support set-top appliances that would likely be sold internationally, 
> was that simply a 'corporate decision?'   How might MediaWiki 
> developed if written in Java initially?
>
> And to follow on to this point in another message:
>
> [snip]
>
> True, though I think there is an interesting path that could be taken 
> in the sense of what information 'wants' by making a small indirect 
> reference to W.J.T. Mitchell's work (What do Pictures Want): seeing 
> 'want' as both meaning "to lack" (as in being denied a means to 
> participate in a particular forum such as Baidu's sites) and how 
> information/language has no power as an agent alone without another 
> agent to receive and process it. One could make an argument that the 
> development of Unicode itself is the expression of the desire of 
> information to have power and meaning across boundaries.  Following 
> that line of thought, the question could be asked: Given that 
> information has no power without the ability to be communicated, what 
> does an entity gain or lose by adopting a standard such as Unicode, 
> (eg. the control of messages, the acquisition of markets, the benefit 
> of intercultural communication for its own sake, etc.) and how does 
> that affect power relations (etc.)
>
> Mike
>
>





More information about the Air-L mailing list