[Air-L] Information wants to be ASCII or Unicode? Tibetan-written information cannot be ASCII anyway.
Mike Stanger
mstanger at sfu.ca
Thu Jul 16 10:23:00 PDT 2009
On Jul 15, 2009, at 5:04 PM, Han-Teng Liao (OII) wrote:
> I agree that "Information wants to be digital", and that is why we
> should start a honest conversations among programmers, IT support,
> academics and policy makers.
> I disagree that the notion that the technical support of Unicode
> source is confusing for programmers. Please refer to the following
> blog post:
>
> The Absolute Minimum Every Software Developer Absolutely, Positively
> Must Know About Unicode and Character Sets (No Excuses!) by Joel
> Spolsky
> http://www.joelonsoftware.com/articles/Unicode.html
Yes, I've read that many years ago when we started having a need to
address codepages/Unicode/etc. :-) but I think that the article is
primarily saying that the assumption/ignorance of a programmer that a
given text string may have meaning in ASCII is problematic and that
the use of Unicode is a better general solution for a number of
reasons. The use of Unicode believing that it solves the
interoperability issues and/or is a communication about the intent of
the programmer is much the same sin, in my view.
My comment wasn't intended to imply that it was confusing to implement
your own software in Unicode, but that Unicode is still an encoding,
and you have to deal with it, its assumptions, and the assumptions of
the programmers and the assumptions of the same set of actors of any
software or system with which you might interact; if one is
intentionally trying to be interoperable it's a similar set of
concerns as if you were if dealing with traditional codepages, one
just uses different approaches to determine the ability to display the
content accurately by the time the content makes it's way towards the
ultimate recipient of the information. However, just using unicode
isn't going to resolve all of the interoperability issues (eg. reading
direction, and other unique features of the written form of a
particular language, etc.).
Ultimately though, what data storage in Unicode does provide almost
automatically is the preservation of the appropriate data (unless it
gets transformed of course), and its use could potentially signal the
intent by the author to enable the coexistence of mixed language
content as a politically friendly gesture. I would agree that
character encodings could potentially send a signal about the intent
to be good internet citizens, or that the intentional use of something
other than unicode could be seen as a statement of political position
(eg. mainland China's use of jianti character sets in a particular
code page vs. a codepage that supported fanti).
However, I think often programmer intent is lost in the end-product.
It would be encouraging to see a movement where programmers stated
that their active decision to use Unicode is a deliberate recognition
of the multitude of languages as a 'politically friendly' gesture.
I also assume that there are many coders who are using unicode, but
doing so less than deliberately, perhaps even as a side-effect of the
development environment that they use (eg. Java's native character/
string support), mirroring the use of ASCII in earlier environments.
These applications may well support Unicode at the character level,
but because the programmer's use of Unicode is a sort of side-effect,
the end product may not actually interoperate with other languages
properly or completely.
So while I agree that the use of Unicode is a step forward in
interoperability, I'd argue that the work to be done is not so much
about the use of Unicode, but the 'publicly' stated intent to be
interoperable. Unicode may be one tool that can assist in that goal if
used properly, but the use of Unicode alone says little about intent.
Mike
> We can debate about the technical implementation on and on (but I
> hope the above link has settled the technical debate). However, it
> would be better to ask first whether we need, for example, Korean,
> Japanese and Chinese to *coexist* on the same page, or
> alternatively, Jewish and Arabic to *coexist* on the same page. If
> the social and communicative need across languages is among our
> priority to support a better Internet environment, then the answer
> is obvious. Again, the reason why Unicode is supported and
> maintained by industry and experts points to the fact that Google,
> youtube, facebook and other websites supports Unicode probably for a
> simple reason: they want to reach other local markets.
> My short-cut and simplified understanding of the whole software
> industry movement in i18n (internationalization) and l10n
> (localization) is as follows. The industry (and along with open
> source community who actually excels in i18n and l10n) has proposed
> Unicode by first imagining there is limitless space (codepoints) for
> alphabets/scripts/strokes/characters to be assigned. And then the
> industry can compete to implement them and satisfy any potential
> markets.
> So I am of the opinion that Unicode is actually market-friendly and
> potentially programmer-friendly. It takes more effort to make it
> politically-friendly rather than merely politically-correct. I hope
> my starting point is not about multicultural or multilingual
> correctness, but about an open nature of Internet....
>
> All languages want to be digital. We have enough space for them.
>
More information about the Air-L
mailing list