Hi Terry and all,
I usually just lurk on the list, but since I'm a C++ afficionado, I wanted
to question your below snipped statement.
If we settle on wchar_t being 16bits, then we will still be forced to do
UTF-7/8/16 to properly handle a random Unicode (or ISO/IEC 10646) string,
since we must deal with that charming thing known as "surrogate pairs" (see
section 3.7 of the Unicode standard v3.0). This again breaks the "one
wchar_t == on character". When being forced to deal with Unicode, I much
prefer working with 32bits, since that guarantees that I get a fixed length
for each character. Admittedly, it is space inefficient to the Nth degree,
but speedwise it is better.
As for interoperability with Windows, it is clearly stated that the wchar_t
is intended for internal usage only, and the various encoding schemes should
be used when storing strings outside of a process. In reality this means
that just about every Unicode capable application reads and writes in UTF-8
or 7. This means that interoperability should not become an issue. If it
really was expected to have been an issue, I'm sure the C++ standard would
have mandated a specific width for wchar_t, which as far as I am aware they
didn't. The draft copy I pulled out via google says the following:
Type wchar_t is a distinct type whose values can represent distinct codes
for all members of the largest extended character set specified among the
supported locales (_lib.locale_). Type wchar_t shall have the same size,
signedness, and alignment requirements (_intro.memory_) as one of the other
integral types, called its underlying type.
So, in the light of this, what would be the most appropriate choice? I
haven't yet had a chance to explore what locales we support, but I would
lean toward saying wchar_t == 32 bits, since this is future proof. If we
later down the track are forced to go from 16 -> 32 due us supporting more
of the asian locales, I foresee this causing _major_ breakage.
If anyone actually has a copy of the C++ standard and would be kind enough
to paste the section regarding the size of wchar_t, that would be most
helpful for this discussion I believe.
Johny Mattsson | Email: Johny.Mattsson@ericsson.com.au
Ericsson Support Engineer | Phone: +61 (0)3 9301 1372
NCSA NetScreen Certified | Mobile: +61 (0)404 003 713
> -----Original Message-----
> From: Terry Lambert [SMTP:firstname.lastname@example.org]
> Sent: Tuesday, June 18, 2002 9:47 PM
> To: Thomas David Rivers
> Cc: email@example.com; current@FreeBSD.ORG; firstname.lastname@example.org
> Subject: Re: PATCH: wchar_t is already defined in libstd++
> o A desire for raw storage of Unicode, rather than UTF-8 or
> UTF-7 encoding. This last one is:
> o UTF encoding breaks fixed field storage, which has
> always bean a measure of the number of characters
> you can put in a field.