LinuxLists.cc - Alphabet of kernel source

2004-06-23 21:07:35

Subject: Alphabet of kernel source

Guys,

I have a silly question, for which I am unable to google out the answer
so far. Do we have a Linus' decree on the charset and encoding of the
kernel source?

I had a funny situation recently... I prefer non-MIME attachements
for two reasons: a) I grab parts of the header and fold them into
patch and b) it is easier to quote fragments of the patch with clients
I tried (mutt and sylpheed). Admittendly, a different MUA software may
change these habits, but please bear with me here. So, someone sent
me a patch which included a context line with MODULE_AUTHOR() with
an accented name, which the author entered in ISO-8859-1 (he was German).
I replied, but my mail agent recoded the reply as UTF-8. The author
agreed to my patch, and copied my reply, sent to me. Everything was
perfectly readable at this point, but the patch rejected. Because
I use Russian and Japanese simultaneously, all utilities run with UTF-8
my boxes, so it took me a moment to do "LANG=C vi" and find the problem.

Anyhow, long story short, this got me thinking... What is the charset
and the encoding of the actual source? I saw quite a discussion about
the filenames, but this is different. I am sorry if this was discussed
previously.

-- Pete

2004-06-23 21:51:50

by David Eger

[permalink] [raw]

Subject: Re: Alphabet of kernel source

I started a thread a while ago (2.6.3/2.6.4) where I submitted some
patches to UTF-8ifying the kernel sources. Basically, most of the
kernel is ASCII (98.4% of the files). The rest are mostly ISO-Latin-1,
with the rare bit of Japanese (in a couple of charsets) and some just
random bytes in some of the Documentation/...

http://www.yak.net/random/linux-2.6.4-utf8-cleanup-auto.diff
http://www.yak.net/random/linux-2.6.4-utf8-cleanup-cstrings2utf8.diff
http://www.yak.net/random/linux-2.6.4-utf8-cleanup-jp.diff
http://www.yak.net/random/linux-2.6.4-utf8-cleanup-wrong.diff

It's sorta difficult to do non-ASCII patches over email because
the kernel developers like reading their mail in mutt, and don't
like attachments (the only sane ways to send non 7-bit clean data:
8-bit MIME: tagged and bagged or uuencoded)

Further, you confuse the hell out of vi if you have any trash (8bit data
in another charset) in a file that's supposed to be UTF-8. i.e. don't
think you're going to be able to look at a charset changing patch in
anything.

-dte

2004-06-23 22:01:57

by Andries Brouwer

[permalink] [raw]

Subject: Re: Alphabet of kernel source

On Wed, Jun 23, 2004 at 02:06:28PM -0700, Pete Zaitcev wrote:

> Anyhow, long story short, this got me thinking... What is the charset
> and the encoding of the actual source? I saw quite a discussion about
> the filenames, but this is different. I am sorry if this was discussed
> previously.

This has come up repeatedly. As far as I recall, Linus has never said
anything. The de facto situation can be seen by just inspecting the
MAINTAINERS file. Kai Makisara has a diaeresis on the first vowel of
his last name. Today (2.6.6) that is still coded in ISO 8859-1.

In old discussions people who disliked 8859-1 expressed strong preference
for plain ASCII (possibly with TeX-like escape sequences for non-ASCII).
These days it seems that, if anything is changed, the only reasonable action
would be to switch to UTF-8.

Andries

2004-06-23 22:26:46

by Kalin KOZHUHAROV

[permalink] [raw]

Subject: Re: Alphabet of kernel source

David Eger wrote:
> I started a thread a while ago (2.6.3/2.6.4) where I submitted some
> patches to UTF-8ifying the kernel sources. Basically, most of the
> kernel is ASCII (98.4% of the files). The rest are mostly ISO-Latin-1,
> with the rare bit of Japanese (in a couple of charsets) and some just
> random bytes in some of the Documentation/...

The "problem" is contributor names, although having everything in plain ASCII is resonable, I guess.

> http://www.yak.net/random/linux-2.6.4-utf8-cleanup-auto.diff
A lot of names and some art supposed to be ASCII.

> http://www.yak.net/random/linux-2.6.4-utf8-cleanup-cstrings2utf8.diff
Some degree symbols and microseconds... and names.
I remember having problems with lm-sensors trying to print degrees, how did they fight the problem?

> http://www.yak.net/random/linux-2.6.4-utf8-cleanup-jp.diff
Ok, this Japanese is only in the comments.
I can translate that in no time and fix this diff.
WTF is arch/v850/ ?
I guess you had some kind of script, can you try it on vanilla 2.6.7, plesae, and post results.

> http://www.yak.net/random/linux-2.6.4-utf8-cleanup-wrong.diff
There are a few microseconds written properly, but may commonly by typed as us, or just don't use abbr.

> It's sorta difficult to do non-ASCII patches over email because
> the kernel developers like reading their mail in mutt, and don't
> like attachments (the only sane ways to send non 7-bit clean data:
> 8-bit MIME: tagged and bagged or uuencoded)
>
> Further, you confuse the hell out of vi if you have any trash (8bit data
> in another charset) in a file that's supposed to be UTF-8. i.e. don't
> think you're going to be able to look at a charset changing patch in
> anything.
Totally agree, although I use Mozilla Mail (and sometimes mutt).

Kalin.

--
||///_ o *****************************
||//'_/> WWW: http://ThinRope.net/
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

2004-06-24 06:17:20

by David Eger

[permalink] [raw]

Subject: Re: Alphabet of kernel source

On Thu, Jun 24, 2004 at 07:18:41AM +0900, Kalin KOZHUHAROV wrote:
> >http://www.yak.net/random/linux-2.6.4-utf8-cleanup-cstrings2utf8.diff
> Some degree symbols and microseconds... and names.
> I remember having problems with lm-sensors trying to print degrees, how did
> they fight the problem?

I assume the local charset on the machines where they cat /proc/blah are
running in ISO Latin 1 ;-)

> >http://www.yak.net/random/linux-2.6.4-utf8-cleanup-jp.diff
> Ok, this Japanese is only in the comments.
> I can translate that in no time and fix this diff.

actually, I'm pretty sure the diff is correct against 2.6.4 - the bytes
should all be correct, as I checked it with someone who works with
said files...

> I guess you had some kind of script, can you try it on vanilla 2.6.7,
> plesae, and post results.

I will regenerate the patches if someone in charge (Linus or Andrew)
actually wants them.

-dte

2004-06-24 11:06:59

by Richard B. Johnson

[permalink] [raw]

Subject: Re: Alphabet of kernel source

On Wed, 23 Jun 2004, Pete Zaitcev wrote:

> Guys,
>
> I have a silly question, for which I am unable to google out the answer
> so far. Do we have a Linus' decree on the charset and encoding of the
> kernel source?
>
[SNIPPED...]

Good question! It was supposed to be ASCII which, I guess is
UTF-8 or something like that. However, I find that tabs, which
were decreed to be at 8-collumn intervals end up being used
instead of spaces i.e., one-column, etc. So, if you look at
some well-patched source you sometimes see a mess.

The names of contributors often have non-ASCII characters
in them. This may not be a problem, but when using `pine`
without the 'latest-and-greatest' version, they sometimes
are unreadable.

Cheers,
Dick Johnson
Penguin : Linux version 2.4.26 on an i686 machine (5570.56 BogoMips).
Note 96.31% of all statistics are fiction.

2004-06-27 05:48:14

by Kalin KOZHUHAROV

[permalink] [raw]

Subject: [PATCH] Translate Japanese comments in arch/v850 ( was: Alphabet of kernel source)

David Eger wrote:
>>>http://www.yak.net/random/linux-2.6.4-utf8-cleanup-jp.diff
>>
>>Ok, this Japanese is only in the comments.
>>I can translate that in no time and fix this diff.
>
> actually, I'm pretty sure the diff is correct against 2.6.4 - the bytes
> should all be correct, as I checked it with someone who works with
> said files...

OK, I had a few idle minutes, so I did patch the Japanese comments in arch/v850.

I am not exactly 100% sure I translated it correctly since I have no idea what exactly was that NEC v850 evaluation board, but should be OK (say 95% sure).

Patches just the comments, so code is untouched.

The other thing is that one of the files was encoded (i.e. readable) in iso-2022-jp, the other in euc-jp...
No idea how patch will handle this, I hope it doesn't bother with locale settings, etc.

Attaching as application/octet-stream in a hope for better handling of i18n issues, sorry for the inconvenience.

Here goes the patch:

Signed-off-by: Kalin KOZHUHAROV <[email protected]>

Attachments:

v850-jp2en.diff (1.90 kB)