2004-03-04 10:05:23

by David Eger

[permalink] [raw]
Subject: [PATCH] UTF-8ifying the kernel source



http://www.yak.net/random/linux-2.6.3-utf8-cleanup-auto.diff.bz2

Here you find the first of several patches to convert the kernel
source from ISO Latin-1 to UTF-8. I'm working on the files that didn't
auto-convert easily; comments welcome ;-)

First, some statistics!

In Linux 2.6.3, there are:
15860 clean 7-bit ASCII files
274 text files are not 7-bit clean

38 of these 274 files are not auto-convertible -- either they are not ISO
Latin-1 or the high octets appear within the actual code (not comments).

This first patch applies to help files, documentation, and comments which
are trivially correct ISO Latin-1 => UTF-8 conversions. The work I have
left to do is summarized below.

--dte


Un-needed/wrong non-ASCII characters (these fixes will form patch 2)
====================================================================
drivers/video/amifb.c - +- sign?
Documentation/i2c/i2c-protocol - NBSP, but why?
arch/i386/kernel/cpu/cyrix.c - NBSP, but why?
arch/v850/kernel/as85ep1.ld - WTF? comments in some random charset...
drivers/char/ftape/lowlevel/fdc-isr.c - WTF? shit in the comments
include/asm-m68k/atarihw.h - 0x94 - "cancel character"?
include/asm-m68k/atariints.h - 0x94 - "cancel character"?
include/linux/802_11.h - why the non-standard dash?
scripts/docproc.c - why the bizarre spelling for specific?
fs/ext2/xattr.c - bad ASCII art
fs/ext3/xattr.c - bad ASCII art
fs/afs/vlclient.h - a degrees sign, but why?

Box-drawing ASCII art (these fixes will form patch 3)
=====================================================
Documentation/networking/tms380tr.txt - DOS-style ASCII art
arch/arm/nwfpe/fpopcode.h - line-drawing characters

C strings - (what to do?)
=========================
arch/ppc/platforms/proc_rtas.c - a C string containing "degrees"
arch/ppc64/kernel/rtas-proc.c - a C string containing "degrees"
drivers/macintosh/therm_adt7467.c - degrees, MODULE_PARAM_DESC(),
and a C string
drivers/mtd/chips/cfi_probe.c - C strings
drivers/net/wireless/netwave_cs.c - C strings
drivers/scsi/dc395x.c - C strings

Other - (i'd convert it, but...)
================================
drivers/pci/pci.ids - I don't know what program processes this...
drivers/ieee1394/oui.db - I don't know what program processes this...

Machine / charset specific shite - (does anything need to be done?)
===================================================================
arch/m68k/hp300/hp300map.map - maps to "char"s.. grr
drivers/char/defkeymap.map - a map file... maps to "char"s.. grr
drivers/char/qtronixmap.c_shipped - maps to "char"s.. grr
drivers/char/qtronixmap.map - maps to "char"s.. grr
drivers/tc/lk201-map.c_shipped - maps to "char"s.. grr
drivers/tc/lk201-map.map - maps to "char"s.. grr
drivers/acorn/char/defkeymap-l7200.c - maps to "char"s.. grr
arch/s390/kernel/ebcdic.c - comments on a keymap table
drivers/video/console/font_8x16.c - comments on a keymap table
drivers/video/console/font_8x8.c - comments on a keymap table
drivers/video/console/font_pearl_8x8.c - comments on a keymap table
drivers/s390/ebcdic.c - comments on a keymap table

Noise from userland (this I won't be touching)
==============================================
Documentation/networking/ethertap.txt - random crap cat'd from /dev/tap0
Documentation/s390/Debugging390.txt - weird gdb output


2004-03-04 10:18:53

by Meelis Roos

[permalink] [raw]
Subject: Re: [PATCH] UTF-8ifying the kernel source

DE> Here you find the first of several patches to convert the kernel
DE> source from ISO Latin-1 to UTF-8. I'm working on the files that didn't
DE> auto-convert easily; comments welcome ;-)

Why? It's just easier to use plain 8-bit text files today (with editors,
code tools etc) and accept the limitations of it that to overcome the
limitations by forcing people to UTF-8 editors & other tools.

I am not a kernel developer but this seems a bad idea to me.

--
Meelis Roos

2004-03-04 10:32:16

by Måns Rullgård

[permalink] [raw]
Subject: Re: [PATCH] UTF-8ifying the kernel source

Meelis Roos <[email protected]> writes:

> DE> Here you find the first of several patches to convert the kernel
> DE> source from ISO Latin-1 to UTF-8. I'm working on the files that didn't
> DE> auto-convert easily; comments welcome ;-)
>
> Why? It's just easier to use plain 8-bit text files today (with editors,
> code tools etc) and accept the limitations of it that to overcome the
> limitations by forcing people to UTF-8 editors & other tools.

How do you propose that editors should know which encoding a file
uses? The trend seems to be moving towards UTF-8 for everything, so
the kernel might as well do it too.

--
M?ns Rullg?rd
[email protected]

2004-03-04 21:56:19

by Alex Belits

[permalink] [raw]
Subject: Re: [PATCH] UTF-8ifying the kernel source

On Thu, 4 Mar 2004, David Eger wrote:

> http://www.yak.net/random/linux-2.6.3-utf8-cleanup-auto.diff.bz2
>
> Here you find the first of several patches to convert the kernel
> source from ISO Latin-1 to UTF-8. I'm working on the files that didn't
> auto-convert easily; comments welcome ;-)
>
> First, some statistics!
>
> In Linux 2.6.3, there are:
> 15860 clean 7-bit ASCII files
> 274 text files are not 7-bit clean
>
> 38 of these 274 files are not auto-convertible -- either they are not ISO
> Latin-1 or the high octets appear within the actual code (not comments).
>
> This first patch applies to help files, documentation, and comments which
> are trivially correct ISO Latin-1 => UTF-8 conversions. The work I have
> left to do is summarized below.

That will be of a great help for the future developers that will edit
kernel sources in Microsoft Word.

[a large collection of expletives in multiple languages and charsets is
skipped here]

--
Alex

2004-03-05 08:26:22

by Miles Bader

[permalink] [raw]
Subject: Re: [PATCH] UTF-8ifying the kernel source

David Eger <[email protected]> writes:
> arch/v850/kernel/as85ep1.ld - WTF? comments in some random charset...

FWIW, the charset is EUC-JP.

Even other files in that same directory aren't consistent, e.g.,
as85ep1.c uses ISO-2022-JP.

[My fault, but it never really registered on my important-enough-to fix
radar (emacs autodetects them all so I never really noticed the
discrepancy).]

-Miles
--
We are all lying in the gutter, but some of us are looking at the stars.
-Oscar Wilde

2004-03-05 13:21:06

by paolo ciarrocchi

[permalink] [raw]
Subject: Re: [PATCH] UTF-8ifying the kernel source

Sorry to jump in to this thread without providing any useful information...

I'm looking for doc and/or links to info regardign UTF8 and iso-*.

Any hints ?

Thanks in advance.

Ciao,
Paolo

_________________________________________________________________
Filtri antispamming e antivirus per la tua casella di posta
http://www.msn.it/msn/hotmail

2004-03-05 20:02:00

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] UTF-8ifying the kernel source

Followup to: <[email protected]>
By author: Miles Bader <[email protected]>
In newsgroup: linux.dev.kernel
>
> David Eger <[email protected]> writes:
> > arch/v850/kernel/as85ep1.ld - WTF? comments in some random charset...
>
> FWIW, the charset is EUC-JP.
>
> Even other files in that same directory aren't consistent, e.g.,
> as85ep1.c uses ISO-2022-JP.
>
> [My fault, but it never really registered on my important-enough-to fix
> radar (emacs autodetects them all so I never really noticed the
> discrepancy).]
>

OK, this is definitely a good reason to go to UTF-8 across the board.

-hpa

2004-03-05 21:01:06

by Mike Fedyk

[permalink] [raw]
Subject: Re: [PATCH] UTF-8ifying the kernel source

H. Peter Anvin wrote:
> Followup to: <[email protected]>
> By author: Miles Bader <[email protected]>
> In newsgroup: linux.dev.kernel
>
>>David Eger <[email protected]> writes:
>>
>>>arch/v850/kernel/as85ep1.ld - WTF? comments in some random charset...
>>
>>FWIW, the charset is EUC-JP.
>>
>>Even other files in that same directory aren't consistent, e.g.,
>>as85ep1.c uses ISO-2022-JP.
>>
>>[My fault, but it never really registered on my important-enough-to fix
>>radar (emacs autodetects them all so I never really noticed the
>>discrepancy).]
>>
>
>
> OK, this is definitely a good reason to go to UTF-8 across the board.

So when is "less" going to support utf8? Right now, it just shows
escape codes... :(

2004-03-05 21:03:16

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] UTF-8ifying the kernel source

Mike Fedyk wrote:
>>
>> OK, this is definitely a good reason to go to UTF-8 across the board.
>
> So when is "less" going to support utf8? Right now, it just shows
> escape codes... :(
>

Why don't you ask the "less" maintainer about that?

Right now, "less" seems to insist on showing ampersands for *any*
non-ASCII character for me...

-hpa

2004-03-05 21:17:56

by Måns Rullgård

[permalink] [raw]
Subject: Re: [PATCH] UTF-8ifying the kernel source

"H. Peter Anvin" <[email protected]> writes:

> Mike Fedyk wrote:
>>>
>>> OK, this is definitely a good reason to go to UTF-8 across the board.
>>
>> So when is "less" going to support utf8? Right now, it just shows
>> escape codes... :(
>>
>
> Why don't you ask the "less" maintainer about that?
>
> Right now, "less" seems to insist on showing ampersands for *any*
> non-ASCII character for me...

Less version 381 is working fine here with UTF-8. I have LANG and
LC_CTYPE set to en_US.UTF-8.

--
M?ns Rullg?rd
[email protected]

2004-03-05 21:20:32

by David Eger

[permalink] [raw]
Subject: Re: [PATCH] UTF-8ifying the kernel source

On Fri, Mar 05, 2004 at 01:00:55PM -0800, Mike Fedyk wrote:
>
> So when is "less" going to support utf8? Right now, it just shows
> escape codes... :(

bash user? try:
$ export LESSCHARSET="utf-8"
$ less myfavoritefile.c

-dte ;-)

2004-03-05 21:22:45

by Charles Cazabon

[permalink] [raw]
Subject: Re: [PATCH] UTF-8ifying the kernel source

M?ns Rullg?rd <[email protected]> wrote:
> >
> > Right now, "less" seems to insist on showing ampersands for *any*
> > non-ASCII character for me...
>
> Less version 381 is working fine here with UTF-8. I have LANG and
> LC_CTYPE set to en_US.UTF-8.

less 340 works fine here with the same settings.

Charles
--
-----------------------------------------------------------------------
Charles Cazabon <[email protected]>
GPL'ed software available at: http://www.qcc.ca/~charlesc/software/
-----------------------------------------------------------------------

2004-03-05 23:24:29

by David Eger

[permalink] [raw]
Subject: Re: [PATCH] UTF-8ifying the kernel source

There are now three patches available, and some work left to go.

The first patch hasn't changed, still the trivial ISO Latin-1 => UTF-8.

The second patch takes care of a lot of wrong and/or unneeded non-ASCII.

The third patch concerns 8-bit characters embedded in C strings.
These are almost always output to devfs or proc. The characters used are
the degrees symbol (for ppc temp. sensors) and mu (for micro-seconds).
I do not want to make a value judgement on what the kernel outputs
to userspace, so I leave the strings the same. However, C99 makes it
implementation defined how the source character set is translated to
the character set in the compiled binary... Therefore, I've taken the
raw octets and converted them in the source file to octal constants in
the strings, just to make sure cc doesn't mangle things if you set your
locale differently...

http://www.yak.net/random/linux-2.6.3-utf8-cleanup-auto.diff.bz2
http://www.yak.net/random/linux-2.6.3-utf8-cleanup-wrong.diff
http://www.yak.net/random/linux-2.6.3-utf8-cleanup-cstrings.diff

-dte


Un-needed/wrong non-ASCII characters (patch 2)
==============================================
drivers/video/amifb.c - +- sign (NOTE: X's .ttf files just don't have it)
Documentation/i2c/i2c-protocol - NBSP, but why? (made regular space)
arch/i386/kernel/cpu/cyrix.c - NBSP, but why? (made regular space)
include/linux/802_11.h - why the non-standard dash? (made regular dash)
scripts/docproc.c - why the bizarre spelling for specific? (fixed)
fs/ext2/xattr.c - bad ASCII art (made regular pipe - fixed)
fs/ext3/xattr.c - bad ASCII art (made regular pipe - fixed)
arch/arm/nwfpe/fpopcode.h - line-drawing characters (fixed)
include/asm-m68k/atarihw.h - 0x94? no, it's an ?, for Bj?rn
include/asm-m68k/atariints.h - 0x94? no, it's an ?, for Bj?rn

C strings - (patch 3)
=====================
arch/ppc/platforms/proc_rtas.c - a C string w/"degrees": exports to proc
arch/ppc64/kernel/rtas-proc.c - a C string w/"degrees": exports to proc
drivers/macintosh/therm_adt7467.c - temperature reporting (degrees sign)
- several printk's, output to a devfs interface, MODULE_PARAM_DESC(),
drivers/mtd/chips/cfi_probe.c - time reporting (micro sign)
- printk's in the DEBUG code
drivers/net/wireless/netwave_cs.c - module version string
(author's name - but it doesn't seem to be *used* for anything...)

BELOW HERE not fixed...

(was going to be fixed w/ patch, but, umm, huh?)
==================================================
arch/v850/kernel/as85ep1.ld - according to Miles Bader,
it's EUC-JP in the comments, and e.g. as85ep1.c uses ISO-2022-JP...
drivers/char/ftape/lowlevel/fdc-isr.c - WTF? shit in the comments
fs/afs/vlclient.h - a degrees sign, but why? (author says he'll get it)
drivers/scsi/dc395x.c - C debug strings... is this chinese traditional?
Documentation/networking/tms380tr.txt - DOS-style ASCII art

Other - (i'd convert it, but...)
================================
drivers/pci/pci.ids - I don't know what program processes this...
drivers/ieee1394/oui.db - I don't know what program processes this...

Machine / charset specific shite - (does anything need to be done?)
===================================================================
arch/m68k/hp300/hp300map.map - maps to "char"s.. grr
drivers/char/defkeymap.map - a map file... maps to "char"s.. grr
drivers/char/qtronixmap.c_shipped - maps to "char"s.. grr
drivers/char/qtronixmap.map - maps to "char"s.. grr
drivers/tc/lk201-map.c_shipped - maps to "char"s.. grr
drivers/tc/lk201-map.map - maps to "char"s.. grr
drivers/acorn/char/defkeymap-l7200.c - maps to "char"s.. grr
arch/s390/kernel/ebcdic.c - comments on a keymap table
drivers/video/console/font_8x16.c - comments on a keymap table
drivers/video/console/font_8x8.c - comments on a keymap table
drivers/video/console/font_pearl_8x8.c - comments on a keymap table
drivers/s390/ebcdic.c - comments on a keymap table

Noise from userland (this I won't be touching)
==============================================
Documentation/networking/ethertap.txt - random crap cat'd from /dev/tap0
Documentation/s390/Debugging390.txt - weird gdb output

2004-03-05 23:33:29

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] UTF-8ifying the kernel source

Followup to: <[email protected]>
By author: David Eger <[email protected]>
In newsgroup: linux.dev.kernel

> The third patch concerns 8-bit characters embedded in C strings.
> These are almost always output to devfs or proc. The characters used are
> the degrees symbol (for ppc temp. sensors) and mu (for micro-seconds).
> I do not want to make a value judgement on what the kernel outputs
> to userspace, so I leave the strings the same. However, C99 makes it
> implementation defined how the source character set is translated to
> the character set in the compiled binary... Therefore, I've taken the
> raw octets and converted them in the source file to octal constants in
> the strings, just to make sure cc doesn't mangle things if you set your
> locale differently...
>

I would highly vote for making those UTF-8 unless it breaks protocol.

Plain ASCII would be better, though.

-hpa

2004-03-06 11:09:04

by Xavier Bestel

[permalink] [raw]
Subject: Re: [PATCH] UTF-8ifying the kernel source

Le sam 06/03/2004 ? 00:33, H. Peter Anvin a ?crit :
> Followup to: <[email protected]>
> By author: David Eger <[email protected]>
> In newsgroup: linux.dev.kernel
>
> > The third patch concerns 8-bit characters embedded in C strings.
> > These are almost always output to devfs or proc. The characters used are
> > the degrees symbol (for ppc temp. sensors) and mu (for micro-seconds).
>
> I would highly vote for making those UTF-8 unless it breaks protocol.

ISO-8859-1 characters are mostly the same in UTF-8.

Xav

2004-03-06 11:14:34

by Måns Rullgård

[permalink] [raw]
Subject: Re: [PATCH] UTF-8ifying the kernel source

Xavier Bestel <[email protected]> writes:

> Le sam 06/03/2004 ? 00:33, H. Peter Anvin a ?crit :
>> Followup to: <[email protected]>
>> By author: David Eger <[email protected]>
>> In newsgroup: linux.dev.kernel
>>
>> > The third patch concerns 8-bit characters embedded in C strings.
>> > These are almost always output to devfs or proc. The characters used are
>> > the degrees symbol (for ppc temp. sensors) and mu (for micro-seconds).
>>
>> I would highly vote for making those UTF-8 unless it breaks protocol.
>
> ISO-8859-1 characters are mostly the same in UTF-8.

The 7-bit ones are the same. The 8-bit ones are all different.

--
M?ns Rullg?rd
[email protected]

2004-03-06 13:33:11

by David Eger

[permalink] [raw]
Subject: Other bizarre thing... backspaces?

There are five files with embedded backspace octets in them.... ;-)

fs/hfs/FAQ.txt
fs/hfs/HFS.txt
fs/hfs/INSTALL.txt
Documentation/filesystems/coda.txt
Documentation/uml/UserModeLinux-HOWTO.txt

-dte

2004-03-06 14:04:41

by Måns Rullgård

[permalink] [raw]
Subject: Re: Other bizarre thing... backspaces?

David Eger <[email protected]> writes:

> There are five files with embedded backspace octets in them.... ;-)

That's an old way to do underlining and bold face and it seems like at
least coda.txt is doing that. If I could choose I'd probably just
remove them.

--
M?ns Rullg?rd
[email protected]

2004-03-09 00:30:21

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] UTF-8ifying the kernel source

Followup to: <[email protected]>
By author: Xavier Bestel <[email protected]>
In newsgroup: linux.dev.kernel
>
> Le sam 06/03/2004 ? 00:33, H. Peter Anvin a ?crit :
> > Followup to: <[email protected]>
> > By author: David Eger <[email protected]>
> > In newsgroup: linux.dev.kernel
> >
> > > The third patch concerns 8-bit characters embedded in C strings.
> > > These are almost always output to devfs or proc. The characters used are
> > > the degrees symbol (for ppc temp. sensors) and mu (for micro-seconds).
> >
> > I would highly vote for making those UTF-8 unless it breaks protocol.
>
> ISO-8859-1 characters are mostly the same in UTF-8.
>

Unicode, yes. UTF-8, no. The ISO-8859-1 character "?" (0xC5) does,
indeed correspond to Unicode character U+00C5, but it's encoded 0xC3
0x85 in UTF-8.

-hpa

2004-03-09 09:55:25

by Xavier Bestel

[permalink] [raw]
Subject: Re: [PATCH] UTF-8ifying the kernel source

On Tue, 2004-03-09 at 00:30 +0000, H. Peter Anvin wrote:

> Followup to: <[email protected]>
> By author: Xavier Bestel <[email protected]>
> > ISO-8859-1 characters are mostly the same in UTF-8.
> >
>
> Unicode, yes. UTF-8, no. The ISO-8859-1 character "?" (0xC5) does,
> indeed correspond to Unicode character U+00C5, but it's encoded 0xC3
> 0x85 in UTF-8.

Yeah, that's what I realized, after posting of course.
While utf-8ying the sources is certainly a good thing, I have mixed
feelings about kernel strings. It will render poorly in some
environments.
Maybe the all-ascii route is better for strings ?

Xav

2004-03-09 12:20:09

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [PATCH] UTF-8ifying the kernel source

On Fri, 5 Mar 2004, David Eger wrote:
> Un-needed/wrong non-ASCII characters (patch 2)
> ==============================================
> drivers/video/amifb.c - +- sign (NOTE: X's .ttf files just don't have it)

do_blank is either 0 (do nothing), -1 (unblank), or +1 (blank).

You can replace it by `+/-1' if you want.

> include/asm-m68k/atarihw.h - 0x94? no, it's an ?, for Bj?rn
> include/asm-m68k/atariints.h - 0x94? no, it's an ?, for Bj?rn

Yep.

> Machine / charset specific shite - (does anything need to be done?)
> ===================================================================
> arch/m68k/hp300/hp300map.map - maps to "char"s.. grr
> drivers/char/defkeymap.map - a map file... maps to "char"s.. grr
> drivers/char/qtronixmap.c_shipped - maps to "char"s.. grr
> drivers/char/qtronixmap.map - maps to "char"s.. grr
> drivers/tc/lk201-map.c_shipped - maps to "char"s.. grr
> drivers/tc/lk201-map.map - maps to "char"s.. grr
> drivers/acorn/char/defkeymap-l7200.c - maps to "char"s.. grr

If you want the keyboard to generate UTF-8, I think you should change these
(not sure, please test).

> drivers/video/console/font_8x16.c - comments on a keymap table
> drivers/video/console/font_8x8.c - comments on a keymap table
> drivers/video/console/font_pearl_8x8.c - comments on a keymap table

These fonts have the box-drawing ASCII art.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2004-03-14 23:32:23

by Petr Baudis

[permalink] [raw]
Subject: Re: Re: Other bizarre thing... backspaces?

Dear diary, on Sat, Mar 06, 2004 at 03:04:35PM CET, I got a letter,
where M?ns Rullg?rd <[email protected]> told me, that...
> David Eger <[email protected]> writes:
>
> > There are five files with embedded backspace octets in them.... ;-)
>
> That's an old way to do underlining and bold face and it seems like at
> least coda.txt is doing that. If I could choose I'd probably just
> remove them.

Well, what's the "new way" for ASCII documents? At least less produces a
desired result.

Kind regards,

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
The reasonable man adapts himself to the world; the unreasonable one
persists in trying to adapt the world to himself. Therefore all
progress depends on the unreasonable man. -- George Bernard Shaw