2004-03-23 23:02:14

by David Eger

[permalink] [raw]
Subject: [PATCH 0/4] UTF-8ifying the kernel sources


Following are four patches against Linux 2.6.4 for charset clean-up.
That is, converting the sources to be UTF-8 text. Please consider.

Thankfully, the vast majority of the kernel sources (some 15860 files)
are 7-bit clean ASCII. The rest (274 files) are mostly ISO Latin-1.
However, there are spots of pure junk (userland output in the docs and
some corruption that's creeped in over time), box-drawing characters,
EUC-JP, and ISO-2022-JP.

The only controversial patch here is patch 4, which changes the
output of the kernel to userspace (via /proc, mostly). It sets the
policy that the kernel should output UTF-8 instead of ISO Latin-1
when printing things that are not 7-bit ASCII. Details below.

-dte

Patch 1: linux-2.6.4-utf8-cleanup-auto.diff.bz2
===============================================
237 ISO Latin-1 files auto-converted to UTF-8.
All changes are to documentation or to the comments of code.

Patch 2: linux-2.6.4-utf8-cleanup-jp.diff
=========================================
arch/v850/kernel/as85ep1.ld
arch/v850/kernel/as85ep1.c

Two files containing Japanese comments - in two different charsets!
Unfortunately, emacs doesn't understand CJK pages in UTF-8.
cat, less, and vim do. (though vim gets confused when you look
at the diff, since it is a mix of two charsets...)

Patch 3: linux-2.6.4-utf8-cleanup-wrong.diff
============================================
drivers/video/amifb.c - +- sign (NOTE: X's .ttf files just don't have it)
Documentation/i2c/i2c-protocol - NBSP, but why? (made regular space)
arch/i386/kernel/cpu/cyrix.c - NBSP, but why? (made regular space)
include/linux/802_11.h - why the non-standard dash? (made regular dash)
scripts/docproc.c - why the bizarre spelling for specific? (fixed)
fs/ext2/xattr.c - bad ASCII art (made regular pipe - fixed)
fs/ext3/xattr.c - bad ASCII art (made regular pipe - fixed)
arch/arm/nwfpe/fpopcode.h - line-drawing characters (fixed)
include/asm-m68k/atarihw.h - 0x94? no, it's an ö, for Björn
include/asm-m68k/atariints.h - 0x94? no, it's an ö, for Björn

Patch 4: linux-2.6.4-utf8-cleanup-cstrings2utf8.diff
====================================================
arch/ppc/platforms/proc_rtas.c - a C string w/"degrees": exports to proc
arch/ppc64/kernel/rtas-proc.c - a C string w/"degrees": exports to proc
drivers/macintosh/therm_adt7467.c - temperature reporting (degrees sign)
- several printk's, output to a devfs interface, MODULE_PARAM_DESC(),
drivers/mtd/chips/cfi_probe.c - time reporting (micro sign)
- printk's in the DEBUG code
drivers/net/wireless/netwave_cs.c - module version string
(author's name - but it doesn't seem to be *used* for anything...)


2004-03-24 06:49:50

by Miles Bader

[permalink] [raw]
Subject: Re: [PATCH 0/4] UTF-8ifying the kernel sources

David Eger <[email protected]> writes:
> Two files containing Japanese comments - in two different charsets!
> Unfortunately, emacs doesn't understand CJK pages in UTF-8.

If you're using a recent version of GNU Emacs (I'm not sure how recent
-- I track emacs CVS), do:

M-x utf-translate-cjk-mode RET

That will make emacs handle CJK UTF-8 properly (it's not turned on by
default because it involves loading some large tables).

-Miles
--
"1971 pickup truck; will trade for guns"