LinuxLists.cc - [PATCH] Russian encoding support for MacHFS

2005-01-24 10:01:07

Subject: [PATCH] Russian encoding support for MacHFS

Hello, guys! I'd like to present to the community my second kernel project. This patch adds support for russian characters on MacHFS volumes if you use koi8-r encoding on Linux (this is the common case in Russia).
The implementation is probably not very good because it uses its own tables instead of NLS modules. Using NLS modules i consider impossible because due to MacHFS nature (at least current implementation nature) names must be supplied in MacOS encoding for proper searching. This means that you must to be able to reverse-translate all names from Linux encoding to Mac encoding. Using NLS causes characters loss if requested character does not exist in the table (it is substituted by '?'). Macintosh disks often contains specific characters in file names ("Folder" character for example) which will be lost in this case.
If someone has some idea how to fix this you're welcome. I currently don't see a way to make the thing better because i don't know internal HFS structure. Probably using utf8 as host encoding would solve the problem but it's not commonly used in Russia.

--
Best regards,
Pavel Fedin, mailto:[email protected]

Attachments:

hfs-koi8r.diff.bz2 (4.36 kB)

2005-01-24 18:46:24

by Alex Riesen

[permalink] [raw]

Subject: Re: [PATCH] Russian encoding support for MacHFS

(could you please use shorter lines? Around 80 is good. It's difficult to read).

On Mon, 24 Jan 2005 12:57:56 +0300, Pavel Fedin <[email protected]> wrote:
> ... This means that you must to be able to reverse-translate all names from
> Linux encoding to Mac encoding. Using NLS causes characters loss if
> requested character does not exist in the table (it is substituted by '?').
> Macintosh disks often contains specific characters in file names
> ("Folder" character for example) which will be lost in this case.

how about just leave the characters unchanged? (remap them to the same
codes in Unicode).

> Probably using utf8 as host encoding would solve the problem but it's not
> commonly used in Russia.

Unicode, and its encoding UTF8 IS commonly used everywhere.
And Russia can (and often does) use it just as well.

P.S. Read Documentation/SubmittingPatches.
What kernel is the patch against?

2005-01-25 09:38:12

by Pavel Fedin

[permalink] [raw]

Subject: Re: [PATCH] Russian encoding support for MacHFS

On Mon, 24 Jan 2005 19:46:18 +0100
Alex Riesen <[email protected]> wrote:

> how about just leave the characters unchanged? (remap them to the same
> codes in Unicode).

But what to do when i convert then from unicode to 8-bit iocharset? This can lead to that several characters in Mac charset will be converted to the same character in Linux charset. This will lead to information loss and name will not be reverse-translatable.
To describe the thing better: i have 8-bit Mac encoding and 8-bit target encoding (iocharset). I need to convert from (1) to (2) and be able to convert back. I tried to perform a one-way conversion like in other filesystems but this didn't work.
Probably NLS tables can be used when iocharset is UTF8. If you wish i can try to implement it after some time.

> Unicode, and its encoding UTF8 IS commonly used everywhere.
> And Russia can (and often does) use it just as well.

Many people say many software is not UTF8-ready yet. Anyway i had problems when tried to use it. Many russian ASCII documents use 8-bit encoding so i need to be able to deal with them. Many software assumes that 1 byte is 1 character.

> P.S. Read Documentation/SubmittingPatches.

Ok. Sorry for violations.

> What kernel is the patch against?

2.6.8.

--
Best regards,
Pavel Fedin, mailto:[email protected]

2005-01-25 15:35:13

by Roman Zippel

[permalink] [raw]

Subject: Re: [PATCH] Russian encoding support for MacHFS

Hi,

On Tue, 25 Jan 2005, Pavel Fedin wrote:

> > how about just leave the characters unchanged? (remap them to the same
> > codes in Unicode).
>
> But what to do when i convert then from unicode to 8-bit iocharset?
> This can lead to that several characters in Mac charset will be
> converted to the same character in Linux charset. This will lead to
> information loss and name will not be reverse-translatable.
> To describe the thing better: i have 8-bit Mac encoding and 8-bit
> target encoding (iocharset). I need to convert from (1) to (2) and be
> able to convert back. I tried to perform a one-way conversion like in
> other filesystems but this didn't work.
> Probably NLS tables can be used when iocharset is UTF8. If you wish i
> can try to implement it after some time.

I'm not quite sure, what problem you're trying to solve here. NLS is used
to convert from a local encoding to unicode, HFS has only 8bit
characters, so there isn't much space to store the unicode characters in.
If you want to use utf-8, you can do so without changing hfs. All
filesystem which don't use nls (that includes e.g. ext3) store the
filename in the local encoding.
If you want to store unicode characters use HFS+, I plan to implement nls
support real soon for it (especially because to also fix the missing
decomposition support).

bye, Roman

2005-01-26 06:33:46

by Pavel Fedin

[permalink] [raw]

Subject: Re: [PATCH] Russian encoding support for MacHFS

On Tue, 25 Jan 2005 16:34:57 +0100 (CET)
Roman Zippel <[email protected]> wrote:

> I'm not quite sure, what problem you're trying to solve here.

I am trying to implement character sets conversion for MacHFS. I have some CD s with russian file names. Currently they are not displayed properly because Linux uses KOI8-R character set for russian letters and Macintosh uses its own character set called Mac-cyrillic or codepage 10007.
Firstly i tried to implement character set conversion using NLS tables. It was done using "iocharset" and "codepage" arguments. "Iocharset" specified Linux's local character set and "codepage" specified HFS's character set. So to convert a character i needed to process it twice: convert from "codepage" to Unicode and then convert from Unicode to "iocharset".
The problem with this is that some characters will be lost during this conversion. Not all characters from source ("codepage") charset are present in destination ("iocharset") charset table (for example "Folder" sign). But for proper operation of dir.c/hfs_lookup() function we need to be able to convert the name back from KOI8-R to CP10007 otherwise searching algorythm will fail. This will lead to that we won't be able to operate with any file which contains such a characters.
A solution was to use my own conversion table which ensures that no characters will be lost during conversion in both directions. Every unique source character is translated to some unique destination character. Of course Mac-specific characters are not displayed properly but they're not lost either. "codepage" argument was omitted for simplicity because specific "iocharset" implies specific "codepage" (for example if iocharset is koi8-r then we can assume that Macintosh codepage is mac-cyrillic). But some people said that this patch can't be approved because not using NLS is bad solution. So i'd like to talk to you, may be we'll find a better solution (because you know HFS better than me) or we can come to a conclusion that there is really no solution and push the patch upstream.

> If you want to store unicode characters use HFS+, I plan to implement nls
> support real soon for it (especially because to also fix the missing
> decomposition support).

Would be nice. I also thought about it but i have no HFS+ disks with russian names so i can't test it. And i decided not to do a "blind" implementation in order not to break the filesystem. Currently my patch adds "iocharset" argumnent to HFS+ also (so that i can specify both filesystems in one /etc/fstab line, this is useful for CD-ROM) but it is ignored there.

--
Best regards,
Pavel Fedin, mailto:[email protected]

2005-01-27 08:29:24

by Alex Riesen

[permalink] [raw]

Subject: Re: [PATCH] Russian encoding support for MacHFS

On Tue, 25 Jan 2005 12:35:16 +0300, Pavel Fedin <[email protected]> wrote:
> > how about just leave the characters unchanged? (remap them to the same
> > codes in Unicode).
>
> But what to do when i convert then from unicode to 8-bit iocharset? This can lead to that several characters in Mac charset will be converted to the same character in Linux charset. This will lead to information loss and name will not be reverse-translatable.
> To describe the thing better: i have 8-bit Mac encoding and 8-bit target encoding (iocharset). I need to convert from (1) to (2) and be able to convert back. I tried to perform a one-way conversion like in other filesystems but this didn't work.
> Probably NLS tables can be used when iocharset is UTF8. If you wish i can try to implement it after some time.

remap unicode character missing in filesystem codepage into something like '?'.
I believe this is what nls routines do if converter returns -1 (error).
You'd loose the new characters, right. But you'd loose them anyway, as they
have no place in mac software.

> > Unicode, and its encoding UTF8 IS commonly used everywhere.
> > And Russia can (and often does) use it just as well.
>
> Many people say many software is not UTF8-ready yet. Anyway i had problems when tried to use it. Many russian ASCII documents use 8-bit encoding so i need to be able to deal with them. Many software assumes that 1 byte is 1 character.

just fix that software instead of polluting the kernel.

And besides: software which _does_ work with unicode,
can make a good use of an nls module for HFS.