LinuxLists.cc - Re: [Patch] Support UTF-8 scripts

2005-09-18 00:53:35

Subject: Re: [Patch] Support UTF-8 scripts

Bernd Petrovitsch <[email protected]> wrote:
> On Sat, 2005-09-17 at 08:20 +0200, "Martin v. L?wis" wrote:
>> Bernd Petrovitsch wrote:
>> > On Fri, 2005-09-16 at 22:41 +0200, "Martin v. L?wis" wrote:

>> > [ Language-specific examples ]
>> >
>> > And that's the only working way - the programming languages can actually
>> > do it because it defines the syntax and semantics of the contents
>> > anyways.
>>
>> It works from the programming language point of view, but it is a mess
>> from the text editor point of view.
>
> Most of the text editors have ways to markup the source files. Not even
> the various editors are able to agreen on one method for all, so why
> could the (Linux) world agree on one for all text files?

You don't need a marker for all text files, but it's legal to have a marker
for utf-8 text files (see the uniocode standard 4.0.0 section 15.9), and
it's handy to use it until you made everybody in the world convert
everything to utf-8 (but not utf-{16,32}{le,be}).

>> > With this marker you are interferign with (at least) *all* text files.
>>
>> Hmm. What does that have to do with the patch I'm proposing? This
>> patch does *not* interfere with all text files. It is only relevant
>> for executable files starting with the #! magic.
>
> It *does* interfere since scripts are also text files in every aspect.
> So every feature you want for "scripts" you also get for text files (and
> vice versa BTW).

If utf-8 encoded text files are text files, and text files are scripts,
and all of them shall have the same features, utf-8 encoded text files
with BOM MUST be recognized as legal scripts, too. Therefore this patch
fixes a kernel bug.

BTW: Implementing the other utf signatures from Table 15.3 is left to the
reader as an exercise.-)

> If you think "script" and "text file" are different, define both of
> them, please, otherwise a discussion is pointless.

If all text files are script files, execute this mail.

>> > And there are always tools out there which simply do not understand the
>> > generic marker and can not ignore it since these bytes are part of the
>> > file.
>>
>> This conclusion is false. Many tools that don't understand the file
>> structure still can do their job on the files. So the fact that a tool
>> does not understand the structure does not necessarily imply that
>> the tool breaks when the structure changes.
>
> It *may* break just because of some to-be-ignored inline marking due to
> some questionable feature.

How exactly does it break, and what is it? And why must *it* be prevented
from breaking by ignoring script signatures in valid text files?

> And *when* (not if) it breaks, it is probably cumbersome to find since
> you have pretty unprintable characters.

If your tools can't print utf-8 encoded characters, they are broken for
ISO-8859-*, too. Besides that, it's not a kernel problem.

> Let alone the confusion why the size of a file with `ls -l` is different
> from the size in the editor or a marker-aware `wc -c`.
> So IMHO either you have a clear and visible marker or you none at all.

Like e.g. the "From "-line starting each message in a mbox file? Virtually
no email client will display it. The size of email messages does differ
from it's unencoded content size, too! Off cause nobody can handle this,
and all users contantly try to kill themselfes because of that - NOT.

--
Ich danke GMX daf?r, die Verwendung meiner Adressen mittels per SPF
verbreiteten L?gen zu sabotieren.

2005-09-18 16:56:15

by Bernd Petrovitsch

[permalink] [raw]

Subject: Re: [Patch] Support UTF-8 scripts

On Sun, 2005-09-18 at 02:53 +0200, Bodo Eggert wrote:
> Bernd Petrovitsch <[email protected]> wrote:
[...]
> > Most of the text editors have ways to markup the source files. Not even
> > the various editors are able to agreen on one method for all, so why
> > could the (Linux) world agree on one for all text files?
>
> You don't need a marker for all text files, but it's legal to have a marker
> for utf-8 text files (see the uniocode standard 4.0.0 section 15.9), and
> it's handy to use it until you made everybody in the world convert
> everything to utf-8 (but not utf-{16,32}{le,be}).

Have fun patching almost every text processing tool and concept out
there.
Apart from that the way of that marker is wrong it seems to me that the
UTF-8 body has no other choice than such a insane "rule" or
recommendation).

> >> > With this marker you are interferign with (at least) *all* text files.
> >>
> >> Hmm. What does that have to do with the patch I'm proposing? This
> >> patch does *not* interfere with all text files. It is only relevant
> >> for executable files starting with the #! magic.
> >
> > It *does* interfere since scripts are also text files in every aspect.
> > So every feature you want for "scripts" you also get for text files (and
> > vice versa BTW).
>
> If utf-8 encoded text files are text files, and text files are scripts,

No one said all text files are scripts, instead it is the other way
'round.

[ snipped because of ex falso quod libet ]

> > If you think "script" and "text file" are different, define both of
> > them, please, otherwise a discussion is pointless.
>
> If all text files are script files, execute this mail.

See above. Obviously you misunderstand some thing.

> >> > And there are always tools out there which simply do not understand the
> >> > generic marker and can not ignore it since these bytes are part of the
> >> > file.
> >>
> >> This conclusion is false. Many tools that don't understand the file
> >> structure still can do their job on the files. So the fact that a tool
> >> does not understand the structure does not necessarily imply that
> >> the tool breaks when the structure changes.
> >
> > It *may* break just because of some to-be-ignored inline marking due to
> > some questionable feature.
>
> How exactly does it break, and what is it? And why must *it* be prevented
> from breaking by ignoring script signatures in valid text files?

The question was: What is if this marker in encountered within a file?
To be ignored (by UTF-8 aware tools)? Some other interpretation?
Illegal/Forbidden?

> > And *when* (not if) it breaks, it is probably cumbersome to find since
> > you have pretty unprintable characters.
>
> If your tools can't print utf-8 encoded characters, they are broken for
> ISO-8859-*, too. Besides that, it's not a kernel problem.

Which is again not true since lots of tools out there printed ISO-8859-*
correctly before UTF-8 was deployed.

[...]

Bernd
--
Firmix Software GmbH http://www.firmix.at/
mobil: +43 664 4416156 fax: +43 1 7890849-55
Embedded Linux Development and Services