LinuxLists.cc - Re: [Patch] Support UTF-8 scripts

2005-09-16 18:02:49

Subject: Re: [Patch] Support UTF-8 scripts

Bernd Petrovitsch <[email protected]> wrote:
> On Thu, 2005-09-15 at 20:39 +0200, "Martin v. L?wis" wrote:
>> H. Peter Anvin wrote:

>> > In Unix, it's a hideously bad idea.??The?reason?is?that?Unix?inherently
>> > assumes that text streams can be merged, split, and modified.??In?other
>> > words, unless you can guarantee that EVERY program can handle BOM
>> > EVERYWHERE, it's broken.

You can't sort /bin/ls into /tmp/ls and expect /tmp/ls to be meaningfull,
but /bin/ls works as expected. You can't usurally concat perl scripts and
shell scripts either, but both kinds of script run quite well.

And if you do "cat /bin/cat /bin/cp > /bin/catcp", what's "catcp foo bar"
supposed to do? First output foo and bar to stdout, then copy foo to bar?
Is execve() broken if it doesn't do what I described? Is the ELF header
broken because it's not recogmized EVERYWHERE? I don't think so.

>> This argument is bogus. We are talking about scripts here, which cannot
>> be merged, split, and modified. You don't cat(1) or sort(1) them - it's
>
> Sure they can since they are plain text files.
> How do you think one merges scripts?
> Just `cat`ing them all into one new file and edit that new file is much
> faster and simpler than to open an empty new file with your editor, then
> you open all the other scripts in your editor and copy them by hand.

What's supposed to happen if you concatenate a script from your french
user and from your russian user, both using localized text, into one file?
Unless you can guarantee every editor to correctly handle this case, all
usage of 8-bit-characters should be disabled - NOT!

If you concatenate two plain text files, you will use cat.
If you concatenate two pnm image files, you will use pnmcat.
If you concatenate two utf-8 files, you will use utf8cat.
If you concatenate two binaries, you will shoot your feet.
That's easy, isn't it?

BTW: I think decent utf-8 capable programs SHOULD ignore extra BOM markers.

> And you (or at least I) do `grep`/`egrep`/`fgrep`, `wc` them.

You can *grep utf-8 scripts, but you can't *grep binaries. Shouldn't
this be fixed by implementing an in-kernel ASCII assembler and convert
all binaries to assembler text?

> And
> probably with several other tools too - think of `find <dir> -type f
> -print0 | xargs -0r <cmd>`.

utf-8 filenames will work correctly (unless used as an extended BASIC
script with non-ASCII variable names, but that would be insane).

>> just pointless to do that. You create them with text editors, and those
>> can handle the UTF-8 signature.
>
> It is not uncommon to create scripts and the like with other programs,
> other scripts, what-else.

It's not uncommon to create binaries using other programs. So what?

> Apart from the fact the a "script" is merely a plain text file with the
> eXecutable bit set.

And an utf-8 script is a utf-8 encoded text file with it's executable bit
set.

> And that is the only difference, so you have to at
> least (all instances of) `chmod` to insert and remove the BOM.
[...]

In order to make it harder for the interpreter to correctly detect utf-8?
You can have DOS executables run in dosboxes, windows applications run
in windows, java archives run in java, but utf-8 scripts should be
mangled in order to work "correctly", and mangled back in order to be
editable? *That*'s insane!

Just make execve ignore the BOM marker before "#!" as the patch does, and
you're done. The rest is somebody else's not-a-problem.

BTW2: However, I don't like the patch.

I'd first check for a utf-8 signature, and if it's found, adjust the
buffer offset by 3. Then I'd run the old code checking for the sh_bang.
OTOH, I just read the patch and not the .c file, maybe (unlikely) my idea
wouldn't work correctly.

--
Ich danke GMX daf?r, die Verwendung meiner Adressen mittels per SPF
verbreiteten L?gen zu sabotieren.

2005-09-16 18:09:33

by H. Peter Anvin

[permalink] [raw]

Subject: Re: [Patch] Support UTF-8 scripts

Bodo Eggert wrote:
>
> What's supposed to happen if you concatenate a script from your french
> user and from your russian user, both using localized text, into one file?
> Unless you can guarantee every editor to correctly handle this case, all
> usage of 8-bit-characters should be disabled - NOT!
>

Actually, it's quite easy to avoid problems by using UTF-8 consistently.
The 8-bit characters are oddballs and need to be treated specially,
but look, guys, it's 2005 - UTF-8 should be the norm, not the exception.

-hpa

2005-09-16 18:57:38

by Bodo Eggert

[permalink] [raw]

Subject: Re: [Patch] Support UTF-8 scripts

On Fri, 16 Sep 2005, H. Peter Anvin wrote:
> Bodo Eggert wrote:

> > What's supposed to happen if you concatenate a script from your french
> > user and from your russian user, both using localized text, into one file?
> > Unless you can guarantee every editor to correctly handle this case, all
> > usage of 8-bit-characters should be disabled - NOT!
>
> Actually, it's quite easy to avoid problems by using UTF-8 consistently.
> The 8-bit characters are oddballs and need to be treated specially,
> but look, guys, it's 2005 - UTF-8 should be the norm, not the exception.

It should, but as long as old programs are still around, we'll have both
and need a marker to distinguish them. Otherwise we'll be stuck with
legacy scripts for a long time.

--
I'm a member of DNA (National Assocciation of Dyslexics).
-- Storm in <[email protected]>

2005-09-16 19:08:34

by Martin Mares

[permalink] [raw]

Subject: Re: [Patch] Support UTF-8 scripts

Hello!

> It should, but as long as old programs are still around, we'll have both
> and need a marker to distinguish them.

I doubt that. For ages people were using several different encodings on
a single system (at least here in .cz) without any markers and although
there were some rough edges, almost everything worked. Now we do the same
with ISO-8859-2 and UTF-8, again with no need for a marker.

Have a nice fortnight
--
Martin `MJ' Mares <[email protected]> http://atrey.karlin.mff.cuni.cz/~mj/
Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth
Linux vs. Windows is a no-WIN situation.

2005-09-16 19:25:28

by H. Peter Anvin

[permalink] [raw]

Subject: Re: [Patch] Support UTF-8 scripts

Bodo Eggert wrote:
>
> It should, but as long as old programs are still around, we'll have both
> and need a marker to distinguish them. Otherwise we'll be stuck with
> legacy scripts for a long time.
>

You don't have markers (although they're defined, see ISO 2022) for your
8-bit encodings, and *THEY'RE THE ONES THAT NEED TO BE DISTINGUISHED.*
Flagging UTF-8, especially with the BOM (as opposed to the ISO 2022
signature, <ESC>%G) is pointless in the context, since you still can't
distinguish your arbitrary number of legacy encodings.

Oh, yes, and try to stick ISO 2022 signatures in scripts or whatnot, and
you can see what current software does with a signature standard that
dates back to the 1970's.

-hpa