LinuxLists.cc - Re: [Patch] Support UTF-8 scripts

2005-09-18 19:24:05

Subject: Re: [Patch] Support UTF-8 scripts

Bernd Petrovitsch <[email protected]> wrote:
> On Sun, 2005-09-18 at 09:23 +0200, "Martin v. L?wis" wrote:
> [...]
>> >>Hmm. What does that have to do with the patch I'm proposing? This
>> >>patch does *not* interfere with all text files. It is only relevant
>> >>for executable files starting with the #! magic.
>> >
>> > It *does* interfere since scripts are also text files in every aspect.
>> > So every feature you want for "scripts" you also get for text files (and
>> > vice versa BTW).
>>
>> The specific feature I get is that when I pass a file starting
>> with <utf8sig>#! to execve, Linux will execute the file following
>> the #!. In what way do I get this feature for text in general?
>> And if I do, why is that a problem?
>
> After applying this patch it seems that "Linux" is supporting this
> marker officially in general - especially if the kernel supports it.

It will be the first POSIX kernel to correctly support utf-8 scripts.
It's 2005, and according to other(?) posters, this should be standard.

> I
> suppose the next kernel patch is to support Win-like CR-LF sequences
> (which is not the case AFAIK).

Maybe it should, maybe it shouldn't. If I used MAC or DOS, I'd be sure it
should.-)

> BTW even some standards body thinks that this is the way to go,

Not surprisingly the Unicode Consortium is one of them.

> it
> raises more problems and questions than resolves anything.

The problem of ow to handle BOM is solved by reading the standard.

> And though scripts are usually edited/changed/"parsed"/... with an text
> editor, it is not always the case. Therefore the automatic extension to
> *all text files* (especially as the marker basically applies to all text
> files, not only scripts).
> You want to focus just on your patch and ignore the directly implied
> potential problems arising ...

There is no problem arising from the patch, it solves one.
To solve the rest, use recode.

[...]
> Apparently I have to repeat: If you do `cat a.txt b.txt >c.txt` where
> a.txt and b.txt have this marker, then c.txt have the marker of b.txt
> somewhere in the middle. Does this make sense in anyway?
> How do I get rid of the marker in the middle transparently?

The unicode standard defines how to handle them.

>> > Let alone the confusion why the size of a file with `ls -l` is different
>> > from the size in the editor or a marker-aware `wc -c`.
>>
>> This is true for any UTF-8 file, or any multibyte encoding. For any
>> multibyte encoding, the number of bytes in the file is different from
>> the number of characters. That doesn't (and shouldn't) stop people from
>> using multi-byte encodings.
>
> It is different even if a pure ASCII file is marked as UTF-8.

No pure ASCII file will be marked, since a marked file will be no
ASCII file.

> And sure, the problem exists in general with multi-byte encodings.

ACK, but that's not a kernel problem nor a specific unicode problem.
Fix it by making China, Greece an Japan convert to ASCII and by making
all mathematicans stop using strange characters. All other users will
follow.

>> What the editor displays as the number of "things" is up to its own.
>> The output of wc -c will always be the same as the one of ls -l,
>> as wc -c does *not* give you characters:
>>
>> -c, --bytes
>> print the byte counts
>>
>> You might have been thinking of 'wc -m'.
>
> It depends on the definition of "character". There are other standards
> which define "character" as "byte".

There are architectures defining a byte to be 32 bit.
They are irrelevant, too.

[...]
>> Not sure what this has
>> to do with the specific patch, though.
>
> It is not supported by the kernel. So either you remove it or you make
> some compatibility hack (like an appropriate sym-link

-EDOESNOTWORK

#!/usr/bin/perl -T -s -w

>, etc.). Since the
> kernel can start java classes directly, you can probably make a similar
> thing for the UTF-8 stuff.

If MSDOS text files are text files are legal scripts, the kernel
should recognize [\x0D\x0A] as valid line breaks.

(The real reason would be unicode allowing NEL to be encoded as 0x0D
or 0x0A.)

This compile-tested patch adds 32 bytes to binfmt_script:

--- ./fs/binfmt_script.c.old 2005-09-18 20:28:32.000000000 +0200
+++ ./fs/binfmt_script.c 2005-09-18 20:29:44.000000000 +0200
@@ -18,7 +18,7 @@

static int load_script(struct linux_binprm *bprm,struct pt_regs *regs)
{
- char *cp, *i_name, *i_arg;
+ char *cp, *cp2, *i_name, *i_arg;
struct file *file;
char interp[BINPRM_BUF_SIZE];
int retval;
@@ -47,6 +47,9 @@ static int load_script(struct linux_binp
bprm->buf[BINPRM_BUF_SIZE - 1] = '\0';
if ((cp = strchr(bprm->buf, '\n')) == NULL)
cp = bprm->buf+BINPRM_BUF_SIZE-1;
+ if ((cp2 = strchr(bprm->buf, '\x0D')) != NULL
+ && cp2 < cp)
+ cp = cp2;
*cp = '\0';
while (cp > bprm->buf) {
cp--;
--
Ich danke GMX daf?r, die Verwendung meiner Adressen mittels per SPF
verbreiteten L?gen zu sabotieren.

2005-09-18 21:06:24

by Bernd Petrovitsch

[permalink] [raw]

Subject: Re: [Patch] Support UTF-8 scripts

On Sun, 2005-09-18 at 21:23 +0200, Bodo Eggert wrote:
[...]
> >> Not sure what this has
> >> to do with the specific patch, though.
> >
> > It is not supported by the kernel. So either you remove it or you make
> > some compatibility hack (like an appropriate sym-link
>
> -EDOESNOTWORK
>
> #!/usr/bin/perl -T -s -w

It depends on /usr/bin/perl how it handles a white-space character
directly after "-w".

> >, etc.). Since the
> > kernel can start java classes directly, you can probably make a similar
> > thing for the UTF-8 stuff.
>
> If MSDOS text files are text files are legal scripts, the kernel
> should recognize [\x0D\x0A] as valid line breaks.

The Unix worls does recognize the line breaks. It's up to the tool how
to handle the white-space character before it. Especially for C and
similar languages with continuation lines this leads to interesting (or
now more boring) problems.

Bernd
--
Firmix Software GmbH http://www.firmix.at/
mobil: +43 664 4416156 fax: +43 1 7890849-55
Embedded Linux Development and Services

2005-09-18 22:29:55

by Valdis Klētnieks

[permalink] [raw]

Subject: Re: [Patch] Support UTF-8 scripts

On Sun, 18 Sep 2005 21:23:42 +0200, Bodo Eggert said:
> Bernd Petrovitsch <[email protected]> wrote:
> > Apparently I have to repeat: If you do `cat a.txt b.txt >c.txt` where
> > a.txt and b.txt have this marker, then c.txt have the marker of b.txt
> > somewhere in the middle. Does this make sense in anyway?
> > How do I get rid of the marker in the middle transparently?
>
> The unicode standard defines how to handle them.

For the benefit of those of us who are interested in the problem, but aren't
in the mood to wade through a long standard looking for the answer to a
specific question, can you elaborate?

It isn't as obvious as all that, because of all the nasty corner cases...

> > It is different even if a pure ASCII file is marked as UTF-8.
>
> No pure ASCII file will be marked, since a marked file will be no
> ASCII file.

Given a file "a.txt" that's pure ASCII, and a file "b.txt" that has the BOM
marker on it, what happens when you do "cat a.txt b.txt > c.txt"?

'cat' doesn't know, and has no way of knowing, that c.txt needs a BOM at the
*front* of the file until it's already written past the point in c.txt where
the BOM has to go.

What does the Unicode standard say to do in this case?

Attachments:

(No filename) (226.00 B)

2005-09-19 06:03:34

by H. Peter Anvin

[permalink] [raw]

Subject: Re: [Patch] Support UTF-8 scripts

Bodo Eggert wrote:
>
> It will be the first POSIX kernel to correctly support utf-8 scripts.
> It's 2005, and according to other(?) posters, this should be standard.
>

UTF-8, yes. BOM bullshit, no.

-hpa

2005-09-19 19:37:35

by Bodo Eggert

[permalink] [raw]

Subject: Re: [Patch] Support UTF-8 scripts

On Sun, 18 Sep 2005, Bernd Petrovitsch wrote:
> On Sun, 2005-09-18 at 21:23 +0200, Bodo Eggert wrote:

> > >, etc.). Since the
> > > kernel can start java classes directly, you can probably make a similar
> > > thing for the UTF-8 stuff.
> >
> > If MSDOS text files are text files are legal scripts, the kernel
> > should recognize [\x0D\x0A] as valid line breaks.
>
> The Unix worls does recognize the line breaks.

Create a valid text file with macintosh line breaks (as allowed in unicode
files) and try it.
--
If enough data is collected, a board of inquiry can prove ANYTHING.