2005-09-19 04:54:44

by Martin v. Löwis

[permalink] [raw]
Subject: Re: [Patch] Support UTF-8 scripts

Bernd Petrovitsch wrote:
>>The specific feature I get is that when I pass a file starting
>>with <utf8sig>#! to execve, Linux will execute the file following
>>the #!. In what way do I get this feature for text in general?
>>And if I do, why is that a problem?
>
>
> After applying this patch it seems that "Linux" is supporting this
> marker officially in general - especially if the kernel supports it.

What makes it seem so? That binfmt_script supports a certain convention
doesn't mean that all other programs also somehow need to support that
convention - and certainly not in the same way.

> I suppose the next kernel patch is to support Win-like CR-LF sequences
> (which is not the case AFAIK).

What makes you suppose that? I have no plans to submit such a patch.

> And though scripts are usually edited/changed/"parsed"/... with an text
> editor, it is not always the case. Therefore the automatic extension to
> *all text files* (especially as the marker basically applies to all text
> files, not only scripts).
> You want to focus just on your patch and ignore the directly implied
> potential problems arising ...

Because there are no problems arising. The next time somebody submits
a patch to cat(1) to strip off UTF-8 signatures, you *then* complain
that this is the wrong thing to do, because it violates the
specification of cat.

This reasoning is just flawed: it is like saying to a web browser
developer: "don't _support_ XHTML, because there are so many tools
which use HTML 4".

> Apparently I have to repeat: If you do `cat a.txt b.txt >c.txt` where
> a.txt and b.txt have this marker, then c.txt have the marker of b.txt
> somewhere in the middle. Does this make sense in anyway?

Indeed, it does. There is nothing inherently wrong with having
the marker in the middle.

> How do I get rid of the marker in the middle transparently?

http://www.unicode.org/faq/utf_bom.html#38

>>What the editor displays as the number of "things" is up to its own.
>>The output of wc -c will always be the same as the one of ls -l,
>>as wc -c does *not* give you characters:
>>
>> -c, --bytes
>> print the byte counts
>>
>>You might have been thinking of 'wc -m'.
>
>
> It depends on the definition of "character". There are other standards
> which define "character" as "byte".

Certainly. However, you specifically talked about 'wc -c', and, in
wc(1), atleast in the implementation commonly used on Linux, characters
and bytes are not the same.

>>It depends on the editor I use, of course
>
>
> No, more on the OS the editor runs on.

You talked about Windows specifically. On Windows, most editors give you
the choice of chosing the line ending, and will preserve whatever line
ending they find when adding new lines to a file.

Regards,
Martin


2005-09-19 08:26:28

by Bernd Petrovitsch

[permalink] [raw]
Subject: Re: [Patch] Support UTF-8 scripts

On Mon, 2005-09-19 at 06:54 +0200, "Martin v. Löwis" wrote:
> Bernd Petrovitsch wrote:
> >>The specific feature I get is that when I pass a file starting
> >>with <utf8sig>#! to execve, Linux will execute the file following
> >>the #!. In what way do I get this feature for text in general?
> >>And if I do, why is that a problem?
> >
> > After applying this patch it seems that "Linux" is supporting this
> > marker officially in general - especially if the kernel supports it.
>
> What makes it seem so? That binfmt_script supports a certain convention
> doesn't mean that all other programs also somehow need to support that
> convention - and certainly not in the same way.

We will see how it develops. Actually the marker could be used to detect
endianness of the file if I read below URL correctly ....

> > I suppose the next kernel patch is to support Win-like CR-LF sequences
> > (which is not the case AFAIK).
>
> What makes you suppose that? I have no plans to submit such a patch.

No need to. Other people tried already.

> This reasoning is just flawed: it is like saying to a web browser
> developer: "don't _support_ XHTML, because there are so many tools
> which use HTML 4".

No, the saying was more: "don't support XHTML since it may break HTML
compliant browsers."
For XHTML/HTML we all know that this is not the case, so the comparison
is flawed.

> > Apparently I have to repeat: If you do `cat a.txt b.txt >c.txt` where
> > a.txt and b.txt have this marker, then c.txt have the marker of b.txt
> > somewhere in the middle. Does this make sense in anyway?
>
> Indeed, it does. There is nothing inherently wrong with having
> the marker in the middle.
>
> > How do I get rid of the marker in the middle transparently?
>
> http://www.unicode.org/faq/utf_bom.html#38

Thanks.
---- snip ----
In that case, any U+FEFF occurring in the middle of the file can be
ignored, or treated as an error.
---- snip ----
Well, this doesn't sound like an clear rule stating that it *must* be
ignored.
BTW:
---- snip ----
Q: How I should deal with BOMs?
[...]
3. Some byte oriented protocols expect ASCII characters at the beginning
of a file. If UTF-8 is used with these protocols, use of the BOM as
encoding form signature should be avoided.
---- snip ----
Voila. Avoid the BOM in your scripts and be done.

> > It depends on the definition of "character". There are other standards
> > which define "character" as "byte".
>
> Certainly. However, you specifically talked about 'wc -c', and, in
> wc(1), atleast in the implementation commonly used on Linux, characters
> and bytes are not the same.

Yes, now since multi-byte character sets gets more commonly used.
However, I don't think you get this into the C standard. But we are now
far off the discussion ....

> >>It depends on the editor I use, of course
> >
> > No, more on the OS the editor runs on.
>
> You talked about Windows specifically. On Windows, most editors give you
> the choice of chosing the line ending, and will preserve whatever line
> ending they find when adding new lines to a file.

I belive this vor vim, emacs, etc. but I don't believe ir for the native
ones ...

Bernd
--
Firmix Software GmbH http://www.firmix.at/
mobil: +43 664 4416156 fax: +43 1 7890849-55
Embedded Linux Development and Services

2005-09-19 09:01:09

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: [Patch] Support UTF-8 scripts

On Mon, 19 Sep 2005 10:26:22 +0200, Bernd Petrovitsch said:

> We will see how it develops. Actually the marker could be used to detect
> endianness of the file if I read below URL correctly ....

Text files have endianness????

> ---- snip ----
> Q: How I should deal with BOMs?
> [...]
> 3. Some byte oriented protocols expect ASCII characters at the beginning
> of a file. If UTF-8 is used with these protocols, use of the BOM as
> encoding form signature should be avoided.
> ---- snip ----
> Voila. Avoid the BOM in your scripts and be done.

At which point the proposed kernel patch becomes pointless.. ;)


Attachments:
(No filename) (226.00 B)

2005-09-19 09:41:42

by Bernd Petrovitsch

[permalink] [raw]
Subject: Re: [Patch] Support UTF-8 scripts

On Mon, 2005-09-19 at 05:00 -0400, [email protected] wrote:
> On Mon, 19 Sep 2005 10:26:22 +0200, Bernd Petrovitsch said:
>
> > We will see how it develops. Actually the marker could be used to detect
> > endianness of the file if I read below URL correctly ....
>
> Text files have endianness????

Unicode-16 ones with 16 bit per character (as in Win NT), yes.
UTF-8 ones not AFAIK.

Bernd
--
Firmix Software GmbH http://www.firmix.at/
mobil: +43 664 4416156 fax: +43 1 7890849-55
Embedded Linux Development and Services

2005-09-19 21:40:45

by Martin v. Löwis

[permalink] [raw]
Subject: Re: [Patch] Support UTF-8 scripts

Bernd Petrovitsch wrote:
>>>It depends on the definition of "character". There are other standards
>>>which define "character" as "byte".
>>
>>Certainly. However, you specifically talked about 'wc -c', and, in
>>wc(1), atleast in the implementation commonly used on Linux, characters
>>and bytes are not the same.
>
>
> Yes, now since multi-byte character sets gets more commonly used.
> However, I don't think you get this into the C standard. But we are now
> far off the discussion ....

It does indeed, so just one final clarification. wc(1) is not part
of the C standard - ISO 9899 does not talk about command line utilities
at all. The relevant standard is POSIX; IEEE Std 1003.1, 2004 Edition
says, in

http://www.opengroup.org/onlinepubs/009695399/utilities/wc.html

-c
Write to the standard output the number of bytes in each input file.
[...]
-m
Write to the standard output the number of characters in each input
file.

[...]
RATIONALE
[...]
The -c option stands for "character" count, even though it counts bytes.
This stems from the sometimes erroneous historical view that bytes and
characters are the same size. Due to international requirements, the -m
option (reminiscent of "multi-byte") was added to obtain actual
character counts.

Regards,
Martin