2005-09-15 18:24:16

by Martin v. Löwis

[permalink] [raw]
Subject: Re: [Patch] Support UTF-8 scripts

H. Peter Anvin wrote:
> BOM should not be used in UTF-8. In fact, it shouldn't be used at
> all.

Says who? In UTF-8, it is not used to indicate a byte order; instead,
it is used to indicate the fact that the file is UTF-8, like a magic.
That's why I prefer to call it "UTF-8 signature".

The Unicode consortium thinks that the BOM can be used in UTF-8:

http://www.unicode.org/faq/utf_bom.html#29

The UTF-8 signature is very useful, and I would prefer if it would
be used instead of format-specific encoding declarations.

Regards,
Martin


2005-09-15 18:26:13

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [Patch] Support UTF-8 scripts

Martin v. L?wis wrote:
>
> Says who? In UTF-8, it is not used to indicate a byte order; instead,
> it is used to indicate the fact that the file is UTF-8, like a magic.
> That's why I prefer to call it "UTF-8 signature".
>
> The Unicode consortium thinks that the BOM can be used in UTF-8:
>
> http://www.unicode.org/faq/utf_bom.html#29
>
> The UTF-8 signature is very useful, and I would prefer if it would
> be used instead of format-specific encoding declarations.
>

In Unix, it's a hideously bad idea. The reason is that Unix inherently
assumes that text streams can be merged, split, and modified. In other
words, unless you can guarantee that EVERY program can handle BOM
EVERYWHERE, it's broken.

In other words, it's broken.

-hpa

2005-09-15 18:39:28

by Martin v. Löwis

[permalink] [raw]
Subject: Re: [Patch] Support UTF-8 scripts

H. Peter Anvin wrote:

> In Unix, it's a hideously bad idea. The reason is that Unix inherently
> assumes that text streams can be merged, split, and modified. In other
> words, unless you can guarantee that EVERY program can handle BOM
> EVERYWHERE, it's broken.

This argument is bogus. We are talking about scripts here, which cannot
be merged, split, and modified. You don't cat(1) or sort(1) them - it's
just pointless to do that. You create them with text editors, and those
*can* handle the UTF-8 signature.

> In other words, it's broken.

We can do that now, or in five or ten years. I'm willing to wait that
long, but I'm certain that more people will find the UTF-8 signature
useful over time. It's the only sane way to get non-ASCII into script
source in a consistent way.

Regards,
Martin

2005-09-15 19:21:08

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [Patch] Support UTF-8 scripts

Martin v. L?wis wrote:
>
> We can do that now, or in five or ten years. I'm willing to wait that
> long, but I'm certain that more people will find the UTF-8 signature
> useful over time. It's the only sane way to get non-ASCII into script
> source in a consistent way.
>

No. The sane way is to just use UTF-8.

In five or ten years, by the time you've gotten your idiotic BOM mess to
sort-of work, it will be completely pointless to have anything *but*
UTF-8, and thus it's pointless.

Don't perpetuate the braindamage.

-hpa

2005-09-16 08:13:14

by Bernd Petrovitsch

[permalink] [raw]
Subject: Re: [Patch] Support UTF-8 scripts

On Thu, 2005-09-15 at 20:39 +0200, "Martin v. Löwis" wrote:
> H. Peter Anvin wrote:
>
> > In Unix, it's a hideously bad idea. The reason is that Unix inherently
> > assumes that text streams can be merged, split, and modified. In other
> > words, unless you can guarantee that EVERY program can handle BOM
> > EVERYWHERE, it's broken.
>
> This argument is bogus. We are talking about scripts here, which cannot
> be merged, split, and modified. You don't cat(1) or sort(1) them - it's

Sure they can since they are plain text files.
How do you think one merges scripts?
Just `cat`ing them all into one new file and edit that new file is much
faster and simpler than to open an empty new file with your editor, then
you open all the other scripts in your editor and copy them by hand.
And you (or at least I) do `grep`/`egrep`/`fgrep`, `wc` them. And
probably with several other tools too - think of `find <dir> -type f
-print0 | xargs -0r <cmd>`.

> just pointless to do that. You create them with text editors, and those
> *can* handle the UTF-8 signature.

It is not uncommon to create scripts and the like with other programs,
other scripts, what-else.
Apart from the fact the a "script" is merely a plain text file with the
eXecutable bit set. And *that* is the only difference, so you have to at
least (all instances of) `chmod` to insert and remove the BOM.
This gets funny if you think of file systems without a concept of
"executable bit" and copying files around. Another standard tool to
patch.
And how do you solve `cat`ing a script (with set X bit) like:
`cat <script >other-file` where other-file will not have the X bit set.
The `cat` program doesn't even know (or care about) the names of the two
files.

Bernd
--
Firmix Software GmbH http://www.firmix.at/
mobil: +43 664 4416156 fax: +43 1 7890849-55
Embedded Linux Development and Services