LinuxLists.cc - Re: [Patch] Support UTF-8 scripts

2005-09-16 20:34:26

Subject: Re: [Patch] Support UTF-8 scripts

Martin Mares wrote:
> I doubt that. For ages people were using several different encodings on
> a single system (at least here in .cz) without any markers and although
> there were some rough edges, almost everything worked. Now we do the same
> with ISO-8859-2 and UTF-8, again with no need for a marker.

This is true for text files, where a human reader can interpret the data
correctly even in absence of a declaration. For programming languages,
this is typically not the case. Instead, in order to correctly interpret
the source code, you need to declare the encoding. For a script, this
should be done inside the file itself, as there is no explicit
invocation of a compiler or some such where the script encoding could
be specified externally.

Regards,
Martin

2005-09-17 13:33:32

by Martin v. Löwis

[permalink] [raw]

Subject: Re: [Patch] Support UTF-8 scripts

Martin Mares wrote:
> I still think that this does solve only a completely insignificant part
> of the problem. Given the zillion existing encodings, you are able to identify
> UTF-8, leaving you with zillion-1 other encodings you are unable to deal with.

Correct. This is a special case only. The more general problem is
already solved: both Python and Perl support source encodings in
the entire zillion encodings. As I explained, this general solution,
while being general, is also not very user-friendly.

Now, why does UTF-8 deserve to be a special case? One reason is that it
has the potential to replace the entire zillion of encodings over time.
However, this can only happen if tool support for this encoding is
really good. The patch contributes a (minor) fragment to the support -
it is a small patch only.

The other reason is that UTF-8 defines its own encoding declaration,
unlike most of the other zillion-1 encodings. So naturally, an
implementation that supports UTF-8 in this way cannot extend to other
encodings. hpa suggested that ISO-2022 would be a more general
mechanism, but pointed out that it hasn't implemented widely in the
last 30 years, so it is unlikely that it will get much better support
in the next thirty years.

> I see a need for a feature which would help identify the charset of the script,
> but the patch in question obviously doesn't offer that -- it solves only a single
> special case of the problem in a completely non-systematic way. This does not
> sound right.

It's not a complete solution, but it *is* part of a general solution.
People have tried in the past to solve the general problem of "identify
the encoding of a text file", both in really general ways (iso-2022)
and in format-specific ways (perl, python). All these solutions are
tedious to use.

There is another general solution: gradually replace the zillion
encodings with a single one, namely Unicode (or, specifically, UTF-8).
This solution will only work when done gradually. Clearly, this
patch doesn't implement this solution entirely, but it contributes
to it, by making usage of UTF-8 in script files more simple. Many
more changes to other software (i.e. non-kernel changes) will be
necessary to implement this solution, as well as (obviously) changes
to existing files.

Regards,
Martin

2005-09-17 13:05:30

by Martin Mares

[permalink] [raw]

Subject: Re: [Patch] Support UTF-8 scripts

Hello!

> With the UTF-8 signature, things become much simpler: editors can
> automatically detect presence of the signature, and need no
> language-specific parsing.

I still think that this does solve only a completely insignificant part
of the problem. Given the zillion existing encodings, you are able to identify
UTF-8, leaving you with zillion-1 other encodings you are unable to deal with.

> Probably not literally, as we are not searching for an explanation of
> some phenomenon.

ACK, not literally.

> You are probably suggesting that people dislike the
> feature because they see no need for it (as one poster stated it:
> I don't use UTF-8, so I don't want that feature).

I see a need for a feature which would help identify the charset of the script,
but the patch in question obviously doesn't offer that -- it solves only a single
special case of the problem in a completely non-systematic way. This does not
sound right.

Have a nice fortnight
--
Martin `MJ' Mares <[email protected]> http://atrey.karlin.mff.cuni.cz/~mj/
Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth
"How I need a drink, alcoholic in nature, after the tough chapters involving quantum mechanics!" = \pi

2005-09-17 12:53:12

by Martin v. Löwis

[permalink] [raw]

Subject: Re: [Patch] Support UTF-8 scripts

Martin Mares wrote:
> Agreed. On the other hand, in all these languages you can pass the encoding
> as a parameter to the interpreter, cannot you?

Not in general, no. If you have a library of multiple modules, different
modules may have different encodings. In particular, if UTF-8 in source
code becomes more common (because it is better supported than now),
people will start using it for libraries. At the same time, a lot of
code is around that still uses other encodings (typically Latin-1).
So you may have two encodings in the same program (different modules);
that's why you need the encoding declared *in* the file.

Now, there are different ways to do that: you can find language-specific
ways (such as 'use utf8;'), and this is what most languages currently
do. However, this is a nightmare for editor developers, and a severe
inconvenience for script authors - which now have to put the encoding
declaration into the files.

With the UTF-8 signature, things become much simpler: editors can
automatically detect presence of the signature, and need no
language-specific parsing. The language interpreters have a guarantee
that the signature is at the beginning of the file, so they don't
need to switch encodings in the middle of parsing. Users can configure
their editors to always write the signature for certain types of
files, and don't need to worry about putting correct encoding
declarations into the files.

>>In the future, the signature *will* carry no information. But the future
>>is, well, in the future.
>>
>>I just can't understand why (some) people are so opposed to this patch.
>
>
> Occam's razor?

Probably not literally, as we are not searching for an explanation of
some phenomenon. You are probably suggesting that people dislike the
feature because they see no need for it (as one poster stated it:
I don't use UTF-8, so I don't want that feature).

However, I do believe there is a need for the feature, and that
the gains by far outweigh the costs.

Regards,
Martin

2005-09-17 12:01:23

by Martin Mares

[permalink] [raw]

Subject: Re: [Patch] Support UTF-8 scripts

Hello!

> This is true for text files, where a human reader can interpret the data
> correctly even in absence of a declaration. For programming languages,
> this is typically not the case. Instead, in order to correctly interpret
> the source code, you need to declare the encoding. For a script,
[...]

This makes no sense. For a script, the shell does not care about the encoding
at all.

Also, currently, people use zillions of encodings, most of which have no
signature, so introducing a signature for UTF-8 does not win anything.

In the future, most people will probably use only UTF-8, so the signature
carries no information.

Have a nice fortnight
--
Martin `MJ' Mares <[email protected]> http://atrey.karlin.mff.cuni.cz/~mj/
Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth
Q: Who invented the first airplane that did not fly? A: The Wrong Brothers.

2005-09-17 12:26:01

by Martin v. Löwis

[permalink] [raw]

Subject: Re: [Patch] Support UTF-8 scripts

Martin Mares wrote:
> This makes no sense. For a script, the shell does not care about the encoding
> at all.

I'm not (only) talking about /bin/sh. I'm primarily talking about
/usr/bin/python, /usr/bin/perl, and /usr/bin/wish. In all these
languages, the interpreter *does* care about the encoding.

1. In Python, the syntax

u"some data"

denotes a Unicode literal (stored internally either in UCS-2 or
UCS-4); the literals are converted from the source encoding to
the internal representation. This requires knowledge of the source
encoding.

2. In Tcl, all strings are internally represented in UTF-8, and
converted from the source encoding (which currently is inferred
from the locale of the process executing the script).

3. In Perl, 'use utf8' declares that the encoding of the script is
UTF-8, meaning that non-ASCII can be used in string literals,
identifiers, and regular expressions.

> Also, currently, people use zillions of encodings, most of which have no
> signature, so introducing a signature for UTF-8 does not win anything.

This specific patch does win something: it allows to executed scripts
which start with <utf8 signature>#!

This is useful e.g. for Python, which recognizes the UTF-8 signature
as declaring the source encoding of the Python module to be UTF-8.

> In the future, most people will probably use only UTF-8, so the signature
> carries no information.

In the future, the signature *will* carry no information. But the future
is, well, in the future.

I just can't understand why (some) people are so opposed to this patch.
It is a really trivial, straight-forward change. It introduces no
policy, just a feature: you can put the UTF-8 signature in your script
file, if you want to (and your scripting language supports it). By
no means it forces you to put the UTF-8 signature in your all script
files, let alone all your text files.

Regards,
Martin

2005-09-17 12:28:28

by Martin Mares

[permalink] [raw]

Subject: Re: [Patch] Support UTF-8 scripts

Hello!

> I'm not (only) talking about /bin/sh. I'm primarily talking about
> /usr/bin/python, /usr/bin/perl, and /usr/bin/wish. In all these
> languages, the interpreter *does* care about the encoding.

Agreed. On the other hand, in all these languages you can pass the encoding
as a parameter to the interpreter, cannot you?

> In the future, the signature *will* carry no information. But the future
> is, well, in the future.
>
> I just can't understand why (some) people are so opposed to this patch.

Occam's razor?

Have a nice fortnight
--
Martin `MJ' Mares <[email protected]> http://atrey.karlin.mff.cuni.cz/~mj/
Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth
"In accord to UNIX philosophy, PERL gives you enough rope to hang yourself." -- Larry Wall

2005-09-19 07:08:35

by Pavel Machek

[permalink] [raw]

Subject: Re: [Patch] Support UTF-8 scripts

Hi!

> I just can't understand why (some) people are so opposed to this patch.
> It is a really trivial, straight-forward change. It introduces no
> policy, just a feature: you can put the UTF-8 signature in your script
> file, if you want to (and your scripting language supports it). By
> no means it forces you to put the UTF-8 signature in your all script
> files, let alone all your text files.

Why is binfmt_misc not enough for you?
Pavel

--
if you have sharp zaurus hardware you don't need... you know my address

2005-09-19 07:18:41

by Martin v. Löwis

[permalink] [raw]

Subject: Re: [Patch] Support UTF-8 scripts

Pavel Machek wrote:
> Why is binfmt_misc not enough for you?

For two reasons: for one, it has the overhead of yet another
exec call. This is different from usages for, say, Java byte
code or Python byte code, where the registered interpreter already
is the eventual binary which has to be invoked anyway; for
a binfmt_misc application, you need an additional wrapper
which reinterprets the first line, and then invokes the eventual
interpreter.

The other reason is availability: as an author of an UTF-8
script, you would have to communicate to your users that they
need the right binfmt_misc wrapper installed (which they may
have to build first). While installing additional stuff to
run a single program is acceptable for large applications,
it is likely not for script files. To make the feature useful
in practice, it must be builtin.

Regards,
Martin

2005-09-19 07:24:59

by Pavel Machek

[permalink] [raw]

Subject: Re: [Patch] Support UTF-8 scripts

On Po 19-09-05 09:18:33, "Martin v. L?wis" wrote:
> Pavel Machek wrote:
> > Why is binfmt_misc not enough for you?
>
> For two reasons: for one, it has the overhead of yet another
> exec call. This is different from usages for, say, Java byte
> code or Python byte code, where the registered interpreter already
> is the eventual binary which has to be invoked anyway; for
> a binfmt_misc application, you need an additional wrapper
> which reinterprets the first line, and then invokes the eventual
> interpreter.

Who cares? exec is fast.

> The other reason is availability: as an author of an UTF-8
> script, you would have to communicate to your users that they
> need the right binfmt_misc wrapper installed (which they may
> have to build first). While installing additional stuff to
> run a single program is acceptable for large applications,
> it is likely not for script files. To make the feature useful
> in practice, it must be builtin.

This is distribution problem, not kernel problem. "/bin/ls should be
built into kernel, because otherwise you can't call /bin/ls from
script" is not an argument.

If UTF-8 compatibility is important, distros will get it right. If it
is not, you loose, but at least kernel is not messed up.

Pavel
--
if you have sharp zaurus hardware you don't need... you know my address

2005-09-19 07:46:15

by Martin v. Löwis

[permalink] [raw]

Subject: Re: [Patch] Support UTF-8 scripts

Pavel Machek wrote:
> If UTF-8 compatibility is important, distros will get it right. If it
> is not, you loose, but at least kernel is not messed up.

The patch doesn't mess up the kernel.

Regards,
Martin

2005-09-19 07:51:04

by Pavel Machek

[permalink] [raw]

Subject: Re: [Patch] Support UTF-8 scripts

On Po 19-09-05 09:46:11, "Martin v. L?wis" wrote:
> Pavel Machek wrote:
> > If UTF-8 compatibility is important, distros will get it right. If it
> > is not, you loose, but at least kernel is not messed up.
>
> The patch doesn't mess up the kernel.

Every patch does.

Except that yours one does not because it is not going in :-).
Pavel

--
if you have sharp zaurus hardware you don't need... you know my address

2005-09-19 10:23:00

by Alan

[permalink] [raw]

Subject: Re: [Patch] Support UTF-8 scripts

On Llu, 2005-09-19 at 09:24 +0200, Pavel Machek wrote:
> > which reinterprets the first line, and then invokes the eventual
> > interpreter.
>
> Who cares? exec is fast.

It would be nice if it was but exec + user space overhead of startup is
merely "faster than many equivalent systems". It's still slow