LinuxLists.cc - Re: [Patch] Support UTF-8 scripts

2005-09-16 20:41:49

Subject: Re: [Patch] Support UTF-8 scripts

H. Peter Anvin wrote:
> You don't have markers (although they're defined, see ISO 2022) for your
> 8-bit encodings, and *THEY'RE THE ONES THAT NEED TO BE DISTINGUISHED.*
> Flagging UTF-8, especially with the BOM (as opposed to the ISO 2022
> signature, <ESC>%G) is pointless in the context, since you still can't
> distinguish your arbitrary number of legacy encodings.

In programming languages that support the notion of source encodings,
you do have markers for 8-bit encodings. For example, in Python, you
can specify

# -*- coding: iso-8859-1 -*-

to denote the source encoding. In Perl, you write

use encoding "latin-1";

(with 'use utf8;' being a special-case shortcut).

In Java, you can specify the encoding through the -encoding argument
to javac. In gcc, you use -finput-charset (with the special case of
-fexec-charset and -fwide-exec-charset potentially being different).

So you *must* use encoding declarations in some languages; the UTF-8
signature is a particularly convenient way of doing so, since it allows
for uniformity across languages, with no need for the text editors to
parse all the different programming languages.

Regards,
Martin

2005-09-16 22:08:29

by H. Peter Anvin

[permalink] [raw]

Subject: Re: [Patch] Support UTF-8 scripts

Martin v. L?wis wrote:
> In programming languages that support the notion of source encodings,
> you do have markers for 8-bit encodings. For example, in Python, you
> can specify
>
> # -*- coding: iso-8859-1 -*-
>
> to denote the source encoding. In Perl, you write
>
> use encoding "latin-1";
>
> (with 'use utf8;' being a special-case shortcut).
>
> In Java, you can specify the encoding through the -encoding argument
> to javac. In gcc, you use -finput-charset (with the special case of
> -fexec-charset and -fwide-exec-charset potentially being different).
>
> So you *must* use encoding declarations in some languages; the UTF-8
> signature is a particularly convenient way of doing so, since it allows
> for uniformity across languages, with no need for the text editors to
> parse all the different programming languages.

Did you miss the point? There has been a standard for marking for *30
years*, and virtually NOONE (outside Japan) uses it.

-hpa

2005-09-16 22:48:05

by Bernd Petrovitsch

[permalink] [raw]

Subject: Re: [Patch] Support UTF-8 scripts

On Fri, 2005-09-16 at 22:41 +0200, "Martin v. Löwis" wrote:
[ Language-specific examples ]

And that's the only working way - the programming languages can actually
do it because it defines the syntax and semantics of the contents
anyways.
With this marker you are interferign with (at least) *all* text files.
And thus with *all* tools which "handle" those text files.

> So you *must* use encoding declarations in some languages; the UTF-8

... if you absolutely want to use Non-ASCII characters in the source
code. In most (if not all) of them exist a native gettext()
interface ...

> signature is a particularly convenient way of doing so, since it allows
> for uniformity across languages, with no need for the text editors to
> parse all the different programming languages.

And there are always tools out there which simply do not understand the
generic marker and can not ignore it since these bytes are part of the
file. And thus tools (and people) will kill those markers (for whatever
reason and if it's simple ignorance) anyway.

Or another example: (Try to) start a perl/shell/... script (without
paranmeter on the first line) which was edited on Win* and binary copied
to a Unix system. Or at least guess what will happen ....

Bernd
--
Firmix Software GmbH http://www.firmix.at/
mobil: +43 664 4416156 fax: +43 1 7890849-55
Embedded Linux Development and Services

2005-09-17 06:05:33

by Martin v. Löwis

[permalink] [raw]

Subject: Re: [Patch] Support UTF-8 scripts

H. Peter Anvin wrote:
> Did you miss the point? There has been a standard for marking for *30
> years*, and virtually NOONE (outside Japan) uses it.

I understood that fact - but I fail to see the point. If you mean to
imply "people did not use ISO-2022, therefore, they will never use
encoding declarations", I think this implication is false. People
do use encoding declarations.

If you mean to imply "people did not use ISO-2022, therefore, they
will never use the UTF-8 signature", I think this implications is
also false. People do use the UTF-8 signature, even outside Japan.
The primary reason is that the UTF-8 signature is much easier to
implement than ISO-2022: if you support UTF-8 in your tool (say,
a text editor), anyway, adding support for the UTF-8 signature
is almost trivial. Therefore, many more editors support the UTF-8
signature today than ever supported ISO-2022.

Regards,
Martin

2005-09-17 06:20:16

by Martin v. Löwis

[permalink] [raw]

Subject: Re: [Patch] Support UTF-8 scripts

Bernd Petrovitsch wrote:
> On Fri, 2005-09-16 at 22:41 +0200, "Martin v. Löwis" wrote:
> [ Language-specific examples ]
>
> And that's the only working way - the programming languages can actually
> do it because it defines the syntax and semantics of the contents
> anyways.

It works from the programming language point of view, but it is a mess
from the text editor point of view.

Even for the programming language, it is a pain to implement: what
if you have non-ASCII characters before the pragma that declares the
encoding? and so on.

> With this marker you are interferign with (at least) *all* text files.

Hmm. What does that have to do with the patch I'm proposing? This
patch does *not* interfere with all text files. It is only relevant
for executable files starting with the #! magic.

> And thus with *all* tools which "handle" those text files.

This is simply not true. My patch does not interfere with any such
tools. They continue to work just fine.

>>So you *must* use encoding declarations in some languages; the UTF-8
>
>
> ... if you absolutely want to use Non-ASCII characters in the source
> code. In most (if not all) of them exist a native gettext()
> interface ...

True. However, this is more tedious to use. Also, it doesn't apply to
all cases: e.g. if you have comments, documentation etc. in the source
code, gettext is no option.

Likewise, people often want to use non-ASCII in identifiers (e.g. class
Lösung); this can also only work if you know what the source encoding
is. You may argue that people just shouldn't do that, because it does
not work well, but this is not convincing: it doesn't work well because
language developers are to lazy to implement it. In fact, some languages
(C, C++, Java, C#) do support non-ASCII identifiers (atleast in their
specifications); there really isn't a good reason not to support it
in scripting languages as well.

> And there are always tools out there which simply do not understand the
> generic marker and can not ignore it since these bytes are part of the
> file.

This conclusion is false. Many tools that don't understand the file
structure still can do their job on the files. So the fact that a tool
does not understand the structure does not necessarily imply that
the tool breaks when the structure changes.

> Or another example: (Try to) start a perl/shell/... script (without
> paranmeter on the first line) which was edited on Win* and binary copied
> to a Unix system. Or at least guess what will happen ....

For a Python script, I don't need to guess: It will just work.

Regards,
Martin

2005-09-17 22:30:47

by Bernd Petrovitsch

[permalink] [raw]

Subject: Re: [Patch] Support UTF-8 scripts

On Sat, 2005-09-17 at 08:20 +0200, "Martin v. Löwis" wrote:
> Bernd Petrovitsch wrote:
> > On Fri, 2005-09-16 at 22:41 +0200, "Martin v. Löwis" wrote:
> > [ Language-specific examples ]
> >
> > And that's the only working way - the programming languages can actually
> > do it because it defines the syntax and semantics of the contents
> > anyways.
>
> It works from the programming language point of view, but it is a mess
> from the text editor point of view.

Most of the text editors have ways to markup the source files. Not even
the various editors are able to agreen on one method for all, so why
could the (Linux) world agree on one for all text files?

> Even for the programming language, it is a pain to implement: what
> if you have non-ASCII characters before the pragma that declares the
> encoding? and so on.

That's the problem of the language definers who absolutely want such
(IMHO absolutely superflous) features.

> > With this marker you are interferign with (at least) *all* text files.
>
> Hmm. What does that have to do with the patch I'm proposing? This
> patch does *not* interfere with all text files. It is only relevant
> for executable files starting with the #! magic.

It *does* interfere since scripts are also text files in every aspect.
So every feature you want for "scripts" you also get for text files (and
vice versa BTW).
If you think "script" and "text file" are different, define both of
them, please, otherwise a discussion is pointless.

> > And there are always tools out there which simply do not understand the
> > generic marker and can not ignore it since these bytes are part of the
> > file.
>
> This conclusion is false. Many tools that don't understand the file
> structure still can do their job on the files. So the fact that a tool
> does not understand the structure does not necessarily imply that
> the tool breaks when the structure changes.

It *may* break just because of some to-be-ignored inline marking due to
some questionable feature.
And *when* (not if) it breaks, it is probably cumbersome to find since
you have pretty unprintable characters.
Let alone the confusion why the size of a file with `ls -l` is different
from the size in the editor or a marker-aware `wc -c`.
So IMHO either you have a clear and visible marker or you none at all.

> > Or another example: (Try to) start a perl/shell/... script (without
> > paranmeter on the first line) which was edited on Win* and binary copied
> > to a Unix system. Or at least guess what will happen ....
>
> For a Python script, I don't need to guess: It will just work.

Then write a short python script (with a "#!/usr/bin/python" line at the
start [without parameters]) natively on a Win*-system, copy it binary
over to an arbitrary Linux system and see what's happening.

Bernd
--
Firmix Software GmbH http://www.firmix.at/
mobil: +43 664 4416156 fax: +43 1 7890849-55
Embedded Linux Development and Services

2005-09-18 07:23:43

by Martin v. Löwis

[permalink] [raw]

Subject: Re: [Patch] Support UTF-8 scripts

Bernd Petrovitsch wrote:
> Most of the text editors have ways to markup the source files. Not even
> the various editors are able to agreen on one method for all, so why
> could the (Linux) world agree on one for all text files?

You are ignoring the role of standardization. People invent their own
mechanism if a standard is missing (or virtually unimplementable). For
declaring encodings, there is no standard (except of iso-2022, which
is really hard to implement correctly). Therefore, editor authors
create their own standards.

Atleast Python abstained from creating yet another standard, and instead
supports both the declarations from Emacs and vim. To some degree, it
also supports notepad (namely through the UTF-8 signature).

However, people are much more likely to agree on a technology when it
is defined by a recognized standards body. This is the case for the
UTF-8 signature, which is defined by the Unicode consortium, for
precisely this purpose. Therefore, editors *will* agree on that
mechanism, while keeping their own mechanism for the more general
problem.

>>Even for the programming language, it is a pain to implement: what
>>if you have non-ASCII characters before the pragma that declares the
>>encoding? and so on.
>
>
> That's the problem of the language definers who absolutely want such
> (IMHO absolutely superflous) features.

It's not the language designers who absolutely want this feature. It's
the language users. Of course, you'ld have to be a language designer to
know that fact - language users go to the language designers asking for
the feature, not to the kernel developers.

>>Hmm. What does that have to do with the patch I'm proposing? This
>>patch does *not* interfere with all text files. It is only relevant
>>for executable files starting with the #! magic.
>
>
> It *does* interfere since scripts are also text files in every aspect.
> So every feature you want for "scripts" you also get for text files (and
> vice versa BTW).

The specific feature I get is that when I pass a file starting
with <utf8sig>#! to execve, Linux will execute the file following
the #!. In what way do I get this feature for text in general?
And if I do, why is that a problem?

> If you think "script" and "text file" are different, define both of
> them, please, otherwise a discussion is pointless.

A script file (in the context of this discussion) is a text file
that is executable (i.e. has the appropriate subset of
S_IXUSR|S_IXGRP|S_IXOTH set), starts with #!, and has the path
name of an executable file after the #!.

More generally, a script file is a text file written in a scripting
language. A scripting language is a programming language which
supports "direct" execution of source code. So in the more
general definition, a script file does not need to start with
#!; for the context of this discussion, we should restrict
attention to files actually affected by the patch.

>>This conclusion is false. Many tools that don't understand the file
>>structure still can do their job on the files. So the fact that a tool
>>does not understand the structure does not necessarily imply that
>>the tool breaks when the structure changes.
>
>
> It *may* break just because of some to-be-ignored inline marking due to
> some questionable feature.

Be more specific. For what specific kind of file will cat(1) break?
Unless cat(1) has a 2GB limitation, I very much doubt it will break
(i.e. fail to do its job, "concatenate files and print on the standard
output") for any kind of input - whether this is text files, binary
files, images, sound files, HTML files. cat always does what it is
designed to do.

> Let alone the confusion why the size of a file with `ls -l` is different
> from the size in the editor or a marker-aware `wc -c`.

This is true for any UTF-8 file, or any multibyte encoding. For any
multibyte encoding, the number of bytes in the file is different from
the number of characters. That doesn't (and shouldn't) stop people from
using multi-byte encodings.

What the editor displays as the number of "things" is up to its own.
The output of wc -c will always be the same as the one of ls -l,
as wc -c does *not* give you characters:

-c, --bytes
print the byte counts

You might have been thinking of 'wc -m'.

>>For a Python script, I don't need to guess: It will just work.
>
>
> Then write a short python script (with a "#!/usr/bin/python" line at the
> start [without parameters]) natively on a Win*-system, copy it binary
> over to an arbitrary Linux system and see what's happening.

It depends on the editor I use, of course: the kernel will consider any
CR after the n as part of the interpreter name. Not sure what this has
to do with the specific patch, though.

Regards,
Martin

2005-09-18 14:53:31

by Bernd Petrovitsch

[permalink] [raw]

Subject: Re: [Patch] Support UTF-8 scripts

On Sun, 2005-09-18 at 09:23 +0200, "Martin v. Löwis" wrote:
[...]
> >>Hmm. What does that have to do with the patch I'm proposing? This
> >>patch does *not* interfere with all text files. It is only relevant
> >>for executable files starting with the #! magic.
> >
> > It *does* interfere since scripts are also text files in every aspect.
> > So every feature you want for "scripts" you also get for text files (and
> > vice versa BTW).
>
> The specific feature I get is that when I pass a file starting
> with <utf8sig>#! to execve, Linux will execute the file following
> the #!. In what way do I get this feature for text in general?
> And if I do, why is that a problem?

After applying this patch it seems that "Linux" is supporting this
marker officially in general - especially if the kernel supports it. I
suppose the next kernel patch is to support Win-like CR-LF sequences
(which is not the case AFAIK).
BTW even some standards body thinks that this is the way to go, it
raises more problems and questions than resolves anything.

> > If you think "script" and "text file" are different, define both of
> > them, please, otherwise a discussion is pointless.
>
> A script file (in the context of this discussion) is a text file
> that is executable (i.e. has the appropriate subset of
> S_IXUSR|S_IXGRP|S_IXOTH set), starts with #!, and has the path
> name of an executable file after the #!.
>
> More generally, a script file is a text file written in a scripting
> language. A scripting language is a programming language which
> supports "direct" execution of source code. So in the more
> general definition, a script file does not need to start with
> #!; for the context of this discussion, we should restrict
> attention to files actually affected by the patch.

And though scripts are usually edited/changed/"parsed"/... with an text
editor, it is not always the case. Therefore the automatic extension to
*all text files* (especially as the marker basically applies to all text
files, not only scripts).
You want to focus just on your patch and ignore the directly implied
potential problems arising ...

[...]
> > It *may* break just because of some to-be-ignored inline marking due to
> > some questionable feature.
>
> Be more specific. For what specific kind of file will cat(1) break?

`cat` as such will not break (as such).

> Unless cat(1) has a 2GB limitation, I very much doubt it will break
> (i.e. fail to do its job, "concatenate files and print on the standard
> output") for any kind of input - whether this is text files, binary
> files, images, sound files, HTML files. cat always does what it is
> designed to do.

Apparently I have to repeat: If you do `cat a.txt b.txt >c.txt` where
a.txt and b.txt have this marker, then c.txt have the marker of b.txt
somewhere in the middle. Does this make sense in anyway?
How do I get rid of the marker in the middle transparently?

> > Let alone the confusion why the size of a file with `ls -l` is different
> > from the size in the editor or a marker-aware `wc -c`.
>
> This is true for any UTF-8 file, or any multibyte encoding. For any
> multibyte encoding, the number of bytes in the file is different from
> the number of characters. That doesn't (and shouldn't) stop people from
> using multi-byte encodings.

It is different even if a pure ASCII file is marked as UTF-8.
And sure, the problem exists in general with multi-byte encodings.

> What the editor displays as the number of "things" is up to its own.
> The output of wc -c will always be the same as the one of ls -l,
> as wc -c does *not* give you characters:
>
> -c, --bytes
> print the byte counts
>
> You might have been thinking of 'wc -m'.

It depends on the definition of "character". There are other standards
which define "character" as "byte".

[...]
> > Then write a short python script (with a "#!/usr/bin/python" line at the
> > start [without parameters]) natively on a Win*-system, copy it binary
> > over to an arbitrary Linux system and see what's happening.
>
> It depends on the editor I use, of course: the kernel will consider any

No, more on the OS the editor runs on.

> CR after the n as part of the interpreter name. Not sure what this has

ACK.

> to do with the specific patch, though.

It is not supported by the kernel. So either you remove it or you make
some compatibility hack (like an appropriate sym-link, etc.). Since the
kernel can start java classes directly, you can probably make a similar
thing for the UTF-8 stuff.

Bernd
--
Firmix Software GmbH http://www.firmix.at/
mobil: +43 664 4416156 fax: +43 1 7890849-55
Embedded Linux Development and Services