Bernd Petrovitsch <[email protected]> wrote:
> On Sun, 2005-09-18 at 09:23 +0200, "Martin v. L?wis" wrote:
> [...]
>> >>Hmm. What does that have to do with the patch I'm proposing? This
>> >>patch does *not* interfere with all text files. It is only relevant
>> >>for executable files starting with the #! magic.
>> >
>> > It *does* interfere since scripts are also text files in every aspect.
>> > So every feature you want for "scripts" you also get for text files (and
>> > vice versa BTW).
>>
>> The specific feature I get is that when I pass a file starting
>> with <utf8sig>#! to execve, Linux will execute the file following
>> the #!. In what way do I get this feature for text in general?
>> And if I do, why is that a problem?
>
> After applying this patch it seems that "Linux" is supporting this
> marker officially in general - especially if the kernel supports it.
It will be the first POSIX kernel to correctly support utf-8 scripts.
It's 2005, and according to other(?) posters, this should be standard.
> I
> suppose the next kernel patch is to support Win-like CR-LF sequences
> (which is not the case AFAIK).
Maybe it should, maybe it shouldn't. If I used MAC or DOS, I'd be sure it
should.-)
> BTW even some standards body thinks that this is the way to go,
Not surprisingly the Unicode Consortium is one of them.
> it
> raises more problems and questions than resolves anything.
The problem of ow to handle BOM is solved by reading the standard.
> And though scripts are usually edited/changed/"parsed"/... with an text
> editor, it is not always the case. Therefore the automatic extension to
> *all text files* (especially as the marker basically applies to all text
> files, not only scripts).
> You want to focus just on your patch and ignore the directly implied
> potential problems arising ...
There is no problem arising from the patch, it solves one.
To solve the rest, use recode.
[...]
> Apparently I have to repeat: If you do `cat a.txt b.txt >c.txt` where
> a.txt and b.txt have this marker, then c.txt have the marker of b.txt
> somewhere in the middle. Does this make sense in anyway?
> How do I get rid of the marker in the middle transparently?
The unicode standard defines how to handle them.
>> > Let alone the confusion why the size of a file with `ls -l` is different
>> > from the size in the editor or a marker-aware `wc -c`.
>>
>> This is true for any UTF-8 file, or any multibyte encoding. For any
>> multibyte encoding, the number of bytes in the file is different from
>> the number of characters. That doesn't (and shouldn't) stop people from
>> using multi-byte encodings.
>
> It is different even if a pure ASCII file is marked as UTF-8.
No pure ASCII file will be marked, since a marked file will be no
ASCII file.
> And sure, the problem exists in general with multi-byte encodings.
ACK, but that's not a kernel problem nor a specific unicode problem.
Fix it by making China, Greece an Japan convert to ASCII and by making
all mathematicans stop using strange characters. All other users will
follow.
>> What the editor displays as the number of "things" is up to its own.
>> The output of wc -c will always be the same as the one of ls -l,
>> as wc -c does *not* give you characters:
>>
>> -c, --bytes
>> print the byte counts
>>
>> You might have been thinking of 'wc -m'.
>
> It depends on the definition of "character". There are other standards
> which define "character" as "byte".
There are architectures defining a byte to be 32 bit.
They are irrelevant, too.
[...]
>> Not sure what this has
>> to do with the specific patch, though.
>
> It is not supported by the kernel. So either you remove it or you make
> some compatibility hack (like an appropriate sym-link
-EDOESNOTWORK
#!/usr/bin/perl -T -s -w
>, etc.). Since the
> kernel can start java classes directly, you can probably make a similar
> thing for the UTF-8 stuff.
If MSDOS text files are text files are legal scripts, the kernel
should recognize [\x0D\x0A] as valid line breaks.
(The real reason would be unicode allowing NEL to be encoded as 0x0D
or 0x0A.)
This compile-tested patch adds 32 bytes to binfmt_script:
--- ./fs/binfmt_script.c.old 2005-09-18 20:28:32.000000000 +0200
+++ ./fs/binfmt_script.c 2005-09-18 20:29:44.000000000 +0200
@@ -18,7 +18,7 @@
static int load_script(struct linux_binprm *bprm,struct pt_regs *regs)
{
- char *cp, *i_name, *i_arg;
+ char *cp, *cp2, *i_name, *i_arg;
struct file *file;
char interp[BINPRM_BUF_SIZE];
int retval;
@@ -47,6 +47,9 @@ static int load_script(struct linux_binp
bprm->buf[BINPRM_BUF_SIZE - 1] = '\0';
if ((cp = strchr(bprm->buf, '\n')) == NULL)
cp = bprm->buf+BINPRM_BUF_SIZE-1;
+ if ((cp2 = strchr(bprm->buf, '\x0D')) != NULL
+ && cp2 < cp)
+ cp = cp2;
*cp = '\0';
while (cp > bprm->buf) {
cp--;
--
Ich danke GMX daf?r, die Verwendung meiner Adressen mittels per SPF
verbreiteten L?gen zu sabotieren.
On Sun, 2005-09-18 at 21:23 +0200, Bodo Eggert wrote:
[...]
> >> Not sure what this has
> >> to do with the specific patch, though.
> >
> > It is not supported by the kernel. So either you remove it or you make
> > some compatibility hack (like an appropriate sym-link
>
> -EDOESNOTWORK
>
> #!/usr/bin/perl -T -s -w
It depends on /usr/bin/perl how it handles a white-space character
directly after "-w".
> >, etc.). Since the
> > kernel can start java classes directly, you can probably make a similar
> > thing for the UTF-8 stuff.
>
> If MSDOS text files are text files are legal scripts, the kernel
> should recognize [\x0D\x0A] as valid line breaks.
The Unix worls does recognize the line breaks. It's up to the tool how
to handle the white-space character before it. Especially for C and
similar languages with continuation lines this leads to interesting (or
now more boring) problems.
Bernd
--
Firmix Software GmbH http://www.firmix.at/
mobil: +43 664 4416156 fax: +43 1 7890849-55
Embedded Linux Development and Services
On Sun, 18 Sep 2005 21:23:42 +0200, Bodo Eggert said:
> Bernd Petrovitsch <[email protected]> wrote:
> > Apparently I have to repeat: If you do `cat a.txt b.txt >c.txt` where
> > a.txt and b.txt have this marker, then c.txt have the marker of b.txt
> > somewhere in the middle. Does this make sense in anyway?
> > How do I get rid of the marker in the middle transparently?
>
> The unicode standard defines how to handle them.
For the benefit of those of us who are interested in the problem, but aren't
in the mood to wade through a long standard looking for the answer to a
specific question, can you elaborate?
It isn't as obvious as all that, because of all the nasty corner cases...
> > It is different even if a pure ASCII file is marked as UTF-8.
>
> No pure ASCII file will be marked, since a marked file will be no
> ASCII file.
Given a file "a.txt" that's pure ASCII, and a file "b.txt" that has the BOM
marker on it, what happens when you do "cat a.txt b.txt > c.txt"?
'cat' doesn't know, and has no way of knowing, that c.txt needs a BOM at the
*front* of the file until it's already written past the point in c.txt where
the BOM has to go.
What does the Unicode standard say to do in this case?
Bodo Eggert wrote:
>
> It will be the first POSIX kernel to correctly support utf-8 scripts.
> It's 2005, and according to other(?) posters, this should be standard.
>
UTF-8, yes. BOM bullshit, no.
-hpa
On Sun, 18 Sep 2005, Bernd Petrovitsch wrote:
> On Sun, 2005-09-18 at 21:23 +0200, Bodo Eggert wrote:
> > >, etc.). Since the
> > > kernel can start java classes directly, you can probably make a similar
> > > thing for the UTF-8 stuff.
> >
> > If MSDOS text files are text files are legal scripts, the kernel
> > should recognize [\x0D\x0A] as valid line breaks.
>
> The Unix worls does recognize the line breaks.
Create a valid text file with macintosh line breaks (as allowed in unicode
files) and try it.
--
If enough data is collected, a board of inquiry can prove ANYTHING.