LinuxLists.cc - Re: [Patch] Support UTF-8 scripts

2005-09-17 06:28:17

Subject: Re: [Patch] Support UTF-8 scripts

D. Hazelton wrote:
> This is a bogus argument. You're comparing the way a _binary_
> executable works to the way an interpreted _text_ script works.
> execve(), at least on my system, isn't capable of running a script -
> if I want to do that from a program I have to tell execve() that it's
> running /bin/sh and the script file is in the parameter list.

This being the linux-kernel list, I assume your system is Linux, no?
Well, on Linux, execve *does* support script files. This is the whole
point of my patch - I would not propose a kernel patch to improve
this support if it weren't there in the first place.

> While I appreciate that the kernel is capable of performing complex
> actions when execve runs into a file that is not an a.out or elf
> binary I have yet to see a "binfmt script" option in the kernel
> config files ever.

It's not a config option because it is always enabled. See
fs/binfmt_script.c for details. It wasn't integrated into the binfmt
system until I made it so some ten years ago, though.

> On the other hand, there is the "binfmt_misc" option, which does the
> work that you seem to be looking for and can, AFAIK, be set to handle
> both ASCII and UTF-8 scripts. Why add the complexity to the kernel
> when it's not needed?

One shouldn't add complexity if its not needed. However, this patch
does not add complexity. It is fairly trivial.

Regards,
Martin

2005-09-18 02:28:43

by Daniel Hazelton

[permalink] [raw]

Subject: Re: [Patch] Support UTF-8 scripts

On Saturday 17 September 2005 06:28, "Martin v. L?wis" wrote:
> D. Hazelton wrote:
> > This is a bogus argument. You're comparing the way a _binary_
> > executable works to the way an interpreted _text_ script works.
> > execve(), at least on my system, isn't capable of running a
> > script - if I want to do that from a program I have to tell
> > execve() that it's running /bin/sh and the script file is in the
> > parameter list.
>
> This being the linux-kernel list, I assume your system is Linux,
> no? Well, on Linux, execve *does* support script files. This is the
> whole point of my patch - I would not propose a kernel patch to
> improve this support if it weren't there in the first place.

This is news to me. The last time I handed execve() a script as a
paramter I had errors returned from execve() -- I must admit that
this was not on my current system and I had assumed that the behavior
would be consistent.

> > While I appreciate that the kernel is capable of performing
> > complex actions when execve runs into a file that is not an a.out
> > or elf binary I have yet to see a "binfmt script" option in the
> > kernel config files ever.
>
> It's not a config option because it is always enabled. See
> fs/binfmt_script.c for details. It wasn't integrated into the
> binfmt system until I made it so some ten years ago, though.

I haven't gotten into that section of the code yet. I've been slowly
working my way through the code from the drivers that seem to cause
strange behavior on my system and then up the tree from there.

> > On the other hand, there is the "binfmt_misc" option, which does
> > the work that you seem to be looking for and can, AFAIK, be set
> > to handle both ASCII and UTF-8 scripts. Why add the complexity to
> > the kernel when it's not needed?
>
> One shouldn't add complexity if its not needed. However, this patch
> does not add complexity. It is fairly trivial.

You are correct. It is fairly trivial. However my point still is valid
that the Kernel has the whole binfmt_misc system -- I will admit that
I have recently been shown numbers that show a noticeable difference
in the speed of a binary executed using the binfmt_misc system and
the binfmt_script system, but the fact remains that offering handling
for UTF8 and ASCII scripts directly in the kernel will likely lead to
at least one more patch in which the the full Unicode standard is
implemented.

That, and my point remains that the kernel should know absolutely
nothing about how to execute a text file - the kernel should return
an error to the extent of "I don't know what to do with this file" to
the shell that tries to execute it, and the shell can then check for
the sh_bang. I do admit that this change would break a lot of
existing code, so I'll leave the argument to the experts.

> Regards,
> Martin

DRH

2005-09-18 03:46:26

by Kyle Moffett

[permalink] [raw]

Subject: Re: [Patch] Support UTF-8 scripts

On Sep 17, 2005, at 18:31:33, D. Hazelton wrote:
> That, and my point remains that the kernel should know absolutely
> nothing about how to execute a text file - the kernel should return
> an error to the extent of "I don't know what to do with this file"
> to the shell that tries to execute it, and the shell can then check
> for the sh_bang. I do admit that this change would break a lot of
> existing code, so I'll leave the argument to the experts.

No, that would not work at all. We have a very nice system to allow
set-uid scripts (Specifically, I like my nice secure taint-mode set-
uid perl scripts). If you did this, they would break completely, not
to mention _add_ all sorts of unsolvable race conditions to the few
ways of working around such a lack of SUID scripts. Also, it means
that I can't just "mv /sbin/init /sbin/init.real ; vim /sbin/init" to
do a simple wrapper around the init program, I would need to write a
compiled C program to do all sorts of fragile hackish things like
calling a script /sbin/init.sh.

Cheers,
Kyle Moffett

--
There are two ways of constructing a software design. One way is to
make it so simple that there are obviously no deficiencies. And the
other way is to make it so complicated that there are no obvious
deficiencies. The first method is far more difficult.
-- C.A.R. Hoare

2005-09-18 06:59:08

by Martin v. Löwis

[permalink] [raw]

Subject: Re: [Patch] Support UTF-8 scripts

D. Hazelton wrote:
> This is news to me. The last time I handed execve() a script as a
> paramter I had errors returned from execve() -- I must admit that
> this was not on my current system and I had assumed that the behavior
> would be consistent.

The kernel checks for #!<path>, and that <path> is an existing
executable. If not, execve fails.

> You are correct. It is fairly trivial. However my point still is valid
> that the Kernel has the whole binfmt_misc system -- I will admit that
> I have recently been shown numbers that show a noticeable difference
> in the speed of a binary executed using the binfmt_misc system and
> the binfmt_script system, but the fact remains that offering handling
> for UTF8 and ASCII scripts directly in the kernel will likely lead to
> at least one more patch in which the the full Unicode standard is
> implemented.

The problem with the binfmt_misc approach is that you need *another*
execve call: with binfmt_misc, you register <utf8sig>#!, and a
generic binary. Then, this generic binary will interpret the #!
signature *again*, and invoke the proper interpreter. This will
intepret the first line *yet again* (finding that it is a comment),
and continue processing the file.

However, this is not the real problem. The real problem is that
the specific binfmt_misc "backend" would not be universally
available, and then the same script would start on some systems,
and break on others. This may be acceptable for large or specific
applications (e.g. you have to setup the ibcs2 module to run
SCO applications); it is not for scripts.

Now, the "universally available" part would not apply right now,
as only the most recent kernels would provide the feature. However,
within a few years, the feature would be part of "Linux" - then
people can start using it extensively.

> That, and my point remains that the kernel should know absolutely
> nothing about how to execute a text file - the kernel should return
> an error to the extent of "I don't know what to do with this file" to
> the shell that tries to execute it, and the shell can then check for
> the sh_bang. I do admit that this change would break a lot of
> existing code, so I'll leave the argument to the experts.

The point is that it is not necessarily the shell which starts
programs - the shell is but one creator of new processes. It is
very common today that, say, httpd starts new programs - this
mechanism is called CGI. Your approach was in use until 1985 or
so, when Unix implementations started to support #! natively.
This was done both for convenience and for performance: if
programs would always use system(3) to start new processes,
there would always be a shell that execs the eventual
interpreter.

I'm not sure, but I believe that most current shells have "forgotten"
how to do the #! magic, since, by now, "traditionally" this is
a kernel responsibility.

Regards,
Martin

2005-09-19 04:11:59

by Daniel Hazelton

[permalink] [raw]

Subject: Re: [Patch] Support UTF-8 scripts

On Sunday 18 September 2005 03:45, Kyle Moffett wrote:
> On Sep 17, 2005, at 18:31:33, D. Hazelton wrote:
> > That, and my point remains that the kernel should know absolutely
> > nothing about how to execute a text file - the kernel should
> > return an error to the extent of "I don't know what to do with
> > this file" to the shell that tries to execute it, and the shell
> > can then check for the sh_bang. I do admit that this change would
> > break a lot of existing code, so I'll leave the argument to the
> > experts.
>
> No, that would not work at all. We have a very nice system to
> allow set-uid scripts (Specifically, I like my nice secure
> taint-mode set- uid perl scripts). If you did this, they would
> break completely, not to mention _add_ all sorts of unsolvable race
> conditions to the few ways of working around such a lack of SUID
> scripts. Also, it means that I can't just "mv /sbin/init
> /sbin/init.real ; vim /sbin/init" to do a simple wrapper around the
> init program, I would need to write a compiled C program to do all
> sorts of fragile hackish things like calling a script
> /sbin/init.sh.

This makes a lot more sense than I expected to hear. This argument
alone is enough for me to understand the reasoning behind the kernel
knowing how to interpret a shell script. Problem is, the program
would not be fragile or hackish - it'd be almost as simple as a
"hello world" program.

#include <unistd.h>

int main() {
/* if this fails the system is busted anyway */
return execve( "/bin/sh", "/sbin/init.sh", 0 );
};

-- This program would do the trick nicely, and since init is run as
root, there is no need to worry about the program having to grab
privs.

However, the real problem is that this would break the initrd systems
used by most distributions for installation, and it would probably
break most of the "early userspace" systems just coming into use. As
I said originally - my comment about having the shell itself
interpret the sh_bang would break a lot of stuff and I've been shown
that I have to spend more time in the kernel code (as I haven't
finished going through the various drivers to see how those have been
made to work) before I can make a good suggestion in a discussion
like this.

DRH

2005-09-19 04:55:30

by Daniel Hazelton

[permalink] [raw]

Subject: Re: [Patch] Support UTF-8 scripts

On Sunday 18 September 2005 06:58, "Martin v. L?wis" wrote:
> D. Hazelton wrote:
> > This is news to me. The last time I handed execve() a script as a
> > paramter I had errors returned from execve() -- I must admit that
> > this was not on my current system and I had assumed that the
> > behavior would be consistent.
>
> The kernel checks for #!<path>, and that <path> is an existing
> executable. If not, execve fails.
>
> > You are correct. It is fairly trivial. However my point still is
> > valid that the Kernel has the whole binfmt_misc system -- I will
> > admit that I have recently been shown numbers that show a
> > noticeable difference in the speed of a binary executed using the
> > binfmt_misc system and the binfmt_script system, but the fact
> > remains that offering handling for UTF8 and ASCII scripts
> > directly in the kernel will likely lead to at least one more
> > patch in which the the full Unicode standard is implemented.
>
> The problem with the binfmt_misc approach is that you need
> *another* execve call: with binfmt_misc, you register <utf8sig>#!,
> and a generic binary. Then, this generic binary will interpret the
> #! signature *again*, and invoke the proper interpreter. This will
> intepret the first line *yet again* (finding that it is a comment),
> and continue processing the file.

True. I had forgotten that for truly generic rules about handling the
#! there would be double the overhead for the sh_bang.

> However, this is not the real problem. The real problem is that
> the specific binfmt_misc "backend" would not be universally
> available, and then the same script would start on some systems,
> and break on others. This may be acceptable for large or specific
> applications (e.g. you have to setup the ibcs2 module to run
> SCO applications); it is not for scripts.

Again this is all too true. Doubly so with the problem of an initrd
that has 'init' as a script.

> Now, the "universally available" part would not apply right now,
> as only the most recent kernels would provide the feature. However,
> within a few years, the feature would be part of "Linux" - then
> people can start using it extensively.

This sounds to me like you're saying in a few years my suggestion of
using binfmt_misc would be tenable. Unfortunately, unless forced into
it, no distro would ever use it. As I now see it, binfmt_script is
pretty much a hard-coded hack that gives the system a bit more speed
for running scripts. And since I've thought about the consequences of
ripping it out after the posts yesterday - there is no clean way to
remove it and still have a large number of systems still function.

> > That, and my point remains that the kernel should know absolutely
> > nothing about how to execute a text file - the kernel should
> > return an error to the extent of "I don't know what to do with
> > this file" to the shell that tries to execute it, and the shell
> > can then check for the sh_bang. I do admit that this change would
> > break a lot of existing code, so I'll leave the argument to the
> > experts.
>
> The point is that it is not necessarily the shell which starts
> programs - the shell is but one creator of new processes. It is
> very common today that, say, httpd starts new programs - this
> mechanism is called CGI. Your approach was in use until 1985 or
> so, when Unix implementations started to support #! natively.
> This was done both for convenience and for performance: if
> programs would always use system(3) to start new processes,
> there would always be a shell that execs the eventual
> interpreter.

True. In some cases, though, system(3) is really unusable - like you
mentioned, httpd often starts new processes. Since daemons don't,
technically, run on top of a shell, having one use system(3) to start
a new process would add a lot of unnecessary overhead.

> I'm not sure, but I believe that most current shells have
> "forgotten" how to do the #! magic, since, by now, "traditionally"
> this is a kernel responsibility.

Not true. Bash, at least, still handles the sh_bang. (Provable by
using it to call a perl script that doesn't have the exec bit set.
This worked for me just a week ago :)

DRH