Hi Linus,
Raised by: http://bugzilla.kernel.org/show_bug.cgi?id=10095 , there is
the question of whether we want to separate the env+arg arrays from the
stack proper.
Currently these arrays are considered part of the stack, and
RLIMIT_STACK includes them. However POSIX does not specify it must be
so.
The complaint is that sysconf(_SC_ARG_MAX) returns a hard coded value
(which is not obtained from the kernel) and might, depending on the
RLIMIT_STACK setting, be invalid.
POSIX disallows sysconf() variables to change during the execution of a
process, so even if it would ask the kernel for a value, we could not
give a sane answer.
The suggestion is to introduce a new RLIMIT_ARG_MAX which takes over the
role of sysconf(_SC_ARG_MAX), however this would require we either
separate these values into their own vma, or subtract
mm->env_end - mm->env_start + mm->arg_end - mm->arg_start
from the computed vma size when we test RLIMIT_STACK.
I'm still of two minds on this issue.. but fwiw here is a patch
implementing RLIMIT_ARG_MAX - utterly untested and doesn't consider
!MMU.
---
Subject: RLIMIT_ARG_MAX
Having this rlimit allows userspace to determine how large argv arrays
can be (after they bother to calculate the env size).
Signed-off-by: Peter Zijlstra <[email protected]>
---
fs/exec.c | 2 +-
fs/proc/base.c | 1 +
include/asm-generic/resource.h | 4 +++-
mm/mmap.c | 6 +++++-
4 files changed, 10 insertions(+), 3 deletions(-)
Index: linux-2.6/fs/exec.c
===================================================================
--- linux-2.6.orig/fs/exec.c
+++ linux-2.6/fs/exec.c
@@ -183,7 +183,7 @@ static struct page *get_arg_page(struct
* - the program will have a reasonable amount of stack left
* to work from.
*/
- if (size > rlim[RLIMIT_STACK].rlim_cur / 4) {
+ if (size > rlim[RLIMIT_ARG_MAX].rlim_cur) {
put_page(page);
return NULL;
}
Index: linux-2.6/fs/proc/base.c
===================================================================
--- linux-2.6.orig/fs/proc/base.c
+++ linux-2.6/fs/proc/base.c
@@ -412,6 +412,7 @@ static const struct limit_names lnames[R
[RLIMIT_NICE] = {"Max nice priority", NULL},
[RLIMIT_RTPRIO] = {"Max realtime priority", NULL},
[RLIMIT_RTTIME] = {"Max realtime timeout", "us"},
+ [RLIMIT_ARG_MAX] = {"Max env+arg space", "bytes"},
};
/* Display limits for a process */
Index: linux-2.6/include/asm-generic/resource.h
===================================================================
--- linux-2.6.orig/include/asm-generic/resource.h
+++ linux-2.6/include/asm-generic/resource.h
@@ -45,7 +45,8 @@
0-39 for nice level 19 .. -20 */
#define RLIMIT_RTPRIO 14 /* maximum realtime priority */
#define RLIMIT_RTTIME 15 /* timeout for RT tasks in us */
-#define RLIM_NLIMITS 16
+#define RLIMIT_ARG_MAX 16 /* maximum env+arg space */
+#define RLIM_NLIMITS 17
/*
* SuS says limits have to be unsigned.
@@ -87,6 +88,7 @@
[RLIMIT_NICE] = { 0, 0 }, \
[RLIMIT_RTPRIO] = { 0, 0 }, \
[RLIMIT_RTTIME] = { RLIM_INFINITY, RLIM_INFINITY }, \
+ [RLIMIT_ARG_MAX] = { 32*PAGE_SIZE, _STK_LIM/4 }, \
}
#endif /* __KERNEL__ */
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c
+++ linux-2.6/mm/mmap.c
@@ -1516,13 +1516,17 @@ static int acct_stack_growth(struct vm_a
struct mm_struct *mm = vma->vm_mm;
struct rlimit *rlim = current->signal->rlim;
unsigned long new_start;
+ unsigned long env_arg_size;
/* address space limit tests */
if (!may_expand_vm(mm, grow))
return -ENOMEM;
+ env_arg_size = mm->env_end - mm->env_start +
+ mm->arg_end - mm->arg_start;
+
/* Stack limit test */
- if (size > rlim[RLIMIT_STACK].rlim_cur)
+ if (size - env_arg_size > rlim[RLIMIT_STACK].rlim_cur)
return -ENOMEM;
/* mlock limit tests */
On Wed, 27 Feb 2008, Peter Zijlstra wrote:
>
> Currently these arrays are considered part of the stack, and
> RLIMIT_STACK includes them. However POSIX does not specify it must be
> so.
What's the real advantage of this? I'm not seeing it. Just an extra
complexity "niceness" that nobody can rely on anyway since it's not even
specified, and older kernels won't do it.
Linus
[Adding Ulrich D to the CC]
On Fri, Feb 29, 2008 at 5:05 PM, Linus Torvalds
<[email protected]> wrote:
>
>
> On Wed, 27 Feb 2008, Peter Zijlstra wrote:
> >
> > Currently these arrays are considered part of the stack, and
> > RLIMIT_STACK includes them. However POSIX does not specify it must be
> > so.
>
> What's the real advantage of this? I'm not seeing it. Just an extra
> complexity "niceness" that nobody can rely on anyway since it's not even
> specified, and older kernels won't do it.
The advantages are the following:
1. We don't break the ABI. in 2.6.23, RLIMIT_STACK acquired an
additional semantic: RLIMIT_STACK/4 specified the size for
argv+environ. [email protected] added this feature to allow processes to
have much larger argument lists. However, if the user sets
RLIMIT_STACK to less than 512k, then the amount of space for
argv+environ falls below the space guaranteed by kernel 2.6.22 and
earlier. (Older kernels guaranteed at least 128k for argv+environ.)
Manipulating RLIMIT_STACK did not previously have this effect. (One
place this matters is with NPTL, where, if RLIMIT_STACK is set to
anything other than unlimited, then it is used as the default stack
size when creating new threads. When creating many threads, it may
well be desirable to set RLIMIT_STACK to a value lower than 512k.)
While the new functionality provided by [email protected]'s work is
useful, RLIMIT_STACK really should not have been overloaded with a
second meaning, since it is no longer possible to control stack size
without also changing the limit on argv+environ space. Hence the
proposal of a new resource limit.
2. It provides a sane mechanism for an application to determine the
space available for argv+environ. Formerly this space was an
invariant, advertised via sysconf(_SC_ARG_MAX).
3. The implementation details about stack size and size/location of
argv+environ can be decoupled.
Cheers,
Michael
On Fri, 29 Feb 2008, Michael Kerrisk wrote:
> > What's the real advantage of this? I'm not seeing it. Just an extra
> > complexity "niceness" that nobody can rely on anyway since it's not even
> > specified, and older kernels won't do it.
>
> The advantages are the following:
>
> 1. We don't break the ABI. in 2.6.23, RLIMIT_STACK acquired an
> additional semantic: RLIMIT_STACK/4 specified the size for
> argv+environ.
So maybe we should change *that* then, and just allow arg/env to be more
than 25%.
> 2. It provides a sane mechanism for an application to determine the
> space available for argv+environ. Formerly this space was an
> invariant, advertised via sysconf(_SC_ARG_MAX).
.. and what's the point? We've never had it before, nobody has ever cared,
and the whole notion is just stupid. Why would we want to limit it? The
only thing that the kernel *cares* about is the stack size - any other
size limits are always going to be arbitrary.
> 3. The implementation details about stack size and size/location of
> argv+environ can be decoupled.
Now, this is a potentially interesting argument, but is it true (ie don't
we have programs that know about the status quo) and are people actually
planning on doing that (for what reason?) or is it just a theoretical one?
Linus
On Fri, 2008-02-29 at 17:58 +0100, Michael Kerrisk wrote:
> [Adding Ulrich D to the CC]
>
> On Fri, Feb 29, 2008 at 5:05 PM, Linus Torvalds
> <[email protected]> wrote:
> >
> >
> > On Wed, 27 Feb 2008, Peter Zijlstra wrote:
> > >
> > > Currently these arrays are considered part of the stack, and
> > > RLIMIT_STACK includes them. However POSIX does not specify it must be
> > > so.
> >
> > What's the real advantage of this? I'm not seeing it. Just an extra
> > complexity "niceness" that nobody can rely on anyway since it's not even
> > specified, and older kernels won't do it.
>
> The advantages are the following:
>
> 1. We don't break the ABI. in 2.6.23, RLIMIT_STACK acquired an
> additional semantic: RLIMIT_STACK/4 specified the size for
> argv+environ. [email protected] added this feature to allow processes to
> have much larger argument lists. However, if the user sets
> RLIMIT_STACK to less than 512k, then the amount of space for
> argv+environ falls below the space guaranteed by kernel 2.6.22 and
> earlier. (Older kernels guaranteed at least 128k for argv+environ.)
> Manipulating RLIMIT_STACK did not previously have this effect. (One
> place this matters is with NPTL, where, if RLIMIT_STACK is set to
> anything other than unlimited, then it is used as the default stack
> size when creating new threads. When creating many threads, it may
> well be desirable to set RLIMIT_STACK to a value lower than 512k.)
>
> While the new functionality provided by [email protected]'s work is
> useful, RLIMIT_STACK really should not have been overloaded with a
> second meaning, since it is no longer possible to control stack size
> without also changing the limit on argv+environ space. Hence the
> proposal of a new resource limit.
>
> 2. It provides a sane mechanism for an application to determine the
> space available for argv+environ. Formerly this space was an
> invariant, advertised via sysconf(_SC_ARG_MAX).
>
> 3. The implementation details about stack size and size/location of
> argv+environ can be decoupled.
You fail to mention that <23 will still fault the first time it tries to
grow the stack when you set rlimit_stack to 128k and actually supply
128k of env+arg.
On Fri, 2008-02-29 at 09:12 -0800, Linus Torvalds wrote:
> > 2. It provides a sane mechanism for an application to determine the
> > space available for argv+environ. Formerly this space was an
> > invariant, advertised via sysconf(_SC_ARG_MAX).
>
> ... and what's the point? We've never had it before, nobody has ever cared,
> and the whole notion is just stupid. Why would we want to limit it? The
> only thing that the kernel *cares* about is the stack size - any other
> size limits are always going to be arbitrary.
Well, don't think of limiting it, but querying the limit.
Programs like xargs would need to know how much to stuff into argv
before starting a new invocation.
On Fri, 29 Feb 2008, Peter Zijlstra wrote:
>
> > ... and what's the point? We've never had it before, nobody has ever cared,
> > and the whole notion is just stupid. Why would we want to limit it? The
> > only thing that the kernel *cares* about is the stack size - any other
> > size limits are always going to be arbitrary.
>
> Well, don't think of limiting it, but querying the limit.
>
> Programs like xargs would need to know how much to stuff into argv
> before starting a new invocation.
But they already can't really do that. More importantly, isn't it better
to just use the whole stack size then (or just return "stack size / 4" or
whatever)?
Linus
On Fri, 29 Feb 2008, Peter Zijlstra wrote:
>
> You fail to mention that <23 will still fault the first time it tries to
> grow the stack when you set rlimit_stack to 128k and actually supply
> 128k of env+arg.
So? That's what rlimit_stack has always meant (and not just on Linux
either, afaik). That's not a bug, it's a feature. If the system has a
limited stack, it has a limited stack. That's what RLIMIT_STACK means.
Linus
On Fri, 2008-02-29 at 09:29 -0800, Linus Torvalds wrote:
>
> On Fri, 29 Feb 2008, Peter Zijlstra wrote:
> >
> > > ... and what's the point? We've never had it before, nobody has ever cared,
> > > and the whole notion is just stupid. Why would we want to limit it? The
> > > only thing that the kernel *cares* about is the stack size - any other
> > > size limits are always going to be arbitrary.
> >
> > Well, don't think of limiting it, but querying the limit.
> >
> > Programs like xargs would need to know how much to stuff into argv
> > before starting a new invocation.
>
> But they already can't really do that.
I think they used to use sysconf(_SC_ARG_MAX) to do that.
> More importantly, isn't it better to just use the whole stack size then
Well, we ran into trouble of freshly spawned tasks faulting on the first
stack grow. The /4 thing was to avoid that situation.
> (or just return "stack size / 4" or whatever)?
I'm all for that, trouble is that the POSIX folks specified that the
sysconf() value must be consistent during the lifetime of a process.
Which isn't true, because we can change rlimit_stack after asking. And
the linux implementation doesn't even seem to bother asking the kernel -
so there just isn't much we _can_ do here.
My suggestion was a kernel version check along with sysconf or
rlimit_stack. But I guess that made the userspace people puke :-)
On Fri, 2008-02-29 at 09:35 -0800, Linus Torvalds wrote:
>
> On Fri, 29 Feb 2008, Peter Zijlstra wrote:
> >
> > You fail to mention that <23 will still fault the first time it tries to
> > grow the stack when you set rlimit_stack to 128k and actually supply
> > 128k of env+arg.
>
> So? That's what rlimit_stack has always meant (and not just on Linux
> either, afaik). That's not a bug, it's a feature. If the system has a
> limited stack, it has a limited stack. That's what RLIMIT_STACK means.
Well, I agree with that point. It just that apparently POSIX does not.
According to Michael POSIX does not consider the arg+env array part of
the stack proper.
On Fri, 29 Feb 2008, Peter Zijlstra wrote:
>
> > More importantly, isn't it better to just use the whole stack size then
>
> Well, we ran into trouble of freshly spawned tasks faulting on the first
> stack grow. The /4 thing was to avoid that situation.
Yeah, I do see the point of wanting slop, but maybe the right thing to do
is simply to just not make the slop be 75% of it all ;)
The thing is, RLIMIT_STACK has never been "exact" in the first place (ie
it has *always* contained the argument and environment as a part of it),
and that is really traditional behaviour even outside of Linux.
And I seriously doubt that RLIMIT_ARG_MAX really buys people anything
truly wonderful, and it definitely adds just another thing you can screw
up and make the system just behave differently depending on a config value
that doesn't even matter to the kernel. In fact, even with that patch,
it's *still* not going to handle the difference between the actual string
space and the pointers themselves, or even all the other setup stuff that
the binary loaders will put on the stack.
So it's not *going* to be exact even with RLIMIT_ARG_MAX, because it's
going to have all those other issues to contend with - on a 64-bit
architecture, the argument _pointers_ are often within an order of
magnitude of the argument strings themselves, and I don't think your patch
counted them as part of the argument/environemnt size (I was too lazy to
check the sources, but I'm pretty sure argv/env_start/end is just the
string space, not the pointers).
So rather than introduce a new thing that is not going to be trustworthy
anyway, I'd much rather just remove the limit entirely.
Also, it all boils down to the fact that the whole argument is utter crap:
> POSIX disallows sysconf() variables to change during the execution of a
> process, so even if it would ask the kernel for a value, we could not
> give a sane answer.
If the resource limits change, then it makes not a whit of a difference
whether _SC_ARG_MAX changes or not, because it's not going to reflect
reality. So you might as well just continue to give the value we've always
given (128k? I don't remember). Because it's not going to be the "real"
value *anyway*.
This whole argument seems pointless. Has anybody ever really cared? Why
not just keep _SC_ARG_MAX at the old (small) limit, and then the fact that
99% of all programs won't even care, and that they can actually use much
larger limits in real life is just gravy.
A *good* implementation would generally just do the execve() with the
maximal arguments, and only bother to see "oh, maybe I can split it up" if
it returns ENOMEM or whatever it does.
So I don't see the practical value here - _SC_ARG_MAX is not worth having
another tweaking value for that people will just always get wrogn anyway
because there is no right answer (except "I don't want my stack to grow
too large" where it's just one of the relevant things)
Linus
On Fri, 29 Feb 2008, Peter Zijlstra wrote:
>
> Well, I agree with that point. It just that apparently POSIX does not.
> According to Michael POSIX does not consider the arg+env array part of
> the stack proper.
I don't think that's true.
POSIX has always been guided on "what you can depend on", not "we make up
new features". And if that has changed, then it's a problem for POSIX, not
for Linux.
IOW, I think somebody is either reading the standard wrong, or the
standard is not worth reading.
Linus
Peter Zijlstra wrote:
> On Fri, 2008-02-29 at 09:35 -0800, Linus Torvalds wrote:
>> On Fri, 29 Feb 2008, Peter Zijlstra wrote:
>>> You fail to mention that <23 will still fault the first time it tries to
>>> grow the stack when you set rlimit_stack to 128k and actually supply
>>> 128k of env+arg.
>> So? That's what rlimit_stack has always meant (and not just on Linux
>> either, afaik). That's not a bug, it's a feature. If the system has a
>> limited stack, it has a limited stack. That's what RLIMIT_STACK means.
>
> Well, I agree with that point. It just that apparently POSIX does not.
> According to Michael POSIX does not consider the arg+env array part of
> the stack proper.
AFAIK, POSIX.1 makes no requirement here. Most (all?) Unix systems have
traditionally placed argv+environ just above the stack, but that isn't
required.
My reading of POSIX.1 (and POSIX doesn't seem very explicit on this
point), is that the limits on argv+environ and on stack are decoupled,
since POSIX specifies RLIMIT_STACK and sysconf(_SC_ARG_MAX) and doesn't
specify any relationship between the two.
On Fri, 29 Feb 2008, Michael Kerrisk wrote:
>
> My reading of POSIX.1 (and POSIX doesn't seem very explicit on this point), is
> that the limits on argv+environ and on stack are decoupled, since POSIX
> specifies RLIMIT_STACK and sysconf(_SC_ARG_MAX) and doesn't specify any
> relationship between the two.
I agree. And clearly there _are_ relationships and always have been, but
equally clearly they simply haven't been a big issue in practice, and
nobody really cares.
Usually, _SC_ARG_MAX is just so much smaller than RLIMIT_STACK that it
makes no possible difference. Which I would actually argue we should just
continue with: just keep _SC_ARG_MAX a smallish, irrelevant constant.
We still have to have the compile-time ARG_MAX constant (as in *real*
constant - a #define) anyway, for traditional programs, and you might as
well make sysconf(_SC_ARG_MAX) always just match ARG_MAX.
It's not like there is likely a single user of _SC_ARG_MAX that cares.
Linus
On Fri, 29 Feb 2008 18:55:56 +0100
Peter Zijlstra <[email protected]> wrote:
>
> On Fri, 2008-02-29 at 09:35 -0800, Linus Torvalds wrote:
> >
> > On Fri, 29 Feb 2008, Peter Zijlstra wrote:
> > >
> > > You fail to mention that <23 will still fault the first time it tries to
> > > grow the stack when you set rlimit_stack to 128k and actually supply
> > > 128k of env+arg.
> >
> > So? That's what rlimit_stack has always meant (and not just on Linux
> > either, afaik). That's not a bug, it's a feature. If the system has a
> > limited stack, it has a limited stack. That's what RLIMIT_STACK means.
>
> Well, I agree with that point. It just that apparently POSIX does not.
> According to Michael POSIX does not consider the arg+env array part of
> the stack proper.
As far as I can see POSIX and SuS do not care. In all the ABIs some of
your stack is already used by stuff. Posix doesn't seem to consider it
either way. By some undefined magic main() gets argc, argv, envp. Quite
frankly it could read them from a pipe before main is called.
Alan
On Fri, Feb 29, 2008 at 10:12 AM, Linus Torvalds
<[email protected]> wrote:
>
> So it's not *going* to be exact even with RLIMIT_ARG_MAX, because it's
> going to have all those other issues to contend with - on a 64-bit
> architecture, the argument _pointers_ are often within an order of
> magnitude of the argument strings themselves, and I don't think your patch
> counted them as part of the argument/environemnt size (I was too lazy to
> check the sources, but I'm pretty sure argv/env_start/end is just the
> string space, not the pointers).
This is precisely why I picked 25% as the maximum argument size ratio.
In practice, that 25% can easily mean 50% or more. If people want to
increase this, it can probably be tweaked somewhat, but switching it
to, say, 50% probably isn't a good idea.
Ollie
On Fri, Feb 29, 2008 at 11:01:38AM -0800, Ollie Wild wrote:
> On Fri, Feb 29, 2008 at 10:12 AM, Linus Torvalds
> <[email protected]> wrote:
> >
> > So it's not *going* to be exact even with RLIMIT_ARG_MAX, because it's
> > going to have all those other issues to contend with - on a 64-bit
> > architecture, the argument _pointers_ are often within an order of
> > magnitude of the argument strings themselves, and I don't think your patch
> > counted them as part of the argument/environemnt size (I was too lazy to
> > check the sources, but I'm pretty sure argv/env_start/end is just the
> > string space, not the pointers).
>
> This is precisely why I picked 25% as the maximum argument size ratio.
> In practice, that 25% can easily mean 50% or more. If people want to
> increase this, it can probably be tweaked somewhat, but switching it
> to, say, 50% probably isn't a good idea.
I think 50% would be still fine. And, ideally make that
MAX (RLIMIT_STACK / 2, 128KB) to avoid regressions for programs which assume
they can pass ARG_MAX args+env, even if they have say 192KB stack limit.
Jakub
On Fri, Feb 29, 2008 at 7:39 PM, Linus Torvalds
<[email protected]> wrote:
>
>
> On Fri, 29 Feb 2008, Michael Kerrisk wrote:
> >
> > My reading of POSIX.1 (and POSIX doesn't seem very explicit on this point), is
> > that the limits on argv+environ and on stack are decoupled, since POSIX
> > specifies RLIMIT_STACK and sysconf(_SC_ARG_MAX) and doesn't specify any
> > relationship between the two.
>
> I agree. And clearly there _are_ relationships and always have been, but
> equally clearly they simply haven't been a big issue in practice, and
> nobody really cares.
Do we know that for sure?
> Usually, _SC_ARG_MAX is just so much smaller than RLIMIT_STACK that it
> makes no possible difference. Which I would actually argue we should just
> continue with: just keep _SC_ARG_MAX a smallish, irrelevant constant.
>
> We still have to have the compile-time ARG_MAX constant (as in *real*
> constant - a #define) anyway, for traditional programs, and you might as
> well make sysconf(_SC_ARG_MAX) always just match ARG_MAX.
>
> It's not like there is likely a single user of _SC_ARG_MAX that cares.
In my initial reply, I pointed out one example where users *may* care:
NPTL uses RLIMIT_STACK to determine the size of per-thread stacks. It
is conceivable that users might want to set RLIMIT_STACK < 512k, and
that would have the effect of lowering the amount of space for
argv+eviron below what the kernel has historically guaranteed. That's
an ABI change, though it's unclear whether it would impact anyone in
practice.
On Fri, 29 Feb 2008, Jakub Jelinek wrote:
> On Fri, Feb 29, 2008 at 11:01:38AM -0800, Ollie Wild wrote:
> >
> > This is precisely why I picked 25% as the maximum argument size ratio.
> > In practice, that 25% can easily mean 50% or more. If people want to
> > increase this, it can probably be tweaked somewhat, but switching it
> > to, say, 50% probably isn't a good idea.
>
> I think 50% would be still fine. And, ideally make that
> MAX (RLIMIT_STACK / 2, 128KB) to avoid regressions for programs which assume
> they can pass ARG_MAX args+env, even if they have say 192KB stack limit.
It would certainly be worth at least testing that as an approach.
Another thing we could decide to do is to just check the size of the stack
that is left at the end of all the stack setup code, and just say "if it's
less than X bytes, just return ENOMEM rather than set up a process with a
really unusably small stack".
Linus
On Fri, Feb 29, 2008 at 11:50 AM, Linus Torvalds
<[email protected]> wrote:
>
> Another thing we could decide to do is to just check the size of the stack
> that is left at the end of all the stack setup code, and just say "if it's
> less than X bytes, just return ENOMEM rather than set up a process with a
> really unusably small stack".
What would be a reasonable value, though? Whereas argument space is
static and known at process execution time, required stack space is
program dependent. If the program is going to crash, I'd rather it do
so at exec time. Otherwise, we end up with corrupted files, partially
committed database transactions, and so forth. I'd rather error on
the side of small argument space.
In the common situation, a 25% argument allocation is vastly larger
than the pre-2.6.23 limit. We're really only talking about cases
where the limits have been set to unusually small values. I'd be
interested in hearing from people that do this in practice. Why do
they do this? What are their expectations?
Ollie
On Fri, 29 Feb 2008, Michael Kerrisk wrote:
> On Fri, Feb 29, 2008 at 7:39 PM, Linus Torvalds
> <[email protected]> wrote:
>
> > I agree. And clearly there _are_ relationships and always have been, but
> > equally clearly they simply haven't been a big issue in practice, and
> > nobody really cares.
>
> Do we know that for sure?
We *do* know for sure that the relationship has always been there. At
least in Linux, and I bet in 99% of all other Unixes too. The arguments
simply have traditionally been counted as part of the stack size.
Or did you mean the latter part?
The fact is, we *also* know for sure that anybody that depends on
_SC_ARG_MAX being exact has always - and will continue to be - broken.
Again, because of not only older kernels but also because even with the
patch in question, we don't count argument sizes exactly.
> In my initial reply, I pointed out one example where users *may* care:
> NPTL uses RLIMIT_STACK to determine the size of per-thread stacks. It
> is conceivable that users might want to set RLIMIT_STACK < 512k, and
> that would have the effect of lowering the amount of space for
> argv+eviron below what the kernel has historically guaranteed. That's
> an ABI change, though it's unclear whether it would impact anyone in
> practice.
I do agree that we should at least make the "MAX(stacksize/4, 128k)"
change for backwards compatibility. That is actually a potential
regression, but it has nothing to do with a new _SC_ARG_SIZE, because
quite frankly, it's a regression *regardless* of whether we'd expose a new
rlimit or not!
And one of the reasons I'm so down on new resource limits is that nobody
then has the code to actually update them. You won't see it in "ulimit -a"
until you have a newly compiled bash that cares etc etc, so as far as I'm
concerned, I see hat RLIMIT_ARG_MAX as nothing but a pain. It actually
makes the code more complex, makes it work less like we and others have
*always* worked, and doesn't even help users - quite the reverse.
I don't like unnecessary complexity. If the RLIMGI_STACK/4 check is truly
so troublesome, let's just *remove* it, rather than add more crap and
complexity on top of it!
Problem solved.
Linus
On Fri, Feb 29, 2008 at 9:07 PM, Linus Torvalds
<[email protected]> wrote:
>
>
> On Fri, 29 Feb 2008, Michael Kerrisk wrote:
>
> > On Fri, Feb 29, 2008 at 7:39 PM, Linus Torvalds
> > <[email protected]> wrote:
> >
>
> > > I agree. And clearly there _are_ relationships and always have been, but
> > > equally clearly they simply haven't been a big issue in practice, and
> > > nobody really cares.
> >
> > Do we know that for sure?
>
> We *do* know for sure that the relationship has always been there. At
> least in Linux, and I bet in 99% of all other Unixes too. The arguments
> simply have traditionally been counted as part of the stack size.
>
> Or did you mean the latter part?
I meant: do we know for sure that no one really cares?
> The fact is, we *also* know for sure that anybody that depends on
> _SC_ARG_MAX being exact has always - and will continue to be - broken.
> Again, because of not only older kernels but also because even with the
> patch in question, we don't count argument sizes exactly.
>
>
> > In my initial reply, I pointed out one example where users *may* care:
> > NPTL uses RLIMIT_STACK to determine the size of per-thread stacks. It
> > is conceivable that users might want to set RLIMIT_STACK < 512k, and
> > that would have the effect of lowering the amount of space for
> > argv+eviron below what the kernel has historically guaranteed. That's
> > an ABI change, though it's unclear whether it would impact anyone in
> > practice.
>
> I do agree that we should at least make the "MAX(stacksize/4, 128k)"
> change for backwards compatibility.
Good -- because that's probably the most important point, IMO.
> That is actually a potential
> regression, but it has nothing to do with a new _SC_ARG_SIZE, because
> quite frankly, it's a regression *regardless* of whether we'd expose a new
> rlimit or not!
Agreed.
The new rlimit is primarily for the (supposed) applications that care
about knowing (at least approximately) what _SC_ARG_MAX is. I raised
the initial bug report against glibc because applications can no
longer (post 2.6.23) do this, but I haven't done the investigation
about how many applications actually care.
Cheers,
Michael
On Fri, 29 Feb 2008, Michael Kerrisk wrote:
> On Fri, Feb 29, 2008 at 9:07 PM, Linus Torvalds
> >
> > > > I agree. And clearly there _are_ relationships and always have been, but
> > > > equally clearly they simply haven't been a big issue in practice, and
> > > > nobody really cares.
> > >
> > > Do we know that for sure?
> >
> > We *do* know for sure that the relationship has always been there. At
> > least in Linux, and I bet in 99% of all other Unixes too. The arguments
> > simply have traditionally been counted as part of the stack size.
> >
> > Or did you mean the latter part?
>
> I meant: do we know for sure that no one really cares?
Well, what I have tried to argue is that even if they care, the patch
won't actually really help. It just moves existing behaviour around a bit,
but leaves all the fundamental issues totally untouched in that it may
count the strings, but not the pointers themselves etc.
More importantly, anybody who would depend on any new behaviour would
still be screwed on all other platforms - including older Linux ones - in
that they'd depend on some very specific behaviour that simply isn't going
to be there in other cases.
So yeah, I can see that people could care, but they *shouldn't*.
> The new rlimit is primarily for the (supposed) applications that care
> about knowing (at least approximately) what _SC_ARG_MAX is. I raised
> the initial bug report against glibc because applications can no
> longer (post 2.6.23) do this, but I haven't done the investigation
> about how many applications actually care.
Very few reasonably can. The thing is, in order to care, you have to count
things like your own environment space etc, and you have to know that
there is something you can even *do* about it if the counts go wrong.
So in practice, I think it's just about things like "xargs" and very few
actual applications.
I did try to do a google codesearch on "sysconf(_SC_ARG_MAX)" and it
exists, but there wasn't a whole lot. The most logical one (and the one
that didn't prefer the ARG_MAX #define) was the built-in xargs in ksh.
But I really didn't look very hard, just a few screenfuls of codesearch.
Realistically, "xargs" really is the main user. *Most* users of execve()
simply either want all their arguments or none. It's not that common that
somebody says "ok, I have a ton of arguments, but if you limit them I'll
just use a fraction of them".
Linus
On Fri, 29 Feb 2008, Linus Torvalds wrote:
>
> I do agree that we should at least make the "MAX(stacksize/4, 128k)"
> change for backwards compatibility.
How about something like this?
The alternative is to just remove that size check entirely, and depend on
get_user_pages() doing the stack limit check (among all the *other* checks
it does when it does the acct_stack_growth() thing).
I'd almost prefer that simpler approach, but I don't have any really
strong preferences. Anybody?
Linus
---
fs/exec.c | 10 +++++++++-
1 files changed, 9 insertions(+), 1 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index a44b142..e91f9cb 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -173,8 +173,15 @@ static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
return NULL;
if (write) {
- struct rlimit *rlim = current->signal->rlim;
unsigned long size = bprm->vma->vm_end - bprm->vma->vm_start;
+ struct rlimit *rlim;
+
+ /*
+ * We've historically supported up to 32 pages of argument
+ * strings even with small stacks
+ */
+ if (size <= 32*PAGE_SIZE)
+ return page;
/*
* Limit to 1/4-th the stack size for the argv+env strings.
@@ -183,6 +190,7 @@ static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
* - the program will have a reasonable amount of stack left
* to work from.
*/
+ rlim = current->signal->rlim;
if (size > rlim[RLIMIT_STACK].rlim_cur / 4) {
put_page(page);
return NULL;
Michael Kerrisk <[email protected]> wrote, on 29 Feb 2008:
>
> My reading of POSIX.1 (and POSIX doesn't seem very explicit on this
> point), is that the limits on argv+environ and on stack are decoupled,
> since POSIX specifies RLIMIT_STACK and sysconf(_SC_ARG_MAX) and doesn't
> specify any relationship between the two.
POSIX doesn't specify any relationship between them because (as far
as POSIX is concerned) they are limits on entirely different things.
sysconf(_SC_ARG_MAX) is a limit on how much arg+env a process can
_pass_ to the exec*() functions. The RLIMIT_* limits are limits on
the process itself.
--
Geoff Clare <[email protected]>
The Open Group, Thames Tower, Station Road, Reading, RG1 1LX, England
Linus Torvalds wrote:
> On Fri, 29 Feb 2008, Linus Torvalds wrote:
>> I do agree that we should at least make the "MAX(stacksize/4, 128k)"
>> change for backwards compatibility.
>
> How about something like this?
This is perfect. As the original submitter of the bug my primary
interest is in having the regression fixed.
> The alternative is to just remove that size check entirely, and depend on
> get_user_pages() doing the stack limit check (among all the *other* checks
> it does when it does the acct_stack_growth() thing).
>
> I'd almost prefer that simpler approach, but I don't have any really
> strong preferences. Anybody?
>
> Linus
>
> ---
> fs/exec.c | 10 +++++++++-
> 1 files changed, 9 insertions(+), 1 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index a44b142..e91f9cb 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -173,8 +173,15 @@ static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
> return NULL;
>
> if (write) {
> - struct rlimit *rlim = current->signal->rlim;
> unsigned long size = bprm->vma->vm_end - bprm->vma->vm_start;
> + struct rlimit *rlim;
> +
> + /*
> + * We've historically supported up to 32 pages of argument
> + * strings even with small stacks
> + */
> + if (size <= 32*PAGE_SIZE)
> + return page;
Could you use ARG_MAX as defined in include/linux/limits.h?
>
> /*
> * Limit to 1/4-th the stack size for the argv+env strings.
> @@ -183,6 +190,7 @@ static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
> * - the program will have a reasonable amount of stack left
> * to work from.
> */
> + rlim = current->signal->rlim;
> if (size > rlim[RLIMIT_STACK].rlim_cur / 4) {
> put_page(page);
> return NULL;
Cheers,
Carlos.
--
Carlos O'Donell
CodeSourcery
[email protected]
(650) 331-3385 x716
On Fri 2008-02-29 09:29:19, Linus Torvalds wrote:
>
>
> On Fri, 29 Feb 2008, Peter Zijlstra wrote:
> >
> > > ... and what's the point? We've never had it before, nobody has ever cared,
> > > and the whole notion is just stupid. Why would we want to limit it? The
> > > only thing that the kernel *cares* about is the stack size - any other
> > > size limits are always going to be arbitrary.
> >
> > Well, don't think of limiting it, but querying the limit.
> >
> > Programs like xargs would need to know how much to stuff into argv
> > before starting a new invocation.
>
> But they already can't really do that. More importantly, isn't it better
> to just use the whole stack size then (or just return "stack size / 4" or
> whatever)?
Using whole stack smells like a security problem to me.
...pass so much parameters that passwd dies on stack shortage. Make
sure passwd grabbed some system-wide lock before dying.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html