2010-11-11 04:36:02

by Mandeep Singh Baines

[permalink] [raw]
Subject: [PATCH] oom: create a resource limit for oom_adj

For ChromiumOS, we'd like to be able to oom_adj a process up/down
as its leaves/enters the foreground. Currently, it is not possible
to oom_adj down without CAP_SYS_RESOURCE. This patch creates a new
resource limit, RLIMIT_OOMADJ, which is works in a similar fashion
to RLIMIT_NICE. This allows a process's oom_adj to be lowered
without CAP_SYS_RESOURCE as long as the new value is greater
than the resource limit.

Alternative considered:

* a setuid binary
* a daemon with CAP_SYS_RESOURCE

Since you don't wan't all processes to be able to reduce their
oom_adj, a setuid or daemon implementation would be complex. The
alternatives also have much higher overhead.

Signed-off-by: Mandeep Singh Baines <[email protected]>
---
fs/proc/base.c | 12 ++++++++++--
include/asm-generic/resource.h | 5 ++++-
2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index f3d02ca..4384013 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -462,6 +462,7 @@ static const struct limit_names lnames[RLIM_NLIMITS] = {
[RLIMIT_NICE] = {"Max nice priority", NULL},
[RLIMIT_RTPRIO] = {"Max realtime priority", NULL},
[RLIMIT_RTTIME] = {"Max realtime timeout", "us"},
+ [RLIMIT_OOMADJ] = {"Max OOM adjust", NULL},
};

/* Display limits for a process */
@@ -1057,8 +1058,15 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
}

if (oom_adjust < task->signal->oom_adj && !capable(CAP_SYS_RESOURCE)) {
- err = -EACCES;
- goto err_sighand;
+ /* convert oom_adj [15,-17] to rlimit style value [1,33] */
+ long oom_rlim = OOM_ADJUST_MAX + 1 - oom_adjust;
+
+ if (oom_rlim > task->signal->rlim[RLIMIT_OOMADJ].rlim_cur) {
+ unlock_task_sighand(task, &flags);
+ put_task_struct(task);
+ err = -EACCES;
+ goto err_sighand;
+ }
}

if (oom_adjust != task->signal->oom_adj) {
diff --git a/include/asm-generic/resource.h b/include/asm-generic/resource.h
index 587566f..a8640a4 100644
--- a/include/asm-generic/resource.h
+++ b/include/asm-generic/resource.h
@@ -45,7 +45,9 @@
0-39 for nice level 19 .. -20 */
#define RLIMIT_RTPRIO 14 /* maximum realtime priority */
#define RLIMIT_RTTIME 15 /* timeout for RT tasks in us */
-#define RLIM_NLIMITS 16
+#define RLIMIT_OOMADJ 16 /* max oom_adj allowed to lower to
+ 0-32 for oom level 15 .. -17 */
+#define RLIM_NLIMITS 17

/*
* SuS says limits have to be unsigned.
@@ -86,6 +88,7 @@
[RLIMIT_MSGQUEUE] = { MQ_BYTES_MAX, MQ_BYTES_MAX }, \
[RLIMIT_NICE] = { 0, 0 }, \
[RLIMIT_RTPRIO] = { 0, 0 }, \
+ [RLIMIT_OOMADJ] = { 0, 0 }, \
[RLIMIT_RTTIME] = { RLIM_INFINITY, RLIM_INFINITY }, \
}

--
1.7.3.1


2010-11-11 05:19:04

by Figo.zhang

[permalink] [raw]
Subject: Re: [PATCH] oom: create a resource limit for oom_adj


> if (oom_adjust< task->signal->oom_adj&& !capable(CAP_SYS_RESOURCE)) {
> - err = -EACCES;
> - goto err_sighand;
> + /* convert oom_adj [15,-17] to rlimit style value [1,33] */
> + long oom_rlim = OOM_ADJUST_MAX + 1 - oom_adjust;
> +
> + if (oom_rlim> task->signal->rlim[RLIMIT_OOMADJ].rlim_cur) {
> + unlock_task_sighand(task,&flags);
> + put_task_struct(task);
> + err = -EACCES;
> + goto err_sighand;
> + }
> }

=> Label "err_sighand" have do that, why are you do that on here?

+ err = -EACCES;
+ goto err_sighand;
+ }
}

2010-11-11 07:36:01

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] oom: create a resource limit for oom_adj

On Wed, 10 Nov 2010, Mandeep Singh Baines wrote:

> For ChromiumOS, we'd like to be able to oom_adj a process up/down
> as its leaves/enters the foreground. Currently, it is not possible
> to oom_adj down without CAP_SYS_RESOURCE. This patch creates a new
> resource limit, RLIMIT_OOMADJ, which is works in a similar fashion
> to RLIMIT_NICE. This allows a process's oom_adj to be lowered
> without CAP_SYS_RESOURCE as long as the new value is greater
> than the resource limit.
>

First of all, oom_adj is deprecated and scheduled for removal in a couple
of years (see Documentation/feature-removal-schedule.txt) so any work in
this area should be targeting oom_score_adj instead.

What is the anticipated use case for this? We know that you want to lower
oom_adj without CAP_SYS_RESOURCE, but what's the expected behavior when an
app moves from foreground to background? I assume it's something like
having an oom_adj of 0 in the background and +15 in the foreground. If
so, does /proc/sys/vm/oom_kill_allocating_task get you most of what you're
looking for?

I'm wondering if we can avoid yet another resource limit for something
like this.

> Alternative considered:
>
> * a setuid binary
> * a daemon with CAP_SYS_RESOURCE
>
> Since you don't wan't all processes to be able to reduce their
> oom_adj, a setuid or daemon implementation would be complex. The
> alternatives also have much higher overhead.
>

What do you anticipate will be writing to oom_score_adj with this patch,
the app itself?

> Signed-off-by: Mandeep Singh Baines <[email protected]>
> ---
> fs/proc/base.c | 12 ++++++++++--
> include/asm-generic/resource.h | 5 ++++-
> 2 files changed, 14 insertions(+), 3 deletions(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index f3d02ca..4384013 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -462,6 +462,7 @@ static const struct limit_names lnames[RLIM_NLIMITS] = {
> [RLIMIT_NICE] = {"Max nice priority", NULL},
> [RLIMIT_RTPRIO] = {"Max realtime priority", NULL},
> [RLIMIT_RTTIME] = {"Max realtime timeout", "us"},
> + [RLIMIT_OOMADJ] = {"Max OOM adjust", NULL},

s/Max/Min, right?

> };
>
> /* Display limits for a process */
> @@ -1057,8 +1058,15 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
> }
>
> if (oom_adjust < task->signal->oom_adj && !capable(CAP_SYS_RESOURCE)) {
> - err = -EACCES;
> - goto err_sighand;
> + /* convert oom_adj [15,-17] to rlimit style value [1,33] */
> + long oom_rlim = OOM_ADJUST_MAX + 1 - oom_adjust;
> +

Ouch, that's a rather unfortunate mapping.

> + if (oom_rlim > task->signal->rlim[RLIMIT_OOMADJ].rlim_cur) {
> + unlock_task_sighand(task, &flags);
> + put_task_struct(task);
> + err = -EACCES;
> + goto err_sighand;

err_sighand has duplicate unlock_task_sighand() and put_task_struct();
since you're missing the task_unlock(task) here, just using goto
err_sighand would suffice.

> + }
> }
>
> if (oom_adjust != task->signal->oom_adj) {

2010-11-11 18:31:12

by Mandeep Singh Baines

[permalink] [raw]
Subject: Re: [PATCH] oom: create a resource limit for oom_adj

David Rientjes ([email protected]) wrote:
> On Wed, 10 Nov 2010, Mandeep Singh Baines wrote:
>
> > For ChromiumOS, we'd like to be able to oom_adj a process up/down
> > as its leaves/enters the foreground. Currently, it is not possible
> > to oom_adj down without CAP_SYS_RESOURCE. This patch creates a new
> > resource limit, RLIMIT_OOMADJ, which is works in a similar fashion
> > to RLIMIT_NICE. This allows a process's oom_adj to be lowered
> > without CAP_SYS_RESOURCE as long as the new value is greater
> > than the resource limit.
> >
>
> First of all, oom_adj is deprecated and scheduled for removal in a couple
> of years (see Documentation/feature-removal-schedule.txt) so any work in
> this area should be targeting oom_score_adj instead.
>

Ah. Thanks for the pointer.

> What is the anticipated use case for this? We know that you want to lower
> oom_adj without CAP_SYS_RESOURCE, but what's the expected behavior when an
> app moves from foreground to background? I assume it's something like

The focus here is the web browser's tabs. In our case, each is a process. If
OOM is going to kill a process, you'd rather it kill the tab you looked at
hours ago instead of the one you're looking at now. So you'd like to have a
policy where the LRU tab gets killed first. We'd like to use oom_score_adj
as the mechanism to implement an LRU policy like this.

> having an oom_adj of 0 in the background and +15 in the foreground. If
> so, does /proc/sys/vm/oom_kill_allocating_task get you most of what you're
> looking for?
>

As explained above, oom_kill_allocating_task won't give us what we want.

> I'm wondering if we can avoid yet another resource limit for something
> like this.
>
> > Alternative considered:
> >
> > * a setuid binary
> > * a daemon with CAP_SYS_RESOURCE
> >
> > Since you don't wan't all processes to be able to reduce their
> > oom_adj, a setuid or daemon implementation would be complex. The
> > alternatives also have much higher overhead.
> >
>
> What do you anticipate will be writing to oom_score_adj with this patch,
> the app itself?
>

A process in the browser session will do the adusting. We'd rather not give
it CAP_SYS_RESOURCE. It should only be allowed to change oom_score_adj up
and down within the bounds set by the administrator. Analagous to renice()
which we also do using a similar policy.

> > Signed-off-by: Mandeep Singh Baines <[email protected]>
> > ---
> > fs/proc/base.c | 12 ++++++++++--
> > include/asm-generic/resource.h | 5 ++++-
> > 2 files changed, 14 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > index f3d02ca..4384013 100644
> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -462,6 +462,7 @@ static const struct limit_names lnames[RLIM_NLIMITS] = {
> > [RLIMIT_NICE] = {"Max nice priority", NULL},
> > [RLIMIT_RTPRIO] = {"Max realtime priority", NULL},
> > [RLIMIT_RTTIME] = {"Max realtime timeout", "us"},
> > + [RLIMIT_OOMADJ] = {"Max OOM adjust", NULL},
>
> s/Max/Min, right?
>

This is a MAX value because of how resource limits work. On the other hand,
it is really controlling the minimum oom_adj. So its a toss up for me.
More than happy to change if you prefer Min.

> > };
> >
> > /* Display limits for a process */
> > @@ -1057,8 +1058,15 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
> > }
> >
> > if (oom_adjust < task->signal->oom_adj && !capable(CAP_SYS_RESOURCE)) {
> > - err = -EACCES;
> > - goto err_sighand;
> > + /* convert oom_adj [15,-17] to rlimit style value [1,33] */
> > + long oom_rlim = OOM_ADJUST_MAX + 1 - oom_adjust;
> > +
>
> Ouch, that's a rather unfortunate mapping.
>

Unfortunate but unavoidable. The resource limit code checks to see if the
new limit is greater than the limit. This code was based on the can_nice()
code in sched.c.

> > + if (oom_rlim > task->signal->rlim[RLIMIT_OOMADJ].rlim_cur) {
> > + unlock_task_sighand(task, &flags);
> > + put_task_struct(task);
> > + err = -EACCES;
> > + goto err_sighand;
>
> err_sighand has duplicate unlock_task_sighand() and put_task_struct();
> since you're missing the task_unlock(task) here, just using goto
> err_sighand would suffice.
>

D'oh. Forward port error. I should be more careful. Thanks for catching:)

> > + }
> > }
> >
> > if (oom_adjust != task->signal->oom_adj) {

Thank you for reviewing this patch.

Should I send an updated oom_score_adj patch?

Regards,
Mandeep

2010-11-11 20:57:43

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] oom: create a resource limit for oom_adj

On Thu, 11 Nov 2010, Mandeep Singh Baines wrote:

> > What is the anticipated use case for this? We know that you want to lower
> > oom_adj without CAP_SYS_RESOURCE, but what's the expected behavior when an
> > app moves from foreground to background? I assume it's something like
>
> The focus here is the web browser's tabs. In our case, each is a process. If
> OOM is going to kill a process, you'd rather it kill the tab you looked at
> hours ago instead of the one you're looking at now. So you'd like to have a
> policy where the LRU tab gets killed first. We'd like to use oom_score_adj
> as the mechanism to implement an LRU policy like this.
>

Hmm, at first glance that seems potentially dangerous if the current tab
generates a burt of memory allocations and it ends up killing all other
tabs before finally targeting the culprit whereas currently the heuristic
should do a good job of finding this problematic tab and killing it
instantly.

Perhaps that can't happen and it probably doesn't even matter:
oom_score_adj allows users to determine which process to kill regardless
of the underlying reason.

> > What do you anticipate will be writing to oom_score_adj with this patch,
> > the app itself?
>
> A process in the browser session will do the adusting. We'd rather not give
> it CAP_SYS_RESOURCE. It should only be allowed to change oom_score_adj up
> and down within the bounds set by the administrator. Analagous to renice()
> which we also do using a similar policy.
>

So as more and more tabs get used, the least recently used tab gets its
oom_score_adj raised higher and higher until it is reused itself and then
it gets reset back to 0 for the current tab?

Is there a reason you don't want to give the underlying browser session
process CAP_SYS_RESOURCE? Will it not be enforcing resource limits to
ensure tabs don't deplete all memory when certain sites are opened? Are
you concerned that it may deplete all memory itself (for which case you
could raise its own oom_score_adj, which is a proportion of available
memory so you can define where that point of depletiong is)?

2010-11-11 22:25:26

by Mandeep Singh Baines

[permalink] [raw]
Subject: Re: [PATCH] oom: create a resource limit for oom_adj

David Rientjes ([email protected]) wrote:
> On Thu, 11 Nov 2010, Mandeep Singh Baines wrote:
>
> > > What is the anticipated use case for this? We know that you want to lower
> > > oom_adj without CAP_SYS_RESOURCE, but what's the expected behavior when an
> > > app moves from foreground to background? I assume it's something like
> >
> > The focus here is the web browser's tabs. In our case, each is a process. If
> > OOM is going to kill a process, you'd rather it kill the tab you looked at
> > hours ago instead of the one you're looking at now. So you'd like to have a
> > policy where the LRU tab gets killed first. We'd like to use oom_score_adj
> > as the mechanism to implement an LRU policy like this.
> >
>
> Hmm, at first glance that seems potentially dangerous if the current tab
> generates a burt of memory allocations and it ends up killing all other
> tabs before finally targeting the culprit whereas currently the heuristic
> should do a good job of finding this problematic tab and killing it
> instantly.
>

If you're watching a movie, video chatting, playing a game, etc. What
would you rather have killed: the current tab you are interacting with or
some tab you opened a while back and are no longer interacting with.

> Perhaps that can't happen and it probably doesn't even matter:
> oom_score_adj allows users to determine which process to kill regardless
> of the underlying reason.
>
> > > What do you anticipate will be writing to oom_score_adj with this patch,
> > > the app itself?
> >
> > A process in the browser session will do the adusting. We'd rather not give
> > it CAP_SYS_RESOURCE. It should only be allowed to change oom_score_adj up
> > and down within the bounds set by the administrator. Analagous to renice()
> > which we also do using a similar policy.
> >
>
> So as more and more tabs get used, the least recently used tab gets its
> oom_score_adj raised higher and higher until it is reused itself and then
> it gets reset back to 0 for the current tab?
>

Exactly.

> Is there a reason you don't want to give the underlying browser session
> process CAP_SYS_RESOURCE? Will it not be enforcing resource limits to

Security. We want to use the least-privilege possible. We really want to
avoid giving special privileges to the browser. You shouldn't need
administrative privileges to run the browser. We'd like for oom_score_adj
to work the same as nice. An unprivileged user can nice up and down as
long as the new setting is within the administratively configured
resource limit: ulimit -e.

> ensure tabs don't deplete all memory when certain sites are opened? Are
> you concerned that it may deplete all memory itself (for which case you
> could raise its own oom_score_adj, which is a proportion of available
> memory so you can define where that point of depletiong is)?

2010-11-11 23:15:42

by Bodo Eggert

[permalink] [raw]
Subject: Re: [PATCH] oom: create a resource limit for oom_adj

David Rientjes <[email protected]> wrote:
> On Thu, 11 Nov 2010, Mandeep Singh Baines wrote:

>> > What is the anticipated use case for this? We know that you want to lower
>> > oom_adj without CAP_SYS_RESOURCE, but what's the expected behavior when an
>> > app moves from foreground to background? I assume it's something like
>>
>> The focus here is the web browser's tabs. In our case, each is a process. If
>> OOM is going to kill a process, you'd rather it kill the tab you looked at
>> hours ago instead of the one you're looking at now. So you'd like to have a
>> policy where the LRU tab gets killed first. We'd like to use oom_score_adj
>> as the mechanism to implement an LRU policy like this.
>>
>
> Hmm, at first glance that seems potentially dangerous if the current tab
> generates a burt of memory allocations and it ends up killing all other
> tabs before finally targeting the culprit whereas currently the heuristic
> should do a good job of finding this problematic tab and killing it
> instantly.

The original oom_adj design would e.g. adjust all background tabs to seem
twice as bad as the current tab, so a ever-growing current tab would
only be able to get one or two tabs killed in the worst case:

System is near OOM while the current tab starts using a normal amount of
mem and grows beyond limits. After it easily killed the first bg tab by
allocating one byte, it needs to grow to twice the normal tab memsize to
start the OOM killer again. By then, it's score will be equal to normal
background tabs (half size, double score), and killing the second tab will
be luck. Killing the third tab should be impossible.

(If you adjust the bg tabs to be four times as killable, you'll get what
you asked for.)


I don't know the current oom_score_adj, it should be able to do something
similar? Or should there be a oom_score_mul?

>> > What do you anticipate will be writing to oom_score_adj with this patch,
>> > the app itself?
>>
>> A process in the browser session will do the adusting. We'd rather not give
>> it CAP_SYS_RESOURCE. It should only be allowed to change oom_score_adj up
>> and down within the bounds set by the administrator. Analagous to renice()
>> which we also do using a similar policy.
>>
>
> So as more and more tabs get used, the least recently used tab gets its
> oom_score_adj raised higher and higher until it is reused itself and then
> it gets reset back to 0 for the current tab?

As far as I understand, a background tab should get a higher score if it
rests untouched for some time, and if it gets used again, it will jump to
neutral oom_adj.

> Is there a reason you don't want to give the underlying browser session
> process CAP_SYS_RESOURCE?

I should not have control over a CAP_SYS_RESOURCE process while being a user,
but I should be able to tell the system which process to kill. When I designed
the oom_adj, I did not think about one-process-per-tab, so I did not suggest
a soft/hard limit.

I think the concept of using an rlimit is the right thing to do, I did not yet
look/think about the mapping. Glancing at oom_score_adj does suggest a different
range ...

2010-11-11 23:19:16

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] oom: create a resource limit for oom_adj

On Thu, 11 Nov 2010, Mandeep Singh Baines wrote:

> > Hmm, at first glance that seems potentially dangerous if the current tab
> > generates a burt of memory allocations and it ends up killing all other
> > tabs before finally targeting the culprit whereas currently the heuristic
> > should do a good job of finding this problematic tab and killing it
> > instantly.
> >
>
> If you're watching a movie, video chatting, playing a game, etc. What
> would you rather have killed: the current tab you are interacting with or
> some tab you opened a while back and are no longer interacting with.
>

Well, it's a tangential point, but I'd personally prefer that my existing
tabs that I've decided to leave open are guaranteed to remain open
regardless of where I'm browsing next (they could hold valuable data that
I can't easily get back) and avoid having all of them sacrificed out from
under me for the newly opened tab. I can always go back and close those
tabs for more memory if I know I don't need them anymore and then retry
the failed allocation.

> > So as more and more tabs get used, the least recently used tab gets its
> > oom_score_adj raised higher and higher until it is reused itself and then
> > it gets reset back to 0 for the current tab?
> >
>
> Exactly.
>

We don't necessarily want arbitrary tasks to be able to decrease their
oom_score_adj back to 0 if a CAP_SYS_RESOURCE thread has elevated it,
that's part of the reason for the restriction (in addition to decreasing
your own oom_score_adj all the way to OOM_SCORE_ADJ_MIN).

Would it suffice to allow a task to decrease its oom_score_adj back to the
highest value that a CAP_SYS_RESOURCE thread set it or its inherited value
at fork? Assuming the thread that has forked it has oom_score_adj of 0,
each tab could decrease it back from 0 upon activation unless a
CAP_SYS_RESOURCE thread elevated it to something higher.

To do this, we'd need to save the highest oom_score_adj set by a
CAP_SYS_RESOURCE in struct signal_struct.

2010-11-11 23:21:57

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] oom: create a resource limit for oom_adj

On Fri, 12 Nov 2010, Bodo Eggert wrote:

> The original oom_adj design would e.g. adjust all background tabs to seem
> twice as bad as the current tab, so a ever-growing current tab would
> only be able to get one or two tabs killed in the worst case:
>
> System is near OOM while the current tab starts using a normal amount of
> mem and grows beyond limits. After it easily killed the first bg tab by
> allocating one byte, it needs to grow to twice the normal tab memsize to
> start the OOM killer again. By then, it's score will be equal to normal
> background tabs (half size, double score), and killing the second tab will
> be luck. Killing the third tab should be impossible.
>
> (If you adjust the bg tabs to be four times as killable, you'll get what
> you asked for.)
>
>
> I don't know the current oom_score_adj, it should be able to do something
> similar? Or should there be a oom_score_mul?
>

oom_score_adj is done on a linear scale instead of exponential so instead
of increasing oom_adj by 1 for each hop to the current tab, you would need
to double it.

2010-11-11 23:56:36

by Mandeep Singh Baines

[permalink] [raw]
Subject: Re: [PATCH] oom: create a resource limit for oom_adj

David Rientjes ([email protected]) wrote:
> On Thu, 11 Nov 2010, Mandeep Singh Baines wrote:
>
> > > Hmm, at first glance that seems potentially dangerous if the current tab
> > > generates a burt of memory allocations and it ends up killing all other
> > > tabs before finally targeting the culprit whereas currently the heuristic
> > > should do a good job of finding this problematic tab and killing it
> > > instantly.
> > >
> >
> > If you're watching a movie, video chatting, playing a game, etc. What
> > would you rather have killed: the current tab you are interacting with or
> > some tab you opened a while back and are no longer interacting with.
> >
>
> Well, it's a tangential point, but I'd personally prefer that my existing
> tabs that I've decided to leave open are guaranteed to remain open
> regardless of where I'm browsing next (they could hold valuable data that
> I can't easily get back) and avoid having all of them sacrificed out from
> under me for the newly opened tab. I can always go back and close those
> tabs for more memory if I know I don't need them anymore and then retry
> the failed allocation.
>
> > > So as more and more tabs get used, the least recently used tab gets its
> > > oom_score_adj raised higher and higher until it is reused itself and then
> > > it gets reset back to 0 for the current tab?
> > >
> >
> > Exactly.
> >
>
> We don't necessarily want arbitrary tasks to be able to decrease their
> oom_score_adj back to 0 if a CAP_SYS_RESOURCE thread has elevated it,
> that's part of the reason for the restriction (in addition to decreasing
> your own oom_score_adj all the way to OOM_SCORE_ADJ_MIN).
>
> Would it suffice to allow a task to decrease its oom_score_adj back to the
> highest value that a CAP_SYS_RESOURCE thread set it or its inherited value
> at fork? Assuming the thread that has forked it has oom_score_adj of 0,
> each tab could decrease it back from 0 upon activation unless a
> CAP_SYS_RESOURCE thread elevated it to something higher.
>
> To do this, we'd need to save the highest oom_score_adj set by a
> CAP_SYS_RESOURCE in struct signal_struct.

Sounds good to me. I'll start working on this patch.

Thanks,
Mandeep

2010-11-13 00:47:13

by Mandeep Baines

[permalink] [raw]
Subject: [PATCH] oom: allow a non-CAP_SYS_RESOURCE proces to oom_score_adj down

We'd like to be able to oom_score_adj a process up/down as its
enters/leaves the foreground. Currently, it is not possible to oom_adj
down without CAP_SYS_RESOURCE. This patch allows a task to decrease
its oom_score_adj back to the value that a CAP_SYS_RESOURCE thread set
it or its inherited value at fork. Assuming the thread that has forked
it has oom_score_adj of 0, each tab could decrease it back from 0 upon
activation unless a CAP_SYS_RESOURCE thread elevated it to something
higher.

Alternative considered:

* a setuid binary
* a daemon with CAP_SYS_RESOURCE

Since you don't wan't all processes to be able to reduce their
oom_adj, a setuid or daemon implementation would be complex. The
alternatives also have much higher overhead.

This patch updated based on feedback from
David Rientjes <[email protected]>.

Change-Id: If8f52363fd6c156e1730f43148aee987260e9c72
Signed-off-by: Mandeep Singh Baines <[email protected]>
---
fs/proc/base.c | 4 +++-
include/linux/sched.h | 2 ++
2 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index f3d02ca..e617413 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1164,7 +1164,7 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
goto err_task_lock;
}

- if (oom_score_adj < task->signal->oom_score_adj &&
+ if (oom_score_adj < task->signal->oom_score_adj_min &&
!capable(CAP_SYS_RESOURCE)) {
err = -EACCES;
goto err_sighand;
@@ -1177,6 +1177,8 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
atomic_dec(&task->mm->oom_disable_count);
}
task->signal->oom_score_adj = oom_score_adj;
+ if (capable(CAP_SYS_RESOURCE))
+ task->signal->oom_score_adj_min = oom_score_adj;
/*
* Scale /proc/pid/oom_adj appropriately ensuring that OOM_DISABLE is
* always attainable.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f53cdf2..2a71ee0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -626,6 +626,8 @@ struct signal_struct {

int oom_adj; /* OOM kill score adjustment (bit shift) */
int oom_score_adj; /* OOM kill score adjustment */
+ int oom_score_adj_min; /* OOM kill score adjustment minimum value.
+ * Only settable by CAP_SYS_RESOURCE. */

struct mutex cred_guard_mutex; /* guard against foreign influences on
* credential calculations
--
1.7.3.1

2010-11-14 01:37:45

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] oom: allow a non-CAP_SYS_RESOURCE proces to oom_score_adj down

On Fri, 12 Nov 2010, Mandeep Singh Baines wrote:

> We'd like to be able to oom_score_adj a process up/down as its
> enters/leaves the foreground. Currently, it is not possible to oom_adj
> down without CAP_SYS_RESOURCE. This patch allows a task to decrease
> its oom_score_adj back to the value that a CAP_SYS_RESOURCE thread set
> it or its inherited value at fork. Assuming the thread that has forked
> it has oom_score_adj of 0, each tab could decrease it back from 0 upon
> activation unless a CAP_SYS_RESOURCE thread elevated it to something
> higher.
>

oom_score_adj_min doesn't appear to be inherited at fork in your patch.

> Alternative considered:
>
> * a setuid binary
> * a daemon with CAP_SYS_RESOURCE
>
> Since you don't wan't all processes to be able to reduce their
> oom_adj, a setuid or daemon implementation would be complex. The
> alternatives also have much higher overhead.
>

This behavior should be documented in Documentation/filesystems/proc.txt.

> This patch updated based on feedback from
> David Rientjes <[email protected]>.
>
> Change-Id: If8f52363fd6c156e1730f43148aee987260e9c72

I know what a Change-Id is , but nobody else here does :)

> Signed-off-by: Mandeep Singh Baines <[email protected]>
> ---
> fs/proc/base.c | 4 +++-
> include/linux/sched.h | 2 ++
> 2 files changed, 5 insertions(+), 1 deletions(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index f3d02ca..e617413 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -1164,7 +1164,7 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
> goto err_task_lock;
> }
>
> - if (oom_score_adj < task->signal->oom_score_adj &&
> + if (oom_score_adj < task->signal->oom_score_adj_min &&
> !capable(CAP_SYS_RESOURCE)) {
> err = -EACCES;
> goto err_sighand;
> @@ -1177,6 +1177,8 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
> atomic_dec(&task->mm->oom_disable_count);
> }
> task->signal->oom_score_adj = oom_score_adj;
> + if (capable(CAP_SYS_RESOURCE))
> + task->signal->oom_score_adj_min = oom_score_adj;
> /*
> * Scale /proc/pid/oom_adj appropriately ensuring that OOM_DISABLE is
> * always attainable.
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index f53cdf2..2a71ee0 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -626,6 +626,8 @@ struct signal_struct {
>
> int oom_adj; /* OOM kill score adjustment (bit shift) */
> int oom_score_adj; /* OOM kill score adjustment */
> + int oom_score_adj_min; /* OOM kill score adjustment minimum value.
> + * Only settable by CAP_SYS_RESOURCE. */
>
> struct mutex cred_guard_mutex; /* guard against foreign influences on
> * credential calculations

2010-11-14 05:07:29

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] oom: create a resource limit for oom_adj

Hi Mandeep,

> For ChromiumOS, we'd like to be able to oom_adj a process up/down
> as its leaves/enters the foreground. Currently, it is not possible
> to oom_adj down without CAP_SYS_RESOURCE. This patch creates a new
> resource limit, RLIMIT_OOMADJ, which is works in a similar fashion
> to RLIMIT_NICE. This allows a process's oom_adj to be lowered
> without CAP_SYS_RESOURCE as long as the new value is greater
> than the resource limit.
>
> Alternative considered:
>
> * a setuid binary
> * a daemon with CAP_SYS_RESOURCE
>
> Since you don't wan't all processes to be able to reduce their
> oom_adj, a setuid or daemon implementation would be complex. The
> alternatives also have much higher overhead.
>
> Signed-off-by: Mandeep Singh Baines <[email protected]>
> ---
> fs/proc/base.c | 12 ++++++++++--
> include/asm-generic/resource.h | 5 ++++-
> 2 files changed, 14 insertions(+), 3 deletions(-)

This concept sound useful for embedeed. but I dislike this interface
a bit. Why don't you create /proc/{pid}/oom_adj_lower_bound or similar?
It is more straight forward because oom_adj are already using /proc.

I also think 15..-17 to 0-32 convertion is a bit user unfriendly.


>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index f3d02ca..4384013 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -462,6 +462,7 @@ static const struct limit_names lnames[RLIM_NLIMITS] = {
> [RLIMIT_NICE] = {"Max nice priority", NULL},
> [RLIMIT_RTPRIO] = {"Max realtime priority", NULL},
> [RLIMIT_RTTIME] = {"Max realtime timeout", "us"},
> + [RLIMIT_OOMADJ] = {"Max OOM adjust", NULL},
> };
>
> /* Display limits for a process */
> @@ -1057,8 +1058,15 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
> }
>
> if (oom_adjust < task->signal->oom_adj && !capable(CAP_SYS_RESOURCE)) {
> - err = -EACCES;
> - goto err_sighand;
> + /* convert oom_adj [15,-17] to rlimit style value [1,33] */
> + long oom_rlim = OOM_ADJUST_MAX + 1 - oom_adjust;
> +
> + if (oom_rlim > task->signal->rlim[RLIMIT_OOMADJ].rlim_cur) {

two points.

1) task->signal->rlim[RLIMIT_OOMADJ].rlim_cur is incorrect.
please use task_rlimit().

2) If process has CAP_SYS_RESOURCE, we should ignore RLIMIT_OOMADJ for
backword compatibility. CAP_NICE do so. (see below)

------------------------------------------------------------------
int can_nice(const struct task_struct *p, const int nice)
{
/* convert nice value [19,-20] to rlimit style value [1,40] */
int nice_rlim = 20 - nice;

return (nice_rlim <= task_rlimit(p, RLIMIT_NICE) ||
capable(CAP_SYS_NICE));
}
------------------------------------------------------------------



> + unlock_task_sighand(task, &flags);
> + put_task_struct(task);
> + err = -EACCES;
> + goto err_sighand;
> + }
> }
>
> if (oom_adjust != task->signal->oom_adj) {
> diff --git a/include/asm-generic/resource.h b/include/asm-generic/resource.h
> index 587566f..a8640a4 100644
> --- a/include/asm-generic/resource.h
> +++ b/include/asm-generic/resource.h
> @@ -45,7 +45,9 @@
> 0-39 for nice level 19 .. -20 */
> #define RLIMIT_RTPRIO 14 /* maximum realtime priority */
> #define RLIMIT_RTTIME 15 /* timeout for RT tasks in us */
> -#define RLIM_NLIMITS 16
> +#define RLIMIT_OOMADJ 16 /* max oom_adj allowed to lower to
> + 0-32 for oom level 15 .. -17 */
> +#define RLIM_NLIMITS 17
>
> /*
> * SuS says limits have to be unsigned.
> @@ -86,6 +88,7 @@
> [RLIMIT_MSGQUEUE] = { MQ_BYTES_MAX, MQ_BYTES_MAX }, \
> [RLIMIT_NICE] = { 0, 0 }, \
> [RLIMIT_RTPRIO] = { 0, 0 }, \
> + [RLIMIT_OOMADJ] = { 0, 0 }, \

I don't think 0 is good initial value because 0 mean oom_adj==15.



> [RLIMIT_RTTIME] = { RLIM_INFINITY, RLIM_INFINITY }, \
> }
>
> --
> 1.7.3.1
>


2010-11-14 05:07:34

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] oom: create a resource limit for oom_adj

> > Hmm, at first glance that seems potentially dangerous if the current tab
> > generates a burt of memory allocations and it ends up killing all other
> > tabs before finally targeting the culprit whereas currently the heuristic
> > should do a good job of finding this problematic tab and killing it
> > instantly.
>
> The original oom_adj design would e.g. adjust all background tabs to seem
> twice as bad as the current tab, so a ever-growing current tab would
> only be able to get one or two tabs killed in the worst case:
>
> System is near OOM while the current tab starts using a normal amount of
> mem and grows beyond limits. After it easily killed the first bg tab by
> allocating one byte, it needs to grow to twice the normal tab memsize to
> start the OOM killer again. By then, it's score will be equal to normal
> background tabs (half size, double score), and killing the second tab will
> be luck. Killing the third tab should be impossible.
>
> (If you adjust the bg tabs to be four times as killable, you'll get what
> you asked for.)

Sorry for the delay.

I've sent revert patch to linus. We DON'T want to break any userland seriously.
So, we reconsider about this area. personally I hope to reintroduce oom_score_adj
later, but It must don't break current userland.


2010-11-14 21:42:41

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] oom: create a resource limit for oom_adj

On Sun, 14 Nov 2010, KOSAKI Motohiro wrote:

> I've sent revert patch to linus. We DON'T want to break any userland seriously.
> So, we reconsider about this area. personally I hope to reintroduce oom_score_adj
> later, but It must don't break current userland.
>

Nobody is going to stall their development waiting for your patches to be
merged (especially when you've proposed them four times before and they
haven't been), so patches in this area should be against Linus' latest
tree.

2010-11-15 22:02:11

by Mandeep Singh Baines

[permalink] [raw]
Subject: [PATCH v2] oom: allow a non-CAP_SYS_RESOURCE proces to oom_score_adj down

Hi,

Attached is V2 of this patch. Since V1:

* Added documentation in Documentation/filesystems/proc.txt
* Copy oom_score_adj_min value across a fork

An alternative to this patch would be to expose oom_score_adj_min via
proc (suggested by Kosaki Motohiro). You could allow non-CAP_SYS_RESOURCE
processes to irreversibly increase the minimum. That would give you
similar semantics to a resource limit and would give you a bit more
control but it means adding another proc variable.

Regards,
Mandeep

---

We'd like to be able to oom_score_adj a process up/down as its
enters/leaves the foreground. Currently, it is not possible to oom_adj
down without CAP_SYS_RESOURCE. This patch allows a task to decrease
its oom_score_adj back to the value that a CAP_SYS_RESOURCE thread set
it or its inherited value at fork. Assuming the thread that has forked
it has oom_score_adj of 0, each tab process could decrease it back from
0 upon activation unless a CAP_SYS_RESOURCE thread elevated it to
something higher.

Alternative considered:

* a setuid binary
* a daemon with CAP_SYS_RESOURCE

Since you don't wan't all processes to be able to reduce their
oom_adj, a setuid or daemon implementation would be complex. The
alternatives also have much higher overhead.

This patch updated from original patch based on feedback from
David Rientjes <[email protected]>.

Signed-off-by: Mandeep Singh Baines <[email protected]>
---
Documentation/filesystems/proc.txt | 4 ++++
fs/proc/base.c | 4 +++-
include/linux/sched.h | 2 ++
kernel/fork.c | 1 +
4 files changed, 10 insertions(+), 1 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index e73df27..7139c50 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -1296,6 +1296,10 @@ scaled linearly with /proc/<pid>/oom_score_adj.
Writing to /proc/<pid>/oom_score_adj or /proc/<pid>/oom_adj will change the
other with its scaled value.

+The value of /proc/<pid>/oom_score_adj may be reduced no lower than the last
+value set by a CAP_SYS_RESOURCE process. To reduce the value any lower
+requires CAP_SYS_RESOURCE.
+
NOTICE: /proc/<pid>/oom_adj is deprecated and will be removed, please see
Documentation/feature-removal-schedule.txt.

diff --git a/fs/proc/base.c b/fs/proc/base.c
index f3d02ca..e617413 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1164,7 +1164,7 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
goto err_task_lock;
}

- if (oom_score_adj < task->signal->oom_score_adj &&
+ if (oom_score_adj < task->signal->oom_score_adj_min &&
!capable(CAP_SYS_RESOURCE)) {
err = -EACCES;
goto err_sighand;
@@ -1177,6 +1177,8 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
atomic_dec(&task->mm->oom_disable_count);
}
task->signal->oom_score_adj = oom_score_adj;
+ if (capable(CAP_SYS_RESOURCE))
+ task->signal->oom_score_adj_min = oom_score_adj;
/*
* Scale /proc/pid/oom_adj appropriately ensuring that OOM_DISABLE is
* always attainable.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f53cdf2..2a71ee0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -626,6 +626,8 @@ struct signal_struct {

int oom_adj; /* OOM kill score adjustment (bit shift) */
int oom_score_adj; /* OOM kill score adjustment */
+ int oom_score_adj_min; /* OOM kill score adjustment minimum value.
+ * Only settable by CAP_SYS_RESOURCE. */

struct mutex cred_guard_mutex; /* guard against foreign influences on
* credential calculations
diff --git a/kernel/fork.c b/kernel/fork.c
index 3b159c5..0979527 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -907,6 +907,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)

sig->oom_adj = current->signal->oom_adj;
sig->oom_score_adj = current->signal->oom_score_adj;
+ sig->oom_score_adj_min = current->signal->oom_score_adj_min;

mutex_init(&sig->cred_guard_mutex);

--
1.7.3.1

2010-11-15 22:06:59

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v2] oom: allow a non-CAP_SYS_RESOURCE proces to oom_score_adj down

On Mon, 15 Nov 2010, Mandeep Singh Baines wrote:

> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index f3d02ca..e617413 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -1164,7 +1164,7 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
> goto err_task_lock;
> }
>
> - if (oom_score_adj < task->signal->oom_score_adj &&
> + if (oom_score_adj < task->signal->oom_score_adj_min &&
> !capable(CAP_SYS_RESOURCE)) {
> err = -EACCES;
> goto err_sighand;
> @@ -1177,6 +1177,8 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
> atomic_dec(&task->mm->oom_disable_count);
> }
> task->signal->oom_score_adj = oom_score_adj;
> + if (capable(CAP_SYS_RESOURCE))
> + task->signal->oom_score_adj_min = oom_score_adj;
> /*
> * Scale /proc/pid/oom_adj appropriately ensuring that OOM_DISABLE is
> * always attainable.

This should be has_capability_noaudit(current, CAP_SYS_RESOURCE) instead,
we don't want an audit message to be emitted when checking if
oom_score_adj_min should be set.

Other than that:

Acked-by: David Rientjes <[email protected]>

2010-11-16 01:21:06

by Mandeep Singh Baines

[permalink] [raw]
Subject: [PATCH v3] oom: allow a non-CAP_SYS_RESOURCE proces to oom_score_adj down

We'd like to be able to oom_score_adj a process up/down as its
enters/leaves the foreground. Currently, it is not possible to oom_adj
down without CAP_SYS_RESOURCE. This patch allows a task to decrease
its oom_score_adj back to the value that a CAP_SYS_RESOURCE thread set
it or its inherited value at fork. Assuming the thread that has forked
it has oom_score_adj of 0, each tab process could decrease it back from
0 upon activation unless a CAP_SYS_RESOURCE thread elevated it to
something higher.

Alternative considered:

* a setuid binary
* a daemon with CAP_SYS_RESOURCE

Since you don't wan't all processes to be able to reduce their
oom_adj, a setuid or daemon implementation would be complex. The
alternatives also have much higher overhead.

This patch updated from original patch based on feedback from
David Rientjes <[email protected]>.

Signed-off-by: Mandeep Singh Baines <[email protected]>
Acked-by: David Rientjes <[email protected]>
---
Documentation/filesystems/proc.txt | 4 ++++
fs/proc/base.c | 4 +++-
include/linux/sched.h | 2 ++
kernel/fork.c | 1 +
4 files changed, 10 insertions(+), 1 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index e73df27..7139c50 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -1296,6 +1296,10 @@ scaled linearly with /proc/<pid>/oom_score_adj.
Writing to /proc/<pid>/oom_score_adj or /proc/<pid>/oom_adj will change the
other with its scaled value.

+The value of /proc/<pid>/oom_score_adj may be reduced no lower than the last
+value set by a CAP_SYS_RESOURCE process. To reduce the value any lower
+requires CAP_SYS_RESOURCE.
+
NOTICE: /proc/<pid>/oom_adj is deprecated and will be removed, please see
Documentation/feature-removal-schedule.txt.

diff --git a/fs/proc/base.c b/fs/proc/base.c
index f3d02ca..7b1a9df 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1164,7 +1164,7 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
goto err_task_lock;
}

- if (oom_score_adj < task->signal->oom_score_adj &&
+ if (oom_score_adj < task->signal->oom_score_adj_min &&
!capable(CAP_SYS_RESOURCE)) {
err = -EACCES;
goto err_sighand;
@@ -1177,6 +1177,8 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
atomic_dec(&task->mm->oom_disable_count);
}
task->signal->oom_score_adj = oom_score_adj;
+ if (has_capability_noaudit(current, CAP_SYS_RESOURCE))
+ task->signal->oom_score_adj_min = oom_score_adj;
/*
* Scale /proc/pid/oom_adj appropriately ensuring that OOM_DISABLE is
* always attainable.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f53cdf2..2a71ee0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -626,6 +626,8 @@ struct signal_struct {

int oom_adj; /* OOM kill score adjustment (bit shift) */
int oom_score_adj; /* OOM kill score adjustment */
+ int oom_score_adj_min; /* OOM kill score adjustment minimum value.
+ * Only settable by CAP_SYS_RESOURCE. */

struct mutex cred_guard_mutex; /* guard against foreign influences on
* credential calculations
diff --git a/kernel/fork.c b/kernel/fork.c
index 3b159c5..0979527 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -907,6 +907,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)

sig->oom_adj = current->signal->oom_adj;
sig->oom_score_adj = current->signal->oom_score_adj;
+ sig->oom_score_adj_min = current->signal->oom_score_adj_min;

mutex_init(&sig->cred_guard_mutex);

--
1.7.3.1

2010-11-23 07:17:15

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] oom: create a resource limit for oom_adj

> On Sun, 14 Nov 2010, KOSAKI Motohiro wrote:
>
> > I've sent revert patch to linus. We DON'T want to break any userland seriously.
> > So, we reconsider about this area. personally I hope to reintroduce oom_score_adj
> > later, but It must don't break current userland.
> >
>
> Nobody is going to stall their development waiting for your patches to be
> merged (especially when you've proposed them four times before and they
> haven't been), so patches in this area should be against Linus' latest
> tree.

Nobody want to accept you break userland.