From: Paul Menage <[email protected]>
make get_user_pages interruptible
The initial implementation of checking TIF_MEMDIE covers the cases of OOM
killing. If the process has been OOM killed, the TIF_MEMDIE is set and it
return immediately. This patch includes:
1. add the case that the SIGKILL is sent by user processes. The process can
try to get_user_pages() unlimited memory even if a user process has sent a
SIGKILL to it(maybe a monitor find the process exceed its memory limit and
try to kill it). In the old implementation, the SIGKILL won't be handled
until the get_user_pages() returns.
2. change the return value to be ERESTARTSYS. It makes no sense to return
ENOMEM if the get_user_pages returned by getting a SIGKILL signal.
Considering the general convention for a system call interrupted by a
signal is ERESTARTNOSYS, so the current return value is consistant to that.
Signed-off-by: Paul Menage <[email protected]>
Singed-off-by: Ying Han <[email protected]>
include/linux/sched.h | 1 +
kernel/signal.c | 2 +-
mm/memory.c | 9 +-
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b483f39..f9c6a8a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1790,6 +1790,7 @@ extern void sched_dead(struct task_struct *p);
extern int in_group_p(gid_t);
extern int in_egroup_p(gid_t);
+extern int sigkill_pending(struct task_struct *tsk);
extern void proc_caches_init(void);
extern void flush_signals(struct task_struct *);
extern void ignore_signals(struct task_struct *);
diff --git a/kernel/signal.c b/kernel/signal.c
index 105217d..f3f154e 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1497,7 +1497,7 @@ static inline int may_ptrace_stop(void)
* Return nonzero if there is a SIGKILL that should be waking us up.
* Called with the siglock held.
*/
-static int sigkill_pending(struct task_struct *tsk)
+int sigkill_pending(struct task_struct *tsk)
{
return sigismember(&tsk->pending.signal, SIGKILL) ||
sigismember(&tsk->signal->shared_pending.signal, SIGKILL);
diff --git a/mm/memory.c b/mm/memory.c
index 164951c..ae24300 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1218,12 +1218,11 @@ int __get_user_pages(struct task_struct *tsk, struct m
struct page *page;
/*
- * If tsk is ooming, cut off its access to large memory
- * allocations. It has a pending SIGKILL, but it can't
- * be processed until returning to user space.
+ * If we have a pending SIGKILL, don't keep
+ * allocating memory.
*/
- if (unlikely(test_tsk_thread_flag(tsk, TIF_MEMDIE)))
- return i ? i : -ENOMEM;
+ if (unlikely(sigkill_pending(tsk)))
+ return i ? i : -ERESTARTSYS;
if (write)
foll_flags |= FOLL_WRITE;
On Fri, Nov 21, 2008 at 5:31 PM, Ying Han <[email protected]> wrote:
> From: Paul Menage <[email protected]>
This patch is getting further and further from my original internal
changes, so I'm not sure that a From: line from me is appropriate.
> */
> - if (unlikely(test_tsk_thread_flag(tsk, TIF_MEMDIE)))
> - return i ? i : -ENOMEM;
> + if (unlikely(sigkill_pending(tsk)))
> + return i ? i : -ERESTARTSYS;
You've changed the check from sigkill_pending(current) to sigkill_pending(tsk).
I originally made that sigkill_pending(current) since we want to avoid
tasks entering an unkillable state just because they're doing
get_user_pages() on a system that's short of memory. Admittedly for
the main case that we care about, mlock() (or an mmap() with
MCL_FUTURE set) then tsk==current, but philosophically it seems to me
to be more correct to do the check against current than tsk, since
current is the thing that's actually allocating the memory. But maybe
it would be better to check both?
Paul
Hi Paul,
On Fri, Nov 21, 2008 at 5:31 PM, Ying Han <[email protected]> wrote:
>> */
>> - if (unlikely(test_tsk_thread_flag(tsk, TIF_MEMDIE)))
>> - return i ? i : -ENOMEM;
>> + if (unlikely(sigkill_pending(tsk)))
>> + return i ? i : -ERESTARTSYS;
On Mon, Nov 24, 2008 at 10:02 PM, Paul Menage <[email protected]> wrote:
> You've changed the check from sigkill_pending(current) to sigkill_pending(tsk).
>
> I originally made that sigkill_pending(current) since we want to avoid
> tasks entering an unkillable state just because they're doing
> get_user_pages() on a system that's short of memory. Admittedly for
> the main case that we care about, mlock() (or an mmap() with
> MCL_FUTURE set) then tsk==current, but philosophically it seems to me
> to be more correct to do the check against current than tsk, since
> current is the thing that's actually allocating the memory. But maybe
> it would be better to check both?
Well, most callers seem to pass 'current' to get_user_pages() but for
the out-of-tree revoke patches, for example, you certainly want to
check sigkill_pending(current) as well; otherwise the revoke operation
is unkillable while in get_user_pages().
Not that revoke() is going to hit mainline any time soon but it does
serve as an argument for checking both.
Pekka
On Mon, Nov 24, 2008 at 12:02 PM, Paul Menage <[email protected]> wrote:
> On Fri, Nov 21, 2008 at 5:31 PM, Ying Han <[email protected]> wrote:
>> From: Paul Menage <[email protected]>
>
> This patch is getting further and further from my original internal
> changes, so I'm not sure that a From: line from me is appropriate.
>
>> */
>> - if (unlikely(test_tsk_thread_flag(tsk, TIF_MEMDIE)))
>> - return i ? i : -ENOMEM;
>> + if (unlikely(sigkill_pending(tsk)))
>> + return i ? i : -ERESTARTSYS;
>
> You've changed the check from sigkill_pending(current) to sigkill_pending(tsk).
>
> I originally made that sigkill_pending(current) since we want to avoid
> tasks entering an unkillable state just because they're doing
> get_user_pages() on a system that's short of memory. Admittedly for
> the main case that we care about, mlock() (or an mmap() with
> MCL_FUTURE set) then tsk==current, but philosophically it seems to me
> to be more correct to do the check against current than tsk, since
> current is the thing that's actually allocating the memory. But maybe
> it would be better to check both?
In most of cases, tsk==current in get_user_pages(), that is why i
change current to tsk since
tsk is a superset of current, no? If that is right, why we need to check both?
>
> Paul
>
On Fri, Nov 21, 2008 at 5:31 PM, Ying Han <[email protected]> wrote:
>>> */
>>> - if (unlikely(test_tsk_thread_flag(tsk, TIF_MEMDIE)))
>>> - return i ? i : -ENOMEM;
>>> + if (unlikely(sigkill_pending(tsk)))
>>> + return i ? i : -ERESTARTSYS;
On Mon, Nov 24, 2008 at 12:02 PM, Paul Menage <[email protected]> wrote:
>> You've changed the check from sigkill_pending(current) to sigkill_pending(tsk).
>>
>> I originally made that sigkill_pending(current) since we want to avoid
>> tasks entering an unkillable state just because they're doing
>> get_user_pages() on a system that's short of memory. Admittedly for
>> the main case that we care about, mlock() (or an mmap() with
>> MCL_FUTURE set) then tsk==current, but philosophically it seems to me
>> to be more correct to do the check against current than tsk, since
>> current is the thing that's actually allocating the memory. But maybe
>> it would be better to check both?
On Mon, Nov 24, 2008 at 11:02 PM, Ying Han <[email protected]> wrote:
> In most of cases, tsk==current in get_user_pages(), that is why i
> change current to tsk since
> tsk is a superset of current, no? If that is right, why we need to check both?
I'm not sure if it's strictly necessary but as I pointed out in the
other mail, there can be callers that are doing get_user_pages() on
behalf of other tasks and you probably want to be able to kill the
task that's actually _calling_ get_user_pages() as well.
Pekka
thanks Pekka and i think one example of the case you mentioned is in
access_process_vm() which is calling
get_user_pages(tsk, mm, addr, 1, write, 1, &pages, &vma). However, it
is allocating only one page here which
much less likely to be stuck under memory pressure. Like you said, in
order to make it more flexible for future
changes, i might make the change like:
>>>> */
>>>> - if (unlikely(test_tsk_thread_flag(tsk, TIF_MEMDIE)))
>>>> - return i ? i : -ENOMEM;
>>>> + if (unlikely(sigkill_pending(current) | | sigkill_pending(tsk)))
>>>> + return i ? i : -ERESTARTSYS;
is this something acceptable?
On Mon, Nov 24, 2008 at 1:13 PM, Pekka Enberg <[email protected]> wrote:
> On Fri, Nov 21, 2008 at 5:31 PM, Ying Han <[email protected]> wrote:
>>>> */
>>>> - if (unlikely(test_tsk_thread_flag(tsk, TIF_MEMDIE)))
>>>> - return i ? i : -ENOMEM;
>>>> + if (unlikely(sigkill_pending(tsk)))
>>>> + return i ? i : -ERESTARTSYS;
>
> On Mon, Nov 24, 2008 at 12:02 PM, Paul Menage <[email protected]> wrote:
>>> You've changed the check from sigkill_pending(current) to sigkill_pending(tsk).
>>>
>>> I originally made that sigkill_pending(current) since we want to avoid
>>> tasks entering an unkillable state just because they're doing
>>> get_user_pages() on a system that's short of memory. Admittedly for
>>> the main case that we care about, mlock() (or an mmap() with
>>> MCL_FUTURE set) then tsk==current, but philosophically it seems to me
>>> to be more correct to do the check against current than tsk, since
>>> current is the thing that's actually allocating the memory. But maybe
>>> it would be better to check both?
>
> On Mon, Nov 24, 2008 at 11:02 PM, Ying Han <[email protected]> wrote:
>> In most of cases, tsk==current in get_user_pages(), that is why i
>> change current to tsk since
>> tsk is a superset of current, no? If that is right, why we need to check both?
>
> I'm not sure if it's strictly necessary but as I pointed out in the
> other mail, there can be callers that are doing get_user_pages() on
> behalf of other tasks and you probably want to be able to kill the
> task that's actually _calling_ get_user_pages() as well.
>
> Pekka
>
Ying Han wrote:
> thanks Pekka and i think one example of the case you mentioned is in
> access_process_vm() which is calling
> get_user_pages(tsk, mm, addr, 1, write, 1, &pages, &vma). However, it
> is allocating only one page here which
> much less likely to be stuck under memory pressure. Like you said, in
> order to make it more flexible for future
> changes, i might make the change like:
>>>>> */
>>>>> - if (unlikely(test_tsk_thread_flag(tsk, TIF_MEMDIE)))
>>>>> - return i ? i : -ENOMEM;
>>>>> + if (unlikely(sigkill_pending(current) | | sigkill_pending(tsk)))
>>>>> + return i ? i : -ERESTARTSYS;
>
> is this something acceptable?
The formatting is bit wacky but I'm certainly OK with the change.