2022-04-15 23:45:16

by Nico Pache

[permalink] [raw]
Subject: [PATCH v9] oom_kill.c: futex: Delay the OOM reaper to allow time for proper futex cleanup

The pthread struct is allocated on PRIVATE|ANONYMOUS memory [1] which can
be targeted by the oom reaper. This mapping is used to store the futex
robust list head; the kernel does not keep a copy of the robust list and
instead references a userspace address to maintain the robustness during
a process death. A race can occur between exit_mm and the oom reaper that
allows the oom reaper to free the memory of the futex robust list before
the exit path has handled the futex death:

CPU1 CPU2
------------------------------------------------------------------------
page_fault
do_exit "signal"
wake_oom_reaper
oom_reaper
oom_reap_task_mm (invalidates mm)
exit_mm
exit_mm_release
futex_exit_release
futex_cleanup
exit_robust_list
get_user (EFAULT- can't access memory)

If the get_user EFAULT's, the kernel will be unable to recover the
waiters on the robust_list, leaving userspace mutexes hung indefinitely.

Delay the OOM reaper, allowing more time for the exit path to perform
the futex cleanup.

Reproducer: https://gitlab.com/jsavitz/oom_futex_reproducer

[1] https://elixir.bootlin.com/glibc/latest/source/nptl/allocatestack.c#L370

Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
Cc: Rafael Aquini <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Herton R. Krzesinski <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Joel Savitz <[email protected]>
Cc: Darren Hart <[email protected]>
Cc: [email protected]
Cc: Thomas Gleixner <[email protected]>
Suggested-by: Thomas Gleixner <[email protected]>
[ Based on a patch by Michal Hocko ]
Co-developed-by: Joel Savitz <[email protected]>
Signed-off-by: Joel Savitz <[email protected]>
Signed-off-by: Nico Pache <[email protected]>
---
include/linux/sched.h | 1 +
mm/oom_kill.c | 54 ++++++++++++++++++++++++++++++++-----------
2 files changed, 41 insertions(+), 14 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d5e3c00b74e1..a8911b1f35aa 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1443,6 +1443,7 @@ struct task_struct {
int pagefault_disabled;
#ifdef CONFIG_MMU
struct task_struct *oom_reaper_list;
+ struct timer_list oom_reaper_timer;
#endif
#ifdef CONFIG_VMAP_STACK
struct vm_struct *stack_vm_area;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7ec38194f8e1..49d7df39b02d 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -632,7 +632,7 @@ static void oom_reap_task(struct task_struct *tsk)
*/
set_bit(MMF_OOM_SKIP, &mm->flags);

- /* Drop a reference taken by wake_oom_reaper */
+ /* Drop a reference taken by queue_oom_reaper */
put_task_struct(tsk);
}

@@ -644,12 +644,12 @@ static int oom_reaper(void *unused)
struct task_struct *tsk = NULL;

wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL);
- spin_lock(&oom_reaper_lock);
+ spin_lock_irq(&oom_reaper_lock);
if (oom_reaper_list != NULL) {
tsk = oom_reaper_list;
oom_reaper_list = tsk->oom_reaper_list;
}
- spin_unlock(&oom_reaper_lock);
+ spin_unlock_irq(&oom_reaper_lock);

if (tsk)
oom_reap_task(tsk);
@@ -658,22 +658,48 @@ static int oom_reaper(void *unused)
return 0;
}

-static void wake_oom_reaper(struct task_struct *tsk)
+static void wake_oom_reaper(struct timer_list *timer)
{
- /* mm is already queued? */
- if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
- return;
+ struct task_struct *tsk = container_of(timer, struct task_struct,
+ oom_reaper_timer);
+ struct mm_struct *mm = tsk->signal->oom_mm;
+ unsigned long flags;

- get_task_struct(tsk);
+ /* The victim managed to terminate on its own - see exit_mmap */
+ if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
+ put_task_struct(tsk);
+ return;
+ }

- spin_lock(&oom_reaper_lock);
+ spin_lock_irqsave(&oom_reaper_lock, flags);
tsk->oom_reaper_list = oom_reaper_list;
oom_reaper_list = tsk;
- spin_unlock(&oom_reaper_lock);
+ spin_unlock_irqrestore(&oom_reaper_lock, flags);
trace_wake_reaper(tsk->pid);
wake_up(&oom_reaper_wait);
}

+/*
+ * Give the OOM victim time to exit naturally before invoking the oom_reaping.
+ * The timers timeout is arbitrary... the longer it is, the longer the worst
+ * case scenario for the OOM can take. If it is too small, the oom_reaper can
+ * get in the way and release resources needed by the process exit path.
+ * e.g. The futex robust list can sit in Anon|Private memory that gets reaped
+ * before the exit path is able to wake the futex waiters.
+ */
+#define OOM_REAPER_DELAY (2*HZ)
+static void queue_oom_reaper(struct task_struct *tsk)
+{
+ /* mm is already queued? */
+ if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
+ return;
+
+ get_task_struct(tsk);
+ timer_setup(&tsk->oom_reaper_timer, wake_oom_reaper, 0);
+ tsk->oom_reaper_timer.expires = jiffies + OOM_REAPER_DELAY;
+ add_timer(&tsk->oom_reaper_timer);
+}
+
static int __init oom_init(void)
{
oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
@@ -681,7 +707,7 @@ static int __init oom_init(void)
}
subsys_initcall(oom_init)
#else
-static inline void wake_oom_reaper(struct task_struct *tsk)
+static inline void queue_oom_reaper(struct task_struct *tsk)
{
}
#endif /* CONFIG_MMU */
@@ -932,7 +958,7 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
rcu_read_unlock();

if (can_oom_reap)
- wake_oom_reaper(victim);
+ queue_oom_reaper(victim);

mmdrop(mm);
put_task_struct(victim);
@@ -968,7 +994,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
task_lock(victim);
if (task_will_free_mem(victim)) {
mark_oom_victim(victim);
- wake_oom_reaper(victim);
+ queue_oom_reaper(victim);
task_unlock(victim);
put_task_struct(victim);
return;
@@ -1067,7 +1093,7 @@ bool out_of_memory(struct oom_control *oc)
*/
if (task_will_free_mem(current)) {
mark_oom_victim(current);
- wake_oom_reaper(current);
+ queue_oom_reaper(current);
return true;
}

--
2.35.1


2022-04-21 15:57:16

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v9] oom_kill.c: futex: Delay the OOM reaper to allow time for proper futex cleanup

On Thu, Apr 14 2022 at 10:40, Nico Pache wrote:
> The pthread struct is allocated on PRIVATE|ANONYMOUS memory [1] which can
> be targeted by the oom reaper. This mapping is used to store the futex
> robust list head; the kernel does not keep a copy of the robust list and
> instead references a userspace address to maintain the robustness during
> a process death. A race can occur between exit_mm and the oom reaper that
> allows the oom reaper to free the memory of the futex robust list before
> the exit path has handled the futex death:
>
> CPU1 CPU2
> ------------------------------------------------------------------------
> page_fault
> do_exit "signal"
> wake_oom_reaper
> oom_reaper
> oom_reap_task_mm (invalidates mm)
> exit_mm
> exit_mm_release
> futex_exit_release
> futex_cleanup
> exit_robust_list
> get_user (EFAULT- can't access memory)
>
> If the get_user EFAULT's, the kernel will be unable to recover the
> waiters on the robust_list, leaving userspace mutexes hung indefinitely.
>
> Delay the OOM reaper, allowing more time for the exit path to perform
> the futex cleanup.
>
> Reproducer: https://gitlab.com/jsavitz/oom_futex_reproducer
>
> [1] https://elixir.bootlin.com/glibc/latest/source/nptl/allocatestack.c#L370

A link to the original discussion about this would be more useful than a
code reference which is stale tomorrow. The above explanation is good
enough to describe the problem.

>
> +/*
> + * Give the OOM victim time to exit naturally before invoking the oom_reaping.
> + * The timers timeout is arbitrary... the longer it is, the longer the worst
> + * case scenario for the OOM can take. If it is too small, the oom_reaper can
> + * get in the way and release resources needed by the process exit path.
> + * e.g. The futex robust list can sit in Anon|Private memory that gets reaped
> + * before the exit path is able to wake the futex waiters.
> + */
> +#define OOM_REAPER_DELAY (2*HZ)
> +static void queue_oom_reaper(struct task_struct *tsk)

Bah. Did you run out of newlines? Glueing that define between the
comment and the function is unreadable.

Other than that.

Acked-by: Thomas Gleixner <[email protected]>

2022-04-21 16:25:32

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v9] oom_kill.c: futex: Delay the OOM reaper to allow time for proper futex cleanup

On Thu 14-04-22 10:40:42, Nico Pache wrote:
> The pthread struct is allocated on PRIVATE|ANONYMOUS memory [1] which can
> be targeted by the oom reaper. This mapping is used to store the futex
> robust list head; the kernel does not keep a copy of the robust list and
> instead references a userspace address to maintain the robustness during
> a process death. A race can occur between exit_mm and the oom reaper that
> allows the oom reaper to free the memory of the futex robust list before
> the exit path has handled the futex death:
>
> CPU1 CPU2
> ------------------------------------------------------------------------
> page_fault
> do_exit "signal"
> wake_oom_reaper
> oom_reaper
> oom_reap_task_mm (invalidates mm)
> exit_mm
> exit_mm_release
> futex_exit_release
> futex_cleanup
> exit_robust_list
> get_user (EFAULT- can't access memory)
>
> If the get_user EFAULT's, the kernel will be unable to recover the
> waiters on the robust_list, leaving userspace mutexes hung indefinitely.
>
> Delay the OOM reaper, allowing more time for the exit path to perform
> the futex cleanup.
>
> Reproducer: https://gitlab.com/jsavitz/oom_futex_reproducer
>
> [1] https://elixir.bootlin.com/glibc/latest/source/nptl/allocatestack.c#L370
>
> Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
> Cc: Rafael Aquini <[email protected]>
> Cc: Waiman Long <[email protected]>
> Cc: Herton R. Krzesinski <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Ben Segall <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: David Rientjes <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Andrea Arcangeli <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Davidlohr Bueso <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Joel Savitz <[email protected]>
> Cc: Darren Hart <[email protected]>
> Cc: [email protected]
> Cc: Thomas Gleixner <[email protected]>
> Suggested-by: Thomas Gleixner <[email protected]>
> [ Based on a patch by Michal Hocko ]
> Co-developed-by: Joel Savitz <[email protected]>
> Signed-off-by: Joel Savitz <[email protected]>
> Signed-off-by: Nico Pache <[email protected]>

Acked-by: Michal Hocko <[email protected]>
Thanks!

> ---
> include/linux/sched.h | 1 +
> mm/oom_kill.c | 54 ++++++++++++++++++++++++++++++++-----------
> 2 files changed, 41 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index d5e3c00b74e1..a8911b1f35aa 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1443,6 +1443,7 @@ struct task_struct {
> int pagefault_disabled;
> #ifdef CONFIG_MMU
> struct task_struct *oom_reaper_list;
> + struct timer_list oom_reaper_timer;
> #endif
> #ifdef CONFIG_VMAP_STACK
> struct vm_struct *stack_vm_area;
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 7ec38194f8e1..49d7df39b02d 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -632,7 +632,7 @@ static void oom_reap_task(struct task_struct *tsk)
> */
> set_bit(MMF_OOM_SKIP, &mm->flags);
>
> - /* Drop a reference taken by wake_oom_reaper */
> + /* Drop a reference taken by queue_oom_reaper */
> put_task_struct(tsk);
> }
>
> @@ -644,12 +644,12 @@ static int oom_reaper(void *unused)
> struct task_struct *tsk = NULL;
>
> wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL);
> - spin_lock(&oom_reaper_lock);
> + spin_lock_irq(&oom_reaper_lock);
> if (oom_reaper_list != NULL) {
> tsk = oom_reaper_list;
> oom_reaper_list = tsk->oom_reaper_list;
> }
> - spin_unlock(&oom_reaper_lock);
> + spin_unlock_irq(&oom_reaper_lock);
>
> if (tsk)
> oom_reap_task(tsk);
> @@ -658,22 +658,48 @@ static int oom_reaper(void *unused)
> return 0;
> }
>
> -static void wake_oom_reaper(struct task_struct *tsk)
> +static void wake_oom_reaper(struct timer_list *timer)
> {
> - /* mm is already queued? */
> - if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
> - return;
> + struct task_struct *tsk = container_of(timer, struct task_struct,
> + oom_reaper_timer);
> + struct mm_struct *mm = tsk->signal->oom_mm;
> + unsigned long flags;
>
> - get_task_struct(tsk);
> + /* The victim managed to terminate on its own - see exit_mmap */
> + if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
> + put_task_struct(tsk);
> + return;
> + }
>
> - spin_lock(&oom_reaper_lock);
> + spin_lock_irqsave(&oom_reaper_lock, flags);
> tsk->oom_reaper_list = oom_reaper_list;
> oom_reaper_list = tsk;
> - spin_unlock(&oom_reaper_lock);
> + spin_unlock_irqrestore(&oom_reaper_lock, flags);
> trace_wake_reaper(tsk->pid);
> wake_up(&oom_reaper_wait);
> }
>
> +/*
> + * Give the OOM victim time to exit naturally before invoking the oom_reaping.
> + * The timers timeout is arbitrary... the longer it is, the longer the worst
> + * case scenario for the OOM can take. If it is too small, the oom_reaper can
> + * get in the way and release resources needed by the process exit path.
> + * e.g. The futex robust list can sit in Anon|Private memory that gets reaped
> + * before the exit path is able to wake the futex waiters.
> + */
> +#define OOM_REAPER_DELAY (2*HZ)
> +static void queue_oom_reaper(struct task_struct *tsk)
> +{
> + /* mm is already queued? */
> + if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
> + return;
> +
> + get_task_struct(tsk);
> + timer_setup(&tsk->oom_reaper_timer, wake_oom_reaper, 0);
> + tsk->oom_reaper_timer.expires = jiffies + OOM_REAPER_DELAY;
> + add_timer(&tsk->oom_reaper_timer);
> +}
> +
> static int __init oom_init(void)
> {
> oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
> @@ -681,7 +707,7 @@ static int __init oom_init(void)
> }
> subsys_initcall(oom_init)
> #else
> -static inline void wake_oom_reaper(struct task_struct *tsk)
> +static inline void queue_oom_reaper(struct task_struct *tsk)
> {
> }
> #endif /* CONFIG_MMU */
> @@ -932,7 +958,7 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
> rcu_read_unlock();
>
> if (can_oom_reap)
> - wake_oom_reaper(victim);
> + queue_oom_reaper(victim);
>
> mmdrop(mm);
> put_task_struct(victim);
> @@ -968,7 +994,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> task_lock(victim);
> if (task_will_free_mem(victim)) {
> mark_oom_victim(victim);
> - wake_oom_reaper(victim);
> + queue_oom_reaper(victim);
> task_unlock(victim);
> put_task_struct(victim);
> return;
> @@ -1067,7 +1093,7 @@ bool out_of_memory(struct oom_control *oc)
> */
> if (task_will_free_mem(current)) {
> mark_oom_victim(current);
> - wake_oom_reaper(current);
> + queue_oom_reaper(current);
> return true;
> }
>
> --
> 2.35.1

--
Michal Hocko
SUSE Labs

2022-04-21 22:45:50

by Nico Pache

[permalink] [raw]
Subject: Re: [PATCH v9] oom_kill.c: futex: Delay the OOM reaper to allow time for proper futex cleanup



On 4/21/22 10:40, Thomas Gleixner wrote:
> On Thu, Apr 14 2022 at 10:40, Nico Pache wrote:
>> The pthread struct is allocated on PRIVATE|ANONYMOUS memory [1] which can
>> be targeted by the oom reaper. This mapping is used to store the futex
>> robust list head; the kernel does not keep a copy of the robust list and
>> instead references a userspace address to maintain the robustness during
>> a process death. A race can occur between exit_mm and the oom reaper that
>> allows the oom reaper to free the memory of the futex robust list before
>> the exit path has handled the futex death:
>>
>> CPU1 CPU2
>> ------------------------------------------------------------------------
>> page_fault
>> do_exit "signal"
>> wake_oom_reaper
>> oom_reaper
>> oom_reap_task_mm (invalidates mm)
>> exit_mm
>> exit_mm_release
>> futex_exit_release
>> futex_cleanup
>> exit_robust_list
>> get_user (EFAULT- can't access memory)
>>
>> If the get_user EFAULT's, the kernel will be unable to recover the
>> waiters on the robust_list, leaving userspace mutexes hung indefinitely.
>>
>> Delay the OOM reaper, allowing more time for the exit path to perform
>> the futex cleanup.
>>
>> Reproducer: https://gitlab.com/jsavitz/oom_futex_reproducer
>>
>> [1] https://elixir.bootlin.com/glibc/latest/source/nptl/allocatestack.c#L370
>
> A link to the original discussion about this would be more useful than a
> code reference which is stale tomorrow. The above explanation is good
> enough to describe the problem.

Hi Andrew,

can you please update the link when you add the ACKs.

Here is a more stable link:
[1] https://elixir.bootlin.com/glibc/glibc-2.35/source/nptl/allocatestack.c#L370

>
>>
>> +/*
>> + * Give the OOM victim time to exit naturally before invoking the oom_reaping.
>> + * The timers timeout is arbitrary... the longer it is, the longer the worst
>> + * case scenario for the OOM can take. If it is too small, the oom_reaper can
>> + * get in the way and release resources needed by the process exit path.
>> + * e.g. The futex robust list can sit in Anon|Private memory that gets reaped
>> + * before the exit path is able to wake the futex waiters.
>> + */
>> +#define OOM_REAPER_DELAY (2*HZ)
>> +static void queue_oom_reaper(struct task_struct *tsk)
>
> Bah. Did you run out of newlines? Glueing that define between the
> comment and the function is unreadable.

My Enter key hit its cgroup limit for newlines.

Andrew, would it be possible to also add a new line when you make the other
changes. Sorry about that.

>
> Other than that.
>
> Acked-by: Thomas Gleixner <[email protected]>

Thanks!

2022-04-22 18:57:34

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v9] oom_kill.c: futex: Delay the OOM reaper to allow time for proper futex cleanup

On Thu, 21 Apr 2022 12:25:58 -0400 Nico Pache <[email protected]> wrote:

> can you please update the link when you add the ACKs.
>
> Here is a more stable link:
> [1] https://elixir.bootlin.com/glibc/glibc-2.35/source/nptl/allocatestack.c#L370

Done. Thanks, all.

2022-04-22 20:31:59

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v9] oom_kill.c: futex: Delay the OOM reaper to allow time for proper futex cleanup

On Thu, Apr 21 2022 at 12:25, Nico Pache wrote:
> On 4/21/22 10:40, Thomas Gleixner wrote:
>>> Reproducer: https://gitlab.com/jsavitz/oom_futex_reproducer
>>>
>>> [1] https://elixir.bootlin.com/glibc/latest/source/nptl/allocatestack.c#L370
>>
>> A link to the original discussion about this would be more useful than a
>> code reference which is stale tomorrow. The above explanation is good
>> enough to describe the problem.
>
> Hi Andrew,
>
> can you please update the link when you add the ACKs.
>
> Here is a more stable link:
> [1] https://elixir.bootlin.com/glibc/glibc-2.35/source/nptl/allocatestack.c#L370

That link is still uninteresting and has nothing to do with what I was
asking for, i.e. replacing it with a link to the original discussion
which led to this patch.

Thanks,

tglx