While investigating hosts with high cgroup memory pressures, Tejun
found culprit zombie tasks that had were holding on to a lot of
memory, had SIGKILL pending, but were stuck in memory.high reclaim.
In the past, we used to always force-charge allocations from tasks
that were exiting in order to accelerate them dying and freeing up
their rss. This changed for memory.max in a4ebf1b6ca1e ("memcg:
prohibit unconditional exceeding the limit of dying tasks"); it noted
that this can cause (userspace inducable) containment failures, so it
added a mandatory reclaim and OOM kill cycle before forcing charges.
At the time, memory.high enforcement was handled in the userspace
return path, which isn't reached by dying tasks, and so memory.high
was still never enforced by dying tasks.
When c9afe31ec443 ("memcg: synchronously enforce memory.high for large
overcharges") added synchronous reclaim for memory.high, it added
unconditional memory.high enforcement for dying tasks as well. The
callstack shows that this path is where the zombie is stuck in.
We need to accelerate dying tasks getting past memory.high, but we
cannot do it quite the same way as we do for memory.max: memory.max is
enforced strictly, and tasks aren't allowed to move past it without
FIRST reclaiming and OOM killing if necessary. This ensures very small
levels of excess. With memory.high, though, enforcement happens lazily
after the charge, and OOM killing is never triggered. A lot of
concurrent threads could have pushed, or could actively be pushing,
the cgroup into excess. The dying task will enter reclaim on every
allocation attempt, with little hope of restoring balance.
To fix this, skip synchronous memory.high enforcement on dying tasks
altogether again. Update memory.high path documentation while at it.
Fixes: c9afe31ec443 ("memcg: synchronously enforce memory.high for large overcharges")
Reported-by: Tejun Heo <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
---
mm/memcontrol.c | 24 +++++++++++++++++++++---
1 file changed, 21 insertions(+), 3 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 73692cd8c142..aca879995022 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2603,8 +2603,9 @@ static unsigned long calculate_high_delay(struct mem_cgroup *memcg,
}
/*
- * Scheduled by try_charge() to be executed from the userland return path
- * and reclaims memory over the high limit.
+ * Reclaims memory over the high limit. Called directly from
+ * try_charge() when possible, but also scheduled to be called from
+ * the userland return path where reclaim is always able to block.
*/
void mem_cgroup_handle_over_high(gfp_t gfp_mask)
{
@@ -2673,6 +2674,9 @@ void mem_cgroup_handle_over_high(gfp_t gfp_mask)
}
/*
+ * Reclaim didn't manage to push usage below the limit, slow
+ * this allocating task down.
+ *
* If we exit early, we're guaranteed to die (since
* schedule_timeout_killable sets TASK_KILLABLE). This means we don't
* need to account for any ill-begotten jiffies to pay them off later.
@@ -2867,8 +2871,22 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
}
} while ((memcg = parent_mem_cgroup(memcg)));
+ /*
+ * Reclaim is scheduled for the userland return path already,
+ * but also attempt synchronous reclaim to avoid excessive
+ * overrun while the task is still inside the kernel. If this
+ * is successful, the return path will see it when it rechecks
+ * the overage, and simply bail out.
+ *
+ * Skip if the task is already dying, though. Unlike
+ * memory.max, memory.high enforcement isn't as strict, and
+ * there is no OOM killer involved, which means the excess
+ * could already be much bigger (and still growing) than it
+ * could for memory.max; the dying task could get stuck in
+ * fruitless reclaim for a long time, which isn't desirable.
+ */
if (current->memcg_nr_pages_over_high > MEMCG_CHARGE_BATCH &&
- !(current->flags & PF_MEMALLOC) &&
+ !(current->flags & PF_MEMALLOC) && !task_is_dying() &&
gfpflags_allow_blocking(gfp_mask)) {
mem_cgroup_handle_over_high(gfp_mask);
}
--
2.43.0
On Thu, Jan 11, 2024 at 5:29 AM Johannes Weiner <[email protected]> wrote:
>
> While investigating hosts with high cgroup memory pressures, Tejun
> found culprit zombie tasks that had were holding on to a lot of
> memory, had SIGKILL pending, but were stuck in memory.high reclaim.
>
> In the past, we used to always force-charge allocations from tasks
> that were exiting in order to accelerate them dying and freeing up
> their rss. This changed for memory.max in a4ebf1b6ca1e ("memcg:
> prohibit unconditional exceeding the limit of dying tasks"); it noted
> that this can cause (userspace inducable) containment failures, so it
> added a mandatory reclaim and OOM kill cycle before forcing charges.
> At the time, memory.high enforcement was handled in the userspace
> return path, which isn't reached by dying tasks, and so memory.high
> was still never enforced by dying tasks.
>
> When c9afe31ec443 ("memcg: synchronously enforce memory.high for large
> overcharges") added synchronous reclaim for memory.high, it added
> unconditional memory.high enforcement for dying tasks as well. The
> callstack shows that this path is where the zombie is stuck in.
>
> We need to accelerate dying tasks getting past memory.high, but we
> cannot do it quite the same way as we do for memory.max: memory.max is
> enforced strictly, and tasks aren't allowed to move past it without
> FIRST reclaiming and OOM killing if necessary. This ensures very small
> levels of excess. With memory.high, though, enforcement happens lazily
> after the charge, and OOM killing is never triggered. A lot of
> concurrent threads could have pushed, or could actively be pushing,
> the cgroup into excess. The dying task will enter reclaim on every
> allocation attempt, with little hope of restoring balance.
>
> To fix this, skip synchronous memory.high enforcement on dying tasks
> altogether again. Update memory.high path documentation while at it.
>
> Fixes: c9afe31ec443 ("memcg: synchronously enforce memory.high for large overcharges")
> Reported-by: Tejun Heo <[email protected]>
> Signed-off-by: Johannes Weiner <[email protected]>
LGTM with a couple of nits below:
Reviewed-by: Yosry Ahmed <[email protected]>
> ---
> mm/memcontrol.c | 24 +++++++++++++++++++++---
> 1 file changed, 21 insertions(+), 3 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 73692cd8c142..aca879995022 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2603,8 +2603,9 @@ static unsigned long calculate_high_delay(struct mem_cgroup *memcg,
> }
>
> /*
> - * Scheduled by try_charge() to be executed from the userland return path
> - * and reclaims memory over the high limit.
> + * Reclaims memory over the high limit. Called directly from
> + * try_charge() when possible, but also scheduled to be called from
> + * the userland return path where reclaim is always able to block.
> */
nit: The term "scheduled" here is deceptive imo, it makes me think of
queue_work() and friends, when it is directly called from
resume_user_mode_work(). Can we change the terminology to "called from
the userland return path" or directly reference
resume_user_mode_work() instead? Same applies to the added comment
below in try_charge_memcg().
nit: "when possible" is not entirely accurate, it makes it seem like
we call mem_cgroup_handle_over_high() whenever we can (which means
gfpflags_allow_blocking() imo). We actually choose not to call it in
some situations, and this patch is adding one such situation. So
perhaps "when possible and desirable" or just "when appropriate".
> void mem_cgroup_handle_over_high(gfp_t gfp_mask)
> {
> @@ -2673,6 +2674,9 @@ void mem_cgroup_handle_over_high(gfp_t gfp_mask)
> }
>
> /*
> + * Reclaim didn't manage to push usage below the limit, slow
> + * this allocating task down.
> + *
> * If we exit early, we're guaranteed to die (since
> * schedule_timeout_killable sets TASK_KILLABLE). This means we don't
> * need to account for any ill-begotten jiffies to pay them off later.
> @@ -2867,8 +2871,22 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> }
> } while ((memcg = parent_mem_cgroup(memcg)));
>
> + /*
> + * Reclaim is scheduled for the userland return path already,
> + * but also attempt synchronous reclaim to avoid excessive
> + * overrun while the task is still inside the kernel. If this
> + * is successful, the return path will see it when it rechecks
> + * the overage, and simply bail out.
> + *
> + * Skip if the task is already dying, though. Unlike
> + * memory.max, memory.high enforcement isn't as strict, and
> + * there is no OOM killer involved, which means the excess
> + * could already be much bigger (and still growing) than it
> + * could for memory.max; the dying task could get stuck in
> + * fruitless reclaim for a long time, which isn't desirable.
> + */
> if (current->memcg_nr_pages_over_high > MEMCG_CHARGE_BATCH &&
> - !(current->flags & PF_MEMALLOC) &&
> + !(current->flags & PF_MEMALLOC) && !task_is_dying() &&
> gfpflags_allow_blocking(gfp_mask)) {
> mem_cgroup_handle_over_high(gfp_mask);
> }
> --
> 2.43.0
>
>
On Thu, Jan 11, 2024 at 08:29:02AM -0500, Johannes Weiner wrote:
> While investigating hosts with high cgroup memory pressures, Tejun
> found culprit zombie tasks that had were holding on to a lot of
> memory, had SIGKILL pending, but were stuck in memory.high reclaim.
>
> In the past, we used to always force-charge allocations from tasks
> that were exiting in order to accelerate them dying and freeing up
> their rss. This changed for memory.max in a4ebf1b6ca1e ("memcg:
> prohibit unconditional exceeding the limit of dying tasks"); it noted
> that this can cause (userspace inducable) containment failures, so it
> added a mandatory reclaim and OOM kill cycle before forcing charges.
> At the time, memory.high enforcement was handled in the userspace
> return path, which isn't reached by dying tasks, and so memory.high
> was still never enforced by dying tasks.
>
> When c9afe31ec443 ("memcg: synchronously enforce memory.high for large
> overcharges") added synchronous reclaim for memory.high, it added
> unconditional memory.high enforcement for dying tasks as well. The
> callstack shows that this path is where the zombie is stuck in.
>
> We need to accelerate dying tasks getting past memory.high, but we
> cannot do it quite the same way as we do for memory.max: memory.max is
> enforced strictly, and tasks aren't allowed to move past it without
> FIRST reclaiming and OOM killing if necessary. This ensures very small
> levels of excess. With memory.high, though, enforcement happens lazily
> after the charge, and OOM killing is never triggered. A lot of
> concurrent threads could have pushed, or could actively be pushing,
> the cgroup into excess. The dying task will enter reclaim on every
> allocation attempt, with little hope of restoring balance.
>
> To fix this, skip synchronous memory.high enforcement on dying tasks
> altogether again. Update memory.high path documentation while at it.
>
> Fixes: c9afe31ec443 ("memcg: synchronously enforce memory.high for large overcharges")
> Reported-by: Tejun Heo <[email protected]>
> Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
On Thu, Jan 11, 2024 at 08:29:02AM -0500, Johannes Weiner wrote:
> While investigating hosts with high cgroup memory pressures, Tejun
> found culprit zombie tasks that had were holding on to a lot of
> memory, had SIGKILL pending, but were stuck in memory.high reclaim.
>
> In the past, we used to always force-charge allocations from tasks
> that were exiting in order to accelerate them dying and freeing up
> their rss. This changed for memory.max in a4ebf1b6ca1e ("memcg:
> prohibit unconditional exceeding the limit of dying tasks"); it noted
> that this can cause (userspace inducable) containment failures, so it
> added a mandatory reclaim and OOM kill cycle before forcing charges.
> At the time, memory.high enforcement was handled in the userspace
> return path, which isn't reached by dying tasks, and so memory.high
> was still never enforced by dying tasks.
>
> When c9afe31ec443 ("memcg: synchronously enforce memory.high for large
> overcharges") added synchronous reclaim for memory.high, it added
> unconditional memory.high enforcement for dying tasks as well. The
> callstack shows that this path is where the zombie is stuck in.
>
> We need to accelerate dying tasks getting past memory.high, but we
> cannot do it quite the same way as we do for memory.max: memory.max is
> enforced strictly, and tasks aren't allowed to move past it without
> FIRST reclaiming and OOM killing if necessary. This ensures very small
> levels of excess. With memory.high, though, enforcement happens lazily
> after the charge, and OOM killing is never triggered. A lot of
> concurrent threads could have pushed, or could actively be pushing,
> the cgroup into excess. The dying task will enter reclaim on every
> allocation attempt, with little hope of restoring balance.
>
> To fix this, skip synchronous memory.high enforcement on dying tasks
> altogether again. Update memory.high path documentation while at it.
It makes total sense to me.
Acked-by: Roman Gushchin <[email protected]>
However if tasks can stuck for a long time in the "high reclaim" state,
shouldn't we also handle the case when tasks are being killed during the
reclaim? E. g. something like this (completely untested):
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c4c422c81f93..9f971fc6aae8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2465,6 +2465,9 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
READ_ONCE(memcg->memory.high))
continue;
+ if (task_is_dying())
+ break;
+
memcg_memory_event(memcg, MEMCG_HIGH);
psi_memstall_enter(&pflags);
@@ -2645,6 +2648,9 @@ void mem_cgroup_handle_over_high(gfp_t gfp_mask)
current->memcg_nr_pages_over_high = 0;
retry_reclaim:
+ if (task_is_dying())
+ return;
+
/*
* The allocating task should reclaim at least the batch size, but for
* subsequent retries we only want to do what's necessary to prevent oom
On Thu, Jan 11, 2024 at 09:59:11AM -0800, Roman Gushchin wrote:
> On Thu, Jan 11, 2024 at 08:29:02AM -0500, Johannes Weiner wrote:
> > While investigating hosts with high cgroup memory pressures, Tejun
> > found culprit zombie tasks that had were holding on to a lot of
> > memory, had SIGKILL pending, but were stuck in memory.high reclaim.
> >
> > In the past, we used to always force-charge allocations from tasks
> > that were exiting in order to accelerate them dying and freeing up
> > their rss. This changed for memory.max in a4ebf1b6ca1e ("memcg:
> > prohibit unconditional exceeding the limit of dying tasks"); it noted
> > that this can cause (userspace inducable) containment failures, so it
> > added a mandatory reclaim and OOM kill cycle before forcing charges.
> > At the time, memory.high enforcement was handled in the userspace
> > return path, which isn't reached by dying tasks, and so memory.high
> > was still never enforced by dying tasks.
> >
> > When c9afe31ec443 ("memcg: synchronously enforce memory.high for large
> > overcharges") added synchronous reclaim for memory.high, it added
> > unconditional memory.high enforcement for dying tasks as well. The
> > callstack shows that this path is where the zombie is stuck in.
> >
> > We need to accelerate dying tasks getting past memory.high, but we
> > cannot do it quite the same way as we do for memory.max: memory.max is
> > enforced strictly, and tasks aren't allowed to move past it without
> > FIRST reclaiming and OOM killing if necessary. This ensures very small
> > levels of excess. With memory.high, though, enforcement happens lazily
> > after the charge, and OOM killing is never triggered. A lot of
> > concurrent threads could have pushed, or could actively be pushing,
> > the cgroup into excess. The dying task will enter reclaim on every
> > allocation attempt, with little hope of restoring balance.
> >
> > To fix this, skip synchronous memory.high enforcement on dying tasks
> > altogether again. Update memory.high path documentation while at it.
>
> It makes total sense to me.
> Acked-by: Roman Gushchin <[email protected]>
Thanks
> However if tasks can stuck for a long time in the "high reclaim" state,
> shouldn't we also handle the case when tasks are being killed during the
> reclaim? E. g. something like this (completely untested):
Yes, that's probably a good idea.
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c4c422c81f93..9f971fc6aae8 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2465,6 +2465,9 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
> READ_ONCE(memcg->memory.high))
> continue;
>
> + if (task_is_dying())
> + break;
> +
> memcg_memory_event(memcg, MEMCG_HIGH);
>
> psi_memstall_enter(&pflags);
I think we can skip this one. The loop is for traversing from the
charging cgroup to the one that has memory.high set and breached, and
then reclaim it. It's not expected to run multiple reclaims.
> @@ -2645,6 +2648,9 @@ void mem_cgroup_handle_over_high(gfp_t gfp_mask)
> current->memcg_nr_pages_over_high = 0;
>
> retry_reclaim:
> + if (task_is_dying())
> + return;
> +
> /*
> * The allocating task should reclaim at least the batch size, but for
> * subsequent retries we only want to do what's necessary to prevent oom
Yeah this is the better place for this check.
How about this?
---
From 6124a13cb073f5ff06b9c1309505bc937d65d6e5 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <[email protected]>
Date: Thu, 11 Jan 2024 07:18:47 -0500
Subject: [PATCH] mm: memcontrol: don't throttle dying tasks on memory.high
While investigating hosts with high cgroup memory pressures, Tejun
found culprit zombie tasks that had were holding on to a lot of
memory, had SIGKILL pending, but were stuck in memory.high reclaim.
In the past, we used to always force-charge allocations from tasks
that were exiting in order to accelerate them dying and freeing up
their rss. This changed for memory.max in a4ebf1b6ca1e ("memcg:
prohibit unconditional exceeding the limit of dying tasks"); it noted
that this can cause (userspace inducable) containment failures, so it
added a mandatory reclaim and OOM kill cycle before forcing charges.
At the time, memory.high enforcement was handled in the userspace
return path, which isn't reached by dying tasks, and so memory.high
was still never enforced by dying tasks.
When c9afe31ec443 ("memcg: synchronously enforce memory.high for large
overcharges") added synchronous reclaim for memory.high, it added
unconditional memory.high enforcement for dying tasks as well. The
callstack shows that this path is where the zombie is stuck in.
We need to accelerate dying tasks getting past memory.high, but we
cannot do it quite the same way as we do for memory.max: memory.max is
enforced strictly, and tasks aren't allowed to move past it without
FIRST reclaiming and OOM killing if necessary. This ensures very small
levels of excess. With memory.high, though, enforcement happens lazily
after the charge, and OOM killing is never triggered. A lot of
concurrent threads could have pushed, or could actively be pushing,
the cgroup into excess. The dying task will enter reclaim on every
allocation attempt, with little hope of restoring balance.
To fix this, skip synchronous memory.high enforcement on dying tasks
altogether again. Update memory.high path documentation while at it.
Fixes: c9afe31ec443 ("memcg: synchronously enforce memory.high for large overcharges")
Reported-by: Tejun Heo <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
---
mm/memcontrol.c | 29 +++++++++++++++++++++++++----
1 file changed, 25 insertions(+), 4 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 73692cd8c142..7be7a2f4e536 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2603,8 +2603,9 @@ static unsigned long calculate_high_delay(struct mem_cgroup *memcg,
}
/*
- * Scheduled by try_charge() to be executed from the userland return path
- * and reclaims memory over the high limit.
+ * Reclaims memory over the high limit. Called directly from
+ * try_charge() (context permitting), as well as from the userland
+ * return path where reclaim is always able to block.
*/
void mem_cgroup_handle_over_high(gfp_t gfp_mask)
{
@@ -2623,6 +2624,17 @@ void mem_cgroup_handle_over_high(gfp_t gfp_mask)
current->memcg_nr_pages_over_high = 0;
retry_reclaim:
+ /*
+ * Bail if the task is already exiting. Unlike memory.max,
+ * memory.high enforcement isn't as strict, and there is no
+ * OOM killer involved, which means the excess could already
+ * be much bigger (and still growing) than it could for
+ * memory.max; the dying task could get stuck in fruitless
+ * reclaim for a long time, which isn't desirable.
+ */
+ if (task_is_dying())
+ goto out;
+
/*
* The allocating task should reclaim at least the batch size, but for
* subsequent retries we only want to do what's necessary to prevent oom
@@ -2673,6 +2685,9 @@ void mem_cgroup_handle_over_high(gfp_t gfp_mask)
}
/*
+ * Reclaim didn't manage to push usage below the limit, slow
+ * this allocating task down.
+ *
* If we exit early, we're guaranteed to die (since
* schedule_timeout_killable sets TASK_KILLABLE). This means we don't
* need to account for any ill-begotten jiffies to pay them off later.
@@ -2867,11 +2882,17 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
}
} while ((memcg = parent_mem_cgroup(memcg)));
+ /*
+ * Reclaim is set up above to be called from the userland
+ * return path. But also attempt synchronous reclaim to avoid
+ * excessive overrun while the task is still inside the
+ * kernel. If this is successful, the return path will see it
+ * when it rechecks the overage and simply bail out.
+ */
if (current->memcg_nr_pages_over_high > MEMCG_CHARGE_BATCH &&
!(current->flags & PF_MEMALLOC) &&
- gfpflags_allow_blocking(gfp_mask)) {
+ gfpflags_allow_blocking(gfp_mask))
mem_cgroup_handle_over_high(gfp_mask);
- }
return 0;
}
--
2.43.0
On Thu, Jan 11, 2024 at 02:28:07PM -0500, Johannes Weiner wrote:
> On Thu, Jan 11, 2024 at 09:59:11AM -0800, Roman Gushchin wrote:
> > On Thu, Jan 11, 2024 at 08:29:02AM -0500, Johannes Weiner wrote:
> > > While investigating hosts with high cgroup memory pressures, Tejun
> > > found culprit zombie tasks that had were holding on to a lot of
> > > memory, had SIGKILL pending, but were stuck in memory.high reclaim.
> > >
> > > In the past, we used to always force-charge allocations from tasks
> > > that were exiting in order to accelerate them dying and freeing up
> > > their rss. This changed for memory.max in a4ebf1b6ca1e ("memcg:
> > > prohibit unconditional exceeding the limit of dying tasks"); it noted
> > > that this can cause (userspace inducable) containment failures, so it
> > > added a mandatory reclaim and OOM kill cycle before forcing charges.
> > > At the time, memory.high enforcement was handled in the userspace
> > > return path, which isn't reached by dying tasks, and so memory.high
> > > was still never enforced by dying tasks.
> > >
> > > When c9afe31ec443 ("memcg: synchronously enforce memory.high for large
> > > overcharges") added synchronous reclaim for memory.high, it added
> > > unconditional memory.high enforcement for dying tasks as well. The
> > > callstack shows that this path is where the zombie is stuck in.
> > >
> > > We need to accelerate dying tasks getting past memory.high, but we
> > > cannot do it quite the same way as we do for memory.max: memory.max is
> > > enforced strictly, and tasks aren't allowed to move past it without
> > > FIRST reclaiming and OOM killing if necessary. This ensures very small
> > > levels of excess. With memory.high, though, enforcement happens lazily
> > > after the charge, and OOM killing is never triggered. A lot of
> > > concurrent threads could have pushed, or could actively be pushing,
> > > the cgroup into excess. The dying task will enter reclaim on every
> > > allocation attempt, with little hope of restoring balance.
> > >
> > > To fix this, skip synchronous memory.high enforcement on dying tasks
> > > altogether again. Update memory.high path documentation while at it.
> >
> > It makes total sense to me.
> > Acked-by: Roman Gushchin <[email protected]>
>
> Thanks
>
> > However if tasks can stuck for a long time in the "high reclaim" state,
> > shouldn't we also handle the case when tasks are being killed during the
> > reclaim? E. g. something like this (completely untested):
>
> Yes, that's probably a good idea.
>
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index c4c422c81f93..9f971fc6aae8 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2465,6 +2465,9 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
> > READ_ONCE(memcg->memory.high))
> > continue;
> >
> > + if (task_is_dying())
> > + break;
> > +
> > memcg_memory_event(memcg, MEMCG_HIGH);
> >
> > psi_memstall_enter(&pflags);
>
> I think we can skip this one. The loop is for traversing from the
> charging cgroup to the one that has memory.high set and breached, and
> then reclaim it. It's not expected to run multiple reclaims.
Yes, the next one is probably enough (hard to say for me without knowing
exactly where whose dying processes are getting stuck - you should have
actual stacktraces I guess).
>
> > @@ -2645,6 +2648,9 @@ void mem_cgroup_handle_over_high(gfp_t gfp_mask)
> > current->memcg_nr_pages_over_high = 0;
> >
> > retry_reclaim:
> > + if (task_is_dying())
> > + return;
> > +
> > /*
> > * The allocating task should reclaim at least the batch size, but for
> > * subsequent retries we only want to do what's necessary to prevent oom
>
> Yeah this is the better place for this check.
>
> How about this?
Looks really good to me!
I actually thought about moving the check into mem_cgroup_handle_over_high(),
and you already did it in this version.
Thanks!
On Thu, Jan 11, 2024 at 11:38:09AM -0800, Roman Gushchin wrote:
> On Thu, Jan 11, 2024 at 02:28:07PM -0500, Johannes Weiner wrote:
> > On Thu, Jan 11, 2024 at 09:59:11AM -0800, Roman Gushchin wrote:
> > > On Thu, Jan 11, 2024 at 08:29:02AM -0500, Johannes Weiner wrote:
> > > > While investigating hosts with high cgroup memory pressures, Tejun
> > > > found culprit zombie tasks that had were holding on to a lot of
> > > > memory, had SIGKILL pending, but were stuck in memory.high reclaim.
> > > >
> > > > In the past, we used to always force-charge allocations from tasks
> > > > that were exiting in order to accelerate them dying and freeing up
> > > > their rss. This changed for memory.max in a4ebf1b6ca1e ("memcg:
> > > > prohibit unconditional exceeding the limit of dying tasks"); it noted
> > > > that this can cause (userspace inducable) containment failures, so it
> > > > added a mandatory reclaim and OOM kill cycle before forcing charges.
> > > > At the time, memory.high enforcement was handled in the userspace
> > > > return path, which isn't reached by dying tasks, and so memory.high
> > > > was still never enforced by dying tasks.
> > > >
> > > > When c9afe31ec443 ("memcg: synchronously enforce memory.high for large
> > > > overcharges") added synchronous reclaim for memory.high, it added
> > > > unconditional memory.high enforcement for dying tasks as well. The
> > > > callstack shows that this path is where the zombie is stuck in.
> > > >
> > > > We need to accelerate dying tasks getting past memory.high, but we
> > > > cannot do it quite the same way as we do for memory.max: memory.max is
> > > > enforced strictly, and tasks aren't allowed to move past it without
> > > > FIRST reclaiming and OOM killing if necessary. This ensures very small
> > > > levels of excess. With memory.high, though, enforcement happens lazily
> > > > after the charge, and OOM killing is never triggered. A lot of
> > > > concurrent threads could have pushed, or could actively be pushing,
> > > > the cgroup into excess. The dying task will enter reclaim on every
> > > > allocation attempt, with little hope of restoring balance.
> > > >
> > > > To fix this, skip synchronous memory.high enforcement on dying tasks
> > > > altogether again. Update memory.high path documentation while at it.
> > >
> > > It makes total sense to me.
> > > Acked-by: Roman Gushchin <[email protected]>
> >
> > Thanks
> >
> > > However if tasks can stuck for a long time in the "high reclaim" state,
> > > shouldn't we also handle the case when tasks are being killed during the
> > > reclaim? E. g. something like this (completely untested):
> >
> > Yes, that's probably a good idea.
> >
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index c4c422c81f93..9f971fc6aae8 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -2465,6 +2465,9 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
> > > READ_ONCE(memcg->memory.high))
> > > continue;
> > >
> > > + if (task_is_dying())
> > > + break;
> > > +
> > > memcg_memory_event(memcg, MEMCG_HIGH);
> > >
> > > psi_memstall_enter(&pflags);
> >
> > I think we can skip this one. The loop is for traversing from the
> > charging cgroup to the one that has memory.high set and breached, and
> > then reclaim it. It's not expected to run multiple reclaims.
>
> Yes, the next one is probably enough (hard to say for me without knowing
> exactly where whose dying processes are getting stuck - you should have
> actual stacktraces I guess).
A bit tricky to say. Tejun managed to get a trace from a crashdump,
but you can't tell where exactly it's looping:
#0 arch_atomic_dec_and_test (./arch/x86/include/asm/atomic.h:123:9)
#1 atomic_dec_and_test (./include/linux/atomic/atomic-instrumented.h:576:9)
#2 page_ref_dec_and_test (./include/linux/page_ref.h:210:12)
#3 put_page_testzero (./include/linux/mm.h:999:9)
#4 folio_put_testzero (./include/linux/mm.h:1004:9)
#5 move_folios_to_lru (mm/vmscan.c:2495:7)
#6 shrink_inactive_list (mm/vmscan.c:2594:2)
#7 shrink_list (mm/vmscan.c:2835:9)
#8 shrink_lruvec (mm/vmscan.c:6271:21)
#9 shrink_node_memcgs (mm/vmscan.c:6458:3)
#10 shrink_node (mm/vmscan.c:6493:2)
#11 shrink_zones (mm/vmscan.c:6728:3)
#12 do_try_to_free_pages (mm/vmscan.c:6790:3)
#13 try_to_free_mem_cgroup_pages (mm/vmscan.c:7105:17)
#14 reclaim_high (mm/memcontrol.c:2451:19)
#15 mem_cgroup_handle_over_high (mm/memcontrol.c:2670:17)
#16 try_charge_memcg (mm/memcontrol.c:2887:3)
#17 try_charge (mm/memcontrol.c:2898:9)
#18 charge_memcg (mm/memcontrol.c:7062:8)
#19 __mem_cgroup_charge (mm/memcontrol.c:7083:8)
#20 mem_cgroup_charge (./include/linux/memcontrol.h:682:9)
#21 __filemap_add_folio (mm/filemap.c:860:15)
#22 filemap_add_folio (mm/filemap.c:942:8)
#23 page_cache_ra_unbounded (mm/readahead.c:251:7)
#24 do_sync_mmap_readahead (mm/filemap.c:0)
#25 filemap_fault (mm/filemap.c:3288:10)
#26 __do_fault (mm/memory.c:4184:8)
#27 do_read_fault (mm/memory.c:4538:8)
#28 do_fault (mm/memory.c:4667:9)
#29 do_pte_missing (mm/memory.c:3648:10)
#30 handle_pte_fault (mm/memory.c:4955:10)
#31 __handle_mm_fault (mm/memory.c:5097:9)
#32 handle_mm_fault (mm/memory.c:5251:9)
#33 do_user_addr_fault (arch/x86/mm/fault.c:1392:10)
#34 handle_page_fault (arch/x86/mm/fault.c:1486:3)
#35 exc_page_fault (arch/x86/mm/fault.c:1542:2)
#36 asm_exc_page_fault+0x22/0x27 (./arch/x86/include/asm/idtentry.h:570)
There is some circumstantial evidence: this thread has SIGKILL set and
memory pressure is high in its cgroup. When memory.high is reset, the
task exits and pressure drops. So the most likely culprit is the
somewhat vaguely bounded loop in mem_cgroup_handle_over_high().
> > > @@ -2645,6 +2648,9 @@ void mem_cgroup_handle_over_high(gfp_t gfp_mask)
> > > current->memcg_nr_pages_over_high = 0;
> > >
> > > retry_reclaim:
> > > + if (task_is_dying())
> > > + return;
> > > +
> > > /*
> > > * The allocating task should reclaim at least the batch size, but for
> > > * subsequent retries we only want to do what's necessary to prevent oom
> >
> > Yeah this is the better place for this check.
> >
> > How about this?
>
> Looks really good to me!
>
> I actually thought about moving the check into mem_cgroup_handle_over_high(),
> and you already did it in this version.
Excellent, thanks for your input.
On Thu 11-01-24 14:28:07, Johannes Weiner wrote:
[...]
> @@ -2867,11 +2882,17 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> }
> } while ((memcg = parent_mem_cgroup(memcg)));
>
> + /*
> + * Reclaim is set up above to be called from the userland
> + * return path. But also attempt synchronous reclaim to avoid
> + * excessive overrun while the task is still inside the
> + * kernel. If this is successful, the return path will see it
> + * when it rechecks the overage and simply bail out.
> + */
> if (current->memcg_nr_pages_over_high > MEMCG_CHARGE_BATCH &&
> !(current->flags & PF_MEMALLOC) &&
> - gfpflags_allow_blocking(gfp_mask)) {
> + gfpflags_allow_blocking(gfp_mask))
> mem_cgroup_handle_over_high(gfp_mask);
Have you lost the check for the dying task here?
Other than that looks good to me.
--
Michal Hocko
SUSE Labs
On Fri, Jan 12, 2024 at 06:06:39PM +0100, Michal Hocko wrote:
> On Thu 11-01-24 14:28:07, Johannes Weiner wrote:
> [...]
> > @@ -2867,11 +2882,17 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > }
> > } while ((memcg = parent_mem_cgroup(memcg)));
> >
> > + /*
> > + * Reclaim is set up above to be called from the userland
> > + * return path. But also attempt synchronous reclaim to avoid
> > + * excessive overrun while the task is still inside the
> > + * kernel. If this is successful, the return path will see it
> > + * when it rechecks the overage and simply bail out.
> > + */
> > if (current->memcg_nr_pages_over_high > MEMCG_CHARGE_BATCH &&
> > !(current->flags & PF_MEMALLOC) &&
> > - gfpflags_allow_blocking(gfp_mask)) {
> > + gfpflags_allow_blocking(gfp_mask))
> > mem_cgroup_handle_over_high(gfp_mask);
>
> Have you lost the check for the dying task here?
It was moved into mem_cgroup_handle_over_high()'s body.
On Fri 12-01-24 09:10:33, Roman Gushchin wrote:
> On Fri, Jan 12, 2024 at 06:06:39PM +0100, Michal Hocko wrote:
> > On Thu 11-01-24 14:28:07, Johannes Weiner wrote:
> > [...]
> > > @@ -2867,11 +2882,17 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > > }
> > > } while ((memcg = parent_mem_cgroup(memcg)));
> > >
> > > + /*
> > > + * Reclaim is set up above to be called from the userland
> > > + * return path. But also attempt synchronous reclaim to avoid
> > > + * excessive overrun while the task is still inside the
> > > + * kernel. If this is successful, the return path will see it
> > > + * when it rechecks the overage and simply bail out.
> > > + */
> > > if (current->memcg_nr_pages_over_high > MEMCG_CHARGE_BATCH &&
> > > !(current->flags & PF_MEMALLOC) &&
> > > - gfpflags_allow_blocking(gfp_mask)) {
> > > + gfpflags_allow_blocking(gfp_mask))
> > > mem_cgroup_handle_over_high(gfp_mask);
> >
> > Have you lost the check for the dying task here?
>
> It was moved into mem_cgroup_handle_over_high()'s body.
Ohh, right. Somehow overlooked that even when I was staring at that
path.
Acked-by: Michal Hocko <[email protected]>
Thanks!
--
Michal Hocko
SUSE Labs
On Thu, Jan 11, 2024 at 11:28 AM Johannes Weiner <[email protected]> wrote:
>
[...]
>
> From 6124a13cb073f5ff06b9c1309505bc937d65d6e5 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <[email protected]>
> Date: Thu, 11 Jan 2024 07:18:47 -0500
> Subject: [PATCH] mm: memcontrol: don't throttle dying tasks on memory.high
>
> While investigating hosts with high cgroup memory pressures, Tejun
> found culprit zombie tasks that had were holding on to a lot of
> memory, had SIGKILL pending, but were stuck in memory.high reclaim.
>
> In the past, we used to always force-charge allocations from tasks
> that were exiting in order to accelerate them dying and freeing up
> their rss. This changed for memory.max in a4ebf1b6ca1e ("memcg:
> prohibit unconditional exceeding the limit of dying tasks"); it noted
> that this can cause (userspace inducable) containment failures, so it
> added a mandatory reclaim and OOM kill cycle before forcing charges.
> At the time, memory.high enforcement was handled in the userspace
> return path, which isn't reached by dying tasks, and so memory.high
> was still never enforced by dying tasks.
>
> When c9afe31ec443 ("memcg: synchronously enforce memory.high for large
> overcharges") added synchronous reclaim for memory.high, it added
> unconditional memory.high enforcement for dying tasks as well. The
> callstack shows that this path is where the zombie is stuck in.
>
> We need to accelerate dying tasks getting past memory.high, but we
> cannot do it quite the same way as we do for memory.max: memory.max is
> enforced strictly, and tasks aren't allowed to move past it without
> FIRST reclaiming and OOM killing if necessary. This ensures very small
> levels of excess. With memory.high, though, enforcement happens lazily
> after the charge, and OOM killing is never triggered. A lot of
> concurrent threads could have pushed, or could actively be pushing,
> the cgroup into excess. The dying task will enter reclaim on every
> allocation attempt, with little hope of restoring balance.
>
> To fix this, skip synchronous memory.high enforcement on dying tasks
> altogether again. Update memory.high path documentation while at it.
>
> Fixes: c9afe31ec443 ("memcg: synchronously enforce memory.high for large overcharges")
> Reported-by: Tejun Heo <[email protected]>
> Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
I am wondering if you have seen or suspected a similar issue but for
remote memcg charging. For example pageout on a global reclaim which
has to allocate buffers for some other memcg.
On Fri, Jan 12, 2024 at 11:04:06AM -0800, Shakeel Butt wrote:
> On Thu, Jan 11, 2024 at 11:28 AM Johannes Weiner <[email protected]> wrote:
> >
> [...]
> >
> > From 6124a13cb073f5ff06b9c1309505bc937d65d6e5 Mon Sep 17 00:00:00 2001
> > From: Johannes Weiner <[email protected]>
> > Date: Thu, 11 Jan 2024 07:18:47 -0500
> > Subject: [PATCH] mm: memcontrol: don't throttle dying tasks on memory.high
> >
> > While investigating hosts with high cgroup memory pressures, Tejun
> > found culprit zombie tasks that had were holding on to a lot of
> > memory, had SIGKILL pending, but were stuck in memory.high reclaim.
> >
> > In the past, we used to always force-charge allocations from tasks
> > that were exiting in order to accelerate them dying and freeing up
> > their rss. This changed for memory.max in a4ebf1b6ca1e ("memcg:
> > prohibit unconditional exceeding the limit of dying tasks"); it noted
> > that this can cause (userspace inducable) containment failures, so it
> > added a mandatory reclaim and OOM kill cycle before forcing charges.
> > At the time, memory.high enforcement was handled in the userspace
> > return path, which isn't reached by dying tasks, and so memory.high
> > was still never enforced by dying tasks.
> >
> > When c9afe31ec443 ("memcg: synchronously enforce memory.high for large
> > overcharges") added synchronous reclaim for memory.high, it added
> > unconditional memory.high enforcement for dying tasks as well. The
> > callstack shows that this path is where the zombie is stuck in.
> >
> > We need to accelerate dying tasks getting past memory.high, but we
> > cannot do it quite the same way as we do for memory.max: memory.max is
> > enforced strictly, and tasks aren't allowed to move past it without
> > FIRST reclaiming and OOM killing if necessary. This ensures very small
> > levels of excess. With memory.high, though, enforcement happens lazily
> > after the charge, and OOM killing is never triggered. A lot of
> > concurrent threads could have pushed, or could actively be pushing,
> > the cgroup into excess. The dying task will enter reclaim on every
> > allocation attempt, with little hope of restoring balance.
> >
> > To fix this, skip synchronous memory.high enforcement on dying tasks
> > altogether again. Update memory.high path documentation while at it.
> >
> > Fixes: c9afe31ec443 ("memcg: synchronously enforce memory.high for large overcharges")
> > Reported-by: Tejun Heo <[email protected]>
> > Signed-off-by: Johannes Weiner <[email protected]>
>
> Acked-by: Shakeel Butt <[email protected]>
>
> I am wondering if you have seen or suspected a similar issue but for
> remote memcg charging. For example pageout on a global reclaim which
> has to allocate buffers for some other memcg.
You mean dying tasks entering a direct reclaim mode?
Or kswapd being stuck in the reclaim path?