Changes since v2:
- Use single character (e.g. 'R' for MMF_OOM_SKIP) as suggested
by Tetsuo Handa <[email protected]>
- Add new header to oom_dump_tasks documentation
- Provide further justification
The output generated by dump_tasks() can be helpful to determine why
there was an OOM condition and which rogue task potentially caused it.
Please note that this is only provided when sysctl oom_dump_tasks is
enabled.
At the present time, when showing potential OOM victims, we do not
exclude any task that are not OOM eligible e.g. those that have
MMF_OOM_SKIP set; it is possible that the last OOM killable victim was
already OOM killed, yet the OOM reaper failed to reclaim memory and set
MMF_OOM_SKIP. This can be confusing (or perhaps even be misleading) to the
viewer. Now, we already unconditionally display a task's oom_score_adj_min
value that can be set to OOM_SCORE_ADJ_MIN which is indicative of an
"unkillable" task.
This patch provides a clear indication with regard to the OOM ineligibility
(and why) of each displayed task with the addition of a new column namely
"oom_skipped". An example is provided below:
[ 5084.524970] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj oom_skipped name
[ 5084.526397] [660417] 0 660417 35869 683 167936 0 -1000 M conmon
[ 5084.526400] [660452] 0 660452 175834 472 86016 0 -998 pod
[ 5084.527460] [752415] 0 752415 35869 650 172032 0 -1000 M conmon
[ 5084.527462] [752575] 1001050000 752575 184205 11158 700416 0 999 npm
[ 5084.527467] [753606] 1001050000 753606 183380 46843 2134016 0 999 node
[ 5084.527581] Memory cgroup out of memory: Killed process 753606 (node) total-vm:733520kB, anon-rss:161228kB, file-rss:26144kB, shmem-rss:0kB, UID:1001050000
So, a single character 'M' is for OOM_SCORE_ADJ_MIN, 'R' MMF_OOM_SKIP and
'V' for in_vfork().
Signed-off-by: Aaron Tomlin <[email protected]>
---
Documentation/admin-guide/sysctl/vm.rst | 5 ++--
mm/oom_kill.c | 31 +++++++++++++++++++++----
2 files changed, 30 insertions(+), 6 deletions(-)
diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 003d5cc3751b..4c79fa00ddb3 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -650,8 +650,9 @@ oom_dump_tasks
Enables a system-wide task dump (excluding kernel threads) to be produced
when the kernel performs an OOM-killing and includes such information as
pid, uid, tgid, vm size, rss, pgtables_bytes, swapents, oom_score_adj
-score, and name. This is helpful to determine why the OOM killer was
-invoked, to identify the rogue task that caused it, and to determine why
+score, oom eligibility status and name. This is helpful to determine why
+the OOM killer was invoked, to identify the rogue task that caused it, and
+to determine why
the OOM killer chose the task it did to kill.
If this is set to zero, this information is suppressed. On very
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index c729a4c4a1ac..36daa6917b62 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -160,6 +160,27 @@ static inline bool is_sysrq_oom(struct oom_control *oc)
return oc->order == -1;
}
+/**
+ * is_task_eligible_oom - determine if and why a task cannot be OOM killed
+ * @tsk: task to check
+ *
+ * Needs to be called with task_lock().
+ */
+static const char * const is_task_oom_eligible(struct task_struct *p)
+{
+ long adj;
+
+ adj = (long)p->signal->oom_score_adj;
+ if (adj == OOM_SCORE_ADJ_MIN)
+ return "M";
+ else if (test_bit(MMF_OOM_SKIP, &p->mm->flags)
+ return "R";
+ else if (in_vfork(p))
+ return "V";
+ else
+ return "";
+}
+
/* return true if the task is not adequate as candidate victim task. */
static bool oom_unkillable_task(struct task_struct *p)
{
@@ -401,12 +422,13 @@ static int dump_task(struct task_struct *p, void *arg)
return 0;
}
- pr_info("[%7d] %5d %5d %8lu %8lu %8ld %8lu %5hd %s\n",
+ pr_info("[%7d] %5d %5d %8lu %8lu %8ld %8lu %5hd %1s %s\n",
task->pid, from_kuid(&init_user_ns, task_uid(task)),
task->tgid, task->mm->total_vm, get_mm_rss(task->mm),
mm_pgtables_bytes(task->mm),
get_mm_counter(task->mm, MM_SWAPENTS),
- task->signal->oom_score_adj, task->comm);
+ task->signal->oom_score_adj, is_task_oom_eligible(task),
+ task->comm);
task_unlock(task);
return 0;
@@ -420,12 +442,13 @@ static int dump_task(struct task_struct *p, void *arg)
* memcg, not in the same cpuset, or bound to a disjoint set of mempolicy nodes
* are not shown.
* State information includes task's pid, uid, tgid, vm size, rss,
- * pgtables_bytes, swapents, oom_score_adj value, and name.
+ * pgtables_bytes, swapents, oom_score_adj value, oom eligibility status
+ * and name.
*/
static void dump_tasks(struct oom_control *oc)
{
pr_info("Tasks state (memory values in pages):\n");
- pr_info("[ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name\n");
+ pr_info("[ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj oom_skipped name\n");
if (is_memcg_oom(oc))
mem_cgroup_scan_tasks(oc->memcg, dump_task, oc);
--
2.31.1
On Fri, 30 Jul 2021 17:20:02 +0100 Aaron Tomlin <[email protected]> wrote:
> Changes since v2:
> - Use single character (e.g. 'R' for MMF_OOM_SKIP) as suggested
> by Tetsuo Handa <[email protected]>
> - Add new header to oom_dump_tasks documentation
> - Provide further justification
>
>
> The output generated by dump_tasks() can be helpful to determine why
> there was an OOM condition and which rogue task potentially caused it.
> Please note that this is only provided when sysctl oom_dump_tasks is
> enabled.
>
> At the present time, when showing potential OOM victims, we do not
> exclude any task that are not OOM eligible e.g. those that have
> MMF_OOM_SKIP set; it is possible that the last OOM killable victim was
> already OOM killed, yet the OOM reaper failed to reclaim memory and set
> MMF_OOM_SKIP. This can be confusing (or perhaps even be misleading) to the
> viewer. Now, we already unconditionally display a task's oom_score_adj_min
> value that can be set to OOM_SCORE_ADJ_MIN which is indicative of an
> "unkillable" task.
>
> This patch provides a clear indication with regard to the OOM ineligibility
> (and why) of each displayed task with the addition of a new column namely
> "oom_skipped". An example is provided below:
>
> [ 5084.524970] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj oom_skipped name
> [ 5084.526397] [660417] 0 660417 35869 683 167936 0 -1000 M conmon
> [ 5084.526400] [660452] 0 660452 175834 472 86016 0 -998 pod
> [ 5084.527460] [752415] 0 752415 35869 650 172032 0 -1000 M conmon
> [ 5084.527462] [752575] 1001050000 752575 184205 11158 700416 0 999 npm
> [ 5084.527467] [753606] 1001050000 753606 183380 46843 2134016 0 999 node
> [ 5084.527581] Memory cgroup out of memory: Killed process 753606 (node) total-vm:733520kB, anon-rss:161228kB, file-rss:26144kB, shmem-rss:0kB, UID:1001050000
>
> So, a single character 'M' is for OOM_SCORE_ADJ_MIN, 'R' MMF_OOM_SKIP and
> 'V' for in_vfork().
>
> index 003d5cc3751b..4c79fa00ddb3 100644
> --- a/Documentation/admin-guide/sysctl/vm.rst
> +++ b/Documentation/admin-guide/sysctl/vm.rst
> @@ -650,8 +650,9 @@ oom_dump_tasks
> Enables a system-wide task dump (excluding kernel threads) to be produced
> when the kernel performs an OOM-killing and includes such information as
> pid, uid, tgid, vm size, rss, pgtables_bytes, swapents, oom_score_adj
> -score, and name. This is helpful to determine why the OOM killer was
> -invoked, to identify the rogue task that caused it, and to determine why
> +score, oom eligibility status and name. This is helpful to determine why
> +the OOM killer was invoked, to identify the rogue task that caused it, and
> +to determine why
It would be better if the meaning of 'M', 'R' and 'V' were described here.
> the OOM killer chose the task it did to kill.
>
> +/**
> + * is_task_eligible_oom - determine if and why a task cannot be OOM killed
> + * @tsk: task to check
> + *
> + * Needs to be called with task_lock().
> + */
> +static const char * const is_task_oom_eligible(struct task_struct *p)
Name seems inappropriate. task_oom_eligibility()?
> +{
> + long adj;
> +
> + adj = (long)p->signal->oom_score_adj;
> + if (adj == OOM_SCORE_ADJ_MIN)
> + return "M";
> + else if (test_bit(MMF_OOM_SKIP, &p->mm->flags)
> + return "R";
> + else if (in_vfork(p))
> + return "V";
> + else
> + return "";
> +}
On Fri, 30 Jul 2021, Aaron Tomlin wrote:
> Documentation/admin-guide/sysctl/vm.rst | 5 ++--
> mm/oom_kill.c | 31 +++++++++++++++++++++----
> 2 files changed, 30 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
> index 003d5cc3751b..4c79fa00ddb3 100644
> --- a/Documentation/admin-guide/sysctl/vm.rst
> +++ b/Documentation/admin-guide/sysctl/vm.rst
> @@ -650,8 +650,9 @@ oom_dump_tasks
> Enables a system-wide task dump (excluding kernel threads) to be produced
> when the kernel performs an OOM-killing and includes such information as
> pid, uid, tgid, vm size, rss, pgtables_bytes, swapents, oom_score_adj
> -score, and name. This is helpful to determine why the OOM killer was
> -invoked, to identify the rogue task that caused it, and to determine why
> +score, oom eligibility status and name. This is helpful to determine why
> +the OOM killer was invoked, to identify the rogue task that caused it, and
> +to determine why
> the OOM killer chose the task it did to kill.
>
> If this is set to zero, this information is suppressed. On very
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index c729a4c4a1ac..36daa6917b62 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -160,6 +160,27 @@ static inline bool is_sysrq_oom(struct oom_control *oc)
> return oc->order == -1;
> }
>
> +/**
> + * is_task_eligible_oom - determine if and why a task cannot be OOM killed
> + * @tsk: task to check
> + *
> + * Needs to be called with task_lock().
> + */
> +static const char * const is_task_oom_eligible(struct task_struct *p)
> +{
> + long adj;
> +
> + adj = (long)p->signal->oom_score_adj;
> + if (adj == OOM_SCORE_ADJ_MIN)
> + return "M";
oom_score_adj is shown already in the tasklist dump, I'm not sure what
value this adds.
> + else if (test_bit(MMF_OOM_SKIP, &p->mm->flags)
> + return "R";
> + else if (in_vfork(p))
> + return "V";
This is going to be racy, we can't show that a task that is emitted as
part of the tasklist dump was did not have in_vfork() == true at the time
oom_badness() was called.
Wouldn't it be better to simply print the output of oom_badness() to the
tasklist dump instead so we get complete information?
We could simply special case a LONG_MIN return value as -1000 or "min".
> + else
> + return "";
> +}
> +
> /* return true if the task is not adequate as candidate victim task. */
> static bool oom_unkillable_task(struct task_struct *p)
> {
> @@ -401,12 +422,13 @@ static int dump_task(struct task_struct *p, void *arg)
> return 0;
> }
>
> - pr_info("[%7d] %5d %5d %8lu %8lu %8ld %8lu %5hd %s\n",
> + pr_info("[%7d] %5d %5d %8lu %8lu %8ld %8lu %5hd %1s %s\n",
> task->pid, from_kuid(&init_user_ns, task_uid(task)),
> task->tgid, task->mm->total_vm, get_mm_rss(task->mm),
> mm_pgtables_bytes(task->mm),
> get_mm_counter(task->mm, MM_SWAPENTS),
> - task->signal->oom_score_adj, task->comm);
> + task->signal->oom_score_adj, is_task_oom_eligible(task),
> + task->comm);
> task_unlock(task);
>
> return 0;
> @@ -420,12 +442,13 @@ static int dump_task(struct task_struct *p, void *arg)
> * memcg, not in the same cpuset, or bound to a disjoint set of mempolicy nodes
> * are not shown.
> * State information includes task's pid, uid, tgid, vm size, rss,
> - * pgtables_bytes, swapents, oom_score_adj value, and name.
> + * pgtables_bytes, swapents, oom_score_adj value, oom eligibility status
> + * and name.
> */
> static void dump_tasks(struct oom_control *oc)
> {
> pr_info("Tasks state (memory values in pages):\n");
> - pr_info("[ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name\n");
> + pr_info("[ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj oom_skipped name\n");
>
> if (is_memcg_oom(oc))
> mem_cgroup_scan_tasks(oc->memcg, dump_task, oc);
> --
> 2.31.1
>
>
On Fri 30-07-21 17:20:02, Aaron Tomlin wrote:
> Changes since v2:
> - Use single character (e.g. 'R' for MMF_OOM_SKIP) as suggested
> by Tetsuo Handa <[email protected]>
> - Add new header to oom_dump_tasks documentation
> - Provide further justification
>
>
> The output generated by dump_tasks() can be helpful to determine why
> there was an OOM condition and which rogue task potentially caused it.
> Please note that this is only provided when sysctl oom_dump_tasks is
> enabled.
>
> At the present time, when showing potential OOM victims, we do not
> exclude any task that are not OOM eligible e.g.
Well, this is not precise. We do exclude ineligible. Consider tasks that
are outside of the OOM domain for example. You are right that we are not
excluding all of them though.
> those that have
> MMF_OOM_SKIP set; it is possible that the last OOM killable victim was
> already OOM killed, yet the OOM reaper failed to reclaim memory and set
> MMF_OOM_SKIP. This can be confusing (or perhaps even be misleading) to the
> viewer. Now, we already unconditionally display a task's oom_score_adj_min
> value that can be set to OOM_SCORE_ADJ_MIN which is indicative of an
> "unkillable" task.
>
> This patch provides a clear indication with regard to the OOM ineligibility
> (and why) of each displayed task with the addition of a new column namely
> "oom_skipped". An example is provided below:
>
> [ 5084.524970] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj oom_skipped name
> [ 5084.526397] [660417] 0 660417 35869 683 167936 0 -1000 M conmon
> [ 5084.526400] [660452] 0 660452 175834 472 86016 0 -998 pod
> [ 5084.527460] [752415] 0 752415 35869 650 172032 0 -1000 M conmon
> [ 5084.527462] [752575] 1001050000 752575 184205 11158 700416 0 999 npm
> [ 5084.527467] [753606] 1001050000 753606 183380 46843 2134016 0 999 node
> [ 5084.527581] Memory cgroup out of memory: Killed process 753606 (node) total-vm:733520kB, anon-rss:161228kB, file-rss:26144kB, shmem-rss:0kB, UID:1001050000
>
> So, a single character 'M' is for OOM_SCORE_ADJ_MIN, 'R' MMF_OOM_SKIP and
> 'V' for in_vfork().
I have to say I dislike this for two reasons. First and foremost it
makes parsing unnecessarily more complex. Now you have a potentially
an empty column to special case. Secondly MMF_OOM_SKIP is an internal
state that shouldn't really leak to userspace IMO. in_vfork is a racy
check and M is already expressed via oom_score_adj so it duplicates an
existing information.
If you really want/need to make any change here then I would propose to
either add E(eligible)/I(ligible) column without any specifics or
consistently skip over all tasks which are not eligible.
> 2.31.1
--
Michal Hocko
SUSE Labs
On Sun 2021-08-01 20:49 -0700, David Rientjes wrote:
> oom_score_adj is shown already in the tasklist dump, I'm not sure what
> value this adds.
Fair enough.
>
> > + else if (test_bit(MMF_OOM_SKIP, &p->mm->flags)
> > + return "R";
> > + else if (in_vfork(p))
> > + return "V";
>
> This is going to be racy, we can't show that a task that is emitted as
> part of the tasklist dump was did not have in_vfork() == true at the time
> oom_badness() was called.
Yes, this is true.
> Wouldn't it be better to simply print the output of oom_badness() to the
> tasklist dump instead so we get complete information?
I think this would be acceptable.
> We could simply special case a LONG_MIN return value as -1000 or "min".
Agreed.
Kind regards,
--
Aaron Tomlin
On Mon 2021-08-02 08:34 +0200, Michal Hocko wrote:
> If you really want/need to make any change here then I would propose to
> either add E(eligible)/I(ligible) column without any specifics or
> consistently skip over all tasks which are not eligible.
How about the suggestion made by David i.e. exposing the value returned by
oom_badness(), as if I understand correctly, this would provide a more
complete picture with regard to the rationale used?
Kind regards,
--
Aaron Tomlin
On Mon 02-08-21 16:12:50, Aaron Tomlin wrote:
> On Mon 2021-08-02 08:34 +0200, Michal Hocko wrote:
> > If you really want/need to make any change here then I would propose to
> > either add E(eligible)/I(ligible) column without any specifics or
> > consistently skip over all tasks which are not eligible.
>
> How about the suggestion made by David i.e. exposing the value returned by
> oom_badness(), as if I understand correctly, this would provide a more
> complete picture with regard to the rationale used?
There were some attempts to print oom_score during OOM. E.g.
http://lkml.kernel.org/r/[email protected]. That
one was rejected on the grounds that the number on its own doesn't
really present any real value. It is really only valuable when comparing
to other potential oom victims. I have to say I am still worried about
printing this internal scoring as it should have really been an
implementation detail but with /proc/<pid>/oom_score this is likely a
lost battle and I am willing to give up on that front.
I am still not entirely convinced this is worth doing though.
oom_badness is not a cheap operation. task_lock has to be taken again
during dump_tasks for each task so the already quite expensive operation
will be more so. Is this really something we cannot live without?
--
Michal Hocko
SUSE Labs
On Tue 2021-08-03 09:05 +0200, Michal Hocko wrote:
> There were some attempts to print oom_score during OOM. E.g.
> http://lkml.kernel.org/r/[email protected]. That
> one was rejected on the grounds that the number on its own doesn't
> really present any real value. It is really only valuable when comparing
> to other potential oom victims. I have to say I am still worried about
> printing this internal scoring as it should have really been an
> implementation detail but with /proc/<pid>/oom_score this is likely a
> lost battle and I am willing to give up on that front.
Understood.
> I am still not entirely convinced this is worth doing though.
> oom_badness is not a cheap operation. task_lock has to be taken again
> during dump_tasks for each task so the already quite expensive operation
> will be more so. Is this really something we cannot live without?
Fair enough and I now agree, it is unquestionably not worth it.
Kind regards,
--
Aaron Tomlin