2020-07-09 06:26:47

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 1/2] doc, mm: sync up oom_score_adj documentation

From: Michal Hocko <[email protected]>

There are at least two notes in the oom section. The 3% discount for
root processes is gone since d46078b28889 ("mm, oom: remove 3% bonus for
CAP_SYS_ADMIN processes").

Likewise children of the selected oom victim are not sacrificed since
bbbe48029720 ("mm, oom: remove 'prefer children over parent' heuristic")

Drop both of them.

Signed-off-by: Michal Hocko <[email protected]>
---
Documentation/filesystems/proc.rst | 8 --------
1 file changed, 8 deletions(-)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 996f3cfe7030..8e3b5dffcfa8 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -1634,9 +1634,6 @@ may allocate from based on an estimation of its current memory and swap use.
For example, if a task is using all allowed memory, its badness score will be
1000. If it is using half of its allowed memory, its score will be 500.

-There is an additional factor included in the badness score: the current memory
-and swap usage is discounted by 3% for root processes.
-
The amount of "allowed" memory depends on the context in which the oom killer
was called. If it is due to the memory assigned to the allocating task's cpuset
being exhausted, the allowed memory represents the set of mems assigned to that
@@ -1672,11 +1669,6 @@ The value of /proc/<pid>/oom_score_adj may be reduced no lower than the last
value set by a CAP_SYS_RESOURCE process. To reduce the value any lower
requires CAP_SYS_RESOURCE.

-Caveat: when a parent task is selected, the oom killer will sacrifice any first
-generation children with separate address spaces instead, if possible. This
-avoids servers and important system daemons from being killed and loses the
-minimal amount of work.
-

3.2 /proc/<pid>/oom_score - Display current oom-killer score
-------------------------------------------------------------
--
2.27.0


2020-07-09 06:28:55

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 2/2] doc, mm: clarify /proc/<pid>/oom_score value range

From: Michal Hocko <[email protected]>

The exported value includes oom_score_adj so the range is no [0, 1000]
as described in the previous section but rather [0, 2000]. Mention that
fact explicitly.

Signed-off-by: Michal Hocko <[email protected]>
---
Documentation/filesystems/proc.rst | 3 +++
1 file changed, 3 insertions(+)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 8e3b5dffcfa8..78a0dec323a3 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -1673,6 +1673,9 @@ requires CAP_SYS_RESOURCE.
3.2 /proc/<pid>/oom_score - Display current oom-killer score
-------------------------------------------------------------

+Please note that the exported value includes oom_score_adj so it is effectively
+in range [0,2000].
+
This file can be used to check the current score used by the oom-killer is for
any given <pid>. Use it together with /proc/<pid>/oom_score_adj to tune which
process should be killed in an out-of-memory situation.
--
2.27.0

2020-07-09 07:42:37

by Yafang Shao

[permalink] [raw]
Subject: Re: [PATCH 2/2] doc, mm: clarify /proc/<pid>/oom_score value range

On Thu, Jul 9, 2020 at 2:26 PM Michal Hocko <[email protected]> wrote:
>
> From: Michal Hocko <[email protected]>
>
> The exported value includes oom_score_adj so the range is no [0, 1000]
> as described in the previous section but rather [0, 2000]. Mention that
> fact explicitly.
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> Documentation/filesystems/proc.rst | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index 8e3b5dffcfa8..78a0dec323a3 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -1673,6 +1673,9 @@ requires CAP_SYS_RESOURCE.
> 3.2 /proc/<pid>/oom_score - Display current oom-killer score
> -------------------------------------------------------------
>
> +Please note that the exported value includes oom_score_adj so it is effectively
> +in range [0,2000].
> +

[0, 2000] may be not a proper range, see my reply in another thread.[1]
As this value hasn't been documented before and nobody notices that, I
think there might be no user really care about it before.
So we should discuss the proper range if we really think the user will
care about this value.

[1]. https://lore.kernel.org/linux-mm/CALOAHbAvj-gWZMLef=PuKTfDScwfM8gPPX0evzjoref1bG=zwA@mail.gmail.com/T/#m2332c3e6b7f869383cb74ab3a0f7b6c670b3b23b

> This file can be used to check the current score used by the oom-killer is for
> any given <pid>. Use it together with /proc/<pid>/oom_score_adj to tune which
> process should be killed in an out-of-memory situation.
> --
> 2.27.0
>


--
Thanks
Yafang

2020-07-09 08:21:36

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 2/2] doc, mm: clarify /proc/<pid>/oom_score value range

On Thu 09-07-20 15:41:11, Yafang Shao wrote:
> On Thu, Jul 9, 2020 at 2:26 PM Michal Hocko <[email protected]> wrote:
> >
> > From: Michal Hocko <[email protected]>
> >
> > The exported value includes oom_score_adj so the range is no [0, 1000]
> > as described in the previous section but rather [0, 2000]. Mention that
> > fact explicitly.
> >
> > Signed-off-by: Michal Hocko <[email protected]>
> > ---
> > Documentation/filesystems/proc.rst | 3 +++
> > 1 file changed, 3 insertions(+)
> >
> > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> > index 8e3b5dffcfa8..78a0dec323a3 100644
> > --- a/Documentation/filesystems/proc.rst
> > +++ b/Documentation/filesystems/proc.rst
> > @@ -1673,6 +1673,9 @@ requires CAP_SYS_RESOURCE.
> > 3.2 /proc/<pid>/oom_score - Display current oom-killer score
> > -------------------------------------------------------------
> >
> > +Please note that the exported value includes oom_score_adj so it is effectively
> > +in range [0,2000].
> > +
>
> [0, 2000] may be not a proper range, see my reply in another thread.[1]
> As this value hasn't been documented before and nobody notices that, I
> think there might be no user really care about it before.
> So we should discuss the proper range if we really think the user will
> care about this value.

Even if we decide the range should change, I do not really assume this
will happen, it is good to have the existing behavior clarified.

--
Michal Hocko
SUSE Labs

2020-07-09 09:04:28

by Yafang Shao

[permalink] [raw]
Subject: Re: [PATCH 2/2] doc, mm: clarify /proc/<pid>/oom_score value range

On Thu, Jul 9, 2020 at 4:18 PM Michal Hocko <[email protected]> wrote:
>
> On Thu 09-07-20 15:41:11, Yafang Shao wrote:
> > On Thu, Jul 9, 2020 at 2:26 PM Michal Hocko <[email protected]> wrote:
> > >
> > > From: Michal Hocko <[email protected]>
> > >
> > > The exported value includes oom_score_adj so the range is no [0, 1000]
> > > as described in the previous section but rather [0, 2000]. Mention that
> > > fact explicitly.
> > >
> > > Signed-off-by: Michal Hocko <[email protected]>
> > > ---
> > > Documentation/filesystems/proc.rst | 3 +++
> > > 1 file changed, 3 insertions(+)
> > >
> > > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> > > index 8e3b5dffcfa8..78a0dec323a3 100644
> > > --- a/Documentation/filesystems/proc.rst
> > > +++ b/Documentation/filesystems/proc.rst
> > > @@ -1673,6 +1673,9 @@ requires CAP_SYS_RESOURCE.
> > > 3.2 /proc/<pid>/oom_score - Display current oom-killer score
> > > -------------------------------------------------------------
> > >
> > > +Please note that the exported value includes oom_score_adj so it is effectively
> > > +in range [0,2000].
> > > +
> >
> > [0, 2000] may be not a proper range, see my reply in another thread.[1]
> > As this value hasn't been documented before and nobody notices that, I
> > think there might be no user really care about it before.
> > So we should discuss the proper range if we really think the user will
> > care about this value.
>
> Even if we decide the range should change, I do not really assume this
> will happen, it is good to have the existing behavior clarified.
>

But the existing behavior is not defined in the kernel documentation
before, so I don't think that the user has a clear understanding of
the existing behavior.
The way to use the result of proc_oom_score is to compare which
processes will be killed first by the OOM killer, IOW, the user should
always use it to compare different processes. For example,

if proc_oom_score(process_a) > proc_oom_score(process_b)
then
process_a will be killed before process_b
fi

And then the user will "Use it together with
/proc/<pid>/oom_score_adj to tune which
process should be killed in an out-of-memory situation."

That means what the user really cares about is the relative value, and
they will not care about the range or the absolute value.

--
Thanks
Yafang

2020-07-09 10:02:37

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 2/2] doc, mm: clarify /proc/<pid>/oom_score value range

On Thu 09-07-20 17:01:06, Yafang Shao wrote:
> On Thu, Jul 9, 2020 at 4:18 PM Michal Hocko <[email protected]> wrote:
> >
> > On Thu 09-07-20 15:41:11, Yafang Shao wrote:
> > > On Thu, Jul 9, 2020 at 2:26 PM Michal Hocko <[email protected]> wrote:
> > > >
> > > > From: Michal Hocko <[email protected]>
> > > >
> > > > The exported value includes oom_score_adj so the range is no [0, 1000]
> > > > as described in the previous section but rather [0, 2000]. Mention that
> > > > fact explicitly.
> > > >
> > > > Signed-off-by: Michal Hocko <[email protected]>
> > > > ---
> > > > Documentation/filesystems/proc.rst | 3 +++
> > > > 1 file changed, 3 insertions(+)
> > > >
> > > > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> > > > index 8e3b5dffcfa8..78a0dec323a3 100644
> > > > --- a/Documentation/filesystems/proc.rst
> > > > +++ b/Documentation/filesystems/proc.rst
> > > > @@ -1673,6 +1673,9 @@ requires CAP_SYS_RESOURCE.
> > > > 3.2 /proc/<pid>/oom_score - Display current oom-killer score
> > > > -------------------------------------------------------------
> > > >
> > > > +Please note that the exported value includes oom_score_adj so it is effectively
> > > > +in range [0,2000].
> > > > +
> > >
> > > [0, 2000] may be not a proper range, see my reply in another thread.[1]
> > > As this value hasn't been documented before and nobody notices that, I
> > > think there might be no user really care about it before.
> > > So we should discuss the proper range if we really think the user will
> > > care about this value.
> >
> > Even if we decide the range should change, I do not really assume this
> > will happen, it is good to have the existing behavior clarified.
> >
>
> But the existing behavior is not defined in the kernel documentation
> before, so I don't think that the user has a clear understanding of
> the existing behavior.

Well, documentation is by no means authoritative, especially when it is
outdated or incomplete. What really matters is the observed behavior and
a lot of userspace depends on that or based on the specific
implementation.

> The way to use the result of proc_oom_score is to compare which
> processes will be killed first by the OOM killer, IOW, the user should
> always use it to compare different processes. For example,
>
> if proc_oom_score(process_a) > proc_oom_score(process_b)
> then
> process_a will be killed before process_b
> fi
>
> And then the user will "Use it together with
> /proc/<pid>/oom_score_adj to tune which
> process should be killed in an out-of-memory situation."
>
> That means what the user really cares about is the relative value, and
> they will not care about the range or the absolute value.

In an ideal world yes. But the real life tells a different story. Many
times userspace (ab)uses certain undocumented/unintended (mis)features
and the hard rule is that we never break userspace. We've learned that
through many painful historical experiences. Especially vaguely defined
functionality suffers from the problem.
--
Michal Hocko
SUSE Labs

2020-07-09 11:21:32

by Yafang Shao

[permalink] [raw]
Subject: Re: [PATCH 2/2] doc, mm: clarify /proc/<pid>/oom_score value range

On Thu, Jul 9, 2020 at 5:58 PM Michal Hocko <[email protected]> wrote:
>
> On Thu 09-07-20 17:01:06, Yafang Shao wrote:
> > On Thu, Jul 9, 2020 at 4:18 PM Michal Hocko <[email protected]> wrote:
> > >
> > > On Thu 09-07-20 15:41:11, Yafang Shao wrote:
> > > > On Thu, Jul 9, 2020 at 2:26 PM Michal Hocko <[email protected]> wrote:
> > > > >
> > > > > From: Michal Hocko <[email protected]>
> > > > >
> > > > > The exported value includes oom_score_adj so the range is no [0, 1000]
> > > > > as described in the previous section but rather [0, 2000]. Mention that
> > > > > fact explicitly.
> > > > >
> > > > > Signed-off-by: Michal Hocko <[email protected]>
> > > > > ---
> > > > > Documentation/filesystems/proc.rst | 3 +++
> > > > > 1 file changed, 3 insertions(+)
> > > > >
> > > > > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> > > > > index 8e3b5dffcfa8..78a0dec323a3 100644
> > > > > --- a/Documentation/filesystems/proc.rst
> > > > > +++ b/Documentation/filesystems/proc.rst
> > > > > @@ -1673,6 +1673,9 @@ requires CAP_SYS_RESOURCE.
> > > > > 3.2 /proc/<pid>/oom_score - Display current oom-killer score
> > > > > -------------------------------------------------------------
> > > > >
> > > > > +Please note that the exported value includes oom_score_adj so it is effectively
> > > > > +in range [0,2000].
> > > > > +
> > > >
> > > > [0, 2000] may be not a proper range, see my reply in another thread.[1]
> > > > As this value hasn't been documented before and nobody notices that, I
> > > > think there might be no user really care about it before.
> > > > So we should discuss the proper range if we really think the user will
> > > > care about this value.
> > >
> > > Even if we decide the range should change, I do not really assume this
> > > will happen, it is good to have the existing behavior clarified.
> > >
> >
> > But the existing behavior is not defined in the kernel documentation
> > before, so I don't think that the user has a clear understanding of
> > the existing behavior.
>
> Well, documentation is by no means authoritative, especially when it is
> outdated or incomplete. What really matters is the observed behavior and
> a lot of userspace depends on that or based on the specific
> implementation.
>
> > The way to use the result of proc_oom_score is to compare which
> > processes will be killed first by the OOM killer, IOW, the user should
> > always use it to compare different processes. For example,
> >
> > if proc_oom_score(process_a) > proc_oom_score(process_b)
> > then
> > process_a will be killed before process_b
> > fi
> >
> > And then the user will "Use it together with
> > /proc/<pid>/oom_score_adj to tune which
> > process should be killed in an out-of-memory situation."
> >
> > That means what the user really cares about is the relative value, and
> > they will not care about the range or the absolute value.
>
> In an ideal world yes. But the real life tells a different story. Many
> times userspace (ab)uses certain undocumented/unintended (mis)features
> and the hard rule is that we never break userspace. We've learned that
> through many painful historical experiences. Especially vaguely defined
> functionality suffers from the problem.
> --

All right. I don't insist if we think the change in range may break
the userspace.

--
Thanks
Yafang