On Thu, 30 Nov 2017, Andrew Morton wrote:
> > This patchset makes the OOM killer cgroup-aware.
>
> Thanks, I'll grab these.
>
> There has been controversy over this patchset, to say the least. I
> can't say that I followed it closely! Could those who still have
> reservations please summarise their concerns and hopefully suggest a
> way forward?
>
Yes, I'll summarize what my concerns have been in the past and what they
are wrt the patchset as it stands in -mm. None of them originate from my
current usecase or anticipated future usecase of the oom killer for
system-wide or memcg-constrained oom conditions. They are based purely on
the patchset's use of an incomplete and unfair heuristic for deciding
which cgroup to target.
I'll also suggest simple changes to the patchset, which I have in the
past, that can be made to address all of these concerns.
1. The unfair comparison of the root mem cgroup vs leaf mem cgroups
The patchset uses two different heuristics to compare root and leaf mem
cgroups and scores them based on number of pages. For the root mem
cgroup, it totals the /proc/pid/oom_score of all processes attached:
that's based on rss, swap, pgtables, and, most importantly, oom_score_adj.
For leaf mem cgroups, it's based on that memcg's anonymous, unevictable,
unreclaimable slab, kernel stack, and swap counters. These can be wildly
different independent of /proc/pid/oom_score_adj, but the most obvious
unfairness comes from users who tune oom_score_adj.
An example: start a process that faults 1GB of anonymous memory and leave
it attached to the root mem cgroup. Start six more processes that each
fault 1GB of anonymous memory and attached them to a leaf mem cgroup. Set
all processes to have /proc/pid/oom_score_adj of 1000. System oom kill
will always kill the 1GB process attached to the root mem cgroup. It's
because oom_badness() relies on /proc/pid/oom_score_adj, which is used to
evaluate the root mem cgroup, and leaf mem cgroups completely disregard
it.
In this example, the leaf mem cgroup's score is 1,573,044, the number of
pages for the 6GB of faulted memory. The root mem cgroup's score is
12,652,907, eight times larger even though its usage is six times smaller.
This is caused by the patchset disregarding oom_score_adj entirely for
leaf mem cgroups and relying on it heavily for the root mem cgroup. It's
the complete opposite result of what the cgroup aware oom killer
advertises.
It also works the other way, if a large memory hog is attached to the root
mem cgroup but has a negative oom_score_adj it is never killed and random
processes are nuked solely because they happened to be attached to a leaf
mem cgroup. This behavior wrt oom_score_adj is completely undocumented,
so I can't presume that it is either known nor tested.
Solution: compare the root mem cgroup and leaf mem cgroups equally with
the same criteria by doing hierarchical accounting of usage and
subtracting from total system usage to find root usage.
2. Evading the oom killer by attaching processes to child cgroups
Any cgroup on the system can attach all their processes to individual
child cgroups. This is functionally the same as doing
for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done
without the no internal process constraint introduced with cgroup v2. All
child cgroups are evaluated based on their own usage: all anon,
unevictable, and unreclaimable slab as described previously. It requires
an individual cgroup to be the single largest consumer to be targeted by
the oom killer.
An example: allow users to manage two different mem cgroup hierarchies
limited to 100GB each. User A uses 10GB of memory and user B uses 90GB of
memory in their respective hierarchies. On a system oom condition, we'd
expect at least one process from user B's hierarchy would always be oom
killed with the cgroup aware oom killer. In fact, the changelog
explicitly states it solves an issue where "1) There is no fairness
between containers. A small container with few large processes will be
chosen over a large one with huge number of small processes."
The opposite becomes true, however, if user B creates child cgroups and
distributes its processes such that each child cgroup's usage never
exceeds 10GB of memory. This can either be done intentionally to
purposefully have a low cgroup memory footprint to evade the oom killer or
unintentionally with cgroup v2 to allow those individual processes to be
constrained by other cgroups in a single hierarchy model. User A, using
10% of his memory limit, is always oom killed instead of user B, using 90%
of his memory limit.
Others have commented its still possible to do this with a per-process
model if users split their processes into many subprocesses with small
memory footprints.
Solution: comparing cgroups must be done hierarchically. Neither user A
nor user B can evade the oom killer because targeting is done based on the
total hierarchical usage rather than individual cgroups in their
hierarchies.
3. Userspace has zero control over oom kill selection in leaf mem cgroups
Unlike using /proc/pid/oom_score_adj to bias or prefer certain processes
from the oom killer, the cgroup aware oom killer does not provide any
solution for the user to protect leaf mem cgroups. This is a result of
leaf mem cgroups being evaluated based on their anon, unevictable, and
unreclaimable slab usage and disregarding any user tunable.
Absent the cgroup aware oom killer, users have the ability to strongly
prefer a process is oom killed (/proc/pid/oom_score_adj = 1000) or
strongly bias against a process (/proc/pid/oom_score_adj = -999).
An example: a process knows its going to use a lot of memory, so it sets
/proc/self/oom_score_adj to 1000. It wants to be killed first to avoid
distrupting any other process. If it's attached to the root mem cgroup,
it will be oom killed. If it's attached to a leaf mem cgroup by an admin
outside its control, it will never be oom killed unless that cgroup's
usage is the largest single cgroup usage on the system. The reverse also
is true for processes that the admin does not want to be oom killed: set
/proc/pid/oom_score_adj to -999, but it will *always* be oom killed if its
cgroup has the highest usage on the system.
The result is that both admins and users have lost all control over which
processes are oom killed. They are left with only one alternative: set
/proc/pid/oom_score_adj to -1000 to completely disable a process from oom
kill. It doesn't address the issue at all for memcg-constrained oom
conditions since no processes are killable anymore, and risks panicking
the system if it is the only process left on the system. A process
preferring that it is first in line for oom kill simply cannot volunteer
anymore.
Solution: allow users and admins to control oom kill selection by
introducing a memory.oom_score_adj to affect the oom score of that mem
cgroup, exactly the same as /proc/pid/oom_score_adj affects the oom score
of a process.
I proposed a solution in
https://marc.info/?l=linux-kernel&m=150956897302725, which was never
responded to, for all of these issues. The idea is to do hierarchical
accounting of mem cgroup hierarchies so that the hierarchy is traversed
comparing total usage at each level to select target cgroups. Admins and
users can use memory.oom_score_adj to influence that decisionmaking at
each level.
This solves #1 because mem cgroups can be compared based on the same
classes of memory and the root mem cgroup's usage can be fairly compared
by subtracting top-level mem cgroup usage from system usage. All of the
criteria used to evaluate a leaf mem cgroup has a reasonable system-wide
counterpart that can be used to do the simple subtraction.
This solves #2 because evaluation is done hierarchically so that
distributing processes over a set of child cgroups either intentionally
or unintentionally no longer evades the oom killer. Total usage is always
accounted to the parent and there is no escaping this criteria for users.
This solves #3 because it allows admins to protect important processes in
cgroups that are supposed to use, for example, 75% of system memory
without it unconditionally being selected for oom kill but still oom kill
if it exceeds a certain threshold. In this sense, the cgroup aware oom
killer, as currently implemented, is selling mem cgroups short by
requiring the user to accept that the important process will be oom killed
iff it uses mem cgroups and isn't attached to root. It also allows users
to actually volunteer to be oom killed first without majority usage.
It has come up time and time again that this support can be introduced on
top of the cgroup oom killer as implemented. It simply cannot. For
admins and users to have control over decisionmaking, it needs a
oom_score_adj type tunable that cannot change semantics from kernel
version to kernel version and without polluting the mem cgroup filesystem.
That, in my suggestion, is an adjustment on the amount of total
hierarchical usage of each mem cgroup at each level of the hierarchy.
That requires that the heuristic uses hierarchical usage rather than
considering each cgroup as independent consumers as it does today. We
need to implement that heuristic and introduce userspace influence over
oom kill selection now rather than later because its implementation
changes how this patchset is implemented.
I can implement these changes, if preferred, on top of the current
patchset, but I do not believe we want inconsistencies between kernel
versions that introduce user visible changes for the sole reason that this
current implementation is incomplete and unfair. We can implement and
introduce it once without behavior changing later because the core
heuristic has necessarily changed.
Hello, David!
On Tue, Jan 09, 2018 at 04:57:53PM -0800, David Rientjes wrote:
> On Thu, 30 Nov 2017, Andrew Morton wrote:
>
> > > This patchset makes the OOM killer cgroup-aware.
> >
> > Thanks, I'll grab these.
> >
> > There has been controversy over this patchset, to say the least. I
> > can't say that I followed it closely! Could those who still have
> > reservations please summarise their concerns and hopefully suggest a
> > way forward?
> >
>
> Yes, I'll summarize what my concerns have been in the past and what they
> are wrt the patchset as it stands in -mm. None of them originate from my
> current usecase or anticipated future usecase of the oom killer for
> system-wide or memcg-constrained oom conditions. They are based purely on
> the patchset's use of an incomplete and unfair heuristic for deciding
> which cgroup to target.
>
> I'll also suggest simple changes to the patchset, which I have in the
> past, that can be made to address all of these concerns.
>
> 1. The unfair comparison of the root mem cgroup vs leaf mem cgroups
>
> The patchset uses two different heuristics to compare root and leaf mem
> cgroups and scores them based on number of pages. For the root mem
> cgroup, it totals the /proc/pid/oom_score of all processes attached:
> that's based on rss, swap, pgtables, and, most importantly, oom_score_adj.
> For leaf mem cgroups, it's based on that memcg's anonymous, unevictable,
> unreclaimable slab, kernel stack, and swap counters. These can be wildly
> different independent of /proc/pid/oom_score_adj, but the most obvious
> unfairness comes from users who tune oom_score_adj.
>
> An example: start a process that faults 1GB of anonymous memory and leave
> it attached to the root mem cgroup. Start six more processes that each
> fault 1GB of anonymous memory and attached them to a leaf mem cgroup. Set
> all processes to have /proc/pid/oom_score_adj of 1000. System oom kill
> will always kill the 1GB process attached to the root mem cgroup. It's
> because oom_badness() relies on /proc/pid/oom_score_adj, which is used to
> evaluate the root mem cgroup, and leaf mem cgroups completely disregard
> it.
>
> In this example, the leaf mem cgroup's score is 1,573,044, the number of
> pages for the 6GB of faulted memory. The root mem cgroup's score is
> 12,652,907, eight times larger even though its usage is six times smaller.
>
> This is caused by the patchset disregarding oom_score_adj entirely for
> leaf mem cgroups and relying on it heavily for the root mem cgroup. It's
> the complete opposite result of what the cgroup aware oom killer
> advertises.
>
> It also works the other way, if a large memory hog is attached to the root
> mem cgroup but has a negative oom_score_adj it is never killed and random
> processes are nuked solely because they happened to be attached to a leaf
> mem cgroup. This behavior wrt oom_score_adj is completely undocumented,
> so I can't presume that it is either known nor tested.
>
> Solution: compare the root mem cgroup and leaf mem cgroups equally with
> the same criteria by doing hierarchical accounting of usage and
> subtracting from total system usage to find root usage.
I find this problem quite minor, because I haven't seen any practical problems
caused by accounting of the root cgroup memory.
If it's a serious problem for you, it can be solved without switching to the
hierarchical accounting: it's possible to sum up all leaf cgroup stats and
substract them from global values. So, it can be a relatively small enhancement
on top of the current mm tree. This has nothing to do with global victim selection
approach.
>
> 2. Evading the oom killer by attaching processes to child cgroups
>
> Any cgroup on the system can attach all their processes to individual
> child cgroups. This is functionally the same as doing
>
> for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done
>
> without the no internal process constraint introduced with cgroup v2. All
> child cgroups are evaluated based on their own usage: all anon,
> unevictable, and unreclaimable slab as described previously. It requires
> an individual cgroup to be the single largest consumer to be targeted by
> the oom killer.
>
> An example: allow users to manage two different mem cgroup hierarchies
> limited to 100GB each. User A uses 10GB of memory and user B uses 90GB of
> memory in their respective hierarchies. On a system oom condition, we'd
> expect at least one process from user B's hierarchy would always be oom
> killed with the cgroup aware oom killer. In fact, the changelog
> explicitly states it solves an issue where "1) There is no fairness
> between containers. A small container with few large processes will be
> chosen over a large one with huge number of small processes."
>
> The opposite becomes true, however, if user B creates child cgroups and
> distributes its processes such that each child cgroup's usage never
> exceeds 10GB of memory. This can either be done intentionally to
> purposefully have a low cgroup memory footprint to evade the oom killer or
> unintentionally with cgroup v2 to allow those individual processes to be
> constrained by other cgroups in a single hierarchy model. User A, using
> 10% of his memory limit, is always oom killed instead of user B, using 90%
> of his memory limit.
>
> Others have commented its still possible to do this with a per-process
> model if users split their processes into many subprocesses with small
> memory footprints.
>
> Solution: comparing cgroups must be done hierarchically. Neither user A
> nor user B can evade the oom killer because targeting is done based on the
> total hierarchical usage rather than individual cgroups in their
> hierarchies.
We've discussed this a lot.
Hierarchical approach has their own issues, which we've discussed during
previous iterations of the patchset. If you know how to address them
(I've no idea), please, go on and suggest your version.
>
> 3. Userspace has zero control over oom kill selection in leaf mem cgroups
>
> Unlike using /proc/pid/oom_score_adj to bias or prefer certain processes
> from the oom killer, the cgroup aware oom killer does not provide any
> solution for the user to protect leaf mem cgroups. This is a result of
> leaf mem cgroups being evaluated based on their anon, unevictable, and
> unreclaimable slab usage and disregarding any user tunable.
>
> Absent the cgroup aware oom killer, users have the ability to strongly
> prefer a process is oom killed (/proc/pid/oom_score_adj = 1000) or
> strongly bias against a process (/proc/pid/oom_score_adj = -999).
>
> An example: a process knows its going to use a lot of memory, so it sets
> /proc/self/oom_score_adj to 1000. It wants to be killed first to avoid
> distrupting any other process. If it's attached to the root mem cgroup,
> it will be oom killed. If it's attached to a leaf mem cgroup by an admin
> outside its control, it will never be oom killed unless that cgroup's
> usage is the largest single cgroup usage on the system. The reverse also
> is true for processes that the admin does not want to be oom killed: set
> /proc/pid/oom_score_adj to -999, but it will *always* be oom killed if its
> cgroup has the highest usage on the system.
>
> The result is that both admins and users have lost all control over which
> processes are oom killed. They are left with only one alternative: set
> /proc/pid/oom_score_adj to -1000 to completely disable a process from oom
> kill. It doesn't address the issue at all for memcg-constrained oom
> conditions since no processes are killable anymore, and risks panicking
> the system if it is the only process left on the system. A process
> preferring that it is first in line for oom kill simply cannot volunteer
> anymore.
>
> Solution: allow users and admins to control oom kill selection by
> introducing a memory.oom_score_adj to affect the oom score of that mem
> cgroup, exactly the same as /proc/pid/oom_score_adj affects the oom score
> of a process.
The per-process oom_score_adj interface is not the nicest one, and I'm not
sure we want to replicate it on cgroup level as is. If you have an idea of how
it should look like, please, propose a patch; otherwise it's hard to discuss
it without the code.
>
>
> I proposed a solution in
> https://marc.info/?l=linux-kernel&m=150956897302725, which was never
> responded to, for all of these issues. The idea is to do hierarchical
> accounting of mem cgroup hierarchies so that the hierarchy is traversed
> comparing total usage at each level to select target cgroups. Admins and
> users can use memory.oom_score_adj to influence that decisionmaking at
> each level.
>
> This solves #1 because mem cgroups can be compared based on the same
> classes of memory and the root mem cgroup's usage can be fairly compared
> by subtracting top-level mem cgroup usage from system usage. All of the
> criteria used to evaluate a leaf mem cgroup has a reasonable system-wide
> counterpart that can be used to do the simple subtraction.
>
> This solves #2 because evaluation is done hierarchically so that
> distributing processes over a set of child cgroups either intentionally
> or unintentionally no longer evades the oom killer. Total usage is always
> accounted to the parent and there is no escaping this criteria for users.
>
> This solves #3 because it allows admins to protect important processes in
> cgroups that are supposed to use, for example, 75% of system memory
> without it unconditionally being selected for oom kill but still oom kill
> if it exceeds a certain threshold. In this sense, the cgroup aware oom
> killer, as currently implemented, is selling mem cgroups short by
> requiring the user to accept that the important process will be oom killed
> iff it uses mem cgroups and isn't attached to root. It also allows users
> to actually volunteer to be oom killed first without majority usage.
>
> It has come up time and time again that this support can be introduced on
> top of the cgroup oom killer as implemented. It simply cannot. For
> admins and users to have control over decisionmaking, it needs a
> oom_score_adj type tunable that cannot change semantics from kernel
> version to kernel version and without polluting the mem cgroup filesystem.
> That, in my suggestion, is an adjustment on the amount of total
> hierarchical usage of each mem cgroup at each level of the hierarchy.
> That requires that the heuristic uses hierarchical usage rather than
> considering each cgroup as independent consumers as it does today. We
> need to implement that heuristic and introduce userspace influence over
> oom kill selection now rather than later because its implementation
> changes how this patchset is implemented.
>
> I can implement these changes, if preferred, on top of the current
> patchset, but I do not believe we want inconsistencies between kernel
> versions that introduce user visible changes for the sole reason that this
> current implementation is incomplete and unfair. We can implement and
> introduce it once without behavior changing later because the core
> heuristic has necessarily changed.
David, I _had_ hierarchical accounting implemented in one of the previous
versions of this patchset. And there were _reasons_, why we went away from it.
You can't just ignore them and say that "there is a simple solution, which
Roman is not responding". If you know how to address these issues and
convince everybody that hierarchical approach is a way to go, please,
go on and send your version of the patchset.
Thanks!
Roman
On Wed, 10 Jan 2018 05:11:44 -0800 Roman Gushchin <[email protected]> wrote:
> Hello, David!
>
> On Tue, Jan 09, 2018 at 04:57:53PM -0800, David Rientjes wrote:
> > On Thu, 30 Nov 2017, Andrew Morton wrote:
> >
> > > > This patchset makes the OOM killer cgroup-aware.
> > >
> > > Thanks, I'll grab these.
> > >
> > > There has been controversy over this patchset, to say the least. I
> > > can't say that I followed it closely! Could those who still have
> > > reservations please summarise their concerns and hopefully suggest a
> > > way forward?
> > >
> >
> > Yes, I'll summarize what my concerns have been in the past and what they
> > are wrt the patchset as it stands in -mm. None of them originate from my
> > current usecase or anticipated future usecase of the oom killer for
> > system-wide or memcg-constrained oom conditions. They are based purely on
> > the patchset's use of an incomplete and unfair heuristic for deciding
> > which cgroup to target.
> >
> > I'll also suggest simple changes to the patchset, which I have in the
> > past, that can be made to address all of these concerns.
> >
> > 1. The unfair comparison of the root mem cgroup vs leaf mem cgroups
> >
> > The patchset uses two different heuristics to compare root and leaf mem
> > cgroups and scores them based on number of pages. For the root mem
> > cgroup, it totals the /proc/pid/oom_score of all processes attached:
> > that's based on rss, swap, pgtables, and, most importantly, oom_score_adj.
> > For leaf mem cgroups, it's based on that memcg's anonymous, unevictable,
> > unreclaimable slab, kernel stack, and swap counters. These can be wildly
> > different independent of /proc/pid/oom_score_adj, but the most obvious
> > unfairness comes from users who tune oom_score_adj.
> >
> > An example: start a process that faults 1GB of anonymous memory and leave
> > it attached to the root mem cgroup. Start six more processes that each
> > fault 1GB of anonymous memory and attached them to a leaf mem cgroup. Set
> > all processes to have /proc/pid/oom_score_adj of 1000. System oom kill
> > will always kill the 1GB process attached to the root mem cgroup. It's
> > because oom_badness() relies on /proc/pid/oom_score_adj, which is used to
> > evaluate the root mem cgroup, and leaf mem cgroups completely disregard
> > it.
> >
> > In this example, the leaf mem cgroup's score is 1,573,044, the number of
> > pages for the 6GB of faulted memory. The root mem cgroup's score is
> > 12,652,907, eight times larger even though its usage is six times smaller.
> >
> > This is caused by the patchset disregarding oom_score_adj entirely for
> > leaf mem cgroups and relying on it heavily for the root mem cgroup. It's
> > the complete opposite result of what the cgroup aware oom killer
> > advertises.
> >
> > It also works the other way, if a large memory hog is attached to the root
> > mem cgroup but has a negative oom_score_adj it is never killed and random
> > processes are nuked solely because they happened to be attached to a leaf
> > mem cgroup. This behavior wrt oom_score_adj is completely undocumented,
> > so I can't presume that it is either known nor tested.
> >
> > Solution: compare the root mem cgroup and leaf mem cgroups equally with
> > the same criteria by doing hierarchical accounting of usage and
> > subtracting from total system usage to find root usage.
>
> I find this problem quite minor, because I haven't seen any practical problems
> caused by accounting of the root cgroup memory.
> If it's a serious problem for you, it can be solved without switching to the
> hierarchical accounting: it's possible to sum up all leaf cgroup stats and
> substract them from global values. So, it can be a relatively small enhancement
> on top of the current mm tree. This has nothing to do with global victim selection
> approach.
It sounds like a significant shortcoming to me - the oom-killing
decisions which David describes are clearly incorrect?
If this can be fixed against the -mm patchset with a "relatively small
enhancement" then please let's get that done so it can be reviewed and
tested.
> >
> > 2. Evading the oom killer by attaching processes to child cgroups
> >
> > Any cgroup on the system can attach all their processes to individual
> > child cgroups. This is functionally the same as doing
> >
> > for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done
> >
> > without the no internal process constraint introduced with cgroup v2. All
> > child cgroups are evaluated based on their own usage: all anon,
> > unevictable, and unreclaimable slab as described previously. It requires
> > an individual cgroup to be the single largest consumer to be targeted by
> > the oom killer.
> >
> > An example: allow users to manage two different mem cgroup hierarchies
> > limited to 100GB each. User A uses 10GB of memory and user B uses 90GB of
> > memory in their respective hierarchies. On a system oom condition, we'd
> > expect at least one process from user B's hierarchy would always be oom
> > killed with the cgroup aware oom killer. In fact, the changelog
> > explicitly states it solves an issue where "1) There is no fairness
> > between containers. A small container with few large processes will be
> > chosen over a large one with huge number of small processes."
> >
> > The opposite becomes true, however, if user B creates child cgroups and
> > distributes its processes such that each child cgroup's usage never
> > exceeds 10GB of memory. This can either be done intentionally to
> > purposefully have a low cgroup memory footprint to evade the oom killer or
> > unintentionally with cgroup v2 to allow those individual processes to be
> > constrained by other cgroups in a single hierarchy model. User A, using
> > 10% of his memory limit, is always oom killed instead of user B, using 90%
> > of his memory limit.
> >
> > Others have commented its still possible to do this with a per-process
> > model if users split their processes into many subprocesses with small
> > memory footprints.
> >
> > Solution: comparing cgroups must be done hierarchically. Neither user A
> > nor user B can evade the oom killer because targeting is done based on the
> > total hierarchical usage rather than individual cgroups in their
> > hierarchies.
>
> We've discussed this a lot.
> Hierarchical approach has their own issues, which we've discussed during
> previous iterations of the patchset. If you know how to address them
> (I've no idea), please, go on and suggest your version.
Well, if a hierarchical approach isn't a workable fix for the problem
which David has identified then what *is* the fix?
> >
> > 3. Userspace has zero control over oom kill selection in leaf mem cgroups
> >
> > Unlike using /proc/pid/oom_score_adj to bias or prefer certain processes
> > from the oom killer, the cgroup aware oom killer does not provide any
> > solution for the user to protect leaf mem cgroups. This is a result of
> > leaf mem cgroups being evaluated based on their anon, unevictable, and
> > unreclaimable slab usage and disregarding any user tunable.
> >
> > Absent the cgroup aware oom killer, users have the ability to strongly
> > prefer a process is oom killed (/proc/pid/oom_score_adj = 1000) or
> > strongly bias against a process (/proc/pid/oom_score_adj = -999).
> >
> > An example: a process knows its going to use a lot of memory, so it sets
> > /proc/self/oom_score_adj to 1000. It wants to be killed first to avoid
> > distrupting any other process. If it's attached to the root mem cgroup,
> > it will be oom killed. If it's attached to a leaf mem cgroup by an admin
> > outside its control, it will never be oom killed unless that cgroup's
> > usage is the largest single cgroup usage on the system. The reverse also
> > is true for processes that the admin does not want to be oom killed: set
> > /proc/pid/oom_score_adj to -999, but it will *always* be oom killed if its
> > cgroup has the highest usage on the system.
> >
> > The result is that both admins and users have lost all control over which
> > processes are oom killed. They are left with only one alternative: set
> > /proc/pid/oom_score_adj to -1000 to completely disable a process from oom
> > kill. It doesn't address the issue at all for memcg-constrained oom
> > conditions since no processes are killable anymore, and risks panicking
> > the system if it is the only process left on the system. A process
> > preferring that it is first in line for oom kill simply cannot volunteer
> > anymore.
> >
> > Solution: allow users and admins to control oom kill selection by
> > introducing a memory.oom_score_adj to affect the oom score of that mem
> > cgroup, exactly the same as /proc/pid/oom_score_adj affects the oom score
> > of a process.
>
> The per-process oom_score_adj interface is not the nicest one, and I'm not
> sure we want to replicate it on cgroup level as is. If you have an idea of how
> it should look like, please, propose a patch; otherwise it's hard to discuss
> it without the code.
It does make sense to have some form of per-cgroup tunability. Why is
the oom_score_adj approach inappropriate and what would be better? How
hard is it to graft such a thing onto the -mm patchset?
> >
> >
> > I proposed a solution in
> > https://marc.info/?l=linux-kernel&m=150956897302725, which was never
> > responded to, for all of these issues. The idea is to do hierarchical
> > accounting of mem cgroup hierarchies so that the hierarchy is traversed
> > comparing total usage at each level to select target cgroups. Admins and
> > users can use memory.oom_score_adj to influence that decisionmaking at
> > each level.
> >
> > This solves #1 because mem cgroups can be compared based on the same
> > classes of memory and the root mem cgroup's usage can be fairly compared
> > by subtracting top-level mem cgroup usage from system usage. All of the
> > criteria used to evaluate a leaf mem cgroup has a reasonable system-wide
> > counterpart that can be used to do the simple subtraction.
> >
> > This solves #2 because evaluation is done hierarchically so that
> > distributing processes over a set of child cgroups either intentionally
> > or unintentionally no longer evades the oom killer. Total usage is always
> > accounted to the parent and there is no escaping this criteria for users.
> >
> > This solves #3 because it allows admins to protect important processes in
> > cgroups that are supposed to use, for example, 75% of system memory
> > without it unconditionally being selected for oom kill but still oom kill
> > if it exceeds a certain threshold. In this sense, the cgroup aware oom
> > killer, as currently implemented, is selling mem cgroups short by
> > requiring the user to accept that the important process will be oom killed
> > iff it uses mem cgroups and isn't attached to root. It also allows users
> > to actually volunteer to be oom killed first without majority usage.
> >
> > It has come up time and time again that this support can be introduced on
> > top of the cgroup oom killer as implemented. It simply cannot. For
> > admins and users to have control over decisionmaking, it needs a
> > oom_score_adj type tunable that cannot change semantics from kernel
> > version to kernel version and without polluting the mem cgroup filesystem.
> > That, in my suggestion, is an adjustment on the amount of total
> > hierarchical usage of each mem cgroup at each level of the hierarchy.
> > That requires that the heuristic uses hierarchical usage rather than
> > considering each cgroup as independent consumers as it does today. We
> > need to implement that heuristic and introduce userspace influence over
> > oom kill selection now rather than later because its implementation
> > changes how this patchset is implemented.
> >
> > I can implement these changes, if preferred, on top of the current
> > patchset, but I do not believe we want inconsistencies between kernel
> > versions that introduce user visible changes for the sole reason that this
> > current implementation is incomplete and unfair. We can implement and
> > introduce it once without behavior changing later because the core
> > heuristic has necessarily changed.
>
> David, I _had_ hierarchical accounting implemented in one of the previous
> versions of this patchset. And there were _reasons_, why we went away from it.
Can you please summarize those issues for my understanding?
> You can't just ignore them and say that "there is a simple solution, which
> Roman is not responding". If you know how to address these issues and
> convince everybody that hierarchical approach is a way to go, please,
> go on and send your version of the patchset.
>
> Thanks!
>
> Roman
On Wed, 10 Jan 2018, Roman Gushchin wrote:
> > 1. The unfair comparison of the root mem cgroup vs leaf mem cgroups
> >
> > The patchset uses two different heuristics to compare root and leaf mem
> > cgroups and scores them based on number of pages. For the root mem
> > cgroup, it totals the /proc/pid/oom_score of all processes attached:
> > that's based on rss, swap, pgtables, and, most importantly, oom_score_adj.
> > For leaf mem cgroups, it's based on that memcg's anonymous, unevictable,
> > unreclaimable slab, kernel stack, and swap counters. These can be wildly
> > different independent of /proc/pid/oom_score_adj, but the most obvious
> > unfairness comes from users who tune oom_score_adj.
> >
> > An example: start a process that faults 1GB of anonymous memory and leave
> > it attached to the root mem cgroup. Start six more processes that each
> > fault 1GB of anonymous memory and attached them to a leaf mem cgroup. Set
> > all processes to have /proc/pid/oom_score_adj of 1000. System oom kill
> > will always kill the 1GB process attached to the root mem cgroup. It's
> > because oom_badness() relies on /proc/pid/oom_score_adj, which is used to
> > evaluate the root mem cgroup, and leaf mem cgroups completely disregard
> > it.
> >
> > In this example, the leaf mem cgroup's score is 1,573,044, the number of
> > pages for the 6GB of faulted memory. The root mem cgroup's score is
> > 12,652,907, eight times larger even though its usage is six times smaller.
> >
> > This is caused by the patchset disregarding oom_score_adj entirely for
> > leaf mem cgroups and relying on it heavily for the root mem cgroup. It's
> > the complete opposite result of what the cgroup aware oom killer
> > advertises.
> >
> > It also works the other way, if a large memory hog is attached to the root
> > mem cgroup but has a negative oom_score_adj it is never killed and random
> > processes are nuked solely because they happened to be attached to a leaf
> > mem cgroup. This behavior wrt oom_score_adj is completely undocumented,
> > so I can't presume that it is either known nor tested.
> >
> > Solution: compare the root mem cgroup and leaf mem cgroups equally with
> > the same criteria by doing hierarchical accounting of usage and
> > subtracting from total system usage to find root usage.
>
> I find this problem quite minor, because I haven't seen any practical problems
> caused by accounting of the root cgroup memory.
> If it's a serious problem for you, it can be solved without switching to the
> hierarchical accounting: it's possible to sum up all leaf cgroup stats and
> substract them from global values. So, it can be a relatively small enhancement
> on top of the current mm tree. This has nothing to do with global victim selection
> approach.
>
It's not surprising that this problem is considered minor, since it was
quite obviously never tested when developing the series.
/proc/pid/oom_score_adj being effective when the process is attached to
the root mem cgroup and not when attached to a leaf mem cgroup is a major
problem, and one that was never documented or explained. I'm somewhat
stunned this is considered minor.
Allow me to put a stake in the ground: admins and users need to be able to
influence oom kill decisions. Killing a process matters. The ability to
protect important processes matters, such as processes that are really
supposed to use 50% of a system's memory. As implemented, there is no
control over this.
If a process's oom_score_adj = 1000 (always kill it), this always happens
when attached to the root mem cgroup. If attached to a leaf mem cgroup,
it never happens unless that individual mem cgroup is the next largest
single consumer of memory. It makes it impossible for a process to be the
preferred oom victim if it is ever attached to any non-root mem cgroup.
Where it is attached can be outside the control of the application itself.
Perhaps worse, an important process with oom_score_adj = -999 (try to kill
everything before it) is always oom killed first if attached to a leaf mem
cgroup with majority usage.
> > 2. Evading the oom killer by attaching processes to child cgroups
> >
> > Any cgroup on the system can attach all their processes to individual
> > child cgroups. This is functionally the same as doing
> >
> > for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done
> >
> > without the no internal process constraint introduced with cgroup v2. All
> > child cgroups are evaluated based on their own usage: all anon,
> > unevictable, and unreclaimable slab as described previously. It requires
> > an individual cgroup to be the single largest consumer to be targeted by
> > the oom killer.
> >
> > An example: allow users to manage two different mem cgroup hierarchies
> > limited to 100GB each. User A uses 10GB of memory and user B uses 90GB of
> > memory in their respective hierarchies. On a system oom condition, we'd
> > expect at least one process from user B's hierarchy would always be oom
> > killed with the cgroup aware oom killer. In fact, the changelog
> > explicitly states it solves an issue where "1) There is no fairness
> > between containers. A small container with few large processes will be
> > chosen over a large one with huge number of small processes."
> >
> > The opposite becomes true, however, if user B creates child cgroups and
> > distributes its processes such that each child cgroup's usage never
> > exceeds 10GB of memory. This can either be done intentionally to
> > purposefully have a low cgroup memory footprint to evade the oom killer or
> > unintentionally with cgroup v2 to allow those individual processes to be
> > constrained by other cgroups in a single hierarchy model. User A, using
> > 10% of his memory limit, is always oom killed instead of user B, using 90%
> > of his memory limit.
> >
> > Others have commented its still possible to do this with a per-process
> > model if users split their processes into many subprocesses with small
> > memory footprints.
> >
> > Solution: comparing cgroups must be done hierarchically. Neither user A
> > nor user B can evade the oom killer because targeting is done based on the
> > total hierarchical usage rather than individual cgroups in their
> > hierarchies.
>
> We've discussed this a lot.
> Hierarchical approach has their own issues, which we've discussed during
> previous iterations of the patchset. If you know how to address them
> (I've no idea), please, go on and suggest your version.
>
I asked in https://marc.info/?l=linux-kernel&m=150956897302725 for a clear
example of these issues when there is userspace control over target
selection available. It has received no response, but I can ask again:
please enumerate the concerns you have with hierarchical accounting in the
presence of userspace control over target selection.
Also note that even though userspace control over target selection is
available (through memory.oom_score_adj in my suggestion), it does not
need to be tuned. It can be left to the default value of 0 if there
should be no preference or bias of mem cgroups, just as
/proc/pid/oom_score_adj does not need to be tuned.
> > 3. Userspace has zero control over oom kill selection in leaf mem cgroups
> >
> > Unlike using /proc/pid/oom_score_adj to bias or prefer certain processes
> > from the oom killer, the cgroup aware oom killer does not provide any
> > solution for the user to protect leaf mem cgroups. This is a result of
> > leaf mem cgroups being evaluated based on their anon, unevictable, and
> > unreclaimable slab usage and disregarding any user tunable.
> >
> > Absent the cgroup aware oom killer, users have the ability to strongly
> > prefer a process is oom killed (/proc/pid/oom_score_adj = 1000) or
> > strongly bias against a process (/proc/pid/oom_score_adj = -999).
> >
> > An example: a process knows its going to use a lot of memory, so it sets
> > /proc/self/oom_score_adj to 1000. It wants to be killed first to avoid
> > distrupting any other process. If it's attached to the root mem cgroup,
> > it will be oom killed. If it's attached to a leaf mem cgroup by an admin
> > outside its control, it will never be oom killed unless that cgroup's
> > usage is the largest single cgroup usage on the system. The reverse also
> > is true for processes that the admin does not want to be oom killed: set
> > /proc/pid/oom_score_adj to -999, but it will *always* be oom killed if its
> > cgroup has the highest usage on the system.
> >
> > The result is that both admins and users have lost all control over which
> > processes are oom killed. They are left with only one alternative: set
> > /proc/pid/oom_score_adj to -1000 to completely disable a process from oom
> > kill. It doesn't address the issue at all for memcg-constrained oom
> > conditions since no processes are killable anymore, and risks panicking
> > the system if it is the only process left on the system. A process
> > preferring that it is first in line for oom kill simply cannot volunteer
> > anymore.
> >
> > Solution: allow users and admins to control oom kill selection by
> > introducing a memory.oom_score_adj to affect the oom score of that mem
> > cgroup, exactly the same as /proc/pid/oom_score_adj affects the oom score
> > of a process.
>
> The per-process oom_score_adj interface is not the nicest one, and I'm not
> sure we want to replicate it on cgroup level as is. If you have an idea of how
> it should look like, please, propose a patch; otherwise it's hard to discuss
> it without the code.
>
Please read the rest of the email; we cannot introduce
memory.oom_score_adj without hierarchical accounting, which requires a
change to the core heuristic used by this patchset. Because hierarchical
accounting is needed for #1 and #2, the values userspace uses for
memory.oom_score_adj would be radically different for hierarchical usage
vs the current heuristic of local usage. We simply cannot push this
inconsistency onto the end user that depends on kernel version. The core
heuristic needs to change for this to be possible while still addressing
#1 and #2. We do not want to be in a situation where the heuristic
changes after users have constructed their mem cgroup hierarchies to use
the current heuristic; doing so would be a maintenance nightmare and,
frankly, irresponsible.
> > It has come up time and time again that this support can be introduced on
> > top of the cgroup oom killer as implemented. It simply cannot. For
> > admins and users to have control over decisionmaking, it needs a
> > oom_score_adj type tunable that cannot change semantics from kernel
> > version to kernel version and without polluting the mem cgroup filesystem.
> > That, in my suggestion, is an adjustment on the amount of total
> > hierarchical usage of each mem cgroup at each level of the hierarchy.
> > That requires that the heuristic uses hierarchical usage rather than
> > considering each cgroup as independent consumers as it does today. We
> > need to implement that heuristic and introduce userspace influence over
> > oom kill selection now rather than later because its implementation
> > changes how this patchset is implemented.
> >
> > I can implement these changes, if preferred, on top of the current
> > patchset, but I do not believe we want inconsistencies between kernel
> > versions that introduce user visible changes for the sole reason that this
> > current implementation is incomplete and unfair. We can implement and
> > introduce it once without behavior changing later because the core
> > heuristic has necessarily changed.
>
> David, I _had_ hierarchical accounting implemented in one of the previous
> versions of this patchset. And there were _reasons_, why we went away from it.
> You can't just ignore them and say that "there is a simple solution, which
> Roman is not responding". If you know how to address these issues and
> convince everybody that hierarchical approach is a way to go, please,
> go on and send your version of the patchset.
>
I am very much aware that you had hierarchical accounting implemented and
you'll remember that I strongly advocated for it, because it is the
correct way to address cgroup target selection when mem cgroups are
hierarchical by nature. The reasons you're referring to did not account
for the proposed memory.oom_score_adj, but I cannot continue to ask for
those reasons to be enumerated if there is no response. If there are
real-world concerns to what is proposed in my email, I'd love those
concerns to be described as I've described the obvious shortcomings with
this patchset. I'm eager to debate that.
Absent any repsonse that shows #1, #2, and #3 above are incorrect or that
shows what is proposed in my email makes real-world usecases impossible or
somehow worse than what is proposed here, I'll ask that the cgroup aware
oom killer is removed from -mm and, if you'd like me to do it, I'd
repropose a modified version of your earlier hierarchical accounting
patchset with the added memory.oom_score_adj.
On Wed 10-01-18 11:33:45, Andrew Morton wrote:
> On Wed, 10 Jan 2018 05:11:44 -0800 Roman Gushchin <[email protected]> wrote:
>
> > Hello, David!
> >
> > On Tue, Jan 09, 2018 at 04:57:53PM -0800, David Rientjes wrote:
> > > On Thu, 30 Nov 2017, Andrew Morton wrote:
> > >
> > > > > This patchset makes the OOM killer cgroup-aware.
> > > >
> > > > Thanks, I'll grab these.
> > > >
> > > > There has been controversy over this patchset, to say the least. I
> > > > can't say that I followed it closely! Could those who still have
> > > > reservations please summarise their concerns and hopefully suggest a
> > > > way forward?
> > > >
> > >
> > > Yes, I'll summarize what my concerns have been in the past and what they
> > > are wrt the patchset as it stands in -mm. None of them originate from my
> > > current usecase or anticipated future usecase of the oom killer for
> > > system-wide or memcg-constrained oom conditions. They are based purely on
> > > the patchset's use of an incomplete and unfair heuristic for deciding
> > > which cgroup to target.
> > >
> > > I'll also suggest simple changes to the patchset, which I have in the
> > > past, that can be made to address all of these concerns.
> > >
> > > 1. The unfair comparison of the root mem cgroup vs leaf mem cgroups
> > >
> > > The patchset uses two different heuristics to compare root and leaf mem
> > > cgroups and scores them based on number of pages. For the root mem
> > > cgroup, it totals the /proc/pid/oom_score of all processes attached:
> > > that's based on rss, swap, pgtables, and, most importantly, oom_score_adj.
> > > For leaf mem cgroups, it's based on that memcg's anonymous, unevictable,
> > > unreclaimable slab, kernel stack, and swap counters. These can be wildly
> > > different independent of /proc/pid/oom_score_adj, but the most obvious
> > > unfairness comes from users who tune oom_score_adj.
> > >
> > > An example: start a process that faults 1GB of anonymous memory and leave
> > > it attached to the root mem cgroup. Start six more processes that each
> > > fault 1GB of anonymous memory and attached them to a leaf mem cgroup. Set
> > > all processes to have /proc/pid/oom_score_adj of 1000. System oom kill
> > > will always kill the 1GB process attached to the root mem cgroup. It's
> > > because oom_badness() relies on /proc/pid/oom_score_adj, which is used to
> > > evaluate the root mem cgroup, and leaf mem cgroups completely disregard
> > > it.
> > >
> > > In this example, the leaf mem cgroup's score is 1,573,044, the number of
> > > pages for the 6GB of faulted memory. The root mem cgroup's score is
> > > 12,652,907, eight times larger even though its usage is six times smaller.
> > >
> > > This is caused by the patchset disregarding oom_score_adj entirely for
> > > leaf mem cgroups and relying on it heavily for the root mem cgroup. It's
> > > the complete opposite result of what the cgroup aware oom killer
> > > advertises.
> > >
> > > It also works the other way, if a large memory hog is attached to the root
> > > mem cgroup but has a negative oom_score_adj it is never killed and random
> > > processes are nuked solely because they happened to be attached to a leaf
> > > mem cgroup. This behavior wrt oom_score_adj is completely undocumented,
> > > so I can't presume that it is either known nor tested.
> > >
> > > Solution: compare the root mem cgroup and leaf mem cgroups equally with
> > > the same criteria by doing hierarchical accounting of usage and
> > > subtracting from total system usage to find root usage.
> >
> > I find this problem quite minor, because I haven't seen any practical problems
> > caused by accounting of the root cgroup memory.
> > If it's a serious problem for you, it can be solved without switching to the
> > hierarchical accounting: it's possible to sum up all leaf cgroup stats and
> > substract them from global values. So, it can be a relatively small enhancement
> > on top of the current mm tree. This has nothing to do with global victim selection
> > approach.
>
> It sounds like a significant shortcoming to me - the oom-killing
> decisions which David describes are clearly incorrect?
Well, I would rather look at that from the use case POV. The primary
user of the new OOM killer functionality are containers. I might be
wrong but I _assume_ that root cgroup will only contain the basic system
infrastructure there and all the workload will run containers (aka
cgroups). The only oom tuning inside the root cgroup would be to disable
oom for some of those processes. The current implementation should work
reasonably well for that configuration.
> If this can be fixed against the -mm patchset with a "relatively small
> enhancement" then please let's get that done so it can be reviewed and
> tested.
The root memcg will be always special and I would rather base the
semantic based on usecases rather than implement something based on
theoretical examples.
> > > 2. Evading the oom killer by attaching processes to child cgroups
> > >
> > > Any cgroup on the system can attach all their processes to individual
> > > child cgroups. This is functionally the same as doing
> > >
> > > for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done
> > >
> > > without the no internal process constraint introduced with cgroup v2. All
> > > child cgroups are evaluated based on their own usage: all anon,
> > > unevictable, and unreclaimable slab as described previously. It requires
> > > an individual cgroup to be the single largest consumer to be targeted by
> > > the oom killer.
> > >
> > > An example: allow users to manage two different mem cgroup hierarchies
> > > limited to 100GB each. User A uses 10GB of memory and user B uses 90GB of
> > > memory in their respective hierarchies. On a system oom condition, we'd
> > > expect at least one process from user B's hierarchy would always be oom
> > > killed with the cgroup aware oom killer. In fact, the changelog
> > > explicitly states it solves an issue where "1) There is no fairness
> > > between containers. A small container with few large processes will be
> > > chosen over a large one with huge number of small processes."
> > >
> > > The opposite becomes true, however, if user B creates child cgroups and
> > > distributes its processes such that each child cgroup's usage never
> > > exceeds 10GB of memory. This can either be done intentionally to
> > > purposefully have a low cgroup memory footprint to evade the oom killer or
> > > unintentionally with cgroup v2 to allow those individual processes to be
> > > constrained by other cgroups in a single hierarchy model. User A, using
> > > 10% of his memory limit, is always oom killed instead of user B, using 90%
> > > of his memory limit.
> > >
> > > Others have commented its still possible to do this with a per-process
> > > model if users split their processes into many subprocesses with small
> > > memory footprints.
> > >
> > > Solution: comparing cgroups must be done hierarchically. Neither user A
> > > nor user B can evade the oom killer because targeting is done based on the
> > > total hierarchical usage rather than individual cgroups in their
> > > hierarchies.
> >
> > We've discussed this a lot.
> > Hierarchical approach has their own issues, which we've discussed during
> > previous iterations of the patchset. If you know how to address them
> > (I've no idea), please, go on and suggest your version.
>
> Well, if a hierarchical approach isn't a workable fix for the problem
> which David has identified then what *is* the fix?
Hierarchical approach basically hardcodes the oom decision into the
hierarchy structure and that is simply a no go because that would turn
into a massive configuration PITA (more on that below). I consider the above
example rather artificial to be honest. Anyway, if we _really_ have to
address it in the future we can do that by providing a mechanism to
prioritize cgroups. It seems that this is required for some usecases
anyway.
> > > 3. Userspace has zero control over oom kill selection in leaf mem cgroups
> > >
> > > Unlike using /proc/pid/oom_score_adj to bias or prefer certain processes
> > > from the oom killer, the cgroup aware oom killer does not provide any
> > > solution for the user to protect leaf mem cgroups. This is a result of
> > > leaf mem cgroups being evaluated based on their anon, unevictable, and
> > > unreclaimable slab usage and disregarding any user tunable.
> > >
> > > Absent the cgroup aware oom killer, users have the ability to strongly
> > > prefer a process is oom killed (/proc/pid/oom_score_adj = 1000) or
> > > strongly bias against a process (/proc/pid/oom_score_adj = -999).
> > >
> > > An example: a process knows its going to use a lot of memory, so it sets
> > > /proc/self/oom_score_adj to 1000. It wants to be killed first to avoid
> > > distrupting any other process. If it's attached to the root mem cgroup,
> > > it will be oom killed. If it's attached to a leaf mem cgroup by an admin
> > > outside its control, it will never be oom killed unless that cgroup's
> > > usage is the largest single cgroup usage on the system. The reverse also
> > > is true for processes that the admin does not want to be oom killed: set
> > > /proc/pid/oom_score_adj to -999, but it will *always* be oom killed if its
> > > cgroup has the highest usage on the system.
> > >
> > > The result is that both admins and users have lost all control over which
> > > processes are oom killed. They are left with only one alternative: set
> > > /proc/pid/oom_score_adj to -1000 to completely disable a process from oom
> > > kill. It doesn't address the issue at all for memcg-constrained oom
> > > conditions since no processes are killable anymore, and risks panicking
> > > the system if it is the only process left on the system. A process
> > > preferring that it is first in line for oom kill simply cannot volunteer
> > > anymore.
> > >
> > > Solution: allow users and admins to control oom kill selection by
> > > introducing a memory.oom_score_adj to affect the oom score of that mem
> > > cgroup, exactly the same as /proc/pid/oom_score_adj affects the oom score
> > > of a process.
> >
> > The per-process oom_score_adj interface is not the nicest one, and I'm not
> > sure we want to replicate it on cgroup level as is. If you have an idea of how
> > it should look like, please, propose a patch; otherwise it's hard to discuss
> > it without the code.
>
> It does make sense to have some form of per-cgroup tunability. Why is
> the oom_score_adj approach inappropriate and what would be better?
oom_score_adj is basically unusable for any fine tuning on the process
level for most setups except for very specialized ones. The only
reasonable usage I've seen so far was to disable OOM killer for a
process or make it a prime candidate. Using the same limited concept for
cgroups sounds like repeating the same error to me.
> How hard is it to graft such a thing onto the -mm patchset?
I think this should be thought through very well before we add another
tuning. Moreover the current usecase doesn't seem to require it so I am
not really sure why we should implement something right away and later
suffer from API mistakes.
> > > I proposed a solution in
> > > https://marc.info/?l=linux-kernel&m=150956897302725, which was never
> > > responded to, for all of these issues. The idea is to do hierarchical
> > > accounting of mem cgroup hierarchies so that the hierarchy is traversed
> > > comparing total usage at each level to select target cgroups. Admins and
> > > users can use memory.oom_score_adj to influence that decisionmaking at
> > > each level.
> > >
> > > This solves #1 because mem cgroups can be compared based on the same
> > > classes of memory and the root mem cgroup's usage can be fairly compared
> > > by subtracting top-level mem cgroup usage from system usage. All of the
> > > criteria used to evaluate a leaf mem cgroup has a reasonable system-wide
> > > counterpart that can be used to do the simple subtraction.
> > >
> > > This solves #2 because evaluation is done hierarchically so that
> > > distributing processes over a set of child cgroups either intentionally
> > > or unintentionally no longer evades the oom killer. Total usage is always
> > > accounted to the parent and there is no escaping this criteria for users.
> > >
> > > This solves #3 because it allows admins to protect important processes in
> > > cgroups that are supposed to use, for example, 75% of system memory
> > > without it unconditionally being selected for oom kill but still oom kill
> > > if it exceeds a certain threshold. In this sense, the cgroup aware oom
> > > killer, as currently implemented, is selling mem cgroups short by
> > > requiring the user to accept that the important process will be oom killed
> > > iff it uses mem cgroups and isn't attached to root. It also allows users
> > > to actually volunteer to be oom killed first without majority usage.
> > >
> > > It has come up time and time again that this support can be introduced on
> > > top of the cgroup oom killer as implemented. It simply cannot. For
> > > admins and users to have control over decisionmaking, it needs a
> > > oom_score_adj type tunable that cannot change semantics from kernel
> > > version to kernel version and without polluting the mem cgroup filesystem.
> > > That, in my suggestion, is an adjustment on the amount of total
> > > hierarchical usage of each mem cgroup at each level of the hierarchy.
> > > That requires that the heuristic uses hierarchical usage rather than
> > > considering each cgroup as independent consumers as it does today. We
> > > need to implement that heuristic and introduce userspace influence over
> > > oom kill selection now rather than later because its implementation
> > > changes how this patchset is implemented.
> > >
> > > I can implement these changes, if preferred, on top of the current
> > > patchset, but I do not believe we want inconsistencies between kernel
> > > versions that introduce user visible changes for the sole reason that this
> > > current implementation is incomplete and unfair. We can implement and
> > > introduce it once without behavior changing later because the core
> > > heuristic has necessarily changed.
> >
> > David, I _had_ hierarchical accounting implemented in one of the previous
> > versions of this patchset. And there were _reasons_, why we went away from it.
>
> Can you please summarize those issues for my understanding?
Because it makes the oom decision directly hardwired to the hierarchy
structure. Just take a simple example of the cgroup structure which
reflects a higher level organization
root
/ | \
admins | teachers
students
Now your students group will be most like the largest one. Why should we
kill tasks/cgroups from that cgroup just because it is cumulatively the
largest one. It might have been some of the teacher blowing up the
memory usage.
Another example is when you need a mid layer cgroups for other
controllers to better control resources.
root
/ \
cpuset1 cpuset2
/ \ / | \
A B C D E
You really do not want to select cpuset2 just because it has more
subgroups and potentially larger cumulative usage. The hierarchical
accounting works _only_ if higher level cgroups are semantically
_comparable_ which might be true for some workloads but by no means this
is true in general.
That all being said, I can see further improvements to happen on top of
the current work but I also think that the current implementation works
for the usecase which many users can use without those improvements.
--
Michal Hocko
SUSE Labs
On Thu, Jan 11, 2018 at 10:08:09AM +0100, Michal Hocko wrote:
> On Wed 10-01-18 11:33:45, Andrew Morton wrote:
> > On Wed, 10 Jan 2018 05:11:44 -0800 Roman Gushchin <[email protected]> wrote:
> >
> > > The per-process oom_score_adj interface is not the nicest one, and I'm not
> > > sure we want to replicate it on cgroup level as is. If you have an idea of how
> > > it should look like, please, propose a patch; otherwise it's hard to discuss
> > > it without the code.
> >
> > It does make sense to have some form of per-cgroup tunability. Why is
> > the oom_score_adj approach inappropriate and what would be better?
>
> oom_score_adj is basically unusable for any fine tuning on the process
> level for most setups except for very specialized ones. The only
> reasonable usage I've seen so far was to disable OOM killer for a
> process or make it a prime candidate. Using the same limited concept for
> cgroups sounds like repeating the same error to me.
My 2c here: current oom_score_adj semantics is really non-trivial for
cgroups. It defines an addition/substraction in 1/1000s of total memory or
OOMing cgroup's memory limit, depending on the the scope of the OOM event.
This is totally out of control for a user, because he/she can even have no
idea about the limit of an upper cgroup in the hierarchy. I've provided
an example earlier, in which one or another of two processes in the
same cgroup can be killed, depending on the scope of the OOM event.
>
> > How hard is it to graft such a thing onto the -mm patchset?
>
> I think this should be thought through very well before we add another
> tuning. Moreover the current usecase doesn't seem to require it so I am
> not really sure why we should implement something right away and later
> suffer from API mistakes.
>
> > > > I proposed a solution in
> > > > https://marc.info/?l=linux-kernel&m=150956897302725, which was never
> > > > responded to, for all of these issues. The idea is to do hierarchical
> > > > accounting of mem cgroup hierarchies so that the hierarchy is traversed
> > > > comparing total usage at each level to select target cgroups. Admins and
> > > > users can use memory.oom_score_adj to influence that decisionmaking at
> > > > each level.
> > > >
> > > > This solves #1 because mem cgroups can be compared based on the same
> > > > classes of memory and the root mem cgroup's usage can be fairly compared
> > > > by subtracting top-level mem cgroup usage from system usage. All of the
> > > > criteria used to evaluate a leaf mem cgroup has a reasonable system-wide
> > > > counterpart that can be used to do the simple subtraction.
> > > >
> > > > This solves #2 because evaluation is done hierarchically so that
> > > > distributing processes over a set of child cgroups either intentionally
> > > > or unintentionally no longer evades the oom killer. Total usage is always
> > > > accounted to the parent and there is no escaping this criteria for users.
> > > >
> > > > This solves #3 because it allows admins to protect important processes in
> > > > cgroups that are supposed to use, for example, 75% of system memory
> > > > without it unconditionally being selected for oom kill but still oom kill
> > > > if it exceeds a certain threshold. In this sense, the cgroup aware oom
> > > > killer, as currently implemented, is selling mem cgroups short by
> > > > requiring the user to accept that the important process will be oom killed
> > > > iff it uses mem cgroups and isn't attached to root. It also allows users
> > > > to actually volunteer to be oom killed first without majority usage.
> > > >
> > > > It has come up time and time again that this support can be introduced on
> > > > top of the cgroup oom killer as implemented. It simply cannot. For
> > > > admins and users to have control over decisionmaking, it needs a
> > > > oom_score_adj type tunable that cannot change semantics from kernel
> > > > version to kernel version and without polluting the mem cgroup filesystem.
> > > > That, in my suggestion, is an adjustment on the amount of total
> > > > hierarchical usage of each mem cgroup at each level of the hierarchy.
> > > > That requires that the heuristic uses hierarchical usage rather than
> > > > considering each cgroup as independent consumers as it does today. We
> > > > need to implement that heuristic and introduce userspace influence over
> > > > oom kill selection now rather than later because its implementation
> > > > changes how this patchset is implemented.
> > > >
> > > > I can implement these changes, if preferred, on top of the current
> > > > patchset, but I do not believe we want inconsistencies between kernel
> > > > versions that introduce user visible changes for the sole reason that this
> > > > current implementation is incomplete and unfair. We can implement and
> > > > introduce it once without behavior changing later because the core
> > > > heuristic has necessarily changed.
> > >
> > > David, I _had_ hierarchical accounting implemented in one of the previous
> > > versions of this patchset. And there were _reasons_, why we went away from it.
> >
> > Can you please summarize those issues for my understanding?
>
> Because it makes the oom decision directly hardwired to the hierarchy
> structure. Just take a simple example of the cgroup structure which
> reflects a higher level organization
> root
> / | \
> admins | teachers
> students
>
> Now your students group will be most like the largest one. Why should we
> kill tasks/cgroups from that cgroup just because it is cumulatively the
> largest one. It might have been some of the teacher blowing up the
> memory usage.
>
> Another example is when you need a mid layer cgroups for other
> controllers to better control resources.
> root
> / \
> cpuset1 cpuset2
> / \ / | \
> A B C D E
>
> You really do not want to select cpuset2 just because it has more
> subgroups and potentially larger cumulative usage. The hierarchical
> accounting works _only_ if higher level cgroups are semantically
> _comparable_ which might be true for some workloads but by no means this
> is true in general.
>
> That all being said, I can see further improvements to happen on top of
> the current work but I also think that the current implementation works
> for the usecase which many users can use without those improvements.
Thank you, Michal, you wrote exactly what I wanted to write here!
Summarizing all this, following the hierarchy is good when it reflects
the "importance" of cgroup's memory for a user, and bad otherwise.
In generic case with unified hierarchy it's not true, so following
the hierarchy unconditionally is bad.
Current version of the patchset allows common evaluation of cgroups
by setting memory.groupoom to true. The only limitation is that it
also changes the OOM action: all belonging processes will be killed.
If we really want to preserve an option to evaluate cgroups
together without forcing "kill all processes" action, we can
convert memory.groupoom from being boolean to the multi-value knob.
For instance, "disabled" and "killall". This will allow to add
a third state ("evaluate", for example) later without breaking the API.
Re root cgroup evaluation:
I believe that it's discussed here mostly as an argument towards
hierarchical approach. The current heuristics can definitely be improved,
but it doesn't require changing the whole semantics. For example,
we can ignore oom_score_adj (except -1000) in this particular case,
that will fix David's example.
Thank you!
Roman
On Thu, 11 Jan 2018, Michal Hocko wrote:
> > > I find this problem quite minor, because I haven't seen any practical problems
> > > caused by accounting of the root cgroup memory.
> > > If it's a serious problem for you, it can be solved without switching to the
> > > hierarchical accounting: it's possible to sum up all leaf cgroup stats and
> > > substract them from global values. So, it can be a relatively small enhancement
> > > on top of the current mm tree. This has nothing to do with global victim selection
> > > approach.
> >
> > It sounds like a significant shortcoming to me - the oom-killing
> > decisions which David describes are clearly incorrect?
>
> Well, I would rather look at that from the use case POV. The primary
> user of the new OOM killer functionality are containers. I might be
> wrong but I _assume_ that root cgroup will only contain the basic system
> infrastructure there and all the workload will run containers (aka
> cgroups). The only oom tuning inside the root cgroup would be to disable
> oom for some of those processes. The current implementation should work
> reasonably well for that configuration.
>
It's a decision made on the system level when cgroup2 is mounted, it
affects all processes that get attached to a leaf cgroup, even regardless
of whether or not it has the memory controller enabled for that subtree.
In other words, mounting cgroup2 with the cgroup aware oom killer mount
option immediately:
- makes /proc/pid/oom_score_adj effective for all processes attached
to the root cgroup only, and
- makes /proc/pid/oom_score_adj a no-op for all processes attached to a
non-root cgroup.
Note that there may be no active usage of any memory controller at the
time of oom, yet this tunable inconsistency still exists for any process.
The initial response is correct: it clearly produces incorrect oom killing
decisions. This implementation detail either requires the entire system
is not containerized at all (no processes attached to any leaf cgroup), or
fully containerized (all processes attached to leaf cgroups). It's a
highly specialized usecase with very limited limited scope and is wrought
with pitfalls if any oom_score_adj is tuned because it strictly depends on
the cgroup to which those processes are attached at any given time to
determine whether it is effective or not.
> > > We've discussed this a lot.
> > > Hierarchical approach has their own issues, which we've discussed during
> > > previous iterations of the patchset. If you know how to address them
> > > (I've no idea), please, go on and suggest your version.
> >
> > Well, if a hierarchical approach isn't a workable fix for the problem
> > which David has identified then what *is* the fix?
>
> Hierarchical approach basically hardcodes the oom decision into the
> hierarchy structure and that is simply a no go because that would turn
> into a massive configuration PITA (more on that below). I consider the above
> example rather artificial to be honest. Anyway, if we _really_ have to
> address it in the future we can do that by providing a mechanism to
> prioritize cgroups. It seems that this is required for some usecases
> anyway.
>
I'll address the hierarchical accounting suggestion below.
The example is not artificial, it's not theoretical, it's a real-world
example. Users can attach processes to subcontainers purely for memory
stats from a mem cgroup point of view without limiting usage, or to
subcontainers for use with other controllers other than the memory
controller. We do this all the time: it's helpful to assign groups of
processes to subcontainers simply to track statistics while the actual
limitation is enforced by an ancestor.
So without hierarchical accounting, we can extend the above restriction,
that a system is either fully containerized or not containerized at all,
by saying that a fully containerized system must not use subcontainers to
avoid the cgroup aware oom killer heuristic. In other words, using this
feature requires:
- the entire system is not containerized at all, or
- the entire system is fully containerized and no cgroup (any controller,
not just memory) uses subcontainers to intentionally/unintentionally
distribute usage to evade this heuristic.
Of course the second restriction severely limits the flexibility that
cgroup v2 introduces as a whole as a caveat of an implementation detail of
the memory cgroup aware oom killer. Why not simply avoid using the cgroup
aware oom killer? It's not so easy since the it's a property of the
machine itself: users probably have no control over it themselves and, in
the worst case, can trivially evade ever being oom killed if used.
> > > The per-process oom_score_adj interface is not the nicest one, and I'm not
> > > sure we want to replicate it on cgroup level as is. If you have an idea of how
> > > it should look like, please, propose a patch; otherwise it's hard to discuss
> > > it without the code.
> >
> > It does make sense to have some form of per-cgroup tunability. Why is
> > the oom_score_adj approach inappropriate and what would be better?
>
> oom_score_adj is basically unusable for any fine tuning on the process
> level for most setups except for very specialized ones. The only
> reasonable usage I've seen so far was to disable OOM killer for a
> process or make it a prime candidate. Using the same limited concept for
> cgroups sounds like repeating the same error to me.
>
oom_score_adj is based on the heuristic that the oom killer uses to decide
which process to kill, it kills the largest memory hogging process. Being
able to tune that based on a proportion of available memory, whether for
the system as a whole or for a memory cgroup hierarchy makes sense for
varying RAM capacities and memory cgroup limits. It works quite well in
practice, I'm not sure your experience with it based on the needs you have
with overcommit.
The ability to protect important cgroups and bias against non-important
cgroups is vital to any selection implementation. Strictly requiring that
important cgroups do not have majority usage is not a solution, which is
what this patchset implements. It's also insufficient to say that this
can be added on later, even though the need is acknowledged, because any
prioritization or userspace influence will require a change to the cgroup
aware oom killer's core heuristic itself. That needs to be decided now
rather than later to avoid differing behavior between kernel versions for
those who adopt this feature and carefully arrange their cgroup
hierarchies to fit with the highly specialized usecase before it's
completely changed out from under them.
> > How hard is it to graft such a thing onto the -mm patchset?
>
> I think this should be thought through very well before we add another
> tuning. Moreover the current usecase doesn't seem to require it so I am
> not really sure why we should implement something right away and later
> suffer from API mistakes.
>
The current usecase is so highly specialized that it's not usable for
anybody else and will break that highly specialized usecase if we make it
usable for anybody else in the future.
It's so highly specialized that you can't even protect an important
process from oom kill, there is no remedy provided to userspace that still
allows local mem cgroup oom to actually work.
# echo "+memory" > cgroup.subtree_control
# mkdir cg1
# echo $$ > cg1/cgroup.procs
<fork important process>
# echo -999 > /proc/$!/oom_score_adj
# echo f > /proc/sysrq-trigger
This kills the important process if cg1 has the highest usage, the user
has no control other than purposefully evading the oom killer, if
possible, by distributing the process's threads over subcontainers.
Because a highly specialized user is proposing this and it works well for
them right now does not mean that it is in the best interest of Linux to
merge, especially if it cannot become generally useful to others without
core changes that affect the original highly specialized user. This is
why the patchset, as is, is currently incomplete.
> > > David, I _had_ hierarchical accounting implemented in one of the previous
> > > versions of this patchset. And there were _reasons_, why we went away from it.
> >
> > Can you please summarize those issues for my understanding?
>
> Because it makes the oom decision directly hardwired to the hierarchy
> structure. Just take a simple example of the cgroup structure which
> reflects a higher level organization
> root
> / | \
> admins | teachers
> students
>
> Now your students group will be most like the largest one. Why should we
> kill tasks/cgroups from that cgroup just because it is cumulatively the
> largest one. It might have been some of the teacher blowing up the
> memory usage.
>
There's one thing missing here: the use of the proposed
memory.oom_score_adj. You can use it to either polarize the decision so
that oom kills always originate from any of /admins, /students, or
/teachers, or bias against them. You can also proportionally prefer
/admins or /teachers, if desired, so they are selected if their usage
exceeds a certain threshold based on the limit of the ancestor even though
that hierarchical usage does not exceed, or even approach the hierarchical
usage of /students.
This requires that the entity setting up this hierarchy know only one
thing: the allowed memory usage of /admins or /teachers compared to the
very large usage of /students. It can actually be tuned to a great
detail. I only suggest memory.oom_score_adj because it can work for all
/root capacities, regardless of RAM capacity, as a proportion of the
system that should be set aside for /admins, /students, and/or /teachers
rather than hardcoded bytes which would change depending on the amount of
RAM you have.
FWIW, we do this exact thing and it works quite well. This doesn't mean
that I'm advocating for my own usecase, or anticipated usecase for the
cgroup aware oom killer, but that you've picked an example that I have
personally worked with and the bias can work quite well for oom kill
selection.
On Thu, 11 Jan 2018, Roman Gushchin wrote:
> Summarizing all this, following the hierarchy is good when it reflects
> the "importance" of cgroup's memory for a user, and bad otherwise.
> In generic case with unified hierarchy it's not true, so following
> the hierarchy unconditionally is bad.
>
> Current version of the patchset allows common evaluation of cgroups
> by setting memory.groupoom to true. The only limitation is that it
> also changes the OOM action: all belonging processes will be killed.
> If we really want to preserve an option to evaluate cgroups
> together without forcing "kill all processes" action, we can
> convert memory.groupoom from being boolean to the multi-value knob.
> For instance, "disabled" and "killall". This will allow to add
> a third state ("evaluate", for example) later without breaking the API.
>
No, this isn't how kernel features get introduced. We don't design a new
kernel feature with its own API for a highly specialized usecase and then
claim we'll fix the problems later. Users will work around the
constraints of the new feature, if possible, and then we can end up
breaking them later. Or, we can pollute the mem cgroup v2 filesystem with
even more tunables to cover up for mistakes in earlier designs.
The key point to all three of my objections: extensibility.
Both you and Michal have acknowledged blantently obvious shortcomings in
the design. We do not need to merge an incomplete patchset that forces us
into a corner later so that we need to add more tunables and behaviors.
It's also an incompletely documented patchset: the user has *no* insight
other than reading the kernel code that with this functionality their
/proc/pid/oom_score_adj values are effective when attached to the root
cgroup and ineffective when attached to a non-root cgroup, regardless of
whether the memory controller is enabled or not.
Extensibility when designing a new kernel feature is important. If you
see shortcomings, which I've enumerated and have been acknowledged, it
needs to be fixed.
The main consideration here is two choices:
(1) any logic to determine process(es) to be oom killed belongs only in
userspace, and has no place in the kernel, or
(2) such logic is complete, documented, extensible, and is generally
useful.
This patchset obviously is neither. I believe we can move toward (2), but
it needs to be sanely designed and implemented such that it addresses the
concerns I've enumerated that we are all familiar with.
The solution is to define the oom policy as a trait of the mem cgroup
itself. That needs to be complemented with a per mem cgroup value member
to specify the policy for child mem cgroups. We don't need any new
memory.groupoom, which depends on only this policy decision and is
otherwise a complete no-op, and we don't need any new mount option.
It's quite obvious the system cannot have a single cgroup aware oom
policy. That's when you get crazy inconsistencies and functionally broken
behavior such as influencing oom kill selection based on whether a cgroup
distributed processes over a set of child cgroups, even if they do not
have memory in their memory.subtree_control. It's also where you can give
the ability to user to still prefer to protect important cgroups or bias
against non-important cgroups.
The point is that the functionality needs to be complete and it needs to
be extensible.
I'd propose to introduce a new memory.oom_policy and a new
memory.oom_value. All cgroups have these files, but memory.oom_value is a
no-op for the root mem cgroup, just as you can't limit the memory usage on
the root mem cgroup.
memory.oom_policy defines the policy for selection when that mem cgroup is
oom; system oom conditions refer to the root mem cgroup's
memory.oom_policy.
memory.oom_value defines any parameter that child mem cgroups should be
considered with when effecting that policy.
There's three memory.oom_policy values I can envision at this time, but it
can be extended in the future:
- "adj": cgroup aware just as you have implemented based on how large a
cgroup is. oom_value can be used to define any adjustment made to that
usage so that userspace can protect important cgroups or bias against
non-important cgroups (I'd suggest using that same semantics as
oom_score_adj for consistency, but wouldn't object to using a byte
value).
- "hierarchy": same as "adj" but based on hierarchical usage. oom_value
has the same semantics as "adj". This prevents users from
intentionally or unintentionally affecting oom kill selection by
limiting groups of their processes by the memory controller, or any
other cgroup controller.
- "priority": cgroup aware just as you implemented memory.oom_priority in
the past. oom_value is a strict priority value where usage is not
considered and only the priority of the subtree is compared.
With cgroup v2 sematics of no internal process constraint, this is
extremely straight forward. All of your oom evaluation function can be
reused with a simple comparison based on the policy to score individual
cgroups. In the simplest policy, "priority", this is like a 10 line
function, but extremely useful to userspace.
This allows users to have full power over the decisionmaking in every
subtree wrt oom kill selection and doesn't regress or allow for any
pitfalls of the current patchset. The goal is not to have one single oom
policy for the entire system, but define the policy that makes useful
sense. This is how an extensible feature is designed and does not require
any further pollution of the mem cgroup filesystem.
If you have additional features such as groupoom, you can make
memory.oom_policy comma delimited, just as vmpressure modes are comma
delimited. You would want to use "adj,groupoom". We don't need another
file that is pointless in other policy decisions. We don't need a mount
option to lock the entire system into a single methodology.
Cgroup v2 is a very clean interface and I think it's the responsibility of
every controller to maintain that. We should not fall into a cgroup v1
mentality which became very difficult to make extensible. Let's make a
feature that is generally useful, complete, and empowers the user rather
than push them into a corner with a system wide policy with obvious
downsides.
For these reasons, and the three concerns that I enumerated earlier which
have been acknowledged of obvious shortcoming with this approach:
Nacked-by: David Rientjes <[email protected]>
I'll be happy to implement the core functionality that allows oom policies
to be written by the user and introduce memory.oom_value, and then rework
the logic defined in your patchset as "adj" by giving the user an optional
way of perferring or biasing that usage in a way that is clearly
documented and extended. Root mem cgroup usage is obviously wrong in this
current patchset since it uses oom_score_adj whereas leaf cgroups do not,
so that will be fixed. But I'll ask that the current patchset is removed
from -mm since it has obvious downsides, pollutes the mem cgroup v2
filesystem, is not extensible, is not documented wrt to oom_score_adj,
allows evasion of the heuristic, and doesn't allow the user to have any
power in the important decision of which of their important processes is
oom killed such that this feature is not useful outside very specialized
usecases.
On Wed, Jan 10, 2018 at 11:33:45AM -0800, Andrew Morton wrote:
> On Wed, 10 Jan 2018 05:11:44 -0800 Roman Gushchin <[email protected]> wrote:
> > On Tue, Jan 09, 2018 at 04:57:53PM -0800, David Rientjes wrote:
> > > On Thu, 30 Nov 2017, Andrew Morton wrote:
> > > > > This patchset makes the OOM killer cgroup-aware.
> > > >
> > > > Thanks, I'll grab these.
> > > >
> > > > There has been controversy over this patchset, to say the least. I
> > > > can't say that I followed it closely! Could those who still have
> > > > reservations please summarise their concerns and hopefully suggest a
> > > > way forward?
> > > >
> > >
> > > Yes, I'll summarize what my concerns have been in the past and what they
> > > are wrt the patchset as it stands in -mm. None of them originate from my
> > > current usecase or anticipated future usecase of the oom killer for
> > > system-wide or memcg-constrained oom conditions. They are based purely on
> > > the patchset's use of an incomplete and unfair heuristic for deciding
> > > which cgroup to target.
> > >
> > > I'll also suggest simple changes to the patchset, which I have in the
> > > past, that can be made to address all of these concerns.
> > >
> > > 1. The unfair comparison of the root mem cgroup vs leaf mem cgroups
> > >
> > > The patchset uses two different heuristics to compare root and leaf mem
> > > cgroups and scores them based on number of pages. For the root mem
> > > cgroup, it totals the /proc/pid/oom_score of all processes attached:
> > > that's based on rss, swap, pgtables, and, most importantly, oom_score_adj.
> > > For leaf mem cgroups, it's based on that memcg's anonymous, unevictable,
> > > unreclaimable slab, kernel stack, and swap counters. These can be wildly
> > > different independent of /proc/pid/oom_score_adj, but the most obvious
> > > unfairness comes from users who tune oom_score_adj.
> > >
> > > An example: start a process that faults 1GB of anonymous memory and leave
> > > it attached to the root mem cgroup. Start six more processes that each
> > > fault 1GB of anonymous memory and attached them to a leaf mem cgroup. Set
> > > all processes to have /proc/pid/oom_score_adj of 1000. System oom kill
> > > will always kill the 1GB process attached to the root mem cgroup. It's
> > > because oom_badness() relies on /proc/pid/oom_score_adj, which is used to
> > > evaluate the root mem cgroup, and leaf mem cgroups completely disregard
> > > it.
> > >
> > > In this example, the leaf mem cgroup's score is 1,573,044, the number of
> > > pages for the 6GB of faulted memory. The root mem cgroup's score is
> > > 12,652,907, eight times larger even though its usage is six times smaller.
> > >
> > > This is caused by the patchset disregarding oom_score_adj entirely for
> > > leaf mem cgroups and relying on it heavily for the root mem cgroup. It's
> > > the complete opposite result of what the cgroup aware oom killer
> > > advertises.
> > >
> > > It also works the other way, if a large memory hog is attached to the root
> > > mem cgroup but has a negative oom_score_adj it is never killed and random
> > > processes are nuked solely because they happened to be attached to a leaf
> > > mem cgroup. This behavior wrt oom_score_adj is completely undocumented,
> > > so I can't presume that it is either known nor tested.
> > >
> > > Solution: compare the root mem cgroup and leaf mem cgroups equally with
> > > the same criteria by doing hierarchical accounting of usage and
> > > subtracting from total system usage to find root usage.
> >
> > I find this problem quite minor, because I haven't seen any practical problems
> > caused by accounting of the root cgroup memory.
> > If it's a serious problem for you, it can be solved without switching to the
> > hierarchical accounting: it's possible to sum up all leaf cgroup stats and
> > substract them from global values. So, it can be a relatively small enhancement
> > on top of the current mm tree. This has nothing to do with global victim selection
> > approach.
>
> It sounds like a significant shortcoming to me - the oom-killing
> decisions which David describes are clearly incorrect?
As others have pointed out, it's an academic problem.
You don't have any control and no accounting of the stuff situated
inside the root cgroup, so it doesn't make sense to leave anything in
there while also using sophisticated containerization mechanisms like
this group oom setting.
In fact, the laptop I'm writing this email on runs an unmodified
mainstream Linux distribution. The only thing in the root cgroup are
kernel threads.
The decisions are good enough for the rare cases you forget something
in there and it explodes.
> If this can be fixed against the -mm patchset with a "relatively small
> enhancement" then please let's get that done so it can be reviewed and
> tested.
You'd have to sum up all the memory consumed by first-level cgroups
and then subtract from the respective system-wide counters. They're
implemented differently than the cgroup tracking, so it's kind of
messy. But it wouldn't be impossible.
> > > 2. Evading the oom killer by attaching processes to child cgroups
> > >
> > > Any cgroup on the system can attach all their processes to individual
> > > child cgroups. This is functionally the same as doing
> > >
> > > for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done
> > >
> > > without the no internal process constraint introduced with cgroup v2. All
> > > child cgroups are evaluated based on their own usage: all anon,
> > > unevictable, and unreclaimable slab as described previously. It requires
> > > an individual cgroup to be the single largest consumer to be targeted by
> > > the oom killer.
> > >
> > > An example: allow users to manage two different mem cgroup hierarchies
> > > limited to 100GB each. User A uses 10GB of memory and user B uses 90GB of
> > > memory in their respective hierarchies. On a system oom condition, we'd
> > > expect at least one process from user B's hierarchy would always be oom
> > > killed with the cgroup aware oom killer. In fact, the changelog
> > > explicitly states it solves an issue where "1) There is no fairness
> > > between containers. A small container with few large processes will be
> > > chosen over a large one with huge number of small processes."
> > >
> > > The opposite becomes true, however, if user B creates child cgroups and
> > > distributes its processes such that each child cgroup's usage never
> > > exceeds 10GB of memory. This can either be done intentionally to
> > > purposefully have a low cgroup memory footprint to evade the oom killer or
> > > unintentionally with cgroup v2 to allow those individual processes to be
> > > constrained by other cgroups in a single hierarchy model. User A, using
> > > 10% of his memory limit, is always oom killed instead of user B, using 90%
> > > of his memory limit.
> > >
> > > Others have commented its still possible to do this with a per-process
> > > model if users split their processes into many subprocesses with small
> > > memory footprints.
> > >
> > > Solution: comparing cgroups must be done hierarchically. Neither user A
> > > nor user B can evade the oom killer because targeting is done based on the
> > > total hierarchical usage rather than individual cgroups in their
> > > hierarchies.
> >
> > We've discussed this a lot.
> > Hierarchical approach has their own issues, which we've discussed during
> > previous iterations of the patchset. If you know how to address them
> > (I've no idea), please, go on and suggest your version.
>
> Well, if a hierarchical approach isn't a workable fix for the problem
> which David has identified then what *is* the fix?
This assumes you even need one. Right now, the OOM killer picks the
biggest MM, so you can evade selection by forking your MM. This patch
allows picking the biggest cgroup, so you can evade by forking groups.
It's not a new vector, and clearly nobody cares. This has never been
brought up against the current design that I know of.
Note, however, that there actually *is* a way to guard against it: in
cgroup2 there is a hierarchical limit you can configure for the number
of cgroups that are allowed to be created in the subtree. See
1a926e0bbab8 ("cgroup: implement hierarchy limits").
> > > 3. Userspace has zero control over oom kill selection in leaf mem cgroups
> > >
> > > Unlike using /proc/pid/oom_score_adj to bias or prefer certain processes
> > > from the oom killer, the cgroup aware oom killer does not provide any
> > > solution for the user to protect leaf mem cgroups. This is a result of
> > > leaf mem cgroups being evaluated based on their anon, unevictable, and
> > > unreclaimable slab usage and disregarding any user tunable.
> > >
> > > Absent the cgroup aware oom killer, users have the ability to strongly
> > > prefer a process is oom killed (/proc/pid/oom_score_adj = 1000) or
> > > strongly bias against a process (/proc/pid/oom_score_adj = -999).
> > >
> > > An example: a process knows its going to use a lot of memory, so it sets
> > > /proc/self/oom_score_adj to 1000. It wants to be killed first to avoid
> > > distrupting any other process. If it's attached to the root mem cgroup,
> > > it will be oom killed. If it's attached to a leaf mem cgroup by an admin
> > > outside its control, it will never be oom killed unless that cgroup's
> > > usage is the largest single cgroup usage on the system. The reverse also
> > > is true for processes that the admin does not want to be oom killed: set
> > > /proc/pid/oom_score_adj to -999, but it will *always* be oom killed if its
> > > cgroup has the highest usage on the system.
> > >
> > > The result is that both admins and users have lost all control over which
> > > processes are oom killed. They are left with only one alternative: set
> > > /proc/pid/oom_score_adj to -1000 to completely disable a process from oom
> > > kill. It doesn't address the issue at all for memcg-constrained oom
> > > conditions since no processes are killable anymore, and risks panicking
> > > the system if it is the only process left on the system. A process
> > > preferring that it is first in line for oom kill simply cannot volunteer
> > > anymore.
> > >
> > > Solution: allow users and admins to control oom kill selection by
> > > introducing a memory.oom_score_adj to affect the oom score of that mem
> > > cgroup, exactly the same as /proc/pid/oom_score_adj affects the oom score
> > > of a process.
> >
> > The per-process oom_score_adj interface is not the nicest one, and I'm not
> > sure we want to replicate it on cgroup level as is. If you have an idea of how
> > it should look like, please, propose a patch; otherwise it's hard to discuss
> > it without the code.
>
> It does make sense to have some form of per-cgroup tunability. Why is
> the oom_score_adj approach inappropriate and what would be better? How
> hard is it to graft such a thing onto the -mm patchset?
It could be useful, but we have no concensus on the desired
semantics. And it's not clear why we couldn't add it later as long as
the default settings of a new knob maintain the default behavior
(which would have to be preserved anyway, since we rely on it).
> > > I proposed a solution in
> > > https://marc.info/?l=linux-kernel&m=150956897302725, which was never
> > > responded to, for all of these issues. The idea is to do hierarchical
> > > accounting of mem cgroup hierarchies so that the hierarchy is traversed
> > > comparing total usage at each level to select target cgroups. Admins and
> > > users can use memory.oom_score_adj to influence that decisionmaking at
> > > each level.
We did respond repeatedly: this doesn't work for a lot of setups.
Take for example our systems. They have two big top level cgroups:
system stuff - logging, monitoring, package management, etc. - and
then the workload subgroup.
Your scheme would always descend the workload group by default. Why
would that be obviously correct? Something in the system group could
have blown up.
What you propose could be done, but it would have to be opt-in - just
like this oom group feature is opt-in.
> > > This solves #1 because mem cgroups can be compared based on the same
> > > classes of memory and the root mem cgroup's usage can be fairly compared
> > > by subtracting top-level mem cgroup usage from system usage. All of the
> > > criteria used to evaluate a leaf mem cgroup has a reasonable system-wide
> > > counterpart that can be used to do the simple subtraction.
> > >
> > > This solves #2 because evaluation is done hierarchically so that
> > > distributing processes over a set of child cgroups either intentionally
> > > or unintentionally no longer evades the oom killer. Total usage is always
> > > accounted to the parent and there is no escaping this criteria for users.
> > >
> > > This solves #3 because it allows admins to protect important processes in
> > > cgroups that are supposed to use, for example, 75% of system memory
> > > without it unconditionally being selected for oom kill but still oom kill
> > > if it exceeds a certain threshold. In this sense, the cgroup aware oom
> > > killer, as currently implemented, is selling mem cgroups short by
> > > requiring the user to accept that the important process will be oom killed
> > > iff it uses mem cgroups and isn't attached to root. It also allows users
> > > to actually volunteer to be oom killed first without majority usage.
> > >
> > > It has come up time and time again that this support can be introduced on
> > > top of the cgroup oom killer as implemented. It simply cannot. For
> > > admins and users to have control over decisionmaking, it needs a
> > > oom_score_adj type tunable that cannot change semantics from kernel
> > > version to kernel version and without polluting the mem cgroup filesystem.
> > > That, in my suggestion, is an adjustment on the amount of total
> > > hierarchical usage of each mem cgroup at each level of the hierarchy.
> > > That requires that the heuristic uses hierarchical usage rather than
> > > considering each cgroup as independent consumers as it does today. We
> > > need to implement that heuristic and introduce userspace influence over
> > > oom kill selection now rather than later because its implementation
> > > changes how this patchset is implemented.
> > >
> > > I can implement these changes, if preferred, on top of the current
> > > patchset, but I do not believe we want inconsistencies between kernel
> > > versions that introduce user visible changes for the sole reason that this
> > > current implementation is incomplete and unfair. We can implement and
> > > introduce it once without behavior changing later because the core
> > > heuristic has necessarily changed.
> >
> > David, I _had_ hierarchical accounting implemented in one of the previous
> > versions of this patchset. And there were _reasons_, why we went away from it.
>
> Can you please summarize those issues for my understanding?
Comparing siblings at each level to find OOM victims is not
universally meaningful. As per above, our workload subgroup will
always be bigger than the system subgroup. Does that mean nothing
should *ever* be killed in the system group per default? That's silly.
In such a scheme, configuring priorities on all cgroups down the tree
is not just something you *can* do; it's something you *must* do to
get reasonable behavior.
That's fine for David, who does that already anyway. But we actually
*like* the way the current OOM killer heuristics work: find the
biggest indivisible memory consumer in the system and kill it (we
merely want to extend the concept of "indivisible memory consumer"
from individual MMs to cgroups of interdependent MMs).
Such a heuristic is mutually exclusive with a hierarchy walk. You
can't know, looking at the consumption of system group vs. workload
group, which subtree has the biggest indivisible consumer. The huge
workload group can consist of hundreds of independent jobs.
And a scoring system can't fix this. Once you're in manual mode, you
cannot configure your way back to letting the OOM killer decide.
A hierarchy mode would simply make the existing OOM heuristics
impossible. And we want to use them. So it's not an alternative
proposal, it's in direct opposition of what we're trying to do here.
A way to bias oom_group configured cgroups in their selection could be
useful. But 1) it's not necessary for these patches to be useful (we
certainly don't care). 2) It's absolutely possible to add such knobs
later, as long as you leave the *default* behavior alone. 3) David's
proposal is something entirely different and conflicts with existing
OOM heuristics and our usecase, so it couldn't possibly be the answer.
On Sat, 13 Jan 2018, Johannes Weiner wrote:
> You don't have any control and no accounting of the stuff situated
> inside the root cgroup, so it doesn't make sense to leave anything in
> there while also using sophisticated containerization mechanisms like
> this group oom setting.
>
> In fact, the laptop I'm writing this email on runs an unmodified
> mainstream Linux distribution. The only thing in the root cgroup are
> kernel threads.
>
> The decisions are good enough for the rare cases you forget something
> in there and it explodes.
>
It's quite trivial to allow the root mem cgroup to be compared exactly the
same as another cgroup. Please see
https://marc.info/?l=linux-kernel&m=151579459920305.
> This assumes you even need one. Right now, the OOM killer picks the
> biggest MM, so you can evade selection by forking your MM. This patch
> allows picking the biggest cgroup, so you can evade by forking groups.
>
It's quite trivial to prevent any cgroup from evading the oom killer by
either forking their mm or attaching all their processes to subcontainers.
Please see https://marc.info/?l=linux-kernel&m=151579459920305.
> It's not a new vector, and clearly nobody cares. This has never been
> brought up against the current design that I know of.
>
As cgroup v2 becomes more popular, people will organize their cgroup
hierarchies for all controllers they need to use. We do this today, for
example, by attaching some individual consumers to child mem cgroups
purely for the rich statistics and vmscan stats that mem cgroup provides
without any limitation on those cgroups.
> Note, however, that there actually *is* a way to guard against it: in
> cgroup2 there is a hierarchical limit you can configure for the number
> of cgroups that are allowed to be created in the subtree. See
> 1a926e0bbab8 ("cgroup: implement hierarchy limits").
>
Not allowing the user to create subcontainers to track statistics to paper
over an obvious and acknowledged shortcoming in the design of the cgroup
aware oom killer seems like a pretty nasty shortcoming itself.
> It could be useful, but we have no concensus on the desired
> semantics. And it's not clear why we couldn't add it later as long as
> the default settings of a new knob maintain the default behavior
> (which would have to be preserved anyway, since we rely on it).
>
The active proposal is
https://marc.info/?l=linux-kernel&m=151579459920305, which describes an
extendable interface and one that covers all the shortcomings of this
patchset without polluting the mem cgroup filesystem. The default oom
policy in that proposal would be "none", i.e. we do what we do today,
based on process usage. You can configure that, without the mount option
this patchset introduces for local or hierarchical cgroup targeting.
> > > > I proposed a solution in
> > > > https://marc.info/?l=linux-kernel&m=150956897302725, which was never
> > > > responded to, for all of these issues. The idea is to do hierarchical
> > > > accounting of mem cgroup hierarchies so that the hierarchy is traversed
> > > > comparing total usage at each level to select target cgroups. Admins and
> > > > users can use memory.oom_score_adj to influence that decisionmaking at
> > > > each level.
>
> We did respond repeatedly: this doesn't work for a lot of setups.
>
We need to move this discussion to the active proposal at
https://marc.info/?l=linux-kernel&m=151579459920305, because it does
address your setup, so it's not good use of anyones time to further
discuss simply memory.oom_score_adj.
Thanks.
On Fri 12-01-18 14:03:03, David Rientjes wrote:
> On Thu, 11 Jan 2018, Roman Gushchin wrote:
>
> > Summarizing all this, following the hierarchy is good when it reflects
> > the "importance" of cgroup's memory for a user, and bad otherwise.
> > In generic case with unified hierarchy it's not true, so following
> > the hierarchy unconditionally is bad.
> >
> > Current version of the patchset allows common evaluation of cgroups
> > by setting memory.groupoom to true. The only limitation is that it
> > also changes the OOM action: all belonging processes will be killed.
> > If we really want to preserve an option to evaluate cgroups
> > together without forcing "kill all processes" action, we can
> > convert memory.groupoom from being boolean to the multi-value knob.
> > For instance, "disabled" and "killall". This will allow to add
> > a third state ("evaluate", for example) later without breaking the API.
> >
>
> No, this isn't how kernel features get introduced. We don't design a new
> kernel feature with its own API for a highly specialized usecase and then
> claim we'll fix the problems later. Users will work around the
> constraints of the new feature, if possible, and then we can end up
> breaking them later. Or, we can pollute the mem cgroup v2 filesystem with
> even more tunables to cover up for mistakes in earlier designs.
This is a blatant misinterpretation of the proposed changes. I haven't
heard _any_ single argument against the proposed user interface except
for complaints for missing tunables. This is not how the kernel
development works and should work. The usecase was clearly described and
far from limited to a single workload or company.
> The key point to all three of my objections: extensibility.
And it has been argued that further _features_ can be added on top. I am
absolutely fed up discussing those things again and again without any
progress. You simply keep _ignoring_ counter arguments and that is a
major PITA to be honest with you. You are basically blocking a useful
feature because it doesn't solve your particular workload. This is
simply not acceptable in the kernel development.
> Both you and Michal have acknowledged blantently obvious shortcomings in
> the design.
What you call blatant shortcomings we do not see affecting any
_existing_ workloads. If they turn out to be real issues then we can fix
them without deprecating any user APIs added by this patchset.
Look, I am not the one who is a heavy container user. The primary reason
I am supporting this is because a) it has a _real_ user and I can see
more to emerge and b) because it _makes_ sense in its current form and
it doesn't bring heavy maintenance burden in the OOM code base which I
do care about.
I am deliberately skipping the rest of your email because it is _yet_
_again_ a form of obstructing the current patchset which is what you
have been doing for quite some time. I will leave the final decision
for merging to Andrew. If you want to build a more fine grained control
on top, you are free to do so. I will be reviewing those like any other
upstream oom changes.
--
Michal Hocko
SUSE Labs
On Sun, Jan 14, 2018 at 03:44:09PM -0800, David Rientjes wrote:
> On Sat, 13 Jan 2018, Johannes Weiner wrote:
>
> > You don't have any control and no accounting of the stuff situated
> > inside the root cgroup, so it doesn't make sense to leave anything in
> > there while also using sophisticated containerization mechanisms like
> > this group oom setting.
> >
> > In fact, the laptop I'm writing this email on runs an unmodified
> > mainstream Linux distribution. The only thing in the root cgroup are
> > kernel threads.
> >
> > The decisions are good enough for the rare cases you forget something
> > in there and it explodes.
>
> It's quite trivial to allow the root mem cgroup to be compared exactly the
> same as another cgroup. Please see
> https://marc.info/?l=linux-kernel&m=151579459920305.
This only says "that will be fixed" and doesn't address why I care.
> > This assumes you even need one. Right now, the OOM killer picks the
> > biggest MM, so you can evade selection by forking your MM. This patch
> > allows picking the biggest cgroup, so you can evade by forking groups.
>
> It's quite trivial to prevent any cgroup from evading the oom killer by
> either forking their mm or attaching all their processes to subcontainers.
> Please see https://marc.info/?l=linux-kernel&m=151579459920305.
This doesn't address anything I wrote.
> > It's not a new vector, and clearly nobody cares. This has never been
> > brought up against the current design that I know of.
>
> As cgroup v2 becomes more popular, people will organize their cgroup
> hierarchies for all controllers they need to use. We do this today, for
> example, by attaching some individual consumers to child mem cgroups
> purely for the rich statistics and vmscan stats that mem cgroup provides
> without any limitation on those cgroups.
There is no connecting tissue between what I wrote and what you wrote.
> > Note, however, that there actually *is* a way to guard against it: in
> > cgroup2 there is a hierarchical limit you can configure for the number
> > of cgroups that are allowed to be created in the subtree. See
> > 1a926e0bbab8 ("cgroup: implement hierarchy limits").
>
> Not allowing the user to create subcontainers to track statistics to paper
> over an obvious and acknowledged shortcoming in the design of the cgroup
> aware oom killer seems like a pretty nasty shortcoming itself.
It's not what I proposed. There is a big difference between cgroup
fork bombs and having a couple of groups for statistics.
> > > > > I proposed a solution in
> > > > > https://marc.info/?l=linux-kernel&m=150956897302725, which was never
> > > > > responded to, for all of these issues. The idea is to do hierarchical
> > > > > accounting of mem cgroup hierarchies so that the hierarchy is traversed
> > > > > comparing total usage at each level to select target cgroups. Admins and
> > > > > users can use memory.oom_score_adj to influence that decisionmaking at
> > > > > each level.
> >
> > We did respond repeatedly: this doesn't work for a lot of setups.
>
> We need to move this discussion to the active proposal at
> https://marc.info/?l=linux-kernel&m=151579459920305, because it does
> address your setup, so it's not good use of anyones time to further
> discuss simply memory.oom_score_adj.
No, we don't.
We have a patch set that was developed, iterated and improved over 10+
revisions, based on evaluating and weighing trade-offs. It's reached a
state where the memcg maintainers are happy with it and don't have any
concern about future extendabilty to cover more specialized setups.
You've had nine months to contribute and shape this patch series
productively, and instead resorted to a cavalcade of polemics,
evasion, and repetition of truisms and refuted points. A ten paragraph
proposal of vague ideas at this point is simply not a valid argument.
You can send patches to replace or improve on Roman's code and make
the case for them.
Thanks
On Mon, 15 Jan 2018, Johannes Weiner wrote:
> > It's quite trivial to allow the root mem cgroup to be compared exactly the
> > same as another cgroup. Please see
> > https://marc.info/?l=linux-kernel&m=151579459920305.
>
> This only says "that will be fixed" and doesn't address why I care.
>
Because the design of the cgroup aware oom killer requires the entire
system to be fully containerized to be useful and select the
correct/anticipated victim. If anything is left behind in the root mem
cgroup, or on systems that use mem cgroups in ways that you do not, it
returns wildly unpredictable and downright wrong results; it factors
oom_score_adj into the selection criteria only for processes attached to
the root mem cgroup and ignores it otherwise. If we treated root and leaf
cgroups as comparable, this problem wouldn't arise.
> > > This assumes you even need one. Right now, the OOM killer picks the
> > > biggest MM, so you can evade selection by forking your MM. This patch
> > > allows picking the biggest cgroup, so you can evade by forking groups.
> >
> > It's quite trivial to prevent any cgroup from evading the oom killer by
> > either forking their mm or attaching all their processes to subcontainers.
> > Please see https://marc.info/?l=linux-kernel&m=151579459920305.
>
> This doesn't address anything I wrote.
>
It prevents both problems if you are attached to a mem cgroup subtree.
You can't fork the mm and you can't fork groups to evade the selection
logic. If the root mem cgroup and leaf cgroups were truly comparable, it
also prevents both problems regardless of which cgroup the processes
attached to.
> > > It's not a new vector, and clearly nobody cares. This has never been
> > > brought up against the current design that I know of.
> >
> > As cgroup v2 becomes more popular, people will organize their cgroup
> > hierarchies for all controllers they need to use. We do this today, for
> > example, by attaching some individual consumers to child mem cgroups
> > purely for the rich statistics and vmscan stats that mem cgroup provides
> > without any limitation on those cgroups.
>
> There is no connecting tissue between what I wrote and what you wrote.
>
You're completely ignoring other usecases other than your own highly
specialized usecase. You may attach every user process on the system to
leaf cgroups and you may not have any users who have control over a
subtree. Other people do. And this patchset, as implemented, returns
very inaccurate results or allows users to intentionally or
unintentionally evade the oom killer just because they want to use
subcontainers.
Creating and attaching a subset of processes to either top-level
containers or subcontainers for either limitation by mem cgroup or for
statistics is a valid usecase, and one that is used in practice. Your
suggestion of avoiding that problem to work around the shortcomings of
this patchset by limiting how many subcontainers can be created by the
user is utterly ridiculous.
> We have a patch set that was developed, iterated and improved over 10+
> revisions, based on evaluating and weighing trade-offs. It's reached a
> state where the memcg maintainers are happy with it and don't have any
> concern about future extendabilty to cover more specialized setups.
>
It's also obviously untested in those 10+ revisions since it uses
oom_badness() for the root mem cgroup and not leaf mem cgroups which is
why it breaks any system where user processes are attached to the root mem
cgroup. See my example where a 1GB worker attached to the root mem cgroup
is preferred over a cgroup with 6GB workers. It may have been tested with
your own highly specialized usecase, but not with anything else, and the
author obviously admits its shortcomings.
> You've had nine months to contribute and shape this patch series
> productively, and instead resorted to a cavalcade of polemics,
> evasion, and repetition of truisms and refuted points. A ten paragraph
> proposal of vague ideas at this point is simply not a valid argument.
>
The patch series has gone through massive changes over the past nine
months and I've brought up three very specific concerns with its current
form that makes it non-extensible. I know the patchset has very
significant changes or rewrites in its history, but I'm only concerned
with its present form because it's not something that can easily be
extended later. We don't need useless files polutting the cgroup v2
filesystem for things that don't matter with other oom policies and we
don't need the mount option, actually.
> You can send patches to replace or improve on Roman's code and make
> the case for them.
>
I volunteered in the email thread where I proposed an alternative to
replace it, I'm actively seeking any feedback on that proposal to address
anybody else's concerns early on. Your participation in that would be
very useful.
Thanks.
On Mon, 15 Jan 2018, Michal Hocko wrote:
> > No, this isn't how kernel features get introduced. We don't design a new
> > kernel feature with its own API for a highly specialized usecase and then
> > claim we'll fix the problems later. Users will work around the
> > constraints of the new feature, if possible, and then we can end up
> > breaking them later. Or, we can pollute the mem cgroup v2 filesystem with
> > even more tunables to cover up for mistakes in earlier designs.
>
> This is a blatant misinterpretation of the proposed changes. I haven't
> heard _any_ single argument against the proposed user interface except
> for complaints for missing tunables. This is not how the kernel
> development works and should work. The usecase was clearly described and
> far from limited to a single workload or company.
>
The complaint about the user interface is that it is not extensible, as my
next line states. This doesn't need to be opted into with a mount option
locking the entire system into a single oom policy. That, itself, is the
result of a poor design. What is needed is a way for users to define an
oom policy that is generally useful, not something that is locked in for
the whole system. We don't need several different cgroup mount options
only for mem cgroup oom policies. We also don't need random
memory.groupoom files being added to the mem cgroup v2 filesystem only for
one or two particular policies and being no-ops otherwise. It can easily
be specified as part of the policy itself. My suggestion adds two new
files to the mem cgroup v2 filesystem and no mount option, and allows any
policy to be added later that only uses these two files. I see you've
ignored all of that in this email, so perhaps reading it would be
worthwhile so that you can provide constructive feedback.
> > The key point to all three of my objections: extensibility.
>
> And it has been argued that further _features_ can be added on top. I am
> absolutely fed up discussing those things again and again without any
> progress. You simply keep _ignoring_ counter arguments and that is a
> major PITA to be honest with you. You are basically blocking a useful
> feature because it doesn't solve your particular workload. This is
> simply not acceptable in the kernel development.
>
As the thread says, this has nothing to do with my own particular
workload, it has to do with three obvious shortcomings in the design that
the user has no control over. We can't add features on top if the
heuristic itself changes as a result of the proposal, it needs to be
introduced in an extensible way so that additional changes can be made
later, if necessary, while still working around the very obvious problems
with this current implementation. My suggestion is that we introduce a
way to define the oom policy once so that we don't have to change it later
and are left with needless mount options or mem cgroup v2 files that
become no-ops with the suggested design. I hope that you will read the
proposal for that extensible interface and comment on it about any
concerns that you have, because that feedback would generally be useful.
> > Both you and Michal have acknowledged blantently obvious shortcomings in
> > the design.
>
> What you call blatant shortcomings we do not see affecting any
> _existing_ workloads. If they turn out to be real issues then we can fix
> them without deprecating any user APIs added by this patchset.
>
There are existing workloads that use mem cgroup subcontainers purely for
tracking charging and vmscan stats, which results in this logic being
evaded. It's a real issue, and a perfectly acceptable usecase for mem
cgroup. It's a result of the entire oom policy either being opted into or
opted out of for the entire system and impossible for the user to
configure or avoid. That can be done better by enabling the oom policy
only for a subtree, as I've suggested, but you've ignored. It would also
render both the mount option and the additional file in the mem cgroup v2
filesystem added by this patchset to be no-ops.
On Tue 16-01-18 13:36:21, David Rientjes wrote:
> On Mon, 15 Jan 2018, Michal Hocko wrote:
>
> > > No, this isn't how kernel features get introduced. We don't design a new
> > > kernel feature with its own API for a highly specialized usecase and then
> > > claim we'll fix the problems later. Users will work around the
> > > constraints of the new feature, if possible, and then we can end up
> > > breaking them later. Or, we can pollute the mem cgroup v2 filesystem with
> > > even more tunables to cover up for mistakes in earlier designs.
> >
> > This is a blatant misinterpretation of the proposed changes. I haven't
> > heard _any_ single argument against the proposed user interface except
> > for complaints for missing tunables. This is not how the kernel
> > development works and should work. The usecase was clearly described and
> > far from limited to a single workload or company.
> >
>
> The complaint about the user interface is that it is not extensible, as my
> next line states.
I disagree and will not repeat argument why.
> This doesn't need to be opted into with a mount option
> locking the entire system into a single oom policy. That, itself, is the
> result of a poor design. What is needed is a way for users to define an
> oom policy that is generally useful, not something that is locked in for
> the whole system.
We have been discussing general oom policies for years now and there was
_no_ _single_ useful/acceptable approach suggested. Nor is your sketch
I am afraid because we could argue how that one doesn't address other
usecases out there which need a more specific control. All that without
having _no code_ merged. The current one is a policy that addresses a
reasonably large class of usecases out there based on containers without
forcing everybody else to use the same policy.
> We don't need several different cgroup mount options
> only for mem cgroup oom policies.
cgroup mount option sounds like a reasonable approach already used for
the unified hierarchy in early stages.
> We also don't need random
> memory.groupoom files being added to the mem cgroup v2 filesystem only for
> one or two particular policies and being no-ops otherwise.
This groupoom is a fundamental API allowing to kill the whole cgroup
which is a reasonable thing to do and also sane user API regardless of
implementation details. Any oom selection policy can be built on top.
> It can easily
> be specified as part of the policy itself.
No it cannot, because it would conflate oom selection _and_ oom action
together. And that would be wrong _semantically_, I believe. And I am quite
sure we can discuss what kind of policies we need to death and won't
move on. Exactly like, ehm, until now.
So let me repeat. There are users for the functionality. Users have to
explicitly opt-in so existing users are not in a risk of regressions.
Further more fine grained oom selection policies can be implemented on top
without breaking new users.
In short: There is no single reason to block this to be merged.
If your usecase is not covered yet then feel free to extend the existing
code/APIs to do so. I will happily review and discuss them like I've
been doing here even though I am myself not a user of this new
functionality.
--
Michal Hocko
SUSE Labs