It seems that this is still in limbo mostly because of David's concerns.
So let me reiterate them and provide my POV once more (and the last
time) just to help Andrew make a decision:
1) comparision root with tail memcgs during the OOM killer is not fair
because we are comparing tasks with memcgs.
This is true, but I do not think this matters much for workloads which
are going to use the feature. Why? Because the main consumers of the new
feature seem to be containers which really need some fairness when
comparing _workloads_ rather than processes. Those are unlikely to
contain any significant memory consumers in the root memcg. That would
be mostly common infrastructure.
Is this is fixable? Yes, we would need to account in the root memcgs.
Why are we not doing that now? Because it has some negligible
performance overhead. Are there other ways? Yes we can approximate root
memcg memory consumption but I would rather wait for somebody seeing
that as a real problem rather than add hacks now without a strong
reason.
2) Evading the oom killer by attaching processes to child cgroups which
basically means that a task can split up the workload into smaller
memcgs to hide their real memory consumption.
Again true but not really anything new. Processes can already fork and
split up the memory consumption. Moreover it doesn't even require any
special privileges to do so unlike creating a sub memcg. Is this
fixable? Yes, untrusted workloads can setup group oom evaluation at the
delegation layer so all subgroups would be considered together.
3) Userspace has zero control over oom kill selection in leaf mem
cgroups.
Again true but this is something that needs a good evaluation to not end
up in the fiasko we have seen with oom_score*. Current users demanding
this feature can live without any prioritization so blocking the whole
feature seems unreasonable.
4) Future extensibility to be backward compatible.
David is wrong here IMHO. Any prioritization or oom selection policy
controls added in future are orthogonal to the oom_group concept added
by this patchset. Allowing memcg to be an oom entity is something that
we really want longterm. Global CGRP_GROUP_OOM is the most restrictive
semantic and softening it will be possible by a adding a new knob to
tell whether a memcg/hierarchy is a workload or a set of tasks.
--
Michal Hocko
SUSE Labs
On Tue 05-06-18 13:47:29, Michal Hocko wrote:
> It seems that this is still in limbo mostly because of David's concerns.
> So let me reiterate them and provide my POV once more (and the last
> time) just to help Andrew make a decision:
Sorry, I forgot to add reference to the email with the full David's
reasoning. Here it is http://lkml.kernel.org/r/[email protected]
> 1) comparision root with tail memcgs during the OOM killer is not fair
> because we are comparing tasks with memcgs.
>
> This is true, but I do not think this matters much for workloads which
> are going to use the feature. Why? Because the main consumers of the new
> feature seem to be containers which really need some fairness when
> comparing _workloads_ rather than processes. Those are unlikely to
> contain any significant memory consumers in the root memcg. That would
> be mostly common infrastructure.
>
> Is this is fixable? Yes, we would need to account in the root memcgs.
> Why are we not doing that now? Because it has some negligible
> performance overhead. Are there other ways? Yes we can approximate root
> memcg memory consumption but I would rather wait for somebody seeing
> that as a real problem rather than add hacks now without a strong
> reason.
>
>
> 2) Evading the oom killer by attaching processes to child cgroups which
> basically means that a task can split up the workload into smaller
> memcgs to hide their real memory consumption.
>
> Again true but not really anything new. Processes can already fork and
> split up the memory consumption. Moreover it doesn't even require any
> special privileges to do so unlike creating a sub memcg. Is this
> fixable? Yes, untrusted workloads can setup group oom evaluation at the
> delegation layer so all subgroups would be considered together.
>
> 3) Userspace has zero control over oom kill selection in leaf mem
> cgroups.
>
> Again true but this is something that needs a good evaluation to not end
> up in the fiasko we have seen with oom_score*. Current users demanding
> this feature can live without any prioritization so blocking the whole
> feature seems unreasonable.
>
> 4) Future extensibility to be backward compatible.
>
> David is wrong here IMHO. Any prioritization or oom selection policy
> controls added in future are orthogonal to the oom_group concept added
> by this patchset. Allowing memcg to be an oom entity is something that
> we really want longterm. Global CGRP_GROUP_OOM is the most restrictive
> semantic and softening it will be possible by a adding a new knob to
> tell whether a memcg/hierarchy is a workload or a set of tasks.
> --
> Michal Hocko
> SUSE Labs
--
Michal Hocko
SUSE Labs
On Tue, 5 Jun 2018, Michal Hocko wrote:
> 1) comparision root with tail memcgs during the OOM killer is not fair
> because we are comparing tasks with memcgs.
>
> This is true, but I do not think this matters much for workloads which
> are going to use the feature. Why? Because the main consumers of the new
> feature seem to be containers which really need some fairness when
> comparing _workloads_ rather than processes. Those are unlikely to
> contain any significant memory consumers in the root memcg. That would
> be mostly common infrastructure.
>
There are users (us) who want to use the feature and not all processes are
attached to leaf mem cgroups. The functionality can be provided in a
generally useful way that doesn't require any specific hierarchy, and I
implemented this in my patch series at
https://marc.info/?l=linux-mm&m=152175563004458&w=2. That proposal to fix
*all* of my concerns with the cgroup-aware oom killer as it sits in -mm,
in it's entirety, only extends it so it is generally useful and does not
remove any functionality. I'm not sure why we are discussing ways of
moving forward when that patchset has been waiting for review for almost
four months and, to date, I haven't seen an objection to.
I don't know why we cannot agree on making solutions generally useful nor
why that patchset has not been merged into -mm so that the whole feature
can be merged. It's baffling. This is the first time I've encountered a
perceived stalemate when there is a patchset sitting, unreviewed, that
fixes all of the concerns that there are about the implementation sitting
in -mm.
This isn't a criticism just of comparing processes attached to root
differently than leaf mem cgroups, it's how oom_score_adj influences that
decision. It's trivial for a very small consumer (not "significant"
memory consumer, as you put it) to require an oom kill from root instead
of a leaf mem cgroup. I show in
https://marc.info/?l=linux-mm&m=152175564104468&w=2 that changing the
oom_score_adj of my bash shell attached to the root mem cgroup is
considered equal to a 95GB leaf mem cgroup with the current
implementation.
> Is this is fixable? Yes, we would need to account in the root memcgs.
> Why are we not doing that now? Because it has some negligible
> performance overhead. Are there other ways? Yes we can approximate root
> memcg memory consumption but I would rather wait for somebody seeing
> that as a real problem rather than add hacks now without a strong
> reason.
>
I fixed this in https://marc.info/?t=152175564500007&r=1&w=2, and from
what I remmeber Roman actually liked it.
> 2) Evading the oom killer by attaching processes to child cgroups which
> basically means that a task can split up the workload into smaller
> memcgs to hide their real memory consumption.
>
> Again true but not really anything new. Processes can already fork and
> split up the memory consumption. Moreover it doesn't even require any
> special privileges to do so unlike creating a sub memcg. Is this
> fixable? Yes, untrusted workloads can setup group oom evaluation at the
> delegation layer so all subgroups would be considered together.
>
Processes being able to fork to split up memory consumption is also fixed
by https://marc.info/?l=linux-mm&m=152175564104467 just as creating
subcontainers to intentionally or unintentionally subverting the oom
policy is fixed. It solves both problems.
> 3) Userspace has zero control over oom kill selection in leaf mem
> cgroups.
>
> Again true but this is something that needs a good evaluation to not end
> up in the fiasko we have seen with oom_score*. Current users demanding
> this feature can live without any prioritization so blocking the whole
> feature seems unreasonable.
>
One objection here is how the oom_score_adj of a process means something
or doesn't mean something depending on what cgroup it is attached to. The
cgroup-aware oom killer is cgroup aware. oom_score_adj should play no
part. I fixed this with https://marc.info/?t=152175564500007&r=1&w=2.
The other objection is that users do have cgroups that shouldn't be oom
killed because they are important, either because they are required to
provide a service for a smaller cgroup or because of business goals. We
have cgroups that use more than half of system memory and they are allowed
to do so because they are important. We would love to be able to bias
against that cgroup to prefer others, or prefer cgroups for oom kill
because they are less important. This was done for processes with
oom_score_adj, we need it for a cgroup aware oom killer for the same
reason.
But notice even in https://marc.info/?l=linux-mm&m=152175563004458&w=2
that I said priority or adjustment can be added on top of the feature
after it's merged. This itself is not precluding anything from being
merged.
> 4) Future extensibility to be backward compatible.
>
> David is wrong here IMHO. Any prioritization or oom selection policy
> controls added in future are orthogonal to the oom_group concept added
> by this patchset. Allowing memcg to be an oom entity is something that
> we really want longterm. Global CGRP_GROUP_OOM is the most restrictive
> semantic and softening it will be possible by a adding a new knob to
> tell whether a memcg/hierarchy is a workload or a set of tasks.
I've always said that the mechanism and policy in this patchset should be
separated. I do that exact thing in
https://marc.info/?l=linux-mm&m=152175564304469&w=2. I suggest that
different subtrees will want (or the admin will require) different
behaviors with regard to the mechanism.
I've stated the problems (and there are others wrt mempolicy selection)
that the current implementation has and given a full solution at
https://marc.info/?l=linux-mm&m=152175563004458&w=2 that has not been
reviewed. I would love feedback from anybody on this thread on that. I'm
not trying to preclude the cgroup-aware oom killer from being merged, I'm
the only person actively trying to get it merged.
Thanks.
On 2018/07/14 6:59, David Rientjes wrote:
> I'm not trying to preclude the cgroup-aware oom killer from being merged,
> I'm the only person actively trying to get it merged.
Before merging the cgroup-aware oom killer, can we merge OOM lockup fixes
and my cleanup? The gap between linux.git and linux-next.git keeps us unable
to use agreed baseline.
On Fri 13-07-18 14:59:59, David Rientjes wrote:
> On Tue, 5 Jun 2018, Michal Hocko wrote:
>
> > 1) comparision root with tail memcgs during the OOM killer is not fair
> > because we are comparing tasks with memcgs.
> >
> > This is true, but I do not think this matters much for workloads which
> > are going to use the feature. Why? Because the main consumers of the new
> > feature seem to be containers which really need some fairness when
> > comparing _workloads_ rather than processes. Those are unlikely to
> > contain any significant memory consumers in the root memcg. That would
> > be mostly common infrastructure.
> >
>
> There are users (us) who want to use the feature and not all processes are
> attached to leaf mem cgroups. The functionality can be provided in a
> generally useful way that doesn't require any specific hierarchy, and I
> implemented this in my patch series at
> https://marc.info/?l=linux-mm&m=152175563004458&w=2. That proposal to fix
> *all* of my concerns with the cgroup-aware oom killer as it sits in -mm,
> in it's entirety, only extends it so it is generally useful and does not
> remove any functionality. I'm not sure why we are discussing ways of
> moving forward when that patchset has been waiting for review for almost
> four months and, to date, I haven't seen an objection to.
Well, I didn't really get to your patches yet. The last time I've
checked I had some pretty serious concerns about the consistency of your
proposal. Those might have been fixed in the lastest version of your
patchset I haven't seen. But I still strongly suspect that you are
largerly underestimating the complexity of more generic oom policies
which you are heading to.
Considering user API failures from the past (oom_*adj fiasco for
example) suggests that we should start with smaller steps and only
provide a clear and simple API. oom_group is such a simple and
semantically consistent thing which is the reason I am OK with it much
more than your "we can be more generic" approach. I simply do not trust
we can agree on sane and consistent api in a reasonable time.
And it is quite mind boggling that a simpler approach has been basically
blocked for months because there are some concerns for workloads which
are not really asking for the feature. Sure your usecase might need to
handle root memcg differently. That is a fair point but that shouldn't
really block containers users who can use the proposed solution without
any further changes. If we ever decide to handle root memcg differently
we are free to do so because the oom selection policy is not carved in
stone by any api.
[...]
--
Michal Hocko
SUSE Labs
No response from Roman and David...
Andrew, will you once drop Roman's cgroup-aware OOM killer and David's patches?
Roman's series has a bug which I mentioned and which can be avoided by my patch.
David's patch is using MMF_UNSTABLE incorrectly such that it might start selecting
next OOM victim without trying to reclaim any memory.
Since they are not responding to my mail, I suggest once dropping from linux-next.
https://www.spinics.net/lists/linux-mm/msg153212.html
https://lore.kernel.org/lkml/[email protected]/T/#u
On 2018/07/14 10:55, Tetsuo Handa wrote:
> On 2018/07/14 6:59, David Rientjes wrote:
>> I'm not trying to preclude the cgroup-aware oom killer from being merged,
>> I'm the only person actively trying to get it merged.
>
> Before merging the cgroup-aware oom killer, can we merge OOM lockup fixes
> and my cleanup? The gap between linux.git and linux-next.git keeps us unable
> to use agreed baseline.
>
On Tue, Jul 17, 2018 at 06:13:47AM +0900, Tetsuo Handa wrote:
> No response from Roman and David...
>
> Andrew, will you once drop Roman's cgroup-aware OOM killer and David's patches?
> Roman's series has a bug which I mentioned and which can be avoided by my patch.
> David's patch is using MMF_UNSTABLE incorrectly such that it might start selecting
> next OOM victim without trying to reclaim any memory.
>
> Since they are not responding to my mail, I suggest once dropping from linux-next.
I was in cc, and didn't thought that you're expecting something from me.
I don't get, why it's necessary to drop the cgroup oom killer to merge your fix?
I'm happy to help with rebasing and everything else.
Thanks,
Roman
Roman Gushchin wrote:
> On Tue, Jul 17, 2018 at 06:13:47AM +0900, Tetsuo Handa wrote:
> > No response from Roman and David...
> >
> > Andrew, will you once drop Roman's cgroup-aware OOM killer and David's patches?
> > Roman's series has a bug which I mentioned and which can be avoided by my patch.
> > David's patch is using MMF_UNSTABLE incorrectly such that it might start selecting
> > next OOM victim without trying to reclaim any memory.
> >
> > Since they are not responding to my mail, I suggest once dropping from linux-next.
>
> I was in cc, and didn't thought that you're expecting something from me.
Oops. I was waiting for your response. ;-)
But Roman, my patch conflicts with your "mm, oom: cgroup-aware OOM killer" patch
in linux-next. And it seems to me that your patch contains a bug which leads to
premature memory allocation failure explained below.
Can we apply my patch prior to your "mm, oom: cgroup-aware OOM killer" patch
(which eliminates "delay" and "out:" from your patch) so that people can easily
backport my patch? Or, do you want to apply a fix (which eliminates "delay" and
"out:" from linux-next) prior to my patch?
>
> I don't get, why it's necessary to drop the cgroup oom killer to merge your fix?
> I'm happy to help with rebasing and everything else.
Yes, I wish you rebase your series on top of OOM lockup (CVE-2016-10723) mitigation
patch ( https://marc.info/?l=linux-mm&m=153112243424285&w=4 ). It is a trivial change
and easy to cleanly backport (if applied before your series).
Also, I expect you to check whether my cleanup patch which removes "abort" path
( [PATCH 1/2] at https://marc.info/?l=linux-mm&m=153119509215026&w=4 ) helps
simplifying your series. I don't know detailed behavior of your series, but I
assume that your series do not kill threads which current thread should not wait
for MMF_OOM_SKIP.
On Mon, 16 Jul 2018, Michal Hocko wrote:
> Well, I didn't really get to your patches yet. The last time I've
> checked I had some pretty serious concerns about the consistency of your
> proposal. Those might have been fixed in the lastest version of your
> patchset I haven't seen. But I still strongly suspect that you are
> largerly underestimating the complexity of more generic oom policies
> which you are heading to.
>
I don't believe it's underestimated since it's used. It's perfectly valid
the lock an entire hierarchy or individual subtrees into a single policy
if that's what is preferred. Any use of a different policy at a subtree
root is a conscious decision made by the owner of that subtree. If they
prefer to kill the largest process, the largest descendant cgroup, or the
largest subtree, it is up to them. All three have valid usecases, the
goal is not to lock the entire hierarchy into a single policy: this
introduces the ability for users to subvert the selection policy either
intentionally or unintentionally because they are using a unified single
hierarchy with cgroup v2 and they are using controllers other than mem
cgroup.
> Considering user API failures from the past (oom_*adj fiasco for
> example) suggests that we should start with smaller steps and only
> provide a clear and simple API. oom_group is such a simple and
> semantically consistent thing which is the reason I am OK with it much
> more than your "we can be more generic" approach. I simply do not trust
> we can agree on sane and consistent api in a reasonable time.
>
> And it is quite mind boggling that a simpler approach has been basically
> blocked for months because there are some concerns for workloads which
> are not really asking for the feature. Sure your usecase might need to
> handle root memcg differently. That is a fair point but that shouldn't
> really block containers users who can use the proposed solution without
> any further changes. If we ever decide to handle root memcg differently
> we are free to do so because the oom selection policy is not carved in
> stone by any api.
>
Please respond directly to the patchset which clearly enumerates the
problems with the current implementation in -mm.
On 2018/07/17 9:55, Tetsuo Handa wrote:
>> I don't get, why it's necessary to drop the cgroup oom killer to merge your fix?
>> I'm happy to help with rebasing and everything else.
>
> Yes, I wish you rebase your series on top of OOM lockup (CVE-2016-10723) mitigation
> patch ( https://marc.info/?l=linux-mm&m=153112243424285&w=4 ). It is a trivial change
> and easy to cleanly backport (if applied before your series).
>
> Also, I expect you to check whether my cleanup patch which removes "abort" path
> ( [PATCH 1/2] at https://marc.info/?l=linux-mm&m=153119509215026&w=4 ) helps
> simplifying your series. I don't know detailed behavior of your series, but I
> assume that your series do not kill threads which current thread should not wait
> for MMF_OOM_SKIP.
syzbot is hitting WARN(1) due to mem_cgroup_out_of_memory() == false.
https://syzkaller.appspot.com/bug?id=ea8c7912757d253537375e981b61749b2da69258
I can't tell what change is triggering this race. Maybe removal of oom_lock from
the oom reaper made more likely to hit. But anyway I suspect that
static bool oom_kill_memcg_victim(struct oom_control *oc)
{
if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
return oc->chosen_memcg; // <= This line is still broken
because
/* We have one or more terminating processes at this point. */
oc->chosen_task = INFLIGHT_VICTIM;
is not called.
Also, that patch is causing confusion by reviving schedule_timeout_killable(1)
with oom_lock held.
Can we temporarily drop cgroup-aware OOM killer from linux-next.git and
apply my cleanup patch? Since the merge window is approaching, I really want to
see how next -rc1 would look like...
On Tue, Jul 31, 2018 at 11:14:01PM +0900, Tetsuo Handa wrote:
> On 2018/07/17 9:55, Tetsuo Handa wrote:
> >> I don't get, why it's necessary to drop the cgroup oom killer to merge your fix?
> >> I'm happy to help with rebasing and everything else.
> >
> > Yes, I wish you rebase your series on top of OOM lockup (CVE-2016-10723) mitigation
> > patch ( https://marc.info/?l=linux-mm&m=153112243424285&w=4 ). It is a trivial change
> > and easy to cleanly backport (if applied before your series).
> >
> > Also, I expect you to check whether my cleanup patch which removes "abort" path
> > ( [PATCH 1/2] at https://marc.info/?l=linux-mm&m=153119509215026&w=4 ) helps
> > simplifying your series. I don't know detailed behavior of your series, but I
> > assume that your series do not kill threads which current thread should not wait
> > for MMF_OOM_SKIP.
>
> syzbot is hitting WARN(1) due to mem_cgroup_out_of_memory() == false.
> https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_bug-3Fid-3Dea8c7912757d253537375e981b61749b2da69258&d=DwICJg&c=5VD0RTtNlTh3ycd41b3MUw&r=i6WobKxbeG3slzHSIOxTVtYIJw7qjCE6S0spDTKL-J4&m=h9FJRAWtCmDLT-cVwvXKCYIUVRSrD--0XFJE-OnNY64&s=If6eFu6MlYjnfLXeg5_S-3tuhCZhSMv8_qfSrMfwOQ0&e=
>
> I can't tell what change is triggering this race. Maybe removal of oom_lock from
> the oom reaper made more likely to hit. But anyway I suspect that
>
> static bool oom_kill_memcg_victim(struct oom_control *oc)
> {
> if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
> return oc->chosen_memcg; // <= This line is still broken
>
> because
>
> /* We have one or more terminating processes at this point. */
> oc->chosen_task = INFLIGHT_VICTIM;
>
> is not called.
>
> Also, that patch is causing confusion by reviving schedule_timeout_killable(1)
> with oom_lock held.
>
> Can we temporarily drop cgroup-aware OOM killer from linux-next.git and
> apply my cleanup patch? Since the merge window is approaching, I really want to
> see how next -rc1 would look like...
Hi Tetsuo!
Has this cleanup patch been acked by somebody?
Which problem does it solve?
Dropping patches for making a cleanup (if it's a cleanup) sounds a bit strange.
Anyway, there is a good chance that current cgroup-aware OOM killer
implementation will be replaced by a lightweight version (memory.oom.group).
Please, take a look at it, probably your cleanup will not conflict with it
at all.
Thanks!
On 2018/08/02 1:37, Roman Gushchin wrote:
> On Tue, Jul 31, 2018 at 11:14:01PM +0900, Tetsuo Handa wrote:
>> Can we temporarily drop cgroup-aware OOM killer from linux-next.git and
>> apply my cleanup patch? Since the merge window is approaching, I really want to
>> see how next -rc1 would look like...
>
> Hi Tetsuo!
>
> Has this cleanup patch been acked by somebody?
Not yet. But since Michal considers this cleanup as "a nice shortcut"
( https://marc.info/[email protected] ), I assume that
I will get an ACK regarding this cleanup.
> Which problem does it solve?
It simplifies tricky out_of_memory() return value decision, and
it also fixes a bug in your series which syzbot is pointing out.
> Dropping patches for making a cleanup (if it's a cleanup) sounds a bit strange.
What I need is a git tree which I can use as a baseline for making this cleanup.
linux.git is not suitable because it does not include Michal's fix, but
linux-next.git is not suitable because Michal's fix is overwritten by your series.
I want a git tree which includes Michal's fix and does not include your series.
>
> Anyway, there is a good chance that current cgroup-aware OOM killer
> implementation will be replaced by a lightweight version (memory.oom.group).
> Please, take a look at it, probably your cleanup will not conflict with it
> at all.
Then, please drop current cgroup-aware OOM killer implementation from linux-next.git .
I want to see how next -rc1 would look like (for testing purpose) and want to use
linux-next.git as a baseline (for making this cleanup).
>
> Thanks!
>
>
On Thu, Aug 02, 2018 at 07:01:28AM +0900, Tetsuo Handa wrote:
> On 2018/08/02 1:37, Roman Gushchin wrote:
> > On Tue, Jul 31, 2018 at 11:14:01PM +0900, Tetsuo Handa wrote:
> >> Can we temporarily drop cgroup-aware OOM killer from linux-next.git and
> >> apply my cleanup patch? Since the merge window is approaching, I really want to
> >> see how next -rc1 would look like...
> >
> > Hi Tetsuo!
> >
> > Has this cleanup patch been acked by somebody?
>
> Not yet. But since Michal considers this cleanup as "a nice shortcut"
> ( https://marc.info/[email protected] ), I assume that
> I will get an ACK regarding this cleanup.
>
> > Which problem does it solve?
>
> It simplifies tricky out_of_memory() return value decision, and
> it also fixes a bug in your series which syzbot is pointing out.
>
> > Dropping patches for making a cleanup (if it's a cleanup) sounds a bit strange.
>
> What I need is a git tree which I can use as a baseline for making this cleanup.
> linux.git is not suitable because it does not include Michal's fix, but
> linux-next.git is not suitable because Michal's fix is overwritten by your series.
> I want a git tree which includes Michal's fix and does not include your series.
>
> >
> > Anyway, there is a good chance that current cgroup-aware OOM killer
> > implementation will be replaced by a lightweight version (memory.oom.group).
> > Please, take a look at it, probably your cleanup will not conflict with it
> > at all.
>
> Then, please drop current cgroup-aware OOM killer implementation from linux-next.git .
> I want to see how next -rc1 would look like (for testing purpose) and want to use
> linux-next.git as a baseline (for making this cleanup).
I'll post memory.oom.group v2 later today, and if there will be no objections,
I'll ask Andrew to drop current memcg-aware OOM killer and replace it
with lightweight memory.oom.group.
These changes will be picked by linux-next in few days.
Thanks!