LinuxLists.cc - Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy

2015-11-04 10:42:45

Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy

On Thu 29-10-15 09:10:09, Johannes Weiner wrote:
> On Thu, Oct 29, 2015 at 04:25:46PM +0100, Michal Hocko wrote:
> > On Tue 27-10-15 09:42:27, Johannes Weiner wrote:
[...]
> > > You carefully skipped over this part. We can ignore it for socket
> > > memory but it's something we need to figure out when it comes to slab
> > > accounting and tracking.
> >
> > I am sorry, I didn't mean to skip this part, I though it would be clear
> > from the previous text. I think kmem accounting falls into the same
> > category. Have a sane default and a global boottime knob to override it
> > for those that think differently - for whatever reason they might have.
>
> Yes, that makes sense to me.
>
> Like cgroup.memory=nosocket, would you think it makes sense to include
> slab in the default for functional/semantical completeness and provide
> a cgroup.memory=noslab for powerusers?

I am still not sure whether the kmem accounting is stable enough to be
enabled by default. If for nothing else the allocation failures, which
are not allowed for the global case and easily triggered by the hard
limit, might be a big problem. My last attempts to allow GFP_NOFS to
fail made me quite skeptical. I still believe this is something which
will be solved in the long term but the current state might be still too
fragile. So I would rather be conservative and have the kmem accounting
disabled by default with a config option and boot parameter to override.
If somebody is confident that the desired load is stable then the config
can be enabled easily.
--
Michal Hocko
SUSE Labs

2015-11-04 19:50:53

by Johannes Weiner

[permalink] [raw]

Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy

On Wed, Nov 04, 2015 at 11:42:40AM +0100, Michal Hocko wrote:
> On Thu 29-10-15 09:10:09, Johannes Weiner wrote:
> > On Thu, Oct 29, 2015 at 04:25:46PM +0100, Michal Hocko wrote:
> > > On Tue 27-10-15 09:42:27, Johannes Weiner wrote:
> [...]
> > > > You carefully skipped over this part. We can ignore it for socket
> > > > memory but it's something we need to figure out when it comes to slab
> > > > accounting and tracking.
> > >
> > > I am sorry, I didn't mean to skip this part, I though it would be clear
> > > from the previous text. I think kmem accounting falls into the same
> > > category. Have a sane default and a global boottime knob to override it
> > > for those that think differently - for whatever reason they might have.
> >
> > Yes, that makes sense to me.
> >
> > Like cgroup.memory=nosocket, would you think it makes sense to include
> > slab in the default for functional/semantical completeness and provide
> > a cgroup.memory=noslab for powerusers?
>
> I am still not sure whether the kmem accounting is stable enough to be
> enabled by default. If for nothing else the allocation failures, which
> are not allowed for the global case and easily triggered by the hard
> limit, might be a big problem. My last attempts to allow GFP_NOFS to
> fail made me quite skeptical. I still believe this is something which
> will be solved in the long term but the current state might be still too
> fragile. So I would rather be conservative and have the kmem accounting
> disabled by default with a config option and boot parameter to override.
> If somebody is confident that the desired load is stable then the config
> can be enabled easily.

I agree with your assessment of the current kmem code state, but I
think your conclusion is completely backwards here.

The interface will be set in stone forever, whereas any stability
issues will be transient and will have to be addressed in a finite
amount of time anyway. It doesn't make sense to design an interface
based on temporary quality of implementation. Only one of those two
can ever be changed.

Because it goes without saying that once the cgroupv2 interface is
released, and people use it in production, there is no way we can then
*add* dentry cache, inode cache, and others to memory.current. That
would be an unacceptable change in interface behavior. On the other
hand, people will be prepared for hiccups in the early stages of
cgroupv2 release, and we're providing cgroup.memory=noslab to let them
workaround severe problems in production until we fix it without
forcing them to fully revert to cgroupv1.

So if we agree that there are no fundamental architectural concerns
with slab accounting, i.e. nothing that can't be addressed in the
implementation, we have to make the call now.

And I maintain that not accounting dentry cache and inode cache is a
gaping hole in memory isolation, so it should be included by default.
(The rest of the slabs is arguable, but IMO the risk of missing
something important is higher than the cost of including them.)

As far as your allocation failure concerns go, I think the kmem code
is currently not behaving as Glauber originally intended, which is to
force charge if reclaim and OOM killing weren't able to make enough
space. See this recently rewritten section of the kmem charge path:

- /*
- * try_charge() chose to bypass to root due to OOM kill or
- * fatal signal. Since our only options are to either fail
- * the allocation or charge it to this cgroup, do it as a
- * temporary condition. But we can't fail. From a kmem/slab
- * perspective, the cache has already been selected, by
- * mem_cgroup_kmem_get_cache(), so it is too late to change
- * our minds.
- *
- * This condition will only trigger if the task entered
- * memcg_charge_kmem in a sane state, but was OOM-killed
- * during try_charge() above. Tasks that were already dying
- * when the allocation triggers should have been already
- * directed to the root cgroup in memcontrol.h
- */
- page_counter_charge(&memcg->memory, nr_pages);
- if (do_swap_account)
- page_counter_charge(&memcg->memsw, nr_pages);

It could be that this never properly worked as it was tied to the
-EINTR bypass trick, but the idea was these charges never fail.

And this makes sense. If the allocator semantics are such that we
never fail these page allocations for slab, and the callsites rely on
that, surely we should not fail them in the memory controller, either.

And it makes a lot more sense to account them in excess of the limit
than pretend they don't exist. We might not be able to completely
fullfill the containment part of the memory controller (although these
slab charges will still create significant pressure before that), but
at least we don't fail the accounting part on top of it.

2015-11-05 14:40:10

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy

On Wed 04-11-15 14:50:37, Johannes Weiner wrote:
[...]
> Because it goes without saying that once the cgroupv2 interface is
> released, and people use it in production, there is no way we can then
> *add* dentry cache, inode cache, and others to memory.current. That
> would be an unacceptable change in interface behavior.

They would still have to _enable_ the config option _explicitly_. make
oldconfig wouldn't change it silently for them. I do not think
it is an unacceptable change of behavior if the config is changed
explicitly.

> On the other
> hand, people will be prepared for hiccups in the early stages of
> cgroupv2 release, and we're providing cgroup.memory=noslab to let them
> workaround severe problems in production until we fix it without
> forcing them to fully revert to cgroupv1.

This would be true if they moved on to the new cgroup API intentionally.
The reality is more complicated though. AFAIK sysmted is waiting for
cgroup2 already and privileged services enable all available resource
controllers by default as I've learned just recently. If we know that
the interface is not stable enough then we are basically forcing _most_
users to use the kernel boot parameter if we stay with the current
kmem semantic. More on that below.

> So if we agree that there are no fundamental architectural concerns
> with slab accounting, i.e. nothing that can't be addressed in the
> implementation, we have to make the call now.

We are on the same page here.

> And I maintain that not accounting dentry cache and inode cache is a
> gaping hole in memory isolation, so it should be included by default.
> (The rest of the slabs is arguable, but IMO the risk of missing
> something important is higher than the cost of including them.)

More on that below.

> As far as your allocation failure concerns go, I think the kmem code
> is currently not behaving as Glauber originally intended, which is to
> force charge if reclaim and OOM killing weren't able to make enough
> space. See this recently rewritten section of the kmem charge path:
>
> - /*
> - * try_charge() chose to bypass to root due to OOM kill or
> - * fatal signal. Since our only options are to either fail
> - * the allocation or charge it to this cgroup, do it as a
> - * temporary condition. But we can't fail. From a kmem/slab
> - * perspective, the cache has already been selected, by
> - * mem_cgroup_kmem_get_cache(), so it is too late to change
> - * our minds.
> - *
> - * This condition will only trigger if the task entered
> - * memcg_charge_kmem in a sane state, but was OOM-killed
> - * during try_charge() above. Tasks that were already dying
> - * when the allocation triggers should have been already
> - * directed to the root cgroup in memcontrol.h
> - */
> - page_counter_charge(&memcg->memory, nr_pages);
> - if (do_swap_account)
> - page_counter_charge(&memcg->memsw, nr_pages);
>
> It could be that this never properly worked as it was tied to the
> -EINTR bypass trick, but the idea was these charges never fail.

I have always understood this path as a corner case when the task is an
oom victim or exiting. So this would be only a temporal condition which
cannot cause a complete runaway.

> And this makes sense. If the allocator semantics are such that we
> never fail these page allocations for slab, and the callsites rely on
> that, surely we should not fail them in the memory controller, either.

Then we can only bypass them or loop inside the charge code for ever
like we do in the page allocator. The later one is really fragile and
it would be much more in the restricted environment as we have learned
with the memcg OOM killer in the past.

> And it makes a lot more sense to account them in excess of the limit
> than pretend they don't exist. We might not be able to completely
> fullfill the containment part of the memory controller (although these
> slab charges will still create significant pressure before that), but
> at least we don't fail the accounting part on top of it.

Hmm, wouldn't that kill the whole purpose of the kmem accounting? Any
load could simply runaway via kernel allocations. What is even worse we
might even not trigger memcg OOM killer before we hit the global OOM. So
the whole containment goes straight to hell.

I can see four options here:
1) enable kmem by default with the current semantic which we know can
BUG_ON (at least btrfs is known to hit this) or lead to other issues.
2) enable kmem by default and change the semantic for cgroup2 to allow
runaway charges above the hard limit which would defeat the whole
purpose of the containment for cgroup2. This can be a temporary
workaround until we can afford kmem failures. This has a big risk
that we will end up with this permanently because there is a strong
pressure that GFP_KERNEL allocations should never fail. Yet this is
the most common type of request. Or do we change the consistency with
the global case at some point?
3) keep only some (safe) cache types enabled by default with the current
failing semantic and require an explicit enabling for the complete
kmem accounting. [di]cache code paths should be quite robust to
handle allocation failures.
4) disable kmem by default and change the config default later to signal
the accounting is safe as far as we are aware and let people enable
the functionality on those basis. We would keep the current failing
semantic.

To me 4) sounds like the safest option because it still keeps the
functionality available to those who can benefit from it in v1 already
while we are not exposing a potentially buggy behavior to the majority
(many of them even unintentionally). Moreover we still allow to change
the default later on an explicit basis. 3) sounds like the second best
option but I am not really sure whether we can do that very easily
without bringing up a lot of unmaintainable mess. 2) sounds like the
third best approach but I am afraid it would render the basic use cases
unusable for a very long time and kill any interest in cgroup2 for even
longer (cargo cults are really hard to get rid of). 1) sounds like a
land mine approach which would force many/most users to simply keep
using the boot option and force us to re-evaluate the default hard way.
--
Michal Hocko
SUSE Labs

2015-11-05 16:16:15

by David Miller

[permalink] [raw]

Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy

From: Michal Hocko <[email protected]>
Date: Thu, 5 Nov 2015 15:40:02 +0100

> On Wed 04-11-15 14:50:37, Johannes Weiner wrote:
> [...]
>> Because it goes without saying that once the cgroupv2 interface is
>> released, and people use it in production, there is no way we can then
>> *add* dentry cache, inode cache, and others to memory.current. That
>> would be an unacceptable change in interface behavior.
>
> They would still have to _enable_ the config option _explicitly_. make
> oldconfig wouldn't change it silently for them. I do not think
> it is an unacceptable change of behavior if the config is changed
> explicitly.

Every user is going to get this config option when they update their
distibution kernel or whatever.

Then they will all wonder why their networking performance went down.

This is why I do not want the networking accounting bits on by default
even if the kconfig option is enabled. They must be off by default
and guarded by a static branch so the cost is exactly zero.

2015-11-05 16:28:10

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy

On Thu 05-11-15 11:16:09, David S. Miller wrote:
> From: Michal Hocko <[email protected]>
> Date: Thu, 5 Nov 2015 15:40:02 +0100
>
> > On Wed 04-11-15 14:50:37, Johannes Weiner wrote:
> > [...]
> >> Because it goes without saying that once the cgroupv2 interface is
> >> released, and people use it in production, there is no way we can then
> >> *add* dentry cache, inode cache, and others to memory.current. That
> >> would be an unacceptable change in interface behavior.
> >
> > They would still have to _enable_ the config option _explicitly_. make
> > oldconfig wouldn't change it silently for them. I do not think
> > it is an unacceptable change of behavior if the config is changed
> > explicitly.
>
> Every user is going to get this config option when they update their
> distibution kernel or whatever.
>
> Then they will all wonder why their networking performance went down.
>
> This is why I do not want the networking accounting bits on by default
> even if the kconfig option is enabled. They must be off by default
> and guarded by a static branch so the cost is exactly zero.

Yes, that part is clear and Johannes made it clear that the kmem tcp
part is disabled by default. Or are you considering also all the slab
usage by the networking code as well?

--
Michal Hocko
SUSE Labs

2015-11-05 16:30:19

by David Miller

[permalink] [raw]

Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy

From: Michal Hocko <[email protected]>
Date: Thu, 5 Nov 2015 17:28:03 +0100

> Yes, that part is clear and Johannes made it clear that the kmem tcp
> part is disabled by default. Or are you considering also all the slab
> usage by the networking code as well?

I'm still thinking about the implications of that aspect, and will
comment when I have something coherent to say about it.

2015-11-05 20:55:36

by Johannes Weiner

[permalink] [raw]

Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy

On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote:
> On Wed 04-11-15 14:50:37, Johannes Weiner wrote:
> [...]
> > Because it goes without saying that once the cgroupv2 interface is
> > released, and people use it in production, there is no way we can then
> > *add* dentry cache, inode cache, and others to memory.current. That
> > would be an unacceptable change in interface behavior.
>
> They would still have to _enable_ the config option _explicitly_. make
> oldconfig wouldn't change it silently for them. I do not think
> it is an unacceptable change of behavior if the config is changed
> explicitly.

Yeah, as Dave said these will all get turned on anyway, so there is no
point in fragmenting the Kconfig space in the first place.

> > On the other
> > hand, people will be prepared for hiccups in the early stages of
> > cgroupv2 release, and we're providing cgroup.memory=noslab to let them
> > workaround severe problems in production until we fix it without
> > forcing them to fully revert to cgroupv1.
>
> This would be true if they moved on to the new cgroup API intentionally.
> The reality is more complicated though. AFAIK sysmted is waiting for
> cgroup2 already and privileged services enable all available resource
> controllers by default as I've learned just recently.

Have you filed a report with them? I don't think they should turn them
on unless users explicitely configure resource control for the unit.

But what I said still holds: critical production machines don't just
get rolling updates and "accidentally" switch to all this new
code. And those that do take the plunge have the cmdline options.

> > And it makes a lot more sense to account them in excess of the limit
> > than pretend they don't exist. We might not be able to completely
> > fullfill the containment part of the memory controller (although these
> > slab charges will still create significant pressure before that), but
> > at least we don't fail the accounting part on top of it.
>
> Hmm, wouldn't that kill the whole purpose of the kmem accounting? Any
> load could simply runaway via kernel allocations. What is even worse we
> might even not trigger memcg OOM killer before we hit the global OOM. So
> the whole containment goes straight to hell.
>
> I can see four options here:
> 1) enable kmem by default with the current semantic which we know can
> BUG_ON (at least btrfs is known to hit this) or lead to other issues.

Can you point me to that report?

That's not "semantics", that's a bug! Whether or not a feature is
enabled by default, it can not be allowed to crash the kernel.

Presenting this as a choice is a bit of a strawman argument.

> 2) enable kmem by default and change the semantic for cgroup2 to allow
> runaway charges above the hard limit which would defeat the whole
> purpose of the containment for cgroup2. This can be a temporary
> workaround until we can afford kmem failures. This has a big risk
> that we will end up with this permanently because there is a strong
> pressure that GFP_KERNEL allocations should never fail. Yet this is
> the most common type of request. Or do we change the consistency with
> the global case at some point?

As per 1) we *have* to fail containment eventually if not doing so
means crashes and lockups. That's not a choice of semantics.

But that doesn't mean we have to give up *immediately* and allow
unrestrained "runaway charges"--again, more of a strawman than a
choice. We can still throttle the allocator and apply significant
pressure on the memory pool, culminating in OOM kills eventually.

Once we run out of available containment tools, however, we *have* to
follow the semantics of the page and slab allocator and succeed the
request. We can not just return -ENOMEM if that causes kernel bugs.

That's the only thing we can do right now.

In fact, it's likely going to be the best we will ever be able to do
when it comes to kernel memory accounting. Linus made it clear where
he stands on failing kernel allocations, so all we can do is continue
to improve our containment tools and then give up on containment when
they're exhausted and force the charge past the limit.

> 3) keep only some (safe) cache types enabled by default with the current
> failing semantic and require an explicit enabling for the complete
> kmem accounting. [di]cache code paths should be quite robust to
> handle allocation failures.

Vladimir, what would be your opinion on this?

> 4) disable kmem by default and change the config default later to signal
> the accounting is safe as far as we are aware and let people enable
> the functionality on those basis. We would keep the current failing
> semantic.
>
> To me 4) sounds like the safest option because it still keeps the
> functionality available to those who can benefit from it in v1 already
> while we are not exposing a potentially buggy behavior to the majority
> (many of them even unintentionally). Moreover we still allow to change
> the default later on an explicit basis.

I'm not interested in fragmenting the interface forever out of caution
because there might be a bug in the implementation right now. As I
said we have to fix any instability in the features we provide whether
they are turned on by default or not. I don't see how this is relevant
to the interface discussion.

Also, there is no way we can later fundamentally change the semantics
of memory.current, so it would have to remain configurable forever,
forcing people forever to select multiple options in order to piece
together a single logical kernel feature.

This is really not an option, either.

If there are show-stopping bugs in the implementation, I'd rather hold
off the release of the unified hierarchy than commit to a half-assed
interface right out of the gate. The point of v2 is sane interfaces.

So let's please focus on fixing any problems that slab accounting may
have, rather than designing complex config options and transition
procedures whose sole purpose is to defer dealing with our issues.

Please?

2015-11-05 22:33:04

by Johannes Weiner

[permalink] [raw]

Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy

On Thu, Nov 05, 2015 at 05:28:03PM +0100, Michal Hocko wrote:
> On Thu 05-11-15 11:16:09, David S. Miller wrote:
> > From: Michal Hocko <[email protected]>
> > Date: Thu, 5 Nov 2015 15:40:02 +0100
> >
> > > On Wed 04-11-15 14:50:37, Johannes Weiner wrote:
> > > [...]
> > >> Because it goes without saying that once the cgroupv2 interface is
> > >> released, and people use it in production, there is no way we can then
> > >> *add* dentry cache, inode cache, and others to memory.current. That
> > >> would be an unacceptable change in interface behavior.
> > >
> > > They would still have to _enable_ the config option _explicitly_. make
> > > oldconfig wouldn't change it silently for them. I do not think
> > > it is an unacceptable change of behavior if the config is changed
> > > explicitly.
> >
> > Every user is going to get this config option when they update their
> > distibution kernel or whatever.
> >
> > Then they will all wonder why their networking performance went down.
> >
> > This is why I do not want the networking accounting bits on by default
> > even if the kconfig option is enabled. They must be off by default
> > and guarded by a static branch so the cost is exactly zero.
>
> Yes, that part is clear and Johannes made it clear that the kmem tcp
> part is disabled by default. Or are you considering also all the slab
> usage by the networking code as well?

Michal, there shouldn't be any tracking or accounting going on per
default when you boot into a fresh system.

I removed all accounting and statistics on the system level in
cgroupv2, so distribution kernels can compile-time enable a single,
feature-complete CONFIG_MEMCG that provides a full memory controller
while at the same time puts no overhead on users that don't benefit
from mem control at all and just want to use the machine bare-metal.

This is completely doable. My new series does it for skmem, but I also
want to retrofit the code to eliminate that current overhead for page
cache, anonymous memory, slab memory and so forth.

This is the only sane way to make the memory controller powerful and
generally useful without having to make unreasonable compromises with
memory consumers. We shouldn't even be *having* the discussion about
whether we should sacrifice the quality of our interface in order to
compromise with a class of users that doesn't care about any of this
in the first place.

So let's eliminate the cost for non-users, but make the memory
controller feature-complete and useful--with reasonable cost,
implementation, and interface--for our actual userbase.

Paying the necessary cost for a functionality you actually want is not
the problem. Paying for something that doesn't benefit you is.

2015-11-05 22:52:11

by Johannes Weiner

[permalink] [raw]

Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy

On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote:
> On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote:
> > This would be true if they moved on to the new cgroup API intentionally.
> > The reality is more complicated though. AFAIK sysmted is waiting for
> > cgroup2 already and privileged services enable all available resource
> > controllers by default as I've learned just recently.
>
> Have you filed a report with them? I don't think they should turn them
> on unless users explicitely configure resource control for the unit.

Okay, verified with systemd people that they're not planning on
enabling resource control per default.

Inflammatory half-truths, man. This is not constructive.

2015-11-06 09:06:23

by Vladimir Davydov

[permalink] [raw]

Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy

On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote:
> On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote:
...
> > 3) keep only some (safe) cache types enabled by default with the current
> > failing semantic and require an explicit enabling for the complete
> > kmem accounting. [di]cache code paths should be quite robust to
> > handle allocation failures.
>
> Vladimir, what would be your opinion on this?

I'm all for this option. Actually, I've been thinking about this since I
introduced the __GFP_NOACCOUNT flag. Not because of the failing
semantics, since we can always let kmem allocations breach the limit.
This shouldn't be critical, because I don't think it's possible to issue
a series of kmem allocations w/o a single user page allocation, which
would reclaim/kill the excess.

The point is there are allocations that are shared system-wide and
therefore shouldn't go to any memcg. Most obvious examples are: mempool
users and radix_tree/idr preloads. Accounting them to memcg is likely to
result in noticeable memory overhead as memory cgroups are
created/destroyed, because they pin dead memory cgroups with all their
kmem caches, which aren't tiny.

Another funny example is objects destroyed lazily for performance
reasons, e.g. vmap_area. Such objects are usually very small, so
delaying destruction of a bunch of them will normally go unnoticed.
However, if kmemcg is used the effective memory consumption caused by
such objects can be multiplied by many times due to dangling kmem
caches.

We can, of course, mark all such allocations as __GFP_NOACCOUNT, but the
problem is they are tricky to identify, because they are scattered all
over the kernel source tree. E.g. Dave Chinner mentioned that XFS
internals do a lot of allocations that are shared among all XFS
filesystems and therefore should not be accounted (BTW that's why
list_lru's used by XFS are not marked as memcg-aware). There must be
more out there. Besides, kernel developers don't usually even know about
kmemcg (they just write the code for their subsys, so why should they?)
so they won't care thinking about using __GFP_NOACCOUNT, and hence new
falsely-accounted allocations are likely to appear.

That said, by switching from black-list (__GFP_NOACCOUNT) to white-list
(__GFP_ACCOUNT) kmem accounting policy we would make the system more
predictable and robust IMO. OTOH what would we lose? Security? Well,
containers aren't secure IMHO. In fact, I doubt they will ever be (as
secure as VMs). Anyway, if a runaway allocation is reported, it should
be trivial to fix by adding __GFP_ACCOUNT where appropriate.

If there are no objections, I'll prepare a patch switching to the
white-list approach. Let's start from obvious things like fs_struct,
mm_struct, task_struct, signal_struct, dentry, inode, which can be
easily allocated from user space. This should cover 90% of all
allocations that should be accounted AFAICS. The rest will be added
later if necessarily.

Thanks,
Vladimir

2015-11-06 10:57:30

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy

On Thu 05-11-15 17:52:00, Johannes Weiner wrote:
> On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote:
> > On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote:
> > > This would be true if they moved on to the new cgroup API intentionally.
> > > The reality is more complicated though. AFAIK sysmted is waiting for
> > > cgroup2 already and privileged services enable all available resource
> > > controllers by default as I've learned just recently.
> >
> > Have you filed a report with them? I don't think they should turn them
> > on unless users explicitely configure resource control for the unit.
>
> Okay, verified with systemd people that they're not planning on
> enabling resource control per default.
>
> Inflammatory half-truths, man. This is not constructive.

What about Delegate=yes feature then? We have just been burnt by this
quite heavily. AFAIU [email protected] and [email protected] have this
enabled by default
http://lists.freedesktop.org/archives/systemd-commits/2014-November/007400.html

--
Michal Hocko
SUSE Labs

2015-11-06 12:51:45

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy

On Thu 05-11-15 17:32:51, Johannes Weiner wrote:
> On Thu, Nov 05, 2015 at 05:28:03PM +0100, Michal Hocko wrote:
[...]
> > Yes, that part is clear and Johannes made it clear that the kmem tcp
> > part is disabled by default. Or are you considering also all the slab
> > usage by the networking code as well?
>
> Michal, there shouldn't be any tracking or accounting going on per
> default when you boot into a fresh system.
>
> I removed all accounting and statistics on the system level in
> cgroupv2, so distribution kernels can compile-time enable a single,
> feature-complete CONFIG_MEMCG that provides a full memory controller
> while at the same time puts no overhead on users that don't benefit
> from mem control at all and just want to use the machine bare-metal.

Yes that part is clear and I am not disputing it _at all_. It is just
that changes are high that memory controller _will_ be enabled in a
typical distribution systems. E.g. systemd _is_ enabling all resource
controllers by default for some services with Delegate=yes option.

> This is completely doable. My new series does it for skmem, but I also
> want to retrofit the code to eliminate that current overhead for page
> cache, anonymous memory, slab memory and so forth.
>
> This is the only sane way to make the memory controller powerful and
> generally useful without having to make unreasonable compromises with
> memory consumers. We shouldn't even be *having* the discussion about
> whether we should sacrifice the quality of our interface in order to
> compromise with a class of users that doesn't care about any of this
> in the first place.
>
> So let's eliminate the cost for non-users, but make the memory
> controller feature-complete and useful--with reasonable cost,
> implementation, and interface--for our actual userbase.
>
> Paying the necessary cost for a functionality you actually want is not
> the problem. Paying for something that doesn't benefit you is.

I completely agree that a reasonable cost for those who _want_ the
functionality. It hasn't been shown that people actually lack kmem
accounting in the wild from the past in general. E.g. kmem controller
is even not enabled in opensuse nor SLES kernels and I do not remember
there was huge push to enable it.

I do understand that you want to have an out-of-the-box isolation
behavior which I agree is a nice-to-have feature. Especially with
a larger penetration of containerized workloads. But my point still
holds. This is not something everybody wants to have. So have a
configuration and a boot time option to override is the most reasonable
way to go. You can clearly see that this is already demand from tcp
kmem extension because they really _care_ about every single cpu cycle
even though some part of the userspace happens to have memcg enabled.

The question about the configuration default is a different question
and we can discuss that because this is not an easy one to decide right
now IMHO.
--
Michal Hocko
SUSE Labs

2015-11-06 13:21:08

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy

On Thu 05-11-15 15:55:22, Johannes Weiner wrote:
> On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote:
> > On Wed 04-11-15 14:50:37, Johannes Weiner wrote:
[...]
> > This would be true if they moved on to the new cgroup API intentionally.
> > The reality is more complicated though. AFAIK sysmted is waiting for
> > cgroup2 already and privileged services enable all available resource
> > controllers by default as I've learned just recently.
>
> Have you filed a report with them? I don't think they should turn them
> on unless users explicitely configure resource control for the unit.

We have just been bitten by this (aka Delegate=yes for some basic
services) and our systemd people are supposed to bring this up upstream.
I've mentioned that in other email where you accuse me from spreading a
FUD.

> But what I said still holds: critical production machines don't just
> get rolling updates and "accidentally" switch to all this new
> code. And those that do take the plunge have the cmdline options.

That is exactly my point why I do not think re-evaluating the default
config option is a problem at all. The default wouldn't matter for
existing users. Those who care can have all the functionality they need
right away - be it kmem enabled or disabled.

> > > And it makes a lot more sense to account them in excess of the limit
> > > than pretend they don't exist. We might not be able to completely
> > > fullfill the containment part of the memory controller (although these
> > > slab charges will still create significant pressure before that), but
> > > at least we don't fail the accounting part on top of it.
> >
> > Hmm, wouldn't that kill the whole purpose of the kmem accounting? Any
> > load could simply runaway via kernel allocations. What is even worse we
> > might even not trigger memcg OOM killer before we hit the global OOM. So
> > the whole containment goes straight to hell.
> >
> > I can see four options here:
> > 1) enable kmem by default with the current semantic which we know can
> > BUG_ON (at least btrfs is known to hit this) or lead to other issues.
>
> Can you point me to that report?

git grep "BUG_ON.*ENOMEM" -- fs/btrfs

just to give you a picture. Not all of them are kmalloc and others are
not annotated by ENOMEM comment. This came out as a result of my last
attempt to allow GFP_NOFS fail
(http://lkml.kernel.org/r/1438768284-30927-1-git-send-email-mhocko%40kernel.org)

> That's not "semantics", that's a bug! Whether or not a feature is
> enabled by default, it can not be allowed to crash the kernel.

Yes those are bugs and have to be fixed. Not an easy task but nothing
which couldn't be solved. It just takes some time. They are not very
likely right now because they are reduced to corner cases right now.
But they are more visible with the current kmem accounting semantic.
So either we change the semantic or wait until this gets fixed if
the accoutning should be on by default.

> Presenting this as a choice is a bit of a strawman argument.
>
> > 2) enable kmem by default and change the semantic for cgroup2 to allow
> > runaway charges above the hard limit which would defeat the whole
> > purpose of the containment for cgroup2. This can be a temporary
> > workaround until we can afford kmem failures. This has a big risk
> > that we will end up with this permanently because there is a strong
> > pressure that GFP_KERNEL allocations should never fail. Yet this is
> > the most common type of request. Or do we change the consistency with
> > the global case at some point?
>
> As per 1) we *have* to fail containment eventually if not doing so
> means crashes and lockups. That's not a choice of semantics.
>
> But that doesn't mean we have to give up *immediately* and allow
> unrestrained "runaway charges"--again, more of a strawman than a
> choice. We can still throttle the allocator and apply significant
> pressure on the memory pool, culminating in OOM kills eventually.
>
> Once we run out of available containment tools, however, we *have* to
> follow the semantics of the page and slab allocator and succeed the
> request. We can not just return -ENOMEM if that causes kernel bugs.
>
> That's the only thing we can do right now.
>
> In fact, it's likely going to be the best we will ever be able to do
> when it comes to kernel memory accounting. Linus made it clear where
> he stands on failing kernel allocations, so all we can do is continue
> to improve our containment tools and then give up on containment when
> they're exhausted and force the charge past the limit.

OK, then we need all the additional measures to keep the hard limit
excess bound.

> > 3) keep only some (safe) cache types enabled by default with the current
> > failing semantic and require an explicit enabling for the complete
> > kmem accounting. [di]cache code paths should be quite robust to
> > handle allocation failures.
>
> Vladimir, what would be your opinion on this?
>
> > 4) disable kmem by default and change the config default later to signal
> > the accounting is safe as far as we are aware and let people enable
> > the functionality on those basis. We would keep the current failing
> > semantic.
> >
> > To me 4) sounds like the safest option because it still keeps the
> > functionality available to those who can benefit from it in v1 already
> > while we are not exposing a potentially buggy behavior to the majority
> > (many of them even unintentionally). Moreover we still allow to change
> > the default later on an explicit basis.
>
> I'm not interested in fragmenting the interface forever out of caution
> because there might be a bug in the implementation right now. As I
> said we have to fix any instability in the features we provide whether
> they are turned on by default or not. I don't see how this is relevant
> to the interface discussion.
>
> Also, there is no way we can later fundamentally change the semantics
> of memory.current, so it would have to remain configurable forever,
> forcing people forever to select multiple options in order to piece
> together a single logical kernel feature.
>
> This is really not an option, either.

Why not? I can clearly see people who would really want to have this
disabled and doing that by config is much more easier than providing
a command line parameter. A config option doesn't give any additional
maintenance burden than the boot time parameter.

> If there are show-stopping bugs in the implementation, I'd rather hold
> off the release of the unified hierarchy than commit to a half-assed
> interface right out of the gate.

If you are willing to postpone releasing cgroup2 until this gets
resolved - one way or another - then I have no objections. My impression
was that Tejun wanted to release it sooner rather than later. As this
mere discussion shows we are even not sure what should be the kmem
failure behavior.

> The point of v2 is sane interfaces.

And the sane interface to me is to use a single set of knobs regardless
of memory type. We are currently only discussing what should be
accounted by default. My understanding of what David said is that tcp
kmem should be enabled only when explicitly opted in. Did I get this
wrong?

> So let's please focus on fixing any problems that slab accounting may
> have, rather than designing complex config options and transition
> procedures whose sole purpose is to defer dealing with our issues.
>
> Please?

--
Michal Hocko
SUSE Labs

2015-11-06 13:29:45

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy

On Fri 06-11-15 12:05:55, Vladimir Davydov wrote:
[...]
> If there are no objections, I'll prepare a patch switching to the
> white-list approach. Let's start from obvious things like fs_struct,
> mm_struct, task_struct, signal_struct, dentry, inode, which can be
> easily allocated from user space.

pipe buffers, kernel stacks and who knows what more.

> This should cover 90% of all
> allocations that should be accounted AFAICS. The rest will be added
> later if necessarily.

The more I think about that the more I am convinced that is the only
sane way forward. The only concerns I would have is how do we deal with
the old interface in cgroup1? We do not want to break existing
deployments which might depend on the current behavior. I doubt they are
but...
--
Michal Hocko
SUSE Labs

2015-11-06 16:20:08

by Johannes Weiner

[permalink] [raw]

Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy

On Fri, Nov 06, 2015 at 11:57:24AM +0100, Michal Hocko wrote:
> On Thu 05-11-15 17:52:00, Johannes Weiner wrote:
> > On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote:
> > > On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote:
> > > > This would be true if they moved on to the new cgroup API intentionally.
> > > > The reality is more complicated though. AFAIK sysmted is waiting for
> > > > cgroup2 already and privileged services enable all available resource
> > > > controllers by default as I've learned just recently.
> > >
> > > Have you filed a report with them? I don't think they should turn them
> > > on unless users explicitely configure resource control for the unit.
> >
> > Okay, verified with systemd people that they're not planning on
> > enabling resource control per default.
> >
> > Inflammatory half-truths, man. This is not constructive.
>
> What about Delegate=yes feature then? We have just been burnt by this
> quite heavily. AFAIU [email protected] and [email protected] have this
> enabled by default
> http://lists.freedesktop.org/archives/systemd-commits/2014-November/007400.html

That's when you launch a *container* and want it to be able to use
nested resource control.

We're talking about actual container users here. It's not turning on
resource control for all "privileged services", which is what we were
worried about here. Can you at least admit that when you yourself link
to the refuting evidence?

And if you've been "burnt quite heavily" by this, where is your bug
report to stop other users from getting "burnt quite heavily" as well?

All I read here is vague inflammatory language to spread FUD.

You might think sending these emails is helpful, but it really
isn't. Not only is it not contributing code, insights, or solutions,
you're now actively sabotaging someone else's effort to build something.

2015-11-06 16:36:09

by Johannes Weiner

[permalink] [raw]

Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy

On Fri, Nov 06, 2015 at 12:05:55PM +0300, Vladimir Davydov wrote:
> On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote:
> > On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote:
> ...
> > > 3) keep only some (safe) cache types enabled by default with the current
> > > failing semantic and require an explicit enabling for the complete
> > > kmem accounting. [di]cache code paths should be quite robust to
> > > handle allocation failures.
> >
> > Vladimir, what would be your opinion on this?
>
> I'm all for this option. Actually, I've been thinking about this since I
> introduced the __GFP_NOACCOUNT flag. Not because of the failing
> semantics, since we can always let kmem allocations breach the limit.
> This shouldn't be critical, because I don't think it's possible to issue
> a series of kmem allocations w/o a single user page allocation, which
> would reclaim/kill the excess.
>
> The point is there are allocations that are shared system-wide and
> therefore shouldn't go to any memcg. Most obvious examples are: mempool
> users and radix_tree/idr preloads. Accounting them to memcg is likely to
> result in noticeable memory overhead as memory cgroups are
> created/destroyed, because they pin dead memory cgroups with all their
> kmem caches, which aren't tiny.
>
> Another funny example is objects destroyed lazily for performance
> reasons, e.g. vmap_area. Such objects are usually very small, so
> delaying destruction of a bunch of them will normally go unnoticed.
> However, if kmemcg is used the effective memory consumption caused by
> such objects can be multiplied by many times due to dangling kmem
> caches.
>
> We can, of course, mark all such allocations as __GFP_NOACCOUNT, but the
> problem is they are tricky to identify, because they are scattered all
> over the kernel source tree. E.g. Dave Chinner mentioned that XFS
> internals do a lot of allocations that are shared among all XFS
> filesystems and therefore should not be accounted (BTW that's why
> list_lru's used by XFS are not marked as memcg-aware). There must be
> more out there. Besides, kernel developers don't usually even know about
> kmemcg (they just write the code for their subsys, so why should they?)
> so they won't care thinking about using __GFP_NOACCOUNT, and hence new
> falsely-accounted allocations are likely to appear.
>
> That said, by switching from black-list (__GFP_NOACCOUNT) to white-list
> (__GFP_ACCOUNT) kmem accounting policy we would make the system more
> predictable and robust IMO. OTOH what would we lose? Security? Well,
> containers aren't secure IMHO. In fact, I doubt they will ever be (as
> secure as VMs). Anyway, if a runaway allocation is reported, it should
> be trivial to fix by adding __GFP_ACCOUNT where appropriate.

I wholeheartedly agree with all of this.

> If there are no objections, I'll prepare a patch switching to the
> white-list approach. Let's start from obvious things like fs_struct,
> mm_struct, task_struct, signal_struct, dentry, inode, which can be
> easily allocated from user space. This should cover 90% of all
> allocations that should be accounted AFAICS. The rest will be added
> later if necessarily.

Awesome, I'm looking forward to that patch!

2015-11-06 16:47:03

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy

On Fri 06-11-15 11:19:53, Johannes Weiner wrote:
> On Fri, Nov 06, 2015 at 11:57:24AM +0100, Michal Hocko wrote:
> > On Thu 05-11-15 17:52:00, Johannes Weiner wrote:
> > > On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote:
> > > > On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote:
> > > > > This would be true if they moved on to the new cgroup API intentionally.
> > > > > The reality is more complicated though. AFAIK sysmted is waiting for
> > > > > cgroup2 already and privileged services enable all available resource
> > > > > controllers by default as I've learned just recently.
> > > >
> > > > Have you filed a report with them? I don't think they should turn them
> > > > on unless users explicitely configure resource control for the unit.
> > >
> > > Okay, verified with systemd people that they're not planning on
> > > enabling resource control per default.
> > >
> > > Inflammatory half-truths, man. This is not constructive.
> >
> > What about Delegate=yes feature then? We have just been burnt by this
> > quite heavily. AFAIU [email protected] and [email protected] have this
> > enabled by default
> > http://lists.freedesktop.org/archives/systemd-commits/2014-November/007400.html
>
> That's when you launch a *container* and want it to be able to use
> nested resource control.

Ups. copy&paste error here. The second one was [email protected]. So it is
not only about containers AFAIU but all user defined sessions.

> We're talking about actual container users here. It's not turning on
> resource control for all "privileged services", which is what we were
> worried about here. Can you at least admit that when you yourself link
> to the refuting evidence?

My bad, that was misundestanding of the changelog.

> And if you've been "burnt quite heavily" by this, where is your bug
> report to stop other users from getting "burnt quite heavily" as well?

The bug report is still internal because it is tracking an unrelased
product. We have ended up reverting Delegate feature. Our systemd
developers are supposed to bring this up with the upstream.

The basic problem was that the Delegate feature has been backported to
our systemd package without further consideration and that has
invalidated a lot of performance testing because some resource
controllers have measurable effects on those benchmarks.

> All I read here is vague inflammatory language to spread FUD.

I was merely pointing out that memory controller might be enabled without
_user_ actually even noticing because the controller wasn't enabled
explicitly. I haven't blamed anybody for that.

> You might think sending these emails is helpful, but it really
> isn't. Not only is it not contributing code, insights, or solutions,
> you're now actively sabotaging someone else's effort to build something.

Come on! Are you even serious?

--
Michal Hocko
SUSE Labs

2015-11-06 17:45:47

by Johannes Weiner

[permalink] [raw]

Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy

On Fri, Nov 06, 2015 at 05:46:57PM +0100, Michal Hocko wrote:
> The basic problem was that the Delegate feature has been backported to
> our systemd package without further consideration and that has
> invalidated a lot of performance testing because some resource
> controllers have measurable effects on those benchmarks.

You're talking about a userspace bug. No amount of fragmenting and
layering and opt-in in the kernel's runtime configuration space is
going to help you if you screw up and enable it all by accident.

> > All I read here is vague inflammatory language to spread FUD.
>
> I was merely pointing out that memory controller might be enabled without
> _user_ actually even noticing because the controller wasn't enabled
> explicitly. I haven't blamed anybody for that.

Why does that have anything to do with how we design our interface?

We can't do more than present a sane interface in good faith and lobby
userspace projects if we think they misuse it.

2015-11-07 03:45:45

by David Miller

[permalink] [raw]

Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy

From: Michal Hocko <[email protected]>
Date: Fri, 6 Nov 2015 17:46:57 +0100

> On Fri 06-11-15 11:19:53, Johannes Weiner wrote:
>> You might think sending these emails is helpful, but it really
>> isn't. Not only is it not contributing code, insights, or solutions,
>> you're now actively sabotaging someone else's effort to build something.
>
> Come on! Are you even serious?

He is, and I agree %100 with him FWIW.

2015-11-12 18:41:45

by Mel Gorman

[permalink] [raw]

Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy

On Fri, Nov 06, 2015 at 11:19:53AM -0500, Johannes Weiner wrote:
> On Fri, Nov 06, 2015 at 11:57:24AM +0100, Michal Hocko wrote:
> > On Thu 05-11-15 17:52:00, Johannes Weiner wrote:
> > > On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote:
> > > > On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote:
> > > > > This would be true if they moved on to the new cgroup API intentionally.
> > > > > The reality is more complicated though. AFAIK sysmted is waiting for
> > > > > cgroup2 already and privileged services enable all available resource
> > > > > controllers by default as I've learned just recently.
> > > >
> > > > Have you filed a report with them? I don't think they should turn them
> > > > on unless users explicitely configure resource control for the unit.
> > >
> > > Okay, verified with systemd people that they're not planning on
> > > enabling resource control per default.
> > >
> > > Inflammatory half-truths, man. This is not constructive.
> >
> > What about Delegate=yes feature then? We have just been burnt by this
> > quite heavily. AFAIU [email protected] and [email protected] have this
> > enabled by default
> > http://lists.freedesktop.org/archives/systemd-commits/2014-November/007400.html
>
> That's when you launch a *container* and want it to be able to use
> nested resource control.
>
> We're talking about actual container users here. It's not turning on
> resource control for all "privileged services", which is what we were
> worried about here. Can you at least admit that when you yourself link
> to the refuting evidence?
>
> And if you've been "burnt quite heavily" by this, where is your bug
> report to stop other users from getting "burnt quite heavily" as well?
>

I didn't read this thread in detail but the lack of public information on
problems with cgroup controllers is partially my fault so I'd like to correct
that.

https://bugzilla.suse.com/show_bug.cgi?id=954765

This bug documents some of the impact that was incurred due to ssh
sessions being resource controlled by default. It talks primarily about
pipetest being impacted by cpu,cpuacct. It is also found in the recent
past that dbench4 was previously impacted because the blkio controller
was enabled. That bug is not public but basically dbench4 regressed 80% as
the journal thread was in a different cgroup than dbench4. dbench4 would
stall for 8ms in case more IO was issued before the journal thread could
issue any IO. The opensuse bug 954765 bug is not affected by blkio because
it's disabled by a distribution-specific patch. Mike Galbraith adds some
additional information on why activating the cpu controller can have an
impact on semantics even if the overhead was zero.

It may be the case that it's an oversight by the systemd developers and the
intent was only to affect containers. My experience was that everything
was affected. It also may be the case that this is an opensuse-specific
problem due to how the maintainers packaged systemd. I don't actually
know and hopefully the bug will be able to determine if upstream is really
affected or not.

There is also a link to this bug on the upstream project so there is some
chance they are aware https://github.com/systemd/systemd/issues/1715

Bottom line, there is legimate confusion over whether cgroup controllers
are going to be enabled by default or not in the future. If they are
enabled by default, there is a non-zero cost to that and a change in
semantics that people may or may not be surprised by.

--
Mel Gorman
SUSE Labs

2015-11-12 19:12:38

by Johannes Weiner

[permalink] [raw]

Subject: Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy

On Thu, Nov 12, 2015 at 06:36:20PM +0000, Mel Gorman wrote:
> Bottom line, there is legimate confusion over whether cgroup controllers
> are going to be enabled by default or not in the future. If they are
> enabled by default, there is a non-zero cost to that and a change in
> semantics that people may or may not be surprised by.

Thanks for elaborating, Mel.

My understanding is that this is a plain bug. I don't think anybody
wants to put costs without benefits on their users.

But I'll keep an eye on these reports, and I'll work with the systemd
people should issues with the kernel interface materialize that would
force them to enable resource control prematurely.