LinuxLists.cc - [RFC] how should we deal with dead memcgs' kmem caches?

2014-04-20 10:39:43

Subject: [RFC] how should we deal with dead memcgs' kmem caches?

Hi,

>From my pov one of the biggest problems in kmemcg implementation is how
to handle per memcg kmem caches that have objects when the owner memcg
is turned offline. Actually there are two issues here. First, when and
where we should initiate cache destruction (e.g. schedule destruction
work from kmem_cache_free when the last object goes away, or reap them
periodically). Second, how to prevent races between kmem_cache_destroy
and kmem_cache_free for a dead cache: the point is kmem_cache_free may
want to access the kmem_cache structure after it releases the object
potentially making the cache destroyable.

Here I'd like to present possible ways of sorting out this problem,
their pros and cons, in the hope that you will share your thoughts on
them and perhaps we will be able to come to a consensus about which way
we should choose.

* How it works now *

We count pages (slabs) allocated to a per memcg cache in
memcg_cache_params::nr_pages. When a memcg is turned offline, we set the
memcg_cache_params::dead flag, which makes slab freeing functions
schedule the memcg_cache_params::destroy work, which destroys the cache
(see memcg_release_pages), as soon as nr_pages reaches 0.

Actually, it does not work as expected: kmem caches that have objects on
memcg offline will be leaked, because both slab and slub designs never
free all pages on kmem_cache_free to speed up further allocs/frees.

Furthermore, currently we don't handle possible races between
kmem_cache_free/shrink and the destruction work - we can still use the
cache in kmem_cache_free/shrink after we freed the last page and
initiated destruction.

* Way #1 - prevent dead kmem caches from caching slabs on free *

We can modify sl[au]b implementation so that it won't cache any objects
on free if the kmem cache belongs to a dead memcg. Then it'd be enough
to drain per-cpu pools of all dead kmem caches on css offline - no new
slabs will be added there on further frees, and the last object will go
away along with the last slab.

Pros: don't see any
Cons:
- have to intrude into sl[au]b internals
- frees to dead caches will be noticeably slowed down

We still have to solve kmem_cache_free vs destroy race somehow, e.g. by
rearranging operations in kmem_cache_free so that nr_pages is always
decremented in the end.

* Way #2 - reap caches periodically or on vmpressure *

We can remove the async work scheduling from kmem_cache_free completely,
and instead walk over all dead kmem caches either periodically or on
vmpressure to shrink and destroy those of them that become empty.

That is what I had in mind when submitting the patch set titled "kmemcg:
simplify work-flow":
https://lkml.org/lkml/2014/4/18/42

Pros: easy to implement
Cons: instead of being destroyed asap, dead caches will hang around
until some point in time or, even worse, memory pressure condition.

Again, it has nothing to say about the free-vs-destroy race. I was
planning to rearrange operations in kmem_cache_free as I described above.

* Way #3 - re-parent individual slabs *

Theoretically, we could move all slab pages belonging to a kmem cache of
a dead memcg to its parent memcg's cache. Then we could remove all dead
caches immediately on css offline.

Pros:
- slabs of dead caches could be reused by parent memcg
- should solve the cache free-vs-destroy race
Cons:
- difficult to implement - requires deep knowledge of sl[au]b design
and individual approach to both algorithms
- will require heavy intrusion into sl[au]b internals

* Way #4 - count active objects per memcg cache *

We could count not pages allocated to per memcg kmem caches, but
individual objects. To minimize performance impact we could use percpu
counters (something like percpu_ref).

Pros:
- very simple and clear implementation independent of slab algorithm
- caches are destroyed as soon as they become empty
- solves the problem with free-vs-destroy race automatically - cache
destruction will be initiated in the end of the last kfree, so that
no races are possible
Cons:
- will impact performance of alloc/free for per memcg caches,
especially for dead ones, for which we have to switch to an atomic
counter
- existing implementation of percpu_ref can only hold 2^31-1 values;
although currently we can hardly have 2G kmem objects of one kind
even in a global cache, not speaking of per memcg, we should use
long counter to avoid overflows in future; therefore we should
either extend the existing implementation or introduce new percpu
long counter or use in-place solution

I'd appreciate if you could vote for the solution you like most or
propose other approaches.

Thank you.

2014-04-21 12:19:00

by Johannes Weiner

[permalink] [raw]

Subject: Re: [RFC] how should we deal with dead memcgs' kmem caches?

On Sun, Apr 20, 2014 at 02:39:31PM +0400, Vladimir Davydov wrote:
> * Way #2 - reap caches periodically or on vmpressure *
>
> We can remove the async work scheduling from kmem_cache_free completely,
> and instead walk over all dead kmem caches either periodically or on
> vmpressure to shrink and destroy those of them that become empty.
>
> That is what I had in mind when submitting the patch set titled "kmemcg:
> simplify work-flow":
> https://lkml.org/lkml/2014/4/18/42
>
> Pros: easy to implement
> Cons: instead of being destroyed asap, dead caches will hang around
> until some point in time or, even worse, memory pressure condition.

This would continue to pin css after cgroup destruction indefinitely,
or at least for an arbitrary amount of time. To reduce the waste from
such pinning, we currently have to tear down other parts of the memcg
optimistically from css_offline(), which is called before the last
reference disappears and out of hierarchy order, making the teardown
unnecessarily complicated and error prone.

So I think "easy to implement" is misleading. What we really care
about is "easy to maintain", and this basically excludes any async
schemes.

As far as synchronous cache teardown goes, I think everything that
introduces object accounting into the slab hotpaths will also be a
tough sell.

Personally, I would prefer the cache merging, where remaining child
slab pages are moved to the parent's cache on cgroup destruction.

2014-04-21 15:00:30

by Vladimir Davydov

[permalink] [raw]

Subject: Re: [RFC] how should we deal with dead memcgs' kmem caches?

On 04/21/2014 04:18 PM, Johannes Weiner wrote:
> On Sun, Apr 20, 2014 at 02:39:31PM +0400, Vladimir Davydov wrote:
>> * Way #2 - reap caches periodically or on vmpressure *
>>
>> We can remove the async work scheduling from kmem_cache_free completely,
>> and instead walk over all dead kmem caches either periodically or on
>> vmpressure to shrink and destroy those of them that become empty.
>>
>> That is what I had in mind when submitting the patch set titled "kmemcg:
>> simplify work-flow":
>> https://lkml.org/lkml/2014/4/18/42
>>
>> Pros: easy to implement
>> Cons: instead of being destroyed asap, dead caches will hang around
>> until some point in time or, even worse, memory pressure condition.
>
> This would continue to pin css after cgroup destruction indefinitely,
> or at least for an arbitrary amount of time. To reduce the waste from
> such pinning, we currently have to tear down other parts of the memcg
> optimistically from css_offline(), which is called before the last
> reference disappears and out of hierarchy order, making the teardown
> unnecessarily complicated and error prone.

I don't think that re-parenting kmem caches on css offline would be
complicated in such a scheme. We just need to walk over the memcg's
memcg_slab_caches list and move them to the list of its parent along
with changing the memcg_params::memcg ptr. Also, we have to assure that
all readers of memcg_params::memcg are protected with RCU and handle
re-parenting properly. AFAIU, we'd have to do approximately the same if
we decided to go with individual slabs reparenting.

To me the most disgusting part is that after css offline we'll have
pointless dead caches hanging for indefinite time w/o any chance to get
reused.

> So I think "easy to implement" is misleading. What we really care
> about is "easy to maintain", and this basically excludes any async
> schemes.
>
> As far as synchronous cache teardown goes, I think everything that
> introduces object accounting into the slab hotpaths will also be a
> tough sell.

Agree, so ways #1 and #4 don't seem to be an option.

> Personally, I would prefer the cache merging, where remaining child
> slab pages are moved to the parent's cache on cgroup destruction.

Your point is clear to me and sounds quite reasonable, but I'm afraid
that moving slabs from one active kmem cache to another would be really
difficult to implement, since kmem_cache_free is mostly lock-less. Also
we'll have to intrude into the free fast path to handle concurrent cache
merging. That's why I'm trying to find another way around.

There is another idea that has just sprung into my mind. Actually, it's
based on Way #2, but has one significant difference - dead caches can be
reused. Below goes the full picture.

First, for re-parenting of kmem charges, there will be a kmem context
object per each kmem-enabled memcg. All kmem pages (including slabs)
that are charged to a memcg will point to the context object of the
memcg through the page cgroup, not to the memcg itself as it works now.
When a kmem page is freed, it will be discharged against the memcg of
the context it points to. The context will look like this:

struct memcg_kmem_context {
struct mem_cgroup *memcg; /* owner memcg */
atomic_long_t nr_pages; /* nr pages charged to the memcg
through this context */
struct list_head list; /* list of all memcg's contexts */
};

struct mem_cgroup {
[...]
struct memcg_kmem_context *kmem_ctx;

On memcg offline we'll reparent its context to the parent memcg by
changing the memcg_kmem_context::memcg ptr to the parent so that all
previously allocated objects will be discharged properly against its
parent. Regarding the memcg's caches, no re-parenting is necessary -
we'll just mark them as orphaned (e.g. by clearing memcg_params::memcg).
We won't clear the orphaned caches from root caches'
memcg_params::memcg_caches arrays although we will release the dead
memcg along with its kmemcg_id. The orphaned caches won't have any
references to the dead memcg and therefore won't pin the css.

Then, if the kmemcg_id is relocated to another memcg, the new memcg will
just adopt such an orphaned cache and allocate objects from it. Note,
some of the pages the new memcg will be allocating from may be accounted
to the parent of the previous owner memcg.

On vmpressure we will walk over orphaned caches to shrink them and
optionally destroy those of them that become empty.

To sum up what we will have in such a design:
Pros:
- no css pinning: kmem caches just marked orphaned while the kmem
context is re-parented, so that the dead css can go away
- simple teardown procedure on css offline: re-parenting of kmem
contexts would be just changing memcg_kmem_context::memcg while
orphaning the caches is just clearing the pointer to the owner memcg;
since no concurrent allocations from the caches is possible, no
sophisticated synchronizations is needed
- reaping of orphaned caches on vmpressure will look quite natural,
because the caches may be reused at any time
- slab internals independent
Cons:
- objects belonging to different memcgs may be located on the same
slab of the same cache - the slab will be accounted to only one of
the cgroups though; AFAIU we would have the same picture if we moved
slabs from dead cache to its parent
- kmem context objects will hang around until all pages accounted
through them are gone, which may take indefinitely long, but that
shouldn't be a big problem, because their size is rather small.

What do you think about that?

Thank you.

2014-04-21 16:29:32

by Christoph Lameter (Ampere)

[permalink] [raw]

Subject: Re: [RFC] how should we deal with dead memcgs' kmem caches?

On Sun, 20 Apr 2014, Vladimir Davydov wrote:

> * Way #1 - prevent dead kmem caches from caching slabs on free *
>
> We can modify sl[au]b implementation so that it won't cache any objects
> on free if the kmem cache belongs to a dead memcg. Then it'd be enough
> to drain per-cpu pools of all dead kmem caches on css offline - no new
> slabs will be added there on further frees, and the last object will go
> away along with the last slab.

You can call kmem_cache_shrink() to force slab allocators to drop cached
objects after a free.

2014-04-21 17:56:25

by Vladimir Davydov

[permalink] [raw]

Subject: Re: [RFC] how should we deal with dead memcgs' kmem caches?

21.04.2014 20:29, Christoph Lameter:
> On Sun, 20 Apr 2014, Vladimir Davydov wrote:
>
>> * Way #1 - prevent dead kmem caches from caching slabs on free *
>>
>> We can modify sl[au]b implementation so that it won't cache any objects
>> on free if the kmem cache belongs to a dead memcg. Then it'd be enough
>> to drain per-cpu pools of all dead kmem caches on css offline - no new
>> slabs will be added there on further frees, and the last object will go
>> away along with the last slab.
>
> You can call kmem_cache_shrink() to force slab allocators to drop cached
> objects after a free.

Yes, but the question is when and how often should we do that? Calling
it after each kfree would be an overkill, because there may be plenty of
objects in a dead cache. Calling it periodically or on vmpressure is the
first thing that springs to mind - that's covered by "way #2".

Thanks.