2015-08-30 19:02:29

by Vladimir Davydov

[permalink] [raw]
Subject: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

Hi,

Tejun reported that sometimes memcg/memory.high threshold seems to be
silently ignored if kmem accounting is enabled:

http://www.spinics.net/lists/linux-mm/msg93613.html

It turned out that both SLAB and SLUB try to allocate without __GFP_WAIT
first. As a result, if there is enough free pages, memcg reclaim will
not get invoked on kmem allocations, which will lead to uncontrollable
growth of memory usage no matter what memory.high is set to.

This patch set attempts to fix this issue. For more details please see
comments to individual patches.

Thanks,

Vladimir Davydov (2):
mm/slab: skip memcg reclaim only if in atomic context
mm/slub: do not bypass memcg reclaim for high-order page allocation

mm/slab.c | 32 +++++++++++---------------------
mm/slub.c | 24 +++++++++++-------------
2 files changed, 22 insertions(+), 34 deletions(-)

--
2.1.4


2015-08-30 19:02:46

by Vladimir Davydov

[permalink] [raw]
Subject: [PATCH 1/2] mm/slab: skip memcg reclaim only if in atomic context

SLAB's implementation of kmem_cache_alloc() works as follows:
1. First, it tries to allocate from the preferred NUMA node without
issuing reclaim.
2. If step 1 fails, it tries all nodes in the order of preference,
again without invoking reclaimer
3. Only if steps 1 and 2 fails, it falls back on allocation from any
allowed node with reclaim enabled.

Before commit 4167e9b2cf10f ("mm: remove GFP_THISNODE"), GFP_THISNODE
combination, which equaled __GFP_THISNODE|__GFP_NOWARN|__GFP_NORETRY on
NUMA enabled builds, was used in order to avoid reclaim during steps 1
and 2. If __alloc_pages_slowpath() saw this combination in gfp flags, it
aborted immediately even if __GFP_WAIT flag was set. So there was no
need in clearing __GFP_WAIT flag while performing steps 1 and 2 and
hence we could invoke memcg reclaim when allocating a slab page if the
context allowed.

Commit 4167e9b2cf10f zapped GFP_THISNODE combination. Instead of OR-ing
the gfp mask with GFP_THISNODE, gfp_exact_node() helper should now be
used. The latter sets __GFP_THISNODE and __GFP_NOWARN flags and clears
__GFP_WAIT on the current gfp mask. As a result, it effectively
prohibits invoking memcg reclaim on steps 1 and 2. This breaks
memcg/memory.high logic when kmem accounting is enabled. The memory.high
threshold is supposed to work as a soft limit, i.e. it does not fail an
allocation on breaching it, but it still forces the caller to invoke
direct reclaim to compensate for the excess. Without __GFP_WAIT flag
direct reclaim is impossible so the caller will go on without being
pushed back to the threshold.

To fix this issue, we get rid of gfp_exact_node() helper and move gfp
flags filtering to kmem_getpages() after memcg_charge_slab() is called.

To understand the patch, note that:
- In fallback_alloc() the only effect of using gfp_exact_node() is
preventing recursion fallback_alloc() -> ____cache_alloc_node() ->
fallback_alloc().
- Aside from fallback_alloc(), gfp_exact_node() is only used along with
cache_grow(). Moreover, the only place where cache_grow() is used
without it is fallback_alloc(), which, in contrast to other
cache_grow() users, preallocates a page and passes it to cache_grow()
so that the latter does not need to invoke kmem_getpages() by itself.

Reported-by: Tejun Heo <[email protected]>
Signed-off-by: Vladimir Davydov <[email protected]>
---
mm/slab.c | 32 +++++++++++---------------------
1 file changed, 11 insertions(+), 21 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index d890750ec31e..9ee809d2ed8b 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -857,11 +857,6 @@ static inline void *____cache_alloc_node(struct kmem_cache *cachep,
return NULL;
}

-static inline gfp_t gfp_exact_node(gfp_t flags)
-{
- return flags;
-}
-
#else /* CONFIG_NUMA */

static void *____cache_alloc_node(struct kmem_cache *, gfp_t, int);
@@ -1028,15 +1023,6 @@ static inline int cache_free_alien(struct kmem_cache *cachep, void *objp)

return __cache_free_alien(cachep, objp, node, page_node);
}
-
-/*
- * Construct gfp mask to allocate from a specific node but do not invoke reclaim
- * or warn about failures.
- */
-static inline gfp_t gfp_exact_node(gfp_t flags)
-{
- return (flags | __GFP_THISNODE | __GFP_NOWARN) & ~__GFP_WAIT;
-}
#endif

/*
@@ -1583,7 +1569,7 @@ slab_out_of_memory(struct kmem_cache *cachep, gfp_t gfpflags, int nodeid)
* would be relatively rare and ignorable.
*/
static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
- int nodeid)
+ int nodeid, bool fallback)
{
struct page *page;
int nr_pages;
@@ -1595,6 +1581,9 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
if (memcg_charge_slab(cachep, flags, cachep->gfporder))
return NULL;

+ if (!fallback)
+ flags = (flags | __GFP_THISNODE | __GFP_NOWARN) & ~__GFP_WAIT;
+
page = __alloc_pages_node(nodeid, flags | __GFP_NOTRACK, cachep->gfporder);
if (!page) {
memcg_uncharge_slab(cachep, cachep->gfporder);
@@ -2641,7 +2630,8 @@ static int cache_grow(struct kmem_cache *cachep,
* 'nodeid'.
*/
if (!page)
- page = kmem_getpages(cachep, local_flags, nodeid);
+ page = kmem_getpages(cachep, local_flags, nodeid,
+ !IS_ENABLED(CONFIG_NUMA));
if (!page)
goto failed;

@@ -2840,7 +2830,7 @@ alloc_done:
if (unlikely(!ac->avail)) {
int x;
force_grow:
- x = cache_grow(cachep, gfp_exact_node(flags), node, NULL);
+ x = cache_grow(cachep, flags, node, NULL);

/* cache_grow can reenable interrupts, then ac could change. */
ac = cpu_cache_get(cachep);
@@ -3034,7 +3024,7 @@ retry:
get_node(cache, nid) &&
get_node(cache, nid)->free_objects) {
obj = ____cache_alloc_node(cache,
- gfp_exact_node(flags), nid);
+ flags | __GFP_THISNODE, nid);
if (obj)
break;
}
@@ -3052,7 +3042,7 @@ retry:
if (local_flags & __GFP_WAIT)
local_irq_enable();
kmem_flagcheck(cache, flags);
- page = kmem_getpages(cache, local_flags, numa_mem_id());
+ page = kmem_getpages(cache, local_flags, numa_mem_id(), true);
if (local_flags & __GFP_WAIT)
local_irq_disable();
if (page) {
@@ -3062,7 +3052,7 @@ retry:
nid = page_to_nid(page);
if (cache_grow(cache, flags, nid, page)) {
obj = ____cache_alloc_node(cache,
- gfp_exact_node(flags), nid);
+ flags | __GFP_THISNODE, nid);
if (!obj)
/*
* Another processor may allocate the
@@ -3133,7 +3123,7 @@ retry:

must_grow:
spin_unlock(&n->list_lock);
- x = cache_grow(cachep, gfp_exact_node(flags), nodeid, NULL);
+ x = cache_grow(cachep, flags, nodeid, NULL);
if (x)
goto retry;

--
2.1.4

2015-08-30 19:03:04

by Vladimir Davydov

[permalink] [raw]
Subject: [PATCH 2/2] mm/slub: do not bypass memcg reclaim for high-order page allocation

Commit 6af3142bed1f52 ("mm/slub: don't wait for high-order page
allocation") made allocate_slab() try to allocate high order slab pages
without __GFP_WAIT in order to avoid invoking reclaim/compaction when we
can fall back on low order pages. However, it broke memcg/memory.high
logic in case kmem accounting is enabled. The memory.high threshold
works as a soft limit: an allocation does not fail if it is breached,
but we call direct reclaim to compensate for the excess. Without
__GFP_WAIT we cannot invoke reclaimer and therefore we will go on
exceeding memory.high more and more until a normal __GFP_WAIT allocation
is issued.

Since memcg reclaim never triggers compaction, we can pass __GFP_WAIT to
memcg_charge_slab() even on high order page allocations w/o any
performance impact. So let us fix this problem by excluding __GFP_WAIT
only from alloc_pages() while still forwarding it to memcg_charge_slab()
if the context allows.

Reported-by: Tejun Heo <[email protected]>
Signed-off-by: Vladimir Davydov <[email protected]>
---
mm/slub.c | 24 +++++++++++-------------
1 file changed, 11 insertions(+), 13 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index e180f8dcd06d..416a332277cb 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1333,6 +1333,14 @@ static inline struct page *alloc_slab_page(struct kmem_cache *s,
if (memcg_charge_slab(s, flags, order))
return NULL;

+ /*
+ * Let the initial higher-order allocation fail under memory pressure
+ * so we fall-back to the minimum order allocation.
+ */
+ if (oo_order(oo) > oo_order(s->min))
+ flags = (flags | __GFP_NOWARN | __GFP_NOMEMALLOC) &
+ ~(__GFP_NOFAIL | __GFP_WAIT);
+
if (node == NUMA_NO_NODE)
page = alloc_pages(flags, order);
else
@@ -1348,7 +1356,6 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
{
struct page *page;
struct kmem_cache_order_objects oo = s->oo;
- gfp_t alloc_gfp;
void *start, *p;
int idx, order;

@@ -1359,23 +1366,14 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)

flags |= s->allocflags;

- /*
- * Let the initial higher-order allocation fail under memory pressure
- * so we fall-back to the minimum order allocation.
- */
- alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY) & ~__GFP_NOFAIL;
- if ((alloc_gfp & __GFP_WAIT) && oo_order(oo) > oo_order(s->min))
- alloc_gfp = (alloc_gfp | __GFP_NOMEMALLOC) & ~__GFP_WAIT;
-
- page = alloc_slab_page(s, alloc_gfp, node, oo);
+ page = alloc_slab_page(s, flags, node, oo);
if (unlikely(!page)) {
oo = s->min;
- alloc_gfp = flags;
/*
* Allocation may have failed due to fragmentation.
* Try a lower order alloc if possible
*/
- page = alloc_slab_page(s, alloc_gfp, node, oo);
+ page = alloc_slab_page(s, flags, node, oo);
if (unlikely(!page))
goto out;
stat(s, ORDER_FALLBACK);
@@ -1385,7 +1383,7 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
!(s->flags & (SLAB_NOTRACK | DEBUG_DEFAULT_FLAGS))) {
int pages = 1 << oo_order(oo);

- kmemcheck_alloc_shadow(page, oo_order(oo), alloc_gfp, node);
+ kmemcheck_alloc_shadow(page, oo_order(oo), flags, node);

/*
* Objects from caches that have a constructor don't get
--
2.1.4

2015-08-31 13:24:19

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

On Sun 30-08-15 22:02:16, Vladimir Davydov wrote:
> Hi,
>
> Tejun reported that sometimes memcg/memory.high threshold seems to be
> silently ignored if kmem accounting is enabled:
>
> http://www.spinics.net/lists/linux-mm/msg93613.html
>
> It turned out that both SLAB and SLUB try to allocate without __GFP_WAIT
> first. As a result, if there is enough free pages, memcg reclaim will
> not get invoked on kmem allocations, which will lead to uncontrollable
> growth of memory usage no matter what memory.high is set to.

Right but isn't that what the caller explicitly asked for? Why should we
ignore that for kmem accounting? It seems like a fix at a wrong layer to
me. Either we should start failing GFP_NOWAIT charges when we are above
high wmark or deploy an additional catchup mechanism as suggested by
Tejun. I like the later more because it allows to better handle GFP_NOFS
requests as well and there are many sources of these from kmem paths.

> This patch set attempts to fix this issue. For more details please see
> comments to individual patches.
>
> Thanks,
>
> Vladimir Davydov (2):
> mm/slab: skip memcg reclaim only if in atomic context
> mm/slub: do not bypass memcg reclaim for high-order page allocation
>
> mm/slab.c | 32 +++++++++++---------------------
> mm/slub.c | 24 +++++++++++-------------
> 2 files changed, 22 insertions(+), 34 deletions(-)
>
> --
> 2.1.4

--
Michal Hocko
SUSE Labs

2015-08-31 13:43:39

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

Hello,

On Mon, Aug 31, 2015 at 03:24:15PM +0200, Michal Hocko wrote:
> Right but isn't that what the caller explicitly asked for? Why should we
> ignore that for kmem accounting? It seems like a fix at a wrong layer to
> me. Either we should start failing GFP_NOWAIT charges when we are above
> high wmark or deploy an additional catchup mechanism as suggested by
> Tejun. I like the later more because it allows to better handle GFP_NOFS
> requests as well and there are many sources of these from kmem paths.

Yeah, this is beginning to look like we're trying to solve the problem
at the wrong layer. slab/slub or whatever else should be able to use
GFP_NOWAIT in whatever frequency they want for speculative
allocations.

Thanks.

--
tejun

2015-08-31 14:21:09

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

On Mon, Aug 31, 2015 at 03:24:15PM +0200, Michal Hocko wrote:
> On Sun 30-08-15 22:02:16, Vladimir Davydov wrote:

> > Tejun reported that sometimes memcg/memory.high threshold seems to be
> > silently ignored if kmem accounting is enabled:
> >
> > http://www.spinics.net/lists/linux-mm/msg93613.html
> >
> > It turned out that both SLAB and SLUB try to allocate without __GFP_WAIT
> > first. As a result, if there is enough free pages, memcg reclaim will
> > not get invoked on kmem allocations, which will lead to uncontrollable
> > growth of memory usage no matter what memory.high is set to.
>
> Right but isn't that what the caller explicitly asked for?

No. If the caller of kmalloc() asked for a __GFP_WAIT allocation, we
might ignore that and charge memcg w/o __GFP_WAIT.

> Why should we ignore that for kmem accounting? It seems like a fix at
> a wrong layer to me.

Let's forget about memory.high for a minute.

1. SLAB. Suppose someone calls kmalloc_node and there is enough free
memory on the preferred node. W/o memcg limit set, the allocation
will happen from the preferred node, which is OK. If there is memcg
limit, we can currently fail to allocate from the preferred node if
we are near the limit. We issue memcg reclaim and go to fallback
alloc then, which will most probably allocate from a different node,
although there is no reason for that. This is a bug.

2. SLUB. Someone calls kmalloc and there is enough free high order
pages. If there is no memcg limit, we will allocate a high order
slab page, which is in accordance with SLUB internal logic. With
memcg limit set, we are likely to fail to charge high order page
(because we currently try to charge high order pages w/o __GFP_WAIT)
and fallback on a low order page. The latter is unexpected and
unjustified.

That being said, this is the fix at the right layer.

> Either we should start failing GFP_NOWAIT charges when we are above
> high wmark or deploy an additional catchup mechanism as suggested by
> Tejun.

The mechanism proposed by Tejun won't help us to avoid allocation
failures if we are hitting memory.max w/o __GFP_WAIT or __GFP_FS.

To fix GFP_NOFS/GFP_NOWAIT failures we just need to start reclaim when
the gap between limit and usage is getting too small. It may be done
from a workqueue or from task_work, but currently I don't see any reason
why complicate and not just start reclaim directly, just like
memory.high does.

I mean, currently you can protect against GFP_NOWAIT failures by setting
memory.high to be 1-2 MB lower than memory.high and this *will* work,
because GFP_NOWAIT/GFP_NOFS allocations can't go on infinitely - they
will alternate with normal GFP_KERNEL allocations sooner or later. It
does not mean we should encourage users to set memory.high to protect
against such failures, because, as pointed out by Tejun, logic behind
memory.high is currently opaque and can change, but we can introduce
memcg-internal watermarks that would work exactly as memory.high and
hence help us against GFP_NOWAIT/GFP_NOFS failures.

Thanks,
Vladimir

> I like the later more because it allows to better handle GFP_NOFS
> requests as well and there are many sources of these from kmem paths.

2015-08-31 14:30:25

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

On Mon, Aug 31, 2015 at 09:43:35AM -0400, Tejun Heo wrote:
> On Mon, Aug 31, 2015 at 03:24:15PM +0200, Michal Hocko wrote:
> > Right but isn't that what the caller explicitly asked for? Why should we
> > ignore that for kmem accounting? It seems like a fix at a wrong layer to
> > me. Either we should start failing GFP_NOWAIT charges when we are above
> > high wmark or deploy an additional catchup mechanism as suggested by
> > Tejun. I like the later more because it allows to better handle GFP_NOFS
> > requests as well and there are many sources of these from kmem paths.
>
> Yeah, this is beginning to look like we're trying to solve the problem
> at the wrong layer. slab/slub or whatever else should be able to use
> GFP_NOWAIT in whatever frequency they want for speculative
> allocations.

slab/slub can issue alloc_pages() any time with any flags they want and
it won't be accounted to memcg, because kmem is accounted at slab/slub
layer, not in buddy.

Thanks,
Vladimir

2015-08-31 14:39:44

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

On Mon, Aug 31, 2015 at 05:30:08PM +0300, Vladimir Davydov wrote:
> slab/slub can issue alloc_pages() any time with any flags they want and
> it won't be accounted to memcg, because kmem is accounted at slab/slub
> layer, not in buddy.

Hmmm? I meant the eventual calling into try_charge w/ GFP_NOWAIT.
Speculative usage of GFP_NOWAIT is bound to increase and we don't want
to put on extra restrictions from memcg side. For memory.high,
punting to the return path is a pretty stright-forward solution which
should make the problem go away almost entirely.

Thanks.

--
tejun

2015-08-31 14:46:09

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

Hello, Vladimir.

On Mon, Aug 31, 2015 at 05:20:49PM +0300, Vladimir Davydov wrote:
...
> That being said, this is the fix at the right layer.

While this *might* be a necessary workaround for the hard limit case
right now, this is by no means the fix at the right layer. The
expectation is that mm keeps a reasonable amount of memory available
for allocations which can't block. These allocations may fail from
time to time depending on luck and under extreme memory pressure but
the caller should be able to depend on it as a speculative allocation
mechanism which doesn't fail willy-nilly.

Hardlimit breaking GFP_NOWAIT behavior is a bug on memcg side, not
slab or slub.

Thanks.

--
tejun

2015-08-31 15:18:33

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

On Mon, Aug 31, 2015 at 10:39:39AM -0400, Tejun Heo wrote:
> On Mon, Aug 31, 2015 at 05:30:08PM +0300, Vladimir Davydov wrote:
> > slab/slub can issue alloc_pages() any time with any flags they want and
> > it won't be accounted to memcg, because kmem is accounted at slab/slub
> > layer, not in buddy.
>
> Hmmm? I meant the eventual calling into try_charge w/ GFP_NOWAIT.
> Speculative usage of GFP_NOWAIT is bound to increase and we don't want
> to put on extra restrictions from memcg side.

We already put restrictions on slab/slub from memcg side, because kmem
accounting is a part of slab/slub. They have to cooperate in order to
get things working. If slab/slub wants to make a speculative allocation
for some reason, it should just put memcg_charge out of this speculative
alloc section. This is what this patch set does.

We have to be cautious about placing memcg_charge in slab/slub. To
understand why, consider SLAB case, which first tries to allocate from
all nodes in the order of preference w/o __GFP_WAIT and only if it fails
falls back on an allocation from any node w/ __GFP_WAIT. This is its
internal algorithm. If we blindly put memcg_charge to alloc_slab method,
then, when we are near the memcg limit, we will go over all NUMA nodes
in vain, then finally fall back to __GFP_WAIT allocation, which will get
a slab from a random node. Not only we do more work than necessary due
to walking over all NUMA nodes for nothing, but we also break SLAB
internal logic! And you just can't fix it in memcg, because memcg knows
nothing about the internal logic of SLAB, how it handles NUMA nodes.

SLUB has a different problem. It tries to avoid high-order allocations
if there is a risk of invoking costly memory compactor. It has nothing
to do with memcg, because memcg does not care if the charge is for a
high order page or not.

Thanks,
Vladimir

> For memory.high,
> punting to the return path is a pretty stright-forward solution which
> should make the problem go away almost entirely.

2015-08-31 15:24:54

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

On Mon, Aug 31, 2015 at 10:46:04AM -0400, Tejun Heo wrote:
> Hello, Vladimir.
>
> On Mon, Aug 31, 2015 at 05:20:49PM +0300, Vladimir Davydov wrote:
> ...
> > That being said, this is the fix at the right layer.
>
> While this *might* be a necessary workaround for the hard limit case
> right now, this is by no means the fix at the right layer. The
> expectation is that mm keeps a reasonable amount of memory available
> for allocations which can't block. These allocations may fail from
> time to time depending on luck and under extreme memory pressure but
> the caller should be able to depend on it as a speculative allocation
> mechanism which doesn't fail willy-nilly.
>
> Hardlimit breaking GFP_NOWAIT behavior is a bug on memcg side, not
> slab or slub.

I never denied that there is GFP_NOWAIT/GFP_NOFS problem in memcg. I
even proposed ways to cope with it in one of the previous e-mails.

Nevertheless, we just can't allow slab/slub internals call memcg_charge
whenever they want as I pointed out in a parallel thread.

Thanks,
Vladimir

2015-08-31 15:48:00

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

Hello,

On Mon, Aug 31, 2015 at 06:18:14PM +0300, Vladimir Davydov wrote:
> We have to be cautious about placing memcg_charge in slab/slub. To
> understand why, consider SLAB case, which first tries to allocate from
> all nodes in the order of preference w/o __GFP_WAIT and only if it fails
> falls back on an allocation from any node w/ __GFP_WAIT. This is its
> internal algorithm. If we blindly put memcg_charge to alloc_slab method,
> then, when we are near the memcg limit, we will go over all NUMA nodes
> in vain, then finally fall back to __GFP_WAIT allocation, which will get
> a slab from a random node. Not only we do more work than necessary due
> to walking over all NUMA nodes for nothing, but we also break SLAB
> internal logic! And you just can't fix it in memcg, because memcg knows
> nothing about the internal logic of SLAB, how it handles NUMA nodes.
>
> SLUB has a different problem. It tries to avoid high-order allocations
> if there is a risk of invoking costly memory compactor. It has nothing
> to do with memcg, because memcg does not care if the charge is for a
> high order page or not.

Maybe I'm missing something but aren't both issues caused by memcg
failing to provide headroom for NOWAIT allocations when the
consumption gets close to the max limit? Regardless of the specific
usage, !__GFP_WAIT means "give me memory if it can be spared w/o
inducing direct time-consuming maintenance work" and the contract
around it is that such requests will mostly succeed under nominal
conditions. Also, slab/slub might not stay as the only user of
try_charge(). I still think solving this from memcg side is the right
direction.

Thanks.

--
tejun

2015-08-31 16:51:50

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

On Mon, Aug 31, 2015 at 11:47:56AM -0400, Tejun Heo wrote:
> On Mon, Aug 31, 2015 at 06:18:14PM +0300, Vladimir Davydov wrote:
> > We have to be cautious about placing memcg_charge in slab/slub. To
> > understand why, consider SLAB case, which first tries to allocate from
> > all nodes in the order of preference w/o __GFP_WAIT and only if it fails
> > falls back on an allocation from any node w/ __GFP_WAIT. This is its
> > internal algorithm. If we blindly put memcg_charge to alloc_slab method,
> > then, when we are near the memcg limit, we will go over all NUMA nodes
> > in vain, then finally fall back to __GFP_WAIT allocation, which will get
> > a slab from a random node. Not only we do more work than necessary due
> > to walking over all NUMA nodes for nothing, but we also break SLAB
> > internal logic! And you just can't fix it in memcg, because memcg knows
> > nothing about the internal logic of SLAB, how it handles NUMA nodes.
> >
> > SLUB has a different problem. It tries to avoid high-order allocations
> > if there is a risk of invoking costly memory compactor. It has nothing
> > to do with memcg, because memcg does not care if the charge is for a
> > high order page or not.
>
> Maybe I'm missing something but aren't both issues caused by memcg
> failing to provide headroom for NOWAIT allocations when the
> consumption gets close to the max limit?

That's correct.

> Regardless of the specific usage, !__GFP_WAIT means "give me memory if
> it can be spared w/o inducing direct time-consuming maintenance work"
> and the contract around it is that such requests will mostly succeed
> under nominal conditions. Also, slab/slub might not stay as the only
> user of try_charge().

Indeed, there might be other users trying GFP_NOWAIT before falling back
to GFP_KERNEL, but they are not doing that constantly and hence cause no
problems. If SLAB/SLUB plays such tricks, the problem becomes massive:
under certain conditions *every* try_charge may be invoked w/o
__GFP_WAIT, resulting in memory.high breaching and hitting memory.max.

Generally speaking, handing over reclaim responsibility to task_work
won't help, because there might be cases when a process spends quite a
lot of time in kernel invoking lots of GFP_KERNEL allocations before
returning to userspace. Without fixing slab/slub, such a process will
charge w/o __GFP_WAIT and therefore can exceed memory.high and reach
memory.max. If there are no other active processes in the cgroup, the
cgroup can stay with memory.high excess for a relatively long time
(suppose the process was throttled in kernel), possibly hurting the rest
of the system. What is worse, if the process happens to invoke a real
GFP_NOWAIT allocation when it's about to hit the limit, it will fail.

If we want to allow slab/slub implementation to invoke try_charge
wherever it wants, we need to introduce an asynchronous thread doing
reclaim when a memcg is approaching its limit (or teach kswapd do that).
That's a way to go, but what's the point to complicate things
prematurely while it seems we can fix the problem by using the technique
similar to the one behind memory.high?

Nevertheless, even if we introduced such a thread, it'd be just insane
to allow slab/slub blindly insert try_charge. Let me repeat the examples
of SLAB/SLUB sub-optimal behavior caused by thoughtless usage of
try_charge I gave above:

- memcg knows nothing about NUMA nodes, so what's the point in failing
!__GFP_WAIT allocations used by SLAB while inspecting NUMA nodes?
- memcg knows nothing about high order pages, so what's the point in
failing !__GFP_WAIT allocations used by SLUB to try to allocate a
high order page?

Thanks,
Vladimir

> I still think solving this from memcg side is the right direction.

2015-08-31 17:03:14

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

Hello,

On Mon, Aug 31, 2015 at 07:51:32PM +0300, Vladimir Davydov wrote:
...
> If we want to allow slab/slub implementation to invoke try_charge
> wherever it wants, we need to introduce an asynchronous thread doing
> reclaim when a memcg is approaching its limit (or teach kswapd do that).

In the long term, I think this is the way to go.

> That's a way to go, but what's the point to complicate things
> prematurely while it seems we can fix the problem by using the technique
> similar to the one behind memory.high?

Cuz we're now scattering workarounds to multiple places and I'm sure
we'll add more try_charge() users (e.g. we want to fold in tcp memcg
under the same knobs) and we'll have to worry about the same problem
all over again and will inevitably miss some cases leading to subtle
failures.

> Nevertheless, even if we introduced such a thread, it'd be just insane
> to allow slab/slub blindly insert try_charge. Let me repeat the examples
> of SLAB/SLUB sub-optimal behavior caused by thoughtless usage of
> try_charge I gave above:
>
> - memcg knows nothing about NUMA nodes, so what's the point in failing
> !__GFP_WAIT allocations used by SLAB while inspecting NUMA nodes?
> - memcg knows nothing about high order pages, so what's the point in
> failing !__GFP_WAIT allocations used by SLUB to try to allocate a
> high order page?

Both are optimistic speculative actions and as long as memcg can
guarantee that those requests will succeed under normal circumstances,
as does the system-wide mm does, it isn't a problem.

In general, we want to make sure inside-cgroup behaviors as close to
system-wide behaviors as possible, scoped but equivalent in kind.
Doing things differently, while inevitable in certain cases, is likely
to get messy in the long term.

Thanks.

--
tejun

2015-08-31 19:26:33

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

On Mon, Aug 31, 2015 at 01:03:09PM -0400, Tejun Heo wrote:
> On Mon, Aug 31, 2015 at 07:51:32PM +0300, Vladimir Davydov wrote:
> ...
> > If we want to allow slab/slub implementation to invoke try_charge
> > wherever it wants, we need to introduce an asynchronous thread doing
> > reclaim when a memcg is approaching its limit (or teach kswapd do that).
>
> In the long term, I think this is the way to go.

Quite probably, or we can use task_work, or direct reclaim instead. It's
not that obvious to me yet which one is the best.

>
> > That's a way to go, but what's the point to complicate things
> > prematurely while it seems we can fix the problem by using the technique
> > similar to the one behind memory.high?
>
> Cuz we're now scattering workarounds to multiple places and I'm sure
> we'll add more try_charge() users (e.g. we want to fold in tcp memcg
> under the same knobs) and we'll have to worry about the same problem
> all over again and will inevitably miss some cases leading to subtle
> failures.

I don't think we will need to insert try_charge_kmem anywhere else,
because all kmem users either allocate memory using kmalloc and friends
or using alloc_pages. kmalloc is accounted. For those who prefer
alloc_pages, there is alloc_kmem_pages helper.

>
> > Nevertheless, even if we introduced such a thread, it'd be just insane
> > to allow slab/slub blindly insert try_charge. Let me repeat the examples
> > of SLAB/SLUB sub-optimal behavior caused by thoughtless usage of
> > try_charge I gave above:
> >
> > - memcg knows nothing about NUMA nodes, so what's the point in failing
> > !__GFP_WAIT allocations used by SLAB while inspecting NUMA nodes?
> > - memcg knows nothing about high order pages, so what's the point in
> > failing !__GFP_WAIT allocations used by SLUB to try to allocate a
> > high order page?
>
> Both are optimistic speculative actions and as long as memcg can
> guarantee that those requests will succeed under normal circumstances,
> as does the system-wide mm does, it isn't a problem.
>
> In general, we want to make sure inside-cgroup behaviors as close to
> system-wide behaviors as possible, scoped but equivalent in kind.
> Doing things differently, while inevitable in certain cases, is likely
> to get messy in the long term.

I totally agree that we should strive to make a kmem user feel roughly
the same in memcg as if it were running on a host with equal amount of
RAM. There are two ways to achieve that:

1. Make the API functions, i.e. kmalloc and friends, behave inside
memcg roughly the same way as they do in the root cgroup.
2. Make the internal memcg functions, i.e. try_charge and friends,
behave roughly the same way as alloc_pages.

I find way 1 more flexible, because we don't have to blindly follow
heuristics used on global memory reclaim and therefore have more
opportunities to achieve the same goal.

Thanks,
Vladimir

Subject: Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

On Mon, 31 Aug 2015, Vladimir Davydov wrote:

> I totally agree that we should strive to make a kmem user feel roughly
> the same in memcg as if it were running on a host with equal amount of
> RAM. There are two ways to achieve that:
>
> 1. Make the API functions, i.e. kmalloc and friends, behave inside
> memcg roughly the same way as they do in the root cgroup.
> 2. Make the internal memcg functions, i.e. try_charge and friends,
> behave roughly the same way as alloc_pages.
>
> I find way 1 more flexible, because we don't have to blindly follow
> heuristics used on global memory reclaim and therefore have more
> opportunities to achieve the same goal.

The heuristics need to integrate well if its in a cgroup or not. In
general make use of cgroups as transparent as possible to the rest of the
code.