2015-05-05 09:46:05

by Vladimir Davydov

[permalink] [raw]
Subject: [PATCH 1/2] gfp: add __GFP_NOACCOUNT

Not all kmem allocations should be accounted to memcg. The following
patch gives an example when accounting of a certain type of allocations
to memcg can effectively result in a memory leak. This patch adds the
__GFP_NOACCOUNT flag which if passed to kmalloc and friends will force
the allocation to go through the root cgroup. It will be used by the
next patch.

Note, since in case of kmemleak enabled each kmalloc implies yet another
allocation from the kmemleak_object cache, we add __GFP_NOACCOUNT to
gfp_kmemleak_mask.

Signed-off-by: Vladimir Davydov <[email protected]>
---
include/linux/gfp.h | 2 ++
include/linux/memcontrol.h | 4 ++++
mm/kmemleak.c | 3 ++-
3 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 97a9373e61e8..37c422df2a0f 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -30,6 +30,7 @@ struct vm_area_struct;
#define ___GFP_HARDWALL 0x20000u
#define ___GFP_THISNODE 0x40000u
#define ___GFP_RECLAIMABLE 0x80000u
+#define ___GFP_NOACCOUNT 0x100000u
#define ___GFP_NOTRACK 0x200000u
#define ___GFP_NO_KSWAPD 0x400000u
#define ___GFP_OTHER_NODE 0x800000u
@@ -87,6 +88,7 @@ struct vm_area_struct;
#define __GFP_HARDWALL ((__force gfp_t)___GFP_HARDWALL) /* Enforce hardwall cpuset memory allocs */
#define __GFP_THISNODE ((__force gfp_t)___GFP_THISNODE)/* No fallback, no policies */
#define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE) /* Page is reclaimable */
+#define __GFP_NOACCOUNT ((__force gfp_t)___GFP_NOACCOUNT) /* Don't account to memcg */
#define __GFP_NOTRACK ((__force gfp_t)___GFP_NOTRACK) /* Don't track with kmemcheck */

#define __GFP_NO_KSWAPD ((__force gfp_t)___GFP_NO_KSWAPD)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 72dff5fb0d0c..6c8918114804 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -463,6 +463,8 @@ memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
if (!memcg_kmem_enabled())
return true;

+ if (gfp & __GFP_NOACCOUNT)
+ return true;
/*
* __GFP_NOFAIL allocations will move on even if charging is not
* possible. Therefore we don't even try, and have this allocation
@@ -522,6 +524,8 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
{
if (!memcg_kmem_enabled())
return cachep;
+ if (gfp & __GFP_NOACCOUNT)
+ return cachep;
if (gfp & __GFP_NOFAIL)
return cachep;
if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD))
diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index 5405aff5a590..f0fe4f2c1fa7 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -115,7 +115,8 @@
#define BYTES_PER_POINTER sizeof(void *)

/* GFP bitmask for kmemleak internal allocations */
-#define gfp_kmemleak_mask(gfp) (((gfp) & (GFP_KERNEL | GFP_ATOMIC)) | \
+#define gfp_kmemleak_mask(gfp) (((gfp) & (GFP_KERNEL | GFP_ATOMIC | \
+ __GFP_NOACCOUNT)) | \
__GFP_NORETRY | __GFP_NOMEMALLOC | \
__GFP_NOWARN)

--
1.7.10.4


2015-05-05 09:46:09

by Vladimir Davydov

[permalink] [raw]
Subject: [PATCH 2/2] kernfs: do not account ino_ida allocations to memcg

root->ino_ida is used for kernfs inode number allocations. Since IDA has
a layered structure, different IDs can reside on the same layer, which
is currently accounted to some memory cgroup. The problem is that each
kmem cache of a memory cgroup has its own directory on sysfs (under
/sys/fs/kernel/<cache-name>/cgroup). If the inode number of such a
directory or any file in it gets allocated from a layer accounted to the
cgroup which the cache is created for, the cgroup will get pinned for
good, because one has to free all kmem allocations accounted to a cgroup
in order to release it and destroy all its kmem caches. That said we
must not account layers of ino_ida to any memory cgroup.

Since per net init operations may create new sysfs entries directly
(e.g. lo device) or indirectly (nf_conntrack creates a new kmem cache
per each namespace, which, in turn, creates new sysfs entries), an easy
way to reproduce this issue is by creating network namespace(s) from
inside a kmem-active memory cgroup.

Signed-off-by: Vladimir Davydov <[email protected]>
---
fs/kernfs/dir.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index f131fc23ffc4..fffca9517321 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -518,7 +518,14 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
if (!kn)
goto err_out1;

- ret = ida_simple_get(&root->ino_ida, 1, 0, GFP_KERNEL);
+ /*
+ * If the ino of the sysfs entry created for a kmem cache gets
+ * allocated from an ida layer, which is accounted to the memcg that
+ * owns the cache, the memcg will get pinned forever. So do not account
+ * ino ida allocations.
+ */
+ ret = ida_simple_get(&root->ino_ida, 1, 0,
+ GFP_KERNEL | __GFP_NOACCOUNT);
if (ret < 0)
goto err_out2;
kn->ino = ret;
--
1.7.10.4

2015-05-05 13:45:44

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 2/2] kernfs: do not account ino_ida allocations to memcg

On Tue, May 05, 2015 at 12:45:43PM +0300, Vladimir Davydov wrote:
> root->ino_ida is used for kernfs inode number allocations. Since IDA has
> a layered structure, different IDs can reside on the same layer, which
> is currently accounted to some memory cgroup. The problem is that each
> kmem cache of a memory cgroup has its own directory on sysfs (under
> /sys/fs/kernel/<cache-name>/cgroup). If the inode number of such a
> directory or any file in it gets allocated from a layer accounted to the
> cgroup which the cache is created for, the cgroup will get pinned for
> good, because one has to free all kmem allocations accounted to a cgroup
> in order to release it and destroy all its kmem caches. That said we
> must not account layers of ino_ida to any memory cgroup.
>
> Since per net init operations may create new sysfs entries directly
> (e.g. lo device) or indirectly (nf_conntrack creates a new kmem cache
> per each namespace, which, in turn, creates new sysfs entries), an easy
> way to reproduce this issue is by creating network namespace(s) from
> inside a kmem-active memory cgroup.
>
> Signed-off-by: Vladimir Davydov <[email protected]>

Man, that's nasty. For the kernfs part,

Acked-by: Tejun Heo <[email protected]>

Can you please repost this patch w/ Greg KH cc'd?

Thanks.

--
tejun

2015-05-05 16:11:46

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH 2/2] kernfs: do not account ino_ida allocations to memcg

On Tue, May 05, 2015 at 09:45:21AM -0400, Tejun Heo wrote:
> On Tue, May 05, 2015 at 12:45:43PM +0300, Vladimir Davydov wrote:
> > root->ino_ida is used for kernfs inode number allocations. Since IDA has
> > a layered structure, different IDs can reside on the same layer, which
> > is currently accounted to some memory cgroup. The problem is that each
> > kmem cache of a memory cgroup has its own directory on sysfs (under
> > /sys/fs/kernel/<cache-name>/cgroup). If the inode number of such a
> > directory or any file in it gets allocated from a layer accounted to the
> > cgroup which the cache is created for, the cgroup will get pinned for
> > good, because one has to free all kmem allocations accounted to a cgroup
> > in order to release it and destroy all its kmem caches. That said we
> > must not account layers of ino_ida to any memory cgroup.
> >
> > Since per net init operations may create new sysfs entries directly
> > (e.g. lo device) or indirectly (nf_conntrack creates a new kmem cache
> > per each namespace, which, in turn, creates new sysfs entries), an easy
> > way to reproduce this issue is by creating network namespace(s) from
> > inside a kmem-active memory cgroup.
> >
> > Signed-off-by: Vladimir Davydov <[email protected]>
>
> Man, that's nasty. For the kernfs part,
>
> Acked-by: Tejun Heo <[email protected]>

Thanks.

>
> Can you please repost this patch w/ Greg KH cc'd?

OK.

2015-05-06 11:59:45

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/2] gfp: add __GFP_NOACCOUNT

On Tue 05-05-15 12:45:42, Vladimir Davydov wrote:
> Not all kmem allocations should be accounted to memcg. The following
> patch gives an example when accounting of a certain type of allocations
> to memcg can effectively result in a memory leak.

> This patch adds the __GFP_NOACCOUNT flag which if passed to kmalloc
> and friends will force the allocation to go through the root
> cgroup. It will be used by the next patch.

The name of the flag is way too generic. It is not clear that the
accounting is KMEMCG related. __GFP_NO_KMEMCG sounds better?

I was going to suggest doing per-cache rather than gfp flag and that
would actually work just fine for the kmemleak as it uses its own cache
already. But the ida_simple_get would be trickier because it doesn't use
any special cache and more over only one user seem to have a problem so
this doesn't sound like a good fit.

So I do not object to opt-out for kmemcg accounting but I really think
the name should be changed.

> Note, since in case of kmemleak enabled each kmalloc implies yet another
> allocation from the kmemleak_object cache, we add __GFP_NOACCOUNT to
> gfp_kmemleak_mask.

> Signed-off-by: Vladimir Davydov <[email protected]>
> ---
> include/linux/gfp.h | 2 ++
> include/linux/memcontrol.h | 4 ++++
> mm/kmemleak.c | 3 ++-
> 3 files changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 97a9373e61e8..37c422df2a0f 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -30,6 +30,7 @@ struct vm_area_struct;
> #define ___GFP_HARDWALL 0x20000u
> #define ___GFP_THISNODE 0x40000u
> #define ___GFP_RECLAIMABLE 0x80000u
> +#define ___GFP_NOACCOUNT 0x100000u
> #define ___GFP_NOTRACK 0x200000u
> #define ___GFP_NO_KSWAPD 0x400000u
> #define ___GFP_OTHER_NODE 0x800000u
> @@ -87,6 +88,7 @@ struct vm_area_struct;
> #define __GFP_HARDWALL ((__force gfp_t)___GFP_HARDWALL) /* Enforce hardwall cpuset memory allocs */
> #define __GFP_THISNODE ((__force gfp_t)___GFP_THISNODE)/* No fallback, no policies */
> #define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE) /* Page is reclaimable */
> +#define __GFP_NOACCOUNT ((__force gfp_t)___GFP_NOACCOUNT) /* Don't account to memcg */
> #define __GFP_NOTRACK ((__force gfp_t)___GFP_NOTRACK) /* Don't track with kmemcheck */
>
> #define __GFP_NO_KSWAPD ((__force gfp_t)___GFP_NO_KSWAPD)
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 72dff5fb0d0c..6c8918114804 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -463,6 +463,8 @@ memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
> if (!memcg_kmem_enabled())
> return true;
>
> + if (gfp & __GFP_NOACCOUNT)
> + return true;
> /*
> * __GFP_NOFAIL allocations will move on even if charging is not
> * possible. Therefore we don't even try, and have this allocation
> @@ -522,6 +524,8 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
> {
> if (!memcg_kmem_enabled())
> return cachep;
> + if (gfp & __GFP_NOACCOUNT)
> + return cachep;
> if (gfp & __GFP_NOFAIL)
> return cachep;
> if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD))
> diff --git a/mm/kmemleak.c b/mm/kmemleak.c
> index 5405aff5a590..f0fe4f2c1fa7 100644
> --- a/mm/kmemleak.c
> +++ b/mm/kmemleak.c
> @@ -115,7 +115,8 @@
> #define BYTES_PER_POINTER sizeof(void *)
>
> /* GFP bitmask for kmemleak internal allocations */
> -#define gfp_kmemleak_mask(gfp) (((gfp) & (GFP_KERNEL | GFP_ATOMIC)) | \
> +#define gfp_kmemleak_mask(gfp) (((gfp) & (GFP_KERNEL | GFP_ATOMIC | \
> + __GFP_NOACCOUNT)) | \
> __GFP_NORETRY | __GFP_NOMEMALLOC | \
> __GFP_NOWARN)
>
> --
> 1.7.10.4
>

--
Michal Hocko
SUSE Labs

2015-05-06 12:24:45

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH 1/2] gfp: add __GFP_NOACCOUNT

On Wed, May 06, 2015 at 01:59:41PM +0200, Michal Hocko wrote:
> On Tue 05-05-15 12:45:42, Vladimir Davydov wrote:
> > Not all kmem allocations should be accounted to memcg. The following
> > patch gives an example when accounting of a certain type of allocations
> > to memcg can effectively result in a memory leak.
>
> > This patch adds the __GFP_NOACCOUNT flag which if passed to kmalloc
> > and friends will force the allocation to go through the root
> > cgroup. It will be used by the next patch.
>
> The name of the flag is way too generic. It is not clear that the
> accounting is KMEMCG related. __GFP_NO_KMEMCG sounds better?
>
> I was going to suggest doing per-cache rather than gfp flag and that
> would actually work just fine for the kmemleak as it uses its own cache
> already. But the ida_simple_get would be trickier because it doesn't use
> any special cache and more over only one user seem to have a problem so
> this doesn't sound like a good fit.

I don't think making this flag per-cache is an option either, but for
another reason - it would not be possible to merge such a kmem cache
with caches without this flag set. As a result, total memory pressure
would increase, even for setups without kmem-active memory cgroups,
which does not sound acceptable to me.

>
> So I do not object to opt-out for kmemcg accounting but I really think
> the name should be changed.

I named it __GFP_NOACCOUNT to match with __GFP_NOTRACK, which is a very
specific flag too (kmemcheck), nevertheless it has a rather generic
name.

Anyways, what else apart from memcg can account kmem so that we have to
mention KMEMCG in the flag name explicitly?

Thanks,
Vladimir

2015-05-06 12:35:48

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/2] gfp: add __GFP_NOACCOUNT

On Wed 06-05-15 15:24:31, Vladimir Davydov wrote:
> On Wed, May 06, 2015 at 01:59:41PM +0200, Michal Hocko wrote:
> > On Tue 05-05-15 12:45:42, Vladimir Davydov wrote:
> > > Not all kmem allocations should be accounted to memcg. The following
> > > patch gives an example when accounting of a certain type of allocations
> > > to memcg can effectively result in a memory leak.
> >
> > > This patch adds the __GFP_NOACCOUNT flag which if passed to kmalloc
> > > and friends will force the allocation to go through the root
> > > cgroup. It will be used by the next patch.
> >
> > The name of the flag is way too generic. It is not clear that the
> > accounting is KMEMCG related. __GFP_NO_KMEMCG sounds better?
> >
> > I was going to suggest doing per-cache rather than gfp flag and that
> > would actually work just fine for the kmemleak as it uses its own cache
> > already. But the ida_simple_get would be trickier because it doesn't use
> > any special cache and more over only one user seem to have a problem so
> > this doesn't sound like a good fit.
>
> I don't think making this flag per-cache is an option either, but for
> another reason - it would not be possible to merge such a kmem cache
> with caches without this flag set. As a result, total memory pressure
> would increase, even for setups without kmem-active memory cgroups,
> which does not sound acceptable to me.

I am not sure I see the performance implications here because kmem
accounted memcgs would have their copy of the cache anyway, no?
Anyway, I guess it would be good to document these reasons in the
changelog.

> > So I do not object to opt-out for kmemcg accounting but I really think
> > the name should be changed.
>
> I named it __GFP_NOACCOUNT to match with __GFP_NOTRACK, which is a very
> specific flag too (kmemcheck), nevertheless it has a rather generic
> name.

__GFP_NOTRACK is a bad name IMHO as well. One has to go and check the
comment to see this is kmemleak related.

> Anyways, what else apart from memcg can account kmem so that we have to
> mention KMEMCG in the flag name explicitly?

NOACCOUNT doesn't imply kmem at all so it is not clear who is in charge
of the accounting. I do not insist on __GFP_NO_KMEMCG of course but
it sounds quite specific about its meaning and scope.
--
Michal Hocko
SUSE Labs

2015-05-06 13:16:50

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 1/2] gfp: add __GFP_NOACCOUNT

On Wed, May 06, 2015 at 01:59:41PM +0200, Michal Hocko wrote:
> On Tue 05-05-15 12:45:42, Vladimir Davydov wrote:
> > Not all kmem allocations should be accounted to memcg. The following
> > patch gives an example when accounting of a certain type of allocations
> > to memcg can effectively result in a memory leak.
>
> > This patch adds the __GFP_NOACCOUNT flag which if passed to kmalloc
> > and friends will force the allocation to go through the root
> > cgroup. It will be used by the next patch.
>
> The name of the flag is way too generic. It is not clear that the
> accounting is KMEMCG related.

The memory controller is the (primary) component that accounts
physical memory allocations in the kernel, so I don't see how this
would be ambiguous in any way.

> __GFP_NO_KMEMCG sounds better?

I think that's much worse. I would prefer communicating the desired
behavior directly instead of having to derive it from a subsystem
name.

(And KMEMCG should not even be a term, it's all just the memory
controller, i.e. memcg.)

2015-05-06 13:25:29

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH 1/2] gfp: add __GFP_NOACCOUNT

On Wed, May 06, 2015 at 02:35:41PM +0200, Michal Hocko wrote:
> On Wed 06-05-15 15:24:31, Vladimir Davydov wrote:
> > On Wed, May 06, 2015 at 01:59:41PM +0200, Michal Hocko wrote:
> > > On Tue 05-05-15 12:45:42, Vladimir Davydov wrote:
> > > > Not all kmem allocations should be accounted to memcg. The following
> > > > patch gives an example when accounting of a certain type of allocations
> > > > to memcg can effectively result in a memory leak.
> > >
> > > > This patch adds the __GFP_NOACCOUNT flag which if passed to kmalloc
> > > > and friends will force the allocation to go through the root
> > > > cgroup. It will be used by the next patch.
> > >
> > > The name of the flag is way too generic. It is not clear that the
> > > accounting is KMEMCG related. __GFP_NO_KMEMCG sounds better?
> > >
> > > I was going to suggest doing per-cache rather than gfp flag and that
> > > would actually work just fine for the kmemleak as it uses its own cache
> > > already. But the ida_simple_get would be trickier because it doesn't use
> > > any special cache and more over only one user seem to have a problem so
> > > this doesn't sound like a good fit.
> >
> > I don't think making this flag per-cache is an option either, but for
> > another reason - it would not be possible to merge such a kmem cache
> > with caches without this flag set. As a result, total memory pressure
> > would increase, even for setups without kmem-active memory cgroups,
> > which does not sound acceptable to me.
>
> I am not sure I see the performance implications here because kmem
> accounted memcgs would have their copy of the cache anyway, no?

It's orthogonal.

Suppose there are two *global* kmem caches, A and B, which would
normally be merged, i.e. A=B. Then we find out that we don't want to
account allocations from A to memcg while still accounting allocations
from B. Obviously, cache A can no longer be merged with cache B so we
have two different caches instead of the only merged one, even if there
are *no* memory cgroups at all. That might result in increased memory
consumption due to fragmentation.

Although it is not really critical, especially counting that SLAB
merging was introduced not long before, the idea that enabling an extra
feature, such as memcg, without actually using it, may affect the global
behavior does not sound good to me.

> Anyway, I guess it would be good to document these reasons in the
> changelog.
>
> > > So I do not object to opt-out for kmemcg accounting but I really think
> > > the name should be changed.
> >
> > I named it __GFP_NOACCOUNT to match with __GFP_NOTRACK, which is a very
> > specific flag too (kmemcheck), nevertheless it has a rather generic
> > name.
>
> __GFP_NOTRACK is a bad name IMHO as well. One has to go and check the
> comment to see this is kmemleak related.

I think it's a good practice to go to its definition and check comments
when encountering an unknown symbol anyway. With ctags/cscope it's
trivial :-)

>
> > Anyways, what else apart from memcg can account kmem so that we have to
> > mention KMEMCG in the flag name explicitly?
>
> NOACCOUNT doesn't imply kmem at all so it is not clear who is in charge
> of the accounting.

IMO it is a benefit. If one day for some reason we want to bypass memcg
accounting for some other type of allocation somewhere, we can simply
reuse it.

> I do not insist on __GFP_NO_KMEMCG of course but it sounds quite
> specific about its meaning and scope.

There is another argument against __GFP_NO_KMEMCG: it is not yet clear
if kmem is going to be accounted separately in the unified cgroup
hierarchy. As I mentioned before, it is quite difficult to draw the line
between user and kernel memory at times - think of buffer_head or
radix_tree_node, which are pinned by user pages and therefore cannot be
dropped without reclaiming user memory. That said, chances are high that
there will be the only knob, memory.max, to limit all types of memory
allocations together, in which case __GFP_NO_KMEMCG will look awkward
IMO. We could use __GFP_NO_MEMCG (without 'K'), of course, but again,
what else except for memcg does full memory accounting so that we have
to mention MEMCG explicitly?

Thanks,
Vladimir

2015-05-06 13:46:27

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/2] gfp: add __GFP_NOACCOUNT

On Wed 06-05-15 09:16:22, Johannes Weiner wrote:
> On Wed, May 06, 2015 at 01:59:41PM +0200, Michal Hocko wrote:
> > On Tue 05-05-15 12:45:42, Vladimir Davydov wrote:
> > > Not all kmem allocations should be accounted to memcg. The following
> > > patch gives an example when accounting of a certain type of allocations
> > > to memcg can effectively result in a memory leak.
> >
> > > This patch adds the __GFP_NOACCOUNT flag which if passed to kmalloc
> > > and friends will force the allocation to go through the root
> > > cgroup. It will be used by the next patch.
> >
> > The name of the flag is way too generic. It is not clear that the
> > accounting is KMEMCG related.
>
> The memory controller is the (primary) component that accounts
> physical memory allocations in the kernel, so I don't see how this
> would be ambiguous in any way.

What if a high-level allocator wants to do some accounting as well?
E.g. slab allocator accounts {un}reclaimable pages. It is a different
thing because the accounting is per-cache rather than gfp based but I
just wanted to point out that accounting is rather a wide term.

> > __GFP_NO_KMEMCG sounds better?
>
> I think that's much worse. I would prefer communicating the desired
> behavior directly instead of having to derive it from a subsystem
> name.
> (And KMEMCG should not even be a term, it's all just the memory
> controller, i.e. memcg.)

I do not mind __GFP_NO_MEMCG either.

--
Michal Hocko
SUSE Labs

2015-05-06 13:55:25

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/2] gfp: add __GFP_NOACCOUNT

On Wed 06-05-15 16:25:10, Vladimir Davydov wrote:
> On Wed, May 06, 2015 at 02:35:41PM +0200, Michal Hocko wrote:
> > On Wed 06-05-15 15:24:31, Vladimir Davydov wrote:
[...]
> > > I don't think making this flag per-cache is an option either, but for
> > > another reason - it would not be possible to merge such a kmem cache
> > > with caches without this flag set. As a result, total memory pressure
> > > would increase, even for setups without kmem-active memory cgroups,
> > > which does not sound acceptable to me.
> >
> > I am not sure I see the performance implications here because kmem
> > accounted memcgs would have their copy of the cache anyway, no?
>
> It's orthogonal.
>
> Suppose there are two *global* kmem caches, A and B, which would
> normally be merged, i.e. A=B. Then we find out that we don't want to
> account allocations from A to memcg while still accounting allocations
> from B. Obviously, cache A can no longer be merged with cache B so we
> have two different caches instead of the only merged one, even if there
> are *no* memory cgroups at all. That might result in increased memory
> consumption due to fragmentation.

Got your point. Thanks for the clarification!

> Although it is not really critical, especially counting that SLAB
> merging was introduced not long before, the idea that enabling an extra
> feature, such as memcg, without actually using it, may affect the global
> behavior does not sound good to me.

Agreed.

> > Anyway, I guess it would be good to document these reasons in the
> > changelog.
> >
> > > > So I do not object to opt-out for kmemcg accounting but I really think
> > > > the name should be changed.
> > >
> > > I named it __GFP_NOACCOUNT to match with __GFP_NOTRACK, which is a very
> > > specific flag too (kmemcheck), nevertheless it has a rather generic
> > > name.
> >
> > __GFP_NOTRACK is a bad name IMHO as well. One has to go and check the
> > comment to see this is kmemleak related.
>
> I think it's a good practice to go to its definition and check comments
> when encountering an unknown symbol anyway. With ctags/cscope it's
> trivial :-)
>
> >
> > > Anyways, what else apart from memcg can account kmem so that we have to
> > > mention KMEMCG in the flag name explicitly?
> >
> > NOACCOUNT doesn't imply kmem at all so it is not clear who is in charge
> > of the accounting.
>
> IMO it is a benefit. If one day for some reason we want to bypass memcg
> accounting for some other type of allocation somewhere, we can simply
> reuse it.

But what if somebody, say a highlevel memory allocator in the kernel,
want's to (ab)use this flag for its internal purpose as well?

> > I do not insist on __GFP_NO_KMEMCG of course but it sounds quite
> > specific about its meaning and scope.
>
> There is another argument against __GFP_NO_KMEMCG: it is not yet clear
> if kmem is going to be accounted separately in the unified cgroup
> hierarchy.

As I've said, I do not insist on *KMEMCG. __GFP_NO_MEMCG would be
generic enough to rule out MEMCG altogether as well. Be it kmem or user
memory.
--
Michal Hocko
SUSE Labs

2015-05-06 14:30:08

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH 1/2] gfp: add __GFP_NOACCOUNT

On Wed, May 06, 2015 at 03:55:20PM +0200, Michal Hocko wrote:
> On Wed 06-05-15 16:25:10, Vladimir Davydov wrote:
> > On Wed, May 06, 2015 at 02:35:41PM +0200, Michal Hocko wrote:
[...]
> > > NOACCOUNT doesn't imply kmem at all so it is not clear who is in charge
> > > of the accounting.
> >
> > IMO it is a benefit. If one day for some reason we want to bypass memcg
> > accounting for some other type of allocation somewhere, we can simply
> > reuse it.
>
> But what if somebody, say a highlevel memory allocator in the kernel,
> want's to (ab)use this flag for its internal purpose as well?

We won't let him :-)

If we take your argument about future (ab)users seriously, we should
also consider what will happen if one wants to use e.g. __GFP_HARDWALL,
which BTW has a generic name too although it's cpuset-specific.

My point is that MEMCG is the only subsystem of the kernel that tries to
do full memory accounting, and there is no point in introducing another
one, because we already have it. So we have full right to reserve
__GFP_NOACCOUNT for our purposes, just like cpuset reserves
__GFP_HARDWALL and kmemcheck __GFP_NOTRACK. Any newcomer must take this
into account.

Thanks,
Vladimir

2015-05-06 14:47:05

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/2] gfp: add __GFP_NOACCOUNT

On Wed 06-05-15 17:29:51, Vladimir Davydov wrote:
[...]
> My point is that MEMCG is the only subsystem of the kernel that tries to
> do full memory accounting, and there is no point in introducing another
> one, because we already have it.

Then I really do not get why the gfp flag cannot be specific about that.
Anyway, it doesn't really make much sense to bikeshed about the flag
here. So if both you and Johannes agree on the name I will not stand in
the way. I will go and check into include/linux/gfp.h anytime I will try
to remember the flag name...

--
Michal Hocko
SUSE Labs

2015-05-06 14:58:19

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/2] gfp: add __GFP_NOACCOUNT

On Tue 05-05-15 12:45:42, Vladimir Davydov wrote:
[...]
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 97a9373e61e8..37c422df2a0f 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -30,6 +30,7 @@ struct vm_area_struct;
> #define ___GFP_HARDWALL 0x20000u
> #define ___GFP_THISNODE 0x40000u
> #define ___GFP_RECLAIMABLE 0x80000u
> +#define ___GFP_NOACCOUNT 0x100000u
> #define ___GFP_NOTRACK 0x200000u
> #define ___GFP_NO_KSWAPD 0x400000u
> #define ___GFP_OTHER_NODE 0x800000u
> @@ -87,6 +88,7 @@ struct vm_area_struct;
> #define __GFP_HARDWALL ((__force gfp_t)___GFP_HARDWALL) /* Enforce hardwall cpuset memory allocs */
> #define __GFP_THISNODE ((__force gfp_t)___GFP_THISNODE)/* No fallback, no policies */
> #define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE) /* Page is reclaimable */
> +#define __GFP_NOACCOUNT ((__force gfp_t)___GFP_NOACCOUNT) /* Don't account to memcg */

The wording suggests that _any_ memcg charge might be skipped by this flag
but only kmem part is handled.

So either handle the flag in try_charge or, IMO preferably, update the
comment here and add WARN_ON{_ONCE}(gfp & __GFP_NOACCOUNT). I do not
think we should allow to skip the charge for user pages ATM and warning
could tell us about the abuse of the flag.

> #define __GFP_NOTRACK ((__force gfp_t)___GFP_NOTRACK) /* Don't track with kmemcheck */
>
> #define __GFP_NO_KSWAPD ((__force gfp_t)___GFP_NO_KSWAPD)
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 72dff5fb0d0c..6c8918114804 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -463,6 +463,8 @@ memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
> if (!memcg_kmem_enabled())
> return true;
>
> + if (gfp & __GFP_NOACCOUNT)
> + return true;
> /*
> * __GFP_NOFAIL allocations will move on even if charging is not
> * possible. Therefore we don't even try, and have this allocation
> @@ -522,6 +524,8 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
> {
> if (!memcg_kmem_enabled())
> return cachep;
> + if (gfp & __GFP_NOACCOUNT)
> + return cachep;
> if (gfp & __GFP_NOFAIL)
> return cachep;
> if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD))
> diff --git a/mm/kmemleak.c b/mm/kmemleak.c
> index 5405aff5a590..f0fe4f2c1fa7 100644
> --- a/mm/kmemleak.c
> +++ b/mm/kmemleak.c
> @@ -115,7 +115,8 @@
> #define BYTES_PER_POINTER sizeof(void *)
>
> /* GFP bitmask for kmemleak internal allocations */
> -#define gfp_kmemleak_mask(gfp) (((gfp) & (GFP_KERNEL | GFP_ATOMIC)) | \
> +#define gfp_kmemleak_mask(gfp) (((gfp) & (GFP_KERNEL | GFP_ATOMIC | \
> + __GFP_NOACCOUNT)) | \
> __GFP_NORETRY | __GFP_NOMEMALLOC | \
> __GFP_NOWARN)
>
> --
> 1.7.10.4
>

--
Michal Hocko
SUSE Labs

2015-05-06 15:01:17

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 1/2] gfp: add __GFP_NOACCOUNT

On Wed, May 06, 2015 at 03:46:20PM +0200, Michal Hocko wrote:
> On Wed 06-05-15 09:16:22, Johannes Weiner wrote:
> > On Wed, May 06, 2015 at 01:59:41PM +0200, Michal Hocko wrote:
> > > On Tue 05-05-15 12:45:42, Vladimir Davydov wrote:
> > > > Not all kmem allocations should be accounted to memcg. The following
> > > > patch gives an example when accounting of a certain type of allocations
> > > > to memcg can effectively result in a memory leak.
> > >
> > > > This patch adds the __GFP_NOACCOUNT flag which if passed to kmalloc
> > > > and friends will force the allocation to go through the root
> > > > cgroup. It will be used by the next patch.
> > >
> > > The name of the flag is way too generic. It is not clear that the
> > > accounting is KMEMCG related.
> >
> > The memory controller is the (primary) component that accounts
> > physical memory allocations in the kernel, so I don't see how this
> > would be ambiguous in any way.
>
> What if a high-level allocator wants to do some accounting as well?
> E.g. slab allocator accounts {un}reclaimable pages. It is a different
> thing because the accounting is per-cache rather than gfp based but I
> just wanted to point out that accounting is rather a wide term.

This is a concept we have discussed so many times before. The more
generic or fundamental the functionality in its context, the more
generic the name. The longer and more specific the name, the more
niche its functionality in that context.

See schedule().

See open().

Accounting is a broad term, but in the context of a physical memory
request I can not think of a more fundamental meaning of "do not
account" - without further qualification - then inhibiting what memcg
does: accounting the requested memory to the caller.

If some allocator would want to modify the accounting of a certain
*type* of memory, then that would be more specific, and a flag to
accomplish this would be expected to have a longer name.

One is a core feature of the VM, the other is a niche aspect of
another core feature of the VM.

2015-05-06 16:35:31

by Vladimir Davydov

[permalink] [raw]
Subject: [PATCH v2] gfp: add __GFP_NOACCOUNT

Not all kmem allocations should be accounted to memcg. The following
patch gives an example when accounting of a certain type of allocations
to memcg can effectively result in a memory leak. This patch adds the
__GFP_NOACCOUNT flag which if passed to kmalloc and friends will force
the allocation to go through the root cgroup. It will be used by the
next patch.

Note, since in case of kmemleak enabled each kmalloc implies yet another
allocation from the kmemleak_object cache, we add __GFP_NOACCOUNT to
gfp_kmemleak_mask.

Alternatively, we could introduce a per kmem cache flag disabling
accounting for all allocations of a particular kind, but (a) we would
not be able to bypass accounting for kmalloc then and (b) a kmem cache
with this flag set could not be merged with a kmem cache without this
flag, which would increase the number of global caches and therefore
fragmentation even if the memory cgroup controller is not used.

Despite its generic name, currently __GFP_NOACCOUNT disables accounting
only for kmem allocations while user page allocations are always
charged. To catch abusing of this flag, a warning is issued on an
attempt of passing it to mem_cgroup_try_charge.

Signed-off-by: Vladimir Davydov <[email protected]>
Cc: <[email protected]> # 4.0
---
Changes in v2:
- explain drawbacks of per kmem cache flag disabling accounting as a possible
alternative to a GFP flag in commit message (Michal)
- warn if __GFP_NOACCOUNT is passed to mem_cgroup_try_charge (Michal)

include/linux/gfp.h | 2 ++
include/linux/memcontrol.h | 4 ++++
mm/kmemleak.c | 3 ++-
mm/memcontrol.c | 2 ++
4 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 97a9373e61e8..15928f0647e4 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -30,6 +30,7 @@ struct vm_area_struct;
#define ___GFP_HARDWALL 0x20000u
#define ___GFP_THISNODE 0x40000u
#define ___GFP_RECLAIMABLE 0x80000u
+#define ___GFP_NOACCOUNT 0x100000u
#define ___GFP_NOTRACK 0x200000u
#define ___GFP_NO_KSWAPD 0x400000u
#define ___GFP_OTHER_NODE 0x800000u
@@ -87,6 +88,7 @@ struct vm_area_struct;
#define __GFP_HARDWALL ((__force gfp_t)___GFP_HARDWALL) /* Enforce hardwall cpuset memory allocs */
#define __GFP_THISNODE ((__force gfp_t)___GFP_THISNODE)/* No fallback, no policies */
#define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE) /* Page is reclaimable */
+#define __GFP_NOACCOUNT ((__force gfp_t)___GFP_NOACCOUNT) /* Don't account to kmemcg */
#define __GFP_NOTRACK ((__force gfp_t)___GFP_NOTRACK) /* Don't track with kmemcheck */

#define __GFP_NO_KSWAPD ((__force gfp_t)___GFP_NO_KSWAPD)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 72dff5fb0d0c..6c8918114804 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -463,6 +463,8 @@ memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
if (!memcg_kmem_enabled())
return true;

+ if (gfp & __GFP_NOACCOUNT)
+ return true;
/*
* __GFP_NOFAIL allocations will move on even if charging is not
* possible. Therefore we don't even try, and have this allocation
@@ -522,6 +524,8 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
{
if (!memcg_kmem_enabled())
return cachep;
+ if (gfp & __GFP_NOACCOUNT)
+ return cachep;
if (gfp & __GFP_NOFAIL)
return cachep;
if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD))
diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index 5405aff5a590..f0fe4f2c1fa7 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -115,7 +115,8 @@
#define BYTES_PER_POINTER sizeof(void *)

/* GFP bitmask for kmemleak internal allocations */
-#define gfp_kmemleak_mask(gfp) (((gfp) & (GFP_KERNEL | GFP_ATOMIC)) | \
+#define gfp_kmemleak_mask(gfp) (((gfp) & (GFP_KERNEL | GFP_ATOMIC | \
+ __GFP_NOACCOUNT)) | \
__GFP_NORETRY | __GFP_NOMEMALLOC | \
__GFP_NOWARN)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 14c2f2017e37..41c9e530f1cd 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5462,6 +5462,8 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
unsigned int nr_pages = 1;
int ret = 0;

+ WARN_ON_ONCE(gfp_mask & __GFP_NOACCOUNT);
+
if (mem_cgroup_disabled())
goto out;

--
1.7.10.4

2015-05-06 17:52:33

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 1/2] gfp: add __GFP_NOACCOUNT

On Wed, May 06, 2015 at 04:58:14PM +0200, Michal Hocko wrote:
> On Tue 05-05-15 12:45:42, Vladimir Davydov wrote:
> [...]
> > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > index 97a9373e61e8..37c422df2a0f 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -30,6 +30,7 @@ struct vm_area_struct;
> > #define ___GFP_HARDWALL 0x20000u
> > #define ___GFP_THISNODE 0x40000u
> > #define ___GFP_RECLAIMABLE 0x80000u
> > +#define ___GFP_NOACCOUNT 0x100000u
> > #define ___GFP_NOTRACK 0x200000u
> > #define ___GFP_NO_KSWAPD 0x400000u
> > #define ___GFP_OTHER_NODE 0x800000u
> > @@ -87,6 +88,7 @@ struct vm_area_struct;
> > #define __GFP_HARDWALL ((__force gfp_t)___GFP_HARDWALL) /* Enforce hardwall cpuset memory allocs */
> > #define __GFP_THISNODE ((__force gfp_t)___GFP_THISNODE)/* No fallback, no policies */
> > #define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE) /* Page is reclaimable */
> > +#define __GFP_NOACCOUNT ((__force gfp_t)___GFP_NOACCOUNT) /* Don't account to memcg */
>
> The wording suggests that _any_ memcg charge might be skipped by this flag
> but only kmem part is handled.
>
> So either handle the flag in try_charge or, IMO preferably, update the
> comment here and add WARN_ON{_ONCE}(gfp & __GFP_NOACCOUNT). I do not
> think we should allow to skip the charge for user pages ATM and warning
> could tell us about the abuse of the flag.

Michal, please just stop.

There is no reason to warn the user about this whatsoever. If you
want to prevent abuse - whatever that means - program your mailreader
to flag patches containing __GFP_NOACCOUNT and review them carefully.

This eagerness to clutter the code with defensiveness against the rest
of the kernel tree and to disrupt the user over every little blip that
has nothing to do with them is really getting old at this point.