v4:
- Incorporate suggestions from others like setting SLAB_ACCOUNT for
KMALLOC_CGROUP caches into patch 2.
- Add a new patch 3 to disable caches merging for KMALLOC_NORMAL caches
as suggested by Roman.
v3:
- Update patch 2 commit log and rework kmalloc_type() to make it easier
to read.
v2:
- Take suggestion from Vlastimil to use a new set of kmalloc-cg-* to
handle the objcg pointer array allocation and freeing problems.
Since the merging of the new slab memory controller in v5.9,
the page structure stores a pointer to objcg pointer array for
slab pages. When the slab has no used objects, it can be freed in
free_slab() which will call kfree() to free the objcg pointer array in
memcg_alloc_page_obj_cgroups(). If it happens that the objcg pointer
array is the last used object in its slab, that slab may then be freed
which may caused kfree() to be called again.
With the right workload, the slab cache may be set up in a way that
allows the recursive kfree() calling loop to nest deep enough to
cause a kernel stack overflow and panic the system. In fact, we have
a reproducer that can cause kernel stack overflow on a s390 system
involving kmalloc-rcl-256 and kmalloc-rcl-128 slabs with the following
kfree() loop recursively called 74 times:
[ 285.520739] [<000000000ec432fc>] kfree+0x4bc/0x560
[ 285.520740] [<000000000ec43466>] __free_slab+0xc6/0x228
[ 285.520741] [<000000000ec41fc2>] __slab_free+0x3c2/0x3e0
[ 285.520742] [<000000000ec432fc>] kfree+0x4bc/0x560
:
While investigating this issue, I also found an issue on the allocation
side. If the objcg pointer array happen to come from the same slab or
a circular dependency linkage is formed with multiple slabs, those
affected slabs can never be freed again.
This patch series addresses these two issues by introducing a new
set of kmalloc-cg-<n> caches split from kmalloc-<n> caches. The new
set will only contain non-reclaimable and non-dma objects that are
accounted in memory cgroups whereas the old set are now for unaccounted
objects only. By making this split, all the objcg pointer arrays will
come from the kmalloc-<n> caches, but those caches will never hold any
objcg pointer array. As a result, deeply nested kfree() call and the
unfreeable slab problems are now gone.
Waiman Long (3):
mm: memcg/slab: Properly set up gfp flags for objcg pointer array
mm: memcg/slab: Create a new set of kmalloc-cg-<n> caches
mm: memcg/slab: Disable cache merging for KMALLOC_NORMAL caches
include/linux/slab.h | 41 ++++++++++++++++++++++++++++++++---------
mm/memcontrol.c | 8 ++++++++
mm/slab.h | 1 -
mm/slab_common.c | 32 ++++++++++++++++++++++++--------
4 files changed, 64 insertions(+), 18 deletions(-)
--
2.18.1
There are currently two problems in the way the objcg pointer array
(memcg_data) in the page structure is being allocated and freed.
On its allocation, it is possible that the allocated objcg pointer
array comes from the same slab that requires memory accounting. If this
happens, the slab will never become empty again as there is at least
one object left (the obj_cgroup array) in the slab.
When it is freed, the objcg pointer array object may be the last one
in its slab and hence causes kfree() to be called again. With the
right workload, the slab cache may be set up in a way that allows the
recursive kfree() calling loop to nest deep enough to cause a kernel
stack overflow and panic the system.
One way to solve this problem is to split the kmalloc-<n> caches
(KMALLOC_NORMAL) into two separate sets - a new set of kmalloc-<n>
(KMALLOC_NORMAL) caches for unaccounted objects only and a new set of
kmalloc-cg-<n> (KMALLOC_CGROUP) caches for accounted objects only. All
the other caches can still allow a mix of accounted and unaccounted
objects.
With this change, all the objcg pointer array objects will come from
KMALLOC_NORMAL caches which won't have their objcg pointer arrays. So
both the recursive kfree() problem and non-freeable slab problem are
gone.
Since both the KMALLOC_NORMAL and KMALLOC_CGROUP caches no longer have
mixed accounted and unaccounted objects, this will slightly reduce the
number of objcg pointer arrays that need to be allocated and save a bit
of memory. On the other hand, creating a new set of kmalloc caches does
have the effect of reducing cache utilization. So it is properly a wash.
The new KMALLOC_CGROUP is added between KMALLOC_NORMAL and
KMALLOC_RECLAIM so that the first for loop in create_kmalloc_caches()
will include the newly added caches without change.
Suggested-by: Vlastimil Babka <[email protected]>
Signed-off-by: Waiman Long <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
---
include/linux/slab.h | 41 ++++++++++++++++++++++++++++++++---------
mm/slab_common.c | 25 +++++++++++++++++--------
2 files changed, 49 insertions(+), 17 deletions(-)
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 0c97d788762c..a51cad5f561c 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -305,12 +305,23 @@ static inline void __check_heap_object(const void *ptr, unsigned long n,
/*
* Whenever changing this, take care of that kmalloc_type() and
* create_kmalloc_caches() still work as intended.
+ *
+ * KMALLOC_NORMAL can contain only unaccounted objects whereas KMALLOC_CGROUP
+ * is for accounted but unreclaimable and non-dma objects. All the other
+ * kmem caches can have both accounted and unaccounted objects.
*/
enum kmalloc_cache_type {
KMALLOC_NORMAL = 0,
+#ifdef CONFIG_MEMCG_KMEM
+ KMALLOC_CGROUP,
+#else
+ KMALLOC_CGROUP = KMALLOC_NORMAL,
+#endif
KMALLOC_RECLAIM,
#ifdef CONFIG_ZONE_DMA
KMALLOC_DMA,
+#else
+ KMALLOC_DMA = KMALLOC_NORMAL,
#endif
NR_KMALLOC_TYPES
};
@@ -319,24 +330,36 @@ enum kmalloc_cache_type {
extern struct kmem_cache *
kmalloc_caches[NR_KMALLOC_TYPES][KMALLOC_SHIFT_HIGH + 1];
+/*
+ * Define gfp bits that should not be set for KMALLOC_NORMAL.
+ */
+#define KMALLOC_NOT_NORMAL_BITS \
+ (__GFP_RECLAIMABLE | \
+ (IS_ENABLED(CONFIG_ZONE_DMA) ? __GFP_DMA : 0) | \
+ (IS_ENABLED(CONFIG_MEMCG_KMEM) ? __GFP_ACCOUNT : 0))
+
static __always_inline enum kmalloc_cache_type kmalloc_type(gfp_t flags)
{
-#ifdef CONFIG_ZONE_DMA
/*
* The most common case is KMALLOC_NORMAL, so test for it
- * with a single branch for both flags.
+ * with a single branch for all the relevant flags.
*/
- if (likely((flags & (__GFP_DMA | __GFP_RECLAIMABLE)) == 0))
+ if (likely((flags & KMALLOC_NOT_NORMAL_BITS) == 0))
return KMALLOC_NORMAL;
/*
- * At least one of the flags has to be set. If both are, __GFP_DMA
- * is more important.
+ * At least one of the flags has to be set. Their priorities in
+ * decreasing order are:
+ * 1) __GFP_DMA
+ * 2) __GFP_RECLAIMABLE
+ * 3) __GFP_ACCOUNT
*/
- return flags & __GFP_DMA ? KMALLOC_DMA : KMALLOC_RECLAIM;
-#else
- return flags & __GFP_RECLAIMABLE ? KMALLOC_RECLAIM : KMALLOC_NORMAL;
-#endif
+ if (IS_ENABLED(CONFIG_ZONE_DMA) && (flags & __GFP_DMA))
+ return KMALLOC_DMA;
+ if (!IS_ENABLED(CONFIG_MEMCG_KMEM) || (flags & __GFP_RECLAIMABLE))
+ return KMALLOC_RECLAIM;
+ else
+ return KMALLOC_CGROUP;
}
/*
diff --git a/mm/slab_common.c b/mm/slab_common.c
index f8833d3e5d47..bbaf41a7c77e 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -727,21 +727,25 @@ struct kmem_cache *kmalloc_slab(size_t size, gfp_t flags)
}
#ifdef CONFIG_ZONE_DMA
-#define INIT_KMALLOC_INFO(__size, __short_size) \
-{ \
- .name[KMALLOC_NORMAL] = "kmalloc-" #__short_size, \
- .name[KMALLOC_RECLAIM] = "kmalloc-rcl-" #__short_size, \
- .name[KMALLOC_DMA] = "dma-kmalloc-" #__short_size, \
- .size = __size, \
-}
+#define KMALLOC_DMA_NAME(sz) .name[KMALLOC_DMA] = "dma-kmalloc-" #sz,
+#else
+#define KMALLOC_DMA_NAME(sz)
+#endif
+
+#ifdef CONFIG_MEMCG_KMEM
+#define KMALLOC_CGROUP_NAME(sz) .name[KMALLOC_CGROUP] = "kmalloc-cg-" #sz,
#else
+#define KMALLOC_CGROUP_NAME(sz)
+#endif
+
#define INIT_KMALLOC_INFO(__size, __short_size) \
{ \
.name[KMALLOC_NORMAL] = "kmalloc-" #__short_size, \
.name[KMALLOC_RECLAIM] = "kmalloc-rcl-" #__short_size, \
+ KMALLOC_CGROUP_NAME(__short_size) \
+ KMALLOC_DMA_NAME(__short_size) \
.size = __size, \
}
-#endif
/*
* kmalloc_info[] is to make slub_debug=,kmalloc-xx option work at boot time.
@@ -830,6 +834,8 @@ new_kmalloc_cache(int idx, enum kmalloc_cache_type type, slab_flags_t flags)
{
if (type == KMALLOC_RECLAIM)
flags |= SLAB_RECLAIM_ACCOUNT;
+ else if (IS_ENABLED(CONFIG_MEMCG_KMEM) && (type == KMALLOC_CGROUP))
+ flags |= SLAB_ACCOUNT;
kmalloc_caches[type][idx] = create_kmalloc_cache(
kmalloc_info[idx].name[type],
@@ -847,6 +853,9 @@ void __init create_kmalloc_caches(slab_flags_t flags)
int i;
enum kmalloc_cache_type type;
+ /*
+ * Including KMALLOC_CGROUP if CONFIG_MEMCG_KMEM defined
+ */
for (type = KMALLOC_NORMAL; type <= KMALLOC_RECLAIM; type++) {
for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++) {
if (!kmalloc_caches[type][i])
--
2.18.1
The KMALLOC_NORMAL (kmalloc-<n>) caches are for unaccounted objects only
when CONFIG_MEMCG_KMEM is enabled. To make sure that this condition
remains true, we will have to prevent KMALOC_NORMAL caches to merge
with other kmem caches. This is now done by setting its refcount to -1
right after its creation.
Suggested-by: Roman Gushchin <[email protected]>
Signed-off-by: Waiman Long <[email protected]>
---
mm/slab_common.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/mm/slab_common.c b/mm/slab_common.c
index bbaf41a7c77e..a0ff8e1d8b67 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -841,6 +841,13 @@ new_kmalloc_cache(int idx, enum kmalloc_cache_type type, slab_flags_t flags)
kmalloc_info[idx].name[type],
kmalloc_info[idx].size, flags, 0,
kmalloc_info[idx].size);
+
+ /*
+ * If CONFIG_MEMCG_KMEM is enabled, disable cache merging for
+ * KMALLOC_NORMAL caches.
+ */
+ if (IS_ENABLED(CONFIG_MEMCG_KMEM) && (type == KMALLOC_NORMAL))
+ kmalloc_caches[type][idx]->refcount = -1;
}
/*
--
2.18.1
On 5/5/21 10:06 PM, Waiman Long wrote:
> There are currently two problems in the way the objcg pointer array
> (memcg_data) in the page structure is being allocated and freed.
>
> On its allocation, it is possible that the allocated objcg pointer
> array comes from the same slab that requires memory accounting. If this
> happens, the slab will never become empty again as there is at least
> one object left (the obj_cgroup array) in the slab.
>
> When it is freed, the objcg pointer array object may be the last one
> in its slab and hence causes kfree() to be called again. With the
> right workload, the slab cache may be set up in a way that allows the
> recursive kfree() calling loop to nest deep enough to cause a kernel
> stack overflow and panic the system.
>
> One way to solve this problem is to split the kmalloc-<n> caches
> (KMALLOC_NORMAL) into two separate sets - a new set of kmalloc-<n>
> (KMALLOC_NORMAL) caches for unaccounted objects only and a new set of
> kmalloc-cg-<n> (KMALLOC_CGROUP) caches for accounted objects only. All
> the other caches can still allow a mix of accounted and unaccounted
> objects.
>
> With this change, all the objcg pointer array objects will come from
> KMALLOC_NORMAL caches which won't have their objcg pointer arrays. So
> both the recursive kfree() problem and non-freeable slab problem are
> gone.
>
> Since both the KMALLOC_NORMAL and KMALLOC_CGROUP caches no longer have
> mixed accounted and unaccounted objects, this will slightly reduce the
> number of objcg pointer arrays that need to be allocated and save a bit
> of memory. On the other hand, creating a new set of kmalloc caches does
> have the effect of reducing cache utilization. So it is properly a wash.
>
> The new KMALLOC_CGROUP is added between KMALLOC_NORMAL and
> KMALLOC_RECLAIM so that the first for loop in create_kmalloc_caches()
> will include the newly added caches without change.
>
> Suggested-by: Vlastimil Babka <[email protected]>
> Signed-off-by: Waiman Long <[email protected]>
> Reviewed-by: Shakeel Butt <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
I still believe the cgroup.memory=nokmem parameter should be respected,
otherwise the caches are not only created, but also used. I offer this followup
for squashing into your patch if you and Andrew agree:
----8<----
From c87378d437d9a59b8757033485431b4721c74173 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <[email protected]>
Date: Thu, 6 May 2021 17:53:21 +0200
Subject: [PATCH] mm: memcg/slab: don't create kmalloc-cg caches with
cgroup.memory=nokmem
The caches should not be created when kmemcg is disabled on boot, otherwise
they are also filled by kmalloc(__GFP_ACCOUNT) allocations. When booted with
cgroup.memory=nokmem, link the kmalloc_caches[KMALLOC_CGROUP] entries to
KMALLOC_NORMAL entries instead.
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/internal.h | 5 +++++
mm/memcontrol.c | 2 +-
mm/slab_common.c | 9 +++++++--
3 files changed, 13 insertions(+), 3 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index ef5f336f59bd..b2d60b3403c7 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -135,6 +135,11 @@ extern void putback_lru_page(struct page *page);
*/
extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
+/*
+ * in mm/memcontrol.c:
+ */
+extern bool cgroup_memory_nokmem;
+
/*
* in mm/page_alloc.c
*/
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5e3b4f23b830..b9ec01f2b4f6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -83,7 +83,7 @@ DEFINE_PER_CPU(struct mem_cgroup *, int_active_memcg);
static bool cgroup_memory_nosocket;
/* Kernel memory accounting disabled? */
-static bool cgroup_memory_nokmem;
+bool cgroup_memory_nokmem;
/* Whether the swap controller is active */
#ifdef CONFIG_MEMCG_SWAP
diff --git a/mm/slab_common.c b/mm/slab_common.c
index bbaf41a7c77e..363f90215401 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -832,10 +832,15 @@ void __init setup_kmalloc_cache_index_table(void)
static void __init
new_kmalloc_cache(int idx, enum kmalloc_cache_type type, slab_flags_t flags)
{
- if (type == KMALLOC_RECLAIM)
+ if (type == KMALLOC_RECLAIM) {
flags |= SLAB_RECLAIM_ACCOUNT;
- else if (IS_ENABLED(CONFIG_MEMCG_KMEM) && (type == KMALLOC_CGROUP))
+ } else if (IS_ENABLED(CONFIG_MEMCG_KMEM) && (type == KMALLOC_CGROUP)) {
+ if (cgroup_memory_nokmem) {
+ kmalloc_caches[type][idx] = kmalloc_caches[KMALLOC_NORMAL][idx];
+ return;
+ }
flags |= SLAB_ACCOUNT;
+ }
kmalloc_caches[type][idx] = create_kmalloc_cache(
kmalloc_info[idx].name[type],
--
2.31.1
On Thu, May 6, 2021 at 9:00 AM Vlastimil Babka <[email protected]> wrote:
>
>
> On 5/5/21 10:06 PM, Waiman Long wrote:
> > There are currently two problems in the way the objcg pointer array
> > (memcg_data) in the page structure is being allocated and freed.
> >
> > On its allocation, it is possible that the allocated objcg pointer
> > array comes from the same slab that requires memory accounting. If this
> > happens, the slab will never become empty again as there is at least
> > one object left (the obj_cgroup array) in the slab.
> >
> > When it is freed, the objcg pointer array object may be the last one
> > in its slab and hence causes kfree() to be called again. With the
> > right workload, the slab cache may be set up in a way that allows the
> > recursive kfree() calling loop to nest deep enough to cause a kernel
> > stack overflow and panic the system.
> >
> > One way to solve this problem is to split the kmalloc-<n> caches
> > (KMALLOC_NORMAL) into two separate sets - a new set of kmalloc-<n>
> > (KMALLOC_NORMAL) caches for unaccounted objects only and a new set of
> > kmalloc-cg-<n> (KMALLOC_CGROUP) caches for accounted objects only. All
> > the other caches can still allow a mix of accounted and unaccounted
> > objects.
> >
> > With this change, all the objcg pointer array objects will come from
> > KMALLOC_NORMAL caches which won't have their objcg pointer arrays. So
> > both the recursive kfree() problem and non-freeable slab problem are
> > gone.
> >
> > Since both the KMALLOC_NORMAL and KMALLOC_CGROUP caches no longer have
> > mixed accounted and unaccounted objects, this will slightly reduce the
> > number of objcg pointer arrays that need to be allocated and save a bit
> > of memory. On the other hand, creating a new set of kmalloc caches does
> > have the effect of reducing cache utilization. So it is properly a wash.
> >
> > The new KMALLOC_CGROUP is added between KMALLOC_NORMAL and
> > KMALLOC_RECLAIM so that the first for loop in create_kmalloc_caches()
> > will include the newly added caches without change.
> >
> > Suggested-by: Vlastimil Babka <[email protected]>
> > Signed-off-by: Waiman Long <[email protected]>
> > Reviewed-by: Shakeel Butt <[email protected]>
>
> Reviewed-by: Vlastimil Babka <[email protected]>
>
> I still believe the cgroup.memory=nokmem parameter should be respected,
> otherwise the caches are not only created, but also used. I offer this followup
> for squashing into your patch if you and Andrew agree:
>
> ----8<----
> From c87378d437d9a59b8757033485431b4721c74173 Mon Sep 17 00:00:00 2001
> From: Vlastimil Babka <[email protected]>
> Date: Thu, 6 May 2021 17:53:21 +0200
> Subject: [PATCH] mm: memcg/slab: don't create kmalloc-cg caches with
> cgroup.memory=nokmem
>
> The caches should not be created when kmemcg is disabled on boot, otherwise
> they are also filled by kmalloc(__GFP_ACCOUNT) allocations. When booted with
> cgroup.memory=nokmem, link the kmalloc_caches[KMALLOC_CGROUP] entries to
> KMALLOC_NORMAL entries instead.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
Yes this makes sense:
Reviewed-by: Shakeel Butt <[email protected]>
On 5/5/21 10:06 PM, Waiman Long wrote:
> The KMALLOC_NORMAL (kmalloc-<n>) caches are for unaccounted objects only
> when CONFIG_MEMCG_KMEM is enabled. To make sure that this condition
> remains true, we will have to prevent KMALOC_NORMAL caches to merge
> with other kmem caches. This is now done by setting its refcount to -1
> right after its creation.
>
> Suggested-by: Roman Gushchin <[email protected]>
> Signed-off-by: Waiman Long <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
(outside of scope of this patch/series, we should later replace this refcount
ugliness with a proper slab flag)
> ---
> mm/slab_common.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index bbaf41a7c77e..a0ff8e1d8b67 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -841,6 +841,13 @@ new_kmalloc_cache(int idx, enum kmalloc_cache_type type, slab_flags_t flags)
> kmalloc_info[idx].name[type],
> kmalloc_info[idx].size, flags, 0,
> kmalloc_info[idx].size);
> +
> + /*
> + * If CONFIG_MEMCG_KMEM is enabled, disable cache merging for
> + * KMALLOC_NORMAL caches.
> + */
> + if (IS_ENABLED(CONFIG_MEMCG_KMEM) && (type == KMALLOC_NORMAL))
> + kmalloc_caches[type][idx]->refcount = -1;
> }
>
> /*
>
On Thu, May 06, 2021 at 06:00:16PM +0200, Vlastimil Babka wrote:
>
> On 5/5/21 10:06 PM, Waiman Long wrote:
> > There are currently two problems in the way the objcg pointer array
> > (memcg_data) in the page structure is being allocated and freed.
> >
> > On its allocation, it is possible that the allocated objcg pointer
> > array comes from the same slab that requires memory accounting. If this
> > happens, the slab will never become empty again as there is at least
> > one object left (the obj_cgroup array) in the slab.
> >
> > When it is freed, the objcg pointer array object may be the last one
> > in its slab and hence causes kfree() to be called again. With the
> > right workload, the slab cache may be set up in a way that allows the
> > recursive kfree() calling loop to nest deep enough to cause a kernel
> > stack overflow and panic the system.
> >
> > One way to solve this problem is to split the kmalloc-<n> caches
> > (KMALLOC_NORMAL) into two separate sets - a new set of kmalloc-<n>
> > (KMALLOC_NORMAL) caches for unaccounted objects only and a new set of
> > kmalloc-cg-<n> (KMALLOC_CGROUP) caches for accounted objects only. All
> > the other caches can still allow a mix of accounted and unaccounted
> > objects.
> >
> > With this change, all the objcg pointer array objects will come from
> > KMALLOC_NORMAL caches which won't have their objcg pointer arrays. So
> > both the recursive kfree() problem and non-freeable slab problem are
> > gone.
> >
> > Since both the KMALLOC_NORMAL and KMALLOC_CGROUP caches no longer have
> > mixed accounted and unaccounted objects, this will slightly reduce the
> > number of objcg pointer arrays that need to be allocated and save a bit
> > of memory. On the other hand, creating a new set of kmalloc caches does
> > have the effect of reducing cache utilization. So it is properly a wash.
> >
> > The new KMALLOC_CGROUP is added between KMALLOC_NORMAL and
> > KMALLOC_RECLAIM so that the first for loop in create_kmalloc_caches()
> > will include the newly added caches without change.
> >
> > Suggested-by: Vlastimil Babka <[email protected]>
> > Signed-off-by: Waiman Long <[email protected]>
> > Reviewed-by: Shakeel Butt <[email protected]>
>
> Reviewed-by: Vlastimil Babka <[email protected]>
>
> I still believe the cgroup.memory=nokmem parameter should be respected,
> otherwise the caches are not only created, but also used.
+1
> I offer this followup
> for squashing into your patch if you and Andrew agree:
>
> ----8<----
> From c87378d437d9a59b8757033485431b4721c74173 Mon Sep 17 00:00:00 2001
> From: Vlastimil Babka <[email protected]>
> Date: Thu, 6 May 2021 17:53:21 +0200
> Subject: [PATCH] mm: memcg/slab: don't create kmalloc-cg caches with
> cgroup.memory=nokmem
>
> The caches should not be created when kmemcg is disabled on boot, otherwise
> they are also filled by kmalloc(__GFP_ACCOUNT) allocations. When booted with
> cgroup.memory=nokmem, link the kmalloc_caches[KMALLOC_CGROUP] entries to
> KMALLOC_NORMAL entries instead.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Thanks!
On 5/6/21 12:00 PM, Vlastimil Babka wrote:
> On 5/5/21 10:06 PM, Waiman Long wrote:
>> There are currently two problems in the way the objcg pointer array
>> (memcg_data) in the page structure is being allocated and freed.
>>
>> On its allocation, it is possible that the allocated objcg pointer
>> array comes from the same slab that requires memory accounting. If this
>> happens, the slab will never become empty again as there is at least
>> one object left (the obj_cgroup array) in the slab.
>>
>> When it is freed, the objcg pointer array object may be the last one
>> in its slab and hence causes kfree() to be called again. With the
>> right workload, the slab cache may be set up in a way that allows the
>> recursive kfree() calling loop to nest deep enough to cause a kernel
>> stack overflow and panic the system.
>>
>> One way to solve this problem is to split the kmalloc-<n> caches
>> (KMALLOC_NORMAL) into two separate sets - a new set of kmalloc-<n>
>> (KMALLOC_NORMAL) caches for unaccounted objects only and a new set of
>> kmalloc-cg-<n> (KMALLOC_CGROUP) caches for accounted objects only. All
>> the other caches can still allow a mix of accounted and unaccounted
>> objects.
>>
>> With this change, all the objcg pointer array objects will come from
>> KMALLOC_NORMAL caches which won't have their objcg pointer arrays. So
>> both the recursive kfree() problem and non-freeable slab problem are
>> gone.
>>
>> Since both the KMALLOC_NORMAL and KMALLOC_CGROUP caches no longer have
>> mixed accounted and unaccounted objects, this will slightly reduce the
>> number of objcg pointer arrays that need to be allocated and save a bit
>> of memory. On the other hand, creating a new set of kmalloc caches does
>> have the effect of reducing cache utilization. So it is properly a wash.
>>
>> The new KMALLOC_CGROUP is added between KMALLOC_NORMAL and
>> KMALLOC_RECLAIM so that the first for loop in create_kmalloc_caches()
>> will include the newly added caches without change.
>>
>> Suggested-by: Vlastimil Babka <[email protected]>
>> Signed-off-by: Waiman Long <[email protected]>
>> Reviewed-by: Shakeel Butt <[email protected]>
> Reviewed-by: Vlastimil Babka <[email protected]>
>
> I still believe the cgroup.memory=nokmem parameter should be respected,
> otherwise the caches are not only created, but also used. I offer this followup
> for squashing into your patch if you and Andrew agree:
>
> ----8<----
> From c87378d437d9a59b8757033485431b4721c74173 Mon Sep 17 00:00:00 2001
> From: Vlastimil Babka <[email protected]>
> Date: Thu, 6 May 2021 17:53:21 +0200
> Subject: [PATCH] mm: memcg/slab: don't create kmalloc-cg caches with
> cgroup.memory=nokmem
>
> The caches should not be created when kmemcg is disabled on boot, otherwise
> they are also filled by kmalloc(__GFP_ACCOUNT) allocations. When booted with
> cgroup.memory=nokmem, link the kmalloc_caches[KMALLOC_CGROUP] entries to
> KMALLOC_NORMAL entries instead.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
> ---
> mm/internal.h | 5 +++++
> mm/memcontrol.c | 2 +-
> mm/slab_common.c | 9 +++++++--
> 3 files changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index ef5f336f59bd..b2d60b3403c7 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -135,6 +135,11 @@ extern void putback_lru_page(struct page *page);
> */
> extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
>
> +/*
> + * in mm/memcontrol.c:
> + */
> +extern bool cgroup_memory_nokmem;
> +
> /*
> * in mm/page_alloc.c
> */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 5e3b4f23b830..b9ec01f2b4f6 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -83,7 +83,7 @@ DEFINE_PER_CPU(struct mem_cgroup *, int_active_memcg);
> static bool cgroup_memory_nosocket;
>
> /* Kernel memory accounting disabled? */
> -static bool cgroup_memory_nokmem;
> +bool cgroup_memory_nokmem;
>
> /* Whether the swap controller is active */
> #ifdef CONFIG_MEMCG_SWAP
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index bbaf41a7c77e..363f90215401 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -832,10 +832,15 @@ void __init setup_kmalloc_cache_index_table(void)
> static void __init
> new_kmalloc_cache(int idx, enum kmalloc_cache_type type, slab_flags_t flags)
> {
> - if (type == KMALLOC_RECLAIM)
> + if (type == KMALLOC_RECLAIM) {
> flags |= SLAB_RECLAIM_ACCOUNT;
> - else if (IS_ENABLED(CONFIG_MEMCG_KMEM) && (type == KMALLOC_CGROUP))
> + } else if (IS_ENABLED(CONFIG_MEMCG_KMEM) && (type == KMALLOC_CGROUP)) {
> + if (cgroup_memory_nokmem) {
> + kmalloc_caches[type][idx] = kmalloc_caches[KMALLOC_NORMAL][idx];
> + return;
> + }
> flags |= SLAB_ACCOUNT;
> + }
>
> kmalloc_caches[type][idx] = create_kmalloc_cache(
> kmalloc_info[idx].name[type],
Thanks, the patch looks good to me.
Acked-by: Waiman Long <[email protected]>
Cheers,
Longman
There are currently two problems in the way the objcg pointer array
(memcg_data) in the page structure is being allocated and freed.
On its allocation, it is possible that the allocated objcg pointer
array comes from the same slab that requires memory accounting. If this
happens, the slab will never become empty again as there is at least
one object left (the obj_cgroup array) in the slab.
When it is freed, the objcg pointer array object may be the last one
in its slab and hence causes kfree() to be called again. With the
right workload, the slab cache may be set up in a way that allows the
recursive kfree() calling loop to nest deep enough to cause a kernel
stack overflow and panic the system.
One way to solve this problem is to split the kmalloc-<n> caches
(KMALLOC_NORMAL) into two separate sets - a new set of kmalloc-<n>
(KMALLOC_NORMAL) caches for unaccounted objects only and a new set of
kmalloc-cg-<n> (KMALLOC_CGROUP) caches for accounted objects only. All
the other caches can still allow a mix of accounted and unaccounted
objects.
With this change, all the objcg pointer array objects will come from
KMALLOC_NORMAL caches which won't have their objcg pointer arrays. So
both the recursive kfree() problem and non-freeable slab problem are
gone.
Since both the KMALLOC_NORMAL and KMALLOC_CGROUP caches no longer have
mixed accounted and unaccounted objects, this will slightly reduce the
number of objcg pointer arrays that need to be allocated and save a bit
of memory. On the other hand, creating a new set of kmalloc caches does
have the effect of reducing cache utilization. So it is properly a wash.
The new KMALLOC_CGROUP is added between KMALLOC_NORMAL and
KMALLOC_RECLAIM so that the first for loop in create_kmalloc_caches()
will include the newly added caches without change.
Signed-off-by: Waiman Long <[email protected]>
Suggested-by: Vlastimil Babka <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
---
include/linux/slab.h | 42 +++++++++++++++++++++++++++++++++---------
mm/slab_common.c | 25 +++++++++++++++++--------
2 files changed, 50 insertions(+), 17 deletions(-)
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 0c97d788762c..aa7f6c222a60 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -305,9 +305,21 @@ static inline void __check_heap_object(const void *ptr, unsigned long n,
/*
* Whenever changing this, take care of that kmalloc_type() and
* create_kmalloc_caches() still work as intended.
+ *
+ * KMALLOC_NORMAL can contain only unaccounted objects whereas KMALLOC_CGROUP
+ * is for accounted but unreclaimable and non-dma objects. All the other
+ * kmem caches can have both accounted and unaccounted objects.
*/
enum kmalloc_cache_type {
KMALLOC_NORMAL = 0,
+#ifndef CONFIG_ZONE_DMA
+ KMALLOC_DMA = KMALLOC_NORMAL,
+#endif
+#ifndef CONFIG_MEMCG_KMEM
+ KMALLOC_CGROUP = KMALLOC_NORMAL,
+#else
+ KMALLOC_CGROUP,
+#endif
KMALLOC_RECLAIM,
#ifdef CONFIG_ZONE_DMA
KMALLOC_DMA,
@@ -319,24 +331,36 @@ enum kmalloc_cache_type {
extern struct kmem_cache *
kmalloc_caches[NR_KMALLOC_TYPES][KMALLOC_SHIFT_HIGH + 1];
+/*
+ * Define gfp bits that should not be set for KMALLOC_NORMAL.
+ */
+#define KMALLOC_NOT_NORMAL_BITS \
+ (__GFP_RECLAIMABLE | \
+ (IS_ENABLED(CONFIG_ZONE_DMA) ? __GFP_DMA : 0) | \
+ (IS_ENABLED(CONFIG_MEMCG_KMEM) ? __GFP_ACCOUNT : 0))
+
static __always_inline enum kmalloc_cache_type kmalloc_type(gfp_t flags)
{
-#ifdef CONFIG_ZONE_DMA
/*
* The most common case is KMALLOC_NORMAL, so test for it
- * with a single branch for both flags.
+ * with a single branch for all the relevant flags.
*/
- if (likely((flags & (__GFP_DMA | __GFP_RECLAIMABLE)) == 0))
+ if (likely((flags & KMALLOC_NOT_NORMAL_BITS) == 0))
return KMALLOC_NORMAL;
/*
- * At least one of the flags has to be set. If both are, __GFP_DMA
- * is more important.
+ * At least one of the flags has to be set. Their priorities in
+ * decreasing order are:
+ * 1) __GFP_DMA
+ * 2) __GFP_RECLAIMABLE
+ * 3) __GFP_ACCOUNT
*/
- return flags & __GFP_DMA ? KMALLOC_DMA : KMALLOC_RECLAIM;
-#else
- return flags & __GFP_RECLAIMABLE ? KMALLOC_RECLAIM : KMALLOC_NORMAL;
-#endif
+ if (IS_ENABLED(CONFIG_ZONE_DMA) && (flags & __GFP_DMA))
+ return KMALLOC_DMA;
+ if (!IS_ENABLED(CONFIG_MEMCG_KMEM) || (flags & __GFP_RECLAIMABLE))
+ return KMALLOC_RECLAIM;
+ else
+ return KMALLOC_CGROUP;
}
/*
diff --git a/mm/slab_common.c b/mm/slab_common.c
index f8833d3e5d47..bbaf41a7c77e 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -727,21 +727,25 @@ struct kmem_cache *kmalloc_slab(size_t size, gfp_t flags)
}
#ifdef CONFIG_ZONE_DMA
-#define INIT_KMALLOC_INFO(__size, __short_size) \
-{ \
- .name[KMALLOC_NORMAL] = "kmalloc-" #__short_size, \
- .name[KMALLOC_RECLAIM] = "kmalloc-rcl-" #__short_size, \
- .name[KMALLOC_DMA] = "dma-kmalloc-" #__short_size, \
- .size = __size, \
-}
+#define KMALLOC_DMA_NAME(sz) .name[KMALLOC_DMA] = "dma-kmalloc-" #sz,
+#else
+#define KMALLOC_DMA_NAME(sz)
+#endif
+
+#ifdef CONFIG_MEMCG_KMEM
+#define KMALLOC_CGROUP_NAME(sz) .name[KMALLOC_CGROUP] = "kmalloc-cg-" #sz,
#else
+#define KMALLOC_CGROUP_NAME(sz)
+#endif
+
#define INIT_KMALLOC_INFO(__size, __short_size) \
{ \
.name[KMALLOC_NORMAL] = "kmalloc-" #__short_size, \
.name[KMALLOC_RECLAIM] = "kmalloc-rcl-" #__short_size, \
+ KMALLOC_CGROUP_NAME(__short_size) \
+ KMALLOC_DMA_NAME(__short_size) \
.size = __size, \
}
-#endif
/*
* kmalloc_info[] is to make slub_debug=,kmalloc-xx option work at boot time.
@@ -830,6 +834,8 @@ new_kmalloc_cache(int idx, enum kmalloc_cache_type type, slab_flags_t flags)
{
if (type == KMALLOC_RECLAIM)
flags |= SLAB_RECLAIM_ACCOUNT;
+ else if (IS_ENABLED(CONFIG_MEMCG_KMEM) && (type == KMALLOC_CGROUP))
+ flags |= SLAB_ACCOUNT;
kmalloc_caches[type][idx] = create_kmalloc_cache(
kmalloc_info[idx].name[type],
@@ -847,6 +853,9 @@ void __init create_kmalloc_caches(slab_flags_t flags)
int i;
enum kmalloc_cache_type type;
+ /*
+ * Including KMALLOC_CGROUP if CONFIG_MEMCG_KMEM defined
+ */
for (type = KMALLOC_NORMAL; type <= KMALLOC_RECLAIM; type++) {
for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++) {
if (!kmalloc_caches[type][i])
--
2.18.1
On 5/12/21 10:51 AM, Waiman Long wrote:
> There are currently two problems in the way the objcg pointer array
> (memcg_data) in the page structure is being allocated and freed.
>
> On its allocation, it is possible that the allocated objcg pointer
> array comes from the same slab that requires memory accounting. If this
> happens, the slab will never become empty again as there is at least
> one object left (the obj_cgroup array) in the slab.
>
> When it is freed, the objcg pointer array object may be the last one
> in its slab and hence causes kfree() to be called again. With the
> right workload, the slab cache may be set up in a way that allows the
> recursive kfree() calling loop to nest deep enough to cause a kernel
> stack overflow and panic the system.
>
> One way to solve this problem is to split the kmalloc-<n> caches
> (KMALLOC_NORMAL) into two separate sets - a new set of kmalloc-<n>
> (KMALLOC_NORMAL) caches for unaccounted objects only and a new set of
> kmalloc-cg-<n> (KMALLOC_CGROUP) caches for accounted objects only. All
> the other caches can still allow a mix of accounted and unaccounted
> objects.
>
> With this change, all the objcg pointer array objects will come from
> KMALLOC_NORMAL caches which won't have their objcg pointer arrays. So
> both the recursive kfree() problem and non-freeable slab problem are
> gone.
>
> Since both the KMALLOC_NORMAL and KMALLOC_CGROUP caches no longer have
> mixed accounted and unaccounted objects, this will slightly reduce the
> number of objcg pointer arrays that need to be allocated and save a bit
> of memory. On the other hand, creating a new set of kmalloc caches does
> have the effect of reducing cache utilization. So it is properly a wash.
>
> The new KMALLOC_CGROUP is added between KMALLOC_NORMAL and
> KMALLOC_RECLAIM so that the first for loop in create_kmalloc_caches()
> will include the newly added caches without change.
>
> Signed-off-by: Waiman Long <[email protected]>
> Suggested-by: Vlastimil Babka <[email protected]>
> Reviewed-by: Shakeel Butt <[email protected]>
> Acked-by: Roman Gushchin <[email protected]>
> ---
> include/linux/slab.h | 42 +++++++++++++++++++++++++++++++++---------
> mm/slab_common.c | 25 +++++++++++++++++--------
> 2 files changed, 50 insertions(+), 17 deletions(-)
The following are the diff's from previous version. It turns out that
the previous patch doesn't work if CONFIG_ZONE_DMA isn't defined.
diff --git a/include/linux/slab.h b/include/linux/slab.h
index a51cad5f561c..aa7f6c222a60 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -312,16 +312,17 @@ static inline void __check_heap_object(const void
*ptr, un
signed long n,
*/
enum kmalloc_cache_type {
KMALLOC_NORMAL = 0,
-#ifdef CONFIG_MEMCG_KMEM
- KMALLOC_CGROUP,
-#else
+#ifndef CONFIG_ZONE_DMA
+ KMALLOC_DMA = KMALLOC_NORMAL,
+#endif
+#ifndef CONFIG_MEMCG_KMEM
KMALLOC_CGROUP = KMALLOC_NORMAL,
+#else
+ KMALLOC_CGROUP,
#endif
KMALLOC_RECLAIM,
#ifdef CONFIG_ZONE_DMA
KMALLOC_DMA,
-#else
- KMALLOC_DMA = KMALLOC_NORMAL,
#endif
NR_KMALLOC_TYPES
};
Cheers,
Longman
On Wed, 12 May 2021 10:54:19 -0400 Waiman Long <[email protected]> wrote:
> > include/linux/slab.h | 42 +++++++++++++++++++++++++++++++++---------
> > mm/slab_common.c | 25 +++++++++++++++++--------
> > 2 files changed, 50 insertions(+), 17 deletions(-)
>
> The following are the diff's from previous version. It turns out that
> the previous patch doesn't work if CONFIG_ZONE_DMA isn't defined.
>
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index a51cad5f561c..aa7f6c222a60 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -312,16 +312,17 @@ static inline void __check_heap_object(const void
> *ptr, un
> signed long n,
> ? */
> ?enum kmalloc_cache_type {
> ???? KMALLOC_NORMAL = 0,
> -#ifdef CONFIG_MEMCG_KMEM
> -??? KMALLOC_CGROUP,
> -#else
> +#ifndef CONFIG_ZONE_DMA
> +??? KMALLOC_DMA = KMALLOC_NORMAL,
> +#endif
> +#ifndef CONFIG_MEMCG_KMEM
> ???? KMALLOC_CGROUP = KMALLOC_NORMAL,
> +#else
> +??? KMALLOC_CGROUP,
> ?#endif
> ???? KMALLOC_RECLAIM,
> ?#ifdef CONFIG_ZONE_DMA
> ???? KMALLOC_DMA,
> -#else
> -??? KMALLOC_DMA = KMALLOC_NORMAL,
> ?#endif
> ???? NR_KMALLOC_TYPES
> ?};
I assume this fixes
https://lkml.kernel.org/r/[email protected]?
On 5/13/21 2:32 AM, Andrew Morton wrote:
> On Wed, 12 May 2021 10:54:19 -0400 Waiman Long <[email protected]> wrote:
>
>> > include/linux/slab.h | 42 +++++++++++++++++++++++++++++++++---------
>> > mm/slab_common.c | 25 +++++++++++++++++--------
>> > 2 files changed, 50 insertions(+), 17 deletions(-)
>>
>> The following are the diff's from previous version. It turns out that
>> the previous patch doesn't work if CONFIG_ZONE_DMA isn't defined.
>>
>> diff --git a/include/linux/slab.h b/include/linux/slab.h
>> index a51cad5f561c..aa7f6c222a60 100644
>> --- a/include/linux/slab.h
>> +++ b/include/linux/slab.h
>> @@ -312,16 +312,17 @@ static inline void __check_heap_object(const void
>> *ptr, un
>> signed long n,
>> */
>> enum kmalloc_cache_type {
>> KMALLOC_NORMAL = 0,
>> -#ifdef CONFIG_MEMCG_KMEM
>> - KMALLOC_CGROUP,
>> -#else
>> +#ifndef CONFIG_ZONE_DMA
>> + KMALLOC_DMA = KMALLOC_NORMAL,
>> +#endif
>> +#ifndef CONFIG_MEMCG_KMEM
>> KMALLOC_CGROUP = KMALLOC_NORMAL,
>> +#else
>> + KMALLOC_CGROUP,
>> #endif
>> KMALLOC_RECLAIM,
>> #ifdef CONFIG_ZONE_DMA
>> KMALLOC_DMA,
>> -#else
>> - KMALLOC_DMA = KMALLOC_NORMAL,
>> #endif
>> NR_KMALLOC_TYPES
>> };
>
> I assume this fixes
> https://lkml.kernel.org/r/[email protected]?
Yeah it should.
On 5/12/21 8:32 PM, Andrew Morton wrote:
> On Wed, 12 May 2021 10:54:19 -0400 Waiman Long <[email protected]> wrote:
>
>>> include/linux/slab.h | 42 +++++++++++++++++++++++++++++++++---------
>>> mm/slab_common.c | 25 +++++++++++++++++--------
>>> 2 files changed, 50 insertions(+), 17 deletions(-)
>> The following are the diff's from previous version. It turns out that
>> the previous patch doesn't work if CONFIG_ZONE_DMA isn't defined.
>>
>> diff --git a/include/linux/slab.h b/include/linux/slab.h
>> index a51cad5f561c..aa7f6c222a60 100644
>> --- a/include/linux/slab.h
>> +++ b/include/linux/slab.h
>> @@ -312,16 +312,17 @@ static inline void __check_heap_object(const void
>> *ptr, un
>> signed long n,
>> */
>> enum kmalloc_cache_type {
>> KMALLOC_NORMAL = 0,
>> -#ifdef CONFIG_MEMCG_KMEM
>> - KMALLOC_CGROUP,
>> -#else
>> +#ifndef CONFIG_ZONE_DMA
>> + KMALLOC_DMA = KMALLOC_NORMAL,
>> +#endif
>> +#ifndef CONFIG_MEMCG_KMEM
>> KMALLOC_CGROUP = KMALLOC_NORMAL,
>> +#else
>> + KMALLOC_CGROUP,
>> #endif
>> KMALLOC_RECLAIM,
>> #ifdef CONFIG_ZONE_DMA
>> KMALLOC_DMA,
>> -#else
>> - KMALLOC_DMA = KMALLOC_NORMAL,
>> #endif
>> NR_KMALLOC_TYPES
>> };
> I assume this fixes
> https://lkml.kernel.org/r/[email protected]?
>
Yes.
Cheers,
Longman