2022-09-13 07:07:55

by Feng Tang

[permalink] [raw]
Subject: [PATCH v6 0/4] mm/slub: some debug enhancements for kmalloc

kmalloc's API family is critical for mm, and one of its nature is that
it will round up the request size to a fixed one (mostly power of 2).
When user requests memory for '2^n + 1' bytes, actually 2^(n+1) bytes
could be allocated, so in worst case, there is around 50% memory space
waste.

The wastage is not a big issue for requests that get allocated/freed
quickly, but may cause problems with objects that have longer life time,
and there were some OOM cases in some extrem cases.

This patchset(4/4) tries to :
* Add a debug method to track each kmalloced object's wastage info,
and show the call stack of original allocation (depends on
SLAB_STORE_USER flag) (Patch 1)

* Extend the redzone sanity check to the extra kmalloced buffer than
requested, to better detect un-legitimate access to it. (depends
on SLAB_STORE_USER & SLAB_RED_ZONE) (Patch 2/3/4, while 2/3 are
preparation patches)

The redzone part has been tested with code below:

for (shift = 3; shift <= 12; shift++) {
size = 1 << shift;
buf = kmalloc(size + 4, GFP_KERNEL);
/* We have 96, 196 kmalloc size, which is not power of 2 */
if (size == 64 || size == 128)
oob_size = 16;
else
oob_size = size - 4;
memset(buf + size + 4, 0xee, oob_size);
kfree(buf);
}

Please help to review, thanks!

- Feng

---
Changelogs:

since v5:
* Refine code/comments and add more perf info in commit log for
kzalloc change (Hyeonggoon Yoo)
* change the kasan param name and refine comments about
kasan+redzone handling (Andrey Konovalov)
* put free pointer in meta data to make redzone check cover all
kmalloc objects (Hyeonggoon Yoo)

since v4:
* fix a race issue in v3, by moving kmalloc debug init into
alloc_debug_processing (Hyeonggon Yoo)
* add 'partial_conext' for better parameter passing in get_partial()
call chain (Vlastimil Babka)
* update 'slub.rst' for 'alloc_traces' part (Hyeonggon Yoo)
* update code comments for 'orig_size'

since v3:
* rebase against latest post 6.0-rc1 slab tree's 'for-next' branch
* fix a bug reported by 0Day, that kmalloc-redzoned data and kasan's
free meta data overlaps in the same kmalloc object data area

since v2:
* rebase against slab tree's 'for-next' branch
* fix pointer handling (Kefeng Wang)
* move kzalloc zeroing handling change to a separate patch (Vlastimil Babka)
* make 'orig_size' only depend on KMALLOC & STORE_USER flag
bits (Vlastimil Babka)

since v1:
* limit the 'orig_size' to kmalloc objects only, and save
it after track in metadata (Vlastimil Babka)
* fix a offset calculation problem in print_trailer

since RFC:
* fix problems in kmem_cache_alloc_bulk() and records sorting,
improve the print format (Hyeonggon Yoo)
* fix a compiling issue found by 0Day bot
* update the commit log based info from iova developers

Feng Tang (4):
mm/slub: enable debugging memory wasting of kmalloc
mm/slub: only zero the requested size of buffer for kzalloc
mm: kasan: Add free_meta size info in struct kasan_cache
mm/slub: extend redzone check to extra allocated kmalloc space than
requested

Documentation/mm/slub.rst | 33 +++---
include/linux/kasan.h | 2 +
include/linux/slab.h | 2 +
mm/kasan/common.c | 2 +
mm/slab.c | 7 +-
mm/slab.h | 9 +-
mm/slab_common.c | 4 +
mm/slub.c | 217 ++++++++++++++++++++++++++++++--------
8 files changed, 214 insertions(+), 62 deletions(-)

--
2.34.1


2022-09-13 07:15:55

by Feng Tang

[permalink] [raw]
Subject: [PATCH v6 4/4] mm/slub: extend redzone check to extra allocated kmalloc space than requested

kmalloc will round up the request size to a fixed size (mostly power
of 2), so there could be a extra space than what is requested, whose
size is the actual buffer size minus original request size.

To better detect out of bound access or abuse of this space, add
redzone sanity check for it.

And in current kernel, some kmalloc user already knows the existence
of the space and utilizes it after calling 'ksize()' to know the real
size of the allocated buffer. So we skip the sanity check for objects
which have been called with ksize(), as treating them as legitimate
users.

In some cases, the free pointer could be saved inside the latter
part of object data area, which may overlap the redzone part(for
small sizes of kmalloc objects). As suggested by Hyeonggon Yoo,
force the free pointer to be in meta data area when kmalloc redzone
debug is enabled, to make all kmalloc objects covered by redzone
check.

Suggested-by: Vlastimil Babka <[email protected]>
Signed-off-by: Feng Tang <[email protected]>
---
mm/slab.h | 4 ++++
mm/slab_common.c | 4 ++++
mm/slub.c | 51 ++++++++++++++++++++++++++++++++++++++++++++----
3 files changed, 55 insertions(+), 4 deletions(-)

diff --git a/mm/slab.h b/mm/slab.h
index 3cf5adf63f48..5ca04d9c8bf5 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -881,4 +881,8 @@ void __check_heap_object(const void *ptr, unsigned long n,
}
#endif

+#ifdef CONFIG_SLUB_DEBUG
+void skip_orig_size_check(struct kmem_cache *s, const void *object);
+#endif
+
#endif /* MM_SLAB_H */
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 8e13e3aac53f..5106667d6adb 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1001,6 +1001,10 @@ size_t __ksize(const void *object)
return folio_size(folio);
}

+#ifdef CONFIG_SLUB_DEBUG
+ skip_orig_size_check(folio_slab(folio)->slab_cache, object);
+#endif
+
return slab_ksize(folio_slab(folio)->slab_cache);
}

diff --git a/mm/slub.c b/mm/slub.c
index 6f823e99d8b4..546b30ed5afd 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -812,12 +812,28 @@ static inline void set_orig_size(struct kmem_cache *s,
if (!slub_debug_orig_size(s))
return;

+#ifdef CONFIG_KASAN_GENERIC
+ /*
+ * KASAN could save its free meta data in object's data area at
+ * offset 0, if the size is larger than 'orig_size', it could
+ * overlap the data redzone(from 'orig_size+1' to 'object_size'),
+ * where the check should be skipped.
+ */
+ if (s->kasan_info.free_meta_size > orig_size)
+ orig_size = s->object_size;
+#endif
+
p += get_info_end(s);
p += sizeof(struct track) * 2;

*(unsigned int *)p = orig_size;
}

+void skip_orig_size_check(struct kmem_cache *s, const void *object)
+{
+ set_orig_size(s, (void *)object, s->object_size);
+}
+
static unsigned int get_orig_size(struct kmem_cache *s, void *object)
{
void *p = kasan_reset_tag(object);
@@ -949,13 +965,27 @@ static __printf(3, 4) void slab_err(struct kmem_cache *s, struct slab *slab,
static void init_object(struct kmem_cache *s, void *object, u8 val)
{
u8 *p = kasan_reset_tag(object);
+ unsigned int orig_size = s->object_size;

- if (s->flags & SLAB_RED_ZONE)
+ if (s->flags & SLAB_RED_ZONE) {
memset(p - s->red_left_pad, val, s->red_left_pad);

+ if (slub_debug_orig_size(s) && val == SLUB_RED_ACTIVE) {
+ orig_size = get_orig_size(s, object);
+
+ /*
+ * Redzone the extra allocated space by kmalloc
+ * than requested.
+ */
+ if (orig_size < s->object_size)
+ memset(p + orig_size, val,
+ s->object_size - orig_size);
+ }
+ }
+
if (s->flags & __OBJECT_POISON) {
- memset(p, POISON_FREE, s->object_size - 1);
- p[s->object_size - 1] = POISON_END;
+ memset(p, POISON_FREE, orig_size - 1);
+ p[orig_size - 1] = POISON_END;
}

if (s->flags & SLAB_RED_ZONE)
@@ -1103,6 +1133,7 @@ static int check_object(struct kmem_cache *s, struct slab *slab,
{
u8 *p = object;
u8 *endobject = object + s->object_size;
+ unsigned int orig_size;

if (s->flags & SLAB_RED_ZONE) {
if (!check_bytes_and_report(s, slab, object, "Left Redzone",
@@ -1112,6 +1143,17 @@ static int check_object(struct kmem_cache *s, struct slab *slab,
if (!check_bytes_and_report(s, slab, object, "Right Redzone",
endobject, val, s->inuse - s->object_size))
return 0;
+
+ if (slub_debug_orig_size(s) && val == SLUB_RED_ACTIVE) {
+ orig_size = get_orig_size(s, object);
+
+ if (s->object_size > orig_size &&
+ !check_bytes_and_report(s, slab, object,
+ "kmalloc Redzone", p + orig_size,
+ val, s->object_size - orig_size)) {
+ return 0;
+ }
+ }
} else {
if ((s->flags & SLAB_POISON) && s->object_size < s->inuse) {
check_bytes_and_report(s, slab, p, "Alignment padding",
@@ -4187,7 +4229,8 @@ static int calculate_sizes(struct kmem_cache *s)
*/
s->inuse = size;

- if ((flags & (SLAB_TYPESAFE_BY_RCU | SLAB_POISON)) ||
+ if (slub_debug_orig_size(s) ||
+ (flags & (SLAB_TYPESAFE_BY_RCU | SLAB_POISON)) ||
((flags & SLAB_RED_ZONE) && s->object_size < sizeof(void *)) ||
s->ctor) {
/*
--
2.34.1

2022-09-13 07:19:25

by Feng Tang

[permalink] [raw]
Subject: [PATCH v6 1/4] mm/slub: enable debugging memory wasting of kmalloc

kmalloc's API family is critical for mm, with one nature that it will
round up the request size to a fixed one (mostly power of 2). Say
when user requests memory for '2^n + 1' bytes, actually 2^(n+1) bytes
could be allocated, so in worst case, there is around 50% memory
space waste.

The wastage is not a big issue for requests that get allocated/freed
quickly, but may cause problems with objects that have longer life
time.

We've met a kernel boot OOM panic (v5.10), and from the dumped slab
info:

[ 26.062145] kmalloc-2k 814056KB 814056KB

From debug we found there are huge number of 'struct iova_magazine',
whose size is 1032 bytes (1024 + 8), so each allocation will waste
1016 bytes. Though the issue was solved by giving the right (bigger)
size of RAM, it is still nice to optimize the size (either use a
kmalloc friendly size or create a dedicated slab for it).

And from lkml archive, there was another crash kernel OOM case [1]
back in 2019, which seems to be related with the similar slab waste
situation, as the log is similar:

[ 4.332648] iommu: Adding device 0000:20:02.0 to group 16
[ 4.338946] swapper/0 invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null), order=0, oom_score_adj=0
...
[ 4.857565] kmalloc-2048 59164KB 59164KB

The crash kernel only has 256M memory, and 59M is pretty big here.
(Note: the related code has been changed and optimised in recent
kernel [2], these logs are just picked to demo the problem, also
a patch changing its size to 1024 bytes has been merged)

So add an way to track each kmalloc's memory waste info, and
leverage the existing SLUB debug framework (specifically
SLUB_STORE_USER) to show its call stack of original allocation,
so that user can evaluate the waste situation, identify some hot
spots and optimize accordingly, for a better utilization of memory.

The waste info is integrated into existing interface:
'/sys/kernel/debug/slab/kmalloc-xx/alloc_traces', one example of
'kmalloc-4k' after boot is:

126 ixgbe_alloc_q_vector+0xbe/0x830 [ixgbe] waste=233856/1856 age=280763/281414/282065 pid=1330 cpus=32 nodes=1
__kmem_cache_alloc_node+0x11f/0x4e0
__kmalloc_node+0x4e/0x140
ixgbe_alloc_q_vector+0xbe/0x830 [ixgbe]
ixgbe_init_interrupt_scheme+0x2ae/0xc90 [ixgbe]
ixgbe_probe+0x165f/0x1d20 [ixgbe]
local_pci_probe+0x78/0xc0
work_for_cpu_fn+0x26/0x40
...

which means in 'kmalloc-4k' slab, there are 126 requests of
2240 bytes which got a 4KB space (wasting 1856 bytes each
and 233856 bytes in total), from ixgbe_alloc_q_vector().

And when system starts some real workload like multiple docker
instances, there could are more severe waste.

[1]. https://lkml.org/lkml/2019/8/12/266
[2]. https://lore.kernel.org/lkml/[email protected]/

[Thanks Hyeonggon for pointing out several bugs about sorting/format]
[Thanks Vlastimil for suggesting way to reduce memory usage of
orig_size and keep it only for kmalloc objects]

Signed-off-by: Feng Tang <[email protected]>
Reviewed-by: Hyeonggon Yoo <[email protected]>
Cc: Robin Murphy <[email protected]>
Cc: John Garry <[email protected]>
Cc: Kefeng Wang <[email protected]>
---
Documentation/mm/slub.rst | 33 +++++---
include/linux/slab.h | 2 +
mm/slub.c | 156 ++++++++++++++++++++++++++++----------
3 files changed, 141 insertions(+), 50 deletions(-)

diff --git a/Documentation/mm/slub.rst b/Documentation/mm/slub.rst
index 43063ade737a..4e1578186b4f 100644
--- a/Documentation/mm/slub.rst
+++ b/Documentation/mm/slub.rst
@@ -400,21 +400,30 @@ information:
allocated objects. The output is sorted by frequency of each trace.

Information in the output:
- Number of objects, allocating function, minimal/average/maximal jiffies since alloc,
- pid range of the allocating processes, cpu mask of allocating cpus, and stack trace.
+ Number of objects, allocating function, possible memory wastage of
+ kmalloc objects(total/per-object), minimal/average/maximal jiffies
+ since alloc, pid range of the allocating processes, cpu mask of
+ allocating cpus, numa node mask of origins of memory, and stack trace.

Example:::

- 1085 populate_error_injection_list+0x97/0x110 age=166678/166680/166682 pid=1 cpus=1::
- __slab_alloc+0x6d/0x90
- kmem_cache_alloc_trace+0x2eb/0x300
- populate_error_injection_list+0x97/0x110
- init_error_injection+0x1b/0x71
- do_one_initcall+0x5f/0x2d0
- kernel_init_freeable+0x26f/0x2d7
- kernel_init+0xe/0x118
- ret_from_fork+0x22/0x30
-
+ 338 pci_alloc_dev+0x2c/0xa0 waste=521872/1544 age=290837/291891/293509 pid=1 cpus=106 nodes=0-1
+ __kmem_cache_alloc_node+0x11f/0x4e0
+ kmalloc_trace+0x26/0xa0
+ pci_alloc_dev+0x2c/0xa0
+ pci_scan_single_device+0xd2/0x150
+ pci_scan_slot+0xf7/0x2d0
+ pci_scan_child_bus_extend+0x4e/0x360
+ acpi_pci_root_create+0x32e/0x3b0
+ pci_acpi_scan_root+0x2b9/0x2d0
+ acpi_pci_root_add.cold.11+0x110/0xb0a
+ acpi_bus_attach+0x262/0x3f0
+ device_for_each_child+0xb7/0x110
+ acpi_dev_for_each_child+0x77/0xa0
+ acpi_bus_attach+0x108/0x3f0
+ device_for_each_child+0xb7/0x110
+ acpi_dev_for_each_child+0x77/0xa0
+ acpi_bus_attach+0x108/0x3f0

2. free_traces::

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 9b592e611cb1..6dc495f76644 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -29,6 +29,8 @@
#define SLAB_RED_ZONE ((slab_flags_t __force)0x00000400U)
/* DEBUG: Poison objects */
#define SLAB_POISON ((slab_flags_t __force)0x00000800U)
+/* Indicate a kmalloc slab */
+#define SLAB_KMALLOC ((slab_flags_t __force)0x00001000U)
/* Align objs on cache lines */
#define SLAB_HWCACHE_ALIGN ((slab_flags_t __force)0x00002000U)
/* Use GFP_DMA memory */
diff --git a/mm/slub.c b/mm/slub.c
index fe4fe0e72daf..c8ba16b3a4db 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -194,11 +194,24 @@ DEFINE_STATIC_KEY_FALSE(slub_debug_enabled);
#endif
#endif /* CONFIG_SLUB_DEBUG */

+/* Structure holding parameters for get_partial() call chain */
+struct partial_context {
+ struct slab **slab;
+ gfp_t flags;
+ unsigned int orig_size;
+};
+
static inline bool kmem_cache_debug(struct kmem_cache *s)
{
return kmem_cache_debug_flags(s, SLAB_DEBUG_FLAGS);
}

+static inline bool slub_debug_orig_size(struct kmem_cache *s)
+{
+ return (kmem_cache_debug_flags(s, SLAB_STORE_USER) &&
+ (s->flags & SLAB_KMALLOC));
+}
+
void *fixup_red_left(struct kmem_cache *s, void *p)
{
if (kmem_cache_debug_flags(s, SLAB_RED_ZONE))
@@ -785,6 +798,39 @@ static void print_slab_info(const struct slab *slab)
folio_flags(folio, 0));
}

+/*
+ * kmalloc caches has fixed sizes (mostly power of 2), and kmalloc() API
+ * family will round up the real request size to these fixed ones, so
+ * there could be an extra area than what is requested. Save the original
+ * request size in the meta data area, for better debug and sanity check.
+ */
+static inline void set_orig_size(struct kmem_cache *s,
+ void *object, unsigned int orig_size)
+{
+ void *p = kasan_reset_tag(object);
+
+ if (!slub_debug_orig_size(s))
+ return;
+
+ p += get_info_end(s);
+ p += sizeof(struct track) * 2;
+
+ *(unsigned int *)p = orig_size;
+}
+
+static unsigned int get_orig_size(struct kmem_cache *s, void *object)
+{
+ void *p = kasan_reset_tag(object);
+
+ if (!slub_debug_orig_size(s))
+ return s->object_size;
+
+ p += get_info_end(s);
+ p += sizeof(struct track) * 2;
+
+ return *(unsigned int *)p;
+}
+
static void slab_bug(struct kmem_cache *s, char *fmt, ...)
{
struct va_format vaf;
@@ -844,6 +890,9 @@ static void print_trailer(struct kmem_cache *s, struct slab *slab, u8 *p)
if (s->flags & SLAB_STORE_USER)
off += 2 * sizeof(struct track);

+ if (slub_debug_orig_size(s))
+ off += sizeof(unsigned int);
+
off += kasan_metadata_size(s);

if (off != size_from_object(s))
@@ -977,7 +1026,8 @@ static int check_bytes_and_report(struct kmem_cache *s, struct slab *slab,
*
* A. Free pointer (if we cannot overwrite object on free)
* B. Tracking data for SLAB_STORE_USER
- * C. Padding to reach required alignment boundary or at minimum
+ * C. Original request size for kmalloc object (SLAB_STORE_USER enabled)
+ * D. Padding to reach required alignment boundary or at minimum
* one word if debugging is on to be able to detect writes
* before the word boundary.
*
@@ -995,10 +1045,14 @@ static int check_pad_bytes(struct kmem_cache *s, struct slab *slab, u8 *p)
{
unsigned long off = get_info_end(s); /* The end of info */

- if (s->flags & SLAB_STORE_USER)
+ if (s->flags & SLAB_STORE_USER) {
/* We also have user information there */
off += 2 * sizeof(struct track);

+ if (s->flags & SLAB_KMALLOC)
+ off += sizeof(unsigned int);
+ }
+
off += kasan_metadata_size(s);

if (size_from_object(s) == off)
@@ -1293,7 +1347,7 @@ static inline int alloc_consistency_checks(struct kmem_cache *s,
}

static noinline int alloc_debug_processing(struct kmem_cache *s,
- struct slab *slab, void *object)
+ struct slab *slab, void *object, int orig_size)
{
if (s->flags & SLAB_CONSISTENCY_CHECKS) {
if (!alloc_consistency_checks(s, slab, object))
@@ -1302,6 +1356,7 @@ static noinline int alloc_debug_processing(struct kmem_cache *s,

/* Success. Perform special debug activities for allocs */
trace(s, slab, object, 1);
+ set_orig_size(s, object, orig_size);
init_object(s, object, SLUB_RED_ACTIVE);
return 1;

@@ -1570,7 +1625,10 @@ static inline
void setup_slab_debug(struct kmem_cache *s, struct slab *slab, void *addr) {}

static inline int alloc_debug_processing(struct kmem_cache *s,
- struct slab *slab, void *object) { return 0; }
+ struct slab *slab, void *object, int orig_size) { return 0; }
+
+static inline void set_orig_size(struct kmem_cache *s,
+ void *object, unsigned int orig_size) {}

static inline void free_debug_processing(
struct kmem_cache *s, struct slab *slab,
@@ -1999,7 +2057,7 @@ static inline void remove_partial(struct kmem_cache_node *n,
* it to full list if it was the last free object.
*/
static void *alloc_single_from_partial(struct kmem_cache *s,
- struct kmem_cache_node *n, struct slab *slab)
+ struct kmem_cache_node *n, struct slab *slab, int orig_size)
{
void *object;

@@ -2009,7 +2067,7 @@ static void *alloc_single_from_partial(struct kmem_cache *s,
slab->freelist = get_freepointer(s, object);
slab->inuse++;

- if (!alloc_debug_processing(s, slab, object)) {
+ if (!alloc_debug_processing(s, slab, object, orig_size)) {
remove_partial(n, slab);
return NULL;
}
@@ -2028,7 +2086,7 @@ static void *alloc_single_from_partial(struct kmem_cache *s,
* and put the slab to the partial (or full) list.
*/
static void *alloc_single_from_new_slab(struct kmem_cache *s,
- struct slab *slab)
+ struct slab *slab, int orig_size)
{
int nid = slab_nid(slab);
struct kmem_cache_node *n = get_node(s, nid);
@@ -2040,7 +2098,7 @@ static void *alloc_single_from_new_slab(struct kmem_cache *s,
slab->freelist = get_freepointer(s, object);
slab->inuse = 1;

- if (!alloc_debug_processing(s, slab, object))
+ if (!alloc_debug_processing(s, slab, object, orig_size))
/*
* It's not really expected that this would fail on a
* freshly allocated slab, but a concurrent memory
@@ -2118,7 +2176,7 @@ static inline bool pfmemalloc_match(struct slab *slab, gfp_t gfpflags);
* Try to allocate a partial slab from a specific node.
*/
static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
- struct slab **ret_slab, gfp_t gfpflags)
+ struct partial_context *pc)
{
struct slab *slab, *slab2;
void *object = NULL;
@@ -2138,11 +2196,12 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
list_for_each_entry_safe(slab, slab2, &n->partial, slab_list) {
void *t;

- if (!pfmemalloc_match(slab, gfpflags))
+ if (!pfmemalloc_match(slab, pc->flags))
continue;

if (kmem_cache_debug(s)) {
- object = alloc_single_from_partial(s, n, slab);
+ object = alloc_single_from_partial(s, n, slab,
+ pc->orig_size);
if (object)
break;
continue;
@@ -2153,7 +2212,7 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
break;

if (!object) {
- *ret_slab = slab;
+ *pc->slab = slab;
stat(s, ALLOC_FROM_PARTIAL);
object = t;
} else {
@@ -2177,14 +2236,13 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
/*
* Get a slab from somewhere. Search in increasing NUMA distances.
*/
-static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
- struct slab **ret_slab)
+static void *get_any_partial(struct kmem_cache *s, struct partial_context *pc)
{
#ifdef CONFIG_NUMA
struct zonelist *zonelist;
struct zoneref *z;
struct zone *zone;
- enum zone_type highest_zoneidx = gfp_zone(flags);
+ enum zone_type highest_zoneidx = gfp_zone(pc->flags);
void *object;
unsigned int cpuset_mems_cookie;

@@ -2212,15 +2270,15 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags,

do {
cpuset_mems_cookie = read_mems_allowed_begin();
- zonelist = node_zonelist(mempolicy_slab_node(), flags);
+ zonelist = node_zonelist(mempolicy_slab_node(), pc->flags);
for_each_zone_zonelist(zone, z, zonelist, highest_zoneidx) {
struct kmem_cache_node *n;

n = get_node(s, zone_to_nid(zone));

- if (n && cpuset_zone_allowed(zone, flags) &&
+ if (n && cpuset_zone_allowed(zone, pc->flags) &&
n->nr_partial > s->min_partial) {
- object = get_partial_node(s, n, ret_slab, flags);
+ object = get_partial_node(s, n, pc);
if (object) {
/*
* Don't check read_mems_allowed_retry()
@@ -2241,8 +2299,7 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
/*
* Get a partial slab, lock it and return it.
*/
-static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
- struct slab **ret_slab)
+static void *get_partial(struct kmem_cache *s, int node, struct partial_context *pc)
{
void *object;
int searchnode = node;
@@ -2250,11 +2307,11 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
if (node == NUMA_NO_NODE)
searchnode = numa_mem_id();

- object = get_partial_node(s, get_node(s, searchnode), ret_slab, flags);
+ object = get_partial_node(s, get_node(s, searchnode), pc);
if (object || node != NUMA_NO_NODE)
return object;

- return get_any_partial(s, flags, ret_slab);
+ return get_any_partial(s, pc);
}

#ifdef CONFIG_PREEMPTION
@@ -2974,11 +3031,12 @@ static inline void *get_freelist(struct kmem_cache *s, struct slab *slab)
* already disabled (which is the case for bulk allocation).
*/
static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
- unsigned long addr, struct kmem_cache_cpu *c)
+ unsigned long addr, struct kmem_cache_cpu *c, unsigned int orig_size)
{
void *freelist;
struct slab *slab;
unsigned long flags;
+ struct partial_context pc;

stat(s, ALLOC_SLOWPATH);

@@ -3092,7 +3150,10 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,

new_objects:

- freelist = get_partial(s, gfpflags, node, &slab);
+ pc.flags = gfpflags;
+ pc.slab = &slab;
+ pc.orig_size = orig_size;
+ freelist = get_partial(s, node, &pc);
if (freelist)
goto check_new_slab;

@@ -3108,7 +3169,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
stat(s, ALLOC_SLAB);

if (kmem_cache_debug(s)) {
- freelist = alloc_single_from_new_slab(s, slab);
+ freelist = alloc_single_from_new_slab(s, slab, orig_size);

if (unlikely(!freelist))
goto new_objects;
@@ -3140,6 +3201,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
*/
if (s->flags & SLAB_STORE_USER)
set_track(s, freelist, TRACK_ALLOC, addr);
+
return freelist;
}

@@ -3182,7 +3244,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
* pointer.
*/
static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
- unsigned long addr, struct kmem_cache_cpu *c)
+ unsigned long addr, struct kmem_cache_cpu *c, unsigned int orig_size)
{
void *p;

@@ -3195,7 +3257,7 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
c = slub_get_cpu_ptr(s->cpu_slab);
#endif

- p = ___slab_alloc(s, gfpflags, node, addr, c);
+ p = ___slab_alloc(s, gfpflags, node, addr, c, orig_size);
#ifdef CONFIG_PREEMPT_COUNT
slub_put_cpu_ptr(s->cpu_slab);
#endif
@@ -3280,7 +3342,7 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s, struct list_l

if (!USE_LOCKLESS_FAST_PATH() ||
unlikely(!object || !slab || !node_match(slab, node))) {
- object = __slab_alloc(s, gfpflags, node, addr, c);
+ object = __slab_alloc(s, gfpflags, node, addr, c, orig_size);
} else {
void *next_object = get_freepointer_safe(s, object);

@@ -3747,7 +3809,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
* of re-populating per CPU c->freelist
*/
p[i] = ___slab_alloc(s, flags, NUMA_NO_NODE,
- _RET_IP_, c);
+ _RET_IP_, c, s->object_size);
if (unlikely(!p[i]))
goto error;

@@ -4150,12 +4212,17 @@ static int calculate_sizes(struct kmem_cache *s)
}

#ifdef CONFIG_SLUB_DEBUG
- if (flags & SLAB_STORE_USER)
+ if (flags & SLAB_STORE_USER) {
/*
* Need to store information about allocs and frees after
* the object.
*/
size += 2 * sizeof(struct track);
+
+ /* Save the original kmalloc request size */
+ if (flags & SLAB_KMALLOC)
+ size += sizeof(unsigned int);
+ }
#endif

kasan_cache_create(s, &size, &s->flags);
@@ -4770,7 +4837,7 @@ void __init kmem_cache_init(void)

/* Now we can use the kmem_cache to allocate kmalloc slabs */
setup_kmalloc_cache_index_table();
- create_kmalloc_caches(0);
+ create_kmalloc_caches(SLAB_KMALLOC);

/* Setup random freelists for each cache */
init_freelist_randomization();
@@ -4937,6 +5004,7 @@ struct location {
depot_stack_handle_t handle;
unsigned long count;
unsigned long addr;
+ unsigned long waste;
long long sum_time;
long min_time;
long max_time;
@@ -4983,13 +5051,15 @@ static int alloc_loc_track(struct loc_track *t, unsigned long max, gfp_t flags)
}

static int add_location(struct loc_track *t, struct kmem_cache *s,
- const struct track *track)
+ const struct track *track,
+ unsigned int orig_size)
{
long start, end, pos;
struct location *l;
- unsigned long caddr, chandle;
+ unsigned long caddr, chandle, cwaste;
unsigned long age = jiffies - track->when;
depot_stack_handle_t handle = 0;
+ unsigned int waste = s->object_size - orig_size;

#ifdef CONFIG_STACKDEPOT
handle = READ_ONCE(track->handle);
@@ -5007,11 +5077,13 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
if (pos == end)
break;

- caddr = t->loc[pos].addr;
- chandle = t->loc[pos].handle;
- if ((track->addr == caddr) && (handle == chandle)) {
+ l = &t->loc[pos];
+ caddr = l->addr;
+ chandle = l->handle;
+ cwaste = l->waste;
+ if ((track->addr == caddr) && (handle == chandle) &&
+ (waste == cwaste)) {

- l = &t->loc[pos];
l->count++;
if (track->when) {
l->sum_time += age;
@@ -5036,6 +5108,9 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
end = pos;
else if (track->addr == caddr && handle < chandle)
end = pos;
+ else if (track->addr == caddr && handle == chandle &&
+ waste < cwaste)
+ end = pos;
else
start = pos;
}
@@ -5059,6 +5134,7 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
l->min_pid = track->pid;
l->max_pid = track->pid;
l->handle = handle;
+ l->waste = waste;
cpumask_clear(to_cpumask(l->cpus));
cpumask_set_cpu(track->cpu, to_cpumask(l->cpus));
nodes_clear(l->nodes);
@@ -5077,7 +5153,7 @@ static void process_slab(struct loc_track *t, struct kmem_cache *s,

for_each_object(p, s, addr, slab->objects)
if (!test_bit(__obj_to_index(s, addr, p), obj_map))
- add_location(t, s, get_track(s, p, alloc));
+ add_location(t, s, get_track(s, p, alloc), get_orig_size(s, p));
}
#endif /* CONFIG_DEBUG_FS */
#endif /* CONFIG_SLUB_DEBUG */
@@ -5942,6 +6018,10 @@ static int slab_debugfs_show(struct seq_file *seq, void *v)
else
seq_puts(seq, "<not-available>");

+ if (l->waste)
+ seq_printf(seq, " waste=%lu/%lu",
+ l->count * l->waste, l->waste);
+
if (l->sum_time != l->min_time) {
seq_printf(seq, " age=%ld/%llu/%ld",
l->min_time, div_u64(l->sum_time, l->count),
--
2.34.1

2022-09-13 07:20:59

by Feng Tang

[permalink] [raw]
Subject: [PATCH v6 3/4] mm: kasan: Add free_meta size info in struct kasan_cache

When kasan is enabled for slab/slub, it may save kasan' free_meta
data in the former part of slab object data area in slab object's
free path, which works fine.

There is ongoing effort to extend slub's debug function which will
redzone the latter part of kmalloc object area, and when both of
the debug are enabled, there is possible conflict, especially when
the kmalloc object has small size, as caught by 0Day bot [1]

For better information for slab/slub, add free_meta's data size
into 'struct kasan_cache', so that its users can take right action
to avoid data conflict.

[1]. https://lore.kernel.org/lkml/YuYm3dWwpZwH58Hu@xsang-OptiPlex-9020/
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Feng Tang <[email protected]>
Acked-by: Dmitry Vyukov <[email protected]>
---
include/linux/kasan.h | 2 ++
mm/kasan/common.c | 2 ++
2 files changed, 4 insertions(+)

diff --git a/include/linux/kasan.h b/include/linux/kasan.h
index b092277bf48d..49af9513e8ed 100644
--- a/include/linux/kasan.h
+++ b/include/linux/kasan.h
@@ -100,6 +100,8 @@ static inline bool kasan_has_integrated_init(void)
struct kasan_cache {
int alloc_meta_offset;
int free_meta_offset;
+ /* size of free_meta data saved in object's data area */
+ int free_meta_size;
bool is_kmalloc;
};

diff --git a/mm/kasan/common.c b/mm/kasan/common.c
index 69f583855c8b..0cb867e92524 100644
--- a/mm/kasan/common.c
+++ b/mm/kasan/common.c
@@ -201,6 +201,8 @@ void __kasan_cache_create(struct kmem_cache *cache, unsigned int *size,
cache->kasan_info.free_meta_offset = KASAN_NO_FREE_META;
*size = ok_size;
}
+ } else {
+ cache->kasan_info.free_meta_size = sizeof(struct kasan_free_meta);
}

/* Calculate size with optimal redzone. */
--
2.34.1

2022-09-13 07:22:09

by Feng Tang

[permalink] [raw]
Subject: [PATCH v6 2/4] mm/slub: only zero the requested size of buffer for kzalloc

kzalloc/kmalloc will round up the request size to a fixed size
(mostly power of 2), so the allocated memory could be more than
requested. Currently kzalloc family APIs will zero all the
allocated memory.

To detect out-of-bound usage of the extra allocated memory, only
zero the requested part, so that sanity check could be added to
the extra space later.

Performance wise, smaller zeroing length also brings shorter
execution time, as shown from test data on various server/desktop
platforms.

For kzalloc users who will call ksize() later and utilize this
extra space, please be aware that the space is not zeroed any
more.

Signed-off-by: Feng Tang <[email protected]>
---
mm/slab.c | 7 ++++---
mm/slab.h | 5 +++--
mm/slub.c | 10 +++++++---
3 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index a5486ff8362a..4594de0e3d6b 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3253,7 +3253,8 @@ slab_alloc_node(struct kmem_cache *cachep, struct list_lru *lru, gfp_t flags,
init = slab_want_init_on_alloc(flags, cachep);

out:
- slab_post_alloc_hook(cachep, objcg, flags, 1, &objp, init);
+ slab_post_alloc_hook(cachep, objcg, flags, 1, &objp, init,
+ cachep->object_size);
return objp;
}

@@ -3506,13 +3507,13 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
* Done outside of the IRQ disabled section.
*/
slab_post_alloc_hook(s, objcg, flags, size, p,
- slab_want_init_on_alloc(flags, s));
+ slab_want_init_on_alloc(flags, s), s->object_size);
/* FIXME: Trace call missing. Christoph would like a bulk variant */
return size;
error:
local_irq_enable();
cache_alloc_debugcheck_after_bulk(s, flags, i, p, _RET_IP_);
- slab_post_alloc_hook(s, objcg, flags, i, p, false);
+ slab_post_alloc_hook(s, objcg, flags, i, p, false, s->object_size);
kmem_cache_free_bulk(s, i, p);
return 0;
}
diff --git a/mm/slab.h b/mm/slab.h
index d0ef9dd44b71..3cf5adf63f48 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -730,7 +730,8 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,

static inline void slab_post_alloc_hook(struct kmem_cache *s,
struct obj_cgroup *objcg, gfp_t flags,
- size_t size, void **p, bool init)
+ size_t size, void **p, bool init,
+ unsigned int orig_size)
{
size_t i;

@@ -746,7 +747,7 @@ static inline void slab_post_alloc_hook(struct kmem_cache *s,
for (i = 0; i < size; i++) {
p[i] = kasan_slab_alloc(s, p[i], flags, init);
if (p[i] && init && !kasan_has_integrated_init())
- memset(p[i], 0, s->object_size);
+ memset(p[i], 0, orig_size);
kmemleak_alloc_recursive(p[i], s->object_size, 1,
s->flags, flags);
}
diff --git a/mm/slub.c b/mm/slub.c
index c8ba16b3a4db..6f823e99d8b4 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3376,7 +3376,11 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s, struct list_l
init = slab_want_init_on_alloc(gfpflags, s);

out:
- slab_post_alloc_hook(s, objcg, gfpflags, 1, &object, init);
+ /*
+ * When init equals 'true', like for kzalloc() family, only
+ * @orig_size bytes will be zeroed instead of s->object_size
+ */
+ slab_post_alloc_hook(s, objcg, gfpflags, 1, &object, init, orig_size);

return object;
}
@@ -3833,11 +3837,11 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
* Done outside of the IRQ disabled fastpath loop.
*/
slab_post_alloc_hook(s, objcg, flags, size, p,
- slab_want_init_on_alloc(flags, s));
+ slab_want_init_on_alloc(flags, s), s->object_size);
return i;
error:
slub_put_cpu_ptr(s->cpu_slab);
- slab_post_alloc_hook(s, objcg, flags, i, p, false);
+ slab_post_alloc_hook(s, objcg, flags, i, p, false, s->object_size);
kmem_cache_free_bulk(s, i, p);
return 0;
}
--
2.34.1

2022-09-13 09:45:48

by Hyeonggon Yoo

[permalink] [raw]
Subject: Re: [PATCH v6 4/4] mm/slub: extend redzone check to extra allocated kmalloc space than requested

On Tue, Sep 13, 2022 at 02:54:23PM +0800, Feng Tang wrote:
> kmalloc will round up the request size to a fixed size (mostly power
> of 2), so there could be a extra space than what is requested, whose
> size is the actual buffer size minus original request size.
>
> To better detect out of bound access or abuse of this space, add
> redzone sanity check for it.
>
> And in current kernel, some kmalloc user already knows the existence
> of the space and utilizes it after calling 'ksize()' to know the real
> size of the allocated buffer. So we skip the sanity check for objects
> which have been called with ksize(), as treating them as legitimate
> users.
>
> In some cases, the free pointer could be saved inside the latter
> part of object data area, which may overlap the redzone part(for
> small sizes of kmalloc objects). As suggested by Hyeonggon Yoo,
> force the free pointer to be in meta data area when kmalloc redzone
> debug is enabled, to make all kmalloc objects covered by redzone
> check.
>
> Suggested-by: Vlastimil Babka <[email protected]>
> Signed-off-by: Feng Tang <[email protected]>
> ---
> mm/slab.h | 4 ++++
> mm/slab_common.c | 4 ++++
> mm/slub.c | 51 ++++++++++++++++++++++++++++++++++++++++++++----
> 3 files changed, 55 insertions(+), 4 deletions(-)
>
> diff --git a/mm/slab.h b/mm/slab.h
> index 3cf5adf63f48..5ca04d9c8bf5 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -881,4 +881,8 @@ void __check_heap_object(const void *ptr, unsigned long n,
> }
> #endif
>
> +#ifdef CONFIG_SLUB_DEBUG
> +void skip_orig_size_check(struct kmem_cache *s, const void *object);
> +#endif
> +
> #endif /* MM_SLAB_H */
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 8e13e3aac53f..5106667d6adb 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -1001,6 +1001,10 @@ size_t __ksize(const void *object)
> return folio_size(folio);
> }
>
> +#ifdef CONFIG_SLUB_DEBUG
> + skip_orig_size_check(folio_slab(folio)->slab_cache, object);
> +#endif
> +
> return slab_ksize(folio_slab(folio)->slab_cache);
> }
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 6f823e99d8b4..546b30ed5afd 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -812,12 +812,28 @@ static inline void set_orig_size(struct kmem_cache *s,
> if (!slub_debug_orig_size(s))
> return;
>
> +#ifdef CONFIG_KASAN_GENERIC
> + /*
> + * KASAN could save its free meta data in object's data area at
> + * offset 0, if the size is larger than 'orig_size', it could
> + * overlap the data redzone(from 'orig_size+1' to 'object_size'),
> + * where the check should be skipped.
> + */
> + if (s->kasan_info.free_meta_size > orig_size)
> + orig_size = s->object_size;
> +#endif
> +
> p += get_info_end(s);
> p += sizeof(struct track) * 2;
>
> *(unsigned int *)p = orig_size;
> }
>
> +void skip_orig_size_check(struct kmem_cache *s, const void *object)
> +{
> + set_orig_size(s, (void *)object, s->object_size);
> +}
> +
> static unsigned int get_orig_size(struct kmem_cache *s, void *object)
> {
> void *p = kasan_reset_tag(object);
> @@ -949,13 +965,27 @@ static __printf(3, 4) void slab_err(struct kmem_cache *s, struct slab *slab,
> static void init_object(struct kmem_cache *s, void *object, u8 val)
> {
> u8 *p = kasan_reset_tag(object);
> + unsigned int orig_size = s->object_size;
>
> - if (s->flags & SLAB_RED_ZONE)
> + if (s->flags & SLAB_RED_ZONE) {
> memset(p - s->red_left_pad, val, s->red_left_pad);
>
> + if (slub_debug_orig_size(s) && val == SLUB_RED_ACTIVE) {
> + orig_size = get_orig_size(s, object);
> +
> + /*
> + * Redzone the extra allocated space by kmalloc
> + * than requested.
> + */
> + if (orig_size < s->object_size)
> + memset(p + orig_size, val,
> + s->object_size - orig_size);
> + }
> + }
> +
> if (s->flags & __OBJECT_POISON) {
> - memset(p, POISON_FREE, s->object_size - 1);
> - p[s->object_size - 1] = POISON_END;
> + memset(p, POISON_FREE, orig_size - 1);
> + p[orig_size - 1] = POISON_END;
> }
>
> if (s->flags & SLAB_RED_ZONE)
> @@ -1103,6 +1133,7 @@ static int check_object(struct kmem_cache *s, struct slab *slab,
> {
> u8 *p = object;
> u8 *endobject = object + s->object_size;
> + unsigned int orig_size;
>
> if (s->flags & SLAB_RED_ZONE) {
> if (!check_bytes_and_report(s, slab, object, "Left Redzone",
> @@ -1112,6 +1143,17 @@ static int check_object(struct kmem_cache *s, struct slab *slab,
> if (!check_bytes_and_report(s, slab, object, "Right Redzone",
> endobject, val, s->inuse - s->object_size))
> return 0;
> +
> + if (slub_debug_orig_size(s) && val == SLUB_RED_ACTIVE) {
> + orig_size = get_orig_size(s, object);
> +
> + if (s->object_size > orig_size &&
> + !check_bytes_and_report(s, slab, object,
> + "kmalloc Redzone", p + orig_size,
> + val, s->object_size - orig_size)) {
> + return 0;
> + }
> + }
> } else {
> if ((s->flags & SLAB_POISON) && s->object_size < s->inuse) {
> check_bytes_and_report(s, slab, p, "Alignment padding",
> @@ -4187,7 +4229,8 @@ static int calculate_sizes(struct kmem_cache *s)
> */
> s->inuse = size;
>
> - if ((flags & (SLAB_TYPESAFE_BY_RCU | SLAB_POISON)) ||
> + if (slub_debug_orig_size(s) ||
> + (flags & (SLAB_TYPESAFE_BY_RCU | SLAB_POISON)) ||
> ((flags & SLAB_RED_ZONE) && s->object_size < sizeof(void *)) ||
> s->ctor) {
> /*
> --
> 2.34.1
>

For the slab part:

Looks good to me.
Acked-by: Hyeonggon Yoo <[email protected]>

Thanks!

--
Thanks,
Hyeonggon

2022-09-20 20:22:05

by Andrey Konovalov

[permalink] [raw]
Subject: Re: [PATCH v6 3/4] mm: kasan: Add free_meta size info in struct kasan_cache

On Tue, Sep 13, 2022 at 8:54 AM Feng Tang <[email protected]> wrote:
>
> When kasan is enabled for slab/slub, it may save kasan' free_meta
> data in the former part of slab object data area in slab object's
> free path, which works fine.
>
> There is ongoing effort to extend slub's debug function which will
> redzone the latter part of kmalloc object area, and when both of
> the debug are enabled, there is possible conflict, especially when
> the kmalloc object has small size, as caught by 0Day bot [1]
>
> For better information for slab/slub, add free_meta's data size
> into 'struct kasan_cache', so that its users can take right action
> to avoid data conflict.
>
> [1]. https://lore.kernel.org/lkml/YuYm3dWwpZwH58Hu@xsang-OptiPlex-9020/
> Reported-by: kernel test robot <[email protected]>
> Signed-off-by: Feng Tang <[email protected]>
> Acked-by: Dmitry Vyukov <[email protected]>
> ---
> include/linux/kasan.h | 2 ++
> mm/kasan/common.c | 2 ++
> 2 files changed, 4 insertions(+)
>
> diff --git a/include/linux/kasan.h b/include/linux/kasan.h
> index b092277bf48d..49af9513e8ed 100644
> --- a/include/linux/kasan.h
> +++ b/include/linux/kasan.h
> @@ -100,6 +100,8 @@ static inline bool kasan_has_integrated_init(void)
> struct kasan_cache {
> int alloc_meta_offset;
> int free_meta_offset;
> + /* size of free_meta data saved in object's data area */
> + int free_meta_size;
> bool is_kmalloc;
> };
>
> diff --git a/mm/kasan/common.c b/mm/kasan/common.c
> index 69f583855c8b..0cb867e92524 100644
> --- a/mm/kasan/common.c
> +++ b/mm/kasan/common.c
> @@ -201,6 +201,8 @@ void __kasan_cache_create(struct kmem_cache *cache, unsigned int *size,
> cache->kasan_info.free_meta_offset = KASAN_NO_FREE_META;
> *size = ok_size;
> }
> + } else {
> + cache->kasan_info.free_meta_size = sizeof(struct kasan_free_meta);

Hi Feng,

I just realized that we already have a function that exposes a similar
functionality: kasan_metadata_size. However, this function returns the
size of metadata that is stored in the redzone.

I think, instead of adding free_meta_size, a better approach would be to:

1. Rename kasan_metadata_size to kasan_metadata_size_in_redzone (or
something like that).
2. Add kasan_metadata_size_in_object with appropriate implementation
and use that in your patches.

This allows avoiding exposing KASAN-internal details such as what kind
of fields the kasan_cache struct has to the common code.

Sorry for nor realizing this straight away.

(Note that there's an upcoming patch that fixes a bug in
kasan_metadata_size' implementation [1].)

Thanks!

[1] https://lore.kernel.org/linux-mm/c7b316d30d90e5947eb8280f4dc78856a49298cf.1662411799.git.andreyknvl@google.com/



> }
>
> /* Calculate size with optimal redzone. */
> --
> 2.34.1
>

2022-09-21 12:47:25

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH v6 3/4] mm: kasan: Add free_meta size info in struct kasan_cache

Hi Andrey,

On Wed, Sep 21, 2022 at 03:20:58AM +0800, Andrey Konovalov wrote:
> On Tue, Sep 13, 2022 at 8:54 AM Feng Tang <[email protected]> wrote:
> >
> > When kasan is enabled for slab/slub, it may save kasan' free_meta
> > data in the former part of slab object data area in slab object's
> > free path, which works fine.
> >
> > There is ongoing effort to extend slub's debug function which will
> > redzone the latter part of kmalloc object area, and when both of
> > the debug are enabled, there is possible conflict, especially when
> > the kmalloc object has small size, as caught by 0Day bot [1]
> >
> > For better information for slab/slub, add free_meta's data size
> > into 'struct kasan_cache', so that its users can take right action
> > to avoid data conflict.
> >
> > [1]. https://lore.kernel.org/lkml/YuYm3dWwpZwH58Hu@xsang-OptiPlex-9020/
> > Reported-by: kernel test robot <[email protected]>
> > Signed-off-by: Feng Tang <[email protected]>
> > Acked-by: Dmitry Vyukov <[email protected]>
> > ---
> > include/linux/kasan.h | 2 ++
> > mm/kasan/common.c | 2 ++
> > 2 files changed, 4 insertions(+)
> >
> > diff --git a/include/linux/kasan.h b/include/linux/kasan.h
> > index b092277bf48d..49af9513e8ed 100644
> > --- a/include/linux/kasan.h
> > +++ b/include/linux/kasan.h
> > @@ -100,6 +100,8 @@ static inline bool kasan_has_integrated_init(void)
> > struct kasan_cache {
> > int alloc_meta_offset;
> > int free_meta_offset;
> > + /* size of free_meta data saved in object's data area */
> > + int free_meta_size;
> > bool is_kmalloc;
> > };
> >
> > diff --git a/mm/kasan/common.c b/mm/kasan/common.c
> > index 69f583855c8b..0cb867e92524 100644
> > --- a/mm/kasan/common.c
> > +++ b/mm/kasan/common.c
> > @@ -201,6 +201,8 @@ void __kasan_cache_create(struct kmem_cache *cache, unsigned int *size,
> > cache->kasan_info.free_meta_offset = KASAN_NO_FREE_META;
> > *size = ok_size;
> > }
> > + } else {
> > + cache->kasan_info.free_meta_size = sizeof(struct kasan_free_meta);
>
> Hi Feng,
>
> I just realized that we already have a function that exposes a similar
> functionality: kasan_metadata_size. However, this function returns the
> size of metadata that is stored in the redzone.
>
> I think, instead of adding free_meta_size, a better approach would be to:
>
> 1. Rename kasan_metadata_size to kasan_metadata_size_in_redzone (or
> something like that).
> 2. Add kasan_metadata_size_in_object with appropriate implementation
> and use that in your patches.
>
> This allows avoiding exposing KASAN-internal details such as what kind
> of fields the kasan_cache struct has to the common code.

Agree, it's better not touch the internal fields in slub code.

How about the following patch, it merge the 2 functions with one flag
indicating in meta data or object. (I'm fine with 2 separate functions)


diff --git a/include/linux/kasan.h b/include/linux/kasan.h
index b092277bf48d..0ad05a34e708 100644
--- a/include/linux/kasan.h
+++ b/include/linux/kasan.h
@@ -150,11 +150,12 @@ static __always_inline void kasan_cache_create_kmalloc(struct kmem_cache *cache)
__kasan_cache_create_kmalloc(cache);
}

-size_t __kasan_metadata_size(struct kmem_cache *cache);
-static __always_inline size_t kasan_metadata_size(struct kmem_cache *cache)
+size_t __kasan_meta_size(struct kmem_cache *cache, bool in_slab_object);
+static __always_inline size_t kasan_meta_size(struct kmem_cache *cache,
+ bool in_slab_object)
{
if (kasan_enabled())
- return __kasan_metadata_size(cache);
+ return __kasan_meta_size(cache, in_slab_object);
return 0;
}

diff --git a/mm/kasan/common.c b/mm/kasan/common.c
index 69f583855c8b..2a8710461ebb 100644
--- a/mm/kasan/common.c
+++ b/mm/kasan/common.c
@@ -218,14 +218,21 @@ void __kasan_cache_create_kmalloc(struct kmem_cache *cache)
cache->kasan_info.is_kmalloc = true;
}

-size_t __kasan_metadata_size(struct kmem_cache *cache)
+size_t __kasan_meta_size(struct kmem_cache *cache, bool in_slab_object)
{
if (!kasan_stack_collection_enabled())
return 0;
- return (cache->kasan_info.alloc_meta_offset ?
- sizeof(struct kasan_alloc_meta) : 0) +
- (cache->kasan_info.free_meta_offset ?
- sizeof(struct kasan_free_meta) : 0);
+
+ if (in_slab_object)
+ return (cache->kasan_info.alloc_meta_offset == 0 ?
+ sizeof(struct kasan_alloc_meta) : 0) +
+ (cache->kasan_info.free_meta_offset ?
+ sizeof(struct kasan_free_meta) : 0);
+ else
+ return (cache->kasan_info.alloc_meta_offset == 0 ?
+ sizeof(struct kasan_alloc_meta) : 0) +
+ (cache->kasan_info.free_meta_offset ?
+ sizeof(struct kasan_free_meta) : 0);
}

struct kasan_alloc_meta *kasan_get_alloc_meta(struct kmem_cache *cache,

> Sorry for nor realizing this straight away.
>
> (Note that there's an upcoming patch that fixes a bug in
> kasan_metadata_size' implementation [1].)

Thanks for the note, will check this when making the formal patch.

- Feng

> Thanks!
>
> [1] https://lore.kernel.org/linux-mm/c7b316d30d90e5947eb8280f4dc78856a49298cf.1662411799.git.andreyknvl@google.com/

2022-09-23 12:28:56

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v6 1/4] mm/slub: enable debugging memory wasting of kmalloc

On 9/13/22 08:54, Feng Tang wrote:
> kmalloc's API family is critical for mm, with one nature that it will
> round up the request size to a fixed one (mostly power of 2). Say
> when user requests memory for '2^n + 1' bytes, actually 2^(n+1) bytes
> could be allocated, so in worst case, there is around 50% memory
> space waste.
>
> The wastage is not a big issue for requests that get allocated/freed
> quickly, but may cause problems with objects that have longer life
> time.
>
> We've met a kernel boot OOM panic (v5.10), and from the dumped slab
> info:
>
> [ 26.062145] kmalloc-2k 814056KB 814056KB
>
> From debug we found there are huge number of 'struct iova_magazine',
> whose size is 1032 bytes (1024 + 8), so each allocation will waste
> 1016 bytes. Though the issue was solved by giving the right (bigger)
> size of RAM, it is still nice to optimize the size (either use a
> kmalloc friendly size or create a dedicated slab for it).
>
> And from lkml archive, there was another crash kernel OOM case [1]
> back in 2019, which seems to be related with the similar slab waste
> situation, as the log is similar:
>
> [ 4.332648] iommu: Adding device 0000:20:02.0 to group 16
> [ 4.338946] swapper/0 invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null), order=0, oom_score_adj=0
> ...
> [ 4.857565] kmalloc-2048 59164KB 59164KB
>
> The crash kernel only has 256M memory, and 59M is pretty big here.
> (Note: the related code has been changed and optimised in recent
> kernel [2], these logs are just picked to demo the problem, also
> a patch changing its size to 1024 bytes has been merged)
>
> So add an way to track each kmalloc's memory waste info, and
> leverage the existing SLUB debug framework (specifically
> SLUB_STORE_USER) to show its call stack of original allocation,
> so that user can evaluate the waste situation, identify some hot
> spots and optimize accordingly, for a better utilization of memory.
>
> The waste info is integrated into existing interface:
> '/sys/kernel/debug/slab/kmalloc-xx/alloc_traces', one example of
> 'kmalloc-4k' after boot is:
>
> 126 ixgbe_alloc_q_vector+0xbe/0x830 [ixgbe] waste=233856/1856 age=280763/281414/282065 pid=1330 cpus=32 nodes=1
> __kmem_cache_alloc_node+0x11f/0x4e0
> __kmalloc_node+0x4e/0x140
> ixgbe_alloc_q_vector+0xbe/0x830 [ixgbe]
> ixgbe_init_interrupt_scheme+0x2ae/0xc90 [ixgbe]
> ixgbe_probe+0x165f/0x1d20 [ixgbe]
> local_pci_probe+0x78/0xc0
> work_for_cpu_fn+0x26/0x40
> ...
>
> which means in 'kmalloc-4k' slab, there are 126 requests of
> 2240 bytes which got a 4KB space (wasting 1856 bytes each
> and 233856 bytes in total), from ixgbe_alloc_q_vector().
>
> And when system starts some real workload like multiple docker
> instances, there could are more severe waste.
>
> [1]. https://lkml.org/lkml/2019/8/12/266
> [2]. https://lore.kernel.org/lkml/[email protected]/
>
> [Thanks Hyeonggon for pointing out several bugs about sorting/format]
> [Thanks Vlastimil for suggesting way to reduce memory usage of
> orig_size and keep it only for kmalloc objects]
>
> Signed-off-by: Feng Tang <[email protected]>
> Reviewed-by: Hyeonggon Yoo <[email protected]>
> Cc: Robin Murphy <[email protected]>
> Cc: John Garry <[email protected]>
> Cc: Kefeng Wang <[email protected]>

Thanks.
Given that the merge window is nearing, and the rest of the series a) has
some changes suggested and b) could be hopefully done in a simpler way with
the proposed ksize() cleanup, I am picking just this patch now to slab.git
(and thus -next), with some small modifications:

...

> +
> +static unsigned int get_orig_size(struct kmem_cache *s, void *object)

Made this inline for consistency.

> +{
> + void *p = kasan_reset_tag(object);
> +
> + if (!slub_debug_orig_size(s))
> + return s->object_size;
> +
> + p += get_info_end(s);
> + p += sizeof(struct track) * 2;
> +
> + return *(unsigned int *)p;
> +}
> +
> static void slab_bug(struct kmem_cache *s, char *fmt, ...)
> {
> struct va_format vaf;
> @@ -844,6 +890,9 @@ static void print_trailer(struct kmem_cache *s, struct slab *slab, u8 *p)
> if (s->flags & SLAB_STORE_USER)
> off += 2 * sizeof(struct track);
>
> + if (slub_debug_orig_size(s))
> + off += sizeof(unsigned int);
> +
> off += kasan_metadata_size(s);
>
> if (off != size_from_object(s))
> @@ -977,7 +1026,8 @@ static int check_bytes_and_report(struct kmem_cache *s, struct slab *slab,
> *
> * A. Free pointer (if we cannot overwrite object on free)
> * B. Tracking data for SLAB_STORE_USER
> - * C. Padding to reach required alignment boundary or at minimum
> + * C. Original request size for kmalloc object (SLAB_STORE_USER enabled)
> + * D. Padding to reach required alignment boundary or at minimum
> * one word if debugging is on to be able to detect writes
> * before the word boundary.
> *
> @@ -995,10 +1045,14 @@ static int check_pad_bytes(struct kmem_cache *s, struct slab *slab, u8 *p)
> {
> unsigned long off = get_info_end(s); /* The end of info */
>
> - if (s->flags & SLAB_STORE_USER)
> + if (s->flags & SLAB_STORE_USER) {
> /* We also have user information there */
> off += 2 * sizeof(struct track);
>
> + if (s->flags & SLAB_KMALLOC)
> + off += sizeof(unsigned int);
> + }
> +
> off += kasan_metadata_size(s);
>
> if (size_from_object(s) == off)
> @@ -1293,7 +1347,7 @@ static inline int alloc_consistency_checks(struct kmem_cache *s,
> }
>
> static noinline int alloc_debug_processing(struct kmem_cache *s,
> - struct slab *slab, void *object)
> + struct slab *slab, void *object, int orig_size)
> {
> if (s->flags & SLAB_CONSISTENCY_CHECKS) {
> if (!alloc_consistency_checks(s, slab, object))
> @@ -1302,6 +1356,7 @@ static noinline int alloc_debug_processing(struct kmem_cache *s,
>
> /* Success. Perform special debug activities for allocs */
> trace(s, slab, object, 1);
> + set_orig_size(s, object, orig_size);
> init_object(s, object, SLUB_RED_ACTIVE);
> return 1;
>
> @@ -1570,7 +1625,10 @@ static inline
> void setup_slab_debug(struct kmem_cache *s, struct slab *slab, void *addr) {}
>
> static inline int alloc_debug_processing(struct kmem_cache *s,
> - struct slab *slab, void *object) { return 0; }
> + struct slab *slab, void *object, int orig_size) { return 0; }
> +
> +static inline void set_orig_size(struct kmem_cache *s,
> + void *object, unsigned int orig_size) {}

There's no caller (in this patch alone) for the !SLUB_DEBUG version, so removed.

> static inline void free_debug_processing(
> struct kmem_cache *s, struct slab *slab,
> @@ -1999,7 +2057,7 @@ static inline void remove_partial(struct kmem_cache_node *n,
> * it to full list if it was the last free object.
> */
> static void *alloc_single_from_partial(struct kmem_cache *s,
> - struct kmem_cache_node *n, struct slab *slab)
> + struct kmem_cache_node *n, struct slab *slab, int orig_size)
> {
> void *object;
>
> @@ -2009,7 +2067,7 @@ static void *alloc_single_from_partial(struct kmem_cache *s,
> slab->freelist = get_freepointer(s, object);
> slab->inuse++;
>
> - if (!alloc_debug_processing(s, slab, object)) {
> + if (!alloc_debug_processing(s, slab, object, orig_size)) {
> remove_partial(n, slab);
> return NULL;
> }
> @@ -2028,7 +2086,7 @@ static void *alloc_single_from_partial(struct kmem_cache *s,
> * and put the slab to the partial (or full) list.
> */
> static void *alloc_single_from_new_slab(struct kmem_cache *s,
> - struct slab *slab)
> + struct slab *slab, int orig_size)
> {
> int nid = slab_nid(slab);
> struct kmem_cache_node *n = get_node(s, nid);
> @@ -2040,7 +2098,7 @@ static void *alloc_single_from_new_slab(struct kmem_cache *s,
> slab->freelist = get_freepointer(s, object);
> slab->inuse = 1;
>
> - if (!alloc_debug_processing(s, slab, object))
> + if (!alloc_debug_processing(s, slab, object, orig_size))
> /*
> * It's not really expected that this would fail on a
> * freshly allocated slab, but a concurrent memory
> @@ -2118,7 +2176,7 @@ static inline bool pfmemalloc_match(struct slab *slab, gfp_t gfpflags);
> * Try to allocate a partial slab from a specific node.
> */
> static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
> - struct slab **ret_slab, gfp_t gfpflags)
> + struct partial_context *pc)
> {
> struct slab *slab, *slab2;
> void *object = NULL;
> @@ -2138,11 +2196,12 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
> list_for_each_entry_safe(slab, slab2, &n->partial, slab_list) {
> void *t;
>
> - if (!pfmemalloc_match(slab, gfpflags))
> + if (!pfmemalloc_match(slab, pc->flags))
> continue;
>
> if (kmem_cache_debug(s)) {
> - object = alloc_single_from_partial(s, n, slab);
> + object = alloc_single_from_partial(s, n, slab,
> + pc->orig_size);
> if (object)
> break;
> continue;
> @@ -2153,7 +2212,7 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
> break;
>
> if (!object) {
> - *ret_slab = slab;
> + *pc->slab = slab;
> stat(s, ALLOC_FROM_PARTIAL);
> object = t;
> } else {
> @@ -2177,14 +2236,13 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
> /*
> * Get a slab from somewhere. Search in increasing NUMA distances.
> */
> -static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
> - struct slab **ret_slab)
> +static void *get_any_partial(struct kmem_cache *s, struct partial_context *pc)
> {
> #ifdef CONFIG_NUMA
> struct zonelist *zonelist;
> struct zoneref *z;
> struct zone *zone;
> - enum zone_type highest_zoneidx = gfp_zone(flags);
> + enum zone_type highest_zoneidx = gfp_zone(pc->flags);
> void *object;
> unsigned int cpuset_mems_cookie;
>
> @@ -2212,15 +2270,15 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
>
> do {
> cpuset_mems_cookie = read_mems_allowed_begin();
> - zonelist = node_zonelist(mempolicy_slab_node(), flags);
> + zonelist = node_zonelist(mempolicy_slab_node(), pc->flags);
> for_each_zone_zonelist(zone, z, zonelist, highest_zoneidx) {
> struct kmem_cache_node *n;
>
> n = get_node(s, zone_to_nid(zone));
>
> - if (n && cpuset_zone_allowed(zone, flags) &&
> + if (n && cpuset_zone_allowed(zone, pc->flags) &&
> n->nr_partial > s->min_partial) {
> - object = get_partial_node(s, n, ret_slab, flags);
> + object = get_partial_node(s, n, pc);
> if (object) {
> /*
> * Don't check read_mems_allowed_retry()
> @@ -2241,8 +2299,7 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
> /*
> * Get a partial slab, lock it and return it.
> */
> -static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
> - struct slab **ret_slab)
> +static void *get_partial(struct kmem_cache *s, int node, struct partial_context *pc)
> {
> void *object;
> int searchnode = node;
> @@ -2250,11 +2307,11 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
> if (node == NUMA_NO_NODE)
> searchnode = numa_mem_id();
>
> - object = get_partial_node(s, get_node(s, searchnode), ret_slab, flags);
> + object = get_partial_node(s, get_node(s, searchnode), pc);
> if (object || node != NUMA_NO_NODE)
> return object;
>
> - return get_any_partial(s, flags, ret_slab);
> + return get_any_partial(s, pc);
> }
>
> #ifdef CONFIG_PREEMPTION
> @@ -2974,11 +3031,12 @@ static inline void *get_freelist(struct kmem_cache *s, struct slab *slab)
> * already disabled (which is the case for bulk allocation).
> */
> static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> - unsigned long addr, struct kmem_cache_cpu *c)
> + unsigned long addr, struct kmem_cache_cpu *c, unsigned int orig_size)
> {
> void *freelist;
> struct slab *slab;
> unsigned long flags;
> + struct partial_context pc;
>
> stat(s, ALLOC_SLOWPATH);
>
> @@ -3092,7 +3150,10 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>
> new_objects:
>
> - freelist = get_partial(s, gfpflags, node, &slab);
> + pc.flags = gfpflags;
> + pc.slab = &slab;
> + pc.orig_size = orig_size;
> + freelist = get_partial(s, node, &pc);
> if (freelist)
> goto check_new_slab;
>
> @@ -3108,7 +3169,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> stat(s, ALLOC_SLAB);
>
> if (kmem_cache_debug(s)) {
> - freelist = alloc_single_from_new_slab(s, slab);
> + freelist = alloc_single_from_new_slab(s, slab, orig_size);
>
> if (unlikely(!freelist))
> goto new_objects;
> @@ -3140,6 +3201,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> */
> if (s->flags & SLAB_STORE_USER)
> set_track(s, freelist, TRACK_ALLOC, addr);
> +
> return freelist;
> }
>
> @@ -3182,7 +3244,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> * pointer.
> */
> static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> - unsigned long addr, struct kmem_cache_cpu *c)
> + unsigned long addr, struct kmem_cache_cpu *c, unsigned int orig_size)
> {
> void *p;
>
> @@ -3195,7 +3257,7 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> c = slub_get_cpu_ptr(s->cpu_slab);
> #endif
>
> - p = ___slab_alloc(s, gfpflags, node, addr, c);
> + p = ___slab_alloc(s, gfpflags, node, addr, c, orig_size);
> #ifdef CONFIG_PREEMPT_COUNT
> slub_put_cpu_ptr(s->cpu_slab);
> #endif
> @@ -3280,7 +3342,7 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s, struct list_l
>
> if (!USE_LOCKLESS_FAST_PATH() ||
> unlikely(!object || !slab || !node_match(slab, node))) {
> - object = __slab_alloc(s, gfpflags, node, addr, c);
> + object = __slab_alloc(s, gfpflags, node, addr, c, orig_size);
> } else {
> void *next_object = get_freepointer_safe(s, object);
>
> @@ -3747,7 +3809,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
> * of re-populating per CPU c->freelist
> */
> p[i] = ___slab_alloc(s, flags, NUMA_NO_NODE,
> - _RET_IP_, c);
> + _RET_IP_, c, s->object_size);
> if (unlikely(!p[i]))
> goto error;
>
> @@ -4150,12 +4212,17 @@ static int calculate_sizes(struct kmem_cache *s)
> }
>
> #ifdef CONFIG_SLUB_DEBUG
> - if (flags & SLAB_STORE_USER)
> + if (flags & SLAB_STORE_USER) {
> /*
> * Need to store information about allocs and frees after
> * the object.
> */
> size += 2 * sizeof(struct track);
> +
> + /* Save the original kmalloc request size */
> + if (flags & SLAB_KMALLOC)
> + size += sizeof(unsigned int);
> + }
> #endif
>
> kasan_cache_create(s, &size, &s->flags);
> @@ -4770,7 +4837,7 @@ void __init kmem_cache_init(void)
>
> /* Now we can use the kmem_cache to allocate kmalloc slabs */
> setup_kmalloc_cache_index_table();
> - create_kmalloc_caches(0);
> + create_kmalloc_caches(SLAB_KMALLOC);

Instead of this, add the flag in the common creation function, so SLAB kmalloc caches are also marked even if there's no use for it there now.

--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -649,7 +649,8 @@ struct kmem_cache *__init create_kmalloc_cache(const char *name,
if (!s)
panic("Out of memory when creating slab %s\n", name);

- create_boot_cache(s, name, size, flags, useroffset, usersize);
+ create_boot_cache(s, name, size, flags | SLAB_KMALLOC, useroffset,
+ usersize);
kasan_cache_create_kmalloc(s);
list_add(&s->list, &slab_caches);
s->refcount = 1;


> /* Setup random freelists for each cache */
> init_freelist_randomization();
> @@ -4937,6 +5004,7 @@ struct location {
> depot_stack_handle_t handle;
> unsigned long count;
> unsigned long addr;
> + unsigned long waste;
> long long sum_time;
> long min_time;
> long max_time;
> @@ -4983,13 +5051,15 @@ static int alloc_loc_track(struct loc_track *t, unsigned long max, gfp_t flags)
> }
>
> static int add_location(struct loc_track *t, struct kmem_cache *s,
> - const struct track *track)
> + const struct track *track,
> + unsigned int orig_size)
> {
> long start, end, pos;
> struct location *l;
> - unsigned long caddr, chandle;
> + unsigned long caddr, chandle, cwaste;
> unsigned long age = jiffies - track->when;
> depot_stack_handle_t handle = 0;
> + unsigned int waste = s->object_size - orig_size;
>
> #ifdef CONFIG_STACKDEPOT
> handle = READ_ONCE(track->handle);
> @@ -5007,11 +5077,13 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
> if (pos == end)
> break;
>
> - caddr = t->loc[pos].addr;
> - chandle = t->loc[pos].handle;
> - if ((track->addr == caddr) && (handle == chandle)) {
> + l = &t->loc[pos];
> + caddr = l->addr;
> + chandle = l->handle;
> + cwaste = l->waste;
> + if ((track->addr == caddr) && (handle == chandle) &&
> + (waste == cwaste)) {
>
> - l = &t->loc[pos];
> l->count++;
> if (track->when) {
> l->sum_time += age;
> @@ -5036,6 +5108,9 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
> end = pos;
> else if (track->addr == caddr && handle < chandle)
> end = pos;
> + else if (track->addr == caddr && handle == chandle &&
> + waste < cwaste)
> + end = pos;
> else
> start = pos;
> }
> @@ -5059,6 +5134,7 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
> l->min_pid = track->pid;
> l->max_pid = track->pid;
> l->handle = handle;
> + l->waste = waste;
> cpumask_clear(to_cpumask(l->cpus));
> cpumask_set_cpu(track->cpu, to_cpumask(l->cpus));
> nodes_clear(l->nodes);
> @@ -5077,7 +5153,7 @@ static void process_slab(struct loc_track *t, struct kmem_cache *s,
>
> for_each_object(p, s, addr, slab->objects)
> if (!test_bit(__obj_to_index(s, addr, p), obj_map))
> - add_location(t, s, get_track(s, p, alloc));
> + add_location(t, s, get_track(s, p, alloc), get_orig_size(s, p));

I think it makes little sense to report waste in the 'free_traces' file?
So adjusted like this to make sure nothing is reported there:

@@ -5356,13 +5353,16 @@ static void process_slab(struct loc_track *t, struct kmem_cache *s,
unsigned long *obj_map)
{
void *addr = slab_address(slab);
+ bool is_alloc = (alloc == TRACK_ALLOC);
void *p;

__fill_map(obj_map, s, slab);

for_each_object(p, s, addr, slab->objects)
if (!test_bit(__obj_to_index(s, addr, p), obj_map))
- add_location(t, s, get_track(s, p, alloc), get_orig_size(s, p));
+ add_location(t, s, get_track(s, p, alloc),
+ is_alloc ? get_orig_size(s, p) :
+ s->object_size);


> }
> #endif /* CONFIG_DEBUG_FS */
> #endif /* CONFIG_SLUB_DEBUG */
> @@ -5942,6 +6018,10 @@ static int slab_debugfs_show(struct seq_file *seq, void *v)
> else
> seq_puts(seq, "<not-available>");
>
> + if (l->waste)
> + seq_printf(seq, " waste=%lu/%lu",
> + l->count * l->waste, l->waste);
> +
> if (l->sum_time != l->min_time) {
> seq_printf(seq, " age=%ld/%llu/%ld",
> l->min_time, div_u64(l->sum_time, l->count),

2022-09-24 07:25:13

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH v6 1/4] mm/slub: enable debugging memory wasting of kmalloc

On Fri, Sep 23, 2022 at 07:43:22PM +0800, Vlastimil Babka wrote:
> On 9/13/22 08:54, Feng Tang wrote:
[...]
> > which means in 'kmalloc-4k' slab, there are 126 requests of
> > 2240 bytes which got a 4KB space (wasting 1856 bytes each
> > and 233856 bytes in total), from ixgbe_alloc_q_vector().
> >
> > And when system starts some real workload like multiple docker
> > instances, there could are more severe waste.
> >
> > [1]. https://lkml.org/lkml/2019/8/12/266
> > [2]. https://lore.kernel.org/lkml/[email protected]/
> >
> > [Thanks Hyeonggon for pointing out several bugs about sorting/format]
> > [Thanks Vlastimil for suggesting way to reduce memory usage of
> > orig_size and keep it only for kmalloc objects]
> >
> > Signed-off-by: Feng Tang <[email protected]>
> > Reviewed-by: Hyeonggon Yoo <[email protected]>
> > Cc: Robin Murphy <[email protected]>
> > Cc: John Garry <[email protected]>
> > Cc: Kefeng Wang <[email protected]>
>
> Thanks.
> Given that the merge window is nearing, and the rest of the series a) has
> some changes suggested and b) could be hopefully done in a simpler way with
> the proposed ksize() cleanup, I am picking just this patch now to slab.git
> (and thus -next), with some small modifications:

OK, and all the cleanup/improvments from you look good to me. Many thanks!

For kasan and ksize() related patches, I'll keep monitoring and working on
them.

- Feng

> ...
>
> > +
> > +static unsigned int get_orig_size(struct kmem_cache *s, void *object)
>
> Made this inline for consistency.
>
> > +{
> > + void *p = kasan_reset_tag(object);
> > +
> > + if (!slub_debug_orig_size(s))
> > + return s->object_size;
> > +
> > + p += get_info_end(s);
> > + p += sizeof(struct track) * 2;
> > +
> > + return *(unsigned int *)p;
> > +}
> > +
> > static void slab_bug(struct kmem_cache *s, char *fmt, ...)
> > {
> > struct va_format vaf;
> > @@ -844,6 +890,9 @@ static void print_trailer(struct kmem_cache *s, struct slab *slab, u8 *p)
> > if (s->flags & SLAB_STORE_USER)
> > off += 2 * sizeof(struct track);
> >
> > + if (slub_debug_orig_size(s))
> > + off += sizeof(unsigned int);
> > +
> > off += kasan_metadata_size(s);
> >
> > if (off != size_from_object(s))
> > @@ -977,7 +1026,8 @@ static int check_bytes_and_report(struct kmem_cache *s, struct slab *slab,
> > *
> > * A. Free pointer (if we cannot overwrite object on free)
> > * B. Tracking data for SLAB_STORE_USER
> > - * C. Padding to reach required alignment boundary or at minimum
> > + * C. Original request size for kmalloc object (SLAB_STORE_USER enabled)
> > + * D. Padding to reach required alignment boundary or at minimum
> > * one word if debugging is on to be able to detect writes
> > * before the word boundary.
> > *
> > @@ -995,10 +1045,14 @@ static int check_pad_bytes(struct kmem_cache *s, struct slab *slab, u8 *p)
> > {
> > unsigned long off = get_info_end(s); /* The end of info */
> >
> > - if (s->flags & SLAB_STORE_USER)
> > + if (s->flags & SLAB_STORE_USER) {
> > /* We also have user information there */
> > off += 2 * sizeof(struct track);
> >
> > + if (s->flags & SLAB_KMALLOC)
> > + off += sizeof(unsigned int);
> > + }
> > +
> > off += kasan_metadata_size(s);
> >
> > if (size_from_object(s) == off)
> > @@ -1293,7 +1347,7 @@ static inline int alloc_consistency_checks(struct kmem_cache *s,
> > }
> >
> > static noinline int alloc_debug_processing(struct kmem_cache *s,
> > - struct slab *slab, void *object)
> > + struct slab *slab, void *object, int orig_size)
> > {
> > if (s->flags & SLAB_CONSISTENCY_CHECKS) {
> > if (!alloc_consistency_checks(s, slab, object))
> > @@ -1302,6 +1356,7 @@ static noinline int alloc_debug_processing(struct kmem_cache *s,
> >
> > /* Success. Perform special debug activities for allocs */
> > trace(s, slab, object, 1);
> > + set_orig_size(s, object, orig_size);
> > init_object(s, object, SLUB_RED_ACTIVE);
> > return 1;
> >
> > @@ -1570,7 +1625,10 @@ static inline
> > void setup_slab_debug(struct kmem_cache *s, struct slab *slab, void *addr) {}
> >
> > static inline int alloc_debug_processing(struct kmem_cache *s,
> > - struct slab *slab, void *object) { return 0; }
> > + struct slab *slab, void *object, int orig_size) { return 0; }
> > +
> > +static inline void set_orig_size(struct kmem_cache *s,
> > + void *object, unsigned int orig_size) {}
>
> There's no caller (in this patch alone) for the !SLUB_DEBUG version, so removed.
>
> > static inline void free_debug_processing(
> > struct kmem_cache *s, struct slab *slab,
> > @@ -1999,7 +2057,7 @@ static inline void remove_partial(struct kmem_cache_node *n,
> > * it to full list if it was the last free object.
> > */
> > static void *alloc_single_from_partial(struct kmem_cache *s,
> > - struct kmem_cache_node *n, struct slab *slab)
> > + struct kmem_cache_node *n, struct slab *slab, int orig_size)
> > {
> > void *object;
> >
> > @@ -2009,7 +2067,7 @@ static void *alloc_single_from_partial(struct kmem_cache *s,
> > slab->freelist = get_freepointer(s, object);
> > slab->inuse++;
> >
> > - if (!alloc_debug_processing(s, slab, object)) {
> > + if (!alloc_debug_processing(s, slab, object, orig_size)) {
> > remove_partial(n, slab);
> > return NULL;
> > }
> > @@ -2028,7 +2086,7 @@ static void *alloc_single_from_partial(struct kmem_cache *s,
> > * and put the slab to the partial (or full) list.
> > */
> > static void *alloc_single_from_new_slab(struct kmem_cache *s,
> > - struct slab *slab)
> > + struct slab *slab, int orig_size)
> > {
> > int nid = slab_nid(slab);
> > struct kmem_cache_node *n = get_node(s, nid);
> > @@ -2040,7 +2098,7 @@ static void *alloc_single_from_new_slab(struct kmem_cache *s,
> > slab->freelist = get_freepointer(s, object);
> > slab->inuse = 1;
> >
> > - if (!alloc_debug_processing(s, slab, object))
> > + if (!alloc_debug_processing(s, slab, object, orig_size))
> > /*
> > * It's not really expected that this would fail on a
> > * freshly allocated slab, but a concurrent memory
> > @@ -2118,7 +2176,7 @@ static inline bool pfmemalloc_match(struct slab *slab, gfp_t gfpflags);
> > * Try to allocate a partial slab from a specific node.
> > */
> > static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
> > - struct slab **ret_slab, gfp_t gfpflags)
> > + struct partial_context *pc)
> > {
> > struct slab *slab, *slab2;
> > void *object = NULL;
> > @@ -2138,11 +2196,12 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
> > list_for_each_entry_safe(slab, slab2, &n->partial, slab_list) {
> > void *t;
> >
> > - if (!pfmemalloc_match(slab, gfpflags))
> > + if (!pfmemalloc_match(slab, pc->flags))
> > continue;
> >
> > if (kmem_cache_debug(s)) {
> > - object = alloc_single_from_partial(s, n, slab);
> > + object = alloc_single_from_partial(s, n, slab,
> > + pc->orig_size);
> > if (object)
> > break;
> > continue;
> > @@ -2153,7 +2212,7 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
> > break;
> >
> > if (!object) {
> > - *ret_slab = slab;
> > + *pc->slab = slab;
> > stat(s, ALLOC_FROM_PARTIAL);
> > object = t;
> > } else {
> > @@ -2177,14 +2236,13 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
> > /*
> > * Get a slab from somewhere. Search in increasing NUMA distances.
> > */
> > -static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
> > - struct slab **ret_slab)
> > +static void *get_any_partial(struct kmem_cache *s, struct partial_context *pc)
> > {
> > #ifdef CONFIG_NUMA
> > struct zonelist *zonelist;
> > struct zoneref *z;
> > struct zone *zone;
> > - enum zone_type highest_zoneidx = gfp_zone(flags);
> > + enum zone_type highest_zoneidx = gfp_zone(pc->flags);
> > void *object;
> > unsigned int cpuset_mems_cookie;
> >
> > @@ -2212,15 +2270,15 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
> >
> > do {
> > cpuset_mems_cookie = read_mems_allowed_begin();
> > - zonelist = node_zonelist(mempolicy_slab_node(), flags);
> > + zonelist = node_zonelist(mempolicy_slab_node(), pc->flags);
> > for_each_zone_zonelist(zone, z, zonelist, highest_zoneidx) {
> > struct kmem_cache_node *n;
> >
> > n = get_node(s, zone_to_nid(zone));
> >
> > - if (n && cpuset_zone_allowed(zone, flags) &&
> > + if (n && cpuset_zone_allowed(zone, pc->flags) &&
> > n->nr_partial > s->min_partial) {
> > - object = get_partial_node(s, n, ret_slab, flags);
> > + object = get_partial_node(s, n, pc);
> > if (object) {
> > /*
> > * Don't check read_mems_allowed_retry()
> > @@ -2241,8 +2299,7 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
> > /*
> > * Get a partial slab, lock it and return it.
> > */
> > -static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
> > - struct slab **ret_slab)
> > +static void *get_partial(struct kmem_cache *s, int node, struct partial_context *pc)
> > {
> > void *object;
> > int searchnode = node;
> > @@ -2250,11 +2307,11 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
> > if (node == NUMA_NO_NODE)
> > searchnode = numa_mem_id();
> >
> > - object = get_partial_node(s, get_node(s, searchnode), ret_slab, flags);
> > + object = get_partial_node(s, get_node(s, searchnode), pc);
> > if (object || node != NUMA_NO_NODE)
> > return object;
> >
> > - return get_any_partial(s, flags, ret_slab);
> > + return get_any_partial(s, pc);
> > }
> >
> > #ifdef CONFIG_PREEMPTION
> > @@ -2974,11 +3031,12 @@ static inline void *get_freelist(struct kmem_cache *s, struct slab *slab)
> > * already disabled (which is the case for bulk allocation).
> > */
> > static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> > - unsigned long addr, struct kmem_cache_cpu *c)
> > + unsigned long addr, struct kmem_cache_cpu *c, unsigned int orig_size)
> > {
> > void *freelist;
> > struct slab *slab;
> > unsigned long flags;
> > + struct partial_context pc;
> >
> > stat(s, ALLOC_SLOWPATH);
> >
> > @@ -3092,7 +3150,10 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> >
> > new_objects:
> >
> > - freelist = get_partial(s, gfpflags, node, &slab);
> > + pc.flags = gfpflags;
> > + pc.slab = &slab;
> > + pc.orig_size = orig_size;
> > + freelist = get_partial(s, node, &pc);
> > if (freelist)
> > goto check_new_slab;
> >
> > @@ -3108,7 +3169,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> > stat(s, ALLOC_SLAB);
> >
> > if (kmem_cache_debug(s)) {
> > - freelist = alloc_single_from_new_slab(s, slab);
> > + freelist = alloc_single_from_new_slab(s, slab, orig_size);
> >
> > if (unlikely(!freelist))
> > goto new_objects;
> > @@ -3140,6 +3201,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> > */
> > if (s->flags & SLAB_STORE_USER)
> > set_track(s, freelist, TRACK_ALLOC, addr);
> > +
> > return freelist;
> > }
> >
> > @@ -3182,7 +3244,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> > * pointer.
> > */
> > static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> > - unsigned long addr, struct kmem_cache_cpu *c)
> > + unsigned long addr, struct kmem_cache_cpu *c, unsigned int orig_size)
> > {
> > void *p;
> >
> > @@ -3195,7 +3257,7 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> > c = slub_get_cpu_ptr(s->cpu_slab);
> > #endif
> >
> > - p = ___slab_alloc(s, gfpflags, node, addr, c);
> > + p = ___slab_alloc(s, gfpflags, node, addr, c, orig_size);
> > #ifdef CONFIG_PREEMPT_COUNT
> > slub_put_cpu_ptr(s->cpu_slab);
> > #endif
> > @@ -3280,7 +3342,7 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s, struct list_l
> >
> > if (!USE_LOCKLESS_FAST_PATH() ||
> > unlikely(!object || !slab || !node_match(slab, node))) {
> > - object = __slab_alloc(s, gfpflags, node, addr, c);
> > + object = __slab_alloc(s, gfpflags, node, addr, c, orig_size);
> > } else {
> > void *next_object = get_freepointer_safe(s, object);
> >
> > @@ -3747,7 +3809,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
> > * of re-populating per CPU c->freelist
> > */
> > p[i] = ___slab_alloc(s, flags, NUMA_NO_NODE,
> > - _RET_IP_, c);
> > + _RET_IP_, c, s->object_size);
> > if (unlikely(!p[i]))
> > goto error;
> >
> > @@ -4150,12 +4212,17 @@ static int calculate_sizes(struct kmem_cache *s)
> > }
> >
> > #ifdef CONFIG_SLUB_DEBUG
> > - if (flags & SLAB_STORE_USER)
> > + if (flags & SLAB_STORE_USER) {
> > /*
> > * Need to store information about allocs and frees after
> > * the object.
> > */
> > size += 2 * sizeof(struct track);
> > +
> > + /* Save the original kmalloc request size */
> > + if (flags & SLAB_KMALLOC)
> > + size += sizeof(unsigned int);
> > + }
> > #endif
> >
> > kasan_cache_create(s, &size, &s->flags);
> > @@ -4770,7 +4837,7 @@ void __init kmem_cache_init(void)
> >
> > /* Now we can use the kmem_cache to allocate kmalloc slabs */
> > setup_kmalloc_cache_index_table();
> > - create_kmalloc_caches(0);
> > + create_kmalloc_caches(SLAB_KMALLOC);
>
> Instead of this, add the flag in the common creation function, so SLAB kmalloc caches are also marked even if there's no use for it there now.
>
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -649,7 +649,8 @@ struct kmem_cache *__init create_kmalloc_cache(const char *name,
> if (!s)
> panic("Out of memory when creating slab %s\n", name);
>
> - create_boot_cache(s, name, size, flags, useroffset, usersize);
> + create_boot_cache(s, name, size, flags | SLAB_KMALLOC, useroffset,
> + usersize);
> kasan_cache_create_kmalloc(s);
> list_add(&s->list, &slab_caches);
> s->refcount = 1;
>
>
> > /* Setup random freelists for each cache */
> > init_freelist_randomization();
> > @@ -4937,6 +5004,7 @@ struct location {
> > depot_stack_handle_t handle;
> > unsigned long count;
> > unsigned long addr;
> > + unsigned long waste;
> > long long sum_time;
> > long min_time;
> > long max_time;
> > @@ -4983,13 +5051,15 @@ static int alloc_loc_track(struct loc_track *t, unsigned long max, gfp_t flags)
> > }
> >
> > static int add_location(struct loc_track *t, struct kmem_cache *s,
> > - const struct track *track)
> > + const struct track *track,
> > + unsigned int orig_size)
> > {
> > long start, end, pos;
> > struct location *l;
> > - unsigned long caddr, chandle;
> > + unsigned long caddr, chandle, cwaste;
> > unsigned long age = jiffies - track->when;
> > depot_stack_handle_t handle = 0;
> > + unsigned int waste = s->object_size - orig_size;
> >
> > #ifdef CONFIG_STACKDEPOT
> > handle = READ_ONCE(track->handle);
> > @@ -5007,11 +5077,13 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
> > if (pos == end)
> > break;
> >
> > - caddr = t->loc[pos].addr;
> > - chandle = t->loc[pos].handle;
> > - if ((track->addr == caddr) && (handle == chandle)) {
> > + l = &t->loc[pos];
> > + caddr = l->addr;
> > + chandle = l->handle;
> > + cwaste = l->waste;
> > + if ((track->addr == caddr) && (handle == chandle) &&
> > + (waste == cwaste)) {
> >
> > - l = &t->loc[pos];
> > l->count++;
> > if (track->when) {
> > l->sum_time += age;
> > @@ -5036,6 +5108,9 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
> > end = pos;
> > else if (track->addr == caddr && handle < chandle)
> > end = pos;
> > + else if (track->addr == caddr && handle == chandle &&
> > + waste < cwaste)
> > + end = pos;
> > else
> > start = pos;
> > }
> > @@ -5059,6 +5134,7 @@ static int add_location(struct loc_track *t, struct kmem_cache *s,
> > l->min_pid = track->pid;
> > l->max_pid = track->pid;
> > l->handle = handle;
> > + l->waste = waste;
> > cpumask_clear(to_cpumask(l->cpus));
> > cpumask_set_cpu(track->cpu, to_cpumask(l->cpus));
> > nodes_clear(l->nodes);
> > @@ -5077,7 +5153,7 @@ static void process_slab(struct loc_track *t, struct kmem_cache *s,
> >
> > for_each_object(p, s, addr, slab->objects)
> > if (!test_bit(__obj_to_index(s, addr, p), obj_map))
> > - add_location(t, s, get_track(s, p, alloc));
> > + add_location(t, s, get_track(s, p, alloc), get_orig_size(s, p));
>
> I think it makes little sense to report waste in the 'free_traces' file?
> So adjusted like this to make sure nothing is reported there:
>
> @@ -5356,13 +5353,16 @@ static void process_slab(struct loc_track *t, struct kmem_cache *s,
> unsigned long *obj_map)
> {
> void *addr = slab_address(slab);
> + bool is_alloc = (alloc == TRACK_ALLOC);
> void *p;
>
> __fill_map(obj_map, s, slab);
>
> for_each_object(p, s, addr, slab->objects)
> if (!test_bit(__obj_to_index(s, addr, p), obj_map))
> - add_location(t, s, get_track(s, p, alloc), get_orig_size(s, p));
> + add_location(t, s, get_track(s, p, alloc),
> + is_alloc ? get_orig_size(s, p) :
> + s->object_size);
>
>
> > }
> > #endif /* CONFIG_DEBUG_FS */
> > #endif /* CONFIG_SLUB_DEBUG */
> > @@ -5942,6 +6018,10 @@ static int slab_debugfs_show(struct seq_file *seq, void *v)
> > else
> > seq_puts(seq, "<not-available>");
> >
> > + if (l->waste)
> > + seq_printf(seq, " waste=%lu/%lu",
> > + l->count * l->waste, l->waste);
> > +
> > if (l->sum_time != l->min_time) {
> > seq_printf(seq, " age=%ld/%llu/%ld",
> > l->min_time, div_u64(l->sum_time, l->count),
>
>

2022-09-24 18:21:02

by Andrey Konovalov

[permalink] [raw]
Subject: Re: [PATCH v6 3/4] mm: kasan: Add free_meta size info in struct kasan_cache

On Wed, Sep 21, 2022 at 2:03 PM Feng Tang <[email protected]> wrote:
>
> Agree, it's better not touch the internal fields in slub code.
>
> How about the following patch, it merge the 2 functions with one flag
> indicating in meta data or object. (I'm fine with 2 separate functions)

The overall approach sounds good. See some comments below.

> diff --git a/include/linux/kasan.h b/include/linux/kasan.h
> index b092277bf48d..0ad05a34e708 100644
> --- a/include/linux/kasan.h
> +++ b/include/linux/kasan.h
> @@ -150,11 +150,12 @@ static __always_inline void kasan_cache_create_kmalloc(struct kmem_cache *cache)
> __kasan_cache_create_kmalloc(cache);
> }
>
> -size_t __kasan_metadata_size(struct kmem_cache *cache);
> -static __always_inline size_t kasan_metadata_size(struct kmem_cache *cache)
> +size_t __kasan_meta_size(struct kmem_cache *cache, bool in_slab_object);
> +static __always_inline size_t kasan_meta_size(struct kmem_cache *cache,
> + bool in_slab_object)

I would keep the name as kasan_metadata_size as it's more clear to
external users but rename in_slab_object to in_object to make the
declaration shorter.

> {
> if (kasan_enabled())
> - return __kasan_metadata_size(cache);
> + return __kasan_meta_size(cache, in_slab_object);
> return 0;
> }
>
> diff --git a/mm/kasan/common.c b/mm/kasan/common.c
> index 69f583855c8b..2a8710461ebb 100644
> --- a/mm/kasan/common.c
> +++ b/mm/kasan/common.c
> @@ -218,14 +218,21 @@ void __kasan_cache_create_kmalloc(struct kmem_cache *cache)
> cache->kasan_info.is_kmalloc = true;
> }
>
> -size_t __kasan_metadata_size(struct kmem_cache *cache)
> +size_t __kasan_meta_size(struct kmem_cache *cache, bool in_slab_object)
> {
> if (!kasan_stack_collection_enabled())
> return 0;
> - return (cache->kasan_info.alloc_meta_offset ?
> - sizeof(struct kasan_alloc_meta) : 0) +
> - (cache->kasan_info.free_meta_offset ?
> - sizeof(struct kasan_free_meta) : 0);
> +
> + if (in_slab_object)
> + return (cache->kasan_info.alloc_meta_offset == 0 ?
> + sizeof(struct kasan_alloc_meta) : 0) +
> + (cache->kasan_info.free_meta_offset ?
> + sizeof(struct kasan_free_meta) : 0);
> + else
> + return (cache->kasan_info.alloc_meta_offset == 0 ?
> + sizeof(struct kasan_alloc_meta) : 0) +
> + (cache->kasan_info.free_meta_offset ?
> + sizeof(struct kasan_free_meta) : 0);

Something weird here: both if and else cases are the same.

The change also needs to be rebased onto [1].

Thanks!

[1] https://lore.kernel.org/linux-mm/c7b316d30d90e5947eb8280f4dc78856a49298cf.1662411799.git.andreyknvl@google.com/

2022-09-25 11:37:21

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH v6 3/4] mm: kasan: Add free_meta size info in struct kasan_cache

On Sun, Sep 25, 2022 at 02:05:04AM +0800, Andrey Konovalov wrote:
> On Wed, Sep 21, 2022 at 2:03 PM Feng Tang <[email protected]> wrote:
> >
> > Agree, it's better not touch the internal fields in slub code.
> >
> > How about the following patch, it merge the 2 functions with one flag
> > indicating in meta data or object. (I'm fine with 2 separate functions)
>
> The overall approach sounds good. See some comments below.
>
> > diff --git a/include/linux/kasan.h b/include/linux/kasan.h
> > index b092277bf48d..0ad05a34e708 100644
> > --- a/include/linux/kasan.h
> > +++ b/include/linux/kasan.h
> > @@ -150,11 +150,12 @@ static __always_inline void kasan_cache_create_kmalloc(struct kmem_cache *cache)
> > __kasan_cache_create_kmalloc(cache);
> > }
> >
> > -size_t __kasan_metadata_size(struct kmem_cache *cache);
> > -static __always_inline size_t kasan_metadata_size(struct kmem_cache *cache)
> > +size_t __kasan_meta_size(struct kmem_cache *cache, bool in_slab_object);
> > +static __always_inline size_t kasan_meta_size(struct kmem_cache *cache,
> > + bool in_slab_object)
>
> I would keep the name as kasan_metadata_size as it's more clear to
> external users but rename in_slab_object to in_object to make the
> declaration shorter.

Make sense to me, will do.

[...]

> > + if (in_slab_object)
> > + return (cache->kasan_info.alloc_meta_offset == 0 ?
> > + sizeof(struct kasan_alloc_meta) : 0) +
> > + (cache->kasan_info.free_meta_offset ?
> > + sizeof(struct kasan_free_meta) : 0);
> > + else
> > + return (cache->kasan_info.alloc_meta_offset == 0 ?
> > + sizeof(struct kasan_alloc_meta) : 0) +
> > + (cache->kasan_info.free_meta_offset ?
> > + sizeof(struct kasan_free_meta) : 0);
>
> Something weird here: both if and else cases are the same.

Yes, will fix it.

> The change also needs to be rebased onto [1].
>
> Thanks!
>
> [1] https://lore.kernel.org/linux-mm/c7b316d30d90e5947eb8280f4dc78856a49298cf.1662411799.git.andreyknvl@google.com/

I noticed this has been merged to -mm tree's 'mm-everything' branch,
so following is the patch againt that. Thanks!

One thing I'm not very sure is, to check 'in-object' kasan's meta
size, I didn't check 'alloc_meta_offset', as from the code reading
the alloc_meta is never put inside slab object data area.


Thanks,
Feng

---8<---

diff --git a/include/linux/kasan.h b/include/linux/kasan.h
index d811b3d7d2a1..96c9d56e5510 100644
--- a/include/linux/kasan.h
+++ b/include/linux/kasan.h
@@ -302,7 +302,7 @@ static inline void kasan_unpoison_task_stack(struct task_struct *task) {}

#ifdef CONFIG_KASAN_GENERIC

-size_t kasan_metadata_size(struct kmem_cache *cache);
+size_t kasan_metadata_size(struct kmem_cache *cache, bool in_object);
slab_flags_t kasan_never_merge(void);
void kasan_cache_create(struct kmem_cache *cache, unsigned int *size,
slab_flags_t *flags);
@@ -315,7 +315,8 @@ void kasan_record_aux_stack_noalloc(void *ptr);
#else /* CONFIG_KASAN_GENERIC */

/* Tag-based KASAN modes do not use per-object metadata. */
-static inline size_t kasan_metadata_size(struct kmem_cache *cache)
+static inline size_t kasan_metadata_size(struct kmem_cache *cache,
+ bool in_object)
{
return 0;
}
diff --git a/mm/kasan/generic.c b/mm/kasan/generic.c
index d8b5590f9484..5a806f9b9466 100644
--- a/mm/kasan/generic.c
+++ b/mm/kasan/generic.c
@@ -450,15 +450,22 @@ void kasan_init_object_meta(struct kmem_cache *cache, const void *object)
__memset(alloc_meta, 0, sizeof(*alloc_meta));
}

-size_t kasan_metadata_size(struct kmem_cache *cache)
+size_t kasan_metadata_size(struct kmem_cache *cache, bool in_object)
{
+ struct kasan_cache *info = &cache->kasan_info ;
+
if (!kasan_requires_meta())
return 0;
- return (cache->kasan_info.alloc_meta_offset ?
- sizeof(struct kasan_alloc_meta) : 0) +
- ((cache->kasan_info.free_meta_offset &&
- cache->kasan_info.free_meta_offset != KASAN_NO_FREE_META) ?
- sizeof(struct kasan_free_meta) : 0);
+
+ if (in_object)
+ return (info->free_meta_offset ?
+ 0 : sizeof(struct kasan_free_meta));
+ else
+ return (info->alloc_meta_offset ?
+ sizeof(struct kasan_alloc_meta) : 0) +
+ ((info->free_meta_offset &&
+ info->free_meta_offset != KASAN_NO_FREE_META) ?
+ sizeof(struct kasan_free_meta) : 0);
}

static void __kasan_record_aux_stack(void *addr, bool can_alloc)
diff --git a/mm/slub.c b/mm/slub.c
index ce8310e131b3..a75c21a0da8b 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -887,7 +887,7 @@ static void print_trailer(struct kmem_cache *s, struct slab *slab, u8 *p)
if (s->flags & SLAB_STORE_USER)
off += 2 * sizeof(struct track);

- off += kasan_metadata_size(s);
+ off += kasan_metadata_size(s, false);

if (off != size_from_object(s))
/* Beginning of the filler is the free pointer */
@@ -1042,7 +1042,7 @@ static int check_pad_bytes(struct kmem_cache *s, struct slab *slab, u8 *p)
/* We also have user information there */
off += 2 * sizeof(struct track);

- off += kasan_metadata_size(s);
+ off += kasan_metadata_size(s, false);

if (size_from_object(s) == off)
return 1;








2022-09-25 16:41:48

by Andrey Konovalov

[permalink] [raw]
Subject: Re: [PATCH v6 3/4] mm: kasan: Add free_meta size info in struct kasan_cache

On Sun, Sep 25, 2022 at 1:27 PM Feng Tang <[email protected]> wrote:
>
> > [1] https://lore.kernel.org/linux-mm/c7b316d30d90e5947eb8280f4dc78856a49298cf.1662411799.git.andreyknvl@google.com/
>
> I noticed this has been merged to -mm tree's 'mm-everything' branch,
> so following is the patch againt that. Thanks!
>
> One thing I'm not very sure is, to check 'in-object' kasan's meta
> size, I didn't check 'alloc_meta_offset', as from the code reading
> the alloc_meta is never put inside slab object data area.

Yes, this is correct.

> Thanks,
> Feng
>
> ---8<---
>
> diff --git a/include/linux/kasan.h b/include/linux/kasan.h
> index d811b3d7d2a1..96c9d56e5510 100644
> --- a/include/linux/kasan.h
> +++ b/include/linux/kasan.h
> @@ -302,7 +302,7 @@ static inline void kasan_unpoison_task_stack(struct task_struct *task) {}
>
> #ifdef CONFIG_KASAN_GENERIC
>
> -size_t kasan_metadata_size(struct kmem_cache *cache);
> +size_t kasan_metadata_size(struct kmem_cache *cache, bool in_object);
> slab_flags_t kasan_never_merge(void);
> void kasan_cache_create(struct kmem_cache *cache, unsigned int *size,
> slab_flags_t *flags);
> @@ -315,7 +315,8 @@ void kasan_record_aux_stack_noalloc(void *ptr);
> #else /* CONFIG_KASAN_GENERIC */
>
> /* Tag-based KASAN modes do not use per-object metadata. */
> -static inline size_t kasan_metadata_size(struct kmem_cache *cache)
> +static inline size_t kasan_metadata_size(struct kmem_cache *cache,
> + bool in_object)
> {
> return 0;
> }
> diff --git a/mm/kasan/generic.c b/mm/kasan/generic.c
> index d8b5590f9484..5a806f9b9466 100644
> --- a/mm/kasan/generic.c
> +++ b/mm/kasan/generic.c
> @@ -450,15 +450,22 @@ void kasan_init_object_meta(struct kmem_cache *cache, const void *object)
> __memset(alloc_meta, 0, sizeof(*alloc_meta));
> }
>
> -size_t kasan_metadata_size(struct kmem_cache *cache)
> +size_t kasan_metadata_size(struct kmem_cache *cache, bool in_object)
> {
> + struct kasan_cache *info = &cache->kasan_info ;
> +
> if (!kasan_requires_meta())
> return 0;
> - return (cache->kasan_info.alloc_meta_offset ?
> - sizeof(struct kasan_alloc_meta) : 0) +
> - ((cache->kasan_info.free_meta_offset &&
> - cache->kasan_info.free_meta_offset != KASAN_NO_FREE_META) ?
> - sizeof(struct kasan_free_meta) : 0);
> +
> + if (in_object)
> + return (info->free_meta_offset ?
> + 0 : sizeof(struct kasan_free_meta));
> + else
> + return (info->alloc_meta_offset ?
> + sizeof(struct kasan_alloc_meta) : 0) +
> + ((info->free_meta_offset &&
> + info->free_meta_offset != KASAN_NO_FREE_META) ?
> + sizeof(struct kasan_free_meta) : 0);
> }
>
> static void __kasan_record_aux_stack(void *addr, bool can_alloc)
> diff --git a/mm/slub.c b/mm/slub.c
> index ce8310e131b3..a75c21a0da8b 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -887,7 +887,7 @@ static void print_trailer(struct kmem_cache *s, struct slab *slab, u8 *p)
> if (s->flags & SLAB_STORE_USER)
> off += 2 * sizeof(struct track);
>
> - off += kasan_metadata_size(s);
> + off += kasan_metadata_size(s, false);
>
> if (off != size_from_object(s))
> /* Beginning of the filler is the free pointer */
> @@ -1042,7 +1042,7 @@ static int check_pad_bytes(struct kmem_cache *s, struct slab *slab, u8 *p)
> /* We also have user information there */
> off += 2 * sizeof(struct track);
>
> - off += kasan_metadata_size(s);
> + off += kasan_metadata_size(s, false);
>
> if (size_from_object(s) == off)
> return 1;

Reviewed-by: Andrey Konovalov <[email protected]>

Thanks!

2022-09-26 19:20:42

by Andrey Konovalov

[permalink] [raw]
Subject: Re: [PATCH v6 2/4] mm/slub: only zero the requested size of buffer for kzalloc

On Tue, Sep 13, 2022 at 8:54 AM Feng Tang <[email protected]> wrote:
>

Hi Feng,

> kzalloc/kmalloc will round up the request size to a fixed size
> (mostly power of 2), so the allocated memory could be more than
> requested. Currently kzalloc family APIs will zero all the
> allocated memory.
>
> To detect out-of-bound usage of the extra allocated memory, only
> zero the requested part, so that sanity check could be added to
> the extra space later.

I still don't like the idea of only zeroing the requested memory and
not the whole object. Considering potential info-leak vulnerabilities.

Can we only do this when SLAB_DEBUG is enabled?

> Performance wise, smaller zeroing length also brings shorter
> execution time, as shown from test data on various server/desktop
> platforms.
>
> For kzalloc users who will call ksize() later and utilize this
> extra space, please be aware that the space is not zeroed any
> more.

CC Kees

>
> Signed-off-by: Feng Tang <[email protected]>
> ---
> mm/slab.c | 7 ++++---
> mm/slab.h | 5 +++--
> mm/slub.c | 10 +++++++---
> 3 files changed, 14 insertions(+), 8 deletions(-)
>
> diff --git a/mm/slab.c b/mm/slab.c
> index a5486ff8362a..4594de0e3d6b 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -3253,7 +3253,8 @@ slab_alloc_node(struct kmem_cache *cachep, struct list_lru *lru, gfp_t flags,
> init = slab_want_init_on_alloc(flags, cachep);
>
> out:
> - slab_post_alloc_hook(cachep, objcg, flags, 1, &objp, init);
> + slab_post_alloc_hook(cachep, objcg, flags, 1, &objp, init,
> + cachep->object_size);
> return objp;
> }
>
> @@ -3506,13 +3507,13 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
> * Done outside of the IRQ disabled section.
> */
> slab_post_alloc_hook(s, objcg, flags, size, p,
> - slab_want_init_on_alloc(flags, s));
> + slab_want_init_on_alloc(flags, s), s->object_size);
> /* FIXME: Trace call missing. Christoph would like a bulk variant */
> return size;
> error:
> local_irq_enable();
> cache_alloc_debugcheck_after_bulk(s, flags, i, p, _RET_IP_);
> - slab_post_alloc_hook(s, objcg, flags, i, p, false);
> + slab_post_alloc_hook(s, objcg, flags, i, p, false, s->object_size);
> kmem_cache_free_bulk(s, i, p);
> return 0;
> }
> diff --git a/mm/slab.h b/mm/slab.h
> index d0ef9dd44b71..3cf5adf63f48 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -730,7 +730,8 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
>
> static inline void slab_post_alloc_hook(struct kmem_cache *s,
> struct obj_cgroup *objcg, gfp_t flags,
> - size_t size, void **p, bool init)
> + size_t size, void **p, bool init,
> + unsigned int orig_size)
> {
> size_t i;
>
> @@ -746,7 +747,7 @@ static inline void slab_post_alloc_hook(struct kmem_cache *s,
> for (i = 0; i < size; i++) {
> p[i] = kasan_slab_alloc(s, p[i], flags, init);
> if (p[i] && init && !kasan_has_integrated_init())
> - memset(p[i], 0, s->object_size);
> + memset(p[i], 0, orig_size);

Note that when KASAN is enabled and has integrated init, it will
initialize the whole object, which leads to an inconsistency with this
change.

> kmemleak_alloc_recursive(p[i], s->object_size, 1,
> s->flags, flags);
> }
> diff --git a/mm/slub.c b/mm/slub.c
> index c8ba16b3a4db..6f823e99d8b4 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3376,7 +3376,11 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s, struct list_l
> init = slab_want_init_on_alloc(gfpflags, s);
>
> out:
> - slab_post_alloc_hook(s, objcg, gfpflags, 1, &object, init);
> + /*
> + * When init equals 'true', like for kzalloc() family, only
> + * @orig_size bytes will be zeroed instead of s->object_size
> + */
> + slab_post_alloc_hook(s, objcg, gfpflags, 1, &object, init, orig_size);
>
> return object;
> }
> @@ -3833,11 +3837,11 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
> * Done outside of the IRQ disabled fastpath loop.
> */
> slab_post_alloc_hook(s, objcg, flags, size, p,
> - slab_want_init_on_alloc(flags, s));
> + slab_want_init_on_alloc(flags, s), s->object_size);
> return i;
> error:
> slub_put_cpu_ptr(s->cpu_slab);
> - slab_post_alloc_hook(s, objcg, flags, i, p, false);
> + slab_post_alloc_hook(s, objcg, flags, i, p, false, s->object_size);
> kmem_cache_free_bulk(s, i, p);
> return 0;
> }
> --
> 2.34.1
>

2022-09-26 20:33:14

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v6 2/4] mm/slub: only zero the requested size of buffer for kzalloc

On Mon, Sep 26, 2022 at 09:11:24PM +0200, Andrey Konovalov wrote:
> On Tue, Sep 13, 2022 at 8:54 AM Feng Tang <[email protected]> wrote:
> >
>
> Hi Feng,
>
> > kzalloc/kmalloc will round up the request size to a fixed size
> > (mostly power of 2), so the allocated memory could be more than
> > requested. Currently kzalloc family APIs will zero all the
> > allocated memory.
> >
> > To detect out-of-bound usage of the extra allocated memory, only
> > zero the requested part, so that sanity check could be added to
> > the extra space later.
>
> I still don't like the idea of only zeroing the requested memory and
> not the whole object. Considering potential info-leak vulnerabilities.

I really really do not like reducing the zeroing size. We're trying to
be proactive against _flaws_, which means that when there's a memory
over-read (or uninitialized use), suddenly the scope of the exposure (or
control) is wider/looser.

Imagine the (unfortunately very common) case of use-after-free attacks,
which leverage type confusion: some object is located in kmalloc-128
because it's 126 bytes. That slot gets freed and reallocated to, say, a
97 byte object going through kzalloc() or zero-on-init. With this patch
the bytes above the 97 don't get zeroed, and the stale data from the
prior 126 byte object say there happily to be used again later through
a dangling pointer, or whatever. Without the proposed patch, the entire
128 bytes is wiped, which makes stale data re-use more difficult.

> > Performance wise, smaller zeroing length also brings shorter
> > execution time, as shown from test data on various server/desktop
> > platforms.

For these cases, I think a much better solution is to provide those
sensitive allocations their own dedicated kmem_cache.

> >
> > For kzalloc users who will call ksize() later and utilize this
> > extra space, please be aware that the space is not zeroed any
> > more.
>
> CC Kees

Thanks! Well, the good news is that ksize() side-effects is hopefully
going to vanish soon, but my objections about stale memory remain.

-Kees

--
Kees Cook

2022-09-27 01:37:44

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH v6 2/4] mm/slub: only zero the requested size of buffer for kzalloc

On Tue, Sep 27, 2022 at 04:15:02AM +0800, Kees Cook wrote:
> On Mon, Sep 26, 2022 at 09:11:24PM +0200, Andrey Konovalov wrote:
> > On Tue, Sep 13, 2022 at 8:54 AM Feng Tang <[email protected]> wrote:
> > >
> >
> > Hi Feng,
> >
> > > kzalloc/kmalloc will round up the request size to a fixed size
> > > (mostly power of 2), so the allocated memory could be more than
> > > requested. Currently kzalloc family APIs will zero all the
> > > allocated memory.
> > >
> > > To detect out-of-bound usage of the extra allocated memory, only
> > > zero the requested part, so that sanity check could be added to
> > > the extra space later.
> >
> > I still don't like the idea of only zeroing the requested memory and
> > not the whole object. Considering potential info-leak vulnerabilities.
>
> I really really do not like reducing the zeroing size. We're trying to
> be proactive against _flaws_, which means that when there's a memory
> over-read (or uninitialized use), suddenly the scope of the exposure (or
> control) is wider/looser.
>
> Imagine the (unfortunately very common) case of use-after-free attacks,
> which leverage type confusion: some object is located in kmalloc-128
> because it's 126 bytes. That slot gets freed and reallocated to, say, a
> 97 byte object going through kzalloc() or zero-on-init. With this patch
> the bytes above the 97 don't get zeroed, and the stale data from the
> prior 126 byte object say there happily to be used again later through
> a dangling pointer, or whatever. Without the proposed patch, the entire
> 128 bytes is wiped, which makes stale data re-use more difficult.

Thanks for the details explaination, which is a valid concern.

And Andrey's suggestion is a good solution: only reduce the zeroing
size for kmalloc-redzone enabled objects, as the extra space will be
redzoned, and no info will be leaked.

Thanks,
Feng

2022-09-27 02:51:36

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH v6 2/4] mm/slub: only zero the requested size of buffer for kzalloc

On Tue, Sep 27, 2022 at 03:11:24AM +0800, Andrey Konovalov wrote:
> On Tue, Sep 13, 2022 at 8:54 AM Feng Tang <[email protected]> wrote:
> >
>
> Hi Feng,
>
> > kzalloc/kmalloc will round up the request size to a fixed size
> > (mostly power of 2), so the allocated memory could be more than
> > requested. Currently kzalloc family APIs will zero all the
> > allocated memory.
> >
> > To detect out-of-bound usage of the extra allocated memory, only
> > zero the requested part, so that sanity check could be added to
> > the extra space later.
>
> I still don't like the idea of only zeroing the requested memory and
> not the whole object. Considering potential info-leak vulnerabilities.
>
> Can we only do this when SLAB_DEBUG is enabled?

Good point! will add slub_debug_orig_size(s) check.

> > Performance wise, smaller zeroing length also brings shorter
> > execution time, as shown from test data on various server/desktop
> > platforms.
> >
> > For kzalloc users who will call ksize() later and utilize this
> > extra space, please be aware that the space is not zeroed any
> > more.
>
> CC Kees

Thanks for adding Kees, who provided review from security point
of review.

> >
> > Signed-off-by: Feng Tang <[email protected]>
> > ---
> > mm/slab.c | 7 ++++---
> > mm/slab.h | 5 +++--
> > mm/slub.c | 10 +++++++---
[...]
> > @@ -730,7 +730,8 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
> >
> > static inline void slab_post_alloc_hook(struct kmem_cache *s,
> > struct obj_cgroup *objcg, gfp_t flags,
> > - size_t size, void **p, bool init)
> > + size_t size, void **p, bool init,
> > + unsigned int orig_size)
> > {
> > size_t i;
> >
> > @@ -746,7 +747,7 @@ static inline void slab_post_alloc_hook(struct kmem_cache *s,
> > for (i = 0; i < size; i++) {
> > p[i] = kasan_slab_alloc(s, p[i], flags, init);
> > if (p[i] && init && !kasan_has_integrated_init())
> > - memset(p[i], 0, s->object_size);
> > + memset(p[i], 0, orig_size);
>
> Note that when KASAN is enabled and has integrated init, it will
> initialize the whole object, which leads to an inconsistency with this
> change.

Do you mean for kzalloc() only? or there is some kasan check newly added?

I'm not familiar with kasan code, and during development, I usually
enabled KASAN and KFENCE configs and did catch some bugs, while 0Day
bot also reported some. And with latest v6 patchset, I haven't seen
kasan/kfence failed cases.

And for generic slub objects, when slub_debug is enabled, the object
data area could be already modified like in init_object()

if (s->flags & __OBJECT_POISON) {
memset(p, POISON_FREE, s->object_size - 1);
p[s->object_size - 1] = POISON_END;
}

slub-redzone check actually splitis it into 2 regions [0, orig_size-1],
and [orig_size, object_size-1], and adds different sanity check to
them.

Anyway, I'll go check the latest linux-next tree.

Thanks,
Feng



2022-09-27 03:13:53

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH v6 3/4] mm: kasan: Add free_meta size info in struct kasan_cache

On Mon, Sep 26, 2022 at 12:31:37AM +0800, Andrey Konovalov wrote:
> On Sun, Sep 25, 2022 at 1:27 PM Feng Tang <[email protected]> wrote:
> >
> > > [1] https://lore.kernel.org/linux-mm/c7b316d30d90e5947eb8280f4dc78856a49298cf.1662411799.git.andreyknvl@google.com/
> >
> > I noticed this has been merged to -mm tree's 'mm-everything' branch,
> > so following is the patch againt that. Thanks!
> >
> > One thing I'm not very sure is, to check 'in-object' kasan's meta
> > size, I didn't check 'alloc_meta_offset', as from the code reading
> > the alloc_meta is never put inside slab object data area.
>
> Yes, this is correct.
>
> > @@ -1042,7 +1042,7 @@ static int check_pad_bytes(struct kmem_cache *s, struct slab *slab, u8 *p)
> > /* We also have user information there */
> > off += 2 * sizeof(struct track);
> >
> > - off += kasan_metadata_size(s);
> > + off += kasan_metadata_size(s, false);
> >
> > if (size_from_object(s) == off)
> > return 1;
>
> Reviewed-by: Andrey Konovalov <[email protected]>

Thank you!

I made a formal patch, which is based on your latest kasan patchset
in -mm tree

Thanks,
Feng

---
From ba4cfd81b86c3339523b467451baa5e87ca1c9f8 Mon Sep 17 00:00:00 2001
From: Feng Tang <[email protected]>
Date: Sun, 25 Sep 2022 15:37:31 +0800
Subject: [PATCH] mm: kasan: Extend kasan_metadata_size() to also cover
in-object size

When kasan is enabled for slab/slub, it may save kasan' free_meta
data in the former part of slab object data area in slab object's
free path, which works fine.

There is ongoing effort to extend slub's debug function which will
redzone the latter part of kmalloc object area, and when both of
the debug are enabled, there is possible conflict, especially when
the kmalloc object has small size, as caught by 0Day bot [1].

To solve it, slub code needs to know the in-object kasan's meta
data size. Currently, there is existing kasan_metadata_size()
which returns the kasan's metadata size inside slub's metadata
area, so extend it to also cover the in-object meta size by
adding a boolean flag 'in_object'.

There is no functional change to existing code logic.

[1]. https://lore.kernel.org/lkml/YuYm3dWwpZwH58Hu@xsang-OptiPlex-9020/
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Feng Tang <[email protected]>
Suggested-by: Andrey Konovalov <[email protected]>
Reviewed-by: Andrey Konovalov <[email protected]>
---
include/linux/kasan.h | 5 +++--
mm/kasan/generic.c | 19 +++++++++++++------
mm/slub.c | 4 ++--
3 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/include/linux/kasan.h b/include/linux/kasan.h
index d811b3d7d2a1..96c9d56e5510 100644
--- a/include/linux/kasan.h
+++ b/include/linux/kasan.h
@@ -302,7 +302,7 @@ static inline void kasan_unpoison_task_stack(struct task_struct *task) {}

#ifdef CONFIG_KASAN_GENERIC

-size_t kasan_metadata_size(struct kmem_cache *cache);
+size_t kasan_metadata_size(struct kmem_cache *cache, bool in_object);
slab_flags_t kasan_never_merge(void);
void kasan_cache_create(struct kmem_cache *cache, unsigned int *size,
slab_flags_t *flags);
@@ -315,7 +315,8 @@ void kasan_record_aux_stack_noalloc(void *ptr);
#else /* CONFIG_KASAN_GENERIC */

/* Tag-based KASAN modes do not use per-object metadata. */
-static inline size_t kasan_metadata_size(struct kmem_cache *cache)
+static inline size_t kasan_metadata_size(struct kmem_cache *cache,
+ bool in_object)
{
return 0;
}
diff --git a/mm/kasan/generic.c b/mm/kasan/generic.c
index d8b5590f9484..b076f597a378 100644
--- a/mm/kasan/generic.c
+++ b/mm/kasan/generic.c
@@ -450,15 +450,22 @@ void kasan_init_object_meta(struct kmem_cache *cache, const void *object)
__memset(alloc_meta, 0, sizeof(*alloc_meta));
}

-size_t kasan_metadata_size(struct kmem_cache *cache)
+size_t kasan_metadata_size(struct kmem_cache *cache, bool in_object)
{
+ struct kasan_cache *info = &cache->kasan_info;
+
if (!kasan_requires_meta())
return 0;
- return (cache->kasan_info.alloc_meta_offset ?
- sizeof(struct kasan_alloc_meta) : 0) +
- ((cache->kasan_info.free_meta_offset &&
- cache->kasan_info.free_meta_offset != KASAN_NO_FREE_META) ?
- sizeof(struct kasan_free_meta) : 0);
+
+ if (in_object)
+ return (info->free_meta_offset ?
+ 0 : sizeof(struct kasan_free_meta));
+ else
+ return (info->alloc_meta_offset ?
+ sizeof(struct kasan_alloc_meta) : 0) +
+ ((info->free_meta_offset &&
+ info->free_meta_offset != KASAN_NO_FREE_META) ?
+ sizeof(struct kasan_free_meta) : 0);
}

static void __kasan_record_aux_stack(void *addr, bool can_alloc)
diff --git a/mm/slub.c b/mm/slub.c
index ce8310e131b3..a75c21a0da8b 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -887,7 +887,7 @@ static void print_trailer(struct kmem_cache *s, struct slab *slab, u8 *p)
if (s->flags & SLAB_STORE_USER)
off += 2 * sizeof(struct track);

- off += kasan_metadata_size(s);
+ off += kasan_metadata_size(s, false);

if (off != size_from_object(s))
/* Beginning of the filler is the free pointer */
@@ -1042,7 +1042,7 @@ static int check_pad_bytes(struct kmem_cache *s, struct slab *slab, u8 *p)
/* We also have user information there */
off += 2 * sizeof(struct track);

- off += kasan_metadata_size(s);
+ off += kasan_metadata_size(s, false);

if (size_from_object(s) == off)
return 1;
--
2.34.1




2022-10-13 14:14:41

by Andrey Konovalov

[permalink] [raw]
Subject: Re: [PATCH v6 2/4] mm/slub: only zero the requested size of buffer for kzalloc

On Tue, Sep 27, 2022 at 4:42 AM Feng Tang <[email protected]> wrote:
>
> > > @@ -746,7 +747,7 @@ static inline void slab_post_alloc_hook(struct kmem_cache *s,
> > > for (i = 0; i < size; i++) {
> > > p[i] = kasan_slab_alloc(s, p[i], flags, init);
> > > if (p[i] && init && !kasan_has_integrated_init())
> > > - memset(p[i], 0, s->object_size);
> > > + memset(p[i], 0, orig_size);
> >
> > Note that when KASAN is enabled and has integrated init, it will
> > initialize the whole object, which leads to an inconsistency with this
> > change.
>
> Do you mean for kzalloc() only? or there is some kasan check newly added?

Hi Feng,

I mean that when init is true and kasan_has_integrated_init() is true
(with HW_TAGS mode), kasan_slab_alloc() initializes the whole object.
Which is inconsistent with the memset() of only orig_size when
!kasan_has_integrated_init(). But I think this is fine assuming SLAB
poisoning happens later. But please add a comment.

Thanks!

2022-10-14 06:16:54

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH v6 2/4] mm/slub: only zero the requested size of buffer for kzalloc

On Thu, Oct 13, 2022 at 10:00:57PM +0800, Andrey Konovalov wrote:
> On Tue, Sep 27, 2022 at 4:42 AM Feng Tang <[email protected]> wrote:
> >
> > > > @@ -746,7 +747,7 @@ static inline void slab_post_alloc_hook(struct kmem_cache *s,
> > > > for (i = 0; i < size; i++) {
> > > > p[i] = kasan_slab_alloc(s, p[i], flags, init);
> > > > if (p[i] && init && !kasan_has_integrated_init())
> > > > - memset(p[i], 0, s->object_size);
> > > > + memset(p[i], 0, orig_size);
> > >
> > > Note that when KASAN is enabled and has integrated init, it will
> > > initialize the whole object, which leads to an inconsistency with this
> > > change.
> >
> > Do you mean for kzalloc() only? or there is some kasan check newly added?
>
> Hi Feng,
>
> I mean that when init is true and kasan_has_integrated_init() is true
> (with HW_TAGS mode), kasan_slab_alloc() initializes the whole object.
> Which is inconsistent with the memset() of only orig_size when
> !kasan_has_integrated_init(). But I think this is fine assuming SLAB
> poisoning happens later. But please add a comment.

I see now. Will add some comment. thanks!

- Feng

> Thanks!

2022-10-30 19:44:35

by John Thomson

[permalink] [raw]
Subject: Re: [PATCH v6 1/4] mm/slub: enable debugging memory wasting of kmalloc

On Tue, 13 Sep 2022, at 06:54, Feng Tang wrote:
> kmalloc's API family is critical for mm, with one nature that it will
> round up the request size to a fixed one (mostly power of 2). Say
> when user requests memory for '2^n + 1' bytes, actually 2^(n+1) bytes
> could be allocated, so in worst case, there is around 50% memory
> space waste.


I have a ralink mt7621 router running Openwrt, using the mips ZBOOT kernel, and appear to have bisected
a very-nearly-clean kernel v6.1rc-2 boot issue to this commit.
I have 3 commits atop 6.1-rc2: fix a ZBOOT compile error, use the Openwrt LZMA options,
and enable DEBUG_ZBOOT for my platform. I am compiling my kernel within the Openwrt build system.
No guarantees this is not due to something I am doing wrong, but any insight would be greatly appreciated.


On UART, No indication of the (once extracted) kernel booting:

transfer started ......................................... transfer ok, time=2.01s
setting up elf image... OK
jumping to kernel code
zimage at: 80BA4100 810D4720
Uncompressing Linux at load address 80001000
Copy device tree to address 80B96EE0
Now, booting the kernel...

Nothing follows

6edf2576a6cc ("mm/slub: enable debugging memory wasting of kmalloc") reverted, normal boot:
transfer started ......................................... transfer ok, time=2.01s
setting up elf image... OK
jumping to kernel code
zimage at: 80BA4100 810D47A4
Uncompressing Linux at load address 80001000
Copy device tree to address 80B96EE0
Now, booting the kernel...

[ 0.000000] Linux version 6.1.0-rc2 (john@john) (mipsel-openwrt-linux-musl-gcc (OpenWrt GCC 11.3.0 r19724+16-1521d5f453) 11.3.0, GNU ld (GNU Binutils) 2.37) #0 SMP Fri Oct 28 03:48:10 2022
[ 0.000000] SoC Type: MediaTek MT7621 ver:1 eco:3
[ 0.000000] printk: bootconsole [early0] enabled
[ 0.000000] CPU0 revision is: 0001992f (MIPS 1004Kc)
[ 0.000000] MIPS: machine is MikroTik RouterBOARD 760iGS
[ 0.000000] Initrd not found or empty - disabling initrd
[ 0.000000] VPE topology {2,2} total 4
[ 0.000000] Primary instruction cache 32kB, VIPT, 4-way, linesize 32 bytes.
[ 0.000000] Primary data cache 32kB, 4-way, PIPT, no aliases, linesize 32 bytes
[ 0.000000] MIPS secondary cache 256kB, 8-way, linesize 32 bytes.
[ 0.000000] Zone ranges:
[ 0.000000] Normal [mem 0x0000000000000000-0x000000000fffffff]
[ 0.000000] HighMem empty
[ 0.000000] Movable zone start for each node
[ 0.000000] Early memory node ranges
[ 0.000000] node 0: [mem 0x0000000000000000-0x000000000fffffff]
[ 0.000000] Initmem setup node 0 [mem 0x0000000000000000-0x000000000fffffff]
[ 0.000000] percpu: Embedded 11 pages/cpu s16064 r8192 d20800 u45056
[ 0.000000] Built 1 zonelists, mobility grouping on. Total pages: 64960
[ 0.000000] Kernel command line: console=ttyS0,115200 rootfstype=squashfs,jffs2
[ 0.000000] Dentry cache hash table entries: 32768 (order: 5, 131072 bytes, linear)
[ 0.000000] Inode-cache hash table entries: 16384 (order: 4, 65536 bytes, linear)
[ 0.000000] Writing ErrCtl register=00019146
[ 0.000000] Readback ErrCtl register=00019146
[ 0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off
[ 0.000000] Memory: 246220K/262144K available (7455K kernel code, 628K rwdata, 1308K rodata, 3524K init, 245K bss, 15924K reserved, 0K cma-reserved, 0K highmem)
[ 0.000000] SLUB: HWalign=32, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
[ 0.000000] rcu: Hierarchical RCU implementation.


boot continues as expected


possibly relevant config options:
grep -E '(SLUB|SLAB)' .config
# SLAB allocator options
# CONFIG_SLAB is not set
CONFIG_SLUB=y
CONFIG_SLAB_MERGE_DEFAULT=y
# CONFIG_SLAB_FREELIST_RANDOM is not set
# CONFIG_SLAB_FREELIST_HARDENED is not set
# CONFIG_SLUB_STATS is not set
CONFIG_SLUB_CPU_PARTIAL=y
# end of SLAB allocator options
# CONFIG_SLUB_DEBUG is not set


With this commit reverted: cpuinfo and meminfo

system type : MediaTek MT7621 ver:1 eco:3
machine : MikroTik RouterBOARD 760iGS
processor : 0
cpu model : MIPS 1004Kc V2.15
BogoMIPS : 586.13
wait instruction : yes
microsecond timers : yes
tlb_entries : 32
extra interrupt vector : yes
hardware watchpoint : yes, count: 4, address/irw mask: [0x0ffc, 0x0ffc, 0x0ffb, 0x0ffb]
isa : mips1 mips2 mips32r1 mips32r2
ASEs implemented : mips16 dsp mt
Options implemented : tlb 4kex 4k_cache prefetch mcheck ejtag llsc pindexed_dcache userlocal vint perf_cntr_intr_bit cdmm perf
shadow register sets : 1
kscratch registers : 0
package : 0
core : 0
VPE : 0
VCED exceptions : not available
VCEI exceptions : not available

processor : 1
cpu model : MIPS 1004Kc V2.15
BogoMIPS : 586.13
wait instruction : yes
microsecond timers : yes
tlb_entries : 32
extra interrupt vector : yes
hardware watchpoint : yes, count: 4, address/irw mask: [0x0ffc, 0x0ffc, 0x0ffb, 0x0ffb]
isa : mips1 mips2 mips32r1 mips32r2
ASEs implemented : mips16 dsp mt
Options implemented : tlb 4kex 4k_cache prefetch mcheck ejtag llsc pindexed_dcache userlocal vint perf_cntr_intr_bit cdmm perf
shadow register sets : 1
kscratch registers : 0
package : 0
core : 0
VPE : 1
VCED exceptions : not available
VCEI exceptions : not available

processor : 2
cpu model : MIPS 1004Kc V2.15
BogoMIPS : 586.13
wait instruction : yes
microsecond timers : yes
tlb_entries : 32
extra interrupt vector : yes
hardware watchpoint : yes, count: 4, address/irw mask: [0x0ffc, 0x0ffc, 0x0ffb, 0x0ffb]
isa : mips1 mips2 mips32r1 mips32r2
ASEs implemented : mips16 dsp mt
Options implemented : tlb 4kex 4k_cache prefetch mcheck ejtag llsc pindexed_dcache userlocal vint perf_cntr_intr_bit cdmm perf
shadow register sets : 1
kscratch registers : 0
package : 0
core : 1
VPE : 0
VCED exceptions : not available
VCEI exceptions : not available

processor : 3
cpu model : MIPS 1004Kc V2.15
BogoMIPS : 586.13
wait instruction : yes
microsecond timers : yes
tlb_entries : 32
extra interrupt vector : yes
hardware watchpoint : yes, count: 4, address/irw mask: [0x0ffc, 0x0ffc, 0x0ffb, 0x0ffb]
isa : mips1 mips2 mips32r1 mips32r2
ASEs implemented : mips16 dsp mt
Options implemented : tlb 4kex 4k_cache prefetch mcheck ejtag llsc pindexed_dcache userlocal vint perf_cntr_intr_bit cdmm perf
shadow register sets : 1
kscratch registers : 0
package : 0
core : 1
VPE : 1
VCED exceptions : not available
VCEI exceptions : not available

MemTotal: 249744 kB
MemFree: 211088 kB
MemAvailable: 187364 kB
Buffers: 0 kB
Cached: 8824 kB
SwapCached: 0 kB
Active: 1104 kB
Inactive: 8860 kB
Active(anon): 1104 kB
Inactive(anon): 8860 kB
Active(file): 0 kB
Inactive(file): 0 kB
Unevictable: 0 kB
Mlocked: 0 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 249744 kB
LowFree: 211088 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 0 kB
Writeback: 0 kB
AnonPages: 1192 kB
Mapped: 2092 kB
Shmem: 8824 kB
KReclaimable: 1704 kB
Slab: 9372 kB
SReclaimable: 1704 kB
SUnreclaim: 7668 kB
KernelStack: 592 kB
PageTables: 264 kB
SecPageTables: 0 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 124872 kB
Committed_AS: 14676 kB
VmallocTotal: 1040376 kB
VmallocUsed: 2652 kB
VmallocChunk: 0 kB
Percpu: 272 kB


Cheers,

--
John Thomson

2022-10-30 21:53:28

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v6 1/4] mm/slub: enable debugging memory wasting of kmalloc

On 10/30/22 20:23, John Thomson wrote:
> On Tue, 13 Sep 2022, at 06:54, Feng Tang wrote:
>> kmalloc's API family is critical for mm, with one nature that it will
>> round up the request size to a fixed one (mostly power of 2). Say
>> when user requests memory for '2^n + 1' bytes, actually 2^(n+1) bytes
>> could be allocated, so in worst case, there is around 50% memory
>> space waste.
>
>
> I have a ralink mt7621 router running Openwrt, using the mips ZBOOT kernel, and appear to have bisected
> a very-nearly-clean kernel v6.1rc-2 boot issue to this commit.
> I have 3 commits atop 6.1-rc2: fix a ZBOOT compile error, use the Openwrt LZMA options,
> and enable DEBUG_ZBOOT for my platform. I am compiling my kernel within the Openwrt build system.
> No guarantees this is not due to something I am doing wrong, but any insight would be greatly appreciated.
>
>
> On UART, No indication of the (once extracted) kernel booting:
>
> transfer started ......................................... transfer ok, time=2.01s
> setting up elf image... OK
> jumping to kernel code
> zimage at: 80BA4100 810D4720
> Uncompressing Linux at load address 80001000
> Copy device tree to address 80B96EE0
> Now, booting the kernel...

It's weird that the commit would cause no output so early, SLUB code is
run only later.

> Nothing follows
>
> 6edf2576a6cc ("mm/slub: enable debugging memory wasting of kmalloc") reverted, normal boot:
> transfer started ......................................... transfer ok, time=2.01s
> setting up elf image... OK
> jumping to kernel code
> zimage at: 80BA4100 810D47A4
> Uncompressing Linux at load address 80001000
> Copy device tree to address 80B96EE0
> Now, booting the kernel...
>
> [ 0.000000] Linux version 6.1.0-rc2 (john@john) (mipsel-openwrt-linux-musl-gcc (OpenWrt GCC 11.3.0 r19724+16-1521d5f453) 11.3.0, GNU ld (GNU Binutils) 2.37) #0 SMP Fri Oct 28 03:48:10 2022
> [ 0.000000] SoC Type: MediaTek MT7621 ver:1 eco:3
> [ 0.000000] printk: bootconsole [early0] enabled
> [ 0.000000] CPU0 revision is: 0001992f (MIPS 1004Kc)
> [ 0.000000] MIPS: machine is MikroTik RouterBOARD 760iGS
> [ 0.000000] Initrd not found or empty - disabling initrd
> [ 0.000000] VPE topology {2,2} total 4
> [ 0.000000] Primary instruction cache 32kB, VIPT, 4-way, linesize 32 bytes.
> [ 0.000000] Primary data cache 32kB, 4-way, PIPT, no aliases, linesize 32 bytes
> [ 0.000000] MIPS secondary cache 256kB, 8-way, linesize 32 bytes.
> [ 0.000000] Zone ranges:
> [ 0.000000] Normal [mem 0x0000000000000000-0x000000000fffffff]
> [ 0.000000] HighMem empty
> [ 0.000000] Movable zone start for each node
> [ 0.000000] Early memory node ranges
> [ 0.000000] node 0: [mem 0x0000000000000000-0x000000000fffffff]
> [ 0.000000] Initmem setup node 0 [mem 0x0000000000000000-0x000000000fffffff]
> [ 0.000000] percpu: Embedded 11 pages/cpu s16064 r8192 d20800 u45056
> [ 0.000000] Built 1 zonelists, mobility grouping on. Total pages: 64960
> [ 0.000000] Kernel command line: console=ttyS0,115200 rootfstype=squashfs,jffs2
> [ 0.000000] Dentry cache hash table entries: 32768 (order: 5, 131072 bytes, linear)
> [ 0.000000] Inode-cache hash table entries: 16384 (order: 4, 65536 bytes, linear)
> [ 0.000000] Writing ErrCtl register=00019146
> [ 0.000000] Readback ErrCtl register=00019146
> [ 0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off
> [ 0.000000] Memory: 246220K/262144K available (7455K kernel code, 628K rwdata, 1308K rodata, 3524K init, 245K bss, 15924K reserved, 0K cma-reserved, 0K highmem)
> [ 0.000000] SLUB: HWalign=32, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
> [ 0.000000] rcu: Hierarchical RCU implementation.
>
>
> boot continues as expected
>
>
> possibly relevant config options:
> grep -E '(SLUB|SLAB)' .config
> # SLAB allocator options
> # CONFIG_SLAB is not set
> CONFIG_SLUB=y
> CONFIG_SLAB_MERGE_DEFAULT=y
> # CONFIG_SLAB_FREELIST_RANDOM is not set
> # CONFIG_SLAB_FREELIST_HARDENED is not set
> # CONFIG_SLUB_STATS is not set
> CONFIG_SLUB_CPU_PARTIAL=y
> # end of SLAB allocator options
> # CONFIG_SLUB_DEBUG is not set

Also not having CONFIG_SLUB_DEBUG enabled means most of the code the
patch/commit touches is not even active.
Could this be some miscompile or code layout change exposing some
different bug, hmm.
Is it any different if you do enable CONFIG_SLUB_DEBUG ?
Or change to CONFIG_SLAB? (that would be really weird if not)

>
> With this commit reverted: cpuinfo and meminfo
>
> system type : MediaTek MT7621 ver:1 eco:3
> machine : MikroTik RouterBOARD 760iGS
> processor : 0
> cpu model : MIPS 1004Kc V2.15
> BogoMIPS : 586.13
> wait instruction : yes
> microsecond timers : yes
> tlb_entries : 32
> extra interrupt vector : yes
> hardware watchpoint : yes, count: 4, address/irw mask: [0x0ffc, 0x0ffc, 0x0ffb, 0x0ffb]
> isa : mips1 mips2 mips32r1 mips32r2
> ASEs implemented : mips16 dsp mt
> Options implemented : tlb 4kex 4k_cache prefetch mcheck ejtag llsc pindexed_dcache userlocal vint perf_cntr_intr_bit cdmm perf
> shadow register sets : 1
> kscratch registers : 0
> package : 0
> core : 0
> VPE : 0
> VCED exceptions : not available
> VCEI exceptions : not available
>
> processor : 1
> cpu model : MIPS 1004Kc V2.15
> BogoMIPS : 586.13
> wait instruction : yes
> microsecond timers : yes
> tlb_entries : 32
> extra interrupt vector : yes
> hardware watchpoint : yes, count: 4, address/irw mask: [0x0ffc, 0x0ffc, 0x0ffb, 0x0ffb]
> isa : mips1 mips2 mips32r1 mips32r2
> ASEs implemented : mips16 dsp mt
> Options implemented : tlb 4kex 4k_cache prefetch mcheck ejtag llsc pindexed_dcache userlocal vint perf_cntr_intr_bit cdmm perf
> shadow register sets : 1
> kscratch registers : 0
> package : 0
> core : 0
> VPE : 1
> VCED exceptions : not available
> VCEI exceptions : not available
>
> processor : 2
> cpu model : MIPS 1004Kc V2.15
> BogoMIPS : 586.13
> wait instruction : yes
> microsecond timers : yes
> tlb_entries : 32
> extra interrupt vector : yes
> hardware watchpoint : yes, count: 4, address/irw mask: [0x0ffc, 0x0ffc, 0x0ffb, 0x0ffb]
> isa : mips1 mips2 mips32r1 mips32r2
> ASEs implemented : mips16 dsp mt
> Options implemented : tlb 4kex 4k_cache prefetch mcheck ejtag llsc pindexed_dcache userlocal vint perf_cntr_intr_bit cdmm perf
> shadow register sets : 1
> kscratch registers : 0
> package : 0
> core : 1
> VPE : 0
> VCED exceptions : not available
> VCEI exceptions : not available
>
> processor : 3
> cpu model : MIPS 1004Kc V2.15
> BogoMIPS : 586.13
> wait instruction : yes
> microsecond timers : yes
> tlb_entries : 32
> extra interrupt vector : yes
> hardware watchpoint : yes, count: 4, address/irw mask: [0x0ffc, 0x0ffc, 0x0ffb, 0x0ffb]
> isa : mips1 mips2 mips32r1 mips32r2
> ASEs implemented : mips16 dsp mt
> Options implemented : tlb 4kex 4k_cache prefetch mcheck ejtag llsc pindexed_dcache userlocal vint perf_cntr_intr_bit cdmm perf
> shadow register sets : 1
> kscratch registers : 0
> package : 0
> core : 1
> VPE : 1
> VCED exceptions : not available
> VCEI exceptions : not available
>
> MemTotal: 249744 kB
> MemFree: 211088 kB
> MemAvailable: 187364 kB
> Buffers: 0 kB
> Cached: 8824 kB
> SwapCached: 0 kB
> Active: 1104 kB
> Inactive: 8860 kB
> Active(anon): 1104 kB
> Inactive(anon): 8860 kB
> Active(file): 0 kB
> Inactive(file): 0 kB
> Unevictable: 0 kB
> Mlocked: 0 kB
> HighTotal: 0 kB
> HighFree: 0 kB
> LowTotal: 249744 kB
> LowFree: 211088 kB
> SwapTotal: 0 kB
> SwapFree: 0 kB
> Dirty: 0 kB
> Writeback: 0 kB
> AnonPages: 1192 kB
> Mapped: 2092 kB
> Shmem: 8824 kB
> KReclaimable: 1704 kB
> Slab: 9372 kB
> SReclaimable: 1704 kB
> SUnreclaim: 7668 kB
> KernelStack: 592 kB
> PageTables: 264 kB
> SecPageTables: 0 kB
> NFS_Unstable: 0 kB
> Bounce: 0 kB
> WritebackTmp: 0 kB
> CommitLimit: 124872 kB
> Committed_AS: 14676 kB
> VmallocTotal: 1040376 kB
> VmallocUsed: 2652 kB
> VmallocChunk: 0 kB
> Percpu: 272 kB
>
>
> Cheers,
>


2022-10-31 03:53:26

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH v6 1/4] mm/slub: enable debugging memory wasting of kmalloc

Hi John,

Thanks for the bisecting and reporting!

On Mon, Oct 31, 2022 at 05:30:24AM +0800, Vlastimil Babka wrote:
> On 10/30/22 20:23, John Thomson wrote:
> > On Tue, 13 Sep 2022, at 06:54, Feng Tang wrote:
> >> kmalloc's API family is critical for mm, with one nature that it will
> >> round up the request size to a fixed one (mostly power of 2). Say
> >> when user requests memory for '2^n + 1' bytes, actually 2^(n+1) bytes
> >> could be allocated, so in worst case, there is around 50% memory
> >> space waste.
> >
> >
> > I have a ralink mt7621 router running Openwrt, using the mips ZBOOT kernel, and appear to have bisected
> > a very-nearly-clean kernel v6.1rc-2 boot issue to this commit.
> > I have 3 commits atop 6.1-rc2: fix a ZBOOT compile error, use the Openwrt LZMA options,
> > and enable DEBUG_ZBOOT for my platform. I am compiling my kernel within the Openwrt build system.
> > No guarantees this is not due to something I am doing wrong, but any insight would be greatly appreciated.
> >
> >
> > On UART, No indication of the (once extracted) kernel booting:
> >
> > transfer started ......................................... transfer ok, time=2.01s
> > setting up elf image... OK
> > jumping to kernel code
> > zimage at: 80BA4100 810D4720
> > Uncompressing Linux at load address 80001000
> > Copy device tree to address 80B96EE0
> > Now, booting the kernel...
>
> It's weird that the commit would cause no output so early, SLUB code is
> run only later.

I noticed your cmdline has console setting, could you enable the
earlyprintk in cmdline like "earlyprintk=ttyS0,115200" etc to see
if there is more message printed out.

Also I want to confirm this is a boot failure and not only a boot
message missing.

> > Nothing follows
> >
> > 6edf2576a6cc ("mm/slub: enable debugging memory wasting of kmalloc") reverted, normal boot:
> > transfer started ......................................... transfer ok, time=2.01s
> > setting up elf image... OK
> > jumping to kernel code
> > zimage at: 80BA4100 810D47A4
> > Uncompressing Linux at load address 80001000
> > Copy device tree to address 80B96EE0
> > Now, booting the kernel...
> >
> > [ 0.000000] Linux version 6.1.0-rc2 (john@john) (mipsel-openwrt-linux-musl-gcc (OpenWrt GCC 11.3.0 r19724+16-1521d5f453) 11.3.0, GNU ld (GNU Binutils) 2.37) #0 SMP Fri Oct 28 03:48:10 2022
> > [ 0.000000] SoC Type: MediaTek MT7621 ver:1 eco:3
> > [ 0.000000] printk: bootconsole [early0] enabled
> > [ 0.000000] CPU0 revision is: 0001992f (MIPS 1004Kc)
> > [ 0.000000] MIPS: machine is MikroTik RouterBOARD 760iGS
> > [ 0.000000] Initrd not found or empty - disabling initrd
> > [ 0.000000] VPE topology {2,2} total 4
> > [ 0.000000] Primary instruction cache 32kB, VIPT, 4-way, linesize 32 bytes.
> > [ 0.000000] Primary data cache 32kB, 4-way, PIPT, no aliases, linesize 32 bytes
> > [ 0.000000] MIPS secondary cache 256kB, 8-way, linesize 32 bytes.
> > [ 0.000000] Zone ranges:
> > [ 0.000000] Normal [mem 0x0000000000000000-0x000000000fffffff]
> > [ 0.000000] HighMem empty
> > [ 0.000000] Movable zone start for each node
> > [ 0.000000] Early memory node ranges
> > [ 0.000000] node 0: [mem 0x0000000000000000-0x000000000fffffff]
> > [ 0.000000] Initmem setup node 0 [mem 0x0000000000000000-0x000000000fffffff]
> > [ 0.000000] percpu: Embedded 11 pages/cpu s16064 r8192 d20800 u45056
> > [ 0.000000] Built 1 zonelists, mobility grouping on. Total pages: 64960
> > [ 0.000000] Kernel command line: console=ttyS0,115200 rootfstype=squashfs,jffs2
> > [ 0.000000] Dentry cache hash table entries: 32768 (order: 5, 131072 bytes, linear)
> > [ 0.000000] Inode-cache hash table entries: 16384 (order: 4, 65536 bytes, linear)
> > [ 0.000000] Writing ErrCtl register=00019146
> > [ 0.000000] Readback ErrCtl register=00019146
> > [ 0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off
> > [ 0.000000] Memory: 246220K/262144K available (7455K kernel code, 628K rwdata, 1308K rodata, 3524K init, 245K bss, 15924K reserved, 0K cma-reserved, 0K highmem)
> > [ 0.000000] SLUB: HWalign=32, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
> > [ 0.000000] rcu: Hierarchical RCU implementation.
> >
> >
> > boot continues as expected
> >
> >
> > possibly relevant config options:
> > grep -E '(SLUB|SLAB)' .config
> > # SLAB allocator options
> > # CONFIG_SLAB is not set
> > CONFIG_SLUB=y
> > CONFIG_SLAB_MERGE_DEFAULT=y
> > # CONFIG_SLAB_FREELIST_RANDOM is not set
> > # CONFIG_SLAB_FREELIST_HARDENED is not set
> > # CONFIG_SLUB_STATS is not set
> > CONFIG_SLUB_CPU_PARTIAL=y
> > # end of SLAB allocator options
> > # CONFIG_SLUB_DEBUG is not set
>
> Also not having CONFIG_SLUB_DEBUG enabled means most of the code the
> patch/commit touches is not even active.
> Could this be some miscompile or code layout change exposing some
> different bug, hmm.
> Is it any different if you do enable CONFIG_SLUB_DEBUG ?
> Or change to CONFIG_SLAB? (that would be really weird if not)

I haven't found any clue from the code either, and I compiled
kernel with the config above and tested booting on an Alder-lake
desktop and a QEMU, which boot fine.

Could you provide the full kernel config and demsg (in compressed
format if you think it's too big), so we can check more?

Thanks,
Feng

2022-10-31 10:51:16

by John Thomson

[permalink] [raw]
Subject: Re: [PATCH v6 1/4] mm/slub: enable debugging memory wasting of kmalloc

On Mon, 31 Oct 2022, at 02:36, Feng Tang wrote:
> Hi John,
>
> Thanks for the bisecting and reporting!
>
> On Mon, Oct 31, 2022 at 05:30:24AM +0800, Vlastimil Babka wrote:
>> On 10/30/22 20:23, John Thomson wrote:
>> > On Tue, 13 Sep 2022, at 06:54, Feng Tang wrote:
>> >> kmalloc's API family is critical for mm, with one nature that it will
>> >> round up the request size to a fixed one (mostly power of 2). Say
>> >> when user requests memory for '2^n + 1' bytes, actually 2^(n+1) bytes
>> >> could be allocated, so in worst case, there is around 50% memory
>> >> space waste.
>> >
>> >
>> > I have a ralink mt7621 router running Openwrt, using the mips ZBOOT kernel, and appear to have bisected
>> > a very-nearly-clean kernel v6.1rc-2 boot issue to this commit.
>> > I have 3 commits atop 6.1-rc2: fix a ZBOOT compile error, use the Openwrt LZMA options,
>> > and enable DEBUG_ZBOOT for my platform. I am compiling my kernel within the Openwrt build system.
>> > No guarantees this is not due to something I am doing wrong, but any insight would be greatly appreciated.
>> >
>> >
>> > On UART, No indication of the (once extracted) kernel booting:
>> >
>> > transfer started ......................................... transfer ok, time=2.01s
>> > setting up elf image... OK
>> > jumping to kernel code
>> > zimage at: 80BA4100 810D4720
>> > Uncompressing Linux at load address 80001000
>> > Copy device tree to address 80B96EE0
>> > Now, booting the kernel...
>>
>> It's weird that the commit would cause no output so early, SLUB code is
>> run only later.
>
> I noticed your cmdline has console setting, could you enable the
> earlyprintk in cmdline like "earlyprintk=ttyS0,115200" etc to see
> if there is more message printed out.

Still nothing from vmlinux with earlykprint on UART unless revert.

>
> Also I want to confirm this is a boot failure and not only a boot
> message missing.

Yes, boot failure.
Network comes up automatically on successful boot. Not happening when no kernel UART

>
>> > Nothing follows
>> >
>> > 6edf2576a6cc ("mm/slub: enable debugging memory wasting of kmalloc") reverted, normal boot:
>> > transfer started ......................................... transfer ok, time=2.01s
>> > setting up elf image... OK
>> > jumping to kernel code
>> > zimage at: 80BA4100 810D47A4
>> > Uncompressing Linux at load address 80001000
>> > Copy device tree to address 80B96EE0
>> > Now, booting the kernel...
>> >
>> > [ 0.000000] Linux version 6.1.0-rc2 (john@john) (mipsel-openwrt-linux-musl-gcc (OpenWrt GCC 11.3.0 r19724+16-1521d5f453) 11.3.0, GNU ld (GNU Binutils) 2.37) #0 SMP Fri Oct 28 03:48:10 2022
>> > [ 0.000000] SoC Type: MediaTek MT7621 ver:1 eco:3
>> > [ 0.000000] printk: bootconsole [early0] enabled
>> > [ 0.000000] CPU0 revision is: 0001992f (MIPS 1004Kc)
>> > [ 0.000000] MIPS: machine is MikroTik RouterBOARD 760iGS
>> > [ 0.000000] Initrd not found or empty - disabling initrd
>> > [ 0.000000] VPE topology {2,2} total 4
>> > [ 0.000000] Primary instruction cache 32kB, VIPT, 4-way, linesize 32 bytes.
>> > [ 0.000000] Primary data cache 32kB, 4-way, PIPT, no aliases, linesize 32 bytes
>> > [ 0.000000] MIPS secondary cache 256kB, 8-way, linesize 32 bytes.
>> > [ 0.000000] Zone ranges:
>> > [ 0.000000] Normal [mem 0x0000000000000000-0x000000000fffffff]
>> > [ 0.000000] HighMem empty
>> > [ 0.000000] Movable zone start for each node
>> > [ 0.000000] Early memory node ranges
>> > [ 0.000000] node 0: [mem 0x0000000000000000-0x000000000fffffff]
>> > [ 0.000000] Initmem setup node 0 [mem 0x0000000000000000-0x000000000fffffff]
>> > [ 0.000000] percpu: Embedded 11 pages/cpu s16064 r8192 d20800 u45056
>> > [ 0.000000] Built 1 zonelists, mobility grouping on. Total pages: 64960
>> > [ 0.000000] Kernel command line: console=ttyS0,115200 rootfstype=squashfs,jffs2
>> > [ 0.000000] Dentry cache hash table entries: 32768 (order: 5, 131072 bytes, linear)
>> > [ 0.000000] Inode-cache hash table entries: 16384 (order: 4, 65536 bytes, linear)
>> > [ 0.000000] Writing ErrCtl register=00019146
>> > [ 0.000000] Readback ErrCtl register=00019146
>> > [ 0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off
>> > [ 0.000000] Memory: 246220K/262144K available (7455K kernel code, 628K rwdata, 1308K rodata, 3524K init, 245K bss, 15924K reserved, 0K cma-reserved, 0K highmem)
>> > [ 0.000000] SLUB: HWalign=32, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
>> > [ 0.000000] rcu: Hierarchical RCU implementation.
>> >
>> >
>> > boot continues as expected
>> >
>> >
>> > possibly relevant config options:
>> > grep -E '(SLUB|SLAB)' .config
>> > # SLAB allocator options
>> > # CONFIG_SLAB is not set
>> > CONFIG_SLUB=y
>> > CONFIG_SLAB_MERGE_DEFAULT=y
>> > # CONFIG_SLAB_FREELIST_RANDOM is not set
>> > # CONFIG_SLAB_FREELIST_HARDENED is not set
>> > # CONFIG_SLUB_STATS is not set
>> > CONFIG_SLUB_CPU_PARTIAL=y
>> > # end of SLAB allocator options
>> > # CONFIG_SLUB_DEBUG is not set
>>
>> Also not having CONFIG_SLUB_DEBUG enabled means most of the code the
>> patch/commit touches is not even active.
>> Could this be some miscompile or code layout change exposing some
>> different bug, hmm.

Yes, it could be.

>> Is it any different if you do enable CONFIG_SLUB_DEBUG ?

No change

>> Or change to CONFIG_SLAB? (that would be really weird if not)

This boots fine

> I haven't found any clue from the code either, and I compiled
> kernel with the config above and tested booting on an Alder-lake
> desktop and a QEMU, which boot fine.
>
> Could you provide the full kernel config and demsg (in compressed
> format if you think it's too big), so we can check more?

Attached

> Thanks,
> Feng

vmlinux is bigger, and entry point is larger (0x8074081c vs 0x807407dc revert vs 0x8073fcbc),
so that may be it? Or not?
revert + SLUB_DEBUG + SLUB_DEBUG_ON is bigger still, but does successfully boot.
vmlinux entry point is 0x8074705c


transfer started ......................................... transfer ok, time=2.01s
setting up elf image... OK
jumping to kernel code
zimage at: 80BA4100 810D6FA0
Uncompressing Linux at load address 80001000
Copy device tree to address 80B9EEE0
Now, booting the kernel...
[ 0.000000] Linux version 6.1.0-rc2 (john@john) (mipsel-openwrt-linux-musl-gc
c (OpenWrt GCC 11.3.0 r19724+16-1521d5f453) 11.3.0, GNU ld (GNU Binutils) 2.37)
#0 SMP Fri Oct 28 03:48:10 2022


I will keep looking.

Thank you,
--
John Thomson


Attachments:
mt7621-mm-waste-debug.tar.zstd (29.06 kB)

2022-10-31 12:31:47

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH v6 1/4] mm/slub: enable debugging memory wasting of kmalloc

On Mon, Oct 31, 2022 at 10:05:58AM +0000, John Thomson wrote:
> On Mon, 31 Oct 2022, at 02:36, Feng Tang wrote:
> >> >
> >> > possibly relevant config options:
> >> > grep -E '(SLUB|SLAB)' .config
> >> > # SLAB allocator options
> >> > # CONFIG_SLAB is not set
> >> > CONFIG_SLUB=y
> >> > CONFIG_SLAB_MERGE_DEFAULT=y
> >> > # CONFIG_SLAB_FREELIST_RANDOM is not set
> >> > # CONFIG_SLAB_FREELIST_HARDENED is not set
> >> > # CONFIG_SLUB_STATS is not set
> >> > CONFIG_SLUB_CPU_PARTIAL=y
> >> > # end of SLAB allocator options
> >> > # CONFIG_SLUB_DEBUG is not set
> >>
> >> Also not having CONFIG_SLUB_DEBUG enabled means most of the code the
> >> patch/commit touches is not even active.
> >> Could this be some miscompile or code layout change exposing some
> >> different bug, hmm.
>
> Yes, it could be.
>
> >> Is it any different if you do enable CONFIG_SLUB_DEBUG ?
>
> No change
>
> >> Or change to CONFIG_SLAB? (that would be really weird if not)
>
> This boots fine
>
> > I haven't found any clue from the code either, and I compiled
> > kernel with the config above and tested booting on an Alder-lake
> > desktop and a QEMU, which boot fine.
> >
> > Could you provide the full kernel config and demsg (in compressed
> > format if you think it's too big), so we can check more?
>
> Attached
>
> > Thanks,
> > Feng
>
> vmlinux is bigger, and entry point is larger (0x8074081c vs 0x807407dc revert vs 0x8073fcbc),
> so that may be it? Or not?
> revert + SLUB_DEBUG + SLUB_DEBUG_ON is bigger still, but does successfully boot.
> vmlinux entry point is 0x8074705c

Thanks for prompt info!

As I can't reproduce it locally yet, could you help try 3 tests separately:
* change the O2/O3 compile option to O1
* try the attached 0001 patch (which cut part of commit)
* try attached 0001+0002 patch

Thanks!


- Feng




Attachments:
(No filename) (1.85 kB)
0001-reduced-slub-patch.patch (10.95 kB)
0002-reorder-the-partial_context-initialization.patch (1.02 kB)
Download all attachments

2022-10-31 12:58:12

by Hyeonggon Yoo

[permalink] [raw]
Subject: Re: [PATCH v6 1/4] mm/slub: enable debugging memory wasting of kmalloc

On Mon, Oct 31, 2022 at 10:05:58AM +0000, John Thomson wrote:
> On Mon, 31 Oct 2022, at 02:36, Feng Tang wrote:
> > Hi John,
> >
> > Thanks for the bisecting and reporting!
> >
> > On Mon, Oct 31, 2022 at 05:30:24AM +0800, Vlastimil Babka wrote:
> >> On 10/30/22 20:23, John Thomson wrote:
> >> > On Tue, 13 Sep 2022, at 06:54, Feng Tang wrote:
> >> >> kmalloc's API family is critical for mm, with one nature that it will
> >> >> round up the request size to a fixed one (mostly power of 2). Say
> >> >> when user requests memory for '2^n + 1' bytes, actually 2^(n+1) bytes
> >> >> could be allocated, so in worst case, there is around 50% memory
> >> >> space waste.
> >> >
> >> >
> >> > I have a ralink mt7621 router running Openwrt, using the mips ZBOOT kernel, and appear to have bisected
> >> > a very-nearly-clean kernel v6.1rc-2 boot issue to this commit.
> >> > I have 3 commits atop 6.1-rc2: fix a ZBOOT compile error, use the Openwrt LZMA options,
> >> > and enable DEBUG_ZBOOT for my platform. I am compiling my kernel within the Openwrt build system.
> >> > No guarantees this is not due to something I am doing wrong, but any insight would be greatly appreciated.
> >> >
> >> >
> >> > On UART, No indication of the (once extracted) kernel booting:
> >> >
> >> > transfer started ......................................... transfer ok, time=2.01s
> >> > setting up elf image... OK
> >> > jumping to kernel code
> >> > zimage at: 80BA4100 810D4720
> >> > Uncompressing Linux at load address 80001000
> >> > Copy device tree to address 80B96EE0
> >> > Now, booting the kernel...
> >>
> >> It's weird that the commit would cause no output so early, SLUB code is
> >> run only later.
> >
> > I noticed your cmdline has console setting, could you enable the
> > earlyprintk in cmdline like "earlyprintk=ttyS0,115200" etc to see
> > if there is more message printed out.
>
> Still nothing from vmlinux with earlykprint on UART unless revert.
>
> >
> > Also I want to confirm this is a boot failure and not only a boot
> > message missing.
>
> Yes, boot failure.
> Network comes up automatically on successful boot. Not happening when no kernel UART

It is really weird that I see no boot issue on my MIPS emulation with almost same
config, with different target - Malta board that QEMU supports. it just boot fine.

Can you attach debugger to the board?
(Which I hadn't tried. I had tried it only to QEMU)

[...]

> >> >
> >> >
> >> > possibly relevant config options:
> >> > grep -E '(SLUB|SLAB)' .config
> >> > # SLAB allocator options
> >> > # CONFIG_SLAB is not set
> >> > CONFIG_SLUB=y
> >> > CONFIG_SLAB_MERGE_DEFAULT=y
> >> > # CONFIG_SLAB_FREELIST_RANDOM is not set
> >> > # CONFIG_SLAB_FREELIST_HARDENED is not set
> >> > # CONFIG_SLUB_STATS is not set
> >> > CONFIG_SLUB_CPU_PARTIAL=y
> >> > # end of SLAB allocator options
> >> > # CONFIG_SLUB_DEBUG is not set
> >>
> >> Also not having CONFIG_SLUB_DEBUG enabled means most of the code the
> >> patch/commit touches is not even active.
> >> Could this be some miscompile or code layout change exposing some
> >> different bug, hmm.
>
> Yes, it could be.

What happens with clang?

>
> >> Is it any different if you do enable CONFIG_SLUB_DEBUG ?
>
> No change
>
> >> Or change to CONFIG_SLAB? (that would be really weird if not)
>
> This boots fine
>
> > I haven't found any clue from the code either, and I compiled
> > kernel with the config above and tested booting on an Alder-lake
> > desktop and a QEMU, which boot fine.
> >
> > Could you provide the full kernel config and demsg (in compressed
> > format if you think it's too big), so we can check more?
>
> Attached
>
> > Thanks,
> > Feng
>
> vmlinux is bigger, and entry point is larger (0x8074081c vs 0x807407dc revert vs 0x8073fcbc),
> so that may be it? Or not?
> revert + SLUB_DEBUG + SLUB_DEBUG_ON is bigger still, but does successfully boot.
> vmlinux entry point is 0x8074705c
>
>
> transfer started ......................................... transfer ok, time=2.01s
> setting up elf image... OK
> jumping to kernel code
> zimage at: 80BA4100 810D6FA0
> Uncompressing Linux at load address 80001000
> Copy device tree to address 80B9EEE0
> Now, booting the kernel...
> [ 0.000000] Linux version 6.1.0-rc2 (john@john) (mipsel-openwrt-linux-musl-gc
> c (OpenWrt GCC 11.3.0 r19724+16-1521d5f453) 11.3.0, GNU ld (GNU Binutils) 2.37)
> #0 SMP Fri Oct 28 03:48:10 2022
>
>
> I will keep looking.
>
> Thank you,
> --
> John Thomson



--
Thanks,
Hyeonggon

2022-11-01 00:38:06

by John Thomson

[permalink] [raw]
Subject: Re: [PATCH v6 1/4] mm/slub: enable debugging memory wasting of kmalloc

On Mon, 31 Oct 2022, at 11:42, Feng Tang wrote:
> On Mon, Oct 31, 2022 at 10:05:58AM +0000, John Thomson wrote:
>> On Mon, 31 Oct 2022, at 02:36, Feng Tang wrote:
>> >> >
>> >> > possibly relevant config options:
>> >> > grep -E '(SLUB|SLAB)' .config
>> >> > # SLAB allocator options
>> >> > # CONFIG_SLAB is not set
>> >> > CONFIG_SLUB=y
>> >> > CONFIG_SLAB_MERGE_DEFAULT=y
>> >> > # CONFIG_SLAB_FREELIST_RANDOM is not set
>> >> > # CONFIG_SLAB_FREELIST_HARDENED is not set
>> >> > # CONFIG_SLUB_STATS is not set
>> >> > CONFIG_SLUB_CPU_PARTIAL=y
>> >> > # end of SLAB allocator options
>> >> > # CONFIG_SLUB_DEBUG is not set
>> >>
>> >> Also not having CONFIG_SLUB_DEBUG enabled means most of the code the
>> >> patch/commit touches is not even active.
>> >> Could this be some miscompile or code layout change exposing some
>> >> different bug, hmm.
>>
>> Yes, it could be.
>>
>> >> Is it any different if you do enable CONFIG_SLUB_DEBUG ?
>>
>> No change
>>
>> >> Or change to CONFIG_SLAB? (that would be really weird if not)
>>
>> This boots fine
>>
>> > I haven't found any clue from the code either, and I compiled
>> > kernel with the config above and tested booting on an Alder-lake
>> > desktop and a QEMU, which boot fine.
>> >
>> > Could you provide the full kernel config and demsg (in compressed
>> > format if you think it's too big), so we can check more?
>>
>> Attached
>>
>> > Thanks,
>> > Feng
>>
>> vmlinux is bigger, and entry point is larger (0x8074081c vs 0x807407dc revert vs 0x8073fcbc),
>> so that may be it? Or not?
>> revert + SLUB_DEBUG + SLUB_DEBUG_ON is bigger still, but does successfully boot.
>> vmlinux entry point is 0x8074705c
>
> Thanks for prompt info!
>
> As I can't reproduce it locally yet, could you help try 3 tests separately:
> * change the O2/O3 compile option to O1
> * try the attached 0001 patch (which cut part of commit)
> * try attached 0001+0002 patch

None of these changed my outcome.

I also tried compiling the same linux tree & config with the Bootlin toolchain
(mipsel-buildroot-linux-gnu-gcc.br_real (Buildroot 2021.11-4428-g6b6741b) 12.2.0)
with the same results.
I will look into finding or building a mips clang toolchain.

No JTAG capability to debug, sorry.

I get the same outcome with either the ZBOOT vmlinuz, or vmlinux

Same happening with 6.1-rc3


After some blind poking around changing (how much of the commit affected) mm/slub.c,
I may have got lucky. it appears as though this is all I need to boot:
(against 6.1-rc3), and with the Bootlin toolchain. Will test my other build system as well.

--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3276,7 +3276,7 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
c = slub_get_cpu_ptr(s->cpu_slab);
#endif

- p = ___slab_alloc(s, gfpflags, node, addr, c, orig_size);
+ p = ___slab_alloc(s, gfpflags, node, addr, c, 0);
#ifdef CONFIG_PREEMPT_COUNT
slub_put_cpu_ptr(s->cpu_slab);
#endif


Would like to hear your thoughts, but I will keep digging.

>
> Thanks!
>
>
> - Feng
>
>
>
>
> Attachments:
> * 0001-reduced-slub-patch.patch
> * 0002-reorder-the-partial_context-initialization.patch

--
John Thomson

2022-11-01 03:11:08

by John Thomson

[permalink] [raw]
Subject: Re: [PATCH v6 1/4] mm/slub: enable debugging memory wasting of kmalloc


On Tue, 1 Nov 2022, at 00:18, John Thomson wrote:
> I may have got lucky. it appears as though this is all I need to boot:
> (against 6.1-rc3), and with the Bootlin toolchain. Will test my other
> build system as well.
>
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3276,7 +3276,7 @@ static void *__slab_alloc(struct kmem_cache *s,
> gfp_t gfpflags, int node,
> c = slub_get_cpu_ptr(s->cpu_slab);
> #endif
>
> - p = ___slab_alloc(s, gfpflags, node, addr, c, orig_size);
> + p = ___slab_alloc(s, gfpflags, node, addr, c, 0);
> #ifdef CONFIG_PREEMPT_COUNT
> slub_put_cpu_ptr(s->cpu_slab);
> #endif

Tested that with and without SLUB_DEBUG


Testing without SLUB_DEBUG below:
With this change on 6.1-rc3:
diff --git a/mm/slub.c b/mm/slub.c
index 157527d7101b..5fdb7609bb9e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3410,6 +3410,8 @@ static __always_inline
void *__kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
gfp_t gfpflags)
{
+
+ pr_warn("SLUB: __slab_alloc from slab_alloc s->object_size=%d\n", s->object_size);
void *ret = slab_alloc(s, lru, gfpflags, _RET_IP_, s->object_size);

trace_kmem_cache_alloc(_RET_IP_, ret, s, gfpflags, NUMA_NO_NODE);


UART & boot stops here:
transfer started ......................................... transfer ok, time=2.00s
setting up elf image... OK
jumping to kernel code
zimage at: 80B842A0 810B4BE4

Uncompressing Linux at load address 80001000

Copy device tree to address 80B80EE0

Now, booting the kernel...

[ 0.000000] Linux version 6.1.0-rc3+ (john@john) (mipsel-buildroot-linux-gnu-gcc.br_real (Buildroot 2021.11-4428-g6b6741b) 12.2.0, GNU ld (GNU Binutils) 2.39) #34 SMP Tue Nov 1 12:33:10 AEST 2022
[ 0.000000] Overriding previously set SMP ops
[ 0.000000] SoC Type: MediaTek MT7621 ver:1 eco:3
[ 0.000000] printk: bootconsole [early0] enabled
[ 0.000000] CPU0 revision is: 0001992f (MIPS 1004Kc)
[ 0.000000] MIPS: machine is MikroTik RouterBOARD 760iGS
[ 0.000000] Initrd not found or empty - disabling initrd
[ 0.000000] VPE topology {2,2} total 4
[ 0.000000] Primary instruction cache 32kB, VIPT, 4-way, linesize 32 bytes.
[ 0.000000] Primary data cache 32kB, 4-way, PIPT, no aliases, linesize 32 bytes
[ 0.000000] MIPS secondary cache 256kB, 8-way, linesize 32 bytes.
[ 0.000000] Zone ranges:
[ 0.000000] Normal [mem 0x0000000000000000-0x000000000fffffff]
[ 0.000000] HighMem empty
[ 0.000000] Movable zone start for each node
[ 0.000000] Early memory node ranges
[ 0.000000] node 0: [mem 0x0000000000000000-0x000000000fffffff]
[ 0.000000] Initmem setup node 0 [mem 0x0000000000000000-0x000000000fffffff]
[ 0.000000] percpu: Embedded 11 pages/cpu s16064 r8192 d20800 u45056
[ 0.000000] Built 1 zonelists, mobility grouping on. Total pages: 64960
[ 0.000000] Kernel command line: earlyprintk=ttyS0,115200 console=ttyS0,115200 rootfstype=squashfs,jffs2
[ 0.000000] Unknown kernel command line parameters "earlyprintk=ttyS0,115200", will be passed to user space.
[ 0.000000] Dentry cache hash table entries: 32768 (order: 5, 131072 bytes, linear)
[ 0.000000] Inode-cache hash table entries: 16384 (order: 4, 65536 bytes, linear)
[ 0.000000] Writing ErrCtl register=00011146
[ 0.000000] Readback ErrCtl register=00011146
[ 0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off
[ 0.000000] Memory: 246284K/262144K available (7417K kernel code, 630K rwdata, 1304K rodata, 3500K init, 245K bss, 15860K reserved, 0K cma-reserved, 0K highmem)
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=132
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=132
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=132
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=132
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=132
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=132
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=132
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=132
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=132
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=132
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=132
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=132
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=132
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=132
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=132
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=132
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=132
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=132
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=132
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=132
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=132
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=132
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=132
[ 0.000000] SLUB: HWalign=32, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=132
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=32
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=132
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=128
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=132
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=132
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=128
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=128
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=300
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=128
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=128
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=128
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=128
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=128
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=128
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=128
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=128
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=128
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=128
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=128
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=128
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=128
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=128
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=128
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=128
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=128
[ 0.000000] rcu: Hierarchical RCU implementation.
[ 0.000000] Tracing variant of Tasks RCU enabled.
[ 0.000000] rcu: RCU calculated value of scheduler-enlistment delay is 10 jiffies.
[ 0.000000] NR_IRQS: 256
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=128
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=128
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=128
[ 0.000000] rcu: srcu_init: Setting srcu_struct sizes based on contention.
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=128
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=128
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=128
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=128
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=128
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=128
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=512
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=128
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=128
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=128
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=512
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=128
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=128
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=256
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=128
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=256
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=128
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=256
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=128
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=128
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=128
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=128
[ 0.000000] clocksource: GIC: mask: 0xffffffffffffffff max_cycles: 0xcaf478abb4, max_idle_ns: 440795247997 ns
[ 0.000000] SLUB: __slab_alloc from slab_alloc s->object_size=256
[ 0.000004] sched_clock: 64 bits at 880MHz, resolution 1ns, wraps every 4398046511103ns


This change, and kernel boots fine:

diff --git a/mm/slub.c b/mm/slub.c
index 157527d7101b..e9677c04d19c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3410,7 +3410,11 @@ static __always_inline
void *__kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
gfp_t gfpflags)
{
- void *ret = slab_alloc(s, lru, gfpflags, _RET_IP_, s->object_size);
+
+ /*
+ pr_warn("SLUB: __slab_alloc from slab_alloc s->object_size=%d\n", s->object_size);
+ void *ret = slab_alloc(s, lru, gfpflags, _RET_IP_, s->object_size);*/
+ void *ret = slab_alloc(s, lru, gfpflags, _RET_IP_, 0);

trace_kmem_cache_alloc(_RET_IP_, ret, s, gfpflags, NUMA_NO_NODE);



Cheers,

--
John Thomson

2022-11-01 08:23:05

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH v6 1/4] mm/slub: enable debugging memory wasting of kmalloc

Hi Thomson,

Thanks for testing!

+ mips maintainer and mail list. The original report is here

https://lore.kernel.org/lkml/[email protected]/

On Tue, Nov 01, 2022 at 12:18:19AM +0000, John Thomson wrote:
> > As I can't reproduce it locally yet, could you help try 3 tests separately:
> > * change the O2/O3 compile option to O1
> > * try the attached 0001 patch (which cut part of commit)
> > * try attached 0001+0002 patch
>
> None of these changed my outcome.

0001 patch chops off most of the functional changes, and it still
fails. I'm wondering if it is related to the function arguments'
passing, I noticed this is a 32 bits MIPS platform (from your config)

> I also tried compiling the same linux tree & config with the Bootlin toolchain
> (mipsel-buildroot-linux-gnu-gcc.br_real (Buildroot 2021.11-4428-g6b6741b) 12.2.0)
> with the same results.
> I will look into finding or building a mips clang toolchain.

Yes, Hyeonggon's suggestion makes sense, different compiler may
show some difference.

> No JTAG capability to debug, sorry.
>
> I get the same outcome with either the ZBOOT vmlinuz, or vmlinux
>
> Same happening with 6.1-rc3
>
>
> After some blind poking around changing (how much of the commit affected) mm/slub.c,
> I may have got lucky. it appears as though this is all I need to boot:
> (against 6.1-rc3), and with the Bootlin toolchain. Will test my other build system as well.
>
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3276,7 +3276,7 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> c = slub_get_cpu_ptr(s->cpu_slab);
> #endif
>
> - p = ___slab_alloc(s, gfpflags, node, addr, c, orig_size);
> + p = ___slab_alloc(s, gfpflags, node, addr, c, 0);

___slab_alloc()'s argument number has been changed from 5 to 6, and
some others from 4 to 5.

Thanks,
Feng


> #ifdef CONFIG_PREEMPT_COUNT
> slub_put_cpu_ptr(s->cpu_slab);
> #endif


2022-11-01 09:50:51

by Hyeonggon Yoo

[permalink] [raw]
Subject: Re: [PATCH v6 1/4] mm/slub: enable debugging memory wasting of kmalloc

On Tue, Nov 01, 2022 at 09:20:21AM +0000, John Thomson wrote:
> On Tue, 1 Nov 2022, at 07:57, Feng Tang wrote:
> > Hi Thomson,
> >
> > Thanks for testing!
> >
> > + mips maintainer and mail list. The original report is here
> >
> > https://lore.kernel.org/lkml/[email protected]/
>
> I am guessing my issue comes from __kmem_cache_alloc_lru accessing s->object_size when (kmem_cache) s is NULL?
> If that is the case, this change is not to blame, it only exposes the issue?
>
> I get the following dmesg (note very early NULL kmem_cache) with the below change atop v6.1-rc3:
>
> transfer started ......................................... transfer ok, time=2.02s
> setting up elf image... OK
> jumping to kernel code
> zimage at: 80B842A0 810B4EFC
>
> Uncompressing Linux at load address 80001000
>
> Copy device tree to address 80B80EE0
>
> Now, booting the kernel...
>
> [ 0.000000] Linux version 6.1.0-rc3+ (john@john) (mipsel-buildroot-linux-gnu-gcc.br_real (Buildroot 2021.11-4428-g6b6741b) 12.2.0, GNU ld (GNU Binutils) 2.39) #61 SMP Tue Nov 1 18:04:13 AEST 2022
> [ 0.000000] slub: kmem_cache_alloc called with kmem_cache: 0x0
> [ 0.000000] slub: __kmem_cache_alloc_lru called with kmem_cache: 0x0
> [ 0.000000] SoC Type: MediaTek MT7621 ver:1 eco:3
> [ 0.000000] printk: bootconsole [early0] enabled
> [ 0.000000] CPU0 revision is: 0001992f (MIPS 1004Kc)
> [ 0.000000] MIPS: machine is MikroTik RouterBOARD 760iGS
>
> normal boot
>
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 157527d7101b..10fcdf2520d2 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3410,7 +3410,13 @@ static __always_inline
> void *__kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
> gfp_t gfpflags)
> {
> - void *ret = slab_alloc(s, lru, gfpflags, _RET_IP_, s->object_size);
> + void *ret;
> + if (IS_ERR_OR_NULL(s)) {
> + pr_warn("slub: __kmem_cache_alloc_lru called with kmem_cache: %pSR\n", s);
> + ret = slab_alloc(s, lru, gfpflags, _RET_IP_, 0);
> + } else {
> + ret = slab_alloc(s, lru, gfpflags, _RET_IP_, s->object_size);
> + }
>
> trace_kmem_cache_alloc(_RET_IP_, ret, s, gfpflags, NUMA_NO_NODE);
>
> @@ -3419,6 +3425,8 @@ void *__kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
>
> void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
> {
> + if (IS_ERR_OR_NULL(s))
> + pr_warn("slub: kmem_cache_alloc called with kmem_cache: %pSR\n", s);
> return __kmem_cache_alloc_lru(s, NULL, gfpflags);
> }
> EXPORT_SYMBOL(kmem_cache_alloc);
> @@ -3426,6 +3434,8 @@ EXPORT_SYMBOL(kmem_cache_alloc);
> void *kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
> gfp_t gfpflags)
> {
> + if (IS_ERR_OR_NULL(s))
> + pr_warn("slub: __kmem_cache_alloc_lru called with kmem_cache: %pSR\n", s);
> return __kmem_cache_alloc_lru(s, lru, gfpflags);
> }
> EXPORT_SYMBOL(kmem_cache_alloc_lru);
>
>
> Any hints on where kmem_cache_alloc would be being called from this early?
> I will start looking from /init/main.c around pr_notice("%s", linux_banner);

Great. Would you try calling dump_stack(); when we observed s == NULL?
That would give more information about who passed s == NULL to these
functions.

--
Thanks,
Hyeonggon

2022-11-01 10:17:25

by John Thomson

[permalink] [raw]
Subject: Re: [PATCH v6 1/4] mm/slub: enable debugging memory wasting of kmalloc

On Tue, 1 Nov 2022, at 07:57, Feng Tang wrote:
> Hi Thomson,
>
> Thanks for testing!
>
> + mips maintainer and mail list. The original report is here
>
> https://lore.kernel.org/lkml/[email protected]/



I am guessing my issue comes from __kmem_cache_alloc_lru accessing s->object_size when (kmem_cache) s is NULL?
If that is the case, this change is not to blame, it only exposes the issue?

I get the following dmesg (note very early NULL kmem_cache) with the below change atop v6.1-rc3:

transfer started ......................................... transfer ok, time=2.02s
setting up elf image... OK
jumping to kernel code
zimage at: 80B842A0 810B4EFC

Uncompressing Linux at load address 80001000

Copy device tree to address 80B80EE0

Now, booting the kernel...

[ 0.000000] Linux version 6.1.0-rc3+ (john@john) (mipsel-buildroot-linux-gnu-gcc.br_real (Buildroot 2021.11-4428-g6b6741b) 12.2.0, GNU ld (GNU Binutils) 2.39) #61 SMP Tue Nov 1 18:04:13 AEST 2022
[ 0.000000] slub: kmem_cache_alloc called with kmem_cache: 0x0
[ 0.000000] slub: __kmem_cache_alloc_lru called with kmem_cache: 0x0
[ 0.000000] SoC Type: MediaTek MT7621 ver:1 eco:3
[ 0.000000] printk: bootconsole [early0] enabled
[ 0.000000] CPU0 revision is: 0001992f (MIPS 1004Kc)
[ 0.000000] MIPS: machine is MikroTik RouterBOARD 760iGS

normal boot


diff --git a/mm/slub.c b/mm/slub.c
index 157527d7101b..10fcdf2520d2 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3410,7 +3410,13 @@ static __always_inline
void *__kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
gfp_t gfpflags)
{
- void *ret = slab_alloc(s, lru, gfpflags, _RET_IP_, s->object_size);
+ void *ret;
+ if (IS_ERR_OR_NULL(s)) {
+ pr_warn("slub: __kmem_cache_alloc_lru called with kmem_cache: %pSR\n", s);
+ ret = slab_alloc(s, lru, gfpflags, _RET_IP_, 0);
+ } else {
+ ret = slab_alloc(s, lru, gfpflags, _RET_IP_, s->object_size);
+ }

trace_kmem_cache_alloc(_RET_IP_, ret, s, gfpflags, NUMA_NO_NODE);

@@ -3419,6 +3425,8 @@ void *__kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,

void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
{
+ if (IS_ERR_OR_NULL(s))
+ pr_warn("slub: kmem_cache_alloc called with kmem_cache: %pSR\n", s);
return __kmem_cache_alloc_lru(s, NULL, gfpflags);
}
EXPORT_SYMBOL(kmem_cache_alloc);
@@ -3426,6 +3434,8 @@ EXPORT_SYMBOL(kmem_cache_alloc);
void *kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
gfp_t gfpflags)
{
+ if (IS_ERR_OR_NULL(s))
+ pr_warn("slub: __kmem_cache_alloc_lru called with kmem_cache: %pSR\n", s);
return __kmem_cache_alloc_lru(s, lru, gfpflags);
}
EXPORT_SYMBOL(kmem_cache_alloc_lru);


Any hints on where kmem_cache_alloc would be being called from this early?
I will start looking from /init/main.c around pr_notice("%s", linux_banner);

Thank you for your help.

Let me know if you want me to stop replying to this mm/slub debug memory wasting email,
and take this to a new email?

Cheers,
--
John Thomson

2022-11-01 10:57:25

by Hyeonggon Yoo

[permalink] [raw]
Subject: Re: [PATCH v6 1/4] mm/slub: enable debugging memory wasting of kmalloc

On Tue, Nov 01, 2022 at 10:33:32AM +0000, John Thomson wrote:
> On Tue, 1 Nov 2022, at 09:31, Hyeonggon Yoo wrote:
> > On Tue, Nov 01, 2022 at 09:20:21AM +0000, John Thomson wrote:
> >> On Tue, 1 Nov 2022, at 07:57, Feng Tang wrote:
> >> > Hi Thomson,
> >> >
> >> > Thanks for testing!
> >> >
> >> > + mips maintainer and mail list. The original report is here
> >> >
> >> > https://lore.kernel.org/lkml/[email protected]/
> >>
> >> I am guessing my issue comes from __kmem_cache_alloc_lru accessing s->object_size when (kmem_cache) s is NULL?
> >> If that is the case, this change is not to blame, it only exposes the issue?
> >>
> >> I get the following dmesg (note very early NULL kmem_cache) with the below change atop v6.1-rc3:
> >>
> >> transfer started ......................................... transfer ok, time=2.02s
> >> setting up elf image... OK
> >> jumping to kernel code
> >> zimage at: 80B842A0 810B4EFC
> >>
> >> Uncompressing Linux at load address 80001000
> >>
> >> Copy device tree to address 80B80EE0
> >>
> >> Now, booting the kernel...
> >>
> >> [ 0.000000] Linux version 6.1.0-rc3+ (john@john) (mipsel-buildroot-linux-gnu-gcc.br_real (Buildroot 2021.11-4428-g6b6741b) 12.2.0, GNU ld (GNU Binutils) 2.39) #61 SMP Tue Nov 1 18:04:13 AEST 2022
> >> [ 0.000000] slub: kmem_cache_alloc called with kmem_cache: 0x0
> >> [ 0.000000] slub: __kmem_cache_alloc_lru called with kmem_cache: 0x0
> >> [ 0.000000] SoC Type: MediaTek MT7621 ver:1 eco:3
> >> [ 0.000000] printk: bootconsole [early0] enabled
> >> [ 0.000000] CPU0 revision is: 0001992f (MIPS 1004Kc)
> >> [ 0.000000] MIPS: machine is MikroTik RouterBOARD 760iGS
> >>
> >> normal boot
> >>
> >>
> >> diff --git a/mm/slub.c b/mm/slub.c
> >> index 157527d7101b..10fcdf2520d2 100644
> >> --- a/mm/slub.c
> >> +++ b/mm/slub.c
> >> @@ -3410,7 +3410,13 @@ static __always_inline
> >> void *__kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
> >> gfp_t gfpflags)
> >> {
> >> - void *ret = slab_alloc(s, lru, gfpflags, _RET_IP_, s->object_size);
> >> + void *ret;
> >> + if (IS_ERR_OR_NULL(s)) {
> >> + pr_warn("slub: __kmem_cache_alloc_lru called with kmem_cache: %pSR\n", s);
> >> + ret = slab_alloc(s, lru, gfpflags, _RET_IP_, 0);
> >> + } else {
> >> + ret = slab_alloc(s, lru, gfpflags, _RET_IP_, s->object_size);
> >> + }
> >>
> >> trace_kmem_cache_alloc(_RET_IP_, ret, s, gfpflags, NUMA_NO_NODE);
> >>
> >> @@ -3419,6 +3425,8 @@ void *__kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
> >>
> >> void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
> >> {
> >> + if (IS_ERR_OR_NULL(s))
> >> + pr_warn("slub: kmem_cache_alloc called with kmem_cache: %pSR\n", s);
> >> return __kmem_cache_alloc_lru(s, NULL, gfpflags);
> >> }
> >> EXPORT_SYMBOL(kmem_cache_alloc);
> >> @@ -3426,6 +3434,8 @@ EXPORT_SYMBOL(kmem_cache_alloc);
> >> void *kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
> >> gfp_t gfpflags)
> >> {
> >> + if (IS_ERR_OR_NULL(s))
> >> + pr_warn("slub: __kmem_cache_alloc_lru called with kmem_cache: %pSR\n", s);
> >> return __kmem_cache_alloc_lru(s, lru, gfpflags);
> >> }
> >> EXPORT_SYMBOL(kmem_cache_alloc_lru);
> >>
> >>
> >> Any hints on where kmem_cache_alloc would be being called from this early?
> >> I will start looking from /init/main.c around pr_notice("%s", linux_banner);
> >
> > Great. Would you try calling dump_stack(); when we observed s == NULL?
> > That would give more information about who passed s == NULL to these
> > functions.
> >
>
> With the dump_stack() in place:
>
> Now, booting the kernel...
>
> [ 0.000000] Linux version 6.1.0-rc3+ (john@john) (mipsel-buildroot-linux-gnu-gcc.br_real (Buildroot 2021.11-4428-g6b6741b) 12.2.0, GNU ld (GNU Binutils) 2.39) #62 SMP Tue Nov 1 19:49:52 AEST 2022
> [ 0.000000] slub: __kmem_cache_alloc_lru called with kmem_cache ptr: 0x0
> [ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 6.1.0-rc3+ #62
> [ 0.000000] Stack : 810fff78 80084d98 80889d00 00000004 00000000 00000000 80889d5c 80c90000
> [ 0.000000] 80920000 807bd380 8089d368 80923bd3 00000000 00000001 80889d08 00000000
> [ 0.000000] 00000000 00000000 807bd380 8084bd51 00000002 00000002 00000001 6d6f4320
> [ 0.000000] 00000000 80c97ce9 80c97d14 fffffffc 807bd380 00000000 00000003 00000dc0
> [ 0.000000] 00000000 a0000000 80910000 8110a0b4 00000000 00000020 80010000 80010000
> [ 0.000000] ...
> [ 0.000000] Call Trace:
> [ 0.000000] [<80008260>] show_stack+0x28/0xf0
> [ 0.000000] [<8070cdc0>] dump_stack_lvl+0x60/0x80
> [ 0.000000] [<801c1428>] kmem_cache_alloc+0x5c0/0x740
> [ 0.000000] [<8092856c>] prom_soc_init+0x1fc/0x2b4
> [ 0.000000] [<80928060>] prom_init+0x44/0xf0
> [ 0.000000] [<80929214>] setup_arch+0x4c/0x6a8
> [ 0.000000] [<809257e0>] start_kernel+0x88/0x7c0
> [ 0.000000]
> [ 0.000000] SoC Type: MediaTek MT7621 ver:1 eco:3

setup_arch() is too early to use slab allocators.
I think slab received NULL pointer because kmalloc is not initialized.

It seems arch/mips/ralink/mt7621.c is using slab too early.

>
>
> Now, booting the kernel...
>
> [ 0.000000] Linux version 6.1.0-rc3+ (john@john) (mipsel-buildroot-linux-gnu-gcc.br_real (Buildroot 2021.11-4428-g6b6741b) 12.2.0, GNU ld (GNU Binutils) 2.39) #62 SMP Tue Nov 1 19:49:52 AEST 2022
> [ 0.000000] slub: __kmem_cache_alloc_lru called with kmem_cache ptr: 0x0
> [ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 6.1.0-rc3+ #62
> [ 0.000000] Stack : 810fff78 80084d98 80889d00 00000004 00000000 00000000 80889d5c 80c90000
> [ 0.000000] 80920000 807bd380 8089d368 80923bd3 00000000 00000001 80889d08 00000000
> [ 0.000000] 00000000 00000000 807bd380 8084bd51 00000002 00000002 00000001 6d6f4320
> [ 0.000000] 00000000 80c97ce9 80c97d14 fffffffc 807bd380 00000000 00000003 00000dc0
> [ 0.000000] 00000000 a0000000 80910000 8110a0b4 00000000 00000020 80010000 80010000
> [ 0.000000] ...
> [ 0.000000] Call Trace:
> [ 0.000000] show_stack (/mnt/pool_ssd/code/linux/linux-stable-mt7621/./arch/mips/include/asm/stacktrace.h:43 /mnt/pool_ssd/code/linux/linux-stable-mt7621/arch/mips/kernel/traps.c:223)
> [ 0.000000] dump_stack_lvl (/mnt/pool_ssd/code/linux/linux-stable-mt7621/lib/dump_stack.c:107 (discriminator 1))
> [ 0.000000] kmem_cache_alloc (/mnt/pool_ssd/code/linux/linux-stable-mt7621/mm/slub.c:3318 /mnt/pool_ssd/code/linux/linux-stable-mt7621/mm/slub.c:3406 /mnt/pool_ssd/code/linux/linux-stable-mt7621/mm/slub.c:3418 /mnt/pool_ssd/code/linux/linux-stable-mt7621/mm/slub.c:3430)
> [ 0.000000] prom_soc_init (/mnt/pool_ssd/code/linux/linux-stable-mt7621/arch/mips/ralink/mt7621.c:106 /mnt/pool_ssd/code/linux/linux-stable-mt7621/arch/mips/ralink/mt7621.c:177)
> [ 0.000000] prom_init (/mnt/pool_ssd/code/linux/linux-stable-mt7621/arch/mips/ralink/prom.c:64)
> [ 0.000000] setup_arch (/mnt/pool_ssd/code/linux/linux-stable-mt7621/arch/mips/kernel/setup.c:786)
> [ 0.000000] start_kernel (/mnt/pool_ssd/code/linux/linux-stable-mt7621/init/main.c:279 /mnt/pool_ssd/code/linux/linux-stable-mt7621/init/main.c:477 /mnt/pool_ssd/code/linux/linux-stable-mt7621/init/main.c:960)
> [ 0.000000]
> [ 0.000000] SoC Type: MediaTek MT7621 ver:1 eco:3
>
>
> I have not found it yet.
>
>
> Cheers,
> --
> John Thomson

--
Thanks,
Hyeonggon

2022-11-01 11:16:58

by John Thomson

[permalink] [raw]
Subject: Re: [PATCH v6 1/4] mm/slub: enable debugging memory wasting of kmalloc

On Tue, 1 Nov 2022, at 09:31, Hyeonggon Yoo wrote:
> On Tue, Nov 01, 2022 at 09:20:21AM +0000, John Thomson wrote:
>> On Tue, 1 Nov 2022, at 07:57, Feng Tang wrote:
>> > Hi Thomson,
>> >
>> > Thanks for testing!
>> >
>> > + mips maintainer and mail list. The original report is here
>> >
>> > https://lore.kernel.org/lkml/[email protected]/
>>
>> I am guessing my issue comes from __kmem_cache_alloc_lru accessing s->object_size when (kmem_cache) s is NULL?
>> If that is the case, this change is not to blame, it only exposes the issue?
>>
>> I get the following dmesg (note very early NULL kmem_cache) with the below change atop v6.1-rc3:
>>
>> transfer started ......................................... transfer ok, time=2.02s
>> setting up elf image... OK
>> jumping to kernel code
>> zimage at: 80B842A0 810B4EFC
>>
>> Uncompressing Linux at load address 80001000
>>
>> Copy device tree to address 80B80EE0
>>
>> Now, booting the kernel...
>>
>> [ 0.000000] Linux version 6.1.0-rc3+ (john@john) (mipsel-buildroot-linux-gnu-gcc.br_real (Buildroot 2021.11-4428-g6b6741b) 12.2.0, GNU ld (GNU Binutils) 2.39) #61 SMP Tue Nov 1 18:04:13 AEST 2022
>> [ 0.000000] slub: kmem_cache_alloc called with kmem_cache: 0x0
>> [ 0.000000] slub: __kmem_cache_alloc_lru called with kmem_cache: 0x0
>> [ 0.000000] SoC Type: MediaTek MT7621 ver:1 eco:3
>> [ 0.000000] printk: bootconsole [early0] enabled
>> [ 0.000000] CPU0 revision is: 0001992f (MIPS 1004Kc)
>> [ 0.000000] MIPS: machine is MikroTik RouterBOARD 760iGS
>>
>> normal boot
>>
>>
>> diff --git a/mm/slub.c b/mm/slub.c
>> index 157527d7101b..10fcdf2520d2 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -3410,7 +3410,13 @@ static __always_inline
>> void *__kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
>> gfp_t gfpflags)
>> {
>> - void *ret = slab_alloc(s, lru, gfpflags, _RET_IP_, s->object_size);
>> + void *ret;
>> + if (IS_ERR_OR_NULL(s)) {
>> + pr_warn("slub: __kmem_cache_alloc_lru called with kmem_cache: %pSR\n", s);
>> + ret = slab_alloc(s, lru, gfpflags, _RET_IP_, 0);
>> + } else {
>> + ret = slab_alloc(s, lru, gfpflags, _RET_IP_, s->object_size);
>> + }
>>
>> trace_kmem_cache_alloc(_RET_IP_, ret, s, gfpflags, NUMA_NO_NODE);
>>
>> @@ -3419,6 +3425,8 @@ void *__kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
>>
>> void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
>> {
>> + if (IS_ERR_OR_NULL(s))
>> + pr_warn("slub: kmem_cache_alloc called with kmem_cache: %pSR\n", s);
>> return __kmem_cache_alloc_lru(s, NULL, gfpflags);
>> }
>> EXPORT_SYMBOL(kmem_cache_alloc);
>> @@ -3426,6 +3434,8 @@ EXPORT_SYMBOL(kmem_cache_alloc);
>> void *kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
>> gfp_t gfpflags)
>> {
>> + if (IS_ERR_OR_NULL(s))
>> + pr_warn("slub: __kmem_cache_alloc_lru called with kmem_cache: %pSR\n", s);
>> return __kmem_cache_alloc_lru(s, lru, gfpflags);
>> }
>> EXPORT_SYMBOL(kmem_cache_alloc_lru);
>>
>>
>> Any hints on where kmem_cache_alloc would be being called from this early?
>> I will start looking from /init/main.c around pr_notice("%s", linux_banner);
>
> Great. Would you try calling dump_stack(); when we observed s == NULL?
> That would give more information about who passed s == NULL to these
> functions.
>

With the dump_stack() in place:

Now, booting the kernel...

[ 0.000000] Linux version 6.1.0-rc3+ (john@john) (mipsel-buildroot-linux-gnu-gcc.br_real (Buildroot 2021.11-4428-g6b6741b) 12.2.0, GNU ld (GNU Binutils) 2.39) #62 SMP Tue Nov 1 19:49:52 AEST 2022
[ 0.000000] slub: __kmem_cache_alloc_lru called with kmem_cache ptr: 0x0
[ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 6.1.0-rc3+ #62
[ 0.000000] Stack : 810fff78 80084d98 80889d00 00000004 00000000 00000000 80889d5c 80c90000
[ 0.000000] 80920000 807bd380 8089d368 80923bd3 00000000 00000001 80889d08 00000000
[ 0.000000] 00000000 00000000 807bd380 8084bd51 00000002 00000002 00000001 6d6f4320
[ 0.000000] 00000000 80c97ce9 80c97d14 fffffffc 807bd380 00000000 00000003 00000dc0
[ 0.000000] 00000000 a0000000 80910000 8110a0b4 00000000 00000020 80010000 80010000
[ 0.000000] ...
[ 0.000000] Call Trace:
[ 0.000000] [<80008260>] show_stack+0x28/0xf0
[ 0.000000] [<8070cdc0>] dump_stack_lvl+0x60/0x80
[ 0.000000] [<801c1428>] kmem_cache_alloc+0x5c0/0x740
[ 0.000000] [<8092856c>] prom_soc_init+0x1fc/0x2b4
[ 0.000000] [<80928060>] prom_init+0x44/0xf0
[ 0.000000] [<80929214>] setup_arch+0x4c/0x6a8
[ 0.000000] [<809257e0>] start_kernel+0x88/0x7c0
[ 0.000000]
[ 0.000000] SoC Type: MediaTek MT7621 ver:1 eco:3



Now, booting the kernel...

[ 0.000000] Linux version 6.1.0-rc3+ (john@john) (mipsel-buildroot-linux-gnu-gcc.br_real (Buildroot 2021.11-4428-g6b6741b) 12.2.0, GNU ld (GNU Binutils) 2.39) #62 SMP Tue Nov 1 19:49:52 AEST 2022
[ 0.000000] slub: __kmem_cache_alloc_lru called with kmem_cache ptr: 0x0
[ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 6.1.0-rc3+ #62
[ 0.000000] Stack : 810fff78 80084d98 80889d00 00000004 00000000 00000000 80889d5c 80c90000
[ 0.000000] 80920000 807bd380 8089d368 80923bd3 00000000 00000001 80889d08 00000000
[ 0.000000] 00000000 00000000 807bd380 8084bd51 00000002 00000002 00000001 6d6f4320
[ 0.000000] 00000000 80c97ce9 80c97d14 fffffffc 807bd380 00000000 00000003 00000dc0
[ 0.000000] 00000000 a0000000 80910000 8110a0b4 00000000 00000020 80010000 80010000
[ 0.000000] ...
[ 0.000000] Call Trace:
[ 0.000000] show_stack (/mnt/pool_ssd/code/linux/linux-stable-mt7621/./arch/mips/include/asm/stacktrace.h:43 /mnt/pool_ssd/code/linux/linux-stable-mt7621/arch/mips/kernel/traps.c:223)
[ 0.000000] dump_stack_lvl (/mnt/pool_ssd/code/linux/linux-stable-mt7621/lib/dump_stack.c:107 (discriminator 1))
[ 0.000000] kmem_cache_alloc (/mnt/pool_ssd/code/linux/linux-stable-mt7621/mm/slub.c:3318 /mnt/pool_ssd/code/linux/linux-stable-mt7621/mm/slub.c:3406 /mnt/pool_ssd/code/linux/linux-stable-mt7621/mm/slub.c:3418 /mnt/pool_ssd/code/linux/linux-stable-mt7621/mm/slub.c:3430)
[ 0.000000] prom_soc_init (/mnt/pool_ssd/code/linux/linux-stable-mt7621/arch/mips/ralink/mt7621.c:106 /mnt/pool_ssd/code/linux/linux-stable-mt7621/arch/mips/ralink/mt7621.c:177)
[ 0.000000] prom_init (/mnt/pool_ssd/code/linux/linux-stable-mt7621/arch/mips/ralink/prom.c:64)
[ 0.000000] setup_arch (/mnt/pool_ssd/code/linux/linux-stable-mt7621/arch/mips/kernel/setup.c:786)
[ 0.000000] start_kernel (/mnt/pool_ssd/code/linux/linux-stable-mt7621/init/main.c:279 /mnt/pool_ssd/code/linux/linux-stable-mt7621/init/main.c:477 /mnt/pool_ssd/code/linux/linux-stable-mt7621/init/main.c:960)
[ 0.000000]
[ 0.000000] SoC Type: MediaTek MT7621 ver:1 eco:3


I have not found it yet.


Cheers,
--
John Thomson

2022-11-01 14:08:36

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH v6 1/4] mm/slub: enable debugging memory wasting of kmalloc

On Tue, Nov 01, 2022 at 06:42:23PM +0800, Hyeonggon Yoo wrote:
> On Tue, Nov 01, 2022 at 10:33:32AM +0000, John Thomson wrote:
> > On Tue, 1 Nov 2022, at 09:31, Hyeonggon Yoo wrote:
> > > On Tue, Nov 01, 2022 at 09:20:21AM +0000, John Thomson wrote:
> > >> On Tue, 1 Nov 2022, at 07:57, Feng Tang wrote:
> > >> > Hi Thomson,
> > >> >
> > >> > Thanks for testing!
> > >> >
> > >> > + mips maintainer and mail list. The original report is here
> > >> >
> > >> > https://lore.kernel.org/lkml/[email protected]/
> > >>
> > >> I am guessing my issue comes from __kmem_cache_alloc_lru accessing s->object_size when (kmem_cache) s is NULL?
> > >> If that is the case, this change is not to blame, it only exposes the issue?
> > >>
> > >> I get the following dmesg (note very early NULL kmem_cache) with the below change atop v6.1-rc3:
> > >>
> > >> transfer started ......................................... transfer ok, time=2.02s
> > >> setting up elf image... OK
> > >> jumping to kernel code
> > >> zimage at: 80B842A0 810B4EFC
> > >>
> > >> Uncompressing Linux at load address 80001000
> > >>
> > >> Copy device tree to address 80B80EE0
> > >>
> > >> Now, booting the kernel...
> > >>
> > >> [ 0.000000] Linux version 6.1.0-rc3+ (john@john) (mipsel-buildroot-linux-gnu-gcc.br_real (Buildroot 2021.11-4428-g6b6741b) 12.2.0, GNU ld (GNU Binutils) 2.39) #61 SMP Tue Nov 1 18:04:13 AEST 2022
> > >> [ 0.000000] slub: kmem_cache_alloc called with kmem_cache: 0x0
> > >> [ 0.000000] slub: __kmem_cache_alloc_lru called with kmem_cache: 0x0
> > >> [ 0.000000] SoC Type: MediaTek MT7621 ver:1 eco:3
> > >> [ 0.000000] printk: bootconsole [early0] enabled
> > >> [ 0.000000] CPU0 revision is: 0001992f (MIPS 1004Kc)
> > >> [ 0.000000] MIPS: machine is MikroTik RouterBOARD 760iGS
> > >>
> > >> normal boot
> > >>
> > >>
> > >> diff --git a/mm/slub.c b/mm/slub.c
> > >> index 157527d7101b..10fcdf2520d2 100644
> > >> --- a/mm/slub.c
> > >> +++ b/mm/slub.c
> > >> @@ -3410,7 +3410,13 @@ static __always_inline
> > >> void *__kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
> > >> gfp_t gfpflags)
> > >> {
> > >> - void *ret = slab_alloc(s, lru, gfpflags, _RET_IP_, s->object_size);
> > >> + void *ret;
> > >> + if (IS_ERR_OR_NULL(s)) {
> > >> + pr_warn("slub: __kmem_cache_alloc_lru called with kmem_cache: %pSR\n", s);
> > >> + ret = slab_alloc(s, lru, gfpflags, _RET_IP_, 0);
> > >> + } else {
> > >> + ret = slab_alloc(s, lru, gfpflags, _RET_IP_, s->object_size);
> > >> + }
> > >>
> > >> trace_kmem_cache_alloc(_RET_IP_, ret, s, gfpflags, NUMA_NO_NODE);
> > >>
> > >> @@ -3419,6 +3425,8 @@ void *__kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
> > >>
> > >> void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
> > >> {
> > >> + if (IS_ERR_OR_NULL(s))
> > >> + pr_warn("slub: kmem_cache_alloc called with kmem_cache: %pSR\n", s);
> > >> return __kmem_cache_alloc_lru(s, NULL, gfpflags);
> > >> }
> > >> EXPORT_SYMBOL(kmem_cache_alloc);
> > >> @@ -3426,6 +3434,8 @@ EXPORT_SYMBOL(kmem_cache_alloc);
> > >> void *kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
> > >> gfp_t gfpflags)
> > >> {
> > >> + if (IS_ERR_OR_NULL(s))
> > >> + pr_warn("slub: __kmem_cache_alloc_lru called with kmem_cache: %pSR\n", s);
> > >> return __kmem_cache_alloc_lru(s, lru, gfpflags);
> > >> }
> > >> EXPORT_SYMBOL(kmem_cache_alloc_lru);
> > >>
> > >>
> > >> Any hints on where kmem_cache_alloc would be being called from this early?
> > >> I will start looking from /init/main.c around pr_notice("%s", linux_banner);
> > >
> > > Great. Would you try calling dump_stack(); when we observed s == NULL?
> > > That would give more information about who passed s == NULL to these
> > > functions.
> > >
> >
> > With the dump_stack() in place:
> >
> > Now, booting the kernel...
> >
> > [ 0.000000] Linux version 6.1.0-rc3+ (john@john) (mipsel-buildroot-linux-gnu-gcc.br_real (Buildroot 2021.11-4428-g6b6741b) 12.2.0, GNU ld (GNU Binutils) 2.39) #62 SMP Tue Nov 1 19:49:52 AEST 2022
> > [ 0.000000] slub: __kmem_cache_alloc_lru called with kmem_cache ptr: 0x0
> > [ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 6.1.0-rc3+ #62
> > [ 0.000000] Stack : 810fff78 80084d98 80889d00 00000004 00000000 00000000 80889d5c 80c90000
> > [ 0.000000] 80920000 807bd380 8089d368 80923bd3 00000000 00000001 80889d08 00000000
> > [ 0.000000] 00000000 00000000 807bd380 8084bd51 00000002 00000002 00000001 6d6f4320
> > [ 0.000000] 00000000 80c97ce9 80c97d14 fffffffc 807bd380 00000000 00000003 00000dc0
> > [ 0.000000] 00000000 a0000000 80910000 8110a0b4 00000000 00000020 80010000 80010000
> > [ 0.000000] ...
> > [ 0.000000] Call Trace:
> > [ 0.000000] [<80008260>] show_stack+0x28/0xf0
> > [ 0.000000] [<8070cdc0>] dump_stack_lvl+0x60/0x80
> > [ 0.000000] [<801c1428>] kmem_cache_alloc+0x5c0/0x740
> > [ 0.000000] [<8092856c>] prom_soc_init+0x1fc/0x2b4
> > [ 0.000000] [<80928060>] prom_init+0x44/0xf0
> > [ 0.000000] [<80929214>] setup_arch+0x4c/0x6a8
> > [ 0.000000] [<809257e0>] start_kernel+0x88/0x7c0
> > [ 0.000000]
> > [ 0.000000] SoC Type: MediaTek MT7621 ver:1 eco:3
>
> setup_arch() is too early to use slab allocators.
> I think slab received NULL pointer because kmalloc is not initialized.
>
> It seems arch/mips/ralink/mt7621.c is using slab too early.

Cool! it is finally root caused :) Thanks!

The following patch should solve it and give it a warning message, though
I'm not sure if there is other holes.

Thanks,
Feng

---
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 33b1886b06eb..429c21b7ecbc 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1043,7 +1043,14 @@ size_t __ksize(const void *object)
#ifdef CONFIG_TRACING
void *kmalloc_trace(struct kmem_cache *s, gfp_t gfpflags, size_t size)
{
- void *ret = __kmem_cache_alloc_node(s, gfpflags, NUMA_NO_NODE,
+ void *ret;
+
+ if (unlikely(ZERO_OR_NULL_PTR(s))) {
+ WARN_ON_ONCE(1);
+ return s;
+ }
+
+ ret = __kmem_cache_alloc_node(s, gfpflags, NUMA_NO_NODE,
size, _RET_IP_);

trace_kmalloc(_RET_IP_, ret, size, s->size, gfpflags, NUMA_NO_NODE);
diff --git a/mm/slub.c b/mm/slub.c
index 157527d7101b..85d24bb6eda7 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3410,8 +3410,14 @@ static __always_inline
void *__kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
gfp_t gfpflags)
{
- void *ret = slab_alloc(s, lru, gfpflags, _RET_IP_, s->object_size);
+ void *ret;

+ if (unlikely(ZERO_OR_NULL_PTR(s))) {
+ WARN_ON_ONCE(1);
+ return s;
+ }
+
+ ret = slab_alloc(s, lru, gfpflags, _RET_IP_, s->object_size);
trace_kmem_cache_alloc(_RET_IP_, ret, s, gfpflags, NUMA_NO_NODE);

return ret;







2022-11-01 19:45:44

by John Thomson

[permalink] [raw]
Subject: Re: [PATCH v6 1/4] mm/slub: enable debugging memory wasting of kmalloc



On Tue, 1 Nov 2022, at 13:55, Feng Tang wrote:
> On Tue, Nov 01, 2022 at 06:42:23PM +0800, Hyeonggon Yoo wrote:
>> setup_arch() is too early to use slab allocators.
>> I think slab received NULL pointer because kmalloc is not initialized.
>>
>> It seems arch/mips/ralink/mt7621.c is using slab too early.
>
> Cool! it is finally root caused :) Thanks!
>
> The following patch should solve it and give it a warning message, though
> I'm not sure if there is other holes.
>
> Thanks,
> Feng
>
> ---
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 33b1886b06eb..429c21b7ecbc 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -1043,7 +1043,14 @@ size_t __ksize(const void *object)
> #ifdef CONFIG_TRACING
> void *kmalloc_trace(struct kmem_cache *s, gfp_t gfpflags, size_t size)
> {
> - void *ret = __kmem_cache_alloc_node(s, gfpflags, NUMA_NO_NODE,
> + void *ret;
> +
> + if (unlikely(ZERO_OR_NULL_PTR(s))) {
> + WARN_ON_ONCE(1);
> + return s;
> + }
> +
> + ret = __kmem_cache_alloc_node(s, gfpflags, NUMA_NO_NODE,
> size, _RET_IP_);
>
> trace_kmalloc(_RET_IP_, ret, size, s->size, gfpflags, NUMA_NO_NODE);
> diff --git a/mm/slub.c b/mm/slub.c
> index 157527d7101b..85d24bb6eda7 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3410,8 +3410,14 @@ static __always_inline
> void *__kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
> gfp_t gfpflags)
> {
> - void *ret = slab_alloc(s, lru, gfpflags, _RET_IP_, s->object_size);
> + void *ret;
>
> + if (unlikely(ZERO_OR_NULL_PTR(s))) {
> + WARN_ON_ONCE(1);
> + return s;
> + }
> +
> + ret = slab_alloc(s, lru, gfpflags, _RET_IP_, s->object_size);
> trace_kmem_cache_alloc(_RET_IP_, ret, s, gfpflags, NUMA_NO_NODE);
>
> return ret;

Yes, thank you, that patch atop v6.1-rc3 lets me boot, and shows the warning and stack dump.
Will you submit that, or how do we want to proceed?

transfer started ......................................... transfer ok, time=2.11s
setting up elf image... OK
jumping to kernel code
zimage at: 80B842A0 810B4BC0

Uncompressing Linux at load address 80001000

Copy device tree to address 80B80EE0

Now, booting the kernel...

[ 0.000000] Linux version 6.1.0-rc3+ (john@john) (mipsel-buildroot-linux-gnu-gcc.br_real (Buildroot 2021.11-4428-g6b6741b) 12.2.0, GNU ld (GNU Binutils) 2.39) #73 SMP Wed Nov 2 05:10:01 AEST 2022
[ 0.000000] ------------[ cut here ]------------
[ 0.000000] WARNING: CPU: 0 PID: 0 at mm/slub.c:3416 kmem_cache_alloc+0x5a4/0x5e8
[ 0.000000] Modules linked in:
[ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 6.1.0-rc3+ #73
[ 0.000000] Stack : 810fff78 80084d98 00000000 00000004 00000000 00000000 80889d04 80c90000
[ 0.000000] 80920000 807bd328 8089d368 80923bd3 00000000 00000001 80889cb0 00000000
[ 0.000000] 00000000 00000000 807bd328 8084bcb1 00000002 00000002 00000001 6d6f4320
[ 0.000000] 00000000 80c97d3d 80c97d68 fffffffc 807bd328 00000000 00000000 00000000
[ 0.000000] 00000000 a0000000 80910000 8110a0b4 00000000 00000020 80010000 80010000
[ 0.000000] ...
[ 0.000000] Call Trace:
[ 0.000000] [<80008260>] show_stack+0x28/0xf0
[ 0.000000] [<8070c958>] dump_stack_lvl+0x60/0x80
[ 0.000000] [<8002e184>] __warn+0xc4/0xf8
[ 0.000000] [<8002e210>] warn_slowpath_fmt+0x58/0xa4
[ 0.000000] [<801c0fac>] kmem_cache_alloc+0x5a4/0x5e8
[ 0.000000] [<8092856c>] prom_soc_init+0x1fc/0x2b4
[ 0.000000] [<80928060>] prom_init+0x44/0xf0
[ 0.000000] [<80929214>] setup_arch+0x4c/0x6a8
[ 0.000000] [<809257e0>] start_kernel+0x88/0x7c0
[ 0.000000]
[ 0.000000] ---[ end trace 0000000000000000 ]---
[ 0.000000] SoC Type: MediaTek MT7621 ver:1 eco:3
[ 0.000000] printk: bootconsole [early0] enabled

Thank you for working through this with me.
I will try to address the root cause in mt7621.c.
It looks like other arch/** soc_device_register users use postcore_initcall, device_initcall,
or the ARM DT_MACHINE_START .init_machine. A quick hack to use postcore_initcall in mt7621
avoided this zero ptr kmem_cache passed to kmem_cache_alloc_lru.


Thanks,

--
John Thomson

2022-11-02 06:40:52

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH v6 1/4] mm/slub: enable debugging memory wasting of kmalloc

On Tue, Nov 01, 2022 at 07:39:13PM +0000, John Thomson wrote:
>
>
> On Tue, 1 Nov 2022, at 13:55, Feng Tang wrote:
> > On Tue, Nov 01, 2022 at 06:42:23PM +0800, Hyeonggon Yoo wrote:
> >> setup_arch() is too early to use slab allocators.
> >> I think slab received NULL pointer because kmalloc is not initialized.
> >>
> >> It seems arch/mips/ralink/mt7621.c is using slab too early.
> >
> > Cool! it is finally root caused :) Thanks!
> >
> > The following patch should solve it and give it a warning message, though
> > I'm not sure if there is other holes.
> >
> > Thanks,
> > Feng
> >
> > ---
> > diff --git a/mm/slab_common.c b/mm/slab_common.c
> > index 33b1886b06eb..429c21b7ecbc 100644
> > --- a/mm/slab_common.c
> > +++ b/mm/slab_common.c
> > @@ -1043,7 +1043,14 @@ size_t __ksize(const void *object)
> > #ifdef CONFIG_TRACING
> > void *kmalloc_trace(struct kmem_cache *s, gfp_t gfpflags, size_t size)
> > {
> > - void *ret = __kmem_cache_alloc_node(s, gfpflags, NUMA_NO_NODE,
> > + void *ret;
> > +
> > + if (unlikely(ZERO_OR_NULL_PTR(s))) {
> > + WARN_ON_ONCE(1);
> > + return s;
> > + }
> > +
> > + ret = __kmem_cache_alloc_node(s, gfpflags, NUMA_NO_NODE,
> > size, _RET_IP_);
> >
> > trace_kmalloc(_RET_IP_, ret, size, s->size, gfpflags, NUMA_NO_NODE);
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 157527d7101b..85d24bb6eda7 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -3410,8 +3410,14 @@ static __always_inline
> > void *__kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
> > gfp_t gfpflags)
> > {
> > - void *ret = slab_alloc(s, lru, gfpflags, _RET_IP_, s->object_size);
> > + void *ret;
> >
> > + if (unlikely(ZERO_OR_NULL_PTR(s))) {
> > + WARN_ON_ONCE(1);
> > + return s;
> > + }
> > +
> > + ret = slab_alloc(s, lru, gfpflags, _RET_IP_, s->object_size);
> > trace_kmem_cache_alloc(_RET_IP_, ret, s, gfpflags, NUMA_NO_NODE);
> >
> > return ret;
>
> Yes, thank you, that patch atop v6.1-rc3 lets me boot, and shows the warning and stack dump.
> Will you submit that, or how do we want to proceed?

Thanks for confirming. I wanted to wait for Vlastimil, Hyeonggon and
other developer's opinion. And yes, I can also post a more formal one.

> transfer started ......................................... transfer ok, time=2.11s
> setting up elf image... OK
> jumping to kernel code
> zimage at: 80B842A0 810B4BC0
>
> Uncompressing Linux at load address 80001000
>
> Copy device tree to address 80B80EE0
>
> Now, booting the kernel...
>
> [ 0.000000] Linux version 6.1.0-rc3+ (john@john) (mipsel-buildroot-linux-gnu-gcc.br_real (Buildroot 2021.11-4428-g6b6741b) 12.2.0, GNU ld (GNU Binutils) 2.39) #73 SMP Wed Nov 2 05:10:01 AEST 2022
> [ 0.000000] ------------[ cut here ]------------
> [ 0.000000] WARNING: CPU: 0 PID: 0 at mm/slub.c:3416 kmem_cache_alloc+0x5a4/0x5e8
> [ 0.000000] Modules linked in:
> [ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 6.1.0-rc3+ #73
> [ 0.000000] Stack : 810fff78 80084d98 00000000 00000004 00000000 00000000 80889d04 80c90000
> [ 0.000000] 80920000 807bd328 8089d368 80923bd3 00000000 00000001 80889cb0 00000000
> [ 0.000000] 00000000 00000000 807bd328 8084bcb1 00000002 00000002 00000001 6d6f4320
> [ 0.000000] 00000000 80c97d3d 80c97d68 fffffffc 807bd328 00000000 00000000 00000000
> [ 0.000000] 00000000 a0000000 80910000 8110a0b4 00000000 00000020 80010000 80010000
> [ 0.000000] ...
> [ 0.000000] Call Trace:
> [ 0.000000] [<80008260>] show_stack+0x28/0xf0
> [ 0.000000] [<8070c958>] dump_stack_lvl+0x60/0x80
> [ 0.000000] [<8002e184>] __warn+0xc4/0xf8
> [ 0.000000] [<8002e210>] warn_slowpath_fmt+0x58/0xa4
> [ 0.000000] [<801c0fac>] kmem_cache_alloc+0x5a4/0x5e8
> [ 0.000000] [<8092856c>] prom_soc_init+0x1fc/0x2b4
> [ 0.000000] [<80928060>] prom_init+0x44/0xf0
> [ 0.000000] [<80929214>] setup_arch+0x4c/0x6a8
> [ 0.000000] [<809257e0>] start_kernel+0x88/0x7c0
> [ 0.000000]
> [ 0.000000] ---[ end trace 0000000000000000 ]---
> [ 0.000000] SoC Type: MediaTek MT7621 ver:1 eco:3
> [ 0.000000] printk: bootconsole [early0] enabled
>
> Thank you for working through this with me.
> I will try to address the root cause in mt7621.c.
> It looks like other arch/** soc_device_register users use postcore_initcall, device_initcall,
> or the ARM DT_MACHINE_START .init_machine. A quick hack to use postcore_initcall in mt7621
> avoided this zero ptr kmem_cache passed to kmem_cache_alloc_lru.

If IIUC, the prom_soc_init() is only called once in kernel, can the
'soc_dev_attr' just be defined as a global data structure instead
of calling kzalloc(), as its size is small only containing 7 pointers.

Thanks,
Feng

2022-11-02 07:47:39

by Hyeonggon Yoo

[permalink] [raw]
Subject: Re: [PATCH v6 1/4] mm/slub: enable debugging memory wasting of kmalloc

On Wed, Nov 02, 2022 at 02:08:09PM +0800, Feng Tang wrote:
> On Tue, Nov 01, 2022 at 07:39:13PM +0000, John Thomson wrote:
> >
> >
> > On Tue, 1 Nov 2022, at 13:55, Feng Tang wrote:
> > > On Tue, Nov 01, 2022 at 06:42:23PM +0800, Hyeonggon Yoo wrote:
> > >> setup_arch() is too early to use slab allocators.
> > >> I think slab received NULL pointer because kmalloc is not initialized.
> > >>
> > >> It seems arch/mips/ralink/mt7621.c is using slab too early.
> > >
> > > Cool! it is finally root caused :) Thanks!
> > >
> > > The following patch should solve it and give it a warning message, though
> > > I'm not sure if there is other holes.
> > >
> > > Thanks,
> > > Feng
> > >
> > > ---
> > > diff --git a/mm/slab_common.c b/mm/slab_common.c
> > > index 33b1886b06eb..429c21b7ecbc 100644
> > > --- a/mm/slab_common.c
> > > +++ b/mm/slab_common.c
> > > @@ -1043,7 +1043,14 @@ size_t __ksize(const void *object)
> > > #ifdef CONFIG_TRACING
> > > void *kmalloc_trace(struct kmem_cache *s, gfp_t gfpflags, size_t size)
> > > {
> > > - void *ret = __kmem_cache_alloc_node(s, gfpflags, NUMA_NO_NODE,
> > > + void *ret;
> > > +
> > > + if (unlikely(ZERO_OR_NULL_PTR(s))) {
> > > + WARN_ON_ONCE(1);
> > > + return s;
> > > + }
> > > +
> > > + ret = __kmem_cache_alloc_node(s, gfpflags, NUMA_NO_NODE,
> > > size, _RET_IP_);
> > >
> > > trace_kmalloc(_RET_IP_, ret, size, s->size, gfpflags, NUMA_NO_NODE);
> > > diff --git a/mm/slub.c b/mm/slub.c
> > > index 157527d7101b..85d24bb6eda7 100644
> > > --- a/mm/slub.c
> > > +++ b/mm/slub.c
> > > @@ -3410,8 +3410,14 @@ static __always_inline
> > > void *__kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
> > > gfp_t gfpflags)
> > > {
> > > - void *ret = slab_alloc(s, lru, gfpflags, _RET_IP_, s->object_size);
> > > + void *ret;
> > >
> > > + if (unlikely(ZERO_OR_NULL_PTR(s))) {
> > > + WARN_ON_ONCE(1);
> > > + return s;
> > > + }
> > > +

Thank you for suggestion!

I think the holes are:
kmalloc_node_trace(), kmem_cache_alloc_node(), __do_kmalloc_node()

And want to suggest:
What about using VM_WARN_ON_ONCE() instead?

> > > + ret = slab_alloc(s, lru, gfpflags, _RET_IP_, s->object_size);
> > > trace_kmem_cache_alloc(_RET_IP_, ret, s, gfpflags, NUMA_NO_NODE);
> > >
> > > return ret;
> >
> > Yes, thank you, that patch atop v6.1-rc3 lets me boot, and shows the warning and stack dump.
> > Will you submit that, or how do we want to proceed?
>
> Thanks for confirming. I wanted to wait for Vlastimil, Hyeonggon and
> other developer's opinion. And yes, I can also post a more formal one.
>
> > transfer started ......................................... transfer ok, time=2.11s
> > setting up elf image... OK
> > jumping to kernel code
> > zimage at: 80B842A0 810B4BC0
> >
> > Uncompressing Linux at load address 80001000
> >
> > Copy device tree to address 80B80EE0
> >
> > Now, booting the kernel...
> >
> > [ 0.000000] Linux version 6.1.0-rc3+ (john@john) (mipsel-buildroot-linux-gnu-gcc.br_real (Buildroot 2021.11-4428-g6b6741b) 12.2.0, GNU ld (GNU Binutils) 2.39) #73 SMP Wed Nov 2 05:10:01 AEST 2022
> > [ 0.000000] ------------[ cut here ]------------
> > [ 0.000000] WARNING: CPU: 0 PID: 0 at mm/slub.c:3416 kmem_cache_alloc+0x5a4/0x5e8
> > [ 0.000000] Modules linked in:
> > [ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 6.1.0-rc3+ #73
> > [ 0.000000] Stack : 810fff78 80084d98 00000000 00000004 00000000 00000000 80889d04 80c90000
> > [ 0.000000] 80920000 807bd328 8089d368 80923bd3 00000000 00000001 80889cb0 00000000
> > [ 0.000000] 00000000 00000000 807bd328 8084bcb1 00000002 00000002 00000001 6d6f4320
> > [ 0.000000] 00000000 80c97d3d 80c97d68 fffffffc 807bd328 00000000 00000000 00000000
> > [ 0.000000] 00000000 a0000000 80910000 8110a0b4 00000000 00000020 80010000 80010000
> > [ 0.000000] ...
> > [ 0.000000] Call Trace:
> > [ 0.000000] [<80008260>] show_stack+0x28/0xf0
> > [ 0.000000] [<8070c958>] dump_stack_lvl+0x60/0x80
> > [ 0.000000] [<8002e184>] __warn+0xc4/0xf8
> > [ 0.000000] [<8002e210>] warn_slowpath_fmt+0x58/0xa4
> > [ 0.000000] [<801c0fac>] kmem_cache_alloc+0x5a4/0x5e8
> > [ 0.000000] [<8092856c>] prom_soc_init+0x1fc/0x2b4
> > [ 0.000000] [<80928060>] prom_init+0x44/0xf0
> > [ 0.000000] [<80929214>] setup_arch+0x4c/0x6a8
> > [ 0.000000] [<809257e0>] start_kernel+0x88/0x7c0
> > [ 0.000000]
> > [ 0.000000] ---[ end trace 0000000000000000 ]---
> > [ 0.000000] SoC Type: MediaTek MT7621 ver:1 eco:3
> > [ 0.000000] printk: bootconsole [early0] enabled
> >
> > Thank you for working through this with me.
> > I will try to address the root cause in mt7621.c.
> > It looks like other arch/** soc_device_register users use postcore_initcall, device_initcall,
> > or the ARM DT_MACHINE_START .init_machine. A quick hack to use postcore_initcall in mt7621
> > avoided this zero ptr kmem_cache passed to kmem_cache_alloc_lru.
>
> If IIUC, the prom_soc_init() is only called once in kernel, can the
> 'soc_dev_attr' just be defined as a global data structure instead
> of calling kzalloc(), as its size is small only containing 7 pointers.

But soc_device_registers() too uses kmalloc. I think calling it
after slab initialization will be best solution - if that is correct.

>
> Thanks,
> Feng
>

--
Thanks,
Hyeonggon

2022-11-02 09:23:37

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v6 1/4] mm/slub: enable debugging memory wasting of kmalloc

On 11/1/22 11:33, John Thomson wrote:
> On Tue, 1 Nov 2022, at 09:31, Hyeonggon Yoo wrote:
>> On Tue, Nov 01, 2022 at 09:20:21AM +0000, John Thomson wrote:
>>> On Tue, 1 Nov 2022, at 07:57, Feng Tang wrote:
>>> > Hi Thomson,
>>> >
>>> > Thanks for testing!
>>> >
>>> > + mips maintainer and mail list. The original report is here
>>> >
>>> > https://lore.kernel.org/lkml/[email protected]/
>>>
>>> I am guessing my issue comes from __kmem_cache_alloc_lru accessing s->object_size when (kmem_cache) s is NULL?
>>> If that is the case, this change is not to blame, it only exposes the issue?
>>>
>>> I get the following dmesg (note very early NULL kmem_cache) with the below change atop v6.1-rc3:
>>>
>>> transfer started ......................................... transfer ok, time=2.02s
>>> setting up elf image... OK
>>> jumping to kernel code
>>> zimage at: 80B842A0 810B4EFC
>>>
>>> Uncompressing Linux at load address 80001000
>>>
>>> Copy device tree to address 80B80EE0
>>>
>>> Now, booting the kernel...
>>>
>>> [ 0.000000] Linux version 6.1.0-rc3+ (john@john) (mipsel-buildroot-linux-gnu-gcc.br_real (Buildroot 2021.11-4428-g6b6741b) 12.2.0, GNU ld (GNU Binutils) 2.39) #61 SMP Tue Nov 1 18:04:13 AEST 2022
>>> [ 0.000000] slub: kmem_cache_alloc called with kmem_cache: 0x0
>>> [ 0.000000] slub: __kmem_cache_alloc_lru called with kmem_cache: 0x0
>>> [ 0.000000] SoC Type: MediaTek MT7621 ver:1 eco:3
>>> [ 0.000000] printk: bootconsole [early0] enabled
>>> [ 0.000000] CPU0 revision is: 0001992f (MIPS 1004Kc)
>>> [ 0.000000] MIPS: machine is MikroTik RouterBOARD 760iGS
>>>
>>> normal boot
>>>
>>>
>>> diff --git a/mm/slub.c b/mm/slub.c
>>> index 157527d7101b..10fcdf2520d2 100644
>>> --- a/mm/slub.c
>>> +++ b/mm/slub.c
>>> @@ -3410,7 +3410,13 @@ static __always_inline
>>> void *__kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
>>> gfp_t gfpflags)
>>> {
>>> - void *ret = slab_alloc(s, lru, gfpflags, _RET_IP_, s->object_size);
>>> + void *ret;
>>> + if (IS_ERR_OR_NULL(s)) {
>>> + pr_warn("slub: __kmem_cache_alloc_lru called with kmem_cache: %pSR\n", s);
>>> + ret = slab_alloc(s, lru, gfpflags, _RET_IP_, 0);
>>> + } else {
>>> + ret = slab_alloc(s, lru, gfpflags, _RET_IP_, s->object_size);
>>> + }
>>>
>>> trace_kmem_cache_alloc(_RET_IP_, ret, s, gfpflags, NUMA_NO_NODE);
>>>
>>> @@ -3419,6 +3425,8 @@ void *__kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
>>>
>>> void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
>>> {
>>> + if (IS_ERR_OR_NULL(s))
>>> + pr_warn("slub: kmem_cache_alloc called with kmem_cache: %pSR\n", s);
>>> return __kmem_cache_alloc_lru(s, NULL, gfpflags);
>>> }
>>> EXPORT_SYMBOL(kmem_cache_alloc);
>>> @@ -3426,6 +3434,8 @@ EXPORT_SYMBOL(kmem_cache_alloc);
>>> void *kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
>>> gfp_t gfpflags)
>>> {
>>> + if (IS_ERR_OR_NULL(s))
>>> + pr_warn("slub: __kmem_cache_alloc_lru called with kmem_cache: %pSR\n", s);
>>> return __kmem_cache_alloc_lru(s, lru, gfpflags);
>>> }
>>> EXPORT_SYMBOL(kmem_cache_alloc_lru);
>>>
>>>
>>> Any hints on where kmem_cache_alloc would be being called from this early?
>>> I will start looking from /init/main.c around pr_notice("%s", linux_banner);
>>
>> Great. Would you try calling dump_stack(); when we observed s == NULL?
>> That would give more information about who passed s == NULL to these
>> functions.
>>
>
> With the dump_stack() in place:
>
> Now, booting the kernel...
>
> [ 0.000000] Linux version 6.1.0-rc3+ (john@john) (mipsel-buildroot-linux-gnu-gcc.br_real (Buildroot 2021.11-4428-g6b6741b) 12.2.0, GNU ld (GNU Binutils) 2.39) #62 SMP Tue Nov 1 19:49:52 AEST 2022
> [ 0.000000] slub: __kmem_cache_alloc_lru called with kmem_cache ptr: 0x0
> [ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 6.1.0-rc3+ #62
> [ 0.000000] Stack : 810fff78 80084d98 80889d00 00000004 00000000 00000000 80889d5c 80c90000
> [ 0.000000] 80920000 807bd380 8089d368 80923bd3 00000000 00000001 80889d08 00000000
> [ 0.000000] 00000000 00000000 807bd380 8084bd51 00000002 00000002 00000001 6d6f4320
> [ 0.000000] 00000000 80c97ce9 80c97d14 fffffffc 807bd380 00000000 00000003 00000dc0
> [ 0.000000] 00000000 a0000000 80910000 8110a0b4 00000000 00000020 80010000 80010000
> [ 0.000000] ...
> [ 0.000000] Call Trace:
> [ 0.000000] [<80008260>] show_stack+0x28/0xf0
> [ 0.000000] [<8070cdc0>] dump_stack_lvl+0x60/0x80
> [ 0.000000] [<801c1428>] kmem_cache_alloc+0x5c0/0x740
> [ 0.000000] [<8092856c>] prom_soc_init+0x1fc/0x2b4
> [ 0.000000] [<80928060>] prom_init+0x44/0xf0
> [ 0.000000] [<80929214>] setup_arch+0x4c/0x6a8
> [ 0.000000] [<809257e0>] start_kernel+0x88/0x7c0
> [ 0.000000]
> [ 0.000000] SoC Type: MediaTek MT7621 ver:1 eco:3

The stack means CONFIG_TRACING=n, is that right?

That would mean
prom_soc_init()
soc_dev_init()
kzalloc() -> kmalloc()
kmalloc_trace() // after #else /* CONFIG_TRACING */
kmem_cache_alloc(s, flags);

Looks like this path is a small bug in the wasting detection patch, as we
throw away size there.

AFAICS before this patch, we "survive" "kmem_cache *s" being NULL as
slab_pre_alloc_hook() will happen to return NULL and we bail out from
slab_alloc_node(). But this is a side-effect, not an intended protection.
Also the CONFIG_TRACING variant of kmalloc_trace() would have called
trace_kmalloc dereferencing s->size anyway even before this patch.

I don't think we should add WARNS in the slab hot paths just to prevent this
rare error of using slab too early. At most VM_WARN... would be acceptable
but still not necessary as crashing immediately from a NULL pointer is
sufficient.

So IMHO mips should fix their soc init, and we should look into the
CONFIG_TRACING=n variant of kmalloc_trace(), to pass orig_size properly.

2022-11-03 06:40:41

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH v6 1/4] mm/slub: enable debugging memory wasting of kmalloc

On Wed, Nov 02, 2022 at 04:22:37PM +0800, Vlastimil Babka wrote:
> On 11/1/22 11:33, John Thomson wrote:
[...]
> >
> > [ 0.000000] Linux version 6.1.0-rc3+ (john@john) (mipsel-buildroot-linux-gnu-gcc.br_real (Buildroot 2021.11-4428-g6b6741b) 12.2.0, GNU ld (GNU Binutils) 2.39) #62 SMP Tue Nov 1 19:49:52 AEST 2022
> > [ 0.000000] slub: __kmem_cache_alloc_lru called with kmem_cache ptr: 0x0
> > [ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 6.1.0-rc3+ #62
> > [ 0.000000] Stack : 810fff78 80084d98 80889d00 00000004 00000000 00000000 80889d5c 80c90000
> > [ 0.000000] 80920000 807bd380 8089d368 80923bd3 00000000 00000001 80889d08 00000000
> > [ 0.000000] 00000000 00000000 807bd380 8084bd51 00000002 00000002 00000001 6d6f4320
> > [ 0.000000] 00000000 80c97ce9 80c97d14 fffffffc 807bd380 00000000 00000003 00000dc0
> > [ 0.000000] 00000000 a0000000 80910000 8110a0b4 00000000 00000020 80010000 80010000
> > [ 0.000000] ...
> > [ 0.000000] Call Trace:
> > [ 0.000000] [<80008260>] show_stack+0x28/0xf0
> > [ 0.000000] [<8070cdc0>] dump_stack_lvl+0x60/0x80
> > [ 0.000000] [<801c1428>] kmem_cache_alloc+0x5c0/0x740
> > [ 0.000000] [<8092856c>] prom_soc_init+0x1fc/0x2b4
> > [ 0.000000] [<80928060>] prom_init+0x44/0xf0
> > [ 0.000000] [<80929214>] setup_arch+0x4c/0x6a8
> > [ 0.000000] [<809257e0>] start_kernel+0x88/0x7c0
> > [ 0.000000]
> > [ 0.000000] SoC Type: MediaTek MT7621 ver:1 eco:3
>
> The stack means CONFIG_TRACING=n, is that right?

Yes, from the kconfig, CONFIG_TRACING is not set.

> That would mean
> prom_soc_init()
> soc_dev_init()
> kzalloc() -> kmalloc()
> kmalloc_trace() // after #else /* CONFIG_TRACING */
> kmem_cache_alloc(s, flags);
>
> Looks like this path is a small bug in the wasting detection patch, as we
> throw away size there.

Yes, from the code reading and log from John, it is.

One strange thing is, I reset the code to v6.0, and found that
__kmem_cache_alloc_lru() also access the 's->object_size'

void *__kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
gfp_t gfpflags)
{
void *ret = slab_alloc(s, lru, gfpflags, _RET_IP_, s->object_size);
...
}

And from John's dump_stack() info, this call is also where the NULL pointer
happens, which I still can't figue out.

> AFAICS before this patch, we "survive" "kmem_cache *s" being NULL as
> slab_pre_alloc_hook() will happen to return NULL and we bail out from
> slab_alloc_node(). But this is a side-effect, not an intended protection.
> Also the CONFIG_TRACING variant of kmalloc_trace() would have called
> trace_kmalloc dereferencing s->size anyway even before this patch.
>
> I don't think we should add WARNS in the slab hot paths just to prevent this
> rare error of using slab too early. At most VM_WARN... would be acceptable
> but still not necessary as crashing immediately from a NULL pointer is
> sufficient.
>
> So IMHO mips should fix their soc init,

Yes, for the mips fix, John has proposed to defer the calling of prom_soc_init(),
which looks reasonable.

> and we should look into the
> CONFIG_TRACING=n variant of kmalloc_trace(), to pass orig_size properly.

You mean check if the pointer is NULL and bail out early.

Thanks,
Feng

2022-11-03 07:49:31

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH v6 1/4] mm/slub: enable debugging memory wasting of kmalloc

On Wed, Nov 02, 2022 at 04:16:06PM +0900, Hyeonggon Yoo wrote:
> On Wed, Nov 02, 2022 at 02:08:09PM +0800, Feng Tang wrote:
[...]
> > > transfer started ......................................... transfer ok, time=2.11s
> > > setting up elf image... OK
> > > jumping to kernel code
> > > zimage at: 80B842A0 810B4BC0
> > >
> > > Uncompressing Linux at load address 80001000
> > >
> > > Copy device tree to address 80B80EE0
> > >
> > > Now, booting the kernel...
> > >
> > > [ 0.000000] Linux version 6.1.0-rc3+ (john@john) (mipsel-buildroot-linux-gnu-gcc.br_real (Buildroot 2021.11-4428-g6b6741b) 12.2.0, GNU ld (GNU Binutils) 2.39) #73 SMP Wed Nov 2 05:10:01 AEST 2022
> > > [ 0.000000] ------------[ cut here ]------------
> > > [ 0.000000] WARNING: CPU: 0 PID: 0 at mm/slub.c:3416 kmem_cache_alloc+0x5a4/0x5e8
> > > [ 0.000000] Modules linked in:
> > > [ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 6.1.0-rc3+ #73
> > > [ 0.000000] Stack : 810fff78 80084d98 00000000 00000004 00000000 00000000 80889d04 80c90000
> > > [ 0.000000] 80920000 807bd328 8089d368 80923bd3 00000000 00000001 80889cb0 00000000
> > > [ 0.000000] 00000000 00000000 807bd328 8084bcb1 00000002 00000002 00000001 6d6f4320
> > > [ 0.000000] 00000000 80c97d3d 80c97d68 fffffffc 807bd328 00000000 00000000 00000000
> > > [ 0.000000] 00000000 a0000000 80910000 8110a0b4 00000000 00000020 80010000 80010000
> > > [ 0.000000] ...
> > > [ 0.000000] Call Trace:
> > > [ 0.000000] [<80008260>] show_stack+0x28/0xf0
> > > [ 0.000000] [<8070c958>] dump_stack_lvl+0x60/0x80
> > > [ 0.000000] [<8002e184>] __warn+0xc4/0xf8
> > > [ 0.000000] [<8002e210>] warn_slowpath_fmt+0x58/0xa4
> > > [ 0.000000] [<801c0fac>] kmem_cache_alloc+0x5a4/0x5e8
> > > [ 0.000000] [<8092856c>] prom_soc_init+0x1fc/0x2b4
> > > [ 0.000000] [<80928060>] prom_init+0x44/0xf0
> > > [ 0.000000] [<80929214>] setup_arch+0x4c/0x6a8
> > > [ 0.000000] [<809257e0>] start_kernel+0x88/0x7c0
> > > [ 0.000000]
> > > [ 0.000000] ---[ end trace 0000000000000000 ]---
> > > [ 0.000000] SoC Type: MediaTek MT7621 ver:1 eco:3
> > > [ 0.000000] printk: bootconsole [early0] enabled
> > >
> > > Thank you for working through this with me.
> > > I will try to address the root cause in mt7621.c.
> > > It looks like other arch/** soc_device_register users use postcore_initcall, device_initcall,
> > > or the ARM DT_MACHINE_START .init_machine. A quick hack to use postcore_initcall in mt7621
> > > avoided this zero ptr kmem_cache passed to kmem_cache_alloc_lru.
> >
> > If IIUC, the prom_soc_init() is only called once in kernel, can the
> > 'soc_dev_attr' just be defined as a global data structure instead
> > of calling kzalloc(), as its size is small only containing 7 pointers.
>
> But soc_device_registers() too uses kmalloc. I think calling it
> after slab initialization will be best solution - if that is correct.

Yes, you are right, there is other kmalloc() down the call chain.

Hi John,

Will you verify and submit a patch for your proposal of deferring
calling prom_soc_init()? thanks

- Feng

2022-11-03 08:02:19

by John Thomson

[permalink] [raw]
Subject: Re: [PATCH v6 1/4] mm/slub: enable debugging memory wasting of kmalloc

On Thu, 3 Nov 2022, at 07:18, Feng Tang wrote:
> On Wed, Nov 02, 2022 at 04:16:06PM +0900, Hyeonggon Yoo wrote:
>> On Wed, Nov 02, 2022 at 02:08:09PM +0800, Feng Tang wrote:
> [...]
>> > > transfer started ......................................... transfer ok, time=2.11s
>> > > setting up elf image... OK
>> > > jumping to kernel code
>> > > zimage at: 80B842A0 810B4BC0
>> > >
>> > > Uncompressing Linux at load address 80001000
>> > >
>> > > Copy device tree to address 80B80EE0
>> > >
>> > > Now, booting the kernel...
>> > >
>> > > [ 0.000000] Linux version 6.1.0-rc3+ (john@john) (mipsel-buildroot-linux-gnu-gcc.br_real (Buildroot 2021.11-4428-g6b6741b) 12.2.0, GNU ld (GNU Binutils) 2.39) #73 SMP Wed Nov 2 05:10:01 AEST 2022
>> > > [ 0.000000] ------------[ cut here ]------------
>> > > [ 0.000000] WARNING: CPU: 0 PID: 0 at mm/slub.c:3416 kmem_cache_alloc+0x5a4/0x5e8
>> > > [ 0.000000] Modules linked in:
>> > > [ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 6.1.0-rc3+ #73
>> > > [ 0.000000] Stack : 810fff78 80084d98 00000000 00000004 00000000 00000000 80889d04 80c90000
>> > > [ 0.000000] 80920000 807bd328 8089d368 80923bd3 00000000 00000001 80889cb0 00000000
>> > > [ 0.000000] 00000000 00000000 807bd328 8084bcb1 00000002 00000002 00000001 6d6f4320
>> > > [ 0.000000] 00000000 80c97d3d 80c97d68 fffffffc 807bd328 00000000 00000000 00000000
>> > > [ 0.000000] 00000000 a0000000 80910000 8110a0b4 00000000 00000020 80010000 80010000
>> > > [ 0.000000] ...
>> > > [ 0.000000] Call Trace:
>> > > [ 0.000000] [<80008260>] show_stack+0x28/0xf0
>> > > [ 0.000000] [<8070c958>] dump_stack_lvl+0x60/0x80
>> > > [ 0.000000] [<8002e184>] __warn+0xc4/0xf8
>> > > [ 0.000000] [<8002e210>] warn_slowpath_fmt+0x58/0xa4
>> > > [ 0.000000] [<801c0fac>] kmem_cache_alloc+0x5a4/0x5e8
>> > > [ 0.000000] [<8092856c>] prom_soc_init+0x1fc/0x2b4
>> > > [ 0.000000] [<80928060>] prom_init+0x44/0xf0
>> > > [ 0.000000] [<80929214>] setup_arch+0x4c/0x6a8
>> > > [ 0.000000] [<809257e0>] start_kernel+0x88/0x7c0
>> > > [ 0.000000]
>> > > [ 0.000000] ---[ end trace 0000000000000000 ]---
>> > > [ 0.000000] SoC Type: MediaTek MT7621 ver:1 eco:3
>> > > [ 0.000000] printk: bootconsole [early0] enabled
>> > >
>> > > Thank you for working through this with me.
>> > > I will try to address the root cause in mt7621.c.
>> > > It looks like other arch/** soc_device_register users use postcore_initcall, device_initcall,
>> > > or the ARM DT_MACHINE_START .init_machine. A quick hack to use postcore_initcall in mt7621
>> > > avoided this zero ptr kmem_cache passed to kmem_cache_alloc_lru.
>> >
>> > If IIUC, the prom_soc_init() is only called once in kernel, can the
>> > 'soc_dev_attr' just be defined as a global data structure instead
>> > of calling kzalloc(), as its size is small only containing 7 pointers.
>>
>> But soc_device_registers() too uses kmalloc. I think calling it
>> after slab initialization will be best solution - if that is correct.
>
> Yes, you are right, there is other kmalloc() down the call chain.
>
> Hi John,
>
> Will you verify and submit a patch for your proposal of deferring
> calling prom_soc_init()? thanks
>
> - Feng

Hi Feng,

My proposed mt7621.c changes are RFC here:
https://lore.kernel.org/lkml/[email protected]/
That series lets me boot the v6.1-rc3 kernel. I have only tried it with my config (as sent earlier). If there are other suspect config settings that I should test, please let me know?
I used device_initcall, but postcore_initcall also works fine.
I rephrased Vlastimil's explanation and used it in patch 3 description.
I have not referenced a Fixes tag yet (unsure which/if any I should use)

Cheers,
--
John Thomson

2022-11-03 08:55:36

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH v6 1/4] mm/slub: enable debugging memory wasting of kmalloc

On Thu, Nov 03, 2022 at 07:45:49AM +0000, John Thomson wrote:
> On Thu, 3 Nov 2022, at 07:18, Feng Tang wrote:
> > On Wed, Nov 02, 2022 at 04:16:06PM +0900, Hyeonggon Yoo wrote:
> >> On Wed, Nov 02, 2022 at 02:08:09PM +0800, Feng Tang wrote:
> > [...]
> >> > > transfer started ......................................... transfer ok, time=2.11s
> >> > > setting up elf image... OK
> >> > > jumping to kernel code
> >> > > zimage at: 80B842A0 810B4BC0
> >> > >
> >> > > Uncompressing Linux at load address 80001000
> >> > >
> >> > > Copy device tree to address 80B80EE0
> >> > >
> >> > > Now, booting the kernel...
> >> > >
> >> > > [ 0.000000] Linux version 6.1.0-rc3+ (john@john) (mipsel-buildroot-linux-gnu-gcc.br_real (Buildroot 2021.11-4428-g6b6741b) 12.2.0, GNU ld (GNU Binutils) 2.39) #73 SMP Wed Nov 2 05:10:01 AEST 2022
> >> > > [ 0.000000] ------------[ cut here ]------------
> >> > > [ 0.000000] WARNING: CPU: 0 PID: 0 at mm/slub.c:3416 kmem_cache_alloc+0x5a4/0x5e8
> >> > > [ 0.000000] Modules linked in:
> >> > > [ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 6.1.0-rc3+ #73
> >> > > [ 0.000000] Stack : 810fff78 80084d98 00000000 00000004 00000000 00000000 80889d04 80c90000
> >> > > [ 0.000000] 80920000 807bd328 8089d368 80923bd3 00000000 00000001 80889cb0 00000000
> >> > > [ 0.000000] 00000000 00000000 807bd328 8084bcb1 00000002 00000002 00000001 6d6f4320
> >> > > [ 0.000000] 00000000 80c97d3d 80c97d68 fffffffc 807bd328 00000000 00000000 00000000
> >> > > [ 0.000000] 00000000 a0000000 80910000 8110a0b4 00000000 00000020 80010000 80010000
> >> > > [ 0.000000] ...
> >> > > [ 0.000000] Call Trace:
> >> > > [ 0.000000] [<80008260>] show_stack+0x28/0xf0
> >> > > [ 0.000000] [<8070c958>] dump_stack_lvl+0x60/0x80
> >> > > [ 0.000000] [<8002e184>] __warn+0xc4/0xf8
> >> > > [ 0.000000] [<8002e210>] warn_slowpath_fmt+0x58/0xa4
> >> > > [ 0.000000] [<801c0fac>] kmem_cache_alloc+0x5a4/0x5e8
> >> > > [ 0.000000] [<8092856c>] prom_soc_init+0x1fc/0x2b4
> >> > > [ 0.000000] [<80928060>] prom_init+0x44/0xf0
> >> > > [ 0.000000] [<80929214>] setup_arch+0x4c/0x6a8
> >> > > [ 0.000000] [<809257e0>] start_kernel+0x88/0x7c0
> >> > > [ 0.000000]
> >> > > [ 0.000000] ---[ end trace 0000000000000000 ]---
> >> > > [ 0.000000] SoC Type: MediaTek MT7621 ver:1 eco:3
> >> > > [ 0.000000] printk: bootconsole [early0] enabled
> >> > >
> >> > > Thank you for working through this with me.
> >> > > I will try to address the root cause in mt7621.c.
> >> > > It looks like other arch/** soc_device_register users use postcore_initcall, device_initcall,
> >> > > or the ARM DT_MACHINE_START .init_machine. A quick hack to use postcore_initcall in mt7621
> >> > > avoided this zero ptr kmem_cache passed to kmem_cache_alloc_lru.
> >> >
> >> > If IIUC, the prom_soc_init() is only called once in kernel, can the
> >> > 'soc_dev_attr' just be defined as a global data structure instead
> >> > of calling kzalloc(), as its size is small only containing 7 pointers.
> >>
> >> But soc_device_registers() too uses kmalloc. I think calling it
> >> after slab initialization will be best solution - if that is correct.
> >
> > Yes, you are right, there is other kmalloc() down the call chain.
> >
> > Hi John,
> >
> > Will you verify and submit a patch for your proposal of deferring
> > calling prom_soc_init()? thanks
> >
> > - Feng
>
> Hi Feng,
>
> My proposed mt7621.c changes are RFC here:
> https://lore.kernel.org/lkml/[email protected]/

Great!

> That series lets me boot the v6.1-rc3 kernel. I have only tried it with my config (as sent earlier). If there are other suspect config settings that I should test, please let me know?
> I used device_initcall, but postcore_initcall also works fine.

I'm not sure which order is better, due to lack of mips platform
knowledge.

> I rephrased Vlastimil's explanation and used it in patch 3 description.
> I have not referenced a Fixes tag yet (unsure which/if any I should use)

With older version, the kernel boots fine with soc_dev_init() not
being actually called, and I don't know if they also need to get
this called.

Thanks,
Feng

> Cheers,
> --
> John Thomson
>

2022-11-03 09:25:12

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v6 1/4] mm/slub: enable debugging memory wasting of kmalloc

On 11/3/22 06:54, Feng Tang wrote:
> On Wed, Nov 02, 2022 at 04:22:37PM +0800, Vlastimil Babka wrote:
>> On 11/1/22 11:33, John Thomson wrote:
> [...]
>> >
>> > [ 0.000000] Linux version 6.1.0-rc3+ (john@john) (mipsel-buildroot-linux-gnu-gcc.br_real (Buildroot 2021.11-4428-g6b6741b) 12.2.0, GNU ld (GNU Binutils) 2.39) #62 SMP Tue Nov 1 19:49:52 AEST 2022
>> > [ 0.000000] slub: __kmem_cache_alloc_lru called with kmem_cache ptr: 0x0
>> > [ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 6.1.0-rc3+ #62
>> > [ 0.000000] Stack : 810fff78 80084d98 80889d00 00000004 00000000 00000000 80889d5c 80c90000
>> > [ 0.000000] 80920000 807bd380 8089d368 80923bd3 00000000 00000001 80889d08 00000000
>> > [ 0.000000] 00000000 00000000 807bd380 8084bd51 00000002 00000002 00000001 6d6f4320
>> > [ 0.000000] 00000000 80c97ce9 80c97d14 fffffffc 807bd380 00000000 00000003 00000dc0
>> > [ 0.000000] 00000000 a0000000 80910000 8110a0b4 00000000 00000020 80010000 80010000
>> > [ 0.000000] ...
>> > [ 0.000000] Call Trace:
>> > [ 0.000000] [<80008260>] show_stack+0x28/0xf0
>> > [ 0.000000] [<8070cdc0>] dump_stack_lvl+0x60/0x80
>> > [ 0.000000] [<801c1428>] kmem_cache_alloc+0x5c0/0x740
>> > [ 0.000000] [<8092856c>] prom_soc_init+0x1fc/0x2b4
>> > [ 0.000000] [<80928060>] prom_init+0x44/0xf0
>> > [ 0.000000] [<80929214>] setup_arch+0x4c/0x6a8
>> > [ 0.000000] [<809257e0>] start_kernel+0x88/0x7c0
>> > [ 0.000000]
>> > [ 0.000000] SoC Type: MediaTek MT7621 ver:1 eco:3
>>
>> The stack means CONFIG_TRACING=n, is that right?
>
> Yes, from the kconfig, CONFIG_TRACING is not set.
>
>> That would mean
>> prom_soc_init()
>> soc_dev_init()
>> kzalloc() -> kmalloc()
>> kmalloc_trace() // after #else /* CONFIG_TRACING */
>> kmem_cache_alloc(s, flags);
>>
>> Looks like this path is a small bug in the wasting detection patch, as we
>> throw away size there.
>
> Yes, from the code reading and log from John, it is.
>
> One strange thing is, I reset the code to v6.0, and found that
> __kmem_cache_alloc_lru() also access the 's->object_size'
>
> void *__kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
> gfp_t gfpflags)
> {
> void *ret = slab_alloc(s, lru, gfpflags, _RET_IP_, s->object_size);
> ...
> }
>
> And from John's dump_stack() info, this call is also where the NULL pointer
> happens, which I still can't figue out.
>
>> AFAICS before this patch, we "survive" "kmem_cache *s" being NULL as
>> slab_pre_alloc_hook() will happen to return NULL and we bail out from
>> slab_alloc_node(). But this is a side-effect, not an intended protection.
>> Also the CONFIG_TRACING variant of kmalloc_trace() would have called
>> trace_kmalloc dereferencing s->size anyway even before this patch.
>>
>> I don't think we should add WARNS in the slab hot paths just to prevent this
>> rare error of using slab too early. At most VM_WARN... would be acceptable
>> but still not necessary as crashing immediately from a NULL pointer is
>> sufficient.
>>
>> So IMHO mips should fix their soc init,
>
> Yes, for the mips fix, John has proposed to defer the calling of prom_soc_init(),
> which looks reasonable.
>
>> and we should look into the
>> CONFIG_TRACING=n variant of kmalloc_trace(), to pass orig_size properly.
>
> You mean check if the pointer is NULL and bail out early.

No I mean here:

#else /* CONFIG_TRACING */
/* Save a function call when CONFIG_TRACING=n */
static __always_inline __alloc_size(3)
void *kmalloc_trace(struct kmem_cache *s, gfp_t flags, size_t size)
{
void *ret = kmem_cache_alloc(s, flags);

ret = kasan_kmalloc(s, ret, size, flags);
return ret;
}

we call kmem_cache_alloc() and discard the size parameter, so it will assume
s->object_size (and as the side-effect, crash if s is NULL). We shouldn't
add "s is NULL?" checks, but fix passing the size - probably switch to
__kmem_cache_alloc_node()? and in the following kmalloc_node_trace() analogically.


2022-11-03 14:55:04

by Hyeonggon Yoo

[permalink] [raw]
Subject: Re: [PATCH v6 1/4] mm/slub: enable debugging memory wasting of kmalloc

On Thu, Nov 03, 2022 at 10:16:12PM +0800, Feng Tang wrote:
> On Thu, Nov 03, 2022 at 09:33:28AM +0100, Vlastimil Babka wrote:
> [...]
> > >> AFAICS before this patch, we "survive" "kmem_cache *s" being NULL as
> > >> slab_pre_alloc_hook() will happen to return NULL and we bail out from
> > >> slab_alloc_node(). But this is a side-effect, not an intended protection.
> > >> Also the CONFIG_TRACING variant of kmalloc_trace() would have called
> > >> trace_kmalloc dereferencing s->size anyway even before this patch.
> > >>
> > >> I don't think we should add WARNS in the slab hot paths just to prevent this
> > >> rare error of using slab too early. At most VM_WARN... would be acceptable
> > >> but still not necessary as crashing immediately from a NULL pointer is
> > >> sufficient.
> > >>
> > >> So IMHO mips should fix their soc init,
> > >
> > > Yes, for the mips fix, John has proposed to defer the calling of prom_soc_init(),
> > > which looks reasonable.
> > >
> > >> and we should look into the
> > >> CONFIG_TRACING=n variant of kmalloc_trace(), to pass orig_size properly.
> > >
> > > You mean check if the pointer is NULL and bail out early.
> >
> > No I mean here:
> >
> > #else /* CONFIG_TRACING */
> > /* Save a function call when CONFIG_TRACING=n */
> > static __always_inline __alloc_size(3)
> > void *kmalloc_trace(struct kmem_cache *s, gfp_t flags, size_t size)
> > {
> > void *ret = kmem_cache_alloc(s, flags);
> >
> > ret = kasan_kmalloc(s, ret, size, flags);
> > return ret;
> > }
> >
> > we call kmem_cache_alloc() and discard the size parameter, so it will assume
> > s->object_size (and as the side-effect, crash if s is NULL). We shouldn't
> > add "s is NULL?" checks, but fix passing the size - probably switch to
> > __kmem_cache_alloc_node()? and in the following kmalloc_node_trace() analogically.
>
> Got it, thanks! I might have missed it during some rebasing for the
> kmalloc wastage debug patch.

That was good catch and I missed too!
But FYI I'm suggesting to drop CONFIG_TRACING=n variant:

https://lore.kernel.org/linux-mm/[email protected]/T/#m20ecf14390e406247bde0ea9cce368f469c539ed

Any thoughts?

>
> How about the following fix?
>
> Thanks,
> Feng
>
> ---
> From 9f9fa9da8946fd44625f873c0f51167357075be1 Mon Sep 17 00:00:00 2001
> From: Feng Tang <[email protected]>
> Date: Thu, 3 Nov 2022 21:32:10 +0800
> Subject: [PATCH] mm/slub: Add missing orig_size parameter for wastage debug
>
> commit 6edf2576a6cc ("mm/slub: enable debugging memory wasting of
> kmalloc") was introduced for debugging kmalloc memory wastage,
> and it missed to pass the original request size for kmalloc_trace()
> and kmalloc_node_trace() in CONFIG_TRACING=n path.
>
> Fix it by using __kmem_cache_alloc_node() with correct original
> request size.
>
> Fixes: 6edf2576a6cc ("mm/slub: enable debugging memory wasting of kmalloc")
> Suggested-by: Vlastimil Babka <[email protected]>
> Signed-off-by: Feng Tang <[email protected]>
> ---
> include/linux/slab.h | 9 +++++++--
> 1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 90877fcde70b..9691afa569e1 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -469,6 +469,9 @@ void *__kmalloc_node(size_t size, gfp_t flags, int node) __assume_kmalloc_alignm
> __alloc_size(1);
> void *kmem_cache_alloc_node(struct kmem_cache *s, gfp_t flags, int node) __assume_slab_alignment
> __malloc;
> +void *__kmem_cache_alloc_node(struct kmem_cache *s, gfp_t flags, int node,
> + size_t orig_size, unsigned long caller) __assume_slab_alignment
> + __malloc;
>
> #ifdef CONFIG_TRACING
> void *kmalloc_trace(struct kmem_cache *s, gfp_t flags, size_t size)
> @@ -482,7 +485,8 @@ void *kmalloc_node_trace(struct kmem_cache *s, gfp_t gfpflags,
> static __always_inline __alloc_size(3)
> void *kmalloc_trace(struct kmem_cache *s, gfp_t flags, size_t size)
> {
> - void *ret = kmem_cache_alloc(s, flags);
> + void *ret = __kmem_cache_alloc_node(s, flags, NUMA_NO_NODE,
> + size, _RET_IP_);
>
> ret = kasan_kmalloc(s, ret, size, flags);
> return ret;
> @@ -492,7 +496,8 @@ static __always_inline __alloc_size(4)
> void *kmalloc_node_trace(struct kmem_cache *s, gfp_t gfpflags,
> int node, size_t size)
> {
> - void *ret = kmem_cache_alloc_node(s, gfpflags, node);
> + void *ret = __kmem_cache_alloc_node(s, gfpflags, node,
> + size, _RET_IP_);
>
> ret = kasan_kmalloc(s, ret, size, gfpflags);
> return ret;
> --
> 2.34.1
>
>
>

--
Thanks,
Hyeonggon

2022-11-03 15:32:30

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH v6 1/4] mm/slub: enable debugging memory wasting of kmalloc

On Thu, Nov 03, 2022 at 09:33:28AM +0100, Vlastimil Babka wrote:
[...]
> >> AFAICS before this patch, we "survive" "kmem_cache *s" being NULL as
> >> slab_pre_alloc_hook() will happen to return NULL and we bail out from
> >> slab_alloc_node(). But this is a side-effect, not an intended protection.
> >> Also the CONFIG_TRACING variant of kmalloc_trace() would have called
> >> trace_kmalloc dereferencing s->size anyway even before this patch.
> >>
> >> I don't think we should add WARNS in the slab hot paths just to prevent this
> >> rare error of using slab too early. At most VM_WARN... would be acceptable
> >> but still not necessary as crashing immediately from a NULL pointer is
> >> sufficient.
> >>
> >> So IMHO mips should fix their soc init,
> >
> > Yes, for the mips fix, John has proposed to defer the calling of prom_soc_init(),
> > which looks reasonable.
> >
> >> and we should look into the
> >> CONFIG_TRACING=n variant of kmalloc_trace(), to pass orig_size properly.
> >
> > You mean check if the pointer is NULL and bail out early.
>
> No I mean here:
>
> #else /* CONFIG_TRACING */
> /* Save a function call when CONFIG_TRACING=n */
> static __always_inline __alloc_size(3)
> void *kmalloc_trace(struct kmem_cache *s, gfp_t flags, size_t size)
> {
> void *ret = kmem_cache_alloc(s, flags);
>
> ret = kasan_kmalloc(s, ret, size, flags);
> return ret;
> }
>
> we call kmem_cache_alloc() and discard the size parameter, so it will assume
> s->object_size (and as the side-effect, crash if s is NULL). We shouldn't
> add "s is NULL?" checks, but fix passing the size - probably switch to
> __kmem_cache_alloc_node()? and in the following kmalloc_node_trace() analogically.

Got it, thanks! I might have missed it during some rebasing for the
kmalloc wastage debug patch.

How about the following fix?

Thanks,
Feng

---
From 9f9fa9da8946fd44625f873c0f51167357075be1 Mon Sep 17 00:00:00 2001
From: Feng Tang <[email protected]>
Date: Thu, 3 Nov 2022 21:32:10 +0800
Subject: [PATCH] mm/slub: Add missing orig_size parameter for wastage debug

commit 6edf2576a6cc ("mm/slub: enable debugging memory wasting of
kmalloc") was introduced for debugging kmalloc memory wastage,
and it missed to pass the original request size for kmalloc_trace()
and kmalloc_node_trace() in CONFIG_TRACING=n path.

Fix it by using __kmem_cache_alloc_node() with correct original
request size.

Fixes: 6edf2576a6cc ("mm/slub: enable debugging memory wasting of kmalloc")
Suggested-by: Vlastimil Babka <[email protected]>
Signed-off-by: Feng Tang <[email protected]>
---
include/linux/slab.h | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 90877fcde70b..9691afa569e1 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -469,6 +469,9 @@ void *__kmalloc_node(size_t size, gfp_t flags, int node) __assume_kmalloc_alignm
__alloc_size(1);
void *kmem_cache_alloc_node(struct kmem_cache *s, gfp_t flags, int node) __assume_slab_alignment
__malloc;
+void *__kmem_cache_alloc_node(struct kmem_cache *s, gfp_t flags, int node,
+ size_t orig_size, unsigned long caller) __assume_slab_alignment
+ __malloc;

#ifdef CONFIG_TRACING
void *kmalloc_trace(struct kmem_cache *s, gfp_t flags, size_t size)
@@ -482,7 +485,8 @@ void *kmalloc_node_trace(struct kmem_cache *s, gfp_t gfpflags,
static __always_inline __alloc_size(3)
void *kmalloc_trace(struct kmem_cache *s, gfp_t flags, size_t size)
{
- void *ret = kmem_cache_alloc(s, flags);
+ void *ret = __kmem_cache_alloc_node(s, flags, NUMA_NO_NODE,
+ size, _RET_IP_);

ret = kasan_kmalloc(s, ret, size, flags);
return ret;
@@ -492,7 +496,8 @@ static __always_inline __alloc_size(4)
void *kmalloc_node_trace(struct kmem_cache *s, gfp_t gfpflags,
int node, size_t size)
{
- void *ret = kmem_cache_alloc_node(s, gfpflags, node);
+ void *ret = __kmem_cache_alloc_node(s, gfpflags, node,
+ size, _RET_IP_);

ret = kasan_kmalloc(s, ret, size, gfpflags);
return ret;
--
2.34.1




2022-11-03 17:45:28

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v6 1/4] mm/slub: enable debugging memory wasting of kmalloc

On 11/3/22 15:36, Hyeonggon Yoo wrote:
> On Thu, Nov 03, 2022 at 10:16:12PM +0800, Feng Tang wrote:
>> On Thu, Nov 03, 2022 at 09:33:28AM +0100, Vlastimil Babka wrote:
>> [...]
>> > >> AFAICS before this patch, we "survive" "kmem_cache *s" being NULL as
>> > >> slab_pre_alloc_hook() will happen to return NULL and we bail out from
>> > >> slab_alloc_node(). But this is a side-effect, not an intended protection.
>> > >> Also the CONFIG_TRACING variant of kmalloc_trace() would have called
>> > >> trace_kmalloc dereferencing s->size anyway even before this patch.
>> > >>
>> > >> I don't think we should add WARNS in the slab hot paths just to prevent this
>> > >> rare error of using slab too early. At most VM_WARN... would be acceptable
>> > >> but still not necessary as crashing immediately from a NULL pointer is
>> > >> sufficient.
>> > >>
>> > >> So IMHO mips should fix their soc init,
>> > >
>> > > Yes, for the mips fix, John has proposed to defer the calling of prom_soc_init(),
>> > > which looks reasonable.
>> > >
>> > >> and we should look into the
>> > >> CONFIG_TRACING=n variant of kmalloc_trace(), to pass orig_size properly.
>> > >
>> > > You mean check if the pointer is NULL and bail out early.
>> >
>> > No I mean here:
>> >
>> > #else /* CONFIG_TRACING */
>> > /* Save a function call when CONFIG_TRACING=n */
>> > static __always_inline __alloc_size(3)
>> > void *kmalloc_trace(struct kmem_cache *s, gfp_t flags, size_t size)
>> > {
>> > void *ret = kmem_cache_alloc(s, flags);
>> >
>> > ret = kasan_kmalloc(s, ret, size, flags);
>> > return ret;
>> > }
>> >
>> > we call kmem_cache_alloc() and discard the size parameter, so it will assume
>> > s->object_size (and as the side-effect, crash if s is NULL). We shouldn't
>> > add "s is NULL?" checks, but fix passing the size - probably switch to
>> > __kmem_cache_alloc_node()? and in the following kmalloc_node_trace() analogically.
>>
>> Got it, thanks! I might have missed it during some rebasing for the
>> kmalloc wastage debug patch.
>
> That was good catch and I missed too!
> But FYI I'm suggesting to drop CONFIG_TRACING=n variant:
>
> https://lore.kernel.org/linux-mm/[email protected]/T/#m20ecf14390e406247bde0ea9cce368f469c539ed
>
> Any thoughts?

I'll get to it, also I think we were pondering that within your series too,
but I wanted to postpone in case somebody objects to the extra function call
it creates.
But that would be for 6.2 anyway while I'll collect the fix here for 6.1.

>>
>> How about the following fix?
>>
>> Thanks,
>> Feng
>>
>> ---
>> From 9f9fa9da8946fd44625f873c0f51167357075be1 Mon Sep 17 00:00:00 2001
>> From: Feng Tang <[email protected]>
>> Date: Thu, 3 Nov 2022 21:32:10 +0800
>> Subject: [PATCH] mm/slub: Add missing orig_size parameter for wastage debug
>>
>> commit 6edf2576a6cc ("mm/slub: enable debugging memory wasting of
>> kmalloc") was introduced for debugging kmalloc memory wastage,
>> and it missed to pass the original request size for kmalloc_trace()
>> and kmalloc_node_trace() in CONFIG_TRACING=n path.
>>
>> Fix it by using __kmem_cache_alloc_node() with correct original
>> request size.
>>
>> Fixes: 6edf2576a6cc ("mm/slub: enable debugging memory wasting of kmalloc")
>> Suggested-by: Vlastimil Babka <[email protected]>
>> Signed-off-by: Feng Tang <[email protected]>
>> ---
>> include/linux/slab.h | 9 +++++++--
>> 1 file changed, 7 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/slab.h b/include/linux/slab.h
>> index 90877fcde70b..9691afa569e1 100644
>> --- a/include/linux/slab.h
>> +++ b/include/linux/slab.h
>> @@ -469,6 +469,9 @@ void *__kmalloc_node(size_t size, gfp_t flags, int node) __assume_kmalloc_alignm
>> __alloc_size(1);
>> void *kmem_cache_alloc_node(struct kmem_cache *s, gfp_t flags, int node) __assume_slab_alignment
>> __malloc;
>> +void *__kmem_cache_alloc_node(struct kmem_cache *s, gfp_t flags, int node,
>> + size_t orig_size, unsigned long caller) __assume_slab_alignment
>> + __malloc;
>>
>> #ifdef CONFIG_TRACING
>> void *kmalloc_trace(struct kmem_cache *s, gfp_t flags, size_t size)
>> @@ -482,7 +485,8 @@ void *kmalloc_node_trace(struct kmem_cache *s, gfp_t gfpflags,
>> static __always_inline __alloc_size(3)
>> void *kmalloc_trace(struct kmem_cache *s, gfp_t flags, size_t size)
>> {
>> - void *ret = kmem_cache_alloc(s, flags);
>> + void *ret = __kmem_cache_alloc_node(s, flags, NUMA_NO_NODE,
>> + size, _RET_IP_);
>>
>> ret = kasan_kmalloc(s, ret, size, flags);
>> return ret;
>> @@ -492,7 +496,8 @@ static __always_inline __alloc_size(4)
>> void *kmalloc_node_trace(struct kmem_cache *s, gfp_t gfpflags,
>> int node, size_t size)
>> {
>> - void *ret = kmem_cache_alloc_node(s, gfpflags, node);
>> + void *ret = __kmem_cache_alloc_node(s, gfpflags, node,
>> + size, _RET_IP_);
>>
>> ret = kasan_kmalloc(s, ret, size, gfpflags);
>> return ret;
>> --
>> 2.34.1
>>
>>
>>
>


2022-11-03 17:54:01

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v6 1/4] mm/slub: enable debugging memory wasting of kmalloc

On 11/3/22 17:57, Vlastimil Babka wrote:
> On 11/3/22 15:36, Hyeonggon Yoo wrote:
>> On Thu, Nov 03, 2022 at 10:16:12PM +0800, Feng Tang wrote:
>>> On Thu, Nov 03, 2022 at 09:33:28AM +0100, Vlastimil Babka wrote:
>>> [...]
>>> > >> AFAICS before this patch, we "survive" "kmem_cache *s" being NULL as
>>> > >> slab_pre_alloc_hook() will happen to return NULL and we bail out from
>>> > >> slab_alloc_node(). But this is a side-effect, not an intended protection.
>>> > >> Also the CONFIG_TRACING variant of kmalloc_trace() would have called
>>> > >> trace_kmalloc dereferencing s->size anyway even before this patch.
>>> > >>
>>> > >> I don't think we should add WARNS in the slab hot paths just to prevent this
>>> > >> rare error of using slab too early. At most VM_WARN... would be acceptable
>>> > >> but still not necessary as crashing immediately from a NULL pointer is
>>> > >> sufficient.
>>> > >>
>>> > >> So IMHO mips should fix their soc init,
>>> > >
>>> > > Yes, for the mips fix, John has proposed to defer the calling of prom_soc_init(),
>>> > > which looks reasonable.
>>> > >
>>> > >> and we should look into the
>>> > >> CONFIG_TRACING=n variant of kmalloc_trace(), to pass orig_size properly.
>>> > >
>>> > > You mean check if the pointer is NULL and bail out early.
>>> >
>>> > No I mean here:
>>> >
>>> > #else /* CONFIG_TRACING */
>>> > /* Save a function call when CONFIG_TRACING=n */
>>> > static __always_inline __alloc_size(3)
>>> > void *kmalloc_trace(struct kmem_cache *s, gfp_t flags, size_t size)
>>> > {
>>> > void *ret = kmem_cache_alloc(s, flags);
>>> >
>>> > ret = kasan_kmalloc(s, ret, size, flags);
>>> > return ret;
>>> > }
>>> >
>>> > we call kmem_cache_alloc() and discard the size parameter, so it will assume
>>> > s->object_size (and as the side-effect, crash if s is NULL). We shouldn't
>>> > add "s is NULL?" checks, but fix passing the size - probably switch to
>>> > __kmem_cache_alloc_node()? and in the following kmalloc_node_trace() analogically.
>>>
>>> Got it, thanks! I might have missed it during some rebasing for the
>>> kmalloc wastage debug patch.
>>
>> That was good catch and I missed too!
>> But FYI I'm suggesting to drop CONFIG_TRACING=n variant:
>>
>> https://lore.kernel.org/linux-mm/[email protected]/T/#m20ecf14390e406247bde0ea9cce368f469c539ed
>>
>> Any thoughts?
>
> I'll get to it, also I think we were pondering that within your series too,
> but I wanted to postpone in case somebody objects to the extra function call
> it creates.
> But that would be for 6.2 anyway while I'll collect the fix here for 6.1.

On second thought, the fix is making the inlined kmalloc_trace() expand to a
call that had 2 parameters and now it has 5, which seems to me like a worse
thing (code bloat) than the function call. With the other reasons to ditch
the CONFIG_TRACING=n variant I'm inclined to just do it right now.

>>>
>>> How about the following fix?
>>>
>>> Thanks,
>>> Feng
>>>
>>> ---
>>> From 9f9fa9da8946fd44625f873c0f51167357075be1 Mon Sep 17 00:00:00 2001
>>> From: Feng Tang <[email protected]>
>>> Date: Thu, 3 Nov 2022 21:32:10 +0800
>>> Subject: [PATCH] mm/slub: Add missing orig_size parameter for wastage debug
>>>
>>> commit 6edf2576a6cc ("mm/slub: enable debugging memory wasting of
>>> kmalloc") was introduced for debugging kmalloc memory wastage,
>>> and it missed to pass the original request size for kmalloc_trace()
>>> and kmalloc_node_trace() in CONFIG_TRACING=n path.
>>>
>>> Fix it by using __kmem_cache_alloc_node() with correct original
>>> request size.
>>>
>>> Fixes: 6edf2576a6cc ("mm/slub: enable debugging memory wasting of kmalloc")
>>> Suggested-by: Vlastimil Babka <[email protected]>
>>> Signed-off-by: Feng Tang <[email protected]>
>>> ---
>>> include/linux/slab.h | 9 +++++++--
>>> 1 file changed, 7 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/include/linux/slab.h b/include/linux/slab.h
>>> index 90877fcde70b..9691afa569e1 100644
>>> --- a/include/linux/slab.h
>>> +++ b/include/linux/slab.h
>>> @@ -469,6 +469,9 @@ void *__kmalloc_node(size_t size, gfp_t flags, int node) __assume_kmalloc_alignm
>>> __alloc_size(1);
>>> void *kmem_cache_alloc_node(struct kmem_cache *s, gfp_t flags, int node) __assume_slab_alignment
>>> __malloc;
>>> +void *__kmem_cache_alloc_node(struct kmem_cache *s, gfp_t flags, int node,
>>> + size_t orig_size, unsigned long caller) __assume_slab_alignment
>>> + __malloc;
>>>
>>> #ifdef CONFIG_TRACING
>>> void *kmalloc_trace(struct kmem_cache *s, gfp_t flags, size_t size)
>>> @@ -482,7 +485,8 @@ void *kmalloc_node_trace(struct kmem_cache *s, gfp_t gfpflags,
>>> static __always_inline __alloc_size(3)
>>> void *kmalloc_trace(struct kmem_cache *s, gfp_t flags, size_t size)
>>> {
>>> - void *ret = kmem_cache_alloc(s, flags);
>>> + void *ret = __kmem_cache_alloc_node(s, flags, NUMA_NO_NODE,
>>> + size, _RET_IP_);
>>>
>>> ret = kasan_kmalloc(s, ret, size, flags);
>>> return ret;
>>> @@ -492,7 +496,8 @@ static __always_inline __alloc_size(4)
>>> void *kmalloc_node_trace(struct kmem_cache *s, gfp_t gfpflags,
>>> int node, size_t size)
>>> {
>>> - void *ret = kmem_cache_alloc_node(s, gfpflags, node);
>>> + void *ret = __kmem_cache_alloc_node(s, gfpflags, node,
>>> + size, _RET_IP_);
>>>
>>> ret = kasan_kmalloc(s, ret, size, gfpflags);
>>> return ret;
>>> --
>>> 2.34.1
>>>
>>>
>>>
>>
>


2022-11-04 03:59:04

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH v6 1/4] mm/slub: enable debugging memory wasting of kmalloc

On Thu, Nov 03, 2022 at 06:35:53PM +0100, Vlastimil Babka wrote:

[...]

> >> But FYI I'm suggesting to drop CONFIG_TRACING=n variant:
> >>
> >> https://lore.kernel.org/linux-mm/[email protected]/T/#m20ecf14390e406247bde0ea9cce368f469c539ed
> >>
> >> Any thoughts?
> >
> > I'll get to it, also I think we were pondering that within your series too,
> > but I wanted to postpone in case somebody objects to the extra function call
> > it creates.
> > But that would be for 6.2 anyway while I'll collect the fix here for 6.1.
>
> On second thought, the fix is making the inlined kmalloc_trace() expand to a
> call that had 2 parameters and now it has 5, which seems to me like a worse
> thing (code bloat) than the function call. With the other reasons to ditch
> the CONFIG_TRACING=n variant I'm inclined to just do it right now.

That's great! It will save much trouble, and reduce code complexity.

Btw, the patch below also has some compiling issue for some kconfig
(thanks to 0Day's kbuild bot).

Thanks,
Feng

> >>>
> >>> How about the following fix?
> >>>
> >>> Thanks,
> >>> Feng
> >>>
> >>> ---
> >>> From 9f9fa9da8946fd44625f873c0f51167357075be1 Mon Sep 17 00:00:00 2001
> >>> From: Feng Tang <[email protected]>
> >>> Date: Thu, 3 Nov 2022 21:32:10 +0800
> >>> Subject: [PATCH] mm/slub: Add missing orig_size parameter for wastage debug
> >>>
> >>> commit 6edf2576a6cc ("mm/slub: enable debugging memory wasting of
> >>> kmalloc") was introduced for debugging kmalloc memory wastage,
> >>> and it missed to pass the original request size for kmalloc_trace()
> >>> and kmalloc_node_trace() in CONFIG_TRACING=n path.
> >>>
> >>> Fix it by using __kmem_cache_alloc_node() with correct original
> >>> request size.
> >>>
> >>> Fixes: 6edf2576a6cc ("mm/slub: enable debugging memory wasting of kmalloc")
> >>> Suggested-by: Vlastimil Babka <[email protected]>
> >>> Signed-off-by: Feng Tang <[email protected]>
> >>> ---
> >>> include/linux/slab.h | 9 +++++++--
> >>> 1 file changed, 7 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/include/linux/slab.h b/include/linux/slab.h
> >>> index 90877fcde70b..9691afa569e1 100644
> >>> --- a/include/linux/slab.h
> >>> +++ b/include/linux/slab.h
> >>> @@ -469,6 +469,9 @@ void *__kmalloc_node(size_t size, gfp_t flags, int node) __assume_kmalloc_alignm
> >>> __alloc_size(1);
> >>> void *kmem_cache_alloc_node(struct kmem_cache *s, gfp_t flags, int node) __assume_slab_alignment
> >>> __malloc;
> >>> +void *__kmem_cache_alloc_node(struct kmem_cache *s, gfp_t flags, int node,
> >>> + size_t orig_size, unsigned long caller) __assume_slab_alignment
> >>> + __malloc;
> >>>
> >>> #ifdef CONFIG_TRACING
> >>> void *kmalloc_trace(struct kmem_cache *s, gfp_t flags, size_t size)
> >>> @@ -482,7 +485,8 @@ void *kmalloc_node_trace(struct kmem_cache *s, gfp_t gfpflags,
> >>> static __always_inline __alloc_size(3)
> >>> void *kmalloc_trace(struct kmem_cache *s, gfp_t flags, size_t size)
> >>> {
> >>> - void *ret = kmem_cache_alloc(s, flags);
> >>> + void *ret = __kmem_cache_alloc_node(s, flags, NUMA_NO_NODE,
> >>> + size, _RET_IP_);
> >>>
> >>> ret = kasan_kmalloc(s, ret, size, flags);
> >>> return ret;
> >>> @@ -492,7 +496,8 @@ static __always_inline __alloc_size(4)
> >>> void *kmalloc_node_trace(struct kmem_cache *s, gfp_t gfpflags,
> >>> int node, size_t size)
> >>> {
> >>> - void *ret = kmem_cache_alloc_node(s, gfpflags, node);
> >>> + void *ret = __kmem_cache_alloc_node(s, gfpflags, node,
> >>> + size, _RET_IP_);
> >>>
> >>> ret = kasan_kmalloc(s, ret, size, gfpflags);
> >>> return ret;
> >>> --
> >>> 2.34.1
> >>>
> >>>
> >>>
> >>
> >
>
>