Memory allocation profiling infrastructure provides a low overhead
mechanism to make all kernel allocations in the system visible. It can be
used to monitor memory usage, track memory hotspots, detect memory leaks,
identify memory regressions.
To keep the overhead to the minimum, we record only allocation sizes for
every allocation in the codebase. With that information, if users are
interested in more detailed context for a specific allocation, they can
enable in-depth context tracking, which includes capturing the pid, tgid,
task name, allocation size, timestamp and call stack for every allocation
at the specified code location.
The data is exposed to the user space via a read-only debugfs file called
allocations. Usage example:
$ sort -hr /sys/kernel/debug/allocations|head
153MiB 8599 mm/slub.c:1826 module:slub func:alloc_slab_page
6.08MiB 49 mm/slab_common.c:950 module:slab_common func:_kmalloc_order
5.09MiB 6335 mm/memcontrol.c:2814 module:memcontrol func:alloc_slab_obj_exts
4.54MiB 78 mm/page_alloc.c:5777 module:page_alloc func:alloc_pages_exact
1.32MiB 338 include/asm-generic/pgalloc.h:63 module:pgtable func:__pte_alloc_one
1.16MiB 603 fs/xfs/xfs_log_priv.h:700 module:xfs func:xlog_kvmalloc
1.00MiB 256 mm/swap_cgroup.c:48 module:swap_cgroup func:swap_cgroup_prepare
734KiB 5380 fs/xfs/kmem.c:20 module:xfs func:kmem_alloc
640KiB 160 kernel/rcu/tree.c:3184 module:tree func:fill_page_cache_func
640KiB 160 drivers/char/virtio_console.c:452 module:virtio_console func:alloc_buf
For allocation context capture, a new debugfs file called allocations.ctx
is used to select which code location should capture allocation context
and to read captured context information. Usage example:
$ cd /sys/kernel/debug/
$ echo "file include/asm-generic/pgalloc.h line 63 enable" > allocations.ctx
$ cat allocations.ctx
920KiB 230 include/asm-generic/pgalloc.h:63 module:pgtable func:__pte_alloc_one
size: 4096
pid: 1474
tgid: 1474
comm: bash
ts: 175332940994
call stack:
pte_alloc_one+0xfe/0x130
__pte_alloc+0x22/0xb0
copy_page_range+0x842/0x1640
dup_mm+0x42d/0x580
copy_process+0xfb1/0x1ac0
kernel_clone+0x92/0x3e0
__do_sys_clone+0x66/0x90
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
...
Implementation utilizes a more generic concept of code tagging, introduced
as part of this patchset. Code tag is a structure identifying a specific
location in the source code which is generated at compile time and can be
embedded in an application-specific structure. A number of applications
for code tagging have been presented in the original RFC [1].
Code tagging uses the old trick of "define a special elf section for
objects of a given type so that we can iterate over them at runtime" and
creates a proper library for it.
To profile memory allocations, we instrument page, slab and percpu
allocators to record total memory allocated in the associated code tag at
every allocation in the codebase. Every time an allocation is performed by
an instrumented allocator, the code tag at that location increments its
counter by allocation size. Every time the memory is freed the counter is
decremented. To decrement the counter upon freeing, allocated object needs
a reference to its code tag. Page allocators use page_ext to record this
reference while slab allocators use memcg_data (renamed into more generic
slabobj_ext) of the slab page.
Module allocations are accounted the same way as other kernel allocations.
Module loading and unloading is supported. If a module is unloaded while
one or more of its allocations is still not freed (rather rare condition),
its data section will be kept in memory to allow later code tag
referencing when the allocation is freed later on.
As part of this series we introduce several kernel configs:
CODE_TAGGING - to enable code tagging framework
CONFIG_MEM_ALLOC_PROFILING - to enable memory allocation profiling
CONFIG_MEM_ALLOC_PROFILING_DEBUG - to enable memory allocation profiling
validation
Note: CONFIG_MEM_ALLOC_PROFILING enables CONFIG_PAGE_EXTENSION to store
code tag reference in the page_ext object.
nomem_profiling kernel command-line parameter is also provided to disable
the functionality and avoid the performance overhead.
Performance overhead:
To evaluate performance we implemented an in-kernel test executing
multiple get_free_page/free_page and kmalloc/kfree calls with allocation
sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU
affinity set to a specific CPU to minimize the noise. Below is performance
comparison between the baseline kernel, profiling when enabled, profiling
when disabled (nomem_profiling=y) and (for comparison purposes) baseline
with CONFIG_MEMCG_KMEM enabled and allocations using __GFP_ACCOUNT:
kmalloc pgalloc
Baseline (6.3-rc7) 9.200s 31.050s
profiling disabled 9.800 (+6.52%) 32.600 (+4.99%)
profiling enabled 12.500 (+35.87%) 39.010 (+25.60%)
memcg_kmem enabled 41.400 (+350.00%) 70.600 (+127.38%)
[1] https://lore.kernel.org/all/[email protected]/
Kent Overstreet (15):
lib/string_helpers: Drop space in string_get_size's output
scripts/kallysms: Always include __start and __stop symbols
fs: Convert alloc_inode_sb() to a macro
nodemask: Split out include/linux/nodemask_types.h
prandom: Remove unused include
lib/string.c: strsep_no_empty()
Lazy percpu counters
lib: code tagging query helper functions
mm/slub: Mark slab_free_freelist_hook() __always_inline
mempool: Hook up to memory allocation profiling
timekeeping: Fix a circular include dependency
mm: percpu: Introduce pcpuobj_ext
mm: percpu: Add codetag reference into pcpuobj_ext
arm64: Fix circular header dependency
MAINTAINERS: Add entries for code tagging and memory allocation
profiling
Suren Baghdasaryan (25):
mm: introduce slabobj_ext to support slab object extensions
mm: introduce __GFP_NO_OBJ_EXT flag to selectively prevent slabobj_ext
creation
mm/slab: introduce SLAB_NO_OBJ_EXT to avoid obj_ext creation
mm: prevent slabobj_ext allocations for slabobj_ext and kmem_cache
objects
slab: objext: introduce objext_flags as extension to
page_memcg_data_flags
lib: code tagging framework
lib: code tagging module support
lib: prevent module unloading if memory is not freed
lib: add allocation tagging support for memory allocation profiling
lib: introduce support for page allocation tagging
change alloc_pages name in dma_map_ops to avoid name conflicts
mm: enable page allocation tagging
mm/page_ext: enable early_page_ext when
CONFIG_MEM_ALLOC_PROFILING_DEBUG=y
mm: create new codetag references during page splitting
lib: add codetag reference into slabobj_ext
mm/slab: add allocation accounting into slab allocation and free paths
mm/slab: enable slab allocation tagging for kmalloc and friends
mm: percpu: enable per-cpu allocation tagging
move stack capture functionality into a separate function for reuse
lib: code tagging context capture support
lib: implement context capture support for tagged allocations
lib: add memory allocations report in show_mem()
codetag: debug: skip objext checking when it's for objext itself
codetag: debug: mark codetags for reserved pages as empty
codetag: debug: introduce OBJEXTS_ALLOC_FAIL to mark failed slab_ext
allocations
.../admin-guide/kernel-parameters.txt | 2 +
MAINTAINERS | 22 +
arch/arm64/include/asm/spectre.h | 4 +-
arch/x86/kernel/amd_gart_64.c | 2 +-
drivers/iommu/dma-iommu.c | 2 +-
drivers/xen/grant-dma-ops.c | 2 +-
drivers/xen/swiotlb-xen.c | 2 +-
include/asm-generic/codetag.lds.h | 14 +
include/asm-generic/vmlinux.lds.h | 3 +
include/linux/alloc_tag.h | 161 ++++++
include/linux/codetag.h | 159 ++++++
include/linux/codetag_ctx.h | 48 ++
include/linux/dma-map-ops.h | 2 +-
include/linux/fs.h | 6 +-
include/linux/gfp.h | 123 ++--
include/linux/gfp_types.h | 12 +-
include/linux/hrtimer.h | 2 +-
include/linux/lazy-percpu-counter.h | 102 ++++
include/linux/memcontrol.h | 56 +-
include/linux/mempool.h | 73 ++-
include/linux/mm.h | 8 +
include/linux/mm_types.h | 4 +-
include/linux/nodemask.h | 2 +-
include/linux/nodemask_types.h | 9 +
include/linux/page_ext.h | 1 -
include/linux/pagemap.h | 9 +-
include/linux/percpu.h | 19 +-
include/linux/pgalloc_tag.h | 95 ++++
include/linux/prandom.h | 1 -
include/linux/sched.h | 32 +-
include/linux/slab.h | 182 +++---
include/linux/slab_def.h | 2 +-
include/linux/slub_def.h | 4 +-
include/linux/stackdepot.h | 16 +
include/linux/string.h | 1 +
include/linux/time_namespace.h | 2 +
init/Kconfig | 4 +
kernel/dma/mapping.c | 4 +-
kernel/module/main.c | 25 +-
lib/Kconfig | 3 +
lib/Kconfig.debug | 26 +
lib/Makefile | 5 +
lib/alloc_tag.c | 464 +++++++++++++++
lib/codetag.c | 529 ++++++++++++++++++
lib/lazy-percpu-counter.c | 127 +++++
lib/show_mem.c | 15 +
lib/stackdepot.c | 68 +++
lib/string.c | 19 +
lib/string_helpers.c | 3 +-
mm/compaction.c | 9 +-
mm/filemap.c | 6 +-
mm/huge_memory.c | 2 +
mm/kfence/core.c | 14 +-
mm/kfence/kfence.h | 4 +-
mm/memcontrol.c | 56 +-
mm/mempolicy.c | 30 +-
mm/mempool.c | 28 +-
mm/mm_init.c | 1 +
mm/page_alloc.c | 75 ++-
mm/page_ext.c | 21 +-
mm/page_owner.c | 54 +-
mm/percpu-internal.h | 26 +-
mm/percpu.c | 122 ++--
mm/slab.c | 22 +-
mm/slab.h | 224 ++++++--
mm/slab_common.c | 95 +++-
mm/slub.c | 24 +-
mm/util.c | 10 +-
scripts/kallsyms.c | 13 +
scripts/module.lds.S | 7 +
70 files changed, 2765 insertions(+), 554 deletions(-)
create mode 100644 include/asm-generic/codetag.lds.h
create mode 100644 include/linux/alloc_tag.h
create mode 100644 include/linux/codetag.h
create mode 100644 include/linux/codetag_ctx.h
create mode 100644 include/linux/lazy-percpu-counter.h
create mode 100644 include/linux/nodemask_types.h
create mode 100644 include/linux/pgalloc_tag.h
create mode 100644 lib/alloc_tag.c
create mode 100644 lib/codetag.c
create mode 100644 lib/lazy-percpu-counter.c
--
2.40.1.495.gc816e09b53d-goog
From: Kent Overstreet <[email protected]>
Previously, string_get_size() outputted a space between the number and
the units, i.e.
9.88 MiB
This changes it to
9.88MiB
which allows it to be parsed correctly by the 'sort -h' command.
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Cc: Jason Wang <[email protected]>
Cc: "Noralf Trønnes" <[email protected]>
Cc: Jens Axboe <[email protected]>
---
lib/string_helpers.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/lib/string_helpers.c b/lib/string_helpers.c
index 230020a2e076..593b29fece32 100644
--- a/lib/string_helpers.c
+++ b/lib/string_helpers.c
@@ -126,8 +126,7 @@ void string_get_size(u64 size, u64 blk_size, const enum string_size_units units,
else
unit = units_str[units][i];
- snprintf(buf, len, "%u%s %s", (u32)size,
- tmp, unit);
+ snprintf(buf, len, "%u%s%s", (u32)size, tmp, unit);
}
EXPORT_SYMBOL(string_get_size);
--
2.40.1.495.gc816e09b53d-goog
From: Kent Overstreet <[email protected]>
These symbols are used to denote section boundaries: by always including
them we can unify loading sections from modules with loading built-in
sections, which leads to some significant cleanup.
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
scripts/kallsyms.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/scripts/kallsyms.c b/scripts/kallsyms.c
index 0d2db41177b2..7b7dbeb5bd6e 100644
--- a/scripts/kallsyms.c
+++ b/scripts/kallsyms.c
@@ -203,6 +203,11 @@ static int symbol_in_range(const struct sym_entry *s,
return 0;
}
+static bool string_starts_with(const char *s, const char *prefix)
+{
+ return strncmp(s, prefix, strlen(prefix)) == 0;
+}
+
static int symbol_valid(const struct sym_entry *s)
{
const char *name = sym_name(s);
@@ -210,6 +215,14 @@ static int symbol_valid(const struct sym_entry *s)
/* if --all-symbols is not specified, then symbols outside the text
* and inittext sections are discarded */
if (!all_symbols) {
+ /*
+ * Symbols starting with __start and __stop are used to denote
+ * section boundaries, and should always be included:
+ */
+ if (string_starts_with(name, "__start_") ||
+ string_starts_with(name, "__stop_"))
+ return 1;
+
if (symbol_in_range(s, text_ranges,
ARRAY_SIZE(text_ranges)) == 0)
return 0;
--
2.40.1.495.gc816e09b53d-goog
From: Kent Overstreet <[email protected]>
We're introducing alloc tagging, which tracks memory allocations by
callsite. Converting alloc_inode_sb() to a macro means allocations will
be tracked by its caller, which is a bit more useful.
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
Cc: Alexander Viro <[email protected]>
---
include/linux/fs.h | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 21a981680856..4905ce14db0b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2699,11 +2699,7 @@ int setattr_should_drop_sgid(struct mnt_idmap *idmap,
* This must be used for allocating filesystems specific inodes to set
* up the inode reclaim context correctly.
*/
-static inline void *
-alloc_inode_sb(struct super_block *sb, struct kmem_cache *cache, gfp_t gfp)
-{
- return kmem_cache_alloc_lru(cache, &sb->s_inode_lru, gfp);
-}
+#define alloc_inode_sb(_sb, _cache, _gfp) kmem_cache_alloc_lru(_cache, &_sb->s_inode_lru, _gfp)
extern void __insert_inode_hash(struct inode *, unsigned long hashval);
static inline void insert_inode_hash(struct inode *inode)
--
2.40.1.495.gc816e09b53d-goog
From: Kent Overstreet <[email protected]>
prandom.h doesn't use percpu.h - this fixes some circular header issues.
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/prandom.h | 1 -
1 file changed, 1 deletion(-)
diff --git a/include/linux/prandom.h b/include/linux/prandom.h
index f2ed5b72b3d6..f7f1e5251c67 100644
--- a/include/linux/prandom.h
+++ b/include/linux/prandom.h
@@ -10,7 +10,6 @@
#include <linux/types.h>
#include <linux/once.h>
-#include <linux/percpu.h>
#include <linux/random.h>
struct rnd_state {
--
2.40.1.495.gc816e09b53d-goog
From: Kent Overstreet <[email protected]>
sched.h, which defines task_struct, needs nodemask_t - but sched.h is a
frequently used header and ideally shouldn't be pulling in any more code
that it needs to.
This splits out nodemask_types.h which has the definition sched.h needs,
which will avoid a circular header dependency in the alloc tagging patch
series, and as a bonus should speed up kernel build times.
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
---
include/linux/nodemask.h | 2 +-
include/linux/nodemask_types.h | 9 +++++++++
include/linux/sched.h | 2 +-
3 files changed, 11 insertions(+), 2 deletions(-)
create mode 100644 include/linux/nodemask_types.h
diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index bb0ee80526b2..fda37b6df274 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -93,10 +93,10 @@
#include <linux/threads.h>
#include <linux/bitmap.h>
#include <linux/minmax.h>
+#include <linux/nodemask_types.h>
#include <linux/numa.h>
#include <linux/random.h>
-typedef struct { DECLARE_BITMAP(bits, MAX_NUMNODES); } nodemask_t;
extern nodemask_t _unused_nodemask_arg_;
/**
diff --git a/include/linux/nodemask_types.h b/include/linux/nodemask_types.h
new file mode 100644
index 000000000000..84c2f47c4237
--- /dev/null
+++ b/include/linux/nodemask_types.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __LINUX_NODEMASK_TYPES_H
+#define __LINUX_NODEMASK_TYPES_H
+
+#include <linux/numa.h>
+
+typedef struct { DECLARE_BITMAP(bits, MAX_NUMNODES); } nodemask_t;
+
+#endif /* __LINUX_NODEMASK_TYPES_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index eed5d65b8d1f..35e7efdea2d9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -20,7 +20,7 @@
#include <linux/hrtimer.h>
#include <linux/irqflags.h>
#include <linux/seccomp.h>
-#include <linux/nodemask.h>
+#include <linux/nodemask_types.h>
#include <linux/rcupdate.h>
#include <linux/refcount.h>
#include <linux/resource.h>
--
2.40.1.495.gc816e09b53d-goog
From: Kent Overstreet <[email protected]>
This adds a new helper which is like strsep, except that it skips empty
tokens.
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/string.h | 1 +
lib/string.c | 19 +++++++++++++++++++
2 files changed, 20 insertions(+)
diff --git a/include/linux/string.h b/include/linux/string.h
index c062c581a98b..6cd5451c262c 100644
--- a/include/linux/string.h
+++ b/include/linux/string.h
@@ -96,6 +96,7 @@ extern char * strpbrk(const char *,const char *);
#ifndef __HAVE_ARCH_STRSEP
extern char * strsep(char **,const char *);
#endif
+extern char *strsep_no_empty(char **, const char *);
#ifndef __HAVE_ARCH_STRSPN
extern __kernel_size_t strspn(const char *,const char *);
#endif
diff --git a/lib/string.c b/lib/string.c
index 3d55ef890106..dd4914baf45a 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -520,6 +520,25 @@ char *strsep(char **s, const char *ct)
EXPORT_SYMBOL(strsep);
#endif
+/**
+ * strsep_no_empt - Split a string into tokens, but don't return empty tokens
+ * @s: The string to be searched
+ * @ct: The characters to search for
+ *
+ * strsep() updates @s to point after the token, ready for the next call.
+ */
+char *strsep_no_empty(char **s, const char *ct)
+{
+ char *ret;
+
+ do {
+ ret = strsep(s, ct);
+ } while (ret && !*ret);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(strsep_no_empty);
+
#ifndef __HAVE_ARCH_MEMSET
/**
* memset - Fill a region of memory with the given value
--
2.40.1.495.gc816e09b53d-goog
From: Kent Overstreet <[email protected]>
This patch adds lib/lazy-percpu-counter.c, which implements counters
that start out as atomics, but lazily switch to percpu mode if the
update rate crosses some threshold (arbitrarily set at 256 per second).
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/lazy-percpu-counter.h | 102 ++++++++++++++++++++++
lib/Kconfig | 3 +
lib/Makefile | 2 +
lib/lazy-percpu-counter.c | 127 ++++++++++++++++++++++++++++
4 files changed, 234 insertions(+)
create mode 100644 include/linux/lazy-percpu-counter.h
create mode 100644 lib/lazy-percpu-counter.c
diff --git a/include/linux/lazy-percpu-counter.h b/include/linux/lazy-percpu-counter.h
new file mode 100644
index 000000000000..45ca9e2ce58b
--- /dev/null
+++ b/include/linux/lazy-percpu-counter.h
@@ -0,0 +1,102 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Lazy percpu counters:
+ * (C) 2022 Kent Overstreet
+ *
+ * Lazy percpu counters start out in atomic mode, then switch to percpu mode if
+ * the update rate crosses some threshold.
+ *
+ * This means we don't have to decide between low memory overhead atomic
+ * counters and higher performance percpu counters - we can have our cake and
+ * eat it, too!
+ *
+ * Internally we use an atomic64_t, where the low bit indicates whether we're in
+ * percpu mode, and the high 8 bits are a secondary counter that's incremented
+ * when the counter is modified - meaning 55 bits of precision are available for
+ * the counter itself.
+ */
+
+#ifndef _LINUX_LAZY_PERCPU_COUNTER_H
+#define _LINUX_LAZY_PERCPU_COUNTER_H
+
+#include <linux/atomic.h>
+#include <asm/percpu.h>
+
+struct lazy_percpu_counter {
+ atomic64_t v;
+ unsigned long last_wrap;
+};
+
+void lazy_percpu_counter_exit(struct lazy_percpu_counter *c);
+void lazy_percpu_counter_add_slowpath(struct lazy_percpu_counter *c, s64 i);
+void lazy_percpu_counter_add_slowpath_noupgrade(struct lazy_percpu_counter *c, s64 i);
+s64 lazy_percpu_counter_read(struct lazy_percpu_counter *c);
+
+/*
+ * We use the high bits of the atomic counter for a secondary counter, which is
+ * incremented every time the counter is touched. When the secondary counter
+ * wraps, we check the time the counter last wrapped, and if it was recent
+ * enough that means the update frequency has crossed our threshold and we
+ * switch to percpu mode:
+ */
+#define COUNTER_MOD_BITS 8
+#define COUNTER_MOD_MASK ~(~0ULL >> COUNTER_MOD_BITS)
+#define COUNTER_MOD_BITS_START (64 - COUNTER_MOD_BITS)
+
+/*
+ * We use the low bit of the counter to indicate whether we're in atomic mode
+ * (low bit clear), or percpu mode (low bit set, counter is a pointer to actual
+ * percpu counters:
+ */
+#define COUNTER_IS_PCPU_BIT 1
+
+static inline u64 __percpu *lazy_percpu_counter_is_pcpu(u64 v)
+{
+ if (!(v & COUNTER_IS_PCPU_BIT))
+ return NULL;
+
+ v ^= COUNTER_IS_PCPU_BIT;
+ return (u64 __percpu *)(unsigned long)v;
+}
+
+/**
+ * lazy_percpu_counter_add: Add a value to a lazy_percpu_counter
+ *
+ * @c: counter to modify
+ * @i: value to add
+ */
+static inline void lazy_percpu_counter_add(struct lazy_percpu_counter *c, s64 i)
+{
+ u64 v = atomic64_read(&c->v);
+ u64 __percpu *pcpu_v = lazy_percpu_counter_is_pcpu(v);
+
+ if (likely(pcpu_v))
+ this_cpu_add(*pcpu_v, i);
+ else
+ lazy_percpu_counter_add_slowpath(c, i);
+}
+
+/**
+ * lazy_percpu_counter_add_noupgrade: Add a value to a lazy_percpu_counter,
+ * without upgrading to percpu mode
+ *
+ * @c: counter to modify
+ * @i: value to add
+ */
+static inline void lazy_percpu_counter_add_noupgrade(struct lazy_percpu_counter *c, s64 i)
+{
+ u64 v = atomic64_read(&c->v);
+ u64 __percpu *pcpu_v = lazy_percpu_counter_is_pcpu(v);
+
+ if (likely(pcpu_v))
+ this_cpu_add(*pcpu_v, i);
+ else
+ lazy_percpu_counter_add_slowpath_noupgrade(c, i);
+}
+
+static inline void lazy_percpu_counter_sub(struct lazy_percpu_counter *c, s64 i)
+{
+ lazy_percpu_counter_add(c, -i);
+}
+
+#endif /* _LINUX_LAZY_PERCPU_COUNTER_H */
diff --git a/lib/Kconfig b/lib/Kconfig
index 5c2da561c516..7380292a8fcd 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -505,6 +505,9 @@ config ASSOCIATIVE_ARRAY
for more information.
+config LAZY_PERCPU_COUNTER
+ bool
+
config HAS_IOMEM
bool
depends on !NO_IOMEM
diff --git a/lib/Makefile b/lib/Makefile
index 876fcdeae34e..293a0858a3f8 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -164,6 +164,8 @@ obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
obj-$(CONFIG_DEBUG_LIST) += list_debug.o
obj-$(CONFIG_DEBUG_OBJECTS) += debugobjects.o
+obj-$(CONFIG_LAZY_PERCPU_COUNTER) += lazy-percpu-counter.o
+
obj-$(CONFIG_BITREVERSE) += bitrev.o
obj-$(CONFIG_LINEAR_RANGES) += linear_ranges.o
obj-$(CONFIG_PACKING) += packing.o
diff --git a/lib/lazy-percpu-counter.c b/lib/lazy-percpu-counter.c
new file mode 100644
index 000000000000..4f4e32c2dc09
--- /dev/null
+++ b/lib/lazy-percpu-counter.c
@@ -0,0 +1,127 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include <linux/atomic.h>
+#include <linux/gfp.h>
+#include <linux/jiffies.h>
+#include <linux/lazy-percpu-counter.h>
+#include <linux/percpu.h>
+
+static inline s64 lazy_percpu_counter_atomic_val(s64 v)
+{
+ /* Ensure output is sign extended properly: */
+ return (v << COUNTER_MOD_BITS) >>
+ (COUNTER_MOD_BITS + COUNTER_IS_PCPU_BIT);
+}
+
+static void lazy_percpu_counter_switch_to_pcpu(struct lazy_percpu_counter *c)
+{
+ u64 __percpu *pcpu_v = alloc_percpu_gfp(u64, GFP_ATOMIC|__GFP_NOWARN);
+ u64 old, new, v;
+
+ if (!pcpu_v)
+ return;
+
+ preempt_disable();
+ v = atomic64_read(&c->v);
+ do {
+ if (lazy_percpu_counter_is_pcpu(v)) {
+ free_percpu(pcpu_v);
+ return;
+ }
+
+ old = v;
+ new = (unsigned long)pcpu_v | 1;
+
+ *this_cpu_ptr(pcpu_v) = lazy_percpu_counter_atomic_val(v);
+ } while ((v = atomic64_cmpxchg(&c->v, old, new)) != old);
+ preempt_enable();
+}
+
+/**
+ * lazy_percpu_counter_exit: Free resources associated with a
+ * lazy_percpu_counter
+ *
+ * @c: counter to exit
+ */
+void lazy_percpu_counter_exit(struct lazy_percpu_counter *c)
+{
+ free_percpu(lazy_percpu_counter_is_pcpu(atomic64_read(&c->v)));
+}
+EXPORT_SYMBOL_GPL(lazy_percpu_counter_exit);
+
+/**
+ * lazy_percpu_counter_read: Read current value of a lazy_percpu_counter
+ *
+ * @c: counter to read
+ */
+s64 lazy_percpu_counter_read(struct lazy_percpu_counter *c)
+{
+ s64 v = atomic64_read(&c->v);
+ u64 __percpu *pcpu_v = lazy_percpu_counter_is_pcpu(v);
+
+ if (pcpu_v) {
+ int cpu;
+
+ v = 0;
+ for_each_possible_cpu(cpu)
+ v += *per_cpu_ptr(pcpu_v, cpu);
+ } else {
+ v = lazy_percpu_counter_atomic_val(v);
+ }
+
+ return v;
+}
+EXPORT_SYMBOL_GPL(lazy_percpu_counter_read);
+
+void lazy_percpu_counter_add_slowpath(struct lazy_percpu_counter *c, s64 i)
+{
+ u64 atomic_i;
+ u64 old, v = atomic64_read(&c->v);
+ u64 __percpu *pcpu_v;
+
+ atomic_i = i << COUNTER_IS_PCPU_BIT;
+ atomic_i &= ~COUNTER_MOD_MASK;
+ atomic_i |= 1ULL << COUNTER_MOD_BITS_START;
+
+ do {
+ pcpu_v = lazy_percpu_counter_is_pcpu(v);
+ if (pcpu_v) {
+ this_cpu_add(*pcpu_v, i);
+ return;
+ }
+
+ old = v;
+ } while ((v = atomic64_cmpxchg(&c->v, old, old + atomic_i)) != old);
+
+ if (unlikely(!(v & COUNTER_MOD_MASK))) {
+ unsigned long now = jiffies;
+
+ if (c->last_wrap &&
+ unlikely(time_after(c->last_wrap + HZ, now)))
+ lazy_percpu_counter_switch_to_pcpu(c);
+ else
+ c->last_wrap = now;
+ }
+}
+EXPORT_SYMBOL(lazy_percpu_counter_add_slowpath);
+
+void lazy_percpu_counter_add_slowpath_noupgrade(struct lazy_percpu_counter *c, s64 i)
+{
+ u64 atomic_i;
+ u64 old, v = atomic64_read(&c->v);
+ u64 __percpu *pcpu_v;
+
+ atomic_i = i << COUNTER_IS_PCPU_BIT;
+ atomic_i &= ~COUNTER_MOD_MASK;
+
+ do {
+ pcpu_v = lazy_percpu_counter_is_pcpu(v);
+ if (pcpu_v) {
+ this_cpu_add(*pcpu_v, i);
+ return;
+ }
+
+ old = v;
+ } while ((v = atomic64_cmpxchg(&c->v, old, old + atomic_i)) != old);
+}
+EXPORT_SYMBOL(lazy_percpu_counter_add_slowpath_noupgrade);
--
2.40.1.495.gc816e09b53d-goog
Currently slab pages can store only vectors of obj_cgroup pointers in
page->memcg_data. Introduce slabobj_ext structure to allow more data
to be stored for each slab object. Wrap obj_cgroup into slabobj_ext
to support current functionality while allowing to extend slabobj_ext
in the future.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/memcontrol.h | 20 +++--
include/linux/mm_types.h | 4 +-
init/Kconfig | 4 +
mm/kfence/core.c | 14 ++--
mm/kfence/kfence.h | 4 +-
mm/memcontrol.c | 56 ++------------
mm/page_owner.c | 2 +-
mm/slab.h | 148 +++++++++++++++++++++++++------------
mm/slab_common.c | 47 ++++++++++++
9 files changed, 185 insertions(+), 114 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 222d7370134c..b9fd9732a52b 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -339,8 +339,8 @@ struct mem_cgroup {
extern struct mem_cgroup *root_mem_cgroup;
enum page_memcg_data_flags {
- /* page->memcg_data is a pointer to an objcgs vector */
- MEMCG_DATA_OBJCGS = (1UL << 0),
+ /* page->memcg_data is a pointer to an slabobj_ext vector */
+ MEMCG_DATA_OBJEXTS = (1UL << 0),
/* page has been accounted as a non-slab kernel page */
MEMCG_DATA_KMEM = (1UL << 1),
/* the next bit after the last actual flag */
@@ -378,7 +378,7 @@ static inline struct mem_cgroup *__folio_memcg(struct folio *folio)
unsigned long memcg_data = folio->memcg_data;
VM_BUG_ON_FOLIO(folio_test_slab(folio), folio);
- VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJCGS, folio);
+ VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio);
VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_KMEM, folio);
return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
@@ -399,7 +399,7 @@ static inline struct obj_cgroup *__folio_objcg(struct folio *folio)
unsigned long memcg_data = folio->memcg_data;
VM_BUG_ON_FOLIO(folio_test_slab(folio), folio);
- VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJCGS, folio);
+ VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio);
VM_BUG_ON_FOLIO(!(memcg_data & MEMCG_DATA_KMEM), folio);
return (struct obj_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
@@ -496,7 +496,7 @@ static inline struct mem_cgroup *folio_memcg_check(struct folio *folio)
*/
unsigned long memcg_data = READ_ONCE(folio->memcg_data);
- if (memcg_data & MEMCG_DATA_OBJCGS)
+ if (memcg_data & MEMCG_DATA_OBJEXTS)
return NULL;
if (memcg_data & MEMCG_DATA_KMEM) {
@@ -542,7 +542,7 @@ static inline struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *ob
static inline bool folio_memcg_kmem(struct folio *folio)
{
VM_BUG_ON_PGFLAGS(PageTail(&folio->page), &folio->page);
- VM_BUG_ON_FOLIO(folio->memcg_data & MEMCG_DATA_OBJCGS, folio);
+ VM_BUG_ON_FOLIO(folio->memcg_data & MEMCG_DATA_OBJEXTS, folio);
return folio->memcg_data & MEMCG_DATA_KMEM;
}
@@ -1606,6 +1606,14 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
}
#endif /* CONFIG_MEMCG */
+/*
+ * Extended information for slab objects stored as an array in page->memcg_data
+ * if MEMCG_DATA_OBJEXTS is set.
+ */
+struct slabobj_ext {
+ struct obj_cgroup *objcg;
+} __aligned(8);
+
static inline void __inc_lruvec_kmem_state(void *p, enum node_stat_item idx)
{
__mod_lruvec_kmem_state(p, idx, 1);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 306a3d1a0fa6..e79303e1e30c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -194,7 +194,7 @@ struct page {
/* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
atomic_t _refcount;
-#ifdef CONFIG_MEMCG
+#ifdef CONFIG_SLAB_OBJ_EXT
unsigned long memcg_data;
#endif
@@ -320,7 +320,7 @@ struct folio {
void *private;
atomic_t _mapcount;
atomic_t _refcount;
-#ifdef CONFIG_MEMCG
+#ifdef CONFIG_SLAB_OBJ_EXT
unsigned long memcg_data;
#endif
/* private: the union with struct page is transitional */
diff --git a/init/Kconfig b/init/Kconfig
index 32c24950c4ce..44267919a2a2 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -936,10 +936,14 @@ config CGROUP_FAVOR_DYNMODS
Say N if unsure.
+config SLAB_OBJ_EXT
+ bool
+
config MEMCG
bool "Memory controller"
select PAGE_COUNTER
select EVENTFD
+ select SLAB_OBJ_EXT
help
Provides control over the memory footprint of tasks in a cgroup.
diff --git a/mm/kfence/core.c b/mm/kfence/core.c
index dad3c0eb70a0..aea6fa145080 100644
--- a/mm/kfence/core.c
+++ b/mm/kfence/core.c
@@ -590,9 +590,9 @@ static unsigned long kfence_init_pool(void)
continue;
__folio_set_slab(slab_folio(slab));
-#ifdef CONFIG_MEMCG
- slab->memcg_data = (unsigned long)&kfence_metadata[i / 2 - 1].objcg |
- MEMCG_DATA_OBJCGS;
+#ifdef CONFIG_MEMCG_KMEM
+ slab->obj_exts = (unsigned long)&kfence_metadata[i / 2 - 1].obj_exts |
+ MEMCG_DATA_OBJEXTS;
#endif
}
@@ -634,8 +634,8 @@ static unsigned long kfence_init_pool(void)
if (!i || (i % 2))
continue;
-#ifdef CONFIG_MEMCG
- slab->memcg_data = 0;
+#ifdef CONFIG_MEMCG_KMEM
+ slab->obj_exts = 0;
#endif
__folio_clear_slab(slab_folio(slab));
}
@@ -1093,8 +1093,8 @@ void __kfence_free(void *addr)
{
struct kfence_metadata *meta = addr_to_metadata((unsigned long)addr);
-#ifdef CONFIG_MEMCG
- KFENCE_WARN_ON(meta->objcg);
+#ifdef CONFIG_MEMCG_KMEM
+ KFENCE_WARN_ON(meta->obj_exts.objcg);
#endif
/*
* If the objects of the cache are SLAB_TYPESAFE_BY_RCU, defer freeing
diff --git a/mm/kfence/kfence.h b/mm/kfence/kfence.h
index 2aafc46a4aaf..8e0d76c4ea2a 100644
--- a/mm/kfence/kfence.h
+++ b/mm/kfence/kfence.h
@@ -97,8 +97,8 @@ struct kfence_metadata {
struct kfence_track free_track;
/* For updating alloc_covered on frees. */
u32 alloc_stack_hash;
-#ifdef CONFIG_MEMCG
- struct obj_cgroup *objcg;
+#ifdef CONFIG_MEMCG_KMEM
+ struct slabobj_ext obj_exts;
#endif
};
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4b27e245a055..f2a7fe718117 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2892,13 +2892,6 @@ static void commit_charge(struct folio *folio, struct mem_cgroup *memcg)
}
#ifdef CONFIG_MEMCG_KMEM
-/*
- * The allocated objcg pointers array is not accounted directly.
- * Moreover, it should not come from DMA buffer and is not readily
- * reclaimable. So those GFP bits should be masked off.
- */
-#define OBJCGS_CLEAR_MASK (__GFP_DMA | __GFP_RECLAIMABLE | __GFP_ACCOUNT)
-
/*
* mod_objcg_mlstate() may be called with irq enabled, so
* mod_memcg_lruvec_state() should be used.
@@ -2917,62 +2910,27 @@ static inline void mod_objcg_mlstate(struct obj_cgroup *objcg,
rcu_read_unlock();
}
-int memcg_alloc_slab_cgroups(struct slab *slab, struct kmem_cache *s,
- gfp_t gfp, bool new_slab)
-{
- unsigned int objects = objs_per_slab(s, slab);
- unsigned long memcg_data;
- void *vec;
-
- gfp &= ~OBJCGS_CLEAR_MASK;
- vec = kcalloc_node(objects, sizeof(struct obj_cgroup *), gfp,
- slab_nid(slab));
- if (!vec)
- return -ENOMEM;
-
- memcg_data = (unsigned long) vec | MEMCG_DATA_OBJCGS;
- if (new_slab) {
- /*
- * If the slab is brand new and nobody can yet access its
- * memcg_data, no synchronization is required and memcg_data can
- * be simply assigned.
- */
- slab->memcg_data = memcg_data;
- } else if (cmpxchg(&slab->memcg_data, 0, memcg_data)) {
- /*
- * If the slab is already in use, somebody can allocate and
- * assign obj_cgroups in parallel. In this case the existing
- * objcg vector should be reused.
- */
- kfree(vec);
- return 0;
- }
-
- kmemleak_not_leak(vec);
- return 0;
-}
-
static __always_inline
struct mem_cgroup *mem_cgroup_from_obj_folio(struct folio *folio, void *p)
{
/*
* Slab objects are accounted individually, not per-page.
* Memcg membership data for each individual object is saved in
- * slab->memcg_data.
+ * slab->obj_exts.
*/
if (folio_test_slab(folio)) {
- struct obj_cgroup **objcgs;
+ struct slabobj_ext *obj_exts;
struct slab *slab;
unsigned int off;
slab = folio_slab(folio);
- objcgs = slab_objcgs(slab);
- if (!objcgs)
+ obj_exts = slab_obj_exts(slab);
+ if (!obj_exts)
return NULL;
off = obj_to_index(slab->slab_cache, slab, p);
- if (objcgs[off])
- return obj_cgroup_memcg(objcgs[off]);
+ if (obj_exts[off].objcg)
+ return obj_cgroup_memcg(obj_exts[off].objcg);
return NULL;
}
@@ -2980,7 +2938,7 @@ struct mem_cgroup *mem_cgroup_from_obj_folio(struct folio *folio, void *p)
/*
* folio_memcg_check() is used here, because in theory we can encounter
* a folio where the slab flag has been cleared already, but
- * slab->memcg_data has not been freed yet
+ * slab->obj_exts has not been freed yet
* folio_memcg_check() will guarantee that a proper memory
* cgroup pointer or NULL will be returned.
*/
diff --git a/mm/page_owner.c b/mm/page_owner.c
index 31169b3e7f06..8b6086c666e6 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -372,7 +372,7 @@ static inline int print_page_owner_memcg(char *kbuf, size_t count, int ret,
if (!memcg_data)
goto out_unlock;
- if (memcg_data & MEMCG_DATA_OBJCGS)
+ if (memcg_data & MEMCG_DATA_OBJEXTS)
ret += scnprintf(kbuf + ret, count - ret,
"Slab cache page\n");
diff --git a/mm/slab.h b/mm/slab.h
index f01ac256a8f5..25d14b3a7280 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -57,8 +57,8 @@ struct slab {
#endif
atomic_t __page_refcount;
-#ifdef CONFIG_MEMCG
- unsigned long memcg_data;
+#ifdef CONFIG_SLAB_OBJ_EXT
+ unsigned long obj_exts;
#endif
};
@@ -67,8 +67,8 @@ struct slab {
SLAB_MATCH(flags, __page_flags);
SLAB_MATCH(compound_head, slab_cache); /* Ensure bit 0 is clear */
SLAB_MATCH(_refcount, __page_refcount);
-#ifdef CONFIG_MEMCG
-SLAB_MATCH(memcg_data, memcg_data);
+#ifdef CONFIG_SLAB_OBJ_EXT
+SLAB_MATCH(memcg_data, obj_exts);
#endif
#undef SLAB_MATCH
static_assert(sizeof(struct slab) <= sizeof(struct page));
@@ -390,36 +390,106 @@ static inline bool kmem_cache_debug_flags(struct kmem_cache *s, slab_flags_t fla
return false;
}
-#ifdef CONFIG_MEMCG_KMEM
+#ifdef CONFIG_SLAB_OBJ_EXT
+
/*
- * slab_objcgs - get the object cgroups vector associated with a slab
+ * slab_obj_exts - get the pointer to the slab object extension vector
+ * associated with a slab.
* @slab: a pointer to the slab struct
*
- * Returns a pointer to the object cgroups vector associated with the slab,
+ * Returns a pointer to the object extension vector associated with the slab,
* or NULL if no such vector has been associated yet.
*/
-static inline struct obj_cgroup **slab_objcgs(struct slab *slab)
+static inline struct slabobj_ext *slab_obj_exts(struct slab *slab)
{
- unsigned long memcg_data = READ_ONCE(slab->memcg_data);
+ unsigned long obj_exts = READ_ONCE(slab->obj_exts);
- VM_BUG_ON_PAGE(memcg_data && !(memcg_data & MEMCG_DATA_OBJCGS),
+#ifdef CONFIG_MEMCG
+ VM_BUG_ON_PAGE(obj_exts && !(obj_exts & MEMCG_DATA_OBJEXTS),
slab_page(slab));
- VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_KMEM, slab_page(slab));
+ VM_BUG_ON_PAGE(obj_exts & MEMCG_DATA_KMEM, slab_page(slab));
- return (struct obj_cgroup **)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+ return (struct slabobj_ext *)(obj_exts & ~MEMCG_DATA_FLAGS_MASK);
+#else
+ return (struct slabobj_ext *)obj_exts;
+#endif
}
-int memcg_alloc_slab_cgroups(struct slab *slab, struct kmem_cache *s,
- gfp_t gfp, bool new_slab);
-void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
- enum node_stat_item idx, int nr);
+int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
+ gfp_t gfp, bool new_slab);
-static inline void memcg_free_slab_cgroups(struct slab *slab)
+static inline bool need_slab_obj_ext(void)
{
- kfree(slab_objcgs(slab));
- slab->memcg_data = 0;
+ /*
+ * CONFIG_MEMCG_KMEM creates vector of obj_cgroup objects conditionally
+ * inside memcg_slab_post_alloc_hook. No other users for now.
+ */
+ return false;
}
+static inline void free_slab_obj_exts(struct slab *slab)
+{
+ struct slabobj_ext *obj_exts;
+
+ obj_exts = slab_obj_exts(slab);
+ if (!obj_exts)
+ return;
+
+ kfree(obj_exts);
+ slab->obj_exts = 0;
+}
+
+static inline struct slabobj_ext *
+prepare_slab_obj_exts_hook(struct kmem_cache *s, gfp_t flags, void *p)
+{
+ struct slab *slab;
+
+ if (!p)
+ return NULL;
+
+ if (!need_slab_obj_ext())
+ return NULL;
+
+ slab = virt_to_slab(p);
+ if (!slab_obj_exts(slab) &&
+ WARN(alloc_slab_obj_exts(slab, s, flags, false),
+ "%s, %s: Failed to create slab extension vector!\n",
+ __func__, s->name))
+ return NULL;
+
+ return slab_obj_exts(slab) + obj_to_index(s, slab, p);
+}
+
+#else /* CONFIG_SLAB_OBJ_EXT */
+
+static inline struct slabobj_ext *slab_obj_exts(struct slab *slab)
+{
+ return NULL;
+}
+
+static inline int alloc_slab_obj_exts(struct slab *slab,
+ struct kmem_cache *s, gfp_t gfp,
+ bool new_slab)
+{
+ return 0;
+}
+
+static inline void free_slab_obj_exts(struct slab *slab)
+{
+}
+
+static inline struct slabobj_ext *
+prepare_slab_obj_exts_hook(struct kmem_cache *s, gfp_t flags, void *p)
+{
+ return NULL;
+}
+
+#endif /* CONFIG_SLAB_OBJ_EXT */
+
+#ifdef CONFIG_MEMCG_KMEM
+void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
+ enum node_stat_item idx, int nr);
+
static inline size_t obj_full_size(struct kmem_cache *s)
{
/*
@@ -487,16 +557,15 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
if (likely(p[i])) {
slab = virt_to_slab(p[i]);
- if (!slab_objcgs(slab) &&
- memcg_alloc_slab_cgroups(slab, s, flags,
- false)) {
+ if (!slab_obj_exts(slab) &&
+ alloc_slab_obj_exts(slab, s, flags, false)) {
obj_cgroup_uncharge(objcg, obj_full_size(s));
continue;
}
off = obj_to_index(s, slab, p[i]);
obj_cgroup_get(objcg);
- slab_objcgs(slab)[off] = objcg;
+ slab_obj_exts(slab)[off].objcg = objcg;
mod_objcg_state(objcg, slab_pgdat(slab),
cache_vmstat_idx(s), obj_full_size(s));
} else {
@@ -509,14 +578,14 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
static inline void memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab,
void **p, int objects)
{
- struct obj_cgroup **objcgs;
+ struct slabobj_ext *obj_exts;
int i;
if (!memcg_kmem_online())
return;
- objcgs = slab_objcgs(slab);
- if (!objcgs)
+ obj_exts = slab_obj_exts(slab);
+ if (!obj_exts)
return;
for (i = 0; i < objects; i++) {
@@ -524,11 +593,11 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab,
unsigned int off;
off = obj_to_index(s, slab, p[i]);
- objcg = objcgs[off];
+ objcg = obj_exts[off].objcg;
if (!objcg)
continue;
- objcgs[off] = NULL;
+ obj_exts[off].objcg = NULL;
obj_cgroup_uncharge(objcg, obj_full_size(s));
mod_objcg_state(objcg, slab_pgdat(slab), cache_vmstat_idx(s),
-obj_full_size(s));
@@ -537,27 +606,11 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab,
}
#else /* CONFIG_MEMCG_KMEM */
-static inline struct obj_cgroup **slab_objcgs(struct slab *slab)
-{
- return NULL;
-}
-
static inline struct mem_cgroup *memcg_from_slab_obj(void *ptr)
{
return NULL;
}
-static inline int memcg_alloc_slab_cgroups(struct slab *slab,
- struct kmem_cache *s, gfp_t gfp,
- bool new_slab)
-{
- return 0;
-}
-
-static inline void memcg_free_slab_cgroups(struct slab *slab)
-{
-}
-
static inline bool memcg_slab_pre_alloc_hook(struct kmem_cache *s,
struct list_lru *lru,
struct obj_cgroup **objcgp,
@@ -594,7 +647,7 @@ static __always_inline void account_slab(struct slab *slab, int order,
struct kmem_cache *s, gfp_t gfp)
{
if (memcg_kmem_online() && (s->flags & SLAB_ACCOUNT))
- memcg_alloc_slab_cgroups(slab, s, gfp, true);
+ alloc_slab_obj_exts(slab, s, gfp, true);
mod_node_page_state(slab_pgdat(slab), cache_vmstat_idx(s),
PAGE_SIZE << order);
@@ -603,8 +656,7 @@ static __always_inline void account_slab(struct slab *slab, int order,
static __always_inline void unaccount_slab(struct slab *slab, int order,
struct kmem_cache *s)
{
- if (memcg_kmem_online())
- memcg_free_slab_cgroups(slab);
+ free_slab_obj_exts(slab);
mod_node_page_state(slab_pgdat(slab), cache_vmstat_idx(s),
-(PAGE_SIZE << order));
@@ -684,6 +736,7 @@ static inline void slab_post_alloc_hook(struct kmem_cache *s,
unsigned int orig_size)
{
unsigned int zero_size = s->object_size;
+ struct slabobj_ext *obj_exts;
size_t i;
flags &= gfp_allowed_mask;
@@ -714,6 +767,7 @@ static inline void slab_post_alloc_hook(struct kmem_cache *s,
kmemleak_alloc_recursive(p[i], s->object_size, 1,
s->flags, flags);
kmsan_slab_alloc(s, p[i], flags);
+ obj_exts = prepare_slab_obj_exts_hook(s, flags, p[i]);
}
memcg_slab_post_alloc_hook(s, objcg, flags, size, p);
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 607249785c07..f11cc072b01e 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -204,6 +204,53 @@ struct kmem_cache *find_mergeable(unsigned int size, unsigned int align,
return NULL;
}
+#ifdef CONFIG_SLAB_OBJ_EXT
+/*
+ * The allocated objcg pointers array is not accounted directly.
+ * Moreover, it should not come from DMA buffer and is not readily
+ * reclaimable. So those GFP bits should be masked off.
+ */
+#define OBJCGS_CLEAR_MASK (__GFP_DMA | __GFP_RECLAIMABLE | __GFP_ACCOUNT)
+
+int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
+ gfp_t gfp, bool new_slab)
+{
+ unsigned int objects = objs_per_slab(s, slab);
+ unsigned long obj_exts;
+ void *vec;
+
+ gfp &= ~OBJCGS_CLEAR_MASK;
+ vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
+ slab_nid(slab));
+ if (!vec)
+ return -ENOMEM;
+
+ obj_exts = (unsigned long)vec;
+#ifdef CONFIG_MEMCG
+ obj_exts |= MEMCG_DATA_OBJEXTS;
+#endif
+ if (new_slab) {
+ /*
+ * If the slab is brand new and nobody can yet access its
+ * obj_exts, no synchronization is required and obj_exts can
+ * be simply assigned.
+ */
+ slab->obj_exts = obj_exts;
+ } else if (cmpxchg(&slab->obj_exts, 0, obj_exts)) {
+ /*
+ * If the slab is already in use, somebody can allocate and
+ * assign slabobj_exts in parallel. In this case the existing
+ * objcg vector should be reused.
+ */
+ kfree(vec);
+ return 0;
+ }
+
+ kmemleak_not_leak(vec);
+ return 0;
+}
+#endif /* CONFIG_SLAB_OBJ_EXT */
+
static struct kmem_cache *create_cache(const char *name,
unsigned int object_size, unsigned int align,
slab_flags_t flags, unsigned int useroffset,
--
2.40.1.495.gc816e09b53d-goog
Introduce __GFP_NO_OBJ_EXT flag in order to prevent recursive allocations
when allocating slabobj_ext on a slab.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/gfp_types.h | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
index 6583a58670c5..aab1959130f9 100644
--- a/include/linux/gfp_types.h
+++ b/include/linux/gfp_types.h
@@ -53,8 +53,13 @@ typedef unsigned int __bitwise gfp_t;
#define ___GFP_SKIP_ZERO 0
#define ___GFP_SKIP_KASAN 0
#endif
+#ifdef CONFIG_SLAB_OBJ_EXT
+#define ___GFP_NO_OBJ_EXT 0x4000000u
+#else
+#define ___GFP_NO_OBJ_EXT 0
+#endif
#ifdef CONFIG_LOCKDEP
-#define ___GFP_NOLOCKDEP 0x4000000u
+#define ___GFP_NOLOCKDEP 0x8000000u
#else
#define ___GFP_NOLOCKDEP 0
#endif
@@ -99,12 +104,15 @@ typedef unsigned int __bitwise gfp_t;
* node with no fallbacks or placement policy enforcements.
*
* %__GFP_ACCOUNT causes the allocation to be accounted to kmemcg.
+ *
+ * %__GFP_NO_OBJ_EXT causes slab allocation to have no object extension.
*/
#define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE)
#define __GFP_WRITE ((__force gfp_t)___GFP_WRITE)
#define __GFP_HARDWALL ((__force gfp_t)___GFP_HARDWALL)
#define __GFP_THISNODE ((__force gfp_t)___GFP_THISNODE)
#define __GFP_ACCOUNT ((__force gfp_t)___GFP_ACCOUNT)
+#define __GFP_NO_OBJ_EXT ((__force gfp_t)___GFP_NO_OBJ_EXT)
/**
* DOC: Watermark modifiers
@@ -249,7 +257,7 @@ typedef unsigned int __bitwise gfp_t;
#define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
/* Room for N __GFP_FOO bits */
-#define __GFP_BITS_SHIFT (26 + IS_ENABLED(CONFIG_LOCKDEP))
+#define __GFP_BITS_SHIFT (27 + IS_ENABLED(CONFIG_LOCKDEP))
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
/**
--
2.40.1.495.gc816e09b53d-goog
Slab extension objects can't be allocated before slab infrastructure is
initialized. Some caches, like kmem_cache and kmem_cache_node, are created
before slab infrastructure is initialized. Objects from these caches can't
have extension objects. Introduce SLAB_NO_OBJ_EXT slab flag to mark these
caches and avoid creating extensions for objects allocated from these
slabs.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/slab.h | 7 +++++++
mm/slab.c | 2 +-
mm/slub.c | 5 +++--
3 files changed, 11 insertions(+), 3 deletions(-)
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 6b3e155b70bf..99a146f3cedf 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -147,6 +147,13 @@
#endif
#define SLAB_TEMPORARY SLAB_RECLAIM_ACCOUNT /* Objects are short-lived */
+#ifdef CONFIG_SLAB_OBJ_EXT
+/* Slab created using create_boot_cache */
+#define SLAB_NO_OBJ_EXT ((slab_flags_t __force)0x20000000U)
+#else
+#define SLAB_NO_OBJ_EXT 0
+#endif
+
/*
* ZERO_SIZE_PTR will be returned for zero sized kmalloc requests.
*
diff --git a/mm/slab.c b/mm/slab.c
index bb57f7fdbae1..ccc76f7455e9 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1232,7 +1232,7 @@ void __init kmem_cache_init(void)
create_boot_cache(kmem_cache, "kmem_cache",
offsetof(struct kmem_cache, node) +
nr_node_ids * sizeof(struct kmem_cache_node *),
- SLAB_HWCACHE_ALIGN, 0, 0);
+ SLAB_HWCACHE_ALIGN | SLAB_NO_OBJ_EXT, 0, 0);
list_add(&kmem_cache->list, &slab_caches);
slab_state = PARTIAL;
diff --git a/mm/slub.c b/mm/slub.c
index c87628cd8a9a..507b71372ee4 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5020,7 +5020,8 @@ void __init kmem_cache_init(void)
node_set(node, slab_nodes);
create_boot_cache(kmem_cache_node, "kmem_cache_node",
- sizeof(struct kmem_cache_node), SLAB_HWCACHE_ALIGN, 0, 0);
+ sizeof(struct kmem_cache_node),
+ SLAB_HWCACHE_ALIGN | SLAB_NO_OBJ_EXT, 0, 0);
hotplug_memory_notifier(slab_memory_callback, SLAB_CALLBACK_PRI);
@@ -5030,7 +5031,7 @@ void __init kmem_cache_init(void)
create_boot_cache(kmem_cache, "kmem_cache",
offsetof(struct kmem_cache, node) +
nr_node_ids * sizeof(struct kmem_cache_node *),
- SLAB_HWCACHE_ALIGN, 0, 0);
+ SLAB_HWCACHE_ALIGN | SLAB_NO_OBJ_EXT, 0, 0);
kmem_cache = bootstrap(&boot_kmem_cache);
kmem_cache_node = bootstrap(&boot_kmem_cache_node);
--
2.40.1.495.gc816e09b53d-goog
Use __GFP_NO_OBJ_EXT to prevent recursions when allocating slabobj_ext
objects. Also prevent slabobj_ext allocations for kmem_cache objects.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
mm/slab.h | 6 ++++++
mm/slab_common.c | 2 ++
2 files changed, 8 insertions(+)
diff --git a/mm/slab.h b/mm/slab.h
index 25d14b3a7280..b1c22dc87047 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -450,6 +450,12 @@ prepare_slab_obj_exts_hook(struct kmem_cache *s, gfp_t flags, void *p)
if (!need_slab_obj_ext())
return NULL;
+ if (s->flags & SLAB_NO_OBJ_EXT)
+ return NULL;
+
+ if (flags & __GFP_NO_OBJ_EXT)
+ return NULL;
+
slab = virt_to_slab(p);
if (!slab_obj_exts(slab) &&
WARN(alloc_slab_obj_exts(slab, s, flags, false),
diff --git a/mm/slab_common.c b/mm/slab_common.c
index f11cc072b01e..42777d66d0e3 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -220,6 +220,8 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
void *vec;
gfp &= ~OBJCGS_CLEAR_MASK;
+ /* Prevent recursive extension vector allocation */
+ gfp |= __GFP_NO_OBJ_EXT;
vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
slab_nid(slab));
if (!vec)
--
2.40.1.495.gc816e09b53d-goog
Add basic infrastructure to support code tagging which stores tag common
information consisting of the module name, function, file name and line
number. Provide functions to register a new code tag type and navigate
between code tags.
Co-developed-by: Kent Overstreet <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/codetag.h | 71 ++++++++++++++
lib/Kconfig.debug | 4 +
lib/Makefile | 1 +
lib/codetag.c | 199 ++++++++++++++++++++++++++++++++++++++++
4 files changed, 275 insertions(+)
create mode 100644 include/linux/codetag.h
create mode 100644 lib/codetag.c
diff --git a/include/linux/codetag.h b/include/linux/codetag.h
new file mode 100644
index 000000000000..a9d7adecc2a5
--- /dev/null
+++ b/include/linux/codetag.h
@@ -0,0 +1,71 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * code tagging framework
+ */
+#ifndef _LINUX_CODETAG_H
+#define _LINUX_CODETAG_H
+
+#include <linux/types.h>
+
+struct codetag_iterator;
+struct codetag_type;
+struct seq_buf;
+struct module;
+
+/*
+ * An instance of this structure is created in a special ELF section at every
+ * code location being tagged. At runtime, the special section is treated as
+ * an array of these.
+ */
+struct codetag {
+ unsigned int flags; /* used in later patches */
+ unsigned int lineno;
+ const char *modname;
+ const char *function;
+ const char *filename;
+} __aligned(8);
+
+union codetag_ref {
+ struct codetag *ct;
+};
+
+struct codetag_range {
+ struct codetag *start;
+ struct codetag *stop;
+};
+
+struct codetag_module {
+ struct module *mod;
+ struct codetag_range range;
+};
+
+struct codetag_type_desc {
+ const char *section;
+ size_t tag_size;
+};
+
+struct codetag_iterator {
+ struct codetag_type *cttype;
+ struct codetag_module *cmod;
+ unsigned long mod_id;
+ struct codetag *ct;
+};
+
+#define CODE_TAG_INIT { \
+ .modname = KBUILD_MODNAME, \
+ .function = __func__, \
+ .filename = __FILE__, \
+ .lineno = __LINE__, \
+ .flags = 0, \
+}
+
+void codetag_lock_module_list(struct codetag_type *cttype, bool lock);
+struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype);
+struct codetag *codetag_next_ct(struct codetag_iterator *iter);
+
+void codetag_to_text(struct seq_buf *out, struct codetag *ct);
+
+struct codetag_type *
+codetag_register_type(const struct codetag_type_desc *desc);
+
+#endif /* _LINUX_CODETAG_H */
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index ce51d4dc6803..5078da7d3ffb 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -957,6 +957,10 @@ config DEBUG_STACKOVERFLOW
If in doubt, say "N".
+config CODE_TAGGING
+ bool
+ select KALLSYMS
+
source "lib/Kconfig.kasan"
source "lib/Kconfig.kfence"
source "lib/Kconfig.kmsan"
diff --git a/lib/Makefile b/lib/Makefile
index 293a0858a3f8..28d70ecf2976 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -228,6 +228,7 @@ obj-$(CONFIG_OF_RECONFIG_NOTIFIER_ERROR_INJECT) += \
of-reconfig-notifier-error-inject.o
obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
+obj-$(CONFIG_CODE_TAGGING) += codetag.o
lib-$(CONFIG_GENERIC_BUG) += bug.o
obj-$(CONFIG_HAVE_ARCH_TRACEHOOK) += syscall.o
diff --git a/lib/codetag.c b/lib/codetag.c
new file mode 100644
index 000000000000..7708f8388e55
--- /dev/null
+++ b/lib/codetag.c
@@ -0,0 +1,199 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/codetag.h>
+#include <linux/idr.h>
+#include <linux/kallsyms.h>
+#include <linux/module.h>
+#include <linux/seq_buf.h>
+#include <linux/slab.h>
+
+struct codetag_type {
+ struct list_head link;
+ unsigned int count;
+ struct idr mod_idr;
+ struct rw_semaphore mod_lock; /* protects mod_idr */
+ struct codetag_type_desc desc;
+};
+
+static DEFINE_MUTEX(codetag_lock);
+static LIST_HEAD(codetag_types);
+
+void codetag_lock_module_list(struct codetag_type *cttype, bool lock)
+{
+ if (lock)
+ down_read(&cttype->mod_lock);
+ else
+ up_read(&cttype->mod_lock);
+}
+
+struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype)
+{
+ struct codetag_iterator iter = {
+ .cttype = cttype,
+ .cmod = NULL,
+ .mod_id = 0,
+ .ct = NULL,
+ };
+
+ return iter;
+}
+
+static inline struct codetag *get_first_module_ct(struct codetag_module *cmod)
+{
+ return cmod->range.start < cmod->range.stop ? cmod->range.start : NULL;
+}
+
+static inline
+struct codetag *get_next_module_ct(struct codetag_iterator *iter)
+{
+ struct codetag *res = (struct codetag *)
+ ((char *)iter->ct + iter->cttype->desc.tag_size);
+
+ return res < iter->cmod->range.stop ? res : NULL;
+}
+
+struct codetag *codetag_next_ct(struct codetag_iterator *iter)
+{
+ struct codetag_type *cttype = iter->cttype;
+ struct codetag_module *cmod;
+ struct codetag *ct;
+
+ lockdep_assert_held(&cttype->mod_lock);
+
+ if (unlikely(idr_is_empty(&cttype->mod_idr)))
+ return NULL;
+
+ ct = NULL;
+ while (true) {
+ cmod = idr_find(&cttype->mod_idr, iter->mod_id);
+
+ /* If module was removed move to the next one */
+ if (!cmod)
+ cmod = idr_get_next_ul(&cttype->mod_idr,
+ &iter->mod_id);
+
+ /* Exit if no more modules */
+ if (!cmod)
+ break;
+
+ if (cmod != iter->cmod) {
+ iter->cmod = cmod;
+ ct = get_first_module_ct(cmod);
+ } else
+ ct = get_next_module_ct(iter);
+
+ if (ct)
+ break;
+
+ iter->mod_id++;
+ }
+
+ iter->ct = ct;
+ return ct;
+}
+
+void codetag_to_text(struct seq_buf *out, struct codetag *ct)
+{
+ seq_buf_printf(out, "%s:%u module:%s func:%s",
+ ct->filename, ct->lineno,
+ ct->modname, ct->function);
+}
+
+static inline size_t range_size(const struct codetag_type *cttype,
+ const struct codetag_range *range)
+{
+ return ((char *)range->stop - (char *)range->start) /
+ cttype->desc.tag_size;
+}
+
+static void *get_symbol(struct module *mod, const char *prefix, const char *name)
+{
+ char buf[64];
+ int res;
+
+ res = snprintf(buf, sizeof(buf), "%s%s", prefix, name);
+ if (WARN_ON(res < 1 || res > sizeof(buf)))
+ return NULL;
+
+ return mod ?
+ (void *)find_kallsyms_symbol_value(mod, buf) :
+ (void *)kallsyms_lookup_name(buf);
+}
+
+static struct codetag_range get_section_range(struct module *mod,
+ const char *section)
+{
+ return (struct codetag_range) {
+ get_symbol(mod, "__start_", section),
+ get_symbol(mod, "__stop_", section),
+ };
+}
+
+static int codetag_module_init(struct codetag_type *cttype, struct module *mod)
+{
+ struct codetag_range range;
+ struct codetag_module *cmod;
+ int err;
+
+ range = get_section_range(mod, cttype->desc.section);
+ if (!range.start || !range.stop) {
+ pr_warn("Failed to load code tags of type %s from the module %s\n",
+ cttype->desc.section,
+ mod ? mod->name : "(built-in)");
+ return -EINVAL;
+ }
+
+ /* Ignore empty ranges */
+ if (range.start == range.stop)
+ return 0;
+
+ BUG_ON(range.start > range.stop);
+
+ cmod = kmalloc(sizeof(*cmod), GFP_KERNEL);
+ if (unlikely(!cmod))
+ return -ENOMEM;
+
+ cmod->mod = mod;
+ cmod->range = range;
+
+ down_write(&cttype->mod_lock);
+ err = idr_alloc(&cttype->mod_idr, cmod, 0, 0, GFP_KERNEL);
+ if (err >= 0)
+ cttype->count += range_size(cttype, &range);
+ up_write(&cttype->mod_lock);
+
+ if (err < 0) {
+ kfree(cmod);
+ return err;
+ }
+
+ return 0;
+}
+
+struct codetag_type *
+codetag_register_type(const struct codetag_type_desc *desc)
+{
+ struct codetag_type *cttype;
+ int err;
+
+ BUG_ON(desc->tag_size <= 0);
+
+ cttype = kzalloc(sizeof(*cttype), GFP_KERNEL);
+ if (unlikely(!cttype))
+ return ERR_PTR(-ENOMEM);
+
+ cttype->desc = *desc;
+ idr_init(&cttype->mod_idr);
+ init_rwsem(&cttype->mod_lock);
+
+ err = codetag_module_init(cttype, NULL);
+ if (unlikely(err)) {
+ kfree(cttype);
+ return ERR_PTR(err);
+ }
+
+ mutex_lock(&codetag_lock);
+ list_add_tail(&cttype->link, &codetag_types);
+ mutex_unlock(&codetag_lock);
+
+ return cttype;
+}
--
2.40.1.495.gc816e09b53d-goog
Introduce objext_flags to store additional objext flags unrelated to memcg.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/memcontrol.h | 29 ++++++++++++++++++++++-------
mm/slab.h | 4 +---
2 files changed, 23 insertions(+), 10 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b9fd9732a52b..5e2da63c525f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -347,7 +347,22 @@ enum page_memcg_data_flags {
__NR_MEMCG_DATA_FLAGS = (1UL << 2),
};
-#define MEMCG_DATA_FLAGS_MASK (__NR_MEMCG_DATA_FLAGS - 1)
+#define __FIRST_OBJEXT_FLAG __NR_MEMCG_DATA_FLAGS
+
+#else /* CONFIG_MEMCG */
+
+#define __FIRST_OBJEXT_FLAG (1UL << 0)
+
+#endif /* CONFIG_MEMCG */
+
+enum objext_flags {
+ /* the next bit after the last actual flag */
+ __NR_OBJEXTS_FLAGS = __FIRST_OBJEXT_FLAG,
+};
+
+#define OBJEXTS_FLAGS_MASK (__NR_OBJEXTS_FLAGS - 1)
+
+#ifdef CONFIG_MEMCG
static inline bool folio_memcg_kmem(struct folio *folio);
@@ -381,7 +396,7 @@ static inline struct mem_cgroup *__folio_memcg(struct folio *folio)
VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio);
VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_KMEM, folio);
- return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+ return (struct mem_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
}
/*
@@ -402,7 +417,7 @@ static inline struct obj_cgroup *__folio_objcg(struct folio *folio)
VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio);
VM_BUG_ON_FOLIO(!(memcg_data & MEMCG_DATA_KMEM), folio);
- return (struct obj_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+ return (struct obj_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
}
/*
@@ -459,11 +474,11 @@ static inline struct mem_cgroup *folio_memcg_rcu(struct folio *folio)
if (memcg_data & MEMCG_DATA_KMEM) {
struct obj_cgroup *objcg;
- objcg = (void *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+ objcg = (void *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
return obj_cgroup_memcg(objcg);
}
- return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+ return (struct mem_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
}
/*
@@ -502,11 +517,11 @@ static inline struct mem_cgroup *folio_memcg_check(struct folio *folio)
if (memcg_data & MEMCG_DATA_KMEM) {
struct obj_cgroup *objcg;
- objcg = (void *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+ objcg = (void *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
return obj_cgroup_memcg(objcg);
}
- return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+ return (struct mem_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
}
static inline struct mem_cgroup *page_memcg_check(struct page *page)
diff --git a/mm/slab.h b/mm/slab.h
index b1c22dc87047..bec202bdcfb8 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -409,10 +409,8 @@ static inline struct slabobj_ext *slab_obj_exts(struct slab *slab)
slab_page(slab));
VM_BUG_ON_PAGE(obj_exts & MEMCG_DATA_KMEM, slab_page(slab));
- return (struct slabobj_ext *)(obj_exts & ~MEMCG_DATA_FLAGS_MASK);
-#else
- return (struct slabobj_ext *)obj_exts;
#endif
+ return (struct slabobj_ext *)(obj_exts & ~OBJEXTS_FLAGS_MASK);
}
int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
--
2.40.1.495.gc816e09b53d-goog
After redefining alloc_pages, all uses of that name are being replaced.
Change the conflicting names to prevent preprocessor from replacing them
when it's not intended.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
arch/x86/kernel/amd_gart_64.c | 2 +-
drivers/iommu/dma-iommu.c | 2 +-
drivers/xen/grant-dma-ops.c | 2 +-
drivers/xen/swiotlb-xen.c | 2 +-
include/linux/dma-map-ops.h | 2 +-
kernel/dma/mapping.c | 4 ++--
6 files changed, 7 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kernel/amd_gart_64.c b/arch/x86/kernel/amd_gart_64.c
index 56a917df410d..842a0ec5eaa9 100644
--- a/arch/x86/kernel/amd_gart_64.c
+++ b/arch/x86/kernel/amd_gart_64.c
@@ -676,7 +676,7 @@ static const struct dma_map_ops gart_dma_ops = {
.get_sgtable = dma_common_get_sgtable,
.dma_supported = dma_direct_supported,
.get_required_mask = dma_direct_get_required_mask,
- .alloc_pages = dma_direct_alloc_pages,
+ .alloc_pages_op = dma_direct_alloc_pages,
.free_pages = dma_direct_free_pages,
};
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 7a9f0b0bddbd..76a9d5ca4eee 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1556,7 +1556,7 @@ static const struct dma_map_ops iommu_dma_ops = {
.flags = DMA_F_PCI_P2PDMA_SUPPORTED,
.alloc = iommu_dma_alloc,
.free = iommu_dma_free,
- .alloc_pages = dma_common_alloc_pages,
+ .alloc_pages_op = dma_common_alloc_pages,
.free_pages = dma_common_free_pages,
.alloc_noncontiguous = iommu_dma_alloc_noncontiguous,
.free_noncontiguous = iommu_dma_free_noncontiguous,
diff --git a/drivers/xen/grant-dma-ops.c b/drivers/xen/grant-dma-ops.c
index 9784a77fa3c9..6c7d984f164d 100644
--- a/drivers/xen/grant-dma-ops.c
+++ b/drivers/xen/grant-dma-ops.c
@@ -282,7 +282,7 @@ static int xen_grant_dma_supported(struct device *dev, u64 mask)
static const struct dma_map_ops xen_grant_dma_ops = {
.alloc = xen_grant_dma_alloc,
.free = xen_grant_dma_free,
- .alloc_pages = xen_grant_dma_alloc_pages,
+ .alloc_pages_op = xen_grant_dma_alloc_pages,
.free_pages = xen_grant_dma_free_pages,
.mmap = dma_common_mmap,
.get_sgtable = dma_common_get_sgtable,
diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 67aa74d20162..5ab2616153f0 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -403,6 +403,6 @@ const struct dma_map_ops xen_swiotlb_dma_ops = {
.dma_supported = xen_swiotlb_dma_supported,
.mmap = dma_common_mmap,
.get_sgtable = dma_common_get_sgtable,
- .alloc_pages = dma_common_alloc_pages,
+ .alloc_pages_op = dma_common_alloc_pages,
.free_pages = dma_common_free_pages,
};
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 31f114f486c4..d741940dcb3b 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -27,7 +27,7 @@ struct dma_map_ops {
unsigned long attrs);
void (*free)(struct device *dev, size_t size, void *vaddr,
dma_addr_t dma_handle, unsigned long attrs);
- struct page *(*alloc_pages)(struct device *dev, size_t size,
+ struct page *(*alloc_pages_op)(struct device *dev, size_t size,
dma_addr_t *dma_handle, enum dma_data_direction dir,
gfp_t gfp);
void (*free_pages)(struct device *dev, size_t size, struct page *vaddr,
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 9a4db5cce600..fc42930af14b 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -570,9 +570,9 @@ static struct page *__dma_alloc_pages(struct device *dev, size_t size,
size = PAGE_ALIGN(size);
if (dma_alloc_direct(dev, ops))
return dma_direct_alloc_pages(dev, size, dma_handle, dir, gfp);
- if (!ops->alloc_pages)
+ if (!ops->alloc_pages_op)
return NULL;
- return ops->alloc_pages(dev, size, dma_handle, dir, gfp);
+ return ops->alloc_pages_op(dev, size, dma_handle, dir, gfp);
}
struct page *dma_alloc_pages(struct device *dev, size_t size,
--
2.40.1.495.gc816e09b53d-goog
Add support for code tagging from dynamically loaded modules.
Signed-off-by: Suren Baghdasaryan <[email protected]>
Co-developed-by: Kent Overstreet <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
---
include/linux/codetag.h | 12 +++++++++
kernel/module/main.c | 4 +++
lib/codetag.c | 58 +++++++++++++++++++++++++++++++++++++++--
3 files changed, 72 insertions(+), 2 deletions(-)
diff --git a/include/linux/codetag.h b/include/linux/codetag.h
index a9d7adecc2a5..386733e89b31 100644
--- a/include/linux/codetag.h
+++ b/include/linux/codetag.h
@@ -42,6 +42,10 @@ struct codetag_module {
struct codetag_type_desc {
const char *section;
size_t tag_size;
+ void (*module_load)(struct codetag_type *cttype,
+ struct codetag_module *cmod);
+ void (*module_unload)(struct codetag_type *cttype,
+ struct codetag_module *cmod);
};
struct codetag_iterator {
@@ -68,4 +72,12 @@ void codetag_to_text(struct seq_buf *out, struct codetag *ct);
struct codetag_type *
codetag_register_type(const struct codetag_type_desc *desc);
+#ifdef CONFIG_CODE_TAGGING
+void codetag_load_module(struct module *mod);
+void codetag_unload_module(struct module *mod);
+#else
+static inline void codetag_load_module(struct module *mod) {}
+static inline void codetag_unload_module(struct module *mod) {}
+#endif
+
#endif /* _LINUX_CODETAG_H */
diff --git a/kernel/module/main.c b/kernel/module/main.c
index 044aa2c9e3cb..4232e7bff549 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -56,6 +56,7 @@
#include <linux/dynamic_debug.h>
#include <linux/audit.h>
#include <linux/cfi.h>
+#include <linux/codetag.h>
#include <linux/debugfs.h>
#include <uapi/linux/module.h>
#include "internal.h"
@@ -1249,6 +1250,7 @@ static void free_module(struct module *mod)
{
trace_module_free(mod);
+ codetag_unload_module(mod);
mod_sysfs_teardown(mod);
/*
@@ -2974,6 +2976,8 @@ static int load_module(struct load_info *info, const char __user *uargs,
/* Get rid of temporary copy. */
free_copy(info, flags);
+ codetag_load_module(mod);
+
/* Done! */
trace_module_load(mod);
diff --git a/lib/codetag.c b/lib/codetag.c
index 7708f8388e55..4ea57fb37346 100644
--- a/lib/codetag.c
+++ b/lib/codetag.c
@@ -108,15 +108,20 @@ static inline size_t range_size(const struct codetag_type *cttype,
static void *get_symbol(struct module *mod, const char *prefix, const char *name)
{
char buf[64];
+ void *ret;
int res;
res = snprintf(buf, sizeof(buf), "%s%s", prefix, name);
if (WARN_ON(res < 1 || res > sizeof(buf)))
return NULL;
- return mod ?
+ preempt_disable();
+ ret = mod ?
(void *)find_kallsyms_symbol_value(mod, buf) :
(void *)kallsyms_lookup_name(buf);
+ preempt_enable();
+
+ return ret;
}
static struct codetag_range get_section_range(struct module *mod,
@@ -157,8 +162,11 @@ static int codetag_module_init(struct codetag_type *cttype, struct module *mod)
down_write(&cttype->mod_lock);
err = idr_alloc(&cttype->mod_idr, cmod, 0, 0, GFP_KERNEL);
- if (err >= 0)
+ if (err >= 0) {
cttype->count += range_size(cttype, &range);
+ if (cttype->desc.module_load)
+ cttype->desc.module_load(cttype, cmod);
+ }
up_write(&cttype->mod_lock);
if (err < 0) {
@@ -197,3 +205,49 @@ codetag_register_type(const struct codetag_type_desc *desc)
return cttype;
}
+
+void codetag_load_module(struct module *mod)
+{
+ struct codetag_type *cttype;
+
+ if (!mod)
+ return;
+
+ mutex_lock(&codetag_lock);
+ list_for_each_entry(cttype, &codetag_types, link)
+ codetag_module_init(cttype, mod);
+ mutex_unlock(&codetag_lock);
+}
+
+void codetag_unload_module(struct module *mod)
+{
+ struct codetag_type *cttype;
+
+ if (!mod)
+ return;
+
+ mutex_lock(&codetag_lock);
+ list_for_each_entry(cttype, &codetag_types, link) {
+ struct codetag_module *found = NULL;
+ struct codetag_module *cmod;
+ unsigned long mod_id, tmp;
+
+ down_write(&cttype->mod_lock);
+ idr_for_each_entry_ul(&cttype->mod_idr, cmod, tmp, mod_id) {
+ if (cmod->mod && cmod->mod == mod) {
+ found = cmod;
+ break;
+ }
+ }
+ if (found) {
+ if (cttype->desc.module_unload)
+ cttype->desc.module_unload(cttype, cmod);
+
+ cttype->count -= range_size(cttype, &cmod->range);
+ idr_remove(&cttype->mod_idr, mod_id);
+ kfree(cmod);
+ }
+ up_write(&cttype->mod_lock);
+ }
+ mutex_unlock(&codetag_lock);
+}
--
2.40.1.495.gc816e09b53d-goog
From: Kent Overstreet <[email protected]>
Provide codetag_query_parse() to parse codetag queries and
codetag_matches_query() to check if the query affects a given codetag.
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/codetag.h | 27 ++++++++
lib/codetag.c | 135 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 162 insertions(+)
diff --git a/include/linux/codetag.h b/include/linux/codetag.h
index d98e4c8e86f0..87207f199ac9 100644
--- a/include/linux/codetag.h
+++ b/include/linux/codetag.h
@@ -80,4 +80,31 @@ static inline void codetag_load_module(struct module *mod) {}
static inline bool codetag_unload_module(struct module *mod) { return true; }
#endif
+/* Codetag query parsing */
+
+struct codetag_query {
+ const char *filename;
+ const char *module;
+ const char *function;
+ const char *class;
+ unsigned int first_line, last_line;
+ unsigned int first_index, last_index;
+ unsigned int cur_index;
+
+ bool match_line:1;
+ bool match_index:1;
+
+ unsigned int set_enabled:1;
+ unsigned int enabled:2;
+
+ unsigned int set_frequency:1;
+ unsigned int frequency;
+};
+
+char *codetag_query_parse(struct codetag_query *q, char *buf);
+bool codetag_matches_query(struct codetag_query *q,
+ const struct codetag *ct,
+ const struct codetag_module *mod,
+ const char *class);
+
#endif /* _LINUX_CODETAG_H */
diff --git a/lib/codetag.c b/lib/codetag.c
index 0ad4ea66c769..84f90f3b922c 100644
--- a/lib/codetag.c
+++ b/lib/codetag.c
@@ -256,3 +256,138 @@ bool codetag_unload_module(struct module *mod)
return unload_ok;
}
+
+/* Codetag query parsing */
+
+#define CODETAG_QUERY_TOKENS() \
+ x(func) \
+ x(file) \
+ x(line) \
+ x(module) \
+ x(class) \
+ x(index)
+
+enum tokens {
+#define x(name) TOK_##name,
+ CODETAG_QUERY_TOKENS()
+#undef x
+};
+
+static const char * const token_strs[] = {
+#define x(name) #name,
+ CODETAG_QUERY_TOKENS()
+#undef x
+ NULL
+};
+
+static int parse_range(char *str, unsigned int *first, unsigned int *last)
+{
+ char *first_str = str;
+ char *last_str = strchr(first_str, '-');
+
+ if (last_str)
+ *last_str++ = '\0';
+
+ if (kstrtouint(first_str, 10, first))
+ return -EINVAL;
+
+ if (!last_str)
+ *last = *first;
+ else if (kstrtouint(last_str, 10, last))
+ return -EINVAL;
+
+ return 0;
+}
+
+char *codetag_query_parse(struct codetag_query *q, char *buf)
+{
+ while (1) {
+ char *p = buf;
+ char *str1 = strsep_no_empty(&p, " \t\r\n");
+ char *str2 = strsep_no_empty(&p, " \t\r\n");
+ int ret, token;
+
+ if (!str1 || !str2)
+ break;
+
+ token = match_string(token_strs, ARRAY_SIZE(token_strs), str1);
+ if (token < 0)
+ break;
+
+ switch (token) {
+ case TOK_func:
+ q->function = str2;
+ break;
+ case TOK_file:
+ q->filename = str2;
+ break;
+ case TOK_line:
+ ret = parse_range(str2, &q->first_line, &q->last_line);
+ if (ret)
+ return ERR_PTR(ret);
+ q->match_line = true;
+ break;
+ case TOK_module:
+ q->module = str2;
+ break;
+ case TOK_class:
+ q->class = str2;
+ break;
+ case TOK_index:
+ ret = parse_range(str2, &q->first_index, &q->last_index);
+ if (ret)
+ return ERR_PTR(ret);
+ q->match_index = true;
+ break;
+ }
+
+ buf = p;
+ }
+
+ return buf;
+}
+
+bool codetag_matches_query(struct codetag_query *q,
+ const struct codetag *ct,
+ const struct codetag_module *mod,
+ const char *class)
+{
+ size_t classlen = q->class ? strlen(q->class) : 0;
+
+ if (q->module &&
+ (!mod->mod ||
+ strcmp(q->module, ct->modname)))
+ return false;
+
+ if (q->filename &&
+ strcmp(q->filename, ct->filename) &&
+ strcmp(q->filename, kbasename(ct->filename)))
+ return false;
+
+ if (q->function &&
+ strcmp(q->function, ct->function))
+ return false;
+
+ /* match against the line number range */
+ if (q->match_line &&
+ (ct->lineno < q->first_line ||
+ ct->lineno > q->last_line))
+ return false;
+
+ /* match against the class */
+ if (classlen &&
+ (strncmp(q->class, class, classlen) ||
+ (class[classlen] && class[classlen] != ':')))
+ return false;
+
+ /* match against the fault index */
+ if (q->match_index &&
+ (q->cur_index < q->first_index ||
+ q->cur_index > q->last_index)) {
+ q->cur_index++;
+ return false;
+ }
+
+ q->cur_index++;
+ return true;
+}
--
2.40.1.495.gc816e09b53d-goog
Skip freeing module's data section if there are non-zero allocation tags
because otherwise, once these allocations are freed, the access to their
code tag would cause UAF.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/codetag.h | 6 +++---
kernel/module/main.c | 23 +++++++++++++++--------
lib/codetag.c | 11 ++++++++---
3 files changed, 26 insertions(+), 14 deletions(-)
diff --git a/include/linux/codetag.h b/include/linux/codetag.h
index 386733e89b31..d98e4c8e86f0 100644
--- a/include/linux/codetag.h
+++ b/include/linux/codetag.h
@@ -44,7 +44,7 @@ struct codetag_type_desc {
size_t tag_size;
void (*module_load)(struct codetag_type *cttype,
struct codetag_module *cmod);
- void (*module_unload)(struct codetag_type *cttype,
+ bool (*module_unload)(struct codetag_type *cttype,
struct codetag_module *cmod);
};
@@ -74,10 +74,10 @@ codetag_register_type(const struct codetag_type_desc *desc);
#ifdef CONFIG_CODE_TAGGING
void codetag_load_module(struct module *mod);
-void codetag_unload_module(struct module *mod);
+bool codetag_unload_module(struct module *mod);
#else
static inline void codetag_load_module(struct module *mod) {}
-static inline void codetag_unload_module(struct module *mod) {}
+static inline bool codetag_unload_module(struct module *mod) { return true; }
#endif
#endif /* _LINUX_CODETAG_H */
diff --git a/kernel/module/main.c b/kernel/module/main.c
index 4232e7bff549..9ff56f2bb09d 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -1218,15 +1218,19 @@ static void *module_memory_alloc(unsigned int size, enum mod_mem_type type)
return module_alloc(size);
}
-static void module_memory_free(void *ptr, enum mod_mem_type type)
+static void module_memory_free(void *ptr, enum mod_mem_type type,
+ bool unload_codetags)
{
+ if (!unload_codetags && mod_mem_type_is_core_data(type))
+ return;
+
if (mod_mem_use_vmalloc(type))
vfree(ptr);
else
module_memfree(ptr);
}
-static void free_mod_mem(struct module *mod)
+static void free_mod_mem(struct module *mod, bool unload_codetags)
{
for_each_mod_mem_type(type) {
struct module_memory *mod_mem = &mod->mem[type];
@@ -1237,20 +1241,23 @@ static void free_mod_mem(struct module *mod)
/* Free lock-classes; relies on the preceding sync_rcu(). */
lockdep_free_key_range(mod_mem->base, mod_mem->size);
if (mod_mem->size)
- module_memory_free(mod_mem->base, type);
+ module_memory_free(mod_mem->base, type,
+ unload_codetags);
}
/* MOD_DATA hosts mod, so free it at last */
lockdep_free_key_range(mod->mem[MOD_DATA].base, mod->mem[MOD_DATA].size);
- module_memory_free(mod->mem[MOD_DATA].base, MOD_DATA);
+ module_memory_free(mod->mem[MOD_DATA].base, MOD_DATA, unload_codetags);
}
/* Free a module, remove from lists, etc. */
static void free_module(struct module *mod)
{
+ bool unload_codetags;
+
trace_module_free(mod);
- codetag_unload_module(mod);
+ unload_codetags = codetag_unload_module(mod);
mod_sysfs_teardown(mod);
/*
@@ -1292,7 +1299,7 @@ static void free_module(struct module *mod)
kfree(mod->args);
percpu_modfree(mod);
- free_mod_mem(mod);
+ free_mod_mem(mod, unload_codetags);
}
void *__symbol_get(const char *symbol)
@@ -2294,7 +2301,7 @@ static int move_module(struct module *mod, struct load_info *info)
return 0;
out_enomem:
for (t--; t >= 0; t--)
- module_memory_free(mod->mem[t].base, t);
+ module_memory_free(mod->mem[t].base, t, true);
return ret;
}
@@ -2424,7 +2431,7 @@ static void module_deallocate(struct module *mod, struct load_info *info)
percpu_modfree(mod);
module_arch_freeing_init(mod);
- free_mod_mem(mod);
+ free_mod_mem(mod, true);
}
int __weak module_finalize(const Elf_Ehdr *hdr,
diff --git a/lib/codetag.c b/lib/codetag.c
index 4ea57fb37346..0ad4ea66c769 100644
--- a/lib/codetag.c
+++ b/lib/codetag.c
@@ -5,6 +5,7 @@
#include <linux/module.h>
#include <linux/seq_buf.h>
#include <linux/slab.h>
+#include <linux/vmalloc.h>
struct codetag_type {
struct list_head link;
@@ -219,12 +220,13 @@ void codetag_load_module(struct module *mod)
mutex_unlock(&codetag_lock);
}
-void codetag_unload_module(struct module *mod)
+bool codetag_unload_module(struct module *mod)
{
struct codetag_type *cttype;
+ bool unload_ok = true;
if (!mod)
- return;
+ return true;
mutex_lock(&codetag_lock);
list_for_each_entry(cttype, &codetag_types, link) {
@@ -241,7 +243,8 @@ void codetag_unload_module(struct module *mod)
}
if (found) {
if (cttype->desc.module_unload)
- cttype->desc.module_unload(cttype, cmod);
+ if (!cttype->desc.module_unload(cttype, cmod))
+ unload_ok = false;
cttype->count -= range_size(cttype, &cmod->range);
idr_remove(&cttype->mod_idr, mod_id);
@@ -250,4 +253,6 @@ void codetag_unload_module(struct module *mod)
up_write(&cttype->mod_lock);
}
mutex_unlock(&codetag_lock);
+
+ return unload_ok;
}
--
2.40.1.495.gc816e09b53d-goog
To store code tag for every slab object, a codetag reference is embedded
into slabobj_ext when CONFIG_MEM_ALLOC_PROFILING=y.
Signed-off-by: Suren Baghdasaryan <[email protected]>
Co-developed-by: Kent Overstreet <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
---
include/linux/memcontrol.h | 5 +++++
lib/Kconfig.debug | 1 +
mm/slab.h | 4 ++++
3 files changed, 10 insertions(+)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5e2da63c525f..c7f21b15b540 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1626,7 +1626,12 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
* if MEMCG_DATA_OBJEXTS is set.
*/
struct slabobj_ext {
+#ifdef CONFIG_MEMCG_KMEM
struct obj_cgroup *objcg;
+#endif
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+ union codetag_ref ref;
+#endif
} __aligned(8);
static inline void __inc_lruvec_kmem_state(void *p, enum node_stat_item idx)
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index d3aa5ee0bf0d..4157c2251b07 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -968,6 +968,7 @@ config MEM_ALLOC_PROFILING
select CODE_TAGGING
select LAZY_PERCPU_COUNTER
select PAGE_EXTENSION
+ select SLAB_OBJ_EXT
help
Track allocation source code and record total allocation size
initiated at that code location. The mechanism can be used to track
diff --git a/mm/slab.h b/mm/slab.h
index bec202bdcfb8..f953e7c81e98 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -418,6 +418,10 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
static inline bool need_slab_obj_ext(void)
{
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+ if (mem_alloc_profiling_enabled())
+ return true;
+#endif
/*
* CONFIG_MEMCG_KMEM creates vector of obj_cgroup objects conditionally
* inside memcg_slab_post_alloc_hook. No other users for now.
--
2.40.1.495.gc816e09b53d-goog
Introduce CONFIG_MEM_ALLOC_PROFILING which provides definitions to easily
instrument memory allocators. It also registers an "alloc_tags" codetag
type with "allocations" defbugfs interface to output allocation tag
information.
CONFIG_MEM_ALLOC_PROFILING_DEBUG is provided for debugging the memory
allocation profiling instrumentation.
Co-developed-by: Kent Overstreet <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
.../admin-guide/kernel-parameters.txt | 2 +
include/asm-generic/codetag.lds.h | 14 ++
include/asm-generic/vmlinux.lds.h | 3 +
include/linux/alloc_tag.h | 105 +++++++++++
include/linux/sched.h | 24 +++
lib/Kconfig.debug | 19 ++
lib/Makefile | 2 +
lib/alloc_tag.c | 177 ++++++++++++++++++
scripts/module.lds.S | 7 +
9 files changed, 353 insertions(+)
create mode 100644 include/asm-generic/codetag.lds.h
create mode 100644 include/linux/alloc_tag.h
create mode 100644 lib/alloc_tag.c
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 9e5bab29685f..2fd8e56b7af8 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3770,6 +3770,8 @@
nomce [X86-32] Disable Machine Check Exception
+ nomem_profiling Disable memory allocation profiling.
+
nomfgpt [X86-32] Disable Multi-Function General Purpose
Timer usage (for AMD Geode machines).
diff --git a/include/asm-generic/codetag.lds.h b/include/asm-generic/codetag.lds.h
new file mode 100644
index 000000000000..64f536b80380
--- /dev/null
+++ b/include/asm-generic/codetag.lds.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef __ASM_GENERIC_CODETAG_LDS_H
+#define __ASM_GENERIC_CODETAG_LDS_H
+
+#define SECTION_WITH_BOUNDARIES(_name) \
+ . = ALIGN(8); \
+ __start_##_name = .; \
+ KEEP(*(_name)) \
+ __stop_##_name = .;
+
+#define CODETAG_SECTIONS() \
+ SECTION_WITH_BOUNDARIES(alloc_tags)
+
+#endif /* __ASM_GENERIC_CODETAG_LDS_H */
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index d1f57e4868ed..985ff045c2a2 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -50,6 +50,8 @@
* [__nosave_begin, __nosave_end] for the nosave data
*/
+#include <asm-generic/codetag.lds.h>
+
#ifndef LOAD_OFFSET
#define LOAD_OFFSET 0
#endif
@@ -374,6 +376,7 @@
. = ALIGN(8); \
BOUNDED_SECTION_BY(__dyndbg_classes, ___dyndbg_classes) \
BOUNDED_SECTION_BY(__dyndbg, ___dyndbg) \
+ CODETAG_SECTIONS() \
LIKELY_PROFILE() \
BRANCH_PROFILE() \
TRACE_PRINTKS() \
diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
new file mode 100644
index 000000000000..d913f8d9a7d8
--- /dev/null
+++ b/include/linux/alloc_tag.h
@@ -0,0 +1,105 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * allocation tagging
+ */
+#ifndef _LINUX_ALLOC_TAG_H
+#define _LINUX_ALLOC_TAG_H
+
+#include <linux/bug.h>
+#include <linux/codetag.h>
+#include <linux/container_of.h>
+#include <linux/lazy-percpu-counter.h>
+#include <linux/static_key.h>
+
+/*
+ * An instance of this structure is created in a special ELF section at every
+ * allocation callsite. At runtime, the special section is treated as
+ * an array of these. Embedded codetag utilizes codetag framework.
+ */
+struct alloc_tag {
+ struct codetag ct;
+ struct lazy_percpu_counter bytes_allocated;
+} __aligned(8);
+
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+
+static inline struct alloc_tag *ct_to_alloc_tag(struct codetag *ct)
+{
+ return container_of(ct, struct alloc_tag, ct);
+}
+
+#define DEFINE_ALLOC_TAG(_alloc_tag, _old) \
+ static struct alloc_tag _alloc_tag __used __aligned(8) \
+ __section("alloc_tags") = { .ct = CODE_TAG_INIT }; \
+ struct alloc_tag * __maybe_unused _old = alloc_tag_save(&_alloc_tag)
+
+extern struct static_key_true mem_alloc_profiling_key;
+
+static inline bool mem_alloc_profiling_enabled(void)
+{
+ return static_branch_likely(&mem_alloc_profiling_key);
+}
+
+static inline void __alloc_tag_sub(union codetag_ref *ref, size_t bytes,
+ bool may_allocate)
+{
+ struct alloc_tag *tag;
+
+#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
+ /* The switch should be checked before this */
+ BUG_ON(!mem_alloc_profiling_enabled());
+
+ WARN_ONCE(ref && !ref->ct, "alloc_tag was not set\n");
+#endif
+ if (!ref || !ref->ct)
+ return;
+
+ tag = ct_to_alloc_tag(ref->ct);
+
+ if (may_allocate)
+ lazy_percpu_counter_add(&tag->bytes_allocated, -bytes);
+ else
+ lazy_percpu_counter_add_noupgrade(&tag->bytes_allocated, -bytes);
+ ref->ct = NULL;
+}
+
+static inline void alloc_tag_sub(union codetag_ref *ref, size_t bytes)
+{
+ __alloc_tag_sub(ref, bytes, true);
+}
+
+static inline void alloc_tag_sub_noalloc(union codetag_ref *ref, size_t bytes)
+{
+ __alloc_tag_sub(ref, bytes, false);
+}
+
+static inline void alloc_tag_add(union codetag_ref *ref, struct alloc_tag *tag, size_t bytes)
+{
+#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
+ /* The switch should be checked before this */
+ BUG_ON(!mem_alloc_profiling_enabled());
+
+ WARN_ONCE(ref && ref->ct,
+ "alloc_tag was not cleared (got tag for %s:%u)\n",\
+ ref->ct->filename, ref->ct->lineno);
+
+ WARN_ONCE(!tag, "current->alloc_tag not set");
+#endif
+ if (!ref || !tag)
+ return;
+
+ ref->ct = &tag->ct;
+ lazy_percpu_counter_add(&tag->bytes_allocated, bytes);
+}
+
+#else
+
+#define DEFINE_ALLOC_TAG(_alloc_tag, _old)
+static inline void alloc_tag_sub(union codetag_ref *ref, size_t bytes) {}
+static inline void alloc_tag_sub_noalloc(union codetag_ref *ref, size_t bytes) {}
+static inline void alloc_tag_add(union codetag_ref *ref, struct alloc_tag *tag,
+ size_t bytes) {}
+
+#endif
+
+#endif /* _LINUX_ALLOC_TAG_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 35e7efdea2d9..33708bf8f191 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -763,6 +763,10 @@ struct task_struct {
unsigned int flags;
unsigned int ptrace;
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+ struct alloc_tag *alloc_tag;
+#endif
+
#ifdef CONFIG_SMP
int on_cpu;
struct __call_single_node wake_entry;
@@ -802,6 +806,7 @@ struct task_struct {
struct task_group *sched_task_group;
#endif
+
#ifdef CONFIG_UCLAMP_TASK
/*
* Clamp values requested for a scheduling entity.
@@ -2444,4 +2449,23 @@ static inline void sched_core_fork(struct task_struct *p) { }
extern void sched_set_stop_task(int cpu, struct task_struct *stop);
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+static inline struct alloc_tag *alloc_tag_save(struct alloc_tag *tag)
+{
+ swap(current->alloc_tag, tag);
+ return tag;
+}
+
+static inline void alloc_tag_restore(struct alloc_tag *tag, struct alloc_tag *old)
+{
+#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
+ WARN(current->alloc_tag != tag, "current->alloc_tag was changed:\n");
+#endif
+ current->alloc_tag = old;
+}
+#else
+static inline struct alloc_tag *alloc_tag_save(struct alloc_tag *tag) { return NULL; }
+#define alloc_tag_restore(_tag, _old)
+#endif
+
#endif
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 5078da7d3ffb..da0a91ea6042 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -961,6 +961,25 @@ config CODE_TAGGING
bool
select KALLSYMS
+config MEM_ALLOC_PROFILING
+ bool "Enable memory allocation profiling"
+ default n
+ depends on DEBUG_FS
+ select CODE_TAGGING
+ select LAZY_PERCPU_COUNTER
+ help
+ Track allocation source code and record total allocation size
+ initiated at that code location. The mechanism can be used to track
+ memory leaks with a low performance impact.
+
+config MEM_ALLOC_PROFILING_DEBUG
+ bool "Memory allocation profiler debugging"
+ default n
+ depends on MEM_ALLOC_PROFILING
+ help
+ Adds warnings with helpful error messages for memory allocation
+ profiling.
+
source "lib/Kconfig.kasan"
source "lib/Kconfig.kfence"
source "lib/Kconfig.kmsan"
diff --git a/lib/Makefile b/lib/Makefile
index 28d70ecf2976..8d09ccb4d30c 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -229,6 +229,8 @@ obj-$(CONFIG_OF_RECONFIG_NOTIFIER_ERROR_INJECT) += \
obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
obj-$(CONFIG_CODE_TAGGING) += codetag.o
+obj-$(CONFIG_MEM_ALLOC_PROFILING) += alloc_tag.o
+
lib-$(CONFIG_GENERIC_BUG) += bug.o
obj-$(CONFIG_HAVE_ARCH_TRACEHOOK) += syscall.o
diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
new file mode 100644
index 000000000000..3c4cfeb79862
--- /dev/null
+++ b/lib/alloc_tag.c
@@ -0,0 +1,177 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/alloc_tag.h>
+#include <linux/debugfs.h>
+#include <linux/fs.h>
+#include <linux/gfp.h>
+#include <linux/module.h>
+#include <linux/seq_buf.h>
+#include <linux/uaccess.h>
+
+DEFINE_STATIC_KEY_TRUE(mem_alloc_profiling_key);
+
+/*
+ * Won't need to be exported once page allocation accounting is moved to the
+ * correct place:
+ */
+EXPORT_SYMBOL(mem_alloc_profiling_key);
+
+static int __init mem_alloc_profiling_disable(char *s)
+{
+ static_branch_disable(&mem_alloc_profiling_key);
+ return 1;
+}
+__setup("nomem_profiling", mem_alloc_profiling_disable);
+
+struct alloc_tag_file_iterator {
+ struct codetag_iterator ct_iter;
+ struct seq_buf buf;
+ char rawbuf[4096];
+};
+
+struct user_buf {
+ char __user *buf; /* destination user buffer */
+ size_t size; /* size of requested read */
+ ssize_t ret; /* bytes read so far */
+};
+
+static int flush_ubuf(struct user_buf *dst, struct seq_buf *src)
+{
+ if (src->len) {
+ size_t bytes = min_t(size_t, src->len, dst->size);
+ int err = copy_to_user(dst->buf, src->buffer, bytes);
+
+ if (err)
+ return err;
+
+ dst->ret += bytes;
+ dst->buf += bytes;
+ dst->size -= bytes;
+ src->len -= bytes;
+ memmove(src->buffer, src->buffer + bytes, src->len);
+ }
+
+ return 0;
+}
+
+static int allocations_file_open(struct inode *inode, struct file *file)
+{
+ struct codetag_type *cttype = inode->i_private;
+ struct alloc_tag_file_iterator *iter;
+
+ iter = kzalloc(sizeof(*iter), GFP_KERNEL);
+ if (!iter)
+ return -ENOMEM;
+
+ codetag_lock_module_list(cttype, true);
+ iter->ct_iter = codetag_get_ct_iter(cttype);
+ codetag_lock_module_list(cttype, false);
+ seq_buf_init(&iter->buf, iter->rawbuf, sizeof(iter->rawbuf));
+ file->private_data = iter;
+
+ return 0;
+}
+
+static int allocations_file_release(struct inode *inode, struct file *file)
+{
+ struct alloc_tag_file_iterator *iter = file->private_data;
+
+ kfree(iter);
+ return 0;
+}
+
+static void alloc_tag_to_text(struct seq_buf *out, struct codetag *ct)
+{
+ struct alloc_tag *tag = ct_to_alloc_tag(ct);
+ char buf[10];
+
+ string_get_size(lazy_percpu_counter_read(&tag->bytes_allocated), 1,
+ STRING_UNITS_2, buf, sizeof(buf));
+
+ seq_buf_printf(out, "%8s ", buf);
+ codetag_to_text(out, ct);
+ seq_buf_putc(out, '\n');
+}
+
+static ssize_t allocations_file_read(struct file *file, char __user *ubuf,
+ size_t size, loff_t *ppos)
+{
+ struct alloc_tag_file_iterator *iter = file->private_data;
+ struct user_buf buf = { .buf = ubuf, .size = size };
+ struct codetag *ct;
+ int err = 0;
+
+ codetag_lock_module_list(iter->ct_iter.cttype, true);
+ while (1) {
+ err = flush_ubuf(&buf, &iter->buf);
+ if (err || !buf.size)
+ break;
+
+ ct = codetag_next_ct(&iter->ct_iter);
+ if (!ct)
+ break;
+
+ alloc_tag_to_text(&iter->buf, ct);
+ }
+ codetag_lock_module_list(iter->ct_iter.cttype, false);
+
+ return err ? : buf.ret;
+}
+
+static const struct file_operations allocations_file_ops = {
+ .owner = THIS_MODULE,
+ .open = allocations_file_open,
+ .release = allocations_file_release,
+ .read = allocations_file_read,
+};
+
+static int __init dbgfs_init(struct codetag_type *cttype)
+{
+ struct dentry *file;
+
+ file = debugfs_create_file("allocations", 0444, NULL, cttype,
+ &allocations_file_ops);
+
+ return IS_ERR(file) ? PTR_ERR(file) : 0;
+}
+
+static bool alloc_tag_module_unload(struct codetag_type *cttype, struct codetag_module *cmod)
+{
+ struct codetag_iterator iter = codetag_get_ct_iter(cttype);
+ bool module_unused = true;
+ struct alloc_tag *tag;
+ struct codetag *ct;
+ size_t bytes;
+
+ for (ct = codetag_next_ct(&iter); ct; ct = codetag_next_ct(&iter)) {
+ if (iter.cmod != cmod)
+ continue;
+
+ tag = ct_to_alloc_tag(ct);
+ bytes = lazy_percpu_counter_read(&tag->bytes_allocated);
+
+ if (!WARN(bytes, "%s:%u module %s func:%s has %zu allocated at module unload",
+ ct->filename, ct->lineno, ct->modname, ct->function, bytes))
+ lazy_percpu_counter_exit(&tag->bytes_allocated);
+ else
+ module_unused = false;
+ }
+
+ return module_unused;
+}
+
+static int __init alloc_tag_init(void)
+{
+ struct codetag_type *cttype;
+ const struct codetag_type_desc desc = {
+ .section = "alloc_tags",
+ .tag_size = sizeof(struct alloc_tag),
+ .module_unload = alloc_tag_module_unload,
+ };
+
+ cttype = codetag_register_type(&desc);
+ if (IS_ERR_OR_NULL(cttype))
+ return PTR_ERR(cttype);
+
+ return dbgfs_init(cttype);
+}
+module_init(alloc_tag_init);
diff --git a/scripts/module.lds.S b/scripts/module.lds.S
index bf5bcf2836d8..45c67a0994f3 100644
--- a/scripts/module.lds.S
+++ b/scripts/module.lds.S
@@ -9,6 +9,8 @@
#define DISCARD_EH_FRAME *(.eh_frame)
#endif
+#include <asm-generic/codetag.lds.h>
+
SECTIONS {
/DISCARD/ : {
*(.discard)
@@ -47,12 +49,17 @@ SECTIONS {
.data : {
*(.data .data.[0-9a-zA-Z_]*)
*(.data..L*)
+ CODETAG_SECTIONS()
}
.rodata : {
*(.rodata .rodata.[0-9a-zA-Z_]*)
*(.rodata..L*)
}
+#else
+ .data : {
+ CODETAG_SECTIONS()
+ }
#endif
}
--
2.40.1.495.gc816e09b53d-goog
Introduce helper functions to easily instrument page allocators by
storing a pointer to the allocation tag associated with the code that
allocated the page in a page_ext field.
Signed-off-by: Suren Baghdasaryan <[email protected]>
Co-developed-by: Kent Overstreet <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
---
include/linux/pgalloc_tag.h | 33 +++++++++++++++++++++++++++++++++
lib/Kconfig.debug | 1 +
lib/alloc_tag.c | 17 +++++++++++++++++
mm/page_ext.c | 12 +++++++++---
4 files changed, 60 insertions(+), 3 deletions(-)
create mode 100644 include/linux/pgalloc_tag.h
diff --git a/include/linux/pgalloc_tag.h b/include/linux/pgalloc_tag.h
new file mode 100644
index 000000000000..f8c7b6ef9c75
--- /dev/null
+++ b/include/linux/pgalloc_tag.h
@@ -0,0 +1,33 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * page allocation tagging
+ */
+#ifndef _LINUX_PGALLOC_TAG_H
+#define _LINUX_PGALLOC_TAG_H
+
+#include <linux/alloc_tag.h>
+#include <linux/page_ext.h>
+
+extern struct page_ext_operations page_alloc_tagging_ops;
+struct page_ext *lookup_page_ext(const struct page *page);
+
+static inline union codetag_ref *get_page_tag_ref(struct page *page)
+{
+ if (page && mem_alloc_profiling_enabled()) {
+ struct page_ext *page_ext = lookup_page_ext(page);
+
+ if (page_ext)
+ return (void *)page_ext + page_alloc_tagging_ops.offset;
+ }
+ return NULL;
+}
+
+static inline void pgalloc_tag_dec(struct page *page, unsigned int order)
+{
+ union codetag_ref *ref = get_page_tag_ref(page);
+
+ if (ref)
+ alloc_tag_sub(ref, PAGE_SIZE << order);
+}
+
+#endif /* _LINUX_PGALLOC_TAG_H */
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index da0a91ea6042..d3aa5ee0bf0d 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -967,6 +967,7 @@ config MEM_ALLOC_PROFILING
depends on DEBUG_FS
select CODE_TAGGING
select LAZY_PERCPU_COUNTER
+ select PAGE_EXTENSION
help
Track allocation source code and record total allocation size
initiated at that code location. The mechanism can be used to track
diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
index 3c4cfeb79862..4a0b95a46b2e 100644
--- a/lib/alloc_tag.c
+++ b/lib/alloc_tag.c
@@ -4,6 +4,7 @@
#include <linux/fs.h>
#include <linux/gfp.h>
#include <linux/module.h>
+#include <linux/page_ext.h>
#include <linux/seq_buf.h>
#include <linux/uaccess.h>
@@ -159,6 +160,22 @@ static bool alloc_tag_module_unload(struct codetag_type *cttype, struct codetag_
return module_unused;
}
+static __init bool need_page_alloc_tagging(void)
+{
+ return true;
+}
+
+static __init void init_page_alloc_tagging(void)
+{
+}
+
+struct page_ext_operations page_alloc_tagging_ops = {
+ .size = sizeof(union codetag_ref),
+ .need = need_page_alloc_tagging,
+ .init = init_page_alloc_tagging,
+};
+EXPORT_SYMBOL(page_alloc_tagging_ops);
+
static int __init alloc_tag_init(void)
{
struct codetag_type *cttype;
diff --git a/mm/page_ext.c b/mm/page_ext.c
index dc1626be458b..eaf054ec276c 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -10,6 +10,7 @@
#include <linux/page_idle.h>
#include <linux/page_table_check.h>
#include <linux/rcupdate.h>
+#include <linux/pgalloc_tag.h>
/*
* struct page extension
@@ -82,6 +83,9 @@ static struct page_ext_operations *page_ext_ops[] __initdata = {
#if defined(CONFIG_PAGE_IDLE_FLAG) && !defined(CONFIG_64BIT)
&page_idle_ops,
#endif
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+ &page_alloc_tagging_ops,
+#endif
#ifdef CONFIG_PAGE_TABLE_CHECK
&page_table_check_ops,
#endif
@@ -90,7 +94,7 @@ static struct page_ext_operations *page_ext_ops[] __initdata = {
unsigned long page_ext_size;
static unsigned long total_usage;
-static struct page_ext *lookup_page_ext(const struct page *page);
+struct page_ext *lookup_page_ext(const struct page *page);
bool early_page_ext __meminitdata;
static int __init setup_early_page_ext(char *str)
@@ -199,7 +203,7 @@ void __meminit pgdat_page_ext_init(struct pglist_data *pgdat)
pgdat->node_page_ext = NULL;
}
-static struct page_ext *lookup_page_ext(const struct page *page)
+struct page_ext *lookup_page_ext(const struct page *page)
{
unsigned long pfn = page_to_pfn(page);
unsigned long index;
@@ -219,6 +223,7 @@ static struct page_ext *lookup_page_ext(const struct page *page)
MAX_ORDER_NR_PAGES);
return get_entry(base, index);
}
+EXPORT_SYMBOL(lookup_page_ext);
static int __init alloc_node_page_ext(int nid)
{
@@ -278,7 +283,7 @@ static bool page_ext_invalid(struct page_ext *page_ext)
return !page_ext || (((unsigned long)page_ext & PAGE_EXT_INVALID) == PAGE_EXT_INVALID);
}
-static struct page_ext *lookup_page_ext(const struct page *page)
+struct page_ext *lookup_page_ext(const struct page *page)
{
unsigned long pfn = page_to_pfn(page);
struct mem_section *section = __pfn_to_section(pfn);
@@ -295,6 +300,7 @@ static struct page_ext *lookup_page_ext(const struct page *page)
return NULL;
return get_entry(page_ext, pfn);
}
+EXPORT_SYMBOL(lookup_page_ext);
static void *__meminit alloc_page_ext(size_t size, int nid)
{
--
2.40.1.495.gc816e09b53d-goog
For all page allocations to be tagged, page_ext has to be initialized
before the first page allocation. Early tasks allocate their stacks
using page allocator before alloc_node_page_ext() initializes page_ext
area, unless early_page_ext is enabled. Therefore these allocations will
generate a warning when CONFIG_MEM_ALLOC_PROFILING_DEBUG is enabled.
Enable early_page_ext whenever CONFIG_MEM_ALLOC_PROFILING_DEBUG=y to
ensure page_ext initialization prior to any page allocation. This will
have all the negative effects associated with early_page_ext, such as
possible longer boot time, therefore we enable it only when debugging
with CONFIG_MEM_ALLOC_PROFILING_DEBUG enabled and not universally for
CONFIG_MEM_ALLOC_PROFILING.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
mm/page_ext.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/mm/page_ext.c b/mm/page_ext.c
index eaf054ec276c..55ba797f8881 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -96,7 +96,16 @@ unsigned long page_ext_size;
static unsigned long total_usage;
struct page_ext *lookup_page_ext(const struct page *page);
+#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
+/*
+ * To ensure correct allocation tagging for pages, page_ext should be available
+ * before the first page allocation. Otherwise early task stacks will be
+ * allocated before page_ext initialization and missing tags will be flagged.
+ */
+bool early_page_ext __meminitdata = true;
+#else
bool early_page_ext __meminitdata;
+#endif
static int __init setup_early_page_ext(char *str)
{
early_page_ext = true;
--
2.40.1.495.gc816e09b53d-goog
When a high-order page is split into smaller ones, each newly split
page should get its codetag. The original codetag is reused for these
pages but it's recorded as 0-byte allocation because original codetag
already accounts for the original high-order allocated page.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/pgalloc_tag.h | 30 ++++++++++++++++++++++++++++++
mm/huge_memory.c | 2 ++
mm/page_alloc.c | 2 ++
3 files changed, 34 insertions(+)
diff --git a/include/linux/pgalloc_tag.h b/include/linux/pgalloc_tag.h
index 567327c1c46f..0cbba13869b5 100644
--- a/include/linux/pgalloc_tag.h
+++ b/include/linux/pgalloc_tag.h
@@ -52,11 +52,41 @@ static inline void pgalloc_tag_dec(struct page *page, unsigned int order)
}
}
+static inline void pgalloc_tag_split(struct page *page, unsigned int nr)
+{
+ int i;
+ struct page_ext *page_ext;
+ union codetag_ref *ref;
+ struct alloc_tag *tag;
+
+ if (!mem_alloc_profiling_enabled())
+ return;
+
+ page_ext = page_ext_get(page);
+ if (unlikely(!page_ext))
+ return;
+
+ ref = codetag_ref_from_page_ext(page_ext);
+ if (!ref->ct)
+ goto out;
+
+ tag = ct_to_alloc_tag(ref->ct);
+ page_ext = page_ext_next(page_ext);
+ for (i = 1; i < nr; i++) {
+ /* New reference with 0 bytes accounted */
+ alloc_tag_add(codetag_ref_from_page_ext(page_ext), tag, 0);
+ page_ext = page_ext_next(page_ext);
+ }
+out:
+ page_ext_put(page_ext);
+}
+
#else /* CONFIG_MEM_ALLOC_PROFILING */
static inline union codetag_ref *get_page_tag_ref(struct page *page) { return NULL; }
static inline void put_page_tag_ref(union codetag_ref *ref) {}
#define pgalloc_tag_dec(__page, __size) do {} while (0)
+static inline void pgalloc_tag_split(struct page *page, unsigned int nr) {}
#endif /* CONFIG_MEM_ALLOC_PROFILING */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 624671aaa60d..221cce0052a2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -37,6 +37,7 @@
#include <linux/page_owner.h>
#include <linux/sched/sysctl.h>
#include <linux/memory-tiers.h>
+#include <linux/pgalloc_tag.h>
#include <asm/tlb.h>
#include <asm/pgalloc.h>
@@ -2557,6 +2558,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
/* Caller disabled irqs, so they are still disabled here */
split_page_owner(head, nr);
+ pgalloc_tag_split(head, nr);
/* See comment in __split_huge_page_tail() */
if (PageAnon(head)) {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index edd35500f7f6..8cf5a835af7f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2796,6 +2796,7 @@ void split_page(struct page *page, unsigned int order)
for (i = 1; i < (1 << order); i++)
set_page_refcounted(page + i);
split_page_owner(page, 1 << order);
+ pgalloc_tag_split(page, 1 << order);
split_page_memcg(page, 1 << order);
}
EXPORT_SYMBOL_GPL(split_page);
@@ -5012,6 +5013,7 @@ static void *make_alloc_exact(unsigned long addr, unsigned int order,
struct page *last = page + nr;
split_page_owner(page, 1 << order);
+ pgalloc_tag_split(page, 1 << order);
split_page_memcg(page, 1 << order);
while (page < --last)
set_page_refcounted(last);
--
2.40.1.495.gc816e09b53d-goog
Redefine page allocators to record allocation tags upon their invocation.
Instrument post_alloc_hook and free_pages_prepare to modify current
allocation tag.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/alloc_tag.h | 11 ++++
include/linux/gfp.h | 123 +++++++++++++++++++++++++-----------
include/linux/page_ext.h | 1 -
include/linux/pagemap.h | 9 ++-
include/linux/pgalloc_tag.h | 38 +++++++++--
mm/compaction.c | 9 ++-
mm/filemap.c | 6 +-
mm/mempolicy.c | 30 ++++-----
mm/mm_init.c | 1 +
mm/page_alloc.c | 73 ++++++++++++---------
10 files changed, 208 insertions(+), 93 deletions(-)
diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
index d913f8d9a7d8..07922d81b641 100644
--- a/include/linux/alloc_tag.h
+++ b/include/linux/alloc_tag.h
@@ -102,4 +102,15 @@ static inline void alloc_tag_add(union codetag_ref *ref, struct alloc_tag *tag,
#endif
+#define alloc_hooks(_do_alloc, _res_type, _err) \
+({ \
+ _res_type _res; \
+ DEFINE_ALLOC_TAG(_alloc_tag, _old); \
+ \
+ _res = _do_alloc; \
+ alloc_tag_restore(&_alloc_tag, _old); \
+ _res; \
+})
+
+
#endif /* _LINUX_ALLOC_TAG_H */
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index ed8cb537c6a7..0cb4a515109a 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -6,6 +6,8 @@
#include <linux/mmzone.h>
#include <linux/topology.h>
+#include <linux/alloc_tag.h>
+#include <linux/sched.h>
struct vm_area_struct;
@@ -174,42 +176,57 @@ static inline void arch_free_page(struct page *page, int order) { }
static inline void arch_alloc_page(struct page *page, int order) { }
#endif
-struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
+struct page *_alloc_pages2(gfp_t gfp, unsigned int order, int preferred_nid,
nodemask_t *nodemask);
-struct folio *__folio_alloc(gfp_t gfp, unsigned int order, int preferred_nid,
+#define __alloc_pages(_gfp, _order, _preferred_nid, _nodemask) \
+ alloc_hooks(_alloc_pages2(_gfp, _order, _preferred_nid, \
+ _nodemask), struct page *, NULL)
+
+struct folio *_folio_alloc2(gfp_t gfp, unsigned int order, int preferred_nid,
nodemask_t *nodemask);
+#define __folio_alloc(_gfp, _order, _preferred_nid, _nodemask) \
+ alloc_hooks(_folio_alloc2(_gfp, _order, _preferred_nid, \
+ _nodemask), struct folio *, NULL)
-unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
+unsigned long _alloc_pages_bulk(gfp_t gfp, int preferred_nid,
nodemask_t *nodemask, int nr_pages,
struct list_head *page_list,
struct page **page_array);
-
-unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
+#define __alloc_pages_bulk(_gfp, _preferred_nid, _nodemask, _nr_pages, \
+ _page_list, _page_array) \
+ alloc_hooks(_alloc_pages_bulk(_gfp, _preferred_nid, \
+ _nodemask, _nr_pages, \
+ _page_list, _page_array), \
+ unsigned long, 0)
+
+unsigned long _alloc_pages_bulk_array_mempolicy(gfp_t gfp,
unsigned long nr_pages,
struct page **page_array);
+#define alloc_pages_bulk_array_mempolicy(_gfp, _nr_pages, _page_array) \
+ alloc_hooks(_alloc_pages_bulk_array_mempolicy(_gfp, \
+ _nr_pages, _page_array), \
+ unsigned long, 0)
/* Bulk allocate order-0 pages */
-static inline unsigned long
-alloc_pages_bulk_list(gfp_t gfp, unsigned long nr_pages, struct list_head *list)
-{
- return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, list, NULL);
-}
+#define alloc_pages_bulk_list(_gfp, _nr_pages, _list) \
+ __alloc_pages_bulk(_gfp, numa_mem_id(), NULL, _nr_pages, _list, NULL)
-static inline unsigned long
-alloc_pages_bulk_array(gfp_t gfp, unsigned long nr_pages, struct page **page_array)
-{
- return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, NULL, page_array);
-}
+#define alloc_pages_bulk_array(_gfp, _nr_pages, _page_array) \
+ __alloc_pages_bulk(_gfp, numa_mem_id(), NULL, _nr_pages, NULL, _page_array)
static inline unsigned long
-alloc_pages_bulk_array_node(gfp_t gfp, int nid, unsigned long nr_pages, struct page **page_array)
+_alloc_pages_bulk_array_node(gfp_t gfp, int nid, unsigned long nr_pages, struct page **page_array)
{
if (nid == NUMA_NO_NODE)
nid = numa_mem_id();
- return __alloc_pages_bulk(gfp, nid, NULL, nr_pages, NULL, page_array);
+ return _alloc_pages_bulk(gfp, nid, NULL, nr_pages, NULL, page_array);
}
+#define alloc_pages_bulk_array_node(_gfp, _nid, _nr_pages, _page_array) \
+ alloc_hooks(_alloc_pages_bulk_array_node(_gfp, _nid, _nr_pages, _page_array), \
+ unsigned long, 0)
+
static inline void warn_if_node_offline(int this_node, gfp_t gfp_mask)
{
gfp_t warn_gfp = gfp_mask & (__GFP_THISNODE|__GFP_NOWARN);
@@ -229,21 +246,25 @@ static inline void warn_if_node_offline(int this_node, gfp_t gfp_mask)
* online. For more general interface, see alloc_pages_node().
*/
static inline struct page *
-__alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
+_alloc_pages_node2(int nid, gfp_t gfp_mask, unsigned int order)
{
VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
warn_if_node_offline(nid, gfp_mask);
- return __alloc_pages(gfp_mask, order, nid, NULL);
+ return _alloc_pages2(gfp_mask, order, nid, NULL);
}
+#define __alloc_pages_node(_nid, _gfp_mask, _order) \
+ alloc_hooks(_alloc_pages_node2(_nid, _gfp_mask, _order), \
+ struct page *, NULL)
+
static inline
struct folio *__folio_alloc_node(gfp_t gfp, unsigned int order, int nid)
{
VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
warn_if_node_offline(nid, gfp);
- return __folio_alloc(gfp, order, nid, NULL);
+ return _folio_alloc2(gfp, order, nid, NULL);
}
/*
@@ -251,32 +272,45 @@ struct folio *__folio_alloc_node(gfp_t gfp, unsigned int order, int nid)
* prefer the current CPU's closest node. Otherwise node must be valid and
* online.
*/
-static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
+static inline struct page *_alloc_pages_node(int nid, gfp_t gfp_mask,
unsigned int order)
{
if (nid == NUMA_NO_NODE)
nid = numa_mem_id();
- return __alloc_pages_node(nid, gfp_mask, order);
+ return _alloc_pages_node2(nid, gfp_mask, order);
}
+#define alloc_pages_node(_nid, _gfp_mask, _order) \
+ alloc_hooks(_alloc_pages_node(_nid, _gfp_mask, _order), \
+ struct page *, NULL)
+
#ifdef CONFIG_NUMA
-struct page *alloc_pages(gfp_t gfp, unsigned int order);
-struct folio *folio_alloc(gfp_t gfp, unsigned order);
-struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
+struct page *_alloc_pages(gfp_t gfp, unsigned int order);
+struct folio *_folio_alloc(gfp_t gfp, unsigned int order);
+struct folio *_vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
unsigned long addr, bool hugepage);
#else
-static inline struct page *alloc_pages(gfp_t gfp_mask, unsigned int order)
+static inline struct page *_alloc_pages(gfp_t gfp_mask, unsigned int order)
{
- return alloc_pages_node(numa_node_id(), gfp_mask, order);
+ return _alloc_pages_node(numa_node_id(), gfp_mask, order);
}
-static inline struct folio *folio_alloc(gfp_t gfp, unsigned int order)
+static inline struct folio *_folio_alloc(gfp_t gfp, unsigned int order)
{
return __folio_alloc_node(gfp, order, numa_node_id());
}
-#define vma_alloc_folio(gfp, order, vma, addr, hugepage) \
- folio_alloc(gfp, order)
+#define _vma_alloc_folio(gfp, order, vma, addr, hugepage) \
+ _folio_alloc(gfp, order)
#endif
+
+#define alloc_pages(_gfp, _order) \
+ alloc_hooks(_alloc_pages(_gfp, _order), struct page *, NULL)
+#define folio_alloc(_gfp, _order) \
+ alloc_hooks(_folio_alloc(_gfp, _order), struct folio *, NULL)
+#define vma_alloc_folio(_gfp, _order, _vma, _addr, _hugepage) \
+ alloc_hooks(_vma_alloc_folio(_gfp, _order, _vma, _addr, \
+ _hugepage), struct folio *, NULL)
+
#define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
static inline struct page *alloc_page_vma(gfp_t gfp,
struct vm_area_struct *vma, unsigned long addr)
@@ -286,12 +320,21 @@ static inline struct page *alloc_page_vma(gfp_t gfp,
return &folio->page;
}
-extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
-extern unsigned long get_zeroed_page(gfp_t gfp_mask);
+extern unsigned long _get_free_pages(gfp_t gfp_mask, unsigned int order);
+#define __get_free_pages(_gfp_mask, _order) \
+ alloc_hooks(_get_free_pages(_gfp_mask, _order), unsigned long, 0)
+extern unsigned long _get_zeroed_page(gfp_t gfp_mask);
+#define get_zeroed_page(_gfp_mask) \
+ alloc_hooks(_get_zeroed_page(_gfp_mask), unsigned long, 0)
-void *alloc_pages_exact(size_t size, gfp_t gfp_mask) __alloc_size(1);
+void *_alloc_pages_exact(size_t size, gfp_t gfp_mask) __alloc_size(1);
+#define alloc_pages_exact(_size, _gfp_mask) \
+ alloc_hooks(_alloc_pages_exact(_size, _gfp_mask), void *, NULL)
void free_pages_exact(void *virt, size_t size);
-__meminit void *alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask) __alloc_size(2);
+
+__meminit void *_alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask) __alloc_size(2);
+#define alloc_pages_exact_nid(_nid, _size, _gfp_mask) \
+ alloc_hooks(_alloc_pages_exact_nid(_nid, _size, _gfp_mask), void *, NULL)
#define __get_free_page(gfp_mask) \
__get_free_pages((gfp_mask), 0)
@@ -354,10 +397,16 @@ static inline bool pm_suspended_storage(void)
#ifdef CONFIG_CONTIG_ALLOC
/* The below functions must be run on a range from a single zone. */
-extern int alloc_contig_range(unsigned long start, unsigned long end,
+extern int _alloc_contig_range(unsigned long start, unsigned long end,
unsigned migratetype, gfp_t gfp_mask);
-extern struct page *alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask,
- int nid, nodemask_t *nodemask);
+#define alloc_contig_range(_start, _end, _migratetype, _gfp_mask) \
+ alloc_hooks(_alloc_contig_range(_start, _end, _migratetype, \
+ _gfp_mask), int, -ENOMEM)
+extern struct page *_alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask,
+ int nid, nodemask_t *nodemask);
+#define alloc_contig_pages(_nr_pages, _gfp_mask, _nid, _nodemask) \
+ alloc_hooks(_alloc_contig_pages(_nr_pages, _gfp_mask, _nid, \
+ _nodemask), struct page *, NULL)
#endif
void free_contig_range(unsigned long pfn, unsigned long nr_pages);
diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
index 67314f648aeb..cff15ee5440e 100644
--- a/include/linux/page_ext.h
+++ b/include/linux/page_ext.h
@@ -4,7 +4,6 @@
#include <linux/types.h>
#include <linux/stacktrace.h>
-#include <linux/stackdepot.h>
struct pglist_data;
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index a56308a9d1a4..b2efafa001f8 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -467,14 +467,17 @@ static inline void *detach_page_private(struct page *page)
}
#ifdef CONFIG_NUMA
-struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order);
+struct folio *_filemap_alloc_folio(gfp_t gfp, unsigned int order);
#else
-static inline struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order)
+static inline struct folio *_filemap_alloc_folio(gfp_t gfp, unsigned int order)
{
- return folio_alloc(gfp, order);
+ return _folio_alloc(gfp, order);
}
#endif
+#define filemap_alloc_folio(_gfp, _order) \
+ alloc_hooks(_filemap_alloc_folio(_gfp, _order), struct folio *, NULL)
+
static inline struct page *__page_cache_alloc(gfp_t gfp)
{
return &filemap_alloc_folio(gfp, 0)->page;
diff --git a/include/linux/pgalloc_tag.h b/include/linux/pgalloc_tag.h
index f8c7b6ef9c75..567327c1c46f 100644
--- a/include/linux/pgalloc_tag.h
+++ b/include/linux/pgalloc_tag.h
@@ -6,28 +6,58 @@
#define _LINUX_PGALLOC_TAG_H
#include <linux/alloc_tag.h>
+
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+
#include <linux/page_ext.h>
extern struct page_ext_operations page_alloc_tagging_ops;
-struct page_ext *lookup_page_ext(const struct page *page);
+extern struct page_ext *page_ext_get(struct page *page);
+extern void page_ext_put(struct page_ext *page_ext);
+
+static inline union codetag_ref *codetag_ref_from_page_ext(struct page_ext *page_ext)
+{
+ return (void *)page_ext + page_alloc_tagging_ops.offset;
+}
+
+static inline struct page_ext *page_ext_from_codetag_ref(union codetag_ref *ref)
+{
+ return (void *)ref - page_alloc_tagging_ops.offset;
+}
static inline union codetag_ref *get_page_tag_ref(struct page *page)
{
if (page && mem_alloc_profiling_enabled()) {
- struct page_ext *page_ext = lookup_page_ext(page);
+ struct page_ext *page_ext = page_ext_get(page);
if (page_ext)
- return (void *)page_ext + page_alloc_tagging_ops.offset;
+ return codetag_ref_from_page_ext(page_ext);
}
return NULL;
}
+static inline void put_page_tag_ref(union codetag_ref *ref)
+{
+ if (ref)
+ page_ext_put(page_ext_from_codetag_ref(ref));
+}
+
static inline void pgalloc_tag_dec(struct page *page, unsigned int order)
{
union codetag_ref *ref = get_page_tag_ref(page);
- if (ref)
+ if (ref) {
alloc_tag_sub(ref, PAGE_SIZE << order);
+ put_page_tag_ref(ref);
+ }
}
+#else /* CONFIG_MEM_ALLOC_PROFILING */
+
+static inline union codetag_ref *get_page_tag_ref(struct page *page) { return NULL; }
+static inline void put_page_tag_ref(union codetag_ref *ref) {}
+#define pgalloc_tag_dec(__page, __size) do {} while (0)
+
+#endif /* CONFIG_MEM_ALLOC_PROFILING */
+
#endif /* _LINUX_PGALLOC_TAG_H */
diff --git a/mm/compaction.c b/mm/compaction.c
index c8bcdea15f5f..32707fb62495 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1684,7 +1684,7 @@ static void isolate_freepages(struct compact_control *cc)
* This is a migrate-callback that "allocates" freepages by taking pages
* from the isolated freelists in the block we are migrating to.
*/
-static struct page *compaction_alloc(struct page *migratepage,
+static struct page *_compaction_alloc(struct page *migratepage,
unsigned long data)
{
struct compact_control *cc = (struct compact_control *)data;
@@ -1704,6 +1704,13 @@ static struct page *compaction_alloc(struct page *migratepage,
return freepage;
}
+static struct page *compaction_alloc(struct page *migratepage,
+ unsigned long data)
+{
+ return alloc_hooks(_compaction_alloc(migratepage, data),
+ struct page *, NULL);
+}
+
/*
* This is a migrate-callback that "frees" freepages back to the isolated
* freelist. All pages on the freelist are from the same zone, so there is no
diff --git a/mm/filemap.c b/mm/filemap.c
index a34abfe8c654..f0f8b782d172 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -958,7 +958,7 @@ int filemap_add_folio(struct address_space *mapping, struct folio *folio,
EXPORT_SYMBOL_GPL(filemap_add_folio);
#ifdef CONFIG_NUMA
-struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order)
+struct folio *_filemap_alloc_folio(gfp_t gfp, unsigned int order)
{
int n;
struct folio *folio;
@@ -973,9 +973,9 @@ struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order)
return folio;
}
- return folio_alloc(gfp, order);
+ return _folio_alloc(gfp, order);
}
-EXPORT_SYMBOL(filemap_alloc_folio);
+EXPORT_SYMBOL(_filemap_alloc_folio);
#endif
/*
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 2068b594dc88..80cd33811641 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2141,7 +2141,7 @@ static struct page *alloc_pages_preferred_many(gfp_t gfp, unsigned int order,
}
/**
- * vma_alloc_folio - Allocate a folio for a VMA.
+ * _vma_alloc_folio - Allocate a folio for a VMA.
* @gfp: GFP flags.
* @order: Order of the folio.
* @vma: Pointer to VMA or NULL if not available.
@@ -2155,7 +2155,7 @@ static struct page *alloc_pages_preferred_many(gfp_t gfp, unsigned int order,
*
* Return: The folio on success or NULL if allocation fails.
*/
-struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
+struct folio *_vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
unsigned long addr, bool hugepage)
{
struct mempolicy *pol;
@@ -2240,10 +2240,10 @@ struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
out:
return folio;
}
-EXPORT_SYMBOL(vma_alloc_folio);
+EXPORT_SYMBOL(_vma_alloc_folio);
/**
- * alloc_pages - Allocate pages.
+ * _alloc_pages - Allocate pages.
* @gfp: GFP flags.
* @order: Power of two of number of pages to allocate.
*
@@ -2256,7 +2256,7 @@ EXPORT_SYMBOL(vma_alloc_folio);
* flags are used.
* Return: The page on success or NULL if allocation fails.
*/
-struct page *alloc_pages(gfp_t gfp, unsigned order)
+struct page *_alloc_pages(gfp_t gfp, unsigned int order)
{
struct mempolicy *pol = &default_policy;
struct page *page;
@@ -2274,15 +2274,15 @@ struct page *alloc_pages(gfp_t gfp, unsigned order)
page = alloc_pages_preferred_many(gfp, order,
policy_node(gfp, pol, numa_node_id()), pol);
else
- page = __alloc_pages(gfp, order,
+ page = _alloc_pages2(gfp, order,
policy_node(gfp, pol, numa_node_id()),
policy_nodemask(gfp, pol));
return page;
}
-EXPORT_SYMBOL(alloc_pages);
+EXPORT_SYMBOL(_alloc_pages);
-struct folio *folio_alloc(gfp_t gfp, unsigned order)
+struct folio *_folio_alloc(gfp_t gfp, unsigned int order)
{
struct page *page = alloc_pages(gfp | __GFP_COMP, order);
@@ -2290,7 +2290,7 @@ struct folio *folio_alloc(gfp_t gfp, unsigned order)
prep_transhuge_page(page);
return (struct folio *)page;
}
-EXPORT_SYMBOL(folio_alloc);
+EXPORT_SYMBOL(_folio_alloc);
static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp,
struct mempolicy *pol, unsigned long nr_pages,
@@ -2309,13 +2309,13 @@ static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp,
for (i = 0; i < nodes; i++) {
if (delta) {
- nr_allocated = __alloc_pages_bulk(gfp,
+ nr_allocated = _alloc_pages_bulk(gfp,
interleave_nodes(pol), NULL,
nr_pages_per_node + 1, NULL,
page_array);
delta--;
} else {
- nr_allocated = __alloc_pages_bulk(gfp,
+ nr_allocated = _alloc_pages_bulk(gfp,
interleave_nodes(pol), NULL,
nr_pages_per_node, NULL, page_array);
}
@@ -2337,11 +2337,11 @@ static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid,
preferred_gfp = gfp | __GFP_NOWARN;
preferred_gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL);
- nr_allocated = __alloc_pages_bulk(preferred_gfp, nid, &pol->nodes,
+ nr_allocated = _alloc_pages_bulk(preferred_gfp, nid, &pol->nodes,
nr_pages, NULL, page_array);
if (nr_allocated < nr_pages)
- nr_allocated += __alloc_pages_bulk(gfp, numa_node_id(), NULL,
+ nr_allocated += _alloc_pages_bulk(gfp, numa_node_id(), NULL,
nr_pages - nr_allocated, NULL,
page_array + nr_allocated);
return nr_allocated;
@@ -2353,7 +2353,7 @@ static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid,
* It can accelerate memory allocation especially interleaving
* allocate memory.
*/
-unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
+unsigned long _alloc_pages_bulk_array_mempolicy(gfp_t gfp,
unsigned long nr_pages, struct page **page_array)
{
struct mempolicy *pol = &default_policy;
@@ -2369,7 +2369,7 @@ unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
return alloc_pages_bulk_array_preferred_many(gfp,
numa_node_id(), pol, nr_pages, page_array);
- return __alloc_pages_bulk(gfp, policy_node(gfp, pol, numa_node_id()),
+ return _alloc_pages_bulk(gfp, policy_node(gfp, pol, numa_node_id()),
policy_nodemask(gfp, pol), nr_pages, NULL,
page_array);
}
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 7f7f9c677854..42135fad4d8a 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -24,6 +24,7 @@
#include <linux/page_ext.h>
#include <linux/pti.h>
#include <linux/pgtable.h>
+#include <linux/stackdepot.h>
#include <linux/swap.h>
#include <linux/cma.h>
#include "internal.h"
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9de2a18519a1..edd35500f7f6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -74,6 +74,7 @@
#include <linux/psi.h>
#include <linux/khugepaged.h>
#include <linux/delayacct.h>
+#include <linux/pgalloc_tag.h>
#include <asm/sections.h>
#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -657,6 +658,7 @@ static inline bool pcp_allowed_order(unsigned int order)
static inline void free_the_page(struct page *page, unsigned int order)
{
+
if (pcp_allowed_order(order)) /* Via pcp? */
free_unref_page(page, order);
else
@@ -1259,6 +1261,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
__memcg_kmem_uncharge_page(page, order);
reset_page_owner(page, order);
page_table_check_free(page, order);
+ pgalloc_tag_dec(page, order);
return false;
}
@@ -1301,6 +1304,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
reset_page_owner(page, order);
page_table_check_free(page, order);
+ pgalloc_tag_dec(page, order);
if (!PageHighMem(page)) {
debug_check_no_locks_freed(page_address(page),
@@ -1669,6 +1673,9 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
bool init = !want_init_on_free() && want_init_on_alloc(gfp_flags) &&
!should_skip_init(gfp_flags);
bool zero_tags = init && (gfp_flags & __GFP_ZEROTAGS);
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+ union codetag_ref *ref;
+#endif
int i;
set_page_private(page, 0);
@@ -1721,6 +1728,14 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
set_page_owner(page, order, gfp_flags);
page_table_check_alloc(page, order);
+
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+ ref = get_page_tag_ref(page);
+ if (ref) {
+ alloc_tag_add(ref, current->alloc_tag, PAGE_SIZE << order);
+ put_page_tag_ref(ref);
+ }
+#endif
}
static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
@@ -4568,7 +4583,7 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
*
* Returns the number of pages on the list or array.
*/
-unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
+unsigned long _alloc_pages_bulk(gfp_t gfp, int preferred_nid,
nodemask_t *nodemask, int nr_pages,
struct list_head *page_list,
struct page **page_array)
@@ -4704,7 +4719,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
pcp_trylock_finish(UP_flags);
failed:
- page = __alloc_pages(gfp, 0, preferred_nid, nodemask);
+ page = _alloc_pages2(gfp, 0, preferred_nid, nodemask);
if (page) {
if (page_list)
list_add(&page->lru, page_list);
@@ -4715,12 +4730,12 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
goto out;
}
-EXPORT_SYMBOL_GPL(__alloc_pages_bulk);
+EXPORT_SYMBOL_GPL(_alloc_pages_bulk);
/*
* This is the 'heart' of the zoned buddy allocator.
*/
-struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
+struct page *_alloc_pages2(gfp_t gfp, unsigned int order, int preferred_nid,
nodemask_t *nodemask)
{
struct page *page;
@@ -4783,41 +4798,41 @@ struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
return page;
}
-EXPORT_SYMBOL(__alloc_pages);
+EXPORT_SYMBOL(_alloc_pages2);
-struct folio *__folio_alloc(gfp_t gfp, unsigned int order, int preferred_nid,
+struct folio *_folio_alloc2(gfp_t gfp, unsigned int order, int preferred_nid,
nodemask_t *nodemask)
{
- struct page *page = __alloc_pages(gfp | __GFP_COMP, order,
+ struct page *page = _alloc_pages2(gfp | __GFP_COMP, order,
preferred_nid, nodemask);
if (page && order > 1)
prep_transhuge_page(page);
return (struct folio *)page;
}
-EXPORT_SYMBOL(__folio_alloc);
+EXPORT_SYMBOL(_folio_alloc2);
/*
* Common helper functions. Never use with __GFP_HIGHMEM because the returned
* address cannot represent highmem pages. Use alloc_pages and then kmap if
* you need to access high mem.
*/
-unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order)
+unsigned long _get_free_pages(gfp_t gfp_mask, unsigned int order)
{
struct page *page;
- page = alloc_pages(gfp_mask & ~__GFP_HIGHMEM, order);
+ page = _alloc_pages(gfp_mask & ~__GFP_HIGHMEM, order);
if (!page)
return 0;
return (unsigned long) page_address(page);
}
-EXPORT_SYMBOL(__get_free_pages);
+EXPORT_SYMBOL(_get_free_pages);
-unsigned long get_zeroed_page(gfp_t gfp_mask)
+unsigned long _get_zeroed_page(gfp_t gfp_mask)
{
- return __get_free_page(gfp_mask | __GFP_ZERO);
+ return _get_free_pages(gfp_mask | __GFP_ZERO, 0);
}
-EXPORT_SYMBOL(get_zeroed_page);
+EXPORT_SYMBOL(_get_zeroed_page);
/**
* __free_pages - Free pages allocated with alloc_pages().
@@ -5009,7 +5024,7 @@ static void *make_alloc_exact(unsigned long addr, unsigned int order,
}
/**
- * alloc_pages_exact - allocate an exact number physically-contiguous pages.
+ * _alloc_pages_exact - allocate an exact number physically-contiguous pages.
* @size: the number of bytes to allocate
* @gfp_mask: GFP flags for the allocation, must not contain __GFP_COMP
*
@@ -5023,7 +5038,7 @@ static void *make_alloc_exact(unsigned long addr, unsigned int order,
*
* Return: pointer to the allocated area or %NULL in case of error.
*/
-void *alloc_pages_exact(size_t size, gfp_t gfp_mask)
+void *_alloc_pages_exact(size_t size, gfp_t gfp_mask)
{
unsigned int order = get_order(size);
unsigned long addr;
@@ -5031,13 +5046,13 @@ void *alloc_pages_exact(size_t size, gfp_t gfp_mask)
if (WARN_ON_ONCE(gfp_mask & (__GFP_COMP | __GFP_HIGHMEM)))
gfp_mask &= ~(__GFP_COMP | __GFP_HIGHMEM);
- addr = __get_free_pages(gfp_mask, order);
+ addr = _get_free_pages(gfp_mask, order);
return make_alloc_exact(addr, order, size);
}
-EXPORT_SYMBOL(alloc_pages_exact);
+EXPORT_SYMBOL(_alloc_pages_exact);
/**
- * alloc_pages_exact_nid - allocate an exact number of physically-contiguous
+ * _alloc_pages_exact_nid - allocate an exact number of physically-contiguous
* pages on a node.
* @nid: the preferred node ID where memory should be allocated
* @size: the number of bytes to allocate
@@ -5048,7 +5063,7 @@ EXPORT_SYMBOL(alloc_pages_exact);
*
* Return: pointer to the allocated area or %NULL in case of error.
*/
-void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask)
+void * __meminit _alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask)
{
unsigned int order = get_order(size);
struct page *p;
@@ -5056,7 +5071,7 @@ void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask)
if (WARN_ON_ONCE(gfp_mask & (__GFP_COMP | __GFP_HIGHMEM)))
gfp_mask &= ~(__GFP_COMP | __GFP_HIGHMEM);
- p = alloc_pages_node(nid, gfp_mask, order);
+ p = _alloc_pages_node(nid, gfp_mask, order);
if (!p)
return NULL;
return make_alloc_exact((unsigned long)page_address(p), order, size);
@@ -6729,7 +6744,7 @@ int __alloc_contig_migrate_range(struct compact_control *cc,
}
/**
- * alloc_contig_range() -- tries to allocate given range of pages
+ * _alloc_contig_range() -- tries to allocate given range of pages
* @start: start PFN to allocate
* @end: one-past-the-last PFN to allocate
* @migratetype: migratetype of the underlying pageblocks (either
@@ -6749,7 +6764,7 @@ int __alloc_contig_migrate_range(struct compact_control *cc,
* pages which PFN is in [start, end) are allocated for the caller and
* need to be freed with free_contig_range().
*/
-int alloc_contig_range(unsigned long start, unsigned long end,
+int _alloc_contig_range(unsigned long start, unsigned long end,
unsigned migratetype, gfp_t gfp_mask)
{
unsigned long outer_start, outer_end;
@@ -6873,15 +6888,15 @@ int alloc_contig_range(unsigned long start, unsigned long end,
undo_isolate_page_range(start, end, migratetype);
return ret;
}
-EXPORT_SYMBOL(alloc_contig_range);
+EXPORT_SYMBOL(_alloc_contig_range);
static int __alloc_contig_pages(unsigned long start_pfn,
unsigned long nr_pages, gfp_t gfp_mask)
{
unsigned long end_pfn = start_pfn + nr_pages;
- return alloc_contig_range(start_pfn, end_pfn, MIGRATE_MOVABLE,
- gfp_mask);
+ return _alloc_contig_range(start_pfn, end_pfn, MIGRATE_MOVABLE,
+ gfp_mask);
}
static bool pfn_range_valid_contig(struct zone *z, unsigned long start_pfn,
@@ -6916,7 +6931,7 @@ static bool zone_spans_last_pfn(const struct zone *zone,
}
/**
- * alloc_contig_pages() -- tries to find and allocate contiguous range of pages
+ * _alloc_contig_pages() -- tries to find and allocate contiguous range of pages
* @nr_pages: Number of contiguous pages to allocate
* @gfp_mask: GFP mask to limit search and used during compaction
* @nid: Target node
@@ -6936,8 +6951,8 @@ static bool zone_spans_last_pfn(const struct zone *zone,
*
* Return: pointer to contiguous pages on success, or NULL if not successful.
*/
-struct page *alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask,
- int nid, nodemask_t *nodemask)
+struct page *_alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask,
+ int nid, nodemask_t *nodemask)
{
unsigned long ret, pfn, flags;
struct zonelist *zonelist;
--
2.40.1.495.gc816e09b53d-goog
Account slab allocations using codetag reference embedded into slabobj_ext.
Signed-off-by: Suren Baghdasaryan <[email protected]>
Co-developed-by: Kent Overstreet <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
---
include/linux/slab_def.h | 2 +-
include/linux/slub_def.h | 4 ++--
mm/slab.c | 4 +++-
mm/slab.h | 35 +++++++++++++++++++++++++++++++++++
4 files changed, 41 insertions(+), 4 deletions(-)
diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h
index a61e7d55d0d3..23f14dcb8d5b 100644
--- a/include/linux/slab_def.h
+++ b/include/linux/slab_def.h
@@ -107,7 +107,7 @@ static inline void *nearest_obj(struct kmem_cache *cache, const struct slab *sla
* reciprocal_divide(offset, cache->reciprocal_buffer_size)
*/
static inline unsigned int obj_to_index(const struct kmem_cache *cache,
- const struct slab *slab, void *obj)
+ const struct slab *slab, const void *obj)
{
u32 offset = (obj - slab->s_mem);
return reciprocal_divide(offset, cache->reciprocal_buffer_size);
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index f6df03f934e5..e8be5b368857 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -176,14 +176,14 @@ static inline void *nearest_obj(struct kmem_cache *cache, const struct slab *sla
/* Determine object index from a given position */
static inline unsigned int __obj_to_index(const struct kmem_cache *cache,
- void *addr, void *obj)
+ void *addr, const void *obj)
{
return reciprocal_divide(kasan_reset_tag(obj) - addr,
cache->reciprocal_size);
}
static inline unsigned int obj_to_index(const struct kmem_cache *cache,
- const struct slab *slab, void *obj)
+ const struct slab *slab, const void *obj)
{
if (is_kfence_address(obj))
return 0;
diff --git a/mm/slab.c b/mm/slab.c
index ccc76f7455e9..026f0c08708a 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3367,9 +3367,11 @@ static void cache_flusharray(struct kmem_cache *cachep, struct array_cache *ac)
static __always_inline void __cache_free(struct kmem_cache *cachep, void *objp,
unsigned long caller)
{
+ struct slab *slab = virt_to_slab(objp);
bool init;
- memcg_slab_free_hook(cachep, virt_to_slab(objp), &objp, 1);
+ memcg_slab_free_hook(cachep, slab, &objp, 1);
+ alloc_tagging_slab_free_hook(cachep, slab, &objp, 1);
if (is_kfence_address(objp)) {
kmemleak_free_recursive(objp, cachep->flags);
diff --git a/mm/slab.h b/mm/slab.h
index f953e7c81e98..f9442d3a10b2 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -494,6 +494,35 @@ prepare_slab_obj_exts_hook(struct kmem_cache *s, gfp_t flags, void *p)
#endif /* CONFIG_SLAB_OBJ_EXT */
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+
+static inline void alloc_tagging_slab_free_hook(struct kmem_cache *s, struct slab *slab,
+ void **p, int objects)
+{
+ struct slabobj_ext *obj_exts;
+ int i;
+
+ if (!mem_alloc_profiling_enabled())
+ return;
+
+ obj_exts = slab_obj_exts(slab);
+ if (!obj_exts)
+ return;
+
+ for (i = 0; i < objects; i++) {
+ unsigned int off = obj_to_index(s, slab, p[i]);
+
+ alloc_tag_sub(&obj_exts[off].ref, s->size);
+ }
+}
+
+#else
+
+static inline void alloc_tagging_slab_free_hook(struct kmem_cache *s, struct slab *slab,
+ void **p, int objects) {}
+
+#endif /* CONFIG_MEM_ALLOC_PROFILING */
+
#ifdef CONFIG_MEMCG_KMEM
void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
enum node_stat_item idx, int nr);
@@ -776,6 +805,12 @@ static inline void slab_post_alloc_hook(struct kmem_cache *s,
s->flags, flags);
kmsan_slab_alloc(s, p[i], flags);
obj_exts = prepare_slab_obj_exts_hook(s, flags, p[i]);
+
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+ /* obj_exts can be allocated for other reasons */
+ if (likely(obj_exts) && mem_alloc_profiling_enabled())
+ alloc_tag_add(&obj_exts->ref, current->alloc_tag, s->size);
+#endif
}
memcg_slab_post_alloc_hook(s, objcg, flags, size, p);
--
2.40.1.495.gc816e09b53d-goog
Redefine kmalloc, krealloc, kzalloc, kcalloc, etc. to record allocations
and deallocations done by these functions.
Signed-off-by: Suren Baghdasaryan <[email protected]>
Co-developed-by: Kent Overstreet <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
---
include/linux/slab.h | 175 ++++++++++++++++++++++---------------------
mm/slab.c | 16 ++--
mm/slab_common.c | 22 +++---
mm/slub.c | 17 +++--
mm/util.c | 10 +--
5 files changed, 124 insertions(+), 116 deletions(-)
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 99a146f3cedf..43c922524081 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -213,7 +213,10 @@ int kmem_cache_shrink(struct kmem_cache *s);
/*
* Common kmalloc functions provided by all allocators
*/
-void * __must_check krealloc(const void *objp, size_t new_size, gfp_t flags) __realloc_size(2);
+void * __must_check _krealloc(const void *objp, size_t new_size, gfp_t flags) __realloc_size(2);
+#define krealloc(_p, _size, _flags) \
+ alloc_hooks(_krealloc(_p, _size, _flags), void*, NULL)
+
void kfree(const void *objp);
void kfree_sensitive(const void *objp);
size_t __ksize(const void *objp);
@@ -451,6 +454,8 @@ static __always_inline unsigned int __kmalloc_index(size_t size,
static_assert(PAGE_SHIFT <= 20);
#define kmalloc_index(s) __kmalloc_index(s, true)
+#include <linux/alloc_tag.h>
+
void *__kmalloc(size_t size, gfp_t flags) __assume_kmalloc_alignment __alloc_size(1);
/**
@@ -463,9 +468,15 @@ void *__kmalloc(size_t size, gfp_t flags) __assume_kmalloc_alignment __alloc_siz
*
* Return: pointer to the new object or %NULL in case of error
*/
-void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags) __assume_slab_alignment __malloc;
-void *kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
- gfp_t gfpflags) __assume_slab_alignment __malloc;
+void *_kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags) __assume_slab_alignment __malloc;
+#define kmem_cache_alloc(_s, _flags) \
+ alloc_hooks(_kmem_cache_alloc(_s, _flags), void*, NULL)
+
+void *_kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
+ gfp_t gfpflags) __assume_slab_alignment __malloc;
+#define kmem_cache_alloc_lru(_s, _lru, _flags) \
+ alloc_hooks(_kmem_cache_alloc_lru(_s, _lru, _flags), void*, NULL)
+
void kmem_cache_free(struct kmem_cache *s, void *objp);
/*
@@ -476,7 +487,9 @@ void kmem_cache_free(struct kmem_cache *s, void *objp);
* Note that interrupts must be enabled when calling these functions.
*/
void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p);
-int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, void **p);
+int _kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, void **p);
+#define kmem_cache_alloc_bulk(_s, _flags, _size, _p) \
+ alloc_hooks(_kmem_cache_alloc_bulk(_s, _flags, _size, _p), int, 0)
static __always_inline void kfree_bulk(size_t size, void **p)
{
@@ -485,20 +498,32 @@ static __always_inline void kfree_bulk(size_t size, void **p)
void *__kmalloc_node(size_t size, gfp_t flags, int node) __assume_kmalloc_alignment
__alloc_size(1);
-void *kmem_cache_alloc_node(struct kmem_cache *s, gfp_t flags, int node) __assume_slab_alignment
- __malloc;
+void *_kmem_cache_alloc_node(struct kmem_cache *s, gfp_t flags, int node) __assume_slab_alignment
+ __malloc;
+#define kmem_cache_alloc_node(_s, _flags, _node) \
+ alloc_hooks(_kmem_cache_alloc_node(_s, _flags, _node), void*, NULL)
-void *kmalloc_trace(struct kmem_cache *s, gfp_t flags, size_t size)
+void *_kmalloc_trace(struct kmem_cache *s, gfp_t flags, size_t size)
__assume_kmalloc_alignment __alloc_size(3);
-void *kmalloc_node_trace(struct kmem_cache *s, gfp_t gfpflags,
+void *_kmalloc_node_trace(struct kmem_cache *s, gfp_t gfpflags,
int node, size_t size) __assume_kmalloc_alignment
__alloc_size(4);
-void *kmalloc_large(size_t size, gfp_t flags) __assume_page_alignment
+#define kmalloc_trace(_s, _flags, _size) \
+ alloc_hooks(_kmalloc_trace(_s, _flags, _size), void*, NULL)
+
+#define kmalloc_node_trace(_s, _gfpflags, _node, _size) \
+ alloc_hooks(_kmalloc_node_trace(_s, _gfpflags, _node, _size), void*, NULL)
+
+void *_kmalloc_large(size_t size, gfp_t flags) __assume_page_alignment
__alloc_size(1);
+#define kmalloc_large(_size, _flags) \
+ alloc_hooks(_kmalloc_large(_size, _flags), void*, NULL)
-void *kmalloc_large_node(size_t size, gfp_t flags, int node) __assume_page_alignment
+void *_kmalloc_large_node(size_t size, gfp_t flags, int node) __assume_page_alignment
__alloc_size(1);
+#define kmalloc_large_node(_size, _flags, _node) \
+ alloc_hooks(_kmalloc_large_node(_size, _flags, _node), void*, NULL)
/**
* kmalloc - allocate kernel memory
@@ -554,37 +579,40 @@ void *kmalloc_large_node(size_t size, gfp_t flags, int node) __assume_page_align
* Try really hard to succeed the allocation but fail
* eventually.
*/
-static __always_inline __alloc_size(1) void *kmalloc(size_t size, gfp_t flags)
+static __always_inline __alloc_size(1) void *_kmalloc(size_t size, gfp_t flags)
{
if (__builtin_constant_p(size) && size) {
unsigned int index;
if (size > KMALLOC_MAX_CACHE_SIZE)
- return kmalloc_large(size, flags);
+ return _kmalloc_large(size, flags);
index = kmalloc_index(size);
- return kmalloc_trace(
+ return _kmalloc_trace(
kmalloc_caches[kmalloc_type(flags)][index],
flags, size);
}
return __kmalloc(size, flags);
}
+#define kmalloc(_size, _flags) alloc_hooks(_kmalloc(_size, _flags), void*, NULL)
-static __always_inline __alloc_size(1) void *kmalloc_node(size_t size, gfp_t flags, int node)
+static __always_inline __alloc_size(1) void *_kmalloc_node(size_t size, gfp_t flags, int node)
{
if (__builtin_constant_p(size) && size) {
unsigned int index;
if (size > KMALLOC_MAX_CACHE_SIZE)
- return kmalloc_large_node(size, flags, node);
+ return _kmalloc_large_node(size, flags, node);
index = kmalloc_index(size);
- return kmalloc_node_trace(
+ return _kmalloc_node_trace(
kmalloc_caches[kmalloc_type(flags)][index],
flags, node, size);
}
return __kmalloc_node(size, flags, node);
}
+#define kmalloc_node(_size, _flags, _node) \
+ alloc_hooks(_kmalloc_node(_size, _flags, _node), void*, NULL)
/**
* kmalloc_array - allocate memory for an array.
@@ -592,16 +620,18 @@ static __always_inline __alloc_size(1) void *kmalloc_node(size_t size, gfp_t fla
* @size: element size.
* @flags: the type of memory to allocate (see kmalloc).
*/
-static inline __alloc_size(1, 2) void *kmalloc_array(size_t n, size_t size, gfp_t flags)
+static inline __alloc_size(1, 2) void *_kmalloc_array(size_t n, size_t size, gfp_t flags)
{
size_t bytes;
if (unlikely(check_mul_overflow(n, size, &bytes)))
return NULL;
if (__builtin_constant_p(n) && __builtin_constant_p(size))
- return kmalloc(bytes, flags);
- return __kmalloc(bytes, flags);
+ return _kmalloc(bytes, flags);
+ return _kmalloc(bytes, flags);
}
+#define kmalloc_array(_n, _size, _flags) \
+ alloc_hooks(_kmalloc_array(_n, _size, _flags), void*, NULL)
/**
* krealloc_array - reallocate memory for an array.
@@ -610,18 +640,20 @@ static inline __alloc_size(1, 2) void *kmalloc_array(size_t n, size_t size, gfp_
* @new_size: new size of a single member of the array
* @flags: the type of memory to allocate (see kmalloc)
*/
-static inline __realloc_size(2, 3) void * __must_check krealloc_array(void *p,
- size_t new_n,
- size_t new_size,
- gfp_t flags)
+static inline __realloc_size(2, 3) void * __must_check _krealloc_array(void *p,
+ size_t new_n,
+ size_t new_size,
+ gfp_t flags)
{
size_t bytes;
if (unlikely(check_mul_overflow(new_n, new_size, &bytes)))
return NULL;
- return krealloc(p, bytes, flags);
+ return _krealloc(p, bytes, flags);
}
+#define krealloc_array(_p, _n, _size, _flags) \
+ alloc_hooks(_krealloc_array(_p, _n, _size, _flags), void*, NULL)
/**
* kcalloc - allocate memory for an array. The memory is set to zero.
@@ -629,16 +661,14 @@ static inline __realloc_size(2, 3) void * __must_check krealloc_array(void *p,
* @size: element size.
* @flags: the type of memory to allocate (see kmalloc).
*/
-static inline __alloc_size(1, 2) void *kcalloc(size_t n, size_t size, gfp_t flags)
-{
- return kmalloc_array(n, size, flags | __GFP_ZERO);
-}
+#define kcalloc(_n, _size, _flags) \
+ kmalloc_array(_n, _size, (_flags) | __GFP_ZERO)
void *__kmalloc_node_track_caller(size_t size, gfp_t flags, int node,
unsigned long caller) __alloc_size(1);
#define kmalloc_node_track_caller(size, flags, node) \
- __kmalloc_node_track_caller(size, flags, node, \
- _RET_IP_)
+ alloc_hooks(__kmalloc_node_track_caller(size, flags, node, \
+ _RET_IP_), void*, NULL)
/*
* kmalloc_track_caller is a special version of kmalloc that records the
@@ -648,11 +678,10 @@ void *__kmalloc_node_track_caller(size_t size, gfp_t flags, int node,
* allocator where we care about the real place the memory allocation
* request comes from.
*/
-#define kmalloc_track_caller(size, flags) \
- __kmalloc_node_track_caller(size, flags, \
- NUMA_NO_NODE, _RET_IP_)
+#define kmalloc_track_caller(size, flags) \
+ kmalloc_node_track_caller(size, flags, NUMA_NO_NODE)
-static inline __alloc_size(1, 2) void *kmalloc_array_node(size_t n, size_t size, gfp_t flags,
+static inline __alloc_size(1, 2) void *_kmalloc_array_node(size_t n, size_t size, gfp_t flags,
int node)
{
size_t bytes;
@@ -660,75 +689,53 @@ static inline __alloc_size(1, 2) void *kmalloc_array_node(size_t n, size_t size,
if (unlikely(check_mul_overflow(n, size, &bytes)))
return NULL;
if (__builtin_constant_p(n) && __builtin_constant_p(size))
- return kmalloc_node(bytes, flags, node);
+ return _kmalloc_node(bytes, flags, node);
return __kmalloc_node(bytes, flags, node);
}
+#define kmalloc_array_node(_n, _size, _flags, _node) \
+ alloc_hooks(_kmalloc_array_node(_n, _size, _flags, _node), void*, NULL)
-static inline __alloc_size(1, 2) void *kcalloc_node(size_t n, size_t size, gfp_t flags, int node)
-{
- return kmalloc_array_node(n, size, flags | __GFP_ZERO, node);
-}
+#define kcalloc_node(_n, _size, _flags, _node) \
+ kmalloc_array_node(_n, _size, (_flags) | __GFP_ZERO, _node)
/*
* Shortcuts
*/
-static inline void *kmem_cache_zalloc(struct kmem_cache *k, gfp_t flags)
-{
- return kmem_cache_alloc(k, flags | __GFP_ZERO);
-}
+#define kmem_cache_zalloc(_k, _flags) \
+ kmem_cache_alloc(_k, (_flags)|__GFP_ZERO)
/**
* kzalloc - allocate memory. The memory is set to zero.
* @size: how many bytes of memory are required.
* @flags: the type of memory to allocate (see kmalloc).
*/
-static inline __alloc_size(1) void *kzalloc(size_t size, gfp_t flags)
-{
- return kmalloc(size, flags | __GFP_ZERO);
-}
-
-/**
- * kzalloc_node - allocate zeroed memory from a particular memory node.
- * @size: how many bytes of memory are required.
- * @flags: the type of memory to allocate (see kmalloc).
- * @node: memory node from which to allocate
- */
-static inline __alloc_size(1) void *kzalloc_node(size_t size, gfp_t flags, int node)
-{
- return kmalloc_node(size, flags | __GFP_ZERO, node);
-}
+#define kzalloc(_size, _flags) kmalloc(_size, (_flags)|__GFP_ZERO)
+#define kzalloc_node(_size, _flags, _node) kmalloc_node(_size, (_flags)|__GFP_ZERO, _node)
-extern void *kvmalloc_node(size_t size, gfp_t flags, int node) __alloc_size(1);
-static inline __alloc_size(1) void *kvmalloc(size_t size, gfp_t flags)
-{
- return kvmalloc_node(size, flags, NUMA_NO_NODE);
-}
-static inline __alloc_size(1) void *kvzalloc_node(size_t size, gfp_t flags, int node)
-{
- return kvmalloc_node(size, flags | __GFP_ZERO, node);
-}
-static inline __alloc_size(1) void *kvzalloc(size_t size, gfp_t flags)
-{
- return kvmalloc(size, flags | __GFP_ZERO);
-}
+extern void *_kvmalloc_node(size_t size, gfp_t flags, int node) __alloc_size(1);
+#define kvmalloc_node(_size, _flags, _node) \
+ alloc_hooks(_kvmalloc_node(_size, _flags, _node), void*, NULL)
-static inline __alloc_size(1, 2) void *kvmalloc_array(size_t n, size_t size, gfp_t flags)
-{
- size_t bytes;
+#define kvmalloc(_size, _flags) kvmalloc_node(_size, _flags, NUMA_NO_NODE)
+#define kvzalloc(_size, _flags) kvmalloc(_size, _flags|__GFP_ZERO)
- if (unlikely(check_mul_overflow(n, size, &bytes)))
- return NULL;
+#define kvzalloc_node(_size, _flags, _node) kvmalloc_node(_size, _flags|__GFP_ZERO, _node)
- return kvmalloc(bytes, flags);
-}
+#define kvmalloc_array(_n, _size, _flags) \
+({ \
+ size_t _bytes; \
+ \
+ !check_mul_overflow(_n, _size, &_bytes) ? kvmalloc(_bytes, _flags) : NULL; \
+})
-static inline __alloc_size(1, 2) void *kvcalloc(size_t n, size_t size, gfp_t flags)
-{
- return kvmalloc_array(n, size, flags | __GFP_ZERO);
-}
+#define kvcalloc(_n, _size, _flags) kvmalloc_array(_n, _size, _flags|__GFP_ZERO)
-extern void *kvrealloc(const void *p, size_t oldsize, size_t newsize, gfp_t flags)
+extern void *_kvrealloc(const void *p, size_t oldsize, size_t newsize, gfp_t flags)
__realloc_size(3);
+
+#define kvrealloc(_p, _oldsize, _newsize, _flags) \
+ alloc_hooks(_kvrealloc(_p, _oldsize, _newsize, _flags), void*, NULL)
+
extern void kvfree(const void *addr);
extern void kvfree_sensitive(const void *addr, size_t len);
diff --git a/mm/slab.c b/mm/slab.c
index 026f0c08708a..e08bd3496f56 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3448,18 +3448,18 @@ void *__kmem_cache_alloc_lru(struct kmem_cache *cachep, struct list_lru *lru,
return ret;
}
-void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
+void *_kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
{
return __kmem_cache_alloc_lru(cachep, NULL, flags);
}
-EXPORT_SYMBOL(kmem_cache_alloc);
+EXPORT_SYMBOL(_kmem_cache_alloc);
-void *kmem_cache_alloc_lru(struct kmem_cache *cachep, struct list_lru *lru,
+void *_kmem_cache_alloc_lru(struct kmem_cache *cachep, struct list_lru *lru,
gfp_t flags)
{
return __kmem_cache_alloc_lru(cachep, lru, flags);
}
-EXPORT_SYMBOL(kmem_cache_alloc_lru);
+EXPORT_SYMBOL(_kmem_cache_alloc_lru);
static __always_inline void
cache_alloc_debugcheck_after_bulk(struct kmem_cache *s, gfp_t flags,
@@ -3471,7 +3471,7 @@ cache_alloc_debugcheck_after_bulk(struct kmem_cache *s, gfp_t flags,
p[i] = cache_alloc_debugcheck_after(s, flags, p[i], caller);
}
-int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
+int _kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
void **p)
{
struct obj_cgroup *objcg = NULL;
@@ -3510,7 +3510,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
kmem_cache_free_bulk(s, i, p);
return 0;
}
-EXPORT_SYMBOL(kmem_cache_alloc_bulk);
+EXPORT_SYMBOL(_kmem_cache_alloc_bulk);
/**
* kmem_cache_alloc_node - Allocate an object on the specified node
@@ -3525,7 +3525,7 @@ EXPORT_SYMBOL(kmem_cache_alloc_bulk);
*
* Return: pointer to the new object or %NULL in case of error
*/
-void *kmem_cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid)
+void *_kmem_cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid)
{
void *ret = slab_alloc_node(cachep, NULL, flags, nodeid, cachep->object_size, _RET_IP_);
@@ -3533,7 +3533,7 @@ void *kmem_cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid)
return ret;
}
-EXPORT_SYMBOL(kmem_cache_alloc_node);
+EXPORT_SYMBOL(_kmem_cache_alloc_node);
void *__kmem_cache_alloc_node(struct kmem_cache *cachep, gfp_t flags,
int nodeid, size_t orig_size,
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 42777d66d0e3..a05333bbb7f1 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1101,7 +1101,7 @@ size_t __ksize(const void *object)
return slab_ksize(folio_slab(folio)->slab_cache);
}
-void *kmalloc_trace(struct kmem_cache *s, gfp_t gfpflags, size_t size)
+void *_kmalloc_trace(struct kmem_cache *s, gfp_t gfpflags, size_t size)
{
void *ret = __kmem_cache_alloc_node(s, gfpflags, NUMA_NO_NODE,
size, _RET_IP_);
@@ -1111,9 +1111,9 @@ void *kmalloc_trace(struct kmem_cache *s, gfp_t gfpflags, size_t size)
ret = kasan_kmalloc(s, ret, size, gfpflags);
return ret;
}
-EXPORT_SYMBOL(kmalloc_trace);
+EXPORT_SYMBOL(_kmalloc_trace);
-void *kmalloc_node_trace(struct kmem_cache *s, gfp_t gfpflags,
+void *_kmalloc_node_trace(struct kmem_cache *s, gfp_t gfpflags,
int node, size_t size)
{
void *ret = __kmem_cache_alloc_node(s, gfpflags, node, size, _RET_IP_);
@@ -1123,7 +1123,7 @@ void *kmalloc_node_trace(struct kmem_cache *s, gfp_t gfpflags,
ret = kasan_kmalloc(s, ret, size, gfpflags);
return ret;
}
-EXPORT_SYMBOL(kmalloc_node_trace);
+EXPORT_SYMBOL(_kmalloc_node_trace);
gfp_t kmalloc_fix_flags(gfp_t flags)
{
@@ -1168,7 +1168,7 @@ static void *__kmalloc_large_node(size_t size, gfp_t flags, int node)
return ptr;
}
-void *kmalloc_large(size_t size, gfp_t flags)
+void *_kmalloc_large(size_t size, gfp_t flags)
{
void *ret = __kmalloc_large_node(size, flags, NUMA_NO_NODE);
@@ -1176,9 +1176,9 @@ void *kmalloc_large(size_t size, gfp_t flags)
flags, NUMA_NO_NODE);
return ret;
}
-EXPORT_SYMBOL(kmalloc_large);
+EXPORT_SYMBOL(_kmalloc_large);
-void *kmalloc_large_node(size_t size, gfp_t flags, int node)
+void *_kmalloc_large_node(size_t size, gfp_t flags, int node)
{
void *ret = __kmalloc_large_node(size, flags, node);
@@ -1186,7 +1186,7 @@ void *kmalloc_large_node(size_t size, gfp_t flags, int node)
flags, node);
return ret;
}
-EXPORT_SYMBOL(kmalloc_large_node);
+EXPORT_SYMBOL(_kmalloc_large_node);
#ifdef CONFIG_SLAB_FREELIST_RANDOM
/* Randomize a generic freelist */
@@ -1405,7 +1405,7 @@ __do_krealloc(const void *p, size_t new_size, gfp_t flags)
return (void *)p;
}
- ret = kmalloc_track_caller(new_size, flags);
+ ret = __kmalloc_node_track_caller(new_size, flags, NUMA_NO_NODE, _RET_IP_);
if (ret && p) {
/* Disable KASAN checks as the object's redzone is accessed. */
kasan_disable_current();
@@ -1429,7 +1429,7 @@ __do_krealloc(const void *p, size_t new_size, gfp_t flags)
*
* Return: pointer to the allocated memory or %NULL in case of error
*/
-void *krealloc(const void *p, size_t new_size, gfp_t flags)
+void *_krealloc(const void *p, size_t new_size, gfp_t flags)
{
void *ret;
@@ -1444,7 +1444,7 @@ void *krealloc(const void *p, size_t new_size, gfp_t flags)
return ret;
}
-EXPORT_SYMBOL(krealloc);
+EXPORT_SYMBOL(_krealloc);
/**
* kfree_sensitive - Clear sensitive information in memory before freeing
diff --git a/mm/slub.c b/mm/slub.c
index 507b71372ee4..8f57fd086f69 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3470,18 +3470,18 @@ void *__kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
return ret;
}
-void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
+void *_kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
{
return __kmem_cache_alloc_lru(s, NULL, gfpflags);
}
-EXPORT_SYMBOL(kmem_cache_alloc);
+EXPORT_SYMBOL(_kmem_cache_alloc);
-void *kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
+void *_kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
gfp_t gfpflags)
{
return __kmem_cache_alloc_lru(s, lru, gfpflags);
}
-EXPORT_SYMBOL(kmem_cache_alloc_lru);
+EXPORT_SYMBOL(_kmem_cache_alloc_lru);
void *__kmem_cache_alloc_node(struct kmem_cache *s, gfp_t gfpflags,
int node, size_t orig_size,
@@ -3491,7 +3491,7 @@ void *__kmem_cache_alloc_node(struct kmem_cache *s, gfp_t gfpflags,
caller, orig_size);
}
-void *kmem_cache_alloc_node(struct kmem_cache *s, gfp_t gfpflags, int node)
+void *_kmem_cache_alloc_node(struct kmem_cache *s, gfp_t gfpflags, int node)
{
void *ret = slab_alloc_node(s, NULL, gfpflags, node, _RET_IP_, s->object_size);
@@ -3499,7 +3499,7 @@ void *kmem_cache_alloc_node(struct kmem_cache *s, gfp_t gfpflags, int node)
return ret;
}
-EXPORT_SYMBOL(kmem_cache_alloc_node);
+EXPORT_SYMBOL(_kmem_cache_alloc_node);
static noinline void free_to_partial_list(
struct kmem_cache *s, struct slab *slab,
@@ -3779,6 +3779,7 @@ static __fastpath_inline void slab_free(struct kmem_cache *s, struct slab *slab,
unsigned long addr)
{
memcg_slab_free_hook(s, slab, p, cnt);
+ alloc_tagging_slab_free_hook(s, slab, p, cnt);
/*
* With KASAN enabled slab_free_freelist_hook modifies the freelist
* to remove objects, whose reuse must be delayed.
@@ -4009,7 +4010,7 @@ static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
#endif /* CONFIG_SLUB_TINY */
/* Note that interrupts must be enabled when calling this function. */
-int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
+int _kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
void **p)
{
int i;
@@ -4034,7 +4035,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
slab_want_init_on_alloc(flags, s), s->object_size);
return i;
}
-EXPORT_SYMBOL(kmem_cache_alloc_bulk);
+EXPORT_SYMBOL(_kmem_cache_alloc_bulk);
/*
diff --git a/mm/util.c b/mm/util.c
index dd12b9531ac4..e9077d1af676 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -579,7 +579,7 @@ EXPORT_SYMBOL(vm_mmap);
*
* Return: pointer to the allocated memory of %NULL in case of failure
*/
-void *kvmalloc_node(size_t size, gfp_t flags, int node)
+void *_kvmalloc_node(size_t size, gfp_t flags, int node)
{
gfp_t kmalloc_flags = flags;
void *ret;
@@ -601,7 +601,7 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node)
kmalloc_flags &= ~__GFP_NOFAIL;
}
- ret = kmalloc_node(size, kmalloc_flags, node);
+ ret = _kmalloc_node(size, kmalloc_flags, node);
/*
* It doesn't really make sense to fallback to vmalloc for sub page
@@ -630,7 +630,7 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node)
flags, PAGE_KERNEL, VM_ALLOW_HUGE_VMAP,
node, __builtin_return_address(0));
}
-EXPORT_SYMBOL(kvmalloc_node);
+EXPORT_SYMBOL(_kvmalloc_node);
/**
* kvfree() - Free memory.
@@ -669,7 +669,7 @@ void kvfree_sensitive(const void *addr, size_t len)
}
EXPORT_SYMBOL(kvfree_sensitive);
-void *kvrealloc(const void *p, size_t oldsize, size_t newsize, gfp_t flags)
+void *_kvrealloc(const void *p, size_t oldsize, size_t newsize, gfp_t flags)
{
void *newp;
@@ -682,7 +682,7 @@ void *kvrealloc(const void *p, size_t oldsize, size_t newsize, gfp_t flags)
kvfree(p);
return newp;
}
-EXPORT_SYMBOL(kvrealloc);
+EXPORT_SYMBOL(_kvrealloc);
/**
* __vmalloc_array - allocate memory for a virtually contiguous array.
--
2.40.1.495.gc816e09b53d-goog
From: Kent Overstreet <[email protected]>
It seems we need to be more forceful with the compiler on this one.
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
mm/slub.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/slub.c b/mm/slub.c
index 8f57fd086f69..9dd57b3384a1 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1781,7 +1781,7 @@ static __always_inline bool slab_free_hook(struct kmem_cache *s,
return kasan_slab_free(s, x, init);
}
-static inline bool slab_free_freelist_hook(struct kmem_cache *s,
+static __always_inline bool slab_free_freelist_hook(struct kmem_cache *s,
void **head, void **tail,
int *cnt)
{
--
2.40.1.495.gc816e09b53d-goog
From: Kent Overstreet <[email protected]>
This adds hooks to mempools for correctly annotating mempool-backed
allocations at the correct source line, so they show up correctly in
/sys/kernel/debug/allocations.
Various inline functions are converted to wrappers so that we can invoke
alloc_hooks() in fewer places.
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/mempool.h | 73 ++++++++++++++++++++---------------------
mm/mempool.c | 28 ++++++----------
2 files changed, 45 insertions(+), 56 deletions(-)
diff --git a/include/linux/mempool.h b/include/linux/mempool.h
index 4aae6c06c5f2..aa6e886b01d7 100644
--- a/include/linux/mempool.h
+++ b/include/linux/mempool.h
@@ -5,6 +5,8 @@
#ifndef _LINUX_MEMPOOL_H
#define _LINUX_MEMPOOL_H
+#include <linux/sched.h>
+#include <linux/alloc_tag.h>
#include <linux/wait.h>
#include <linux/compiler.h>
@@ -39,18 +41,32 @@ void mempool_exit(mempool_t *pool);
int mempool_init_node(mempool_t *pool, int min_nr, mempool_alloc_t *alloc_fn,
mempool_free_t *free_fn, void *pool_data,
gfp_t gfp_mask, int node_id);
-int mempool_init(mempool_t *pool, int min_nr, mempool_alloc_t *alloc_fn,
+
+int _mempool_init(mempool_t *pool, int min_nr, mempool_alloc_t *alloc_fn,
mempool_free_t *free_fn, void *pool_data);
+#define mempool_init(...) \
+ alloc_hooks(_mempool_init(__VA_ARGS__), int, -ENOMEM)
extern mempool_t *mempool_create(int min_nr, mempool_alloc_t *alloc_fn,
mempool_free_t *free_fn, void *pool_data);
-extern mempool_t *mempool_create_node(int min_nr, mempool_alloc_t *alloc_fn,
+
+extern mempool_t *_mempool_create_node(int min_nr, mempool_alloc_t *alloc_fn,
mempool_free_t *free_fn, void *pool_data,
gfp_t gfp_mask, int nid);
+#define mempool_create_node(...) \
+ alloc_hooks(_mempool_create_node(__VA_ARGS__), mempool_t *, NULL)
+
+#define mempool_create(_min_nr, _alloc_fn, _free_fn, _pool_data) \
+ mempool_create_node(_min_nr, _alloc_fn, _free_fn, _pool_data, \
+ GFP_KERNEL, NUMA_NO_NODE)
extern int mempool_resize(mempool_t *pool, int new_min_nr);
extern void mempool_destroy(mempool_t *pool);
-extern void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask) __malloc;
+
+extern void *_mempool_alloc(mempool_t *pool, gfp_t gfp_mask) __malloc;
+#define mempool_alloc(_pool, _gfp) \
+ alloc_hooks(_mempool_alloc((_pool), (_gfp)), void *, NULL)
+
extern void mempool_free(void *element, mempool_t *pool);
/*
@@ -61,19 +77,10 @@ extern void mempool_free(void *element, mempool_t *pool);
void *mempool_alloc_slab(gfp_t gfp_mask, void *pool_data);
void mempool_free_slab(void *element, void *pool_data);
-static inline int
-mempool_init_slab_pool(mempool_t *pool, int min_nr, struct kmem_cache *kc)
-{
- return mempool_init(pool, min_nr, mempool_alloc_slab,
- mempool_free_slab, (void *) kc);
-}
-
-static inline mempool_t *
-mempool_create_slab_pool(int min_nr, struct kmem_cache *kc)
-{
- return mempool_create(min_nr, mempool_alloc_slab, mempool_free_slab,
- (void *) kc);
-}
+#define mempool_init_slab_pool(_pool, _min_nr, _kc) \
+ mempool_init(_pool, (_min_nr), mempool_alloc_slab, mempool_free_slab, (void *)(_kc))
+#define mempool_create_slab_pool(_min_nr, _kc) \
+ mempool_create((_min_nr), mempool_alloc_slab, mempool_free_slab, (void *)(_kc))
/*
* a mempool_alloc_t and a mempool_free_t to kmalloc and kfree the
@@ -82,17 +89,12 @@ mempool_create_slab_pool(int min_nr, struct kmem_cache *kc)
void *mempool_kmalloc(gfp_t gfp_mask, void *pool_data);
void mempool_kfree(void *element, void *pool_data);
-static inline int mempool_init_kmalloc_pool(mempool_t *pool, int min_nr, size_t size)
-{
- return mempool_init(pool, min_nr, mempool_kmalloc,
- mempool_kfree, (void *) size);
-}
-
-static inline mempool_t *mempool_create_kmalloc_pool(int min_nr, size_t size)
-{
- return mempool_create(min_nr, mempool_kmalloc, mempool_kfree,
- (void *) size);
-}
+#define mempool_init_kmalloc_pool(_pool, _min_nr, _size) \
+ mempool_init(_pool, (_min_nr), mempool_kmalloc, mempool_kfree, \
+ (void *)(unsigned long)(_size))
+#define mempool_create_kmalloc_pool(_min_nr, _size) \
+ mempool_create((_min_nr), mempool_kmalloc, mempool_kfree, \
+ (void *)(unsigned long)(_size))
/*
* A mempool_alloc_t and mempool_free_t for a simple page allocator that
@@ -101,16 +103,11 @@ static inline mempool_t *mempool_create_kmalloc_pool(int min_nr, size_t size)
void *mempool_alloc_pages(gfp_t gfp_mask, void *pool_data);
void mempool_free_pages(void *element, void *pool_data);
-static inline int mempool_init_page_pool(mempool_t *pool, int min_nr, int order)
-{
- return mempool_init(pool, min_nr, mempool_alloc_pages,
- mempool_free_pages, (void *)(long)order);
-}
-
-static inline mempool_t *mempool_create_page_pool(int min_nr, int order)
-{
- return mempool_create(min_nr, mempool_alloc_pages, mempool_free_pages,
- (void *)(long)order);
-}
+#define mempool_init_page_pool(_pool, _min_nr, _order) \
+ mempool_init(_pool, (_min_nr), mempool_alloc_pages, \
+ mempool_free_pages, (void *)(long)(_order))
+#define mempool_create_page_pool(_min_nr, _order) \
+ mempool_create((_min_nr), mempool_alloc_pages, \
+ mempool_free_pages, (void *)(long)(_order))
#endif /* _LINUX_MEMPOOL_H */
diff --git a/mm/mempool.c b/mm/mempool.c
index 734bcf5afbb7..4fc90735853c 100644
--- a/mm/mempool.c
+++ b/mm/mempool.c
@@ -230,17 +230,17 @@ EXPORT_SYMBOL(mempool_init_node);
*
* Return: %0 on success, negative error code otherwise.
*/
-int mempool_init(mempool_t *pool, int min_nr, mempool_alloc_t *alloc_fn,
+int _mempool_init(mempool_t *pool, int min_nr, mempool_alloc_t *alloc_fn,
mempool_free_t *free_fn, void *pool_data)
{
return mempool_init_node(pool, min_nr, alloc_fn, free_fn,
pool_data, GFP_KERNEL, NUMA_NO_NODE);
}
-EXPORT_SYMBOL(mempool_init);
+EXPORT_SYMBOL(_mempool_init);
/**
- * mempool_create - create a memory pool
+ * mempool_create_node - create a memory pool
* @min_nr: the minimum number of elements guaranteed to be
* allocated for this pool.
* @alloc_fn: user-defined element-allocation function.
@@ -255,15 +255,7 @@ EXPORT_SYMBOL(mempool_init);
*
* Return: pointer to the created memory pool object or %NULL on error.
*/
-mempool_t *mempool_create(int min_nr, mempool_alloc_t *alloc_fn,
- mempool_free_t *free_fn, void *pool_data)
-{
- return mempool_create_node(min_nr, alloc_fn, free_fn, pool_data,
- GFP_KERNEL, NUMA_NO_NODE);
-}
-EXPORT_SYMBOL(mempool_create);
-
-mempool_t *mempool_create_node(int min_nr, mempool_alloc_t *alloc_fn,
+mempool_t *_mempool_create_node(int min_nr, mempool_alloc_t *alloc_fn,
mempool_free_t *free_fn, void *pool_data,
gfp_t gfp_mask, int node_id)
{
@@ -281,7 +273,7 @@ mempool_t *mempool_create_node(int min_nr, mempool_alloc_t *alloc_fn,
return pool;
}
-EXPORT_SYMBOL(mempool_create_node);
+EXPORT_SYMBOL(_mempool_create_node);
/**
* mempool_resize - resize an existing memory pool
@@ -377,7 +369,7 @@ EXPORT_SYMBOL(mempool_resize);
*
* Return: pointer to the allocated element or %NULL on error.
*/
-void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
+void *_mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
{
void *element;
unsigned long flags;
@@ -444,7 +436,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
finish_wait(&pool->wait, &wait);
goto repeat_alloc;
}
-EXPORT_SYMBOL(mempool_alloc);
+EXPORT_SYMBOL(_mempool_alloc);
/**
* mempool_free - return an element to the pool.
@@ -515,7 +507,7 @@ void *mempool_alloc_slab(gfp_t gfp_mask, void *pool_data)
{
struct kmem_cache *mem = pool_data;
VM_BUG_ON(mem->ctor);
- return kmem_cache_alloc(mem, gfp_mask);
+ return _kmem_cache_alloc(mem, gfp_mask);
}
EXPORT_SYMBOL(mempool_alloc_slab);
@@ -533,7 +525,7 @@ EXPORT_SYMBOL(mempool_free_slab);
void *mempool_kmalloc(gfp_t gfp_mask, void *pool_data)
{
size_t size = (size_t)pool_data;
- return kmalloc(size, gfp_mask);
+ return _kmalloc(size, gfp_mask);
}
EXPORT_SYMBOL(mempool_kmalloc);
@@ -550,7 +542,7 @@ EXPORT_SYMBOL(mempool_kfree);
void *mempool_alloc_pages(gfp_t gfp_mask, void *pool_data)
{
int order = (int)(long)pool_data;
- return alloc_pages(gfp_mask, order);
+ return _alloc_pages(gfp_mask, order);
}
EXPORT_SYMBOL(mempool_alloc_pages);
--
2.40.1.495.gc816e09b53d-goog
From: Kent Overstreet <[email protected]>
This avoids a circular header dependency in an upcoming patch by only
making hrtimer.h depend on percpu-defs.h
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
Cc: Thomas Gleixner <[email protected]>
---
include/linux/hrtimer.h | 2 +-
include/linux/time_namespace.h | 2 ++
2 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index 0ee140176f10..e67349e84364 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -16,7 +16,7 @@
#include <linux/rbtree.h>
#include <linux/init.h>
#include <linux/list.h>
-#include <linux/percpu.h>
+#include <linux/percpu-defs.h>
#include <linux/seqlock.h>
#include <linux/timer.h>
#include <linux/timerqueue.h>
diff --git a/include/linux/time_namespace.h b/include/linux/time_namespace.h
index bb9d3f5542f8..d8e0cacfcae5 100644
--- a/include/linux/time_namespace.h
+++ b/include/linux/time_namespace.h
@@ -11,6 +11,8 @@
struct user_namespace;
extern struct user_namespace init_user_ns;
+struct vm_area_struct;
+
struct timens_offsets {
struct timespec64 monotonic;
struct timespec64 boottime;
--
2.40.1.495.gc816e09b53d-goog
objext objects are created with __GFP_NO_OBJ_EXT flag and therefore have
no corresponding objext themselves (otherwise we would get an infinite
recursion). When freeing these objects their codetag will be empty and
when CONFIG_MEM_ALLOC_PROFILING_DEBUG is enabled this will lead to false
warnings. Introduce CODETAG_EMPTY special codetag value to mark
allocations which intentionally lack codetag to avoid these warnings.
Set objext codetags to CODETAG_EMPTY before freeing to indicate that
the codetag is expected to be empty.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/alloc_tag.h | 28 ++++++++++++++++++++++++++++
mm/slab.h | 33 +++++++++++++++++++++++++++++++++
mm/slab_common.c | 1 +
3 files changed, 62 insertions(+)
diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
index 190ab793f7e5..2c3f4f3a8c93 100644
--- a/include/linux/alloc_tag.h
+++ b/include/linux/alloc_tag.h
@@ -51,6 +51,28 @@ static inline bool mem_alloc_profiling_enabled(void)
return static_branch_likely(&mem_alloc_profiling_key);
}
+#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
+
+#define CODETAG_EMPTY (void *)1
+
+static inline bool is_codetag_empty(union codetag_ref *ref)
+{
+ return ref->ct == CODETAG_EMPTY;
+}
+
+static inline void set_codetag_empty(union codetag_ref *ref)
+{
+ if (ref)
+ ref->ct = CODETAG_EMPTY;
+}
+
+#else /* CONFIG_MEM_ALLOC_PROFILING_DEBUG */
+
+static inline bool is_codetag_empty(union codetag_ref *ref) { return false; }
+static inline void set_codetag_empty(union codetag_ref *ref) {}
+
+#endif /* CONFIG_MEM_ALLOC_PROFILING_DEBUG */
+
static inline void __alloc_tag_sub(union codetag_ref *ref, size_t bytes,
bool may_allocate)
{
@@ -65,6 +87,11 @@ static inline void __alloc_tag_sub(union codetag_ref *ref, size_t bytes,
if (!ref || !ref->ct)
return;
+ if (is_codetag_empty(ref)) {
+ ref->ct = NULL;
+ return;
+ }
+
if (is_codetag_ctx_ref(ref))
alloc_tag_free_ctx(ref->ctx, &tag);
else
@@ -112,6 +139,7 @@ static inline void alloc_tag_add(union codetag_ref *ref, struct alloc_tag *tag,
#else
#define DEFINE_ALLOC_TAG(_alloc_tag, _old)
+static inline void set_codetag_empty(union codetag_ref *ref) {}
static inline void alloc_tag_sub(union codetag_ref *ref, size_t bytes) {}
static inline void alloc_tag_sub_noalloc(union codetag_ref *ref, size_t bytes) {}
static inline void alloc_tag_add(union codetag_ref *ref, struct alloc_tag *tag,
diff --git a/mm/slab.h b/mm/slab.h
index f9442d3a10b2..50d86008a86a 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -416,6 +416,31 @@ static inline struct slabobj_ext *slab_obj_exts(struct slab *slab)
int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
gfp_t gfp, bool new_slab);
+
+#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
+
+static inline void mark_objexts_empty(struct slabobj_ext *obj_exts)
+{
+ struct slabobj_ext *slab_exts;
+ struct slab *obj_exts_slab;
+
+ obj_exts_slab = virt_to_slab(obj_exts);
+ slab_exts = slab_obj_exts(obj_exts_slab);
+ if (slab_exts) {
+ unsigned int offs = obj_to_index(obj_exts_slab->slab_cache,
+ obj_exts_slab, obj_exts);
+ /* codetag should be NULL */
+ WARN_ON(slab_exts[offs].ref.ct);
+ set_codetag_empty(&slab_exts[offs].ref);
+ }
+}
+
+#else /* CONFIG_MEM_ALLOC_PROFILING_DEBUG */
+
+static inline void mark_objexts_empty(struct slabobj_ext *obj_exts) {}
+
+#endif /* CONFIG_MEM_ALLOC_PROFILING_DEBUG */
+
static inline bool need_slab_obj_ext(void)
{
#ifdef CONFIG_MEM_ALLOC_PROFILING
@@ -437,6 +462,14 @@ static inline void free_slab_obj_exts(struct slab *slab)
if (!obj_exts)
return;
+ /*
+ * obj_exts was created with __GFP_NO_OBJ_EXT flag, therefore its
+ * corresponding extension will be NULL. alloc_tag_sub() will throw a
+ * warning if slab has extensions but the extension of an object is
+ * NULL, therefore replace NULL with CODETAG_EMPTY to indicate that
+ * the extension for obj_exts is expected to be NULL.
+ */
+ mark_objexts_empty(obj_exts);
kfree(obj_exts);
slab->obj_exts = 0;
}
diff --git a/mm/slab_common.c b/mm/slab_common.c
index a05333bbb7f1..89265f825c43 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -244,6 +244,7 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
* assign slabobj_exts in parallel. In this case the existing
* objcg vector should be reused.
*/
+ mark_objexts_empty(vec);
kfree(vec);
return 0;
}
--
2.40.1.495.gc816e09b53d-goog
From: Kent Overstreet <[email protected]>
Upcoming alloc tagging patches require a place to stash per-allocation
metadata.
We already do this when memcg is enabled, so this patch generalizes the
obj_cgroup * vector in struct pcpu_chunk by creating a pcpu_obj_ext
type, which we will be adding to in an upcoming patch - similarly to the
previous slabobj_ext patch.
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: [email protected]
---
mm/percpu-internal.h | 19 +++++++++++++++++--
mm/percpu.c | 30 +++++++++++++++---------------
2 files changed, 32 insertions(+), 17 deletions(-)
diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index f9847c131998..2433e7b24172 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -32,6 +32,16 @@ struct pcpu_block_md {
int nr_bits; /* total bits responsible for */
};
+struct pcpuobj_ext {
+#ifdef CONFIG_MEMCG_KMEM
+ struct obj_cgroup *cgroup;
+#endif
+};
+
+#ifdef CONFIG_MEMCG_KMEM
+#define NEED_PCPUOBJ_EXT
+#endif
+
struct pcpu_chunk {
#ifdef CONFIG_PERCPU_STATS
int nr_alloc; /* # of allocations */
@@ -57,8 +67,8 @@ struct pcpu_chunk {
int end_offset; /* additional area required to
have the region end page
aligned */
-#ifdef CONFIG_MEMCG_KMEM
- struct obj_cgroup **obj_cgroups; /* vector of object cgroups */
+#ifdef NEED_PCPUOBJ_EXT
+ struct pcpuobj_ext *obj_exts; /* vector of object cgroups */
#endif
int nr_pages; /* # of pages served by this chunk */
@@ -67,6 +77,11 @@ struct pcpu_chunk {
unsigned long populated[]; /* populated bitmap */
};
+static inline bool need_pcpuobj_ext(void)
+{
+ return !mem_cgroup_kmem_disabled();
+}
+
extern spinlock_t pcpu_lock;
extern struct list_head *pcpu_chunk_lists;
diff --git a/mm/percpu.c b/mm/percpu.c
index 28e07ede46f6..95b26a6b718d 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1392,9 +1392,9 @@ static struct pcpu_chunk * __init pcpu_alloc_first_chunk(unsigned long tmp_addr,
panic("%s: Failed to allocate %zu bytes\n", __func__,
alloc_size);
-#ifdef CONFIG_MEMCG_KMEM
+#ifdef NEED_PCPUOBJ_EXT
/* first chunk is free to use */
- chunk->obj_cgroups = NULL;
+ chunk->obj_exts = NULL;
#endif
pcpu_init_md_blocks(chunk);
@@ -1463,12 +1463,12 @@ static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp)
if (!chunk->md_blocks)
goto md_blocks_fail;
-#ifdef CONFIG_MEMCG_KMEM
- if (!mem_cgroup_kmem_disabled()) {
- chunk->obj_cgroups =
+#ifdef NEED_PCPUOBJ_EXT
+ if (need_pcpuobj_ext()) {
+ chunk->obj_exts =
pcpu_mem_zalloc(pcpu_chunk_map_bits(chunk) *
- sizeof(struct obj_cgroup *), gfp);
- if (!chunk->obj_cgroups)
+ sizeof(struct pcpuobj_ext), gfp);
+ if (!chunk->obj_exts)
goto objcg_fail;
}
#endif
@@ -1480,7 +1480,7 @@ static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp)
return chunk;
-#ifdef CONFIG_MEMCG_KMEM
+#ifdef NEED_PCPUOBJ_EXT
objcg_fail:
pcpu_mem_free(chunk->md_blocks);
#endif
@@ -1498,8 +1498,8 @@ static void pcpu_free_chunk(struct pcpu_chunk *chunk)
{
if (!chunk)
return;
-#ifdef CONFIG_MEMCG_KMEM
- pcpu_mem_free(chunk->obj_cgroups);
+#ifdef NEED_PCPUOBJ_EXT
+ pcpu_mem_free(chunk->obj_exts);
#endif
pcpu_mem_free(chunk->md_blocks);
pcpu_mem_free(chunk->bound_map);
@@ -1648,8 +1648,8 @@ static void pcpu_memcg_post_alloc_hook(struct obj_cgroup *objcg,
if (!objcg)
return;
- if (likely(chunk && chunk->obj_cgroups)) {
- chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = objcg;
+ if (likely(chunk && chunk->obj_exts)) {
+ chunk->obj_exts[off >> PCPU_MIN_ALLOC_SHIFT].cgroup = objcg;
rcu_read_lock();
mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B,
@@ -1665,13 +1665,13 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
{
struct obj_cgroup *objcg;
- if (unlikely(!chunk->obj_cgroups))
+ if (unlikely(!chunk->obj_exts))
return;
- objcg = chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT];
+ objcg = chunk->obj_exts[off >> PCPU_MIN_ALLOC_SHIFT].cgroup;
if (!objcg)
return;
- chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = NULL;
+ chunk->obj_exts[off >> PCPU_MIN_ALLOC_SHIFT].cgroup = NULL;
obj_cgroup_uncharge(objcg, pcpu_obj_full_size(size));
--
2.40.1.495.gc816e09b53d-goog
Redefine __alloc_percpu, __alloc_percpu_gfp and __alloc_reserved_percpu
to record allocations and deallocations done by these functions.
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/percpu.h | 19 ++++++++----
mm/percpu.c | 66 +++++-------------------------------------
2 files changed, 22 insertions(+), 63 deletions(-)
diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 1338ea2aa720..51ec257379af 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -2,12 +2,14 @@
#ifndef __LINUX_PERCPU_H
#define __LINUX_PERCPU_H
+#include <linux/alloc_tag.h>
#include <linux/mmdebug.h>
#include <linux/preempt.h>
#include <linux/smp.h>
#include <linux/cpumask.h>
#include <linux/pfn.h>
#include <linux/init.h>
+#include <linux/sched.h>
#include <asm/percpu.h>
@@ -116,7 +118,6 @@ extern int __init pcpu_page_first_chunk(size_t reserved_size,
pcpu_fc_cpu_to_node_fn_t cpu_to_nd_fn);
#endif
-extern void __percpu *__alloc_reserved_percpu(size_t size, size_t align) __alloc_size(1);
extern bool __is_kernel_percpu_address(unsigned long addr, unsigned long *can_addr);
extern bool is_kernel_percpu_address(unsigned long addr);
@@ -124,10 +125,15 @@ extern bool is_kernel_percpu_address(unsigned long addr);
extern void __init setup_per_cpu_areas(void);
#endif
-extern void __percpu *__alloc_percpu_gfp(size_t size, size_t align, gfp_t gfp) __alloc_size(1);
-extern void __percpu *__alloc_percpu(size_t size, size_t align) __alloc_size(1);
-extern void free_percpu(void __percpu *__pdata);
-extern phys_addr_t per_cpu_ptr_to_phys(void *addr);
+extern void __percpu *__pcpu_alloc(size_t size, size_t align, bool reserved,
+ gfp_t gfp) __alloc_size(1);
+
+#define __alloc_percpu_gfp(_size, _align, _gfp) alloc_hooks( \
+ __pcpu_alloc(_size, _align, false, _gfp), void __percpu *, NULL)
+#define __alloc_percpu(_size, _align) alloc_hooks( \
+ __pcpu_alloc(_size, _align, false, GFP_KERNEL), void __percpu *, NULL)
+#define __alloc_reserved_percpu(_size, _align) alloc_hooks( \
+ __pcpu_alloc(_size, _align, true, GFP_KERNEL), void __percpu *, NULL)
#define alloc_percpu_gfp(type, gfp) \
(typeof(type) __percpu *)__alloc_percpu_gfp(sizeof(type), \
@@ -136,6 +142,9 @@ extern phys_addr_t per_cpu_ptr_to_phys(void *addr);
(typeof(type) __percpu *)__alloc_percpu(sizeof(type), \
__alignof__(type))
+extern void free_percpu(void __percpu *__pdata);
+extern phys_addr_t per_cpu_ptr_to_phys(void *addr);
+
extern unsigned long pcpu_nr_pages(void);
#endif /* __LINUX_PERCPU_H */
diff --git a/mm/percpu.c b/mm/percpu.c
index 4e2592f2e58f..4b5cf260d8e0 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1728,7 +1728,7 @@ static void pcpu_alloc_tag_free_hook(struct pcpu_chunk *chunk, int off, size_t s
#endif
/**
- * pcpu_alloc - the percpu allocator
+ * __pcpu_alloc - the percpu allocator
* @size: size of area to allocate in bytes
* @align: alignment of area (max PAGE_SIZE)
* @reserved: allocate from the reserved chunk if available
@@ -1742,8 +1742,8 @@ static void pcpu_alloc_tag_free_hook(struct pcpu_chunk *chunk, int off, size_t s
* RETURNS:
* Percpu pointer to the allocated area on success, NULL on failure.
*/
-static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
- gfp_t gfp)
+void __percpu *__pcpu_alloc(size_t size, size_t align, bool reserved,
+ gfp_t gfp)
{
gfp_t pcpu_gfp;
bool is_atomic;
@@ -1909,6 +1909,8 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
pcpu_memcg_post_alloc_hook(objcg, chunk, off, size);
+ pcpu_alloc_tag_alloc_hook(chunk, off, size);
+
return ptr;
fail_unlock:
@@ -1935,61 +1937,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
return NULL;
}
-
-/**
- * __alloc_percpu_gfp - allocate dynamic percpu area
- * @size: size of area to allocate in bytes
- * @align: alignment of area (max PAGE_SIZE)
- * @gfp: allocation flags
- *
- * Allocate zero-filled percpu area of @size bytes aligned at @align. If
- * @gfp doesn't contain %GFP_KERNEL, the allocation doesn't block and can
- * be called from any context but is a lot more likely to fail. If @gfp
- * has __GFP_NOWARN then no warning will be triggered on invalid or failed
- * allocation requests.
- *
- * RETURNS:
- * Percpu pointer to the allocated area on success, NULL on failure.
- */
-void __percpu *__alloc_percpu_gfp(size_t size, size_t align, gfp_t gfp)
-{
- return pcpu_alloc(size, align, false, gfp);
-}
-EXPORT_SYMBOL_GPL(__alloc_percpu_gfp);
-
-/**
- * __alloc_percpu - allocate dynamic percpu area
- * @size: size of area to allocate in bytes
- * @align: alignment of area (max PAGE_SIZE)
- *
- * Equivalent to __alloc_percpu_gfp(size, align, %GFP_KERNEL).
- */
-void __percpu *__alloc_percpu(size_t size, size_t align)
-{
- return pcpu_alloc(size, align, false, GFP_KERNEL);
-}
-EXPORT_SYMBOL_GPL(__alloc_percpu);
-
-/**
- * __alloc_reserved_percpu - allocate reserved percpu area
- * @size: size of area to allocate in bytes
- * @align: alignment of area (max PAGE_SIZE)
- *
- * Allocate zero-filled percpu area of @size bytes aligned at @align
- * from reserved percpu area if arch has set it up; otherwise,
- * allocation is served from the same dynamic area. Might sleep.
- * Might trigger writeouts.
- *
- * CONTEXT:
- * Does GFP_KERNEL allocation.
- *
- * RETURNS:
- * Percpu pointer to the allocated area on success, NULL on failure.
- */
-void __percpu *__alloc_reserved_percpu(size_t size, size_t align)
-{
- return pcpu_alloc(size, align, true, GFP_KERNEL);
-}
+EXPORT_SYMBOL_GPL(__pcpu_alloc);
/**
* pcpu_balance_free - manage the amount of free chunks
@@ -2299,6 +2247,8 @@ void free_percpu(void __percpu *ptr)
size = pcpu_free_area(chunk, off);
+ pcpu_alloc_tag_free_hook(chunk, off, size);
+
pcpu_memcg_free_hook(chunk, off, size);
/*
--
2.40.1.495.gc816e09b53d-goog
From: Kent Overstreet <[email protected]>
To store codetag for every per-cpu allocation, a codetag reference is
embedded into pcpuobj_ext when CONFIG_MEM_ALLOC_PROFILING=y. Hooks to
use the newly introduced codetag are added.
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
mm/percpu-internal.h | 11 +++++++++--
mm/percpu.c | 26 ++++++++++++++++++++++++++
2 files changed, 35 insertions(+), 2 deletions(-)
diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index 2433e7b24172..c5d1d6723a66 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -36,9 +36,12 @@ struct pcpuobj_ext {
#ifdef CONFIG_MEMCG_KMEM
struct obj_cgroup *cgroup;
#endif
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+ union codetag_ref tag;
+#endif
};
-#ifdef CONFIG_MEMCG_KMEM
+#if defined(CONFIG_MEMCG_KMEM) || defined(CONFIG_MEM_ALLOC_PROFILING)
#define NEED_PCPUOBJ_EXT
#endif
@@ -79,7 +82,11 @@ struct pcpu_chunk {
static inline bool need_pcpuobj_ext(void)
{
- return !mem_cgroup_kmem_disabled();
+ if (IS_ENABLED(CONFIG_MEM_ALLOC_PROFILING))
+ return true;
+ if (!mem_cgroup_kmem_disabled())
+ return true;
+ return false;
}
extern spinlock_t pcpu_lock;
diff --git a/mm/percpu.c b/mm/percpu.c
index 95b26a6b718d..4e2592f2e58f 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1701,6 +1701,32 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
}
#endif /* CONFIG_MEMCG_KMEM */
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+static void pcpu_alloc_tag_alloc_hook(struct pcpu_chunk *chunk, int off,
+ size_t size)
+{
+ if (mem_alloc_profiling_enabled() && likely(chunk->obj_exts)) {
+ alloc_tag_add(&chunk->obj_exts[off >> PCPU_MIN_ALLOC_SHIFT].tag,
+ current->alloc_tag, size);
+ }
+}
+
+static void pcpu_alloc_tag_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
+{
+ if (mem_alloc_profiling_enabled() && likely(chunk->obj_exts))
+ alloc_tag_sub_noalloc(&chunk->obj_exts[off >> PCPU_MIN_ALLOC_SHIFT].tag, size);
+}
+#else
+static void pcpu_alloc_tag_alloc_hook(struct pcpu_chunk *chunk, int off,
+ size_t size)
+{
+}
+
+static void pcpu_alloc_tag_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
+{
+}
+#endif
+
/**
* pcpu_alloc - the percpu allocator
* @size: size of area to allocate in bytes
--
2.40.1.495.gc816e09b53d-goog
From: Kent Overstreet <[email protected]>
Replace linux/percpu.h include with asm/percpu.h to avoid circular
dependency.
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
arch/arm64/include/asm/spectre.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/include/asm/spectre.h b/arch/arm64/include/asm/spectre.h
index db7b371b367c..31823d9715ab 100644
--- a/arch/arm64/include/asm/spectre.h
+++ b/arch/arm64/include/asm/spectre.h
@@ -13,8 +13,8 @@
#define __BP_HARDEN_HYP_VECS_SZ ((BP_HARDEN_EL2_SLOTS - 1) * SZ_2K)
#ifndef __ASSEMBLY__
-
-#include <linux/percpu.h>
+#include <linux/smp.h>
+#include <asm/percpu.h>
#include <asm/cpufeature.h>
#include <asm/virt.h>
--
2.40.1.495.gc816e09b53d-goog
Make save_stack() function part of stackdepot API to be used outside of
page_owner. Also rename task_struct's in_page_owner to in_capture_stack
flag to better convey the wider use of this flag.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/sched.h | 6 ++--
include/linux/stackdepot.h | 16 +++++++++
lib/stackdepot.c | 68 ++++++++++++++++++++++++++++++++++++++
mm/page_owner.c | 52 ++---------------------------
4 files changed, 90 insertions(+), 52 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 33708bf8f191..6eca46ab6d78 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -942,9 +942,9 @@ struct task_struct {
/* Stalled due to lack of memory */
unsigned in_memstall:1;
#endif
-#ifdef CONFIG_PAGE_OWNER
- /* Used by page_owner=on to detect recursion in page tracking. */
- unsigned in_page_owner:1;
+#ifdef CONFIG_STACKDEPOT
+ /* Used by stack_depot_capture_stack to detect recursion. */
+ unsigned in_capture_stack:1;
#endif
#ifdef CONFIG_EVENTFD
/* Recursion prevention for eventfd_signal() */
diff --git a/include/linux/stackdepot.h b/include/linux/stackdepot.h
index e58306783d8e..baf7e80cf449 100644
--- a/include/linux/stackdepot.h
+++ b/include/linux/stackdepot.h
@@ -164,4 +164,20 @@ depot_stack_handle_t __must_check stack_depot_set_extra_bits(
*/
unsigned int stack_depot_get_extra_bits(depot_stack_handle_t handle);
+/**
+ * stack_depot_capture_init - Initialize stack depot capture mechanism
+ *
+ * Return: Stack depot initialization status
+ */
+bool stack_depot_capture_init(void);
+
+/**
+ * stack_depot_capture_stack - Capture current stack trace into stack depot
+ *
+ * @flags: Allocation GFP flags
+ *
+ * Return: Handle of the stack trace stored in depot, 0 on failure
+ */
+depot_stack_handle_t stack_depot_capture_stack(gfp_t flags);
+
#endif
diff --git a/lib/stackdepot.c b/lib/stackdepot.c
index 2f5aa851834e..c7e5e22fcb16 100644
--- a/lib/stackdepot.c
+++ b/lib/stackdepot.c
@@ -539,3 +539,71 @@ unsigned int stack_depot_get_extra_bits(depot_stack_handle_t handle)
return parts.extra;
}
EXPORT_SYMBOL(stack_depot_get_extra_bits);
+
+static depot_stack_handle_t recursion_handle;
+static depot_stack_handle_t failure_handle;
+
+static __always_inline depot_stack_handle_t create_custom_stack(void)
+{
+ unsigned long entries[4];
+ unsigned int nr_entries;
+
+ nr_entries = stack_trace_save(entries, ARRAY_SIZE(entries), 0);
+ return stack_depot_save(entries, nr_entries, GFP_KERNEL);
+}
+
+static noinline void register_recursion_stack(void)
+{
+ recursion_handle = create_custom_stack();
+}
+
+static noinline void register_failure_stack(void)
+{
+ failure_handle = create_custom_stack();
+}
+
+bool stack_depot_capture_init(void)
+{
+ static DEFINE_MUTEX(stack_depot_capture_init_mutex);
+ static bool utility_stacks_ready;
+
+ mutex_lock(&stack_depot_capture_init_mutex);
+ if (!utility_stacks_ready) {
+ register_recursion_stack();
+ register_failure_stack();
+ utility_stacks_ready = true;
+ }
+ mutex_unlock(&stack_depot_capture_init_mutex);
+
+ return utility_stacks_ready;
+}
+
+/* TODO: teach stack_depot_capture_stack to use off stack temporal storage */
+#define CAPTURE_STACK_DEPTH (16)
+
+depot_stack_handle_t stack_depot_capture_stack(gfp_t flags)
+{
+ unsigned long entries[CAPTURE_STACK_DEPTH];
+ depot_stack_handle_t handle;
+ unsigned int nr_entries;
+
+ /*
+ * Avoid recursion.
+ *
+ * Sometimes page metadata allocation tracking requires more
+ * memory to be allocated:
+ * - when new stack trace is saved to stack depot
+ * - when backtrace itself is calculated (ia64)
+ */
+ if (current->in_capture_stack)
+ return recursion_handle;
+ current->in_capture_stack = 1;
+
+ nr_entries = stack_trace_save(entries, ARRAY_SIZE(entries), 2);
+ handle = stack_depot_save(entries, nr_entries, flags);
+ if (!handle)
+ handle = failure_handle;
+
+ current->in_capture_stack = 0;
+ return handle;
+}
diff --git a/mm/page_owner.c b/mm/page_owner.c
index 8b6086c666e6..9fafbc290d5b 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -15,12 +15,6 @@
#include "internal.h"
-/*
- * TODO: teach PAGE_OWNER_STACK_DEPTH (__dump_page_owner and save_stack)
- * to use off stack temporal storage
- */
-#define PAGE_OWNER_STACK_DEPTH (16)
-
struct page_owner {
unsigned short order;
short last_migrate_reason;
@@ -37,8 +31,6 @@ struct page_owner {
static bool page_owner_enabled __initdata;
DEFINE_STATIC_KEY_FALSE(page_owner_inited);
-static depot_stack_handle_t dummy_handle;
-static depot_stack_handle_t failure_handle;
static depot_stack_handle_t early_handle;
static void init_early_allocated_pages(void);
@@ -68,16 +60,6 @@ static __always_inline depot_stack_handle_t create_dummy_stack(void)
return stack_depot_save(entries, nr_entries, GFP_KERNEL);
}
-static noinline void register_dummy_stack(void)
-{
- dummy_handle = create_dummy_stack();
-}
-
-static noinline void register_failure_stack(void)
-{
- failure_handle = create_dummy_stack();
-}
-
static noinline void register_early_stack(void)
{
early_handle = create_dummy_stack();
@@ -88,8 +70,7 @@ static __init void init_page_owner(void)
if (!page_owner_enabled)
return;
- register_dummy_stack();
- register_failure_stack();
+ stack_depot_capture_init();
register_early_stack();
static_branch_enable(&page_owner_inited);
init_early_allocated_pages();
@@ -107,33 +88,6 @@ static inline struct page_owner *get_page_owner(struct page_ext *page_ext)
return (void *)page_ext + page_owner_ops.offset;
}
-static noinline depot_stack_handle_t save_stack(gfp_t flags)
-{
- unsigned long entries[PAGE_OWNER_STACK_DEPTH];
- depot_stack_handle_t handle;
- unsigned int nr_entries;
-
- /*
- * Avoid recursion.
- *
- * Sometimes page metadata allocation tracking requires more
- * memory to be allocated:
- * - when new stack trace is saved to stack depot
- * - when backtrace itself is calculated (ia64)
- */
- if (current->in_page_owner)
- return dummy_handle;
- current->in_page_owner = 1;
-
- nr_entries = stack_trace_save(entries, ARRAY_SIZE(entries), 2);
- handle = stack_depot_save(entries, nr_entries, flags);
- if (!handle)
- handle = failure_handle;
-
- current->in_page_owner = 0;
- return handle;
-}
-
void __reset_page_owner(struct page *page, unsigned short order)
{
int i;
@@ -146,7 +100,7 @@ void __reset_page_owner(struct page *page, unsigned short order)
if (unlikely(!page_ext))
return;
- handle = save_stack(GFP_NOWAIT | __GFP_NOWARN);
+ handle = stack_depot_capture_stack(GFP_NOWAIT | __GFP_NOWARN);
for (i = 0; i < (1 << order); i++) {
__clear_bit(PAGE_EXT_OWNER_ALLOCATED, &page_ext->flags);
page_owner = get_page_owner(page_ext);
@@ -189,7 +143,7 @@ noinline void __set_page_owner(struct page *page, unsigned short order,
struct page_ext *page_ext;
depot_stack_handle_t handle;
- handle = save_stack(gfp_mask);
+ handle = stack_depot_capture_stack(gfp_mask);
page_ext = page_ext_get(page);
if (unlikely(!page_ext))
--
2.40.1.495.gc816e09b53d-goog
Add support for code tag context capture when registering a new code tag
type. When context capture for a specific code tag is enabled,
codetag_ref will point to a codetag_ctx object which can be attached
to an application-specific object storing code invocation context.
codetag_ctx has a pointer to its codetag_with_ctx object with embedded
codetag object in it. All context objects of the same code tag are placed
into codetag_with_ctx.ctx_head linked list. codetag.flag is used to
indicate when a context capture for the associated code tag is
initialized and enabled.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/codetag.h | 50 +++++++++++++-
include/linux/codetag_ctx.h | 48 +++++++++++++
lib/codetag.c | 134 ++++++++++++++++++++++++++++++++++++
3 files changed, 231 insertions(+), 1 deletion(-)
create mode 100644 include/linux/codetag_ctx.h
diff --git a/include/linux/codetag.h b/include/linux/codetag.h
index 87207f199ac9..9ab2f017e845 100644
--- a/include/linux/codetag.h
+++ b/include/linux/codetag.h
@@ -5,8 +5,12 @@
#ifndef _LINUX_CODETAG_H
#define _LINUX_CODETAG_H
+#include <linux/container_of.h>
+#include <linux/spinlock.h>
#include <linux/types.h>
+struct kref;
+struct codetag_ctx;
struct codetag_iterator;
struct codetag_type;
struct seq_buf;
@@ -18,15 +22,38 @@ struct module;
* an array of these.
*/
struct codetag {
- unsigned int flags; /* used in later patches */
+ unsigned int flags; /* has to be the first member shared with codetag_ctx */
unsigned int lineno;
const char *modname;
const char *function;
const char *filename;
} __aligned(8);
+/* codetag_with_ctx flags */
+#define CTC_FLAG_CTX_PTR (1 << 0)
+#define CTC_FLAG_CTX_READY (1 << 1)
+#define CTC_FLAG_CTX_ENABLED (1 << 2)
+
+/*
+ * Code tag with context capture support. Contains a list to store context for
+ * each tag hit, a lock protecting the list and a flag to indicate whether
+ * context capture is enabled for the tag.
+ */
+struct codetag_with_ctx {
+ struct codetag ct;
+ struct list_head ctx_head;
+ spinlock_t ctx_lock;
+} __aligned(8);
+
+/*
+ * Tag reference can point to codetag directly or indirectly via codetag_ctx.
+ * Direct codetag pointer is used when context capture is disabled or not
+ * supported. When context capture for the tag is used, the reference points
+ * to the codetag_ctx through which the codetag can be reached.
+ */
union codetag_ref {
struct codetag *ct;
+ struct codetag_ctx *ctx;
};
struct codetag_range {
@@ -46,6 +73,7 @@ struct codetag_type_desc {
struct codetag_module *cmod);
bool (*module_unload)(struct codetag_type *cttype,
struct codetag_module *cmod);
+ void (*free_ctx)(struct kref *ref);
};
struct codetag_iterator {
@@ -53,6 +81,7 @@ struct codetag_iterator {
struct codetag_module *cmod;
unsigned long mod_id;
struct codetag *ct;
+ struct codetag_ctx *ctx;
};
#define CODE_TAG_INIT { \
@@ -63,9 +92,28 @@ struct codetag_iterator {
.flags = 0, \
}
+static inline bool is_codetag_ctx_ref(union codetag_ref *ref)
+{
+ return !!(ref->ct->flags & CTC_FLAG_CTX_PTR);
+}
+
+static inline
+struct codetag_with_ctx *ct_to_ctc(struct codetag *ct)
+{
+ return container_of(ct, struct codetag_with_ctx, ct);
+}
+
void codetag_lock_module_list(struct codetag_type *cttype, bool lock);
struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype);
struct codetag *codetag_next_ct(struct codetag_iterator *iter);
+struct codetag_ctx *codetag_next_ctx(struct codetag_iterator *iter);
+
+bool codetag_enable_ctx(struct codetag_with_ctx *ctc, bool enable);
+static inline bool codetag_ctx_enabled(struct codetag_with_ctx *ctc)
+{
+ return !!(ctc->ct.flags & CTC_FLAG_CTX_ENABLED);
+}
+bool codetag_has_ctx(struct codetag_with_ctx *ctc);
void codetag_to_text(struct seq_buf *out, struct codetag *ct);
diff --git a/include/linux/codetag_ctx.h b/include/linux/codetag_ctx.h
new file mode 100644
index 000000000000..e741484f0e08
--- /dev/null
+++ b/include/linux/codetag_ctx.h
@@ -0,0 +1,48 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * code tag context
+ */
+#ifndef _LINUX_CODETAG_CTX_H
+#define _LINUX_CODETAG_CTX_H
+
+#include <linux/codetag.h>
+#include <linux/kref.h>
+
+/* Code tag hit context. */
+struct codetag_ctx {
+ unsigned int flags; /* has to be the first member shared with codetag */
+ struct codetag_with_ctx *ctc;
+ struct list_head node;
+ struct kref refcount;
+} __aligned(8);
+
+static inline struct codetag_ctx *kref_to_ctx(struct kref *refcount)
+{
+ return container_of(refcount, struct codetag_ctx, refcount);
+}
+
+static inline void add_ctx(struct codetag_ctx *ctx,
+ struct codetag_with_ctx *ctc)
+{
+ kref_init(&ctx->refcount);
+ spin_lock(&ctc->ctx_lock);
+ ctx->flags = CTC_FLAG_CTX_PTR;
+ ctx->ctc = ctc;
+ list_add_tail(&ctx->node, &ctc->ctx_head);
+ spin_unlock(&ctc->ctx_lock);
+}
+
+static inline void rem_ctx(struct codetag_ctx *ctx,
+ void (*free_ctx)(struct kref *refcount))
+{
+ struct codetag_with_ctx *ctc = ctx->ctc;
+
+ spin_lock(&ctc->ctx_lock);
+ /* ctx might have been removed while we were using it */
+ if (!list_empty(&ctx->node))
+ list_del_init(&ctx->node);
+ spin_unlock(&ctc->ctx_lock);
+ kref_put(&ctx->refcount, free_ctx);
+}
+
+#endif /* _LINUX_CODETAG_CTX_H */
diff --git a/lib/codetag.c b/lib/codetag.c
index 84f90f3b922c..d891bbe4481d 100644
--- a/lib/codetag.c
+++ b/lib/codetag.c
@@ -1,5 +1,6 @@
// SPDX-License-Identifier: GPL-2.0-only
#include <linux/codetag.h>
+#include <linux/codetag_ctx.h>
#include <linux/idr.h>
#include <linux/kallsyms.h>
#include <linux/module.h>
@@ -92,6 +93,139 @@ struct codetag *codetag_next_ct(struct codetag_iterator *iter)
return ct;
}
+static struct codetag_ctx *next_ctx_from_ct(struct codetag_iterator *iter)
+{
+ struct codetag_with_ctx *ctc;
+ struct codetag_ctx *ctx = NULL;
+ struct codetag *ct = iter->ct;
+
+ while (ct) {
+ if (!(ct->flags & CTC_FLAG_CTX_READY))
+ goto next;
+
+ ctc = ct_to_ctc(ct);
+ spin_lock(&ctc->ctx_lock);
+ if (!list_empty(&ctc->ctx_head)) {
+ ctx = list_first_entry(&ctc->ctx_head,
+ struct codetag_ctx, node);
+ kref_get(&ctx->refcount);
+ }
+ spin_unlock(&ctc->ctx_lock);
+ if (ctx)
+ break;
+next:
+ ct = codetag_next_ct(iter);
+ }
+
+ iter->ctx = ctx;
+ return ctx;
+}
+
+struct codetag_ctx *codetag_next_ctx(struct codetag_iterator *iter)
+{
+ struct codetag_ctx *ctx = iter->ctx;
+ struct codetag_ctx *found = NULL;
+
+ lockdep_assert_held(&iter->cttype->mod_lock);
+
+ if (!ctx)
+ return next_ctx_from_ct(iter);
+
+ spin_lock(&ctx->ctc->ctx_lock);
+ /*
+ * Do not advance if the object was isolated, restart at the same tag.
+ */
+ if (!list_empty(&ctx->node)) {
+ if (list_is_last(&ctx->node, &ctx->ctc->ctx_head)) {
+ /* Finished with this tag, advance to the next */
+ codetag_next_ct(iter);
+ } else {
+ found = list_next_entry(ctx, node);
+ kref_get(&found->refcount);
+ }
+ }
+ spin_unlock(&ctx->ctc->ctx_lock);
+ kref_put(&ctx->refcount, iter->cttype->desc.free_ctx);
+
+ if (!found)
+ return next_ctx_from_ct(iter);
+
+ iter->ctx = found;
+ return found;
+}
+
+static struct codetag_type *find_cttype(struct codetag *ct)
+{
+ struct codetag_module *cmod;
+ struct codetag_type *cttype;
+ unsigned long mod_id;
+ unsigned long tmp;
+
+ mutex_lock(&codetag_lock);
+ list_for_each_entry(cttype, &codetag_types, link) {
+ down_read(&cttype->mod_lock);
+ idr_for_each_entry_ul(&cttype->mod_idr, cmod, tmp, mod_id) {
+ if (ct >= cmod->range.start && ct < cmod->range.stop) {
+ up_read(&cttype->mod_lock);
+ goto found;
+ }
+ }
+ up_read(&cttype->mod_lock);
+ }
+ cttype = NULL;
+found:
+ mutex_unlock(&codetag_lock);
+
+ return cttype;
+}
+
+bool codetag_enable_ctx(struct codetag_with_ctx *ctc, bool enable)
+{
+ struct codetag_type *cttype = find_cttype(&ctc->ct);
+
+ if (!cttype || !cttype->desc.free_ctx)
+ return false;
+
+ lockdep_assert_held(&cttype->mod_lock);
+ BUG_ON(!rwsem_is_locked(&cttype->mod_lock));
+
+ if (codetag_ctx_enabled(ctc) == enable)
+ return false;
+
+ if (enable) {
+ /* Initialize context capture fields only once */
+ if (!(ctc->ct.flags & CTC_FLAG_CTX_READY)) {
+ spin_lock_init(&ctc->ctx_lock);
+ INIT_LIST_HEAD(&ctc->ctx_head);
+ ctc->ct.flags |= CTC_FLAG_CTX_READY;
+ }
+ ctc->ct.flags |= CTC_FLAG_CTX_ENABLED;
+ } else {
+ /*
+ * The list of context objects is intentionally left untouched.
+ * It can be read back and if context capture is re-enablied it
+ * will append new objects.
+ */
+ ctc->ct.flags &= ~CTC_FLAG_CTX_ENABLED;
+ }
+
+ return true;
+}
+
+bool codetag_has_ctx(struct codetag_with_ctx *ctc)
+{
+ bool no_ctx;
+
+ if (!(ctc->ct.flags & CTC_FLAG_CTX_READY))
+ return false;
+
+ spin_lock(&ctc->ctx_lock);
+ no_ctx = list_empty(&ctc->ctx_head);
+ spin_unlock(&ctc->ctx_lock);
+
+ return !no_ctx;
+}
+
void codetag_to_text(struct seq_buf *out, struct codetag *ct)
{
seq_buf_printf(out, "%s:%u module:%s func:%s",
--
2.40.1.495.gc816e09b53d-goog
Implement mechanisms for capturing allocation call context which consists
of:
- allocation size
- pid, tgid and name of the allocating task
- allocation timestamp
- allocation call stack
The patch creates allocations.ctx file which can be written to
enable/disable context capture for a specific code tag. Captured context
can be obtained by reading allocations.ctx file.
Usage example:
echo "file include/asm-generic/pgalloc.h line 63 enable" > \
/sys/kernel/debug/allocations.ctx
cat allocations.ctx
91.0MiB 212 include/asm-generic/pgalloc.h:63 module:pgtable func:__pte_alloc_one
size: 4096
pid: 1551
tgid: 1551
comm: cat
ts: 670109646361
call stack:
pte_alloc_one+0xfe/0x130
__pte_alloc+0x22/0x90
move_page_tables.part.0+0x994/0xa60
shift_arg_pages+0xa4/0x180
setup_arg_pages+0x286/0x2d0
load_elf_binary+0x4e1/0x18d0
bprm_execve+0x26b/0x660
do_execveat_common.isra.0+0x19d/0x220
__x64_sys_execve+0x2e/0x40
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
size: 4096
pid: 1551
tgid: 1551
comm: cat
ts: 670109711801
call stack:
pte_alloc_one+0xfe/0x130
__do_fault+0x52/0xc0
__handle_mm_fault+0x7d9/0xdd0
handle_mm_fault+0xc0/0x2b0
do_user_addr_fault+0x1c3/0x660
exc_page_fault+0x62/0x150
asm_exc_page_fault+0x22/0x30
...
echo "file include/asm-generic/pgalloc.h line 63 disable" > \
/sys/kernel/debug/alloc_tags.ctx
Note that disabling context capture will not clear already captured
context but no new context will be captured.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/alloc_tag.h | 25 +++-
include/linux/codetag.h | 3 +-
include/linux/pgalloc_tag.h | 4 +-
lib/Kconfig.debug | 1 +
lib/alloc_tag.c | 238 +++++++++++++++++++++++++++++++++++-
lib/codetag.c | 20 +--
6 files changed, 272 insertions(+), 19 deletions(-)
diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
index 07922d81b641..2a3d248aae10 100644
--- a/include/linux/alloc_tag.h
+++ b/include/linux/alloc_tag.h
@@ -17,20 +17,29 @@
* an array of these. Embedded codetag utilizes codetag framework.
*/
struct alloc_tag {
- struct codetag ct;
+ struct codetag_with_ctx ctc;
struct lazy_percpu_counter bytes_allocated;
} __aligned(8);
#ifdef CONFIG_MEM_ALLOC_PROFILING
+static inline struct alloc_tag *ctc_to_alloc_tag(struct codetag_with_ctx *ctc)
+{
+ return container_of(ctc, struct alloc_tag, ctc);
+}
+
static inline struct alloc_tag *ct_to_alloc_tag(struct codetag *ct)
{
- return container_of(ct, struct alloc_tag, ct);
+ return container_of(ct_to_ctc(ct), struct alloc_tag, ctc);
}
+struct codetag_ctx *alloc_tag_create_ctx(struct alloc_tag *tag, size_t size);
+void alloc_tag_free_ctx(struct codetag_ctx *ctx, struct alloc_tag **ptag);
+bool alloc_tag_enable_ctx(struct alloc_tag *tag, bool enable);
+
#define DEFINE_ALLOC_TAG(_alloc_tag, _old) \
static struct alloc_tag _alloc_tag __used __aligned(8) \
- __section("alloc_tags") = { .ct = CODE_TAG_INIT }; \
+ __section("alloc_tags") = { .ctc.ct = CODE_TAG_INIT }; \
struct alloc_tag * __maybe_unused _old = alloc_tag_save(&_alloc_tag)
extern struct static_key_true mem_alloc_profiling_key;
@@ -54,7 +63,10 @@ static inline void __alloc_tag_sub(union codetag_ref *ref, size_t bytes,
if (!ref || !ref->ct)
return;
- tag = ct_to_alloc_tag(ref->ct);
+ if (is_codetag_ctx_ref(ref))
+ alloc_tag_free_ctx(ref->ctx, &tag);
+ else
+ tag = ct_to_alloc_tag(ref->ct);
if (may_allocate)
lazy_percpu_counter_add(&tag->bytes_allocated, -bytes);
@@ -88,7 +100,10 @@ static inline void alloc_tag_add(union codetag_ref *ref, struct alloc_tag *tag,
if (!ref || !tag)
return;
- ref->ct = &tag->ct;
+ if (codetag_ctx_enabled(&tag->ctc))
+ ref->ctx = alloc_tag_create_ctx(tag, bytes);
+ else
+ ref->ct = &tag->ctc.ct;
lazy_percpu_counter_add(&tag->bytes_allocated, bytes);
}
diff --git a/include/linux/codetag.h b/include/linux/codetag.h
index 9ab2f017e845..b6a2f0287a83 100644
--- a/include/linux/codetag.h
+++ b/include/linux/codetag.h
@@ -104,7 +104,8 @@ struct codetag_with_ctx *ct_to_ctc(struct codetag *ct)
}
void codetag_lock_module_list(struct codetag_type *cttype, bool lock);
-struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype);
+void codetag_init_iter(struct codetag_iterator *iter,
+ struct codetag_type *cttype);
struct codetag *codetag_next_ct(struct codetag_iterator *iter);
struct codetag_ctx *codetag_next_ctx(struct codetag_iterator *iter);
diff --git a/include/linux/pgalloc_tag.h b/include/linux/pgalloc_tag.h
index 0cbba13869b5..e4661bbd40c6 100644
--- a/include/linux/pgalloc_tag.h
+++ b/include/linux/pgalloc_tag.h
@@ -6,6 +6,7 @@
#define _LINUX_PGALLOC_TAG_H
#include <linux/alloc_tag.h>
+#include <linux/codetag_ctx.h>
#ifdef CONFIG_MEM_ALLOC_PROFILING
@@ -70,7 +71,8 @@ static inline void pgalloc_tag_split(struct page *page, unsigned int nr)
if (!ref->ct)
goto out;
- tag = ct_to_alloc_tag(ref->ct);
+ tag = is_codetag_ctx_ref(ref) ? ctc_to_alloc_tag(ref->ctx->ctc)
+ : ct_to_alloc_tag(ref->ct);
page_ext = page_ext_next(page_ext);
for (i = 1; i < nr; i++) {
/* New reference with 0 bytes accounted */
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 4157c2251b07..1b83ef17d232 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -969,6 +969,7 @@ config MEM_ALLOC_PROFILING
select LAZY_PERCPU_COUNTER
select PAGE_EXTENSION
select SLAB_OBJ_EXT
+ select STACKDEPOT
help
Track allocation source code and record total allocation size
initiated at that code location. The mechanism can be used to track
diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
index 4a0b95a46b2e..675c7a08e38b 100644
--- a/lib/alloc_tag.c
+++ b/lib/alloc_tag.c
@@ -1,13 +1,18 @@
// SPDX-License-Identifier: GPL-2.0-only
#include <linux/alloc_tag.h>
+#include <linux/codetag_ctx.h>
#include <linux/debugfs.h>
#include <linux/fs.h>
#include <linux/gfp.h>
#include <linux/module.h>
#include <linux/page_ext.h>
+#include <linux/sched/clock.h>
#include <linux/seq_buf.h>
+#include <linux/stackdepot.h>
#include <linux/uaccess.h>
+#define STACK_BUF_SIZE 1024
+
DEFINE_STATIC_KEY_TRUE(mem_alloc_profiling_key);
/*
@@ -23,6 +28,16 @@ static int __init mem_alloc_profiling_disable(char *s)
}
__setup("nomem_profiling", mem_alloc_profiling_disable);
+struct alloc_call_ctx {
+ struct codetag_ctx ctx;
+ size_t size;
+ pid_t pid;
+ pid_t tgid;
+ char comm[TASK_COMM_LEN];
+ u64 ts_nsec;
+ depot_stack_handle_t stack_handle;
+} __aligned(8);
+
struct alloc_tag_file_iterator {
struct codetag_iterator ct_iter;
struct seq_buf buf;
@@ -64,7 +79,7 @@ static int allocations_file_open(struct inode *inode, struct file *file)
return -ENOMEM;
codetag_lock_module_list(cttype, true);
- iter->ct_iter = codetag_get_ct_iter(cttype);
+ codetag_init_iter(&iter->ct_iter, cttype);
codetag_lock_module_list(cttype, false);
seq_buf_init(&iter->buf, iter->rawbuf, sizeof(iter->rawbuf));
file->private_data = iter;
@@ -125,24 +140,240 @@ static const struct file_operations allocations_file_ops = {
.read = allocations_file_read,
};
+static void alloc_tag_ops_free_ctx(struct kref *refcount)
+{
+ kfree(container_of(kref_to_ctx(refcount), struct alloc_call_ctx, ctx));
+}
+
+struct codetag_ctx *alloc_tag_create_ctx(struct alloc_tag *tag, size_t size)
+{
+ struct alloc_call_ctx *ac_ctx;
+
+ /* TODO: use a dedicated kmem_cache */
+ ac_ctx = kmalloc(sizeof(struct alloc_call_ctx), GFP_KERNEL);
+ if (WARN_ON(!ac_ctx))
+ return NULL;
+
+ ac_ctx->size = size;
+ ac_ctx->pid = current->pid;
+ ac_ctx->tgid = current->tgid;
+ strscpy(ac_ctx->comm, current->comm, sizeof(ac_ctx->comm));
+ ac_ctx->ts_nsec = local_clock();
+ ac_ctx->stack_handle =
+ stack_depot_capture_stack(GFP_NOWAIT | __GFP_NOWARN);
+ add_ctx(&ac_ctx->ctx, &tag->ctc);
+
+ return &ac_ctx->ctx;
+}
+EXPORT_SYMBOL_GPL(alloc_tag_create_ctx);
+
+void alloc_tag_free_ctx(struct codetag_ctx *ctx, struct alloc_tag **ptag)
+{
+ *ptag = ctc_to_alloc_tag(ctx->ctc);
+ rem_ctx(ctx, alloc_tag_ops_free_ctx);
+}
+EXPORT_SYMBOL_GPL(alloc_tag_free_ctx);
+
+bool alloc_tag_enable_ctx(struct alloc_tag *tag, bool enable)
+{
+ static bool stack_depot_ready;
+
+ if (enable && !stack_depot_ready) {
+ stack_depot_init();
+ stack_depot_capture_init();
+ stack_depot_ready = true;
+ }
+
+ return codetag_enable_ctx(&tag->ctc, enable);
+}
+
+static void alloc_tag_ctx_to_text(struct seq_buf *out, struct codetag_ctx *ctx)
+{
+ struct alloc_call_ctx *ac_ctx;
+ char *buf;
+
+ ac_ctx = container_of(ctx, struct alloc_call_ctx, ctx);
+ seq_buf_printf(out, " size: %zu\n", ac_ctx->size);
+ seq_buf_printf(out, " pid: %d\n", ac_ctx->pid);
+ seq_buf_printf(out, " tgid: %d\n", ac_ctx->tgid);
+ seq_buf_printf(out, " comm: %s\n", ac_ctx->comm);
+ seq_buf_printf(out, " ts: %llu\n", ac_ctx->ts_nsec);
+
+ buf = kmalloc(STACK_BUF_SIZE, GFP_KERNEL);
+ if (buf) {
+ int bytes_read = stack_depot_snprint(ac_ctx->stack_handle, buf,
+ STACK_BUF_SIZE - 1, 8);
+ buf[bytes_read] = '\0';
+ seq_buf_printf(out, " call stack:\n%s\n", buf);
+ }
+ kfree(buf);
+}
+
+static ssize_t allocations_ctx_file_read(struct file *file, char __user *ubuf,
+ size_t size, loff_t *ppos)
+{
+ struct alloc_tag_file_iterator *iter = file->private_data;
+ struct codetag_iterator *ct_iter = &iter->ct_iter;
+ struct user_buf buf = { .buf = ubuf, .size = size };
+ struct codetag_ctx *ctx;
+ struct codetag *prev_ct;
+ int err = 0;
+
+ codetag_lock_module_list(ct_iter->cttype, true);
+ while (1) {
+ err = flush_ubuf(&buf, &iter->buf);
+ if (err || !buf.size)
+ break;
+
+ prev_ct = ct_iter->ct;
+ ctx = codetag_next_ctx(ct_iter);
+ if (!ctx)
+ break;
+
+ if (prev_ct != &ctx->ctc->ct)
+ alloc_tag_to_text(&iter->buf, &ctx->ctc->ct);
+ alloc_tag_ctx_to_text(&iter->buf, ctx);
+ }
+ codetag_lock_module_list(ct_iter->cttype, false);
+
+ return err ? : buf.ret;
+}
+
+#define CTX_CAPTURE_TOKENS() \
+ x(disable, 0) \
+ x(enable, 0)
+
+static const char * const ctx_capture_token_strs[] = {
+#define x(name, nr_args) #name,
+ CTX_CAPTURE_TOKENS()
+#undef x
+ NULL
+};
+
+enum ctx_capture_token {
+#define x(name, nr_args) TOK_##name,
+ CTX_CAPTURE_TOKENS()
+#undef x
+};
+
+static int enable_ctx_capture(struct codetag_type *cttype,
+ struct codetag_query *query, bool enable)
+{
+ struct codetag_iterator ct_iter;
+ struct codetag_with_ctx *ctc;
+ struct codetag *ct;
+ unsigned int nfound = 0;
+
+ codetag_lock_module_list(cttype, true);
+
+ codetag_init_iter(&ct_iter, cttype);
+ while ((ct = codetag_next_ct(&ct_iter))) {
+ if (!codetag_matches_query(query, ct, ct_iter.cmod, NULL))
+ continue;
+
+ ctc = ct_to_ctc(ct);
+ if (codetag_ctx_enabled(ctc) == enable)
+ continue;
+
+ if (!alloc_tag_enable_ctx(ctc_to_alloc_tag(ctc), enable)) {
+ pr_warn("Failed to toggle context capture\n");
+ continue;
+ }
+
+ nfound++;
+ }
+
+ codetag_lock_module_list(cttype, false);
+
+ return nfound ? 0 : -ENOENT;
+}
+
+static int parse_command(struct codetag_type *cttype, char *buf)
+{
+ struct codetag_query query = { NULL };
+ char *cmd;
+ int ret;
+ int tok;
+
+ buf = codetag_query_parse(&query, buf);
+ if (IS_ERR(buf))
+ return PTR_ERR(buf);
+
+ cmd = strsep_no_empty(&buf, " \t\r\n");
+ if (!cmd)
+ return -EINVAL; /* no command */
+
+ tok = match_string(ctx_capture_token_strs,
+ ARRAY_SIZE(ctx_capture_token_strs), cmd);
+ if (tok < 0)
+ return -EINVAL; /* unknown command */
+
+ ret = enable_ctx_capture(cttype, &query, tok == TOK_enable);
+ if (ret < 0)
+ return ret;
+
+ return 0;
+}
+
+static ssize_t allocations_ctx_file_write(struct file *file, const char __user *ubuf,
+ size_t len, loff_t *offp)
+{
+ struct alloc_tag_file_iterator *iter = file->private_data;
+ char tmpbuf[256];
+
+ if (len == 0)
+ return 0;
+ /* we don't check *offp -- multiple writes() are allowed */
+ if (len > sizeof(tmpbuf) - 1)
+ return -E2BIG;
+
+ if (copy_from_user(tmpbuf, ubuf, len))
+ return -EFAULT;
+
+ tmpbuf[len] = '\0';
+ parse_command(iter->ct_iter.cttype, tmpbuf);
+
+ *offp += len;
+ return len;
+}
+
+static const struct file_operations allocations_ctx_file_ops = {
+ .owner = THIS_MODULE,
+ .open = allocations_file_open,
+ .release = allocations_file_release,
+ .read = allocations_ctx_file_read,
+ .write = allocations_ctx_file_write,
+};
+
static int __init dbgfs_init(struct codetag_type *cttype)
{
struct dentry *file;
+ struct dentry *ctx_file;
file = debugfs_create_file("allocations", 0444, NULL, cttype,
&allocations_file_ops);
+ if (IS_ERR(file))
+ return PTR_ERR(file);
+
+ ctx_file = debugfs_create_file("allocations.ctx", 0666, NULL, cttype,
+ &allocations_ctx_file_ops);
+ if (IS_ERR(ctx_file)) {
+ debugfs_remove(file);
+ return PTR_ERR(ctx_file);
+ }
- return IS_ERR(file) ? PTR_ERR(file) : 0;
+ return 0;
}
static bool alloc_tag_module_unload(struct codetag_type *cttype, struct codetag_module *cmod)
{
- struct codetag_iterator iter = codetag_get_ct_iter(cttype);
+ struct codetag_iterator iter;
bool module_unused = true;
struct alloc_tag *tag;
struct codetag *ct;
size_t bytes;
+ codetag_init_iter(&iter, cttype);
for (ct = codetag_next_ct(&iter); ct; ct = codetag_next_ct(&iter)) {
if (iter.cmod != cmod)
continue;
@@ -183,6 +414,7 @@ static int __init alloc_tag_init(void)
.section = "alloc_tags",
.tag_size = sizeof(struct alloc_tag),
.module_unload = alloc_tag_module_unload,
+ .free_ctx = alloc_tag_ops_free_ctx,
};
cttype = codetag_register_type(&desc);
diff --git a/lib/codetag.c b/lib/codetag.c
index d891bbe4481d..cbff146b3fe8 100644
--- a/lib/codetag.c
+++ b/lib/codetag.c
@@ -27,16 +27,14 @@ void codetag_lock_module_list(struct codetag_type *cttype, bool lock)
up_read(&cttype->mod_lock);
}
-struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype)
+void codetag_init_iter(struct codetag_iterator *iter,
+ struct codetag_type *cttype)
{
- struct codetag_iterator iter = {
- .cttype = cttype,
- .cmod = NULL,
- .mod_id = 0,
- .ct = NULL,
- };
-
- return iter;
+ iter->cttype = cttype;
+ iter->cmod = NULL;
+ iter->mod_id = 0;
+ iter->ct = NULL;
+ iter->ctx = NULL;
}
static inline struct codetag *get_first_module_ct(struct codetag_module *cmod)
@@ -128,6 +126,10 @@ struct codetag_ctx *codetag_next_ctx(struct codetag_iterator *iter)
lockdep_assert_held(&iter->cttype->mod_lock);
+ /* Move to the first codetag if search just started */
+ if (!iter->ct)
+ codetag_next_ct(iter);
+
if (!ctx)
return next_ctx_from_ct(iter);
--
2.40.1.495.gc816e09b53d-goog
Include allocations in show_mem reports.
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/alloc_tag.h | 2 ++
lib/alloc_tag.c | 48 +++++++++++++++++++++++++++++++++++----
lib/show_mem.c | 15 ++++++++++++
3 files changed, 60 insertions(+), 5 deletions(-)
diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
index 2a3d248aae10..190ab793f7e5 100644
--- a/include/linux/alloc_tag.h
+++ b/include/linux/alloc_tag.h
@@ -23,6 +23,8 @@ struct alloc_tag {
#ifdef CONFIG_MEM_ALLOC_PROFILING
+void alloc_tags_show_mem_report(struct seq_buf *s);
+
static inline struct alloc_tag *ctc_to_alloc_tag(struct codetag_with_ctx *ctc)
{
return container_of(ctc, struct alloc_tag, ctc);
diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
index 675c7a08e38b..e2ebab8999a9 100644
--- a/lib/alloc_tag.c
+++ b/lib/alloc_tag.c
@@ -13,6 +13,8 @@
#define STACK_BUF_SIZE 1024
+static struct codetag_type *alloc_tag_cttype;
+
DEFINE_STATIC_KEY_TRUE(mem_alloc_profiling_key);
/*
@@ -133,6 +135,43 @@ static ssize_t allocations_file_read(struct file *file, char __user *ubuf,
return err ? : buf.ret;
}
+void alloc_tags_show_mem_report(struct seq_buf *s)
+{
+ struct codetag_iterator iter;
+ struct codetag *ct;
+ struct {
+ struct codetag *tag;
+ size_t bytes;
+ } tags[10], n;
+ unsigned int i, nr = 0;
+
+ codetag_init_iter(&iter, alloc_tag_cttype);
+
+ codetag_lock_module_list(alloc_tag_cttype, true);
+ while ((ct = codetag_next_ct(&iter))) {
+ n.tag = ct;
+ n.bytes = lazy_percpu_counter_read(&ct_to_alloc_tag(ct)->bytes_allocated);
+
+ for (i = 0; i < nr; i++)
+ if (n.bytes > tags[i].bytes)
+ break;
+
+ if (i < ARRAY_SIZE(tags)) {
+ nr -= nr == ARRAY_SIZE(tags);
+ memmove(&tags[i + 1],
+ &tags[i],
+ sizeof(tags[0]) * (nr - i));
+ nr++;
+ tags[i] = n;
+ }
+ }
+
+ for (i = 0; i < nr; i++)
+ alloc_tag_to_text(s, tags[i].tag);
+
+ codetag_lock_module_list(alloc_tag_cttype, false);
+}
+
static const struct file_operations allocations_file_ops = {
.owner = THIS_MODULE,
.open = allocations_file_open,
@@ -409,7 +448,6 @@ EXPORT_SYMBOL(page_alloc_tagging_ops);
static int __init alloc_tag_init(void)
{
- struct codetag_type *cttype;
const struct codetag_type_desc desc = {
.section = "alloc_tags",
.tag_size = sizeof(struct alloc_tag),
@@ -417,10 +455,10 @@ static int __init alloc_tag_init(void)
.free_ctx = alloc_tag_ops_free_ctx,
};
- cttype = codetag_register_type(&desc);
- if (IS_ERR_OR_NULL(cttype))
- return PTR_ERR(cttype);
+ alloc_tag_cttype = codetag_register_type(&desc);
+ if (IS_ERR_OR_NULL(alloc_tag_cttype))
+ return PTR_ERR(alloc_tag_cttype);
- return dbgfs_init(cttype);
+ return dbgfs_init(alloc_tag_cttype);
}
module_init(alloc_tag_init);
diff --git a/lib/show_mem.c b/lib/show_mem.c
index 1485c87be935..5c82f29168e3 100644
--- a/lib/show_mem.c
+++ b/lib/show_mem.c
@@ -7,6 +7,7 @@
#include <linux/mm.h>
#include <linux/cma.h>
+#include <linux/seq_buf.h>
void __show_mem(unsigned int filter, nodemask_t *nodemask, int max_zone_idx)
{
@@ -34,4 +35,18 @@ void __show_mem(unsigned int filter, nodemask_t *nodemask, int max_zone_idx)
#ifdef CONFIG_MEMORY_FAILURE
printk("%lu pages hwpoisoned\n", atomic_long_read(&num_poisoned_pages));
#endif
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+ {
+ struct seq_buf s;
+ char *buf = kmalloc(4096, GFP_ATOMIC);
+
+ if (buf) {
+ printk("Memory allocations:\n");
+ seq_buf_init(&s, buf, 4096);
+ alloc_tags_show_mem_report(&s);
+ printk("%s", buf);
+ kfree(buf);
+ }
+ }
+#endif
}
--
2.40.1.495.gc816e09b53d-goog
To avoid debug warnings while freeing reserved pages which were not
allocated with usual allocators, mark their codetags as empty before
freeing.
Maybe we can annotate reserved pages correctly and avoid this?
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/mm.h | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 27ce77080c79..f5969cb85879 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -5,6 +5,7 @@
#include <linux/errno.h>
#include <linux/mmdebug.h>
#include <linux/gfp.h>
+#include <linux/pgalloc_tag.h>
#include <linux/bug.h>
#include <linux/list.h>
#include <linux/mmzone.h>
@@ -2920,6 +2921,13 @@ extern void reserve_bootmem_region(phys_addr_t start, phys_addr_t end);
/* Free the reserved page into the buddy system, so it gets managed. */
static inline void free_reserved_page(struct page *page)
{
+ union codetag_ref *ref;
+
+ ref = get_page_tag_ref(page);
+ if (ref) {
+ set_codetag_empty(ref);
+ put_page_tag_ref(ref);
+ }
ClearPageReserved(page);
init_page_count(page);
__free_page(page);
--
2.40.1.495.gc816e09b53d-goog
If slabobj_ext vector allocation for a slab object fails and later on it
succeeds for another object in the same slab, the slabobj_ext for the
original object will be NULL and will be flagged in case when
CONFIG_MEM_ALLOC_PROFILING_DEBUG is enabled.
Mark failed slabobj_ext vector allocations using a new objext_flags flag
stored in the lower bits of slab->obj_exts. When new allocation succeeds
it marks all tag references in the same slabobj_ext vector as empty to
avoid warnings implemented by CONFIG_MEM_ALLOC_PROFILING_DEBUG checks.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/memcontrol.h | 4 +++-
mm/slab_common.c | 27 +++++++++++++++++++++++++--
2 files changed, 28 insertions(+), 3 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c7f21b15b540..3eb8975c1462 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -356,8 +356,10 @@ enum page_memcg_data_flags {
#endif /* CONFIG_MEMCG */
enum objext_flags {
+ /* slabobj_ext vector failed to allocate */
+ OBJEXTS_ALLOC_FAIL = __FIRST_OBJEXT_FLAG,
/* the next bit after the last actual flag */
- __NR_OBJEXTS_FLAGS = __FIRST_OBJEXT_FLAG,
+ __NR_OBJEXTS_FLAGS = (__FIRST_OBJEXT_FLAG << 1),
};
#define OBJEXTS_FLAGS_MASK (__NR_OBJEXTS_FLAGS - 1)
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 89265f825c43..5b7e096b70a5 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -217,21 +217,44 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
{
unsigned int objects = objs_per_slab(s, slab);
unsigned long obj_exts;
- void *vec;
+ struct slabobj_ext *vec;
gfp &= ~OBJCGS_CLEAR_MASK;
/* Prevent recursive extension vector allocation */
gfp |= __GFP_NO_OBJ_EXT;
vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
slab_nid(slab));
- if (!vec)
+ if (!vec) {
+#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
+ if (new_slab) {
+ /* Mark vectors which failed to allocate */
+ slab->obj_exts = OBJEXTS_ALLOC_FAIL;
+#ifdef CONFIG_MEMCG
+ slab->obj_exts |= MEMCG_DATA_OBJEXTS;
+#endif
+ }
+#endif
return -ENOMEM;
+ }
obj_exts = (unsigned long)vec;
#ifdef CONFIG_MEMCG
obj_exts |= MEMCG_DATA_OBJEXTS;
#endif
if (new_slab) {
+#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
+ /*
+ * If vector previously failed to allocate then we have live
+ * objects with no tag reference. Mark all references in this
+ * vector as empty to avoid warnings later on.
+ */
+ if (slab->obj_exts & OBJEXTS_ALLOC_FAIL) {
+ unsigned int i;
+
+ for (i = 0; i < objects; i++)
+ set_codetag_empty(&vec[i].ref);
+ }
+#endif
/*
* If the slab is brand new and nobody can yet access its
* obj_exts, no synchronization is required and obj_exts can
--
2.40.1.495.gc816e09b53d-goog
From: Kent Overstreet <[email protected]>
The new code & libraries added are being maintained - mark them as such.
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
MAINTAINERS | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)
diff --git a/MAINTAINERS b/MAINTAINERS
index 3889d1adf71f..6f3b79266204 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5116,6 +5116,13 @@ S: Supported
F: Documentation/process/code-of-conduct-interpretation.rst
F: Documentation/process/code-of-conduct.rst
+CODE TAGGING
+M: Suren Baghdasaryan <[email protected]>
+M: Kent Overstreet <[email protected]>
+S: Maintained
+F: include/linux/codetag.h
+F: lib/codetag.c
+
COMEDI DRIVERS
M: Ian Abbott <[email protected]>
M: H Hartley Sweeten <[email protected]>
@@ -11658,6 +11665,12 @@ S: Maintained
F: Documentation/devicetree/bindings/leds/backlight/kinetic,ktz8866.yaml
F: drivers/video/backlight/ktz8866.c
+LAZY PERCPU COUNTERS
+M: Kent Overstreet <[email protected]>
+S: Maintained
+F: include/linux/lazy-percpu-counter.h
+F: lib/lazy-percpu-counter.c
+
L3MDEV
M: David Ahern <[email protected]>
L: [email protected]
@@ -13468,6 +13481,15 @@ F: mm/memblock.c
F: mm/mm_init.c
F: tools/testing/memblock/
+MEMORY ALLOCATION PROFILING
+M: Suren Baghdasaryan <[email protected]>
+M: Kent Overstreet <[email protected]>
+S: Maintained
+F: include/linux/alloc_tag.h
+F: include/linux/codetag_ctx.h
+F: lib/alloc_tag.c
+F: lib/pgalloc_tag.c
+
MEMORY CONTROLLER DRIVERS
M: Krzysztof Kozlowski <[email protected]>
L: [email protected]
--
2.40.1.495.gc816e09b53d-goog
On Mon, May 01, 2023 at 09:54:10AM -0700, Suren Baghdasaryan wrote:
> Performance overhead:
> To evaluate performance we implemented an in-kernel test executing
> multiple get_free_page/free_page and kmalloc/kfree calls with allocation
> sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU
> affinity set to a specific CPU to minimize the noise. Below is performance
> comparison between the baseline kernel, profiling when enabled, profiling
> when disabled (nomem_profiling=y) and (for comparison purposes) baseline
> with CONFIG_MEMCG_KMEM enabled and allocations using __GFP_ACCOUNT:
>
> kmalloc pgalloc
> Baseline (6.3-rc7) 9.200s 31.050s
> profiling disabled 9.800 (+6.52%) 32.600 (+4.99%)
> profiling enabled 12.500 (+35.87%) 39.010 (+25.60%)
> memcg_kmem enabled 41.400 (+350.00%) 70.600 (+127.38%)
Hm, this makes me think we have a regression with memcg_kmem in one of
the recent releases. When I measured it a couple of years ago, the overhead
was definitely within 100%.
Do you understand what makes the your profiling drastically faster than kmem?
Thanks!
On Mon, May 1, 2023 at 10:47 AM Roman Gushchin <[email protected]> wrote:
>
> On Mon, May 01, 2023 at 09:54:10AM -0700, Suren Baghdasaryan wrote:
> > Performance overhead:
> > To evaluate performance we implemented an in-kernel test executing
> > multiple get_free_page/free_page and kmalloc/kfree calls with allocation
> > sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU
> > affinity set to a specific CPU to minimize the noise. Below is performance
> > comparison between the baseline kernel, profiling when enabled, profiling
> > when disabled (nomem_profiling=y) and (for comparison purposes) baseline
> > with CONFIG_MEMCG_KMEM enabled and allocations using __GFP_ACCOUNT:
> >
> > kmalloc pgalloc
> > Baseline (6.3-rc7) 9.200s 31.050s
> > profiling disabled 9.800 (+6.52%) 32.600 (+4.99%)
> > profiling enabled 12.500 (+35.87%) 39.010 (+25.60%)
> > memcg_kmem enabled 41.400 (+350.00%) 70.600 (+127.38%)
>
> Hm, this makes me think we have a regression with memcg_kmem in one of
> the recent releases. When I measured it a couple of years ago, the overhead
> was definitely within 100%.
>
> Do you understand what makes the your profiling drastically faster than kmem?
I haven't profiled or looked into kmem overhead closely but I can do
that. I just wanted to see how the overhead compares with the existing
accounting mechanisms.
For kmalloc, the overhead is low because after we create the vector of
slab_ext objects (which is the same as what memcg_kmem does), memory
profiling just increments a lazy counter (which in many cases would be
a per-cpu counter). memcg_kmem operates on cgroup hierarchy with
additional overhead associated with that. I'm guessing that's the
reason for the big difference between these mechanisms but, I didn't
look into the details to understand memcg_kmem performance.
>
> Thanks!
On Mon, May 01, 2023 at 11:08:05AM -0700, Suren Baghdasaryan wrote:
> On Mon, May 1, 2023 at 10:47 AM Roman Gushchin <[email protected]> wrote:
> >
> > On Mon, May 01, 2023 at 09:54:10AM -0700, Suren Baghdasaryan wrote:
> > > Performance overhead:
> > > To evaluate performance we implemented an in-kernel test executing
> > > multiple get_free_page/free_page and kmalloc/kfree calls with allocation
> > > sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU
> > > affinity set to a specific CPU to minimize the noise. Below is performance
> > > comparison between the baseline kernel, profiling when enabled, profiling
> > > when disabled (nomem_profiling=y) and (for comparison purposes) baseline
> > > with CONFIG_MEMCG_KMEM enabled and allocations using __GFP_ACCOUNT:
> > >
> > > kmalloc pgalloc
> > > Baseline (6.3-rc7) 9.200s 31.050s
> > > profiling disabled 9.800 (+6.52%) 32.600 (+4.99%)
> > > profiling enabled 12.500 (+35.87%) 39.010 (+25.60%)
> > > memcg_kmem enabled 41.400 (+350.00%) 70.600 (+127.38%)
> >
> > Hm, this makes me think we have a regression with memcg_kmem in one of
> > the recent releases. When I measured it a couple of years ago, the overhead
> > was definitely within 100%.
> >
> > Do you understand what makes the your profiling drastically faster than kmem?
>
> I haven't profiled or looked into kmem overhead closely but I can do
> that. I just wanted to see how the overhead compares with the existing
> accounting mechanisms.
It's a good idea and I generally think that +25-35% for kmalloc/pgalloc
should be ok for the production use, which is great!
In the reality, most workloads are not that sensitive to the speed of
memory allocation.
>
> For kmalloc, the overhead is low because after we create the vector of
> slab_ext objects (which is the same as what memcg_kmem does), memory
> profiling just increments a lazy counter (which in many cases would be
> a per-cpu counter).
So does kmem (this is why I'm somewhat surprised by the difference).
> memcg_kmem operates on cgroup hierarchy with
> additional overhead associated with that. I'm guessing that's the
> reason for the big difference between these mechanisms but, I didn't
> look into the details to understand memcg_kmem performance.
I suspect recent rt-related changes and also the wide usage of
rcu primitives in the kmem code. I'll try to look closer as well.
Thanks!
On Mon, 01 May 2023, Suren Baghdasaryan wrote:
>From: Kent Overstreet <[email protected]>
>
>Previously, string_get_size() outputted a space between the number and
>the units, i.e.
> 9.88 MiB
>
>This changes it to
> 9.88MiB
>
>which allows it to be parsed correctly by the 'sort -h' command.
Wouldn't this break users that already parse it the current way?
Thanks,
Davidlohr
Hi--
On 5/1/23 09:54, Suren Baghdasaryan wrote:
> From: Kent Overstreet <[email protected]>
>
> This patch adds lib/lazy-percpu-counter.c, which implements counters
> that start out as atomics, but lazily switch to percpu mode if the
> update rate crosses some threshold (arbitrarily set at 256 per second).
>
from submitting-patches.rst:
Describe your changes in imperative mood, e.g. "make xyzzy do frotz"
instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy
to do frotz", as if you are giving orders to the codebase to change
its behaviour.
> Signed-off-by: Kent Overstreet <[email protected]>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> ---
> include/linux/lazy-percpu-counter.h | 102 ++++++++++++++++++++++
> lib/Kconfig | 3 +
> lib/Makefile | 2 +
> lib/lazy-percpu-counter.c | 127 ++++++++++++++++++++++++++++
> 4 files changed, 234 insertions(+)
> create mode 100644 include/linux/lazy-percpu-counter.h
> create mode 100644 lib/lazy-percpu-counter.c
>
> diff --git a/include/linux/lazy-percpu-counter.h b/include/linux/lazy-percpu-counter.h
> new file mode 100644
> index 000000000000..45ca9e2ce58b
> --- /dev/null
> +++ b/include/linux/lazy-percpu-counter.h
> @@ -0,0 +1,102 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Lazy percpu counters:
> + * (C) 2022 Kent Overstreet
> + *
> + * Lazy percpu counters start out in atomic mode, then switch to percpu mode if
> + * the update rate crosses some threshold.
> + *
> + * This means we don't have to decide between low memory overhead atomic
> + * counters and higher performance percpu counters - we can have our cake and
> + * eat it, too!
> + *
> + * Internally we use an atomic64_t, where the low bit indicates whether we're in
> + * percpu mode, and the high 8 bits are a secondary counter that's incremented
> + * when the counter is modified - meaning 55 bits of precision are available for
> + * the counter itself.
> + */
> +
> +#ifndef _LINUX_LAZY_PERCPU_COUNTER_H
> +#define _LINUX_LAZY_PERCPU_COUNTER_H
> +
> +#include <linux/atomic.h>
> +#include <asm/percpu.h>
> +
> +struct lazy_percpu_counter {
> + atomic64_t v;
> + unsigned long last_wrap;
> +};
> +
> +void lazy_percpu_counter_exit(struct lazy_percpu_counter *c);
> +void lazy_percpu_counter_add_slowpath(struct lazy_percpu_counter *c, s64 i);
> +void lazy_percpu_counter_add_slowpath_noupgrade(struct lazy_percpu_counter *c, s64 i);
> +s64 lazy_percpu_counter_read(struct lazy_percpu_counter *c);
> +
> +/*
> + * We use the high bits of the atomic counter for a secondary counter, which is
> + * incremented every time the counter is touched. When the secondary counter
> + * wraps, we check the time the counter last wrapped, and if it was recent
> + * enough that means the update frequency has crossed our threshold and we
> + * switch to percpu mode:
> + */
> +#define COUNTER_MOD_BITS 8
> +#define COUNTER_MOD_MASK ~(~0ULL >> COUNTER_MOD_BITS)
> +#define COUNTER_MOD_BITS_START (64 - COUNTER_MOD_BITS)
> +
> +/*
> + * We use the low bit of the counter to indicate whether we're in atomic mode
> + * (low bit clear), or percpu mode (low bit set, counter is a pointer to actual
> + * percpu counters:
> + */
> +#define COUNTER_IS_PCPU_BIT 1
> +
> +static inline u64 __percpu *lazy_percpu_counter_is_pcpu(u64 v)
> +{
> + if (!(v & COUNTER_IS_PCPU_BIT))
> + return NULL;
> +
> + v ^= COUNTER_IS_PCPU_BIT;
> + return (u64 __percpu *)(unsigned long)v;
> +}
> +
> +/**
> + * lazy_percpu_counter_add: Add a value to a lazy_percpu_counter
For kernel-doc, the function name should be followed by '-', not ':'.
(many places)
> + *
> + * @c: counter to modify
> + * @i: value to add
> + */
> +static inline void lazy_percpu_counter_add(struct lazy_percpu_counter *c, s64 i)
> +{
> + u64 v = atomic64_read(&c->v);
> + u64 __percpu *pcpu_v = lazy_percpu_counter_is_pcpu(v);
> +
> + if (likely(pcpu_v))
> + this_cpu_add(*pcpu_v, i);
> + else
> + lazy_percpu_counter_add_slowpath(c, i);
> +}
> +
> +/**
> + * lazy_percpu_counter_add_noupgrade: Add a value to a lazy_percpu_counter,
> + * without upgrading to percpu mode
> + *
> + * @c: counter to modify
> + * @i: value to add
> + */
> +static inline void lazy_percpu_counter_add_noupgrade(struct lazy_percpu_counter *c, s64 i)
> +{
> + u64 v = atomic64_read(&c->v);
> + u64 __percpu *pcpu_v = lazy_percpu_counter_is_pcpu(v);
> +
> + if (likely(pcpu_v))
> + this_cpu_add(*pcpu_v, i);
> + else
> + lazy_percpu_counter_add_slowpath_noupgrade(c, i);
> +}
> +
> +static inline void lazy_percpu_counter_sub(struct lazy_percpu_counter *c, s64 i)
> +{
> + lazy_percpu_counter_add(c, -i);
> +}
> +
> +#endif /* _LINUX_LAZY_PERCPU_COUNTER_H */
> diff --git a/lib/lazy-percpu-counter.c b/lib/lazy-percpu-counter.c
> new file mode 100644
> index 000000000000..4f4e32c2dc09
> --- /dev/null
> +++ b/lib/lazy-percpu-counter.c
> @@ -0,0 +1,127 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +#include <linux/atomic.h>
> +#include <linux/gfp.h>
> +#include <linux/jiffies.h>
> +#include <linux/lazy-percpu-counter.h>
> +#include <linux/percpu.h>
> +
> +static inline s64 lazy_percpu_counter_atomic_val(s64 v)
> +{
> + /* Ensure output is sign extended properly: */
> + return (v << COUNTER_MOD_BITS) >>
> + (COUNTER_MOD_BITS + COUNTER_IS_PCPU_BIT);
> +}
> +
...
> +
> +/**
> + * lazy_percpu_counter_exit: Free resources associated with a
> + * lazy_percpu_counter
Same kernel-doc comment.
> + *
> + * @c: counter to exit
> + */
> +void lazy_percpu_counter_exit(struct lazy_percpu_counter *c)
> +{
> + free_percpu(lazy_percpu_counter_is_pcpu(atomic64_read(&c->v)));
> +}
> +EXPORT_SYMBOL_GPL(lazy_percpu_counter_exit);
> +
> +/**
> + * lazy_percpu_counter_read: Read current value of a lazy_percpu_counter
> + *
> + * @c: counter to read
> + */
> +s64 lazy_percpu_counter_read(struct lazy_percpu_counter *c)
> +{
> + s64 v = atomic64_read(&c->v);
> + u64 __percpu *pcpu_v = lazy_percpu_counter_is_pcpu(v);
> +
> + if (pcpu_v) {
> + int cpu;
> +
> + v = 0;
> + for_each_possible_cpu(cpu)
> + v += *per_cpu_ptr(pcpu_v, cpu);
> + } else {
> + v = lazy_percpu_counter_atomic_val(v);
> + }
> +
> + return v;
> +}
> +EXPORT_SYMBOL_GPL(lazy_percpu_counter_read);
> +
> +void lazy_percpu_counter_add_slowpath(struct lazy_percpu_counter *c, s64 i)
> +{
> + u64 atomic_i;
> + u64 old, v = atomic64_read(&c->v);
> + u64 __percpu *pcpu_v;
> +
> + atomic_i = i << COUNTER_IS_PCPU_BIT;
> + atomic_i &= ~COUNTER_MOD_MASK;
> + atomic_i |= 1ULL << COUNTER_MOD_BITS_START;
> +
> + do {
> + pcpu_v = lazy_percpu_counter_is_pcpu(v);
> + if (pcpu_v) {
> + this_cpu_add(*pcpu_v, i);
> + return;
> + }
> +
> + old = v;
> + } while ((v = atomic64_cmpxchg(&c->v, old, old + atomic_i)) != old);
> +
> + if (unlikely(!(v & COUNTER_MOD_MASK))) {
> + unsigned long now = jiffies;
> +
> + if (c->last_wrap &&
> + unlikely(time_after(c->last_wrap + HZ, now)))
> + lazy_percpu_counter_switch_to_pcpu(c);
> + else
> + c->last_wrap = now;
> + }
> +}
> +EXPORT_SYMBOL(lazy_percpu_counter_add_slowpath);
> +
> +void lazy_percpu_counter_add_slowpath_noupgrade(struct lazy_percpu_counter *c, s64 i)
> +{
> + u64 atomic_i;
> + u64 old, v = atomic64_read(&c->v);
> + u64 __percpu *pcpu_v;
> +
> + atomic_i = i << COUNTER_IS_PCPU_BIT;
> + atomic_i &= ~COUNTER_MOD_MASK;
> +
> + do {
> + pcpu_v = lazy_percpu_counter_is_pcpu(v);
> + if (pcpu_v) {
> + this_cpu_add(*pcpu_v, i);
> + return;
> + }
> +
> + old = v;
> + } while ((v = atomic64_cmpxchg(&c->v, old, old + atomic_i)) != old);
> +}
> +EXPORT_SYMBOL(lazy_percpu_counter_add_slowpath_noupgrade);
These last 2 exported functions could use some comments, preferably in
kernel-doc format.
Thanks.
--
~Randy
On Mon, May 01, 2023 at 11:13:15AM -0700, Davidlohr Bueso wrote:
> On Mon, 01 May 2023, Suren Baghdasaryan wrote:
>
> > From: Kent Overstreet <[email protected]>
> >
> > Previously, string_get_size() outputted a space between the number and
> > the units, i.e.
> > 9.88 MiB
> >
> > This changes it to
> > 9.88MiB
> >
> > which allows it to be parsed correctly by the 'sort -h' command.
>
> Wouldn't this break users that already parse it the current way?
It's not impossible - but it's not used in very many places and we
wouldn't be printing in human-readable units if it was meant to be
parsed - it's mainly used for debug output currently.
If someone raises a specific objection we'll do something different,
otherwise I think standardizing on what userspace tooling already parses
is a good idea.
On Mon, May 01, 2023 at 11:14:45AM -0700, Roman Gushchin wrote:
> It's a good idea and I generally think that +25-35% for kmalloc/pgalloc
> should be ok for the production use, which is great!
> In the reality, most workloads are not that sensitive to the speed of
> memory allocation.
:)
My main takeaway has been "the slub fast path is _really_ fast". No
disabling of preemption, no atomic instructions, just a non locked
double word cmpxchg - it's a slick piece of work.
> > For kmalloc, the overhead is low because after we create the vector of
> > slab_ext objects (which is the same as what memcg_kmem does), memory
> > profiling just increments a lazy counter (which in many cases would be
> > a per-cpu counter).
>
> So does kmem (this is why I'm somewhat surprised by the difference).
>
> > memcg_kmem operates on cgroup hierarchy with
> > additional overhead associated with that. I'm guessing that's the
> > reason for the big difference between these mechanisms but, I didn't
> > look into the details to understand memcg_kmem performance.
>
> I suspect recent rt-related changes and also the wide usage of
> rcu primitives in the kmem code. I'll try to look closer as well.
Happy to give you something to compare against :)
On Mon, May 1, 2023 at 10:36 PM Kent Overstreet
<[email protected]> wrote:
>
> On Mon, May 01, 2023 at 11:13:15AM -0700, Davidlohr Bueso wrote:
> > On Mon, 01 May 2023, Suren Baghdasaryan wrote:
> >
> > > From: Kent Overstreet <[email protected]>
> > >
> > > Previously, string_get_size() outputted a space between the number and
> > > the units, i.e.
> > > 9.88 MiB
> > >
> > > This changes it to
> > > 9.88MiB
> > >
> > > which allows it to be parsed correctly by the 'sort -h' command.
But why do we need that? What's the use case?
> > Wouldn't this break users that already parse it the current way?
>
> It's not impossible - but it's not used in very many places and we
> wouldn't be printing in human-readable units if it was meant to be
> parsed - it's mainly used for debug output currently.
>
> If someone raises a specific objection we'll do something different,
> otherwise I think standardizing on what userspace tooling already parses
> is a good idea.
Yes, I NAK this on the basis of
https://english.stackexchange.com/a/2911/153144
--
With Best Regards,
Andy Shevchenko
On Mon, May 01, 2023 at 10:57:07PM +0300, Andy Shevchenko wrote:
> On Mon, May 1, 2023 at 10:36 PM Kent Overstreet
> <[email protected]> wrote:
> >
> > On Mon, May 01, 2023 at 11:13:15AM -0700, Davidlohr Bueso wrote:
> > > On Mon, 01 May 2023, Suren Baghdasaryan wrote:
> > >
> > > > From: Kent Overstreet <[email protected]>
> > > >
> > > > Previously, string_get_size() outputted a space between the number and
> > > > the units, i.e.
> > > > 9.88 MiB
> > > >
> > > > This changes it to
> > > > 9.88MiB
> > > >
> > > > which allows it to be parsed correctly by the 'sort -h' command.
>
> But why do we need that? What's the use case?
As was in the commit message: to produce output that sort -h knows how
to parse.
> > > Wouldn't this break users that already parse it the current way?
> >
> > It's not impossible - but it's not used in very many places and we
> > wouldn't be printing in human-readable units if it was meant to be
> > parsed - it's mainly used for debug output currently.
> >
> > If someone raises a specific objection we'll do something different,
> > otherwise I think standardizing on what userspace tooling already parses
> > is a good idea.
>
> Yes, I NAK this on the basis of
> https://english.stackexchange.com/a/2911/153144
Not sure I find a style guide on stackexchange more compelling than
interop with a tool everyone already has installed :)
On Mon, May 01, 2023 at 03:37:58PM -0400, Kent Overstreet wrote:
> On Mon, May 01, 2023 at 11:14:45AM -0700, Roman Gushchin wrote:
> > It's a good idea and I generally think that +25-35% for kmalloc/pgalloc
> > should be ok for the production use, which is great!
> > In the reality, most workloads are not that sensitive to the speed of
> > memory allocation.
>
> :)
>
> My main takeaway has been "the slub fast path is _really_ fast". No
> disabling of preemption, no atomic instructions, just a non locked
> double word cmpxchg - it's a slick piece of work.
>
> > > For kmalloc, the overhead is low because after we create the vector of
> > > slab_ext objects (which is the same as what memcg_kmem does), memory
> > > profiling just increments a lazy counter (which in many cases would be
> > > a per-cpu counter).
> >
> > So does kmem (this is why I'm somewhat surprised by the difference).
> >
> > > memcg_kmem operates on cgroup hierarchy with
> > > additional overhead associated with that. I'm guessing that's the
> > > reason for the big difference between these mechanisms but, I didn't
> > > look into the details to understand memcg_kmem performance.
> >
> > I suspect recent rt-related changes and also the wide usage of
> > rcu primitives in the kmem code. I'll try to look closer as well.
>
> Happy to give you something to compare against :)
To be fair, it's not an apple-to-apple comparison, because:
1) memcgs are organized in a tree, these days usually with at least 3 layers,
2) memcgs are dynamic. In theory a task can be moved to a different
memcg while performing a (very slow) allocation, and the original
memcg can be released. To prevent this we have to perform a lot
of operations which you can happily avoid.
That said, there is clearly a place for optimization, so thank you
for indirectly bringing this up.
Thanks!
* Andy Shevchenko <[email protected]> [230501 15:57]:
> On Mon, May 1, 2023 at 10:36 PM Kent Overstreet
> <[email protected]> wrote:
> >
> > On Mon, May 01, 2023 at 11:13:15AM -0700, Davidlohr Bueso wrote:
> > > On Mon, 01 May 2023, Suren Baghdasaryan wrote:
> > >
> > > > From: Kent Overstreet <[email protected]>
> > > >
> > > > Previously, string_get_size() outputted a space between the number and
> > > > the units, i.e.
> > > > 9.88 MiB
> > > >
> > > > This changes it to
> > > > 9.88MiB
> > > >
> > > > which allows it to be parsed correctly by the 'sort -h' command.
>
> But why do we need that? What's the use case?
>
> > > Wouldn't this break users that already parse it the current way?
> >
> > It's not impossible - but it's not used in very many places and we
> > wouldn't be printing in human-readable units if it was meant to be
> > parsed - it's mainly used for debug output currently.
> >
> > If someone raises a specific objection we'll do something different,
> > otherwise I think standardizing on what userspace tooling already parses
> > is a good idea.
>
> Yes, I NAK this on the basis of
> https://english.stackexchange.com/a/2911/153144
This fixes the output to be better aligned with:
the output of ls -sh
the input expected by find -size
Are there counter-examples of commands that follow the SI Brochure?
Thanks,
Liam
On Mon, May 01, 2023 at 05:33:49PM -0400, Liam R. Howlett wrote:
> * Andy Shevchenko <[email protected]> [230501 15:57]:
> This fixes the output to be better aligned with:
> the output of ls -sh
> the input expected by find -size
>
> Are there counter-examples of commands that follow the SI Brochure?
Even perf, which is included in the kernel tree, doesn't include the
space - example perf top output:
0 bcachefs:move_extent_fail
0 bcachefs:move_extent_alloc_mem_fail
3 bcachefs:move_data
0 bcachefs:evacuate_bucket
0 bcachefs:copygc
2 bcachefs:copygc_wait
195K bcachefs:transaction_commit
0 bcachefs:trans_restart_injected
(I'm also going to need to submit a patch that deletes or makes optional
the B suffix, just because we're using human readable units doesn't mean
it's bytes).
On Mon, May 01, 2023 at 10:57:07PM +0300, Andy Shevchenko wrote:
> But why do we need that? What's the use case?
It looks like we missed you on the initial CC, here's the use case:
https://lore.kernel.org/linux-fsdevel/ZFAsm0XTqC%2F%2Ff4FP@P9FQF9L96D/T/#mdda814a8c569e2214baa31320912b0ef83432fa9
On Mon, 2023-05-01 at 15:35 -0400, Kent Overstreet wrote:
> On Mon, May 01, 2023 at 11:13:15AM -0700, Davidlohr Bueso wrote:
> > On Mon, 01 May 2023, Suren Baghdasaryan wrote:
> >
> > > From: Kent Overstreet <[email protected]>
> > >
> > > Previously, string_get_size() outputted a space between the
> > > number and the units, i.e.
> > > 9.88 MiB
> > >
> > > This changes it to
> > > 9.88MiB
> > >
> > > which allows it to be parsed correctly by the 'sort -h' command.
> >
> > Wouldn't this break users that already parse it the current way?
>
> It's not impossible - but it's not used in very many places and we
> wouldn't be printing in human-readable units if it was meant to be
> parsed - it's mainly used for debug output currently.
It is not used just for debug. It's used all over the kernel for
printing out device sizes. The output mostly goes to the kernel print
buffer, so it's anyone's guess as to what, if any, tools are parsing
it, but the concern about breaking log parsers seems to be a valid one.
> If someone raises a specific objection we'll do something different,
> otherwise I think standardizing on what userspace tooling already
> parses is a good idea.
If you want to omit the space, why not simply add your own variant? A
string_get_size_nospace() which would use most of the body of this one
as a helper function but give its own snprintf format string at the
end. It's only a couple of lines longer as a patch and has the bonus
that it definitely wouldn't break anything by altering an existing
output.
James
On Mon, May 01, 2023 at 10:22:18PM -0400, James Bottomley wrote:
> It is not used just for debug. It's used all over the kernel for
> printing out device sizes. The output mostly goes to the kernel print
> buffer, so it's anyone's guess as to what, if any, tools are parsing
> it, but the concern about breaking log parsers seems to be a valid one.
Ok, there is sd_print_capacity() - but who in their right mind would be
trying to scrape device sizes, in human readable units, from log
messages when it's available in sysfs/procfs (actually, is it in sysfs?
if not, that's an oversight) in more reasonable units?
Correct me if I'm wrong, but I've yet to hear about kernel log messages
being consider a stable interface, and this seems a bit out there.
But, you did write the code :)
> > If someone raises a specific objection we'll do something different,
> > otherwise I think standardizing on what userspace tooling already
> > parses is a good idea.
>
> If you want to omit the space, why not simply add your own variant? A
> string_get_size_nospace() which would use most of the body of this one
> as a helper function but give its own snprintf format string at the
> end. It's only a couple of lines longer as a patch and has the bonus
> that it definitely wouldn't break anything by altering an existing
> output.
I'm happy to do that - I just wanted to post this version first to see
if we can avoid the fragmentation and do a bit of standardizing with
how everything else seems to do that.
On Tue, May 2, 2023 at 6:18 AM Kent Overstreet
<[email protected]> wrote:
> On Mon, May 01, 2023 at 10:22:18PM -0400, James Bottomley wrote:
...
> > > If someone raises a specific objection we'll do something different,
> > > otherwise I think standardizing on what userspace tooling already
> > > parses is a good idea.
> >
> > If you want to omit the space, why not simply add your own variant? A
> > string_get_size_nospace() which would use most of the body of this one
> > as a helper function but give its own snprintf format string at the
> > end. It's only a couple of lines longer as a patch and has the bonus
> > that it definitely wouldn't break anything by altering an existing
> > output.
>
> I'm happy to do that - I just wanted to post this version first to see
> if we can avoid the fragmentation and do a bit of standardizing with
> how everything else seems to do that.
Actually instead of producing zillions of variants, do a %p extension
to the printf() and that's it. We have, for example, %pt with T and
with space to follow users that want one or the other variant. Same
can be done with string_get_size().
--
With Best Regards,
Andy Shevchenko
On Tue, May 02, 2023 at 08:33:57AM +0300, Andy Shevchenko wrote:
> Actually instead of producing zillions of variants, do a %p extension
> to the printf() and that's it. We have, for example, %pt with T and
> with space to follow users that want one or the other variant. Same
> can be done with string_get_size().
God no.
On Mon, 01 May 2023, Suren Baghdasaryan <[email protected]> wrote:
> From: Kent Overstreet <[email protected]>
>
> Previously, string_get_size() outputted a space between the number and
> the units, i.e.
> 9.88 MiB
>
> This changes it to
> 9.88MiB
>
> which allows it to be parsed correctly by the 'sort -h' command.
The former is easier for humans to parse, and that should be
preferred. 'sort -h' is supposed to compare "human readable numbers", so
arguably sort does not do its job here.
BR,
Jani.
>
> Signed-off-by: Kent Overstreet <[email protected]>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> Cc: Andy Shevchenko <[email protected]>
> Cc: Michael Ellerman <[email protected]>
> Cc: Benjamin Herrenschmidt <[email protected]>
> Cc: Paul Mackerras <[email protected]>
> Cc: "Michael S. Tsirkin" <[email protected]>
> Cc: Jason Wang <[email protected]>
> Cc: "Noralf Trønnes" <[email protected]>
> Cc: Jens Axboe <[email protected]>
> ---
> lib/string_helpers.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/lib/string_helpers.c b/lib/string_helpers.c
> index 230020a2e076..593b29fece32 100644
> --- a/lib/string_helpers.c
> +++ b/lib/string_helpers.c
> @@ -126,8 +126,7 @@ void string_get_size(u64 size, u64 blk_size, const enum string_size_units units,
> else
> unit = units_str[units][i];
>
> - snprintf(buf, len, "%u%s %s", (u32)size,
> - tmp, unit);
> + snprintf(buf, len, "%u%s%s", (u32)size, tmp, unit);
> }
> EXPORT_SYMBOL(string_get_size);
--
Jani Nikula, Intel Open Source Graphics Center
On Mon, 2023-05-01 at 23:17 -0400, Kent Overstreet wrote:
> On Mon, May 01, 2023 at 10:22:18PM -0400, James Bottomley wrote:
> > It is not used just for debug. It's used all over the kernel for
> > printing out device sizes. The output mostly goes to the kernel
> > print buffer, so it's anyone's guess as to what, if any, tools are
> > parsing it, but the concern about breaking log parsers seems to be
> > a valid one.
>
> Ok, there is sd_print_capacity() - but who in their right mind would
> be trying to scrape device sizes, in human readable units,
If you bother to google "kernel log parser", you'll discover it's quite
an active area which supports a load of company business models.
> from log messages when it's available in sysfs/procfs (actually, is
> it in sysfs? if not, that's an oversight) in more reasonable units?
It's not in sysfs, no. As aren't a lot of things, which is why log
parsing for system monitoring is big business.
> Correct me if I'm wrong, but I've yet to hear about kernel log
> messages being consider a stable interface, and this seems a bit out
> there.
It might not be listed as stable, but when it's known there's a large
ecosystem out there consuming it we shouldn't break it just because you
feel like it. You should have a good reason and the break should be
unavoidable. I wanted my output in a particular form so I thought I'd
change everyone else's output as well isn't a good reason and it only
costs a couple of lines to avoid.
> But, you did write the code :)
>
> > > If someone raises a specific objection we'll do something
> > > different, otherwise I think standardizing on what userspace
> > > tooling already parses is a good idea.
> >
> > If you want to omit the space, why not simply add your own
> > variant? A string_get_size_nospace() which would use most of the
> > body of this one as a helper function but give its own snprintf
> > format string at the end. It's only a couple of lines longer as a
> > patch and has the bonus that it definitely wouldn't break anything
> > by altering an existing output.
>
> I'm happy to do that - I just wanted to post this version first to
> see if we can avoid the fragmentation and do a bit of standardizing
> with how everything else seems to do that.
What fragmentation? To do this properly you move the whole of the
current function to a helper which takes a format sting, say with a
double underscore prefix, then the existing function and what you want
become one line additions calling the helper with their specific format
string. There's no fragmentation of the base function at all.
James
On Mon, 1 May 2023 09:54:13 -0700
Suren Baghdasaryan <[email protected]> wrote:
> From: Kent Overstreet <[email protected]>
>
> We're introducing alloc tagging, which tracks memory allocations by
> callsite. Converting alloc_inode_sb() to a macro means allocations will
> be tracked by its caller, which is a bit more useful.
>
> Signed-off-by: Kent Overstreet <[email protected]>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> Cc: Alexander Viro <[email protected]>
> ---
> include/linux/fs.h | 6 +-----
> 1 file changed, 1 insertion(+), 5 deletions(-)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 21a981680856..4905ce14db0b 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2699,11 +2699,7 @@ int setattr_should_drop_sgid(struct mnt_idmap *idmap,
> * This must be used for allocating filesystems specific inodes to set
> * up the inode reclaim context correctly.
> */
> -static inline void *
> -alloc_inode_sb(struct super_block *sb, struct kmem_cache *cache, gfp_t gfp)
> -{
> - return kmem_cache_alloc_lru(cache, &sb->s_inode_lru, gfp);
> -}
> +#define alloc_inode_sb(_sb, _cache, _gfp) kmem_cache_alloc_lru(_cache, &_sb->s_inode_lru, _gfp)
Honestly, I don't like this change. In general, pre-processor macros
are ugly and error-prone.
Besides, it works for you only because __kmem_cache_alloc_lru() is
declared __always_inline (unless CONFIG_SLUB_TINY is defined, but then
you probably don't want the tracking either). In any case, it's going
to be difficult for people to understand why and how this works.
If the actual caller of alloc_inode_sb() is needed, I'd rather add it
as a parameter and pass down _RET_IP_ explicitly here.
Just my two cents,
Petr T
On Mon, 1 May 2023 09:54:16 -0700
Suren Baghdasaryan <[email protected]> wrote:
> From: Kent Overstreet <[email protected]>
>
> This adds a new helper which is like strsep, except that it skips empty
> tokens.
>
> Signed-off-by: Kent Overstreet <[email protected]>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> ---
> include/linux/string.h | 1 +
> lib/string.c | 19 +++++++++++++++++++
> 2 files changed, 20 insertions(+)
>
> diff --git a/include/linux/string.h b/include/linux/string.h
> index c062c581a98b..6cd5451c262c 100644
> --- a/include/linux/string.h
> +++ b/include/linux/string.h
> @@ -96,6 +96,7 @@ extern char * strpbrk(const char *,const char *);
> #ifndef __HAVE_ARCH_STRSEP
> extern char * strsep(char **,const char *);
> #endif
> +extern char *strsep_no_empty(char **, const char *);
> #ifndef __HAVE_ARCH_STRSPN
> extern __kernel_size_t strspn(const char *,const char *);
> #endif
> diff --git a/lib/string.c b/lib/string.c
> index 3d55ef890106..dd4914baf45a 100644
> --- a/lib/string.c
> +++ b/lib/string.c
> @@ -520,6 +520,25 @@ char *strsep(char **s, const char *ct)
> EXPORT_SYMBOL(strsep);
> #endif
>
> +/**
> + * strsep_no_empt - Split a string into tokens, but don't return empty tokens
^^^^
Typo: strsep_no_empty
Petr T
On Mon, 1 May 2023 09:54:19 -0700
Suren Baghdasaryan <[email protected]> wrote:
> Introduce __GFP_NO_OBJ_EXT flag in order to prevent recursive allocations
> when allocating slabobj_ext on a slab.
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> ---
> include/linux/gfp_types.h | 12 ++++++++++--
> 1 file changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
> index 6583a58670c5..aab1959130f9 100644
> --- a/include/linux/gfp_types.h
> +++ b/include/linux/gfp_types.h
> @@ -53,8 +53,13 @@ typedef unsigned int __bitwise gfp_t;
> #define ___GFP_SKIP_ZERO 0
> #define ___GFP_SKIP_KASAN 0
> #endif
> +#ifdef CONFIG_SLAB_OBJ_EXT
> +#define ___GFP_NO_OBJ_EXT 0x4000000u
> +#else
> +#define ___GFP_NO_OBJ_EXT 0
> +#endif
> #ifdef CONFIG_LOCKDEP
> -#define ___GFP_NOLOCKDEP 0x4000000u
> +#define ___GFP_NOLOCKDEP 0x8000000u
So now we have two flags that depend on config options, but the first
one is always allocated in fact. I wonder if you could use an enum to
let the compiler allocate bits. Something similar to what Muchun Song
did with section flags.
See commit ed7802dd48f7a507213cbb95bb4c6f1fe134eb5d for reference.
> #else
> #define ___GFP_NOLOCKDEP 0
> #endif
> @@ -99,12 +104,15 @@ typedef unsigned int __bitwise gfp_t;
> * node with no fallbacks or placement policy enforcements.
> *
> * %__GFP_ACCOUNT causes the allocation to be accounted to kmemcg.
> + *
> + * %__GFP_NO_OBJ_EXT causes slab allocation to have no object
> extension. */
> #define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE)
> #define __GFP_WRITE ((__force gfp_t)___GFP_WRITE)
> #define __GFP_HARDWALL ((__force gfp_t)___GFP_HARDWALL)
> #define __GFP_THISNODE ((__force gfp_t)___GFP_THISNODE)
> #define __GFP_ACCOUNT ((__force gfp_t)___GFP_ACCOUNT)
> +#define __GFP_NO_OBJ_EXT ((__force gfp_t)___GFP_NO_OBJ_EXT)
>
> /**
> * DOC: Watermark modifiers
> @@ -249,7 +257,7 @@ typedef unsigned int __bitwise gfp_t;
> #define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
>
> /* Room for N __GFP_FOO bits */
> -#define __GFP_BITS_SHIFT (26 + IS_ENABLED(CONFIG_LOCKDEP))
> +#define __GFP_BITS_SHIFT (27 + IS_ENABLED(CONFIG_LOCKDEP))
If the above suggestion is implemented, this could be changed to
something like __GFP_LAST_BIT (the enum's last identifier).
Petr T
On Tue, May 2, 2023 at 9:22 AM Kent Overstreet
<[email protected]> wrote:
> On Tue, May 02, 2023 at 08:33:57AM +0300, Andy Shevchenko wrote:
> > Actually instead of producing zillions of variants, do a %p extension
> > to the printf() and that's it. We have, for example, %pt with T and
> > with space to follow users that want one or the other variant. Same
> > can be done with string_get_size().
>
> God no.
Any elaboration what's wrong with that?
God no for zillion APIs for almost the same. Today you want space,
tomorrow some other (special) delimiter.
--
With Best Regards,
Andy Shevchenko
On Mon, May 01 2023 at 09:54, Suren Baghdasaryan wrote:
> From: Kent Overstreet <[email protected]>
>
> This avoids a circular header dependency in an upcoming patch by only
> making hrtimer.h depend on percpu-defs.h
>
> Signed-off-by: Kent Overstreet <[email protected]>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
On Mon, 1 May 2023 09:54:29 -0700
Suren Baghdasaryan <[email protected]> wrote:
> After redefining alloc_pages, all uses of that name are being replaced.
> Change the conflicting names to prevent preprocessor from replacing them
> when it's not intended.
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> ---
> arch/x86/kernel/amd_gart_64.c | 2 +-
> drivers/iommu/dma-iommu.c | 2 +-
> drivers/xen/grant-dma-ops.c | 2 +-
> drivers/xen/swiotlb-xen.c | 2 +-
> include/linux/dma-map-ops.h | 2 +-
> kernel/dma/mapping.c | 4 ++--
> 6 files changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/kernel/amd_gart_64.c b/arch/x86/kernel/amd_gart_64.c
> index 56a917df410d..842a0ec5eaa9 100644
> --- a/arch/x86/kernel/amd_gart_64.c
> +++ b/arch/x86/kernel/amd_gart_64.c
> @@ -676,7 +676,7 @@ static const struct dma_map_ops gart_dma_ops = {
> .get_sgtable = dma_common_get_sgtable,
> .dma_supported = dma_direct_supported,
> .get_required_mask = dma_direct_get_required_mask,
> - .alloc_pages = dma_direct_alloc_pages,
> + .alloc_pages_op = dma_direct_alloc_pages,
> .free_pages = dma_direct_free_pages,
> };
>
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index 7a9f0b0bddbd..76a9d5ca4eee 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -1556,7 +1556,7 @@ static const struct dma_map_ops iommu_dma_ops = {
> .flags = DMA_F_PCI_P2PDMA_SUPPORTED,
> .alloc = iommu_dma_alloc,
> .free = iommu_dma_free,
> - .alloc_pages = dma_common_alloc_pages,
> + .alloc_pages_op = dma_common_alloc_pages,
> .free_pages = dma_common_free_pages,
> .alloc_noncontiguous = iommu_dma_alloc_noncontiguous,
> .free_noncontiguous = iommu_dma_free_noncontiguous,
> diff --git a/drivers/xen/grant-dma-ops.c b/drivers/xen/grant-dma-ops.c
> index 9784a77fa3c9..6c7d984f164d 100644
> --- a/drivers/xen/grant-dma-ops.c
> +++ b/drivers/xen/grant-dma-ops.c
> @@ -282,7 +282,7 @@ static int xen_grant_dma_supported(struct device *dev, u64 mask)
> static const struct dma_map_ops xen_grant_dma_ops = {
> .alloc = xen_grant_dma_alloc,
> .free = xen_grant_dma_free,
> - .alloc_pages = xen_grant_dma_alloc_pages,
> + .alloc_pages_op = xen_grant_dma_alloc_pages,
> .free_pages = xen_grant_dma_free_pages,
> .mmap = dma_common_mmap,
> .get_sgtable = dma_common_get_sgtable,
> diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
> index 67aa74d20162..5ab2616153f0 100644
> --- a/drivers/xen/swiotlb-xen.c
> +++ b/drivers/xen/swiotlb-xen.c
> @@ -403,6 +403,6 @@ const struct dma_map_ops xen_swiotlb_dma_ops = {
> .dma_supported = xen_swiotlb_dma_supported,
> .mmap = dma_common_mmap,
> .get_sgtable = dma_common_get_sgtable,
> - .alloc_pages = dma_common_alloc_pages,
> + .alloc_pages_op = dma_common_alloc_pages,
> .free_pages = dma_common_free_pages,
> };
> diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
> index 31f114f486c4..d741940dcb3b 100644
> --- a/include/linux/dma-map-ops.h
> +++ b/include/linux/dma-map-ops.h
> @@ -27,7 +27,7 @@ struct dma_map_ops {
> unsigned long attrs);
> void (*free)(struct device *dev, size_t size, void *vaddr,
> dma_addr_t dma_handle, unsigned long attrs);
> - struct page *(*alloc_pages)(struct device *dev, size_t size,
> + struct page *(*alloc_pages_op)(struct device *dev, size_t size,
> dma_addr_t *dma_handle, enum dma_data_direction dir,
> gfp_t gfp);
> void (*free_pages)(struct device *dev, size_t size, struct page *vaddr,
> diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
> index 9a4db5cce600..fc42930af14b 100644
> --- a/kernel/dma/mapping.c
> +++ b/kernel/dma/mapping.c
> @@ -570,9 +570,9 @@ static struct page *__dma_alloc_pages(struct device *dev, size_t size,
> size = PAGE_ALIGN(size);
> if (dma_alloc_direct(dev, ops))
> return dma_direct_alloc_pages(dev, size, dma_handle, dir, gfp);
> - if (!ops->alloc_pages)
> + if (!ops->alloc_pages_op)
> return NULL;
> - return ops->alloc_pages(dev, size, dma_handle, dir, gfp);
> + return ops->alloc_pages_op(dev, size, dma_handle, dir, gfp);
> }
>
> struct page *dma_alloc_pages(struct device *dev, size_t size,
I'm not impressed. This patch increases churn for code which does not
(directly) benefit from the change, and that for limitations in your
tooling?
Why not just rename the conflicting uses in your local tree, but then
remove the rename from the final patch series?
Just my two cents,
Petr T
On Tue, May 2, 2023 at 5:50 AM Petr Tesařík <[email protected]> wrote:
>
> On Mon, 1 May 2023 09:54:19 -0700
> Suren Baghdasaryan <[email protected]> wrote:
>
> > Introduce __GFP_NO_OBJ_EXT flag in order to prevent recursive allocations
> > when allocating slabobj_ext on a slab.
> >
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > ---
> > include/linux/gfp_types.h | 12 ++++++++++--
> > 1 file changed, 10 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
> > index 6583a58670c5..aab1959130f9 100644
> > --- a/include/linux/gfp_types.h
> > +++ b/include/linux/gfp_types.h
> > @@ -53,8 +53,13 @@ typedef unsigned int __bitwise gfp_t;
> > #define ___GFP_SKIP_ZERO 0
> > #define ___GFP_SKIP_KASAN 0
> > #endif
> > +#ifdef CONFIG_SLAB_OBJ_EXT
> > +#define ___GFP_NO_OBJ_EXT 0x4000000u
> > +#else
> > +#define ___GFP_NO_OBJ_EXT 0
> > +#endif
> > #ifdef CONFIG_LOCKDEP
> > -#define ___GFP_NOLOCKDEP 0x4000000u
> > +#define ___GFP_NOLOCKDEP 0x8000000u
>
> So now we have two flags that depend on config options, but the first
> one is always allocated in fact. I wonder if you could use an enum to
> let the compiler allocate bits. Something similar to what Muchun Song
> did with section flags.
>
> See commit ed7802dd48f7a507213cbb95bb4c6f1fe134eb5d for reference.
Thanks for the reference. I'll take a closer look and will try to clean it up.
>
> > #else
> > #define ___GFP_NOLOCKDEP 0
> > #endif
> > @@ -99,12 +104,15 @@ typedef unsigned int __bitwise gfp_t;
> > * node with no fallbacks or placement policy enforcements.
> > *
> > * %__GFP_ACCOUNT causes the allocation to be accounted to kmemcg.
> > + *
> > + * %__GFP_NO_OBJ_EXT causes slab allocation to have no object
> > extension. */
> > #define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE)
> > #define __GFP_WRITE ((__force gfp_t)___GFP_WRITE)
> > #define __GFP_HARDWALL ((__force gfp_t)___GFP_HARDWALL)
> > #define __GFP_THISNODE ((__force gfp_t)___GFP_THISNODE)
> > #define __GFP_ACCOUNT ((__force gfp_t)___GFP_ACCOUNT)
> > +#define __GFP_NO_OBJ_EXT ((__force gfp_t)___GFP_NO_OBJ_EXT)
> >
> > /**
> > * DOC: Watermark modifiers
> > @@ -249,7 +257,7 @@ typedef unsigned int __bitwise gfp_t;
> > #define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
> >
> > /* Room for N __GFP_FOO bits */
> > -#define __GFP_BITS_SHIFT (26 + IS_ENABLED(CONFIG_LOCKDEP))
> > +#define __GFP_BITS_SHIFT (27 + IS_ENABLED(CONFIG_LOCKDEP))
>
> If the above suggestion is implemented, this could be changed to
> something like __GFP_LAST_BIT (the enum's last identifier).
Ack.
Thanks for reviewing!
Suren.
>
> Petr T
On Tue, May 2, 2023 at 8:50 AM Petr Tesařík <[email protected]> wrote:
>
> On Mon, 1 May 2023 09:54:29 -0700
> Suren Baghdasaryan <[email protected]> wrote:
>
> > After redefining alloc_pages, all uses of that name are being replaced.
> > Change the conflicting names to prevent preprocessor from replacing them
> > when it's not intended.
> >
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > ---
> > arch/x86/kernel/amd_gart_64.c | 2 +-
> > drivers/iommu/dma-iommu.c | 2 +-
> > drivers/xen/grant-dma-ops.c | 2 +-
> > drivers/xen/swiotlb-xen.c | 2 +-
> > include/linux/dma-map-ops.h | 2 +-
> > kernel/dma/mapping.c | 4 ++--
> > 6 files changed, 7 insertions(+), 7 deletions(-)
> >
> > diff --git a/arch/x86/kernel/amd_gart_64.c b/arch/x86/kernel/amd_gart_64.c
> > index 56a917df410d..842a0ec5eaa9 100644
> > --- a/arch/x86/kernel/amd_gart_64.c
> > +++ b/arch/x86/kernel/amd_gart_64.c
> > @@ -676,7 +676,7 @@ static const struct dma_map_ops gart_dma_ops = {
> > .get_sgtable = dma_common_get_sgtable,
> > .dma_supported = dma_direct_supported,
> > .get_required_mask = dma_direct_get_required_mask,
> > - .alloc_pages = dma_direct_alloc_pages,
> > + .alloc_pages_op = dma_direct_alloc_pages,
> > .free_pages = dma_direct_free_pages,
> > };
> >
> > diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> > index 7a9f0b0bddbd..76a9d5ca4eee 100644
> > --- a/drivers/iommu/dma-iommu.c
> > +++ b/drivers/iommu/dma-iommu.c
> > @@ -1556,7 +1556,7 @@ static const struct dma_map_ops iommu_dma_ops = {
> > .flags = DMA_F_PCI_P2PDMA_SUPPORTED,
> > .alloc = iommu_dma_alloc,
> > .free = iommu_dma_free,
> > - .alloc_pages = dma_common_alloc_pages,
> > + .alloc_pages_op = dma_common_alloc_pages,
> > .free_pages = dma_common_free_pages,
> > .alloc_noncontiguous = iommu_dma_alloc_noncontiguous,
> > .free_noncontiguous = iommu_dma_free_noncontiguous,
> > diff --git a/drivers/xen/grant-dma-ops.c b/drivers/xen/grant-dma-ops.c
> > index 9784a77fa3c9..6c7d984f164d 100644
> > --- a/drivers/xen/grant-dma-ops.c
> > +++ b/drivers/xen/grant-dma-ops.c
> > @@ -282,7 +282,7 @@ static int xen_grant_dma_supported(struct device *dev, u64 mask)
> > static const struct dma_map_ops xen_grant_dma_ops = {
> > .alloc = xen_grant_dma_alloc,
> > .free = xen_grant_dma_free,
> > - .alloc_pages = xen_grant_dma_alloc_pages,
> > + .alloc_pages_op = xen_grant_dma_alloc_pages,
> > .free_pages = xen_grant_dma_free_pages,
> > .mmap = dma_common_mmap,
> > .get_sgtable = dma_common_get_sgtable,
> > diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
> > index 67aa74d20162..5ab2616153f0 100644
> > --- a/drivers/xen/swiotlb-xen.c
> > +++ b/drivers/xen/swiotlb-xen.c
> > @@ -403,6 +403,6 @@ const struct dma_map_ops xen_swiotlb_dma_ops = {
> > .dma_supported = xen_swiotlb_dma_supported,
> > .mmap = dma_common_mmap,
> > .get_sgtable = dma_common_get_sgtable,
> > - .alloc_pages = dma_common_alloc_pages,
> > + .alloc_pages_op = dma_common_alloc_pages,
> > .free_pages = dma_common_free_pages,
> > };
> > diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
> > index 31f114f486c4..d741940dcb3b 100644
> > --- a/include/linux/dma-map-ops.h
> > +++ b/include/linux/dma-map-ops.h
> > @@ -27,7 +27,7 @@ struct dma_map_ops {
> > unsigned long attrs);
> > void (*free)(struct device *dev, size_t size, void *vaddr,
> > dma_addr_t dma_handle, unsigned long attrs);
> > - struct page *(*alloc_pages)(struct device *dev, size_t size,
> > + struct page *(*alloc_pages_op)(struct device *dev, size_t size,
> > dma_addr_t *dma_handle, enum dma_data_direction dir,
> > gfp_t gfp);
> > void (*free_pages)(struct device *dev, size_t size, struct page *vaddr,
> > diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
> > index 9a4db5cce600..fc42930af14b 100644
> > --- a/kernel/dma/mapping.c
> > +++ b/kernel/dma/mapping.c
> > @@ -570,9 +570,9 @@ static struct page *__dma_alloc_pages(struct device *dev, size_t size,
> > size = PAGE_ALIGN(size);
> > if (dma_alloc_direct(dev, ops))
> > return dma_direct_alloc_pages(dev, size, dma_handle, dir, gfp);
> > - if (!ops->alloc_pages)
> > + if (!ops->alloc_pages_op)
> > return NULL;
> > - return ops->alloc_pages(dev, size, dma_handle, dir, gfp);
> > + return ops->alloc_pages_op(dev, size, dma_handle, dir, gfp);
> > }
> >
> > struct page *dma_alloc_pages(struct device *dev, size_t size,
>
> I'm not impressed. This patch increases churn for code which does not
> (directly) benefit from the change, and that for limitations in your
> tooling?
>
> Why not just rename the conflicting uses in your local tree, but then
> remove the rename from the final patch series?
With alloc_pages function becoming a macro, the preprocessor ends up
replacing all instances of that name, even when it's not used as a
function. That what necessitates this change. If there is a way to
work around this issue without changing all alloc_pages() calls in the
source base I would love to learn it but I'm not quite clear about
your suggestion and if it solves the issue. Could you please provide
more details?
>
> Just my two cents,
> Petr T
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>
On Tue, May 02, 2023 at 02:35:30PM +0200, Petr Tesařík wrote:
> On Mon, 1 May 2023 09:54:13 -0700
> Suren Baghdasaryan <[email protected]> wrote:
>
> > From: Kent Overstreet <[email protected]>
> >
> > We're introducing alloc tagging, which tracks memory allocations by
> > callsite. Converting alloc_inode_sb() to a macro means allocations will
> > be tracked by its caller, which is a bit more useful.
> >
> > Signed-off-by: Kent Overstreet <[email protected]>
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > Cc: Alexander Viro <[email protected]>
> > ---
> > include/linux/fs.h | 6 +-----
> > 1 file changed, 1 insertion(+), 5 deletions(-)
> >
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 21a981680856..4905ce14db0b 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -2699,11 +2699,7 @@ int setattr_should_drop_sgid(struct mnt_idmap *idmap,
> > * This must be used for allocating filesystems specific inodes to set
> > * up the inode reclaim context correctly.
> > */
> > -static inline void *
> > -alloc_inode_sb(struct super_block *sb, struct kmem_cache *cache, gfp_t gfp)
> > -{
> > - return kmem_cache_alloc_lru(cache, &sb->s_inode_lru, gfp);
> > -}
> > +#define alloc_inode_sb(_sb, _cache, _gfp) kmem_cache_alloc_lru(_cache, &_sb->s_inode_lru, _gfp)
>
> Honestly, I don't like this change. In general, pre-processor macros
> are ugly and error-prone.
It's a one line macro, it's fine.
> Besides, it works for you only because __kmem_cache_alloc_lru() is
> declared __always_inline (unless CONFIG_SLUB_TINY is defined, but then
> you probably don't want the tracking either). In any case, it's going
> to be difficult for people to understand why and how this works.
I think you must be confused. kmem_cache_alloc_lru() is a macro, and we
need that macro to be expanded at the alloc_inode_sb() callsite. It's
got nothing to do with whether or not __kmem_cache_alloc_lru() is inline
or not.
> If the actual caller of alloc_inode_sb() is needed, I'd rather add it
> as a parameter and pass down _RET_IP_ explicitly here.
That approach was considered, but adding an ip parameter to every memory
allocation function would've been far more churn.
On Tue, 2 May 2023 11:38:49 -0700
Suren Baghdasaryan <[email protected]> wrote:
> On Tue, May 2, 2023 at 8:50 AM Petr Tesařík <[email protected]> wrote:
> >
> > On Mon, 1 May 2023 09:54:29 -0700
> > Suren Baghdasaryan <[email protected]> wrote:
> >
> > > After redefining alloc_pages, all uses of that name are being replaced.
> > > Change the conflicting names to prevent preprocessor from replacing them
> > > when it's not intended.
> > >
> > > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > > ---
> > > arch/x86/kernel/amd_gart_64.c | 2 +-
> > > drivers/iommu/dma-iommu.c | 2 +-
> > > drivers/xen/grant-dma-ops.c | 2 +-
> > > drivers/xen/swiotlb-xen.c | 2 +-
> > > include/linux/dma-map-ops.h | 2 +-
> > > kernel/dma/mapping.c | 4 ++--
> > > 6 files changed, 7 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/arch/x86/kernel/amd_gart_64.c b/arch/x86/kernel/amd_gart_64.c
> > > index 56a917df410d..842a0ec5eaa9 100644
> > > --- a/arch/x86/kernel/amd_gart_64.c
> > > +++ b/arch/x86/kernel/amd_gart_64.c
> > > @@ -676,7 +676,7 @@ static const struct dma_map_ops gart_dma_ops = {
> > > .get_sgtable = dma_common_get_sgtable,
> > > .dma_supported = dma_direct_supported,
> > > .get_required_mask = dma_direct_get_required_mask,
> > > - .alloc_pages = dma_direct_alloc_pages,
> > > + .alloc_pages_op = dma_direct_alloc_pages,
> > > .free_pages = dma_direct_free_pages,
> > > };
> > >
> > > diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> > > index 7a9f0b0bddbd..76a9d5ca4eee 100644
> > > --- a/drivers/iommu/dma-iommu.c
> > > +++ b/drivers/iommu/dma-iommu.c
> > > @@ -1556,7 +1556,7 @@ static const struct dma_map_ops iommu_dma_ops = {
> > > .flags = DMA_F_PCI_P2PDMA_SUPPORTED,
> > > .alloc = iommu_dma_alloc,
> > > .free = iommu_dma_free,
> > > - .alloc_pages = dma_common_alloc_pages,
> > > + .alloc_pages_op = dma_common_alloc_pages,
> > > .free_pages = dma_common_free_pages,
> > > .alloc_noncontiguous = iommu_dma_alloc_noncontiguous,
> > > .free_noncontiguous = iommu_dma_free_noncontiguous,
> > > diff --git a/drivers/xen/grant-dma-ops.c b/drivers/xen/grant-dma-ops.c
> > > index 9784a77fa3c9..6c7d984f164d 100644
> > > --- a/drivers/xen/grant-dma-ops.c
> > > +++ b/drivers/xen/grant-dma-ops.c
> > > @@ -282,7 +282,7 @@ static int xen_grant_dma_supported(struct device *dev, u64 mask)
> > > static const struct dma_map_ops xen_grant_dma_ops = {
> > > .alloc = xen_grant_dma_alloc,
> > > .free = xen_grant_dma_free,
> > > - .alloc_pages = xen_grant_dma_alloc_pages,
> > > + .alloc_pages_op = xen_grant_dma_alloc_pages,
> > > .free_pages = xen_grant_dma_free_pages,
> > > .mmap = dma_common_mmap,
> > > .get_sgtable = dma_common_get_sgtable,
> > > diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
> > > index 67aa74d20162..5ab2616153f0 100644
> > > --- a/drivers/xen/swiotlb-xen.c
> > > +++ b/drivers/xen/swiotlb-xen.c
> > > @@ -403,6 +403,6 @@ const struct dma_map_ops xen_swiotlb_dma_ops = {
> > > .dma_supported = xen_swiotlb_dma_supported,
> > > .mmap = dma_common_mmap,
> > > .get_sgtable = dma_common_get_sgtable,
> > > - .alloc_pages = dma_common_alloc_pages,
> > > + .alloc_pages_op = dma_common_alloc_pages,
> > > .free_pages = dma_common_free_pages,
> > > };
> > > diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
> > > index 31f114f486c4..d741940dcb3b 100644
> > > --- a/include/linux/dma-map-ops.h
> > > +++ b/include/linux/dma-map-ops.h
> > > @@ -27,7 +27,7 @@ struct dma_map_ops {
> > > unsigned long attrs);
> > > void (*free)(struct device *dev, size_t size, void *vaddr,
> > > dma_addr_t dma_handle, unsigned long attrs);
> > > - struct page *(*alloc_pages)(struct device *dev, size_t size,
> > > + struct page *(*alloc_pages_op)(struct device *dev, size_t size,
> > > dma_addr_t *dma_handle, enum dma_data_direction dir,
> > > gfp_t gfp);
> > > void (*free_pages)(struct device *dev, size_t size, struct page *vaddr,
> > > diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
> > > index 9a4db5cce600..fc42930af14b 100644
> > > --- a/kernel/dma/mapping.c
> > > +++ b/kernel/dma/mapping.c
> > > @@ -570,9 +570,9 @@ static struct page *__dma_alloc_pages(struct device *dev, size_t size,
> > > size = PAGE_ALIGN(size);
> > > if (dma_alloc_direct(dev, ops))
> > > return dma_direct_alloc_pages(dev, size, dma_handle, dir, gfp);
> > > - if (!ops->alloc_pages)
> > > + if (!ops->alloc_pages_op)
> > > return NULL;
> > > - return ops->alloc_pages(dev, size, dma_handle, dir, gfp);
> > > + return ops->alloc_pages_op(dev, size, dma_handle, dir, gfp);
> > > }
> > >
> > > struct page *dma_alloc_pages(struct device *dev, size_t size,
> >
> > I'm not impressed. This patch increases churn for code which does not
> > (directly) benefit from the change, and that for limitations in your
> > tooling?
> >
> > Why not just rename the conflicting uses in your local tree, but then
> > remove the rename from the final patch series?
>
> With alloc_pages function becoming a macro, the preprocessor ends up
> replacing all instances of that name, even when it's not used as a
> function. That what necessitates this change. If there is a way to
> work around this issue without changing all alloc_pages() calls in the
> source base I would love to learn it but I'm not quite clear about
> your suggestion and if it solves the issue. Could you please provide
> more details?
Ah, right, I admit I did not quite understand why this change is
needed. However, this is exactly what I don't like about preprocessor
macros. Each macro effectively adds a new keyword to the language.
I believe everything can be solved with inline functions. What exactly
does not work if you rename alloc_pages() to e.g. alloc_pages_caller()
and then add an alloc_pages() inline function which calls
alloc_pages_caller() with _RET_IP_ as a parameter?
Petr T
On Tue, 2 May 2023 15:57:51 -0400
Kent Overstreet <[email protected]> wrote:
> On Tue, May 02, 2023 at 02:35:30PM +0200, Petr Tesařík wrote:
> > On Mon, 1 May 2023 09:54:13 -0700
> > Suren Baghdasaryan <[email protected]> wrote:
> >
> > > From: Kent Overstreet <[email protected]>
> > >
> > > We're introducing alloc tagging, which tracks memory allocations by
> > > callsite. Converting alloc_inode_sb() to a macro means allocations will
> > > be tracked by its caller, which is a bit more useful.
> > >
> > > Signed-off-by: Kent Overstreet <[email protected]>
> > > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > > Cc: Alexander Viro <[email protected]>
> > > ---
> > > include/linux/fs.h | 6 +-----
> > > 1 file changed, 1 insertion(+), 5 deletions(-)
> > >
> > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > index 21a981680856..4905ce14db0b 100644
> > > --- a/include/linux/fs.h
> > > +++ b/include/linux/fs.h
> > > @@ -2699,11 +2699,7 @@ int setattr_should_drop_sgid(struct mnt_idmap *idmap,
> > > * This must be used for allocating filesystems specific inodes to set
> > > * up the inode reclaim context correctly.
> > > */
> > > -static inline void *
> > > -alloc_inode_sb(struct super_block *sb, struct kmem_cache *cache, gfp_t gfp)
> > > -{
> > > - return kmem_cache_alloc_lru(cache, &sb->s_inode_lru, gfp);
> > > -}
> > > +#define alloc_inode_sb(_sb, _cache, _gfp) kmem_cache_alloc_lru(_cache, &_sb->s_inode_lru, _gfp)
> >
> > Honestly, I don't like this change. In general, pre-processor macros
> > are ugly and error-prone.
>
> It's a one line macro, it's fine.
It's not the same. A macro effectively adds a keyword, because it gets
expanded regardless of context; for example, you can't declare a local
variable called alloc_inode_sb, and the compiler errors may be quite
confusing at first. See also the discussion about patch 19/40 in this
series.
> > Besides, it works for you only because __kmem_cache_alloc_lru() is
> > declared __always_inline (unless CONFIG_SLUB_TINY is defined, but then
> > you probably don't want the tracking either). In any case, it's going
> > to be difficult for people to understand why and how this works.
>
> I think you must be confused. kmem_cache_alloc_lru() is a macro, and we
> need that macro to be expanded at the alloc_inode_sb() callsite. It's
> got nothing to do with whether or not __kmem_cache_alloc_lru() is inline
> or not.
Oh no, I am not confused. Look at the definition of
kmem_cache_alloc_lru():
void *kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
gfp_t gfpflags)
{
return __kmem_cache_alloc_lru(s, lru, gfpflags);
}
See? No _RET_IP_ here. That's because it's here:
static __fastpath_inline
void *__kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
gfp_t gfpflags)
{
void *ret = slab_alloc(s, lru, gfpflags, _RET_IP_, s->object_size);
trace_kmem_cache_alloc(_RET_IP_, ret, s, gfpflags, NUMA_NO_NODE);
return ret;
}
Now, if __kmem_cache_alloc_lru() is not inlined, then this _RET_IP_
will be somewhere inside kmem_cache_alloc_lru(), which is not very
useful.
But what is __fastpath_inline? Well, it depends:
#ifndef CONFIG_SLUB_TINY
#define __fastpath_inline __always_inline
#else
#define __fastpath_inline
#endif
In short, if CONFIG_SLUB_TINY is defined, it's up to the C compiler
whether __kmem_cache_alloc_lru() is inlined or not.
> > If the actual caller of alloc_inode_sb() is needed, I'd rather add it
> > as a parameter and pass down _RET_IP_ explicitly here.
>
> That approach was considered, but adding an ip parameter to every memory
> allocation function would've been far more churn.
See my reply to patch 19/40. Rename the original function, but add an
__always_inline function with the original signature, and let it take
care of _RET_IP_.
Petr T
On Tue, May 02, 2023 at 10:09:09PM +0200, Petr Tesařík wrote:
> Ah, right, I admit I did not quite understand why this change is
> needed. However, this is exactly what I don't like about preprocessor
> macros. Each macro effectively adds a new keyword to the language.
>
> I believe everything can be solved with inline functions. What exactly
> does not work if you rename alloc_pages() to e.g. alloc_pages_caller()
> and then add an alloc_pages() inline function which calls
> alloc_pages_caller() with _RET_IP_ as a parameter?
Perhaps you should spend a little more time reading the patchset and
learning how the code works before commenting.
On Tue, May 2, 2023 at 1:09 PM Petr Tesařík <[email protected]> wrote:
>
> On Tue, 2 May 2023 11:38:49 -0700
> Suren Baghdasaryan <[email protected]> wrote:
>
> > On Tue, May 2, 2023 at 8:50 AM Petr Tesařík <[email protected]> wrote:
> > >
> > > On Mon, 1 May 2023 09:54:29 -0700
> > > Suren Baghdasaryan <[email protected]> wrote:
> > >
> > > > After redefining alloc_pages, all uses of that name are being replaced.
> > > > Change the conflicting names to prevent preprocessor from replacing them
> > > > when it's not intended.
> > > >
> > > > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > > > ---
> > > > arch/x86/kernel/amd_gart_64.c | 2 +-
> > > > drivers/iommu/dma-iommu.c | 2 +-
> > > > drivers/xen/grant-dma-ops.c | 2 +-
> > > > drivers/xen/swiotlb-xen.c | 2 +-
> > > > include/linux/dma-map-ops.h | 2 +-
> > > > kernel/dma/mapping.c | 4 ++--
> > > > 6 files changed, 7 insertions(+), 7 deletions(-)
> > > >
> > > > diff --git a/arch/x86/kernel/amd_gart_64.c b/arch/x86/kernel/amd_gart_64.c
> > > > index 56a917df410d..842a0ec5eaa9 100644
> > > > --- a/arch/x86/kernel/amd_gart_64.c
> > > > +++ b/arch/x86/kernel/amd_gart_64.c
> > > > @@ -676,7 +676,7 @@ static const struct dma_map_ops gart_dma_ops = {
> > > > .get_sgtable = dma_common_get_sgtable,
> > > > .dma_supported = dma_direct_supported,
> > > > .get_required_mask = dma_direct_get_required_mask,
> > > > - .alloc_pages = dma_direct_alloc_pages,
> > > > + .alloc_pages_op = dma_direct_alloc_pages,
> > > > .free_pages = dma_direct_free_pages,
> > > > };
> > > >
> > > > diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> > > > index 7a9f0b0bddbd..76a9d5ca4eee 100644
> > > > --- a/drivers/iommu/dma-iommu.c
> > > > +++ b/drivers/iommu/dma-iommu.c
> > > > @@ -1556,7 +1556,7 @@ static const struct dma_map_ops iommu_dma_ops = {
> > > > .flags = DMA_F_PCI_P2PDMA_SUPPORTED,
> > > > .alloc = iommu_dma_alloc,
> > > > .free = iommu_dma_free,
> > > > - .alloc_pages = dma_common_alloc_pages,
> > > > + .alloc_pages_op = dma_common_alloc_pages,
> > > > .free_pages = dma_common_free_pages,
> > > > .alloc_noncontiguous = iommu_dma_alloc_noncontiguous,
> > > > .free_noncontiguous = iommu_dma_free_noncontiguous,
> > > > diff --git a/drivers/xen/grant-dma-ops.c b/drivers/xen/grant-dma-ops.c
> > > > index 9784a77fa3c9..6c7d984f164d 100644
> > > > --- a/drivers/xen/grant-dma-ops.c
> > > > +++ b/drivers/xen/grant-dma-ops.c
> > > > @@ -282,7 +282,7 @@ static int xen_grant_dma_supported(struct device *dev, u64 mask)
> > > > static const struct dma_map_ops xen_grant_dma_ops = {
> > > > .alloc = xen_grant_dma_alloc,
> > > > .free = xen_grant_dma_free,
> > > > - .alloc_pages = xen_grant_dma_alloc_pages,
> > > > + .alloc_pages_op = xen_grant_dma_alloc_pages,
> > > > .free_pages = xen_grant_dma_free_pages,
> > > > .mmap = dma_common_mmap,
> > > > .get_sgtable = dma_common_get_sgtable,
> > > > diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
> > > > index 67aa74d20162..5ab2616153f0 100644
> > > > --- a/drivers/xen/swiotlb-xen.c
> > > > +++ b/drivers/xen/swiotlb-xen.c
> > > > @@ -403,6 +403,6 @@ const struct dma_map_ops xen_swiotlb_dma_ops = {
> > > > .dma_supported = xen_swiotlb_dma_supported,
> > > > .mmap = dma_common_mmap,
> > > > .get_sgtable = dma_common_get_sgtable,
> > > > - .alloc_pages = dma_common_alloc_pages,
> > > > + .alloc_pages_op = dma_common_alloc_pages,
> > > > .free_pages = dma_common_free_pages,
> > > > };
> > > > diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
> > > > index 31f114f486c4..d741940dcb3b 100644
> > > > --- a/include/linux/dma-map-ops.h
> > > > +++ b/include/linux/dma-map-ops.h
> > > > @@ -27,7 +27,7 @@ struct dma_map_ops {
> > > > unsigned long attrs);
> > > > void (*free)(struct device *dev, size_t size, void *vaddr,
> > > > dma_addr_t dma_handle, unsigned long attrs);
> > > > - struct page *(*alloc_pages)(struct device *dev, size_t size,
> > > > + struct page *(*alloc_pages_op)(struct device *dev, size_t size,
> > > > dma_addr_t *dma_handle, enum dma_data_direction dir,
> > > > gfp_t gfp);
> > > > void (*free_pages)(struct device *dev, size_t size, struct page *vaddr,
> > > > diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
> > > > index 9a4db5cce600..fc42930af14b 100644
> > > > --- a/kernel/dma/mapping.c
> > > > +++ b/kernel/dma/mapping.c
> > > > @@ -570,9 +570,9 @@ static struct page *__dma_alloc_pages(struct device *dev, size_t size,
> > > > size = PAGE_ALIGN(size);
> > > > if (dma_alloc_direct(dev, ops))
> > > > return dma_direct_alloc_pages(dev, size, dma_handle, dir, gfp);
> > > > - if (!ops->alloc_pages)
> > > > + if (!ops->alloc_pages_op)
> > > > return NULL;
> > > > - return ops->alloc_pages(dev, size, dma_handle, dir, gfp);
> > > > + return ops->alloc_pages_op(dev, size, dma_handle, dir, gfp);
> > > > }
> > > >
> > > > struct page *dma_alloc_pages(struct device *dev, size_t size,
> > >
> > > I'm not impressed. This patch increases churn for code which does not
> > > (directly) benefit from the change, and that for limitations in your
> > > tooling?
> > >
> > > Why not just rename the conflicting uses in your local tree, but then
> > > remove the rename from the final patch series?
> >
> > With alloc_pages function becoming a macro, the preprocessor ends up
> > replacing all instances of that name, even when it's not used as a
> > function. That what necessitates this change. If there is a way to
> > work around this issue without changing all alloc_pages() calls in the
> > source base I would love to learn it but I'm not quite clear about
> > your suggestion and if it solves the issue. Could you please provide
> > more details?
>
> Ah, right, I admit I did not quite understand why this change is
> needed. However, this is exactly what I don't like about preprocessor
> macros. Each macro effectively adds a new keyword to the language.
>
> I believe everything can be solved with inline functions. What exactly
> does not work if you rename alloc_pages() to e.g. alloc_pages_caller()
> and then add an alloc_pages() inline function which calls
> alloc_pages_caller() with _RET_IP_ as a parameter?
I don't think that would work because we need to inject the codetag at
the file/line of the actual allocation call. If we pass _REP_IT_ then
we would have to lookup the codetag associated with that _RET_IP_
which results in additional runtime overhead.
>
> Petr T
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>
On Tue, 2 May 2023 13:24:37 -0700
Suren Baghdasaryan <[email protected]> wrote:
> On Tue, May 2, 2023 at 1:09 PM Petr Tesařík <[email protected]> wrote:
> >
> > On Tue, 2 May 2023 11:38:49 -0700
> > Suren Baghdasaryan <[email protected]> wrote:
> >
> > > On Tue, May 2, 2023 at 8:50 AM Petr Tesařík <[email protected]> wrote:
> > > >
> > > > On Mon, 1 May 2023 09:54:29 -0700
> > > > Suren Baghdasaryan <[email protected]> wrote:
> > > >
> > > > > After redefining alloc_pages, all uses of that name are being replaced.
> > > > > Change the conflicting names to prevent preprocessor from replacing them
> > > > > when it's not intended.
> > > > >
> > > > > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > > > > ---
> > > > > arch/x86/kernel/amd_gart_64.c | 2 +-
> > > > > drivers/iommu/dma-iommu.c | 2 +-
> > > > > drivers/xen/grant-dma-ops.c | 2 +-
> > > > > drivers/xen/swiotlb-xen.c | 2 +-
> > > > > include/linux/dma-map-ops.h | 2 +-
> > > > > kernel/dma/mapping.c | 4 ++--
> > > > > 6 files changed, 7 insertions(+), 7 deletions(-)
> > > > >
> > > > > diff --git a/arch/x86/kernel/amd_gart_64.c b/arch/x86/kernel/amd_gart_64.c
> > > > > index 56a917df410d..842a0ec5eaa9 100644
> > > > > --- a/arch/x86/kernel/amd_gart_64.c
> > > > > +++ b/arch/x86/kernel/amd_gart_64.c
> > > > > @@ -676,7 +676,7 @@ static const struct dma_map_ops gart_dma_ops = {
> > > > > .get_sgtable = dma_common_get_sgtable,
> > > > > .dma_supported = dma_direct_supported,
> > > > > .get_required_mask = dma_direct_get_required_mask,
> > > > > - .alloc_pages = dma_direct_alloc_pages,
> > > > > + .alloc_pages_op = dma_direct_alloc_pages,
> > > > > .free_pages = dma_direct_free_pages,
> > > > > };
> > > > >
> > > > > diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> > > > > index 7a9f0b0bddbd..76a9d5ca4eee 100644
> > > > > --- a/drivers/iommu/dma-iommu.c
> > > > > +++ b/drivers/iommu/dma-iommu.c
> > > > > @@ -1556,7 +1556,7 @@ static const struct dma_map_ops iommu_dma_ops = {
> > > > > .flags = DMA_F_PCI_P2PDMA_SUPPORTED,
> > > > > .alloc = iommu_dma_alloc,
> > > > > .free = iommu_dma_free,
> > > > > - .alloc_pages = dma_common_alloc_pages,
> > > > > + .alloc_pages_op = dma_common_alloc_pages,
> > > > > .free_pages = dma_common_free_pages,
> > > > > .alloc_noncontiguous = iommu_dma_alloc_noncontiguous,
> > > > > .free_noncontiguous = iommu_dma_free_noncontiguous,
> > > > > diff --git a/drivers/xen/grant-dma-ops.c b/drivers/xen/grant-dma-ops.c
> > > > > index 9784a77fa3c9..6c7d984f164d 100644
> > > > > --- a/drivers/xen/grant-dma-ops.c
> > > > > +++ b/drivers/xen/grant-dma-ops.c
> > > > > @@ -282,7 +282,7 @@ static int xen_grant_dma_supported(struct device *dev, u64 mask)
> > > > > static const struct dma_map_ops xen_grant_dma_ops = {
> > > > > .alloc = xen_grant_dma_alloc,
> > > > > .free = xen_grant_dma_free,
> > > > > - .alloc_pages = xen_grant_dma_alloc_pages,
> > > > > + .alloc_pages_op = xen_grant_dma_alloc_pages,
> > > > > .free_pages = xen_grant_dma_free_pages,
> > > > > .mmap = dma_common_mmap,
> > > > > .get_sgtable = dma_common_get_sgtable,
> > > > > diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
> > > > > index 67aa74d20162..5ab2616153f0 100644
> > > > > --- a/drivers/xen/swiotlb-xen.c
> > > > > +++ b/drivers/xen/swiotlb-xen.c
> > > > > @@ -403,6 +403,6 @@ const struct dma_map_ops xen_swiotlb_dma_ops = {
> > > > > .dma_supported = xen_swiotlb_dma_supported,
> > > > > .mmap = dma_common_mmap,
> > > > > .get_sgtable = dma_common_get_sgtable,
> > > > > - .alloc_pages = dma_common_alloc_pages,
> > > > > + .alloc_pages_op = dma_common_alloc_pages,
> > > > > .free_pages = dma_common_free_pages,
> > > > > };
> > > > > diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
> > > > > index 31f114f486c4..d741940dcb3b 100644
> > > > > --- a/include/linux/dma-map-ops.h
> > > > > +++ b/include/linux/dma-map-ops.h
> > > > > @@ -27,7 +27,7 @@ struct dma_map_ops {
> > > > > unsigned long attrs);
> > > > > void (*free)(struct device *dev, size_t size, void *vaddr,
> > > > > dma_addr_t dma_handle, unsigned long attrs);
> > > > > - struct page *(*alloc_pages)(struct device *dev, size_t size,
> > > > > + struct page *(*alloc_pages_op)(struct device *dev, size_t size,
> > > > > dma_addr_t *dma_handle, enum dma_data_direction dir,
> > > > > gfp_t gfp);
> > > > > void (*free_pages)(struct device *dev, size_t size, struct page *vaddr,
> > > > > diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
> > > > > index 9a4db5cce600..fc42930af14b 100644
> > > > > --- a/kernel/dma/mapping.c
> > > > > +++ b/kernel/dma/mapping.c
> > > > > @@ -570,9 +570,9 @@ static struct page *__dma_alloc_pages(struct device *dev, size_t size,
> > > > > size = PAGE_ALIGN(size);
> > > > > if (dma_alloc_direct(dev, ops))
> > > > > return dma_direct_alloc_pages(dev, size, dma_handle, dir, gfp);
> > > > > - if (!ops->alloc_pages)
> > > > > + if (!ops->alloc_pages_op)
> > > > > return NULL;
> > > > > - return ops->alloc_pages(dev, size, dma_handle, dir, gfp);
> > > > > + return ops->alloc_pages_op(dev, size, dma_handle, dir, gfp);
> > > > > }
> > > > >
> > > > > struct page *dma_alloc_pages(struct device *dev, size_t size,
> > > >
> > > > I'm not impressed. This patch increases churn for code which does not
> > > > (directly) benefit from the change, and that for limitations in your
> > > > tooling?
> > > >
> > > > Why not just rename the conflicting uses in your local tree, but then
> > > > remove the rename from the final patch series?
> > >
> > > With alloc_pages function becoming a macro, the preprocessor ends up
> > > replacing all instances of that name, even when it's not used as a
> > > function. That what necessitates this change. If there is a way to
> > > work around this issue without changing all alloc_pages() calls in the
> > > source base I would love to learn it but I'm not quite clear about
> > > your suggestion and if it solves the issue. Could you please provide
> > > more details?
> >
> > Ah, right, I admit I did not quite understand why this change is
> > needed. However, this is exactly what I don't like about preprocessor
> > macros. Each macro effectively adds a new keyword to the language.
> >
> > I believe everything can be solved with inline functions. What exactly
> > does not work if you rename alloc_pages() to e.g. alloc_pages_caller()
> > and then add an alloc_pages() inline function which calls
> > alloc_pages_caller() with _RET_IP_ as a parameter?
>
> I don't think that would work because we need to inject the codetag at
> the file/line of the actual allocation call. If we pass _REP_IT_ then
> we would have to lookup the codetag associated with that _RET_IP_
> which results in additional runtime overhead.
OK. If the reference to source code itself must be recorded in the
kernel, and not resolved later (either by the debugfs read fops, or by
a tool which reads the file), then this information can only be
obtained with a preprocessor macro.
I was hoping that a debugging feature could be less intrusive. OTOH
it's not my call to balance the tradeoffs.
Thank you for your patient explanations.
Petr T
On Tue, May 2, 2023 at 1:39 PM Petr Tesařík <[email protected]> wrote:
>
> On Tue, 2 May 2023 13:24:37 -0700
> Suren Baghdasaryan <[email protected]> wrote:
>
> > On Tue, May 2, 2023 at 1:09 PM Petr Tesařík <[email protected]> wrote:
> > >
> > > On Tue, 2 May 2023 11:38:49 -0700
> > > Suren Baghdasaryan <[email protected]> wrote:
> > >
> > > > On Tue, May 2, 2023 at 8:50 AM Petr Tesařík <[email protected]> wrote:
> > > > >
> > > > > On Mon, 1 May 2023 09:54:29 -0700
> > > > > Suren Baghdasaryan <[email protected]> wrote:
> > > > >
> > > > > > After redefining alloc_pages, all uses of that name are being replaced.
> > > > > > Change the conflicting names to prevent preprocessor from replacing them
> > > > > > when it's not intended.
> > > > > >
> > > > > > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > > > > > ---
> > > > > > arch/x86/kernel/amd_gart_64.c | 2 +-
> > > > > > drivers/iommu/dma-iommu.c | 2 +-
> > > > > > drivers/xen/grant-dma-ops.c | 2 +-
> > > > > > drivers/xen/swiotlb-xen.c | 2 +-
> > > > > > include/linux/dma-map-ops.h | 2 +-
> > > > > > kernel/dma/mapping.c | 4 ++--
> > > > > > 6 files changed, 7 insertions(+), 7 deletions(-)
> > > > > >
> > > > > > diff --git a/arch/x86/kernel/amd_gart_64.c b/arch/x86/kernel/amd_gart_64.c
> > > > > > index 56a917df410d..842a0ec5eaa9 100644
> > > > > > --- a/arch/x86/kernel/amd_gart_64.c
> > > > > > +++ b/arch/x86/kernel/amd_gart_64.c
> > > > > > @@ -676,7 +676,7 @@ static const struct dma_map_ops gart_dma_ops = {
> > > > > > .get_sgtable = dma_common_get_sgtable,
> > > > > > .dma_supported = dma_direct_supported,
> > > > > > .get_required_mask = dma_direct_get_required_mask,
> > > > > > - .alloc_pages = dma_direct_alloc_pages,
> > > > > > + .alloc_pages_op = dma_direct_alloc_pages,
> > > > > > .free_pages = dma_direct_free_pages,
> > > > > > };
> > > > > >
> > > > > > diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> > > > > > index 7a9f0b0bddbd..76a9d5ca4eee 100644
> > > > > > --- a/drivers/iommu/dma-iommu.c
> > > > > > +++ b/drivers/iommu/dma-iommu.c
> > > > > > @@ -1556,7 +1556,7 @@ static const struct dma_map_ops iommu_dma_ops = {
> > > > > > .flags = DMA_F_PCI_P2PDMA_SUPPORTED,
> > > > > > .alloc = iommu_dma_alloc,
> > > > > > .free = iommu_dma_free,
> > > > > > - .alloc_pages = dma_common_alloc_pages,
> > > > > > + .alloc_pages_op = dma_common_alloc_pages,
> > > > > > .free_pages = dma_common_free_pages,
> > > > > > .alloc_noncontiguous = iommu_dma_alloc_noncontiguous,
> > > > > > .free_noncontiguous = iommu_dma_free_noncontiguous,
> > > > > > diff --git a/drivers/xen/grant-dma-ops.c b/drivers/xen/grant-dma-ops.c
> > > > > > index 9784a77fa3c9..6c7d984f164d 100644
> > > > > > --- a/drivers/xen/grant-dma-ops.c
> > > > > > +++ b/drivers/xen/grant-dma-ops.c
> > > > > > @@ -282,7 +282,7 @@ static int xen_grant_dma_supported(struct device *dev, u64 mask)
> > > > > > static const struct dma_map_ops xen_grant_dma_ops = {
> > > > > > .alloc = xen_grant_dma_alloc,
> > > > > > .free = xen_grant_dma_free,
> > > > > > - .alloc_pages = xen_grant_dma_alloc_pages,
> > > > > > + .alloc_pages_op = xen_grant_dma_alloc_pages,
> > > > > > .free_pages = xen_grant_dma_free_pages,
> > > > > > .mmap = dma_common_mmap,
> > > > > > .get_sgtable = dma_common_get_sgtable,
> > > > > > diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
> > > > > > index 67aa74d20162..5ab2616153f0 100644
> > > > > > --- a/drivers/xen/swiotlb-xen.c
> > > > > > +++ b/drivers/xen/swiotlb-xen.c
> > > > > > @@ -403,6 +403,6 @@ const struct dma_map_ops xen_swiotlb_dma_ops = {
> > > > > > .dma_supported = xen_swiotlb_dma_supported,
> > > > > > .mmap = dma_common_mmap,
> > > > > > .get_sgtable = dma_common_get_sgtable,
> > > > > > - .alloc_pages = dma_common_alloc_pages,
> > > > > > + .alloc_pages_op = dma_common_alloc_pages,
> > > > > > .free_pages = dma_common_free_pages,
> > > > > > };
> > > > > > diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
> > > > > > index 31f114f486c4..d741940dcb3b 100644
> > > > > > --- a/include/linux/dma-map-ops.h
> > > > > > +++ b/include/linux/dma-map-ops.h
> > > > > > @@ -27,7 +27,7 @@ struct dma_map_ops {
> > > > > > unsigned long attrs);
> > > > > > void (*free)(struct device *dev, size_t size, void *vaddr,
> > > > > > dma_addr_t dma_handle, unsigned long attrs);
> > > > > > - struct page *(*alloc_pages)(struct device *dev, size_t size,
> > > > > > + struct page *(*alloc_pages_op)(struct device *dev, size_t size,
> > > > > > dma_addr_t *dma_handle, enum dma_data_direction dir,
> > > > > > gfp_t gfp);
> > > > > > void (*free_pages)(struct device *dev, size_t size, struct page *vaddr,
> > > > > > diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
> > > > > > index 9a4db5cce600..fc42930af14b 100644
> > > > > > --- a/kernel/dma/mapping.c
> > > > > > +++ b/kernel/dma/mapping.c
> > > > > > @@ -570,9 +570,9 @@ static struct page *__dma_alloc_pages(struct device *dev, size_t size,
> > > > > > size = PAGE_ALIGN(size);
> > > > > > if (dma_alloc_direct(dev, ops))
> > > > > > return dma_direct_alloc_pages(dev, size, dma_handle, dir, gfp);
> > > > > > - if (!ops->alloc_pages)
> > > > > > + if (!ops->alloc_pages_op)
> > > > > > return NULL;
> > > > > > - return ops->alloc_pages(dev, size, dma_handle, dir, gfp);
> > > > > > + return ops->alloc_pages_op(dev, size, dma_handle, dir, gfp);
> > > > > > }
> > > > > >
> > > > > > struct page *dma_alloc_pages(struct device *dev, size_t size,
> > > > >
> > > > > I'm not impressed. This patch increases churn for code which does not
> > > > > (directly) benefit from the change, and that for limitations in your
> > > > > tooling?
> > > > >
> > > > > Why not just rename the conflicting uses in your local tree, but then
> > > > > remove the rename from the final patch series?
> > > >
> > > > With alloc_pages function becoming a macro, the preprocessor ends up
> > > > replacing all instances of that name, even when it's not used as a
> > > > function. That what necessitates this change. If there is a way to
> > > > work around this issue without changing all alloc_pages() calls in the
> > > > source base I would love to learn it but I'm not quite clear about
> > > > your suggestion and if it solves the issue. Could you please provide
> > > > more details?
> > >
> > > Ah, right, I admit I did not quite understand why this change is
> > > needed. However, this is exactly what I don't like about preprocessor
> > > macros. Each macro effectively adds a new keyword to the language.
> > >
> > > I believe everything can be solved with inline functions. What exactly
> > > does not work if you rename alloc_pages() to e.g. alloc_pages_caller()
> > > and then add an alloc_pages() inline function which calls
> > > alloc_pages_caller() with _RET_IP_ as a parameter?
> >
> > I don't think that would work because we need to inject the codetag at
> > the file/line of the actual allocation call. If we pass _REP_IT_ then
> > we would have to lookup the codetag associated with that _RET_IP_
> > which results in additional runtime overhead.
>
> OK. If the reference to source code itself must be recorded in the
> kernel, and not resolved later (either by the debugfs read fops, or by
> a tool which reads the file), then this information can only be
> obtained with a preprocessor macro.
>
> I was hoping that a debugging feature could be less intrusive. OTOH
> it's not my call to balance the tradeoffs.
>
> Thank you for your patient explanations.
Thanks for reviewing and the suggestions! I'll address the actionable
ones in the next version.
Suren.
>
> Petr T
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>
On Tue, May 02, 2023 at 06:19:27PM +0300, Andy Shevchenko wrote:
> On Tue, May 2, 2023 at 9:22 AM Kent Overstreet
> <[email protected]> wrote:
> > On Tue, May 02, 2023 at 08:33:57AM +0300, Andy Shevchenko wrote:
> > > Actually instead of producing zillions of variants, do a %p extension
> > > to the printf() and that's it. We have, for example, %pt with T and
> > > with space to follow users that want one or the other variant. Same
> > > can be done with string_get_size().
> >
> > God no.
>
> Any elaboration what's wrong with that?
I'm really not a fan of %p extensions in general (they are what people
reach for because we can't standardize on a common string output API),
but when we'd be passing it bare integers the lack of type safety would
be a particularly big footgun.
> God no for zillion APIs for almost the same. Today you want space,
> tomorrow some other (special) delimiter.
No, I just want to delete the space and output numbers the same way
everyone else does. And if we are stuck with two string_get_size()
functions, %p extensions in no way improve the situation.
On Wed, May 3, 2023 at 5:07 AM Kent Overstreet
<[email protected]> wrote:
> On Tue, May 02, 2023 at 06:19:27PM +0300, Andy Shevchenko wrote:
> > On Tue, May 2, 2023 at 9:22 AM Kent Overstreet
> > <[email protected]> wrote:
> > > On Tue, May 02, 2023 at 08:33:57AM +0300, Andy Shevchenko wrote:
> > > > Actually instead of producing zillions of variants, do a %p extension
> > > > to the printf() and that's it. We have, for example, %pt with T and
> > > > with space to follow users that want one or the other variant. Same
> > > > can be done with string_get_size().
> > >
> > > God no.
> >
> > Any elaboration what's wrong with that?
>
> I'm really not a fan of %p extensions in general (they are what people
> reach for because we can't standardize on a common string output API),
The whole story behind, for example, %pt is to _standardize_ the
output of the same stanza in the kernel.
> but when we'd be passing it bare integers the lack of type safety would
> be a particularly big footgun.
There is no difference to any other place in the kernel where we can
shoot into our foot.
> > God no for zillion APIs for almost the same. Today you want space,
> > tomorrow some other (special) delimiter.
>
> No, I just want to delete the space and output numbers the same way
> everyone else does. And if we are stuck with two string_get_size()
> functions, %p extensions in no way improve the situation.
I think it's exactly for the opposite, i.e. standardize that output
once and for all.
--
With Best Regards,
Andy Shevchenko
On Wed, May 03, 2023 at 09:30:11AM +0300, Andy Shevchenko wrote:
> On Wed, May 3, 2023 at 5:07 AM Kent Overstreet
> <[email protected]> wrote:
> > On Tue, May 02, 2023 at 06:19:27PM +0300, Andy Shevchenko wrote:
> > > On Tue, May 2, 2023 at 9:22 AM Kent Overstreet
> > > <[email protected]> wrote:
> > > > On Tue, May 02, 2023 at 08:33:57AM +0300, Andy Shevchenko wrote:
> > > > > Actually instead of producing zillions of variants, do a %p extension
> > > > > to the printf() and that's it. We have, for example, %pt with T and
> > > > > with space to follow users that want one or the other variant. Same
> > > > > can be done with string_get_size().
> > > >
> > > > God no.
> > >
> > > Any elaboration what's wrong with that?
> >
> > I'm really not a fan of %p extensions in general (they are what people
> > reach for because we can't standardize on a common string output API),
>
> The whole story behind, for example, %pt is to _standardize_ the
> output of the same stanza in the kernel.
Wtf does this have to do with the rest of the discussion? The %p thing
seems like a total non sequitar and a distraction.
I'm not getting involved with that. All I'm interested in is fixing the
memory allocation profiling output to make it more usable.
> > but when we'd be passing it bare integers the lack of type safety would
> > be a particularly big footgun.
>
> There is no difference to any other place in the kernel where we can
> shoot into our foot.
Yeah, no, absolutely not. Passing different size integers to
string_get_size() is fine; passing pointers to different size integers
to a %p extension will explode and the compiler won't be able to warn.
>
> > > God no for zillion APIs for almost the same. Today you want space,
> > > tomorrow some other (special) delimiter.
> >
> > No, I just want to delete the space and output numbers the same way
> > everyone else does. And if we are stuck with two string_get_size()
> > functions, %p extensions in no way improve the situation.
>
> I think it's exactly for the opposite, i.e. standardize that output
> once and for all.
So, are you dropping your NACK then, so we can standardize the kernel on
the way everything else does it?
On Mon 01-05-23 09:54:10, Suren Baghdasaryan wrote:
> Memory allocation profiling infrastructure provides a low overhead
> mechanism to make all kernel allocations in the system visible. It can be
> used to monitor memory usage, track memory hotspots, detect memory leaks,
> identify memory regressions.
>
> To keep the overhead to the minimum, we record only allocation sizes for
> every allocation in the codebase. With that information, if users are
> interested in more detailed context for a specific allocation, they can
> enable in-depth context tracking, which includes capturing the pid, tgid,
> task name, allocation size, timestamp and call stack for every allocation
> at the specified code location.
[...]
> Implementation utilizes a more generic concept of code tagging, introduced
> as part of this patchset. Code tag is a structure identifying a specific
> location in the source code which is generated at compile time and can be
> embedded in an application-specific structure. A number of applications
> for code tagging have been presented in the original RFC [1].
> Code tagging uses the old trick of "define a special elf section for
> objects of a given type so that we can iterate over them at runtime" and
> creates a proper library for it.
>
> To profile memory allocations, we instrument page, slab and percpu
> allocators to record total memory allocated in the associated code tag at
> every allocation in the codebase. Every time an allocation is performed by
> an instrumented allocator, the code tag at that location increments its
> counter by allocation size. Every time the memory is freed the counter is
> decremented. To decrement the counter upon freeing, allocated object needs
> a reference to its code tag. Page allocators use page_ext to record this
> reference while slab allocators use memcg_data (renamed into more generic
> slabobj_ext) of the slab page.
[...]
> [1] https://lore.kernel.org/all/[email protected]/
[...]
> 70 files changed, 2765 insertions(+), 554 deletions(-)
Sorry for cutting the cover considerably but I believe I have quoted the
most important/interesting parts here. The approach is not fundamentally
different from the previous version [1] and there was a significant
discussion around this approach. The cover letter doesn't summarize nor
deal with concerns expressed previous AFAICS. So let me bring those up
back. At least those I find the most important:
- This is a big change and it adds a significant maintenance burden
because each allocation entry point needs to be handled specifically.
The cost will grow with the intended coverage especially there when
allocation is hidden in a library code.
- It has been brought up that this is duplicating functionality already
available via existing tracing infrastructure. You should make it very
clear why that is not suitable for the job
- We already have page_owner infrastructure that provides allocation
tracking data. Why it cannot be used/extended?
Thanks!
--
Michal Hocko
SUSE Labs
On Mon 01-05-23 09:54:44, Suren Baghdasaryan wrote:
[...]
> +static inline void add_ctx(struct codetag_ctx *ctx,
> + struct codetag_with_ctx *ctc)
> +{
> + kref_init(&ctx->refcount);
> + spin_lock(&ctc->ctx_lock);
> + ctx->flags = CTC_FLAG_CTX_PTR;
> + ctx->ctc = ctc;
> + list_add_tail(&ctx->node, &ctc->ctx_head);
> + spin_unlock(&ctc->ctx_lock);
AFAIU every single tracked allocation will get its own codetag_ctx.
There is no aggregation per allocation site or anything else. This looks
like a scalability and a memory overhead red flag to me.
> +}
> +
> +static inline void rem_ctx(struct codetag_ctx *ctx,
> + void (*free_ctx)(struct kref *refcount))
> +{
> + struct codetag_with_ctx *ctc = ctx->ctc;
> +
> + spin_lock(&ctc->ctx_lock);
This could deadlock when allocator is called from the IRQ context.
> + /* ctx might have been removed while we were using it */
> + if (!list_empty(&ctx->node))
> + list_del_init(&ctx->node);
> + spin_unlock(&ctc->ctx_lock);
> + kref_put(&ctx->refcount, free_ctx);
--
Michal Hocko
SUSE Labs
On Wed, May 03, 2023 at 09:25:29AM +0200, Michal Hocko wrote:
> On Mon 01-05-23 09:54:10, Suren Baghdasaryan wrote:
> > Memory allocation profiling infrastructure provides a low overhead
> > mechanism to make all kernel allocations in the system visible. It can be
> > used to monitor memory usage, track memory hotspots, detect memory leaks,
> > identify memory regressions.
> >
> > To keep the overhead to the minimum, we record only allocation sizes for
> > every allocation in the codebase. With that information, if users are
> > interested in more detailed context for a specific allocation, they can
> > enable in-depth context tracking, which includes capturing the pid, tgid,
> > task name, allocation size, timestamp and call stack for every allocation
> > at the specified code location.
> [...]
> > Implementation utilizes a more generic concept of code tagging, introduced
> > as part of this patchset. Code tag is a structure identifying a specific
> > location in the source code which is generated at compile time and can be
> > embedded in an application-specific structure. A number of applications
> > for code tagging have been presented in the original RFC [1].
> > Code tagging uses the old trick of "define a special elf section for
> > objects of a given type so that we can iterate over them at runtime" and
> > creates a proper library for it.
> >
> > To profile memory allocations, we instrument page, slab and percpu
> > allocators to record total memory allocated in the associated code tag at
> > every allocation in the codebase. Every time an allocation is performed by
> > an instrumented allocator, the code tag at that location increments its
> > counter by allocation size. Every time the memory is freed the counter is
> > decremented. To decrement the counter upon freeing, allocated object needs
> > a reference to its code tag. Page allocators use page_ext to record this
> > reference while slab allocators use memcg_data (renamed into more generic
> > slabobj_ext) of the slab page.
> [...]
> > [1] https://lore.kernel.org/all/[email protected]/
> [...]
> > 70 files changed, 2765 insertions(+), 554 deletions(-)
>
> Sorry for cutting the cover considerably but I believe I have quoted the
> most important/interesting parts here. The approach is not fundamentally
> different from the previous version [1] and there was a significant
> discussion around this approach. The cover letter doesn't summarize nor
> deal with concerns expressed previous AFAICS. So let me bring those up
> back. At least those I find the most important:
We covered this previously, I'll just be giving the same answers I did
before:
> - This is a big change and it adds a significant maintenance burden
> because each allocation entry point needs to be handled specifically.
> The cost will grow with the intended coverage especially there when
> allocation is hidden in a library code.
We've made this as clean and simple as posssible: a single new macro
invocation per allocation function, no calling convention changes (that
would indeed have been a lot of churn!)
> - It has been brought up that this is duplicating functionality already
> available via existing tracing infrastructure. You should make it very
> clear why that is not suitable for the job
Tracing people _claimed_ this, but never demonstrated it. Tracepoints
exist but the tooling that would consume them to provide this kind of
information does not exist; it would require maintaining an index of
_every outstanding allocation_ so that frees could be accounted
correctly - IOW, it would be _drastically_ higher overhead, so not at
all comparable.
> - We already have page_owner infrastructure that provides allocation
> tracking data. Why it cannot be used/extended?
Page owner is also very high overhead, and the output is not very user
friendly (tracking full call stack means many related overhead gets
split, not generally what you want), and it doesn't cover slab.
This tracks _all_ memory allocations - slab, page, vmalloc, percpu.
On Mon 01-05-23 09:54:45, Suren Baghdasaryan wrote:
[...]
> +struct codetag_ctx *alloc_tag_create_ctx(struct alloc_tag *tag, size_t size)
> +{
> + struct alloc_call_ctx *ac_ctx;
> +
> + /* TODO: use a dedicated kmem_cache */
> + ac_ctx = kmalloc(sizeof(struct alloc_call_ctx), GFP_KERNEL);
You cannot really use GFP_KERNEL here. This is post_alloc_hook path and
that has its own gfp context.
--
Michal Hocko
SUSE Labs
On Wed 03-05-23 03:34:21, Kent Overstreet wrote:
> On Wed, May 03, 2023 at 09:25:29AM +0200, Michal Hocko wrote:
> > On Mon 01-05-23 09:54:10, Suren Baghdasaryan wrote:
> > > Memory allocation profiling infrastructure provides a low overhead
> > > mechanism to make all kernel allocations in the system visible. It can be
> > > used to monitor memory usage, track memory hotspots, detect memory leaks,
> > > identify memory regressions.
> > >
> > > To keep the overhead to the minimum, we record only allocation sizes for
> > > every allocation in the codebase. With that information, if users are
> > > interested in more detailed context for a specific allocation, they can
> > > enable in-depth context tracking, which includes capturing the pid, tgid,
> > > task name, allocation size, timestamp and call stack for every allocation
> > > at the specified code location.
> > [...]
> > > Implementation utilizes a more generic concept of code tagging, introduced
> > > as part of this patchset. Code tag is a structure identifying a specific
> > > location in the source code which is generated at compile time and can be
> > > embedded in an application-specific structure. A number of applications
> > > for code tagging have been presented in the original RFC [1].
> > > Code tagging uses the old trick of "define a special elf section for
> > > objects of a given type so that we can iterate over them at runtime" and
> > > creates a proper library for it.
> > >
> > > To profile memory allocations, we instrument page, slab and percpu
> > > allocators to record total memory allocated in the associated code tag at
> > > every allocation in the codebase. Every time an allocation is performed by
> > > an instrumented allocator, the code tag at that location increments its
> > > counter by allocation size. Every time the memory is freed the counter is
> > > decremented. To decrement the counter upon freeing, allocated object needs
> > > a reference to its code tag. Page allocators use page_ext to record this
> > > reference while slab allocators use memcg_data (renamed into more generic
> > > slabobj_ext) of the slab page.
> > [...]
> > > [1] https://lore.kernel.org/all/[email protected]/
> > [...]
> > > 70 files changed, 2765 insertions(+), 554 deletions(-)
> >
> > Sorry for cutting the cover considerably but I believe I have quoted the
> > most important/interesting parts here. The approach is not fundamentally
> > different from the previous version [1] and there was a significant
> > discussion around this approach. The cover letter doesn't summarize nor
> > deal with concerns expressed previous AFAICS. So let me bring those up
> > back. At least those I find the most important:
>
> We covered this previously, I'll just be giving the same answers I did
> before:
Your answers have shown your insight into tracing is very limited. I
have a clear recollection there were many suggestions on how to get what
you need and willingness to help out. Repeating your previous position
will not help much to be honest with you.
> > - This is a big change and it adds a significant maintenance burden
> > because each allocation entry point needs to be handled specifically.
> > The cost will grow with the intended coverage especially there when
> > allocation is hidden in a library code.
>
> We've made this as clean and simple as posssible: a single new macro
> invocation per allocation function, no calling convention changes (that
> would indeed have been a lot of churn!)
That doesn't really make the concern any less relevant. I believe you
and Suren have made a great effort to reduce the churn as much as
possible but looking at the diffstat the code changes are clearly there
and you have to convince the rest of the community that this maintenance
overhead is really worth it. The above statement hasn't helped to
convinced me to be honest.
> > - It has been brought up that this is duplicating functionality already
> > available via existing tracing infrastructure. You should make it very
> > clear why that is not suitable for the job
>
> Tracing people _claimed_ this, but never demonstrated it.
The burden is on you and Suren. You are proposing the implement an
alternative tracing infrastructure.
> Tracepoints
> exist but the tooling that would consume them to provide this kind of
> information does not exist;
Any reasons why an appropriate tooling cannot be developed?
> it would require maintaining an index of
> _every outstanding allocation_ so that frees could be accounted
> correctly - IOW, it would be _drastically_ higher overhead, so not at
> all comparable.
Do you have any actual data points to prove your claim?
> > - We already have page_owner infrastructure that provides allocation
> > tracking data. Why it cannot be used/extended?
>
> Page owner is also very high overhead,
Is there any data to prove that claim? I would be really surprised that
page_owner would give higher overhead than page tagging with profiling
enabled (there is an allocation for each allocation request!!!). We can
discuss the bare bone page tagging comparision to page_owner because of
the full stack unwinding but is that overhead really prohibitively costly?
Can we reduce that by trimming the unwinder information?
> and the output is not very user
> friendly (tracking full call stack means many related overhead gets
> split, not generally what you want), and it doesn't cover slab.
Is this something we cannot do anything about? Have you explored any
potential ways?
> This tracks _all_ memory allocations - slab, page, vmalloc, percpu.
--
Michal Hocko
SUSE Labs
On Wed, May 03, 2023 at 09:51:49AM +0200, Michal Hocko wrote:
> Your answers have shown your insight into tracing is very limited. I
> have a clear recollection there were many suggestions on how to get what
> you need and willingness to help out. Repeating your previous position
> will not help much to be honest with you.
Please enlighten us, oh wise one.
> > > - It has been brought up that this is duplicating functionality already
> > > available via existing tracing infrastructure. You should make it very
> > > clear why that is not suitable for the job
> >
> > Tracing people _claimed_ this, but never demonstrated it.
>
> The burden is on you and Suren. You are proposing the implement an
> alternative tracing infrastructure.
No, we're still waiting on the tracing people to _demonstrate_, not
claim, that this is at all possible in a comparable way with tracing.
It's not on us to make your argument for you, and before making
accusations about honesty you should try to be more honest yourself.
The expectations you're trying to level have never been the norm in the
kernel community, sorry. When there's a technical argument about the
best way to do something, _code wins_ and we've got working code to do
something that hasn't been possible previously.
There's absolutely no rule that "tracing has to be the one and only tool
for kernel visibility".
I'm considering the tracing discussion closed until someone in the
pro-tracing camp shows something new.
> > > - We already have page_owner infrastructure that provides allocation
> > > tracking data. Why it cannot be used/extended?
> >
> > Page owner is also very high overhead,
>
> Is there any data to prove that claim? I would be really surprised that
> page_owner would give higher overhead than page tagging with profiling
> enabled (there is an allocation for each allocation request!!!). We can
> discuss the bare bone page tagging comparision to page_owner because of
> the full stack unwinding but is that overhead really prohibitively costly?
> Can we reduce that by trimming the unwinder information?
Honestly, this isn't terribly relevant, because as noted before page
owner is limited to just page allocations.
>
> > and the output is not very user
> > friendly (tracking full call stack means many related overhead gets
> > split, not generally what you want), and it doesn't cover slab.
>
> Is this something we cannot do anything about? Have you explored any
> potential ways?
>
> > This tracks _all_ memory allocations - slab, page, vmalloc, percpu.
Michel, the discussions with you seem to perpetually go in circles; it's
clear you're negative on the patchset, you keep raising the same
objections while refusing to concede a single point.
I believe I've answered enough, so I'll leave off further discussions
with you.
On Wed, May 3, 2023 at 10:13 AM Kent Overstreet
<[email protected]> wrote:
> On Wed, May 03, 2023 at 09:30:11AM +0300, Andy Shevchenko wrote:
> > On Wed, May 3, 2023 at 5:07 AM Kent Overstreet
> > <[email protected]> wrote:
> > > On Tue, May 02, 2023 at 06:19:27PM +0300, Andy Shevchenko wrote:
> > > > On Tue, May 2, 2023 at 9:22 AM Kent Overstreet
> > > > <[email protected]> wrote:
> > > > > On Tue, May 02, 2023 at 08:33:57AM +0300, Andy Shevchenko wrote:
> > > > > > Actually instead of producing zillions of variants, do a %p extension
> > > > > > to the printf() and that's it. We have, for example, %pt with T and
> > > > > > with space to follow users that want one or the other variant. Same
> > > > > > can be done with string_get_size().
> > > > >
> > > > > God no.
> > > >
> > > > Any elaboration what's wrong with that?
> > >
> > > I'm really not a fan of %p extensions in general (they are what people
> > > reach for because we can't standardize on a common string output API),
> >
> > The whole story behind, for example, %pt is to _standardize_ the
> > output of the same stanza in the kernel.
>
> Wtf does this have to do with the rest of the discussion? The %p thing
> seems like a total non sequitar and a distraction.
>
> I'm not getting involved with that. All I'm interested in is fixing the
> memory allocation profiling output to make it more usable.
>
> > > but when we'd be passing it bare integers the lack of type safety would
> > > be a particularly big footgun.
> >
> > There is no difference to any other place in the kernel where we can
> > shoot into our foot.
>
> Yeah, no, absolutely not. Passing different size integers to
> string_get_size() is fine; passing pointers to different size integers
> to a %p extension will explode and the compiler won't be able to warn.
This is another topic. Yes, there is a discussion to have a compiler
plugin to check this.
> > > > God no for zillion APIs for almost the same. Today you want space,
> > > > tomorrow some other (special) delimiter.
> > >
> > > No, I just want to delete the space and output numbers the same way
> > > everyone else does. And if we are stuck with two string_get_size()
> > > functions, %p extensions in no way improve the situation.
> >
> > I think it's exactly for the opposite, i.e. standardize that output
> > once and for all.
>
> So, are you dropping your NACK then, so we can standardize the kernel on
> the way everything else does it?
No, you are breaking existing users. The NAK stays.
The whole discussion after that is to make the way on how users can
utilize your format and existing format without multiplying APIs.
--
With Best Regards,
Andy Shevchenko
On Wed, May 03, 2023 at 12:12:12PM +0300, Andy Shevchenko wrote:
> > So, are you dropping your NACK then, so we can standardize the kernel on
> > the way everything else does it?
>
> No, you are breaking existing users. The NAK stays.
> The whole discussion after that is to make the way on how users can
> utilize your format and existing format without multiplying APIs.
Dave seems to think we shouldn't be, and I'm in agreement.
On 5/3/23 00:50, Dave Chinner wrote:
> On Tue, May 02, 2023 at 07:42:59AM -0400, James Bottomley wrote:
>> On Mon, 2023-05-01 at 23:17 -0400, Kent Overstreet wrote:
>> > On Mon, May 01, 2023 at 10:22:18PM -0400, James Bottomley wrote:
>> > > It is not used just for debug. It's used all over the kernel for
>> > > printing out device sizes. The output mostly goes to the kernel
>> > > print buffer, so it's anyone's guess as to what, if any, tools are
>> > > parsing it, but the concern about breaking log parsers seems to be
>> > > a valid one.
>> >
>> > Ok, there is sd_print_capacity() - but who in their right mind would
>> > be trying to scrape device sizes, in human readable units,
>>
>> If you bother to google "kernel log parser", you'll discover it's quite
>> an active area which supports a load of company business models.
>
> That doesn't mean log messages are unchangable ABI. Indeed, we had
> the whole "printk_index_emit()" addition recently to create
> an external index of printk message formats for such applications to
> use. [*]
>
>> > from log messages when it's available in sysfs/procfs (actually, is
>> > it in sysfs? if not, that's an oversight) in more reasonable units?
>>
>> It's not in sysfs, no. As aren't a lot of things, which is why log
>> parsing for system monitoring is big business.
>
> And that big business is why printk_index_emit() exists to allow
> them to easily determine how log messages change format and come and
> go across different kernel versions.
>
>> > Correct me if I'm wrong, but I've yet to hear about kernel log
>> > messages being consider a stable interface, and this seems a bit out
>> > there.
>>
>> It might not be listed as stable, but when it's known there's a large
>> ecosystem out there consuming it we shouldn't break it just because you
>> feel like it.
>
> But we've solved this problem already, yes?
>
> If the userspace applications are not using the kernel printk format
> index to detect such changes between kernel version, then they
> should be. This makes trivial issues like whether we have a space or
> not between units is completely irrelevant because the entry in the
> printk format index for the log output we emit will match whatever
> is output by the kernel....
If I understand that correctly from the commit changelog, this would have
indeed helped, but if the change was reflected in format string. But with
string_get_size() it's always an %s and the change of the helper's or a
switch to another variant of the helper that would omit the space, wouldn't
be reflected in the format string at all? I guess that would be an argument
for Andy's suggestion for adding a new %pt / %pT which would then be
reflected in the format string. And also more concise to use than using the
helper, fwiw.
> Cheers,
>
> Dave.
>
> [*]
> commit 337015573718b161891a3473d25f59273f2e626b
> Author: Chris Down <[email protected]>
> Date: Tue Jun 15 17:52:53 2021 +0100
>
> printk: Userspace format indexing support
>
> We have a number of systems industry-wide that have a subset of their
> functionality that works as follows:
>
> 1. Receive a message from local kmsg, serial console, or netconsole;
> 2. Apply a set of rules to classify the message;
> 3. Do something based on this classification (like scheduling a
> remediation for the machine), rinse, and repeat.
>
> As a couple of examples of places we have this implemented just inside
> Facebook, although this isn't a Facebook-specific problem, we have this
> inside our netconsole processing (for alarm classification), and as part
> of our machine health checking. We use these messages to determine
> fairly important metrics around production health, and it's important
> that we get them right.
>
> While for some kinds of issues we have counters, tracepoints, or metrics
> with a stable interface which can reliably indicate the issue, in order
> to react to production issues quickly we need to work with the interface
> which most kernel developers naturally use when developing: printk.
>
> Most production issues come from unexpected phenomena, and as such
> usually the code in question doesn't have easily usable tracepoints or
> other counters available for the specific problem being mitigated. We
> have a number of lines of monitoring defence against problems in
> production (host metrics, process metrics, service metrics, etc), and
> where it's not feasible to reliably monitor at another level, this kind
> of pragmatic netconsole monitoring is essential.
>
> As one would expect, monitoring using printk is rather brittle for a
> number of reasons -- most notably that the message might disappear
> entirely in a new version of the kernel, or that the message may change
> in some way that the regex or other classification methods start to
> silently fail.
>
> One factor that makes this even harder is that, under normal operation,
> many of these messages are never expected to be hit. For example, there
> may be a rare hardware bug which one wants to detect if it was to ever
> happen again, but its recurrence is not likely or anticipated. This
> precludes using something like checking whether the printk in question
> was printed somewhere fleetwide recently to determine whether the
> message in question is still present or not, since we don't anticipate
> that it should be printed anywhere, but still need to monitor for its
> future presence in the long-term.
>
> This class of issue has happened on a number of occasions, causing
> unhealthy machines with hardware issues to remain in production for
> longer than ideal. As a recent example, some monitoring around
> blk_update_request fell out of date and caused semi-broken machines to
> remain in production for longer than would be desirable.
>
> Searching through the codebase to find the message is also extremely
> fragile, because many of the messages are further constructed beyond
> their callsite (eg. btrfs_printk and other module-specific wrappers,
> each with their own functionality). Even if they aren't, guessing the
> format and formulation of the underlying message based on the aesthetics
> of the message emitted is not a recipe for success at scale, and our
> previous issues with fleetwide machine health checking demonstrate as
> much.
>
> This provides a solution to the issue of silently changed or deleted
> printks: we record pointers to all printk format strings known at
> compile time into a new .printk_index section, both in vmlinux and
> modules. At runtime, this can then be iterated by looking at
> <debugfs>/printk/index/<module>, which emits the following format, both
> readable by humans and able to be parsed by machines:
>
> $ head -1 vmlinux; shuf -n 5 vmlinux
> # <level[,flags]> filename:line function "format"
> <5> block/blk-settings.c:661 disk_stack_limits "%s: Warning: Device %s is misaligned\n"
> <4> kernel/trace/trace.c:8296 trace_create_file "Could not create tracefs '%s' entry\n"
> <6> arch/x86/kernel/hpet.c:144 _hpet_print_config "hpet: %s(%d):\n"
> <6> init/do_mounts.c:605 prepare_namespace "Waiting for root device %s...\n"
> <6> drivers/acpi/osl.c:1410 acpi_no_auto_serialize_setup "ACPI: auto-serialization disabled\n"
>
> This mitigates the majority of cases where we have a highly-specific
> printk which we want to match on, as we can now enumerate and check
> whether the format changed or the printk callsite disappeared entirely
> in userspace. This allows us to catch changes to printks we monitor
> earlier and decide what to do about it before it becomes problematic.
>
> There is no additional runtime cost for printk callers or printk itself,
> and the assembly generated is exactly the same.
>
> Signed-off-by: Chris Down <[email protected]>
> Cc: Petr Mladek <[email protected]>
> Cc: Jessica Yu <[email protected]>
> Cc: Sergey Senozhatsky <[email protected]>
> Cc: John Ogness <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Greg Kroah-Hartman <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Kees Cook <[email protected]>
> Reviewed-by: Petr Mladek <[email protected]>
> Tested-by: Petr Mladek <[email protected]>
> Reported-by: kernel test robot <[email protected]>
> Acked-by: Andy Shevchenko <[email protected]>
> Acked-by: Jessica Yu <[email protected]> # for module.{c,h}
> Signed-off-by: Petr Mladek <[email protected]>
> Link: https://lore.kernel.org/r/e42070983637ac5e384f17fbdbe86d19c7b212a5.1623775748.git.chris@chrisdown.name
>
On Wed, May 3, 2023 at 12:28 PM Vlastimil Babka <[email protected]> wrote:
>
> On 5/3/23 00:50, Dave Chinner wrote:
> > On Tue, May 02, 2023 at 07:42:59AM -0400, James Bottomley wrote:
> >> On Mon, 2023-05-01 at 23:17 -0400, Kent Overstreet wrote:
> >> > On Mon, May 01, 2023 at 10:22:18PM -0400, James Bottomley wrote:
> >> > > It is not used just for debug. It's used all over the kernel for
> >> > > printing out device sizes. The output mostly goes to the kernel
> >> > > print buffer, so it's anyone's guess as to what, if any, tools are
> >> > > parsing it, but the concern about breaking log parsers seems to be
> >> > > a valid one.
> >> >
> >> > Ok, there is sd_print_capacity() - but who in their right mind would
> >> > be trying to scrape device sizes, in human readable units,
> >>
> >> If you bother to google "kernel log parser", you'll discover it's quite
> >> an active area which supports a load of company business models.
> >
> > That doesn't mean log messages are unchangable ABI. Indeed, we had
> > the whole "printk_index_emit()" addition recently to create
> > an external index of printk message formats for such applications to
> > use. [*]
> >
> >> > from log messages when it's available in sysfs/procfs (actually, is
> >> > it in sysfs? if not, that's an oversight) in more reasonable units?
> >>
> >> It's not in sysfs, no. As aren't a lot of things, which is why log
> >> parsing for system monitoring is big business.
> >
> > And that big business is why printk_index_emit() exists to allow
> > them to easily determine how log messages change format and come and
> > go across different kernel versions.
> >
> >> > Correct me if I'm wrong, but I've yet to hear about kernel log
> >> > messages being consider a stable interface, and this seems a bit out
> >> > there.
> >>
> >> It might not be listed as stable, but when it's known there's a large
> >> ecosystem out there consuming it we shouldn't break it just because you
> >> feel like it.
> >
> > But we've solved this problem already, yes?
> >
> > If the userspace applications are not using the kernel printk format
> > index to detect such changes between kernel version, then they
> > should be. This makes trivial issues like whether we have a space or
> > not between units is completely irrelevant because the entry in the
> > printk format index for the log output we emit will match whatever
> > is output by the kernel....
>
> If I understand that correctly from the commit changelog, this would have
> indeed helped, but if the change was reflected in format string. But with
> string_get_size() it's always an %s and the change of the helper's or a
> switch to another variant of the helper that would omit the space, wouldn't
> be reflected in the format string at all? I guess that would be an argument
> for Andy's suggestion for adding a new %pt / %pT which would then be
(Note, there is no respective %p extension for string_get_size() yet.
%pt is for time and was used as an example when its evolution included
a change like this)
> reflected in the format string. And also more concise to use than using the
> helper, fwiw.
--
With Best Regards,
Andy Shevchenko
On Wed, 3 May 2023 09:51:49 +0200
Michal Hocko <[email protected]> wrote:
> On Wed 03-05-23 03:34:21, Kent Overstreet wrote:
>[...]
> > We've made this as clean and simple as posssible: a single new macro
> > invocation per allocation function, no calling convention changes (that
> > would indeed have been a lot of churn!)
>
> That doesn't really make the concern any less relevant. I believe you
> and Suren have made a great effort to reduce the churn as much as
> possible but looking at the diffstat the code changes are clearly there
> and you have to convince the rest of the community that this maintenance
> overhead is really worth it.
I believe this is the crucial point.
I have my own concerns about the use of preprocessor macros, which goes
against the basic idea of a code tagging framework (patch 13/40).
AFAICS the CODE_TAG_INIT macro must be expanded on the same source code
line as the tagged code, which makes it hard to use without further
macros (unless you want to make the source code unreadable beyond
imagination). That's why all allocation functions must be converted to
macros.
If anyone ever wants to use this code tagging framework for something
else, they will also have to convert relevant functions to macros,
slowly changing the kernel to a minefield where local identifiers,
struct, union and enum tags, field names and labels must avoid name
conflict with a tagged function. For now, I have to remember that
alloc_pages is forbidden, but the list may grow.
FWIW I can see some occurences of "alloc_pages" under arch/ which are
not renamed by patch 19/40 of this series. For instance, does the
kernel build for s390x after applying the patch series?
New code may also work initially, but explode after adding an #include
later...
HOWEVER, if the rest of the community agrees that the added value of
code tagging is worth all these potential risks, I can live with it.
Petr T
On Wed, May 03, 2023 at 11:50:51AM +0200, Petr Tesařík wrote:
> On Wed, 3 May 2023 09:51:49 +0200
> Michal Hocko <[email protected]> wrote:
>
> > On Wed 03-05-23 03:34:21, Kent Overstreet wrote:
> >[...]
> > > We've made this as clean and simple as posssible: a single new macro
> > > invocation per allocation function, no calling convention changes (that
> > > would indeed have been a lot of churn!)
> >
> > That doesn't really make the concern any less relevant. I believe you
> > and Suren have made a great effort to reduce the churn as much as
> > possible but looking at the diffstat the code changes are clearly there
> > and you have to convince the rest of the community that this maintenance
> > overhead is really worth it.
>
> I believe this is the crucial point.
>
> I have my own concerns about the use of preprocessor macros, which goes
> against the basic idea of a code tagging framework (patch 13/40).
> AFAICS the CODE_TAG_INIT macro must be expanded on the same source code
> line as the tagged code, which makes it hard to use without further
> macros (unless you want to make the source code unreadable beyond
> imagination). That's why all allocation functions must be converted to
> macros.
>
> If anyone ever wants to use this code tagging framework for something
> else, they will also have to convert relevant functions to macros,
> slowly changing the kernel to a minefield where local identifiers,
> struct, union and enum tags, field names and labels must avoid name
> conflict with a tagged function. For now, I have to remember that
> alloc_pages is forbidden, but the list may grow.
No, we've got other code tagging applications (that have already been
posted!) and they don't "convert functions to macros" in the way this
patchset does - they do introduce new macros, but as new identifiers,
which we do all the time.
This was simply the least churny way to hook memory allocations.
On Wed, May 03, 2023 at 11:50:51AM +0200, Petr Tesařík wrote:
> If anyone ever wants to use this code tagging framework for something
> else, they will also have to convert relevant functions to macros,
> slowly changing the kernel to a minefield where local identifiers,
> struct, union and enum tags, field names and labels must avoid name
> conflict with a tagged function. For now, I have to remember that
> alloc_pages is forbidden, but the list may grow.
Also, since you're not actually a kernel contributor yet...
It's not really good decorum to speculate in code review about things
that can be answered by just reading the code. If you're going to
comment, please do the necessary work to make sure you're saying
something that makes sense.
On Wed, 3 May 2023 05:54:43 -0400
Kent Overstreet <[email protected]> wrote:
> On Wed, May 03, 2023 at 11:50:51AM +0200, Petr Tesařík wrote:
> > On Wed, 3 May 2023 09:51:49 +0200
> > Michal Hocko <[email protected]> wrote:
> >
> > > On Wed 03-05-23 03:34:21, Kent Overstreet wrote:
> > >[...]
> > > > We've made this as clean and simple as posssible: a single new macro
> > > > invocation per allocation function, no calling convention changes (that
> > > > would indeed have been a lot of churn!)
> > >
> > > That doesn't really make the concern any less relevant. I believe you
> > > and Suren have made a great effort to reduce the churn as much as
> > > possible but looking at the diffstat the code changes are clearly there
> > > and you have to convince the rest of the community that this maintenance
> > > overhead is really worth it.
> >
> > I believe this is the crucial point.
> >
> > I have my own concerns about the use of preprocessor macros, which goes
> > against the basic idea of a code tagging framework (patch 13/40).
> > AFAICS the CODE_TAG_INIT macro must be expanded on the same source code
> > line as the tagged code, which makes it hard to use without further
> > macros (unless you want to make the source code unreadable beyond
> > imagination). That's why all allocation functions must be converted to
> > macros.
> >
> > If anyone ever wants to use this code tagging framework for something
> > else, they will also have to convert relevant functions to macros,
> > slowly changing the kernel to a minefield where local identifiers,
> > struct, union and enum tags, field names and labels must avoid name
> > conflict with a tagged function. For now, I have to remember that
> > alloc_pages is forbidden, but the list may grow.
>
> No, we've got other code tagging applications (that have already been
> posted!) and they don't "convert functions to macros" in the way this
> patchset does - they do introduce new macros, but as new identifiers,
> which we do all the time.
Yes, new all-lowercase macros which do not expand to a single
identifier are still added under include/linux. It's unfortunate IMO,
but it's a fact of life. You have a point here.
> This was simply the least churny way to hook memory allocations.
This is a bold statement. You certainly know what you plan to do, but
other people keep coming up with ideas... Like, anyone would like to
tag semaphore use: up() and down()?
Don't get me wrong. I can see how the benefits of code tagging, and I
agree that my concerns are not very strong. I just want that the
consequences are understood and accepted, and they don't take us by
surprise.
Petr T
On Wed, 3 May 2023 05:57:15 -0400
Kent Overstreet <[email protected]> wrote:
> On Wed, May 03, 2023 at 11:50:51AM +0200, Petr Tesařík wrote:
> > If anyone ever wants to use this code tagging framework for something
> > else, they will also have to convert relevant functions to macros,
> > slowly changing the kernel to a minefield where local identifiers,
> > struct, union and enum tags, field names and labels must avoid name
> > conflict with a tagged function. For now, I have to remember that
> > alloc_pages is forbidden, but the list may grow.
>
> Also, since you're not actually a kernel contributor yet...
I see, I've been around only since 2007...
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2a97468024fb5b6eccee2a67a7796485c829343a
Petr T
On Wed, 2023-05-03 at 08:50 +1000, Dave Chinner wrote:
> On Tue, May 02, 2023 at 07:42:59AM -0400, James Bottomley wrote:
> > On Mon, 2023-05-01 at 23:17 -0400, Kent Overstreet wrote:
> > > On Mon, May 01, 2023 at 10:22:18PM -0400, James Bottomley wrote:
> > > > It is not used just for debug. It's used all over the kernel
> > > > for printing out device sizes. The output mostly goes to the
> > > > kernel print buffer, so it's anyone's guess as to what, if any,
> > > > tools are parsing it, but the concern about breaking log
> > > > parsers seems to be a valid one.
> > >
> > > Ok, there is sd_print_capacity() - but who in their right mind
> > > would be trying to scrape device sizes, in human readable units,
> >
> > If you bother to google "kernel log parser", you'll discover it's
> > quite an active area which supports a load of company business
> > models.
>
> That doesn't mean log messages are unchangable ABI. Indeed, we had
> the whole "printk_index_emit()" addition recently to create
> an external index of printk message formats for such applications to
> use. [*]
I didn't say they were.
>
> > > from log messages when it's available in sysfs/procfs (actually,
> > > is it in sysfs? if not, that's an oversight) in more reasonable
> > > units?
> >
> > It's not in sysfs, no. As aren't a lot of things, which is why log
> > parsing for system monitoring is big business.
>
> And that big business is why printk_index_emit() exists to allow
> them to easily determine how log messages change format and come and
> go across different kernel versions.
>
> > > Correct me if I'm wrong, but I've yet to hear about kernel log
> > > messages being consider a stable interface, and this seems a bit
> > > out there.
> >
> > It might not be listed as stable, but when it's known there's a
> > large ecosystem out there consuming it we shouldn't break it just
> > because you feel like it.
>
> But we've solved this problem already, yes?
Well, yes; since it's a simple bit of extra thought and a couple of
lines addition not to afflict everyone with the change, that's the
simplest course. It also gets us out of arguing about whether the
space reads better and is SI consistent.
> If the userspace applications are not using the kernel printk format
> index to detect such changes between kernel version, then they
> should be. This makes trivial issues like whether we have a space or
> not between units is completely irrelevant because the entry in the
> printk format index for the log output we emit will match whatever
> is output by the kernel....
Just because we have better tools to fix a problem when it happens
doesn't mean we should actively cause the problem when its easily
avoidable. In the same way we shouldn't drive less carefully just
because cars are built safer today.
James
On Wed, 2023-05-03 at 05:57 -0400, Kent Overstreet wrote:
> On Wed, May 03, 2023 at 11:50:51AM +0200, Petr Tesařík wrote:
> > If anyone ever wants to use this code tagging framework for
> > something
> > else, they will also have to convert relevant functions to macros,
> > slowly changing the kernel to a minefield where local identifiers,
> > struct, union and enum tags, field names and labels must avoid name
> > conflict with a tagged function. For now, I have to remember that
> > alloc_pages is forbidden, but the list may grow.
>
> Also, since you're not actually a kernel contributor yet...
You have an amazing talent for being wrong. But even if you were
actually right about this, it would be an ad hominem personal attack on
a new contributor which crosses the line into unacceptable behaviour on
the list and runs counter to our code of conduct.
James
On Wed, 3 May 2023 04:05:08 -0400
Kent Overstreet <[email protected]> wrote:
> > The burden is on you and Suren. You are proposing the implement an
> > alternative tracing infrastructure.
>
> No, we're still waiting on the tracing people to _demonstrate_, not
> claim, that this is at all possible in a comparable way with tracing.
It's not my job to do your work for you!
I gave you hints on how you can do this with attaching to existing trace
events and your response was "If you don't think it's hard, go ahead and
show us." No! I'm too busy with my own work to do free work for you!
https://lore.kernel.org/all/20220905235007.sc4uk6illlog62fl@kmo-framework/
I know it's easier to create something from scratch that you fully know,
than to work with an existing infrastructure that you need to spend effort
and learn to make it do what you want. But by recreating the work, you now
pass the burden onto everyone else that needs to learn what you did. Not to
mention, we would likely have multiple ways to do the same thing.
Sorry, but that's not how an open source community is suppose to work.
-- Steve
On Wed, May 3, 2023 at 5:34 AM James Bottomley
<[email protected]> wrote:
>
> On Wed, 2023-05-03 at 05:57 -0400, Kent Overstreet wrote:
> > On Wed, May 03, 2023 at 11:50:51AM +0200, Petr Tesařík wrote:
> > > If anyone ever wants to use this code tagging framework for
> > > something
> > > else, they will also have to convert relevant functions to macros,
> > > slowly changing the kernel to a minefield where local identifiers,
> > > struct, union and enum tags, field names and labels must avoid name
> > > conflict with a tagged function. For now, I have to remember that
> > > alloc_pages is forbidden, but the list may grow.
> >
> > Also, since you're not actually a kernel contributor yet...
>
> You have an amazing talent for being wrong. But even if you were
> actually right about this, it would be an ad hominem personal attack on
> a new contributor which crosses the line into unacceptable behaviour on
> the list and runs counter to our code of conduct.
Kent, I asked you before and I'm asking you again. Please focus on the
technical discussion and stop personal attacks. That is extremely
counter-productive.
>
> James
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>
On Wed, May 3, 2023 at 12:25 AM Michal Hocko <[email protected]> wrote:
>
> On Mon 01-05-23 09:54:10, Suren Baghdasaryan wrote:
> > Memory allocation profiling infrastructure provides a low overhead
> > mechanism to make all kernel allocations in the system visible. It can be
> > used to monitor memory usage, track memory hotspots, detect memory leaks,
> > identify memory regressions.
> >
> > To keep the overhead to the minimum, we record only allocation sizes for
> > every allocation in the codebase. With that information, if users are
> > interested in more detailed context for a specific allocation, they can
> > enable in-depth context tracking, which includes capturing the pid, tgid,
> > task name, allocation size, timestamp and call stack for every allocation
> > at the specified code location.
> [...]
> > Implementation utilizes a more generic concept of code tagging, introduced
> > as part of this patchset. Code tag is a structure identifying a specific
> > location in the source code which is generated at compile time and can be
> > embedded in an application-specific structure. A number of applications
> > for code tagging have been presented in the original RFC [1].
> > Code tagging uses the old trick of "define a special elf section for
> > objects of a given type so that we can iterate over them at runtime" and
> > creates a proper library for it.
> >
> > To profile memory allocations, we instrument page, slab and percpu
> > allocators to record total memory allocated in the associated code tag at
> > every allocation in the codebase. Every time an allocation is performed by
> > an instrumented allocator, the code tag at that location increments its
> > counter by allocation size. Every time the memory is freed the counter is
> > decremented. To decrement the counter upon freeing, allocated object needs
> > a reference to its code tag. Page allocators use page_ext to record this
> > reference while slab allocators use memcg_data (renamed into more generic
> > slabobj_ext) of the slab page.
> [...]
> > [1] https://lore.kernel.org/all/[email protected]/
> [...]
> > 70 files changed, 2765 insertions(+), 554 deletions(-)
>
> Sorry for cutting the cover considerably but I believe I have quoted the
> most important/interesting parts here. The approach is not fundamentally
> different from the previous version [1] and there was a significant
> discussion around this approach. The cover letter doesn't summarize nor
> deal with concerns expressed previous AFAICS. So let me bring those up
> back.
Thanks for summarizing!
> At least those I find the most important:
> - This is a big change and it adds a significant maintenance burden
> because each allocation entry point needs to be handled specifically.
> The cost will grow with the intended coverage especially there when
> allocation is hidden in a library code.
Do you mean with more allocations in the codebase more codetags will
be generated? Is that the concern? Or maybe as you commented in
another patch that context capturing feature does not limit how many
stacks will be captured?
> - It has been brought up that this is duplicating functionality already
> available via existing tracing infrastructure. You should make it very
> clear why that is not suitable for the job
I experimented with using tracing with _RET_IP_ to implement this
accounting. The major issue is the _RET_IP_ to codetag lookup runtime
overhead which is orders of magnitude higher than proposed code
tagging approach. With code tagging proposal, that link is resolved at
compile time. Since we want this mechanism deployed in production, we
want to keep the overhead to the absolute minimum.
You asked me before how much overhead would be tolerable and the
answer will always be "as small as possible". This is especially true
for slab allocators which are ridiculously fast and regressing them
would be very noticable (due to the frequent use).
There is another issue, which I think can be solved in a smart way but
will either affect performance or would require more memory. With the
tracing approach we don't know beforehand how many individual
allocation sites exist, so we have to allocate code tags (or similar
structures for counting) at runtime vs compile time. We can be smart
about it and allocate in batches or even preallocate more than we need
beforehand but, as I said, it will require some kind of compromise.
I understand that code tagging creates additional maintenance burdens
but I hope it also produces enough benefits that people will want
this. The cost is also hopefully amortized when additional
applications like the ones we presented in RFC [1] are built using the
same framework.
> - We already have page_owner infrastructure that provides allocation
> tracking data. Why it cannot be used/extended?
1. The overhead.
2. Covers only page allocators.
I didn't think about extending the page_owner approach to slab
allocators but I suspect it would not be trivial. I don't see
attaching an owner to every slab object to be a scalable solution. The
overhead would again be of concern here.
I should point out that there was one important technical concern
about lack of a kill switch for this feature, which was an issue for
distributions that can't disable the CONFIG flag. In this series we
addressed that concern.
[1] https://lore.kernel.org/all/[email protected]/
Thanks,
Suren.
>
> Thanks!
> --
> Michal Hocko
> SUSE Labs
On Wed, May 3, 2023 at 12:36 AM Michal Hocko <[email protected]> wrote:
>
> On Mon 01-05-23 09:54:44, Suren Baghdasaryan wrote:
> [...]
> > +static inline void add_ctx(struct codetag_ctx *ctx,
> > + struct codetag_with_ctx *ctc)
> > +{
> > + kref_init(&ctx->refcount);
> > + spin_lock(&ctc->ctx_lock);
> > + ctx->flags = CTC_FLAG_CTX_PTR;
> > + ctx->ctc = ctc;
> > + list_add_tail(&ctx->node, &ctc->ctx_head);
> > + spin_unlock(&ctc->ctx_lock);
>
> AFAIU every single tracked allocation will get its own codetag_ctx.
> There is no aggregation per allocation site or anything else. This looks
> like a scalability and a memory overhead red flag to me.
True. The allocations here would not be limited. We could introduce a
global limit to the amount of memory that we can use to store contexts
and maybe reuse the oldest entry (in LRU fashion) when we hit that
limit?
>
> > +}
> > +
> > +static inline void rem_ctx(struct codetag_ctx *ctx,
> > + void (*free_ctx)(struct kref *refcount))
> > +{
> > + struct codetag_with_ctx *ctc = ctx->ctc;
> > +
> > + spin_lock(&ctc->ctx_lock);
>
> This could deadlock when allocator is called from the IRQ context.
I see. spin_lock_irqsave() then?
Thanks for the feedback!
Suren.
>
> > + /* ctx might have been removed while we were using it */
> > + if (!list_empty(&ctx->node))
> > + list_del_init(&ctx->node);
> > + spin_unlock(&ctc->ctx_lock);
> > + kref_put(&ctx->refcount, free_ctx);
> --
> Michal Hocko
> SUSE Labs
On Wed, May 3, 2023 at 12:39 AM Michal Hocko <[email protected]> wrote:
>
> On Mon 01-05-23 09:54:45, Suren Baghdasaryan wrote:
> [...]
> > +struct codetag_ctx *alloc_tag_create_ctx(struct alloc_tag *tag, size_t size)
> > +{
> > + struct alloc_call_ctx *ac_ctx;
> > +
> > + /* TODO: use a dedicated kmem_cache */
> > + ac_ctx = kmalloc(sizeof(struct alloc_call_ctx), GFP_KERNEL);
>
> You cannot really use GFP_KERNEL here. This is post_alloc_hook path and
> that has its own gfp context.
I missed that. Would it be appropriate to use the gfp_flags parameter
of post_alloc_hook() here?
> --
> Michal Hocko
> SUSE Labs
On 5/3/23 08:18, Suren Baghdasaryan wrote:
>>> +static inline void rem_ctx(struct codetag_ctx *ctx,
>>> + void (*free_ctx)(struct kref *refcount))
>>> +{
>>> + struct codetag_with_ctx *ctc = ctx->ctc;
>>> +
>>> + spin_lock(&ctc->ctx_lock);
>> This could deadlock when allocator is called from the IRQ context.
> I see. spin_lock_irqsave() then?
Yes. But, even better, please turn on lockdep when you are testing. It
will find these for you. If you're on x86, we have a set of handy-dandy
debug options that you can add to an existing config with:
make x86_debug.config
That said, I'm as concerned as everyone else that this is all "new" code
and doesn't lean on existing tracing or things like PAGE_OWNER enough.
On Wed, May 03, 2023 at 08:33:48AM -0400, James Bottomley wrote:
> On Wed, 2023-05-03 at 05:57 -0400, Kent Overstreet wrote:
> > On Wed, May 03, 2023 at 11:50:51AM +0200, Petr Tesařík wrote:
> > > If anyone ever wants to use this code tagging framework for
> > > something
> > > else, they will also have to convert relevant functions to macros,
> > > slowly changing the kernel to a minefield where local identifiers,
> > > struct, union and enum tags, field names and labels must avoid name
> > > conflict with a tagged function. For now, I have to remember that
> > > alloc_pages is forbidden, but the list may grow.
> >
> > Also, since you're not actually a kernel contributor yet...
>
> You have an amazing talent for being wrong. But even if you were
> actually right about this, it would be an ad hominem personal attack on
> a new contributor which crosses the line into unacceptable behaviour on
> the list and runs counter to our code of conduct.
...Err, what? That was intended _in no way_ as a personal attack.
If I was mistaken I do apologize, but lately I've run across quite a lot
of people offering review feedback to patches I post that turn out to
have 0 or 10 patches in the kernel, and - to be blunt - a pattern of
offering feedback in strong language with a presumption of experience
that takes a lot to respond to adequately on a technical basis.
I don't think a suggestion to spend a bit more time reading code instead
of speculating is out of order! We could all, put more effort into how
we offer review feedback.
On Wed, May 03, 2023 at 11:28:06AM -0400, Kent Overstreet wrote:
> On Wed, May 03, 2023 at 08:33:48AM -0400, James Bottomley wrote:
> > On Wed, 2023-05-03 at 05:57 -0400, Kent Overstreet wrote:
> > > On Wed, May 03, 2023 at 11:50:51AM +0200, Petr Tesařík wrote:
> > > > If anyone ever wants to use this code tagging framework for
> > > > something
> > > > else, they will also have to convert relevant functions to macros,
> > > > slowly changing the kernel to a minefield where local identifiers,
> > > > struct, union and enum tags, field names and labels must avoid name
> > > > conflict with a tagged function. For now, I have to remember that
> > > > alloc_pages is forbidden, but the list may grow.
> > >
> > > Also, since you're not actually a kernel contributor yet...
> >
> > You have an amazing talent for being wrong. But even if you were
> > actually right about this, it would be an ad hominem personal attack on
> > a new contributor which crosses the line into unacceptable behaviour on
> > the list and runs counter to our code of conduct.
>
> ...Err, what? That was intended _in no way_ as a personal attack.
>
As an outside observer, I can assure you that absolutely came across as a
personal attack, and the precise kind that puts people off from
contributing. I should know as a hobbyist contributor myself.
> If I was mistaken I do apologize, but lately I've run across quite a lot
> of people offering review feedback to patches I post that turn out to
> have 0 or 10 patches in the kernel, and - to be blunt - a pattern of
> offering feedback in strong language with a presumption of experience
> that takes a lot to respond to adequately on a technical basis.
>
I, who may very well not merit being considered a contributor of
significant merit in your view, have had such 'drive-by' commentary on some
of my patches by precisely this type of person, and at no time felt the
need to question whether they were a true Scotsman or not. It's simply not
productive.
> I don't think a suggestion to spend a bit more time reading code instead
> of speculating is out of order! We could all, put more effort into how
> we offer review feedback.
It's the means by which you say it that counts for everything. If you feel
the technical comments might not be merited on a deeper level, perhaps ask
a broader question, or even don't respond at all? There are other means
available.
It's remarkable the impact comments like the one you made can have on
contributors, certainly those of us who are not maintainers and are
naturally plagued with imposter syndrome, so I would ask you on a human
level to try to be a little more considerate.
By all means address technical issues as robustly as you feel appropriate,
that is after all the purpose of code review, but just take a step back and
perhaps find the 'cuddlier' side of yourself when not addressing technical
things :)
On Wed, May 03, 2023 at 12:26:27PM +0200, Petr Tesařík wrote:
> On Wed, 3 May 2023 05:57:15 -0400
> Kent Overstreet <[email protected]> wrote:
>
> > On Wed, May 03, 2023 at 11:50:51AM +0200, Petr Tesařík wrote:
> > > If anyone ever wants to use this code tagging framework for something
> > > else, they will also have to convert relevant functions to macros,
> > > slowly changing the kernel to a minefield where local identifiers,
> > > struct, union and enum tags, field names and labels must avoid name
> > > conflict with a tagged function. For now, I have to remember that
> > > alloc_pages is forbidden, but the list may grow.
> >
> > Also, since you're not actually a kernel contributor yet...
>
> I see, I've been around only since 2007...
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2a97468024fb5b6eccee2a67a7796485c829343a
My sincere apologies :) I'd searched for your name and email and found
nothing, whoops.
On Wed, 2023-05-03 at 11:28 -0400, Kent Overstreet wrote:
> On Wed, May 03, 2023 at 08:33:48AM -0400, James Bottomley wrote:
> > On Wed, 2023-05-03 at 05:57 -0400, Kent Overstreet wrote:
> > > On Wed, May 03, 2023 at 11:50:51AM +0200, Petr Tesařík wrote:
> > > > If anyone ever wants to use this code tagging framework for
> > > > something else, they will also have to convert relevant
> > > > functions to macros, slowly changing the kernel to a minefield
> > > > where local identifiers, struct, union and enum tags, field
> > > > names and labels must avoid name conflict with a tagged
> > > > function. For now, I have to remember that alloc_pages is
> > > > forbidden, but the list may grow.
> > >
> > > Also, since you're not actually a kernel contributor yet...
> >
> > You have an amazing talent for being wrong. But even if you were
> > actually right about this, it would be an ad hominem personal
> > attack on a new contributor which crosses the line into
> > unacceptable behaviour on the list and runs counter to our code of
> > conduct.
>
> ...Err, what? That was intended _in no way_ as a personal attack.
Your reply went on to say "If you're going to comment, please do the
necessary work to make sure you're saying something that makes sense."
That is a personal attack belittling the person involved and holding
them up for general contempt on the mailing list. This is exactly how
we should *not* treat newcomers.
> If I was mistaken I do apologize, but lately I've run across quite a
> lot of people offering review feedback to patches I post that turn
> out to have 0 or 10 patches in the kernel, and - to be blunt - a
> pattern of offering feedback in strong language with a presumption of
> experience that takes a lot to respond to adequately on a technical
> basis.
A synopsis of the feedback is that using macros to attach trace tags
pollutes the global function namespace of the kernel. That's a valid
observation and merits a technical not a personal response.
James
On Wed, May 03, 2023 at 04:37:36PM +0100, Lorenzo Stoakes wrote:
> As an outside observer, I can assure you that absolutely came across as a
> personal attack, and the precise kind that puts people off from
> contributing. I should know as a hobbyist contributor myself.
>
> > If I was mistaken I do apologize, but lately I've run across quite a lot
> > of people offering review feedback to patches I post that turn out to
> > have 0 or 10 patches in the kernel, and - to be blunt - a pattern of
> > offering feedback in strong language with a presumption of experience
> > that takes a lot to respond to adequately on a technical basis.
> >
>
> I, who may very well not merit being considered a contributor of
> significant merit in your view, have had such 'drive-by' commentary on some
> of my patches by precisely this type of person, and at no time felt the
> need to question whether they were a true Scotsman or not. It's simply not
> productive.
>
> > I don't think a suggestion to spend a bit more time reading code instead
> > of speculating is out of order! We could all, put more effort into how
> > we offer review feedback.
>
> It's the means by which you say it that counts for everything. If you feel
> the technical comments might not be merited on a deeper level, perhaps ask
> a broader question, or even don't respond at all? There are other means
> available.
>
> It's remarkable the impact comments like the one you made can have on
> contributors, certainly those of us who are not maintainers and are
> naturally plagued with imposter syndrome, so I would ask you on a human
> level to try to be a little more considerate.
>
> By all means address technical issues as robustly as you feel appropriate,
> that is after all the purpose of code review, but just take a step back and
> perhaps find the 'cuddlier' side of yourself when not addressing technical
> things :)
Thanks for your reply, it's level headed and appreciated.
But I personally value directness, and I see quite a few people in this
thread going all out on the tone policing - but look, without the
directness the confusion (that Petr is not actually a new contributor)
never would've been cleared up.
Food for thought, perhaps?
On Mon, 1 May 2023 09:54:29 -0700
Suren Baghdasaryan <[email protected]> wrote:
> After redefining alloc_pages, all uses of that name are being replaced.
> Change the conflicting names to prevent preprocessor from replacing them
> when it's not intended.
Note, every change log should have enough information in it to know why it
is being done. This says what the patch does, but does not fully explain
"why". It should never be assumed that one must read other patches to get
the context. A year from now, investigating git history, this may be the
only thing someone sees for why this change occurred.
The "why" above is simply "prevent preprocessor from replacing them
when it's not intended". What does that mean?
-- Steve
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
On Wed, 3 May 2023 08:09:28 -0700
Suren Baghdasaryan <[email protected]> wrote:
> There is another issue, which I think can be solved in a smart way but
> will either affect performance or would require more memory. With the
> tracing approach we don't know beforehand how many individual
> allocation sites exist, so we have to allocate code tags (or similar
> structures for counting) at runtime vs compile time. We can be smart
> about it and allocate in batches or even preallocate more than we need
> beforehand but, as I said, it will require some kind of compromise.
This approach is actually quite common, especially since tagging every
instance is usually overkill, as if you trace function calls in a running
kernel, you will find that only a small percentage of the kernel ever
executes. It's possible that you will be allocating a lot of tags that will
never be used. If run time allocation is possible, that is usually the
better approach.
-- Steve
Hello, Kent.
On Wed, May 03, 2023 at 04:05:08AM -0400, Kent Overstreet wrote:
> No, we're still waiting on the tracing people to _demonstrate_, not
> claim, that this is at all possible in a comparable way with tracing.
So, we (meta) happen to do stuff like this all the time in the fleet to hunt
down tricky persistent problems like memory leaks, ref leaks, what-have-you.
In recent kernels, with kprobe and BPF, our ability to debug these sorts of
problems has improved a great deal. Below, I'm attaching a bcc script I used
to hunt down, IIRC, a double vfree. It's not exactly for a leak but leaks
can follow the same pattern.
There are of course some pros and cons to this approach:
Pros:
* The framework doesn't really have any runtime overhead, so we can have it
deployed in the entire fleet and debug wherever problem is.
* It's fully flexible and programmable which enables non-trivial filtering
and summarizing to be done inside kernel w/ BPF as necessary, which is
pretty handy for tracking high frequency events.
* BPF is pretty performant. Dedicated built-in kernel code can do better of
course but BPF's jit compiled code & its data structures are fast enough.
I don't remember any time this was a problem.
Cons:
* BPF has some learning curve. Also the fact that what it provides is a wide
open field rather than something scoped out for a specific problem can
make it seem a bit daunting at the beginning.
* Because tracking starts when the script starts running, it doesn't know
anything which has happened upto that point, so you gotta pay attention to
handling e.g. handling frees which don't match allocs. It's kinda annoying
but not a huge problem usually. There are ways to build in BPF progs into
the kernel and load it early but I haven't experiemnted with it yet
personally.
I'm not necessarily against adding dedicated memory debugging mechanism but
do wonder whether the extra benefits would be enough to justify the code and
maintenance overhead.
Oh, a bit of delta but for anyone who's more interested in debugging
problems like this, while I tend to go for bcc
(https://github.com/iovisor/bcc) for this sort of problems. Others prefer to
write against libbpf directly or use bpftrace
(https://github.com/iovisor/bpftrace).
Thanks.
#!/usr/bin/env bcc-py
import bcc
import time
import datetime
import argparse
import os
import sys
import errno
description = """
Record vmalloc/vfrees and trigger on unmatched vfree
"""
bpf_source = """
#include <uapi/linux/ptrace.h>
#include <linux/vmalloc.h>
struct vmalloc_rec {
unsigned long ptr;
int last_alloc_stkid;
int last_free_stkid;
int this_stkid;
bool allocated;
};
BPF_STACK_TRACE(stacks, 8192);
BPF_HASH(vmallocs, unsigned long, struct vmalloc_rec, 131072);
BPF_ARRAY(dup_free, struct vmalloc_rec, 1);
int kpret_vmalloc_node_range(struct pt_regs *ctx)
{
unsigned long ptr = PT_REGS_RC(ctx);
uint32_t zkey = 0;
struct vmalloc_rec rec_init = { };
struct vmalloc_rec *rec;
int stkid;
if (!ptr)
return 0;
stkid = stacks.get_stackid(ctx, 0);
rec_init.ptr = ptr;
rec_init.last_alloc_stkid = -1;
rec_init.last_free_stkid = -1;
rec_init.this_stkid = -1;
rec = vmallocs.lookup_or_init(&ptr, &rec_init);
rec->allocated = true;
rec->last_alloc_stkid = stkid;
return 0;
}
int kp_vfree(struct pt_regs *ctx, const void *addr)
{
unsigned long ptr = (unsigned long)addr;
uint32_t zkey = 0;
struct vmalloc_rec rec_init = { };
struct vmalloc_rec *rec;
int stkid;
stkid = stacks.get_stackid(ctx, 0);
rec_init.ptr = ptr;
rec_init.last_alloc_stkid = -1;
rec_init.last_free_stkid = -1;
rec_init.this_stkid = -1;
rec = vmallocs.lookup_or_init(&ptr, &rec_init);
if (!rec->allocated && rec->last_alloc_stkid >= 0) {
rec->this_stkid = stkid;
dup_free.update(&zkey, rec);
}
rec->allocated = false;
rec->last_free_stkid = stkid;
return 0;
}
"""
bpf = bcc.BPF(text=bpf_source)
bpf.attach_kretprobe(event="__vmalloc_node_range", fn_name="kpret_vmalloc_node_range");
bpf.attach_kprobe(event="vfree", fn_name="kp_vfree");
bpf.attach_kprobe(event="vfree_atomic", fn_name="kp_vfree");
stacks = bpf["stacks"]
vmallocs = bpf["vmallocs"]
dup_free = bpf["dup_free"]
last_dup_free_ptr = dup_free[0].ptr
def print_stack(stkid):
for addr in stacks.walk(stkid):
sym = bpf.ksym(addr)
print(' {}'.format(sym))
def print_dup(dup):
print('allocated={} ptr={}'.format(dup.allocated, hex(dup.ptr)))
if (dup.last_alloc_stkid >= 0):
print('last_alloc_stack: ')
print_stack(dup.last_alloc_stkid)
if (dup.last_free_stkid >= 0):
print('last_free_stack: ')
print_stack(dup.last_free_stkid)
if (dup.this_stkid >= 0):
print('this_stack: ')
print_stack(dup.this_stkid)
while True:
time.sleep(1)
if dup_free[0].ptr != last_dup_free_ptr:
print('\nDUP_FREE:')
print_dup(dup_free[0])
last_dup_free_ptr = dup_free[0].ptr
On Wed, May 3, 2023 at 9:28 AM Steven Rostedt <[email protected]> wrote:
>
> On Wed, 3 May 2023 08:09:28 -0700
> Suren Baghdasaryan <[email protected]> wrote:
>
> > There is another issue, which I think can be solved in a smart way but
> > will either affect performance or would require more memory. With the
> > tracing approach we don't know beforehand how many individual
> > allocation sites exist, so we have to allocate code tags (or similar
> > structures for counting) at runtime vs compile time. We can be smart
> > about it and allocate in batches or even preallocate more than we need
> > beforehand but, as I said, it will require some kind of compromise.
>
> This approach is actually quite common, especially since tagging every
> instance is usually overkill, as if you trace function calls in a running
> kernel, you will find that only a small percentage of the kernel ever
> executes. It's possible that you will be allocating a lot of tags that will
> never be used. If run time allocation is possible, that is usually the
> better approach.
True but the memory overhead should not be prohibitive here. As a
ballpark number, on my machine I see there are 4838 individual
allocation locations and each codetag structure is 32 bytes, so that's
152KB.
>
> -- Steve
On Wed, May 3, 2023 at 9:35 AM Tejun Heo <[email protected]> wrote:
>
> Hello, Kent.
>
> On Wed, May 03, 2023 at 04:05:08AM -0400, Kent Overstreet wrote:
> > No, we're still waiting on the tracing people to _demonstrate_, not
> > claim, that this is at all possible in a comparable way with tracing.
>
> So, we (meta) happen to do stuff like this all the time in the fleet to hunt
> down tricky persistent problems like memory leaks, ref leaks, what-have-you.
> In recent kernels, with kprobe and BPF, our ability to debug these sorts of
> problems has improved a great deal. Below, I'm attaching a bcc script I used
> to hunt down, IIRC, a double vfree. It's not exactly for a leak but leaks
> can follow the same pattern.
Thanks for sharing, Tejun!
>
> There are of course some pros and cons to this approach:
>
> Pros:
>
> * The framework doesn't really have any runtime overhead, so we can have it
> deployed in the entire fleet and debug wherever problem is.
Do you mean it has no runtime overhead when disabled?
If so, do you know what's the overhead when enabled? I want to
understand if that's truly a viable solution to track all allocations
(including slab) all the time.
Thanks,
Suren.
>
> * It's fully flexible and programmable which enables non-trivial filtering
> and summarizing to be done inside kernel w/ BPF as necessary, which is
> pretty handy for tracking high frequency events.
>
> * BPF is pretty performant. Dedicated built-in kernel code can do better of
> course but BPF's jit compiled code & its data structures are fast enough.
> I don't remember any time this was a problem.
>
> Cons:
>
> * BPF has some learning curve. Also the fact that what it provides is a wide
> open field rather than something scoped out for a specific problem can
> make it seem a bit daunting at the beginning.
>
> * Because tracking starts when the script starts running, it doesn't know
> anything which has happened upto that point, so you gotta pay attention to
> handling e.g. handling frees which don't match allocs. It's kinda annoying
> but not a huge problem usually. There are ways to build in BPF progs into
> the kernel and load it early but I haven't experiemnted with it yet
> personally.
>
> I'm not necessarily against adding dedicated memory debugging mechanism but
> do wonder whether the extra benefits would be enough to justify the code and
> maintenance overhead.
>
> Oh, a bit of delta but for anyone who's more interested in debugging
> problems like this, while I tend to go for bcc
> (https://github.com/iovisor/bcc) for this sort of problems. Others prefer to
> write against libbpf directly or use bpftrace
> (https://github.com/iovisor/bpftrace).
>
> Thanks.
>
> #!/usr/bin/env bcc-py
>
> import bcc
> import time
> import datetime
> import argparse
> import os
> import sys
> import errno
>
> description = """
> Record vmalloc/vfrees and trigger on unmatched vfree
> """
>
> bpf_source = """
> #include <uapi/linux/ptrace.h>
> #include <linux/vmalloc.h>
>
> struct vmalloc_rec {
> unsigned long ptr;
> int last_alloc_stkid;
> int last_free_stkid;
> int this_stkid;
> bool allocated;
> };
>
> BPF_STACK_TRACE(stacks, 8192);
> BPF_HASH(vmallocs, unsigned long, struct vmalloc_rec, 131072);
> BPF_ARRAY(dup_free, struct vmalloc_rec, 1);
>
> int kpret_vmalloc_node_range(struct pt_regs *ctx)
> {
> unsigned long ptr = PT_REGS_RC(ctx);
> uint32_t zkey = 0;
> struct vmalloc_rec rec_init = { };
> struct vmalloc_rec *rec;
> int stkid;
>
> if (!ptr)
> return 0;
>
> stkid = stacks.get_stackid(ctx, 0);
>
> rec_init.ptr = ptr;
> rec_init.last_alloc_stkid = -1;
> rec_init.last_free_stkid = -1;
> rec_init.this_stkid = -1;
>
> rec = vmallocs.lookup_or_init(&ptr, &rec_init);
> rec->allocated = true;
> rec->last_alloc_stkid = stkid;
> return 0;
> }
>
> int kp_vfree(struct pt_regs *ctx, const void *addr)
> {
> unsigned long ptr = (unsigned long)addr;
> uint32_t zkey = 0;
> struct vmalloc_rec rec_init = { };
> struct vmalloc_rec *rec;
> int stkid;
>
> stkid = stacks.get_stackid(ctx, 0);
>
> rec_init.ptr = ptr;
> rec_init.last_alloc_stkid = -1;
> rec_init.last_free_stkid = -1;
> rec_init.this_stkid = -1;
>
> rec = vmallocs.lookup_or_init(&ptr, &rec_init);
> if (!rec->allocated && rec->last_alloc_stkid >= 0) {
> rec->this_stkid = stkid;
> dup_free.update(&zkey, rec);
> }
>
> rec->allocated = false;
> rec->last_free_stkid = stkid;
> return 0;
> }
> """
>
> bpf = bcc.BPF(text=bpf_source)
> bpf.attach_kretprobe(event="__vmalloc_node_range", fn_name="kpret_vmalloc_node_range");
> bpf.attach_kprobe(event="vfree", fn_name="kp_vfree");
> bpf.attach_kprobe(event="vfree_atomic", fn_name="kp_vfree");
>
> stacks = bpf["stacks"]
> vmallocs = bpf["vmallocs"]
> dup_free = bpf["dup_free"]
> last_dup_free_ptr = dup_free[0].ptr
>
> def print_stack(stkid):
> for addr in stacks.walk(stkid):
> sym = bpf.ksym(addr)
> print(' {}'.format(sym))
>
> def print_dup(dup):
> print('allocated={} ptr={}'.format(dup.allocated, hex(dup.ptr)))
> if (dup.last_alloc_stkid >= 0):
> print('last_alloc_stack: ')
> print_stack(dup.last_alloc_stkid)
> if (dup.last_free_stkid >= 0):
> print('last_free_stack: ')
> print_stack(dup.last_free_stkid)
> if (dup.this_stkid >= 0):
> print('this_stack: ')
> print_stack(dup.this_stkid)
>
> while True:
> time.sleep(1)
>
> if dup_free[0].ptr != last_dup_free_ptr:
> print('\nDUP_FREE:')
> print_dup(dup_free[0])
> last_dup_free_ptr = dup_free[0].ptr
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>
On Wed, May 03, 2023 at 06:35:49AM -1000, Tejun Heo wrote:
> Hello, Kent.
>
> On Wed, May 03, 2023 at 04:05:08AM -0400, Kent Overstreet wrote:
> > No, we're still waiting on the tracing people to _demonstrate_, not
> > claim, that this is at all possible in a comparable way with tracing.
>
> So, we (meta) happen to do stuff like this all the time in the fleet to hunt
> down tricky persistent problems like memory leaks, ref leaks, what-have-you.
> In recent kernels, with kprobe and BPF, our ability to debug these sorts of
> problems has improved a great deal. Below, I'm attaching a bcc script I used
> to hunt down, IIRC, a double vfree. It's not exactly for a leak but leaks
> can follow the same pattern.
>
> There are of course some pros and cons to this approach:
>
> Pros:
>
> * The framework doesn't really have any runtime overhead, so we can have it
> deployed in the entire fleet and debug wherever problem is.
>
> * It's fully flexible and programmable which enables non-trivial filtering
> and summarizing to be done inside kernel w/ BPF as necessary, which is
> pretty handy for tracking high frequency events.
>
> * BPF is pretty performant. Dedicated built-in kernel code can do better of
> course but BPF's jit compiled code & its data structures are fast enough.
> I don't remember any time this was a problem.
You're still going to have the inherent overhead a separate index of
outstanding memory allocations, so that frees can be decremented to the
correct callsite.
The BPF approach is going to be _way_ higher overhead if you try to use
it as a general profiler, like this is.
On Wed, May 03, 2023 at 06:35:49AM -1000, Tejun Heo wrote:
> Hello, Kent.
>
> On Wed, May 03, 2023 at 04:05:08AM -0400, Kent Overstreet wrote:
> > No, we're still waiting on the tracing people to _demonstrate_, not
> > claim, that this is at all possible in a comparable way with tracing.
>
> So, we (meta) happen to do stuff like this all the time in the fleet to hunt
> down tricky persistent problems like memory leaks, ref leaks, what-have-you.
> In recent kernels, with kprobe and BPF, our ability to debug these sorts of
> problems has improved a great deal. Below, I'm attaching a bcc script I used
> to hunt down, IIRC, a double vfree. It's not exactly for a leak but leaks
> can follow the same pattern.
>
> There are of course some pros and cons to this approach:
>
> Pros:
>
> * The framework doesn't really have any runtime overhead, so we can have it
> deployed in the entire fleet and debug wherever problem is.
>
> * It's fully flexible and programmable which enables non-trivial filtering
> and summarizing to be done inside kernel w/ BPF as necessary, which is
> pretty handy for tracking high frequency events.
>
> * BPF is pretty performant. Dedicated built-in kernel code can do better of
> course but BPF's jit compiled code & its data structures are fast enough.
> I don't remember any time this was a problem.
>
> Cons:
>
> * BPF has some learning curve. Also the fact that what it provides is a wide
> open field rather than something scoped out for a specific problem can
> make it seem a bit daunting at the beginning.
>
> * Because tracking starts when the script starts running, it doesn't know
> anything which has happened upto that point, so you gotta pay attention to
> handling e.g. handling frees which don't match allocs. It's kinda annoying
> but not a huge problem usually. There are ways to build in BPF progs into
> the kernel and load it early but I haven't experiemnted with it yet
> personally.
>
> I'm not necessarily against adding dedicated memory debugging mechanism but
> do wonder whether the extra benefits would be enough to justify the code and
> maintenance overhead.
>
> Oh, a bit of delta but for anyone who's more interested in debugging
> problems like this, while I tend to go for bcc
> (https://github.com/iovisor/bcc) for this sort of problems. Others prefer to
> write against libbpf directly or use bpftrace
> (https://github.com/iovisor/bpftrace).
Do you have example output?
TBH I'm skeptical that it's even possible to do full memory allocation
profiling with tracing/bpf, due to recursive memory allocations and
needing an index of outstanding allcations.
On Wed, May 3, 2023 at 9:25 AM Steven Rostedt <[email protected]> wrote:
>
> On Mon, 1 May 2023 09:54:29 -0700
> Suren Baghdasaryan <[email protected]> wrote:
>
> > After redefining alloc_pages, all uses of that name are being replaced.
> > Change the conflicting names to prevent preprocessor from replacing them
> > when it's not intended.
>
> Note, every change log should have enough information in it to know why it
> is being done. This says what the patch does, but does not fully explain
> "why". It should never be assumed that one must read other patches to get
> the context. A year from now, investigating git history, this may be the
> only thing someone sees for why this change occurred.
>
> The "why" above is simply "prevent preprocessor from replacing them
> when it's not intended". What does that mean?
Thanks for the feedback, Steve. I'll make appropriate modifications to
the description.
>
> -- Steve
>
>
> >
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>
On Wed, 3 May 2023 10:40:42 -0700
Suren Baghdasaryan <[email protected]> wrote:
> > This approach is actually quite common, especially since tagging every
> > instance is usually overkill, as if you trace function calls in a running
> > kernel, you will find that only a small percentage of the kernel ever
> > executes. It's possible that you will be allocating a lot of tags that will
> > never be used. If run time allocation is possible, that is usually the
> > better approach.
>
> True but the memory overhead should not be prohibitive here. As a
> ballpark number, on my machine I see there are 4838 individual
> allocation locations and each codetag structure is 32 bytes, so that's
> 152KB.
If it's not that big, then allocating at runtime should not be an issue
either. If runtime allocation can make it less intrusive to the code, that
would be more rationale to do so.
-- Steve
Hello, Suren.
On Wed, May 03, 2023 at 10:42:11AM -0700, Suren Baghdasaryan wrote:
> > * The framework doesn't really have any runtime overhead, so we can have it
> > deployed in the entire fleet and debug wherever problem is.
>
> Do you mean it has no runtime overhead when disabled?
Yes, that's what I meant.
> If so, do you know what's the overhead when enabled? I want to
> understand if that's truly a viable solution to track all allocations
> (including slab) all the time.
(cc'ing Alexei and Andrii who know a lot better than me)
I don't have enough concrete benchmark data on the hand to answer
definitively but hopefully what my general impresison would help. We attach
BPF programs to both per-packet and per-IO paths. They obviously aren't free
but their overhead isn't signficantly higher than building in the same thing
in C code. Once loaded, BPF progs are jit compiled into native code. The
generated code will be a bit worse than regularly compiled C code but those
are really micro differences. There's some bridging code to jump into BPF
but again negligible / acceptable even in the hottest paths.
In terms of execution overhead, I don't think there is a signficant
disadvantage to doing these things in BPF. Bigger differences would likely
be in tracking data structures and locking around them. One can definitely
better integrate tracking into alloc / free paths piggybacking on existing
locking and whatnot. That said, BPF hashtable is pretty fast and BPF is
constantly improving in terms of data structure support.
It really depends on the workload and how much overhead one considers
acceptable and I'm sure persistent global tracking can be done more
efficiently with built-in C code. That said, done right, the overhead
difference most likely isn't gonna be orders of magnitude but more like in
the realm of tens of percents, if that.
So, it doesn't nullify the benefits a dedicated mechansim can bring but does
change the conversation quite a bit. Is the extra code justifiable given
that most of what it enables is already possible using a more generic
mechanism, albeit at a bit higher cost? That may well be the case but it
does raise the bar.
Thanks.
--
tejun
On Wed, May 03, 2023 at 06:35:49AM -1000, Tejun Heo wrote:
> Hello, Kent.
>
> On Wed, May 03, 2023 at 04:05:08AM -0400, Kent Overstreet wrote:
> > No, we're still waiting on the tracing people to _demonstrate_, not
> > claim, that this is at all possible in a comparable way with tracing.
>
> So, we (meta) happen to do stuff like this all the time in the fleet to hunt
> down tricky persistent problems like memory leaks, ref leaks, what-have-you.
> In recent kernels, with kprobe and BPF, our ability to debug these sorts of
> problems has improved a great deal. Below, I'm attaching a bcc script I used
> to hunt down, IIRC, a double vfree. It's not exactly for a leak but leaks
> can follow the same pattern.
>
> There are of course some pros and cons to this approach:
>
> Pros:
>
> * The framework doesn't really have any runtime overhead, so we can have it
> deployed in the entire fleet and debug wherever problem is.
>
> * It's fully flexible and programmable which enables non-trivial filtering
> and summarizing to be done inside kernel w/ BPF as necessary, which is
> pretty handy for tracking high frequency events.
>
> * BPF is pretty performant. Dedicated built-in kernel code can do better of
> course but BPF's jit compiled code & its data structures are fast enough.
> I don't remember any time this was a problem.
>
> Cons:
>
> * BPF has some learning curve. Also the fact that what it provides is a wide
> open field rather than something scoped out for a specific problem can
> make it seem a bit daunting at the beginning.
>
> * Because tracking starts when the script starts running, it doesn't know
> anything which has happened upto that point, so you gotta pay attention to
> handling e.g. handling frees which don't match allocs. It's kinda annoying
> but not a huge problem usually. There are ways to build in BPF progs into
> the kernel and load it early but I haven't experiemnted with it yet
> personally.
Yeah, early loading is definitely important, especially before module
loading etc.
One common usecase is that we see a machine in the wild with a high
amount of kernel memory disappearing somewhere that isn't voluntarily
reported in vmstat/meminfo. Reproducing it isn't always
practical. Something that records early and always (with acceptable
runtime overhead) would be the holy grail.
Matching allocs to frees is doable using the pfn as the key for pages,
and virtual addresses for slab objects.
The biggest issue I had when I tried with bpf was losing updates to
the map. IIRC there is some trylocking going on to avoid deadlocks
from nested contexts (alloc interrupted, interrupt frees). It doesn't
sound like an unsolvable problem, though.
Another minor thing was the stack trace map exploding on a basically
infinite number of unique interrupt stacks. This could probably also
be solved by extending the trace extraction API to cut the frames off
at the context switch boundary.
Taking a step back though, given the multitude of allocation sites in
the kernel, it's a bit odd that the only accounting we do is the tiny
fraction of voluntary vmstat/meminfo reporting. We try to cover the
biggest consumers with this of course, but it's always going to be
incomplete and is maintenance overhead too. There are on average
several gigabytes in unknown memory (total - known vmstats) on our
machines. It's difficult to detect regressions easily. And it's per
definition the unexpected cornercases that are the trickiest to track
down. So it might be doable with BPF, but it does feel like the kernel
should do a better job of tracking out of the box and without
requiring too much plumbing and somewhat fragile kernel allocation API
tracking and probing from userspace.
On Wed, May 3, 2023 at 11:03 AM Steven Rostedt <[email protected]> wrote:
>
> On Wed, 3 May 2023 10:40:42 -0700
> Suren Baghdasaryan <[email protected]> wrote:
>
> > > This approach is actually quite common, especially since tagging every
> > > instance is usually overkill, as if you trace function calls in a running
> > > kernel, you will find that only a small percentage of the kernel ever
> > > executes. It's possible that you will be allocating a lot of tags that will
> > > never be used. If run time allocation is possible, that is usually the
> > > better approach.
> >
> > True but the memory overhead should not be prohibitive here. As a
> > ballpark number, on my machine I see there are 4838 individual
> > allocation locations and each codetag structure is 32 bytes, so that's
> > 152KB.
>
> If it's not that big, then allocating at runtime should not be an issue
> either. If runtime allocation can make it less intrusive to the code, that
> would be more rationale to do so.
As I noted, this issue is minor since we can be smart about how we
allocate these entries. The main issue is the performance overhead.
The kmalloc path is extremely fast and very hot. Even adding a per-cpu
increment in our patchset has a 35% overhead. Adding an additional
lookup here would prevent us from having it enabled all the time in
production.
>
> -- Steve
On Wed, May 03, 2023 at 02:03:37PM -0400, Steven Rostedt wrote:
> On Wed, 3 May 2023 10:40:42 -0700
> Suren Baghdasaryan <[email protected]> wrote:
>
> > > This approach is actually quite common, especially since tagging every
> > > instance is usually overkill, as if you trace function calls in a running
> > > kernel, you will find that only a small percentage of the kernel ever
> > > executes. It's possible that you will be allocating a lot of tags that will
> > > never be used. If run time allocation is possible, that is usually the
> > > better approach.
> >
> > True but the memory overhead should not be prohibitive here. As a
> > ballpark number, on my machine I see there are 4838 individual
> > allocation locations and each codetag structure is 32 bytes, so that's
> > 152KB.
>
> If it's not that big, then allocating at runtime should not be an issue
> either. If runtime allocation can make it less intrusive to the code, that
> would be more rationale to do so.
We're more optimizing for runtime overhead - a major goal of this
patchset was to be cheap enough to be always on, we've got too many
debugging features that are really useful, but too expensive to have on
all the time.
Doing more runtime allocation would add another pointer fetch to the
fast paths - and I don't see how it would even be possible to runtime
allocate the codetag struct itself.
We already do runtime allocation of percpu counters; see the lazy percpu
counter patch.
Hello,
On Wed, May 03, 2023 at 02:07:26PM -0400, Johannes Weiner wrote:
...
> > * Because tracking starts when the script starts running, it doesn't know
> > anything which has happened upto that point, so you gotta pay attention to
> > handling e.g. handling frees which don't match allocs. It's kinda annoying
> > but not a huge problem usually. There are ways to build in BPF progs into
> > the kernel and load it early but I haven't experiemnted with it yet
> > personally.
>
> Yeah, early loading is definitely important, especially before module
> loading etc.
>
> One common usecase is that we see a machine in the wild with a high
> amount of kernel memory disappearing somewhere that isn't voluntarily
> reported in vmstat/meminfo. Reproducing it isn't always
> practical. Something that records early and always (with acceptable
> runtime overhead) would be the holy grail.
>
> Matching allocs to frees is doable using the pfn as the key for pages,
> and virtual addresses for slab objects.
>
> The biggest issue I had when I tried with bpf was losing updates to
> the map. IIRC there is some trylocking going on to avoid deadlocks
> from nested contexts (alloc interrupted, interrupt frees). It doesn't
> sound like an unsolvable problem, though.
(cc'ing Alexei and Andrii)
This is the same thing that I hit with sched_ext. BPF plugged it for
struct_ops but I wonder whether it can be done for specific maps / progs -
ie. just declare that a given map or prog is not to be accessed from NMI and
bypass the trylock deadlock avoidance mechanism. But, yeah, this should be
addressed from BPF side.
> Another minor thing was the stack trace map exploding on a basically
> infinite number of unique interrupt stacks. This could probably also
> be solved by extending the trace extraction API to cut the frames off
> at the context switch boundary.
>
> Taking a step back though, given the multitude of allocation sites in
> the kernel, it's a bit odd that the only accounting we do is the tiny
> fraction of voluntary vmstat/meminfo reporting. We try to cover the
> biggest consumers with this of course, but it's always going to be
> incomplete and is maintenance overhead too. There are on average
> several gigabytes in unknown memory (total - known vmstats) on our
> machines. It's difficult to detect regressions easily. And it's per
> definition the unexpected cornercases that are the trickiest to track
> down. So it might be doable with BPF, but it does feel like the kernel
> should do a better job of tracking out of the box and without
> requiring too much plumbing and somewhat fragile kernel allocation API
> tracking and probing from userspace.
Yeah, easy / default visibility argument does make sense to me.
Thanks.
--
tejun
Hello,
On Wed, May 03, 2023 at 01:51:23PM -0400, Kent Overstreet wrote:
> Do you have example output?
Not right now. It's from many months ago. It's just a script I could find
easily.
> TBH I'm skeptical that it's even possible to do full memory allocation
> profiling with tracing/bpf, due to recursive memory allocations and
> needing an index of outstanding allcations.
There are some issues e.g. w/ lossy updates which should be fixed from BPF
side but we do run BPF on every single packet and IO on most of our
machines, so basing this argument on whether tracking all memory allocations
from BPF is possible is probably not a winning strategy for this proposal.
Thanks.
--
tejun
On Wed, May 03, 2023 at 08:19:24AM -1000, Tejun Heo wrote:
> > Taking a step back though, given the multitude of allocation sites in
> > the kernel, it's a bit odd that the only accounting we do is the tiny
> > fraction of voluntary vmstat/meminfo reporting. We try to cover the
> > biggest consumers with this of course, but it's always going to be
> > incomplete and is maintenance overhead too. There are on average
> > several gigabytes in unknown memory (total - known vmstats) on our
> > machines. It's difficult to detect regressions easily. And it's per
> > definition the unexpected cornercases that are the trickiest to track
> > down. So it might be doable with BPF, but it does feel like the kernel
> > should do a better job of tracking out of the box and without
> > requiring too much plumbing and somewhat fragile kernel allocation API
> > tracking and probing from userspace.
>
> Yeah, easy / default visibility argument does make sense to me.
So, a bit of addition here. If this is the thrust, the debugfs part seems
rather redundant, right? That's trivially obtainable with tracing / bpf and
in a more flexible and performant manner. Also, are we happy with recording
just single depth for persistent tracking?
Thanks.
--
tejun
On Wed, May 03, 2023 at 02:56:44PM -0400, Kent Overstreet wrote:
> On Wed, May 03, 2023 at 08:40:07AM -1000, Tejun Heo wrote:
> > > Yeah, easy / default visibility argument does make sense to me.
> >
> > So, a bit of addition here. If this is the thrust, the debugfs part seems
> > rather redundant, right? That's trivially obtainable with tracing / bpf and
> > in a more flexible and performant manner. Also, are we happy with recording
> > just single depth for persistent tracking?
>
> Not sure what you're envisioning?
>
> I'd consider the debugfs interface pretty integral; it's much more
> discoverable for users, and it's hardly any code out of the whole
> patchset.
You can do the same thing with a bpftrace one liner tho. That's rather
difficult to beat.
Thanks.
--
tejun
On Wed, May 03, 2023 at 08:40:07AM -1000, Tejun Heo wrote:
> > Yeah, easy / default visibility argument does make sense to me.
>
> So, a bit of addition here. If this is the thrust, the debugfs part seems
> rather redundant, right? That's trivially obtainable with tracing / bpf and
> in a more flexible and performant manner. Also, are we happy with recording
> just single depth for persistent tracking?
Not sure what you're envisioning?
I'd consider the debugfs interface pretty integral; it's much more
discoverable for users, and it's hardly any code out of the whole
patchset.
Single depth was discussed previously. It's what makes it cheap enough
to be always-on (saving stack traces is expensive!), and I find the
output much more usable than e.g. page owner.
On Wed, May 03, 2023 at 08:58:51AM -1000, Tejun Heo wrote:
> On Wed, May 03, 2023 at 02:56:44PM -0400, Kent Overstreet wrote:
> > On Wed, May 03, 2023 at 08:40:07AM -1000, Tejun Heo wrote:
> > > > Yeah, easy / default visibility argument does make sense to me.
> > >
> > > So, a bit of addition here. If this is the thrust, the debugfs part seems
> > > rather redundant, right? That's trivially obtainable with tracing / bpf and
> > > in a more flexible and performant manner. Also, are we happy with recording
> > > just single depth for persistent tracking?
> >
> > Not sure what you're envisioning?
> >
> > I'd consider the debugfs interface pretty integral; it's much more
> > discoverable for users, and it's hardly any code out of the whole
> > patchset.
>
> You can do the same thing with a bpftrace one liner tho. That's rather
> difficult to beat.
Ah, shit, I'm an idiot. Sorry. I thought allocations was under /proc and
allocations.ctx under debugfs. I meant allocations.ctx is redundant.
Thanks.
--
tejun
Hello,
On Wed, May 03, 2023 at 12:41:08PM -0700, Suren Baghdasaryan wrote:
> On Wed, May 3, 2023 at 12:09 PM Tejun Heo <[email protected]> wrote:
> >
> > On Wed, May 03, 2023 at 08:58:51AM -1000, Tejun Heo wrote:
> > > On Wed, May 03, 2023 at 02:56:44PM -0400, Kent Overstreet wrote:
> > > > On Wed, May 03, 2023 at 08:40:07AM -1000, Tejun Heo wrote:
> > > > > > Yeah, easy / default visibility argument does make sense to me.
> > > > >
> > > > > So, a bit of addition here. If this is the thrust, the debugfs part seems
> > > > > rather redundant, right? That's trivially obtainable with tracing / bpf and
> > > > > in a more flexible and performant manner. Also, are we happy with recording
> > > > > just single depth for persistent tracking?
>
> IIUC, by single depth you mean no call stack capturing?
Yes.
> If so, that's the idea behind the context capture feature so that we
> can enable it on specific allocations only after we determine there is
> something interesting there. So, with low-cost persistent tracking we
> can determine the suspects and then pay some more to investigate those
> suspects in more detail.
Yeah, I was wondering whether it'd be useful to have that configurable so
that it'd be possible for a user to say "I'm okay with the cost, please
track more context per allocation". Given that tracking the immediate caller
is already a huge improvement and narrowing it down from there using
existing tools shouldn't be that difficult, I don't think this is a blocker
in any way. It just bothers me a bit that the code is structured so that
source line is the main abstraction.
> > > > Not sure what you're envisioning?
> > > >
> > > > I'd consider the debugfs interface pretty integral; it's much more
> > > > discoverable for users, and it's hardly any code out of the whole
> > > > patchset.
> > >
> > > You can do the same thing with a bpftrace one liner tho. That's rather
> > > difficult to beat.
>
> debugfs seemed like a natural choice for such information. If another
> interface is more appropriate I'm happy to explore that.
>
> >
> > Ah, shit, I'm an idiot. Sorry. I thought allocations was under /proc and
> > allocations.ctx under debugfs. I meant allocations.ctx is redundant.
>
> Do you mean that we could display allocation context in
> debugfs/allocations file (for the allocations which we explicitly
> enabled context capturing)?
Sorry about the fumbled communication. Here's what I mean:
* Improving memory allocation visibility makes sense to me. To me, a more
natural place for that feels like /proc/allocations next to other memory
info files rather than under debugfs.
* The default visibility provided by "allocations" provides something which
is more difficult or at least cumbersome to obtain using existing tracing
tools. However, what's provided by "allocations.ctx" can be trivially
obtained using kprobe and BPF and seems redundant.
Thanks.
--
tejun
On Wed, May 3, 2023 at 8:26 AM Dave Hansen <[email protected]> wrote:
>
> On 5/3/23 08:18, Suren Baghdasaryan wrote:
> >>> +static inline void rem_ctx(struct codetag_ctx *ctx,
> >>> + void (*free_ctx)(struct kref *refcount))
> >>> +{
> >>> + struct codetag_with_ctx *ctc = ctx->ctc;
> >>> +
> >>> + spin_lock(&ctc->ctx_lock);
> >> This could deadlock when allocator is called from the IRQ context.
> > I see. spin_lock_irqsave() then?
>
> Yes. But, even better, please turn on lockdep when you are testing. It
> will find these for you. If you're on x86, we have a set of handy-dandy
> debug options that you can add to an existing config with:
>
> make x86_debug.config
Nice!
I thought I tested with lockdep enabled but I might be wrong. The
beauty of working on multiple patchsets in parallel is that I can't
remember what I did for each one :)
>
> That said, I'm as concerned as everyone else that this is all "new" code
> and doesn't lean on existing tracing or things like PAGE_OWNER enough.
Yeah, that's being actively discussed.
>
On Wed, May 3, 2023 at 12:09 PM Tejun Heo <[email protected]> wrote:
>
> On Wed, May 03, 2023 at 08:58:51AM -1000, Tejun Heo wrote:
> > On Wed, May 03, 2023 at 02:56:44PM -0400, Kent Overstreet wrote:
> > > On Wed, May 03, 2023 at 08:40:07AM -1000, Tejun Heo wrote:
> > > > > Yeah, easy / default visibility argument does make sense to me.
> > > >
> > > > So, a bit of addition here. If this is the thrust, the debugfs part seems
> > > > rather redundant, right? That's trivially obtainable with tracing / bpf and
> > > > in a more flexible and performant manner. Also, are we happy with recording
> > > > just single depth for persistent tracking?
IIUC, by single depth you mean no call stack capturing?
If so, that's the idea behind the context capture feature so that we
can enable it on specific allocations only after we determine there is
something interesting there. So, with low-cost persistent tracking we
can determine the suspects and then pay some more to investigate those
suspects in more detail.
> > >
> > > Not sure what you're envisioning?
> > >
> > > I'd consider the debugfs interface pretty integral; it's much more
> > > discoverable for users, and it's hardly any code out of the whole
> > > patchset.
> >
> > You can do the same thing with a bpftrace one liner tho. That's rather
> > difficult to beat.
debugfs seemed like a natural choice for such information. If another
interface is more appropriate I'm happy to explore that.
>
> Ah, shit, I'm an idiot. Sorry. I thought allocations was under /proc and
> allocations.ctx under debugfs. I meant allocations.ctx is redundant.
Do you mean that we could display allocation context in
debugfs/allocations file (for the allocations which we explicitly
enabled context capturing)?
>
> Thanks.
>
> --
> tejun
Hello,
On Wed, May 03, 2023 at 09:48:55AM -1000, Tejun Heo wrote:
> > If so, that's the idea behind the context capture feature so that we
> > can enable it on specific allocations only after we determine there is
> > something interesting there. So, with low-cost persistent tracking we
> > can determine the suspects and then pay some more to investigate those
> > suspects in more detail.
>
> Yeah, I was wondering whether it'd be useful to have that configurable so
> that it'd be possible for a user to say "I'm okay with the cost, please
> track more context per allocation". Given that tracking the immediate caller
> is already a huge improvement and narrowing it down from there using
> existing tools shouldn't be that difficult, I don't think this is a blocker
> in any way. It just bothers me a bit that the code is structured so that
> source line is the main abstraction.
Another related question. So, the reason for macro'ing stuff is needed is
because you want to print the line directly from kernel, right? Is that
really necessary? Values from __builtin_return_address() can easily be
printed out as function+offset from kernel which already gives most of the
necessary information for triaging and mapping that back to source line from
userspace isn't difficult. Wouldn't using __builtin_return_address() make
the whole thing a lot simpler?
Thanks.
--
tejun
On Wed, May 3, 2023 at 6:35 PM Tejun Heo <[email protected]> wrote:
>
> Hello, Kent.
>
> On Wed, May 03, 2023 at 04:05:08AM -0400, Kent Overstreet wrote:
> > No, we're still waiting on the tracing people to _demonstrate_, not
> > claim, that this is at all possible in a comparable way with tracing.
>
> So, we (meta) happen to do stuff like this all the time in the fleet to hunt
> down tricky persistent problems like memory leaks, ref leaks, what-have-you.
> In recent kernels, with kprobe and BPF, our ability to debug these sorts of
> problems has improved a great deal. Below, I'm attaching a bcc script I used
> to hunt down, IIRC, a double vfree. It's not exactly for a leak but leaks
> can follow the same pattern.
>
For leaks there is example bcc
https://github.com/iovisor/bcc/blob/master/tools/memleak.py
On Wed, May 3, 2023 at 12:49 PM Tejun Heo <[email protected]> wrote:
>
> Hello,
>
> On Wed, May 03, 2023 at 12:41:08PM -0700, Suren Baghdasaryan wrote:
> > On Wed, May 3, 2023 at 12:09 PM Tejun Heo <[email protected]> wrote:
> > >
> > > On Wed, May 03, 2023 at 08:58:51AM -1000, Tejun Heo wrote:
> > > > On Wed, May 03, 2023 at 02:56:44PM -0400, Kent Overstreet wrote:
> > > > > On Wed, May 03, 2023 at 08:40:07AM -1000, Tejun Heo wrote:
> > > > > > > Yeah, easy / default visibility argument does make sense to me.
> > > > > >
> > > > > > So, a bit of addition here. If this is the thrust, the debugfs part seems
> > > > > > rather redundant, right? That's trivially obtainable with tracing / bpf and
> > > > > > in a more flexible and performant manner. Also, are we happy with recording
> > > > > > just single depth for persistent tracking?
> >
> > IIUC, by single depth you mean no call stack capturing?
>
> Yes.
>
> > If so, that's the idea behind the context capture feature so that we
> > can enable it on specific allocations only after we determine there is
> > something interesting there. So, with low-cost persistent tracking we
> > can determine the suspects and then pay some more to investigate those
> > suspects in more detail.
>
> Yeah, I was wondering whether it'd be useful to have that configurable so
> that it'd be possible for a user to say "I'm okay with the cost, please
> track more context per allocation".
I assume by "more context per allocation" you mean for a specific
allocation, not for all allocations.
So, in a sense you are asking if the context capture feature can be
dropped from this series and implemented using some other means. Is
that right?
> Given that tracking the immediate caller
> is already a huge improvement and narrowing it down from there using
> existing tools shouldn't be that difficult, I don't think this is a blocker
> in any way. It just bothers me a bit that the code is structured so that
> source line is the main abstraction.
>
> > > > > Not sure what you're envisioning?
> > > > >
> > > > > I'd consider the debugfs interface pretty integral; it's much more
> > > > > discoverable for users, and it's hardly any code out of the whole
> > > > > patchset.
> > > >
> > > > You can do the same thing with a bpftrace one liner tho. That's rather
> > > > difficult to beat.
> >
> > debugfs seemed like a natural choice for such information. If another
> > interface is more appropriate I'm happy to explore that.
> >
> > >
> > > Ah, shit, I'm an idiot. Sorry. I thought allocations was under /proc and
> > > allocations.ctx under debugfs. I meant allocations.ctx is redundant.
> >
> > Do you mean that we could display allocation context in
> > debugfs/allocations file (for the allocations which we explicitly
> > enabled context capturing)?
>
> Sorry about the fumbled communication. Here's what I mean:
>
> * Improving memory allocation visibility makes sense to me. To me, a more
> natural place for that feels like /proc/allocations next to other memory
> info files rather than under debugfs.
TBH I would love that if this approach is acceptable.
>
> * The default visibility provided by "allocations" provides something which
> is more difficult or at least cumbersome to obtain using existing tracing
> tools. However, what's provided by "allocations.ctx" can be trivially
> obtained using kprobe and BPF and seems redundant.
Hmm. That might be a good way forward. Since context capture has
already high performance overhead, maybe choosing not the most
performant but more generic solution is the right answer here. I'll
need to think about it some more but thanks for the idea!
>
> Thanks.
>
> --
> tejun
On Wed, May 03, 2023 at 01:08:40PM -0700, Suren Baghdasaryan wrote:
> On Wed, May 3, 2023 at 12:49 PM Tejun Heo <[email protected]> wrote:
> > * Improving memory allocation visibility makes sense to me. To me, a more
> > natural place for that feels like /proc/allocations next to other memory
> > info files rather than under debugfs.
>
> TBH I would love that if this approach is acceptable.
Ack
On Wed, May 3, 2023 at 1:00 PM Tejun Heo <[email protected]> wrote:
>
> Hello,
>
> On Wed, May 03, 2023 at 09:48:55AM -1000, Tejun Heo wrote:
> > > If so, that's the idea behind the context capture feature so that we
> > > can enable it on specific allocations only after we determine there is
> > > something interesting there. So, with low-cost persistent tracking we
> > > can determine the suspects and then pay some more to investigate those
> > > suspects in more detail.
> >
> > Yeah, I was wondering whether it'd be useful to have that configurable so
> > that it'd be possible for a user to say "I'm okay with the cost, please
> > track more context per allocation". Given that tracking the immediate caller
> > is already a huge improvement and narrowing it down from there using
> > existing tools shouldn't be that difficult, I don't think this is a blocker
> > in any way. It just bothers me a bit that the code is structured so that
> > source line is the main abstraction.
>
> Another related question. So, the reason for macro'ing stuff is needed is
> because you want to print the line directly from kernel, right?
The main reason is because we want to inject a code tag at the
location of the call. If we have a code tag injected at every
allocation call, then finding the allocation counter (code tag) to
operate takes no time.
> Is that
> really necessary? Values from __builtin_return_address() can easily be
> printed out as function+offset from kernel which already gives most of the
> necessary information for triaging and mapping that back to source line from
> userspace isn't difficult. Wouldn't using __builtin_return_address() make
> the whole thing a lot simpler?
If we do that we have to associate that address with the allocation
counter at runtime on the first allocation and look it up on all
following allocations. That introduces the overhead which we are
trying to avoid by using macros.
>
> Thanks.
>
> --
> tejun
Hello,
On Wed, May 03, 2023 at 01:08:40PM -0700, Suren Baghdasaryan wrote:
> > Yeah, I was wondering whether it'd be useful to have that configurable so
> > that it'd be possible for a user to say "I'm okay with the cost, please
> > track more context per allocation".
>
> I assume by "more context per allocation" you mean for a specific
> allocation, not for all allocations.
> So, in a sense you are asking if the context capture feature can be
> dropped from this series and implemented using some other means. Is
> that right?
Oh, no, what I meant was whether it'd make sense to allow enable richer
tracking (e.g. record deeper into callstack) for all allocations. For
targeted tracking, it seems that the kernel already has everything needed.
But this is more of an idle thought and the immediate caller tracking is
already a big improvement in terms of visibility, so no need to be hung up
on this part of discussion at all.
Thanks.
--
tejun
Hello,
On Wed, May 03, 2023 at 01:14:57PM -0700, Suren Baghdasaryan wrote:
> On Wed, May 3, 2023 at 1:00 PM Tejun Heo <[email protected]> wrote:
> > Another related question. So, the reason for macro'ing stuff is needed is
> > because you want to print the line directly from kernel, right?
>
> The main reason is because we want to inject a code tag at the
> location of the call. If we have a code tag injected at every
> allocation call, then finding the allocation counter (code tag) to
> operate takes no time.
>
> > Is that
> > really necessary? Values from __builtin_return_address() can easily be
> > printed out as function+offset from kernel which already gives most of the
> > necessary information for triaging and mapping that back to source line from
> > userspace isn't difficult. Wouldn't using __builtin_return_address() make
> > the whole thing a lot simpler?
>
> If we do that we have to associate that address with the allocation
> counter at runtime on the first allocation and look it up on all
> following allocations. That introduces the overhead which we are
> trying to avoid by using macros.
I see. I'm a bit skeptical about the performance angle given that the hot
path can be probably made really cheap even with lookups. In most cases,
it's just gonna be an extra pointer deref and a few more arithmetics. That
can show up in microbenchmarks but it's not gonna be much. The benefit of
going that route would be the tracking thing being mostly self contained.
That said, it's nice to not have to worry about allocating tracking slots
and managing hash table, so no strong opinion.
Thanks.
--
tejun
On Wed, May 03, 2023 at 04:25:30PM -1000, Tejun Heo wrote:
> I see. I'm a bit skeptical about the performance angle given that the hot
> path can be probably made really cheap even with lookups. In most cases,
> it's just gonna be an extra pointer deref and a few more arithmetics. That
> can show up in microbenchmarks but it's not gonna be much. The benefit of
> going that route would be the tracking thing being mostly self contained.
The only way to do it with a single additional pointer deref would be
with a completely statically sized hash table without chaining - it'd
have to be open addressing.
More realistically you're looking at ~3 dependent loads.
On Wed, May 3, 2023 at 7:25 PM Tejun Heo <[email protected]> wrote:
>
> Hello,
>
> On Wed, May 03, 2023 at 01:14:57PM -0700, Suren Baghdasaryan wrote:
> > On Wed, May 3, 2023 at 1:00 PM Tejun Heo <[email protected]> wrote:
> > > Another related question. So, the reason for macro'ing stuff is needed is
> > > because you want to print the line directly from kernel, right?
> >
> > The main reason is because we want to inject a code tag at the
> > location of the call. If we have a code tag injected at every
> > allocation call, then finding the allocation counter (code tag) to
> > operate takes no time.
> >
> > > Is that
> > > really necessary? Values from __builtin_return_address() can easily be
> > > printed out as function+offset from kernel which already gives most of the
> > > necessary information for triaging and mapping that back to source line from
> > > userspace isn't difficult. Wouldn't using __builtin_return_address() make
> > > the whole thing a lot simpler?
> >
> > If we do that we have to associate that address with the allocation
> > counter at runtime on the first allocation and look it up on all
> > following allocations. That introduces the overhead which we are
> > trying to avoid by using macros.
>
> I see. I'm a bit skeptical about the performance angle given that the hot
> path can be probably made really cheap even with lookups. In most cases,
> it's just gonna be an extra pointer deref and a few more arithmetics. That
> can show up in microbenchmarks but it's not gonna be much. The benefit of
> going that route would be the tracking thing being mostly self contained.
I'm in the process of rerunning the tests to compare the overhead on
the latest kernel but I don't expect that to be cheap compared to
kmalloc().
>
> That said, it's nice to not have to worry about allocating tracking slots
> and managing hash table, so no strong opinion.
>
> Thanks.
>
> --
> tejun
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>
On Wed, 3 May 2023 13:14:57 -0700
Suren Baghdasaryan <[email protected]> wrote:
> On Wed, May 3, 2023 at 1:00 PM Tejun Heo <[email protected]> wrote:
> >
> > Hello,
> >
> > On Wed, May 03, 2023 at 09:48:55AM -1000, Tejun Heo wrote:
> > > > If so, that's the idea behind the context capture feature so that we
> > > > can enable it on specific allocations only after we determine there is
> > > > something interesting there. So, with low-cost persistent tracking we
> > > > can determine the suspects and then pay some more to investigate those
> > > > suspects in more detail.
> > >
> > > Yeah, I was wondering whether it'd be useful to have that configurable so
> > > that it'd be possible for a user to say "I'm okay with the cost, please
> > > track more context per allocation". Given that tracking the immediate caller
> > > is already a huge improvement and narrowing it down from there using
> > > existing tools shouldn't be that difficult, I don't think this is a blocker
> > > in any way. It just bothers me a bit that the code is structured so that
> > > source line is the main abstraction.
> >
> > Another related question. So, the reason for macro'ing stuff is needed is
> > because you want to print the line directly from kernel, right?
>
> The main reason is because we want to inject a code tag at the
> location of the call. If we have a code tag injected at every
> allocation call, then finding the allocation counter (code tag) to
> operate takes no time.
Another consequence is that each source code location gets its own tag.
The compiler can no longer apply common subexpression elimination
(because the tag is different). I have some doubts that there are any
places where CSE could be applied to allocation calls, but in general,
this is one more difference to using _RET_IP_.
Petr T
> > Is that
> > really necessary? Values from __builtin_return_address() can easily be
> > printed out as function+offset from kernel which already gives most of the
> > necessary information for triaging and mapping that back to source line from
> > userspace isn't difficult. Wouldn't using __builtin_return_address() make
> > the whole thing a lot simpler?
>
> If we do that we have to associate that address with the allocation
> counter at runtime on the first allocation and look it up on all
> following allocations. That introduces the overhead which we are
> trying to avoid by using macros.
>
> >
> > Thanks.
> >
> > --
> > tejun
>
On Wed 03-05-23 08:18:39, Suren Baghdasaryan wrote:
> On Wed, May 3, 2023 at 12:36 AM Michal Hocko <[email protected]> wrote:
> >
> > On Mon 01-05-23 09:54:44, Suren Baghdasaryan wrote:
> > [...]
> > > +static inline void add_ctx(struct codetag_ctx *ctx,
> > > + struct codetag_with_ctx *ctc)
> > > +{
> > > + kref_init(&ctx->refcount);
> > > + spin_lock(&ctc->ctx_lock);
> > > + ctx->flags = CTC_FLAG_CTX_PTR;
> > > + ctx->ctc = ctc;
> > > + list_add_tail(&ctx->node, &ctc->ctx_head);
> > > + spin_unlock(&ctc->ctx_lock);
> >
> > AFAIU every single tracked allocation will get its own codetag_ctx.
> > There is no aggregation per allocation site or anything else. This looks
> > like a scalability and a memory overhead red flag to me.
>
> True. The allocations here would not be limited. We could introduce a
> global limit to the amount of memory that we can use to store contexts
> and maybe reuse the oldest entry (in LRU fashion) when we hit that
> limit?
Wouldn't it make more sense to aggregate same allocations? Sure pids
get recycled but quite honestly I am not sure that information is all
that interesting. Precisely because of the recycle and short lived
processes reasons. I think there is quite a lot to think about the
detailed context tracking.
> >
> > > +}
> > > +
> > > +static inline void rem_ctx(struct codetag_ctx *ctx,
> > > + void (*free_ctx)(struct kref *refcount))
> > > +{
> > > + struct codetag_with_ctx *ctc = ctx->ctc;
> > > +
> > > + spin_lock(&ctc->ctx_lock);
> >
> > This could deadlock when allocator is called from the IRQ context.
>
> I see. spin_lock_irqsave() then?
yes. I have checked that the lock is not held over the all list
traversal which is good but the changelog could be more explicit about
the iterators and lock hold times implications.
--
Michal Hocko
SUSE Labs
On Wed 03-05-23 08:24:19, Suren Baghdasaryan wrote:
> On Wed, May 3, 2023 at 12:39 AM Michal Hocko <[email protected]> wrote:
> >
> > On Mon 01-05-23 09:54:45, Suren Baghdasaryan wrote:
> > [...]
> > > +struct codetag_ctx *alloc_tag_create_ctx(struct alloc_tag *tag, size_t size)
> > > +{
> > > + struct alloc_call_ctx *ac_ctx;
> > > +
> > > + /* TODO: use a dedicated kmem_cache */
> > > + ac_ctx = kmalloc(sizeof(struct alloc_call_ctx), GFP_KERNEL);
> >
> > You cannot really use GFP_KERNEL here. This is post_alloc_hook path and
> > that has its own gfp context.
>
> I missed that. Would it be appropriate to use the gfp_flags parameter
> of post_alloc_hook() here?
No. the original allocation could have been GFP_USER based and you do
not want these allocations to pullute other zones potentially. You want
GFP_KERNEL compatible subset of that mask.
But even then I really detest an additional allocation from this context
for every single allocation request. There GFP_NOWAIT allocation for
steckdepot but that is at least cached and generally not allocating.
This will allocate for every single allocation. There must be a better
way.
--
Michal Hocko
SUSE Labs
On Wed 03-05-23 08:09:28, Suren Baghdasaryan wrote:
> On Wed, May 3, 2023 at 12:25 AM Michal Hocko <[email protected]> wrote:
[...]
> Thanks for summarizing!
>
> > At least those I find the most important:
> > - This is a big change and it adds a significant maintenance burden
> > because each allocation entry point needs to be handled specifically.
> > The cost will grow with the intended coverage especially there when
> > allocation is hidden in a library code.
>
> Do you mean with more allocations in the codebase more codetags will
> be generated? Is that the concern?
No. I am mostly concerned about the _maintenance_ overhead. For the
bare tracking (without profiling and thus stack traces) only those
allocations that are directly inlined into the consumer are really
of any use. That increases the code impact of the tracing because any
relevant allocation location has to go through the micro surgery.
e.g. is it really interesting to know that there is a likely memory
leak in seq_file proper doing and allocation? No as it is the specific
implementation using seq_file that is leaking most likely. There are
other examples like that See?
> Or maybe as you commented in
> another patch that context capturing feature does not limit how many
> stacks will be captured?
That is a memory overhead which can be really huge and it would be nice
to be more explicit about that in the cover letter. It is a downside for
sure but not something that has a code maintenance impact and it is an
opt-in so it can be enabled only when necessary.
Quite honestly, though, the more I look into context capturing part it
seems to me that there is much more to be reconsidered there and if you
really want to move forward with the code tagging part then you should
drop that for now. It would make the whole series smaller and easier to
digest.
> > - It has been brought up that this is duplicating functionality already
> > available via existing tracing infrastructure. You should make it very
> > clear why that is not suitable for the job
>
> I experimented with using tracing with _RET_IP_ to implement this
> accounting. The major issue is the _RET_IP_ to codetag lookup runtime
> overhead which is orders of magnitude higher than proposed code
> tagging approach. With code tagging proposal, that link is resolved at
> compile time. Since we want this mechanism deployed in production, we
> want to keep the overhead to the absolute minimum.
> You asked me before how much overhead would be tolerable and the
> answer will always be "as small as possible". This is especially true
> for slab allocators which are ridiculously fast and regressing them
> would be very noticable (due to the frequent use).
It would have been more convincing if you had some numbers at hands.
E.g. this is a typical workload we are dealing with. With the compile
time tags we are able to learn this with that much of cost. With a dynamic
tracing we are able to learn this much with that cost. See? As small as
possible is a rather vague term that different people will have a very
different idea about.
> There is another issue, which I think can be solved in a smart way but
> will either affect performance or would require more memory. With the
> tracing approach we don't know beforehand how many individual
> allocation sites exist, so we have to allocate code tags (or similar
> structures for counting) at runtime vs compile time. We can be smart
> about it and allocate in batches or even preallocate more than we need
> beforehand but, as I said, it will require some kind of compromise.
I have tried our usual distribution config (only vmlinux without modules
so the real impact will be larger as we build a lot of stuff into
modules) just to get an idea:
text data bss dec hex filename
28755345 17040322 19845124 65640791 3e99957 vmlinux.before
28867168 17571838 19386372 65825378 3ec6a62 vmlinux.after
Less than 1% for text 3% for data. This is not all that terrible
for an initial submission and a more dynamic approach could be added
later. E.g. with a smaller pre-allocated hash table that could be
expanded lazily. Anyway not something I would be losing sleep over. This
can always be improved later on.
> I understand that code tagging creates additional maintenance burdens
> but I hope it also produces enough benefits that people will want
> this. The cost is also hopefully amortized when additional
> applications like the ones we presented in RFC [1] are built using the
> same framework.
TBH I am much more concerned about the maintenance burden on the MM side
than the actual code tagging itslef which is much more self contained. I
haven't seen other potential applications of the same infrastructure and
maybe the code impact would be much smaller than in the MM proper. Our
allocator API is really hairy and convoluted.
> > - We already have page_owner infrastructure that provides allocation
> > tracking data. Why it cannot be used/extended?
>
> 1. The overhead.
Do you have any numbers?
> 2. Covers only page allocators.
Yes this sucks.
>
> I didn't think about extending the page_owner approach to slab
> allocators but I suspect it would not be trivial. I don't see
> attaching an owner to every slab object to be a scalable solution. The
> overhead would again be of concern here.
This would have been a nice argument to mention in the changelog so that
we know that you have considered that option at least. Why should I (as
a reviewer) wild guess that?
> I should point out that there was one important technical concern
> about lack of a kill switch for this feature, which was an issue for
> distributions that can't disable the CONFIG flag. In this series we
> addressed that concern.
Thanks, that is certainly appreciated. I haven't looked deeper into that
part but from the cover letter I have understood that CONFIG_MEM_ALLOC_PROFILING
implies unconditional page_ext and therefore the memory overhead
assosiated with that. There seems to be a killswitch nomem_profiling but
from a quick look it doesn't seem to disable page_ext allocations. I
might be missing something there of course. Having a highlevel
describtion for that would be really nice as well.
> [1] https://lore.kernel.org/all/[email protected]/
--
Michal Hocko
SUSE Labs
On Thu, May 4, 2023 at 1:04 AM Michal Hocko <[email protected]> wrote:
>
> On Wed 03-05-23 08:18:39, Suren Baghdasaryan wrote:
> > On Wed, May 3, 2023 at 12:36 AM Michal Hocko <[email protected]> wrote:
> > >
> > > On Mon 01-05-23 09:54:44, Suren Baghdasaryan wrote:
> > > [...]
> > > > +static inline void add_ctx(struct codetag_ctx *ctx,
> > > > + struct codetag_with_ctx *ctc)
> > > > +{
> > > > + kref_init(&ctx->refcount);
> > > > + spin_lock(&ctc->ctx_lock);
> > > > + ctx->flags = CTC_FLAG_CTX_PTR;
> > > > + ctx->ctc = ctc;
> > > > + list_add_tail(&ctx->node, &ctc->ctx_head);
> > > > + spin_unlock(&ctc->ctx_lock);
> > >
> > > AFAIU every single tracked allocation will get its own codetag_ctx.
> > > There is no aggregation per allocation site or anything else. This looks
> > > like a scalability and a memory overhead red flag to me.
> >
> > True. The allocations here would not be limited. We could introduce a
> > global limit to the amount of memory that we can use to store contexts
> > and maybe reuse the oldest entry (in LRU fashion) when we hit that
> > limit?
>
> Wouldn't it make more sense to aggregate same allocations? Sure pids
> get recycled but quite honestly I am not sure that information is all
> that interesting. Precisely because of the recycle and short lived
> processes reasons. I think there is quite a lot to think about the
> detailed context tracking.
That would be a nice optimization. I'll need to look into the
implementation details. Thanks for the idea.
>
> > >
> > > > +}
> > > > +
> > > > +static inline void rem_ctx(struct codetag_ctx *ctx,
> > > > + void (*free_ctx)(struct kref *refcount))
> > > > +{
> > > > + struct codetag_with_ctx *ctc = ctx->ctc;
> > > > +
> > > > + spin_lock(&ctc->ctx_lock);
> > >
> > > This could deadlock when allocator is called from the IRQ context.
> >
> > I see. spin_lock_irqsave() then?
>
> yes. I have checked that the lock is not held over the all list
> traversal which is good but the changelog could be more explicit about
> the iterators and lock hold times implications.
Ack. Will add more information.
>
> --
> Michal Hocko
> SUSE Labs
On Thu, May 4, 2023 at 2:07 AM Michal Hocko <[email protected]> wrote:
>
> On Wed 03-05-23 08:09:28, Suren Baghdasaryan wrote:
> > On Wed, May 3, 2023 at 12:25 AM Michal Hocko <[email protected]> wrote:
> [...]
> > Thanks for summarizing!
> >
> > > At least those I find the most important:
> > > - This is a big change and it adds a significant maintenance burden
> > > because each allocation entry point needs to be handled specifically.
> > > The cost will grow with the intended coverage especially there when
> > > allocation is hidden in a library code.
> >
> > Do you mean with more allocations in the codebase more codetags will
> > be generated? Is that the concern?
>
> No. I am mostly concerned about the _maintenance_ overhead. For the
> bare tracking (without profiling and thus stack traces) only those
> allocations that are directly inlined into the consumer are really
> of any use. That increases the code impact of the tracing because any
> relevant allocation location has to go through the micro surgery.
>
> e.g. is it really interesting to know that there is a likely memory
> leak in seq_file proper doing and allocation? No as it is the specific
> implementation using seq_file that is leaking most likely. There are
> other examples like that See?
Yes, I see that. One level tracking does not provide all the
information needed to track such issues. Something more informative
would cost more. That's why our proposal is to have a light-weight
mechanism to get a high level picture and then be able to zoom into a
specific area using context capture. If you have ideas to improve
this, I'm open to suggestions.
>
> > Or maybe as you commented in
> > another patch that context capturing feature does not limit how many
> > stacks will be captured?
>
> That is a memory overhead which can be really huge and it would be nice
> to be more explicit about that in the cover letter. It is a downside for
> sure but not something that has a code maintenance impact and it is an
> opt-in so it can be enabled only when necessary.
You are right, I'll add that into the cover letter.
>
> Quite honestly, though, the more I look into context capturing part it
> seems to me that there is much more to be reconsidered there and if you
> really want to move forward with the code tagging part then you should
> drop that for now. It would make the whole series smaller and easier to
> digest.
Sure, I don't see an issue with removing that for now and refining the
mechanism before posting again.
>
> > > - It has been brought up that this is duplicating functionality already
> > > available via existing tracing infrastructure. You should make it very
> > > clear why that is not suitable for the job
> >
> > I experimented with using tracing with _RET_IP_ to implement this
> > accounting. The major issue is the _RET_IP_ to codetag lookup runtime
> > overhead which is orders of magnitude higher than proposed code
> > tagging approach. With code tagging proposal, that link is resolved at
> > compile time. Since we want this mechanism deployed in production, we
> > want to keep the overhead to the absolute minimum.
> > You asked me before how much overhead would be tolerable and the
> > answer will always be "as small as possible". This is especially true
> > for slab allocators which are ridiculously fast and regressing them
> > would be very noticable (due to the frequent use).
>
> It would have been more convincing if you had some numbers at hands.
> E.g. this is a typical workload we are dealing with. With the compile
> time tags we are able to learn this with that much of cost. With a dynamic
> tracing we are able to learn this much with that cost. See? As small as
> possible is a rather vague term that different people will have a very
> different idea about.
I'm rerunning my tests with the latest kernel to collect the
comparison data. I profiled these solutions before but the kernel
changed since then, so I need to update them.
>
> > There is another issue, which I think can be solved in a smart way but
> > will either affect performance or would require more memory. With the
> > tracing approach we don't know beforehand how many individual
> > allocation sites exist, so we have to allocate code tags (or similar
> > structures for counting) at runtime vs compile time. We can be smart
> > about it and allocate in batches or even preallocate more than we need
> > beforehand but, as I said, it will require some kind of compromise.
>
> I have tried our usual distribution config (only vmlinux without modules
> so the real impact will be larger as we build a lot of stuff into
> modules) just to get an idea:
> text data bss dec hex filename
> 28755345 17040322 19845124 65640791 3e99957 vmlinux.before
> 28867168 17571838 19386372 65825378 3ec6a62 vmlinux.after
>
> Less than 1% for text 3% for data. This is not all that terrible
> for an initial submission and a more dynamic approach could be added
> later. E.g. with a smaller pre-allocated hash table that could be
> expanded lazily. Anyway not something I would be losing sleep over. This
> can always be improved later on.
Ah, right. I should have mentioned this overhead too. Thanks for
keeping me honest.
> > I understand that code tagging creates additional maintenance burdens
> > but I hope it also produces enough benefits that people will want
> > this. The cost is also hopefully amortized when additional
> > applications like the ones we presented in RFC [1] are built using the
> > same framework.
>
> TBH I am much more concerned about the maintenance burden on the MM side
> than the actual code tagging itslef which is much more self contained. I
> haven't seen other potential applications of the same infrastructure and
> maybe the code impact would be much smaller than in the MM proper. Our
> allocator API is really hairy and convoluted.
Yes, other applications are much smaller and cleaner. MM allocation
code is quite complex indeed.
>
> > > - We already have page_owner infrastructure that provides allocation
> > > tracking data. Why it cannot be used/extended?
> >
> > 1. The overhead.
>
> Do you have any numbers?
Will post once my tests are completed.
>
> > 2. Covers only page allocators.
>
> Yes this sucks.
> >
> > I didn't think about extending the page_owner approach to slab
> > allocators but I suspect it would not be trivial. I don't see
> > attaching an owner to every slab object to be a scalable solution. The
> > overhead would again be of concern here.
>
> This would have been a nice argument to mention in the changelog so that
> we know that you have considered that option at least. Why should I (as
> a reviewer) wild guess that?
Sorry, It's hard to remember all the decisions, discussions and
conclusions when working on a feature over a long time period. I'll
include more information about that.
>
> > I should point out that there was one important technical concern
> > about lack of a kill switch for this feature, which was an issue for
> > distributions that can't disable the CONFIG flag. In this series we
> > addressed that concern.
>
> Thanks, that is certainly appreciated. I haven't looked deeper into that
> part but from the cover letter I have understood that CONFIG_MEM_ALLOC_PROFILING
> implies unconditional page_ext and therefore the memory overhead
> assosiated with that. There seems to be a killswitch nomem_profiling but
> from a quick look it doesn't seem to disable page_ext allocations. I
> might be missing something there of course. Having a highlevel
> describtion for that would be really nice as well.
Right, will add a description of that as well.
We eliminate the runtime overhead but not the memory one. However I
believe it's also doable using page_ext_operations.need callback. Will
look into it.
Thanks,
Suren.
>
> > [1] https://lore.kernel.org/all/[email protected]/
>
> --
> Michal Hocko
> SUSE Labs
On Thu, May 4, 2023 at 1:09 AM Michal Hocko <[email protected]> wrote:
>
> On Wed 03-05-23 08:24:19, Suren Baghdasaryan wrote:
> > On Wed, May 3, 2023 at 12:39 AM Michal Hocko <[email protected]> wrote:
> > >
> > > On Mon 01-05-23 09:54:45, Suren Baghdasaryan wrote:
> > > [...]
> > > > +struct codetag_ctx *alloc_tag_create_ctx(struct alloc_tag *tag, size_t size)
> > > > +{
> > > > + struct alloc_call_ctx *ac_ctx;
> > > > +
> > > > + /* TODO: use a dedicated kmem_cache */
> > > > + ac_ctx = kmalloc(sizeof(struct alloc_call_ctx), GFP_KERNEL);
> > >
> > > You cannot really use GFP_KERNEL here. This is post_alloc_hook path and
> > > that has its own gfp context.
> >
> > I missed that. Would it be appropriate to use the gfp_flags parameter
> > of post_alloc_hook() here?
>
> No. the original allocation could have been GFP_USER based and you do
> not want these allocations to pullute other zones potentially. You want
> GFP_KERNEL compatible subset of that mask.
Ack.
>
> But even then I really detest an additional allocation from this context
> for every single allocation request. There GFP_NOWAIT allocation for
> steckdepot but that is at least cached and generally not allocating.
> This will allocate for every single allocation.
A small correction here. alloc_tag_create_ctx() is used only for
allocations which we requested to capture the context. So, this last
sentence is true for allocations we specifically marked to capture the
context, not in general.
> There must be a better way.
Yeah, agree, it would be good to avoid allocations in this path. Any
specific ideas on how to improve this? Pooling/caching perhaps? I
think kmem_cache does some of that already but maybe something else?
Thanks,
Suren.
> --
> Michal Hocko
> SUSE Labs
On Thu 04-05-23 09:22:07, Suren Baghdasaryan wrote:
[...]
> > But even then I really detest an additional allocation from this context
> > for every single allocation request. There GFP_NOWAIT allocation for
> > steckdepot but that is at least cached and generally not allocating.
> > This will allocate for every single allocation.
>
> A small correction here. alloc_tag_create_ctx() is used only for
> allocations which we requested to capture the context. So, this last
> sentence is true for allocations we specifically marked to capture the
> context, not in general.
Ohh, right. I have misunderstood that part. Slightly better, still
potentially a scalability issue because hard to debug memory leaks
usually use a generic caches (for kmalloc). So this might be still a lot
of objects to track.
> > There must be a better way.
>
> Yeah, agree, it would be good to avoid allocations in this path. Any
> specific ideas on how to improve this? Pooling/caching perhaps? I
> think kmem_cache does some of that already but maybe something else?
The best I can come up with is a preallocated hash table to store
references to stack depots with some additional data associated. The
memory overhead could be still quite big but the hash tables could be
resized lazily.
--
Michal Hocko
SUSE Labs
On Fri, May 5, 2023 at 1:40 AM Michal Hocko <[email protected]> wrote:
>
> On Thu 04-05-23 09:22:07, Suren Baghdasaryan wrote:
> [...]
> > > But even then I really detest an additional allocation from this context
> > > for every single allocation request. There GFP_NOWAIT allocation for
> > > steckdepot but that is at least cached and generally not allocating.
> > > This will allocate for every single allocation.
> >
> > A small correction here. alloc_tag_create_ctx() is used only for
> > allocations which we requested to capture the context. So, this last
> > sentence is true for allocations we specifically marked to capture the
> > context, not in general.
>
> Ohh, right. I have misunderstood that part. Slightly better, still
> potentially a scalability issue because hard to debug memory leaks
> usually use a generic caches (for kmalloc). So this might be still a lot
> of objects to track.
Yes, generally speaking, if a single code location is allocating very
frequently then enabling context capture for it will generate many
callstack buffers.
Your note about use of generic caches makes me think we still have a
small misunderstanding. We tag at the allocation call site, not based
on which cache is used. Two kmalloc calls from different code
locations will have unique codetags for each, so enabling context
capture for one would not result in context capturing for the other
one.
>
> > > There must be a better way.
> >
> > Yeah, agree, it would be good to avoid allocations in this path. Any
> > specific ideas on how to improve this? Pooling/caching perhaps? I
> > think kmem_cache does some of that already but maybe something else?
>
> The best I can come up with is a preallocated hash table to store
> references to stack depots with some additional data associated. The
> memory overhead could be still quite big but the hash tables could be
> resized lazily.
Ok, that seems like the continuation of you suggestion in another
thread to combine identical callstack traces. That's an excellent
idea! I think it would not be hard to implement. Thanks!
> --
> Michal Hocko
> SUSE Labs
On Thu 04-05-23 08:08:13, Suren Baghdasaryan wrote:
> On Thu, May 4, 2023 at 2:07 AM Michal Hocko <[email protected]> wrote:
[...]
> > e.g. is it really interesting to know that there is a likely memory
> > leak in seq_file proper doing and allocation? No as it is the specific
> > implementation using seq_file that is leaking most likely. There are
> > other examples like that See?
>
> Yes, I see that. One level tracking does not provide all the
> information needed to track such issues. Something more informative
> would cost more. That's why our proposal is to have a light-weight
> mechanism to get a high level picture and then be able to zoom into a
> specific area using context capture. If you have ideas to improve
> this, I'm open to suggestions.
Well, I think that a more scalable approach would be to not track in
callers but in the allocator itself. The full stack trace might not be
all that important or interesting and maybe even increase the overall
overhead but a partial one with a configurable depth would sound more
interesting to me. A per cache hastable indexed by stack trace reference
and extending slab metadata to store the reference for kfree path won't
be free but the overhead might be just acceptable.
If the stack unwinding is really too expensive for tracking another
option would be to add code tags dynamically to the compiled
kernel without any actual code changes. I can imagine the tracing
infrastructure could be used for that or maybe even consider compiler
plugins to inject code for functions marked as allocators. So the kernel
could be instrumented even without eny userspace tooling required by
users directly.
--
Michal Hocko
SUSE Labs
On Sun, May 07, 2023 at 12:27:18PM +0200, Michal Hocko wrote:
> On Thu 04-05-23 08:08:13, Suren Baghdasaryan wrote:
> > On Thu, May 4, 2023 at 2:07 AM Michal Hocko <[email protected]> wrote:
> [...]
> > > e.g. is it really interesting to know that there is a likely memory
> > > leak in seq_file proper doing and allocation? No as it is the specific
> > > implementation using seq_file that is leaking most likely. There are
> > > other examples like that See?
> >
> > Yes, I see that. One level tracking does not provide all the
> > information needed to track such issues. Something more informative
> > would cost more. That's why our proposal is to have a light-weight
> > mechanism to get a high level picture and then be able to zoom into a
> > specific area using context capture. If you have ideas to improve
> > this, I'm open to suggestions.
>
> Well, I think that a more scalable approach would be to not track in
> callers but in the allocator itself. The full stack trace might not be
> all that important or interesting and maybe even increase the overall
> overhead but a partial one with a configurable depth would sound more
> interesting to me. A per cache hastable indexed by stack trace reference
> and extending slab metadata to store the reference for kfree path won't
> be free but the overhead might be just acceptable.
How would you propose to annotate what call chains need what depth of
stack trace recorded?
How would you propose to make this performant?
On Thu, May 04, 2023 at 11:07:22AM +0200, Michal Hocko wrote:
> No. I am mostly concerned about the _maintenance_ overhead. For the
> bare tracking (without profiling and thus stack traces) only those
> allocations that are directly inlined into the consumer are really
> of any use. That increases the code impact of the tracing because any
> relevant allocation location has to go through the micro surgery.
>
> e.g. is it really interesting to know that there is a likely memory
> leak in seq_file proper doing and allocation? No as it is the specific
> implementation using seq_file that is leaking most likely. There are
> other examples like that See?
So this is a rather strange usage of "maintenance overhead" :)
But it's something we thought of. If we had to plumb around a _RET_IP_
parameter, or a codetag pointer, it would be a hassle annotating the
correct callsite.
Instead, alloc_hooks() wraps a memory allocation function and stashes a
pointer to a codetag in task_struct for use by the core slub/buddy
allocator code.
That means that in your example, to move tracking to a given seq_file
function, we just:
- hook the seq_file function with alloc_hooks
- change the seq_file function to call non-hooked memory allocation
functions.
> It would have been more convincing if you had some numbers at hands.
> E.g. this is a typical workload we are dealing with. With the compile
> time tags we are able to learn this with that much of cost. With a dynamic
> tracing we are able to learn this much with that cost. See? As small as
> possible is a rather vague term that different people will have a very
> different idea about.
Engineers don't prototype and benchmark everything as a matter of
course, we're expected to have the rough equivealent of a CS education
and an understanding of big O notation, cache architecture, etc.
The slub fast path is _really_ fast - double word non locked cmpxchg.
That's what we're trying to compete with. Adding a big globally
accessible hash table is going to tank performance compared to that.
I believe the numbers we already posted speak for themselves. We're
considerably faster than memcg, fast enough to run in production.
I'm not going to be switching to a design that significantly regresses
performance, sorry :)
> TBH I am much more concerned about the maintenance burden on the MM side
> than the actual code tagging itslef which is much more self contained. I
> haven't seen other potential applications of the same infrastructure and
> maybe the code impact would be much smaller than in the MM proper. Our
> allocator API is really hairy and convoluted.
You keep saying "maintenance burden", but this is a criticism that can
be directed at _any_ patchset that adds new code; it's generally
understood that that is the accepted cost for new functionality.
If you have specific concerns where you think we did something that
makes the code harder to maintain, _please point them out in the
appropriate patch_. I don't think you'll find too much - the
instrumentation in the allocators simply generalizes what memcg was
already doing, and the hooks themselves are a bit boilerplaty but hardly
the sort of thing people will be tripping over later.
TL;DR - put up or shut up :)
On Sun, 7 May 2023 13:20:55 -0400
Kent Overstreet <[email protected]> wrote:
> On Thu, May 04, 2023 at 11:07:22AM +0200, Michal Hocko wrote:
> > No. I am mostly concerned about the _maintenance_ overhead. For the
> > bare tracking (without profiling and thus stack traces) only those
> > allocations that are directly inlined into the consumer are really
> > of any use. That increases the code impact of the tracing because any
> > relevant allocation location has to go through the micro surgery.
> >
> > e.g. is it really interesting to know that there is a likely memory
> > leak in seq_file proper doing and allocation? No as it is the specific
> > implementation using seq_file that is leaking most likely. There are
> > other examples like that See?
>
> So this is a rather strange usage of "maintenance overhead" :)
>
> But it's something we thought of. If we had to plumb around a _RET_IP_
> parameter, or a codetag pointer, it would be a hassle annotating the
> correct callsite.
>
> Instead, alloc_hooks() wraps a memory allocation function and stashes a
> pointer to a codetag in task_struct for use by the core slub/buddy
> allocator code.
>
> That means that in your example, to move tracking to a given seq_file
> function, we just:
> - hook the seq_file function with alloc_hooks
> - change the seq_file function to call non-hooked memory allocation
> functions.
>
> > It would have been more convincing if you had some numbers at hands.
> > E.g. this is a typical workload we are dealing with. With the compile
> > time tags we are able to learn this with that much of cost. With a dynamic
> > tracing we are able to learn this much with that cost. See? As small as
> > possible is a rather vague term that different people will have a very
> > different idea about.
>
> Engineers don't prototype and benchmark everything as a matter of
> course, we're expected to have the rough equivealent of a CS education
> and an understanding of big O notation, cache architecture, etc.
>
> The slub fast path is _really_ fast - double word non locked cmpxchg.
> That's what we're trying to compete with. Adding a big globally
> accessible hash table is going to tank performance compared to that.
>
> I believe the numbers we already posted speak for themselves. We're
> considerably faster than memcg, fast enough to run in production.
>
> I'm not going to be switching to a design that significantly regresses
> performance, sorry :)
>
> > TBH I am much more concerned about the maintenance burden on the MM side
> > than the actual code tagging itslef which is much more self contained. I
> > haven't seen other potential applications of the same infrastructure and
> > maybe the code impact would be much smaller than in the MM proper. Our
> > allocator API is really hairy and convoluted.
>
> You keep saying "maintenance burden", but this is a criticism that can
> be directed at _any_ patchset that adds new code; it's generally
> understood that that is the accepted cost for new functionality.
>
> If you have specific concerns where you think we did something that
> makes the code harder to maintain, _please point them out in the
> appropriate patch_. I don't think you'll find too much - the
> instrumentation in the allocators simply generalizes what memcg was
> already doing, and the hooks themselves are a bit boilerplaty but hardly
> the sort of thing people will be tripping over later.
>
> TL;DR - put up or shut up :)
Your email would have been much better if you left the above line out. :-/
Comments like the above do not go over well via text. Even if you add the ":)"
Back to the comment about this being a burden. I just applied all the
patches and did a diff (much easier than to wade through 40 patches!)
One thing we need to get rid of, and this isn't your fault but this
series is extending it, is the use of the damn underscores to
differentiate functions. This is one of the abominations of the early
Linux kernel code base. I admit, I'm guilty of this too. But today I
have learned and avoid it at all cost. Underscores are meaningless and
error prone, not to mention confusing to people coming onboard. Let's
use something that has some meaning.
What's the difference between:
_kmem_cache_alloc_node() and __kmem_cache_alloc_node()?
And if every allocation function requires a double hook, that is a
maintenance burden. We do this for things like system calls, but
there's a strong rationale for that. I'm guessing that Michal's concern
is that he and other mm maintainers will need to make sure any new
allocation function has this double call and is done properly. This
isn't just new code that needs to be maintained, it's something that
needs to be understood when adding any new interface to page
allocations.
It's true that all new code has a maintenance burden, and unless the
maintainer feels the burden is worth their time, they have the right to
complain about it.
I've given talks about how to get code into open source projects, and
the title is "Commits are pulled and never pushed". Where basically I
talk about convincing the maintainers that they want your change, and
not by pushing it because you want it.
-- Steve
On Sun, May 07, 2023 at 04:55:38PM -0400, Steven Rostedt wrote:
> > TL;DR - put up or shut up :)
>
> Your email would have been much better if you left the above line out. :-/
> Comments like the above do not go over well via text. Even if you add the ":)"
I stand by that comment :)
> Back to the comment about this being a burden. I just applied all the
> patches and did a diff (much easier than to wade through 40 patches!)
>
> One thing we need to get rid of, and this isn't your fault but this
> series is extending it, is the use of the damn underscores to
> differentiate functions. This is one of the abominations of the early
> Linux kernel code base. I admit, I'm guilty of this too. But today I
> have learned and avoid it at all cost. Underscores are meaningless and
> error prone, not to mention confusing to people coming onboard. Let's
> use something that has some meaning.
>
> What's the difference between:
>
> _kmem_cache_alloc_node() and __kmem_cache_alloc_node()?
>
> And if every allocation function requires a double hook, that is a
> maintenance burden. We do this for things like system calls, but
> there's a strong rationale for that.
The underscore is a legitimate complaint - I brought this up in
development, not sure why it got lost. We'll do something better with a
consistent suffix, perhaps kmem_cache_alloc_noacct().
> I'm guessing that Michal's concern is that he and other mm maintainers
> will need to make sure any new allocation function has this double
> call and is done properly. This isn't just new code that needs to be
> maintained, it's something that needs to be understood when adding any
> new interface to page allocations.
Well, isn't that part of the problem then? We're _this far_ into the
thread and still guessing on what Michal's "maintenance concerns" are?
Regarding your specific concern: My main design consideration was making
sure every allocation gets accounted somewhere; we don't want a memory
allocation profiling system where it's possible for allocations to be
silently not tracked! There's warnings in the core allocators if they
see an allocation without an alloc tag, and in testing we chased down
everything we found.
So if anyone later creates a new memory allocation interface and forgets
to hook it, they'll see the same warning - but perhaps we could improve
the warning message so it says exactly what needs to be done (wrap the
allocation in an alloc_hooks() call).
> It's true that all new code has a maintenance burden, and unless the
> maintainer feels the burden is worth their time, they have the right to
> complain about it.
Sure, but complaints should say what they're complaining about.
Complaints so vague they could be levelled at any patchset don't do
anything for the discussion.
On Sun, 7 May 2023 17:53:09 -0400
Kent Overstreet <[email protected]> wrote:
> The underscore is a legitimate complaint - I brought this up in
> development, not sure why it got lost. We'll do something better with a
> consistent suffix, perhaps kmem_cache_alloc_noacct().
Would "_noprofile()" be a better name. I'm not sure what "acct" means.
-- Steve
On Sun, May 07, 2023 at 06:09:11PM -0400, Steven Rostedt wrote:
> On Sun, 7 May 2023 17:53:09 -0400
> Kent Overstreet <[email protected]> wrote:
>
> > The underscore is a legitimate complaint - I brought this up in
> > development, not sure why it got lost. We'll do something better with a
> > consistent suffix, perhaps kmem_cache_alloc_noacct().
>
> Would "_noprofile()" be a better name. I'm not sure what "acct" means.
account - but _noprofile() is probably better. Didn't suggest it at
first because of an association in my head with CPU profiling, but
considering we already renamed the feature to memory allocation
profiling... :)
On Sun, 7 May 2023 13:20:55 -0400
Kent Overstreet <[email protected]> wrote:
> On Thu, May 04, 2023 at 11:07:22AM +0200, Michal Hocko wrote:
> > No. I am mostly concerned about the _maintenance_ overhead. For the
> > bare tracking (without profiling and thus stack traces) only those
> > allocations that are directly inlined into the consumer are really
> > of any use. That increases the code impact of the tracing because any
> > relevant allocation location has to go through the micro surgery.
> >
> > e.g. is it really interesting to know that there is a likely memory
> > leak in seq_file proper doing and allocation? No as it is the specific
> > implementation using seq_file that is leaking most likely. There are
> > other examples like that See?
>
> So this is a rather strange usage of "maintenance overhead" :)
>
> But it's something we thought of. If we had to plumb around a _RET_IP_
> parameter, or a codetag pointer, it would be a hassle annotating the
> correct callsite.
>
> Instead, alloc_hooks() wraps a memory allocation function and stashes a
> pointer to a codetag in task_struct for use by the core slub/buddy
> allocator code.
>
> That means that in your example, to move tracking to a given seq_file
> function, we just:
> - hook the seq_file function with alloc_hooks
Thank you. That's exactly what I was trying to point out. So you hook
seq_buf_alloc(), just to find out it's called from traverse(), which
is not very helpful either. So, you hook traverse(), which sounds quite
generic. Yes, you're lucky, because it is a static function, and the
identifier is not actually used anywhere else (right now), but each
time you want to hook something, you must make sure it does not
conflict with any other identifier in the kernel...
Petr T
On Mon, 8 May 2023 11:57:10 -0400
Kent Overstreet <[email protected]> wrote:
> On Mon, May 08, 2023 at 05:52:06PM +0200, Petr Tesařík wrote:
> > On Sun, 7 May 2023 13:20:55 -0400
> > Kent Overstreet <[email protected]> wrote:
> >
> > > On Thu, May 04, 2023 at 11:07:22AM +0200, Michal Hocko wrote:
> > > > No. I am mostly concerned about the _maintenance_ overhead. For the
> > > > bare tracking (without profiling and thus stack traces) only those
> > > > allocations that are directly inlined into the consumer are really
> > > > of any use. That increases the code impact of the tracing because any
> > > > relevant allocation location has to go through the micro surgery.
> > > >
> > > > e.g. is it really interesting to know that there is a likely memory
> > > > leak in seq_file proper doing and allocation? No as it is the specific
> > > > implementation using seq_file that is leaking most likely. There are
> > > > other examples like that See?
> > >
> > > So this is a rather strange usage of "maintenance overhead" :)
> > >
> > > But it's something we thought of. If we had to plumb around a _RET_IP_
> > > parameter, or a codetag pointer, it would be a hassle annotating the
> > > correct callsite.
> > >
> > > Instead, alloc_hooks() wraps a memory allocation function and stashes a
> > > pointer to a codetag in task_struct for use by the core slub/buddy
> > > allocator code.
> > >
> > > That means that in your example, to move tracking to a given seq_file
> > > function, we just:
> > > - hook the seq_file function with alloc_hooks
> >
> > Thank you. That's exactly what I was trying to point out. So you hook
> > seq_buf_alloc(), just to find out it's called from traverse(), which
> > is not very helpful either. So, you hook traverse(), which sounds quite
> > generic. Yes, you're lucky, because it is a static function, and the
> > identifier is not actually used anywhere else (right now), but each
> > time you want to hook something, you must make sure it does not
> > conflict with any other identifier in the kernel...
>
> Cscope makes quick and easy work of this kind of stuff.
Sure, although AFAIK the index does not cover all possible config
options (so non-x86 arch code is often forgotten). However, that's the
less important part.
What do you do if you need to hook something that does conflict with an
existing identifier?
Petr T
On Mon, May 08, 2023 at 05:52:06PM +0200, Petr Tesařík wrote:
> On Sun, 7 May 2023 13:20:55 -0400
> Kent Overstreet <[email protected]> wrote:
>
> > On Thu, May 04, 2023 at 11:07:22AM +0200, Michal Hocko wrote:
> > > No. I am mostly concerned about the _maintenance_ overhead. For the
> > > bare tracking (without profiling and thus stack traces) only those
> > > allocations that are directly inlined into the consumer are really
> > > of any use. That increases the code impact of the tracing because any
> > > relevant allocation location has to go through the micro surgery.
> > >
> > > e.g. is it really interesting to know that there is a likely memory
> > > leak in seq_file proper doing and allocation? No as it is the specific
> > > implementation using seq_file that is leaking most likely. There are
> > > other examples like that See?
> >
> > So this is a rather strange usage of "maintenance overhead" :)
> >
> > But it's something we thought of. If we had to plumb around a _RET_IP_
> > parameter, or a codetag pointer, it would be a hassle annotating the
> > correct callsite.
> >
> > Instead, alloc_hooks() wraps a memory allocation function and stashes a
> > pointer to a codetag in task_struct for use by the core slub/buddy
> > allocator code.
> >
> > That means that in your example, to move tracking to a given seq_file
> > function, we just:
> > - hook the seq_file function with alloc_hooks
>
> Thank you. That's exactly what I was trying to point out. So you hook
> seq_buf_alloc(), just to find out it's called from traverse(), which
> is not very helpful either. So, you hook traverse(), which sounds quite
> generic. Yes, you're lucky, because it is a static function, and the
> identifier is not actually used anywhere else (right now), but each
> time you want to hook something, you must make sure it does not
> conflict with any other identifier in the kernel...
Cscope makes quick and easy work of this kind of stuff.
On Mon, May 08, 2023 at 06:09:13PM +0200, Petr Tesařík wrote:
> Sure, although AFAIK the index does not cover all possible config
> options (so non-x86 arch code is often forgotten). However, that's the
> less important part.
>
> What do you do if you need to hook something that does conflict with an
> existing identifier?
As already happens in this patchset, rename the other identifier.
But this is C, we avoid these kinds of conflicts already because the
language has no namespacing - it's going to be a pretty rare situtaion
going forward. Most of the hooking that will be done is done with this
patchset, and there was only one identifier that needed to be renamed.
On Mon, 8 May 2023 12:28:52 -0400
Kent Overstreet <[email protected]> wrote:
> On Mon, May 08, 2023 at 06:09:13PM +0200, Petr Tesařík wrote:
> > Sure, although AFAIK the index does not cover all possible config
> > options (so non-x86 arch code is often forgotten). However, that's the
> > less important part.
> >
> > What do you do if you need to hook something that does conflict with an
> > existing identifier?
>
> As already happens in this patchset, rename the other identifier.
>
> But this is C, we avoid these kinds of conflicts already because the
> language has no namespacing
This statement is not accurate, but I agree there's not much. Refer to
section 6.2.3 of ISO/IEC9899:2018 (Name spaces of identifiers).
More importantly, macros also interfere with identifier scoping, e.g.
you cannot even have a local variable with the same name as a macro.
That's why I dislike macros so much.
But since there's no clear policy regarding macros in the kernel, I'm
merely showing a downside; it's perfectly fine to write kernel code
like this as long as the maintainers agree that the limitation is
acceptable and outweighed by the benefits.
Petr T
> it's going to be a pretty rare situtaion
> going forward. Most of the hooking that will be done is done with this
> patchset, and there was only one identifier that needed to be renamed.
>
On Mon, May 08, 2023 at 08:59:39PM +0200, Petr Tesařík wrote:
> On Mon, 8 May 2023 12:28:52 -0400
> Kent Overstreet <[email protected]> wrote:
>
> > On Mon, May 08, 2023 at 06:09:13PM +0200, Petr Tesařík wrote:
> > > Sure, although AFAIK the index does not cover all possible config
> > > options (so non-x86 arch code is often forgotten). However, that's the
> > > less important part.
> > >
> > > What do you do if you need to hook something that does conflict with an
> > > existing identifier?
> >
> > As already happens in this patchset, rename the other identifier.
> >
> > But this is C, we avoid these kinds of conflicts already because the
> > language has no namespacing
>
> This statement is not accurate, but I agree there's not much. Refer to
> section 6.2.3 of ISO/IEC9899:2018 (Name spaces of identifiers).
>
> More importantly, macros also interfere with identifier scoping, e.g.
> you cannot even have a local variable with the same name as a macro.
> That's why I dislike macros so much.
Shadowing a global identifier like that would at best be considered poor
style, so I don't see this as a major downside.
> But since there's no clear policy regarding macros in the kernel, I'm
> merely showing a downside; it's perfectly fine to write kernel code
> like this as long as the maintainers agree that the limitation is
> acceptable and outweighed by the benefits.
Macros do have lots of tricky downsides, but in general we're not shy
about using them for things that can't be done otherwise; see
wait_event(), all of tracing...
I think we could in general do a job of making the macros _themselves_
more managable, when writing things that need to be macros I'll often
have just the wrapper as a macro and write the bulk as inline functions.
See the generic radix tree code for example.
Reflection is a major use case for macros, and the underlying mechanism
here - code tagging - is something worth talking about more, since it's
codifying something that's been done ad-hoc in the kernel for a long
time and something we hope to refactor other existing code to use,
including tracing - I've got a patch already written to convert the
dynamic debug code to code tagging; it's a nice -200 loc cleanup.
Regarding the alloc_hooks() macro itself specifically, I've got more
plans for it. I have another patch series after this one that implements
code tagging based fault injection, which is _far_ more ergonomic to use
than our existing fault injection capabilities (and this matters! Fault
injection is a really important tool for getting good test coverage, but
tools that are a pain in the ass to use don't get used) - and with the
alloc_hooks() macro already in place, we'll be able to turn _every
individual memory allocation callsite_ into a distinct, individually
selectable fault injection point - which is something our existing fault
injection framework attempts at but doesn't really manage.
If we can get this in, it'll make it really easy to write unit tests
that iterate over every memory allocation site in a given module,
individually telling them to fail, run a workload until they hit, and
verify that the code path being tested was executed. It'll nicely
complement the fuzz testing capabilities that we've been working on,
particularly in filesystem land.