2022-08-30 22:10:09

by Suren Baghdasaryan

[permalink] [raw]
Subject: [RFC PATCH 00/30] Code tagging framework and applications

===========================
Code tagging framework
===========================
Code tag is a structure identifying a specific location in the source code
which is generated at compile time and can be embedded in an application-
specific structure. Several applications of code tagging are included in
this RFC, such as memory allocation tracking, dynamic fault injection,
latency tracking and improved error code reporting.
Basically, it takes the old trick of "define a special elf section for
objects of a given type so that we can iterate over them at runtime" and
creates a proper library for it.

===========================
Memory allocation tracking
===========================
The goal for using codetags for memory allocation tracking is to minimize
performance and memory overhead. By recording only the call count and
allocation size, the required operations are kept at the minimum while
collecting statistics for every allocation in the codebase. With that
information, if users are interested in mode detailed context for a
specific allocation, they can enable more in-depth context tracking,
which includes capturing the pid, tgid, task name, allocation size,
timestamp and call stack for every allocation at the specified code
location.
Memory allocation tracking is implemented in two parts:

part1: instruments page and slab allocators to record call count and total
memory allocated at every allocation in the source code. Every time an
allocation is performed by an instrumented allocator, the codetag at that
location increments its call and size counters. Every time the memory is
freed these counters are decremented. To decrement the counters upon free,
allocated object needs a reference to its codetag. Page allocators use
page_ext to record this reference while slab allocators use memcg_data of
the slab page.
The data is exposed to the user space via a read-only debugfs file called
alloc_tags.

Usage example:

$ sort -hr /sys/kernel/debug/alloc_tags|head
153MiB 8599 mm/slub.c:1826 module:slub func:alloc_slab_page
6.08MiB 49 mm/slab_common.c:950 module:slab_common func:_kmalloc_order
5.09MiB 6335 mm/memcontrol.c:2814 module:memcontrol func:alloc_slab_obj_exts
4.54MiB 78 mm/page_alloc.c:5777 module:page_alloc func:alloc_pages_exact
1.32MiB 338 include/asm-generic/pgalloc.h:63 module:pgtable func:__pte_alloc_one
1.16MiB 603 fs/xfs/xfs_log_priv.h:700 module:xfs func:xlog_kvmalloc
1.00MiB 256 mm/swap_cgroup.c:48 module:swap_cgroup func:swap_cgroup_prepare
734KiB 5380 fs/xfs/kmem.c:20 module:xfs func:kmem_alloc
640KiB 160 kernel/rcu/tree.c:3184 module:tree func:fill_page_cache_func
640KiB 160 drivers/char/virtio_console.c:452 module:virtio_console func:alloc_buf

part2: adds support for the user to select a specific code location to capture
allocation context. A new debugfs file called alloc_tags.ctx is used to select
which code location should capture allocation context and to read captured
context information.

Usage example:

$ cd /sys/kernel/debug/
$ echo "file include/asm-generic/pgalloc.h line 63 enable" > alloc_tags.ctx
$ cat alloc_tags.ctx
920KiB 230 include/asm-generic/pgalloc.h:63 module:pgtable func:__pte_alloc_one
size: 4096
pid: 1474
tgid: 1474
comm: bash
ts: 175332940994
call stack:
pte_alloc_one+0xfe/0x130
__pte_alloc+0x22/0xb0
copy_page_range+0x842/0x1640
dup_mm+0x42d/0x580
copy_process+0xfb1/0x1ac0
kernel_clone+0x92/0x3e0
__do_sys_clone+0x66/0x90
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
...

NOTE: slab allocation tracking is not yet stable and has a leak that
shows up in long-running tests. We are working on fixing it and posting
the RFC early to collect some feedback and to have a reference code in
public before presenting the idea at LPC2022.

===========================
Dynamic fault injection
===========================
Dynamic fault injection lets you do fault injection with a single call
to dynamic_fault(), with a debugfs interface similar to dynamic_debug.

Calls to dynamic_fault are listed in debugfs and can be enabled at
runtime (oneshot mode or a defined frequency are also available). This
patch also uses the memory allocation wrapper macros introduced by the
memory allocation tracking patches to add distinct fault injection
points for every memory allocation in the kernel.

Example fault injection points, after hooking memory allocation paths:

fs/xfs/libxfs/xfs_iext_tree.c:606 module:xfs func:xfs_iext_realloc_rootclass:memory disabled "
fs/xfs/libxfs/xfs_inode_fork.c:503 module:xfs func:xfs_idata_reallocclass:memory disabled "
fs/xfs/libxfs/xfs_inode_fork.c:399 module:xfs func:xfs_iroot_reallocclass:memory disabled "
fs/xfs/xfs_buf.c:373 module:xfs func:xfs_buf_alloc_pagesclass:memory disabled "
fs/xfs/xfs_iops.c:497 module:xfs func:xfs_vn_get_linkclass:memory disabled "
fs/xfs/xfs_mount.c:85 module:xfs func:xfs_uuid_mountclass:memory disabled "

===========================
Latency tracking
===========================
This lets you instrument code for measuring latency with just two calls
to code_tag_time_stats_start() and code_tag_time_stats_finish(), and
makes statistics available in debugfs on a per-callsite basis.

Recorded statistics include total count, frequency/rate, average
duration, max duration, and event duration quantiles.

Additionally, this patch instruments prepare_to_wait() and finish_wait().

Example output:

fs/xfs/xfs_extent_busy.c:589 module:xfs func:xfs_extent_busy_flush
count: 61
rate: 0/sec
frequency: 19 sec
avg duration: 632 us
max duration: 2 ms
quantiles (us): 274 288 288 296 296 296 296 336 336 336 336 336 336 336 336

===========================
Improved error codes
===========================
Ever waste hours trying to figure out which line of code from some
obscure module is returning you -EINVAL and nothing else?

What if we had... more error codes?

This patch adds ERR(), which returns a unique error code that is related
to the error code that passed to it: the original error code can be
recovered with error_class(), and errname() (as well as %pE) returns an
error string that includes the file and line number of the ERR() call.

Example output:

VFS: Cannot open root device "sda" or unknown-block(8,0): error -EINVAL at fs/ext4/super.c:4387

===========================
Dynamic debug conversion to code tagging
===========================
There are several open coded implementations of the "define a special elf
section for objects and iterate" technique that should be converted to
code tagging. This series just converts dynamic debug; there are others
(multiple in ftrace, in particular) that should also be converted.

===========================

The patchset applies cleanly over Linux 6.0-rc3
The tree for testing is published at:
https://github.com/surenbaghdasaryan/linux/tree/alloc_tags_rfc

The structure of the patchset is:
- code tagging framework (patches 1-6)
- page allocation tracking (patches 7-10)
- slab allocation tracking (patch 11-16)
- allocation context capture (patch 17-21)
- dynamic fault injection (patch 22)
- latency tracking (patch 23-27)
- improved error codes (patch 28)
- dynamic debug conversion to code tagging (patch 29)
- MAINTAINERS update (patch 30)

Next steps:
- track and fix slab allocator leak mentioned earlier;
- instrument more allocators: vmalloc, per-cpu allocations, others?


Kent Overstreet (14):
lib/string_helpers: Drop space in string_get_size's output
Lazy percpu counters
scripts/kallysms: Always include __start and __stop symbols
lib/string.c: strsep_no_empty()
codetag: add codetag query helper functions
Code tagging based fault injection
timekeeping: Add a missing include
wait: Clean up waitqueue_entry initialization
lib/time_stats: New library for statistics on events
bcache: Convert to lib/time_stats
Code tagging based latency tracking
Improved symbolic error names
dyndbg: Convert to code tagging
MAINTAINERS: Add entries for code tagging & related

Suren Baghdasaryan (16):
kernel/module: move find_kallsyms_symbol_value declaration
lib: code tagging framework
lib: code tagging module support
lib: add support for allocation tagging
lib: introduce page allocation tagging
change alloc_pages name in dma_map_ops to avoid name conflicts
mm: enable page allocation tagging for __get_free_pages and
alloc_pages
mm: introduce slabobj_ext to support slab object extensions
mm: introduce __GFP_NO_OBJ_EXT flag to selectively prevent slabobj_ext
creation
mm/slab: introduce SLAB_NO_OBJ_EXT to avoid obj_ext creation
mm: prevent slabobj_ext allocations for slabobj_ext and kmem_cache
objects
lib: introduce slab allocation tagging
mm: enable slab allocation tagging for kmalloc and friends
move stack capture functionality into a separate function for reuse
lib: introduce support for storing code tag context
lib: implement context capture support for page and slab allocators

MAINTAINERS | 34 ++
arch/x86/kernel/amd_gart_64.c | 2 +-
drivers/iommu/dma-iommu.c | 2 +-
drivers/md/bcache/Kconfig | 1 +
drivers/md/bcache/bcache.h | 1 +
drivers/md/bcache/bset.c | 8 +-
drivers/md/bcache/bset.h | 1 +
drivers/md/bcache/btree.c | 12 +-
drivers/md/bcache/super.c | 3 +
drivers/md/bcache/sysfs.c | 43 ++-
drivers/md/bcache/util.c | 30 --
drivers/md/bcache/util.h | 57 ---
drivers/xen/grant-dma-ops.c | 2 +-
drivers/xen/swiotlb-xen.c | 2 +-
include/asm-generic/codetag.lds.h | 18 +
include/asm-generic/vmlinux.lds.h | 8 +-
include/linux/alloc_tag.h | 84 +++++
include/linux/codetag.h | 159 +++++++++
include/linux/codetag_ctx.h | 48 +++
include/linux/codetag_time_stats.h | 54 +++
include/linux/dma-map-ops.h | 2 +-
include/linux/dynamic_debug.h | 11 +-
include/linux/dynamic_fault.h | 79 +++++
include/linux/err.h | 2 +-
include/linux/errname.h | 50 +++
include/linux/gfp.h | 10 +-
include/linux/gfp_types.h | 12 +-
include/linux/io_uring_types.h | 2 +-
include/linux/lazy-percpu-counter.h | 67 ++++
include/linux/memcontrol.h | 23 +-
include/linux/module.h | 1 +
include/linux/page_ext.h | 3 +-
include/linux/pgalloc_tag.h | 63 ++++
include/linux/sbitmap.h | 6 +-
include/linux/sched.h | 6 +-
include/linux/slab.h | 136 +++++---
include/linux/slab_def.h | 2 +-
include/linux/slub_def.h | 4 +-
include/linux/stackdepot.h | 3 +
include/linux/string.h | 1 +
include/linux/time_stats.h | 44 +++
include/linux/timekeeping.h | 1 +
include/linux/wait.h | 72 ++--
include/linux/wait_bit.h | 7 +-
init/Kconfig | 5 +
kernel/dma/mapping.c | 4 +-
kernel/module/internal.h | 3 -
kernel/module/main.c | 27 +-
kernel/sched/wait.c | 15 +-
lib/Kconfig | 6 +
lib/Kconfig.debug | 46 +++
lib/Makefile | 10 +
lib/alloc_tag.c | 391 +++++++++++++++++++++
lib/codetag.c | 519 ++++++++++++++++++++++++++++
lib/codetag_time_stats.c | 143 ++++++++
lib/dynamic_debug.c | 452 +++++++++---------------
lib/dynamic_fault.c | 372 ++++++++++++++++++++
lib/errname.c | 103 ++++++
lib/lazy-percpu-counter.c | 141 ++++++++
lib/pgalloc_tag.c | 22 ++
lib/stackdepot.c | 68 ++++
lib/string.c | 19 +
lib/string_helpers.c | 3 +-
lib/time_stats.c | 236 +++++++++++++
mm/kfence/core.c | 2 +-
mm/memcontrol.c | 62 ++--
mm/mempolicy.c | 4 +-
mm/page_alloc.c | 13 +-
mm/page_ext.c | 6 +
mm/page_owner.c | 54 +--
mm/slab.c | 4 +-
mm/slab.h | 125 ++++---
mm/slab_common.c | 49 ++-
mm/slob.c | 2 +
mm/slub.c | 7 +-
scripts/kallsyms.c | 13 +
scripts/module.lds.S | 7 +
77 files changed, 3406 insertions(+), 703 deletions(-)
create mode 100644 include/asm-generic/codetag.lds.h
create mode 100644 include/linux/alloc_tag.h
create mode 100644 include/linux/codetag.h
create mode 100644 include/linux/codetag_ctx.h
create mode 100644 include/linux/codetag_time_stats.h
create mode 100644 include/linux/dynamic_fault.h
create mode 100644 include/linux/lazy-percpu-counter.h
create mode 100644 include/linux/pgalloc_tag.h
create mode 100644 include/linux/time_stats.h
create mode 100644 lib/alloc_tag.c
create mode 100644 lib/codetag.c
create mode 100644 lib/codetag_time_stats.c
create mode 100644 lib/dynamic_fault.c
create mode 100644 lib/lazy-percpu-counter.c
create mode 100644 lib/pgalloc_tag.c
create mode 100644 lib/time_stats.c

--
2.37.2.672.g94769d06f0-goog


2022-08-30 22:11:18

by Suren Baghdasaryan

[permalink] [raw]
Subject: [RFC PATCH 23/30] timekeeping: Add a missing include

From: Kent Overstreet <[email protected]>

We need ktime.h for ktime_t.

Signed-off-by: Kent Overstreet <[email protected]>
---
include/linux/timekeeping.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
index fe1e467ba046..7c43e98cf211 100644
--- a/include/linux/timekeeping.h
+++ b/include/linux/timekeeping.h
@@ -4,6 +4,7 @@

#include <linux/errno.h>
#include <linux/clocksource_ids.h>
+#include <linux/ktime.h>

/* Included from linux/ktime.h */

--
2.37.2.672.g94769d06f0-goog

2022-08-30 22:12:01

by Suren Baghdasaryan

[permalink] [raw]
Subject: [RFC PATCH 14/30] mm: prevent slabobj_ext allocations for slabobj_ext and kmem_cache objects

Use __GFP_NO_OBJ_EXT to prevent recursions when allocating slabobj_ext
objects. Also prevent slabobj_ext allocations for kmem_cache objects.

Signed-off-by: Suren Baghdasaryan <[email protected]>
---
mm/memcontrol.c | 2 ++
mm/slab.h | 6 ++++++
2 files changed, 8 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3f407ef2f3f1..dabb451dc364 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2809,6 +2809,8 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
void *vec;

gfp &= ~OBJCGS_CLEAR_MASK;
+ /* Prevent recursive extension vector allocation */
+ gfp |= __GFP_NO_OBJ_EXT;
vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
slab_nid(slab));
if (!vec)
diff --git a/mm/slab.h b/mm/slab.h
index c767ce3f0fe2..d93b22b8bbe2 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -475,6 +475,12 @@ static inline void prepare_slab_obj_exts_hook(struct kmem_cache *s, gfp_t flags,
if (is_kmem_only_obj_ext())
return;

+ if (s->flags & SLAB_NO_OBJ_EXT)
+ return;
+
+ if (flags & __GFP_NO_OBJ_EXT)
+ return;
+
slab = virt_to_slab(p);
if (!slab_obj_exts(slab))
WARN(alloc_slab_obj_exts(slab, s, flags, false),
--
2.37.2.672.g94769d06f0-goog

2022-08-30 22:13:03

by Suren Baghdasaryan

[permalink] [raw]
Subject: [RFC PATCH 13/30] mm/slab: introduce SLAB_NO_OBJ_EXT to avoid obj_ext creation

Slab extension objects can't be allocated before slab infrastructure is
initialized. Some caches, like kmem_cache and kmem_cache_node, are created
before slab infrastructure is initialized. Objects from these caches can't
have extension objects. Introduce SLAB_NO_OBJ_EXT slab flag to mark these
caches and avoid creating extensions for objects allocated from these
slabs.

Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/slab.h | 7 +++++++
mm/slab.c | 2 +-
mm/slub.c | 5 +++--
3 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 0fefdf528e0d..55ae3ea864a4 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -124,6 +124,13 @@
#define SLAB_RECLAIM_ACCOUNT ((slab_flags_t __force)0x00020000U)
#define SLAB_TEMPORARY SLAB_RECLAIM_ACCOUNT /* Objects are short-lived */

+#ifdef CONFIG_SLAB_OBJ_EXT
+/* Slab created using create_boot_cache */
+#define SLAB_NO_OBJ_EXT ((slab_flags_t __force)0x20000000U)
+#else
+#define SLAB_NO_OBJ_EXT 0
+#endif
+
/*
* ZERO_SIZE_PTR will be returned for zero sized kmalloc requests.
*
diff --git a/mm/slab.c b/mm/slab.c
index 10e96137b44f..ba97aeef7ec1 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1233,7 +1233,7 @@ void __init kmem_cache_init(void)
create_boot_cache(kmem_cache, "kmem_cache",
offsetof(struct kmem_cache, node) +
nr_node_ids * sizeof(struct kmem_cache_node *),
- SLAB_HWCACHE_ALIGN, 0, 0);
+ SLAB_HWCACHE_ALIGN | SLAB_NO_OBJ_EXT, 0, 0);
list_add(&kmem_cache->list, &slab_caches);
slab_state = PARTIAL;

diff --git a/mm/slub.c b/mm/slub.c
index 862dbd9af4f5..80199d5ac7c9 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4825,7 +4825,8 @@ void __init kmem_cache_init(void)
node_set(node, slab_nodes);

create_boot_cache(kmem_cache_node, "kmem_cache_node",
- sizeof(struct kmem_cache_node), SLAB_HWCACHE_ALIGN, 0, 0);
+ sizeof(struct kmem_cache_node),
+ SLAB_HWCACHE_ALIGN | SLAB_NO_OBJ_EXT, 0, 0);

register_hotmemory_notifier(&slab_memory_callback_nb);

@@ -4835,7 +4836,7 @@ void __init kmem_cache_init(void)
create_boot_cache(kmem_cache, "kmem_cache",
offsetof(struct kmem_cache, node) +
nr_node_ids * sizeof(struct kmem_cache_node *),
- SLAB_HWCACHE_ALIGN, 0, 0);
+ SLAB_HWCACHE_ALIGN | SLAB_NO_OBJ_EXT, 0, 0);

kmem_cache = bootstrap(&boot_kmem_cache);
kmem_cache_node = bootstrap(&boot_kmem_cache_node);
--
2.37.2.672.g94769d06f0-goog

2022-08-30 22:14:10

by Suren Baghdasaryan

[permalink] [raw]
Subject: [RFC PATCH 22/30] Code tagging based fault injection

From: Kent Overstreet <[email protected]>

This adds a new fault injection capability, based on code tagging.

To use, simply insert somewhere in your code

dynamic_fault("fault_class_name")

and check whether it returns true - if so, inject the error.
For example

if (dynamic_fault("init"))
return -EINVAL;

There's no need to define faults elsewhere, as with
include/linux/fault-injection.h. Faults show up in debugfs, under
/sys/kernel/debug/dynamic_faults, and can be selected based on
file/module/function/line number/class, and enabled permanently, or in
oneshot mode, or with a specified frequency.

Signed-off-by: Kent Overstreet <[email protected]>
---
include/asm-generic/codetag.lds.h | 3 +-
include/linux/dynamic_fault.h | 79 +++++++
include/linux/slab.h | 3 +-
lib/Kconfig.debug | 6 +
lib/Makefile | 2 +
lib/dynamic_fault.c | 372 ++++++++++++++++++++++++++++++
6 files changed, 463 insertions(+), 2 deletions(-)
create mode 100644 include/linux/dynamic_fault.h
create mode 100644 lib/dynamic_fault.c

diff --git a/include/asm-generic/codetag.lds.h b/include/asm-generic/codetag.lds.h
index 64f536b80380..16fbf74edc3d 100644
--- a/include/asm-generic/codetag.lds.h
+++ b/include/asm-generic/codetag.lds.h
@@ -9,6 +9,7 @@
__stop_##_name = .;

#define CODETAG_SECTIONS() \
- SECTION_WITH_BOUNDARIES(alloc_tags)
+ SECTION_WITH_BOUNDARIES(alloc_tags) \
+ SECTION_WITH_BOUNDARIES(dynamic_fault_tags)

#endif /* __ASM_GENERIC_CODETAG_LDS_H */
diff --git a/include/linux/dynamic_fault.h b/include/linux/dynamic_fault.h
new file mode 100644
index 000000000000..526a33209e94
--- /dev/null
+++ b/include/linux/dynamic_fault.h
@@ -0,0 +1,79 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _LINUX_DYNAMIC_FAULT_H
+#define _LINUX_DYNAMIC_FAULT_H
+
+/*
+ * Dynamic/code tagging fault injection:
+ *
+ * Originally based on the dynamic debug trick of putting types in a special elf
+ * section, then rewritten using code tagging:
+ *
+ * To use, simply insert a call to dynamic_fault("fault_class"), which will
+ * return true if an error should be injected.
+ *
+ * Fault injection sites may be listed and enabled via debugfs, under
+ * /sys/kernel/debug/dynamic_faults.
+ */
+
+#ifdef CONFIG_CODETAG_FAULT_INJECTION
+
+#include <linux/codetag.h>
+#include <linux/jump_label.h>
+
+#define DFAULT_STATES() \
+ x(disabled) \
+ x(enabled) \
+ x(oneshot)
+
+enum dfault_enabled {
+#define x(n) DFAULT_##n,
+ DFAULT_STATES()
+#undef x
+};
+
+union dfault_state {
+ struct {
+ unsigned int enabled:2;
+ unsigned int count:30;
+ };
+
+ struct {
+ unsigned int v;
+ };
+};
+
+struct dfault {
+ struct codetag tag;
+ const char *class;
+ unsigned int frequency;
+ union dfault_state state;
+ struct static_key_false enabled;
+};
+
+bool __dynamic_fault_enabled(struct dfault *df);
+
+#define dynamic_fault(_class) \
+({ \
+ static struct dfault \
+ __used \
+ __section("dynamic_fault_tags") \
+ __aligned(8) df = { \
+ .tag = CODE_TAG_INIT, \
+ .class = _class, \
+ .enabled = STATIC_KEY_FALSE_INIT, \
+ }; \
+ \
+ static_key_false(&df.enabled.key) && \
+ __dynamic_fault_enabled(&df); \
+})
+
+#else
+
+#define dynamic_fault(_class) false
+
+#endif /* CODETAG_FAULT_INJECTION */
+
+#define memory_fault() dynamic_fault("memory")
+
+#endif /* _LINUX_DYNAMIC_FAULT_H */
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 89273be35743..4be5a93ed15a 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -17,6 +17,7 @@
#include <linux/types.h>
#include <linux/workqueue.h>
#include <linux/percpu-refcount.h>
+#include <linux/dynamic_fault.h>


/*
@@ -468,7 +469,7 @@ static inline void slab_tag_dec(const void *ptr) {}

#define krealloc_hooks(_p, _do_alloc) \
({ \
- void *_res = _do_alloc; \
+ void *_res = !memory_fault() ? _do_alloc : NULL; \
slab_tag_add(_p, _res); \
_res; \
})
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 2790848464f1..b7d03afbc808 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1982,6 +1982,12 @@ config FAULT_INJECTION_STACKTRACE_FILTER
help
Provide stacktrace filter for fault-injection capabilities

+config CODETAG_FAULT_INJECTION
+ bool "Code tagging based fault injection"
+ select CODE_TAGGING
+ help
+ Dynamic fault injection based on code tagging
+
config ARCH_HAS_KCOV
bool
help
diff --git a/lib/Makefile b/lib/Makefile
index 99f732156673..489ea000c528 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -231,6 +231,8 @@ obj-$(CONFIG_CODE_TAGGING) += codetag.o
obj-$(CONFIG_ALLOC_TAGGING) += alloc_tag.o
obj-$(CONFIG_PAGE_ALLOC_TAGGING) += pgalloc_tag.o

+obj-$(CONFIG_CODETAG_FAULT_INJECTION) += dynamic_fault.o
+
lib-$(CONFIG_GENERIC_BUG) += bug.o

obj-$(CONFIG_HAVE_ARCH_TRACEHOOK) += syscall.o
diff --git a/lib/dynamic_fault.c b/lib/dynamic_fault.c
new file mode 100644
index 000000000000..4c9cd18686be
--- /dev/null
+++ b/lib/dynamic_fault.c
@@ -0,0 +1,372 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include <linux/ctype.h>
+#include <linux/debugfs.h>
+#include <linux/dynamic_fault.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/seq_buf.h>
+
+static struct codetag_type *cttype;
+
+bool __dynamic_fault_enabled(struct dfault *df)
+{
+ union dfault_state old, new;
+ unsigned int v = df->state.v;
+ bool ret;
+
+ do {
+ old.v = new.v = v;
+
+ if (new.enabled == DFAULT_disabled)
+ return false;
+
+ ret = df->frequency
+ ? ++new.count >= df->frequency
+ : true;
+ if (ret)
+ new.count = 0;
+ if (ret && new.enabled == DFAULT_oneshot)
+ new.enabled = DFAULT_disabled;
+ } while ((v = cmpxchg(&df->state.v, old.v, new.v)) != old.v);
+
+ if (ret)
+ pr_debug("returned true for %s:%u", df->tag.filename, df->tag.lineno);
+
+ return ret;
+}
+EXPORT_SYMBOL(__dynamic_fault_enabled);
+
+static const char * const dfault_state_strs[] = {
+#define x(n) #n,
+ DFAULT_STATES()
+#undef x
+ NULL
+};
+
+static void dynamic_fault_to_text(struct seq_buf *out, struct dfault *df)
+{
+ codetag_to_text(out, &df->tag);
+ seq_buf_printf(out, "class:%s %s \"", df->class,
+ dfault_state_strs[df->state.enabled]);
+}
+
+struct dfault_query {
+ struct codetag_query q;
+
+ bool set_enabled:1;
+ unsigned int enabled:2;
+
+ bool set_frequency:1;
+ unsigned int frequency;
+};
+
+/*
+ * Search the tables for _dfault's which match the given
+ * `query' and apply the `flags' and `mask' to them. Tells
+ * the user which dfault's were changed, or whether none
+ * were matched.
+ */
+static int dfault_change(struct dfault_query *query)
+{
+ struct codetag_iterator ct_iter;
+ struct codetag *ct;
+ unsigned int nfound = 0;
+
+ codetag_lock_module_list(cttype, true);
+ codetag_init_iter(&ct_iter, cttype);
+
+ while ((ct = codetag_next_ct(&ct_iter))) {
+ struct dfault *df = container_of(ct, struct dfault, tag);
+
+ if (!codetag_matches_query(&query->q, ct, ct_iter.cmod, df->class))
+ continue;
+
+ if (query->set_enabled &&
+ query->enabled != df->state.enabled) {
+ if (query->enabled != DFAULT_disabled)
+ static_key_slow_inc(&df->enabled.key);
+ else if (df->state.enabled != DFAULT_disabled)
+ static_key_slow_dec(&df->enabled.key);
+
+ df->state.enabled = query->enabled;
+ }
+
+ if (query->set_frequency)
+ df->frequency = query->frequency;
+
+ pr_debug("changed %s:%d [%s]%s #%d %s",
+ df->tag.filename, df->tag.lineno, df->tag.modname,
+ df->tag.function, query->q.cur_index,
+ dfault_state_strs[df->state.enabled]);
+
+ nfound++;
+ }
+
+ pr_debug("dfault: %u matches", nfound);
+
+ codetag_lock_module_list(cttype, false);
+
+ return nfound ? 0 : -ENOENT;
+}
+
+#define DFAULT_TOKENS() \
+ x(disable, 0) \
+ x(enable, 0) \
+ x(oneshot, 0) \
+ x(frequency, 1)
+
+enum dfault_token {
+#define x(name, nr_args) TOK_##name,
+ DFAULT_TOKENS()
+#undef x
+};
+
+static const char * const dfault_token_strs[] = {
+#define x(name, nr_args) #name,
+ DFAULT_TOKENS()
+#undef x
+ NULL
+};
+
+static unsigned int dfault_token_nr_args[] = {
+#define x(name, nr_args) nr_args,
+ DFAULT_TOKENS()
+#undef x
+};
+
+static enum dfault_token str_to_token(const char *word, unsigned int nr_words)
+{
+ int tok = match_string(dfault_token_strs, ARRAY_SIZE(dfault_token_strs), word);
+
+ if (tok < 0) {
+ pr_debug("unknown keyword \"%s\"", word);
+ return tok;
+ }
+
+ if (nr_words < dfault_token_nr_args[tok]) {
+ pr_debug("insufficient arguments to \"%s\"", word);
+ return -EINVAL;
+ }
+
+ return tok;
+}
+
+static int dfault_parse_command(struct dfault_query *query,
+ enum dfault_token tok,
+ char *words[], size_t nr_words)
+{
+ unsigned int i = 0;
+ int ret;
+
+ switch (tok) {
+ case TOK_disable:
+ query->set_enabled = true;
+ query->enabled = DFAULT_disabled;
+ break;
+ case TOK_enable:
+ query->set_enabled = true;
+ query->enabled = DFAULT_enabled;
+ break;
+ case TOK_oneshot:
+ query->set_enabled = true;
+ query->enabled = DFAULT_oneshot;
+ break;
+ case TOK_frequency:
+ query->set_frequency = 1;
+ ret = kstrtouint(words[i++], 10, &query->frequency);
+ if (ret)
+ return ret;
+
+ if (!query->set_enabled) {
+ query->set_enabled = 1;
+ query->enabled = DFAULT_enabled;
+ }
+ break;
+ }
+
+ return i;
+}
+
+static int dynamic_fault_store(char *buf)
+{
+ struct dfault_query query = { NULL };
+#define MAXWORDS 9
+ char *tok, *words[MAXWORDS];
+ int ret, nr_words, i = 0;
+
+ buf = codetag_query_parse(&query.q, buf);
+ if (IS_ERR(buf))
+ return PTR_ERR(buf);
+
+ while ((tok = strsep_no_empty(&buf, " \t\r\n"))) {
+ if (nr_words == ARRAY_SIZE(words))
+ return -EINVAL; /* ran out of words[] before bytes */
+ words[nr_words++] = tok;
+ }
+
+ while (i < nr_words) {
+ const char *tok_str = words[i++];
+ enum dfault_token tok = str_to_token(tok_str, nr_words - i);
+
+ if (tok < 0)
+ return tok;
+
+ ret = dfault_parse_command(&query, tok, words + i, nr_words - i);
+ if (ret < 0)
+ return ret;
+
+ i += ret;
+ BUG_ON(i > nr_words);
+ }
+
+ pr_debug("q->function=\"%s\" q->filename=\"%s\" "
+ "q->module=\"%s\" q->line=%u-%u\n q->index=%u-%u",
+ query.q.function, query.q.filename, query.q.module,
+ query.q.first_line, query.q.last_line,
+ query.q.first_index, query.q.last_index);
+
+ ret = dfault_change(&query);
+ if (ret < 0)
+ return ret;
+
+ return 0;
+}
+
+struct dfault_iter {
+ struct codetag_iterator ct_iter;
+
+ struct seq_buf buf;
+ char rawbuf[4096];
+};
+
+static int dfault_open(struct inode *inode, struct file *file)
+{
+ struct dfault_iter *iter;
+
+ iter = kzalloc(sizeof(*iter), GFP_KERNEL);
+ if (!iter)
+ return -ENOMEM;
+
+ codetag_lock_module_list(cttype, true);
+ codetag_init_iter(&iter->ct_iter, cttype);
+ codetag_lock_module_list(cttype, false);
+
+ file->private_data = iter;
+ seq_buf_init(&iter->buf, iter->rawbuf, sizeof(iter->rawbuf));
+ return 0;
+}
+
+static int dfault_release(struct inode *inode, struct file *file)
+{
+ struct dfault_iter *iter = file->private_data;
+
+ kfree(iter);
+ return 0;
+}
+
+struct user_buf {
+ char __user *buf; /* destination user buffer */
+ size_t size; /* size of requested read */
+ ssize_t ret; /* bytes read so far */
+};
+
+static int flush_ubuf(struct user_buf *dst, struct seq_buf *src)
+{
+ if (src->len) {
+ size_t bytes = min_t(size_t, src->len, dst->size);
+ int err = copy_to_user(dst->buf, src->buffer, bytes);
+
+ if (err)
+ return err;
+
+ dst->ret += bytes;
+ dst->buf += bytes;
+ dst->size -= bytes;
+ src->len -= bytes;
+ memmove(src->buffer, src->buffer + bytes, src->len);
+ }
+
+ return 0;
+}
+
+static ssize_t dfault_read(struct file *file, char __user *ubuf,
+ size_t size, loff_t *ppos)
+{
+ struct dfault_iter *iter = file->private_data;
+ struct user_buf buf = { .buf = ubuf, .size = size };
+ struct codetag *ct;
+ struct dfault *df;
+ int err;
+
+ codetag_lock_module_list(iter->ct_iter.cttype, true);
+ while (1) {
+ err = flush_ubuf(&buf, &iter->buf);
+ if (err || !buf.size)
+ break;
+
+ ct = codetag_next_ct(&iter->ct_iter);
+ if (!ct)
+ break;
+
+ df = container_of(ct, struct dfault, tag);
+ dynamic_fault_to_text(&iter->buf, df);
+ seq_buf_putc(&iter->buf, '\n');
+ }
+ codetag_lock_module_list(iter->ct_iter.cttype, false);
+
+ return err ?: buf.ret;
+}
+
+/*
+ * File_ops->write method for <debugfs>/dynamic_fault/conrol. Gathers the
+ * command text from userspace, parses and executes it.
+ */
+static ssize_t dfault_write(struct file *file, const char __user *ubuf,
+ size_t len, loff_t *offp)
+{
+ char tmpbuf[256];
+
+ if (len == 0)
+ return 0;
+ /* we don't check *offp -- multiple writes() are allowed */
+ if (len > sizeof(tmpbuf)-1)
+ return -E2BIG;
+ if (copy_from_user(tmpbuf, ubuf, len))
+ return -EFAULT;
+ tmpbuf[len] = '\0';
+ pr_debug("read %zu bytes from userspace", len);
+
+ dynamic_fault_store(tmpbuf);
+
+ *offp += len;
+ return len;
+}
+
+static const struct file_operations dfault_ops = {
+ .owner = THIS_MODULE,
+ .open = dfault_open,
+ .release = dfault_release,
+ .read = dfault_read,
+ .write = dfault_write
+};
+
+static int __init dynamic_fault_init(void)
+{
+ const struct codetag_type_desc desc = {
+ .section = "dynamic_fault_tags",
+ .tag_size = sizeof(struct dfault),
+ };
+ struct dentry *debugfs_file;
+
+ cttype = codetag_register_type(&desc);
+ if (IS_ERR_OR_NULL(cttype))
+ return PTR_ERR(cttype);
+
+ debugfs_file = debugfs_create_file("dynamic_faults", 0666, NULL, NULL, &dfault_ops);
+ if (IS_ERR(debugfs_file))
+ return PTR_ERR(debugfs_file);
+
+ return 0;
+}
+module_init(dynamic_fault_init);
--
2.37.2.672.g94769d06f0-goog

2022-08-30 22:14:29

by Suren Baghdasaryan

[permalink] [raw]
Subject: [RFC PATCH 25/30] lib/time_stats: New library for statistics on events

From: Kent Overstreet <[email protected]>

This adds a small new library for tracking statistics on events that
have a duration, i.e. a start and end time.

- number of events
- rate/frequency
- average duration
- max duration
- duration quantiles

This code comes from bcachefs, and originally bcache: the next patch
will be converting bcache to use this version, and a subsequent patch
will be using code_tagging to instrument all wait_event() calls in the
kernel.

Signed-off-by: Kent Overstreet <[email protected]>
---
include/linux/time_stats.h | 44 +++++++
lib/Kconfig | 3 +
lib/Makefile | 1 +
lib/time_stats.c | 236 +++++++++++++++++++++++++++++++++++++
4 files changed, 284 insertions(+)
create mode 100644 include/linux/time_stats.h
create mode 100644 lib/time_stats.c

diff --git a/include/linux/time_stats.h b/include/linux/time_stats.h
new file mode 100644
index 000000000000..7ae929e6f836
--- /dev/null
+++ b/include/linux/time_stats.h
@@ -0,0 +1,44 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _LINUX_TIMESTATS_H
+#define _LINUX_TIMESTATS_H
+
+#include <linux/spinlock_types.h>
+#include <linux/types.h>
+
+#define NR_QUANTILES 15
+
+struct quantiles {
+ struct quantile_entry {
+ u64 m;
+ u64 step;
+ } entries[NR_QUANTILES];
+};
+
+struct time_stat_buffer {
+ unsigned int nr;
+ struct time_stat_buffer_entry {
+ u64 start;
+ u64 end;
+ } entries[32];
+};
+
+struct time_stats {
+ spinlock_t lock;
+ u64 count;
+ /* all fields are in nanoseconds */
+ u64 average_duration;
+ u64 average_frequency;
+ u64 max_duration;
+ u64 last_event;
+ struct quantiles quantiles;
+
+ struct time_stat_buffer __percpu *buffer;
+};
+
+struct seq_buf;
+void time_stats_update(struct time_stats *stats, u64 start);
+void time_stats_to_text(struct seq_buf *out, struct time_stats *stats);
+void time_stats_exit(struct time_stats *stats);
+
+#endif /* _LINUX_TIMESTATS_H */
diff --git a/lib/Kconfig b/lib/Kconfig
index fc6dbc425728..884fd9f2f06d 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -744,3 +744,6 @@ config ASN1_ENCODER

config POLYNOMIAL
tristate
+
+config TIME_STATS
+ bool
diff --git a/lib/Makefile b/lib/Makefile
index 489ea000c528..e54392011f5e 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -232,6 +232,7 @@ obj-$(CONFIG_ALLOC_TAGGING) += alloc_tag.o
obj-$(CONFIG_PAGE_ALLOC_TAGGING) += pgalloc_tag.o

obj-$(CONFIG_CODETAG_FAULT_INJECTION) += dynamic_fault.o
+obj-$(CONFIG_TIME_STATS) += time_stats.o

lib-$(CONFIG_GENERIC_BUG) += bug.o

diff --git a/lib/time_stats.c b/lib/time_stats.c
new file mode 100644
index 000000000000..30362364fdd2
--- /dev/null
+++ b/lib/time_stats.c
@@ -0,0 +1,236 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include <linux/gfp.h>
+#include <linux/jiffies.h>
+#include <linux/kernel.h>
+#include <linux/ktime.h>
+#include <linux/percpu.h>
+#include <linux/seq_buf.h>
+#include <linux/spinlock.h>
+#include <linux/time_stats.h>
+#include <linux/timekeeping.h>
+
+static inline unsigned int eytzinger1_child(unsigned int i, unsigned int child)
+{
+ return (i << 1) + child;
+}
+
+static inline unsigned int eytzinger1_right_child(unsigned int i)
+{
+ return eytzinger1_child(i, 1);
+}
+
+static inline unsigned int eytzinger1_next(unsigned int i, unsigned int size)
+{
+ if (eytzinger1_right_child(i) <= size) {
+ i = eytzinger1_right_child(i);
+
+ i <<= __fls(size + 1) - __fls(i);
+ i >>= i > size;
+ } else {
+ i >>= ffz(i) + 1;
+ }
+
+ return i;
+}
+
+static inline unsigned int eytzinger0_child(unsigned int i, unsigned int child)
+{
+ return (i << 1) + 1 + child;
+}
+
+static inline unsigned int eytzinger0_first(unsigned int size)
+{
+ return rounddown_pow_of_two(size) - 1;
+}
+
+static inline unsigned int eytzinger0_next(unsigned int i, unsigned int size)
+{
+ return eytzinger1_next(i + 1, size) - 1;
+}
+
+#define eytzinger0_for_each(_i, _size) \
+ for ((_i) = eytzinger0_first((_size)); \
+ (_i) != -1; \
+ (_i) = eytzinger0_next((_i), (_size)))
+
+#define ewma_add(ewma, val, weight) \
+({ \
+ typeof(ewma) _ewma = (ewma); \
+ typeof(weight) _weight = (weight); \
+ \
+ (((_ewma << _weight) - _ewma) + (val)) >> _weight; \
+})
+
+static void quantiles_update(struct quantiles *q, u64 v)
+{
+ unsigned int i = 0;
+
+ while (i < ARRAY_SIZE(q->entries)) {
+ struct quantile_entry *e = q->entries + i;
+
+ if (unlikely(!e->step)) {
+ e->m = v;
+ e->step = max_t(unsigned int, v / 2, 1024);
+ } else if (e->m > v) {
+ e->m = e->m >= e->step
+ ? e->m - e->step
+ : 0;
+ } else if (e->m < v) {
+ e->m = e->m + e->step > e->m
+ ? e->m + e->step
+ : U32_MAX;
+ }
+
+ if ((e->m > v ? e->m - v : v - e->m) < e->step)
+ e->step = max_t(unsigned int, e->step / 2, 1);
+
+ if (v >= e->m)
+ break;
+
+ i = eytzinger0_child(i, v > e->m);
+ }
+}
+
+static void time_stats_update_one(struct time_stats *stats,
+ u64 start, u64 end)
+{
+ u64 duration, freq;
+
+ duration = time_after64(end, start)
+ ? end - start : 0;
+ freq = time_after64(end, stats->last_event)
+ ? end - stats->last_event : 0;
+
+ stats->count++;
+
+ stats->average_duration = stats->average_duration
+ ? ewma_add(stats->average_duration, duration, 6)
+ : duration;
+
+ stats->average_frequency = stats->average_frequency
+ ? ewma_add(stats->average_frequency, freq, 6)
+ : freq;
+
+ stats->max_duration = max(stats->max_duration, duration);
+
+ stats->last_event = end;
+
+ quantiles_update(&stats->quantiles, duration);
+}
+
+void time_stats_update(struct time_stats *stats, u64 start)
+{
+ u64 end = ktime_get_ns();
+ unsigned long flags;
+
+ if (!stats->buffer) {
+ spin_lock_irqsave(&stats->lock, flags);
+ time_stats_update_one(stats, start, end);
+
+ if (stats->average_frequency < 32 &&
+ stats->count > 1024)
+ stats->buffer =
+ alloc_percpu_gfp(struct time_stat_buffer,
+ GFP_ATOMIC);
+ spin_unlock_irqrestore(&stats->lock, flags);
+ } else {
+ struct time_stat_buffer_entry *i;
+ struct time_stat_buffer *b;
+
+ preempt_disable();
+ b = this_cpu_ptr(stats->buffer);
+
+ BUG_ON(b->nr >= ARRAY_SIZE(b->entries));
+ b->entries[b->nr++] = (struct time_stat_buffer_entry) {
+ .start = start,
+ .end = end
+ };
+
+ if (b->nr == ARRAY_SIZE(b->entries)) {
+ spin_lock_irqsave(&stats->lock, flags);
+ for (i = b->entries;
+ i < b->entries + ARRAY_SIZE(b->entries);
+ i++)
+ time_stats_update_one(stats, i->start, i->end);
+ spin_unlock_irqrestore(&stats->lock, flags);
+
+ b->nr = 0;
+ }
+
+ preempt_enable();
+ }
+}
+EXPORT_SYMBOL(time_stats_update);
+
+static const struct time_unit {
+ const char *name;
+ u32 nsecs;
+} time_units[] = {
+ { "ns", 1 },
+ { "us", NSEC_PER_USEC },
+ { "ms", NSEC_PER_MSEC },
+ { "sec", NSEC_PER_SEC },
+};
+
+static const struct time_unit *pick_time_units(u64 ns)
+{
+ const struct time_unit *u;
+
+ for (u = time_units;
+ u + 1 < time_units + ARRAY_SIZE(time_units) &&
+ ns >= u[1].nsecs << 1;
+ u++)
+ ;
+
+ return u;
+}
+
+static void pr_time_units(struct seq_buf *out, u64 ns)
+{
+ const struct time_unit *u = pick_time_units(ns);
+
+ seq_buf_printf(out, "%llu %s", div_u64(ns, u->nsecs), u->name);
+}
+
+void time_stats_to_text(struct seq_buf *out, struct time_stats *stats)
+{
+ const struct time_unit *u;
+ u64 freq = READ_ONCE(stats->average_frequency);
+ u64 q, last_q = 0;
+ int i;
+
+ seq_buf_printf(out, "count: %llu\n", stats->count);
+ seq_buf_printf(out, "rate: %llu/sec\n",
+ freq ? div64_u64(NSEC_PER_SEC, freq) : 0);
+ seq_buf_printf(out, "frequency: ");
+ pr_time_units(out, freq);
+ seq_buf_putc(out, '\n');
+
+ seq_buf_printf(out, "avg duration: ");
+ pr_time_units(out, stats->average_duration);
+ seq_buf_putc(out, '\n');
+
+ seq_buf_printf(out, "max duration: ");
+ pr_time_units(out, stats->max_duration);
+ seq_buf_putc(out, '\n');
+
+ i = eytzinger0_first(NR_QUANTILES);
+ u = pick_time_units(stats->quantiles.entries[i].m);
+ seq_buf_printf(out, "quantiles (%s): ", u->name);
+ eytzinger0_for_each(i, NR_QUANTILES) {
+ q = max(stats->quantiles.entries[i].m, last_q);
+ seq_buf_printf(out, "%llu ", div_u64(q, u->nsecs));
+ last_q = q;
+ }
+
+ seq_buf_putc(out, '\n');
+}
+EXPORT_SYMBOL_GPL(time_stats_to_text);
+
+void time_stats_exit(struct time_stats *stats)
+{
+ free_percpu(stats->buffer);
+ stats->buffer = NULL;
+}
+EXPORT_SYMBOL_GPL(time_stats_exit);
--
2.37.2.672.g94769d06f0-goog

2022-08-30 22:20:38

by Suren Baghdasaryan

[permalink] [raw]
Subject: [RFC PATCH 12/30] mm: introduce __GFP_NO_OBJ_EXT flag to selectively prevent slabobj_ext creation

Introduce __GFP_NO_OBJ_EXT flag in order to prevent recursive allocations
when allocating slabobj_ext on a slab.

Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/gfp_types.h | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
index d88c46ca82e1..a2cba1d20b86 100644
--- a/include/linux/gfp_types.h
+++ b/include/linux/gfp_types.h
@@ -55,8 +55,13 @@ typedef unsigned int __bitwise gfp_t;
#define ___GFP_SKIP_KASAN_UNPOISON 0
#define ___GFP_SKIP_KASAN_POISON 0
#endif
+#ifdef CONFIG_SLAB_OBJ_EXT
+#define ___GFP_NO_OBJ_EXT 0x8000000u
+#else
+#define ___GFP_NO_OBJ_EXT 0
+#endif
#ifdef CONFIG_LOCKDEP
-#define ___GFP_NOLOCKDEP 0x8000000u
+#define ___GFP_NOLOCKDEP 0x10000000u
#else
#define ___GFP_NOLOCKDEP 0
#endif
@@ -101,12 +106,15 @@ typedef unsigned int __bitwise gfp_t;
* node with no fallbacks or placement policy enforcements.
*
* %__GFP_ACCOUNT causes the allocation to be accounted to kmemcg.
+ *
+ * %__GFP_NO_OBJ_EXT causes slab allocation to have no object extension.
*/
#define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE)
#define __GFP_WRITE ((__force gfp_t)___GFP_WRITE)
#define __GFP_HARDWALL ((__force gfp_t)___GFP_HARDWALL)
#define __GFP_THISNODE ((__force gfp_t)___GFP_THISNODE)
#define __GFP_ACCOUNT ((__force gfp_t)___GFP_ACCOUNT)
+#define __GFP_NO_OBJ_EXT ((__force gfp_t)___GFP_NO_OBJ_EXT)

/**
* DOC: Watermark modifiers
@@ -256,7 +264,7 @@ typedef unsigned int __bitwise gfp_t;
#define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)

/* Room for N __GFP_FOO bits */
-#define __GFP_BITS_SHIFT (27 + IS_ENABLED(CONFIG_LOCKDEP))
+#define __GFP_BITS_SHIFT (28 + IS_ENABLED(CONFIG_LOCKDEP))
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))

/**
--
2.37.2.672.g94769d06f0-goog

2022-08-30 22:20:46

by Suren Baghdasaryan

[permalink] [raw]
Subject: [RFC PATCH 29/30] dyndbg: Convert to code tagging

From: Kent Overstreet <[email protected]>

This converts dynamic debug to the new code tagging framework, which
provides an interface for iterating over objects in a particular elf
section.

It also converts the debugfs interface from seq_file to the style used
by other code tagging users, which also makes the code a bit smaller and
simpler.

It doesn't yet convert struct _ddebug to use struct codetag; another
cleanup could convert it to that, and to codetag_query_parse().

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Jason Baron <[email protected]>
Cc: Luis Chamberlain <[email protected]>
---
include/asm-generic/codetag.lds.h | 5 +-
include/asm-generic/vmlinux.lds.h | 5 -
include/linux/dynamic_debug.h | 11 +-
kernel/module/internal.h | 2 -
kernel/module/main.c | 23 --
lib/dynamic_debug.c | 452 ++++++++++--------------------
6 files changed, 158 insertions(+), 340 deletions(-)

diff --git a/include/asm-generic/codetag.lds.h b/include/asm-generic/codetag.lds.h
index b087cf1874a9..b7e351f80e9e 100644
--- a/include/asm-generic/codetag.lds.h
+++ b/include/asm-generic/codetag.lds.h
@@ -8,10 +8,11 @@
KEEP(*(_name)) \
__stop_##_name = .;

-#define CODETAG_SECTIONS() \
+#define CODETAG_SECTIONS() \
SECTION_WITH_BOUNDARIES(alloc_tags) \
SECTION_WITH_BOUNDARIES(dynamic_fault_tags) \
SECTION_WITH_BOUNDARIES(time_stats_tags) \
- SECTION_WITH_BOUNDARIES(error_code_tags)
+ SECTION_WITH_BOUNDARIES(error_code_tags) \
+ SECTION_WITH_BOUNDARIES(dyndbg)

#endif /* __ASM_GENERIC_CODETAG_LDS_H */
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index c2dc2a59ab2e..d3fb914d157f 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -345,11 +345,6 @@
__end_once = .; \
STRUCT_ALIGN(); \
*(__tracepoints) \
- /* implement dynamic printk debug */ \
- . = ALIGN(8); \
- __start___dyndbg = .; \
- KEEP(*(__dyndbg)) \
- __stop___dyndbg = .; \
CODETAG_SECTIONS() \
LIKELY_PROFILE() \
BRANCH_PROFILE() \
diff --git a/include/linux/dynamic_debug.h b/include/linux/dynamic_debug.h
index dce631e678dd..6a57009dd29e 100644
--- a/include/linux/dynamic_debug.h
+++ b/include/linux/dynamic_debug.h
@@ -58,9 +58,6 @@ struct _ddebug {
/* exported for module authors to exercise >control */
int dynamic_debug_exec_queries(const char *query, const char *modname);

-int ddebug_add_module(struct _ddebug *tab, unsigned int n,
- const char *modname);
-extern int ddebug_remove_module(const char *mod_name);
extern __printf(2, 3)
void __dynamic_pr_debug(struct _ddebug *descriptor, const char *fmt, ...);

@@ -89,7 +86,7 @@ void __dynamic_ibdev_dbg(struct _ddebug *descriptor,

#define DEFINE_DYNAMIC_DEBUG_METADATA(name, fmt) \
static struct _ddebug __aligned(8) \
- __section("__dyndbg") name = { \
+ __section("dyndbg") name = { \
.modname = KBUILD_MODNAME, \
.function = __func__, \
.filename = __FILE__, \
@@ -187,12 +184,6 @@ void __dynamic_ibdev_dbg(struct _ddebug *descriptor,
#include <linux/errno.h>
#include <linux/printk.h>

-static inline int ddebug_add_module(struct _ddebug *tab, unsigned int n,
- const char *modname)
-{
- return 0;
-}
-
static inline int ddebug_remove_module(const char *mod)
{
return 0;
diff --git a/kernel/module/internal.h b/kernel/module/internal.h
index f1b6c477bd93..f867c57ab74f 100644
--- a/kernel/module/internal.h
+++ b/kernel/module/internal.h
@@ -62,8 +62,6 @@ struct load_info {
Elf_Shdr *sechdrs;
char *secstrings, *strtab;
unsigned long symoffs, stroffs, init_typeoffs, core_typeoffs;
- struct _ddebug *debug;
- unsigned int num_debug;
bool sig_ok;
#ifdef CONFIG_KALLSYMS
unsigned long mod_kallsyms_init_off;
diff --git a/kernel/module/main.c b/kernel/module/main.c
index d253277492fd..28e3b337841b 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -1163,9 +1163,6 @@ static void free_module(struct module *mod)
mod->state = MODULE_STATE_UNFORMED;
mutex_unlock(&module_mutex);

- /* Remove dynamic debug info */
- ddebug_remove_module(mod->name);
-
/* Arch-specific cleanup. */
module_arch_cleanup(mod);

@@ -1600,19 +1597,6 @@ static void free_modinfo(struct module *mod)
}
}

-static void dynamic_debug_setup(struct module *mod, struct _ddebug *debug, unsigned int num)
-{
- if (!debug)
- return;
- ddebug_add_module(debug, num, mod->name);
-}
-
-static void dynamic_debug_remove(struct module *mod, struct _ddebug *debug)
-{
- if (debug)
- ddebug_remove_module(mod->name);
-}
-
void * __weak module_alloc(unsigned long size)
{
return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
@@ -2113,9 +2097,6 @@ static int find_module_sections(struct module *mod, struct load_info *info)
if (section_addr(info, "__obsparm"))
pr_warn("%s: Ignoring obsolete parameters\n", mod->name);

- info->debug = section_objs(info, "__dyndbg",
- sizeof(*info->debug), &info->num_debug);
-
return 0;
}

@@ -2808,9 +2789,6 @@ static int load_module(struct load_info *info, const char __user *uargs,
goto free_arch_cleanup;
}

- init_build_id(mod, info);
- dynamic_debug_setup(mod, info->debug, info->num_debug);
-
/* Ftrace init must be called in the MODULE_STATE_UNFORMED state */
ftrace_module_init(mod);

@@ -2875,7 +2853,6 @@ static int load_module(struct load_info *info, const char __user *uargs,

ddebug_cleanup:
ftrace_release_mod(mod);
- dynamic_debug_remove(mod, info->debug);
synchronize_rcu();
kfree(mod->args);
free_arch_cleanup:
diff --git a/lib/dynamic_debug.c b/lib/dynamic_debug.c
index dd7f56af9aed..e9079825fb3b 100644
--- a/lib/dynamic_debug.c
+++ b/lib/dynamic_debug.c
@@ -13,6 +13,7 @@

#define pr_fmt(fmt) "dyndbg: " fmt

+#include <linux/codetag.h>
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/moduleparam.h>
@@ -36,19 +37,37 @@
#include <linux/sched.h>
#include <linux/device.h>
#include <linux/netdevice.h>
+#include <linux/seq_buf.h>

#include <rdma/ib_verbs.h>

-extern struct _ddebug __start___dyndbg[];
-extern struct _ddebug __stop___dyndbg[];
+static struct codetag_type *cttype;

-struct ddebug_table {
- struct list_head link;
- const char *mod_name;
- unsigned int num_ddebugs;
- struct _ddebug *ddebugs;
+struct user_buf {
+ char __user *buf; /* destination user buffer */
+ size_t size; /* size of requested read */
+ ssize_t ret; /* bytes read so far */
};

+static int flush_ubuf(struct user_buf *dst, struct seq_buf *src)
+{
+ if (src->len) {
+ size_t bytes = min_t(size_t, src->len, dst->size);
+ int err = copy_to_user(dst->buf, src->buffer, bytes);
+
+ if (err)
+ return err;
+
+ dst->ret += bytes;
+ dst->buf += bytes;
+ dst->size -= bytes;
+ src->len -= bytes;
+ memmove(src->buffer, src->buffer + bytes, src->len);
+ }
+
+ return 0;
+}
+
struct ddebug_query {
const char *filename;
const char *module;
@@ -58,8 +77,9 @@ struct ddebug_query {
};

struct ddebug_iter {
- struct ddebug_table *table;
- unsigned int idx;
+ struct codetag_iterator ct_iter;
+ struct seq_buf buf;
+ char rawbuf[4096];
};

struct flag_settings {
@@ -67,8 +87,6 @@ struct flag_settings {
unsigned int mask;
};

-static DEFINE_MUTEX(ddebug_lock);
-static LIST_HEAD(ddebug_tables);
static int verbose;
module_param(verbose, int, 0644);
MODULE_PARM_DESC(verbose, " dynamic_debug/control processing "
@@ -152,78 +170,76 @@ static void vpr_info_dq(const struct ddebug_query *query, const char *msg)
static int ddebug_change(const struct ddebug_query *query,
struct flag_settings *modifiers)
{
- int i;
- struct ddebug_table *dt;
+ struct codetag_iterator ct_iter;
+ struct codetag *ct;
unsigned int newflags;
unsigned int nfound = 0;
struct flagsbuf fbuf;

- /* search for matching ddebugs */
- mutex_lock(&ddebug_lock);
- list_for_each_entry(dt, &ddebug_tables, link) {
+ codetag_lock_module_list(cttype, true);
+ codetag_init_iter(&ct_iter, cttype);
+
+ while ((ct = codetag_next_ct(&ct_iter))) {
+ struct _ddebug *dp = (void *) ct;

/* match against the module name */
if (query->module &&
- !match_wildcard(query->module, dt->mod_name))
+ !match_wildcard(query->module, dp->modname))
continue;

- for (i = 0; i < dt->num_ddebugs; i++) {
- struct _ddebug *dp = &dt->ddebugs[i];
-
- /* match against the source filename */
- if (query->filename &&
- !match_wildcard(query->filename, dp->filename) &&
- !match_wildcard(query->filename,
- kbasename(dp->filename)) &&
- !match_wildcard(query->filename,
- trim_prefix(dp->filename)))
- continue;
+ /* match against the source filename */
+ if (query->filename &&
+ !match_wildcard(query->filename, dp->filename) &&
+ !match_wildcard(query->filename,
+ kbasename(dp->filename)) &&
+ !match_wildcard(query->filename,
+ trim_prefix(dp->filename)))
+ continue;

- /* match against the function */
- if (query->function &&
- !match_wildcard(query->function, dp->function))
- continue;
+ /* match against the function */
+ if (query->function &&
+ !match_wildcard(query->function, dp->function))
+ continue;

- /* match against the format */
- if (query->format) {
- if (*query->format == '^') {
- char *p;
- /* anchored search. match must be at beginning */
- p = strstr(dp->format, query->format+1);
- if (p != dp->format)
- continue;
- } else if (!strstr(dp->format, query->format))
+ /* match against the format */
+ if (query->format) {
+ if (*query->format == '^') {
+ char *p;
+ /* anchored search. match must be at beginning */
+ p = strstr(dp->format, query->format+1);
+ if (p != dp->format)
continue;
- }
-
- /* match against the line number range */
- if (query->first_lineno &&
- dp->lineno < query->first_lineno)
- continue;
- if (query->last_lineno &&
- dp->lineno > query->last_lineno)
+ } else if (!strstr(dp->format, query->format))
continue;
+ }
+
+ /* match against the line number range */
+ if (query->first_lineno &&
+ dp->lineno < query->first_lineno)
+ continue;
+ if (query->last_lineno &&
+ dp->lineno > query->last_lineno)
+ continue;

- nfound++;
+ nfound++;

- newflags = (dp->flags & modifiers->mask) | modifiers->flags;
- if (newflags == dp->flags)
- continue;
+ newflags = (dp->flags & modifiers->mask) | modifiers->flags;
+ if (newflags == dp->flags)
+ continue;
#ifdef CONFIG_JUMP_LABEL
- if (dp->flags & _DPRINTK_FLAGS_PRINT) {
- if (!(modifiers->flags & _DPRINTK_FLAGS_PRINT))
- static_branch_disable(&dp->key.dd_key_true);
- } else if (modifiers->flags & _DPRINTK_FLAGS_PRINT)
- static_branch_enable(&dp->key.dd_key_true);
+ if (dp->flags & _DPRINTK_FLAGS_PRINT) {
+ if (!(modifiers->flags & _DPRINTK_FLAGS_PRINT))
+ static_branch_disable(&dp->key.dd_key_true);
+ } else if (modifiers->flags & _DPRINTK_FLAGS_PRINT)
+ static_branch_enable(&dp->key.dd_key_true);
#endif
- dp->flags = newflags;
- v4pr_info("changed %s:%d [%s]%s =%s\n",
- trim_prefix(dp->filename), dp->lineno,
- dt->mod_name, dp->function,
- ddebug_describe_flags(dp->flags, &fbuf));
- }
+ dp->flags = newflags;
+ v4pr_info("changed %s:%d [%s]%s =%s\n",
+ trim_prefix(dp->filename), dp->lineno,
+ dp->modname, dp->function,
+ ddebug_describe_flags(dp->flags, &fbuf));
}
- mutex_unlock(&ddebug_lock);
+ codetag_lock_module_list(cttype, false);

if (!nfound && verbose)
pr_info("no matches for query\n");
@@ -794,187 +810,96 @@ static ssize_t ddebug_proc_write(struct file *file, const char __user *ubuf,
return len;
}

-/*
- * Set the iterator to point to the first _ddebug object
- * and return a pointer to that first object. Returns
- * NULL if there are no _ddebugs at all.
- */
-static struct _ddebug *ddebug_iter_first(struct ddebug_iter *iter)
-{
- if (list_empty(&ddebug_tables)) {
- iter->table = NULL;
- iter->idx = 0;
- return NULL;
- }
- iter->table = list_entry(ddebug_tables.next,
- struct ddebug_table, link);
- iter->idx = 0;
- return &iter->table->ddebugs[iter->idx];
-}
-
-/*
- * Advance the iterator to point to the next _ddebug
- * object from the one the iterator currently points at,
- * and returns a pointer to the new _ddebug. Returns
- * NULL if the iterator has seen all the _ddebugs.
- */
-static struct _ddebug *ddebug_iter_next(struct ddebug_iter *iter)
-{
- if (iter->table == NULL)
- return NULL;
- if (++iter->idx == iter->table->num_ddebugs) {
- /* iterate to next table */
- iter->idx = 0;
- if (list_is_last(&iter->table->link, &ddebug_tables)) {
- iter->table = NULL;
- return NULL;
- }
- iter->table = list_entry(iter->table->link.next,
- struct ddebug_table, link);
- }
- return &iter->table->ddebugs[iter->idx];
-}
-
-/*
- * Seq_ops start method. Called at the start of every
- * read() call from userspace. Takes the ddebug_lock and
- * seeks the seq_file's iterator to the given position.
- */
-static void *ddebug_proc_start(struct seq_file *m, loff_t *pos)
-{
- struct ddebug_iter *iter = m->private;
- struct _ddebug *dp;
- int n = *pos;
-
- mutex_lock(&ddebug_lock);
-
- if (!n)
- return SEQ_START_TOKEN;
- if (n < 0)
- return NULL;
- dp = ddebug_iter_first(iter);
- while (dp != NULL && --n > 0)
- dp = ddebug_iter_next(iter);
- return dp;
-}
-
-/*
- * Seq_ops next method. Called several times within a read()
- * call from userspace, with ddebug_lock held. Walks to the
- * next _ddebug object with a special case for the header line.
- */
-static void *ddebug_proc_next(struct seq_file *m, void *p, loff_t *pos)
-{
- struct ddebug_iter *iter = m->private;
- struct _ddebug *dp;
-
- if (p == SEQ_START_TOKEN)
- dp = ddebug_iter_first(iter);
- else
- dp = ddebug_iter_next(iter);
- ++*pos;
- return dp;
-}
-
/*
* Seq_ops show method. Called several times within a read()
* call from userspace, with ddebug_lock held. Formats the
* current _ddebug as a single human-readable line, with a
* special case for the header line.
*/
-static int ddebug_proc_show(struct seq_file *m, void *p)
+static void ddebug_to_text(struct seq_buf *out, struct _ddebug *dp)
{
- struct ddebug_iter *iter = m->private;
- struct _ddebug *dp = p;
struct flagsbuf flags;
+ char *buf;
+ size_t len;

- if (p == SEQ_START_TOKEN) {
- seq_puts(m,
- "# filename:lineno [module]function flags format\n");
- return 0;
- }
-
- seq_printf(m, "%s:%u [%s]%s =%s \"",
+ seq_buf_printf(out, "%s:%u [%s]%s =%s \"",
trim_prefix(dp->filename), dp->lineno,
- iter->table->mod_name, dp->function,
+ dp->modname, dp->function,
ddebug_describe_flags(dp->flags, &flags));
- seq_escape(m, dp->format, "\t\r\n\"");
- seq_puts(m, "\"\n");

- return 0;
-}
+ len = seq_buf_get_buf(out, &buf);
+ len = string_escape_mem(dp->format, strlen(dp->format),
+ buf, len, ESCAPE_OCTAL, "\t\r\n\"");
+ seq_buf_commit(out, len);

-/*
- * Seq_ops stop method. Called at the end of each read()
- * call from userspace. Drops ddebug_lock.
- */
-static void ddebug_proc_stop(struct seq_file *m, void *p)
-{
- mutex_unlock(&ddebug_lock);
+ seq_buf_puts(out, "\"\n");
}

-static const struct seq_operations ddebug_proc_seqops = {
- .start = ddebug_proc_start,
- .next = ddebug_proc_next,
- .show = ddebug_proc_show,
- .stop = ddebug_proc_stop
-};
-
static int ddebug_proc_open(struct inode *inode, struct file *file)
{
- return seq_open_private(file, &ddebug_proc_seqops,
- sizeof(struct ddebug_iter));
+ struct ddebug_iter *iter;
+
+ iter = kzalloc(sizeof(*iter), GFP_KERNEL);
+ if (!iter)
+ return -ENOMEM;
+
+ codetag_lock_module_list(cttype, true);
+ codetag_init_iter(&iter->ct_iter, cttype);
+ codetag_lock_module_list(cttype, false);
+ seq_buf_init(&iter->buf, iter->rawbuf, sizeof(iter->rawbuf));
+ file->private_data = iter;
+
+ return 0;
}

-static const struct file_operations ddebug_proc_fops = {
- .owner = THIS_MODULE,
- .open = ddebug_proc_open,
- .read = seq_read,
- .llseek = seq_lseek,
- .release = seq_release_private,
- .write = ddebug_proc_write
-};
+static int ddebug_proc_release(struct inode *inode, struct file *file)
+{
+ struct ddebug_iter *iter = file->private_data;

-static const struct proc_ops proc_fops = {
- .proc_open = ddebug_proc_open,
- .proc_read = seq_read,
- .proc_lseek = seq_lseek,
- .proc_release = seq_release_private,
- .proc_write = ddebug_proc_write
-};
+ kfree(iter);
+ return 0;
+}

-/*
- * Allocate a new ddebug_table for the given module
- * and add it to the global list.
- */
-int ddebug_add_module(struct _ddebug *tab, unsigned int n,
- const char *name)
+static ssize_t ddebug_proc_read(struct file *file, char __user *ubuf,
+ size_t size, loff_t *ppos)
{
- struct ddebug_table *dt;
+ struct ddebug_iter *iter = file->private_data;
+ struct user_buf buf = { .buf = ubuf, .size = size };
+ struct codetag *ct;
+ int err = 0;

- dt = kzalloc(sizeof(*dt), GFP_KERNEL);
- if (dt == NULL) {
- pr_err("error adding module: %s\n", name);
- return -ENOMEM;
- }
- /*
- * For built-in modules, name lives in .rodata and is
- * immortal. For loaded modules, name points at the name[]
- * member of struct module, which lives at least as long as
- * this struct ddebug_table.
- */
- dt->mod_name = name;
- dt->num_ddebugs = n;
- dt->ddebugs = tab;
+ codetag_lock_module_list(iter->ct_iter.cttype, true);
+ while (1) {
+ err = flush_ubuf(&buf, &iter->buf);
+ if (err || !buf.size)
+ break;
+
+ ct = codetag_next_ct(&iter->ct_iter);
+ if (!ct)
+ break;

- mutex_lock(&ddebug_lock);
- list_add(&dt->link, &ddebug_tables);
- mutex_unlock(&ddebug_lock);
+ ddebug_to_text(&iter->buf, (void *) ct);
+ }
+ codetag_lock_module_list(iter->ct_iter.cttype, false);

- vpr_info("%3u debug prints in module %s\n", n, dt->mod_name);
- return 0;
+ return err ? : buf.ret;
}

+static const struct file_operations ddebug_proc_fops = {
+ .owner = THIS_MODULE,
+ .open = ddebug_proc_open,
+ .read = ddebug_proc_read,
+ .release = ddebug_proc_release,
+ .write = ddebug_proc_write,
+};
+
+static const struct proc_ops proc_fops = {
+ .proc_open = ddebug_proc_open,
+ .proc_read = ddebug_proc_read,
+ .proc_release = ddebug_proc_release,
+ .proc_write = ddebug_proc_write,
+};
+
/* helper for ddebug_dyndbg_(boot|module)_param_cb */
static int ddebug_dyndbg_param_cb(char *param, char *val,
const char *modname, int on_err)
@@ -1015,47 +940,6 @@ int ddebug_dyndbg_module_param_cb(char *param, char *val, const char *module)
return ddebug_dyndbg_param_cb(param, val, module, -ENOENT);
}

-static void ddebug_table_free(struct ddebug_table *dt)
-{
- list_del_init(&dt->link);
- kfree(dt);
-}
-
-/*
- * Called in response to a module being unloaded. Removes
- * any ddebug_table's which point at the module.
- */
-int ddebug_remove_module(const char *mod_name)
-{
- struct ddebug_table *dt, *nextdt;
- int ret = -ENOENT;
-
- mutex_lock(&ddebug_lock);
- list_for_each_entry_safe(dt, nextdt, &ddebug_tables, link) {
- if (dt->mod_name == mod_name) {
- ddebug_table_free(dt);
- ret = 0;
- break;
- }
- }
- mutex_unlock(&ddebug_lock);
- if (!ret)
- v2pr_info("removed module \"%s\"\n", mod_name);
- return ret;
-}
-
-static void ddebug_remove_all_tables(void)
-{
- mutex_lock(&ddebug_lock);
- while (!list_empty(&ddebug_tables)) {
- struct ddebug_table *dt = list_entry(ddebug_tables.next,
- struct ddebug_table,
- link);
- ddebug_table_free(dt);
- }
- mutex_unlock(&ddebug_lock);
-}
-
static __initdata int ddebug_init_success;

static int __init dynamic_debug_init_control(void)
@@ -1083,45 +967,19 @@ static int __init dynamic_debug_init_control(void)

static int __init dynamic_debug_init(void)
{
- struct _ddebug *iter, *iter_start;
- const char *modname = NULL;
+ const struct codetag_type_desc desc = {
+ .section = "dyndbg",
+ .tag_size = sizeof(struct _ddebug),
+ };
char *cmdline;
- int ret = 0;
- int n = 0, entries = 0, modct = 0;
+ int ret;

- if (&__start___dyndbg == &__stop___dyndbg) {
- if (IS_ENABLED(CONFIG_DYNAMIC_DEBUG)) {
- pr_warn("_ddebug table is empty in a CONFIG_DYNAMIC_DEBUG build\n");
- return 1;
- }
- pr_info("Ignore empty _ddebug table in a CONFIG_DYNAMIC_DEBUG_CORE build\n");
- ddebug_init_success = 1;
- return 0;
- }
- iter = __start___dyndbg;
- modname = iter->modname;
- iter_start = iter;
- for (; iter < __stop___dyndbg; iter++) {
- entries++;
- if (strcmp(modname, iter->modname)) {
- modct++;
- ret = ddebug_add_module(iter_start, n, modname);
- if (ret)
- goto out_err;
- n = 0;
- modname = iter->modname;
- iter_start = iter;
- }
- n++;
- }
- ret = ddebug_add_module(iter_start, n, modname);
+ cttype = codetag_register_type(&desc);
+ ret = PTR_ERR_OR_ZERO(cttype);
if (ret)
- goto out_err;
+ return ret;

ddebug_init_success = 1;
- vpr_info("%d prdebugs in %d modules, %d KiB in ddebug tables, %d kiB in __dyndbg section\n",
- entries, modct, (int)((modct * sizeof(struct ddebug_table)) >> 10),
- (int)((entries * sizeof(struct _ddebug)) >> 10));

/* now that ddebug tables are loaded, process all boot args
* again to find and activate queries given in dyndbg params.
@@ -1132,14 +990,12 @@ static int __init dynamic_debug_init(void)
* slightly noisy if verbose, but harmless.
*/
cmdline = kstrdup(saved_command_line, GFP_KERNEL);
+ if (!cmdline)
+ return -ENOMEM;
parse_args("dyndbg params", cmdline, NULL,
0, 0, 0, NULL, &ddebug_dyndbg_boot_param_cb);
kfree(cmdline);
return 0;
-
-out_err:
- ddebug_remove_all_tables();
- return 0;
}
/* Allow early initialization for boot messages via boot param */
early_initcall(dynamic_debug_init);
--
2.37.2.672.g94769d06f0-goog

2022-08-30 22:20:47

by Suren Baghdasaryan

[permalink] [raw]
Subject: [RFC PATCH 18/30] codetag: add codetag query helper functions

From: Kent Overstreet <[email protected]>

Provide codetag_query_parse() to parse codetag queries and
codetag_matches_query() to check if the query affects a given codetag.

Signed-off-by: Kent Overstreet <[email protected]>
---
include/linux/codetag.h | 27 ++++++++
lib/codetag.c | 135 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 162 insertions(+)

diff --git a/include/linux/codetag.h b/include/linux/codetag.h
index 386733e89b31..0c605417ebbe 100644
--- a/include/linux/codetag.h
+++ b/include/linux/codetag.h
@@ -80,4 +80,31 @@ static inline void codetag_load_module(struct module *mod) {}
static inline void codetag_unload_module(struct module *mod) {}
#endif

+/* Codetag query parsing */
+
+struct codetag_query {
+ const char *filename;
+ const char *module;
+ const char *function;
+ const char *class;
+ unsigned int first_line, last_line;
+ unsigned int first_index, last_index;
+ unsigned int cur_index;
+
+ bool match_line:1;
+ bool match_index:1;
+
+ unsigned int set_enabled:1;
+ unsigned int enabled:2;
+
+ unsigned int set_frequency:1;
+ unsigned int frequency;
+};
+
+char *codetag_query_parse(struct codetag_query *q, char *buf);
+bool codetag_matches_query(struct codetag_query *q,
+ const struct codetag *ct,
+ const struct codetag_module *mod,
+ const char *class);
+
#endif /* _LINUX_CODETAG_H */
diff --git a/lib/codetag.c b/lib/codetag.c
index f0a3174f9b71..288ccfd5cbd0 100644
--- a/lib/codetag.c
+++ b/lib/codetag.c
@@ -246,3 +246,138 @@ void codetag_unload_module(struct module *mod)
}
mutex_unlock(&codetag_lock);
}
+
+/* Codetag query parsing */
+
+#define CODETAG_QUERY_TOKENS() \
+ x(func) \
+ x(file) \
+ x(line) \
+ x(module) \
+ x(class) \
+ x(index)
+
+enum tokens {
+#define x(name) TOK_##name,
+ CODETAG_QUERY_TOKENS()
+#undef x
+};
+
+static const char * const token_strs[] = {
+#define x(name) #name,
+ CODETAG_QUERY_TOKENS()
+#undef x
+ NULL
+};
+
+static int parse_range(char *str, unsigned int *first, unsigned int *last)
+{
+ char *first_str = str;
+ char *last_str = strchr(first_str, '-');
+
+ if (last_str)
+ *last_str++ = '\0';
+
+ if (kstrtouint(first_str, 10, first))
+ return -EINVAL;
+
+ if (!last_str)
+ *last = *first;
+ else if (kstrtouint(last_str, 10, last))
+ return -EINVAL;
+
+ return 0;
+}
+
+char *codetag_query_parse(struct codetag_query *q, char *buf)
+{
+ while (1) {
+ char *p = buf;
+ char *str1 = strsep_no_empty(&p, " \t\r\n");
+ char *str2 = strsep_no_empty(&p, " \t\r\n");
+ int ret, token;
+
+ if (!str1 || !str2)
+ break;
+
+ token = match_string(token_strs, ARRAY_SIZE(token_strs), str1);
+ if (token < 0)
+ break;
+
+ switch (token) {
+ case TOK_func:
+ q->function = str2;
+ break;
+ case TOK_file:
+ q->filename = str2;
+ break;
+ case TOK_line:
+ ret = parse_range(str2, &q->first_line, &q->last_line);
+ if (ret)
+ return ERR_PTR(ret);
+ q->match_line = true;
+ break;
+ case TOK_module:
+ q->module = str2;
+ break;
+ case TOK_class:
+ q->class = str2;
+ break;
+ case TOK_index:
+ ret = parse_range(str2, &q->first_index, &q->last_index);
+ if (ret)
+ return ERR_PTR(ret);
+ q->match_index = true;
+ break;
+ }
+
+ buf = p;
+ }
+
+ return buf;
+}
+
+bool codetag_matches_query(struct codetag_query *q,
+ const struct codetag *ct,
+ const struct codetag_module *mod,
+ const char *class)
+{
+ size_t classlen = q->class ? strlen(q->class) : 0;
+
+ if (q->module &&
+ (!mod->mod ||
+ strcmp(q->module, ct->modname)))
+ return false;
+
+ if (q->filename &&
+ strcmp(q->filename, ct->filename) &&
+ strcmp(q->filename, kbasename(ct->filename)))
+ return false;
+
+ if (q->function &&
+ strcmp(q->function, ct->function))
+ return false;
+
+ /* match against the line number range */
+ if (q->match_line &&
+ (ct->lineno < q->first_line ||
+ ct->lineno > q->last_line))
+ return false;
+
+ /* match against the class */
+ if (classlen &&
+ (strncmp(q->class, class, classlen) ||
+ (class[classlen] && class[classlen] != ':')))
+ return false;
+
+ /* match against the fault index */
+ if (q->match_index &&
+ (q->cur_index < q->first_index ||
+ q->cur_index > q->last_index)) {
+ q->cur_index++;
+ return false;
+ }
+
+ q->cur_index++;
+ return true;
+}
--
2.37.2.672.g94769d06f0-goog

2022-08-30 22:20:49

by Suren Baghdasaryan

[permalink] [raw]
Subject: [RFC PATCH 30/30] MAINTAINERS: Add entries for code tagging & related

From: Kent Overstreet <[email protected]>

The new code & libraries added are being maintained - mark them as such.

Signed-off-by: Kent Overstreet <[email protected]>
---
MAINTAINERS | 34 ++++++++++++++++++++++++++++++++++
1 file changed, 34 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 589517372408..902c96744bcb 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5111,6 +5111,19 @@ S: Supported
F: Documentation/process/code-of-conduct-interpretation.rst
F: Documentation/process/code-of-conduct.rst

+CODE TAGGING
+M: Suren Baghdasaryan <[email protected]>
+M: Kent Overstreet <[email protected]>
+S: Maintained
+F: lib/codetag.c
+F: include/linux/codetag.h
+
+CODE TAGGING TIME STATS
+M: Kent Overstreet <[email protected]>
+S: Maintained
+F: lib/codetag_time_stats.c
+F: include/linux/codetag_time_stats.h
+
COMEDI DRIVERS
M: Ian Abbott <[email protected]>
M: H Hartley Sweeten <[email protected]>
@@ -11405,6 +11418,12 @@ M: John Hawley <[email protected]>
S: Maintained
F: tools/testing/ktest

+LAZY PERCPU COUNTERS
+M: Kent Overstreet <[email protected]>
+S: Maintained
+F: lib/lazy-percpu-counter.c
+F: include/linux/lazy-percpu-counter.h
+
L3MDEV
M: David Ahern <[email protected]>
L: [email protected]
@@ -13124,6 +13143,15 @@ F: include/linux/memblock.h
F: mm/memblock.c
F: tools/testing/memblock/

+MEMORY ALLOCATION TRACKING
+M: Suren Baghdasaryan <[email protected]>
+M: Kent Overstreet <[email protected]>
+S: Maintained
+F: lib/alloc_tag.c
+F: lib/pgalloc_tag.c
+F: include/linux/alloc_tag.h
+F: include/linux/codetag_ctx.h
+
MEMORY CONTROLLER DRIVERS
M: Krzysztof Kozlowski <[email protected]>
L: [email protected]
@@ -20421,6 +20449,12 @@ T: git git://git.kernel.org/pub/scm/linux/kernel/git/luca/wl12xx.git
F: drivers/net/wireless/ti/
F: include/linux/wl12xx.h

+TIME STATS
+M: Kent Overstreet <[email protected]>
+S: Maintained
+F: lib/time_stats.c
+F: include/linux/time_stats.h
+
TIMEKEEPING, CLOCKSOURCE CORE, NTP, ALARMTIMER
M: John Stultz <[email protected]>
M: Thomas Gleixner <[email protected]>
--
2.37.2.672.g94769d06f0-goog

2022-08-30 22:21:01

by Suren Baghdasaryan

[permalink] [raw]
Subject: [RFC PATCH 06/30] lib: code tagging module support

Add support for code tagging from dynamically loaded modules.

Signed-off-by: Suren Baghdasaryan <[email protected]>
Co-developed-by: Kent Overstreet <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
---
include/linux/codetag.h | 12 ++++++++++
kernel/module/main.c | 4 ++++
lib/codetag.c | 51 ++++++++++++++++++++++++++++++++++++++++-
3 files changed, 66 insertions(+), 1 deletion(-)

diff --git a/include/linux/codetag.h b/include/linux/codetag.h
index a9d7adecc2a5..386733e89b31 100644
--- a/include/linux/codetag.h
+++ b/include/linux/codetag.h
@@ -42,6 +42,10 @@ struct codetag_module {
struct codetag_type_desc {
const char *section;
size_t tag_size;
+ void (*module_load)(struct codetag_type *cttype,
+ struct codetag_module *cmod);
+ void (*module_unload)(struct codetag_type *cttype,
+ struct codetag_module *cmod);
};

struct codetag_iterator {
@@ -68,4 +72,12 @@ void codetag_to_text(struct seq_buf *out, struct codetag *ct);
struct codetag_type *
codetag_register_type(const struct codetag_type_desc *desc);

+#ifdef CONFIG_CODE_TAGGING
+void codetag_load_module(struct module *mod);
+void codetag_unload_module(struct module *mod);
+#else
+static inline void codetag_load_module(struct module *mod) {}
+static inline void codetag_unload_module(struct module *mod) {}
+#endif
+
#endif /* _LINUX_CODETAG_H */
diff --git a/kernel/module/main.c b/kernel/module/main.c
index a4e4d84b6f4e..d253277492fd 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -53,6 +53,7 @@
#include <linux/bsearch.h>
#include <linux/dynamic_debug.h>
#include <linux/audit.h>
+#include <linux/codetag.h>
#include <uapi/linux/module.h>
#include "internal.h"

@@ -1151,6 +1152,7 @@ static void free_module(struct module *mod)
{
trace_module_free(mod);

+ codetag_unload_module(mod);
mod_sysfs_teardown(mod);

/*
@@ -2849,6 +2851,8 @@ static int load_module(struct load_info *info, const char __user *uargs,
/* Get rid of temporary copy. */
free_copy(info, flags);

+ codetag_load_module(mod);
+
/* Done! */
trace_module_load(mod);

diff --git a/lib/codetag.c b/lib/codetag.c
index 7708f8388e55..f0a3174f9b71 100644
--- a/lib/codetag.c
+++ b/lib/codetag.c
@@ -157,8 +157,11 @@ static int codetag_module_init(struct codetag_type *cttype, struct module *mod)

down_write(&cttype->mod_lock);
err = idr_alloc(&cttype->mod_idr, cmod, 0, 0, GFP_KERNEL);
- if (err >= 0)
+ if (err >= 0) {
cttype->count += range_size(cttype, &range);
+ if (cttype->desc.module_load)
+ cttype->desc.module_load(cttype, cmod);
+ }
up_write(&cttype->mod_lock);

if (err < 0) {
@@ -197,3 +200,49 @@ codetag_register_type(const struct codetag_type_desc *desc)

return cttype;
}
+
+void codetag_load_module(struct module *mod)
+{
+ struct codetag_type *cttype;
+
+ if (!mod)
+ return;
+
+ mutex_lock(&codetag_lock);
+ list_for_each_entry(cttype, &codetag_types, link)
+ codetag_module_init(cttype, mod);
+ mutex_unlock(&codetag_lock);
+}
+
+void codetag_unload_module(struct module *mod)
+{
+ struct codetag_type *cttype;
+
+ if (!mod)
+ return;
+
+ mutex_lock(&codetag_lock);
+ list_for_each_entry(cttype, &codetag_types, link) {
+ struct codetag_module *found = NULL;
+ struct codetag_module *cmod;
+ unsigned long mod_id, tmp;
+
+ down_write(&cttype->mod_lock);
+ idr_for_each_entry_ul(&cttype->mod_idr, cmod, tmp, mod_id) {
+ if (cmod->mod && cmod->mod == mod) {
+ found = cmod;
+ break;
+ }
+ }
+ if (found) {
+ if (cttype->desc.module_unload)
+ cttype->desc.module_unload(cttype, cmod);
+
+ cttype->count -= range_size(cttype, &cmod->range);
+ idr_remove(&cttype->mod_idr, mod_id);
+ kfree(cmod);
+ }
+ up_write(&cttype->mod_lock);
+ }
+ mutex_unlock(&codetag_lock);
+}
--
2.37.2.672.g94769d06f0-goog

2022-08-30 22:21:04

by Suren Baghdasaryan

[permalink] [raw]
Subject: [RFC PATCH 20/30] lib: introduce support for storing code tag context

Add support for code tag context capture when registering a new code tag
type. When context capture for a specific code tag is enabled,
codetag_ref will point to a codetag_ctx object which can be attached
to an application-specific object storing code invocation context.
codetag_ctx has a pointer to its codetag_with_ctx object with embedded
codetag object in it. All context objects of the same code tag are placed
into codetag_with_ctx.ctx_head linked list. codetag.flag is used to
indicate when a context capture for the associated code tag is
initialized and enabled.

Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/codetag.h | 50 +++++++++++++-
include/linux/codetag_ctx.h | 48 +++++++++++++
lib/codetag.c | 134 ++++++++++++++++++++++++++++++++++++
3 files changed, 231 insertions(+), 1 deletion(-)
create mode 100644 include/linux/codetag_ctx.h

diff --git a/include/linux/codetag.h b/include/linux/codetag.h
index 0c605417ebbe..57736ec77b45 100644
--- a/include/linux/codetag.h
+++ b/include/linux/codetag.h
@@ -5,8 +5,12 @@
#ifndef _LINUX_CODETAG_H
#define _LINUX_CODETAG_H

+#include <linux/container_of.h>
+#include <linux/spinlock.h>
#include <linux/types.h>

+struct kref;
+struct codetag_ctx;
struct codetag_iterator;
struct codetag_type;
struct seq_buf;
@@ -18,15 +22,38 @@ struct module;
* an array of these.
*/
struct codetag {
- unsigned int flags; /* used in later patches */
+ unsigned int flags; /* has to be the first member shared with codetag_ctx */
unsigned int lineno;
const char *modname;
const char *function;
const char *filename;
} __aligned(8);

+/* codetag_with_ctx flags */
+#define CTC_FLAG_CTX_PTR (1 << 0)
+#define CTC_FLAG_CTX_READY (1 << 1)
+#define CTC_FLAG_CTX_ENABLED (1 << 2)
+
+/*
+ * Code tag with context capture support. Contains a list to store context for
+ * each tag hit, a lock protecting the list and a flag to indicate whether
+ * context capture is enabled for the tag.
+ */
+struct codetag_with_ctx {
+ struct codetag ct;
+ struct list_head ctx_head;
+ spinlock_t ctx_lock;
+} __aligned(8);
+
+/*
+ * Tag reference can point to codetag directly or indirectly via codetag_ctx.
+ * Direct codetag pointer is used when context capture is disabled or not
+ * supported. When context capture for the tag is used, the reference points
+ * to the codetag_ctx through which the codetag can be reached.
+ */
union codetag_ref {
struct codetag *ct;
+ struct codetag_ctx *ctx;
};

struct codetag_range {
@@ -46,6 +73,7 @@ struct codetag_type_desc {
struct codetag_module *cmod);
void (*module_unload)(struct codetag_type *cttype,
struct codetag_module *cmod);
+ void (*free_ctx)(struct kref *ref);
};

struct codetag_iterator {
@@ -53,6 +81,7 @@ struct codetag_iterator {
struct codetag_module *cmod;
unsigned long mod_id;
struct codetag *ct;
+ struct codetag_ctx *ctx;
};

#define CODE_TAG_INIT { \
@@ -63,9 +92,28 @@ struct codetag_iterator {
.flags = 0, \
}

+static inline bool is_codetag_ctx_ref(union codetag_ref *ref)
+{
+ return !!(ref->ct->flags & CTC_FLAG_CTX_PTR);
+}
+
+static inline
+struct codetag_with_ctx *ct_to_ctc(struct codetag *ct)
+{
+ return container_of(ct, struct codetag_with_ctx, ct);
+}
+
void codetag_lock_module_list(struct codetag_type *cttype, bool lock);
struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype);
struct codetag *codetag_next_ct(struct codetag_iterator *iter);
+struct codetag_ctx *codetag_next_ctx(struct codetag_iterator *iter);
+
+bool codetag_enable_ctx(struct codetag_with_ctx *ctc, bool enable);
+static inline bool codetag_ctx_enabled(struct codetag_with_ctx *ctc)
+{
+ return !!(ctc->ct.flags & CTC_FLAG_CTX_ENABLED);
+}
+bool codetag_has_ctx(struct codetag_with_ctx *ctc);

void codetag_to_text(struct seq_buf *out, struct codetag *ct);

diff --git a/include/linux/codetag_ctx.h b/include/linux/codetag_ctx.h
new file mode 100644
index 000000000000..e741484f0e08
--- /dev/null
+++ b/include/linux/codetag_ctx.h
@@ -0,0 +1,48 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * code tag context
+ */
+#ifndef _LINUX_CODETAG_CTX_H
+#define _LINUX_CODETAG_CTX_H
+
+#include <linux/codetag.h>
+#include <linux/kref.h>
+
+/* Code tag hit context. */
+struct codetag_ctx {
+ unsigned int flags; /* has to be the first member shared with codetag */
+ struct codetag_with_ctx *ctc;
+ struct list_head node;
+ struct kref refcount;
+} __aligned(8);
+
+static inline struct codetag_ctx *kref_to_ctx(struct kref *refcount)
+{
+ return container_of(refcount, struct codetag_ctx, refcount);
+}
+
+static inline void add_ctx(struct codetag_ctx *ctx,
+ struct codetag_with_ctx *ctc)
+{
+ kref_init(&ctx->refcount);
+ spin_lock(&ctc->ctx_lock);
+ ctx->flags = CTC_FLAG_CTX_PTR;
+ ctx->ctc = ctc;
+ list_add_tail(&ctx->node, &ctc->ctx_head);
+ spin_unlock(&ctc->ctx_lock);
+}
+
+static inline void rem_ctx(struct codetag_ctx *ctx,
+ void (*free_ctx)(struct kref *refcount))
+{
+ struct codetag_with_ctx *ctc = ctx->ctc;
+
+ spin_lock(&ctc->ctx_lock);
+ /* ctx might have been removed while we were using it */
+ if (!list_empty(&ctx->node))
+ list_del_init(&ctx->node);
+ spin_unlock(&ctc->ctx_lock);
+ kref_put(&ctx->refcount, free_ctx);
+}
+
+#endif /* _LINUX_CODETAG_CTX_H */
diff --git a/lib/codetag.c b/lib/codetag.c
index 288ccfd5cbd0..2762fda5c016 100644
--- a/lib/codetag.c
+++ b/lib/codetag.c
@@ -1,5 +1,6 @@
// SPDX-License-Identifier: GPL-2.0-only
#include <linux/codetag.h>
+#include <linux/codetag_ctx.h>
#include <linux/idr.h>
#include <linux/kallsyms.h>
#include <linux/module.h>
@@ -91,6 +92,139 @@ struct codetag *codetag_next_ct(struct codetag_iterator *iter)
return ct;
}

+static struct codetag_ctx *next_ctx_from_ct(struct codetag_iterator *iter)
+{
+ struct codetag_with_ctx *ctc;
+ struct codetag_ctx *ctx = NULL;
+ struct codetag *ct = iter->ct;
+
+ while (ct) {
+ if (!(ct->flags & CTC_FLAG_CTX_READY))
+ goto next;
+
+ ctc = ct_to_ctc(ct);
+ spin_lock(&ctc->ctx_lock);
+ if (!list_empty(&ctc->ctx_head)) {
+ ctx = list_first_entry(&ctc->ctx_head,
+ struct codetag_ctx, node);
+ kref_get(&ctx->refcount);
+ }
+ spin_unlock(&ctc->ctx_lock);
+ if (ctx)
+ break;
+next:
+ ct = codetag_next_ct(iter);
+ }
+
+ iter->ctx = ctx;
+ return ctx;
+}
+
+struct codetag_ctx *codetag_next_ctx(struct codetag_iterator *iter)
+{
+ struct codetag_ctx *ctx = iter->ctx;
+ struct codetag_ctx *found = NULL;
+
+ lockdep_assert_held(&iter->cttype->mod_lock);
+
+ if (!ctx)
+ return next_ctx_from_ct(iter);
+
+ spin_lock(&ctx->ctc->ctx_lock);
+ /*
+ * Do not advance if the object was isolated, restart at the same tag.
+ */
+ if (!list_empty(&ctx->node)) {
+ if (list_is_last(&ctx->node, &ctx->ctc->ctx_head)) {
+ /* Finished with this tag, advance to the next */
+ codetag_next_ct(iter);
+ } else {
+ found = list_next_entry(ctx, node);
+ kref_get(&found->refcount);
+ }
+ }
+ spin_unlock(&ctx->ctc->ctx_lock);
+ kref_put(&ctx->refcount, iter->cttype->desc.free_ctx);
+
+ if (!found)
+ return next_ctx_from_ct(iter);
+
+ iter->ctx = found;
+ return found;
+}
+
+static struct codetag_type *find_cttype(struct codetag *ct)
+{
+ struct codetag_module *cmod;
+ struct codetag_type *cttype;
+ unsigned long mod_id;
+ unsigned long tmp;
+
+ mutex_lock(&codetag_lock);
+ list_for_each_entry(cttype, &codetag_types, link) {
+ down_read(&cttype->mod_lock);
+ idr_for_each_entry_ul(&cttype->mod_idr, cmod, tmp, mod_id) {
+ if (ct >= cmod->range.start && ct < cmod->range.stop) {
+ up_read(&cttype->mod_lock);
+ goto found;
+ }
+ }
+ up_read(&cttype->mod_lock);
+ }
+ cttype = NULL;
+found:
+ mutex_unlock(&codetag_lock);
+
+ return cttype;
+}
+
+bool codetag_enable_ctx(struct codetag_with_ctx *ctc, bool enable)
+{
+ struct codetag_type *cttype = find_cttype(&ctc->ct);
+
+ if (!cttype || !cttype->desc.free_ctx)
+ return false;
+
+ lockdep_assert_held(&cttype->mod_lock);
+ BUG_ON(!rwsem_is_locked(&cttype->mod_lock));
+
+ if (codetag_ctx_enabled(ctc) == enable)
+ return false;
+
+ if (enable) {
+ /* Initialize context capture fields only once */
+ if (!(ctc->ct.flags & CTC_FLAG_CTX_READY)) {
+ spin_lock_init(&ctc->ctx_lock);
+ INIT_LIST_HEAD(&ctc->ctx_head);
+ ctc->ct.flags |= CTC_FLAG_CTX_READY;
+ }
+ ctc->ct.flags |= CTC_FLAG_CTX_ENABLED;
+ } else {
+ /*
+ * The list of context objects is intentionally left untouched.
+ * It can be read back and if context capture is re-enablied it
+ * will append new objects.
+ */
+ ctc->ct.flags &= ~CTC_FLAG_CTX_ENABLED;
+ }
+
+ return true;
+}
+
+bool codetag_has_ctx(struct codetag_with_ctx *ctc)
+{
+ bool no_ctx;
+
+ if (!(ctc->ct.flags & CTC_FLAG_CTX_READY))
+ return false;
+
+ spin_lock(&ctc->ctx_lock);
+ no_ctx = list_empty(&ctc->ctx_head);
+ spin_unlock(&ctc->ctx_lock);
+
+ return !no_ctx;
+}
+
void codetag_to_text(struct seq_buf *out, struct codetag *ct)
{
seq_buf_printf(out, "%s:%u module:%s func:%s",
--
2.37.2.672.g94769d06f0-goog

2022-08-30 22:21:33

by Suren Baghdasaryan

[permalink] [raw]
Subject: [RFC PATCH 11/30] mm: introduce slabobj_ext to support slab object extensions

Currently slab pages can store only vectors of obj_cgroup pointers in
page->memcg_data. Introduce slabobj_ext structure to allow more data
to be stored for each slab object. Wraps obj_cgroup into slabobj_ext
to support current functionality while allowing to extend slabobj_ext
in the future.

Note: ideally the config dependency should be turned the other way around:
MEMCG should depend on SLAB_OBJ_EXT and {page|slab|folio}.memcg_data would
be renamed to something like {page|slab|folio}.objext_data. However doing
this in RFC would introduce considerable churn unrelated to the overall
idea, so avoiding this until v1.

Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/memcontrol.h | 18 ++++--
init/Kconfig | 5 ++
mm/kfence/core.c | 2 +-
mm/memcontrol.c | 60 ++++++++++---------
mm/page_owner.c | 2 +-
mm/slab.h | 119 +++++++++++++++++++++++++------------
6 files changed, 131 insertions(+), 75 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6257867fbf95..315399f77173 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -227,6 +227,14 @@ struct obj_cgroup {
};
};

+/*
+ * Extended information for slab objects stored as an array in page->memcg_data
+ * if MEMCG_DATA_OBJEXTS is set.
+ */
+struct slabobj_ext {
+ struct obj_cgroup *objcg;
+} __aligned(8);
+
/*
* The memory controller data structure. The memory controller controls both
* page cache and RSS per cgroup. We would eventually like to provide
@@ -363,7 +371,7 @@ extern struct mem_cgroup *root_mem_cgroup;

enum page_memcg_data_flags {
/* page->memcg_data is a pointer to an objcgs vector */
- MEMCG_DATA_OBJCGS = (1UL << 0),
+ MEMCG_DATA_OBJEXTS = (1UL << 0),
/* page has been accounted as a non-slab kernel page */
MEMCG_DATA_KMEM = (1UL << 1),
/* the next bit after the last actual flag */
@@ -401,7 +409,7 @@ static inline struct mem_cgroup *__folio_memcg(struct folio *folio)
unsigned long memcg_data = folio->memcg_data;

VM_BUG_ON_FOLIO(folio_test_slab(folio), folio);
- VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJCGS, folio);
+ VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio);
VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_KMEM, folio);

return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
@@ -422,7 +430,7 @@ static inline struct obj_cgroup *__folio_objcg(struct folio *folio)
unsigned long memcg_data = folio->memcg_data;

VM_BUG_ON_FOLIO(folio_test_slab(folio), folio);
- VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJCGS, folio);
+ VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio);
VM_BUG_ON_FOLIO(!(memcg_data & MEMCG_DATA_KMEM), folio);

return (struct obj_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
@@ -517,7 +525,7 @@ static inline struct mem_cgroup *page_memcg_check(struct page *page)
*/
unsigned long memcg_data = READ_ONCE(page->memcg_data);

- if (memcg_data & MEMCG_DATA_OBJCGS)
+ if (memcg_data & MEMCG_DATA_OBJEXTS)
return NULL;

if (memcg_data & MEMCG_DATA_KMEM) {
@@ -556,7 +564,7 @@ static inline struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *ob
static inline bool folio_memcg_kmem(struct folio *folio)
{
VM_BUG_ON_PGFLAGS(PageTail(&folio->page), &folio->page);
- VM_BUG_ON_FOLIO(folio->memcg_data & MEMCG_DATA_OBJCGS, folio);
+ VM_BUG_ON_FOLIO(folio->memcg_data & MEMCG_DATA_OBJEXTS, folio);
return folio->memcg_data & MEMCG_DATA_KMEM;
}

diff --git a/init/Kconfig b/init/Kconfig
index 532362fcfe31..82396d7a2717 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -958,6 +958,10 @@ config MEMCG
help
Provides control over the memory footprint of tasks in a cgroup.

+config SLAB_OBJ_EXT
+ bool
+ depends on MEMCG
+
config MEMCG_SWAP
bool
depends on MEMCG && SWAP
@@ -966,6 +970,7 @@ config MEMCG_SWAP
config MEMCG_KMEM
bool
depends on MEMCG && !SLOB
+ select SLAB_OBJ_EXT
default y

config BLK_CGROUP
diff --git a/mm/kfence/core.c b/mm/kfence/core.c
index c252081b11df..c0958e4a32e2 100644
--- a/mm/kfence/core.c
+++ b/mm/kfence/core.c
@@ -569,7 +569,7 @@ static unsigned long kfence_init_pool(void)
__folio_set_slab(slab_folio(slab));
#ifdef CONFIG_MEMCG
slab->memcg_data = (unsigned long)&kfence_metadata[i / 2 - 1].objcg |
- MEMCG_DATA_OBJCGS;
+ MEMCG_DATA_OBJEXTS;
#endif
}

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b69979c9ced5..3f407ef2f3f1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2793,7 +2793,7 @@ static void commit_charge(struct folio *folio, struct mem_cgroup *memcg)
folio->memcg_data = (unsigned long)memcg;
}

-#ifdef CONFIG_MEMCG_KMEM
+#ifdef CONFIG_SLAB_OBJ_EXT
/*
* The allocated objcg pointers array is not accounted directly.
* Moreover, it should not come from DMA buffer and is not readily
@@ -2801,38 +2801,20 @@ static void commit_charge(struct folio *folio, struct mem_cgroup *memcg)
*/
#define OBJCGS_CLEAR_MASK (__GFP_DMA | __GFP_RECLAIMABLE | __GFP_ACCOUNT)

-/*
- * mod_objcg_mlstate() may be called with irq enabled, so
- * mod_memcg_lruvec_state() should be used.
- */
-static inline void mod_objcg_mlstate(struct obj_cgroup *objcg,
- struct pglist_data *pgdat,
- enum node_stat_item idx, int nr)
-{
- struct mem_cgroup *memcg;
- struct lruvec *lruvec;
-
- rcu_read_lock();
- memcg = obj_cgroup_memcg(objcg);
- lruvec = mem_cgroup_lruvec(memcg, pgdat);
- mod_memcg_lruvec_state(lruvec, idx, nr);
- rcu_read_unlock();
-}
-
-int memcg_alloc_slab_cgroups(struct slab *slab, struct kmem_cache *s,
- gfp_t gfp, bool new_slab)
+int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
+ gfp_t gfp, bool new_slab)
{
unsigned int objects = objs_per_slab(s, slab);
unsigned long memcg_data;
void *vec;

gfp &= ~OBJCGS_CLEAR_MASK;
- vec = kcalloc_node(objects, sizeof(struct obj_cgroup *), gfp,
+ vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
slab_nid(slab));
if (!vec)
return -ENOMEM;

- memcg_data = (unsigned long) vec | MEMCG_DATA_OBJCGS;
+ memcg_data = (unsigned long) vec | MEMCG_DATA_OBJEXTS;
if (new_slab) {
/*
* If the slab is brand new and nobody can yet access its
@@ -2843,7 +2825,7 @@ int memcg_alloc_slab_cgroups(struct slab *slab, struct kmem_cache *s,
} else if (cmpxchg(&slab->memcg_data, 0, memcg_data)) {
/*
* If the slab is already in use, somebody can allocate and
- * assign obj_cgroups in parallel. In this case the existing
+ * assign slabobj_exts in parallel. In this case the existing
* objcg vector should be reused.
*/
kfree(vec);
@@ -2853,6 +2835,26 @@ int memcg_alloc_slab_cgroups(struct slab *slab, struct kmem_cache *s,
kmemleak_not_leak(vec);
return 0;
}
+#endif /* CONFIG_SLAB_OBJ_EXT */
+
+#ifdef CONFIG_MEMCG_KMEM
+/*
+ * mod_objcg_mlstate() may be called with irq enabled, so
+ * mod_memcg_lruvec_state() should be used.
+ */
+static inline void mod_objcg_mlstate(struct obj_cgroup *objcg,
+ struct pglist_data *pgdat,
+ enum node_stat_item idx, int nr)
+{
+ struct mem_cgroup *memcg;
+ struct lruvec *lruvec;
+
+ rcu_read_lock();
+ memcg = obj_cgroup_memcg(objcg);
+ lruvec = mem_cgroup_lruvec(memcg, pgdat);
+ mod_memcg_lruvec_state(lruvec, idx, nr);
+ rcu_read_unlock();
+}

static __always_inline
struct mem_cgroup *mem_cgroup_from_obj_folio(struct folio *folio, void *p)
@@ -2863,18 +2865,18 @@ struct mem_cgroup *mem_cgroup_from_obj_folio(struct folio *folio, void *p)
* slab->memcg_data.
*/
if (folio_test_slab(folio)) {
- struct obj_cgroup **objcgs;
+ struct slabobj_ext *obj_exts;
struct slab *slab;
unsigned int off;

slab = folio_slab(folio);
- objcgs = slab_objcgs(slab);
- if (!objcgs)
+ obj_exts = slab_obj_exts(slab);
+ if (!obj_exts)
return NULL;

off = obj_to_index(slab->slab_cache, slab, p);
- if (objcgs[off])
- return obj_cgroup_memcg(objcgs[off]);
+ if (obj_exts[off].objcg)
+ return obj_cgroup_memcg(obj_exts[off].objcg);

return NULL;
}
diff --git a/mm/page_owner.c b/mm/page_owner.c
index e4c6f3f1695b..fd4af1ad34b8 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -353,7 +353,7 @@ static inline int print_page_owner_memcg(char *kbuf, size_t count, int ret,
if (!memcg_data)
goto out_unlock;

- if (memcg_data & MEMCG_DATA_OBJCGS)
+ if (memcg_data & MEMCG_DATA_OBJEXTS)
ret += scnprintf(kbuf + ret, count - ret,
"Slab cache page\n");

diff --git a/mm/slab.h b/mm/slab.h
index 4ec82bec15ec..c767ce3f0fe2 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -422,36 +422,94 @@ static inline bool kmem_cache_debug_flags(struct kmem_cache *s, slab_flags_t fla
return false;
}

+#ifdef CONFIG_SLAB_OBJ_EXT
+
+static inline bool is_kmem_only_obj_ext(void)
+{
#ifdef CONFIG_MEMCG_KMEM
+ return sizeof(struct slabobj_ext) == sizeof(struct obj_cgroup *);
+#else
+ return false;
+#endif
+}
+
/*
- * slab_objcgs - get the object cgroups vector associated with a slab
+ * slab_obj_exts - get the pointer to the slab object extension vector
+ * associated with a slab.
* @slab: a pointer to the slab struct
*
- * Returns a pointer to the object cgroups vector associated with the slab,
+ * Returns a pointer to the object extension vector associated with the slab,
* or NULL if no such vector has been associated yet.
*/
-static inline struct obj_cgroup **slab_objcgs(struct slab *slab)
+static inline struct slabobj_ext *slab_obj_exts(struct slab *slab)
{
unsigned long memcg_data = READ_ONCE(slab->memcg_data);

- VM_BUG_ON_PAGE(memcg_data && !(memcg_data & MEMCG_DATA_OBJCGS),
+ VM_BUG_ON_PAGE(memcg_data && !(memcg_data & MEMCG_DATA_OBJEXTS),
slab_page(slab));
VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_KMEM, slab_page(slab));

- return (struct obj_cgroup **)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+ return (struct slabobj_ext *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
}

-int memcg_alloc_slab_cgroups(struct slab *slab, struct kmem_cache *s,
- gfp_t gfp, bool new_slab);
-void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
- enum node_stat_item idx, int nr);
+int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
+ gfp_t gfp, bool new_slab);

-static inline void memcg_free_slab_cgroups(struct slab *slab)
+static inline void free_slab_obj_exts(struct slab *slab)
{
- kfree(slab_objcgs(slab));
+ struct slabobj_ext *obj_exts;
+
+ if (!memcg_kmem_enabled() && is_kmem_only_obj_ext())
+ return;
+
+ obj_exts = slab_obj_exts(slab);
+ kfree(obj_exts);
slab->memcg_data = 0;
}

+static inline void prepare_slab_obj_exts_hook(struct kmem_cache *s, gfp_t flags, void *p)
+{
+ struct slab *slab;
+
+ /* If kmem is the only extension then the vector will be created conditionally */
+ if (is_kmem_only_obj_ext())
+ return;
+
+ slab = virt_to_slab(p);
+ if (!slab_obj_exts(slab))
+ WARN(alloc_slab_obj_exts(slab, s, flags, false),
+ "%s, %s: Failed to create slab extension vector!\n",
+ __func__, s->name);
+}
+
+#else /* CONFIG_SLAB_OBJ_EXT */
+
+static inline struct slabobj_ext *slab_obj_exts(struct slab *slab)
+{
+ return NULL;
+}
+
+static inline int alloc_slab_obj_exts(struct slab *slab,
+ struct kmem_cache *s, gfp_t gfp,
+ bool new_slab)
+{
+ return 0;
+}
+
+static inline void free_slab_obj_exts(struct slab *slab)
+{
+}
+
+static inline void prepare_slab_obj_exts_hook(struct kmem_cache *s, gfp_t flags, void *p)
+{
+}
+
+#endif /* CONFIG_SLAB_OBJ_EXT */
+
+#ifdef CONFIG_MEMCG_KMEM
+void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
+ enum node_stat_item idx, int nr);
+
static inline size_t obj_full_size(struct kmem_cache *s)
{
/*
@@ -519,16 +577,15 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
if (likely(p[i])) {
slab = virt_to_slab(p[i]);

- if (!slab_objcgs(slab) &&
- memcg_alloc_slab_cgroups(slab, s, flags,
- false)) {
+ if (!slab_obj_exts(slab) &&
+ alloc_slab_obj_exts(slab, s, flags, false)) {
obj_cgroup_uncharge(objcg, obj_full_size(s));
continue;
}

off = obj_to_index(s, slab, p[i]);
obj_cgroup_get(objcg);
- slab_objcgs(slab)[off] = objcg;
+ slab_obj_exts(slab)[off].objcg = objcg;
mod_objcg_state(objcg, slab_pgdat(slab),
cache_vmstat_idx(s), obj_full_size(s));
} else {
@@ -541,14 +598,14 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
static inline void memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab,
void **p, int objects)
{
- struct obj_cgroup **objcgs;
+ struct slabobj_ext *obj_exts;
int i;

if (!memcg_kmem_enabled())
return;

- objcgs = slab_objcgs(slab);
- if (!objcgs)
+ obj_exts = slab_obj_exts(slab);
+ if (!obj_exts)
return;

for (i = 0; i < objects; i++) {
@@ -556,11 +613,11 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab,
unsigned int off;

off = obj_to_index(s, slab, p[i]);
- objcg = objcgs[off];
+ objcg = obj_exts[off].objcg;
if (!objcg)
continue;

- objcgs[off] = NULL;
+ obj_exts[off].objcg = NULL;
obj_cgroup_uncharge(objcg, obj_full_size(s));
mod_objcg_state(objcg, slab_pgdat(slab), cache_vmstat_idx(s),
-obj_full_size(s));
@@ -569,27 +626,11 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab,
}

#else /* CONFIG_MEMCG_KMEM */
-static inline struct obj_cgroup **slab_objcgs(struct slab *slab)
-{
- return NULL;
-}
-
static inline struct mem_cgroup *memcg_from_slab_obj(void *ptr)
{
return NULL;
}

-static inline int memcg_alloc_slab_cgroups(struct slab *slab,
- struct kmem_cache *s, gfp_t gfp,
- bool new_slab)
-{
- return 0;
-}
-
-static inline void memcg_free_slab_cgroups(struct slab *slab)
-{
-}
-
static inline bool memcg_slab_pre_alloc_hook(struct kmem_cache *s,
struct list_lru *lru,
struct obj_cgroup **objcgp,
@@ -627,7 +668,7 @@ static __always_inline void account_slab(struct slab *slab, int order,
struct kmem_cache *s, gfp_t gfp)
{
if (memcg_kmem_enabled() && (s->flags & SLAB_ACCOUNT))
- memcg_alloc_slab_cgroups(slab, s, gfp, true);
+ alloc_slab_obj_exts(slab, s, gfp, true);

mod_node_page_state(slab_pgdat(slab), cache_vmstat_idx(s),
PAGE_SIZE << order);
@@ -636,8 +677,7 @@ static __always_inline void account_slab(struct slab *slab, int order,
static __always_inline void unaccount_slab(struct slab *slab, int order,
struct kmem_cache *s)
{
- if (memcg_kmem_enabled())
- memcg_free_slab_cgroups(slab);
+ free_slab_obj_exts(slab);

mod_node_page_state(slab_pgdat(slab), cache_vmstat_idx(s),
-(PAGE_SIZE << order));
@@ -729,6 +769,7 @@ static inline void slab_post_alloc_hook(struct kmem_cache *s,
memset(p[i], 0, s->object_size);
kmemleak_alloc_recursive(p[i], s->object_size, 1,
s->flags, flags);
+ prepare_slab_obj_exts_hook(s, flags, p[i]);
}

memcg_slab_post_alloc_hook(s, objcg, flags, size, p);
--
2.37.2.672.g94769d06f0-goog

2022-08-30 22:21:53

by Suren Baghdasaryan

[permalink] [raw]
Subject: [RFC PATCH 19/30] move stack capture functionality into a separate function for reuse

Make save_stack() function part of stackdepot API to be used outside of
page_owner. Also rename task_struct's in_page_owner to in_capture_stack
flag to better convey the wider use of this flag.

Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/sched.h | 6 ++--
include/linux/stackdepot.h | 3 ++
lib/stackdepot.c | 68 ++++++++++++++++++++++++++++++++++++++
mm/page_owner.c | 52 ++---------------------------
4 files changed, 77 insertions(+), 52 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e7b2f8a5c711..d06cad6c14bd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -930,9 +930,9 @@ struct task_struct {
/* Stalled due to lack of memory */
unsigned in_memstall:1;
#endif
-#ifdef CONFIG_PAGE_OWNER
- /* Used by page_owner=on to detect recursion in page tracking. */
- unsigned in_page_owner:1;
+#ifdef CONFIG_STACKDEPOT
+ /* Used by stack_depot_capture_stack to detect recursion. */
+ unsigned in_capture_stack:1;
#endif
#ifdef CONFIG_EVENTFD
/* Recursion prevention for eventfd_signal() */
diff --git a/include/linux/stackdepot.h b/include/linux/stackdepot.h
index bc2797955de9..8dc9fdb2c4dd 100644
--- a/include/linux/stackdepot.h
+++ b/include/linux/stackdepot.h
@@ -64,4 +64,7 @@ int stack_depot_snprint(depot_stack_handle_t handle, char *buf, size_t size,

void stack_depot_print(depot_stack_handle_t stack);

+bool stack_depot_capture_init(void);
+depot_stack_handle_t stack_depot_capture_stack(gfp_t flags);
+
#endif
diff --git a/lib/stackdepot.c b/lib/stackdepot.c
index e73fda23388d..c8615bd6dc25 100644
--- a/lib/stackdepot.c
+++ b/lib/stackdepot.c
@@ -514,3 +514,71 @@ depot_stack_handle_t stack_depot_save(unsigned long *entries,
return __stack_depot_save(entries, nr_entries, alloc_flags, true);
}
EXPORT_SYMBOL_GPL(stack_depot_save);
+
+static depot_stack_handle_t recursion_handle;
+static depot_stack_handle_t failure_handle;
+
+static __always_inline depot_stack_handle_t create_custom_stack(void)
+{
+ unsigned long entries[4];
+ unsigned int nr_entries;
+
+ nr_entries = stack_trace_save(entries, ARRAY_SIZE(entries), 0);
+ return stack_depot_save(entries, nr_entries, GFP_KERNEL);
+}
+
+static noinline void register_recursion_stack(void)
+{
+ recursion_handle = create_custom_stack();
+}
+
+static noinline void register_failure_stack(void)
+{
+ failure_handle = create_custom_stack();
+}
+
+bool stack_depot_capture_init(void)
+{
+ static DEFINE_MUTEX(stack_depot_capture_init_mutex);
+ static bool utility_stacks_ready;
+
+ mutex_lock(&stack_depot_capture_init_mutex);
+ if (!utility_stacks_ready) {
+ register_recursion_stack();
+ register_failure_stack();
+ utility_stacks_ready = true;
+ }
+ mutex_unlock(&stack_depot_capture_init_mutex);
+
+ return utility_stacks_ready;
+}
+
+/* TODO: teach stack_depot_capture_stack to use off stack temporal storage */
+#define CAPTURE_STACK_DEPTH (16)
+
+depot_stack_handle_t stack_depot_capture_stack(gfp_t flags)
+{
+ unsigned long entries[CAPTURE_STACK_DEPTH];
+ depot_stack_handle_t handle;
+ unsigned int nr_entries;
+
+ /*
+ * Avoid recursion.
+ *
+ * Sometimes page metadata allocation tracking requires more
+ * memory to be allocated:
+ * - when new stack trace is saved to stack depot
+ * - when backtrace itself is calculated (ia64)
+ */
+ if (current->in_capture_stack)
+ return recursion_handle;
+ current->in_capture_stack = 1;
+
+ nr_entries = stack_trace_save(entries, ARRAY_SIZE(entries), 2);
+ handle = stack_depot_save(entries, nr_entries, flags);
+ if (!handle)
+ handle = failure_handle;
+
+ current->in_capture_stack = 0;
+ return handle;
+}
diff --git a/mm/page_owner.c b/mm/page_owner.c
index fd4af1ad34b8..c3173e34a779 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -15,12 +15,6 @@

#include "internal.h"

-/*
- * TODO: teach PAGE_OWNER_STACK_DEPTH (__dump_page_owner and save_stack)
- * to use off stack temporal storage
- */
-#define PAGE_OWNER_STACK_DEPTH (16)
-
struct page_owner {
unsigned short order;
short last_migrate_reason;
@@ -37,8 +31,6 @@ struct page_owner {
static bool page_owner_enabled __initdata;
DEFINE_STATIC_KEY_FALSE(page_owner_inited);

-static depot_stack_handle_t dummy_handle;
-static depot_stack_handle_t failure_handle;
static depot_stack_handle_t early_handle;

static void init_early_allocated_pages(void);
@@ -68,16 +60,6 @@ static __always_inline depot_stack_handle_t create_dummy_stack(void)
return stack_depot_save(entries, nr_entries, GFP_KERNEL);
}

-static noinline void register_dummy_stack(void)
-{
- dummy_handle = create_dummy_stack();
-}
-
-static noinline void register_failure_stack(void)
-{
- failure_handle = create_dummy_stack();
-}
-
static noinline void register_early_stack(void)
{
early_handle = create_dummy_stack();
@@ -88,8 +70,7 @@ static __init void init_page_owner(void)
if (!page_owner_enabled)
return;

- register_dummy_stack();
- register_failure_stack();
+ stack_depot_capture_init();
register_early_stack();
static_branch_enable(&page_owner_inited);
init_early_allocated_pages();
@@ -106,33 +87,6 @@ static inline struct page_owner *get_page_owner(struct page_ext *page_ext)
return (void *)page_ext + page_owner_ops.offset;
}

-static noinline depot_stack_handle_t save_stack(gfp_t flags)
-{
- unsigned long entries[PAGE_OWNER_STACK_DEPTH];
- depot_stack_handle_t handle;
- unsigned int nr_entries;
-
- /*
- * Avoid recursion.
- *
- * Sometimes page metadata allocation tracking requires more
- * memory to be allocated:
- * - when new stack trace is saved to stack depot
- * - when backtrace itself is calculated (ia64)
- */
- if (current->in_page_owner)
- return dummy_handle;
- current->in_page_owner = 1;
-
- nr_entries = stack_trace_save(entries, ARRAY_SIZE(entries), 2);
- handle = stack_depot_save(entries, nr_entries, flags);
- if (!handle)
- handle = failure_handle;
-
- current->in_page_owner = 0;
- return handle;
-}
-
void __reset_page_owner(struct page *page, unsigned short order)
{
int i;
@@ -145,7 +99,7 @@ void __reset_page_owner(struct page *page, unsigned short order)
if (unlikely(!page_ext))
return;

- handle = save_stack(GFP_NOWAIT | __GFP_NOWARN);
+ handle = stack_depot_capture_stack(GFP_NOWAIT | __GFP_NOWARN);
for (i = 0; i < (1 << order); i++) {
__clear_bit(PAGE_EXT_OWNER_ALLOCATED, &page_ext->flags);
page_owner = get_page_owner(page_ext);
@@ -189,7 +143,7 @@ noinline void __set_page_owner(struct page *page, unsigned short order,
if (unlikely(!page_ext))
return;

- handle = save_stack(gfp_mask);
+ handle = stack_depot_capture_stack(gfp_mask);
__set_page_owner_handle(page_ext, handle, order, gfp_mask);
}

--
2.37.2.672.g94769d06f0-goog

2022-08-30 22:22:27

by Suren Baghdasaryan

[permalink] [raw]
Subject: [RFC PATCH 07/30] lib: add support for allocation tagging

Introduce CONFIG_ALLOC_TAGGING which provides definitions to easily
instrument allocators. It also registers an "alloc_tags" codetag type
with defbugfs interface to output allocation tags information.

Signed-off-by: Suren Baghdasaryan <[email protected]>
Co-developed-by: Kent Overstreet <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
---
include/asm-generic/codetag.lds.h | 14 +++
include/asm-generic/vmlinux.lds.h | 3 +
include/linux/alloc_tag.h | 66 +++++++++++++
lib/Kconfig.debug | 5 +
lib/Makefile | 2 +
lib/alloc_tag.c | 158 ++++++++++++++++++++++++++++++
scripts/module.lds.S | 7 ++
7 files changed, 255 insertions(+)
create mode 100644 include/asm-generic/codetag.lds.h
create mode 100644 include/linux/alloc_tag.h
create mode 100644 lib/alloc_tag.c

diff --git a/include/asm-generic/codetag.lds.h b/include/asm-generic/codetag.lds.h
new file mode 100644
index 000000000000..64f536b80380
--- /dev/null
+++ b/include/asm-generic/codetag.lds.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef __ASM_GENERIC_CODETAG_LDS_H
+#define __ASM_GENERIC_CODETAG_LDS_H
+
+#define SECTION_WITH_BOUNDARIES(_name) \
+ . = ALIGN(8); \
+ __start_##_name = .; \
+ KEEP(*(_name)) \
+ __stop_##_name = .;
+
+#define CODETAG_SECTIONS() \
+ SECTION_WITH_BOUNDARIES(alloc_tags)
+
+#endif /* __ASM_GENERIC_CODETAG_LDS_H */
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index 7515a465ec03..c2dc2a59ab2e 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -50,6 +50,8 @@
* [__nosave_begin, __nosave_end] for the nosave data
*/

+#include <asm-generic/codetag.lds.h>
+
#ifndef LOAD_OFFSET
#define LOAD_OFFSET 0
#endif
@@ -348,6 +350,7 @@
__start___dyndbg = .; \
KEEP(*(__dyndbg)) \
__stop___dyndbg = .; \
+ CODETAG_SECTIONS() \
LIKELY_PROFILE() \
BRANCH_PROFILE() \
TRACE_PRINTKS() \
diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
new file mode 100644
index 000000000000..b3f589afb1c9
--- /dev/null
+++ b/include/linux/alloc_tag.h
@@ -0,0 +1,66 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * allocation tagging
+ */
+#ifndef _LINUX_ALLOC_TAG_H
+#define _LINUX_ALLOC_TAG_H
+
+#include <linux/bug.h>
+#include <linux/codetag.h>
+#include <linux/container_of.h>
+#include <linux/lazy-percpu-counter.h>
+
+/*
+ * An instance of this structure is created in a special ELF section at every
+ * allocation callsite. At runtime, the special section is treated as
+ * an array of these. Embedded codetag utilizes codetag framework.
+ */
+struct alloc_tag {
+ struct codetag ct;
+ unsigned long last_wrap;
+ struct raw_lazy_percpu_counter call_count;
+ struct raw_lazy_percpu_counter bytes_allocated;
+} __aligned(8);
+
+static inline struct alloc_tag *ct_to_alloc_tag(struct codetag *ct)
+{
+ return container_of(ct, struct alloc_tag, ct);
+}
+
+#define DEFINE_ALLOC_TAG(_alloc_tag) \
+ static struct alloc_tag _alloc_tag __used __aligned(8) \
+ __section("alloc_tags") = { .ct = CODE_TAG_INIT }
+
+#define alloc_tag_counter_read(counter) \
+ __lazy_percpu_counter_read(counter)
+
+static inline void __alloc_tag_sub(union codetag_ref *ref, size_t bytes)
+{
+ struct alloc_tag *tag = ct_to_alloc_tag(ref->ct);
+
+ __lazy_percpu_counter_add(&tag->call_count, &tag->last_wrap, -1);
+ __lazy_percpu_counter_add(&tag->bytes_allocated, &tag->last_wrap, -bytes);
+ ref->ct = NULL;
+}
+
+#define alloc_tag_sub(_ref, _bytes) \
+do { \
+ if ((_ref) && (_ref)->ct) \
+ __alloc_tag_sub(_ref, _bytes); \
+} while (0)
+
+static inline void __alloc_tag_add(struct alloc_tag *tag, union codetag_ref *ref, size_t bytes)
+{
+ ref->ct = &tag->ct;
+ __lazy_percpu_counter_add(&tag->call_count, &tag->last_wrap, 1);
+ __lazy_percpu_counter_add(&tag->bytes_allocated, &tag->last_wrap, bytes);
+}
+
+#define alloc_tag_add(_ref, _bytes) \
+do { \
+ DEFINE_ALLOC_TAG(_alloc_tag); \
+ if (_ref && !WARN_ONCE(_ref->ct, "alloc_tag was not cleared")) \
+ __alloc_tag_add(&_alloc_tag, _ref, _bytes); \
+} while (0)
+
+#endif /* _LINUX_ALLOC_TAG_H */
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 22bc1eff7f8f..795bf6993f8a 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -973,6 +973,11 @@ config CODE_TAGGING
bool
select KALLSYMS

+config ALLOC_TAGGING
+ bool
+ select CODE_TAGGING
+ select LAZY_PERCPU_COUNTER
+
source "lib/Kconfig.kasan"
source "lib/Kconfig.kfence"

diff --git a/lib/Makefile b/lib/Makefile
index 574d7716e640..dc00533fc5c8 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -228,6 +228,8 @@ obj-$(CONFIG_OF_RECONFIG_NOTIFIER_ERROR_INJECT) += \
obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o

obj-$(CONFIG_CODE_TAGGING) += codetag.o
+obj-$(CONFIG_ALLOC_TAGGING) += alloc_tag.o
+
lib-$(CONFIG_GENERIC_BUG) += bug.o

obj-$(CONFIG_HAVE_ARCH_TRACEHOOK) += syscall.o
diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
new file mode 100644
index 000000000000..082fbde184ef
--- /dev/null
+++ b/lib/alloc_tag.c
@@ -0,0 +1,158 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/alloc_tag.h>
+#include <linux/debugfs.h>
+#include <linux/fs.h>
+#include <linux/gfp.h>
+#include <linux/module.h>
+#include <linux/seq_buf.h>
+#include <linux/uaccess.h>
+
+#ifdef CONFIG_DEBUG_FS
+
+struct alloc_tag_file_iterator {
+ struct codetag_iterator ct_iter;
+ struct seq_buf buf;
+ char rawbuf[4096];
+};
+
+struct user_buf {
+ char __user *buf; /* destination user buffer */
+ size_t size; /* size of requested read */
+ ssize_t ret; /* bytes read so far */
+};
+
+static int flush_ubuf(struct user_buf *dst, struct seq_buf *src)
+{
+ if (src->len) {
+ size_t bytes = min_t(size_t, src->len, dst->size);
+ int err = copy_to_user(dst->buf, src->buffer, bytes);
+
+ if (err)
+ return err;
+
+ dst->ret += bytes;
+ dst->buf += bytes;
+ dst->size -= bytes;
+ src->len -= bytes;
+ memmove(src->buffer, src->buffer + bytes, src->len);
+ }
+
+ return 0;
+}
+
+static int alloc_tag_file_open(struct inode *inode, struct file *file)
+{
+ struct codetag_type *cttype = inode->i_private;
+ struct alloc_tag_file_iterator *iter;
+
+ iter = kzalloc(sizeof(*iter), GFP_KERNEL);
+ if (!iter)
+ return -ENOMEM;
+
+ codetag_lock_module_list(cttype, true);
+ iter->ct_iter = codetag_get_ct_iter(cttype);
+ codetag_lock_module_list(cttype, false);
+ seq_buf_init(&iter->buf, iter->rawbuf, sizeof(iter->rawbuf));
+ file->private_data = iter;
+
+ return 0;
+}
+
+static int alloc_tag_file_release(struct inode *inode, struct file *file)
+{
+ struct alloc_tag_file_iterator *iter = file->private_data;
+
+ kfree(iter);
+ return 0;
+}
+
+static void alloc_tag_to_text(struct seq_buf *out, struct codetag *ct)
+{
+ struct alloc_tag *tag = ct_to_alloc_tag(ct);
+ char buf[10];
+
+ string_get_size(alloc_tag_counter_read(&tag->bytes_allocated), 1,
+ STRING_UNITS_2, buf, sizeof(buf));
+
+ seq_buf_printf(out, "%8s %8lld ", buf, alloc_tag_counter_read(&tag->call_count));
+ codetag_to_text(out, ct);
+ seq_buf_putc(out, '\n');
+}
+
+static ssize_t alloc_tag_file_read(struct file *file, char __user *ubuf,
+ size_t size, loff_t *ppos)
+{
+ struct alloc_tag_file_iterator *iter = file->private_data;
+ struct user_buf buf = { .buf = ubuf, .size = size };
+ struct codetag *ct;
+ int err = 0;
+
+ codetag_lock_module_list(iter->ct_iter.cttype, true);
+ while (1) {
+ err = flush_ubuf(&buf, &iter->buf);
+ if (err || !buf.size)
+ break;
+
+ ct = codetag_next_ct(&iter->ct_iter);
+ if (!ct)
+ break;
+
+ alloc_tag_to_text(&iter->buf, ct);
+ }
+ codetag_lock_module_list(iter->ct_iter.cttype, false);
+
+ return err ? : buf.ret;
+}
+
+static const struct file_operations alloc_tag_file_ops = {
+ .owner = THIS_MODULE,
+ .open = alloc_tag_file_open,
+ .release = alloc_tag_file_release,
+ .read = alloc_tag_file_read,
+};
+
+static int dbgfs_init(struct codetag_type *cttype)
+{
+ struct dentry *file;
+
+ file = debugfs_create_file("alloc_tags", 0444, NULL, cttype,
+ &alloc_tag_file_ops);
+
+ return IS_ERR(file) ? PTR_ERR(file) : 0;
+}
+
+#else /* CONFIG_DEBUG_FS */
+
+static int dbgfs_init(struct codetag_type *) { return 0; }
+
+#endif /* CONFIG_DEBUG_FS */
+
+static void alloc_tag_module_unload(struct codetag_type *cttype, struct codetag_module *cmod)
+{
+ struct codetag_iterator iter = codetag_get_ct_iter(cttype);
+ struct codetag *ct;
+
+ for (ct = codetag_next_ct(&iter); ct; ct = codetag_next_ct(&iter)) {
+ struct alloc_tag *tag = ct_to_alloc_tag(ct);
+
+ __lazy_percpu_counter_exit(&tag->call_count);
+ __lazy_percpu_counter_exit(&tag->bytes_allocated);
+ }
+}
+
+static int __init alloc_tag_init(void)
+{
+ struct codetag_type *cttype;
+ const struct codetag_type_desc desc = {
+ .section = "alloc_tags",
+ .tag_size = sizeof(struct alloc_tag),
+ .module_unload = alloc_tag_module_unload,
+ };
+
+ cttype = codetag_register_type(&desc);
+ if (IS_ERR_OR_NULL(cttype))
+ return PTR_ERR(cttype);
+
+ return dbgfs_init(cttype);
+}
+module_init(alloc_tag_init);
diff --git a/scripts/module.lds.S b/scripts/module.lds.S
index 3a3aa2354ed8..e73a8781f239 100644
--- a/scripts/module.lds.S
+++ b/scripts/module.lds.S
@@ -12,6 +12,8 @@
# define SANITIZER_DISCARDS
#endif

+#include <asm-generic/codetag.lds.h>
+
SECTIONS {
/DISCARD/ : {
*(.discard)
@@ -47,6 +49,7 @@ SECTIONS {
.data : {
*(.data .data.[0-9a-zA-Z_]*)
*(.data..L*)
+ CODETAG_SECTIONS()
}

.rodata : {
@@ -62,6 +65,10 @@ SECTIONS {
*(.text.__cfi_check)
*(.text .text.[0-9a-zA-Z_]* .text..L.cfi*)
}
+#else
+ .data : {
+ CODETAG_SECTIONS()
+ }
#endif
}

--
2.37.2.672.g94769d06f0-goog

2022-08-30 22:37:12

by Suren Baghdasaryan

[permalink] [raw]
Subject: [RFC PATCH 03/30] Lazy percpu counters

From: Kent Overstreet <[email protected]>

This patch adds lib/lazy-percpu-counter.c, which implements counters
that start out as atomics, but lazily switch to percpu mode if the
update rate crosses some threshold (arbitrarily set at 256 per second).

Signed-off-by: Kent Overstreet <[email protected]>
---
include/linux/lazy-percpu-counter.h | 67 +++++++++++++
lib/Kconfig | 3 +
lib/Makefile | 2 +
lib/lazy-percpu-counter.c | 141 ++++++++++++++++++++++++++++
4 files changed, 213 insertions(+)
create mode 100644 include/linux/lazy-percpu-counter.h
create mode 100644 lib/lazy-percpu-counter.c

diff --git a/include/linux/lazy-percpu-counter.h b/include/linux/lazy-percpu-counter.h
new file mode 100644
index 000000000000..a22a2b9a9f32
--- /dev/null
+++ b/include/linux/lazy-percpu-counter.h
@@ -0,0 +1,67 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Lazy percpu counters:
+ * (C) 2022 Kent Overstreet
+ *
+ * Lazy percpu counters start out in atomic mode, then switch to percpu mode if
+ * the update rate crosses some threshold.
+ *
+ * This means we don't have to decide between low memory overhead atomic
+ * counters and higher performance percpu counters - we can have our cake and
+ * eat it, too!
+ *
+ * Internally we use an atomic64_t, where the low bit indicates whether we're in
+ * percpu mode, and the high 8 bits are a secondary counter that's incremented
+ * when the counter is modified - meaning 55 bits of precision are available for
+ * the counter itself.
+ *
+ * lazy_percpu_counter is 16 bytes (on 64 bit machines), raw_lazy_percpu_counter
+ * is 8 bytes but requires a separate unsigned long to record when the counter
+ * wraps - because sometimes multiple counters are used together and can share
+ * the same timestamp.
+ */
+
+#ifndef _LINUX_LAZY_PERCPU_COUNTER_H
+#define _LINUX_LAZY_PERCPU_COUNTER_H
+
+struct raw_lazy_percpu_counter {
+ atomic64_t v;
+};
+
+void __lazy_percpu_counter_exit(struct raw_lazy_percpu_counter *c);
+void __lazy_percpu_counter_add(struct raw_lazy_percpu_counter *c,
+ unsigned long *last_wrap, s64 i);
+s64 __lazy_percpu_counter_read(struct raw_lazy_percpu_counter *c);
+
+static inline void __lazy_percpu_counter_sub(struct raw_lazy_percpu_counter *c,
+ unsigned long *last_wrap, s64 i)
+{
+ __lazy_percpu_counter_add(c, last_wrap, -i);
+}
+
+struct lazy_percpu_counter {
+ struct raw_lazy_percpu_counter v;
+ unsigned long last_wrap;
+};
+
+static inline void lazy_percpu_counter_exit(struct lazy_percpu_counter *c)
+{
+ __lazy_percpu_counter_exit(&c->v);
+}
+
+static inline void lazy_percpu_counter_add(struct lazy_percpu_counter *c, s64 i)
+{
+ __lazy_percpu_counter_add(&c->v, &c->last_wrap, i);
+}
+
+static inline void lazy_percpu_counter_sub(struct lazy_percpu_counter *c, s64 i)
+{
+ __lazy_percpu_counter_sub(&c->v, &c->last_wrap, i);
+}
+
+static inline s64 lazy_percpu_counter_read(struct lazy_percpu_counter *c)
+{
+ return __lazy_percpu_counter_read(&c->v);
+}
+
+#endif /* _LINUX_LAZY_PERCPU_COUNTER_H */
diff --git a/lib/Kconfig b/lib/Kconfig
index dc1ab2ed1dc6..fc6dbc425728 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -498,6 +498,9 @@ config ASSOCIATIVE_ARRAY

for more information.

+config LAZY_PERCPU_COUNTER
+ bool
+
config HAS_IOMEM
bool
depends on !NO_IOMEM
diff --git a/lib/Makefile b/lib/Makefile
index ffabc30a27d4..cc7762748708 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -163,6 +163,8 @@ obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
obj-$(CONFIG_DEBUG_LIST) += list_debug.o
obj-$(CONFIG_DEBUG_OBJECTS) += debugobjects.o

+obj-$(CONFIG_LAZY_PERCPU_COUNTER) += lazy-percpu-counter.o
+
obj-$(CONFIG_BITREVERSE) += bitrev.o
obj-$(CONFIG_LINEAR_RANGES) += linear_ranges.o
obj-$(CONFIG_PACKING) += packing.o
diff --git a/lib/lazy-percpu-counter.c b/lib/lazy-percpu-counter.c
new file mode 100644
index 000000000000..299ef36137ee
--- /dev/null
+++ b/lib/lazy-percpu-counter.c
@@ -0,0 +1,141 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include <linux/atomic.h>
+#include <linux/gfp.h>
+#include <linux/jiffies.h>
+#include <linux/lazy-percpu-counter.h>
+#include <linux/percpu.h>
+
+/*
+ * We use the high bits of the atomic counter for a secondary counter, which is
+ * incremented every time the counter is touched. When the secondary counter
+ * wraps, we check the time the counter last wrapped, and if it was recent
+ * enough that means the update frequency has crossed our threshold and we
+ * switch to percpu mode:
+ */
+#define COUNTER_MOD_BITS 8
+#define COUNTER_MOD_MASK ~(~0ULL >> COUNTER_MOD_BITS)
+#define COUNTER_MOD_BITS_START (64 - COUNTER_MOD_BITS)
+
+/*
+ * We use the low bit of the counter to indicate whether we're in atomic mode
+ * (low bit clear), or percpu mode (low bit set, counter is a pointer to actual
+ * percpu counters:
+ */
+#define COUNTER_IS_PCPU_BIT 1
+
+static inline u64 __percpu *lazy_percpu_counter_is_pcpu(u64 v)
+{
+ if (!(v & COUNTER_IS_PCPU_BIT))
+ return NULL;
+
+ v ^= COUNTER_IS_PCPU_BIT;
+ return (u64 __percpu *)(unsigned long)v;
+}
+
+static inline s64 lazy_percpu_counter_atomic_val(s64 v)
+{
+ /* Ensure output is sign extended properly: */
+ return (v << COUNTER_MOD_BITS) >>
+ (COUNTER_MOD_BITS + COUNTER_IS_PCPU_BIT);
+}
+
+static void lazy_percpu_counter_switch_to_pcpu(struct raw_lazy_percpu_counter *c)
+{
+ u64 __percpu *pcpu_v = alloc_percpu_gfp(u64, GFP_ATOMIC|__GFP_NOWARN);
+ u64 old, new, v;
+
+ if (!pcpu_v)
+ return;
+
+ preempt_disable();
+ v = atomic64_read(&c->v);
+ do {
+ if (lazy_percpu_counter_is_pcpu(v)) {
+ free_percpu(pcpu_v);
+ return;
+ }
+
+ old = v;
+ new = (unsigned long)pcpu_v | 1;
+
+ *this_cpu_ptr(pcpu_v) = lazy_percpu_counter_atomic_val(v);
+ } while ((v = atomic64_cmpxchg(&c->v, old, new)) != old);
+ preempt_enable();
+}
+
+/**
+ * __lazy_percpu_counter_exit: Free resources associated with a
+ * raw_lazy_percpu_counter
+ *
+ * @c: counter to exit
+ */
+void __lazy_percpu_counter_exit(struct raw_lazy_percpu_counter *c)
+{
+ free_percpu(lazy_percpu_counter_is_pcpu(atomic64_read(&c->v)));
+}
+EXPORT_SYMBOL_GPL(__lazy_percpu_counter_exit);
+
+/**
+ * __lazy_percpu_counter_read: Read current value of a raw_lazy_percpu_counter
+ *
+ * @c: counter to read
+ */
+s64 __lazy_percpu_counter_read(struct raw_lazy_percpu_counter *c)
+{
+ s64 v = atomic64_read(&c->v);
+ u64 __percpu *pcpu_v = lazy_percpu_counter_is_pcpu(v);
+
+ if (pcpu_v) {
+ int cpu;
+
+ v = 0;
+ for_each_possible_cpu(cpu)
+ v += *per_cpu_ptr(pcpu_v, cpu);
+ } else {
+ v = lazy_percpu_counter_atomic_val(v);
+ }
+
+ return v;
+}
+EXPORT_SYMBOL_GPL(__lazy_percpu_counter_read);
+
+/**
+ * __lazy_percpu_counter_add: Add a value to a lazy_percpu_counter
+ *
+ * @c: counter to modify
+ * @last_wrap: pointer to a timestamp, updated when mod counter wraps
+ * @i: value to add
+ */
+void __lazy_percpu_counter_add(struct raw_lazy_percpu_counter *c,
+ unsigned long *last_wrap, s64 i)
+{
+ u64 atomic_i;
+ u64 old, v = atomic64_read(&c->v);
+ u64 __percpu *pcpu_v;
+
+ atomic_i = i << COUNTER_IS_PCPU_BIT;
+ atomic_i &= ~COUNTER_MOD_MASK;
+ atomic_i |= 1ULL << COUNTER_MOD_BITS_START;
+
+ do {
+ pcpu_v = lazy_percpu_counter_is_pcpu(v);
+ if (pcpu_v) {
+ this_cpu_add(*pcpu_v, i);
+ return;
+ }
+
+ old = v;
+ } while ((v = atomic64_cmpxchg(&c->v, old, old + atomic_i)) != old);
+
+ if (unlikely(!(v & COUNTER_MOD_MASK))) {
+ unsigned long now = jiffies;
+
+ if (*last_wrap &&
+ unlikely(time_after(*last_wrap + HZ, now)))
+ lazy_percpu_counter_switch_to_pcpu(c);
+ else
+ *last_wrap = now;
+ }
+}
+EXPORT_SYMBOL(__lazy_percpu_counter_add);
--
2.37.2.672.g94769d06f0-goog

2022-08-30 22:41:09

by Suren Baghdasaryan

[permalink] [raw]
Subject: [RFC PATCH 16/30] mm: enable slab allocation tagging for kmalloc and friends

Redefine kmalloc, krealloc, kzalloc, kcalloc, etc. to record allocations
and deallocations done by these functions.

Signed-off-by: Suren Baghdasaryan <[email protected]>
Co-developed-by: Kent Overstreet <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
---
include/linux/slab.h | 103 +++++++++++++++++++++++++------------------
mm/slab.c | 2 +
mm/slab_common.c | 16 +++----
mm/slob.c | 2 +
mm/slub.c | 2 +
5 files changed, 75 insertions(+), 50 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 5a198aa02a08..89273be35743 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -191,7 +191,10 @@ int kmem_cache_shrink(struct kmem_cache *s);
/*
* Common kmalloc functions provided by all allocators
*/
-void * __must_check krealloc(const void *objp, size_t new_size, gfp_t flags) __alloc_size(2);
+void * __must_check _krealloc(const void *objp, size_t new_size, gfp_t flags) __alloc_size(2);
+#define krealloc(_p, _size, _flags) \
+ krealloc_hooks(_p, _krealloc(_p, _size, _flags))
+
void kfree(const void *objp);
void kfree_sensitive(const void *objp);
size_t __ksize(const void *objp);
@@ -463,6 +466,15 @@ static inline void slab_tag_dec(const void *ptr) {}

#endif

+#define krealloc_hooks(_p, _do_alloc) \
+({ \
+ void *_res = _do_alloc; \
+ slab_tag_add(_p, _res); \
+ _res; \
+})
+
+#define kmalloc_hooks(_do_alloc) krealloc_hooks(NULL, _do_alloc)
+
void *__kmalloc(size_t size, gfp_t flags) __assume_kmalloc_alignment __alloc_size(1);
void *kmem_cache_alloc(struct kmem_cache *s, gfp_t flags) __assume_slab_alignment __malloc;
void *kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
@@ -541,25 +553,31 @@ static __always_inline void *kmem_cache_alloc_node_trace(struct kmem_cache *s, g
}
#endif /* CONFIG_TRACING */

-extern void *kmalloc_order(size_t size, gfp_t flags, unsigned int order) __assume_page_alignment
+extern void *_kmalloc_order(size_t size, gfp_t flags, unsigned int order) __assume_page_alignment
__alloc_size(1);
+#define kmalloc_order(_size, _flags, _order) \
+ kmalloc_hooks(_kmalloc_order(_size, _flags, _order))

#ifdef CONFIG_TRACING
-extern void *kmalloc_order_trace(size_t size, gfp_t flags, unsigned int order)
+extern void *_kmalloc_order_trace(size_t size, gfp_t flags, unsigned int order)
__assume_page_alignment __alloc_size(1);
#else
-static __always_inline __alloc_size(1) void *kmalloc_order_trace(size_t size, gfp_t flags,
+static __always_inline __alloc_size(1) void *_kmalloc_order_trace(size_t size, gfp_t flags,
unsigned int order)
{
- return kmalloc_order(size, flags, order);
+ return _kmalloc_order(size, flags, order);
}
#endif
+#define kmalloc_order_trace(_size, _flags, _order) \
+ kmalloc_hooks(_kmalloc_order_trace(_size, _flags, _order))

-static __always_inline __alloc_size(1) void *kmalloc_large(size_t size, gfp_t flags)
+static __always_inline __alloc_size(1) void *_kmalloc_large(size_t size, gfp_t flags)
{
unsigned int order = get_order(size);
- return kmalloc_order_trace(size, flags, order);
+ return _kmalloc_order_trace(size, flags, order);
}
+#define kmalloc_large(_size, _flags) \
+ kmalloc_hooks(_kmalloc_large(_size, _flags))

/**
* kmalloc - allocate memory
@@ -615,14 +633,14 @@ static __always_inline __alloc_size(1) void *kmalloc_large(size_t size, gfp_t fl
* Try really hard to succeed the allocation but fail
* eventually.
*/
-static __always_inline __alloc_size(1) void *kmalloc(size_t size, gfp_t flags)
+static __always_inline __alloc_size(1) void *_kmalloc(size_t size, gfp_t flags)
{
if (__builtin_constant_p(size)) {
#ifndef CONFIG_SLOB
unsigned int index;
#endif
if (size > KMALLOC_MAX_CACHE_SIZE)
- return kmalloc_large(size, flags);
+ return _kmalloc_large(size, flags);
#ifndef CONFIG_SLOB
index = kmalloc_index(size);

@@ -636,8 +654,9 @@ static __always_inline __alloc_size(1) void *kmalloc(size_t size, gfp_t flags)
}
return __kmalloc(size, flags);
}
+#define kmalloc(_size, _flags) kmalloc_hooks(_kmalloc(_size, _flags))

-static __always_inline __alloc_size(1) void *kmalloc_node(size_t size, gfp_t flags, int node)
+static __always_inline __alloc_size(1) void *_kmalloc_node(size_t size, gfp_t flags, int node)
{
#ifndef CONFIG_SLOB
if (__builtin_constant_p(size) &&
@@ -654,6 +673,8 @@ static __always_inline __alloc_size(1) void *kmalloc_node(size_t size, gfp_t fla
#endif
return __kmalloc_node(size, flags, node);
}
+#define kmalloc_node(_size, _flags, _node) \
+ kmalloc_hooks(_kmalloc_node(_size, _flags, _node))

/**
* kmalloc_array - allocate memory for an array.
@@ -661,16 +682,18 @@ static __always_inline __alloc_size(1) void *kmalloc_node(size_t size, gfp_t fla
* @size: element size.
* @flags: the type of memory to allocate (see kmalloc).
*/
-static inline __alloc_size(1, 2) void *kmalloc_array(size_t n, size_t size, gfp_t flags)
+static inline __alloc_size(1, 2) void *_kmalloc_array(size_t n, size_t size, gfp_t flags)
{
size_t bytes;

if (unlikely(check_mul_overflow(n, size, &bytes)))
return NULL;
if (__builtin_constant_p(n) && __builtin_constant_p(size))
- return kmalloc(bytes, flags);
- return __kmalloc(bytes, flags);
+ return _kmalloc(bytes, flags);
+ return _kmalloc(bytes, flags);
}
+#define kmalloc_array(_n, _size, _flags) \
+ kmalloc_hooks(_kmalloc_array(_n, _size, _flags))

/**
* krealloc_array - reallocate memory for an array.
@@ -679,7 +702,7 @@ static inline __alloc_size(1, 2) void *kmalloc_array(size_t n, size_t size, gfp_
* @new_size: new size of a single member of the array
* @flags: the type of memory to allocate (see kmalloc)
*/
-static inline __alloc_size(2, 3) void * __must_check krealloc_array(void *p,
+static inline __alloc_size(2, 3) void * __must_check _krealloc_array(void *p,
size_t new_n,
size_t new_size,
gfp_t flags)
@@ -689,8 +712,10 @@ static inline __alloc_size(2, 3) void * __must_check krealloc_array(void *p,
if (unlikely(check_mul_overflow(new_n, new_size, &bytes)))
return NULL;

- return krealloc(p, bytes, flags);
+ return _krealloc(p, bytes, flags);
}
+#define krealloc_array(_p, _n, _size, _flags) \
+ krealloc_hooks(_p, _krealloc_array(_p, _n, _size, _flags))

/**
* kcalloc - allocate memory for an array. The memory is set to zero.
@@ -698,10 +723,8 @@ static inline __alloc_size(2, 3) void * __must_check krealloc_array(void *p,
* @size: element size.
* @flags: the type of memory to allocate (see kmalloc).
*/
-static inline __alloc_size(1, 2) void *kcalloc(size_t n, size_t size, gfp_t flags)
-{
- return kmalloc_array(n, size, flags | __GFP_ZERO);
-}
+#define kcalloc(_n, _size, _flags) \
+ kmalloc_array(_n, _size, (_flags)|__GFP_ZERO)

/*
* kmalloc_track_caller is a special version of kmalloc that records the
@@ -712,10 +735,10 @@ static inline __alloc_size(1, 2) void *kcalloc(size_t n, size_t size, gfp_t flag
* request comes from.
*/
extern void *__kmalloc_track_caller(size_t size, gfp_t flags, unsigned long caller);
-#define kmalloc_track_caller(size, flags) \
- __kmalloc_track_caller(size, flags, _RET_IP_)
+#define kmalloc_track_caller(size, flags) \
+ kmalloc_hooks(__kmalloc_track_caller(size, flags, _RET_IP_))

-static inline __alloc_size(1, 2) void *kmalloc_array_node(size_t n, size_t size, gfp_t flags,
+static inline __alloc_size(1, 2) void *_kmalloc_array_node(size_t n, size_t size, gfp_t flags,
int node)
{
size_t bytes;
@@ -723,26 +746,24 @@ static inline __alloc_size(1, 2) void *kmalloc_array_node(size_t n, size_t size,
if (unlikely(check_mul_overflow(n, size, &bytes)))
return NULL;
if (__builtin_constant_p(n) && __builtin_constant_p(size))
- return kmalloc_node(bytes, flags, node);
+ return _kmalloc_node(bytes, flags, node);
return __kmalloc_node(bytes, flags, node);
}
+#define kmalloc_array_node(_n, _size, _flags, _node) \
+ kmalloc_hooks(_kmalloc_array_node(_n, _size, _flags, _node))

-static inline __alloc_size(1, 2) void *kcalloc_node(size_t n, size_t size, gfp_t flags, int node)
-{
- return kmalloc_array_node(n, size, flags | __GFP_ZERO, node);
-}
-
+#define kcalloc_node(_n, _size, _flags, _node) \
+ kmalloc_array_node(_n, _size, (_flags)|__GFP_ZERO, _node)

#ifdef CONFIG_NUMA
extern void *__kmalloc_node_track_caller(size_t size, gfp_t flags, int node,
unsigned long caller) __alloc_size(1);
-#define kmalloc_node_track_caller(size, flags, node) \
- __kmalloc_node_track_caller(size, flags, node, \
- _RET_IP_)
+#define kmalloc_node_track_caller(size, flags, node) \
+ kmalloc_hooks(__kmalloc_node_track_caller(size, flags, node, _RET_IP_))

#else /* CONFIG_NUMA */

-#define kmalloc_node_track_caller(size, flags, node) \
+#define kmalloc_node_track_caller(size, flags, node) \
kmalloc_track_caller(size, flags)

#endif /* CONFIG_NUMA */
@@ -750,20 +771,16 @@ extern void *__kmalloc_node_track_caller(size_t size, gfp_t flags, int node,
/*
* Shortcuts
*/
-static inline void *kmem_cache_zalloc(struct kmem_cache *k, gfp_t flags)
-{
- return kmem_cache_alloc(k, flags | __GFP_ZERO);
-}
+#define kmem_cache_zalloc(_k, _flags) \
+ kmem_cache_alloc(_k, (_flags)|__GFP_ZERO)

/**
* kzalloc - allocate memory. The memory is set to zero.
* @size: how many bytes of memory are required.
* @flags: the type of memory to allocate (see kmalloc).
*/
-static inline __alloc_size(1) void *kzalloc(size_t size, gfp_t flags)
-{
- return kmalloc(size, flags | __GFP_ZERO);
-}
+#define kzalloc(_size, _flags) \
+ kmalloc(_size, (_flags)|__GFP_ZERO)

/**
* kzalloc_node - allocate zeroed memory from a particular memory node.
@@ -771,10 +788,12 @@ static inline __alloc_size(1) void *kzalloc(size_t size, gfp_t flags)
* @flags: the type of memory to allocate (see kmalloc).
* @node: memory node from which to allocate
*/
-static inline __alloc_size(1) void *kzalloc_node(size_t size, gfp_t flags, int node)
+static inline __alloc_size(1) void *_kzalloc_node(size_t size, gfp_t flags, int node)
{
- return kmalloc_node(size, flags | __GFP_ZERO, node);
+ return _kmalloc_node(size, flags | __GFP_ZERO, node);
}
+#define kzalloc_node(_size, _flags, _node) \
+ kmalloc_hooks(_kzalloc_node(_size, _flags, _node))

extern void *kvmalloc_node(size_t size, gfp_t flags, int node) __alloc_size(1);
static inline __alloc_size(1) void *kvmalloc(size_t size, gfp_t flags)
diff --git a/mm/slab.c b/mm/slab.c
index ba97aeef7ec1..db344de3b260 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3402,6 +3402,7 @@ static __always_inline void __cache_free(struct kmem_cache *cachep, void *objp,

if (is_kfence_address(objp)) {
kmemleak_free_recursive(objp, cachep->flags);
+ slab_tag_dec(objp);
__kfence_free(objp);
return;
}
@@ -3433,6 +3434,7 @@ void ___cache_free(struct kmem_cache *cachep, void *objp,

check_irq_off();
kmemleak_free_recursive(objp, cachep->flags);
+ slab_tag_dec(objp);
objp = cache_free_debugcheck(cachep, objp, caller);

/*
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 272eda62ecaa..7b6473db5ab4 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -938,7 +938,7 @@ gfp_t kmalloc_fix_flags(gfp_t flags)
* directly to the page allocator. We use __GFP_COMP, because we will need to
* know the allocation order to free the pages properly in kfree.
*/
-void *kmalloc_order(size_t size, gfp_t flags, unsigned int order)
+void *_kmalloc_order(size_t size, gfp_t flags, unsigned int order)
{
void *ret = NULL;
struct page *page;
@@ -958,16 +958,16 @@ void *kmalloc_order(size_t size, gfp_t flags, unsigned int order)
kmemleak_alloc(ret, size, 1, flags);
return ret;
}
-EXPORT_SYMBOL(kmalloc_order);
+EXPORT_SYMBOL(_kmalloc_order);

#ifdef CONFIG_TRACING
-void *kmalloc_order_trace(size_t size, gfp_t flags, unsigned int order)
+void *_kmalloc_order_trace(size_t size, gfp_t flags, unsigned int order)
{
- void *ret = kmalloc_order(size, flags, order);
+ void *ret = _kmalloc_order(size, flags, order);
trace_kmalloc(_RET_IP_, ret, NULL, size, PAGE_SIZE << order, flags);
return ret;
}
-EXPORT_SYMBOL(kmalloc_order_trace);
+EXPORT_SYMBOL(_kmalloc_order_trace);
#endif

#ifdef CONFIG_SLAB_FREELIST_RANDOM
@@ -1187,7 +1187,7 @@ static __always_inline void *__do_krealloc(const void *p, size_t new_size,
return (void *)p;
}

- ret = kmalloc_track_caller(new_size, flags);
+ ret = __kmalloc_track_caller(new_size, flags, _RET_IP_);
if (ret && p) {
/* Disable KASAN checks as the object's redzone is accessed. */
kasan_disable_current();
@@ -1211,7 +1211,7 @@ static __always_inline void *__do_krealloc(const void *p, size_t new_size,
*
* Return: pointer to the allocated memory or %NULL in case of error
*/
-void *krealloc(const void *p, size_t new_size, gfp_t flags)
+void *_krealloc(const void *p, size_t new_size, gfp_t flags)
{
void *ret;

@@ -1226,7 +1226,7 @@ void *krealloc(const void *p, size_t new_size, gfp_t flags)

return ret;
}
-EXPORT_SYMBOL(krealloc);
+EXPORT_SYMBOL(_krealloc);

/**
* kfree_sensitive - Clear sensitive information in memory before freeing
diff --git a/mm/slob.c b/mm/slob.c
index 2bd4f476c340..23b49f6c9c8f 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -554,6 +554,7 @@ void kfree(const void *block)
if (unlikely(ZERO_OR_NULL_PTR(block)))
return;
kmemleak_free(block);
+ slab_tag_dec(block);

sp = virt_to_folio(block);
if (folio_test_slab(sp)) {
@@ -680,6 +681,7 @@ static void kmem_rcu_free(struct rcu_head *head)
void kmem_cache_free(struct kmem_cache *c, void *b)
{
kmemleak_free_recursive(b, c->flags);
+ slab_tag_dec(b);
trace_kmem_cache_free(_RET_IP_, b, c->name);
if (unlikely(c->flags & SLAB_TYPESAFE_BY_RCU)) {
struct slob_rcu *slob_rcu;
diff --git a/mm/slub.c b/mm/slub.c
index 80199d5ac7c9..caf752087ad6 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1715,6 +1715,7 @@ static inline void *kmalloc_large_node_hook(void *ptr, size_t size, gfp_t flags)
static __always_inline void kfree_hook(void *x)
{
kmemleak_free(x);
+ slab_tag_dec(x);
kasan_kfree_large(x);
}

@@ -1722,6 +1723,7 @@ static __always_inline bool slab_free_hook(struct kmem_cache *s,
void *x, bool init)
{
kmemleak_free_recursive(x, s->flags);
+ slab_tag_dec(x);

debug_check_no_locks_freed(x, s->object_size);

--
2.37.2.672.g94769d06f0-goog

2022-08-30 22:43:08

by Suren Baghdasaryan

[permalink] [raw]
Subject: [RFC PATCH 01/30] kernel/module: move find_kallsyms_symbol_value declaration

Allow find_kallsyms_symbol_value to be called by code outside of
kernel/module. It will be used for code tagging module support.

Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/module.h | 1 +
kernel/module/internal.h | 1 -
2 files changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/module.h b/include/linux/module.h
index 518296ea7f73..563d38ad84ed 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -605,6 +605,7 @@ struct module *find_module(const char *name);
int module_get_kallsym(unsigned int symnum, unsigned long *value, char *type,
char *name, char *module_name, int *exported);

+unsigned long find_kallsyms_symbol_value(struct module *mod, const char *name);
/* Look for this name: can be of form module:name. */
unsigned long module_kallsyms_lookup_name(const char *name);

diff --git a/kernel/module/internal.h b/kernel/module/internal.h
index 680d980a4fb2..f1b6c477bd93 100644
--- a/kernel/module/internal.h
+++ b/kernel/module/internal.h
@@ -246,7 +246,6 @@ static inline void kmemleak_load_module(const struct module *mod,
void init_build_id(struct module *mod, const struct load_info *info);
void layout_symtab(struct module *mod, struct load_info *info);
void add_kallsyms(struct module *mod, const struct load_info *info);
-unsigned long find_kallsyms_symbol_value(struct module *mod, const char *name);

static inline bool sect_empty(const Elf_Shdr *sect)
{
--
2.37.2.672.g94769d06f0-goog

2022-08-30 22:43:42

by Suren Baghdasaryan

[permalink] [raw]
Subject: [RFC PATCH 04/30] scripts/kallysms: Always include __start and __stop symbols

From: Kent Overstreet <[email protected]>

These symbols are used to denote section boundaries: by always including
them we can unify loading sections from modules with loading built-in
sections, which leads to some significant cleanup.

Signed-off-by: Kent Overstreet <[email protected]>
---
scripts/kallsyms.c | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/scripts/kallsyms.c b/scripts/kallsyms.c
index f18e6dfc68c5..3d51639a595d 100644
--- a/scripts/kallsyms.c
+++ b/scripts/kallsyms.c
@@ -263,6 +263,11 @@ static int symbol_in_range(const struct sym_entry *s,
return 0;
}

+static bool string_starts_with(const char *s, const char *prefix)
+{
+ return strncmp(s, prefix, strlen(prefix)) == 0;
+}
+
static int symbol_valid(const struct sym_entry *s)
{
const char *name = sym_name(s);
@@ -270,6 +275,14 @@ static int symbol_valid(const struct sym_entry *s)
/* if --all-symbols is not specified, then symbols outside the text
* and inittext sections are discarded */
if (!all_symbols) {
+ /*
+ * Symbols starting with __start and __stop are used to denote
+ * section boundaries, and should always be included:
+ */
+ if (string_starts_with(name, "__start_") ||
+ string_starts_with(name, "__stop_"))
+ return 1;
+
if (symbol_in_range(s, text_ranges,
ARRAY_SIZE(text_ranges)) == 0)
return 0;
--
2.37.2.672.g94769d06f0-goog

2022-08-30 22:45:49

by Suren Baghdasaryan

[permalink] [raw]
Subject: [RFC PATCH 21/30] lib: implement context capture support for page and slab allocators

Implement mechanisms for capturing allocation call context which consists
of:
- allocation size
- pid, tgid and name of the allocating task
- allocation timestamp
- allocation call stack
The patch creates alloc_tags.ctx file which can be written to
enable/disable context capture for a specific code tag. Captured context
can be obtained by reading alloc_tags.ctx file.
Usage example:

echo "file include/asm-generic/pgalloc.h line 63 enable" > \
/sys/kernel/debug/alloc_tags.ctx
cat alloc_tags.ctx
91.0MiB 212 include/asm-generic/pgalloc.h:63 module:pgtable func:__pte_alloc_one
size: 4096
pid: 1551
tgid: 1551
comm: cat
ts: 670109646361
call stack:
pte_alloc_one+0xfe/0x130
__pte_alloc+0x22/0x90
move_page_tables.part.0+0x994/0xa60
shift_arg_pages+0xa4/0x180
setup_arg_pages+0x286/0x2d0
load_elf_binary+0x4e1/0x18d0
bprm_execve+0x26b/0x660
do_execveat_common.isra.0+0x19d/0x220
__x64_sys_execve+0x2e/0x40
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd

size: 4096
pid: 1551
tgid: 1551
comm: cat
ts: 670109711801
call stack:
pte_alloc_one+0xfe/0x130
__do_fault+0x52/0xc0
__handle_mm_fault+0x7d9/0xdd0
handle_mm_fault+0xc0/0x2b0
do_user_addr_fault+0x1c3/0x660
exc_page_fault+0x62/0x150
asm_exc_page_fault+0x22/0x30
...

echo "file include/asm-generic/pgalloc.h line 63 disable" > \
/sys/kernel/debug/alloc_tags.ctx

Note that disabling context capture will not clear already captured
context but no new context will be captured.

Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/alloc_tag.h | 28 ++++-
include/linux/codetag.h | 3 +-
lib/Kconfig.debug | 1 +
lib/alloc_tag.c | 239 +++++++++++++++++++++++++++++++++++++-
lib/codetag.c | 20 ++--
5 files changed, 273 insertions(+), 18 deletions(-)

diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
index b3f589afb1c9..66638cbf349a 100644
--- a/include/linux/alloc_tag.h
+++ b/include/linux/alloc_tag.h
@@ -16,27 +16,41 @@
* an array of these. Embedded codetag utilizes codetag framework.
*/
struct alloc_tag {
- struct codetag ct;
+ struct codetag_with_ctx ctc;
unsigned long last_wrap;
struct raw_lazy_percpu_counter call_count;
struct raw_lazy_percpu_counter bytes_allocated;
} __aligned(8);

+static inline struct alloc_tag *ctc_to_alloc_tag(struct codetag_with_ctx *ctc)
+{
+ return container_of(ctc, struct alloc_tag, ctc);
+}
+
static inline struct alloc_tag *ct_to_alloc_tag(struct codetag *ct)
{
- return container_of(ct, struct alloc_tag, ct);
+ return container_of(ct_to_ctc(ct), struct alloc_tag, ctc);
}

+struct codetag_ctx *alloc_tag_create_ctx(struct alloc_tag *tag, size_t size);
+void alloc_tag_free_ctx(struct codetag_ctx *ctx, struct alloc_tag **ptag);
+bool alloc_tag_enable_ctx(struct alloc_tag *tag, bool enable);
+
#define DEFINE_ALLOC_TAG(_alloc_tag) \
static struct alloc_tag _alloc_tag __used __aligned(8) \
- __section("alloc_tags") = { .ct = CODE_TAG_INIT }
+ __section("alloc_tags") = { .ctc.ct = CODE_TAG_INIT }

#define alloc_tag_counter_read(counter) \
__lazy_percpu_counter_read(counter)

static inline void __alloc_tag_sub(union codetag_ref *ref, size_t bytes)
{
- struct alloc_tag *tag = ct_to_alloc_tag(ref->ct);
+ struct alloc_tag *tag;
+
+ if (is_codetag_ctx_ref(ref))
+ alloc_tag_free_ctx(ref->ctx, &tag);
+ else
+ tag = ct_to_alloc_tag(ref->ct);

__lazy_percpu_counter_add(&tag->call_count, &tag->last_wrap, -1);
__lazy_percpu_counter_add(&tag->bytes_allocated, &tag->last_wrap, -bytes);
@@ -51,7 +65,11 @@ do { \

static inline void __alloc_tag_add(struct alloc_tag *tag, union codetag_ref *ref, size_t bytes)
{
- ref->ct = &tag->ct;
+ if (codetag_ctx_enabled(&tag->ctc))
+ ref->ctx = alloc_tag_create_ctx(tag, bytes);
+ else
+ ref->ct = &tag->ctc.ct;
+
__lazy_percpu_counter_add(&tag->call_count, &tag->last_wrap, 1);
__lazy_percpu_counter_add(&tag->bytes_allocated, &tag->last_wrap, bytes);
}
diff --git a/include/linux/codetag.h b/include/linux/codetag.h
index 57736ec77b45..a10c5fcbdd20 100644
--- a/include/linux/codetag.h
+++ b/include/linux/codetag.h
@@ -104,7 +104,8 @@ struct codetag_with_ctx *ct_to_ctc(struct codetag *ct)
}

void codetag_lock_module_list(struct codetag_type *cttype, bool lock);
-struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype);
+void codetag_init_iter(struct codetag_iterator *iter,
+ struct codetag_type *cttype);
struct codetag *codetag_next_ct(struct codetag_iterator *iter);
struct codetag_ctx *codetag_next_ctx(struct codetag_iterator *iter);

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 08c97a978906..2790848464f1 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -977,6 +977,7 @@ config ALLOC_TAGGING
bool
select CODE_TAGGING
select LAZY_PERCPU_COUNTER
+ select STACKDEPOT

config PAGE_ALLOC_TAGGING
bool "Enable page allocation tagging"
diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
index 082fbde184ef..50d7bdc2a3c8 100644
--- a/lib/alloc_tag.c
+++ b/lib/alloc_tag.c
@@ -1,12 +1,75 @@
// SPDX-License-Identifier: GPL-2.0-only
#include <linux/alloc_tag.h>
+#include <linux/codetag_ctx.h>
#include <linux/debugfs.h>
#include <linux/fs.h>
#include <linux/gfp.h>
#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/sched/clock.h>
#include <linux/seq_buf.h>
+#include <linux/stackdepot.h>
#include <linux/uaccess.h>

+#define STACK_BUF_SIZE 1024
+
+struct alloc_call_ctx {
+ struct codetag_ctx ctx;
+ size_t size;
+ pid_t pid;
+ pid_t tgid;
+ char comm[TASK_COMM_LEN];
+ u64 ts_nsec;
+ depot_stack_handle_t stack_handle;
+} __aligned(8);
+
+static void alloc_tag_ops_free_ctx(struct kref *refcount)
+{
+ kfree(container_of(kref_to_ctx(refcount), struct alloc_call_ctx, ctx));
+}
+
+struct codetag_ctx *alloc_tag_create_ctx(struct alloc_tag *tag, size_t size)
+{
+ struct alloc_call_ctx *ac_ctx;
+
+ /* TODO: use a dedicated kmem_cache */
+ ac_ctx = kmalloc(sizeof(struct alloc_call_ctx), GFP_KERNEL);
+ if (WARN_ON(!ac_ctx))
+ return NULL;
+
+ ac_ctx->size = size;
+ ac_ctx->pid = current->pid;
+ ac_ctx->tgid = current->tgid;
+ strscpy(ac_ctx->comm, current->comm, sizeof(ac_ctx->comm));
+ ac_ctx->ts_nsec = local_clock();
+ ac_ctx->stack_handle =
+ stack_depot_capture_stack(GFP_NOWAIT | __GFP_NOWARN);
+ add_ctx(&ac_ctx->ctx, &tag->ctc);
+
+ return &ac_ctx->ctx;
+}
+EXPORT_SYMBOL_GPL(alloc_tag_create_ctx);
+
+void alloc_tag_free_ctx(struct codetag_ctx *ctx, struct alloc_tag **ptag)
+{
+ *ptag = ctc_to_alloc_tag(ctx->ctc);
+ rem_ctx(ctx, alloc_tag_ops_free_ctx);
+}
+EXPORT_SYMBOL_GPL(alloc_tag_free_ctx);
+
+bool alloc_tag_enable_ctx(struct alloc_tag *tag, bool enable)
+{
+ static bool stack_depot_ready;
+
+ if (enable && !stack_depot_ready) {
+ stack_depot_init();
+ stack_depot_capture_init();
+ stack_depot_ready = true;
+ }
+
+ return codetag_enable_ctx(&tag->ctc, enable);
+}
+
#ifdef CONFIG_DEBUG_FS

struct alloc_tag_file_iterator {
@@ -50,7 +113,7 @@ static int alloc_tag_file_open(struct inode *inode, struct file *file)
return -ENOMEM;

codetag_lock_module_list(cttype, true);
- iter->ct_iter = codetag_get_ct_iter(cttype);
+ codetag_init_iter(&iter->ct_iter, cttype);
codetag_lock_module_list(cttype, false);
seq_buf_init(&iter->buf, iter->rawbuf, sizeof(iter->rawbuf));
file->private_data = iter;
@@ -111,14 +174,182 @@ static const struct file_operations alloc_tag_file_ops = {
.read = alloc_tag_file_read,
};

+static void alloc_tag_ctx_to_text(struct seq_buf *out, struct codetag_ctx *ctx)
+{
+ struct alloc_call_ctx *ac_ctx;
+ char *buf;
+
+ ac_ctx = container_of(ctx, struct alloc_call_ctx, ctx);
+ seq_buf_printf(out, " size: %zu\n", ac_ctx->size);
+ seq_buf_printf(out, " pid: %d\n", ac_ctx->pid);
+ seq_buf_printf(out, " tgid: %d\n", ac_ctx->tgid);
+ seq_buf_printf(out, " comm: %s\n", ac_ctx->comm);
+ seq_buf_printf(out, " ts: %llu\n", ac_ctx->ts_nsec);
+
+ buf = kmalloc(STACK_BUF_SIZE, GFP_KERNEL);
+ if (buf) {
+ int bytes_read = stack_depot_snprint(ac_ctx->stack_handle, buf,
+ STACK_BUF_SIZE - 1, 8);
+ buf[bytes_read] = '\0';
+ seq_buf_printf(out, " call stack:\n%s\n", buf);
+ }
+ kfree(buf);
+}
+
+static ssize_t alloc_tag_ctx_file_read(struct file *file, char __user *ubuf,
+ size_t size, loff_t *ppos)
+{
+ struct alloc_tag_file_iterator *iter = file->private_data;
+ struct codetag_iterator *ct_iter = &iter->ct_iter;
+ struct user_buf buf = { .buf = ubuf, .size = size };
+ struct codetag_ctx *ctx;
+ struct codetag *prev_ct;
+ int err = 0;
+
+ codetag_lock_module_list(ct_iter->cttype, true);
+ while (1) {
+ err = flush_ubuf(&buf, &iter->buf);
+ if (err || !buf.size)
+ break;
+
+ prev_ct = ct_iter->ct;
+ ctx = codetag_next_ctx(ct_iter);
+ if (!ctx)
+ break;
+
+ if (prev_ct != &ctx->ctc->ct)
+ alloc_tag_to_text(&iter->buf, &ctx->ctc->ct);
+ alloc_tag_ctx_to_text(&iter->buf, ctx);
+ }
+ codetag_lock_module_list(ct_iter->cttype, false);
+
+ return err ? : buf.ret;
+}
+
+#define CTX_CAPTURE_TOKENS() \
+ x(disable, 0) \
+ x(enable, 0)
+
+static const char * const ctx_capture_token_strs[] = {
+#define x(name, nr_args) #name,
+ CTX_CAPTURE_TOKENS()
+#undef x
+ NULL
+};
+
+enum ctx_capture_token {
+#define x(name, nr_args) TOK_##name,
+ CTX_CAPTURE_TOKENS()
+#undef x
+};
+
+static int enable_ctx_capture(struct codetag_type *cttype,
+ struct codetag_query *query, bool enable)
+{
+ struct codetag_iterator ct_iter;
+ struct codetag_with_ctx *ctc;
+ struct codetag *ct;
+ unsigned int nfound = 0;
+
+ codetag_lock_module_list(cttype, true);
+
+ codetag_init_iter(&ct_iter, cttype);
+ while ((ct = codetag_next_ct(&ct_iter))) {
+ if (!codetag_matches_query(query, ct, ct_iter.cmod, NULL))
+ continue;
+
+ ctc = ct_to_ctc(ct);
+ if (codetag_ctx_enabled(ctc) == enable)
+ continue;
+
+ if (!alloc_tag_enable_ctx(ctc_to_alloc_tag(ctc), enable)) {
+ pr_warn("Failed to toggle context capture\n");
+ continue;
+ }
+
+ nfound++;
+ }
+
+ codetag_lock_module_list(cttype, false);
+
+ return nfound ? 0 : -ENOENT;
+}
+
+static int parse_command(struct codetag_type *cttype, char *buf)
+{
+ struct codetag_query query = { NULL };
+ char *cmd;
+ int ret;
+ int tok;
+
+ buf = codetag_query_parse(&query, buf);
+ if (IS_ERR(buf))
+ return PTR_ERR(buf);
+
+ cmd = strsep_no_empty(&buf, " \t\r\n");
+ if (!cmd)
+ return -EINVAL; /* no command */
+
+ tok = match_string(ctx_capture_token_strs,
+ ARRAY_SIZE(ctx_capture_token_strs), cmd);
+ if (tok < 0)
+ return -EINVAL; /* unknown command */
+
+ ret = enable_ctx_capture(cttype, &query, tok == TOK_enable);
+ if (ret < 0)
+ return ret;
+
+ return 0;
+}
+
+static ssize_t alloc_tag_ctx_file_write(struct file *file, const char __user *ubuf,
+ size_t len, loff_t *offp)
+{
+ struct alloc_tag_file_iterator *iter = file->private_data;
+ char tmpbuf[256];
+
+ if (len == 0)
+ return 0;
+ /* we don't check *offp -- multiple writes() are allowed */
+ if (len > sizeof(tmpbuf) - 1)
+ return -E2BIG;
+
+ if (copy_from_user(tmpbuf, ubuf, len))
+ return -EFAULT;
+
+ tmpbuf[len] = '\0';
+ parse_command(iter->ct_iter.cttype, tmpbuf);
+
+ *offp += len;
+ return len;
+}
+
+static const struct file_operations alloc_tag_ctx_file_ops = {
+ .owner = THIS_MODULE,
+ .open = alloc_tag_file_open,
+ .release = alloc_tag_file_release,
+ .read = alloc_tag_ctx_file_read,
+ .write = alloc_tag_ctx_file_write,
+};
+
static int dbgfs_init(struct codetag_type *cttype)
{
struct dentry *file;
+ struct dentry *ctx_file;

file = debugfs_create_file("alloc_tags", 0444, NULL, cttype,
&alloc_tag_file_ops);
+ if (IS_ERR(file))
+ return PTR_ERR(file);
+
+ ctx_file = debugfs_create_file("alloc_tags.ctx", 0666, NULL, cttype,
+ &alloc_tag_ctx_file_ops);
+ if (IS_ERR(ctx_file)) {
+ debugfs_remove(file);
+ return PTR_ERR(ctx_file);
+ }

- return IS_ERR(file) ? PTR_ERR(file) : 0;
+ return 0;
}

#else /* CONFIG_DEBUG_FS */
@@ -129,9 +360,10 @@ static int dbgfs_init(struct codetag_type *) { return 0; }

static void alloc_tag_module_unload(struct codetag_type *cttype, struct codetag_module *cmod)
{
- struct codetag_iterator iter = codetag_get_ct_iter(cttype);
+ struct codetag_iterator iter;
struct codetag *ct;

+ codetag_init_iter(&iter, cttype);
for (ct = codetag_next_ct(&iter); ct; ct = codetag_next_ct(&iter)) {
struct alloc_tag *tag = ct_to_alloc_tag(ct);

@@ -147,6 +379,7 @@ static int __init alloc_tag_init(void)
.section = "alloc_tags",
.tag_size = sizeof(struct alloc_tag),
.module_unload = alloc_tag_module_unload,
+ .free_ctx = alloc_tag_ops_free_ctx,
};

cttype = codetag_register_type(&desc);
diff --git a/lib/codetag.c b/lib/codetag.c
index 2762fda5c016..a936d2988c96 100644
--- a/lib/codetag.c
+++ b/lib/codetag.c
@@ -26,16 +26,14 @@ void codetag_lock_module_list(struct codetag_type *cttype, bool lock)
up_read(&cttype->mod_lock);
}

-struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype)
+void codetag_init_iter(struct codetag_iterator *iter,
+ struct codetag_type *cttype)
{
- struct codetag_iterator iter = {
- .cttype = cttype,
- .cmod = NULL,
- .mod_id = 0,
- .ct = NULL,
- };
-
- return iter;
+ iter->cttype = cttype;
+ iter->cmod = NULL;
+ iter->mod_id = 0;
+ iter->ct = NULL;
+ iter->ctx = NULL;
}

static inline struct codetag *get_first_module_ct(struct codetag_module *cmod)
@@ -127,6 +125,10 @@ struct codetag_ctx *codetag_next_ctx(struct codetag_iterator *iter)

lockdep_assert_held(&iter->cttype->mod_lock);

+ /* Move to the first codetag if search just started */
+ if (!iter->ct)
+ codetag_next_ct(iter);
+
if (!ctx)
return next_ctx_from_ct(iter);

--
2.37.2.672.g94769d06f0-goog

2022-08-30 22:46:42

by Suren Baghdasaryan

[permalink] [raw]
Subject: [RFC PATCH 09/30] change alloc_pages name in dma_map_ops to avoid name conflicts

After redefining alloc_pages, all uses of that name are being replaced.
Change the conflicting names to prevent preprocessor from replacing them
when it's not intended.

Signed-off-by: Suren Baghdasaryan <[email protected]>
---
arch/x86/kernel/amd_gart_64.c | 2 +-
drivers/iommu/dma-iommu.c | 2 +-
drivers/xen/grant-dma-ops.c | 2 +-
drivers/xen/swiotlb-xen.c | 2 +-
include/linux/dma-map-ops.h | 2 +-
kernel/dma/mapping.c | 4 ++--
6 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/amd_gart_64.c b/arch/x86/kernel/amd_gart_64.c
index 194d54eed537..5e83a387bfef 100644
--- a/arch/x86/kernel/amd_gart_64.c
+++ b/arch/x86/kernel/amd_gart_64.c
@@ -676,7 +676,7 @@ static const struct dma_map_ops gart_dma_ops = {
.get_sgtable = dma_common_get_sgtable,
.dma_supported = dma_direct_supported,
.get_required_mask = dma_direct_get_required_mask,
- .alloc_pages = dma_direct_alloc_pages,
+ .alloc_pages_op = dma_direct_alloc_pages,
.free_pages = dma_direct_free_pages,
};

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 17dd683b2fce..58b4878ef930 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1547,7 +1547,7 @@ static const struct dma_map_ops iommu_dma_ops = {
.flags = DMA_F_PCI_P2PDMA_SUPPORTED,
.alloc = iommu_dma_alloc,
.free = iommu_dma_free,
- .alloc_pages = dma_common_alloc_pages,
+ .alloc_pages_op = dma_common_alloc_pages,
.free_pages = dma_common_free_pages,
.alloc_noncontiguous = iommu_dma_alloc_noncontiguous,
.free_noncontiguous = iommu_dma_free_noncontiguous,
diff --git a/drivers/xen/grant-dma-ops.c b/drivers/xen/grant-dma-ops.c
index 8973fc1e9ccc..0e26d066036e 100644
--- a/drivers/xen/grant-dma-ops.c
+++ b/drivers/xen/grant-dma-ops.c
@@ -262,7 +262,7 @@ static int xen_grant_dma_supported(struct device *dev, u64 mask)
static const struct dma_map_ops xen_grant_dma_ops = {
.alloc = xen_grant_dma_alloc,
.free = xen_grant_dma_free,
- .alloc_pages = xen_grant_dma_alloc_pages,
+ .alloc_pages_op = xen_grant_dma_alloc_pages,
.free_pages = xen_grant_dma_free_pages,
.mmap = dma_common_mmap,
.get_sgtable = dma_common_get_sgtable,
diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 67aa74d20162..5ab2616153f0 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -403,6 +403,6 @@ const struct dma_map_ops xen_swiotlb_dma_ops = {
.dma_supported = xen_swiotlb_dma_supported,
.mmap = dma_common_mmap,
.get_sgtable = dma_common_get_sgtable,
- .alloc_pages = dma_common_alloc_pages,
+ .alloc_pages_op = dma_common_alloc_pages,
.free_pages = dma_common_free_pages,
};
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index d678afeb8a13..e8e2d210ba68 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -27,7 +27,7 @@ struct dma_map_ops {
unsigned long attrs);
void (*free)(struct device *dev, size_t size, void *vaddr,
dma_addr_t dma_handle, unsigned long attrs);
- struct page *(*alloc_pages)(struct device *dev, size_t size,
+ struct page *(*alloc_pages_op)(struct device *dev, size_t size,
dma_addr_t *dma_handle, enum dma_data_direction dir,
gfp_t gfp);
void (*free_pages)(struct device *dev, size_t size, struct page *vaddr,
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 49cbf3e33de7..80a2bfeed8d0 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -552,9 +552,9 @@ static struct page *__dma_alloc_pages(struct device *dev, size_t size,
size = PAGE_ALIGN(size);
if (dma_alloc_direct(dev, ops))
return dma_direct_alloc_pages(dev, size, dma_handle, dir, gfp);
- if (!ops->alloc_pages)
+ if (!ops->alloc_pages_op)
return NULL;
- return ops->alloc_pages(dev, size, dma_handle, dir, gfp);
+ return ops->alloc_pages_op(dev, size, dma_handle, dir, gfp);
}

struct page *dma_alloc_pages(struct device *dev, size_t size,
--
2.37.2.672.g94769d06f0-goog

2022-08-30 22:46:42

by Suren Baghdasaryan

[permalink] [raw]
Subject: [RFC PATCH 27/30] Code tagging based latency tracking

From: Kent Overstreet <[email protected]>

This adds the ability to easily instrument code for measuring latency.
To use, add the following to calls to your code, at the start and end of
the event you wish to measure:

code_tag_time_stats_start(start_time);
code_tag_time_stats_finish(start_time);

Stastistics will then show up in debugfs under
/sys/kernel/debug/time_stats, listed by file and line number.

Stastics measured include weighted averages of frequency, duration, max
duration, as well as quantiles.

This patch also instruments all calls to init_wait and finish_wait,
which includes all calls to wait_event. Example debugfs output:

fs/xfs/xfs_trans_ail.c:746 module:xfs func:xfs_ail_push_all_sync
count: 17
rate: 0/sec
frequency: 2 sec
avg duration: 10 us
max duration: 232 us
quantiles (ns): 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128

lib/sbitmap.c:813 module:sbitmap func:sbitmap_finish_wait
count: 3
rate: 0/sec
frequency: 4 sec
avg duration: 4 sec
max duration: 4 sec
quantiles (ns): 0 4288669120 4288669120 5360836048 5360836048 5360836048 5360836048 5360836048 5360836048 5360836048 5360836048 5360836048 5360836048 5360836048 5360836048

net/core/datagram.c:122 module:datagram func:__skb_wait_for_more_packets
count: 10
rate: 1/sec
frequency: 859 ms
avg duration: 472 ms
max duration: 30 sec
quantiles (ns): 0 12279 12279 15669 15669 15669 15669 17217 17217 17217 17217 17217 17217 17217 17217

Signed-off-by: Kent Overstreet <[email protected]>
---
include/asm-generic/codetag.lds.h | 3 +-
include/linux/codetag_time_stats.h | 54 +++++++++++
include/linux/io_uring_types.h | 2 +-
include/linux/wait.h | 22 ++++-
kernel/sched/wait.c | 6 +-
lib/Kconfig.debug | 8 ++
lib/Makefile | 1 +
lib/codetag_time_stats.c | 143 +++++++++++++++++++++++++++++
8 files changed, 233 insertions(+), 6 deletions(-)
create mode 100644 include/linux/codetag_time_stats.h
create mode 100644 lib/codetag_time_stats.c

diff --git a/include/asm-generic/codetag.lds.h b/include/asm-generic/codetag.lds.h
index 16fbf74edc3d..d799f4aced82 100644
--- a/include/asm-generic/codetag.lds.h
+++ b/include/asm-generic/codetag.lds.h
@@ -10,6 +10,7 @@

#define CODETAG_SECTIONS() \
SECTION_WITH_BOUNDARIES(alloc_tags) \
- SECTION_WITH_BOUNDARIES(dynamic_fault_tags)
+ SECTION_WITH_BOUNDARIES(dynamic_fault_tags) \
+ SECTION_WITH_BOUNDARIES(time_stats_tags)

#endif /* __ASM_GENERIC_CODETAG_LDS_H */
diff --git a/include/linux/codetag_time_stats.h b/include/linux/codetag_time_stats.h
new file mode 100644
index 000000000000..7e44c7ee9e9b
--- /dev/null
+++ b/include/linux/codetag_time_stats.h
@@ -0,0 +1,54 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_CODETAG_TIMESTATS_H
+#define _LINUX_CODETAG_TIMESTATS_H
+
+/*
+ * Code tagging based latency tracking:
+ * (C) 2022 Kent Overstreet
+ *
+ * This allows you to easily instrument code to track latency, and have the
+ * results show up in debugfs. To use, add the following two calls to your code
+ * at the beginning and end of the event you wish to instrument:
+ *
+ * code_tag_time_stats_start(start_time);
+ * code_tag_time_stats_finish(start_time);
+ *
+ * Statistics will then show up in debugfs under /sys/kernel/debug/time_stats,
+ * listed by file and line number.
+ */
+
+#ifdef CONFIG_CODETAG_TIME_STATS
+
+#include <linux/codetag.h>
+#include <linux/time_stats.h>
+#include <linux/timekeeping.h>
+
+struct codetag_time_stats {
+ struct codetag tag;
+ struct time_stats stats;
+};
+
+#define codetag_time_stats_start(_start_time) u64 _start_time = ktime_get_ns()
+
+#define codetag_time_stats_finish(_start_time) \
+do { \
+ static struct codetag_time_stats \
+ __used \
+ __section("time_stats_tags") \
+ __aligned(8) s = { \
+ .tag = CODE_TAG_INIT, \
+ .stats.lock = __SPIN_LOCK_UNLOCKED(_lock) \
+ }; \
+ \
+ WARN_ONCE(!(_start_time), "codetag_time_stats_start() not called");\
+ time_stats_update(&s.stats, _start_time); \
+} while (0)
+
+#else
+
+#define codetag_time_stats_finish(_start_time) do {} while (0)
+#define codetag_time_stats_start(_start_time) do {} while (0)
+
+#endif /* CODETAG_CODETAG_TIME_STATS */
+
+#endif
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 677a25d44d7f..3bcef85eacd8 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -488,7 +488,7 @@ struct io_cqe {
struct io_cmd_data {
struct file *file;
/* each command gets 56 bytes of data */
- __u8 data[56];
+ __u8 data[64];
};

static inline void io_kiocb_cmd_sz_check(size_t cmd_sz)
diff --git a/include/linux/wait.h b/include/linux/wait.h
index 91ced6a118bc..bab11b7ef19a 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -4,6 +4,7 @@
/*
* Linux wait queue related types and methods
*/
+#include <linux/codetag_time_stats.h>
#include <linux/list.h>
#include <linux/stddef.h>
#include <linux/spinlock.h>
@@ -32,6 +33,9 @@ struct wait_queue_entry {
void *private;
wait_queue_func_t func;
struct list_head entry;
+#ifdef CONFIG_CODETAG_TIME_STATS
+ u64 start_time;
+#endif
};

struct wait_queue_head {
@@ -79,10 +83,17 @@ extern void __init_waitqueue_head(struct wait_queue_head *wq_head, const char *n
# define DECLARE_WAIT_QUEUE_HEAD_ONSTACK(name) DECLARE_WAIT_QUEUE_HEAD(name)
#endif

+#ifdef CONFIG_CODETAG_TIME_STATS
+#define WAIT_QUEUE_ENTRY_START_TIME_INITIALIZER .start_time = ktime_get_ns(),
+#else
+#define WAIT_QUEUE_ENTRY_START_TIME_INITIALIZER
+#endif
+
#define WAIT_FUNC_INITIALIZER(name, function) { \
.private = current, \
.func = function, \
.entry = LIST_HEAD_INIT((name).entry), \
+ WAIT_QUEUE_ENTRY_START_TIME_INITIALIZER \
}

#define DEFINE_WAIT_FUNC(name, function) \
@@ -98,6 +109,9 @@ __init_waitqueue_entry(struct wait_queue_entry *wq_entry, unsigned int flags,
wq_entry->private = private;
wq_entry->func = func;
INIT_LIST_HEAD(&wq_entry->entry);
+#ifdef CONFIG_CODETAG_TIME_STATS
+ wq_entry->start_time = ktime_get_ns();
+#endif
}

#define init_waitqueue_func_entry(_wq_entry, _func) \
@@ -1180,11 +1194,17 @@ do { \
void prepare_to_wait(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry, int state);
bool prepare_to_wait_exclusive(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry, int state);
long prepare_to_wait_event(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry, int state);
-void finish_wait(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry);
+void __finish_wait(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry);
long wait_woken(struct wait_queue_entry *wq_entry, unsigned mode, long timeout);
int woken_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, int sync, void *key);
int autoremove_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, int sync, void *key);

+#define finish_wait(_wq_head, _wq_entry) \
+do { \
+ codetag_time_stats_finish((_wq_entry)->start_time); \
+ __finish_wait(_wq_head, _wq_entry); \
+} while (0)
+
typedef int (*task_call_f)(struct task_struct *p, void *arg);
extern int task_call_func(struct task_struct *p, task_call_f func, void *arg);

diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index b9922346077d..e88de3f0c3ad 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -367,7 +367,7 @@ int do_wait_intr_irq(wait_queue_head_t *wq, wait_queue_entry_t *wait)
EXPORT_SYMBOL(do_wait_intr_irq);

/**
- * finish_wait - clean up after waiting in a queue
+ * __finish_wait - clean up after waiting in a queue
* @wq_head: waitqueue waited on
* @wq_entry: wait descriptor
*
@@ -375,7 +375,7 @@ EXPORT_SYMBOL(do_wait_intr_irq);
* the wait descriptor from the given waitqueue if still
* queued.
*/
-void finish_wait(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry)
+void __finish_wait(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry)
{
unsigned long flags;

@@ -399,7 +399,7 @@ void finish_wait(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_en
spin_unlock_irqrestore(&wq_head->lock, flags);
}
}
-EXPORT_SYMBOL(finish_wait);
+EXPORT_SYMBOL(__finish_wait);

int autoremove_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, int sync, void *key)
{
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index b7d03afbc808..b0f86643b8f0 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1728,6 +1728,14 @@ config LATENCYTOP
Enable this option if you want to use the LatencyTOP tool
to find out which userspace is blocking on what kernel operations.

+config CODETAG_TIME_STATS
+ bool "Code tagging based latency measuring"
+ depends on DEBUG_FS
+ select TIME_STATS
+ select CODE_TAGGING
+ help
+ Enabling this option makes latency statistics available in debugfs
+
source "kernel/trace/Kconfig"

config PROVIDE_OHCI1394_DMA_INIT
diff --git a/lib/Makefile b/lib/Makefile
index e54392011f5e..d4067973805b 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -233,6 +233,7 @@ obj-$(CONFIG_PAGE_ALLOC_TAGGING) += pgalloc_tag.o

obj-$(CONFIG_CODETAG_FAULT_INJECTION) += dynamic_fault.o
obj-$(CONFIG_TIME_STATS) += time_stats.o
+obj-$(CONFIG_CODETAG_TIME_STATS) += codetag_time_stats.o

lib-$(CONFIG_GENERIC_BUG) += bug.o

diff --git a/lib/codetag_time_stats.c b/lib/codetag_time_stats.c
new file mode 100644
index 000000000000..b0e9a08308a2
--- /dev/null
+++ b/lib/codetag_time_stats.c
@@ -0,0 +1,143 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include <linux/codetag_time_stats.h>
+#include <linux/ctype.h>
+#include <linux/debugfs.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/seq_buf.h>
+
+static struct codetag_type *cttype;
+
+struct user_buf {
+ char __user *buf; /* destination user buffer */
+ size_t size; /* size of requested read */
+ ssize_t ret; /* bytes read so far */
+};
+
+static int flush_ubuf(struct user_buf *dst, struct seq_buf *src)
+{
+ if (src->len) {
+ size_t bytes = min_t(size_t, src->len, dst->size);
+ int err = copy_to_user(dst->buf, src->buffer, bytes);
+
+ if (err)
+ return err;
+
+ dst->ret += bytes;
+ dst->buf += bytes;
+ dst->size -= bytes;
+ src->len -= bytes;
+ memmove(src->buffer, src->buffer + bytes, src->len);
+ }
+
+ return 0;
+}
+
+struct time_stats_iter {
+ struct codetag_iterator ct_iter;
+ struct seq_buf buf;
+ char rawbuf[4096];
+ bool first;
+};
+
+static int time_stats_open(struct inode *inode, struct file *file)
+{
+ struct time_stats_iter *iter;
+
+ pr_debug("called");
+
+ iter = kzalloc(sizeof(*iter), GFP_KERNEL);
+ if (!iter)
+ return -ENOMEM;
+
+ codetag_lock_module_list(cttype, true);
+ codetag_init_iter(&iter->ct_iter, cttype);
+ codetag_lock_module_list(cttype, false);
+
+ file->private_data = iter;
+ seq_buf_init(&iter->buf, iter->rawbuf, sizeof(iter->rawbuf));
+ iter->first = true;
+ return 0;
+}
+
+static int time_stats_release(struct inode *inode, struct file *file)
+{
+ struct time_stats_iter *i = file->private_data;
+
+ kfree(i);
+ return 0;
+}
+
+static ssize_t time_stats_read(struct file *file, char __user *ubuf,
+ size_t size, loff_t *ppos)
+{
+ struct time_stats_iter *iter = file->private_data;
+ struct user_buf buf = { .buf = ubuf, .size = size };
+ struct codetag_time_stats *s;
+ struct codetag *ct;
+ int err;
+
+ codetag_lock_module_list(iter->ct_iter.cttype, true);
+ while (1) {
+ err = flush_ubuf(&buf, &iter->buf);
+ if (err || !buf.size)
+ break;
+
+ ct = codetag_next_ct(&iter->ct_iter);
+ if (!ct)
+ break;
+
+ s = container_of(ct, struct codetag_time_stats, tag);
+ if (s->stats.count) {
+ if (!iter->first) {
+ seq_buf_putc(&iter->buf, '\n');
+ iter->first = true;
+ }
+
+ codetag_to_text(&iter->buf, &s->tag);
+ seq_buf_putc(&iter->buf, '\n');
+ time_stats_to_text(&iter->buf, &s->stats);
+ }
+ }
+ codetag_lock_module_list(iter->ct_iter.cttype, false);
+
+ return err ?: buf.ret;
+}
+
+static const struct file_operations time_stats_ops = {
+ .owner = THIS_MODULE,
+ .open = time_stats_open,
+ .release = time_stats_release,
+ .read = time_stats_read,
+};
+
+static void time_stats_module_unload(struct codetag_type *cttype, struct codetag_module *mod)
+{
+ struct codetag_time_stats *i, *start = (void *) mod->range.start;
+ struct codetag_time_stats *end = (void *) mod->range.stop;
+
+ for (i = start; i != end; i++)
+ time_stats_exit(&i->stats);
+}
+
+static int __init codetag_time_stats_init(void)
+{
+ const struct codetag_type_desc desc = {
+ .section = "time_stats_tags",
+ .tag_size = sizeof(struct codetag_time_stats),
+ .module_unload = time_stats_module_unload,
+ };
+ struct dentry *debugfs_file;
+
+ cttype = codetag_register_type(&desc);
+ if (IS_ERR_OR_NULL(cttype))
+ return PTR_ERR(cttype);
+
+ debugfs_file = debugfs_create_file("time_stats", 0666, NULL, NULL, &time_stats_ops);
+ if (IS_ERR(debugfs_file))
+ return PTR_ERR(debugfs_file);
+
+ return 0;
+}
+module_init(codetag_time_stats_init);
--
2.37.2.672.g94769d06f0-goog

2022-08-30 22:47:55

by Suren Baghdasaryan

[permalink] [raw]
Subject: [RFC PATCH 15/30] lib: introduce slab allocation tagging

Introduce CONFIG_SLAB_ALLOC_TAGGING which provides helper functions
to easily instrument slab allocators and adds a codetag_ref field into
slabobj_ext to store a pointer to the allocation tag associated with
the code that allocated the slab object.

Signed-off-by: Suren Baghdasaryan <[email protected]>
Co-developed-by: Kent Overstreet <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
---
include/linux/memcontrol.h | 5 +++++
include/linux/slab.h | 25 +++++++++++++++++++++++++
include/linux/slab_def.h | 2 +-
include/linux/slub_def.h | 4 ++--
lib/Kconfig.debug | 11 +++++++++++
mm/slab_common.c | 33 +++++++++++++++++++++++++++++++++
6 files changed, 77 insertions(+), 3 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 315399f77173..97c0153f0247 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -232,7 +232,12 @@ struct obj_cgroup {
* if MEMCG_DATA_OBJEXTS is set.
*/
struct slabobj_ext {
+#ifdef CONFIG_MEMCG_KMEM
struct obj_cgroup *objcg;
+#endif
+#ifdef CONFIG_SLAB_ALLOC_TAGGING
+ union codetag_ref ref;
+#endif
} __aligned(8);

/*
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 55ae3ea864a4..5a198aa02a08 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -438,6 +438,31 @@ static __always_inline unsigned int __kmalloc_index(size_t size,
#define kmalloc_index(s) __kmalloc_index(s, true)
#endif /* !CONFIG_SLOB */

+#ifdef CONFIG_SLAB_ALLOC_TAGGING
+
+#include <linux/alloc_tag.h>
+
+union codetag_ref *get_slab_tag_ref(const void *objp);
+
+#define slab_tag_add(_old, _new) \
+do { \
+ if (!ZERO_OR_NULL_PTR(_new) && _old != _new) \
+ alloc_tag_add(get_slab_tag_ref(_new), __ksize(_new)); \
+} while (0)
+
+static inline void slab_tag_dec(const void *ptr)
+{
+ if (!ZERO_OR_NULL_PTR(ptr))
+ alloc_tag_sub(get_slab_tag_ref(ptr), __ksize(ptr));
+}
+
+#else
+
+#define slab_tag_add(_old, _new) do {} while (0)
+static inline void slab_tag_dec(const void *ptr) {}
+
+#endif
+
void *__kmalloc(size_t size, gfp_t flags) __assume_kmalloc_alignment __alloc_size(1);
void *kmem_cache_alloc(struct kmem_cache *s, gfp_t flags) __assume_slab_alignment __malloc;
void *kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h
index e24c9aff6fed..25feb5f7dc32 100644
--- a/include/linux/slab_def.h
+++ b/include/linux/slab_def.h
@@ -106,7 +106,7 @@ static inline void *nearest_obj(struct kmem_cache *cache, const struct slab *sla
* reciprocal_divide(offset, cache->reciprocal_buffer_size)
*/
static inline unsigned int obj_to_index(const struct kmem_cache *cache,
- const struct slab *slab, void *obj)
+ const struct slab *slab, const void *obj)
{
u32 offset = (obj - slab->s_mem);
return reciprocal_divide(offset, cache->reciprocal_buffer_size);
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index f9c68a9dac04..940c146768d4 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -170,14 +170,14 @@ static inline void *nearest_obj(struct kmem_cache *cache, const struct slab *sla

/* Determine object index from a given position */
static inline unsigned int __obj_to_index(const struct kmem_cache *cache,
- void *addr, void *obj)
+ void *addr, const void *obj)
{
return reciprocal_divide(kasan_reset_tag(obj) - addr,
cache->reciprocal_size);
}

static inline unsigned int obj_to_index(const struct kmem_cache *cache,
- const struct slab *slab, void *obj)
+ const struct slab *slab, const void *obj)
{
if (is_kfence_address(obj))
return 0;
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 6686648843b3..08c97a978906 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -989,6 +989,17 @@ config PAGE_ALLOC_TAGGING
initiated at that code location. The mechanism can be used to track
memory leaks with a low performance impact.

+config SLAB_ALLOC_TAGGING
+ bool "Enable slab allocation tagging"
+ default n
+ select ALLOC_TAGGING
+ select SLAB_OBJ_EXT
+ help
+ Instrument slab allocators to track allocation source code and
+ collect statistics on the number of allocations and their total size
+ initiated at that code location. The mechanism can be used to track
+ memory leaks with a low performance impact.
+
source "lib/Kconfig.kasan"
source "lib/Kconfig.kfence"

diff --git a/mm/slab_common.c b/mm/slab_common.c
index 17996649cfe3..272eda62ecaa 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -202,6 +202,39 @@ struct kmem_cache *find_mergeable(unsigned int size, unsigned int align,
return NULL;
}

+#ifdef CONFIG_SLAB_ALLOC_TAGGING
+
+union codetag_ref *get_slab_tag_ref(const void *objp)
+{
+ struct slabobj_ext *obj_exts;
+ union codetag_ref *res = NULL;
+ struct slab *slab;
+ unsigned int off;
+
+ slab = virt_to_slab(objp);
+ /*
+ * We could be given a kmalloc_large() object, skip those. They use
+ * alloc_pages and can be tracked by page allocation tracking.
+ */
+ if (!slab)
+ goto out;
+
+ obj_exts = slab_obj_exts(slab);
+ if (!obj_exts)
+ goto out;
+
+ if (!slab->slab_cache)
+ goto out;
+
+ off = obj_to_index(slab->slab_cache, slab, objp);
+ res = &obj_exts[off].ref;
+out:
+ return res;
+}
+EXPORT_SYMBOL(get_slab_tag_ref);
+
+#endif /* CONFIG_SLAB_ALLOC_TAGGING */
+
static struct kmem_cache *create_cache(const char *name,
unsigned int object_size, unsigned int align,
slab_flags_t flags, unsigned int useroffset,
--
2.37.2.672.g94769d06f0-goog

2022-08-31 02:32:19

by Randy Dunlap

[permalink] [raw]
Subject: Re: [RFC PATCH 27/30] Code tagging based latency tracking



On 8/30/22 14:49, Suren Baghdasaryan wrote:
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index b7d03afbc808..b0f86643b8f0 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1728,6 +1728,14 @@ config LATENCYTOP
> Enable this option if you want to use the LatencyTOP tool
> to find out which userspace is blocking on what kernel operations.
>
> +config CODETAG_TIME_STATS
> + bool "Code tagging based latency measuring"
> + depends on DEBUG_FS
> + select TIME_STATS
> + select CODE_TAGGING
> + help
> + Enabling this option makes latency statistics available in debugfs

Missing period at the end of the sentence.

--
~Randy

2022-08-31 02:38:54

by Randy Dunlap

[permalink] [raw]
Subject: Re: [RFC PATCH 22/30] Code tagging based fault injection



On 8/30/22 14:49, Suren Baghdasaryan wrote:
> From: Kent Overstreet <[email protected]>
>
> This adds a new fault injection capability, based on code tagging.
>
> To use, simply insert somewhere in your code
>
> dynamic_fault("fault_class_name")
>
> and check whether it returns true - if so, inject the error.
> For example
>
> if (dynamic_fault("init"))
> return -EINVAL;
>
> There's no need to define faults elsewhere, as with
> include/linux/fault-injection.h. Faults show up in debugfs, under
> /sys/kernel/debug/dynamic_faults, and can be selected based on
> file/module/function/line number/class, and enabled permanently, or in
> oneshot mode, or with a specified frequency.
>
> Signed-off-by: Kent Overstreet <[email protected]>

Missing Signed-off-by: from Suren.
See Documentation/process/submitting-patches.rst:

When to use Acked-by:, Cc:, and Co-developed-by:
------------------------------------------------

The Signed-off-by: tag indicates that the signer was involved in the
development of the patch, or that he/she was in the patch's delivery path.


--
~Randy

2022-08-31 07:40:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Tue, Aug 30, 2022 at 02:48:49PM -0700, Suren Baghdasaryan wrote:
> ===========================
> Code tagging framework
> ===========================
> Code tag is a structure identifying a specific location in the source code
> which is generated at compile time and can be embedded in an application-
> specific structure. Several applications of code tagging are included in
> this RFC, such as memory allocation tracking, dynamic fault injection,
> latency tracking and improved error code reporting.
> Basically, it takes the old trick of "define a special elf section for
> objects of a given type so that we can iterate over them at runtime" and
> creates a proper library for it.

I might be super dense this morning, but what!? I've skimmed through the
set and I don't think I get it.

What does this provide that ftrace/kprobes don't already allow?

2022-08-31 08:51:31

by Kent Overstreet

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Wed, Aug 31, 2022 at 09:38:27AM +0200, Peter Zijlstra wrote:
> On Tue, Aug 30, 2022 at 02:48:49PM -0700, Suren Baghdasaryan wrote:
> > ===========================
> > Code tagging framework
> > ===========================
> > Code tag is a structure identifying a specific location in the source code
> > which is generated at compile time and can be embedded in an application-
> > specific structure. Several applications of code tagging are included in
> > this RFC, such as memory allocation tracking, dynamic fault injection,
> > latency tracking and improved error code reporting.
> > Basically, it takes the old trick of "define a special elf section for
> > objects of a given type so that we can iterate over them at runtime" and
> > creates a proper library for it.
>
> I might be super dense this morning, but what!? I've skimmed through the
> set and I don't think I get it.
>
> What does this provide that ftrace/kprobes don't already allow?

You're kidding, right?

2022-08-31 10:23:29

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Wed, Aug 31, 2022 at 04:42:30AM -0400, Kent Overstreet wrote:
> On Wed, Aug 31, 2022 at 09:38:27AM +0200, Peter Zijlstra wrote:
> > On Tue, Aug 30, 2022 at 02:48:49PM -0700, Suren Baghdasaryan wrote:
> > > ===========================
> > > Code tagging framework
> > > ===========================
> > > Code tag is a structure identifying a specific location in the source code
> > > which is generated at compile time and can be embedded in an application-
> > > specific structure. Several applications of code tagging are included in
> > > this RFC, such as memory allocation tracking, dynamic fault injection,
> > > latency tracking and improved error code reporting.
> > > Basically, it takes the old trick of "define a special elf section for
> > > objects of a given type so that we can iterate over them at runtime" and
> > > creates a proper library for it.
> >
> > I might be super dense this morning, but what!? I've skimmed through the
> > set and I don't think I get it.
> >
> > What does this provide that ftrace/kprobes don't already allow?
>
> You're kidding, right?

It's a valid question. From the description, it main addition that would
be hard to do with ftrace or probes is catching where an error code is
returned. A secondary addition would be catching all historical state and
not just state since the tracing started.

It's also unclear *who* would enable this. It looks like it would mostly
have value during the development stage of an embedded platform to track
kernel memory usage on a per-application basis in an environment where it
may be difficult to setup tracing and tracking. Would it ever be enabled
in production? Would a distribution ever enable this? If it's enabled, any
overhead cannot be disabled/enabled at run or boot time so anyone enabling
this would carry the cost without never necessarily consuming the data.

It might be an ease-of-use thing. Gathering the information from traces
is tricky and would need combining multiple different elements and that
is development effort but not impossible.

Whatever asking for an explanation as to why equivalent functionality
cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable.

--
Mel Gorman
SUSE Labs

2022-08-31 10:45:21

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: [RFC PATCH 22/30] Code tagging based fault injection

On Tue, 30 Aug 2022 at 23:50, Suren Baghdasaryan <[email protected]> wrote:
>
> From: Kent Overstreet <[email protected]>
>
> This adds a new fault injection capability, based on code tagging.
>
> To use, simply insert somewhere in your code
>
> dynamic_fault("fault_class_name")
>
> and check whether it returns true - if so, inject the error.
> For example
>
> if (dynamic_fault("init"))
> return -EINVAL;

Hi Suren,

If this is going to be used by mainline kernel, it would be good to
integrate this with fail_nth systematic fault injection:
https://elixir.bootlin.com/linux/latest/source/lib/fault-inject.c#L109

Otherwise these dynamic sites won't be tested by testing systems doing
systematic fault injection testing.


> There's no need to define faults elsewhere, as with
> include/linux/fault-injection.h. Faults show up in debugfs, under
> /sys/kernel/debug/dynamic_faults, and can be selected based on
> file/module/function/line number/class, and enabled permanently, or in
> oneshot mode, or with a specified frequency.
>
> Signed-off-by: Kent Overstreet <[email protected]>
> ---
> include/asm-generic/codetag.lds.h | 3 +-
> include/linux/dynamic_fault.h | 79 +++++++
> include/linux/slab.h | 3 +-
> lib/Kconfig.debug | 6 +
> lib/Makefile | 2 +
> lib/dynamic_fault.c | 372 ++++++++++++++++++++++++++++++
> 6 files changed, 463 insertions(+), 2 deletions(-)
> create mode 100644 include/linux/dynamic_fault.h
> create mode 100644 lib/dynamic_fault.c
>
> diff --git a/include/asm-generic/codetag.lds.h b/include/asm-generic/codetag.lds.h
> index 64f536b80380..16fbf74edc3d 100644
> --- a/include/asm-generic/codetag.lds.h
> +++ b/include/asm-generic/codetag.lds.h
> @@ -9,6 +9,7 @@
> __stop_##_name = .;
>
> #define CODETAG_SECTIONS() \
> - SECTION_WITH_BOUNDARIES(alloc_tags)
> + SECTION_WITH_BOUNDARIES(alloc_tags) \
> + SECTION_WITH_BOUNDARIES(dynamic_fault_tags)
>
> #endif /* __ASM_GENERIC_CODETAG_LDS_H */
> diff --git a/include/linux/dynamic_fault.h b/include/linux/dynamic_fault.h
> new file mode 100644
> index 000000000000..526a33209e94
> --- /dev/null
> +++ b/include/linux/dynamic_fault.h
> @@ -0,0 +1,79 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifndef _LINUX_DYNAMIC_FAULT_H
> +#define _LINUX_DYNAMIC_FAULT_H
> +
> +/*
> + * Dynamic/code tagging fault injection:
> + *
> + * Originally based on the dynamic debug trick of putting types in a special elf
> + * section, then rewritten using code tagging:
> + *
> + * To use, simply insert a call to dynamic_fault("fault_class"), which will
> + * return true if an error should be injected.
> + *
> + * Fault injection sites may be listed and enabled via debugfs, under
> + * /sys/kernel/debug/dynamic_faults.
> + */
> +
> +#ifdef CONFIG_CODETAG_FAULT_INJECTION
> +
> +#include <linux/codetag.h>
> +#include <linux/jump_label.h>
> +
> +#define DFAULT_STATES() \
> + x(disabled) \
> + x(enabled) \
> + x(oneshot)
> +
> +enum dfault_enabled {
> +#define x(n) DFAULT_##n,
> + DFAULT_STATES()
> +#undef x
> +};
> +
> +union dfault_state {
> + struct {
> + unsigned int enabled:2;
> + unsigned int count:30;
> + };
> +
> + struct {
> + unsigned int v;
> + };
> +};
> +
> +struct dfault {
> + struct codetag tag;
> + const char *class;
> + unsigned int frequency;
> + union dfault_state state;
> + struct static_key_false enabled;
> +};
> +
> +bool __dynamic_fault_enabled(struct dfault *df);
> +
> +#define dynamic_fault(_class) \
> +({ \
> + static struct dfault \
> + __used \
> + __section("dynamic_fault_tags") \
> + __aligned(8) df = { \
> + .tag = CODE_TAG_INIT, \
> + .class = _class, \
> + .enabled = STATIC_KEY_FALSE_INIT, \
> + }; \
> + \
> + static_key_false(&df.enabled.key) && \
> + __dynamic_fault_enabled(&df); \
> +})
> +
> +#else
> +
> +#define dynamic_fault(_class) false
> +
> +#endif /* CODETAG_FAULT_INJECTION */
> +
> +#define memory_fault() dynamic_fault("memory")
> +
> +#endif /* _LINUX_DYNAMIC_FAULT_H */
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 89273be35743..4be5a93ed15a 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -17,6 +17,7 @@
> #include <linux/types.h>
> #include <linux/workqueue.h>
> #include <linux/percpu-refcount.h>
> +#include <linux/dynamic_fault.h>
>
>
> /*
> @@ -468,7 +469,7 @@ static inline void slab_tag_dec(const void *ptr) {}
>
> #define krealloc_hooks(_p, _do_alloc) \
> ({ \
> - void *_res = _do_alloc; \
> + void *_res = !memory_fault() ? _do_alloc : NULL; \
> slab_tag_add(_p, _res); \
> _res; \
> })
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 2790848464f1..b7d03afbc808 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1982,6 +1982,12 @@ config FAULT_INJECTION_STACKTRACE_FILTER
> help
> Provide stacktrace filter for fault-injection capabilities
>
> +config CODETAG_FAULT_INJECTION
> + bool "Code tagging based fault injection"
> + select CODE_TAGGING
> + help
> + Dynamic fault injection based on code tagging
> +
> config ARCH_HAS_KCOV
> bool
> help
> diff --git a/lib/Makefile b/lib/Makefile
> index 99f732156673..489ea000c528 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -231,6 +231,8 @@ obj-$(CONFIG_CODE_TAGGING) += codetag.o
> obj-$(CONFIG_ALLOC_TAGGING) += alloc_tag.o
> obj-$(CONFIG_PAGE_ALLOC_TAGGING) += pgalloc_tag.o
>
> +obj-$(CONFIG_CODETAG_FAULT_INJECTION) += dynamic_fault.o
> +
> lib-$(CONFIG_GENERIC_BUG) += bug.o
>
> obj-$(CONFIG_HAVE_ARCH_TRACEHOOK) += syscall.o
> diff --git a/lib/dynamic_fault.c b/lib/dynamic_fault.c
> new file mode 100644
> index 000000000000..4c9cd18686be
> --- /dev/null
> +++ b/lib/dynamic_fault.c
> @@ -0,0 +1,372 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +#include <linux/ctype.h>
> +#include <linux/debugfs.h>
> +#include <linux/dynamic_fault.h>
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/seq_buf.h>
> +
> +static struct codetag_type *cttype;
> +
> +bool __dynamic_fault_enabled(struct dfault *df)
> +{
> + union dfault_state old, new;
> + unsigned int v = df->state.v;
> + bool ret;
> +
> + do {
> + old.v = new.v = v;
> +
> + if (new.enabled == DFAULT_disabled)
> + return false;
> +
> + ret = df->frequency
> + ? ++new.count >= df->frequency
> + : true;
> + if (ret)
> + new.count = 0;
> + if (ret && new.enabled == DFAULT_oneshot)
> + new.enabled = DFAULT_disabled;
> + } while ((v = cmpxchg(&df->state.v, old.v, new.v)) != old.v);
> +
> + if (ret)
> + pr_debug("returned true for %s:%u", df->tag.filename, df->tag.lineno);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL(__dynamic_fault_enabled);
> +
> +static const char * const dfault_state_strs[] = {
> +#define x(n) #n,
> + DFAULT_STATES()
> +#undef x
> + NULL
> +};
> +
> +static void dynamic_fault_to_text(struct seq_buf *out, struct dfault *df)
> +{
> + codetag_to_text(out, &df->tag);
> + seq_buf_printf(out, "class:%s %s \"", df->class,
> + dfault_state_strs[df->state.enabled]);
> +}
> +
> +struct dfault_query {
> + struct codetag_query q;
> +
> + bool set_enabled:1;
> + unsigned int enabled:2;
> +
> + bool set_frequency:1;
> + unsigned int frequency;
> +};
> +
> +/*
> + * Search the tables for _dfault's which match the given
> + * `query' and apply the `flags' and `mask' to them. Tells
> + * the user which dfault's were changed, or whether none
> + * were matched.
> + */
> +static int dfault_change(struct dfault_query *query)
> +{
> + struct codetag_iterator ct_iter;
> + struct codetag *ct;
> + unsigned int nfound = 0;
> +
> + codetag_lock_module_list(cttype, true);
> + codetag_init_iter(&ct_iter, cttype);
> +
> + while ((ct = codetag_next_ct(&ct_iter))) {
> + struct dfault *df = container_of(ct, struct dfault, tag);
> +
> + if (!codetag_matches_query(&query->q, ct, ct_iter.cmod, df->class))
> + continue;
> +
> + if (query->set_enabled &&
> + query->enabled != df->state.enabled) {
> + if (query->enabled != DFAULT_disabled)
> + static_key_slow_inc(&df->enabled.key);
> + else if (df->state.enabled != DFAULT_disabled)
> + static_key_slow_dec(&df->enabled.key);
> +
> + df->state.enabled = query->enabled;
> + }
> +
> + if (query->set_frequency)
> + df->frequency = query->frequency;
> +
> + pr_debug("changed %s:%d [%s]%s #%d %s",
> + df->tag.filename, df->tag.lineno, df->tag.modname,
> + df->tag.function, query->q.cur_index,
> + dfault_state_strs[df->state.enabled]);
> +
> + nfound++;
> + }
> +
> + pr_debug("dfault: %u matches", nfound);
> +
> + codetag_lock_module_list(cttype, false);
> +
> + return nfound ? 0 : -ENOENT;
> +}
> +
> +#define DFAULT_TOKENS() \
> + x(disable, 0) \
> + x(enable, 0) \
> + x(oneshot, 0) \
> + x(frequency, 1)
> +
> +enum dfault_token {
> +#define x(name, nr_args) TOK_##name,
> + DFAULT_TOKENS()
> +#undef x
> +};
> +
> +static const char * const dfault_token_strs[] = {
> +#define x(name, nr_args) #name,
> + DFAULT_TOKENS()
> +#undef x
> + NULL
> +};
> +
> +static unsigned int dfault_token_nr_args[] = {
> +#define x(name, nr_args) nr_args,
> + DFAULT_TOKENS()
> +#undef x
> +};
> +
> +static enum dfault_token str_to_token(const char *word, unsigned int nr_words)
> +{
> + int tok = match_string(dfault_token_strs, ARRAY_SIZE(dfault_token_strs), word);
> +
> + if (tok < 0) {
> + pr_debug("unknown keyword \"%s\"", word);
> + return tok;
> + }
> +
> + if (nr_words < dfault_token_nr_args[tok]) {
> + pr_debug("insufficient arguments to \"%s\"", word);
> + return -EINVAL;
> + }
> +
> + return tok;
> +}
> +
> +static int dfault_parse_command(struct dfault_query *query,
> + enum dfault_token tok,
> + char *words[], size_t nr_words)
> +{
> + unsigned int i = 0;
> + int ret;
> +
> + switch (tok) {
> + case TOK_disable:
> + query->set_enabled = true;
> + query->enabled = DFAULT_disabled;
> + break;
> + case TOK_enable:
> + query->set_enabled = true;
> + query->enabled = DFAULT_enabled;
> + break;
> + case TOK_oneshot:
> + query->set_enabled = true;
> + query->enabled = DFAULT_oneshot;
> + break;
> + case TOK_frequency:
> + query->set_frequency = 1;
> + ret = kstrtouint(words[i++], 10, &query->frequency);
> + if (ret)
> + return ret;
> +
> + if (!query->set_enabled) {
> + query->set_enabled = 1;
> + query->enabled = DFAULT_enabled;
> + }
> + break;
> + }
> +
> + return i;
> +}
> +
> +static int dynamic_fault_store(char *buf)
> +{
> + struct dfault_query query = { NULL };
> +#define MAXWORDS 9
> + char *tok, *words[MAXWORDS];
> + int ret, nr_words, i = 0;
> +
> + buf = codetag_query_parse(&query.q, buf);
> + if (IS_ERR(buf))
> + return PTR_ERR(buf);
> +
> + while ((tok = strsep_no_empty(&buf, " \t\r\n"))) {
> + if (nr_words == ARRAY_SIZE(words))
> + return -EINVAL; /* ran out of words[] before bytes */
> + words[nr_words++] = tok;
> + }
> +
> + while (i < nr_words) {
> + const char *tok_str = words[i++];
> + enum dfault_token tok = str_to_token(tok_str, nr_words - i);
> +
> + if (tok < 0)
> + return tok;
> +
> + ret = dfault_parse_command(&query, tok, words + i, nr_words - i);
> + if (ret < 0)
> + return ret;
> +
> + i += ret;
> + BUG_ON(i > nr_words);
> + }
> +
> + pr_debug("q->function=\"%s\" q->filename=\"%s\" "
> + "q->module=\"%s\" q->line=%u-%u\n q->index=%u-%u",
> + query.q.function, query.q.filename, query.q.module,
> + query.q.first_line, query.q.last_line,
> + query.q.first_index, query.q.last_index);
> +
> + ret = dfault_change(&query);
> + if (ret < 0)
> + return ret;
> +
> + return 0;
> +}
> +
> +struct dfault_iter {
> + struct codetag_iterator ct_iter;
> +
> + struct seq_buf buf;
> + char rawbuf[4096];
> +};
> +
> +static int dfault_open(struct inode *inode, struct file *file)
> +{
> + struct dfault_iter *iter;
> +
> + iter = kzalloc(sizeof(*iter), GFP_KERNEL);
> + if (!iter)
> + return -ENOMEM;
> +
> + codetag_lock_module_list(cttype, true);
> + codetag_init_iter(&iter->ct_iter, cttype);
> + codetag_lock_module_list(cttype, false);
> +
> + file->private_data = iter;
> + seq_buf_init(&iter->buf, iter->rawbuf, sizeof(iter->rawbuf));
> + return 0;
> +}
> +
> +static int dfault_release(struct inode *inode, struct file *file)
> +{
> + struct dfault_iter *iter = file->private_data;
> +
> + kfree(iter);
> + return 0;
> +}
> +
> +struct user_buf {
> + char __user *buf; /* destination user buffer */
> + size_t size; /* size of requested read */
> + ssize_t ret; /* bytes read so far */
> +};
> +
> +static int flush_ubuf(struct user_buf *dst, struct seq_buf *src)
> +{
> + if (src->len) {
> + size_t bytes = min_t(size_t, src->len, dst->size);
> + int err = copy_to_user(dst->buf, src->buffer, bytes);
> +
> + if (err)
> + return err;
> +
> + dst->ret += bytes;
> + dst->buf += bytes;
> + dst->size -= bytes;
> + src->len -= bytes;
> + memmove(src->buffer, src->buffer + bytes, src->len);
> + }
> +
> + return 0;
> +}
> +
> +static ssize_t dfault_read(struct file *file, char __user *ubuf,
> + size_t size, loff_t *ppos)
> +{
> + struct dfault_iter *iter = file->private_data;
> + struct user_buf buf = { .buf = ubuf, .size = size };
> + struct codetag *ct;
> + struct dfault *df;
> + int err;
> +
> + codetag_lock_module_list(iter->ct_iter.cttype, true);
> + while (1) {
> + err = flush_ubuf(&buf, &iter->buf);
> + if (err || !buf.size)
> + break;
> +
> + ct = codetag_next_ct(&iter->ct_iter);
> + if (!ct)
> + break;
> +
> + df = container_of(ct, struct dfault, tag);
> + dynamic_fault_to_text(&iter->buf, df);
> + seq_buf_putc(&iter->buf, '\n');
> + }
> + codetag_lock_module_list(iter->ct_iter.cttype, false);
> +
> + return err ?: buf.ret;
> +}
> +
> +/*
> + * File_ops->write method for <debugfs>/dynamic_fault/conrol. Gathers the
> + * command text from userspace, parses and executes it.
> + */
> +static ssize_t dfault_write(struct file *file, const char __user *ubuf,
> + size_t len, loff_t *offp)
> +{
> + char tmpbuf[256];
> +
> + if (len == 0)
> + return 0;
> + /* we don't check *offp -- multiple writes() are allowed */
> + if (len > sizeof(tmpbuf)-1)
> + return -E2BIG;
> + if (copy_from_user(tmpbuf, ubuf, len))
> + return -EFAULT;
> + tmpbuf[len] = '\0';
> + pr_debug("read %zu bytes from userspace", len);
> +
> + dynamic_fault_store(tmpbuf);
> +
> + *offp += len;
> + return len;
> +}
> +
> +static const struct file_operations dfault_ops = {
> + .owner = THIS_MODULE,
> + .open = dfault_open,
> + .release = dfault_release,
> + .read = dfault_read,
> + .write = dfault_write
> +};
> +
> +static int __init dynamic_fault_init(void)
> +{
> + const struct codetag_type_desc desc = {
> + .section = "dynamic_fault_tags",
> + .tag_size = sizeof(struct dfault),
> + };
> + struct dentry *debugfs_file;
> +
> + cttype = codetag_register_type(&desc);
> + if (IS_ERR_OR_NULL(cttype))
> + return PTR_ERR(cttype);
> +
> + debugfs_file = debugfs_create_file("dynamic_faults", 0666, NULL, NULL, &dfault_ops);
> + if (IS_ERR(debugfs_file))
> + return PTR_ERR(debugfs_file);
> +
> + return 0;
> +}
> +module_init(dynamic_fault_init);
> --
> 2.37.2.672.g94769d06f0-goog
>

2022-08-31 11:08:21

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Wed 31-08-22 11:19:48, Mel Gorman wrote:
> On Wed, Aug 31, 2022 at 04:42:30AM -0400, Kent Overstreet wrote:
> > On Wed, Aug 31, 2022 at 09:38:27AM +0200, Peter Zijlstra wrote:
> > > On Tue, Aug 30, 2022 at 02:48:49PM -0700, Suren Baghdasaryan wrote:
> > > > ===========================
> > > > Code tagging framework
> > > > ===========================
> > > > Code tag is a structure identifying a specific location in the source code
> > > > which is generated at compile time and can be embedded in an application-
> > > > specific structure. Several applications of code tagging are included in
> > > > this RFC, such as memory allocation tracking, dynamic fault injection,
> > > > latency tracking and improved error code reporting.
> > > > Basically, it takes the old trick of "define a special elf section for
> > > > objects of a given type so that we can iterate over them at runtime" and
> > > > creates a proper library for it.
> > >
> > > I might be super dense this morning, but what!? I've skimmed through the
> > > set and I don't think I get it.
> > >
> > > What does this provide that ftrace/kprobes don't already allow?
> >
> > You're kidding, right?
>
> It's a valid question. From the description, it main addition that would
> be hard to do with ftrace or probes is catching where an error code is
> returned. A secondary addition would be catching all historical state and
> not just state since the tracing started.
>
> It's also unclear *who* would enable this. It looks like it would mostly
> have value during the development stage of an embedded platform to track
> kernel memory usage on a per-application basis in an environment where it
> may be difficult to setup tracing and tracking. Would it ever be enabled
> in production? Would a distribution ever enable this? If it's enabled, any
> overhead cannot be disabled/enabled at run or boot time so anyone enabling
> this would carry the cost without never necessarily consuming the data.
>
> It might be an ease-of-use thing. Gathering the information from traces
> is tricky and would need combining multiple different elements and that
> is development effort but not impossible.
>
> Whatever asking for an explanation as to why equivalent functionality
> cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable.

Fully agreed and this is especially true for a change this size
77 files changed, 3406 insertions(+), 703 deletions(-)

--
Michal Hocko
SUSE Labs

2022-08-31 11:08:32

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 03/30] Lazy percpu counters

On Tue, Aug 30, 2022 at 02:48:52PM -0700, Suren Baghdasaryan wrote:
> From: Kent Overstreet <[email protected]>
>
> This patch adds lib/lazy-percpu-counter.c, which implements counters
> that start out as atomics, but lazily switch to percpu mode if the
> update rate crosses some threshold (arbitrarily set at 256 per second).
>
> Signed-off-by: Kent Overstreet <[email protected]>

Why not use percpu_counter? It has a per-cpu counter that is synchronised
when a batch threshold (default 32) is exceeded and can explicitly sync
the counters when required assuming the synchronised count is only needed
when reading debugfs.

--
Mel Gorman
SUSE Labs

2022-08-31 15:42:35

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Wed, Aug 31, 2022 at 3:47 AM Michal Hocko <[email protected]> wrote:
>
> On Wed 31-08-22 11:19:48, Mel Gorman wrote:
> > On Wed, Aug 31, 2022 at 04:42:30AM -0400, Kent Overstreet wrote:
> > > On Wed, Aug 31, 2022 at 09:38:27AM +0200, Peter Zijlstra wrote:
> > > > On Tue, Aug 30, 2022 at 02:48:49PM -0700, Suren Baghdasaryan wrote:
> > > > > ===========================
> > > > > Code tagging framework
> > > > > ===========================
> > > > > Code tag is a structure identifying a specific location in the source code
> > > > > which is generated at compile time and can be embedded in an application-
> > > > > specific structure. Several applications of code tagging are included in
> > > > > this RFC, such as memory allocation tracking, dynamic fault injection,
> > > > > latency tracking and improved error code reporting.
> > > > > Basically, it takes the old trick of "define a special elf section for
> > > > > objects of a given type so that we can iterate over them at runtime" and
> > > > > creates a proper library for it.
> > > >
> > > > I might be super dense this morning, but what!? I've skimmed through the
> > > > set and I don't think I get it.
> > > >
> > > > What does this provide that ftrace/kprobes don't already allow?
> > >
> > > You're kidding, right?
> >
> > It's a valid question. From the description, it main addition that would
> > be hard to do with ftrace or probes is catching where an error code is
> > returned. A secondary addition would be catching all historical state and
> > not just state since the tracing started.
> >
> > It's also unclear *who* would enable this. It looks like it would mostly
> > have value during the development stage of an embedded platform to track
> > kernel memory usage on a per-application basis in an environment where it
> > may be difficult to setup tracing and tracking. Would it ever be enabled
> > in production? Would a distribution ever enable this? If it's enabled, any
> > overhead cannot be disabled/enabled at run or boot time so anyone enabling
> > this would carry the cost without never necessarily consuming the data.

Thank you for the question.
For memory tracking my intent is to have a mechanism that can be enabled in
the field testing (pre-production testing on a large population of
internal users).
The issue that we are often facing is when some memory leaks are happening
in the field but very hard to reproduce locally. We get a bugreport
from the user
which indicates it but often has not enough information to track it. Note that
quite often these leaks/issues happen in the drivers, so even simply finding out
where they came from is a big help.
The way I envision this mechanism to be used is to enable the basic memory
tracking in the field tests and have a user space process collecting
the allocation
statistics periodically (say once an hour). Once it detects some counter growing
infinitely or atypically (the definition of this is left to the user
space) it can enable
context capturing only for that specific location, still keeping the
overhead to the
minimum but getting more information about potential issues. Collected stats and
contexts are then attached to the bugreport and we get more visibility
into the issue
when we receive it.
The goal is to provide a mechanism with low enough overhead that it
can be enabled
all the time during these field tests without affecting the device's
performance profiles.
Tracing is very cheap when it's disabled but having it enabled all the
time would
introduce higher overhead than the counter manipulations.
My apologies, I should have clarified all this in this cover letter
from the beginning.

As for other applications, maybe I'm not such an advanced user of
tracing but I think only
the latency tracking application might be done with tracing, assuming
we have all the
right tracepoints but I don't see how we would use tracing for fault
injections and
descriptive error codes. Again, I might be mistaken.

Thanks,
Suren.

> >
> > It might be an ease-of-use thing. Gathering the information from traces
> > is tricky and would need combining multiple different elements and that
> > is development effort but not impossible.
> >
> > Whatever asking for an explanation as to why equivalent functionality
> > cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable.
>
> Fully agreed and this is especially true for a change this size
> 77 files changed, 3406 insertions(+), 703 deletions(-)
>
> --
> Michal Hocko
> SUSE Labs

2022-08-31 15:43:21

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [RFC PATCH 03/30] Lazy percpu counters

On Wed, Aug 31, 2022 at 3:02 AM Mel Gorman <[email protected]> wrote:
>
> On Tue, Aug 30, 2022 at 02:48:52PM -0700, Suren Baghdasaryan wrote:
> > From: Kent Overstreet <[email protected]>
> >
> > This patch adds lib/lazy-percpu-counter.c, which implements counters
> > that start out as atomics, but lazily switch to percpu mode if the
> > update rate crosses some threshold (arbitrarily set at 256 per second).
> >
> > Signed-off-by: Kent Overstreet <[email protected]>
>
> Why not use percpu_counter? It has a per-cpu counter that is synchronised
> when a batch threshold (default 32) is exceeded and can explicitly sync
> the counters when required assuming the synchronised count is only needed
> when reading debugfs.

The intent is to use atomic counters for places that are not updated very often.
This would save memory required for the counters. Originally I had a config
option to choose which counter type to use but with lazy counters we sacrifice
memory for performance only when needed while keeping the other counters
small.

>
> --
> Mel Gorman
> SUSE Labs

2022-08-31 16:08:15

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [RFC PATCH 22/30] Code tagging based fault injection

On Tue, Aug 30, 2022 at 6:52 PM Randy Dunlap <[email protected]> wrote:
>
>
>
> On 8/30/22 14:49, Suren Baghdasaryan wrote:
> > From: Kent Overstreet <[email protected]>
> >
> > This adds a new fault injection capability, based on code tagging.
> >
> > To use, simply insert somewhere in your code
> >
> > dynamic_fault("fault_class_name")
> >
> > and check whether it returns true - if so, inject the error.
> > For example
> >
> > if (dynamic_fault("init"))
> > return -EINVAL;
> >
> > There's no need to define faults elsewhere, as with
> > include/linux/fault-injection.h. Faults show up in debugfs, under
> > /sys/kernel/debug/dynamic_faults, and can be selected based on
> > file/module/function/line number/class, and enabled permanently, or in
> > oneshot mode, or with a specified frequency.
> >
> > Signed-off-by: Kent Overstreet <[email protected]>
>
> Missing Signed-off-by: from Suren.
> See Documentation/process/submitting-patches.rst:
>
> When to use Acked-by:, Cc:, and Co-developed-by:
> ------------------------------------------------
>
> The Signed-off-by: tag indicates that the signer was involved in the
> development of the patch, or that he/she was in the patch's delivery path.

Thanks for the note! Will fix in the next respin.

>
>
> --
> ~Randy

2022-08-31 16:19:42

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [RFC PATCH 22/30] Code tagging based fault injection

On Wed, Aug 31, 2022 at 3:37 AM Dmitry Vyukov <[email protected]> wrote:
>
> On Tue, 30 Aug 2022 at 23:50, Suren Baghdasaryan <[email protected]> wrote:
> >
> > From: Kent Overstreet <[email protected]>
> >
> > This adds a new fault injection capability, based on code tagging.
> >
> > To use, simply insert somewhere in your code
> >
> > dynamic_fault("fault_class_name")
> >
> > and check whether it returns true - if so, inject the error.
> > For example
> >
> > if (dynamic_fault("init"))
> > return -EINVAL;
>
> Hi Suren,
>
> If this is going to be used by mainline kernel, it would be good to
> integrate this with fail_nth systematic fault injection:
> https://elixir.bootlin.com/linux/latest/source/lib/fault-inject.c#L109
>
> Otherwise these dynamic sites won't be tested by testing systems doing
> systematic fault injection testing.

Hi Dmitry,
Thanks for the information! Will look into it and try to integrate.
Suren.

>
>
> > There's no need to define faults elsewhere, as with
> > include/linux/fault-injection.h. Faults show up in debugfs, under
> > /sys/kernel/debug/dynamic_faults, and can be selected based on
> > file/module/function/line number/class, and enabled permanently, or in
> > oneshot mode, or with a specified frequency.
> >
> > Signed-off-by: Kent Overstreet <[email protected]>
> > ---
> > include/asm-generic/codetag.lds.h | 3 +-
> > include/linux/dynamic_fault.h | 79 +++++++
> > include/linux/slab.h | 3 +-
> > lib/Kconfig.debug | 6 +
> > lib/Makefile | 2 +
> > lib/dynamic_fault.c | 372 ++++++++++++++++++++++++++++++
> > 6 files changed, 463 insertions(+), 2 deletions(-)
> > create mode 100644 include/linux/dynamic_fault.h
> > create mode 100644 lib/dynamic_fault.c
> >
> > diff --git a/include/asm-generic/codetag.lds.h b/include/asm-generic/codetag.lds.h
> > index 64f536b80380..16fbf74edc3d 100644
> > --- a/include/asm-generic/codetag.lds.h
> > +++ b/include/asm-generic/codetag.lds.h
> > @@ -9,6 +9,7 @@
> > __stop_##_name = .;
> >
> > #define CODETAG_SECTIONS() \
> > - SECTION_WITH_BOUNDARIES(alloc_tags)
> > + SECTION_WITH_BOUNDARIES(alloc_tags) \
> > + SECTION_WITH_BOUNDARIES(dynamic_fault_tags)
> >
> > #endif /* __ASM_GENERIC_CODETAG_LDS_H */
> > diff --git a/include/linux/dynamic_fault.h b/include/linux/dynamic_fault.h
> > new file mode 100644
> > index 000000000000..526a33209e94
> > --- /dev/null
> > +++ b/include/linux/dynamic_fault.h
> > @@ -0,0 +1,79 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +
> > +#ifndef _LINUX_DYNAMIC_FAULT_H
> > +#define _LINUX_DYNAMIC_FAULT_H
> > +
> > +/*
> > + * Dynamic/code tagging fault injection:
> > + *
> > + * Originally based on the dynamic debug trick of putting types in a special elf
> > + * section, then rewritten using code tagging:
> > + *
> > + * To use, simply insert a call to dynamic_fault("fault_class"), which will
> > + * return true if an error should be injected.
> > + *
> > + * Fault injection sites may be listed and enabled via debugfs, under
> > + * /sys/kernel/debug/dynamic_faults.
> > + */
> > +
> > +#ifdef CONFIG_CODETAG_FAULT_INJECTION
> > +
> > +#include <linux/codetag.h>
> > +#include <linux/jump_label.h>
> > +
> > +#define DFAULT_STATES() \
> > + x(disabled) \
> > + x(enabled) \
> > + x(oneshot)
> > +
> > +enum dfault_enabled {
> > +#define x(n) DFAULT_##n,
> > + DFAULT_STATES()
> > +#undef x
> > +};
> > +
> > +union dfault_state {
> > + struct {
> > + unsigned int enabled:2;
> > + unsigned int count:30;
> > + };
> > +
> > + struct {
> > + unsigned int v;
> > + };
> > +};
> > +
> > +struct dfault {
> > + struct codetag tag;
> > + const char *class;
> > + unsigned int frequency;
> > + union dfault_state state;
> > + struct static_key_false enabled;
> > +};
> > +
> > +bool __dynamic_fault_enabled(struct dfault *df);
> > +
> > +#define dynamic_fault(_class) \
> > +({ \
> > + static struct dfault \
> > + __used \
> > + __section("dynamic_fault_tags") \
> > + __aligned(8) df = { \
> > + .tag = CODE_TAG_INIT, \
> > + .class = _class, \
> > + .enabled = STATIC_KEY_FALSE_INIT, \
> > + }; \
> > + \
> > + static_key_false(&df.enabled.key) && \
> > + __dynamic_fault_enabled(&df); \
> > +})
> > +
> > +#else
> > +
> > +#define dynamic_fault(_class) false
> > +
> > +#endif /* CODETAG_FAULT_INJECTION */
> > +
> > +#define memory_fault() dynamic_fault("memory")
> > +
> > +#endif /* _LINUX_DYNAMIC_FAULT_H */
> > diff --git a/include/linux/slab.h b/include/linux/slab.h
> > index 89273be35743..4be5a93ed15a 100644
> > --- a/include/linux/slab.h
> > +++ b/include/linux/slab.h
> > @@ -17,6 +17,7 @@
> > #include <linux/types.h>
> > #include <linux/workqueue.h>
> > #include <linux/percpu-refcount.h>
> > +#include <linux/dynamic_fault.h>
> >
> >
> > /*
> > @@ -468,7 +469,7 @@ static inline void slab_tag_dec(const void *ptr) {}
> >
> > #define krealloc_hooks(_p, _do_alloc) \
> > ({ \
> > - void *_res = _do_alloc; \
> > + void *_res = !memory_fault() ? _do_alloc : NULL; \
> > slab_tag_add(_p, _res); \
> > _res; \
> > })
> > diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> > index 2790848464f1..b7d03afbc808 100644
> > --- a/lib/Kconfig.debug
> > +++ b/lib/Kconfig.debug
> > @@ -1982,6 +1982,12 @@ config FAULT_INJECTION_STACKTRACE_FILTER
> > help
> > Provide stacktrace filter for fault-injection capabilities
> >
> > +config CODETAG_FAULT_INJECTION
> > + bool "Code tagging based fault injection"
> > + select CODE_TAGGING
> > + help
> > + Dynamic fault injection based on code tagging
> > +
> > config ARCH_HAS_KCOV
> > bool
> > help
> > diff --git a/lib/Makefile b/lib/Makefile
> > index 99f732156673..489ea000c528 100644
> > --- a/lib/Makefile
> > +++ b/lib/Makefile
> > @@ -231,6 +231,8 @@ obj-$(CONFIG_CODE_TAGGING) += codetag.o
> > obj-$(CONFIG_ALLOC_TAGGING) += alloc_tag.o
> > obj-$(CONFIG_PAGE_ALLOC_TAGGING) += pgalloc_tag.o
> >
> > +obj-$(CONFIG_CODETAG_FAULT_INJECTION) += dynamic_fault.o
> > +
> > lib-$(CONFIG_GENERIC_BUG) += bug.o
> >
> > obj-$(CONFIG_HAVE_ARCH_TRACEHOOK) += syscall.o
> > diff --git a/lib/dynamic_fault.c b/lib/dynamic_fault.c
> > new file mode 100644
> > index 000000000000..4c9cd18686be
> > --- /dev/null
> > +++ b/lib/dynamic_fault.c
> > @@ -0,0 +1,372 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +
> > +#include <linux/ctype.h>
> > +#include <linux/debugfs.h>
> > +#include <linux/dynamic_fault.h>
> > +#include <linux/kernel.h>
> > +#include <linux/module.h>
> > +#include <linux/seq_buf.h>
> > +
> > +static struct codetag_type *cttype;
> > +
> > +bool __dynamic_fault_enabled(struct dfault *df)
> > +{
> > + union dfault_state old, new;
> > + unsigned int v = df->state.v;
> > + bool ret;
> > +
> > + do {
> > + old.v = new.v = v;
> > +
> > + if (new.enabled == DFAULT_disabled)
> > + return false;
> > +
> > + ret = df->frequency
> > + ? ++new.count >= df->frequency
> > + : true;
> > + if (ret)
> > + new.count = 0;
> > + if (ret && new.enabled == DFAULT_oneshot)
> > + new.enabled = DFAULT_disabled;
> > + } while ((v = cmpxchg(&df->state.v, old.v, new.v)) != old.v);
> > +
> > + if (ret)
> > + pr_debug("returned true for %s:%u", df->tag.filename, df->tag.lineno);
> > +
> > + return ret;
> > +}
> > +EXPORT_SYMBOL(__dynamic_fault_enabled);
> > +
> > +static const char * const dfault_state_strs[] = {
> > +#define x(n) #n,
> > + DFAULT_STATES()
> > +#undef x
> > + NULL
> > +};
> > +
> > +static void dynamic_fault_to_text(struct seq_buf *out, struct dfault *df)
> > +{
> > + codetag_to_text(out, &df->tag);
> > + seq_buf_printf(out, "class:%s %s \"", df->class,
> > + dfault_state_strs[df->state.enabled]);
> > +}
> > +
> > +struct dfault_query {
> > + struct codetag_query q;
> > +
> > + bool set_enabled:1;
> > + unsigned int enabled:2;
> > +
> > + bool set_frequency:1;
> > + unsigned int frequency;
> > +};
> > +
> > +/*
> > + * Search the tables for _dfault's which match the given
> > + * `query' and apply the `flags' and `mask' to them. Tells
> > + * the user which dfault's were changed, or whether none
> > + * were matched.
> > + */
> > +static int dfault_change(struct dfault_query *query)
> > +{
> > + struct codetag_iterator ct_iter;
> > + struct codetag *ct;
> > + unsigned int nfound = 0;
> > +
> > + codetag_lock_module_list(cttype, true);
> > + codetag_init_iter(&ct_iter, cttype);
> > +
> > + while ((ct = codetag_next_ct(&ct_iter))) {
> > + struct dfault *df = container_of(ct, struct dfault, tag);
> > +
> > + if (!codetag_matches_query(&query->q, ct, ct_iter.cmod, df->class))
> > + continue;
> > +
> > + if (query->set_enabled &&
> > + query->enabled != df->state.enabled) {
> > + if (query->enabled != DFAULT_disabled)
> > + static_key_slow_inc(&df->enabled.key);
> > + else if (df->state.enabled != DFAULT_disabled)
> > + static_key_slow_dec(&df->enabled.key);
> > +
> > + df->state.enabled = query->enabled;
> > + }
> > +
> > + if (query->set_frequency)
> > + df->frequency = query->frequency;
> > +
> > + pr_debug("changed %s:%d [%s]%s #%d %s",
> > + df->tag.filename, df->tag.lineno, df->tag.modname,
> > + df->tag.function, query->q.cur_index,
> > + dfault_state_strs[df->state.enabled]);
> > +
> > + nfound++;
> > + }
> > +
> > + pr_debug("dfault: %u matches", nfound);
> > +
> > + codetag_lock_module_list(cttype, false);
> > +
> > + return nfound ? 0 : -ENOENT;
> > +}
> > +
> > +#define DFAULT_TOKENS() \
> > + x(disable, 0) \
> > + x(enable, 0) \
> > + x(oneshot, 0) \
> > + x(frequency, 1)
> > +
> > +enum dfault_token {
> > +#define x(name, nr_args) TOK_##name,
> > + DFAULT_TOKENS()
> > +#undef x
> > +};
> > +
> > +static const char * const dfault_token_strs[] = {
> > +#define x(name, nr_args) #name,
> > + DFAULT_TOKENS()
> > +#undef x
> > + NULL
> > +};
> > +
> > +static unsigned int dfault_token_nr_args[] = {
> > +#define x(name, nr_args) nr_args,
> > + DFAULT_TOKENS()
> > +#undef x
> > +};
> > +
> > +static enum dfault_token str_to_token(const char *word, unsigned int nr_words)
> > +{
> > + int tok = match_string(dfault_token_strs, ARRAY_SIZE(dfault_token_strs), word);
> > +
> > + if (tok < 0) {
> > + pr_debug("unknown keyword \"%s\"", word);
> > + return tok;
> > + }
> > +
> > + if (nr_words < dfault_token_nr_args[tok]) {
> > + pr_debug("insufficient arguments to \"%s\"", word);
> > + return -EINVAL;
> > + }
> > +
> > + return tok;
> > +}
> > +
> > +static int dfault_parse_command(struct dfault_query *query,
> > + enum dfault_token tok,
> > + char *words[], size_t nr_words)
> > +{
> > + unsigned int i = 0;
> > + int ret;
> > +
> > + switch (tok) {
> > + case TOK_disable:
> > + query->set_enabled = true;
> > + query->enabled = DFAULT_disabled;
> > + break;
> > + case TOK_enable:
> > + query->set_enabled = true;
> > + query->enabled = DFAULT_enabled;
> > + break;
> > + case TOK_oneshot:
> > + query->set_enabled = true;
> > + query->enabled = DFAULT_oneshot;
> > + break;
> > + case TOK_frequency:
> > + query->set_frequency = 1;
> > + ret = kstrtouint(words[i++], 10, &query->frequency);
> > + if (ret)
> > + return ret;
> > +
> > + if (!query->set_enabled) {
> > + query->set_enabled = 1;
> > + query->enabled = DFAULT_enabled;
> > + }
> > + break;
> > + }
> > +
> > + return i;
> > +}
> > +
> > +static int dynamic_fault_store(char *buf)
> > +{
> > + struct dfault_query query = { NULL };
> > +#define MAXWORDS 9
> > + char *tok, *words[MAXWORDS];
> > + int ret, nr_words, i = 0;
> > +
> > + buf = codetag_query_parse(&query.q, buf);
> > + if (IS_ERR(buf))
> > + return PTR_ERR(buf);
> > +
> > + while ((tok = strsep_no_empty(&buf, " \t\r\n"))) {
> > + if (nr_words == ARRAY_SIZE(words))
> > + return -EINVAL; /* ran out of words[] before bytes */
> > + words[nr_words++] = tok;
> > + }
> > +
> > + while (i < nr_words) {
> > + const char *tok_str = words[i++];
> > + enum dfault_token tok = str_to_token(tok_str, nr_words - i);
> > +
> > + if (tok < 0)
> > + return tok;
> > +
> > + ret = dfault_parse_command(&query, tok, words + i, nr_words - i);
> > + if (ret < 0)
> > + return ret;
> > +
> > + i += ret;
> > + BUG_ON(i > nr_words);
> > + }
> > +
> > + pr_debug("q->function=\"%s\" q->filename=\"%s\" "
> > + "q->module=\"%s\" q->line=%u-%u\n q->index=%u-%u",
> > + query.q.function, query.q.filename, query.q.module,
> > + query.q.first_line, query.q.last_line,
> > + query.q.first_index, query.q.last_index);
> > +
> > + ret = dfault_change(&query);
> > + if (ret < 0)
> > + return ret;
> > +
> > + return 0;
> > +}
> > +
> > +struct dfault_iter {
> > + struct codetag_iterator ct_iter;
> > +
> > + struct seq_buf buf;
> > + char rawbuf[4096];
> > +};
> > +
> > +static int dfault_open(struct inode *inode, struct file *file)
> > +{
> > + struct dfault_iter *iter;
> > +
> > + iter = kzalloc(sizeof(*iter), GFP_KERNEL);
> > + if (!iter)
> > + return -ENOMEM;
> > +
> > + codetag_lock_module_list(cttype, true);
> > + codetag_init_iter(&iter->ct_iter, cttype);
> > + codetag_lock_module_list(cttype, false);
> > +
> > + file->private_data = iter;
> > + seq_buf_init(&iter->buf, iter->rawbuf, sizeof(iter->rawbuf));
> > + return 0;
> > +}
> > +
> > +static int dfault_release(struct inode *inode, struct file *file)
> > +{
> > + struct dfault_iter *iter = file->private_data;
> > +
> > + kfree(iter);
> > + return 0;
> > +}
> > +
> > +struct user_buf {
> > + char __user *buf; /* destination user buffer */
> > + size_t size; /* size of requested read */
> > + ssize_t ret; /* bytes read so far */
> > +};
> > +
> > +static int flush_ubuf(struct user_buf *dst, struct seq_buf *src)
> > +{
> > + if (src->len) {
> > + size_t bytes = min_t(size_t, src->len, dst->size);
> > + int err = copy_to_user(dst->buf, src->buffer, bytes);
> > +
> > + if (err)
> > + return err;
> > +
> > + dst->ret += bytes;
> > + dst->buf += bytes;
> > + dst->size -= bytes;
> > + src->len -= bytes;
> > + memmove(src->buffer, src->buffer + bytes, src->len);
> > + }
> > +
> > + return 0;
> > +}
> > +
> > +static ssize_t dfault_read(struct file *file, char __user *ubuf,
> > + size_t size, loff_t *ppos)
> > +{
> > + struct dfault_iter *iter = file->private_data;
> > + struct user_buf buf = { .buf = ubuf, .size = size };
> > + struct codetag *ct;
> > + struct dfault *df;
> > + int err;
> > +
> > + codetag_lock_module_list(iter->ct_iter.cttype, true);
> > + while (1) {
> > + err = flush_ubuf(&buf, &iter->buf);
> > + if (err || !buf.size)
> > + break;
> > +
> > + ct = codetag_next_ct(&iter->ct_iter);
> > + if (!ct)
> > + break;
> > +
> > + df = container_of(ct, struct dfault, tag);
> > + dynamic_fault_to_text(&iter->buf, df);
> > + seq_buf_putc(&iter->buf, '\n');
> > + }
> > + codetag_lock_module_list(iter->ct_iter.cttype, false);
> > +
> > + return err ?: buf.ret;
> > +}
> > +
> > +/*
> > + * File_ops->write method for <debugfs>/dynamic_fault/conrol. Gathers the
> > + * command text from userspace, parses and executes it.
> > + */
> > +static ssize_t dfault_write(struct file *file, const char __user *ubuf,
> > + size_t len, loff_t *offp)
> > +{
> > + char tmpbuf[256];
> > +
> > + if (len == 0)
> > + return 0;
> > + /* we don't check *offp -- multiple writes() are allowed */
> > + if (len > sizeof(tmpbuf)-1)
> > + return -E2BIG;
> > + if (copy_from_user(tmpbuf, ubuf, len))
> > + return -EFAULT;
> > + tmpbuf[len] = '\0';
> > + pr_debug("read %zu bytes from userspace", len);
> > +
> > + dynamic_fault_store(tmpbuf);
> > +
> > + *offp += len;
> > + return len;
> > +}
> > +
> > +static const struct file_operations dfault_ops = {
> > + .owner = THIS_MODULE,
> > + .open = dfault_open,
> > + .release = dfault_release,
> > + .read = dfault_read,
> > + .write = dfault_write
> > +};
> > +
> > +static int __init dynamic_fault_init(void)
> > +{
> > + const struct codetag_type_desc desc = {
> > + .section = "dynamic_fault_tags",
> > + .tag_size = sizeof(struct dfault),
> > + };
> > + struct dentry *debugfs_file;
> > +
> > + cttype = codetag_register_type(&desc);
> > + if (IS_ERR_OR_NULL(cttype))
> > + return PTR_ERR(cttype);
> > +
> > + debugfs_file = debugfs_create_file("dynamic_faults", 0666, NULL, NULL, &dfault_ops);
> > + if (IS_ERR(debugfs_file))
> > + return PTR_ERR(debugfs_file);
> > +
> > + return 0;
> > +}
> > +module_init(dynamic_fault_init);
> > --
> > 2.37.2.672.g94769d06f0-goog
> >

2022-08-31 16:24:58

by Kent Overstreet

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Wed, Aug 31, 2022 at 11:19:48AM +0100, Mel Gorman wrote:
> On Wed, Aug 31, 2022 at 04:42:30AM -0400, Kent Overstreet wrote:
> > On Wed, Aug 31, 2022 at 09:38:27AM +0200, Peter Zijlstra wrote:
> > > On Tue, Aug 30, 2022 at 02:48:49PM -0700, Suren Baghdasaryan wrote:
> > > > ===========================
> > > > Code tagging framework
> > > > ===========================
> > > > Code tag is a structure identifying a specific location in the source code
> > > > which is generated at compile time and can be embedded in an application-
> > > > specific structure. Several applications of code tagging are included in
> > > > this RFC, such as memory allocation tracking, dynamic fault injection,
> > > > latency tracking and improved error code reporting.
> > > > Basically, it takes the old trick of "define a special elf section for
> > > > objects of a given type so that we can iterate over them at runtime" and
> > > > creates a proper library for it.
> > >
> > > I might be super dense this morning, but what!? I've skimmed through the
> > > set and I don't think I get it.
> > >
> > > What does this provide that ftrace/kprobes don't already allow?
> >
> > You're kidding, right?
>
> It's a valid question. From the description, it main addition that would
> be hard to do with ftrace or probes is catching where an error code is
> returned. A secondary addition would be catching all historical state and
> not just state since the tracing started.

Catching all historical state is pretty important in the case of memory
allocation accounting, don't you think?

Also, ftrace can drop events. Not really ideal if under system load your memory
accounting numbers start to drift.

> It's also unclear *who* would enable this. It looks like it would mostly
> have value during the development stage of an embedded platform to track
> kernel memory usage on a per-application basis in an environment where it
> may be difficult to setup tracing and tracking. Would it ever be enabled
> in production? Would a distribution ever enable this? If it's enabled, any
> overhead cannot be disabled/enabled at run or boot time so anyone enabling
> this would carry the cost without never necessarily consuming the data.

The whole point of this is to be cheap enough to enable in production -
especially the latency tracing infrastructure. There's a lot of value to
always-on system visibility infrastructure, so that when a live machine starts
to do something wonky the data is already there.

What we've built here this is _far_ cheaper than anything that could be done
with ftrace.

> It might be an ease-of-use thing. Gathering the information from traces
> is tricky and would need combining multiple different elements and that
> is development effort but not impossible.
>
> Whatever asking for an explanation as to why equivalent functionality
> cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable.

I think perhaps some of the expectation should be on the "ftrace for
everything!" people to explain a: how their alternative could be even built and
b: how it would compare in terms of performance and ease of use.

Look, I've been a tracing user for many years, and it has its uses, but some of
the claims I've been hearing from tracing/bpf people when any alternative
tooling is proposed sound like vaporware and bullshitting.

2022-08-31 16:55:14

by Kent Overstreet

[permalink] [raw]
Subject: Re: [RFC PATCH 03/30] Lazy percpu counters

On Wed, Aug 31, 2022 at 11:02:49AM +0100, Mel Gorman wrote:
> On Tue, Aug 30, 2022 at 02:48:52PM -0700, Suren Baghdasaryan wrote:
> > From: Kent Overstreet <[email protected]>
> >
> > This patch adds lib/lazy-percpu-counter.c, which implements counters
> > that start out as atomics, but lazily switch to percpu mode if the
> > update rate crosses some threshold (arbitrarily set at 256 per second).
> >
> > Signed-off-by: Kent Overstreet <[email protected]>
>
> Why not use percpu_counter? It has a per-cpu counter that is synchronised
> when a batch threshold (default 32) is exceeded and can explicitly sync
> the counters when required assuming the synchronised count is only needed
> when reading debugfs.

It doesn't switch from atomic mode to percpu mode when the update rate crosses a
threshold like lazy percpu counters does, it allocates all the percpu counters
up front - that makes it a non starter here.

Also, from my reading of the code... wtf is it even doing, and why would I use
it at all? This looks like old grotty code from ext3, it's not even using
this_cpu_add() - it does preempt_enable()/disable() just for adding to a local
percpu counter!

Noooooope.

2022-08-31 16:57:31

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [RFC PATCH 27/30] Code tagging based latency tracking

On Tue, Aug 30, 2022 at 6:53 PM Randy Dunlap <[email protected]> wrote:
>
>
>
> On 8/30/22 14:49, Suren Baghdasaryan wrote:
> > diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> > index b7d03afbc808..b0f86643b8f0 100644
> > --- a/lib/Kconfig.debug
> > +++ b/lib/Kconfig.debug
> > @@ -1728,6 +1728,14 @@ config LATENCYTOP
> > Enable this option if you want to use the LatencyTOP tool
> > to find out which userspace is blocking on what kernel operations.
> >
> > +config CODETAG_TIME_STATS
> > + bool "Code tagging based latency measuring"
> > + depends on DEBUG_FS
> > + select TIME_STATS
> > + select CODE_TAGGING
> > + help
> > + Enabling this option makes latency statistics available in debugfs
>
> Missing period at the end of the sentence.

Ack.

>
> --
> ~Randy

2022-08-31 17:18:36

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Wed, Aug 31, 2022 at 8:28 AM Suren Baghdasaryan <[email protected]> wrote:
>
> On Wed, Aug 31, 2022 at 3:47 AM Michal Hocko <[email protected]> wrote:
> >
> > On Wed 31-08-22 11:19:48, Mel Gorman wrote:
> > > On Wed, Aug 31, 2022 at 04:42:30AM -0400, Kent Overstreet wrote:
> > > > On Wed, Aug 31, 2022 at 09:38:27AM +0200, Peter Zijlstra wrote:
> > > > > On Tue, Aug 30, 2022 at 02:48:49PM -0700, Suren Baghdasaryan wrote:
> > > > > > ===========================
> > > > > > Code tagging framework
> > > > > > ===========================
> > > > > > Code tag is a structure identifying a specific location in the source code
> > > > > > which is generated at compile time and can be embedded in an application-
> > > > > > specific structure. Several applications of code tagging are included in
> > > > > > this RFC, such as memory allocation tracking, dynamic fault injection,
> > > > > > latency tracking and improved error code reporting.
> > > > > > Basically, it takes the old trick of "define a special elf section for
> > > > > > objects of a given type so that we can iterate over them at runtime" and
> > > > > > creates a proper library for it.
> > > > >
> > > > > I might be super dense this morning, but what!? I've skimmed through the
> > > > > set and I don't think I get it.
> > > > >
> > > > > What does this provide that ftrace/kprobes don't already allow?
> > > >
> > > > You're kidding, right?
> > >
> > > It's a valid question. From the description, it main addition that would
> > > be hard to do with ftrace or probes is catching where an error code is
> > > returned. A secondary addition would be catching all historical state and
> > > not just state since the tracing started.
> > >
> > > It's also unclear *who* would enable this. It looks like it would mostly
> > > have value during the development stage of an embedded platform to track
> > > kernel memory usage on a per-application basis in an environment where it
> > > may be difficult to setup tracing and tracking. Would it ever be enabled
> > > in production? Would a distribution ever enable this? If it's enabled, any
> > > overhead cannot be disabled/enabled at run or boot time so anyone enabling
> > > this would carry the cost without never necessarily consuming the data.
>
> Thank you for the question.
> For memory tracking my intent is to have a mechanism that can be enabled in
> the field testing (pre-production testing on a large population of
> internal users).
> The issue that we are often facing is when some memory leaks are happening
> in the field but very hard to reproduce locally. We get a bugreport
> from the user
> which indicates it but often has not enough information to track it. Note that
> quite often these leaks/issues happen in the drivers, so even simply finding out
> where they came from is a big help.
> The way I envision this mechanism to be used is to enable the basic memory
> tracking in the field tests and have a user space process collecting
> the allocation
> statistics periodically (say once an hour). Once it detects some counter growing
> infinitely or atypically (the definition of this is left to the user
> space) it can enable
> context capturing only for that specific location, still keeping the
> overhead to the
> minimum but getting more information about potential issues. Collected stats and
> contexts are then attached to the bugreport and we get more visibility
> into the issue
> when we receive it.
> The goal is to provide a mechanism with low enough overhead that it
> can be enabled
> all the time during these field tests without affecting the device's
> performance profiles.
> Tracing is very cheap when it's disabled but having it enabled all the
> time would
> introduce higher overhead than the counter manipulations.
> My apologies, I should have clarified all this in this cover letter
> from the beginning.
>
> As for other applications, maybe I'm not such an advanced user of
> tracing but I think only
> the latency tracking application might be done with tracing, assuming
> we have all the
> right tracepoints but I don't see how we would use tracing for fault
> injections and
> descriptive error codes. Again, I might be mistaken.

Sorry about the formatting of my reply. Forgot to reconfigure the editor on
the new machine.

>
> Thanks,
> Suren.
>
> > >
> > > It might be an ease-of-use thing. Gathering the information from traces
> > > is tricky and would need combining multiple different elements and that
> > > is development effort but not impossible.
> > >
> > > Whatever asking for an explanation as to why equivalent functionality
> > > cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable.
> >
> > Fully agreed and this is especially true for a change this size
> > 77 files changed, 3406 insertions(+), 703 deletions(-)
> >
> > --
> > Michal Hocko
> > SUSE Labs

2022-08-31 18:03:15

by Kent Overstreet

[permalink] [raw]
Subject: Re: [RFC PATCH 22/30] Code tagging based fault injection

On Wed, Aug 31, 2022 at 12:37:14PM +0200, Dmitry Vyukov wrote:
> On Tue, 30 Aug 2022 at 23:50, Suren Baghdasaryan <[email protected]> wrote:
> >
> > From: Kent Overstreet <[email protected]>
> >
> > This adds a new fault injection capability, based on code tagging.
> >
> > To use, simply insert somewhere in your code
> >
> > dynamic_fault("fault_class_name")
> >
> > and check whether it returns true - if so, inject the error.
> > For example
> >
> > if (dynamic_fault("init"))
> > return -EINVAL;
>
> Hi Suren,
>
> If this is going to be used by mainline kernel, it would be good to
> integrate this with fail_nth systematic fault injection:
> https://elixir.bootlin.com/linux/latest/source/lib/fault-inject.c#L109
>
> Otherwise these dynamic sites won't be tested by testing systems doing
> systematic fault injection testing.

That's a discussion we need to have, yeah. We don't want two distinct fault
injection frameworks, we'll have to have a discussion as to whether this is (or
can be) better enough to make a switch worthwhile, and whether a compatibility
interface is needed - or maybe there's enough distinct interesting bits in both
to make merging plausible?

The debugfs interface for this fault injection code is necessarily different
from our existing fault injection - this gives you a fault injection point _per
callsite_, which is huge - e.g. for filesystem testing what I need is to be able
to enable fault injection points within a given module. I can do that easily
with this, not with our current fault injection.

I think the per-callsite fault injection points would also be pretty valuable
for CONFIG_FAULT_INJECTION_USERCOPY, too.

OTOH, existing kernel fault injection can filter based on task - this fault
injection framework doesn't have that. Easy enough to add, though. Similar for
the interval/probability/ratelimit stuff.

fail_function is the odd one out, I'm not sure how that would fit into this
model. Everything else I've seen I think fits into this model.

Also, it sounds like you're more familiar with our existing fault injection than
I am, so if I've misunderstood anything about what it can do please do correct
me.

Interestingly: I just discovered from reading the code that
CONFIG_FAULT_INJECTION_STACKTRACE_FILTER is a thing (hadn't before because it
depends on !X86_64 - what?). That's cool, though.

2022-08-31 20:03:07

by Kent Overstreet

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Wed, Aug 31, 2022 at 12:47:32PM +0200, Michal Hocko wrote:
> On Wed 31-08-22 11:19:48, Mel Gorman wrote:
> > Whatever asking for an explanation as to why equivalent functionality
> > cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable.
>
> Fully agreed and this is especially true for a change this size
> 77 files changed, 3406 insertions(+), 703 deletions(-)

In the case of memory allocation accounting, you flat cannot do this with ftrace
- you could maybe do a janky version that isn't fully accurate, much slower,
more complicated for the developer to understand and debug and more complicated
for the end user.

But please, I invite anyone who's actually been doing this with ftrace to
demonstrate otherwise.

Ftrace just isn't the right tool for the job here - we're talking about adding
per callsite accounting to some of the fastest fast paths in the kernel.

And the size of the changes for memory allocation accounting are much more
reasonable:
33 files changed, 623 insertions(+), 99 deletions(-)

The code tagging library should exist anyways, it's been open coded half a dozen
times in the kernel already.

And once we've got that, the time stats code is _also_ far simpler than doing it
with ftrace would be. If anyone here has successfully debugged latency issues
with ftrace, I'd really like to hear it. Again, for debugging latency issues you
want something that can always be on, and that's not cheap with ftrace - and
never mind the hassle of correlating start and end wait trace events, builting
up histograms, etc. - that's all handled here.

Cheap, simple, easy to use. What more could you want?

2022-08-31 21:40:37

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Wed, Aug 31, 2022 at 12:02 PM Kent Overstreet
<[email protected]> wrote:
>
> On Wed, Aug 31, 2022 at 12:47:32PM +0200, Michal Hocko wrote:
> > On Wed 31-08-22 11:19:48, Mel Gorman wrote:
> > > Whatever asking for an explanation as to why equivalent functionality
> > > cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable.
> >
> > Fully agreed and this is especially true for a change this size
> > 77 files changed, 3406 insertions(+), 703 deletions(-)
>
> In the case of memory allocation accounting, you flat cannot do this with ftrace
> - you could maybe do a janky version that isn't fully accurate, much slower,
> more complicated for the developer to understand and debug and more complicated
> for the end user.
>
> But please, I invite anyone who's actually been doing this with ftrace to
> demonstrate otherwise.
>
> Ftrace just isn't the right tool for the job here - we're talking about adding
> per callsite accounting to some of the fastest fast paths in the kernel.
>
> And the size of the changes for memory allocation accounting are much more
> reasonable:
> 33 files changed, 623 insertions(+), 99 deletions(-)
>
> The code tagging library should exist anyways, it's been open coded half a dozen
> times in the kernel already.
>
> And once we've got that, the time stats code is _also_ far simpler than doing it
> with ftrace would be. If anyone here has successfully debugged latency issues
> with ftrace, I'd really like to hear it. Again, for debugging latency issues you
> want something that can always be on, and that's not cheap with ftrace - and
> never mind the hassle of correlating start and end wait trace events, builting
> up histograms, etc. - that's all handled here.
>
> Cheap, simple, easy to use. What more could you want?
>

This is very interesting work! Do you have any data about the overhead
this introduces, especially in a production environment? I am
especially interested in memory allocations tracking and detecting
leaks.
(Sorry if you already posted this kind of data somewhere that I missed)

2022-08-31 21:49:51

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Wed, Aug 31, 2022 at 1:56 PM Yosry Ahmed <[email protected]> wrote:
>
> On Wed, Aug 31, 2022 at 12:02 PM Kent Overstreet
> <[email protected]> wrote:
> >
> > On Wed, Aug 31, 2022 at 12:47:32PM +0200, Michal Hocko wrote:
> > > On Wed 31-08-22 11:19:48, Mel Gorman wrote:
> > > > Whatever asking for an explanation as to why equivalent functionality
> > > > cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable.
> > >
> > > Fully agreed and this is especially true for a change this size
> > > 77 files changed, 3406 insertions(+), 703 deletions(-)
> >
> > In the case of memory allocation accounting, you flat cannot do this with ftrace
> > - you could maybe do a janky version that isn't fully accurate, much slower,
> > more complicated for the developer to understand and debug and more complicated
> > for the end user.
> >
> > But please, I invite anyone who's actually been doing this with ftrace to
> > demonstrate otherwise.
> >
> > Ftrace just isn't the right tool for the job here - we're talking about adding
> > per callsite accounting to some of the fastest fast paths in the kernel.
> >
> > And the size of the changes for memory allocation accounting are much more
> > reasonable:
> > 33 files changed, 623 insertions(+), 99 deletions(-)
> >
> > The code tagging library should exist anyways, it's been open coded half a dozen
> > times in the kernel already.
> >
> > And once we've got that, the time stats code is _also_ far simpler than doing it
> > with ftrace would be. If anyone here has successfully debugged latency issues
> > with ftrace, I'd really like to hear it. Again, for debugging latency issues you
> > want something that can always be on, and that's not cheap with ftrace - and
> > never mind the hassle of correlating start and end wait trace events, builting
> > up histograms, etc. - that's all handled here.
> >
> > Cheap, simple, easy to use. What more could you want?
> >
>
> This is very interesting work! Do you have any data about the overhead
> this introduces, especially in a production environment? I am
> especially interested in memory allocations tracking and detecting
> leaks.

I had the numbers for my previous implementation, before we started using the
lazy percpu counters but that would not apply to the new implementation. I'll
rerun the measurements and will post the exact numbers in a day or so.

> (Sorry if you already posted this kind of data somewhere that I missed)

2022-09-01 05:06:46

by Oscar Salvador

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Tue, Aug 30, 2022 at 02:48:49PM -0700, Suren Baghdasaryan wrote:
> ===========================
> Code tagging framework
> ===========================
> Code tag is a structure identifying a specific location in the source code
> which is generated at compile time and can be embedded in an application-
> specific structure. Several applications of code tagging are included in
> this RFC, such as memory allocation tracking, dynamic fault injection,
> latency tracking and improved error code reporting.
> Basically, it takes the old trick of "define a special elf section for
> objects of a given type so that we can iterate over them at runtime" and
> creates a proper library for it.
>
> ===========================
> Memory allocation tracking
> ===========================
> The goal for using codetags for memory allocation tracking is to minimize
> performance and memory overhead. By recording only the call count and
> allocation size, the required operations are kept at the minimum while
> collecting statistics for every allocation in the codebase. With that
> information, if users are interested in mode detailed context for a
> specific allocation, they can enable more in-depth context tracking,
> which includes capturing the pid, tgid, task name, allocation size,
> timestamp and call stack for every allocation at the specified code
> location.
> Memory allocation tracking is implemented in two parts:
>
> part1: instruments page and slab allocators to record call count and total
> memory allocated at every allocation in the source code. Every time an
> allocation is performed by an instrumented allocator, the codetag at that
> location increments its call and size counters. Every time the memory is
> freed these counters are decremented. To decrement the counters upon free,
> allocated object needs a reference to its codetag. Page allocators use
> page_ext to record this reference while slab allocators use memcg_data of
> the slab page.
> The data is exposed to the user space via a read-only debugfs file called
> alloc_tags.

Hi Suren,

I just posted a patch [1] and reading through your changelog and seeing your PoC,
I think we have some kind of overlap.
My patchset aims to give you the stacktrace <-> relationship information and it is
achieved by a little amount of extra code mostly in page_owner.c/ and lib/stackdepot.

Of course, your works seems to be more complete wrt. the information you get.

I CCed you in case you want to have a look

[1] https://lkml.org/lkml/2022/9/1/36

Thanks


--
Oscar Salvador
SUSE Labs

2022-09-01 06:02:50

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Wed, Aug 31, 2022 at 9:52 PM Oscar Salvador <[email protected]> wrote:
>
> On Tue, Aug 30, 2022 at 02:48:49PM -0700, Suren Baghdasaryan wrote:
> > ===========================
> > Code tagging framework
> > ===========================
> > Code tag is a structure identifying a specific location in the source code
> > which is generated at compile time and can be embedded in an application-
> > specific structure. Several applications of code tagging are included in
> > this RFC, such as memory allocation tracking, dynamic fault injection,
> > latency tracking and improved error code reporting.
> > Basically, it takes the old trick of "define a special elf section for
> > objects of a given type so that we can iterate over them at runtime" and
> > creates a proper library for it.
> >
> > ===========================
> > Memory allocation tracking
> > ===========================
> > The goal for using codetags for memory allocation tracking is to minimize
> > performance and memory overhead. By recording only the call count and
> > allocation size, the required operations are kept at the minimum while
> > collecting statistics for every allocation in the codebase. With that
> > information, if users are interested in mode detailed context for a
> > specific allocation, they can enable more in-depth context tracking,
> > which includes capturing the pid, tgid, task name, allocation size,
> > timestamp and call stack for every allocation at the specified code
> > location.
> > Memory allocation tracking is implemented in two parts:
> >
> > part1: instruments page and slab allocators to record call count and total
> > memory allocated at every allocation in the source code. Every time an
> > allocation is performed by an instrumented allocator, the codetag at that
> > location increments its call and size counters. Every time the memory is
> > freed these counters are decremented. To decrement the counters upon free,
> > allocated object needs a reference to its codetag. Page allocators use
> > page_ext to record this reference while slab allocators use memcg_data of
> > the slab page.
> > The data is exposed to the user space via a read-only debugfs file called
> > alloc_tags.
>
> Hi Suren,
>
> I just posted a patch [1] and reading through your changelog and seeing your PoC,
> I think we have some kind of overlap.
> My patchset aims to give you the stacktrace <-> relationship information and it is
> achieved by a little amount of extra code mostly in page_owner.c/ and lib/stackdepot.
>
> Of course, your works seems to be more complete wrt. the information you get.
>
> I CCed you in case you want to have a look
>
> [1] https://lkml.org/lkml/2022/9/1/36

Hi Oscar,
Thanks for the note. I'll take a look most likely on Friday and will
follow up with you.
Thanks,
Suren.

>
> Thanks
>
>
> --
> Oscar Salvador
> SUSE Labs

2022-09-01 07:06:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Wed, Aug 31, 2022 at 11:19:48AM +0100, Mel Gorman wrote:

> It's also unclear *who* would enable this. It looks like it would mostly
> have value during the development stage of an embedded platform to track
> kernel memory usage on a per-application basis in an environment where it
> may be difficult to setup tracing and tracking. Would it ever be enabled
> in production?

Afaict this is developer only; it is all unconditional code.

> Would a distribution ever enable this?

I would sincerely hope not. Because:

> If it's enabled, any overhead cannot be disabled/enabled at run or
> boot time so anyone enabling this would carry the cost without never
> necessarily consuming the data.

this.

2022-09-01 07:15:22

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 03/30] Lazy percpu counters

On Tue, Aug 30, 2022 at 02:48:52PM -0700, Suren Baghdasaryan wrote:
> +static void lazy_percpu_counter_switch_to_pcpu(struct raw_lazy_percpu_counter *c)
> +{
> + u64 __percpu *pcpu_v = alloc_percpu_gfp(u64, GFP_ATOMIC|__GFP_NOWARN);

Realize that this is incorrect when used under a raw_spinlock_t.

2022-09-01 07:24:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 27/30] Code tagging based latency tracking

On Tue, Aug 30, 2022 at 02:49:16PM -0700, Suren Baghdasaryan wrote:
> From: Kent Overstreet <[email protected]>
>
> This adds the ability to easily instrument code for measuring latency.
> To use, add the following to calls to your code, at the start and end of
> the event you wish to measure:
>
> code_tag_time_stats_start(start_time);
> code_tag_time_stats_finish(start_time);
>
> Stastistics will then show up in debugfs under
> /sys/kernel/debug/time_stats, listed by file and line number.
>
> Stastics measured include weighted averages of frequency, duration, max
> duration, as well as quantiles.
>
> This patch also instruments all calls to init_wait and finish_wait,
> which includes all calls to wait_event. Example debugfs output:

How can't you do this with a simple eBPF script on top of
trace_sched_stat_* and friends?

2022-09-01 07:45:17

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Wed 31-08-22 15:01:54, Kent Overstreet wrote:
> On Wed, Aug 31, 2022 at 12:47:32PM +0200, Michal Hocko wrote:
> > On Wed 31-08-22 11:19:48, Mel Gorman wrote:
> > > Whatever asking for an explanation as to why equivalent functionality
> > > cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable.
> >
> > Fully agreed and this is especially true for a change this size
> > 77 files changed, 3406 insertions(+), 703 deletions(-)
>
> In the case of memory allocation accounting, you flat cannot do this with ftrace
> - you could maybe do a janky version that isn't fully accurate, much slower,
> more complicated for the developer to understand and debug and more complicated
> for the end user.
>
> But please, I invite anyone who's actually been doing this with ftrace to
> demonstrate otherwise.
>
> Ftrace just isn't the right tool for the job here - we're talking about adding
> per callsite accounting to some of the fastest fast paths in the kernel.
>
> And the size of the changes for memory allocation accounting are much more
> reasonable:
> 33 files changed, 623 insertions(+), 99 deletions(-)
>
> The code tagging library should exist anyways, it's been open coded half a dozen
> times in the kernel already.
>
> And once we've got that, the time stats code is _also_ far simpler than doing it
> with ftrace would be. If anyone here has successfully debugged latency issues
> with ftrace, I'd really like to hear it. Again, for debugging latency issues you
> want something that can always be on, and that's not cheap with ftrace - and
> never mind the hassle of correlating start and end wait trace events, builting
> up histograms, etc. - that's all handled here.
>
> Cheap, simple, easy to use. What more could you want?

A big ad on a banner. But more seriously.

This patchset is _huge_ and touching a lot of different areas. It will
be not only hard to review but even harder to maintain longterm. So
it is completely reasonable to ask for potential alternatives with a
smaller code footprint. I am pretty sure you are aware of that workflow.

So I find Peter's question completely appropriate while your response to
that not so much! Maybe ftrace is not the right tool for the intented
job. Maybe there are other ways and it would be really great to show
that those have been evaluated and they are not suitable for a), b) and
c) reasons.

E.g. Oscar has been working on extending page_ext to track number of
allocations for specific calltrace[1]. Is this 1:1 replacement? No! But
it can help in environments where page_ext can be enabled and it is
completely non-intrusive to the MM code.

If the page_ext overhead is not desirable/acceptable then I am sure
there are other options. E.g. kprobes/LivePatching framework can hook
into functions and alter their behavior. So why not use that for data
collection? Has this been evaluated at all?

And please note that I am not claiming the presented work is approaching
the problem from a wrong direction. It might very well solve multiple
problems in a single go _but_ the long term code maintenance burden
really has to to be carefully evaluated and if we can achieve a
reasonable subset of the functionality with an existing infrastructure
then I would be inclined to sacrifice some portions with a considerably
smaller code footprint.

[1] http://lkml.kernel.org/r/[email protected]

--
Michal Hocko
SUSE Labs

2022-09-01 08:07:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Wed, Aug 31, 2022 at 11:59:41AM -0400, Kent Overstreet wrote:

> Also, ftrace can drop events. Not really ideal if under system load your memory
> accounting numbers start to drift.

You could attach custom handlers to tracepoints. If you were to replace
these unconditional code hooks of yours with tracepoints then you could
conditionally (say at boot) register custom handlers that do the
accounting you want.

Nobody is mandating you use the ftrace ringbuffer to consume tracepoints.
Many people these days attach eBPF scripts to them and do whatever they
want.

2022-09-01 08:32:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Thu, Sep 01, 2022 at 09:05:36AM +0200, Peter Zijlstra wrote:
> On Wed, Aug 31, 2022 at 11:59:41AM -0400, Kent Overstreet wrote:
>
> > Also, ftrace can drop events. Not really ideal if under system load your memory
> > accounting numbers start to drift.
>
> You could attach custom handlers to tracepoints. If you were to replace
> these unconditional code hooks of yours with tracepoints then you could
> conditionally (say at boot) register custom handlers that do the
> accounting you want.
>
> Nobody is mandating you use the ftrace ringbuffer to consume tracepoints.
> Many people these days attach eBPF scripts to them and do whatever they
> want.

Look at kernel/trace/blktrace.c for a fine in-kernel !BFP example of this.

Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On 9/1/22 09:05, Peter Zijlstra wrote:
>> Also, ftrace can drop events. Not really ideal if under system load your memory
>> accounting numbers start to drift.
> You could attach custom handlers to tracepoints. If you were to replace
> these unconditional code hooks of yours with tracepoints then you could
> conditionally (say at boot) register custom handlers that do the
> accounting you want.

That is strategy in RV (kernel/trace/rv/). It is in C, but I am also
adding support for monitors in bpf. The osnoise/timerlat tracers work this
way too, and they are enabled on Fedora/Red Hat/SUSE... production. They
will also be enabled in Ubuntu and Debian (the interwebs say).

The overhead of attaching code to tracepoints (or any "attachable thing") and
processing data in kernel is often lower than consuming it in user-space.
Obviously, when it is possible, e.g., when you respect locking rules, etc.

This paper (the basis for RV) shows a little comparison:
https://bristot.me/wp-content/uploads/2019/09/paper.pdf

By doing so, we also avoid problems of losing events... and you can also
generate other events from your attached code.

(It is also way easier to convince a maintainer to add a tracepoints or a trace
events than to add arbitrary code... ;-)

-- Daniel

2022-09-01 08:48:21

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On 31.08.22 21:01, Kent Overstreet wrote:
> On Wed, Aug 31, 2022 at 12:47:32PM +0200, Michal Hocko wrote:
>> On Wed 31-08-22 11:19:48, Mel Gorman wrote:
>>> Whatever asking for an explanation as to why equivalent functionality
>>> cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable.
>>
>> Fully agreed and this is especially true for a change this size
>> 77 files changed, 3406 insertions(+), 703 deletions(-)
>
> In the case of memory allocation accounting, you flat cannot do this with ftrace
> - you could maybe do a janky version that isn't fully accurate, much slower,
> more complicated for the developer to understand and debug and more complicated
> for the end user.
>
> But please, I invite anyone who's actually been doing this with ftrace to
> demonstrate otherwise.
>
> Ftrace just isn't the right tool for the job here - we're talking about adding
> per callsite accounting to some of the fastest fast paths in the kernel.
>
> And the size of the changes for memory allocation accounting are much more
> reasonable:
> 33 files changed, 623 insertions(+), 99 deletions(-)
>
> The code tagging library should exist anyways, it's been open coded half a dozen
> times in the kernel already.

Hi Kent,

independent of the other discussions, if it's open coded already, does
it make sense to factor that already-open-coded part out independently
of the remainder of the full series here?

[I didn't immediately spot if this series also attempts already to
replace that open-coded part]

--
Thanks,

David / dhildenb

2022-09-01 09:25:40

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: [RFC PATCH 22/30] Code tagging based fault injection

On Wed, 31 Aug 2022 at 19:30, Kent Overstreet
<[email protected]> wrote:
> > > From: Kent Overstreet <[email protected]>
> > >
> > > This adds a new fault injection capability, based on code tagging.
> > >
> > > To use, simply insert somewhere in your code
> > >
> > > dynamic_fault("fault_class_name")
> > >
> > > and check whether it returns true - if so, inject the error.
> > > For example
> > >
> > > if (dynamic_fault("init"))
> > > return -EINVAL;
> >
> > Hi Suren,
> >
> > If this is going to be used by mainline kernel, it would be good to
> > integrate this with fail_nth systematic fault injection:
> > https://elixir.bootlin.com/linux/latest/source/lib/fault-inject.c#L109
> >
> > Otherwise these dynamic sites won't be tested by testing systems doing
> > systematic fault injection testing.
>
> That's a discussion we need to have, yeah. We don't want two distinct fault
> injection frameworks, we'll have to have a discussion as to whether this is (or
> can be) better enough to make a switch worthwhile, and whether a compatibility
> interface is needed - or maybe there's enough distinct interesting bits in both
> to make merging plausible?
>
> The debugfs interface for this fault injection code is necessarily different
> from our existing fault injection - this gives you a fault injection point _per
> callsite_, which is huge - e.g. for filesystem testing what I need is to be able
> to enable fault injection points within a given module. I can do that easily
> with this, not with our current fault injection.
>
> I think the per-callsite fault injection points would also be pretty valuable
> for CONFIG_FAULT_INJECTION_USERCOPY, too.
>
> OTOH, existing kernel fault injection can filter based on task - this fault
> injection framework doesn't have that. Easy enough to add, though. Similar for
> the interval/probability/ratelimit stuff.
>
> fail_function is the odd one out, I'm not sure how that would fit into this
> model. Everything else I've seen I think fits into this model.
>
> Also, it sounds like you're more familiar with our existing fault injection than
> I am, so if I've misunderstood anything about what it can do please do correct
> me.

What you are saying makes sense. But I can't say if we want to do a
global switch or not. I don't know how many existing users there are
(by users I mean automated testing b/c humans can switch for one-off
manual testing).

However, fail_nth that I mentioned is orthogonal to this. It's a
different mechanism to select the fault site that needs to be failed
(similar to what you mentioned as "interval/probability/ratelimit
stuff"). fail_nth allows to fail the specified n-th call site in the
specified task. And that's the only mechanism we use in
syzkaller/syzbot.
And I think it can be supported relatively easily (copy a few lines to
the "does this site needs to fail" check).

I don't know how exactly you want to use this new mechanism, but I
found fail_nth much better than any of the existing selection
mechanisms, including what this will add for specific site failing.

fail_nth allows to fail every site in a given test/syscall one-by-one
systematically. E.g. we can even have strace-like utility that repeats
the given test failing all sites in to systematically:
$ fail_all ./a_unit_test
This can be integrated into any CI system, e.g. running all LTP tests with this.

For file:line-based selection, first, we need to get these file:line
from somewhere; second, lines are changing over time so can't be
hardcoded in tests; third, it still needs to be per-task, since
unrelated processes can execute the same code.

One downside of fail_nth, though, is that it does not cover background
threads/async work. But we found that there are so many untested
synchronous error paths, that moving to background threads is not
necessary at this point.



> Interestingly: I just discovered from reading the code that
> CONFIG_FAULT_INJECTION_STACKTRACE_FILTER is a thing (hadn't before because it
> depends on !X86_64 - what?). That's cool, though.

2022-09-01 11:53:48

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Wed, Aug 31, 2022 at 11:59:41AM -0400, Kent Overstreet wrote:
> On Wed, Aug 31, 2022 at 11:19:48AM +0100, Mel Gorman wrote:
> > On Wed, Aug 31, 2022 at 04:42:30AM -0400, Kent Overstreet wrote:
> > > On Wed, Aug 31, 2022 at 09:38:27AM +0200, Peter Zijlstra wrote:
> > > > On Tue, Aug 30, 2022 at 02:48:49PM -0700, Suren Baghdasaryan wrote:
> > > > > ===========================
> > > > > Code tagging framework
> > > > > ===========================
> > > > > Code tag is a structure identifying a specific location in the source code
> > > > > which is generated at compile time and can be embedded in an application-
> > > > > specific structure. Several applications of code tagging are included in
> > > > > this RFC, such as memory allocation tracking, dynamic fault injection,
> > > > > latency tracking and improved error code reporting.
> > > > > Basically, it takes the old trick of "define a special elf section for
> > > > > objects of a given type so that we can iterate over them at runtime" and
> > > > > creates a proper library for it.
> > > >
> > > > I might be super dense this morning, but what!? I've skimmed through the
> > > > set and I don't think I get it.
> > > >
> > > > What does this provide that ftrace/kprobes don't already allow?
> > >
> > > You're kidding, right?
> >
> > It's a valid question. From the description, it main addition that would
> > be hard to do with ftrace or probes is catching where an error code is
> > returned. A secondary addition would be catching all historical state and
> > not just state since the tracing started.
>
> Catching all historical state is pretty important in the case of memory
> allocation accounting, don't you think?
>

Not always. If the intent is to catch a memory leak that gets worse over
time, early boot should be sufficient. Sure, there might be drivers that leak
memory allocated at init but if it's not a growing leak, it doesn't matter.

> Also, ftrace can drop events. Not really ideal if under system load your memory
> accounting numbers start to drift.
>

As pointed out elsewhere, attaching to the tracepoint and recording relevant
state is an option other than trying to parse a raw ftrace feed. For memory
leaks, there are already tracepoints for page allocation and free that could
be used to track allocations that are not freed at a given point in time.
There is also the kernel memory leak detector although I never had reason
to use it (https://www.kernel.org/doc/html/v6.0-rc3/dev-tools/kmemleak.html)
and it sounds like it would be expensive.

> > It's also unclear *who* would enable this. It looks like it would mostly
> > have value during the development stage of an embedded platform to track
> > kernel memory usage on a per-application basis in an environment where it
> > may be difficult to setup tracing and tracking. Would it ever be enabled
> > in production? Would a distribution ever enable this? If it's enabled, any
> > overhead cannot be disabled/enabled at run or boot time so anyone enabling
> > this would carry the cost without never necessarily consuming the data.
>
> The whole point of this is to be cheap enough to enable in production -
> especially the latency tracing infrastructure. There's a lot of value to
> always-on system visibility infrastructure, so that when a live machine starts
> to do something wonky the data is already there.
>

Sure, there is value but nothing stops the tracepoints being attached as
a boot-time service where interested. For latencies, there is already
bpf examples for tracing individual function latency over time e.g.
https://github.com/iovisor/bcc/blob/master/tools/funclatency.py although
I haven't used it recently.

Live parsing of ftrace is possible, albeit expensive.
https://github.com/gormanm/mmtests/blob/master/monitors/watch-highorder.pl
tracks counts of high-order allocations and dumps a report on interrupt as
an example of live parsing ftrace and only recording interesting state. It's
not tracking state you are interested in but it demonstrates it is possible
to rely on ftrace alone and monitor from userspace. It's bit-rotted but
can be fixed with

diff --git a/monitors/watch-highorder.pl b/monitors/watch-highorder.pl
index 8c80ae79e556..fd0d477427df 100755
--- a/monitors/watch-highorder.pl
+++ b/monitors/watch-highorder.pl
@@ -52,7 +52,7 @@ my $regex_pagealloc;

# Static regex used. Specified like this for readability and for use with /o
# (process_pid) (cpus ) ( time ) (tpoint ) (details)
-my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)';
+my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9. ]*):\s*([a-zA-Z_]*):\s*(.*)';
my $regex_statname = '[-0-9]*\s\((.*)\).*';
my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*';

@@ -73,6 +73,7 @@ sub generate_traceevent_regex {
$regex =~ s/%p/\([0-9a-f]*\)/g;
$regex =~ s/%d/\([-0-9]*\)/g;
$regex =~ s/%lu/\([0-9]*\)/g;
+ $regex =~ s/%lx/\([0-9a-zA-Z]*\)/g;
$regex =~ s/%s/\([A-Z_|]*\)/g;
$regex =~ s/\(REC->gfp_flags\).*/REC->gfp_flags/;
$regex =~ s/\",.*//;

Example output

3 instances order=2 normal gfp_flags=GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_ZERO
=> trace_event_raw_event_mm_page_alloc+0x7d/0xc0 <ffffffffb1caeccd>
=> __alloc_pages+0x188/0x250 <ffffffffb1cee8a8>
=> kmalloc_large_node+0x3f/0x80 <ffffffffb1d1cd3f>
=> __kmalloc_node+0x321/0x420 <ffffffffb1d22351>
=> kvmalloc_node+0x46/0xe0 <ffffffffb1ca4906>
=> ttm_sg_tt_init+0x88/0xb0 [ttm] <ffffffffc03a02c8>
=> amdgpu_ttm_tt_create+0x4f/0x80 [amdgpu] <ffffffffc04cff0f>
=> ttm_tt_create+0x59/0x90 [ttm] <ffffffffc03a03b9>
=> ttm_bo_handle_move_mem+0x7e/0x1c0 [ttm] <ffffffffc03a0d9e>
=> ttm_bo_validate+0xc5/0x140 [ttm] <ffffffffc03a2095>
=> ttm_bo_init_reserved+0x17b/0x200 [ttm] <ffffffffc03a228b>
=> amdgpu_bo_create+0x1a3/0x470 [amdgpu] <ffffffffc04d36c3>
=> amdgpu_bo_create_user+0x34/0x60 [amdgpu] <ffffffffc04d39c4>
=> amdgpu_gem_create_ioctl+0x131/0x3a0 [amdgpu] <ffffffffc04d94f1>
=> drm_ioctl_kernel+0xb5/0x140 <ffffffffb21652c5>
=> drm_ioctl+0x224/0x3e0 <ffffffffb2165574>
=> amdgpu_drm_ioctl+0x49/0x80 [amdgpu] <ffffffffc04bd2d9>
=> __x64_sys_ioctl+0x8a/0xc0 <ffffffffb1d7c2da>
=> do_syscall_64+0x5c/0x90 <ffffffffb253016c>
=> entry_SYSCALL_64_after_hwframe+0x63/0xcd <ffffffffb260009b>

3 instances order=1 normal gfp_flags=GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT
=> trace_event_raw_event_mm_page_alloc+0x7d/0xc0 <ffffffffb1caeccd>
=> __alloc_pages+0x188/0x250 <ffffffffb1cee8a8>
=> __folio_alloc+0x17/0x50 <ffffffffb1cef1a7>
=> vma_alloc_folio+0x8f/0x350 <ffffffffb1d11e4f>
=> __handle_mm_fault+0xa1e/0x1120 <ffffffffb1cc80ee>
=> handle_mm_fault+0xb2/0x280 <ffffffffb1cc88a2>
=> do_user_addr_fault+0x1b9/0x690 <ffffffffb1a89949>
=> exc_page_fault+0x67/0x150 <ffffffffb2534627>
=> asm_exc_page_fault+0x22/0x30 <ffffffffb2600b62>

It's not tracking leaks because that is not what I was intrested in at
the time but could using the same method and recording PFNs that were
allocated, their call site but not freed. These days, this approach may
be a bit unexpected but it was originally written 13 years ago. It could
have been done with systemtap back then but my recollection was that it
was difficult to keep systemtap working with rc kernels.

> What we've built here this is _far_ cheaper than anything that could be done
> with ftrace.
>
> > It might be an ease-of-use thing. Gathering the information from traces
> > is tricky and would need combining multiple different elements and that
> > is development effort but not impossible.
> >
> > Whatever asking for an explanation as to why equivalent functionality
> > cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable.
>
> I think perhaps some of the expectation should be on the "ftrace for
> everything!" people to explain a: how their alternative could be even built and
> b: how it would compare in terms of performance and ease of use.
>

The ease of use is a criticism as there is effort required to develop
the state tracking of in-kernel event be it from live parsing ftrace,
attaching to tracepoints with systemtap/bpf/whatever and the like. The
main disadvantage with an in-kernel implementation is three-fold. First,
it doesn't work with older kernels without backports. Second, if something
slightly different it needed then it's a kernel rebuild. Third, if the
option is not enabled in the deployed kernel config then you are relying
on the end user being willing to deploy a custom kernel. The initial
investment in doing memory leak tracking or latency tracking by attaching
to tracepoints is significant but it works with older kernels up to a point
and is less sensitive to the kernel config options selected as features
like ftrace are often selected.

--
Mel Gorman
SUSE Labs

2022-09-01 14:47:28

by Kent Overstreet

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Thu, Sep 01, 2022 at 09:00:17AM +0200, Peter Zijlstra wrote:
> On Wed, Aug 31, 2022 at 11:19:48AM +0100, Mel Gorman wrote:
>
> > It's also unclear *who* would enable this. It looks like it would mostly
> > have value during the development stage of an embedded platform to track
> > kernel memory usage on a per-application basis in an environment where it
> > may be difficult to setup tracing and tracking. Would it ever be enabled
> > in production?
>
> Afaict this is developer only; it is all unconditional code.
>
> > Would a distribution ever enable this?
>
> I would sincerely hope not. Because:
>
> > If it's enabled, any overhead cannot be disabled/enabled at run or
> > boot time so anyone enabling this would carry the cost without never
> > necessarily consuming the data.
>
> this.

We could make it a boot parameter, with the alternatives infrastructure - with a
bit of refactoring there'd be a single function call to nop out, and then we
could also drop the elf sections as well, so that when built in but disabled the
overhead would be practically nil.

2022-09-01 14:48:06

by Kent Overstreet

[permalink] [raw]
Subject: Re: [RFC PATCH 03/30] Lazy percpu counters

On Thu, Sep 01, 2022 at 08:51:31AM +0200, Peter Zijlstra wrote:
> On Tue, Aug 30, 2022 at 02:48:52PM -0700, Suren Baghdasaryan wrote:
> > +static void lazy_percpu_counter_switch_to_pcpu(struct raw_lazy_percpu_counter *c)
> > +{
> > + u64 __percpu *pcpu_v = alloc_percpu_gfp(u64, GFP_ATOMIC|__GFP_NOWARN);
>
> Realize that this is incorrect when used under a raw_spinlock_t.

Can you elaborate?

2022-09-01 14:48:45

by Kent Overstreet

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Thu, Sep 01, 2022 at 10:05:03AM +0200, David Hildenbrand wrote:
> On 31.08.22 21:01, Kent Overstreet wrote:
> > On Wed, Aug 31, 2022 at 12:47:32PM +0200, Michal Hocko wrote:
> >> On Wed 31-08-22 11:19:48, Mel Gorman wrote:
> >>> Whatever asking for an explanation as to why equivalent functionality
> >>> cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable.
> >>
> >> Fully agreed and this is especially true for a change this size
> >> 77 files changed, 3406 insertions(+), 703 deletions(-)
> >
> > In the case of memory allocation accounting, you flat cannot do this with ftrace
> > - you could maybe do a janky version that isn't fully accurate, much slower,
> > more complicated for the developer to understand and debug and more complicated
> > for the end user.
> >
> > But please, I invite anyone who's actually been doing this with ftrace to
> > demonstrate otherwise.
> >
> > Ftrace just isn't the right tool for the job here - we're talking about adding
> > per callsite accounting to some of the fastest fast paths in the kernel.
> >
> > And the size of the changes for memory allocation accounting are much more
> > reasonable:
> > 33 files changed, 623 insertions(+), 99 deletions(-)
> >
> > The code tagging library should exist anyways, it's been open coded half a dozen
> > times in the kernel already.
>
> Hi Kent,
>
> independent of the other discussions, if it's open coded already, does
> it make sense to factor that already-open-coded part out independently
> of the remainder of the full series here?

It's discussed in the cover letter, that is exactly how the patch series is
structured.

> [I didn't immediately spot if this series also attempts already to
> replace that open-coded part]

Uh huh.

Honestly, some days it feels like lkml is just as bad as slashdot, with people
wanting to get in their two cents without actually reading...

2022-09-01 14:52:05

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 03/30] Lazy percpu counters

On Thu, 1 Sep 2022 10:32:19 -0400
Kent Overstreet <[email protected]> wrote:

> On Thu, Sep 01, 2022 at 08:51:31AM +0200, Peter Zijlstra wrote:
> > On Tue, Aug 30, 2022 at 02:48:52PM -0700, Suren Baghdasaryan wrote:
> > > +static void lazy_percpu_counter_switch_to_pcpu(struct raw_lazy_percpu_counter *c)
> > > +{
> > > + u64 __percpu *pcpu_v = alloc_percpu_gfp(u64, GFP_ATOMIC|__GFP_NOWARN);
> >
> > Realize that this is incorrect when used under a raw_spinlock_t.
>
> Can you elaborate?

All allocations (including GFP_ATOMIC) grab normal spin_locks. When
PREEMPT_RT is configured, normal spin_locks turn into a mutex, where as
raw_spinlock's do not.

Thus, if this is done within a raw_spinlock with PREEMPT_RT configured, it
can cause a schedule while holding a spinlock.

-- Steve

2022-09-01 15:45:05

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On 01.09.22 16:23, Kent Overstreet wrote:
> On Thu, Sep 01, 2022 at 10:05:03AM +0200, David Hildenbrand wrote:
>> On 31.08.22 21:01, Kent Overstreet wrote:
>>> On Wed, Aug 31, 2022 at 12:47:32PM +0200, Michal Hocko wrote:
>>>> On Wed 31-08-22 11:19:48, Mel Gorman wrote:
>>>>> Whatever asking for an explanation as to why equivalent functionality
>>>>> cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable.
>>>>
>>>> Fully agreed and this is especially true for a change this size
>>>> 77 files changed, 3406 insertions(+), 703 deletions(-)
>>>
>>> In the case of memory allocation accounting, you flat cannot do this with ftrace
>>> - you could maybe do a janky version that isn't fully accurate, much slower,
>>> more complicated for the developer to understand and debug and more complicated
>>> for the end user.
>>>
>>> But please, I invite anyone who's actually been doing this with ftrace to
>>> demonstrate otherwise.
>>>
>>> Ftrace just isn't the right tool for the job here - we're talking about adding
>>> per callsite accounting to some of the fastest fast paths in the kernel.
>>>
>>> And the size of the changes for memory allocation accounting are much more
>>> reasonable:
>>> 33 files changed, 623 insertions(+), 99 deletions(-)
>>>
>>> The code tagging library should exist anyways, it's been open coded half a dozen
>>> times in the kernel already.
>>
>> Hi Kent,
>>
>> independent of the other discussions, if it's open coded already, does
>> it make sense to factor that already-open-coded part out independently
>> of the remainder of the full series here?
>
> It's discussed in the cover letter, that is exactly how the patch series is
> structured.

Skimming over the patches (that I was CCed on) and skimming over the
cover letter, I got the impression that everything after patch 7 is
introducing something new instead of refactoring something out.

>
>> [I didn't immediately spot if this series also attempts already to
>> replace that open-coded part]
>
> Uh huh.
>
> Honestly, some days it feels like lkml is just as bad as slashdot, with people
> wanting to get in their two cents without actually reading...

... and of course you had to reply like that. I should just have learned
from my last upstream experience with you and kept you on my spam list.

Thanks, bye

--
Thanks,

David / dhildenb

2022-09-01 15:47:46

by Kent Overstreet

[permalink] [raw]
Subject: Re: [RFC PATCH 27/30] Code tagging based latency tracking

On Thu, Sep 01, 2022 at 09:11:17AM +0200, Peter Zijlstra wrote:
> On Tue, Aug 30, 2022 at 02:49:16PM -0700, Suren Baghdasaryan wrote:
> > From: Kent Overstreet <[email protected]>
> >
> > This adds the ability to easily instrument code for measuring latency.
> > To use, add the following to calls to your code, at the start and end of
> > the event you wish to measure:
> >
> > code_tag_time_stats_start(start_time);
> > code_tag_time_stats_finish(start_time);
> >
> > Stastistics will then show up in debugfs under
> > /sys/kernel/debug/time_stats, listed by file and line number.
> >
> > Stastics measured include weighted averages of frequency, duration, max
> > duration, as well as quantiles.
> >
> > This patch also instruments all calls to init_wait and finish_wait,
> > which includes all calls to wait_event. Example debugfs output:
>
> How can't you do this with a simple eBPF script on top of
> trace_sched_stat_* and friends?

I know about those tracepoints, and I've never found them to be usable. I've
never succesfully used them for debugging latency issues, or known anyone who
has.

And an eBPF script to do everything this does wouldn't be simple at all.
Honesly, the time stats stuff looks _far_ simpler to me than anything involving
tracing - and with tracing you have to correlate the start and end events after
the fact.

2022-09-01 16:25:56

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Thu, Sep 1, 2022 at 12:18 AM Michal Hocko <[email protected]> wrote:
>
> On Wed 31-08-22 15:01:54, Kent Overstreet wrote:
> > On Wed, Aug 31, 2022 at 12:47:32PM +0200, Michal Hocko wrote:
> > > On Wed 31-08-22 11:19:48, Mel Gorman wrote:
> > > > Whatever asking for an explanation as to why equivalent functionality
> > > > cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable.
> > >
> > > Fully agreed and this is especially true for a change this size
> > > 77 files changed, 3406 insertions(+), 703 deletions(-)
> >
> > In the case of memory allocation accounting, you flat cannot do this with ftrace
> > - you could maybe do a janky version that isn't fully accurate, much slower,
> > more complicated for the developer to understand and debug and more complicated
> > for the end user.
> >
> > But please, I invite anyone who's actually been doing this with ftrace to
> > demonstrate otherwise.
> >
> > Ftrace just isn't the right tool for the job here - we're talking about adding
> > per callsite accounting to some of the fastest fast paths in the kernel.
> >
> > And the size of the changes for memory allocation accounting are much more
> > reasonable:
> > 33 files changed, 623 insertions(+), 99 deletions(-)
> >
> > The code tagging library should exist anyways, it's been open coded half a dozen
> > times in the kernel already.
> >
> > And once we've got that, the time stats code is _also_ far simpler than doing it
> > with ftrace would be. If anyone here has successfully debugged latency issues
> > with ftrace, I'd really like to hear it. Again, for debugging latency issues you
> > want something that can always be on, and that's not cheap with ftrace - and
> > never mind the hassle of correlating start and end wait trace events, builting
> > up histograms, etc. - that's all handled here.
> >
> > Cheap, simple, easy to use. What more could you want?
>
> A big ad on a banner. But more seriously.
>
> This patchset is _huge_ and touching a lot of different areas. It will
> be not only hard to review but even harder to maintain longterm. So
> it is completely reasonable to ask for potential alternatives with a
> smaller code footprint. I am pretty sure you are aware of that workflow.

The patchset is huge because it introduces a reusable part (the first
6 patches introducing code tagging) and 6 different applications in
very different areas of the kernel. We wanted to present all of them
in the RFC to show the variety of cases this mechanism can be reused
for. If the code tagging is accepted, each application can be posted
separately to the appropriate group of people. Hopefully that makes it
easier to review. Those first 6 patches are not that big and are quite
isolated IMHO:

include/linux/codetag.h | 83 ++++++++++
include/linux/lazy-percpu-counter.h | 67 ++++++++
include/linux/module.h | 1 +
kernel/module/internal.h | 1 -
kernel/module/main.c | 4 +
lib/Kconfig | 3 +
lib/Kconfig.debug | 4 +
lib/Makefile | 3 +
lib/codetag.c | 248 ++++++++++++++++++++++++++++
lib/lazy-percpu-counter.c | 141 ++++++++++++++++
lib/string_helpers.c | 3 +-
scripts/kallsyms.c | 13 ++

>
> So I find Peter's question completely appropriate while your response to
> that not so much! Maybe ftrace is not the right tool for the intented
> job. Maybe there are other ways and it would be really great to show
> that those have been evaluated and they are not suitable for a), b) and
> c) reasons.

That's fair.
For memory tracking I looked into using kmemleak and page_owner which
can't match the required functionality at an overhead acceptable for
production and pre-production testing environments. traces + BPF I
haven't evaluated myself but heard from other members of my team who
tried using that in production environment with poor results. I'll try
to get more specific information on that.

>
> E.g. Oscar has been working on extending page_ext to track number of
> allocations for specific calltrace[1]. Is this 1:1 replacement? No! But
> it can help in environments where page_ext can be enabled and it is
> completely non-intrusive to the MM code.

Thanks for pointing out this work. I'll need to review and maybe
profile it before making any claims.

>
> If the page_ext overhead is not desirable/acceptable then I am sure
> there are other options. E.g. kprobes/LivePatching framework can hook
> into functions and alter their behavior. So why not use that for data
> collection? Has this been evaluated at all?

I'm not sure how I can hook into say alloc_pages() to find out where
it was called from without capturing the call stack (which would
introduce an overhead at every allocation). Would love to discuss this
or other alternatives if they can be done with low enough overhead.
Thanks,
Suren.

>
> And please note that I am not claiming the presented work is approaching
> the problem from a wrong direction. It might very well solve multiple
> problems in a single go _but_ the long term code maintenance burden
> really has to to be carefully evaluated and if we can achieve a
> reasonable subset of the functionality with an existing infrastructure
> then I would be inclined to sacrifice some portions with a considerably
> smaller code footprint.
>
> [1] http://lkml.kernel.org/r/[email protected]
>
> --
> Michal Hocko
> SUSE Labs

2022-09-01 17:10:38

by Kent Overstreet

[permalink] [raw]
Subject: Re: [RFC PATCH 03/30] Lazy percpu counters

On Thu, Sep 01, 2022 at 10:48:39AM -0400, Steven Rostedt wrote:
> On Thu, 1 Sep 2022 10:32:19 -0400
> Kent Overstreet <[email protected]> wrote:
>
> > On Thu, Sep 01, 2022 at 08:51:31AM +0200, Peter Zijlstra wrote:
> > > On Tue, Aug 30, 2022 at 02:48:52PM -0700, Suren Baghdasaryan wrote:
> > > > +static void lazy_percpu_counter_switch_to_pcpu(struct raw_lazy_percpu_counter *c)
> > > > +{
> > > > + u64 __percpu *pcpu_v = alloc_percpu_gfp(u64, GFP_ATOMIC|__GFP_NOWARN);
> > >
> > > Realize that this is incorrect when used under a raw_spinlock_t.
> >
> > Can you elaborate?
>
> All allocations (including GFP_ATOMIC) grab normal spin_locks. When
> PREEMPT_RT is configured, normal spin_locks turn into a mutex, where as
> raw_spinlock's do not.
>
> Thus, if this is done within a raw_spinlock with PREEMPT_RT configured, it
> can cause a schedule while holding a spinlock.

Thanks, I think we should be good here but I'll document it anyways.

2022-09-01 17:15:09

by Kent Overstreet

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Thu, Sep 01, 2022 at 05:07:06PM +0200, David Hildenbrand wrote:
> Skimming over the patches (that I was CCed on) and skimming over the
> cover letter, I got the impression that everything after patch 7 is
> introducing something new instead of refactoring something out.

You skimmed over the dynamic debug patch then...

2022-09-01 17:28:32

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Thu, Sep 1, 2022 at 8:07 AM David Hildenbrand <[email protected]> wrote:
>
> On 01.09.22 16:23, Kent Overstreet wrote:
> > On Thu, Sep 01, 2022 at 10:05:03AM +0200, David Hildenbrand wrote:
> >> On 31.08.22 21:01, Kent Overstreet wrote:
> >>> On Wed, Aug 31, 2022 at 12:47:32PM +0200, Michal Hocko wrote:
> >>>> On Wed 31-08-22 11:19:48, Mel Gorman wrote:
> >>>>> Whatever asking for an explanation as to why equivalent functionality
> >>>>> cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable.
> >>>>
> >>>> Fully agreed and this is especially true for a change this size
> >>>> 77 files changed, 3406 insertions(+), 703 deletions(-)
> >>>
> >>> In the case of memory allocation accounting, you flat cannot do this with ftrace
> >>> - you could maybe do a janky version that isn't fully accurate, much slower,
> >>> more complicated for the developer to understand and debug and more complicated
> >>> for the end user.
> >>>
> >>> But please, I invite anyone who's actually been doing this with ftrace to
> >>> demonstrate otherwise.
> >>>
> >>> Ftrace just isn't the right tool for the job here - we're talking about adding
> >>> per callsite accounting to some of the fastest fast paths in the kernel.
> >>>
> >>> And the size of the changes for memory allocation accounting are much more
> >>> reasonable:
> >>> 33 files changed, 623 insertions(+), 99 deletions(-)
> >>>
> >>> The code tagging library should exist anyways, it's been open coded half a dozen
> >>> times in the kernel already.
> >>
> >> Hi Kent,
> >>
> >> independent of the other discussions, if it's open coded already, does
> >> it make sense to factor that already-open-coded part out independently
> >> of the remainder of the full series here?
> >
> > It's discussed in the cover letter, that is exactly how the patch series is
> > structured.
>
> Skimming over the patches (that I was CCed on) and skimming over the
> cover letter, I got the impression that everything after patch 7 is
> introducing something new instead of refactoring something out.

Hi David,
Yes, you are right, the RFC does incorporate lots of parts which can
be considered separately. They are sent together to present the
overall scope of the proposal but I do intend to send them separately
once we decide if it's worth working on.
Thanks,
Suren.

>
> >
> >> [I didn't immediately spot if this series also attempts already to
> >> replace that open-coded part]
> >
> > Uh huh.
> >
> > Honestly, some days it feels like lkml is just as bad as slashdot, with people
> > wanting to get in their two cents without actually reading...
>
> ... and of course you had to reply like that. I should just have learned
> from my last upstream experience with you and kept you on my spam list.
>
> Thanks, bye
>
> --
> Thanks,
>
> David / dhildenb
>

2022-09-01 17:29:41

by Kent Overstreet

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Thu, Sep 01, 2022 at 12:05:01PM +0100, Mel Gorman wrote:
> As pointed out elsewhere, attaching to the tracepoint and recording relevant
> state is an option other than trying to parse a raw ftrace feed. For memory
> leaks, there are already tracepoints for page allocation and free that could
> be used to track allocations that are not freed at a given point in time.

Page allocation tracepoints are not sufficient for what we're trying to do here,
and a substantial amount of effort in this patchset has gone into just getting
the hooking locations right - our memory allocation interfaces are not trivial.

That's something people should keep in mind when commenting on the size of this
patchset, since that's effort that would have to be spent for /any/ complete
solution, be in tracepoint based or no.

Additionally, we need to be able to write assertions that verify that our hook
locations are correct, that allocations or frees aren't getting double counted
or missed - highly necessary given the maze of nested memory allocation
interfaces we have (i.e. slab.h), and it's something a tracepoint based
implementation would have to account for - otherwise, a tool isn't very useful
if you can't trust the numbers it's giving you.

And then you have to correlate the allocate and free events, so that you know
which allocate callsite to decrement the amount freed from.

How would you plan on doing that with tracepoints?

> There is also the kernel memory leak detector although I never had reason
> to use it (https://www.kernel.org/doc/html/v6.0-rc3/dev-tools/kmemleak.html)
> and it sounds like it would be expensive.

Kmemleak is indeed expensive, and in the past I've had issues with it not
catching everything (I've noticed the kmemleak annotations growing, so maybe
this is less of an issue than it was).

And this is a more complete solution (though not something that could strictly
replace kmemleak): strict memory leaks aren't the only issue, it's also drivers
unexpectedly consuming more memory than expected.

I'll bet you a beer that when people have had this awhile, we're going to have a
bunch of bugs discovered and fixed along the lines of "oh hey, this driver
wasn't supposed to be using this 1 MB of memory, I never noticed that before".

> > > It's also unclear *who* would enable this. It looks like it would mostly
> > > have value during the development stage of an embedded platform to track
> > > kernel memory usage on a per-application basis in an environment where it
> > > may be difficult to setup tracing and tracking. Would it ever be enabled
> > > in production? Would a distribution ever enable this? If it's enabled, any
> > > overhead cannot be disabled/enabled at run or boot time so anyone enabling
> > > this would carry the cost without never necessarily consuming the data.
> >
> > The whole point of this is to be cheap enough to enable in production -
> > especially the latency tracing infrastructure. There's a lot of value to
> > always-on system visibility infrastructure, so that when a live machine starts
> > to do something wonky the data is already there.
> >
>
> Sure, there is value but nothing stops the tracepoints being attached as
> a boot-time service where interested. For latencies, there is already
> bpf examples for tracing individual function latency over time e.g.
> https://github.com/iovisor/bcc/blob/master/tools/funclatency.py although
> I haven't used it recently.

So this is cool, I'll check it out today.

Tracing of /function/ latency is definitely something you'd want tracing/kprobes
for - that's way more practical than any code tagging-based approach. And if the
output is reliable and useful I could definitely see myself using this, thank
you.

But for data collection where it makes sense to annotate in the source code
where the data collection points are, I see the code-tagging based approach as
simpler - it cuts out a whole bunch of indirection. The diffstat on the code
tagging time stats patch is

8 files changed, 233 insertions(+), 6 deletions(-)

And that includes hooking wait.h - this is really simple, easy stuff.

The memory allocation tracking patches are more complicated because we've got a
ton of memory allocation interfaces and we're aiming for strict correctness
there - because that tool needs strict correctness in order to be useful.

> Live parsing of ftrace is possible, albeit expensive.
> https://github.com/gormanm/mmtests/blob/master/monitors/watch-highorder.pl
> tracks counts of high-order allocations and dumps a report on interrupt as
> an example of live parsing ftrace and only recording interesting state. It's
> not tracking state you are interested in but it demonstrates it is possible
> to rely on ftrace alone and monitor from userspace. It's bit-rotted but
> can be fixed with

Yeah, if this is as far as people have gotten with ftrace on memory allocations
than I don't think tracing is credible here, sorry.

> The ease of use is a criticism as there is effort required to develop
> the state tracking of in-kernel event be it from live parsing ftrace,
> attaching to tracepoints with systemtap/bpf/whatever and the like. The
> main disadvantage with an in-kernel implementation is three-fold. First,
> it doesn't work with older kernels without backports. Second, if something
> slightly different it needed then it's a kernel rebuild. Third, if the
> option is not enabled in the deployed kernel config then you are relying
> on the end user being willing to deploy a custom kernel. The initial
> investment in doing memory leak tracking or latency tracking by attaching
> to tracepoints is significant but it works with older kernels up to a point
> and is less sensitive to the kernel config options selected as features
> like ftrace are often selected.

The next version of this patch set is going to use the alternatives mechanism to
add a boot parameter.

I'm not interested in backporting to older kernels - eesh. People on old
enterprise kernels don't always get all the new shiny things :)

2022-09-01 19:06:02

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 03/30] Lazy percpu counters

On Thu, Sep 01, 2022 at 10:32:19AM -0400, Kent Overstreet wrote:
> On Thu, Sep 01, 2022 at 08:51:31AM +0200, Peter Zijlstra wrote:
> > On Tue, Aug 30, 2022 at 02:48:52PM -0700, Suren Baghdasaryan wrote:
> > > +static void lazy_percpu_counter_switch_to_pcpu(struct raw_lazy_percpu_counter *c)
> > > +{
> > > + u64 __percpu *pcpu_v = alloc_percpu_gfp(u64, GFP_ATOMIC|__GFP_NOWARN);
> >
> > Realize that this is incorrect when used under a raw_spinlock_t.
>
> Can you elaborate?

required lock order: raw_spinlock_t < spinlock_t < mutex

allocators lives at spinlock_t.

Also see CONFIG_PROVE_RAW_LOCK_NESTING and there might be a document
mentioning all this somewhere.

Additionally, this (obviously) also isn't NMI safe.

2022-09-01 19:50:17

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Thu 01-09-22 08:33:19, Suren Baghdasaryan wrote:
> On Thu, Sep 1, 2022 at 12:18 AM Michal Hocko <[email protected]> wrote:
[...]
> > So I find Peter's question completely appropriate while your response to
> > that not so much! Maybe ftrace is not the right tool for the intented
> > job. Maybe there are other ways and it would be really great to show
> > that those have been evaluated and they are not suitable for a), b) and
> > c) reasons.
>
> That's fair.
> For memory tracking I looked into using kmemleak and page_owner which
> can't match the required functionality at an overhead acceptable for
> production and pre-production testing environments.

Being more specific would be really helpful. Especially when your cover
letter suggests that you rely on page_owner/memcg metadata as well to
match allocation and their freeing parts.

> traces + BPF I
> haven't evaluated myself but heard from other members of my team who
> tried using that in production environment with poor results. I'll try
> to get more specific information on that.

That would be helpful as well.

> > E.g. Oscar has been working on extending page_ext to track number of
> > allocations for specific calltrace[1]. Is this 1:1 replacement? No! But
> > it can help in environments where page_ext can be enabled and it is
> > completely non-intrusive to the MM code.
>
> Thanks for pointing out this work. I'll need to review and maybe
> profile it before making any claims.
>
> >
> > If the page_ext overhead is not desirable/acceptable then I am sure
> > there are other options. E.g. kprobes/LivePatching framework can hook
> > into functions and alter their behavior. So why not use that for data
> > collection? Has this been evaluated at all?
>
> I'm not sure how I can hook into say alloc_pages() to find out where
> it was called from without capturing the call stack (which would
> introduce an overhead at every allocation). Would love to discuss this
> or other alternatives if they can be done with low enough overhead.

Yes, tracking back the call trace would be really needed. The question
is whether this is really prohibitively expensive. How much overhead are
we talking about? There is no free lunch here, really. You either have
the overhead during runtime when the feature is used or on the source
code level for all the future development (with a maze of macros and
wrappers).

Thanks!
--
Michal Hocko
SUSE Labs

2022-09-01 19:58:13

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Thu, Sep 1, 2022 at 12:15 PM Michal Hocko <[email protected]> wrote:
>
> On Thu 01-09-22 08:33:19, Suren Baghdasaryan wrote:
> > On Thu, Sep 1, 2022 at 12:18 AM Michal Hocko <[email protected]> wrote:
> [...]
> > > So I find Peter's question completely appropriate while your response to
> > > that not so much! Maybe ftrace is not the right tool for the intented
> > > job. Maybe there are other ways and it would be really great to show
> > > that those have been evaluated and they are not suitable for a), b) and
> > > c) reasons.
> >
> > That's fair.
> > For memory tracking I looked into using kmemleak and page_owner which
> > can't match the required functionality at an overhead acceptable for
> > production and pre-production testing environments.
>
> Being more specific would be really helpful. Especially when your cover
> letter suggests that you rely on page_owner/memcg metadata as well to
> match allocation and their freeing parts.

kmemleak is known to be slow and it's even documented [1], so I hope I
can skip that part. For page_owner to provide the comparable
information we would have to capture the call stacks for all page
allocations unlike our proposal which allows to do that selectively
for specific call sites. I'll post the overhead numbers of call stack
capturing once I'm finished with profiling the latest code, hopefully
sometime tomorrow, in the worst case after the long weekend.

>
> > traces + BPF I
> > haven't evaluated myself but heard from other members of my team who
> > tried using that in production environment with poor results. I'll try
> > to get more specific information on that.
>
> That would be helpful as well.

Ack.

>
> > > E.g. Oscar has been working on extending page_ext to track number of
> > > allocations for specific calltrace[1]. Is this 1:1 replacement? No! But
> > > it can help in environments where page_ext can be enabled and it is
> > > completely non-intrusive to the MM code.
> >
> > Thanks for pointing out this work. I'll need to review and maybe
> > profile it before making any claims.
> >
> > >
> > > If the page_ext overhead is not desirable/acceptable then I am sure
> > > there are other options. E.g. kprobes/LivePatching framework can hook
> > > into functions and alter their behavior. So why not use that for data
> > > collection? Has this been evaluated at all?
> >
> > I'm not sure how I can hook into say alloc_pages() to find out where
> > it was called from without capturing the call stack (which would
> > introduce an overhead at every allocation). Would love to discuss this
> > or other alternatives if they can be done with low enough overhead.
>
> Yes, tracking back the call trace would be really needed. The question
> is whether this is really prohibitively expensive. How much overhead are
> we talking about? There is no free lunch here, really. You either have
> the overhead during runtime when the feature is used or on the source
> code level for all the future development (with a maze of macros and
> wrappers).

Will post the overhead numbers soon.
What I hear loud and clear is that we need a kernel command-line kill
switch that mitigates the overhead for having this feature. That seems
to be the main concern.
Thanks,
Suren.

[1] https://docs.kernel.org/dev-tools/kmemleak.html#limitations-and-drawbacks

>
> Thanks!
> --
> Michal Hocko
> SUSE Labs

2022-09-01 21:22:15

by Kent Overstreet

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Thu, Sep 01, 2022 at 12:39:11PM -0700, Suren Baghdasaryan wrote:
> kmemleak is known to be slow and it's even documented [1], so I hope I
> can skip that part. For page_owner to provide the comparable
> information we would have to capture the call stacks for all page
> allocations unlike our proposal which allows to do that selectively
> for specific call sites. I'll post the overhead numbers of call stack
> capturing once I'm finished with profiling the latest code, hopefully
> sometime tomorrow, in the worst case after the long weekend.

To expand on this further: we're stashing a pointer to the alloc_tag, which is
defined at the allocation callsite. That's how we're able to decrement the
proper counter on free, and why this beats any tracing based approach - with
tracing you'd instead have to correlate allocate/free events. Ouch.

> > Yes, tracking back the call trace would be really needed. The question
> > is whether this is really prohibitively expensive. How much overhead are
> > we talking about? There is no free lunch here, really. You either have
> > the overhead during runtime when the feature is used or on the source
> > code level for all the future development (with a maze of macros and
> > wrappers).

The full call stack is really not what you want in most applications - that's
what people think they want at first, and why page_owner works the way it does,
but it turns out that then combining all the different but related stack traces
_sucks_ (so why were you saving them in the first place?), and then you have to
do a separate memory allocate for each stack track, which destroys performance.

>
> Will post the overhead numbers soon.
> What I hear loud and clear is that we need a kernel command-line kill
> switch that mitigates the overhead for having this feature. That seems
> to be the main concern.
> Thanks,

After looking at this more I don't think we should commit just yet - there's
some tradeoffs to be evaluated, and maybe the thing to do first will be to see
if we can cut down on the (huge!) number of allocation interfaces before adding
more complexity.

The ideal approach, from a performance POV, would be to pass a pointer to the
alloc tag to kmalloc() et. all, and then we'd have the actual accounting code in
one place and use a jump label to skip over it when this feature is disabled.

However, there are _many, many_ wrapper functions in our allocation code, and
this approach is going to make the plumbing for the hooks quite a bit bigger
than what we have now - and then, do we want to have this extra alloc_tag
parameter that's not used when CONFIG_ALLOC_TAGGING=n? It's a tiny cost for an
extra unused parameter, but it's a cost - or do we get rid of that with some
extra macro hackery (eww, gross)?

If we do the boot parameter before submission, I think we'll have something
that's maybe not strictly ideal from a performance POV when
CONFIG_ALLOC_TAGGING=y but boot parameter=n, but it'll introduce the minimum
amount of macro insanity.

What we should be able to do pretty easily is discard the alloc_tag structs when
the boot parameter is disabled, because they're in special elf sections and we
already do that (e.g. for .init).

2022-09-01 21:47:32

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 27/30] Code tagging based latency tracking

On Thu, 1 Sep 2022 17:38:44 -0400
Steven Rostedt <[email protected]> wrote:

> # echo 'hist:keys=comm,prio,delta.buckets=10:sort=delta' > /sys/kernel/tracing/events/synthetic/wakeup_lat/trigger

The above could almost be done with sqlhist (but I haven't implemented
"buckets=10" yet because that's a new feature. But for now, let's do log2):

# sqlhist -e 'select comm,prio,cast(delta as log2) from wakeup_lat'

("-e" is to execute the command, as it normally only displays what commands
need to be run to create the synthetic events and histograms)

# cat /sys/kernel/tracing/events/synthetic/wakeup_lat/hist
# event histogram
#
# trigger info: hist:keys=comm,prio,delta.log2:vals=hitcount:sort=hitcount:size=2048 [active]
#

{ comm: migration/4 , prio: 0, delta: ~ 2^5 } hitcount: 1
{ comm: migration/0 , prio: 0, delta: ~ 2^4 } hitcount: 2
{ comm: rtkit-daemon , prio: 0, delta: ~ 2^7 } hitcount: 2
{ comm: rtkit-daemon , prio: 0, delta: ~ 2^6 } hitcount: 4
{ comm: migration/0 , prio: 0, delta: ~ 2^5 } hitcount: 8
{ comm: migration/4 , prio: 0, delta: ~ 2^4 } hitcount: 9
{ comm: migration/2 , prio: 0, delta: ~ 2^4 } hitcount: 10
{ comm: migration/5 , prio: 0, delta: ~ 2^4 } hitcount: 10
{ comm: migration/7 , prio: 0, delta: ~ 2^4 } hitcount: 10
{ comm: migration/1 , prio: 0, delta: ~ 2^4 } hitcount: 10
{ comm: migration/6 , prio: 0, delta: ~ 2^4 } hitcount: 10

Totals:
Hits: 76
Entries: 11
Dropped: 0


-- Steve

2022-09-01 21:47:33

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 27/30] Code tagging based latency tracking

On Tue, 30 Aug 2022 14:49:16 -0700
Suren Baghdasaryan <[email protected]> wrote:

> From: Kent Overstreet <[email protected]>
>
> This adds the ability to easily instrument code for measuring latency.
> To use, add the following to calls to your code, at the start and end of
> the event you wish to measure:
>
> code_tag_time_stats_start(start_time);
> code_tag_time_stats_finish(start_time);

So you need to modify the code to see what you want?

>
> Stastistics will then show up in debugfs under
> /sys/kernel/debug/time_stats, listed by file and line number.
>
> Stastics measured include weighted averages of frequency, duration, max
> duration, as well as quantiles.
>
> This patch also instruments all calls to init_wait and finish_wait,
> which includes all calls to wait_event. Example debugfs output:
>
> fs/xfs/xfs_trans_ail.c:746 module:xfs func:xfs_ail_push_all_sync
> count: 17
> rate: 0/sec
> frequency: 2 sec
> avg duration: 10 us
> max duration: 232 us
> quantiles (ns): 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128
>
> lib/sbitmap.c:813 module:sbitmap func:sbitmap_finish_wait
> count: 3
> rate: 0/sec
> frequency: 4 sec
> avg duration: 4 sec
> max duration: 4 sec
> quantiles (ns): 0 4288669120 4288669120 5360836048 5360836048 5360836048 5360836048 5360836048 5360836048 5360836048 5360836048 5360836048 5360836048 5360836048 5360836048
>
> net/core/datagram.c:122 module:datagram func:__skb_wait_for_more_packets
> count: 10
> rate: 1/sec
> frequency: 859 ms
> avg duration: 472 ms
> max duration: 30 sec
> quantiles (ns): 0 12279 12279 15669 15669 15669 15669 17217 17217 17217 17217 17217 17217 17217 17217

For function length you could just do something like this:

# cd /sys/kernel/tracing
# echo __skb_wait_for_more_packets > set_ftrace_filter
# echo 1 > function_profile_enabled
# cat trace_stat/function*
Function Hit Time Avg s^2
-------- --- ---- --- ---
__skb_wait_for_more_packets 1 0.000 us 0.000 us 0.000 us
Function Hit Time Avg s^2
-------- --- ---- --- ---
__skb_wait_for_more_packets 1 74.813 us 74.813 us 0.000 us
Function Hit Time Avg s^2
-------- --- ---- --- ---
Function Hit Time Avg s^2
-------- --- ---- --- ---

The above is for a 4 CPU machine. The s^2 is the square of the standard
deviation (makes not having to do divisions while it runs).

But if you are looking for latency between two events (which can be kprobes
too, where you do not need to rebuild your kernel):

From: https://man.archlinux.org/man/sqlhist.1.en
which comes in: https://git.kernel.org/pub/scm/libs/libtrace/libtracefs.git/
if not already installed on your distro.

# sqlhist -e -n wakeup_lat 'select end.next_comm as comm,start.pid,start.prio,(end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as delta from sched_waking as start join sched_switch as end on start.pid = end.next_pid where start.prio < 100'

The above creates a synthetic event called "wakeup_lat" that joins two
events (sched_waking and sched_switch) when the pid field of sched_waking
matches the next_pid field of sched_switch. When there is a match, it will
trigger the wakeup_lat event only if the prio of the sched_waking event is
less than 100 (which in the kernel means any real-time task). The
wakeup_lat event will record the next_comm (as comm field), the pid of
woken task and the time delta in microseconds between the two events.

# echo 'hist:keys=comm,prio,delta.buckets=10:sort=delta' > /sys/kernel/tracing/events/synthetic/wakeup_lat/trigger

The above starts a histogram tracing the name of the woken task, the
priority and the delta (but placing the delta in buckets of size 10, as we
do not need to see every latency number).

# chrt -f 20 sleep 1

Force something to be woken up that is interesting.

# cat /sys/kernel/tracing/events/synthetic/wakeup_lat/hist
# event histogram
#
# trigger info: hist:keys=comm,prio,delta.buckets=10:vals=hitcount:sort=delta.buckets=10:size=2048 [active]
#

{ comm: migration/5 , prio: 0, delta: ~ 10-19 } hitcount: 1
{ comm: migration/2 , prio: 0, delta: ~ 10-19 } hitcount: 1
{ comm: sleep , prio: 79, delta: ~ 10-19 } hitcount: 1
{ comm: migration/7 , prio: 0, delta: ~ 10-19 } hitcount: 1
{ comm: migration/4 , prio: 0, delta: ~ 10-19 } hitcount: 1
{ comm: migration/6 , prio: 0, delta: ~ 10-19 } hitcount: 1
{ comm: migration/1 , prio: 0, delta: ~ 10-19 } hitcount: 2
{ comm: migration/0 , prio: 0, delta: ~ 10-19 } hitcount: 1
{ comm: migration/2 , prio: 0, delta: ~ 20-29 } hitcount: 1
{ comm: migration/0 , prio: 0, delta: ~ 20-29 } hitcount: 1

Totals:
Hits: 11
Entries: 10
Dropped: 0

That is a histogram of the wakeup latency of all real time tasks that woke
up. Oh, and it does not drop events unless the number of entries is bigger
than the size of the count of buckets, which I haven't actually
encountered, as there's 2048 buckets. But you can make it bigger with the
"size" attribute in the creation of the histogram.

-- Steve




2022-09-01 22:03:55

by Kent Overstreet

[permalink] [raw]
Subject: Re: [RFC PATCH 27/30] Code tagging based latency tracking

On Thu, Sep 01, 2022 at 05:38:44PM -0400, Steven Rostedt wrote:
> On Tue, 30 Aug 2022 14:49:16 -0700
> Suren Baghdasaryan <[email protected]> wrote:
>
> > From: Kent Overstreet <[email protected]>
> >
> > This adds the ability to easily instrument code for measuring latency.
> > To use, add the following to calls to your code, at the start and end of
> > the event you wish to measure:
> >
> > code_tag_time_stats_start(start_time);
> > code_tag_time_stats_finish(start_time);
>
> So you need to modify the code to see what you want?

Figuring out the _correct_ place to measure is often a significant amount of the
total effort.

Having done so once, why not annotate that in the source code?

> For function length you could just do something like this:
>
> # cd /sys/kernel/tracing
> # echo __skb_wait_for_more_packets > set_ftrace_filter
> # echo 1 > function_profile_enabled
> # cat trace_stat/function*
> Function Hit Time Avg s^2
> -------- --- ---- --- ---
> __skb_wait_for_more_packets 1 0.000 us 0.000 us 0.000 us
> Function Hit Time Avg s^2
> -------- --- ---- --- ---
> __skb_wait_for_more_packets 1 74.813 us 74.813 us 0.000 us
> Function Hit Time Avg s^2
> -------- --- ---- --- ---
> Function Hit Time Avg s^2
> -------- --- ---- --- ---
>
> The above is for a 4 CPU machine. The s^2 is the square of the standard
> deviation (makes not having to do divisions while it runs).
>
> But if you are looking for latency between two events (which can be kprobes
> too, where you do not need to rebuild your kernel):
>
> From: https://man.archlinux.org/man/sqlhist.1.en
> which comes in: https://git.kernel.org/pub/scm/libs/libtrace/libtracefs.git/
> if not already installed on your distro.
>
> # sqlhist -e -n wakeup_lat 'select end.next_comm as comm,start.pid,start.prio,(end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as delta from sched_waking as start join sched_switch as end on start.pid = end.next_pid where start.prio < 100'
>
> The above creates a synthetic event called "wakeup_lat" that joins two
> events (sched_waking and sched_switch) when the pid field of sched_waking
> matches the next_pid field of sched_switch. When there is a match, it will
> trigger the wakeup_lat event only if the prio of the sched_waking event is
> less than 100 (which in the kernel means any real-time task). The
> wakeup_lat event will record the next_comm (as comm field), the pid of
> woken task and the time delta in microseconds between the two events.

So this looks like it's gotten better since I last looked, but it's still not
there yet.

Part of the problem is that the tracepoints themselves are in the wrong place:
your end event is when a task is woken up, but that means spurious wakeups will
cause one wait_event() call to be reported as multiple smaller waits, not one
long wait - oops, now I can't actually find the thing that's causing my
multi-second delay.

Also, in your example you don't have it broken out by callsite. That would be
the first thing I'd need for any real world debugging.

So, it looks like tracing has made some progress over the past 10 years, but
for debugging latency issues it's still not there yet in general. I will
definitely remember function latency tracing the next time I'm doing performance
work, but I expect that to be far too heavy to enable on a live server.

This thing is only a couple hundred lines of code though, so perhaps tracing
shouldn't be the only tool in our toolbox :)

2022-09-01 22:45:47

by Roman Gushchin

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Wed, Aug 31, 2022 at 01:56:08PM -0700, Yosry Ahmed wrote:
> On Wed, Aug 31, 2022 at 12:02 PM Kent Overstreet
> <[email protected]> wrote:
> >
> > On Wed, Aug 31, 2022 at 12:47:32PM +0200, Michal Hocko wrote:
> > > On Wed 31-08-22 11:19:48, Mel Gorman wrote:
> > > > Whatever asking for an explanation as to why equivalent functionality
> > > > cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable.
> > >
> > > Fully agreed and this is especially true for a change this size
> > > 77 files changed, 3406 insertions(+), 703 deletions(-)
> >
> > In the case of memory allocation accounting, you flat cannot do this with ftrace
> > - you could maybe do a janky version that isn't fully accurate, much slower,
> > more complicated for the developer to understand and debug and more complicated
> > for the end user.
> >
> > But please, I invite anyone who's actually been doing this with ftrace to
> > demonstrate otherwise.
> >
> > Ftrace just isn't the right tool for the job here - we're talking about adding
> > per callsite accounting to some of the fastest fast paths in the kernel.
> >
> > And the size of the changes for memory allocation accounting are much more
> > reasonable:
> > 33 files changed, 623 insertions(+), 99 deletions(-)
> >
> > The code tagging library should exist anyways, it's been open coded half a dozen
> > times in the kernel already.
> >
> > And once we've got that, the time stats code is _also_ far simpler than doing it
> > with ftrace would be. If anyone here has successfully debugged latency issues
> > with ftrace, I'd really like to hear it. Again, for debugging latency issues you
> > want something that can always be on, and that's not cheap with ftrace - and
> > never mind the hassle of correlating start and end wait trace events, builting
> > up histograms, etc. - that's all handled here.
> >
> > Cheap, simple, easy to use. What more could you want?
> >
>
> This is very interesting work! Do you have any data about the overhead
> this introduces, especially in a production environment? I am
> especially interested in memory allocations tracking and detecting
> leaks.

+1

I think the question whether it indeed can be always turned on in the production
or not is the main one. If not, the advantage over ftrace/bpf/... is not that
obvious. Otherwise it will be indeed a VERY useful thing.

Also, there is a lot of interesting stuff within this patchset, which
might be useful elsewhere. So thanks to Kent and Suren for this work!

Thanks!

2022-09-01 22:52:03

by Kent Overstreet

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Thu, Sep 01, 2022 at 03:27:27PM -0700, Roman Gushchin wrote:
> On Wed, Aug 31, 2022 at 01:56:08PM -0700, Yosry Ahmed wrote:
> > This is very interesting work! Do you have any data about the overhead
> > this introduces, especially in a production environment? I am
> > especially interested in memory allocations tracking and detecting
> > leaks.
>
> +1
>
> I think the question whether it indeed can be always turned on in the production
> or not is the main one. If not, the advantage over ftrace/bpf/... is not that
> obvious. Otherwise it will be indeed a VERY useful thing.

Low enough overhead to run in production was my primary design goal.

Stats are kept in a struct that's defined at the callsite. So this adds _no_
pointer chasing to the allocation path, unless we've switch to percpu counters
at that callsite (see the lazy percpu counters patch), where we need to deref
one percpu pointer to save an atomic.

Then we need to stash a pointer to the alloc_tag, so that kfree() can find it.
For slab allocations this uses the same storage area as memcg, so for
allocations that are using that we won't be touching any additional cachelines.
(I wanted the pointer to the alloc_tag to be stored inline with the allocation,
but that would've caused alignment difficulties).

Then there's a pointer deref introduced to the kfree() path, to get back to the
original alloc_tag and subtract the allocation from that callsite. That one
won't be free, and with percpu counters we've got another dependent load too -
hmm, it might be worth benchmarking with just atomics, skipping the percpu
counters.

So the overhead won't be zero, I expect it'll show up in some synthetic
benchmarks, but yes I do definitely expect this to be worth enabling in
production in many scenarios.

2022-09-01 23:14:43

by Roman Gushchin

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Thu, Sep 01, 2022 at 06:37:20PM -0400, Kent Overstreet wrote:
> On Thu, Sep 01, 2022 at 03:27:27PM -0700, Roman Gushchin wrote:
> > On Wed, Aug 31, 2022 at 01:56:08PM -0700, Yosry Ahmed wrote:
> > > This is very interesting work! Do you have any data about the overhead
> > > this introduces, especially in a production environment? I am
> > > especially interested in memory allocations tracking and detecting
> > > leaks.
> >
> > +1
> >
> > I think the question whether it indeed can be always turned on in the production
> > or not is the main one. If not, the advantage over ftrace/bpf/... is not that
> > obvious. Otherwise it will be indeed a VERY useful thing.
>
> Low enough overhead to run in production was my primary design goal.
>
> Stats are kept in a struct that's defined at the callsite. So this adds _no_
> pointer chasing to the allocation path, unless we've switch to percpu counters
> at that callsite (see the lazy percpu counters patch), where we need to deref
> one percpu pointer to save an atomic.
>
> Then we need to stash a pointer to the alloc_tag, so that kfree() can find it.
> For slab allocations this uses the same storage area as memcg, so for
> allocations that are using that we won't be touching any additional cachelines.
> (I wanted the pointer to the alloc_tag to be stored inline with the allocation,
> but that would've caused alignment difficulties).
>
> Then there's a pointer deref introduced to the kfree() path, to get back to the
> original alloc_tag and subtract the allocation from that callsite. That one
> won't be free, and with percpu counters we've got another dependent load too -
> hmm, it might be worth benchmarking with just atomics, skipping the percpu
> counters.
>
> So the overhead won't be zero, I expect it'll show up in some synthetic
> benchmarks, but yes I do definitely expect this to be worth enabling in
> production in many scenarios.

I'm somewhat sceptical, but I usually am. And in this case I'll be really happy
to be wrong.

On a bright side, maybe most of the overhead will come from few allocations,
so an option to explicitly exclude them will do the trick.

I'd suggest to run something like iperf on a fast hardware. And maybe some
io_uring stuff too. These are two places which were historically most sensitive
to the (kernel) memory accounting speed.

Thanks!

2022-09-01 23:15:43

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 27/30] Code tagging based latency tracking

On Thu, 1 Sep 2022 17:54:38 -0400
Kent Overstreet <[email protected]> wrote:
>
> So this looks like it's gotten better since I last looked, but it's still not
> there yet.
>
> Part of the problem is that the tracepoints themselves are in the wrong place:
> your end event is when a task is woken up, but that means spurious wakeups will

The end event is when a task is scheduled onto the CPU. The start event is
the first time it is woken up.

> cause one wait_event() call to be reported as multiple smaller waits, not one
> long wait - oops, now I can't actually find the thing that's causing my
> multi-second delay.
>
> Also, in your example you don't have it broken out by callsite. That would be
> the first thing I'd need for any real world debugging.

OK, how about this (currently we can only have 3 keys, but you can create
multiple histograms on the same event).

# echo 'hist:keys=comm,stacktrace,delta.buckets=10:sort=delta' > /sys/kernel/tracing/events/synthetic/wakeup_lat/trigger

(notice the "stacktrace" in the keys)

# cat /sys/kernel/tracing/events/synthetic/wakeup_lat/hist
# event histogram
#
# trigger info: hist:keys=comm,stacktrace,delta.buckets=10:vals=hitcount:sort=delta.buckets=10:size=2048 [active]
#

{ comm: migration/2 , stacktrace:
event_hist_trigger+0x290/0x2b0
event_triggers_call+0x52/0xe0
trace_event_buffer_commit+0x193/0x240
trace_event_raw_event_sched_switch+0x120/0x180
__traceiter_sched_switch+0x39/0x50
__schedule+0x310/0x700
schedule_idle+0x26/0x40
do_idle+0xb4/0xd0
cpu_startup_entry+0x19/0x20
secondary_startup_64_no_verify+0xc2/0xcb
, delta: ~ 10-19} hitcount: 7
{ comm: migration/5 , stacktrace:
event_hist_trigger+0x290/0x2b0
event_triggers_call+0x52/0xe0
trace_event_buffer_commit+0x193/0x240
trace_event_raw_event_sched_switch+0x120/0x180
__traceiter_sched_switch+0x39/0x50
__schedule+0x310/0x700
schedule_idle+0x26/0x40
do_idle+0xb4/0xd0
cpu_startup_entry+0x19/0x20
secondary_startup_64_no_verify+0xc2/0xcb
, delta: ~ 10-19} hitcount: 7
{ comm: migration/1 , stacktrace:
event_hist_trigger+0x290/0x2b0
event_triggers_call+0x52/0xe0
trace_event_buffer_commit+0x193/0x240
trace_event_raw_event_sched_switch+0x120/0x180
__traceiter_sched_switch+0x39/0x50
__schedule+0x310/0x700
schedule_idle+0x26/0x40
do_idle+0xb4/0xd0
cpu_startup_entry+0x19/0x20
secondary_startup_64_no_verify+0xc2/0xcb
, delta: ~ 10-19} hitcount: 7
{ comm: migration/7 , stacktrace:
event_hist_trigger+0x290/0x2b0
event_triggers_call+0x52/0xe0
trace_event_buffer_commit+0x193/0x240
trace_event_raw_event_sched_switch+0x120/0x180
__traceiter_sched_switch+0x39/0x50
__schedule+0x310/0x700
schedule_idle+0x26/0x40
do_idle+0xb4/0xd0
cpu_startup_entry+0x19/0x20
secondary_startup_64_no_verify+0xc2/0xcb
, delta: ~ 10-19} hitcount: 7
{ comm: migration/0 , stacktrace:
event_hist_trigger+0x290/0x2b0
event_triggers_call+0x52/0xe0
trace_event_buffer_commit+0x193/0x240
trace_event_raw_event_sched_switch+0x120/0x180
__traceiter_sched_switch+0x39/0x50
__schedule+0x310/0x700
schedule_idle+0x26/0x40
do_idle+0xb4/0xd0
cpu_startup_entry+0x19/0x20
start_kernel+0x595/0x5be
secondary_startup_64_no_verify+0xc2/0xcb
, delta: ~ 10-19} hitcount: 7
{ comm: migration/4 , stacktrace:
event_hist_trigger+0x290/0x2b0
event_triggers_call+0x52/0xe0
trace_event_buffer_commit+0x193/0x240
trace_event_raw_event_sched_switch+0x120/0x180
__traceiter_sched_switch+0x39/0x50
__schedule+0x310/0x700
schedule_idle+0x26/0x40
do_idle+0xb4/0xd0
cpu_startup_entry+0x19/0x20
secondary_startup_64_no_verify+0xc2/0xcb
, delta: ~ 10-19} hitcount: 7
{ comm: rtkit-daemon , stacktrace:
event_hist_trigger+0x290/0x2b0
event_triggers_call+0x52/0xe0
trace_event_buffer_commit+0x193/0x240
trace_event_raw_event_sched_switch+0x120/0x180
__traceiter_sched_switch+0x39/0x50
__schedule+0x310/0x700
preempt_schedule_common+0x2d/0x70
preempt_schedule_thunk+0x16/0x18
_raw_spin_unlock_irq+0x2e/0x40
eventfd_write+0xc8/0x290
vfs_write+0xc0/0x2a0
ksys_write+0x5f/0xe0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x61/0xcb
, delta: ~ 10-19} hitcount: 1
{ comm: migration/6 , stacktrace:
event_hist_trigger+0x290/0x2b0
event_triggers_call+0x52/0xe0
trace_event_buffer_commit+0x193/0x240
trace_event_raw_event_sched_switch+0x120/0x180
__traceiter_sched_switch+0x39/0x50
__schedule+0x310/0x700
schedule_idle+0x26/0x40
do_idle+0xb4/0xd0
cpu_startup_entry+0x19/0x20
secondary_startup_64_no_verify+0xc2/0xcb
, delta: ~ 10-19} hitcount: 7
{ comm: rtkit-daemon , stacktrace:
event_hist_trigger+0x290/0x2b0
event_triggers_call+0x52/0xe0
trace_event_buffer_commit+0x193/0x240
trace_event_raw_event_sched_switch+0x120/0x180
__traceiter_sched_switch+0x39/0x50
__schedule+0x310/0x700
schedule_idle+0x26/0x40
do_idle+0xb4/0xd0
cpu_startup_entry+0x19/0x20
secondary_startup_64_no_verify+0xc2/0xcb
, delta: ~ 20-29} hitcount: 1
{ comm: rtkit-daemon , stacktrace:
event_hist_trigger+0x290/0x2b0
event_triggers_call+0x52/0xe0
trace_event_buffer_commit+0x193/0x240
trace_event_raw_event_sched_switch+0x120/0x180
__traceiter_sched_switch+0x39/0x50
__schedule+0x310/0x700
preempt_schedule_common+0x2d/0x70
preempt_schedule_thunk+0x16/0x18
_raw_spin_unlock_irq+0x2e/0x40
eventfd_write+0xc8/0x290
vfs_write+0xc0/0x2a0
ksys_write+0x5f/0xe0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x61/0xcb
, delta: ~ 30-39} hitcount: 1
{ comm: rtkit-daemon , stacktrace:
event_hist_trigger+0x290/0x2b0
event_triggers_call+0x52/0xe0
trace_event_buffer_commit+0x193/0x240
trace_event_raw_event_sched_switch+0x120/0x180
__traceiter_sched_switch+0x39/0x50
__schedule+0x310/0x700
schedule_idle+0x26/0x40
do_idle+0xb4/0xd0
cpu_startup_entry+0x19/0x20
secondary_startup_64_no_verify+0xc2/0xcb
, delta: ~ 40-49} hitcount: 1

Totals:
Hits: 53
Entries: 11
Dropped: 0


Not the prettiest thing to read. But hey, we got the full stack of where
these latencies happened!

Yes, it adds some overhead when the events are triggered due to the
stacktrace code, but it's extremely useful information.

>
> So, it looks like tracing has made some progress over the past 10 years,
> but for debugging latency issues it's still not there yet in general. I

I call BS on that statement. Just because you do not know what has been
added to the kernel in the last 10 years (like you had no idea about
seq_buf and that was added in 2014) means to me that you are totally
clueless on what tracing can and can not do.

It appears to me that you are too focused on inventing your own wheel that
does exactly what you want before looking to see how things are today. Just
because something didn't fit your needs 10 years ago doesn't mean that it
can't fit your needs today.


> will definitely remember function latency tracing the next time I'm doing
> performance work, but I expect that to be far too heavy to enable on a
> live server.

I run it on production machines all the time. With the filtering in place
it has very little overhead. Mostly in the noise. The best part is that it
has practically zero overhead (but can add some cache pressure) when it's
off, and can be turned on at run time.

The tracing infrastructure is very modular, you can use parts of it that
you need, without the overhead of other parts. Like you found out this week
that tracepoints are not the same as trace events. Because tracepoints are
just a hook in the code that anything can attach to (that's what Daniel's
RV work does). Trace events provide the stored data to be recorded.

I will note that the current histogram code overhead has increased due to
retpolines, but I have code to convert them from indirect calls to direct
calls via a switch statement which drops the overhead by 20%!

https://lore.kernel.org/all/[email protected]/


>
> This thing is only a couple hundred lines of code though, so perhaps
> tracing shouldn't be the only tool in our toolbox :)

I'm already getting complaints from customers/users that are saying there's
too many tools in the toolbox already. (Do we use ftrace/perf/bpf?). The
idea is to have the tools using mostly the same infrastructure, and not be
100% off on its own, unless there's a clear reason to invent a new wheel
that several people are asking for, not just one or two.

-- Steve

2022-09-01 23:19:13

by Kent Overstreet

[permalink] [raw]
Subject: Re: [RFC PATCH 27/30] Code tagging based latency tracking

On Thu, Sep 01, 2022 at 06:34:30PM -0400, Steven Rostedt wrote:
> On Thu, 1 Sep 2022 17:54:38 -0400
> Kent Overstreet <[email protected]> wrote:
> >
> > So this looks like it's gotten better since I last looked, but it's still not
> > there yet.
> >
> > Part of the problem is that the tracepoints themselves are in the wrong place:
> > your end event is when a task is woken up, but that means spurious wakeups will
>
> The end event is when a task is scheduled onto the CPU. The start event is
> the first time it is woken up.

Yeah, that's not what I want. You're just tracing latency due to having more
processes runnable than CPUs.

I don't care about that for debugging, though! I specifically want latency at
the wait_event() level, and related - every time a process blocked _on some
condition_, until that condition became true. Not until some random, potentially
spurious wakeup.


> Not the prettiest thing to read. But hey, we got the full stack of where
> these latencies happened!

Most of the time I _don't_ want full stacktraces, though!

That means I have a ton more output to sort through, and the data is far more
expensive to collect.

I don't know why it's what people go to first - see the page_owner stuff - but
that doesn't get used much either because the output is _really hard to sort
through_.

Most of the time, just a single file and line number is all you want - and
tracing has always made it hard to get at that.


> Yes, it adds some overhead when the events are triggered due to the
> stacktrace code, but it's extremely useful information.
>
> >
> > So, it looks like tracing has made some progress over the past 10 years,
> > but for debugging latency issues it's still not there yet in general. I
>
> I call BS on that statement. Just because you do not know what has been
> added to the kernel in the last 10 years (like you had no idea about
> seq_buf and that was added in 2014) means to me that you are totally
> clueless on what tracing can and can not do.
>
> It appears to me that you are too focused on inventing your own wheel that
> does exactly what you want before looking to see how things are today. Just
> because something didn't fit your needs 10 years ago doesn't mean that it
> can't fit your needs today.

...And the ad hominem attacks start.

Steve, I'm not attacking you, and there's room enough in this world for the both
of us to be doing our thing creating new and useful tools.

> I'm already getting complaints from customers/users that are saying there's
> too many tools in the toolbox already. (Do we use ftrace/perf/bpf?). The
> idea is to have the tools using mostly the same infrastructure, and not be
> 100% off on its own, unless there's a clear reason to invent a new wheel
> that several people are asking for, not just one or two.

I would like to see more focus on usability.

That means, in a best case scenario, always-on data collection that I can just
look at, and it'll already be in the format most likely to be useful.

Surely you can appreciate the usefulness of that..?

Tracing started out as a tool for efficiently getting lots of data out of the
kernel, and it's great for that. But I think your focus on the cool thing you
built may be blinding you a bit to alternative approaches...

2022-09-01 23:52:08

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Thu, Sep 1, 2022 at 3:54 PM Roman Gushchin <[email protected]> wrote:
>
> On Thu, Sep 01, 2022 at 06:37:20PM -0400, Kent Overstreet wrote:
> > On Thu, Sep 01, 2022 at 03:27:27PM -0700, Roman Gushchin wrote:
> > > On Wed, Aug 31, 2022 at 01:56:08PM -0700, Yosry Ahmed wrote:
> > > > This is very interesting work! Do you have any data about the overhead
> > > > this introduces, especially in a production environment? I am
> > > > especially interested in memory allocations tracking and detecting
> > > > leaks.
> > >
> > > +1
> > >
> > > I think the question whether it indeed can be always turned on in the production
> > > or not is the main one. If not, the advantage over ftrace/bpf/... is not that
> > > obvious. Otherwise it will be indeed a VERY useful thing.
> >
> > Low enough overhead to run in production was my primary design goal.
> >
> > Stats are kept in a struct that's defined at the callsite. So this adds _no_
> > pointer chasing to the allocation path, unless we've switch to percpu counters
> > at that callsite (see the lazy percpu counters patch), where we need to deref
> > one percpu pointer to save an atomic.
> >
> > Then we need to stash a pointer to the alloc_tag, so that kfree() can find it.
> > For slab allocations this uses the same storage area as memcg, so for
> > allocations that are using that we won't be touching any additional cachelines.
> > (I wanted the pointer to the alloc_tag to be stored inline with the allocation,
> > but that would've caused alignment difficulties).
> >
> > Then there's a pointer deref introduced to the kfree() path, to get back to the
> > original alloc_tag and subtract the allocation from that callsite. That one
> > won't be free, and with percpu counters we've got another dependent load too -
> > hmm, it might be worth benchmarking with just atomics, skipping the percpu
> > counters.
> >
> > So the overhead won't be zero, I expect it'll show up in some synthetic
> > benchmarks, but yes I do definitely expect this to be worth enabling in
> > production in many scenarios.
>
> I'm somewhat sceptical, but I usually am. And in this case I'll be really happy
> to be wrong.
>
> On a bright side, maybe most of the overhead will come from few allocations,
> so an option to explicitly exclude them will do the trick.
>
> I'd suggest to run something like iperf on a fast hardware. And maybe some
> io_uring stuff too. These are two places which were historically most sensitive
> to the (kernel) memory accounting speed.

Thanks for the suggestions, Roman. I'll see how I can get this done.
I'll have to find someone with access to fast hardware (Android is not
great for that) and backporting the patchset to the supported kernel
version. Will do my best.
Thanks,
Suren.

>
> Thanks!
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>

2022-09-01 23:52:19

by Roman Gushchin

[permalink] [raw]
Subject: Re: [RFC PATCH 11/30] mm: introduce slabobj_ext to support slab object extensions

On Tue, Aug 30, 2022 at 02:49:00PM -0700, Suren Baghdasaryan wrote:
> Currently slab pages can store only vectors of obj_cgroup pointers in
> page->memcg_data. Introduce slabobj_ext structure to allow more data
> to be stored for each slab object. Wraps obj_cgroup into slabobj_ext
> to support current functionality while allowing to extend slabobj_ext
> in the future.
>
> Note: ideally the config dependency should be turned the other way around:
> MEMCG should depend on SLAB_OBJ_EXT and {page|slab|folio}.memcg_data would
> be renamed to something like {page|slab|folio}.objext_data. However doing
> this in RFC would introduce considerable churn unrelated to the overall
> idea, so avoiding this until v1.

Hi Suren!

I'd say CONFIG_MEMCG_KMEM and CONFIG_YOUR_NEW_STUFF should both depend on
SLAB_OBJ_EXT.
CONFIG_MEMCG_KMEM depend on CONFIG_MEMCG anyway.

>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> ---
> include/linux/memcontrol.h | 18 ++++--
> init/Kconfig | 5 ++
> mm/kfence/core.c | 2 +-
> mm/memcontrol.c | 60 ++++++++++---------
> mm/page_owner.c | 2 +-
> mm/slab.h | 119 +++++++++++++++++++++++++------------
> 6 files changed, 131 insertions(+), 75 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 6257867fbf95..315399f77173 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -227,6 +227,14 @@ struct obj_cgroup {
> };
> };
>
> +/*
> + * Extended information for slab objects stored as an array in page->memcg_data
> + * if MEMCG_DATA_OBJEXTS is set.
> + */
> +struct slabobj_ext {
> + struct obj_cgroup *objcg;
> +} __aligned(8);

Why do we need this aligment requirement?

> +
> /*
> * The memory controller data structure. The memory controller controls both
> * page cache and RSS per cgroup. We would eventually like to provide
> @@ -363,7 +371,7 @@ extern struct mem_cgroup *root_mem_cgroup;
>
> enum page_memcg_data_flags {
> /* page->memcg_data is a pointer to an objcgs vector */
> - MEMCG_DATA_OBJCGS = (1UL << 0),
> + MEMCG_DATA_OBJEXTS = (1UL << 0),
> /* page has been accounted as a non-slab kernel page */
> MEMCG_DATA_KMEM = (1UL << 1),
> /* the next bit after the last actual flag */
> @@ -401,7 +409,7 @@ static inline struct mem_cgroup *__folio_memcg(struct folio *folio)
> unsigned long memcg_data = folio->memcg_data;
>
> VM_BUG_ON_FOLIO(folio_test_slab(folio), folio);
> - VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJCGS, folio);
> + VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio);
> VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_KMEM, folio);
>
> return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
> @@ -422,7 +430,7 @@ static inline struct obj_cgroup *__folio_objcg(struct folio *folio)
> unsigned long memcg_data = folio->memcg_data;
>
> VM_BUG_ON_FOLIO(folio_test_slab(folio), folio);
> - VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJCGS, folio);
> + VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio);
> VM_BUG_ON_FOLIO(!(memcg_data & MEMCG_DATA_KMEM), folio);
>
> return (struct obj_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
> @@ -517,7 +525,7 @@ static inline struct mem_cgroup *page_memcg_check(struct page *page)
> */
> unsigned long memcg_data = READ_ONCE(page->memcg_data);
>
> - if (memcg_data & MEMCG_DATA_OBJCGS)
> + if (memcg_data & MEMCG_DATA_OBJEXTS)
> return NULL;
>
> if (memcg_data & MEMCG_DATA_KMEM) {
> @@ -556,7 +564,7 @@ static inline struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *ob
> static inline bool folio_memcg_kmem(struct folio *folio)
> {
> VM_BUG_ON_PGFLAGS(PageTail(&folio->page), &folio->page);
> - VM_BUG_ON_FOLIO(folio->memcg_data & MEMCG_DATA_OBJCGS, folio);
> + VM_BUG_ON_FOLIO(folio->memcg_data & MEMCG_DATA_OBJEXTS, folio);
> return folio->memcg_data & MEMCG_DATA_KMEM;
> }
>
> diff --git a/init/Kconfig b/init/Kconfig
> index 532362fcfe31..82396d7a2717 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -958,6 +958,10 @@ config MEMCG
> help
> Provides control over the memory footprint of tasks in a cgroup.
>
> +config SLAB_OBJ_EXT
> + bool
> + depends on MEMCG
> +
> config MEMCG_SWAP
> bool
> depends on MEMCG && SWAP
> @@ -966,6 +970,7 @@ config MEMCG_SWAP
> config MEMCG_KMEM
> bool
> depends on MEMCG && !SLOB
> + select SLAB_OBJ_EXT
> default y
>
> config BLK_CGROUP
> diff --git a/mm/kfence/core.c b/mm/kfence/core.c
> index c252081b11df..c0958e4a32e2 100644
> --- a/mm/kfence/core.c
> +++ b/mm/kfence/core.c
> @@ -569,7 +569,7 @@ static unsigned long kfence_init_pool(void)
> __folio_set_slab(slab_folio(slab));
> #ifdef CONFIG_MEMCG
> slab->memcg_data = (unsigned long)&kfence_metadata[i / 2 - 1].objcg |
> - MEMCG_DATA_OBJCGS;
> + MEMCG_DATA_OBJEXTS;
> #endif
> }
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b69979c9ced5..3f407ef2f3f1 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2793,7 +2793,7 @@ static void commit_charge(struct folio *folio, struct mem_cgroup *memcg)
> folio->memcg_data = (unsigned long)memcg;
> }
>
> -#ifdef CONFIG_MEMCG_KMEM
> +#ifdef CONFIG_SLAB_OBJ_EXT
> /*
> * The allocated objcg pointers array is not accounted directly.
> * Moreover, it should not come from DMA buffer and is not readily
> @@ -2801,38 +2801,20 @@ static void commit_charge(struct folio *folio, struct mem_cgroup *memcg)
> */
> #define OBJCGS_CLEAR_MASK (__GFP_DMA | __GFP_RECLAIMABLE | __GFP_ACCOUNT)
>
> -/*
> - * mod_objcg_mlstate() may be called with irq enabled, so
> - * mod_memcg_lruvec_state() should be used.
> - */
> -static inline void mod_objcg_mlstate(struct obj_cgroup *objcg,
> - struct pglist_data *pgdat,
> - enum node_stat_item idx, int nr)
> -{
> - struct mem_cgroup *memcg;
> - struct lruvec *lruvec;
> -
> - rcu_read_lock();
> - memcg = obj_cgroup_memcg(objcg);
> - lruvec = mem_cgroup_lruvec(memcg, pgdat);
> - mod_memcg_lruvec_state(lruvec, idx, nr);
> - rcu_read_unlock();
> -}
> -
> -int memcg_alloc_slab_cgroups(struct slab *slab, struct kmem_cache *s,
> - gfp_t gfp, bool new_slab)
> +int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
> + gfp_t gfp, bool new_slab)
> {
> unsigned int objects = objs_per_slab(s, slab);
> unsigned long memcg_data;
> void *vec;
>
> gfp &= ~OBJCGS_CLEAR_MASK;
> - vec = kcalloc_node(objects, sizeof(struct obj_cgroup *), gfp,
> + vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
> slab_nid(slab));
> if (!vec)
> return -ENOMEM;
>
> - memcg_data = (unsigned long) vec | MEMCG_DATA_OBJCGS;
> + memcg_data = (unsigned long) vec | MEMCG_DATA_OBJEXTS;
> if (new_slab) {
> /*
> * If the slab is brand new and nobody can yet access its
> @@ -2843,7 +2825,7 @@ int memcg_alloc_slab_cgroups(struct slab *slab, struct kmem_cache *s,
> } else if (cmpxchg(&slab->memcg_data, 0, memcg_data)) {
> /*
> * If the slab is already in use, somebody can allocate and
> - * assign obj_cgroups in parallel. In this case the existing
> + * assign slabobj_exts in parallel. In this case the existing
> * objcg vector should be reused.
> */
> kfree(vec);
> @@ -2853,6 +2835,26 @@ int memcg_alloc_slab_cgroups(struct slab *slab, struct kmem_cache *s,
> kmemleak_not_leak(vec);
> return 0;
> }
> +#endif /* CONFIG_SLAB_OBJ_EXT */
> +
> +#ifdef CONFIG_MEMCG_KMEM
> +/*
> + * mod_objcg_mlstate() may be called with irq enabled, so
> + * mod_memcg_lruvec_state() should be used.
> + */
> +static inline void mod_objcg_mlstate(struct obj_cgroup *objcg,
> + struct pglist_data *pgdat,
> + enum node_stat_item idx, int nr)
> +{
> + struct mem_cgroup *memcg;
> + struct lruvec *lruvec;
> +
> + rcu_read_lock();
> + memcg = obj_cgroup_memcg(objcg);
> + lruvec = mem_cgroup_lruvec(memcg, pgdat);
> + mod_memcg_lruvec_state(lruvec, idx, nr);
> + rcu_read_unlock();
> +}
>
> static __always_inline
> struct mem_cgroup *mem_cgroup_from_obj_folio(struct folio *folio, void *p)
> @@ -2863,18 +2865,18 @@ struct mem_cgroup *mem_cgroup_from_obj_folio(struct folio *folio, void *p)
> * slab->memcg_data.
> */
> if (folio_test_slab(folio)) {
> - struct obj_cgroup **objcgs;
> + struct slabobj_ext *obj_exts;
> struct slab *slab;
> unsigned int off;
>
> slab = folio_slab(folio);
> - objcgs = slab_objcgs(slab);
> - if (!objcgs)
> + obj_exts = slab_obj_exts(slab);
> + if (!obj_exts)
> return NULL;
>
> off = obj_to_index(slab->slab_cache, slab, p);
> - if (objcgs[off])
> - return obj_cgroup_memcg(objcgs[off]);
> + if (obj_exts[off].objcg)
> + return obj_cgroup_memcg(obj_exts[off].objcg);
>
> return NULL;
> }
> diff --git a/mm/page_owner.c b/mm/page_owner.c
> index e4c6f3f1695b..fd4af1ad34b8 100644
> --- a/mm/page_owner.c
> +++ b/mm/page_owner.c
> @@ -353,7 +353,7 @@ static inline int print_page_owner_memcg(char *kbuf, size_t count, int ret,
> if (!memcg_data)
> goto out_unlock;
>
> - if (memcg_data & MEMCG_DATA_OBJCGS)
> + if (memcg_data & MEMCG_DATA_OBJEXTS)
> ret += scnprintf(kbuf + ret, count - ret,
> "Slab cache page\n");
>
> diff --git a/mm/slab.h b/mm/slab.h
> index 4ec82bec15ec..c767ce3f0fe2 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -422,36 +422,94 @@ static inline bool kmem_cache_debug_flags(struct kmem_cache *s, slab_flags_t fla
> return false;
> }
>
> +#ifdef CONFIG_SLAB_OBJ_EXT
> +
> +static inline bool is_kmem_only_obj_ext(void)
> +{
> #ifdef CONFIG_MEMCG_KMEM
> + return sizeof(struct slabobj_ext) == sizeof(struct obj_cgroup *);
> +#else
> + return false;
> +#endif
> +}
> +
> /*
> - * slab_objcgs - get the object cgroups vector associated with a slab
> + * slab_obj_exts - get the pointer to the slab object extension vector
> + * associated with a slab.
> * @slab: a pointer to the slab struct
> *
> - * Returns a pointer to the object cgroups vector associated with the slab,
> + * Returns a pointer to the object extension vector associated with the slab,
> * or NULL if no such vector has been associated yet.
> */
> -static inline struct obj_cgroup **slab_objcgs(struct slab *slab)
> +static inline struct slabobj_ext *slab_obj_exts(struct slab *slab)
> {
> unsigned long memcg_data = READ_ONCE(slab->memcg_data);
>
> - VM_BUG_ON_PAGE(memcg_data && !(memcg_data & MEMCG_DATA_OBJCGS),
> + VM_BUG_ON_PAGE(memcg_data && !(memcg_data & MEMCG_DATA_OBJEXTS),
> slab_page(slab));
> VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_KMEM, slab_page(slab));
>
> - return (struct obj_cgroup **)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
> + return (struct slabobj_ext *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
> }
>
> -int memcg_alloc_slab_cgroups(struct slab *slab, struct kmem_cache *s,
> - gfp_t gfp, bool new_slab);
> -void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
> - enum node_stat_item idx, int nr);
> +int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
> + gfp_t gfp, bool new_slab);
>
> -static inline void memcg_free_slab_cgroups(struct slab *slab)
> +static inline void free_slab_obj_exts(struct slab *slab)
> {
> - kfree(slab_objcgs(slab));
> + struct slabobj_ext *obj_exts;
> +
> + if (!memcg_kmem_enabled() && is_kmem_only_obj_ext())
> + return;

Hm, not sure I understand this. I kmem is disabled and is_kmem_only_obj_ext()
is true, shouldn't slab->memcg_data == NULL (always)?

> +
> + obj_exts = slab_obj_exts(slab);
> + kfree(obj_exts);
> slab->memcg_data = 0;
> }
>
> +static inline void prepare_slab_obj_exts_hook(struct kmem_cache *s, gfp_t flags, void *p)
> +{
> + struct slab *slab;
> +
> + /* If kmem is the only extension then the vector will be created conditionally */
> + if (is_kmem_only_obj_ext())
> + return;
> +
> + slab = virt_to_slab(p);
> + if (!slab_obj_exts(slab))
> + WARN(alloc_slab_obj_exts(slab, s, flags, false),
> + "%s, %s: Failed to create slab extension vector!\n",
> + __func__, s->name);
> +}

This looks a bit crypric: the action is wrapped into WARN() and the rest is a set
of (semi-)static checks. Can we, please, invert it? E.g. something like:

if (slab_alloc_tracking_enabled()) {
slab = virt_to_slab(p);
if (!slab_obj_exts(slab))
WARN(alloc_slab_obj_exts(slab, s, flags, false),
"%s, %s: Failed to create slab extension vector!\n",
__func__, s->name);
}

The rest looks good to me.

Thank you!

2022-09-02 00:17:33

by Roman Gushchin

[permalink] [raw]
Subject: Re: [RFC PATCH 14/30] mm: prevent slabobj_ext allocations for slabobj_ext and kmem_cache objects

On Tue, Aug 30, 2022 at 02:49:03PM -0700, Suren Baghdasaryan wrote:
> Use __GFP_NO_OBJ_EXT to prevent recursions when allocating slabobj_ext
> objects. Also prevent slabobj_ext allocations for kmem_cache objects.
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>

Patches 12-14 look good to me.
It's probably to early to ack anything, but otherwise I'd ack them.

Thanks!

2022-09-02 00:27:08

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [RFC PATCH 14/30] mm: prevent slabobj_ext allocations for slabobj_ext and kmem_cache objects

On Thu, Sep 1, 2022 at 4:41 PM Roman Gushchin <[email protected]> wrote:
>
> On Tue, Aug 30, 2022 at 02:49:03PM -0700, Suren Baghdasaryan wrote:
> > Use __GFP_NO_OBJ_EXT to prevent recursions when allocating slabobj_ext
> > objects. Also prevent slabobj_ext allocations for kmem_cache objects.
> >
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
>
> Patches 12-14 look good to me.
> It's probably to early to ack anything, but otherwise I'd ack them.

Thank you for reviewing!

>
> Thanks!

2022-09-02 00:31:31

by Roman Gushchin

[permalink] [raw]
Subject: Re: [RFC PATCH 16/30] mm: enable slab allocation tagging for kmalloc and friends

On Tue, Aug 30, 2022 at 02:49:05PM -0700, Suren Baghdasaryan wrote:
> Redefine kmalloc, krealloc, kzalloc, kcalloc, etc. to record allocations
> and deallocations done by these functions.

One particular case when this functionality might be very useful:
in the past we've seen examples (at Fb) where it was hard to understand
the difference between slab memory sizes of two different kernel versions
due to slab caches merging. Once a slab cache is merged with another large
cache, this data is pretty much lost. So I definetely see value in stats which
are independent from kmem caches.

The performance overhead is a concern here, so more data would be useful.

Thanks!

2022-09-02 00:32:13

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 27/30] Code tagging based latency tracking

On Thu, 1 Sep 2022 18:55:15 -0400
Kent Overstreet <[email protected]> wrote:

> On Thu, Sep 01, 2022 at 06:34:30PM -0400, Steven Rostedt wrote:
> > On Thu, 1 Sep 2022 17:54:38 -0400
> > Kent Overstreet <[email protected]> wrote:
> > >
> > > So this looks like it's gotten better since I last looked, but it's still not
> > > there yet.
> > >
> > > Part of the problem is that the tracepoints themselves are in the wrong place:
> > > your end event is when a task is woken up, but that means spurious wakeups will
> >
> > The end event is when a task is scheduled onto the CPU. The start event is
> > the first time it is woken up.
>
> Yeah, that's not what I want. You're just tracing latency due to having more
> processes runnable than CPUs.
>
> I don't care about that for debugging, though! I specifically want latency at
> the wait_event() level, and related - every time a process blocked _on some
> condition_, until that condition became true. Not until some random, potentially
> spurious wakeup.

Ideally this would be better if we could pass the stack trace from one
event to the next, but that wouldn't be too hard to implement. It just
needs to be done.

But anyway:

# echo 'p:wait prepare_to_wait_event' > /sys/kernel/tracing/kprobe_events

// created an event on prepare_to_wait_event as that's usually called just
// before wait event.

# sqlhist -e -n wait_sched 'select start.common_pid as pid,(end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as delta from wait as start join sched_switch as end on start.common_pid = end.prev_pid where end.prev_state & 3'

// Create a "wait_sched" event that traces the time between the
// prepare_to_wait_event call and the scheduler. Only trigger it if the
// schedule happens in the interruptible or uninterruptible states.

# sqlhist -e -n wake_sched 'select start.pid,(end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as delta2 from wait_sched as start join sched_switch as end on start.pid = end.next_pid where start.delta < 50'

// Now attach the wait_event to the sched_switch where the task gets
// scheduled back in. But we are only going to care if the delta between
// the prepare_to_wait_event and the schedule is less that 50us. This is a
// hack to just care about where a prepare_to_wait_event was done just before
// scheduling out.

# echo 'hist:keys=pid,delta2.buckets=10:sort=delta2' > /sys/kernel/tracing/events/synthetic/wake_sched/trigger

// Now we are going to look at the deltas that the task was sleeping for an
// event. But this just gives pids and deltas.

# echo 'hist:keys=pid,stacktrace if delta < 50' >> /sys/kernel/tracing/events/synthetic/wait_sched/trigger

// And this is to get the backtraces of where the task was. This is because
// the stack trace is not available at the schedule in, because the
// sched_switch can only give the stack trace of when a task schedules out.
// Again, this is somewhat a hack.

# cat /sys/kernel/tracing/events/synthetic/wake_sched/hist
# event histogram
#
# trigger info: hist:keys=pid,delta2.buckets=10:vals=hitcount:sort=delta2.buckets=10:size=2048 [active]
#

{ pid: 2114, delta2: ~ 10-19 } hitcount: 1
{ pid: 1389, delta2: ~ 160-169 } hitcount: 1
{ pid: 1389, delta2: ~ 660-669 } hitcount: 1
{ pid: 1389, delta2: ~ 1020-1029 } hitcount: 1
{ pid: 1189, delta2: ~ 500020-500029 } hitcount: 1
{ pid: 1189, delta2: ~ 500030-500039 } hitcount: 1
{ pid: 1195, delta2: ~ 500030-500039 } hitcount: 2
{ pid: 1189, delta2: ~ 500040-500049 } hitcount: 10
{ pid: 1193, delta2: ~ 500040-500049 } hitcount: 3
{ pid: 1197, delta2: ~ 500040-500049 } hitcount: 3
{ pid: 1195, delta2: ~ 500040-500049 } hitcount: 9
{ pid: 1190, delta2: ~ 500050-500059 } hitcount: 55
{ pid: 1197, delta2: ~ 500050-500059 } hitcount: 51
{ pid: 1191, delta2: ~ 500050-500059 } hitcount: 61
{ pid: 1198, delta2: ~ 500050-500059 } hitcount: 56
{ pid: 1195, delta2: ~ 500050-500059 } hitcount: 48
{ pid: 1192, delta2: ~ 500050-500059 } hitcount: 54
{ pid: 1194, delta2: ~ 500050-500059 } hitcount: 50
{ pid: 1196, delta2: ~ 500050-500059 } hitcount: 57
{ pid: 1189, delta2: ~ 500050-500059 } hitcount: 48
{ pid: 1193, delta2: ~ 500050-500059 } hitcount: 52
{ pid: 1194, delta2: ~ 500060-500069 } hitcount: 12
{ pid: 1191, delta2: ~ 500060-500069 } hitcount: 2
{ pid: 1190, delta2: ~ 500060-500069 } hitcount: 7
{ pid: 1198, delta2: ~ 500060-500069 } hitcount: 9
{ pid: 1193, delta2: ~ 500060-500069 } hitcount: 6
{ pid: 1196, delta2: ~ 500060-500069 } hitcount: 5
{ pid: 1192, delta2: ~ 500060-500069 } hitcount: 9
{ pid: 1197, delta2: ~ 500060-500069 } hitcount: 9
{ pid: 1195, delta2: ~ 500060-500069 } hitcount: 6
{ pid: 1189, delta2: ~ 500060-500069 } hitcount: 6
{ pid: 1198, delta2: ~ 500070-500079 } hitcount: 1
{ pid: 1192, delta2: ~ 500070-500079 } hitcount: 2
{ pid: 1193, delta2: ~ 500070-500079 } hitcount: 3
{ pid: 1194, delta2: ~ 500070-500079 } hitcount: 2
{ pid: 1191, delta2: ~ 500070-500079 } hitcount: 3
{ pid: 1190, delta2: ~ 500070-500079 } hitcount: 1
{ pid: 1196, delta2: ~ 500070-500079 } hitcount: 1
{ pid: 1193, delta2: ~ 500080-500089 } hitcount: 1
{ pid: 1192, delta2: ~ 500080-500089 } hitcount: 1
{ pid: 1196, delta2: ~ 500080-500089 } hitcount: 2
{ pid: 1194, delta2: ~ 500090-500099 } hitcount: 1
{ pid: 1197, delta2: ~ 500090-500099 } hitcount: 1
{ pid: 1193, delta2: ~ 500090-500099 } hitcount: 1
{ pid: 61, delta2: ~ 503910-503919 } hitcount: 1
{ pid: 61, delta2: ~ 503920-503929 } hitcount: 1
{ pid: 61, delta2: ~ 503930-503939 } hitcount: 1
{ pid: 61, delta2: ~ 503960-503969 } hitcount: 15
{ pid: 61, delta2: ~ 503970-503979 } hitcount: 18
{ pid: 61, delta2: ~ 503980-503989 } hitcount: 20
{ pid: 61, delta2: ~ 504010-504019 } hitcount: 2
{ pid: 61, delta2: ~ 504020-504029 } hitcount: 1
{ pid: 61, delta2: ~ 504030-504039 } hitcount: 2
{ pid: 58, delta2: ~ 43409960-43409969 } hitcount: 1

Totals:
Hits: 718
Entries: 54
Dropped: 0

The above is useless without the following:

# cat /sys/kernel/tracing/events/synthetic/wait_sched/hist
# event histogram
#
# trigger info: hist:keys=pid:vals=hitcount:__arg_1618_2=pid,__arg_1618_3=common_timestamp.usecs:sort=hitcount:size=2048:clock=global if delta < 10 [active]
#

{ pid: 612 } hitcount: 1
{ pid: 889 } hitcount: 2
{ pid: 1389 } hitcount: 3
{ pid: 58 } hitcount: 3
{ pid: 2096 } hitcount: 5
{ pid: 61 } hitcount: 145
{ pid: 1196 } hitcount: 151
{ pid: 1190 } hitcount: 151
{ pid: 1198 } hitcount: 153
{ pid: 1197 } hitcount: 153
{ pid: 1195 } hitcount: 153
{ pid: 1194 } hitcount: 153
{ pid: 1191 } hitcount: 153
{ pid: 1192 } hitcount: 153
{ pid: 1189 } hitcount: 153
{ pid: 1193 } hitcount: 153

Totals:
Hits: 1685
Entries: 16
Dropped: 0


# event histogram
#
# trigger info: hist:keys=pid,stacktrace:vals=hitcount:sort=hitcount:size=2048 if delta < 10 [active]
#

{ pid: 1389, stacktrace:
event_hist_trigger+0x290/0x2b0
event_triggers_call+0x52/0xe0
trace_event_buffer_commit+0x193/0x240
trace_event_raw_event_sched_switch+0x120/0x180
__traceiter_sched_switch+0x39/0x50
__schedule+0x310/0x700
schedule+0x72/0x110
pipe_read+0x318/0x420
new_sync_read+0x18b/0x1a0
vfs_read+0xf5/0x190
ksys_read+0xab/0xe0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x61/0xcb
} hitcount: 3
{ pid: 1189, stacktrace:
event_hist_trigger+0x290/0x2b0
event_triggers_call+0x52/0xe0
trace_event_buffer_commit+0x193/0x240
trace_event_raw_event_sched_switch+0x120/0x180
__traceiter_sched_switch+0x39/0x50
__schedule+0x310/0x700
schedule+0x72/0x110
read_events+0x119/0x190
do_io_getevents+0x72/0xe0
__x64_sys_io_getevents+0x59/0xc0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x61/0xcb
} hitcount: 28
{ pid: 61, stacktrace:
event_hist_trigger+0x290/0x2b0
event_triggers_call+0x52/0xe0
trace_event_buffer_commit+0x193/0x240
trace_event_raw_event_sched_switch+0x120/0x180
__traceiter_sched_switch+0x39/0x50
__schedule+0x310/0x700
schedule+0x72/0x110
schedule_timeout+0x88/0x160
kcompactd+0x364/0x3f0
kthread+0x141/0x170
ret_from_fork+0x22/0x30
} hitcount: 28
{ pid: 1194, stacktrace:
event_hist_trigger+0x290/0x2b0
event_triggers_call+0x52/0xe0
trace_event_buffer_commit+0x193/0x240
trace_event_raw_event_sched_switch+0x120/0x180
__traceiter_sched_switch+0x39/0x50
__schedule+0x310/0x700
schedule+0x72/0x110
read_events+0x119/0x190
do_io_getevents+0x72/0xe0
__x64_sys_io_getevents+0x59/0xc0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x61/0xcb
} hitcount: 28
{ pid: 1197, stacktrace:
event_hist_trigger+0x290/0x2b0
event_triggers_call+0x52/0xe0
trace_event_buffer_commit+0x193/0x240
trace_event_raw_event_sched_switch+0x120/0x180
__traceiter_sched_switch+0x39/0x50
__schedule+0x310/0x700
schedule+0x72/0x110
read_events+0x119/0x190
do_io_getevents+0x72/0xe0
__x64_sys_io_getevents+0x59/0xc0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x61/0xcb
} hitcount: 28
{ pid: 1198, stacktrace:
event_hist_trigger+0x290/0x2b0
event_triggers_call+0x52/0xe0
trace_event_buffer_commit+0x193/0x240
trace_event_raw_event_sched_switch+0x120/0x180
__traceiter_sched_switch+0x39/0x50
__schedule+0x310/0x700
schedule+0x72/0x110
read_events+0x119/0x190
do_io_getevents+0x72/0xe0
__x64_sys_io_getevents+0x59/0xc0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x61/0xcb
} hitcount: 28
{ pid: 1191, stacktrace:
event_hist_trigger+0x290/0x2b0
event_triggers_call+0x52/0xe0
trace_event_buffer_commit+0x193/0x240
trace_event_raw_event_sched_switch+0x120/0x180
__traceiter_sched_switch+0x39/0x50
__schedule+0x310/0x700
schedule+0x72/0x110
read_events+0x119/0x190
do_io_getevents+0x72/0xe0
__x64_sys_io_getevents+0x59/0xc0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x61/0xcb
} hitcount: 28
{ pid: 1196, stacktrace:
event_hist_trigger+0x290/0x2b0
event_triggers_call+0x52/0xe0
trace_event_buffer_commit+0x193/0x240
trace_event_raw_event_sched_switch+0x120/0x180
__traceiter_sched_switch+0x39/0x50
__schedule+0x310/0x700
schedule+0x72/0x110
read_events+0x119/0x190
do_io_getevents+0x72/0xe0
__x64_sys_io_getevents+0x59/0xc0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x61/0xcb
} hitcount: 28
{ pid: 1192, stacktrace:
event_hist_trigger+0x290/0x2b0
event_triggers_call+0x52/0xe0
trace_event_buffer_commit+0x193/0x240
trace_event_raw_event_sched_switch+0x120/0x180
__traceiter_sched_switch+0x39/0x50
__schedule+0x310/0x700
schedule+0x72/0x110
read_events+0x119/0x190
do_io_getevents+0x72/0xe0
__x64_sys_io_getevents+0x59/0xc0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x61/0xcb
} hitcount: 28
{ pid: 1195, stacktrace:
event_hist_trigger+0x290/0x2b0
event_triggers_call+0x52/0xe0
trace_event_buffer_commit+0x193/0x240
trace_event_raw_event_sched_switch+0x120/0x180
__traceiter_sched_switch+0x39/0x50
__schedule+0x310/0x700
schedule+0x72/0x110
read_events+0x119/0x190
do_io_getevents+0x72/0xe0
__x64_sys_io_getevents+0x59/0xc0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x61/0xcb
} hitcount: 28
{ pid: 1190, stacktrace:
event_hist_trigger+0x290/0x2b0
event_triggers_call+0x52/0xe0
trace_event_buffer_commit+0x193/0x240
trace_event_raw_event_sched_switch+0x120/0x180
__traceiter_sched_switch+0x39/0x50
__schedule+0x310/0x700
schedule+0x72/0x110
read_events+0x119/0x190
do_io_getevents+0x72/0xe0
__x64_sys_io_getevents+0x59/0xc0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x61/0xcb
} hitcount: 28
{ pid: 1193, stacktrace:
event_hist_trigger+0x290/0x2b0
event_triggers_call+0x52/0xe0
trace_event_buffer_commit+0x193/0x240
trace_event_raw_event_sched_switch+0x120/0x180
__traceiter_sched_switch+0x39/0x50
__schedule+0x310/0x700
schedule+0x72/0x110
read_events+0x119/0x190
do_io_getevents+0x72/0xe0
__x64_sys_io_getevents+0x59/0xc0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x61/0xcb
} hitcount: 28

Totals:
Hits: 311
Entries: 12
Dropped: 0

Now we just need a tool to map the pids of the delta histogram to the pids
of the stack traces to figure out where the issues may happen.

The above is just to show that there's a lot of infrastructure already
there that does a lot of this work, but needs improvement. My theme to this
email is to modify what's there to make it work for you before just doing
everything from scratch, and then we have a bunch of stuff that only does
what we want, but is not flexible to do what others may want.

>
>
> > Not the prettiest thing to read. But hey, we got the full stack of where
> > these latencies happened!
>
> Most of the time I _don't_ want full stacktraces, though!

We could easily add a feature to limit how much you want to trace. Perhaps even
a skip level. That is, add skip and depth options to the stacktrace field.

>
> That means I have a ton more output to sort through, and the data is far more
> expensive to collect.

That's what user space tools are for ;-)

>
> I don't know why it's what people go to first - see the page_owner stuff - but
> that doesn't get used much either because the output is _really hard to sort
> through_.
>
> Most of the time, just a single file and line number is all you want - and
> tracing has always made it hard to get at that.

Because we would need to store too much dwarf information in the kernel to
do so. But user space could do this for you with the function/offset
information.

>
>
> > Yes, it adds some overhead when the events are triggered due to the
> > stacktrace code, but it's extremely useful information.
> >
> > >
> > > So, it looks like tracing has made some progress over the past 10
> > > years, but for debugging latency issues it's still not there yet in
> > > general. I
> >
> > I call BS on that statement. Just because you do not know what has been
> > added to the kernel in the last 10 years (like you had no idea about
> > seq_buf and that was added in 2014) means to me that you are totally
> > clueless on what tracing can and can not do.
> >
> > It appears to me that you are too focused on inventing your own wheel
> > that does exactly what you want before looking to see how things are
> > today. Just because something didn't fit your needs 10 years ago
> > doesn't mean that it can't fit your needs today.
>
> ...And the ad hominem attacks start.

Look, you keep making comments about the tracing infrastructure that you
clearly do not understand. And that is pretty insulting. Sorry, I'm not
sure you realize this, but those comments do turn people off and their
responses will start to become stronger.

>
> Steve, I'm not attacking you, and there's room enough in this world for
> the both of us to be doing our thing creating new and useful tools.

You seem to push back hard when people suggest improving other utilities
to suite your needs.

>
> > I'm already getting complaints from customers/users that are saying
> > there's too many tools in the toolbox already. (Do we use
> > ftrace/perf/bpf?). The idea is to have the tools using mostly the same
> > infrastructure, and not be 100% off on its own, unless there's a clear
> > reason to invent a new wheel that several people are asking for, not
> > just one or two.
>
> I would like to see more focus on usability.

Then lets make the current tools more usable. For example, the synthetic
event kernel interface is horrible. It's an awesome feature that wasn't
getting used due to the interface. This is why I created "sqlhist". It's
now really easy to create synthetic events with that tool. I agree, focus
on usability, but that doesn't always mean to create yet another tool. This
reminds me of:

https://xkcd.com/927/


>
> That means, in a best case scenario, always-on data collection that I can
> just look at, and it'll already be in the format most likely to be useful.
>
> Surely you can appreciate the usefulness of that..?

I find "runtime turn on and off" better than "always on". We have
static_branches today (aka jump labels). I would strongly suggest using
them. You get them automatically from tracepoints . Even sched_stats are
using these.

>
> Tracing started out as a tool for efficiently getting lots of data out of
> the kernel, and it's great for that. But I think your focus on the cool
> thing you built may be blinding you a bit to alternative approaches...

I actually work hard to have the tracing infrastructure help out other
approaches. perf and bpf use the ftrace infrastructure because it is
designed to be modular. Nothing is "must be the ftrace way". I'm not against
the new features you are adding, I just want you to make a little more
effort in incorporating other infrastructures (and perhaps even improving
that infrastructure) to suite your needs.

If ftrace, perf, bpf can't do what you want, take a harder look to see if
you can modify them to do so.

-- Steve

2022-09-02 00:32:15

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [RFC PATCH 11/30] mm: introduce slabobj_ext to support slab object extensions

On Thu, Sep 1, 2022 at 4:36 PM Roman Gushchin <[email protected]> wrote:
>
> On Tue, Aug 30, 2022 at 02:49:00PM -0700, Suren Baghdasaryan wrote:
> > Currently slab pages can store only vectors of obj_cgroup pointers in
> > page->memcg_data. Introduce slabobj_ext structure to allow more data
> > to be stored for each slab object. Wraps obj_cgroup into slabobj_ext
> > to support current functionality while allowing to extend slabobj_ext
> > in the future.
> >
> > Note: ideally the config dependency should be turned the other way around:
> > MEMCG should depend on SLAB_OBJ_EXT and {page|slab|folio}.memcg_data would
> > be renamed to something like {page|slab|folio}.objext_data. However doing
> > this in RFC would introduce considerable churn unrelated to the overall
> > idea, so avoiding this until v1.
>
> Hi Suren!

Hi Roman,

>
> I'd say CONFIG_MEMCG_KMEM and CONFIG_YOUR_NEW_STUFF should both depend on
> SLAB_OBJ_EXT.
> CONFIG_MEMCG_KMEM depend on CONFIG_MEMCG anyway.

Yes, I agree. I wanted to mention here that the current dependency is
incorrect and should be reworked. Having both depending on
SLAB_OBJ_EXT seems like the right approach.

>
> >
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > ---
> > include/linux/memcontrol.h | 18 ++++--
> > init/Kconfig | 5 ++
> > mm/kfence/core.c | 2 +-
> > mm/memcontrol.c | 60 ++++++++++---------
> > mm/page_owner.c | 2 +-
> > mm/slab.h | 119 +++++++++++++++++++++++++------------
> > 6 files changed, 131 insertions(+), 75 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 6257867fbf95..315399f77173 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -227,6 +227,14 @@ struct obj_cgroup {
> > };
> > };
> >
> > +/*
> > + * Extended information for slab objects stored as an array in page->memcg_data
> > + * if MEMCG_DATA_OBJEXTS is set.
> > + */
> > +struct slabobj_ext {
> > + struct obj_cgroup *objcg;
> > +} __aligned(8);
>
> Why do we need this aligment requirement?

To save space by avoiding padding, however, all members today will be
pointers, so it's meaningless and we can safely drop it.

>
> > +
> > /*
> > * The memory controller data structure. The memory controller controls both
> > * page cache and RSS per cgroup. We would eventually like to provide
> > @@ -363,7 +371,7 @@ extern struct mem_cgroup *root_mem_cgroup;
> >
> > enum page_memcg_data_flags {
> > /* page->memcg_data is a pointer to an objcgs vector */
> > - MEMCG_DATA_OBJCGS = (1UL << 0),
> > + MEMCG_DATA_OBJEXTS = (1UL << 0),
> > /* page has been accounted as a non-slab kernel page */
> > MEMCG_DATA_KMEM = (1UL << 1),
> > /* the next bit after the last actual flag */
> > @@ -401,7 +409,7 @@ static inline struct mem_cgroup *__folio_memcg(struct folio *folio)
> > unsigned long memcg_data = folio->memcg_data;
> >
> > VM_BUG_ON_FOLIO(folio_test_slab(folio), folio);
> > - VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJCGS, folio);
> > + VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio);
> > VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_KMEM, folio);
> >
> > return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
> > @@ -422,7 +430,7 @@ static inline struct obj_cgroup *__folio_objcg(struct folio *folio)
> > unsigned long memcg_data = folio->memcg_data;
> >
> > VM_BUG_ON_FOLIO(folio_test_slab(folio), folio);
> > - VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJCGS, folio);
> > + VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio);
> > VM_BUG_ON_FOLIO(!(memcg_data & MEMCG_DATA_KMEM), folio);
> >
> > return (struct obj_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
> > @@ -517,7 +525,7 @@ static inline struct mem_cgroup *page_memcg_check(struct page *page)
> > */
> > unsigned long memcg_data = READ_ONCE(page->memcg_data);
> >
> > - if (memcg_data & MEMCG_DATA_OBJCGS)
> > + if (memcg_data & MEMCG_DATA_OBJEXTS)
> > return NULL;
> >
> > if (memcg_data & MEMCG_DATA_KMEM) {
> > @@ -556,7 +564,7 @@ static inline struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *ob
> > static inline bool folio_memcg_kmem(struct folio *folio)
> > {
> > VM_BUG_ON_PGFLAGS(PageTail(&folio->page), &folio->page);
> > - VM_BUG_ON_FOLIO(folio->memcg_data & MEMCG_DATA_OBJCGS, folio);
> > + VM_BUG_ON_FOLIO(folio->memcg_data & MEMCG_DATA_OBJEXTS, folio);
> > return folio->memcg_data & MEMCG_DATA_KMEM;
> > }
> >
> > diff --git a/init/Kconfig b/init/Kconfig
> > index 532362fcfe31..82396d7a2717 100644
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -958,6 +958,10 @@ config MEMCG
> > help
> > Provides control over the memory footprint of tasks in a cgroup.
> >
> > +config SLAB_OBJ_EXT
> > + bool
> > + depends on MEMCG
> > +
> > config MEMCG_SWAP
> > bool
> > depends on MEMCG && SWAP
> > @@ -966,6 +970,7 @@ config MEMCG_SWAP
> > config MEMCG_KMEM
> > bool
> > depends on MEMCG && !SLOB
> > + select SLAB_OBJ_EXT
> > default y
> >
> > config BLK_CGROUP
> > diff --git a/mm/kfence/core.c b/mm/kfence/core.c
> > index c252081b11df..c0958e4a32e2 100644
> > --- a/mm/kfence/core.c
> > +++ b/mm/kfence/core.c
> > @@ -569,7 +569,7 @@ static unsigned long kfence_init_pool(void)
> > __folio_set_slab(slab_folio(slab));
> > #ifdef CONFIG_MEMCG
> > slab->memcg_data = (unsigned long)&kfence_metadata[i / 2 - 1].objcg |
> > - MEMCG_DATA_OBJCGS;
> > + MEMCG_DATA_OBJEXTS;
> > #endif
> > }
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index b69979c9ced5..3f407ef2f3f1 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2793,7 +2793,7 @@ static void commit_charge(struct folio *folio, struct mem_cgroup *memcg)
> > folio->memcg_data = (unsigned long)memcg;
> > }
> >
> > -#ifdef CONFIG_MEMCG_KMEM
> > +#ifdef CONFIG_SLAB_OBJ_EXT
> > /*
> > * The allocated objcg pointers array is not accounted directly.
> > * Moreover, it should not come from DMA buffer and is not readily
> > @@ -2801,38 +2801,20 @@ static void commit_charge(struct folio *folio, struct mem_cgroup *memcg)
> > */
> > #define OBJCGS_CLEAR_MASK (__GFP_DMA | __GFP_RECLAIMABLE | __GFP_ACCOUNT)
> >
> > -/*
> > - * mod_objcg_mlstate() may be called with irq enabled, so
> > - * mod_memcg_lruvec_state() should be used.
> > - */
> > -static inline void mod_objcg_mlstate(struct obj_cgroup *objcg,
> > - struct pglist_data *pgdat,
> > - enum node_stat_item idx, int nr)
> > -{
> > - struct mem_cgroup *memcg;
> > - struct lruvec *lruvec;
> > -
> > - rcu_read_lock();
> > - memcg = obj_cgroup_memcg(objcg);
> > - lruvec = mem_cgroup_lruvec(memcg, pgdat);
> > - mod_memcg_lruvec_state(lruvec, idx, nr);
> > - rcu_read_unlock();
> > -}
> > -
> > -int memcg_alloc_slab_cgroups(struct slab *slab, struct kmem_cache *s,
> > - gfp_t gfp, bool new_slab)
> > +int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
> > + gfp_t gfp, bool new_slab)
> > {
> > unsigned int objects = objs_per_slab(s, slab);
> > unsigned long memcg_data;
> > void *vec;
> >
> > gfp &= ~OBJCGS_CLEAR_MASK;
> > - vec = kcalloc_node(objects, sizeof(struct obj_cgroup *), gfp,
> > + vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
> > slab_nid(slab));
> > if (!vec)
> > return -ENOMEM;
> >
> > - memcg_data = (unsigned long) vec | MEMCG_DATA_OBJCGS;
> > + memcg_data = (unsigned long) vec | MEMCG_DATA_OBJEXTS;
> > if (new_slab) {
> > /*
> > * If the slab is brand new and nobody can yet access its
> > @@ -2843,7 +2825,7 @@ int memcg_alloc_slab_cgroups(struct slab *slab, struct kmem_cache *s,
> > } else if (cmpxchg(&slab->memcg_data, 0, memcg_data)) {
> > /*
> > * If the slab is already in use, somebody can allocate and
> > - * assign obj_cgroups in parallel. In this case the existing
> > + * assign slabobj_exts in parallel. In this case the existing
> > * objcg vector should be reused.
> > */
> > kfree(vec);
> > @@ -2853,6 +2835,26 @@ int memcg_alloc_slab_cgroups(struct slab *slab, struct kmem_cache *s,
> > kmemleak_not_leak(vec);
> > return 0;
> > }
> > +#endif /* CONFIG_SLAB_OBJ_EXT */
> > +
> > +#ifdef CONFIG_MEMCG_KMEM
> > +/*
> > + * mod_objcg_mlstate() may be called with irq enabled, so
> > + * mod_memcg_lruvec_state() should be used.
> > + */
> > +static inline void mod_objcg_mlstate(struct obj_cgroup *objcg,
> > + struct pglist_data *pgdat,
> > + enum node_stat_item idx, int nr)
> > +{
> > + struct mem_cgroup *memcg;
> > + struct lruvec *lruvec;
> > +
> > + rcu_read_lock();
> > + memcg = obj_cgroup_memcg(objcg);
> > + lruvec = mem_cgroup_lruvec(memcg, pgdat);
> > + mod_memcg_lruvec_state(lruvec, idx, nr);
> > + rcu_read_unlock();
> > +}
> >
> > static __always_inline
> > struct mem_cgroup *mem_cgroup_from_obj_folio(struct folio *folio, void *p)
> > @@ -2863,18 +2865,18 @@ struct mem_cgroup *mem_cgroup_from_obj_folio(struct folio *folio, void *p)
> > * slab->memcg_data.
> > */
> > if (folio_test_slab(folio)) {
> > - struct obj_cgroup **objcgs;
> > + struct slabobj_ext *obj_exts;
> > struct slab *slab;
> > unsigned int off;
> >
> > slab = folio_slab(folio);
> > - objcgs = slab_objcgs(slab);
> > - if (!objcgs)
> > + obj_exts = slab_obj_exts(slab);
> > + if (!obj_exts)
> > return NULL;
> >
> > off = obj_to_index(slab->slab_cache, slab, p);
> > - if (objcgs[off])
> > - return obj_cgroup_memcg(objcgs[off]);
> > + if (obj_exts[off].objcg)
> > + return obj_cgroup_memcg(obj_exts[off].objcg);
> >
> > return NULL;
> > }
> > diff --git a/mm/page_owner.c b/mm/page_owner.c
> > index e4c6f3f1695b..fd4af1ad34b8 100644
> > --- a/mm/page_owner.c
> > +++ b/mm/page_owner.c
> > @@ -353,7 +353,7 @@ static inline int print_page_owner_memcg(char *kbuf, size_t count, int ret,
> > if (!memcg_data)
> > goto out_unlock;
> >
> > - if (memcg_data & MEMCG_DATA_OBJCGS)
> > + if (memcg_data & MEMCG_DATA_OBJEXTS)
> > ret += scnprintf(kbuf + ret, count - ret,
> > "Slab cache page\n");
> >
> > diff --git a/mm/slab.h b/mm/slab.h
> > index 4ec82bec15ec..c767ce3f0fe2 100644
> > --- a/mm/slab.h
> > +++ b/mm/slab.h
> > @@ -422,36 +422,94 @@ static inline bool kmem_cache_debug_flags(struct kmem_cache *s, slab_flags_t fla
> > return false;
> > }
> >
> > +#ifdef CONFIG_SLAB_OBJ_EXT
> > +
> > +static inline bool is_kmem_only_obj_ext(void)
> > +{
> > #ifdef CONFIG_MEMCG_KMEM
> > + return sizeof(struct slabobj_ext) == sizeof(struct obj_cgroup *);
> > +#else
> > + return false;
> > +#endif
> > +}
> > +
> > /*
> > - * slab_objcgs - get the object cgroups vector associated with a slab
> > + * slab_obj_exts - get the pointer to the slab object extension vector
> > + * associated with a slab.
> > * @slab: a pointer to the slab struct
> > *
> > - * Returns a pointer to the object cgroups vector associated with the slab,
> > + * Returns a pointer to the object extension vector associated with the slab,
> > * or NULL if no such vector has been associated yet.
> > */
> > -static inline struct obj_cgroup **slab_objcgs(struct slab *slab)
> > +static inline struct slabobj_ext *slab_obj_exts(struct slab *slab)
> > {
> > unsigned long memcg_data = READ_ONCE(slab->memcg_data);
> >
> > - VM_BUG_ON_PAGE(memcg_data && !(memcg_data & MEMCG_DATA_OBJCGS),
> > + VM_BUG_ON_PAGE(memcg_data && !(memcg_data & MEMCG_DATA_OBJEXTS),
> > slab_page(slab));
> > VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_KMEM, slab_page(slab));
> >
> > - return (struct obj_cgroup **)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
> > + return (struct slabobj_ext *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
> > }
> >
> > -int memcg_alloc_slab_cgroups(struct slab *slab, struct kmem_cache *s,
> > - gfp_t gfp, bool new_slab);
> > -void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
> > - enum node_stat_item idx, int nr);
> > +int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
> > + gfp_t gfp, bool new_slab);
> >
> > -static inline void memcg_free_slab_cgroups(struct slab *slab)
> > +static inline void free_slab_obj_exts(struct slab *slab)
> > {
> > - kfree(slab_objcgs(slab));
> > + struct slabobj_ext *obj_exts;
> > +
> > + if (!memcg_kmem_enabled() && is_kmem_only_obj_ext())
> > + return;
>
> Hm, not sure I understand this. I kmem is disabled and is_kmem_only_obj_ext()
> is true, shouldn't slab->memcg_data == NULL (always)?

So, the logic was to skip freeing when the only possible objects in
slab->memcg_data are "struct obj_cgroup" and kmem is disabled.
Otherwise there are other objects stored in slab->memcg_data which
have to be freed. Did I make it more complicated than it should have
been?

>
> > +
> > + obj_exts = slab_obj_exts(slab);
> > + kfree(obj_exts);
> > slab->memcg_data = 0;
> > }
> >
> > +static inline void prepare_slab_obj_exts_hook(struct kmem_cache *s, gfp_t flags, void *p)
> > +{
> > + struct slab *slab;
> > +
> > + /* If kmem is the only extension then the vector will be created conditionally */
> > + if (is_kmem_only_obj_ext())
> > + return;
> > +
> > + slab = virt_to_slab(p);
> > + if (!slab_obj_exts(slab))
> > + WARN(alloc_slab_obj_exts(slab, s, flags, false),
> > + "%s, %s: Failed to create slab extension vector!\n",
> > + __func__, s->name);
> > +}
>
> This looks a bit crypric: the action is wrapped into WARN() and the rest is a set
> of (semi-)static checks. Can we, please, invert it? E.g. something like:
>
> if (slab_alloc_tracking_enabled()) {
> slab = virt_to_slab(p);
> if (!slab_obj_exts(slab))
> WARN(alloc_slab_obj_exts(slab, s, flags, false),
> "%s, %s: Failed to create slab extension vector!\n",
> __func__, s->name);
> }

Yeah, this is much more readable. Thanks for the suggestion and for
reviewing the code!

>
> The rest looks good to me.
>
> Thank you!
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>

2022-09-02 00:50:54

by Kent Overstreet

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Thu, Sep 01, 2022 at 03:53:57PM -0700, Roman Gushchin wrote:
> I'd suggest to run something like iperf on a fast hardware. And maybe some
> io_uring stuff too. These are two places which were historically most sensitive
> to the (kernel) memory accounting speed.

I'm getting wildly inconsistent results with iperf.

io_uring-echo-server and rust_echo_bench gets me:
Benchmarking: 127.0.0.1:12345
50 clients, running 512 bytes, 60 sec.

Without alloc tagging: 120547 request/sec
With: 116748 request/sec

https://github.com/frevib/io_uring-echo-server
https://github.com/haraldh/rust_echo_bench

How's that look to you? Close enough? :)

2022-09-02 01:31:34

by Kent Overstreet

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Thu, Sep 01, 2022 at 06:04:46PM -0700, Roman Gushchin wrote:
> On Thu, Sep 01, 2022 at 08:17:47PM -0400, Kent Overstreet wrote:
> > On Thu, Sep 01, 2022 at 03:53:57PM -0700, Roman Gushchin wrote:
> > > I'd suggest to run something like iperf on a fast hardware. And maybe some
> > > io_uring stuff too. These are two places which were historically most sensitive
> > > to the (kernel) memory accounting speed.
> >
> > I'm getting wildly inconsistent results with iperf.
> >
> > io_uring-echo-server and rust_echo_bench gets me:
> > Benchmarking: 127.0.0.1:12345
> > 50 clients, running 512 bytes, 60 sec.
> >
> > Without alloc tagging: 120547 request/sec
> > With: 116748 request/sec
> >
> > https://github.com/frevib/io_uring-echo-server
> > https://github.com/haraldh/rust_echo_bench
> >
> > How's that look to you? Close enough? :)
>
> Yes, this looks good (a bit too good).

Eh, I was hoping for better :)

> I'm not that familiar with io_uring, Jens and Pavel should have a better idea
> what and how to run (I know they've workarounded the kernel memory accounting
> because of the performance in the past, this is why I suspect it might be an
> issue here as well).
>
> This is a recent optimization on the networking side:
> https://lore.kernel.org/linux-mm/[email protected]/
>
> Maybe you can try to repeat this experiment.

I'd be more interested in a synthetic benchmark, if you know of any.

2022-09-02 01:54:22

by Roman Gushchin

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Thu, Sep 01, 2022 at 08:17:47PM -0400, Kent Overstreet wrote:
> On Thu, Sep 01, 2022 at 03:53:57PM -0700, Roman Gushchin wrote:
> > I'd suggest to run something like iperf on a fast hardware. And maybe some
> > io_uring stuff too. These are two places which were historically most sensitive
> > to the (kernel) memory accounting speed.
>
> I'm getting wildly inconsistent results with iperf.
>
> io_uring-echo-server and rust_echo_bench gets me:
> Benchmarking: 127.0.0.1:12345
> 50 clients, running 512 bytes, 60 sec.
>
> Without alloc tagging: 120547 request/sec
> With: 116748 request/sec
>
> https://github.com/frevib/io_uring-echo-server
> https://github.com/haraldh/rust_echo_bench
>
> How's that look to you? Close enough? :)

Yes, this looks good (a bit too good).

I'm not that familiar with io_uring, Jens and Pavel should have a better idea
what and how to run (I know they've workarounded the kernel memory accounting
because of the performance in the past, this is why I suspect it might be an
issue here as well).

This is a recent optimization on the networking side:
https://lore.kernel.org/linux-mm/[email protected]/

Maybe you can try to repeat this experiment.

Thanks!

2022-09-02 02:03:27

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 27/30] Code tagging based latency tracking

On Thu, 1 Sep 2022 21:35:32 -0400
Kent Overstreet <[email protected]> wrote:

> On Thu, Sep 01, 2022 at 08:23:11PM -0400, Steven Rostedt wrote:
> > If ftrace, perf, bpf can't do what you want, take a harder look to see if
> > you can modify them to do so.
>
> Maybe we can use this exchange to make both of our tools better. I like your
> histograms - the quantiles algorithm I've had for years is janky, I've been
> meaning to rip that out, I'd love to take a look at your code for that. And
> having an on/off switch is a good idea, I'll try to add that at some point.
> Maybe you got some ideas from my stuff too.
>
> I'd love to get better tracepoints for measuring latency - what I added to
> init_wait() and finish_wait() was really only a starting point. Figuring out
> the right places to measure is where I'd like to be investing my time in this
> area, and there's no reason we couldn't both be making use of that.

Yes, this is exactly what I'm talking about. I'm not against your work, I
just want you to work more with everyone to come up with ideas that can
help everyone as a whole. That's how "open source communities" is suppose
to work ;-)

The histogram and synthetic events can use some more clean ups. There's a
lot of places that can be improved in that code. But I feel the ideas
behind that code is sound. It's just getting the implementation to be a bit
more efficient.

>
> e.g. with kernel waitqueues, I looked at hooking prepare_to_wait() first but not
> all code uses that, init_wait() got me better coverage. But I've already seen
> that that misses things, too, there's more work to be done.

I picked prepare_to_wait() just because I was hacking up something quick
and thought that was "close enough" ;-)

-- Steve

2022-09-02 02:19:06

by Kent Overstreet

[permalink] [raw]
Subject: Re: [RFC PATCH 27/30] Code tagging based latency tracking

On Thu, Sep 01, 2022 at 08:23:11PM -0400, Steven Rostedt wrote:
> If ftrace, perf, bpf can't do what you want, take a harder look to see if
> you can modify them to do so.

Maybe we can use this exchange to make both of our tools better. I like your
histograms - the quantiles algorithm I've had for years is janky, I've been
meaning to rip that out, I'd love to take a look at your code for that. And
having an on/off switch is a good idea, I'll try to add that at some point.
Maybe you got some ideas from my stuff too.

I'd love to get better tracepoints for measuring latency - what I added to
init_wait() and finish_wait() was really only a starting point. Figuring out
the right places to measure is where I'd like to be investing my time in this
area, and there's no reason we couldn't both be making use of that.

e.g. with kernel waitqueues, I looked at hooking prepare_to_wait() first but not
all code uses that, init_wait() got me better coverage. But I've already seen
that that misses things, too, there's more work to be done.

random thought: might try adding a warning in schedule() any time it's called
and codetag_time_stats_start() hasn't been called, that'll be a starting
point...

2022-09-02 12:22:08

by Jens Axboe

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On 9/1/22 7:04 PM, Roman Gushchin wrote:
> On Thu, Sep 01, 2022 at 08:17:47PM -0400, Kent Overstreet wrote:
>> On Thu, Sep 01, 2022 at 03:53:57PM -0700, Roman Gushchin wrote:
>>> I'd suggest to run something like iperf on a fast hardware. And maybe some
>>> io_uring stuff too. These are two places which were historically most sensitive
>>> to the (kernel) memory accounting speed.
>>
>> I'm getting wildly inconsistent results with iperf.
>>
>> io_uring-echo-server and rust_echo_bench gets me:
>> Benchmarking: 127.0.0.1:12345
>> 50 clients, running 512 bytes, 60 sec.
>>
>> Without alloc tagging: 120547 request/sec
>> With: 116748 request/sec
>>
>> https://github.com/frevib/io_uring-echo-server
>> https://github.com/haraldh/rust_echo_bench
>>
>> How's that look to you? Close enough? :)
>
> Yes, this looks good (a bit too good).
>
> I'm not that familiar with io_uring, Jens and Pavel should have a better idea
> what and how to run (I know they've workarounded the kernel memory accounting
> because of the performance in the past, this is why I suspect it might be an
> issue here as well).

io_uring isn't alloc+free intensive on a per request basis anymore, it
would not be a good benchmark if the goal is to check for regressions in
that area.

--
Jens Axboe

2022-09-02 20:09:25

by Kent Overstreet

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Fri, Sep 02, 2022 at 06:02:12AM -0600, Jens Axboe wrote:
> On 9/1/22 7:04 PM, Roman Gushchin wrote:
> > On Thu, Sep 01, 2022 at 08:17:47PM -0400, Kent Overstreet wrote:
> >> On Thu, Sep 01, 2022 at 03:53:57PM -0700, Roman Gushchin wrote:
> >>> I'd suggest to run something like iperf on a fast hardware. And maybe some
> >>> io_uring stuff too. These are two places which were historically most sensitive
> >>> to the (kernel) memory accounting speed.
> >>
> >> I'm getting wildly inconsistent results with iperf.
> >>
> >> io_uring-echo-server and rust_echo_bench gets me:
> >> Benchmarking: 127.0.0.1:12345
> >> 50 clients, running 512 bytes, 60 sec.
> >>
> >> Without alloc tagging: 120547 request/sec
> >> With: 116748 request/sec
> >>
> >> https://github.com/frevib/io_uring-echo-server
> >> https://github.com/haraldh/rust_echo_bench
> >>
> >> How's that look to you? Close enough? :)
> >
> > Yes, this looks good (a bit too good).
> >
> > I'm not that familiar with io_uring, Jens and Pavel should have a better idea
> > what and how to run (I know they've workarounded the kernel memory accounting
> > because of the performance in the past, this is why I suspect it might be an
> > issue here as well).
>
> io_uring isn't alloc+free intensive on a per request basis anymore, it
> would not be a good benchmark if the goal is to check for regressions in
> that area.

Good to know. The benchmark is still a TCP benchmark though, so still useful.

Matthew suggested
while true; do echo 1 >/tmp/foo; rm /tmp/foo; done

I ran that on tmpfs, and the numbers with and without alloc tagging were
statistically equal - there was a fair amount of variation, it wasn't a super
controlled test, anywhere from 38-41 seconds with 100000 iterations (and alloc
tagging was some of the faster runs).

But with memcg off, it ran in 32-33 seconds. We're piggybacking on the same
mechanism memcg uses for stashing per-object pointers, so it looks like that's
the bigger cost.

2022-09-02 20:11:36

by Jens Axboe

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On 9/2/22 1:48 PM, Kent Overstreet wrote:
> On Fri, Sep 02, 2022 at 06:02:12AM -0600, Jens Axboe wrote:
>> On 9/1/22 7:04 PM, Roman Gushchin wrote:
>>> On Thu, Sep 01, 2022 at 08:17:47PM -0400, Kent Overstreet wrote:
>>>> On Thu, Sep 01, 2022 at 03:53:57PM -0700, Roman Gushchin wrote:
>>>>> I'd suggest to run something like iperf on a fast hardware. And maybe some
>>>>> io_uring stuff too. These are two places which were historically most sensitive
>>>>> to the (kernel) memory accounting speed.
>>>>
>>>> I'm getting wildly inconsistent results with iperf.
>>>>
>>>> io_uring-echo-server and rust_echo_bench gets me:
>>>> Benchmarking: 127.0.0.1:12345
>>>> 50 clients, running 512 bytes, 60 sec.
>>>>
>>>> Without alloc tagging: 120547 request/sec
>>>> With: 116748 request/sec
>>>>
>>>> https://github.com/frevib/io_uring-echo-server
>>>> https://github.com/haraldh/rust_echo_bench
>>>>
>>>> How's that look to you? Close enough? :)
>>>
>>> Yes, this looks good (a bit too good).
>>>
>>> I'm not that familiar with io_uring, Jens and Pavel should have a better idea
>>> what and how to run (I know they've workarounded the kernel memory accounting
>>> because of the performance in the past, this is why I suspect it might be an
>>> issue here as well).
>>
>> io_uring isn't alloc+free intensive on a per request basis anymore, it
>> would not be a good benchmark if the goal is to check for regressions in
>> that area.
>
> Good to know. The benchmark is still a TCP benchmark though, so still useful.
>
> Matthew suggested
> while true; do echo 1 >/tmp/foo; rm /tmp/foo; done
>
> I ran that on tmpfs, and the numbers with and without alloc tagging were
> statistically equal - there was a fair amount of variation, it wasn't a super
> controlled test, anywhere from 38-41 seconds with 100000 iterations (and alloc
> tagging was some of the faster runs).
>
> But with memcg off, it ran in 32-33 seconds. We're piggybacking on the same
> mechanism memcg uses for stashing per-object pointers, so it looks like that's
> the bigger cost.

I've complained about memcg accounting before, the slowness of it is why
io_uring works around it by caching. Anything we account we try NOT do
in the fast path because of it, the slowdown is considerable.

You care about efficiency now? I thought that was relegated to
irrelevant 10M IOPS cases.

--
Jens Axboe

2022-09-02 20:18:03

by Kent Overstreet

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Fri, Sep 02, 2022 at 01:53:53PM -0600, Jens Axboe wrote:
> I've complained about memcg accounting before, the slowness of it is why
> io_uring works around it by caching. Anything we account we try NOT do
> in the fast path because of it, the slowdown is considerable.

I'm with you on that, it definitely raises an eyebrow.

> You care about efficiency now? I thought that was relegated to
> irrelevant 10M IOPS cases.

I always did, it's just not the only thing I care about.

2022-09-02 20:44:48

by Jens Axboe

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On 9/2/22 2:05 PM, Kent Overstreet wrote:
> On Fri, Sep 02, 2022 at 01:53:53PM -0600, Jens Axboe wrote:
>> I've complained about memcg accounting before, the slowness of it is why
>> io_uring works around it by caching. Anything we account we try NOT do
>> in the fast path because of it, the slowdown is considerable.
>
> I'm with you on that, it definitely raises an eyebrow.
>
>> You care about efficiency now? I thought that was relegated to
>> irrelevant 10M IOPS cases.
>
> I always did, it's just not the only thing I care about.

It's not the only thing anyone cares about.

--
Jens Axboe

2022-09-05 01:56:50

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Thu, Sep 1, 2022 at 12:15 PM Michal Hocko <[email protected]> wrote:
>
> On Thu 01-09-22 08:33:19, Suren Baghdasaryan wrote:
> > On Thu, Sep 1, 2022 at 12:18 AM Michal Hocko <[email protected]> wrote:
> [...]
> > > So I find Peter's question completely appropriate while your response to
> > > that not so much! Maybe ftrace is not the right tool for the intented
> > > job. Maybe there are other ways and it would be really great to show
> > > that those have been evaluated and they are not suitable for a), b) and
> > > c) reasons.
> >
> > That's fair.
> > For memory tracking I looked into using kmemleak and page_owner which
> > can't match the required functionality at an overhead acceptable for
> > production and pre-production testing environments.
>
> Being more specific would be really helpful. Especially when your cover
> letter suggests that you rely on page_owner/memcg metadata as well to
> match allocation and their freeing parts.
>
> > traces + BPF I
> > haven't evaluated myself but heard from other members of my team who
> > tried using that in production environment with poor results. I'll try
> > to get more specific information on that.
>
> That would be helpful as well.
>
> > > E.g. Oscar has been working on extending page_ext to track number of
> > > allocations for specific calltrace[1]. Is this 1:1 replacement? No! But
> > > it can help in environments where page_ext can be enabled and it is
> > > completely non-intrusive to the MM code.
> >
> > Thanks for pointing out this work. I'll need to review and maybe
> > profile it before making any claims.
> >
> > >
> > > If the page_ext overhead is not desirable/acceptable then I am sure
> > > there are other options. E.g. kprobes/LivePatching framework can hook
> > > into functions and alter their behavior. So why not use that for data
> > > collection? Has this been evaluated at all?
> >
> > I'm not sure how I can hook into say alloc_pages() to find out where
> > it was called from without capturing the call stack (which would
> > introduce an overhead at every allocation). Would love to discuss this
> > or other alternatives if they can be done with low enough overhead.
>
> Yes, tracking back the call trace would be really needed. The question
> is whether this is really prohibitively expensive. How much overhead are
> we talking about? There is no free lunch here, really. You either have
> the overhead during runtime when the feature is used or on the source
> code level for all the future development (with a maze of macros and
> wrappers).

As promised, I profiled a simple code that repeatedly makes 10
allocations/frees in a loop and measured overheads of code tagging,
call stack capturing and tracing+BPF for page and slab allocations.
Summary:

Page allocations (overheads are compared to get_free_pages() duration):
6.8% Codetag counter manipulations (__lazy_percpu_counter_add + __alloc_tag_add)
8.8% lookup_page_ext
1237% call stack capture
139% tracepoint with attached empty BPF program

Slab allocations (overheads are compared to __kmalloc() duration):
With CONFIG_MEMCG_KMEM=y
39% Codetag counter manipulations(__lazy_percpu_counter_add + __alloc_tag_add)
55% get_slab_tag_ref
3.9% __ksize
3027% call stack capture
397% tracepoint with attached empty BPF program

With CONFIG_MEMCG_KMEM=n
26% Codetag counter manipulation(__lazy_percpu_counter_add + __alloc_tag_add)
72% get_slab_tag_ref
7.4% __ksize
2789% call stack capture
345% tracepoint with attached empty BPF program

Details:
_get_free_pages is used as page allocation duration baseline
__kmalloc is used as slab allocation duration baseline

1. Profile with instrumented page allocator
|--50.13%--my__get_free_page
| |
| |--38.99%--_get_free_pages
| | |
| | |--34.75%--__alloc_pages
| | | |
| | | |--27.59%--get_page_from_freelist
| | |
| | --3.98%--_alloc_pages
| | |
| | --0.53%--policy_node
| |
| |--3.45%--lookup_page_ext
| |
| |--1.59%--__lazy_percpu_counter_add
| | |
| | --0.80%--pcpu_alloc
| | memset_orig
| |
| --1.06%--__alloc_tag_add
| |
| --0.80%--__lazy_percpu_counter_add
|
|--35.28%--free_unref_page
| |
| |--23.08%--_raw_spin_unlock_irqrestore
| |
| |--2.39%--preempt_count_add
| | |
| | --0.80%--in_lock_functions
| |
| |--1.59%--free_pcp_prepare
| |
| |--1.33%--preempt_count_sub
| |
| --0.80%--check_preemption_disabled
|
|--4.24%--__free_pages
|
--1.59%--free_pages


2. Profile with non-instrumented page allocator and call stack capturing
|--84.18%--my__get_free_page
| |
| --83.91%--stack_depot_capture_stack
| |
| |--77.99%--stack_trace_save
| | |
| | --77.53%--arch_stack_walk
| | |
| | |--37.17%--unwind_next_frame
| | | |
| | | |--8.44%--__orc_find
| | | |
| | |--10.57%-stack_trace_consume_entry
| | |
| | --9.64%--unwind_get_return_address
| |
| --5.78%--__stack_depot_save
|
|--6.78%--__get_free_pages
| |
| |--5.85%--__alloc_pages
| | |
| | --3.86%--get_page_from_freelist
| | |
| | --1.33%--_raw_spin_unlock_irqrestore
| |
| --0.80%--alloc_pages
|
|--5.19%--free_unref_page
| |
| |--2.73%--_raw_spin_unlock_irqrestore
| |
| --0.60%--free_pcp_prepare
|
--0.73%--__free_pages


3. Profile with non-instrumented page allocator and BPF attached to tracepoint
|--42.42%--my__get_free_page
| |
| --38.53%--perf_trace_kmem_alloc
| |
| |--25.76%--perf_trace_run_bpf_submit
| | |
| | |--21.86%--trace_call_bpf
| | | |
| | | |--4.76%--migrate_enable
| | | |
| | | |--4.55%--migrate_disable
| | | |
| | | |--3.03%--check_preemption_disabled
| | | |
| | | |--0.65%--__this_cpu_preempt_check
| | | |
| | | --0.65%--__rcu_read_unlock
| | |
| | --0.87%--check_preemption_disabled
| |
| |--8.01%--perf_trace_buf_alloc
| | |
| | |--3.68%--perf_swevent_get_recursion_context
| | | |
| | | --0.87%--check_preemption_disabled
| | |
| | --1.30%--check_preemption_disabled
| |
| --0.87%--check_preemption_disabled
|
|--27.71%--__get_free_pages
| |
| |--23.38%--__alloc_pages
| | |
| | --17.75%--get_page_from_freelist
| | |
| | |--8.66%--_raw_spin_unlock_irqrestore
| | | |
| | | --1.95%--preempt_count_sub
| | |
| | --1.08%--preempt_count_add
| |
| --4.33%--alloc_pages
| |
| |--0.87%--policy_node
| |
| --0.65%--policy_nodemask
|
|--15.37%--free_unref_page
| |
| |--6.71%--_raw_spin_unlock_irqrestore
| |
| |--1.52%--check_preemption_disabled
| |
| |--0.65%--free_pcp_prepare
| |
| --0.65%--preempt_count_add
|--4.98%--__free_pages


4. Profile with instrumented slab allocator CONFIG_MEMCG_KMEM=y
|--51.28%--my__get_free_page
| |
| |--21.79%--__kmalloc
| | |
| | |--3.42%--memcg_slab_post_alloc_hook
| | |
| | |--1.71%--kmalloc_slab
| | |
| | --0.85%--should_failslab
| |
| |--11.97%--get_slab_tag_ref
| |
| |--5.56%--__alloc_tag_add
| | |
| | --2.56%--__lazy_percpu_counter_add
| |
| |--2.99%--__lazy_percpu_counter_add
| |
| --0.85%--__ksize
|
--35.90%--kfree
|
|--13.68%--get_slab_tag_ref
|
|--6.41%--__alloc_tag_sub
| |
| --4.70%--__lazy_percpu_counter_add
|
--2.14%--__ksize


5. Profile with non-instrumented slab allocator and call stack
capturing CONFIG_MEMCG_KMEM=y
|--91.50%--my__get_free_page
| |
| --91.13%--stack_depot_capture_stack
| |
| |--85.48%--stack_trace_save
| | |
| | --85.12%--arch_stack_walk
| | |
| | |--40.54%--unwind_next_frame
| | |
| | |--14.30%--__unwind_start
| | |
| | |--11.95%-unwind_get_return_address
| | |
| | --10.48%-stack_trace_consume_entry
| |
| --4.99%--__stack_depot_save
| |
| --0.66%--filter_irq_stacks
|
|--3.01%--__kmalloc
|
|--2.05%--kfree

6. Profile with non-instrumented slab allocator and BPF attached to a
tracepoint CONFIG_MEMCG_KMEM=y
|--72.39%--__kmalloc
| |
| |--57.84%--perf_trace_kmem_alloc
| | |
| | |--38.06%--perf_trace_run_bpf_submit
| | | |
| | | --33.96%--trace_call_bpf
| | | |
| | | |--10.07%--migrate_disable
| | | |
| | | |--4.85%--migrate_enable
| | | |
| | | |--4.10%--check_preemption_disabled
| | | |
| | | |--1.87%--__rcu_read_unlock
| | | |
| | | --0.75%--__rcu_read_lock
| | |
| | --9.70%--perf_trace_buf_alloc
| | |
| | |--2.99%--perf_swevent_get_recursion_context
| | |
| | |--1.12%--check_preemption_disabled
| | |
| | --0.75%--debug_smp_processor_id
| |
| |--2.24%--kmalloc_slab
| |
| |--1.49%--memcg_slab_post_alloc_hook
| |
| --1.12%--__cond_resched
|
|--7.84%--kfree


7. Profile with instrumented slab allocator CONFIG_MEMCG_KMEM=n
|--49.39%--my__get_free_page
| |
| |--22.04%--__kmalloc
| | |
| | |--3.27%--kmalloc_slab
| | |
| | --0.82%--asm_sysvec_apic_timer_interrupt
| | sysvec_apic_timer_interrupt
| | __irq_exit_rcu
| | __softirqentry_text_start
| |
| |--15.92%--get_slab_tag_ref
| |
| |--3.27%--__alloc_tag_add
| | |
| | --2.04%--__lazy_percpu_counter_add
| |
| --2.45%--__lazy_percpu_counter_add
|
|--35.51%--kfree
| |
| |--13.88%--get_slab_tag_ref
| |
| |--11.84%--__alloc_tag_sub
| | |
| | --5.31%--__lazy_percpu_counter_add
| |
| --1.63%--__ksize

8. Profile with non-instrumented slab allocator and call stack
capturing CONFIG_MEMCG_KMEM=n
|--91.70%--my__get_free_page
| |
| --91.48%--stack_depot_capture_stack
| |
| |--85.29%--stack_trace_save
| | |
| | --85.07%--arch_stack_walk
| | |
| | |--45.23%--unwind_next_frame
| | |
| | |--12.89%--__unwind_start
| | |
| | |--10.20%-unwind_get_return_address
| | |
| | --10.12%-stack_trace_consume_entry
| |
| --5.75%--__stack_depot_save
| |
| --0.87%--filter_irq_stacks
|
|--3.28%--__kmalloc
|
--1.89%--kfree

9. Profile with non-instrumented slab allocator and BPF attached to a
tracepoint CONFIG_MEMCG_KMEM=n
|--71.65%--__kmalloc
| |
| |--55.56%--perf_trace_kmem_alloc
| | |
| | |--38.31%--perf_trace_run_bpf_submit
| | | |
| | | |--31.80%--trace_call_bpf
| | | | |
| | | | |--9.96%--migrate_enable
| | | | |
| | | | |--4.98%--migrate_disable
| | | | |
| | | | |--1.92%--check_preemption_disabled
| | | | |
| | | | |--1.92%--__rcu_read_unlock
| | | | |
| | | | --1.15%--__rcu_read_lock
| | | |
| | | --0.77%--check_preemption_disabled
| | |
| | --11.11%--perf_trace_buf_alloc
| | |
| | --4.98%--perf_swevent_get_recursion_context
| | |
| | --1.53%--check_preemption_disabled
| |
| |--2.68%--kmalloc_slab
| |
| --1.15%--__cond_resched
|
--9.58%--kfree


>
> Thanks!
> --
> Michal Hocko
> SUSE Labs

2022-09-05 08:25:07

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Sun 04-09-22 18:32:58, Suren Baghdasaryan wrote:
> On Thu, Sep 1, 2022 at 12:15 PM Michal Hocko <[email protected]> wrote:
[...]
> > Yes, tracking back the call trace would be really needed. The question
> > is whether this is really prohibitively expensive. How much overhead are
> > we talking about? There is no free lunch here, really. You either have
> > the overhead during runtime when the feature is used or on the source
> > code level for all the future development (with a maze of macros and
> > wrappers).
>
> As promised, I profiled a simple code that repeatedly makes 10
> allocations/frees in a loop and measured overheads of code tagging,
> call stack capturing and tracing+BPF for page and slab allocations.
> Summary:
>
> Page allocations (overheads are compared to get_free_pages() duration):
> 6.8% Codetag counter manipulations (__lazy_percpu_counter_add + __alloc_tag_add)
> 8.8% lookup_page_ext
> 1237% call stack capture
> 139% tracepoint with attached empty BPF program

Yes, I am not surprised that the call stack capturing is really
expensive comparing to the allocator fast path (which is really highly
optimized and I suspect that with 10 allocation/free loop you mostly get
your memory from the pcp lists). Is this overhead still _that_ visible
for somehow less microoptimized workloads which have to take slow paths
as well?

Also what kind of stack unwinder is configured (I guess ORC)? This is
not my area but from what I remember the unwinder overhead varies
between ORC and FP.

And just to make it clear. I do realize that an overhead from the stack
unwinding is unavoidable. And code tagging would logically have lower
overhead as it performs much less work. But the main point is whether
our existing stack unwiding approach is really prohibitively expensive
to be used for debugging purposes on production systems. I might
misremember but I recall people having bigger concerns with page_owner
memory footprint than the actual stack unwinder overhead.
--
Michal Hocko
SUSE Labs

2022-09-05 09:25:12

by Marco Elver

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Mon, 5 Sept 2022 at 10:12, Michal Hocko <[email protected]> wrote:
> On Sun 04-09-22 18:32:58, Suren Baghdasaryan wrote:
> > On Thu, Sep 1, 2022 at 12:15 PM Michal Hocko <[email protected]> wrote:
> [...]
> > > Yes, tracking back the call trace would be really needed. The question
> > > is whether this is really prohibitively expensive. How much overhead are
> > > we talking about? There is no free lunch here, really. You either have
> > > the overhead during runtime when the feature is used or on the source
> > > code level for all the future development (with a maze of macros and
> > > wrappers).
> >
> > As promised, I profiled a simple code that repeatedly makes 10
> > allocations/frees in a loop and measured overheads of code tagging,
> > call stack capturing and tracing+BPF for page and slab allocations.
> > Summary:
> >
> > Page allocations (overheads are compared to get_free_pages() duration):
> > 6.8% Codetag counter manipulations (__lazy_percpu_counter_add + __alloc_tag_add)
> > 8.8% lookup_page_ext
> > 1237% call stack capture
> > 139% tracepoint with attached empty BPF program
>
> Yes, I am not surprised that the call stack capturing is really
> expensive comparing to the allocator fast path (which is really highly
> optimized and I suspect that with 10 allocation/free loop you mostly get
> your memory from the pcp lists). Is this overhead still _that_ visible
> for somehow less microoptimized workloads which have to take slow paths
> as well?
>
> Also what kind of stack unwinder is configured (I guess ORC)? This is
> not my area but from what I remember the unwinder overhead varies
> between ORC and FP.
>
> And just to make it clear. I do realize that an overhead from the stack
> unwinding is unavoidable. And code tagging would logically have lower
> overhead as it performs much less work. But the main point is whether
> our existing stack unwiding approach is really prohibitively expensive
> to be used for debugging purposes on production systems. I might
> misremember but I recall people having bigger concerns with page_owner
> memory footprint than the actual stack unwinder overhead.

This is just to point out that we've also been looking at cheaper
collection of the stack trace (for KASAN and other sanitizers). The
cheapest way to unwind the stack would be a system with "shadow call
stack" enabled. With compiler support it's available on arm64, see
CONFIG_SHADOW_CALL_STACK. For x86 the hope is that at one point the
kernel will support CET, which newer Intel and AMD CPUs support.
Collecting the call stack would then be a simple memcpy.

2022-09-05 10:06:18

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Thu 01-09-22 16:15:02, Kent Overstreet wrote:
> On Thu, Sep 01, 2022 at 12:39:11PM -0700, Suren Baghdasaryan wrote:
> > kmemleak is known to be slow and it's even documented [1], so I hope I
> > can skip that part. For page_owner to provide the comparable
> > information we would have to capture the call stacks for all page
> > allocations unlike our proposal which allows to do that selectively
> > for specific call sites. I'll post the overhead numbers of call stack
> > capturing once I'm finished with profiling the latest code, hopefully
> > sometime tomorrow, in the worst case after the long weekend.
>
> To expand on this further: we're stashing a pointer to the alloc_tag, which is
> defined at the allocation callsite. That's how we're able to decrement the
> proper counter on free, and why this beats any tracing based approach - with
> tracing you'd instead have to correlate allocate/free events. Ouch.
>
> > > Yes, tracking back the call trace would be really needed. The question
> > > is whether this is really prohibitively expensive. How much overhead are
> > > we talking about? There is no free lunch here, really. You either have
> > > the overhead during runtime when the feature is used or on the source
> > > code level for all the future development (with a maze of macros and
> > > wrappers).
>
> The full call stack is really not what you want in most applications - that's
> what people think they want at first, and why page_owner works the way it does,
> but it turns out that then combining all the different but related stack traces
> _sucks_ (so why were you saving them in the first place?), and then you have to
> do a separate memory allocate for each stack track, which destroys performance.

I do agree that the full stack trace is likely not what you need. But
the portion of the stack that you need is not really clear because the
relevant part might be on a different level of the calltrace depending
on the allocation site. Take this as an example:
{traverse, seq_read_iter, single_open_size}->seq_buf_alloc -> kvmalloc -> kmalloc

This whole part of the stack is not really all that interesting and you
would have to allocate pretty high at the API layer to catch something
useful. And please remember that seq_file interface is heavily used in
throughout the kernel. I wouldn't suspect seq_file itself to be buggy,
that is well exercised code but its users can botch things and that is
where the leak would happen. There are many other examples like that
where the allocation is done at a lib/infrastructure layer (sysfs
framework, mempools network pool allocators and whatnot). We do care
about those users, really. Ad-hoc pool allocators built on top of the
core MM allocators are not really uncommon. And I am really skeptical we
really want to add all the tagging source code level changes to each and
every one of them.

This is really my main concern about this whole work. Not only it adds a
considerable maintenance burden to the core MM because it adds on top of
our existing allocator layers complexity but it would need to spread beyond
MM to be useful because it is usually outside of MM where leaks happen.
--
Michal Hocko
SUSE Labs

2022-09-05 15:21:25

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Sun, 4 Sep 2022 18:32:58 -0700
Suren Baghdasaryan <[email protected]> wrote:

> Page allocations (overheads are compared to get_free_pages() duration):
> 6.8% Codetag counter manipulations (__lazy_percpu_counter_add + __alloc_tag_add)
> 8.8% lookup_page_ext
> 1237% call stack capture
> 139% tracepoint with attached empty BPF program

Have you tried tracepoint with custom callback?

static void my_callback(void *data, unsigned long call_site,
const void *ptr, struct kmem_cache *s,
size_t bytes_req, size_t bytes_alloc,
gfp_t gfp_flags)
{
struct my_data_struct *my_data = data;

{ do whatever }
}

[..]
register_trace_kmem_alloc(my_callback, my_data);

Now the my_callback function will be called directly every time the
kmem_alloc tracepoint is hit.

This avoids that perf and BPF overhead.

-- Steve

2022-09-05 18:23:00

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Mon, Sep 5, 2022 at 1:58 AM Marco Elver <[email protected]> wrote:
>
> On Mon, 5 Sept 2022 at 10:12, Michal Hocko <[email protected]> wrote:
> > On Sun 04-09-22 18:32:58, Suren Baghdasaryan wrote:
> > > On Thu, Sep 1, 2022 at 12:15 PM Michal Hocko <[email protected]> wrote:
> > [...]
> > > > Yes, tracking back the call trace would be really needed. The question
> > > > is whether this is really prohibitively expensive. How much overhead are
> > > > we talking about? There is no free lunch here, really. You either have
> > > > the overhead during runtime when the feature is used or on the source
> > > > code level for all the future development (with a maze of macros and
> > > > wrappers).
> > >
> > > As promised, I profiled a simple code that repeatedly makes 10
> > > allocations/frees in a loop and measured overheads of code tagging,
> > > call stack capturing and tracing+BPF for page and slab allocations.
> > > Summary:
> > >
> > > Page allocations (overheads are compared to get_free_pages() duration):
> > > 6.8% Codetag counter manipulations (__lazy_percpu_counter_add + __alloc_tag_add)
> > > 8.8% lookup_page_ext
> > > 1237% call stack capture
> > > 139% tracepoint with attached empty BPF program
> >
> > Yes, I am not surprised that the call stack capturing is really
> > expensive comparing to the allocator fast path (which is really highly
> > optimized and I suspect that with 10 allocation/free loop you mostly get
> > your memory from the pcp lists). Is this overhead still _that_ visible
> > for somehow less microoptimized workloads which have to take slow paths
> > as well?
> >
> > Also what kind of stack unwinder is configured (I guess ORC)? This is
> > not my area but from what I remember the unwinder overhead varies
> > between ORC and FP.
> >
> > And just to make it clear. I do realize that an overhead from the stack
> > unwinding is unavoidable. And code tagging would logically have lower
> > overhead as it performs much less work. But the main point is whether
> > our existing stack unwiding approach is really prohibitively expensive
> > to be used for debugging purposes on production systems. I might
> > misremember but I recall people having bigger concerns with page_owner
> > memory footprint than the actual stack unwinder overhead.
>
> This is just to point out that we've also been looking at cheaper
> collection of the stack trace (for KASAN and other sanitizers). The
> cheapest way to unwind the stack would be a system with "shadow call
> stack" enabled. With compiler support it's available on arm64, see
> CONFIG_SHADOW_CALL_STACK. For x86 the hope is that at one point the
> kernel will support CET, which newer Intel and AMD CPUs support.
> Collecting the call stack would then be a simple memcpy.

Thanks for the note Marco! I'll check out the CONFIG_SHADOW_CALL_STACK
on Android.

2022-09-05 18:39:30

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Mon, Sep 5, 2022 at 1:12 AM Michal Hocko <[email protected]> wrote:
>
> On Sun 04-09-22 18:32:58, Suren Baghdasaryan wrote:
> > On Thu, Sep 1, 2022 at 12:15 PM Michal Hocko <[email protected]> wrote:
> [...]
> > > Yes, tracking back the call trace would be really needed. The question
> > > is whether this is really prohibitively expensive. How much overhead are
> > > we talking about? There is no free lunch here, really. You either have
> > > the overhead during runtime when the feature is used or on the source
> > > code level for all the future development (with a maze of macros and
> > > wrappers).
> >
> > As promised, I profiled a simple code that repeatedly makes 10
> > allocations/frees in a loop and measured overheads of code tagging,
> > call stack capturing and tracing+BPF for page and slab allocations.
> > Summary:
> >
> > Page allocations (overheads are compared to get_free_pages() duration):
> > 6.8% Codetag counter manipulations (__lazy_percpu_counter_add + __alloc_tag_add)
> > 8.8% lookup_page_ext
> > 1237% call stack capture
> > 139% tracepoint with attached empty BPF program
>
> Yes, I am not surprised that the call stack capturing is really
> expensive comparing to the allocator fast path (which is really highly
> optimized and I suspect that with 10 allocation/free loop you mostly get
> your memory from the pcp lists). Is this overhead still _that_ visible
> for somehow less microoptimized workloads which have to take slow paths
> as well?

Correct, it's a comparison with the allocation fast path, so in a
sense represents the worst case scenario. However at the same time the
measurements are fair because they measure the overheads against the
same meaningful baseline, therefore can be used for comparison.

>
> Also what kind of stack unwinder is configured (I guess ORC)? This is
> not my area but from what I remember the unwinder overhead varies
> between ORC and FP.

I used whatever is default and didn't try other mechanisms. Don't
think the difference would be orders of magnitude better though.

>
> And just to make it clear. I do realize that an overhead from the stack
> unwinding is unavoidable. And code tagging would logically have lower
> overhead as it performs much less work. But the main point is whether
> our existing stack unwiding approach is really prohibitively expensive
> to be used for debugging purposes on production systems. I might
> misremember but I recall people having bigger concerns with page_owner
> memory footprint than the actual stack unwinder overhead.

That's one of those questions which are very difficult to answer (if
even possible) because that would depend on the use scenario. If the
workload allocates frequently then adding the overhead will likely
affect it, otherwise might not be even noticeable. In general, in
pre-production testing we try to minimize the difference in
performance and memory profiles between the software we are testing
and the production one. From that point of view, the smaller the
overhead, the better. I know it's kinda obvious but unfortunately I
have no better answer to that question.

For the memory overhead, in my early internal proposal with assumption
of 10000 instrumented allocation call sites, I've made some
calculations for an 8GB 8-core system (quite typical for Android) and
ended up with the following:

per-cpu counters atomic counters
page_ext references 16MB 16MB
slab object references 10.5MB 10.5MB
alloc_tags 900KB 312KB
Total memory overhead 27.4MB 26.8MB

so, about 0.34% of the total memory. Our implementation has changed
since then and the number might not be completely correct but it
should be in the ballpark.
I just checked the number of instrumented calls that we currently have
in the 6.0-rc3 built with defconfig and it's 165 page allocation and
2684 slab allocation sites. I readily accept that we are probably
missing some allocations and additional modules can also contribute to
these numbers but my guess it's still less than 10000 that I used in
my calculations.
I don't claim that 0.34% overhead is low enough to be always
acceptable, just posting the numbers to provide some reference points.

> --
> Michal Hocko
> SUSE Labs

2022-09-05 18:41:02

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Mon, Sep 5, 2022 at 8:06 AM Steven Rostedt <[email protected]> wrote:
>
> On Sun, 4 Sep 2022 18:32:58 -0700
> Suren Baghdasaryan <[email protected]> wrote:
>
> > Page allocations (overheads are compared to get_free_pages() duration):
> > 6.8% Codetag counter manipulations (__lazy_percpu_counter_add + __alloc_tag_add)
> > 8.8% lookup_page_ext
> > 1237% call stack capture
> > 139% tracepoint with attached empty BPF program
>
> Have you tried tracepoint with custom callback?
>
> static void my_callback(void *data, unsigned long call_site,
> const void *ptr, struct kmem_cache *s,
> size_t bytes_req, size_t bytes_alloc,
> gfp_t gfp_flags)
> {
> struct my_data_struct *my_data = data;
>
> { do whatever }
> }
>
> [..]
> register_trace_kmem_alloc(my_callback, my_data);
>
> Now the my_callback function will be called directly every time the
> kmem_alloc tracepoint is hit.
>
> This avoids that perf and BPF overhead.

Haven't tried that yet but will do. Thanks for the reference code!

>
> -- Steve

2022-09-05 19:18:38

by Nadav Amit

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Aug 31, 2022, at 3:19 AM, Mel Gorman <[email protected]> wrote:

> On Wed, Aug 31, 2022 at 04:42:30AM -0400, Kent Overstreet wrote:
>> On Wed, Aug 31, 2022 at 09:38:27AM +0200, Peter Zijlstra wrote:
>>> On Tue, Aug 30, 2022 at 02:48:49PM -0700, Suren Baghdasaryan wrote:
>>>> ===========================
>>>> Code tagging framework
>>>> ===========================
>>>> Code tag is a structure identifying a specific location in the source code
>>>> which is generated at compile time and can be embedded in an application-
>>>> specific structure. Several applications of code tagging are included in
>>>> this RFC, such as memory allocation tracking, dynamic fault injection,
>>>> latency tracking and improved error code reporting.
>>>> Basically, it takes the old trick of "define a special elf section for
>>>> objects of a given type so that we can iterate over them at runtime" and
>>>> creates a proper library for it.
>>>
>>> I might be super dense this morning, but what!? I've skimmed through the
>>> set and I don't think I get it.
>>>
>>> What does this provide that ftrace/kprobes don't already allow?
>>
>> You're kidding, right?
>
> It's a valid question. From the description, it main addition that would
> be hard to do with ftrace or probes is catching where an error code is
> returned. A secondary addition would be catching all historical state and
> not just state since the tracing started.
>
> It's also unclear *who* would enable this. It looks like it would mostly
> have value during the development stage of an embedded platform to track
> kernel memory usage on a per-application basis in an environment where it
> may be difficult to setup tracing and tracking. Would it ever be enabled
> in production? Would a distribution ever enable this? If it's enabled, any
> overhead cannot be disabled/enabled at run or boot time so anyone enabling
> this would carry the cost without never necessarily consuming the data.
>
> It might be an ease-of-use thing. Gathering the information from traces
> is tricky and would need combining multiple different elements and that
> is development effort but not impossible.
>
> Whatever asking for an explanation as to why equivalent functionality
> cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable.

I would note that I have a solution in the making (which pretty much works)
for this matter, and does not require any kernel changes. It produces a
call stack that leads to the code that lead to syscall failure.

The way it works is by using seccomp to trap syscall failures, and then
setting ftrace function filters and kprobes on conditional branches,
indirect branch targets and function returns.

Using symbolic execution, backtracking is performed and the condition that
lead to the failure is then pin-pointed.

I hope to share the code soon.

2022-09-05 19:25:58

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Mon, 5 Sep 2022 11:44:55 -0700
Nadav Amit <[email protected]> wrote:

> I would note that I have a solution in the making (which pretty much works)
> for this matter, and does not require any kernel changes. It produces a
> call stack that leads to the code that lead to syscall failure.
>
> The way it works is by using seccomp to trap syscall failures, and then
> setting ftrace function filters and kprobes on conditional branches,
> indirect branch targets and function returns.

Ooh nifty!

>
> Using symbolic execution, backtracking is performed and the condition that
> lead to the failure is then pin-pointed.
>
> I hope to share the code soon.

Looking forward to it.

-- Steve

2022-09-05 20:55:15

by Kent Overstreet

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Mon, Sep 05, 2022 at 11:08:21AM -0700, Suren Baghdasaryan wrote:
> On Mon, Sep 5, 2022 at 8:06 AM Steven Rostedt <[email protected]> wrote:
> >
> > On Sun, 4 Sep 2022 18:32:58 -0700
> > Suren Baghdasaryan <[email protected]> wrote:
> >
> > > Page allocations (overheads are compared to get_free_pages() duration):
> > > 6.8% Codetag counter manipulations (__lazy_percpu_counter_add + __alloc_tag_add)
> > > 8.8% lookup_page_ext
> > > 1237% call stack capture
> > > 139% tracepoint with attached empty BPF program
> >
> > Have you tried tracepoint with custom callback?
> >
> > static void my_callback(void *data, unsigned long call_site,
> > const void *ptr, struct kmem_cache *s,
> > size_t bytes_req, size_t bytes_alloc,
> > gfp_t gfp_flags)
> > {
> > struct my_data_struct *my_data = data;
> >
> > { do whatever }
> > }
> >
> > [..]
> > register_trace_kmem_alloc(my_callback, my_data);
> >
> > Now the my_callback function will be called directly every time the
> > kmem_alloc tracepoint is hit.
> >
> > This avoids that perf and BPF overhead.
>
> Haven't tried that yet but will do. Thanks for the reference code!

Is it really worth the effort of benchmarking tracing API overhead here?

The main cost of a tracing based approach is going to to be the data structure
for remembering outstanding allocations so that free events can be matched to
the appropriate callsite. Regardless of whether it's done with BFP or by
attaching to the tracepoints directly, that's going to be the main overhead.

2022-09-05 22:38:24

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Mon, 5 Sep 2022 16:42:29 -0400
Kent Overstreet <[email protected]> wrote:

> > Haven't tried that yet but will do. Thanks for the reference code!
>
> Is it really worth the effort of benchmarking tracing API overhead here?
>
> The main cost of a tracing based approach is going to to be the data structure
> for remembering outstanding allocations so that free events can be matched to
> the appropriate callsite. Regardless of whether it's done with BFP or by
> attaching to the tracepoints directly, that's going to be the main overhead.

The point I was making here is that you do not need your own hooking
mechanism. You can get the information directly by attaching to the
tracepoint.

> > static void my_callback(void *data, unsigned long call_site,
> > const void *ptr, struct kmem_cache *s,
> > size_t bytes_req, size_t bytes_alloc,
> > gfp_t gfp_flags)
> > {
> > struct my_data_struct *my_data = data;
> >
> > { do whatever }
> > }

The "do whatever" is anything you want to do.

Or is the data structure you create with this approach going to be too much
overhead? How hard is it for a hash or binary search lookup?


-- Steve

2022-09-05 23:56:09

by Kent Overstreet

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Mon, Sep 05, 2022 at 10:49:38AM +0200, Michal Hocko wrote:
> This is really my main concern about this whole work. Not only it adds a
> considerable maintenance burden to the core MM because

[citation needed]

> it adds on top of
> our existing allocator layers complexity but it would need to spread beyond
> MM to be useful because it is usually outside of MM where leaks happen.

If you want the tracking to happen at a different level of the call stack, just
call _kmalloc() directly and call alloc_tag_add()/sub() yourself.

2022-09-06 00:11:12

by Kent Overstreet

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Mon, Sep 05, 2022 at 06:16:50PM -0400, Steven Rostedt wrote:
> On Mon, 5 Sep 2022 16:42:29 -0400
> Kent Overstreet <[email protected]> wrote:
>
> > > Haven't tried that yet but will do. Thanks for the reference code!
> >
> > Is it really worth the effort of benchmarking tracing API overhead here?
> >
> > The main cost of a tracing based approach is going to to be the data structure
> > for remembering outstanding allocations so that free events can be matched to
> > the appropriate callsite. Regardless of whether it's done with BFP or by
> > attaching to the tracepoints directly, that's going to be the main overhead.
>
> The point I was making here is that you do not need your own hooking
> mechanism. You can get the information directly by attaching to the
> tracepoint.
>
> > > static void my_callback(void *data, unsigned long call_site,
> > > const void *ptr, struct kmem_cache *s,
> > > size_t bytes_req, size_t bytes_alloc,
> > > gfp_t gfp_flags)
> > > {
> > > struct my_data_struct *my_data = data;
> > >
> > > { do whatever }
> > > }
>
> The "do whatever" is anything you want to do.
>
> Or is the data structure you create with this approach going to be too much
> overhead? How hard is it for a hash or binary search lookup?

If you don't think it's hard, go ahead and show us.

2022-09-06 07:43:44

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Mon 05-09-22 19:46:49, Kent Overstreet wrote:
> On Mon, Sep 05, 2022 at 10:49:38AM +0200, Michal Hocko wrote:
> > This is really my main concern about this whole work. Not only it adds a
> > considerable maintenance burden to the core MM because
>
> [citation needed]

I thought this was clear from the email content (the part you haven't
quoted here). But let me be explicit one more time for you.

I hope we can agree that in order for this kind of tracking to be useful
you need to cover _callers_ of the allocator or in the ideal world
the users/owner of the tracked memory (the later is sometimes much
harder/impossible to track when the memory is handed over from one peer
to another).

It is not particularly useful IMO to see that a large portion of the
memory has been allocated by say vmalloc or kvmalloc, right? How
much does it really tell you that a lot of memory has been allocated
by kvmalloc or vmalloc? Yet, neither of the two is handled by the
proposed tracking and it would require additional code to be added and
_maintained_ to cover them. But that would be still far from complete,
we have bulk allocator, mempools etc.

If that was not enough some of those allocators are used by library code
like seq_file, networking pools, module loader and whatnot. So this
grows and effectively doubles the API space for many allocators as they
need both normal API and the one which can pass the tracking context
down the path to prevent double tracking. Right?

This in my book is a considerable maintenance burden. And especially for
the MM subsystem this means additional burden because we have a very
rich allocators APIs.

You are absolutely right that processing stack traces is PITA but that
allows to see the actual callers irrespectively how many layers of
indirection or library code it goes.

> > it adds on top of
> > our existing allocator layers complexity but it would need to spread beyond
> > MM to be useful because it is usually outside of MM where leaks happen.
>
> If you want the tracking to happen at a different level of the call stack, just
> call _kmalloc() directly and call alloc_tag_add()/sub() yourself.

As pointed above this just scales poorly and adds to the API space. Not
to mention that direct use of alloc_tag_add can just confuse layers
below which rely on the same thing.

Hope this makes it clearer.
--
Michal Hocko
SUSE Labs

2022-09-06 08:11:57

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Mon 05-09-22 11:03:35, Suren Baghdasaryan wrote:
> On Mon, Sep 5, 2022 at 1:12 AM Michal Hocko <[email protected]> wrote:
> >
> > On Sun 04-09-22 18:32:58, Suren Baghdasaryan wrote:
> > > On Thu, Sep 1, 2022 at 12:15 PM Michal Hocko <[email protected]> wrote:
> > [...]
> > > > Yes, tracking back the call trace would be really needed. The question
> > > > is whether this is really prohibitively expensive. How much overhead are
> > > > we talking about? There is no free lunch here, really. You either have
> > > > the overhead during runtime when the feature is used or on the source
> > > > code level for all the future development (with a maze of macros and
> > > > wrappers).
> > >
> > > As promised, I profiled a simple code that repeatedly makes 10
> > > allocations/frees in a loop and measured overheads of code tagging,
> > > call stack capturing and tracing+BPF for page and slab allocations.
> > > Summary:
> > >
> > > Page allocations (overheads are compared to get_free_pages() duration):
> > > 6.8% Codetag counter manipulations (__lazy_percpu_counter_add + __alloc_tag_add)
> > > 8.8% lookup_page_ext
> > > 1237% call stack capture
> > > 139% tracepoint with attached empty BPF program
> >
> > Yes, I am not surprised that the call stack capturing is really
> > expensive comparing to the allocator fast path (which is really highly
> > optimized and I suspect that with 10 allocation/free loop you mostly get
> > your memory from the pcp lists). Is this overhead still _that_ visible
> > for somehow less microoptimized workloads which have to take slow paths
> > as well?
>
> Correct, it's a comparison with the allocation fast path, so in a
> sense represents the worst case scenario. However at the same time the
> measurements are fair because they measure the overheads against the
> same meaningful baseline, therefore can be used for comparison.

Yes, I am not saying it is an unfair comparision. It is just not a
particularly practical one for real life situations. So I am not sure
you can draw many conclusions from that. Or let me put it differently.
There is not real point comparing the code tagging and stack unwiding
approaches because the later is simply more complex because it collects
more state. The main question is whether that additional state
collection is too expensive to be practically used.

> > Also what kind of stack unwinder is configured (I guess ORC)? This is
> > not my area but from what I remember the unwinder overhead varies
> > between ORC and FP.
>
> I used whatever is default and didn't try other mechanisms. Don't
> think the difference would be orders of magnitude better though.
>
> >
> > And just to make it clear. I do realize that an overhead from the stack
> > unwinding is unavoidable. And code tagging would logically have lower
> > overhead as it performs much less work. But the main point is whether
> > our existing stack unwiding approach is really prohibitively expensive
> > to be used for debugging purposes on production systems. I might
> > misremember but I recall people having bigger concerns with page_owner
> > memory footprint than the actual stack unwinder overhead.
>
> That's one of those questions which are very difficult to answer (if
> even possible) because that would depend on the use scenario. If the
> workload allocates frequently then adding the overhead will likely
> affect it, otherwise might not be even noticeable. In general, in
> pre-production testing we try to minimize the difference in
> performance and memory profiles between the software we are testing
> and the production one. From that point of view, the smaller the
> overhead, the better. I know it's kinda obvious but unfortunately I
> have no better answer to that question.

This is clear but it doesn't really tell whether the existing tooling is
unusable for _your_ or any specific scenarios. Because when we are
talking about adding quite a lot of code and make our allocators APIs
more complicated to track the state then we should carefully weigh the
benefit and the cost. As replied to other email I am really skeptical
this patchset is at the final stage and the more allocators get covered
the more code we have to maintain. So there must be a very strong reason
to add it.

> For the memory overhead, in my early internal proposal with assumption
> of 10000 instrumented allocation call sites, I've made some
> calculations for an 8GB 8-core system (quite typical for Android) and
> ended up with the following:
>
> per-cpu counters atomic counters
> page_ext references 16MB 16MB
> slab object references 10.5MB 10.5MB
> alloc_tags 900KB 312KB
> Total memory overhead 27.4MB 26.8MB

I do not really think this is all that interesting because the major
memory overhead contributors (page_ext and objcg are going to be there
with other approaches that want to match alloc and free as that clearly
requires to store the allocator objects somewhere).

> so, about 0.34% of the total memory. Our implementation has changed
> since then and the number might not be completely correct but it
> should be in the ballpark.
> I just checked the number of instrumented calls that we currently have
> in the 6.0-rc3 built with defconfig and it's 165 page allocation and
> 2684 slab allocation sites. I readily accept that we are probably
> missing some allocations and additional modules can also contribute to
> these numbers but my guess it's still less than 10000 that I used in
> my calculations.

yes, in the current implementation you are missing most indirect users
of the page allocator as stated elsewhere so the usefulness can be
really limited. A better coverege will not increase the memory
consumption much but it will add an additional maintenance burden that
will scale with different usecases.
--
Michal Hocko
SUSE Labs

2022-09-06 16:25:30

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Tue, Sep 6, 2022 at 1:01 AM Michal Hocko <[email protected]> wrote:
>
> On Mon 05-09-22 11:03:35, Suren Baghdasaryan wrote:
> > On Mon, Sep 5, 2022 at 1:12 AM Michal Hocko <[email protected]> wrote:
> > >
> > > On Sun 04-09-22 18:32:58, Suren Baghdasaryan wrote:
> > > > On Thu, Sep 1, 2022 at 12:15 PM Michal Hocko <[email protected]> wrote:
> > > [...]
> > > > > Yes, tracking back the call trace would be really needed. The question
> > > > > is whether this is really prohibitively expensive. How much overhead are
> > > > > we talking about? There is no free lunch here, really. You either have
> > > > > the overhead during runtime when the feature is used or on the source
> > > > > code level for all the future development (with a maze of macros and
> > > > > wrappers).
> > > >
> > > > As promised, I profiled a simple code that repeatedly makes 10
> > > > allocations/frees in a loop and measured overheads of code tagging,
> > > > call stack capturing and tracing+BPF for page and slab allocations.
> > > > Summary:
> > > >
> > > > Page allocations (overheads are compared to get_free_pages() duration):
> > > > 6.8% Codetag counter manipulations (__lazy_percpu_counter_add + __alloc_tag_add)
> > > > 8.8% lookup_page_ext
> > > > 1237% call stack capture
> > > > 139% tracepoint with attached empty BPF program
> > >
> > > Yes, I am not surprised that the call stack capturing is really
> > > expensive comparing to the allocator fast path (which is really highly
> > > optimized and I suspect that with 10 allocation/free loop you mostly get
> > > your memory from the pcp lists). Is this overhead still _that_ visible
> > > for somehow less microoptimized workloads which have to take slow paths
> > > as well?
> >
> > Correct, it's a comparison with the allocation fast path, so in a
> > sense represents the worst case scenario. However at the same time the
> > measurements are fair because they measure the overheads against the
> > same meaningful baseline, therefore can be used for comparison.
>
> Yes, I am not saying it is an unfair comparision. It is just not a
> particularly practical one for real life situations. So I am not sure
> you can draw many conclusions from that. Or let me put it differently.
> There is not real point comparing the code tagging and stack unwiding
> approaches because the later is simply more complex because it collects
> more state. The main question is whether that additional state
> collection is too expensive to be practically used.

You asked me to provide the numbers in one of your replies, that's what I did.

>
> > > Also what kind of stack unwinder is configured (I guess ORC)? This is
> > > not my area but from what I remember the unwinder overhead varies
> > > between ORC and FP.
> >
> > I used whatever is default and didn't try other mechanisms. Don't
> > think the difference would be orders of magnitude better though.
> >
> > >
> > > And just to make it clear. I do realize that an overhead from the stack
> > > unwinding is unavoidable. And code tagging would logically have lower
> > > overhead as it performs much less work. But the main point is whether
> > > our existing stack unwiding approach is really prohibitively expensive
> > > to be used for debugging purposes on production systems. I might
> > > misremember but I recall people having bigger concerns with page_owner
> > > memory footprint than the actual stack unwinder overhead.
> >
> > That's one of those questions which are very difficult to answer (if
> > even possible) because that would depend on the use scenario. If the
> > workload allocates frequently then adding the overhead will likely
> > affect it, otherwise might not be even noticeable. In general, in
> > pre-production testing we try to minimize the difference in
> > performance and memory profiles between the software we are testing
> > and the production one. From that point of view, the smaller the
> > overhead, the better. I know it's kinda obvious but unfortunately I
> > have no better answer to that question.
>
> This is clear but it doesn't really tell whether the existing tooling is
> unusable for _your_ or any specific scenarios. Because when we are
> talking about adding quite a lot of code and make our allocators APIs
> more complicated to track the state then we should carefully weigh the
> benefit and the cost. As replied to other email I am really skeptical
> this patchset is at the final stage and the more allocators get covered
> the more code we have to maintain. So there must be a very strong reason
> to add it.

The patchset is quite complete at this point. Instrumenting new
allocators takes 3 lines of code, see how kmalloc_hooks macro is used
in https://lore.kernel.org/all/[email protected]/

>
> > For the memory overhead, in my early internal proposal with assumption
> > of 10000 instrumented allocation call sites, I've made some
> > calculations for an 8GB 8-core system (quite typical for Android) and
> > ended up with the following:
> >
> > per-cpu counters atomic counters
> > page_ext references 16MB 16MB
> > slab object references 10.5MB 10.5MB
> > alloc_tags 900KB 312KB
> > Total memory overhead 27.4MB 26.8MB
>
> I do not really think this is all that interesting because the major
> memory overhead contributors (page_ext and objcg are going to be there
> with other approaches that want to match alloc and free as that clearly
> requires to store the allocator objects somewhere).

You mentioned that memory consumption in the page_owner approach was
more important overhead, so I provided the numbers for that part of
the discussion.

>
> > so, about 0.34% of the total memory. Our implementation has changed
> > since then and the number might not be completely correct but it
> > should be in the ballpark.
> > I just checked the number of instrumented calls that we currently have
> > in the 6.0-rc3 built with defconfig and it's 165 page allocation and
> > 2684 slab allocation sites. I readily accept that we are probably
> > missing some allocations and additional modules can also contribute to
> > these numbers but my guess it's still less than 10000 that I used in
> > my calculations.
>
> yes, in the current implementation you are missing most indirect users
> of the page allocator as stated elsewhere so the usefulness can be
> really limited. A better coverege will not increase the memory
> consumption much but it will add an additional maintenance burden that
> will scale with different usecases.

Your comments in the last two letters about needing the stack tracing
and covering indirect users of the allocators makes me think that you
missed my reply here:
https://lore.kernel.org/all/CAJuCfpGZ==v0HGWBzZzHTgbo4B_ZBe6V6U4T_788LVWj8HhCRQ@mail.gmail.com/.
I messed up with formatting but hopefully it's still readable. The
idea of having two stage tracking - first one very cheap and the
second one more in-depth I think should address your concerns about
indirect users.
Thanks,
Suren.

> --
> Michal Hocko
> SUSE Labs

2022-09-06 18:37:31

by Kent Overstreet

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Tue, Sep 06, 2022 at 09:23:31AM +0200, Michal Hocko wrote:
> On Mon 05-09-22 19:46:49, Kent Overstreet wrote:
> > On Mon, Sep 05, 2022 at 10:49:38AM +0200, Michal Hocko wrote:
> > > This is really my main concern about this whole work. Not only it adds a
> > > considerable maintenance burden to the core MM because
> >
> > [citation needed]
>
> I thought this was clear from the email content (the part you haven't
> quoted here). But let me be explicit one more time for you.
>
> I hope we can agree that in order for this kind of tracking to be useful
> you need to cover _callers_ of the allocator or in the ideal world
> the users/owner of the tracked memory (the later is sometimes much
> harder/impossible to track when the memory is handed over from one peer
> to another).
>
> It is not particularly useful IMO to see that a large portion of the
> memory has been allocated by say vmalloc or kvmalloc, right? How
> much does it really tell you that a lot of memory has been allocated
> by kvmalloc or vmalloc? Yet, neither of the two is handled by the
> proposed tracking and it would require additional code to be added and
> _maintained_ to cover them. But that would be still far from complete,
> we have bulk allocator, mempools etc.

Of course - and even a light skimming of the patch set would see it does indeed
address this. We still have to do vmalloc and percpu memory allocations, but
slab is certainly handled and that's the big one.

> As pointed above this just scales poorly and adds to the API space. Not
> to mention that direct use of alloc_tag_add can just confuse layers
> below which rely on the same thing.

It might help you make your case if you'd say something about what you'd like
better.

Otherwise, saying "code has to be maintained" is a little bit like saying water
is wet, and we're all engineers here, I think we know that :)

2022-09-07 11:18:19

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Tue 06-09-22 14:20:58, Kent Overstreet wrote:
[...]
> Otherwise, saying "code has to be maintained" is a little bit like saying water
> is wet, and we're all engineers here, I think we know that :)

Hmm, it seems that further discussion doesn't really make much sense
here. I know how to use my time better.
--
Michal Hocko
SUSE Labs

2022-09-07 13:13:43

by Kent Overstreet

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Wed, Sep 07, 2022 at 01:00:09PM +0200, Michal Hocko wrote:
> Hmm, it seems that further discussion doesn't really make much sense
> here. I know how to use my time better.

Just a thought, but I generally find it more productive to propose ideas than to
just be disparaging.

Cheers,
Kent

2022-09-07 14:29:25

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Wed, 7 Sep 2022 09:04:28 -0400
Kent Overstreet <[email protected]> wrote:

> On Wed, Sep 07, 2022 at 01:00:09PM +0200, Michal Hocko wrote:
> > Hmm, it seems that further discussion doesn't really make much sense
> > here. I know how to use my time better.
>
> Just a thought, but I generally find it more productive to propose ideas than to
> just be disparaging.
>

But it's not Michal's job to do so. He's just telling you that the given
feature is not worth the burden. He's telling you the issues that he has
with the patch set. It's the submitter's job to address those concerns and
not the maintainer's to tell you how to make it better.

When Linus tells us that a submission is crap, we don't ask him how to make
it less crap, we listen to why he called it crap, and then rewrite to be
not so crappy. If we cannot figure it out, it doesn't get in.

-- Steve

2022-09-08 07:03:58

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Wed, Sep 7, 2022 at 11:35 PM Kent Overstreet
<[email protected]> wrote:
>
> On Wed, Sep 07, 2022 at 09:45:18AM -0400, Steven Rostedt wrote:
> > On Wed, 7 Sep 2022 09:04:28 -0400
> > Kent Overstreet <[email protected]> wrote:
> >
> > > On Wed, Sep 07, 2022 at 01:00:09PM +0200, Michal Hocko wrote:
> > > > Hmm, it seems that further discussion doesn't really make much sense
> > > > here. I know how to use my time better.
> > >
> > > Just a thought, but I generally find it more productive to propose ideas than to
> > > just be disparaging.
> > >
> >
> > But it's not Michal's job to do so. He's just telling you that the given
> > feature is not worth the burden. He's telling you the issues that he has
> > with the patch set. It's the submitter's job to address those concerns and
> > not the maintainer's to tell you how to make it better.
> >
> > When Linus tells us that a submission is crap, we don't ask him how to make
> > it less crap, we listen to why he called it crap, and then rewrite to be
> > not so crappy. If we cannot figure it out, it doesn't get in.
>
> When Linus tells someone a submission is crap, he _always_ has a sound, and
> _specific_ technical justification for doing so.
>
> "This code is going to be a considerable maintenance burden" is vapid, and lazy.
> It's the kind of feedback made by someone who has looked at the number of lines
> of code a patch touches and not much more.

I would really appreciate if everyone could please stick to the
technical side of the conversation. That way we can get some
constructive feedback. Everything else is not helpful and at best is a
distraction.
Maintenance burden is a price we pay and I think it's the prerogative
of the maintainers to take that into account. Our job is to prove that
the price is worth paying.

>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>

2022-09-08 07:42:09

by Kent Overstreet

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Wed, Sep 07, 2022 at 09:45:18AM -0400, Steven Rostedt wrote:
> On Wed, 7 Sep 2022 09:04:28 -0400
> Kent Overstreet <[email protected]> wrote:
>
> > On Wed, Sep 07, 2022 at 01:00:09PM +0200, Michal Hocko wrote:
> > > Hmm, it seems that further discussion doesn't really make much sense
> > > here. I know how to use my time better.
> >
> > Just a thought, but I generally find it more productive to propose ideas than to
> > just be disparaging.
> >
>
> But it's not Michal's job to do so. He's just telling you that the given
> feature is not worth the burden. He's telling you the issues that he has
> with the patch set. It's the submitter's job to address those concerns and
> not the maintainer's to tell you how to make it better.
>
> When Linus tells us that a submission is crap, we don't ask him how to make
> it less crap, we listen to why he called it crap, and then rewrite to be
> not so crappy. If we cannot figure it out, it doesn't get in.

When Linus tells someone a submission is crap, he _always_ has a sound, and
_specific_ technical justification for doing so.

"This code is going to be a considerable maintenance burden" is vapid, and lazy.
It's the kind of feedback made by someone who has looked at the number of lines
of code a patch touches and not much more.

2022-09-08 07:42:59

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Thu 08-09-22 02:35:48, Kent Overstreet wrote:
> On Wed, Sep 07, 2022 at 09:45:18AM -0400, Steven Rostedt wrote:
> > On Wed, 7 Sep 2022 09:04:28 -0400
> > Kent Overstreet <[email protected]> wrote:
> >
> > > On Wed, Sep 07, 2022 at 01:00:09PM +0200, Michal Hocko wrote:
> > > > Hmm, it seems that further discussion doesn't really make much sense
> > > > here. I know how to use my time better.
> > >
> > > Just a thought, but I generally find it more productive to propose ideas than to
> > > just be disparaging.
> > >
> >
> > But it's not Michal's job to do so. He's just telling you that the given
> > feature is not worth the burden. He's telling you the issues that he has
> > with the patch set. It's the submitter's job to address those concerns and
> > not the maintainer's to tell you how to make it better.
> >
> > When Linus tells us that a submission is crap, we don't ask him how to make
> > it less crap, we listen to why he called it crap, and then rewrite to be
> > not so crappy. If we cannot figure it out, it doesn't get in.
>
> When Linus tells someone a submission is crap, he _always_ has a sound, and
> _specific_ technical justification for doing so.
>
> "This code is going to be a considerable maintenance burden" is vapid, and lazy.
> It's the kind of feedback made by someone who has looked at the number of lines
> of code a patch touches and not much more.

Then you have probably missed a huge part of my emails. Please
re-read. If those arguments are not clear, feel free to ask for
clarification. Reducing the whole my reasoning and objections to the
sentence above and calling that vapid and lazy is not only unfair but
also disrespectful.

--
Michal Hocko
SUSE Labs

2022-09-08 08:00:52

by Kent Overstreet

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Wed, Sep 07, 2022 at 11:49:37PM -0700, Suren Baghdasaryan wrote:
> I would really appreciate if everyone could please stick to the
> technical side of the conversation. That way we can get some
> constructive feedback. Everything else is not helpful and at best is a
> distraction.
> Maintenance burden is a price we pay and I think it's the prerogative
> of the maintainers to take that into account. Our job is to prove that
> the price is worth paying.

Well said.

I'd also like to add - slab.h does look pretty overgrown and messy. We've grown
a _lot_ of special purpose memory allocation interfaces, and I think it probably
is time to try and wrangle that back.

The API complexity isn't just an issue for this patch - it's an issue for
anything that has to wrap and plumb through memory allocation interfaces. It's a
pain point for the Rust people, and also comes in e.g. the mempool API.

I think we should keep going with the memalloc_no*_save()/restore() approach,
and extend it to other things:

- memalloc_nowait_save()
- memalloc_highpri_save()

(these two get you GFP_ATOMIC).

Also, I don't think these all need to be separate functions, we could have

memalloc_gfp_apply()
memalloc_gfp_restore()

which simply takes a gfp flags argument and applies it to the current
PF_MEMALLOC flags.

We've had long standing bugs where vmalloc() can't correctly take gfp flags
because some of the allocations it does for page tables don't have it correctly
plumbed through; switching to the memalloc_*_(save|restore) is something people
have been wanting in order to fix this - for years. Actually following through
and completing this would let us kill the gfp flags arguments to our various
memory allocators entirely.

I think we can do the same thing with the numa node parameter - kill
kmalloc_node() et. all, move it to task_struct with a set of save/restore
functions.

There's probably other things we can do to simplify slab.h if we look more. I've
been hoping to start pushing patches for some of this stuff - it's going to be
some time before I can get to it though, can only handle so many projects in
flight at a time :)

2022-09-08 08:02:06

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Thu 08-09-22 03:29:50, Kent Overstreet wrote:
> On Thu, Sep 08, 2022 at 09:12:45AM +0200, Michal Hocko wrote:
> > Then you have probably missed a huge part of my emails. Please
> > re-read. If those arguments are not clear, feel free to ask for
> > clarification. Reducing the whole my reasoning and objections to the
> > sentence above and calling that vapid and lazy is not only unfair but
> > also disrespectful.
>
> What, where you complained about slab's page allocations showing up in the
> profile instead of slab, and I pointed out to you that actually each and every
> slab call is instrumented, and you're just seeing some double counting (that we
> will no doubt fix?)
>
> Or when you complained about allocation sites where it should actually be the
> caller that should be instrumented, and I pointed out that it'd be quite easy to
> simply change that code to use _kmalloc() and slab_tag_add() directly, if it
> becomes an issue.
>
> Of course, if we got that far, we'd have this code to thank for telling us where
> to look!
>
> Did I miss anything?

Feel free to reponse to specific arguments as I wrote them. I won't
repeat them again. Sure we can discuss how important/relevant those
are. And that _can_ be a productive discussion.

--
Michal Hocko
SUSE Labs

2022-09-08 08:31:59

by Kent Overstreet

[permalink] [raw]
Subject: Re: [RFC PATCH 00/30] Code tagging framework and applications

On Thu, Sep 08, 2022 at 09:12:45AM +0200, Michal Hocko wrote:
> Then you have probably missed a huge part of my emails. Please
> re-read. If those arguments are not clear, feel free to ask for
> clarification. Reducing the whole my reasoning and objections to the
> sentence above and calling that vapid and lazy is not only unfair but
> also disrespectful.

What, where you complained about slab's page allocations showing up in the
profile instead of slab, and I pointed out to you that actually each and every
slab call is instrumented, and you're just seeing some double counting (that we
will no doubt fix?)

Or when you complained about allocation sites where it should actually be the
caller that should be instrumented, and I pointed out that it'd be quite easy to
simply change that code to use _kmalloc() and slab_tag_add() directly, if it
becomes an issue.

Of course, if we got that far, we'd have this code to thank for telling us where
to look!

Did I miss anything?