Memory allocation, v3 and final:
Overview:
Low overhead [1] per-callsite memory allocation profiling. Not just for debug
kernels, overhead low enough to be deployed in production.
We're aiming to get this in the next merge window, for 6.9. The feedback
we've gotten has been that even out of tree this patchset has already
been useful, and there's a significant amount of other work gated on the
code tagging functionality included in this patchset [2].
Example output:
root@moria-kvm:~# sort -h /proc/allocinfo|tail
3.11MiB 2850 fs/ext4/super.c:1408 module:ext4 func:ext4_alloc_inode
3.52MiB 225 kernel/fork.c:356 module:fork func:alloc_thread_stack_node
3.75MiB 960 mm/page_ext.c:270 module:page_ext func:alloc_page_ext
4.00MiB 2 mm/khugepaged.c:893 module:khugepaged func:hpage_collapse_alloc_folio
10.5MiB 168 block/blk-mq.c:3421 module:blk_mq func:blk_mq_alloc_rqs
14.0MiB 3594 include/linux/gfp.h:295 module:filemap func:folio_alloc_noprof
26.8MiB 6856 include/linux/gfp.h:295 module:memory func:folio_alloc_noprof
64.5MiB 98315 fs/xfs/xfs_rmap_item.c:147 module:xfs func:xfs_rui_init
98.7MiB 25264 include/linux/gfp.h:295 module:readahead func:folio_alloc_noprof
125MiB 7357 mm/slub.c:2201 module:slub func:alloc_slab_page
Since v2:
- tglx noticed a circular header dependency between sched.h and percpu.h;
a bunch of header cleanups were merged into 6.8 to ameliorate this [3].
- a number of improvements, moving alloc_hooks() annotations to the
correct place for better tracking (mempool), and bugfixes.
- looked at alternate hooking methods.
There were suggestions on alternate methods (compiler attribute,
trampolines), but they wouldn't have made the patchset any cleaner
(we still need to have different function versions for accounting vs. no
accounting to control at which point in a call chain the accounting
happens), and they would have added a dependency on toolchain
support.
Usage:
kconfig options:
- CONFIG_MEM_ALLOC_PROFILING
- CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT
- CONFIG_MEM_ALLOC_PROFILING_DEBUG
adds warnings for allocations that weren't accounted because of a
missing annotation
sysctl:
/proc/sys/vm/mem_profiling
Runtime info:
/proc/allocinfo
Notes:
[1]: Overhead
To measure the overhead we are comparing the following configurations:
(1) Baseline with CONFIG_MEMCG_KMEM=n
(2) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n)
(3) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y)
(4) Enabled at runtime (CONFIG_MEM_ALLOC_PROFILING=y &&
CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n && /proc/sys/vm/mem_profiling=1)
(5) Baseline with CONFIG_MEMCG_KMEM=y && allocating with __GFP_ACCOUNT
Performance overhead:
To evaluate performance we implemented an in-kernel test executing
multiple get_free_page/free_page and kmalloc/kfree calls with allocation
sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU
affinity set to a specific CPU to minimize the noise. Below are results
from running the test on Ubuntu 22.04.2 LTS with 6.8.0-rc1 kernel on
56 core Intel Xeon:
kmalloc pgalloc
(1 baseline) 6.764s 16.902s
(2 default disabled) 6.793s (+0.43%) 17.007s (+0.62%)
(3 default enabled) 7.197s (+6.40%) 23.666s (+40.02%)
(4 runtime enabled) 7.405s (+9.48%) 23.901s (+41.41%)
(5 memcg) 13.388s (+97.94%) 48.460s (+186.71%)
Memory overhead:
Kernel size:
text data bss dec diff
(1) 26515311 18890222 17018880 62424413
(2) 26524728 19423818 16740352 62688898 264485
(3) 26524724 19423818 16740352 62688894 264481
(4) 26524728 19423818 16740352 62688898 264485
(5) 26541782 18964374 16957440 62463596 39183
Memory consumption on a 56 core Intel CPU with 125GB of memory:
Code tags: 192 kB
PageExts: 262144 kB (256MB)
SlabExts: 9876 kB (9.6MB)
PcpuExts: 512 kB (0.5MB)
Total overhead is 0.2% of total memory.
[2]: Improved fault injection is the big one; the alloc_hooks() macro
this patchset introduces is also used for per-callsite fault injection
points in the dynamic fault injection patchset, which means we can
easily do fault injection on a per module or per file basis; this makes
it much easier to integrate memory fault injection into existing tests.
Vlastimil recently raised concerns about exposing GFP_NOWAIT as a
PF_MEMALLOC_* flag, as this might introduce GFP_NOWAIT to allocation
paths that have never had their failure paths tested - this is something
we need to address.
[3]: The circular dependency looks to be unavoidable; the issue is that
alloc_tag_save() -> current -> get_current() requires percpu.h, and
percpu.h requires sched.h because of course it does. But this doesn't
actually cause build errors because we're only using macros, so the main
concern is just not leaving a difficult-to-disentangle minefield for
later.
So, sched.h is now pretty close to being a types only header that
imports types and declares types - this is the header cleanups that were
merged for 6.8.
Kent Overstreet (11):
lib/string_helpers: Add flags param to string_get_size()
scripts/kallysms: Always include __start and __stop symbols
fs: Convert alloc_inode_sb() to a macro
mm/slub: Mark slab_free_freelist_hook() __always_inline
mempool: Hook up to memory allocation profiling
xfs: Memory allocation profiling fixups
mm: percpu: Introduce pcpuobj_ext
mm: percpu: Add codetag reference into pcpuobj_ext
mm: vmalloc: Enable memory allocation profiling
rhashtable: Plumb through alloc tag
MAINTAINERS: Add entries for code tagging and memory allocation
profiling
Suren Baghdasaryan (24):
mm: enumerate all gfp flags
mm: introduce slabobj_ext to support slab object extensions
mm: introduce __GFP_NO_OBJ_EXT flag to selectively prevent slabobj_ext
creation
mm/slab: introduce SLAB_NO_OBJ_EXT to avoid obj_ext creation
mm: prevent slabobj_ext allocations for slabobj_ext and kmem_cache
objects
slab: objext: introduce objext_flags as extension to
page_memcg_data_flags
lib: code tagging framework
lib: code tagging module support
lib: prevent module unloading if memory is not freed
lib: add allocation tagging support for memory allocation profiling
lib: introduce support for page allocation tagging
mm: percpu: increase PERCPU_MODULE_RESERVE to accommodate allocation
tags
change alloc_pages name in dma_map_ops to avoid name conflicts
mm: enable page allocation tagging
mm: create new codetag references during page splitting
mm/page_ext: enable early_page_ext when
CONFIG_MEM_ALLOC_PROFILING_DEBUG=y
lib: add codetag reference into slabobj_ext
mm/slab: add allocation accounting into slab allocation and free paths
mm/slab: enable slab allocation tagging for kmalloc and friends
mm: percpu: enable per-cpu allocation tagging
lib: add memory allocations report in show_mem()
codetag: debug: skip objext checking when it's for objext itself
codetag: debug: mark codetags for reserved pages as empty
codetag: debug: introduce OBJEXTS_ALLOC_FAIL to mark failed slab_ext
allocations
Documentation/admin-guide/sysctl/vm.rst | 16 ++
Documentation/filesystems/proc.rst | 28 ++
MAINTAINERS | 16 ++
arch/alpha/kernel/pci_iommu.c | 2 +-
arch/mips/jazz/jazzdma.c | 2 +-
arch/powerpc/kernel/dma-iommu.c | 2 +-
arch/powerpc/mm/book3s64/radix_pgtable.c | 2 +-
arch/powerpc/platforms/ps3/system-bus.c | 4 +-
arch/powerpc/platforms/pseries/vio.c | 2 +-
arch/x86/kernel/amd_gart_64.c | 2 +-
drivers/block/virtio_blk.c | 4 +-
drivers/gpu/drm/gud/gud_drv.c | 2 +-
drivers/iommu/dma-iommu.c | 2 +-
drivers/mmc/core/block.c | 4 +-
drivers/mtd/spi-nor/debugfs.c | 6 +-
.../ethernet/chelsio/cxgb4/cxgb4_debugfs.c | 4 +-
drivers/parisc/ccio-dma.c | 2 +-
drivers/parisc/sba_iommu.c | 2 +-
drivers/scsi/sd.c | 8 +-
drivers/staging/media/atomisp/pci/hmm/hmm.c | 2 +-
drivers/xen/grant-dma-ops.c | 2 +-
drivers/xen/swiotlb-xen.c | 2 +-
fs/xfs/kmem.c | 4 +-
fs/xfs/kmem.h | 10 +-
include/asm-generic/codetag.lds.h | 14 +
include/asm-generic/vmlinux.lds.h | 3 +
include/linux/alloc_tag.h | 188 +++++++++++++
include/linux/codetag.h | 83 ++++++
include/linux/dma-map-ops.h | 2 +-
include/linux/fortify-string.h | 5 +-
include/linux/fs.h | 6 +-
include/linux/gfp.h | 126 +++++----
include/linux/gfp_types.h | 101 +++++--
include/linux/memcontrol.h | 56 +++-
include/linux/mempool.h | 73 +++--
include/linux/mm.h | 8 +
include/linux/mm_types.h | 4 +-
include/linux/page_ext.h | 1 -
include/linux/pagemap.h | 9 +-
include/linux/percpu.h | 27 +-
include/linux/pgalloc_tag.h | 105 +++++++
include/linux/rhashtable-types.h | 11 +-
include/linux/sched.h | 24 ++
include/linux/slab.h | 184 +++++++------
include/linux/string.h | 4 +-
include/linux/string_helpers.h | 11 +-
include/linux/vmalloc.h | 60 +++-
init/Kconfig | 4 +
kernel/dma/mapping.c | 4 +-
kernel/kallsyms_selftest.c | 2 +-
kernel/module/main.c | 25 +-
lib/Kconfig.debug | 31 +++
lib/Makefile | 3 +
lib/alloc_tag.c | 213 +++++++++++++++
lib/codetag.c | 258 ++++++++++++++++++
lib/rhashtable.c | 52 +++-
lib/string_helpers.c | 22 +-
lib/test-string_helpers.c | 4 +-
mm/compaction.c | 7 +-
mm/filemap.c | 6 +-
mm/huge_memory.c | 2 +
mm/hugetlb.c | 8 +-
mm/kfence/core.c | 14 +-
mm/kfence/kfence.h | 4 +-
mm/memcontrol.c | 56 +---
mm/mempolicy.c | 52 ++--
mm/mempool.c | 36 +--
mm/mm_init.c | 10 +
mm/page_alloc.c | 66 +++--
mm/page_ext.c | 13 +
mm/page_owner.c | 2 +-
mm/percpu-internal.h | 26 +-
mm/percpu.c | 120 ++++----
mm/show_mem.c | 15 +
mm/slab.h | 176 ++++++++++--
mm/slab_common.c | 65 ++++-
mm/slub.c | 138 ++++++----
mm/util.c | 44 +--
mm/vmalloc.c | 88 +++---
scripts/kallsyms.c | 13 +
scripts/module.lds.S | 7 +
81 files changed, 2126 insertions(+), 695 deletions(-)
create mode 100644 include/asm-generic/codetag.lds.h
create mode 100644 include/linux/alloc_tag.h
create mode 100644 include/linux/codetag.h
create mode 100644 include/linux/pgalloc_tag.h
create mode 100644 lib/alloc_tag.c
create mode 100644 lib/codetag.c
--
2.43.0.687.g38aa6559b0-goog
From: Kent Overstreet <[email protected]>
The new flags parameter allows controlling
- Whether or not the units suffix is separated by a space, for
compatibility with sort -h
- Whether or not to append a B suffix - we're not always printing
bytes.
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Cc: Jason Wang <[email protected]>
Cc: "Noralf Trønnes" <[email protected]>
Cc: Jens Axboe <[email protected]>
---
arch/powerpc/mm/book3s64/radix_pgtable.c | 2 +-
drivers/block/virtio_blk.c | 4 ++--
drivers/gpu/drm/gud/gud_drv.c | 2 +-
drivers/mmc/core/block.c | 4 ++--
drivers/mtd/spi-nor/debugfs.c | 6 ++---
.../ethernet/chelsio/cxgb4/cxgb4_debugfs.c | 4 ++--
drivers/scsi/sd.c | 8 +++----
include/linux/string_helpers.h | 11 +++++-----
lib/string_helpers.c | 22 ++++++++++++++-----
lib/test-string_helpers.c | 4 ++--
mm/hugetlb.c | 8 +++----
11 files changed, 42 insertions(+), 33 deletions(-)
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index c6a4ac766b2b..27aa5a083ff0 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -260,7 +260,7 @@ print_mapping(unsigned long start, unsigned long end, unsigned long size, bool e
if (end <= start)
return;
- string_get_size(size, 1, STRING_UNITS_2, buf, sizeof(buf));
+ string_get_size(size, 1, STRING_SIZE_BASE2, buf, sizeof(buf));
pr_info("Mapped 0x%016lx-0x%016lx with %s pages%s\n", start, end, buf,
exec ? " (exec)" : "");
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 2bf14a0e2815..94fba7f57079 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -934,9 +934,9 @@ static void virtblk_update_capacity(struct virtio_blk *vblk, bool resize)
nblocks = DIV_ROUND_UP_ULL(capacity, queue_logical_block_size(q) >> 9);
string_get_size(nblocks, queue_logical_block_size(q),
- STRING_UNITS_2, cap_str_2, sizeof(cap_str_2));
+ STRING_SIZE_BASE2, cap_str_2, sizeof(cap_str_2));
string_get_size(nblocks, queue_logical_block_size(q),
- STRING_UNITS_10, cap_str_10, sizeof(cap_str_10));
+ 0, cap_str_10, sizeof(cap_str_10));
dev_notice(&vdev->dev,
"[%s] %s%llu %d-byte logical blocks (%s/%s)\n",
diff --git a/drivers/gpu/drm/gud/gud_drv.c b/drivers/gpu/drm/gud/gud_drv.c
index 9d7bf8ee45f1..6b1748e1f666 100644
--- a/drivers/gpu/drm/gud/gud_drv.c
+++ b/drivers/gpu/drm/gud/gud_drv.c
@@ -329,7 +329,7 @@ static int gud_stats_debugfs(struct seq_file *m, void *data)
struct gud_device *gdrm = to_gud_device(entry->dev);
char buf[10];
- string_get_size(gdrm->bulk_len, 1, STRING_UNITS_2, buf, sizeof(buf));
+ string_get_size(gdrm->bulk_len, 1, STRING_SIZE_BASE2, buf, sizeof(buf));
seq_printf(m, "Max buffer size: %s\n", buf);
seq_printf(m, "Number of errors: %u\n", gdrm->stats_num_errors);
diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
index 32d49100dff5..1cded1e9aca4 100644
--- a/drivers/mmc/core/block.c
+++ b/drivers/mmc/core/block.c
@@ -2557,7 +2557,7 @@ static struct mmc_blk_data *mmc_blk_alloc_req(struct mmc_card *card,
blk_queue_write_cache(md->queue.queue, cache_enabled, fua_enabled);
- string_get_size((u64)size, 512, STRING_UNITS_2,
+ string_get_size((u64)size, 512, STRING_SIZE_BASE2,
cap_str, sizeof(cap_str));
pr_info("%s: %s %s %s%s\n",
md->disk->disk_name, mmc_card_id(card), mmc_card_name(card),
@@ -2753,7 +2753,7 @@ static int mmc_blk_alloc_rpmb_part(struct mmc_card *card,
list_add(&rpmb->node, &md->rpmbs);
- string_get_size((u64)size, 512, STRING_UNITS_2,
+ string_get_size((u64)size, 512, STRING_SIZE_BASE2,
cap_str, sizeof(cap_str));
pr_info("%s: %s %s %s, chardev (%d:%d)\n",
diff --git a/drivers/mtd/spi-nor/debugfs.c b/drivers/mtd/spi-nor/debugfs.c
index 2dbda6b6938a..f6c3ca430df1 100644
--- a/drivers/mtd/spi-nor/debugfs.c
+++ b/drivers/mtd/spi-nor/debugfs.c
@@ -85,7 +85,7 @@ static int spi_nor_params_show(struct seq_file *s, void *data)
seq_printf(s, "name\t\t%s\n", info->name);
seq_printf(s, "id\t\t%*ph\n", SPI_NOR_MAX_ID_LEN, nor->id);
- string_get_size(params->size, 1, STRING_UNITS_2, buf, sizeof(buf));
+ string_get_size(params->size, 1, STRING_SIZE_BASE2, buf, sizeof(buf));
seq_printf(s, "size\t\t%s\n", buf);
seq_printf(s, "write size\t%u\n", params->writesize);
seq_printf(s, "page size\t%u\n", params->page_size);
@@ -130,14 +130,14 @@ static int spi_nor_params_show(struct seq_file *s, void *data)
struct spi_nor_erase_type *et = &erase_map->erase_type[i];
if (et->size) {
- string_get_size(et->size, 1, STRING_UNITS_2, buf,
+ string_get_size(et->size, 1, STRING_SIZE_BASE2, buf,
sizeof(buf));
seq_printf(s, " %02x (%s) [%d]\n", et->opcode, buf, i);
}
}
if (!(nor->flags & SNOR_F_NO_OP_CHIP_ERASE)) {
- string_get_size(params->size, 1, STRING_UNITS_2, buf, sizeof(buf));
+ string_get_size(params->size, 1, STRING_SIZE_BASE2, buf, sizeof(buf));
seq_printf(s, " %02x (%s)\n", nor->params->die_erase_opcode, buf);
}
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
index 14e0d989c3ba..7d5fbebd36fc 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
@@ -3457,8 +3457,8 @@ static void mem_region_show(struct seq_file *seq, const char *name,
{
char buf[40];
- string_get_size((u64)to - from + 1, 1, STRING_UNITS_2, buf,
- sizeof(buf));
+ string_get_size((u64)to - from + 1, 1, STRING_SIZE_BASE2,
+ buf, sizeof(buf));
seq_printf(seq, "%-15s %#x-%#x [%s]\n", name, from, to, buf);
}
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 0833b3e6aa6e..e23bcb1d1ffa 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2731,10 +2731,10 @@ sd_print_capacity(struct scsi_disk *sdkp,
if (!sdkp->first_scan && old_capacity == sdkp->capacity)
return;
- string_get_size(sdkp->capacity, sector_size,
- STRING_UNITS_2, cap_str_2, sizeof(cap_str_2));
- string_get_size(sdkp->capacity, sector_size,
- STRING_UNITS_10, cap_str_10, sizeof(cap_str_10));
+ string_get_size(sdkp->capacity, sector_size, STRING_SIZE_BASE2,
+ cap_str_2, sizeof(cap_str_2));
+ string_get_size(sdkp->capacity, sector_size, 0,
+ cap_str_10, sizeof(cap_str_10));
sd_printk(KERN_NOTICE, sdkp,
"%llu %d-byte logical blocks: (%s/%s)\n",
diff --git a/include/linux/string_helpers.h b/include/linux/string_helpers.h
index 58fb1f90eda5..a54467d891db 100644
--- a/include/linux/string_helpers.h
+++ b/include/linux/string_helpers.h
@@ -17,14 +17,13 @@ static inline bool string_is_terminated(const char *s, int len)
return memchr(s, '\0', len) ? true : false;
}
-/* Descriptions of the types of units to
- * print in */
-enum string_size_units {
- STRING_UNITS_10, /* use powers of 10^3 (standard SI) */
- STRING_UNITS_2, /* use binary powers of 2^10 */
+enum string_size_flags {
+ STRING_SIZE_BASE2 = (1 << 0),
+ STRING_SIZE_NOSPACE = (1 << 1),
+ STRING_SIZE_NOBYTES = (1 << 2),
};
-int string_get_size(u64 size, u64 blk_size, enum string_size_units units,
+int string_get_size(u64 size, u64 blk_size, enum string_size_flags flags,
char *buf, int len);
int parse_int_array_user(const char __user *from, size_t count, int **array);
diff --git a/lib/string_helpers.c b/lib/string_helpers.c
index 7713f73e66b0..a5d7d1caed70 100644
--- a/lib/string_helpers.c
+++ b/lib/string_helpers.c
@@ -19,11 +19,17 @@
#include <linux/string.h>
#include <linux/string_helpers.h>
+enum string_size_units {
+ STRING_UNITS_10, /* use powers of 10^3 (standard SI) */
+ STRING_UNITS_2, /* use binary powers of 2^10 */
+};
+
/**
* string_get_size - get the size in the specified units
* @size: The size to be converted in blocks
* @blk_size: Size of the block (use 1 for size in bytes)
- * @units: units to use (powers of 1000 or 1024)
+ * @flags: units to use (powers of 1000 or 1024), whether to include space
+ * separator
* @buf: buffer to format to
* @len: length of buffer
*
@@ -34,14 +40,16 @@
* Return value: number of characters of output that would have been written
* (which may be greater than len, if output was truncated).
*/
-int string_get_size(u64 size, u64 blk_size, const enum string_size_units units,
+int string_get_size(u64 size, u64 blk_size, enum string_size_flags flags,
char *buf, int len)
{
+ enum string_size_units units = flags & flags & STRING_SIZE_BASE2
+ ? STRING_UNITS_2 : STRING_UNITS_10;
static const char *const units_10[] = {
- "B", "kB", "MB", "GB", "TB", "PB", "EB", "ZB", "YB"
+ "", "k", "M", "G", "T", "P", "E", "Z", "Y"
};
static const char *const units_2[] = {
- "B", "KiB", "MiB", "GiB", "TiB", "PiB", "EiB", "ZiB", "YiB"
+ "", "Ki", "Mi", "Gi", "Ti", "Pi", "Ei", "Zi", "Yi"
};
static const char *const *const units_str[] = {
[STRING_UNITS_10] = units_10,
@@ -128,8 +136,10 @@ int string_get_size(u64 size, u64 blk_size, const enum string_size_units units,
else
unit = units_str[units][i];
- return snprintf(buf, len, "%u%s %s", (u32)size,
- tmp, unit);
+ return snprintf(buf, len, "%u%s%s%s%s", (u32)size, tmp,
+ (flags & STRING_SIZE_NOSPACE) ? "" : " ",
+ unit,
+ (flags & STRING_SIZE_NOBYTES) ? "" : "B");
}
EXPORT_SYMBOL(string_get_size);
diff --git a/lib/test-string_helpers.c b/lib/test-string_helpers.c
index 9a68849a5d55..0b01ffca96fb 100644
--- a/lib/test-string_helpers.c
+++ b/lib/test-string_helpers.c
@@ -507,8 +507,8 @@ static __init void __test_string_get_size(const u64 size, const u64 blk_size,
char buf10[string_get_size_maxbuf];
char buf2[string_get_size_maxbuf];
- string_get_size(size, blk_size, STRING_UNITS_10, buf10, sizeof(buf10));
- string_get_size(size, blk_size, STRING_UNITS_2, buf2, sizeof(buf2));
+ string_get_size(size, blk_size, 0, buf10, sizeof(buf10));
+ string_get_size(size, blk_size, STRING_SIZE_BASE2, buf2, sizeof(buf2));
test_string_get_size_check("STRING_UNITS_10", exp_result10, buf10,
size, blk_size);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ed1581b670d4..26a8028e4bb7 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3475,7 +3475,7 @@ static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
if (i == h->max_huge_pages_node[nid])
return;
- string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32);
+ string_get_size(huge_page_size(h), 1, STRING_SIZE_BASE2, buf, 32);
pr_warn("HugeTLB: allocating %u of page size %s failed node%d. Only allocated %lu hugepages.\n",
h->max_huge_pages_node[nid], buf, nid, i);
h->max_huge_pages -= (h->max_huge_pages_node[nid] - i);
@@ -3561,7 +3561,7 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
if (i < h->max_huge_pages) {
char buf[32];
- string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32);
+ string_get_size(huge_page_size(h), 1, STRING_SIZE_BASE2, buf, 32);
pr_warn("HugeTLB: allocating %lu of page size %s failed. Only allocated %lu hugepages.\n",
h->max_huge_pages, buf, i);
h->max_huge_pages = i;
@@ -3607,7 +3607,7 @@ static void __init report_hugepages(void)
for_each_hstate(h) {
char buf[32];
- string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32);
+ string_get_size(huge_page_size(h), 1, STRING_SIZE_BASE2, buf, 32);
pr_info("HugeTLB: registered %s page size, pre-allocated %ld pages\n",
buf, h->free_huge_pages);
pr_info("HugeTLB: %d KiB vmemmap can be freed for a %s page\n",
@@ -4527,7 +4527,7 @@ static int __init hugetlb_init(void)
char buf[32];
string_get_size(huge_page_size(&default_hstate),
- 1, STRING_UNITS_2, buf, 32);
+ 1, STRING_SIZE_BASE2, buf, 32);
pr_warn("HugeTLB: Ignoring hugepages=%lu associated with %s page size\n",
default_hstate.max_huge_pages, buf);
pr_warn("HugeTLB: Using hugepages=%lu for number of default huge pages\n",
--
2.43.0.687.g38aa6559b0-goog
Introduce GFP bits enumeration to let compiler track the number of used
bits (which depends on the config options) instead of hardcoding them.
That simplifies __GFP_BITS_SHIFT calculation.
Suggested-by: Petr Tesařík <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/gfp_types.h | 90 +++++++++++++++++++++++++++------------
1 file changed, 62 insertions(+), 28 deletions(-)
diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
index 1b6053da8754..868c8fb1bbc1 100644
--- a/include/linux/gfp_types.h
+++ b/include/linux/gfp_types.h
@@ -21,44 +21,78 @@ typedef unsigned int __bitwise gfp_t;
* include/trace/events/mmflags.h and tools/perf/builtin-kmem.c
*/
+enum {
+ ___GFP_DMA_BIT,
+ ___GFP_HIGHMEM_BIT,
+ ___GFP_DMA32_BIT,
+ ___GFP_MOVABLE_BIT,
+ ___GFP_RECLAIMABLE_BIT,
+ ___GFP_HIGH_BIT,
+ ___GFP_IO_BIT,
+ ___GFP_FS_BIT,
+ ___GFP_ZERO_BIT,
+ ___GFP_UNUSED_BIT, /* 0x200u unused */
+ ___GFP_DIRECT_RECLAIM_BIT,
+ ___GFP_KSWAPD_RECLAIM_BIT,
+ ___GFP_WRITE_BIT,
+ ___GFP_NOWARN_BIT,
+ ___GFP_RETRY_MAYFAIL_BIT,
+ ___GFP_NOFAIL_BIT,
+ ___GFP_NORETRY_BIT,
+ ___GFP_MEMALLOC_BIT,
+ ___GFP_COMP_BIT,
+ ___GFP_NOMEMALLOC_BIT,
+ ___GFP_HARDWALL_BIT,
+ ___GFP_THISNODE_BIT,
+ ___GFP_ACCOUNT_BIT,
+ ___GFP_ZEROTAGS_BIT,
+#ifdef CONFIG_KASAN_HW_TAGS
+ ___GFP_SKIP_ZERO_BIT,
+ ___GFP_SKIP_KASAN_BIT,
+#endif
+#ifdef CONFIG_LOCKDEP
+ ___GFP_NOLOCKDEP_BIT,
+#endif
+ ___GFP_LAST_BIT
+};
+
/* Plain integer GFP bitmasks. Do not use this directly. */
-#define ___GFP_DMA 0x01u
-#define ___GFP_HIGHMEM 0x02u
-#define ___GFP_DMA32 0x04u
-#define ___GFP_MOVABLE 0x08u
-#define ___GFP_RECLAIMABLE 0x10u
-#define ___GFP_HIGH 0x20u
-#define ___GFP_IO 0x40u
-#define ___GFP_FS 0x80u
-#define ___GFP_ZERO 0x100u
+#define ___GFP_DMA BIT(___GFP_DMA_BIT)
+#define ___GFP_HIGHMEM BIT(___GFP_HIGHMEM_BIT)
+#define ___GFP_DMA32 BIT(___GFP_DMA32_BIT)
+#define ___GFP_MOVABLE BIT(___GFP_MOVABLE_BIT)
+#define ___GFP_RECLAIMABLE BIT(___GFP_RECLAIMABLE_BIT)
+#define ___GFP_HIGH BIT(___GFP_HIGH_BIT)
+#define ___GFP_IO BIT(___GFP_IO_BIT)
+#define ___GFP_FS BIT(___GFP_FS_BIT)
+#define ___GFP_ZERO BIT(___GFP_ZERO_BIT)
/* 0x200u unused */
-#define ___GFP_DIRECT_RECLAIM 0x400u
-#define ___GFP_KSWAPD_RECLAIM 0x800u
-#define ___GFP_WRITE 0x1000u
-#define ___GFP_NOWARN 0x2000u
-#define ___GFP_RETRY_MAYFAIL 0x4000u
-#define ___GFP_NOFAIL 0x8000u
-#define ___GFP_NORETRY 0x10000u
-#define ___GFP_MEMALLOC 0x20000u
-#define ___GFP_COMP 0x40000u
-#define ___GFP_NOMEMALLOC 0x80000u
-#define ___GFP_HARDWALL 0x100000u
-#define ___GFP_THISNODE 0x200000u
-#define ___GFP_ACCOUNT 0x400000u
-#define ___GFP_ZEROTAGS 0x800000u
+#define ___GFP_DIRECT_RECLAIM BIT(___GFP_DIRECT_RECLAIM_BIT)
+#define ___GFP_KSWAPD_RECLAIM BIT(___GFP_KSWAPD_RECLAIM_BIT)
+#define ___GFP_WRITE BIT(___GFP_WRITE_BIT)
+#define ___GFP_NOWARN BIT(___GFP_NOWARN_BIT)
+#define ___GFP_RETRY_MAYFAIL BIT(___GFP_RETRY_MAYFAIL_BIT)
+#define ___GFP_NOFAIL BIT(___GFP_NOFAIL_BIT)
+#define ___GFP_NORETRY BIT(___GFP_NORETRY_BIT)
+#define ___GFP_MEMALLOC BIT(___GFP_MEMALLOC_BIT)
+#define ___GFP_COMP BIT(___GFP_COMP_BIT)
+#define ___GFP_NOMEMALLOC BIT(___GFP_NOMEMALLOC_BIT)
+#define ___GFP_HARDWALL BIT(___GFP_HARDWALL_BIT)
+#define ___GFP_THISNODE BIT(___GFP_THISNODE_BIT)
+#define ___GFP_ACCOUNT BIT(___GFP_ACCOUNT_BIT)
+#define ___GFP_ZEROTAGS BIT(___GFP_ZEROTAGS_BIT)
#ifdef CONFIG_KASAN_HW_TAGS
-#define ___GFP_SKIP_ZERO 0x1000000u
-#define ___GFP_SKIP_KASAN 0x2000000u
+#define ___GFP_SKIP_ZERO BIT(___GFP_SKIP_ZERO_BIT)
+#define ___GFP_SKIP_KASAN BIT(___GFP_SKIP_KASAN_BIT)
#else
#define ___GFP_SKIP_ZERO 0
#define ___GFP_SKIP_KASAN 0
#endif
#ifdef CONFIG_LOCKDEP
-#define ___GFP_NOLOCKDEP 0x4000000u
+#define ___GFP_NOLOCKDEP BIT(___GFP_NOLOCKDEP_BIT)
#else
#define ___GFP_NOLOCKDEP 0
#endif
-/* If the above are modified, __GFP_BITS_SHIFT may need updating */
/*
* Physical address zone modifiers (see linux/mmzone.h - low four bits)
@@ -249,7 +283,7 @@ typedef unsigned int __bitwise gfp_t;
#define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
/* Room for N __GFP_FOO bits */
-#define __GFP_BITS_SHIFT (26 + IS_ENABLED(CONFIG_LOCKDEP))
+#define __GFP_BITS_SHIFT ___GFP_LAST_BIT
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
/**
--
2.43.0.687.g38aa6559b0-goog
From: Kent Overstreet <[email protected]>
These symbols are used to denote section boundaries: by always including
them we can unify loading sections from modules with loading built-in
sections, which leads to some significant cleanup.
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
scripts/kallsyms.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/scripts/kallsyms.c b/scripts/kallsyms.c
index 653b92f6d4c8..47978efe4797 100644
--- a/scripts/kallsyms.c
+++ b/scripts/kallsyms.c
@@ -204,6 +204,11 @@ static int symbol_in_range(const struct sym_entry *s,
return 0;
}
+static bool string_starts_with(const char *s, const char *prefix)
+{
+ return strncmp(s, prefix, strlen(prefix)) == 0;
+}
+
static int symbol_valid(const struct sym_entry *s)
{
const char *name = sym_name(s);
@@ -211,6 +216,14 @@ static int symbol_valid(const struct sym_entry *s)
/* if --all-symbols is not specified, then symbols outside the text
* and inittext sections are discarded */
if (!all_symbols) {
+ /*
+ * Symbols starting with __start and __stop are used to denote
+ * section boundaries, and should always be included:
+ */
+ if (string_starts_with(name, "__start_") ||
+ string_starts_with(name, "__stop_"))
+ return 1;
+
if (symbol_in_range(s, text_ranges,
ARRAY_SIZE(text_ranges)) == 0)
return 0;
--
2.43.0.687.g38aa6559b0-goog
Currently slab pages can store only vectors of obj_cgroup pointers in
page->memcg_data. Introduce slabobj_ext structure to allow more data
to be stored for each slab object. Wrap obj_cgroup into slabobj_ext
to support current functionality while allowing to extend slabobj_ext
in the future.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/memcontrol.h | 20 ++++++---
include/linux/mm_types.h | 4 +-
init/Kconfig | 4 ++
mm/kfence/core.c | 14 +++---
mm/kfence/kfence.h | 4 +-
mm/memcontrol.c | 56 +++--------------------
mm/page_owner.c | 2 +-
mm/slab.h | 92 +++++++++++++++++++++++++++++---------
mm/slab_common.c | 48 ++++++++++++++++++++
mm/slub.c | 64 +++++++++++++-------------
10 files changed, 189 insertions(+), 119 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 20ff87f8e001..eb1dc181e412 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -348,8 +348,8 @@ struct mem_cgroup {
extern struct mem_cgroup *root_mem_cgroup;
enum page_memcg_data_flags {
- /* page->memcg_data is a pointer to an objcgs vector */
- MEMCG_DATA_OBJCGS = (1UL << 0),
+ /* page->memcg_data is a pointer to an slabobj_ext vector */
+ MEMCG_DATA_OBJEXTS = (1UL << 0),
/* page has been accounted as a non-slab kernel page */
MEMCG_DATA_KMEM = (1UL << 1),
/* the next bit after the last actual flag */
@@ -387,7 +387,7 @@ static inline struct mem_cgroup *__folio_memcg(struct folio *folio)
unsigned long memcg_data = folio->memcg_data;
VM_BUG_ON_FOLIO(folio_test_slab(folio), folio);
- VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJCGS, folio);
+ VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio);
VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_KMEM, folio);
return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
@@ -408,7 +408,7 @@ static inline struct obj_cgroup *__folio_objcg(struct folio *folio)
unsigned long memcg_data = folio->memcg_data;
VM_BUG_ON_FOLIO(folio_test_slab(folio), folio);
- VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJCGS, folio);
+ VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio);
VM_BUG_ON_FOLIO(!(memcg_data & MEMCG_DATA_KMEM), folio);
return (struct obj_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
@@ -505,7 +505,7 @@ static inline struct mem_cgroup *folio_memcg_check(struct folio *folio)
*/
unsigned long memcg_data = READ_ONCE(folio->memcg_data);
- if (memcg_data & MEMCG_DATA_OBJCGS)
+ if (memcg_data & MEMCG_DATA_OBJEXTS)
return NULL;
if (memcg_data & MEMCG_DATA_KMEM) {
@@ -551,7 +551,7 @@ static inline struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *ob
static inline bool folio_memcg_kmem(struct folio *folio)
{
VM_BUG_ON_PGFLAGS(PageTail(&folio->page), &folio->page);
- VM_BUG_ON_FOLIO(folio->memcg_data & MEMCG_DATA_OBJCGS, folio);
+ VM_BUG_ON_FOLIO(folio->memcg_data & MEMCG_DATA_OBJEXTS, folio);
return folio->memcg_data & MEMCG_DATA_KMEM;
}
@@ -1633,6 +1633,14 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
}
#endif /* CONFIG_MEMCG */
+/*
+ * Extended information for slab objects stored as an array in page->memcg_data
+ * if MEMCG_DATA_OBJEXTS is set.
+ */
+struct slabobj_ext {
+ struct obj_cgroup *objcg;
+} __aligned(8);
+
static inline void __inc_lruvec_kmem_state(void *p, enum node_stat_item idx)
{
__mod_lruvec_kmem_state(p, idx, 1);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 8b611e13153e..9ff97f4e74c5 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -169,7 +169,7 @@ struct page {
/* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
atomic_t _refcount;
-#ifdef CONFIG_MEMCG
+#ifdef CONFIG_SLAB_OBJ_EXT
unsigned long memcg_data;
#endif
@@ -306,7 +306,7 @@ struct folio {
};
atomic_t _mapcount;
atomic_t _refcount;
-#ifdef CONFIG_MEMCG
+#ifdef CONFIG_SLAB_OBJ_EXT
unsigned long memcg_data;
#endif
#if defined(WANT_PAGE_VIRTUAL)
diff --git a/init/Kconfig b/init/Kconfig
index deda3d14135b..8ca5285108be 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -949,10 +949,14 @@ config CGROUP_FAVOR_DYNMODS
Say N if unsure.
+config SLAB_OBJ_EXT
+ bool
+
config MEMCG
bool "Memory controller"
select PAGE_COUNTER
select EVENTFD
+ select SLAB_OBJ_EXT
help
Provides control over the memory footprint of tasks in a cgroup.
diff --git a/mm/kfence/core.c b/mm/kfence/core.c
index 8350f5c06f2e..964b8482275b 100644
--- a/mm/kfence/core.c
+++ b/mm/kfence/core.c
@@ -595,9 +595,9 @@ static unsigned long kfence_init_pool(void)
continue;
__folio_set_slab(slab_folio(slab));
-#ifdef CONFIG_MEMCG
- slab->memcg_data = (unsigned long)&kfence_metadata_init[i / 2 - 1].objcg |
- MEMCG_DATA_OBJCGS;
+#ifdef CONFIG_MEMCG_KMEM
+ slab->obj_exts = (unsigned long)&kfence_metadata_init[i / 2 - 1].obj_exts |
+ MEMCG_DATA_OBJEXTS;
#endif
}
@@ -645,8 +645,8 @@ static unsigned long kfence_init_pool(void)
if (!i || (i % 2))
continue;
-#ifdef CONFIG_MEMCG
- slab->memcg_data = 0;
+#ifdef CONFIG_MEMCG_KMEM
+ slab->obj_exts = 0;
#endif
__folio_clear_slab(slab_folio(slab));
}
@@ -1139,8 +1139,8 @@ void __kfence_free(void *addr)
{
struct kfence_metadata *meta = addr_to_metadata((unsigned long)addr);
-#ifdef CONFIG_MEMCG
- KFENCE_WARN_ON(meta->objcg);
+#ifdef CONFIG_MEMCG_KMEM
+ KFENCE_WARN_ON(meta->obj_exts.objcg);
#endif
/*
* If the objects of the cache are SLAB_TYPESAFE_BY_RCU, defer freeing
diff --git a/mm/kfence/kfence.h b/mm/kfence/kfence.h
index f46fbb03062b..084f5f36e8e7 100644
--- a/mm/kfence/kfence.h
+++ b/mm/kfence/kfence.h
@@ -97,8 +97,8 @@ struct kfence_metadata {
struct kfence_track free_track;
/* For updating alloc_covered on frees. */
u32 alloc_stack_hash;
-#ifdef CONFIG_MEMCG
- struct obj_cgroup *objcg;
+#ifdef CONFIG_MEMCG_KMEM
+ struct slabobj_ext obj_exts;
#endif
};
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1ed40f9d3a27..7021639d2a6f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2977,13 +2977,6 @@ void mem_cgroup_commit_charge(struct folio *folio, struct mem_cgroup *memcg)
}
#ifdef CONFIG_MEMCG_KMEM
-/*
- * The allocated objcg pointers array is not accounted directly.
- * Moreover, it should not come from DMA buffer and is not readily
- * reclaimable. So those GFP bits should be masked off.
- */
-#define OBJCGS_CLEAR_MASK (__GFP_DMA | __GFP_RECLAIMABLE | \
- __GFP_ACCOUNT | __GFP_NOFAIL)
/*
* mod_objcg_mlstate() may be called with irq enabled, so
@@ -3003,62 +2996,27 @@ static inline void mod_objcg_mlstate(struct obj_cgroup *objcg,
rcu_read_unlock();
}
-int memcg_alloc_slab_cgroups(struct slab *slab, struct kmem_cache *s,
- gfp_t gfp, bool new_slab)
-{
- unsigned int objects = objs_per_slab(s, slab);
- unsigned long memcg_data;
- void *vec;
-
- gfp &= ~OBJCGS_CLEAR_MASK;
- vec = kcalloc_node(objects, sizeof(struct obj_cgroup *), gfp,
- slab_nid(slab));
- if (!vec)
- return -ENOMEM;
-
- memcg_data = (unsigned long) vec | MEMCG_DATA_OBJCGS;
- if (new_slab) {
- /*
- * If the slab is brand new and nobody can yet access its
- * memcg_data, no synchronization is required and memcg_data can
- * be simply assigned.
- */
- slab->memcg_data = memcg_data;
- } else if (cmpxchg(&slab->memcg_data, 0, memcg_data)) {
- /*
- * If the slab is already in use, somebody can allocate and
- * assign obj_cgroups in parallel. In this case the existing
- * objcg vector should be reused.
- */
- kfree(vec);
- return 0;
- }
-
- kmemleak_not_leak(vec);
- return 0;
-}
-
static __always_inline
struct mem_cgroup *mem_cgroup_from_obj_folio(struct folio *folio, void *p)
{
/*
* Slab objects are accounted individually, not per-page.
* Memcg membership data for each individual object is saved in
- * slab->memcg_data.
+ * slab->obj_exts.
*/
if (folio_test_slab(folio)) {
- struct obj_cgroup **objcgs;
+ struct slabobj_ext *obj_exts;
struct slab *slab;
unsigned int off;
slab = folio_slab(folio);
- objcgs = slab_objcgs(slab);
- if (!objcgs)
+ obj_exts = slab_obj_exts(slab);
+ if (!obj_exts)
return NULL;
off = obj_to_index(slab->slab_cache, slab, p);
- if (objcgs[off])
- return obj_cgroup_memcg(objcgs[off]);
+ if (obj_exts[off].objcg)
+ return obj_cgroup_memcg(obj_exts[off].objcg);
return NULL;
}
@@ -3066,7 +3024,7 @@ struct mem_cgroup *mem_cgroup_from_obj_folio(struct folio *folio, void *p)
/*
* folio_memcg_check() is used here, because in theory we can encounter
* a folio where the slab flag has been cleared already, but
- * slab->memcg_data has not been freed yet
+ * slab->obj_exts has not been freed yet
* folio_memcg_check() will guarantee that a proper memory
* cgroup pointer or NULL will be returned.
*/
diff --git a/mm/page_owner.c b/mm/page_owner.c
index 5634e5d890f8..262aa7d25f40 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -377,7 +377,7 @@ static inline int print_page_owner_memcg(char *kbuf, size_t count, int ret,
if (!memcg_data)
goto out_unlock;
- if (memcg_data & MEMCG_DATA_OBJCGS)
+ if (memcg_data & MEMCG_DATA_OBJEXTS)
ret += scnprintf(kbuf + ret, count - ret,
"Slab cache page\n");
diff --git a/mm/slab.h b/mm/slab.h
index 54deeb0428c6..436a126486b5 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -87,8 +87,8 @@ struct slab {
unsigned int __unused;
atomic_t __page_refcount;
-#ifdef CONFIG_MEMCG
- unsigned long memcg_data;
+#ifdef CONFIG_SLAB_OBJ_EXT
+ unsigned long obj_exts;
#endif
};
@@ -97,8 +97,8 @@ struct slab {
SLAB_MATCH(flags, __page_flags);
SLAB_MATCH(compound_head, slab_cache); /* Ensure bit 0 is clear */
SLAB_MATCH(_refcount, __page_refcount);
-#ifdef CONFIG_MEMCG
-SLAB_MATCH(memcg_data, memcg_data);
+#ifdef CONFIG_SLAB_OBJ_EXT
+SLAB_MATCH(memcg_data, obj_exts);
#endif
#undef SLAB_MATCH
static_assert(sizeof(struct slab) <= sizeof(struct page));
@@ -541,42 +541,90 @@ static inline bool kmem_cache_debug_flags(struct kmem_cache *s, slab_flags_t fla
return false;
}
-#ifdef CONFIG_MEMCG_KMEM
+#ifdef CONFIG_SLAB_OBJ_EXT
+
/*
- * slab_objcgs - get the object cgroups vector associated with a slab
+ * slab_obj_exts - get the pointer to the slab object extension vector
+ * associated with a slab.
* @slab: a pointer to the slab struct
*
- * Returns a pointer to the object cgroups vector associated with the slab,
+ * Returns a pointer to the object extension vector associated with the slab,
* or NULL if no such vector has been associated yet.
*/
-static inline struct obj_cgroup **slab_objcgs(struct slab *slab)
+static inline struct slabobj_ext *slab_obj_exts(struct slab *slab)
{
- unsigned long memcg_data = READ_ONCE(slab->memcg_data);
+ unsigned long obj_exts = READ_ONCE(slab->obj_exts);
- VM_BUG_ON_PAGE(memcg_data && !(memcg_data & MEMCG_DATA_OBJCGS),
+#ifdef CONFIG_MEMCG
+ VM_BUG_ON_PAGE(obj_exts && !(obj_exts & MEMCG_DATA_OBJEXTS),
slab_page(slab));
- VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_KMEM, slab_page(slab));
+ VM_BUG_ON_PAGE(obj_exts & MEMCG_DATA_KMEM, slab_page(slab));
- return (struct obj_cgroup **)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+ return (struct slabobj_ext *)(obj_exts & ~MEMCG_DATA_FLAGS_MASK);
+#else
+ return (struct slabobj_ext *)obj_exts;
+#endif
}
-int memcg_alloc_slab_cgroups(struct slab *slab, struct kmem_cache *s,
- gfp_t gfp, bool new_slab);
-void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
- enum node_stat_item idx, int nr);
-#else /* CONFIG_MEMCG_KMEM */
-static inline struct obj_cgroup **slab_objcgs(struct slab *slab)
+int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
+ gfp_t gfp, bool new_slab);
+
+static inline bool need_slab_obj_ext(void)
+{
+ /*
+ * CONFIG_MEMCG_KMEM creates vector of obj_cgroup objects conditionally
+ * inside memcg_slab_post_alloc_hook. No other users for now.
+ */
+ return false;
+}
+
+static inline struct slabobj_ext *
+prepare_slab_obj_exts_hook(struct kmem_cache *s, gfp_t flags, void *p)
+{
+ struct slab *slab;
+
+ if (!p)
+ return NULL;
+
+ if (!need_slab_obj_ext())
+ return NULL;
+
+ slab = virt_to_slab(p);
+ if (!slab_obj_exts(slab) &&
+ WARN(alloc_slab_obj_exts(slab, s, flags, false),
+ "%s, %s: Failed to create slab extension vector!\n",
+ __func__, s->name))
+ return NULL;
+
+ return slab_obj_exts(slab) + obj_to_index(s, slab, p);
+}
+
+#else /* CONFIG_SLAB_OBJ_EXT */
+
+static inline struct slabobj_ext *slab_obj_exts(struct slab *slab)
{
return NULL;
}
-static inline int memcg_alloc_slab_cgroups(struct slab *slab,
- struct kmem_cache *s, gfp_t gfp,
- bool new_slab)
+static inline int alloc_slab_obj_exts(struct slab *slab,
+ struct kmem_cache *s, gfp_t gfp,
+ bool new_slab)
{
return 0;
}
-#endif /* CONFIG_MEMCG_KMEM */
+
+static inline struct slabobj_ext *
+prepare_slab_obj_exts_hook(struct kmem_cache *s, gfp_t flags, void *p)
+{
+ return NULL;
+}
+
+#endif /* CONFIG_SLAB_OBJ_EXT */
+
+#ifdef CONFIG_MEMCG_KMEM
+void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
+ enum node_stat_item idx, int nr);
+#endif
size_t __ksize(const void *objp);
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 238293b1dbe1..6bfa1810da5e 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -201,6 +201,54 @@ struct kmem_cache *find_mergeable(unsigned int size, unsigned int align,
return NULL;
}
+#ifdef CONFIG_SLAB_OBJ_EXT
+/*
+ * The allocated objcg pointers array is not accounted directly.
+ * Moreover, it should not come from DMA buffer and is not readily
+ * reclaimable. So those GFP bits should be masked off.
+ */
+#define OBJCGS_CLEAR_MASK (__GFP_DMA | __GFP_RECLAIMABLE | \
+ __GFP_ACCOUNT | __GFP_NOFAIL)
+
+int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
+ gfp_t gfp, bool new_slab)
+{
+ unsigned int objects = objs_per_slab(s, slab);
+ unsigned long obj_exts;
+ void *vec;
+
+ gfp &= ~OBJCGS_CLEAR_MASK;
+ vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
+ slab_nid(slab));
+ if (!vec)
+ return -ENOMEM;
+
+ obj_exts = (unsigned long)vec;
+#ifdef CONFIG_MEMCG
+ obj_exts |= MEMCG_DATA_OBJEXTS;
+#endif
+ if (new_slab) {
+ /*
+ * If the slab is brand new and nobody can yet access its
+ * obj_exts, no synchronization is required and obj_exts can
+ * be simply assigned.
+ */
+ slab->obj_exts = obj_exts;
+ } else if (cmpxchg(&slab->obj_exts, 0, obj_exts)) {
+ /*
+ * If the slab is already in use, somebody can allocate and
+ * assign slabobj_exts in parallel. In this case the existing
+ * objcg vector should be reused.
+ */
+ kfree(vec);
+ return 0;
+ }
+
+ kmemleak_not_leak(vec);
+ return 0;
+}
+#endif /* CONFIG_SLAB_OBJ_EXT */
+
static struct kmem_cache *create_cache(const char *name,
unsigned int object_size, unsigned int align,
slab_flags_t flags, unsigned int useroffset,
diff --git a/mm/slub.c b/mm/slub.c
index 2ef88bbf56a3..1eb1050814aa 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -683,10 +683,10 @@ static inline bool __slab_update_freelist(struct kmem_cache *s, struct slab *sla
if (s->flags & __CMPXCHG_DOUBLE) {
ret = __update_freelist_fast(slab, freelist_old, counters_old,
- freelist_new, counters_new);
+ freelist_new, counters_new);
} else {
ret = __update_freelist_slow(slab, freelist_old, counters_old,
- freelist_new, counters_new);
+ freelist_new, counters_new);
}
if (likely(ret))
return true;
@@ -710,13 +710,13 @@ static inline bool slab_update_freelist(struct kmem_cache *s, struct slab *slab,
if (s->flags & __CMPXCHG_DOUBLE) {
ret = __update_freelist_fast(slab, freelist_old, counters_old,
- freelist_new, counters_new);
+ freelist_new, counters_new);
} else {
unsigned long flags;
local_irq_save(flags);
ret = __update_freelist_slow(slab, freelist_old, counters_old,
- freelist_new, counters_new);
+ freelist_new, counters_new);
local_irq_restore(flags);
}
if (likely(ret))
@@ -1881,13 +1881,25 @@ static inline enum node_stat_item cache_vmstat_idx(struct kmem_cache *s)
NR_SLAB_RECLAIMABLE_B : NR_SLAB_UNRECLAIMABLE_B;
}
-#ifdef CONFIG_MEMCG_KMEM
-static inline void memcg_free_slab_cgroups(struct slab *slab)
+#ifdef CONFIG_SLAB_OBJ_EXT
+static inline void free_slab_obj_exts(struct slab *slab)
+{
+ struct slabobj_ext *obj_exts;
+
+ obj_exts = slab_obj_exts(slab);
+ if (!obj_exts)
+ return;
+
+ kfree(obj_exts);
+ slab->obj_exts = 0;
+}
+#else
+static inline void free_slab_obj_exts(struct slab *slab)
{
- kfree(slab_objcgs(slab));
- slab->memcg_data = 0;
}
+#endif
+#ifdef CONFIG_MEMCG_KMEM
static inline size_t obj_full_size(struct kmem_cache *s)
{
/*
@@ -1966,15 +1978,15 @@ static void __memcg_slab_post_alloc_hook(struct kmem_cache *s,
if (likely(p[i])) {
slab = virt_to_slab(p[i]);
- if (!slab_objcgs(slab) &&
- memcg_alloc_slab_cgroups(slab, s, flags, false)) {
+ if (!slab_obj_exts(slab) &&
+ alloc_slab_obj_exts(slab, s, flags, false)) {
obj_cgroup_uncharge(objcg, obj_full_size(s));
continue;
}
off = obj_to_index(s, slab, p[i]);
obj_cgroup_get(objcg);
- slab_objcgs(slab)[off] = objcg;
+ slab_obj_exts(slab)[off].objcg = objcg;
mod_objcg_state(objcg, slab_pgdat(slab),
cache_vmstat_idx(s), obj_full_size(s));
} else {
@@ -1995,18 +2007,18 @@ void memcg_slab_post_alloc_hook(struct kmem_cache *s, struct obj_cgroup *objcg,
static void __memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab,
void **p, int objects,
- struct obj_cgroup **objcgs)
+ struct slabobj_ext *obj_exts)
{
for (int i = 0; i < objects; i++) {
struct obj_cgroup *objcg;
unsigned int off;
off = obj_to_index(s, slab, p[i]);
- objcg = objcgs[off];
+ objcg = obj_exts[off].objcg;
if (!objcg)
continue;
- objcgs[off] = NULL;
+ obj_exts[off].objcg = NULL;
obj_cgroup_uncharge(objcg, obj_full_size(s));
mod_objcg_state(objcg, slab_pgdat(slab), cache_vmstat_idx(s),
-obj_full_size(s));
@@ -2018,16 +2030,16 @@ static __fastpath_inline
void memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab, void **p,
int objects)
{
- struct obj_cgroup **objcgs;
+ struct slabobj_ext *obj_exts;
if (!memcg_kmem_online())
return;
- objcgs = slab_objcgs(slab);
- if (likely(!objcgs))
+ obj_exts = slab_obj_exts(slab);
+ if (likely(!obj_exts))
return;
- __memcg_slab_free_hook(s, slab, p, objects, objcgs);
+ __memcg_slab_free_hook(s, slab, p, objects, obj_exts);
}
static inline
@@ -2038,15 +2050,6 @@ void memcg_slab_alloc_error_hook(struct kmem_cache *s, int objects,
obj_cgroup_uncharge(objcg, objects * obj_full_size(s));
}
#else /* CONFIG_MEMCG_KMEM */
-static inline struct mem_cgroup *memcg_from_slab_obj(void *ptr)
-{
- return NULL;
-}
-
-static inline void memcg_free_slab_cgroups(struct slab *slab)
-{
-}
-
static inline bool memcg_slab_pre_alloc_hook(struct kmem_cache *s,
struct list_lru *lru,
struct obj_cgroup **objcgp,
@@ -2314,7 +2317,7 @@ static __always_inline void account_slab(struct slab *slab, int order,
struct kmem_cache *s, gfp_t gfp)
{
if (memcg_kmem_online() && (s->flags & SLAB_ACCOUNT))
- memcg_alloc_slab_cgroups(slab, s, gfp, true);
+ alloc_slab_obj_exts(slab, s, gfp, true);
mod_node_page_state(slab_pgdat(slab), cache_vmstat_idx(s),
PAGE_SIZE << order);
@@ -2323,8 +2326,7 @@ static __always_inline void account_slab(struct slab *slab, int order,
static __always_inline void unaccount_slab(struct slab *slab, int order,
struct kmem_cache *s)
{
- if (memcg_kmem_online())
- memcg_free_slab_cgroups(slab);
+ free_slab_obj_exts(slab);
mod_node_page_state(slab_pgdat(slab), cache_vmstat_idx(s),
-(PAGE_SIZE << order));
@@ -3775,6 +3777,7 @@ void slab_post_alloc_hook(struct kmem_cache *s, struct obj_cgroup *objcg,
unsigned int orig_size)
{
unsigned int zero_size = s->object_size;
+ struct slabobj_ext *obj_exts;
bool kasan_init = init;
size_t i;
gfp_t init_flags = flags & gfp_allowed_mask;
@@ -3817,6 +3820,7 @@ void slab_post_alloc_hook(struct kmem_cache *s, struct obj_cgroup *objcg,
kmemleak_alloc_recursive(p[i], s->object_size, 1,
s->flags, init_flags);
kmsan_slab_alloc(s, p[i], init_flags);
+ obj_exts = prepare_slab_obj_exts_hook(s, flags, p[i]);
}
memcg_slab_post_alloc_hook(s, objcg, flags, size, p);
--
2.43.0.687.g38aa6559b0-goog
Introduce __GFP_NO_OBJ_EXT flag in order to prevent recursive allocations
when allocating slabobj_ext on a slab.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/gfp_types.h | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
index 868c8fb1bbc1..e36e168d8cfd 100644
--- a/include/linux/gfp_types.h
+++ b/include/linux/gfp_types.h
@@ -52,6 +52,9 @@ enum {
#endif
#ifdef CONFIG_LOCKDEP
___GFP_NOLOCKDEP_BIT,
+#endif
+#ifdef CONFIG_SLAB_OBJ_EXT
+ ___GFP_NO_OBJ_EXT_BIT,
#endif
___GFP_LAST_BIT
};
@@ -93,6 +96,11 @@ enum {
#else
#define ___GFP_NOLOCKDEP 0
#endif
+#ifdef CONFIG_SLAB_OBJ_EXT
+#define ___GFP_NO_OBJ_EXT BIT(___GFP_NO_OBJ_EXT_BIT)
+#else
+#define ___GFP_NO_OBJ_EXT 0
+#endif
/*
* Physical address zone modifiers (see linux/mmzone.h - low four bits)
@@ -133,12 +141,15 @@ enum {
* node with no fallbacks or placement policy enforcements.
*
* %__GFP_ACCOUNT causes the allocation to be accounted to kmemcg.
+ *
+ * %__GFP_NO_OBJ_EXT causes slab allocation to have no object extension.
*/
#define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE)
#define __GFP_WRITE ((__force gfp_t)___GFP_WRITE)
#define __GFP_HARDWALL ((__force gfp_t)___GFP_HARDWALL)
#define __GFP_THISNODE ((__force gfp_t)___GFP_THISNODE)
#define __GFP_ACCOUNT ((__force gfp_t)___GFP_ACCOUNT)
+#define __GFP_NO_OBJ_EXT ((__force gfp_t)___GFP_NO_OBJ_EXT)
/**
* DOC: Watermark modifiers
--
2.43.0.687.g38aa6559b0-goog
From: Kent Overstreet <[email protected]>
We're introducing alloc tagging, which tracks memory allocations by
callsite. Converting alloc_inode_sb() to a macro means allocations will
be tracked by its caller, which is a bit more useful.
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
Cc: Alexander Viro <[email protected]>
---
include/linux/fs.h | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ed5966a70495..7794b4182bac 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3013,11 +3013,7 @@ int setattr_should_drop_sgid(struct mnt_idmap *idmap,
* This must be used for allocating filesystems specific inodes to set
* up the inode reclaim context correctly.
*/
-static inline void *
-alloc_inode_sb(struct super_block *sb, struct kmem_cache *cache, gfp_t gfp)
-{
- return kmem_cache_alloc_lru(cache, &sb->s_inode_lru, gfp);
-}
+#define alloc_inode_sb(_sb, _cache, _gfp) kmem_cache_alloc_lru(_cache, &_sb->s_inode_lru, _gfp)
extern void __insert_inode_hash(struct inode *, unsigned long hashval);
static inline void insert_inode_hash(struct inode *inode)
--
2.43.0.687.g38aa6559b0-goog
Slab extension objects can't be allocated before slab infrastructure is
initialized. Some caches, like kmem_cache and kmem_cache_node, are created
before slab infrastructure is initialized. Objects from these caches can't
have extension objects. Introduce SLAB_NO_OBJ_EXT slab flag to mark these
caches and avoid creating extensions for objects allocated from these
slabs.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/slab.h | 7 +++++++
mm/slub.c | 5 +++--
2 files changed, 10 insertions(+), 2 deletions(-)
diff --git a/include/linux/slab.h b/include/linux/slab.h
index b5f5ee8308d0..3ac2fc830f0f 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -164,6 +164,13 @@
#endif
#define SLAB_TEMPORARY SLAB_RECLAIM_ACCOUNT /* Objects are short-lived */
+#ifdef CONFIG_SLAB_OBJ_EXT
+/* Slab created using create_boot_cache */
+#define SLAB_NO_OBJ_EXT ((slab_flags_t __force)0x20000000U)
+#else
+#define SLAB_NO_OBJ_EXT 0
+#endif
+
/*
* ZERO_SIZE_PTR will be returned for zero sized kmalloc requests.
*
diff --git a/mm/slub.c b/mm/slub.c
index 1eb1050814aa..9fd96238ed39 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5650,7 +5650,8 @@ void __init kmem_cache_init(void)
node_set(node, slab_nodes);
create_boot_cache(kmem_cache_node, "kmem_cache_node",
- sizeof(struct kmem_cache_node), SLAB_HWCACHE_ALIGN, 0, 0);
+ sizeof(struct kmem_cache_node),
+ SLAB_HWCACHE_ALIGN | SLAB_NO_OBJ_EXT, 0, 0);
hotplug_memory_notifier(slab_memory_callback, SLAB_CALLBACK_PRI);
@@ -5660,7 +5661,7 @@ void __init kmem_cache_init(void)
create_boot_cache(kmem_cache, "kmem_cache",
offsetof(struct kmem_cache, node) +
nr_node_ids * sizeof(struct kmem_cache_node *),
- SLAB_HWCACHE_ALIGN, 0, 0);
+ SLAB_HWCACHE_ALIGN | SLAB_NO_OBJ_EXT, 0, 0);
kmem_cache = bootstrap(&boot_kmem_cache);
kmem_cache_node = bootstrap(&boot_kmem_cache_node);
--
2.43.0.687.g38aa6559b0-goog
Use __GFP_NO_OBJ_EXT to prevent recursions when allocating slabobj_ext
objects. Also prevent slabobj_ext allocations for kmem_cache objects.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
mm/slab.h | 6 ++++++
mm/slab_common.c | 2 ++
2 files changed, 8 insertions(+)
diff --git a/mm/slab.h b/mm/slab.h
index 436a126486b5..f4ff635091e4 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -589,6 +589,12 @@ prepare_slab_obj_exts_hook(struct kmem_cache *s, gfp_t flags, void *p)
if (!need_slab_obj_ext())
return NULL;
+ if (s->flags & SLAB_NO_OBJ_EXT)
+ return NULL;
+
+ if (flags & __GFP_NO_OBJ_EXT)
+ return NULL;
+
slab = virt_to_slab(p);
if (!slab_obj_exts(slab) &&
WARN(alloc_slab_obj_exts(slab, s, flags, false),
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 6bfa1810da5e..83fec2dd2e2d 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -218,6 +218,8 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
void *vec;
gfp &= ~OBJCGS_CLEAR_MASK;
+ /* Prevent recursive extension vector allocation */
+ gfp |= __GFP_NO_OBJ_EXT;
vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
slab_nid(slab));
if (!vec)
--
2.43.0.687.g38aa6559b0-goog
Introduce objext_flags to store additional objext flags unrelated to memcg.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/memcontrol.h | 29 ++++++++++++++++++++++-------
mm/slab.h | 4 +---
2 files changed, 23 insertions(+), 10 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index eb1dc181e412..f3584e98b640 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -356,7 +356,22 @@ enum page_memcg_data_flags {
__NR_MEMCG_DATA_FLAGS = (1UL << 2),
};
-#define MEMCG_DATA_FLAGS_MASK (__NR_MEMCG_DATA_FLAGS - 1)
+#define __FIRST_OBJEXT_FLAG __NR_MEMCG_DATA_FLAGS
+
+#else /* CONFIG_MEMCG */
+
+#define __FIRST_OBJEXT_FLAG (1UL << 0)
+
+#endif /* CONFIG_MEMCG */
+
+enum objext_flags {
+ /* the next bit after the last actual flag */
+ __NR_OBJEXTS_FLAGS = __FIRST_OBJEXT_FLAG,
+};
+
+#define OBJEXTS_FLAGS_MASK (__NR_OBJEXTS_FLAGS - 1)
+
+#ifdef CONFIG_MEMCG
static inline bool folio_memcg_kmem(struct folio *folio);
@@ -390,7 +405,7 @@ static inline struct mem_cgroup *__folio_memcg(struct folio *folio)
VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio);
VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_KMEM, folio);
- return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+ return (struct mem_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
}
/*
@@ -411,7 +426,7 @@ static inline struct obj_cgroup *__folio_objcg(struct folio *folio)
VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio);
VM_BUG_ON_FOLIO(!(memcg_data & MEMCG_DATA_KMEM), folio);
- return (struct obj_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+ return (struct obj_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
}
/*
@@ -468,11 +483,11 @@ static inline struct mem_cgroup *folio_memcg_rcu(struct folio *folio)
if (memcg_data & MEMCG_DATA_KMEM) {
struct obj_cgroup *objcg;
- objcg = (void *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+ objcg = (void *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
return obj_cgroup_memcg(objcg);
}
- return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+ return (struct mem_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
}
/*
@@ -511,11 +526,11 @@ static inline struct mem_cgroup *folio_memcg_check(struct folio *folio)
if (memcg_data & MEMCG_DATA_KMEM) {
struct obj_cgroup *objcg;
- objcg = (void *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+ objcg = (void *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
return obj_cgroup_memcg(objcg);
}
- return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+ return (struct mem_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
}
static inline struct mem_cgroup *page_memcg_check(struct page *page)
diff --git a/mm/slab.h b/mm/slab.h
index f4ff635091e4..77cf7474fe46 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -560,10 +560,8 @@ static inline struct slabobj_ext *slab_obj_exts(struct slab *slab)
slab_page(slab));
VM_BUG_ON_PAGE(obj_exts & MEMCG_DATA_KMEM, slab_page(slab));
- return (struct slabobj_ext *)(obj_exts & ~MEMCG_DATA_FLAGS_MASK);
-#else
- return (struct slabobj_ext *)obj_exts;
#endif
+ return (struct slabobj_ext *)(obj_exts & ~OBJEXTS_FLAGS_MASK);
}
int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
--
2.43.0.687.g38aa6559b0-goog
Add basic infrastructure to support code tagging which stores tag common
information consisting of the module name, function, file name and line
number. Provide functions to register a new code tag type and navigate
between code tags.
Co-developed-by: Kent Overstreet <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/codetag.h | 71 ++++++++++++++
lib/Kconfig.debug | 4 +
lib/Makefile | 1 +
lib/codetag.c | 199 ++++++++++++++++++++++++++++++++++++++++
4 files changed, 275 insertions(+)
create mode 100644 include/linux/codetag.h
create mode 100644 lib/codetag.c
diff --git a/include/linux/codetag.h b/include/linux/codetag.h
new file mode 100644
index 000000000000..a9d7adecc2a5
--- /dev/null
+++ b/include/linux/codetag.h
@@ -0,0 +1,71 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * code tagging framework
+ */
+#ifndef _LINUX_CODETAG_H
+#define _LINUX_CODETAG_H
+
+#include <linux/types.h>
+
+struct codetag_iterator;
+struct codetag_type;
+struct seq_buf;
+struct module;
+
+/*
+ * An instance of this structure is created in a special ELF section at every
+ * code location being tagged. At runtime, the special section is treated as
+ * an array of these.
+ */
+struct codetag {
+ unsigned int flags; /* used in later patches */
+ unsigned int lineno;
+ const char *modname;
+ const char *function;
+ const char *filename;
+} __aligned(8);
+
+union codetag_ref {
+ struct codetag *ct;
+};
+
+struct codetag_range {
+ struct codetag *start;
+ struct codetag *stop;
+};
+
+struct codetag_module {
+ struct module *mod;
+ struct codetag_range range;
+};
+
+struct codetag_type_desc {
+ const char *section;
+ size_t tag_size;
+};
+
+struct codetag_iterator {
+ struct codetag_type *cttype;
+ struct codetag_module *cmod;
+ unsigned long mod_id;
+ struct codetag *ct;
+};
+
+#define CODE_TAG_INIT { \
+ .modname = KBUILD_MODNAME, \
+ .function = __func__, \
+ .filename = __FILE__, \
+ .lineno = __LINE__, \
+ .flags = 0, \
+}
+
+void codetag_lock_module_list(struct codetag_type *cttype, bool lock);
+struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype);
+struct codetag *codetag_next_ct(struct codetag_iterator *iter);
+
+void codetag_to_text(struct seq_buf *out, struct codetag *ct);
+
+struct codetag_type *
+codetag_register_type(const struct codetag_type_desc *desc);
+
+#endif /* _LINUX_CODETAG_H */
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 975a07f9f1cc..0be2d00c3696 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -968,6 +968,10 @@ config DEBUG_STACKOVERFLOW
If in doubt, say "N".
+config CODE_TAGGING
+ bool
+ select KALLSYMS
+
source "lib/Kconfig.kasan"
source "lib/Kconfig.kfence"
source "lib/Kconfig.kmsan"
diff --git a/lib/Makefile b/lib/Makefile
index 6b09731d8e61..6b48b22fdfac 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -235,6 +235,7 @@ obj-$(CONFIG_OF_RECONFIG_NOTIFIER_ERROR_INJECT) += \
of-reconfig-notifier-error-inject.o
obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
+obj-$(CONFIG_CODE_TAGGING) += codetag.o
lib-$(CONFIG_GENERIC_BUG) += bug.o
obj-$(CONFIG_HAVE_ARCH_TRACEHOOK) += syscall.o
diff --git a/lib/codetag.c b/lib/codetag.c
new file mode 100644
index 000000000000..7708f8388e55
--- /dev/null
+++ b/lib/codetag.c
@@ -0,0 +1,199 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/codetag.h>
+#include <linux/idr.h>
+#include <linux/kallsyms.h>
+#include <linux/module.h>
+#include <linux/seq_buf.h>
+#include <linux/slab.h>
+
+struct codetag_type {
+ struct list_head link;
+ unsigned int count;
+ struct idr mod_idr;
+ struct rw_semaphore mod_lock; /* protects mod_idr */
+ struct codetag_type_desc desc;
+};
+
+static DEFINE_MUTEX(codetag_lock);
+static LIST_HEAD(codetag_types);
+
+void codetag_lock_module_list(struct codetag_type *cttype, bool lock)
+{
+ if (lock)
+ down_read(&cttype->mod_lock);
+ else
+ up_read(&cttype->mod_lock);
+}
+
+struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype)
+{
+ struct codetag_iterator iter = {
+ .cttype = cttype,
+ .cmod = NULL,
+ .mod_id = 0,
+ .ct = NULL,
+ };
+
+ return iter;
+}
+
+static inline struct codetag *get_first_module_ct(struct codetag_module *cmod)
+{
+ return cmod->range.start < cmod->range.stop ? cmod->range.start : NULL;
+}
+
+static inline
+struct codetag *get_next_module_ct(struct codetag_iterator *iter)
+{
+ struct codetag *res = (struct codetag *)
+ ((char *)iter->ct + iter->cttype->desc.tag_size);
+
+ return res < iter->cmod->range.stop ? res : NULL;
+}
+
+struct codetag *codetag_next_ct(struct codetag_iterator *iter)
+{
+ struct codetag_type *cttype = iter->cttype;
+ struct codetag_module *cmod;
+ struct codetag *ct;
+
+ lockdep_assert_held(&cttype->mod_lock);
+
+ if (unlikely(idr_is_empty(&cttype->mod_idr)))
+ return NULL;
+
+ ct = NULL;
+ while (true) {
+ cmod = idr_find(&cttype->mod_idr, iter->mod_id);
+
+ /* If module was removed move to the next one */
+ if (!cmod)
+ cmod = idr_get_next_ul(&cttype->mod_idr,
+ &iter->mod_id);
+
+ /* Exit if no more modules */
+ if (!cmod)
+ break;
+
+ if (cmod != iter->cmod) {
+ iter->cmod = cmod;
+ ct = get_first_module_ct(cmod);
+ } else
+ ct = get_next_module_ct(iter);
+
+ if (ct)
+ break;
+
+ iter->mod_id++;
+ }
+
+ iter->ct = ct;
+ return ct;
+}
+
+void codetag_to_text(struct seq_buf *out, struct codetag *ct)
+{
+ seq_buf_printf(out, "%s:%u module:%s func:%s",
+ ct->filename, ct->lineno,
+ ct->modname, ct->function);
+}
+
+static inline size_t range_size(const struct codetag_type *cttype,
+ const struct codetag_range *range)
+{
+ return ((char *)range->stop - (char *)range->start) /
+ cttype->desc.tag_size;
+}
+
+static void *get_symbol(struct module *mod, const char *prefix, const char *name)
+{
+ char buf[64];
+ int res;
+
+ res = snprintf(buf, sizeof(buf), "%s%s", prefix, name);
+ if (WARN_ON(res < 1 || res > sizeof(buf)))
+ return NULL;
+
+ return mod ?
+ (void *)find_kallsyms_symbol_value(mod, buf) :
+ (void *)kallsyms_lookup_name(buf);
+}
+
+static struct codetag_range get_section_range(struct module *mod,
+ const char *section)
+{
+ return (struct codetag_range) {
+ get_symbol(mod, "__start_", section),
+ get_symbol(mod, "__stop_", section),
+ };
+}
+
+static int codetag_module_init(struct codetag_type *cttype, struct module *mod)
+{
+ struct codetag_range range;
+ struct codetag_module *cmod;
+ int err;
+
+ range = get_section_range(mod, cttype->desc.section);
+ if (!range.start || !range.stop) {
+ pr_warn("Failed to load code tags of type %s from the module %s\n",
+ cttype->desc.section,
+ mod ? mod->name : "(built-in)");
+ return -EINVAL;
+ }
+
+ /* Ignore empty ranges */
+ if (range.start == range.stop)
+ return 0;
+
+ BUG_ON(range.start > range.stop);
+
+ cmod = kmalloc(sizeof(*cmod), GFP_KERNEL);
+ if (unlikely(!cmod))
+ return -ENOMEM;
+
+ cmod->mod = mod;
+ cmod->range = range;
+
+ down_write(&cttype->mod_lock);
+ err = idr_alloc(&cttype->mod_idr, cmod, 0, 0, GFP_KERNEL);
+ if (err >= 0)
+ cttype->count += range_size(cttype, &range);
+ up_write(&cttype->mod_lock);
+
+ if (err < 0) {
+ kfree(cmod);
+ return err;
+ }
+
+ return 0;
+}
+
+struct codetag_type *
+codetag_register_type(const struct codetag_type_desc *desc)
+{
+ struct codetag_type *cttype;
+ int err;
+
+ BUG_ON(desc->tag_size <= 0);
+
+ cttype = kzalloc(sizeof(*cttype), GFP_KERNEL);
+ if (unlikely(!cttype))
+ return ERR_PTR(-ENOMEM);
+
+ cttype->desc = *desc;
+ idr_init(&cttype->mod_idr);
+ init_rwsem(&cttype->mod_lock);
+
+ err = codetag_module_init(cttype, NULL);
+ if (unlikely(err)) {
+ kfree(cttype);
+ return ERR_PTR(err);
+ }
+
+ mutex_lock(&codetag_lock);
+ list_add_tail(&cttype->link, &codetag_types);
+ mutex_unlock(&codetag_lock);
+
+ return cttype;
+}
--
2.43.0.687.g38aa6559b0-goog
Add support for code tagging from dynamically loaded modules.
Signed-off-by: Suren Baghdasaryan <[email protected]>
Co-developed-by: Kent Overstreet <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
---
include/linux/codetag.h | 12 +++++++++
kernel/module/main.c | 4 +++
lib/codetag.c | 58 +++++++++++++++++++++++++++++++++++++++--
3 files changed, 72 insertions(+), 2 deletions(-)
diff --git a/include/linux/codetag.h b/include/linux/codetag.h
index a9d7adecc2a5..386733e89b31 100644
--- a/include/linux/codetag.h
+++ b/include/linux/codetag.h
@@ -42,6 +42,10 @@ struct codetag_module {
struct codetag_type_desc {
const char *section;
size_t tag_size;
+ void (*module_load)(struct codetag_type *cttype,
+ struct codetag_module *cmod);
+ void (*module_unload)(struct codetag_type *cttype,
+ struct codetag_module *cmod);
};
struct codetag_iterator {
@@ -68,4 +72,12 @@ void codetag_to_text(struct seq_buf *out, struct codetag *ct);
struct codetag_type *
codetag_register_type(const struct codetag_type_desc *desc);
+#ifdef CONFIG_CODE_TAGGING
+void codetag_load_module(struct module *mod);
+void codetag_unload_module(struct module *mod);
+#else
+static inline void codetag_load_module(struct module *mod) {}
+static inline void codetag_unload_module(struct module *mod) {}
+#endif
+
#endif /* _LINUX_CODETAG_H */
diff --git a/kernel/module/main.c b/kernel/module/main.c
index 36681911c05a..f400ba076cc7 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -56,6 +56,7 @@
#include <linux/dynamic_debug.h>
#include <linux/audit.h>
#include <linux/cfi.h>
+#include <linux/codetag.h>
#include <linux/debugfs.h>
#include <uapi/linux/module.h>
#include "internal.h"
@@ -1242,6 +1243,7 @@ static void free_module(struct module *mod)
{
trace_module_free(mod);
+ codetag_unload_module(mod);
mod_sysfs_teardown(mod);
/*
@@ -2978,6 +2980,8 @@ static int load_module(struct load_info *info, const char __user *uargs,
/* Get rid of temporary copy. */
free_copy(info, flags);
+ codetag_load_module(mod);
+
/* Done! */
trace_module_load(mod);
diff --git a/lib/codetag.c b/lib/codetag.c
index 7708f8388e55..4ea57fb37346 100644
--- a/lib/codetag.c
+++ b/lib/codetag.c
@@ -108,15 +108,20 @@ static inline size_t range_size(const struct codetag_type *cttype,
static void *get_symbol(struct module *mod, const char *prefix, const char *name)
{
char buf[64];
+ void *ret;
int res;
res = snprintf(buf, sizeof(buf), "%s%s", prefix, name);
if (WARN_ON(res < 1 || res > sizeof(buf)))
return NULL;
- return mod ?
+ preempt_disable();
+ ret = mod ?
(void *)find_kallsyms_symbol_value(mod, buf) :
(void *)kallsyms_lookup_name(buf);
+ preempt_enable();
+
+ return ret;
}
static struct codetag_range get_section_range(struct module *mod,
@@ -157,8 +162,11 @@ static int codetag_module_init(struct codetag_type *cttype, struct module *mod)
down_write(&cttype->mod_lock);
err = idr_alloc(&cttype->mod_idr, cmod, 0, 0, GFP_KERNEL);
- if (err >= 0)
+ if (err >= 0) {
cttype->count += range_size(cttype, &range);
+ if (cttype->desc.module_load)
+ cttype->desc.module_load(cttype, cmod);
+ }
up_write(&cttype->mod_lock);
if (err < 0) {
@@ -197,3 +205,49 @@ codetag_register_type(const struct codetag_type_desc *desc)
return cttype;
}
+
+void codetag_load_module(struct module *mod)
+{
+ struct codetag_type *cttype;
+
+ if (!mod)
+ return;
+
+ mutex_lock(&codetag_lock);
+ list_for_each_entry(cttype, &codetag_types, link)
+ codetag_module_init(cttype, mod);
+ mutex_unlock(&codetag_lock);
+}
+
+void codetag_unload_module(struct module *mod)
+{
+ struct codetag_type *cttype;
+
+ if (!mod)
+ return;
+
+ mutex_lock(&codetag_lock);
+ list_for_each_entry(cttype, &codetag_types, link) {
+ struct codetag_module *found = NULL;
+ struct codetag_module *cmod;
+ unsigned long mod_id, tmp;
+
+ down_write(&cttype->mod_lock);
+ idr_for_each_entry_ul(&cttype->mod_idr, cmod, tmp, mod_id) {
+ if (cmod->mod && cmod->mod == mod) {
+ found = cmod;
+ break;
+ }
+ }
+ if (found) {
+ if (cttype->desc.module_unload)
+ cttype->desc.module_unload(cttype, cmod);
+
+ cttype->count -= range_size(cttype, &cmod->range);
+ idr_remove(&cttype->mod_idr, mod_id);
+ kfree(cmod);
+ }
+ up_write(&cttype->mod_lock);
+ }
+ mutex_unlock(&codetag_lock);
+}
--
2.43.0.687.g38aa6559b0-goog
Skip freeing module's data section if there are non-zero allocation tags
because otherwise, once these allocations are freed, the access to their
code tag would cause UAF.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/codetag.h | 6 +++---
kernel/module/main.c | 23 +++++++++++++++--------
lib/codetag.c | 11 ++++++++---
3 files changed, 26 insertions(+), 14 deletions(-)
diff --git a/include/linux/codetag.h b/include/linux/codetag.h
index 386733e89b31..d98e4c8e86f0 100644
--- a/include/linux/codetag.h
+++ b/include/linux/codetag.h
@@ -44,7 +44,7 @@ struct codetag_type_desc {
size_t tag_size;
void (*module_load)(struct codetag_type *cttype,
struct codetag_module *cmod);
- void (*module_unload)(struct codetag_type *cttype,
+ bool (*module_unload)(struct codetag_type *cttype,
struct codetag_module *cmod);
};
@@ -74,10 +74,10 @@ codetag_register_type(const struct codetag_type_desc *desc);
#ifdef CONFIG_CODE_TAGGING
void codetag_load_module(struct module *mod);
-void codetag_unload_module(struct module *mod);
+bool codetag_unload_module(struct module *mod);
#else
static inline void codetag_load_module(struct module *mod) {}
-static inline void codetag_unload_module(struct module *mod) {}
+static inline bool codetag_unload_module(struct module *mod) { return true; }
#endif
#endif /* _LINUX_CODETAG_H */
diff --git a/kernel/module/main.c b/kernel/module/main.c
index f400ba076cc7..658b631e76ad 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -1211,15 +1211,19 @@ static void *module_memory_alloc(unsigned int size, enum mod_mem_type type)
return module_alloc(size);
}
-static void module_memory_free(void *ptr, enum mod_mem_type type)
+static void module_memory_free(void *ptr, enum mod_mem_type type,
+ bool unload_codetags)
{
+ if (!unload_codetags && mod_mem_type_is_core_data(type))
+ return;
+
if (mod_mem_use_vmalloc(type))
vfree(ptr);
else
module_memfree(ptr);
}
-static void free_mod_mem(struct module *mod)
+static void free_mod_mem(struct module *mod, bool unload_codetags)
{
for_each_mod_mem_type(type) {
struct module_memory *mod_mem = &mod->mem[type];
@@ -1230,20 +1234,23 @@ static void free_mod_mem(struct module *mod)
/* Free lock-classes; relies on the preceding sync_rcu(). */
lockdep_free_key_range(mod_mem->base, mod_mem->size);
if (mod_mem->size)
- module_memory_free(mod_mem->base, type);
+ module_memory_free(mod_mem->base, type,
+ unload_codetags);
}
/* MOD_DATA hosts mod, so free it at last */
lockdep_free_key_range(mod->mem[MOD_DATA].base, mod->mem[MOD_DATA].size);
- module_memory_free(mod->mem[MOD_DATA].base, MOD_DATA);
+ module_memory_free(mod->mem[MOD_DATA].base, MOD_DATA, unload_codetags);
}
/* Free a module, remove from lists, etc. */
static void free_module(struct module *mod)
{
+ bool unload_codetags;
+
trace_module_free(mod);
- codetag_unload_module(mod);
+ unload_codetags = codetag_unload_module(mod);
mod_sysfs_teardown(mod);
/*
@@ -1285,7 +1292,7 @@ static void free_module(struct module *mod)
kfree(mod->args);
percpu_modfree(mod);
- free_mod_mem(mod);
+ free_mod_mem(mod, unload_codetags);
}
void *__symbol_get(const char *symbol)
@@ -2298,7 +2305,7 @@ static int move_module(struct module *mod, struct load_info *info)
return 0;
out_enomem:
for (t--; t >= 0; t--)
- module_memory_free(mod->mem[t].base, t);
+ module_memory_free(mod->mem[t].base, t, true);
return ret;
}
@@ -2428,7 +2435,7 @@ static void module_deallocate(struct module *mod, struct load_info *info)
percpu_modfree(mod);
module_arch_freeing_init(mod);
- free_mod_mem(mod);
+ free_mod_mem(mod, true);
}
int __weak module_finalize(const Elf_Ehdr *hdr,
diff --git a/lib/codetag.c b/lib/codetag.c
index 4ea57fb37346..0ad4ea66c769 100644
--- a/lib/codetag.c
+++ b/lib/codetag.c
@@ -5,6 +5,7 @@
#include <linux/module.h>
#include <linux/seq_buf.h>
#include <linux/slab.h>
+#include <linux/vmalloc.h>
struct codetag_type {
struct list_head link;
@@ -219,12 +220,13 @@ void codetag_load_module(struct module *mod)
mutex_unlock(&codetag_lock);
}
-void codetag_unload_module(struct module *mod)
+bool codetag_unload_module(struct module *mod)
{
struct codetag_type *cttype;
+ bool unload_ok = true;
if (!mod)
- return;
+ return true;
mutex_lock(&codetag_lock);
list_for_each_entry(cttype, &codetag_types, link) {
@@ -241,7 +243,8 @@ void codetag_unload_module(struct module *mod)
}
if (found) {
if (cttype->desc.module_unload)
- cttype->desc.module_unload(cttype, cmod);
+ if (!cttype->desc.module_unload(cttype, cmod))
+ unload_ok = false;
cttype->count -= range_size(cttype, &cmod->range);
idr_remove(&cttype->mod_idr, mod_id);
@@ -250,4 +253,6 @@ void codetag_unload_module(struct module *mod)
up_write(&cttype->mod_lock);
}
mutex_unlock(&codetag_lock);
+
+ return unload_ok;
}
--
2.43.0.687.g38aa6559b0-goog
Introduce CONFIG_MEM_ALLOC_PROFILING which provides definitions to easily
instrument memory allocators. It registers an "alloc_tags" codetag type
with /proc/allocinfo interface to output allocation tag information when
the feature is enabled.
CONFIG_MEM_ALLOC_PROFILING_DEBUG is provided for debugging the memory
allocation profiling instrumentation.
Memory allocation profiling can be enabled or disabled at runtime using
/proc/sys/vm/mem_profiling sysctl when CONFIG_MEM_ALLOC_PROFILING_DEBUG=n.
CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT enables memory allocation
profiling by default.
Signed-off-by: Suren Baghdasaryan <[email protected]>
Co-developed-by: Kent Overstreet <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
---
Documentation/admin-guide/sysctl/vm.rst | 16 +++
Documentation/filesystems/proc.rst | 28 +++++
include/asm-generic/codetag.lds.h | 14 +++
include/asm-generic/vmlinux.lds.h | 3 +
include/linux/alloc_tag.h | 133 ++++++++++++++++++++
include/linux/sched.h | 24 ++++
lib/Kconfig.debug | 25 ++++
lib/Makefile | 2 +
lib/alloc_tag.c | 158 ++++++++++++++++++++++++
scripts/module.lds.S | 7 ++
10 files changed, 410 insertions(+)
create mode 100644 include/asm-generic/codetag.lds.h
create mode 100644 include/linux/alloc_tag.h
create mode 100644 lib/alloc_tag.c
diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index c59889de122b..a214719492ea 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -43,6 +43,7 @@ Currently, these files are in /proc/sys/vm:
- legacy_va_layout
- lowmem_reserve_ratio
- max_map_count
+- mem_profiling (only if CONFIG_MEM_ALLOC_PROFILING=y)
- memory_failure_early_kill
- memory_failure_recovery
- min_free_kbytes
@@ -425,6 +426,21 @@ e.g., up to one or two maps per allocation.
The default value is 65530.
+mem_profiling
+==============
+
+Enable memory profiling (when CONFIG_MEM_ALLOC_PROFILING=y)
+
+1: Enable memory profiling.
+
+0: Disabld memory profiling.
+
+Enabling memory profiling introduces a small performance overhead for all
+memory allocations.
+
+The default value depends on CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT.
+
+
memory_failure_early_kill:
==========================
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 104c6d047d9b..40d6d18308e4 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -688,6 +688,7 @@ files are there, and which are missing.
============ ===============================================================
File Content
============ ===============================================================
+ allocinfo Memory allocations profiling information
apm Advanced power management info
bootconfig Kernel command line obtained from boot config,
and, if there were kernel parameters from the
@@ -953,6 +954,33 @@ also be allocatable although a lot of filesystem metadata may have to be
reclaimed to achieve this.
+allocinfo
+~~~~~~~
+
+Provides information about memory allocations at all locations in the code
+base. Each allocation in the code is identified by its source file, line
+number, module and the function calling the allocation. The number of bytes
+allocated at each location is reported.
+
+Example output.
+
+::
+
+ > cat /proc/allocinfo
+
+ 153MiB mm/slub.c:1826 module:slub func:alloc_slab_page
+ 6.08MiB mm/slab_common.c:950 module:slab_common func:_kmalloc_order
+ 5.09MiB mm/memcontrol.c:2814 module:memcontrol func:alloc_slab_obj_exts
+ 4.54MiB mm/page_alloc.c:5777 module:page_alloc func:alloc_pages_exact
+ 1.32MiB include/asm-generic/pgalloc.h:63 module:pgtable func:__pte_alloc_one
+ 1.16MiB fs/xfs/xfs_log_priv.h:700 module:xfs func:xlog_kvmalloc
+ 1.00MiB mm/swap_cgroup.c:48 module:swap_cgroup func:swap_cgroup_prepare
+ 734KiB fs/xfs/kmem.c:20 module:xfs func:kmem_alloc
+ 640KiB kernel/rcu/tree.c:3184 module:tree func:fill_page_cache_func
+ 640KiB drivers/char/virtio_console.c:452 module:virtio_console func:alloc_buf
+ ...
+
+
meminfo
~~~~~~~
diff --git a/include/asm-generic/codetag.lds.h b/include/asm-generic/codetag.lds.h
new file mode 100644
index 000000000000..64f536b80380
--- /dev/null
+++ b/include/asm-generic/codetag.lds.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef __ASM_GENERIC_CODETAG_LDS_H
+#define __ASM_GENERIC_CODETAG_LDS_H
+
+#define SECTION_WITH_BOUNDARIES(_name) \
+ . = ALIGN(8); \
+ __start_##_name = .; \
+ KEEP(*(_name)) \
+ __stop_##_name = .;
+
+#define CODETAG_SECTIONS() \
+ SECTION_WITH_BOUNDARIES(alloc_tags)
+
+#endif /* __ASM_GENERIC_CODETAG_LDS_H */
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index 5dd3a61d673d..c9997dc50c50 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -50,6 +50,8 @@
* [__nosave_begin, __nosave_end] for the nosave data
*/
+#include <asm-generic/codetag.lds.h>
+
#ifndef LOAD_OFFSET
#define LOAD_OFFSET 0
#endif
@@ -366,6 +368,7 @@
. = ALIGN(8); \
BOUNDED_SECTION_BY(__dyndbg_classes, ___dyndbg_classes) \
BOUNDED_SECTION_BY(__dyndbg, ___dyndbg) \
+ CODETAG_SECTIONS() \
LIKELY_PROFILE() \
BRANCH_PROFILE() \
TRACE_PRINTKS() \
diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
new file mode 100644
index 000000000000..cf55a149fa84
--- /dev/null
+++ b/include/linux/alloc_tag.h
@@ -0,0 +1,133 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * allocation tagging
+ */
+#ifndef _LINUX_ALLOC_TAG_H
+#define _LINUX_ALLOC_TAG_H
+
+#include <linux/bug.h>
+#include <linux/codetag.h>
+#include <linux/container_of.h>
+#include <linux/preempt.h>
+#include <asm/percpu.h>
+#include <linux/cpumask.h>
+#include <linux/static_key.h>
+
+struct alloc_tag_counters {
+ u64 bytes;
+ u64 calls;
+};
+
+/*
+ * An instance of this structure is created in a special ELF section at every
+ * allocation callsite. At runtime, the special section is treated as
+ * an array of these. Embedded codetag utilizes codetag framework.
+ */
+struct alloc_tag {
+ struct codetag ct;
+ struct alloc_tag_counters __percpu *counters;
+} __aligned(8);
+
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+
+static inline struct alloc_tag *ct_to_alloc_tag(struct codetag *ct)
+{
+ return container_of(ct, struct alloc_tag, ct);
+}
+
+#ifdef ARCH_NEEDS_WEAK_PER_CPU
+/*
+ * When percpu variables are required to be defined as weak, static percpu
+ * variables can't be used inside a function (see comments for DECLARE_PER_CPU_SECTION).
+ */
+#error "Memory allocation profiling is incompatible with ARCH_NEEDS_WEAK_PER_CPU"
+#endif
+
+#define DEFINE_ALLOC_TAG(_alloc_tag, _old) \
+ static DEFINE_PER_CPU(struct alloc_tag_counters, _alloc_tag_cntr); \
+ static struct alloc_tag _alloc_tag __used __aligned(8) \
+ __section("alloc_tags") = { \
+ .ct = CODE_TAG_INIT, \
+ .counters = &_alloc_tag_cntr }; \
+ struct alloc_tag * __maybe_unused _old = alloc_tag_save(&_alloc_tag)
+
+DECLARE_STATIC_KEY_MAYBE(CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT,
+ mem_alloc_profiling_key);
+
+static inline bool mem_alloc_profiling_enabled(void)
+{
+ return static_branch_maybe(CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT,
+ &mem_alloc_profiling_key);
+}
+
+static inline struct alloc_tag_counters alloc_tag_read(struct alloc_tag *tag)
+{
+ struct alloc_tag_counters v = { 0, 0 };
+ struct alloc_tag_counters *counter;
+ int cpu;
+
+ for_each_possible_cpu(cpu) {
+ counter = per_cpu_ptr(tag->counters, cpu);
+ v.bytes += counter->bytes;
+ v.calls += counter->calls;
+ }
+
+ return v;
+}
+
+static inline void __alloc_tag_sub(union codetag_ref *ref, size_t bytes)
+{
+ struct alloc_tag *tag;
+
+#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
+ WARN_ONCE(ref && !ref->ct, "alloc_tag was not set\n");
+#endif
+ if (!ref || !ref->ct)
+ return;
+
+ tag = ct_to_alloc_tag(ref->ct);
+
+ this_cpu_sub(tag->counters->bytes, bytes);
+ this_cpu_dec(tag->counters->calls);
+
+ ref->ct = NULL;
+}
+
+static inline void alloc_tag_sub(union codetag_ref *ref, size_t bytes)
+{
+ __alloc_tag_sub(ref, bytes);
+}
+
+static inline void alloc_tag_sub_noalloc(union codetag_ref *ref, size_t bytes)
+{
+ __alloc_tag_sub(ref, bytes);
+}
+
+static inline void alloc_tag_add(union codetag_ref *ref, struct alloc_tag *tag, size_t bytes)
+{
+#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
+ WARN_ONCE(ref && ref->ct,
+ "alloc_tag was not cleared (got tag for %s:%u)\n",\
+ ref->ct->filename, ref->ct->lineno);
+
+ WARN_ONCE(!tag, "current->alloc_tag not set");
+#endif
+ if (!ref || !tag)
+ return;
+
+ ref->ct = &tag->ct;
+ this_cpu_add(tag->counters->bytes, bytes);
+ this_cpu_inc(tag->counters->calls);
+}
+
+#else
+
+#define DEFINE_ALLOC_TAG(_alloc_tag, _old)
+static inline void alloc_tag_sub(union codetag_ref *ref, size_t bytes) {}
+static inline void alloc_tag_sub_noalloc(union codetag_ref *ref, size_t bytes) {}
+static inline void alloc_tag_add(union codetag_ref *ref, struct alloc_tag *tag,
+ size_t bytes) {}
+
+#endif
+
+#endif /* _LINUX_ALLOC_TAG_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ffe8f618ab86..da68a10517c8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -770,6 +770,10 @@ struct task_struct {
unsigned int flags;
unsigned int ptrace;
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+ struct alloc_tag *alloc_tag;
+#endif
+
#ifdef CONFIG_SMP
int on_cpu;
struct __call_single_node wake_entry;
@@ -810,6 +814,7 @@ struct task_struct {
struct task_group *sched_task_group;
#endif
+
#ifdef CONFIG_UCLAMP_TASK
/*
* Clamp values requested for a scheduling entity.
@@ -2183,4 +2188,23 @@ static inline int sched_core_idle_cpu(int cpu) { return idle_cpu(cpu); }
extern void sched_set_stop_task(int cpu, struct task_struct *stop);
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+static inline struct alloc_tag *alloc_tag_save(struct alloc_tag *tag)
+{
+ swap(current->alloc_tag, tag);
+ return tag;
+}
+
+static inline void alloc_tag_restore(struct alloc_tag *tag, struct alloc_tag *old)
+{
+#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
+ WARN(current->alloc_tag != tag, "current->alloc_tag was changed:\n");
+#endif
+ current->alloc_tag = old;
+}
+#else
+static inline struct alloc_tag *alloc_tag_save(struct alloc_tag *tag) { return NULL; }
+#define alloc_tag_restore(_tag, _old)
+#endif
+
#endif
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 0be2d00c3696..78d258ca508f 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -972,6 +972,31 @@ config CODE_TAGGING
bool
select KALLSYMS
+config MEM_ALLOC_PROFILING
+ bool "Enable memory allocation profiling"
+ default n
+ depends on PROC_FS
+ depends on !DEBUG_FORCE_WEAK_PER_CPU
+ select CODE_TAGGING
+ help
+ Track allocation source code and record total allocation size
+ initiated at that code location. The mechanism can be used to track
+ memory leaks with a low performance and memory impact.
+
+config MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT
+ bool "Enable memory allocation profiling by default"
+ default y
+ depends on MEM_ALLOC_PROFILING
+
+config MEM_ALLOC_PROFILING_DEBUG
+ bool "Memory allocation profiler debugging"
+ default n
+ depends on MEM_ALLOC_PROFILING
+ select MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT
+ help
+ Adds warnings with helpful error messages for memory allocation
+ profiling.
+
source "lib/Kconfig.kasan"
source "lib/Kconfig.kfence"
source "lib/Kconfig.kmsan"
diff --git a/lib/Makefile b/lib/Makefile
index 6b48b22fdfac..859112f09bf5 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -236,6 +236,8 @@ obj-$(CONFIG_OF_RECONFIG_NOTIFIER_ERROR_INJECT) += \
obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
obj-$(CONFIG_CODE_TAGGING) += codetag.o
+obj-$(CONFIG_MEM_ALLOC_PROFILING) += alloc_tag.o
+
lib-$(CONFIG_GENERIC_BUG) += bug.o
obj-$(CONFIG_HAVE_ARCH_TRACEHOOK) += syscall.o
diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
new file mode 100644
index 000000000000..4fc031f9cefd
--- /dev/null
+++ b/lib/alloc_tag.c
@@ -0,0 +1,158 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/alloc_tag.h>
+#include <linux/fs.h>
+#include <linux/gfp.h>
+#include <linux/module.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_buf.h>
+#include <linux/seq_file.h>
+
+static struct codetag_type *alloc_tag_cttype;
+
+DEFINE_STATIC_KEY_MAYBE(CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT,
+ mem_alloc_profiling_key);
+
+static void *allocinfo_start(struct seq_file *m, loff_t *pos)
+{
+ struct codetag_iterator *iter;
+ struct codetag *ct;
+ loff_t node = *pos;
+
+ iter = kzalloc(sizeof(*iter), GFP_KERNEL);
+ m->private = iter;
+ if (!iter)
+ return NULL;
+
+ codetag_lock_module_list(alloc_tag_cttype, true);
+ *iter = codetag_get_ct_iter(alloc_tag_cttype);
+ while ((ct = codetag_next_ct(iter)) != NULL && node)
+ node--;
+
+ return ct ? iter : NULL;
+}
+
+static void *allocinfo_next(struct seq_file *m, void *arg, loff_t *pos)
+{
+ struct codetag_iterator *iter = (struct codetag_iterator *)arg;
+ struct codetag *ct = codetag_next_ct(iter);
+
+ (*pos)++;
+ if (!ct)
+ return NULL;
+
+ return iter;
+}
+
+static void allocinfo_stop(struct seq_file *m, void *arg)
+{
+ struct codetag_iterator *iter = (struct codetag_iterator *)m->private;
+
+ if (iter) {
+ codetag_lock_module_list(alloc_tag_cttype, false);
+ kfree(iter);
+ }
+}
+
+static void alloc_tag_to_text(struct seq_buf *out, struct codetag *ct)
+{
+ struct alloc_tag *tag = ct_to_alloc_tag(ct);
+ struct alloc_tag_counters counter = alloc_tag_read(tag);
+ s64 bytes = counter.bytes;
+ char val[10], *p = val;
+
+ if (bytes < 0) {
+ *p++ = '-';
+ bytes = -bytes;
+ }
+
+ string_get_size(bytes, 1,
+ STRING_SIZE_BASE2|STRING_SIZE_NOSPACE,
+ p, val + ARRAY_SIZE(val) - p);
+
+ seq_buf_printf(out, "%8s %8llu ", val, counter.calls);
+ codetag_to_text(out, ct);
+ seq_buf_putc(out, ' ');
+ seq_buf_putc(out, '\n');
+}
+
+static int allocinfo_show(struct seq_file *m, void *arg)
+{
+ struct codetag_iterator *iter = (struct codetag_iterator *)arg;
+ char *bufp;
+ size_t n = seq_get_buf(m, &bufp);
+ struct seq_buf buf;
+
+ seq_buf_init(&buf, bufp, n);
+ alloc_tag_to_text(&buf, iter->ct);
+ seq_commit(m, seq_buf_used(&buf));
+ return 0;
+}
+
+static const struct seq_operations allocinfo_seq_op = {
+ .start = allocinfo_start,
+ .next = allocinfo_next,
+ .stop = allocinfo_stop,
+ .show = allocinfo_show,
+};
+
+static void __init procfs_init(void)
+{
+ proc_create_seq("allocinfo", 0444, NULL, &allocinfo_seq_op);
+}
+
+static bool alloc_tag_module_unload(struct codetag_type *cttype,
+ struct codetag_module *cmod)
+{
+ struct codetag_iterator iter = codetag_get_ct_iter(cttype);
+ struct alloc_tag_counters counter;
+ bool module_unused = true;
+ struct alloc_tag *tag;
+ struct codetag *ct;
+
+ for (ct = codetag_next_ct(&iter); ct; ct = codetag_next_ct(&iter)) {
+ if (iter.cmod != cmod)
+ continue;
+
+ tag = ct_to_alloc_tag(ct);
+ counter = alloc_tag_read(tag);
+
+ if (WARN(counter.bytes, "%s:%u module %s func:%s has %llu allocated at module unload",
+ ct->filename, ct->lineno, ct->modname, ct->function, counter.bytes))
+ module_unused = false;
+ }
+
+ return module_unused;
+}
+
+static struct ctl_table memory_allocation_profiling_sysctls[] = {
+ {
+ .procname = "mem_profiling",
+ .data = &mem_alloc_profiling_key,
+#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
+ .mode = 0444,
+#else
+ .mode = 0644,
+#endif
+ .proc_handler = proc_do_static_key,
+ },
+ { }
+};
+
+static int __init alloc_tag_init(void)
+{
+ const struct codetag_type_desc desc = {
+ .section = "alloc_tags",
+ .tag_size = sizeof(struct alloc_tag),
+ .module_unload = alloc_tag_module_unload,
+ };
+
+ alloc_tag_cttype = codetag_register_type(&desc);
+ if (IS_ERR_OR_NULL(alloc_tag_cttype))
+ return PTR_ERR(alloc_tag_cttype);
+
+ register_sysctl_init("vm", memory_allocation_profiling_sysctls);
+ procfs_init();
+
+ return 0;
+}
+module_init(alloc_tag_init);
diff --git a/scripts/module.lds.S b/scripts/module.lds.S
index bf5bcf2836d8..45c67a0994f3 100644
--- a/scripts/module.lds.S
+++ b/scripts/module.lds.S
@@ -9,6 +9,8 @@
#define DISCARD_EH_FRAME *(.eh_frame)
#endif
+#include <asm-generic/codetag.lds.h>
+
SECTIONS {
/DISCARD/ : {
*(.discard)
@@ -47,12 +49,17 @@ SECTIONS {
.data : {
*(.data .data.[0-9a-zA-Z_]*)
*(.data..L*)
+ CODETAG_SECTIONS()
}
.rodata : {
*(.rodata .rodata.[0-9a-zA-Z_]*)
*(.rodata..L*)
}
+#else
+ .data : {
+ CODETAG_SECTIONS()
+ }
#endif
}
--
2.43.0.687.g38aa6559b0-goog
Introduce helper functions to easily instrument page allocators by
storing a pointer to the allocation tag associated with the code that
allocated the page in a page_ext field.
Signed-off-by: Suren Baghdasaryan <[email protected]>
Co-developed-by: Kent Overstreet <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
---
include/linux/page_ext.h | 1 -
include/linux/pgalloc_tag.h | 73 +++++++++++++++++++++++++++++++++++++
lib/Kconfig.debug | 1 +
lib/alloc_tag.c | 17 +++++++++
mm/mm_init.c | 1 +
mm/page_alloc.c | 4 ++
mm/page_ext.c | 4 ++
7 files changed, 100 insertions(+), 1 deletion(-)
create mode 100644 include/linux/pgalloc_tag.h
diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
index be98564191e6..07e0656898f9 100644
--- a/include/linux/page_ext.h
+++ b/include/linux/page_ext.h
@@ -4,7 +4,6 @@
#include <linux/types.h>
#include <linux/stacktrace.h>
-#include <linux/stackdepot.h>
struct pglist_data;
diff --git a/include/linux/pgalloc_tag.h b/include/linux/pgalloc_tag.h
new file mode 100644
index 000000000000..a060c26eb449
--- /dev/null
+++ b/include/linux/pgalloc_tag.h
@@ -0,0 +1,73 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * page allocation tagging
+ */
+#ifndef _LINUX_PGALLOC_TAG_H
+#define _LINUX_PGALLOC_TAG_H
+
+#include <linux/alloc_tag.h>
+
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+
+#include <linux/page_ext.h>
+
+extern struct page_ext_operations page_alloc_tagging_ops;
+extern struct page_ext *page_ext_get(struct page *page);
+extern void page_ext_put(struct page_ext *page_ext);
+
+static inline union codetag_ref *codetag_ref_from_page_ext(struct page_ext *page_ext)
+{
+ return (void *)page_ext + page_alloc_tagging_ops.offset;
+}
+
+static inline struct page_ext *page_ext_from_codetag_ref(union codetag_ref *ref)
+{
+ return (void *)ref - page_alloc_tagging_ops.offset;
+}
+
+static inline union codetag_ref *get_page_tag_ref(struct page *page)
+{
+ if (page && mem_alloc_profiling_enabled()) {
+ struct page_ext *page_ext = page_ext_get(page);
+
+ if (page_ext)
+ return codetag_ref_from_page_ext(page_ext);
+ }
+ return NULL;
+}
+
+static inline void put_page_tag_ref(union codetag_ref *ref)
+{
+ page_ext_put(page_ext_from_codetag_ref(ref));
+}
+
+static inline void pgalloc_tag_add(struct page *page, struct task_struct *task,
+ unsigned int order)
+{
+ union codetag_ref *ref = get_page_tag_ref(page);
+
+ if (ref) {
+ alloc_tag_add(ref, task->alloc_tag, PAGE_SIZE << order);
+ put_page_tag_ref(ref);
+ }
+}
+
+static inline void pgalloc_tag_sub(struct page *page, unsigned int order)
+{
+ union codetag_ref *ref = get_page_tag_ref(page);
+
+ if (ref) {
+ alloc_tag_sub(ref, PAGE_SIZE << order);
+ put_page_tag_ref(ref);
+ }
+}
+
+#else /* CONFIG_MEM_ALLOC_PROFILING */
+
+static inline void pgalloc_tag_add(struct page *page, struct task_struct *task,
+ unsigned int order) {}
+static inline void pgalloc_tag_sub(struct page *page, unsigned int order) {}
+
+#endif /* CONFIG_MEM_ALLOC_PROFILING */
+
+#endif /* _LINUX_PGALLOC_TAG_H */
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 78d258ca508f..7bbdb0ddb011 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -978,6 +978,7 @@ config MEM_ALLOC_PROFILING
depends on PROC_FS
depends on !DEBUG_FORCE_WEAK_PER_CPU
select CODE_TAGGING
+ select PAGE_EXTENSION
help
Track allocation source code and record total allocation size
initiated at that code location. The mechanism can be used to track
diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
index 4fc031f9cefd..2d5226d9262d 100644
--- a/lib/alloc_tag.c
+++ b/lib/alloc_tag.c
@@ -3,6 +3,7 @@
#include <linux/fs.h>
#include <linux/gfp.h>
#include <linux/module.h>
+#include <linux/page_ext.h>
#include <linux/proc_fs.h>
#include <linux/seq_buf.h>
#include <linux/seq_file.h>
@@ -124,6 +125,22 @@ static bool alloc_tag_module_unload(struct codetag_type *cttype,
return module_unused;
}
+static __init bool need_page_alloc_tagging(void)
+{
+ return true;
+}
+
+static __init void init_page_alloc_tagging(void)
+{
+}
+
+struct page_ext_operations page_alloc_tagging_ops = {
+ .size = sizeof(union codetag_ref),
+ .need = need_page_alloc_tagging,
+ .init = init_page_alloc_tagging,
+};
+EXPORT_SYMBOL(page_alloc_tagging_ops);
+
static struct ctl_table memory_allocation_profiling_sysctls[] = {
{
.procname = "mem_profiling",
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 2c19f5515e36..e9ea2919d02d 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -24,6 +24,7 @@
#include <linux/page_ext.h>
#include <linux/pti.h>
#include <linux/pgtable.h>
+#include <linux/stackdepot.h>
#include <linux/swap.h>
#include <linux/cma.h>
#include <linux/crash_dump.h>
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 150d4f23b010..edb79a55a252 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -53,6 +53,7 @@
#include <linux/khugepaged.h>
#include <linux/delayacct.h>
#include <linux/cacheinfo.h>
+#include <linux/pgalloc_tag.h>
#include <asm/div64.h>
#include "internal.h"
#include "shuffle.h"
@@ -1100,6 +1101,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
/* Do not let hwpoison pages hit pcplists/buddy */
reset_page_owner(page, order);
page_table_check_free(page, order);
+ pgalloc_tag_sub(page, order);
return false;
}
@@ -1139,6 +1141,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
reset_page_owner(page, order);
page_table_check_free(page, order);
+ pgalloc_tag_sub(page, order);
if (!PageHighMem(page)) {
debug_check_no_locks_freed(page_address(page),
@@ -1532,6 +1535,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
set_page_owner(page, order, gfp_flags);
page_table_check_alloc(page, order);
+ pgalloc_tag_add(page, current, order);
}
static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
diff --git a/mm/page_ext.c b/mm/page_ext.c
index 4548fcc66d74..3c58fe8a24df 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -10,6 +10,7 @@
#include <linux/page_idle.h>
#include <linux/page_table_check.h>
#include <linux/rcupdate.h>
+#include <linux/pgalloc_tag.h>
/*
* struct page extension
@@ -82,6 +83,9 @@ static struct page_ext_operations *page_ext_ops[] __initdata = {
#if defined(CONFIG_PAGE_IDLE_FLAG) && !defined(CONFIG_64BIT)
&page_idle_ops,
#endif
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+ &page_alloc_tagging_ops,
+#endif
#ifdef CONFIG_PAGE_TABLE_CHECK
&page_table_check_ops,
#endif
--
2.43.0.687.g38aa6559b0-goog
Redefine page allocators to record allocation tags upon their invocation.
Instrument post_alloc_hook and free_pages_prepare to modify current
allocation tag.
Signed-off-by: Suren Baghdasaryan <[email protected]>
Co-developed-by: Kent Overstreet <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
---
include/linux/alloc_tag.h | 10 +++
include/linux/gfp.h | 126 ++++++++++++++++++++++++--------------
include/linux/pagemap.h | 9 ++-
mm/compaction.c | 7 ++-
mm/filemap.c | 6 +-
mm/mempolicy.c | 52 ++++++++--------
mm/page_alloc.c | 60 +++++++++---------
7 files changed, 160 insertions(+), 110 deletions(-)
diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
index cf55a149fa84..6fa8a94d8bc1 100644
--- a/include/linux/alloc_tag.h
+++ b/include/linux/alloc_tag.h
@@ -130,4 +130,14 @@ static inline void alloc_tag_add(union codetag_ref *ref, struct alloc_tag *tag,
#endif
+#define alloc_hooks(_do_alloc) \
+({ \
+ typeof(_do_alloc) _res; \
+ DEFINE_ALLOC_TAG(_alloc_tag, _old); \
+ \
+ _res = _do_alloc; \
+ alloc_tag_restore(&_alloc_tag, _old); \
+ _res; \
+})
+
#endif /* _LINUX_ALLOC_TAG_H */
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index de292a007138..bc0fd5259b0b 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -6,6 +6,8 @@
#include <linux/mmzone.h>
#include <linux/topology.h>
+#include <linux/alloc_tag.h>
+#include <linux/sched.h>
struct vm_area_struct;
struct mempolicy;
@@ -175,42 +177,46 @@ static inline void arch_free_page(struct page *page, int order) { }
static inline void arch_alloc_page(struct page *page, int order) { }
#endif
-struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
+struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order, int preferred_nid,
nodemask_t *nodemask);
-struct folio *__folio_alloc(gfp_t gfp, unsigned int order, int preferred_nid,
+#define __alloc_pages(...) alloc_hooks(__alloc_pages_noprof(__VA_ARGS__))
+
+struct folio *__folio_alloc_noprof(gfp_t gfp, unsigned int order, int preferred_nid,
nodemask_t *nodemask);
+#define __folio_alloc(...) alloc_hooks(__folio_alloc_noprof(__VA_ARGS__))
-unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
+unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
nodemask_t *nodemask, int nr_pages,
struct list_head *page_list,
struct page **page_array);
+#define __alloc_pages_bulk(...) alloc_hooks(alloc_pages_bulk_noprof(__VA_ARGS__))
-unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
+unsigned long alloc_pages_bulk_array_mempolicy_noprof(gfp_t gfp,
unsigned long nr_pages,
struct page **page_array);
+#define alloc_pages_bulk_array_mempolicy(...) \
+ alloc_hooks(alloc_pages_bulk_array_mempolicy_noprof(__VA_ARGS__))
/* Bulk allocate order-0 pages */
-static inline unsigned long
-alloc_pages_bulk_list(gfp_t gfp, unsigned long nr_pages, struct list_head *list)
-{
- return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, list, NULL);
-}
+#define alloc_pages_bulk_list(_gfp, _nr_pages, _list) \
+ __alloc_pages_bulk(_gfp, numa_mem_id(), NULL, _nr_pages, _list, NULL)
-static inline unsigned long
-alloc_pages_bulk_array(gfp_t gfp, unsigned long nr_pages, struct page **page_array)
-{
- return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, NULL, page_array);
-}
+#define alloc_pages_bulk_array(_gfp, _nr_pages, _page_array) \
+ __alloc_pages_bulk(_gfp, numa_mem_id(), NULL, _nr_pages, NULL, _page_array)
static inline unsigned long
-alloc_pages_bulk_array_node(gfp_t gfp, int nid, unsigned long nr_pages, struct page **page_array)
+alloc_pages_bulk_array_node_noprof(gfp_t gfp, int nid, unsigned long nr_pages,
+ struct page **page_array)
{
if (nid == NUMA_NO_NODE)
nid = numa_mem_id();
- return __alloc_pages_bulk(gfp, nid, NULL, nr_pages, NULL, page_array);
+ return alloc_pages_bulk_noprof(gfp, nid, NULL, nr_pages, NULL, page_array);
}
+#define alloc_pages_bulk_array_node(...) \
+ alloc_hooks(alloc_pages_bulk_array_node_noprof(__VA_ARGS__))
+
static inline void warn_if_node_offline(int this_node, gfp_t gfp_mask)
{
gfp_t warn_gfp = gfp_mask & (__GFP_THISNODE|__GFP_NOWARN);
@@ -230,82 +236,104 @@ static inline void warn_if_node_offline(int this_node, gfp_t gfp_mask)
* online. For more general interface, see alloc_pages_node().
*/
static inline struct page *
-__alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
+__alloc_pages_node_noprof(int nid, gfp_t gfp_mask, unsigned int order)
{
VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
warn_if_node_offline(nid, gfp_mask);
- return __alloc_pages(gfp_mask, order, nid, NULL);
+ return __alloc_pages_noprof(gfp_mask, order, nid, NULL);
}
+#define __alloc_pages_node(...) alloc_hooks(__alloc_pages_node_noprof(__VA_ARGS__))
+
static inline
-struct folio *__folio_alloc_node(gfp_t gfp, unsigned int order, int nid)
+struct folio *__folio_alloc_node_noprof(gfp_t gfp, unsigned int order, int nid)
{
VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
warn_if_node_offline(nid, gfp);
- return __folio_alloc(gfp, order, nid, NULL);
+ return __folio_alloc_noprof(gfp, order, nid, NULL);
}
+#define __folio_alloc_node(...) alloc_hooks(__folio_alloc_node_noprof(__VA_ARGS__))
+
/*
* Allocate pages, preferring the node given as nid. When nid == NUMA_NO_NODE,
* prefer the current CPU's closest node. Otherwise node must be valid and
* online.
*/
-static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
- unsigned int order)
+static inline struct page *alloc_pages_node_noprof(int nid, gfp_t gfp_mask,
+ unsigned int order)
{
if (nid == NUMA_NO_NODE)
nid = numa_mem_id();
- return __alloc_pages_node(nid, gfp_mask, order);
+ return __alloc_pages_node_noprof(nid, gfp_mask, order);
}
+#define alloc_pages_node(...) alloc_hooks(alloc_pages_node_noprof(__VA_ARGS__))
+
#ifdef CONFIG_NUMA
-struct page *alloc_pages(gfp_t gfp, unsigned int order);
-struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
+struct page *alloc_pages_noprof(gfp_t gfp, unsigned int order);
+struct page *alloc_pages_mpol_noprof(gfp_t gfp, unsigned int order,
struct mempolicy *mpol, pgoff_t ilx, int nid);
-struct folio *folio_alloc(gfp_t gfp, unsigned int order);
-struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
+struct folio *folio_alloc_noprof(gfp_t gfp, unsigned int order);
+struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order, struct vm_area_struct *vma,
unsigned long addr, bool hugepage);
#else
-static inline struct page *alloc_pages(gfp_t gfp_mask, unsigned int order)
+static inline struct page *alloc_pages_noprof(gfp_t gfp_mask, unsigned int order)
{
- return alloc_pages_node(numa_node_id(), gfp_mask, order);
+ return alloc_pages_node_noprof(numa_node_id(), gfp_mask, order);
}
-static inline struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
+static inline struct page *alloc_pages_mpol_noprof(gfp_t gfp, unsigned int order,
struct mempolicy *mpol, pgoff_t ilx, int nid)
{
- return alloc_pages(gfp, order);
+ return alloc_pages_noprof(gfp, order);
}
-static inline struct folio *folio_alloc(gfp_t gfp, unsigned int order)
+static inline struct folio *folio_alloc_noprof(gfp_t gfp, unsigned int order)
{
return __folio_alloc_node(gfp, order, numa_node_id());
}
-#define vma_alloc_folio(gfp, order, vma, addr, hugepage) \
- folio_alloc(gfp, order)
+#define vma_alloc_folio_noprof(gfp, order, vma, addr, hugepage) \
+ folio_alloc_noprof(gfp, order)
#endif
+
+#define alloc_pages(...) alloc_hooks(alloc_pages_noprof(__VA_ARGS__))
+#define alloc_pages_mpol(...) alloc_hooks(alloc_pages_mpol_noprof(__VA_ARGS__))
+#define folio_alloc(...) alloc_hooks(folio_alloc_noprof(__VA_ARGS__))
+#define vma_alloc_folio(...) alloc_hooks(vma_alloc_folio_noprof(__VA_ARGS__))
+
#define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
-static inline struct page *alloc_page_vma(gfp_t gfp,
+
+static inline struct page *alloc_page_vma_noprof(gfp_t gfp,
struct vm_area_struct *vma, unsigned long addr)
{
- struct folio *folio = vma_alloc_folio(gfp, 0, vma, addr, false);
+ struct folio *folio = vma_alloc_folio_noprof(gfp, 0, vma, addr, false);
return &folio->page;
}
+#define alloc_page_vma(...) alloc_hooks(alloc_page_vma_noprof(__VA_ARGS__))
+
+extern unsigned long get_free_pages_noprof(gfp_t gfp_mask, unsigned int order);
+#define __get_free_pages(...) alloc_hooks(get_free_pages_noprof(__VA_ARGS__))
-extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
-extern unsigned long get_zeroed_page(gfp_t gfp_mask);
+extern unsigned long get_zeroed_page_noprof(gfp_t gfp_mask);
+#define get_zeroed_page(...) alloc_hooks(get_zeroed_page_noprof(__VA_ARGS__))
+
+void *alloc_pages_exact_noprof(size_t size, gfp_t gfp_mask) __alloc_size(1);
+#define alloc_pages_exact(...) alloc_hooks(alloc_pages_exact_noprof(__VA_ARGS__))
-void *alloc_pages_exact(size_t size, gfp_t gfp_mask) __alloc_size(1);
void free_pages_exact(void *virt, size_t size);
-__meminit void *alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask) __alloc_size(2);
-#define __get_free_page(gfp_mask) \
- __get_free_pages((gfp_mask), 0)
+__meminit void *alloc_pages_exact_nid_noprof(int nid, size_t size, gfp_t gfp_mask) __alloc_size(2);
+#define alloc_pages_exact_nid(...) \
+ alloc_hooks(alloc_pages_exact_nid_noprof(__VA_ARGS__))
+
+#define __get_free_page(gfp_mask) \
+ __get_free_pages((gfp_mask), 0)
-#define __get_dma_pages(gfp_mask, order) \
- __get_free_pages((gfp_mask) | GFP_DMA, (order))
+#define __get_dma_pages(gfp_mask, order) \
+ __get_free_pages((gfp_mask) | GFP_DMA, (order))
extern void __free_pages(struct page *page, unsigned int order);
extern void free_pages(unsigned long addr, unsigned int order);
@@ -357,10 +385,14 @@ extern gfp_t vma_thp_gfp_mask(struct vm_area_struct *vma);
#ifdef CONFIG_CONTIG_ALLOC
/* The below functions must be run on a range from a single zone. */
-extern int alloc_contig_range(unsigned long start, unsigned long end,
+extern int alloc_contig_range_noprof(unsigned long start, unsigned long end,
unsigned migratetype, gfp_t gfp_mask);
-extern struct page *alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask,
- int nid, nodemask_t *nodemask);
+#define alloc_contig_range(...) alloc_hooks(alloc_contig_range_noprof(__VA_ARGS__))
+
+extern struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_mask,
+ int nid, nodemask_t *nodemask);
+#define alloc_contig_pages(...) alloc_hooks(alloc_contig_pages_noprof(__VA_ARGS__))
+
#endif
void free_contig_range(unsigned long pfn, unsigned long nr_pages);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 2df35e65557d..35636e67e2e1 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -542,14 +542,17 @@ static inline void *detach_page_private(struct page *page)
#endif
#ifdef CONFIG_NUMA
-struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order);
+struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order);
#else
-static inline struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order)
+static inline struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order)
{
- return folio_alloc(gfp, order);
+ return folio_alloc_noprof(gfp, order);
}
#endif
+#define filemap_alloc_folio(...) \
+ alloc_hooks(filemap_alloc_folio_noprof(__VA_ARGS__))
+
static inline struct page *__page_cache_alloc(gfp_t gfp)
{
return &filemap_alloc_folio(gfp, 0)->page;
diff --git a/mm/compaction.c b/mm/compaction.c
index 4add68d40e8d..f4c0e682c979 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1781,7 +1781,7 @@ static void isolate_freepages(struct compact_control *cc)
* This is a migrate-callback that "allocates" freepages by taking pages
* from the isolated freelists in the block we are migrating to.
*/
-static struct folio *compaction_alloc(struct folio *src, unsigned long data)
+static struct folio *compaction_alloc_noprof(struct folio *src, unsigned long data)
{
struct compact_control *cc = (struct compact_control *)data;
struct folio *dst;
@@ -1800,6 +1800,11 @@ static struct folio *compaction_alloc(struct folio *src, unsigned long data)
return dst;
}
+static struct folio *compaction_alloc(struct folio *src, unsigned long data)
+{
+ return alloc_hooks(compaction_alloc_noprof(src, data));
+}
+
/*
* This is a migrate-callback that "frees" freepages back to the isolated
* freelist. All pages on the freelist are from the same zone, so there is no
diff --git a/mm/filemap.c b/mm/filemap.c
index 750e779c23db..e51e474545ad 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -957,7 +957,7 @@ int filemap_add_folio(struct address_space *mapping, struct folio *folio,
EXPORT_SYMBOL_GPL(filemap_add_folio);
#ifdef CONFIG_NUMA
-struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order)
+struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order)
{
int n;
struct folio *folio;
@@ -972,9 +972,9 @@ struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order)
return folio;
}
- return folio_alloc(gfp, order);
+ return folio_alloc_noprof(gfp, order);
}
-EXPORT_SYMBOL(filemap_alloc_folio);
+EXPORT_SYMBOL(filemap_alloc_folio_noprof);
#endif
/*
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 10a590ee1c89..c329d00b975f 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2070,15 +2070,15 @@ static struct page *alloc_pages_preferred_many(gfp_t gfp, unsigned int order,
*/
preferred_gfp = gfp | __GFP_NOWARN;
preferred_gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL);
- page = __alloc_pages(preferred_gfp, order, nid, nodemask);
+ page = __alloc_pages_noprof(preferred_gfp, order, nid, nodemask);
if (!page)
- page = __alloc_pages(gfp, order, nid, NULL);
+ page = __alloc_pages_noprof(gfp, order, nid, NULL);
return page;
}
/**
- * alloc_pages_mpol - Allocate pages according to NUMA mempolicy.
+ * alloc_pages_mpol_noprof - Allocate pages according to NUMA mempolicy.
* @gfp: GFP flags.
* @order: Order of the page allocation.
* @pol: Pointer to the NUMA mempolicy.
@@ -2087,7 +2087,7 @@ static struct page *alloc_pages_preferred_many(gfp_t gfp, unsigned int order,
*
* Return: The page on success or NULL if allocation fails.
*/
-struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
+struct page *alloc_pages_mpol_noprof(gfp_t gfp, unsigned int order,
struct mempolicy *pol, pgoff_t ilx, int nid)
{
nodemask_t *nodemask;
@@ -2117,7 +2117,7 @@ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
* First, try to allocate THP only on local node, but
* don't reclaim unnecessarily, just compact.
*/
- page = __alloc_pages_node(nid,
+ page = __alloc_pages_node_noprof(nid,
gfp | __GFP_THISNODE | __GFP_NORETRY, order);
if (page || !(gfp & __GFP_DIRECT_RECLAIM))
return page;
@@ -2130,7 +2130,7 @@ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
}
}
- page = __alloc_pages(gfp, order, nid, nodemask);
+ page = __alloc_pages_noprof(gfp, order, nid, nodemask);
if (unlikely(pol->mode == MPOL_INTERLEAVE) && page) {
/* skip NUMA_INTERLEAVE_HIT update if numa stats is disabled */
@@ -2146,7 +2146,7 @@ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
}
/**
- * vma_alloc_folio - Allocate a folio for a VMA.
+ * vma_alloc_folio_noprof - Allocate a folio for a VMA.
* @gfp: GFP flags.
* @order: Order of the folio.
* @vma: Pointer to VMA.
@@ -2161,7 +2161,7 @@ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
*
* Return: The folio on success or NULL if allocation fails.
*/
-struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
+struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order, struct vm_area_struct *vma,
unsigned long addr, bool hugepage)
{
struct mempolicy *pol;
@@ -2169,15 +2169,15 @@ struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
struct page *page;
pol = get_vma_policy(vma, addr, order, &ilx);
- page = alloc_pages_mpol(gfp | __GFP_COMP, order,
- pol, ilx, numa_node_id());
+ page = alloc_pages_mpol_noprof(gfp | __GFP_COMP, order,
+ pol, ilx, numa_node_id());
mpol_cond_put(pol);
return page_rmappable_folio(page);
}
-EXPORT_SYMBOL(vma_alloc_folio);
+EXPORT_SYMBOL(vma_alloc_folio_noprof);
/**
- * alloc_pages - Allocate pages.
+ * alloc_pages_noprof - Allocate pages.
* @gfp: GFP flags.
* @order: Power of two of number of pages to allocate.
*
@@ -2190,7 +2190,7 @@ EXPORT_SYMBOL(vma_alloc_folio);
* flags are used.
* Return: The page on success or NULL if allocation fails.
*/
-struct page *alloc_pages(gfp_t gfp, unsigned int order)
+struct page *alloc_pages_noprof(gfp_t gfp, unsigned int order)
{
struct mempolicy *pol = &default_policy;
@@ -2201,16 +2201,16 @@ struct page *alloc_pages(gfp_t gfp, unsigned int order)
if (!in_interrupt() && !(gfp & __GFP_THISNODE))
pol = get_task_policy(current);
- return alloc_pages_mpol(gfp, order,
- pol, NO_INTERLEAVE_INDEX, numa_node_id());
+ return alloc_pages_mpol_noprof(gfp, order, pol, NO_INTERLEAVE_INDEX,
+ numa_node_id());
}
-EXPORT_SYMBOL(alloc_pages);
+EXPORT_SYMBOL(alloc_pages_noprof);
-struct folio *folio_alloc(gfp_t gfp, unsigned int order)
+struct folio *folio_alloc_noprof(gfp_t gfp, unsigned int order)
{
- return page_rmappable_folio(alloc_pages(gfp | __GFP_COMP, order));
+ return page_rmappable_folio(alloc_pages_noprof(gfp | __GFP_COMP, order));
}
-EXPORT_SYMBOL(folio_alloc);
+EXPORT_SYMBOL(folio_alloc_noprof);
static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp,
struct mempolicy *pol, unsigned long nr_pages,
@@ -2229,13 +2229,13 @@ static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp,
for (i = 0; i < nodes; i++) {
if (delta) {
- nr_allocated = __alloc_pages_bulk(gfp,
+ nr_allocated = alloc_pages_bulk_noprof(gfp,
interleave_nodes(pol), NULL,
nr_pages_per_node + 1, NULL,
page_array);
delta--;
} else {
- nr_allocated = __alloc_pages_bulk(gfp,
+ nr_allocated = alloc_pages_bulk_noprof(gfp,
interleave_nodes(pol), NULL,
nr_pages_per_node, NULL, page_array);
}
@@ -2257,11 +2257,11 @@ static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid,
preferred_gfp = gfp | __GFP_NOWARN;
preferred_gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL);
- nr_allocated = __alloc_pages_bulk(preferred_gfp, nid, &pol->nodes,
+ nr_allocated = alloc_pages_bulk_noprof(preferred_gfp, nid, &pol->nodes,
nr_pages, NULL, page_array);
if (nr_allocated < nr_pages)
- nr_allocated += __alloc_pages_bulk(gfp, numa_node_id(), NULL,
+ nr_allocated += alloc_pages_bulk_noprof(gfp, numa_node_id(), NULL,
nr_pages - nr_allocated, NULL,
page_array + nr_allocated);
return nr_allocated;
@@ -2273,7 +2273,7 @@ static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid,
* It can accelerate memory allocation especially interleaving
* allocate memory.
*/
-unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
+unsigned long alloc_pages_bulk_array_mempolicy_noprof(gfp_t gfp,
unsigned long nr_pages, struct page **page_array)
{
struct mempolicy *pol = &default_policy;
@@ -2293,8 +2293,8 @@ unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
nid = numa_node_id();
nodemask = policy_nodemask(gfp, pol, NO_INTERLEAVE_INDEX, &nid);
- return __alloc_pages_bulk(gfp, nid, nodemask,
- nr_pages, NULL, page_array);
+ return alloc_pages_bulk_noprof(gfp, nid, nodemask,
+ nr_pages, NULL, page_array);
}
int vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index edb79a55a252..58c0e8b948a4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4380,7 +4380,7 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
*
* Returns the number of pages on the list or array.
*/
-unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
+unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
nodemask_t *nodemask, int nr_pages,
struct list_head *page_list,
struct page **page_array)
@@ -4516,7 +4516,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
pcp_trylock_finish(UP_flags);
failed:
- page = __alloc_pages(gfp, 0, preferred_nid, nodemask);
+ page = __alloc_pages_noprof(gfp, 0, preferred_nid, nodemask);
if (page) {
if (page_list)
list_add(&page->lru, page_list);
@@ -4527,13 +4527,13 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
goto out;
}
-EXPORT_SYMBOL_GPL(__alloc_pages_bulk);
+EXPORT_SYMBOL_GPL(alloc_pages_bulk_noprof);
/*
* This is the 'heart' of the zoned buddy allocator.
*/
-struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
- nodemask_t *nodemask)
+struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
+ int preferred_nid, nodemask_t *nodemask)
{
struct page *page;
unsigned int alloc_flags = ALLOC_WMARK_LOW;
@@ -4595,38 +4595,38 @@ struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
return page;
}
-EXPORT_SYMBOL(__alloc_pages);
+EXPORT_SYMBOL(__alloc_pages_noprof);
-struct folio *__folio_alloc(gfp_t gfp, unsigned int order, int preferred_nid,
+struct folio *__folio_alloc_noprof(gfp_t gfp, unsigned int order, int preferred_nid,
nodemask_t *nodemask)
{
- struct page *page = __alloc_pages(gfp | __GFP_COMP, order,
+ struct page *page = __alloc_pages_noprof(gfp | __GFP_COMP, order,
preferred_nid, nodemask);
return page_rmappable_folio(page);
}
-EXPORT_SYMBOL(__folio_alloc);
+EXPORT_SYMBOL(__folio_alloc_noprof);
/*
* Common helper functions. Never use with __GFP_HIGHMEM because the returned
* address cannot represent highmem pages. Use alloc_pages and then kmap if
* you need to access high mem.
*/
-unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order)
+unsigned long get_free_pages_noprof(gfp_t gfp_mask, unsigned int order)
{
struct page *page;
- page = alloc_pages(gfp_mask & ~__GFP_HIGHMEM, order);
+ page = alloc_pages_noprof(gfp_mask & ~__GFP_HIGHMEM, order);
if (!page)
return 0;
return (unsigned long) page_address(page);
}
-EXPORT_SYMBOL(__get_free_pages);
+EXPORT_SYMBOL(get_free_pages_noprof);
-unsigned long get_zeroed_page(gfp_t gfp_mask)
+unsigned long get_zeroed_page_noprof(gfp_t gfp_mask)
{
- return __get_free_page(gfp_mask | __GFP_ZERO);
+ return get_free_pages_noprof(gfp_mask | __GFP_ZERO, 0);
}
-EXPORT_SYMBOL(get_zeroed_page);
+EXPORT_SYMBOL(get_zeroed_page_noprof);
/**
* __free_pages - Free pages allocated with alloc_pages().
@@ -4818,7 +4818,7 @@ static void *make_alloc_exact(unsigned long addr, unsigned int order,
}
/**
- * alloc_pages_exact - allocate an exact number physically-contiguous pages.
+ * alloc_pages_exact_noprof - allocate an exact number physically-contiguous pages.
* @size: the number of bytes to allocate
* @gfp_mask: GFP flags for the allocation, must not contain __GFP_COMP
*
@@ -4832,7 +4832,7 @@ static void *make_alloc_exact(unsigned long addr, unsigned int order,
*
* Return: pointer to the allocated area or %NULL in case of error.
*/
-void *alloc_pages_exact(size_t size, gfp_t gfp_mask)
+void *alloc_pages_exact_noprof(size_t size, gfp_t gfp_mask)
{
unsigned int order = get_order(size);
unsigned long addr;
@@ -4840,13 +4840,13 @@ void *alloc_pages_exact(size_t size, gfp_t gfp_mask)
if (WARN_ON_ONCE(gfp_mask & (__GFP_COMP | __GFP_HIGHMEM)))
gfp_mask &= ~(__GFP_COMP | __GFP_HIGHMEM);
- addr = __get_free_pages(gfp_mask, order);
+ addr = get_free_pages_noprof(gfp_mask, order);
return make_alloc_exact(addr, order, size);
}
-EXPORT_SYMBOL(alloc_pages_exact);
+EXPORT_SYMBOL(alloc_pages_exact_noprof);
/**
- * alloc_pages_exact_nid - allocate an exact number of physically-contiguous
+ * alloc_pages_exact_nid_noprof - allocate an exact number of physically-contiguous
* pages on a node.
* @nid: the preferred node ID where memory should be allocated
* @size: the number of bytes to allocate
@@ -4857,7 +4857,7 @@ EXPORT_SYMBOL(alloc_pages_exact);
*
* Return: pointer to the allocated area or %NULL in case of error.
*/
-void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask)
+void * __meminit alloc_pages_exact_nid_noprof(int nid, size_t size, gfp_t gfp_mask)
{
unsigned int order = get_order(size);
struct page *p;
@@ -4865,7 +4865,7 @@ void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask)
if (WARN_ON_ONCE(gfp_mask & (__GFP_COMP | __GFP_HIGHMEM)))
gfp_mask &= ~(__GFP_COMP | __GFP_HIGHMEM);
- p = alloc_pages_node(nid, gfp_mask, order);
+ p = alloc_pages_node_noprof(nid, gfp_mask, order);
if (!p)
return NULL;
return make_alloc_exact((unsigned long)page_address(p), order, size);
@@ -6283,7 +6283,7 @@ int __alloc_contig_migrate_range(struct compact_control *cc,
}
/**
- * alloc_contig_range() -- tries to allocate given range of pages
+ * alloc_contig_range_noprof() -- tries to allocate given range of pages
* @start: start PFN to allocate
* @end: one-past-the-last PFN to allocate
* @migratetype: migratetype of the underlying pageblocks (either
@@ -6303,7 +6303,7 @@ int __alloc_contig_migrate_range(struct compact_control *cc,
* pages which PFN is in [start, end) are allocated for the caller and
* need to be freed with free_contig_range().
*/
-int alloc_contig_range(unsigned long start, unsigned long end,
+int alloc_contig_range_noprof(unsigned long start, unsigned long end,
unsigned migratetype, gfp_t gfp_mask)
{
unsigned long outer_start, outer_end;
@@ -6427,15 +6427,15 @@ int alloc_contig_range(unsigned long start, unsigned long end,
undo_isolate_page_range(start, end, migratetype);
return ret;
}
-EXPORT_SYMBOL(alloc_contig_range);
+EXPORT_SYMBOL(alloc_contig_range_noprof);
static int __alloc_contig_pages(unsigned long start_pfn,
unsigned long nr_pages, gfp_t gfp_mask)
{
unsigned long end_pfn = start_pfn + nr_pages;
- return alloc_contig_range(start_pfn, end_pfn, MIGRATE_MOVABLE,
- gfp_mask);
+ return alloc_contig_range_noprof(start_pfn, end_pfn, MIGRATE_MOVABLE,
+ gfp_mask);
}
static bool pfn_range_valid_contig(struct zone *z, unsigned long start_pfn,
@@ -6470,7 +6470,7 @@ static bool zone_spans_last_pfn(const struct zone *zone,
}
/**
- * alloc_contig_pages() -- tries to find and allocate contiguous range of pages
+ * alloc_contig_pages_noprof() -- tries to find and allocate contiguous range of pages
* @nr_pages: Number of contiguous pages to allocate
* @gfp_mask: GFP mask to limit search and used during compaction
* @nid: Target node
@@ -6490,8 +6490,8 @@ static bool zone_spans_last_pfn(const struct zone *zone,
*
* Return: pointer to contiguous pages on success, or NULL if not successful.
*/
-struct page *alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask,
- int nid, nodemask_t *nodemask)
+struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_mask,
+ int nid, nodemask_t *nodemask)
{
unsigned long ret, pfn, flags;
struct zonelist *zonelist;
--
2.43.0.687.g38aa6559b0-goog
Account slab allocations using codetag reference embedded into slabobj_ext.
Signed-off-by: Suren Baghdasaryan <[email protected]>
Co-developed-by: Kent Overstreet <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
---
mm/slab.h | 26 ++++++++++++++++++++++++++
mm/slub.c | 5 +++++
2 files changed, 31 insertions(+)
diff --git a/mm/slab.h b/mm/slab.h
index 224a4b2305fb..c4bd0d5348cb 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -629,6 +629,32 @@ prepare_slab_obj_exts_hook(struct kmem_cache *s, gfp_t flags, void *p)
#endif /* CONFIG_SLAB_OBJ_EXT */
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+
+static inline void alloc_tagging_slab_free_hook(struct kmem_cache *s, struct slab *slab,
+ void **p, int objects)
+{
+ struct slabobj_ext *obj_exts;
+ int i;
+
+ obj_exts = slab_obj_exts(slab);
+ if (!obj_exts)
+ return;
+
+ for (i = 0; i < objects; i++) {
+ unsigned int off = obj_to_index(s, slab, p[i]);
+
+ alloc_tag_sub(&obj_exts[off].ref, s->size);
+ }
+}
+
+#else
+
+static inline void alloc_tagging_slab_free_hook(struct kmem_cache *s, struct slab *slab,
+ void **p, int objects) {}
+
+#endif /* CONFIG_MEM_ALLOC_PROFILING */
+
#ifdef CONFIG_MEMCG_KMEM
void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
enum node_stat_item idx, int nr);
diff --git a/mm/slub.c b/mm/slub.c
index 9fd96238ed39..f4d5794c1e86 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3821,6 +3821,11 @@ void slab_post_alloc_hook(struct kmem_cache *s, struct obj_cgroup *objcg,
s->flags, init_flags);
kmsan_slab_alloc(s, p[i], init_flags);
obj_exts = prepare_slab_obj_exts_hook(s, flags, p[i]);
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+ /* obj_exts can be allocated for other reasons */
+ if (likely(obj_exts) && mem_alloc_profiling_enabled())
+ alloc_tag_add(&obj_exts->ref, current->alloc_tag, s->size);
+#endif
}
memcg_slab_post_alloc_hook(s, objcg, flags, size, p);
--
2.43.0.687.g38aa6559b0-goog
As each allocation tag generates a per-cpu variable, more space is required
to store them. Increase PERCPU_MODULE_RESERVE to provide enough area. A
better long-term solution would be to allocate this memory dynamically.
Signed-off-by: Suren Baghdasaryan <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
---
include/linux/percpu.h | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 8c677f185901..62b5eb45bd89 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -14,7 +14,11 @@
/* enough to cover all DEFINE_PER_CPUs in modules */
#ifdef CONFIG_MODULES
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+#define PERCPU_MODULE_RESERVE (8 << 12)
+#else
#define PERCPU_MODULE_RESERVE (8 << 10)
+#endif
#else
#define PERCPU_MODULE_RESERVE 0
#endif
--
2.43.0.687.g38aa6559b0-goog
From: Kent Overstreet <[email protected]>
It seems we need to be more forceful with the compiler on this one.
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
mm/slub.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/slub.c b/mm/slub.c
index 9ea03d6e9c9d..4d480784942e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2124,7 +2124,7 @@ bool slab_free_hook(struct kmem_cache *s, void *x, bool init)
return !kasan_slab_free(s, x, init);
}
-static inline bool slab_free_freelist_hook(struct kmem_cache *s,
+static __always_inline bool slab_free_freelist_hook(struct kmem_cache *s,
void **head, void **tail,
int *cnt)
{
--
2.43.0.687.g38aa6559b0-goog
From: Kent Overstreet <[email protected]>
This adds hooks to mempools for correctly annotating mempool-backed
allocations at the correct source line, so they show up correctly in
/sys/kernel/debug/allocations.
Various inline functions are converted to wrappers so that we can invoke
alloc_hooks() in fewer places.
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/mempool.h | 73 ++++++++++++++++++++---------------------
mm/mempool.c | 36 ++++++++------------
2 files changed, 49 insertions(+), 60 deletions(-)
diff --git a/include/linux/mempool.h b/include/linux/mempool.h
index 7be1e32e6d42..69e65ca515ee 100644
--- a/include/linux/mempool.h
+++ b/include/linux/mempool.h
@@ -5,6 +5,8 @@
#ifndef _LINUX_MEMPOOL_H
#define _LINUX_MEMPOOL_H
+#include <linux/sched.h>
+#include <linux/alloc_tag.h>
#include <linux/wait.h>
#include <linux/compiler.h>
@@ -39,18 +41,32 @@ void mempool_exit(mempool_t *pool);
int mempool_init_node(mempool_t *pool, int min_nr, mempool_alloc_t *alloc_fn,
mempool_free_t *free_fn, void *pool_data,
gfp_t gfp_mask, int node_id);
-int mempool_init(mempool_t *pool, int min_nr, mempool_alloc_t *alloc_fn,
+
+int mempool_init_noprof(mempool_t *pool, int min_nr, mempool_alloc_t *alloc_fn,
mempool_free_t *free_fn, void *pool_data);
+#define mempool_init(...) \
+ alloc_hooks(mempool_init_noprof(__VA_ARGS__))
extern mempool_t *mempool_create(int min_nr, mempool_alloc_t *alloc_fn,
mempool_free_t *free_fn, void *pool_data);
-extern mempool_t *mempool_create_node(int min_nr, mempool_alloc_t *alloc_fn,
+
+extern mempool_t *mempool_create_node_noprof(int min_nr, mempool_alloc_t *alloc_fn,
mempool_free_t *free_fn, void *pool_data,
gfp_t gfp_mask, int nid);
+#define mempool_create_node(...) \
+ alloc_hooks(mempool_create_node_noprof(__VA_ARGS__))
+
+#define mempool_create(_min_nr, _alloc_fn, _free_fn, _pool_data) \
+ mempool_create_node(_min_nr, _alloc_fn, _free_fn, _pool_data, \
+ GFP_KERNEL, NUMA_NO_NODE)
extern int mempool_resize(mempool_t *pool, int new_min_nr);
extern void mempool_destroy(mempool_t *pool);
-extern void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask) __malloc;
+
+extern void *mempool_alloc_noprof(mempool_t *pool, gfp_t gfp_mask) __malloc;
+#define mempool_alloc(...) \
+ alloc_hooks(mempool_alloc_noprof(__VA_ARGS__))
+
extern void *mempool_alloc_preallocated(mempool_t *pool) __malloc;
extern void mempool_free(void *element, mempool_t *pool);
@@ -62,19 +78,10 @@ extern void mempool_free(void *element, mempool_t *pool);
void *mempool_alloc_slab(gfp_t gfp_mask, void *pool_data);
void mempool_free_slab(void *element, void *pool_data);
-static inline int
-mempool_init_slab_pool(mempool_t *pool, int min_nr, struct kmem_cache *kc)
-{
- return mempool_init(pool, min_nr, mempool_alloc_slab,
- mempool_free_slab, (void *) kc);
-}
-
-static inline mempool_t *
-mempool_create_slab_pool(int min_nr, struct kmem_cache *kc)
-{
- return mempool_create(min_nr, mempool_alloc_slab, mempool_free_slab,
- (void *) kc);
-}
+#define mempool_init_slab_pool(_pool, _min_nr, _kc) \
+ mempool_init(_pool, (_min_nr), mempool_alloc_slab, mempool_free_slab, (void *)(_kc))
+#define mempool_create_slab_pool(_min_nr, _kc) \
+ mempool_create((_min_nr), mempool_alloc_slab, mempool_free_slab, (void *)(_kc))
/*
* a mempool_alloc_t and a mempool_free_t to kmalloc and kfree the
@@ -83,17 +90,12 @@ mempool_create_slab_pool(int min_nr, struct kmem_cache *kc)
void *mempool_kmalloc(gfp_t gfp_mask, void *pool_data);
void mempool_kfree(void *element, void *pool_data);
-static inline int mempool_init_kmalloc_pool(mempool_t *pool, int min_nr, size_t size)
-{
- return mempool_init(pool, min_nr, mempool_kmalloc,
- mempool_kfree, (void *) size);
-}
-
-static inline mempool_t *mempool_create_kmalloc_pool(int min_nr, size_t size)
-{
- return mempool_create(min_nr, mempool_kmalloc, mempool_kfree,
- (void *) size);
-}
+#define mempool_init_kmalloc_pool(_pool, _min_nr, _size) \
+ mempool_init(_pool, (_min_nr), mempool_kmalloc, mempool_kfree, \
+ (void *)(unsigned long)(_size))
+#define mempool_create_kmalloc_pool(_min_nr, _size) \
+ mempool_create((_min_nr), mempool_kmalloc, mempool_kfree, \
+ (void *)(unsigned long)(_size))
/*
* A mempool_alloc_t and mempool_free_t for a simple page allocator that
@@ -102,16 +104,11 @@ static inline mempool_t *mempool_create_kmalloc_pool(int min_nr, size_t size)
void *mempool_alloc_pages(gfp_t gfp_mask, void *pool_data);
void mempool_free_pages(void *element, void *pool_data);
-static inline int mempool_init_page_pool(mempool_t *pool, int min_nr, int order)
-{
- return mempool_init(pool, min_nr, mempool_alloc_pages,
- mempool_free_pages, (void *)(long)order);
-}
-
-static inline mempool_t *mempool_create_page_pool(int min_nr, int order)
-{
- return mempool_create(min_nr, mempool_alloc_pages, mempool_free_pages,
- (void *)(long)order);
-}
+#define mempool_init_page_pool(_pool, _min_nr, _order) \
+ mempool_init(_pool, (_min_nr), mempool_alloc_pages, \
+ mempool_free_pages, (void *)(long)(_order))
+#define mempool_create_page_pool(_min_nr, _order) \
+ mempool_create((_min_nr), mempool_alloc_pages, \
+ mempool_free_pages, (void *)(long)(_order))
#endif /* _LINUX_MEMPOOL_H */
diff --git a/mm/mempool.c b/mm/mempool.c
index dbbf0e9fb424..c47ff883cf36 100644
--- a/mm/mempool.c
+++ b/mm/mempool.c
@@ -240,17 +240,17 @@ EXPORT_SYMBOL(mempool_init_node);
*
* Return: %0 on success, negative error code otherwise.
*/
-int mempool_init(mempool_t *pool, int min_nr, mempool_alloc_t *alloc_fn,
- mempool_free_t *free_fn, void *pool_data)
+int mempool_init_noprof(mempool_t *pool, int min_nr, mempool_alloc_t *alloc_fn,
+ mempool_free_t *free_fn, void *pool_data)
{
return mempool_init_node(pool, min_nr, alloc_fn, free_fn,
pool_data, GFP_KERNEL, NUMA_NO_NODE);
}
-EXPORT_SYMBOL(mempool_init);
+EXPORT_SYMBOL(mempool_init_noprof);
/**
- * mempool_create - create a memory pool
+ * mempool_create_node - create a memory pool
* @min_nr: the minimum number of elements guaranteed to be
* allocated for this pool.
* @alloc_fn: user-defined element-allocation function.
@@ -265,17 +265,9 @@ EXPORT_SYMBOL(mempool_init);
*
* Return: pointer to the created memory pool object or %NULL on error.
*/
-mempool_t *mempool_create(int min_nr, mempool_alloc_t *alloc_fn,
- mempool_free_t *free_fn, void *pool_data)
-{
- return mempool_create_node(min_nr, alloc_fn, free_fn, pool_data,
- GFP_KERNEL, NUMA_NO_NODE);
-}
-EXPORT_SYMBOL(mempool_create);
-
-mempool_t *mempool_create_node(int min_nr, mempool_alloc_t *alloc_fn,
- mempool_free_t *free_fn, void *pool_data,
- gfp_t gfp_mask, int node_id)
+mempool_t *mempool_create_node_noprof(int min_nr, mempool_alloc_t *alloc_fn,
+ mempool_free_t *free_fn, void *pool_data,
+ gfp_t gfp_mask, int node_id)
{
mempool_t *pool;
@@ -291,7 +283,7 @@ mempool_t *mempool_create_node(int min_nr, mempool_alloc_t *alloc_fn,
return pool;
}
-EXPORT_SYMBOL(mempool_create_node);
+EXPORT_SYMBOL(mempool_create_node_noprof);
/**
* mempool_resize - resize an existing memory pool
@@ -374,7 +366,7 @@ int mempool_resize(mempool_t *pool, int new_min_nr)
EXPORT_SYMBOL(mempool_resize);
/**
- * mempool_alloc - allocate an element from a specific memory pool
+ * mempool_alloc_noprof - allocate an element from a specific memory pool
* @pool: pointer to the memory pool which was allocated via
* mempool_create().
* @gfp_mask: the usual allocation bitmask.
@@ -387,7 +379,7 @@ EXPORT_SYMBOL(mempool_resize);
*
* Return: pointer to the allocated element or %NULL on error.
*/
-void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
+void *mempool_alloc_noprof(mempool_t *pool, gfp_t gfp_mask)
{
void *element;
unsigned long flags;
@@ -454,7 +446,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
finish_wait(&pool->wait, &wait);
goto repeat_alloc;
}
-EXPORT_SYMBOL(mempool_alloc);
+EXPORT_SYMBOL(mempool_alloc_noprof);
/**
* mempool_alloc_preallocated - allocate an element from preallocated elements
@@ -562,7 +554,7 @@ void *mempool_alloc_slab(gfp_t gfp_mask, void *pool_data)
{
struct kmem_cache *mem = pool_data;
VM_BUG_ON(mem->ctor);
- return kmem_cache_alloc(mem, gfp_mask);
+ return kmem_cache_alloc_noprof(mem, gfp_mask);
}
EXPORT_SYMBOL(mempool_alloc_slab);
@@ -580,7 +572,7 @@ EXPORT_SYMBOL(mempool_free_slab);
void *mempool_kmalloc(gfp_t gfp_mask, void *pool_data)
{
size_t size = (size_t)pool_data;
- return kmalloc(size, gfp_mask);
+ return kmalloc_noprof(size, gfp_mask);
}
EXPORT_SYMBOL(mempool_kmalloc);
@@ -597,7 +589,7 @@ EXPORT_SYMBOL(mempool_kfree);
void *mempool_alloc_pages(gfp_t gfp_mask, void *pool_data)
{
int order = (int)(long)pool_data;
- return alloc_pages(gfp_mask, order);
+ return alloc_pages_noprof(gfp_mask, order);
}
EXPORT_SYMBOL(mempool_alloc_pages);
--
2.43.0.687.g38aa6559b0-goog
After redefining alloc_pages, all uses of that name are being replaced.
Change the conflicting names to prevent preprocessor from replacing them
when it's not intended.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
arch/alpha/kernel/pci_iommu.c | 2 +-
arch/mips/jazz/jazzdma.c | 2 +-
arch/powerpc/kernel/dma-iommu.c | 2 +-
arch/powerpc/platforms/ps3/system-bus.c | 4 ++--
arch/powerpc/platforms/pseries/vio.c | 2 +-
arch/x86/kernel/amd_gart_64.c | 2 +-
drivers/iommu/dma-iommu.c | 2 +-
drivers/parisc/ccio-dma.c | 2 +-
drivers/parisc/sba_iommu.c | 2 +-
drivers/xen/grant-dma-ops.c | 2 +-
drivers/xen/swiotlb-xen.c | 2 +-
include/linux/dma-map-ops.h | 2 +-
kernel/dma/mapping.c | 4 ++--
13 files changed, 15 insertions(+), 15 deletions(-)
diff --git a/arch/alpha/kernel/pci_iommu.c b/arch/alpha/kernel/pci_iommu.c
index c81183935e97..7fcf3e9b7103 100644
--- a/arch/alpha/kernel/pci_iommu.c
+++ b/arch/alpha/kernel/pci_iommu.c
@@ -929,7 +929,7 @@ const struct dma_map_ops alpha_pci_ops = {
.dma_supported = alpha_pci_supported,
.mmap = dma_common_mmap,
.get_sgtable = dma_common_get_sgtable,
- .alloc_pages = dma_common_alloc_pages,
+ .alloc_pages_op = dma_common_alloc_pages,
.free_pages = dma_common_free_pages,
};
EXPORT_SYMBOL(alpha_pci_ops);
diff --git a/arch/mips/jazz/jazzdma.c b/arch/mips/jazz/jazzdma.c
index eabddb89d221..c97b089b9902 100644
--- a/arch/mips/jazz/jazzdma.c
+++ b/arch/mips/jazz/jazzdma.c
@@ -617,7 +617,7 @@ const struct dma_map_ops jazz_dma_ops = {
.sync_sg_for_device = jazz_dma_sync_sg_for_device,
.mmap = dma_common_mmap,
.get_sgtable = dma_common_get_sgtable,
- .alloc_pages = dma_common_alloc_pages,
+ .alloc_pages_op = dma_common_alloc_pages,
.free_pages = dma_common_free_pages,
};
EXPORT_SYMBOL(jazz_dma_ops);
diff --git a/arch/powerpc/kernel/dma-iommu.c b/arch/powerpc/kernel/dma-iommu.c
index 8920862ffd79..f0ae39e77e37 100644
--- a/arch/powerpc/kernel/dma-iommu.c
+++ b/arch/powerpc/kernel/dma-iommu.c
@@ -216,6 +216,6 @@ const struct dma_map_ops dma_iommu_ops = {
.get_required_mask = dma_iommu_get_required_mask,
.mmap = dma_common_mmap,
.get_sgtable = dma_common_get_sgtable,
- .alloc_pages = dma_common_alloc_pages,
+ .alloc_pages_op = dma_common_alloc_pages,
.free_pages = dma_common_free_pages,
};
diff --git a/arch/powerpc/platforms/ps3/system-bus.c b/arch/powerpc/platforms/ps3/system-bus.c
index d6b5f5ecd515..56dc6b29a3e7 100644
--- a/arch/powerpc/platforms/ps3/system-bus.c
+++ b/arch/powerpc/platforms/ps3/system-bus.c
@@ -695,7 +695,7 @@ static const struct dma_map_ops ps3_sb_dma_ops = {
.unmap_page = ps3_unmap_page,
.mmap = dma_common_mmap,
.get_sgtable = dma_common_get_sgtable,
- .alloc_pages = dma_common_alloc_pages,
+ .alloc_pages_op = dma_common_alloc_pages,
.free_pages = dma_common_free_pages,
};
@@ -709,7 +709,7 @@ static const struct dma_map_ops ps3_ioc0_dma_ops = {
.unmap_page = ps3_unmap_page,
.mmap = dma_common_mmap,
.get_sgtable = dma_common_get_sgtable,
- .alloc_pages = dma_common_alloc_pages,
+ .alloc_pages_op = dma_common_alloc_pages,
.free_pages = dma_common_free_pages,
};
diff --git a/arch/powerpc/platforms/pseries/vio.c b/arch/powerpc/platforms/pseries/vio.c
index 2dc9cbc4bcd8..0c90fc4c3796 100644
--- a/arch/powerpc/platforms/pseries/vio.c
+++ b/arch/powerpc/platforms/pseries/vio.c
@@ -611,7 +611,7 @@ static const struct dma_map_ops vio_dma_mapping_ops = {
.get_required_mask = dma_iommu_get_required_mask,
.mmap = dma_common_mmap,
.get_sgtable = dma_common_get_sgtable,
- .alloc_pages = dma_common_alloc_pages,
+ .alloc_pages_op = dma_common_alloc_pages,
.free_pages = dma_common_free_pages,
};
diff --git a/arch/x86/kernel/amd_gart_64.c b/arch/x86/kernel/amd_gart_64.c
index 2ae98f754e59..c884deca839b 100644
--- a/arch/x86/kernel/amd_gart_64.c
+++ b/arch/x86/kernel/amd_gart_64.c
@@ -676,7 +676,7 @@ static const struct dma_map_ops gart_dma_ops = {
.get_sgtable = dma_common_get_sgtable,
.dma_supported = dma_direct_supported,
.get_required_mask = dma_direct_get_required_mask,
- .alloc_pages = dma_direct_alloc_pages,
+ .alloc_pages_op = dma_direct_alloc_pages,
.free_pages = dma_direct_free_pages,
};
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 50ccc4f1ef81..8a1f7f5d1bca 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1710,7 +1710,7 @@ static const struct dma_map_ops iommu_dma_ops = {
.flags = DMA_F_PCI_P2PDMA_SUPPORTED,
.alloc = iommu_dma_alloc,
.free = iommu_dma_free,
- .alloc_pages = dma_common_alloc_pages,
+ .alloc_pages_op = dma_common_alloc_pages,
.free_pages = dma_common_free_pages,
.alloc_noncontiguous = iommu_dma_alloc_noncontiguous,
.free_noncontiguous = iommu_dma_free_noncontiguous,
diff --git a/drivers/parisc/ccio-dma.c b/drivers/parisc/ccio-dma.c
index 9ce0d20a6c58..feef537257d0 100644
--- a/drivers/parisc/ccio-dma.c
+++ b/drivers/parisc/ccio-dma.c
@@ -1022,7 +1022,7 @@ static const struct dma_map_ops ccio_ops = {
.map_sg = ccio_map_sg,
.unmap_sg = ccio_unmap_sg,
.get_sgtable = dma_common_get_sgtable,
- .alloc_pages = dma_common_alloc_pages,
+ .alloc_pages_op = dma_common_alloc_pages,
.free_pages = dma_common_free_pages,
};
diff --git a/drivers/parisc/sba_iommu.c b/drivers/parisc/sba_iommu.c
index 784037837f65..fc3863c09f83 100644
--- a/drivers/parisc/sba_iommu.c
+++ b/drivers/parisc/sba_iommu.c
@@ -1090,7 +1090,7 @@ static const struct dma_map_ops sba_ops = {
.map_sg = sba_map_sg,
.unmap_sg = sba_unmap_sg,
.get_sgtable = dma_common_get_sgtable,
- .alloc_pages = dma_common_alloc_pages,
+ .alloc_pages_op = dma_common_alloc_pages,
.free_pages = dma_common_free_pages,
};
diff --git a/drivers/xen/grant-dma-ops.c b/drivers/xen/grant-dma-ops.c
index 76f6f26265a3..29257d2639db 100644
--- a/drivers/xen/grant-dma-ops.c
+++ b/drivers/xen/grant-dma-ops.c
@@ -282,7 +282,7 @@ static int xen_grant_dma_supported(struct device *dev, u64 mask)
static const struct dma_map_ops xen_grant_dma_ops = {
.alloc = xen_grant_dma_alloc,
.free = xen_grant_dma_free,
- .alloc_pages = xen_grant_dma_alloc_pages,
+ .alloc_pages_op = xen_grant_dma_alloc_pages,
.free_pages = xen_grant_dma_free_pages,
.mmap = dma_common_mmap,
.get_sgtable = dma_common_get_sgtable,
diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 0e6c6c25d154..1c4ef5111651 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -403,7 +403,7 @@ const struct dma_map_ops xen_swiotlb_dma_ops = {
.dma_supported = xen_swiotlb_dma_supported,
.mmap = dma_common_mmap,
.get_sgtable = dma_common_get_sgtable,
- .alloc_pages = dma_common_alloc_pages,
+ .alloc_pages_op = dma_common_alloc_pages,
.free_pages = dma_common_free_pages,
.max_mapping_size = swiotlb_max_mapping_size,
};
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 4abc60f04209..9ee319851b5f 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -29,7 +29,7 @@ struct dma_map_ops {
unsigned long attrs);
void (*free)(struct device *dev, size_t size, void *vaddr,
dma_addr_t dma_handle, unsigned long attrs);
- struct page *(*alloc_pages)(struct device *dev, size_t size,
+ struct page *(*alloc_pages_op)(struct device *dev, size_t size,
dma_addr_t *dma_handle, enum dma_data_direction dir,
gfp_t gfp);
void (*free_pages)(struct device *dev, size_t size, struct page *vaddr,
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 58db8fd70471..5e2d51e1cdf6 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -570,9 +570,9 @@ static struct page *__dma_alloc_pages(struct device *dev, size_t size,
size = PAGE_ALIGN(size);
if (dma_alloc_direct(dev, ops))
return dma_direct_alloc_pages(dev, size, dma_handle, dir, gfp);
- if (!ops->alloc_pages)
+ if (!ops->alloc_pages_op)
return NULL;
- return ops->alloc_pages(dev, size, dma_handle, dir, gfp);
+ return ops->alloc_pages_op(dev, size, dma_handle, dir, gfp);
}
struct page *dma_alloc_pages(struct device *dev, size_t size,
--
2.43.0.687.g38aa6559b0-goog
From: Kent Overstreet <[email protected]>
This adds an alloc_hooks() wrapper around kmem_alloc(), so that we can
have allocations accounted to the proper callsite.
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
fs/xfs/kmem.c | 4 ++--
fs/xfs/kmem.h | 10 ++++------
2 files changed, 6 insertions(+), 8 deletions(-)
diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index c557a030acfe..9aa57a4e2478 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -8,7 +8,7 @@
#include "xfs_trace.h"
void *
-kmem_alloc(size_t size, xfs_km_flags_t flags)
+kmem_alloc_noprof(size_t size, xfs_km_flags_t flags)
{
int retries = 0;
gfp_t lflags = kmem_flags_convert(flags);
@@ -17,7 +17,7 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
trace_kmem_alloc(size, flags, _RET_IP_);
do {
- ptr = kmalloc(size, lflags);
+ ptr = kmalloc_noprof(size, lflags);
if (ptr || (flags & KM_MAYFAIL))
return ptr;
if (!(++retries % 100))
diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index b987dc2c6851..c4cf1dc2a7af 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -6,6 +6,7 @@
#ifndef __XFS_SUPPORT_KMEM_H__
#define __XFS_SUPPORT_KMEM_H__
+#include <linux/alloc_tag.h>
#include <linux/slab.h>
#include <linux/sched.h>
#include <linux/mm.h>
@@ -56,18 +57,15 @@ kmem_flags_convert(xfs_km_flags_t flags)
return lflags;
}
-extern void *kmem_alloc(size_t, xfs_km_flags_t);
static inline void kmem_free(const void *ptr)
{
kvfree(ptr);
}
+extern void *kmem_alloc_noprof(size_t, xfs_km_flags_t);
+#define kmem_alloc(...) alloc_hooks(kmem_alloc_noprof(__VA_ARGS__))
-static inline void *
-kmem_zalloc(size_t size, xfs_km_flags_t flags)
-{
- return kmem_alloc(size, flags | KM_ZERO);
-}
+#define kmem_zalloc(_size, _flags) kmem_alloc((_size), (_flags) | KM_ZERO)
/*
* Zone interfaces
--
2.43.0.687.g38aa6559b0-goog
From: Kent Overstreet <[email protected]>
Upcoming alloc tagging patches require a place to stash per-allocation
metadata.
We already do this when memcg is enabled, so this patch generalizes the
obj_cgroup * vector in struct pcpu_chunk by creating a pcpu_obj_ext
type, which we will be adding to in an upcoming patch - similarly to the
previous slabobj_ext patch.
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: [email protected]
---
mm/percpu-internal.h | 19 +++++++++++++++++--
mm/percpu.c | 30 +++++++++++++++---------------
2 files changed, 32 insertions(+), 17 deletions(-)
diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index cdd0aa597a81..e62d582f4bf3 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -32,6 +32,16 @@ struct pcpu_block_md {
int nr_bits; /* total bits responsible for */
};
+struct pcpuobj_ext {
+#ifdef CONFIG_MEMCG_KMEM
+ struct obj_cgroup *cgroup;
+#endif
+};
+
+#ifdef CONFIG_MEMCG_KMEM
+#define NEED_PCPUOBJ_EXT
+#endif
+
struct pcpu_chunk {
#ifdef CONFIG_PERCPU_STATS
int nr_alloc; /* # of allocations */
@@ -64,8 +74,8 @@ struct pcpu_chunk {
int end_offset; /* additional area required to
have the region end page
aligned */
-#ifdef CONFIG_MEMCG_KMEM
- struct obj_cgroup **obj_cgroups; /* vector of object cgroups */
+#ifdef NEED_PCPUOBJ_EXT
+ struct pcpuobj_ext *obj_exts; /* vector of object cgroups */
#endif
int nr_pages; /* # of pages served by this chunk */
@@ -74,6 +84,11 @@ struct pcpu_chunk {
unsigned long populated[]; /* populated bitmap */
};
+static inline bool need_pcpuobj_ext(void)
+{
+ return !mem_cgroup_kmem_disabled();
+}
+
extern spinlock_t pcpu_lock;
extern struct list_head *pcpu_chunk_lists;
diff --git a/mm/percpu.c b/mm/percpu.c
index 4e11fc1e6def..2e5edaad9cc3 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1392,9 +1392,9 @@ static struct pcpu_chunk * __init pcpu_alloc_first_chunk(unsigned long tmp_addr,
panic("%s: Failed to allocate %zu bytes\n", __func__,
alloc_size);
-#ifdef CONFIG_MEMCG_KMEM
+#ifdef NEED_PCPUOBJ_EXT
/* first chunk is free to use */
- chunk->obj_cgroups = NULL;
+ chunk->obj_exts = NULL;
#endif
pcpu_init_md_blocks(chunk);
@@ -1463,12 +1463,12 @@ static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp)
if (!chunk->md_blocks)
goto md_blocks_fail;
-#ifdef CONFIG_MEMCG_KMEM
- if (!mem_cgroup_kmem_disabled()) {
- chunk->obj_cgroups =
+#ifdef NEED_PCPUOBJ_EXT
+ if (need_pcpuobj_ext()) {
+ chunk->obj_exts =
pcpu_mem_zalloc(pcpu_chunk_map_bits(chunk) *
- sizeof(struct obj_cgroup *), gfp);
- if (!chunk->obj_cgroups)
+ sizeof(struct pcpuobj_ext), gfp);
+ if (!chunk->obj_exts)
goto objcg_fail;
}
#endif
@@ -1480,7 +1480,7 @@ static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp)
return chunk;
-#ifdef CONFIG_MEMCG_KMEM
+#ifdef NEED_PCPUOBJ_EXT
objcg_fail:
pcpu_mem_free(chunk->md_blocks);
#endif
@@ -1498,8 +1498,8 @@ static void pcpu_free_chunk(struct pcpu_chunk *chunk)
{
if (!chunk)
return;
-#ifdef CONFIG_MEMCG_KMEM
- pcpu_mem_free(chunk->obj_cgroups);
+#ifdef NEED_PCPUOBJ_EXT
+ pcpu_mem_free(chunk->obj_exts);
#endif
pcpu_mem_free(chunk->md_blocks);
pcpu_mem_free(chunk->bound_map);
@@ -1646,9 +1646,9 @@ static void pcpu_memcg_post_alloc_hook(struct obj_cgroup *objcg,
if (!objcg)
return;
- if (likely(chunk && chunk->obj_cgroups)) {
+ if (likely(chunk && chunk->obj_exts)) {
obj_cgroup_get(objcg);
- chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = objcg;
+ chunk->obj_exts[off >> PCPU_MIN_ALLOC_SHIFT].cgroup = objcg;
rcu_read_lock();
mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B,
@@ -1663,13 +1663,13 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
{
struct obj_cgroup *objcg;
- if (unlikely(!chunk->obj_cgroups))
+ if (unlikely(!chunk->obj_exts))
return;
- objcg = chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT];
+ objcg = chunk->obj_exts[off >> PCPU_MIN_ALLOC_SHIFT].cgroup;
if (!objcg)
return;
- chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = NULL;
+ chunk->obj_exts[off >> PCPU_MIN_ALLOC_SHIFT].cgroup = NULL;
obj_cgroup_uncharge(objcg, pcpu_obj_full_size(size));
--
2.43.0.687.g38aa6559b0-goog
For all page allocations to be tagged, page_ext has to be initialized
before the first page allocation. Early tasks allocate their stacks
using page allocator before alloc_node_page_ext() initializes page_ext
area, unless early_page_ext is enabled. Therefore these allocations will
generate a warning when CONFIG_MEM_ALLOC_PROFILING_DEBUG is enabled.
Enable early_page_ext whenever CONFIG_MEM_ALLOC_PROFILING_DEBUG=y to
ensure page_ext initialization prior to any page allocation. This will
have all the negative effects associated with early_page_ext, such as
possible longer boot time, therefore we enable it only when debugging
with CONFIG_MEM_ALLOC_PROFILING_DEBUG enabled and not universally for
CONFIG_MEM_ALLOC_PROFILING.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
mm/page_ext.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/mm/page_ext.c b/mm/page_ext.c
index 3c58fe8a24df..e7d8f1a5589e 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -95,7 +95,16 @@ unsigned long page_ext_size;
static unsigned long total_usage;
+#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
+/*
+ * To ensure correct allocation tagging for pages, page_ext should be available
+ * before the first page allocation. Otherwise early task stacks will be
+ * allocated before page_ext initialization and missing tags will be flagged.
+ */
+bool early_page_ext __meminitdata = true;
+#else
bool early_page_ext __meminitdata;
+#endif
static int __init setup_early_page_ext(char *str)
{
early_page_ext = true;
--
2.43.0.687.g38aa6559b0-goog
Redefine __alloc_percpu, __alloc_percpu_gfp and __alloc_reserved_percpu
to record allocations and deallocations done by these functions.
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/alloc_tag.h | 15 +++++++++
include/linux/percpu.h | 23 +++++++++-----
mm/percpu.c | 64 +++++----------------------------------
3 files changed, 38 insertions(+), 64 deletions(-)
diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
index 6fa8a94d8bc1..3fe51e67e231 100644
--- a/include/linux/alloc_tag.h
+++ b/include/linux/alloc_tag.h
@@ -140,4 +140,19 @@ static inline void alloc_tag_add(union codetag_ref *ref, struct alloc_tag *tag,
_res; \
})
+/*
+ * workaround for a sparse bug: it complains about res_type_to_err() when
+ * typeof(_do_alloc) is a __percpu pointer, but gcc won't let us add a separate
+ * __percpu case to res_type_to_err():
+ */
+#define alloc_hooks_pcpu(_do_alloc) \
+({ \
+ typeof(_do_alloc) _res; \
+ DEFINE_ALLOC_TAG(_alloc_tag, _old); \
+ \
+ _res = _do_alloc; \
+ alloc_tag_restore(&_alloc_tag, _old); \
+ _res; \
+})
+
#endif /* _LINUX_ALLOC_TAG_H */
diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 62b5eb45bd89..eb4eb264136f 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -2,6 +2,7 @@
#ifndef __LINUX_PERCPU_H
#define __LINUX_PERCPU_H
+#include <linux/alloc_tag.h>
#include <linux/mmdebug.h>
#include <linux/preempt.h>
#include <linux/smp.h>
@@ -9,6 +10,7 @@
#include <linux/pfn.h>
#include <linux/init.h>
#include <linux/cleanup.h>
+#include <linux/sched.h>
#include <asm/percpu.h>
@@ -125,7 +127,6 @@ extern int __init pcpu_page_first_chunk(size_t reserved_size,
pcpu_fc_cpu_to_node_fn_t cpu_to_nd_fn);
#endif
-extern void __percpu *__alloc_reserved_percpu(size_t size, size_t align) __alloc_size(1);
extern bool __is_kernel_percpu_address(unsigned long addr, unsigned long *can_addr);
extern bool is_kernel_percpu_address(unsigned long addr);
@@ -133,14 +134,16 @@ extern bool is_kernel_percpu_address(unsigned long addr);
extern void __init setup_per_cpu_areas(void);
#endif
-extern void __percpu *__alloc_percpu_gfp(size_t size, size_t align, gfp_t gfp) __alloc_size(1);
-extern void __percpu *__alloc_percpu(size_t size, size_t align) __alloc_size(1);
-extern void free_percpu(void __percpu *__pdata);
+extern void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved,
+ gfp_t gfp) __alloc_size(1);
extern size_t pcpu_alloc_size(void __percpu *__pdata);
-DEFINE_FREE(free_percpu, void __percpu *, free_percpu(_T))
-
-extern phys_addr_t per_cpu_ptr_to_phys(void *addr);
+#define __alloc_percpu_gfp(_size, _align, _gfp) \
+ alloc_hooks_pcpu(pcpu_alloc_noprof(_size, _align, false, _gfp))
+#define __alloc_percpu(_size, _align) \
+ alloc_hooks_pcpu(pcpu_alloc_noprof(_size, _align, false, GFP_KERNEL))
+#define __alloc_reserved_percpu(_size, _align) \
+ alloc_hooks_pcpu(pcpu_alloc_noprof(_size, _align, true, GFP_KERNEL))
#define alloc_percpu_gfp(type, gfp) \
(typeof(type) __percpu *)__alloc_percpu_gfp(sizeof(type), \
@@ -149,6 +152,12 @@ extern phys_addr_t per_cpu_ptr_to_phys(void *addr);
(typeof(type) __percpu *)__alloc_percpu(sizeof(type), \
__alignof__(type))
+extern void free_percpu(void __percpu *__pdata);
+
+DEFINE_FREE(free_percpu, void __percpu *, free_percpu(_T))
+
+extern phys_addr_t per_cpu_ptr_to_phys(void *addr);
+
extern unsigned long pcpu_nr_pages(void);
#endif /* __LINUX_PERCPU_H */
diff --git a/mm/percpu.c b/mm/percpu.c
index 578531ea1f43..2badcc5e0e71 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1726,7 +1726,7 @@ static void pcpu_alloc_tag_free_hook(struct pcpu_chunk *chunk, int off, size_t s
#endif
/**
- * pcpu_alloc - the percpu allocator
+ * pcpu_alloc_noprof - the percpu allocator
* @size: size of area to allocate in bytes
* @align: alignment of area (max PAGE_SIZE)
* @reserved: allocate from the reserved chunk if available
@@ -1740,7 +1740,7 @@ static void pcpu_alloc_tag_free_hook(struct pcpu_chunk *chunk, int off, size_t s
* RETURNS:
* Percpu pointer to the allocated area on success, NULL on failure.
*/
-static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
+void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved,
gfp_t gfp)
{
gfp_t pcpu_gfp;
@@ -1907,6 +1907,8 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
pcpu_memcg_post_alloc_hook(objcg, chunk, off, size);
+ pcpu_alloc_tag_alloc_hook(chunk, off, size);
+
return ptr;
fail_unlock:
@@ -1935,61 +1937,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
return NULL;
}
-
-/**
- * __alloc_percpu_gfp - allocate dynamic percpu area
- * @size: size of area to allocate in bytes
- * @align: alignment of area (max PAGE_SIZE)
- * @gfp: allocation flags
- *
- * Allocate zero-filled percpu area of @size bytes aligned at @align. If
- * @gfp doesn't contain %GFP_KERNEL, the allocation doesn't block and can
- * be called from any context but is a lot more likely to fail. If @gfp
- * has __GFP_NOWARN then no warning will be triggered on invalid or failed
- * allocation requests.
- *
- * RETURNS:
- * Percpu pointer to the allocated area on success, NULL on failure.
- */
-void __percpu *__alloc_percpu_gfp(size_t size, size_t align, gfp_t gfp)
-{
- return pcpu_alloc(size, align, false, gfp);
-}
-EXPORT_SYMBOL_GPL(__alloc_percpu_gfp);
-
-/**
- * __alloc_percpu - allocate dynamic percpu area
- * @size: size of area to allocate in bytes
- * @align: alignment of area (max PAGE_SIZE)
- *
- * Equivalent to __alloc_percpu_gfp(size, align, %GFP_KERNEL).
- */
-void __percpu *__alloc_percpu(size_t size, size_t align)
-{
- return pcpu_alloc(size, align, false, GFP_KERNEL);
-}
-EXPORT_SYMBOL_GPL(__alloc_percpu);
-
-/**
- * __alloc_reserved_percpu - allocate reserved percpu area
- * @size: size of area to allocate in bytes
- * @align: alignment of area (max PAGE_SIZE)
- *
- * Allocate zero-filled percpu area of @size bytes aligned at @align
- * from reserved percpu area if arch has set it up; otherwise,
- * allocation is served from the same dynamic area. Might sleep.
- * Might trigger writeouts.
- *
- * CONTEXT:
- * Does GFP_KERNEL allocation.
- *
- * RETURNS:
- * Percpu pointer to the allocated area on success, NULL on failure.
- */
-void __percpu *__alloc_reserved_percpu(size_t size, size_t align)
-{
- return pcpu_alloc(size, align, true, GFP_KERNEL);
-}
+EXPORT_SYMBOL_GPL(pcpu_alloc_noprof);
/**
* pcpu_balance_free - manage the amount of free chunks
@@ -2328,6 +2276,8 @@ void free_percpu(void __percpu *ptr)
spin_lock_irqsave(&pcpu_lock, flags);
size = pcpu_free_area(chunk, off);
+ pcpu_alloc_tag_free_hook(chunk, off, size);
+
pcpu_memcg_free_hook(chunk, off, size);
/*
--
2.43.0.687.g38aa6559b0-goog
From: Kent Overstreet <[email protected]>
This wrapps all external vmalloc allocation functions with the
alloc_hooks() wrapper, and switches internal allocations to _noprof
variants where appropriate, for the new memory allocation profiling
feature.
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
drivers/staging/media/atomisp/pci/hmm/hmm.c | 2 +-
include/linux/vmalloc.h | 60 ++++++++++----
kernel/kallsyms_selftest.c | 2 +-
mm/util.c | 24 +++---
mm/vmalloc.c | 88 ++++++++++-----------
5 files changed, 103 insertions(+), 73 deletions(-)
diff --git a/drivers/staging/media/atomisp/pci/hmm/hmm.c b/drivers/staging/media/atomisp/pci/hmm/hmm.c
index bb12644fd033..3e2899ad8517 100644
--- a/drivers/staging/media/atomisp/pci/hmm/hmm.c
+++ b/drivers/staging/media/atomisp/pci/hmm/hmm.c
@@ -205,7 +205,7 @@ static ia_css_ptr __hmm_alloc(size_t bytes, enum hmm_bo_type type,
}
dev_dbg(atomisp_dev, "pages: 0x%08x (%zu bytes), type: %d, vmalloc %p\n",
- bo->start, bytes, type, vmalloc);
+ bo->start, bytes, type, vmalloc_noprof);
return bo->start;
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index c720be70c8dd..106d78e75606 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -2,6 +2,8 @@
#ifndef _LINUX_VMALLOC_H
#define _LINUX_VMALLOC_H
+#include <linux/alloc_tag.h>
+#include <linux/sched.h>
#include <linux/spinlock.h>
#include <linux/init.h>
#include <linux/list.h>
@@ -137,26 +139,54 @@ extern unsigned long vmalloc_nr_pages(void);
static inline unsigned long vmalloc_nr_pages(void) { return 0; }
#endif
-extern void *vmalloc(unsigned long size) __alloc_size(1);
-extern void *vzalloc(unsigned long size) __alloc_size(1);
-extern void *vmalloc_user(unsigned long size) __alloc_size(1);
-extern void *vmalloc_node(unsigned long size, int node) __alloc_size(1);
-extern void *vzalloc_node(unsigned long size, int node) __alloc_size(1);
-extern void *vmalloc_32(unsigned long size) __alloc_size(1);
-extern void *vmalloc_32_user(unsigned long size) __alloc_size(1);
-extern void *__vmalloc(unsigned long size, gfp_t gfp_mask) __alloc_size(1);
-extern void *__vmalloc_node_range(unsigned long size, unsigned long align,
+extern void *vmalloc_noprof(unsigned long size) __alloc_size(1);
+#define vmalloc(...) alloc_hooks(vmalloc_noprof(__VA_ARGS__))
+
+extern void *vzalloc_noprof(unsigned long size) __alloc_size(1);
+#define vzalloc(...) alloc_hooks(vzalloc_noprof(__VA_ARGS__))
+
+extern void *vmalloc_user_noprof(unsigned long size) __alloc_size(1);
+#define vmalloc_user(...) alloc_hooks(vmalloc_user_noprof(__VA_ARGS__))
+
+extern void *vmalloc_node_noprof(unsigned long size, int node) __alloc_size(1);
+#define vmalloc_node(...) alloc_hooks(vmalloc_node_noprof(__VA_ARGS__))
+
+extern void *vzalloc_node_noprof(unsigned long size, int node) __alloc_size(1);
+#define vzalloc_node(...) alloc_hooks(vzalloc_node_noprof(__VA_ARGS__))
+
+extern void *vmalloc_32_noprof(unsigned long size) __alloc_size(1);
+#define vmalloc_32(...) alloc_hooks(vmalloc_32_noprof(__VA_ARGS__))
+
+extern void *vmalloc_32_user_noprof(unsigned long size) __alloc_size(1);
+#define vmalloc_32_user(...) alloc_hooks(vmalloc_32_user_noprof(__VA_ARGS__))
+
+extern void *__vmalloc_noprof(unsigned long size, gfp_t gfp_mask) __alloc_size(1);
+#define __vmalloc(...) alloc_hooks(__vmalloc_noprof(__VA_ARGS__))
+
+extern void *__vmalloc_node_range_noprof(unsigned long size, unsigned long align,
unsigned long start, unsigned long end, gfp_t gfp_mask,
pgprot_t prot, unsigned long vm_flags, int node,
const void *caller) __alloc_size(1);
-void *__vmalloc_node(unsigned long size, unsigned long align, gfp_t gfp_mask,
+#define __vmalloc_node_range(...) alloc_hooks(__vmalloc_node_range_noprof(__VA_ARGS__))
+
+void *__vmalloc_node_noprof(unsigned long size, unsigned long align, gfp_t gfp_mask,
int node, const void *caller) __alloc_size(1);
-void *vmalloc_huge(unsigned long size, gfp_t gfp_mask) __alloc_size(1);
+#define __vmalloc_node(...) alloc_hooks(__vmalloc_node_noprof(__VA_ARGS__))
+
+void *vmalloc_huge_noprof(unsigned long size, gfp_t gfp_mask) __alloc_size(1);
+#define vmalloc_huge(...) alloc_hooks(vmalloc_huge_noprof(__VA_ARGS__))
+
+extern void *__vmalloc_array_noprof(size_t n, size_t size, gfp_t flags) __alloc_size(1, 2);
+#define __vmalloc_array(...) alloc_hooks(__vmalloc_array_noprof(__VA_ARGS__))
+
+extern void *vmalloc_array_noprof(size_t n, size_t size) __alloc_size(1, 2);
+#define vmalloc_array(...) alloc_hooks(vmalloc_array_noprof(__VA_ARGS__))
+
+extern void *__vcalloc_noprof(size_t n, size_t size, gfp_t flags) __alloc_size(1, 2);
+#define __vcalloc(...) alloc_hooks(__vcalloc_noprof(__VA_ARGS__))
-extern void *__vmalloc_array(size_t n, size_t size, gfp_t flags) __alloc_size(1, 2);
-extern void *vmalloc_array(size_t n, size_t size) __alloc_size(1, 2);
-extern void *__vcalloc(size_t n, size_t size, gfp_t flags) __alloc_size(1, 2);
-extern void *vcalloc(size_t n, size_t size) __alloc_size(1, 2);
+extern void *vcalloc_noprof(size_t n, size_t size) __alloc_size(1, 2);
+#define vcalloc(...) alloc_hooks(vcalloc_noprof(__VA_ARGS__))
extern void vfree(const void *addr);
extern void vfree_atomic(const void *addr);
diff --git a/kernel/kallsyms_selftest.c b/kernel/kallsyms_selftest.c
index b4cac76ea5e9..3ea9be364e32 100644
--- a/kernel/kallsyms_selftest.c
+++ b/kernel/kallsyms_selftest.c
@@ -82,7 +82,7 @@ static struct test_item test_items[] = {
ITEM_FUNC(kallsyms_test_func_static),
ITEM_FUNC(kallsyms_test_func),
ITEM_FUNC(kallsyms_test_func_weak),
- ITEM_FUNC(vmalloc),
+ ITEM_FUNC(vmalloc_noprof),
ITEM_FUNC(vfree),
#ifdef CONFIG_KALLSYMS_ALL
ITEM_DATA(kallsyms_test_var_bss_static),
diff --git a/mm/util.c b/mm/util.c
index 291f7945190f..19c90036d3cc 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -639,7 +639,7 @@ void *kvmalloc_node_noprof(size_t size, gfp_t flags, int node)
* about the resulting pointer, and cannot play
* protection games.
*/
- return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
+ return __vmalloc_node_range_noprof(size, 1, VMALLOC_START, VMALLOC_END,
flags, PAGE_KERNEL, VM_ALLOW_HUGE_VMAP,
node, __builtin_return_address(0));
}
@@ -698,12 +698,12 @@ void *kvrealloc_noprof(const void *p, size_t oldsize, size_t newsize, gfp_t flag
EXPORT_SYMBOL(kvrealloc_noprof);
/**
- * __vmalloc_array - allocate memory for a virtually contiguous array.
+ * __vmalloc_array_noprof - allocate memory for a virtually contiguous array.
* @n: number of elements.
* @size: element size.
* @flags: the type of memory to allocate (see kmalloc).
*/
-void *__vmalloc_array(size_t n, size_t size, gfp_t flags)
+void *__vmalloc_array_noprof(size_t n, size_t size, gfp_t flags)
{
size_t bytes;
@@ -711,18 +711,18 @@ void *__vmalloc_array(size_t n, size_t size, gfp_t flags)
return NULL;
return __vmalloc(bytes, flags);
}
-EXPORT_SYMBOL(__vmalloc_array);
+EXPORT_SYMBOL(__vmalloc_array_noprof);
/**
- * vmalloc_array - allocate memory for a virtually contiguous array.
+ * vmalloc_array_noprof - allocate memory for a virtually contiguous array.
* @n: number of elements.
* @size: element size.
*/
-void *vmalloc_array(size_t n, size_t size)
+void *vmalloc_array_noprof(size_t n, size_t size)
{
return __vmalloc_array(n, size, GFP_KERNEL);
}
-EXPORT_SYMBOL(vmalloc_array);
+EXPORT_SYMBOL(vmalloc_array_noprof);
/**
* __vcalloc - allocate and zero memory for a virtually contiguous array.
@@ -730,22 +730,22 @@ EXPORT_SYMBOL(vmalloc_array);
* @size: element size.
* @flags: the type of memory to allocate (see kmalloc).
*/
-void *__vcalloc(size_t n, size_t size, gfp_t flags)
+void *__vcalloc_noprof(size_t n, size_t size, gfp_t flags)
{
return __vmalloc_array(n, size, flags | __GFP_ZERO);
}
-EXPORT_SYMBOL(__vcalloc);
+EXPORT_SYMBOL(__vcalloc_noprof);
/**
- * vcalloc - allocate and zero memory for a virtually contiguous array.
+ * vcalloc_noprof - allocate and zero memory for a virtually contiguous array.
* @n: number of elements.
* @size: element size.
*/
-void *vcalloc(size_t n, size_t size)
+void *vcalloc_noprof(size_t n, size_t size)
{
return __vmalloc_array(n, size, GFP_KERNEL | __GFP_ZERO);
}
-EXPORT_SYMBOL(vcalloc);
+EXPORT_SYMBOL(vcalloc_noprof);
struct anon_vma *folio_anon_vma(struct folio *folio)
{
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d12a17fc0c17..5239f2c9ecae 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3025,12 +3025,12 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
* but mempolicy wants to alloc memory by interleaving.
*/
if (IS_ENABLED(CONFIG_NUMA) && nid == NUMA_NO_NODE)
- nr = alloc_pages_bulk_array_mempolicy(bulk_gfp,
+ nr = alloc_pages_bulk_array_mempolicy_noprof(bulk_gfp,
nr_pages_request,
pages + nr_allocated);
else
- nr = alloc_pages_bulk_array_node(bulk_gfp, nid,
+ nr = alloc_pages_bulk_array_node_noprof(bulk_gfp, nid,
nr_pages_request,
pages + nr_allocated);
@@ -3060,9 +3060,9 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
break;
if (nid == NUMA_NO_NODE)
- page = alloc_pages(alloc_gfp, order);
+ page = alloc_pages_noprof(alloc_gfp, order);
else
- page = alloc_pages_node(nid, alloc_gfp, order);
+ page = alloc_pages_node_noprof(nid, alloc_gfp, order);
if (unlikely(!page)) {
if (!nofail)
break;
@@ -3119,10 +3119,10 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
/* Please note that the recursion is strictly bounded. */
if (array_size > PAGE_SIZE) {
- area->pages = __vmalloc_node(array_size, 1, nested_gfp, node,
+ area->pages = __vmalloc_node_noprof(array_size, 1, nested_gfp, node,
area->caller);
} else {
- area->pages = kmalloc_node(array_size, nested_gfp, node);
+ area->pages = kmalloc_node_noprof(array_size, nested_gfp, node);
}
if (!area->pages) {
@@ -3205,7 +3205,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
}
/**
- * __vmalloc_node_range - allocate virtually contiguous memory
+ * __vmalloc_node_range_noprof - allocate virtually contiguous memory
* @size: allocation size
* @align: desired alignment
* @start: vm area range start
@@ -3232,7 +3232,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
*
* Return: the address of the area or %NULL on failure
*/
-void *__vmalloc_node_range(unsigned long size, unsigned long align,
+void *__vmalloc_node_range_noprof(unsigned long size, unsigned long align,
unsigned long start, unsigned long end, gfp_t gfp_mask,
pgprot_t prot, unsigned long vm_flags, int node,
const void *caller)
@@ -3361,7 +3361,7 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
}
/**
- * __vmalloc_node - allocate virtually contiguous memory
+ * __vmalloc_node_noprof - allocate virtually contiguous memory
* @size: allocation size
* @align: desired alignment
* @gfp_mask: flags for the page level allocator
@@ -3379,10 +3379,10 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
*
* Return: pointer to the allocated memory or %NULL on error
*/
-void *__vmalloc_node(unsigned long size, unsigned long align,
+void *__vmalloc_node_noprof(unsigned long size, unsigned long align,
gfp_t gfp_mask, int node, const void *caller)
{
- return __vmalloc_node_range(size, align, VMALLOC_START, VMALLOC_END,
+ return __vmalloc_node_range_noprof(size, align, VMALLOC_START, VMALLOC_END,
gfp_mask, PAGE_KERNEL, 0, node, caller);
}
/*
@@ -3391,15 +3391,15 @@ void *__vmalloc_node(unsigned long size, unsigned long align,
* than that.
*/
#ifdef CONFIG_TEST_VMALLOC_MODULE
-EXPORT_SYMBOL_GPL(__vmalloc_node);
+EXPORT_SYMBOL_GPL(__vmalloc_node_noprof);
#endif
-void *__vmalloc(unsigned long size, gfp_t gfp_mask)
+void *__vmalloc_noprof(unsigned long size, gfp_t gfp_mask)
{
- return __vmalloc_node(size, 1, gfp_mask, NUMA_NO_NODE,
+ return __vmalloc_node_noprof(size, 1, gfp_mask, NUMA_NO_NODE,
__builtin_return_address(0));
}
-EXPORT_SYMBOL(__vmalloc);
+EXPORT_SYMBOL(__vmalloc_noprof);
/**
* vmalloc - allocate virtually contiguous memory
@@ -3413,12 +3413,12 @@ EXPORT_SYMBOL(__vmalloc);
*
* Return: pointer to the allocated memory or %NULL on error
*/
-void *vmalloc(unsigned long size)
+void *vmalloc_noprof(unsigned long size)
{
- return __vmalloc_node(size, 1, GFP_KERNEL, NUMA_NO_NODE,
+ return __vmalloc_node_noprof(size, 1, GFP_KERNEL, NUMA_NO_NODE,
__builtin_return_address(0));
}
-EXPORT_SYMBOL(vmalloc);
+EXPORT_SYMBOL(vmalloc_noprof);
/**
* vmalloc_huge - allocate virtually contiguous memory, allow huge pages
@@ -3432,16 +3432,16 @@ EXPORT_SYMBOL(vmalloc);
*
* Return: pointer to the allocated memory or %NULL on error
*/
-void *vmalloc_huge(unsigned long size, gfp_t gfp_mask)
+void *vmalloc_huge_noprof(unsigned long size, gfp_t gfp_mask)
{
- return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
+ return __vmalloc_node_range_noprof(size, 1, VMALLOC_START, VMALLOC_END,
gfp_mask, PAGE_KERNEL, VM_ALLOW_HUGE_VMAP,
NUMA_NO_NODE, __builtin_return_address(0));
}
-EXPORT_SYMBOL_GPL(vmalloc_huge);
+EXPORT_SYMBOL_GPL(vmalloc_huge_noprof);
/**
- * vzalloc - allocate virtually contiguous memory with zero fill
+ * vzalloc_noprof - allocate virtually contiguous memory with zero fill
* @size: allocation size
*
* Allocate enough pages to cover @size from the page level
@@ -3453,12 +3453,12 @@ EXPORT_SYMBOL_GPL(vmalloc_huge);
*
* Return: pointer to the allocated memory or %NULL on error
*/
-void *vzalloc(unsigned long size)
+void *vzalloc_noprof(unsigned long size)
{
- return __vmalloc_node(size, 1, GFP_KERNEL | __GFP_ZERO, NUMA_NO_NODE,
+ return __vmalloc_node_noprof(size, 1, GFP_KERNEL | __GFP_ZERO, NUMA_NO_NODE,
__builtin_return_address(0));
}
-EXPORT_SYMBOL(vzalloc);
+EXPORT_SYMBOL(vzalloc_noprof);
/**
* vmalloc_user - allocate zeroed virtually contiguous memory for userspace
@@ -3469,17 +3469,17 @@ EXPORT_SYMBOL(vzalloc);
*
* Return: pointer to the allocated memory or %NULL on error
*/
-void *vmalloc_user(unsigned long size)
+void *vmalloc_user_noprof(unsigned long size)
{
- return __vmalloc_node_range(size, SHMLBA, VMALLOC_START, VMALLOC_END,
+ return __vmalloc_node_range_noprof(size, SHMLBA, VMALLOC_START, VMALLOC_END,
GFP_KERNEL | __GFP_ZERO, PAGE_KERNEL,
VM_USERMAP, NUMA_NO_NODE,
__builtin_return_address(0));
}
-EXPORT_SYMBOL(vmalloc_user);
+EXPORT_SYMBOL(vmalloc_user_noprof);
/**
- * vmalloc_node - allocate memory on a specific node
+ * vmalloc_node_noprof - allocate memory on a specific node
* @size: allocation size
* @node: numa node
*
@@ -3491,15 +3491,15 @@ EXPORT_SYMBOL(vmalloc_user);
*
* Return: pointer to the allocated memory or %NULL on error
*/
-void *vmalloc_node(unsigned long size, int node)
+void *vmalloc_node_noprof(unsigned long size, int node)
{
- return __vmalloc_node(size, 1, GFP_KERNEL, node,
+ return __vmalloc_node_noprof(size, 1, GFP_KERNEL, node,
__builtin_return_address(0));
}
-EXPORT_SYMBOL(vmalloc_node);
+EXPORT_SYMBOL(vmalloc_node_noprof);
/**
- * vzalloc_node - allocate memory on a specific node with zero fill
+ * vzalloc_node_noprof - allocate memory on a specific node with zero fill
* @size: allocation size
* @node: numa node
*
@@ -3509,12 +3509,12 @@ EXPORT_SYMBOL(vmalloc_node);
*
* Return: pointer to the allocated memory or %NULL on error
*/
-void *vzalloc_node(unsigned long size, int node)
+void *vzalloc_node_noprof(unsigned long size, int node)
{
- return __vmalloc_node(size, 1, GFP_KERNEL | __GFP_ZERO, node,
+ return __vmalloc_node_noprof(size, 1, GFP_KERNEL | __GFP_ZERO, node,
__builtin_return_address(0));
}
-EXPORT_SYMBOL(vzalloc_node);
+EXPORT_SYMBOL(vzalloc_node_noprof);
#if defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA32)
#define GFP_VMALLOC32 (GFP_DMA32 | GFP_KERNEL)
@@ -3529,7 +3529,7 @@ EXPORT_SYMBOL(vzalloc_node);
#endif
/**
- * vmalloc_32 - allocate virtually contiguous memory (32bit addressable)
+ * vmalloc_32_noprof - allocate virtually contiguous memory (32bit addressable)
* @size: allocation size
*
* Allocate enough 32bit PA addressable pages to cover @size from the
@@ -3537,15 +3537,15 @@ EXPORT_SYMBOL(vzalloc_node);
*
* Return: pointer to the allocated memory or %NULL on error
*/
-void *vmalloc_32(unsigned long size)
+void *vmalloc_32_noprof(unsigned long size)
{
- return __vmalloc_node(size, 1, GFP_VMALLOC32, NUMA_NO_NODE,
+ return __vmalloc_node_noprof(size, 1, GFP_VMALLOC32, NUMA_NO_NODE,
__builtin_return_address(0));
}
-EXPORT_SYMBOL(vmalloc_32);
+EXPORT_SYMBOL(vmalloc_32_noprof);
/**
- * vmalloc_32_user - allocate zeroed virtually contiguous 32bit memory
+ * vmalloc_32_user_noprof - allocate zeroed virtually contiguous 32bit memory
* @size: allocation size
*
* The resulting memory area is 32bit addressable and zeroed so it can be
@@ -3553,14 +3553,14 @@ EXPORT_SYMBOL(vmalloc_32);
*
* Return: pointer to the allocated memory or %NULL on error
*/
-void *vmalloc_32_user(unsigned long size)
+void *vmalloc_32_user_noprof(unsigned long size)
{
- return __vmalloc_node_range(size, SHMLBA, VMALLOC_START, VMALLOC_END,
+ return __vmalloc_node_range_noprof(size, SHMLBA, VMALLOC_START, VMALLOC_END,
GFP_VMALLOC32 | __GFP_ZERO, PAGE_KERNEL,
VM_USERMAP, NUMA_NO_NODE,
__builtin_return_address(0));
}
-EXPORT_SYMBOL(vmalloc_32_user);
+EXPORT_SYMBOL(vmalloc_32_user_noprof);
/*
* Atomically zero bytes in the iterator.
--
2.43.0.687.g38aa6559b0-goog
From: Kent Overstreet <[email protected]>
This gives better memory allocation profiling results; rhashtable
allocations will be accounted to the code that initialized the
rhashtable.
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/rhashtable-types.h | 11 +++++--
lib/rhashtable.c | 52 +++++++++++++++++++++++++-------
2 files changed, 50 insertions(+), 13 deletions(-)
diff --git a/include/linux/rhashtable-types.h b/include/linux/rhashtable-types.h
index b6f3797277ff..015c8298bebc 100644
--- a/include/linux/rhashtable-types.h
+++ b/include/linux/rhashtable-types.h
@@ -9,6 +9,7 @@
#ifndef _LINUX_RHASHTABLE_TYPES_H
#define _LINUX_RHASHTABLE_TYPES_H
+#include <linux/alloc_tag.h>
#include <linux/atomic.h>
#include <linux/compiler.h>
#include <linux/mutex.h>
@@ -88,6 +89,9 @@ struct rhashtable {
struct mutex mutex;
spinlock_t lock;
atomic_t nelems;
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+ struct alloc_tag *alloc_tag;
+#endif
};
/**
@@ -127,9 +131,12 @@ struct rhashtable_iter {
bool end_of_table;
};
-int rhashtable_init(struct rhashtable *ht,
+int rhashtable_init_noprof(struct rhashtable *ht,
const struct rhashtable_params *params);
-int rhltable_init(struct rhltable *hlt,
+#define rhashtable_init(...) alloc_hooks(rhashtable_init_noprof(__VA_ARGS__))
+
+int rhltable_init_noprof(struct rhltable *hlt,
const struct rhashtable_params *params);
+#define rhltable_init(...) alloc_hooks(rhltable_init_noprof(__VA_ARGS__))
#endif /* _LINUX_RHASHTABLE_TYPES_H */
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 6ae2ba8e06a2..b62116f332b8 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -63,6 +63,27 @@ EXPORT_SYMBOL_GPL(lockdep_rht_bucket_is_held);
#define ASSERT_RHT_MUTEX(HT)
#endif
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+static inline void rhashtable_alloc_tag_init(struct rhashtable *ht)
+{
+ ht->alloc_tag = current->alloc_tag;
+}
+
+static inline struct alloc_tag *rhashtable_alloc_tag_save(struct rhashtable *ht)
+{
+ return alloc_tag_save(ht->alloc_tag);
+}
+
+static inline void rhashtable_alloc_tag_restore(struct rhashtable *ht, struct alloc_tag *old)
+{
+ alloc_tag_restore(ht->alloc_tag, old);
+}
+#else
+#define rhashtable_alloc_tag_init(ht)
+static inline struct alloc_tag *rhashtable_alloc_tag_save(struct rhashtable *ht) { return NULL; }
+#define rhashtable_alloc_tag_restore(ht, old)
+#endif
+
static inline union nested_table *nested_table_top(
const struct bucket_table *tbl)
{
@@ -130,7 +151,7 @@ static union nested_table *nested_table_alloc(struct rhashtable *ht,
if (ntbl)
return ntbl;
- ntbl = kzalloc(PAGE_SIZE, GFP_ATOMIC);
+ ntbl = kmalloc_noprof(PAGE_SIZE, GFP_ATOMIC|__GFP_ZERO);
if (ntbl && leaf) {
for (i = 0; i < PAGE_SIZE / sizeof(ntbl[0]); i++)
@@ -157,7 +178,7 @@ static struct bucket_table *nested_bucket_table_alloc(struct rhashtable *ht,
size = sizeof(*tbl) + sizeof(tbl->buckets[0]);
- tbl = kzalloc(size, gfp);
+ tbl = kmalloc_noprof(size, gfp|__GFP_ZERO);
if (!tbl)
return NULL;
@@ -180,8 +201,10 @@ static struct bucket_table *bucket_table_alloc(struct rhashtable *ht,
size_t size;
int i;
static struct lock_class_key __key;
+ struct alloc_tag * __maybe_unused old = rhashtable_alloc_tag_save(ht);
- tbl = kvzalloc(struct_size(tbl, buckets, nbuckets), gfp);
+ tbl = kvmalloc_node_noprof(struct_size(tbl, buckets, nbuckets),
+ gfp|__GFP_ZERO, NUMA_NO_NODE);
size = nbuckets;
@@ -190,6 +213,8 @@ static struct bucket_table *bucket_table_alloc(struct rhashtable *ht,
nbuckets = 0;
}
+ rhashtable_alloc_tag_restore(ht, old);
+
if (tbl == NULL)
return NULL;
@@ -975,7 +1000,7 @@ static u32 rhashtable_jhash2(const void *key, u32 length, u32 seed)
}
/**
- * rhashtable_init - initialize a new hash table
+ * rhashtable_init_noprof - initialize a new hash table
* @ht: hash table to be initialized
* @params: configuration parameters
*
@@ -1016,7 +1041,7 @@ static u32 rhashtable_jhash2(const void *key, u32 length, u32 seed)
* .obj_hashfn = my_hash_fn,
* };
*/
-int rhashtable_init(struct rhashtable *ht,
+int rhashtable_init_noprof(struct rhashtable *ht,
const struct rhashtable_params *params)
{
struct bucket_table *tbl;
@@ -1031,6 +1056,8 @@ int rhashtable_init(struct rhashtable *ht,
spin_lock_init(&ht->lock);
memcpy(&ht->p, params, sizeof(*params));
+ rhashtable_alloc_tag_init(ht);
+
if (params->min_size)
ht->p.min_size = roundup_pow_of_two(params->min_size);
@@ -1076,26 +1103,26 @@ int rhashtable_init(struct rhashtable *ht,
return 0;
}
-EXPORT_SYMBOL_GPL(rhashtable_init);
+EXPORT_SYMBOL_GPL(rhashtable_init_noprof);
/**
- * rhltable_init - initialize a new hash list table
+ * rhltable_init_noprof - initialize a new hash list table
* @hlt: hash list table to be initialized
* @params: configuration parameters
*
* Initializes a new hash list table.
*
- * See documentation for rhashtable_init.
+ * See documentation for rhashtable_init_noprof.
*/
-int rhltable_init(struct rhltable *hlt, const struct rhashtable_params *params)
+int rhltable_init_noprof(struct rhltable *hlt, const struct rhashtable_params *params)
{
int err;
- err = rhashtable_init(&hlt->ht, params);
+ err = rhashtable_init_noprof(&hlt->ht, params);
hlt->ht.rhlist = true;
return err;
}
-EXPORT_SYMBOL_GPL(rhltable_init);
+EXPORT_SYMBOL_GPL(rhltable_init_noprof);
static void rhashtable_free_one(struct rhashtable *ht, struct rhash_head *obj,
void (*free_fn)(void *ptr, void *arg),
@@ -1222,6 +1249,7 @@ struct rhash_lock_head __rcu **rht_bucket_nested_insert(
unsigned int index = hash & ((1 << tbl->nest) - 1);
unsigned int size = tbl->size >> tbl->nest;
union nested_table *ntbl;
+ struct alloc_tag * __maybe_unused old = rhashtable_alloc_tag_save(ht);
ntbl = nested_table_top(tbl);
hash >>= tbl->nest;
@@ -1236,6 +1264,8 @@ struct rhash_lock_head __rcu **rht_bucket_nested_insert(
size <= (1 << shift));
}
+ rhashtable_alloc_tag_restore(ht, old);
+
if (!ntbl)
return NULL;
--
2.43.0.687.g38aa6559b0-goog
Include allocations in show_mem reports.
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/alloc_tag.h | 2 ++
lib/alloc_tag.c | 38 ++++++++++++++++++++++++++++++++++++++
mm/show_mem.c | 15 +++++++++++++++
3 files changed, 55 insertions(+)
diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
index 3fe51e67e231..0a5973c4ad77 100644
--- a/include/linux/alloc_tag.h
+++ b/include/linux/alloc_tag.h
@@ -30,6 +30,8 @@ struct alloc_tag {
#ifdef CONFIG_MEM_ALLOC_PROFILING
+void alloc_tags_show_mem_report(struct seq_buf *s);
+
static inline struct alloc_tag *ct_to_alloc_tag(struct codetag *ct)
{
return container_of(ct, struct alloc_tag, ct);
diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
index 2d5226d9262d..54312c213860 100644
--- a/lib/alloc_tag.c
+++ b/lib/alloc_tag.c
@@ -96,6 +96,44 @@ static const struct seq_operations allocinfo_seq_op = {
.show = allocinfo_show,
};
+void alloc_tags_show_mem_report(struct seq_buf *s)
+{
+ struct codetag_iterator iter;
+ struct codetag *ct;
+ struct {
+ struct codetag *tag;
+ size_t bytes;
+ } tags[10], n;
+ unsigned int i, nr = 0;
+
+ codetag_lock_module_list(alloc_tag_cttype, true);
+ iter = codetag_get_ct_iter(alloc_tag_cttype);
+ while ((ct = codetag_next_ct(&iter))) {
+ struct alloc_tag_counters counter = alloc_tag_read(ct_to_alloc_tag(ct));
+
+ n.tag = ct;
+ n.bytes = counter.bytes;
+
+ for (i = 0; i < nr; i++)
+ if (n.bytes > tags[i].bytes)
+ break;
+
+ if (i < ARRAY_SIZE(tags)) {
+ nr -= nr == ARRAY_SIZE(tags);
+ memmove(&tags[i + 1],
+ &tags[i],
+ sizeof(tags[0]) * (nr - i));
+ nr++;
+ tags[i] = n;
+ }
+ }
+
+ for (i = 0; i < nr; i++)
+ alloc_tag_to_text(s, tags[i].tag);
+
+ codetag_lock_module_list(alloc_tag_cttype, false);
+}
+
static void __init procfs_init(void)
{
proc_create_seq("allocinfo", 0444, NULL, &allocinfo_seq_op);
diff --git a/mm/show_mem.c b/mm/show_mem.c
index 8dcfafbd283c..d514c15ca076 100644
--- a/mm/show_mem.c
+++ b/mm/show_mem.c
@@ -12,6 +12,7 @@
#include <linux/hugetlb.h>
#include <linux/mm.h>
#include <linux/mmzone.h>
+#include <linux/seq_buf.h>
#include <linux/swap.h>
#include <linux/vmstat.h>
@@ -423,4 +424,18 @@ void __show_mem(unsigned int filter, nodemask_t *nodemask, int max_zone_idx)
#ifdef CONFIG_MEMORY_FAILURE
printk("%lu pages hwpoisoned\n", atomic_long_read(&num_poisoned_pages));
#endif
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+ {
+ struct seq_buf s;
+ char *buf = kmalloc(4096, GFP_ATOMIC);
+
+ if (buf) {
+ printk("Memory allocations:\n");
+ seq_buf_init(&s, buf, 4096);
+ alloc_tags_show_mem_report(&s);
+ printk("%s", buf);
+ kfree(buf);
+ }
+ }
+#endif
}
--
2.43.0.687.g38aa6559b0-goog
objext objects are created with __GFP_NO_OBJ_EXT flag and therefore have
no corresponding objext themselves (otherwise we would get an infinite
recursion). When freeing these objects their codetag will be empty and
when CONFIG_MEM_ALLOC_PROFILING_DEBUG is enabled this will lead to false
warnings. Introduce CODETAG_EMPTY special codetag value to mark
allocations which intentionally lack codetag to avoid these warnings.
Set objext codetags to CODETAG_EMPTY before freeing to indicate that
the codetag is expected to be empty.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/alloc_tag.h | 26 ++++++++++++++++++++++++++
mm/slab.h | 25 +++++++++++++++++++++++++
mm/slab_common.c | 1 +
mm/slub.c | 8 ++++++++
4 files changed, 60 insertions(+)
diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
index 0a5973c4ad77..1f3207097b03 100644
--- a/include/linux/alloc_tag.h
+++ b/include/linux/alloc_tag.h
@@ -77,6 +77,27 @@ static inline struct alloc_tag_counters alloc_tag_read(struct alloc_tag *tag)
return v;
}
+#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
+
+#define CODETAG_EMPTY (void *)1
+
+static inline bool is_codetag_empty(union codetag_ref *ref)
+{
+ return ref->ct == CODETAG_EMPTY;
+}
+
+static inline void set_codetag_empty(union codetag_ref *ref)
+{
+ if (ref)
+ ref->ct = CODETAG_EMPTY;
+}
+
+#else /* CONFIG_MEM_ALLOC_PROFILING_DEBUG */
+
+static inline bool is_codetag_empty(union codetag_ref *ref) { return false; }
+
+#endif /* CONFIG_MEM_ALLOC_PROFILING_DEBUG */
+
static inline void __alloc_tag_sub(union codetag_ref *ref, size_t bytes)
{
struct alloc_tag *tag;
@@ -87,6 +108,11 @@ static inline void __alloc_tag_sub(union codetag_ref *ref, size_t bytes)
if (!ref || !ref->ct)
return;
+ if (is_codetag_empty(ref)) {
+ ref->ct = NULL;
+ return;
+ }
+
tag = ct_to_alloc_tag(ref->ct);
this_cpu_sub(tag->counters->bytes, bytes);
diff --git a/mm/slab.h b/mm/slab.h
index c4bd0d5348cb..cf332a839bf4 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -567,6 +567,31 @@ static inline struct slabobj_ext *slab_obj_exts(struct slab *slab)
int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
gfp_t gfp, bool new_slab);
+
+#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
+
+static inline void mark_objexts_empty(struct slabobj_ext *obj_exts)
+{
+ struct slabobj_ext *slab_exts;
+ struct slab *obj_exts_slab;
+
+ obj_exts_slab = virt_to_slab(obj_exts);
+ slab_exts = slab_obj_exts(obj_exts_slab);
+ if (slab_exts) {
+ unsigned int offs = obj_to_index(obj_exts_slab->slab_cache,
+ obj_exts_slab, obj_exts);
+ /* codetag should be NULL */
+ WARN_ON(slab_exts[offs].ref.ct);
+ set_codetag_empty(&slab_exts[offs].ref);
+ }
+}
+
+#else /* CONFIG_MEM_ALLOC_PROFILING_DEBUG */
+
+static inline void mark_objexts_empty(struct slabobj_ext *obj_exts) {}
+
+#endif /* CONFIG_MEM_ALLOC_PROFILING_DEBUG */
+
static inline bool need_slab_obj_ext(void)
{
#ifdef CONFIG_MEM_ALLOC_PROFILING
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 21b0b9e9cd9e..d5f75d04ced2 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -242,6 +242,7 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
* assign slabobj_exts in parallel. In this case the existing
* objcg vector should be reused.
*/
+ mark_objexts_empty(vec);
kfree(vec);
return 0;
}
diff --git a/mm/slub.c b/mm/slub.c
index 4d480784942e..1136ff18b4fe 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1890,6 +1890,14 @@ static inline void free_slab_obj_exts(struct slab *slab)
if (!obj_exts)
return;
+ /*
+ * obj_exts was created with __GFP_NO_OBJ_EXT flag, therefore its
+ * corresponding extension will be NULL. alloc_tag_sub() will throw a
+ * warning if slab has extensions but the extension of an object is
+ * NULL, therefore replace NULL with CODETAG_EMPTY to indicate that
+ * the extension for obj_exts is expected to be NULL.
+ */
+ mark_objexts_empty(obj_exts);
kfree(obj_exts);
slab->obj_exts = 0;
}
--
2.43.0.687.g38aa6559b0-goog
To avoid debug warnings while freeing reserved pages which were not
allocated with usual allocators, mark their codetags as empty before
freeing.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/alloc_tag.h | 2 ++
include/linux/mm.h | 8 ++++++++
include/linux/pgalloc_tag.h | 2 ++
mm/mm_init.c | 9 +++++++++
4 files changed, 21 insertions(+)
diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
index 1f3207097b03..102caf62c2a9 100644
--- a/include/linux/alloc_tag.h
+++ b/include/linux/alloc_tag.h
@@ -95,6 +95,7 @@ static inline void set_codetag_empty(union codetag_ref *ref)
#else /* CONFIG_MEM_ALLOC_PROFILING_DEBUG */
static inline bool is_codetag_empty(union codetag_ref *ref) { return false; }
+static inline void set_codetag_empty(union codetag_ref *ref) {}
#endif /* CONFIG_MEM_ALLOC_PROFILING_DEBUG */
@@ -155,6 +156,7 @@ static inline void alloc_tag_sub(union codetag_ref *ref, size_t bytes) {}
static inline void alloc_tag_sub_noalloc(union codetag_ref *ref, size_t bytes) {}
static inline void alloc_tag_add(union codetag_ref *ref, struct alloc_tag *tag,
size_t bytes) {}
+static inline void set_codetag_empty(union codetag_ref *ref) {}
#endif
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f5a97dec5169..ac1b661987ed 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -5,6 +5,7 @@
#include <linux/errno.h>
#include <linux/mmdebug.h>
#include <linux/gfp.h>
+#include <linux/pgalloc_tag.h>
#include <linux/bug.h>
#include <linux/list.h>
#include <linux/mmzone.h>
@@ -3112,6 +3113,13 @@ extern void reserve_bootmem_region(phys_addr_t start,
/* Free the reserved page into the buddy system, so it gets managed. */
static inline void free_reserved_page(struct page *page)
{
+ union codetag_ref *ref;
+
+ ref = get_page_tag_ref(page);
+ if (ref) {
+ set_codetag_empty(ref);
+ put_page_tag_ref(ref);
+ }
ClearPageReserved(page);
init_page_count(page);
__free_page(page);
diff --git a/include/linux/pgalloc_tag.h b/include/linux/pgalloc_tag.h
index 0174aff5e871..ae9b0f359264 100644
--- a/include/linux/pgalloc_tag.h
+++ b/include/linux/pgalloc_tag.h
@@ -93,6 +93,8 @@ static inline void pgalloc_tag_split(struct page *page, unsigned int nr)
#else /* CONFIG_MEM_ALLOC_PROFILING */
+static inline union codetag_ref *get_page_tag_ref(struct page *page) { return NULL; }
+static inline void put_page_tag_ref(union codetag_ref *ref) {}
static inline void pgalloc_tag_add(struct page *page, struct task_struct *task,
unsigned int order) {}
static inline void pgalloc_tag_sub(struct page *page, unsigned int order) {}
diff --git a/mm/mm_init.c b/mm/mm_init.c
index e9ea2919d02d..f5386632fe86 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -2566,6 +2566,7 @@ void __init set_dma_reserve(unsigned long new_dma_reserve)
void __init memblock_free_pages(struct page *page, unsigned long pfn,
unsigned int order)
{
+ union codetag_ref *ref;
if (IS_ENABLED(CONFIG_DEFERRED_STRUCT_PAGE_INIT)) {
int nid = early_pfn_to_nid(pfn);
@@ -2578,6 +2579,14 @@ void __init memblock_free_pages(struct page *page, unsigned long pfn,
/* KMSAN will take care of these pages. */
return;
}
+
+ /* pages were reserved and not allocated */
+ ref = get_page_tag_ref(page);
+ if (ref) {
+ set_codetag_empty(ref);
+ put_page_tag_ref(ref);
+ }
+
__free_pages_core(page, order);
}
--
2.43.0.687.g38aa6559b0-goog
If slabobj_ext vector allocation for a slab object fails and later on it
succeeds for another object in the same slab, the slabobj_ext for the
original object will be NULL and will be flagged in case when
CONFIG_MEM_ALLOC_PROFILING_DEBUG is enabled.
Mark failed slabobj_ext vector allocations using a new objext_flags flag
stored in the lower bits of slab->obj_exts. When new allocation succeeds
it marks all tag references in the same slabobj_ext vector as empty to
avoid warnings implemented by CONFIG_MEM_ALLOC_PROFILING_DEBUG checks.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/memcontrol.h | 4 +++-
mm/slab.h | 25 +++++++++++++++++++++++++
mm/slab_common.c | 22 +++++++++++++++-------
3 files changed, 43 insertions(+), 8 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 2b010316016c..f95241ca9052 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -365,8 +365,10 @@ enum page_memcg_data_flags {
#endif /* CONFIG_MEMCG */
enum objext_flags {
+ /* slabobj_ext vector failed to allocate */
+ OBJEXTS_ALLOC_FAIL = __FIRST_OBJEXT_FLAG,
/* the next bit after the last actual flag */
- __NR_OBJEXTS_FLAGS = __FIRST_OBJEXT_FLAG,
+ __NR_OBJEXTS_FLAGS = (__FIRST_OBJEXT_FLAG << 1),
};
#define OBJEXTS_FLAGS_MASK (__NR_OBJEXTS_FLAGS - 1)
diff --git a/mm/slab.h b/mm/slab.h
index cf332a839bf4..7bb3900f83ef 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -586,9 +586,34 @@ static inline void mark_objexts_empty(struct slabobj_ext *obj_exts)
}
}
+static inline void mark_failed_objexts_alloc(struct slab *slab)
+{
+ slab->obj_exts = OBJEXTS_ALLOC_FAIL;
+}
+
+static inline void handle_failed_objexts_alloc(unsigned long obj_exts,
+ struct slabobj_ext *vec, unsigned int objects)
+{
+ /*
+ * If vector previously failed to allocate then we have live
+ * objects with no tag reference. Mark all references in this
+ * vector as empty to avoid warnings later on.
+ */
+ if (obj_exts & OBJEXTS_ALLOC_FAIL) {
+ unsigned int i;
+
+ for (i = 0; i < objects; i++)
+ set_codetag_empty(&vec[i].ref);
+ }
+}
+
+
#else /* CONFIG_MEM_ALLOC_PROFILING_DEBUG */
static inline void mark_objexts_empty(struct slabobj_ext *obj_exts) {}
+static inline void mark_failed_objexts_alloc(struct slab *slab) {}
+static inline void handle_failed_objexts_alloc(unsigned long obj_exts,
+ struct slabobj_ext *vec, unsigned int objects) {}
#endif /* CONFIG_MEM_ALLOC_PROFILING_DEBUG */
diff --git a/mm/slab_common.c b/mm/slab_common.c
index d5f75d04ced2..489c7a8ba8f1 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -214,29 +214,37 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
gfp_t gfp, bool new_slab)
{
unsigned int objects = objs_per_slab(s, slab);
- unsigned long obj_exts;
- void *vec;
+ unsigned long new_exts;
+ unsigned long old_exts;
+ struct slabobj_ext *vec;
gfp &= ~OBJCGS_CLEAR_MASK;
/* Prevent recursive extension vector allocation */
gfp |= __GFP_NO_OBJ_EXT;
vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
slab_nid(slab));
- if (!vec)
+ if (!vec) {
+ /* Mark vectors which failed to allocate */
+ if (new_slab)
+ mark_failed_objexts_alloc(slab);
+
return -ENOMEM;
+ }
- obj_exts = (unsigned long)vec;
+ new_exts = (unsigned long)vec;
#ifdef CONFIG_MEMCG
- obj_exts |= MEMCG_DATA_OBJEXTS;
+ new_exts |= MEMCG_DATA_OBJEXTS;
#endif
+ old_exts = slab->obj_exts;
+ handle_failed_objexts_alloc(old_exts, vec, objects);
if (new_slab) {
/*
* If the slab is brand new and nobody can yet access its
* obj_exts, no synchronization is required and obj_exts can
* be simply assigned.
*/
- slab->obj_exts = obj_exts;
- } else if (cmpxchg(&slab->obj_exts, 0, obj_exts)) {
+ slab->obj_exts = new_exts;
+ } else if (cmpxchg(&slab->obj_exts, old_exts, new_exts) != old_exts) {
/*
* If the slab is already in use, somebody can allocate and
* assign slabobj_exts in parallel. In this case the existing
--
2.43.0.687.g38aa6559b0-goog
From: Kent Overstreet <[email protected]>
The new code & libraries added are being maintained - mark them as such.
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
MAINTAINERS | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/MAINTAINERS b/MAINTAINERS
index 73d898383e51..6da139418775 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5210,6 +5210,13 @@ S: Supported
F: Documentation/process/code-of-conduct-interpretation.rst
F: Documentation/process/code-of-conduct.rst
+CODE TAGGING
+M: Suren Baghdasaryan <[email protected]>
+M: Kent Overstreet <[email protected]>
+S: Maintained
+F: include/linux/codetag.h
+F: lib/codetag.c
+
COMEDI DRIVERS
M: Ian Abbott <[email protected]>
M: H Hartley Sweeten <[email protected]>
@@ -14056,6 +14063,15 @@ F: mm/memblock.c
F: mm/mm_init.c
F: tools/testing/memblock/
+MEMORY ALLOCATION PROFILING
+M: Suren Baghdasaryan <[email protected]>
+M: Kent Overstreet <[email protected]>
+S: Maintained
+F: include/linux/alloc_tag.h
+F: include/linux/codetag_ctx.h
+F: lib/alloc_tag.c
+F: lib/pgalloc_tag.c
+
MEMORY CONTROLLER DRIVERS
M: Krzysztof Kozlowski <[email protected]>
L: [email protected]
--
2.43.0.687.g38aa6559b0-goog
To store code tag for every slab object, a codetag reference is embedded
into slabobj_ext when CONFIG_MEM_ALLOC_PROFILING=y.
Signed-off-by: Suren Baghdasaryan <[email protected]>
Co-developed-by: Kent Overstreet <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
---
include/linux/memcontrol.h | 5 +++++
lib/Kconfig.debug | 1 +
mm/slab.h | 4 ++++
3 files changed, 10 insertions(+)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f3584e98b640..2b010316016c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1653,7 +1653,12 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
* if MEMCG_DATA_OBJEXTS is set.
*/
struct slabobj_ext {
+#ifdef CONFIG_MEMCG_KMEM
struct obj_cgroup *objcg;
+#endif
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+ union codetag_ref ref;
+#endif
} __aligned(8);
static inline void __inc_lruvec_kmem_state(void *p, enum node_stat_item idx)
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 7bbdb0ddb011..9ecfcdb54417 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -979,6 +979,7 @@ config MEM_ALLOC_PROFILING
depends on !DEBUG_FORCE_WEAK_PER_CPU
select CODE_TAGGING
select PAGE_EXTENSION
+ select SLAB_OBJ_EXT
help
Track allocation source code and record total allocation size
initiated at that code location. The mechanism can be used to track
diff --git a/mm/slab.h b/mm/slab.h
index 77cf7474fe46..224a4b2305fb 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -569,6 +569,10 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
static inline bool need_slab_obj_ext(void)
{
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+ if (mem_alloc_profiling_enabled())
+ return true;
+#endif
/*
* CONFIG_MEMCG_KMEM creates vector of obj_cgroup objects conditionally
* inside memcg_slab_post_alloc_hook. No other users for now.
--
2.43.0.687.g38aa6559b0-goog
From: Kent Overstreet <[email protected]>
To store codetag for every per-cpu allocation, a codetag reference is
embedded into pcpuobj_ext when CONFIG_MEM_ALLOC_PROFILING=y. Hooks to
use the newly introduced codetag are added.
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
mm/percpu-internal.h | 11 +++++++++--
mm/percpu.c | 26 ++++++++++++++++++++++++++
2 files changed, 35 insertions(+), 2 deletions(-)
diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index e62d582f4bf3..7e42f0ca3b7b 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -36,9 +36,12 @@ struct pcpuobj_ext {
#ifdef CONFIG_MEMCG_KMEM
struct obj_cgroup *cgroup;
#endif
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+ union codetag_ref tag;
+#endif
};
-#ifdef CONFIG_MEMCG_KMEM
+#if defined(CONFIG_MEMCG_KMEM) || defined(CONFIG_MEM_ALLOC_PROFILING)
#define NEED_PCPUOBJ_EXT
#endif
@@ -86,7 +89,11 @@ struct pcpu_chunk {
static inline bool need_pcpuobj_ext(void)
{
- return !mem_cgroup_kmem_disabled();
+ if (IS_ENABLED(CONFIG_MEM_ALLOC_PROFILING))
+ return true;
+ if (!mem_cgroup_kmem_disabled())
+ return true;
+ return false;
}
extern spinlock_t pcpu_lock;
diff --git a/mm/percpu.c b/mm/percpu.c
index 2e5edaad9cc3..578531ea1f43 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1699,6 +1699,32 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
}
#endif /* CONFIG_MEMCG_KMEM */
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+static void pcpu_alloc_tag_alloc_hook(struct pcpu_chunk *chunk, int off,
+ size_t size)
+{
+ if (mem_alloc_profiling_enabled() && likely(chunk->obj_exts)) {
+ alloc_tag_add(&chunk->obj_exts[off >> PCPU_MIN_ALLOC_SHIFT].tag,
+ current->alloc_tag, size);
+ }
+}
+
+static void pcpu_alloc_tag_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
+{
+ if (mem_alloc_profiling_enabled() && likely(chunk->obj_exts))
+ alloc_tag_sub_noalloc(&chunk->obj_exts[off >> PCPU_MIN_ALLOC_SHIFT].tag, size);
+}
+#else
+static void pcpu_alloc_tag_alloc_hook(struct pcpu_chunk *chunk, int off,
+ size_t size)
+{
+}
+
+static void pcpu_alloc_tag_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
+{
+}
+#endif
+
/**
* pcpu_alloc - the percpu allocator
* @size: size of area to allocate in bytes
--
2.43.0.687.g38aa6559b0-goog
When a high-order page is split into smaller ones, each newly split
page should get its codetag. The original codetag is reused for these
pages but it's recorded as 0-byte allocation because original codetag
already accounts for the original high-order allocated page.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/pgalloc_tag.h | 30 ++++++++++++++++++++++++++++++
mm/huge_memory.c | 2 ++
mm/page_alloc.c | 2 ++
3 files changed, 34 insertions(+)
diff --git a/include/linux/pgalloc_tag.h b/include/linux/pgalloc_tag.h
index a060c26eb449..0174aff5e871 100644
--- a/include/linux/pgalloc_tag.h
+++ b/include/linux/pgalloc_tag.h
@@ -62,11 +62,41 @@ static inline void pgalloc_tag_sub(struct page *page, unsigned int order)
}
}
+static inline void pgalloc_tag_split(struct page *page, unsigned int nr)
+{
+ int i;
+ struct page_ext *page_ext;
+ union codetag_ref *ref;
+ struct alloc_tag *tag;
+
+ if (!mem_alloc_profiling_enabled())
+ return;
+
+ page_ext = page_ext_get(page);
+ if (unlikely(!page_ext))
+ return;
+
+ ref = codetag_ref_from_page_ext(page_ext);
+ if (!ref->ct)
+ goto out;
+
+ tag = ct_to_alloc_tag(ref->ct);
+ page_ext = page_ext_next(page_ext);
+ for (i = 1; i < nr; i++) {
+ /* New reference with 0 bytes accounted */
+ alloc_tag_add(codetag_ref_from_page_ext(page_ext), tag, 0);
+ page_ext = page_ext_next(page_ext);
+ }
+out:
+ page_ext_put(page_ext);
+}
+
#else /* CONFIG_MEM_ALLOC_PROFILING */
static inline void pgalloc_tag_add(struct page *page, struct task_struct *task,
unsigned int order) {}
static inline void pgalloc_tag_sub(struct page *page, unsigned int order) {}
+static inline void pgalloc_tag_split(struct page *page, unsigned int nr) {}
#endif /* CONFIG_MEM_ALLOC_PROFILING */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 94c958f7ebb5..86daae671319 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -38,6 +38,7 @@
#include <linux/sched/sysctl.h>
#include <linux/memory-tiers.h>
#include <linux/compat.h>
+#include <linux/pgalloc_tag.h>
#include <asm/tlb.h>
#include <asm/pgalloc.h>
@@ -2899,6 +2900,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
/* Caller disabled irqs, so they are still disabled here */
split_page_owner(head, nr);
+ pgalloc_tag_split(head, nr);
/* See comment in __split_huge_page_tail() */
if (PageAnon(head)) {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 58c0e8b948a4..4bc5b4720fee 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2621,6 +2621,7 @@ void split_page(struct page *page, unsigned int order)
for (i = 1; i < (1 << order); i++)
set_page_refcounted(page + i);
split_page_owner(page, 1 << order);
+ pgalloc_tag_split(page, 1 << order);
split_page_memcg(page, 1 << order);
}
EXPORT_SYMBOL_GPL(split_page);
@@ -4806,6 +4807,7 @@ static void *make_alloc_exact(unsigned long addr, unsigned int order,
struct page *last = page + nr;
split_page_owner(page, 1 << order);
+ pgalloc_tag_split(page, 1 << order);
split_page_memcg(page, 1 << order);
while (page < --last)
set_page_refcounted(last);
--
2.43.0.687.g38aa6559b0-goog
Redefine kmalloc, krealloc, kzalloc, kcalloc, etc. to record allocations
and deallocations done by these functions.
Signed-off-by: Suren Baghdasaryan <[email protected]>
Co-developed-by: Kent Overstreet <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
---
include/linux/fortify-string.h | 5 +-
include/linux/slab.h | 177 ++++++++++++++++-----------------
include/linux/string.h | 4 +-
mm/slab_common.c | 6 +-
mm/slub.c | 54 +++++-----
mm/util.c | 20 ++--
6 files changed, 134 insertions(+), 132 deletions(-)
diff --git a/include/linux/fortify-string.h b/include/linux/fortify-string.h
index 89a6888f2f9e..55f66bd8a366 100644
--- a/include/linux/fortify-string.h
+++ b/include/linux/fortify-string.h
@@ -697,9 +697,9 @@ __FORTIFY_INLINE void *memchr_inv(const void * const POS0 p, int c, size_t size)
return __real_memchr_inv(p, c, size);
}
-extern void *__real_kmemdup(const void *src, size_t len, gfp_t gfp) __RENAME(kmemdup)
+extern void *__real_kmemdup(const void *src, size_t len, gfp_t gfp) __RENAME(kmemdup_noprof)
__realloc_size(2);
-__FORTIFY_INLINE void *kmemdup(const void * const POS0 p, size_t size, gfp_t gfp)
+__FORTIFY_INLINE void *kmemdup_noprof(const void * const POS0 p, size_t size, gfp_t gfp)
{
const size_t p_size = __struct_size(p);
@@ -709,6 +709,7 @@ __FORTIFY_INLINE void *kmemdup(const void * const POS0 p, size_t size, gfp_t gfp
fortify_panic(__func__);
return __real_kmemdup(p, size, gfp);
}
+#define kmemdup(...) alloc_hooks(kmemdup_noprof(__VA_ARGS__))
/**
* strcpy - Copy a string into another string buffer
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 3ac2fc830f0f..910473e07e86 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -230,7 +230,10 @@ int kmem_cache_shrink(struct kmem_cache *s);
/*
* Common kmalloc functions provided by all allocators
*/
-void * __must_check krealloc(const void *objp, size_t new_size, gfp_t flags) __realloc_size(2);
+void * __must_check krealloc_noprof(const void *objp, size_t new_size,
+ gfp_t flags) __realloc_size(2);
+#define krealloc(...) alloc_hooks(krealloc_noprof(__VA_ARGS__))
+
void kfree(const void *objp);
void kfree_sensitive(const void *objp);
size_t __ksize(const void *objp);
@@ -482,7 +485,10 @@ static __always_inline unsigned int __kmalloc_index(size_t size,
static_assert(PAGE_SHIFT <= 20);
#define kmalloc_index(s) __kmalloc_index(s, true)
-void *__kmalloc(size_t size, gfp_t flags) __assume_kmalloc_alignment __alloc_size(1);
+#include <linux/alloc_tag.h>
+
+void *__kmalloc_noprof(size_t size, gfp_t flags) __assume_kmalloc_alignment __alloc_size(1);
+#define __kmalloc(...) alloc_hooks(__kmalloc_noprof(__VA_ARGS__))
/**
* kmem_cache_alloc - Allocate an object
@@ -494,9 +500,14 @@ void *__kmalloc(size_t size, gfp_t flags) __assume_kmalloc_alignment __alloc_siz
*
* Return: pointer to the new object or %NULL in case of error
*/
-void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags) __assume_slab_alignment __malloc;
-void *kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
- gfp_t gfpflags) __assume_slab_alignment __malloc;
+void *kmem_cache_alloc_noprof(struct kmem_cache *cachep,
+ gfp_t flags) __assume_slab_alignment __malloc;
+#define kmem_cache_alloc(...) alloc_hooks(kmem_cache_alloc_noprof(__VA_ARGS__))
+
+void *kmem_cache_alloc_lru_noprof(struct kmem_cache *s, struct list_lru *lru,
+ gfp_t gfpflags) __assume_slab_alignment __malloc;
+#define kmem_cache_alloc_lru(...) alloc_hooks(kmem_cache_alloc_lru_noprof(__VA_ARGS__))
+
void kmem_cache_free(struct kmem_cache *s, void *objp);
/*
@@ -507,29 +518,40 @@ void kmem_cache_free(struct kmem_cache *s, void *objp);
* Note that interrupts must be enabled when calling these functions.
*/
void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p);
-int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, void **p);
+
+int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size, void **p);
+#define kmem_cache_alloc_bulk(...) alloc_hooks(kmem_cache_alloc_bulk_noprof(__VA_ARGS__))
static __always_inline void kfree_bulk(size_t size, void **p)
{
kmem_cache_free_bulk(NULL, size, p);
}
-void *__kmalloc_node(size_t size, gfp_t flags, int node) __assume_kmalloc_alignment
+void *__kmalloc_node_noprof(size_t size, gfp_t flags, int node) __assume_kmalloc_alignment
__alloc_size(1);
-void *kmem_cache_alloc_node(struct kmem_cache *s, gfp_t flags, int node) __assume_slab_alignment
- __malloc;
+#define __kmalloc_node(...) alloc_hooks(__kmalloc_node_noprof(__VA_ARGS__))
-void *kmalloc_trace(struct kmem_cache *s, gfp_t flags, size_t size)
+void *kmem_cache_alloc_node_noprof(struct kmem_cache *s, gfp_t flags,
+ int node) __assume_slab_alignment __malloc;
+#define kmem_cache_alloc_node(...) alloc_hooks(kmem_cache_alloc_node_noprof(__VA_ARGS__))
+
+void *kmalloc_trace_noprof(struct kmem_cache *s, gfp_t flags, size_t size)
__assume_kmalloc_alignment __alloc_size(3);
-void *kmalloc_node_trace(struct kmem_cache *s, gfp_t gfpflags,
- int node, size_t size) __assume_kmalloc_alignment
+void *kmalloc_node_trace_noprof(struct kmem_cache *s, gfp_t gfpflags,
+ int node, size_t size) __assume_kmalloc_alignment
__alloc_size(4);
-void *kmalloc_large(size_t size, gfp_t flags) __assume_page_alignment
+#define kmalloc_trace(...) alloc_hooks(kmalloc_trace_noprof(__VA_ARGS__))
+
+#define kmalloc_node_trace(...) alloc_hooks(kmalloc_node_trace_noprof(__VA_ARGS__))
+
+void *kmalloc_large_noprof(size_t size, gfp_t flags) __assume_page_alignment
__alloc_size(1);
+#define kmalloc_large(...) alloc_hooks(kmalloc_large_noprof(__VA_ARGS__))
-void *kmalloc_large_node(size_t size, gfp_t flags, int node) __assume_page_alignment
+void *kmalloc_large_node_noprof(size_t size, gfp_t flags, int node) __assume_page_alignment
__alloc_size(1);
+#define kmalloc_large_node(...) alloc_hooks(kmalloc_large_node_noprof(__VA_ARGS__))
/**
* kmalloc - allocate kernel memory
@@ -585,37 +607,39 @@ void *kmalloc_large_node(size_t size, gfp_t flags, int node) __assume_page_align
* Try really hard to succeed the allocation but fail
* eventually.
*/
-static __always_inline __alloc_size(1) void *kmalloc(size_t size, gfp_t flags)
+static __always_inline __alloc_size(1) void *kmalloc_noprof(size_t size, gfp_t flags)
{
if (__builtin_constant_p(size) && size) {
unsigned int index;
if (size > KMALLOC_MAX_CACHE_SIZE)
- return kmalloc_large(size, flags);
+ return kmalloc_large_noprof(size, flags);
index = kmalloc_index(size);
- return kmalloc_trace(
+ return kmalloc_trace_noprof(
kmalloc_caches[kmalloc_type(flags, _RET_IP_)][index],
flags, size);
}
- return __kmalloc(size, flags);
+ return __kmalloc_noprof(size, flags);
}
+#define kmalloc(...) alloc_hooks(kmalloc_noprof(__VA_ARGS__))
-static __always_inline __alloc_size(1) void *kmalloc_node(size_t size, gfp_t flags, int node)
+static __always_inline __alloc_size(1) void *kmalloc_node_noprof(size_t size, gfp_t flags, int node)
{
if (__builtin_constant_p(size) && size) {
unsigned int index;
if (size > KMALLOC_MAX_CACHE_SIZE)
- return kmalloc_large_node(size, flags, node);
+ return kmalloc_large_node_noprof(size, flags, node);
index = kmalloc_index(size);
- return kmalloc_node_trace(
+ return kmalloc_node_trace_noprof(
kmalloc_caches[kmalloc_type(flags, _RET_IP_)][index],
flags, node, size);
}
- return __kmalloc_node(size, flags, node);
+ return __kmalloc_node_noprof(size, flags, node);
}
+#define kmalloc_node(...) alloc_hooks(kmalloc_node_noprof(__VA_ARGS__))
/**
* kmalloc_array - allocate memory for an array.
@@ -623,16 +647,17 @@ static __always_inline __alloc_size(1) void *kmalloc_node(size_t size, gfp_t fla
* @size: element size.
* @flags: the type of memory to allocate (see kmalloc).
*/
-static inline __alloc_size(1, 2) void *kmalloc_array(size_t n, size_t size, gfp_t flags)
+static inline __alloc_size(1, 2) void *kmalloc_array_noprof(size_t n, size_t size, gfp_t flags)
{
size_t bytes;
if (unlikely(check_mul_overflow(n, size, &bytes)))
return NULL;
if (__builtin_constant_p(n) && __builtin_constant_p(size))
- return kmalloc(bytes, flags);
- return __kmalloc(bytes, flags);
+ return kmalloc_noprof(bytes, flags);
+ return kmalloc_noprof(bytes, flags);
}
+#define kmalloc_array(...) alloc_hooks(kmalloc_array_noprof(__VA_ARGS__))
/**
* krealloc_array - reallocate memory for an array.
@@ -641,18 +666,19 @@ static inline __alloc_size(1, 2) void *kmalloc_array(size_t n, size_t size, gfp_
* @new_size: new size of a single member of the array
* @flags: the type of memory to allocate (see kmalloc)
*/
-static inline __realloc_size(2, 3) void * __must_check krealloc_array(void *p,
- size_t new_n,
- size_t new_size,
- gfp_t flags)
+static inline __realloc_size(2, 3) void * __must_check krealloc_array_noprof(void *p,
+ size_t new_n,
+ size_t new_size,
+ gfp_t flags)
{
size_t bytes;
if (unlikely(check_mul_overflow(new_n, new_size, &bytes)))
return NULL;
- return krealloc(p, bytes, flags);
+ return krealloc_noprof(p, bytes, flags);
}
+#define krealloc_array(...) alloc_hooks(krealloc_array_noprof(__VA_ARGS__))
/**
* kcalloc - allocate memory for an array. The memory is set to zero.
@@ -660,16 +686,12 @@ static inline __realloc_size(2, 3) void * __must_check krealloc_array(void *p,
* @size: element size.
* @flags: the type of memory to allocate (see kmalloc).
*/
-static inline __alloc_size(1, 2) void *kcalloc(size_t n, size_t size, gfp_t flags)
-{
- return kmalloc_array(n, size, flags | __GFP_ZERO);
-}
+#define kcalloc(_n, _size, _flags) kmalloc_array(_n, _size, (_flags) | __GFP_ZERO)
-void *__kmalloc_node_track_caller(size_t size, gfp_t flags, int node,
+void *kmalloc_node_track_caller_noprof(size_t size, gfp_t flags, int node,
unsigned long caller) __alloc_size(1);
-#define kmalloc_node_track_caller(size, flags, node) \
- __kmalloc_node_track_caller(size, flags, node, \
- _RET_IP_)
+#define kmalloc_node_track_caller(...) \
+ alloc_hooks(kmalloc_node_track_caller_noprof(__VA_ARGS__, _RET_IP_))
/*
* kmalloc_track_caller is a special version of kmalloc that records the
@@ -679,11 +701,9 @@ void *__kmalloc_node_track_caller(size_t size, gfp_t flags, int node,
* allocator where we care about the real place the memory allocation
* request comes from.
*/
-#define kmalloc_track_caller(size, flags) \
- __kmalloc_node_track_caller(size, flags, \
- NUMA_NO_NODE, _RET_IP_)
+#define kmalloc_track_caller(...) kmalloc_node_track_caller(__VA_ARGS__, NUMA_NO_NODE)
-static inline __alloc_size(1, 2) void *kmalloc_array_node(size_t n, size_t size, gfp_t flags,
+static inline __alloc_size(1, 2) void *kmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags,
int node)
{
size_t bytes;
@@ -691,75 +711,52 @@ static inline __alloc_size(1, 2) void *kmalloc_array_node(size_t n, size_t size,
if (unlikely(check_mul_overflow(n, size, &bytes)))
return NULL;
if (__builtin_constant_p(n) && __builtin_constant_p(size))
- return kmalloc_node(bytes, flags, node);
- return __kmalloc_node(bytes, flags, node);
+ return kmalloc_node_noprof(bytes, flags, node);
+ return __kmalloc_node_noprof(bytes, flags, node);
}
+#define kmalloc_array_node(...) alloc_hooks(kmalloc_array_node_noprof(__VA_ARGS__))
-static inline __alloc_size(1, 2) void *kcalloc_node(size_t n, size_t size, gfp_t flags, int node)
-{
- return kmalloc_array_node(n, size, flags | __GFP_ZERO, node);
-}
+#define kcalloc_node(_n, _size, _flags, _node) \
+ kmalloc_array_node(_n, _size, (_flags) | __GFP_ZERO, _node)
/*
* Shortcuts
*/
-static inline void *kmem_cache_zalloc(struct kmem_cache *k, gfp_t flags)
-{
- return kmem_cache_alloc(k, flags | __GFP_ZERO);
-}
+#define kmem_cache_zalloc(_k, _flags) kmem_cache_alloc(_k, (_flags)|__GFP_ZERO)
/**
* kzalloc - allocate memory. The memory is set to zero.
* @size: how many bytes of memory are required.
* @flags: the type of memory to allocate (see kmalloc).
*/
-static inline __alloc_size(1) void *kzalloc(size_t size, gfp_t flags)
+static inline __alloc_size(1) void *kzalloc_noprof(size_t size, gfp_t flags)
{
- return kmalloc(size, flags | __GFP_ZERO);
+ return kmalloc_noprof(size, flags | __GFP_ZERO);
}
+#define kzalloc(...) alloc_hooks(kzalloc_noprof(__VA_ARGS__))
+#define kzalloc_node(_size, _flags, _node) kmalloc_node(_size, (_flags)|__GFP_ZERO, _node)
-/**
- * kzalloc_node - allocate zeroed memory from a particular memory node.
- * @size: how many bytes of memory are required.
- * @flags: the type of memory to allocate (see kmalloc).
- * @node: memory node from which to allocate
- */
-static inline __alloc_size(1) void *kzalloc_node(size_t size, gfp_t flags, int node)
-{
- return kmalloc_node(size, flags | __GFP_ZERO, node);
-}
+extern void *kvmalloc_node_noprof(size_t size, gfp_t flags, int node) __alloc_size(1);
+#define kvmalloc_node(...) alloc_hooks(kvmalloc_node_noprof(__VA_ARGS__))
-extern void *kvmalloc_node(size_t size, gfp_t flags, int node) __alloc_size(1);
-static inline __alloc_size(1) void *kvmalloc(size_t size, gfp_t flags)
-{
- return kvmalloc_node(size, flags, NUMA_NO_NODE);
-}
-static inline __alloc_size(1) void *kvzalloc_node(size_t size, gfp_t flags, int node)
-{
- return kvmalloc_node(size, flags | __GFP_ZERO, node);
-}
-static inline __alloc_size(1) void *kvzalloc(size_t size, gfp_t flags)
-{
- return kvmalloc(size, flags | __GFP_ZERO);
-}
+#define kvmalloc(_size, _flags) kvmalloc_node(_size, _flags, NUMA_NO_NODE)
+#define kvzalloc(_size, _flags) kvmalloc(_size, _flags|__GFP_ZERO)
-static inline __alloc_size(1, 2) void *kvmalloc_array(size_t n, size_t size, gfp_t flags)
-{
- size_t bytes;
-
- if (unlikely(check_mul_overflow(n, size, &bytes)))
- return NULL;
+#define kvzalloc_node(_size, _flags, _node) kvmalloc_node(_size, _flags|__GFP_ZERO, _node)
- return kvmalloc(bytes, flags);
-}
+#define kvmalloc_array(_n, _size, _flags) \
+({ \
+ size_t _bytes; \
+ \
+ !check_mul_overflow(_n, _size, &_bytes) ? kvmalloc(_bytes, _flags) : NULL; \
+})
-static inline __alloc_size(1, 2) void *kvcalloc(size_t n, size_t size, gfp_t flags)
-{
- return kvmalloc_array(n, size, flags | __GFP_ZERO);
-}
+#define kvcalloc(_n, _size, _flags) kvmalloc_array(_n, _size, _flags|__GFP_ZERO)
-extern void *kvrealloc(const void *p, size_t oldsize, size_t newsize, gfp_t flags)
+extern void *kvrealloc_noprof(const void *p, size_t oldsize, size_t newsize, gfp_t flags)
__realloc_size(3);
+#define kvrealloc(...) alloc_hooks(kvrealloc_noprof(__VA_ARGS__))
+
extern void kvfree(const void *addr);
DEFINE_FREE(kvfree, void *, if (_T) kvfree(_T))
diff --git a/include/linux/string.h b/include/linux/string.h
index ab148d8dbfc1..14e4fb4340f4 100644
--- a/include/linux/string.h
+++ b/include/linux/string.h
@@ -214,7 +214,9 @@ extern void kfree_const(const void *x);
extern char *kstrdup(const char *s, gfp_t gfp) __malloc;
extern const char *kstrdup_const(const char *s, gfp_t gfp);
extern char *kstrndup(const char *s, size_t len, gfp_t gfp);
-extern void *kmemdup(const void *src, size_t len, gfp_t gfp) __realloc_size(2);
+extern void *kmemdup_noprof(const void *src, size_t len, gfp_t gfp) __realloc_size(2);
+#define kmemdup(...) alloc_hooks(kmemdup_noprof(__VA_ARGS__))
+
extern void *kvmemdup(const void *src, size_t len, gfp_t gfp) __realloc_size(2);
extern char *kmemdup_nul(const char *s, size_t len, gfp_t gfp);
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 83fec2dd2e2d..21b0b9e9cd9e 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1234,7 +1234,7 @@ __do_krealloc(const void *p, size_t new_size, gfp_t flags)
return (void *)p;
}
- ret = kmalloc_track_caller(new_size, flags);
+ ret = kmalloc_node_track_caller_noprof(new_size, flags, NUMA_NO_NODE, _RET_IP_);
if (ret && p) {
/* Disable KASAN checks as the object's redzone is accessed. */
kasan_disable_current();
@@ -1258,7 +1258,7 @@ __do_krealloc(const void *p, size_t new_size, gfp_t flags)
*
* Return: pointer to the allocated memory or %NULL in case of error
*/
-void *krealloc(const void *p, size_t new_size, gfp_t flags)
+void *krealloc_noprof(const void *p, size_t new_size, gfp_t flags)
{
void *ret;
@@ -1273,7 +1273,7 @@ void *krealloc(const void *p, size_t new_size, gfp_t flags)
return ret;
}
-EXPORT_SYMBOL(krealloc);
+EXPORT_SYMBOL(krealloc_noprof);
/**
* kfree_sensitive - Clear sensitive information in memory before freeing
diff --git a/mm/slub.c b/mm/slub.c
index f4d5794c1e86..9ea03d6e9c9d 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3871,7 +3871,7 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list
return object;
}
-void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
+void *kmem_cache_alloc_noprof(struct kmem_cache *s, gfp_t gfpflags)
{
void *ret = slab_alloc_node(s, NULL, gfpflags, NUMA_NO_NODE, _RET_IP_,
s->object_size);
@@ -3880,9 +3880,9 @@ void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
return ret;
}
-EXPORT_SYMBOL(kmem_cache_alloc);
+EXPORT_SYMBOL(kmem_cache_alloc_noprof);
-void *kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
+void *kmem_cache_alloc_lru_noprof(struct kmem_cache *s, struct list_lru *lru,
gfp_t gfpflags)
{
void *ret = slab_alloc_node(s, lru, gfpflags, NUMA_NO_NODE, _RET_IP_,
@@ -3892,10 +3892,10 @@ void *kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
return ret;
}
-EXPORT_SYMBOL(kmem_cache_alloc_lru);
+EXPORT_SYMBOL(kmem_cache_alloc_lru_noprof);
/**
- * kmem_cache_alloc_node - Allocate an object on the specified node
+ * kmem_cache_alloc_node_noprof - Allocate an object on the specified node
* @s: The cache to allocate from.
* @gfpflags: See kmalloc().
* @node: node number of the target node.
@@ -3907,7 +3907,7 @@ EXPORT_SYMBOL(kmem_cache_alloc_lru);
*
* Return: pointer to the new object or %NULL in case of error
*/
-void *kmem_cache_alloc_node(struct kmem_cache *s, gfp_t gfpflags, int node)
+void *kmem_cache_alloc_node_noprof(struct kmem_cache *s, gfp_t gfpflags, int node)
{
void *ret = slab_alloc_node(s, NULL, gfpflags, node, _RET_IP_, s->object_size);
@@ -3915,7 +3915,7 @@ void *kmem_cache_alloc_node(struct kmem_cache *s, gfp_t gfpflags, int node)
return ret;
}
-EXPORT_SYMBOL(kmem_cache_alloc_node);
+EXPORT_SYMBOL(kmem_cache_alloc_node_noprof);
/*
* To avoid unnecessary overhead, we pass through large allocation requests
@@ -3932,7 +3932,7 @@ static void *__kmalloc_large_node(size_t size, gfp_t flags, int node)
flags = kmalloc_fix_flags(flags);
flags |= __GFP_COMP;
- folio = (struct folio *)alloc_pages_node(node, flags, order);
+ folio = (struct folio *)alloc_pages_node_noprof(node, flags, order);
if (folio) {
ptr = folio_address(folio);
lruvec_stat_mod_folio(folio, NR_SLAB_UNRECLAIMABLE_B,
@@ -3947,7 +3947,7 @@ static void *__kmalloc_large_node(size_t size, gfp_t flags, int node)
return ptr;
}
-void *kmalloc_large(size_t size, gfp_t flags)
+void *kmalloc_large_noprof(size_t size, gfp_t flags)
{
void *ret = __kmalloc_large_node(size, flags, NUMA_NO_NODE);
@@ -3955,9 +3955,9 @@ void *kmalloc_large(size_t size, gfp_t flags)
flags, NUMA_NO_NODE);
return ret;
}
-EXPORT_SYMBOL(kmalloc_large);
+EXPORT_SYMBOL(kmalloc_large_noprof);
-void *kmalloc_large_node(size_t size, gfp_t flags, int node)
+void *kmalloc_large_node_noprof(size_t size, gfp_t flags, int node)
{
void *ret = __kmalloc_large_node(size, flags, node);
@@ -3965,7 +3965,7 @@ void *kmalloc_large_node(size_t size, gfp_t flags, int node)
flags, node);
return ret;
}
-EXPORT_SYMBOL(kmalloc_large_node);
+EXPORT_SYMBOL(kmalloc_large_node_noprof);
static __always_inline
void *__do_kmalloc_node(size_t size, gfp_t flags, int node,
@@ -3992,26 +3992,26 @@ void *__do_kmalloc_node(size_t size, gfp_t flags, int node,
return ret;
}
-void *__kmalloc_node(size_t size, gfp_t flags, int node)
+void *__kmalloc_node_noprof(size_t size, gfp_t flags, int node)
{
return __do_kmalloc_node(size, flags, node, _RET_IP_);
}
-EXPORT_SYMBOL(__kmalloc_node);
+EXPORT_SYMBOL(__kmalloc_node_noprof);
-void *__kmalloc(size_t size, gfp_t flags)
+void *__kmalloc_noprof(size_t size, gfp_t flags)
{
return __do_kmalloc_node(size, flags, NUMA_NO_NODE, _RET_IP_);
}
-EXPORT_SYMBOL(__kmalloc);
+EXPORT_SYMBOL(__kmalloc_noprof);
-void *__kmalloc_node_track_caller(size_t size, gfp_t flags,
- int node, unsigned long caller)
+void *kmalloc_node_track_caller_noprof(size_t size, gfp_t flags,
+ int node, unsigned long caller)
{
return __do_kmalloc_node(size, flags, node, caller);
}
-EXPORT_SYMBOL(__kmalloc_node_track_caller);
+EXPORT_SYMBOL(kmalloc_node_track_caller_noprof);
-void *kmalloc_trace(struct kmem_cache *s, gfp_t gfpflags, size_t size)
+void *kmalloc_trace_noprof(struct kmem_cache *s, gfp_t gfpflags, size_t size)
{
void *ret = slab_alloc_node(s, NULL, gfpflags, NUMA_NO_NODE,
_RET_IP_, size);
@@ -4021,9 +4021,9 @@ void *kmalloc_trace(struct kmem_cache *s, gfp_t gfpflags, size_t size)
ret = kasan_kmalloc(s, ret, size, gfpflags);
return ret;
}
-EXPORT_SYMBOL(kmalloc_trace);
+EXPORT_SYMBOL(kmalloc_trace_noprof);
-void *kmalloc_node_trace(struct kmem_cache *s, gfp_t gfpflags,
+void *kmalloc_node_trace_noprof(struct kmem_cache *s, gfp_t gfpflags,
int node, size_t size)
{
void *ret = slab_alloc_node(s, NULL, gfpflags, node, _RET_IP_, size);
@@ -4033,7 +4033,7 @@ void *kmalloc_node_trace(struct kmem_cache *s, gfp_t gfpflags,
ret = kasan_kmalloc(s, ret, size, gfpflags);
return ret;
}
-EXPORT_SYMBOL(kmalloc_node_trace);
+EXPORT_SYMBOL(kmalloc_node_trace_noprof);
static noinline void free_to_partial_list(
struct kmem_cache *s, struct slab *slab,
@@ -4304,6 +4304,7 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
unsigned long addr)
{
memcg_slab_free_hook(s, slab, &object, 1);
+ alloc_tagging_slab_free_hook(s, slab, &object, 1);
if (likely(slab_free_hook(s, object, slab_want_init_on_free(s))))
do_slab_free(s, slab, object, object, 1, addr);
@@ -4314,6 +4315,7 @@ void slab_free_bulk(struct kmem_cache *s, struct slab *slab, void *head,
void *tail, void **p, int cnt, unsigned long addr)
{
memcg_slab_free_hook(s, slab, p, cnt);
+ alloc_tagging_slab_free_hook(s, slab, p, cnt);
/*
* With KASAN enabled slab_free_freelist_hook modifies the freelist
* to remove objects, whose reuse must be delayed.
@@ -4640,8 +4642,8 @@ static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
#endif /* CONFIG_SLUB_TINY */
/* Note that interrupts must be enabled when calling this function. */
-int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
- void **p)
+int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
+ void **p)
{
int i;
struct obj_cgroup *objcg = NULL;
@@ -4669,7 +4671,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
return i;
}
-EXPORT_SYMBOL(kmem_cache_alloc_bulk);
+EXPORT_SYMBOL(kmem_cache_alloc_bulk_noprof);
/*
diff --git a/mm/util.c b/mm/util.c
index 5a6a9802583b..291f7945190f 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -115,7 +115,7 @@ char *kstrndup(const char *s, size_t max, gfp_t gfp)
EXPORT_SYMBOL(kstrndup);
/**
- * kmemdup - duplicate region of memory
+ * kmemdup_noprof - duplicate region of memory
*
* @src: memory region to duplicate
* @len: memory region length
@@ -124,16 +124,16 @@ EXPORT_SYMBOL(kstrndup);
* Return: newly allocated copy of @src or %NULL in case of error,
* result is physically contiguous. Use kfree() to free.
*/
-void *kmemdup(const void *src, size_t len, gfp_t gfp)
+void *kmemdup_noprof(const void *src, size_t len, gfp_t gfp)
{
void *p;
- p = kmalloc_track_caller(len, gfp);
+ p = kmalloc_node_track_caller_noprof(len, gfp, NUMA_NO_NODE, _RET_IP_);
if (p)
memcpy(p, src, len);
return p;
}
-EXPORT_SYMBOL(kmemdup);
+EXPORT_SYMBOL(kmemdup_noprof);
/**
* kvmemdup - duplicate region of memory
@@ -577,7 +577,7 @@ unsigned long vm_mmap(struct file *file, unsigned long addr,
EXPORT_SYMBOL(vm_mmap);
/**
- * kvmalloc_node - attempt to allocate physically contiguous memory, but upon
+ * kvmalloc_node_noprof - attempt to allocate physically contiguous memory, but upon
* failure, fall back to non-contiguous (vmalloc) allocation.
* @size: size of the request.
* @flags: gfp mask for the allocation - must be compatible (superset) with GFP_KERNEL.
@@ -592,7 +592,7 @@ EXPORT_SYMBOL(vm_mmap);
*
* Return: pointer to the allocated memory of %NULL in case of failure
*/
-void *kvmalloc_node(size_t size, gfp_t flags, int node)
+void *kvmalloc_node_noprof(size_t size, gfp_t flags, int node)
{
gfp_t kmalloc_flags = flags;
void *ret;
@@ -614,7 +614,7 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node)
kmalloc_flags &= ~__GFP_NOFAIL;
}
- ret = kmalloc_node(size, kmalloc_flags, node);
+ ret = kmalloc_node_noprof(size, kmalloc_flags, node);
/*
* It doesn't really make sense to fallback to vmalloc for sub page
@@ -643,7 +643,7 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node)
flags, PAGE_KERNEL, VM_ALLOW_HUGE_VMAP,
node, __builtin_return_address(0));
}
-EXPORT_SYMBOL(kvmalloc_node);
+EXPORT_SYMBOL(kvmalloc_node_noprof);
/**
* kvfree() - Free memory.
@@ -682,7 +682,7 @@ void kvfree_sensitive(const void *addr, size_t len)
}
EXPORT_SYMBOL(kvfree_sensitive);
-void *kvrealloc(const void *p, size_t oldsize, size_t newsize, gfp_t flags)
+void *kvrealloc_noprof(const void *p, size_t oldsize, size_t newsize, gfp_t flags)
{
void *newp;
@@ -695,7 +695,7 @@ void *kvrealloc(const void *p, size_t oldsize, size_t newsize, gfp_t flags)
kvfree(p);
return newp;
}
-EXPORT_SYMBOL(kvrealloc);
+EXPORT_SYMBOL(kvrealloc_noprof);
/**
* __vmalloc_array - allocate memory for a virtually contiguous array.
--
2.43.0.687.g38aa6559b0-goog
On Mon, Feb 12, 2024 at 01:38:48PM -0800, Suren Baghdasaryan wrote:
> From: Kent Overstreet <[email protected]>
>
> These symbols are used to denote section boundaries: by always including
> them we can unify loading sections from modules with loading built-in
> sections, which leads to some significant cleanup.
>
> Signed-off-by: Kent Overstreet <[email protected]>
Seems reasonable!
Reviewed-by: Kees Cook <[email protected]>
--
Kees Cook
On Mon, Feb 12, 2024 at 01:38:49PM -0800, Suren Baghdasaryan wrote:
> From: Kent Overstreet <[email protected]>
>
> We're introducing alloc tagging, which tracks memory allocations by
> callsite. Converting alloc_inode_sb() to a macro means allocations will
> be tracked by its caller, which is a bit more useful.
>
> Signed-off-by: Kent Overstreet <[email protected]>
Yup, getting these all doing direct calls will be nice.
Reviewed-by: Kees Cook <[email protected]>
--
Kees Cook
On Mon, Feb 12, 2024 at 01:38:47PM -0800, Suren Baghdasaryan wrote:
> From: Kent Overstreet <[email protected]>
>
> The new flags parameter allows controlling
> - Whether or not the units suffix is separated by a space, for
> compatibility with sort -h
> - Whether or not to append a B suffix - we're not always printing
> bytes.
>
> Signed-off-by: Kent Overstreet <[email protected]>
If there is a v4, please include some short examples with the .h bit
field documentation. It's pretty obvious right now but these kinds of
things have a habit of growing, so let's set a good example for
documenting the flags.
Reviewed-by: Kees Cook <[email protected]>
--
Kees Cook
On Mon, Feb 12, 2024 at 01:38:50PM -0800, Suren Baghdasaryan wrote:
> Introduce GFP bits enumeration to let compiler track the number of used
> bits (which depends on the config options) instead of hardcoding them.
> That simplifies __GFP_BITS_SHIFT calculation.
>
> Suggested-by: Petr Tesařík <[email protected]>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
Yeah, looks good.
Reviewed-by: Kees Cook <[email protected]>
--
Kees Cook
On Mon, Feb 12, 2024 at 01:38:51PM -0800, Suren Baghdasaryan wrote:
> Currently slab pages can store only vectors of obj_cgroup pointers in
> page->memcg_data. Introduce slabobj_ext structure to allow more data
> to be stored for each slab object. Wrap obj_cgroup into slabobj_ext
> to support current functionality while allowing to extend slabobj_ext
> in the future.
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
It looks like this doesn't change which buckets GFP_KERNEL_ACCOUNT comes
out of, is that correct? I'd love it if we didn't have separate buckets
so GFP_KERNEL and GFP_KERNEL_ACCOUNT came from the same pools (so that
the randomized pools would cover GFP_KERNEL_ACCOUNT ...)
Regardless:
Reviewed-by: Kees Cook <[email protected]>
--
Kees Cook
On Mon, Feb 12, 2024 at 01:38:52PM -0800, Suren Baghdasaryan wrote:
> Introduce __GFP_NO_OBJ_EXT flag in order to prevent recursive allocations
> when allocating slabobj_ext on a slab.
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
--
Kees Cook
On Mon, Feb 12, 2024 at 01:38:55PM -0800, Suren Baghdasaryan wrote:
> Introduce objext_flags to store additional objext flags unrelated to memcg.
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
--
Kees Cook
On Mon, Feb 12, 2024 at 01:38:53PM -0800, Suren Baghdasaryan wrote:
> Slab extension objects can't be allocated before slab infrastructure is
> initialized. Some caches, like kmem_cache and kmem_cache_node, are created
> before slab infrastructure is initialized. Objects from these caches can't
> have extension objects. Introduce SLAB_NO_OBJ_EXT slab flag to mark these
> caches and avoid creating extensions for objects allocated from these
> slabs.
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
--
Kees Cook
On Mon, Feb 12, 2024 at 01:38:54PM -0800, Suren Baghdasaryan wrote:
> Use __GFP_NO_OBJ_EXT to prevent recursions when allocating slabobj_ext
> objects. Also prevent slabobj_ext allocations for kmem_cache objects.
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
I almost feel like this can be collapsed into earlier patches, but
regardless:
Reviewed-by: Kees Cook <[email protected]>
--
Kees Cook
On Mon, Feb 12, 2024 at 01:38:56PM -0800, Suren Baghdasaryan wrote:
> Add basic infrastructure to support code tagging which stores tag common
> information consisting of the module name, function, file name and line
> number. Provide functions to register a new code tag type and navigate
> between code tags.
>
> Co-developed-by: Kent Overstreet <[email protected]>
> Signed-off-by: Kent Overstreet <[email protected]>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> ---
> include/linux/codetag.h | 71 ++++++++++++++
> lib/Kconfig.debug | 4 +
> lib/Makefile | 1 +
> lib/codetag.c | 199 ++++++++++++++++++++++++++++++++++++++++
> 4 files changed, 275 insertions(+)
> create mode 100644 include/linux/codetag.h
> create mode 100644 lib/codetag.c
>
> diff --git a/include/linux/codetag.h b/include/linux/codetag.h
> new file mode 100644
> index 000000000000..a9d7adecc2a5
> --- /dev/null
> +++ b/include/linux/codetag.h
> @@ -0,0 +1,71 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * code tagging framework
> + */
> +#ifndef _LINUX_CODETAG_H
> +#define _LINUX_CODETAG_H
> +
> +#include <linux/types.h>
> +
> +struct codetag_iterator;
> +struct codetag_type;
> +struct seq_buf;
> +struct module;
> +
> +/*
> + * An instance of this structure is created in a special ELF section at every
> + * code location being tagged. At runtime, the special section is treated as
> + * an array of these.
> + */
> +struct codetag {
> + unsigned int flags; /* used in later patches */
> + unsigned int lineno;
> + const char *modname;
> + const char *function;
> + const char *filename;
> +} __aligned(8);
> +
> +union codetag_ref {
> + struct codetag *ct;
> +};
> +
> +struct codetag_range {
> + struct codetag *start;
> + struct codetag *stop;
> +};
> +
> +struct codetag_module {
> + struct module *mod;
> + struct codetag_range range;
> +};
> +
> +struct codetag_type_desc {
> + const char *section;
> + size_t tag_size;
> +};
> +
> +struct codetag_iterator {
> + struct codetag_type *cttype;
> + struct codetag_module *cmod;
> + unsigned long mod_id;
> + struct codetag *ct;
> +};
> +
> +#define CODE_TAG_INIT { \
> + .modname = KBUILD_MODNAME, \
> + .function = __func__, \
> + .filename = __FILE__, \
> + .lineno = __LINE__, \
> + .flags = 0, \
> +}
> +
> +void codetag_lock_module_list(struct codetag_type *cttype, bool lock);
> +struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype);
> +struct codetag *codetag_next_ct(struct codetag_iterator *iter);
> +
> +void codetag_to_text(struct seq_buf *out, struct codetag *ct);
> +
> +struct codetag_type *
> +codetag_register_type(const struct codetag_type_desc *desc);
> +
> +#endif /* _LINUX_CODETAG_H */
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 975a07f9f1cc..0be2d00c3696 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -968,6 +968,10 @@ config DEBUG_STACKOVERFLOW
>
> If in doubt, say "N".
>
> +config CODE_TAGGING
> + bool
> + select KALLSYMS
> +
> source "lib/Kconfig.kasan"
> source "lib/Kconfig.kfence"
> source "lib/Kconfig.kmsan"
> diff --git a/lib/Makefile b/lib/Makefile
> index 6b09731d8e61..6b48b22fdfac 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -235,6 +235,7 @@ obj-$(CONFIG_OF_RECONFIG_NOTIFIER_ERROR_INJECT) += \
> of-reconfig-notifier-error-inject.o
> obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
>
> +obj-$(CONFIG_CODE_TAGGING) += codetag.o
> lib-$(CONFIG_GENERIC_BUG) += bug.o
>
> obj-$(CONFIG_HAVE_ARCH_TRACEHOOK) += syscall.o
> diff --git a/lib/codetag.c b/lib/codetag.c
> new file mode 100644
> index 000000000000..7708f8388e55
> --- /dev/null
> +++ b/lib/codetag.c
> @@ -0,0 +1,199 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +#include <linux/codetag.h>
> +#include <linux/idr.h>
> +#include <linux/kallsyms.h>
> +#include <linux/module.h>
> +#include <linux/seq_buf.h>
> +#include <linux/slab.h>
> +
> +struct codetag_type {
> + struct list_head link;
> + unsigned int count;
> + struct idr mod_idr;
> + struct rw_semaphore mod_lock; /* protects mod_idr */
> + struct codetag_type_desc desc;
> +};
> +
> +static DEFINE_MUTEX(codetag_lock);
> +static LIST_HEAD(codetag_types);
> +
> +void codetag_lock_module_list(struct codetag_type *cttype, bool lock)
> +{
> + if (lock)
> + down_read(&cttype->mod_lock);
> + else
> + up_read(&cttype->mod_lock);
> +}
> +
> +struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype)
> +{
> + struct codetag_iterator iter = {
> + .cttype = cttype,
> + .cmod = NULL,
> + .mod_id = 0,
> + .ct = NULL,
> + };
> +
> + return iter;
> +}
> +
> +static inline struct codetag *get_first_module_ct(struct codetag_module *cmod)
> +{
> + return cmod->range.start < cmod->range.stop ? cmod->range.start : NULL;
> +}
> +
> +static inline
> +struct codetag *get_next_module_ct(struct codetag_iterator *iter)
> +{
> + struct codetag *res = (struct codetag *)
> + ((char *)iter->ct + iter->cttype->desc.tag_size);
> +
> + return res < iter->cmod->range.stop ? res : NULL;
> +}
> +
> +struct codetag *codetag_next_ct(struct codetag_iterator *iter)
> +{
> + struct codetag_type *cttype = iter->cttype;
> + struct codetag_module *cmod;
> + struct codetag *ct;
> +
> + lockdep_assert_held(&cttype->mod_lock);
> +
> + if (unlikely(idr_is_empty(&cttype->mod_idr)))
> + return NULL;
> +
> + ct = NULL;
> + while (true) {
> + cmod = idr_find(&cttype->mod_idr, iter->mod_id);
> +
> + /* If module was removed move to the next one */
> + if (!cmod)
> + cmod = idr_get_next_ul(&cttype->mod_idr,
> + &iter->mod_id);
> +
> + /* Exit if no more modules */
> + if (!cmod)
> + break;
> +
> + if (cmod != iter->cmod) {
> + iter->cmod = cmod;
> + ct = get_first_module_ct(cmod);
> + } else
> + ct = get_next_module_ct(iter);
> +
> + if (ct)
> + break;
> +
> + iter->mod_id++;
> + }
> +
> + iter->ct = ct;
> + return ct;
> +}
> +
> +void codetag_to_text(struct seq_buf *out, struct codetag *ct)
> +{
> + seq_buf_printf(out, "%s:%u module:%s func:%s",
> + ct->filename, ct->lineno,
> + ct->modname, ct->function);
> +}
Thank you for using seq_buf here!
Also, will this need an EXPORT_SYMBOL_GPL()?
> +
> +static inline size_t range_size(const struct codetag_type *cttype,
> + const struct codetag_range *range)
> +{
> + return ((char *)range->stop - (char *)range->start) /
> + cttype->desc.tag_size;
> +}
> +
> +static void *get_symbol(struct module *mod, const char *prefix, const char *name)
> +{
> + char buf[64];
Why is 64 enough? I was expecting KSYM_NAME_LEN here, but perhaps this
is specialized enough to section names that it will not be a problem?
If so, please document it clearly with a comment.
> + int res;
> +
> + res = snprintf(buf, sizeof(buf), "%s%s", prefix, name);
> + if (WARN_ON(res < 1 || res > sizeof(buf)))
> + return NULL;
Please use a seq_buf here instead of snprintf, which we're trying to get
rid of.
DECLARE_SEQ_BUF(sb, KSYM_NAME_LEN);
char *buf;
seq_buf_printf(sb, "%s%s", prefix, name);
if (seq_buf_has_overflowed(sb))
return NULL;
buf = seq_buf_str(sb);
> +
> + return mod ?
> + (void *)find_kallsyms_symbol_value(mod, buf) :
> + (void *)kallsyms_lookup_name(buf);
> +}
> +
> +static struct codetag_range get_section_range(struct module *mod,
> + const char *section)
> +{
> + return (struct codetag_range) {
> + get_symbol(mod, "__start_", section),
> + get_symbol(mod, "__stop_", section),
> + };
> +}
> +
> +static int codetag_module_init(struct codetag_type *cttype, struct module *mod)
> +{
> + struct codetag_range range;
> + struct codetag_module *cmod;
> + int err;
> +
> + range = get_section_range(mod, cttype->desc.section);
> + if (!range.start || !range.stop) {
> + pr_warn("Failed to load code tags of type %s from the module %s\n",
> + cttype->desc.section,
> + mod ? mod->name : "(built-in)");
> + return -EINVAL;
> + }
> +
> + /* Ignore empty ranges */
> + if (range.start == range.stop)
> + return 0;
> +
> + BUG_ON(range.start > range.stop);
> +
> + cmod = kmalloc(sizeof(*cmod), GFP_KERNEL);
> + if (unlikely(!cmod))
> + return -ENOMEM;
> +
> + cmod->mod = mod;
> + cmod->range = range;
> +
> + down_write(&cttype->mod_lock);
> + err = idr_alloc(&cttype->mod_idr, cmod, 0, 0, GFP_KERNEL);
> + if (err >= 0)
> + cttype->count += range_size(cttype, &range);
> + up_write(&cttype->mod_lock);
> +
> + if (err < 0) {
> + kfree(cmod);
> + return err;
> + }
> +
> + return 0;
> +}
> +
> +struct codetag_type *
> +codetag_register_type(const struct codetag_type_desc *desc)
> +{
> + struct codetag_type *cttype;
> + int err;
> +
> + BUG_ON(desc->tag_size <= 0);
> +
> + cttype = kzalloc(sizeof(*cttype), GFP_KERNEL);
> + if (unlikely(!cttype))
> + return ERR_PTR(-ENOMEM);
> +
> + cttype->desc = *desc;
> + idr_init(&cttype->mod_idr);
> + init_rwsem(&cttype->mod_lock);
> +
> + err = codetag_module_init(cttype, NULL);
> + if (unlikely(err)) {
> + kfree(cttype);
> + return ERR_PTR(err);
> + }
> +
> + mutex_lock(&codetag_lock);
> + list_add_tail(&cttype->link, &codetag_types);
> + mutex_unlock(&codetag_lock);
> +
> + return cttype;
> +}
> --
> 2.43.0.687.g38aa6559b0-goog
>
--
Kees Cook
On Mon, Feb 12, 2024 at 01:38:59PM -0800, Suren Baghdasaryan wrote:
> Introduce CONFIG_MEM_ALLOC_PROFILING which provides definitions to easily
> instrument memory allocators. It registers an "alloc_tags" codetag type
> with /proc/allocinfo interface to output allocation tag information when
Please don't add anything new to the top-level /proc directory. This
should likely live in /sys.
> the feature is enabled.
> CONFIG_MEM_ALLOC_PROFILING_DEBUG is provided for debugging the memory
> allocation profiling instrumentation.
> Memory allocation profiling can be enabled or disabled at runtime using
> /proc/sys/vm/mem_profiling sysctl when CONFIG_MEM_ALLOC_PROFILING_DEBUG=n.
> CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT enables memory allocation
> profiling by default.
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> Co-developed-by: Kent Overstreet <[email protected]>
> Signed-off-by: Kent Overstreet <[email protected]>
> ---
> Documentation/admin-guide/sysctl/vm.rst | 16 +++
> Documentation/filesystems/proc.rst | 28 +++++
> include/asm-generic/codetag.lds.h | 14 +++
> include/asm-generic/vmlinux.lds.h | 3 +
> include/linux/alloc_tag.h | 133 ++++++++++++++++++++
> include/linux/sched.h | 24 ++++
> lib/Kconfig.debug | 25 ++++
> lib/Makefile | 2 +
> lib/alloc_tag.c | 158 ++++++++++++++++++++++++
> scripts/module.lds.S | 7 ++
> 10 files changed, 410 insertions(+)
> create mode 100644 include/asm-generic/codetag.lds.h
> create mode 100644 include/linux/alloc_tag.h
> create mode 100644 lib/alloc_tag.c
>
> diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
> index c59889de122b..a214719492ea 100644
> --- a/Documentation/admin-guide/sysctl/vm.rst
> +++ b/Documentation/admin-guide/sysctl/vm.rst
> @@ -43,6 +43,7 @@ Currently, these files are in /proc/sys/vm:
> - legacy_va_layout
> - lowmem_reserve_ratio
> - max_map_count
> +- mem_profiling (only if CONFIG_MEM_ALLOC_PROFILING=y)
> - memory_failure_early_kill
> - memory_failure_recovery
> - min_free_kbytes
> @@ -425,6 +426,21 @@ e.g., up to one or two maps per allocation.
> The default value is 65530.
>
>
> +mem_profiling
> +==============
> +
> +Enable memory profiling (when CONFIG_MEM_ALLOC_PROFILING=y)
> +
> +1: Enable memory profiling.
> +
> +0: Disabld memory profiling.
> +
> +Enabling memory profiling introduces a small performance overhead for all
> +memory allocations.
> +
> +The default value depends on CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT.
> +
> +
> memory_failure_early_kill:
> ==========================
>
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index 104c6d047d9b..40d6d18308e4 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -688,6 +688,7 @@ files are there, and which are missing.
> ============ ===============================================================
> File Content
> ============ ===============================================================
> + allocinfo Memory allocations profiling information
> apm Advanced power management info
> bootconfig Kernel command line obtained from boot config,
> and, if there were kernel parameters from the
> @@ -953,6 +954,33 @@ also be allocatable although a lot of filesystem metadata may have to be
> reclaimed to achieve this.
>
>
> +allocinfo
> +~~~~~~~
> +
> +Provides information about memory allocations at all locations in the code
> +base. Each allocation in the code is identified by its source file, line
> +number, module and the function calling the allocation. The number of bytes
> +allocated at each location is reported.
> +
> +Example output.
> +
> +::
> +
> + > cat /proc/allocinfo
> +
> + 153MiB mm/slub.c:1826 module:slub func:alloc_slab_page
> + 6.08MiB mm/slab_common.c:950 module:slab_common func:_kmalloc_order
> + 5.09MiB mm/memcontrol.c:2814 module:memcontrol func:alloc_slab_obj_exts
> + 4.54MiB mm/page_alloc.c:5777 module:page_alloc func:alloc_pages_exact
> + 1.32MiB include/asm-generic/pgalloc.h:63 module:pgtable func:__pte_alloc_one
> + 1.16MiB fs/xfs/xfs_log_priv.h:700 module:xfs func:xlog_kvmalloc
> + 1.00MiB mm/swap_cgroup.c:48 module:swap_cgroup func:swap_cgroup_prepare
> + 734KiB fs/xfs/kmem.c:20 module:xfs func:kmem_alloc
> + 640KiB kernel/rcu/tree.c:3184 module:tree func:fill_page_cache_func
> + 640KiB drivers/char/virtio_console.c:452 module:virtio_console func:alloc_buf
> + ...
> +
> +
> meminfo
> ~~~~~~~
>
> diff --git a/include/asm-generic/codetag.lds.h b/include/asm-generic/codetag.lds.h
> new file mode 100644
> index 000000000000..64f536b80380
> --- /dev/null
> +++ b/include/asm-generic/codetag.lds.h
> @@ -0,0 +1,14 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +#ifndef __ASM_GENERIC_CODETAG_LDS_H
> +#define __ASM_GENERIC_CODETAG_LDS_H
> +
> +#define SECTION_WITH_BOUNDARIES(_name) \
> + . = ALIGN(8); \
> + __start_##_name = .; \
> + KEEP(*(_name)) \
> + __stop_##_name = .;
> +
> +#define CODETAG_SECTIONS() \
> + SECTION_WITH_BOUNDARIES(alloc_tags)
> +
> +#endif /* __ASM_GENERIC_CODETAG_LDS_H */
> diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
> index 5dd3a61d673d..c9997dc50c50 100644
> --- a/include/asm-generic/vmlinux.lds.h
> +++ b/include/asm-generic/vmlinux.lds.h
> @@ -50,6 +50,8 @@
> * [__nosave_begin, __nosave_end] for the nosave data
> */
>
> +#include <asm-generic/codetag.lds.h>
> +
> #ifndef LOAD_OFFSET
> #define LOAD_OFFSET 0
> #endif
> @@ -366,6 +368,7 @@
> . = ALIGN(8); \
> BOUNDED_SECTION_BY(__dyndbg_classes, ___dyndbg_classes) \
> BOUNDED_SECTION_BY(__dyndbg, ___dyndbg) \
> + CODETAG_SECTIONS() \
> LIKELY_PROFILE() \
> BRANCH_PROFILE() \
> TRACE_PRINTKS() \
> diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
> new file mode 100644
> index 000000000000..cf55a149fa84
> --- /dev/null
> +++ b/include/linux/alloc_tag.h
> @@ -0,0 +1,133 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * allocation tagging
> + */
> +#ifndef _LINUX_ALLOC_TAG_H
> +#define _LINUX_ALLOC_TAG_H
> +
> +#include <linux/bug.h>
> +#include <linux/codetag.h>
> +#include <linux/container_of.h>
> +#include <linux/preempt.h>
> +#include <asm/percpu.h>
> +#include <linux/cpumask.h>
> +#include <linux/static_key.h>
> +
> +struct alloc_tag_counters {
> + u64 bytes;
> + u64 calls;
> +};
> +
> +/*
> + * An instance of this structure is created in a special ELF section at every
> + * allocation callsite. At runtime, the special section is treated as
> + * an array of these. Embedded codetag utilizes codetag framework.
> + */
> +struct alloc_tag {
> + struct codetag ct;
> + struct alloc_tag_counters __percpu *counters;
> +} __aligned(8);
> +
> +#ifdef CONFIG_MEM_ALLOC_PROFILING
> +
> +static inline struct alloc_tag *ct_to_alloc_tag(struct codetag *ct)
> +{
> + return container_of(ct, struct alloc_tag, ct);
> +}
> +
> +#ifdef ARCH_NEEDS_WEAK_PER_CPU
> +/*
> + * When percpu variables are required to be defined as weak, static percpu
> + * variables can't be used inside a function (see comments for DECLARE_PER_CPU_SECTION).
> + */
> +#error "Memory allocation profiling is incompatible with ARCH_NEEDS_WEAK_PER_CPU"
Is this enforced via Kconfig as well? (Looks like only alpha and s390?)
> +#endif
> +
> +#define DEFINE_ALLOC_TAG(_alloc_tag, _old) \
> + static DEFINE_PER_CPU(struct alloc_tag_counters, _alloc_tag_cntr); \
> + static struct alloc_tag _alloc_tag __used __aligned(8) \
> + __section("alloc_tags") = { \
> + .ct = CODE_TAG_INIT, \
> + .counters = &_alloc_tag_cntr }; \
> + struct alloc_tag * __maybe_unused _old = alloc_tag_save(&_alloc_tag)
> +
> +DECLARE_STATIC_KEY_MAYBE(CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT,
> + mem_alloc_profiling_key);
> +
> +static inline bool mem_alloc_profiling_enabled(void)
> +{
> + return static_branch_maybe(CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT,
> + &mem_alloc_profiling_key);
> +}
> +
> +static inline struct alloc_tag_counters alloc_tag_read(struct alloc_tag *tag)
> +{
> + struct alloc_tag_counters v = { 0, 0 };
> + struct alloc_tag_counters *counter;
> + int cpu;
> +
> + for_each_possible_cpu(cpu) {
> + counter = per_cpu_ptr(tag->counters, cpu);
> + v.bytes += counter->bytes;
> + v.calls += counter->calls;
> + }
> +
> + return v;
> +}
> +
> +static inline void __alloc_tag_sub(union codetag_ref *ref, size_t bytes)
> +{
> + struct alloc_tag *tag;
> +
> +#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
> + WARN_ONCE(ref && !ref->ct, "alloc_tag was not set\n");
> +#endif
> + if (!ref || !ref->ct)
> + return;
> +
> + tag = ct_to_alloc_tag(ref->ct);
> +
> + this_cpu_sub(tag->counters->bytes, bytes);
> + this_cpu_dec(tag->counters->calls);
> +
> + ref->ct = NULL;
> +}
> +
> +static inline void alloc_tag_sub(union codetag_ref *ref, size_t bytes)
> +{
> + __alloc_tag_sub(ref, bytes);
> +}
> +
> +static inline void alloc_tag_sub_noalloc(union codetag_ref *ref, size_t bytes)
> +{
> + __alloc_tag_sub(ref, bytes);
> +}
> +
> +static inline void alloc_tag_add(union codetag_ref *ref, struct alloc_tag *tag, size_t bytes)
> +{
> +#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
> + WARN_ONCE(ref && ref->ct,
> + "alloc_tag was not cleared (got tag for %s:%u)\n",\
> + ref->ct->filename, ref->ct->lineno);
> +
> + WARN_ONCE(!tag, "current->alloc_tag not set");
> +#endif
> + if (!ref || !tag)
> + return;
> +
> + ref->ct = &tag->ct;
> + this_cpu_add(tag->counters->bytes, bytes);
> + this_cpu_inc(tag->counters->calls);
> +}
> +
> +#else
> +
> +#define DEFINE_ALLOC_TAG(_alloc_tag, _old)
> +static inline void alloc_tag_sub(union codetag_ref *ref, size_t bytes) {}
> +static inline void alloc_tag_sub_noalloc(union codetag_ref *ref, size_t bytes) {}
> +static inline void alloc_tag_add(union codetag_ref *ref, struct alloc_tag *tag,
> + size_t bytes) {}
> +
> +#endif
> +
> +#endif /* _LINUX_ALLOC_TAG_H */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index ffe8f618ab86..da68a10517c8 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -770,6 +770,10 @@ struct task_struct {
> unsigned int flags;
> unsigned int ptrace;
>
> +#ifdef CONFIG_MEM_ALLOC_PROFILING
> + struct alloc_tag *alloc_tag;
> +#endif
Normally scheduling is very sensitive to having anything early in
task_struct. I would suggest moving this the CONFIG_SCHED_CORE ifdef
area.
> +
> #ifdef CONFIG_SMP
> int on_cpu;
> struct __call_single_node wake_entry;
> @@ -810,6 +814,7 @@ struct task_struct {
> struct task_group *sched_task_group;
> #endif
>
> +
> #ifdef CONFIG_UCLAMP_TASK
> /*
> * Clamp values requested for a scheduling entity.
> @@ -2183,4 +2188,23 @@ static inline int sched_core_idle_cpu(int cpu) { return idle_cpu(cpu); }
>
> extern void sched_set_stop_task(int cpu, struct task_struct *stop);
>
> +#ifdef CONFIG_MEM_ALLOC_PROFILING
> +static inline struct alloc_tag *alloc_tag_save(struct alloc_tag *tag)
> +{
> + swap(current->alloc_tag, tag);
> + return tag;
> +}
> +
> +static inline void alloc_tag_restore(struct alloc_tag *tag, struct alloc_tag *old)
> +{
> +#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
> + WARN(current->alloc_tag != tag, "current->alloc_tag was changed:\n");
> +#endif
> + current->alloc_tag = old;
> +}
> +#else
> +static inline struct alloc_tag *alloc_tag_save(struct alloc_tag *tag) { return NULL; }
> +#define alloc_tag_restore(_tag, _old)
> +#endif
> +
> #endif
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 0be2d00c3696..78d258ca508f 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -972,6 +972,31 @@ config CODE_TAGGING
> bool
> select KALLSYMS
>
> +config MEM_ALLOC_PROFILING
> + bool "Enable memory allocation profiling"
> + default n
> + depends on PROC_FS
> + depends on !DEBUG_FORCE_WEAK_PER_CPU
> + select CODE_TAGGING
> + help
> + Track allocation source code and record total allocation size
> + initiated at that code location. The mechanism can be used to track
> + memory leaks with a low performance and memory impact.
> +
> +config MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT
> + bool "Enable memory allocation profiling by default"
> + default y
> + depends on MEM_ALLOC_PROFILING
> +
> +config MEM_ALLOC_PROFILING_DEBUG
> + bool "Memory allocation profiler debugging"
> + default n
> + depends on MEM_ALLOC_PROFILING
> + select MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT
> + help
> + Adds warnings with helpful error messages for memory allocation
> + profiling.
> +
> source "lib/Kconfig.kasan"
> source "lib/Kconfig.kfence"
> source "lib/Kconfig.kmsan"
> diff --git a/lib/Makefile b/lib/Makefile
> index 6b48b22fdfac..859112f09bf5 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -236,6 +236,8 @@ obj-$(CONFIG_OF_RECONFIG_NOTIFIER_ERROR_INJECT) += \
> obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
>
> obj-$(CONFIG_CODE_TAGGING) += codetag.o
> +obj-$(CONFIG_MEM_ALLOC_PROFILING) += alloc_tag.o
> +
> lib-$(CONFIG_GENERIC_BUG) += bug.o
>
> obj-$(CONFIG_HAVE_ARCH_TRACEHOOK) += syscall.o
> diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
> new file mode 100644
> index 000000000000..4fc031f9cefd
> --- /dev/null
> +++ b/lib/alloc_tag.c
> @@ -0,0 +1,158 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +#include <linux/alloc_tag.h>
> +#include <linux/fs.h>
> +#include <linux/gfp.h>
> +#include <linux/module.h>
> +#include <linux/proc_fs.h>
> +#include <linux/seq_buf.h>
> +#include <linux/seq_file.h>
> +
> +static struct codetag_type *alloc_tag_cttype;
> +
> +DEFINE_STATIC_KEY_MAYBE(CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT,
> + mem_alloc_profiling_key);
> +
> +static void *allocinfo_start(struct seq_file *m, loff_t *pos)
> +{
> + struct codetag_iterator *iter;
> + struct codetag *ct;
> + loff_t node = *pos;
> +
> + iter = kzalloc(sizeof(*iter), GFP_KERNEL);
> + m->private = iter;
> + if (!iter)
> + return NULL;
> +
> + codetag_lock_module_list(alloc_tag_cttype, true);
> + *iter = codetag_get_ct_iter(alloc_tag_cttype);
> + while ((ct = codetag_next_ct(iter)) != NULL && node)
> + node--;
> +
> + return ct ? iter : NULL;
> +}
> +
> +static void *allocinfo_next(struct seq_file *m, void *arg, loff_t *pos)
> +{
> + struct codetag_iterator *iter = (struct codetag_iterator *)arg;
> + struct codetag *ct = codetag_next_ct(iter);
> +
> + (*pos)++;
> + if (!ct)
> + return NULL;
> +
> + return iter;
> +}
> +
> +static void allocinfo_stop(struct seq_file *m, void *arg)
> +{
> + struct codetag_iterator *iter = (struct codetag_iterator *)m->private;
> +
> + if (iter) {
> + codetag_lock_module_list(alloc_tag_cttype, false);
> + kfree(iter);
> + }
> +}
> +
> +static void alloc_tag_to_text(struct seq_buf *out, struct codetag *ct)
> +{
> + struct alloc_tag *tag = ct_to_alloc_tag(ct);
> + struct alloc_tag_counters counter = alloc_tag_read(tag);
> + s64 bytes = counter.bytes;
> + char val[10], *p = val;
> +
> + if (bytes < 0) {
> + *p++ = '-';
> + bytes = -bytes;
> + }
> +
> + string_get_size(bytes, 1,
> + STRING_SIZE_BASE2|STRING_SIZE_NOSPACE,
> + p, val + ARRAY_SIZE(val) - p);
> +
> + seq_buf_printf(out, "%8s %8llu ", val, counter.calls);
> + codetag_to_text(out, ct);
> + seq_buf_putc(out, ' ');
> + seq_buf_putc(out, '\n');
> +}
/me does happy seq_buf dance!
> +
> +static int allocinfo_show(struct seq_file *m, void *arg)
> +{
> + struct codetag_iterator *iter = (struct codetag_iterator *)arg;
> + char *bufp;
> + size_t n = seq_get_buf(m, &bufp);
> + struct seq_buf buf;
> +
> + seq_buf_init(&buf, bufp, n);
> + alloc_tag_to_text(&buf, iter->ct);
> + seq_commit(m, seq_buf_used(&buf));
> + return 0;
> +}
> +
> +static const struct seq_operations allocinfo_seq_op = {
> + .start = allocinfo_start,
> + .next = allocinfo_next,
> + .stop = allocinfo_stop,
> + .show = allocinfo_show,
> +};
> +
> +static void __init procfs_init(void)
> +{
> + proc_create_seq("allocinfo", 0444, NULL, &allocinfo_seq_op);
> +}
As mentioned, this really should be in /sys somewhere.
> +
> +static bool alloc_tag_module_unload(struct codetag_type *cttype,
> + struct codetag_module *cmod)
> +{
> + struct codetag_iterator iter = codetag_get_ct_iter(cttype);
> + struct alloc_tag_counters counter;
> + bool module_unused = true;
> + struct alloc_tag *tag;
> + struct codetag *ct;
> +
> + for (ct = codetag_next_ct(&iter); ct; ct = codetag_next_ct(&iter)) {
> + if (iter.cmod != cmod)
> + continue;
> +
> + tag = ct_to_alloc_tag(ct);
> + counter = alloc_tag_read(tag);
> +
> + if (WARN(counter.bytes, "%s:%u module %s func:%s has %llu allocated at module unload",
> + ct->filename, ct->lineno, ct->modname, ct->function, counter.bytes))
> + module_unused = false;
> + }
> +
> + return module_unused;
> +}
> +
> +static struct ctl_table memory_allocation_profiling_sysctls[] = {
> + {
> + .procname = "mem_profiling",
> + .data = &mem_alloc_profiling_key,
> +#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
> + .mode = 0444,
> +#else
> + .mode = 0644,
> +#endif
> + .proc_handler = proc_do_static_key,
> + },
> + { }
> +};
> +
> +static int __init alloc_tag_init(void)
> +{
> + const struct codetag_type_desc desc = {
> + .section = "alloc_tags",
> + .tag_size = sizeof(struct alloc_tag),
> + .module_unload = alloc_tag_module_unload,
> + };
> +
> + alloc_tag_cttype = codetag_register_type(&desc);
> + if (IS_ERR_OR_NULL(alloc_tag_cttype))
> + return PTR_ERR(alloc_tag_cttype);
> +
> + register_sysctl_init("vm", memory_allocation_profiling_sysctls);
> + procfs_init();
> +
> + return 0;
> +}
> +module_init(alloc_tag_init);
> diff --git a/scripts/module.lds.S b/scripts/module.lds.S
> index bf5bcf2836d8..45c67a0994f3 100644
> --- a/scripts/module.lds.S
> +++ b/scripts/module.lds.S
> @@ -9,6 +9,8 @@
> #define DISCARD_EH_FRAME *(.eh_frame)
> #endif
>
> +#include <asm-generic/codetag.lds.h>
> +
> SECTIONS {
> /DISCARD/ : {
> *(.discard)
> @@ -47,12 +49,17 @@ SECTIONS {
> .data : {
> *(.data .data.[0-9a-zA-Z_]*)
> *(.data..L*)
> + CODETAG_SECTIONS()
> }
>
> .rodata : {
> *(.rodata .rodata.[0-9a-zA-Z_]*)
> *(.rodata..L*)
> }
> +#else
> + .data : {
> + CODETAG_SECTIONS()
> + }
> #endif
> }
Otherwise, looks good.
--
Kees Cook
On Mon, Feb 12, 2024 at 01:39:21PM -0800, Suren Baghdasaryan wrote:
> From: Kent Overstreet <[email protected]>
>
> The new code & libraries added are being maintained - mark them as such.
>
> Signed-off-by: Kent Overstreet <[email protected]>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> ---
> MAINTAINERS | 16 ++++++++++++++++
> 1 file changed, 16 insertions(+)
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 73d898383e51..6da139418775 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -5210,6 +5210,13 @@ S: Supported
> F: Documentation/process/code-of-conduct-interpretation.rst
> F: Documentation/process/code-of-conduct.rst
>
> +CODE TAGGING
> +M: Suren Baghdasaryan <[email protected]>
> +M: Kent Overstreet <[email protected]>
> +S: Maintained
> +F: include/linux/codetag.h
> +F: lib/codetag.c
> +
> COMEDI DRIVERS
> M: Ian Abbott <[email protected]>
> M: H Hartley Sweeten <[email protected]>
> @@ -14056,6 +14063,15 @@ F: mm/memblock.c
> F: mm/mm_init.c
> F: tools/testing/memblock/
>
> +MEMORY ALLOCATION PROFILING
> +M: Suren Baghdasaryan <[email protected]>
> +M: Kent Overstreet <[email protected]>
> +S: Maintained
> +F: include/linux/alloc_tag.h
> +F: include/linux/codetag_ctx.h
> +F: lib/alloc_tag.c
> +F: lib/pgalloc_tag.c
Any mailing list to aim at? linux-mm maybe?
Regardless:
Reviewed-by: Kees Cook <[email protected]>
> +
> MEMORY CONTROLLER DRIVERS
> M: Krzysztof Kozlowski <[email protected]>
> L: [email protected]
> --
> 2.43.0.687.g38aa6559b0-goog
>
--
Kees Cook
On Mon, Feb 12, 2024 at 01:39:19PM -0800, Suren Baghdasaryan wrote:
> To avoid debug warnings while freeing reserved pages which were not
> allocated with usual allocators, mark their codetags as empty before
> freeing.
How do these get their codetags to begin with? Regardless:
Reviewed-by: Kees Cook <[email protected]>
--
Kees Cook
On Mon, Feb 12, 2024 at 01:39:20PM -0800, Suren Baghdasaryan wrote:
> If slabobj_ext vector allocation for a slab object fails and later on it
> succeeds for another object in the same slab, the slabobj_ext for the
> original object will be NULL and will be flagged in case when
> CONFIG_MEM_ALLOC_PROFILING_DEBUG is enabled.
> Mark failed slabobj_ext vector allocations using a new objext_flags flag
> stored in the lower bits of slab->obj_exts. When new allocation succeeds
> it marks all tag references in the same slabobj_ext vector as empty to
> avoid warnings implemented by CONFIG_MEM_ALLOC_PROFILING_DEBUG checks.
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> ---
> include/linux/memcontrol.h | 4 +++-
> mm/slab.h | 25 +++++++++++++++++++++++++
> mm/slab_common.c | 22 +++++++++++++++-------
> 3 files changed, 43 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 2b010316016c..f95241ca9052 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -365,8 +365,10 @@ enum page_memcg_data_flags {
> #endif /* CONFIG_MEMCG */
>
> enum objext_flags {
> + /* slabobj_ext vector failed to allocate */
> + OBJEXTS_ALLOC_FAIL = __FIRST_OBJEXT_FLAG,
> /* the next bit after the last actual flag */
> - __NR_OBJEXTS_FLAGS = __FIRST_OBJEXT_FLAG,
> + __NR_OBJEXTS_FLAGS = (__FIRST_OBJEXT_FLAG << 1),
> };
>
> #define OBJEXTS_FLAGS_MASK (__NR_OBJEXTS_FLAGS - 1)
> diff --git a/mm/slab.h b/mm/slab.h
> index cf332a839bf4..7bb3900f83ef 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -586,9 +586,34 @@ static inline void mark_objexts_empty(struct slabobj_ext *obj_exts)
> }
> }
>
> +static inline void mark_failed_objexts_alloc(struct slab *slab)
> +{
> + slab->obj_exts = OBJEXTS_ALLOC_FAIL;
Uh, does this mean slab->obj_exts is suddenly non-NULL? Is everything
that accesses obj_exts expecting this?
-Kees
> +}
> +
> +static inline void handle_failed_objexts_alloc(unsigned long obj_exts,
> + struct slabobj_ext *vec, unsigned int objects)
> +{
> + /*
> + * If vector previously failed to allocate then we have live
> + * objects with no tag reference. Mark all references in this
> + * vector as empty to avoid warnings later on.
> + */
> + if (obj_exts & OBJEXTS_ALLOC_FAIL) {
> + unsigned int i;
> +
> + for (i = 0; i < objects; i++)
> + set_codetag_empty(&vec[i].ref);
> + }
> +}
> +
> +
> #else /* CONFIG_MEM_ALLOC_PROFILING_DEBUG */
>
> static inline void mark_objexts_empty(struct slabobj_ext *obj_exts) {}
> +static inline void mark_failed_objexts_alloc(struct slab *slab) {}
> +static inline void handle_failed_objexts_alloc(unsigned long obj_exts,
> + struct slabobj_ext *vec, unsigned int objects) {}
>
> #endif /* CONFIG_MEM_ALLOC_PROFILING_DEBUG */
>
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index d5f75d04ced2..489c7a8ba8f1 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -214,29 +214,37 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
> gfp_t gfp, bool new_slab)
> {
> unsigned int objects = objs_per_slab(s, slab);
> - unsigned long obj_exts;
> - void *vec;
> + unsigned long new_exts;
> + unsigned long old_exts;
> + struct slabobj_ext *vec;
>
> gfp &= ~OBJCGS_CLEAR_MASK;
> /* Prevent recursive extension vector allocation */
> gfp |= __GFP_NO_OBJ_EXT;
> vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
> slab_nid(slab));
> - if (!vec)
> + if (!vec) {
> + /* Mark vectors which failed to allocate */
> + if (new_slab)
> + mark_failed_objexts_alloc(slab);
> +
> return -ENOMEM;
> + }
>
> - obj_exts = (unsigned long)vec;
> + new_exts = (unsigned long)vec;
> #ifdef CONFIG_MEMCG
> - obj_exts |= MEMCG_DATA_OBJEXTS;
> + new_exts |= MEMCG_DATA_OBJEXTS;
> #endif
> + old_exts = slab->obj_exts;
> + handle_failed_objexts_alloc(old_exts, vec, objects);
> if (new_slab) {
> /*
> * If the slab is brand new and nobody can yet access its
> * obj_exts, no synchronization is required and obj_exts can
> * be simply assigned.
> */
> - slab->obj_exts = obj_exts;
> - } else if (cmpxchg(&slab->obj_exts, 0, obj_exts)) {
> + slab->obj_exts = new_exts;
> + } else if (cmpxchg(&slab->obj_exts, old_exts, new_exts) != old_exts) {
> /*
> * If the slab is already in use, somebody can allocate and
> * assign slabobj_exts in parallel. In this case the existing
> --
> 2.43.0.687.g38aa6559b0-goog
>
--
Kees Cook
On Mon, Feb 12, 2024 at 01:39:03PM -0800, Suren Baghdasaryan wrote:
> Redefine page allocators to record allocation tags upon their invocation.
> Instrument post_alloc_hook and free_pages_prepare to modify current
> allocation tag.
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> Co-developed-by: Kent Overstreet <[email protected]>
> Signed-off-by: Kent Overstreet <[email protected]>
> ---
> include/linux/alloc_tag.h | 10 +++
> include/linux/gfp.h | 126 ++++++++++++++++++++++++--------------
> include/linux/pagemap.h | 9 ++-
> mm/compaction.c | 7 ++-
> mm/filemap.c | 6 +-
> mm/mempolicy.c | 52 ++++++++--------
> mm/page_alloc.c | 60 +++++++++---------
> 7 files changed, 160 insertions(+), 110 deletions(-)
>
> diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
> index cf55a149fa84..6fa8a94d8bc1 100644
> --- a/include/linux/alloc_tag.h
> +++ b/include/linux/alloc_tag.h
> @@ -130,4 +130,14 @@ static inline void alloc_tag_add(union codetag_ref *ref, struct alloc_tag *tag,
>
> #endif
>
> +#define alloc_hooks(_do_alloc) \
> +({ \
> + typeof(_do_alloc) _res; \
> + DEFINE_ALLOC_TAG(_alloc_tag, _old); \
> + \
> + _res = _do_alloc; \
> + alloc_tag_restore(&_alloc_tag, _old); \
> + _res; \
> +})
I am delighted to see that __alloc_size survives this indirection.
AFAICT, all the fortify goo continues to work with this in use.
Reviewed-by: Kees Cook <[email protected]>
-Kees
> +
> #endif /* _LINUX_ALLOC_TAG_H */
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index de292a007138..bc0fd5259b0b 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -6,6 +6,8 @@
>
> #include <linux/mmzone.h>
> #include <linux/topology.h>
> +#include <linux/alloc_tag.h>
> +#include <linux/sched.h>
>
> struct vm_area_struct;
> struct mempolicy;
> @@ -175,42 +177,46 @@ static inline void arch_free_page(struct page *page, int order) { }
> static inline void arch_alloc_page(struct page *page, int order) { }
> #endif
>
> -struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
> +struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order, int preferred_nid,
> nodemask_t *nodemask);
> -struct folio *__folio_alloc(gfp_t gfp, unsigned int order, int preferred_nid,
> +#define __alloc_pages(...) alloc_hooks(__alloc_pages_noprof(__VA_ARGS__))
> +
> +struct folio *__folio_alloc_noprof(gfp_t gfp, unsigned int order, int preferred_nid,
> nodemask_t *nodemask);
> +#define __folio_alloc(...) alloc_hooks(__folio_alloc_noprof(__VA_ARGS__))
>
> -unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> +unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
> nodemask_t *nodemask, int nr_pages,
> struct list_head *page_list,
> struct page **page_array);
> +#define __alloc_pages_bulk(...) alloc_hooks(alloc_pages_bulk_noprof(__VA_ARGS__))
>
> -unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
> +unsigned long alloc_pages_bulk_array_mempolicy_noprof(gfp_t gfp,
> unsigned long nr_pages,
> struct page **page_array);
> +#define alloc_pages_bulk_array_mempolicy(...) \
> + alloc_hooks(alloc_pages_bulk_array_mempolicy_noprof(__VA_ARGS__))
>
> /* Bulk allocate order-0 pages */
> -static inline unsigned long
> -alloc_pages_bulk_list(gfp_t gfp, unsigned long nr_pages, struct list_head *list)
> -{
> - return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, list, NULL);
> -}
> +#define alloc_pages_bulk_list(_gfp, _nr_pages, _list) \
> + __alloc_pages_bulk(_gfp, numa_mem_id(), NULL, _nr_pages, _list, NULL)
>
> -static inline unsigned long
> -alloc_pages_bulk_array(gfp_t gfp, unsigned long nr_pages, struct page **page_array)
> -{
> - return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, NULL, page_array);
> -}
> +#define alloc_pages_bulk_array(_gfp, _nr_pages, _page_array) \
> + __alloc_pages_bulk(_gfp, numa_mem_id(), NULL, _nr_pages, NULL, _page_array)
>
> static inline unsigned long
> -alloc_pages_bulk_array_node(gfp_t gfp, int nid, unsigned long nr_pages, struct page **page_array)
> +alloc_pages_bulk_array_node_noprof(gfp_t gfp, int nid, unsigned long nr_pages,
> + struct page **page_array)
> {
> if (nid == NUMA_NO_NODE)
> nid = numa_mem_id();
>
> - return __alloc_pages_bulk(gfp, nid, NULL, nr_pages, NULL, page_array);
> + return alloc_pages_bulk_noprof(gfp, nid, NULL, nr_pages, NULL, page_array);
> }
>
> +#define alloc_pages_bulk_array_node(...) \
> + alloc_hooks(alloc_pages_bulk_array_node_noprof(__VA_ARGS__))
> +
> static inline void warn_if_node_offline(int this_node, gfp_t gfp_mask)
> {
> gfp_t warn_gfp = gfp_mask & (__GFP_THISNODE|__GFP_NOWARN);
> @@ -230,82 +236,104 @@ static inline void warn_if_node_offline(int this_node, gfp_t gfp_mask)
> * online. For more general interface, see alloc_pages_node().
> */
> static inline struct page *
> -__alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
> +__alloc_pages_node_noprof(int nid, gfp_t gfp_mask, unsigned int order)
> {
> VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
> warn_if_node_offline(nid, gfp_mask);
>
> - return __alloc_pages(gfp_mask, order, nid, NULL);
> + return __alloc_pages_noprof(gfp_mask, order, nid, NULL);
> }
>
> +#define __alloc_pages_node(...) alloc_hooks(__alloc_pages_node_noprof(__VA_ARGS__))
> +
> static inline
> -struct folio *__folio_alloc_node(gfp_t gfp, unsigned int order, int nid)
> +struct folio *__folio_alloc_node_noprof(gfp_t gfp, unsigned int order, int nid)
> {
> VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
> warn_if_node_offline(nid, gfp);
>
> - return __folio_alloc(gfp, order, nid, NULL);
> + return __folio_alloc_noprof(gfp, order, nid, NULL);
> }
>
> +#define __folio_alloc_node(...) alloc_hooks(__folio_alloc_node_noprof(__VA_ARGS__))
> +
> /*
> * Allocate pages, preferring the node given as nid. When nid == NUMA_NO_NODE,
> * prefer the current CPU's closest node. Otherwise node must be valid and
> * online.
> */
> -static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
> - unsigned int order)
> +static inline struct page *alloc_pages_node_noprof(int nid, gfp_t gfp_mask,
> + unsigned int order)
> {
> if (nid == NUMA_NO_NODE)
> nid = numa_mem_id();
>
> - return __alloc_pages_node(nid, gfp_mask, order);
> + return __alloc_pages_node_noprof(nid, gfp_mask, order);
> }
>
> +#define alloc_pages_node(...) alloc_hooks(alloc_pages_node_noprof(__VA_ARGS__))
> +
> #ifdef CONFIG_NUMA
> -struct page *alloc_pages(gfp_t gfp, unsigned int order);
> -struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
> +struct page *alloc_pages_noprof(gfp_t gfp, unsigned int order);
> +struct page *alloc_pages_mpol_noprof(gfp_t gfp, unsigned int order,
> struct mempolicy *mpol, pgoff_t ilx, int nid);
> -struct folio *folio_alloc(gfp_t gfp, unsigned int order);
> -struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
> +struct folio *folio_alloc_noprof(gfp_t gfp, unsigned int order);
> +struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order, struct vm_area_struct *vma,
> unsigned long addr, bool hugepage);
> #else
> -static inline struct page *alloc_pages(gfp_t gfp_mask, unsigned int order)
> +static inline struct page *alloc_pages_noprof(gfp_t gfp_mask, unsigned int order)
> {
> - return alloc_pages_node(numa_node_id(), gfp_mask, order);
> + return alloc_pages_node_noprof(numa_node_id(), gfp_mask, order);
> }
> -static inline struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
> +static inline struct page *alloc_pages_mpol_noprof(gfp_t gfp, unsigned int order,
> struct mempolicy *mpol, pgoff_t ilx, int nid)
> {
> - return alloc_pages(gfp, order);
> + return alloc_pages_noprof(gfp, order);
> }
> -static inline struct folio *folio_alloc(gfp_t gfp, unsigned int order)
> +static inline struct folio *folio_alloc_noprof(gfp_t gfp, unsigned int order)
> {
> return __folio_alloc_node(gfp, order, numa_node_id());
> }
> -#define vma_alloc_folio(gfp, order, vma, addr, hugepage) \
> - folio_alloc(gfp, order)
> +#define vma_alloc_folio_noprof(gfp, order, vma, addr, hugepage) \
> + folio_alloc_noprof(gfp, order)
> #endif
> +
> +#define alloc_pages(...) alloc_hooks(alloc_pages_noprof(__VA_ARGS__))
> +#define alloc_pages_mpol(...) alloc_hooks(alloc_pages_mpol_noprof(__VA_ARGS__))
> +#define folio_alloc(...) alloc_hooks(folio_alloc_noprof(__VA_ARGS__))
> +#define vma_alloc_folio(...) alloc_hooks(vma_alloc_folio_noprof(__VA_ARGS__))
> +
> #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
> -static inline struct page *alloc_page_vma(gfp_t gfp,
> +
> +static inline struct page *alloc_page_vma_noprof(gfp_t gfp,
> struct vm_area_struct *vma, unsigned long addr)
> {
> - struct folio *folio = vma_alloc_folio(gfp, 0, vma, addr, false);
> + struct folio *folio = vma_alloc_folio_noprof(gfp, 0, vma, addr, false);
>
> return &folio->page;
> }
> +#define alloc_page_vma(...) alloc_hooks(alloc_page_vma_noprof(__VA_ARGS__))
> +
> +extern unsigned long get_free_pages_noprof(gfp_t gfp_mask, unsigned int order);
> +#define __get_free_pages(...) alloc_hooks(get_free_pages_noprof(__VA_ARGS__))
>
> -extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
> -extern unsigned long get_zeroed_page(gfp_t gfp_mask);
> +extern unsigned long get_zeroed_page_noprof(gfp_t gfp_mask);
> +#define get_zeroed_page(...) alloc_hooks(get_zeroed_page_noprof(__VA_ARGS__))
> +
> +void *alloc_pages_exact_noprof(size_t size, gfp_t gfp_mask) __alloc_size(1);
> +#define alloc_pages_exact(...) alloc_hooks(alloc_pages_exact_noprof(__VA_ARGS__))
>
> -void *alloc_pages_exact(size_t size, gfp_t gfp_mask) __alloc_size(1);
> void free_pages_exact(void *virt, size_t size);
> -__meminit void *alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask) __alloc_size(2);
>
> -#define __get_free_page(gfp_mask) \
> - __get_free_pages((gfp_mask), 0)
> +__meminit void *alloc_pages_exact_nid_noprof(int nid, size_t size, gfp_t gfp_mask) __alloc_size(2);
> +#define alloc_pages_exact_nid(...) \
> + alloc_hooks(alloc_pages_exact_nid_noprof(__VA_ARGS__))
> +
> +#define __get_free_page(gfp_mask) \
> + __get_free_pages((gfp_mask), 0)
>
> -#define __get_dma_pages(gfp_mask, order) \
> - __get_free_pages((gfp_mask) | GFP_DMA, (order))
> +#define __get_dma_pages(gfp_mask, order) \
> + __get_free_pages((gfp_mask) | GFP_DMA, (order))
>
> extern void __free_pages(struct page *page, unsigned int order);
> extern void free_pages(unsigned long addr, unsigned int order);
> @@ -357,10 +385,14 @@ extern gfp_t vma_thp_gfp_mask(struct vm_area_struct *vma);
>
> #ifdef CONFIG_CONTIG_ALLOC
> /* The below functions must be run on a range from a single zone. */
> -extern int alloc_contig_range(unsigned long start, unsigned long end,
> +extern int alloc_contig_range_noprof(unsigned long start, unsigned long end,
> unsigned migratetype, gfp_t gfp_mask);
> -extern struct page *alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask,
> - int nid, nodemask_t *nodemask);
> +#define alloc_contig_range(...) alloc_hooks(alloc_contig_range_noprof(__VA_ARGS__))
> +
> +extern struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_mask,
> + int nid, nodemask_t *nodemask);
> +#define alloc_contig_pages(...) alloc_hooks(alloc_contig_pages_noprof(__VA_ARGS__))
> +
> #endif
> void free_contig_range(unsigned long pfn, unsigned long nr_pages);
>
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 2df35e65557d..35636e67e2e1 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -542,14 +542,17 @@ static inline void *detach_page_private(struct page *page)
> #endif
>
> #ifdef CONFIG_NUMA
> -struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order);
> +struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order);
> #else
> -static inline struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order)
> +static inline struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order)
> {
> - return folio_alloc(gfp, order);
> + return folio_alloc_noprof(gfp, order);
> }
> #endif
>
> +#define filemap_alloc_folio(...) \
> + alloc_hooks(filemap_alloc_folio_noprof(__VA_ARGS__))
> +
> static inline struct page *__page_cache_alloc(gfp_t gfp)
> {
> return &filemap_alloc_folio(gfp, 0)->page;
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 4add68d40e8d..f4c0e682c979 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1781,7 +1781,7 @@ static void isolate_freepages(struct compact_control *cc)
> * This is a migrate-callback that "allocates" freepages by taking pages
> * from the isolated freelists in the block we are migrating to.
> */
> -static struct folio *compaction_alloc(struct folio *src, unsigned long data)
> +static struct folio *compaction_alloc_noprof(struct folio *src, unsigned long data)
> {
> struct compact_control *cc = (struct compact_control *)data;
> struct folio *dst;
> @@ -1800,6 +1800,11 @@ static struct folio *compaction_alloc(struct folio *src, unsigned long data)
> return dst;
> }
>
> +static struct folio *compaction_alloc(struct folio *src, unsigned long data)
> +{
> + return alloc_hooks(compaction_alloc_noprof(src, data));
> +}
> +
> /*
> * This is a migrate-callback that "frees" freepages back to the isolated
> * freelist. All pages on the freelist are from the same zone, so there is no
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 750e779c23db..e51e474545ad 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -957,7 +957,7 @@ int filemap_add_folio(struct address_space *mapping, struct folio *folio,
> EXPORT_SYMBOL_GPL(filemap_add_folio);
>
> #ifdef CONFIG_NUMA
> -struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order)
> +struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order)
> {
> int n;
> struct folio *folio;
> @@ -972,9 +972,9 @@ struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order)
>
> return folio;
> }
> - return folio_alloc(gfp, order);
> + return folio_alloc_noprof(gfp, order);
> }
> -EXPORT_SYMBOL(filemap_alloc_folio);
> +EXPORT_SYMBOL(filemap_alloc_folio_noprof);
> #endif
>
> /*
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 10a590ee1c89..c329d00b975f 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2070,15 +2070,15 @@ static struct page *alloc_pages_preferred_many(gfp_t gfp, unsigned int order,
> */
> preferred_gfp = gfp | __GFP_NOWARN;
> preferred_gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL);
> - page = __alloc_pages(preferred_gfp, order, nid, nodemask);
> + page = __alloc_pages_noprof(preferred_gfp, order, nid, nodemask);
> if (!page)
> - page = __alloc_pages(gfp, order, nid, NULL);
> + page = __alloc_pages_noprof(gfp, order, nid, NULL);
>
> return page;
> }
>
> /**
> - * alloc_pages_mpol - Allocate pages according to NUMA mempolicy.
> + * alloc_pages_mpol_noprof - Allocate pages according to NUMA mempolicy.
> * @gfp: GFP flags.
> * @order: Order of the page allocation.
> * @pol: Pointer to the NUMA mempolicy.
> @@ -2087,7 +2087,7 @@ static struct page *alloc_pages_preferred_many(gfp_t gfp, unsigned int order,
> *
> * Return: The page on success or NULL if allocation fails.
> */
> -struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
> +struct page *alloc_pages_mpol_noprof(gfp_t gfp, unsigned int order,
> struct mempolicy *pol, pgoff_t ilx, int nid)
> {
> nodemask_t *nodemask;
> @@ -2117,7 +2117,7 @@ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
> * First, try to allocate THP only on local node, but
> * don't reclaim unnecessarily, just compact.
> */
> - page = __alloc_pages_node(nid,
> + page = __alloc_pages_node_noprof(nid,
> gfp | __GFP_THISNODE | __GFP_NORETRY, order);
> if (page || !(gfp & __GFP_DIRECT_RECLAIM))
> return page;
> @@ -2130,7 +2130,7 @@ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
> }
> }
>
> - page = __alloc_pages(gfp, order, nid, nodemask);
> + page = __alloc_pages_noprof(gfp, order, nid, nodemask);
>
> if (unlikely(pol->mode == MPOL_INTERLEAVE) && page) {
> /* skip NUMA_INTERLEAVE_HIT update if numa stats is disabled */
> @@ -2146,7 +2146,7 @@ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
> }
>
> /**
> - * vma_alloc_folio - Allocate a folio for a VMA.
> + * vma_alloc_folio_noprof - Allocate a folio for a VMA.
> * @gfp: GFP flags.
> * @order: Order of the folio.
> * @vma: Pointer to VMA.
> @@ -2161,7 +2161,7 @@ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
> *
> * Return: The folio on success or NULL if allocation fails.
> */
> -struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
> +struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order, struct vm_area_struct *vma,
> unsigned long addr, bool hugepage)
> {
> struct mempolicy *pol;
> @@ -2169,15 +2169,15 @@ struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
> struct page *page;
>
> pol = get_vma_policy(vma, addr, order, &ilx);
> - page = alloc_pages_mpol(gfp | __GFP_COMP, order,
> - pol, ilx, numa_node_id());
> + page = alloc_pages_mpol_noprof(gfp | __GFP_COMP, order,
> + pol, ilx, numa_node_id());
> mpol_cond_put(pol);
> return page_rmappable_folio(page);
> }
> -EXPORT_SYMBOL(vma_alloc_folio);
> +EXPORT_SYMBOL(vma_alloc_folio_noprof);
>
> /**
> - * alloc_pages - Allocate pages.
> + * alloc_pages_noprof - Allocate pages.
> * @gfp: GFP flags.
> * @order: Power of two of number of pages to allocate.
> *
> @@ -2190,7 +2190,7 @@ EXPORT_SYMBOL(vma_alloc_folio);
> * flags are used.
> * Return: The page on success or NULL if allocation fails.
> */
> -struct page *alloc_pages(gfp_t gfp, unsigned int order)
> +struct page *alloc_pages_noprof(gfp_t gfp, unsigned int order)
> {
> struct mempolicy *pol = &default_policy;
>
> @@ -2201,16 +2201,16 @@ struct page *alloc_pages(gfp_t gfp, unsigned int order)
> if (!in_interrupt() && !(gfp & __GFP_THISNODE))
> pol = get_task_policy(current);
>
> - return alloc_pages_mpol(gfp, order,
> - pol, NO_INTERLEAVE_INDEX, numa_node_id());
> + return alloc_pages_mpol_noprof(gfp, order, pol, NO_INTERLEAVE_INDEX,
> + numa_node_id());
> }
> -EXPORT_SYMBOL(alloc_pages);
> +EXPORT_SYMBOL(alloc_pages_noprof);
>
> -struct folio *folio_alloc(gfp_t gfp, unsigned int order)
> +struct folio *folio_alloc_noprof(gfp_t gfp, unsigned int order)
> {
> - return page_rmappable_folio(alloc_pages(gfp | __GFP_COMP, order));
> + return page_rmappable_folio(alloc_pages_noprof(gfp | __GFP_COMP, order));
> }
> -EXPORT_SYMBOL(folio_alloc);
> +EXPORT_SYMBOL(folio_alloc_noprof);
>
> static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp,
> struct mempolicy *pol, unsigned long nr_pages,
> @@ -2229,13 +2229,13 @@ static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp,
>
> for (i = 0; i < nodes; i++) {
> if (delta) {
> - nr_allocated = __alloc_pages_bulk(gfp,
> + nr_allocated = alloc_pages_bulk_noprof(gfp,
> interleave_nodes(pol), NULL,
> nr_pages_per_node + 1, NULL,
> page_array);
> delta--;
> } else {
> - nr_allocated = __alloc_pages_bulk(gfp,
> + nr_allocated = alloc_pages_bulk_noprof(gfp,
> interleave_nodes(pol), NULL,
> nr_pages_per_node, NULL, page_array);
> }
> @@ -2257,11 +2257,11 @@ static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid,
> preferred_gfp = gfp | __GFP_NOWARN;
> preferred_gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL);
>
> - nr_allocated = __alloc_pages_bulk(preferred_gfp, nid, &pol->nodes,
> + nr_allocated = alloc_pages_bulk_noprof(preferred_gfp, nid, &pol->nodes,
> nr_pages, NULL, page_array);
>
> if (nr_allocated < nr_pages)
> - nr_allocated += __alloc_pages_bulk(gfp, numa_node_id(), NULL,
> + nr_allocated += alloc_pages_bulk_noprof(gfp, numa_node_id(), NULL,
> nr_pages - nr_allocated, NULL,
> page_array + nr_allocated);
> return nr_allocated;
> @@ -2273,7 +2273,7 @@ static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid,
> * It can accelerate memory allocation especially interleaving
> * allocate memory.
> */
> -unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
> +unsigned long alloc_pages_bulk_array_mempolicy_noprof(gfp_t gfp,
> unsigned long nr_pages, struct page **page_array)
> {
> struct mempolicy *pol = &default_policy;
> @@ -2293,8 +2293,8 @@ unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
>
> nid = numa_node_id();
> nodemask = policy_nodemask(gfp, pol, NO_INTERLEAVE_INDEX, &nid);
> - return __alloc_pages_bulk(gfp, nid, nodemask,
> - nr_pages, NULL, page_array);
> + return alloc_pages_bulk_noprof(gfp, nid, nodemask,
> + nr_pages, NULL, page_array);
> }
>
> int vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst)
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index edb79a55a252..58c0e8b948a4 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4380,7 +4380,7 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
> *
> * Returns the number of pages on the list or array.
> */
> -unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> +unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
> nodemask_t *nodemask, int nr_pages,
> struct list_head *page_list,
> struct page **page_array)
> @@ -4516,7 +4516,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> pcp_trylock_finish(UP_flags);
>
> failed:
> - page = __alloc_pages(gfp, 0, preferred_nid, nodemask);
> + page = __alloc_pages_noprof(gfp, 0, preferred_nid, nodemask);
> if (page) {
> if (page_list)
> list_add(&page->lru, page_list);
> @@ -4527,13 +4527,13 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
>
> goto out;
> }
> -EXPORT_SYMBOL_GPL(__alloc_pages_bulk);
> +EXPORT_SYMBOL_GPL(alloc_pages_bulk_noprof);
>
> /*
> * This is the 'heart' of the zoned buddy allocator.
> */
> -struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
> - nodemask_t *nodemask)
> +struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
> + int preferred_nid, nodemask_t *nodemask)
> {
> struct page *page;
> unsigned int alloc_flags = ALLOC_WMARK_LOW;
> @@ -4595,38 +4595,38 @@ struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
>
> return page;
> }
> -EXPORT_SYMBOL(__alloc_pages);
> +EXPORT_SYMBOL(__alloc_pages_noprof);
>
> -struct folio *__folio_alloc(gfp_t gfp, unsigned int order, int preferred_nid,
> +struct folio *__folio_alloc_noprof(gfp_t gfp, unsigned int order, int preferred_nid,
> nodemask_t *nodemask)
> {
> - struct page *page = __alloc_pages(gfp | __GFP_COMP, order,
> + struct page *page = __alloc_pages_noprof(gfp | __GFP_COMP, order,
> preferred_nid, nodemask);
> return page_rmappable_folio(page);
> }
> -EXPORT_SYMBOL(__folio_alloc);
> +EXPORT_SYMBOL(__folio_alloc_noprof);
>
> /*
> * Common helper functions. Never use with __GFP_HIGHMEM because the returned
> * address cannot represent highmem pages. Use alloc_pages and then kmap if
> * you need to access high mem.
> */
> -unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order)
> +unsigned long get_free_pages_noprof(gfp_t gfp_mask, unsigned int order)
> {
> struct page *page;
>
> - page = alloc_pages(gfp_mask & ~__GFP_HIGHMEM, order);
> + page = alloc_pages_noprof(gfp_mask & ~__GFP_HIGHMEM, order);
> if (!page)
> return 0;
> return (unsigned long) page_address(page);
> }
> -EXPORT_SYMBOL(__get_free_pages);
> +EXPORT_SYMBOL(get_free_pages_noprof);
>
> -unsigned long get_zeroed_page(gfp_t gfp_mask)
> +unsigned long get_zeroed_page_noprof(gfp_t gfp_mask)
> {
> - return __get_free_page(gfp_mask | __GFP_ZERO);
> + return get_free_pages_noprof(gfp_mask | __GFP_ZERO, 0);
> }
> -EXPORT_SYMBOL(get_zeroed_page);
> +EXPORT_SYMBOL(get_zeroed_page_noprof);
>
> /**
> * __free_pages - Free pages allocated with alloc_pages().
> @@ -4818,7 +4818,7 @@ static void *make_alloc_exact(unsigned long addr, unsigned int order,
> }
>
> /**
> - * alloc_pages_exact - allocate an exact number physically-contiguous pages.
> + * alloc_pages_exact_noprof - allocate an exact number physically-contiguous pages.
> * @size: the number of bytes to allocate
> * @gfp_mask: GFP flags for the allocation, must not contain __GFP_COMP
> *
> @@ -4832,7 +4832,7 @@ static void *make_alloc_exact(unsigned long addr, unsigned int order,
> *
> * Return: pointer to the allocated area or %NULL in case of error.
> */
> -void *alloc_pages_exact(size_t size, gfp_t gfp_mask)
> +void *alloc_pages_exact_noprof(size_t size, gfp_t gfp_mask)
> {
> unsigned int order = get_order(size);
> unsigned long addr;
> @@ -4840,13 +4840,13 @@ void *alloc_pages_exact(size_t size, gfp_t gfp_mask)
> if (WARN_ON_ONCE(gfp_mask & (__GFP_COMP | __GFP_HIGHMEM)))
> gfp_mask &= ~(__GFP_COMP | __GFP_HIGHMEM);
>
> - addr = __get_free_pages(gfp_mask, order);
> + addr = get_free_pages_noprof(gfp_mask, order);
> return make_alloc_exact(addr, order, size);
> }
> -EXPORT_SYMBOL(alloc_pages_exact);
> +EXPORT_SYMBOL(alloc_pages_exact_noprof);
>
> /**
> - * alloc_pages_exact_nid - allocate an exact number of physically-contiguous
> + * alloc_pages_exact_nid_noprof - allocate an exact number of physically-contiguous
> * pages on a node.
> * @nid: the preferred node ID where memory should be allocated
> * @size: the number of bytes to allocate
> @@ -4857,7 +4857,7 @@ EXPORT_SYMBOL(alloc_pages_exact);
> *
> * Return: pointer to the allocated area or %NULL in case of error.
> */
> -void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask)
> +void * __meminit alloc_pages_exact_nid_noprof(int nid, size_t size, gfp_t gfp_mask)
> {
> unsigned int order = get_order(size);
> struct page *p;
> @@ -4865,7 +4865,7 @@ void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask)
> if (WARN_ON_ONCE(gfp_mask & (__GFP_COMP | __GFP_HIGHMEM)))
> gfp_mask &= ~(__GFP_COMP | __GFP_HIGHMEM);
>
> - p = alloc_pages_node(nid, gfp_mask, order);
> + p = alloc_pages_node_noprof(nid, gfp_mask, order);
> if (!p)
> return NULL;
> return make_alloc_exact((unsigned long)page_address(p), order, size);
> @@ -6283,7 +6283,7 @@ int __alloc_contig_migrate_range(struct compact_control *cc,
> }
>
> /**
> - * alloc_contig_range() -- tries to allocate given range of pages
> + * alloc_contig_range_noprof() -- tries to allocate given range of pages
> * @start: start PFN to allocate
> * @end: one-past-the-last PFN to allocate
> * @migratetype: migratetype of the underlying pageblocks (either
> @@ -6303,7 +6303,7 @@ int __alloc_contig_migrate_range(struct compact_control *cc,
> * pages which PFN is in [start, end) are allocated for the caller and
> * need to be freed with free_contig_range().
> */
> -int alloc_contig_range(unsigned long start, unsigned long end,
> +int alloc_contig_range_noprof(unsigned long start, unsigned long end,
> unsigned migratetype, gfp_t gfp_mask)
> {
> unsigned long outer_start, outer_end;
> @@ -6427,15 +6427,15 @@ int alloc_contig_range(unsigned long start, unsigned long end,
> undo_isolate_page_range(start, end, migratetype);
> return ret;
> }
> -EXPORT_SYMBOL(alloc_contig_range);
> +EXPORT_SYMBOL(alloc_contig_range_noprof);
>
> static int __alloc_contig_pages(unsigned long start_pfn,
> unsigned long nr_pages, gfp_t gfp_mask)
> {
> unsigned long end_pfn = start_pfn + nr_pages;
>
> - return alloc_contig_range(start_pfn, end_pfn, MIGRATE_MOVABLE,
> - gfp_mask);
> + return alloc_contig_range_noprof(start_pfn, end_pfn, MIGRATE_MOVABLE,
> + gfp_mask);
> }
>
> static bool pfn_range_valid_contig(struct zone *z, unsigned long start_pfn,
> @@ -6470,7 +6470,7 @@ static bool zone_spans_last_pfn(const struct zone *zone,
> }
>
> /**
> - * alloc_contig_pages() -- tries to find and allocate contiguous range of pages
> + * alloc_contig_pages_noprof() -- tries to find and allocate contiguous range of pages
> * @nr_pages: Number of contiguous pages to allocate
> * @gfp_mask: GFP mask to limit search and used during compaction
> * @nid: Target node
> @@ -6490,8 +6490,8 @@ static bool zone_spans_last_pfn(const struct zone *zone,
> *
> * Return: pointer to contiguous pages on success, or NULL if not successful.
> */
> -struct page *alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask,
> - int nid, nodemask_t *nodemask)
> +struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_mask,
> + int nid, nodemask_t *nodemask)
> {
> unsigned long ret, pfn, flags;
> struct zonelist *zonelist;
> --
> 2.43.0.687.g38aa6559b0-goog
>
--
Kees Cook
On Mon, Feb 12, 2024 at 01:39:07PM -0800, Suren Baghdasaryan wrote:
> Account slab allocations using codetag reference embedded into slabobj_ext.
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
--
Kees Cook
On Mon, Feb 12, 2024 at 01:39:08PM -0800, Suren Baghdasaryan wrote:
> Redefine kmalloc, krealloc, kzalloc, kcalloc, etc. to record allocations
> and deallocations done by these functions.
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> Co-developed-by: Kent Overstreet <[email protected]>
> Signed-off-by: Kent Overstreet <[email protected]>
I'm not a big fan of the _noprof suffix, but anything else I can think
of isn't as descriptive, so:
Reviewed-by: Kees Cook <[email protected]>
--
Kees Cook
On Mon, Feb 12, 2024 at 2:49 PM Kees Cook <[email protected]> wrote:
>
> On Mon, Feb 12, 2024 at 01:39:20PM -0800, Suren Baghdasaryan wrote:
> > If slabobj_ext vector allocation for a slab object fails and later on it
> > succeeds for another object in the same slab, the slabobj_ext for the
> > original object will be NULL and will be flagged in case when
> > CONFIG_MEM_ALLOC_PROFILING_DEBUG is enabled.
> > Mark failed slabobj_ext vector allocations using a new objext_flags flag
> > stored in the lower bits of slab->obj_exts. When new allocation succeeds
> > it marks all tag references in the same slabobj_ext vector as empty to
> > avoid warnings implemented by CONFIG_MEM_ALLOC_PROFILING_DEBUG checks.
> >
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > ---
> > include/linux/memcontrol.h | 4 +++-
> > mm/slab.h | 25 +++++++++++++++++++++++++
> > mm/slab_common.c | 22 +++++++++++++++-------
> > 3 files changed, 43 insertions(+), 8 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 2b010316016c..f95241ca9052 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -365,8 +365,10 @@ enum page_memcg_data_flags {
> > #endif /* CONFIG_MEMCG */
> >
> > enum objext_flags {
> > + /* slabobj_ext vector failed to allocate */
> > + OBJEXTS_ALLOC_FAIL = __FIRST_OBJEXT_FLAG,
> > /* the next bit after the last actual flag */
> > - __NR_OBJEXTS_FLAGS = __FIRST_OBJEXT_FLAG,
> > + __NR_OBJEXTS_FLAGS = (__FIRST_OBJEXT_FLAG << 1),
> > };
> >
> > #define OBJEXTS_FLAGS_MASK (__NR_OBJEXTS_FLAGS - 1)
> > diff --git a/mm/slab.h b/mm/slab.h
> > index cf332a839bf4..7bb3900f83ef 100644
> > --- a/mm/slab.h
> > +++ b/mm/slab.h
> > @@ -586,9 +586,34 @@ static inline void mark_objexts_empty(struct slabobj_ext *obj_exts)
> > }
> > }
> >
> > +static inline void mark_failed_objexts_alloc(struct slab *slab)
> > +{
> > + slab->obj_exts = OBJEXTS_ALLOC_FAIL;
>
> Uh, does this mean slab->obj_exts is suddenly non-NULL? Is everything
> that accesses obj_exts expecting this?
Hi Kees,
Thank you for the reviews!
Yes, I believe everything that accesses slab->obj_exts directly
(currently alloc_slab_obj_exts() and free_slab_obj_exts()) handle this
special non-NULL case. kfence_init_pool() initialized slab->obj_exts
directly, but since it's setting it and not accessing it, it does not
need to handle OBJEXTS_ALLOC_FAIL. All other slab->obj_exts users use
slab_obj_exts() which applies OBJEXTS_FLAGS_MASK and masks out any
special bits.
Thanks,
Suren.
>
> -Kees
>
> > +}
> > +
> > +static inline void handle_failed_objexts_alloc(unsigned long obj_exts,
> > + struct slabobj_ext *vec, unsigned int objects)
> > +{
> > + /*
> > + * If vector previously failed to allocate then we have live
> > + * objects with no tag reference. Mark all references in this
> > + * vector as empty to avoid warnings later on.
> > + */
> > + if (obj_exts & OBJEXTS_ALLOC_FAIL) {
> > + unsigned int i;
> > +
> > + for (i = 0; i < objects; i++)
> > + set_codetag_empty(&vec[i].ref);
> > + }
> > +}
> > +
> > +
> > #else /* CONFIG_MEM_ALLOC_PROFILING_DEBUG */
> >
> > static inline void mark_objexts_empty(struct slabobj_ext *obj_exts) {}
> > +static inline void mark_failed_objexts_alloc(struct slab *slab) {}
> > +static inline void handle_failed_objexts_alloc(unsigned long obj_exts,
> > + struct slabobj_ext *vec, unsigned int objects) {}
> >
> > #endif /* CONFIG_MEM_ALLOC_PROFILING_DEBUG */
> >
> > diff --git a/mm/slab_common.c b/mm/slab_common.c
> > index d5f75d04ced2..489c7a8ba8f1 100644
> > --- a/mm/slab_common.c
> > +++ b/mm/slab_common.c
> > @@ -214,29 +214,37 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
> > gfp_t gfp, bool new_slab)
> > {
> > unsigned int objects = objs_per_slab(s, slab);
> > - unsigned long obj_exts;
> > - void *vec;
> > + unsigned long new_exts;
> > + unsigned long old_exts;
> > + struct slabobj_ext *vec;
> >
> > gfp &= ~OBJCGS_CLEAR_MASK;
> > /* Prevent recursive extension vector allocation */
> > gfp |= __GFP_NO_OBJ_EXT;
> > vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
> > slab_nid(slab));
> > - if (!vec)
> > + if (!vec) {
> > + /* Mark vectors which failed to allocate */
> > + if (new_slab)
> > + mark_failed_objexts_alloc(slab);
> > +
> > return -ENOMEM;
> > + }
> >
> > - obj_exts = (unsigned long)vec;
> > + new_exts = (unsigned long)vec;
> > #ifdef CONFIG_MEMCG
> > - obj_exts |= MEMCG_DATA_OBJEXTS;
> > + new_exts |= MEMCG_DATA_OBJEXTS;
> > #endif
> > + old_exts = slab->obj_exts;
> > + handle_failed_objexts_alloc(old_exts, vec, objects);
> > if (new_slab) {
> > /*
> > * If the slab is brand new and nobody can yet access its
> > * obj_exts, no synchronization is required and obj_exts can
> > * be simply assigned.
> > */
> > - slab->obj_exts = obj_exts;
> > - } else if (cmpxchg(&slab->obj_exts, 0, obj_exts)) {
> > + slab->obj_exts = new_exts;
> > + } else if (cmpxchg(&slab->obj_exts, old_exts, new_exts) != old_exts) {
> > /*
> > * If the slab is already in use, somebody can allocate and
> > * assign slabobj_exts in parallel. In this case the existing
> > --
> > 2.43.0.687.g38aa6559b0-goog
> >
>
> --
> Kees Cook
On Mon, Feb 12, 2024 at 01:39:17PM -0800, Suren Baghdasaryan wrote:
> Include allocations in show_mem reports.
>
> Signed-off-by: Kent Overstreet <[email protected]>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> ---
> include/linux/alloc_tag.h | 2 ++
> lib/alloc_tag.c | 38 ++++++++++++++++++++++++++++++++++++++
> mm/show_mem.c | 15 +++++++++++++++
> 3 files changed, 55 insertions(+)
>
> diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
> index 3fe51e67e231..0a5973c4ad77 100644
> --- a/include/linux/alloc_tag.h
> +++ b/include/linux/alloc_tag.h
> @@ -30,6 +30,8 @@ struct alloc_tag {
>
> #ifdef CONFIG_MEM_ALLOC_PROFILING
>
> +void alloc_tags_show_mem_report(struct seq_buf *s);
> +
> static inline struct alloc_tag *ct_to_alloc_tag(struct codetag *ct)
> {
> return container_of(ct, struct alloc_tag, ct);
> diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
> index 2d5226d9262d..54312c213860 100644
> --- a/lib/alloc_tag.c
> +++ b/lib/alloc_tag.c
> @@ -96,6 +96,44 @@ static const struct seq_operations allocinfo_seq_op = {
> .show = allocinfo_show,
> };
>
> +void alloc_tags_show_mem_report(struct seq_buf *s)
> +{
> + struct codetag_iterator iter;
> + struct codetag *ct;
> + struct {
> + struct codetag *tag;
> + size_t bytes;
> + } tags[10], n;
> + unsigned int i, nr = 0;
> +
> + codetag_lock_module_list(alloc_tag_cttype, true);
> + iter = codetag_get_ct_iter(alloc_tag_cttype);
> + while ((ct = codetag_next_ct(&iter))) {
> + struct alloc_tag_counters counter = alloc_tag_read(ct_to_alloc_tag(ct));
> +
> + n.tag = ct;
> + n.bytes = counter.bytes;
> +
> + for (i = 0; i < nr; i++)
> + if (n.bytes > tags[i].bytes)
> + break;
> +
> + if (i < ARRAY_SIZE(tags)) {
> + nr -= nr == ARRAY_SIZE(tags);
> + memmove(&tags[i + 1],
> + &tags[i],
> + sizeof(tags[0]) * (nr - i));
> + nr++;
> + tags[i] = n;
> + }
> + }
> +
> + for (i = 0; i < nr; i++)
> + alloc_tag_to_text(s, tags[i].tag);
> +
> + codetag_lock_module_list(alloc_tag_cttype, false);
> +}
> +
> static void __init procfs_init(void)
> {
> proc_create_seq("allocinfo", 0444, NULL, &allocinfo_seq_op);
> diff --git a/mm/show_mem.c b/mm/show_mem.c
> index 8dcfafbd283c..d514c15ca076 100644
> --- a/mm/show_mem.c
> +++ b/mm/show_mem.c
> @@ -12,6 +12,7 @@
> #include <linux/hugetlb.h>
> #include <linux/mm.h>
> #include <linux/mmzone.h>
> +#include <linux/seq_buf.h>
> #include <linux/swap.h>
> #include <linux/vmstat.h>
>
> @@ -423,4 +424,18 @@ void __show_mem(unsigned int filter, nodemask_t *nodemask, int max_zone_idx)
> #ifdef CONFIG_MEMORY_FAILURE
> printk("%lu pages hwpoisoned\n", atomic_long_read(&num_poisoned_pages));
> #endif
> +#ifdef CONFIG_MEM_ALLOC_PROFILING
> + {
> + struct seq_buf s;
> + char *buf = kmalloc(4096, GFP_ATOMIC);
Why 4096? Maybe use PAGE_SIZE instead?
> +
> + if (buf) {
> + printk("Memory allocations:\n");
This needs a printk prefix, or better yet, just use pr_info() or similar.
> + seq_buf_init(&s, buf, 4096);
> + alloc_tags_show_mem_report(&s);
> + printk("%s", buf);
Once a seq_buf "consumes" a char *, please don't use any directly any
more. This should be:
pr_info("%s", seq_buf_str(s));
Otherwise %NUL termination isn't certain. Very likely, given the use
case here, but let's use good hygiene. :)
> + kfree(buf);
> + }
> + }
> +#endif
> }
> --
> 2.43.0.687.g38aa6559b0-goog
>
--
Kees Cook
On Mon, Feb 12, 2024 at 2:45 PM Kees Cook <[email protected]> wrote:
>
> On Mon, Feb 12, 2024 at 01:39:19PM -0800, Suren Baghdasaryan wrote:
> > To avoid debug warnings while freeing reserved pages which were not
> > allocated with usual allocators, mark their codetags as empty before
> > freeing.
>
> How do these get their codetags to begin with?
The space for the codetag reference is inside the page_ext and that
reference is set to NULL. So, unless we set the reference as empty
(set it to CODETAG_EMPTY), the free routine will detect that we are
freeing an allocation that has never been accounted for and will issue
a warning. To prevent this warning we use this CODETAG_EMPTY to denote
that this codetag reference is expected to be empty because it was not
allocated in a usual way.
> Regardless:
>
> Reviewed-by: Kees Cook <[email protected]>
>
> --
> Kees Cook
On Mon, 12 Feb 2024 16:10:02 -0800
Kees Cook <[email protected]> wrote:
> > #endif
> > +#ifdef CONFIG_MEM_ALLOC_PROFILING
> > + {
> > + struct seq_buf s;
> > + char *buf = kmalloc(4096, GFP_ATOMIC);
>
> Why 4096? Maybe use PAGE_SIZE instead?
Will it make a difference for architectures that don't have 4096 PAGE_SIZE?
Like PowerPC which has PAGE_SIZE of anywhere between 4K to 256K!
-- Steve
On Mon, Feb 12, 2024 at 01:38:46PM -0800, Suren Baghdasaryan wrote:
> Low overhead [1] per-callsite memory allocation profiling. Not just for debug
> kernels, overhead low enough to be deployed in production.
What's the plan for things like devm_kmalloc() and similar relatively
simple wrappers? I was thinking it would be possible to reimplement at
least devm_kmalloc() with size and flags changing helper a while back:
https://lore.kernel.org/all/202309111428.6F36672F57@keescook/
I suspect it could be possible to adapt the alloc_hooks wrapper in this
series similarly:
#define alloc_hooks_prep(_do_alloc, _do_prepare, _do_finish, \
ctx, size, flags) \
({ \
typeof(_do_alloc) _res; \
DEFINE_ALLOC_TAG(_alloc_tag, _old); \
ssize_t _size = (size); \
size_t _usable = _size; \
gfp_t _flags = (flags); \
\
_res = _do_prepare(ctx, &_size, &_flags); \
if (!IS_ERR_OR_NULL(_res) \
_res = _do_alloc(_size, _flags); \
if (!IS_ERR_OR_NULL(_res) \
_res = _do_finish(ctx, _usable, _size, _flags, _res); \
_res; \
})
#define devm_kmalloc(dev, size, flags) \
alloc_hooks_prep(kmalloc, devm_alloc_prep, devm_alloc_finish, \
dev, size, flags)
And devm_alloc_prep() and devm_alloc_finish() adapted from the URL
above.
And _do_finish instances could be marked with __realloc_size(2)
-Kees
--
Kees Cook
On Mon, Feb 12, 2024 at 2:43 PM Kees Cook <[email protected]> wrote:
>
> On Mon, Feb 12, 2024 at 01:39:21PM -0800, Suren Baghdasaryan wrote:
> > From: Kent Overstreet <[email protected]>
> >
> > The new code & libraries added are being maintained - mark them as such.
> >
> > Signed-off-by: Kent Overstreet <[email protected]>
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > ---
> > MAINTAINERS | 16 ++++++++++++++++
> > 1 file changed, 16 insertions(+)
> >
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 73d898383e51..6da139418775 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -5210,6 +5210,13 @@ S: Supported
> > F: Documentation/process/code-of-conduct-interpretation.rst
> > F: Documentation/process/code-of-conduct.rst
> >
> > +CODE TAGGING
> > +M: Suren Baghdasaryan <[email protected]>
> > +M: Kent Overstreet <[email protected]>
> > +S: Maintained
> > +F: include/linux/codetag.h
> > +F: lib/codetag.c
> > +
> > COMEDI DRIVERS
> > M: Ian Abbott <[email protected]>
> > M: H Hartley Sweeten <[email protected]>
> > @@ -14056,6 +14063,15 @@ F: mm/memblock.c
> > F: mm/mm_init.c
> > F: tools/testing/memblock/
> >
> > +MEMORY ALLOCATION PROFILING
> > +M: Suren Baghdasaryan <[email protected]>
> > +M: Kent Overstreet <[email protected]>
> > +S: Maintained
> > +F: include/linux/alloc_tag.h
> > +F: include/linux/codetag_ctx.h
> > +F: lib/alloc_tag.c
> > +F: lib/pgalloc_tag.c
>
> Any mailing list to aim at? linux-mm maybe?
Good point. Will add. Thanks!
>
> Regardless:
>
> Reviewed-by: Kees Cook <[email protected]>
>
>
> > +
> > MEMORY CONTROLLER DRIVERS
> > M: Krzysztof Kozlowski <[email protected]>
> > L: [email protected]
> > --
> > 2.43.0.687.g38aa6559b0-goog
> >
>
> --
> Kees Cook
On Mon, Feb 12, 2024 at 4:29 PM Kees Cook <[email protected]> wrote:
>
> On Mon, Feb 12, 2024 at 01:38:46PM -0800, Suren Baghdasaryan wrote:
> > Low overhead [1] per-callsite memory allocation profiling. Not just for debug
> > kernels, overhead low enough to be deployed in production.
>
> What's the plan for things like devm_kmalloc() and similar relatively
> simple wrappers? I was thinking it would be possible to reimplement at
> least devm_kmalloc() with size and flags changing helper a while back:
>
> https://lore.kernel.org/all/202309111428.6F36672F57@keescook/
>
> I suspect it could be possible to adapt the alloc_hooks wrapper in this
> series similarly:
>
> #define alloc_hooks_prep(_do_alloc, _do_prepare, _do_finish, \
> ctx, size, flags) \
> ({ \
> typeof(_do_alloc) _res; \
> DEFINE_ALLOC_TAG(_alloc_tag, _old); \
> ssize_t _size = (size); \
> size_t _usable = _size; \
> gfp_t _flags = (flags); \
> \
> _res = _do_prepare(ctx, &_size, &_flags); \
> if (!IS_ERR_OR_NULL(_res) \
> _res = _do_alloc(_size, _flags); \
> if (!IS_ERR_OR_NULL(_res) \
> _res = _do_finish(ctx, _usable, _size, _flags, _res); \
> _res; \
> })
>
> #define devm_kmalloc(dev, size, flags) \
> alloc_hooks_prep(kmalloc, devm_alloc_prep, devm_alloc_finish, \
> dev, size, flags)
>
> And devm_alloc_prep() and devm_alloc_finish() adapted from the URL
> above.
>
> And _do_finish instances could be marked with __realloc_size(2)
devm_kmalloc() is definitely a great candidate to account separately.
Looks like it's currently using
alloc_dr()->kmalloc_node_track_caller(), so this series will account
the internal kmalloc_node_track_caller() allocation. We can easily
apply alloc_hook to devm_kmalloc() and friends and replace the
kmalloc_node_track_caller() call inside alloc_dr() with
kmalloc_node_track_caller_noprof(). That will move accounting directly
to devm_kmalloc().
>
> -Kees
>
> --
> Kees Cook
On Mon, Feb 12, 2024 at 01:39:09PM -0800, Suren Baghdasaryan wrote:
> From: Kent Overstreet <[email protected]>
>
> It seems we need to be more forceful with the compiler on this one.
Sure, but why?
>
> Signed-off-by: Kent Overstreet <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
--
Kees Cook
On Mon, Feb 12, 2024 at 4:31 PM Kees Cook <[email protected]> wrote:
>
> On Mon, Feb 12, 2024 at 01:39:09PM -0800, Suren Baghdasaryan wrote:
> > From: Kent Overstreet <[email protected]>
> >
> > It seems we need to be more forceful with the compiler on this one.
>
> Sure, but why?
IIRC Kent saw a case when it was not inlined for some reason... Kent,
do you recall this?
>
> >
> > Signed-off-by: Kent Overstreet <[email protected]>
>
> Reviewed-by: Kees Cook <[email protected]>
>
> --
> Kees Cook
On Mon, Feb 12, 2024 at 2:40 PM Kees Cook <[email protected]> wrote:
>
> On Mon, Feb 12, 2024 at 01:38:59PM -0800, Suren Baghdasaryan wrote:
> > Introduce CONFIG_MEM_ALLOC_PROFILING which provides definitions to easily
> > instrument memory allocators. It registers an "alloc_tags" codetag type
> > with /proc/allocinfo interface to output allocation tag information when
>
> Please don't add anything new to the top-level /proc directory. This
> should likely live in /sys.
Ack. I'll find a more appropriate place for it then.
It just seemed like such generic information which would belong next
to meminfo/zoneinfo and such...
>
> > the feature is enabled.
> > CONFIG_MEM_ALLOC_PROFILING_DEBUG is provided for debugging the memory
> > allocation profiling instrumentation.
> > Memory allocation profiling can be enabled or disabled at runtime using
> > /proc/sys/vm/mem_profiling sysctl when CONFIG_MEM_ALLOC_PROFILING_DEBUG=n.
> > CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT enables memory allocation
> > profiling by default.
> >
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > Co-developed-by: Kent Overstreet <[email protected]>
> > Signed-off-by: Kent Overstreet <[email protected]>
> > ---
> > Documentation/admin-guide/sysctl/vm.rst | 16 +++
> > Documentation/filesystems/proc.rst | 28 +++++
> > include/asm-generic/codetag.lds.h | 14 +++
> > include/asm-generic/vmlinux.lds.h | 3 +
> > include/linux/alloc_tag.h | 133 ++++++++++++++++++++
> > include/linux/sched.h | 24 ++++
> > lib/Kconfig.debug | 25 ++++
> > lib/Makefile | 2 +
> > lib/alloc_tag.c | 158 ++++++++++++++++++++++++
> > scripts/module.lds.S | 7 ++
> > 10 files changed, 410 insertions(+)
> > create mode 100644 include/asm-generic/codetag.lds.h
> > create mode 100644 include/linux/alloc_tag.h
> > create mode 100644 lib/alloc_tag.c
> >
> > diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
> > index c59889de122b..a214719492ea 100644
> > --- a/Documentation/admin-guide/sysctl/vm.rst
> > +++ b/Documentation/admin-guide/sysctl/vm.rst
> > @@ -43,6 +43,7 @@ Currently, these files are in /proc/sys/vm:
> > - legacy_va_layout
> > - lowmem_reserve_ratio
> > - max_map_count
> > +- mem_profiling (only if CONFIG_MEM_ALLOC_PROFILING=y)
> > - memory_failure_early_kill
> > - memory_failure_recovery
> > - min_free_kbytes
> > @@ -425,6 +426,21 @@ e.g., up to one or two maps per allocation.
> > The default value is 65530.
> >
> >
> > +mem_profiling
> > +==============
> > +
> > +Enable memory profiling (when CONFIG_MEM_ALLOC_PROFILING=y)
> > +
> > +1: Enable memory profiling.
> > +
> > +0: Disabld memory profiling.
> > +
> > +Enabling memory profiling introduces a small performance overhead for all
> > +memory allocations.
> > +
> > +The default value depends on CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT.
> > +
> > +
> > memory_failure_early_kill:
> > ==========================
> >
> > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> > index 104c6d047d9b..40d6d18308e4 100644
> > --- a/Documentation/filesystems/proc.rst
> > +++ b/Documentation/filesystems/proc.rst
> > @@ -688,6 +688,7 @@ files are there, and which are missing.
> > ============ ===============================================================
> > File Content
> > ============ ===============================================================
> > + allocinfo Memory allocations profiling information
> > apm Advanced power management info
> > bootconfig Kernel command line obtained from boot config,
> > and, if there were kernel parameters from the
> > @@ -953,6 +954,33 @@ also be allocatable although a lot of filesystem metadata may have to be
> > reclaimed to achieve this.
> >
> >
> > +allocinfo
> > +~~~~~~~
> > +
> > +Provides information about memory allocations at all locations in the code
> > +base. Each allocation in the code is identified by its source file, line
> > +number, module and the function calling the allocation. The number of bytes
> > +allocated at each location is reported.
> > +
> > +Example output.
> > +
> > +::
> > +
> > + > cat /proc/allocinfo
> > +
> > + 153MiB mm/slub.c:1826 module:slub func:alloc_slab_page
> > + 6.08MiB mm/slab_common.c:950 module:slab_common func:_kmalloc_order
> > + 5.09MiB mm/memcontrol.c:2814 module:memcontrol func:alloc_slab_obj_exts
> > + 4.54MiB mm/page_alloc.c:5777 module:page_alloc func:alloc_pages_exact
> > + 1.32MiB include/asm-generic/pgalloc.h:63 module:pgtable func:__pte_alloc_one
> > + 1.16MiB fs/xfs/xfs_log_priv.h:700 module:xfs func:xlog_kvmalloc
> > + 1.00MiB mm/swap_cgroup.c:48 module:swap_cgroup func:swap_cgroup_prepare
> > + 734KiB fs/xfs/kmem.c:20 module:xfs func:kmem_alloc
> > + 640KiB kernel/rcu/tree.c:3184 module:tree func:fill_page_cache_func
> > + 640KiB drivers/char/virtio_console.c:452 module:virtio_console func:alloc_buf
> > + ...
> > +
> > +
> > meminfo
> > ~~~~~~~
> >
> > diff --git a/include/asm-generic/codetag.lds.h b/include/asm-generic/codetag.lds.h
> > new file mode 100644
> > index 000000000000..64f536b80380
> > --- /dev/null
> > +++ b/include/asm-generic/codetag.lds.h
> > @@ -0,0 +1,14 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only */
> > +#ifndef __ASM_GENERIC_CODETAG_LDS_H
> > +#define __ASM_GENERIC_CODETAG_LDS_H
> > +
> > +#define SECTION_WITH_BOUNDARIES(_name) \
> > + . = ALIGN(8); \
> > + __start_##_name = .; \
> > + KEEP(*(_name)) \
> > + __stop_##_name = .;
> > +
> > +#define CODETAG_SECTIONS() \
> > + SECTION_WITH_BOUNDARIES(alloc_tags)
> > +
> > +#endif /* __ASM_GENERIC_CODETAG_LDS_H */
> > diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
> > index 5dd3a61d673d..c9997dc50c50 100644
> > --- a/include/asm-generic/vmlinux.lds.h
> > +++ b/include/asm-generic/vmlinux.lds.h
> > @@ -50,6 +50,8 @@
> > * [__nosave_begin, __nosave_end] for the nosave data
> > */
> >
> > +#include <asm-generic/codetag.lds.h>
> > +
> > #ifndef LOAD_OFFSET
> > #define LOAD_OFFSET 0
> > #endif
> > @@ -366,6 +368,7 @@
> > . = ALIGN(8); \
> > BOUNDED_SECTION_BY(__dyndbg_classes, ___dyndbg_classes) \
> > BOUNDED_SECTION_BY(__dyndbg, ___dyndbg) \
> > + CODETAG_SECTIONS() \
> > LIKELY_PROFILE() \
> > BRANCH_PROFILE() \
> > TRACE_PRINTKS() \
> > diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
> > new file mode 100644
> > index 000000000000..cf55a149fa84
> > --- /dev/null
> > +++ b/include/linux/alloc_tag.h
> > @@ -0,0 +1,133 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * allocation tagging
> > + */
> > +#ifndef _LINUX_ALLOC_TAG_H
> > +#define _LINUX_ALLOC_TAG_H
> > +
> > +#include <linux/bug.h>
> > +#include <linux/codetag.h>
> > +#include <linux/container_of.h>
> > +#include <linux/preempt.h>
> > +#include <asm/percpu.h>
> > +#include <linux/cpumask.h>
> > +#include <linux/static_key.h>
> > +
> > +struct alloc_tag_counters {
> > + u64 bytes;
> > + u64 calls;
> > +};
> > +
> > +/*
> > + * An instance of this structure is created in a special ELF section at every
> > + * allocation callsite. At runtime, the special section is treated as
> > + * an array of these. Embedded codetag utilizes codetag framework.
> > + */
> > +struct alloc_tag {
> > + struct codetag ct;
> > + struct alloc_tag_counters __percpu *counters;
> > +} __aligned(8);
> > +
> > +#ifdef CONFIG_MEM_ALLOC_PROFILING
> > +
> > +static inline struct alloc_tag *ct_to_alloc_tag(struct codetag *ct)
> > +{
> > + return container_of(ct, struct alloc_tag, ct);
> > +}
> > +
> > +#ifdef ARCH_NEEDS_WEAK_PER_CPU
> > +/*
> > + * When percpu variables are required to be defined as weak, static percpu
> > + * variables can't be used inside a function (see comments for DECLARE_PER_CPU_SECTION).
> > + */
> > +#error "Memory allocation profiling is incompatible with ARCH_NEEDS_WEAK_PER_CPU"
>
> Is this enforced via Kconfig as well? (Looks like only alpha and s390?)
Unfortunately ARCH_NEEDS_WEAK_PER_CPU is not a Kconfig option but
CONFIG_DEBUG_FORCE_WEAK_PER_CPU is, so that one is handled via Kconfig
(see "depends on !DEBUG_FORCE_WEAK_PER_CPU" in this patch). We have to
avoid both cases because of this:
https://elixir.bootlin.com/linux/latest/source/include/linux/percpu-defs.h#L75,
so I'm trying to provide an informative error here.
>
> > +#endif
> > +
> > +#define DEFINE_ALLOC_TAG(_alloc_tag, _old) \
> > + static DEFINE_PER_CPU(struct alloc_tag_counters, _alloc_tag_cntr); \
> > + static struct alloc_tag _alloc_tag __used __aligned(8) \
> > + __section("alloc_tags") = { \
> > + .ct = CODE_TAG_INIT, \
> > + .counters = &_alloc_tag_cntr }; \
> > + struct alloc_tag * __maybe_unused _old = alloc_tag_save(&_alloc_tag)
> > +
> > +DECLARE_STATIC_KEY_MAYBE(CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT,
> > + mem_alloc_profiling_key);
> > +
> > +static inline bool mem_alloc_profiling_enabled(void)
> > +{
> > + return static_branch_maybe(CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT,
> > + &mem_alloc_profiling_key);
> > +}
> > +
> > +static inline struct alloc_tag_counters alloc_tag_read(struct alloc_tag *tag)
> > +{
> > + struct alloc_tag_counters v = { 0, 0 };
> > + struct alloc_tag_counters *counter;
> > + int cpu;
> > +
> > + for_each_possible_cpu(cpu) {
> > + counter = per_cpu_ptr(tag->counters, cpu);
> > + v.bytes += counter->bytes;
> > + v.calls += counter->calls;
> > + }
> > +
> > + return v;
> > +}
> > +
> > +static inline void __alloc_tag_sub(union codetag_ref *ref, size_t bytes)
> > +{
> > + struct alloc_tag *tag;
> > +
> > +#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
> > + WARN_ONCE(ref && !ref->ct, "alloc_tag was not set\n");
> > +#endif
> > + if (!ref || !ref->ct)
> > + return;
> > +
> > + tag = ct_to_alloc_tag(ref->ct);
> > +
> > + this_cpu_sub(tag->counters->bytes, bytes);
> > + this_cpu_dec(tag->counters->calls);
> > +
> > + ref->ct = NULL;
> > +}
> > +
> > +static inline void alloc_tag_sub(union codetag_ref *ref, size_t bytes)
> > +{
> > + __alloc_tag_sub(ref, bytes);
> > +}
> > +
> > +static inline void alloc_tag_sub_noalloc(union codetag_ref *ref, size_t bytes)
> > +{
> > + __alloc_tag_sub(ref, bytes);
> > +}
> > +
> > +static inline void alloc_tag_add(union codetag_ref *ref, struct alloc_tag *tag, size_t bytes)
> > +{
> > +#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
> > + WARN_ONCE(ref && ref->ct,
> > + "alloc_tag was not cleared (got tag for %s:%u)\n",\
> > + ref->ct->filename, ref->ct->lineno);
> > +
> > + WARN_ONCE(!tag, "current->alloc_tag not set");
> > +#endif
> > + if (!ref || !tag)
> > + return;
> > +
> > + ref->ct = &tag->ct;
> > + this_cpu_add(tag->counters->bytes, bytes);
> > + this_cpu_inc(tag->counters->calls);
> > +}
> > +
> > +#else
> > +
> > +#define DEFINE_ALLOC_TAG(_alloc_tag, _old)
> > +static inline void alloc_tag_sub(union codetag_ref *ref, size_t bytes) {}
> > +static inline void alloc_tag_sub_noalloc(union codetag_ref *ref, size_t bytes) {}
> > +static inline void alloc_tag_add(union codetag_ref *ref, struct alloc_tag *tag,
> > + size_t bytes) {}
> > +
> > +#endif
> > +
> > +#endif /* _LINUX_ALLOC_TAG_H */
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index ffe8f618ab86..da68a10517c8 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -770,6 +770,10 @@ struct task_struct {
> > unsigned int flags;
> > unsigned int ptrace;
> >
> > +#ifdef CONFIG_MEM_ALLOC_PROFILING
> > + struct alloc_tag *alloc_tag;
> > +#endif
>
> Normally scheduling is very sensitive to having anything early in
> task_struct. I would suggest moving this the CONFIG_SCHED_CORE ifdef
> area.
Thanks for the warning! We will look into that.
>
> > +
> > #ifdef CONFIG_SMP
> > int on_cpu;
> > struct __call_single_node wake_entry;
> > @@ -810,6 +814,7 @@ struct task_struct {
> > struct task_group *sched_task_group;
> > #endif
> >
> > +
> > #ifdef CONFIG_UCLAMP_TASK
> > /*
> > * Clamp values requested for a scheduling entity.
> > @@ -2183,4 +2188,23 @@ static inline int sched_core_idle_cpu(int cpu) { return idle_cpu(cpu); }
> >
> > extern void sched_set_stop_task(int cpu, struct task_struct *stop);
> >
> > +#ifdef CONFIG_MEM_ALLOC_PROFILING
> > +static inline struct alloc_tag *alloc_tag_save(struct alloc_tag *tag)
> > +{
> > + swap(current->alloc_tag, tag);
> > + return tag;
> > +}
> > +
> > +static inline void alloc_tag_restore(struct alloc_tag *tag, struct alloc_tag *old)
> > +{
> > +#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
> > + WARN(current->alloc_tag != tag, "current->alloc_tag was changed:\n");
> > +#endif
> > + current->alloc_tag = old;
> > +}
> > +#else
> > +static inline struct alloc_tag *alloc_tag_save(struct alloc_tag *tag) { return NULL; }
> > +#define alloc_tag_restore(_tag, _old)
> > +#endif
> > +
> > #endif
> > diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> > index 0be2d00c3696..78d258ca508f 100644
> > --- a/lib/Kconfig.debug
> > +++ b/lib/Kconfig.debug
> > @@ -972,6 +972,31 @@ config CODE_TAGGING
> > bool
> > select KALLSYMS
> >
> > +config MEM_ALLOC_PROFILING
> > + bool "Enable memory allocation profiling"
> > + default n
> > + depends on PROC_FS
> > + depends on !DEBUG_FORCE_WEAK_PER_CPU
> > + select CODE_TAGGING
> > + help
> > + Track allocation source code and record total allocation size
> > + initiated at that code location. The mechanism can be used to track
> > + memory leaks with a low performance and memory impact.
> > +
> > +config MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT
> > + bool "Enable memory allocation profiling by default"
> > + default y
> > + depends on MEM_ALLOC_PROFILING
> > +
> > +config MEM_ALLOC_PROFILING_DEBUG
> > + bool "Memory allocation profiler debugging"
> > + default n
> > + depends on MEM_ALLOC_PROFILING
> > + select MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT
> > + help
> > + Adds warnings with helpful error messages for memory allocation
> > + profiling.
> > +
> > source "lib/Kconfig.kasan"
> > source "lib/Kconfig.kfence"
> > source "lib/Kconfig.kmsan"
> > diff --git a/lib/Makefile b/lib/Makefile
> > index 6b48b22fdfac..859112f09bf5 100644
> > --- a/lib/Makefile
> > +++ b/lib/Makefile
> > @@ -236,6 +236,8 @@ obj-$(CONFIG_OF_RECONFIG_NOTIFIER_ERROR_INJECT) += \
> > obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
> >
> > obj-$(CONFIG_CODE_TAGGING) += codetag.o
> > +obj-$(CONFIG_MEM_ALLOC_PROFILING) += alloc_tag.o
> > +
> > lib-$(CONFIG_GENERIC_BUG) += bug.o
> >
> > obj-$(CONFIG_HAVE_ARCH_TRACEHOOK) += syscall.o
> > diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
> > new file mode 100644
> > index 000000000000..4fc031f9cefd
> > --- /dev/null
> > +++ b/lib/alloc_tag.c
> > @@ -0,0 +1,158 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +#include <linux/alloc_tag.h>
> > +#include <linux/fs.h>
> > +#include <linux/gfp.h>
> > +#include <linux/module.h>
> > +#include <linux/proc_fs.h>
> > +#include <linux/seq_buf.h>
> > +#include <linux/seq_file.h>
> > +
> > +static struct codetag_type *alloc_tag_cttype;
> > +
> > +DEFINE_STATIC_KEY_MAYBE(CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT,
> > + mem_alloc_profiling_key);
> > +
> > +static void *allocinfo_start(struct seq_file *m, loff_t *pos)
> > +{
> > + struct codetag_iterator *iter;
> > + struct codetag *ct;
> > + loff_t node = *pos;
> > +
> > + iter = kzalloc(sizeof(*iter), GFP_KERNEL);
> > + m->private = iter;
> > + if (!iter)
> > + return NULL;
> > +
> > + codetag_lock_module_list(alloc_tag_cttype, true);
> > + *iter = codetag_get_ct_iter(alloc_tag_cttype);
> > + while ((ct = codetag_next_ct(iter)) != NULL && node)
> > + node--;
> > +
> > + return ct ? iter : NULL;
> > +}
> > +
> > +static void *allocinfo_next(struct seq_file *m, void *arg, loff_t *pos)
> > +{
> > + struct codetag_iterator *iter = (struct codetag_iterator *)arg;
> > + struct codetag *ct = codetag_next_ct(iter);
> > +
> > + (*pos)++;
> > + if (!ct)
> > + return NULL;
> > +
> > + return iter;
> > +}
> > +
> > +static void allocinfo_stop(struct seq_file *m, void *arg)
> > +{
> > + struct codetag_iterator *iter = (struct codetag_iterator *)m->private;
> > +
> > + if (iter) {
> > + codetag_lock_module_list(alloc_tag_cttype, false);
> > + kfree(iter);
> > + }
> > +}
> > +
> > +static void alloc_tag_to_text(struct seq_buf *out, struct codetag *ct)
> > +{
> > + struct alloc_tag *tag = ct_to_alloc_tag(ct);
> > + struct alloc_tag_counters counter = alloc_tag_read(tag);
> > + s64 bytes = counter.bytes;
> > + char val[10], *p = val;
> > +
> > + if (bytes < 0) {
> > + *p++ = '-';
> > + bytes = -bytes;
> > + }
> > +
> > + string_get_size(bytes, 1,
> > + STRING_SIZE_BASE2|STRING_SIZE_NOSPACE,
> > + p, val + ARRAY_SIZE(val) - p);
> > +
> > + seq_buf_printf(out, "%8s %8llu ", val, counter.calls);
> > + codetag_to_text(out, ct);
> > + seq_buf_putc(out, ' ');
> > + seq_buf_putc(out, '\n');
> > +}
>
> /me does happy seq_buf dance!
>
> > +
> > +static int allocinfo_show(struct seq_file *m, void *arg)
> > +{
> > + struct codetag_iterator *iter = (struct codetag_iterator *)arg;
> > + char *bufp;
> > + size_t n = seq_get_buf(m, &bufp);
> > + struct seq_buf buf;
> > +
> > + seq_buf_init(&buf, bufp, n);
> > + alloc_tag_to_text(&buf, iter->ct);
> > + seq_commit(m, seq_buf_used(&buf));
> > + return 0;
> > +}
> > +
> > +static const struct seq_operations allocinfo_seq_op = {
> > + .start = allocinfo_start,
> > + .next = allocinfo_next,
> > + .stop = allocinfo_stop,
> > + .show = allocinfo_show,
> > +};
> > +
> > +static void __init procfs_init(void)
> > +{
> > + proc_create_seq("allocinfo", 0444, NULL, &allocinfo_seq_op);
> > +}
>
> As mentioned, this really should be in /sys somewhere.
Ack.
>
> > +
> > +static bool alloc_tag_module_unload(struct codetag_type *cttype,
> > + struct codetag_module *cmod)
> > +{
> > + struct codetag_iterator iter = codetag_get_ct_iter(cttype);
> > + struct alloc_tag_counters counter;
> > + bool module_unused = true;
> > + struct alloc_tag *tag;
> > + struct codetag *ct;
> > +
> > + for (ct = codetag_next_ct(&iter); ct; ct = codetag_next_ct(&iter)) {
> > + if (iter.cmod != cmod)
> > + continue;
> > +
> > + tag = ct_to_alloc_tag(ct);
> > + counter = alloc_tag_read(tag);
> > +
> > + if (WARN(counter.bytes, "%s:%u module %s func:%s has %llu allocated at module unload",
> > + ct->filename, ct->lineno, ct->modname, ct->function, counter.bytes))
> > + module_unused = false;
> > + }
> > +
> > + return module_unused;
> > +}
> > +
> > +static struct ctl_table memory_allocation_profiling_sysctls[] = {
> > + {
> > + .procname = "mem_profiling",
> > + .data = &mem_alloc_profiling_key,
> > +#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
> > + .mode = 0444,
> > +#else
> > + .mode = 0644,
> > +#endif
> > + .proc_handler = proc_do_static_key,
> > + },
> > + { }
> > +};
> > +
> > +static int __init alloc_tag_init(void)
> > +{
> > + const struct codetag_type_desc desc = {
> > + .section = "alloc_tags",
> > + .tag_size = sizeof(struct alloc_tag),
> > + .module_unload = alloc_tag_module_unload,
> > + };
> > +
> > + alloc_tag_cttype = codetag_register_type(&desc);
> > + if (IS_ERR_OR_NULL(alloc_tag_cttype))
> > + return PTR_ERR(alloc_tag_cttype);
> > +
> > + register_sysctl_init("vm", memory_allocation_profiling_sysctls);
> > + procfs_init();
> > +
> > + return 0;
> > +}
> > +module_init(alloc_tag_init);
> > diff --git a/scripts/module.lds.S b/scripts/module.lds.S
> > index bf5bcf2836d8..45c67a0994f3 100644
> > --- a/scripts/module.lds.S
> > +++ b/scripts/module.lds.S
> > @@ -9,6 +9,8 @@
> > #define DISCARD_EH_FRAME *(.eh_frame)
> > #endif
> >
> > +#include <asm-generic/codetag.lds.h>
> > +
> > SECTIONS {
> > /DISCARD/ : {
> > *(.discard)
> > @@ -47,12 +49,17 @@ SECTIONS {
> > .data : {
> > *(.data .data.[0-9a-zA-Z_]*)
> > *(.data..L*)
> > + CODETAG_SECTIONS()
> > }
> >
> > .rodata : {
> > *(.rodata .rodata.[0-9a-zA-Z_]*)
> > *(.rodata..L*)
> > }
> > +#else
> > + .data : {
> > + CODETAG_SECTIONS()
> > + }
> > #endif
> > }
>
> Otherwise, looks good.
Thanks!
>
> --
> Kees Cook
On Mon, Feb 12, 2024 at 2:27 PM Kees Cook <[email protected]> wrote:
>
> On Mon, Feb 12, 2024 at 01:38:56PM -0800, Suren Baghdasaryan wrote:
> > Add basic infrastructure to support code tagging which stores tag common
> > information consisting of the module name, function, file name and line
> > number. Provide functions to register a new code tag type and navigate
> > between code tags.
> >
> > Co-developed-by: Kent Overstreet <[email protected]>
> > Signed-off-by: Kent Overstreet <[email protected]>
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > ---
> > include/linux/codetag.h | 71 ++++++++++++++
> > lib/Kconfig.debug | 4 +
> > lib/Makefile | 1 +
> > lib/codetag.c | 199 ++++++++++++++++++++++++++++++++++++++++
> > 4 files changed, 275 insertions(+)
> > create mode 100644 include/linux/codetag.h
> > create mode 100644 lib/codetag.c
> >
> > diff --git a/include/linux/codetag.h b/include/linux/codetag.h
> > new file mode 100644
> > index 000000000000..a9d7adecc2a5
> > --- /dev/null
> > +++ b/include/linux/codetag.h
> > @@ -0,0 +1,71 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * code tagging framework
> > + */
> > +#ifndef _LINUX_CODETAG_H
> > +#define _LINUX_CODETAG_H
> > +
> > +#include <linux/types.h>
> > +
> > +struct codetag_iterator;
> > +struct codetag_type;
> > +struct seq_buf;
> > +struct module;
> > +
> > +/*
> > + * An instance of this structure is created in a special ELF section at every
> > + * code location being tagged. At runtime, the special section is treated as
> > + * an array of these.
> > + */
> > +struct codetag {
> > + unsigned int flags; /* used in later patches */
> > + unsigned int lineno;
> > + const char *modname;
> > + const char *function;
> > + const char *filename;
> > +} __aligned(8);
> > +
> > +union codetag_ref {
> > + struct codetag *ct;
> > +};
> > +
> > +struct codetag_range {
> > + struct codetag *start;
> > + struct codetag *stop;
> > +};
> > +
> > +struct codetag_module {
> > + struct module *mod;
> > + struct codetag_range range;
> > +};
> > +
> > +struct codetag_type_desc {
> > + const char *section;
> > + size_t tag_size;
> > +};
> > +
> > +struct codetag_iterator {
> > + struct codetag_type *cttype;
> > + struct codetag_module *cmod;
> > + unsigned long mod_id;
> > + struct codetag *ct;
> > +};
> > +
> > +#define CODE_TAG_INIT { \
> > + .modname = KBUILD_MODNAME, \
> > + .function = __func__, \
> > + .filename = __FILE__, \
> > + .lineno = __LINE__, \
> > + .flags = 0, \
> > +}
> > +
> > +void codetag_lock_module_list(struct codetag_type *cttype, bool lock);
> > +struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype);
> > +struct codetag *codetag_next_ct(struct codetag_iterator *iter);
> > +
> > +void codetag_to_text(struct seq_buf *out, struct codetag *ct);
> > +
> > +struct codetag_type *
> > +codetag_register_type(const struct codetag_type_desc *desc);
> > +
> > +#endif /* _LINUX_CODETAG_H */
> > diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> > index 975a07f9f1cc..0be2d00c3696 100644
> > --- a/lib/Kconfig.debug
> > +++ b/lib/Kconfig.debug
> > @@ -968,6 +968,10 @@ config DEBUG_STACKOVERFLOW
> >
> > If in doubt, say "N".
> >
> > +config CODE_TAGGING
> > + bool
> > + select KALLSYMS
> > +
> > source "lib/Kconfig.kasan"
> > source "lib/Kconfig.kfence"
> > source "lib/Kconfig.kmsan"
> > diff --git a/lib/Makefile b/lib/Makefile
> > index 6b09731d8e61..6b48b22fdfac 100644
> > --- a/lib/Makefile
> > +++ b/lib/Makefile
> > @@ -235,6 +235,7 @@ obj-$(CONFIG_OF_RECONFIG_NOTIFIER_ERROR_INJECT) += \
> > of-reconfig-notifier-error-inject.o
> > obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
> >
> > +obj-$(CONFIG_CODE_TAGGING) += codetag.o
> > lib-$(CONFIG_GENERIC_BUG) += bug.o
> >
> > obj-$(CONFIG_HAVE_ARCH_TRACEHOOK) += syscall.o
> > diff --git a/lib/codetag.c b/lib/codetag.c
> > new file mode 100644
> > index 000000000000..7708f8388e55
> > --- /dev/null
> > +++ b/lib/codetag.c
> > @@ -0,0 +1,199 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +#include <linux/codetag.h>
> > +#include <linux/idr.h>
> > +#include <linux/kallsyms.h>
> > +#include <linux/module.h>
> > +#include <linux/seq_buf.h>
> > +#include <linux/slab.h>
> > +
> > +struct codetag_type {
> > + struct list_head link;
> > + unsigned int count;
> > + struct idr mod_idr;
> > + struct rw_semaphore mod_lock; /* protects mod_idr */
> > + struct codetag_type_desc desc;
> > +};
> > +
> > +static DEFINE_MUTEX(codetag_lock);
> > +static LIST_HEAD(codetag_types);
> > +
> > +void codetag_lock_module_list(struct codetag_type *cttype, bool lock)
> > +{
> > + if (lock)
> > + down_read(&cttype->mod_lock);
> > + else
> > + up_read(&cttype->mod_lock);
> > +}
> > +
> > +struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype)
> > +{
> > + struct codetag_iterator iter = {
> > + .cttype = cttype,
> > + .cmod = NULL,
> > + .mod_id = 0,
> > + .ct = NULL,
> > + };
> > +
> > + return iter;
> > +}
> > +
> > +static inline struct codetag *get_first_module_ct(struct codetag_module *cmod)
> > +{
> > + return cmod->range.start < cmod->range.stop ? cmod->range.start : NULL;
> > +}
> > +
> > +static inline
> > +struct codetag *get_next_module_ct(struct codetag_iterator *iter)
> > +{
> > + struct codetag *res = (struct codetag *)
> > + ((char *)iter->ct + iter->cttype->desc.tag_size);
> > +
> > + return res < iter->cmod->range.stop ? res : NULL;
> > +}
> > +
> > +struct codetag *codetag_next_ct(struct codetag_iterator *iter)
> > +{
> > + struct codetag_type *cttype = iter->cttype;
> > + struct codetag_module *cmod;
> > + struct codetag *ct;
> > +
> > + lockdep_assert_held(&cttype->mod_lock);
> > +
> > + if (unlikely(idr_is_empty(&cttype->mod_idr)))
> > + return NULL;
> > +
> > + ct = NULL;
> > + while (true) {
> > + cmod = idr_find(&cttype->mod_idr, iter->mod_id);
> > +
> > + /* If module was removed move to the next one */
> > + if (!cmod)
> > + cmod = idr_get_next_ul(&cttype->mod_idr,
> > + &iter->mod_id);
> > +
> > + /* Exit if no more modules */
> > + if (!cmod)
> > + break;
> > +
> > + if (cmod != iter->cmod) {
> > + iter->cmod = cmod;
> > + ct = get_first_module_ct(cmod);
> > + } else
> > + ct = get_next_module_ct(iter);
> > +
> > + if (ct)
> > + break;
> > +
> > + iter->mod_id++;
> > + }
> > +
> > + iter->ct = ct;
> > + return ct;
> > +}
> > +
> > +void codetag_to_text(struct seq_buf *out, struct codetag *ct)
> > +{
> > + seq_buf_printf(out, "%s:%u module:%s func:%s",
> > + ct->filename, ct->lineno,
> > + ct->modname, ct->function);
> > +}
>
> Thank you for using seq_buf here!
>
> Also, will this need an EXPORT_SYMBOL_GPL()?
>
> > +
> > +static inline size_t range_size(const struct codetag_type *cttype,
> > + const struct codetag_range *range)
> > +{
> > + return ((char *)range->stop - (char *)range->start) /
> > + cttype->desc.tag_size;
> > +}
> > +
> > +static void *get_symbol(struct module *mod, const char *prefix, const char *name)
> > +{
> > + char buf[64];
>
> Why is 64 enough? I was expecting KSYM_NAME_LEN here, but perhaps this
> is specialized enough to section names that it will not be a problem?
This buffer is being used to hold the name of the section containing
codetags appended with "__start_" or "__stop_" and the only current
user is alloc_tag_init() which sets the section name to "alloc_tags".
So, this buffer currently holds either "alloc_tags__start_" or
"alloc_tags__stop_". When more codetag applications are added (like
the ones we have shown in the original RFC [1]), there would be more
section names. 64 was chosen as a big enough value to reasonably hold
the section name with the suffix. But you are right, we should add a
check for the section name size to ensure it always fits. Will add
into my TODO list.
[1] https://lore.kernel.org/all/[email protected]/
> If so, please document it clearly with a comment.
Will do.
>
> > + int res;
> > +
> > + res = snprintf(buf, sizeof(buf), "%s%s", prefix, name);
> > + if (WARN_ON(res < 1 || res > sizeof(buf)))
> > + return NULL;
>
> Please use a seq_buf here instead of snprintf, which we're trying to get
> rid of.
>
> DECLARE_SEQ_BUF(sb, KSYM_NAME_LEN);
> char *buf;
>
> seq_buf_printf(sb, "%s%s", prefix, name);
> if (seq_buf_has_overflowed(sb))
> return NULL;
>
> buf = seq_buf_str(sb);
Will do. Thanks!
>
> > +
> > + return mod ?
> > + (void *)find_kallsyms_symbol_value(mod, buf) :
> > + (void *)kallsyms_lookup_name(buf);
> > +}
> > +
> > +static struct codetag_range get_section_range(struct module *mod,
> > + const char *section)
> > +{
> > + return (struct codetag_range) {
> > + get_symbol(mod, "__start_", section),
> > + get_symbol(mod, "__stop_", section),
> > + };
> > +}
> > +
> > +static int codetag_module_init(struct codetag_type *cttype, struct module *mod)
> > +{
> > + struct codetag_range range;
> > + struct codetag_module *cmod;
> > + int err;
> > +
> > + range = get_section_range(mod, cttype->desc.section);
> > + if (!range.start || !range.stop) {
> > + pr_warn("Failed to load code tags of type %s from the module %s\n",
> > + cttype->desc.section,
> > + mod ? mod->name : "(built-in)");
> > + return -EINVAL;
> > + }
> > +
> > + /* Ignore empty ranges */
> > + if (range.start == range.stop)
> > + return 0;
> > +
> > + BUG_ON(range.start > range.stop);
> > +
> > + cmod = kmalloc(sizeof(*cmod), GFP_KERNEL);
> > + if (unlikely(!cmod))
> > + return -ENOMEM;
> > +
> > + cmod->mod = mod;
> > + cmod->range = range;
> > +
> > + down_write(&cttype->mod_lock);
> > + err = idr_alloc(&cttype->mod_idr, cmod, 0, 0, GFP_KERNEL);
> > + if (err >= 0)
> > + cttype->count += range_size(cttype, &range);
> > + up_write(&cttype->mod_lock);
> > +
> > + if (err < 0) {
> > + kfree(cmod);
> > + return err;
> > + }
> > +
> > + return 0;
> > +}
> > +
> > +struct codetag_type *
> > +codetag_register_type(const struct codetag_type_desc *desc)
> > +{
> > + struct codetag_type *cttype;
> > + int err;
> > +
> > + BUG_ON(desc->tag_size <= 0);
> > +
> > + cttype = kzalloc(sizeof(*cttype), GFP_KERNEL);
> > + if (unlikely(!cttype))
> > + return ERR_PTR(-ENOMEM);
> > +
> > + cttype->desc = *desc;
> > + idr_init(&cttype->mod_idr);
> > + init_rwsem(&cttype->mod_lock);
> > +
> > + err = codetag_module_init(cttype, NULL);
> > + if (unlikely(err)) {
> > + kfree(cttype);
> > + return ERR_PTR(err);
> > + }
> > +
> > + mutex_lock(&codetag_lock);
> > + list_add_tail(&cttype->link, &codetag_types);
> > + mutex_unlock(&codetag_lock);
> > +
> > + return cttype;
> > +}
> > --
> > 2.43.0.687.g38aa6559b0-goog
> >
>
> --
> Kees Cook
On Mon, Feb 12, 2024 at 04:31:14PM -0800, Kees Cook wrote:
> On Mon, Feb 12, 2024 at 01:39:09PM -0800, Suren Baghdasaryan wrote:
> > From: Kent Overstreet <[email protected]>
> >
> > It seems we need to be more forceful with the compiler on this one.
>
> Sure, but why?
Wasn't getting inlined without it, and that's one we do want inlined -
it's only called in one place.
On Mon, Feb 12, 2024 at 2:14 PM Kees Cook <[email protected]> wrote:
>
> On Mon, Feb 12, 2024 at 01:38:51PM -0800, Suren Baghdasaryan wrote:
> > Currently slab pages can store only vectors of obj_cgroup pointers in
> > page->memcg_data. Introduce slabobj_ext structure to allow more data
> > to be stored for each slab object. Wrap obj_cgroup into slabobj_ext
> > to support current functionality while allowing to extend slabobj_ext
> > in the future.
> >
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
>
> It looks like this doesn't change which buckets GFP_KERNEL_ACCOUNT comes
> out of, is that correct? I'd love it if we didn't have separate buckets
> so GFP_KERNEL and GFP_KERNEL_ACCOUNT came from the same pools (so that
> the randomized pools would cover GFP_KERNEL_ACCOUNT ...)
This should not affect KMEM accounting in any way. We are simply
changing the vector of obj_cgroup objects to hold complex objects
which can contain more fields in addition to the original obj_cgroup
(in our case it's the codetag reference).
Unless I misunderstood your question?
>
> Regardless:
>
> Reviewed-by: Kees Cook <[email protected]>
>
> --
> Kees Cook
On Mon, Feb 12, 2024 at 07:22:42PM -0500, Steven Rostedt wrote:
> On Mon, 12 Feb 2024 16:10:02 -0800
> Kees Cook <[email protected]> wrote:
>
> > > #endif
> > > +#ifdef CONFIG_MEM_ALLOC_PROFILING
> > > + {
> > > + struct seq_buf s;
> > > + char *buf = kmalloc(4096, GFP_ATOMIC);
> >
> > Why 4096? Maybe use PAGE_SIZE instead?
>
> Will it make a difference for architectures that don't have 4096 PAGE_SIZE?
> Like PowerPC which has PAGE_SIZE of anywhere between 4K to 256K!
it's just a string buffer
On Mon, Feb 12, 2024 at 8:33 PM Kent Overstreet
<[email protected]> wrote:
>
> On Mon, Feb 12, 2024 at 07:22:42PM -0500, Steven Rostedt wrote:
> > On Mon, 12 Feb 2024 16:10:02 -0800
> > Kees Cook <[email protected]> wrote:
> >
> > > > #endif
> > > > +#ifdef CONFIG_MEM_ALLOC_PROFILING
> > > > + {
> > > > + struct seq_buf s;
> > > > + char *buf = kmalloc(4096, GFP_ATOMIC);
> > >
> > > Why 4096? Maybe use PAGE_SIZE instead?
> >
> > Will it make a difference for architectures that don't have 4096 PAGE_SIZE?
> > Like PowerPC which has PAGE_SIZE of anywhere between 4K to 256K!
>
> it's just a string buffer
We should document that __show_mem() prints only the top 10 largest
allocations, therefore as long as this buffer is large enough to hold
10 records we should be good. Technically we could simply print one
record at a time and then the buffer can be smaller.
On Mon, Feb 12, 2024 at 11:39 PM Suren Baghdasaryan <[email protected]> wrote:
>
> From: Kent Overstreet <[email protected]>
>
> The new flags parameter allows controlling
> - Whether or not the units suffix is separated by a space, for
> compatibility with sort -h
> - Whether or not to append a B suffix - we're not always printing
> bytes.
>
> Signed-off-by: Kent Overstreet <[email protected]>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
It seems most of my points from the previous review were refused...
..
You can move the below under --- cutter, so it won't pollute the git history.
> Cc: Andy Shevchenko <[email protected]>
> Cc: Michael Ellerman <[email protected]>
> Cc: Benjamin Herrenschmidt <[email protected]>
> Cc: Paul Mackerras <[email protected]>
> Cc: "Michael S. Tsirkin" <[email protected]>
> Cc: Jason Wang <[email protected]>
> Cc: "Noralf Trønnes" <[email protected]>
> Cc: Jens Axboe <[email protected]>
> ---
..
> --- a/include/linux/string_helpers.h
> +++ b/include/linux/string_helpers.h
> @@ -17,14 +17,13 @@ static inline bool string_is_terminated(const char *s, int len)
..
> -/* Descriptions of the types of units to
> - * print in */
> -enum string_size_units {
> - STRING_UNITS_10, /* use powers of 10^3 (standard SI) */
> - STRING_UNITS_2, /* use binary powers of 2^10 */
> +enum string_size_flags {
> + STRING_SIZE_BASE2 = (1 << 0),
> + STRING_SIZE_NOSPACE = (1 << 1),
> + STRING_SIZE_NOBYTES = (1 << 2),
> };
Do not kill documentation, I already said that. Or i.o.w. document this.
Also the _SIZE is ambigous (if you don't want UNITS, use SIZE_FORMAT.
Also why did you kill BASE10 here? (see below as well)
..
> --- a/lib/string_helpers.c
> +++ b/lib/string_helpers.c
> @@ -19,11 +19,17 @@
> #include <linux/string.h>
> #include <linux/string_helpers.h>
>
> +enum string_size_units {
> + STRING_UNITS_10, /* use powers of 10^3 (standard SI) */
> + STRING_UNITS_2, /* use binary powers of 2^10 */
> +};
Why do we need this duplication?
..
> + enum string_size_units units = flags & flags & STRING_SIZE_BASE2
> + ? STRING_UNITS_2 : STRING_UNITS_10;
Double flags check is redundant.
--
With Best Regards,
Andy Shevchenko
On Tue, Feb 13, 2024 at 10:26 AM Andy Shevchenko
<[email protected]> wrote:
>
> On Mon, Feb 12, 2024 at 11:39 PM Suren Baghdasaryan <[email protected]> wrote:
> >
> > From: Kent Overstreet <[email protected]>
> >
> > The new flags parameter allows controlling
> > - Whether or not the units suffix is separated by a space, for
> > compatibility with sort -h
> > - Whether or not to append a B suffix - we're not always printing
> > bytes.
And you effectively missed to _add_ the test cases for the modified code.
Formal NAK for this, the rest is discussable, the absence of tests is not.
> > Signed-off-by: Kent Overstreet <[email protected]>
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
>
> It seems most of my points from the previous review were refused...
>
> ...
>
> You can move the below under --- cutter, so it won't pollute the git history.
>
> > Cc: Andy Shevchenko <[email protected]>
> > Cc: Michael Ellerman <[email protected]>
> > Cc: Benjamin Herrenschmidt <[email protected]>
> > Cc: Paul Mackerras <[email protected]>
> > Cc: "Michael S. Tsirkin" <[email protected]>
> > Cc: Jason Wang <[email protected]>
> > Cc: "Noralf Trønnes" <[email protected]>
> > Cc: Jens Axboe <[email protected]>
> > ---
>
> ...
>
> > --- a/include/linux/string_helpers.h
> > +++ b/include/linux/string_helpers.h
> > @@ -17,14 +17,13 @@ static inline bool string_is_terminated(const char *s, int len)
>
> ...
>
> > -/* Descriptions of the types of units to
> > - * print in */
> > -enum string_size_units {
> > - STRING_UNITS_10, /* use powers of 10^3 (standard SI) */
> > - STRING_UNITS_2, /* use binary powers of 2^10 */
> > +enum string_size_flags {
> > + STRING_SIZE_BASE2 = (1 << 0),
> > + STRING_SIZE_NOSPACE = (1 << 1),
> > + STRING_SIZE_NOBYTES = (1 << 2),
> > };
>
> Do not kill documentation, I already said that. Or i.o.w. document this.
> Also the _SIZE is ambigous (if you don't want UNITS, use SIZE_FORMAT.
>
> Also why did you kill BASE10 here? (see below as well)
>
> ...
>
> > --- a/lib/string_helpers.c
> > +++ b/lib/string_helpers.c
> > @@ -19,11 +19,17 @@
> > #include <linux/string.h>
> > #include <linux/string_helpers.h>
> >
> > +enum string_size_units {
> > + STRING_UNITS_10, /* use powers of 10^3 (standard SI) */
> > + STRING_UNITS_2, /* use binary powers of 2^10 */
> > +};
>
> Why do we need this duplication?
>
> ...
>
> > + enum string_size_units units = flags & flags & STRING_SIZE_BASE2
> > + ? STRING_UNITS_2 : STRING_UNITS_10;
>
> Double flags check is redundant.
--
With Best Regards,
Andy Shevchenko
On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote:
[...]
> We're aiming to get this in the next merge window, for 6.9. The feedback
> we've gotten has been that even out of tree this patchset has already
> been useful, and there's a significant amount of other work gated on the
> code tagging functionality included in this patchset [2].
I suspect it will not come as a surprise that I really dislike the
implementation proposed here. I will not repeat my arguments, I have
done so on several occasions already.
Anyway, I didn't go as far as to nak it even though I _strongly_ believe
this debugging feature will add a maintenance overhead for a very long
time. I can live with all the downsides of the proposed implementation
_as long as_ there is a wider agreement from the MM community as this is
where the maintenance cost will be payed. So far I have not seen (m)any
acks by MM developers so aiming into the next merge window is more than
little rushed.
> 81 files changed, 2126 insertions(+), 695 deletions(-)
--
Michal Hocko
SUSE Labs
On 13.02.24 22:58, Suren Baghdasaryan wrote:
> On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <[email protected]> wrote:
>>
>> On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote:
>> [...]
>>> We're aiming to get this in the next merge window, for 6.9. The feedback
>>> we've gotten has been that even out of tree this patchset has already
>>> been useful, and there's a significant amount of other work gated on the
>>> code tagging functionality included in this patchset [2].
>>
>> I suspect it will not come as a surprise that I really dislike the
>> implementation proposed here. I will not repeat my arguments, I have
>> done so on several occasions already.
>>
>> Anyway, I didn't go as far as to nak it even though I _strongly_ believe
>> this debugging feature will add a maintenance overhead for a very long
>> time. I can live with all the downsides of the proposed implementation
>> _as long as_ there is a wider agreement from the MM community as this is
>> where the maintenance cost will be payed. So far I have not seen (m)any
>> acks by MM developers so aiming into the next merge window is more than
>> little rushed.
>
> We tried other previously proposed approaches and all have their
> downsides without making maintenance much easier. Your position is
> understandable and I think it's fair. Let's see if others see more
> benefit than cost here.
Would it make sense to discuss that at LSF/MM once again, especially
covering why proposed alternatives did not work out? LSF/MM is not "too
far" away (May).
I recall that the last LSF/MM session on this topic was a bit
unfortunate (IMHO not as productive as it could have been). Maybe we can
finally reach a consensus on this.
--
Cheers,
David / dhildenb
On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <[email protected]> wrote:
>
> On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote:
> [...]
> > We're aiming to get this in the next merge window, for 6.9. The feedback
> > we've gotten has been that even out of tree this patchset has already
> > been useful, and there's a significant amount of other work gated on the
> > code tagging functionality included in this patchset [2].
>
> I suspect it will not come as a surprise that I really dislike the
> implementation proposed here. I will not repeat my arguments, I have
> done so on several occasions already.
>
> Anyway, I didn't go as far as to nak it even though I _strongly_ believe
> this debugging feature will add a maintenance overhead for a very long
> time. I can live with all the downsides of the proposed implementation
> _as long as_ there is a wider agreement from the MM community as this is
> where the maintenance cost will be payed. So far I have not seen (m)any
> acks by MM developers so aiming into the next merge window is more than
> little rushed.
We tried other previously proposed approaches and all have their
downsides without making maintenance much easier. Your position is
understandable and I think it's fair. Let's see if others see more
benefit than cost here.
Thanks,
Suren.
>
> > 81 files changed, 2126 insertions(+), 695 deletions(-)
> --
> Michal Hocko
> SUSE Labs
On Tue, Feb 13, 2024 at 10:26:48AM +0200, Andy Shevchenko wrote:
> On Mon, Feb 12, 2024 at 11:39 PM Suren Baghdasaryan <[email protected]> wrote:
> >
> > From: Kent Overstreet <[email protected]>
> >
> > The new flags parameter allows controlling
> > - Whether or not the units suffix is separated by a space, for
> > compatibility with sort -h
> > - Whether or not to append a B suffix - we're not always printing
> > bytes.
> >
> > Signed-off-by: Kent Overstreet <[email protected]>
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
>
> ...
>
> You can move the below under --- cutter, so it won't pollute the git history.
>
> > Cc: Andy Shevchenko <[email protected]>
> > Cc: Michael Ellerman <[email protected]>
> > Cc: Benjamin Herrenschmidt <[email protected]>
> > Cc: Paul Mackerras <[email protected]>
> > Cc: "Michael S. Tsirkin" <[email protected]>
> > Cc: Jason Wang <[email protected]>
> > Cc: "Noralf Trønnes" <[email protected]>
> > Cc: Jens Axboe <[email protected]>
> > ---
>
> ...
>
> > --- a/include/linux/string_helpers.h
> > +++ b/include/linux/string_helpers.h
> > @@ -17,14 +17,13 @@ static inline bool string_is_terminated(const char *s, int len)
>
> ...
>
> > -/* Descriptions of the types of units to
> > - * print in */
> > -enum string_size_units {
> > - STRING_UNITS_10, /* use powers of 10^3 (standard SI) */
> > - STRING_UNITS_2, /* use binary powers of 2^10 */
> > +enum string_size_flags {
> > + STRING_SIZE_BASE2 = (1 << 0),
> > + STRING_SIZE_NOSPACE = (1 << 1),
> > + STRING_SIZE_NOBYTES = (1 << 2),
> > };
>
> Do not kill documentation, I already said that. Or i.o.w. document this.
> Also the _SIZE is ambigous (if you don't want UNITS, use SIZE_FORMAT.
>
> Also why did you kill BASE10 here? (see below as well)
As you should be able to tell from the name, it's a set of flags.
> > --- a/lib/string_helpers.c
> > +++ b/lib/string_helpers.c
> > @@ -19,11 +19,17 @@
> > #include <linux/string.h>
> > #include <linux/string_helpers.h>
> >
> > +enum string_size_units {
> > + STRING_UNITS_10, /* use powers of 10^3 (standard SI) */
> > + STRING_UNITS_2, /* use binary powers of 2^10 */
> > +};
>
> Why do we need this duplication?
Because otherwise a lot more code would have to change.
>
> It seems most of my points from the previous review were refused...
Look, Andy, this is a pretty tiny part of the patchset, yet it's been
eating up a pretty disproprortionate amount of time and your review
feedback has been pretty unhelpful - asking for things to be broken up
in ways that would not be bisectable, or (as here) re-asking the same
things that I've already answered and that should've been obvious.
The code works. If you wish to complain about anything being broken, or
if you can come up with anything more actionable than what you've got
here, I will absolutely respond to that, but otherwise I'm just going to
leave things where they sit.
On 13.02.24 23:09, Kent Overstreet wrote:
> On Tue, Feb 13, 2024 at 11:04:58PM +0100, David Hildenbrand wrote:
>> On 13.02.24 22:58, Suren Baghdasaryan wrote:
>>> On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <[email protected]> wrote:
>>>>
>>>> On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote:
>>>> [...]
>>>>> We're aiming to get this in the next merge window, for 6.9. The feedback
>>>>> we've gotten has been that even out of tree this patchset has already
>>>>> been useful, and there's a significant amount of other work gated on the
>>>>> code tagging functionality included in this patchset [2].
>>>>
>>>> I suspect it will not come as a surprise that I really dislike the
>>>> implementation proposed here. I will not repeat my arguments, I have
>>>> done so on several occasions already.
>>>>
>>>> Anyway, I didn't go as far as to nak it even though I _strongly_ believe
>>>> this debugging feature will add a maintenance overhead for a very long
>>>> time. I can live with all the downsides of the proposed implementation
>>>> _as long as_ there is a wider agreement from the MM community as this is
>>>> where the maintenance cost will be payed. So far I have not seen (m)any
>>>> acks by MM developers so aiming into the next merge window is more than
>>>> little rushed.
>>>
>>> We tried other previously proposed approaches and all have their
>>> downsides without making maintenance much easier. Your position is
>>> understandable and I think it's fair. Let's see if others see more
>>> benefit than cost here.
>>
>> Would it make sense to discuss that at LSF/MM once again, especially
>> covering why proposed alternatives did not work out? LSF/MM is not "too far"
>> away (May).
>>
>> I recall that the last LSF/MM session on this topic was a bit unfortunate
>> (IMHO not as productive as it could have been). Maybe we can finally reach a
>> consensus on this.
>
> I'd rather not delay for more bikeshedding. Before agreeing to LSF I'd
> need to see a serious proposl - what we had at the last LSF was people
> jumping in with half baked alternative proposals that very much hadn't
> been thought through, and I see no need to repeat that.
>
> Like I mentioned, there's other work gated on this patchset; if people
> want to hold this up for more discussion they better be putting forth
> something to discuss.
I'm thinking of ways on how to achieve Michal's request: "as long as
there is a wider agreement from the MM community". If we can achieve
that without LSF, great! (a bi-weekly MM meeting might also be an option)
--
Cheers,
David / dhildenb
On Mon, Feb 12, 2024 at 05:01:19PM -0800, Suren Baghdasaryan wrote:
> On Mon, Feb 12, 2024 at 2:40 PM Kees Cook <[email protected]> wrote:
> >
> > On Mon, Feb 12, 2024 at 01:38:59PM -0800, Suren Baghdasaryan wrote:
> > > Introduce CONFIG_MEM_ALLOC_PROFILING which provides definitions to easily
> > > instrument memory allocators. It registers an "alloc_tags" codetag type
> > > with /proc/allocinfo interface to output allocation tag information when
> >
> > Please don't add anything new to the top-level /proc directory. This
> > should likely live in /sys.
>
> Ack. I'll find a more appropriate place for it then.
> It just seemed like such generic information which would belong next
> to meminfo/zoneinfo and such...
Save yourself a cycle of "rework the whole fs interface only to have
someone else tell you no" and put it in debugfs, not sysfs. Wrangling
with debugfs is easier than all the macro-happy sysfs stuff; you don't
have to integrate with the "device" model; and there is no 'one value
per file' rule.
--D
> >
> > > the feature is enabled.
> > > CONFIG_MEM_ALLOC_PROFILING_DEBUG is provided for debugging the memory
> > > allocation profiling instrumentation.
> > > Memory allocation profiling can be enabled or disabled at runtime using
> > > /proc/sys/vm/mem_profiling sysctl when CONFIG_MEM_ALLOC_PROFILING_DEBUG=n.
> > > CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT enables memory allocation
> > > profiling by default.
> > >
> > > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > > Co-developed-by: Kent Overstreet <[email protected]>
> > > Signed-off-by: Kent Overstreet <[email protected]>
> > > ---
> > > Documentation/admin-guide/sysctl/vm.rst | 16 +++
> > > Documentation/filesystems/proc.rst | 28 +++++
> > > include/asm-generic/codetag.lds.h | 14 +++
> > > include/asm-generic/vmlinux.lds.h | 3 +
> > > include/linux/alloc_tag.h | 133 ++++++++++++++++++++
> > > include/linux/sched.h | 24 ++++
> > > lib/Kconfig.debug | 25 ++++
> > > lib/Makefile | 2 +
> > > lib/alloc_tag.c | 158 ++++++++++++++++++++++++
> > > scripts/module.lds.S | 7 ++
> > > 10 files changed, 410 insertions(+)
> > > create mode 100644 include/asm-generic/codetag.lds.h
> > > create mode 100644 include/linux/alloc_tag.h
> > > create mode 100644 lib/alloc_tag.c
> > >
> > > diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
> > > index c59889de122b..a214719492ea 100644
> > > --- a/Documentation/admin-guide/sysctl/vm.rst
> > > +++ b/Documentation/admin-guide/sysctl/vm.rst
> > > @@ -43,6 +43,7 @@ Currently, these files are in /proc/sys/vm:
> > > - legacy_va_layout
> > > - lowmem_reserve_ratio
> > > - max_map_count
> > > +- mem_profiling (only if CONFIG_MEM_ALLOC_PROFILING=y)
> > > - memory_failure_early_kill
> > > - memory_failure_recovery
> > > - min_free_kbytes
> > > @@ -425,6 +426,21 @@ e.g., up to one or two maps per allocation.
> > > The default value is 65530.
> > >
> > >
> > > +mem_profiling
> > > +==============
> > > +
> > > +Enable memory profiling (when CONFIG_MEM_ALLOC_PROFILING=y)
> > > +
> > > +1: Enable memory profiling.
> > > +
> > > +0: Disabld memory profiling.
> > > +
> > > +Enabling memory profiling introduces a small performance overhead for all
> > > +memory allocations.
> > > +
> > > +The default value depends on CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT.
> > > +
> > > +
> > > memory_failure_early_kill:
> > > ==========================
> > >
> > > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> > > index 104c6d047d9b..40d6d18308e4 100644
> > > --- a/Documentation/filesystems/proc.rst
> > > +++ b/Documentation/filesystems/proc.rst
> > > @@ -688,6 +688,7 @@ files are there, and which are missing.
> > > ============ ===============================================================
> > > File Content
> > > ============ ===============================================================
> > > + allocinfo Memory allocations profiling information
> > > apm Advanced power management info
> > > bootconfig Kernel command line obtained from boot config,
> > > and, if there were kernel parameters from the
> > > @@ -953,6 +954,33 @@ also be allocatable although a lot of filesystem metadata may have to be
> > > reclaimed to achieve this.
> > >
> > >
> > > +allocinfo
> > > +~~~~~~~
> > > +
> > > +Provides information about memory allocations at all locations in the code
> > > +base. Each allocation in the code is identified by its source file, line
> > > +number, module and the function calling the allocation. The number of bytes
> > > +allocated at each location is reported.
> > > +
> > > +Example output.
> > > +
> > > +::
> > > +
> > > + > cat /proc/allocinfo
> > > +
> > > + 153MiB mm/slub.c:1826 module:slub func:alloc_slab_page
> > > + 6.08MiB mm/slab_common.c:950 module:slab_common func:_kmalloc_order
> > > + 5.09MiB mm/memcontrol.c:2814 module:memcontrol func:alloc_slab_obj_exts
> > > + 4.54MiB mm/page_alloc.c:5777 module:page_alloc func:alloc_pages_exact
> > > + 1.32MiB include/asm-generic/pgalloc.h:63 module:pgtable func:__pte_alloc_one
> > > + 1.16MiB fs/xfs/xfs_log_priv.h:700 module:xfs func:xlog_kvmalloc
> > > + 1.00MiB mm/swap_cgroup.c:48 module:swap_cgroup func:swap_cgroup_prepare
> > > + 734KiB fs/xfs/kmem.c:20 module:xfs func:kmem_alloc
> > > + 640KiB kernel/rcu/tree.c:3184 module:tree func:fill_page_cache_func
> > > + 640KiB drivers/char/virtio_console.c:452 module:virtio_console func:alloc_buf
> > > + ...
> > > +
> > > +
> > > meminfo
> > > ~~~~~~~
> > >
> > > diff --git a/include/asm-generic/codetag.lds.h b/include/asm-generic/codetag.lds.h
> > > new file mode 100644
> > > index 000000000000..64f536b80380
> > > --- /dev/null
> > > +++ b/include/asm-generic/codetag.lds.h
> > > @@ -0,0 +1,14 @@
> > > +/* SPDX-License-Identifier: GPL-2.0-only */
> > > +#ifndef __ASM_GENERIC_CODETAG_LDS_H
> > > +#define __ASM_GENERIC_CODETAG_LDS_H
> > > +
> > > +#define SECTION_WITH_BOUNDARIES(_name) \
> > > + . = ALIGN(8); \
> > > + __start_##_name = .; \
> > > + KEEP(*(_name)) \
> > > + __stop_##_name = .;
> > > +
> > > +#define CODETAG_SECTIONS() \
> > > + SECTION_WITH_BOUNDARIES(alloc_tags)
> > > +
> > > +#endif /* __ASM_GENERIC_CODETAG_LDS_H */
> > > diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
> > > index 5dd3a61d673d..c9997dc50c50 100644
> > > --- a/include/asm-generic/vmlinux.lds.h
> > > +++ b/include/asm-generic/vmlinux.lds.h
> > > @@ -50,6 +50,8 @@
> > > * [__nosave_begin, __nosave_end] for the nosave data
> > > */
> > >
> > > +#include <asm-generic/codetag.lds.h>
> > > +
> > > #ifndef LOAD_OFFSET
> > > #define LOAD_OFFSET 0
> > > #endif
> > > @@ -366,6 +368,7 @@
> > > . = ALIGN(8); \
> > > BOUNDED_SECTION_BY(__dyndbg_classes, ___dyndbg_classes) \
> > > BOUNDED_SECTION_BY(__dyndbg, ___dyndbg) \
> > > + CODETAG_SECTIONS() \
> > > LIKELY_PROFILE() \
> > > BRANCH_PROFILE() \
> > > TRACE_PRINTKS() \
> > > diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
> > > new file mode 100644
> > > index 000000000000..cf55a149fa84
> > > --- /dev/null
> > > +++ b/include/linux/alloc_tag.h
> > > @@ -0,0 +1,133 @@
> > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > +/*
> > > + * allocation tagging
> > > + */
> > > +#ifndef _LINUX_ALLOC_TAG_H
> > > +#define _LINUX_ALLOC_TAG_H
> > > +
> > > +#include <linux/bug.h>
> > > +#include <linux/codetag.h>
> > > +#include <linux/container_of.h>
> > > +#include <linux/preempt.h>
> > > +#include <asm/percpu.h>
> > > +#include <linux/cpumask.h>
> > > +#include <linux/static_key.h>
> > > +
> > > +struct alloc_tag_counters {
> > > + u64 bytes;
> > > + u64 calls;
> > > +};
> > > +
> > > +/*
> > > + * An instance of this structure is created in a special ELF section at every
> > > + * allocation callsite. At runtime, the special section is treated as
> > > + * an array of these. Embedded codetag utilizes codetag framework.
> > > + */
> > > +struct alloc_tag {
> > > + struct codetag ct;
> > > + struct alloc_tag_counters __percpu *counters;
> > > +} __aligned(8);
> > > +
> > > +#ifdef CONFIG_MEM_ALLOC_PROFILING
> > > +
> > > +static inline struct alloc_tag *ct_to_alloc_tag(struct codetag *ct)
> > > +{
> > > + return container_of(ct, struct alloc_tag, ct);
> > > +}
> > > +
> > > +#ifdef ARCH_NEEDS_WEAK_PER_CPU
> > > +/*
> > > + * When percpu variables are required to be defined as weak, static percpu
> > > + * variables can't be used inside a function (see comments for DECLARE_PER_CPU_SECTION).
> > > + */
> > > +#error "Memory allocation profiling is incompatible with ARCH_NEEDS_WEAK_PER_CPU"
> >
> > Is this enforced via Kconfig as well? (Looks like only alpha and s390?)
>
> Unfortunately ARCH_NEEDS_WEAK_PER_CPU is not a Kconfig option but
> CONFIG_DEBUG_FORCE_WEAK_PER_CPU is, so that one is handled via Kconfig
> (see "depends on !DEBUG_FORCE_WEAK_PER_CPU" in this patch). We have to
> avoid both cases because of this:
> https://elixir.bootlin.com/linux/latest/source/include/linux/percpu-defs.h#L75,
> so I'm trying to provide an informative error here.
>
> >
> > > +#endif
> > > +
> > > +#define DEFINE_ALLOC_TAG(_alloc_tag, _old) \
> > > + static DEFINE_PER_CPU(struct alloc_tag_counters, _alloc_tag_cntr); \
> > > + static struct alloc_tag _alloc_tag __used __aligned(8) \
> > > + __section("alloc_tags") = { \
> > > + .ct = CODE_TAG_INIT, \
> > > + .counters = &_alloc_tag_cntr }; \
> > > + struct alloc_tag * __maybe_unused _old = alloc_tag_save(&_alloc_tag)
> > > +
> > > +DECLARE_STATIC_KEY_MAYBE(CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT,
> > > + mem_alloc_profiling_key);
> > > +
> > > +static inline bool mem_alloc_profiling_enabled(void)
> > > +{
> > > + return static_branch_maybe(CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT,
> > > + &mem_alloc_profiling_key);
> > > +}
> > > +
> > > +static inline struct alloc_tag_counters alloc_tag_read(struct alloc_tag *tag)
> > > +{
> > > + struct alloc_tag_counters v = { 0, 0 };
> > > + struct alloc_tag_counters *counter;
> > > + int cpu;
> > > +
> > > + for_each_possible_cpu(cpu) {
> > > + counter = per_cpu_ptr(tag->counters, cpu);
> > > + v.bytes += counter->bytes;
> > > + v.calls += counter->calls;
> > > + }
> > > +
> > > + return v;
> > > +}
> > > +
> > > +static inline void __alloc_tag_sub(union codetag_ref *ref, size_t bytes)
> > > +{
> > > + struct alloc_tag *tag;
> > > +
> > > +#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
> > > + WARN_ONCE(ref && !ref->ct, "alloc_tag was not set\n");
> > > +#endif
> > > + if (!ref || !ref->ct)
> > > + return;
> > > +
> > > + tag = ct_to_alloc_tag(ref->ct);
> > > +
> > > + this_cpu_sub(tag->counters->bytes, bytes);
> > > + this_cpu_dec(tag->counters->calls);
> > > +
> > > + ref->ct = NULL;
> > > +}
> > > +
> > > +static inline void alloc_tag_sub(union codetag_ref *ref, size_t bytes)
> > > +{
> > > + __alloc_tag_sub(ref, bytes);
> > > +}
> > > +
> > > +static inline void alloc_tag_sub_noalloc(union codetag_ref *ref, size_t bytes)
> > > +{
> > > + __alloc_tag_sub(ref, bytes);
> > > +}
> > > +
> > > +static inline void alloc_tag_add(union codetag_ref *ref, struct alloc_tag *tag, size_t bytes)
> > > +{
> > > +#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
> > > + WARN_ONCE(ref && ref->ct,
> > > + "alloc_tag was not cleared (got tag for %s:%u)\n",\
> > > + ref->ct->filename, ref->ct->lineno);
> > > +
> > > + WARN_ONCE(!tag, "current->alloc_tag not set");
> > > +#endif
> > > + if (!ref || !tag)
> > > + return;
> > > +
> > > + ref->ct = &tag->ct;
> > > + this_cpu_add(tag->counters->bytes, bytes);
> > > + this_cpu_inc(tag->counters->calls);
> > > +}
> > > +
> > > +#else
> > > +
> > > +#define DEFINE_ALLOC_TAG(_alloc_tag, _old)
> > > +static inline void alloc_tag_sub(union codetag_ref *ref, size_t bytes) {}
> > > +static inline void alloc_tag_sub_noalloc(union codetag_ref *ref, size_t bytes) {}
> > > +static inline void alloc_tag_add(union codetag_ref *ref, struct alloc_tag *tag,
> > > + size_t bytes) {}
> > > +
> > > +#endif
> > > +
> > > +#endif /* _LINUX_ALLOC_TAG_H */
> > > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > > index ffe8f618ab86..da68a10517c8 100644
> > > --- a/include/linux/sched.h
> > > +++ b/include/linux/sched.h
> > > @@ -770,6 +770,10 @@ struct task_struct {
> > > unsigned int flags;
> > > unsigned int ptrace;
> > >
> > > +#ifdef CONFIG_MEM_ALLOC_PROFILING
> > > + struct alloc_tag *alloc_tag;
> > > +#endif
> >
> > Normally scheduling is very sensitive to having anything early in
> > task_struct. I would suggest moving this the CONFIG_SCHED_CORE ifdef
> > area.
>
> Thanks for the warning! We will look into that.
>
> >
> > > +
> > > #ifdef CONFIG_SMP
> > > int on_cpu;
> > > struct __call_single_node wake_entry;
> > > @@ -810,6 +814,7 @@ struct task_struct {
> > > struct task_group *sched_task_group;
> > > #endif
> > >
> > > +
> > > #ifdef CONFIG_UCLAMP_TASK
> > > /*
> > > * Clamp values requested for a scheduling entity.
> > > @@ -2183,4 +2188,23 @@ static inline int sched_core_idle_cpu(int cpu) { return idle_cpu(cpu); }
> > >
> > > extern void sched_set_stop_task(int cpu, struct task_struct *stop);
> > >
> > > +#ifdef CONFIG_MEM_ALLOC_PROFILING
> > > +static inline struct alloc_tag *alloc_tag_save(struct alloc_tag *tag)
> > > +{
> > > + swap(current->alloc_tag, tag);
> > > + return tag;
> > > +}
> > > +
> > > +static inline void alloc_tag_restore(struct alloc_tag *tag, struct alloc_tag *old)
> > > +{
> > > +#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
> > > + WARN(current->alloc_tag != tag, "current->alloc_tag was changed:\n");
> > > +#endif
> > > + current->alloc_tag = old;
> > > +}
> > > +#else
> > > +static inline struct alloc_tag *alloc_tag_save(struct alloc_tag *tag) { return NULL; }
> > > +#define alloc_tag_restore(_tag, _old)
> > > +#endif
> > > +
> > > #endif
> > > diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> > > index 0be2d00c3696..78d258ca508f 100644
> > > --- a/lib/Kconfig.debug
> > > +++ b/lib/Kconfig.debug
> > > @@ -972,6 +972,31 @@ config CODE_TAGGING
> > > bool
> > > select KALLSYMS
> > >
> > > +config MEM_ALLOC_PROFILING
> > > + bool "Enable memory allocation profiling"
> > > + default n
> > > + depends on PROC_FS
> > > + depends on !DEBUG_FORCE_WEAK_PER_CPU
> > > + select CODE_TAGGING
> > > + help
> > > + Track allocation source code and record total allocation size
> > > + initiated at that code location. The mechanism can be used to track
> > > + memory leaks with a low performance and memory impact.
> > > +
> > > +config MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT
> > > + bool "Enable memory allocation profiling by default"
> > > + default y
> > > + depends on MEM_ALLOC_PROFILING
> > > +
> > > +config MEM_ALLOC_PROFILING_DEBUG
> > > + bool "Memory allocation profiler debugging"
> > > + default n
> > > + depends on MEM_ALLOC_PROFILING
> > > + select MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT
> > > + help
> > > + Adds warnings with helpful error messages for memory allocation
> > > + profiling.
> > > +
> > > source "lib/Kconfig.kasan"
> > > source "lib/Kconfig.kfence"
> > > source "lib/Kconfig.kmsan"
> > > diff --git a/lib/Makefile b/lib/Makefile
> > > index 6b48b22fdfac..859112f09bf5 100644
> > > --- a/lib/Makefile
> > > +++ b/lib/Makefile
> > > @@ -236,6 +236,8 @@ obj-$(CONFIG_OF_RECONFIG_NOTIFIER_ERROR_INJECT) += \
> > > obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
> > >
> > > obj-$(CONFIG_CODE_TAGGING) += codetag.o
> > > +obj-$(CONFIG_MEM_ALLOC_PROFILING) += alloc_tag.o
> > > +
> > > lib-$(CONFIG_GENERIC_BUG) += bug.o
> > >
> > > obj-$(CONFIG_HAVE_ARCH_TRACEHOOK) += syscall.o
> > > diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
> > > new file mode 100644
> > > index 000000000000..4fc031f9cefd
> > > --- /dev/null
> > > +++ b/lib/alloc_tag.c
> > > @@ -0,0 +1,158 @@
> > > +// SPDX-License-Identifier: GPL-2.0-only
> > > +#include <linux/alloc_tag.h>
> > > +#include <linux/fs.h>
> > > +#include <linux/gfp.h>
> > > +#include <linux/module.h>
> > > +#include <linux/proc_fs.h>
> > > +#include <linux/seq_buf.h>
> > > +#include <linux/seq_file.h>
> > > +
> > > +static struct codetag_type *alloc_tag_cttype;
> > > +
> > > +DEFINE_STATIC_KEY_MAYBE(CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT,
> > > + mem_alloc_profiling_key);
> > > +
> > > +static void *allocinfo_start(struct seq_file *m, loff_t *pos)
> > > +{
> > > + struct codetag_iterator *iter;
> > > + struct codetag *ct;
> > > + loff_t node = *pos;
> > > +
> > > + iter = kzalloc(sizeof(*iter), GFP_KERNEL);
> > > + m->private = iter;
> > > + if (!iter)
> > > + return NULL;
> > > +
> > > + codetag_lock_module_list(alloc_tag_cttype, true);
> > > + *iter = codetag_get_ct_iter(alloc_tag_cttype);
> > > + while ((ct = codetag_next_ct(iter)) != NULL && node)
> > > + node--;
> > > +
> > > + return ct ? iter : NULL;
> > > +}
> > > +
> > > +static void *allocinfo_next(struct seq_file *m, void *arg, loff_t *pos)
> > > +{
> > > + struct codetag_iterator *iter = (struct codetag_iterator *)arg;
> > > + struct codetag *ct = codetag_next_ct(iter);
> > > +
> > > + (*pos)++;
> > > + if (!ct)
> > > + return NULL;
> > > +
> > > + return iter;
> > > +}
> > > +
> > > +static void allocinfo_stop(struct seq_file *m, void *arg)
> > > +{
> > > + struct codetag_iterator *iter = (struct codetag_iterator *)m->private;
> > > +
> > > + if (iter) {
> > > + codetag_lock_module_list(alloc_tag_cttype, false);
> > > + kfree(iter);
> > > + }
> > > +}
> > > +
> > > +static void alloc_tag_to_text(struct seq_buf *out, struct codetag *ct)
> > > +{
> > > + struct alloc_tag *tag = ct_to_alloc_tag(ct);
> > > + struct alloc_tag_counters counter = alloc_tag_read(tag);
> > > + s64 bytes = counter.bytes;
> > > + char val[10], *p = val;
> > > +
> > > + if (bytes < 0) {
> > > + *p++ = '-';
> > > + bytes = -bytes;
> > > + }
> > > +
> > > + string_get_size(bytes, 1,
> > > + STRING_SIZE_BASE2|STRING_SIZE_NOSPACE,
> > > + p, val + ARRAY_SIZE(val) - p);
> > > +
> > > + seq_buf_printf(out, "%8s %8llu ", val, counter.calls);
> > > + codetag_to_text(out, ct);
> > > + seq_buf_putc(out, ' ');
> > > + seq_buf_putc(out, '\n');
> > > +}
> >
> > /me does happy seq_buf dance!
> >
> > > +
> > > +static int allocinfo_show(struct seq_file *m, void *arg)
> > > +{
> > > + struct codetag_iterator *iter = (struct codetag_iterator *)arg;
> > > + char *bufp;
> > > + size_t n = seq_get_buf(m, &bufp);
> > > + struct seq_buf buf;
> > > +
> > > + seq_buf_init(&buf, bufp, n);
> > > + alloc_tag_to_text(&buf, iter->ct);
> > > + seq_commit(m, seq_buf_used(&buf));
> > > + return 0;
> > > +}
> > > +
> > > +static const struct seq_operations allocinfo_seq_op = {
> > > + .start = allocinfo_start,
> > > + .next = allocinfo_next,
> > > + .stop = allocinfo_stop,
> > > + .show = allocinfo_show,
> > > +};
> > > +
> > > +static void __init procfs_init(void)
> > > +{
> > > + proc_create_seq("allocinfo", 0444, NULL, &allocinfo_seq_op);
> > > +}
> >
> > As mentioned, this really should be in /sys somewhere.
>
> Ack.
>
> >
> > > +
> > > +static bool alloc_tag_module_unload(struct codetag_type *cttype,
> > > + struct codetag_module *cmod)
> > > +{
> > > + struct codetag_iterator iter = codetag_get_ct_iter(cttype);
> > > + struct alloc_tag_counters counter;
> > > + bool module_unused = true;
> > > + struct alloc_tag *tag;
> > > + struct codetag *ct;
> > > +
> > > + for (ct = codetag_next_ct(&iter); ct; ct = codetag_next_ct(&iter)) {
> > > + if (iter.cmod != cmod)
> > > + continue;
> > > +
> > > + tag = ct_to_alloc_tag(ct);
> > > + counter = alloc_tag_read(tag);
> > > +
> > > + if (WARN(counter.bytes, "%s:%u module %s func:%s has %llu allocated at module unload",
> > > + ct->filename, ct->lineno, ct->modname, ct->function, counter.bytes))
> > > + module_unused = false;
> > > + }
> > > +
> > > + return module_unused;
> > > +}
> > > +
> > > +static struct ctl_table memory_allocation_profiling_sysctls[] = {
> > > + {
> > > + .procname = "mem_profiling",
> > > + .data = &mem_alloc_profiling_key,
> > > +#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
> > > + .mode = 0444,
> > > +#else
> > > + .mode = 0644,
> > > +#endif
> > > + .proc_handler = proc_do_static_key,
> > > + },
> > > + { }
> > > +};
> > > +
> > > +static int __init alloc_tag_init(void)
> > > +{
> > > + const struct codetag_type_desc desc = {
> > > + .section = "alloc_tags",
> > > + .tag_size = sizeof(struct alloc_tag),
> > > + .module_unload = alloc_tag_module_unload,
> > > + };
> > > +
> > > + alloc_tag_cttype = codetag_register_type(&desc);
> > > + if (IS_ERR_OR_NULL(alloc_tag_cttype))
> > > + return PTR_ERR(alloc_tag_cttype);
> > > +
> > > + register_sysctl_init("vm", memory_allocation_profiling_sysctls);
> > > + procfs_init();
> > > +
> > > + return 0;
> > > +}
> > > +module_init(alloc_tag_init);
> > > diff --git a/scripts/module.lds.S b/scripts/module.lds.S
> > > index bf5bcf2836d8..45c67a0994f3 100644
> > > --- a/scripts/module.lds.S
> > > +++ b/scripts/module.lds.S
> > > @@ -9,6 +9,8 @@
> > > #define DISCARD_EH_FRAME *(.eh_frame)
> > > #endif
> > >
> > > +#include <asm-generic/codetag.lds.h>
> > > +
> > > SECTIONS {
> > > /DISCARD/ : {
> > > *(.discard)
> > > @@ -47,12 +49,17 @@ SECTIONS {
> > > .data : {
> > > *(.data .data.[0-9a-zA-Z_]*)
> > > *(.data..L*)
> > > + CODETAG_SECTIONS()
> > > }
> > >
> > > .rodata : {
> > > *(.rodata .rodata.[0-9a-zA-Z_]*)
> > > *(.rodata..L*)
> > > }
> > > +#else
> > > + .data : {
> > > + CODETAG_SECTIONS()
> > > + }
> > > #endif
> > > }
> >
> > Otherwise, looks good.
>
> Thanks!
>
> >
> > --
> > Kees Cook
>
On Tue, Feb 13, 2024 at 11:04:58PM +0100, David Hildenbrand wrote:
> On 13.02.24 22:58, Suren Baghdasaryan wrote:
> > On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <[email protected]> wrote:
> > >
> > > On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote:
> > > [...]
> > > > We're aiming to get this in the next merge window, for 6.9. The feedback
> > > > we've gotten has been that even out of tree this patchset has already
> > > > been useful, and there's a significant amount of other work gated on the
> > > > code tagging functionality included in this patchset [2].
> > >
> > > I suspect it will not come as a surprise that I really dislike the
> > > implementation proposed here. I will not repeat my arguments, I have
> > > done so on several occasions already.
> > >
> > > Anyway, I didn't go as far as to nak it even though I _strongly_ believe
> > > this debugging feature will add a maintenance overhead for a very long
> > > time. I can live with all the downsides of the proposed implementation
> > > _as long as_ there is a wider agreement from the MM community as this is
> > > where the maintenance cost will be payed. So far I have not seen (m)any
> > > acks by MM developers so aiming into the next merge window is more than
> > > little rushed.
> >
> > We tried other previously proposed approaches and all have their
> > downsides without making maintenance much easier. Your position is
> > understandable and I think it's fair. Let's see if others see more
> > benefit than cost here.
>
> Would it make sense to discuss that at LSF/MM once again, especially
> covering why proposed alternatives did not work out? LSF/MM is not "too far"
> away (May).
>
> I recall that the last LSF/MM session on this topic was a bit unfortunate
> (IMHO not as productive as it could have been). Maybe we can finally reach a
> consensus on this.
I'd rather not delay for more bikeshedding. Before agreeing to LSF I'd
need to see a serious proposl - what we had at the last LSF was people
jumping in with half baked alternative proposals that very much hadn't
been thought through, and I see no need to repeat that.
Like I mentioned, there's other work gated on this patchset; if people
want to hold this up for more discussion they better be putting forth
something to discuss.
On Tue, Feb 13, 2024 at 2:17 PM David Hildenbrand <[email protected]> wrote:
>
> On 13.02.24 23:09, Kent Overstreet wrote:
> > On Tue, Feb 13, 2024 at 11:04:58PM +0100, David Hildenbrand wrote:
> >> On 13.02.24 22:58, Suren Baghdasaryan wrote:
> >>> On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <[email protected]> wrote:
> >>>>
> >>>> On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote:
> >>>> [...]
> >>>>> We're aiming to get this in the next merge window, for 6.9. The feedback
> >>>>> we've gotten has been that even out of tree this patchset has already
> >>>>> been useful, and there's a significant amount of other work gated on the
> >>>>> code tagging functionality included in this patchset [2].
> >>>>
> >>>> I suspect it will not come as a surprise that I really dislike the
> >>>> implementation proposed here. I will not repeat my arguments, I have
> >>>> done so on several occasions already.
> >>>>
> >>>> Anyway, I didn't go as far as to nak it even though I _strongly_ believe
> >>>> this debugging feature will add a maintenance overhead for a very long
> >>>> time. I can live with all the downsides of the proposed implementation
> >>>> _as long as_ there is a wider agreement from the MM community as this is
> >>>> where the maintenance cost will be payed. So far I have not seen (m)any
> >>>> acks by MM developers so aiming into the next merge window is more than
> >>>> little rushed.
> >>>
> >>> We tried other previously proposed approaches and all have their
> >>> downsides without making maintenance much easier. Your position is
> >>> understandable and I think it's fair. Let's see if others see more
> >>> benefit than cost here.
> >>
> >> Would it make sense to discuss that at LSF/MM once again, especially
> >> covering why proposed alternatives did not work out? LSF/MM is not "too far"
> >> away (May).
> >>
> >> I recall that the last LSF/MM session on this topic was a bit unfortunate
> >> (IMHO not as productive as it could have been). Maybe we can finally reach a
> >> consensus on this.
> >
> > I'd rather not delay for more bikeshedding. Before agreeing to LSF I'd
> > need to see a serious proposl - what we had at the last LSF was people
> > jumping in with half baked alternative proposals that very much hadn't
> > been thought through, and I see no need to repeat that.
> >
> > Like I mentioned, there's other work gated on this patchset; if people
> > want to hold this up for more discussion they better be putting forth
> > something to discuss.
>
> I'm thinking of ways on how to achieve Michal's request: "as long as
> there is a wider agreement from the MM community". If we can achieve
> that without LSF, great! (a bi-weekly MM meeting might also be an option)
There will be a maintenance burden even with the cleanest proposed
approach. We worked hard to make the patchset as clean as possible and
if benefits still don't outweigh the maintenance cost then we should
probably stop trying. At LSF/MM I would rather discuss functonal
issues/requirements/improvements than alternative approaches to
instrument allocators.
I'm happy to arrange a separate meeting with MM folks if that would
help to progress on the cost/benefit decision.
>
> --
> Cheers,
>
> David / dhildenb
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>
On Tue, Feb 13, 2024 at 11:17:32PM +0100, David Hildenbrand wrote:
> On 13.02.24 23:09, Kent Overstreet wrote:
> > On Tue, Feb 13, 2024 at 11:04:58PM +0100, David Hildenbrand wrote:
> > > On 13.02.24 22:58, Suren Baghdasaryan wrote:
> > > > On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <[email protected]> wrote:
> > > > >
> > > > > On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote:
> > > > > [...]
> > > > > > We're aiming to get this in the next merge window, for 6.9. The feedback
> > > > > > we've gotten has been that even out of tree this patchset has already
> > > > > > been useful, and there's a significant amount of other work gated on the
> > > > > > code tagging functionality included in this patchset [2].
> > > > >
> > > > > I suspect it will not come as a surprise that I really dislike the
> > > > > implementation proposed here. I will not repeat my arguments, I have
> > > > > done so on several occasions already.
> > > > >
> > > > > Anyway, I didn't go as far as to nak it even though I _strongly_ believe
> > > > > this debugging feature will add a maintenance overhead for a very long
> > > > > time. I can live with all the downsides of the proposed implementation
> > > > > _as long as_ there is a wider agreement from the MM community as this is
> > > > > where the maintenance cost will be payed. So far I have not seen (m)any
> > > > > acks by MM developers so aiming into the next merge window is more than
> > > > > little rushed.
> > > >
> > > > We tried other previously proposed approaches and all have their
> > > > downsides without making maintenance much easier. Your position is
> > > > understandable and I think it's fair. Let's see if others see more
> > > > benefit than cost here.
> > >
> > > Would it make sense to discuss that at LSF/MM once again, especially
> > > covering why proposed alternatives did not work out? LSF/MM is not "too far"
> > > away (May).
> > >
> > > I recall that the last LSF/MM session on this topic was a bit unfortunate
> > > (IMHO not as productive as it could have been). Maybe we can finally reach a
> > > consensus on this.
> >
> > I'd rather not delay for more bikeshedding. Before agreeing to LSF I'd
> > need to see a serious proposl - what we had at the last LSF was people
> > jumping in with half baked alternative proposals that very much hadn't
> > been thought through, and I see no need to repeat that.
> >
> > Like I mentioned, there's other work gated on this patchset; if people
> > want to hold this up for more discussion they better be putting forth
> > something to discuss.
>
> I'm thinking of ways on how to achieve Michal's request: "as long as there
> is a wider agreement from the MM community". If we can achieve that without
> LSF, great! (a bi-weekly MM meeting might also be an option)
A meeting wouldn't be out of the question, _if_ there is an agenda, but:
What's that coffeee mug say? I just survived another meeting that
could've been an email? What exactly is the outcome we're looking for?
Is there info that people are looking for? I think we summed things up
pretty well in the cover letter; if there are specifics that people
want to discuss, that's why we emailed the series out.
There's people in this thread who've used this patchset in production
and diagnosed real issues (gigabytes of memory gone missing, I heard the
other day); I'm personally looking for them to chime in on this thread
(Johannes, Pasha).
If it's just grumbling about "maintenance overhead" we need to get past
- well, people are going to have to accept that we can't deliver
features without writing code, and I'm confident that the hooking in
particular is about as clean as it's going to get, _regardless_ of
toolchain support; and moreover it addresses what's been historically a
pretty gaping hole in our ability to profile and understand the code we
write.
On Tue, Feb 13, 2024 at 2:29 PM Darrick J. Wong <[email protected]> wrote:
>
> On Mon, Feb 12, 2024 at 05:01:19PM -0800, Suren Baghdasaryan wrote:
> > On Mon, Feb 12, 2024 at 2:40 PM Kees Cook <[email protected]> wrote:
> > >
> > > On Mon, Feb 12, 2024 at 01:38:59PM -0800, Suren Baghdasaryan wrote:
> > > > Introduce CONFIG_MEM_ALLOC_PROFILING which provides definitions to easily
> > > > instrument memory allocators. It registers an "alloc_tags" codetag type
> > > > with /proc/allocinfo interface to output allocation tag information when
> > >
> > > Please don't add anything new to the top-level /proc directory. This
> > > should likely live in /sys.
> >
> > Ack. I'll find a more appropriate place for it then.
> > It just seemed like such generic information which would belong next
> > to meminfo/zoneinfo and such...
>
> Save yourself a cycle of "rework the whole fs interface only to have
> someone else tell you no" and put it in debugfs, not sysfs. Wrangling
> with debugfs is easier than all the macro-happy sysfs stuff; you don't
> have to integrate with the "device" model; and there is no 'one value
> per file' rule.
Thanks for the input. This file used to be in debugfs but reviewers
felt it belonged in /proc if it's to be used in production
environments. Some distros (like Android) disable debugfs in
production.
>
> --D
>
> > >
> > > > the feature is enabled.
> > > > CONFIG_MEM_ALLOC_PROFILING_DEBUG is provided for debugging the memory
> > > > allocation profiling instrumentation.
> > > > Memory allocation profiling can be enabled or disabled at runtime using
> > > > /proc/sys/vm/mem_profiling sysctl when CONFIG_MEM_ALLOC_PROFILING_DEBUG=n.
> > > > CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT enables memory allocation
> > > > profiling by default.
> > > >
> > > > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > > > Co-developed-by: Kent Overstreet <[email protected]>
> > > > Signed-off-by: Kent Overstreet <[email protected]>
> > > > ---
> > > > Documentation/admin-guide/sysctl/vm.rst | 16 +++
> > > > Documentation/filesystems/proc.rst | 28 +++++
> > > > include/asm-generic/codetag.lds.h | 14 +++
> > > > include/asm-generic/vmlinux.lds.h | 3 +
> > > > include/linux/alloc_tag.h | 133 ++++++++++++++++++++
> > > > include/linux/sched.h | 24 ++++
> > > > lib/Kconfig.debug | 25 ++++
> > > > lib/Makefile | 2 +
> > > > lib/alloc_tag.c | 158 ++++++++++++++++++++++++
> > > > scripts/module.lds.S | 7 ++
> > > > 10 files changed, 410 insertions(+)
> > > > create mode 100644 include/asm-generic/codetag.lds.h
> > > > create mode 100644 include/linux/alloc_tag.h
> > > > create mode 100644 lib/alloc_tag.c
> > > >
> > > > diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
> > > > index c59889de122b..a214719492ea 100644
> > > > --- a/Documentation/admin-guide/sysctl/vm.rst
> > > > +++ b/Documentation/admin-guide/sysctl/vm.rst
> > > > @@ -43,6 +43,7 @@ Currently, these files are in /proc/sys/vm:
> > > > - legacy_va_layout
> > > > - lowmem_reserve_ratio
> > > > - max_map_count
> > > > +- mem_profiling (only if CONFIG_MEM_ALLOC_PROFILING=y)
> > > > - memory_failure_early_kill
> > > > - memory_failure_recovery
> > > > - min_free_kbytes
> > > > @@ -425,6 +426,21 @@ e.g., up to one or two maps per allocation.
> > > > The default value is 65530.
> > > >
> > > >
> > > > +mem_profiling
> > > > +==============
> > > > +
> > > > +Enable memory profiling (when CONFIG_MEM_ALLOC_PROFILING=y)
> > > > +
> > > > +1: Enable memory profiling.
> > > > +
> > > > +0: Disabld memory profiling.
> > > > +
> > > > +Enabling memory profiling introduces a small performance overhead for all
> > > > +memory allocations.
> > > > +
> > > > +The default value depends on CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT.
> > > > +
> > > > +
> > > > memory_failure_early_kill:
> > > > ==========================
> > > >
> > > > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> > > > index 104c6d047d9b..40d6d18308e4 100644
> > > > --- a/Documentation/filesystems/proc.rst
> > > > +++ b/Documentation/filesystems/proc.rst
> > > > @@ -688,6 +688,7 @@ files are there, and which are missing.
> > > > ============ ===============================================================
> > > > File Content
> > > > ============ ===============================================================
> > > > + allocinfo Memory allocations profiling information
> > > > apm Advanced power management info
> > > > bootconfig Kernel command line obtained from boot config,
> > > > and, if there were kernel parameters from the
> > > > @@ -953,6 +954,33 @@ also be allocatable although a lot of filesystem metadata may have to be
> > > > reclaimed to achieve this.
> > > >
> > > >
> > > > +allocinfo
> > > > +~~~~~~~
> > > > +
> > > > +Provides information about memory allocations at all locations in the code
> > > > +base. Each allocation in the code is identified by its source file, line
> > > > +number, module and the function calling the allocation. The number of bytes
> > > > +allocated at each location is reported.
> > > > +
> > > > +Example output.
> > > > +
> > > > +::
> > > > +
> > > > + > cat /proc/allocinfo
> > > > +
> > > > + 153MiB mm/slub.c:1826 module:slub func:alloc_slab_page
> > > > + 6.08MiB mm/slab_common.c:950 module:slab_common func:_kmalloc_order
> > > > + 5.09MiB mm/memcontrol.c:2814 module:memcontrol func:alloc_slab_obj_exts
> > > > + 4.54MiB mm/page_alloc.c:5777 module:page_alloc func:alloc_pages_exact
> > > > + 1.32MiB include/asm-generic/pgalloc.h:63 module:pgtable func:__pte_alloc_one
> > > > + 1.16MiB fs/xfs/xfs_log_priv.h:700 module:xfs func:xlog_kvmalloc
> > > > + 1.00MiB mm/swap_cgroup.c:48 module:swap_cgroup func:swap_cgroup_prepare
> > > > + 734KiB fs/xfs/kmem.c:20 module:xfs func:kmem_alloc
> > > > + 640KiB kernel/rcu/tree.c:3184 module:tree func:fill_page_cache_func
> > > > + 640KiB drivers/char/virtio_console.c:452 module:virtio_console func:alloc_buf
> > > > + ...
> > > > +
> > > > +
> > > > meminfo
> > > > ~~~~~~~
> > > >
> > > > diff --git a/include/asm-generic/codetag.lds.h b/include/asm-generic/codetag.lds.h
> > > > new file mode 100644
> > > > index 000000000000..64f536b80380
> > > > --- /dev/null
> > > > +++ b/include/asm-generic/codetag.lds.h
> > > > @@ -0,0 +1,14 @@
> > > > +/* SPDX-License-Identifier: GPL-2.0-only */
> > > > +#ifndef __ASM_GENERIC_CODETAG_LDS_H
> > > > +#define __ASM_GENERIC_CODETAG_LDS_H
> > > > +
> > > > +#define SECTION_WITH_BOUNDARIES(_name) \
> > > > + . = ALIGN(8); \
> > > > + __start_##_name = .; \
> > > > + KEEP(*(_name)) \
> > > > + __stop_##_name = .;
> > > > +
> > > > +#define CODETAG_SECTIONS() \
> > > > + SECTION_WITH_BOUNDARIES(alloc_tags)
> > > > +
> > > > +#endif /* __ASM_GENERIC_CODETAG_LDS_H */
> > > > diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
> > > > index 5dd3a61d673d..c9997dc50c50 100644
> > > > --- a/include/asm-generic/vmlinux.lds.h
> > > > +++ b/include/asm-generic/vmlinux.lds.h
> > > > @@ -50,6 +50,8 @@
> > > > * [__nosave_begin, __nosave_end] for the nosave data
> > > > */
> > > >
> > > > +#include <asm-generic/codetag.lds.h>
> > > > +
> > > > #ifndef LOAD_OFFSET
> > > > #define LOAD_OFFSET 0
> > > > #endif
> > > > @@ -366,6 +368,7 @@
> > > > . = ALIGN(8); \
> > > > BOUNDED_SECTION_BY(__dyndbg_classes, ___dyndbg_classes) \
> > > > BOUNDED_SECTION_BY(__dyndbg, ___dyndbg) \
> > > > + CODETAG_SECTIONS() \
> > > > LIKELY_PROFILE() \
> > > > BRANCH_PROFILE() \
> > > > TRACE_PRINTKS() \
> > > > diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
> > > > new file mode 100644
> > > > index 000000000000..cf55a149fa84
> > > > --- /dev/null
> > > > +++ b/include/linux/alloc_tag.h
> > > > @@ -0,0 +1,133 @@
> > > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > > +/*
> > > > + * allocation tagging
> > > > + */
> > > > +#ifndef _LINUX_ALLOC_TAG_H
> > > > +#define _LINUX_ALLOC_TAG_H
> > > > +
> > > > +#include <linux/bug.h>
> > > > +#include <linux/codetag.h>
> > > > +#include <linux/container_of.h>
> > > > +#include <linux/preempt.h>
> > > > +#include <asm/percpu.h>
> > > > +#include <linux/cpumask.h>
> > > > +#include <linux/static_key.h>
> > > > +
> > > > +struct alloc_tag_counters {
> > > > + u64 bytes;
> > > > + u64 calls;
> > > > +};
> > > > +
> > > > +/*
> > > > + * An instance of this structure is created in a special ELF section at every
> > > > + * allocation callsite. At runtime, the special section is treated as
> > > > + * an array of these. Embedded codetag utilizes codetag framework.
> > > > + */
> > > > +struct alloc_tag {
> > > > + struct codetag ct;
> > > > + struct alloc_tag_counters __percpu *counters;
> > > > +} __aligned(8);
> > > > +
> > > > +#ifdef CONFIG_MEM_ALLOC_PROFILING
> > > > +
> > > > +static inline struct alloc_tag *ct_to_alloc_tag(struct codetag *ct)
> > > > +{
> > > > + return container_of(ct, struct alloc_tag, ct);
> > > > +}
> > > > +
> > > > +#ifdef ARCH_NEEDS_WEAK_PER_CPU
> > > > +/*
> > > > + * When percpu variables are required to be defined as weak, static percpu
> > > > + * variables can't be used inside a function (see comments for DECLARE_PER_CPU_SECTION).
> > > > + */
> > > > +#error "Memory allocation profiling is incompatible with ARCH_NEEDS_WEAK_PER_CPU"
> > >
> > > Is this enforced via Kconfig as well? (Looks like only alpha and s390?)
> >
> > Unfortunately ARCH_NEEDS_WEAK_PER_CPU is not a Kconfig option but
> > CONFIG_DEBUG_FORCE_WEAK_PER_CPU is, so that one is handled via Kconfig
> > (see "depends on !DEBUG_FORCE_WEAK_PER_CPU" in this patch). We have to
> > avoid both cases because of this:
> > https://elixir.bootlin.com/linux/latest/source/include/linux/percpu-defs.h#L75,
> > so I'm trying to provide an informative error here.
> >
> > >
> > > > +#endif
> > > > +
> > > > +#define DEFINE_ALLOC_TAG(_alloc_tag, _old) \
> > > > + static DEFINE_PER_CPU(struct alloc_tag_counters, _alloc_tag_cntr); \
> > > > + static struct alloc_tag _alloc_tag __used __aligned(8) \
> > > > + __section("alloc_tags") = { \
> > > > + .ct = CODE_TAG_INIT, \
> > > > + .counters = &_alloc_tag_cntr }; \
> > > > + struct alloc_tag * __maybe_unused _old = alloc_tag_save(&_alloc_tag)
> > > > +
> > > > +DECLARE_STATIC_KEY_MAYBE(CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT,
> > > > + mem_alloc_profiling_key);
> > > > +
> > > > +static inline bool mem_alloc_profiling_enabled(void)
> > > > +{
> > > > + return static_branch_maybe(CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT,
> > > > + &mem_alloc_profiling_key);
> > > > +}
> > > > +
> > > > +static inline struct alloc_tag_counters alloc_tag_read(struct alloc_tag *tag)
> > > > +{
> > > > + struct alloc_tag_counters v = { 0, 0 };
> > > > + struct alloc_tag_counters *counter;
> > > > + int cpu;
> > > > +
> > > > + for_each_possible_cpu(cpu) {
> > > > + counter = per_cpu_ptr(tag->counters, cpu);
> > > > + v.bytes += counter->bytes;
> > > > + v.calls += counter->calls;
> > > > + }
> > > > +
> > > > + return v;
> > > > +}
> > > > +
> > > > +static inline void __alloc_tag_sub(union codetag_ref *ref, size_t bytes)
> > > > +{
> > > > + struct alloc_tag *tag;
> > > > +
> > > > +#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
> > > > + WARN_ONCE(ref && !ref->ct, "alloc_tag was not set\n");
> > > > +#endif
> > > > + if (!ref || !ref->ct)
> > > > + return;
> > > > +
> > > > + tag = ct_to_alloc_tag(ref->ct);
> > > > +
> > > > + this_cpu_sub(tag->counters->bytes, bytes);
> > > > + this_cpu_dec(tag->counters->calls);
> > > > +
> > > > + ref->ct = NULL;
> > > > +}
> > > > +
> > > > +static inline void alloc_tag_sub(union codetag_ref *ref, size_t bytes)
> > > > +{
> > > > + __alloc_tag_sub(ref, bytes);
> > > > +}
> > > > +
> > > > +static inline void alloc_tag_sub_noalloc(union codetag_ref *ref, size_t bytes)
> > > > +{
> > > > + __alloc_tag_sub(ref, bytes);
> > > > +}
> > > > +
> > > > +static inline void alloc_tag_add(union codetag_ref *ref, struct alloc_tag *tag, size_t bytes)
> > > > +{
> > > > +#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
> > > > + WARN_ONCE(ref && ref->ct,
> > > > + "alloc_tag was not cleared (got tag for %s:%u)\n",\
> > > > + ref->ct->filename, ref->ct->lineno);
> > > > +
> > > > + WARN_ONCE(!tag, "current->alloc_tag not set");
> > > > +#endif
> > > > + if (!ref || !tag)
> > > > + return;
> > > > +
> > > > + ref->ct = &tag->ct;
> > > > + this_cpu_add(tag->counters->bytes, bytes);
> > > > + this_cpu_inc(tag->counters->calls);
> > > > +}
> > > > +
> > > > +#else
> > > > +
> > > > +#define DEFINE_ALLOC_TAG(_alloc_tag, _old)
> > > > +static inline void alloc_tag_sub(union codetag_ref *ref, size_t bytes) {}
> > > > +static inline void alloc_tag_sub_noalloc(union codetag_ref *ref, size_t bytes) {}
> > > > +static inline void alloc_tag_add(union codetag_ref *ref, struct alloc_tag *tag,
> > > > + size_t bytes) {}
> > > > +
> > > > +#endif
> > > > +
> > > > +#endif /* _LINUX_ALLOC_TAG_H */
> > > > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > > > index ffe8f618ab86..da68a10517c8 100644
> > > > --- a/include/linux/sched.h
> > > > +++ b/include/linux/sched.h
> > > > @@ -770,6 +770,10 @@ struct task_struct {
> > > > unsigned int flags;
> > > > unsigned int ptrace;
> > > >
> > > > +#ifdef CONFIG_MEM_ALLOC_PROFILING
> > > > + struct alloc_tag *alloc_tag;
> > > > +#endif
> > >
> > > Normally scheduling is very sensitive to having anything early in
> > > task_struct. I would suggest moving this the CONFIG_SCHED_CORE ifdef
> > > area.
> >
> > Thanks for the warning! We will look into that.
> >
> > >
> > > > +
> > > > #ifdef CONFIG_SMP
> > > > int on_cpu;
> > > > struct __call_single_node wake_entry;
> > > > @@ -810,6 +814,7 @@ struct task_struct {
> > > > struct task_group *sched_task_group;
> > > > #endif
> > > >
> > > > +
> > > > #ifdef CONFIG_UCLAMP_TASK
> > > > /*
> > > > * Clamp values requested for a scheduling entity.
> > > > @@ -2183,4 +2188,23 @@ static inline int sched_core_idle_cpu(int cpu) { return idle_cpu(cpu); }
> > > >
> > > > extern void sched_set_stop_task(int cpu, struct task_struct *stop);
> > > >
> > > > +#ifdef CONFIG_MEM_ALLOC_PROFILING
> > > > +static inline struct alloc_tag *alloc_tag_save(struct alloc_tag *tag)
> > > > +{
> > > > + swap(current->alloc_tag, tag);
> > > > + return tag;
> > > > +}
> > > > +
> > > > +static inline void alloc_tag_restore(struct alloc_tag *tag, struct alloc_tag *old)
> > > > +{
> > > > +#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
> > > > + WARN(current->alloc_tag != tag, "current->alloc_tag was changed:\n");
> > > > +#endif
> > > > + current->alloc_tag = old;
> > > > +}
> > > > +#else
> > > > +static inline struct alloc_tag *alloc_tag_save(struct alloc_tag *tag) { return NULL; }
> > > > +#define alloc_tag_restore(_tag, _old)
> > > > +#endif
> > > > +
> > > > #endif
> > > > diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> > > > index 0be2d00c3696..78d258ca508f 100644
> > > > --- a/lib/Kconfig.debug
> > > > +++ b/lib/Kconfig.debug
> > > > @@ -972,6 +972,31 @@ config CODE_TAGGING
> > > > bool
> > > > select KALLSYMS
> > > >
> > > > +config MEM_ALLOC_PROFILING
> > > > + bool "Enable memory allocation profiling"
> > > > + default n
> > > > + depends on PROC_FS
> > > > + depends on !DEBUG_FORCE_WEAK_PER_CPU
> > > > + select CODE_TAGGING
> > > > + help
> > > > + Track allocation source code and record total allocation size
> > > > + initiated at that code location. The mechanism can be used to track
> > > > + memory leaks with a low performance and memory impact.
> > > > +
> > > > +config MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT
> > > > + bool "Enable memory allocation profiling by default"
> > > > + default y
> > > > + depends on MEM_ALLOC_PROFILING
> > > > +
> > > > +config MEM_ALLOC_PROFILING_DEBUG
> > > > + bool "Memory allocation profiler debugging"
> > > > + default n
> > > > + depends on MEM_ALLOC_PROFILING
> > > > + select MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT
> > > > + help
> > > > + Adds warnings with helpful error messages for memory allocation
> > > > + profiling.
> > > > +
> > > > source "lib/Kconfig.kasan"
> > > > source "lib/Kconfig.kfence"
> > > > source "lib/Kconfig.kmsan"
> > > > diff --git a/lib/Makefile b/lib/Makefile
> > > > index 6b48b22fdfac..859112f09bf5 100644
> > > > --- a/lib/Makefile
> > > > +++ b/lib/Makefile
> > > > @@ -236,6 +236,8 @@ obj-$(CONFIG_OF_RECONFIG_NOTIFIER_ERROR_INJECT) += \
> > > > obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
> > > >
> > > > obj-$(CONFIG_CODE_TAGGING) += codetag.o
> > > > +obj-$(CONFIG_MEM_ALLOC_PROFILING) += alloc_tag.o
> > > > +
> > > > lib-$(CONFIG_GENERIC_BUG) += bug.o
> > > >
> > > > obj-$(CONFIG_HAVE_ARCH_TRACEHOOK) += syscall.o
> > > > diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
> > > > new file mode 100644
> > > > index 000000000000..4fc031f9cefd
> > > > --- /dev/null
> > > > +++ b/lib/alloc_tag.c
> > > > @@ -0,0 +1,158 @@
> > > > +// SPDX-License-Identifier: GPL-2.0-only
> > > > +#include <linux/alloc_tag.h>
> > > > +#include <linux/fs.h>
> > > > +#include <linux/gfp.h>
> > > > +#include <linux/module.h>
> > > > +#include <linux/proc_fs.h>
> > > > +#include <linux/seq_buf.h>
> > > > +#include <linux/seq_file.h>
> > > > +
> > > > +static struct codetag_type *alloc_tag_cttype;
> > > > +
> > > > +DEFINE_STATIC_KEY_MAYBE(CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT,
> > > > + mem_alloc_profiling_key);
> > > > +
> > > > +static void *allocinfo_start(struct seq_file *m, loff_t *pos)
> > > > +{
> > > > + struct codetag_iterator *iter;
> > > > + struct codetag *ct;
> > > > + loff_t node = *pos;
> > > > +
> > > > + iter = kzalloc(sizeof(*iter), GFP_KERNEL);
> > > > + m->private = iter;
> > > > + if (!iter)
> > > > + return NULL;
> > > > +
> > > > + codetag_lock_module_list(alloc_tag_cttype, true);
> > > > + *iter = codetag_get_ct_iter(alloc_tag_cttype);
> > > > + while ((ct = codetag_next_ct(iter)) != NULL && node)
> > > > + node--;
> > > > +
> > > > + return ct ? iter : NULL;
> > > > +}
> > > > +
> > > > +static void *allocinfo_next(struct seq_file *m, void *arg, loff_t *pos)
> > > > +{
> > > > + struct codetag_iterator *iter = (struct codetag_iterator *)arg;
> > > > + struct codetag *ct = codetag_next_ct(iter);
> > > > +
> > > > + (*pos)++;
> > > > + if (!ct)
> > > > + return NULL;
> > > > +
> > > > + return iter;
> > > > +}
> > > > +
> > > > +static void allocinfo_stop(struct seq_file *m, void *arg)
> > > > +{
> > > > + struct codetag_iterator *iter = (struct codetag_iterator *)m->private;
> > > > +
> > > > + if (iter) {
> > > > + codetag_lock_module_list(alloc_tag_cttype, false);
> > > > + kfree(iter);
> > > > + }
> > > > +}
> > > > +
> > > > +static void alloc_tag_to_text(struct seq_buf *out, struct codetag *ct)
> > > > +{
> > > > + struct alloc_tag *tag = ct_to_alloc_tag(ct);
> > > > + struct alloc_tag_counters counter = alloc_tag_read(tag);
> > > > + s64 bytes = counter.bytes;
> > > > + char val[10], *p = val;
> > > > +
> > > > + if (bytes < 0) {
> > > > + *p++ = '-';
> > > > + bytes = -bytes;
> > > > + }
> > > > +
> > > > + string_get_size(bytes, 1,
> > > > + STRING_SIZE_BASE2|STRING_SIZE_NOSPACE,
> > > > + p, val + ARRAY_SIZE(val) - p);
> > > > +
> > > > + seq_buf_printf(out, "%8s %8llu ", val, counter.calls);
> > > > + codetag_to_text(out, ct);
> > > > + seq_buf_putc(out, ' ');
> > > > + seq_buf_putc(out, '\n');
> > > > +}
> > >
> > > /me does happy seq_buf dance!
> > >
> > > > +
> > > > +static int allocinfo_show(struct seq_file *m, void *arg)
> > > > +{
> > > > + struct codetag_iterator *iter = (struct codetag_iterator *)arg;
> > > > + char *bufp;
> > > > + size_t n = seq_get_buf(m, &bufp);
> > > > + struct seq_buf buf;
> > > > +
> > > > + seq_buf_init(&buf, bufp, n);
> > > > + alloc_tag_to_text(&buf, iter->ct);
> > > > + seq_commit(m, seq_buf_used(&buf));
> > > > + return 0;
> > > > +}
> > > > +
> > > > +static const struct seq_operations allocinfo_seq_op = {
> > > > + .start = allocinfo_start,
> > > > + .next = allocinfo_next,
> > > > + .stop = allocinfo_stop,
> > > > + .show = allocinfo_show,
> > > > +};
> > > > +
> > > > +static void __init procfs_init(void)
> > > > +{
> > > > + proc_create_seq("allocinfo", 0444, NULL, &allocinfo_seq_op);
> > > > +}
> > >
> > > As mentioned, this really should be in /sys somewhere.
> >
> > Ack.
> >
> > >
> > > > +
> > > > +static bool alloc_tag_module_unload(struct codetag_type *cttype,
> > > > + struct codetag_module *cmod)
> > > > +{
> > > > + struct codetag_iterator iter = codetag_get_ct_iter(cttype);
> > > > + struct alloc_tag_counters counter;
> > > > + bool module_unused = true;
> > > > + struct alloc_tag *tag;
> > > > + struct codetag *ct;
> > > > +
> > > > + for (ct = codetag_next_ct(&iter); ct; ct = codetag_next_ct(&iter)) {
> > > > + if (iter.cmod != cmod)
> > > > + continue;
> > > > +
> > > > + tag = ct_to_alloc_tag(ct);
> > > > + counter = alloc_tag_read(tag);
> > > > +
> > > > + if (WARN(counter.bytes, "%s:%u module %s func:%s has %llu allocated at module unload",
> > > > + ct->filename, ct->lineno, ct->modname, ct->function, counter.bytes))
> > > > + module_unused = false;
> > > > + }
> > > > +
> > > > + return module_unused;
> > > > +}
> > > > +
> > > > +static struct ctl_table memory_allocation_profiling_sysctls[] = {
> > > > + {
> > > > + .procname = "mem_profiling",
> > > > + .data = &mem_alloc_profiling_key,
> > > > +#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
> > > > + .mode = 0444,
> > > > +#else
> > > > + .mode = 0644,
> > > > +#endif
> > > > + .proc_handler = proc_do_static_key,
> > > > + },
> > > > + { }
> > > > +};
> > > > +
> > > > +static int __init alloc_tag_init(void)
> > > > +{
> > > > + const struct codetag_type_desc desc = {
> > > > + .section = "alloc_tags",
> > > > + .tag_size = sizeof(struct alloc_tag),
> > > > + .module_unload = alloc_tag_module_unload,
> > > > + };
> > > > +
> > > > + alloc_tag_cttype = codetag_register_type(&desc);
> > > > + if (IS_ERR_OR_NULL(alloc_tag_cttype))
> > > > + return PTR_ERR(alloc_tag_cttype);
> > > > +
> > > > + register_sysctl_init("vm", memory_allocation_profiling_sysctls);
> > > > + procfs_init();
> > > > +
> > > > + return 0;
> > > > +}
> > > > +module_init(alloc_tag_init);
> > > > diff --git a/scripts/module.lds.S b/scripts/module.lds.S
> > > > index bf5bcf2836d8..45c67a0994f3 100644
> > > > --- a/scripts/module.lds.S
> > > > +++ b/scripts/module.lds.S
> > > > @@ -9,6 +9,8 @@
> > > > #define DISCARD_EH_FRAME *(.eh_frame)
> > > > #endif
> > > >
> > > > +#include <asm-generic/codetag.lds.h>
> > > > +
> > > > SECTIONS {
> > > > /DISCARD/ : {
> > > > *(.discard)
> > > > @@ -47,12 +49,17 @@ SECTIONS {
> > > > .data : {
> > > > *(.data .data.[0-9a-zA-Z_]*)
> > > > *(.data..L*)
> > > > + CODETAG_SECTIONS()
> > > > }
> > > >
> > > > .rodata : {
> > > > *(.rodata .rodata.[0-9a-zA-Z_]*)
> > > > *(.rodata..L*)
> > > > }
> > > > +#else
> > > > + .data : {
> > > > + CODETAG_SECTIONS()
> > > > + }
> > > > #endif
> > > > }
> > >
> > > Otherwise, looks good.
> >
> > Thanks!
> >
> > >
> > > --
> > > Kees Cook
> >
On Tue, Feb 13, 2024 at 02:35:29PM -0800, Suren Baghdasaryan wrote:
> On Tue, Feb 13, 2024 at 2:29 PM Darrick J. Wong <[email protected]> wrote:
> >
> > On Mon, Feb 12, 2024 at 05:01:19PM -0800, Suren Baghdasaryan wrote:
> > > On Mon, Feb 12, 2024 at 2:40 PM Kees Cook <[email protected]> wrote:
> > > >
> > > > On Mon, Feb 12, 2024 at 01:38:59PM -0800, Suren Baghdasaryan wrote:
> > > > > Introduce CONFIG_MEM_ALLOC_PROFILING which provides definitions to easily
> > > > > instrument memory allocators. It registers an "alloc_tags" codetag type
> > > > > with /proc/allocinfo interface to output allocation tag information when
> > > >
> > > > Please don't add anything new to the top-level /proc directory. This
> > > > should likely live in /sys.
> > >
> > > Ack. I'll find a more appropriate place for it then.
> > > It just seemed like such generic information which would belong next
> > > to meminfo/zoneinfo and such...
> >
> > Save yourself a cycle of "rework the whole fs interface only to have
> > someone else tell you no" and put it in debugfs, not sysfs. Wrangling
> > with debugfs is easier than all the macro-happy sysfs stuff; you don't
> > have to integrate with the "device" model; and there is no 'one value
> > per file' rule.
>
> Thanks for the input. This file used to be in debugfs but reviewers
> felt it belonged in /proc if it's to be used in production
> environments. Some distros (like Android) disable debugfs in
> production.
FWIW, I agree debugfs is not right. If others feel it's right in /proc,
I certainly won't NAK -- it's just been that we've traditionally been
trying to avoid continuing to pollute the top-level /proc and instead
associate new things with something in /sys.
--
Kees Cook
On 13.02.24 23:30, Suren Baghdasaryan wrote:
> On Tue, Feb 13, 2024 at 2:17 PM David Hildenbrand <[email protected]> wrote:
>>
>> On 13.02.24 23:09, Kent Overstreet wrote:
>>> On Tue, Feb 13, 2024 at 11:04:58PM +0100, David Hildenbrand wrote:
>>>> On 13.02.24 22:58, Suren Baghdasaryan wrote:
>>>>> On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <[email protected]> wrote:
>>>>>>
>>>>>> On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote:
>>>>>> [...]
>>>>>>> We're aiming to get this in the next merge window, for 6.9. The feedback
>>>>>>> we've gotten has been that even out of tree this patchset has already
>>>>>>> been useful, and there's a significant amount of other work gated on the
>>>>>>> code tagging functionality included in this patchset [2].
>>>>>>
>>>>>> I suspect it will not come as a surprise that I really dislike the
>>>>>> implementation proposed here. I will not repeat my arguments, I have
>>>>>> done so on several occasions already.
>>>>>>
>>>>>> Anyway, I didn't go as far as to nak it even though I _strongly_ believe
>>>>>> this debugging feature will add a maintenance overhead for a very long
>>>>>> time. I can live with all the downsides of the proposed implementation
>>>>>> _as long as_ there is a wider agreement from the MM community as this is
>>>>>> where the maintenance cost will be payed. So far I have not seen (m)any
>>>>>> acks by MM developers so aiming into the next merge window is more than
>>>>>> little rushed.
>>>>>
>>>>> We tried other previously proposed approaches and all have their
>>>>> downsides without making maintenance much easier. Your position is
>>>>> understandable and I think it's fair. Let's see if others see more
>>>>> benefit than cost here.
>>>>
>>>> Would it make sense to discuss that at LSF/MM once again, especially
>>>> covering why proposed alternatives did not work out? LSF/MM is not "too far"
>>>> away (May).
>>>>
>>>> I recall that the last LSF/MM session on this topic was a bit unfortunate
>>>> (IMHO not as productive as it could have been). Maybe we can finally reach a
>>>> consensus on this.
>>>
>>> I'd rather not delay for more bikeshedding. Before agreeing to LSF I'd
>>> need to see a serious proposl - what we had at the last LSF was people
>>> jumping in with half baked alternative proposals that very much hadn't
>>> been thought through, and I see no need to repeat that.
>>>
>>> Like I mentioned, there's other work gated on this patchset; if people
>>> want to hold this up for more discussion they better be putting forth
>>> something to discuss.
>>
>> I'm thinking of ways on how to achieve Michal's request: "as long as
>> there is a wider agreement from the MM community". If we can achieve
>> that without LSF, great! (a bi-weekly MM meeting might also be an option)
>
> There will be a maintenance burden even with the cleanest proposed
> approach.
Yes.
> We worked hard to make the patchset as clean as possible and
> if benefits still don't outweigh the maintenance cost then we should
> probably stop trying.
Indeed.
> At LSF/MM I would rather discuss functonal
> issues/requirements/improvements than alternative approaches to
> instrument allocators.
> I'm happy to arrange a separate meeting with MM folks if that would
> help to progress on the cost/benefit decision.
Note that I am only proposing ways forward.
If you think you can easily achieve what Michal requested without all
that, good.
My past experience was that LSF/MM / bi-weekly MM meetings were
extremely helpful to reach consensus.
--
Cheers,
David / dhildenb
On Tue, Feb 13, 2024 at 2:50 PM Kent Overstreet
<[email protected]> wrote:
>
> On Tue, Feb 13, 2024 at 11:48:41PM +0100, David Hildenbrand wrote:
> > On 13.02.24 23:30, Suren Baghdasaryan wrote:
> > > On Tue, Feb 13, 2024 at 2:17 PM David Hildenbrand <[email protected]> wrote:
> > > >
> > > > On 13.02.24 23:09, Kent Overstreet wrote:
> > > > > On Tue, Feb 13, 2024 at 11:04:58PM +0100, David Hildenbrand wrote:
> > > > > > On 13.02.24 22:58, Suren Baghdasaryan wrote:
> > > > > > > On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <[email protected]> wrote:
> > > > > > > >
> > > > > > > > On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote:
> > > > > > > > [...]
> > > > > > > > > We're aiming to get this in the next merge window, for 6.9. The feedback
> > > > > > > > > we've gotten has been that even out of tree this patchset has already
> > > > > > > > > been useful, and there's a significant amount of other work gated on the
> > > > > > > > > code tagging functionality included in this patchset [2].
> > > > > > > >
> > > > > > > > I suspect it will not come as a surprise that I really dislike the
> > > > > > > > implementation proposed here. I will not repeat my arguments, I have
> > > > > > > > done so on several occasions already.
> > > > > > > >
> > > > > > > > Anyway, I didn't go as far as to nak it even though I _strongly_ believe
> > > > > > > > this debugging feature will add a maintenance overhead for a very long
> > > > > > > > time. I can live with all the downsides of the proposed implementation
> > > > > > > > _as long as_ there is a wider agreement from the MM community as this is
> > > > > > > > where the maintenance cost will be payed. So far I have not seen (m)any
> > > > > > > > acks by MM developers so aiming into the next merge window is more than
> > > > > > > > little rushed.
> > > > > > >
> > > > > > > We tried other previously proposed approaches and all have their
> > > > > > > downsides without making maintenance much easier. Your position is
> > > > > > > understandable and I think it's fair. Let's see if others see more
> > > > > > > benefit than cost here.
> > > > > >
> > > > > > Would it make sense to discuss that at LSF/MM once again, especially
> > > > > > covering why proposed alternatives did not work out? LSF/MM is not "too far"
> > > > > > away (May).
> > > > > >
> > > > > > I recall that the last LSF/MM session on this topic was a bit unfortunate
> > > > > > (IMHO not as productive as it could have been). Maybe we can finally reach a
> > > > > > consensus on this.
> > > > >
> > > > > I'd rather not delay for more bikeshedding. Before agreeing to LSF I'd
> > > > > need to see a serious proposl - what we had at the last LSF was people
> > > > > jumping in with half baked alternative proposals that very much hadn't
> > > > > been thought through, and I see no need to repeat that.
> > > > >
> > > > > Like I mentioned, there's other work gated on this patchset; if people
> > > > > want to hold this up for more discussion they better be putting forth
> > > > > something to discuss.
> > > >
> > > > I'm thinking of ways on how to achieve Michal's request: "as long as
> > > > there is a wider agreement from the MM community". If we can achieve
> > > > that without LSF, great! (a bi-weekly MM meeting might also be an option)
> > >
> > > There will be a maintenance burden even with the cleanest proposed
> > > approach.
> >
> > Yes.
> >
> > > We worked hard to make the patchset as clean as possible and
> > > if benefits still don't outweigh the maintenance cost then we should
> > > probably stop trying.
> >
> > Indeed.
> >
> > > At LSF/MM I would rather discuss functonal
> > > issues/requirements/improvements than alternative approaches to
> > > instrument allocators.
> > > I'm happy to arrange a separate meeting with MM folks if that would
> > > help to progress on the cost/benefit decision.
> > Note that I am only proposing ways forward.
> >
> > If you think you can easily achieve what Michal requested without all that,
> > good.
>
> He requested something?
Yes, a cleaner instrumentation. Unfortunately the cleanest one is not
possible until the compiler feature is developed and deployed. And it
still would require changes to the headers, so don't think it's worth
delaying the feature for years.
On 13.02.24 23:59, Suren Baghdasaryan wrote:
> On Tue, Feb 13, 2024 at 2:50 PM Kent Overstreet
> <[email protected]> wrote:
>>
>> On Tue, Feb 13, 2024 at 11:48:41PM +0100, David Hildenbrand wrote:
>>> On 13.02.24 23:30, Suren Baghdasaryan wrote:
>>>> On Tue, Feb 13, 2024 at 2:17 PM David Hildenbrand <[email protected]> wrote:
>>>>>
>>>>> On 13.02.24 23:09, Kent Overstreet wrote:
>>>>>> On Tue, Feb 13, 2024 at 11:04:58PM +0100, David Hildenbrand wrote:
>>>>>>> On 13.02.24 22:58, Suren Baghdasaryan wrote:
>>>>>>>> On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>> On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote:
>>>>>>>>> [...]
>>>>>>>>>> We're aiming to get this in the next merge window, for 6.9. The feedback
>>>>>>>>>> we've gotten has been that even out of tree this patchset has already
>>>>>>>>>> been useful, and there's a significant amount of other work gated on the
>>>>>>>>>> code tagging functionality included in this patchset [2].
>>>>>>>>>
>>>>>>>>> I suspect it will not come as a surprise that I really dislike the
>>>>>>>>> implementation proposed here. I will not repeat my arguments, I have
>>>>>>>>> done so on several occasions already.
>>>>>>>>>
>>>>>>>>> Anyway, I didn't go as far as to nak it even though I _strongly_ believe
>>>>>>>>> this debugging feature will add a maintenance overhead for a very long
>>>>>>>>> time. I can live with all the downsides of the proposed implementation
>>>>>>>>> _as long as_ there is a wider agreement from the MM community as this is
>>>>>>>>> where the maintenance cost will be payed. So far I have not seen (m)any
>>>>>>>>> acks by MM developers so aiming into the next merge window is more than
>>>>>>>>> little rushed.
>>>>>>>>
>>>>>>>> We tried other previously proposed approaches and all have their
>>>>>>>> downsides without making maintenance much easier. Your position is
>>>>>>>> understandable and I think it's fair. Let's see if others see more
>>>>>>>> benefit than cost here.
>>>>>>>
>>>>>>> Would it make sense to discuss that at LSF/MM once again, especially
>>>>>>> covering why proposed alternatives did not work out? LSF/MM is not "too far"
>>>>>>> away (May).
>>>>>>>
>>>>>>> I recall that the last LSF/MM session on this topic was a bit unfortunate
>>>>>>> (IMHO not as productive as it could have been). Maybe we can finally reach a
>>>>>>> consensus on this.
>>>>>>
>>>>>> I'd rather not delay for more bikeshedding. Before agreeing to LSF I'd
>>>>>> need to see a serious proposl - what we had at the last LSF was people
>>>>>> jumping in with half baked alternative proposals that very much hadn't
>>>>>> been thought through, and I see no need to repeat that.
>>>>>>
>>>>>> Like I mentioned, there's other work gated on this patchset; if people
>>>>>> want to hold this up for more discussion they better be putting forth
>>>>>> something to discuss.
>>>>>
>>>>> I'm thinking of ways on how to achieve Michal's request: "as long as
>>>>> there is a wider agreement from the MM community". If we can achieve
>>>>> that without LSF, great! (a bi-weekly MM meeting might also be an option)
>>>>
>>>> There will be a maintenance burden even with the cleanest proposed
>>>> approach.
>>>
>>> Yes.
>>>
>>>> We worked hard to make the patchset as clean as possible and
>>>> if benefits still don't outweigh the maintenance cost then we should
>>>> probably stop trying.
>>>
>>> Indeed.
>>>
>>>> At LSF/MM I would rather discuss functonal
>>>> issues/requirements/improvements than alternative approaches to
>>>> instrument allocators.
>>>> I'm happy to arrange a separate meeting with MM folks if that would
>>>> help to progress on the cost/benefit decision.
>>> Note that I am only proposing ways forward.
>>>
>>> If you think you can easily achieve what Michal requested without all that,
>>> good.
>>
>> He requested something?
>
> Yes, a cleaner instrumentation. Unfortunately the cleanest one is not
> possible until the compiler feature is developed and deployed. And it
> still would require changes to the headers, so don't think it's worth
> delaying the feature for years.
>
I was talking about this: "I can live with all the downsides of the
proposed implementationas long as there is a wider agreement from the MM
community as this is where the maintenance cost will be payed. So far I
have not seen (m)any acks by MM developers".
I certainly cannot be motivated at this point to review and ack this,
unfortunately too much negative energy around here.
--
Cheers,
David / dhildenb
On Tue, Feb 13, 2024 at 02:59:11PM -0800, Suren Baghdasaryan wrote:
> On Tue, Feb 13, 2024 at 2:50 PM Kent Overstreet
> <[email protected]> wrote:
> >
> > On Tue, Feb 13, 2024 at 11:48:41PM +0100, David Hildenbrand wrote:
> > > On 13.02.24 23:30, Suren Baghdasaryan wrote:
> > > > On Tue, Feb 13, 2024 at 2:17 PM David Hildenbrand <[email protected]> wrote:
> > > If you think you can easily achieve what Michal requested without all that,
> > > good.
> >
> > He requested something?
>
> Yes, a cleaner instrumentation. Unfortunately the cleanest one is not
> possible until the compiler feature is developed and deployed. And it
> still would require changes to the headers, so don't think it's worth
> delaying the feature for years.
Hang on, let's look at the actual code.
This is what instrumenting an allocation function looks like:
#define krealloc_array(...) alloc_hooks(krealloc_array_noprof(__VA_ARGS__))
IOW, we have to:
- rename krealloc_array to krealloc_array_noprof
- replace krealloc_array with a one wrapper macro call
Is this really all we're getting worked up over?
The renaming we need regardless, because the thing that makes this
approach efficient enough to run in production is that we account at
_one_ point in the callstack, we don't save entire backtraces.
And thus we need to explicitly annotate which one that is; which means
we need _noprof() versions of functions for when the accounting is done
by an outer wraper (e.g. mempool).
And, as I keep saying: that alloc_hooks() macro will also get us _per
callsite fault injection points_, and we really need that because - if
you guys have been paying attention to other threads - whenever moving
more stuff to PF_MEMALLOC_* flags comes up (including adding
PF_MEMALLOC_NORECLAIM), the issue of small allocations not failing and
not being testable keeps coming up.
On Tue, Feb 13, 2024 at 05:29:03PM -0500, Kent Overstreet wrote:
> On Tue, Feb 13, 2024 at 11:17:32PM +0100, David Hildenbrand wrote:
> > On 13.02.24 23:09, Kent Overstreet wrote:
> > > On Tue, Feb 13, 2024 at 11:04:58PM +0100, David Hildenbrand wrote:
> > > > On 13.02.24 22:58, Suren Baghdasaryan wrote:
> > > > > On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <[email protected]> wrote:
> > > > > >
> > > > > > On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote:
> > > > > > [...]
> > > > > > > We're aiming to get this in the next merge window, for 6.9. The feedback
> > > > > > > we've gotten has been that even out of tree this patchset has already
> > > > > > > been useful, and there's a significant amount of other work gated on the
> > > > > > > code tagging functionality included in this patchset [2].
> > > > > >
> > > > > > I suspect it will not come as a surprise that I really dislike the
> > > > > > implementation proposed here. I will not repeat my arguments, I have
> > > > > > done so on several occasions already.
> > > > > >
> > > > > > Anyway, I didn't go as far as to nak it even though I _strongly_ believe
> > > > > > this debugging feature will add a maintenance overhead for a very long
> > > > > > time. I can live with all the downsides of the proposed implementation
> > > > > > _as long as_ there is a wider agreement from the MM community as this is
> > > > > > where the maintenance cost will be payed. So far I have not seen (m)any
> > > > > > acks by MM developers so aiming into the next merge window is more than
> > > > > > little rushed.
> > > > >
> > > > > We tried other previously proposed approaches and all have their
> > > > > downsides without making maintenance much easier. Your position is
> > > > > understandable and I think it's fair. Let's see if others see more
> > > > > benefit than cost here.
> > > >
> > > > Would it make sense to discuss that at LSF/MM once again, especially
> > > > covering why proposed alternatives did not work out? LSF/MM is not "too far"
> > > > away (May).
You want to stall this effort for *three months* to schedule a meeting?
I would love to have better profiling of memory allocations inside XFS
so that we can answer questions like "What's going on with these
allocation stalls?" or "What memory is getting used, and where?" more
quickly than we can now.
Right now I get to stare at tracepoints and printk crap until my eyes
bleed, and maybe dchinner comes to my rescue and figures out what's
going on sooner than I do. More often we just never figure it out
because only the customer can reproduce the problem, the reams of data
produced by ftrace is unmanageable, and BPF isn't always available.
I'm not thrilled by the large increase in macro crap in the allocation
paths, but I don't know of a better way to instrument things. Our
attempts to use _RET_IP in XFS to instrument its code paths have never
worked quite right w.r.t. inlined functions and whatnot.
> > > > I recall that the last LSF/MM session on this topic was a bit unfortunate
> > > > (IMHO not as productive as it could have been). Maybe we can finally reach a
> > > > consensus on this.
From my outsider's perspective, nine months have gone by since the last
LSF. Who has come up with a cleaner/better/faster way to do what Suren
and Kent have done? Were those code changes integrated into this
patchset? Or why not?
Most of what I saw in 2023 involved compiler changes (great; now I have
to wait until RHEL 11/Debian 14 to use it) and/or still utilize fugly
macros.
Recalling all the way back to suggestions made in 2022, who wrote the
prototype for doing this via ftrace? Or BPF? How well did that go for
counting allocation events and the like? I saw Tejun saying something
about how they use BPF aggressively inside Meta, but that was about it.
Were any of those solutions significantly better than what's in front of
us here?
I get it, a giant patch forcing everyone to know the difference between
alloc_foo and alloc_foo_noperf offends my (yours?) stylistic
sensibilities. On the other side, making analysis easier during
customer escalations means we kernel people get data, answers, and
solutions sooner instead of watching all our time get eaten up on L4
support and backporting hell.
> > > I'd rather not delay for more bikeshedding. Before agreeing to LSF I'd
> > > need to see a serious proposl - what we had at the last LSF was people
> > > jumping in with half baked alternative proposals that very much hadn't
> > > been thought through, and I see no need to repeat that.
> > >
> > > Like I mentioned, there's other work gated on this patchset; if people
> > > want to hold this up for more discussion they better be putting forth
> > > something to discuss.
> >
> > I'm thinking of ways on how to achieve Michal's request: "as long as there
> > is a wider agreement from the MM community". If we can achieve that without
> > LSF, great! (a bi-weekly MM meeting might also be an option)
>
> A meeting wouldn't be out of the question, _if_ there is an agenda, but:
>
> What's that coffeee mug say? I just survived another meeting that
> could've been an email?
I congratulate you on your memory of my kitchen mug. Yes, that's what
it says.
> What exactly is the outcome we're looking for?
>
> Is there info that people are looking for? I think we summed things up
> pretty well in the cover letter; if there are specifics that people
> want to discuss, that's why we emailed the series out.
>
> There's people in this thread who've used this patchset in production
> and diagnosed real issues (gigabytes of memory gone missing, I heard the
> other day); I'm personally looking for them to chime in on this thread
> (Johannes, Pasha).
>
> If it's just grumbling about "maintenance overhead" we need to get past
> - well, people are going to have to accept that we can't deliver
> features without writing code, and I'm confident that the hooking in
> particular is about as clean as it's going to get, _regardless_ of
> toolchain support; and moreover it addresses what's been historically a
> pretty gaping hole in our ability to profile and understand the code we
> write.
Are you and Suren willing to pay whatever maintenance overhead there is?
--D
On Wed, Feb 14, 2024 at 12:02:30AM +0100, David Hildenbrand wrote:
> On 13.02.24 23:59, Suren Baghdasaryan wrote:
> > On Tue, Feb 13, 2024 at 2:50 PM Kent Overstreet
> > <[email protected]> wrote:
> > >
> > > On Tue, Feb 13, 2024 at 11:48:41PM +0100, David Hildenbrand wrote:
> > > > On 13.02.24 23:30, Suren Baghdasaryan wrote:
> > > > > On Tue, Feb 13, 2024 at 2:17 PM David Hildenbrand <[email protected]> wrote:
> > > > > >
> > > > > > On 13.02.24 23:09, Kent Overstreet wrote:
> > > > > > > On Tue, Feb 13, 2024 at 11:04:58PM +0100, David Hildenbrand wrote:
> > > > > > > > On 13.02.24 22:58, Suren Baghdasaryan wrote:
> > > > > > > > > On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <[email protected]> wrote:
> > > > > > > > > >
> > > > > > > > > > On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote:
> > > > > > > > > > [...]
> > > > > > > > > > > We're aiming to get this in the next merge window, for 6.9. The feedback
> > > > > > > > > > > we've gotten has been that even out of tree this patchset has already
> > > > > > > > > > > been useful, and there's a significant amount of other work gated on the
> > > > > > > > > > > code tagging functionality included in this patchset [2].
> > > > > > > > > >
> > > > > > > > > > I suspect it will not come as a surprise that I really dislike the
> > > > > > > > > > implementation proposed here. I will not repeat my arguments, I have
> > > > > > > > > > done so on several occasions already.
> > > > > > > > > >
> > > > > > > > > > Anyway, I didn't go as far as to nak it even though I _strongly_ believe
> > > > > > > > > > this debugging feature will add a maintenance overhead for a very long
> > > > > > > > > > time. I can live with all the downsides of the proposed implementation
> > > > > > > > > > _as long as_ there is a wider agreement from the MM community as this is
> > > > > > > > > > where the maintenance cost will be payed. So far I have not seen (m)any
> > > > > > > > > > acks by MM developers so aiming into the next merge window is more than
> > > > > > > > > > little rushed.
> > > > > > > > >
> > > > > > > > > We tried other previously proposed approaches and all have their
> > > > > > > > > downsides without making maintenance much easier. Your position is
> > > > > > > > > understandable and I think it's fair. Let's see if others see more
> > > > > > > > > benefit than cost here.
> > > > > > > >
> > > > > > > > Would it make sense to discuss that at LSF/MM once again, especially
> > > > > > > > covering why proposed alternatives did not work out? LSF/MM is not "too far"
> > > > > > > > away (May).
> > > > > > > >
> > > > > > > > I recall that the last LSF/MM session on this topic was a bit unfortunate
> > > > > > > > (IMHO not as productive as it could have been). Maybe we can finally reach a
> > > > > > > > consensus on this.
> > > > > > >
> > > > > > > I'd rather not delay for more bikeshedding. Before agreeing to LSF I'd
> > > > > > > need to see a serious proposl - what we had at the last LSF was people
> > > > > > > jumping in with half baked alternative proposals that very much hadn't
> > > > > > > been thought through, and I see no need to repeat that.
> > > > > > >
> > > > > > > Like I mentioned, there's other work gated on this patchset; if people
> > > > > > > want to hold this up for more discussion they better be putting forth
> > > > > > > something to discuss.
> > > > > >
> > > > > > I'm thinking of ways on how to achieve Michal's request: "as long as
> > > > > > there is a wider agreement from the MM community". If we can achieve
> > > > > > that without LSF, great! (a bi-weekly MM meeting might also be an option)
> > > > >
> > > > > There will be a maintenance burden even with the cleanest proposed
> > > > > approach.
> > > >
> > > > Yes.
> > > >
> > > > > We worked hard to make the patchset as clean as possible and
> > > > > if benefits still don't outweigh the maintenance cost then we should
> > > > > probably stop trying.
> > > >
> > > > Indeed.
> > > >
> > > > > At LSF/MM I would rather discuss functonal
> > > > > issues/requirements/improvements than alternative approaches to
> > > > > instrument allocators.
> > > > > I'm happy to arrange a separate meeting with MM folks if that would
> > > > > help to progress on the cost/benefit decision.
> > > > Note that I am only proposing ways forward.
> > > >
> > > > If you think you can easily achieve what Michal requested without all that,
> > > > good.
> > >
> > > He requested something?
> >
> > Yes, a cleaner instrumentation. Unfortunately the cleanest one is not
> > possible until the compiler feature is developed and deployed. And it
> > still would require changes to the headers, so don't think it's worth
> > delaying the feature for years.
> >
>
> I was talking about this: "I can live with all the downsides of the proposed
> implementationas long as there is a wider agreement from the MM community as
> this is where the maintenance cost will be payed. So far I have not seen
> (m)any acks by MM developers".
>
> I certainly cannot be motivated at this point to review and ack this,
> unfortunately too much negative energy around here.
David, this kind of reaction is exactly why I was telling Andrew I was
going to submit this as a direct pull request to Linus.
This is an important feature; if we can't stay focused ot the technical
and get it done that's what I'll do.
On Tue, 13 Feb 2024 14:38:16 -0800
Kees Cook <[email protected]> wrote:
> > > Save yourself a cycle of "rework the whole fs interface only to have
> > > someone else tell you no" and put it in debugfs, not sysfs. Wrangling
> > > with debugfs is easier than all the macro-happy sysfs stuff; you don't
> > > have to integrate with the "device" model; and there is no 'one value
> > > per file' rule.
> >
> > Thanks for the input. This file used to be in debugfs but reviewers
> > felt it belonged in /proc if it's to be used in production
> > environments. Some distros (like Android) disable debugfs in
> > production.
>
> FWIW, I agree debugfs is not right. If others feel it's right in /proc,
> I certainly won't NAK -- it's just been that we've traditionally been
> trying to avoid continuing to pollute the top-level /proc and instead
> associate new things with something in /sys.
You can create your own file system, but I would suggest using kernfs for it ;-)
If you look in /sys/kernel/ you'll see a bunch of kernel file systems already there:
~# mount |grep kernel
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime)
tracefs on /sys/kernel/tracing type tracefs (rw,nosuid,nodev,noexec,relatime)
configfs on /sys/kernel/config type configfs (rw,nosuid,nodev,noexec,relatime)
-- Steve
On 14.02.24 00:12, Kent Overstreet wrote:
> On Wed, Feb 14, 2024 at 12:02:30AM +0100, David Hildenbrand wrote:
>> On 13.02.24 23:59, Suren Baghdasaryan wrote:
>>> On Tue, Feb 13, 2024 at 2:50 PM Kent Overstreet
>>> <[email protected]> wrote:
>>>>
>>>> On Tue, Feb 13, 2024 at 11:48:41PM +0100, David Hildenbrand wrote:
>>>>> On 13.02.24 23:30, Suren Baghdasaryan wrote:
>>>>>> On Tue, Feb 13, 2024 at 2:17 PM David Hildenbrand <[email protected]> wrote:
>>>>>>>
>>>>>>> On 13.02.24 23:09, Kent Overstreet wrote:
>>>>>>>> On Tue, Feb 13, 2024 at 11:04:58PM +0100, David Hildenbrand wrote:
>>>>>>>>> On 13.02.24 22:58, Suren Baghdasaryan wrote:
>>>>>>>>>> On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <[email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote:
>>>>>>>>>>> [...]
>>>>>>>>>>>> We're aiming to get this in the next merge window, for 6.9. The feedback
>>>>>>>>>>>> we've gotten has been that even out of tree this patchset has already
>>>>>>>>>>>> been useful, and there's a significant amount of other work gated on the
>>>>>>>>>>>> code tagging functionality included in this patchset [2].
>>>>>>>>>>>
>>>>>>>>>>> I suspect it will not come as a surprise that I really dislike the
>>>>>>>>>>> implementation proposed here. I will not repeat my arguments, I have
>>>>>>>>>>> done so on several occasions already.
>>>>>>>>>>>
>>>>>>>>>>> Anyway, I didn't go as far as to nak it even though I _strongly_ believe
>>>>>>>>>>> this debugging feature will add a maintenance overhead for a very long
>>>>>>>>>>> time. I can live with all the downsides of the proposed implementation
>>>>>>>>>>> _as long as_ there is a wider agreement from the MM community as this is
>>>>>>>>>>> where the maintenance cost will be payed. So far I have not seen (m)any
>>>>>>>>>>> acks by MM developers so aiming into the next merge window is more than
>>>>>>>>>>> little rushed.
>>>>>>>>>>
>>>>>>>>>> We tried other previously proposed approaches and all have their
>>>>>>>>>> downsides without making maintenance much easier. Your position is
>>>>>>>>>> understandable and I think it's fair. Let's see if others see more
>>>>>>>>>> benefit than cost here.
>>>>>>>>>
>>>>>>>>> Would it make sense to discuss that at LSF/MM once again, especially
>>>>>>>>> covering why proposed alternatives did not work out? LSF/MM is not "too far"
>>>>>>>>> away (May).
>>>>>>>>>
>>>>>>>>> I recall that the last LSF/MM session on this topic was a bit unfortunate
>>>>>>>>> (IMHO not as productive as it could have been). Maybe we can finally reach a
>>>>>>>>> consensus on this.
>>>>>>>>
>>>>>>>> I'd rather not delay for more bikeshedding. Before agreeing to LSF I'd
>>>>>>>> need to see a serious proposl - what we had at the last LSF was people
>>>>>>>> jumping in with half baked alternative proposals that very much hadn't
>>>>>>>> been thought through, and I see no need to repeat that.
>>>>>>>>
>>>>>>>> Like I mentioned, there's other work gated on this patchset; if people
>>>>>>>> want to hold this up for more discussion they better be putting forth
>>>>>>>> something to discuss.
>>>>>>>
>>>>>>> I'm thinking of ways on how to achieve Michal's request: "as long as
>>>>>>> there is a wider agreement from the MM community". If we can achieve
>>>>>>> that without LSF, great! (a bi-weekly MM meeting might also be an option)
>>>>>>
>>>>>> There will be a maintenance burden even with the cleanest proposed
>>>>>> approach.
>>>>>
>>>>> Yes.
>>>>>
>>>>>> We worked hard to make the patchset as clean as possible and
>>>>>> if benefits still don't outweigh the maintenance cost then we should
>>>>>> probably stop trying.
>>>>>
>>>>> Indeed.
>>>>>
>>>>>> At LSF/MM I would rather discuss functonal
>>>>>> issues/requirements/improvements than alternative approaches to
>>>>>> instrument allocators.
>>>>>> I'm happy to arrange a separate meeting with MM folks if that would
>>>>>> help to progress on the cost/benefit decision.
>>>>> Note that I am only proposing ways forward.
>>>>>
>>>>> If you think you can easily achieve what Michal requested without all that,
>>>>> good.
>>>>
>>>> He requested something?
>>>
>>> Yes, a cleaner instrumentation. Unfortunately the cleanest one is not
>>> possible until the compiler feature is developed and deployed. And it
>>> still would require changes to the headers, so don't think it's worth
>>> delaying the feature for years.
>>>
>>
>> I was talking about this: "I can live with all the downsides of the proposed
>> implementationas long as there is a wider agreement from the MM community as
>> this is where the maintenance cost will be payed. So far I have not seen
>> (m)any acks by MM developers".
>>
>> I certainly cannot be motivated at this point to review and ack this,
>> unfortunately too much negative energy around here.
>
> David, this kind of reaction is exactly why I was telling Andrew I was
> going to submit this as a direct pull request to Linus.
>
> This is an important feature; if we can't stay focused ot the technical
> and get it done that's what I'll do.
Kent, I started this with "Would it make sense" in an attempt to help
Suren and you to finally make progress with this, one way or the other.
I know that there were ways in the past to get the MM community to agree
on such things.
I tried to be helpful, finding ways *not having to* bypass the MM
community to get MM stuff merged.
The reply I got is mostly negative energy.
So you don't need my help here, understood.
But I will fight against any attempts to bypass the MM community.
--
Cheers,
David / dhildenb
On Tue, Feb 13, 2024 at 03:11:15PM -0800, Darrick J. Wong wrote:
> On Tue, Feb 13, 2024 at 05:29:03PM -0500, Kent Overstreet wrote:
> > On Tue, Feb 13, 2024 at 11:17:32PM +0100, David Hildenbrand wrote:
> > > On 13.02.24 23:09, Kent Overstreet wrote:
> > > > On Tue, Feb 13, 2024 at 11:04:58PM +0100, David Hildenbrand wrote:
> > > > > On 13.02.24 22:58, Suren Baghdasaryan wrote:
> > > > > > On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <[email protected]> wrote:
> > > > > > >
> > > > > > > On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote:
> > > > > > > [...]
> > > > > > > > We're aiming to get this in the next merge window, for 6.9. The feedback
> > > > > > > > we've gotten has been that even out of tree this patchset has already
> > > > > > > > been useful, and there's a significant amount of other work gated on the
> > > > > > > > code tagging functionality included in this patchset [2].
> > > > > > >
> > > > > > > I suspect it will not come as a surprise that I really dislike the
> > > > > > > implementation proposed here. I will not repeat my arguments, I have
> > > > > > > done so on several occasions already.
> > > > > > >
> > > > > > > Anyway, I didn't go as far as to nak it even though I _strongly_ believe
> > > > > > > this debugging feature will add a maintenance overhead for a very long
> > > > > > > time. I can live with all the downsides of the proposed implementation
> > > > > > > _as long as_ there is a wider agreement from the MM community as this is
> > > > > > > where the maintenance cost will be payed. So far I have not seen (m)any
> > > > > > > acks by MM developers so aiming into the next merge window is more than
> > > > > > > little rushed.
> > > > > >
> > > > > > We tried other previously proposed approaches and all have their
> > > > > > downsides without making maintenance much easier. Your position is
> > > > > > understandable and I think it's fair. Let's see if others see more
> > > > > > benefit than cost here.
> > > > >
> > > > > Would it make sense to discuss that at LSF/MM once again, especially
> > > > > covering why proposed alternatives did not work out? LSF/MM is not "too far"
> > > > > away (May).
>
> You want to stall this effort for *three months* to schedule a meeting?
>
> I would love to have better profiling of memory allocations inside XFS
> so that we can answer questions like "What's going on with these
> allocation stalls?" or "What memory is getting used, and where?" more
> quickly than we can now.
>
> Right now I get to stare at tracepoints and printk crap until my eyes
> bleed, and maybe dchinner comes to my rescue and figures out what's
> going on sooner than I do. More often we just never figure it out
> because only the customer can reproduce the problem, the reams of data
> produced by ftrace is unmanageable, and BPF isn't always available.
>
> I'm not thrilled by the large increase in macro crap in the allocation
> paths, but I don't know of a better way to instrument things. Our
> attempts to use _RET_IP in XFS to instrument its code paths have never
> worked quite right w.r.t. inlined functions and whatnot.
>
> > > > > I recall that the last LSF/MM session on this topic was a bit unfortunate
> > > > > (IMHO not as productive as it could have been). Maybe we can finally reach a
> > > > > consensus on this.
>
> From my outsider's perspective, nine months have gone by since the last
> LSF. Who has come up with a cleaner/better/faster way to do what Suren
> and Kent have done? Were those code changes integrated into this
> patchset? Or why not?
>
> Most of what I saw in 2023 involved compiler changes (great; now I have
> to wait until RHEL 11/Debian 14 to use it) and/or still utilize fugly
> macros.
>
> Recalling all the way back to suggestions made in 2022, who wrote the
> prototype for doing this via ftrace? Or BPF? How well did that go for
> counting allocation events and the like? I saw Tejun saying something
> about how they use BPF aggressively inside Meta, but that was about it.
>
> Were any of those solutions significantly better than what's in front of
> us here?
>
> I get it, a giant patch forcing everyone to know the difference between
> alloc_foo and alloc_foo_noperf offends my (yours?) stylistic
> sensibilities. On the other side, making analysis easier during
> customer escalations means we kernel people get data, answers, and
> solutions sooner instead of watching all our time get eaten up on L4
> support and backporting hell.
>
> > > > I'd rather not delay for more bikeshedding. Before agreeing to LSF I'd
> > > > need to see a serious proposl - what we had at the last LSF was people
> > > > jumping in with half baked alternative proposals that very much hadn't
> > > > been thought through, and I see no need to repeat that.
> > > >
> > > > Like I mentioned, there's other work gated on this patchset; if people
> > > > want to hold this up for more discussion they better be putting forth
> > > > something to discuss.
> > >
> > > I'm thinking of ways on how to achieve Michal's request: "as long as there
> > > is a wider agreement from the MM community". If we can achieve that without
> > > LSF, great! (a bi-weekly MM meeting might also be an option)
> >
> > A meeting wouldn't be out of the question, _if_ there is an agenda, but:
> >
> > What's that coffeee mug say? I just survived another meeting that
> > could've been an email?
>
> I congratulate you on your memory of my kitchen mug. Yes, that's what
> it says.
>
> > What exactly is the outcome we're looking for?
> >
> > Is there info that people are looking for? I think we summed things up
> > pretty well in the cover letter; if there are specifics that people
> > want to discuss, that's why we emailed the series out.
> >
> > There's people in this thread who've used this patchset in production
> > and diagnosed real issues (gigabytes of memory gone missing, I heard the
> > other day); I'm personally looking for them to chime in on this thread
> > (Johannes, Pasha).
> >
> > If it's just grumbling about "maintenance overhead" we need to get past
> > - well, people are going to have to accept that we can't deliver
> > features without writing code, and I'm confident that the hooking in
> > particular is about as clean as it's going to get, _regardless_ of
> > toolchain support; and moreover it addresses what's been historically a
> > pretty gaping hole in our ability to profile and understand the code we
> > write.
>
> Are you and Suren willing to pay whatever maintenance overhead there is?
I'm still wondering what this supposed "maintenance overhead" is going
to be...
As I use this patch series I occasionally notice places where a bunch of
memory is being accounted to one line of code, and it would better be
accounted to a caller - but then it's just a couple lines of code to fix
that. You switch that callsite to the _noprof() version of whatever
allocation it's doing, then add an alloc_hooks() wrapper at the place
you do want it accounted.
That doesn't really feel like overhead to me, just the normal tweaking
your tools to get the most out of them.
I will continue to do that for the code I'm looking at, yes.
If other people are doing that too, it'll be because they're also using
memory allocation profiling and finding it valuable.
I did notice earlier that we're still lacking documentation in the
Documentation/ directory; the workflow for "how you shift accounting to
the right spot" is something that should go in there.
On Tue, Feb 13, 2024 at 11:48:41PM +0100, David Hildenbrand wrote:
> On 13.02.24 23:30, Suren Baghdasaryan wrote:
> > On Tue, Feb 13, 2024 at 2:17 PM David Hildenbrand <[email protected]> wrote:
> > >
> > > On 13.02.24 23:09, Kent Overstreet wrote:
> > > > On Tue, Feb 13, 2024 at 11:04:58PM +0100, David Hildenbrand wrote:
> > > > > On 13.02.24 22:58, Suren Baghdasaryan wrote:
> > > > > > On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <[email protected]> wrote:
> > > > > > >
> > > > > > > On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote:
> > > > > > > [...]
> > > > > > > > We're aiming to get this in the next merge window, for 6.9. The feedback
> > > > > > > > we've gotten has been that even out of tree this patchset has already
> > > > > > > > been useful, and there's a significant amount of other work gated on the
> > > > > > > > code tagging functionality included in this patchset [2].
> > > > > > >
> > > > > > > I suspect it will not come as a surprise that I really dislike the
> > > > > > > implementation proposed here. I will not repeat my arguments, I have
> > > > > > > done so on several occasions already.
> > > > > > >
> > > > > > > Anyway, I didn't go as far as to nak it even though I _strongly_ believe
> > > > > > > this debugging feature will add a maintenance overhead for a very long
> > > > > > > time. I can live with all the downsides of the proposed implementation
> > > > > > > _as long as_ there is a wider agreement from the MM community as this is
> > > > > > > where the maintenance cost will be payed. So far I have not seen (m)any
> > > > > > > acks by MM developers so aiming into the next merge window is more than
> > > > > > > little rushed.
> > > > > >
> > > > > > We tried other previously proposed approaches and all have their
> > > > > > downsides without making maintenance much easier. Your position is
> > > > > > understandable and I think it's fair. Let's see if others see more
> > > > > > benefit than cost here.
> > > > >
> > > > > Would it make sense to discuss that at LSF/MM once again, especially
> > > > > covering why proposed alternatives did not work out? LSF/MM is not "too far"
> > > > > away (May).
> > > > >
> > > > > I recall that the last LSF/MM session on this topic was a bit unfortunate
> > > > > (IMHO not as productive as it could have been). Maybe we can finally reach a
> > > > > consensus on this.
> > > >
> > > > I'd rather not delay for more bikeshedding. Before agreeing to LSF I'd
> > > > need to see a serious proposl - what we had at the last LSF was people
> > > > jumping in with half baked alternative proposals that very much hadn't
> > > > been thought through, and I see no need to repeat that.
> > > >
> > > > Like I mentioned, there's other work gated on this patchset; if people
> > > > want to hold this up for more discussion they better be putting forth
> > > > something to discuss.
> > >
> > > I'm thinking of ways on how to achieve Michal's request: "as long as
> > > there is a wider agreement from the MM community". If we can achieve
> > > that without LSF, great! (a bi-weekly MM meeting might also be an option)
> >
> > There will be a maintenance burden even with the cleanest proposed
> > approach.
>
> Yes.
>
> > We worked hard to make the patchset as clean as possible and
> > if benefits still don't outweigh the maintenance cost then we should
> > probably stop trying.
>
> Indeed.
>
> > At LSF/MM I would rather discuss functonal
> > issues/requirements/improvements than alternative approaches to
> > instrument allocators.
> > I'm happy to arrange a separate meeting with MM folks if that would
> > help to progress on the cost/benefit decision.
> Note that I am only proposing ways forward.
>
> If you think you can easily achieve what Michal requested without all that,
> good.
He requested something?
On 13.02.24 23:50, Kent Overstreet wrote:
> On Tue, Feb 13, 2024 at 11:48:41PM +0100, David Hildenbrand wrote:
>> On 13.02.24 23:30, Suren Baghdasaryan wrote:
>>> On Tue, Feb 13, 2024 at 2:17 PM David Hildenbrand <[email protected]> wrote:
>>>>
>>>> On 13.02.24 23:09, Kent Overstreet wrote:
>>>>> On Tue, Feb 13, 2024 at 11:04:58PM +0100, David Hildenbrand wrote:
>>>>>> On 13.02.24 22:58, Suren Baghdasaryan wrote:
>>>>>>> On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <[email protected]> wrote:
>>>>>>>>
>>>>>>>> On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote:
>>>>>>>> [...]
>>>>>>>>> We're aiming to get this in the next merge window, for 6.9. The feedback
>>>>>>>>> we've gotten has been that even out of tree this patchset has already
>>>>>>>>> been useful, and there's a significant amount of other work gated on the
>>>>>>>>> code tagging functionality included in this patchset [2].
>>>>>>>>
>>>>>>>> I suspect it will not come as a surprise that I really dislike the
>>>>>>>> implementation proposed here. I will not repeat my arguments, I have
>>>>>>>> done so on several occasions already.
>>>>>>>>
>>>>>>>> Anyway, I didn't go as far as to nak it even though I _strongly_ believe
>>>>>>>> this debugging feature will add a maintenance overhead for a very long
>>>>>>>> time. I can live with all the downsides of the proposed implementation
>>>>>>>> _as long as_ there is a wider agreement from the MM community as this is
>>>>>>>> where the maintenance cost will be payed. So far I have not seen (m)any
>>>>>>>> acks by MM developers so aiming into the next merge window is more than
>>>>>>>> little rushed.
>>>>>>>
>>>>>>> We tried other previously proposed approaches and all have their
>>>>>>> downsides without making maintenance much easier. Your position is
>>>>>>> understandable and I think it's fair. Let's see if others see more
>>>>>>> benefit than cost here.
>>>>>>
>>>>>> Would it make sense to discuss that at LSF/MM once again, especially
>>>>>> covering why proposed alternatives did not work out? LSF/MM is not "too far"
>>>>>> away (May).
>>>>>>
>>>>>> I recall that the last LSF/MM session on this topic was a bit unfortunate
>>>>>> (IMHO not as productive as it could have been). Maybe we can finally reach a
>>>>>> consensus on this.
>>>>>
>>>>> I'd rather not delay for more bikeshedding. Before agreeing to LSF I'd
>>>>> need to see a serious proposl - what we had at the last LSF was people
>>>>> jumping in with half baked alternative proposals that very much hadn't
>>>>> been thought through, and I see no need to repeat that.
>>>>>
>>>>> Like I mentioned, there's other work gated on this patchset; if people
>>>>> want to hold this up for more discussion they better be putting forth
>>>>> something to discuss.
>>>>
>>>> I'm thinking of ways on how to achieve Michal's request: "as long as
>>>> there is a wider agreement from the MM community". If we can achieve
>>>> that without LSF, great! (a bi-weekly MM meeting might also be an option)
>>>
>>> There will be a maintenance burden even with the cleanest proposed
>>> approach.
>>
>> Yes.
>>
>>> We worked hard to make the patchset as clean as possible and
>>> if benefits still don't outweigh the maintenance cost then we should
>>> probably stop trying.
>>
>> Indeed.
>>
>>> At LSF/MM I would rather discuss functonal
>>> issues/requirements/improvements than alternative approaches to
>>> instrument allocators.
>>> I'm happy to arrange a separate meeting with MM folks if that would
>>> help to progress on the cost/benefit decision.
>> Note that I am only proposing ways forward.
>>
>> If you think you can easily achieve what Michal requested without all that,
>> good.
>
> He requested something?
>
This won't get merged without acks from MM people.
--
Cheers,
David / dhildenb
On Tue, Feb 13, 2024 at 10:29:34AM +0200, Andy Shevchenko wrote:
> On Tue, Feb 13, 2024 at 10:26 AM Andy Shevchenko
> <[email protected]> wrote:
> >
> > On Mon, Feb 12, 2024 at 11:39 PM Suren Baghdasaryan <[email protected]> wrote:
> > >
> > > From: Kent Overstreet <[email protected]>
> > >
> > > The new flags parameter allows controlling
> > > - Whether or not the units suffix is separated by a space, for
> > > compatibility with sort -h
> > > - Whether or not to append a B suffix - we're not always printing
> > > bytes.
>
> And you effectively missed to _add_ the test cases for the modified code.
> Formal NAK for this, the rest is discussable, the absence of tests is not.
Eh?
The core algorihtm for printing out a number in human readable units;
that's definitely worth a test, and I assume there already is one - I
didn't touch that.
But whether or not the units suffix has a space, or a B suffix? That's
not going to break in subtle ways; that either works or it doesn't.
On Tue, Feb 13, 2024 at 3:22 PM David Hildenbrand <[email protected]> wrote:
>
> On 14.02.24 00:12, Kent Overstreet wrote:
> > On Wed, Feb 14, 2024 at 12:02:30AM +0100, David Hildenbrand wrote:
> >> On 13.02.24 23:59, Suren Baghdasaryan wrote:
> >>> On Tue, Feb 13, 2024 at 2:50 PM Kent Overstreet
> >>> <[email protected]> wrote:
> >>>>
> >>>> On Tue, Feb 13, 2024 at 11:48:41PM +0100, David Hildenbrand wrote:
> >>>>> On 13.02.24 23:30, Suren Baghdasaryan wrote:
> >>>>>> On Tue, Feb 13, 2024 at 2:17 PM David Hildenbrand <[email protected]> wrote:
> >>>>>>>
> >>>>>>> On 13.02.24 23:09, Kent Overstreet wrote:
> >>>>>>>> On Tue, Feb 13, 2024 at 11:04:58PM +0100, David Hildenbrand wrote:
> >>>>>>>>> On 13.02.24 22:58, Suren Baghdasaryan wrote:
> >>>>>>>>>> On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <[email protected]> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote:
> >>>>>>>>>>> [...]
> >>>>>>>>>>>> We're aiming to get this in the next merge window, for 6.9. The feedback
> >>>>>>>>>>>> we've gotten has been that even out of tree this patchset has already
> >>>>>>>>>>>> been useful, and there's a significant amount of other work gated on the
> >>>>>>>>>>>> code tagging functionality included in this patchset [2].
> >>>>>>>>>>>
> >>>>>>>>>>> I suspect it will not come as a surprise that I really dislike the
> >>>>>>>>>>> implementation proposed here. I will not repeat my arguments, I have
> >>>>>>>>>>> done so on several occasions already.
> >>>>>>>>>>>
> >>>>>>>>>>> Anyway, I didn't go as far as to nak it even though I _strongly_ believe
> >>>>>>>>>>> this debugging feature will add a maintenance overhead for a very long
> >>>>>>>>>>> time. I can live with all the downsides of the proposed implementation
> >>>>>>>>>>> _as long as_ there is a wider agreement from the MM community as this is
> >>>>>>>>>>> where the maintenance cost will be payed. So far I have not seen (m)any
> >>>>>>>>>>> acks by MM developers so aiming into the next merge window is more than
> >>>>>>>>>>> little rushed.
> >>>>>>>>>>
> >>>>>>>>>> We tried other previously proposed approaches and all have their
> >>>>>>>>>> downsides without making maintenance much easier. Your position is
> >>>>>>>>>> understandable and I think it's fair. Let's see if others see more
> >>>>>>>>>> benefit than cost here.
> >>>>>>>>>
> >>>>>>>>> Would it make sense to discuss that at LSF/MM once again, especially
> >>>>>>>>> covering why proposed alternatives did not work out? LSF/MM is not "too far"
> >>>>>>>>> away (May).
> >>>>>>>>>
> >>>>>>>>> I recall that the last LSF/MM session on this topic was a bit unfortunate
> >>>>>>>>> (IMHO not as productive as it could have been). Maybe we can finally reach a
> >>>>>>>>> consensus on this.
> >>>>>>>>
> >>>>>>>> I'd rather not delay for more bikeshedding. Before agreeing to LSF I'd
> >>>>>>>> need to see a serious proposl - what we had at the last LSF was people
> >>>>>>>> jumping in with half baked alternative proposals that very much hadn't
> >>>>>>>> been thought through, and I see no need to repeat that.
> >>>>>>>>
> >>>>>>>> Like I mentioned, there's other work gated on this patchset; if people
> >>>>>>>> want to hold this up for more discussion they better be putting forth
> >>>>>>>> something to discuss.
> >>>>>>>
> >>>>>>> I'm thinking of ways on how to achieve Michal's request: "as long as
> >>>>>>> there is a wider agreement from the MM community". If we can achieve
> >>>>>>> that without LSF, great! (a bi-weekly MM meeting might also be an option)
> >>>>>>
> >>>>>> There will be a maintenance burden even with the cleanest proposed
> >>>>>> approach.
> >>>>>
> >>>>> Yes.
> >>>>>
> >>>>>> We worked hard to make the patchset as clean as possible and
> >>>>>> if benefits still don't outweigh the maintenance cost then we should
> >>>>>> probably stop trying.
> >>>>>
> >>>>> Indeed.
> >>>>>
> >>>>>> At LSF/MM I would rather discuss functonal
> >>>>>> issues/requirements/improvements than alternative approaches to
> >>>>>> instrument allocators.
> >>>>>> I'm happy to arrange a separate meeting with MM folks if that would
> >>>>>> help to progress on the cost/benefit decision.
> >>>>> Note that I am only proposing ways forward.
> >>>>>
> >>>>> If you think you can easily achieve what Michal requested without all that,
> >>>>> good.
> >>>>
> >>>> He requested something?
> >>>
> >>> Yes, a cleaner instrumentation. Unfortunately the cleanest one is not
> >>> possible until the compiler feature is developed and deployed. And it
> >>> still would require changes to the headers, so don't think it's worth
> >>> delaying the feature for years.
> >>>
> >>
> >> I was talking about this: "I can live with all the downsides of the proposed
> >> implementationas long as there is a wider agreement from the MM community as
> >> this is where the maintenance cost will be payed. So far I have not seen
> >> (m)any acks by MM developers".
> >>
> >> I certainly cannot be motivated at this point to review and ack this,
> >> unfortunately too much negative energy around here.
> >
> > David, this kind of reaction is exactly why I was telling Andrew I was
> > going to submit this as a direct pull request to Linus.
> >
> > This is an important feature; if we can't stay focused ot the technical
> > and get it done that's what I'll do.
>
> Kent, I started this with "Would it make sense" in an attempt to help
> Suren and you to finally make progress with this, one way or the other.
> I know that there were ways in the past to get the MM community to agree
> on such things.
>
> I tried to be helpful, finding ways *not having to* bypass the MM
> community to get MM stuff merged.
>
> The reply I got is mostly negative energy.
>
> So you don't need my help here, understood.
>
> But I will fight against any attempts to bypass the MM community.
Well, I'm definitely not trying to bypass the MM community, that's why
this patchset is posted. Not sure why people can't voice their opinion
on the benefit/cost balance of the patchset over the email... But if a
meeting would be more productive I'm happy to set it up.
>
> --
> Cheers,
>
> David / dhildenb
>
On Tue, Feb 13, 2024 at 06:54:09PM -0500, Pasha Tatashin wrote:
> > > I tried to be helpful, finding ways *not having to* bypass the MM
> > > community to get MM stuff merged.
> > >
> > > The reply I got is mostly negative energy.
> > >
> > > So you don't need my help here, understood.
> > >
> > > But I will fight against any attempts to bypass the MM community.
> >
> > Well, I'm definitely not trying to bypass the MM community, that's why
> > this patchset is posted. Not sure why people can't voice their opinion
> > on the benefit/cost balance of the patchset over the email... But if a
> > meeting would be more productive I'm happy to set it up.
>
> Discussing these concerns during the next available MM Alignment
> session makes sense. At a minimum, Suren and Kent can present their
> reasons for believing the current approach is superior to the
> previously proposed alternatives.
Hang on though: I believe we did so adequately within this thread. Both
in the cover letter, and I further outlined exactly what the hooks
need to do, and cited the exact code.
Nobody seems to be complaining about the specifics, so I'm not sure what
would be on the agenda?
I'll do a more throrough code review, but before the discussion gets
too sidetracked, I wanted to add my POV on the overall merit of the
direction that is being proposed here.
I have backported and used this code for debugging production issues
before. Logging into a random host with an unfamiliar workload and
being able to get a reliable, comprehensive list of kernel memory
consumers is one of the coolest things I have seen in a long
time. This is a huge improvement to sysadmin quality of life.
It's also a huge improvement for MM developers. We're the first points
of contact for memory regressions that can be caused by pretty much
any driver or subsystem in the kernel.
I encourage anybody who is undecided on whether this is worth doing to
build a kernel with these patches applied and run it on their own
machine. I think you'll be surprised what you'll find - and how myopic
and uninformative /proc/meminfo feels in comparison to this. Did you
know there is a lot more to modern filesystems than the VFS objects we
are currently tracking? :)
Then imagine what this looks like on a production host running a
complex mix of filesystems, enterprise networking, bpf programs, gpus
and accelerators etc.
Backporting the code to a slightly older production kernel wasn't too
difficult. The instrumentation layering is explicit, clean, and fairly
centralized, so resolving minor conflicts around the _noprof renames
and the wrappers was pretty straight-forward.
When we talk about maintenance cost, a fair shake would be to weigh it
against the cost and reliability of our current method: evaluating
consumers in the kernel on a case-by-case basis and annotating the
alloc/free sites by hand; then quibbling with the MM community about
whether that consumer is indeed significant enough to warrant an entry
in /proc/meminfo, and what the catchiest name for the stat would be.
I think we can agree that this is vastly less scalable and more
burdensome than central annotations around a handful of mostly static
allocator entry points. Especially considering the rate of change in
the kernel as a whole, and that not everybody will think of the
comprehensive MM picture when writing a random driver. And I think
that's generous - we don't even have the network stack in meminfo.
So I think what we do now isn't working. In the Meta fleet, at any
given time the p50 for unaccounted kernel memory is several gigabytes
per host. The p99 is between 15% and 30% of total memory. That's a
looot of opaque resource usage we have to accept on faith.
For hunting down regressions, all it takes is one untracked consumer
in the kernel to really throw a wrench into things. It's difficult to
find in the noise with tracing, and if it's not growing after an
initial allocation spike, you're pretty much out of luck finding it at
all. Raise your hand if you've written a drgn script to walk pfns and
try to guess consumers from the state of struct page :)
I agree we should discuss how the annotations are implemented on a
technical basis, but my take is that we need something like this.
In a codebase of our size, I don't think the allocator should be
handing out memory without some basic implied tracking of where it's
going. It's a liability for production environments, and it can hide
bad memory management decisions in drivers and other subsystems for a
very long time.
On 14.02.24 00:28, Suren Baghdasaryan wrote:
> On Tue, Feb 13, 2024 at 3:22 PM David Hildenbrand <[email protected]> wrote:
>>
>> On 14.02.24 00:12, Kent Overstreet wrote:
>>> On Wed, Feb 14, 2024 at 12:02:30AM +0100, David Hildenbrand wrote:
>>>> On 13.02.24 23:59, Suren Baghdasaryan wrote:
>>>>> On Tue, Feb 13, 2024 at 2:50 PM Kent Overstreet
>>>>> <[email protected]> wrote:
>>>>>>
>>>>>> On Tue, Feb 13, 2024 at 11:48:41PM +0100, David Hildenbrand wrote:
>>>>>>> On 13.02.24 23:30, Suren Baghdasaryan wrote:
>>>>>>>> On Tue, Feb 13, 2024 at 2:17 PM David Hildenbrand <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>> On 13.02.24 23:09, Kent Overstreet wrote:
>>>>>>>>>> On Tue, Feb 13, 2024 at 11:04:58PM +0100, David Hildenbrand wrote:
>>>>>>>>>>> On 13.02.24 22:58, Suren Baghdasaryan wrote:
>>>>>>>>>>>> On Tue, Feb 13, 2024 at 4:24 AM Michal Hocko <[email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon 12-02-24 13:38:46, Suren Baghdasaryan wrote:
>>>>>>>>>>>>> [...]
>>>>>>>>>>>>>> We're aiming to get this in the next merge window, for 6.9. The feedback
>>>>>>>>>>>>>> we've gotten has been that even out of tree this patchset has already
>>>>>>>>>>>>>> been useful, and there's a significant amount of other work gated on the
>>>>>>>>>>>>>> code tagging functionality included in this patchset [2].
>>>>>>>>>>>>>
>>>>>>>>>>>>> I suspect it will not come as a surprise that I really dislike the
>>>>>>>>>>>>> implementation proposed here. I will not repeat my arguments, I have
>>>>>>>>>>>>> done so on several occasions already.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Anyway, I didn't go as far as to nak it even though I _strongly_ believe
>>>>>>>>>>>>> this debugging feature will add a maintenance overhead for a very long
>>>>>>>>>>>>> time. I can live with all the downsides of the proposed implementation
>>>>>>>>>>>>> _as long as_ there is a wider agreement from the MM community as this is
>>>>>>>>>>>>> where the maintenance cost will be payed. So far I have not seen (m)any
>>>>>>>>>>>>> acks by MM developers so aiming into the next merge window is more than
>>>>>>>>>>>>> little rushed.
>>>>>>>>>>>>
>>>>>>>>>>>> We tried other previously proposed approaches and all have their
>>>>>>>>>>>> downsides without making maintenance much easier. Your position is
>>>>>>>>>>>> understandable and I think it's fair. Let's see if others see more
>>>>>>>>>>>> benefit than cost here.
>>>>>>>>>>>
>>>>>>>>>>> Would it make sense to discuss that at LSF/MM once again, especially
>>>>>>>>>>> covering why proposed alternatives did not work out? LSF/MM is not "too far"
>>>>>>>>>>> away (May).
>>>>>>>>>>>
>>>>>>>>>>> I recall that the last LSF/MM session on this topic was a bit unfortunate
>>>>>>>>>>> (IMHO not as productive as it could have been). Maybe we can finally reach a
>>>>>>>>>>> consensus on this.
>>>>>>>>>>
>>>>>>>>>> I'd rather not delay for more bikeshedding. Before agreeing to LSF I'd
>>>>>>>>>> need to see a serious proposl - what we had at the last LSF was people
>>>>>>>>>> jumping in with half baked alternative proposals that very much hadn't
>>>>>>>>>> been thought through, and I see no need to repeat that.
>>>>>>>>>>
>>>>>>>>>> Like I mentioned, there's other work gated on this patchset; if people
>>>>>>>>>> want to hold this up for more discussion they better be putting forth
>>>>>>>>>> something to discuss.
>>>>>>>>>
>>>>>>>>> I'm thinking of ways on how to achieve Michal's request: "as long as
>>>>>>>>> there is a wider agreement from the MM community". If we can achieve
>>>>>>>>> that without LSF, great! (a bi-weekly MM meeting might also be an option)
>>>>>>>>
>>>>>>>> There will be a maintenance burden even with the cleanest proposed
>>>>>>>> approach.
>>>>>>>
>>>>>>> Yes.
>>>>>>>
>>>>>>>> We worked hard to make the patchset as clean as possible and
>>>>>>>> if benefits still don't outweigh the maintenance cost then we should
>>>>>>>> probably stop trying.
>>>>>>>
>>>>>>> Indeed.
>>>>>>>
>>>>>>>> At LSF/MM I would rather discuss functonal
>>>>>>>> issues/requirements/improvements than alternative approaches to
>>>>>>>> instrument allocators.
>>>>>>>> I'm happy to arrange a separate meeting with MM folks if that would
>>>>>>>> help to progress on the cost/benefit decision.
>>>>>>> Note that I am only proposing ways forward.
>>>>>>>
>>>>>>> If you think you can easily achieve what Michal requested without all that,
>>>>>>> good.
>>>>>>
>>>>>> He requested something?
>>>>>
>>>>> Yes, a cleaner instrumentation. Unfortunately the cleanest one is not
>>>>> possible until the compiler feature is developed and deployed. And it
>>>>> still would require changes to the headers, so don't think it's worth
>>>>> delaying the feature for years.
>>>>>
>>>>
>>>> I was talking about this: "I can live with all the downsides of the proposed
>>>> implementationas long as there is a wider agreement from the MM community as
>>>> this is where the maintenance cost will be payed. So far I have not seen
>>>> (m)any acks by MM developers".
>>>>
>>>> I certainly cannot be motivated at this point to review and ack this,
>>>> unfortunately too much negative energy around here.
>>>
>>> David, this kind of reaction is exactly why I was telling Andrew I was
>>> going to submit this as a direct pull request to Linus.
>>>
>>> This is an important feature; if we can't stay focused ot the technical
>>> and get it done that's what I'll do.
>>
>> Kent, I started this with "Would it make sense" in an attempt to help
>> Suren and you to finally make progress with this, one way or the other.
>> I know that there were ways in the past to get the MM community to agree
>> on such things.
>>
>> I tried to be helpful, finding ways *not having to* bypass the MM
>> community to get MM stuff merged.
>>
>> The reply I got is mostly negative energy.
>>
>> So you don't need my help here, understood.
>>
>> But I will fight against any attempts to bypass the MM community.
>
> Well, I'm definitely not trying to bypass the MM community, that's why
> this patchset is posted. Not sure why people can't voice their opinion
> on the benefit/cost balance of the patchset over the email... But if a
> meeting would be more productive I'm happy to set it up.
If you can get the acks without any additional meetings, great. The
replies from Pasha and Johannes are encouraging, let's hope core
memory-allocator people will voice their opinion here.
If you come to the conclusion that another meeting would help getting
maintainers's attention and sorting out some of the remaining concerns,
feel free to schedule a meeting with Dave R. I suspect only the slot
next week is already taken. In the past, we also had "special" meetings
just for things to make progress faster.
If you're looking for ideas on what the agenda of such a meeting could
look like, I'll happily discuss that with you off-list.
v2 was more than 3 months ago. If it's really about minor details here,
waiting another 3 months for LSF/MM is indeed not reasonable.
Myself, I'll be happy not having to sit through another LSF/MM session
of that kind. The level of drama is exceptional and I'm hoping it won't
be the new norm in the MM space.
Good luck!
--
Cheers,
David / dhildenb
On 2/14/24 00:08, Kent Overstreet wrote:
> On Tue, Feb 13, 2024 at 02:59:11PM -0800, Suren Baghdasaryan wrote:
>> On Tue, Feb 13, 2024 at 2:50 PM Kent Overstreet
>> <[email protected]> wrote:
>> >
>> > On Tue, Feb 13, 2024 at 11:48:41PM +0100, David Hildenbrand wrote:
>> > > On 13.02.24 23:30, Suren Baghdasaryan wrote:
>> > > > On Tue, Feb 13, 2024 at 2:17 PM David Hildenbrand <[email protected]> wrote:
>> > > If you think you can easily achieve what Michal requested without all that,
>> > > good.
>> >
>> > He requested something?
>>
>> Yes, a cleaner instrumentation. Unfortunately the cleanest one is not
>> possible until the compiler feature is developed and deployed. And it
>> still would require changes to the headers, so don't think it's worth
>> delaying the feature for years.
>
> Hang on, let's look at the actual code.
>
> This is what instrumenting an allocation function looks like:
>
> #define krealloc_array(...) alloc_hooks(krealloc_array_noprof(__VA_ARGS__))
>
> IOW, we have to:
> - rename krealloc_array to krealloc_array_noprof
> - replace krealloc_array with a one wrapper macro call
>
> Is this really all we're getting worked up over?
>
> The renaming we need regardless, because the thing that makes this
> approach efficient enough to run in production is that we account at
> _one_ point in the callstack, we don't save entire backtraces.
>
> And thus we need to explicitly annotate which one that is; which means
> we need _noprof() versions of functions for when the accounting is done
> by an outer wraper (e.g. mempool).
>
> And, as I keep saying: that alloc_hooks() macro will also get us _per
> callsite fault injection points_, and we really need that because - if
> you guys have been paying attention to other threads - whenever moving
> more stuff to PF_MEMALLOC_* flags comes up (including adding
> PF_MEMALLOC_NORECLAIM), the issue of small allocations not failing and
> not being testable keeps coming up.
How exactly do you envision the fault injection to help here? The proposals
are about scoping via a process flag, and the process may then call just
about anything under that scope. So if our tool is per callsite fault
injection points, how do we know which callsites to enable to focus the
fault injection on the particular scope?
On Tue 13-02-24 14:59:11, Suren Baghdasaryan wrote:
> On Tue, Feb 13, 2024 at 2:50 PM Kent Overstreet
> <[email protected]> wrote:
> >
> > On Tue, Feb 13, 2024 at 11:48:41PM +0100, David Hildenbrand wrote:
[...]
> > > If you think you can easily achieve what Michal requested without all that,
> > > good.
> >
> > He requested something?
>
> Yes, a cleaner instrumentation.
Nope, not really. You have indicated you want to target this version for the
_next_ merge window without any acks, really. If you want to go
forward with this then you should gain a support from the MM community
at least. Why? Because the whole macro layering is adding maintenance
cost for MM people.
I have expressed why I absolutely hate the additional macro layer. We
have been through similar layers of macros in other areas (not to
mention page allocator interface itself) and it has _always_ turned out
a bad idea long term. I do not see why this case should be any
different.
The whole kernel is moving to a dynamic tracing realm and now we
are going to build a statically macro based tracing infrastructure which
will need tweaking anytime real memory consumers are one layer up the
existing macro infrastructure (do not forget quite a lot of allocations
are in library functions) and/or we need to modify the allocator API
in some way. Call me unimpressed!
Now, I fully recognize that the solution doesn't really have to be
perfect in order to be useful. Hence I never NAKed it even though I really
_dislike_ the approach. I have expected you will grow the community
support over time if this is indeed the only feasible approach but that
is not reflected in the series posted here. If you find a support I will
not stand in the way.
> Unfortunately the cleanest one is not
> possible until the compiler feature is developed and deployed. And it
> still would require changes to the headers, so don't think it's worth
> delaying the feature for years.
I am pretty sure you have invested a non-trivial time into evaluating
other ways, yet your cover letter is rather modest about any details:
: - looked at alternate hooking methods.
: There were suggestions on alternate methods (compiler attribute,
: trampolines), but they wouldn't have made the patchset any cleaner
: (we still need to have different function versions for accounting vs. no
: accounting to control at which point in a call chain the accounting
: happens), and they would have added a dependency on toolchain
: support.
First immediate question would be: What about page_owner? I do remember
the runtime overhead being discussed but I do not really remember any
actual numbers outside of artificial workloads. Has this been
investigated? Is our stack unwinder the problem? Etc.
Also what are the biggest obstacles to efficiently track allocations via
our tracing infrastructure? Has this been investigated? What were conclusions?
--
Michal Hocko
SUSE Labs
On Wed 14-02-24 01:20:20, Johannes Weiner wrote:
[...]
> I agree we should discuss how the annotations are implemented on a
> technical basis, but my take is that we need something like this.
I do not think there is any disagreement on usefulness of a better
memory allocation tracking. At least for me the primary problem is the
implementation. At LFSMM last year we have heard that existing tracing
infrastructure hasn't really been explored much. Cover letter doesn't
really talk much about those alternatives so it is really hard to
evaluate whether the proposed solution is indeed our best way to
approach this.
> In a codebase of our size, I don't think the allocator should be
> handing out memory without some basic implied tracking of where it's
> going. It's a liability for production environments, and it can hide
> bad memory management decisions in drivers and other subsystems for a
> very long time.
Fully agreed! It is quite common to see oom reports with a large portion
of memory unaccounted and this really presents additional cost on the
debugging side.
--
Michal Hocko
SUSE Labs
On 2/13/24 03:08, Kent Overstreet wrote:
> On Mon, Feb 12, 2024 at 04:31:14PM -0800, Kees Cook wrote:
>> On Mon, Feb 12, 2024 at 01:39:09PM -0800, Suren Baghdasaryan wrote:
>> > From: Kent Overstreet <[email protected]>
>> >
>> > It seems we need to be more forceful with the compiler on this one.
>>
>> Sure, but why?
>
> Wasn't getting inlined without it, and that's one we do want inlined -
> it's only called in one place.
It would be better to mention this in the changelog so it's clear this is
for performance and not e.g. needed for the code tagging to work as expected.
On Wed, Feb 14, 2024 at 03:46:33PM +0100, Michal Hocko wrote:
> On Wed 14-02-24 01:20:20, Johannes Weiner wrote:
> [...]
> > I agree we should discuss how the annotations are implemented on a
> > technical basis, but my take is that we need something like this.
>
> I do not think there is any disagreement on usefulness of a better
> memory allocation tracking. At least for me the primary problem is the
> implementation. At LFSMM last year we have heard that existing tracing
> infrastructure hasn't really been explored much. Cover letter doesn't
> really talk much about those alternatives so it is really hard to
> evaluate whether the proposed solution is indeed our best way to
> approach this.
Michal, we covered this before.
To do this with tracing you'd have to build up data structures
separately, in userspace, that would mirror the allocator's data
structures; you would have to track every single allocation so that you
could match up the free event to the place it was allocated.
Even if it could be built, which I doubt, it'd be completely non viable
because the performance would be terrible.
Like I said, we covered all this before; if you're going to spend so
much time on these threads you really should be making a better attempt
at keeping up with what's been talked about.
On Wed 14-02-24 10:01:14, Kent Overstreet wrote:
> On Wed, Feb 14, 2024 at 03:46:33PM +0100, Michal Hocko wrote:
> > On Wed 14-02-24 01:20:20, Johannes Weiner wrote:
> > [...]
> > > I agree we should discuss how the annotations are implemented on a
> > > technical basis, but my take is that we need something like this.
> >
> > I do not think there is any disagreement on usefulness of a better
> > memory allocation tracking. At least for me the primary problem is the
> > implementation. At LFSMM last year we have heard that existing tracing
> > infrastructure hasn't really been explored much. Cover letter doesn't
> > really talk much about those alternatives so it is really hard to
> > evaluate whether the proposed solution is indeed our best way to
> > approach this.
>
> Michal, we covered this before.
It is a good practice to summarize previous discussions in the cover
letter. Especially when there are different approaches discussed over a
longer time period or when the topic is controversial.
I do not see anything like that here. Neither for the existing tracing
infrastructure, page owner nor performance concerns discussed before
etc. Look, I do not want to nit pick or insist on formalisms but having
those data points layed out would make any further discussion much more
smooth.
--
Michal Hocko
SUSE Labs
On Wed, Feb 14, 2024 at 05:02:28PM +0100, Michal Hocko wrote:
> On Wed 14-02-24 10:01:14, Kent Overstreet wrote:
> > On Wed, Feb 14, 2024 at 03:46:33PM +0100, Michal Hocko wrote:
> > > On Wed 14-02-24 01:20:20, Johannes Weiner wrote:
> > > [...]
> > > > I agree we should discuss how the annotations are implemented on a
> > > > technical basis, but my take is that we need something like this.
> > >
> > > I do not think there is any disagreement on usefulness of a better
> > > memory allocation tracking. At least for me the primary problem is the
> > > implementation. At LFSMM last year we have heard that existing tracing
> > > infrastructure hasn't really been explored much. Cover letter doesn't
> > > really talk much about those alternatives so it is really hard to
> > > evaluate whether the proposed solution is indeed our best way to
> > > approach this.
> >
> > Michal, we covered this before.
>
> It is a good practice to summarize previous discussions in the cover
> letter. Especially when there are different approaches discussed over a
> longer time period or when the topic is controversial.
>
> I do not see anything like that here. Neither for the existing tracing
> infrastructure, page owner nor performance concerns discussed before
> etc. Look, I do not want to nit pick or insist on formalisms but having
> those data points layed out would make any further discussion much more
> smooth.
You don't want to nitpick???
Look, you've been consistently sidestepping the technical discussion; it
seems all you want to talk about is process or "your nack".
If we're going to have a technical discussion, it's incumbent upon all
of us to /keep the focus on the technical/; that is everyone's
responsibility.
I'm not going to write a 20 page cover letter and recap every dead end
that was proposed. That would be a lot of useless crap for eveyone to
wade through. I'm going to summarize the important stuff, and keep the
focus on what we're doing and documenting it. If you want to take part
in a discussion, it's your responsibility to be reading with
comprehension and finding useful things to say.
You gotta stop with this this derailing garbage.
On Wed, Feb 14, 2024 at 11:20:26AM +0100, Vlastimil Babka wrote:
> On 2/14/24 00:08, Kent Overstreet wrote:
> > And, as I keep saying: that alloc_hooks() macro will also get us _per
> > callsite fault injection points_, and we really need that because - if
> > you guys have been paying attention to other threads - whenever moving
> > more stuff to PF_MEMALLOC_* flags comes up (including adding
> > PF_MEMALLOC_NORECLAIM), the issue of small allocations not failing and
> > not being testable keeps coming up.
>
> How exactly do you envision the fault injection to help here? The proposals
> are about scoping via a process flag, and the process may then call just
> about anything under that scope. So if our tool is per callsite fault
> injection points, how do we know which callsites to enable to focus the
> fault injection on the particular scope?
So the question with fault injection is - how do we integrate it into
our existing tests?
We need fault injection that we can integrate into our existing tests
because that's the only way to get the code coverage we need - writing
new tests that cover all the error paths isn't going to happen, and
wouldn't work as well anyways.
But the trouble with injecting memory allocation failures is that
they'll result in errors bubbling up to userspace, and in unpredictable
ways.
We _definitely_ cannot enable random memory allocation faults for the
entire kernel at runttme - or rather we _could_, and that would actually
be great to do as a side project; but that's not something we can do in
our existing automated tests because the results will be completely
unpredictable. If we did that the goal would be to just make sure the
kernel doesn't explode - but what we actually want is for our automated
pass/fail tests to still pass; we need to constrain what will fail.
So we need at a minumum to be able to only enable memory allocation
failures for the code we're interested in testing (file/module) -
enabling memory allocation failures in some other random subsystem we're
not developing or looking at isn't what we want.
Beyond that, it's very much subsystem dependent. For bcachefs, my main
strategy has been to flip on random (1%) memory allocation failures
after the filesystem has mounted. During startup, we do a ton of
allocations (I cover those with separate tests), but after startup we
should be able to run normally in the precence of allocation failures
without ever returning an error to userspace - so that's what I'm trying
to test.
On Tue, 13 Feb 2024 14:59:11 -0800 Suren Baghdasaryan <[email protected]> wrote:
> > > If you think you can easily achieve what Michal requested without all that,
> > > good.
> >
> > He requested something?
>
> Yes, a cleaner instrumentation. Unfortunately the cleanest one is not
> possible until the compiler feature is developed and deployed. And it
> still would require changes to the headers, so don't think it's worth
> delaying the feature for years.
Can we please be told much more about this compiler feature?
Description of what it is, what it does, how it will affect this kernel
feature, etc.
Who is developing it and when can we expect it to become available?
Will we be able to migrate to it without back-compatibility concerns?
(I think "you need quite recent gcc for memory profiling" is
reasonable).
Because: if the maintainability issues which Michel describes will be
significantly addressed with the gcc support then we're kinda reviewing
the wrong patchset. Yes, it may be a maintenance burden initially, but
at some (yet to be revealed) time in the future, this will be addressed
with the gcc support?
On Wed 14-02-24 11:17:20, Kent Overstreet wrote:
[...]
> You gotta stop with this this derailing garbage.
It is always pleasure talking to you Kent, but let me give you advice
(free of charge of course). Let Suren talk, chances for civilized
and productive discussion are much higher!
I do not have much more to add to the discussion. My point stays, find a
support of the MM community if you want to proceed with this work.
--
Michal Hocko
SUSE Labs
On Wed, Feb 14, 2024 at 8:31 AM Michal Hocko <[email protected]> wrote:
>
> On Wed 14-02-24 11:17:20, Kent Overstreet wrote:
> [...]
> > You gotta stop with this this derailing garbage.
>
> It is always pleasure talking to you Kent, but let me give you advice
> (free of charge of course). Let Suren talk, chances for civilized
> and productive discussion are much higher!
Every time I wake up to a new drama... Sorry I won't follow up on this
one not to feed the fire.
>
> I do not have much more to add to the discussion. My point stays, find a
> support of the MM community if you want to proceed with this work.
> --
> Michal Hocko
> SUSE Labs
On Tue, Feb 13, 2024 at 06:08:45PM -0500, Kent Overstreet wrote:
> This is what instrumenting an allocation function looks like:
>
> #define krealloc_array(...) alloc_hooks(krealloc_array_noprof(__VA_ARGS__))
>
> IOW, we have to:
> - rename krealloc_array to krealloc_array_noprof
> - replace krealloc_array with a one wrapper macro call
>
> Is this really all we're getting worked up over?
>
> The renaming we need regardless, because the thing that makes this
> approach efficient enough to run in production is that we account at
> _one_ point in the callstack, we don't save entire backtraces.
I'm probably going to regret getting involved in this thread, but since
Suren already decided to put me on the cc ...
There might be a way to do it without renaming. We have a bit of the
linker script called SCHED_TEXT which lets us implement
in_sched_functions(). ie we could have the equivalent of
include/linux/sched/debug.h:#define __sched __section(".sched.text")
perhaps #define __memalloc __section(".memalloc.text")
which would do all the necessary magic to know where the backtrace
should stop.
On Wed, Feb 14, 2024 at 8:55 AM Andrew Morton <[email protected]> wrote:
>
> On Tue, 13 Feb 2024 14:59:11 -0800 Suren Baghdasaryan <[email protected]> wrote:
>
> > > > If you think you can easily achieve what Michal requested without all that,
> > > > good.
> > >
> > > He requested something?
> >
> > Yes, a cleaner instrumentation. Unfortunately the cleanest one is not
> > possible until the compiler feature is developed and deployed. And it
> > still would require changes to the headers, so don't think it's worth
> > delaying the feature for years.
>
> Can we please be told much more about this compiler feature?
> Description of what it is, what it does, how it will affect this kernel
> feature, etc.
Sure. The compiler support will be in a form of a new __attribute__,
simplified example:
// generate data for the wrapper
static void _alloc_tag()
{
static struct alloc_tag _alloc_tag __section ("alloc_tags")
= { .ct = CODE_TAG_INIT, .counter = 0 };
}
static inline int
wrapper (const char *name, int x, int (*callee) (const char *, int),
struct alloc_tag *callsite_data)
{
callsite_data->counter++;
printf ("Call #%d from %s:%d (%s)\n", callsite_data->counter,
callsite_data->ct.filename, callsite_data->ct.lineno,
callsite_data->ct.function);
int ret = callee (name, x);
printf ("Returned: %d\n", ret);
return ret;
}
__attribute__((annotate("callsite_wrapped_by", wrapper, _alloc_tag)))
int foo(const char* name, int x);
int foo(const char* name, int x) {
printf ("Hello %s, %d!\n", name, x);
return x;
}
Which we will be able to attach to a function without changing its
name and preserving the namespace (it applies only to functions with
that name, not everything else).
Note that we will still need _noprof versions of the allocators.
>
> Who is developing it and when can we expect it to become available?
Aleksei Vetrov (google) with the help of Nick Desaulniers (google).
Both are CC'ed on this email.
After several iterations Aleksei has a POC which we are evaluating
(https://github.com/llvm/llvm-project/compare/main...noxwell:llvm-project:callsite-wrapper-tree-transform).
Once it's in good shape we are going to engage with CLANG and GCC
community to get it upstreamed. When it will become available and when
the distributions will pick it up is anybody's guess. Upstreaming is
usually a lengthy process.
>
> Will we be able to migrate to it without back-compatibility concerns?
> (I think "you need quite recent gcc for memory profiling" is
> reasonable).
The migration should be quite straight-forward, replacing the macros
with functions with that attribute.
>
>
> Because: if the maintainability issues which Michel describes will be
> significantly addressed with the gcc support then we're kinda reviewing
> the wrong patchset. Yes, it may be a maintenance burden initially, but
> at some (yet to be revealed) time in the future, this will be addressed
> with the gcc support?
That's what I'm aiming for. I just don't want this placed on hold
until the compiler support is widely available, which might take
years.
>
On Wed, Feb 14, 2024 at 08:55:48AM -0800, Andrew Morton wrote:
> On Tue, 13 Feb 2024 14:59:11 -0800 Suren Baghdasaryan <[email protected]> wrote:
>
> > > > If you think you can easily achieve what Michal requested without all that,
> > > > good.
> > >
> > > He requested something?
> >
> > Yes, a cleaner instrumentation. Unfortunately the cleanest one is not
> > possible until the compiler feature is developed and deployed. And it
> > still would require changes to the headers, so don't think it's worth
> > delaying the feature for years.
>
> Can we please be told much more about this compiler feature?
> Description of what it is, what it does, how it will affect this kernel
> feature, etc.
>
> Who is developing it and when can we expect it to become available?
>
> Will we be able to migrate to it without back-compatibility concerns?
> (I think "you need quite recent gcc for memory profiling" is
> reasonable).
>
>
>
> Because: if the maintainability issues which Michel describes will be
> significantly addressed with the gcc support then we're kinda reviewing
> the wrong patchset. Yes, it may be a maintenance burden initially, but
> at some (yet to be revealed) time in the future, this will be addressed
> with the gcc support?
Even if we had compiler magic, after considering it more I don't think
the patchset would be improved by it - I would still prefer to stick
with the macro approach.
There's also a lot of unresolved questions about whether the compiler
approach would even end being what we need; we need macro expansion to
happen in the caller of the allocation function, and that's another
level of hooking that I don't think the compiler people are even
considering yet, since cpp runs before the main part of the compiler; if
C macros worked and were implemented more like Rust macros I'm sure it
could be done - in fact, I think this could all be done in Rust
_without_ any new compiler support - but in C, this is a lot to ask.
Let's look at the instrumentation again. There's two steps:
- Renaming the original function to _noprof
- Adding a hooked version of the original function.
We need to do the renaming regardless of what approach we take in order
to correctly handle allocations that happen inside the context of an
existing alloc tag hook but should not be accounted to the outer
context; we do that by selecting the alloc_foo() or alloc_foo_noprof()
version as appropriate.
It's important to get this right; consider slab object extension
vectors or the slab allocator allocating pages from the page allocator.
Second step, adding a hooked version of the original function. We do
that with
#define alloc_foo(...) alloc_hooks(alloc_foo_noprof(__VA_ARGS__))
That's pretty clean, if you ask me. The only way to make it more succint
be if it were possible for a C macro to define a new macro, then it
could be just
alloc_fn(alloc_foo);
But honestly, the former is probably preferable anyways from a ctags/cscope POV.
On 2/12/24 22:38, Suren Baghdasaryan wrote:
> Currently slab pages can store only vectors of obj_cgroup pointers in
> page->memcg_data. Introduce slabobj_ext structure to allow more data
> to be stored for each slab object. Wrap obj_cgroup into slabobj_ext
> to support current functionality while allowing to extend slabobj_ext
> in the future.
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
..
> +static inline bool need_slab_obj_ext(void)
> +{
> + /*
> + * CONFIG_MEMCG_KMEM creates vector of obj_cgroup objects conditionally
> + * inside memcg_slab_post_alloc_hook. No other users for now.
> + */
> + return false;
> +}
> +
> +static inline struct slabobj_ext *
> +prepare_slab_obj_exts_hook(struct kmem_cache *s, gfp_t flags, void *p)
> +{
> + struct slab *slab;
> +
> + if (!p)
> + return NULL;
> +
> + if (!need_slab_obj_ext())
> + return NULL;
> +
> + slab = virt_to_slab(p);
> + if (!slab_obj_exts(slab) &&
> + WARN(alloc_slab_obj_exts(slab, s, flags, false),
> + "%s, %s: Failed to create slab extension vector!\n",
> + __func__, s->name))
> + return NULL;
> +
> + return slab_obj_exts(slab) + obj_to_index(s, slab, p);
This is called in slab_post_alloc_hook() and the result stored to obj_exts
but unused. Maybe introduce this only in a later patch where it becomes
relevant?
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -201,6 +201,54 @@ struct kmem_cache *find_mergeable(unsigned int size, unsigned int align,
> return NULL;
> }
>
> +#ifdef CONFIG_SLAB_OBJ_EXT
> +/*
> + * The allocated objcg pointers array is not accounted directly.
> + * Moreover, it should not come from DMA buffer and is not readily
> + * reclaimable. So those GFP bits should be masked off.
> + */
> +#define OBJCGS_CLEAR_MASK (__GFP_DMA | __GFP_RECLAIMABLE | \
> + __GFP_ACCOUNT | __GFP_NOFAIL)
> +
> +int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
> + gfp_t gfp, bool new_slab)
Since you're moving this function between files anyway, could you please
instead move it to mm/slub.c. I expect we'll eventually (maybe even soon)
move the rest of performance sensitive kmemcg hooks there as well to make
inlining possible.
> +{
> + unsigned int objects = objs_per_slab(s, slab);
> + unsigned long obj_exts;
> + void *vec;
> +
> + gfp &= ~OBJCGS_CLEAR_MASK;
> + vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
> + slab_nid(slab));
> + if (!vec)
> + return -ENOMEM;
> +
> + obj_exts = (unsigned long)vec;
> +#ifdef CONFIG_MEMCG
> + obj_exts |= MEMCG_DATA_OBJEXTS;
> +#endif
> + if (new_slab) {
> + /*
> + * If the slab is brand new and nobody can yet access its
> + * obj_exts, no synchronization is required and obj_exts can
> + * be simply assigned.
> + */
> + slab->obj_exts = obj_exts;
> + } else if (cmpxchg(&slab->obj_exts, 0, obj_exts)) {
> + /*
> + * If the slab is already in use, somebody can allocate and
> + * assign slabobj_exts in parallel. In this case the existing
> + * objcg vector should be reused.
> + */
> + kfree(vec);
> + return 0;
> + }
> +
> + kmemleak_not_leak(vec);
> + return 0;
> +}
> +#endif /* CONFIG_SLAB_OBJ_EXT */
> +
> static struct kmem_cache *create_cache(const char *name,
> unsigned int object_size, unsigned int align,
> slab_flags_t flags, unsigned int useroffset,
> diff --git a/mm/slub.c b/mm/slub.c
> index 2ef88bbf56a3..1eb1050814aa 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -683,10 +683,10 @@ static inline bool __slab_update_freelist(struct kmem_cache *s, struct slab *sla
>
> if (s->flags & __CMPXCHG_DOUBLE) {
> ret = __update_freelist_fast(slab, freelist_old, counters_old,
> - freelist_new, counters_new);
> + freelist_new, counters_new);
> } else {
> ret = __update_freelist_slow(slab, freelist_old, counters_old,
> - freelist_new, counters_new);
> + freelist_new, counters_new);
> }
> if (likely(ret))
> return true;
> @@ -710,13 +710,13 @@ static inline bool slab_update_freelist(struct kmem_cache *s, struct slab *slab,
>
> if (s->flags & __CMPXCHG_DOUBLE) {
> ret = __update_freelist_fast(slab, freelist_old, counters_old,
> - freelist_new, counters_new);
> + freelist_new, counters_new);
> } else {
> unsigned long flags;
>
> local_irq_save(flags);
> ret = __update_freelist_slow(slab, freelist_old, counters_old,
> - freelist_new, counters_new);
> + freelist_new, counters_new);
I can see the mixing of tabs and spaces is wrong but perhaps not fix it as
part of the series?
> local_irq_restore(flags);
> }
> if (likely(ret))
> @@ -1881,13 +1881,25 @@ static inline enum node_stat_item cache_vmstat_idx(struct kmem_cache *s)
> NR_SLAB_RECLAIMABLE_B : NR_SLAB_UNRECLAIMABLE_B;
> }
>
> -#ifdef CONFIG_MEMCG_KMEM
> -static inline void memcg_free_slab_cgroups(struct slab *slab)
> +#ifdef CONFIG_SLAB_OBJ_EXT
> +static inline void free_slab_obj_exts(struct slab *slab)
Right, freeing is already here, so makes sense put the allocation here as well.
> @@ -3817,6 +3820,7 @@ void slab_post_alloc_hook(struct kmem_cache *s, struct obj_cgroup *objcg,
> kmemleak_alloc_recursive(p[i], s->object_size, 1,
> s->flags, init_flags);
> kmsan_slab_alloc(s, p[i], init_flags);
> + obj_exts = prepare_slab_obj_exts_hook(s, flags, p[i]);
Yeah here's the hook used. Doesn't it generate a compiler warning? Maybe at
least postpone the call until the result is further used.
> }
>
> memcg_slab_post_alloc_hook(s, objcg, flags, size, p);
On Mon, 2024-02-12 at 13:38 -0800, Suren Baghdasaryan wrote:
> Memory allocation, v3 and final:
>
> Overview:
> Low overhead [1] per-callsite memory allocation profiling. Not just for debug
> kernels, overhead low enough to be deployed in production.
>
> We're aiming to get this in the next merge window, for 6.9. The feedback
> we've gotten has been that even out of tree this patchset has already
> been useful, and there's a significant amount of other work gated on the
> code tagging functionality included in this patchset [2].
>
> Example output:
> root@moria-kvm:~# sort -h /proc/allocinfo|tail
> 3.11MiB 2850 fs/ext4/super.c:1408 module:ext4 func:ext4_alloc_inode
> 3.52MiB 225 kernel/fork.c:356 module:fork func:alloc_thread_stack_node
> 3.75MiB 960 mm/page_ext.c:270 module:page_ext func:alloc_page_ext
> 4.00MiB 2 mm/khugepaged.c:893 module:khugepaged func:hpage_collapse_alloc_folio
> 10.5MiB 168 block/blk-mq.c:3421 module:blk_mq func:blk_mq_alloc_rqs
> 14.0MiB 3594 include/linux/gfp.h:295 module:filemap func:folio_alloc_noprof
> 26.8MiB 6856 include/linux/gfp.h:295 module:memory func:folio_alloc_noprof
> 64.5MiB 98315 fs/xfs/xfs_rmap_item.c:147 module:xfs func:xfs_rui_init
> 98.7MiB 25264 include/linux/gfp.h:295 module:readahead func:folio_alloc_noprof
> 125MiB 7357 mm/slub.c:2201 module:slub func:alloc_slab_page
>
> Since v2:
> - tglx noticed a circular header dependency between sched.h and percpu.h;
> a bunch of header cleanups were merged into 6.8 to ameliorate this [3].
>
> - a number of improvements, moving alloc_hooks() annotations to the
> correct place for better tracking (mempool), and bugfixes.
>
> - looked at alternate hooking methods.
> There were suggestions on alternate methods (compiler attribute,
> trampolines), but they wouldn't have made the patchset any cleaner
> (we still need to have different function versions for accounting vs. no
> accounting to control at which point in a call chain the accounting
> happens), and they would have added a dependency on toolchain
> support.
>
> Usage:
> kconfig options:
> - CONFIG_MEM_ALLOC_PROFILING
> - CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT
> - CONFIG_MEM_ALLOC_PROFILING_DEBUG
> adds warnings for allocations that weren't accounted because of a
> missing annotation
>
> sysctl:
> /proc/sys/vm/mem_profiling
>
> Runtime info:
> /proc/allocinfo
>
> Notes:
>
> [1]: Overhead
> To measure the overhead we are comparing the following configurations:
> (1) Baseline with CONFIG_MEMCG_KMEM=n
> (2) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
> CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n)
> (3) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
> CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y)
> (4) Enabled at runtime (CONFIG_MEM_ALLOC_PROFILING=y &&
> CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n && /proc/sys/vm/mem_profiling=1)
> (5) Baseline with CONFIG_MEMCG_KMEM=y && allocating with __GFP_ACCOUNT
>
Thanks for the work on this patchset and it is quite useful.
A clarification question on the data:
I assume Config (2), (3) and (4) has CONFIG_MEMCG_KMEM=n, right?
If so do you have similar data for config (2), (3) and (4) but with
CONFIG_MEMCG_KMEM=y for comparison with (5)?
Tim
> Performance overhead:
> To evaluate performance we implemented an in-kernel test executing
> multiple get_free_page/free_page and kmalloc/kfree calls with allocation
> sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU
> affinity set to a specific CPU to minimize the noise. Below are results
> from running the test on Ubuntu 22.04.2 LTS with 6.8.0-rc1 kernel on
> 56 core Intel Xeon:
>
> kmalloc pgalloc
> (1 baseline) 6.764s 16.902s
> (2 default disabled) 6.793s (+0.43%) 17.007s (+0.62%)
> (3 default enabled) 7.197s (+6.40%) 23.666s (+40.02%)
> (4 runtime enabled) 7.405s (+9.48%) 23.901s (+41.41%)
> (5 memcg) 13.388s (+97.94%) 48.460s (+186.71%)
>
On Wed, Feb 14, 2024 at 10:44 AM Andy Shevchenko
<[email protected]> wrote:
>
> On Mon, Feb 12, 2024 at 01:38:46PM -0800, Suren Baghdasaryan wrote:
> > Memory allocation, v3 and final:
>
> Would be nice to have --base added to cover letter. The very first patch
> can't be applied on today's Linux Next.
Sorry about that. It as based on Linus` ToT at the time of posting
(7521f258ea30 Merge tag 'mm-hotfixes-stable-2024-02-10-11-16' of
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm). I also applied
it to mm-unstable with only one trivial merge conflict in one of the
patches.
>
> --
> With Best Regards,
> Andy Shevchenko
>
>
>
On Wed, Feb 14, 2024 at 9:59 AM Vlastimil Babka <[email protected]> wrote:
>
> On 2/12/24 22:38, Suren Baghdasaryan wrote:
> > Currently slab pages can store only vectors of obj_cgroup pointers in
> > page->memcg_data. Introduce slabobj_ext structure to allow more data
> > to be stored for each slab object. Wrap obj_cgroup into slabobj_ext
> > to support current functionality while allowing to extend slabobj_ext
> > in the future.
> >
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
>
> ...
>
> > +static inline bool need_slab_obj_ext(void)
> > +{
> > + /*
> > + * CONFIG_MEMCG_KMEM creates vector of obj_cgroup objects conditionally
> > + * inside memcg_slab_post_alloc_hook. No other users for now.
> > + */
> > + return false;
> > +}
> > +
> > +static inline struct slabobj_ext *
> > +prepare_slab_obj_exts_hook(struct kmem_cache *s, gfp_t flags, void *p)
> > +{
> > + struct slab *slab;
> > +
> > + if (!p)
> > + return NULL;
> > +
> > + if (!need_slab_obj_ext())
> > + return NULL;
> > +
> > + slab = virt_to_slab(p);
> > + if (!slab_obj_exts(slab) &&
> > + WARN(alloc_slab_obj_exts(slab, s, flags, false),
> > + "%s, %s: Failed to create slab extension vector!\n",
> > + __func__, s->name))
> > + return NULL;
> > +
> > + return slab_obj_exts(slab) + obj_to_index(s, slab, p);
>
> This is called in slab_post_alloc_hook() and the result stored to obj_exts
> but unused. Maybe introduce this only in a later patch where it becomes
> relevant?
Ack. I'll move it into the patch where we start using obj_exts.
>
> > --- a/mm/slab_common.c
> > +++ b/mm/slab_common.c
> > @@ -201,6 +201,54 @@ struct kmem_cache *find_mergeable(unsigned int size, unsigned int align,
> > return NULL;
> > }
> >
> > +#ifdef CONFIG_SLAB_OBJ_EXT
> > +/*
> > + * The allocated objcg pointers array is not accounted directly.
> > + * Moreover, it should not come from DMA buffer and is not readily
> > + * reclaimable. So those GFP bits should be masked off.
> > + */
> > +#define OBJCGS_CLEAR_MASK (__GFP_DMA | __GFP_RECLAIMABLE | \
> > + __GFP_ACCOUNT | __GFP_NOFAIL)
> > +
> > +int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
> > + gfp_t gfp, bool new_slab)
>
> Since you're moving this function between files anyway, could you please
> instead move it to mm/slub.c. I expect we'll eventually (maybe even soon)
> move the rest of performance sensitive kmemcg hooks there as well to make
> inlining possible.
Will do.
>
> > +{
> > + unsigned int objects = objs_per_slab(s, slab);
> > + unsigned long obj_exts;
> > + void *vec;
> > +
> > + gfp &= ~OBJCGS_CLEAR_MASK;
> > + vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
> > + slab_nid(slab));
> > + if (!vec)
> > + return -ENOMEM;
> > +
> > + obj_exts = (unsigned long)vec;
> > +#ifdef CONFIG_MEMCG
> > + obj_exts |= MEMCG_DATA_OBJEXTS;
> > +#endif
> > + if (new_slab) {
> > + /*
> > + * If the slab is brand new and nobody can yet access its
> > + * obj_exts, no synchronization is required and obj_exts can
> > + * be simply assigned.
> > + */
> > + slab->obj_exts = obj_exts;
> > + } else if (cmpxchg(&slab->obj_exts, 0, obj_exts)) {
> > + /*
> > + * If the slab is already in use, somebody can allocate and
> > + * assign slabobj_exts in parallel. In this case the existing
> > + * objcg vector should be reused.
> > + */
> > + kfree(vec);
> > + return 0;
> > + }
> > +
> > + kmemleak_not_leak(vec);
> > + return 0;
> > +}
> > +#endif /* CONFIG_SLAB_OBJ_EXT */
> > +
> > static struct kmem_cache *create_cache(const char *name,
> > unsigned int object_size, unsigned int align,
> > slab_flags_t flags, unsigned int useroffset,
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 2ef88bbf56a3..1eb1050814aa 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -683,10 +683,10 @@ static inline bool __slab_update_freelist(struct kmem_cache *s, struct slab *sla
> >
> > if (s->flags & __CMPXCHG_DOUBLE) {
> > ret = __update_freelist_fast(slab, freelist_old, counters_old,
> > - freelist_new, counters_new);
> > + freelist_new, counters_new);
> > } else {
> > ret = __update_freelist_slow(slab, freelist_old, counters_old,
> > - freelist_new, counters_new);
> > + freelist_new, counters_new);
> > }
> > if (likely(ret))
> > return true;
> > @@ -710,13 +710,13 @@ static inline bool slab_update_freelist(struct kmem_cache *s, struct slab *slab,
> >
> > if (s->flags & __CMPXCHG_DOUBLE) {
> > ret = __update_freelist_fast(slab, freelist_old, counters_old,
> > - freelist_new, counters_new);
> > + freelist_new, counters_new);
> > } else {
> > unsigned long flags;
> >
> > local_irq_save(flags);
> > ret = __update_freelist_slow(slab, freelist_old, counters_old,
> > - freelist_new, counters_new);
> > + freelist_new, counters_new);
>
> I can see the mixing of tabs and spaces is wrong but perhaps not fix it as
> part of the series?
I'll fix them in the next version.
>
> > local_irq_restore(flags);
> > }
> > if (likely(ret))
> > @@ -1881,13 +1881,25 @@ static inline enum node_stat_item cache_vmstat_idx(struct kmem_cache *s)
> > NR_SLAB_RECLAIMABLE_B : NR_SLAB_UNRECLAIMABLE_B;
> > }
> >
> > -#ifdef CONFIG_MEMCG_KMEM
> > -static inline void memcg_free_slab_cgroups(struct slab *slab)
> > +#ifdef CONFIG_SLAB_OBJ_EXT
> > +static inline void free_slab_obj_exts(struct slab *slab)
>
> Right, freeing is already here, so makes sense put the allocation here as well.
>
> > @@ -3817,6 +3820,7 @@ void slab_post_alloc_hook(struct kmem_cache *s, struct obj_cgroup *objcg,
> > kmemleak_alloc_recursive(p[i], s->object_size, 1,
> > s->flags, init_flags);
> > kmsan_slab_alloc(s, p[i], init_flags);
> > + obj_exts = prepare_slab_obj_exts_hook(s, flags, p[i]);
>
> Yeah here's the hook used. Doesn't it generate a compiler warning? Maybe at
> least postpone the call until the result is further used.
Yes, I'll move that into the patch where we start using it.
Thanks for the review, Vlastimil!
>
> > }
> >
> > memcg_slab_post_alloc_hook(s, objcg, flags, size, p);
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>
On Wed, Feb 14, 2024 at 9:52 AM Kent Overstreet
<[email protected]> wrote:
>
> On Wed, Feb 14, 2024 at 08:55:48AM -0800, Andrew Morton wrote:
> > On Tue, 13 Feb 2024 14:59:11 -0800 Suren Baghdasaryan <[email protected]> wrote:
> >
> > > > > If you think you can easily achieve what Michal requested without all that,
> > > > > good.
> > > >
> > > > He requested something?
> > >
> > > Yes, a cleaner instrumentation. Unfortunately the cleanest one is not
> > > possible until the compiler feature is developed and deployed. And it
> > > still would require changes to the headers, so don't think it's worth
> > > delaying the feature for years.
> >
> > Can we please be told much more about this compiler feature?
> > Description of what it is, what it does, how it will affect this kernel
> > feature, etc.
> >
> > Who is developing it and when can we expect it to become available?
> >
> > Will we be able to migrate to it without back-compatibility concerns?
> > (I think "you need quite recent gcc for memory profiling" is
> > reasonable).
> >
> >
> >
> > Because: if the maintainability issues which Michel describes will be
> > significantly addressed with the gcc support then we're kinda reviewing
> > the wrong patchset. Yes, it may be a maintenance burden initially, but
> > at some (yet to be revealed) time in the future, this will be addressed
> > with the gcc support?
>
> Even if we had compiler magic, after considering it more I don't think
> the patchset would be improved by it - I would still prefer to stick
> with the macro approach.
>
> There's also a lot of unresolved questions about whether the compiler
> approach would even end being what we need; we need macro expansion to
> happen in the caller of the allocation function
For the record, that's what this attribute will be doing. So it should
cover our usecase.
> , and that's another
> level of hooking that I don't think the compiler people are even
> considering yet, since cpp runs before the main part of the compiler; if
> C macros worked and were implemented more like Rust macros I'm sure it
> could be done - in fact, I think this could all be done in Rust
> _without_ any new compiler support - but in C, this is a lot to ask.
>
> Let's look at the instrumentation again. There's two steps:
>
> - Renaming the original function to _noprof
> - Adding a hooked version of the original function.
>
> We need to do the renaming regardless of what approach we take in order
> to correctly handle allocations that happen inside the context of an
> existing alloc tag hook but should not be accounted to the outer
> context; we do that by selecting the alloc_foo() or alloc_foo_noprof()
> version as appropriate.
>
> It's important to get this right; consider slab object extension
> vectors or the slab allocator allocating pages from the page allocator.
>
> Second step, adding a hooked version of the original function. We do
> that with
>
> #define alloc_foo(...) alloc_hooks(alloc_foo_noprof(__VA_ARGS__))
>
> That's pretty clean, if you ask me. The only way to make it more succint
> be if it were possible for a C macro to define a new macro, then it
> could be just
>
> alloc_fn(alloc_foo);
>
> But honestly, the former is probably preferable anyways from a ctags/cscope POV.
On Wed, Feb 14, 2024 at 11:24:23AM -0800, Suren Baghdasaryan wrote:
> On Wed, Feb 14, 2024 at 9:52 AM Kent Overstreet
> <[email protected]> wrote:
> >
> > On Wed, Feb 14, 2024 at 08:55:48AM -0800, Andrew Morton wrote:
> > > On Tue, 13 Feb 2024 14:59:11 -0800 Suren Baghdasaryan <[email protected]> wrote:
> > >
> > > > > > If you think you can easily achieve what Michal requested without all that,
> > > > > > good.
> > > > >
> > > > > He requested something?
> > > >
> > > > Yes, a cleaner instrumentation. Unfortunately the cleanest one is not
> > > > possible until the compiler feature is developed and deployed. And it
> > > > still would require changes to the headers, so don't think it's worth
> > > > delaying the feature for years.
> > >
> > > Can we please be told much more about this compiler feature?
> > > Description of what it is, what it does, how it will affect this kernel
> > > feature, etc.
> > >
> > > Who is developing it and when can we expect it to become available?
> > >
> > > Will we be able to migrate to it without back-compatibility concerns?
> > > (I think "you need quite recent gcc for memory profiling" is
> > > reasonable).
> > >
> > >
> > >
> > > Because: if the maintainability issues which Michel describes will be
> > > significantly addressed with the gcc support then we're kinda reviewing
> > > the wrong patchset. Yes, it may be a maintenance burden initially, but
> > > at some (yet to be revealed) time in the future, this will be addressed
> > > with the gcc support?
> >
> > Even if we had compiler magic, after considering it more I don't think
> > the patchset would be improved by it - I would still prefer to stick
> > with the macro approach.
> >
> > There's also a lot of unresolved questions about whether the compiler
> > approach would even end being what we need; we need macro expansion to
> > happen in the caller of the allocation function
>
> For the record, that's what this attribute will be doing. So it should
> cover our usecase.
That wasn't clear in the meeting we had the other day; all that was
discussed there was the attribute syntax, as I recall.
So say that does work out (and I don't think that's a given; if I were a
compiler person I don't think I'd be interested in this strange half
macro, half inline function beast); all that has accomplished is to get
rid of the need for the renaming - the _noprof() versions of functions.
So then how do you distinguish where in the callstack the accounting
happens?
If you say "it happens at the outermost wrapper", then what happens is
- Extra overhead for all the inner wrapper invocations, where they have
to now check "actually, we already have an alloc tag, don't do
anything". That's a cost, and given how much time we spent shaving
cycles and branches during development it's not one we want.
- Inner allocations that shouldn't be accounted to the outer context
are now a major problem, because they silently will be accounted
there and never noticed.
With our approach, inner allocations are by default (i.e. when we
haven't switched them to the _noprof() variant) accounted to their
own alloc tag; that way, when we're reading the /proc/allocinfo
output, we can examine them and check if they should be collapsed to
the outer context. With this approach they won't be seen.
So no, we still don't want the compiler approach.
> > > Performance overhead:
> > > To evaluate performance we implemented an in-kernel test executing
> > > multiple get_free_page/free_page and kmalloc/kfree calls with allocation
> > > sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU
> > > affinity set to a specific CPU to minimize the noise. Below are results
> > > from running the test on Ubuntu 22.04.2 LTS with 6.8.0-rc1 kernel on
> > > 56 core Intel Xeon:
> > >
> > > kmalloc pgalloc
> > > (1 baseline) 6.764s 16.902s
> > > (2 default disabled) 6.793s (+0.43%) 17.007s (+0.62%)
> > > (3 default enabled) 7.197s (+6.40%) 23.666s (+40.02%)
> > > (4 runtime enabled) 7.405s (+9.48%) 23.901s (+41.41%)
> > > (5 memcg) 13.388s (+97.94%) 48.460s (+186.71%)
>
> (6 default disabled+memcg) 13.332s (+97.10%) 48.105s (+184.61%)
> (7 default enabled+memcg) 13.446s (+98.78%) 54.963s (+225.18%)
I think these numbers are very interesting for folks that already use
memcg. Specifically, the difference between 6 & 7, which seems to be
~0.85% and ~14.25%. IIUC, this means that the extra overhead is
relatively much lower if someone is already using memcgs.
>
> (6) shows a bit better performance than (5) but it's probably noise. I
> would expect them to be roughly the same. Hope this helps.
>
> > >
> >
> >
On Wed, Feb 14, 2024 at 12:17 PM Yosry Ahmed <[email protected]> wrote:
>
> > > > Performance overhead:
> > > > To evaluate performance we implemented an in-kernel test executing
> > > > multiple get_free_page/free_page and kmalloc/kfree calls with allocation
> > > > sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU
> > > > affinity set to a specific CPU to minimize the noise. Below are results
> > > > from running the test on Ubuntu 22.04.2 LTS with 6.8.0-rc1 kernel on
> > > > 56 core Intel Xeon:
> > > >
> > > > kmalloc pgalloc
> > > > (1 baseline) 6.764s 16.902s
> > > > (2 default disabled) 6.793s (+0.43%) 17.007s (+0.62%)
> > > > (3 default enabled) 7.197s (+6.40%) 23.666s (+40.02%)
> > > > (4 runtime enabled) 7.405s (+9.48%) 23.901s (+41.41%)
> > > > (5 memcg) 13.388s (+97.94%) 48.460s (+186.71%)
> >
> > (6 default disabled+memcg) 13.332s (+97.10%) 48.105s (+184.61%)
> > (7 default enabled+memcg) 13.446s (+98.78%) 54.963s (+225.18%)
>
> I think these numbers are very interesting for folks that already use
> memcg. Specifically, the difference between 6 & 7, which seems to be
> ~0.85% and ~14.25%. IIUC, this means that the extra overhead is
> relatively much lower if someone is already using memcgs.
Well, yes, percentage-wise it's much lower. If you look at the
absolute difference between 6 & 7 vs 2 & 3, it's quite close.
>
> >
> > (6) shows a bit better performance than (5) but it's probably noise. I
> > would expect them to be roughly the same. Hope this helps.
> >
> > > >
> > >
> > >
On Mon, Feb 12, 2024 at 01:38:47PM -0800, Suren Baghdasaryan wrote:
> - string_get_size(size, 1, STRING_UNITS_2, buf, sizeof(buf));
> + string_get_size(size, 1, STRING_SIZE_BASE2, buf, sizeof(buf));
This patch could be a whole lot smaller if ...
> +++ b/include/linux/string_helpers.h
> @@ -17,14 +17,13 @@ static inline bool string_is_terminated(const char *s, int len)
> return memchr(s, '\0', len) ? true : false;
> }
>
> -/* Descriptions of the types of units to
> - * print in */
> -enum string_size_units {
> - STRING_UNITS_10, /* use powers of 10^3 (standard SI) */
> - STRING_UNITS_2, /* use binary powers of 2^10 */
> +enum string_size_flags {
> + STRING_SIZE_BASE2 = (1 << 0),
> + STRING_SIZE_NOSPACE = (1 << 1),
> + STRING_SIZE_NOBYTES = (1 << 2),
you just added:
#define STRING_UNITS_10 0
#define STRING_UNITS_2 STRING_SIZE_BASE2
and you wouldn't need to change any of the callers.
On Wed, Feb 14, 2024 at 2:22 PM Dave Chinner <[email protected]> wrote:
>
> On Mon, Feb 12, 2024 at 01:39:11PM -0800, Suren Baghdasaryan wrote:
> > From: Kent Overstreet <[email protected]>
> >
> > This adds an alloc_hooks() wrapper around kmem_alloc(), so that we can
> > have allocations accounted to the proper callsite.
> >
> > Signed-off-by: Kent Overstreet <[email protected]>
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > ---
> > fs/xfs/kmem.c | 4 ++--
> > fs/xfs/kmem.h | 10 ++++------
> > 2 files changed, 6 insertions(+), 8 deletions(-)
> >
> > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > index c557a030acfe..9aa57a4e2478 100644
> > --- a/fs/xfs/kmem.c
> > +++ b/fs/xfs/kmem.c
> > @@ -8,7 +8,7 @@
> > #include "xfs_trace.h"
> >
> > void *
> > -kmem_alloc(size_t size, xfs_km_flags_t flags)
> > +kmem_alloc_noprof(size_t size, xfs_km_flags_t flags)
> > {
> > int retries = 0;
> > gfp_t lflags = kmem_flags_convert(flags);
> > @@ -17,7 +17,7 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
> > trace_kmem_alloc(size, flags, _RET_IP_);
> >
> > do {
> > - ptr = kmalloc(size, lflags);
> > + ptr = kmalloc_noprof(size, lflags);
> > if (ptr || (flags & KM_MAYFAIL))
> > return ptr;
> > if (!(++retries % 100))
> > diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> > index b987dc2c6851..c4cf1dc2a7af 100644
> > --- a/fs/xfs/kmem.h
> > +++ b/fs/xfs/kmem.h
> > @@ -6,6 +6,7 @@
> > #ifndef __XFS_SUPPORT_KMEM_H__
> > #define __XFS_SUPPORT_KMEM_H__
> >
> > +#include <linux/alloc_tag.h>
> > #include <linux/slab.h>
> > #include <linux/sched.h>
> > #include <linux/mm.h>
> > @@ -56,18 +57,15 @@ kmem_flags_convert(xfs_km_flags_t flags)
> > return lflags;
> > }
> >
> > -extern void *kmem_alloc(size_t, xfs_km_flags_t);
> > static inline void kmem_free(const void *ptr)
> > {
> > kvfree(ptr);
> > }
> >
> > +extern void *kmem_alloc_noprof(size_t, xfs_km_flags_t);
> > +#define kmem_alloc(...) alloc_hooks(kmem_alloc_noprof(__VA_ARGS__))
> >
> > -static inline void *
> > -kmem_zalloc(size_t size, xfs_km_flags_t flags)
> > -{
> > - return kmem_alloc(size, flags | KM_ZERO);
> > -}
> > +#define kmem_zalloc(_size, _flags) kmem_alloc((_size), (_flags) | KM_ZERO)
> >
> > /*
> > * Zone interfaces
> > --
> > 2.43.0.687.g38aa6559b0-goog
>
> These changes can be dropped - the fs/xfs/kmem.[ch] stuff is now
> gone in linux-xfs/for-next.
Thanks for the note. Will drop in the next submission.
>
> -Dave.
> --
> Dave Chinner
> [email protected]
* Vlastimil Babka <[email protected]> [240214 10:14]:
> On 2/13/24 03:08, Kent Overstreet wrote:
> > On Mon, Feb 12, 2024 at 04:31:14PM -0800, Kees Cook wrote:
> >> On Mon, Feb 12, 2024 at 01:39:09PM -0800, Suren Baghdasaryan wrote:
> >> > From: Kent Overstreet <[email protected]>
> >> >
> >> > It seems we need to be more forceful with the compiler on this one.
> >>
> >> Sure, but why?
> >
> > Wasn't getting inlined without it, and that's one we do want inlined -
> > it's only called in one place.
>
> It would be better to mention this in the changelog so it's clear this is
> for performance and not e.g. needed for the code tagging to work as expected.
Since it's not needed specifically for this set, can we take this patch
out of the set (and any others) and get them upstream first?
My hope is to reduce the count of 35 patches. Less patches might get
more reviews and small things like this (should be, are?) easy enough to
get out of the way. But also, it sounds worth doing on its own.
On Mon 12-02-24 13:39:17, Suren Baghdasaryan wrote:
[...]
> @@ -423,4 +424,18 @@ void __show_mem(unsigned int filter, nodemask_t *nodemask, int max_zone_idx)
> #ifdef CONFIG_MEMORY_FAILURE
> printk("%lu pages hwpoisoned\n", atomic_long_read(&num_poisoned_pages));
> #endif
> +#ifdef CONFIG_MEM_ALLOC_PROFILING
> + {
> + struct seq_buf s;
> + char *buf = kmalloc(4096, GFP_ATOMIC);
> +
> + if (buf) {
> + printk("Memory allocations:\n");
> + seq_buf_init(&s, buf, 4096);
> + alloc_tags_show_mem_report(&s);
> + printk("%s", buf);
> + kfree(buf);
> + }
> + }
> +#endif
I am pretty sure I have already objected to this. Memory allocations in
the oom path are simply no go unless there is absolutely no other way
around that. In this case the buffer could be preallocated.
--
Michal Hocko
SUSE Labs
On Thu, Feb 15, 2024 at 1:22 AM Michal Hocko <[email protected]> wrote:
>
> On Mon 12-02-24 13:39:17, Suren Baghdasaryan wrote:
> [...]
> > @@ -423,4 +424,18 @@ void __show_mem(unsigned int filter, nodemask_t *nodemask, int max_zone_idx)
> > #ifdef CONFIG_MEMORY_FAILURE
> > printk("%lu pages hwpoisoned\n", atomic_long_read(&num_poisoned_pages));
> > #endif
> > +#ifdef CONFIG_MEM_ALLOC_PROFILING
> > + {
> > + struct seq_buf s;
> > + char *buf = kmalloc(4096, GFP_ATOMIC);
> > +
> > + if (buf) {
> > + printk("Memory allocations:\n");
> > + seq_buf_init(&s, buf, 4096);
> > + alloc_tags_show_mem_report(&s);
> > + printk("%s", buf);
> > + kfree(buf);
> > + }
> > + }
> > +#endif
>
> I am pretty sure I have already objected to this. Memory allocations in
> the oom path are simply no go unless there is absolutely no other way
> around that. In this case the buffer could be preallocated.
Good point. We will change this to a smaller buffer allocated on the
stack and will print records one-by-one. Thanks!
>
> --
> Michal Hocko
> SUSE Labs
On Thu 15-02-24 06:58:42, Suren Baghdasaryan wrote:
> On Thu, Feb 15, 2024 at 1:22 AM Michal Hocko <[email protected]> wrote:
> >
> > On Mon 12-02-24 13:39:17, Suren Baghdasaryan wrote:
> > [...]
> > > @@ -423,4 +424,18 @@ void __show_mem(unsigned int filter, nodemask_t *nodemask, int max_zone_idx)
> > > #ifdef CONFIG_MEMORY_FAILURE
> > > printk("%lu pages hwpoisoned\n", atomic_long_read(&num_poisoned_pages));
> > > #endif
> > > +#ifdef CONFIG_MEM_ALLOC_PROFILING
> > > + {
> > > + struct seq_buf s;
> > > + char *buf = kmalloc(4096, GFP_ATOMIC);
> > > +
> > > + if (buf) {
> > > + printk("Memory allocations:\n");
> > > + seq_buf_init(&s, buf, 4096);
> > > + alloc_tags_show_mem_report(&s);
> > > + printk("%s", buf);
> > > + kfree(buf);
> > > + }
> > > + }
> > > +#endif
> >
> > I am pretty sure I have already objected to this. Memory allocations in
> > the oom path are simply no go unless there is absolutely no other way
> > around that. In this case the buffer could be preallocated.
>
> Good point. We will change this to a smaller buffer allocated on the
> stack and will print records one-by-one. Thanks!
__show_mem could be called with a very deep call chains. A single
pre-allocated buffer should just do ok.
--
Michal Hocko
SUSE Labs
On Thu, Feb 15, 2024 at 8:45 AM Michal Hocko <[email protected]> wrote:
>
> On Thu 15-02-24 06:58:42, Suren Baghdasaryan wrote:
> > On Thu, Feb 15, 2024 at 1:22 AM Michal Hocko <[email protected]> wrote:
> > >
> > > On Mon 12-02-24 13:39:17, Suren Baghdasaryan wrote:
> > > [...]
> > > > @@ -423,4 +424,18 @@ void __show_mem(unsigned int filter, nodemask_t *nodemask, int max_zone_idx)
> > > > #ifdef CONFIG_MEMORY_FAILURE
> > > > printk("%lu pages hwpoisoned\n", atomic_long_read(&num_poisoned_pages));
> > > > #endif
> > > > +#ifdef CONFIG_MEM_ALLOC_PROFILING
> > > > + {
> > > > + struct seq_buf s;
> > > > + char *buf = kmalloc(4096, GFP_ATOMIC);
> > > > +
> > > > + if (buf) {
> > > > + printk("Memory allocations:\n");
> > > > + seq_buf_init(&s, buf, 4096);
> > > > + alloc_tags_show_mem_report(&s);
> > > > + printk("%s", buf);
> > > > + kfree(buf);
> > > > + }
> > > > + }
> > > > +#endif
> > >
> > > I am pretty sure I have already objected to this. Memory allocations in
> > > the oom path are simply no go unless there is absolutely no other way
> > > around that. In this case the buffer could be preallocated.
> >
> > Good point. We will change this to a smaller buffer allocated on the
> > stack and will print records one-by-one. Thanks!
>
> __show_mem could be called with a very deep call chains. A single
> pre-allocated buffer should just do ok.
Ack. Will do.
>
> --
> Michal Hocko
> SUSE Labs
On Thu, Feb 15, 2024 at 08:47:59AM -0800, Suren Baghdasaryan wrote:
> On Thu, Feb 15, 2024 at 8:45 AM Michal Hocko <[email protected]> wrote:
> >
> > On Thu 15-02-24 06:58:42, Suren Baghdasaryan wrote:
> > > On Thu, Feb 15, 2024 at 1:22 AM Michal Hocko <[email protected]> wrote:
> > > >
> > > > On Mon 12-02-24 13:39:17, Suren Baghdasaryan wrote:
> > > > [...]
> > > > > @@ -423,4 +424,18 @@ void __show_mem(unsigned int filter, nodemask_t *nodemask, int max_zone_idx)
> > > > > #ifdef CONFIG_MEMORY_FAILURE
> > > > > printk("%lu pages hwpoisoned\n", atomic_long_read(&num_poisoned_pages));
> > > > > #endif
> > > > > +#ifdef CONFIG_MEM_ALLOC_PROFILING
> > > > > + {
> > > > > + struct seq_buf s;
> > > > > + char *buf = kmalloc(4096, GFP_ATOMIC);
> > > > > +
> > > > > + if (buf) {
> > > > > + printk("Memory allocations:\n");
> > > > > + seq_buf_init(&s, buf, 4096);
> > > > > + alloc_tags_show_mem_report(&s);
> > > > > + printk("%s", buf);
> > > > > + kfree(buf);
> > > > > + }
> > > > > + }
> > > > > +#endif
> > > >
> > > > I am pretty sure I have already objected to this. Memory allocations in
> > > > the oom path are simply no go unless there is absolutely no other way
> > > > around that. In this case the buffer could be preallocated.
> > >
> > > Good point. We will change this to a smaller buffer allocated on the
> > > stack and will print records one-by-one. Thanks!
> >
> > __show_mem could be called with a very deep call chains. A single
> > pre-allocated buffer should just do ok.
>
> Ack. Will do.
No, we're not going to permanently burn 4k here.
It's completely fine if the allocation fails, there's nothing "unsafe"
about doing a GFP_ATOMIC allocation here.
On Thu 15-02-24 13:29:40, Kent Overstreet wrote:
> On Thu, Feb 15, 2024 at 08:47:59AM -0800, Suren Baghdasaryan wrote:
> > On Thu, Feb 15, 2024 at 8:45 AM Michal Hocko <[email protected]> wrote:
> > >
> > > On Thu 15-02-24 06:58:42, Suren Baghdasaryan wrote:
> > > > On Thu, Feb 15, 2024 at 1:22 AM Michal Hocko <[email protected]> wrote:
> > > > >
> > > > > On Mon 12-02-24 13:39:17, Suren Baghdasaryan wrote:
> > > > > [...]
> > > > > > @@ -423,4 +424,18 @@ void __show_mem(unsigned int filter, nodemask_t *nodemask, int max_zone_idx)
> > > > > > #ifdef CONFIG_MEMORY_FAILURE
> > > > > > printk("%lu pages hwpoisoned\n", atomic_long_read(&num_poisoned_pages));
> > > > > > #endif
> > > > > > +#ifdef CONFIG_MEM_ALLOC_PROFILING
> > > > > > + {
> > > > > > + struct seq_buf s;
> > > > > > + char *buf = kmalloc(4096, GFP_ATOMIC);
> > > > > > +
> > > > > > + if (buf) {
> > > > > > + printk("Memory allocations:\n");
> > > > > > + seq_buf_init(&s, buf, 4096);
> > > > > > + alloc_tags_show_mem_report(&s);
> > > > > > + printk("%s", buf);
> > > > > > + kfree(buf);
> > > > > > + }
> > > > > > + }
> > > > > > +#endif
> > > > >
> > > > > I am pretty sure I have already objected to this. Memory allocations in
> > > > > the oom path are simply no go unless there is absolutely no other way
> > > > > around that. In this case the buffer could be preallocated.
> > > >
> > > > Good point. We will change this to a smaller buffer allocated on the
> > > > stack and will print records one-by-one. Thanks!
> > >
> > > __show_mem could be called with a very deep call chains. A single
> > > pre-allocated buffer should just do ok.
> >
> > Ack. Will do.
>
> No, we're not going to permanently burn 4k here.
>
> It's completely fine if the allocation fails, there's nothing "unsafe"
> about doing a GFP_ATOMIC allocation here.
Nobody is talking about safety. This is just a wrong thing to do when
you are likely under OOM situation. This is a situation when you
GFP_ATOMIC allocation is _likely_ to fail. Yes, yes you will get some
additional memory reservers head room, but you shouldn't rely on that
because that will make the output unreliable. Not something you want in
situation when you really want to know that information.
More over you do not need to preallocate a full page.
--
Michal Hocko
SUSE Labs
On Thu, Feb 15, 2024 at 10:33:53AM -0800, Suren Baghdasaryan wrote:
> On Thu, Feb 15, 2024 at 10:29 AM Kent Overstreet
> <[email protected]> wrote:
> >
> > On Thu, Feb 15, 2024 at 08:47:59AM -0800, Suren Baghdasaryan wrote:
> > > On Thu, Feb 15, 2024 at 8:45 AM Michal Hocko <[email protected]> wrote:
> > > >
> > > > On Thu 15-02-24 06:58:42, Suren Baghdasaryan wrote:
> > > > > On Thu, Feb 15, 2024 at 1:22 AM Michal Hocko <[email protected]> wrote:
> > > > > >
> > > > > > On Mon 12-02-24 13:39:17, Suren Baghdasaryan wrote:
> > > > > > [...]
> > > > > > > @@ -423,4 +424,18 @@ void __show_mem(unsigned int filter, nodemask_t *nodemask, int max_zone_idx)
> > > > > > > #ifdef CONFIG_MEMORY_FAILURE
> > > > > > > printk("%lu pages hwpoisoned\n", atomic_long_read(&num_poisoned_pages));
> > > > > > > #endif
> > > > > > > +#ifdef CONFIG_MEM_ALLOC_PROFILING
> > > > > > > + {
> > > > > > > + struct seq_buf s;
> > > > > > > + char *buf = kmalloc(4096, GFP_ATOMIC);
> > > > > > > +
> > > > > > > + if (buf) {
> > > > > > > + printk("Memory allocations:\n");
> > > > > > > + seq_buf_init(&s, buf, 4096);
> > > > > > > + alloc_tags_show_mem_report(&s);
> > > > > > > + printk("%s", buf);
> > > > > > > + kfree(buf);
> > > > > > > + }
> > > > > > > + }
> > > > > > > +#endif
> > > > > >
> > > > > > I am pretty sure I have already objected to this. Memory allocations in
> > > > > > the oom path are simply no go unless there is absolutely no other way
> > > > > > around that. In this case the buffer could be preallocated.
> > > > >
> > > > > Good point. We will change this to a smaller buffer allocated on the
> > > > > stack and will print records one-by-one. Thanks!
> > > >
> > > > __show_mem could be called with a very deep call chains. A single
> > > > pre-allocated buffer should just do ok.
> > >
> > > Ack. Will do.
> >
> > No, we're not going to permanently burn 4k here.
>
> We don't need 4K here. Just enough to store one line and then print
> these 10 highest allocations one line at a time. This way we can also
> change that 10 to any higher number we like without any side effects.
There's no reason to make the change at all. If Michal thinks there's
something "dangerous" about allocating a buffer here, he needs to able
to explain what it is.
On Thu, Feb 15, 2024 at 10:29 AM Kent Overstreet
<[email protected]> wrote:
>
> On Thu, Feb 15, 2024 at 08:47:59AM -0800, Suren Baghdasaryan wrote:
> > On Thu, Feb 15, 2024 at 8:45 AM Michal Hocko <[email protected]> wrote:
> > >
> > > On Thu 15-02-24 06:58:42, Suren Baghdasaryan wrote:
> > > > On Thu, Feb 15, 2024 at 1:22 AM Michal Hocko <[email protected]> wrote:
> > > > >
> > > > > On Mon 12-02-24 13:39:17, Suren Baghdasaryan wrote:
> > > > > [...]
> > > > > > @@ -423,4 +424,18 @@ void __show_mem(unsigned int filter, nodemask_t *nodemask, int max_zone_idx)
> > > > > > #ifdef CONFIG_MEMORY_FAILURE
> > > > > > printk("%lu pages hwpoisoned\n", atomic_long_read(&num_poisoned_pages));
> > > > > > #endif
> > > > > > +#ifdef CONFIG_MEM_ALLOC_PROFILING
> > > > > > + {
> > > > > > + struct seq_buf s;
> > > > > > + char *buf = kmalloc(4096, GFP_ATOMIC);
> > > > > > +
> > > > > > + if (buf) {
> > > > > > + printk("Memory allocations:\n");
> > > > > > + seq_buf_init(&s, buf, 4096);
> > > > > > + alloc_tags_show_mem_report(&s);
> > > > > > + printk("%s", buf);
> > > > > > + kfree(buf);
> > > > > > + }
> > > > > > + }
> > > > > > +#endif
> > > > >
> > > > > I am pretty sure I have already objected to this. Memory allocations in
> > > > > the oom path are simply no go unless there is absolutely no other way
> > > > > around that. In this case the buffer could be preallocated.
> > > >
> > > > Good point. We will change this to a smaller buffer allocated on the
> > > > stack and will print records one-by-one. Thanks!
> > >
> > > __show_mem could be called with a very deep call chains. A single
> > > pre-allocated buffer should just do ok.
> >
> > Ack. Will do.
>
> No, we're not going to permanently burn 4k here.
We don't need 4K here. Just enough to store one line and then print
these 10 highest allocations one line at a time. This way we can also
change that 10 to any higher number we like without any side effects.
>
> It's completely fine if the allocation fails, there's nothing "unsafe"
> about doing a GFP_ATOMIC allocation here.
On 2/15/24 19:29, Kent Overstreet wrote:
> On Thu, Feb 15, 2024 at 08:47:59AM -0800, Suren Baghdasaryan wrote:
>> On Thu, Feb 15, 2024 at 8:45 AM Michal Hocko <[email protected]> wrote:
>> >
>> > On Thu 15-02-24 06:58:42, Suren Baghdasaryan wrote:
>> > > On Thu, Feb 15, 2024 at 1:22 AM Michal Hocko <[email protected]> wrote:
>> > > >
>> > > > On Mon 12-02-24 13:39:17, Suren Baghdasaryan wrote:
>> > > > [...]
>> > > > > @@ -423,4 +424,18 @@ void __show_mem(unsigned int filter, nodemask_t *nodemask, int max_zone_idx)
>> > > > > #ifdef CONFIG_MEMORY_FAILURE
>> > > > > printk("%lu pages hwpoisoned\n", atomic_long_read(&num_poisoned_pages));
>> > > > > #endif
>> > > > > +#ifdef CONFIG_MEM_ALLOC_PROFILING
>> > > > > + {
>> > > > > + struct seq_buf s;
>> > > > > + char *buf = kmalloc(4096, GFP_ATOMIC);
>> > > > > +
>> > > > > + if (buf) {
>> > > > > + printk("Memory allocations:\n");
>> > > > > + seq_buf_init(&s, buf, 4096);
>> > > > > + alloc_tags_show_mem_report(&s);
>> > > > > + printk("%s", buf);
>> > > > > + kfree(buf);
>> > > > > + }
>> > > > > + }
>> > > > > +#endif
>> > > >
>> > > > I am pretty sure I have already objected to this. Memory allocations in
>> > > > the oom path are simply no go unless there is absolutely no other way
>> > > > around that. In this case the buffer could be preallocated.
>> > >
>> > > Good point. We will change this to a smaller buffer allocated on the
>> > > stack and will print records one-by-one. Thanks!
>> >
>> > __show_mem could be called with a very deep call chains. A single
>> > pre-allocated buffer should just do ok.
>>
>> Ack. Will do.
>
> No, we're not going to permanently burn 4k here.
>
> It's completely fine if the allocation fails, there's nothing "unsafe"
> about doing a GFP_ATOMIC allocation here.
Well, I think without __GFP_NOWARN it will cause a warning and thus
recursion into __show_mem(), potentially infinite? Which is of course
trivial to fix, but I'd myself rather sacrifice a bit of memory to get this
potentially very useful output, if I enabled the profiling. The necessary
memory overhead of page_ext and slabobj_ext makes the printing buffer
overhead negligible in comparison?
On Thu, Feb 15, 2024 at 09:22:07PM +0100, Vlastimil Babka wrote:
> On 2/15/24 19:29, Kent Overstreet wrote:
> > On Thu, Feb 15, 2024 at 08:47:59AM -0800, Suren Baghdasaryan wrote:
> >> On Thu, Feb 15, 2024 at 8:45 AM Michal Hocko <[email protected]> wrote:
> >> >
> >> > On Thu 15-02-24 06:58:42, Suren Baghdasaryan wrote:
> >> > > On Thu, Feb 15, 2024 at 1:22 AM Michal Hocko <[email protected]> wrote:
> >> > > >
> >> > > > On Mon 12-02-24 13:39:17, Suren Baghdasaryan wrote:
> >> > > > [...]
> >> > > > > @@ -423,4 +424,18 @@ void __show_mem(unsigned int filter, nodemask_t *nodemask, int max_zone_idx)
> >> > > > > #ifdef CONFIG_MEMORY_FAILURE
> >> > > > > printk("%lu pages hwpoisoned\n", atomic_long_read(&num_poisoned_pages));
> >> > > > > #endif
> >> > > > > +#ifdef CONFIG_MEM_ALLOC_PROFILING
> >> > > > > + {
> >> > > > > + struct seq_buf s;
> >> > > > > + char *buf = kmalloc(4096, GFP_ATOMIC);
> >> > > > > +
> >> > > > > + if (buf) {
> >> > > > > + printk("Memory allocations:\n");
> >> > > > > + seq_buf_init(&s, buf, 4096);
> >> > > > > + alloc_tags_show_mem_report(&s);
> >> > > > > + printk("%s", buf);
> >> > > > > + kfree(buf);
> >> > > > > + }
> >> > > > > + }
> >> > > > > +#endif
> >> > > >
> >> > > > I am pretty sure I have already objected to this. Memory allocations in
> >> > > > the oom path are simply no go unless there is absolutely no other way
> >> > > > around that. In this case the buffer could be preallocated.
> >> > >
> >> > > Good point. We will change this to a smaller buffer allocated on the
> >> > > stack and will print records one-by-one. Thanks!
> >> >
> >> > __show_mem could be called with a very deep call chains. A single
> >> > pre-allocated buffer should just do ok.
> >>
> >> Ack. Will do.
> >
> > No, we're not going to permanently burn 4k here.
> >
> > It's completely fine if the allocation fails, there's nothing "unsafe"
> > about doing a GFP_ATOMIC allocation here.
>
> Well, I think without __GFP_NOWARN it will cause a warning and thus
> recursion into __show_mem(), potentially infinite? Which is of course
> trivial to fix, but I'd myself rather sacrifice a bit of memory to get this
> potentially very useful output, if I enabled the profiling. The necessary
> memory overhead of page_ext and slabobj_ext makes the printing buffer
> overhead negligible in comparison?
__GFP_NOWARN is a good point, we should have that.
But - and correct me if I'm wrong here - doesn't an OOM kick in well
before GFP_ATOMIC 4k allocations are failing? I'd expect the system to
be well and truly hosed at that point.
If we want this report to be 100% reliable, then yes the preallocated
buffer makes sense - but I don't think 100% makes sense here; I think we
can accept ~99% and give back that 4k.
On 2/12/24 22:38, Suren Baghdasaryan wrote:
> Slab extension objects can't be allocated before slab infrastructure is
> initialized. Some caches, like kmem_cache and kmem_cache_node, are created
> before slab infrastructure is initialized. Objects from these caches can't
> have extension objects. Introduce SLAB_NO_OBJ_EXT slab flag to mark these
> caches and avoid creating extensions for objects allocated from these
> slabs.
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> ---
> include/linux/slab.h | 7 +++++++
> mm/slub.c | 5 +++--
> 2 files changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index b5f5ee8308d0..3ac2fc830f0f 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -164,6 +164,13 @@
> #endif
> #define SLAB_TEMPORARY SLAB_RECLAIM_ACCOUNT /* Objects are short-lived */
>
> +#ifdef CONFIG_SLAB_OBJ_EXT
> +/* Slab created using create_boot_cache */
> +#define SLAB_NO_OBJ_EXT ((slab_flags_t __force)0x20000000U)
There's
#define SLAB_SKIP_KFENCE ((slab_flags_t __force)0x20000000U)
already, so need some other one?
> +#else
> +#define SLAB_NO_OBJ_EXT 0
> +#endif
> +
> /*
> * ZERO_SIZE_PTR will be returned for zero sized kmalloc requests.
> *
> diff --git a/mm/slub.c b/mm/slub.c
> index 1eb1050814aa..9fd96238ed39 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -5650,7 +5650,8 @@ void __init kmem_cache_init(void)
> node_set(node, slab_nodes);
>
> create_boot_cache(kmem_cache_node, "kmem_cache_node",
> - sizeof(struct kmem_cache_node), SLAB_HWCACHE_ALIGN, 0, 0);
> + sizeof(struct kmem_cache_node),
> + SLAB_HWCACHE_ALIGN | SLAB_NO_OBJ_EXT, 0, 0);
>
> hotplug_memory_notifier(slab_memory_callback, SLAB_CALLBACK_PRI);
>
> @@ -5660,7 +5661,7 @@ void __init kmem_cache_init(void)
> create_boot_cache(kmem_cache, "kmem_cache",
> offsetof(struct kmem_cache, node) +
> nr_node_ids * sizeof(struct kmem_cache_node *),
> - SLAB_HWCACHE_ALIGN, 0, 0);
> + SLAB_HWCACHE_ALIGN | SLAB_NO_OBJ_EXT, 0, 0);
>
> kmem_cache = bootstrap(&boot_kmem_cache);
> kmem_cache_node = bootstrap(&boot_kmem_cache_node);
On Thu, Feb 15, 2024 at 10:31:06PM +0100, Vlastimil Babka wrote:
> On 2/12/24 22:38, Suren Baghdasaryan wrote:
> > Slab extension objects can't be allocated before slab infrastructure is
> > initialized. Some caches, like kmem_cache and kmem_cache_node, are created
> > before slab infrastructure is initialized. Objects from these caches can't
> > have extension objects. Introduce SLAB_NO_OBJ_EXT slab flag to mark these
> > caches and avoid creating extensions for objects allocated from these
> > slabs.
> >
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > ---
> > include/linux/slab.h | 7 +++++++
> > mm/slub.c | 5 +++--
> > 2 files changed, 10 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/slab.h b/include/linux/slab.h
> > index b5f5ee8308d0..3ac2fc830f0f 100644
> > --- a/include/linux/slab.h
> > +++ b/include/linux/slab.h
> > @@ -164,6 +164,13 @@
> > #endif
> > #define SLAB_TEMPORARY SLAB_RECLAIM_ACCOUNT /* Objects are short-lived */
> >
> > +#ifdef CONFIG_SLAB_OBJ_EXT
> > +/* Slab created using create_boot_cache */
> > +#define SLAB_NO_OBJ_EXT ((slab_flags_t __force)0x20000000U)
>
> There's
> #define SLAB_SKIP_KFENCE ((slab_flags_t __force)0x20000000U)
> already, so need some other one?
What's up with the order of flags in that file? They don't seem to
follow any particular ordering.
Seems like some cleanup is in order, but any history/context we should
know first?
On 2/12/24 22:38, Suren Baghdasaryan wrote:
> Use __GFP_NO_OBJ_EXT to prevent recursions when allocating slabobj_ext
> objects. Also prevent slabobj_ext allocations for kmem_cache objects.
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> ---
> mm/slab.h | 6 ++++++
> mm/slab_common.c | 2 ++
> 2 files changed, 8 insertions(+)
>
> diff --git a/mm/slab.h b/mm/slab.h
> index 436a126486b5..f4ff635091e4 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -589,6 +589,12 @@ prepare_slab_obj_exts_hook(struct kmem_cache *s, gfp_t flags, void *p)
> if (!need_slab_obj_ext())
> return NULL;
>
> + if (s->flags & SLAB_NO_OBJ_EXT)
> + return NULL;
> +
> + if (flags & __GFP_NO_OBJ_EXT)
> + return NULL;
Since we agreed to postpone this function, when it appears later it can have
those in.
> slab = virt_to_slab(p);
> if (!slab_obj_exts(slab) &&
> WARN(alloc_slab_obj_exts(slab, s, flags, false),
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 6bfa1810da5e..83fec2dd2e2d 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -218,6 +218,8 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
> void *vec;
>
> gfp &= ~OBJCGS_CLEAR_MASK;
> + /* Prevent recursive extension vector allocation */
> + gfp |= __GFP_NO_OBJ_EXT;
And this could become part of 6/35 mm: introduce __GFP_NO_OBJ_EXT ... ?
> vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
> slab_nid(slab));
> if (!vec)
On 2/15/24 22:37, Kent Overstreet wrote:
> On Thu, Feb 15, 2024 at 10:31:06PM +0100, Vlastimil Babka wrote:
>> On 2/12/24 22:38, Suren Baghdasaryan wrote:
>> > Slab extension objects can't be allocated before slab infrastructure is
>> > initialized. Some caches, like kmem_cache and kmem_cache_node, are created
>> > before slab infrastructure is initialized. Objects from these caches can't
>> > have extension objects. Introduce SLAB_NO_OBJ_EXT slab flag to mark these
>> > caches and avoid creating extensions for objects allocated from these
>> > slabs.
>> >
>> > Signed-off-by: Suren Baghdasaryan <[email protected]>
>> > ---
>> > include/linux/slab.h | 7 +++++++
>> > mm/slub.c | 5 +++--
>> > 2 files changed, 10 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/include/linux/slab.h b/include/linux/slab.h
>> > index b5f5ee8308d0..3ac2fc830f0f 100644
>> > --- a/include/linux/slab.h
>> > +++ b/include/linux/slab.h
>> > @@ -164,6 +164,13 @@
>> > #endif
>> > #define SLAB_TEMPORARY SLAB_RECLAIM_ACCOUNT /* Objects are short-lived */
>> >
>> > +#ifdef CONFIG_SLAB_OBJ_EXT
>> > +/* Slab created using create_boot_cache */
>> > +#define SLAB_NO_OBJ_EXT ((slab_flags_t __force)0x20000000U)
>>
>> There's
>> #define SLAB_SKIP_KFENCE ((slab_flags_t __force)0x20000000U)
>> already, so need some other one?
>
> What's up with the order of flags in that file? They don't seem to
> follow any particular ordering.
Seems mostly in increasing order, except commit 4fd0b46e89879 broke it for
SLAB_RECLAIM_ACCOUNT?
> Seems like some cleanup is in order, but any history/context we should
> know first?
Yeah noted, but no need to sidetrack you.
On Thu 15-02-24 15:33:30, Kent Overstreet wrote:
> On Thu, Feb 15, 2024 at 09:22:07PM +0100, Vlastimil Babka wrote:
> > On 2/15/24 19:29, Kent Overstreet wrote:
> > > On Thu, Feb 15, 2024 at 08:47:59AM -0800, Suren Baghdasaryan wrote:
> > >> On Thu, Feb 15, 2024 at 8:45 AM Michal Hocko <[email protected]> wrote:
> > >> >
> > >> > On Thu 15-02-24 06:58:42, Suren Baghdasaryan wrote:
> > >> > > On Thu, Feb 15, 2024 at 1:22 AM Michal Hocko <[email protected]> wrote:
> > >> > > >
> > >> > > > On Mon 12-02-24 13:39:17, Suren Baghdasaryan wrote:
> > >> > > > [...]
> > >> > > > > @@ -423,4 +424,18 @@ void __show_mem(unsigned int filter, nodemask_t *nodemask, int max_zone_idx)
> > >> > > > > #ifdef CONFIG_MEMORY_FAILURE
> > >> > > > > printk("%lu pages hwpoisoned\n", atomic_long_read(&num_poisoned_pages));
> > >> > > > > #endif
> > >> > > > > +#ifdef CONFIG_MEM_ALLOC_PROFILING
> > >> > > > > + {
> > >> > > > > + struct seq_buf s;
> > >> > > > > + char *buf = kmalloc(4096, GFP_ATOMIC);
> > >> > > > > +
> > >> > > > > + if (buf) {
> > >> > > > > + printk("Memory allocations:\n");
> > >> > > > > + seq_buf_init(&s, buf, 4096);
> > >> > > > > + alloc_tags_show_mem_report(&s);
> > >> > > > > + printk("%s", buf);
> > >> > > > > + kfree(buf);
> > >> > > > > + }
> > >> > > > > + }
> > >> > > > > +#endif
> > >> > > >
> > >> > > > I am pretty sure I have already objected to this. Memory allocations in
> > >> > > > the oom path are simply no go unless there is absolutely no other way
> > >> > > > around that. In this case the buffer could be preallocated.
> > >> > >
> > >> > > Good point. We will change this to a smaller buffer allocated on the
> > >> > > stack and will print records one-by-one. Thanks!
> > >> >
> > >> > __show_mem could be called with a very deep call chains. A single
> > >> > pre-allocated buffer should just do ok.
> > >>
> > >> Ack. Will do.
> > >
> > > No, we're not going to permanently burn 4k here.
> > >
> > > It's completely fine if the allocation fails, there's nothing "unsafe"
> > > about doing a GFP_ATOMIC allocation here.
> >
> > Well, I think without __GFP_NOWARN it will cause a warning and thus
> > recursion into __show_mem(), potentially infinite? Which is of course
> > trivial to fix, but I'd myself rather sacrifice a bit of memory to get this
> > potentially very useful output, if I enabled the profiling. The necessary
> > memory overhead of page_ext and slabobj_ext makes the printing buffer
> > overhead negligible in comparison?
>
> __GFP_NOWARN is a good point, we should have that.
>
> But - and correct me if I'm wrong here - doesn't an OOM kick in well
> before GFP_ATOMIC 4k allocations are failing?
Not really, GFP_ATOMIC users can compete with reclaimers and consume
those reserves.
> I'd expect the system to
> be well and truly hosed at that point.
It is OOMed...
> If we want this report to be 100% reliable, then yes the preallocated
> buffer makes sense - but I don't think 100% makes sense here; I think we
> can accept ~99% and give back that 4k.
Think about that from the memory reserves consumers. The atomic reserve
is a scarse resource and now you want to use it for debugging purposes
for which you could have preallocated.
--
Michal Hocko
SUSE Labs
On Thu, Feb 15, 2024 at 1:50 PM Vlastimil Babka <[email protected]> wrote:
>
> On 2/15/24 22:37, Kent Overstreet wrote:
> > On Thu, Feb 15, 2024 at 10:31:06PM +0100, Vlastimil Babka wrote:
> >> On 2/12/24 22:38, Suren Baghdasaryan wrote:
> >> > Slab extension objects can't be allocated before slab infrastructure is
> >> > initialized. Some caches, like kmem_cache and kmem_cache_node, are created
> >> > before slab infrastructure is initialized. Objects from these caches can't
> >> > have extension objects. Introduce SLAB_NO_OBJ_EXT slab flag to mark these
> >> > caches and avoid creating extensions for objects allocated from these
> >> > slabs.
> >> >
> >> > Signed-off-by: Suren Baghdasaryan <[email protected]>
> >> > ---
> >> > include/linux/slab.h | 7 +++++++
> >> > mm/slub.c | 5 +++--
> >> > 2 files changed, 10 insertions(+), 2 deletions(-)
> >> >
> >> > diff --git a/include/linux/slab.h b/include/linux/slab.h
> >> > index b5f5ee8308d0..3ac2fc830f0f 100644
> >> > --- a/include/linux/slab.h
> >> > +++ b/include/linux/slab.h
> >> > @@ -164,6 +164,13 @@
> >> > #endif
> >> > #define SLAB_TEMPORARY SLAB_RECLAIM_ACCOUNT /* Objects are short-lived */
> >> >
> >> > +#ifdef CONFIG_SLAB_OBJ_EXT
> >> > +/* Slab created using create_boot_cache */
> >> > +#define SLAB_NO_OBJ_EXT ((slab_flags_t __force)0x20000000U)
> >>
> >> There's
> >> #define SLAB_SKIP_KFENCE ((slab_flags_t __force)0x20000000U)
> >> already, so need some other one?
Indeed. I somehow missed it. Thanks for noticing, will fix this in the
next version.
> >
> > What's up with the order of flags in that file? They don't seem to
> > follow any particular ordering.
>
> Seems mostly in increasing order, except commit 4fd0b46e89879 broke it for
> SLAB_RECLAIM_ACCOUNT?
>
> > Seems like some cleanup is in order, but any history/context we should
> > know first?
>
> Yeah noted, but no need to sidetrack you.
On Thu, Feb 15, 2024 at 1:44 PM Vlastimil Babka <[email protected]> wrote:
>
> On 2/12/24 22:38, Suren Baghdasaryan wrote:
> > Use __GFP_NO_OBJ_EXT to prevent recursions when allocating slabobj_ext
> > objects. Also prevent slabobj_ext allocations for kmem_cache objects.
> >
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > ---
> > mm/slab.h | 6 ++++++
> > mm/slab_common.c | 2 ++
> > 2 files changed, 8 insertions(+)
> >
> > diff --git a/mm/slab.h b/mm/slab.h
> > index 436a126486b5..f4ff635091e4 100644
> > --- a/mm/slab.h
> > +++ b/mm/slab.h
> > @@ -589,6 +589,12 @@ prepare_slab_obj_exts_hook(struct kmem_cache *s, gfp_t flags, void *p)
> > if (!need_slab_obj_ext())
> > return NULL;
> >
> > + if (s->flags & SLAB_NO_OBJ_EXT)
> > + return NULL;
> > +
> > + if (flags & __GFP_NO_OBJ_EXT)
> > + return NULL;
>
> Since we agreed to postpone this function, when it appears later it can have
> those in.
Yes, I think that works. Will have this in the same patch.
>
> > slab = virt_to_slab(p);
> > if (!slab_obj_exts(slab) &&
> > WARN(alloc_slab_obj_exts(slab, s, flags, false),
> > diff --git a/mm/slab_common.c b/mm/slab_common.c
> > index 6bfa1810da5e..83fec2dd2e2d 100644
> > --- a/mm/slab_common.c
> > +++ b/mm/slab_common.c
> > @@ -218,6 +218,8 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
> > void *vec;
> >
> > gfp &= ~OBJCGS_CLEAR_MASK;
> > + /* Prevent recursive extension vector allocation */
> > + gfp |= __GFP_NO_OBJ_EXT;
>
> And this could become part of 6/35 mm: introduce __GFP_NO_OBJ_EXT ... ?
Yes, that will eliminate this patch. Thanks!
>
> > vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
> > slab_nid(slab));
> > if (!vec)
>
On Thu, Feb 15, 2024 at 10:54:53PM +0100, Michal Hocko wrote:
> On Thu 15-02-24 15:33:30, Kent Overstreet wrote:
> > If we want this report to be 100% reliable, then yes the preallocated
> > buffer makes sense - but I don't think 100% makes sense here; I think we
> > can accept ~99% and give back that 4k.
>
> Think about that from the memory reserves consumers. The atomic reserve
> is a scarse resource and now you want to use it for debugging purposes
> for which you could have preallocated.
_Memory_ is a finite resource that we shouldn't be using unnecessarily.
We don't need this for the entire time we're under memory pressure; just
the short duration it takes to generate the report, then it's back
available for other users.
You would have us dedicate 4k, from system bootup, that can never be
used by other users.
Again: this makes no sense. The whole point of having watermarks and
shared reserves is so that every codepath doesn't have to have its own
dedicated, private reserve, so that we can make better use of a shared
finite resource.
On Thu, 15 Feb 2024 15:33:30 -0500
Kent Overstreet <[email protected]> wrote:
> > Well, I think without __GFP_NOWARN it will cause a warning and thus
> > recursion into __show_mem(), potentially infinite? Which is of course
> > trivial to fix, but I'd myself rather sacrifice a bit of memory to get
> > this potentially very useful output, if I enabled the profiling. The
> > necessary memory overhead of page_ext and slabobj_ext makes the
> > printing buffer overhead negligible in comparison?
>
> __GFP_NOWARN is a good point, we should have that.
>
> But - and correct me if I'm wrong here - doesn't an OOM kick in well
> before GFP_ATOMIC 4k allocations are failing? I'd expect the system to
> be well and truly hosed at that point.
>
> If we want this report to be 100% reliable, then yes the preallocated
> buffer makes sense - but I don't think 100% makes sense here; I think we
> can accept ~99% and give back that 4k.
I just compiled v6.8-rc4 vanilla (with a fedora localmodconfig build) and
saved it off (vmlinux.orig), then I compiled with the following:
Applied the patches but did not enable anything: vmlinux.memtag-off
Enabled MEM_ALLOC_PROFILING: vmlinux.memtag
Enabled MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT: vmlinux.memtag-default-on
Enabled MEM_ALLOC_PROFILING_DEBUG: vmlinux.memtag-debug
And here's what I got:
text data bss dec hex filename
29161847 18352730 5619716 53134293 32ac3d5 vmlinux.orig
29162286 18382638 5595140 53140064 32ada60 vmlinux.memtag-off (+5771)
29230868 18887662 5275652 53394182 32ebb06 vmlinux.memtag (+259889)
29230746 18887662 5275652 53394060 32eba8c vmlinux.memtag-default-on (+259767) dropped?
29276214 18946374 5177348 53399936 32ed180 vmlinux.memtag-debug (+265643)
Just adding the patches increases the size by 5k. But the rest shows an
increase of 259k, and you are worried about 4k (and possibly less?)???
-- Steve
On 2/15/24 15:07, Steven Rostedt wrote:
> Just adding the patches increases the size by 5k. But the rest shows an
> increase of 259k, and you are worried about 4k (and possibly less?)???
Doesn't the new page_ext thingy add a pointer per 'struct page', or
~0.2% of RAM, or ~32MB on a 16GB laptop? I, too, am confused why 4k is
even remotely an issue.
On Thu, 15 Feb 2024 18:16:48 -0500
Steven Rostedt <[email protected]> wrote:
> On Thu, 15 Feb 2024 18:07:42 -0500
> Steven Rostedt <[email protected]> wrote:
>
> > text data bss dec hex filename
> > 29161847 18352730 5619716 53134293 32ac3d5 vmlinux.orig
> > 29162286 18382638 5595140 53140064 32ada60 vmlinux.memtag-off (+5771)
> > 29230868 18887662 5275652 53394182 32ebb06 vmlinux.memtag (+259889)
> > 29230746 18887662 5275652 53394060 32eba8c vmlinux.memtag-default-on (+259767) dropped?
> > 29276214 18946374 5177348 53399936 32ed180 vmlinux.memtag-debug (+265643)
>
> If you plan on running this in production, and this increases the size of
> the text by 68k, have you measured the I$ pressure that this may induce?
> That is, what is the full overhead of having this enabled, as it could
> cause more instruction cache misses?
>
> I wonder if there has been measurements of it off. That is, having this
> configured in but default off still increases the text size by 68k. That
> can't be good on the instruction cache.
>
I should have read the cover letter ;-) (someone pointed me to that on IRC):
> Performance overhead:
> To evaluate performance we implemented an in-kernel test executing
> multiple get_free_page/free_page and kmalloc/kfree calls with allocation
> sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU
> affinity set to a specific CPU to minimize the noise. Below are results
> from running the test on Ubuntu 22.04.2 LTS with 6.8.0-rc1 kernel on
> 56 core Intel Xeon:
These are micro benchmarks, were any larger benchmarks taken? As
microbenchmarks do not always show I$ issues (because the benchmark itself
will warm up the cache). The cache issue could slow down tasks at a bigger
picture, as it can cause more cache misses.
Running other benchmarks under perf and recording the cache misses between
the different configs would be a good picture to show.
>
> kmalloc pgalloc
> (1 baseline) 6.764s 16.902s
> (2 default disabled) 6.793s (+0.43%) 17.007s (+0.62%)
> (3 default enabled) 7.197s (+6.40%) 23.666s (+40.02%)
> (4 runtime enabled) 7.405s (+9.48%) 23.901s (+41.41%)
> (5 memcg) 13.388s (+97.94%) 48.460s (+186.71%)
>
> Memory overhead:
> Kernel size:
>
> text data bss dec diff
> (1) 26515311 18890222 17018880 62424413
> (2) 26524728 19423818 16740352 62688898 264485
> (3) 26524724 19423818 16740352 62688894 264481
> (4) 26524728 19423818 16740352 62688898 264485
> (5) 26541782 18964374 16957440 62463596 39183
Similar to my builds.
>
> Memory consumption on a 56 core Intel CPU with 125GB of memory:
> Code tags: 192 kB
> PageExts: 262144 kB (256MB)
> SlabExts: 9876 kB (9.6MB)
> PcpuExts: 512 kB (0.5MB)
>
> Total overhead is 0.2% of total memory.
All this, and we are still worried about 4k for useful debugging :-/
-- Steve
On Thu, 15 Feb 2024 18:07:42 -0500
Steven Rostedt <[email protected]> wrote:
> text data bss dec hex filename
> 29161847 18352730 5619716 53134293 32ac3d5 vmlinux.orig
> 29162286 18382638 5595140 53140064 32ada60 vmlinux.memtag-off (+5771)
> 29230868 18887662 5275652 53394182 32ebb06 vmlinux.memtag (+259889)
> 29230746 18887662 5275652 53394060 32eba8c vmlinux.memtag-default-on (+259767) dropped?
> 29276214 18946374 5177348 53399936 32ed180 vmlinux.memtag-debug (+265643)
If you plan on running this in production, and this increases the size of
the text by 68k, have you measured the I$ pressure that this may induce?
That is, what is the full overhead of having this enabled, as it could
cause more instruction cache misses?
I wonder if there has been measurements of it off. That is, having this
configured in but default off still increases the text size by 68k. That
can't be good on the instruction cache.
-- Steve
On Thu, Feb 15, 2024 at 06:07:42PM -0500, Steven Rostedt wrote:
> On Thu, 15 Feb 2024 15:33:30 -0500
> Kent Overstreet <[email protected]> wrote:
>
> > > Well, I think without __GFP_NOWARN it will cause a warning and thus
> > > recursion into __show_mem(), potentially infinite? Which is of course
> > > trivial to fix, but I'd myself rather sacrifice a bit of memory to get
> > > this potentially very useful output, if I enabled the profiling. The
> > > necessary memory overhead of page_ext and slabobj_ext makes the
> > > printing buffer overhead negligible in comparison?
> >
> > __GFP_NOWARN is a good point, we should have that.
> >
> > But - and correct me if I'm wrong here - doesn't an OOM kick in well
> > before GFP_ATOMIC 4k allocations are failing? I'd expect the system to
> > be well and truly hosed at that point.
> >
> > If we want this report to be 100% reliable, then yes the preallocated
> > buffer makes sense - but I don't think 100% makes sense here; I think we
> > can accept ~99% and give back that 4k.
>
> I just compiled v6.8-rc4 vanilla (with a fedora localmodconfig build) and
> saved it off (vmlinux.orig), then I compiled with the following:
>
> Applied the patches but did not enable anything: vmlinux.memtag-off
> Enabled MEM_ALLOC_PROFILING: vmlinux.memtag
> Enabled MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT: vmlinux.memtag-default-on
> Enabled MEM_ALLOC_PROFILING_DEBUG: vmlinux.memtag-debug
>
> And here's what I got:
>
> text data bss dec hex filename
> 29161847 18352730 5619716 53134293 32ac3d5 vmlinux.orig
> 29162286 18382638 5595140 53140064 32ada60 vmlinux.memtag-off (+5771)
> 29230868 18887662 5275652 53394182 32ebb06 vmlinux.memtag (+259889)
> 29230746 18887662 5275652 53394060 32eba8c vmlinux.memtag-default-on (+259767) dropped?
> 29276214 18946374 5177348 53399936 32ed180 vmlinux.memtag-debug (+265643)
>
> Just adding the patches increases the size by 5k. But the rest shows an
> increase of 259k, and you are worried about 4k (and possibly less?)???
Most of that is data (505024), not text (68582, or 66k).
The data is mostly the alloc tags themselves (one per allocation
callsite, and you compiled the entire kernel), so that's expected.
Of the text, a lot of that is going to be slowpath stuff - module load
and unload hooks, formatt and printing the output, other assorted bits.
Then there's Allocation and deallocating obj extensions vectors - not
slowpath but not super fast path, not every allocation.
The fastpath instruction count overhead is pretty small
- actually doing the accounting - the core of slub.c, page_alloc.c,
percpu.c
- setting/restoring the alloc tag: this is overhead we add to every
allocation callsite, so it's the most relevant - but it's just a few
instructions.
So that's the breakdown. Definitely not zero overhead, but that fixed
memory overhead (and additionally, the percpu counters) is the price we
pay for very low runtime CPU overhead.
On Thu, Feb 15, 2024 at 06:27:29PM -0500, Steven Rostedt wrote:
> All this, and we are still worried about 4k for useful debugging :-/
Every additional 4k still needs justification. And whether we burn a
reserve on this will have no observable effect on user output in
remotely normal situations; if this allocation ever fails, we've already
been in an OOM situation for awhile and we've already printed out this
report many times, with less memory pressure where the allocation would
have succeeded.
On Thu, Feb 15, 2024 at 03:19:33PM -0800, Dave Hansen wrote:
> On 2/15/24 15:07, Steven Rostedt wrote:
> > Just adding the patches increases the size by 5k. But the rest shows an
> > increase of 259k, and you are worried about 4k (and possibly less?)???
>
> Doesn't the new page_ext thingy add a pointer per 'struct page', or
> ~0.2% of RAM, or ~32MB on a 16GB laptop? I, too, am confused why 4k is
> even remotely an issue.
page_ext adds a separate per-page array; it itself does not add a
pointer to strugt page, it's an array lookup that uses the page pfn.
We do add a pointer to page_ext, and that's (for now) unavoidable
overhead - but we'll be looking at referencing code tags by index, and
if that works out we'll be able to kill the page_ext dependency and just
store the alloc tag index in page bits.
On Thu, 15 Feb 2024 18:51:41 -0500
Kent Overstreet <[email protected]> wrote:
> Most of that is data (505024), not text (68582, or 66k).
>
And the 4K extra would have been data too.
> The data is mostly the alloc tags themselves (one per allocation
> callsite, and you compiled the entire kernel), so that's expected.
>
> Of the text, a lot of that is going to be slowpath stuff - module load
> and unload hooks, formatt and printing the output, other assorted bits.
>
> Then there's Allocation and deallocating obj extensions vectors - not
> slowpath but not super fast path, not every allocation.
>
> The fastpath instruction count overhead is pretty small
> - actually doing the accounting - the core of slub.c, page_alloc.c,
> percpu.c
> - setting/restoring the alloc tag: this is overhead we add to every
> allocation callsite, so it's the most relevant - but it's just a few
> instructions.
>
> So that's the breakdown. Definitely not zero overhead, but that fixed
> memory overhead (and additionally, the percpu counters) is the price we
> pay for very low runtime CPU overhead.
But where are the benchmarks that are not micro-benchmarks. How much
overhead does this cause to those? Is it in the noise, or is it noticeable?
-- Steve
On Thu, Feb 15, 2024 at 07:21:41PM -0500, Steven Rostedt wrote:
> On Thu, 15 Feb 2024 18:51:41 -0500
> Kent Overstreet <[email protected]> wrote:
>
> > Most of that is data (505024), not text (68582, or 66k).
> >
>
> And the 4K extra would have been data too.
"It's not that much" isn't an argument for being wasteful.
> > The data is mostly the alloc tags themselves (one per allocation
> > callsite, and you compiled the entire kernel), so that's expected.
> >
> > Of the text, a lot of that is going to be slowpath stuff - module load
> > and unload hooks, formatt and printing the output, other assorted bits.
> >
> > Then there's Allocation and deallocating obj extensions vectors - not
> > slowpath but not super fast path, not every allocation.
> >
> > The fastpath instruction count overhead is pretty small
> > - actually doing the accounting - the core of slub.c, page_alloc.c,
> > percpu.c
> > - setting/restoring the alloc tag: this is overhead we add to every
> > allocation callsite, so it's the most relevant - but it's just a few
> > instructions.
> >
> > So that's the breakdown. Definitely not zero overhead, but that fixed
> > memory overhead (and additionally, the percpu counters) is the price we
> > pay for very low runtime CPU overhead.
>
> But where are the benchmarks that are not micro-benchmarks. How much
> overhead does this cause to those? Is it in the noise, or is it noticeable?
Microbenchmarks are how we magnify the effect of a change like this to
the most we'll ever see. Barring cache effects, it'll be in the noise.
Cache effects are a concern here because we're now touching task_struct
in the allocation fast path; that is where the
"compiled-in-but-turned-off" overhead comes from, because we can't add
static keys for that code without doubling the amount of icache
footprint, and I don't think that would be a great tradeoff.
So: if your code has fastpath allocations where the hot part of
task_struct isn't in cache, then this will be noticeable overhead to
you, otherwise it won't be.
On Thu, 15 Feb 2024 19:32:38 -0500
Kent Overstreet <[email protected]> wrote:
> > But where are the benchmarks that are not micro-benchmarks. How much
> > overhead does this cause to those? Is it in the noise, or is it noticeable?
>
> Microbenchmarks are how we magnify the effect of a change like this to
> the most we'll ever see. Barring cache effects, it'll be in the noise.
>
> Cache effects are a concern here because we're now touching task_struct
> in the allocation fast path; that is where the
> "compiled-in-but-turned-off" overhead comes from, because we can't add
> static keys for that code without doubling the amount of icache
> footprint, and I don't think that would be a great tradeoff.
>
> So: if your code has fastpath allocations where the hot part of
> task_struct isn't in cache, then this will be noticeable overhead to
> you, otherwise it won't be.
All nice, but where are the benchmarks? This looks like it will have an
affect on cache and you can talk all you want about how it will not be an
issue, but without real world benchmarks, it's meaningless. Numbers talk.
-- Steve
On Thu, Feb 15, 2024 at 07:39:15PM -0500, Steven Rostedt wrote:
> On Thu, 15 Feb 2024 19:32:38 -0500
> Kent Overstreet <[email protected]> wrote:
>
> > > But where are the benchmarks that are not micro-benchmarks. How much
> > > overhead does this cause to those? Is it in the noise, or is it noticeable?
> >
> > Microbenchmarks are how we magnify the effect of a change like this to
> > the most we'll ever see. Barring cache effects, it'll be in the noise.
> >
> > Cache effects are a concern here because we're now touching task_struct
> > in the allocation fast path; that is where the
> > "compiled-in-but-turned-off" overhead comes from, because we can't add
> > static keys for that code without doubling the amount of icache
> > footprint, and I don't think that would be a great tradeoff.
> >
> > So: if your code has fastpath allocations where the hot part of
> > task_struct isn't in cache, then this will be noticeable overhead to
> > you, otherwise it won't be.
>
> All nice, but where are the benchmarks? This looks like it will have an
> affect on cache and you can talk all you want about how it will not be an
> issue, but without real world benchmarks, it's meaningless. Numbers talk.
Steve, you're being demanding. We provided sufficient benchmarks to show
the overhead is low enough for production, and then I gave you a
detailed breakdown of where our overhead is and where it'll show up. I
think that's reasonable.
On Mon, 12 Feb 2024 13:38:59 -0800 Suren Baghdasaryan <[email protected]> wrote:
> +Example output.
> +
> +::
> +
> + > cat /proc/allocinfo
> +
> + 153MiB mm/slub.c:1826 module:slub func:alloc_slab_page
> + 6.08MiB mm/slab_common.c:950 module:slab_common func:_kmalloc_order
> + 5.09MiB mm/memcontrol.c:2814 module:memcontrol func:alloc_slab_obj_exts
> + 4.54MiB mm/page_alloc.c:5777 module:page_alloc func:alloc_pages_exact
> + 1.32MiB include/asm-generic/pgalloc.h:63 module:pgtable func:__pte_alloc_one
I don't really like the fancy MiB stuff. Wouldn't it be better to just
present the amount of memory in plain old bytes, so people can use sort
-n on it? And it's easier to tell big-from-small at a glance because
big has more digits.
Also, the first thing any sort of downstream processing of this data is
going to have to do is to convert the fancified output back into
plain-old-bytes. So why not just emit plain-old-bytes?
If someone wants the fancy output (and nobody does) then that can be
done in userspace.
On Thu, Feb 15, 2024 at 04:54:38PM -0800, Andrew Morton wrote:
> On Mon, 12 Feb 2024 13:38:59 -0800 Suren Baghdasaryan <[email protected]> wrote:
>
> > +Example output.
> > +
> > +::
> > +
> > + > cat /proc/allocinfo
> > +
> > + 153MiB mm/slub.c:1826 module:slub func:alloc_slab_page
> > + 6.08MiB mm/slab_common.c:950 module:slab_common func:_kmalloc_order
> > + 5.09MiB mm/memcontrol.c:2814 module:memcontrol func:alloc_slab_obj_exts
> > + 4.54MiB mm/page_alloc.c:5777 module:page_alloc func:alloc_pages_exact
> > + 1.32MiB include/asm-generic/pgalloc.h:63 module:pgtable func:__pte_alloc_one
>
> I don't really like the fancy MiB stuff. Wouldn't it be better to just
> present the amount of memory in plain old bytes, so people can use sort
> -n on it?
They can use sort -h on it; the string_get_size() patch was specifically
so that we could make the output compatible with sort -h
> And it's easier to tell big-from-small at a glance because
> big has more digits.
>
> Also, the first thing any sort of downstream processing of this data is
> going to have to do is to convert the fancified output back into
> plain-old-bytes. So why not just emit plain-old-bytes?
>
> If someone wants the fancy output (and nobody does) then that can be
> done in userspace.
I like simpler, more discoverable tools; e.g. we've got a bunch of
interesting stuff in scripts/ but it doesn't get used nearly as much -
not as accessible as cat'ing a file, definitely not going to be
installed by default.
I'm just optimizing for the most common use case. I doubt there's going
to be nearly as much consumption by tools, and I'm ok with making them
do the conversion back to bytes if they really need it.
On Thu, 15 Feb 2024 19:50:24 -0500
Kent Overstreet <[email protected]> wrote:
> > All nice, but where are the benchmarks? This looks like it will have an
> > affect on cache and you can talk all you want about how it will not be an
> > issue, but without real world benchmarks, it's meaningless. Numbers talk.
>
> Steve, you're being demanding. We provided sufficient benchmarks to show
> the overhead is low enough for production, and then I gave you a
> detailed breakdown of where our overhead is and where it'll show up. I
> think that's reasonable.
It's not unreasonable or demanding to ask for benchmarks. You showed only
micro-benchmarks that do not show how cache misses may affect the system.
Honestly, it sounds like you did run other benchmarks and didn't like the
results and are fighting to not have to produce them. Really, how hard is
it? There's lots of benchmarks you can run, like hackbench, stress-ng,
dbench. Why is this so difficult for you?
-- Steve
On Thu, Feb 15, 2024 at 08:12:39PM -0500, Steven Rostedt wrote:
> On Thu, 15 Feb 2024 19:50:24 -0500
> Kent Overstreet <[email protected]> wrote:
>
> > > All nice, but where are the benchmarks? This looks like it will have an
> > > affect on cache and you can talk all you want about how it will not be an
> > > issue, but without real world benchmarks, it's meaningless. Numbers talk.
> >
> > Steve, you're being demanding. We provided sufficient benchmarks to show
> > the overhead is low enough for production, and then I gave you a
> > detailed breakdown of where our overhead is and where it'll show up. I
> > think that's reasonable.
>
> It's not unreasonable or demanding to ask for benchmarks. You showed only
> micro-benchmarks that do not show how cache misses may affect the system.
> Honestly, it sounds like you did run other benchmarks and didn't like the
> results and are fighting to not have to produce them. Really, how hard is
> it? There's lots of benchmarks you can run, like hackbench, stress-ng,
> dbench. Why is this so difficult for you?
Woah, this is verging into paranoid conspiracy territory.
No, we haven't done other benchmarks, and if we had we'd be sharing
them. And if I had more time to spend on performance of this patchset
that's not where I'd be spending it; the next thing I'd be looking at
would be assembly output of the hooking code and seeing if I could shave
that down.
But I already put a ton of work into shaving cycles on this patchset,
I'm happy with the results, and I have other responsibilities and other
things I need to be working on.
On Thu, Feb 15, 2024 at 8:00 PM Kent Overstreet
<[email protected]> wrote:
>
> On Thu, Feb 15, 2024 at 04:54:38PM -0800, Andrew Morton wrote:
> > On Mon, 12 Feb 2024 13:38:59 -0800 Suren Baghdasaryan <[email protected]> wrote:
> >
> > > +Example output.
> > > +
> > > +::
> > > +
> > > + > cat /proc/allocinfo
> > > +
> > > + 153MiB mm/slub.c:1826 module:slub func:alloc_slab_page
> > > + 6.08MiB mm/slab_common.c:950 module:slab_common func:_kmalloc_order
> > > + 5.09MiB mm/memcontrol.c:2814 module:memcontrol func:alloc_slab_obj_exts
> > > + 4.54MiB mm/page_alloc.c:5777 module:page_alloc func:alloc_pages_exact
> > > + 1.32MiB include/asm-generic/pgalloc.h:63 module:pgtable func:__pte_alloc_one
> >
> > I don't really like the fancy MiB stuff. Wouldn't it be better to just
> > present the amount of memory in plain old bytes, so people can use sort
> > -n on it?
>
> They can use sort -h on it; the string_get_size() patch was specifically
> so that we could make the output compatible with sort -h
>
> > And it's easier to tell big-from-small at a glance because
> > big has more digits.
> >
> > Also, the first thing any sort of downstream processing of this data is
> > going to have to do is to convert the fancified output back into
> > plain-old-bytes. So why not just emit plain-old-bytes?
> >
> > If someone wants the fancy output (and nobody does) then that can be
> > done in userspace.
>
> I like simpler, more discoverable tools; e.g. we've got a bunch of
> interesting stuff in scripts/ but it doesn't get used nearly as much -
> not as accessible as cat'ing a file, definitely not going to be
> installed by default.
I also prefer plain bytes instead of MiB. A driver developer that
wants to verify up-to the byte allocations for a new data structure
that they added is going to be disappointed by the rounded MiB
numbers.
The data contained in this file is not consumable without at least
"sort -h -r", so why not just output bytes instead?
There is /proc/slabinfo and there is a slabtop tool.
For raw /proc/allocinfo we can create an alloctop tool that would
parse, sort and show data in human readable format based on various
criteria.
We should also add at the top of this file "allocinfo - version: 1.0",
to allow future extensions (i.e. column for proc name).
Pasha
On Thu, Feb 15, 2024 at 08:22:44PM -0500, Pasha Tatashin wrote:
> On Thu, Feb 15, 2024 at 8:00 PM Kent Overstreet
> <[email protected]> wrote:
> >
> > On Thu, Feb 15, 2024 at 04:54:38PM -0800, Andrew Morton wrote:
> > > On Mon, 12 Feb 2024 13:38:59 -0800 Suren Baghdasaryan <[email protected]> wrote:
> > >
> > > > +Example output.
> > > > +
> > > > +::
> > > > +
> > > > + > cat /proc/allocinfo
> > > > +
> > > > + 153MiB mm/slub.c:1826 module:slub func:alloc_slab_page
> > > > + 6.08MiB mm/slab_common.c:950 module:slab_common func:_kmalloc_order
> > > > + 5.09MiB mm/memcontrol.c:2814 module:memcontrol func:alloc_slab_obj_exts
> > > > + 4.54MiB mm/page_alloc.c:5777 module:page_alloc func:alloc_pages_exact
> > > > + 1.32MiB include/asm-generic/pgalloc.h:63 module:pgtable func:__pte_alloc_one
> > >
> > > I don't really like the fancy MiB stuff. Wouldn't it be better to just
> > > present the amount of memory in plain old bytes, so people can use sort
> > > -n on it?
> >
> > They can use sort -h on it; the string_get_size() patch was specifically
> > so that we could make the output compatible with sort -h
> >
> > > And it's easier to tell big-from-small at a glance because
> > > big has more digits.
> > >
> > > Also, the first thing any sort of downstream processing of this data is
> > > going to have to do is to convert the fancified output back into
> > > plain-old-bytes. So why not just emit plain-old-bytes?
> > >
> > > If someone wants the fancy output (and nobody does) then that can be
> > > done in userspace.
> >
> > I like simpler, more discoverable tools; e.g. we've got a bunch of
> > interesting stuff in scripts/ but it doesn't get used nearly as much -
> > not as accessible as cat'ing a file, definitely not going to be
> > installed by default.
>
> I also prefer plain bytes instead of MiB. A driver developer that
> wants to verify up-to the byte allocations for a new data structure
> that they added is going to be disappointed by the rounded MiB
> numbers.
That's a fair point.
> The data contained in this file is not consumable without at least
> "sort -h -r", so why not just output bytes instead?
>
> There is /proc/slabinfo and there is a slabtop tool.
> For raw /proc/allocinfo we can create an alloctop tool that would
> parse, sort and show data in human readable format based on various
> criteria.
>
> We should also add at the top of this file "allocinfo - version: 1.0",
> to allow future extensions (i.e. column for proc name).
How would we feel about exposing two different versions in /proc? It
should be a pretty minimal addition to .text.
Personally, I hate trying to count long strings digits by eyeball...
On Thu, Feb 15, 2024 at 5:18 PM Kent Overstreet
<[email protected]> wrote:
>
> On Thu, Feb 15, 2024 at 08:12:39PM -0500, Steven Rostedt wrote:
> > On Thu, 15 Feb 2024 19:50:24 -0500
> > Kent Overstreet <[email protected]> wrote:
> >
> > > > All nice, but where are the benchmarks? This looks like it will have an
> > > > affect on cache and you can talk all you want about how it will not be an
> > > > issue, but without real world benchmarks, it's meaningless. Numbers talk.
> > >
> > > Steve, you're being demanding. We provided sufficient benchmarks to show
> > > the overhead is low enough for production, and then I gave you a
> > > detailed breakdown of where our overhead is and where it'll show up. I
> > > think that's reasonable.
> >
> > It's not unreasonable or demanding to ask for benchmarks. You showed only
> > micro-benchmarks that do not show how cache misses may affect the system.
> > Honestly, it sounds like you did run other benchmarks and didn't like the
> > results and are fighting to not have to produce them. Really, how hard is
> > it? There's lots of benchmarks you can run, like hackbench, stress-ng,
> > dbench. Why is this so difficult for you?
I'll run these benchmarks and will include the numbers in the next cover letter.
>
> Woah, this is verging into paranoid conspiracy territory.
>
> No, we haven't done other benchmarks, and if we had we'd be sharing
> them. And if I had more time to spend on performance of this patchset
> that's not where I'd be spending it; the next thing I'd be looking at
> would be assembly output of the hooking code and seeing if I could shave
> that down.
>
> But I already put a ton of work into shaving cycles on this patchset,
> I'm happy with the results, and I have other responsibilities and other
> things I need to be working on.
On Mon, Feb 12, 2024 at 6:04 PM Suren Baghdasaryan <[email protected]> wrote:
>
> On Mon, Feb 12, 2024 at 2:27 PM Kees Cook <[email protected]> wrote:
> >
> > On Mon, Feb 12, 2024 at 01:38:56PM -0800, Suren Baghdasaryan wrote:
> > > Add basic infrastructure to support code tagging which stores tag common
> > > information consisting of the module name, function, file name and line
> > > number. Provide functions to register a new code tag type and navigate
> > > between code tags.
> > >
> > > Co-developed-by: Kent Overstreet <[email protected]>
> > > Signed-off-by: Kent Overstreet <[email protected]>
> > > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > > ---
> > > include/linux/codetag.h | 71 ++++++++++++++
> > > lib/Kconfig.debug | 4 +
> > > lib/Makefile | 1 +
> > > lib/codetag.c | 199 ++++++++++++++++++++++++++++++++++++++++
> > > 4 files changed, 275 insertions(+)
> > > create mode 100644 include/linux/codetag.h
> > > create mode 100644 lib/codetag.c
> > >
> > > diff --git a/include/linux/codetag.h b/include/linux/codetag.h
> > > new file mode 100644
> > > index 000000000000..a9d7adecc2a5
> > > --- /dev/null
> > > +++ b/include/linux/codetag.h
> > > @@ -0,0 +1,71 @@
> > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > +/*
> > > + * code tagging framework
> > > + */
> > > +#ifndef _LINUX_CODETAG_H
> > > +#define _LINUX_CODETAG_H
> > > +
> > > +#include <linux/types.h>
> > > +
> > > +struct codetag_iterator;
> > > +struct codetag_type;
> > > +struct seq_buf;
> > > +struct module;
> > > +
> > > +/*
> > > + * An instance of this structure is created in a special ELF section at every
> > > + * code location being tagged. At runtime, the special section is treated as
> > > + * an array of these.
> > > + */
> > > +struct codetag {
> > > + unsigned int flags; /* used in later patches */
> > > + unsigned int lineno;
> > > + const char *modname;
> > > + const char *function;
> > > + const char *filename;
> > > +} __aligned(8);
> > > +
> > > +union codetag_ref {
> > > + struct codetag *ct;
> > > +};
> > > +
> > > +struct codetag_range {
> > > + struct codetag *start;
> > > + struct codetag *stop;
> > > +};
> > > +
> > > +struct codetag_module {
> > > + struct module *mod;
> > > + struct codetag_range range;
> > > +};
> > > +
> > > +struct codetag_type_desc {
> > > + const char *section;
> > > + size_t tag_size;
> > > +};
> > > +
> > > +struct codetag_iterator {
> > > + struct codetag_type *cttype;
> > > + struct codetag_module *cmod;
> > > + unsigned long mod_id;
> > > + struct codetag *ct;
> > > +};
> > > +
> > > +#define CODE_TAG_INIT { \
> > > + .modname = KBUILD_MODNAME, \
> > > + .function = __func__, \
> > > + .filename = __FILE__, \
> > > + .lineno = __LINE__, \
> > > + .flags = 0, \
> > > +}
> > > +
> > > +void codetag_lock_module_list(struct codetag_type *cttype, bool lock);
> > > +struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype);
> > > +struct codetag *codetag_next_ct(struct codetag_iterator *iter);
> > > +
> > > +void codetag_to_text(struct seq_buf *out, struct codetag *ct);
> > > +
> > > +struct codetag_type *
> > > +codetag_register_type(const struct codetag_type_desc *desc);
> > > +
> > > +#endif /* _LINUX_CODETAG_H */
> > > diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> > > index 975a07f9f1cc..0be2d00c3696 100644
> > > --- a/lib/Kconfig.debug
> > > +++ b/lib/Kconfig.debug
> > > @@ -968,6 +968,10 @@ config DEBUG_STACKOVERFLOW
> > >
> > > If in doubt, say "N".
> > >
> > > +config CODE_TAGGING
> > > + bool
> > > + select KALLSYMS
> > > +
> > > source "lib/Kconfig.kasan"
> > > source "lib/Kconfig.kfence"
> > > source "lib/Kconfig.kmsan"
> > > diff --git a/lib/Makefile b/lib/Makefile
> > > index 6b09731d8e61..6b48b22fdfac 100644
> > > --- a/lib/Makefile
> > > +++ b/lib/Makefile
> > > @@ -235,6 +235,7 @@ obj-$(CONFIG_OF_RECONFIG_NOTIFIER_ERROR_INJECT) += \
> > > of-reconfig-notifier-error-inject.o
> > > obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
> > >
> > > +obj-$(CONFIG_CODE_TAGGING) += codetag.o
> > > lib-$(CONFIG_GENERIC_BUG) += bug.o
> > >
> > > obj-$(CONFIG_HAVE_ARCH_TRACEHOOK) += syscall.o
> > > diff --git a/lib/codetag.c b/lib/codetag.c
> > > new file mode 100644
> > > index 000000000000..7708f8388e55
> > > --- /dev/null
> > > +++ b/lib/codetag.c
> > > @@ -0,0 +1,199 @@
> > > +// SPDX-License-Identifier: GPL-2.0-only
> > > +#include <linux/codetag.h>
> > > +#include <linux/idr.h>
> > > +#include <linux/kallsyms.h>
> > > +#include <linux/module.h>
> > > +#include <linux/seq_buf.h>
> > > +#include <linux/slab.h>
> > > +
> > > +struct codetag_type {
> > > + struct list_head link;
> > > + unsigned int count;
> > > + struct idr mod_idr;
> > > + struct rw_semaphore mod_lock; /* protects mod_idr */
> > > + struct codetag_type_desc desc;
> > > +};
> > > +
> > > +static DEFINE_MUTEX(codetag_lock);
> > > +static LIST_HEAD(codetag_types);
> > > +
> > > +void codetag_lock_module_list(struct codetag_type *cttype, bool lock)
> > > +{
> > > + if (lock)
> > > + down_read(&cttype->mod_lock);
> > > + else
> > > + up_read(&cttype->mod_lock);
> > > +}
> > > +
> > > +struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype)
> > > +{
> > > + struct codetag_iterator iter = {
> > > + .cttype = cttype,
> > > + .cmod = NULL,
> > > + .mod_id = 0,
> > > + .ct = NULL,
> > > + };
> > > +
> > > + return iter;
> > > +}
> > > +
> > > +static inline struct codetag *get_first_module_ct(struct codetag_module *cmod)
> > > +{
> > > + return cmod->range.start < cmod->range.stop ? cmod->range.start : NULL;
> > > +}
> > > +
> > > +static inline
> > > +struct codetag *get_next_module_ct(struct codetag_iterator *iter)
> > > +{
> > > + struct codetag *res = (struct codetag *)
> > > + ((char *)iter->ct + iter->cttype->desc.tag_size);
> > > +
> > > + return res < iter->cmod->range.stop ? res : NULL;
> > > +}
> > > +
> > > +struct codetag *codetag_next_ct(struct codetag_iterator *iter)
> > > +{
> > > + struct codetag_type *cttype = iter->cttype;
> > > + struct codetag_module *cmod;
> > > + struct codetag *ct;
> > > +
> > > + lockdep_assert_held(&cttype->mod_lock);
> > > +
> > > + if (unlikely(idr_is_empty(&cttype->mod_idr)))
> > > + return NULL;
> > > +
> > > + ct = NULL;
> > > + while (true) {
> > > + cmod = idr_find(&cttype->mod_idr, iter->mod_id);
> > > +
> > > + /* If module was removed move to the next one */
> > > + if (!cmod)
> > > + cmod = idr_get_next_ul(&cttype->mod_idr,
> > > + &iter->mod_id);
> > > +
> > > + /* Exit if no more modules */
> > > + if (!cmod)
> > > + break;
> > > +
> > > + if (cmod != iter->cmod) {
> > > + iter->cmod = cmod;
> > > + ct = get_first_module_ct(cmod);
> > > + } else
> > > + ct = get_next_module_ct(iter);
> > > +
> > > + if (ct)
> > > + break;
> > > +
> > > + iter->mod_id++;
> > > + }
> > > +
> > > + iter->ct = ct;
> > > + return ct;
> > > +}
> > > +
> > > +void codetag_to_text(struct seq_buf *out, struct codetag *ct)
> > > +{
> > > + seq_buf_printf(out, "%s:%u module:%s func:%s",
> > > + ct->filename, ct->lineno,
> > > + ct->modname, ct->function);
> > > +}
> >
> > Thank you for using seq_buf here!
> >
> > Also, will this need an EXPORT_SYMBOL_GPL()?
Missed this question. I don't think we need EXPORT_SYMBOL_GPL() here
at least for now. Modules don't use these functions. The "alloc_tags"
sections will be generated for each module at compile time but they
themselves do not use it.
> >
> > > +
> > > +static inline size_t range_size(const struct codetag_type *cttype,
> > > + const struct codetag_range *range)
> > > +{
> > > + return ((char *)range->stop - (char *)range->start) /
> > > + cttype->desc.tag_size;
> > > +}
> > > +
> > > +static void *get_symbol(struct module *mod, const char *prefix, const char *name)
> > > +{
> > > + char buf[64];
> >
> > Why is 64 enough? I was expecting KSYM_NAME_LEN here, but perhaps this
> > is specialized enough to section names that it will not be a problem?
>
> This buffer is being used to hold the name of the section containing
> codetags appended with "__start_" or "__stop_" and the only current
> user is alloc_tag_init() which sets the section name to "alloc_tags".
> So, this buffer currently holds either "alloc_tags__start_" or
> "alloc_tags__stop_". When more codetag applications are added (like
> the ones we have shown in the original RFC [1]), there would be more
> section names. 64 was chosen as a big enough value to reasonably hold
> the section name with the suffix. But you are right, we should add a
> check for the section name size to ensure it always fits. Will add
> into my TODO list.
>
> [1] https://lore.kernel.org/all/[email protected]/
> > If so, please document it clearly with a comment.
>
> Will do.
>
> >
> > > + int res;
> > > +
> > > + res = snprintf(buf, sizeof(buf), "%s%s", prefix, name);
> > > + if (WARN_ON(res < 1 || res > sizeof(buf)))
> > > + return NULL;
> >
> > Please use a seq_buf here instead of snprintf, which we're trying to get
> > rid of.
> >
> > DECLARE_SEQ_BUF(sb, KSYM_NAME_LEN);
> > char *buf;
> >
> > seq_buf_printf(sb, "%s%s", prefix, name);
> > if (seq_buf_has_overflowed(sb))
> > return NULL;
> >
> > buf = seq_buf_str(sb);
>
> Will do. Thanks!
>
> >
> > > +
> > > + return mod ?
> > > + (void *)find_kallsyms_symbol_value(mod, buf) :
> > > + (void *)kallsyms_lookup_name(buf);
> > > +}
> > > +
> > > +static struct codetag_range get_section_range(struct module *mod,
> > > + const char *section)
> > > +{
> > > + return (struct codetag_range) {
> > > + get_symbol(mod, "__start_", section),
> > > + get_symbol(mod, "__stop_", section),
> > > + };
> > > +}
> > > +
> > > +static int codetag_module_init(struct codetag_type *cttype, struct module *mod)
> > > +{
> > > + struct codetag_range range;
> > > + struct codetag_module *cmod;
> > > + int err;
> > > +
> > > + range = get_section_range(mod, cttype->desc.section);
> > > + if (!range.start || !range.stop) {
> > > + pr_warn("Failed to load code tags of type %s from the module %s\n",
> > > + cttype->desc.section,
> > > + mod ? mod->name : "(built-in)");
> > > + return -EINVAL;
> > > + }
> > > +
> > > + /* Ignore empty ranges */
> > > + if (range.start == range.stop)
> > > + return 0;
> > > +
> > > + BUG_ON(range.start > range.stop);
> > > +
> > > + cmod = kmalloc(sizeof(*cmod), GFP_KERNEL);
> > > + if (unlikely(!cmod))
> > > + return -ENOMEM;
> > > +
> > > + cmod->mod = mod;
> > > + cmod->range = range;
> > > +
> > > + down_write(&cttype->mod_lock);
> > > + err = idr_alloc(&cttype->mod_idr, cmod, 0, 0, GFP_KERNEL);
> > > + if (err >= 0)
> > > + cttype->count += range_size(cttype, &range);
> > > + up_write(&cttype->mod_lock);
> > > +
> > > + if (err < 0) {
> > > + kfree(cmod);
> > > + return err;
> > > + }
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +struct codetag_type *
> > > +codetag_register_type(const struct codetag_type_desc *desc)
> > > +{
> > > + struct codetag_type *cttype;
> > > + int err;
> > > +
> > > + BUG_ON(desc->tag_size <= 0);
> > > +
> > > + cttype = kzalloc(sizeof(*cttype), GFP_KERNEL);
> > > + if (unlikely(!cttype))
> > > + return ERR_PTR(-ENOMEM);
> > > +
> > > + cttype->desc = *desc;
> > > + idr_init(&cttype->mod_idr);
> > > + init_rwsem(&cttype->mod_lock);
> > > +
> > > + err = codetag_module_init(cttype, NULL);
> > > + if (unlikely(err)) {
> > > + kfree(cttype);
> > > + return ERR_PTR(err);
> > > + }
> > > +
> > > + mutex_lock(&codetag_lock);
> > > + list_add_tail(&cttype->link, &codetag_types);
> > > + mutex_unlock(&codetag_lock);
> > > +
> > > + return cttype;
> > > +}
> > > --
> > > 2.43.0.687.g38aa6559b0-goog
> > >
> >
> > --
> > Kees Cook
On Mon, 12 Feb 2024, Suren Baghdasaryan <[email protected]> wrote:
> Memory allocation, v3 and final:
>
> Overview:
> Low overhead [1] per-callsite memory allocation profiling. Not just for debug
> kernels, overhead low enough to be deployed in production.
>
> We're aiming to get this in the next merge window, for 6.9. The feedback
> we've gotten has been that even out of tree this patchset has already
> been useful, and there's a significant amount of other work gated on the
> code tagging functionality included in this patchset [2].
I wonder if it wouldn't be too much trouble to write at least a brief
overview document under Documentation/ describing what this is all
about? Even as follow-up. People seeing the patch series have the
benefit of the cover letter and the commit messages, but that's hardly
documentation.
We have all these great frameworks and tools but their discoverability
to kernel developers isn't always all that great.
BR,
Jani.
--
Jani Nikula, Intel
On Fri, Feb 16, 2024 at 10:38:00AM +0200, Jani Nikula wrote:
> On Mon, 12 Feb 2024, Suren Baghdasaryan <[email protected]> wrote:
> > Memory allocation, v3 and final:
> >
> > Overview:
> > Low overhead [1] per-callsite memory allocation profiling. Not just for debug
> > kernels, overhead low enough to be deployed in production.
> >
> > We're aiming to get this in the next merge window, for 6.9. The feedback
> > we've gotten has been that even out of tree this patchset has already
> > been useful, and there's a significant amount of other work gated on the
> > code tagging functionality included in this patchset [2].
>
> I wonder if it wouldn't be too much trouble to write at least a brief
> overview document under Documentation/ describing what this is all
> about? Even as follow-up. People seeing the patch series have the
> benefit of the cover letter and the commit messages, but that's hardly
> documentation.
>
> We have all these great frameworks and tools but their discoverability
> to kernel developers isn't always all that great.
commit f589b48789de4b8f77bfc70b9f3ab2013c01eaf2
Author: Kent Overstreet <[email protected]>
Date: Wed Feb 14 01:13:04 2024 -0500
memprofiling: Documentation
Signed-off-by: Kent Overstreet <[email protected]>
diff --git a/Documentation/mm/allocation-profiling.rst b/Documentation/mm/allocation-profiling.rst
new file mode 100644
index 000000000000..d906e9360279
--- /dev/null
+++ b/Documentation/mm/allocation-profiling.rst
@@ -0,0 +1,68 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===========================
+MEMORY ALLOCATION PROFILING
+===========================
+
+Low overhead (suitable for production) accounting of all memory allocations,
+tracked by file and line number.
+
+Usage:
+kconfig options:
+ - CONFIG_MEM_ALLOC_PROFILING
+ - CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT
+ - CONFIG_MEM_ALLOC_PROFILING_DEBUG
+ adds warnings for allocations that weren't accounted because of a
+ missing annotation
+
+sysctl:
+ /proc/sys/vm/mem_profiling
+
+Runtime info:
+ /proc/allocinfo
+
+Example output:
+ root@moria-kvm:~# sort -h /proc/allocinfo|tail
+ 3.11MiB 2850 fs/ext4/super.c:1408 module:ext4 func:ext4_alloc_inode
+ 3.52MiB 225 kernel/fork.c:356 module:fork func:alloc_thread_stack_node
+ 3.75MiB 960 mm/page_ext.c:270 module:page_ext func:alloc_page_ext
+ 4.00MiB 2 mm/khugepaged.c:893 module:khugepaged func:hpage_collapse_alloc_folio
+ 10.5MiB 168 block/blk-mq.c:3421 module:blk_mq func:blk_mq_alloc_rqs
+ 14.0MiB 3594 include/linux/gfp.h:295 module:filemap func:folio_alloc_noprof
+ 26.8MiB 6856 include/linux/gfp.h:295 module:memory func:folio_alloc_noprof
+ 64.5MiB 98315 fs/xfs/xfs_rmap_item.c:147 module:xfs func:xfs_rui_init
+ 98.7MiB 25264 include/linux/gfp.h:295 module:readahead func:folio_alloc_noprof
+ 125MiB 7357 mm/slub.c:2201 module:slub func:alloc_slab_page
+
+
+Theory of operation:
+
+Memory allocation profiling builds off of code tagging, which is a library for
+declaring static structs (that typcially describe a file and line number in
+some way, hence code tagging) and then finding and operating on them at runtime
+- i.e. iterating over them to print them in debugfs/procfs.
+
+To add accounting for an allocation call, we replace it with a macro
+invocation, alloc_hooks(), that
+ - declares a code tag
+ - stashes a pointer to it in task_struct
+ - calls the real allocation function
+ - and finally, restores the task_struct alloc tag pointer to its previous value.
+
+This allows for alloc_hooks() calls to be nested, with the most recent one
+taking effect. This is important for allocations internal to the mm/ code that
+do not properly belong to the outer allocation context and should be counted
+separately: for example, slab object extension vectors, or when the slab
+allocates pages from the page allocator.
+
+Thus, proper usage requires determining which function in an allocation call
+stack should be tagged. There are many helper functions that essentially wrap
+e.g. kmalloc() and do a little more work, then are called in multiple places;
+we'll generally want the accounting to happen in the callers of these helpers,
+not in the helpers themselves.
+
+To fix up a given helper, for example foo(), do the following:
+ - switch its allocation call to the _noprof() version, e.g. kmalloc_noprof()
+ - rename it to foo_noprof()
+ - define a macro version of foo() like so:
+ #define foo(...) alloc_hooks(foo_noprof(__VA_ARGS__))
On 2/13/24 23:38, Kees Cook wrote:
> On Tue, Feb 13, 2024 at 02:35:29PM -0800, Suren Baghdasaryan wrote:
>> On Tue, Feb 13, 2024 at 2:29 PM Darrick J. Wong <[email protected]> wrote:
>> >
>> > On Mon, Feb 12, 2024 at 05:01:19PM -0800, Suren Baghdasaryan wrote:
>> > > On Mon, Feb 12, 2024 at 2:40 PM Kees Cook <[email protected]> wrote:
>> > > >
>> > > > On Mon, Feb 12, 2024 at 01:38:59PM -0800, Suren Baghdasaryan wrote:
>> > > > > Introduce CONFIG_MEM_ALLOC_PROFILING which provides definitions to easily
>> > > > > instrument memory allocators. It registers an "alloc_tags" codetag type
>> > > > > with /proc/allocinfo interface to output allocation tag information when
>> > > >
>> > > > Please don't add anything new to the top-level /proc directory. This
>> > > > should likely live in /sys.
>> > >
>> > > Ack. I'll find a more appropriate place for it then.
>> > > It just seemed like such generic information which would belong next
>> > > to meminfo/zoneinfo and such...
>> >
>> > Save yourself a cycle of "rework the whole fs interface only to have
>> > someone else tell you no" and put it in debugfs, not sysfs. Wrangling
>> > with debugfs is easier than all the macro-happy sysfs stuff; you don't
>> > have to integrate with the "device" model; and there is no 'one value
>> > per file' rule.
>>
>> Thanks for the input. This file used to be in debugfs but reviewers
>> felt it belonged in /proc if it's to be used in production
>> environments. Some distros (like Android) disable debugfs in
>> production.
>
> FWIW, I agree debugfs is not right. If others feel it's right in /proc,
> I certainly won't NAK -- it's just been that we've traditionally been
> trying to avoid continuing to pollute the top-level /proc and instead
> associate new things with something in /sys.
Sysfs is really a "one value per file" thing though. /proc might be ok for a
single overview file.
On Thu, Feb 15, 2024 at 5:27 PM Kent Overstreet
<[email protected]> wrote:
>
> On Thu, Feb 15, 2024 at 08:22:44PM -0500, Pasha Tatashin wrote:
> > On Thu, Feb 15, 2024 at 8:00 PM Kent Overstreet
> > <[email protected]> wrote:
> > >
> > > On Thu, Feb 15, 2024 at 04:54:38PM -0800, Andrew Morton wrote:
> > > > On Mon, 12 Feb 2024 13:38:59 -0800 Suren Baghdasaryan <[email protected]> wrote:
> > > >
> > > > > +Example output.
> > > > > +
> > > > > +::
> > > > > +
> > > > > + > cat /proc/allocinfo
> > > > > +
> > > > > + 153MiB mm/slub.c:1826 module:slub func:alloc_slab_page
> > > > > + 6.08MiB mm/slab_common.c:950 module:slab_common func:_kmalloc_order
> > > > > + 5.09MiB mm/memcontrol.c:2814 module:memcontrol func:alloc_slab_obj_exts
> > > > > + 4.54MiB mm/page_alloc.c:5777 module:page_alloc func:alloc_pages_exact
> > > > > + 1.32MiB include/asm-generic/pgalloc.h:63 module:pgtable func:__pte_alloc_one
> > > >
> > > > I don't really like the fancy MiB stuff. Wouldn't it be better to just
> > > > present the amount of memory in plain old bytes, so people can use sort
> > > > -n on it?
> > >
> > > They can use sort -h on it; the string_get_size() patch was specifically
> > > so that we could make the output compatible with sort -h
> > >
> > > > And it's easier to tell big-from-small at a glance because
> > > > big has more digits.
> > > >
> > > > Also, the first thing any sort of downstream processing of this data is
> > > > going to have to do is to convert the fancified output back into
> > > > plain-old-bytes. So why not just emit plain-old-bytes?
> > > >
> > > > If someone wants the fancy output (and nobody does) then that can be
> > > > done in userspace.
> > >
> > > I like simpler, more discoverable tools; e.g. we've got a bunch of
> > > interesting stuff in scripts/ but it doesn't get used nearly as much -
> > > not as accessible as cat'ing a file, definitely not going to be
> > > installed by default.
> >
> > I also prefer plain bytes instead of MiB. A driver developer that
> > wants to verify up-to the byte allocations for a new data structure
> > that they added is going to be disappointed by the rounded MiB
> > numbers.
>
> That's a fair point.
>
> > The data contained in this file is not consumable without at least
> > "sort -h -r", so why not just output bytes instead?
> >
> > There is /proc/slabinfo and there is a slabtop tool.
> > For raw /proc/allocinfo we can create an alloctop tool that would
> > parse, sort and show data in human readable format based on various
> > criteria.
> >
> > We should also add at the top of this file "allocinfo - version: 1.0",
> > to allow future extensions (i.e. column for proc name).
>
> How would we feel about exposing two different versions in /proc? It
> should be a pretty minimal addition to .text.
>
> Personally, I hate trying to count long strings digits by eyeball...
Maybe something like this work for everyone then?:
160432128 (153MiB) mm/slub.c:1826 module:slub func:alloc_slab_page
On Fri, Feb 16, 2024 at 1:02 AM Suren Baghdasaryan <[email protected]> wrote:
>
> On Thu, Feb 15, 2024 at 5:27 PM Kent Overstreet
> <[email protected]> wrote:
> >
> > On Thu, Feb 15, 2024 at 08:22:44PM -0500, Pasha Tatashin wrote:
> > > On Thu, Feb 15, 2024 at 8:00 PM Kent Overstreet
> > > <[email protected]> wrote:
> > > >
> > > > On Thu, Feb 15, 2024 at 04:54:38PM -0800, Andrew Morton wrote:
> > > > > On Mon, 12 Feb 2024 13:38:59 -0800 Suren Baghdasaryan <[email protected]> wrote:
> > > > >
> > > > > > +Example output.
> > > > > > +
> > > > > > +::
> > > > > > +
> > > > > > + > cat /proc/allocinfo
> > > > > > +
> > > > > > + 153MiB mm/slub.c:1826 module:slub func:alloc_slab_page
> > > > > > + 6.08MiB mm/slab_common.c:950 module:slab_common func:_kmalloc_order
> > > > > > + 5.09MiB mm/memcontrol.c:2814 module:memcontrol func:alloc_slab_obj_exts
> > > > > > + 4.54MiB mm/page_alloc.c:5777 module:page_alloc func:alloc_pages_exact
> > > > > > + 1.32MiB include/asm-generic/pgalloc.h:63 module:pgtable func:__pte_alloc_one
> > > > >
> > > > > I don't really like the fancy MiB stuff. Wouldn't it be better to just
> > > > > present the amount of memory in plain old bytes, so people can use sort
> > > > > -n on it?
> > > >
> > > > They can use sort -h on it; the string_get_size() patch was specifically
> > > > so that we could make the output compatible with sort -h
> > > >
> > > > > And it's easier to tell big-from-small at a glance because
> > > > > big has more digits.
> > > > >
> > > > > Also, the first thing any sort of downstream processing of this data is
> > > > > going to have to do is to convert the fancified output back into
> > > > > plain-old-bytes. So why not just emit plain-old-bytes?
> > > > >
> > > > > If someone wants the fancy output (and nobody does) then that can be
> > > > > done in userspace.
> > > >
> > > > I like simpler, more discoverable tools; e.g. we've got a bunch of
> > > > interesting stuff in scripts/ but it doesn't get used nearly as much -
> > > > not as accessible as cat'ing a file, definitely not going to be
> > > > installed by default.
> > >
> > > I also prefer plain bytes instead of MiB. A driver developer that
> > > wants to verify up-to the byte allocations for a new data structure
> > > that they added is going to be disappointed by the rounded MiB
> > > numbers.
> >
> > That's a fair point.
> >
> > > The data contained in this file is not consumable without at least
> > > "sort -h -r", so why not just output bytes instead?
> > >
> > > There is /proc/slabinfo and there is a slabtop tool.
> > > For raw /proc/allocinfo we can create an alloctop tool that would
> > > parse, sort and show data in human readable format based on various
> > > criteria.
> > >
> > > We should also add at the top of this file "allocinfo - version: 1.0",
> > > to allow future extensions (i.e. column for proc name).
> >
> > How would we feel about exposing two different versions in /proc? It
> > should be a pretty minimal addition to .text.
> >
> > Personally, I hate trying to count long strings digits by eyeball...
>
> Maybe something like this work for everyone then?:
s/work/would work
making too many mistakes. time for bed...
>
> 160432128 (153MiB) mm/slub.c:1826 module:slub func:alloc_slab_page
On Fri, 16 Feb 2024, Kent Overstreet <[email protected]> wrote:
> On Fri, Feb 16, 2024 at 10:38:00AM +0200, Jani Nikula wrote:
>> I wonder if it wouldn't be too much trouble to write at least a brief
>> overview document under Documentation/ describing what this is all
>> about? Even as follow-up. People seeing the patch series have the
>> benefit of the cover letter and the commit messages, but that's hardly
>> documentation.
>>
>> We have all these great frameworks and tools but their discoverability
>> to kernel developers isn't always all that great.
>
> commit f589b48789de4b8f77bfc70b9f3ab2013c01eaf2
> Author: Kent Overstreet <[email protected]>
> Date: Wed Feb 14 01:13:04 2024 -0500
>
> memprofiling: Documentation
>
> Signed-off-by: Kent Overstreet <[email protected]>
Thanks! Wasn't part of this series and I wasn't aware it existed.
BR,
Jani.
--
Jani Nikula, Intel
On 2/12/24 22:39, Suren Baghdasaryan wrote:
> Introduce helper functions to easily instrument page allocators by
> storing a pointer to the allocation tag associated with the code that
> allocated the page in a page_ext field.
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> Co-developed-by: Kent Overstreet <[email protected]>
> Signed-off-by: Kent Overstreet <[email protected]>
> +
> +#ifdef CONFIG_MEM_ALLOC_PROFILING
> +
> +#include <linux/page_ext.h>
> +
> +extern struct page_ext_operations page_alloc_tagging_ops;
> +extern struct page_ext *page_ext_get(struct page *page);
> +extern void page_ext_put(struct page_ext *page_ext);
> +
> +static inline union codetag_ref *codetag_ref_from_page_ext(struct page_ext *page_ext)
> +{
> + return (void *)page_ext + page_alloc_tagging_ops.offset;
> +}
> +
> +static inline struct page_ext *page_ext_from_codetag_ref(union codetag_ref *ref)
> +{
> + return (void *)ref - page_alloc_tagging_ops.offset;
> +}
> +
> +static inline union codetag_ref *get_page_tag_ref(struct page *page)
> +{
> + if (page && mem_alloc_profiling_enabled()) {
> + struct page_ext *page_ext = page_ext_get(page);
> +
> + if (page_ext)
> + return codetag_ref_from_page_ext(page_ext);
I think when structured like this, you're not getting the full benefits of
static keys, and the compiler probably can't improve that on its own.
- page is tested before the static branch is evaluated
- when disabled, the result is NULL, and that's again tested in the callers
> + }
> + return NULL;
> +}
> +
> +static inline void put_page_tag_ref(union codetag_ref *ref)
> +{
> + page_ext_put(page_ext_from_codetag_ref(ref));
> +}
> +
> +static inline void pgalloc_tag_add(struct page *page, struct task_struct *task,
> + unsigned int order)
> +{
> + union codetag_ref *ref = get_page_tag_ref(page);
So the more optimal way would be to test mem_alloc_profiling_enabled() here
as the very first thing before trying to get the ref.
> + if (ref) {
> + alloc_tag_add(ref, task->alloc_tag, PAGE_SIZE << order);
> + put_page_tag_ref(ref);
> + }
> +}
> +
> +static inline void pgalloc_tag_sub(struct page *page, unsigned int order)
> +{
> + union codetag_ref *ref = get_page_tag_ref(page);
And same here.
> + if (ref) {
> + alloc_tag_sub(ref, PAGE_SIZE << order);
> + put_page_tag_ref(ref);
> + }
> +}
> +
> +#else /* CONFIG_MEM_ALLOC_PROFILING */
> +
> +static inline void pgalloc_tag_add(struct page *page, struct task_struct *task,
> + unsigned int order) {}
> +static inline void pgalloc_tag_sub(struct page *page, unsigned int order) {}
> +
> +#endif /* CONFIG_MEM_ALLOC_PROFILING */
> +
> +#endif /* _LINUX_PGALLOC_TAG_H */
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 78d258ca508f..7bbdb0ddb011 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -978,6 +978,7 @@ config MEM_ALLOC_PROFILING
> depends on PROC_FS
> depends on !DEBUG_FORCE_WEAK_PER_CPU
> select CODE_TAGGING
> + select PAGE_EXTENSION
> help
> Track allocation source code and record total allocation size
> initiated at that code location. The mechanism can be used to track
> diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
> index 4fc031f9cefd..2d5226d9262d 100644
> --- a/lib/alloc_tag.c
> +++ b/lib/alloc_tag.c
> @@ -3,6 +3,7 @@
> #include <linux/fs.h>
> #include <linux/gfp.h>
> #include <linux/module.h>
> +#include <linux/page_ext.h>
> #include <linux/proc_fs.h>
> #include <linux/seq_buf.h>
> #include <linux/seq_file.h>
> @@ -124,6 +125,22 @@ static bool alloc_tag_module_unload(struct codetag_type *cttype,
> return module_unused;
> }
>
> +static __init bool need_page_alloc_tagging(void)
> +{
> + return true;
So this means the page_ext memory overead is paid unconditionally once
MEM_ALLOC_PROFILING is compile time enabled, even if never enabled during
runtime? That makes it rather costly to be suitable for generic distro
kernels where the code could be compile time enabled, and runtime enabling
suggested in a debugging/support scenario. It's what we do with page_owner,
debug_pagealloc, slub_debug etc.
Ideally we'd have some vmalloc based page_ext flavor for later-than-boot
runtime enablement, as we now have for stackdepot. But that could be
explored later. For now it would be sufficient to add an early_param boot
parameter to control the enablement including page_ext, like page_owner and
other features do.
> +}
> +
> +static __init void init_page_alloc_tagging(void)
> +{
> +}
> +
> +struct page_ext_operations page_alloc_tagging_ops = {
> + .size = sizeof(union codetag_ref),
> + .need = need_page_alloc_tagging,
> + .init = init_page_alloc_tagging,
> +};
> +EXPORT_SYMBOL(page_alloc_tagging_ops);
On 2/12/24 22:39, Suren Baghdasaryan wrote:
> When a high-order page is split into smaller ones, each newly split
> page should get its codetag. The original codetag is reused for these
> pages but it's recorded as 0-byte allocation because original codetag
> already accounts for the original high-order allocated page.
Wouldn't it be possible to adjust the original's accounted size and
redistribute to the split pages for more accuracy?
On 2/12/24 22:39, Suren Baghdasaryan wrote:
> To store code tag for every slab object, a codetag reference is embedded
> into slabobj_ext when CONFIG_MEM_ALLOC_PROFILING=y.
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> Co-developed-by: Kent Overstreet <[email protected]>
> Signed-off-by: Kent Overstreet <[email protected]>
> ---
> include/linux/memcontrol.h | 5 +++++
> lib/Kconfig.debug | 1 +
> mm/slab.h | 4 ++++
> 3 files changed, 10 insertions(+)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index f3584e98b640..2b010316016c 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1653,7 +1653,12 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
> * if MEMCG_DATA_OBJEXTS is set.
> */
> struct slabobj_ext {
> +#ifdef CONFIG_MEMCG_KMEM
> struct obj_cgroup *objcg;
> +#endif
> +#ifdef CONFIG_MEM_ALLOC_PROFILING
> + union codetag_ref ref;
> +#endif
> } __aligned(8);
So this means that compiling with CONFIG_MEM_ALLOC_PROFILING will increase
the memory overhead of arrays allocated for CONFIG_MEMCG_KMEM, even if
allocation profiling itself is not enabled in runtime? Similar concern to
the unconditional page_ext usage, that this would hinder enabling in a
general distro kernel.
The unused field overhead would be smaller than currently page_ext, but
getting rid of it when alloc profiling is not enabled would be more work
than introducing an early boot param for the page_ext case. Could be however
solved similarly to how page_ext is populated dynamically at runtime.
Hopefully it wouldn't add noticeable cpu overhead.
> static inline void __inc_lruvec_kmem_state(void *p, enum node_stat_item idx)
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 7bbdb0ddb011..9ecfcdb54417 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -979,6 +979,7 @@ config MEM_ALLOC_PROFILING
> depends on !DEBUG_FORCE_WEAK_PER_CPU
> select CODE_TAGGING
> select PAGE_EXTENSION
> + select SLAB_OBJ_EXT
> help
> Track allocation source code and record total allocation size
> initiated at that code location. The mechanism can be used to track
> diff --git a/mm/slab.h b/mm/slab.h
> index 77cf7474fe46..224a4b2305fb 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -569,6 +569,10 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
>
> static inline bool need_slab_obj_ext(void)
> {
> +#ifdef CONFIG_MEM_ALLOC_PROFILING
> + if (mem_alloc_profiling_enabled())
> + return true;
> +#endif
> /*
> * CONFIG_MEMCG_KMEM creates vector of obj_cgroup objects conditionally
> * inside memcg_slab_post_alloc_hook. No other users for now.
On 2/12/24 22:39, Suren Baghdasaryan wrote:
> Account slab allocations using codetag reference embedded into slabobj_ext.
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> Co-developed-by: Kent Overstreet <[email protected]>
> Signed-off-by: Kent Overstreet <[email protected]>
> ---
> mm/slab.h | 26 ++++++++++++++++++++++++++
> mm/slub.c | 5 +++++
> 2 files changed, 31 insertions(+)
>
> diff --git a/mm/slab.h b/mm/slab.h
> index 224a4b2305fb..c4bd0d5348cb 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -629,6 +629,32 @@ prepare_slab_obj_exts_hook(struct kmem_cache *s, gfp_t flags, void *p)
>
> #endif /* CONFIG_SLAB_OBJ_EXT */
>
> +#ifdef CONFIG_MEM_ALLOC_PROFILING
> +
> +static inline void alloc_tagging_slab_free_hook(struct kmem_cache *s, struct slab *slab,
> + void **p, int objects)
> +{
> + struct slabobj_ext *obj_exts;
> + int i;
> +
> + obj_exts = slab_obj_exts(slab);
> + if (!obj_exts)
> + return;
> +
> + for (i = 0; i < objects; i++) {
> + unsigned int off = obj_to_index(s, slab, p[i]);
> +
> + alloc_tag_sub(&obj_exts[off].ref, s->size);
> + }
> +}
> +
> +#else
> +
> +static inline void alloc_tagging_slab_free_hook(struct kmem_cache *s, struct slab *slab,
> + void **p, int objects) {}
> +
> +#endif /* CONFIG_MEM_ALLOC_PROFILING */
You don't actually use the alloc_tagging_slab_free_hook() anywhere? I see
it's in the next patch, but logically should belong to this one.
> +
> #ifdef CONFIG_MEMCG_KMEM
> void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
> enum node_stat_item idx, int nr);
> diff --git a/mm/slub.c b/mm/slub.c
> index 9fd96238ed39..f4d5794c1e86 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3821,6 +3821,11 @@ void slab_post_alloc_hook(struct kmem_cache *s, struct obj_cgroup *objcg,
> s->flags, init_flags);
> kmsan_slab_alloc(s, p[i], init_flags);
> obj_exts = prepare_slab_obj_exts_hook(s, flags, p[i]);
> +#ifdef CONFIG_MEM_ALLOC_PROFILING
> + /* obj_exts can be allocated for other reasons */
> + if (likely(obj_exts) && mem_alloc_profiling_enabled())
> + alloc_tag_add(&obj_exts->ref, current->alloc_tag, s->size);
> +#endif
> }
>
> memcg_slab_post_alloc_hook(s, objcg, flags, size, p);
On Fri, Feb 16, 2024 at 05:31:11PM +0100, Vlastimil Babka wrote:
> On 2/12/24 22:39, Suren Baghdasaryan wrote:
> > Account slab allocations using codetag reference embedded into slabobj_ext.
> >
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > Co-developed-by: Kent Overstreet <[email protected]>
> > Signed-off-by: Kent Overstreet <[email protected]>
> > ---
> > mm/slab.h | 26 ++++++++++++++++++++++++++
> > mm/slub.c | 5 +++++
> > 2 files changed, 31 insertions(+)
> >
> > diff --git a/mm/slab.h b/mm/slab.h
> > index 224a4b2305fb..c4bd0d5348cb 100644
> > --- a/mm/slab.h
> > +++ b/mm/slab.h
> > @@ -629,6 +629,32 @@ prepare_slab_obj_exts_hook(struct kmem_cache *s, gfp_t flags, void *p)
> >
> > #endif /* CONFIG_SLAB_OBJ_EXT */
> >
> > +#ifdef CONFIG_MEM_ALLOC_PROFILING
> > +
> > +static inline void alloc_tagging_slab_free_hook(struct kmem_cache *s, struct slab *slab,
> > + void **p, int objects)
> > +{
> > + struct slabobj_ext *obj_exts;
> > + int i;
> > +
> > + obj_exts = slab_obj_exts(slab);
> > + if (!obj_exts)
> > + return;
> > +
> > + for (i = 0; i < objects; i++) {
> > + unsigned int off = obj_to_index(s, slab, p[i]);
> > +
> > + alloc_tag_sub(&obj_exts[off].ref, s->size);
> > + }
> > +}
> > +
> > +#else
> > +
> > +static inline void alloc_tagging_slab_free_hook(struct kmem_cache *s, struct slab *slab,
> > + void **p, int objects) {}
> > +
> > +#endif /* CONFIG_MEM_ALLOC_PROFILING */
>
> You don't actually use the alloc_tagging_slab_free_hook() anywhere? I see
> it's in the next patch, but logically should belong to this one.
I don't think it makes any sense to quibble about introducing something
in one patch that's not used until the next patch; often times, it's
just easier to review that way.
On Fri, Feb 16, 2024 at 1:45 AM Vlastimil Babka <[email protected]> wrote:
>
> On 2/12/24 22:39, Suren Baghdasaryan wrote:
> > Introduce helper functions to easily instrument page allocators by
> > storing a pointer to the allocation tag associated with the code that
> > allocated the page in a page_ext field.
> >
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > Co-developed-by: Kent Overstreet <[email protected]>
> > Signed-off-by: Kent Overstreet <[email protected]>
> > +
> > +#ifdef CONFIG_MEM_ALLOC_PROFILING
> > +
> > +#include <linux/page_ext.h>
> > +
> > +extern struct page_ext_operations page_alloc_tagging_ops;
> > +extern struct page_ext *page_ext_get(struct page *page);
> > +extern void page_ext_put(struct page_ext *page_ext);
> > +
> > +static inline union codetag_ref *codetag_ref_from_page_ext(struct page_ext *page_ext)
> > +{
> > + return (void *)page_ext + page_alloc_tagging_ops.offset;
> > +}
> > +
> > +static inline struct page_ext *page_ext_from_codetag_ref(union codetag_ref *ref)
> > +{
> > + return (void *)ref - page_alloc_tagging_ops.offset;
> > +}
> > +
> > +static inline union codetag_ref *get_page_tag_ref(struct page *page)
> > +{
> > + if (page && mem_alloc_profiling_enabled()) {
> > + struct page_ext *page_ext = page_ext_get(page);
> > +
> > + if (page_ext)
> > + return codetag_ref_from_page_ext(page_ext);
>
> I think when structured like this, you're not getting the full benefits of
> static keys, and the compiler probably can't improve that on its own.
>
> - page is tested before the static branch is evaluated
> - when disabled, the result is NULL, and that's again tested in the callers
Yes, that sounds right. I'll move the static branch check earlier like
you suggested. Thanks!
>
> > + }
> > + return NULL;
> > +}
> > +
> > +static inline void put_page_tag_ref(union codetag_ref *ref)
> > +{
> > + page_ext_put(page_ext_from_codetag_ref(ref));
> > +}
> > +
> > +static inline void pgalloc_tag_add(struct page *page, struct task_struct *task,
> > + unsigned int order)
> > +{
> > + union codetag_ref *ref = get_page_tag_ref(page);
>
> So the more optimal way would be to test mem_alloc_profiling_enabled() here
> as the very first thing before trying to get the ref.
>
> > + if (ref) {
> > + alloc_tag_add(ref, task->alloc_tag, PAGE_SIZE << order);
> > + put_page_tag_ref(ref);
> > + }
> > +}
> > +
> > +static inline void pgalloc_tag_sub(struct page *page, unsigned int order)
> > +{
> > + union codetag_ref *ref = get_page_tag_ref(page);
>
> And same here.
>
> > + if (ref) {
> > + alloc_tag_sub(ref, PAGE_SIZE << order);
> > + put_page_tag_ref(ref);
> > + }
> > +}
> > +
> > +#else /* CONFIG_MEM_ALLOC_PROFILING */
> > +
> > +static inline void pgalloc_tag_add(struct page *page, struct task_struct *task,
> > + unsigned int order) {}
> > +static inline void pgalloc_tag_sub(struct page *page, unsigned int order) {}
> > +
> > +#endif /* CONFIG_MEM_ALLOC_PROFILING */
> > +
> > +#endif /* _LINUX_PGALLOC_TAG_H */
> > diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> > index 78d258ca508f..7bbdb0ddb011 100644
> > --- a/lib/Kconfig.debug
> > +++ b/lib/Kconfig.debug
> > @@ -978,6 +978,7 @@ config MEM_ALLOC_PROFILING
> > depends on PROC_FS
> > depends on !DEBUG_FORCE_WEAK_PER_CPU
> > select CODE_TAGGING
> > + select PAGE_EXTENSION
> > help
> > Track allocation source code and record total allocation size
> > initiated at that code location. The mechanism can be used to track
> > diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
> > index 4fc031f9cefd..2d5226d9262d 100644
> > --- a/lib/alloc_tag.c
> > +++ b/lib/alloc_tag.c
> > @@ -3,6 +3,7 @@
> > #include <linux/fs.h>
> > #include <linux/gfp.h>
> > #include <linux/module.h>
> > +#include <linux/page_ext.h>
> > #include <linux/proc_fs.h>
> > #include <linux/seq_buf.h>
> > #include <linux/seq_file.h>
> > @@ -124,6 +125,22 @@ static bool alloc_tag_module_unload(struct codetag_type *cttype,
> > return module_unused;
> > }
> >
> > +static __init bool need_page_alloc_tagging(void)
> > +{
> > + return true;
>
> So this means the page_ext memory overead is paid unconditionally once
> MEM_ALLOC_PROFILING is compile time enabled, even if never enabled during
> runtime? That makes it rather costly to be suitable for generic distro
> kernels where the code could be compile time enabled, and runtime enabling
> suggested in a debugging/support scenario. It's what we do with page_owner,
> debug_pagealloc, slub_debug etc.
>
> Ideally we'd have some vmalloc based page_ext flavor for later-than-boot
> runtime enablement, as we now have for stackdepot. But that could be
> explored later. For now it would be sufficient to add an early_param boot
> parameter to control the enablement including page_ext, like page_owner and
> other features do.
Sounds reasonable. In v1 of this patchset we used early boot parameter
but after LSF/MM discussion that was changed to runtime controls.
Sounds like we would need both here. Should be easy to add.
Allocating/reclaiming dynamically the space for page_ext, slab_ext,
etc is not trivial and if done would be done separately. I looked into
it before and listed the encountered issues in the cover letter of v2
[1], see "things we could not address" section.
[1] https://lore.kernel.org/all/[email protected]/
>
> > +}
> > +
> > +static __init void init_page_alloc_tagging(void)
> > +{
> > +}
> > +
> > +struct page_ext_operations page_alloc_tagging_ops = {
> > + .size = sizeof(union codetag_ref),
> > + .need = need_page_alloc_tagging,
> > + .init = init_page_alloc_tagging,
> > +};
> > +EXPORT_SYMBOL(page_alloc_tagging_ops);
>
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>
On Fri, Feb 16, 2024 at 6:33 AM Vlastimil Babka <[email protected]> wrote:
>
> On 2/12/24 22:39, Suren Baghdasaryan wrote:
> > When a high-order page is split into smaller ones, each newly split
> > page should get its codetag. The original codetag is reused for these
> > pages but it's recorded as 0-byte allocation because original codetag
> > already accounts for the original high-order allocated page.
>
> Wouldn't it be possible to adjust the original's accounted size and
> redistribute to the split pages for more accuracy?
I can't recall why I didn't do it that way but I'll try to change and
see if something non-obvious comes up. Thanks!
>
On 2/12/24 22:39, Suren Baghdasaryan wrote:
> Redefine kmalloc, krealloc, kzalloc, kcalloc, etc. to record allocations
> and deallocations done by these functions.
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> Co-developed-by: Kent Overstreet <[email protected]>
> Signed-off-by: Kent Overstreet <[email protected]>
> -}
> +#define kvmalloc(_size, _flags) kvmalloc_node(_size, _flags, NUMA_NO_NODE)
> +#define kvzalloc(_size, _flags) kvmalloc(_size, _flags|__GFP_ZERO)
>
> -static inline __alloc_size(1, 2) void *kvmalloc_array(size_t n, size_t size, gfp_t flags)
This has __alloc_size(1, 2)
> -{
> - size_t bytes;
> -
> - if (unlikely(check_mul_overflow(n, size, &bytes)))
> - return NULL;
> +#define kvzalloc_node(_size, _flags, _node) kvmalloc_node(_size, _flags|__GFP_ZERO, _node)
>
> - return kvmalloc(bytes, flags);
> -}
> +#define kvmalloc_array(_n, _size, _flags) \
> +({ \
> + size_t _bytes; \
> + \
> + !check_mul_overflow(_n, _size, &_bytes) ? kvmalloc(_bytes, _flags) : NULL; \
> +})
But with the calculation now done in a macro, that's gone?
> -static inline __alloc_size(1, 2) void *kvcalloc(size_t n, size_t size, gfp_t flags)
Same here...
> -{
> - return kvmalloc_array(n, size, flags | __GFP_ZERO);
> -}
> +#define kvcalloc(_n, _size, _flags) kvmalloc_array(_n, _size, _flags|__GFP_ZERO)
.. transitively?
But that's for Kees to review, I'm just not sure if he missed it or it's fine.
On Fri, Feb 16, 2024 at 7:36 AM Vlastimil Babka <[email protected]> wrote:
>
> On 2/12/24 22:39, Suren Baghdasaryan wrote:
> > To store code tag for every slab object, a codetag reference is embedded
> > into slabobj_ext when CONFIG_MEM_ALLOC_PROFILING=y.
> >
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > Co-developed-by: Kent Overstreet <[email protected]>
> > Signed-off-by: Kent Overstreet <[email protected]>
> > ---
> > include/linux/memcontrol.h | 5 +++++
> > lib/Kconfig.debug | 1 +
> > mm/slab.h | 4 ++++
> > 3 files changed, 10 insertions(+)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index f3584e98b640..2b010316016c 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -1653,7 +1653,12 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
> > * if MEMCG_DATA_OBJEXTS is set.
> > */
> > struct slabobj_ext {
> > +#ifdef CONFIG_MEMCG_KMEM
> > struct obj_cgroup *objcg;
> > +#endif
> > +#ifdef CONFIG_MEM_ALLOC_PROFILING
> > + union codetag_ref ref;
> > +#endif
> > } __aligned(8);
>
> So this means that compiling with CONFIG_MEM_ALLOC_PROFILING will increase
> the memory overhead of arrays allocated for CONFIG_MEMCG_KMEM, even if
> allocation profiling itself is not enabled in runtime? Similar concern to
> the unconditional page_ext usage, that this would hinder enabling in a
> general distro kernel.
>
> The unused field overhead would be smaller than currently page_ext, but
> getting rid of it when alloc profiling is not enabled would be more work
> than introducing an early boot param for the page_ext case. Could be however
> solved similarly to how page_ext is populated dynamically at runtime.
> Hopefully it wouldn't add noticeable cpu overhead.
Yes, slabobj_ext overhead is much smaller than page_ext one but still
considerable and it would be harder to eliminate. Boot-time resizing
of the extension object might be doable but that again would be quite
complex and better be done as a separate patchset. This is lower on my
TODO list than page_ext ones since the overhead is order of magnitude
smaller.
>
> > static inline void __inc_lruvec_kmem_state(void *p, enum node_stat_item idx)
> > diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> > index 7bbdb0ddb011..9ecfcdb54417 100644
> > --- a/lib/Kconfig.debug
> > +++ b/lib/Kconfig.debug
> > @@ -979,6 +979,7 @@ config MEM_ALLOC_PROFILING
> > depends on !DEBUG_FORCE_WEAK_PER_CPU
> > select CODE_TAGGING
> > select PAGE_EXTENSION
> > + select SLAB_OBJ_EXT
> > help
> > Track allocation source code and record total allocation size
> > initiated at that code location. The mechanism can be used to track
> > diff --git a/mm/slab.h b/mm/slab.h
> > index 77cf7474fe46..224a4b2305fb 100644
> > --- a/mm/slab.h
> > +++ b/mm/slab.h
> > @@ -569,6 +569,10 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
> >
> > static inline bool need_slab_obj_ext(void)
> > {
> > +#ifdef CONFIG_MEM_ALLOC_PROFILING
> > + if (mem_alloc_profiling_enabled())
> > + return true;
> > +#endif
> > /*
> > * CONFIG_MEMCG_KMEM creates vector of obj_cgroup objects conditionally
> > * inside memcg_slab_post_alloc_hook. No other users for now.
>
On Fri, Feb 16, 2024 at 05:52:34PM +0100, Vlastimil Babka wrote:
> On 2/12/24 22:39, Suren Baghdasaryan wrote:
> > Redefine kmalloc, krealloc, kzalloc, kcalloc, etc. to record allocations
> > and deallocations done by these functions.
> >
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > Co-developed-by: Kent Overstreet <[email protected]>
> > Signed-off-by: Kent Overstreet <[email protected]>
>
>
> > -}
> > +#define kvmalloc(_size, _flags) kvmalloc_node(_size, _flags, NUMA_NO_NODE)
> > +#define kvzalloc(_size, _flags) kvmalloc(_size, _flags|__GFP_ZERO)
> >
> > -static inline __alloc_size(1, 2) void *kvmalloc_array(size_t n, size_t size, gfp_t flags)
>
> This has __alloc_size(1, 2)
>
> > -{
> > - size_t bytes;
> > -
> > - if (unlikely(check_mul_overflow(n, size, &bytes)))
> > - return NULL;
> > +#define kvzalloc_node(_size, _flags, _node) kvmalloc_node(_size, _flags|__GFP_ZERO, _node)
> >
> > - return kvmalloc(bytes, flags);
> > -}
> > +#define kvmalloc_array(_n, _size, _flags) \
> > +({ \
> > + size_t _bytes; \
> > + \
> > + !check_mul_overflow(_n, _size, &_bytes) ? kvmalloc(_bytes, _flags) : NULL; \
> > +})
>
> But with the calculation now done in a macro, that's gone?
>
> > -static inline __alloc_size(1, 2) void *kvcalloc(size_t n, size_t size, gfp_t flags)
>
> Same here...
>
> > -{
> > - return kvmalloc_array(n, size, flags | __GFP_ZERO);
> > -}
> > +#define kvcalloc(_n, _size, _flags) kvmalloc_array(_n, _size, _flags|__GFP_ZERO)
>
> ... transitively?
>
> But that's for Kees to review, I'm just not sure if he missed it or it's fine.
I think this is something we want to keep - we'll fix it
On Fri, Feb 16, 2024 at 8:39 AM Kent Overstreet
<[email protected]> wrote:
>
> On Fri, Feb 16, 2024 at 05:31:11PM +0100, Vlastimil Babka wrote:
> > On 2/12/24 22:39, Suren Baghdasaryan wrote:
> > > Account slab allocations using codetag reference embedded into slabobj_ext.
> > >
> > > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > > Co-developed-by: Kent Overstreet <[email protected]>
> > > Signed-off-by: Kent Overstreet <[email protected]>
> > > ---
> > > mm/slab.h | 26 ++++++++++++++++++++++++++
> > > mm/slub.c | 5 +++++
> > > 2 files changed, 31 insertions(+)
> > >
> > > diff --git a/mm/slab.h b/mm/slab.h
> > > index 224a4b2305fb..c4bd0d5348cb 100644
> > > --- a/mm/slab.h
> > > +++ b/mm/slab.h
> > > @@ -629,6 +629,32 @@ prepare_slab_obj_exts_hook(struct kmem_cache *s, gfp_t flags, void *p)
> > >
> > > #endif /* CONFIG_SLAB_OBJ_EXT */
> > >
> > > +#ifdef CONFIG_MEM_ALLOC_PROFILING
> > > +
> > > +static inline void alloc_tagging_slab_free_hook(struct kmem_cache *s, struct slab *slab,
> > > + void **p, int objects)
> > > +{
> > > + struct slabobj_ext *obj_exts;
> > > + int i;
> > > +
> > > + obj_exts = slab_obj_exts(slab);
> > > + if (!obj_exts)
> > > + return;
> > > +
> > > + for (i = 0; i < objects; i++) {
> > > + unsigned int off = obj_to_index(s, slab, p[i]);
> > > +
> > > + alloc_tag_sub(&obj_exts[off].ref, s->size);
> > > + }
> > > +}
> > > +
> > > +#else
> > > +
> > > +static inline void alloc_tagging_slab_free_hook(struct kmem_cache *s, struct slab *slab,
> > > + void **p, int objects) {}
> > > +
> > > +#endif /* CONFIG_MEM_ALLOC_PROFILING */
> >
> > You don't actually use the alloc_tagging_slab_free_hook() anywhere? I see
> > it's in the next patch, but logically should belong to this one.
>
> I don't think it makes any sense to quibble about introducing something
> in one patch that's not used until the next patch; often times, it's
> just easier to review that way.
Yeah, there were several cases where I was debating with myself which
way to split a patch (same was, as you noticed, with
prepare_slab_obj_exts_hook()). Since we already moved
prepare_slab_obj_exts_hook(), alloc_tagging_slab_free_hook() will
probably move into the same patch. I'll go over the results once more
to see if the new split makes more sense, if not will keep it here.
Thanks!
On 2/12/24 22:39, Suren Baghdasaryan wrote:
> objext objects are created with __GFP_NO_OBJ_EXT flag and therefore have
> no corresponding objext themselves (otherwise we would get an infinite
> recursion). When freeing these objects their codetag will be empty and
> when CONFIG_MEM_ALLOC_PROFILING_DEBUG is enabled this will lead to false
> warnings. Introduce CODETAG_EMPTY special codetag value to mark
> allocations which intentionally lack codetag to avoid these warnings.
> Set objext codetags to CODETAG_EMPTY before freeing to indicate that
> the codetag is expected to be empty.
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> ---
> include/linux/alloc_tag.h | 26 ++++++++++++++++++++++++++
> mm/slab.h | 25 +++++++++++++++++++++++++
> mm/slab_common.c | 1 +
> mm/slub.c | 8 ++++++++
> 4 files changed, 60 insertions(+)
>
> diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
> index 0a5973c4ad77..1f3207097b03 100644
..
> index c4bd0d5348cb..cf332a839bf4 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -567,6 +567,31 @@ static inline struct slabobj_ext *slab_obj_exts(struct slab *slab)
> int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
> gfp_t gfp, bool new_slab);
>
> +
> +#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
> +
> +static inline void mark_objexts_empty(struct slabobj_ext *obj_exts)
> +{
> + struct slabobj_ext *slab_exts;
> + struct slab *obj_exts_slab;
> +
> + obj_exts_slab = virt_to_slab(obj_exts);
> + slab_exts = slab_obj_exts(obj_exts_slab);
> + if (slab_exts) {
> + unsigned int offs = obj_to_index(obj_exts_slab->slab_cache,
> + obj_exts_slab, obj_exts);
> + /* codetag should be NULL */
> + WARN_ON(slab_exts[offs].ref.ct);
> + set_codetag_empty(&slab_exts[offs].ref);
> + }
> +}
> +
> +#else /* CONFIG_MEM_ALLOC_PROFILING_DEBUG */
> +
> +static inline void mark_objexts_empty(struct slabobj_ext *obj_exts) {}
> +
> +#endif /* CONFIG_MEM_ALLOC_PROFILING_DEBUG */
> +
I assume with alloc_slab_obj_exts() moved to slub.c, mark_objexts_empty()
could move there too.
> static inline bool need_slab_obj_ext(void)
> {
> #ifdef CONFIG_MEM_ALLOC_PROFILING
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 21b0b9e9cd9e..d5f75d04ced2 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -242,6 +242,7 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
> * assign slabobj_exts in parallel. In this case the existing
> * objcg vector should be reused.
> */
> + mark_objexts_empty(vec);
> kfree(vec);
> return 0;
> }
> diff --git a/mm/slub.c b/mm/slub.c
> index 4d480784942e..1136ff18b4fe 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1890,6 +1890,14 @@ static inline void free_slab_obj_exts(struct slab *slab)
> if (!obj_exts)
> return;
>
> + /*
> + * obj_exts was created with __GFP_NO_OBJ_EXT flag, therefore its
> + * corresponding extension will be NULL. alloc_tag_sub() will throw a
> + * warning if slab has extensions but the extension of an object is
> + * NULL, therefore replace NULL with CODETAG_EMPTY to indicate that
> + * the extension for obj_exts is expected to be NULL.
> + */
> + mark_objexts_empty(obj_exts);
> kfree(obj_exts);
> slab->obj_exts = 0;
> }
On Thu, Feb 15, 2024 at 10:10 PM Suren Baghdasaryan <[email protected]> wrote:
>
> On Thu, Feb 15, 2024 at 1:50 PM Vlastimil Babka <[email protected]> wrote:
> >
> > On 2/15/24 22:37, Kent Overstreet wrote:
> > > On Thu, Feb 15, 2024 at 10:31:06PM +0100, Vlastimil Babka wrote:
> > >> On 2/12/24 22:38, Suren Baghdasaryan wrote:
> > >> > Slab extension objects can't be allocated before slab infrastructure is
> > >> > initialized. Some caches, like kmem_cache and kmem_cache_node, are created
> > >> > before slab infrastructure is initialized. Objects from these caches can't
> > >> > have extension objects. Introduce SLAB_NO_OBJ_EXT slab flag to mark these
> > >> > caches and avoid creating extensions for objects allocated from these
> > >> > slabs.
> > >> >
> > >> > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > >> > ---
> > >> > include/linux/slab.h | 7 +++++++
> > >> > mm/slub.c | 5 +++--
> > >> > 2 files changed, 10 insertions(+), 2 deletions(-)
> > >> >
> > >> > diff --git a/include/linux/slab.h b/include/linux/slab.h
> > >> > index b5f5ee8308d0..3ac2fc830f0f 100644
> > >> > --- a/include/linux/slab.h
> > >> > +++ b/include/linux/slab.h
> > >> > @@ -164,6 +164,13 @@
> > >> > #endif
> > >> > #define SLAB_TEMPORARY SLAB_RECLAIM_ACCOUNT /* Objects are short-lived */
> > >> >
> > >> > +#ifdef CONFIG_SLAB_OBJ_EXT
> > >> > +/* Slab created using create_boot_cache */
> > >> > +#define SLAB_NO_OBJ_EXT ((slab_flags_t __force)0x20000000U)
> > >>
> > >> There's
> > >> #define SLAB_SKIP_KFENCE ((slab_flags_t __force)0x20000000U)
> > >> already, so need some other one?
>
> Indeed. I somehow missed it. Thanks for noticing, will fix this in the
> next version.
Apparently the only unused slab flag is 0x00000200U, all others seem
to be taken. I'll use it if there are no objections.
>
> > >
> > > What's up with the order of flags in that file? They don't seem to
> > > follow any particular ordering.
> >
> > Seems mostly in increasing order, except commit 4fd0b46e89879 broke it for
> > SLAB_RECLAIM_ACCOUNT?
> >
> > > Seems like some cleanup is in order, but any history/context we should
> > > know first?
> >
> > Yeah noted, but no need to sidetrack you.
On 2/16/24 19:41, Suren Baghdasaryan wrote:
> On Thu, Feb 15, 2024 at 10:10 PM Suren Baghdasaryan <[email protected]> wrote:
>>
>> On Thu, Feb 15, 2024 at 1:50 PM Vlastimil Babka <[email protected]> wrote:
>> >
>> > On 2/15/24 22:37, Kent Overstreet wrote:
>> > > On Thu, Feb 15, 2024 at 10:31:06PM +0100, Vlastimil Babka wrote:
>> > >> On 2/12/24 22:38, Suren Baghdasaryan wrote:
>> > >> > Slab extension objects can't be allocated before slab infrastructure is
>> > >> > initialized. Some caches, like kmem_cache and kmem_cache_node, are created
>> > >> > before slab infrastructure is initialized. Objects from these caches can't
>> > >> > have extension objects. Introduce SLAB_NO_OBJ_EXT slab flag to mark these
>> > >> > caches and avoid creating extensions for objects allocated from these
>> > >> > slabs.
>> > >> >
>> > >> > Signed-off-by: Suren Baghdasaryan <[email protected]>
>> > >> > ---
>> > >> > include/linux/slab.h | 7 +++++++
>> > >> > mm/slub.c | 5 +++--
>> > >> > 2 files changed, 10 insertions(+), 2 deletions(-)
>> > >> >
>> > >> > diff --git a/include/linux/slab.h b/include/linux/slab.h
>> > >> > index b5f5ee8308d0..3ac2fc830f0f 100644
>> > >> > --- a/include/linux/slab.h
>> > >> > +++ b/include/linux/slab.h
>> > >> > @@ -164,6 +164,13 @@
>> > >> > #endif
>> > >> > #define SLAB_TEMPORARY SLAB_RECLAIM_ACCOUNT /* Objects are short-lived */
>> > >> >
>> > >> > +#ifdef CONFIG_SLAB_OBJ_EXT
>> > >> > +/* Slab created using create_boot_cache */
>> > >> > +#define SLAB_NO_OBJ_EXT ((slab_flags_t __force)0x20000000U)
>> > >>
>> > >> There's
>> > >> #define SLAB_SKIP_KFENCE ((slab_flags_t __force)0x20000000U)
>> > >> already, so need some other one?
>>
>> Indeed. I somehow missed it. Thanks for noticing, will fix this in the
>> next version.
>
> Apparently the only unused slab flag is 0x00000200U, all others seem
> to be taken. I'll use it if there are no objections.
OK. Will look into the cleanup and consolidation - we already know
SLAB_MEM_SPREAD became dead with SLAB removed. If it comes to worst, we can
switch to 64 bits again.
>>
>> > >
>> > > What's up with the order of flags in that file? They don't seem to
>> > > follow any particular ordering.
>> >
>> > Seems mostly in increasing order, except commit 4fd0b46e89879 broke it for
>> > SLAB_RECLAIM_ACCOUNT?
>> >
>> > > Seems like some cleanup is in order, but any history/context we should
>> > > know first?
>> >
>> > Yeah noted, but no need to sidetrack you.
On Mon, Feb 12, 2024 at 02:40:12PM -0800, Kees Cook wrote:
> On Mon, Feb 12, 2024 at 01:38:59PM -0800, Suren Baghdasaryan wrote:
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index ffe8f618ab86..da68a10517c8 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -770,6 +770,10 @@ struct task_struct {
> > unsigned int flags;
> > unsigned int ptrace;
> >
> > +#ifdef CONFIG_MEM_ALLOC_PROFILING
> > + struct alloc_tag *alloc_tag;
> > +#endif
>
> Normally scheduling is very sensitive to having anything early in
> task_struct. I would suggest moving this the CONFIG_SCHED_CORE ifdef
> area.
This is even hotter than the scheduler members; we actually do want it
up front.
On Fri, Feb 16, 2024 at 06:26:06PM -0500, Kent Overstreet wrote:
> On Mon, Feb 12, 2024 at 02:40:12PM -0800, Kees Cook wrote:
> > On Mon, Feb 12, 2024 at 01:38:59PM -0800, Suren Baghdasaryan wrote:
> > > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > > index ffe8f618ab86..da68a10517c8 100644
> > > --- a/include/linux/sched.h
> > > +++ b/include/linux/sched.h
> > > @@ -770,6 +770,10 @@ struct task_struct {
> > > unsigned int flags;
> > > unsigned int ptrace;
> > >
> > > +#ifdef CONFIG_MEM_ALLOC_PROFILING
> > > + struct alloc_tag *alloc_tag;
> > > +#endif
> >
> > Normally scheduling is very sensitive to having anything early in
> > task_struct. I would suggest moving this the CONFIG_SCHED_CORE ifdef
> > area.
>
> This is even hotter than the scheduler members; we actually do want it
> up front.
It is? I would imagine the scheduler would touch stuff more than the
allocator, but whatever works. :)
--
Kees Cook
On Fri, Feb 16, 2024 at 12:18:09PM -0500, Pasha Tatashin wrote:
> > > Personally, I hate trying to count long strings digits by eyeball...
> >
> > Maybe something like this work for everyone then?:
> >
> > 160432128 (153MiB) mm/slub.c:1826 module:slub func:alloc_slab_page
>
> That would be even harder to parse.
>
> This one liner should converts bytes to human readable size:
> sort -rn /proc/allocinfo | numfmt --to=iec
I like this, it doesn't print out that godawful kibibytes crap
On Fri, Feb 16, 2024 at 4:46 PM Suren Baghdasaryan <[email protected]> wrote:
>
> On Fri, Feb 16, 2024 at 6:33 AM Vlastimil Babka <[email protected]> wrote:
> >
> > On 2/12/24 22:39, Suren Baghdasaryan wrote:
> > > When a high-order page is split into smaller ones, each newly split
> > > page should get its codetag. The original codetag is reused for these
> > > pages but it's recorded as 0-byte allocation because original codetag
> > > already accounts for the original high-order allocated page.
> >
> > Wouldn't it be possible to adjust the original's accounted size and
> > redistribute to the split pages for more accuracy?
>
> I can't recall why I didn't do it that way but I'll try to change and
> see if something non-obvious comes up. Thanks!
Ok, now I recall what's happening here. alloc_tag_add() effectively
does two things:
1. it sets reference to point to the tag (ref->ct = &tag->ct)
2. it increments tag->counters
In pgalloc_tag_split() by calling
alloc_tag_add(codetag_ref_from_page_ext(page_ext), tag, 0); we
effectively set the reference from new page_ext to point to the
original tag but we keep the tag->counters->bytes counter the same
(incrementing by 0). It still increments tag->counters->calls but I
think we need that because when freeing individual split pages we will
be decrementing this counter for each individual page. We allocated
many pages with one call, then split into smaller pages and will be
freeing them with multiple calls. We need to balance out the call
counter during the split.
I can refactor the part of alloc_tag_add() that sets the reference
into a separate alloc_tag_ref_set() and make it set the reference and
increments tag->counters->calls (with a comment explaining why we need
this increment here). Then I can call alloc_tag_ref_set() from inside
alloc_tag_add() and when splitting pages. I think that will be a bit
more clear.
>
> >
On Fri, Feb 16, 2024 at 8:57 AM Vlastimil Babka <[email protected]> wrote:
>
> On 2/12/24 22:38, Suren Baghdasaryan wrote:
> > Introduce CONFIG_MEM_ALLOC_PROFILING which provides definitions to easily
> > instrument memory allocators. It registers an "alloc_tags" codetag type
> > with /proc/allocinfo interface to output allocation tag information when
> > the feature is enabled.
> > CONFIG_MEM_ALLOC_PROFILING_DEBUG is provided for debugging the memory
> > allocation profiling instrumentation.
> > Memory allocation profiling can be enabled or disabled at runtime using
> > /proc/sys/vm/mem_profiling sysctl when CONFIG_MEM_ALLOC_PROFILING_DEBUG=n.
> > CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT enables memory allocation
> > profiling by default.
> >
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > Co-developed-by: Kent Overstreet <[email protected]>
> > Signed-off-by: Kent Overstreet <[email protected]>
> > ---
> > Documentation/admin-guide/sysctl/vm.rst | 16 +++
> > Documentation/filesystems/proc.rst | 28 +++++
> > include/asm-generic/codetag.lds.h | 14 +++
> > include/asm-generic/vmlinux.lds.h | 3 +
> > include/linux/alloc_tag.h | 133 ++++++++++++++++++++
> > include/linux/sched.h | 24 ++++
> > lib/Kconfig.debug | 25 ++++
> > lib/Makefile | 2 +
> > lib/alloc_tag.c | 158 ++++++++++++++++++++++++
> > scripts/module.lds.S | 7 ++
> > 10 files changed, 410 insertions(+)
> > create mode 100644 include/asm-generic/codetag.lds.h
> > create mode 100644 include/linux/alloc_tag.h
> > create mode 100644 lib/alloc_tag.c
> >
> > diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
> > index c59889de122b..a214719492ea 100644
> > --- a/Documentation/admin-guide/sysctl/vm.rst
> > +++ b/Documentation/admin-guide/sysctl/vm.rst
> > @@ -43,6 +43,7 @@ Currently, these files are in /proc/sys/vm:
> > - legacy_va_layout
> > - lowmem_reserve_ratio
> > - max_map_count
> > +- mem_profiling (only if CONFIG_MEM_ALLOC_PROFILING=y)
> > - memory_failure_early_kill
> > - memory_failure_recovery
> > - min_free_kbytes
> > @@ -425,6 +426,21 @@ e.g., up to one or two maps per allocation.
> > The default value is 65530.
> >
> >
> > +mem_profiling
> > +==============
> > +
> > +Enable memory profiling (when CONFIG_MEM_ALLOC_PROFILING=y)
> > +
> > +1: Enable memory profiling.
> > +
> > +0: Disabld memory profiling.
>
> Disable
Ack.
>
> ...
>
> > +allocinfo
> > +~~~~~~~
> > +
> > +Provides information about memory allocations at all locations in the code
> > +base. Each allocation in the code is identified by its source file, line
> > +number, module and the function calling the allocation. The number of bytes
> > +allocated at each location is reported.
>
> See, it even says "number of bytes" :)
Yes, we are changing the output to bytes.
>
> > +
> > +Example output.
> > +
> > +::
> > +
> > + > cat /proc/allocinfo
> > +
> > + 153MiB mm/slub.c:1826 module:slub func:alloc_slab_page
>
> Is "module" meant in the usual kernel module sense? In that case IIRC is
> more common to annotate things e.g. [xfs] in case it's really a module, and
> nothing if it's built it, such as slub. Is that "slub" simply derived from
> "mm/slub.c"? Then it's just redundant?
Sounds good. The new example would look like this:
> sort -rn /proc/allocinfo
127664128 31168 mm/page_ext.c:270 func:alloc_page_ext
56373248 4737 mm/slub.c:2259 func:alloc_slab_page
14880768 3633 mm/readahead.c:247 func:page_cache_ra_unbounded
14417920 3520 mm/mm_init.c:2530 func:alloc_large_system_hash
13377536 234 block/blk-mq.c:3421 func:blk_mq_alloc_rqs
11718656 2861 mm/filemap.c:1919 func:__filemap_get_folio
9192960 2800 kernel/fork.c:307 func:alloc_thread_stack_node
4206592 4 net/netfilter/nf_conntrack_core.c:2567
func:nf_ct_alloc_hashtable
4136960 1010 drivers/staging/ctagmod/ctagmod.c:20 [ctagmod]
func:ctagmod_start
3940352 962 mm/memory.c:4214 func:alloc_anon_folio
2894464 22613 fs/kernfs/dir.c:615 func:__kernfs_new_node
...
Note that [ctagmod] is the only allocation from a module in this example.
>
> > + 6.08MiB mm/slab_common.c:950 module:slab_common func:_kmalloc_order
> > + 5.09MiB mm/memcontrol.c:2814 module:memcontrol func:alloc_slab_obj_exts
> > + 4.54MiB mm/page_alloc.c:5777 module:page_alloc func:alloc_pages_exact
> > + 1.32MiB include/asm-generic/pgalloc.h:63 module:pgtable func:__pte_alloc_one
> > + 1.16MiB fs/xfs/xfs_log_priv.h:700 module:xfs func:xlog_kvmalloc
> > + 1.00MiB mm/swap_cgroup.c:48 module:swap_cgroup func:swap_cgroup_prepare
> > + 734KiB fs/xfs/kmem.c:20 module:xfs func:kmem_alloc
> > + 640KiB kernel/rcu/tree.c:3184 module:tree func:fill_page_cache_func
> > + 640KiB drivers/char/virtio_console.c:452 module:virtio_console func:alloc_buf
> > + ...
> > +
> > +
> > meminfo
>
> ...
>
> > diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> > index 0be2d00c3696..78d258ca508f 100644
> > --- a/lib/Kconfig.debug
> > +++ b/lib/Kconfig.debug
> > @@ -972,6 +972,31 @@ config CODE_TAGGING
> > bool
> > select KALLSYMS
> >
> > +config MEM_ALLOC_PROFILING
> > + bool "Enable memory allocation profiling"
> > + default n
> > + depends on PROC_FS
> > + depends on !DEBUG_FORCE_WEAK_PER_CPU
> > + select CODE_TAGGING
> > + help
> > + Track allocation source code and record total allocation size
> > + initiated at that code location. The mechanism can be used to track
> > + memory leaks with a low performance and memory impact.
> > +
> > +config MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT
> > + bool "Enable memory allocation profiling by default"
> > + default y
>
> I'd go with default n as that I'd select for a general distro.
Well, we have MEM_ALLOC_PROFILING=n by default, so if it was switched
on manually, that is a strong sign that the user wants it enabled IMO.
So, enabling this switch by default seems logical to me. If a distro
wants to have the feature compiled in but disabled by default then
this is perfectly doable, just need to set both options appropriately.
Does my logic make sense?
>
> > + depends on MEM_ALLOC_PROFILING
> > +
> > +config MEM_ALLOC_PROFILING_DEBUG
> > + bool "Memory allocation profiler debugging"
> > + default n
> > + depends on MEM_ALLOC_PROFILING
> > + select MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT
> > + help
> > + Adds warnings with helpful error messages for memory allocation
> > + profiling.
> > +
>
On Fri, Feb 16, 2024 at 6:39 PM Vlastimil Babka <[email protected]> wrote:
>
> On 2/12/24 22:39, Suren Baghdasaryan wrote:
> > objext objects are created with __GFP_NO_OBJ_EXT flag and therefore have
> > no corresponding objext themselves (otherwise we would get an infinite
> > recursion). When freeing these objects their codetag will be empty and
> > when CONFIG_MEM_ALLOC_PROFILING_DEBUG is enabled this will lead to false
> > warnings. Introduce CODETAG_EMPTY special codetag value to mark
> > allocations which intentionally lack codetag to avoid these warnings.
> > Set objext codetags to CODETAG_EMPTY before freeing to indicate that
> > the codetag is expected to be empty.
> >
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > ---
> > include/linux/alloc_tag.h | 26 ++++++++++++++++++++++++++
> > mm/slab.h | 25 +++++++++++++++++++++++++
> > mm/slab_common.c | 1 +
> > mm/slub.c | 8 ++++++++
> > 4 files changed, 60 insertions(+)
> >
> > diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
> > index 0a5973c4ad77..1f3207097b03 100644
>
> ...
>
> > index c4bd0d5348cb..cf332a839bf4 100644
> > --- a/mm/slab.h
> > +++ b/mm/slab.h
> > @@ -567,6 +567,31 @@ static inline struct slabobj_ext *slab_obj_exts(struct slab *slab)
> > int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
> > gfp_t gfp, bool new_slab);
> >
> > +
> > +#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
> > +
> > +static inline void mark_objexts_empty(struct slabobj_ext *obj_exts)
> > +{
> > + struct slabobj_ext *slab_exts;
> > + struct slab *obj_exts_slab;
> > +
> > + obj_exts_slab = virt_to_slab(obj_exts);
> > + slab_exts = slab_obj_exts(obj_exts_slab);
> > + if (slab_exts) {
> > + unsigned int offs = obj_to_index(obj_exts_slab->slab_cache,
> > + obj_exts_slab, obj_exts);
> > + /* codetag should be NULL */
> > + WARN_ON(slab_exts[offs].ref.ct);
> > + set_codetag_empty(&slab_exts[offs].ref);
> > + }
> > +}
> > +
> > +#else /* CONFIG_MEM_ALLOC_PROFILING_DEBUG */
> > +
> > +static inline void mark_objexts_empty(struct slabobj_ext *obj_exts) {}
> > +
> > +#endif /* CONFIG_MEM_ALLOC_PROFILING_DEBUG */
> > +
>
> I assume with alloc_slab_obj_exts() moved to slub.c, mark_objexts_empty()
> could move there too.
No, I think mark_objexts_empty() belongs here. This patch introduced
the function and uses it. Makes sense to me to keep it all together.
>
> > static inline bool need_slab_obj_ext(void)
> > {
> > #ifdef CONFIG_MEM_ALLOC_PROFILING
> > diff --git a/mm/slab_common.c b/mm/slab_common.c
> > index 21b0b9e9cd9e..d5f75d04ced2 100644
> > --- a/mm/slab_common.c
> > +++ b/mm/slab_common.c
> > @@ -242,6 +242,7 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
> > * assign slabobj_exts in parallel. In this case the existing
> > * objcg vector should be reused.
> > */
> > + mark_objexts_empty(vec);
> > kfree(vec);
> > return 0;
> > }
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 4d480784942e..1136ff18b4fe 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -1890,6 +1890,14 @@ static inline void free_slab_obj_exts(struct slab *slab)
> > if (!obj_exts)
> > return;
> >
> > + /*
> > + * obj_exts was created with __GFP_NO_OBJ_EXT flag, therefore its
> > + * corresponding extension will be NULL. alloc_tag_sub() will throw a
> > + * warning if slab has extensions but the extension of an object is
> > + * NULL, therefore replace NULL with CODETAG_EMPTY to indicate that
> > + * the extension for obj_exts is expected to be NULL.
> > + */
> > + mark_objexts_empty(obj_exts);
> > kfree(obj_exts);
> > slab->obj_exts = 0;
> > }
>
On 2/19/24 02:04, Suren Baghdasaryan wrote:
> On Fri, Feb 16, 2024 at 6:39 PM Vlastimil Babka <[email protected]> wrote:
>>
>> On 2/12/24 22:39, Suren Baghdasaryan wrote:
>> > objext objects are created with __GFP_NO_OBJ_EXT flag and therefore have
>> > no corresponding objext themselves (otherwise we would get an infinite
>> > recursion). When freeing these objects their codetag will be empty and
>> > when CONFIG_MEM_ALLOC_PROFILING_DEBUG is enabled this will lead to false
>> > warnings. Introduce CODETAG_EMPTY special codetag value to mark
>> > allocations which intentionally lack codetag to avoid these warnings.
>> > Set objext codetags to CODETAG_EMPTY before freeing to indicate that
>> > the codetag is expected to be empty.
>> >
>> > Signed-off-by: Suren Baghdasaryan <[email protected]>
>> > ---
>> > include/linux/alloc_tag.h | 26 ++++++++++++++++++++++++++
>> > mm/slab.h | 25 +++++++++++++++++++++++++
>> > mm/slab_common.c | 1 +
>> > mm/slub.c | 8 ++++++++
>> > 4 files changed, 60 insertions(+)
>> >
>> > diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
>> > index 0a5973c4ad77..1f3207097b03 100644
>>
>> ...
>>
>> > index c4bd0d5348cb..cf332a839bf4 100644
>> > --- a/mm/slab.h
>> > +++ b/mm/slab.h
>> > @@ -567,6 +567,31 @@ static inline struct slabobj_ext *slab_obj_exts(struct slab *slab)
>> > int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
>> > gfp_t gfp, bool new_slab);
>> >
>> > +
>> > +#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
>> > +
>> > +static inline void mark_objexts_empty(struct slabobj_ext *obj_exts)
>> > +{
>> > + struct slabobj_ext *slab_exts;
>> > + struct slab *obj_exts_slab;
>> > +
>> > + obj_exts_slab = virt_to_slab(obj_exts);
>> > + slab_exts = slab_obj_exts(obj_exts_slab);
>> > + if (slab_exts) {
>> > + unsigned int offs = obj_to_index(obj_exts_slab->slab_cache,
>> > + obj_exts_slab, obj_exts);
>> > + /* codetag should be NULL */
>> > + WARN_ON(slab_exts[offs].ref.ct);
>> > + set_codetag_empty(&slab_exts[offs].ref);
>> > + }
>> > +}
>> > +
>> > +#else /* CONFIG_MEM_ALLOC_PROFILING_DEBUG */
>> > +
>> > +static inline void mark_objexts_empty(struct slabobj_ext *obj_exts) {}
>> > +
>> > +#endif /* CONFIG_MEM_ALLOC_PROFILING_DEBUG */
>> > +
>>
>> I assume with alloc_slab_obj_exts() moved to slub.c, mark_objexts_empty()
>> could move there too.
>
> No, I think mark_objexts_empty() belongs here. This patch introduced
> the function and uses it. Makes sense to me to keep it all together.
Hi,
here I didn't mean moving between patches, but files. alloc_slab_obj_exts()
in slub.c means all callers of mark_objexts_empty() are in slub.c so it
doesn't need to be in slab.h
Also same thing with mark_failed_objexts_alloc() and
handle_failed_objexts_alloc() in patch 34/35.
On Mon, Feb 19, 2024 at 1:17 AM Vlastimil Babka <[email protected]> wrote:
>
> On 2/19/24 02:04, Suren Baghdasaryan wrote:
> > On Fri, Feb 16, 2024 at 6:39 PM Vlastimil Babka <[email protected]> wrote:
> >>
> >> On 2/12/24 22:39, Suren Baghdasaryan wrote:
> >> > objext objects are created with __GFP_NO_OBJ_EXT flag and therefore have
> >> > no corresponding objext themselves (otherwise we would get an infinite
> >> > recursion). When freeing these objects their codetag will be empty and
> >> > when CONFIG_MEM_ALLOC_PROFILING_DEBUG is enabled this will lead to false
> >> > warnings. Introduce CODETAG_EMPTY special codetag value to mark
> >> > allocations which intentionally lack codetag to avoid these warnings.
> >> > Set objext codetags to CODETAG_EMPTY before freeing to indicate that
> >> > the codetag is expected to be empty.
> >> >
> >> > Signed-off-by: Suren Baghdasaryan <[email protected]>
> >> > ---
> >> > include/linux/alloc_tag.h | 26 ++++++++++++++++++++++++++
> >> > mm/slab.h | 25 +++++++++++++++++++++++++
> >> > mm/slab_common.c | 1 +
> >> > mm/slub.c | 8 ++++++++
> >> > 4 files changed, 60 insertions(+)
> >> >
> >> > diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
> >> > index 0a5973c4ad77..1f3207097b03 100644
> >>
> >> ...
> >>
> >> > index c4bd0d5348cb..cf332a839bf4 100644
> >> > --- a/mm/slab.h
> >> > +++ b/mm/slab.h
> >> > @@ -567,6 +567,31 @@ static inline struct slabobj_ext *slab_obj_exts(struct slab *slab)
> >> > int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
> >> > gfp_t gfp, bool new_slab);
> >> >
> >> > +
> >> > +#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
> >> > +
> >> > +static inline void mark_objexts_empty(struct slabobj_ext *obj_exts)
> >> > +{
> >> > + struct slabobj_ext *slab_exts;
> >> > + struct slab *obj_exts_slab;
> >> > +
> >> > + obj_exts_slab = virt_to_slab(obj_exts);
> >> > + slab_exts = slab_obj_exts(obj_exts_slab);
> >> > + if (slab_exts) {
> >> > + unsigned int offs = obj_to_index(obj_exts_slab->slab_cache,
> >> > + obj_exts_slab, obj_exts);
> >> > + /* codetag should be NULL */
> >> > + WARN_ON(slab_exts[offs].ref.ct);
> >> > + set_codetag_empty(&slab_exts[offs].ref);
> >> > + }
> >> > +}
> >> > +
> >> > +#else /* CONFIG_MEM_ALLOC_PROFILING_DEBUG */
> >> > +
> >> > +static inline void mark_objexts_empty(struct slabobj_ext *obj_exts) {}
> >> > +
> >> > +#endif /* CONFIG_MEM_ALLOC_PROFILING_DEBUG */
> >> > +
> >>
> >> I assume with alloc_slab_obj_exts() moved to slub.c, mark_objexts_empty()
> >> could move there too.
> >
> > No, I think mark_objexts_empty() belongs here. This patch introduced
> > the function and uses it. Makes sense to me to keep it all together.
>
> Hi,
>
> here I didn't mean moving between patches, but files. alloc_slab_obj_exts()
> in slub.c means all callers of mark_objexts_empty() are in slub.c so it
> doesn't need to be in slab.h
Ah, I see. I misunderstood your comment. Yes, after slab/slob cleanup
this makes sense.
>
> Also same thing with mark_failed_objexts_alloc() and
> handle_failed_objexts_alloc() in patch 34/35.
Ack. Thanks!
>
On Thu, Feb 15, 2024 at 3:56 PM Kent Overstreet
<[email protected]> wrote:
>
> On Thu, Feb 15, 2024 at 06:27:29PM -0500, Steven Rostedt wrote:
> > All this, and we are still worried about 4k for useful debugging :-/
I was planning to refactor this function to print one record at a time
with a smaller buffer but after discussing with Kent, he has plans to
reuse this function and having the report in one buffer is needed for
that.
> Every additional 4k still needs justification. And whether we burn a
> reserve on this will have no observable effect on user output in
> remotely normal situations; if this allocation ever fails, we've already
> been in an OOM situation for awhile and we've already printed out this
> report many times, with less memory pressure where the allocation would
> have succeeded.
I'm not sure this claim will always be true, specifically in the case
of low-end devices with relatively low amounts of reserves and in the
presence of a possible quick memory usage spike. We should also
consider a case when panic_on_oom is set. All we get is one OOM
report, so we get only one chance to capture this report. In any case,
I don't yet have data to prove or disprove this claim but it will be
interesting to test it with data from the field once the feature is
deployed.
For now I think with Vlastimil's __GFP_NOWARN suggestion the code
becomes safe and the only risk is to lose this report. If we get cases
with reports missing this data, we can easily change to reserved
memory.
On Mon 19-02-24 09:17:36, Suren Baghdasaryan wrote:
[...]
> For now I think with Vlastimil's __GFP_NOWARN suggestion the code
> becomes safe and the only risk is to lose this report. If we get cases
> with reports missing this data, we can easily change to reserved
> memory.
This is not just about missing part of the oom report. This is annoying
but not earth shattering. Eating into very small reserves (that might be
the only usable memory while the system is struggling in OOM situation)
could cause functional problems that would be non trivial to test for.
All that for debugging purposes is just lame. If you want to reuse the code
for a different purpose then abstract it and allocate the buffer when you
can afford that and use preallocated on when in OOM situation.
We have always went extra mile to avoid potentially disruptive
operations from the oom handling code and I do not see any good reason
to diverge from that principle.
--
Michal Hocko
SUSE Labs
On Tue, Feb 20, 2024 at 05:23:29PM +0100, Michal Hocko wrote:
> On Mon 19-02-24 09:17:36, Suren Baghdasaryan wrote:
> [...]
> > For now I think with Vlastimil's __GFP_NOWARN suggestion the code
> > becomes safe and the only risk is to lose this report. If we get cases
> > with reports missing this data, we can easily change to reserved
> > memory.
>
> This is not just about missing part of the oom report. This is annoying
> but not earth shattering. Eating into very small reserves (that might be
> the only usable memory while the system is struggling in OOM situation)
> could cause functional problems that would be non trivial to test for.
> All that for debugging purposes is just lame. If you want to reuse the code
> for a different purpose then abstract it and allocate the buffer when you
> can afford that and use preallocated on when in OOM situation.
>
> We have always went extra mile to avoid potentially disruptive
> operations from the oom handling code and I do not see any good reason
> to diverge from that principle.
Michal, I gave you the logic between dedicated reserves and system
reserves. Please stop repeating these vague what-ifs.
On Tue 20-02-24 12:18:49, Kent Overstreet wrote:
> On Tue, Feb 20, 2024 at 05:23:29PM +0100, Michal Hocko wrote:
> > On Mon 19-02-24 09:17:36, Suren Baghdasaryan wrote:
> > [...]
> > > For now I think with Vlastimil's __GFP_NOWARN suggestion the code
> > > becomes safe and the only risk is to lose this report. If we get cases
> > > with reports missing this data, we can easily change to reserved
> > > memory.
> >
> > This is not just about missing part of the oom report. This is annoying
> > but not earth shattering. Eating into very small reserves (that might be
> > the only usable memory while the system is struggling in OOM situation)
> > could cause functional problems that would be non trivial to test for.
> > All that for debugging purposes is just lame. If you want to reuse the code
> > for a different purpose then abstract it and allocate the buffer when you
> > can afford that and use preallocated on when in OOM situation.
> >
> > We have always went extra mile to avoid potentially disruptive
> > operations from the oom handling code and I do not see any good reason
> > to diverge from that principle.
>
> Michal, I gave you the logic between dedicated reserves and system
> reserves. Please stop repeating these vague what-ifs.
Your argument makes little sense and it seems that it is impossible to
explain that to you. I gave up on discussing this further with you.
Consider NAK to any additional allocation from oom path unless you can
give very _solid_ arguments this is absolutely necessary. "It's gona be
fine and work most of the time" is not a solid argument.
--
Michal Hocko
SUSE Labs
On Tue, Feb 20, 2024 at 06:24:41PM +0100, Michal Hocko wrote:
> On Tue 20-02-24 12:18:49, Kent Overstreet wrote:
> > On Tue, Feb 20, 2024 at 05:23:29PM +0100, Michal Hocko wrote:
> > > On Mon 19-02-24 09:17:36, Suren Baghdasaryan wrote:
> > > [...]
> > > > For now I think with Vlastimil's __GFP_NOWARN suggestion the code
> > > > becomes safe and the only risk is to lose this report. If we get cases
> > > > with reports missing this data, we can easily change to reserved
> > > > memory.
> > >
> > > This is not just about missing part of the oom report. This is annoying
> > > but not earth shattering. Eating into very small reserves (that might be
> > > the only usable memory while the system is struggling in OOM situation)
> > > could cause functional problems that would be non trivial to test for.
> > > All that for debugging purposes is just lame. If you want to reuse the code
> > > for a different purpose then abstract it and allocate the buffer when you
> > > can afford that and use preallocated on when in OOM situation.
> > >
> > > We have always went extra mile to avoid potentially disruptive
> > > operations from the oom handling code and I do not see any good reason
> > > to diverge from that principle.
> >
> > Michal, I gave you the logic between dedicated reserves and system
> > reserves. Please stop repeating these vague what-ifs.
>
> Your argument makes little sense and it seems that it is impossible to
> explain that to you. I gave up on discussing this further with you.
It was your choice to not engage with the technical discussion. And if
you're not going to engage, repeating the same arguments that I already
responded to 10 or 20 emails later is a pretty dishonest way to argue.
You've been doing this kind of grandstanding throughout the entire
discussion across every revision of the patchset.
Knock it off.
On 2/19/24 18:17, Suren Baghdasaryan wrote:
> On Thu, Feb 15, 2024 at 3:56 PM Kent Overstreet
> <[email protected]> wrote:
>>
>> On Thu, Feb 15, 2024 at 06:27:29PM -0500, Steven Rostedt wrote:
>> > All this, and we are still worried about 4k for useful debugging :-/
>
> I was planning to refactor this function to print one record at a time
> with a smaller buffer but after discussing with Kent, he has plans to
> reuse this function and having the report in one buffer is needed for
> that.
We are printing to console, AFAICS all the code involved uses plain printk()
I think it would be way easier to have a function using printk() for this
use case than the seq_buf which is more suitable for /proc and friends. Then
all concerns about buffers would be gone. It wouldn't be that much of a code
duplication?
>> Every additional 4k still needs justification. And whether we burn a
>> reserve on this will have no observable effect on user output in
>> remotely normal situations; if this allocation ever fails, we've already
>> been in an OOM situation for awhile and we've already printed out this
>> report many times, with less memory pressure where the allocation would
>> have succeeded.
>
> I'm not sure this claim will always be true, specifically in the case
> of low-end devices with relatively low amounts of reserves and in the
That's right, GFP_ATOMIC failures can easily happen without prior OOMs.
Consider a system where userspace allocations fill the memory as they
usually do, up to high watermark. Then a burst of packets is received and
handled by GFP_ATOMIC allocations that deplete the reserves and can't cause
OOMs (OOM is when we fail to reclaim anything, but we are allocating from a
context that can't reclaim), so the very first report would be an GFP_ATOMIC
failure and now it can't allocate that buffer for printing.
I'm sure more such scenarios exist, Cc: Tetsuo who I recall was an expert on
this topic.
> presence of a possible quick memory usage spike. We should also
> consider a case when panic_on_oom is set. All we get is one OOM
> report, so we get only one chance to capture this report. In any case,
> I don't yet have data to prove or disprove this claim but it will be
> interesting to test it with data from the field once the feature is
> deployed.
>
> For now I think with Vlastimil's __GFP_NOWARN suggestion the code
> becomes safe and the only risk is to lose this report. If we get cases
> with reports missing this data, we can easily change to reserved
> memory.
On Tue, Feb 20, 2024 at 10:27 AM Vlastimil Babka <[email protected]> wrote:
>
> On 2/19/24 18:17, Suren Baghdasaryan wrote:
> > On Thu, Feb 15, 2024 at 3:56 PM Kent Overstreet
> > <[email protected]> wrote:
> >>
> >> On Thu, Feb 15, 2024 at 06:27:29PM -0500, Steven Rostedt wrote:
> >> > All this, and we are still worried about 4k for useful debugging :-/
> >
> > I was planning to refactor this function to print one record at a time
> > with a smaller buffer but after discussing with Kent, he has plans to
> > reuse this function and having the report in one buffer is needed for
> > that.
>
> We are printing to console, AFAICS all the code involved uses plain printk()
> I think it would be way easier to have a function using printk() for this
> use case than the seq_buf which is more suitable for /proc and friends. Then
> all concerns about buffers would be gone. It wouldn't be that much of a code
> duplication?
Ok, after discussing this with Kent, I'll change this patch to provide
a function returning N top consumers (the array and N will be provided
by the caller) and then we can print one record at a time with much
less memory needed. That should address reusability concerns, will use
memory more efficiently and will allow for more flexibility (more/less
than 10 records if needed).
Thanks for the feedback, everyone!
>
> >> Every additional 4k still needs justification. And whether we burn a
> >> reserve on this will have no observable effect on user output in
> >> remotely normal situations; if this allocation ever fails, we've already
> >> been in an OOM situation for awhile and we've already printed out this
> >> report many times, with less memory pressure where the allocation would
> >> have succeeded.
> >
> > I'm not sure this claim will always be true, specifically in the case
> > of low-end devices with relatively low amounts of reserves and in the
>
> That's right, GFP_ATOMIC failures can easily happen without prior OOMs.
> Consider a system where userspace allocations fill the memory as they
> usually do, up to high watermark. Then a burst of packets is received and
> handled by GFP_ATOMIC allocations that deplete the reserves and can't cause
> OOMs (OOM is when we fail to reclaim anything, but we are allocating from a
> context that can't reclaim), so the very first report would be an GFP_ATOMIC
> failure and now it can't allocate that buffer for printing.
>
> I'm sure more such scenarios exist, Cc: Tetsuo who I recall was an expert on
> this topic.
>
> > presence of a possible quick memory usage spike. We should also
> > consider a case when panic_on_oom is set. All we get is one OOM
> > report, so we get only one chance to capture this report. In any case,
> > I don't yet have data to prove or disprove this claim but it will be
> > interesting to test it with data from the field once the feature is
> > deployed.
> >
> > For now I think with Vlastimil's __GFP_NOWARN suggestion the code
> > becomes safe and the only risk is to lose this report. If we get cases
> > with reports missing this data, we can easily change to reserved
> > memory.
>
On 2024/02/21 3:27, Vlastimil Babka wrote:
> I'm sure more such scenarios exist, Cc: Tetsuo who I recall was an expert on
> this topic.
"[PATCH v3 10/35] lib: code tagging framework" says that codetag_lock_module_list()
calls down_read() (i.e. sleeping operation), and
"[PATCH v3 31/35] lib: add memory allocations report in show_mem()" says that
__show_mem() calls alloc_tags_show_mem_report() after kmalloc(GFP_ATOMIC) (i.e.
non-sleeping operation) but alloc_tags_show_mem_report() calls down_read() via
codetag_lock_module_list() !?
If __show_mem() might be called from atomic context (e.g. kmalloc(GFP_ATOMIC)),
this will be a sleep in atomic bug.
If __show_mem() might be called while semaphore is held for write,
this will be a read-lock after write-lock deadlock bug.
Not the matter of whether to allocate buffer statically or dynamically.
Please don't hold a lock when trying to report memory usage.
On Wed, Feb 21, 2024 at 5:22 AM Tetsuo Handa
<[email protected]> wrote:
>
> On 2024/02/21 3:27, Vlastimil Babka wrote:
> > I'm sure more such scenarios exist, Cc: Tetsuo who I recall was an expert on
> > this topic.
>
> "[PATCH v3 10/35] lib: code tagging framework" says that codetag_lock_module_list()
> calls down_read() (i.e. sleeping operation), and
> "[PATCH v3 31/35] lib: add memory allocations report in show_mem()" says that
> __show_mem() calls alloc_tags_show_mem_report() after kmalloc(GFP_ATOMIC) (i.e.
> non-sleeping operation) but alloc_tags_show_mem_report() calls down_read() via
> codetag_lock_module_list() !?
>
> If __show_mem() might be called from atomic context (e.g. kmalloc(GFP_ATOMIC)),
> this will be a sleep in atomic bug.
> If __show_mem() might be called while semaphore is held for write,
> this will be a read-lock after write-lock deadlock bug.
>
> Not the matter of whether to allocate buffer statically or dynamically.
> Please don't hold a lock when trying to report memory usage.
Thanks for catching this, Tetsuo! Yes, we take the read-lock here to
ensure that the list of modules is stable. I'm thinking I can replace
the down_read() with down_read_trylock() and if we fail (there is a
race with module load/unload) we will skip generating this report. The
probability of racing with module load/unload while in OOM state I
think is quite low, so skipping this report should not cause much
information loss.
>
>
On Tue, Feb 13, 2024 at 05:06:24PM -0500, Kent Overstreet wrote:
> On Tue, Feb 13, 2024 at 10:26:48AM +0200, Andy Shevchenko wrote:
> > On Mon, Feb 12, 2024 at 11:39 PM Suren Baghdasaryan <[email protected]> wrote:
> > It seems most of my points from the previous review were refused...
>
> Look, Andy, this is a pretty tiny part of the patchset, yet it's been
> eating up a pretty disproprortionate amount of time and your review
> feedback has been pretty unhelpful - asking for things to be broken up
> in ways that would not be bisectable, or (as here) re-asking the same
> things that I've already answered and that should've been obvious.
>
> The code works. If you wish to complain about anything being broken, or
> if you can come up with anything more actionable than what you've got
> here, I will absolutely respond to that, but otherwise I'm just going to
> leave things where they sit.
I do not understand why I should do *your* job.
Nevertheless, I have just sent my version of this change.
Enjoy!
--
With Best Regards,
Andy Shevchenko