On one side, when using devm_kmalloc(), a memory overhead is added in order
to keep track of the data needed to release the resources automagically.
On the other side, kmalloc() also rounds-up the required memory size in
order to ease memory reuse and avoid memory fragmentation.
Both behavior together can lead to some over memory allocation which can
be avoided.
For example:
- if 4096 bytes of managed memory is required
- "4096 + sizeof(struct devres_node)" bytes are required to the memory
allocator
- 8192 bytes are allocated and nearly half of it is wasted
In such a case, it would be better to really allocate 4096 bytes of memory
and record an "action" to perform the kfree() when needed.
On my 64 bits system:
sizeof(struct devres_node) = 40
sizeof(struct action_devres) = 16
So, a devm_add_action() call will allocate 56, rounded up to 64 bytes.
kmalloc() uses hunks of 8k, 4k, 2k, 1k, 512, 256, 192, 128, 96, 64, 32, 16,
8 bytes.
So in order to save some memory, if the 256 bytes boundary is crossed
because of the overhead of devm_kmalloc(), 2 distinct memory allocations
make sense.
Signed-off-by: Christophe JAILLET <[email protected]>
---
This patch is only a RFC to get feed-back on the proposed approach.
It is compile tested only.
I don't have numbers to see how much memory could be saved.
I don't have numbers on the performance impact.
Should this makes sense to anyone, I would really appreciate getting some
numbers from others to confirm if it make sense or not.
The idea of this patch came to me because of a discussion initiated by
Feng Tang <[email protected]>. He proposes to track wasted memory
allocation in order to give hints on where optimizations can be done.
My approach is to avoid part of these allocations when due to the usage of
a devm_ function.
The drawbacks I see are:
- code is more complex
- this concurs to memory fragmentation because there will be 2 memory
allocations, instead of just 1
- this is slower for every memory allocation because of the while loop
and tests
- the magic 256 constant is maybe not relevant on all systems
- some places of the kernel already take advantage of this over memory
allocation. So unpredictable impacts can occur somewhere! (see [1],
which is part of the [2] thread)
- this makes some assumption in devres.c on how memory allocation works,
which is not a great idea :(
The advantages I see:
- in some cases, it saves some memory :)
- fragmentation is not necessarily an issue, devm_ allocated memory
are rarely freed, right?
One case where such an approach makes sense to me is:
drivers/mtd/nand/onenand/onenand_samsung.c
where (on my system) we would allocate 8k when 4k is required.
Thought?
CJ
[1]: https://lore.kernel.org/all/[email protected]/
[2]: https://lore.kernel.org/all/[email protected]/
---
drivers/base/devres.c | 34 ++++++++++++++++++++++++++++++++++
1 file changed, 34 insertions(+)
diff --git a/drivers/base/devres.c b/drivers/base/devres.c
index 864d0b3f566e..67f87af32914 100644
--- a/drivers/base/devres.c
+++ b/drivers/base/devres.c
@@ -794,6 +794,11 @@ static void devm_kmalloc_release(struct device *dev, void *res)
/* noop */
}
+static void devm_kmalloc_release_with_action(void *res)
+{
+ kfree(res);
+}
+
static int devm_kmalloc_match(struct device *dev, void *res, void *data)
{
return res == data;
@@ -814,11 +819,40 @@ static int devm_kmalloc_match(struct device *dev, void *res, void *data)
*/
void *devm_kmalloc(struct device *dev, size_t size, gfp_t gfp)
{
+ size_t tipping_point;
struct devres *dr;
if (unlikely(!size))
return ZERO_SIZE_PTR;
+ /*
+ * alloc_dr() adds a small memory overhead that may require some over
+ * memory allocation. In such cases, it is more interesting to allocate
+ * the exact amount of memory required and register an "action" to free
+ * it. The overhead of devm_add_action_or_reset() is smaller.
+ */
+ tipping_point = 256;
+ while (tipping_point < KMALLOC_MAX_CACHE_SIZE) {
+ void *ptr;
+ int err;
+
+ if (size <= tipping_point &&
+ size + sizeof(struct devres) > tipping_point) {
+ ptr = kmalloc(size, gfp);
+ if (unlikely(!ptr))
+ return NULL;
+
+ err = devm_add_action_or_reset(dev,
+ devm_kmalloc_release_with_action, ptr);
+ if (unlikely(err))
+ return NULL;
+
+ return ptr;
+ }
+
+ tipping_point <<= 1;
+ }
+
/* use raw alloc_dr for kmalloc caller tracing */
dr = alloc_dr(devm_kmalloc_release, size, gfp, dev_to_node(dev));
if (unlikely(!dr))
--
2.34.1
On Sat, Jul 23, 2022 at 12:04:33PM +0200, Christophe JAILLET wrote:
> On one side, when using devm_kmalloc(), a memory overhead is added in order
> to keep track of the data needed to release the resources automagically.
>
> On the other side, kmalloc() also rounds-up the required memory size in
> order to ease memory reuse and avoid memory fragmentation.
>
> Both behavior together can lead to some over memory allocation which can
> be avoided.
>
> For example:
> - if 4096 bytes of managed memory is required
> - "4096 + sizeof(struct devres_node)" bytes are required to the memory
> allocator
> - 8192 bytes are allocated and nearly half of it is wasted
>
> In such a case, it would be better to really allocate 4096 bytes of memory
> and record an "action" to perform the kfree() when needed.
>
> On my 64 bits system:
> sizeof(struct devres_node) = 40
> sizeof(struct action_devres) = 16
>
> So, a devm_add_action() call will allocate 56, rounded up to 64 bytes.
>
> kmalloc() uses hunks of 8k, 4k, 2k, 1k, 512, 256, 192, 128, 96, 64, 32, 16,
> 8 bytes.
>
> So in order to save some memory, if the 256 bytes boundary is crossed
> because of the overhead of devm_kmalloc(), 2 distinct memory allocations
> make sense.
>
> Signed-off-by: Christophe JAILLET <[email protected]>
> ---
> This patch is only a RFC to get feed-back on the proposed approach.
>
> It is compile tested only.
> I don't have numbers to see how much memory could be saved.
> I don't have numbers on the performance impact.
>
> Should this makes sense to anyone, I would really appreciate getting some
> numbers from others to confirm if it make sense or not.
>
>
> The idea of this patch came to me because of a discussion initiated by
> Feng Tang <[email protected]>. He proposes to track wasted memory
> allocation in order to give hints on where optimizations can be done.
>
> My approach is to avoid part of these allocations when due to the usage of
> a devm_ function.
>
>
> The drawbacks I see are:
> - code is more complex
> - this concurs to memory fragmentation because there will be 2 memory
> allocations, instead of just 1
> - this is slower for every memory allocation because of the while loop
> and tests
> - the magic 256 constant is maybe not relevant on all systems
> - some places of the kernel already take advantage of this over memory
> allocation. So unpredictable impacts can occur somewhere! (see [1],
> which is part of the [2] thread)
> - this makes some assumption in devres.c on how memory allocation works,
> which is not a great idea :(
>
> The advantages I see:
> - in some cases, it saves some memory :)
> - fragmentation is not necessarily an issue, devm_ allocated memory
> are rarely freed, right?
I think devm_ allocated memory does not happen that much, try it on
your systems and see!
Numbers would be great to have, can you run some benchmarks? Try it on
a "common" SoC device (raspberry pi?) and a desktop to compare.
thanks,
greg k-h
Greeting,
FYI, we noticed the following commit (built with gcc-11):
commit: 3d0e198cd7dc63c6ddbf06c028ba04e5ed43e470 ("[RFC PATCH] devres: avoid over memory allocation with managed memory allocation")
url: https://github.com/intel-lab-lkp/linux/commits/Christophe-JAILLET/devres-avoid-over-memory-allocation-with-managed-memory-allocation/20220723-181707
base: https://git.kernel.org/cgit/linux/kernel/git/gregkh/driver-core.git 3fcbf1c77d089fcf0331fd8f3cbbe6c436a3edbd
patch link: https://lore.kernel.org/lkml/92ec2f78e8d38f68da95d9250cf3f86b2fbe78ad.1658570017.git.christophe.jaillet@wanadoo.fr
in testcase: nvml
version: nvml-x86_64-3de7d358f-1_20211217
with following parameters:
test: pmem
group: util
nr_pmem: 1
fs: ext4
mount_option: dax
bp_memmap: 32G!4G
ucode: 0x700001c
on test machine: 16 threads 1 sockets Intel(R) Xeon(R) CPU D-1541 @ 2.10GHz with 48G memory
caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>
[ 33.935602][ T339] ------------[ cut here ]------------
[ 33.935605][ T339] WARNING: CPU: 10 PID: 339 at drivers/base/devres.c:1092 devm_kfree (drivers/base/devres.c:1092)
[ 33.935615][ T339] Modules linked in: nd_pmem(+) sysimgblt ahci(+) fb_sys_fops dax_pmem nd_btt libahci nd_e820 acpi_ipmi mei_me intel_uncore(+) libnvdimm drm mxm_wmi libata acpi_cpufreq(-) joydev gpio_ich intel_pch_thermal ipmi_si(+) mei ioatdma(+) dca ipmi_devintf ipmi_msghandler wmi acpi_pad ip_tables
[ 33.935652][ T339] CPU: 10 PID: 339 Comm: systemd-udevd Tainted: G S 5.19.0-rc1-00059-g3d0e198cd7dc #1
[ 33.935656][ T339] Hardware name: Supermicro SYS-5018D-FN4T/X10SDV-8C-TLN4F, BIOS 1.1 03/02/2016
[ 33.935659][ T339] RIP: 0010:devm_kfree (drivers/base/devres.c:1092)
[ 33.935663][ T339] Code: 00 fc ff df 48 8d 78 d8 48 89 f9 48 c1 e9 03 80 3c 11 00 75 1d 48 8b 40 d8 48 39 c7 75 09 48 83 c4 10 e9 97 81 31 ff 0f 0b c3 <0f> 0b 48 83 c4 10 c3 c3 48 89 44 24 08 48 89 3c 24 e8 be 16 32 ff
All code
========
0: 00 fc add %bh,%ah
2: ff (bad)
3: df 48 8d fisttps -0x73(%rax)
6: 78 d8 js 0xffffffffffffffe0
8: 48 89 f9 mov %rdi,%rcx
b: 48 c1 e9 03 shr $0x3,%rcx
f: 80 3c 11 00 cmpb $0x0,(%rcx,%rdx,1)
13: 75 1d jne 0x32
15: 48 8b 40 d8 mov -0x28(%rax),%rax
19: 48 39 c7 cmp %rax,%rdi
1c: 75 09 jne 0x27
1e: 48 83 c4 10 add $0x10,%rsp
22: e9 97 81 31 ff jmpq 0xffffffffff3181be
27: 0f 0b ud2
29: c3 retq
2a:* 0f 0b ud2 <-- trapping instruction
2c: 48 83 c4 10 add $0x10,%rsp
30: c3 retq
31: c3 retq
32: 48 89 44 24 08 mov %rax,0x8(%rsp)
37: 48 89 3c 24 mov %rdi,(%rsp)
3b: e8 be 16 32 ff callq 0xffffffffff3216fe
Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: 48 83 c4 10 add $0x10,%rsp
6: c3 retq
7: c3 retq
8: 48 89 44 24 08 mov %rax,0x8(%rsp)
d: 48 89 3c 24 mov %rdi,(%rsp)
11: e8 be 16 32 ff callq 0xffffffffff3216d4
[ 33.935666][ T339] RSP: 0018:ffffc90001f67840 EFLAGS: 00010246
[ 33.935670][ T339] RAX: 0000000000000000 RBX: ffff888906f51358 RCX: dffffc0000000000
[ 33.935672][ T339] RDX: 1ffff11120dea256 RSI: 0000000000000246 RDI: ffff888906f512a4
[ 33.935675][ T339] RBP: ffff888906f51370 R08: ffff888906f512a8 R09: ffffc90001f6779b
[ 33.935677][ T339] R10: fffff520003ecef3 R11: 0000000000000001 R12: ffff888906f51008
[ 33.935679][ T339] R13: ffff88893057eb1c R14: ffff88893057e830 R15: 0000000000000000
[ 33.935681][ T339] FS: 00007f76e113bd40(0000) GS:ffff888bb7900000(0000) knlGS:0000000000000000
[ 33.935684][ T339] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 33.935686][ T339] CR2: 00007f19575cb050 CR3: 0000000c78540001 CR4: 00000000003706e0
[ 33.935689][ T339] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 33.935690][ T339] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 33.935693][ T339] Call Trace:
[ 33.935694][ T339] <TASK>
[ 33.935696][ T339] ? devres_release (drivers/base/devres.c:426)
[ 33.935700][ T339] badblocks_exit (block/badblocks.c:599)
[ 33.935706][ T339] devm_nsio_disable (include/linux/badblocks.h:64 drivers/nvdimm/claim.c:336) libnvdimm
[ 33.935730][ T339] nd_pmem_probe (drivers/nvdimm/pmem.c:652) nd_pmem
[ 33.935736][ T339] nvdimm_bus_probe (drivers/nvdimm/bus.c:91) libnvdimm
[ 33.935753][ T339] really_probe (drivers/base/dd.c:530 drivers/base/dd.c:609)
[ 33.935758][ T339] __driver_probe_device (drivers/base/dd.c:748)
[ 33.935763][ T339] driver_probe_device (drivers/base/dd.c:778)
[ 33.935767][ T339] __driver_attach (drivers/base/dd.c:1151)
[ 33.935772][ T339] ? __device_attach_driver (drivers/base/dd.c:1100)
[ 33.935777][ T339] bus_for_each_dev (drivers/base/bus.c:301)
[ 33.935781][ T339] ? subsys_dev_iter_exit (drivers/base/bus.c:290)
[ 33.935785][ T339] ? klist_add_tail (include/linux/list.h:69 include/linux/list.h:102 lib/klist.c:104 lib/klist.c:137)
[ 33.935790][ T339] bus_add_driver (drivers/base/bus.c:618)
[ 33.935795][ T339] driver_register (drivers/base/driver.c:240)
[ 33.935798][ T339] ? 0xffffffffc108a000
[ 33.935801][ T339] do_one_initcall (init/main.c:1295)
[ 33.935806][ T339] ? trace_event_raw_event_initcall_level (init/main.c:1286)
[ 33.935810][ T339] ? kasan_unpoison (mm/kasan/shadow.c:108 mm/kasan/shadow.c:142)
[ 33.935817][ T339] ? kasan_unpoison (mm/kasan/shadow.c:108 mm/kasan/shadow.c:142)
[ 33.935821][ T339] do_init_module (kernel/module/main.c:2434)
[ 33.935827][ T339] load_module (kernel/module/main.c:2829)
[ 33.935832][ T339] ? layout_and_allocate (kernel/module/main.c:2652)
[ 33.935835][ T339] ? kernel_read_file (arch/x86/include/asm/atomic.h:95 include/linux/atomic/atomic-instrumented.h:191 include/linux/fs.h:2838 fs/kernel_read_file.c:122)
[ 33.935841][ T339] ? __x64_sys_fspick (fs/kernel_read_file.c:38)
[ 33.935845][ T339] ? mmap_region (mm/mmap.c:1889)
[ 33.935851][ T339] ? __do_sys_finit_module (kernel/module/main.c:2930)
[ 33.935854][ T339] __do_sys_finit_module (kernel/module/main.c:2930)
[ 33.935858][ T339] ? __ia32_sys_init_module (kernel/module/main.c:2898)
[ 33.935862][ T339] ? __seccomp_filter (arch/x86/include/asm/bitops.h:214 include/asm-generic/bitops/instrumented-non-atomic.h:135 kernel/seccomp.c:354 kernel/seccomp.c:381 kernel/seccomp.c:413 kernel/seccomp.c:1210)
[ 33.935866][ T339] ? vm_mmap_pgoff (mm/util.c:556)
[ 33.935873][ T339] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
[ 33.935879][ T339] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:115)
[ 33.935884][ T339] RIP: 0033:0x7f76e1925f59
[ 33.935887][ T339] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 07 6f 0c 00 f7 d8 64 89 01 48
All code
========
0: 00 c3 add %al,%bl
2: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
9: 00 00 00
c: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
11: 48 89 f8 mov %rdi,%rax
14: 48 89 f7 mov %rsi,%rdi
17: 48 89 d6 mov %rdx,%rsi
1a: 48 89 ca mov %rcx,%rdx
1d: 4d 89 c2 mov %r8,%r10
20: 4d 89 c8 mov %r9,%r8
23: 4c 8b 4c 24 08 mov 0x8(%rsp),%r9
28: 0f 05 syscall
2a:* 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax <-- trapping instruction
30: 73 01 jae 0x33
32: c3 retq
33: 48 8b 0d 07 6f 0c 00 mov 0xc6f07(%rip),%rcx # 0xc6f41
3a: f7 d8 neg %eax
3c: 64 89 01 mov %eax,%fs:(%rcx)
3f: 48 rex.W
Code starting with the faulting instruction
===========================================
0: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax
6: 73 01 jae 0x9
8: c3 retq
9: 48 8b 0d 07 6f 0c 00 mov 0xc6f07(%rip),%rcx # 0xc6f17
10: f7 d8 neg %eax
12: 64 89 01 mov %eax,%fs:(%rcx)
15: 48 rex.W
To reproduce:
git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
sudo bin/lkp install job.yaml # job file is attached in this email
bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
sudo bin/lkp run generated-yaml-file
# if come across any failure that blocks the test,
# please remove ~/.lkp and /lkp dir to run from a clean state.
--
0-DAY CI Kernel Test Service
https://01.org/lkp