2024-01-25 07:37:22

by Oliver Sang

[permalink] [raw]
Subject: [linus:master] [kmemleak] 39042079a0: kernel-selftests.kvm.memslot_perf_test.fail



Hello,

kernel test robot noticed "kernel-selftests.kvm.memslot_perf_test.fail" on:

commit: 39042079a0c241d09fa6fc3bb67c2ddf60011d0f ("kmemleak: avoid RCU stalls when freeing metadata for per-CPU pointers")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

[test failed on linus/master 9d64bf433c53cab2f48a3fff7a1f2a696bc5229a]
[test failed on linux-next/master ad5c60d66016e544c51ed98635a74073f761f45d]

in testcase: kernel-selftests
version: kernel-selftests-x86_64-60acb023-1_20230329
with following parameters:

group: kvm



compiler: gcc-12
test machine: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480+ (Sapphire Rapids) with 256G memory

(please refer to attached dmesg/kmsg for entire log/backtrace)



If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-lkp/[email protected]



The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240125/[email protected]


# timeout set to 120
# selftests: kvm: memslot_perf_test
# Testing map performance with 1 runs, 5 seconds each
# Memslot count too high for this test, decrease the cap (max is 8209)
#
# Testing unmap performance with 1 runs, 5 seconds each
# Test took 9.178259867s for slot setup + 5.035747214s all iterations
# Done 85 iterations, avg 0.059244084s each
# Best runtime result was 0.059244084s per iteration (with 85 iterations)
#
# Testing unmap chunked performance with 1 runs, 5 seconds each
# Test took 9.460993790s for slot setup + 5.001833746s all iterations
# Done 84 iterations, avg 0.059545639s each
# Best runtime result was 0.059545639s per iteration (with 84 iterations)
#
# Testing move active area performance with 1 runs, 5 seconds each
# Test took 9.235748282s for slot setup + 5.001305107s all iterations
# Done 4807 iterations, avg 0.001040421s each
# Best runtime result was 0.001040421s per iteration (with 4807 iterations)
#
# Testing move inactive area performance with 1 runs, 5 seconds each
# Test took 9.113425177s for slot setup + 5.001150112s all iterations
# Done 4534 iterations, avg 0.001103032s each
# Best runtime result was 0.001103032s per iteration (with 4534 iterations)
#
# Testing RW performance with 1 runs, 5 seconds each
#
not ok 71 selftests: kvm: memslot_perf_test # TIMEOUT 120 seconds


--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki



2024-01-25 12:02:11

by Catalin Marinas

[permalink] [raw]
Subject: Re: [linus:master] [kmemleak] 39042079a0: kernel-selftests.kvm.memslot_perf_test.fail

On Thu, Jan 25, 2024 at 03:34:37PM +0800, kernel test robot wrote:
> kernel test robot noticed "kernel-selftests.kvm.memslot_perf_test.fail" on:
>
> commit: 39042079a0c241d09fa6fc3bb67c2ddf60011d0f ("kmemleak: avoid RCU stalls when freeing metadata for per-CPU pointers")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
[...]
> The kernel config and materials to reproduce are available at:
> https://download.01.org/0day-ci/archive/20240125/[email protected]
[...]
> # Testing RW performance with 1 runs, 5 seconds each
> #
> not ok 71 selftests: kvm: memslot_perf_test # TIMEOUT 120 seconds

I'm not sure how this relates to kmemleak, especially the commit above.
It might as well be that the above kmemleak commit increases the system
load and you trip over the timeout above. Is it still reproducible with
a larger timeout (unfortunately I don't have the hardware to try to
reproduce the problem)?

I can see a lockdep warning in the dmesg but that doesn't look related
to kmemleak.

--
Catalin

2024-01-26 08:50:30

by Oliver Sang

[permalink] [raw]
Subject: Re: [linus:master] [kmemleak] 39042079a0: kernel-selftests.kvm.memslot_perf_test.fail

hi, Catalin,

On Thu, Jan 25, 2024 at 12:01:56PM +0000, Catalin Marinas wrote:
> On Thu, Jan 25, 2024 at 03:34:37PM +0800, kernel test robot wrote:
> > kernel test robot noticed "kernel-selftests.kvm.memslot_perf_test.fail" on:
> >
> > commit: 39042079a0c241d09fa6fc3bb67c2ddf60011d0f ("kmemleak: avoid RCU stalls when freeing metadata for per-CPU pointers")
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> [...]
> > The kernel config and materials to reproduce are available at:
> > https://download.01.org/0day-ci/archive/20240125/[email protected]
> [...]
> > # Testing RW performance with 1 runs, 5 seconds each
> > #
> > not ok 71 selftests: kvm: memslot_perf_test # TIMEOUT 120 seconds
>
> I'm not sure how this relates to kmemleak, especially the commit above.
> It might as well be that the above kmemleak commit increases the system
> load and you trip over the timeout above. Is it still reproducible with
> a larger timeout (unfortunately I don't have the hardware to try to
> reproduce the problem)?
>
> I can see a lockdep warning in the dmesg but that doesn't look related
> to kmemleak.

below things FYI.

(1) we set same timeout for 39042079a0 and its parent, test always can pass
on parent (by 20 runs), but almost always fail on 39042079a0 (18 out of 20
runs fail).

(2) while running tests, we use a kmemleak monitor.
for parent runs, seems no problem, but for 39042079a0 runs, it will capture
below log:

$ cat kmemleak
unreferenced object 0xff110001198a1720 (size 16):
comm "swapper/0", pid 1, jiffies 4294783051
hex dump (first 16 bytes):
41 43 50 49 20 48 4d 41 54 00 6b 6b 6b 6b 6b a5 ACPI HMAT.kkkkk.
backtrace (crc 45396b86):
[<ffffffff81a97e28>] __kmem_cache_alloc_node+0x228/0x310
[<ffffffff81936de5>] __kmalloc_node_track_caller+0x55/0xf0
[<ffffffff819160b4>] kstrdup+0x34/0x60
[<ffffffff81aada3b>] mt_set_default_dram_perf+0x3cb/0x460
[<ffffffff8658a9a6>] hmat_init+0x2b6/0x3a0
[<ffffffff81002c58>] do_one_initcall+0xd8/0x410
[<ffffffff864aaa4a>] do_initcalls+0x1ca/0x390
[<ffffffff864ab53e>] kernel_init_freeable+0x8ce/0xc80
[<ffffffff83beed8f>] kernel_init+0x1f/0x1d0
[<ffffffff810a07c1>] ret_from_fork+0x31/0x70
[<ffffffff81006341>] ret_from_fork_asm+0x11/0x20

(3) we tried to disable the kmemleak monitor for 39042079a0, then
memslot_perf_test can pass now. there is no kmemleak dump.

(4) we also noticed a test can pass on 39042079a0 now, but always fail on
parent, no matter kmemleak monitor enable or disable.
(normally we won't report a commit if it makes some test start to pass, so this
information is not in original report)

f67f8d4a8c1e1ebc 39042079a0c241d09fa6fc3bb67
---------------- ---------------------------
fail:runs %reproduction fail:runs
| | |
:20 150% 30:30 kernel-selftests.kvm.rseq_test.pass

>
> --
> Catalin