2024-01-19 10:51:55

by Mikhail Gavrilov

[permalink] [raw]
Subject: regression/bisected commit 773688a6cb24b0b3c2ba40354d883348a2befa38 make my system completely unusable under high load

Hi,
I use a system with KASAN sanitizer everyday.
Because I want to catch difficult-to-repeat bugs.
And all worked fine until commit 773688a6cb24b0b3c2ba40354d883348a2befa38.
After commit 773688a6cb24b0b3c2ba40354d883348a2befa38 all working
jerky when I compile something.
The sound is interrupted, the cursor moves jerkily if I try to do
anything when all the cores are loaded.

> git bisect bad
773688a6cb24b0b3c2ba40354d883348a2befa38 is the first bad commit
commit 773688a6cb24b0b3c2ba40354d883348a2befa38
Author: Andrey Konovalov <[email protected]>
Date: Mon Nov 20 18:47:19 2023 +0100

kasan: use stack_depot_put for Generic mode

Evict alloc/free stack traces from the stack depot for Generic KASAN once
they are evicted from the quaratine.

For auxiliary stack traces, evict the oldest stack trace once a new one is
saved (KASAN only keeps references to the last two).

Also evict all saved stack traces on krealloc.

To avoid double-evicting and mis-evicting stack traces (in case KASAN's
metadata was corrupted), reset KASAN's per-object metadata that stores
stack depot handles when the object is initialized and when it's evicted
from the quarantine.

Note that stack_depot_put is no-op if the handle is 0.

Link: https://lkml.kernel.org/r/5cef104d9b842899489b4054fe8d1339a71acee0.1700502145.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <[email protected]>
Reviewed-by: Marco Elver <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Evgenii Stepanov <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/kasan/common.c | 3 ++-
mm/kasan/generic.c | 22 ++++++++++++++++++----
mm/kasan/quarantine.c | 26 ++++++++++++++++++++------
3 files changed, 40 insertions(+), 11 deletions(-)

I attached here my build .config and kernel log.
Who could dig into it, please?

--
Best Regards,
Mike Gavrilov.


Attachments:
.config.zip (63.63 kB)
bisect-performance-regression-KASAN-log.zip (1.21 kB)
dmesg-performance-regression-KASAN.zip (43.20 kB)
Download all attachments

2024-01-19 10:55:42

by Marco Elver

[permalink] [raw]
Subject: Re: regression/bisected commit 773688a6cb24b0b3c2ba40354d883348a2befa38 make my system completely unusable under high load

On Fri, 19 Jan 2024 at 11:46, Mikhail Gavrilov
<[email protected]> wrote:
>
> Hi,
> I use a system with KASAN sanitizer everyday.
> Because I want to catch difficult-to-repeat bugs.
> And all worked fine until commit 773688a6cb24b0b3c2ba40354d883348a2befa38.
> After commit 773688a6cb24b0b3c2ba40354d883348a2befa38 all working
> jerky when I compile something.
> The sound is interrupted, the cursor moves jerkily if I try to do
> anything when all the cores are loaded.
>
> > git bisect bad
> 773688a6cb24b0b3c2ba40354d883348a2befa38 is the first bad commit
> commit 773688a6cb24b0b3c2ba40354d883348a2befa38
> Author: Andrey Konovalov <[email protected]>
> Date: Mon Nov 20 18:47:19 2023 +0100
>
> kasan: use stack_depot_put for Generic mode
[...]
> mm/kasan/common.c | 3 ++-
> mm/kasan/generic.c | 22 ++++++++++++++++++----
> mm/kasan/quarantine.c | 26 ++++++++++++++++++++------
> 3 files changed, 40 insertions(+), 11 deletions(-)
>
> I attached here my build .config and kernel log.
> Who could dig into it, please?

I was afraid this would happen - could you try this patch series:
https://lore.kernel.org/all/[email protected]/

Thanks,
-- Marco

2024-01-19 11:00:40

by Marco Elver

[permalink] [raw]
Subject: Re: regression/bisected commit 773688a6cb24b0b3c2ba40354d883348a2befa38 make my system completely unusable under high load

On Fri, 19 Jan 2024 at 11:54, Marco Elver <[email protected]> wrote:
>
> On Fri, 19 Jan 2024 at 11:46, Mikhail Gavrilov
> <[email protected]> wrote:
> >
> > Hi,
> > I use a system with KASAN sanitizer everyday.
> > Because I want to catch difficult-to-repeat bugs.
> > And all worked fine until commit 773688a6cb24b0b3c2ba40354d883348a2befa38.
> > After commit 773688a6cb24b0b3c2ba40354d883348a2befa38 all working
> > jerky when I compile something.
> > The sound is interrupted, the cursor moves jerkily if I try to do
> > anything when all the cores are loaded.
> >
> > > git bisect bad
> > 773688a6cb24b0b3c2ba40354d883348a2befa38 is the first bad commit
> > commit 773688a6cb24b0b3c2ba40354d883348a2befa38
> > Author: Andrey Konovalov <[email protected]>
> > Date: Mon Nov 20 18:47:19 2023 +0100
> >
> > kasan: use stack_depot_put for Generic mode
> [...]
> > mm/kasan/common.c | 3 ++-
> > mm/kasan/generic.c | 22 ++++++++++++++++++----
> > mm/kasan/quarantine.c | 26 ++++++++++++++++++++------
> > 3 files changed, 40 insertions(+), 11 deletions(-)
> >
> > I attached here my build .config and kernel log.
> > Who could dig into it, please?
>
> I was afraid this would happen - could you try this patch series:
> https://lore.kernel.org/all/[email protected]/ [1]

In addition, could you give some additional details about the number
of CPUs in your system?
And if possible, do you have a way to measure performance besides the
obvious lagging of the system? It would be interesting to know if the
fix in [1] regains performance fully.

One major difference is still that an atomic RMW is in the fast paths.
This could be fixed by reverting
773688a6cb24b0b3c2ba40354d883348a2befa38 on top of everything else,
but we're not sure yet that's necessary because the cost of an atomic
RMW really depends on the system you're working with.

2024-01-19 17:55:19

by Mikhail Gavrilov

[permalink] [raw]
Subject: Re: regression/bisected commit 773688a6cb24b0b3c2ba40354d883348a2befa38 make my system completely unusable under high load

On Fri, Jan 19, 2024 at 4:00 PM Marco Elver <[email protected]> wrote:
> I was afraid this would happen - could you try this patch series:
> https://lore.kernel.org/all/[email protected]/ [1]

Thanks, this patch series definitely helped.
I can again work at the computer when something is compiling in the background.
Tested-by: Mikhail Gavrilov <[email protected]>

> In addition, could you give some additional details about the number
> of CPUs in your system?

Hardware probe: https://linux-hardware.org/?probe=ba941d7a4e
CPU: AMD Ryzen 7950x

> And if possible, do you have a way to measure performance besides the
> obvious lagging of the system? It would be interesting to know if the
> fix in [1] regains performance fully.

perf-2d5524635b00.data -
https://mega.nz/file/Q0ACSI4a#QQ8Ntbw5zvP_YZMsXPzSr-PxLVCw8fwg2RJaVOghoOQ
perf-773688a6cb24.data -
https://mega.nz/file/F8wAgBZI#OQ75qLFyf2diFXrDs9bP6_5xDevVrs1KlNdeupWSJSQ
perf-with-patchset.data -
https://mega.nz/file/l8ZXnI6Y#SmrZpH2Em6xzlIZgJe50PwSw-zLK_4whRjx3t_058kE

> perf diff perf-2d5524635b00.data perf-773688a6cb24.data
No kallsyms or vmlinux with build-id
c64a03a51e9503a251dbec8e5267fb3ae51914f2 was found
# Event 'cycles:P'
#
# Baseline Delta Abs Shared Object
Symbol

>
# ........ ......... ..............................................
...................................................................................................................................................................>
#
59.91% +23.05% [kernel.vmlinux]
[k] 0xffffffff940065c0
17.88% -7.89% cc1
[.] 0x0000000000207020
9.39% -6.30% cc1plus
[.] 0x0000000000225110
1.37% -1.29% libpython3.12.so.1.0
[.] 0x00000000000647e0
1.16% -0.84% libcef.so
[.] 0x00000000021720e0
1.27% -0.67% as
[.] 0x0000000000002090
0.78% -0.54% steamclient.so
[.] 0x00000000001ed915
0.77% -0.33% chrome
[.] 0x0000000002892080
0.54% -0.32% libc.so.6
[.] _int_malloc
0.30% -0.23% libpixman-1.so.0.43.0
[.] 0x00000000000078a7
0.31% -0.19% libc.so.6
[.] _int_free


> perf diff perf-2d5524635b00.data perf-with-patchset.data
# Event 'cycles:P'
#
# Baseline Delta Abs Shared Object
Symbol

>
# ........ ......... ..............................................
...................................................................................................................................................................>
#
17.88% +12.61% cc1
[.] 0x0000000000207020
+3.89% [kernel.vmlinux]
[k] unwind_next_frame
+3.53% [kernel.vmlinux]
[k] kasan_check_range
+2.54% [kernel.vmlinux]
[k] debug_check_no_obj_freed
9.39% +2.10% cc1plus
[.] 0x0000000000225110
+1.87% [kernel.vmlinux]
[k] rcu_is_watching
+1.41% [kernel.vmlinux]
[k] lock_release
+1.24% [kernel.vmlinux]
[k] __orc_find
+1.21% [kernel.vmlinux]
[k] lock_acquire
+1.08% [kernel.vmlinux]
[k] stack_trace_consume_entry
+1.08% [kernel.vmlinux]
[k] check_preemption_disabled
1.37% +1.01% libpython3.12.so.1.0
[.] 0x00000000000647e0
+0.96% [kernel.vmlinux]
[k] stack_access_ok
1.27% +0.79% as
[.] 0x0000000000002090


Thanks!

--
Best Regards,
Mike Gavrilov.

2024-01-29 22:26:02

by Mikhail Gavrilov

[permalink] [raw]
Subject: Re: regression/bisected commit 773688a6cb24b0b3c2ba40354d883348a2befa38 make my system completely unusable under high load

On Fri, Jan 19, 2024 at 10:54 PM Mikhail Gavrilov
<[email protected]> wrote:
>
I continued to search regressions in 6.8 kernel.
And found another one.

cc478e0b6bdffd20561e1a07941a65f6c8962cab is the first bad commit
commit cc478e0b6bdffd20561e1a07941a65f6c8962cab
Author: Andrey Konovalov <[email protected]>
Date: Tue Jan 9 23:12:34 2024 +0100

kasan: avoid resetting aux_lock

With commit 63b85ac56a64 ("kasan: stop leaking stack trace handles"),
KASAN zeroes out alloc meta when an object is freed. The zeroed out data
purposefully includes alloc and auxiliary stack traces but also
accidentally includes aux_lock.

As aux_lock is only initialized for each object slot during slab creation,
when the freed slot is reallocated, saving auxiliary stack traces for the
new object leads to lockdep reports when taking the zeroed out aux_lock.

Arguably, we could reinitialize aux_lock when the object is reallocated,
but a simpler solution is to avoid zeroing out aux_lock when an object
gets freed.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 63b85ac56a64 ("kasan: stop leaking stack trace handles")
Signed-off-by: Andrey Konovalov <[email protected]>
Reported-by: Paul E. McKenney <[email protected]>
Closes: https://lore.kernel.org/linux-next/5cc0f83c-e1d6-45c5-be89-9b86746fe731@paulmck-laptop/
Reviewed-by: Marco Elver <[email protected]>
Tested-by: Paul E. McKenney <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/kasan/generic.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)


Here I spotted a dropped FPS in the game "Shadow of the Tomb Raider".
For measuring performance I used an internal benchmark.
Before commit cc478e0b6bdffd20561e1a07941a65f6c8962cab was 111FPS on
commit aaa2c9a97c22af5bf011f6dd8e0538219b45af88 [1].
On commit cc478e0b6bdffd20561e1a07941a65f6c8962cab I has only 63FPS [2]
And unfortunately the stackdepot patchset which I applied on top of
6.8-rc2 didn't restore initial performance [3].

[1] https://i.postimg.cc/tgvwPTkz/c11-aaa2c9a97c22af5bf011f6dd8e0538219b45af88.png
[2] https://i.postimg.cc/pX8vHDCM/c10-cc478e0b6bdffd20561e1a07941a65f6c8962cab.png
[3] https://i.postimg.cc/hvWCb7dV/6-8-0-0-rc2-with-stackdepot.png

--
Best Regards,
Mike Gavrilov.


Attachments:
bisect-performance-regression-in-games2.zip (1.21 kB)
.config.zip (63.71 kB)
Download all attachments

2024-01-29 23:15:04

by Andrey Konovalov

[permalink] [raw]
Subject: Re: regression/bisected commit 773688a6cb24b0b3c2ba40354d883348a2befa38 make my system completely unusable under high load

On Mon, Jan 29, 2024 at 11:25 PM Mikhail Gavrilov
<[email protected]> wrote:
>
> On Fri, Jan 19, 2024 at 10:54 PM Mikhail Gavrilov
> <[email protected]> wrote:
> >
> I continued to search regressions in 6.8 kernel.
> And found another one.
>
> cc478e0b6bdffd20561e1a07941a65f6c8962cab is the first bad commit
> commit cc478e0b6bdffd20561e1a07941a65f6c8962cab
> Author: Andrey Konovalov <[email protected]>
> Date: Tue Jan 9 23:12:34 2024 +0100
>
> kasan: avoid resetting aux_lock
>
> Here I spotted a dropped FPS in the game "Shadow of the Tomb Raider".
> For measuring performance I used an internal benchmark.
> Before commit cc478e0b6bdffd20561e1a07941a65f6c8962cab was 111FPS on
> commit aaa2c9a97c22af5bf011f6dd8e0538219b45af88 [1].
> On commit cc478e0b6bdffd20561e1a07941a65f6c8962cab I has only 63FPS [2]
> And unfortunately the stackdepot patchset which I applied on top of
> 6.8-rc2 didn't restore initial performance [3].
>
> [1] https://i.postimg.cc/tgvwPTkz/c11-aaa2c9a97c22af5bf011f6dd8e0538219b45af88.png
> [2] https://i.postimg.cc/pX8vHDCM/c10-cc478e0b6bdffd20561e1a07941a65f6c8962cab.png
> [3] https://i.postimg.cc/hvWCb7dV/6-8-0-0-rc2-with-stackdepot.png

Hi Mikhail,

Please try to apply these two patches on top:
https://lore.kernel.org/linux-mm/[email protected]/

They effectively revert the change you mentioned.

Thank you for testing!

2024-02-01 22:08:51

by Mikhail Gavrilov

[permalink] [raw]
Subject: Re: regression/bisected commit 773688a6cb24b0b3c2ba40354d883348a2befa38 make my system completely unusable under high load

On Tue, Jan 30, 2024 at 4:14 AM Andrey Konovalov <[email protected]> wrote:
> Hi Mikhail,
>
> Please try to apply these two patches on top:
> https://lore.kernel.org/linux-mm/[email protected]/
>
> They effectively revert the change you mentioned.
>

I tried applying these patches on top of 6.8-rc2 and
6.8-git6764c317b6bb but performance unfortunately has not changed and
is still on regression level.
Maybe we can try something else?

--
Best Regards,
Mike Gavrilov.

2024-02-02 09:08:19

by Marco Elver

[permalink] [raw]
Subject: Re: regression/bisected commit 773688a6cb24b0b3c2ba40354d883348a2befa38 make my system completely unusable under high load

On Thu, 1 Feb 2024 at 23:08, Mikhail Gavrilov
<[email protected]> wrote:
>
> On Tue, Jan 30, 2024 at 4:14 AM Andrey Konovalov <[email protected]> wrote:
> > Hi Mikhail,
> >
> > Please try to apply these two patches on top:
> > https://lore.kernel.org/linux-mm/[email protected]/
[1]
> >
> > They effectively revert the change you mentioned.
> >
>
> I tried applying these patches on top of 6.8-rc2 and
> 6.8-git6764c317b6bb but performance unfortunately has not changed and
> is still on regression level.
> Maybe we can try something else?

That's strange - the patches at [1] definitely revert the change you
bisected to. It's possible there is some other strange side-effect. (I
assume that you are still running all this with a KASAN kernel.)

Just so I understand it right:
You say before commit cc478e0b6bdffd20561e1a07941a65f6c8962cab the
game's FPS were good. But that is strange, because at that point we're
already doing stackdepot refcounting, i.e. after commit
773688a6cb24b0b3c2ba40354d883348a2befa38 which you reported as the
initial performance regression. The patches at [2] fixed that problem.

So now it's unclear to me how the simple change in
cc478e0b6bdffd20561e1a07941a65f6c8962cab causes the performance
problem, when in fact this is already with KASAN stackdepot
refcounting enabled but without the performance fixes from [1] and
[2].

[2] https://lore.kernel.org/all/[email protected]/

My questions now would be:
- What was the game's FPS in the last stable kernel (v6.7)?
- Can you collect another set of performance profiles between good and
bad? Maybe it would show where the time in the kernel is spent.
- Could it be an inconclusive bisection?

Thanks,
-- Marco

2024-02-02 16:36:58

by Mikhail Gavrilov

[permalink] [raw]
Subject: Re: regression/bisected commit 773688a6cb24b0b3c2ba40354d883348a2befa38 make my system completely unusable under high load

On Fri, Feb 2, 2024 at 2:00 PM Marco Elver <[email protected]> wrote:
>
> > Maybe we can try something else?
>
> That's strange - the patches at [1] definitely revert the change you
> bisected to. It's possible there is some other strange side-effect. (I
> assume that you are still running all this with a KASAN kernel.)

Yes. build .config not changed between kernel builds.

> Just so I understand it right:
> You say before commit cc478e0b6bdffd20561e1a07941a65f6c8962cab the
> game's FPS were good. But that is strange, because at that point we're
> already doing stackdepot refcounting, i.e. after commit
> 773688a6cb24b0b3c2ba40354d883348a2befa38 which you reported as the
> initial performance regression. The patches at [2] fixed that problem.
>
> So now it's unclear to me how the simple change in
> cc478e0b6bdffd20561e1a07941a65f6c8962cab causes the performance
> problem, when in fact this is already with KASAN stackdepot
> refcounting enabled but without the performance fixes from [1] and
> [2].
>
> [2] https://lore.kernel.org/all/[email protected]/
>
> My questions now would be:
> - What was the game's FPS in the last stable kernel (v6.7)?

[6.7] - 83 FPS - 13060 frames during benchmark.

> - Can you collect another set of performance profiles between good and
> bad? Maybe it would show where the time in the kernel is spent.

Yes,
please look at [aaa2c9a97c22 perf] and [cc478e0b6bdf perf]

> perf diff perf-git-aaa2c9a97c22af5bf011f6dd8e0538219b45af88.data perf-git-cc478e0b6bdffd20561e1a07941a65f6c8962cab.data
No kallsyms or vmlinux with build-id
de2a040f828394c5ce34802389239c2a0668fcc7 was found
No kallsyms or vmlinux with build-id
33ab1cd545f96f5ffc2a402a4c4cfa647fd727a0 was found
# Event 'cycles:P'
#
# Baseline Delta Abs Shared Object
Symbol
# ........ ......... ..............................................
..................................................................................................................................................................................
#
48.48% +21.75% [kernel.kallsyms]
[k] 0xffffffff860065c0
36.13% -16.49% ShadowOfTheTombRaider
[.] 0x00000000001d7f5e
4.43% -2.10% libvulkan_radeon.so
[.] 0x000000000006b870
3.28% -0.63% libcef.so
[.] 0x00000000021720e0
1.11% -0.53% libc.so.6
[.] syscall
0.65% -0.24% libc.so.6
[.] __memmove_avx512_unaligned_erms
0.31% -0.14% libc.so.6
[.] __memset_avx512_unaligned_erms
0.26% -0.13% libm.so.6
[.] __powf_fma
0.20% -0.10% [amdgpu]
[k] amdgpu_bo_placement_from_domain
0.22% -0.09% [amdgpu]
[k] amdgpu_vram_mgr_compatible
0.67% -0.09% armada-drm_dri.so
[.] 0x00000000000192b4
0.15% -0.08% libc.so.6
[.] sem_post@GLIBC_2.2.5
0.16% -0.07% [amdgpu]
[k] amdgpu_vm_bo_update
0.14% -0.07% [amdgpu]
[k] amdgpu_bo_list_entry_cmp
0.13% -0.06% libm.so.6
[.] powf@GLIBC_2.2.5
0.14% -0.06% libMangoHud.so
[.] 0x000000000001c4c0
0.10% -0.06% libc.so.6
[.] __futex_abstimed_wait_common
0.19% -0.05% libGLESv2.so
[.] 0x0000000000160a11
0.07% -0.04% libc.so.6
[.] __new_sem_wait_slow64.constprop.0
0.10% -0.04% radeonsi_dri.so
[.] 0x0000000000019454
0.05% -0.03% [amdgpu]
[k] optc1_get_position
0.05% -0.03% libc.so.6
[.] sem_wait@@GLIBC_2.34
0.22% -0.02% [vdso]
[.] 0x00000000000005a0
0.10% -0.02% libc.so.6
[.] __memcmp_evex_movbe
+0.02% [JIT] tid 8383
[.] 0x00007f2de0052823


> - Could it be an inconclusive bisection?

I checked twice:
[6.7] - 83 FPS
[aaa2c9a97c22] - 111 FPS
[cc478e0b6bdf] - 64 FPS
[6.8-rc2 with patches] - 82 FPS


[6.7] https://i.postimg.cc/15yyzZBr/v6-7.png
[6.7 perf] https://mega.nz/file/QwJ3hbob#RslLFVYgz1SWMcPR3eF9uEpFuqxdgkwXSatWts-1wVA

[aaa2c9a97c22] https://i.postimg.cc/Sxv4VYhg/git-aaa2c9a97c22af5bf011f6dd8e0538219b45af88.png
[aaa2c9a97c22 perf]
https://mega.nz/file/dwQxha4J#2_nBF6uNzY11VX-T-Lr_-60WIMrbl1YEvPgY4CuXqEc

[cc478e0b6bdf] https://i.postimg.cc/W3cQfMfw/git-cc478e0b6bdffd20561e1a07941a65f6c8962cab.png
[cc478e0b6bdf perf]
https://mega.nz/file/hl5kwLTC#_4Fg1KBXCnQ-8OElY7EYmPOoDG6ZeZYnKFjamWpklWw

[6.8-rc2 with patches] https://i.postimg.cc/26dPpVsR/v6-8-rc2-with-patches.png
[6.8-rc2 with patches perf]
https://mega.nz/file/NxgTAb4L#0KO_WU-svpDw60Y3148RZhELPcUtFg3_VCDzJqSyz34

--
Best Regards,
Mike Gavrilov.

2024-02-02 20:15:39

by Mikhail Gavrilov

[permalink] [raw]
Subject: Re: regression/bisected commit 773688a6cb24b0b3c2ba40354d883348a2befa38 make my system completely unusable under high load

On Fri, Feb 2, 2024 at 10:20 PM Marco Elver <[email protected]> wrote:
>
> Your config has lockdep enabled, right?

Yes.

> Because cc478e0b6bdf was fixing an issue with lockdep, does your kernel
> before that commit show some lockdep errors?

Let's check it, I attached the kernel log of aaa2c9a97c22.

mikhail@primary-ws ~> uname -r
6.7.0-c11-aaa2c9a97c22af5bf011f6dd8e0538219b45af88+
mikhail@primary-ws ~> sudo dmesg | grep lockdep
[sudo] password for mikhail:
[ 3.115891] rcu: RCU lockdep checking is enabled.
[ 3.125718] The code is fine but needs lockdep annotation, or maybe
[ 3.125786] ? lockdep_init_map_type+0x1a5/0x840
[ 12.967789] INFO: lockdep is turned off.

> Because if lockdep encounters an error it usually
> turns itself off right away, which would explain the improved
> performance. :-)

You are right.
Thanks for digging into it!

--
Best Regards,
Mike Gavrilov.


Attachments:
dmesg-aaa2c9a97c22af5bf011f6dd8e0538219b45af88.zip (46.96 kB)

2024-02-19 09:48:25

by Mikhail Gavrilov

[permalink] [raw]
Subject: Re: regression/bisected commit 773688a6cb24b0b3c2ba40354d883348a2befa38 make my system completely unusable under high load

On Sat, Feb 3, 2024 at 1:14 AM Mikhail Gavrilov
<[email protected]> wrote:
>
> You are right.
> Thanks for digging into it!
>

This [2] revert is still not merged at least I checked on 4f5e5092fdbf.
Is there any plan to merge it or find another approach?

[2] https://lore.kernel.org/all/[email protected]/

--
Best Regards,
Mike Gavrilov.

2024-02-19 09:53:06

by Marco Elver

[permalink] [raw]
Subject: Re: regression/bisected commit 773688a6cb24b0b3c2ba40354d883348a2befa38 make my system completely unusable under high load

On Mon, 19 Feb 2024 at 10:48, Mikhail Gavrilov
<[email protected]> wrote:
>
> On Sat, Feb 3, 2024 at 1:14 AM Mikhail Gavrilov
> <[email protected]> wrote:
> >
> > You are right.
> > Thanks for digging into it!
> >
>
> This [2] revert is still not merged at least I checked on 4f5e5092fdbf.
> Is there any plan to merge it or find another approach?
>
> [2] https://lore.kernel.org/all/[email protected]/

I think it's already in -mm and -next. It just takes time, which is a
good thing, after all we want to let -next testing confirm nothing is
wrong with it.

Andrew, is this planned for the next merge window or as a "hot fix"
for the current rc? Given it has the right "Fixes" tags it will make
it to stable kernels eventually, but I also think that the previous
"slow" version is almost unusable on big systems, so it may be
worthwhile considering the current rc.

Thanks,
-- Marco

2024-02-19 10:09:34

by Vlastimil Babka

[permalink] [raw]
Subject: Re: regression/bisected commit 773688a6cb24b0b3c2ba40354d883348a2befa38 make my system completely unusable under high load

On 2/19/24 10:52, Marco Elver wrote:
> On Mon, 19 Feb 2024 at 10:48, Mikhail Gavrilov
> <[email protected]> wrote:
>>
>> On Sat, Feb 3, 2024 at 1:14 AM Mikhail Gavrilov
>> <[email protected]> wrote:
>> >
>> > You are right.
>> > Thanks for digging into it!
>> >
>>
>> This [2] revert is still not merged at least I checked on 4f5e5092fdbf.
>> Is there any plan to merge it or find another approach?
>>
>> [2] https://lore.kernel.org/all/[email protected]/
>
> I think it's already in -mm and -next. It just takes time, which is a
> good thing, after all we want to let -next testing confirm nothing is
> wrong with it.
>
> Andrew, is this planned for the next merge window or as a "hot fix"
> for the current rc? Given it has the right "Fixes" tags it will make
> it to stable kernels eventually, but I also think that the previous
> "slow" version is almost unusable on big systems, so it may be
> worthwhile considering the current rc.

Yeah it would be best to fix in 6.8 to prevent regressions.

> Thanks,
> -- Marco


2024-02-19 23:28:44

by Andrew Morton

[permalink] [raw]
Subject: Re: regression/bisected commit 773688a6cb24b0b3c2ba40354d883348a2befa38 make my system completely unusable under high load

On Mon, 19 Feb 2024 11:09:23 +0100 Vlastimil Babka <[email protected]> wrote:

> On 2/19/24 10:52, Marco Elver wrote:
> > On Mon, 19 Feb 2024 at 10:48, Mikhail Gavrilov
> > <[email protected]> wrote:
> >>
> >> On Sat, Feb 3, 2024 at 1:14 AM Mikhail Gavrilov
> >> <[email protected]> wrote:
> >> >
> >> > You are right.
> >> > Thanks for digging into it!
> >> >
> >>
> >> This [2] revert is still not merged at least I checked on 4f5e5092fdbf.
> >> Is there any plan to merge it or find another approach?
> >>
> >> [2] https://lore.kernel.org/all/[email protected]/
> >
> > I think it's already in -mm and -next. It just takes time, which is a
> > good thing, after all we want to let -next testing confirm nothing is
> > wrong with it.
> >
> > Andrew, is this planned for the next merge window or as a "hot fix"
> > for the current rc? Given it has the right "Fixes" tags it will make
> > it to stable kernels eventually, but I also think that the previous
> > "slow" version is almost unusable on big systems, so it may be
> > worthwhile considering the current rc.
>
> Yeah it would be best to fix in 6.8 to prevent regressions.
>

I'm all confused.

4434a56ec209 ("stackdepot: make fast paths lock-less again") was
mainlined for v6.8-rc3.

That patch Fixed: 108be8def46e ("lib/stackdepot: allow users to evict
stack traces") which was mainlined for v6.8-rc1, so 4434a56ec209 did
not need a cc:stable?


2024-02-19 23:50:45

by Vlastimil Babka

[permalink] [raw]
Subject: Re: regression/bisected commit 773688a6cb24b0b3c2ba40354d883348a2befa38 make my system completely unusable under high load



On 2/20/24 00:28, Andrew Morton wrote:
> On Mon, 19 Feb 2024 11:09:23 +0100 Vlastimil Babka <[email protected]> wrote:
>
>> On 2/19/24 10:52, Marco Elver wrote:
>>> On Mon, 19 Feb 2024 at 10:48, Mikhail Gavrilov
>>> <[email protected]> wrote:
>>>>
>>>> On Sat, Feb 3, 2024 at 1:14 AM Mikhail Gavrilov
>>>> <[email protected]> wrote:
>>>>>
>>>>> You are right.
>>>>> Thanks for digging into it!
>>>>>
>>>>
>>>> This [2] revert is still not merged at least I checked on 4f5e5092fdbf.
>>>> Is there any plan to merge it or find another approach?
>>>>
>>>> [2] https://lore.kernel.org/all/[email protected]/
>>>
>>> I think it's already in -mm and -next. It just takes time, which is a
>>> good thing, after all we want to let -next testing confirm nothing is
>>> wrong with it.
>>>
>>> Andrew, is this planned for the next merge window or as a "hot fix"
>>> for the current rc? Given it has the right "Fixes" tags it will make
>>> it to stable kernels eventually, but I also think that the previous
>>> "slow" version is almost unusable on big systems, so it may be
>>> worthwhile considering the current rc.
>>
>> Yeah it would be best to fix in 6.8 to prevent regressions.
>>
>
> I'm all confused.
>
> 4434a56ec209 ("stackdepot: make fast paths lock-less again") was
> mainlined for v6.8-rc3.

Uh sorry, I just trusted the info that it's not merged and didn't verify
it myself. Yeah, I can see it is there.

> That patch Fixed: 108be8def46e ("lib/stackdepot: allow users to evict
> stack traces") which was mainlined for v6.8-rc1, so 4434a56ec209 did
> not need a cc:stable?

That's right.

2024-02-20 05:37:24

by Mikhail Gavrilov

[permalink] [raw]
Subject: Re: regression/bisected commit 773688a6cb24b0b3c2ba40354d883348a2befa38 make my system completely unusable under high load

On Tue, Feb 20, 2024 at 4:50 AM Vlastimil Babka <[email protected]> wrote:
> >
> > I'm all confused.
> >
> > 4434a56ec209 ("stackdepot: make fast paths lock-less again") was
> > mainlined for v6.8-rc3.
>
> Uh sorry, I just trusted the info that it's not merged and didn't verify
> it myself. Yeah, I can see it is there.
>

Wait, I am talk about these two patches which is not merged yet:
[PATCH v2 1/2] stackdepot: use variable size records for non-evictable entries
[PATCH v2 2/2] kasan: revert eviction of stack traces in generic mode
https://lore.kernel.org/linux-mm/[email protected]/

--
Best Regards,
Mike Gavrilov.

2024-02-20 17:34:14

by Andrew Morton

[permalink] [raw]
Subject: Re: regression/bisected commit 773688a6cb24b0b3c2ba40354d883348a2befa38 make my system completely unusable under high load

On Tue, 20 Feb 2024 10:37:03 +0500 Mikhail Gavrilov <[email protected]> wrote:

> On Tue, Feb 20, 2024 at 4:50 AM Vlastimil Babka <[email protected]> wrote:
> > >
> > > I'm all confused.
> > >
> > > 4434a56ec209 ("stackdepot: make fast paths lock-less again") was
> > > mainlined for v6.8-rc3.
> >
> > Uh sorry, I just trusted the info that it's not merged and didn't verify
> > it myself. Yeah, I can see it is there.
> >
>
> Wait, I am talk about these two patches which is not merged yet:
> [PATCH v2 1/2] stackdepot: use variable size records for non-evictable entries
> [PATCH v2 2/2] kasan: revert eviction of stack traces in generic mode
> https://lore.kernel.org/linux-mm/[email protected]/

A can move those into the 6.8-rc hotfixes queue, and it appears a
cc:stable will not be required.

However I'm not seeing anything in the changelogs to indicate that
we're fixing a dramatic performance regression, nor why that
regressions is occurring.


2024-02-20 18:17:08

by Vlastimil Babka

[permalink] [raw]
Subject: Re: regression/bisected commit 773688a6cb24b0b3c2ba40354d883348a2befa38 make my system completely unusable under high load

On 2/20/24 18:30, Andrew Morton wrote:
> On Tue, 20 Feb 2024 10:37:03 +0500 Mikhail Gavrilov <[email protected]> wrote:
>
>> On Tue, Feb 20, 2024 at 4:50 AM Vlastimil Babka <[email protected]> wrote:
>> > >
>> > > I'm all confused.
>> > >
>> > > 4434a56ec209 ("stackdepot: make fast paths lock-less again") was
>> > > mainlined for v6.8-rc3.
>> >
>> > Uh sorry, I just trusted the info that it's not merged and didn't verify
>> > it myself. Yeah, I can see it is there.
>> >
>>
>> Wait, I am talk about these two patches which is not merged yet:
>> [PATCH v2 1/2] stackdepot: use variable size records for non-evictable entries
>> [PATCH v2 2/2] kasan: revert eviction of stack traces in generic mode
>> https://lore.kernel.org/linux-mm/[email protected]/
>
> A can move those into the 6.8-rc hotfixes queue, and it appears a
> cc:stable will not be required.
>
> However I'm not seeing anything in the changelogs to indicate that
> we're fixing a dramatic performance regression, nor why that
> regressions is occurring.

We also seem have an unhappy bot with the 2/2 patch :/ although it's not yet
clear if it's a genuine issue.

https://lore.kernel.org/all/[email protected]/

2024-02-20 18:51:55

by Marco Elver

[permalink] [raw]
Subject: Re: regression/bisected commit 773688a6cb24b0b3c2ba40354d883348a2befa38 make my system completely unusable under high load

On Tue, 20 Feb 2024 at 19:16, Vlastimil Babka <[email protected]> wrote:
>
> On 2/20/24 18:30, Andrew Morton wrote:
> > On Tue, 20 Feb 2024 10:37:03 +0500 Mikhail Gavrilov <[email protected]> wrote:
> >
> >> On Tue, Feb 20, 2024 at 4:50 AM Vlastimil Babka <[email protected]> wrote:
> >> > >
> >> > > I'm all confused.
> >> > >
> >> > > 4434a56ec209 ("stackdepot: make fast paths lock-less again") was
> >> > > mainlined for v6.8-rc3.
> >> >
> >> > Uh sorry, I just trusted the info that it's not merged and didn't verify
> >> > it myself. Yeah, I can see it is there.
> >> >
> >>
> >> Wait, I am talk about these two patches which is not merged yet:
> >> [PATCH v2 1/2] stackdepot: use variable size records for non-evictable entries
> >> [PATCH v2 2/2] kasan: revert eviction of stack traces in generic mode
> >> https://lore.kernel.org/linux-mm/[email protected]/
> >
> > A can move those into the 6.8-rc hotfixes queue, and it appears a
> > cc:stable will not be required.
> >
> > However I'm not seeing anything in the changelogs to indicate that
> > we're fixing a dramatic performance regression, nor why that
> > regressions is occurring.

It's primarily fixing a regression of memory usage overhead for
stackdepot users in general. Performance is mostly fixed, but patch
2/2 ("kasan: revert eviction of stack traces in generic mode") also
helps with KASAN performance because entries that were being
repeatedly evicted-then-reallocated are just allocated once and with
increasing system uptime the slow path will be taken much less.

> We also seem have an unhappy bot with the 2/2 patch :/ although it's not yet
> clear if it's a genuine issue.
>
> https://lore.kernel.org/all/[email protected]/

While it would be nice if 6.8 would not regress over 6.7 (performance
is mostly fixed, memory usage is not), waiting for confirmation what
the rcutorture issue from the bot is about might be good.

Mikhail: since you are testing mainline, in about 4 weeks the fixes
should then reach 6.9-rc in the next merge window. Until then, if it's
not too difficult for you, you can apply those 2 patches in your own
tree.

Thanks,
-- Marco

2024-02-26 10:30:58

by Vlastimil Babka

[permalink] [raw]
Subject: Re: regression/bisected commit 773688a6cb24b0b3c2ba40354d883348a2befa38 make my system completely unusable under high load

On 2/26/24 10:25, Marco Elver wrote:
> On Tue, 20 Feb 2024 at 19:51, Marco Elver <[email protected]> wrote:
>>
>> While it would be nice if 6.8 would not regress over 6.7 (performance
>> is mostly fixed, memory usage is not), waiting for confirmation what
>> the rcutorture issue from the bot is about might be good.
>>
>> Mikhail: since you are testing mainline, in about 4 weeks the fixes
>> should then reach 6.9-rc in the next merge window. Until then, if it's
>> not too difficult for you, you can apply those 2 patches in your own
>> tree.
>
> There are more issues that are fixed by "[PATCH v2 1/2] stackdepot:
> use variable size records for non-evictable entries". See
> https://lore.kernel.org/all/[email protected]/
>
> This will eventually reach stable, but it might be good to reconsider
> mainlining it earlier.

I believe I can see that patch, together with "kasan: revert eviction of
stack traces in generic mode" in mm-hotfixes-stable so it should be on track
for 6.8.

> Thanks,
> -- Marco


2024-02-26 09:57:15

by Marco Elver

[permalink] [raw]
Subject: Re: regression/bisected commit 773688a6cb24b0b3c2ba40354d883348a2befa38 make my system completely unusable under high load

On Tue, 20 Feb 2024 at 19:51, Marco Elver <[email protected]> wrote:
>
> On Tue, 20 Feb 2024 at 19:16, Vlastimil Babka <[email protected]> wrote:
> >
> > On 2/20/24 18:30, Andrew Morton wrote:
> > > On Tue, 20 Feb 2024 10:37:03 +0500 Mikhail Gavrilov <[email protected]> wrote:
> > >
> > >> On Tue, Feb 20, 2024 at 4:50 AM Vlastimil Babka <vbabka@susecz> wrote:
> > >> > >
> > >> > > I'm all confused.
> > >> > >
> > >> > > 4434a56ec209 ("stackdepot: make fast paths lock-less again") was
> > >> > > mainlined for v6.8-rc3.
> > >> >
> > >> > Uh sorry, I just trusted the info that it's not merged and didn't verify
> > >> > it myself. Yeah, I can see it is there.
> > >> >
> > >>
> > >> Wait, I am talk about these two patches which is not merged yet:
> > >> [PATCH v2 1/2] stackdepot: use variable size records for non-evictable entries
> > >> [PATCH v2 2/2] kasan: revert eviction of stack traces in generic mode
> > >> https://lore.kernel.org/linux-mm/20240129100708.39460-1-elver@googlecom/
> > >
> > > A can move those into the 6.8-rc hotfixes queue, and it appears a
> > > cc:stable will not be required.
> > >
> > > However I'm not seeing anything in the changelogs to indicate that
> > > we're fixing a dramatic performance regression, nor why that
> > > regressions is occurring.
>
> It's primarily fixing a regression of memory usage overhead for
> stackdepot users in general. Performance is mostly fixed, but patch
> 2/2 ("kasan: revert eviction of stack traces in generic mode") also
> helps with KASAN performance because entries that were being
> repeatedly evicted-then-reallocated are just allocated once and with
> increasing system uptime the slow path will be taken much less.
>
> > We also seem have an unhappy bot with the 2/2 patch :/ although it's not yet
> > clear if it's a genuine issue.
> >
> > https://lore.kernel.org/all/[email protected]/

This was confirmed to be a non-bug by RCU devs.

> While it would be nice if 6.8 would not regress over 6.7 (performance
> is mostly fixed, memory usage is not), waiting for confirmation what
> the rcutorture issue from the bot is about might be good.
>
> Mikhail: since you are testing mainline, in about 4 weeks the fixes
> should then reach 6.9-rc in the next merge window. Until then, if it's
> not too difficult for you, you can apply those 2 patches in your own
> tree.

There are more issues that are fixed by "[PATCH v2 1/2] stackdepot:
use variable size records for non-evictable entries". See
https://lore.kernel.org/all/[email protected]/

This will eventually reach stable, but it might be good to reconsider
mainlining it earlier.

Thanks,
-- Marco