2024-01-04 13:42:29

by Friedrich Weber

[permalink] [raw]
Subject: Temporary KVM guest hangs connected to KSM and NUMA balancer

Hi,

some of our (Proxmox VE) users have been reporting [1] that guests
occasionally become unresponsive with high CPU usage for some time
(varying between ~1 and more than 60 seconds). After that time, the
guests come back and continue running fine. Windows guests seem most
affected (not responding to pings during the hang, RDP sessions time
out). But we also got reports about Linux guests. This issue was not
present while we provided (host) kernel 5.15 and was first reported when
we rolled out a kernel based on 6.2. The reports seem to concern NUMA
hosts only. Users reported that the issue becomes easier to trigger the
more memory is assigned to the guests. Setting mitigations=off was
reported to alleviate (but not eliminate) the issue. The issue seems to
disappear after disabling KSM.

We can reproduce the issue with a Windows guest on a NUMA host, though
only occasionally and not very reliably. Using a bpftrace script like
[7] we found the hangs to correlate with long-running invocations of
`task_numa_work` (more than 500ms), suggesting a connection to the NUMA
balancer. Indeed, we can't reproduce the issue after disabling the NUMA
balancer with `echo 0 > /proc/sys/kernel/numa_balancing` [2] and got a
user confirming this fixes the issue for them [3].

Since the Windows reproducer is not very stable, we tried to find a
Linux guest reproducer and have found one (described below [0]) that
triggers a very similar (hopefully the same) issue. The reproducer
triggers the hangs also if the host is on current Linux 6.7-rc8
(610a9b8f). A kernel bisect points to the following as the commit
introducing the issue:

f47e5bbb ("KVM: x86/mmu: Zap only TDP MMU leafs in zap range and
mmu_notifier unmap")

which is why I cc'ed Sean and Paolo. Because of the possible KSM
connection I cc'ed Andrew and linux-mm.

Indeed, on f47e5bbb~1 = a80ced6e ("KVM: SVM: fix panic on out-of-bounds
guest IRQ") the reproducer does not trigger the hang, and on f47e5bbb it
triggers the hang.

Currently I don't know enough about the KVM/KSM/NUMA balancer code to
tell how the patch may trigger these issues. Any idea who we could ask
about this, or how we could further debug this would be greatly appreciated!

Let me know if I can provide any more information.

Best,

Friedrich

[0]

Reproducer (example outputs with the host on f47e5bbb):

* Host with 256GiB memory, 2 NUMA nodes (for output of `numactl -H` see [4])
* Disable ksmtuned on the host, then manually enable and "boost" KSM:

echo 1250 > /sys/kernel/mm/ksm/pages_to_scan
echo 1 > /sys/kernel/mm/ksm/run

* Create Linux guest (e.g. Debian 12) with 140GiB memory. There is no
virtio-balloon-pci device, this is to prevent the host from reclaiming
memory of the guest (this mimics the behavior of Windows guests). See
[5] for QEMU 8.2 command line.
* On the guest, run a program that allocates 128 GiB memory in 1 GiB
chunks, initializes it with an arbitrary byte (ASCII 'A'), sleeps for
60s and exits (allocate.c [6]):

./allocate 128

* Wait until KSM sharing pages are at >30GiB
(/sys/kernel/mm/ksm/pages_sharing exceeds 7864320), that takes ~20
minutes for us.
* Optionally lower KSM pages_to_scan to a reasonable value to rule it
out as a factor:

echo 100 > /sys/kernel/mm/ksm/pages_to_scan
echo 1 > /sys/kernel/mm/ksm/run

* From a different machine, run a continuous `ping -D` against the guest
* On the host, run bpftrace to trace `task_numa_work` invocations over
500ms [7]
* On the guest, run the allocation program 32 times in parallel, each
process allocating 4x1 GiB of memory (32 * 4 = 128):

for i in $(seq 32); do ./allocate 4 & done

* A few seconds later while the processes are still running, the guest
becomes unresponsive for some time, thus the ping response times greatly
increase. For example, here the guest does not respond at all for ~100
seconds:

[1704360680.898427] 64 bytes from 192.168.18.11: icmp_seq=47 ttl=64
time=0.447 ms
[1704360681.922522] 64 bytes from 192.168.18.11: icmp_seq=48 ttl=64
time=0.515 ms
[1704360710.369961] From 192.168.16.32 icmp_seq=73 Destination Host
Unreachable
[1704360713.442023] From 192.168.16.32 icmp_seq=76 Destination Host
Unreachable
... repeats ...
[1704360786.081958] From 192.168.16.32 icmp_seq=147 Destination Host
Unreachable
[1704360789.154021] From 192.168.16.32 icmp_seq=151 Destination Host
Unreachable
[1704360790.194049] 64 bytes from 192.168.18.11: icmp_seq=49 ttl=64
time=107244 ms
[1704360790.196813] 64 bytes from 192.168.18.11: icmp_seq=50 ttl=64
time=106227 ms
[1704360790.196998] 64 bytes from 192.168.18.11: icmp_seq=51 ttl=64
time=105203 ms
[1704360790.206355] 64 bytes from 192.168.18.11: icmp_seq=52 ttl=64
time=104188 ms
[1704360790.206721] 64 bytes from 192.168.18.11: icmp_seq=53 ttl=64
time=103165 ms
[1704360790.206837] 64 bytes from 192.168.18.11: icmp_seq=54 ttl=64
time=102141 ms
[...]
[1704360799.307342] 64 bytes from 192.168.18.11: icmp_seq=163 ttl=64
time=0.335 ms
[1704360800.322355] 64 bytes from 192.168.18.11: icmp_seq=164 ttl=64
time=0.360 ms
[1704360810.481320] 64 bytes from 192.168.18.11: icmp_seq=165 ttl=64
time=9135 ms
[1704360810.481331] 64 bytes from 192.168.18.11: icmp_seq=166 ttl=64
time=8111 ms
[1704360810.481334] 64 bytes from 192.168.18.11: icmp_seq=167 ttl=64
time=7083 ms
[1704360810.481336] 64 bytes from 192.168.18.11: icmp_seq=168 ttl=64
time=6059 ms
[1704360810.481339] 64 bytes from 192.168.18.11: icmp_seq=169 ttl=64
time=5039 ms
[1704360810.481409] 64 bytes from 192.168.18.11: icmp_seq=170 ttl=64
time=4015 ms
[...]
[1704360827.906610] 64 bytes from 192.168.18.11: icmp_seq=191 ttl=64
time=0.591 ms
[1704360828.930570] 64 bytes from 192.168.18.11: icmp_seq=192 ttl=64
time=0.576 ms

* At the same time, bpftrace logs long-running invocations of
`task_numa_work`, some examples:

[1704360683] task_numa_work (tid=15476) took 868 ms
[1704360683] task_numa_work (tid=15457) took 984 ms
[1704360683] task_numa_work (tid=15480) took 1104 ms
[...]
[1704360751] task_numa_work (tid=15462) took 13916 ms
[1704360753] task_numa_work (tid=15453) took 21708 ms
[...]
[1704360805] task_numa_work (tid=15485) took 1029 ms
[1704360807] task_numa_work (tid=15485) took 1245 ms
[1704360807] task_numa_work (tid=15446) took 2483 ms
[1704360808] task_numa_work (tid=15466) took 4149 ms
[1704360810] task_numa_work (tid=15446) took 3409 ms
[...]
[1704360814] task_numa_work (tid=15464) took 1733 ms
[1704360816] task_numa_work (tid=15464) took 1844 ms

* After some time (~100s in the example above) the guest comes back and
ping response times go back to normal. The guest journal logs some soft
lockups, e.g.:

Jan 04 10:33:10 debian kernel: rcu: INFO: rcu_preempt detected stalls on
CPUs/tasks:
Jan 04 10:33:10 debian kernel: watchdog: BUG: soft lockup - CPU#10 stuck
for 101s! [allocate:1169]
Jan 04 10:33:10 debian kernel: watchdog: BUG: soft lockup - CPU#31 stuck
for 101s! [allocate:1180]

* We cannot trigger the hangs anymore after disabling the NUMA balancer
-- ping response times stay under 1ms then.

[1] https://forum.proxmox.com/threads/130727/
[2] https://forum.proxmox.com/threads/130727/page-7#post-601617
[3] https://forum.proxmox.com/threads/130727/page-7#post-603096
[4]

# numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 32 33 34 35 36 37 38
39 40 41 42 43 44 45 46 47
node 0 size: 128649 MB
node 0 free: 124293 MB
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 48 49 50 51
52 53 54 55 56 57 58 59 60 61 62 63
node 1 size: 128971 MB
node 1 free: 126587 MB
node distances:
node 0 1
0: 10 21
1: 21 10

[5]

./qemu-system-x86_64 \
-accel kvm \
-chardev
'socket,id=qmp,path=/var/run/qemu-server/101.qmp,server=on,wait=off' \
-mon 'chardev=qmp,mode=control' \
-chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' \
-mon 'chardev=qmp-event,mode=control' \
-pidfile /var/run/qemu-server/101.pid \
-smp '64,sockets=1,cores=64,maxcpus=64' \
-nodefaults \
-vnc 'unix:/var/run/qemu-server/101.vnc,password=on' \
-cpu qemu64,enforce,+kvm_pv_eoi,+kvm_pv_unhalt \
-m 143360 \
-device 'pci-bridge,id=pci.3,chassis_nr=3,bus=pci.0,addr=0x5' \
-device 'VGA,id=vga,bus=pci.0,addr=0x2' \
-device 'virtio-scsi-pci,id=virtioscsi0,bus=pci.3,addr=0x1' \
-drive 'file=/dev/pve/vm-101-disk-0,if=none,id=drive-scsi0,format=raw' \
-device
'scsi-hd,bus=virtioscsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100'
\
-netdev
'type=tap,id=net0,ifname=tap101i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on'
\
-device
'virtio-net-pci,mac=BC:24:11:09:20:0C,netdev=net0,bus=pci.0,addr=0x12,id=net0,rx_queue_size=1024,tx_queue_size=256,bootindex=102'
\
-machine 'type=pc'

[6]

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <assert.h>

const size_t chunk_size = 1024 * 1024 * 1024; // 1 GiB

void *alloc_chunk(char ch) {
size_t init_size = 65536;
size_t init_done = 0;
void *base = malloc(chunk_size);
assert(base);

while (init_done < chunk_size && init_done + init_size < chunk_size) {
memset((char *)base + init_done, ch, init_size);
init_done += init_size;
}

return base;
}

int main(int argc, char *argv[]) {
int num_chunks;

assert(argc == 2);
num_chunks = atoi(argv[1]);
assert(num_chunks >= 0 && num_chunks <= 1024);

char **chunks = malloc(num_chunks * sizeof(char *));

fprintf(stderr, "alloc %d chunks\n", num_chunks);
for (int i = 0; i < num_chunks; i++) {
fprintf(stderr, "alloc #%d: %c\n", i, 'A');
chunks[i] = alloc_chunk('A');
}
fprintf(stderr, "sleeping 1min\n");
sleep(60);
return 0;
}

[7]

kfunc:task_numa_work { @start[tid] = nsecs; }
kretfunc:task_numa_work /@start[tid]/ {
$diff = nsecs - @start[tid];
if ($diff > 500000000) { // 500ms
time("[%s] ");
printf("task_numa_work (tid=%d) took %d ms\n", tid, $diff / 1000000);
}
delete(@start[tid]);
}



2024-01-11 12:43:41

by Friedrich Weber

[permalink] [raw]
Subject: Re: Temporary KVM guest hangs connected to KSM and NUMA balancer

Hi,

On 04/01/2024 14:42, Friedrich Weber wrote:
> f47e5bbb ("KVM: x86/mmu: Zap only TDP MMU leafs in zap range and
> mmu_notifier unmap")

As this commit mentions tdp_mmu_zap_leafs, I ran the reproducer again on
610a9b8f (6.7-rc8) while tracing >500ms calls to that function (printing
a stacktrace) and task_numa_work via a bpftrace script [1].

Again, there are several invocations of task_numa_work that take >500ms
(the VM appears to be frozen during that time). For instance, one
invocation that takes ~20 seconds:

[1704971602] task_numa_work (tid=52035) took 19995 ms

For this particular thread and in the 20 seconds before that, there were
8 invocations of tdp_mmu_zap_leafs with >500ms:

[1704971584] tdp_mmu_zap_leafs (tid=52035) took 2291 ms
[1704971586] tdp_mmu_zap_leafs (tid=52035) took 2343 ms
[1704971589] tdp_mmu_zap_leafs (tid=52035) took 2316 ms
[1704971590] tdp_mmu_zap_leafs (tid=52035) took 1663 ms
[1704971591] tdp_mmu_zap_leafs (tid=52035) took 682 ms
[1704971594] tdp_mmu_zap_leafs (tid=52035) took 2706 ms
[1704971597] tdp_mmu_zap_leafs (tid=52035) took 3132 ms
[1704971602] tdp_mmu_zap_leafs (tid=52035) took 4846 ms

They roughly sum up to 20s. The stacktrace is the same for all:

bpf_prog_5ca52691cb9e9fbd_tdp_mmu_zap_lea+345
bpf_prog_5ca52691cb9e9fbd_tdp_mmu_zap_lea+345
bpf_trampoline_380104735946+208
tdp_mmu_zap_leafs+5
kvm_unmap_gfn_range+347
kvm_mmu_notifier_invalidate_range_start+394
__mmu_notifier_invalidate_range_start+156
change_protection+3908
change_prot_numa+105
task_numa_work+1029
bpf_trampoline_6442457341+117
task_numa_work+9
xfer_to_guest_mode_handle_work+261
kvm_arch_vcpu_ioctl_run+1553
kvm_vcpu_ioctl+667
__x64_sys_ioctl+164
do_syscall_64+96
entry_SYSCALL_64_after_hwframe+110

AFAICT this pattern repeats several times. I uploaded the last 150kb of
the bpftrace output to [2]. I can provide the full output if needed.

To me, not knowing much about the KVM/KSM/NUMA balancer interplay, this
looks like task_numa_work triggers several invocations of
tdp_mmu_zap_leafs, each of which takes an unusually long time. If anyone
has a hunch why this might happen, or an idea where to look next, it
would be much appreciated.

Best,

Friedrich

[1]

kfunc:tdp_mmu_zap_leafs { @start_zap[tid] = nsecs; }
kretfunc:tdp_mmu_zap_leafs /@start_zap[tid]/ {
$diff = nsecs - @start_zap[tid];
if ($diff > 500000000) { // 500ms
time("[%s] ");
printf("tdp_mmu_zap_leafs (tid=%d) took %d ms\n", tid, $diff / 1000000);
print(kstack());
}
delete(@start_zap[tid]);
}

kfunc:task_numa_work { @start_numa[tid] = nsecs; }
kretfunc:task_numa_work /@start_numa[tid]/ {
$diff = nsecs - @start_numa[tid];
if ($diff > 500000000) { // 500ms
time("[%s] ");
printf("task_numa_work (tid=%d) took %d ms\n", tid, $diff / 1000000);
}
delete(@start_numa[tid]);
}

[2] https://paste.debian.net/1303767/


2024-01-11 16:00:20

by Sean Christopherson

[permalink] [raw]
Subject: Re: Temporary KVM guest hangs connected to KSM and NUMA balancer

On Thu, Jan 04, 2024, Friedrich Weber wrote:
> Hi,
>
> some of our (Proxmox VE) users have been reporting [1] that guests
> occasionally become unresponsive with high CPU usage for some time
> (varying between ~1 and more than 60 seconds). After that time, the
> guests come back and continue running fine. Windows guests seem most
> affected (not responding to pings during the hang, RDP sessions time
> out). But we also got reports about Linux guests. This issue was not
> present while we provided (host) kernel 5.15 and was first reported when
> we rolled out a kernel based on 6.2. The reports seem to concern NUMA
> hosts only. Users reported that the issue becomes easier to trigger the
> more memory is assigned to the guests. Setting mitigations=off was
> reported to alleviate (but not eliminate) the issue. The issue seems to
> disappear after disabling KSM.
>
> We can reproduce the issue with a Windows guest on a NUMA host, though
> only occasionally and not very reliably. Using a bpftrace script like
> [7] we found the hangs to correlate with long-running invocations of
> `task_numa_work` (more than 500ms), suggesting a connection to the NUMA
> balancer. Indeed, we can't reproduce the issue after disabling the NUMA
> balancer with `echo 0 > /proc/sys/kernel/numa_balancing` [2] and got a
> user confirming this fixes the issue for them [3].
>
> Since the Windows reproducer is not very stable, we tried to find a
> Linux guest reproducer and have found one (described below [0]) that
> triggers a very similar (hopefully the same) issue. The reproducer
> triggers the hangs also if the host is on current Linux 6.7-rc8
> (610a9b8f). A kernel bisect points to the following as the commit
> introducing the issue:
>
> f47e5bbb ("KVM: x86/mmu: Zap only TDP MMU leafs in zap range and
> mmu_notifier unmap")
>
> which is why I cc'ed Sean and Paolo. Because of the possible KSM
> connection I cc'ed Andrew and linux-mm.
>
> Indeed, on f47e5bbb~1 = a80ced6e ("KVM: SVM: fix panic on out-of-bounds
> guest IRQ") the reproducer does not trigger the hang, and on f47e5bbb it
> triggers the hang.
>
> Currently I don't know enough about the KVM/KSM/NUMA balancer code to
> tell how the patch may trigger these issues. Any idea who we could ask
> about this, or how we could further debug this would be greatly appreciated!

This is a known issue. It's mostly a KVM bug[1][2] (fix posted[3]), but I suspect
that a bug in the dynamic preemption model logic[4] is also contributing to the
behavior by causing KVM to yield on preempt models where it really shouldn't.

[1] https://lore.kernel.org/all/[email protected]
[2] https://lore.kernel.org/all/[email protected]%2F
[3] https://lore.kernel.org/all/[email protected]
[4] https://lore.kernel.org/all/[email protected]

2024-01-12 16:08:40

by Friedrich Weber

[permalink] [raw]
Subject: Re: Temporary KVM guest hangs connected to KSM and NUMA balancer

On 11/01/2024 17:00, Sean Christopherson wrote:
> This is a known issue. It's mostly a KVM bug[1][2] (fix posted[3]), but I suspect
> that a bug in the dynamic preemption model logic[4] is also contributing to the
> behavior by causing KVM to yield on preempt models where it really shouldn't.

Thanks a lot for the pointers and the proposed fixes!

I still see the same temporary hangs with [3] applied on top of 6.7
(0dd3ee31). However, with [4] applied in addition, I have not seen any
temporary hangs yet.

As the v1 of [3] was reported to fix the reported bug [2] and looks very
similar to the v2 I tried, I wonder whether I might be seeing a slightly
different kind of hangs than the one reported in [2] -- also because the
reproducer relies heavily on KSM and AFAICT, KSM was entirely disabled
in [2]. I'll try to run a few more tests next week.

FWIW, the kernel config relevant to preemption:

CONFIG_PREEMPT_BUILD=y
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
CONFIG_PREEMPT_COUNT=y
CONFIG_PREEMPTION=y
CONFIG_PREEMPT_DYNAMIC=y
CONFIG_PREEMPT_RCU=y
CONFIG_HAVE_PREEMPT_DYNAMIC=y
CONFIG_HAVE_PREEMPT_DYNAMIC_CALL=y
CONFIG_PREEMPT_NOTIFIERS=y
CONFIG_DRM_I915_PREEMPT_TIMEOUT=640
CONFIG_DRM_I915_PREEMPT_TIMEOUT_COMPUTE=7500
# CONFIG_DEBUG_PREEMPT is not set
# CONFIG_PREEMPT_TRACER is not set
# CONFIG_PREEMPTIRQ_DELAY_TEST is not set

Thanks again!

Friedrich

> [1] https://lore.kernel.org/all/[email protected]
> [2] https://lore.kernel.org/all/[email protected]%2F
> [3] https://lore.kernel.org/all/[email protected]
> [4] https://lore.kernel.org/all/[email protected]


2024-01-16 15:37:41

by Friedrich Weber

[permalink] [raw]
Subject: Re: Temporary KVM guest hangs connected to KSM and NUMA balancer

Hi Sean,

On 11/01/2024 17:00, Sean Christopherson wrote:
> This is a known issue. It's mostly a KVM bug[...] (fix posted[...]), but I suspect
> that a bug in the dynamic preemption model logic[...] is also contributing to the
> behavior by causing KVM to yield on preempt models where it really shouldn't.

I tried the following variants now, each applied on top of 6.7 (0dd3ee31):

* [1], the initial patch series mentioned in the bugreport ("[PATCH 0/2]
KVM: Pre-check mmu_notifier retry on x86")
* [2], its v2 that you linked above ("[PATCH v2] KVM: x86/mmu: Retry
fault before acquiring mmu_lock if mapping is changing")
* [3], the scheduler patch you linked above ("[PATCH] sched/core: Drop
spinlocks on contention iff kernel is preemptible")
* both [2] & [3]

My kernel is PREEMPT_DYNAMIC and, according to
/sys/kernel/debug/sched/preempt, defaults to preempt=voluntary. For case
[3], I additionally tried manually switching to preempt=full.

Provided I did not mess up, I get the following results for the
reproducer I posted:

* [1] (the initial patch series): no hangs
* [2] (its v2): hangs
* [3] (the scheduler patch) with preempt=voluntary: no hangs
* [3] (the scheduler patch) with preempt=full: hangs
* [2] & [3]: no hangs

So it seems like:

* [1] (the initial patch series) fixes the hangs, which is consistent
with the feedback in the bugreport [4].
* But weirdly, its v2 [2] does not fix the hangs.
* As long as I stay with preempt=voluntary, [3] (the scheduler patch)
alone is already enough to fix the hangs in my case -- this I did not
expect :)

Does this make sense to you? Happy to double-check or run more tests if
anything seems off.

Best wishes,

Friedrich

[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/all/[email protected]/
[3] https://lore.kernel.org/all/[email protected]/
[4] https://bugzilla.kernel.org/show_bug.cgi?id=218259#c6


2024-01-16 17:20:51

by Sean Christopherson

[permalink] [raw]
Subject: Re: Temporary KVM guest hangs connected to KSM and NUMA balancer

On Tue, Jan 16, 2024, Friedrich Weber wrote:
> Hi Sean,
>
> On 11/01/2024 17:00, Sean Christopherson wrote:
> > This is a known issue. It's mostly a KVM bug[...] (fix posted[...]), but I suspect
> > that a bug in the dynamic preemption model logic[...] is also contributing to the
> > behavior by causing KVM to yield on preempt models where it really shouldn't.
>
> I tried the following variants now, each applied on top of 6.7 (0dd3ee31):
>
> * [1], the initial patch series mentioned in the bugreport ("[PATCH 0/2]
> KVM: Pre-check mmu_notifier retry on x86")
> * [2], its v2 that you linked above ("[PATCH v2] KVM: x86/mmu: Retry
> fault before acquiring mmu_lock if mapping is changing")
> * [3], the scheduler patch you linked above ("[PATCH] sched/core: Drop
> spinlocks on contention iff kernel is preemptible")
> * both [2] & [3]
>
> My kernel is PREEMPT_DYNAMIC and, according to
> /sys/kernel/debug/sched/preempt, defaults to preempt=voluntary. For case
> [3], I additionally tried manually switching to preempt=full.
>
> Provided I did not mess up, I get the following results for the
> reproducer I posted:
>
> * [1] (the initial patch series): no hangs
> * [2] (its v2): hangs
> * [3] (the scheduler patch) with preempt=voluntary: no hangs
> * [3] (the scheduler patch) with preempt=full: hangs
> * [2] & [3]: no hangs
>
> So it seems like:
>
> * [1] (the initial patch series) fixes the hangs, which is consistent
> with the feedback in the bugreport [4].
> * But weirdly, its v2 [2] does not fix the hangs.
> * As long as I stay with preempt=voluntary, [3] (the scheduler patch)
> alone is already enough to fix the hangs in my case -- this I did not
> expect :)
>
> Does this make sense to you? Happy to double-check or run more tests if
> anything seems off.

Ha! It too me a few minutes to realize what went sideways with v2. KVM has an
in-flight change that switches from host virtual addresses (HVA) to guest physical
frame numbers (GFN) for the retry check, commit 8569992d64b8 ("KVM: Use gfn instead
of hva for mmu_notifier_retry").

That commit is in the KVM pull request for 6.8, and so v2 is based on top of a
branch that contains said commit. But for better or worse (probably worse), the
switch from HVA=GFN didn't change the _names_ of mmu_invalidate_range_{start,end},
only the type. So v2 applies and compiles cleanly on 6.7, but it's subtly broken
because checking for a GFN match against an HVA range is all but guaranteed to get
false negatives.

If you can try v2 on top of `git://git.kernel.org/pub/scm/virt/kvm/kvm.git next`,
that would be helpful to confirm that I didn't screw up something else.

Thanks very much for reporting back! I'm pretty sure we would have missed the
semantic conflict when backporting the fix to 6.7 and earlier, i.e. you likely
saved us from another round of bug reports for various stable trees.

2024-01-17 13:09:44

by Friedrich Weber

[permalink] [raw]
Subject: Re: Temporary KVM guest hangs connected to KSM and NUMA balancer

On 16/01/2024 18:20, Sean Christopherson wrote:
>> Does this make sense to you? Happy to double-check or run more tests if
>> anything seems off.
>
> Ha! It too me a few minutes to realize what went sideways with v2. KVM has an
> in-flight change that switches from host virtual addresses (HVA) to guest physical
> frame numbers (GFN) for the retry check, commit 8569992d64b8 ("KVM: Use gfn instead
> of hva for mmu_notifier_retry").
>
> That commit is in the KVM pull request for 6.8, and so v2 is based on top of a
> branch that contains said commit. But for better or worse (probably worse), the
> switch from HVA=GFN didn't change the _names_ of mmu_invalidate_range_{start,end},
> only the type. So v2 applies and compiles cleanly on 6.7, but it's subtly broken
> because checking for a GFN match against an HVA range is all but guaranteed to get
> false negatives.

Oof, that's nifty, good catch! I'll pay more attention to the
base-commit when testing next time. :)

> If you can try v2 on top of `git://git.kernel.org/pub/scm/virt/kvm/kvm.git next`,
> that would be helpful to confirm that I didn't screw up something else.

Pulled that repository and can confirm:

* 1c6d984f ("x86/kvm: Do not try to disable kvmclock if it was not
enabled", current `next`): reproducer hangs
* v2 [1] ("KVM: x86/mmu: Retry fault before acquiring mmu_lock if
mapping is changing") applied on top of 1c6d984f: no hangs anymore

If I understand the discussion on [1] correctly, there might be a v3 --
if so, I'll happily test that too.

> Thanks very much for reporting back! I'm pretty sure we would have missed the
> semantic conflict when backporting the fix to 6.7 and earlier, i.e. you likely
> saved us from another round of bug reports for various stable trees.

Sure! Thanks a lot for taking a look at this!

Best wishes,

Friedrich

[1] https://lore.kernel.org/all/[email protected]/


Subject: [tip: sched/core] sched/core: Drop spinlocks on contention iff kernel is preemptible

The following commit has been merged into the sched/core branch of tip:

Commit-ID: c793a62823d1ce8f70d9cfc7803e3ea436277cda
Gitweb: https://git.kernel.org/tip/c793a62823d1ce8f70d9cfc7803e3ea436277cda
Author: Sean Christopherson <[email protected]>
AuthorDate: Mon, 27 May 2024 17:34:48 -07:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Wed, 05 Jun 2024 16:52:36 +02:00

sched/core: Drop spinlocks on contention iff kernel is preemptible

Use preempt_model_preemptible() to detect a preemptible kernel when
deciding whether or not to reschedule in order to drop a contended
spinlock or rwlock. Because PREEMPT_DYNAMIC selects PREEMPTION, kernels
built with PREEMPT_DYNAMIC=y will yield contended locks even if the live
preemption model is "none" or "voluntary". In short, make kernels with
dynamically selected models behave the same as kernels with statically
selected models.

Somewhat counter-intuitively, NOT yielding a lock can provide better
latency for the relevant tasks/processes. E.g. KVM x86's mmu_lock, a
rwlock, is often contended between an invalidation event (takes mmu_lock
for write) and a vCPU servicing a guest page fault (takes mmu_lock for
read). For _some_ setups, letting the invalidation task complete even
if there is mmu_lock contention provides lower latency for *all* tasks,
i.e. the invalidation completes sooner *and* the vCPU services the guest
page fault sooner.

But even KVM's mmu_lock behavior isn't uniform, e.g. the "best" behavior
can vary depending on the host VMM, the guest workload, the number of
vCPUs, the number of pCPUs in the host, why there is lock contention, etc.

In other words, simply deleting the CONFIG_PREEMPTION guard (or doing the
opposite and removing contention yielding entirely) needs to come with a
big pile of data proving that changing the status quo is a net positive.

Opportunistically document this side effect of preempt=full, as yielding
contended spinlocks can have significant, user-visible impact.

Fixes: c597bfddc9e9 ("sched: Provide Kconfig support for default dynamic preempt mode")
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Ankur Arora <[email protected]>
Reviewed-by: Chen Yu <[email protected]>
Link: https://lore.kernel.org/kvm/[email protected]
---
Documentation/admin-guide/kernel-parameters.txt | 4 +++-
include/linux/spinlock.h | 14 ++++++--------
2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 500cfa7..555e6b5 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4752,7 +4752,9 @@
none - Limited to cond_resched() calls
voluntary - Limited to cond_resched() and might_sleep() calls
full - Any section that isn't explicitly preempt disabled
- can be preempted anytime.
+ can be preempted anytime. Tasks will also yield
+ contended spinlocks (if the critical section isn't
+ explicitly preempt disabled beyond the lock itself).

print-fatal-signals=
[KNL] debug: print fatal signals
diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
index 3fcd20d..63dd8cf 100644
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -462,11 +462,10 @@ static __always_inline int spin_is_contended(spinlock_t *lock)
*/
static inline int spin_needbreak(spinlock_t *lock)
{
-#ifdef CONFIG_PREEMPTION
+ if (!preempt_model_preemptible())
+ return 0;
+
return spin_is_contended(lock);
-#else
- return 0;
-#endif
}

/*
@@ -479,11 +478,10 @@ static inline int spin_needbreak(spinlock_t *lock)
*/
static inline int rwlock_needbreak(rwlock_t *lock)
{
-#ifdef CONFIG_PREEMPTION
+ if (!preempt_model_preemptible())
+ return 0;
+
return rwlock_is_contended(lock);
-#else
- return 0;
-#endif
}

/*