2017-08-20 23:13:08

by Adam Borowski

[permalink] [raw]
Subject: kvm splat in mmu_spte_clear_track_bits

Hi!
I'm afraid I keep getting a quite reliable, but random, splat when running
KVM:

------------[ cut here ]------------
WARNING: CPU: 5 PID: 5826 at arch/x86/kvm/mmu.c:717 mmu_spte_clear_track_bits+0x123/0x170
Modules linked in: tun nbd arc4 rtl8xxxu mac80211 cfg80211 rfkill nouveau video ttm
CPU: 5 PID: 5826 Comm: qemu-system-x86 Not tainted 4.13.0-rc5-vanilla-ubsan-00211-g7f680d7ec315 #1
Hardware name: System manufacturer System Product Name/M4A77T, BIOS 2401 05/18/2011
task: ffff880207ef0400 task.stack: ffffc900035e4000
RIP: 0010:mmu_spte_clear_track_bits+0x123/0x170
RSP: 0018:ffffc900035e7ab0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 000000010501cc67 RCX: 0000000000000001
RDX: dead0000000000ff RSI: ffff88020e501df8 RDI: 0000000004140700
RBP: ffffc900035e7ad8 R08: 0000000000000100 R09: 0000000000000003
R10: 0000000000000003 R11: 0000000000000005 R12: 000000000010501c
R13: ffffea0004140700 R14: ffff88020e1d0000 R15: 0000000000000000
FS: 00007f0213fbd700(0000) GS:ffff88022fd40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000022187f000 CR4: 00000000000006e0
Call Trace:
drop_spte+0x26/0x130
mmu_page_zap_pte+0xc4/0x160
kvm_mmu_prepare_zap_page+0x65/0x660
kvm_mmu_invalidate_zap_all_pages+0xc5/0x1f0
kvm_mmu_invalidate_zap_pages_in_memslot+0x9/0x10
kvm_page_track_flush_slot+0x86/0xd0
kvm_arch_flush_shadow_memslot+0x9/0x10
__kvm_set_memory_region+0x8fb/0x14f0
kvm_set_memory_region+0x2f/0x50
kvm_vm_ioctl+0x559/0xcc0
? kvm_vcpu_ioctl+0x171/0x620
? __switch_to+0x30b/0x740
do_vfs_ioctl+0xbb/0x8d0
? find_vma+0x23/0x100
? __fget_light+0x94/0x110
SyS_ioctl+0x86/0xa0
entry_SYSCALL_64_fastpath+0x17/0x98
RIP: 0033:0x7f021c80ddc7
RSP: 002b:00007f0213fbc518 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f021c80ddc7
RDX: 00007f0213fbc5b0 RSI: 000000004020ae46 RDI: 000000000000000a
RBP: 0000000000000000 R08: 00007f020c1698a0 R09: 0000000000000000
R10: 00007f020c1698a0 R11: 0000000000000246 R12: 0000000000000006
R13: 00007f022201c000 R14: 0000000000000002 R15: 0000558c3899e550
Code: ae fc 01 48 85 c0 75 1c 4c 89 e7 e8 98 de fd ff 48 8b 05 81 ae fc 01 48 85 c0 74 ba 48 85 c3 0f 95 c3 eb b8 48 85 c3 74 e7 eb dd <0f> ff eb 97 4c 89 e7 66 0f 1f 44 00 00 e8 6b de fd ff eb 97 31
---[ end trace 16c196134f0dd0a9 ]---

After this, there are hundreds of repeats and lots of secondary damage which
kills the host quickly.

Usually this happens within a few minutes, but sometimes it takes ~half an
hour to reproduce. Because of this, it'd be unpleasant to bisect -- is this
problem already known?


Meow!
--
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢰⠒⠀⣿⡁ Vat kind uf sufficiently advanced technology iz dis!?
⢿⡄⠘⠷⠚⠋⠀ -- Genghis Ht'rok'din
⠈⠳⣄⠀⠀⠀⠀


2017-08-21 01:27:00

by Wanpeng Li

[permalink] [raw]
Subject: Re: kvm splat in mmu_spte_clear_track_bits

2017-08-21 7:13 GMT+08:00 Adam Borowski <[email protected]>:
> Hi!
> I'm afraid I keep getting a quite reliable, but random, splat when running
> KVM:

I reported something similar before. https://lkml.org/lkml/2017/6/29/64

Regards,
Wanpeng Li

>
> ------------[ cut here ]------------
> WARNING: CPU: 5 PID: 5826 at arch/x86/kvm/mmu.c:717 mmu_spte_clear_track_bits+0x123/0x170
> Modules linked in: tun nbd arc4 rtl8xxxu mac80211 cfg80211 rfkill nouveau video ttm
> CPU: 5 PID: 5826 Comm: qemu-system-x86 Not tainted 4.13.0-rc5-vanilla-ubsan-00211-g7f680d7ec315 #1
> Hardware name: System manufacturer System Product Name/M4A77T, BIOS 2401 05/18/2011
> task: ffff880207ef0400 task.stack: ffffc900035e4000
> RIP: 0010:mmu_spte_clear_track_bits+0x123/0x170
> RSP: 0018:ffffc900035e7ab0 EFLAGS: 00010246
> RAX: 0000000000000000 RBX: 000000010501cc67 RCX: 0000000000000001
> RDX: dead0000000000ff RSI: ffff88020e501df8 RDI: 0000000004140700
> RBP: ffffc900035e7ad8 R08: 0000000000000100 R09: 0000000000000003
> R10: 0000000000000003 R11: 0000000000000005 R12: 000000000010501c
> R13: ffffea0004140700 R14: ffff88020e1d0000 R15: 0000000000000000
> FS: 00007f0213fbd700(0000) GS:ffff88022fd40000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000000 CR3: 000000022187f000 CR4: 00000000000006e0
> Call Trace:
> drop_spte+0x26/0x130
> mmu_page_zap_pte+0xc4/0x160
> kvm_mmu_prepare_zap_page+0x65/0x660
> kvm_mmu_invalidate_zap_all_pages+0xc5/0x1f0
> kvm_mmu_invalidate_zap_pages_in_memslot+0x9/0x10
> kvm_page_track_flush_slot+0x86/0xd0
> kvm_arch_flush_shadow_memslot+0x9/0x10
> __kvm_set_memory_region+0x8fb/0x14f0
> kvm_set_memory_region+0x2f/0x50
> kvm_vm_ioctl+0x559/0xcc0
> ? kvm_vcpu_ioctl+0x171/0x620
> ? __switch_to+0x30b/0x740
> do_vfs_ioctl+0xbb/0x8d0
> ? find_vma+0x23/0x100
> ? __fget_light+0x94/0x110
> SyS_ioctl+0x86/0xa0
> entry_SYSCALL_64_fastpath+0x17/0x98
> RIP: 0033:0x7f021c80ddc7
> RSP: 002b:00007f0213fbc518 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f021c80ddc7
> RDX: 00007f0213fbc5b0 RSI: 000000004020ae46 RDI: 000000000000000a
> RBP: 0000000000000000 R08: 00007f020c1698a0 R09: 0000000000000000
> R10: 00007f020c1698a0 R11: 0000000000000246 R12: 0000000000000006
> R13: 00007f022201c000 R14: 0000000000000002 R15: 0000558c3899e550
> Code: ae fc 01 48 85 c0 75 1c 4c 89 e7 e8 98 de fd ff 48 8b 05 81 ae fc 01 48 85 c0 74 ba 48 85 c3 0f 95 c3 eb b8 48 85 c3 74 e7 eb dd <0f> ff eb 97 4c 89 e7 66 0f 1f 44 00 00 e8 6b de fd ff eb 97 31
> ---[ end trace 16c196134f0dd0a9 ]---
>
> After this, there are hundreds of repeats and lots of secondary damage which
> kills the host quickly.
>
> Usually this happens within a few minutes, but sometimes it takes ~half an
> hour to reproduce. Because of this, it'd be unpleasant to bisect -- is this
> problem already known?
>
>
> Meow!
> --
> ⢀⣴⠾⠻⢶⣦⠀
> ⣾⠁⢰⠒⠀⣿⡁ Vat kind uf sufficiently advanced technology iz dis!?
> ⢿⡄⠘⠷⠚⠋⠀ -- Genghis Ht'rok'din
> ⠈⠳⣄⠀⠀⠀⠀

2017-08-21 19:12:10

by Adam Borowski

[permalink] [raw]
Subject: Re: kvm splat in mmu_spte_clear_track_bits

On Mon, Aug 21, 2017 at 09:26:57AM +0800, Wanpeng Li wrote:
> 2017-08-21 7:13 GMT+08:00 Adam Borowski <[email protected]>:
> > Hi!
> > I'm afraid I keep getting a quite reliable, but random, splat when running
> > KVM:
>
> I reported something similar before. https://lkml.org/lkml/2017/6/29/64

Your problem seems to require OOM; I don't have any memory pressure at all:
running a single 2GB guest while there's nothing big on the host (bloatfox,
xfce, xorg, terminals + some minor junk); 8GB + (untouched) swap. There's
no memory pressure inside the guest either -- none was Linux (I wanted to
test something on hurd, kfreebsd) and I doubt they even got to use all of
their frames.

Also, it doesn't reproduce for me on 4.12.

> > ------------[ cut here ]------------
> > WARNING: CPU: 5 PID: 5826 at arch/x86/kvm/mmu.c:717 mmu_spte_clear_track_bits+0x123/0x170
> > Modules linked in: tun nbd arc4 rtl8xxxu mac80211 cfg80211 rfkill nouveau video ttm
> > CPU: 5 PID: 5826 Comm: qemu-system-x86 Not tainted 4.13.0-rc5-vanilla-ubsan-00211-g7f680d7ec315 #1
> > Hardware name: System manufacturer System Product Name/M4A77T, BIOS 2401 05/18/2011
> > task: ffff880207ef0400 task.stack: ffffc900035e4000
> > RIP: 0010:mmu_spte_clear_track_bits+0x123/0x170
> > RSP: 0018:ffffc900035e7ab0 EFLAGS: 00010246
> > RAX: 0000000000000000 RBX: 000000010501cc67 RCX: 0000000000000001
> > RDX: dead0000000000ff RSI: ffff88020e501df8 RDI: 0000000004140700
> > RBP: ffffc900035e7ad8 R08: 0000000000000100 R09: 0000000000000003
> > R10: 0000000000000003 R11: 0000000000000005 R12: 000000000010501c
> > R13: ffffea0004140700 R14: ffff88020e1d0000 R15: 0000000000000000
> > FS: 00007f0213fbd700(0000) GS:ffff88022fd40000(0000) knlGS:0000000000000000
> > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 0000000000000000 CR3: 000000022187f000 CR4: 00000000000006e0
> > Call Trace:
> > drop_spte+0x26/0x130
> > mmu_page_zap_pte+0xc4/0x160
> > kvm_mmu_prepare_zap_page+0x65/0x660
> > kvm_mmu_invalidate_zap_all_pages+0xc5/0x1f0
> > kvm_mmu_invalidate_zap_pages_in_memslot+0x9/0x10
> > kvm_page_track_flush_slot+0x86/0xd0
> > kvm_arch_flush_shadow_memslot+0x9/0x10
> > __kvm_set_memory_region+0x8fb/0x14f0
> > kvm_set_memory_region+0x2f/0x50
> > kvm_vm_ioctl+0x559/0xcc0
> > ? kvm_vcpu_ioctl+0x171/0x620
> > ? __switch_to+0x30b/0x740
> > do_vfs_ioctl+0xbb/0x8d0
> > ? find_vma+0x23/0x100
> > ? __fget_light+0x94/0x110
> > SyS_ioctl+0x86/0xa0
> > entry_SYSCALL_64_fastpath+0x17/0x98
> > RIP: 0033:0x7f021c80ddc7
> > RSP: 002b:00007f0213fbc518 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> > RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f021c80ddc7
> > RDX: 00007f0213fbc5b0 RSI: 000000004020ae46 RDI: 000000000000000a
> > RBP: 0000000000000000 R08: 00007f020c1698a0 R09: 0000000000000000
> > R10: 00007f020c1698a0 R11: 0000000000000246 R12: 0000000000000006
> > R13: 00007f022201c000 R14: 0000000000000002 R15: 0000558c3899e550
> > Code: ae fc 01 48 85 c0 75 1c 4c 89 e7 e8 98 de fd ff 48 8b 05 81 ae fc 01 48 85 c0 74 ba 48 85 c3 0f 95 c3 eb b8 48 85 c3 74 e7 eb dd <0f> ff eb 97 4c 89 e7 66 0f 1f 44 00 00 e8 6b de fd ff eb 97 31
> > ---[ end trace 16c196134f0dd0a9 ]---
> >
> > After this, there are hundreds of repeats and lots of secondary damage which
> > kills the host quickly.
> >
> > Usually this happens within a few minutes, but sometimes it takes ~half an
> > hour to reproduce. Because of this, it'd be unpleasant to bisect -- is this
> > problem already known?

--
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢰⠒⠀⣿⡁ Vat kind uf sufficiently advanced technology iz dis!?
⢿⡄⠘⠷⠚⠋⠀ -- Genghis Ht'rok'din
⠈⠳⣄⠀⠀⠀⠀

2017-08-21 19:58:39

by Radim Krčmář

[permalink] [raw]
Subject: Re: kvm splat in mmu_spte_clear_track_bits

2017-08-21 21:12+0200, Adam Borowski:
> On Mon, Aug 21, 2017 at 09:26:57AM +0800, Wanpeng Li wrote:
> > 2017-08-21 7:13 GMT+08:00 Adam Borowski <[email protected]>:
> > > Hi!
> > > I'm afraid I keep getting a quite reliable, but random, splat when running
> > > KVM:
> >
> > I reported something similar before. https://lkml.org/lkml/2017/6/29/64
>
> Your problem seems to require OOM; I don't have any memory pressure at all:
> running a single 2GB guest while there's nothing big on the host (bloatfox,
> xfce, xorg, terminals + some minor junk); 8GB + (untouched) swap. There's
> no memory pressure inside the guest either -- none was Linux (I wanted to
> test something on hurd, kfreebsd) and I doubt they even got to use all of
> their frames.

I even tried hurd, but couldn't reproduce ... what is your qemu command
line and the output of host's `grep . /sys/module/kvm*/parameters/*`?

> Also, it doesn't reproduce for me on 4.12.

Great info ... the most suspicious between v4.12 and v4.13-rc5 is the
series with dcdca5fed5f6 ("x86: kvm: mmu: make spte mmio mask more
explicit"), does reverting it help?

`git revert ce00053b1cfca312c22e2a6465451f1862561eab~1..995f00a619584e65e53eff372d9b73b121a7bad5`

Thanks.

2017-08-21 22:32:33

by Adam Borowski

[permalink] [raw]
Subject: Re: kvm splat in mmu_spte_clear_track_bits

On Mon, Aug 21, 2017 at 09:58:34PM +0200, Radim Krčmář wrote:
> 2017-08-21 21:12+0200, Adam Borowski:
> > On Mon, Aug 21, 2017 at 09:26:57AM +0800, Wanpeng Li wrote:
> > > 2017-08-21 7:13 GMT+08:00 Adam Borowski <[email protected]>:
> > > > I'm afraid I keep getting a quite reliable, but random, splat when running
> > > > KVM:
> > >
> > > I reported something similar before. https://lkml.org/lkml/2017/6/29/64
> >
> > Your problem seems to require OOM; I don't have any memory pressure at all:
> > running a single 2GB guest while there's nothing big on the host (bloatfox,
> > xfce, xorg, terminals + some minor junk); 8GB + (untouched) swap. There's
> > no memory pressure inside the guest either -- none was Linux (I wanted to
> > test something on hurd, kfreebsd) and I doubt they even got to use all of
> > their frames.
>
> I even tried hurd, but couldn't reproduce ...

Also happens with a win10 guest, and with multiple Linuxes.

> what is your qemu command
> line and the output of host's `grep . /sys/module/kvm*/parameters/*`?

qemu-system-x86_64 -enable-kvm -m 2048 -vga qxl -usbdevice tablet \
-net bridge -net nic \
-drive file="$DISK",cache=writeback,index=0,media=disk,discard=on

qemu-system-x86_64 -enable-kvm -m 2048 -vga qxl -usbdevice tablet \
-net bridge -net nic \
-drive file="$DISK",cache=unsafe,index=0,media=disk,discard=on,if=virtio,format=raw

/sys/module/kvm/parameters/halt_poll_ns:200000
/sys/module/kvm/parameters/halt_poll_ns_grow:2
/sys/module/kvm/parameters/halt_poll_ns_shrink:0
/sys/module/kvm/parameters/ignore_msrs:N
/sys/module/kvm/parameters/kvmclock_periodic_sync:Y
/sys/module/kvm/parameters/lapic_timer_advance_ns:0
/sys/module/kvm/parameters/min_timer_period_us:500
/sys/module/kvm/parameters/tsc_tolerance_ppm:250
/sys/module/kvm/parameters/vector_hashing:Y
/sys/module/kvm_amd/parameters/avic:0
/sys/module/kvm_amd/parameters/nested:1
/sys/module/kvm_amd/parameters/npt:1
/sys/module/kvm_amd/parameters/vls:0

> > Also, it doesn't reproduce for me on 4.12.
>
> Great info ... the most suspicious between v4.12 and v4.13-rc5 is the
> series with dcdca5fed5f6 ("x86: kvm: mmu: make spte mmio mask more
> explicit"), does reverting it help?
>
> `git revert ce00053b1cfca312c22e2a6465451f1862561eab~1..995f00a619584e65e53eff372d9b73b121a7bad5`

Alas, doesn't seem to help.

I've first installed a Debian stretch guest, the host survived both the
installation and subsequent fooling around. But then I started a win10
guest which splatted as soon as the initial screen.


Meow!
--
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢰⠒⠀⣿⡁ Vat kind uf sufficiently advanced technology iz dis!?
⢿⡄⠘⠷⠚⠋⠀ -- Genghis Ht'rok'din
⠈⠳⣄⠀⠀⠀⠀

2017-08-23 12:23:03

by Paolo Bonzini

[permalink] [raw]
Subject: Re: kvm splat in mmu_spte_clear_track_bits

On 22/08/2017 00:32, Adam Borowski wrote:
> On Mon, Aug 21, 2017 at 09:58:34PM +0200, Radim Krčmář wrote:
>> 2017-08-21 21:12+0200, Adam Borowski:
>>> On Mon, Aug 21, 2017 at 09:26:57AM +0800, Wanpeng Li wrote:
>>>> 2017-08-21 7:13 GMT+08:00 Adam Borowski <[email protected]>:
>>>>> I'm afraid I keep getting a quite reliable, but random, splat when running
>>>>> KVM:
>>>>
>>>> I reported something similar before. https://lkml.org/lkml/2017/6/29/64
>>>
>>> Your problem seems to require OOM; I don't have any memory pressure at all:
>>> running a single 2GB guest while there's nothing big on the host (bloatfox,
>>> xfce, xorg, terminals + some minor junk); 8GB + (untouched) swap. There's
>>> no memory pressure inside the guest either -- none was Linux (I wanted to
>>> test something on hurd, kfreebsd) and I doubt they even got to use all of
>>> their frames.
>>
>> I even tried hurd, but couldn't reproduce ...
>
> Also happens with a win10 guest, and with multiple Linuxes.
>
>> what is your qemu command
>> line and the output of host's `grep . /sys/module/kvm*/parameters/*`?
>
> qemu-system-x86_64 -enable-kvm -m 2048 -vga qxl -usbdevice tablet \
> -net bridge -net nic \
> -drive file="$DISK",cache=writeback,index=0,media=disk,discard=on
>
> qemu-system-x86_64 -enable-kvm -m 2048 -vga qxl -usbdevice tablet \
> -net bridge -net nic \
> -drive file="$DISK",cache=unsafe,index=0,media=disk,discard=on,if=virtio,format=raw
>
> /sys/module/kvm/parameters/halt_poll_ns:200000
> /sys/module/kvm/parameters/halt_poll_ns_grow:2
> /sys/module/kvm/parameters/halt_poll_ns_shrink:0
> /sys/module/kvm/parameters/ignore_msrs:N
> /sys/module/kvm/parameters/kvmclock_periodic_sync:Y
> /sys/module/kvm/parameters/lapic_timer_advance_ns:0
> /sys/module/kvm/parameters/min_timer_period_us:500
> /sys/module/kvm/parameters/tsc_tolerance_ppm:250
> /sys/module/kvm/parameters/vector_hashing:Y
> /sys/module/kvm_amd/parameters/avic:0
> /sys/module/kvm_amd/parameters/nested:1
> /sys/module/kvm_amd/parameters/npt:1
> /sys/module/kvm_amd/parameters/vls:0
>
>>> Also, it doesn't reproduce for me on 4.12.
>>
>> Great info ... the most suspicious between v4.12 and v4.13-rc5 is the
>> series with dcdca5fed5f6 ("x86: kvm: mmu: make spte mmio mask more
>> explicit"), does reverting it help?
>>
>> `git revert ce00053b1cfca312c22e2a6465451f1862561eab~1..995f00a619584e65e53eff372d9b73b121a7bad5`
>
> Alas, doesn't seem to help.
>
> I've first installed a Debian stretch guest, the host survived both the
> installation and subsequent fooling around. But then I started a win10
> guest which splatted as soon as the initial screen.

Can you check if disabling THP on the host also fixes it for you? I
would also try commit 1372324b328cd5dabaef5e345e37ad48c63df2a9 to
identify whether it was caused by a KVM change in 4.13 or something else.

Paolo

2017-08-24 07:43:58

by Wanpeng Li

[permalink] [raw]
Subject: Re: kvm splat in mmu_spte_clear_track_bits

2017-08-23 20:22 GMT+08:00 Paolo Bonzini <[email protected]>:
> On 22/08/2017 00:32, Adam Borowski wrote:
>> On Mon, Aug 21, 2017 at 09:58:34PM +0200, Radim Krčmář wrote:
>>> 2017-08-21 21:12+0200, Adam Borowski:
>>>> On Mon, Aug 21, 2017 at 09:26:57AM +0800, Wanpeng Li wrote:
>>>>> 2017-08-21 7:13 GMT+08:00 Adam Borowski <[email protected]>:
>>>>>> I'm afraid I keep getting a quite reliable, but random, splat when running
>>>>>> KVM:
>>>>>
>>>>> I reported something similar before. https://lkml.org/lkml/2017/6/29/64
>>>>
>>>> Your problem seems to require OOM; I don't have any memory pressure at all:
>>>> running a single 2GB guest while there's nothing big on the host (bloatfox,
>>>> xfce, xorg, terminals + some minor junk); 8GB + (untouched) swap. There's
>>>> no memory pressure inside the guest either -- none was Linux (I wanted to
>>>> test something on hurd, kfreebsd) and I doubt they even got to use all of
>>>> their frames.
>>>
>>> I even tried hurd, but couldn't reproduce ...
>>
>> Also happens with a win10 guest, and with multiple Linuxes.
>>
>>> what is your qemu command
>>> line and the output of host's `grep . /sys/module/kvm*/parameters/*`?
>>
>> qemu-system-x86_64 -enable-kvm -m 2048 -vga qxl -usbdevice tablet \
>> -net bridge -net nic \
>> -drive file="$DISK",cache=writeback,index=0,media=disk,discard=on
>>
>> qemu-system-x86_64 -enable-kvm -m 2048 -vga qxl -usbdevice tablet \
>> -net bridge -net nic \
>> -drive file="$DISK",cache=unsafe,index=0,media=disk,discard=on,if=virtio,format=raw
>>
>> /sys/module/kvm/parameters/halt_poll_ns:200000
>> /sys/module/kvm/parameters/halt_poll_ns_grow:2
>> /sys/module/kvm/parameters/halt_poll_ns_shrink:0
>> /sys/module/kvm/parameters/ignore_msrs:N
>> /sys/module/kvm/parameters/kvmclock_periodic_sync:Y
>> /sys/module/kvm/parameters/lapic_timer_advance_ns:0
>> /sys/module/kvm/parameters/min_timer_period_us:500
>> /sys/module/kvm/parameters/tsc_tolerance_ppm:250
>> /sys/module/kvm/parameters/vector_hashing:Y
>> /sys/module/kvm_amd/parameters/avic:0
>> /sys/module/kvm_amd/parameters/nested:1
>> /sys/module/kvm_amd/parameters/npt:1
>> /sys/module/kvm_amd/parameters/vls:0
>>
>>>> Also, it doesn't reproduce for me on 4.12.
>>>
>>> Great info ... the most suspicious between v4.12 and v4.13-rc5 is the
>>> series with dcdca5fed5f6 ("x86: kvm: mmu: make spte mmio mask more
>>> explicit"), does reverting it help?
>>>
>>> `git revert ce00053b1cfca312c22e2a6465451f1862561eab~1..995f00a619584e65e53eff372d9b73b121a7bad5`
>>
>> Alas, doesn't seem to help.
>>
>> I've first installed a Debian stretch guest, the host survived both the
>> installation and subsequent fooling around. But then I started a win10
>> guest which splatted as soon as the initial screen.
>
> Can you check if disabling THP on the host also fixes it for you? I
> would also try commit 1372324b328cd5dabaef5e345e37ad48c63df2a9 to
> identify whether it was caused by a KVM change in 4.13 or something else.

For the OOM testcase, the splat will disappear if disabling THP.

Regards,
Wanpeng Li

2017-08-25 13:14:27

by Adam Borowski

[permalink] [raw]
Subject: Re: kvm splat in mmu_spte_clear_track_bits

On Thu, Aug 24, 2017 at 03:43:55PM +0800, Wanpeng Li wrote:
> 2017-08-23 20:22 GMT+08:00 Paolo Bonzini <[email protected]>:
> > On 22/08/2017 00:32, Adam Borowski wrote:
> >> On Mon, Aug 21, 2017 at 09:58:34PM +0200, Radim Krčmář wrote:
> >>> 2017-08-21 21:12+0200, Adam Borowski:
> >>>> Also, it doesn't reproduce for me on 4.12.
> >>>
> >>> Great info ... the most suspicious between v4.12 and v4.13-rc5 is the
> >>> series with dcdca5fed5f6 ("x86: kvm: mmu: make spte mmio mask more
> >>> explicit"), does reverting it help?
> >>>
> >>> `git revert ce00053b1cfca312c22e2a6465451f1862561eab~1..995f00a619584e65e53eff372d9b73b121a7bad5`
> >>
> >> Alas, doesn't seem to help.
> >>
> >> I've first installed a Debian stretch guest, the host survived both the
> >> installation and subsequent fooling around. But then I started a win10
> >> guest which splatted as soon as the initial screen.
> >
> > Can you check if disabling THP on the host also fixes it for you?

As in: ?
echo never >/sys/kernel/mm/transparent_hugepage/enabled
echo never >/sys/kernel/mm/transparent_hugepage/defrag
Still reproduces, with or without reverting
ce00053b1cfca312c22e2a6465451f1862561eab~1..995f00a619584e65e53eff372d9b73b121a7bad5

> > I would also try commit 1372324b328cd5dabaef5e345e37ad48c63df2a9 to
> > identify whether it was caused by a KVM change in 4.13 or something
> > else.

I've ran different guests for a couple of hours, no explosions. Thus it
looks like updating Cornelia's email address isn't the cause.

Too bad, there's 15k commits between 1372324b and 7f680d7ec315.

> For the OOM testcase, the splat will disappear if disabling THP.


Meow!
--
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢰⠒⠀⣿⡁ Vat kind uf sufficiently advanced technology iz dis!?
⢿⡄⠘⠷⠚⠋⠀ -- Genghis Ht'rok'din
⠈⠳⣄⠀⠀⠀⠀

2017-08-25 13:41:00

by Paolo Bonzini

[permalink] [raw]
Subject: Re: kvm splat in mmu_spte_clear_track_bits

On 25/08/2017 15:14, Adam Borowski wrote:
>>> I would also try commit 1372324b328cd5dabaef5e345e37ad48c63df2a9 to
>>> identify whether it was caused by a KVM change in 4.13 or something
>>> else.
> I've ran different guests for a couple of hours, no explosions. Thus it
> looks like updating Cornelia's email address isn't the cause.
>
> Too bad, there's 15k commits between 1372324b and 7f680d7ec315.

So:

- 4.12 works

- 1372324b328cd5dabaef5e345e37ad48c63df2a9 works

- 4.13-rc6 fails

The next ones to test are going to be
c136b84393d4e340e1b53fc7f737dd5827b19ee5 and 4.13-rc1.

This can be a starting point for bisection or pointing the finger at
someone else.

2017-08-27 12:35:13

by Adam Borowski

[permalink] [raw]
Subject: Re: kvm splat in mmu_spte_clear_track_bits

On Fri, Aug 25, 2017 at 03:40:50PM +0200, Paolo Bonzini wrote:
> On 25/08/2017 15:14, Adam Borowski wrote:
> >>> I would also try commit 1372324b328cd5dabaef5e345e37ad48c63df2a9 to
> >>> identify whether it was caused by a KVM change in 4.13 or something
> >>> else.
> > I've ran different guests for a couple of hours, no explosions. Thus it
> > looks like updating Cornelia's email address isn't the cause.
> >
> > Too bad, there's 15k commits between 1372324b and 7f680d7ec315.
>
> So:
>
> - 4.12 works
>
> - 1372324b328cd5dabaef5e345e37ad48c63df2a9 works
>
> - 4.13-rc6 fails
>
> The next ones to test are going to be
> c136b84393d4e340e1b53fc7f737dd5827b19ee5 and 4.13-rc1.

c136b84393d4e340e1b53fc7f737dd5827b19ee5 works

4.13-rc1 works
Which actually isn't so surprising, as I pull from linus/master every Sunday
night/Monday morning, so I'd would have noticed KVM breakage sooner.

4.13-rc4 works

4.13-rc5 works (but ↓)

v4.13-rc5-173-g58d4e450a490 works... uh oh.
It shouldn't -- it's merge-base between mainline and what I was running on
the initial crash, and I'm sure anything non-mainline I had isn't the
culprit. After a couple of hours of running various loads inside and
outside KVM, I finally got it to crash.

4.13-rc5 retested fails
Crashed only after two hours or so of testing.

4.13-rc4 apparently works
It survived several hours of varied tests (like 5 debian-installer runs, a
win10 point release upgrade, some hurd package building, openbsd, etc),
all while the host was likewise busy.


Thus: to the best of my knowledge, the problem is between 4.13-rc4 and 4.13-rc5
but I wouldn't bet my life on it.


Meow!
--
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢰⠒⠀⣿⡁ Vat kind uf sufficiently advanced technology iz dis!?
⢿⡄⠘⠷⠚⠋⠀ -- Genghis Ht'rok'din
⠈⠳⣄⠀⠀⠀⠀

2017-08-28 15:26:21

by Bernhard Held

[permalink] [raw]
Subject: Re: kvm splat in mmu_spte_clear_track_bits

On 08/27/2017 at 02:35 PM, Adam Borowski wrote:
> 4.13-rc5 retested fails
> Crashed only after two hours or so of testing.
>
> 4.13-rc4 apparently works
> It survived several hours of varied tests (like 5 debian-installer runs, a
> win10 point release upgrade, some hurd package building, openbsd, etc),
> all while the host was likewise busy.
>
> Thus: to the best of my knowledge, the problem is between 4.13-rc4 and 4.13-rc5
> but I wouldn't bet my life on it.

I get crashes with Win10 in kvm with 4.13-rc5. 4.13-rc4 works for me. THP seems to accelerate the crash, but that's not 100% sure.

There's still no crash after reverting merge 27df70 on 4.13-rc7. There are 21 commits in this merge, 10 are mm-related:

$ git log 4e082e9ba7cd..e86b298bebf7 --pretty=oneline --abbrev-commit
e86b298bebf7 userfaultfd: replace ENOSPC with ESRCH in case mm has gone during copy/zeropage
f357e345eef7 zram: rework copy of compressor name in comp_algorithm_store()
aac2fea94f7a rmap: do not call mmu_notifier_invalidate_page() under ptl
d041353dc98a mm: fix list corruptions on shmem shrinklist
af54aed94bf3 mm/balloon_compaction.c: don't zero ballooned pages
c0a6a5ae6b5d MAINTAINERS: copy virtio on balloon_compaction.c
b3a81d0841a9 mm: fix KSM data corruption
99baac21e458 mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem
0a2dd266dd6b mm: make tlb_flush_pending global
56236a59556c mm: refactor TLB gathering API
a9b802500ebb Revert "mm: numa: defer TLB flush for THP migration as long as possible"
0a2c40487f3e mm: migrate: fix barriers around tlb_flush_pending
16af97dc5a89 mm: migrate: prevent racy access to tlb_flush_pending
9eeb52ae712e fault-inject: fix wrong should_fail() decision in task context
4e98ebe5f435 test_kmod: fix small memory leak on filesystem tests
9c56771316ef test_kmod: fix the lock in register_test_dev_kmod()
434b06ae23ba test_kmod: fix bug which allows negative values on two config options
a4afe8cdec16 test_kmod: fix spelling mistake: "EMTPY" -> "EMPTY"
5af10dfd0afc userfaultfd: hugetlbfs: remove superfluous page unlock in VM_SHARED case
75dddef32514 mm: ratelimit PFNs busy info message
d507e2ebd2c7 mm: fix global NR_SLAB_.*CLAIMABLE counter reads

Any hint on what to test first is welcome!

Bernhard

2017-08-28 16:01:09

by Takashi Iwai

[permalink] [raw]
Subject: Re: kvm splat in mmu_spte_clear_track_bits

On Mon, 28 Aug 2017 17:26:05 +0200,
Bernhard Held wrote:
>
> On 08/27/2017 at 02:35 PM, Adam Borowski wrote:
> > 4.13-rc5 retested fails
> > Crashed only after two hours or so of testing.
> >
> > 4.13-rc4 apparently works
> > It survived several hours of varied tests (like 5 debian-installer runs, a
> > win10 point release upgrade, some hurd package building, openbsd, etc),
> > all while the host was likewise busy.
> >
> > Thus: to the best of my knowledge, the problem is between 4.13-rc4 and 4.13-rc5
> > but I wouldn't bet my life on it.
>
> I get crashes with Win10 in kvm with 4.13-rc5. 4.13-rc4 works for me. THP seems to accelerate the crash, but that's not 100% sure.
>
> There's still no crash after reverting merge 27df70 on 4.13-rc7. There are 21 commits in this merge, 10 are mm-related:
>
> $ git log 4e082e9ba7cd..e86b298bebf7 --pretty=oneline --abbrev-commit
> e86b298bebf7 userfaultfd: replace ENOSPC with ESRCH in case mm has gone during copy/zeropage
> f357e345eef7 zram: rework copy of compressor name in comp_algorithm_store()
> aac2fea94f7a rmap: do not call mmu_notifier_invalidate_page() under ptl
> d041353dc98a mm: fix list corruptions on shmem shrinklist
> af54aed94bf3 mm/balloon_compaction.c: don't zero ballooned pages
> c0a6a5ae6b5d MAINTAINERS: copy virtio on balloon_compaction.c
> b3a81d0841a9 mm: fix KSM data corruption
> 99baac21e458 mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem
> 0a2dd266dd6b mm: make tlb_flush_pending global
> 56236a59556c mm: refactor TLB gathering API
> a9b802500ebb Revert "mm: numa: defer TLB flush for THP migration as long as possible"
> 0a2c40487f3e mm: migrate: fix barriers around tlb_flush_pending
> 16af97dc5a89 mm: migrate: prevent racy access to tlb_flush_pending
> 9eeb52ae712e fault-inject: fix wrong should_fail() decision in task context
> 4e98ebe5f435 test_kmod: fix small memory leak on filesystem tests
> 9c56771316ef test_kmod: fix the lock in register_test_dev_kmod()
> 434b06ae23ba test_kmod: fix bug which allows negative values on two config options
> a4afe8cdec16 test_kmod: fix spelling mistake: "EMTPY" -> "EMPTY"
> 5af10dfd0afc userfaultfd: hugetlbfs: remove superfluous page unlock in VM_SHARED case
> 75dddef32514 mm: ratelimit PFNs busy info message
> d507e2ebd2c7 mm: fix global NR_SLAB_.*CLAIMABLE counter reads
>
> Any hint on what to test first is welcome!

Did you get the crash reliably?
I've been struggling how to trigger it efficiently, but currently in
vain. The memory pressure isn't a single key to trigger it, as it
seems...


thanks,

Takashi

2017-08-28 16:07:55

by Bernhard Held

[permalink] [raw]
Subject: Re: kvm splat in mmu_spte_clear_track_bits

On 08/28/2017 at 06:01 PM, Takashi Iwai wrote:
> On Mon, 28 Aug 2017 17:26:05 +0200,
> Bernhard Held wrote:
>> I get crashes with Win10 in kvm
>>
> Did you get the crash reliably?
> I've been struggling how to trigger it efficiently, but currently in
> vain. The memory pressure isn't a single key to trigger it, as it
> seems...

Yes, I get the crash pretty reliable with Win10 in kvm after 10 to 30 minutes. Some workload in Windows seems to be necessary to trigger the bug.

2017-08-28 16:17:42

by Takashi Iwai

[permalink] [raw]
Subject: Re: kvm splat in mmu_spte_clear_track_bits

On Mon, 28 Aug 2017 18:07:37 +0200,
Bernhard Held wrote:
>
> On 08/28/2017 at 06:01 PM, Takashi Iwai wrote:
> > On Mon, 28 Aug 2017 17:26:05 +0200,
> > Bernhard Held wrote:
> >> I get crashes with Win10 in kvm
> >>
> > Did you get the crash reliably?
> > I've been struggling how to trigger it efficiently, but currently in
> > vain. The memory pressure isn't a single key to trigger it, as it
> > seems...
>
> Yes, I get the crash pretty reliable with Win10 in kvm after 10 to 30 minutes. Some workload in Windows seems to be necessary to trigger the bug.

OK, thanks, it's good to know.
Unfortunately I have no Windows on my machine, so it doesn't work for
me ;) But it implies that rather the workload in VM is likely more
essential than the workload on host.


Takashi

2017-08-28 16:56:32

by Nadav Amit

[permalink] [raw]
Subject: Re: kvm splat in mmu_spte_clear_track_bits

Bernhard Held <[email protected]> wrote:

> On 08/27/2017 at 02:35 PM, Adam Borowski wrote:
>> 4.13-rc5 retested fails
>> Crashed only after two hours or so of testing.
>> 4.13-rc4 apparently works
>> It survived several hours of varied tests (like 5 debian-installer runs, a
>> win10 point release upgrade, some hurd package building, openbsd, etc),
>> all while the host was likewise busy.
>> Thus: to the best of my knowledge, the problem is between 4.13-rc4 and 4.13-rc5
>> but I wouldn't bet my life on it.
>
> I get crashes with Win10 in kvm with 4.13-rc5. 4.13-rc4 works for me. THP seems to accelerate the crash, but that's not 100% sure.
>
> There's still no crash after reverting merge 27df70 on 4.13-rc7. There are 21 commits in this merge, 10 are mm-related:
>
> $ git log 4e082e9ba7cd..e86b298bebf7 --pretty=oneline --abbrev-commit
> e86b298bebf7 userfaultfd: replace ENOSPC with ESRCH in case mm has gone during copy/zeropage
> f357e345eef7 zram: rework copy of compressor name in comp_algorithm_store()
> aac2fea94f7a rmap: do not call mmu_notifier_invalidate_page() under ptl
> d041353dc98a mm: fix list corruptions on shmem shrinklist
> af54aed94bf3 mm/balloon_compaction.c: don't zero ballooned pages
> c0a6a5ae6b5d MAINTAINERS: copy virtio on balloon_compaction.c
> b3a81d0841a9 mm: fix KSM data corruption
> 99baac21e458 mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem
> 0a2dd266dd6b mm: make tlb_flush_pending global
> 56236a59556c mm: refactor TLB gathering API
> a9b802500ebb Revert "mm: numa: defer TLB flush for THP migration as long as possible"
> 0a2c40487f3e mm: migrate: fix barriers around tlb_flush_pending
> 16af97dc5a89 mm: migrate: prevent racy access to tlb_flush_pending
> 9eeb52ae712e fault-inject: fix wrong should_fail() decision in task context
> 4e98ebe5f435 test_kmod: fix small memory leak on filesystem tests
> 9c56771316ef test_kmod: fix the lock in register_test_dev_kmod()
> 434b06ae23ba test_kmod: fix bug which allows negative values on two config options
> a4afe8cdec16 test_kmod: fix spelling mistake: "EMTPY" -> "EMPTY"
> 5af10dfd0afc userfaultfd: hugetlbfs: remove superfluous page unlock in VM_SHARED case
> 75dddef32514 mm: ratelimit PFNs busy info message
> d507e2ebd2c7 mm: fix global NR_SLAB_.*CLAIMABLE counter reads

Don’t blame me for the TLB stuff... My money is on aac2fea94f7a .




2017-08-29 09:19:31

by Bernhard Held

[permalink] [raw]
Subject: Re: kvm splat in mmu_spte_clear_track_bits

On 08/28/2017 at 06:56 PM, Nadav Amit wrote:
> Bernhard Held <[email protected]> wrote:
>
>> On 08/27/2017 at 02:35 PM, Adam Borowski wrote:
>>> 4.13-rc5 retested fails
>>> Crashed only after two hours or so of testing.
>>> 4.13-rc4 apparently works
>>> It survived several hours of varied tests (like 5 debian-installer runs, a
>>> win10 point release upgrade, some hurd package building, openbsd, etc),
>>> all while the host was likewise busy.
>>> Thus: to the best of my knowledge, the problem is between 4.13-rc4 and 4.13-rc5
>>> but I wouldn't bet my life on it.
>>
>> I get crashes with Win10 in kvm with 4.13-rc5. 4.13-rc4 works for me. THP seems to accelerate the crash, but that's not 100% sure.
>>
>> There's still no crash after reverting merge 27df70 on 4.13-rc7. There are 21 commits in this merge, 10 are mm-related:
>>
>> $ git log 4e082e9ba7cd..e86b298bebf7 --pretty=oneline --abbrev-commit
>> e86b298bebf7 userfaultfd: replace ENOSPC with ESRCH in case mm has gone during copy/zeropage
>> f357e345eef7 zram: rework copy of compressor name in comp_algorithm_store()
>> aac2fea94f7a rmap: do not call mmu_notifier_invalidate_page() under ptl
>> d041353dc98a mm: fix list corruptions on shmem shrinklist
>> af54aed94bf3 mm/balloon_compaction.c: don't zero ballooned pages
>> c0a6a5ae6b5d MAINTAINERS: copy virtio on balloon_compaction.c
>> b3a81d0841a9 mm: fix KSM data corruption
>> 99baac21e458 mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem
>> 0a2dd266dd6b mm: make tlb_flush_pending global
>> 56236a59556c mm: refactor TLB gathering API
>> a9b802500ebb Revert "mm: numa: defer TLB flush for THP migration as long as possible"
>> 0a2c40487f3e mm: migrate: fix barriers around tlb_flush_pending
>> 16af97dc5a89 mm: migrate: prevent racy access to tlb_flush_pending
>> 9eeb52ae712e fault-inject: fix wrong should_fail() decision in task context
>> 4e98ebe5f435 test_kmod: fix small memory leak on filesystem tests
>> 9c56771316ef test_kmod: fix the lock in register_test_dev_kmod()
>> 434b06ae23ba test_kmod: fix bug which allows negative values on two config options
>> a4afe8cdec16 test_kmod: fix spelling mistake: "EMTPY" -> "EMPTY"
>> 5af10dfd0afc userfaultfd: hugetlbfs: remove superfluous page unlock in VM_SHARED case
>> 75dddef32514 mm: ratelimit PFNs busy info message
>> d507e2ebd2c7 mm: fix global NR_SLAB_.*CLAIMABLE counter reads
>
> Don’t blame me for the TLB stuff... My money is on aac2fea94f7a .

Amit, thanks for your courage to expose your patch!

I'm more and more confident that aac2fea94f7a is the culprit. Maybe it just accelerates the triggering of the splash. To be more sure the kernel needs to be tested for a couple of days. It would be great if others could assist in testing aac2fea94f7a.

Have fun,
Bernhard

2017-08-29 15:54:03

by Nadav Amit

[permalink] [raw]
Subject: Re: kvm splat in mmu_spte_clear_track_bits

Bernhard Held <[email protected]> wrote:

> On 08/28/2017 at 06:56 PM, Nadav Amit wrote:
>> Bernhard Held <[email protected]> wrote:
>>> On 08/27/2017 at 02:35 PM, Adam Borowski wrote:
>>>> 4.13-rc5 retested fails
>>>> Crashed only after two hours or so of testing.
>>>> 4.13-rc4 apparently works
>>>> It survived several hours of varied tests (like 5 debian-installer runs, a
>>>> win10 point release upgrade, some hurd package building, openbsd, etc),
>>>> all while the host was likewise busy.
>>>> Thus: to the best of my knowledge, the problem is between 4.13-rc4 and 4.13-rc5
>>>> but I wouldn't bet my life on it.
>>>
>>> I get crashes with Win10 in kvm with 4.13-rc5. 4.13-rc4 works for me. THP seems to accelerate the crash, but that's not 100% sure.
>>>
>>> There's still no crash after reverting merge 27df70 on 4.13-rc7. There are 21 commits in this merge, 10 are mm-related:
>>>
>>> $ git log 4e082e9ba7cd..e86b298bebf7 --pretty=oneline --abbrev-commit
>>> e86b298bebf7 userfaultfd: replace ENOSPC with ESRCH in case mm has gone during copy/zeropage
>>> f357e345eef7 zram: rework copy of compressor name in comp_algorithm_store()
>>> aac2fea94f7a rmap: do not call mmu_notifier_invalidate_page() under ptl
>>> d041353dc98a mm: fix list corruptions on shmem shrinklist
>>> af54aed94bf3 mm/balloon_compaction.c: don't zero ballooned pages
>>> c0a6a5ae6b5d MAINTAINERS: copy virtio on balloon_compaction.c
>>> b3a81d0841a9 mm: fix KSM data corruption
>>> 99baac21e458 mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem
>>> 0a2dd266dd6b mm: make tlb_flush_pending global
>>> 56236a59556c mm: refactor TLB gathering API
>>> a9b802500ebb Revert "mm: numa: defer TLB flush for THP migration as long as possible"
>>> 0a2c40487f3e mm: migrate: fix barriers around tlb_flush_pending
>>> 16af97dc5a89 mm: migrate: prevent racy access to tlb_flush_pending
>>> 9eeb52ae712e fault-inject: fix wrong should_fail() decision in task context
>>> 4e98ebe5f435 test_kmod: fix small memory leak on filesystem tests
>>> 9c56771316ef test_kmod: fix the lock in register_test_dev_kmod()
>>> 434b06ae23ba test_kmod: fix bug which allows negative values on two config options
>>> a4afe8cdec16 test_kmod: fix spelling mistake: "EMTPY" -> "EMPTY"
>>> 5af10dfd0afc userfaultfd: hugetlbfs: remove superfluous page unlock in VM_SHARED case
>>> 75dddef32514 mm: ratelimit PFNs busy info message
>>> d507e2ebd2c7 mm: fix global NR_SLAB_.*CLAIMABLE counter reads
>> Don’t blame me for the TLB stuff... My money is on aac2fea94f7a .
>
> Amit, thanks for your courage to expose your patch!

Just for the record, aac2fea94f7a is not mine (some others are).

>
> I'm more and more confident that aac2fea94f7a is the culprit. Maybe it just accelerates the triggering of the splash. To be more sure the kernel needs to be tested for a couple of days. It would be great if others could assist in testing aac2fea94f7a.




2017-08-29 16:11:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: kvm splat in mmu_spte_clear_track_bits

On Tue, Aug 29, 2017 at 7:09 AM, Andrea Arcangeli <[email protected]> wrote:
> Hello,
>
> On Tue, Aug 29, 2017 at 02:59:23PM +0200, Adam Borowski wrote:
>> On Tue, Aug 29, 2017 at 02:45:41PM +0200, Takashi Iwai wrote:
>> > [Put more people to Cc, sorry for growing too much...]
>>
>> We're all interested in 4.13.0 not crashing on us, so that's ok.
>>
>> > On Tue, 29 Aug 2017 11:19:13 +0200,
>> > Bernhard Held wrote:
>> > >
>> > > On 08/28/2017 at 06:56 PM, Nadav Amit wrote:
>> > > > Don’t blame me for the TLB stuff... My money is on aac2fea94f7a .
>> > >
>> > > Amit, thanks for your courage to expose your patch!
>> > >
>> > > I'm more and more confident that aac2fea94f7a is the culprit. Maybe it
>> > > just accelerates the triggering of the splash. To be more sure the
>> > > kernel needs to be tested for a couple of days. It would be great if
>> > > others could assist in testing aac2fea94f7a.
>> >
>> > I'm testing with the revert for a while and it seems working.
>>
>> With nothing but aac2fea94f7a reverted, no explosions for me either.
>
> The aforementioned commit has 3 bugs.

Yes. I'm reverting it from my tree.

We should really *really* just tell the stupid MMU notifier users that
they can't sleep.

The MMU notifiers are not going to destroy our VM layer. I hate the
damn crap, and this kind of garbage is an example of why.

So Andrew - please stop taking "MMU notifier can sleep" patches. Start
taking patches that fix the users, not break the VM.

Linus

2017-08-29 12:59:34

by Adam Borowski

[permalink] [raw]
Subject: Re: kvm splat in mmu_spte_clear_track_bits

On Tue, Aug 29, 2017 at 02:45:41PM +0200, Takashi Iwai wrote:
> [Put more people to Cc, sorry for growing too much...]

We're all interested in 4.13.0 not crashing on us, so that's ok.

> On Tue, 29 Aug 2017 11:19:13 +0200,
> Bernhard Held wrote:
> >
> > On 08/28/2017 at 06:56 PM, Nadav Amit wrote:
> > > Don’t blame me for the TLB stuff... My money is on aac2fea94f7a .
> >
> > Amit, thanks for your courage to expose your patch!
> >
> > I'm more and more confident that aac2fea94f7a is the culprit. Maybe it
> > just accelerates the triggering of the splash. To be more sure the
> > kernel needs to be tested for a couple of days. It would be great if
> > others could assist in testing aac2fea94f7a.
>
> I'm testing with the revert for a while and it seems working.

With nothing but aac2fea94f7a reverted, no explosions for me either.

--
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢰⠒⠀⣿⡁ Vat kind uf sufficiently advanced technology iz dis!?
⢿⡄⠘⠷⠚⠋⠀ -- Genghis Ht'rok'din
⠈⠳⣄⠀⠀⠀⠀

2017-08-29 14:09:39

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: kvm splat in mmu_spte_clear_track_bits

Hello,

On Tue, Aug 29, 2017 at 02:59:23PM +0200, Adam Borowski wrote:
> On Tue, Aug 29, 2017 at 02:45:41PM +0200, Takashi Iwai wrote:
> > [Put more people to Cc, sorry for growing too much...]
>
> We're all interested in 4.13.0 not crashing on us, so that's ok.
>
> > On Tue, 29 Aug 2017 11:19:13 +0200,
> > Bernhard Held wrote:
> > >
> > > On 08/28/2017 at 06:56 PM, Nadav Amit wrote:
> > > > Don’t blame me for the TLB stuff... My money is on aac2fea94f7a .
> > >
> > > Amit, thanks for your courage to expose your patch!
> > >
> > > I'm more and more confident that aac2fea94f7a is the culprit. Maybe it
> > > just accelerates the triggering of the splash. To be more sure the
> > > kernel needs to be tested for a couple of days. It would be great if
> > > others could assist in testing aac2fea94f7a.
> >
> > I'm testing with the revert for a while and it seems working.
>
> With nothing but aac2fea94f7a reverted, no explosions for me either.

The aforementioned commit has 3 bugs.

1) mmu_notifier_invalidate_range cannot be used in replacement of
mmu_notifier_invalidate_range_start/end. For KVM
mmu_notifier_invalidate_range is a noop and rightfully so. A MMU
notifier implementation has to implement either
->invalidate_range method or the invalidate_range_start/end
methods, not both. And if you implement invalidate_range_start/end
like KVM is forced to do, calling mmu_notifier_invalidate_range in
common code is a noop for KVM.

For those MMU notifiers that can get away only implementing
->invalidate_range, the ->invalidate_range is implicitly called by
mmu_notifier_invalidate_range_end(). And only those secondary MMUs
that share the same pagetable with the primary MMU (like AMD
iommuv2) can get away only implementing ->invalidate_range.

So all cases (THP on/off) are broken right now.

To fix this is enough to replace mmu_notifier_invalidate_range with
mmu_notifier_invalidate_range_start;mmu_notifier_invalidate_range_end. Either
that or call multiple mmu_notifier_invalidate_page like before.

2) address + (1UL << compound_order(page) is buggy, it should be
PAGE_SIZE << compound_order(page), it's bytes not pages, 2M not 512.

3) The whole invalidate_range thing was an attempt to call a single
invalidate while walking multiple 4k ptes that maps the same THP
(after a pmd virtual split without physical compound page THP
split). It's unclear if the rmap_walk will always provide an
address that is 2M aligned as parameter to try_to_unmap_one, in
presence of THP. I think it needs also an address &= (PAGE_SIZE <<
compound_order(page)) - 1 to be safe.

The other bug where you can reproduce the same corruption with OOM is
unrelated and caused by the OOM reaper. OOM reaper was even corrupting
data if a task was writing to disk and stuck in OOM in write() syscall
or async io write.

To fix the KVM corruption in the OOM reaper, it needs to call
mmu_notifier_invalidate_start/end around
oom_kill.c:unmap_page_range. This additional
mmu_notifier_invalidate_start will not be good for the OOM reaper
because it's yet another case (like the mmap_sem for writing) that
will prevent the OOM reaper to run, so hindering its ability to hide
XFS OOM deadlocks, and making those resurface. Not in KVM case because
we use a spinlock to serialize against the secondary MMU activity and
the KVM critical section under spinlock isn't going to allocate
memory, but range_start can schedule or block on slow hardware where
the secondary MMU is accessed through PCI (not KVM case).

My preference is still to make the OOM reaper a config option and let
it grow into the VM at zero cost if disabled, while at the same time
having the option to keep the VM simpler and spend the time fixing the
filesystem bugs instead (while still being able to reproduce them more
easily with OOM reaper disabled).

Thanks,
Andrea

2017-08-29 12:58:02

by Mike Galbraith

[permalink] [raw]
Subject: Re: kvm splat in mmu_spte_clear_track_bits

On Mon, 2017-08-28 at 09:56 -0700, Nadav Amit wrote:
> Bernhard Held <[email protected]> wrote:
>
> > On 08/27/2017 at 02:35 PM, Adam Borowski wrote:
> >> 4.13-rc5 retested fails
> >> Crashed only after two hours or so of testing.
> >> 4.13-rc4 apparently works
> >> It survived several hours of varied tests (like 5 debian-installer runs, a
> >> win10 point release upgrade, some hurd package building, openbsd, etc),
> >> all while the host was likewise busy.
> >> Thus: to the best of my knowledge, the problem is between 4.13-rc4 and 4.13-rc5
> >> but I wouldn't bet my life on it.
> >
> > I get crashes with Win10 in kvm with 4.13-rc5. 4.13-rc4 works for me. THP seems to accelerate the crash, but that's not 100% sure.
> >
> > There's still no crash after reverting merge 27df70 on 4.13-rc7. There are 21 commits in this merge, 10 are mm-related:
> >
> > $ git log 4e082e9ba7cd..e86b298bebf7 --pretty=oneline --abbrev-commit
> > e86b298bebf7 userfaultfd: replace ENOSPC with ESRCH in case mm has gone during copy/zeropage
> > f357e345eef7 zram: rework copy of compressor name in comp_algorithm_store()
> > aac2fea94f7a rmap: do not call mmu_notifier_invalidate_page() under ptl
> > d041353dc98a mm: fix list corruptions on shmem shrinklist
> > af54aed94bf3 mm/balloon_compaction.c: don't zero ballooned pages
> > c0a6a5ae6b5d MAINTAINERS: copy virtio on balloon_compaction.c
> > b3a81d0841a9 mm: fix KSM data corruption
> > 99baac21e458 mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem
> > 0a2dd266dd6b mm: make tlb_flush_pending global
> > 56236a59556c mm: refactor TLB gathering API
> > a9b802500ebb Revert "mm: numa: defer TLB flush for THP migration as long as possible"
> > 0a2c40487f3e mm: migrate: fix barriers around tlb_flush_pending
> > 16af97dc5a89 mm: migrate: prevent racy access to tlb_flush_pending
> > 9eeb52ae712e fault-inject: fix wrong should_fail() decision in task context
> > 4e98ebe5f435 test_kmod: fix small memory leak on filesystem tests
> > 9c56771316ef test_kmod: fix the lock in register_test_dev_kmod()
> > 434b06ae23ba test_kmod: fix bug which allows negative values on two config options
> > a4afe8cdec16 test_kmod: fix spelling mistake: "EMTPY" -> "EMPTY"
> > 5af10dfd0afc userfaultfd: hugetlbfs: remove superfluous page unlock in VM_SHARED case
> > 75dddef32514 mm: ratelimit PFNs busy info message
> > d507e2ebd2c7 mm: fix global NR_SLAB_.*CLAIMABLE counter reads
>
> Don’t blame me for the TLB stuff... My money is on aac2fea94f7a .

You may be onto something.

FWIW, with an RT host/guest, I reproduced the problem yesterday in
fairly short order, but today, with that commit reverted, and pushing
markedly harder, nada.

(hohum, intermittent bugs tend to do that, they're particularly fond of
showing up about 10 seconds after you report them dead...9..8..7;)

A colleague suggested going back to mmu_notifier_invalidate_page(),
which I'm going to try shortly (hopefully noticing absolutely nothing
the least bit 'interesting'), but first, I'm gonna CC the author of
that _maybe_ culprit patch.

-Mike

2017-08-29 18:28:49

by Jerome Glisse

[permalink] [raw]
Subject: Re: kvm splat in mmu_spte_clear_track_bits

On Tue, Aug 29, 2017 at 09:10:59AM -0700, Linus Torvalds wrote:
> On Tue, Aug 29, 2017 at 7:09 AM, Andrea Arcangeli <[email protected]> wrote:
> > Hello,
> >
> > On Tue, Aug 29, 2017 at 02:59:23PM +0200, Adam Borowski wrote:
> >> On Tue, Aug 29, 2017 at 02:45:41PM +0200, Takashi Iwai wrote:
> >> > [Put more people to Cc, sorry for growing too much...]
> >>
> >> We're all interested in 4.13.0 not crashing on us, so that's ok.
> >>
> >> > On Tue, 29 Aug 2017 11:19:13 +0200,
> >> > Bernhard Held wrote:
> >> > >
> >> > > On 08/28/2017 at 06:56 PM, Nadav Amit wrote:
> >> > > > Don’t blame me for the TLB stuff... My money is on aac2fea94f7a .
> >> > >
> >> > > Amit, thanks for your courage to expose your patch!
> >> > >
> >> > > I'm more and more confident that aac2fea94f7a is the culprit. Maybe it
> >> > > just accelerates the triggering of the splash. To be more sure the
> >> > > kernel needs to be tested for a couple of days. It would be great if
> >> > > others could assist in testing aac2fea94f7a.
> >> >
> >> > I'm testing with the revert for a while and it seems working.
> >>
> >> With nothing but aac2fea94f7a reverted, no explosions for me either.
> >
> > The aforementioned commit has 3 bugs.
>
> Yes. I'm reverting it from my tree.
>
> We should really *really* just tell the stupid MMU notifier users that
> they can't sleep.

There is no way around sleeping if we ever want to support thing like
GPU. To invalidate page table on GPU you need to schedule commands to
do so on GPU command queue and wait for the GPU to signal that it has
invalidated its page table/tlb and caches.

We had this discussion before. Either we want to support all the new
fancy GPGPU, AI and all the API they rely on or we should tell them
sorry guys not on linux.

>
> The MMU notifiers are not going to destroy our VM layer. I hate the
> damn crap, and this kind of garbage is an example of why.

Issue here is that nobody calls mmu_notifier_invalidate_range_start/end()
hence why people relied on invalidate_range() to not sleep like start/end
Now we can make the decission that start/end can sleep while the range
can't but then we also need to make sure that range_start/end is always
called.

Cheers,
Jérôme

2017-08-29 18:34:11

by Jerome Glisse

[permalink] [raw]
Subject: Re: kvm splat in mmu_spte_clear_track_bits

On Tue, Aug 29, 2017 at 04:09:24PM +0200, Andrea Arcangeli wrote:
> Hello,
>
> On Tue, Aug 29, 2017 at 02:59:23PM +0200, Adam Borowski wrote:
> > On Tue, Aug 29, 2017 at 02:45:41PM +0200, Takashi Iwai wrote:
> > > [Put more people to Cc, sorry for growing too much...]
> >
> > We're all interested in 4.13.0 not crashing on us, so that's ok.
> >
> > > On Tue, 29 Aug 2017 11:19:13 +0200,
> > > Bernhard Held wrote:
> > > >
> > > > On 08/28/2017 at 06:56 PM, Nadav Amit wrote:
> > > > > Don’t blame me for the TLB stuff... My money is on aac2fea94f7a .
> > > >
> > > > Amit, thanks for your courage to expose your patch!
> > > >
> > > > I'm more and more confident that aac2fea94f7a is the culprit. Maybe it
> > > > just accelerates the triggering of the splash. To be more sure the
> > > > kernel needs to be tested for a couple of days. It would be great if
> > > > others could assist in testing aac2fea94f7a.
> > >
> > > I'm testing with the revert for a while and it seems working.
> >
> > With nothing but aac2fea94f7a reverted, no explosions for me either.
>
> The aforementioned commit has 3 bugs.
>
> 1) mmu_notifier_invalidate_range cannot be used in replacement of
> mmu_notifier_invalidate_range_start/end. For KVM
> mmu_notifier_invalidate_range is a noop and rightfully so. A MMU
> notifier implementation has to implement either
> ->invalidate_range method or the invalidate_range_start/end
> methods, not both. And if you implement invalidate_range_start/end
> like KVM is forced to do, calling mmu_notifier_invalidate_range in
> common code is a noop for KVM.
>
> For those MMU notifiers that can get away only implementing
> ->invalidate_range, the ->invalidate_range is implicitly called by
> mmu_notifier_invalidate_range_end(). And only those secondary MMUs
> that share the same pagetable with the primary MMU (like AMD
> iommuv2) can get away only implementing ->invalidate_range.
>
> So all cases (THP on/off) are broken right now.
>
> To fix this is enough to replace mmu_notifier_invalidate_range with
> mmu_notifier_invalidate_range_start;mmu_notifier_invalidate_range_end. Either
> that or call multiple mmu_notifier_invalidate_page like before.

Kirill did regress invalidate_page as it use to be call outside the
spinlock and now it is call inside the spinlock thus reverting will
introduce back a regression.

You can refer to the thread about it:

https://lkml.org/lkml/2017/8/9/418

Jérôme

2017-08-29 19:06:46

by Linus Torvalds

[permalink] [raw]
Subject: Re: kvm splat in mmu_spte_clear_track_bits

On Tue, Aug 29, 2017 at 11:34 AM, Jerome Glisse <[email protected]> wrote:
>
> Kirill did regress invalidate_page as it use to be call outside the
> spinlock and now it is call inside the spinlock thus reverting will
> introduce back a regression.

Honestly, this MMU notifier thing has been nothing but a badly
designed mistake from beginning to end, and bad rules for what can
sleep and what can not are one fundamental problem.

There are fundamentally two levels of VM locking, and those two levels
are not going to go away, and we're not budging on them:

- there's the "virtual address" level, which can block. We have a
nice mmap_semaphore, and we guarantee that it's held for writing for
all changes to the virtual memory layout

This is the "mmap/munmap" kind of granularity. The mmu callbacks at
*this* level are fine to block.

- then there is the "page level" VM handling, and honestly, that
*fundamentally* uses a spinlock. If we look at a particular page, that
page is meaningless without the lock. Really.

I honestly believe that any MMU callback at this level needs to be
atomic. Some of the absolutely *have* to be (that "change_pte", for
example).

In that second case, we might have a "begin/end" surrounding the
actual page table walk. And that might sleep, but then it
*fundamentally* cannot actually be able some particular single page
or stable range. Because without the page table spinlock, no such
stability exists. It's purely a "we are not going to start looking at
this range" kind of thing.

I really don't understand why the nVidia crap cannot follow those
simple rules. Because either

(a) you're working with virtual addresses, and you should be able to
work on that virtual layer

(b) you're actually working with physical pages, and you can just
hold on to those physical pages yourself.

I really detest our MMU callbacks. I shouldn't have allowed them to be
merged. And I definitely shoul.dn't allow them to screw up our VM
layer.

But we have them, and we should work at making sure people do sane things.

And yes, those sane things may include

(a) guaranteeing that the start/end range calls are always done
around the actual locked region.

(b) adding a ton of validation so that people *see* then they break
the rules. Even when they don't use some random notifier crud.

That (b) may involve adding a number of "might_sleep()" calls (not
deep in the notifiers themselves, but in the actual wrapper functions
even when notifiers are compiled out entirely!), but also adding calls
to validate those kinds of "you can't call
mmu_notifier_invalidate_page() without having first called
mmu_notifier_invalidate_range_start() in a sleepable context".

But (b) definitely should also be a very real onus on the mmu
notifiers themselves. No way can we sleep when we're traversing page
tables. We hold a page table lock. We can sleep before and after, but
not during actual page traversal.

Linus

2017-08-29 19:13:58

by Jerome Glisse

[permalink] [raw]
Subject: Re: kvm splat in mmu_spte_clear_track_bits

On Tue, Aug 29, 2017 at 12:06:42PM -0700, Linus Torvalds wrote:
> On Tue, Aug 29, 2017 at 11:34 AM, Jerome Glisse <[email protected]> wrote:
> >
> > Kirill did regress invalidate_page as it use to be call outside the
> > spinlock and now it is call inside the spinlock thus reverting will
> > introduce back a regression.
>
> Honestly, this MMU notifier thing has been nothing but a badly
> designed mistake from beginning to end, and bad rules for what can
> sleep and what can not are one fundamental problem.
>
> There are fundamentally two levels of VM locking, and those two levels
> are not going to go away, and we're not budging on them:
>
> - there's the "virtual address" level, which can block. We have a
> nice mmap_semaphore, and we guarantee that it's held for writing for
> all changes to the virtual memory layout
>
> This is the "mmap/munmap" kind of granularity. The mmu callbacks at
> *this* level are fine to block.
>
> - then there is the "page level" VM handling, and honestly, that
> *fundamentally* uses a spinlock. If we look at a particular page, that
> page is meaningless without the lock. Really.
>
> I honestly believe that any MMU callback at this level needs to be
> atomic. Some of the absolutely *have* to be (that "change_pte", for
> example).
>
> In that second case, we might have a "begin/end" surrounding the
> actual page table walk. And that might sleep, but then it
> *fundamentally* cannot actually be able some particular single page
> or stable range. Because without the page table spinlock, no such
> stability exists. It's purely a "we are not going to start looking at
> this range" kind of thing.
>
> I really don't understand why the nVidia crap cannot follow those
> simple rules. Because either
>
> (a) you're working with virtual addresses, and you should be able to
> work on that virtual layer
>
> (b) you're actually working with physical pages, and you can just
> hold on to those physical pages yourself.
>
> I really detest our MMU callbacks. I shouldn't have allowed them to be
> merged. And I definitely shoul.dn't allow them to screw up our VM
> layer.
>
> But we have them, and we should work at making sure people do sane things.
>
> And yes, those sane things may include
>
> (a) guaranteeing that the start/end range calls are always done
> around the actual locked region.
>
> (b) adding a ton of validation so that people *see* then they break
> the rules. Even when they don't use some random notifier crud.
>
> That (b) may involve adding a number of "might_sleep()" calls (not
> deep in the notifiers themselves, but in the actual wrapper functions
> even when notifiers are compiled out entirely!), but also adding calls
> to validate those kinds of "you can't call
> mmu_notifier_invalidate_page() without having first called
> mmu_notifier_invalidate_range_start() in a sleepable context".
>
> But (b) definitely should also be a very real onus on the mmu
> notifiers themselves. No way can we sleep when we're traversing page
> tables. We hold a page table lock. We can sleep before and after, but
> not during actual page traversal.

Yes and i am fine with page traversal being under spinlock and not
being able to sleep during that. I agree doing otherwise would be
insane. It is just that the existing behavior of try_to_unmap_one()
and page_mkclean_one() have been broken and that no mmu_notifier
calls were added around the lock section.

I sent a patch that properly compute the range to invalidate and move
to invalidate_range() but is lacking the invalidate_range_start()/
end() so i am gonna respin that with range_start/end bracketing and
assume the worse for the range of address.

Cheers,
J?r?me

2017-08-29 19:38:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: kvm splat in mmu_spte_clear_track_bits

On Tue, Aug 29, 2017 at 12:13 PM, Jerome Glisse <[email protected]> wrote:
>
> Yes and i am fine with page traversal being under spinlock and not
> being able to sleep during that. I agree doing otherwise would be
> insane. It is just that the existing behavior of try_to_unmap_one()
> and page_mkclean_one() have been broken and that no mmu_notifier
> calls were added around the lock section.

Yeah, I'm actually surprised that ever worked. I'm surprised that
try_to_unmap_one didn't hold any locks earlier.

In fact, I think at least some of them *did* already hold the page
table locks: ptep_clear_flush_young_notify() and friends very much
should have always held them.

So it's literally just that mmu_notifier_invalidate_page() call that
used to be outside all the locks, but honestly, I think that was
always a bug. It means that you got notified of the page removal
*after* the page was already gone and all locks had been released, so
a completely *different* page could already have been mapped to that
address.

So I think the old code was always broken exactly because the callback
wasn't serialized with the actual action.

> I sent a patch that properly compute the range to invalidate and move
> to invalidate_range() but is lacking the invalidate_range_start()/
> end() so i am gonna respin that with range_start/end bracketing and
> assume the worse for the range of address.

So surrounding it with start/end _should_ make KVM happy.

KVM people, can you confirm?

But I do note that there's a number of other users of that
"invalidate_page" callback.

I think ib_umem_notifier_invalidate_page() the exact same blocking
issue, but changing to range_start/end should be good there too.

amdgpu_mn_invalidate_page() and the xen/gntdev also seem to be happy
being replaced with start/end.

In fact, I'm wondering if this actually means that we could get rid of
mmu_notifier_invalidate_page() entirely. There's only a couple of
callers, and the other one seems to be fs/dax.c, and it actually seems
to have the exact same issue that the try_to_unmap_one() code had: it
tried to invalidate an address too late - by the time it was called,
the page gad already been cleaned and locks had been released.

So the more I look at that "turn mmu_notifier_invalidate_page() into
invalidate_range_start/end()" the more I think that's fundamentally
the right thing to do.

Linus

2017-08-29 20:49:07

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: kvm splat in mmu_spte_clear_track_bits

Hello Linus,

On Tue, Aug 29, 2017 at 12:38:43PM -0700, Linus Torvalds wrote:
> On Tue, Aug 29, 2017 at 12:13 PM, Jerome Glisse <[email protected]> wrote:
> >
> > Yes and i am fine with page traversal being under spinlock and not
> > being able to sleep during that. I agree doing otherwise would be
> > insane. It is just that the existing behavior of try_to_unmap_one()
> > and page_mkclean_one() have been broken and that no mmu_notifier
> > calls were added around the lock section.
>
> Yeah, I'm actually surprised that ever worked. I'm surprised that
> try_to_unmap_one didn't hold any locks earlier.
>
> In fact, I think at least some of them *did* already hold the page
> table locks: ptep_clear_flush_young_notify() and friends very much
> should have always held them.
>
> So it's literally just that mmu_notifier_invalidate_page() call that
> used to be outside all the locks, but honestly, I think that was
> always a bug. It means that you got notified of the page removal
> *after* the page was already gone and all locks had been released, so
> a completely *different* page could already have been mapped to that
> address.
>
> So I think the old code was always broken exactly because the callback
> wasn't serialized with the actual action.

That's a very interesting point indeed.

I thought the old code was safe because even if the secondary MMU view
keeps crunching on the old page for a little while after the primary
MMU got a new page mapped, you cannot measure it.

If you cannot measure it, it doesn't exist? I.e. undefined.

The same runtime you'd get by letting the guest mode crunch on the old
page for a little while after the primary MMU is already computing on
a brand new page, could materialize if the guest mode CPU just got a
turbo boost and computed faster before the ->invalidate_page was run
inside the PT lock.

If the page is swapped-in and is the same page that was unmapped by
try_to_unmap_one, then the primary MMU will return using the same page
that the secondary MMU was still using and it'll even be coherent
(swapin is probably prevented by the page lock, but still even if it
happens it doesn't look an issue).

If the page is suddenly replaced while vcpu is in guest mode, no
coherency could be provided anyway if the guest mode wasn't serialize
by something other than the PT lock.

I'll keep thinking about it, this is a quick answer.

On a side note, the other major requirement for the code not to have
been always broken before is that the caller must hold a refcount: the
put_page executed by try_to_unmap_one before calling
mmu_notifier_invalidate_page cannot free the page or it's unsafe. This
requirement also goes away if using range_start/end.

> > I sent a patch that properly compute the range to invalidate and move
> > to invalidate_range() but is lacking the invalidate_range_start()/
> > end() so i am gonna respin that with range_start/end bracketing and
> > assume the worse for the range of address.
>
> So surrounding it with start/end _should_ make KVM happy.
>
> KVM people, can you confirm?

Yes that always works as far as I can tell.

For our current problem using start/end like you suggested would
definitely make KVM happy and it is obviously safe.

It would also work to teach page_vma_mapped_walk(&pvmw) to restart
after dropping the lock with page_vma_mapped_walk_done() to call
mmu_notifier_invalidate_page outside the lock. If such an alternative
would be safe however, entirely depends if it was always broken before
which again is a very good point to think about.

The common case for such function is a single pte being unmapped.

>
> But I do note that there's a number of other users of that
> "invalidate_page" callback.
>
> I think ib_umem_notifier_invalidate_page() the exact same blocking
> issue, but changing to range_start/end should be good there too.
>
> amdgpu_mn_invalidate_page() and the xen/gntdev also seem to be happy
> being replaced with start/end.
>
> In fact, I'm wondering if this actually means that we could get rid of
> mmu_notifier_invalidate_page() entirely. There's only a couple of
> callers, and the other one seems to be fs/dax.c, and it actually seems
> to have the exact same issue that the try_to_unmap_one() code had: it
> tried to invalidate an address too late - by the time it was called,
> the page gad already been cleaned and locks had been released.
>
> So the more I look at that "turn mmu_notifier_invalidate_page() into
> invalidate_range_start/end()" the more I think that's fundamentally
> the right thing to do.

mmu_notifier_invalidate_page exists purely as an optimized version of
mmu_notifier_invalidate_range_start/end. Furthermore when it was
always run inside the PT lock it was quite handy to replace a
ptep_clear_flush into a ptep_clear_flush_notify and be done with
it. It also requires a single branch when the MMU notifier is unarmed.

mmu_notifier_invalidate_range_start has to first block all secondary
MMU page faults, and invalidate the secondary MMU _before_ the pages
are freed. mmu_notifier_invalidate_range_end unblocks the secondary
MMU page faults after the primary MMU has been invalidated too
(i.e. after zapping the ptes).

mmu_notifier_invalidate_page has the advantage that it takes the
secondary MMU KVM srcu and spinlock a single time.

Thanks!
Andrea

2017-08-29 20:59:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: kvm splat in mmu_spte_clear_track_bits

On Tue, Aug 29, 2017 at 1:49 PM, Andrea Arcangeli <[email protected]> wrote:
>
> mmu_notifier_invalidate_page has the advantage that it takes the
> secondary MMU KVM srcu and spinlock a single time.

Note that that isn't actually all that much of an advantage - it turns
out that a lot of users have "range_start", but not "range_end".

So in a lot of cases, the "range_start/end()" seems to be exactly as
expensive as just the single "page()" call, simply because the code
didn't really need the whole range, it only wanted to make sure it was
invalidating its data before the range got modified.

KVM ends up doing a partial case of that optimization too: it doesn't
do the srcu lock in the end case, for example. It does want to keep
the sequence numbers for the end case, but that's fairly cheap.

So I'd much rather have the simpler rules than have duplicated
interfaces for some very dubious performance advantage.

Linus

2017-08-30 08:19:45

by Michal Hocko

[permalink] [raw]
Subject: Re: kvm splat in mmu_spte_clear_track_bits

On Tue 29-08-17 16:09:24, Andrea Arcangeli wrote:
[...]
> The other bug where you can reproduce the same corruption with OOM is
> unrelated and caused by the OOM reaper. OOM reaper was even corrupting
> data if a task was writing to disk and stuck in OOM in write() syscall
> or async io write.
>
> To fix the KVM corruption in the OOM reaper, it needs to call
> mmu_notifier_invalidate_start/end around
> oom_kill.c:unmap_page_range. This additional
> mmu_notifier_invalidate_start will not be good for the OOM reaper
> because it's yet another case (like the mmap_sem for writing) that
> will prevent the OOM reaper to run, so hindering its ability to hide
> XFS OOM deadlocks, and making those resurface. Not in KVM case because
> we use a spinlock to serialize against the secondary MMU activity and
> the KVM critical section under spinlock isn't going to allocate
> memory, but range_start can schedule or block on slow hardware where
> the secondary MMU is accessed through PCI (not KVM case).

I am not really familiar with mmu notifiers and what they can actually
do. But from what you wrote above it is indeed not very safe to call
them from the oom reaper. So I will prepare and post a patch to disable
the reaper when mm_has_notifiers().

Thanks for pointing this out.

--
Michal Hocko
SUSE Labs