2021-09-05 20:41:31

by syzbot

[permalink] [raw]
Subject: [syzbot] WARNING: kmalloc bug in memslot_rmap_alloc

Hello,

syzbot found the following issue on:

HEAD commit: f1583cb1be35 Merge tag 'linux-kselftest-next-5.15-rc1' of ..
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=11dd6315300000
kernel config: https://syzkaller.appspot.com/x/.config?x=9c582b69de20dde2
dashboard link: https://syzkaller.appspot.com/bug?extid=e0de2333cbf95ea473e8
compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.1
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=15db7e5d300000
C reproducer: https://syzkaller.appspot.com/x/repro.c?x=170e66cd300000

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: [email protected]

L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.
------------[ cut here ]------------
WARNING: CPU: 0 PID: 8419 at mm/util.c:597 kvmalloc_node+0x111/0x120 mm/util.c:597
Modules linked in:
CPU: 0 PID: 8419 Comm: syz-executor520 Not tainted 5.14.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:kvmalloc_node+0x111/0x120 mm/util.c:597
Code: 01 00 00 00 4c 89 e7 e8 ed 17 0d 00 49 89 c5 e9 69 ff ff ff e8 90 0a d1 ff 41 89 ed 41 81 cd 00 20 01 00 eb 95 e8 7f 0a d1 ff <0f> 0b e9 4c ff ff ff 0f 1f 84 00 00 00 00 00 55 48 89 fd 53 e8 66
RSP: 0018:ffffc90001a7f828 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffff888027ee5580 RSI: ffffffff81a51341 RDI: 0000000000000003
RBP: 0000000000400dc0 R08: 000000007fffffff R09: 00000000ffffffff
R10: ffffffff81a512fe R11: 0000000000000000 R12: 0000000380000000
R13: 0000000000000000 R14: 00000000ffffffff R15: dffffc0000000000
FS: 0000000000707300(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007faeea03f6c0 CR3: 0000000074a57000 CR4: 00000000001526f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
kvmalloc include/linux/mm.h:806 [inline]
kvmalloc_array include/linux/mm.h:824 [inline]
kvcalloc include/linux/mm.h:829 [inline]
memslot_rmap_alloc+0xf6/0x310 arch/x86/kvm/x86.c:11320
kvm_alloc_memslot_metadata arch/x86/kvm/x86.c:11388 [inline]
kvm_arch_prepare_memory_region+0x48d/0x610 arch/x86/kvm/x86.c:11462
kvm_set_memslot+0xfe/0x1700 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1505
__kvm_set_memory_region+0x761/0x10e0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1668
kvm_set_memory_region arch/x86/kvm/../../../virt/kvm/kvm_main.c:1689 [inline]
kvm_vm_ioctl_set_memory_region arch/x86/kvm/../../../virt/kvm/kvm_main.c:1701 [inline]
kvm_vm_ioctl+0x4c6/0x2330 arch/x86/kvm/../../../virt/kvm/kvm_main.c:4236
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:874 [inline]
__se_sys_ioctl fs/ioctl.c:860 [inline]
__x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x43ee99
Code: 28 c3 e8 2a 14 00 00 66 2e 0f 1f 84 00 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffc276d5138 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000000000400488 RCX: 000000000043ee99
RDX: 00000000200005c0 RSI: 000000004020ae46 RDI: 0000000000000004
RBP: 0000000000402e80 R08: 0000000000400488 R09: 0000000000400488
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000402f10
R13: 0000000000000000 R14: 00000000004ac018 R15: 0000000000400488


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at [email protected].

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
syzbot can test patches for this issue, for details see:
https://goo.gl/tpsmEJ#testing-patches


2021-09-07 17:49:09

by Sean Christopherson

[permalink] [raw]
Subject: Re: [syzbot] WARNING: kmalloc bug in memslot_rmap_alloc

+Linus and Ben

On Sun, Sep 05, 2021, syzbot wrote:
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 8419 at mm/util.c:597 kvmalloc_node+0x111/0x120 mm/util.c:597
> Modules linked in:
> CPU: 0 PID: 8419 Comm: syz-executor520 Not tainted 5.14.0-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> RIP: 0010:kvmalloc_node+0x111/0x120 mm/util.c:597

...

> Call Trace:
> kvmalloc include/linux/mm.h:806 [inline]
> kvmalloc_array include/linux/mm.h:824 [inline]
> kvcalloc include/linux/mm.h:829 [inline]
> memslot_rmap_alloc+0xf6/0x310 arch/x86/kvm/x86.c:11320
> kvm_alloc_memslot_metadata arch/x86/kvm/x86.c:11388 [inline]
> kvm_arch_prepare_memory_region+0x48d/0x610 arch/x86/kvm/x86.c:11462
> kvm_set_memslot+0xfe/0x1700 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1505
> __kvm_set_memory_region+0x761/0x10e0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1668
> kvm_set_memory_region arch/x86/kvm/../../../virt/kvm/kvm_main.c:1689 [inline]
> kvm_vm_ioctl_set_memory_region arch/x86/kvm/../../../virt/kvm/kvm_main.c:1701 [inline]
> kvm_vm_ioctl+0x4c6/0x2330 arch/x86/kvm/../../../virt/kvm/kvm_main.c:4236

KVM is tripping the WARN_ON_ONCE(size > INT_MAX) added in commit 7661809d493b
("mm: don't allow oversized kvmalloc() calls"). The allocation size is absurd and
doomed to fail in this particular configuration (syzkaller is just throwing garbage
at KVM), but for humongous virtual machines it's feasible that KVM could run afoul
of the sanity check for an otherwise legitimate allocation.

The allocation in question is for KVM's "rmap" to translate a guest pfn to a host
virtual address. The size of the rmap in question is an unsigned long per 4kb page
in a memslot, i.e. on x86-64, 8 bytes per 4096 bytes of guest memory in a memslot.
With INT_MAX=0x7fffffff, KVM will trip the WARN and fail rmap allocations for
memslots >= 1tb, and Google already has VMs that create 1.5tb memslots (12tb of
total guest memory spread across 8 virtual NUMA nodes).

One caveat is that KVM's newfangled "TDP MMU" was designed specifically to avoid
the rmap allocation (among other things), precisely because of its scalability
issues. I.e. it's unlikely KVM's so called "legacy MMU" that relies on the rmaps
would be used for such large VMs. However, KVM's legacy MMU is still the only option
for shadowing nested EPT/NPT, i.e. the rmap allocation would be problematic if/when
nested virtualization is enabled in large VMs.

KVM also has other allocations based on memslot size that are _not_ avoided by KVM's
TDP MMU and may eventually be problematic, though presumably not for quite some time
as it would require petabyte? memslots. E.g. a different metadata array requires
4 bytes per 2mb of guest memory.

I don't have any clever ideas to handle this from the KVM side, at least not in the
short term. Long term, I think it would be doable to reduce the rmap size for large
memslots by 512x, but any change of that nature would be very invasive to KVM and
be fairly risky. It also wouldn't prevent syskaller from triggering this WARN at will.

2021-09-07 19:24:29

by Ben Gardon

[permalink] [raw]
Subject: Re: [syzbot] WARNING: kmalloc bug in memslot_rmap_alloc

On Tue, Sep 7, 2021 at 10:30 AM Sean Christopherson <[email protected]> wrote:
>
> +Linus and Ben
>
> On Sun, Sep 05, 2021, syzbot wrote:
> > ------------[ cut here ]------------
> > WARNING: CPU: 0 PID: 8419 at mm/util.c:597 kvmalloc_node+0x111/0x120 mm/util.c:597
> > Modules linked in:
> > CPU: 0 PID: 8419 Comm: syz-executor520 Not tainted 5.14.0-syzkaller #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> > RIP: 0010:kvmalloc_node+0x111/0x120 mm/util.c:597
>
> ...
>
> > Call Trace:
> > kvmalloc include/linux/mm.h:806 [inline]
> > kvmalloc_array include/linux/mm.h:824 [inline]
> > kvcalloc include/linux/mm.h:829 [inline]
> > memslot_rmap_alloc+0xf6/0x310 arch/x86/kvm/x86.c:11320
> > kvm_alloc_memslot_metadata arch/x86/kvm/x86.c:11388 [inline]
> > kvm_arch_prepare_memory_region+0x48d/0x610 arch/x86/kvm/x86.c:11462
> > kvm_set_memslot+0xfe/0x1700 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1505
> > __kvm_set_memory_region+0x761/0x10e0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1668
> > kvm_set_memory_region arch/x86/kvm/../../../virt/kvm/kvm_main.c:1689 [inline]
> > kvm_vm_ioctl_set_memory_region arch/x86/kvm/../../../virt/kvm/kvm_main.c:1701 [inline]
> > kvm_vm_ioctl+0x4c6/0x2330 arch/x86/kvm/../../../virt/kvm/kvm_main.c:4236
>
> KVM is tripping the WARN_ON_ONCE(size > INT_MAX) added in commit 7661809d493b
> ("mm: don't allow oversized kvmalloc() calls"). The allocation size is absurd and
> doomed to fail in this particular configuration (syzkaller is just throwing garbage
> at KVM), but for humongous virtual machines it's feasible that KVM could run afoul
> of the sanity check for an otherwise legitimate allocation.
>
> The allocation in question is for KVM's "rmap" to translate a guest pfn to a host
> virtual address. The size of the rmap in question is an unsigned long per 4kb page
> in a memslot, i.e. on x86-64, 8 bytes per 4096 bytes of guest memory in a memslot.
> With INT_MAX=0x7fffffff, KVM will trip the WARN and fail rmap allocations for
> memslots >= 1tb, and Google already has VMs that create 1.5tb memslots (12tb of
> total guest memory spread across 8 virtual NUMA nodes).
>
> One caveat is that KVM's newfangled "TDP MMU" was designed specifically to avoid
> the rmap allocation (among other things), precisely because of its scalability
> issues. I.e. it's unlikely KVM's so called "legacy MMU" that relies on the rmaps
> would be used for such large VMs. However, KVM's legacy MMU is still the only option
> for shadowing nested EPT/NPT, i.e. the rmap allocation would be problematic if/when
> nested virtualization is enabled in large VMs.
>
> KVM also has other allocations based on memslot size that are _not_ avoided by KVM's
> TDP MMU and may eventually be problematic, though presumably not for quite some time
> as it would require petabyte? memslots. E.g. a different metadata array requires
> 4 bytes per 2mb of guest memory.

KVM's dirty bitmap requires 1 bit per 4K, so we'd hit this limit even
sooner with 64TB memslots.
Still, that can be avoided with Peter Xu's dirty ring and we're still
a ways away from 64TB memslots.

>
> I don't have any clever ideas to handle this from the KVM side, at least not in the
> short term. Long term, I think it would be doable to reduce the rmap size for large
> memslots by 512x, but any change of that nature would be very invasive to KVM and
> be fairly risky. It also wouldn't prevent syskaller from triggering this WARN at will.

Not the most elegant solution, but KVM could, and perhaps should,
impose a maximum memslot size. KVM operations (e.g. dirty logging)
which operate on a memslot can take a very long time with terabyte
memslots. Forcing userspace to handle memory in units of a more
reasonable size could be a good limitation to impose sooner rather
than later while there are few users (if any outside Google) of these
massive memslots.

2021-09-08 05:37:37

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [syzbot] WARNING: kmalloc bug in memslot_rmap_alloc

On 07/09/21 19:30, Sean Christopherson wrote:
> The allocation in question is for KVM's "rmap" to translate a guest pfn to a host
> virtual address. The size of the rmap in question is an unsigned long per 4kb page
> in a memslot, i.e. on x86-64, 8 bytes per 4096 bytes of guest memory in a memslot.
> With INT_MAX=0x7fffffff, KVM will trip the WARN and fail rmap allocations for
> memslots >= 1tb, and Google already has VMs that create 1.5tb memslots (12tb of
> total guest memory spread across 8 virtual NUMA nodes).

We can just use vmalloc. The warning was only added on kvmalloc, and
vmalloc suits the KVM rmap just fine.

The maximum that Red Hat has tested, as far as I know, is about 4TiB
(and it was back when there was no support for virtual NUMA nodes in
QEMU, so it was all in a single memslot).

Paolo