Received: by 10.223.164.202 with SMTP id h10csp4720222wrb; Wed, 29 Nov 2017 10:43:08 -0800 (PST) X-Google-Smtp-Source: AGs4zMYUQOP+mMIiDp+och46WoMgGNaYaAftJh8BRbUlY1YUR/z6PSAE5olrGJ8KFOn753f8Lj1E X-Received: by 10.84.133.1 with SMTP id 1mr3777350plf.203.1511980988605; Wed, 29 Nov 2017 10:43:08 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1511980988; cv=none; d=google.com; s=arc-20160816; b=yq/XlLlT8SsyZDzgWu62PPxclAQC5OAOa4nOW3yeZO4CxJQ0beF0sqBEyeIj+Dgmp/ r4bprkSnd+N5YuE8/CCi0Z9uiThqCPJBuXOqdroGRp4642hgSSudJiFnh76tpXWV4ZV/ zL+vXjTgAUolhgm5KEy2g++242VhoyX1X2+uWn/l3gGLkXr8mBpJvsjyDAod27czR8c0 cdZpejZp4ewSGmfxMxBnYCICVSOGcKtX3KEl1CgJCYO+WVlnU8Tz9rT7wjiwJG/B7rOM 3sVys470uVkzlnN/ZACph724V9NeR159sb6I+FC2AMGZ9Xq078SwDUE7gZlr31YG0ZSm goJQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=e80n7poqQdmDBg7IRgF11QELVqA/E5iqint3Jy8jcYw=; b=x3yf/ZD+GgYogAvV6VEUQ7QQ/KpVm35kQO81mMaJznLsV+lQT/yftYqBGCgBgZC409 IHi8ifgk6wisCzwzmQNwIdAmaoLgAu1/Yhw2hHpgTyL1ju6YtpqQTykv/Gsw5/4pnuTq etUcUxbJw87/eWR8IgMeEDb0YMXGNknga1qZ/WCt05x90mq7jBvWZxcSZmfAP7ta71VF rirenp6wvB+9O7gbzh/X47wHwB2/b1DlbIxvavngJNl2/pUCtxBnxyKvBfjVfqA1maAi FB9TapFFGM53AnpimyPpdbcP1Y+w6Zd44ah1wbkCx3u9wp8a+MDV6v7HoCk9ZpYj+4CK SzpA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id m6si1633851pln.239.2017.11.29.10.42.57; Wed, 29 Nov 2017 10:43:08 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752069AbdK2SmV (ORCPT + 70 others); Wed, 29 Nov 2017 13:42:21 -0500 Received: from mx1.redhat.com ([209.132.183.28]:52826 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751563AbdK2SmS (ORCPT ); Wed, 29 Nov 2017 13:42:18 -0500 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id C8C6C4A6FA; Wed, 29 Nov 2017 18:42:18 +0000 (UTC) Received: from localhost (ovpn-116-19.gru2.redhat.com [10.97.116.19]) by smtp.corp.redhat.com (Postfix) with ESMTP id 30A3460C8D; Wed, 29 Nov 2017 18:42:18 +0000 (UTC) Date: Wed, 29 Nov 2017 16:42:16 -0200 From: Eduardo Habkost To: Paolo Bonzini Cc: Wanpeng Li , "linux-kernel@vger.kernel.org" , kvm , yfu@redhat.com Subject: Re: [PATCH] KVM: x86: inject exceptions produced by x86_decode_insn Message-ID: <20171129184216.GC3037@localhost.localdomain> References: <1510307378-97452-1-git-send-email-pbonzini@redhat.com> <4ff4d2f3-439b-2a8f-ef89-b2a1984e809d@redhat.com> <20171129114411.GA16634@localhost.localdomain> <4a61fa0a-a4ca-4c06-63c9-2b940eac2601@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4a61fa0a-a4ca-4c06-63c9-2b940eac2601@redhat.com> X-Fnord: you can see the fnord User-Agent: Mutt/1.9.1 (2017-09-22) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.38]); Wed, 29 Nov 2017 18:42:18 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Nov 29, 2017 at 12:44:42PM +0100, Paolo Bonzini wrote: > On 29/11/2017 12:44, Eduardo Habkost wrote: > > On Mon, Nov 13, 2017 at 09:32:09AM +0100, Paolo Bonzini wrote: > >> On 13/11/2017 08:15, Wanpeng Li wrote: > >>> 2017-11-10 17:49 GMT+08:00 Paolo Bonzini : > >>>> Sometimes, a processor might execute an instruction while another > >>>> processor is updating the page tables for that instruction's code page, > >>>> but before the TLB shootdown completes. The interesting case happens > >>>> if the page is in the TLB. > >>>> > >>>> In general, the processor will succeed in executing the instruction and > >>>> nothing bad happens. However, what if the instruction is an MMIO access? > >>>> If *that* happens, KVM invokes the emulator, and the emulator gets the > >>>> updated page tables. If the update side had marked the code page as non > >>>> present, the page table walk then will fail and so will x86_decode_insn. > >>>> > >>>> Unfortunately, even though kvm_fetch_guest_virt is correctly returning > >>>> X86EMUL_PROPAGATE_FAULT, x86_decode_insn's caller treats the failure as > >>>> a fatal error if the instruction cannot simply be reexecuted (as is the > >>>> case for MMIO). And this in fact happened sometimes when rebooting > >>>> Windows 2012r2 guests. Just checking ctxt->have_exception and injecting > >>>> the exception if true is enough to fix the case. > >>> > >>> I found the only place which can set ctxt->have_exception is in the > >>> function x86_emulate_insn(), and x86_decode_insn() will not set > >>> ctxt->have_exception even if kvm_fetch_guest_virt() returns > >>> X86_EMUL_PROPAGATE_FAULT. > >> > >> Hmm, you're right. Looks like Yanan has been (un)lucky when trying out > >> this patch! :( > >> > >> Yanan, can you double check that you can reproduce the issue with an > >> unpatched kernel? I will work on a kvm-unit-tests testcsae > > > > We don't have a kvm-unit-tests reproducer for this yet, right? > > > > I'm considering trying to write one, but I don't want to > > duplicate work. > > No, I haven't written one yet. The reproducer (not a full test case) is quite simple, see patch below. Now, I've noticed something interesting when running the reproducer: If the test_fetch_failure() call happens before we touch pci-testdev through *mem (like in the patch below), we get an emulation failure like the one Yanan saw: $ /usr/bin/qemu-system-x86_64 -nodefaults -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -machine accel=kvm -kernel ./x86/emulator.flat # -initrd /tmp/tmp.RCPjppRp8i enabling apic paging enabled cr0 = 80010011 cr3 = 45e000 cr4 = 20 KVM internal error. Suberror: 1 emulation failure RAX=0000000000000000 RBX=0000000000000000 RCX=0000000000000000 RDX=0000000000000000 RSI=0000000000000000 RDI=0000000000000000 RBP=0000000000000000 RSP=0000000000000000 R8 =0000000000000000 R9 =0000000000000000 R10=0000000000000000 R11=0000000000000000 R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 R15=0000000000000000 RIP=ffffffffffffc08a RFL=00010002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] CS =0008 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA] SS =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] DS =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] FS =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] GS =0010 0000000000454d60 ffffffff 00c09300 DPL=0 DS [-WA] LDT=0000 0000000000000000 0000ffff 00008200 DPL=0 LDT TR =0080 000000000041148a 0000ffff 00008b00 DPL=0 TSS64-busy GDT= 000000000041100a 0000047f IDT= 0000000000000000 00000fff CR0=80010011 CR2=ffffffffffffc08a CR3=000000000045e000 CR4=00000020 DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 DR6=00000000ffff0ff0 DR7=0000000000000400 EFER=0000000000000500 Code=?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? but if I call test_fetch_failure() after touching *mem, like this: diff --git a/x86/emulator.c b/x86/emulator.c index 977ec75..72cb035 100644 --- a/x86/emulator.c +++ b/x86/emulator.c @@ -1124,7 +1124,6 @@ int main() alt_insn_page = alloc_page(); insn_ram = vmap(virt_to_phys(insn_page), 4096); - test_fetch_failure(mem, alt_insn_page); // test mov reg, r/m and mov r/m, reg t1 = 0x123456789abcdef; @@ -1135,6 +1134,8 @@ int main() : "memory"); report("mov reg, r/m (1)", t2 == 0x123456789abcdef); + test_fetch_failure(mem, alt_insn_page); + test_simplealu(mem); test_cmps(mem); test_scas(mem); then I get a KVM_INTERNAL_ERROR_DELIVERY_EV: $ /usr/bin/qemu-system-x86_64 -nodefaults -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -machine accel=kvm -kernel ./x86/emulator.flat # -initrd /tmp/tmp.lmXZa46TEA enabling apic paging enabled cr0 = 80010011 cr3 = 45e000 cr4 = 20 PASS: mov reg, r/m (1) KVM internal error. Suberror: 3 extra data[0]: 80000b0e extra data[1]: 31 extra data[2]: 182 extra data[3]: ff000ff8 RAX=0000000000000000 RBX=0000000000000000 RCX=0000000000000000 RDX=0000000000000000 RSI=0000000000000000 RDI=0000000000000000 RBP=0000000000000000 RSP=0000000000000000 R8 =0000000000000000 R9 =0000000000000000 R10=0000000000000000 R11=0000000000000000 R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 R15=0000000000000000 RIP=ffffffffffffc08a RFL=00010002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] CS =0008 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA] SS =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] DS =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] FS =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] GS =0010 0000000000454d60 ffffffff 00c09300 DPL=0 DS [-WA] LDT=0000 0000000000000000 0000ffff 00008200 DPL=0 LDT TR =0080 000000000041148a 0000ffff 00008b00 DPL=0 TSS64-busy GDT= 000000000041100a 0000047f IDT= 0000000000000000 00000fff CR0=80010011 CR2=ffffffffffffc08a CR3=000000000045e000 CR4=00000020 DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 DR6=00000000ffff0ff0 DR7=0000000000000400 EFER=0000000000000500 Code=?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ^C Also, if I run the reproducer using ept=0, it gets stuck into a loop re-entering the same "in (%dx),%al" instruction over and over again. trace-cmd report output: qemu-system-x86-18185 [001] 1057573.830491: kvm_exit: reason IO_INSTRUCTION rip 0xffffffffffffc08a info 8 0 qemu-system-x86-18185 [001] 1057573.830494: kvm_emulate_insn: 0:ffffffffffffc08a: 4d 89 2c 24 qemu-system-x86-18185 [001] 1057573.830503: kvm_entry: vcpu 0 qemu-system-x86-18185 [001] 1057573.830504: kvm_exit: reason IO_INSTRUCTION rip 0xffffffffffffc08a info 8 0 qemu-system-x86-18185 [001] 1057573.830505: kvm_emulate_insn: 0:ffffffffffffc08a: 4d 89 2c 24 qemu-system-x86-18185 [001] 1057573.830506: kvm_entry: vcpu 0 qemu-system-x86-18185 [001] 1057573.830507: kvm_exit: reason IO_INSTRUCTION rip 0xffffffffffffc08a info 8 0 qemu-system-x86-18185 [001] 1057573.830508: kvm_emulate_insn: 0:ffffffffffffc08a: 4d 89 2c 24 qemu-system-x86-18185 [001] 1057573.830509: kvm_entry: vcpu 0 qemu-system-x86-18185 [001] 1057573.830510: kvm_exit: reason IO_INSTRUCTION rip 0xffffffffffffc08a info 8 0 qemu-system-x86-18185 [001] 1057573.830511: kvm_emulate_insn: 0:ffffffffffffc08a: 4d 89 2c 24 qemu-system-x86-18185 [001] 1057573.830511: kvm_entry: vcpu 0 qemu-system-x86-18185 [001] 1057573.830512: kvm_exit: reason IO_INSTRUCTION rip 0xffffffffffffc08a info 8 0 qemu-system-x86-18185 [001] 1057573.830513: kvm_emulate_insn: 0:ffffffffffffc08a: 4d 89 2c 24 qemu-system-x86-18185 [001] 1057573.830514: kvm_entry: vcpu 0 qemu-system-x86-18185 [001] 1057573.830514: kvm_exit: reason IO_INSTRUCTION rip 0xffffffffffffc08a info 8 0 qemu-system-x86-18185 [001] 1057573.830515: kvm_emulate_insn: 0:ffffffffffffc08a: 4d 89 2c 24 qemu-system-x86-18185 [001] 1057573.830516: kvm_entry: vcpu 0 qemu-system-x86-18185 [001] 1057573.830517: kvm_exit: reason IO_INSTRUCTION rip 0xffffffffffffc08a info 8 0 qemu-system-x86-18185 [001] 1057573.830518: kvm_emulate_insn: 0:ffffffffffffc08a: 4d 89 2c 24 qemu-system-x86-18185 [001] 1057573.830519: kvm_entry: vcpu 0 qemu-system-x86-18185 [001] 1057573.830521: kvm_exit: reason IO_INSTRUCTION rip 0xffffffffffffc08a info 8 0 qemu-system-x86-18185 [001] 1057573.830522: kvm_emulate_insn: 0:ffffffffffffc08a: 4d 89 2c 24 qemu-system-x86-18185 [001] 1057573.830523: kvm_entry: vcpu 0 [...] Signed-off-by: Eduardo Habkost --- x86/emulator.c | 21 +++++++++++++++++---- 1 file changed, 17 insertions(+), 4 deletions(-) diff --git a/x86/emulator.c b/x86/emulator.c index e6f27cc..977ec75 100644 --- a/x86/emulator.c +++ b/x86/emulator.c @@ -792,9 +792,11 @@ static void trap_emulator(uint64_t *mem, void *alt_insn_page, extern u8 insn_page[], test_insn[]; insn_ram = vmap(virt_to_phys(insn_page), 4096); - memcpy(alt_insn_page, insn_page, 4096); - memcpy(alt_insn_page + (test_insn - insn_page), - (void *)(alt_insn->ptr), alt_insn->len); + if (alt_insn_page) { + memcpy(alt_insn_page, insn_page, 4096); + memcpy(alt_insn_page + (test_insn - insn_page), + (void *)(alt_insn->ptr), alt_insn->len); + } save = inregs; /* Load the code TLB with insn_page, but point the page tables at @@ -805,7 +807,11 @@ static void trap_emulator(uint64_t *mem, void *alt_insn_page, invlpg(insn_ram); /* Load code TLB */ asm volatile("call *%0" : : "r"(insn_ram)); - install_page(cr3, virt_to_phys(alt_insn_page), insn_ram); + if (alt_insn_page) { + install_page(cr3, virt_to_phys(alt_insn_page), insn_ram); + } else { + install_pte(cr3, 1, insn_ram, PT_USER_MASK, 0); + } /* Trap, let hypervisor emulate at alt_insn_page */ asm volatile("call *%0": : "r"(insn_ram+1)); @@ -1096,6 +1102,11 @@ static void test_illegal_movbe(void) handle_exception(UD_VECTOR, 0); } +static void test_fetch_failure(void *mem, void *alt_insn_page) +{ + trap_emulator(mem, NULL, NULL); +} + int main() { void *mem; @@ -1113,6 +1124,8 @@ int main() alt_insn_page = alloc_page(); insn_ram = vmap(virt_to_phys(insn_page), 4096); + test_fetch_failure(mem, alt_insn_page); + // test mov reg, r/m and mov r/m, reg t1 = 0x123456789abcdef; asm volatile("mov %[t1], (%[mem]) \n\t" -- 2.13.6 -- Eduardo From 1585400709145997706@xxx Wed Nov 29 11:45:37 +0000 2017 X-GM-THRID: 1583672192246103487 X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread