Date: Tue, 9 Jan 2018 09:24:02 +0800
From: Haozhong Zhang <haozhong.zhang@intel.com>
To: Ross Zwisler <zwisler@gmail.com>
Cc: Wanpeng Li <kernellwp@gmail.com>, kvm@vger.kernel.org,
        Radim =?utf-8?B?S3LEjW3DocWZ?= <rkrcmar@redhat.com>,
        linux-nvdimm <linux-nvdimm@ml01.01.org>,
        LKML <linux-kernel@vger.kernel.org>,
        Paolo Bonzini <pbonzini@redhat.com>,
        Wanpeng Li <wanpeng.li@hotmail.com>
Subject: Re: [PATCH v3 2/4] KVM: X86: Fix loss of exception which has not yet
 injected
Message-ID: <20180109012402.m7x2dygntocb2anx@hz-desktop>
Mail-Followup-To: Ross Zwisler <zwisler@gmail.com>,
        Wanpeng Li <kernellwp@gmail.com>, kvm@vger.kernel.org,
        Radim =?utf-8?B?S3LEjW3DocWZ?= <rkrcmar@redhat.com>,
        linux-nvdimm <linux-nvdimm@ml01.01.org>,
        LKML <linux-kernel@vger.kernel.org>,
        Paolo Bonzini <pbonzini@redhat.com>,
        Wanpeng Li <wanpeng.li@hotmail.com>
References: <1503548506-4457-1-git-send-email-wanpeng.li@hotmail.com>
 <1503548506-4457-2-git-send-email-wanpeng.li@hotmail.com>
 <CAOxpaSUBf8QoOZQ1p4KfUp0jq76OKfGY4Uxs-Gg8ngReD99xww@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <CAOxpaSUBf8QoOZQ1p4KfUp0jq76OKfGY4Uxs-Gg8ngReD99xww@mail.gmail.com>
User-Agent: NeoMutt/20171027
Sender: linux-kernel-owner@vger.kernel.org

On 01/07/18 00:26 -0700, Ross Zwisler wrote:
> On Wed, Aug 23, 2017 at 10:21 PM, Wanpeng Li <kernellwp@gmail.com> wrote:
> > From: Wanpeng Li <wanpeng.li@hotmail.com>
> >
> > vmx_complete_interrupts() assumes that the exception is always injected,
> > so it would be dropped by kvm_clear_exception_queue(). This patch separates
> > exception.pending from exception.injected, exception.inject represents the
> > exception is injected or the exception should be reinjected due to vmexit
> > occurs during event delivery in VMX non-root operation. exception.pending
> > represents the exception is queued and will be cleared when injecting the
> > exception to the guest. So exception.pending and exception.injected can
> > cooperate to guarantee exception will not be lost.
> >
> > Reported-by: Radim Krčmář <rkrcmar@redhat.com>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Cc: Radim Krčmář <rkrcmar@redhat.com>
> > Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
> > ---
> 
> I'm seeing a regression in my QEMU based NVDIMM testing system, and I
> bisected it to this commit.
> 
> The behavior I'm seeing is that heavy I/O to simulated NVDIMMs in
> multiple virtual machines causes the QEMU guests to receive double
> faults, crashing them.  Here's an example backtrace:
> 
> [ 1042.653816] PANIC: double fault, error_code: 0x0
> [ 1042.654398] CPU: 2 PID: 30257 Comm: fsstress Not tainted 4.15.0-rc5 #1
> [ 1042.655169] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.10.2-2.fc27 04/01/2014
> [ 1042.656121] RIP: 0010:memcpy_flushcache+0x4d/0x180
> [ 1042.656631] RSP: 0018:ffffac098c7d3808 EFLAGS: 00010286
> [ 1042.657245] RAX: ffffac0d18ca8000 RBX: 0000000000000fe0 RCX: ffffac0d18ca8000
> [ 1042.658085] RDX: ffff921aaa5df000 RSI: ffff921aaa5e0000 RDI: 000019f26e6c9000
> [ 1042.658802] RBP: 0000000000001000 R08: 0000000000000000 R09: 0000000000000000
> [ 1042.659503] R10: 0000000000000000 R11: 0000000000000000 R12: ffff921aaa5df020
> [ 1042.660306] R13: ffffac0d18ca8000 R14: fffff4c102a977c0 R15: 0000000000001000
> [ 1042.661132] FS:  00007f71530b90c0(0000) GS:ffff921b3b280000(0000)
> knlGS:0000000000000000
> [ 1042.662051] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1042.662528] CR2: 0000000001156002 CR3: 000000012a936000 CR4: 00000000000006e0
> [ 1042.663093] Call Trace:
> [ 1042.663329]  write_pmem+0x6c/0xa0 [nd_pmem]
> [ 1042.663668]  pmem_do_bvec+0x15f/0x330 [nd_pmem]
> [ 1042.664056]  ? kmem_alloc+0x61/0xe0 [xfs]
> [ 1042.664393]  pmem_make_request+0xdd/0x220 [nd_pmem]
> [ 1042.664781]  generic_make_request+0x11f/0x300
> [ 1042.665135]  ? submit_bio+0x6c/0x140
> [ 1042.665436]  submit_bio+0x6c/0x140
> [ 1042.665754]  ? next_bio+0x18/0x40
> [ 1042.666025]  ? _cond_resched+0x15/0x40
> [ 1042.666341]  submit_bio_wait+0x53/0x80
> [ 1042.666804]  blkdev_issue_zeroout+0xdc/0x210
> [ 1042.667336]  ? __dax_zero_page_range+0xb5/0x140
> [ 1042.667810]  __dax_zero_page_range+0xb5/0x140
> [ 1042.668197]  ? xfs_file_iomap_begin+0x2bd/0x8e0 [xfs]
> [ 1042.668611]  iomap_zero_range_actor+0x7c/0x1b0
> [ 1042.668974]  ? iomap_write_actor+0x170/0x170
> [ 1042.669318]  iomap_apply+0xa4/0x110
> [ 1042.669616]  ? iomap_write_actor+0x170/0x170
> [ 1042.669958]  iomap_zero_range+0x52/0x80
> [ 1042.670255]  ? iomap_write_actor+0x170/0x170
> [ 1042.670616]  xfs_setattr_size+0xd4/0x330 [xfs]
> [ 1042.670995]  xfs_ioc_space+0x27e/0x2f0 [xfs]
> [ 1042.671332]  ? terminate_walk+0x87/0xf0
> [ 1042.671662]  xfs_file_ioctl+0x862/0xa40 [xfs]
> [ 1042.672035]  ? _copy_to_user+0x22/0x30
> [ 1042.672346]  ? cp_new_stat+0x150/0x180
> [ 1042.672663]  do_vfs_ioctl+0xa1/0x610
> [ 1042.672960]  ? SYSC_newfstat+0x3c/0x60
> [ 1042.673264]  SyS_ioctl+0x74/0x80
> [ 1042.673661]  entry_SYSCALL_64_fastpath+0x1a/0x7d
> [ 1042.674239] RIP: 0033:0x7f71525a2dc7
> [ 1042.674681] RSP: 002b:00007ffef97aa778 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000010
> [ 1042.675664] RAX: ffffffffffffffda RBX: 00000000000112bc RCX: 00007f71525a2dc7
> [ 1042.676592] RDX: 00007ffef97aa7a0 RSI: 0000000040305825 RDI: 0000000000000003
> [ 1042.677520] RBP: 0000000000000009 R08: 0000000000000045 R09: 00007ffef97aa78c
> [ 1042.678442] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
> [ 1042.679330] R13: 0000000000019e38 R14: 00000000000fcca7 R15: 0000000000000016
> [ 1042.680216] Code: 48 8d 5d e0 4c 8d 62 20 48 89 cf 48 29 d7 48 89
> de 48 83 e6 e0 4c 01 e6 48 8d 04 17 4c 8b 02 4c 8b 4a 08 4c 8b 52 10
> 4c 8b 5a 18 <4c> 0f c3 00 4c 0f c3 48 08 4c 0f c3 50 10 4c 0f c3 58 18
> 48 83
> 
> This appears to be independent of both the guest kernel version (this
> backtrace has v4.15.0-rc5, but I've seen it with other kernels) as
> well as independent of the host QMEU version (mine happens to be
> qemu-2.10.1-2.fc27 in Fedora 27).
> 
> The new behavior is due to this commit being present in the host OS
> kernel.  Prior to this commit I could fire up 4 VMs and run xfstests
> on my simulated NVDIMMs, but after this commit such testing results in
> multiple of my VMs crashing almost immediately.
> 
> Reproduction is very simple, at least on my development box.  All you
> need are a pair of VMs (I just did it with clean installs of Fedora
> 27) with NVDIMMs.  Here's a sample QEMU command to get one of these:
> 
> # qemu-system-x86_64 /home/rzwisler/vms/Fedora27.qcow2 -m
> 4G,slots=3,maxmem=512G -smp 12 -machine pc,accel=kvm,nvdimm
> -enable-kvm -object
> memory-backend-file,id=mem1,share,mem-path=/home/rzwisler/nvdimms/nvdimm-1,size=17G
> -device nvdimm,memdev=mem1,id=nv1
> 
> In my setup my NVDIMMs backing files (/home/rzwisler/nvdimms/nvdimm-1)
> are being created on a filesystem on an SSD.
> 
> After these two qemu guests are up, run write I/Os to the resulting
> /dev/pmem0 devices.   I've done this with xfstests and fio to get the
> error, but the simplest way is just:
> 
> # dd if=/dev/zero of=/dev/pmem0
> 
> The double fault should happen in under a minute, definitely before
> the DDs run out of space on their /dev/pmem0 devices.
> 
> I've reproduced this on multiple development boxes, so I'm pretty sure
> it's not related to a flakey hardware setup.
> 

Thanks for reporting this issue. I'll look into this issue.

Haozhong