Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753260AbdHTTbJ (ORCPT ); Sun, 20 Aug 2017 15:31:09 -0400 Received: from smtp.codeaurora.org ([198.145.29.96]:54696 "EHLO smtp.codeaurora.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753183AbdHTTbG (ORCPT ); Sun, 20 Aug 2017 15:31:06 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 smtp.codeaurora.org 12C8660376 Authentication-Results: pdx-caf-mail.web.codeaurora.org; dmarc=none (p=none dis=none) header.from=codeaurora.org Authentication-Results: pdx-caf-mail.web.codeaurora.org; spf=none smtp.mailfrom=jhugo@codeaurora.org Subject: Re: [BUG] Deadlock due due to interactions of block, RCU, and cpu offline To: paulmck@linux.vnet.ibm.com Cc: linux-kernel@vger.kernel.org, linux-block@vger.kernel.org, pprakash@codeaurora.org, Josh Triplett , Steven Rostedt , Mathieu Desnoyers , Lai Jiangshan , Jens Axboe , Sebastian Andrzej Siewior , Thomas Gleixner , Richard Cochran , Boris Ostrovsky , Richard Weinberger References: <20170326232843.GA3637@linux.vnet.ibm.com> <20170327181711.GF3637@linux.vnet.ibm.com> <20170620234623.GA16200@linux.vnet.ibm.com> <20170621161853.GB3721@linux.vnet.ibm.com> <20170623033456.GA15959@linux.vnet.ibm.com> <20170628001130.GB3721@linux.vnet.ibm.com> <20170630001855.GL2393@linux.vnet.ibm.com> From: Jeffrey Hugo Message-ID: Date: Sun, 20 Aug 2017 13:31:01 -0600 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.2.1 MIME-Version: 1.0 In-Reply-To: <20170630001855.GL2393@linux.vnet.ibm.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3612 Lines: 87 On 6/29/2017 6:18 PM, Paul E. McKenney wrote: > On Thu, Jun 29, 2017 at 10:29:12AM -0600, Jeffrey Hugo wrote: >> On 6/27/2017 6:11 PM, Paul E. McKenney wrote: >>> On Tue, Jun 27, 2017 at 04:32:09PM -0600, Jeffrey Hugo wrote: >>>> On 6/22/2017 9:34 PM, Paul E. McKenney wrote: >>>>> On Wed, Jun 21, 2017 at 09:18:53AM -0700, Paul E. McKenney wrote: >>>>>> No worries, and I am very much looking forward to seeing the results of >>>>>> your testing. >>>>> >>>>> And please see below for an updated patch based on LKML review and >>>>> more intensive testing. >>>>> >>>> >>>> I spent some time on this today. It didn't go as I expected. I >>>> validated the issue is reproducible as before on 4.11 and 4.12 rcs 1 >>>> through 4. However, the version of stress-ng that I was using ran >>>> into constant errors starting with rc5, making it nearly impossible >>>> to make progress toward reproduction. Upgrading stress-ng to tip >>>> fixes the issue, however, I've still been unable to repro the issue. >>>> >>>> Its my unfounded suspicion that something went in between rc4 and >>>> rc5 which changed the timing, and didn't actually fix the issue. I >>>> will run the test overnight for 5 hours to try to repro. >>>> >>>> The patch you sent appears to be based on linux-next, and appears to >>>> have a number of dependencies which prevent it from cleanly applying >>>> on anything current that I'm able to repro on at this time. Do you >>>> want to provide a rebased version of the patch which applies to say >>>> 4.11? I could easily test that and report back. >>> >>> Here is a very lightly tested backport to v4.11. >>> >> >> Works for me. Always reproduced the lockup within 2 minutes on stock >> 4.11. With the change applied, I was able to test for 2 hours in >> the same conditions, and 4 hours with the full system and not >> encounter an issue. >> >> Feel free to add: >> Tested-by: Jeffrey Hugo > > Applied, thank you! > >> I'm going to go back to 4.12-rc5 and see if I can get either repro >> the issue, or identify what changed. Hopefully I can get to >> linux-next and double check the original version of the change as >> well. > > Looking forward to hearing what you find! > > Thanx, Paul > According to git bisect, the following is what "changed" commit 9d0eb4624601ac978b9e89be4aeadbd51ab2c830 Merge: 5faab9e 9bc1f09 Author: Linus Torvalds Date: Sun Jun 11 11:07:25 2017 -0700 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull KVM fixes from Paolo Bonzini: "Bug fixes (ARM, s390, x86)" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: async_pf: avoid async pf injection when in guest mode KVM: cpuid: Fix read/write out-of-bounds vulnerability in cpuid emulation arm: KVM: Allow unaligned accesses at HYP arm64: KVM: Allow unaligned accesses at EL2 arm64: KVM: Preserve RES1 bits in SCTLR_EL2 KVM: arm/arm64: Handle possible NULL stage2 pud when ageing pages KVM: nVMX: Fix exception injection kvm: async_pf: fix rcu_irq_enter() with irqs enabled KVM: arm/arm64: vgic-v3: Fix nr_pre_bits bitfield extraction KVM: s390: fix ais handling vs cpu model KVM: arm/arm64: Fix isues with GICv2 on GICv3 migration Nothing really stands out to me which would "fix" the issue. -- Jeffrey Hugo Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.