Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753372AbdHTU5D (ORCPT ); Sun, 20 Aug 2017 16:57:03 -0400 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:50154 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1753286AbdHTU5C (ORCPT ); Sun, 20 Aug 2017 16:57:02 -0400 Date: Sun, 20 Aug 2017 13:56:58 -0700 From: "Paul E. McKenney" To: Jeffrey Hugo Cc: linux-kernel@vger.kernel.org, linux-block@vger.kernel.org, pprakash@codeaurora.org, Josh Triplett , Steven Rostedt , Mathieu Desnoyers , Lai Jiangshan , Jens Axboe , Sebastian Andrzej Siewior , Thomas Gleixner , Richard Cochran , Boris Ostrovsky , Richard Weinberger , Paolo Bonzini Subject: Re: [BUG] Deadlock due due to interactions of block, RCU, and cpu offline Reply-To: paulmck@linux.vnet.ibm.com References: <20170327181711.GF3637@linux.vnet.ibm.com> <20170620234623.GA16200@linux.vnet.ibm.com> <20170621161853.GB3721@linux.vnet.ibm.com> <20170623033456.GA15959@linux.vnet.ibm.com> <20170628001130.GB3721@linux.vnet.ibm.com> <20170630001855.GL2393@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-GCONF: 00 x-cbid: 17082020-0024-0000-0000-000002C235F4 X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00007581; HX=3.00000241; KW=3.00000007; PH=3.00000004; SC=3.00000223; SDB=6.00905215; UDB=6.00453582; IPR=6.00685417; BA=6.00005542; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00016782; XFM=3.00000015; UTC=2017-08-20 20:57:00 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17082020-0025-0000-0000-00004528A009 Message-Id: <20170820205658.GS11320@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-08-20_08:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1707230000 definitions=main-1708200342 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3777 Lines: 89 On Sun, Aug 20, 2017 at 01:31:01PM -0600, Jeffrey Hugo wrote: > On 6/29/2017 6:18 PM, Paul E. McKenney wrote: > >On Thu, Jun 29, 2017 at 10:29:12AM -0600, Jeffrey Hugo wrote: > >>On 6/27/2017 6:11 PM, Paul E. McKenney wrote: > >>>On Tue, Jun 27, 2017 at 04:32:09PM -0600, Jeffrey Hugo wrote: > >>>>On 6/22/2017 9:34 PM, Paul E. McKenney wrote: > >>>>>On Wed, Jun 21, 2017 at 09:18:53AM -0700, Paul E. McKenney wrote: > >>>>>>No worries, and I am very much looking forward to seeing the results of > >>>>>>your testing. > >>>>> > >>>>>And please see below for an updated patch based on LKML review and > >>>>>more intensive testing. > >>>>> > >>>> > >>>>I spent some time on this today. It didn't go as I expected. I > >>>>validated the issue is reproducible as before on 4.11 and 4.12 rcs 1 > >>>>through 4. However, the version of stress-ng that I was using ran > >>>>into constant errors starting with rc5, making it nearly impossible > >>>>to make progress toward reproduction. Upgrading stress-ng to tip > >>>>fixes the issue, however, I've still been unable to repro the issue. > >>>> > >>>>Its my unfounded suspicion that something went in between rc4 and > >>>>rc5 which changed the timing, and didn't actually fix the issue. I > >>>>will run the test overnight for 5 hours to try to repro. > >>>> > >>>>The patch you sent appears to be based on linux-next, and appears to > >>>>have a number of dependencies which prevent it from cleanly applying > >>>>on anything current that I'm able to repro on at this time. Do you > >>>>want to provide a rebased version of the patch which applies to say > >>>>4.11? I could easily test that and report back. > >>> > >>>Here is a very lightly tested backport to v4.11. > >>> > >> > >>Works for me. Always reproduced the lockup within 2 minutes on stock > >>4.11. With the change applied, I was able to test for 2 hours in > >>the same conditions, and 4 hours with the full system and not > >>encounter an issue. > >> > >>Feel free to add: > >>Tested-by: Jeffrey Hugo > > > >Applied, thank you! > > > >>I'm going to go back to 4.12-rc5 and see if I can get either repro > >>the issue, or identify what changed. Hopefully I can get to > >>linux-next and double check the original version of the change as > >>well. > > > >Looking forward to hearing what you find! > > > > Thanx, Paul > > > > According to git bisect, the following is what "changed" > > commit 9d0eb4624601ac978b9e89be4aeadbd51ab2c830 > Merge: 5faab9e 9bc1f09 > Author: Linus Torvalds > Date: Sun Jun 11 11:07:25 2017 -0700 > > Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm > > Pull KVM fixes from Paolo Bonzini: > "Bug fixes (ARM, s390, x86)" > > * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: > KVM: async_pf: avoid async pf injection when in guest mode > KVM: cpuid: Fix read/write out-of-bounds vulnerability in > cpuid emulation > arm: KVM: Allow unaligned accesses at HYP > arm64: KVM: Allow unaligned accesses at EL2 > arm64: KVM: Preserve RES1 bits in SCTLR_EL2 > KVM: arm/arm64: Handle possible NULL stage2 pud when ageing pages > KVM: nVMX: Fix exception injection > kvm: async_pf: fix rcu_irq_enter() with irqs enabled > KVM: arm/arm64: vgic-v3: Fix nr_pre_bits bitfield extraction > KVM: s390: fix ais handling vs cpu model > KVM: arm/arm64: Fix isues with GICv2 on GICv3 migration > > Nothing really stands out to me which would "fix" the issue. My guess would be an undo of the change that provoked the problem in the first place. Did you try bisecting within the above group of commits? Either way, CCing Paolo for his thoughts? Thanx, Paul