Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751340AbdC0SRe (ORCPT ); Mon, 27 Mar 2017 14:17:34 -0400 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:59144 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751287AbdC0SRZ (ORCPT ); Mon, 27 Mar 2017 14:17:25 -0400 Date: Mon, 27 Mar 2017 11:17:11 -0700 From: "Paul E. McKenney" To: Jeffrey Hugo Cc: linux-kernel@vger.kernel.org, linux-block@vger.kernel.org, pprakash@codeaurora.org, Josh Triplett , Steven Rostedt , Mathieu Desnoyers , Lai Jiangshan , Jens Axboe , Sebastian Andrzej Siewior , Thomas Gleixner , Richard Cochran , Boris Ostrovsky , Richard Weinberger Subject: Re: [BUG] Deadlock due due to interactions of block, RCU, and cpu offline Reply-To: paulmck@linux.vnet.ibm.com References: <20170326232843.GA3637@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-GCONF: 00 x-cbid: 17032718-2213-0000-0000-000001746DBE X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00006860; HX=3.00000240; KW=3.00000007; PH=3.00000004; SC=3.00000206; SDB=6.00839555; UDB=6.00413069; IPR=6.00617458; BA=6.00005241; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00014825; XFM=3.00000013; UTC=2017-03-27 18:17:16 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17032718-2214-0000-0000-00005501F408 Message-Id: <20170327181711.GF3637@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-03-27_17:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1702020001 definitions=main-1703270153 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2785 Lines: 69 On Mon, Mar 27, 2017 at 12:02:27PM -0600, Jeffrey Hugo wrote: > Hi Paul. > > Thanks for the quick reply. > > On 3/26/2017 5:28 PM, Paul E. McKenney wrote: > >On Sun, Mar 26, 2017 at 05:10:40PM -0600, Jeffrey Hugo wrote: > > >>It is a race between this work running, and the cpu offline processing. > > > >One quick way to test this assumption is to build a kernel with Kconfig > >options CONFIG_RCU_NOCB_CPU=y and CONFIG_RCU_NOCB_CPU_ALL=y. This will > >cause call_rcu_sched() to queue the work to a kthread, which can migrate > >to some other CPU. If your analysis is correct, this should avoid > >the deadlock. (Note that the deadlock should be fixed in any case, > >just a diagnostic assumption-check procedure.) > > I enabled CONFIG_RCU_EXPERT=y, CONFIG_RCU_NOCB_CPU=y, > CONFIG_RCU_NOCB_CPU_ALL=y in my build. I've only had time so far to > do one test run however the issue reproduced, but it took a fair bit > longer to do so. An initial look at the data indicates that the > work is still not running. An odd observation, the two threads are > no longer blocked on the same queue, but different ones. I was afraid of that... > Let me look at this more and see what is going on now. Another thing to try would be to affinity the "rcuo" kthreads to some CPU that is never taken offline, just in case that kthread is sometimes somehow getting stuck during the CPU-hotplug operation. > >>What is the opinion of the domain experts? > > > >I do hope that we can come up with a better fix. No offense intended, > >as coming up with -any- fix in the CPU-hotplug domain is not to be > >denigrated, but this looks to be at vest quite fragile. > > > > Thanx, Paul > > > > None taken. I'm not particularly attached to the current fix. I > agree, it does appear to be quite fragile. > > I'm still not sure what a better solution would be though. Maybe > the RCU framework flushes the work somehow during cpu offline? It > would need to ensure further work is not queued after that point, > which seems like it might be tricky to synchronize. I don't know > enough about the working of RCU to even attempt to implement that. There are some ways that RCU might be able to shrink the window during which the outgoing CPU's callbacks are in limbo, but they are not free of risk, so we really need to compleetly understand what is going on before making any possibly ill-conceived changes. ;-) > In any case, it seem like some more analysis is needed based on the > latest data. Looking forward to hearing about you find! Thanx, Paul > -- > Jeffrey Hugo > Qualcomm Datacenter Technologies as an affiliate of Qualcomm > Technologies, Inc. > Qualcomm Technologies, Inc. is a member of the > Code Aurora Forum, a Linux Foundation Collaborative Project. >