Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751800AbdCZX24 (ORCPT ); Sun, 26 Mar 2017 19:28:56 -0400 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:53659 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751380AbdCZX2u (ORCPT ); Sun, 26 Mar 2017 19:28:50 -0400 Date: Sun, 26 Mar 2017 16:28:43 -0700 From: "Paul E. McKenney" To: Jeffrey Hugo Cc: linux-kernel@vger.kernel.org, linux-block@vger.kernel.org, pprakash@codeaurora.org, Josh Triplett , Steven Rostedt , Mathieu Desnoyers , Lai Jiangshan , Jens Axboe , Sebastian Andrzej Siewior , Thomas Gleixner , Richard Cochran , Boris Ostrovsky , Richard Weinberger Subject: Re: [BUG] Deadlock due due to interactions of block, RCU, and cpu offline Reply-To: paulmck@linux.vnet.ibm.com References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-GCONF: 00 x-cbid: 17032623-0056-0000-0000-0000031F42DF X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00006855; HX=3.00000240; KW=3.00000007; PH=3.00000004; SC=3.00000206; SDB=6.00839181; UDB=6.00412844; IPR=6.00617082; BA=6.00005237; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00014816; XFM=3.00000013; UTC=2017-03-26 23:28:46 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17032623-0057-0000-0000-000007554CF5 Message-Id: <20170326232843.GA3637@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-03-26_17:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1702020001 definitions=main-1703260217 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 9185 Lines: 202 On Sun, Mar 26, 2017 at 05:10:40PM -0600, Jeffrey Hugo wrote: > Hello, > > I observe that running stress-ng with the cpu-online and fstat tests > results in a deadlock of hung tasks: > > [ 366.810486] INFO: task stress-ng-cpu-o:2590 blocked for more than > 120 seconds. > [ 366.817689] Not tainted 4.9.0 #39 > [ 366.821504] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [ 366.829320] stress-ng-cpu-o D 0 2590 2589 0x00000008 > [ 366.834803] Call trace: > [ 366.837222] [] __switch_to+0x60/0x70 > [ 366.842338] [] __schedule+0x178/0x648 > [ 366.847550] [] schedule+0x38/0x98 > [ 366.852408] [] blk_mq_freeze_queue_wait+0x64/0x1a8 > [ 366.858749] [] blk_mq_queue_reinit_work+0x74/0x110 > [ 366.865081] [] blk_mq_queue_reinit_dead+0x24/0x30 > [ 366.871335] [] cpuhp_invoke_callback+0x98/0x4a8 > [ 366.877411] [] cpuhp_down_callbacks+0x114/0x150 > [ 366.883484] [] _cpu_down+0x100/0x1d8 > [ 366.888609] [] do_cpu_down+0x4c/0x78 > [ 366.893727] [] cpu_down+0x24/0x30 > [ 366.898593] [] cpu_subsys_offline+0x20/0x30 > [ 366.904318] [] device_offline+0xa8/0xd8 > [ 366.909704] [] online_store+0x4c/0xa8 > [ 366.914907] [] dev_attr_store+0x44/0x60 > [ 366.920294] [] sysfs_kf_write+0x5c/0x78 > [ 366.925672] [] kernfs_fop_write+0xbc/0x1e8 > [ 366.931318] [] __vfs_write+0x48/0x138 > [ 366.936526] [] vfs_write+0xa8/0x1c0 > [ 366.941557] [] SyS_write+0x54/0xb0 > [ 366.946511] [] el0_svc_naked+0x24/0x28 > [ 366.951800] INFO: task stress-ng-fstat:2591 blocked for more than > 120 seconds. > [ 366.959008] Not tainted 4.9.0 #39 > [ 366.962823] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [ 366.970640] stress-ng-fstat D 0 2591 2589 0x00000000 > [ 366.976105] Call trace: > [ 366.978540] [] __switch_to+0x60/0x70 > [ 366.983658] [] __schedule+0x178/0x648 > [ 366.988870] [] schedule+0x38/0x98 > [ 366.993727] [] blk_mq_freeze_queue_wait+0x64/0x1a8 > [ 367.000068] [] blk_mq_freeze_queue+0x28/0x38 > [ 367.005880] [] lo_release+0x64/0x90 > [ 367.010919] [] __blkdev_put+0x26c/0x2c8 > [ 367.016300] [] blkdev_put+0x54/0x128 > [ 367.021418] [] blkdev_close+0x2c/0x40 > [ 367.026631] [] __fput+0xa0/0x1e0 > [ 367.031401] [] ____fput+0x20/0x30 > [ 367.036266] [] task_work_run+0xc8/0xe8 > [ 367.041557] [] do_notify_resume+0xac/0xb8 > [ 367.047116] [] work_pending+0x8/0x10 > > I have tested and found this issue to be reproducible on both x86 > and ARM64 architectures on 4.7, 4.8, 4.9, 4.10, and 4.11-rc3 > kernels. > > Using the below test methodology [1], the issue reproduces within a > few minutes. > > Using ftrace, I have analyzed the issue on 4.9 and I believe I've > found the root cause [2]. > > Based on my analysis, I have developed a fix [3], which addresses > the issue as I am able to run stress-ng for over an hour where I was > unable to do so before, however I do not know the full extend of > impacts from this fix, and look for guidance from the community to > determine the final fix. > > > [1] Test methodology > -------------------- > Boot a multicore system such as a desktop i5 system with nr_cpus=2 > > Enable all logging to determine when the deadlock occurs (prints > from test stop flowing out of the serial port) > echo 1 > /sys/module/printk/parameters/ignore_loglevel > > Run stress-ng > stress-ng --fstat 1 --cpu-online 1 -t 3600 > > Wait for the test output to stop, and the hung task watchdog to fire. > > > [2] Analysis > ------------ > Again, this analysis is based upon the 4.9 kernel, but believe it to > still apply to newer kernels. > > I conclude that the hung tasks occur due to a race condition which > results in a deadlock. > > The race condition occurs between "normal" work in the block layer > on a core (the stress-ng-fstat task in the above dump) and cpu > offline of that core (the stress-ng-cpu-o task in the above dump). > > The fput() from userspace in the fstat task results in a call to > blk_mq_freeze_queue(), which drops the last reference to the queue > via percpu_ref_kill(), and then waits for the ref count of the queue > to hit 0 in blk_mq_freeze_queue_wait(). percpu_ref_kill() will > result in __percpu_ref_switch_to_atomic() which will use > call_rcu_sched() to setup delayed work to finalize the percpu_ref > cleanup and drop the ref count to 0. > > Note that call_rcu_sched() queues the work to a per-cpu queue, thus > the work can only be run on the core it is queued on, by the work > thread that is pinned to that core. > > It is a race between this work running, and the cpu offline processing. One quick way to test this assumption is to build a kernel with Kconfig options CONFIG_RCU_NOCB_CPU=y and CONFIG_RCU_NOCB_CPU_ALL=y. This will cause call_rcu_sched() to queue the work to a kthread, which can migrate to some other CPU. If your analysis is correct, this should avoid the deadlock. (Note that the deadlock should be fixed in any case, just a diagnostic assumption-check procedure.) > If the cpu offline processing is able to get to and process the > RCU/tree:online state before the queued work from the block layer, > then the pinned work thread will be migrated to another core via > rcutree_offline_cpu(), and the work will not be able to execute. > > This race condition does not result in deadlock until later in the > cpu offline processing. Once we hit the block/mq:prepare state the > block layer freezes all the queues and waits for the ref counts to > hit 0. This normally works because at this point the cpu being > offlined is dead from cpu:teardown, and the offline processing is > occuring on another active cpu, so call_rcu_sched() will queue work > to an active cpu where it can get processed. However the fstat > process already did that work for one of the queues to be frozen in > the block layer, so the processing of the block/mq:prepare state > waits on the same ref count as fstat to hit 0. Thus we see the > result of this as the stress-ng-cpu-o task above. > > The block/mq:prepare processing stalls the cpu offline processing > which causes a deadlock because the processing does not get to the > RCU/tree:prepare state which migrates all of the queued work from > the offline cpu to another cpu, which would allow the work that the > fstat task queued to execute, drop the ref count to 0, and unblock > both stalled tasks. > > By reordering the cpu offline states such the shutdown processing of > RCU/tree:prepare occurs before block/mq:prepare [3], we prevent > deadlock by enabling the queued work in the RCU framework to run > elsewhere, and eventually unblock the tasks waiting on the ref > count. > > However, it is not entirely clear what are the full ramifications of > this reorder. I assume the ordering of these cpu online/offline > states is carefully considered, and without that knowledge, I could > not say for certain that my fix [3] is safe. > > What is the opinion of the domain experts? I do hope that we can come up with a better fix. No offense intended, as coming up with -any- fix in the CPU-hotplug domain is not to be denigrated, but this looks to be at vest quite fragile. Thanx, Paul > -- > Jeffrey Hugo > Qualcomm Datacenter Technologies as an affiliate of Qualcomm > Technologies, Inc. > Qualcomm Technologies, Inc. is a member of the > Code Aurora Forum, a Linux Foundation Collaborative Project. > > > [3] Proposed fix > ---8>--- > diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h > index afe641c..9b86db9 100644 > --- a/include/linux/cpuhotplug.h > +++ b/include/linux/cpuhotplug.h > @@ -49,6 +49,7 @@ enum cpuhp_state { > CPUHP_ARM_SHMOBILE_SCU_PREPARE, > CPUHP_SH_SH3X_PREPARE, > CPUHP_BLK_MQ_PREPARE, > + CPUHP_RCUTREE_PREP2, > CPUHP_TIMERS_DEAD, > CPUHP_NOTF_ERR_INJ_PREPARE, > CPUHP_MIPS_SOC_PREPARE, > diff --git a/kernel/cpu.c b/kernel/cpu.c > index 29de1a9..b46c573 100644 > --- a/kernel/cpu.c > +++ b/kernel/cpu.c > @@ -1289,6 +1289,11 @@ static int __init cpu_hotplug_pm_sync_init(void) > [CPUHP_RCUTREE_PREP] = { > .name = "RCU/tree:prepare", > .startup.single = rcutree_prepare_cpu, > + .teardown.single = NULL, > + }, > + [CPUHP_RCUTREE_PREP2] = { > + .name = "RCU/tree:dead", > + .startup.single = NULL, > .teardown.single = rcutree_dead_cpu, > }, > /* >