Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9B186C64EC4 for ; Sat, 4 Mar 2023 10:42:45 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229616AbjCDKmn (ORCPT ); Sat, 4 Mar 2023 05:42:43 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42260 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229437AbjCDKml (ORCPT ); Sat, 4 Mar 2023 05:42:41 -0500 Received: from szxga08-in.huawei.com (szxga08-in.huawei.com [45.249.212.255]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4BB6023304 for ; Sat, 4 Mar 2023 02:42:40 -0800 (PST) Received: from dggpeml100003.china.huawei.com (unknown [172.30.72.54]) by szxga08-in.huawei.com (SkyGuard) with ESMTP id 4PTLvH5Frwz16Nvq; Sat, 4 Mar 2023 18:39:55 +0800 (CST) Received: from [10.174.177.173] (10.174.177.173) by dggpeml100003.china.huawei.com (7.185.36.120) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.21; Sat, 4 Mar 2023 18:42:38 +0800 Message-ID: <5373d269-84e4-b199-3011-4c879c480b68@huawei.com> Date: Sat, 4 Mar 2023 18:42:37 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.2.0 Subject: Re: [BUG] possible deadlock in __rcu_irq_enter_check_tick Content-Language: en-US To: Mark Rutland , "Zhang, Qiang1" CC: "liwei (GF)" , References: <20221012064911.GN4221@paulmck-ThinkPad-P17-Gen-1> From: Yu Liao In-Reply-To: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.174.177.173] X-ClientProxiedBy: dggems706-chm.china.huawei.com (10.3.19.183) To dggpeml100003.china.huawei.com (7.185.36.120) X-CFilter-Loop: Reflected Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2022/10/19 22:14, Mark Rutland wrote: > On Tue, Oct 18, 2022 at 03:24:48PM +0100, Mark Rutland wrote: >> On Tue, Oct 11, 2022 at 11:49:11PM -0700, Paul E. McKenney wrote: >>> On Tue, Oct 11, 2022 at 09:18:11PM +0800, Yu Liao wrote: >>>> Hello, >>>> >>>> When I run syzkaller, a deadlock problem occurs. The call stack is as follows: >>>> [ 1088.244366][ C1] ====================================================== >>>> [ 1088.244838][ C1] WARNING: possible circular locking dependency detected >>>> [ 1088.245313][ C1] 5.10.0-04424-ga472e3c833d3 #1 Not tainted >>>> [ 1088.245745][ C1] ------------------------------------------------------ >>> >>> It is quite possible that an unfortunate set of commits were backported >>> to v5.10. Could you please bisect? >>> >>>> [ 1088.246214][ C1] syz-executor.2/932 is trying to acquire lock: >>>> [ 1088.246628][ C1] ffffa0001440c418 (rcu_node_0){..-.}-{2:2}, at: >>>> __rcu_irq_enter_check_tick+0x128/0x2f4 >>>> [ 1088.247330][ C1] >>>> [ 1088.247330][ C1] but task is already holding lock: >>>> [ 1088.247830][ C1] ffff000224d0c298 (&rq->lock){-.-.}-{2:2}, at: >>>> try_to_wake_up+0x6e0/0xd40 >>>> [ 1088.248424][ C1] >>>> [ 1088.248424][ C1] which lock already depends on the new lock. >>>> [ 1088.248424][ C1] >>>> [ 1088.249127][ C1] >>>> [ 1088.249127][ C1] the existing dependency chain (in reverse order) is: >>>> [ 1088.249726][ C1] >>>> [ 1088.249726][ C1] -> #1 (&rq->lock){-.-.}-{2:2}: >>>> [ 1088.250239][ C1] validate_chain+0x6dc/0xb0c >>>> [ 1088.250591][ C1] __lock_acquire+0x498/0x940 >>>> [ 1088.250942][ C1] lock_acquire+0x228/0x580 >>>> [ 1088.251346][ C1] _raw_spin_lock_irqsave+0xc0/0x15c >>>> [ 1088.251758][ C1] resched_cpu+0x5c/0x110 >>>> [ 1088.252091][ C1] rcu_implicit_dynticks_qs+0x2b0/0x5d0 >>>> [ 1088.252501][ C1] force_qs_rnp+0x244/0x39c >>>> [ 1088.252847][ C1] rcu_gp_fqs_loop+0x2e4/0x440 >>>> [ 1088.253219][ C1] rcu_gp_kthread+0x1a4/0x240 >>>> [ 1088.253597][ C1] kthread+0x20c/0x260 >>>> [ 1088.253963][ C1] ret_from_fork+0x10/0x18 >>>> [ 1088.254389][ C1] >>>> [ 1088.254389][ C1] -> #0 (rcu_node_0){..-.}-{2:2}: >>>> [ 1088.255296][ C1] check_prev_add+0xe0/0x105c >>>> [ 1088.256000][ C1] check_prevs_add+0x1c8/0x3d4 >>>> [ 1088.256693][ C1] validate_chain+0x6dc/0xb0c >>>> [ 1088.257372][ C1] __lock_acquire+0x498/0x940 >>>> [ 1088.257731][ C1] lock_acquire+0x228/0x580 >>>> [ 1088.258079][ C1] _raw_spin_lock+0xa0/0x120 >>>> [ 1088.258425][ C1] __rcu_irq_enter_check_tick+0x128/0x2f4 >>>> [ 1088.258844][ C1] rcu_nmi_enter+0xc4/0xd0 >>> >>> This is looking like we took an interrupt while holding an rq lock. >>> Am I reading this correctly? If so, that is bad in and of itself. >> >> In this case it's not an interrupt; per the entry bits below: >> >>>> [ 1088.259183][ C1] arm64_enter_el1_dbg+0xb0/0x160 >>>> [ 1088.259623][ C1] el1_dbg+0x28/0x50 >>>> [ 1088.260011][ C1] el1_sync_handler+0xf4/0x150 >>>> [ 1088.260481][ C1] el1_sync+0x74/0x100 >> >> ... this is a synchronous debug exception, which is one of: >> >> * A hardware single-step exception >> * A hardware watchpoint >> * A hardware breakpoint >> * A software breakpoint (i.e. a BRK instruction) >> >> ... and we have to treat those as NMIs. >> >> That could be a kprobe, or a WARN, etc. > > Having a go with v6.1-rc1, placing a kprobe on __rcu_irq_enter_check_tick() > causes a recursive exception which triggers the stack overflow detection, so > there are bigger problems here, and we'll need to do some further rework of the > arm64 entry code. FWIW, x86-64 seems fine. > > I have a vague recollection that that there was something (some part kprobes, > perhaps) that didn't like being called in NMI context, which is why debug > exceptions aren't accounted as true NMIs (but get most of the same treatment). > > I'll have to dig into this a bit more; there are a bunch of subtle interactions > in this area, and I don't want to put a band-aid over this without fully > understanding the implications. > > Once we've figured that out for mainline, we can figure out what needs to go to > stable. Hi Mark, Do you have any plans to apply Zhang Qiang's patch that treats el1_dbg as NMI, or do you have any other better solutions? Thanks, Yu