Received: by 10.192.165.148 with SMTP id m20csp1928763imm; Thu, 3 May 2018 07:39:15 -0700 (PDT) X-Google-Smtp-Source: AB8JxZqBaSWp5JS/0ZLLa1AU3GyaLpRdZ0SqJr5lz5lR297wFvxXbBJXNtvtg5i8Q4Y8PiDC3ADH X-Received: by 2002:a17:902:7c94:: with SMTP id y20-v6mr24418611pll.56.1525358355947; Thu, 03 May 2018 07:39:15 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1525358355; cv=none; d=google.com; s=arc-20160816; b=xNBnnvd4oAeFRgccVRECa8C12XwP/bZnTR88/0zWrgdzW3pDgQC8TFgke8DmE635UU 7UPkaPoTQakn/eLMD39yahfOkbv4sIZFq3RiDD7fYdJCxcZ+7XPzcFO5jEGtke2Mz5kZ m1kRku6PmD1Mmq1vRyn/9JNQGb/DRJxqXYrLcDO1r3HPk4EyPH/b4GR6EB90n89eHtkO Xcy8D2gSjEhREnYF6OAL3J+rUJUDDLJUHLebA6rI2Pr2h5+/+6jxyz1ZDE5IbIH8nj5E SyoLqX8Aq0FjRjQMoQASGtD/gKu7MqW+JzSjGiAVCKxd5fJqRKLy9YT7do/89O80ZgJk rJ3A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:message-id:user-agent:in-reply-to :content-disposition:mime-version:references:reply-to:subject:cc:to :from:date:arc-authentication-results; bh=4iQk/hpanZcKgU7SWd2BTFuRZ3K+ceDpxmDjdRr2k3k=; b=Ra7fGnSWjrzTe+MmFWkWmIqFzJeYMonj96X14LzAXnsER0dHOnJp0UgTqYhLnGNwN2 mBzg62hGuLadUwtwonPojDKwU40bPdwmoBBNgDyk2I4nxdAd5PaNANYWTBDsObGI+r3W pTp72RroisceS72O4UdMahBv4QCnhIaRXGMP5Qo5ap7CUEpqBr7sOGrVhyQhsnSWvS3S KoS9uGGxifBRhHaLp8epOe3Z3qpyKRKGNucyOMUVI9s6ZesVMfZZUQJpweVZKyTjKhLD qeQIJc4MfS13fZwHLEQ3TY5DlzgpVks/zhNT+8F4VP8zxwfaAjcuMcYKNXxJBpRAgCMq 0tUg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id t29-v6si11900670pgo.539.2018.05.03.07.39.01; Thu, 03 May 2018 07:39:15 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751390AbeECOi1 (ORCPT + 99 others); Thu, 3 May 2018 10:38:27 -0400 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:49810 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751027AbeECOiY (ORCPT ); Thu, 3 May 2018 10:38:24 -0400 Received: from pps.filterd (m0098416.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id w43EamRi079246 for ; Thu, 3 May 2018 10:38:22 -0400 Received: from e14.ny.us.ibm.com (e14.ny.us.ibm.com [129.33.205.204]) by mx0b-001b2d01.pphosted.com with ESMTP id 2hr3hh2uc7-1 (version=TLSv1.2 cipher=AES256-GCM-SHA384 bits=256 verify=NOT) for ; Thu, 03 May 2018 10:38:22 -0400 Received: from localhost by e14.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Thu, 3 May 2018 10:38:21 -0400 Received: from b01cxnp23032.gho.pok.ibm.com (9.57.198.27) by e14.ny.us.ibm.com (146.89.104.201) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Thu, 3 May 2018 10:38:19 -0400 Received: from b01ledav003.gho.pok.ibm.com (b01ledav003.gho.pok.ibm.com [9.57.199.108]) by b01cxnp23032.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id w43EcJc248955404; Thu, 3 May 2018 14:38:19 GMT Received: from b01ledav003.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 5DCB8B2058; Thu, 3 May 2018 11:40:19 -0400 (EDT) Received: from paulmck-ThinkPad-W541 (unknown [9.70.82.108]) by b01ledav003.gho.pok.ibm.com (Postfix) with ESMTP id 0A9C5B204D; Thu, 3 May 2018 11:40:19 -0400 (EDT) Received: by paulmck-ThinkPad-W541 (Postfix, from userid 1000) id 3BD1E16C194F; Thu, 3 May 2018 07:39:41 -0700 (PDT) Date: Thu, 3 May 2018 07:39:41 -0700 From: "Paul E. McKenney" To: Peter Zijlstra Cc: Mike Galbraith , Matt Fleming , Ingo Molnar , linux-kernel@vger.kernel.org, Michal Hocko Subject: Re: cpu stopper threads and load balancing leads to deadlock Reply-To: paulmck@linux.vnet.ibm.com References: <20180417142119.GA4511@codeblueprint.co.uk> <20180420095005.GH4064@hirez.programming.kicks-ass.net> <20180424133325.GA3179@codeblueprint.co.uk> <1525349542.9956.2.camel@gmx.de> <20180503122808.GZ12217@hirez.programming.kicks-ass.net> <1525351221.9956.4.camel@gmx.de> <20180503124943.GB12217@hirez.programming.kicks-ass.net> <1525354359.5576.1.camel@gmx.de> <20180503135617.GC12217@hirez.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180503135617.GC12217@hirez.programming.kicks-ass.net> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-GCONF: 00 x-cbid: 18050314-0052-0000-0000-000002E84741 X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00008962; HX=3.00000241; KW=3.00000007; PH=3.00000004; SC=3.00000258; SDB=6.01026927; UDB=6.00524524; IPR=6.00806060; MB=3.00020905; MTD=3.00000008; XFM=3.00000015; UTC=2018-05-03 14:38:21 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 18050314-0053-0000-0000-00005C8D2748 Message-Id: <20180503143941.GH26088@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:,, definitions=2018-05-03_07:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1011 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1709140000 definitions=main-1805030128 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, May 03, 2018 at 03:56:17PM +0200, Peter Zijlstra wrote: > On Thu, May 03, 2018 at 03:32:39PM +0200, Mike Galbraith wrote: > > > Dang. With $subject fix applied as well.. > > That's a NO then... :-( > > > [ 151.103732] smpboot: Booting Node 0 Processor 2 APIC 0x4 > > [ 151.104908] ============================= > > [ 151.104909] WARNING: suspicious RCU usage > > [ 151.104910] 4.17.0.g66d489e-tip-default #84 Tainted: G E > > [ 151.104911] ----------------------------- > > [ 151.104912] kernel/sched/core.c:1625 suspicious rcu_dereference_check() usage! > > [ 151.104913] > > other info that might help us debug this: > > > > [ 151.104914] > > RCU used illegally from offline CPU! > > rcu_scheduler_active = 2, debug_locks = 0 > > [ 151.104916] 3 locks held by swapper/2/0: > > [ 151.104916] #0: 00000000560adb60 (stop_cpus_mutex){+.+.}, at: stop_machine_from_inactive_cpu+0x86/0x140 > > [ 151.104923] #1: 00000000e4fb0238 (&p->pi_lock){-.-.}, at: try_to_wake_up+0x2d/0x5f0 > > [ 151.104929] #2: 000000003341403b (rcu_read_lock){....}, at: rcu_read_lock+0x0/0x80 > > [ 151.104934] > > stack backtrace: > > [ 151.104937] CPU: 2 PID: 0 Comm: swapper/2 Kdump: loaded Tainted: G E 4.17.0.g66d489e-tip-default #84 > > [ 151.104938] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 09/23/2013 > > [ 151.104938] Call Trace: > > [ 151.104942] dump_stack+0x78/0xb3 > > [ 151.104945] ttwu_stat+0x121/0x130 > > [ 151.104949] try_to_wake_up+0x2c2/0x5f0 > > [ 151.104953] ? cpu_stop_park+0x30/0x30 > > [ 151.104956] wake_up_q+0x4a/0x70 > > [ 151.104959] cpu_stop_queue_work+0x6b/0xa0 > > [ 151.104963] queue_stop_cpus_work+0x61/0xb0 > > [ 151.104968] stop_machine_from_inactive_cpu+0xd8/0x140 > > > > diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c > > > index f89014a2c238..a32518c2ba4a 100644 > > > --- a/kernel/stop_machine.c > > > +++ b/kernel/stop_machine.c > > > @@ -650,8 +650,10 @@ int stop_machine_from_inactive_cpu(cpu_stop_fn_t fn, void *data, > > > /* Schedule work on other CPUs and execute directly for local CPU */ > > > set_state(&msdata, MULTI_STOP_PREPARE); > > > cpu_stop_init_done(&done, num_active_cpus()); > > > - queue_stop_cpus_work(cpu_active_mask, multi_cpu_stop, &msdata, > > > - &done); > > > + > > > + RCU_NONIDLE(queue_stop_cpus_work(cpu_active_mask, multi_cpu_stop, > > > + &msdata, &done)); > > > + > > > ret = multi_cpu_stop(&msdata); > > Paul, any clue on what else to try here? The whole MTRR setup is > radically crazy but it's something we're stuck with (yay hardware) :/ > > So the issue is that we're doing wakeups from an offline CPU (very early > during bringup) and RCU (rightfully) complains about that. I thought > RCU_NONIDLE() was the magic incantation that makes RCU 'watch', but > clearly it's not enough here. Huh. No, RCU_NONIDLE() only works for idle, not for offline. Maybe... Let me take a look. There must be some way to mark a specific lock acquisition and release as being lockdep-invisible... Another approach would be to have an architecture-specific thing that caused RCU to be enabled way earlier on x86. Thanx, Paul