Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753292AbcDFTvo (ORCPT ); Wed, 6 Apr 2016 15:51:44 -0400 Received: from e06smtp16.uk.ibm.com ([195.75.94.112]:48630 "EHLO e06smtp16.uk.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753232AbcDFTvm (ORCPT ); Wed, 6 Apr 2016 15:51:42 -0400 X-IBM-Helo: d06dlp01.portsmouth.uk.ibm.com X-IBM-MailFrom: heiko.carstens@de.ibm.com X-IBM-RcptTo: linux-kernel@vger.kernel.org;linux-s390@vger.kernel.org Date: Wed, 6 Apr 2016 21:51:33 +0200 From: Heiko Carstens To: Sebastian Andrzej Siewior Cc: Thomas Gleixner , Sebastian Andrzej Siewior , linux-s390@vger.kernel.org, linux-kernel@vger.kernel.org, rt@linutronix.de, Martin Schwidefsky , Anna-Maria Gleixner Subject: Re: [PATCH] cpu/hotplug: fix rollback during error-out in __cpu_disable() Message-ID: <20160406195133.GB3485@osiris> References: <1459765640-13599-1-git-send-email-anna-maria@linutronix.de> <20160405104912.GC3937@osiris> <57039DC2.6090907@linutronix.de> <20160405112336.GB6890@osiris> <20160405113637.GC6890@osiris> <20160405115129.GE30124@linutronix.de> <5703A836.7030708@linutronix.de> <20160405121155.GF6890@osiris> <20160405155904.GA19022@linutronix.de> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20160405155904.GA19022@linutronix.de> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 16040619-0025-0000-0000-00000EBD2E7F Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 1792 Lines: 36 On Tue, Apr 05, 2016 at 05:59:04PM +0200, Sebastian Andrzej Siewior wrote: > If we error out in __cpu_disable() (via takedown_cpu() which is > currently the last one that can fail) we don't rollback entirely to > CPUHP_ONLINE (where we started) but to CPUHP_AP_ONLINE_IDLE. This > happens because the former states were on the target CPU (the AP states) > and during the rollback we go back until the first BP state we started. > During the next cpu_down attempt (on the same failed CPU) will take > forever because the cpuhp thread is still down. > > The fix this I rollback to where we started in _cpu_down() via a workqueue > to ensure that those callback will be run on the target CPU in > non-atomic context (as in normal cpu_up()). > The workqueues should be working again because the CPU_DOWN_FAILED were > already invoked. > > notify_online() has been marked as ->skip_onerr because otherwise we > will see the CPU_ONLINE notifier in addition to the CPU_DOWN_FAILED. > However with ->skip_onerr we neither see CPU_ONLINE nor CPU_DOWN_FAILED > if something in between (CPU_DOWN_FAILED … CPUHP_TEARDOWN_CPU). > Currently there is nothing. > > This regression got probably introduce in the rework while we introduced > the hotplug thread to offload the work to the target CPU. > > Fixes: 4cb28ced23c4 ("cpu/hotplug: Create hotplug threads") > Reported-by: Heiko Carstens > Signed-off-by: Sebastian Andrzej Siewior > --- > kernel/cpu.c | 19 +++++++++++++++++++ > 1 file changed, 19 insertions(+) This fixes the issue that a second cpu_down() will take forever, if __cpu_disable() fails. However it does not fix the issue that CPU_DOWN_FAILED will be seen on a different cpu than the cpu that was supposed to be taken offline.