Received: by 2002:ac0:8c9a:0:0:0:0:0 with SMTP id r26csp592528ima; Fri, 1 Feb 2019 07:56:41 -0800 (PST) X-Google-Smtp-Source: AHgI3IY1pnkhbAGqbY5AXGDQm5EwA/Oe1KCIYf0fDRGqI3igYnDs6NiPxKq2qcOtJXUhnmoEuiig X-Received: by 2002:a63:7c13:: with SMTP id x19mr2620317pgc.336.1549036601807; Fri, 01 Feb 2019 07:56:41 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1549036601; cv=none; d=google.com; s=arc-20160816; b=tMMFf83swTfn2Mle/RuhkVQlLP94AUQjqwF8AisPRMG4D7NbUPZBdpjnhN+pxTsV4m 9lR9ePsROg65zTpl+U+EsG3mJ0uJbYsBXg/4mUNL/r9JDVs3NaHu+Rt55dPeX44SQ8h9 36WO7Y0pXGKDuVh4u3b/ecimvROOjkxHfFi+JjuPljHzJyLYh4jwxPwtwjqVlO8uRp3Y /wjJrzX6SSMzEYo8tWx3uSnyEQL+10vtqzDpmCayAY965TPdlUzgufmDqHZG9qvwjdvO c0M0tTTThlF01NOG2MhQwkWAm6n20jlEHiCu3GLzQoSRTapWRdOgLINfacyKnMBz8JEn aH4Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=K3Nd1FUvgl77VHgTHN1D2Os65f3ugr8/vcQ9qrvYYrw=; b=tk5mfWu37HBYEP/CIGXRY4IBUJQArwKq+ztk16d1OoDoLV9utp4sGL4/So6wYegrmB 2QyoNK68eB59QbZIweLNKHMswRtA0U9TQHnXhGw700ajntBxL/JpbNc+RPUkjFpZlksh jV1wZudXJYgHs5dOA8V/RMdt1JIvZccUJBCqC1K23r3b8onzznNjWV5iEjK+SJjmApIO j9htlN/jH1nPokVMabDSk63cQNx4fR9JotxSsD6ydJu6Slbe05jFscuc5HbnRrdUeMD4 GlFAcQjAgR62BbaHr+GZNVQ2ICfgRg/IS1yRqzY9ntEsDvOs7G0T9SxRbeDUMPNy8CI5 Jppg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id r34si7439325pga.242.2019.02.01.07.56.25; Fri, 01 Feb 2019 07:56:41 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727222AbfBAP4Q (ORCPT + 99 others); Fri, 1 Feb 2019 10:56:16 -0500 Received: from mx2.suse.de ([195.135.220.15]:36140 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726172AbfBAP4Q (ORCPT ); Fri, 1 Feb 2019 10:56:16 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id C682FACD7; Fri, 1 Feb 2019 15:56:13 +0000 (UTC) Subject: Re: Question on handling managed IRQs when hotplugging CPUs To: John Garry , Thomas Gleixner Cc: Keith Busch , Christoph Hellwig , Marc Zyngier , "axboe@kernel.dk" , Peter Zijlstra , Michael Ellerman , Linuxarm , "linux-kernel@vger.kernel.org" , Hannes Reinecke , "linux-scsi@vger.kernel.org" , "linux-block@vger.kernel.org" References: <20190129154433.GF15302@localhost.localdomain> <757902fc-a9ea-090b-7853-89944a0ce1b5@huawei.com> <20190129172059.GC17132@localhost.localdomain> <3fe63dab-0791-f476-69c4-9866b70e8520@huawei.com> <86d5028d-44ab-3696-f7fe-828d7655faa9@huawei.com> From: Hannes Reinecke Message-ID: <745609be-b215-dd2d-c31f-0bd84572f49f@suse.de> Date: Fri, 1 Feb 2019 16:56:12 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.3.0 MIME-Version: 1.0 In-Reply-To: <86d5028d-44ab-3696-f7fe-828d7655faa9@huawei.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 1/31/19 6:48 PM, John Garry wrote: > On 30/01/2019 12:43, Thomas Gleixner wrote: >> On Wed, 30 Jan 2019, John Garry wrote: >>> On 29/01/2019 17:20, Keith Busch wrote: >>>> On Tue, Jan 29, 2019 at 05:12:40PM +0000, John Garry wrote: >>>>> On 29/01/2019 15:44, Keith Busch wrote: >>>>>> >>>>>> Hm, we used to freeze the queues with CPUHP_BLK_MQ_PREPARE callback, >>>>>> which would reap all outstanding commands before the CPU and IRQ are >>>>>> taken offline. That was removed with commit 4b855ad37194f ("blk-mq: >>>>>> Create hctx for each present CPU"). It sounds like we should bring >>>>>> something like that back, but make more fine grain to the per-cpu >>>>>> context. >>>>>> >>>>> >>>>> Seems reasonable. But we would need it to deal with drivers where they >>>>> only >>>>> expose a single queue to BLK MQ, but use many queues internally. I >>>>> think >>>>> megaraid sas does this, for example. >>>>> >>>>> I would also be slightly concerned with commands being issued from the >>>>> driver unknown to blk mq, like SCSI TMF. >>>> >>>> I don't think either of those descriptions sound like good candidates >>>> for using managed IRQ affinities. >>> >>> I wouldn't say that this behaviour is obvious to the developer. I >>> can't see >>> anything in Documentation/PCI/MSI-HOWTO.txt >>> >>> It also seems that this policy to rely on upper layer to flush+freeze >>> queues >>> would cause issues if managed IRQs are used by drivers in other >>> subsystems. >>> Networks controllers may have multiple queues and unsoliciated >>> interrupts. >> >> It's doesn't matter which part is managing flush/freeze of queues as long >> as something (either common subsystem code, upper layers or the driver >> itself) does it. >> >> So for the megaraid SAS example the BLK MQ layer obviously can't do >> anything because it only sees a single request queue. But the driver >> could, >> if the the hardware supports it. tell the device to stop queueing >> completions on the completion queue which is associated with a particular >> CPU (or set of CPUs) during offline and then wait for the on flight stuff >> to be finished. If the hardware does not allow that, then managed >> interrupts can't work for it. >> > > A rough audit of current SCSI drivers tells that these set > PCI_IRQ_AFFINITY in some path but don't set Scsi_host.nr_hw_queues at all: > aacraid, be2iscsi, csiostor, megaraid, mpt3sas > Megaraid and mpt3sas don't have that functionality (or, at least, not that I'm aware). And in general I'm not sure if the above approach is feasible. Thing is, if we have _managed_ CPU hotplug (ie if the hardware provides some means of quiescing the CPU before hotplug) then the whole thing is trivial; disable SQ and wait for all outstanding commands to complete. Then trivially all requests are completed and the issue is resolved. Even with todays infrastructure. And I'm not sure if we can handle surprise CPU hotplug at all, given all the possible race conditions. But then I might be wrong. Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG N?rnberg)