Received: by 2002:ac0:a5a6:0:0:0:0:0 with SMTP id m35-v6csp21987imm; Fri, 31 Aug 2018 15:50:13 -0700 (PDT) X-Google-Smtp-Source: ANB0VdYBok+vLxN1cCDgYT56jIHiAaOZG5Evio9CWVITm7TxYddcc4sAyXLakMSvLcZB5Zh32jWi X-Received: by 2002:a17:902:904c:: with SMTP id w12-v6mr17603024plz.95.1535755813055; Fri, 31 Aug 2018 15:50:13 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1535755813; cv=none; d=google.com; s=arc-20160816; b=zUvYjNZnVty2gHG6fTwz84/5fb4gv03gwBtgR1oMmF/KQxPoBAUgRjBXn7k/OekT5b OZZU/92StsQq7RCPYWxzlBjhIdcJsRnokL5GO/4CVkczxDKYMk/O8SZ+yoY2mI0DHCTu LtomN2uFkZ79duG0j0bzamh4b22S1dwgsEnKhmRD9o0hySGIBDhnhB57iIsYvYQ1eViR fiTL4Ly+jvG4hXKWA51YNdmmwPaIsnBAFY9NlgohozA1Nwrt70MKKDLYCimmMzeFg8m1 RtlLD9h/fBgaBp9+qS0Mw3wPDGgzYxQ/YRLAhhriX1RU1vWJIqez1j0oWyHAp4GRGpeU MQQQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:user-agent:references :message-id:in-reply-to:subject:cc:to:from:date :arc-authentication-results; bh=xia8zXuclM8MVkuYslQH8W8FHX8JL1BsGRCOm6k3yEM=; b=aEhey7XuSnzpjPTgJEN2x2Z8es2Bjjz+9kfsVBY7n+8ejU4KEkzca8ob2CR1Xv5bPS 8T3xkhp7ImZF15oxgJHzemNQsxVkD9PkuIDMcQ0MPyZxB0rM9k58moYxJzeff6uviyjz QaEsp36vzaKhGrg1sB5Su+7Vv+Wh0pEAmpBt/fHWXhZgXWaOJwckIDCa7Q39pDS9Y/7L HQPaLtHOJjiXDSIz8DO8jtbLa4iAkVA85T9uIEMv1xE0oMPIGBXZG7UTMe3t4sLAFr+9 0eVJTcpRCNESL6SSo86zkwhDXetRLcAew2NeSnjDuOngwPrNTf7peZu0V3fExoXfWN3x x8tA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id h32-v6si10694312pgb.290.2018.08.31.15.49.57; Fri, 31 Aug 2018 15:50:13 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727357AbeIAC63 (ORCPT + 99 others); Fri, 31 Aug 2018 22:58:29 -0400 Received: from Galois.linutronix.de ([146.0.238.70]:53407 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726869AbeIAC62 (ORCPT ); Fri, 31 Aug 2018 22:58:28 -0400 Received: from p4fea45ac.dip0.t-ipconnect.de ([79.234.69.172] helo=[192.168.0.145]) by Galois.linutronix.de with esmtpsa (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256) (Exim 4.80) (envelope-from ) id 1fvsDe-00069p-TO; Sat, 01 Sep 2018 00:48:47 +0200 Date: Sat, 1 Sep 2018 00:48:46 +0200 (CEST) From: Thomas Gleixner To: Kashyap Desai cc: Ming Lei , Sumit Saxena , Ming Lei , Christoph Hellwig , Linux Kernel Mailing List , Shivasharan Srikanteshwara , linux-block Subject: RE: Affinity managed interrupts vs non-managed interrupts In-Reply-To: <486f94a563d63c4779498fe8829a546c@mail.gmail.com> Message-ID: References: <20180829084618.GA24765@ming.t460p> <300d6fef733ca76ced581f8c6304bac6@mail.gmail.com> <615d78004495aebc53807156d04d988c@mail.gmail.com> <486f94a563d63c4779498fe8829a546c@mail.gmail.com> User-Agent: Alpine 2.21 (DEB 202 2017-01-01) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Linutronix-Spam-Score: -1.0 X-Linutronix-Spam-Level: - X-Linutronix-Spam-Status: No , -1.0 points, 5.0 required, ALL_TRUSTED=-1,SHORTCIRCUIT=-0.0001 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 31 Aug 2018, Kashyap Desai wrote: > > > It is not yet finalized, but it can be based on per sdev outstanding, > > > shost_busy etc. > > > We want to use special 16 reply queue for IO acceleration (these > queues are > > > working interrupt coalescing mode. This is a h/w feature) > > > > TBH, this does not make any sense whatsoever. Why are you trying to have > > extra interrupts for coalescing instead of doing the following: > > Thomas, > > We are using this feature mainly for performance and not for CPU hotplug > issues. > I read your below #1 to #4 points are more of addressing CPU hotplug > stuffs. Right ? If we use all 72 reply queue (all are in interrupt > coalescing mode) without any extra reply queues, we don't have any issue > with cpu-msix mapping and cpu hotplug issues. Our major problem with > that method is latency is very bad on lower QD and/or single worker case. > > To solve that problem we have added extra 16 reply queue (this is a > special h/w feature for performance only) which can be worked in interrupt > coalescing mode vs existing 72 reply queue will work without any interrupt > coalescing. Best way to map additional 16 reply queue is map it to the > local numa node. Ok. I misunderstood the whole thing a bit. So your real issue is that you want to have reply queues which are instantaneous, the per cpu ones, and then the extra 16 which do batching and are shared over a set of CPUs, right? > I understand that, it is unique requirement but at the same time we may > be able to do it gracefully (in irq sub system) as you mentioned " > irq_set_affinity_hint" should be avoided in low level driver. > Is it possible to have similar mapping in managed interrupt case as below > ? > > for (i = 0; i < 16 ; i++) > irq_set_affinity_hint (pci_irq_vector(instance->pdev, > cpumask_of_node(local_numa_node)); > > Currently we always see managed interrupts for pre-vectors are 0-71 and > effective cpu is always 0. The pre-vectors are not affinity managed. They get the default affinity assigned and at request_irq() the vectors are dynamically spread over CPUs to avoid that the bulk of interrupts ends up on CPU0. That's handled that way since a0c9259dc4e1 ("irq/matrix: Spread interrupts on allocation") > We want some changes in current API which can allow us to pass flags > (like *local numa affinity*) and cpu-msix mapping are from local numa node > + effective cpu are spread across local numa node. What you really want is to split the vector space for your device into two blocks. One for the regular per cpu queues and the other (16 or how many ever) which are managed separately, i.e. spread out evenly. That needs some extensions to the core allocation/management code, but that shouldn't be a huge problem. Thanks, tglx