Received: by 2002:ac0:a5a6:0:0:0:0:0 with SMTP id m35-v6csp44393imm; Fri, 31 Aug 2018 16:38:45 -0700 (PDT) X-Google-Smtp-Source: ANB0VdYbN4nDpv/5AVePY1fU3+GOHONixOUKnFMw+wor2FL6tvPCLIUw1gysepco/PJ1XJ8adAs+ X-Received: by 2002:a17:902:b28:: with SMTP id 37-v6mr17724681plq.337.1535758725674; Fri, 31 Aug 2018 16:38:45 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1535758725; cv=none; d=google.com; s=arc-20160816; b=lbhjYIJFaoPNb49rljtuSdWAIjqUF8+ZXeGXGo3jEzhcTI8Rg5rO7rbjze29wrhbOK E2bO7G3KQk1mTqUwIRjUY2U7DbqJr4I6X/IbDs2JGUXOyd0yGQ60vMniB+reacfG54Ev z42uBoNiNjpNVbLetetZqicY8go6H97FYbLvsIB5O3qZ1sqy+pC39NdYQjG4GLiJeL+I AMwKdK8uiqJ07zaFTCpyS9DWDggwOYZVNRK32MlYzoHhKnIX6gUnTAkOaB5tRlGbUGG0 nmmfR4nERVk9e8rkvtuQUlrbRgxQIzkqHXMw8kyhddedx1zvBfv8r+Sa/6+VpvxbTYEc +SuA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date :thread-index:mime-version:in-reply-to:references:from :dkim-signature:arc-authentication-results; bh=qZhzncD5NA/H9u+4HtHTRzLD7QnibnBEhBm00h0yGDs=; b=rc9tU15vf1QRHW+LtxAeR/poVTt4y0WQXfd5Huvf4XAzayblF1HBOpM/aDeSOa0h9o 9CvG4t40r+P/oHYRtav8StjSEpq3qIRBTrtv+Iy52z0aimlCgVz8yY7Mni4sQzn4gp0f fLuQYcdESiY3YInN4eoHy3Eb5ORqXa9wGaMP5AYfubjU1pPDqP9+l9LPWSzhWnSaYHCU CdVUS3lQNPm/uwnBQYs0xAN7dI13ehwBb/dpy1QbKBdOEc1vkYth6R9H26c8l98OiQ7x 9QzrZIML7LLZ1njfX2qj0HHmd8w2mEKsua3N1JegRUk6MR8YiO0rBgIEq82WrqGyBMg+ kktQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@broadcom.com header.s=google header.b=EpgpHDSy; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=broadcom.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id w4-v6si12180821pfb.52.2018.08.31.16.38.30; Fri, 31 Aug 2018 16:38:45 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@broadcom.com header.s=google header.b=EpgpHDSy; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=broadcom.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727325AbeIADrN (ORCPT + 99 others); Fri, 31 Aug 2018 23:47:13 -0400 Received: from mail-it0-f68.google.com ([209.85.214.68]:52768 "EHLO mail-it0-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727175AbeIADrM (ORCPT ); Fri, 31 Aug 2018 23:47:12 -0400 Received: by mail-it0-f68.google.com with SMTP id h3-v6so9210340ita.2 for ; Fri, 31 Aug 2018 16:37:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=broadcom.com; s=google; h=from:references:in-reply-to:mime-version:thread-index:date :message-id:subject:to:cc; bh=qZhzncD5NA/H9u+4HtHTRzLD7QnibnBEhBm00h0yGDs=; b=EpgpHDSyU5oFOcu7Ovn6E/O4fW0p/IVJ00rdtzDR8BAEJ3Q8OAWj6neu+auLRkIsB/ LpaZMk9v1XHaPUZJA63Wq7oDs+7GMtxe4fHUyWmr4j2ibXh7yFPeLRD+p1I0skTWXqjO IVd88Nv6mYqIpXmud8ptzxO/jafPN/5jwZ6hs= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:references:in-reply-to:mime-version :thread-index:date:message-id:subject:to:cc; bh=qZhzncD5NA/H9u+4HtHTRzLD7QnibnBEhBm00h0yGDs=; b=eKkXnxzmvuQU3PRXZ+O7+2f5g/B9ZErUK//tAZYascXfcC/3wDzX9DOkK44WLWfwGL QR7zrjVYnquJxMB49Js5qiWKE4fz8JateCErYvoA3RVwjjDf3veGDE1evOBmYX0anJ1s QiavJVMm6NiY4f01tkWf9+FVJ55rxfBiXefNa7O9lsV1RS3Swx94lp5MudQVj79wc1j/ IoyoXHuZvA22XH8MoRgdaUb3ZGzVIFfBCyTw82l0noqvfoP6EkGti3vk2+jInMEm24qT vHSf5GImrV1UkuCRcOst2eaGreTuQUuegXtyexYnvvXPwmOj243XXo/lJV7fyirJZlCU cw6Q== X-Gm-Message-State: APzg51BmjHR+pFPMfrEpVvX9PWE/aQ6COr1tif+/c6g9Or6WVoqCer81 MA0ULmFaCgbBs+7XyEyi722l1dOrll7MRHl11dwsMQ== X-Received: by 2002:a24:eec7:: with SMTP id b190-v6mr6468166iti.32.1535758643836; Fri, 31 Aug 2018 16:37:23 -0700 (PDT) From: Kashyap Desai References: <20180829084618.GA24765@ming.t460p> <300d6fef733ca76ced581f8c6304bac6@mail.gmail.com> <615d78004495aebc53807156d04d988c@mail.gmail.com> <486f94a563d63c4779498fe8829a546c@mail.gmail.com> In-Reply-To: MIME-Version: 1.0 X-Mailer: Microsoft Outlook 14.0 Thread-Index: AQL9fTS7902n0VSYivL2AMCzXDd9xwGQx87UAiSubfsBGBaHbgI3aj7TAe+rdbUBJEp+nAJfSXIOoiRGsIA= Date: Fri, 31 Aug 2018 17:37:22 -0600 Message-ID: <602cee6381b9f435a938bbaf852d07f9@mail.gmail.com> Subject: RE: Affinity managed interrupts vs non-managed interrupts To: Thomas Gleixner Cc: Ming Lei , Sumit Saxena , Ming Lei , Christoph Hellwig , Linux Kernel Mailing List , Shivasharan Srikanteshwara , linux-block Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > > > > It is not yet finalized, but it can be based on per sdev outstanding, > > > > shost_busy etc. > > > > We want to use special 16 reply queue for IO acceleration (these > > queues are > > > > working interrupt coalescing mode. This is a h/w feature) > > > > > > TBH, this does not make any sense whatsoever. Why are you trying to have > > > extra interrupts for coalescing instead of doing the following: > > > > Thomas, > > > > We are using this feature mainly for performance and not for CPU hotplug > > issues. > > I read your below #1 to #4 points are more of addressing CPU hotplug > > stuffs. Right ? If we use all 72 reply queue (all are in interrupt > > coalescing mode) without any extra reply queues, we don't have any issue > > with cpu-msix mapping and cpu hotplug issues. Our major problem with > > that method is latency is very bad on lower QD and/or single worker case. > > > > To solve that problem we have added extra 16 reply queue (this is a > > special h/w feature for performance only) which can be worked in interrupt > > coalescing mode vs existing 72 reply queue will work without any interrupt > > coalescing. Best way to map additional 16 reply queue is map it to the > > local numa node. > > Ok. I misunderstood the whole thing a bit. So your real issue is that you > want to have reply queues which are instantaneous, the per cpu ones, and > then the extra 16 which do batching and are shared over a set of CPUs, > right? Yes that is correct. Extra 16 or whatever should be shared over set of CPUs of *local* numa node of the PCI device. > > > I understand that, it is unique requirement but at the same time we may > > be able to do it gracefully (in irq sub system) as you mentioned " > > irq_set_affinity_hint" should be avoided in low level driver. > > > Is it possible to have similar mapping in managed interrupt case as below > > ? > > > > for (i = 0; i < 16 ; i++) > > irq_set_affinity_hint (pci_irq_vector(instance->pdev, > > cpumask_of_node(local_numa_node)); > > > > Currently we always see managed interrupts for pre-vectors are 0-71 and > > effective cpu is always 0. > > The pre-vectors are not affinity managed. They get the default affinity > assigned and at request_irq() the vectors are dynamically spread over CPUs > to avoid that the bulk of interrupts ends up on CPU0. That's handled that > way since a0c9259dc4e1 ("irq/matrix: Spread interrupts on allocation") I am not sure if this is working on 4.18 kernel. I can double check. What I remember is pre_vectors are mapped to 0-71 in my case and effective cpu is always 0. Ideally you mentioned that it should be spread..let me check that. > > > We want some changes in current API which can allow us to pass flags > > (like *local numa affinity*) and cpu-msix mapping are from local numa node > > + effective cpu are spread across local numa node. > > What you really want is to split the vector space for your device into two > blocks. One for the regular per cpu queues and the other (16 or how many > ever) which are managed separately, i.e. spread out evenly. That needs some > extensions to the core allocation/management code, but that shouldn't be a > huge problem. Yes this is correct understanding. I can test any proposed patch if that is what we want to use as best practice. We attempted but due to lack of knowledge in irq-subsystem, we are not able to settle down anything which is close to our requirement. We did something like below - "added new flag PCI_IRQ_PRE_VEC_NUMA which will indicate that all pre and post vector should be shared within local numa node." int irq_flags; struct irq_affinity desc; desc.pre_vectors = 16; desc.post_vectors = 0; irq_flags = PCI_IRQ_MSIX; i = pci_alloc_irq_vectors_affinity(instance->pdev, instance->high_iops_vector_start * 2, instance->msix_vectors, irq_flags | PCI_IRQ_AFFINITY | PCI_IRQ_PRE_VEC_NUMA, &desc); Somehow, I was not able to understand which part of irq subsystem should have changes. ~ Kashyap > > Thanks, > > tglx