Received: by 2002:a25:8b12:0:0:0:0:0 with SMTP id i18csp3777510ybl; Tue, 20 Aug 2019 02:02:26 -0700 (PDT) X-Google-Smtp-Source: APXvYqyXkLT/0ytYQHlOoUFAWJOYaIkZq0hPEimR/k3MHHBcnwfQJqL3j48MD4TNshrfH+Rq9x/b X-Received: by 2002:a65:5348:: with SMTP id w8mr23686139pgr.176.1566291746706; Tue, 20 Aug 2019 02:02:26 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1566291746; cv=none; d=google.com; s=arc-20160816; b=GPRnFsqfpVRHCqILIGx6U3TVP3XaUngFxzv7s630AvhIkYjKgWd86+8DqQHe7l1H/w HaY88K7GrDzbv/itDQPByeqAKD4UVZ/GXn+09Tb39uWP9AR4etrS4MzfWJESiDFab7G/ p2BJ+pW350n2hMITjVRvdGsk56siEU23WHP6hxnzBzfSERd6NAC3Qg12/VUyJSH8vPbR dNQXWv+fQKQzM/5uM8D3MX6XifEMq4SXwhBvUONkOlIJTSMQjyZuYu7JZ/CgjWrukdW0 mzMnygKJWKk/X4C6S9+IOdlN0p8BEMB3i8Uv81ljRKHeTs3LgnlE8tX7nwOK6e1ahJEk WeCg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:in-reply-to :mime-version:user-agent:date:message-id:from:cc:references:to :subject; bh=cctRioTyvoI5IvYkV7+z8lvD6y5tjYf8vj0rnbMnaDY=; b=vz7bN8fTUQr80L8h+uvamldJh2DbpDsRxPm/EHvvQnUWi/Pg/M70m96QT1Oh1mjO4d OTXN+B5U1Oq+mwPZeD1zbFwB7Dbd5cfCGnX+NOR/5UqJH0zpb1V4GB++nBDODBhCtyXH e+mhCgU0F3mIY2bpmAxXqrAsuEQsp82l9HgvhayBhTcBJhlLHA1iqToGyxn4mVoEF6xN DIudQOz+7JkQpAPuR2WI72dk9YxB6Q3cuxV4qn0m8bfAKmB1zSTZX/KZPwwx9UY47YK0 U1XdhInSVmaW5l0RAYW4i9v2KqViZCWCZR9ecVZNLmXNcoaIRDpp2m71kzK3+GkgehiU SCig== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id b20si11650646pgl.533.2019.08.20.02.02.10; Tue, 20 Aug 2019 02:02:26 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729405AbfHTI7x (ORCPT + 99 others); Tue, 20 Aug 2019 04:59:53 -0400 Received: from szxga07-in.huawei.com ([45.249.212.35]:39626 "EHLO huawei.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1729305AbfHTI7x (ORCPT ); Tue, 20 Aug 2019 04:59:53 -0400 Received: from DGGEMS408-HUB.china.huawei.com (unknown [172.30.72.58]) by Forcepoint Email with ESMTP id 65641364036BC0FFDAF4; Tue, 20 Aug 2019 16:59:48 +0800 (CST) Received: from [127.0.0.1] (10.202.227.238) by DGGEMS408-HUB.china.huawei.com (10.3.19.208) with Microsoft SMTP Server id 14.3.439.0; Tue, 20 Aug 2019 16:59:39 +0800 Subject: Re: [PATCH 0/3] fix interrupt swamp in NVMe To: Ming Lei , References: <1566281669-48212-1-git-send-email-longli@linuxonhyperv.com> CC: Ingo Molnar , Peter Zijlstra , Keith Busch , Jens Axboe , "Christoph Hellwig" , Sagi Grimberg , linux-nvme , Linux Kernel Mailing List , Long Li , "Thomas Gleixner" , chenxiang From: John Garry Message-ID: Date: Tue, 20 Aug 2019 09:59:32 +0100 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.3.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [10.202.227.238] X-CFilter-Loop: Reflected Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 20/08/2019 09:25, Ming Lei wrote: > On Tue, Aug 20, 2019 at 2:14 PM wrote: >> >> From: Long Li >> >> This patch set tries to fix interrupt swamp in NVMe devices. >> >> On large systems with many CPUs, a number of CPUs may share one NVMe hardware >> queue. It may have this situation where several CPUs are issuing I/Os, and >> all the I/Os are returned on the CPU where the hardware queue is bound to. >> This may result in that CPU swamped by interrupts and stay in interrupt mode >> for extended time while other CPUs continue to issue I/O. This can trigger >> Watchdog and RCU timeout, and make the system unresponsive. >> >> This patch set addresses this by enforcing scheduling and throttling I/O when >> CPU is starved in this situation. >> >> Long Li (3): >> sched: define a function to report the number of context switches on a >> CPU >> sched: export idle_cpu() >> nvme: complete request in work queue on CPU with flooded interrupts >> >> drivers/nvme/host/core.c | 57 +++++++++++++++++++++++++++++++++++++++- >> drivers/nvme/host/nvme.h | 1 + >> include/linux/sched.h | 2 ++ >> kernel/sched/core.c | 7 +++++ >> 4 files changed, 66 insertions(+), 1 deletion(-) > > Another simpler solution may be to complete request in threaded interrupt > handler for this case. Meantime allow scheduler to run the interrupt thread > handler on CPUs specified by the irq affinity mask, which was discussed by > the following link: > > https://lore.kernel.org/lkml/e0e9478e-62a5-ca24-3b12-58f7d056383e@huawei.com/ > > Could you try the above solution and see if the lockup can be avoided? > John Garry > should have workable patch. Yeah, so we experimented with changing the interrupt handling in the SCSI driver I maintain to use a threaded handler IRQ handler plus patch below, and saw a significant throughput boost: --->8 Subject: [PATCH] genirq: Add support to allow thread to use hard irq affinity Currently the cpu allowed mask for the threaded part of a threaded irq handler will be set to the effective affinity of the hard irq. Typically the effective affinity of the hard irq will be for a single cpu. As such, the threaded handler would always run on the same cpu as the hard irq. We have seen scenarios in high data-rate throughput testing that the cpu handling the interrupt can be totally saturated handling both the hard interrupt and threaded handler parts, limiting throughput. Add IRQF_IRQ_AFFINITY flag to allow the driver requesting the threaded interrupt to decide on the policy of which cpu the threaded handler may run. Signed-off-by: John Garry diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h index 5b8328a99b2a..48e8b955989a 100644 --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -61,6 +61,9 @@ * interrupt handler after suspending interrupts. For system * wakeup devices users need to implement wakeup detection in * their interrupt handlers. + * IRQF_IRQ_AFFINITY - Use the hard interrupt affinity for setting the cpu + * allowed mask for the threaded handler of a threaded interrupt + * handler, rather than the effective hard irq affinity. */ #define IRQF_SHARED 0x00000080 #define IRQF_PROBE_SHARED 0x00000100 @@ -74,6 +77,7 @@ #define IRQF_NO_THREAD 0x00010000 #define IRQF_EARLY_RESUME 0x00020000 #define IRQF_COND_SUSPEND 0x00040000 +#define IRQF_IRQ_AFFINITY 0x00080000 #define IRQF_TIMER (__IRQF_TIMER | IRQF_NO_SUSPEND | IRQF_NO_THREAD) diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index e8f7f179bf77..cb483a055512 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -966,9 +966,13 @@ irq_thread_check_affinity(struct irq_desc *desc, struct irqaction *action) * mask pointer. For CPU_MASK_OFFSTACK=n this is optimized out. */ if (cpumask_available(desc->irq_common_data.affinity)) { + struct irq_data *irq_data = &desc->irq_data; const struct cpumask *m; - m = irq_data_get_effective_affinity_mask(&desc->irq_data); + if (action->flags & IRQF_IRQ_AFFINITY) + m = desc->irq_common_data.affinity; + else + m = irq_data_get_effective_affinity_mask(irq_data); cpumask_copy(mask, m); } else { valid = false; -- 2.17.1 As Ming mentioned in that same thread, we could even make this policy for managed interrupts. Cheers, John > > Thanks, > Ming Lei > > . >