Received: by 2002:a25:8b91:0:0:0:0:0 with SMTP id j17csp5217958ybl; Tue, 10 Dec 2019 02:30:04 -0800 (PST) X-Google-Smtp-Source: APXvYqxRvrdwvzpx0eRg8iu0aPWdZfYLVozeEHsOD6mBOhxSTc/hPW+IdiJR1fAztvzSvaHOpIjI X-Received: by 2002:aca:52c7:: with SMTP id g190mr3256503oib.84.1575973804440; Tue, 10 Dec 2019 02:30:04 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1575973804; cv=none; d=google.com; s=arc-20160816; b=lNP504Wm1pyIzesWGX4ol/GVgF6CwKH5DyM/L7bW4u7J52oCtTpLQQGG4YpgH8je2N 1L5+hVMgt6MwRkjfIT8qE6ftRfNqb+M18HHbZzuskAtbtjcJoQ3uvfhh19g0pHgKCUAS PYueX48e5nukec3oPKDv7qf8FQhEESyJ1t/IsPrGozj1N36Ay5qTg4Isx5gVxXXkIL/n LCZeFgHAgA2lqWinMcoo45OX53UCHd1QjjoVRgxhfmB/PxgKgA5/JTuAlDKikiXBS6fg 7jmb4mQ38uiCni1ykUUX9Cj3f2uKcd/TR5rM4/0rv4GpxQW60tApVqEXeaI5k2hAkN3a 0l/Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:message-id:references :in-reply-to:cc:from:date:content-transfer-encoding:mime-version :subject:to; bh=T+EkkBrtnoC1+Jq36C7T5P3AfphMq+4+UuOb7ODX6bM=; b=pX68r0PkVlRWMFMJgc4HXtLe+AWI9qtQ/3cA0X5IdeD3LfJ6Gy5skUDiQZL0orWr4x 4HA51UUoGiQS6/EXSRetaWEiz9ZlYlt35xUuGRbZ6OZlOKyyiOGi1mlpbi5drZFT1xm4 X3qcLiY+c2MnkbLY1cOGeclcpRe0pOd/DCjpSS39nSSm4RJ0Sb6S9IaphU6X0H8t7xaw k1ItvyZ4gJVGAw1IOUKsIGr0jfRVaKMBIWuZ80hEWR+AYzt5pivBRiOWIuyWTdG2G1Gg 44BRmb88EWiGS9tYSSeXI1wNgEKmHeZHn4jsk0WjIVSS9T5j1ujZ8tDTiFHvWcJJu5Xs yDkQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id c12si1525631otp.310.2019.12.10.02.29.52; Tue, 10 Dec 2019 02:30:04 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727177AbfLJK2V (ORCPT + 99 others); Tue, 10 Dec 2019 05:28:21 -0500 Received: from inca-roads.misterjones.org ([213.251.177.50]:41193 "EHLO inca-roads.misterjones.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726574AbfLJK2V (ORCPT ); Tue, 10 Dec 2019 05:28:21 -0500 Received: from www-data by cheepnis.misterjones.org with local (Exim 4.80) (envelope-from ) id 1ieckW-0003nN-Dn; Tue, 10 Dec 2019 11:28:12 +0100 To: John Garry Subject: Re: [PATCH RFC 1/1] genirq: Make threaded handler use irq affinity for managed interrupt X-PHP-Originating-Script: 0:main.inc MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Date: Tue, 10 Dec 2019 10:28:12 +0000 From: Marc Zyngier Cc: Ming Lei , , , , , , , , , , In-Reply-To: <28424a58-1159-c3f9-1efb-f1366993afcf@huawei.com> References: <1575642904-58295-1-git-send-email-john.garry@huawei.com> <1575642904-58295-2-git-send-email-john.garry@huawei.com> <20191207080335.GA6077@ming.t460p> <78a10958-fdc9-0576-0c39-6079b9749d39@huawei.com> <20191210014335.GA25022@ming.t460p> <28424a58-1159-c3f9-1efb-f1366993afcf@huawei.com> Message-ID: <048746c22898849d28985c0f65cf2c2a@www.loen.fr> X-Sender: maz@kernel.org User-Agent: Roundcube Webmail/0.7.2 X-SA-Exim-Connect-IP: X-SA-Exim-Rcpt-To: john.garry@huawei.com, ming.lei@redhat.com, tglx@linutronix.de, chenxiang66@hisilicon.com, bigeasy@linutronix.de, linux-kernel@vger.kernel.org, hare@suse.com, hch@lst.de, axboe@kernel.dk, bvanassche@acm.org, peterz@infradead.org, mingo@redhat.com X-SA-Exim-Mail-From: maz@kernel.org X-SA-Exim-Scanned: No (on cheepnis.misterjones.org); SAEximRunCond expanded to false Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2019-12-10 09:45, John Garry wrote: > On 10/12/2019 01:43, Ming Lei wrote: >> On Mon, Dec 09, 2019 at 02:30:59PM +0000, John Garry wrote: >>> On 07/12/2019 08:03, Ming Lei wrote: >>>> On Fri, Dec 06, 2019 at 10:35:04PM +0800, John Garry wrote: >>>>> Currently the cpu allowed mask for the threaded part of a >>>>> threaded irq >>>>> handler will be set to the effective affinity of the hard irq. >>>>> >>>>> Typically the effective affinity of the hard irq will be for a >>>>> single cpu. As such, >>>>> the threaded handler would always run on the same cpu as the hard >>>>> irq. >>>>> >>>>> We have seen scenarios in high data-rate throughput testing that >>>>> the cpu >>>>> handling the interrupt can be totally saturated handling both the >>>>> hard >>>>> interrupt and threaded handler parts, limiting throughput. >>>> > > Hi Ming, > >>>> Frankly speaking, I never observed that single CPU is saturated by >>>> one storage >>>> completion queue's interrupt load. Because CPU is still much >>>> quicker than >>>> current storage device. >>>> >>>> If there are more drives, one CPU won't handle more than one >>>> queue(drive)'s >>>> interrupt if (nr_drive * nr_hw_queues) < nr_cpu_cores. >>> >>> Are things this simple? I mean, can you guarantee that fio >>> processes are >>> evenly distributed as such? >> That is why I ask you for the details of your test. >> If you mean hisilicon SAS, > > Yes, it is. > > the interrupt load should have been distributed >> well given the device has multiple reply queues for distributing >> interrupt >> load. >> >>> >>>> >>>> So could you describe your case in a bit detail? Then we can >>>> confirm >>>> if this change is really needed. >>> >>> The issue is that the CPU is saturated in servicing the hard and >>> threaded >>> part of the interrupt together - here's the sort of thing which we >>> saw >>> previously: >>> Before: >>> CPU %usr %sys %irq %soft %idle >>> all 2.9 13.1 1.2 4.6 78.2 >>> 0 0.0 29.3 10.1 58.6 2.0 >>> 1 18.2 39.4 0.0 1.0 41.4 >>> 2 0.0 2.0 0.0 0.0 98.0 >>> >>> CPU0 has no effectively no idle. >> The result just shows the saturation, we need to root cause it >> instead >> of workaround it via random changes. >> >>> >>> Then, by allowing the threaded part to roam: >>> After: >>> CPU %usr %sys %irq %soft %idle >>> all 3.5 18.4 2.7 6.8 68.6 >>> 0 0.0 20.6 29.9 29.9 19.6 >>> 1 0.0 39.8 0.0 50.0 10.2 >>> >>> Note: I think that I may be able to reduce the irq hard part load >>> in the >>> endpoint driver, but not that much such that we see still this >>> issue. >>> >>>> >>>>> >>>>> For when the interrupt is managed, allow the threaded part to run >>>>> on all >>>>> cpus in the irq affinity mask. >>>> >>>> I remembered that performance drop is observed by this approach in >>>> some >>>> test. >>> >>> From checking the thread about the NVMe interrupt swamp, just >>> switching to >>> threaded handler alone degrades performance. I didn't see any >>> specific >>> results for this change from Long Li - >>> https://lkml.org/lkml/2019/8/21/128 >> I am pretty clear the reason for Azure, which is caused by >> aggressive interrupt >> coalescing, and this behavior shouldn't be very common, and it can >> be >> addressed by the following patch: >> >> http://lists.infradead.org/pipermail/linux-nvme/2019-November/028008.html >> Then please share your lockup story, such as, which HBA/drivers, >> test steps, >> if you complete IOs from multiple disks(LUNs) on single CPU, if you >> have >> multiple queues, how many active LUNs involved in the test, ... > > There is no lockup, just a potential performance boost in this > change. > > My colleague Xiang Chen can provide specifics of the test, as he is > the one running it. > > But one key bit of info - which I did not think most relevant before > - that is we have 2x SAS controllers running the throughput test on > the same host. > > As such, the completion queue interrupts would be spread identically > over the CPUs for each controller. I notice that ARM GICv3 ITS > interrupt controller (which we use) does not use the generic irq > matrix allocator, which I think would really help with this. > > Hi Marc, > > Is there any reason for which we couldn't utilise of the generic irq > matrix allocator for GICv3? For a start, the ITS code predates the matrix allocator by about three years. Also, my understanding of this allocator is that it allows x86 to cope with a very small number of possible interrupt vectors per CPU. The ITS doesn't have such issue, as: 1) the namespace is global, and not per CPU 2) the namespace is *huge* Now, what property of the matrix allocator is the ITS code missing? I'd be more than happy to improve it. Thanks, M. -- Jazz is not dead. It just smells funny...