Received: by 2002:a25:8b12:0:0:0:0:0 with SMTP id i18csp305015ybl; Wed, 21 Aug 2019 19:47:18 -0700 (PDT) X-Google-Smtp-Source: APXvYqzpGK0hKr3ll2HU8mtD0ZPs+Zsr7Vejp0ZFt3i62sIPt77eyAM2tbLHP5VgyoCzIZf2VdX1 X-Received: by 2002:aa7:984a:: with SMTP id n10mr39475540pfq.3.1566442038874; Wed, 21 Aug 2019 19:47:18 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1566442038; cv=none; d=google.com; s=arc-20160816; b=NfGjPnS8oHMn5aYYaxj57fh65vjawoONdVK8k2LhqBUaS8gdVAALe/zwRgsOLy/MkS 8XdJiGmEY3kRMlV7zBCxcaQPExFUR9EHkX4FbqwIvw48kuIJz/nj+DqEqqcLGC2tJZ2H besqcugpJpHoC+IFck0ENUYu+yqxuujmZxPqHjbk5+qX0kLyGr/QunpankfN34rqGxCK SdXgjIvonCx+sT3gC9J3Eej3eQpIZRFtuOh+EJZlAxY8fMtOnqSSa/tRCw+AnbFRJMKK hfs55n7XXXl9Ea4OJwkk9GTq44uoHh4oislG61AKTht5/ywHYuA5SCuEikFNiw36e5co Lz/g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=Tq9dWH9VkjsVmaMgjpqQKs9rGFUIkhG6j3j40dSx+54=; b=nCje/PrN/+IfYSfWUGFKpcPSSD2gIfBuRtNhINC8qWTJa5OdSVyIH962qd1P0XrPk8 tU1p2+QCC+Bse3MA88MEw9zOgym542OU4A4h4HU9JnThiWlCIAb5Ksy5RpCSdwcTaoU5 XAzqM2vOOdhvob+t4fwVdBRj311WdL5ZKeLTq18coOu4pVtF4WGTFojEFYrrVuvxT57p KipR9SA9uMnFTaA7+UId3/Is3UzwZ0qjO01GIMd+wdJc03Z4gTc1IxPoe3sn96YfTktU nbCeP1wd+2DZet2aXSVlpwf8bz4Lq2UYb7rFRJWADPYqhO6T3pB7l7pHGf7mo7GKv1Jg Rhqg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 65si16383128ple.240.2019.08.21.19.47.04; Wed, 21 Aug 2019 19:47:18 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730619AbfHVBeN (ORCPT + 99 others); Wed, 21 Aug 2019 21:34:13 -0400 Received: from mx1.redhat.com ([209.132.183.28]:54116 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728470AbfHVBeM (ORCPT ); Wed, 21 Aug 2019 21:34:12 -0400 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id A623B106BB24; Thu, 22 Aug 2019 01:34:11 +0000 (UTC) Received: from ming.t460p (ovpn-8-22.pek2.redhat.com [10.72.8.22]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 29EFB5C205; Thu, 22 Aug 2019 01:34:01 +0000 (UTC) Date: Thu, 22 Aug 2019 09:33:57 +0800 From: Ming Lei To: Long Li Cc: Keith Busch , Sagi Grimberg , chenxiang , Peter Zijlstra , Ming Lei , John Garry , Linux Kernel Mailing List , linux-nvme , Jens Axboe , Ingo Molnar , Thomas Gleixner , Christoph Hellwig , "longli@linuxonhyperv.com" Subject: Re: [PATCH 0/3] fix interrupt swamp in NVMe Message-ID: <20190822013356.GC28635@ming.t460p> References: <1566281669-48212-1-git-send-email-longli@linuxonhyperv.com> <20190821094406.GA28391@ming.t460p> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.11.3 (2019-02-01) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.16 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.6.2 (mx1.redhat.com [10.5.110.64]); Thu, 22 Aug 2019 01:34:11 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Aug 21, 2019 at 04:27:00PM +0000, Long Li wrote: > >>>Subject: Re: [PATCH 0/3] fix interrupt swamp in NVMe > >>> > >>>On Wed, Aug 21, 2019 at 07:47:44AM +0000, Long Li wrote: > >>>> >>>Subject: Re: [PATCH 0/3] fix interrupt swamp in NVMe > >>>> >>> > >>>> >>>On 20/08/2019 09:25, Ming Lei wrote: > >>>> >>>> On Tue, Aug 20, 2019 at 2:14 PM wrote: > >>>> >>>>> > >>>> >>>>> From: Long Li > >>>> >>>>> > >>>> >>>>> This patch set tries to fix interrupt swamp in NVMe devices. > >>>> >>>>> > >>>> >>>>> On large systems with many CPUs, a number of CPUs may share > >>>one > >>>> >>>NVMe > >>>> >>>>> hardware queue. It may have this situation where several CPUs > >>>> >>>>> are issuing I/Os, and all the I/Os are returned on the CPU where > >>>> >>>>> the > >>>> >>>hardware queue is bound to. > >>>> >>>>> This may result in that CPU swamped by interrupts and stay in > >>>> >>>>> interrupt mode for extended time while other CPUs continue to > >>>> >>>>> issue I/O. This can trigger Watchdog and RCU timeout, and make > >>>> >>>>> the system > >>>> >>>unresponsive. > >>>> >>>>> > >>>> >>>>> This patch set addresses this by enforcing scheduling and > >>>> >>>>> throttling I/O when CPU is starved in this situation. > >>>> >>>>> > >>>> >>>>> Long Li (3): > >>>> >>>>> sched: define a function to report the number of context switches > >>>on a > >>>> >>>>> CPU > >>>> >>>>> sched: export idle_cpu() > >>>> >>>>> nvme: complete request in work queue on CPU with flooded > >>>> >>>>> interrupts > >>>> >>>>> > >>>> >>>>> drivers/nvme/host/core.c | 57 > >>>> >>>>> +++++++++++++++++++++++++++++++++++++++- > >>>> >>>>> drivers/nvme/host/nvme.h | 1 + > >>>> >>>>> include/linux/sched.h | 2 ++ > >>>> >>>>> kernel/sched/core.c | 7 +++++ > >>>> >>>>> 4 files changed, 66 insertions(+), 1 deletion(-) > >>>> >>>> > >>>> >>>> Another simpler solution may be to complete request in threaded > >>>> >>>> interrupt handler for this case. Meantime allow scheduler to run > >>>> >>>> the interrupt thread handler on CPUs specified by the irq > >>>> >>>> affinity mask, which was discussed by the following link: > >>>> >>>> > >>>> >>>> > >>>> >>>https://lor > >>>> >>>e > >>>> >>>> .kernel.org%2Flkml%2Fe0e9478e-62a5-ca24-3b12- > >>>> >>>58f7d056383e%40huawei.com > >>>> >>>> %2F&data=02%7C01%7Clongli%40microsoft.com%7Cc7f46d3e2 > >>>73f45 > >>>> >>>176d1c08 > >>>> >>>> > >>>> >>>d7254cc69e%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C63 > >>>70188 > >>>> >>>8401588 > >>>> >>>> > >>>> >>>9866&sdata=h5k6HoGoyDxuhmDfuKLZUwgmw17PU%2BT%2FCb > >>>awfxV > >>>> >>>Er3U%3D& > >>>> >>>> reserved=0 > >>>> >>>> > >>>> >>>> Could you try the above solution and see if the lockup can be > >>>avoided? > >>>> >>>> John Garry > >>>> >>>> should have workable patch. > >>>> >>> > >>>> >>>Yeah, so we experimented with changing the interrupt handling in > >>>> >>>the SCSI driver I maintain to use a threaded handler IRQ handler > >>>> >>>plus patch below, and saw a significant throughput boost: > >>>> >>> > >>>> >>>--->8 > >>>> >>> > >>>> >>>Subject: [PATCH] genirq: Add support to allow thread to use hard > >>>> >>>irq affinity > >>>> >>> > >>>> >>>Currently the cpu allowed mask for the threaded part of a threaded > >>>> >>>irq handler will be set to the effective affinity of the hard irq. > >>>> >>> > >>>> >>>Typically the effective affinity of the hard irq will be for a > >>>> >>>single cpu. As such, the threaded handler would always run on the > >>>same cpu as the hard irq. > >>>> >>> > >>>> >>>We have seen scenarios in high data-rate throughput testing that > >>>> >>>the cpu handling the interrupt can be totally saturated handling > >>>> >>>both the hard interrupt and threaded handler parts, limiting > >>>throughput. > >>>> >>> > >>>> >>>Add IRQF_IRQ_AFFINITY flag to allow the driver requesting the > >>>> >>>threaded interrupt to decide on the policy of which cpu the threaded > >>>handler may run. > >>>> >>> > >>>> >>>Signed-off-by: John Garry > >>>> > >>>> Thanks for pointing me to this patch. This fixed the interrupt swamp and > >>>make the system stable. > >>>> > >>>> However I'm seeing reduced performance when using threaded > >>>interrupts. > >>>> > >>>> Here are the test results on a system with 80 CPUs and 10 NVMe disks > >>>> (32 hardware queues for each disk) Benchmark tool is FIO, I/O pattern: > >>>> 4k random reads on all NVMe disks, with queue depth = 64, num of jobs > >>>> = 80, direct=1 > >>>> > >>>> With threaded interrupts: 1320k IOPS > >>>> With just interrupts: 3720k IOPS > >>>> With just interrupts and my patch: 3700k IOPS > >>> > >>>This gap looks too big wrt. threaded interrupts vs. interrupts. > >>> > >>>> > >>>> At the peak IOPS, the overall CPU usage is at around 98-99%. I think the > >>>cost of doing wake up and context switch for NVMe threaded IRQ handler > >>>takes some CPU away. > >>>> > >>> > >>>In theory, it shouldn't be so because most of times the thread should be > >>>running on CPUs of this hctx, and the wakeup cost shouldn't be so big. > >>>Maybe there is performance problem somewhere wrt. threaded interrupt. > >>> > >>>Could you share us your test script and environment? I will see if I can > >>>reproduce it in my environment. > > Ming, do you have access to L80s_v2 in Azure? This test needs to run on that VM size. > > Here is the command to benchmark it: > > fio --bs=4k --ioengine=libaio --iodepth=128 --filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1 --direct=1 --runtime=120 --numjobs=80 --rw=randread --name=test --group_reporting --gtod_reduce=1 > I can reproduce the issue on one machine(96 cores) with 4 NVMes(32 queues), so each queue is served on 3 CPUs. IOPS drops > 20% when 'use_threaded_interrupts' is enabled. From fio log, CPU context switch is increased a lot. Thanks, Ming