Received: by 2002:a25:8b12:0:0:0:0:0 with SMTP id i18csp418368ybl; Fri, 23 Aug 2019 02:54:48 -0700 (PDT) X-Google-Smtp-Source: APXvYqyMp+PR3Wmhr56n+Hswy7QiB42zTdGSAewM5/DfkDs1k577sgbJdiQYX5Iuug5qBoFzt61+ X-Received: by 2002:a62:dbc6:: with SMTP id f189mr4244930pfg.237.1566554088024; Fri, 23 Aug 2019 02:54:48 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1566554088; cv=none; d=google.com; s=arc-20160816; b=KT4i1eXgendQqf30s8CpTNtUkpwb1fWdQHC1qte/Obfyf94KFwFaOtdv98OqRkAQ6q Kwlw4R9wxqUUX5fqOP9hxtx9fKpwVwzbx4iZmJVZdkHJE638gunKpFVYUSeoBlfZmD5E jBUhefuHHEia7dL1WVBx2UWoBQwHdimK59tMKWKdKr9nLZGb9xBPWWSpXNld3lX8vSpI 6HLF+7oCkN75VUPysCJCqAmlxpbxQEe2lyWajIg8Dg6cx/olLUa45ASJkYHPrw3XPTHk IbFSvA+F6Bx8RFNLaOs2nsUajX5/Ivvqe9oAiq9Dbg+uIieL10objPCYWl7hiTtrVutu 9iSQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=Pl/54mnjJ6igMbd4Wrr6p3R9lVWtM5j+cF+GUq1BFtM=; b=A1q25m/719B2ZIHtBiMcGgsyDO2iik+NrOnPfF2w3jfHk1j+NyfmOibuHowYnxoZ7C U2ZYu49SsHwi0uk6OBvSTC75SBSGpGw4gl6V61z6lPcDlbY6CH+0BVi8NVZkDHhBKJYk IvJkDdaPAMKeUsRa4NRGsBRHlI7cn6XBj6/5IC5aZW0Q25iwTveoKmJL6ar0TiFn+G9z D7JBqBINwSCLXQkhHGMCbDTcjY/Q6S0qd+aR0NMMhyiB3XY7pk3eab+BDThSBAh12opJ jwUGb17saDoJKRzRpBbuzjW0FKhAy/rqt065z/1EYswPX/7E6up9LTtUXTk6zcUSw05p t68A== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id h17si2191517pfo.210.2019.08.23.02.54.33; Fri, 23 Aug 2019 02:54:48 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2391454AbfHWDVp (ORCPT + 99 others); Thu, 22 Aug 2019 23:21:45 -0400 Received: from mx1.redhat.com ([209.132.183.28]:48172 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2391211AbfHWDVp (ORCPT ); Thu, 22 Aug 2019 23:21:45 -0400 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.13]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 6174D18C4264; Fri, 23 Aug 2019 03:21:44 +0000 (UTC) Received: from ming.t460p (ovpn-8-16.pek2.redhat.com [10.72.8.16]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 7E51360605; Fri, 23 Aug 2019 03:21:35 +0000 (UTC) Date: Fri, 23 Aug 2019 11:21:30 +0800 From: Ming Lei To: Sagi Grimberg Cc: longli@linuxonhyperv.com, Ingo Molnar , Peter Zijlstra , Keith Busch , Jens Axboe , Christoph Hellwig , linux-nvme@lists.infradead.org, linux-kernel@vger.kernel.org, Long Li , Hannes Reinecke , linux-scsi@vger.kernel.org, linux-block@vger.kernel.org Subject: Re: [PATCH 3/3] nvme: complete request in work queue on CPU with flooded interrupts Message-ID: <20190823032129.GA18680@ming.t460p> References: <1566281669-48212-1-git-send-email-longli@linuxonhyperv.com> <1566281669-48212-4-git-send-email-longli@linuxonhyperv.com> <2a30a07f-982c-c291-e263-0cf72ec61235@grimberg.me> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <2a30a07f-982c-c291-e263-0cf72ec61235@grimberg.me> User-Agent: Mutt/1.11.3 (2019-02-01) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.6.2 (mx1.redhat.com [10.5.110.62]); Fri, 23 Aug 2019 03:21:44 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Aug 20, 2019 at 10:33:38AM -0700, Sagi Grimberg wrote: > > > From: Long Li > > > > When a NVMe hardware queue is mapped to several CPU queues, it is possible > > that the CPU this hardware queue is bound to is flooded by returning I/O for > > other CPUs. > > > > For example, consider the following scenario: > > 1. CPU 0, 1, 2 and 3 share the same hardware queue > > 2. the hardware queue interrupts CPU 0 for I/O response > > 3. processes from CPU 1, 2 and 3 keep sending I/Os > > > > CPU 0 may be flooded with interrupts from NVMe device that are I/O responses > > for CPU 1, 2 and 3. Under heavy I/O load, it is possible that CPU 0 spends > > all the time serving NVMe and other system interrupts, but doesn't have a > > chance to run in process context. > > > > To fix this, CPU 0 can schedule a work to complete the I/O request when it > > detects the scheduler is not making progress. This serves multiple purposes: > > > > 1. This CPU has to be scheduled to complete the request. The other CPUs can't > > issue more I/Os until some previous I/Os are completed. This helps this CPU > > get out of NVMe interrupts. > > > > 2. This acts a throttling mechanisum for NVMe devices, in that it can not > > starve a CPU while servicing I/Os from other CPUs. > > > > 3. This CPU can make progress on RCU and other work items on its queue. > > The problem is indeed real, but this is the wrong approach in my mind. > > We already have irqpoll which takes care proper budgeting polling > cycles and not hogging the cpu. The issue isn't unique to NVMe, and can be any fast devices which interrupts CPU too frequently, meantime the interrupt/softirq handler may take a bit much time, then CPU is easy to be lockup by the interrupt/sofirq handler, especially in case that multiple submission CPUs vs. single completion CPU. Some SCSI devices has the same problem too. Could we consider to add one generic mechanism to cover this kind of problem? One approach I thought of is to allocate one backup thread for handling such interrupt, which can be marked as IRQF_BACKUP_THREAD by drivers. Inside do_IRQ(), irqtime is accounted, before calling action->handler(), check if this CPU has taken too long time for handling IRQ(interrupt or softirq) and see if this CPU could be lock up. If yes, wakeup the backup thread to handle the interrupt for avoiding lockup this CPU. The threaded interrupt framework is there, and this way could be easier to implement. Meantime most time the handler is run in interrupt context and we may avoid the performance loss when CPU isn't busy enough. Any comment on this approach? Thanks, Ming