Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67;
Date:   Tue, 20 Aug 2019 11:52:40 +0200
From:   Peter Zijlstra <peterz@infradead.org>
To:     longli@linuxonhyperv.com
Cc:     Ingo Molnar <mingo@redhat.com>,
        Keith Busch <keith.busch@intel.com>, Jens Axboe <axboe@fb.com>,
        Christoph Hellwig <hch@lst.de>,
        Sagi Grimberg <sagi@grimberg.me>,
        linux-nvme@lists.infradead.org, linux-kernel@vger.kernel.org,
        Long Li <longli@microsoft.com>
Subject: Re: [PATCH 3/3] nvme: complete request in work queue on CPU with
 flooded interrupts
Message-ID: <20190820095240.GH2332@hirez.programming.kicks-ass.net>
References: <1566281669-48212-1-git-send-email-longli@linuxonhyperv.com>
 <1566281669-48212-4-git-send-email-longli@linuxonhyperv.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <1566281669-48212-4-git-send-email-longli@linuxonhyperv.com>
User-Agent: Mutt/1.10.1 (2018-07-13)
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk

On Mon, Aug 19, 2019 at 11:14:29PM -0700, longli@linuxonhyperv.com wrote:
> From: Long Li <longli@microsoft.com>
> 
> When a NVMe hardware queue is mapped to several CPU queues, it is possible
> that the CPU this hardware queue is bound to is flooded by returning I/O for
> other CPUs.
> 
> For example, consider the following scenario:
> 1. CPU 0, 1, 2 and 3 share the same hardware queue
> 2. the hardware queue interrupts CPU 0 for I/O response
> 3. processes from CPU 1, 2 and 3 keep sending I/Os
> 
> CPU 0 may be flooded with interrupts from NVMe device that are I/O responses
> for CPU 1, 2 and 3. Under heavy I/O load, it is possible that CPU 0 spends
> all the time serving NVMe and other system interrupts, but doesn't have a
> chance to run in process context.

Ideally -- and there is some code to affect this, the load-balancer will
move tasks away from this CPU.

> To fix this, CPU 0 can schedule a work to complete the I/O request when it
> detects the scheduler is not making progress. This serves multiple purposes:

Suppose the task waiting for the IO completion is a RT task, and you've
just queued it to a regular work. This is an instant priority inversion.

> 1. This CPU has to be scheduled to complete the request. The other CPUs can't
> issue more I/Os until some previous I/Os are completed. This helps this CPU
> get out of NVMe interrupts.
> 
> 2. This acts a throttling mechanisum for NVMe devices, in that it can not
> starve a CPU while servicing I/Os from other CPUs.
> 
> 3. This CPU can make progress on RCU and other work items on its queue.
> 
> Signed-off-by: Long Li <longli@microsoft.com>
> ---
>  drivers/nvme/host/core.c | 57 +++++++++++++++++++++++++++++++++++++++-
>  drivers/nvme/host/nvme.h |  1 +
>  2 files changed, 57 insertions(+), 1 deletion(-)

WTH does this live in the NVME driver? Surely something like this should
be in the block layer. I'm thinking there's fiber channel connected
storage that should be able to trigger much the same issues.

> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 6a9dd68c0f4f..576bb6fce293 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c

> @@ -260,9 +270,54 @@ static void nvme_retry_req(struct request *req)
>  	blk_mq_delay_kick_requeue_list(req->q, delay);
>  }
>  
> +static void nvme_complete_rq_work(struct work_struct *work)
> +{
> +	struct nvme_request *nvme_rq =
> +		container_of(work, struct nvme_request, work);
> +	struct request *req = blk_mq_rq_from_pdu(nvme_rq);
> +
> +	nvme_complete_rq(req);
> +}
> +
> +
>  void nvme_complete_rq(struct request *req)
>  {
> -	blk_status_t status = nvme_error_status(req);
> +	blk_status_t status;
> +	int cpu;
> +	u64 switches;
> +	struct nvme_request *nvme_rq;
> +
> +	if (!in_interrupt())
> +		goto skip_check;
> +
> +	nvme_rq = nvme_req(req);
> +	cpu = smp_processor_id();
> +	if (idle_cpu(cpu))
> +		goto skip_check;
> +
> +	/* Check if this CPU is flooded with interrupts */
> +	switches = get_cpu_rq_switches(cpu);
> +	if (this_cpu_read(last_switch) == switches) {
> +		/*
> +		 * If this CPU hasn't made a context switch in
> +		 * MAX_SCHED_TIMEOUT ns (and it's not idle), schedule a work to
> +		 * complete this I/O. This forces this CPU run non-interrupt
> +		 * code and throttle the other CPU issuing the I/O
> +		 */

What if there was only a single task on that CPU? Then we'd never
need/want to context switch in the first place.

AFAICT all this is just a whole bunch of gruesome hacks piled on top one
another.

> +		if (sched_clock() - this_cpu_read(last_clock)
> +				> MAX_SCHED_TIMEOUT) {
> +			INIT_WORK(&nvme_rq->work, nvme_complete_rq_work);
> +			schedule_work_on(cpu, &nvme_rq->work);
> +			return;
> +		}
> +
> +	} else {
> +		this_cpu_write(last_switch, switches);
> +		this_cpu_write(last_clock, sched_clock());
> +	}
> +
> +skip_check:

Aside from everything else; this is just sodding poor coding style. What
is wrong with something like:

	if (nvme_complete_throttle(...))
		return;

> +	status = nvme_error_status(req);
>  
>  	trace_nvme_complete_rq(req);
>