2019-08-20 06:15:54

by Long Li

[permalink] [raw]
Subject: [PATCH 0/3] fix interrupt swamp in NVMe

From: Long Li <[email protected]>

This patch set tries to fix interrupt swamp in NVMe devices.

On large systems with many CPUs, a number of CPUs may share one NVMe hardware
queue. It may have this situation where several CPUs are issuing I/Os, and
all the I/Os are returned on the CPU where the hardware queue is bound to.
This may result in that CPU swamped by interrupts and stay in interrupt mode
for extended time while other CPUs continue to issue I/O. This can trigger
Watchdog and RCU timeout, and make the system unresponsive.

This patch set addresses this by enforcing scheduling and throttling I/O when
CPU is starved in this situation.

Long Li (3):
sched: define a function to report the number of context switches on a
CPU
sched: export idle_cpu()
nvme: complete request in work queue on CPU with flooded interrupts

drivers/nvme/host/core.c | 57 +++++++++++++++++++++++++++++++++++++++-
drivers/nvme/host/nvme.h | 1 +
include/linux/sched.h | 2 ++
kernel/sched/core.c | 7 +++++
4 files changed, 66 insertions(+), 1 deletion(-)

--
2.17.1


2019-08-20 06:16:03

by Long Li

[permalink] [raw]
Subject: [PATCH 2/3] sched: export idle_cpu()

From: Long Li <[email protected]>

This function is useful for device drivers to check if this CPU has work to
do in process context.

Signed-off-by: Long Li <[email protected]>
---
include/linux/sched.h | 1 +
kernel/sched/core.c | 1 +
2 files changed, 2 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 575f1ef7b159..a209941c1770 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1501,6 +1501,7 @@ current_restore_flags(unsigned long orig_flags, unsigned long flags)
extern int cpuset_cpumask_can_shrink(const struct cpumask *cur, const struct cpumask *trial);
extern int task_can_attach(struct task_struct *p, const struct cpumask *cs_cpus_allowed);
extern u64 get_cpu_rq_switches(int cpu);
+extern int idle_cpu(int cpu);
#ifdef CONFIG_SMP
extern void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask);
extern int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1a76f0e97c2d..d1cedfb38174 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4023,6 +4023,7 @@ int idle_cpu(int cpu)

return 1;
}
+EXPORT_SYMBOL_GPL(idle_cpu);

/**
* available_idle_cpu - is a given CPU idle for enqueuing work.
--
2.17.1

2019-08-20 06:16:28

by Long Li

[permalink] [raw]
Subject: [PATCH 3/3] nvme: complete request in work queue on CPU with flooded interrupts

From: Long Li <[email protected]>

When a NVMe hardware queue is mapped to several CPU queues, it is possible
that the CPU this hardware queue is bound to is flooded by returning I/O for
other CPUs.

For example, consider the following scenario:
1. CPU 0, 1, 2 and 3 share the same hardware queue
2. the hardware queue interrupts CPU 0 for I/O response
3. processes from CPU 1, 2 and 3 keep sending I/Os

CPU 0 may be flooded with interrupts from NVMe device that are I/O responses
for CPU 1, 2 and 3. Under heavy I/O load, it is possible that CPU 0 spends
all the time serving NVMe and other system interrupts, but doesn't have a
chance to run in process context.

To fix this, CPU 0 can schedule a work to complete the I/O request when it
detects the scheduler is not making progress. This serves multiple purposes:

1. This CPU has to be scheduled to complete the request. The other CPUs can't
issue more I/Os until some previous I/Os are completed. This helps this CPU
get out of NVMe interrupts.

2. This acts a throttling mechanisum for NVMe devices, in that it can not
starve a CPU while servicing I/Os from other CPUs.

3. This CPU can make progress on RCU and other work items on its queue.

Signed-off-by: Long Li <[email protected]>
---
drivers/nvme/host/core.c | 57 +++++++++++++++++++++++++++++++++++++++-
drivers/nvme/host/nvme.h | 1 +
2 files changed, 57 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 6a9dd68c0f4f..576bb6fce293 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -28,6 +28,7 @@
#include <linux/t10-pi.h>
#include <linux/pm_qos.h>
#include <asm/unaligned.h>
+#include <linux/sched.h>

#define CREATE_TRACE_POINTS
#include "trace.h"
@@ -97,6 +98,15 @@ static dev_t nvme_chr_devt;
static struct class *nvme_class;
static struct class *nvme_subsys_class;

+/*
+ * The following are for detecting if this CPU is flooded with interrupts.
+ * The timestamp for the last context switch is recorded. If that is at least
+ * MAX_SCHED_TIMEOUT ago, try to recover from interrupt flood
+ */
+static DEFINE_PER_CPU(u64, last_switch);
+static DEFINE_PER_CPU(u64, last_clock);
+#define MAX_SCHED_TIMEOUT 2000000000 // 2 seconds in ns
+
static int nvme_revalidate_disk(struct gendisk *disk);
static void nvme_put_subsystem(struct nvme_subsystem *subsys);
static void nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl,
@@ -260,9 +270,54 @@ static void nvme_retry_req(struct request *req)
blk_mq_delay_kick_requeue_list(req->q, delay);
}

+static void nvme_complete_rq_work(struct work_struct *work)
+{
+ struct nvme_request *nvme_rq =
+ container_of(work, struct nvme_request, work);
+ struct request *req = blk_mq_rq_from_pdu(nvme_rq);
+
+ nvme_complete_rq(req);
+}
+
+
void nvme_complete_rq(struct request *req)
{
- blk_status_t status = nvme_error_status(req);
+ blk_status_t status;
+ int cpu;
+ u64 switches;
+ struct nvme_request *nvme_rq;
+
+ if (!in_interrupt())
+ goto skip_check;
+
+ nvme_rq = nvme_req(req);
+ cpu = smp_processor_id();
+ if (idle_cpu(cpu))
+ goto skip_check;
+
+ /* Check if this CPU is flooded with interrupts */
+ switches = get_cpu_rq_switches(cpu);
+ if (this_cpu_read(last_switch) == switches) {
+ /*
+ * If this CPU hasn't made a context switch in
+ * MAX_SCHED_TIMEOUT ns (and it's not idle), schedule a work to
+ * complete this I/O. This forces this CPU run non-interrupt
+ * code and throttle the other CPU issuing the I/O
+ */
+ if (sched_clock() - this_cpu_read(last_clock)
+ > MAX_SCHED_TIMEOUT) {
+ INIT_WORK(&nvme_rq->work, nvme_complete_rq_work);
+ schedule_work_on(cpu, &nvme_rq->work);
+ return;
+ }
+
+ } else {
+ this_cpu_write(last_switch, switches);
+ this_cpu_write(last_clock, sched_clock());
+ }
+
+skip_check:
+ status = nvme_error_status(req);

trace_nvme_complete_rq(req);

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 0a4a7f5e0de7..a8876e69e476 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -113,6 +113,7 @@ struct nvme_request {
u8 flags;
u16 status;
struct nvme_ctrl *ctrl;
+ struct work_struct work;
};

/*
--
2.17.1

2019-08-20 08:28:45

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH 0/3] fix interrupt swamp in NVMe

On Tue, Aug 20, 2019 at 2:14 PM <[email protected]> wrote:
>
> From: Long Li <[email protected]>
>
> This patch set tries to fix interrupt swamp in NVMe devices.
>
> On large systems with many CPUs, a number of CPUs may share one NVMe hardware
> queue. It may have this situation where several CPUs are issuing I/Os, and
> all the I/Os are returned on the CPU where the hardware queue is bound to.
> This may result in that CPU swamped by interrupts and stay in interrupt mode
> for extended time while other CPUs continue to issue I/O. This can trigger
> Watchdog and RCU timeout, and make the system unresponsive.
>
> This patch set addresses this by enforcing scheduling and throttling I/O when
> CPU is starved in this situation.
>
> Long Li (3):
> sched: define a function to report the number of context switches on a
> CPU
> sched: export idle_cpu()
> nvme: complete request in work queue on CPU with flooded interrupts
>
> drivers/nvme/host/core.c | 57 +++++++++++++++++++++++++++++++++++++++-
> drivers/nvme/host/nvme.h | 1 +
> include/linux/sched.h | 2 ++
> kernel/sched/core.c | 7 +++++
> 4 files changed, 66 insertions(+), 1 deletion(-)

Another simpler solution may be to complete request in threaded interrupt
handler for this case. Meantime allow scheduler to run the interrupt thread
handler on CPUs specified by the irq affinity mask, which was discussed by
the following link:

https://lore.kernel.org/lkml/[email protected]/

Could you try the above solution and see if the lockup can be avoided?
John Garry
should have workable patch.

Thanks,
Ming Lei

2019-08-20 09:02:26

by John Garry

[permalink] [raw]
Subject: Re: [PATCH 0/3] fix interrupt swamp in NVMe

On 20/08/2019 09:25, Ming Lei wrote:
> On Tue, Aug 20, 2019 at 2:14 PM <[email protected]> wrote:
>>
>> From: Long Li <[email protected]>
>>
>> This patch set tries to fix interrupt swamp in NVMe devices.
>>
>> On large systems with many CPUs, a number of CPUs may share one NVMe hardware
>> queue. It may have this situation where several CPUs are issuing I/Os, and
>> all the I/Os are returned on the CPU where the hardware queue is bound to.
>> This may result in that CPU swamped by interrupts and stay in interrupt mode
>> for extended time while other CPUs continue to issue I/O. This can trigger
>> Watchdog and RCU timeout, and make the system unresponsive.
>>
>> This patch set addresses this by enforcing scheduling and throttling I/O when
>> CPU is starved in this situation.
>>
>> Long Li (3):
>> sched: define a function to report the number of context switches on a
>> CPU
>> sched: export idle_cpu()
>> nvme: complete request in work queue on CPU with flooded interrupts
>>
>> drivers/nvme/host/core.c | 57 +++++++++++++++++++++++++++++++++++++++-
>> drivers/nvme/host/nvme.h | 1 +
>> include/linux/sched.h | 2 ++
>> kernel/sched/core.c | 7 +++++
>> 4 files changed, 66 insertions(+), 1 deletion(-)
>
> Another simpler solution may be to complete request in threaded interrupt
> handler for this case. Meantime allow scheduler to run the interrupt thread
> handler on CPUs specified by the irq affinity mask, which was discussed by
> the following link:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> Could you try the above solution and see if the lockup can be avoided?
> John Garry
> should have workable patch.

Yeah, so we experimented with changing the interrupt handling in the
SCSI driver I maintain to use a threaded handler IRQ handler plus patch
below, and saw a significant throughput boost:

--->8

Subject: [PATCH] genirq: Add support to allow thread to use hard irq
affinity

Currently the cpu allowed mask for the threaded part of a threaded irq
handler will be set to the effective affinity of the hard irq.

Typically the effective affinity of the hard irq will be for a single
cpu. As such, the threaded handler would always run on the same cpu as
the hard irq.

We have seen scenarios in high data-rate throughput testing that the
cpu handling the interrupt can be totally saturated handling both the
hard interrupt and threaded handler parts, limiting throughput.

Add IRQF_IRQ_AFFINITY flag to allow the driver requesting the threaded
interrupt to decide on the policy of which cpu the threaded handler
may run.

Signed-off-by: John Garry <[email protected]>

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 5b8328a99b2a..48e8b955989a 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -61,6 +61,9 @@
* interrupt handler after suspending interrupts. For
system
* wakeup devices users need to implement wakeup
detection in
* their interrupt handlers.
+ * IRQF_IRQ_AFFINITY - Use the hard interrupt affinity for setting the cpu
+ * allowed mask for the threaded handler of a threaded
interrupt
+ * handler, rather than the effective hard irq affinity.
*/
#define IRQF_SHARED 0x00000080
#define IRQF_PROBE_SHARED 0x00000100
@@ -74,6 +77,7 @@
#define IRQF_NO_THREAD 0x00010000
#define IRQF_EARLY_RESUME 0x00020000
#define IRQF_COND_SUSPEND 0x00040000
+#define IRQF_IRQ_AFFINITY 0x00080000

#define IRQF_TIMER (__IRQF_TIMER | IRQF_NO_SUSPEND | IRQF_NO_THREAD)

diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index e8f7f179bf77..cb483a055512 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -966,9 +966,13 @@ irq_thread_check_affinity(struct irq_desc *desc,
struct irqaction *action)
* mask pointer. For CPU_MASK_OFFSTACK=n this is optimized out.
*/
if (cpumask_available(desc->irq_common_data.affinity)) {
+ struct irq_data *irq_data = &desc->irq_data;
const struct cpumask *m;

- m = irq_data_get_effective_affinity_mask(&desc->irq_data);
+ if (action->flags & IRQF_IRQ_AFFINITY)
+ m = desc->irq_common_data.affinity;
+ else
+ m = irq_data_get_effective_affinity_mask(irq_data);
cpumask_copy(mask, m);
} else {
valid = false;
--
2.17.1

As Ming mentioned in that same thread, we could even make this policy
for managed interrupts.

Cheers,
John

>
> Thanks,
> Ming Lei
>
> .
>


2019-08-20 09:54:01

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/3] nvme: complete request in work queue on CPU with flooded interrupts

On Mon, Aug 19, 2019 at 11:14:29PM -0700, [email protected] wrote:
> From: Long Li <[email protected]>
>
> When a NVMe hardware queue is mapped to several CPU queues, it is possible
> that the CPU this hardware queue is bound to is flooded by returning I/O for
> other CPUs.
>
> For example, consider the following scenario:
> 1. CPU 0, 1, 2 and 3 share the same hardware queue
> 2. the hardware queue interrupts CPU 0 for I/O response
> 3. processes from CPU 1, 2 and 3 keep sending I/Os
>
> CPU 0 may be flooded with interrupts from NVMe device that are I/O responses
> for CPU 1, 2 and 3. Under heavy I/O load, it is possible that CPU 0 spends
> all the time serving NVMe and other system interrupts, but doesn't have a
> chance to run in process context.

Ideally -- and there is some code to affect this, the load-balancer will
move tasks away from this CPU.

> To fix this, CPU 0 can schedule a work to complete the I/O request when it
> detects the scheduler is not making progress. This serves multiple purposes:

Suppose the task waiting for the IO completion is a RT task, and you've
just queued it to a regular work. This is an instant priority inversion.

> 1. This CPU has to be scheduled to complete the request. The other CPUs can't
> issue more I/Os until some previous I/Os are completed. This helps this CPU
> get out of NVMe interrupts.
>
> 2. This acts a throttling mechanisum for NVMe devices, in that it can not
> starve a CPU while servicing I/Os from other CPUs.
>
> 3. This CPU can make progress on RCU and other work items on its queue.
>
> Signed-off-by: Long Li <[email protected]>
> ---
> drivers/nvme/host/core.c | 57 +++++++++++++++++++++++++++++++++++++++-
> drivers/nvme/host/nvme.h | 1 +
> 2 files changed, 57 insertions(+), 1 deletion(-)

WTH does this live in the NVME driver? Surely something like this should
be in the block layer. I'm thinking there's fiber channel connected
storage that should be able to trigger much the same issues.

> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 6a9dd68c0f4f..576bb6fce293 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c

> @@ -260,9 +270,54 @@ static void nvme_retry_req(struct request *req)
> blk_mq_delay_kick_requeue_list(req->q, delay);
> }
>
> +static void nvme_complete_rq_work(struct work_struct *work)
> +{
> + struct nvme_request *nvme_rq =
> + container_of(work, struct nvme_request, work);
> + struct request *req = blk_mq_rq_from_pdu(nvme_rq);
> +
> + nvme_complete_rq(req);
> +}
> +
> +
> void nvme_complete_rq(struct request *req)
> {
> - blk_status_t status = nvme_error_status(req);
> + blk_status_t status;
> + int cpu;
> + u64 switches;
> + struct nvme_request *nvme_rq;
> +
> + if (!in_interrupt())
> + goto skip_check;
> +
> + nvme_rq = nvme_req(req);
> + cpu = smp_processor_id();
> + if (idle_cpu(cpu))
> + goto skip_check;
> +
> + /* Check if this CPU is flooded with interrupts */
> + switches = get_cpu_rq_switches(cpu);
> + if (this_cpu_read(last_switch) == switches) {
> + /*
> + * If this CPU hasn't made a context switch in
> + * MAX_SCHED_TIMEOUT ns (and it's not idle), schedule a work to
> + * complete this I/O. This forces this CPU run non-interrupt
> + * code and throttle the other CPU issuing the I/O
> + */

What if there was only a single task on that CPU? Then we'd never
need/want to context switch in the first place.

AFAICT all this is just a whole bunch of gruesome hacks piled on top one
another.

> + if (sched_clock() - this_cpu_read(last_clock)
> + > MAX_SCHED_TIMEOUT) {
> + INIT_WORK(&nvme_rq->work, nvme_complete_rq_work);
> + schedule_work_on(cpu, &nvme_rq->work);
> + return;
> + }
> +
> + } else {
> + this_cpu_write(last_switch, switches);
> + this_cpu_write(last_clock, sched_clock());
> + }
> +
> +skip_check:

Aside from everything else; this is just sodding poor coding style. What
is wrong with something like:

if (nvme_complete_throttle(...))
return;

> + status = nvme_error_status(req);
>
> trace_nvme_complete_rq(req);
>

2019-08-20 15:10:10

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH 0/3] fix interrupt swamp in NVMe

On Tue, Aug 20, 2019 at 01:59:32AM -0700, John Garry wrote:
> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> index e8f7f179bf77..cb483a055512 100644
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -966,9 +966,13 @@ irq_thread_check_affinity(struct irq_desc *desc,
> struct irqaction *action)
> * mask pointer. For CPU_MASK_OFFSTACK=n this is optimized out.
> */
> if (cpumask_available(desc->irq_common_data.affinity)) {
> + struct irq_data *irq_data = &desc->irq_data;
> const struct cpumask *m;
>
> - m = irq_data_get_effective_affinity_mask(&desc->irq_data);
> + if (action->flags & IRQF_IRQ_AFFINITY)
> + m = desc->irq_common_data.affinity;
> + else
> + m = irq_data_get_effective_affinity_mask(irq_data);
> cpumask_copy(mask, m);
> } else {
> valid = false;
> --
> 2.17.1
>
> As Ming mentioned in that same thread, we could even make this policy
> for managed interrupts.

Ack, I really like this option!

2019-08-20 17:34:57

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH 3/3] nvme: complete request in work queue on CPU with flooded interrupts


> From: Long Li <[email protected]>
>
> When a NVMe hardware queue is mapped to several CPU queues, it is possible
> that the CPU this hardware queue is bound to is flooded by returning I/O for
> other CPUs.
>
> For example, consider the following scenario:
> 1. CPU 0, 1, 2 and 3 share the same hardware queue
> 2. the hardware queue interrupts CPU 0 for I/O response
> 3. processes from CPU 1, 2 and 3 keep sending I/Os
>
> CPU 0 may be flooded with interrupts from NVMe device that are I/O responses
> for CPU 1, 2 and 3. Under heavy I/O load, it is possible that CPU 0 spends
> all the time serving NVMe and other system interrupts, but doesn't have a
> chance to run in process context.
>
> To fix this, CPU 0 can schedule a work to complete the I/O request when it
> detects the scheduler is not making progress. This serves multiple purposes:
>
> 1. This CPU has to be scheduled to complete the request. The other CPUs can't
> issue more I/Os until some previous I/Os are completed. This helps this CPU
> get out of NVMe interrupts.
>
> 2. This acts a throttling mechanisum for NVMe devices, in that it can not
> starve a CPU while servicing I/Os from other CPUs.
>
> 3. This CPU can make progress on RCU and other work items on its queue.

The problem is indeed real, but this is the wrong approach in my mind.

We already have irqpoll which takes care proper budgeting polling
cycles and not hogging the cpu.

I've sent rfc for this particular problem before [1]. At the time IIRC,
Christoph suggested that we will poll the first batch directly from
the irq context and reap the rest in irqpoll handler.

[1]:
http://lists.infradead.org/pipermail/linux-nvme/2016-October/006497.html

How about something like this instead:
--
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 71127a366d3c..84bf16d75109 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -24,6 +24,7 @@
#include <linux/io-64-nonatomic-lo-hi.h>
#include <linux/sed-opal.h>
#include <linux/pci-p2pdma.h>
+#include <linux/irq_poll.h>

#include "trace.h"
#include "nvme.h"
@@ -32,6 +33,7 @@
#define CQ_SIZE(q) ((q)->q_depth * sizeof(struct nvme_completion))

#define SGES_PER_PAGE (PAGE_SIZE / sizeof(struct nvme_sgl_desc))
+#define NVME_POLL_BUDGET_IRQ 256

/*
* These can be higher, but we need to ensure that any command doesn't
@@ -189,6 +191,7 @@ struct nvme_queue {
u32 *dbbuf_cq_db;
u32 *dbbuf_sq_ei;
u32 *dbbuf_cq_ei;
+ struct irq_poll iop;
struct completion delete_done;
};

@@ -1015,6 +1018,23 @@ static inline int nvme_process_cq(struct
nvme_queue *nvmeq, u16 *start,
return found;
}

+static int nvme_irqpoll_handler(struct irq_poll *iop, int budget)
+{
+ struct nvme_queue *nvmeq = container_of(iop, struct nvme_queue,
iop);
+ struct pci_dev *pdev = to_pci_dev(nvmeq->dev->dev);
+ u16 start, end;
+ int completed;
+
+ completed = nvme_process_cq(nvmeq, &start, &end, budget);
+ nvme_complete_cqes(nvmeq, start, end);
+ if (completed < budget) {
+ irq_poll_complete(&nvmeq->iop);
+ enable_irq(pci_irq_vector(pdev, nvmeq->cq_vector));
+ }
+
+ return completed;
+}
+
static irqreturn_t nvme_irq(int irq, void *data)
{
struct nvme_queue *nvmeq = data;
@@ -1028,12 +1048,16 @@ static irqreturn_t nvme_irq(int irq, void *data)
rmb();
if (nvmeq->cq_head != nvmeq->last_cq_head)
ret = IRQ_HANDLED;
- nvme_process_cq(nvmeq, &start, &end, -1);
+ nvme_process_cq(nvmeq, &start, &end, NVME_POLL_BUDGET_IRQ);
nvmeq->last_cq_head = nvmeq->cq_head;
wmb();

if (start != end) {
nvme_complete_cqes(nvmeq, start, end);
+ if (nvme_cqe_pending(nvmeq)) {
+ disable_irq_nosync(irq);
+ irq_poll_sched(&nvmeq->iop);
+ }
return IRQ_HANDLED;
}

@@ -1347,6 +1371,7 @@ static enum blk_eh_timer_return
nvme_timeout(struct request *req, bool reserved)

static void nvme_free_queue(struct nvme_queue *nvmeq)
{
+ irq_poll_disable(&nvmeq->iop);
dma_free_coherent(nvmeq->dev->dev, CQ_SIZE(nvmeq),
(void *)nvmeq->cqes, nvmeq->cq_dma_addr);
if (!nvmeq->sq_cmds)
@@ -1481,6 +1506,7 @@ static int nvme_alloc_queue(struct nvme_dev *dev,
int qid, int depth)
nvmeq->dev = dev;
spin_lock_init(&nvmeq->sq_lock);
spin_lock_init(&nvmeq->cq_poll_lock);
+ irq_poll_init(&nvmeq->iop, NVME_POLL_BUDGET_IRQ,
nvme_irqpoll_handler);
nvmeq->cq_head = 0;
nvmeq->cq_phase = 1;
nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride];
--

2019-08-21 07:52:41

by Long Li

[permalink] [raw]
Subject: RE: [PATCH 0/3] fix interrupt swamp in NVMe

>>>Subject: Re: [PATCH 0/3] fix interrupt swamp in NVMe
>>>
>>>On 20/08/2019 09:25, Ming Lei wrote:
>>>> On Tue, Aug 20, 2019 at 2:14 PM <[email protected]> wrote:
>>>>>
>>>>> From: Long Li <[email protected]>
>>>>>
>>>>> This patch set tries to fix interrupt swamp in NVMe devices.
>>>>>
>>>>> On large systems with many CPUs, a number of CPUs may share one
>>>NVMe
>>>>> hardware queue. It may have this situation where several CPUs are
>>>>> issuing I/Os, and all the I/Os are returned on the CPU where the
>>>hardware queue is bound to.
>>>>> This may result in that CPU swamped by interrupts and stay in
>>>>> interrupt mode for extended time while other CPUs continue to issue
>>>>> I/O. This can trigger Watchdog and RCU timeout, and make the system
>>>unresponsive.
>>>>>
>>>>> This patch set addresses this by enforcing scheduling and throttling
>>>>> I/O when CPU is starved in this situation.
>>>>>
>>>>> Long Li (3):
>>>>> sched: define a function to report the number of context switches on a
>>>>> CPU
>>>>> sched: export idle_cpu()
>>>>> nvme: complete request in work queue on CPU with flooded interrupts
>>>>>
>>>>> drivers/nvme/host/core.c | 57
>>>>> +++++++++++++++++++++++++++++++++++++++-
>>>>> drivers/nvme/host/nvme.h | 1 +
>>>>> include/linux/sched.h | 2 ++
>>>>> kernel/sched/core.c | 7 +++++
>>>>> 4 files changed, 66 insertions(+), 1 deletion(-)
>>>>
>>>> Another simpler solution may be to complete request in threaded
>>>> interrupt handler for this case. Meantime allow scheduler to run the
>>>> interrupt thread handler on CPUs specified by the irq affinity mask,
>>>> which was discussed by the following link:
>>>>
>>>>
>>>https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flor
>>>e
>>>> .kernel.org%2Flkml%2Fe0e9478e-62a5-ca24-3b12-
>>>58f7d056383e%40huawei.com
>>>> %2F&amp;data=02%7C01%7Clongli%40microsoft.com%7Cc7f46d3e273f45
>>>176d1c08
>>>>
>>>d7254cc69e%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C6370188
>>>8401588
>>>>
>>>9866&amp;sdata=h5k6HoGoyDxuhmDfuKLZUwgmw17PU%2BT%2FCbawfxV
>>>Er3U%3D&amp;
>>>> reserved=0
>>>>
>>>> Could you try the above solution and see if the lockup can be avoided?
>>>> John Garry
>>>> should have workable patch.
>>>
>>>Yeah, so we experimented with changing the interrupt handling in the SCSI
>>>driver I maintain to use a threaded handler IRQ handler plus patch below,
>>>and saw a significant throughput boost:
>>>
>>>--->8
>>>
>>>Subject: [PATCH] genirq: Add support to allow thread to use hard irq affinity
>>>
>>>Currently the cpu allowed mask for the threaded part of a threaded irq
>>>handler will be set to the effective affinity of the hard irq.
>>>
>>>Typically the effective affinity of the hard irq will be for a single cpu. As such,
>>>the threaded handler would always run on the same cpu as the hard irq.
>>>
>>>We have seen scenarios in high data-rate throughput testing that the cpu
>>>handling the interrupt can be totally saturated handling both the hard
>>>interrupt and threaded handler parts, limiting throughput.
>>>
>>>Add IRQF_IRQ_AFFINITY flag to allow the driver requesting the threaded
>>>interrupt to decide on the policy of which cpu the threaded handler may run.
>>>
>>>Signed-off-by: John Garry <[email protected]>

Thanks for pointing me to this patch. This fixed the interrupt swamp and make the system stable.

However I'm seeing reduced performance when using threaded interrupts.

Here are the test results on a system with 80 CPUs and 10 NVMe disks (32 hardware queues for each disk)
Benchmark tool is FIO, I/O pattern: 4k random reads on all NVMe disks, with queue depth = 64, num of jobs = 80, direct=1

With threaded interrupts: 1320k IOPS
With just interrupts: 3720k IOPS
With just interrupts and my patch: 3700k IOPS

At the peak IOPS, the overall CPU usage is at around 98-99%. I think the cost of doing wake up and context switch for NVMe threaded IRQ handler takes some CPU away.

In this test, I made the following change to make use of IRQF_IRQ_AFFINITY for NVMe:

diff --git a/drivers/pci/irq.c b/drivers/pci/irq.c
index a1de501a2729..3fb30d16464e 100644
--- a/drivers/pci/irq.c
+++ b/drivers/pci/irq.c
@@ -86,7 +86,7 @@ int pci_request_irq(struct pci_dev *dev, unsigned int nr, irq_handler_t handler,
va_list ap;
int ret;
char *devname;
- unsigned long irqflags = IRQF_SHARED;
+ unsigned long irqflags = IRQF_SHARED | IRQF_IRQ_AFFINITY;

if (!handler)
irqflags |= IRQF_ONESHOT;

Thanks

Long

>>>
>>>diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h index
>>>5b8328a99b2a..48e8b955989a 100644
>>>--- a/include/linux/interrupt.h
>>>+++ b/include/linux/interrupt.h
>>>@@ -61,6 +61,9 @@
>>> * interrupt handler after suspending interrupts. For
>>>system
>>> * wakeup devices users need to implement wakeup
>>>detection in
>>> * their interrupt handlers.
>>>+ * IRQF_IRQ_AFFINITY - Use the hard interrupt affinity for setting the cpu
>>>+ * allowed mask for the threaded handler of a threaded
>>>interrupt
>>>+ * handler, rather than the effective hard irq affinity.
>>> */
>>> #define IRQF_SHARED 0x00000080
>>> #define IRQF_PROBE_SHARED 0x00000100
>>>@@ -74,6 +77,7 @@
>>> #define IRQF_NO_THREAD 0x00010000
>>> #define IRQF_EARLY_RESUME 0x00020000
>>> #define IRQF_COND_SUSPEND 0x00040000
>>>+#define IRQF_IRQ_AFFINITY 0x00080000
>>>
>>> #define IRQF_TIMER (__IRQF_TIMER | IRQF_NO_SUSPEND |
>>>IRQF_NO_THREAD)
>>>
>>>diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index
>>>e8f7f179bf77..cb483a055512 100644
>>>--- a/kernel/irq/manage.c
>>>+++ b/kernel/irq/manage.c
>>>@@ -966,9 +966,13 @@ irq_thread_check_affinity(struct irq_desc *desc,
>>>struct irqaction *action)
>>> * mask pointer. For CPU_MASK_OFFSTACK=n this is optimized out.
>>> */
>>> if (cpumask_available(desc->irq_common_data.affinity)) {
>>>+ struct irq_data *irq_data = &desc->irq_data;
>>> const struct cpumask *m;
>>>
>>>- m = irq_data_get_effective_affinity_mask(&desc-
>>>>irq_data);
>>>+ if (action->flags & IRQF_IRQ_AFFINITY)
>>>+ m = desc->irq_common_data.affinity;
>>>+ else
>>>+ m = irq_data_get_effective_affinity_mask(irq_data);
>>> cpumask_copy(mask, m);
>>> } else {
>>> valid = false;
>>>--
>>>2.17.1
>>>
>>>As Ming mentioned in that same thread, we could even make this policy for
>>>managed interrupts.
>>>
>>>Cheers,
>>>John
>>>
>>>>
>>>> Thanks,
>>>> Ming Lei
>>>>
>>>> .
>>>>
>>>

2019-08-21 08:40:09

by Long Li

[permalink] [raw]
Subject: RE: [PATCH 3/3] nvme: complete request in work queue on CPU with flooded interrupts

>>>Subject: Re: [PATCH 3/3] nvme: complete request in work queue on CPU
>>>with flooded interrupts
>>>
>>>On Mon, Aug 19, 2019 at 11:14:29PM -0700, [email protected]
>>>wrote:
>>>> From: Long Li <[email protected]>
>>>>
>>>> When a NVMe hardware queue is mapped to several CPU queues, it is
>>>> possible that the CPU this hardware queue is bound to is flooded by
>>>> returning I/O for other CPUs.
>>>>
>>>> For example, consider the following scenario:
>>>> 1. CPU 0, 1, 2 and 3 share the same hardware queue 2. the hardware
>>>> queue interrupts CPU 0 for I/O response 3. processes from CPU 1, 2 and
>>>> 3 keep sending I/Os
>>>>
>>>> CPU 0 may be flooded with interrupts from NVMe device that are I/O
>>>> responses for CPU 1, 2 and 3. Under heavy I/O load, it is possible
>>>> that CPU 0 spends all the time serving NVMe and other system
>>>> interrupts, but doesn't have a chance to run in process context.
>>>
>>>Ideally -- and there is some code to affect this, the load-balancer will move
>>>tasks away from this CPU.
>>>
>>>> To fix this, CPU 0 can schedule a work to complete the I/O request
>>>> when it detects the scheduler is not making progress. This serves multiple
>>>purposes:
>>>
>>>Suppose the task waiting for the IO completion is a RT task, and you've just
>>>queued it to a regular work. This is an instant priority inversion.

This is a choice. We can either not "lock up" the CPU, or finish the I/O on time from IRQ handler. I think throttling only happens in extreme conditions, which is rare. The purpose is to make the whole system responsive and happy.

>>>
>>>> 1. This CPU has to be scheduled to complete the request. The other
>>>> CPUs can't issue more I/Os until some previous I/Os are completed.
>>>> This helps this CPU get out of NVMe interrupts.
>>>>
>>>> 2. This acts a throttling mechanisum for NVMe devices, in that it can
>>>> not starve a CPU while servicing I/Os from other CPUs.
>>>>
>>>> 3. This CPU can make progress on RCU and other work items on its queue.
>>>>
>>>> Signed-off-by: Long Li <[email protected]>
>>>> ---
>>>> drivers/nvme/host/core.c | 57
>>>> +++++++++++++++++++++++++++++++++++++++-
>>>> drivers/nvme/host/nvme.h | 1 +
>>>> 2 files changed, 57 insertions(+), 1 deletion(-)
>>>
>>>WTH does this live in the NVME driver? Surely something like this should be
>>>in the block layer. I'm thinking there's fiber channel connected storage that
>>>should be able to trigger much the same issues.

Yes this can be done in block layer. I'm not sure the best way to accomplish this so implemented a NVMe patch to help test. The test results are promising in that we are getting 99.5% of performance while avoided CPU lockup. The challenge is to find a way to throttle a fast storage device.

>>>
>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index
>>>> 6a9dd68c0f4f..576bb6fce293 100644
>>>> --- a/drivers/nvme/host/core.c
>>>> +++ b/drivers/nvme/host/core.c
>>>
>>>> @@ -260,9 +270,54 @@ static void nvme_retry_req(struct request *req)
>>>> blk_mq_delay_kick_requeue_list(req->q, delay); }
>>>>
>>>> +static void nvme_complete_rq_work(struct work_struct *work) {
>>>> + struct nvme_request *nvme_rq =
>>>> + container_of(work, struct nvme_request, work);
>>>> + struct request *req = blk_mq_rq_from_pdu(nvme_rq);
>>>> +
>>>> + nvme_complete_rq(req);
>>>> +}
>>>> +
>>>> +
>>>> void nvme_complete_rq(struct request *req) {
>>>> - blk_status_t status = nvme_error_status(req);
>>>> + blk_status_t status;
>>>> + int cpu;
>>>> + u64 switches;
>>>> + struct nvme_request *nvme_rq;
>>>> +
>>>> + if (!in_interrupt())
>>>> + goto skip_check;
>>>> +
>>>> + nvme_rq = nvme_req(req);
>>>> + cpu = smp_processor_id();
>>>> + if (idle_cpu(cpu))
>>>> + goto skip_check;
>>>> +
>>>> + /* Check if this CPU is flooded with interrupts */
>>>> + switches = get_cpu_rq_switches(cpu);
>>>> + if (this_cpu_read(last_switch) == switches) {
>>>> + /*
>>>> + * If this CPU hasn't made a context switch in
>>>> + * MAX_SCHED_TIMEOUT ns (and it's not idle), schedule a
>>>work to
>>>> + * complete this I/O. This forces this CPU run non-interrupt
>>>> + * code and throttle the other CPU issuing the I/O
>>>> + */
>>>
>>>What if there was only a single task on that CPU? Then we'd never
>>>need/want to context switch in the first place.
>>>
>>>AFAICT all this is just a whole bunch of gruesome hacks piled on top one
>>>another.
>>>
>>>> + if (sched_clock() - this_cpu_read(last_clock)
>>>> + > MAX_SCHED_TIMEOUT) {
>>>> + INIT_WORK(&nvme_rq->work,
>>>nvme_complete_rq_work);
>>>> + schedule_work_on(cpu, &nvme_rq->work);
>>>> + return;
>>>> + }
>>>> +
>>>> + } else {
>>>> + this_cpu_write(last_switch, switches);
>>>> + this_cpu_write(last_clock, sched_clock());
>>>> + }
>>>> +
>>>> +skip_check:
>>>
>>>Aside from everything else; this is just sodding poor coding style. What is
>>>wrong with something like:
>>>
>>> if (nvme_complete_throttle(...))
>>> return;
>>>
>>>> + status = nvme_error_status(req);
>>>>
>>>> trace_nvme_complete_rq(req);
>>>>

2019-08-21 08:40:43

by Long Li

[permalink] [raw]
Subject: RE: [PATCH 3/3] nvme: complete request in work queue on CPU with flooded interrupts

>>>Subject: Re: [PATCH 3/3] nvme: complete request in work queue on CPU
>>>with flooded interrupts
>>>
>>>
>>>> From: Long Li <[email protected]>
>>>>
>>>> When a NVMe hardware queue is mapped to several CPU queues, it is
>>>> possible that the CPU this hardware queue is bound to is flooded by
>>>> returning I/O for other CPUs.
>>>>
>>>> For example, consider the following scenario:
>>>> 1. CPU 0, 1, 2 and 3 share the same hardware queue 2. the hardware
>>>> queue interrupts CPU 0 for I/O response 3. processes from CPU 1, 2 and
>>>> 3 keep sending I/Os
>>>>
>>>> CPU 0 may be flooded with interrupts from NVMe device that are I/O
>>>> responses for CPU 1, 2 and 3. Under heavy I/O load, it is possible
>>>> that CPU 0 spends all the time serving NVMe and other system
>>>> interrupts, but doesn't have a chance to run in process context.
>>>>
>>>> To fix this, CPU 0 can schedule a work to complete the I/O request
>>>> when it detects the scheduler is not making progress. This serves multiple
>>>purposes:
>>>>
>>>> 1. This CPU has to be scheduled to complete the request. The other
>>>> CPUs can't issue more I/Os until some previous I/Os are completed.
>>>> This helps this CPU get out of NVMe interrupts.
>>>>
>>>> 2. This acts a throttling mechanisum for NVMe devices, in that it can
>>>> not starve a CPU while servicing I/Os from other CPUs.
>>>>
>>>> 3. This CPU can make progress on RCU and other work items on its queue.
>>>
>>>The problem is indeed real, but this is the wrong approach in my mind.
>>>
>>>We already have irqpoll which takes care proper budgeting polling cycles
>>>and not hogging the cpu.
>>>
>>>I've sent rfc for this particular problem before [1]. At the time IIRC,
>>>Christoph suggested that we will poll the first batch directly from the irq
>>>context and reap the rest in irqpoll handler.

Thanks for the pointer. I will test and report back.

>>>
>>>[1]:
>>>https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.
>>>infradead.org%2Fpipermail%2Flinux-nvme%2F2016-
>>>October%2F006497.html&amp;data=02%7C01%7Clongli%40microsoft.com%
>>>7C0ebf36eff15c4182116608d725948b93%7C72f988bf86f141af91ab2d7cd011d
>>>b47%7C1%7C0%7C637019192254250361&amp;sdata=fJ%2Fkc8HLSmfzaY3BY
>>>E66zlZKD6FjcXgMJZzVGCVqI%2FU%3D&amp;reserved=0
>>>
>>>How about something like this instead:
>>>--
>>>diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index
>>>71127a366d3c..84bf16d75109 100644
>>>--- a/drivers/nvme/host/pci.c
>>>+++ b/drivers/nvme/host/pci.c
>>>@@ -24,6 +24,7 @@
>>> #include <linux/io-64-nonatomic-lo-hi.h>
>>> #include <linux/sed-opal.h>
>>> #include <linux/pci-p2pdma.h>
>>>+#include <linux/irq_poll.h>
>>>
>>> #include "trace.h"
>>> #include "nvme.h"
>>>@@ -32,6 +33,7 @@
>>> #define CQ_SIZE(q) ((q)->q_depth * sizeof(struct nvme_completion))
>>>
>>> #define SGES_PER_PAGE (PAGE_SIZE / sizeof(struct nvme_sgl_desc))
>>>+#define NVME_POLL_BUDGET_IRQ 256
>>>
>>> /*
>>> * These can be higher, but we need to ensure that any command doesn't
>>>@@ -189,6 +191,7 @@ struct nvme_queue {
>>> u32 *dbbuf_cq_db;
>>> u32 *dbbuf_sq_ei;
>>> u32 *dbbuf_cq_ei;
>>>+ struct irq_poll iop;
>>> struct completion delete_done;
>>> };
>>>
>>>@@ -1015,6 +1018,23 @@ static inline int nvme_process_cq(struct
>>>nvme_queue *nvmeq, u16 *start,
>>> return found;
>>> }
>>>
>>>+static int nvme_irqpoll_handler(struct irq_poll *iop, int budget) {
>>>+ struct nvme_queue *nvmeq = container_of(iop, struct nvme_queue,
>>>iop);
>>>+ struct pci_dev *pdev = to_pci_dev(nvmeq->dev->dev);
>>>+ u16 start, end;
>>>+ int completed;
>>>+
>>>+ completed = nvme_process_cq(nvmeq, &start, &end, budget);
>>>+ nvme_complete_cqes(nvmeq, start, end);
>>>+ if (completed < budget) {
>>>+ irq_poll_complete(&nvmeq->iop);
>>>+ enable_irq(pci_irq_vector(pdev, nvmeq->cq_vector));
>>>+ }
>>>+
>>>+ return completed;
>>>+}
>>>+
>>> static irqreturn_t nvme_irq(int irq, void *data)
>>> {
>>> struct nvme_queue *nvmeq = data; @@ -1028,12 +1048,16 @@ static
>>>irqreturn_t nvme_irq(int irq, void *data)
>>> rmb();
>>> if (nvmeq->cq_head != nvmeq->last_cq_head)
>>> ret = IRQ_HANDLED;
>>>- nvme_process_cq(nvmeq, &start, &end, -1);
>>>+ nvme_process_cq(nvmeq, &start, &end, NVME_POLL_BUDGET_IRQ);
>>> nvmeq->last_cq_head = nvmeq->cq_head;
>>> wmb();
>>>
>>> if (start != end) {
>>> nvme_complete_cqes(nvmeq, start, end);
>>>+ if (nvme_cqe_pending(nvmeq)) {
>>>+ disable_irq_nosync(irq);
>>>+ irq_poll_sched(&nvmeq->iop);
>>>+ }
>>> return IRQ_HANDLED;
>>> }
>>>
>>>@@ -1347,6 +1371,7 @@ static enum blk_eh_timer_return
>>>nvme_timeout(struct request *req, bool reserved)
>>>
>>> static void nvme_free_queue(struct nvme_queue *nvmeq)
>>> {
>>>+ irq_poll_disable(&nvmeq->iop);
>>> dma_free_coherent(nvmeq->dev->dev, CQ_SIZE(nvmeq),
>>> (void *)nvmeq->cqes, nvmeq->cq_dma_addr);
>>> if (!nvmeq->sq_cmds)
>>>@@ -1481,6 +1506,7 @@ static int nvme_alloc_queue(struct nvme_dev
>>>*dev, int qid, int depth)
>>> nvmeq->dev = dev;
>>> spin_lock_init(&nvmeq->sq_lock);
>>> spin_lock_init(&nvmeq->cq_poll_lock);
>>>+ irq_poll_init(&nvmeq->iop, NVME_POLL_BUDGET_IRQ,
>>>nvme_irqpoll_handler);
>>> nvmeq->cq_head = 0;
>>> nvmeq->cq_phase = 1;
>>> nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride];
>>>--

2019-08-21 10:00:25

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH 0/3] fix interrupt swamp in NVMe

On Wed, Aug 21, 2019 at 07:47:44AM +0000, Long Li wrote:
> >>>Subject: Re: [PATCH 0/3] fix interrupt swamp in NVMe
> >>>
> >>>On 20/08/2019 09:25, Ming Lei wrote:
> >>>> On Tue, Aug 20, 2019 at 2:14 PM <[email protected]> wrote:
> >>>>>
> >>>>> From: Long Li <[email protected]>
> >>>>>
> >>>>> This patch set tries to fix interrupt swamp in NVMe devices.
> >>>>>
> >>>>> On large systems with many CPUs, a number of CPUs may share one
> >>>NVMe
> >>>>> hardware queue. It may have this situation where several CPUs are
> >>>>> issuing I/Os, and all the I/Os are returned on the CPU where the
> >>>hardware queue is bound to.
> >>>>> This may result in that CPU swamped by interrupts and stay in
> >>>>> interrupt mode for extended time while other CPUs continue to issue
> >>>>> I/O. This can trigger Watchdog and RCU timeout, and make the system
> >>>unresponsive.
> >>>>>
> >>>>> This patch set addresses this by enforcing scheduling and throttling
> >>>>> I/O when CPU is starved in this situation.
> >>>>>
> >>>>> Long Li (3):
> >>>>> sched: define a function to report the number of context switches on a
> >>>>> CPU
> >>>>> sched: export idle_cpu()
> >>>>> nvme: complete request in work queue on CPU with flooded interrupts
> >>>>>
> >>>>> drivers/nvme/host/core.c | 57
> >>>>> +++++++++++++++++++++++++++++++++++++++-
> >>>>> drivers/nvme/host/nvme.h | 1 +
> >>>>> include/linux/sched.h | 2 ++
> >>>>> kernel/sched/core.c | 7 +++++
> >>>>> 4 files changed, 66 insertions(+), 1 deletion(-)
> >>>>
> >>>> Another simpler solution may be to complete request in threaded
> >>>> interrupt handler for this case. Meantime allow scheduler to run the
> >>>> interrupt thread handler on CPUs specified by the irq affinity mask,
> >>>> which was discussed by the following link:
> >>>>
> >>>>
> >>>https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flor
> >>>e
> >>>> .kernel.org%2Flkml%2Fe0e9478e-62a5-ca24-3b12-
> >>>58f7d056383e%40huawei.com
> >>>> %2F&amp;data=02%7C01%7Clongli%40microsoft.com%7Cc7f46d3e273f45
> >>>176d1c08
> >>>>
> >>>d7254cc69e%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C6370188
> >>>8401588
> >>>>
> >>>9866&amp;sdata=h5k6HoGoyDxuhmDfuKLZUwgmw17PU%2BT%2FCbawfxV
> >>>Er3U%3D&amp;
> >>>> reserved=0
> >>>>
> >>>> Could you try the above solution and see if the lockup can be avoided?
> >>>> John Garry
> >>>> should have workable patch.
> >>>
> >>>Yeah, so we experimented with changing the interrupt handling in the SCSI
> >>>driver I maintain to use a threaded handler IRQ handler plus patch below,
> >>>and saw a significant throughput boost:
> >>>
> >>>--->8
> >>>
> >>>Subject: [PATCH] genirq: Add support to allow thread to use hard irq affinity
> >>>
> >>>Currently the cpu allowed mask for the threaded part of a threaded irq
> >>>handler will be set to the effective affinity of the hard irq.
> >>>
> >>>Typically the effective affinity of the hard irq will be for a single cpu. As such,
> >>>the threaded handler would always run on the same cpu as the hard irq.
> >>>
> >>>We have seen scenarios in high data-rate throughput testing that the cpu
> >>>handling the interrupt can be totally saturated handling both the hard
> >>>interrupt and threaded handler parts, limiting throughput.
> >>>
> >>>Add IRQF_IRQ_AFFINITY flag to allow the driver requesting the threaded
> >>>interrupt to decide on the policy of which cpu the threaded handler may run.
> >>>
> >>>Signed-off-by: John Garry <[email protected]>
>
> Thanks for pointing me to this patch. This fixed the interrupt swamp and make the system stable.
>
> However I'm seeing reduced performance when using threaded interrupts.
>
> Here are the test results on a system with 80 CPUs and 10 NVMe disks (32 hardware queues for each disk)
> Benchmark tool is FIO, I/O pattern: 4k random reads on all NVMe disks, with queue depth = 64, num of jobs = 80, direct=1
>
> With threaded interrupts: 1320k IOPS
> With just interrupts: 3720k IOPS
> With just interrupts and my patch: 3700k IOPS

This gap looks too big wrt. threaded interrupts vs. interrupts.

>
> At the peak IOPS, the overall CPU usage is at around 98-99%. I think the cost of doing wake up and context switch for NVMe threaded IRQ handler takes some CPU away.
>

In theory, it shouldn't be so because most of times the thread should be running
on CPUs of this hctx, and the wakeup cost shouldn't be so big. Maybe there is
performance problem somewhere wrt. threaded interrupt.

Could you share us your test script and environment? I will see if I can
reproduce it in my environment.

> In this test, I made the following change to make use of IRQF_IRQ_AFFINITY for NVMe:
>
> diff --git a/drivers/pci/irq.c b/drivers/pci/irq.c
> index a1de501a2729..3fb30d16464e 100644
> --- a/drivers/pci/irq.c
> +++ b/drivers/pci/irq.c
> @@ -86,7 +86,7 @@ int pci_request_irq(struct pci_dev *dev, unsigned int nr, irq_handler_t handler,
> va_list ap;
> int ret;
> char *devname;
> - unsigned long irqflags = IRQF_SHARED;
> + unsigned long irqflags = IRQF_SHARED | IRQF_IRQ_AFFINITY;
>
> if (!handler)
> irqflags |= IRQF_ONESHOT;
>

I don't see why IRQF_IRQ_AFFINITY is needed.

John, could you explain it a bit why you need changes on IRQF_IRQ_AFFINITY?

The following patch should be enough:

diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index e8f7f179bf77..1e7cffc1c20c 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -968,7 +968,11 @@ irq_thread_check_affinity(struct irq_desc *desc, struct irqaction *action)
if (cpumask_available(desc->irq_common_data.affinity)) {
const struct cpumask *m;

- m = irq_data_get_effective_affinity_mask(&desc->irq_data);
+ if (irqd_affinity_is_managed(&desc->irq_data))
+ m = desc->irq_common_data.affinity;
+ else
+ m = irq_data_get_effective_affinity_mask(
+ &desc->irq_data);
cpumask_copy(mask, m);
} else {
valid = false;


Thanks,
Ming

2019-08-21 10:11:23

by John Garry

[permalink] [raw]
Subject: Re: [PATCH 0/3] fix interrupt swamp in NVMe

On 21/08/2019 10:44, Ming Lei wrote:
> On Wed, Aug 21, 2019 at 07:47:44AM +0000, Long Li wrote:
>>>>> Subject: Re: [PATCH 0/3] fix interrupt swamp in NVMe
>>>>>
>>>>> On 20/08/2019 09:25, Ming Lei wrote:
>>>>>> On Tue, Aug 20, 2019 at 2:14 PM <[email protected]> wrote:
>>>>>>>
>>>>>>> From: Long Li <[email protected]>
>>>>>>>
>>>>>>> This patch set tries to fix interrupt swamp in NVMe devices.
>>>>>>>
>>>>>>> On large systems with many CPUs, a number of CPUs may share one
>>>>> NVMe
>>>>>>> hardware queue. It may have this situation where several CPUs are
>>>>>>> issuing I/Os, and all the I/Os are returned on the CPU where the
>>>>> hardware queue is bound to.
>>>>>>> This may result in that CPU swamped by interrupts and stay in
>>>>>>> interrupt mode for extended time while other CPUs continue to issue
>>>>>>> I/O. This can trigger Watchdog and RCU timeout, and make the system
>>>>> unresponsive.
>>>>>>>
>>>>>>> This patch set addresses this by enforcing scheduling and throttling
>>>>>>> I/O when CPU is starved in this situation.
>>>>>>>
>>>>>>> Long Li (3):
>>>>>>> sched: define a function to report the number of context switches on a
>>>>>>> CPU
>>>>>>> sched: export idle_cpu()
>>>>>>> nvme: complete request in work queue on CPU with flooded interrupts
>>>>>>>
>>>>>>> drivers/nvme/host/core.c | 57
>>>>>>> +++++++++++++++++++++++++++++++++++++++-
>>>>>>> drivers/nvme/host/nvme.h | 1 +
>>>>>>> include/linux/sched.h | 2 ++
>>>>>>> kernel/sched/core.c | 7 +++++
>>>>>>> 4 files changed, 66 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> Another simpler solution may be to complete request in threaded
>>>>>> interrupt handler for this case. Meantime allow scheduler to run the
>>>>>> interrupt thread handler on CPUs specified by the irq affinity mask,
>>>>>> which was discussed by the following link:
>>>>>>
>>>>>>
>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flor
>>>>> e
>>>>>> .kernel.org%2Flkml%2Fe0e9478e-62a5-ca24-3b12-
>>>>> 58f7d056383e%40huawei.com
>>>>>> %2F&amp;data=02%7C01%7Clongli%40microsoft.com%7Cc7f46d3e273f45
>>>>> 176d1c08
>>>>>>
>>>>> d7254cc69e%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C6370188
>>>>> 8401588
>>>>>>
>>>>> 9866&amp;sdata=h5k6HoGoyDxuhmDfuKLZUwgmw17PU%2BT%2FCbawfxV
>>>>> Er3U%3D&amp;
>>>>>> reserved=0
>>>>>>
>>>>>> Could you try the above solution and see if the lockup can be avoided?
>>>>>> John Garry
>>>>>> should have workable patch.
>>>>>
>>>>> Yeah, so we experimented with changing the interrupt handling in the SCSI
>>>>> driver I maintain to use a threaded handler IRQ handler plus patch below,
>>>>> and saw a significant throughput boost:
>>>>>
>>>>> --->8
>>>>>
>>>>> Subject: [PATCH] genirq: Add support to allow thread to use hard irq affinity
>>>>>
>>>>> Currently the cpu allowed mask for the threaded part of a threaded irq
>>>>> handler will be set to the effective affinity of the hard irq.
>>>>>
>>>>> Typically the effective affinity of the hard irq will be for a single cpu. As such,
>>>>> the threaded handler would always run on the same cpu as the hard irq.
>>>>>
>>>>> We have seen scenarios in high data-rate throughput testing that the cpu
>>>>> handling the interrupt can be totally saturated handling both the hard
>>>>> interrupt and threaded handler parts, limiting throughput.
>>>>>
>>>>> Add IRQF_IRQ_AFFINITY flag to allow the driver requesting the threaded
>>>>> interrupt to decide on the policy of which cpu the threaded handler may run.
>>>>>
>>>>> Signed-off-by: John Garry <[email protected]>
>>
>> Thanks for pointing me to this patch. This fixed the interrupt swamp and make the system stable.
>>
>> However I'm seeing reduced performance when using threaded interrupts.
>>
>> Here are the test results on a system with 80 CPUs and 10 NVMe disks (32 hardware queues for each disk)
>> Benchmark tool is FIO, I/O pattern: 4k random reads on all NVMe disks, with queue depth = 64, num of jobs = 80, direct=1
>>
>> With threaded interrupts: 1320k IOPS
>> With just interrupts: 3720k IOPS
>> With just interrupts and my patch: 3700k IOPS
>
> This gap looks too big wrt. threaded interrupts vs. interrupts.
>
>>
>> At the peak IOPS, the overall CPU usage is at around 98-99%. I think the cost of doing wake up and context switch for NVMe threaded IRQ handler takes some CPU away.
>>
>
> In theory, it shouldn't be so because most of times the thread should be running
> on CPUs of this hctx, and the wakeup cost shouldn't be so big. Maybe there is
> performance problem somewhere wrt. threaded interrupt.
>
> Could you share us your test script and environment? I will see if I can
> reproduce it in my environment.
>
>> In this test, I made the following change to make use of IRQF_IRQ_AFFINITY for NVMe:
>>
>> diff --git a/drivers/pci/irq.c b/drivers/pci/irq.c
>> index a1de501a2729..3fb30d16464e 100644
>> --- a/drivers/pci/irq.c
>> +++ b/drivers/pci/irq.c
>> @@ -86,7 +86,7 @@ int pci_request_irq(struct pci_dev *dev, unsigned int nr, irq_handler_t handler,
>> va_list ap;
>> int ret;
>> char *devname;
>> - unsigned long irqflags = IRQF_SHARED;
>> + unsigned long irqflags = IRQF_SHARED | IRQF_IRQ_AFFINITY;
>>
>> if (!handler)
>> irqflags |= IRQF_ONESHOT;
>>
>
> I don't see why IRQF_IRQ_AFFINITY is needed.
>
> John, could you explain it a bit why you need changes on IRQF_IRQ_AFFINITY?

Hi Ming,

The patch I shared was my original solution, based on the driver setting
flag IRQF_IRQ_AFFINITY to request the threaded handler uses the irq
affinity mask for the handler cpu allowed mask.

If we want to make this decision based only on whether the irq is
managed or not, then we can drop the flag and just make the change as
you suggest, below.

Thanks,
John

>
> The following patch should be enough:
>
> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> index e8f7f179bf77..1e7cffc1c20c 100644
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -968,7 +968,11 @@ irq_thread_check_affinity(struct irq_desc *desc, struct irqaction *action)
> if (cpumask_available(desc->irq_common_data.affinity)) {
> const struct cpumask *m;
>
> - m = irq_data_get_effective_affinity_mask(&desc->irq_data);
> + if (irqd_affinity_is_managed(&desc->irq_data))
> + m = desc->irq_common_data.affinity;
> + else
> + m = irq_data_get_effective_affinity_mask(
> + &desc->irq_data);
> cpumask_copy(mask, m);
> } else {
> valid = false;
>
>
> Thanks,
> Ming
>
> .
>


2019-08-21 10:46:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/3] nvme: complete request in work queue on CPU with flooded interrupts

On Wed, Aug 21, 2019 at 08:37:55AM +0000, Long Li wrote:
> >>>Subject: Re: [PATCH 3/3] nvme: complete request in work queue on CPU
> >>>with flooded interrupts
> >>>
> >>>On Mon, Aug 19, 2019 at 11:14:29PM -0700, [email protected]
> >>>wrote:
> >>>> From: Long Li <[email protected]>
> >>>>
> >>>> When a NVMe hardware queue is mapped to several CPU queues, it is
> >>>> possible that the CPU this hardware queue is bound to is flooded by
> >>>> returning I/O for other CPUs.
> >>>>
> >>>> For example, consider the following scenario:
> >>>> 1. CPU 0, 1, 2 and 3 share the same hardware queue 2. the hardware
> >>>> queue interrupts CPU 0 for I/O response 3. processes from CPU 1, 2 and
> >>>> 3 keep sending I/Os
> >>>>
> >>>> CPU 0 may be flooded with interrupts from NVMe device that are I/O
> >>>> responses for CPU 1, 2 and 3. Under heavy I/O load, it is possible
> >>>> that CPU 0 spends all the time serving NVMe and other system
> >>>> interrupts, but doesn't have a chance to run in process context.
> >>>
> >>>Ideally -- and there is some code to affect this, the load-balancer will move
> >>>tasks away from this CPU.
> >>>
> >>>> To fix this, CPU 0 can schedule a work to complete the I/O request
> >>>> when it detects the scheduler is not making progress. This serves multiple
> >>>purposes:
> >>>
> >>>Suppose the task waiting for the IO completion is a RT task, and you've just
> >>>queued it to a regular work. This is an instant priority inversion.
>
> This is a choice. We can either not "lock up" the CPU, or finish the I/O on time from IRQ handler. I think throttling only happens in extreme conditions, which is rare. The purpose is to make the whole system responsive and happy.

Can you please use a sane MUA.. this is unreadable garbage.

2019-08-21 16:28:54

by Long Li

[permalink] [raw]
Subject: RE: [PATCH 0/3] fix interrupt swamp in NVMe

>>>Subject: Re: [PATCH 0/3] fix interrupt swamp in NVMe
>>>
>>>On Wed, Aug 21, 2019 at 07:47:44AM +0000, Long Li wrote:
>>>> >>>Subject: Re: [PATCH 0/3] fix interrupt swamp in NVMe
>>>> >>>
>>>> >>>On 20/08/2019 09:25, Ming Lei wrote:
>>>> >>>> On Tue, Aug 20, 2019 at 2:14 PM <[email protected]> wrote:
>>>> >>>>>
>>>> >>>>> From: Long Li <[email protected]>
>>>> >>>>>
>>>> >>>>> This patch set tries to fix interrupt swamp in NVMe devices.
>>>> >>>>>
>>>> >>>>> On large systems with many CPUs, a number of CPUs may share
>>>one
>>>> >>>NVMe
>>>> >>>>> hardware queue. It may have this situation where several CPUs
>>>> >>>>> are issuing I/Os, and all the I/Os are returned on the CPU where
>>>> >>>>> the
>>>> >>>hardware queue is bound to.
>>>> >>>>> This may result in that CPU swamped by interrupts and stay in
>>>> >>>>> interrupt mode for extended time while other CPUs continue to
>>>> >>>>> issue I/O. This can trigger Watchdog and RCU timeout, and make
>>>> >>>>> the system
>>>> >>>unresponsive.
>>>> >>>>>
>>>> >>>>> This patch set addresses this by enforcing scheduling and
>>>> >>>>> throttling I/O when CPU is starved in this situation.
>>>> >>>>>
>>>> >>>>> Long Li (3):
>>>> >>>>> sched: define a function to report the number of context switches
>>>on a
>>>> >>>>> CPU
>>>> >>>>> sched: export idle_cpu()
>>>> >>>>> nvme: complete request in work queue on CPU with flooded
>>>> >>>>> interrupts
>>>> >>>>>
>>>> >>>>> drivers/nvme/host/core.c | 57
>>>> >>>>> +++++++++++++++++++++++++++++++++++++++-
>>>> >>>>> drivers/nvme/host/nvme.h | 1 +
>>>> >>>>> include/linux/sched.h | 2 ++
>>>> >>>>> kernel/sched/core.c | 7 +++++
>>>> >>>>> 4 files changed, 66 insertions(+), 1 deletion(-)
>>>> >>>>
>>>> >>>> Another simpler solution may be to complete request in threaded
>>>> >>>> interrupt handler for this case. Meantime allow scheduler to run
>>>> >>>> the interrupt thread handler on CPUs specified by the irq
>>>> >>>> affinity mask, which was discussed by the following link:
>>>> >>>>
>>>> >>>>
>>>> >>>https://lor
>>>> >>>e
>>>> >>>> .kernel.org%2Flkml%2Fe0e9478e-62a5-ca24-3b12-
>>>> >>>58f7d056383e%40huawei.com
>>>> >>>> %2F&amp;data=02%7C01%7Clongli%40microsoft.com%7Cc7f46d3e2
>>>73f45
>>>> >>>176d1c08
>>>> >>>>
>>>> >>>d7254cc69e%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C63
>>>70188
>>>> >>>8401588
>>>> >>>>
>>>> >>>9866&amp;sdata=h5k6HoGoyDxuhmDfuKLZUwgmw17PU%2BT%2FCb
>>>awfxV
>>>> >>>Er3U%3D&amp;
>>>> >>>> reserved=0
>>>> >>>>
>>>> >>>> Could you try the above solution and see if the lockup can be
>>>avoided?
>>>> >>>> John Garry
>>>> >>>> should have workable patch.
>>>> >>>
>>>> >>>Yeah, so we experimented with changing the interrupt handling in
>>>> >>>the SCSI driver I maintain to use a threaded handler IRQ handler
>>>> >>>plus patch below, and saw a significant throughput boost:
>>>> >>>
>>>> >>>--->8
>>>> >>>
>>>> >>>Subject: [PATCH] genirq: Add support to allow thread to use hard
>>>> >>>irq affinity
>>>> >>>
>>>> >>>Currently the cpu allowed mask for the threaded part of a threaded
>>>> >>>irq handler will be set to the effective affinity of the hard irq.
>>>> >>>
>>>> >>>Typically the effective affinity of the hard irq will be for a
>>>> >>>single cpu. As such, the threaded handler would always run on the
>>>same cpu as the hard irq.
>>>> >>>
>>>> >>>We have seen scenarios in high data-rate throughput testing that
>>>> >>>the cpu handling the interrupt can be totally saturated handling
>>>> >>>both the hard interrupt and threaded handler parts, limiting
>>>throughput.
>>>> >>>
>>>> >>>Add IRQF_IRQ_AFFINITY flag to allow the driver requesting the
>>>> >>>threaded interrupt to decide on the policy of which cpu the threaded
>>>handler may run.
>>>> >>>
>>>> >>>Signed-off-by: John Garry <[email protected]>
>>>>
>>>> Thanks for pointing me to this patch. This fixed the interrupt swamp and
>>>make the system stable.
>>>>
>>>> However I'm seeing reduced performance when using threaded
>>>interrupts.
>>>>
>>>> Here are the test results on a system with 80 CPUs and 10 NVMe disks
>>>> (32 hardware queues for each disk) Benchmark tool is FIO, I/O pattern:
>>>> 4k random reads on all NVMe disks, with queue depth = 64, num of jobs
>>>> = 80, direct=1
>>>>
>>>> With threaded interrupts: 1320k IOPS
>>>> With just interrupts: 3720k IOPS
>>>> With just interrupts and my patch: 3700k IOPS
>>>
>>>This gap looks too big wrt. threaded interrupts vs. interrupts.
>>>
>>>>
>>>> At the peak IOPS, the overall CPU usage is at around 98-99%. I think the
>>>cost of doing wake up and context switch for NVMe threaded IRQ handler
>>>takes some CPU away.
>>>>
>>>
>>>In theory, it shouldn't be so because most of times the thread should be
>>>running on CPUs of this hctx, and the wakeup cost shouldn't be so big.
>>>Maybe there is performance problem somewhere wrt. threaded interrupt.
>>>
>>>Could you share us your test script and environment? I will see if I can
>>>reproduce it in my environment.

Ming, do you have access to L80s_v2 in Azure? This test needs to run on that VM size.

Here is the command to benchmark it:

fio --bs=4k --ioengine=libaio --iodepth=128 --filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1 --direct=1 --runtime=120 --numjobs=80 --rw=randread --name=test --group_reporting --gtod_reduce=1

Thanks

Long

>>>
>>>> In this test, I made the following change to make use of
>>>IRQF_IRQ_AFFINITY for NVMe:
>>>>
>>>> diff --git a/drivers/pci/irq.c b/drivers/pci/irq.c index
>>>> a1de501a2729..3fb30d16464e 100644
>>>> --- a/drivers/pci/irq.c
>>>> +++ b/drivers/pci/irq.c
>>>> @@ -86,7 +86,7 @@ int pci_request_irq(struct pci_dev *dev, unsigned int
>>>nr, irq_handler_t handler,
>>>> va_list ap;
>>>> int ret;
>>>> char *devname;
>>>> - unsigned long irqflags = IRQF_SHARED;
>>>> + unsigned long irqflags = IRQF_SHARED | IRQF_IRQ_AFFINITY;
>>>>
>>>> if (!handler)
>>>> irqflags |= IRQF_ONESHOT;
>>>>
>>>
>>>I don't see why IRQF_IRQ_AFFINITY is needed.
>>>
>>>John, could you explain it a bit why you need changes on
>>>IRQF_IRQ_AFFINITY?
>>>
>>>The following patch should be enough:
>>>
>>>diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index
>>>e8f7f179bf77..1e7cffc1c20c 100644
>>>--- a/kernel/irq/manage.c
>>>+++ b/kernel/irq/manage.c
>>>@@ -968,7 +968,11 @@ irq_thread_check_affinity(struct irq_desc *desc,
>>>struct irqaction *action)
>>> if (cpumask_available(desc->irq_common_data.affinity)) {
>>> const struct cpumask *m;
>>>
>>>- m = irq_data_get_effective_affinity_mask(&desc-
>>>>irq_data);
>>>+ if (irqd_affinity_is_managed(&desc->irq_data))
>>>+ m = desc->irq_common_data.affinity;
>>>+ else
>>>+ m = irq_data_get_effective_affinity_mask(
>>>+ &desc->irq_data);
>>> cpumask_copy(mask, m);
>>> } else {
>>> valid = false;
>>>
>>>
>>>Thanks,
>>>Ming

2019-08-21 17:59:13

by Long Li

[permalink] [raw]
Subject: RE: [PATCH 3/3] nvme: complete request in work queue on CPU with flooded interrupts

>>>Subject: RE: [PATCH 3/3] nvme: complete request in work queue on CPU
>>>with flooded interrupts
>>>
>>>>>>Subject: Re: [PATCH 3/3] nvme: complete request in work queue on
>>>CPU
>>>>>>with flooded interrupts
>>>>>>
>>>>>>
>>>>>>> From: Long Li <[email protected]>
>>>>>>>
>>>>>>> When a NVMe hardware queue is mapped to several CPU queues, it is
>>>>>>> possible that the CPU this hardware queue is bound to is flooded by
>>>>>>> returning I/O for other CPUs.
>>>>>>>
>>>>>>> For example, consider the following scenario:
>>>>>>> 1. CPU 0, 1, 2 and 3 share the same hardware queue 2. the hardware
>>>>>>> queue interrupts CPU 0 for I/O response 3. processes from CPU 1, 2
>>>>>>> and
>>>>>>> 3 keep sending I/Os
>>>>>>>
>>>>>>> CPU 0 may be flooded with interrupts from NVMe device that are I/O
>>>>>>> responses for CPU 1, 2 and 3. Under heavy I/O load, it is possible
>>>>>>> that CPU 0 spends all the time serving NVMe and other system
>>>>>>> interrupts, but doesn't have a chance to run in process context.
>>>>>>>
>>>>>>> To fix this, CPU 0 can schedule a work to complete the I/O request
>>>>>>> when it detects the scheduler is not making progress. This serves
>>>>>>> multiple
>>>>>>purposes:
>>>>>>>
>>>>>>> 1. This CPU has to be scheduled to complete the request. The other
>>>>>>> CPUs can't issue more I/Os until some previous I/Os are completed.
>>>>>>> This helps this CPU get out of NVMe interrupts.
>>>>>>>
>>>>>>> 2. This acts a throttling mechanisum for NVMe devices, in that it
>>>>>>> can not starve a CPU while servicing I/Os from other CPUs.
>>>>>>>
>>>>>>> 3. This CPU can make progress on RCU and other work items on its
>>>queue.
>>>>>>
>>>>>>The problem is indeed real, but this is the wrong approach in my mind.
>>>>>>
>>>>>>We already have irqpoll which takes care proper budgeting polling
>>>>>>cycles and not hogging the cpu.
>>>>>>
>>>>>>I've sent rfc for this particular problem before [1]. At the time
>>>>>>IIRC, Christoph suggested that we will poll the first batch directly
>>>>>>from the irq context and reap the rest in irqpoll handler.
>>>
>>>Thanks for the pointer. I will test and report back.

Sagi,

Here are the test results.

Benchmark command:
fio --bs=4k --ioengine=libaio --iodepth=64 --filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1 --direct=1 --runtime=90 --numjobs=80 --rw=randread --name=test --group_reporting --gtod_reduce=1

With your patch: 1720k IOPS
With threaded interrupts: 1320k IOPS
With just interrupts: 3720k IOPS

Interrupts are the fastest but we need to find a way to throttle it.

Thanks

Long


>>>
>>>>>>
>>>>>>[1]:
>>>>>>https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fl
>>>ists.
>>>>>>infradead.org%2Fpipermail%2Flinux-nvme%2F2016-
>>>>>>October%2F006497.html&amp;data=02%7C01%7Clongli%40microsoft.co
>>>m%
>>>>>>7C0ebf36eff15c4182116608d725948b93%7C72f988bf86f141af91ab2d7cd0
>>>11d
>>>>>>b47%7C1%7C0%7C637019192254250361&amp;sdata=fJ%2Fkc8HLSmfzaY
>>>3BY
>>>>>>E66zlZKD6FjcXgMJZzVGCVqI%2FU%3D&amp;reserved=0
>>>>>>
>>>>>>How about something like this instead:
>>>>>>--
>>>>>>diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index
>>>>>>71127a366d3c..84bf16d75109 100644
>>>>>>--- a/drivers/nvme/host/pci.c
>>>>>>+++ b/drivers/nvme/host/pci.c
>>>>>>@@ -24,6 +24,7 @@
>>>>>> #include <linux/io-64-nonatomic-lo-hi.h>
>>>>>> #include <linux/sed-opal.h>
>>>>>> #include <linux/pci-p2pdma.h>
>>>>>>+#include <linux/irq_poll.h>
>>>>>>
>>>>>> #include "trace.h"
>>>>>> #include "nvme.h"
>>>>>>@@ -32,6 +33,7 @@
>>>>>> #define CQ_SIZE(q) ((q)->q_depth * sizeof(struct nvme_completion))
>>>>>>
>>>>>> #define SGES_PER_PAGE (PAGE_SIZE / sizeof(struct nvme_sgl_desc))
>>>>>>+#define NVME_POLL_BUDGET_IRQ 256
>>>>>>
>>>>>> /*
>>>>>> * These can be higher, but we need to ensure that any command
>>>>>>doesn't @@ -189,6 +191,7 @@ struct nvme_queue {
>>>>>> u32 *dbbuf_cq_db;
>>>>>> u32 *dbbuf_sq_ei;
>>>>>> u32 *dbbuf_cq_ei;
>>>>>>+ struct irq_poll iop;
>>>>>> struct completion delete_done; };
>>>>>>
>>>>>>@@ -1015,6 +1018,23 @@ static inline int nvme_process_cq(struct
>>>>>>nvme_queue *nvmeq, u16 *start,
>>>>>> return found;
>>>>>> }
>>>>>>
>>>>>>+static int nvme_irqpoll_handler(struct irq_poll *iop, int budget) {
>>>>>>+ struct nvme_queue *nvmeq = container_of(iop, struct
>>>>>>+nvme_queue,
>>>>>>iop);
>>>>>>+ struct pci_dev *pdev = to_pci_dev(nvmeq->dev->dev);
>>>>>>+ u16 start, end;
>>>>>>+ int completed;
>>>>>>+
>>>>>>+ completed = nvme_process_cq(nvmeq, &start, &end, budget);
>>>>>>+ nvme_complete_cqes(nvmeq, start, end);
>>>>>>+ if (completed < budget) {
>>>>>>+ irq_poll_complete(&nvmeq->iop);
>>>>>>+ enable_irq(pci_irq_vector(pdev, nvmeq->cq_vector));
>>>>>>+ }
>>>>>>+
>>>>>>+ return completed;
>>>>>>+}
>>>>>>+
>>>>>> static irqreturn_t nvme_irq(int irq, void *data)
>>>>>> {
>>>>>> struct nvme_queue *nvmeq = data; @@ -1028,12 +1048,16 @@
>>>>>>static irqreturn_t nvme_irq(int irq, void *data)
>>>>>> rmb();
>>>>>> if (nvmeq->cq_head != nvmeq->last_cq_head)
>>>>>> ret = IRQ_HANDLED;
>>>>>>- nvme_process_cq(nvmeq, &start, &end, -1);
>>>>>>+ nvme_process_cq(nvmeq, &start, &end,
>>>NVME_POLL_BUDGET_IRQ);
>>>>>> nvmeq->last_cq_head = nvmeq->cq_head;
>>>>>> wmb();
>>>>>>
>>>>>> if (start != end) {
>>>>>> nvme_complete_cqes(nvmeq, start, end);
>>>>>>+ if (nvme_cqe_pending(nvmeq)) {
>>>>>>+ disable_irq_nosync(irq);
>>>>>>+ irq_poll_sched(&nvmeq->iop);
>>>>>>+ }
>>>>>> return IRQ_HANDLED;
>>>>>> }
>>>>>>
>>>>>>@@ -1347,6 +1371,7 @@ static enum blk_eh_timer_return
>>>>>>nvme_timeout(struct request *req, bool reserved)
>>>>>>
>>>>>> static void nvme_free_queue(struct nvme_queue *nvmeq) {
>>>>>>+ irq_poll_disable(&nvmeq->iop);
>>>>>> dma_free_coherent(nvmeq->dev->dev, CQ_SIZE(nvmeq),
>>>>>> (void *)nvmeq->cqes, nvmeq->cq_dma_addr);
>>>>>> if (!nvmeq->sq_cmds)
>>>>>>@@ -1481,6 +1506,7 @@ static int nvme_alloc_queue(struct nvme_dev
>>>>>>*dev, int qid, int depth)
>>>>>> nvmeq->dev = dev;
>>>>>> spin_lock_init(&nvmeq->sq_lock);
>>>>>> spin_lock_init(&nvmeq->cq_poll_lock);
>>>>>>+ irq_poll_init(&nvmeq->iop, NVME_POLL_BUDGET_IRQ,
>>>>>>nvme_irqpoll_handler);
>>>>>> nvmeq->cq_head = 0;
>>>>>> nvmeq->cq_phase = 1;
>>>>>> nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride];
>>>>>>--

2019-08-22 01:54:32

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH 3/3] nvme: complete request in work queue on CPU with flooded interrupts


> Sagi,
>
> Here are the test results.
>
> Benchmark command:
> fio --bs=4k --ioengine=libaio --iodepth=64 --filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1 --direct=1 --runtime=90 --numjobs=80 --rw=randread --name=test --group_reporting --gtod_reduce=1
>
> With your patch: 1720k IOPS
> With threaded interrupts: 1320k IOPS
> With just interrupts: 3720k IOPS
>
> Interrupts are the fastest but we need to find a way to throttle it.

This is the workload that generates the flood?
If so I did not expect that this would be the perf difference..

If completions keep pounding on the cpu, I would expect irqpoll
to simply keep running forever and poll the cqs. There is no
fundamental reason why polling would be faster in an interrupt,
what matters could be:
1. we reschedule more than we need to
2. we don't reap enough completions in every poll round, which
will trigger rearming the interrupt and then when it fires reschedule
another softirq...

Maybe we need to take care of some irq_poll optimizations?

Does this (untested) patch make any difference?
--
diff --git a/lib/irq_poll.c b/lib/irq_poll.c
index 2f17b488d58e..0e94183eba15 100644
--- a/lib/irq_poll.c
+++ b/lib/irq_poll.c
@@ -12,7 +12,8 @@
#include <linux/irq_poll.h>
#include <linux/delay.h>

-static unsigned int irq_poll_budget __read_mostly = 256;
+static unsigned int irq_poll_budget __read_mostly = 3000;
+unsigned int __read_mostly irqpoll_budget_usecs = 2000;

static DEFINE_PER_CPU(struct list_head, blk_cpu_iopoll);

@@ -77,32 +78,26 @@ EXPORT_SYMBOL(irq_poll_complete);

static void __latent_entropy irq_poll_softirq(struct softirq_action *h)
{
- struct list_head *list = this_cpu_ptr(&blk_cpu_iopoll);
- int rearm = 0, budget = irq_poll_budget;
- unsigned long start_time = jiffies;
+ struct list_head *irqpoll_list = this_cpu_ptr(&blk_cpu_iopoll);
+ unsigned int budget = irq_poll_budget;
+ unsigned long time_limit =
+ jiffies + usecs_to_jiffies(irqpoll_budget_usecs);
+ LIST_HEAD(list);

local_irq_disable();
+ list_splice_init(irqpoll_list, &list);
+ local_irq_enable();

- while (!list_empty(list)) {
+ while (!list_empty(&list)) {
struct irq_poll *iop;
int work, weight;

- /*
- * If softirq window is exhausted then punt.
- */
- if (budget <= 0 || time_after(jiffies, start_time)) {
- rearm = 1;
- break;
- }
-
- local_irq_enable();
-
/* Even though interrupts have been re-enabled, this
* access is safe because interrupts can only add new
* entries to the tail of this list, and only ->poll()
* calls can remove this head entry from the list.
*/
- iop = list_entry(list->next, struct irq_poll, list);
+ iop = list_first_entry(&list, struct irq_poll, list);

weight = iop->weight;
work = 0;
@@ -111,8 +106,6 @@ static void __latent_entropy irq_poll_softirq(struct
softirq_action *h)

budget -= work;

- local_irq_disable();
-
/*
* Drivers must not modify the iopoll state, if they
* consume their assigned weight (or more, some drivers
can't
@@ -125,11 +118,21 @@ static void __latent_entropy
irq_poll_softirq(struct softirq_action *h)
if (test_bit(IRQ_POLL_F_DISABLE, &iop->state))
__irq_poll_complete(iop);
else
- list_move_tail(&iop->list, list);
+ list_move_tail(&iop->list, &list);
}
+
+ /*
+ * If softirq window is exhausted then punt.
+ */
+ if (budget <= 0 || time_after_eq(jiffies, time_limit))
+ break;
}

- if (rearm)
+ local_irq_disable();
+
+ list_splice_tail_init(irqpoll_list, &list);
+ list_splice(&list, irqpoll_list);
+ if (!list_empty(irqpoll_list))
__raise_softirq_irqoff(IRQ_POLL_SOFTIRQ);

local_irq_enable();
--

2019-08-22 02:47:18

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH 0/3] fix interrupt swamp in NVMe

On Wed, Aug 21, 2019 at 04:27:00PM +0000, Long Li wrote:
> >>>Subject: Re: [PATCH 0/3] fix interrupt swamp in NVMe
> >>>
> >>>On Wed, Aug 21, 2019 at 07:47:44AM +0000, Long Li wrote:
> >>>> >>>Subject: Re: [PATCH 0/3] fix interrupt swamp in NVMe
> >>>> >>>
> >>>> >>>On 20/08/2019 09:25, Ming Lei wrote:
> >>>> >>>> On Tue, Aug 20, 2019 at 2:14 PM <[email protected]> wrote:
> >>>> >>>>>
> >>>> >>>>> From: Long Li <[email protected]>
> >>>> >>>>>
> >>>> >>>>> This patch set tries to fix interrupt swamp in NVMe devices.
> >>>> >>>>>
> >>>> >>>>> On large systems with many CPUs, a number of CPUs may share
> >>>one
> >>>> >>>NVMe
> >>>> >>>>> hardware queue. It may have this situation where several CPUs
> >>>> >>>>> are issuing I/Os, and all the I/Os are returned on the CPU where
> >>>> >>>>> the
> >>>> >>>hardware queue is bound to.
> >>>> >>>>> This may result in that CPU swamped by interrupts and stay in
> >>>> >>>>> interrupt mode for extended time while other CPUs continue to
> >>>> >>>>> issue I/O. This can trigger Watchdog and RCU timeout, and make
> >>>> >>>>> the system
> >>>> >>>unresponsive.
> >>>> >>>>>
> >>>> >>>>> This patch set addresses this by enforcing scheduling and
> >>>> >>>>> throttling I/O when CPU is starved in this situation.
> >>>> >>>>>
> >>>> >>>>> Long Li (3):
> >>>> >>>>> sched: define a function to report the number of context switches
> >>>on a
> >>>> >>>>> CPU
> >>>> >>>>> sched: export idle_cpu()
> >>>> >>>>> nvme: complete request in work queue on CPU with flooded
> >>>> >>>>> interrupts
> >>>> >>>>>
> >>>> >>>>> drivers/nvme/host/core.c | 57
> >>>> >>>>> +++++++++++++++++++++++++++++++++++++++-
> >>>> >>>>> drivers/nvme/host/nvme.h | 1 +
> >>>> >>>>> include/linux/sched.h | 2 ++
> >>>> >>>>> kernel/sched/core.c | 7 +++++
> >>>> >>>>> 4 files changed, 66 insertions(+), 1 deletion(-)
> >>>> >>>>
> >>>> >>>> Another simpler solution may be to complete request in threaded
> >>>> >>>> interrupt handler for this case. Meantime allow scheduler to run
> >>>> >>>> the interrupt thread handler on CPUs specified by the irq
> >>>> >>>> affinity mask, which was discussed by the following link:
> >>>> >>>>
> >>>> >>>>
> >>>> >>>https://lor
> >>>> >>>e
> >>>> >>>> .kernel.org%2Flkml%2Fe0e9478e-62a5-ca24-3b12-
> >>>> >>>58f7d056383e%40huawei.com
> >>>> >>>> %2F&amp;data=02%7C01%7Clongli%40microsoft.com%7Cc7f46d3e2
> >>>73f45
> >>>> >>>176d1c08
> >>>> >>>>
> >>>> >>>d7254cc69e%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C63
> >>>70188
> >>>> >>>8401588
> >>>> >>>>
> >>>> >>>9866&amp;sdata=h5k6HoGoyDxuhmDfuKLZUwgmw17PU%2BT%2FCb
> >>>awfxV
> >>>> >>>Er3U%3D&amp;
> >>>> >>>> reserved=0
> >>>> >>>>
> >>>> >>>> Could you try the above solution and see if the lockup can be
> >>>avoided?
> >>>> >>>> John Garry
> >>>> >>>> should have workable patch.
> >>>> >>>
> >>>> >>>Yeah, so we experimented with changing the interrupt handling in
> >>>> >>>the SCSI driver I maintain to use a threaded handler IRQ handler
> >>>> >>>plus patch below, and saw a significant throughput boost:
> >>>> >>>
> >>>> >>>--->8
> >>>> >>>
> >>>> >>>Subject: [PATCH] genirq: Add support to allow thread to use hard
> >>>> >>>irq affinity
> >>>> >>>
> >>>> >>>Currently the cpu allowed mask for the threaded part of a threaded
> >>>> >>>irq handler will be set to the effective affinity of the hard irq.
> >>>> >>>
> >>>> >>>Typically the effective affinity of the hard irq will be for a
> >>>> >>>single cpu. As such, the threaded handler would always run on the
> >>>same cpu as the hard irq.
> >>>> >>>
> >>>> >>>We have seen scenarios in high data-rate throughput testing that
> >>>> >>>the cpu handling the interrupt can be totally saturated handling
> >>>> >>>both the hard interrupt and threaded handler parts, limiting
> >>>throughput.
> >>>> >>>
> >>>> >>>Add IRQF_IRQ_AFFINITY flag to allow the driver requesting the
> >>>> >>>threaded interrupt to decide on the policy of which cpu the threaded
> >>>handler may run.
> >>>> >>>
> >>>> >>>Signed-off-by: John Garry <[email protected]>
> >>>>
> >>>> Thanks for pointing me to this patch. This fixed the interrupt swamp and
> >>>make the system stable.
> >>>>
> >>>> However I'm seeing reduced performance when using threaded
> >>>interrupts.
> >>>>
> >>>> Here are the test results on a system with 80 CPUs and 10 NVMe disks
> >>>> (32 hardware queues for each disk) Benchmark tool is FIO, I/O pattern:
> >>>> 4k random reads on all NVMe disks, with queue depth = 64, num of jobs
> >>>> = 80, direct=1
> >>>>
> >>>> With threaded interrupts: 1320k IOPS
> >>>> With just interrupts: 3720k IOPS
> >>>> With just interrupts and my patch: 3700k IOPS
> >>>
> >>>This gap looks too big wrt. threaded interrupts vs. interrupts.
> >>>
> >>>>
> >>>> At the peak IOPS, the overall CPU usage is at around 98-99%. I think the
> >>>cost of doing wake up and context switch for NVMe threaded IRQ handler
> >>>takes some CPU away.
> >>>>
> >>>
> >>>In theory, it shouldn't be so because most of times the thread should be
> >>>running on CPUs of this hctx, and the wakeup cost shouldn't be so big.
> >>>Maybe there is performance problem somewhere wrt. threaded interrupt.
> >>>
> >>>Could you share us your test script and environment? I will see if I can
> >>>reproduce it in my environment.
>
> Ming, do you have access to L80s_v2 in Azure? This test needs to run on that VM size.
>
> Here is the command to benchmark it:
>
> fio --bs=4k --ioengine=libaio --iodepth=128 --filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1 --direct=1 --runtime=120 --numjobs=80 --rw=randread --name=test --group_reporting --gtod_reduce=1
>

I can reproduce the issue on one machine(96 cores) with 4 NVMes(32 queues), so
each queue is served on 3 CPUs.

IOPS drops > 20% when 'use_threaded_interrupts' is enabled. From fio log, CPU
context switch is increased a lot.


Thanks,
Ming

2019-08-22 02:58:01

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH 0/3] fix interrupt swamp in NVMe

On Wed, Aug 21, 2019 at 7:34 PM Ming Lei <[email protected]> wrote:
> On Wed, Aug 21, 2019 at 04:27:00PM +0000, Long Li wrote:
> > Here is the command to benchmark it:
> >
> > fio --bs=4k --ioengine=libaio --iodepth=128 --filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1 --direct=1 --runtime=120 --numjobs=80 --rw=randread --name=test --group_reporting --gtod_reduce=1
> >
>
> I can reproduce the issue on one machine(96 cores) with 4 NVMes(32 queues), so
> each queue is served on 3 CPUs.
>
> IOPS drops > 20% when 'use_threaded_interrupts' is enabled. From fio log, CPU
> context switch is increased a lot.

Interestingly use_threaded_interrupts shows a marginal improvement on
my machine with the same fio profile. It was only 5 NVMes, but they've
one queue per-cpu on 112 cores.

2019-08-22 03:10:28

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH 0/3] fix interrupt swamp in NVMe

On Thu, Aug 22, 2019 at 10:00 AM Keith Busch <[email protected]> wrote:
>
> On Wed, Aug 21, 2019 at 7:34 PM Ming Lei <[email protected]> wrote:
> > On Wed, Aug 21, 2019 at 04:27:00PM +0000, Long Li wrote:
> > > Here is the command to benchmark it:
> > >
> > > fio --bs=4k --ioengine=libaio --iodepth=128 --filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1 --direct=1 --runtime=120 --numjobs=80 --rw=randread --name=test --group_reporting --gtod_reduce=1
> > >
> >
> > I can reproduce the issue on one machine(96 cores) with 4 NVMes(32 queues), so
> > each queue is served on 3 CPUs.
> >
> > IOPS drops > 20% when 'use_threaded_interrupts' is enabled. From fio log, CPU
> > context switch is increased a lot.
>
> Interestingly use_threaded_interrupts shows a marginal improvement on
> my machine with the same fio profile. It was only 5 NVMes, but they've
> one queue per-cpu on 112 cores.

Not investigate it yet.

BTW, my fio test is only done on the single hw queue via 'taskset -c
$cpu_list_of_the_queue',
without applying the threaded interrupt affinity patch. NVMe is Optane.

The same issue can be reproduced after I force to use 1:1 mapping via passing
'possible_cpus=32' kernel cmd line.

Maybe related with kernel options, so attache the one I used, and
basically it is
a subset of RHEL8 kernel.

Thanks,
Ming Lei


Attachments:
config.tar.gz (30.79 kB)

2019-08-22 13:10:26

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 0/3] fix interrupt swamp in NVMe

On Wed, 21 Aug 2019, Keith Busch wrote:
> On Wed, Aug 21, 2019 at 7:34 PM Ming Lei <[email protected]> wrote:
> > On Wed, Aug 21, 2019 at 04:27:00PM +0000, Long Li wrote:
> > > Here is the command to benchmark it:
> > >
> > > fio --bs=4k --ioengine=libaio --iodepth=128 --filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1 --direct=1 --runtime=120 --numjobs=80 --rw=randread --name=test --group_reporting --gtod_reduce=1
> > >
> >
> > I can reproduce the issue on one machine(96 cores) with 4 NVMes(32 queues), so
> > each queue is served on 3 CPUs.
> >
> > IOPS drops > 20% when 'use_threaded_interrupts' is enabled. From fio log, CPU
> > context switch is increased a lot.
>
> Interestingly use_threaded_interrupts shows a marginal improvement on
> my machine with the same fio profile. It was only 5 NVMes, but they've
> one queue per-cpu on 112 cores.

Which is not surprising because the thread and the hard interrupt are on
the same CPU and there is just that little overhead of the context switch.

The thing is that this really depends on how the scheduler decides to place
the interrupt thread.

If you have a queue for several CPUs, then depending on the load situation
allowing a multi-cpu affinity for the thread can cause lots of task
migration.

But restricting the irq thread to the CPU on which the interrupt is affine
can also starve that CPU. There is no universal rule for that.

Tracing should tell.

Thanks,

tglx



2019-08-23 09:54:48

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH 3/3] nvme: complete request in work queue on CPU with flooded interrupts

On Tue, Aug 20, 2019 at 10:33:38AM -0700, Sagi Grimberg wrote:
>
> > From: Long Li <[email protected]>
> >
> > When a NVMe hardware queue is mapped to several CPU queues, it is possible
> > that the CPU this hardware queue is bound to is flooded by returning I/O for
> > other CPUs.
> >
> > For example, consider the following scenario:
> > 1. CPU 0, 1, 2 and 3 share the same hardware queue
> > 2. the hardware queue interrupts CPU 0 for I/O response
> > 3. processes from CPU 1, 2 and 3 keep sending I/Os
> >
> > CPU 0 may be flooded with interrupts from NVMe device that are I/O responses
> > for CPU 1, 2 and 3. Under heavy I/O load, it is possible that CPU 0 spends
> > all the time serving NVMe and other system interrupts, but doesn't have a
> > chance to run in process context.
> >
> > To fix this, CPU 0 can schedule a work to complete the I/O request when it
> > detects the scheduler is not making progress. This serves multiple purposes:
> >
> > 1. This CPU has to be scheduled to complete the request. The other CPUs can't
> > issue more I/Os until some previous I/Os are completed. This helps this CPU
> > get out of NVMe interrupts.
> >
> > 2. This acts a throttling mechanisum for NVMe devices, in that it can not
> > starve a CPU while servicing I/Os from other CPUs.
> >
> > 3. This CPU can make progress on RCU and other work items on its queue.
>
> The problem is indeed real, but this is the wrong approach in my mind.
>
> We already have irqpoll which takes care proper budgeting polling
> cycles and not hogging the cpu.

The issue isn't unique to NVMe, and can be any fast devices which
interrupts CPU too frequently, meantime the interrupt/softirq handler may
take a bit much time, then CPU is easy to be lockup by the interrupt/sofirq
handler, especially in case that multiple submission CPUs vs. single
completion CPU.

Some SCSI devices has the same problem too.

Could we consider to add one generic mechanism to cover this kind of
problem?

One approach I thought of is to allocate one backup thread for handling
such interrupt, which can be marked as IRQF_BACKUP_THREAD by drivers.

Inside do_IRQ(), irqtime is accounted, before calling action->handler(),
check if this CPU has taken too long time for handling IRQ(interrupt or
softirq) and see if this CPU could be lock up. If yes, wakeup the backup
thread to handle the interrupt for avoiding lockup this CPU.

The threaded interrupt framework is there, and this way could be easier
to implement. Meantime most time the handler is run in interrupt context
and we may avoid the performance loss when CPU isn't busy enough.

Any comment on this approach?

Thanks,
Ming

2019-08-24 00:16:48

by Long Li

[permalink] [raw]
Subject: RE: [PATCH 3/3] nvme: complete request in work queue on CPU with flooded interrupts

>>>Subject: Re: [PATCH 3/3] nvme: complete request in work queue on CPU
>>>with flooded interrupts
>>>
>>>
>>>> Sagi,
>>>>
>>>> Here are the test results.
>>>>
>>>> Benchmark command:
>>>> fio --bs=4k --ioengine=libaio --iodepth=64
>>>> --
>>>filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/d
>>>ev/nv
>>>>
>>>me4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev
>>>/nvme9n1
>>>> --direct=1 --runtime=90 --numjobs=80 --rw=randread --name=test
>>>> --group_reporting --gtod_reduce=1
>>>>
>>>> With your patch: 1720k IOPS
>>>> With threaded interrupts: 1320k IOPS
>>>> With just interrupts: 3720k IOPS
>>>>
>>>> Interrupts are the fastest but we need to find a way to throttle it.
>>>
>>>This is the workload that generates the flood?
>>>If so I did not expect that this would be the perf difference..
>>>
>>>If completions keep pounding on the cpu, I would expect irqpoll to simply
>>>keep running forever and poll the cqs. There is no fundamental reason why
>>>polling would be faster in an interrupt, what matters could be:
>>>1. we reschedule more than we need to
>>>2. we don't reap enough completions in every poll round, which will trigger
>>>rearming the interrupt and then when it fires reschedule another softirq...
>>>

Yes I think it's the rescheduling that takes some. With the patch there are lots of ksoftirqd activities. (compared to nearly none without the patch)
A 90 seconds FIO run shows a big difference of context switches on all CPUs:
With patch: 5755849
Without patch: 1462931

>>>Maybe we need to take care of some irq_poll optimizations?
>>>
>>>Does this (untested) patch make any difference?
>>>--
>>>diff --git a/lib/irq_poll.c b/lib/irq_poll.c index 2f17b488d58e..0e94183eba15
>>>100644
>>>--- a/lib/irq_poll.c
>>>+++ b/lib/irq_poll.c
>>>@@ -12,7 +12,8 @@
>>> #include <linux/irq_poll.h>
>>> #include <linux/delay.h>
>>>
>>>-static unsigned int irq_poll_budget __read_mostly = 256;
>>>+static unsigned int irq_poll_budget __read_mostly = 3000; unsigned int
>>>+__read_mostly irqpoll_budget_usecs = 2000;
>>>
>>> static DEFINE_PER_CPU(struct list_head, blk_cpu_iopoll);
>>>
>>>@@ -77,32 +78,26 @@ EXPORT_SYMBOL(irq_poll_complete);
>>>
>>> static void __latent_entropy irq_poll_softirq(struct softirq_action *h)
>>> {
>>>- struct list_head *list = this_cpu_ptr(&blk_cpu_iopoll);
>>>- int rearm = 0, budget = irq_poll_budget;
>>>- unsigned long start_time = jiffies;
>>>+ struct list_head *irqpoll_list = this_cpu_ptr(&blk_cpu_iopoll);
>>>+ unsigned int budget = irq_poll_budget;
>>>+ unsigned long time_limit =
>>>+ jiffies + usecs_to_jiffies(irqpoll_budget_usecs);
>>>+ LIST_HEAD(list);
>>>
>>> local_irq_disable();
>>>+ list_splice_init(irqpoll_list, &list);
>>>+ local_irq_enable();
>>>
>>>- while (!list_empty(list)) {
>>>+ while (!list_empty(&list)) {
>>> struct irq_poll *iop;
>>> int work, weight;
>>>
>>>- /*
>>>- * If softirq window is exhausted then punt.
>>>- */
>>>- if (budget <= 0 || time_after(jiffies, start_time)) {
>>>- rearm = 1;
>>>- break;
>>>- }
>>>-
>>>- local_irq_enable();
>>>-
>>> /* Even though interrupts have been re-enabled, this
>>> * access is safe because interrupts can only add new
>>> * entries to the tail of this list, and only ->poll()
>>> * calls can remove this head entry from the list.
>>> */
>>>- iop = list_entry(list->next, struct irq_poll, list);
>>>+ iop = list_first_entry(&list, struct irq_poll, list);
>>>
>>> weight = iop->weight;
>>> work = 0;
>>>@@ -111,8 +106,6 @@ static void __latent_entropy irq_poll_softirq(struct
>>>softirq_action *h)
>>>
>>> budget -= work;
>>>
>>>- local_irq_disable();
>>>-
>>> /*
>>> * Drivers must not modify the iopoll state, if they
>>> * consume their assigned weight (or more, some drivers can't @@
>>>-125,11 +118,21 @@ static void __latent_entropy irq_poll_softirq(struct
>>>softirq_action *h)
>>> if (test_bit(IRQ_POLL_F_DISABLE, &iop->state))
>>> __irq_poll_complete(iop);
>>> else
>>>- list_move_tail(&iop->list, list);
>>>+ list_move_tail(&iop->list, &list);
>>> }
>>>+
>>>+ /*
>>>+ * If softirq window is exhausted then punt.
>>>+ */
>>>+ if (budget <= 0 || time_after_eq(jiffies, time_limit))
>>>+ break;
>>> }
>>>
>>>- if (rearm)
>>>+ local_irq_disable();
>>>+
>>>+ list_splice_tail_init(irqpoll_list, &list);
>>>+ list_splice(&list, irqpoll_list);
>>>+ if (!list_empty(irqpoll_list))
>>> __raise_softirq_irqoff(IRQ_POLL_SOFTIRQ);
>>>
>>> local_irq_enable();
>>>--

It's got slightly better IOPS. With this 2nd patch, the number of context switches is at 5445863 (~5% improvement over the 1st patch).

2019-08-24 00:28:27

by Long Li

[permalink] [raw]
Subject: RE: [PATCH 3/3] nvme: complete request in work queue on CPU with flooded interrupts

>>>Subject: Re: [PATCH 3/3] nvme: complete request in work queue on CPU
>>>with flooded interrupts
>>>
>>>On Tue, Aug 20, 2019 at 10:33:38AM -0700, Sagi Grimberg wrote:
>>>>
>>>> > From: Long Li <[email protected]>
>>>> >
>>>> > When a NVMe hardware queue is mapped to several CPU queues, it is
>>>> > possible that the CPU this hardware queue is bound to is flooded by
>>>> > returning I/O for other CPUs.
>>>> >
>>>> > For example, consider the following scenario:
>>>> > 1. CPU 0, 1, 2 and 3 share the same hardware queue 2. the hardware
>>>> > queue interrupts CPU 0 for I/O response 3. processes from CPU 1, 2
>>>> > and 3 keep sending I/Os
>>>> >
>>>> > CPU 0 may be flooded with interrupts from NVMe device that are I/O
>>>> > responses for CPU 1, 2 and 3. Under heavy I/O load, it is possible
>>>> > that CPU 0 spends all the time serving NVMe and other system
>>>> > interrupts, but doesn't have a chance to run in process context.
>>>> >
>>>> > To fix this, CPU 0 can schedule a work to complete the I/O request
>>>> > when it detects the scheduler is not making progress. This serves
>>>multiple purposes:
>>>> >
>>>> > 1. This CPU has to be scheduled to complete the request. The other
>>>> > CPUs can't issue more I/Os until some previous I/Os are completed.
>>>> > This helps this CPU get out of NVMe interrupts.
>>>> >
>>>> > 2. This acts a throttling mechanisum for NVMe devices, in that it
>>>> > can not starve a CPU while servicing I/Os from other CPUs.
>>>> >
>>>> > 3. This CPU can make progress on RCU and other work items on its
>>>queue.
>>>>
>>>> The problem is indeed real, but this is the wrong approach in my mind.
>>>>
>>>> We already have irqpoll which takes care proper budgeting polling
>>>> cycles and not hogging the cpu.
>>>
>>>The issue isn't unique to NVMe, and can be any fast devices which
>>>interrupts CPU too frequently, meantime the interrupt/softirq handler may
>>>take a bit much time, then CPU is easy to be lockup by the interrupt/sofirq
>>>handler, especially in case that multiple submission CPUs vs. single
>>>completion CPU.
>>>
>>>Some SCSI devices has the same problem too.
>>>
>>>Could we consider to add one generic mechanism to cover this kind of
>>>problem?
>>>
>>>One approach I thought of is to allocate one backup thread for handling such
>>>interrupt, which can be marked as IRQF_BACKUP_THREAD by drivers.
>>>
>>>Inside do_IRQ(), irqtime is accounted, before calling action->handler(),
>>>check if this CPU has taken too long time for handling IRQ(interrupt or
>>>softirq) and see if this CPU could be lock up. If yes, wakeup the backup

How do you know if this CPU is spending all the time in do_IRQ()?

Is it something like:
If (IRQ_time /elapsed_time > a threshold value)
wake up the backup thread

>>>thread to handle the interrupt for avoiding lockup this CPU.
>>>
>>>The threaded interrupt framework is there, and this way could be easier to
>>>implement. Meantime most time the handler is run in interrupt context and
>>>we may avoid the performance loss when CPU isn't busy enough.
>>>
>>>Any comment on this approach?
>>>
>>>Thanks,
>>>Ming

2019-08-24 12:57:12

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH 3/3] nvme: complete request in work queue on CPU with flooded interrupts

On Sat, Aug 24, 2019 at 12:27:18AM +0000, Long Li wrote:
> >>>Subject: Re: [PATCH 3/3] nvme: complete request in work queue on CPU
> >>>with flooded interrupts
> >>>
> >>>On Tue, Aug 20, 2019 at 10:33:38AM -0700, Sagi Grimberg wrote:
> >>>>
> >>>> > From: Long Li <[email protected]>
> >>>> >
> >>>> > When a NVMe hardware queue is mapped to several CPU queues, it is
> >>>> > possible that the CPU this hardware queue is bound to is flooded by
> >>>> > returning I/O for other CPUs.
> >>>> >
> >>>> > For example, consider the following scenario:
> >>>> > 1. CPU 0, 1, 2 and 3 share the same hardware queue 2. the hardware
> >>>> > queue interrupts CPU 0 for I/O response 3. processes from CPU 1, 2
> >>>> > and 3 keep sending I/Os
> >>>> >
> >>>> > CPU 0 may be flooded with interrupts from NVMe device that are I/O
> >>>> > responses for CPU 1, 2 and 3. Under heavy I/O load, it is possible
> >>>> > that CPU 0 spends all the time serving NVMe and other system
> >>>> > interrupts, but doesn't have a chance to run in process context.
> >>>> >
> >>>> > To fix this, CPU 0 can schedule a work to complete the I/O request
> >>>> > when it detects the scheduler is not making progress. This serves
> >>>multiple purposes:
> >>>> >
> >>>> > 1. This CPU has to be scheduled to complete the request. The other
> >>>> > CPUs can't issue more I/Os until some previous I/Os are completed.
> >>>> > This helps this CPU get out of NVMe interrupts.
> >>>> >
> >>>> > 2. This acts a throttling mechanisum for NVMe devices, in that it
> >>>> > can not starve a CPU while servicing I/Os from other CPUs.
> >>>> >
> >>>> > 3. This CPU can make progress on RCU and other work items on its
> >>>queue.
> >>>>
> >>>> The problem is indeed real, but this is the wrong approach in my mind.
> >>>>
> >>>> We already have irqpoll which takes care proper budgeting polling
> >>>> cycles and not hogging the cpu.
> >>>
> >>>The issue isn't unique to NVMe, and can be any fast devices which
> >>>interrupts CPU too frequently, meantime the interrupt/softirq handler may
> >>>take a bit much time, then CPU is easy to be lockup by the interrupt/sofirq
> >>>handler, especially in case that multiple submission CPUs vs. single
> >>>completion CPU.
> >>>
> >>>Some SCSI devices has the same problem too.
> >>>
> >>>Could we consider to add one generic mechanism to cover this kind of
> >>>problem?
> >>>
> >>>One approach I thought of is to allocate one backup thread for handling such
> >>>interrupt, which can be marked as IRQF_BACKUP_THREAD by drivers.
> >>>
> >>>Inside do_IRQ(), irqtime is accounted, before calling action->handler(),
> >>>check if this CPU has taken too long time for handling IRQ(interrupt or
> >>>softirq) and see if this CPU could be lock up. If yes, wakeup the backup
>
> How do you know if this CPU is spending all the time in do_IRQ()?
>
> Is it something like:
> If (IRQ_time /elapsed_time > a threshold value)
> wake up the backup thread

Yeah, the above could work in theory.

Another approach I thought of is to monitor average irq gap time on each
CPU.

We could use EWMA(Exponential Weighted Moving Average) to do it simply,
such as:

curr_irq_gap(cpu) = current start time of do_IRQ() on 'cpu' -
end time of last do_IRQ() on 'cpu'
avg_irq_gap(cpu) = weight_prev * avg_irq_gap(cpu) + weight_curr * curr_irq_gap(cpu)

note:
weight_prev + weight_curr = 1

When avg_irq_gap(cpu) is close to one small enough threshold, we think irq flood is
detected.

'weight_prev' could be chosen as one big enough value for avoiding short-time flood.


Thanks,
Ming