2013-06-20 20:17:18

by Matthew Wilcox

[permalink] [raw]
Subject: RFC: Allow block drivers to poll for I/O instead of sleeping


A paper at FAST2012
(http://static.usenix.org/events/fast12/tech/full_papers/Yang.pdf) pointed
out the performance overhead of taking interrupts for low-latency block
I/Os. The solution the author investigated was to spin waiting for each
I/O to complete. This is inefficient as Linux submits many I/Os which
are not latency-sensitive, and even when we do submit latency-sensitive
I/Os (eg swap-in), we frequently submit several I/Os before waiting.

This RFC takes a different approach, only spinning when we would
otherwise sleep. To implement this, I add an 'io_poll' function pointer
to backing_dev_info. I include a sample implementation for the NVMe
driver. Next, I add an io_wait() function which will call io_poll()
if it is set. It falls back to calling io_schedule() if anything goes
wrong with io_poll() or the task exceeds its timeslice. Finally, all
that is left is to judiciously replace calls to io_schedule() with
calls to io_wait(). I think I've covered the main contenders with
sleep_on_page(), sleep_on_buffer() and the DIO path.

I've measured the performance benefits of this with a Chatham NVMe
prototype device and a simple
# dd if=/dev/nvme0n1 of=/dev/null iflag=direct bs=512 count=1000000
The latency of each I/O reduces by about 2.5us (from around 8.0us to
around 5.5us). This matches up quite well with the performance numbers
shown in the FAST2012 paper (which used a similar device).

Is backing_dev_info the right place for this function pointer?
Have I made any bad assumptions about the correct way to retrieve
the backing_dev_info from (eg) a struct page, buffer_head or kiocb?
Should I pass a 'state' into io_wait() instead of retrieving it from
'current'? Is kernel/sched/core.c the right place for io_wait()?
Should it be bdi_wait() instead? Should it be marked with __sched?
Where should I add documentation for the io_poll() function pointer?

NB: The NVMe driver piece of this is for illustrative purposes only and
should not be applied. I've rearranged the diff so that the interesting
pieces appear first.

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index c388155..97f8fde 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -68,6 +68,7 @@ struct backing_dev_info {
unsigned int capabilities; /* Device capabilities */
congested_fn *congested_fn; /* Function pointer if device is md/dm */
void *congested_data; /* Pointer to aux data for congested func */
+ int (*io_poll)(struct backing_dev_info *);

char *name;

@@ -126,6 +127,8 @@ int bdi_has_dirty_io(struct backing_dev_info *bdi);
void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2);

+void io_wait(struct backing_dev_info *bdi);
+
extern spinlock_t bdi_lock;
extern struct list_head bdi_list;

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 58453b8..4840065 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4527,6 +4527,36 @@ long __sched io_schedule_timeout(long timeout)
return ret;
}

+/*
+ * Wait for an I/O to complete against this backing_dev_info. If the
+ * task exhausts its timeslice polling for completions, call io_schedule()
+ * anyway. If a signal comes pending, return so the task can handle it.
+ * If the io_poll returns an error, give up and call io_schedule(), but
+ * swallow the error. We may miss an I/O completion (eg if the interrupt
+ * handler gets to it first). Guard against this possibility by returning
+ * if we've been set back to TASK_RUNNING.
+ */
+void io_wait(struct backing_dev_info *bdi)
+{
+ if (bdi && bdi->io_poll) {
+ long state = current->state;
+ while (!need_resched()) {
+ int ret = bdi->io_poll(bdi);
+ if ((ret > 0) || signal_pending_state(state, current)) {
+ set_current_state(TASK_RUNNING);
+ return;
+ }
+ if (current->state == TASK_RUNNING)
+ return;
+ if (ret < 0)
+ break;
+ cpu_relax();
+ }
+ }
+
+ io_schedule();
+}
+
/**
* sys_sched_get_priority_max - return maximum RT priority.
* @policy: scheduling class.
diff --git a/fs/aio.c b/fs/aio.c
index 2bbcacf..7b20397 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -453,11 +453,15 @@ static void kill_ioctx(struct kioctx *ctx)
*/
ssize_t wait_on_sync_kiocb(struct kiocb *iocb)
{
+ struct backing_dev_info *bdi = NULL;
+
+ if (iocb->ki_filp)
+ bdi = iocb->ki_filp->f_mapping->backing_dev_info;
while (atomic_read(&iocb->ki_users)) {
set_current_state(TASK_UNINTERRUPTIBLE);
if (!atomic_read(&iocb->ki_users))
break;
- io_schedule();
+ io_wait(bdi);
}
__set_current_state(TASK_RUNNING);
return iocb->ki_user_data;
diff --git a/fs/buffer.c b/fs/buffer.c
index d2a4d1b..4502909 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -63,7 +63,13 @@ EXPORT_SYMBOL(touch_buffer);

static int sleep_on_buffer(void *word)
{
- io_schedule();
+ struct buffer_head *bh;
+ struct backing_dev_info *bdi = NULL;
+
+ bh = container_of(word, struct buffer_head, b_state);
+ if (bh->b_assoc_map)
+ bdi = bh->b_assoc_map->backing_dev_info;
+ io_wait(bdi);
return 0;
}

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 7ab90f5..5a0fe06 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -410,6 +410,8 @@ static struct bio *dio_await_one(struct dio *dio)
{
unsigned long flags;
struct bio *bio = NULL;
+ struct address_space *mapping = dio->iocb->ki_filp->f_mapping;
+ struct backing_dev_info *bdi = mapping->backing_dev_info;

spin_lock_irqsave(&dio->bio_lock, flags);

@@ -423,7 +425,7 @@ static struct bio *dio_await_one(struct dio *dio)
__set_current_state(TASK_UNINTERRUPTIBLE);
dio->waiter = current;
spin_unlock_irqrestore(&dio->bio_lock, flags);
- io_schedule();
+ io_wait(bdi);
/* wake up sets us TASK_RUNNING */
spin_lock_irqsave(&dio->bio_lock, flags);
dio->waiter = NULL;
diff --git a/mm/filemap.c b/mm/filemap.c
index 7905fe7..25d3d51 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -178,7 +178,14 @@ EXPORT_SYMBOL(delete_from_page_cache);

static int sleep_on_page(void *word)
{
- io_schedule();
+ struct page *page = container_of(word, struct page, flags);
+ struct backing_dev_info *bdi = NULL;
+ struct address_space *mapping = page->mapping;
+
+ if (mapping && !((unsigned long)mapping & 1))
+ bdi = mapping->backing_dev_info;
+
+ io_wait(bdi);
return 0;
}

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index ce79a59..8fe4f27 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -79,7 +82,8 @@ struct nvme_queue {
u16 sq_head;
u16 sq_tail;
u16 cq_head;
- u16 cq_phase;
+ u8 cq_phase;
+ u8 irq_processed;
unsigned long cmdid_data[];
};

@@ -727,7 +732,7 @@ static void nvme_make_request(struct request_queue *q, struct bio *bio)
put_nvmeq(nvmeq);
}

-static irqreturn_t nvme_process_cq(struct nvme_queue *nvmeq)
+static int nvme_process_cq(struct nvme_queue *nvmeq)
{
u16 head, phase;

@@ -758,13 +767,14 @@ static irqreturn_t nvme_process_cq(struct nvme_queue *nvmeq)
* a big problem.
*/
if (head == nvmeq->cq_head && phase == nvmeq->cq_phase)
- return IRQ_NONE;
+ return 0;

writel(head, nvmeq->q_db + (1 << nvmeq->dev->db_stride));
nvmeq->cq_head = head;
nvmeq->cq_phase = phase;

- return IRQ_HANDLED;
+ nvmeq->irq_processed = 1;
+ return 1;
}

static irqreturn_t nvme_irq(int irq, void *data)
@@ -772,7 +782,9 @@ static irqreturn_t nvme_irq(int irq, void *data)
irqreturn_t result;
struct nvme_queue *nvmeq = data;
spin_lock(&nvmeq->q_lock);
- result = nvme_process_cq(nvmeq);
+ nvme_process_cq(nvmeq);
+ result = nvmeq->irq_processed ? IRQ_HANDLED : IRQ_NONE;
+ nvmeq->irq_processed = 0;
spin_unlock(&nvmeq->q_lock);
return result;
}
@@ -1556,6 +1567,27 @@ static void nvme_config_discard(struct nvme_ns *ns)
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, ns->queue);
}

+static int nvme_io_poll(struct backing_dev_info *bdi)
+{
+ struct request_queue *q = container_of(bdi, struct request_queue,
+ backing_dev_info);
+ struct nvme_ns *ns = q->queuedata;
+ struct nvme_dev *dev = ns->dev;
+ int i, found = 0;
+
+ for (i = 1; i < dev->queue_count; i++) {
+ struct nvme_queue *nvmeq = dev->queues[i];
+ if (!nvmeq)
+ continue;
+ spin_lock_irq(&nvmeq->q_lock);
+ if (nvme_process_cq(nvmeq))
+ found++;
+ spin_unlock_irq(&nvmeq->q_lock);
+ }
+
+ return found;
+}
+
static struct nvme_ns *nvme_alloc_ns(struct nvme_dev *dev, int nsid,
struct nvme_id_ns *id, struct nvme_lba_range_type *rt)
{
@@ -1578,6 +1610,7 @@ static struct nvme_ns *nvme_alloc_ns(struct nvme_dev *dev, int nsid,
blk_queue_make_request(ns->queue, nvme_make_request);
ns->dev = dev;
ns->queue->queuedata = ns;
+ ns->queue->backing_dev_info.io_poll = nvme_io_poll;

disk = alloc_disk(NVME_MINORS);
if (!disk)


2013-06-23 10:09:30

by Ingo Molnar

[permalink] [raw]
Subject: Re: RFC: Allow block drivers to poll for I/O instead of sleeping


* Matthew Wilcox <[email protected]> wrote:

>
> A paper at FAST2012
> (http://static.usenix.org/events/fast12/tech/full_papers/Yang.pdf) pointed
> out the performance overhead of taking interrupts for low-latency block
> I/Os. The solution the author investigated was to spin waiting for each
> I/O to complete. This is inefficient as Linux submits many I/Os which
> are not latency-sensitive, and even when we do submit latency-sensitive
> I/Os (eg swap-in), we frequently submit several I/Os before waiting.
>
> This RFC takes a different approach, only spinning when we would
> otherwise sleep. To implement this, I add an 'io_poll' function pointer
> to backing_dev_info. I include a sample implementation for the NVMe
> driver. Next, I add an io_wait() function which will call io_poll()
> if it is set. It falls back to calling io_schedule() if anything goes
> wrong with io_poll() or the task exceeds its timeslice. Finally, all
> that is left is to judiciously replace calls to io_schedule() with
> calls to io_wait(). I think I've covered the main contenders with
> sleep_on_page(), sleep_on_buffer() and the DIO path.
>
> I've measured the performance benefits of this with a Chatham NVMe
> prototype device and a simple
> # dd if=/dev/nvme0n1 of=/dev/null iflag=direct bs=512 count=1000000
> The latency of each I/O reduces by about 2.5us (from around 8.0us to
> around 5.5us). This matches up quite well with the performance numbers
> shown in the FAST2012 paper (which used a similar device).

Nice speedup!

> Is backing_dev_info the right place for this function pointer?
> Have I made any bad assumptions about the correct way to retrieve
> the backing_dev_info from (eg) a struct page, buffer_head or kiocb?
> Should I pass a 'state' into io_wait() instead of retrieving it from
> 'current'? Is kernel/sched/core.c the right place for io_wait()?
> Should it be bdi_wait() instead? Should it be marked with __sched?
> Where should I add documentation for the io_poll() function pointer?
>
> NB: The NVMe driver piece of this is for illustrative purposes only and
> should not be applied. I've rearranged the diff so that the interesting
> pieces appear first.


-EMISSINGDIFFSTAT

> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index c388155..97f8fde 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -68,6 +68,7 @@ struct backing_dev_info {
> unsigned int capabilities; /* Device capabilities */
> congested_fn *congested_fn; /* Function pointer if device is md/dm */
> void *congested_data; /* Pointer to aux data for congested func */
> + int (*io_poll)(struct backing_dev_info *);
>
> char *name;
>
> @@ -126,6 +127,8 @@ int bdi_has_dirty_io(struct backing_dev_info *bdi);
> void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
> void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2);
>
> +void io_wait(struct backing_dev_info *bdi);
> +
> extern spinlock_t bdi_lock;
> extern struct list_head bdi_list;
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 58453b8..4840065 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4527,6 +4527,36 @@ long __sched io_schedule_timeout(long timeout)
> return ret;
> }
>
> +/*
> + * Wait for an I/O to complete against this backing_dev_info. If the
> + * task exhausts its timeslice polling for completions, call io_schedule()
> + * anyway. If a signal comes pending, return so the task can handle it.
> + * If the io_poll returns an error, give up and call io_schedule(), but
> + * swallow the error. We may miss an I/O completion (eg if the interrupt
> + * handler gets to it first). Guard against this possibility by returning
> + * if we've been set back to TASK_RUNNING.
> + */
> +void io_wait(struct backing_dev_info *bdi)
> +{
> + if (bdi && bdi->io_poll) {
> + long state = current->state;
> + while (!need_resched()) {
> + int ret = bdi->io_poll(bdi);
> + if ((ret > 0) || signal_pending_state(state, current)) {
> + set_current_state(TASK_RUNNING);
> + return;
> + }
> + if (current->state == TASK_RUNNING)
> + return;
> + if (ret < 0)
> + break;
> + cpu_relax();
> + }
> + }
> +
> + io_schedule();
> +}

I'm wondering why this makes such a performance difference.

If an IO driver is implemented properly then it will batch up requests for
the controller, and gets IRQ-notified on a (sub-)batch of buffers
completed.

If there's any spinning done then it should be NAPI-alike polling: a
single "is stuff completed" polling pass per new block of work submitted,
to opportunistically interleave completion with submission work.

I don't see where active spinning brings would improve performance
compared to a NAPI-alike technique. Your numbers obviously show a speedup
we'd like to have, I'm just wondering whether the same speedup (or even
more) could be implemented via:

- smart batching that rate-limits completion IRQs in essence
+ NAPI-alike polling

... which would almost never result in IRQ driven completion when we are
close to CPU-bound and while not yet saturating the IO controller's
capacity.

The spinning approach you add has the disadvantage of actively wasting CPU
time, which could be used to run other tasks. In general it's much better
to make sure the completion IRQs are rate-limited and just schedule. This
(combined with a metric ton of fine details) is what the networking code
does in essence, and they have no trouble reaching very high throughput.

Thanks,

Ingo

2013-06-23 18:29:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: RFC: Allow block drivers to poll for I/O instead of sleeping

On Sun, Jun 23, 2013 at 12:09 AM, Ingo Molnar <[email protected]> wrote:
>
> The spinning approach you add has the disadvantage of actively wasting CPU
> time, which could be used to run other tasks. In general it's much better
> to make sure the completion IRQs are rate-limited and just schedule. This
> (combined with a metric ton of fine details) is what the networking code
> does in essence, and they have no trouble reaching very high throughput.

It's not about throughput - it's about latency. Don't ever confuse the
two, they have almost nothing in common. Networking very very seldom
has the kind of "submit and wait for immediate result" issues that
disk reads do.

That said, I dislike the patch intensely. I do not think it's at all a
good idea to look at "need_resched" to say "I can spin now". You're
still wasting CPU cycles.

So Willy, please do *not* mix this up with the scheduler, or at least
not "need_resched". Instead, maybe we should introduce a notion of "if
we are switching to the idle thread, let's see if we can try to do
some IO synchronously".

You could try to do that either *in* the idle thread (which would take
the context switch overhead - maybe negating some of the advantages),
or alternatively hook into the scheduler idle logic before actually
doing the switch.

But anything that starts polling when there are other runnable
processes to be done sounds really debatable. Even if it's "only" 5us
or so. There's a lot of real work that could be done in 5us.

Linus

2013-06-23 22:14:41

by David Ahern

[permalink] [raw]
Subject: Re: RFC: Allow block drivers to poll for I/O instead of sleeping

On 6/23/13 3:09 AM, Ingo Molnar wrote:
> If an IO driver is implemented properly then it will batch up requests for
> the controller, and gets IRQ-notified on a (sub-)batch of buffers
> completed.
>
> If there's any spinning done then it should be NAPI-alike polling: a
> single "is stuff completed" polling pass per new block of work submitted,
> to opportunistically interleave completion with submission work.
>
> I don't see where active spinning brings would improve performance
> compared to a NAPI-alike technique. Your numbers obviously show a speedup
> we'd like to have, I'm just wondering whether the same speedup (or even
> more) could be implemented via:
>
> - smart batching that rate-limits completion IRQs in essence
> + NAPI-alike polling
>
> ... which would almost never result in IRQ driven completion when we are
> close to CPU-bound and while not yet saturating the IO controller's
> capacity.
>
> The spinning approach you add has the disadvantage of actively wasting CPU
> time, which could be used to run other tasks. In general it's much better
> to make sure the completion IRQs are rate-limited and just schedule. This
> (combined with a metric ton of fine details) is what the networking code
> does in essence, and they have no trouble reaching very high throughput.

Networking code has a similar proposal for low latency sockets using
polling: https://lwn.net/Articles/540281/

David

2013-06-24 07:15:56

by Jens Axboe

[permalink] [raw]
Subject: Re: RFC: Allow block drivers to poll for I/O instead of sleeping

On Sun, Jun 23 2013, Ingo Molnar wrote:
> I'm wondering why this makes such a performance difference.

They key ingredient here is simply not going to sleep, only to get an
IRQ and get woken up very shortly again. NAPI and similar approaches
work great for high IOPS cases, where you maintain a certain depth of
IO. For lower queue depth or sync IO (like Willy is running here),
nothing beats the pure app driven poll from a latency perspective. I've
seen plenty of systems where the power management is also so aggressive,
that you manage to enter lower C states very quickly and that then of
course makes things even worse. Intelligent polling would make that less
of a problem.

Willy, I think the general design is fine, hooking in via the bdi is the
only way to get back to the right place from where you need to sleep.
Some thoughts:

- This should be hooked in via blk-iopoll, both of them should call into
the same driver hook for polling completions.

- It needs to be more intelligent in when you want to poll and when you
want regular irq driven IO.

- With the former note, the app either needs to opt in (and hence
willingly sacrifice CPU cycles of its scheduling slice) or it needs to
be nicer in when it gives up and goes back to irq driven IO.

--
Jens Axboe

2013-06-24 07:17:26

by Jens Axboe

[permalink] [raw]
Subject: Re: RFC: Allow block drivers to poll for I/O instead of sleeping

On Sun, Jun 23 2013, Linus Torvalds wrote:
> nothing in common. Networking very very seldom
> has the kind of "submit and wait for immediate result" issues that
> disk reads do.
>
> That said, I dislike the patch intensely. I do not think it's at all a
> good idea to look at "need_resched" to say "I can spin now". You're
> still wasting CPU cycles.
>
> So Willy, please do *not* mix this up with the scheduler, or at least
> not "need_resched". Instead, maybe we should introduce a notion of "if
> we are switching to the idle thread, let's see if we can try to do
> some IO synchronously".
>
> You could try to do that either *in* the idle thread (which would take
> the context switch overhead - maybe negating some of the advantages),
> or alternatively hook into the scheduler idle logic before actually
> doing the switch.

It can't happen in the idle thread. If you need to take the context
switch, then you've negated pretty much all of the gain of the polled
approach.

> But anything that starts polling when there are other runnable
> processes to be done sounds really debatable. Even if it's "only" 5us
> or so. There's a lot of real work that could be done in 5us.

IMHO that depends mostly on whether the app wants to give up those 5us
of work time just spinning on the completion of important IO. You
obviously can't steal the time of others.

--
Jens Axboe

2013-06-24 08:08:03

by Ingo Molnar

[permalink] [raw]
Subject: Re: RFC: Allow block drivers to poll for I/O instead of sleeping


* Linus Torvalds <[email protected]> wrote:

> On Sun, Jun 23, 2013 at 12:09 AM, Ingo Molnar <[email protected]> wrote:
> >
> > The spinning approach you add has the disadvantage of actively wasting
> > CPU time, which could be used to run other tasks. In general it's much
> > better to make sure the completion IRQs are rate-limited and just
> > schedule. This (combined with a metric ton of fine details) is what
> > the networking code does in essence, and they have no trouble reaching
> > very high throughput.
>
> It's not about throughput - it's about latency. Don't ever confuse the
> two, they have almost nothing in common. Networking very very seldom has
> the kind of "submit and wait for immediate result" issues that disk
> reads do.

Yeah, indeed that's true, the dd measurement Matthew did issued IO at a
rate of one sector at a time and waiting for every sector to complete:

dd if=/dev/nvme0n1 of=/dev/null iflag=direct bs=512 count=1000000

So my suggestions about batching and IRQ rate control are immaterial...

> That said, I dislike the patch intensely. I do not think it's at all a
> good idea to look at "need_resched" to say "I can spin now". You're
> still wasting CPU cycles.
>
> So Willy, please do *not* mix this up with the scheduler, or at least
> not "need_resched". Instead, maybe we should introduce a notion of "if
> we are switching to the idle thread, let's see if we can try to do some
> IO synchronously".
>
> You could try to do that either *in* the idle thread (which would take
> the context switch overhead - maybe negating some of the advantages), or
> alternatively hook into the scheduler idle logic before actually doing
> the switch.
>
> But anything that starts polling when there are other runnable processes
> to be done sounds really debatable. Even if it's "only" 5us or so.
> There's a lot of real work that could be done in 5us.

I'm wondering, how will this scheme work if the IO completion latency is a
lot more than the 5 usecs in the testcase? What if it takes 20 usecs or
100 usecs or more?

Will we still burn our CPU time, wasting power, inflating this CPU's load
which keeps other CPUs from balancing tasks over to this CPU, etc?

In the 5 usecs case it looks beneficial to do. In the longer-latency cases
I'm not so sure.

Thanks,

Ingo

2013-06-24 08:18:45

by Ingo Molnar

[permalink] [raw]
Subject: Re: RFC: Allow block drivers to poll for I/O instead of sleeping


* Jens Axboe <[email protected]> wrote:

> - With the former note, the app either needs to opt in (and hence
> willingly sacrifice CPU cycles of its scheduling slice) or it needs to
> be nicer in when it gives up and goes back to irq driven IO.

The scheduler could look at sleep latency averages of the task in question
- we measure that already in most cases.

If the 'average sleep latency' is below a certain threshold, the
scheduler, if it sees that the CPU is about to go idle, could delay doing
the context switch and do "light idle-polling", for say twice the length
of the expected sleep latency - assuming the CPU is otherwise idle -
before it really schedules away the task and the CPU goes idle.

This would still require an IRQ and a wakeup to be taken, but would avoid
the context switch.

Yet I have an ungood feeling about depending on actual latency values so
explicitly. There will have to be a cutoff value, and if a workload is
just below or just above that threshold then behavior will change
markedly. Such schemes rarely worked out nicely in the past. [Might still
be worth trying it.]

Couldn't the block device driver itself estimate the expected latency of
IO completion and simply poll if that's expected to be very short [such as
there's only a single outstanding IO to a RAM backed device]? IO drivers
doing some polling and waiting in the microseconds range isnt overly
controversial. I'd even do that if the CPU is busy otherwise: the task
should see a proportional slowdown as load increases, with no change in IO
queueing behavior.

Thanks,

Ingo

2013-06-24 08:21:53

by Ingo Molnar

[permalink] [raw]
Subject: Re: RFC: Allow block drivers to poll for I/O instead of sleeping


* David Ahern <[email protected]> wrote:

> On 6/23/13 3:09 AM, Ingo Molnar wrote:
> >If an IO driver is implemented properly then it will batch up requests for
> >the controller, and gets IRQ-notified on a (sub-)batch of buffers
> >completed.
> >
> >If there's any spinning done then it should be NAPI-alike polling: a
> >single "is stuff completed" polling pass per new block of work submitted,
> >to opportunistically interleave completion with submission work.
> >
> >I don't see where active spinning brings would improve performance
> >compared to a NAPI-alike technique. Your numbers obviously show a speedup
> >we'd like to have, I'm just wondering whether the same speedup (or even
> >more) could be implemented via:
> >
> > - smart batching that rate-limits completion IRQs in essence
> > + NAPI-alike polling
> >
> >... which would almost never result in IRQ driven completion when we are
> >close to CPU-bound and while not yet saturating the IO controller's
> >capacity.
> >
> >The spinning approach you add has the disadvantage of actively wasting CPU
> >time, which could be used to run other tasks. In general it's much better
> >to make sure the completion IRQs are rate-limited and just schedule. This
> >(combined with a metric ton of fine details) is what the networking code
> >does in essence, and they have no trouble reaching very high throughput.
>
> Networking code has a similar proposal for low latency sockets using
> polling: https://lwn.net/Articles/540281/

In that case it might make sense to try the generic approach I suggested
in the previous mail, which would measure average sleep latencies of
tasks, and would do light idle-polling instead of the more expensive
switch-to-the-idle-task context switch plus associated RCU, nohz, etc.
busy-CPU-tear-down and the symmetric build-up work on idle wakeup.

The IO driver would still have to take an IRQ though, preferably on the
CPU that runs the IO task ...

Thanks,

Ingo

2013-06-25 00:11:09

by Steven Rostedt

[permalink] [raw]
Subject: Re: RFC: Allow block drivers to poll for I/O instead of sleeping

On Mon, Jun 24, 2013 at 09:17:18AM +0200, Jens Axboe wrote:
> On Sun, Jun 23 2013, Linus Torvalds wrote:
> >
> > You could try to do that either *in* the idle thread (which would take
> > the context switch overhead - maybe negating some of the advantages),
> > or alternatively hook into the scheduler idle logic before actually
> > doing the switch.
>
> It can't happen in the idle thread. If you need to take the context
> switch, then you've negated pretty much all of the gain of the polled
> approach.

What about hooking into the idle_balance code? That happens if we are
about to go to idle but before the full schedule switch to the idle
task.


In __schedule(void):

if (unlikely(!rq->nr_running))
idle_balance(cpu, rq);

-- Steve

2013-06-25 03:01:45

by Matthew Wilcox

[permalink] [raw]
Subject: Re: RFC: Allow block drivers to poll for I/O instead of sleeping

On Mon, Jun 24, 2013 at 09:15:45AM +0200, Jens Axboe wrote:
> Willy, I think the general design is fine, hooking in via the bdi is the
> only way to get back to the right place from where you need to sleep.
> Some thoughts:
>
> - This should be hooked in via blk-iopoll, both of them should call into
> the same driver hook for polling completions.

I actually started working on this, then I realised that it's actually
a bad idea. blk-iopoll's poll function is to poll the single I/O queue
closest to this CPU. The iowait poll function is to poll all queues
that the I/O for this address_space might complete on.

I'm reluctant to ask drivers to define two poll functions, but I'm even
more reluctant to ask them to define one function with two purposes.

> - It needs to be more intelligent in when you want to poll and when you
> want regular irq driven IO.

Oh yeah, absolutely. While the example patch didn't show it, I wouldn't
enable it for all NVMe devices; only ones with sufficiently low latency.
There's also the ability for the driver to look at the number of
outstanding I/Os and return an error (eg -EBUSY) to stop spinning.

> - With the former note, the app either needs to opt in (and hence
> willingly sacrifice CPU cycles of its scheduling slice) or it needs to
> be nicer in when it gives up and goes back to irq driven IO.

Yup. I like the way you framed it. If the task *wants* to spend its
CPU cycles on polling for I/O instead of giving up the remainder of its
time slice, then it should be able to do that. After all, it already can;
it can submit an I/O request via AIO, and then call io_getevents in a
tight loop.

So maybe the right way to do this is with a task flag? If we go that
route, I'd like to further develop this option to allow I/Os to be
designated as "low latency" vs "normal". Taking a page fault would be
"low latency" for all tasks, not just ones that choose to spin for I/O.

2013-06-25 03:07:30

by Matthew Wilcox

[permalink] [raw]
Subject: Re: RFC: Allow block drivers to poll for I/O instead of sleeping

On Mon, Jun 24, 2013 at 08:11:02PM -0400, Steven Rostedt wrote:
> What about hooking into the idle_balance code? That happens if we are
> about to go to idle but before the full schedule switch to the idle
> task.
>
>
> In __schedule(void):
>
> if (unlikely(!rq->nr_running))
> idle_balance(cpu, rq);

That may be a great place to try it from the PoV of the scheduler, but are
you OK with me threading a struct backing_dev_info * all the way through
the scheduler to idle_balance()? :-)

2013-06-25 03:18:14

by Matthew Wilcox

[permalink] [raw]
Subject: Re: RFC: Allow block drivers to poll for I/O instead of sleeping

On Mon, Jun 24, 2013 at 10:07:51AM +0200, Ingo Molnar wrote:
> I'm wondering, how will this scheme work if the IO completion latency is a
> lot more than the 5 usecs in the testcase? What if it takes 20 usecs or
> 100 usecs or more?

There's clearly a threshold at which it stops making sense, and our
current NAND-based SSDs are almost certainly on the wrong side of that
threshold! I can't wait for one of the "post-NAND" technologies to make
it to market in some form that makes it economical to use in an SSD.

The problem is that some of the people who are looking at those
technologies are crazy. They want to "bypass the kernel" and "do user
space I/O" because "the kernel is too slow". This patch is part of an
effort to show them how crazy they are. And even if it doesn't convince
them, at least users who refuse to rewrite their applications to take
advantage of magical userspace I/O libraries will see real performance
benefits.

2013-06-25 07:07:38

by Bart Van Assche

[permalink] [raw]
Subject: Re: RFC: Allow block drivers to poll for I/O instead of sleeping

On 06/25/13 05:18, Matthew Wilcox wrote:
> On Mon, Jun 24, 2013 at 10:07:51AM +0200, Ingo Molnar wrote:
>> I'm wondering, how will this scheme work if the IO completion latency is a
>> lot more than the 5 usecs in the testcase? What if it takes 20 usecs or
>> 100 usecs or more?
>
> There's clearly a threshold at which it stops making sense, and our
> current NAND-based SSDs are almost certainly on the wrong side of that
> threshold! I can't wait for one of the "post-NAND" technologies to make
> it to market in some form that makes it economical to use in an SSD.
>
> The problem is that some of the people who are looking at those
> technologies are crazy. They want to "bypass the kernel" and "do user
> space I/O" because "the kernel is too slow". This patch is part of an
> effort to show them how crazy they are. And even if it doesn't convince
> them, at least users who refuse to rewrite their applications to take
> advantage of magical userspace I/O libraries will see real performance
> benefits.

Recently I attended an interesting talk about this subject in which it
was proposed not only to bypass the kernel for access to high-IOPS
devices but also to allow byte-addressability for block devices. The
slides that accompanied that talk can be found here (includes a
performance comparison with the traditional block driver API):

Bernard Metzler, On Suitability of High-Performance Networking API for
Storage, OFA Int'l Developer Workshop, April 24, 2013
(http://www.openfabrics.org/ofa-documents/presentations/doc_download/559-on-suitability-of-high-performance-networking-api-for-storage.html).

This approach leaves the choice of whether to use polling or an
interrupt-based completion notification to the user of the new API,
something the Linux InfiniBand RDMA verbs API already allows today.

Bart.

2013-06-25 13:57:32

by Steven Rostedt

[permalink] [raw]
Subject: Re: RFC: Allow block drivers to poll for I/O instead of sleeping

On Mon, 2013-06-24 at 23:07 -0400, Matthew Wilcox wrote:
> On Mon, Jun 24, 2013 at 08:11:02PM -0400, Steven Rostedt wrote:
> > What about hooking into the idle_balance code? That happens if we are
> > about to go to idle but before the full schedule switch to the idle
> > task.
> >
> >
> > In __schedule(void):
> >
> > if (unlikely(!rq->nr_running))
> > idle_balance(cpu, rq);
>
> That may be a great place to try it from the PoV of the scheduler, but are
> you OK with me threading a struct backing_dev_info * all the way through
> the scheduler to idle_balance()? :-)

Well, there's other ways to pass data down besides parameters. You could
attach something to the task itself.

-- Steve

2013-06-25 14:55:32

by Jens Axboe

[permalink] [raw]
Subject: Re: RFC: Allow block drivers to poll for I/O instead of sleeping

On Mon, Jun 24 2013, Matthew Wilcox wrote:
> On Mon, Jun 24, 2013 at 09:15:45AM +0200, Jens Axboe wrote:
> > Willy, I think the general design is fine, hooking in via the bdi is the
> > only way to get back to the right place from where you need to sleep.
> > Some thoughts:
> >
> > - This should be hooked in via blk-iopoll, both of them should call into
> > the same driver hook for polling completions.
>
> I actually started working on this, then I realised that it's actually
> a bad idea. blk-iopoll's poll function is to poll the single I/O queue
> closest to this CPU. The iowait poll function is to poll all queues
> that the I/O for this address_space might complete on.

blk_iopoll can be tied to "whatever". It was originally designed to be
tied to the queue, which would make it driver-wide. So there's no intent
for it to poll only a subset of the device, though if you tie it to a
completion queue (which would be most natural), then it'd only find
completions there.

I didn't look at your nvme end of the implementation - if you could
reliably map to the right completion queue, then it would have the same
mapping as iopoll on a per-completion queue basis. If you can't, then
you have to poll all of them. That doesn't seem like it would scale well
for having more than a few applications banging on a device.

> I'm reluctant to ask drivers to define two poll functions, but I'm even
> more reluctant to ask them to define one function with two purposes.
>
> > - It needs to be more intelligent in when you want to poll and when you
> > want regular irq driven IO.
>
> Oh yeah, absolutely. While the example patch didn't show it, I wouldn't
> enable it for all NVMe devices; only ones with sufficiently low latency.
> There's also the ability for the driver to look at the number of
> outstanding I/Os and return an error (eg -EBUSY) to stop spinning.

There might also be read vs write differences. Say some devices complete
writes very quickly, but reads are slower. Or vice versa. And then
there's the inevitable "some IOs are slow, but usually very fast". But
that can't really be handled except giving up on the polling at some
point.

> > - With the former note, the app either needs to opt in (and hence
> > willingly sacrifice CPU cycles of its scheduling slice) or it needs to
> > be nicer in when it gives up and goes back to irq driven IO.
>
> Yup. I like the way you framed it. If the task *wants* to spend its
> CPU cycles on polling for I/O instead of giving up the remainder of its
> time slice, then it should be able to do that. After all, it already can;
> it can submit an I/O request via AIO, and then call io_getevents in a
> tight loop.

Exactly, that was my point. Or it can busy loop just checking the aio
ring, at which point it's really stupid to be IRQ driven at all. It'd be
much better to have the app reap the completion directly.

> So maybe the right way to do this is with a task flag? If we go that
> route, I'd like to further develop this option to allow I/Os to be
> designated as "low latency" vs "normal". Taking a page fault would be
> "low latency" for all tasks, not just ones that choose to spin for I/O.

Not sure, I'd have to think about it some more. It's a mix of what the
application decides to do, but also what the underlying device can do.
And there might be fs implications, etc.

--
Jens Axboe

2013-06-25 14:58:09

by Jens Axboe

[permalink] [raw]
Subject: Re: RFC: Allow block drivers to poll for I/O instead of sleeping

On Mon, Jun 24 2013, Steven Rostedt wrote:
> On Mon, Jun 24, 2013 at 09:17:18AM +0200, Jens Axboe wrote:
> > On Sun, Jun 23 2013, Linus Torvalds wrote:
> > >
> > > You could try to do that either *in* the idle thread (which would take
> > > the context switch overhead - maybe negating some of the advantages),
> > > or alternatively hook into the scheduler idle logic before actually
> > > doing the switch.
> >
> > It can't happen in the idle thread. If you need to take the context
> > switch, then you've negated pretty much all of the gain of the polled
> > approach.
>
> What about hooking into the idle_balance code? That happens if we are
> about to go to idle but before the full schedule switch to the idle
> task.
>
>
> In __schedule(void):
>
> if (unlikely(!rq->nr_running))
> idle_balance(cpu, rq);

If you can avoid the switch (sleep/wakeup), then that's what matters. If
you end up sleeping, you've lost that latency game and polling is mostly
useful in the blk_iopoll designed fashion for high iops scenarios.
Besides, you need the task + page context to be able to find out what to
poll for (and when to stop).

--
Jens Axboe

2013-06-25 15:00:30

by Jens Axboe

[permalink] [raw]
Subject: Re: RFC: Allow block drivers to poll for I/O instead of sleeping

On Mon, Jun 24 2013, Matthew Wilcox wrote:
> On Mon, Jun 24, 2013 at 10:07:51AM +0200, Ingo Molnar wrote:
> > I'm wondering, how will this scheme work if the IO completion latency is a
> > lot more than the 5 usecs in the testcase? What if it takes 20 usecs or
> > 100 usecs or more?
>
> There's clearly a threshold at which it stops making sense, and our
> current NAND-based SSDs are almost certainly on the wrong side of that
> threshold! I can't wait for one of the "post-NAND" technologies to make
> it to market in some form that makes it economical to use in an SSD.
>
> The problem is that some of the people who are looking at those
> technologies are crazy. They want to "bypass the kernel" and "do user
> space I/O" because "the kernel is too slow". This patch is part of an
> effort to show them how crazy they are. And even if it doesn't convince
> them, at least users who refuse to rewrite their applications to take
> advantage of magical userspace I/O libraries will see real performance
> benefits.

Fully concur with that. At least on the read side, nand is just getting
crappier and polled completions is usually not going to be great. On the
write side, however, there are definite gains. Completions in the
10-15usec range aren't unusual. And once we hit PCM, well, it'll be fun.

On the write side, there are plenty of super latency customers out there
who would LOVE to poll when/if it's useful. Most often also the same
kind of people who talk the crazy of putting everything in user space.
Which is why I like the polling. If we can get sufficiently close, then
we can shut some of that up.

--
Jens Axboe

2013-06-27 18:11:22

by Rik van Riel

[permalink] [raw]
Subject: Re: RFC: Allow block drivers to poll for I/O instead of sleeping

On 06/23/2013 02:29 PM, Linus Torvalds wrote:

> You could try to do that either *in* the idle thread (which would take
> the context switch overhead - maybe negating some of the advantages),
> or alternatively hook into the scheduler idle logic before actually
> doing the switch.
>
> But anything that starts polling when there are other runnable
> processes to be done sounds really debatable. Even if it's "only" 5us
> or so. There's a lot of real work that could be done in 5us.

Having a hook into the idle code could be useful for
KVM, too.

In certain message passing workloads, we have seen
that the system throughput suffers greatly simply
by having the KVM thread's preempt notifiers save
CPU state when the guest goes idle, and re-load it
when the guest VCPU is activated again.

Avoiding the context switch overhead when nothing
else wants to run, but being immediately preempted
when something does, could be a big win.

Would it make sense to have some hook that says
"I *am* an idle thread right now", as opposed to
"context switch to the idle thread"?

That hook could even run the cpuidle code, and
switch to the real idle task (saving the current
task state), and put the CPU into a C-state, when
the expected sleep time is long...


2013-06-27 18:43:04

by Rik van Riel

[permalink] [raw]
Subject: Re: RFC: Allow block drivers to poll for I/O instead of sleeping

On 06/20/2013 04:17 PM, Matthew Wilcox wrote:

> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4527,6 +4527,36 @@ long __sched io_schedule_timeout(long timeout)
> return ret;
> }
>
> +/*
> + * Wait for an I/O to complete against this backing_dev_info. If the
> + * task exhausts its timeslice polling for completions, call io_schedule()
> + * anyway. If a signal comes pending, return so the task can handle it.
> + * If the io_poll returns an error, give up and call io_schedule(), but
> + * swallow the error. We may miss an I/O completion (eg if the interrupt
> + * handler gets to it first). Guard against this possibility by returning
> + * if we've been set back to TASK_RUNNING.
> + */
> +void io_wait(struct backing_dev_info *bdi)
> +{

I would like something a little more generic in the
scheduler code, that could also be used by other
things in the kernel (say, KVM with message passing
workloads).

Maybe something looking a little like this?

void idle_poll(struct idle_poll_info *ipi)

struct idle_poll_info {
int (*idle_poll_func)(void *data);
int (*idle_poll_preempt)(void *data);
void *data;
}

That way the kernel can:

1) mark the current thread as having idle priority,
allowing the scheduler to preempt it if something
else wants to run

2) switch to asynchronous mode if something else wants
to run, or if the average wait for the process is
so long that it is better to go asynchronous and
avoid polling

3) poll for completion if nothing else wants to run

Does that make sense?

Did I forget something you need?

Did I forget something KVM could need?

Is this insane? If so, is it too insane? :)

2013-07-04 01:13:13

by Shaohua Li

[permalink] [raw]
Subject: Re: RFC: Allow block drivers to poll for I/O instead of sleeping

On Thu, Jun 20, 2013 at 04:17:13PM -0400, Matthew Wilcox wrote:
>
> A paper at FAST2012
> (http://static.usenix.org/events/fast12/tech/full_papers/Yang.pdf) pointed
> out the performance overhead of taking interrupts for low-latency block
> I/Os. The solution the author investigated was to spin waiting for each
> I/O to complete. This is inefficient as Linux submits many I/Os which
> are not latency-sensitive, and even when we do submit latency-sensitive
> I/Os (eg swap-in), we frequently submit several I/Os before waiting.
>
> This RFC takes a different approach, only spinning when we would
> otherwise sleep. To implement this, I add an 'io_poll' function pointer
> to backing_dev_info. I include a sample implementation for the NVMe
> driver. Next, I add an io_wait() function which will call io_poll()
> if it is set. It falls back to calling io_schedule() if anything goes
> wrong with io_poll() or the task exceeds its timeslice. Finally, all
> that is left is to judiciously replace calls to io_schedule() with
> calls to io_wait(). I think I've covered the main contenders with
> sleep_on_page(), sleep_on_buffer() and the DIO path.
>
> I've measured the performance benefits of this with a Chatham NVMe
> prototype device and a simple
> # dd if=/dev/nvme0n1 of=/dev/null iflag=direct bs=512 count=1000000
> The latency of each I/O reduces by about 2.5us (from around 8.0us to
> around 5.5us). This matches up quite well with the performance numbers
> shown in the FAST2012 paper (which used a similar device).

Hi Matthew,
I'm wondering where the 2.5us latency cut comes from. I did a simple test. In
my xeon 3.4G CPU, one cpu can do about 2M/s context switch of applications.
Assuming switching to idle is faster, so switching to idle and back should take
less than 1us. Does the 2.5us latency cut mostly come from deep idle state
latency? if so, maybe set a lower pm_qos value or have a better idle governer
to prevent cpu entering deep idle state can help too.

Thanks,
Shaohua