2011-03-16 08:18:42

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH 04/10] block: initial patch for on-stack per-task plugging

2011/1/22 Jens Axboe <[email protected]>:
> Signed-off-by: Jens Axboe <[email protected]>
> ---
> ?block/blk-core.c ? ? ? ? ?| ?357 ++++++++++++++++++++++++++++++++------------
> ?block/elevator.c ? ? ? ? ?| ? ?6 +-
> ?include/linux/blk_types.h | ? ?2 +
> ?include/linux/blkdev.h ? ?| ? 30 ++++
> ?include/linux/elevator.h ?| ? ?1 +
> ?include/linux/sched.h ? ? | ? ?6 +
> ?kernel/exit.c ? ? ? ? ? ? | ? ?1 +
> ?kernel/fork.c ? ? ? ? ? ? | ? ?3 +
> ?kernel/sched.c ? ? ? ? ? ?| ? 11 ++-
> ?9 files changed, 317 insertions(+), 100 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 960f12c..42dbfcc 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -27,6 +27,7 @@
> ?#include <linux/writeback.h>
> ?#include <linux/task_io_accounting_ops.h>
> ?#include <linux/fault-inject.h>
> +#include <linux/list_sort.h>
>
> ?#define CREATE_TRACE_POINTS
> ?#include <trace/events/block.h>
> @@ -213,7 +214,7 @@ static void blk_delay_work(struct work_struct *work)
>
> ? ? ? ?q = container_of(work, struct request_queue, delay_work.work);
> ? ? ? ?spin_lock_irq(q->queue_lock);
> - ? ? ? q->request_fn(q);
> + ? ? ? __blk_run_queue(q);
> ? ? ? ?spin_unlock_irq(q->queue_lock);
> ?}
Hi Jens,
I have some questions about the per-task plugging. Since the request
list is per-task, and each task delivers its requests at finish flush
or schedule. But when one cpu delivers requests to global queue, other
cpus don't know. This seems to have problem. For example:
1. get_request_wait() can only flush current task's request list,
other cpus/tasks might still have a lot of requests, which aren't sent
to request_queue. your ioc-rq-alloc branch is for this, right? Will it
be pushed to 2.6.39 too? I'm wondering if we should limit per-task
queue length. If there are enough requests there, we force a flush
plug.
2. some APIs like blk_delay_work, which call __blk_run_queue() might
not work. because other CPUs might not dispatch their requests to
request queue. So __blk_run_queue will eventually find no requests,
which might stall devices.
Since one cpu doesn't know other cpus' request list, I'm wondering if
there are other similar issues.

Thanks,
Shaohua


2011-03-16 17:32:07

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 04/10] block: initial patch for on-stack per-task plugging

On Wed, Mar 16, 2011 at 04:18:30PM +0800, Shaohua Li wrote:
> 2011/1/22 Jens Axboe <[email protected]>:
> > Signed-off-by: Jens Axboe <[email protected]>
> > ---
> > ?block/blk-core.c ? ? ? ? ?| ?357 ++++++++++++++++++++++++++++++++------------
> > ?block/elevator.c ? ? ? ? ?| ? ?6 +-
> > ?include/linux/blk_types.h | ? ?2 +
> > ?include/linux/blkdev.h ? ?| ? 30 ++++
> > ?include/linux/elevator.h ?| ? ?1 +
> > ?include/linux/sched.h ? ? | ? ?6 +
> > ?kernel/exit.c ? ? ? ? ? ? | ? ?1 +
> > ?kernel/fork.c ? ? ? ? ? ? | ? ?3 +
> > ?kernel/sched.c ? ? ? ? ? ?| ? 11 ++-
> > ?9 files changed, 317 insertions(+), 100 deletions(-)
> >
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index 960f12c..42dbfcc 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -27,6 +27,7 @@
> > ?#include <linux/writeback.h>
> > ?#include <linux/task_io_accounting_ops.h>
> > ?#include <linux/fault-inject.h>
> > +#include <linux/list_sort.h>
> >
> > ?#define CREATE_TRACE_POINTS
> > ?#include <trace/events/block.h>
> > @@ -213,7 +214,7 @@ static void blk_delay_work(struct work_struct *work)
> >
> > ? ? ? ?q = container_of(work, struct request_queue, delay_work.work);
> > ? ? ? ?spin_lock_irq(q->queue_lock);
> > - ? ? ? q->request_fn(q);
> > + ? ? ? __blk_run_queue(q);
> > ? ? ? ?spin_unlock_irq(q->queue_lock);
> > ?}
> Hi Jens,
> I have some questions about the per-task plugging. Since the request
> list is per-task, and each task delivers its requests at finish flush
> or schedule. But when one cpu delivers requests to global queue, other
> cpus don't know. This seems to have problem. For example:
> 1. get_request_wait() can only flush current task's request list,
> other cpus/tasks might still have a lot of requests, which aren't sent
> to request_queue.

But very soon these requests will be sent to request queue as soon task
is either scheduled out or task explicitly flushes the plug? So we might
wait a bit longer but that might not matter in general, i guess.

> your ioc-rq-alloc branch is for this, right? Will it
> be pushed to 2.6.39 too? I'm wondering if we should limit per-task
> queue length. If there are enough requests there, we force a flush
> plug.

That's the idea jens had. But then came the question of maintaining
data structures per task per disk. That makes it complicated.

Even if we move the accounting out of request queue and do it say at
bdi, ideally we shall to do per task per bdi accounting.

Jens seemed to be suggesting that generally fluser threads are the
main cluprit for submitting large amount of IO. They are already per
bdi. So probably just maintain a per task limit for flusher threads.

I am not sure what happens to direct reclaim path, AIO deep queue
paths etc.

> 2. some APIs like blk_delay_work, which call __blk_run_queue() might
> not work. because other CPUs might not dispatch their requests to
> request queue. So __blk_run_queue will eventually find no requests,
> which might stall devices.
> Since one cpu doesn't know other cpus' request list, I'm wondering if
> there are other similar issues.

So again in this case if queue is empty at the time of __blk_run_queue(),
then we will probably just experinece little more delay then intended
till some task flushes. But should not stall the system?

Thanks
Vivek

2011-03-17 01:00:19

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH 04/10] block: initial patch for on-stack per-task plugging

On Thu, 2011-03-17 at 01:31 +0800, Vivek Goyal wrote:
> On Wed, Mar 16, 2011 at 04:18:30PM +0800, Shaohua Li wrote:
> > 2011/1/22 Jens Axboe <[email protected]>:
> > > Signed-off-by: Jens Axboe <[email protected]>
> > > ---
> > > block/blk-core.c | 357 ++++++++++++++++++++++++++++++++------------
> > > block/elevator.c | 6 +-
> > > include/linux/blk_types.h | 2 +
> > > include/linux/blkdev.h | 30 ++++
> > > include/linux/elevator.h | 1 +
> > > include/linux/sched.h | 6 +
> > > kernel/exit.c | 1 +
> > > kernel/fork.c | 3 +
> > > kernel/sched.c | 11 ++-
> > > 9 files changed, 317 insertions(+), 100 deletions(-)
> > >
> > > diff --git a/block/blk-core.c b/block/blk-core.c
> > > index 960f12c..42dbfcc 100644
> > > --- a/block/blk-core.c
> > > +++ b/block/blk-core.c
> > > @@ -27,6 +27,7 @@
> > > #include <linux/writeback.h>
> > > #include <linux/task_io_accounting_ops.h>
> > > #include <linux/fault-inject.h>
> > > +#include <linux/list_sort.h>
> > >
> > > #define CREATE_TRACE_POINTS
> > > #include <trace/events/block.h>
> > > @@ -213,7 +214,7 @@ static void blk_delay_work(struct work_struct *work)
> > >
> > > q = container_of(work, struct request_queue, delay_work.work);
> > > spin_lock_irq(q->queue_lock);
> > > - q->request_fn(q);
> > > + __blk_run_queue(q);
> > > spin_unlock_irq(q->queue_lock);
> > > }
> > Hi Jens,
> > I have some questions about the per-task plugging. Since the request
> > list is per-task, and each task delivers its requests at finish flush
> > or schedule. But when one cpu delivers requests to global queue, other
> > cpus don't know. This seems to have problem. For example:
> > 1. get_request_wait() can only flush current task's request list,
> > other cpus/tasks might still have a lot of requests, which aren't sent
> > to request_queue.
>
> But very soon these requests will be sent to request queue as soon task
> is either scheduled out or task explicitly flushes the plug? So we might
> wait a bit longer but that might not matter in general, i guess.
Yes, I understand there is just a bit delay. I don't know how severe it
is, but this still could be a problem, especially for fast storage or
random I/O. My current tests show slight regression (3% or so) with
Jens's for 2.6.39/core branch. I'm still checking if it's caused by the
per-task plug, but the per-task plug is highly suspected.

> > your ioc-rq-alloc branch is for this, right? Will it
> > be pushed to 2.6.39 too? I'm wondering if we should limit per-task
> > queue length. If there are enough requests there, we force a flush
> > plug.
>
> That's the idea jens had. But then came the question of maintaining
> data structures per task per disk. That makes it complicated.
>
> Even if we move the accounting out of request queue and do it say at
> bdi, ideally we shall to do per task per bdi accounting.
>
> Jens seemed to be suggesting that generally fluser threads are the
> main cluprit for submitting large amount of IO. They are already per
> bdi. So probably just maintain a per task limit for flusher threads.
Yep, flusher is the main spot in my mind. We need call more flush plug
for flusher thread.

> I am not sure what happens to direct reclaim path, AIO deep queue
> paths etc.
direct reclaim path could build deep write queue too. It
uses .writepage, currently there is no flush plug there. Maybe we need
add flush plug in shrink_inactive_list too.

> > 2. some APIs like blk_delay_work, which call __blk_run_queue() might
> > not work. because other CPUs might not dispatch their requests to
> > request queue. So __blk_run_queue will eventually find no requests,
> > which might stall devices.
> > Since one cpu doesn't know other cpus' request list, I'm wondering if
> > there are other similar issues.
>
> So again in this case if queue is empty at the time of __blk_run_queue(),
> then we will probably just experinece little more delay then intended
> till some task flushes. But should not stall the system?
not stall the system, but device stalls a little time.

Thanks,
Shaohua

2011-03-17 03:19:32

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH 04/10] block: initial patch for on-stack per-task plugging

On Thu, 2011-03-17 at 09:00 +0800, Shaohua Li wrote:
> On Thu, 2011-03-17 at 01:31 +0800, Vivek Goyal wrote:
> > On Wed, Mar 16, 2011 at 04:18:30PM +0800, Shaohua Li wrote:
> > > 2011/1/22 Jens Axboe <[email protected]>:
> > > > Signed-off-by: Jens Axboe <[email protected]>
> > > > ---
> > > > block/blk-core.c | 357 ++++++++++++++++++++++++++++++++------------
> > > > block/elevator.c | 6 +-
> > > > include/linux/blk_types.h | 2 +
> > > > include/linux/blkdev.h | 30 ++++
> > > > include/linux/elevator.h | 1 +
> > > > include/linux/sched.h | 6 +
> > > > kernel/exit.c | 1 +
> > > > kernel/fork.c | 3 +
> > > > kernel/sched.c | 11 ++-
> > > > 9 files changed, 317 insertions(+), 100 deletions(-)
> > > >
> > > > diff --git a/block/blk-core.c b/block/blk-core.c
> > > > index 960f12c..42dbfcc 100644
> > > > --- a/block/blk-core.c
> > > > +++ b/block/blk-core.c
> > > > @@ -27,6 +27,7 @@
> > > > #include <linux/writeback.h>
> > > > #include <linux/task_io_accounting_ops.h>
> > > > #include <linux/fault-inject.h>
> > > > +#include <linux/list_sort.h>
> > > >
> > > > #define CREATE_TRACE_POINTS
> > > > #include <trace/events/block.h>
> > > > @@ -213,7 +214,7 @@ static void blk_delay_work(struct work_struct *work)
> > > >
> > > > q = container_of(work, struct request_queue, delay_work.work);
> > > > spin_lock_irq(q->queue_lock);
> > > > - q->request_fn(q);
> > > > + __blk_run_queue(q);
> > > > spin_unlock_irq(q->queue_lock);
> > > > }
> > > Hi Jens,
> > > I have some questions about the per-task plugging. Since the request
> > > list is per-task, and each task delivers its requests at finish flush
> > > or schedule. But when one cpu delivers requests to global queue, other
> > > cpus don't know. This seems to have problem. For example:
> > > 1. get_request_wait() can only flush current task's request list,
> > > other cpus/tasks might still have a lot of requests, which aren't sent
> > > to request_queue.
> >
> > But very soon these requests will be sent to request queue as soon task
> > is either scheduled out or task explicitly flushes the plug? So we might
> > wait a bit longer but that might not matter in general, i guess.
> Yes, I understand there is just a bit delay. I don't know how severe it
> is, but this still could be a problem, especially for fast storage or
> random I/O. My current tests show slight regression (3% or so) with
> Jens's for 2.6.39/core branch. I'm still checking if it's caused by the
> per-task plug, but the per-task plug is highly suspected.
>
> > > your ioc-rq-alloc branch is for this, right? Will it
> > > be pushed to 2.6.39 too? I'm wondering if we should limit per-task
> > > queue length. If there are enough requests there, we force a flush
> > > plug.
> >
> > That's the idea jens had. But then came the question of maintaining
> > data structures per task per disk. That makes it complicated.
> >
> > Even if we move the accounting out of request queue and do it say at
> > bdi, ideally we shall to do per task per bdi accounting.
> >
> > Jens seemed to be suggesting that generally fluser threads are the
> > main cluprit for submitting large amount of IO. They are already per
> > bdi. So probably just maintain a per task limit for flusher threads.
> Yep, flusher is the main spot in my mind. We need call more flush plug
> for flusher thread.
>
> > I am not sure what happens to direct reclaim path, AIO deep queue
> > paths etc.
> direct reclaim path could build deep write queue too. It
> uses .writepage, currently there is no flush plug there. Maybe we need
> add flush plug in shrink_inactive_list too.
>
> > > 2. some APIs like blk_delay_work, which call __blk_run_queue() might
> > > not work. because other CPUs might not dispatch their requests to
> > > request queue. So __blk_run_queue will eventually find no requests,
> > > which might stall devices.
> > > Since one cpu doesn't know other cpus' request list, I'm wondering if
> > > there are other similar issues.
> >
> > So again in this case if queue is empty at the time of __blk_run_queue(),
> > then we will probably just experinece little more delay then intended
> > till some task flushes. But should not stall the system?
> not stall the system, but device stalls a little time.
Jens,
I need below patch to recover a ffsb fsync workload, which has about 30%
regression with stack plug.
I guess the reason is WRITE_SYNC_PLUG doesn't work now, so if a context
hasn't blk_plug, we lose previous plug (request merge). This suggests
all places we use WRITE_SYNC_PLUG before (for example, kjournald) should
have a blk_plug context.

Thanks,
Shaohua


diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index cc0ede1..24b7ac2 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1039,11 +1039,17 @@ static int __writepage(struct page *page, struct writeback_control *wbc,
int generic_writepages(struct address_space *mapping,
struct writeback_control *wbc)
{
+ struct blk_plug plug;
+ int ret;
+
/* deal with chardevs and other special file */
if (!mapping->a_ops->writepage)
return 0;

- return write_cache_pages(mapping, wbc, __writepage, mapping);
+ blk_start_plug(&plug);
+ ret = write_cache_pages(mapping, wbc, __writepage, mapping);
+ blk_finish_plug(&plug);
+ return ret;
}

EXPORT_SYMBOL(generic_writepages);

2011-03-17 09:40:12

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 04/10] block: initial patch for on-stack per-task plugging

On 2011-03-16 09:18, Shaohua Li wrote:
> 2011/1/22 Jens Axboe <[email protected]>:
>> Signed-off-by: Jens Axboe <[email protected]>
>> ---
>> block/blk-core.c | 357 ++++++++++++++++++++++++++++++++------------
>> block/elevator.c | 6 +-
>> include/linux/blk_types.h | 2 +
>> include/linux/blkdev.h | 30 ++++
>> include/linux/elevator.h | 1 +
>> include/linux/sched.h | 6 +
>> kernel/exit.c | 1 +
>> kernel/fork.c | 3 +
>> kernel/sched.c | 11 ++-
>> 9 files changed, 317 insertions(+), 100 deletions(-)
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 960f12c..42dbfcc 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -27,6 +27,7 @@
>> #include <linux/writeback.h>
>> #include <linux/task_io_accounting_ops.h>
>> #include <linux/fault-inject.h>
>> +#include <linux/list_sort.h>
>>
>> #define CREATE_TRACE_POINTS
>> #include <trace/events/block.h>
>> @@ -213,7 +214,7 @@ static void blk_delay_work(struct work_struct *work)
>>
>> q = container_of(work, struct request_queue, delay_work.work);
>> spin_lock_irq(q->queue_lock);
>> - q->request_fn(q);
>> + __blk_run_queue(q);
>> spin_unlock_irq(q->queue_lock);
>> }
> Hi Jens,
> I have some questions about the per-task plugging. Since the request
> list is per-task, and each task delivers its requests at finish flush
> or schedule. But when one cpu delivers requests to global queue, other
> cpus don't know. This seems to have problem. For example:
> 1. get_request_wait() can only flush current task's request list,
> other cpus/tasks might still have a lot of requests, which aren't sent
> to request_queue. your ioc-rq-alloc branch is for this, right? Will it
> be pushed to 2.6.39 too? I'm wondering if we should limit per-task
> queue length. If there are enough requests there, we force a flush
> plug.

Any task plug is by definition short lived, since it only persists while
someone is submitting IO or if the task ends up blocking. It's not like
right now where a plug can persist for some time.

I don't plan on submitting the ioc-rq-alloc for 2.6.39, it needs more
work. I think we'll end up dropping the limits completely and just
ensuring that the flusher thread doesn't push out too much.

> 2. some APIs like blk_delay_work, which call __blk_run_queue() might
> not work. because other CPUs might not dispatch their requests to
> request queue. So __blk_run_queue will eventually find no requests,
> which might stall devices.
> Since one cpu doesn't know other cpus' request list, I'm wondering if
> there are other similar issues.

If you call blk_run_queue(), it's to kick something of that you
submitted (and that should already be on the queue). So I don't think
this is an issue.

--
Jens Axboe

2011-03-17 09:43:54

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 04/10] block: initial patch for on-stack per-task plugging

On 2011-03-17 02:00, Shaohua Li wrote:
> On Thu, 2011-03-17 at 01:31 +0800, Vivek Goyal wrote:
>> On Wed, Mar 16, 2011 at 04:18:30PM +0800, Shaohua Li wrote:
>>> 2011/1/22 Jens Axboe <[email protected]>:
>>>> Signed-off-by: Jens Axboe <[email protected]>
>>>> ---
>>>> block/blk-core.c | 357 ++++++++++++++++++++++++++++++++------------
>>>> block/elevator.c | 6 +-
>>>> include/linux/blk_types.h | 2 +
>>>> include/linux/blkdev.h | 30 ++++
>>>> include/linux/elevator.h | 1 +
>>>> include/linux/sched.h | 6 +
>>>> kernel/exit.c | 1 +
>>>> kernel/fork.c | 3 +
>>>> kernel/sched.c | 11 ++-
>>>> 9 files changed, 317 insertions(+), 100 deletions(-)
>>>>
>>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>>> index 960f12c..42dbfcc 100644
>>>> --- a/block/blk-core.c
>>>> +++ b/block/blk-core.c
>>>> @@ -27,6 +27,7 @@
>>>> #include <linux/writeback.h>
>>>> #include <linux/task_io_accounting_ops.h>
>>>> #include <linux/fault-inject.h>
>>>> +#include <linux/list_sort.h>
>>>>
>>>> #define CREATE_TRACE_POINTS
>>>> #include <trace/events/block.h>
>>>> @@ -213,7 +214,7 @@ static void blk_delay_work(struct work_struct *work)
>>>>
>>>> q = container_of(work, struct request_queue, delay_work.work);
>>>> spin_lock_irq(q->queue_lock);
>>>> - q->request_fn(q);
>>>> + __blk_run_queue(q);
>>>> spin_unlock_irq(q->queue_lock);
>>>> }
>>> Hi Jens,
>>> I have some questions about the per-task plugging. Since the request
>>> list is per-task, and each task delivers its requests at finish flush
>>> or schedule. But when one cpu delivers requests to global queue, other
>>> cpus don't know. This seems to have problem. For example:
>>> 1. get_request_wait() can only flush current task's request list,
>>> other cpus/tasks might still have a lot of requests, which aren't sent
>>> to request_queue.
>>
>> But very soon these requests will be sent to request queue as soon task
>> is either scheduled out or task explicitly flushes the plug? So we might
>> wait a bit longer but that might not matter in general, i guess.
> Yes, I understand there is just a bit delay. I don't know how severe it
> is, but this still could be a problem, especially for fast storage or
> random I/O. My current tests show slight regression (3% or so) with
> Jens's for 2.6.39/core branch. I'm still checking if it's caused by the
> per-task plug, but the per-task plug is highly suspected.

To check this particular case, you can always just bump the request
limit. What test is showing a slowdown? Like the one that Vivek
discovered, we are going to be adding plugs in more places. I didn't go
crazy with those, wanted to have the infrastructure sane and stable
first.

>>
>> Jens seemed to be suggesting that generally fluser threads are the
>> main cluprit for submitting large amount of IO. They are already per
>> bdi. So probably just maintain a per task limit for flusher threads.
> Yep, flusher is the main spot in my mind. We need call more flush plug
> for flusher thread.
>
>> I am not sure what happens to direct reclaim path, AIO deep queue
>> paths etc.
> direct reclaim path could build deep write queue too. It
> uses .writepage, currently there is no flush plug there. Maybe we need
> add flush plug in shrink_inactive_list too.

If you find and locate these spots, I'd very much appreciate a patch too
:-)

>>> 2. some APIs like blk_delay_work, which call __blk_run_queue() might
>>> not work. because other CPUs might not dispatch their requests to
>>> request queue. So __blk_run_queue will eventually find no requests,
>>> which might stall devices.
>>> Since one cpu doesn't know other cpus' request list, I'm wondering if
>>> there are other similar issues.
>>
>> So again in this case if queue is empty at the time of __blk_run_queue(),
>> then we will probably just experinece little more delay then intended
>> till some task flushes. But should not stall the system?
> not stall the system, but device stalls a little time.

It's not a problem. Say you use blk_delay_work(), that is to delay
something that is already on the queue. Any task plug should be
unrelated. For the request starvation issue, if we had the plug persist
across schedules it would be an issue. But the time frame that a
per-task plugs lives for is very short, it's just submitting the IO.
Flushing those plugs would be detrimental to the problem you want to
solve, which is ensure that those IOs finish faster so that we can
allocate more.

--
Jens Axboe

2011-03-17 09:44:54

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 04/10] block: initial patch for on-stack per-task plugging

On 2011-03-17 04:19, Shaohua Li wrote:
> On Thu, 2011-03-17 at 09:00 +0800, Shaohua Li wrote:
>> On Thu, 2011-03-17 at 01:31 +0800, Vivek Goyal wrote:
>>> On Wed, Mar 16, 2011 at 04:18:30PM +0800, Shaohua Li wrote:
>>>> 2011/1/22 Jens Axboe <[email protected]>:
>>>>> Signed-off-by: Jens Axboe <[email protected]>
>>>>> ---
>>>>> block/blk-core.c | 357 ++++++++++++++++++++++++++++++++------------
>>>>> block/elevator.c | 6 +-
>>>>> include/linux/blk_types.h | 2 +
>>>>> include/linux/blkdev.h | 30 ++++
>>>>> include/linux/elevator.h | 1 +
>>>>> include/linux/sched.h | 6 +
>>>>> kernel/exit.c | 1 +
>>>>> kernel/fork.c | 3 +
>>>>> kernel/sched.c | 11 ++-
>>>>> 9 files changed, 317 insertions(+), 100 deletions(-)
>>>>>
>>>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>>>> index 960f12c..42dbfcc 100644
>>>>> --- a/block/blk-core.c
>>>>> +++ b/block/blk-core.c
>>>>> @@ -27,6 +27,7 @@
>>>>> #include <linux/writeback.h>
>>>>> #include <linux/task_io_accounting_ops.h>
>>>>> #include <linux/fault-inject.h>
>>>>> +#include <linux/list_sort.h>
>>>>>
>>>>> #define CREATE_TRACE_POINTS
>>>>> #include <trace/events/block.h>
>>>>> @@ -213,7 +214,7 @@ static void blk_delay_work(struct work_struct *work)
>>>>>
>>>>> q = container_of(work, struct request_queue, delay_work.work);
>>>>> spin_lock_irq(q->queue_lock);
>>>>> - q->request_fn(q);
>>>>> + __blk_run_queue(q);
>>>>> spin_unlock_irq(q->queue_lock);
>>>>> }
>>>> Hi Jens,
>>>> I have some questions about the per-task plugging. Since the request
>>>> list is per-task, and each task delivers its requests at finish flush
>>>> or schedule. But when one cpu delivers requests to global queue, other
>>>> cpus don't know. This seems to have problem. For example:
>>>> 1. get_request_wait() can only flush current task's request list,
>>>> other cpus/tasks might still have a lot of requests, which aren't sent
>>>> to request_queue.
>>>
>>> But very soon these requests will be sent to request queue as soon task
>>> is either scheduled out or task explicitly flushes the plug? So we might
>>> wait a bit longer but that might not matter in general, i guess.
>> Yes, I understand there is just a bit delay. I don't know how severe it
>> is, but this still could be a problem, especially for fast storage or
>> random I/O. My current tests show slight regression (3% or so) with
>> Jens's for 2.6.39/core branch. I'm still checking if it's caused by the
>> per-task plug, but the per-task plug is highly suspected.
>>
>>>> your ioc-rq-alloc branch is for this, right? Will it
>>>> be pushed to 2.6.39 too? I'm wondering if we should limit per-task
>>>> queue length. If there are enough requests there, we force a flush
>>>> plug.
>>>
>>> That's the idea jens had. But then came the question of maintaining
>>> data structures per task per disk. That makes it complicated.
>>>
>>> Even if we move the accounting out of request queue and do it say at
>>> bdi, ideally we shall to do per task per bdi accounting.
>>>
>>> Jens seemed to be suggesting that generally fluser threads are the
>>> main cluprit for submitting large amount of IO. They are already per
>>> bdi. So probably just maintain a per task limit for flusher threads.
>> Yep, flusher is the main spot in my mind. We need call more flush plug
>> for flusher thread.
>>
>>> I am not sure what happens to direct reclaim path, AIO deep queue
>>> paths etc.
>> direct reclaim path could build deep write queue too. It
>> uses .writepage, currently there is no flush plug there. Maybe we need
>> add flush plug in shrink_inactive_list too.
>>
>>>> 2. some APIs like blk_delay_work, which call __blk_run_queue() might
>>>> not work. because other CPUs might not dispatch their requests to
>>>> request queue. So __blk_run_queue will eventually find no requests,
>>>> which might stall devices.
>>>> Since one cpu doesn't know other cpus' request list, I'm wondering if
>>>> there are other similar issues.
>>>
>>> So again in this case if queue is empty at the time of __blk_run_queue(),
>>> then we will probably just experinece little more delay then intended
>>> till some task flushes. But should not stall the system?
>> not stall the system, but device stalls a little time.
> Jens,
> I need below patch to recover a ffsb fsync workload, which has about 30%
> regression with stack plug.
> I guess the reason is WRITE_SYNC_PLUG doesn't work now, so if a context
> hasn't blk_plug, we lose previous plug (request merge). This suggests
> all places we use WRITE_SYNC_PLUG before (for example, kjournald) should
> have a blk_plug context.

Good point, those should be auto-converted. I'll take this patch and
double check the others. Thanks!

Does it remove that performance regression completely?

--
Jens Axboe

2011-03-18 01:55:46

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH 04/10] block: initial patch for on-stack per-task plugging

On Thu, 2011-03-17 at 17:44 +0800, Jens Axboe wrote:
> On 2011-03-17 04:19, Shaohua Li wrote:
> > On Thu, 2011-03-17 at 09:00 +0800, Shaohua Li wrote:
> >> On Thu, 2011-03-17 at 01:31 +0800, Vivek Goyal wrote:
> >>> On Wed, Mar 16, 2011 at 04:18:30PM +0800, Shaohua Li wrote:
> >>>> 2011/1/22 Jens Axboe <[email protected]>:
> >>>>> Signed-off-by: Jens Axboe <[email protected]>
> >>>>> ---
> >>>>> block/blk-core.c | 357 ++++++++++++++++++++++++++++++++------------
> >>>>> block/elevator.c | 6 +-
> >>>>> include/linux/blk_types.h | 2 +
> >>>>> include/linux/blkdev.h | 30 ++++
> >>>>> include/linux/elevator.h | 1 +
> >>>>> include/linux/sched.h | 6 +
> >>>>> kernel/exit.c | 1 +
> >>>>> kernel/fork.c | 3 +
> >>>>> kernel/sched.c | 11 ++-
> >>>>> 9 files changed, 317 insertions(+), 100 deletions(-)
> >>>>>
> >>>>> diff --git a/block/blk-core.c b/block/blk-core.c
> >>>>> index 960f12c..42dbfcc 100644
> >>>>> --- a/block/blk-core.c
> >>>>> +++ b/block/blk-core.c
> >>>>> @@ -27,6 +27,7 @@
> >>>>> #include <linux/writeback.h>
> >>>>> #include <linux/task_io_accounting_ops.h>
> >>>>> #include <linux/fault-inject.h>
> >>>>> +#include <linux/list_sort.h>
> >>>>>
> >>>>> #define CREATE_TRACE_POINTS
> >>>>> #include <trace/events/block.h>
> >>>>> @@ -213,7 +214,7 @@ static void blk_delay_work(struct work_struct *work)
> >>>>>
> >>>>> q = container_of(work, struct request_queue, delay_work.work);
> >>>>> spin_lock_irq(q->queue_lock);
> >>>>> - q->request_fn(q);
> >>>>> + __blk_run_queue(q);
> >>>>> spin_unlock_irq(q->queue_lock);
> >>>>> }
> >>>> Hi Jens,
> >>>> I have some questions about the per-task plugging. Since the request
> >>>> list is per-task, and each task delivers its requests at finish flush
> >>>> or schedule. But when one cpu delivers requests to global queue, other
> >>>> cpus don't know. This seems to have problem. For example:
> >>>> 1. get_request_wait() can only flush current task's request list,
> >>>> other cpus/tasks might still have a lot of requests, which aren't sent
> >>>> to request_queue.
> >>>
> >>> But very soon these requests will be sent to request queue as soon task
> >>> is either scheduled out or task explicitly flushes the plug? So we might
> >>> wait a bit longer but that might not matter in general, i guess.
> >> Yes, I understand there is just a bit delay. I don't know how severe it
> >> is, but this still could be a problem, especially for fast storage or
> >> random I/O. My current tests show slight regression (3% or so) with
> >> Jens's for 2.6.39/core branch. I'm still checking if it's caused by the
> >> per-task plug, but the per-task plug is highly suspected.
> >>
> >>>> your ioc-rq-alloc branch is for this, right? Will it
> >>>> be pushed to 2.6.39 too? I'm wondering if we should limit per-task
> >>>> queue length. If there are enough requests there, we force a flush
> >>>> plug.
> >>>
> >>> That's the idea jens had. But then came the question of maintaining
> >>> data structures per task per disk. That makes it complicated.
> >>>
> >>> Even if we move the accounting out of request queue and do it say at
> >>> bdi, ideally we shall to do per task per bdi accounting.
> >>>
> >>> Jens seemed to be suggesting that generally fluser threads are the
> >>> main cluprit for submitting large amount of IO. They are already per
> >>> bdi. So probably just maintain a per task limit for flusher threads.
> >> Yep, flusher is the main spot in my mind. We need call more flush plug
> >> for flusher thread.
> >>
> >>> I am not sure what happens to direct reclaim path, AIO deep queue
> >>> paths etc.
> >> direct reclaim path could build deep write queue too. It
> >> uses .writepage, currently there is no flush plug there. Maybe we need
> >> add flush plug in shrink_inactive_list too.
> >>
> >>>> 2. some APIs like blk_delay_work, which call __blk_run_queue() might
> >>>> not work. because other CPUs might not dispatch their requests to
> >>>> request queue. So __blk_run_queue will eventually find no requests,
> >>>> which might stall devices.
> >>>> Since one cpu doesn't know other cpus' request list, I'm wondering if
> >>>> there are other similar issues.
> >>>
> >>> So again in this case if queue is empty at the time of __blk_run_queue(),
> >>> then we will probably just experinece little more delay then intended
> >>> till some task flushes. But should not stall the system?
> >> not stall the system, but device stalls a little time.
> > Jens,
> > I need below patch to recover a ffsb fsync workload, which has about 30%
> > regression with stack plug.
> > I guess the reason is WRITE_SYNC_PLUG doesn't work now, so if a context
> > hasn't blk_plug, we lose previous plug (request merge). This suggests
> > all places we use WRITE_SYNC_PLUG before (for example, kjournald) should
> > have a blk_plug context.
>
> Good point, those should be auto-converted. I'll take this patch and
> double check the others. Thanks!
>
> Does it remove that performance regression completely?
Yes, it removes the regression completely at my side.

2011-03-18 06:37:01

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH 04/10] block: initial patch for on-stack per-task plugging

On Thu, 2011-03-17 at 17:43 +0800, Jens Axboe wrote:
> On 2011-03-17 02:00, Shaohua Li wrote:
> > On Thu, 2011-03-17 at 01:31 +0800, Vivek Goyal wrote:
> >> On Wed, Mar 16, 2011 at 04:18:30PM +0800, Shaohua Li wrote:
> >>> 2011/1/22 Jens Axboe <[email protected]>:
> >>>> Signed-off-by: Jens Axboe <[email protected]>
> >>>> ---
> >>>> block/blk-core.c | 357 ++++++++++++++++++++++++++++++++------------
> >>>> block/elevator.c | 6 +-
> >>>> include/linux/blk_types.h | 2 +
> >>>> include/linux/blkdev.h | 30 ++++
> >>>> include/linux/elevator.h | 1 +
> >>>> include/linux/sched.h | 6 +
> >>>> kernel/exit.c | 1 +
> >>>> kernel/fork.c | 3 +
> >>>> kernel/sched.c | 11 ++-
> >>>> 9 files changed, 317 insertions(+), 100 deletions(-)
> >>>>
> >>>> diff --git a/block/blk-core.c b/block/blk-core.c
> >>>> index 960f12c..42dbfcc 100644
> >>>> --- a/block/blk-core.c
> >>>> +++ b/block/blk-core.c
> >>>> @@ -27,6 +27,7 @@
> >>>> #include <linux/writeback.h>
> >>>> #include <linux/task_io_accounting_ops.h>
> >>>> #include <linux/fault-inject.h>
> >>>> +#include <linux/list_sort.h>
> >>>>
> >>>> #define CREATE_TRACE_POINTS
> >>>> #include <trace/events/block.h>
> >>>> @@ -213,7 +214,7 @@ static void blk_delay_work(struct work_struct *work)
> >>>>
> >>>> q = container_of(work, struct request_queue, delay_work.work);
> >>>> spin_lock_irq(q->queue_lock);
> >>>> - q->request_fn(q);
> >>>> + __blk_run_queue(q);
> >>>> spin_unlock_irq(q->queue_lock);
> >>>> }
> >>> Hi Jens,
> >>> I have some questions about the per-task plugging. Since the request
> >>> list is per-task, and each task delivers its requests at finish flush
> >>> or schedule. But when one cpu delivers requests to global queue, other
> >>> cpus don't know. This seems to have problem. For example:
> >>> 1. get_request_wait() can only flush current task's request list,
> >>> other cpus/tasks might still have a lot of requests, which aren't sent
> >>> to request_queue.
> >>
> >> But very soon these requests will be sent to request queue as soon task
> >> is either scheduled out or task explicitly flushes the plug? So we might
> >> wait a bit longer but that might not matter in general, i guess.
> > Yes, I understand there is just a bit delay. I don't know how severe it
> > is, but this still could be a problem, especially for fast storage or
> > random I/O. My current tests show slight regression (3% or so) with
> > Jens's for 2.6.39/core branch. I'm still checking if it's caused by the
> > per-task plug, but the per-task plug is highly suspected.
>
> To check this particular case, you can always just bump the request
> limit. What test is showing a slowdown?
this is a simple multi-threaded seq read. The issue tends to be request
merge related (not verified yet). The merge reduces about 60% with stack
plug from fio reported data. From trace, without stack plug, requests
from different threads get merged. But with it, such merge is impossible
because flush_plug doesn't check merge, I thought we need add it again.

Thanks,
Shaohua

2011-03-18 12:55:06

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 04/10] block: initial patch for on-stack per-task plugging

On 2011-03-18 07:36, Shaohua Li wrote:
> On Thu, 2011-03-17 at 17:43 +0800, Jens Axboe wrote:
>> On 2011-03-17 02:00, Shaohua Li wrote:
>>> On Thu, 2011-03-17 at 01:31 +0800, Vivek Goyal wrote:
>>>> On Wed, Mar 16, 2011 at 04:18:30PM +0800, Shaohua Li wrote:
>>>>> 2011/1/22 Jens Axboe <[email protected]>:
>>>>>> Signed-off-by: Jens Axboe <[email protected]>
>>>>>> ---
>>>>>> block/blk-core.c | 357 ++++++++++++++++++++++++++++++++------------
>>>>>> block/elevator.c | 6 +-
>>>>>> include/linux/blk_types.h | 2 +
>>>>>> include/linux/blkdev.h | 30 ++++
>>>>>> include/linux/elevator.h | 1 +
>>>>>> include/linux/sched.h | 6 +
>>>>>> kernel/exit.c | 1 +
>>>>>> kernel/fork.c | 3 +
>>>>>> kernel/sched.c | 11 ++-
>>>>>> 9 files changed, 317 insertions(+), 100 deletions(-)
>>>>>>
>>>>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>>>>> index 960f12c..42dbfcc 100644
>>>>>> --- a/block/blk-core.c
>>>>>> +++ b/block/blk-core.c
>>>>>> @@ -27,6 +27,7 @@
>>>>>> #include <linux/writeback.h>
>>>>>> #include <linux/task_io_accounting_ops.h>
>>>>>> #include <linux/fault-inject.h>
>>>>>> +#include <linux/list_sort.h>
>>>>>>
>>>>>> #define CREATE_TRACE_POINTS
>>>>>> #include <trace/events/block.h>
>>>>>> @@ -213,7 +214,7 @@ static void blk_delay_work(struct work_struct *work)
>>>>>>
>>>>>> q = container_of(work, struct request_queue, delay_work.work);
>>>>>> spin_lock_irq(q->queue_lock);
>>>>>> - q->request_fn(q);
>>>>>> + __blk_run_queue(q);
>>>>>> spin_unlock_irq(q->queue_lock);
>>>>>> }
>>>>> Hi Jens,
>>>>> I have some questions about the per-task plugging. Since the request
>>>>> list is per-task, and each task delivers its requests at finish flush
>>>>> or schedule. But when one cpu delivers requests to global queue, other
>>>>> cpus don't know. This seems to have problem. For example:
>>>>> 1. get_request_wait() can only flush current task's request list,
>>>>> other cpus/tasks might still have a lot of requests, which aren't sent
>>>>> to request_queue.
>>>>
>>>> But very soon these requests will be sent to request queue as soon task
>>>> is either scheduled out or task explicitly flushes the plug? So we might
>>>> wait a bit longer but that might not matter in general, i guess.
>>> Yes, I understand there is just a bit delay. I don't know how severe it
>>> is, but this still could be a problem, especially for fast storage or
>>> random I/O. My current tests show slight regression (3% or so) with
>>> Jens's for 2.6.39/core branch. I'm still checking if it's caused by the
>>> per-task plug, but the per-task plug is highly suspected.
>>
>> To check this particular case, you can always just bump the request
>> limit. What test is showing a slowdown?
> this is a simple multi-threaded seq read. The issue tends to be request
> merge related (not verified yet). The merge reduces about 60% with stack
> plug from fio reported data. From trace, without stack plug, requests
> from different threads get merged. But with it, such merge is impossible
> because flush_plug doesn't check merge, I thought we need add it again.

What we could try is have the plug flush insert be
ELEVATOR_INSERT_SORT_MERGE and have it lookup potential backmerges.

Here's a quick hack that does that, I have not tested it at all.

diff --git a/block/blk-core.c b/block/blk-core.c
index e1fcf7a..5256932 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2685,7 +2685,7 @@ static void flush_plug_list(struct blk_plug *plug)
/*
* rq is already accounted, so use raw insert
*/
- __elv_add_request(q, rq, ELEVATOR_INSERT_SORT);
+ __elv_add_request(q, rq, ELEVATOR_INSERT_SORT_MERGE);
}

if (q) {
diff --git a/block/blk-merge.c b/block/blk-merge.c
index ea85e20..cfcc37c 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -465,3 +465,9 @@ int attempt_front_merge(struct request_queue *q, struct request *rq)

return 0;
}
+
+int blk_attempt_req_merge(struct request_queue *q, struct request *rq,
+ struct request *next)
+{
+ return attempt_merge(q, rq, next);
+}
diff --git a/block/blk.h b/block/blk.h
index 49d21af..c8db371 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -103,6 +103,8 @@ int ll_front_merge_fn(struct request_queue *q, struct request *req,
struct bio *bio);
int attempt_back_merge(struct request_queue *q, struct request *rq);
int attempt_front_merge(struct request_queue *q, struct request *rq);
+int blk_attempt_req_merge(struct request_queue *q, struct request *rq,
+ struct request *next);
void blk_recalc_rq_segments(struct request *rq);
void blk_rq_set_mixed_merge(struct request *rq);

diff --git a/block/elevator.c b/block/elevator.c
index 542ce82..f493e18 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -521,6 +521,33 @@ int elv_merge(struct request_queue *q, struct request **req, struct bio *bio)
return ELEVATOR_NO_MERGE;
}

+/*
+ * Returns true if we merged, false otherwise
+ */
+static bool elv_attempt_insert_merge(struct request_queue *q,
+ struct request *rq)
+{
+ struct request *__rq;
+
+ if (blk_queue_nomerges(q) || blk_queue_noxmerges(q))
+ return false;
+
+ /*
+ * First try one-hit cache.
+ */
+ if (q->last_merge && blk_attempt_req_merge(q, rq, q->last_merge))
+ return true;
+
+ /*
+ * See if our hash lookup can find a potential backmerge.
+ */
+ __rq = elv_rqhash_find(q, blk_rq_pos(rq));
+ if (__rq && blk_attempt_req_merge(q, rq, __rq))
+ return true;
+
+ return false;
+}
+
void elv_merged_request(struct request_queue *q, struct request *rq, int type)
{
struct elevator_queue *e = q->elevator;
@@ -647,6 +674,9 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
__blk_run_queue(q, false);
break;

+ case ELEVATOR_INSERT_SORT_MERGE:
+ if (elv_attempt_insert_merge(q, rq))
+ break;
case ELEVATOR_INSERT_SORT:
BUG_ON(rq->cmd_type != REQ_TYPE_FS &&
!(rq->cmd_flags & REQ_DISCARD));
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index ec6f72b..d93efcc445 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -166,6 +166,7 @@ extern struct request *elv_rb_find(struct rb_root *, sector_t);
#define ELEVATOR_INSERT_SORT 3
#define ELEVATOR_INSERT_REQUEUE 4
#define ELEVATOR_INSERT_FLUSH 5
+#define ELEVATOR_INSERT_SORT_MERGE 6

/*
* return values from elevator_may_queue_fn


--
Jens Axboe

2011-03-18 13:52:46

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 04/10] block: initial patch for on-stack per-task plugging

On 2011-03-18 13:54, Jens Axboe wrote:
> On 2011-03-18 07:36, Shaohua Li wrote:
>> On Thu, 2011-03-17 at 17:43 +0800, Jens Axboe wrote:
>>> On 2011-03-17 02:00, Shaohua Li wrote:
>>>> On Thu, 2011-03-17 at 01:31 +0800, Vivek Goyal wrote:
>>>>> On Wed, Mar 16, 2011 at 04:18:30PM +0800, Shaohua Li wrote:
>>>>>> 2011/1/22 Jens Axboe <[email protected]>:
>>>>>>> Signed-off-by: Jens Axboe <[email protected]>
>>>>>>> ---
>>>>>>> block/blk-core.c | 357 ++++++++++++++++++++++++++++++++------------
>>>>>>> block/elevator.c | 6 +-
>>>>>>> include/linux/blk_types.h | 2 +
>>>>>>> include/linux/blkdev.h | 30 ++++
>>>>>>> include/linux/elevator.h | 1 +
>>>>>>> include/linux/sched.h | 6 +
>>>>>>> kernel/exit.c | 1 +
>>>>>>> kernel/fork.c | 3 +
>>>>>>> kernel/sched.c | 11 ++-
>>>>>>> 9 files changed, 317 insertions(+), 100 deletions(-)
>>>>>>>
>>>>>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>>>>>> index 960f12c..42dbfcc 100644
>>>>>>> --- a/block/blk-core.c
>>>>>>> +++ b/block/blk-core.c
>>>>>>> @@ -27,6 +27,7 @@
>>>>>>> #include <linux/writeback.h>
>>>>>>> #include <linux/task_io_accounting_ops.h>
>>>>>>> #include <linux/fault-inject.h>
>>>>>>> +#include <linux/list_sort.h>
>>>>>>>
>>>>>>> #define CREATE_TRACE_POINTS
>>>>>>> #include <trace/events/block.h>
>>>>>>> @@ -213,7 +214,7 @@ static void blk_delay_work(struct work_struct *work)
>>>>>>>
>>>>>>> q = container_of(work, struct request_queue, delay_work.work);
>>>>>>> spin_lock_irq(q->queue_lock);
>>>>>>> - q->request_fn(q);
>>>>>>> + __blk_run_queue(q);
>>>>>>> spin_unlock_irq(q->queue_lock);
>>>>>>> }
>>>>>> Hi Jens,
>>>>>> I have some questions about the per-task plugging. Since the request
>>>>>> list is per-task, and each task delivers its requests at finish flush
>>>>>> or schedule. But when one cpu delivers requests to global queue, other
>>>>>> cpus don't know. This seems to have problem. For example:
>>>>>> 1. get_request_wait() can only flush current task's request list,
>>>>>> other cpus/tasks might still have a lot of requests, which aren't sent
>>>>>> to request_queue.
>>>>>
>>>>> But very soon these requests will be sent to request queue as soon task
>>>>> is either scheduled out or task explicitly flushes the plug? So we might
>>>>> wait a bit longer but that might not matter in general, i guess.
>>>> Yes, I understand there is just a bit delay. I don't know how severe it
>>>> is, but this still could be a problem, especially for fast storage or
>>>> random I/O. My current tests show slight regression (3% or so) with
>>>> Jens's for 2.6.39/core branch. I'm still checking if it's caused by the
>>>> per-task plug, but the per-task plug is highly suspected.
>>>
>>> To check this particular case, you can always just bump the request
>>> limit. What test is showing a slowdown?
>> this is a simple multi-threaded seq read. The issue tends to be request
>> merge related (not verified yet). The merge reduces about 60% with stack
>> plug from fio reported data. From trace, without stack plug, requests
>> from different threads get merged. But with it, such merge is impossible
>> because flush_plug doesn't check merge, I thought we need add it again.
>
> What we could try is have the plug flush insert be
> ELEVATOR_INSERT_SORT_MERGE and have it lookup potential backmerges.
>
> Here's a quick hack that does that, I have not tested it at all.

Gave it a quick test spin, as suspected it had a few issues. This one
seems to work. Can you toss it through that workload and see if it fares
better?

diff --git a/block/blk-core.c b/block/blk-core.c
index e1fcf7a..e1b29e7 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -55,7 +55,7 @@ struct kmem_cache *blk_requestq_cachep;
*/
static struct workqueue_struct *kblockd_workqueue;

-static void drive_stat_acct(struct request *rq, int new_io)
+void drive_stat_acct(struct request *rq, int new_io)
{
struct hd_struct *part;
int rw = rq_data_dir(rq);
@@ -2685,7 +2685,7 @@ static void flush_plug_list(struct blk_plug *plug)
/*
* rq is already accounted, so use raw insert
*/
- __elv_add_request(q, rq, ELEVATOR_INSERT_SORT);
+ __elv_add_request(q, rq, ELEVATOR_INSERT_SORT_MERGE);
}

if (q) {
diff --git a/block/blk-merge.c b/block/blk-merge.c
index ea85e20..27a7926 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -429,12 +429,14 @@ static int attempt_merge(struct request_queue *q, struct request *req,

req->__data_len += blk_rq_bytes(next);

- elv_merge_requests(q, req, next);
+ if (next->cmd_flags & REQ_SORTED) {
+ elv_merge_requests(q, req, next);

- /*
- * 'next' is going away, so update stats accordingly
- */
- blk_account_io_merge(next);
+ /*
+ * 'next' is going away, so update stats accordingly
+ */
+ blk_account_io_merge(next);
+ }

req->ioprio = ioprio_best(req->ioprio, next->ioprio);
if (blk_rq_cpu_valid(next))
@@ -465,3 +467,15 @@ int attempt_front_merge(struct request_queue *q, struct request *rq)

return 0;
}
+
+int blk_attempt_req_merge(struct request_queue *q, struct request *rq,
+ struct request *next)
+{
+ int ret;
+
+ ret = attempt_merge(q, rq, next);
+ if (ret)
+ drive_stat_acct(rq, 0);
+
+ return ret;
+}
diff --git a/block/blk.h b/block/blk.h
index 49d21af..5b8ecbf 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -103,6 +103,9 @@ int ll_front_merge_fn(struct request_queue *q, struct request *req,
struct bio *bio);
int attempt_back_merge(struct request_queue *q, struct request *rq);
int attempt_front_merge(struct request_queue *q, struct request *rq);
+int blk_attempt_req_merge(struct request_queue *q, struct request *rq,
+ struct request *next);
+void drive_stat_acct(struct request *rq, int new_io);
void blk_recalc_rq_segments(struct request *rq);
void blk_rq_set_mixed_merge(struct request *rq);

diff --git a/block/elevator.c b/block/elevator.c
index 542ce82..88bdf81 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -521,6 +521,33 @@ int elv_merge(struct request_queue *q, struct request **req, struct bio *bio)
return ELEVATOR_NO_MERGE;
}

+/*
+ * Returns true if we merged, false otherwise
+ */
+static bool elv_attempt_insert_merge(struct request_queue *q,
+ struct request *rq)
+{
+ struct request *__rq;
+
+ if (blk_queue_nomerges(q) || blk_queue_noxmerges(q))
+ return false;
+
+ /*
+ * First try one-hit cache.
+ */
+ if (q->last_merge && blk_attempt_req_merge(q, q->last_merge, rq))
+ return true;
+
+ /*
+ * See if our hash lookup can find a potential backmerge.
+ */
+ __rq = elv_rqhash_find(q, blk_rq_pos(rq));
+ if (__rq && blk_attempt_req_merge(q, __rq, rq))
+ return true;
+
+ return false;
+}
+
void elv_merged_request(struct request_queue *q, struct request *rq, int type)
{
struct elevator_queue *e = q->elevator;
@@ -647,6 +674,9 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
__blk_run_queue(q, false);
break;

+ case ELEVATOR_INSERT_SORT_MERGE:
+ if (elv_attempt_insert_merge(q, rq))
+ break;
case ELEVATOR_INSERT_SORT:
BUG_ON(rq->cmd_type != REQ_TYPE_FS &&
!(rq->cmd_flags & REQ_DISCARD));
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index ec6f72b..d93efcc 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -166,6 +166,7 @@ extern struct request *elv_rb_find(struct rb_root *, sector_t);
#define ELEVATOR_INSERT_SORT 3
#define ELEVATOR_INSERT_REQUEUE 4
#define ELEVATOR_INSERT_FLUSH 5
+#define ELEVATOR_INSERT_SORT_MERGE 6

/*
* return values from elevator_may_queue_fn

--
Jens Axboe

2011-03-21 06:53:06

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH 04/10] block: initial patch for on-stack per-task plugging

On Fri, 2011-03-18 at 21:52 +0800, Jens Axboe wrote:
> On 2011-03-18 13:54, Jens Axboe wrote:
> > On 2011-03-18 07:36, Shaohua Li wrote:
> >> On Thu, 2011-03-17 at 17:43 +0800, Jens Axboe wrote:
> >>> On 2011-03-17 02:00, Shaohua Li wrote:
> >>>> On Thu, 2011-03-17 at 01:31 +0800, Vivek Goyal wrote:
> >>>>> On Wed, Mar 16, 2011 at 04:18:30PM +0800, Shaohua Li wrote:
> >>>>>> 2011/1/22 Jens Axboe <[email protected]>:
> >>>>>>> Signed-off-by: Jens Axboe <[email protected]>
> >>>>>>> ---
> >>>>>>> block/blk-core.c | 357 ++++++++++++++++++++++++++++++++------------
> >>>>>>> block/elevator.c | 6 +-
> >>>>>>> include/linux/blk_types.h | 2 +
> >>>>>>> include/linux/blkdev.h | 30 ++++
> >>>>>>> include/linux/elevator.h | 1 +
> >>>>>>> include/linux/sched.h | 6 +
> >>>>>>> kernel/exit.c | 1 +
> >>>>>>> kernel/fork.c | 3 +
> >>>>>>> kernel/sched.c | 11 ++-
> >>>>>>> 9 files changed, 317 insertions(+), 100 deletions(-)
> >>>>>>>
> >>>>>>> diff --git a/block/blk-core.c b/block/blk-core.c
> >>>>>>> index 960f12c..42dbfcc 100644
> >>>>>>> --- a/block/blk-core.c
> >>>>>>> +++ b/block/blk-core.c
> >>>>>>> @@ -27,6 +27,7 @@
> >>>>>>> #include <linux/writeback.h>
> >>>>>>> #include <linux/task_io_accounting_ops.h>
> >>>>>>> #include <linux/fault-inject.h>
> >>>>>>> +#include <linux/list_sort.h>
> >>>>>>>
> >>>>>>> #define CREATE_TRACE_POINTS
> >>>>>>> #include <trace/events/block.h>
> >>>>>>> @@ -213,7 +214,7 @@ static void blk_delay_work(struct work_struct *work)
> >>>>>>>
> >>>>>>> q = container_of(work, struct request_queue, delay_work.work);
> >>>>>>> spin_lock_irq(q->queue_lock);
> >>>>>>> - q->request_fn(q);
> >>>>>>> + __blk_run_queue(q);
> >>>>>>> spin_unlock_irq(q->queue_lock);
> >>>>>>> }
> >>>>>> Hi Jens,
> >>>>>> I have some questions about the per-task plugging. Since the request
> >>>>>> list is per-task, and each task delivers its requests at finish flush
> >>>>>> or schedule. But when one cpu delivers requests to global queue, other
> >>>>>> cpus don't know. This seems to have problem. For example:
> >>>>>> 1. get_request_wait() can only flush current task's request list,
> >>>>>> other cpus/tasks might still have a lot of requests, which aren't sent
> >>>>>> to request_queue.
> >>>>>
> >>>>> But very soon these requests will be sent to request queue as soon task
> >>>>> is either scheduled out or task explicitly flushes the plug? So we might
> >>>>> wait a bit longer but that might not matter in general, i guess.
> >>>> Yes, I understand there is just a bit delay. I don't know how severe it
> >>>> is, but this still could be a problem, especially for fast storage or
> >>>> random I/O. My current tests show slight regression (3% or so) with
> >>>> Jens's for 2.6.39/core branch. I'm still checking if it's caused by the
> >>>> per-task plug, but the per-task plug is highly suspected.
> >>>
> >>> To check this particular case, you can always just bump the request
> >>> limit. What test is showing a slowdown?
> >> this is a simple multi-threaded seq read. The issue tends to be request
> >> merge related (not verified yet). The merge reduces about 60% with stack
> >> plug from fio reported data. From trace, without stack plug, requests
> >> from different threads get merged. But with it, such merge is impossible
> >> because flush_plug doesn't check merge, I thought we need add it again.
> >
> > What we could try is have the plug flush insert be
> > ELEVATOR_INSERT_SORT_MERGE and have it lookup potential backmerges.
> >
> > Here's a quick hack that does that, I have not tested it at all.
>
> Gave it a quick test spin, as suspected it had a few issues. This one
> seems to work. Can you toss it through that workload and see if it fares
> better?
yes, this fully restores the regression I saw. But I have accounting
issue:
1. The merged request is already accounted when it's added into plug
list
2. drive_stat_acct() is called without any protection in
__make_request(). So there is race for in_flight accounting. The race
exists after stack plug is added, so not because of this issue.
Below is the extra patch I need to do the test.

---
block/blk-merge.c | 12 +++++-------
block/elevator.c | 9 ++++++---
drivers/md/dm.c | 7 ++++---
fs/partitions/check.c | 3 ++-
include/linux/genhd.h | 12 ++++++------
5 files changed, 23 insertions(+), 20 deletions(-)

Index: linux-2.6/block/blk-merge.c
===================================================================
--- linux-2.6.orig/block/blk-merge.c
+++ linux-2.6/block/blk-merge.c
@@ -429,14 +429,12 @@ static int attempt_merge(struct request_

req->__data_len += blk_rq_bytes(next);

- if (next->cmd_flags & REQ_SORTED) {
- elv_merge_requests(q, req, next);
+ elv_merge_requests(q, req, next);

- /*
- * 'next' is going away, so update stats accordingly
- */
- blk_account_io_merge(next);
- }
+ /*
+ * 'next' is going away, so update stats accordingly
+ */
+ blk_account_io_merge(next);

req->ioprio = ioprio_best(req->ioprio, next->ioprio);
if (blk_rq_cpu_valid(next))
Index: linux-2.6/block/elevator.c
===================================================================
--- linux-2.6.orig/block/elevator.c
+++ linux-2.6/block/elevator.c
@@ -566,13 +566,16 @@ void elv_merge_requests(struct request_q
{
struct elevator_queue *e = q->elevator;

- if (e->ops->elevator_merge_req_fn)
+ if ((next->cmd_flags & REQ_SORTED) && e->ops->elevator_merge_req_fn)
e->ops->elevator_merge_req_fn(q, rq, next);

elv_rqhash_reposition(q, rq);
- elv_rqhash_del(q, next);

- q->nr_sorted--;
+ if (next->cmd_flags & REQ_SORTED) {
+ elv_rqhash_del(q, next);
+ q->nr_sorted--;
+ }
+
q->last_merge = rq;
}

Index: linux-2.6/drivers/md/dm.c
===================================================================
--- linux-2.6.orig/drivers/md/dm.c
+++ linux-2.6/drivers/md/dm.c
@@ -477,7 +477,8 @@ static void start_io_acct(struct dm_io *
cpu = part_stat_lock();
part_round_stats(cpu, &dm_disk(md)->part0);
part_stat_unlock();
- dm_disk(md)->part0.in_flight[rw] = atomic_inc_return(&md->pending[rw]);
+ atomic_set(&dm_disk(md)->part0.in_flight[rw],
+ atomic_inc_return(&md->pending[rw]));
}

static void end_io_acct(struct dm_io *io)
@@ -497,8 +498,8 @@ static void end_io_acct(struct dm_io *io
* After this is decremented the bio must not be touched if it is
* a flush.
*/
- dm_disk(md)->part0.in_flight[rw] = pending =
- atomic_dec_return(&md->pending[rw]);
+ pending = atomic_dec_return(&md->pending[rw]);
+ atomic_set(&dm_disk(md)->part0.in_flight[rw], pending);
pending += atomic_read(&md->pending[rw^0x1]);

/* nudge anyone waiting on suspend queue */
Index: linux-2.6/fs/partitions/check.c
===================================================================
--- linux-2.6.orig/fs/partitions/check.c
+++ linux-2.6/fs/partitions/check.c
@@ -290,7 +290,8 @@ ssize_t part_inflight_show(struct device
{
struct hd_struct *p = dev_to_part(dev);

- return sprintf(buf, "%8u %8u\n", p->in_flight[0], p->in_flight[1]);
+ return sprintf(buf, "%8u %8u\n", atomic_read(&p->in_flight[0]),
+ atomic_read(&p->in_flight[1]));
}

#ifdef CONFIG_FAIL_MAKE_REQUEST
Index: linux-2.6/include/linux/genhd.h
===================================================================
--- linux-2.6.orig/include/linux/genhd.h
+++ linux-2.6/include/linux/genhd.h
@@ -109,7 +109,7 @@ struct hd_struct {
int make_it_fail;
#endif
unsigned long stamp;
- int in_flight[2];
+ atomic_t in_flight[2];
#ifdef CONFIG_SMP
struct disk_stats __percpu *dkstats;
#else
@@ -370,21 +370,21 @@ static inline void free_part_stats(struc

static inline void part_inc_in_flight(struct hd_struct *part, int rw)
{
- part->in_flight[rw]++;
+ atomic_inc(&part->in_flight[rw]);
if (part->partno)
- part_to_disk(part)->part0.in_flight[rw]++;
+ atomic_inc(&part_to_disk(part)->part0.in_flight[rw]);
}

static inline void part_dec_in_flight(struct hd_struct *part, int rw)
{
- part->in_flight[rw]--;
+ atomic_dec(&part->in_flight[rw]);
if (part->partno)
- part_to_disk(part)->part0.in_flight[rw]--;
+ atomic_dec(&part_to_disk(part)->part0.in_flight[rw]);
}

static inline int part_in_flight(struct hd_struct *part)
{
- return part->in_flight[0] + part->in_flight[1];
+ return atomic_read(&part->in_flight[0]) + atomic_read(&part->in_flight[1]);
}

static inline struct partition_meta_info *alloc_part_info(struct gendisk *disk)

2011-03-21 09:20:36

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 04/10] block: initial patch for on-stack per-task plugging

On 2011-03-21 07:52, Shaohua Li wrote:
>> Gave it a quick test spin, as suspected it had a few issues. This one
>> seems to work. Can you toss it through that workload and see if it fares
>> better?
> yes, this fully restores the regression I saw. But I have accounting
> issue:

Great!

> 1. The merged request is already accounted when it's added into plug
> list

Good catch. I've updated the patch and merged it now, integrating this
accounting fix.

> 2. drive_stat_acct() is called without any protection in
> __make_request(). So there is race for in_flight accounting. The race
> exists after stack plug is added, so not because of this issue.
> Below is the extra patch I need to do the test.

Looks fine. Can I add your signed-off-by to this patch? I'll merge it as
a separate fix.

--
Jens Axboe

2011-03-22 00:32:19

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH 04/10] block: initial patch for on-stack per-task plugging

On Mon, 2011-03-21 at 17:20 +0800, Jens Axboe wrote:
> On 2011-03-21 07:52, Shaohua Li wrote:
> >> Gave it a quick test spin, as suspected it had a few issues. This one
> >> seems to work. Can you toss it through that workload and see if it fares
> >> better?
> > yes, this fully restores the regression I saw. But I have accounting
> > issue:
>
> Great!
>
> > 1. The merged request is already accounted when it's added into plug
> > list
>
> Good catch. I've updated the patch and merged it now, integrating this
> accounting fix.
>
> > 2. drive_stat_acct() is called without any protection in
> > __make_request(). So there is race for in_flight accounting. The race
> > exists after stack plug is added, so not because of this issue.
> > Below is the extra patch I need to do the test.
>
> Looks fine. Can I add your signed-off-by to this patch? I'll merge it as
> a separate fix.
Sure.
Signed-off-by: Shaohua Li<[email protected]>

2011-03-22 07:36:52

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 04/10] block: initial patch for on-stack per-task plugging

On 2011-03-22 01:32, Shaohua Li wrote:
> On Mon, 2011-03-21 at 17:20 +0800, Jens Axboe wrote:
>> On 2011-03-21 07:52, Shaohua Li wrote:
>>>> Gave it a quick test spin, as suspected it had a few issues. This one
>>>> seems to work. Can you toss it through that workload and see if it fares
>>>> better?
>>> yes, this fully restores the regression I saw. But I have accounting
>>> issue:
>>
>> Great!
>>
>>> 1. The merged request is already accounted when it's added into plug
>>> list
>>
>> Good catch. I've updated the patch and merged it now, integrating this
>> accounting fix.
>>
>>> 2. drive_stat_acct() is called without any protection in
>>> __make_request(). So there is race for in_flight accounting. The race
>>> exists after stack plug is added, so not because of this issue.
>>> Below is the extra patch I need to do the test.
>>
>> Looks fine. Can I add your signed-off-by to this patch? I'll merge it as
>> a separate fix.
> Sure.
> Signed-off-by: Shaohua Li<[email protected]>

Thanks, patch has been added now.

--
Jens Axboe