2004-03-01 21:43:59

by Chen, Kenneth W

[permalink] [raw]
Subject: per-cpu blk_plug_list

We were surprised to find global lock in I/O path still exists on 2.6
kernel. This global lock triggers unacceptable lock contention and
prevents 2.6 kernel to scale on the I/O side.

blk_plug_list/blk_plug_lock manages plug/unplug action. When you have
lots of cpu simultaneously submits I/O, there are lots of movement with
the device queue on and off that global list. Our measurement showed
that blk_plug_lock contention prevents linux-2.6.3 kernel to scale pass
beyond 40 thousand I/O per second in the I/O submit path.

Looking at the vmstat and readprofile output with vanilla 2.6.3 kernel:

procs -----------memory---------- -----io---- -system-- ----cpu----
r b swpd free buff cache bi bo in cs us sy id wa
409 484 0 22470560 10160 181792 89456 5867 46899 26383 7 93 0 0
427 488 0 22466176 10160 181792 90307 6057 46818 26155 7 93 0 0
448 469 0 22462208 10160 181792 84076 5769 48217 26782 7 93 0 0

575324 total 0.0818
225342 blk_run_queues 391.2188
140164 __make_request 38.4221
131765 scsi_request_fn 55.6440
21566 generic_unplug_device 61.2670
6050 get_request 3.7812

All cpus are pegged at 93% kernel time spinning on the lock.
blk_plug_device() and blk_remove_plug() didn't show up in readprofile
because they spin with irq disabled and readprofile is based on timer
interrupt. Nevertheless, it's obvious from blk_run_queues().

The plug management need to be converted into per-cpu. The idea here
is to manage device queue on per-cpu plug list. This would also allows
optimal queue kick-off since submit/kick-off typically occurs within
same system call. With our proof-of-concept patch, here are the results,
again vmstat and readprofile output:

procs -----------memory---------- -----io---- -system-- ----cpu----
r b swpd free buff cache bi bo in cs us sy id wa
30 503 0 22506176 9536 172096 212849 20251 102537 177093 27 20 0 53
11 532 0 22505792 9536 172096 209073 21235 103374 177183 29 20 0 51
27 499 0 22505408 9536 172096 205334 21943 103883 176649 30 20 0 50

1988800 total 0.2828
1174759 cpu_idle 2447.4146
156385 default_idle 4887.0312
72586 do_softirq 108.0149
62463 schedule 12.0492
51913 scsi_request_fn 21.9227
34351 __make_request 9.4164
24370 dio_bio_end_io 63.4635
22652 scsi_end_request 44.2422

It clearly relieves the global lock contention and the kernel time is
now in the expected range. I/O has been improved to 110K random I/O
per second and is now only limited by the disks instead of kernel.

The workload that triggers this scalability issue is an I/O intensive
database workload running on a 32P system.

Comments on the patch are welcome, I'm sure there will be a couple of
iterations on the solution.

- Ken


Attachments:
percpu_blk_plug.patch (5.41 kB)
percpu_blk_plug.patch

2004-03-01 22:06:31

by Jens Axboe

[permalink] [raw]
Subject: Re: per-cpu blk_plug_list

On Mon, Mar 01 2004, Chen, Kenneth W wrote:
> We were surprised to find global lock in I/O path still exists on 2.6
> kernel. This global lock triggers unacceptable lock contention and
> prevents 2.6 kernel to scale on the I/O side.
>
> blk_plug_list/blk_plug_lock manages plug/unplug action. When you have
> lots of cpu simultaneously submits I/O, there are lots of movement with
> the device queue on and off that global list. Our measurement showed
> that blk_plug_lock contention prevents linux-2.6.3 kernel to scale pass
> beyond 40 thousand I/O per second in the I/O submit path.
>
> Looking at the vmstat and readprofile output with vanilla 2.6.3 kernel:
>
> procs -----------memory---------- -----io---- -system-- ----cpu----
> r b swpd free buff cache bi bo in cs us sy id wa
> 409 484 0 22470560 10160 181792 89456 5867 46899 26383 7 93 0 0
> 427 488 0 22466176 10160 181792 90307 6057 46818 26155 7 93 0 0
> 448 469 0 22462208 10160 181792 84076 5769 48217 26782 7 93 0 0
>
> 575324 total 0.0818
> 225342 blk_run_queues 391.2188
> 140164 __make_request 38.4221
> 131765 scsi_request_fn 55.6440
> 21566 generic_unplug_device 61.2670
> 6050 get_request 3.7812
>
> All cpus are pegged at 93% kernel time spinning on the lock.
> blk_plug_device() and blk_remove_plug() didn't show up in readprofile
> because they spin with irq disabled and readprofile is based on timer
> interrupt. Nevertheless, it's obvious from blk_run_queues().

Yep doesn't look too good...

> The plug management need to be converted into per-cpu. The idea here
> is to manage device queue on per-cpu plug list. This would also allows
> optimal queue kick-off since submit/kick-off typically occurs within
> same system call. With our proof-of-concept patch, here are the results,
> again vmstat and readprofile output:
>
> procs -----------memory---------- -----io---- -system-- ----cpu----
> r b swpd free buff cache bi bo in cs us sy id wa
> 30 503 0 22506176 9536 172096 212849 20251 102537 177093 27 20 0 53
> 11 532 0 22505792 9536 172096 209073 21235 103374 177183 29 20 0 51
> 27 499 0 22505408 9536 172096 205334 21943 103883 176649 30 20 0 50
>
> 1988800 total 0.2828
> 1174759 cpu_idle 2447.4146
> 156385 default_idle 4887.0312
> 72586 do_softirq 108.0149
> 62463 schedule 12.0492
> 51913 scsi_request_fn 21.9227
> 34351 __make_request 9.4164
> 24370 dio_bio_end_io 63.4635
> 22652 scsi_end_request 44.2422
>
> It clearly relieves the global lock contention and the kernel time is
> now in the expected range. I/O has been improved to 110K random I/O
> per second and is now only limited by the disks instead of kernel.
>
> The workload that triggers this scalability issue is an I/O intensive
> database workload running on a 32P system.
>
> Comments on the patch are welcome, I'm sure there will be a couple of
> iterations on the solution.

(you should inline patches, or at least make sure they have the proper
mimetype so one can comment on them inlined).

The patch doesn't look all bad, I like the per-cpu plug lists. The
fs-directio.c change looks really buggy though, what prevents io from
going to other cpus?! There's also some "cosmetic" stuff, like why
aren't you using the per-cpu helpers?

--
Jens Axboe

2004-03-01 22:14:54

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: per-cpu blk_plug_list

Jens Axboe wrote:
> (you should inline patches, or at least make sure they have
> the proper mimetype so one can comment on them inlined).

My email client sucks and usually screw up with inline patch.
Here is the patch again. (please don't blame if it got messed up).

diff -Nurp linux-2.6.3/drivers/block/ll_rw_blk.c
linux-2.6.3.blk/drivers/block/ll_rw_blk.c
--- linux-2.6.3/drivers/block/ll_rw_blk.c 2004-02-17 19:57:16
+++ linux-2.6.3.blk/drivers/block/ll_rw_blk.c 2004-03-01 11:36:09
@@ -39,8 +39,7 @@ static kmem_cache_t *request_cachep;
/*
* plug management
*/
-static LIST_HEAD(blk_plug_list);
-static spinlock_t blk_plug_lock __cacheline_aligned_in_smp =
SPIN_LOCK_UNLOCKED;
+static struct blk_plug_struct blk_plug_array[NR_CPUS];

static wait_queue_head_t congestion_wqh[2];

@@ -1096,10 +1095,14 @@ void blk_plug_device(request_queue_t *q)
*/
if (!blk_queue_plugged(q)
&& !test_bit(QUEUE_FLAG_STOPPED, &q->queue_flags)) {
- spin_lock(&blk_plug_lock);
- list_add_tail(&q->plug_list, &blk_plug_list);
+ struct blk_plug_struct * local_blk_plug =
+
&blk_plug_array[smp_processor_id()];
+
+ spin_lock(&local_blk_plug->blk_plug_lock);
+ list_add_tail(&q->plug_list,
&local_blk_plug->blk_plug_head);
+ q->blk_plug = local_blk_plug;
mod_timer(&q->unplug_timer, jiffies + q->unplug_delay);
- spin_unlock(&blk_plug_lock);
+ spin_unlock(&local_blk_plug->blk_plug_lock);
}
}

@@ -1113,10 +1116,11 @@ int blk_remove_plug(request_queue_t *q)
{
WARN_ON(!irqs_disabled());
if (blk_queue_plugged(q)) {
- spin_lock(&blk_plug_lock);
+ struct blk_plug_struct * blk_plug = q->blk_plug;
+ spin_lock(&blk_plug->blk_plug_lock);
list_del_init(&q->plug_list);
del_timer(&q->unplug_timer);
- spin_unlock(&blk_plug_lock);
+ spin_unlock(&blk_plug->blk_plug_lock);
return 1;
}

@@ -1237,6 +1241,26 @@ void blk_run_queue(struct request_queue

EXPORT_SYMBOL(blk_run_queue);

+/*
+ * blk_run_queues_cpu - fire all plugged queues on this cpu
+ */
+#define blk_plug_entry(entry) list_entry((entry), request_queue_t,
plug_list)
+void blk_run_queues_cpu(int cpu)
+{
+ struct blk_plug_struct * cur_blk_plug = &blk_plug_array[cpu];
+ struct list_head * head = &cur_blk_plug->blk_plug_head;
+ spinlock_t *lock = &cur_blk_plug->blk_plug_lock;
+
+ spin_lock_irq(lock);
+ while (!list_empty(head)) {
+ request_queue_t *q = blk_plug_entry(head->next);
+ spin_unlock_irq(lock);
+ q->unplug_fn(q);
+ spin_lock_irq(lock);
+ }
+ spin_unlock_irq(lock);
+}
+
/**
* blk_run_queues - fire all plugged queues
*
@@ -1245,30 +1269,12 @@ EXPORT_SYMBOL(blk_run_queue);
* are currently stopped are ignored. This is equivalent to the older
* tq_disk task queue run.
**/
-#define blk_plug_entry(entry) list_entry((entry), request_queue_t,
plug_list)
void blk_run_queues(void)
{
- LIST_HEAD(local_plug_list);
-
- spin_lock_irq(&blk_plug_lock);
-
- /*
- * this will happen fairly often
- */
- if (list_empty(&blk_plug_list))
- goto out;
-
- list_splice_init(&blk_plug_list, &local_plug_list);
-
- while (!list_empty(&local_plug_list)) {
- request_queue_t *q =
blk_plug_entry(local_plug_list.next);
+ int i;

- spin_unlock_irq(&blk_plug_lock);
- q->unplug_fn(q);
- spin_lock_irq(&blk_plug_lock);
- }
-out:
- spin_unlock_irq(&blk_plug_lock);
+ for_each_cpu(i)
+ blk_run_queues_cpu(i);
}

EXPORT_SYMBOL(blk_run_queues);
@@ -2687,6 +2693,11 @@ int __init blk_dev_init(void)

for (i = 0; i < ARRAY_SIZE(congestion_wqh); i++)
init_waitqueue_head(&congestion_wqh[i]);
+
+ for (i = 0; i < NR_CPUS; i++) {
+ INIT_LIST_HEAD(&blk_plug_array[i].blk_plug_head);
+ spin_lock_init(&blk_plug_array[i].blk_plug_lock);
+ }
return 0;
}

diff -Nurp linux-2.6.3/fs/direct-io.c linux-2.6.3.blk/fs/direct-io.c
--- linux-2.6.3/fs/direct-io.c 2004-02-17 19:58:16
+++ linux-2.6.3.blk/fs/direct-io.c 2004-03-01 11:36:09
@@ -329,7 +329,7 @@ static struct bio *dio_await_one(struct
if (dio->bio_list == NULL) {
dio->waiter = current;
spin_unlock_irqrestore(&dio->bio_list_lock,
flags);
- blk_run_queues();
+ blk_run_queues_cpu(smp_processor_id());
io_schedule();
spin_lock_irqsave(&dio->bio_list_lock, flags);
dio->waiter = NULL;
@@ -960,7 +960,7 @@ direct_io_worker(int rw, struct kiocb *i
if (ret == 0)
ret = dio->result; /* Bytes written */
finished_one_bio(dio); /* This can free the dio
*/
- blk_run_queues();
+ blk_run_queues_cpu(smp_processor_id());
} else {
finished_one_bio(dio);
ret2 = dio_await_completion(dio);
diff -Nurp linux-2.6.3/include/linux/blkdev.h
linux-2.6.3.blk/include/linux/blkdev.h
--- linux-2.6.3/include/linux/blkdev.h 2004-02-17 19:57:20
+++ linux-2.6.3.blk/include/linux/blkdev.h 2004-03-01 11:36:09
@@ -267,6 +267,11 @@ struct blk_queue_tag {
atomic_t refcnt; /* map can be shared */
};

+struct blk_plug_struct {
+ struct list_head blk_plug_head;
+ spinlock_t blk_plug_lock;
+} ____cacheline_aligned;
+
struct request_queue
{
/*
@@ -316,6 +321,7 @@ struct request_queue
int bounce_gfp;

struct list_head plug_list;
+ struct blk_plug_struct *blk_plug; /* blk_plug it belongs
to */

/*
* various queue flags, see QUEUE_* below
diff -Nurp linux-2.6.3/include/linux/fs.h
linux-2.6.3.blk/include/linux/fs.h
--- linux-2.6.3/include/linux/fs.h 2004-02-17 19:57:20
+++ linux-2.6.3.blk/include/linux/fs.h 2004-03-01 11:36:09
@@ -1152,6 +1152,7 @@ extern int blkdev_put(struct block_devic
extern int bd_claim(struct block_device *, void *);
extern void bd_release(struct block_device *);
extern void blk_run_queues(void);
+extern void blk_run_queues_cpu(int);

/* fs/char_dev.c */
extern int alloc_chrdev_region(dev_t *, unsigned, unsigned, char *);

2004-03-01 22:16:07

by Andrew Morton

[permalink] [raw]
Subject: Re: per-cpu blk_plug_list

"Chen, Kenneth W" <[email protected]> wrote:
>
> We were surprised to find global lock in I/O path still exists on 2.6
> kernel.

Well I'm surprised that nobody has complained about it yet ;)

Disks are slow. 100 requests per second, perhaps. So a single lock is
generally OK until you have a vast number of disks and lots of
contention-prone CPUs.

Yes, we should fix this.

> blk_plug_list/blk_plug_lock manages plug/unplug action. When you have
> lots of cpu simultaneously submits I/O, there are lots of movement with
> the device queue on and off that global list. Our measurement showed
> that blk_plug_lock contention prevents linux-2.6.3 kernel to scale pass
> beyond 40 thousand I/O per second in the I/O submit path.

Looking at the guts of your patch:

/*
* blk_run_queues_cpu - fire all plugged queues on this cpu
*/
#define blk_plug_entry(entry) list_entry((entry), request_queue_t, plug_list)
void blk_run_queues_cpu(int cpu)
{
struct blk_plug_struct * cur_blk_plug = &blk_plug_array[cpu];
struct list_head * head = &cur_blk_plug->blk_plug_head;
spinlock_t *lock = &cur_blk_plug->blk_plug_lock;

spin_lock_irq(lock);
while (!list_empty(head)) {
request_queue_t *q = blk_plug_entry(head->next);
spin_unlock_irq(lock);
q->unplug_fn(q);
spin_lock_irq(lock);
}
spin_unlock_irq(lock);
}

void blk_run_queues(void)
{
int i;

for_each_cpu(i)
blk_run_queues_cpu(i);
}

How on earth do you know that when direct-io calls blk_run_queues_cpu(), it
is still running on the cpu which initially plugged the queue(s)?

And your patch might have made the much-more-frequently-called
blk_run_queues() more expensive.

There are minor issues with lack of preemption safety and not using the
percpu data infrastructure, but they can wait.

You haven't told us how many disks are in use? At 100k IO's per second it
would be 500 to 1000 disks, yes?

I assume you tried it, but does this help?

diff -puN drivers/block/ll_rw_blk.c~a drivers/block/ll_rw_blk.c
--- 25/drivers/block/ll_rw_blk.c~a Mon Mar 1 13:52:37 2004
+++ 25-akpm/drivers/block/ll_rw_blk.c Mon Mar 1 13:52:45 2004
@@ -1251,6 +1251,9 @@ void blk_run_queues(void)
{
LIST_HEAD(local_plug_list);

+ if (list_empty(&blk_plug_list))
+ return;
+
spin_lock_irq(&blk_plug_lock);

/*

_

2004-03-01 22:28:25

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: per-cpu blk_plug_list

Andrew Morton wrote:
>>
>> We were surprised to find global lock in I/O path still exists on 2.6
>> kernel.
>
> Well I'm surprised that nobody has complained about it yet ;)

We are now, very loudly :-)


> How on earth do you know that when direct-io calls
blk_run_queues_cpu(),
> it is still running on the cpu which initially plugged the queue(s)?
>
> And your patch might have made the much-more-frequently-called
> blk_run_queues() more expensive.
>
> There are minor issues with lack of preemption safety and not using
the
> percpu data infrastructure, but they can wait.

Absolutely, these are the details need to be worked out. And we are
working on that at the moment.


> You haven't told us how many disks are in use? At 100k IO's per
second it
> would be 500 to 1000 disks, yes?

Right again!!, we are using 16 fiber-channel controllers connect to just
over 1000 disks.

> I assume you tried it, but does this help?
>
> diff -puN drivers/block/ll_rw_blk.c~a drivers/block/ll_rw_blk.c
> --- 25/drivers/block/ll_rw_blk.c~a Mon Mar 1 13:52:37 2004
> +++ 25-akpm/drivers/block/ll_rw_blk.c Mon Mar 1 13:52:45 2004
> @@ -1251,6 +1251,9 @@ void blk_run_queues(void)
> {
> LIST_HEAD(local_plug_list);
>
> + if (list_empty(&blk_plug_list))
> + return;
> +
> spin_lock_irq(&blk_plug_lock);
>
> /*

Yeah, that's first thing we tried, didn't do a dent to the lock at all.

2004-03-01 22:29:49

by Andrew Morton

[permalink] [raw]
Subject: Re: per-cpu blk_plug_list

Andrew Morton <[email protected]> wrote:
>
> @@ -1251,6 +1251,9 @@ void blk_run_queues(void)
> {
> LIST_HEAD(local_plug_list);
>
> + if (list_empty(&blk_plug_list))
> + return;
> +

hmm, no, that won't help. There will always be at least one plugged disk.

Perhaps what we need here is a per-task container of "queues which I have
plugged". Or, more accurately, "queues which need to be unplugged so that
I can complete". It'll get tricky because in some situations a task can
need unplugging of a queue which it did not plug, and even a queue to which
it did not submit any I/O.

You should try just nuking the whole plugging scheme, btw. Stick a bunch
of `return' statements at the top of all the plugging functions. I bet
that would run nicely...

2004-03-01 22:36:47

by Andi Kleen

[permalink] [raw]
Subject: Re: per-cpu blk_plug_list


What happens when a process that started a plug on one CPU switches CPUs
before unplugging?

I don't see any mechanism in your patch to make sure the queue on the previous
CPU is flushed too. It could be delayed for a long time.

-Andi

2004-03-03 03:44:49

by Jesse Barnes

[permalink] [raw]
Subject: Re: per-cpu blk_plug_list

On Mon, Mar 01, 2004 at 01:18:40PM -0800, Chen, Kenneth W wrote:
> blk_plug_list/blk_plug_lock manages plug/unplug action. When you have
> lots of cpu simultaneously submits I/O, there are lots of movement with
> the device queue on and off that global list. Our measurement showed
> that blk_plug_lock contention prevents linux-2.6.3 kernel to scale pass
> beyond 40 thousand I/O per second in the I/O submit path.

This helped out our machines quite a bit too. Without the patch, we
weren't able to scale above 80000 IOPS, but now we exceed 110000 (and
parity with our internal XSCSI based tree).

Maybe the plug lists and locks should be per-device though, rather than
per-cpu? That would make the migration case easier I think. Is that
possible?

Thanks,
Jesse

2004-03-03 03:54:30

by Andrew Morton

[permalink] [raw]
Subject: Re: per-cpu blk_plug_list

[email protected] (Jesse Barnes) wrote:
>
> On Mon, Mar 01, 2004 at 01:18:40PM -0800, Chen, Kenneth W wrote:
> > blk_plug_list/blk_plug_lock manages plug/unplug action. When you have
> > lots of cpu simultaneously submits I/O, there are lots of movement with
> > the device queue on and off that global list. Our measurement showed
> > that blk_plug_lock contention prevents linux-2.6.3 kernel to scale pass
> > beyond 40 thousand I/O per second in the I/O submit path.
>
> This helped out our machines quite a bit too. Without the patch, we
> weren't able to scale above 80000 IOPS, but now we exceed 110000 (and
> parity with our internal XSCSI based tree).
>
> Maybe the plug lists and locks should be per-device though, rather than
> per-cpu? That would make the migration case easier I think. Is that
> possible?

It's possible, yes. It is the preferred solution. We need to identify all
the queues which need to be unplugged to permit a VFS-level IO request to
complete. It involves running down the device stack and running around all
the contributing queues at each level.

Relatively straightforward, but first those dang sempahores in device
mapper need to become spinlocks. I haven't looked into what difficulties
might be present in the RAID implementation.

2004-03-03 04:21:12

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: per-cpu blk_plug_list

I don't understand the proposal here. There is a per-device lock
already. But the plugged queue need to be on some list outside itself
so a group of them can be unplugged later on to flush all the I/O.

- Ken


-----Original Message-----
From: Andrew Morton [mailto:[email protected]]
Sent: Tuesday, March 02, 2004 7:56 PM
To: Jesse Barnes
Cc: Chen, Kenneth W; [email protected]
Subject: Re: per-cpu blk_plug_list

> On Mon, Mar 01, 2004 at 01:18:40PM -0800, Chen, Kenneth W wrote:
> > blk_plug_list/blk_plug_lock manages plug/unplug action. When you
have
> > lots of cpu simultaneously submits I/O, there are lots of movement
with
> > the device queue on and off that global list. Our measurement
showed
> > that blk_plug_lock contention prevents linux-2.6.3 kernel to scale
pass
> > beyond 40 thousand I/O per second in the I/O submit path.
>
> This helped out our machines quite a bit too. Without the patch, we
> weren't able to scale above 80000 IOPS, but now we exceed 110000 (and
> parity with our internal XSCSI based tree).
>
> Maybe the plug lists and locks should be per-device though, rather
than
> per-cpu? That would make the migration case easier I think. Is that
> possible?

It's possible, yes. It is the preferred solution. We need to identify
all
the queues which need to be unplugged to permit a VFS-level IO request
to
complete. It involves running down the device stack and running around
all
the contributing queues at each level.

Relatively straightforward, but first those dang sempahores in device
mapper need to become spinlocks. I haven't looked into what
difficulties
might be present in the RAID implementation.

2004-03-03 05:12:39

by Andrew Morton

[permalink] [raw]
Subject: Re: per-cpu blk_plug_list

"Chen, Kenneth W" <[email protected]> wrote:
>
> I don't understand the proposal here. There is a per-device lock
> already. But the plugged queue need to be on some list outside itself
> so a group of them can be unplugged later on to flush all the I/O.


here's the proposal:


Regarding this:

http://www.ussg.iu.edu/hypermail/linux/kernel/0403.0/0179.html

And also having looked at Miquel's (currently slightly defective)
implementation of the any_congested() API for devicemapper:

ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.4-rc1/2.6.4-rc1-mm1/broken-out/queue-congestion-dm-implementation.patch

I am thinking that an appropriate way of solving the blk_run_queues() lock
contention problem is to nuke the global plug list altogther and make the
unplug function a method in struct backing_device_info.

This is conceptually the appropriate place to put it - it is almost always
the case that when we run blk_run_queues() it is on behalf of an
address_space, and the few remaining case can be simply deleted -
mm/mempool.c is the last one I think.

The implementation of backing_dev_info.unplug() would have to run the
unplug_fn of every queue which contributes to the top-level queue (the
thing which the address_space is sitting on top of).

We discussed this maybe a year ago with Jens but decided against it for
complexity reasons, but gee, dm_table_any_congested() isn't complex. Do we
forsee any particular problem with this?

2004-03-03 09:45:21

by Miquel van Smoorenburg

[permalink] [raw]
Subject: Re: per-cpu blk_plug_list

According to Andrew Morton:
> And also having looked at Miquel's (currently slightly defective)
> implementation of the any_congested() API for devicemapper:
>
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.4-rc1/2.6.4-rc1-mm1/broken-out/queue-congestion-dm-implementation.patch
>
> I am thinking that an appropriate way of solving the blk_run_queues() lock
> contention problem is to nuke the global plug list altogther and make the
> unplug function a method in struct backing_device_info.
>
> This is conceptually the appropriate place to put it - it is almost always
> the case that when we run blk_run_queues() it is on behalf of an
> address_space, and the few remaining case can be simply deleted -
> mm/mempool.c is the last one I think.
>
> The implementation of backing_dev_info.unplug() would have to run the
> unplug_fn of every queue which contributes to the top-level queue (the
> thing which the address_space is sitting on top of).

But then you need a pointer to the queue. In that case,
you might as well put the congested_fn pointer in the request_queue
too. Then you get something like
https://www.redhat.com/archives/linux-lvm/2004-February/msg00203.html
(though I'd replace "rw" with "bdi_bits" like in the current patch).

Mike.

2004-03-03 09:54:52

by Andrew Morton

[permalink] [raw]
Subject: Re: per-cpu blk_plug_list

Miquel van Smoorenburg <[email protected]> wrote:
>
> According to Andrew Morton:
> > And also having looked at Miquel's (currently slightly defective)
> > implementation of the any_congested() API for devicemapper:
> >
> > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.4-rc1/2.6.4-rc1-mm1/broken-out/queue-congestion-dm-implementation.patch
> >
> > I am thinking that an appropriate way of solving the blk_run_queues() lock
> > contention problem is to nuke the global plug list altogther and make the
> > unplug function a method in struct backing_device_info.
> >
> > This is conceptually the appropriate place to put it - it is almost always
> > the case that when we run blk_run_queues() it is on behalf of an
> > address_space, and the few remaining case can be simply deleted -
> > mm/mempool.c is the last one I think.
> >
> > The implementation of backing_dev_info.unplug() would have to run the
> > unplug_fn of every queue which contributes to the top-level queue (the
> > thing which the address_space is sitting on top of).
>
> But then you need a pointer to the queue. In that case,
> you might as well put the congested_fn pointer in the request_queue
> too.

It already is. backing_dev_info is a member of struct request_queue.

> Then you get something like
> https://www.redhat.com/archives/linux-lvm/2004-February/msg00203.html
> (though I'd replace "rw" with "bdi_bits" like in the current patch).

Sure. This is all nice and simple stuff, except for the locking, which
needs serious work.

2004-03-03 16:24:44

by Miquel van Smoorenburg

[permalink] [raw]
Subject: Re: per-cpu blk_plug_list

On 2004.03.03 10:54, Andrew Morton wrote:
> Sure. This is all nice and simple stuff, except for the locking, which
> needs serious work.

Is this a step in the right direction ?

dm-rwlock.patch

This patch converts the semaphore locks in dm.c to rwlocks. The bio's are
assembled into a list under the lock, which is then submitted to
generic_make_request outside of the lock.

--- linux-2.6.4-rc1-mm2/drivers/md/dm.c.ORIG 2004-03-03 15:08:58.000000000 +0100
+++ linux-2.6.4-rc1-mm2/drivers/md/dm.c 2004-03-03 17:14:32.000000000 +0100
@@ -35,7 +35,7 @@
#define DMF_SUSPENDED 1

struct mapped_device {
- struct rw_semaphore lock;
+ rwlock_t lock;
atomic_t holders;

unsigned long flags;
@@ -189,16 +189,16 @@
*/
static int queue_io(struct mapped_device *md, struct bio *bio)
{
- down_write(&md->lock);
+ write_lock(&md->lock);

if (!test_bit(DMF_BLOCK_IO, &md->flags)) {
- up_write(&md->lock);
+ write_unlock(&md->lock);
return 1;
}

bio_list_add(&md->deferred, bio);

- up_write(&md->lock);
+ write_unlock(&md->lock);
return 0; /* deferred successfully */
}

@@ -263,7 +263,7 @@
return len;
}

-static void __map_bio(struct dm_target *ti, struct bio *clone, struct dm_io *io)
+static int __map_bio(struct dm_target *ti, struct bio *clone, struct dm_io *io)
{
int r;

@@ -282,13 +282,12 @@
*/
atomic_inc(&io->io_count);
r = ti->type->map(ti, clone);
- if (r > 0)
- /* the bio has been remapped so dispatch it */
- generic_make_request(clone);
-
- else if (r < 0)
+
+ if (r < 0)
/* error the io and bail out */
dec_pending(io, -EIO);
+
+ return r;
}

struct clone_info {
@@ -343,7 +342,7 @@
return clone;
}

-static void __clone_and_map(struct clone_info *ci)
+static void __clone_and_map(struct clone_info *ci, struct bio_list *bl)
{
struct bio *clone, *bio = ci->bio;
struct dm_target *ti = dm_table_find_target(ci->md->map, ci->sector);
@@ -356,7 +355,8 @@
*/
clone = clone_bio(bio, ci->sector, ci->idx,
bio->bi_vcnt - ci->idx, ci->sector_count);
- __map_bio(ti, clone, ci->io);
+ if (__map_bio(ti, clone, ci->io) > 0)
+ bio_list_add(bl, clone);
ci->sector_count = 0;

} else if (to_sector(bio->bi_io_vec[ci->idx].bv_len) <= max) {
@@ -379,7 +379,8 @@
}

clone = clone_bio(bio, ci->sector, ci->idx, i - ci->idx, len);
- __map_bio(ti, clone, ci->io);
+ if (__map_bio(ti, clone, ci->io) > 0)
+ bio_list_add(bl, clone);

ci->sector += len;
ci->sector_count -= len;
@@ -394,7 +395,8 @@

clone = split_bvec(bio, ci->sector, ci->idx,
bv->bv_offset, max);
- __map_bio(ti, clone, ci->io);
+ if (__map_bio(ti, clone, ci->io) > 0)
+ bio_list_add(bl, clone);

ci->sector += max;
ci->sector_count -= max;
@@ -403,7 +405,8 @@
len = to_sector(bv->bv_len) - max;
clone = split_bvec(bio, ci->sector, ci->idx,
bv->bv_offset + to_bytes(max), len);
- __map_bio(ti, clone, ci->io);
+ if (__map_bio(ti, clone, ci->io) > 0)
+ bio_list_add(bl, clone);

ci->sector += len;
ci->sector_count -= len;
@@ -414,7 +417,8 @@
/*
* Split the bio into several clones.
*/
-static void __split_bio(struct mapped_device *md, struct bio *bio)
+static void __split_bio(struct mapped_device *md, struct bio *bio,
+ struct bio_list *bl)
{
struct clone_info ci;

@@ -431,7 +435,7 @@

atomic_inc(&md->pending);
while (ci.sector_count)
- __clone_and_map(&ci);
+ __clone_and_map(&ci, bl);

/* drop the extra reference count */
dec_pending(ci.io, 0);
@@ -442,6 +446,21 @@


/*
+ * Requeue the deferred bios by calling generic_make_request.
+ */
+static void flush_deferred_io(struct bio *c)
+{
+ struct bio *n;
+
+ while (c) {
+ n = c->bi_next;
+ c->bi_next = NULL;
+ generic_make_request(c);
+ c = n;
+ }
+}
+
+/*
* The request function that just remaps the bio built up by
* dm_merge_bvec.
*/
@@ -449,15 +468,18 @@
{
int r;
struct mapped_device *md = q->queuedata;
+ struct bio_list list;
+
+ list.head = list.tail = NULL;

- down_read(&md->lock);
+ read_lock(&md->lock);

/*
* If we're suspended we have to queue
* this io for later.
*/
while (test_bit(DMF_BLOCK_IO, &md->flags)) {
- up_read(&md->lock);
+ read_unlock(&md->lock);

if (bio_rw(bio) == READA) {
bio_io_error(bio, bio->bi_size);
@@ -476,7 +498,7 @@
* We're in a while loop, because someone could suspend
* before we get to the following read lock.
*/
- down_read(&md->lock);
+ read_lock(&md->lock);
}

if (!md->map) {
@@ -484,8 +506,15 @@
return 0;
}

- __split_bio(md, bio);
- up_read(&md->lock);
+ __split_bio(md, bio, &list);
+ read_unlock(&md->lock);
+
+ /*
+ * Submit the bio's outside of md->lock.
+ */
+ bio = bio_list_get(&list);
+ flush_deferred_io(bio);
+
return 0;
}

@@ -494,9 +523,9 @@
int r;
struct mapped_device *md = congested_data;

- down_read(&md->lock);
+ read_lock(&md->lock);
r = dm_table_any_congested(md->map, bdi_bits);
- up_read(&md->lock);
+ read_unlock(&md->lock);

return r;
}
@@ -571,7 +600,7 @@
goto bad1;

memset(md, 0, sizeof(*md));
- init_rwsem(&md->lock);
+ rwlock_init(&md->lock);
atomic_set(&md->holders, 1);

md->queue = blk_alloc_queue(GFP_KERNEL);
@@ -634,10 +663,10 @@
{
struct mapped_device *md = (struct mapped_device *) context;

- down_write(&md->lock);
+ write_lock(&md->lock);
md->event_nr++;
wake_up_interruptible(&md->eventq);
- up_write(&md->lock);
+ write_unlock(&md->lock);
}

static void __set_size(struct gendisk *disk, sector_t size)
@@ -724,32 +753,17 @@
}

/*
- * Requeue the deferred bios by calling generic_make_request.
- */
-static void flush_deferred_io(struct bio *c)
-{
- struct bio *n;
-
- while (c) {
- n = c->bi_next;
- c->bi_next = NULL;
- generic_make_request(c);
- c = n;
- }
-}
-
-/*
* Swap in a new table (destroying old one).
*/
int dm_swap_table(struct mapped_device *md, struct dm_table *table)
{
int r;

- down_write(&md->lock);
+ write_lock(&md->lock);

/* device must be suspended */
if (!test_bit(DMF_SUSPENDED, &md->flags)) {
- up_write(&md->lock);
+ write_unlock(&md->lock);
return -EPERM;
}

@@ -758,7 +772,7 @@
if (r)
return r;

- up_write(&md->lock);
+ write_unlock(&md->lock);
return 0;
}

@@ -773,20 +787,20 @@
{
DECLARE_WAITQUEUE(wait, current);

- down_write(&md->lock);
+ write_lock(&md->lock);

/*
* First we set the BLOCK_IO flag so no more ios will be
* mapped.
*/
if (test_bit(DMF_BLOCK_IO, &md->flags)) {
- up_write(&md->lock);
+ write_unlock(&md->lock);
return -EINVAL;
}

set_bit(DMF_BLOCK_IO, &md->flags);
add_wait_queue(&md->wait, &wait);
- up_write(&md->lock);
+ write_unlock(&md->lock);

/*
* Then we wait for the already mapped ios to
@@ -803,12 +817,12 @@
}
set_current_state(TASK_RUNNING);

- down_write(&md->lock);
+ write_lock(&md->lock);
remove_wait_queue(&md->wait, &wait);
set_bit(DMF_SUSPENDED, &md->flags);
if (md->map)
dm_table_suspend_targets(md->map);
- up_write(&md->lock);
+ write_unlock(&md->lock);

return 0;
}
@@ -817,11 +831,11 @@
{
struct bio *def;

- down_write(&md->lock);
+ write_lock(&md->lock);
if (!md->map ||
!test_bit(DMF_SUSPENDED, &md->flags) ||
!dm_table_get_size(md->map)) {
- up_write(&md->lock);
+ write_unlock(&md->lock);
return -EINVAL;
}

@@ -829,7 +843,7 @@
clear_bit(DMF_SUSPENDED, &md->flags);
clear_bit(DMF_BLOCK_IO, &md->flags);
def = bio_list_get(&md->deferred);
- up_write(&md->lock);
+ write_unlock(&md->lock);

flush_deferred_io(def);
blk_run_queues();
@@ -844,9 +858,9 @@
{
uint32_t r;

- down_read(&md->lock);
+ read_lock(&md->lock);
r = md->event_nr;
- up_read(&md->lock);
+ read_unlock(&md->lock);

return r;
}
@@ -854,23 +868,23 @@
int dm_add_wait_queue(struct mapped_device *md, wait_queue_t *wq,
uint32_t event_nr)
{
- down_write(&md->lock);
+ write_lock(&md->lock);
if (event_nr != md->event_nr) {
- up_write(&md->lock);
+ write_unlock(&md->lock);
return 1;
}

add_wait_queue(&md->eventq, wq);
- up_write(&md->lock);
+ write_unlock(&md->lock);

return 0;
}

void dm_remove_wait_queue(struct mapped_device *md, wait_queue_t *wq)
{
- down_write(&md->lock);
+ write_lock(&md->lock);
remove_wait_queue(&md->eventq, wq);
- up_write(&md->lock);
+ write_unlock(&md->lock);
}

/*
@@ -886,11 +900,11 @@
{
struct dm_table *t;

- down_read(&md->lock);
+ read_lock(&md->lock);
t = md->map;
if (t)
dm_table_get(t);
- up_read(&md->lock);
+ read_unlock(&md->lock);

return t;
}

2004-03-03 22:38:01

by Miquel van Smoorenburg

[permalink] [raw]
Subject: Re: per-cpu blk_plug_list

In article <[email protected]>,
Miquel van Smoorenburg <[email protected]> wrote:
>On 2004.03.03 10:54, Andrew Morton wrote:
>> Sure. This is all nice and simple stuff, except for the locking, which
>> needs serious work.
>
>Is this a step in the right direction ?
>
>dm-rwlock.patch

Forget this one, I sent a better one to [email protected].

Mike.