2007-01-03 07:45:43

by Jens Axboe

[permalink] [raw]
Subject: [BLOCK] 0/4 explicit io plugging

This series of 4 patches switch the block layer to use explicit
plugging instead of the implicit plugging that takes place now when io
is queued against an empty queue.

The first three patches update RCU to include a QRCU method similar to
SRCU. QRCU is a bit heavier on the reader side, but a _lot_ cheaper for
the synchronization part. The new plugging scheme needs to synchronize
queue plugs for barriers and queue quiescing, so it needs to be cheap.

The fourth patch is the actual meat of the series. It also has a longer
explanation of the benefits of the explicit plugging.

I'm sending this out to get some review of the code, and to ask people
to do some testing. I'm looking for both the "hey it works for me" as
well as benchmark runs. In the performance category, I'm interested in
both high end (lots of CPUs) testing to see whether this actually does
reduce lock contention and block layer cpu utilization as well as more
simplistic io performance results on "normal" boxes to make sure we are
not regressing anywhere.

This code is also available in the 'plug' branch of the block layer git
repo:

git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-2.6-block.git/

Documentation/RCU/checklist.txt | 13 +
Documentation/RCU/rcu.txt | 6
Documentation/RCU/torture.txt | 15 -
Documentation/RCU/whatisRCU.txt | 3
Documentation/block/biodoc.txt | 5
block/as-iosched.c | 15 -
block/cfq-iosched.c | 8
block/deadline-iosched.c | 9
block/elevator.c | 44 ---
block/ll_rw_blk.c | 483 ++++++++++++++++++++--------------------
block/noop-iosched.c | 8
drivers/block/cciss.c | 6
drivers/block/cpqarray.c | 3
drivers/block/floppy.c | 1
drivers/block/loop.c | 12
drivers/block/pktcdvd.c | 5
drivers/block/rd.c | 2
drivers/block/umem.c | 16 -
drivers/ide/ide-cd.c | 9
drivers/ide/ide-io.c | 25 --
drivers/md/bitmap.c | 1
drivers/md/dm-emc.c | 2
drivers/md/dm-table.c | 14 -
drivers/md/dm.c | 18 -
drivers/md/dm.h | 1
drivers/md/linear.c | 14 -
drivers/md/md.c | 3
drivers/md/multipath.c | 32 --
drivers/md/raid0.c | 17 -
drivers/md/raid1.c | 70 -----
drivers/md/raid10.c | 73 ------
drivers/md/raid5.c | 60 ----
drivers/message/i2o/i2o_block.c | 6
drivers/mmc/mmc_queue.c | 3
drivers/s390/block/dasd.c | 3
drivers/s390/char/tape_block.c | 1
drivers/scsi/ide-scsi.c | 2
drivers/scsi/scsi_lib.c | 47 +--
fs/adfs/inode.c | 1
fs/affs/file.c | 2
fs/befs/linuxvfs.c | 1
fs/bfs/file.c | 1
fs/block_dev.c | 2
fs/buffer.c | 25 --
fs/cifs/file.c | 2
fs/direct-io.c | 7
fs/ecryptfs/mmap.c | 23 -
fs/efs/inode.c | 1
fs/ext2/inode.c | 2
fs/ext3/inode.c | 3
fs/ext4/inode.c | 3
fs/fat/inode.c | 1
fs/freevxfs/vxfs_subr.c | 1
fs/fuse/inode.c | 1
fs/gfs2/ops_address.c | 1
fs/hfs/inode.c | 2
fs/hfsplus/inode.c | 2
fs/hpfs/file.c | 1
fs/isofs/inode.c | 1
fs/jfs/inode.c | 1
fs/jfs/jfs_metapage.c | 1
fs/minix/inode.c | 1
fs/ntfs/aops.c | 4
fs/ntfs/compress.c | 2
fs/ocfs2/aops.c | 1
fs/ocfs2/cluster/heartbeat.c | 4
fs/qnx4/inode.c | 1
fs/reiserfs/inode.c | 1
fs/sysv/itree.c | 1
fs/udf/file.c | 1
fs/udf/inode.c | 1
fs/ufs/inode.c | 1
fs/ufs/truncate.c | 2
fs/xfs/linux-2.6/xfs_aops.c | 1
fs/xfs/linux-2.6/xfs_buf.c | 15 -
include/linux/backing-dev.h | 3
include/linux/blkdev.h | 75 +++---
include/linux/buffer_head.h | 1
include/linux/elevator.h | 8
include/linux/fs.h | 1
include/linux/pagemap.h | 12
include/linux/raid/md.h | 1
include/linux/sched.h | 1
include/linux/srcu.h | 30 ++
include/linux/swap.h | 2
kernel/rcutorture.c | 71 +++++
kernel/sched.c | 1
kernel/srcu.c | 105 ++++++++
mm/filemap.c | 62 -----
mm/nommu.c | 4
mm/page-writeback.c | 8
mm/readahead.c | 11
mm/shmem.c | 1
mm/swap_state.c | 5
mm/swapfile.c | 37 ---
mm/vmscan.c | 6
96 files changed, 632 insertions(+), 989 deletions(-)

--
Jens Axboe



2007-01-03 07:45:43

by Jens Axboe

[permalink] [raw]
Subject: [PATCH] 2/4 qrcu: add rcutorture test

From: Oleg Nesterov <[email protected]>

Add rcutorture test for qrcu.

Works for me!

Signed-off-by: Oleg Nesterov <[email protected]>
Signed-off-by: Josh Triplett <[email protected]>
Acked-by: Paul E. McKenney <[email protected]>
Acked-by: Jens Axboe <[email protected]>
---
include/linux/srcu.h | 4 +-
kernel/rcutorture.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 71 insertions(+), 4 deletions(-)

diff --git a/include/linux/srcu.h b/include/linux/srcu.h
index fcdb749..03a9010 100644
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -64,8 +64,8 @@ struct qrcu_struct {
};

int init_qrcu_struct(struct qrcu_struct *qp);
-int qrcu_read_lock(struct qrcu_struct *qp);
-void qrcu_read_unlock(struct qrcu_struct *qp, int idx);
+int qrcu_read_lock(struct qrcu_struct *qp) __acquires(qp);
+void qrcu_read_unlock(struct qrcu_struct *qp, int idx) __releases(qp);
void synchronize_qrcu(struct qrcu_struct *qp);

/**
diff --git a/kernel/rcutorture.c b/kernel/rcutorture.c
index 482b11f..bd7fd49 100644
--- a/kernel/rcutorture.c
+++ b/kernel/rcutorture.c
@@ -465,6 +465,73 @@ static struct rcu_torture_ops srcu_ops = {
};

/*
+ * Definitions for qrcu torture testing.
+ */
+
+static struct qrcu_struct qrcu_ctl;
+
+static void qrcu_torture_init(void)
+{
+ init_qrcu_struct(&qrcu_ctl);
+ rcu_sync_torture_init();
+}
+
+static void qrcu_torture_cleanup(void)
+{
+ synchronize_qrcu(&qrcu_ctl);
+ cleanup_qrcu_struct(&qrcu_ctl);
+}
+
+static int qrcu_torture_read_lock(void) __acquires(&qrcu_ctl)
+{
+ return qrcu_read_lock(&qrcu_ctl);
+}
+
+static void qrcu_torture_read_unlock(int idx) __releases(&qrcu_ctl)
+{
+ qrcu_read_unlock(&qrcu_ctl, idx);
+}
+
+static int qrcu_torture_completed(void)
+{
+ return qrcu_ctl.completed;
+}
+
+static void qrcu_torture_synchronize(void)
+{
+ synchronize_qrcu(&qrcu_ctl);
+}
+
+static int qrcu_torture_stats(char *page)
+{
+ int cnt = 0;
+ int idx = qrcu_ctl.completed & 0x1;
+
+ cnt += sprintf(&page[cnt], "%s%s per-CPU(idx=%d):",
+ torture_type, TORTURE_FLAG, idx);
+
+ cnt += sprintf(&page[cnt], " (%d,%d)",
+ atomic_read(qrcu_ctl.ctr + 0),
+ atomic_read(qrcu_ctl.ctr + 1));
+
+ cnt += sprintf(&page[cnt], "\n");
+ return cnt;
+}
+
+static struct rcu_torture_ops qrcu_ops = {
+ .init = qrcu_torture_init,
+ .cleanup = qrcu_torture_cleanup,
+ .readlock = qrcu_torture_read_lock,
+ .readdelay = srcu_read_delay,
+ .readunlock = qrcu_torture_read_unlock,
+ .completed = qrcu_torture_completed,
+ .deferredfree = rcu_sync_torture_deferred_free,
+ .sync = qrcu_torture_synchronize,
+ .stats = qrcu_torture_stats,
+ .name = "qrcu"
+};
+
+/*
* Definitions for sched torture testing.
*/

@@ -503,8 +570,8 @@ static struct rcu_torture_ops sched_ops = {
};

static struct rcu_torture_ops *torture_ops[] =
- { &rcu_ops, &rcu_sync_ops, &rcu_bh_ops, &rcu_bh_sync_ops, &srcu_ops,
- &sched_ops, NULL };
+ { &rcu_ops, &rcu_sync_ops, &rcu_bh_ops, &rcu_bh_sync_ops,
+ &srcu_ops, &qrcu_ops, &sched_ops, NULL };

/*
* RCU torture writer kthread. Repeatedly substitutes a new structure
--
1.4.4.2.g02c9

2007-01-03 07:46:08

by Jens Axboe

[permalink] [raw]
Subject: [PATCH] 1/4 qrcu: "quick" srcu implementation

From: Oleg Nesterov <[email protected]>

Very much based on ideas, corrections, and patient explanations from
Alan and Paul.

The current srcu implementation is very good for readers, lock/unlock
are extremely cheap. But for that reason it is not possible to avoid
synchronize_sched() and polling in synchronize_srcu().

Jens Axboe wrote:
>
> It works for me, but the overhead is still large. Before it would take
> 8-12 jiffies for a synchronize_srcu() to complete without there actually
> being any reader locks active, now it takes 2-3 jiffies. So it's
> definitely faster, and as suspected the loss of two of three
> synchronize_sched() cut down the overhead to a third.

'qrcu' behaves the same as srcu but optimized for writers. The fast path
for synchronize_qrcu() is mutex_lock() + atomic_read() + mutex_unlock().
The slow path is __wait_event(), no polling. However, the reader does
atomic inc/dec on lock/unlock, and the counters are not per-cpu.

Also, unlike srcu, qrcu read lock/unlock can be used in interrupt context,
and 'qrcu_struct' can be compile-time initialized.

See also (a long) discussion:
http://marc.theaimsgroup.com/?t=116370857600003

Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Jens Axboe <[email protected]>
---
include/linux/srcu.h | 30 ++++++++++++++
kernel/srcu.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 135 insertions(+), 0 deletions(-)

diff --git a/include/linux/srcu.h b/include/linux/srcu.h
index aca0eee..fcdb749 100644
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -27,6 +27,8 @@
#ifndef _LINUX_SRCU_H
#define _LINUX_SRCU_H

+#include <linux/wait.h>
+
struct srcu_struct_array {
int c[2];
};
@@ -50,4 +52,32 @@ void srcu_read_unlock(struct srcu_struct *sp, int idx) __releases(sp);
void synchronize_srcu(struct srcu_struct *sp);
long srcu_batches_completed(struct srcu_struct *sp);

+/*
+ * fully compatible with srcu, but optimized for writers.
+ */
+
+struct qrcu_struct {
+ int completed;
+ atomic_t ctr[2];
+ wait_queue_head_t wq;
+ struct mutex mutex;
+};
+
+int init_qrcu_struct(struct qrcu_struct *qp);
+int qrcu_read_lock(struct qrcu_struct *qp);
+void qrcu_read_unlock(struct qrcu_struct *qp, int idx);
+void synchronize_qrcu(struct qrcu_struct *qp);
+
+/**
+ * cleanup_qrcu_struct - deconstruct a quick-RCU structure
+ * @qp: structure to clean up.
+ *
+ * Must invoke this after you are finished using a given qrcu_struct that
+ * was initialized via init_qrcu_struct(). We reserve the right to
+ * leak memory should you fail to do this!
+ */
+static inline void cleanup_qrcu_struct(struct qrcu_struct *qp)
+{
+}
+
#endif
diff --git a/kernel/srcu.c b/kernel/srcu.c
index 3507cab..53c6989 100644
--- a/kernel/srcu.c
+++ b/kernel/srcu.c
@@ -256,3 +256,108 @@ EXPORT_SYMBOL_GPL(srcu_read_unlock);
EXPORT_SYMBOL_GPL(synchronize_srcu);
EXPORT_SYMBOL_GPL(srcu_batches_completed);
EXPORT_SYMBOL_GPL(srcu_readers_active);
+
+/**
+ * init_qrcu_struct - initialize a quick-RCU structure.
+ * @qp: structure to initialize.
+ *
+ * Must invoke this on a given qrcu_struct before passing that qrcu_struct
+ * to any other function. Each qrcu_struct represents a separate domain
+ * of QRCU protection.
+ */
+int init_qrcu_struct(struct qrcu_struct *qp)
+{
+ qp->completed = 0;
+ atomic_set(qp->ctr + 0, 1);
+ atomic_set(qp->ctr + 1, 0);
+ init_waitqueue_head(&qp->wq);
+ mutex_init(&qp->mutex);
+
+ return 0;
+}
+
+/**
+ * qrcu_read_lock - register a new reader for an QRCU-protected structure.
+ * @qp: qrcu_struct in which to register the new reader.
+ *
+ * Counts the new reader in the appropriate element of the qrcu_struct.
+ * Returns an index that must be passed to the matching qrcu_read_unlock().
+ */
+int qrcu_read_lock(struct qrcu_struct *qp)
+{
+ for (;;) {
+ int idx = qp->completed & 0x1;
+ if (likely(atomic_inc_not_zero(qp->ctr + idx)))
+ return idx;
+ }
+}
+
+/**
+ * qrcu_read_unlock - unregister a old reader from an QRCU-protected structure.
+ * @qp: qrcu_struct in which to unregister the old reader.
+ * @idx: return value from corresponding qrcu_read_lock().
+ *
+ * Removes the count for the old reader from the appropriate element of
+ * the qrcu_struct.
+ */
+void qrcu_read_unlock(struct qrcu_struct *qp, int idx)
+{
+ if (atomic_dec_and_test(qp->ctr + idx))
+ wake_up(&qp->wq);
+}
+
+/**
+ * synchronize_qrcu - wait for prior QRCU read-side critical-section completion
+ * @qp: qrcu_struct with which to synchronize.
+ *
+ * Flip the completed counter, and wait for the old count to drain to zero.
+ * As with classic RCU, the updater must use some separate means of
+ * synchronizing concurrent updates. Can block; must be called from
+ * process context.
+ *
+ * Note that it is illegal to call synchronize_qrcu() from the corresponding
+ * QRCU read-side critical section; doing so will result in deadlock.
+ * However, it is perfectly legal to call synchronize_qrcu() on one
+ * qrcu_struct from some other qrcu_struct's read-side critical section.
+ */
+void synchronize_qrcu(struct qrcu_struct *qp)
+{
+ int idx;
+
+ /*
+ * The following memory barrier is needed to ensure that
+ * any prior data-structure manipulation is seen by other
+ * CPUs to happen before picking up the value of
+ * qp->completed.
+ */
+ smp_mb();
+ mutex_lock(&qp->mutex);
+
+ idx = qp->completed & 0x1;
+ if (atomic_read(qp->ctr + idx) == 1)
+ goto out;
+
+ atomic_inc(qp->ctr + (idx ^ 0x1));
+ /* Reduce the likelihood that qrcu_read_lock() will loop */
+ smp_mb__after_atomic_inc();
+ qp->completed++;
+
+ atomic_dec(qp->ctr + idx);
+ __wait_event(qp->wq, !atomic_read(qp->ctr + idx));
+out:
+ mutex_unlock(&qp->mutex);
+ smp_mb();
+ /*
+ * The above smp_mb() is needed in the case that we
+ * see the counter reaching zero, so that we do not
+ * need to block. In this case, we need to make
+ * sure that the CPU does not re-order any subsequent
+ * changes made by the caller to occur prior to the
+ * test, as seen by other CPUs.
+ */
+}
+
+EXPORT_SYMBOL_GPL(init_qrcu_struct);
+EXPORT_SYMBOL_GPL(qrcu_read_lock);
+EXPORT_SYMBOL_GPL(qrcu_read_unlock);
+EXPORT_SYMBOL_GPL(synchronize_qrcu);
--
1.4.4.2.g02c9

2007-01-03 08:28:36

by Jens Axboe

[permalink] [raw]
Subject: [PATCH] 3/4 qrcu: add documentation


Signed-off-by: Paul E. McKenney <[email protected]>
Acked-by: Jens Axboe <[email protected]>
---
Documentation/RCU/checklist.txt | 13 +++++++++++++
Documentation/RCU/rcu.txt | 6 ++++--
Documentation/RCU/torture.txt | 15 +++++++++------
Documentation/RCU/whatisRCU.txt | 3 +++
4 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt
index f4dffad..36d6185 100644
--- a/Documentation/RCU/checklist.txt
+++ b/Documentation/RCU/checklist.txt
@@ -259,3 +259,16 @@ over a rather long period of time, but improvements are always welcome!

Note that, rcu_assign_pointer() and rcu_dereference() relate to
SRCU just as they do to other forms of RCU.
+
+14. QRCU is very similar to SRCU, but features very fast grace-period
+ processing at the expense of heavier-weight read-side operations.
+ The correspondance between QRCU and SRCU is as follows:
+
+ QRCU SRCU
+
+ struct qrcu_struct struct srcu_struct
+ init_qrcu_struct() init_srcu_struct()
+ cleanup_qrcu_struct() cleanup_srcu_struct()
+ qrcu_read_lock() srcu_read_lock()
+ qrcu_read-unlock() srcu_read_unlock()
+ synchronize_qrcu() synchronize_srcu()
diff --git a/Documentation/RCU/rcu.txt b/Documentation/RCU/rcu.txt
index f84407c..ae1e54e 100644
--- a/Documentation/RCU/rcu.txt
+++ b/Documentation/RCU/rcu.txt
@@ -45,8 +45,10 @@ o How can I see where RCU is currently used in the Linux kernel?

Search for "rcu_read_lock", "rcu_read_unlock", "call_rcu",
"rcu_read_lock_bh", "rcu_read_unlock_bh", "call_rcu_bh",
- "srcu_read_lock", "srcu_read_unlock", "synchronize_rcu",
- "synchronize_net", and "synchronize_srcu".
+ "qrcu_read_lock", qrcu_read_unlock", "srcu_read_lock",
+ "srcu_read_unlock", "synchronize_rcu", "synchronize_qrcu",
+ "synchronize_net", "synchronize_srcu", rcu_assign_pointer(),
+ and rcu_dereference().

o What guidelines should I follow when writing code that uses RCU?

diff --git a/Documentation/RCU/torture.txt b/Documentation/RCU/torture.txt
index 25a3c3f..2cb0a3b 100644
--- a/Documentation/RCU/torture.txt
+++ b/Documentation/RCU/torture.txt
@@ -35,7 +35,8 @@ nfakewriters This is the number of RCU fake writer threads to run. Fake
different numbers of writers running in parallel.
nfakewriters defaults to 4, which provides enough parallelism
to trigger special cases caused by multiple writers, such as
- the synchronize_srcu() early return optimization.
+ the synchronize_srcu() and synchronize_qrcu() early return
+ optimizations.

stat_interval The number of seconds between output of torture
statistics (via printk()). Regardless of the interval,
@@ -54,11 +55,13 @@ test_no_idle_hz Whether or not to test the ability of RCU to operate in
idle CPUs. Boolean parameter, "1" to test, "0" otherwise.

torture_type The type of RCU to test: "rcu" for the rcu_read_lock() API,
- "rcu_sync" for rcu_read_lock() with synchronous reclamation,
- "rcu_bh" for the rcu_read_lock_bh() API, "rcu_bh_sync" for
- rcu_read_lock_bh() with synchronous reclamation, "srcu" for
- the "srcu_read_lock()" API, and "sched" for the use of
- preempt_disable() together with synchronize_sched().
+ "rcu_sync" for rcu_read_lock() with synchronous
+ reclamation, "rcu_bh" for the rcu_read_lock_bh() API,
+ "rcu_bh_sync" for rcu_read_lock_bh() with synchronous
+ reclamation, "srcu" for the "srcu_read_lock()" API,
+ "qrcu" for the "qrcu_read_lock()" "quick grace period"
+ form of SRCU, and "sched" for the use of preempt_disable()
+ together with synchronize_sched().

verbose Enable debug printk()s. Default is disabled.

diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt
index e0d6d99..e91650b 100644
--- a/Documentation/RCU/whatisRCU.txt
+++ b/Documentation/RCU/whatisRCU.txt
@@ -780,6 +780,8 @@ Markers for RCU read-side critical sections:
rcu_read_unlock_bh
srcu_read_lock
srcu_read_unlock
+ qrcu_read_lock
+ qrcu_read_unlock

RCU pointer/list traversal:

@@ -807,6 +809,7 @@ RCU grace period:
synchronize_sched
synchronize_rcu
synchronize_srcu
+ synchronize_qrcu
call_rcu
call_rcu_bh

--
1.4.4.2.g02c9

--
Jens Axboe

2007-01-03 09:10:46

by Tomas Carnecky

[permalink] [raw]
Subject: Re: [PATCH] 3/4 qrcu: add documentation

Jens Axboe wrote:
> diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt
> index f4dffad..36d6185 100644
> --- a/Documentation/RCU/checklist.txt
> +++ b/Documentation/RCU/checklist.txt
> @@ -259,3 +259,16 @@ over a rather long period of time, but improvements are always welcome!
>
> Note that, rcu_assign_pointer() and rcu_dereference() relate to
> SRCU just as they do to other forms of RCU.
> +
> +14. QRCU is very similar to SRCU, but features very fast grace-period
> + processing at the expense of heavier-weight read-side operations.
> + The correspondance between QRCU and SRCU is as follows:
> +
> + QRCU SRCU
> +
> + struct qrcu_struct struct srcu_struct
> + init_qrcu_struct() init_srcu_struct()
> + cleanup_qrcu_struct() cleanup_srcu_struct()
> + qrcu_read_lock() srcu_read_lock()
> + qrcu_read-unlock() srcu_read_unlock()

A small typo: qrcu_read_unlock()

tom

2007-01-03 09:37:06

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] 3/4 qrcu: add documentation

On Wed, Jan 03 2007, Tomas Carnecky wrote:
> Jens Axboe wrote:
> > diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt
> > index f4dffad..36d6185 100644
> > --- a/Documentation/RCU/checklist.txt
> > +++ b/Documentation/RCU/checklist.txt
> > @@ -259,3 +259,16 @@ over a rather long period of time, but improvements are always welcome!
> >
> > Note that, rcu_assign_pointer() and rcu_dereference() relate to
> > SRCU just as they do to other forms of RCU.
> > +
> > +14. QRCU is very similar to SRCU, but features very fast grace-period
> > + processing at the expense of heavier-weight read-side operations.
> > + The correspondance between QRCU and SRCU is as follows:
> > +
> > + QRCU SRCU
> > +
> > + struct qrcu_struct struct srcu_struct
> > + init_qrcu_struct() init_srcu_struct()
> > + cleanup_qrcu_struct() cleanup_srcu_struct()
> > + qrcu_read_lock() srcu_read_lock()
> > + qrcu_read-unlock() srcu_read_unlock()
>
> A small typo: qrcu_read_unlock()

Indeed, thanks, I'll update the repo.

--
Jens Axboe

2007-01-03 09:39:15

by Jens Axboe

[permalink] [raw]
Subject: [PATCH] 4/4 block: explicit plugging


Not much luck with the 4th patch, I guess it's too big. I've gzip
attached it now, with the description inlined.

---

Nick writes:

This is a patch to perform block device plugging explicitly in the submitting
process context rather than implicitly by the block device.

There are several advantages to plugging in process context over plugging
by the block device:

- Implicit plugging is only active when the queue empties, so any
advantages are lost if there is parallel IO occuring. Not so with
explicit plugging.

- Implicit plugging relies on a timer and watermarks and a kind-of-explicit
directive in lock_page which directs plugging. These are heuristics and
can cost performance due to holding a block device idle longer than it
should be. Explicit plugging avoids most of these issues by only holding
the device idle when it is known more requests will be submitted.

- This lock_page directive uses a roundabout way to attempt to minimise
intrusiveness of plugging on the VM. In doing so, it gets needlessly
complex: the VM really is in a good position to direct the block layer
as to the nature of its requests, so there is no need to try to hide
the fact.

- Explicit plugging keeps a process-private queue of requests being held.
This offers some advantages over immediately sending requests to the
block device: firstly, merging can be attempted on requests in this list
(currently only attempted on the head of the list) without taking any
locks; secondly, when unplugging occurs, the requests can be delivered
to the block device queue in a batch, thus the lock aquisitions can be
batched up.

On a parallel tiobench benchmark, of the 800 000 calls to __make_request
performed, this patch avoids 490 000 (62%) of queue_lock aquisitions by
early merging on the private plugged list.

Signed-off-by: Nick Piggin <[email protected]>

Changes so far by me:

- Don't invoke ->request_fn() in blk_queue_invalidate_tags

- Fixup all filesystems for block_sync_page()

- Add blk_delay_queue() to handle the old plugging-on-shortage usage.

- Unconditionally run replug_current_nested() in ioschedule()

- Fixup queue start/stop

- Fixup all the remaining drivers

- Change the namespace (prefix the plug functions with blk_)

- Fixup ext4

- Dead code removal

- Fixup blktrace plug/unplug notifications

- __make_request() cleanups

- bio_sync() fixups

- Kill queue empty checking

- Make barriers work again, using QRCU

- Make blk_sync_queue() work again, reuse barrier SRCU handling

This patch needs more work and some dedicated testing.

Signed-off-by: Jens Axboe <[email protected]>

--
Jens Axboe


Attachments:
(No filename) (2.59 kB)
0004-block-explicit-plugging.txt.gz (29.89 kB)
Download all attachments

2007-01-04 04:36:25

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] 4/4 block: explicit plugging

Jens Axboe wrote:
> Nick writes:
>
> This is a patch to perform block device plugging explicitly in the submitting
> process context rather than implicitly by the block device.

Hi Jens,

Hey thanks for doing so much hard work with this, I couldn't have fixed
all the block layer stuff myself. QRCU looks like a good solution for the
barrier/sync operations (/me worried that one wouldn't exist), and a
novel use of RCU!

The only thing I had been thinking about before it is ready for primetime
-- as far as the VM side of things goes -- is whether we should change
the hard calls to address_space operations, such that they might be
avoided or customised when there is no backing block device?

I'm sure the answer to this is "yes", so I have an idea for a simple
implementation... but I'd like to hear thoughts from network fs / raid
people?

Nick

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2007-01-05 07:22:47

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] 4/4 block: explicit plugging

On Thu, Jan 04 2007, Nick Piggin wrote:
> Jens Axboe wrote:
> >Nick writes:
> >
> >This is a patch to perform block device plugging explicitly in the
> >submitting
> >process context rather than implicitly by the block device.
>
> Hi Jens,
>
> Hey thanks for doing so much hard work with this, I couldn't have fixed
> all the block layer stuff myself. QRCU looks like a good solution for the
> barrier/sync operations (/me worried that one wouldn't exist), and a
> novel use of RCU!
>
> The only thing I had been thinking about before it is ready for primetime
> -- as far as the VM side of things goes -- is whether we should change
> the hard calls to address_space operations, such that they might be
> avoided or customised when there is no backing block device?
>
> I'm sure the answer to this is "yes", so I have an idea for a simple
> implementation... but I'd like to hear thoughts from network fs / raid
> people?

I suppose that would be the proper thing to do, for non __make_request()
operated backing devices. I'll add the hooks, then we can cook up a raid
implementation if need be.

--
Jens Axboe