2002-02-12 23:16:28

by Andrew Morton

[permalink] [raw]
Subject: [patch] get_request starvation fix

Second version of this patch, incorporating Suparna's
suggested simplification (the low-water mark was
unnecessary).

This patch is working well here. Hopefully it'll pop up
in an rmap kernel soon.

Bill Irwin has been doing some fairly extensive tuning
and testing of this. Hopefully he'll come out with some
numbers soon.

I include the original description...


==== make_request description ====

Here's a patch which addresses the get_request starvation problem.

The problem we're experiencing is that when the request queue is
exhausted, processes go to sleep until the request queue has refilled
to the batch_nr_requests level. But already-running tasks can simply
whizz in and steal free requests, thus preventing the sleeping tasks
from getting woken.

Some instrumentation illustrates this clearly. All testing was on
ext2. In a `dbench 64' run on stock 2.4.18-pre9, the maximum sleep
duration in __get_request_wait() was 37 seconds, and it's being very
unfair. Look at this section of the trace:

Feb 7 21:59:06 mnm kernel: dbench[883] wakes after 442ms
Feb 7 21:59:06 mnm kernel: dbench[877] wakes after 442ms
Feb 7 21:59:06 mnm kernel: dbench[868] wakes after 441ms
Feb 7 21:59:06 mnm kernel: kupdated[7] wakes after 11587ms
Feb 7 21:59:06 mnm kernel: dbench[914] wakes after 574ms
Feb 7 21:59:06 mnm kernel: dbench[873] wakes after 2618ms
Feb 7 21:59:06 mnm kernel: dbench[927] wakes after 755ms
Feb 7 21:59:06 mnm kernel: dbench[873] wakes after 263ms
Feb 7 21:59:06 mnm kernel: dbench[924] wakes after 838ms
Feb 7 21:59:06 mnm kernel: dbench[924] wakes after 134ms
Feb 7 21:59:06 mnm kernel: dbench[871] wakes after 18688ms
Feb 7 21:59:06 mnm kernel: dbench[908] wakes after 3245ms
Feb 7 21:59:06 mnm kernel: dbench[895] wakes after 974ms
Feb 7 21:59:06 mnm kernel: dbench[878] wakes after 27401ms
Feb 7 21:59:06 mnm kernel: dbench[895] wakes after 278ms
Feb 7 21:59:06 mnm kernel: dbench[904] wakes after 278ms

With the patch applied, the maximum sleep duration was eight seconds,
but when this was happening, *all* tasks were sleeping for similar
durations:

Feb 7 23:55:40 mnm kernel: dbench[948] wakes after 3978ms
Feb 7 23:55:40 mnm kernel: dbench[935] wakes after 3439ms
Feb 7 23:55:41 mnm kernel: dbench[955] wakes after 2998ms
Feb 7 23:55:41 mnm kernel: dbench[965] wakes after 2847ms
Feb 7 23:55:41 mnm kernel: dbench[994] wakes after 3032ms
Feb 7 23:55:41 mnm kernel: dbench[969] wakes after 2652ms
Feb 7 23:55:41 mnm kernel: dbench[956] wakes after 2197ms
Feb 7 23:55:41 mnm kernel: dbench[993] wakes after 1589ms
Feb 7 23:55:41 mnm kernel: dbench[990] wakes after 1590ms
Feb 7 23:55:41 mnm kernel: dbench[939] wakes after 1585ms
Feb 7 23:55:41 mnm kernel: kupdated[7] wakes after 1555ms
Feb 7 23:55:41 mnm kernel: dbench[976] wakes after 1787ms
Feb 7 23:55:41 mnm kernel: dbench[997] wakes after 1653ms
Feb 7 23:55:41 mnm kernel: dbench[961] wakes after 1891ms
Feb 7 23:55:42 mnm kernel: dbench[966] wakes after 1967ms
Feb 7 23:55:42 mnm kernel: dbench[936] wakes after 1339ms
Feb 7 23:55:42 mnm kernel: dbench[965] wakes after 1216ms
Feb 7 23:55:42 mnm kernel: kjournald[9] wakes after 1253ms
Feb 7 23:55:42 mnm kernel: dbench[990] wakes after 879ms
Feb 7 23:55:42 mnm kernel: kupdated[7] wakes after 849ms
Feb 7 23:55:42 mnm kernel: dbench[959] wakes after 343ms
Feb 7 23:55:42 mnm kernel: dbench[936] wakes after 343ms
Feb 7 23:55:42 mnm kernel: dbench[964] wakes after 492ms
Feb 7 23:55:42 mnm kernel: dbench[966] wakes after 648ms
Feb 7 23:55:42 mnm kernel: kupdated[7] wakes after 413ms
Feb 7 23:55:42 mnm kernel: dbench[988] wakes after 344ms



The actual algorithm I chose is described in the patch. It is
noteworthy that the wakeup strategy has changed from wake-all to
wake-one, which should save a few cycles. The number of context
switches during the entire dbench 64 run fell from 57,000 to
29,000.

Interestingly, dbench runtimes are still very inconsistent, and
many clients still exit quite early in the run. Beats me.

Rik, this patch will conflict with read-latency2 in the init code.
Easily fixable. I haven't tested with the larger request queue, but I
suggest that we want to keep the 1024 requests, a high-water mark of
queue_nr_requests/4 and a low-water mark of queue_nr_requests/8. I'll
do more testing tomorrow.

Also, you noted the other day that a LILO run *never* terminated when
there was a dbench running. This is in fact not due to request
starvation. It's due to livelock in invalidate_bdev(), which is called
via ioctl(BLKFLSBUF) by LILO. invalidate_bdev() will never terminate
as long as another task is generating locked buffers against the
device.


=====================================

--- linux-2.4.18-pre9/include/linux/blkdev.h Mon Nov 26 11:52:07 2001
+++ linux-akpm/include/linux/blkdev.h Mon Feb 11 00:59:50 2002
@@ -119,9 +119,9 @@ struct request_queue
spinlock_t queue_lock;

/*
- * Tasks wait here for free request
+ * Tasks wait here for free read and write requests
*/
- wait_queue_head_t wait_for_request;
+ wait_queue_head_t wait_for_requests[2];
};

struct blk_dev_struct {
--- linux-2.4.18-pre9/drivers/block/ll_rw_blk.c Thu Feb 7 13:04:11 2002
+++ linux-akpm/drivers/block/ll_rw_blk.c Mon Feb 11 00:59:50 2002
@@ -118,10 +118,14 @@ int * max_readahead[MAX_BLKDEV];
int * max_sectors[MAX_BLKDEV];

/*
- * How many reqeusts do we allocate per queue,
- * and how many do we "batch" on freeing them?
+ * The total number of requests in each queue.
*/
-static int queue_nr_requests, batch_requests;
+static int queue_nr_requests;
+
+/*
+ * The threshold around which we make wakeup decisions
+ */
+static int batch_requests;

static inline int get_max_sectors(kdev_t dev)
{
@@ -352,7 +356,8 @@ static void blk_init_free_list(request_q
q->rq[i&1].count++;
}

- init_waitqueue_head(&q->wait_for_request);
+ init_waitqueue_head(&q->wait_for_requests[0]);
+ init_waitqueue_head(&q->wait_for_requests[1]);
spin_lock_init(&q->queue_lock);
}

@@ -418,9 +423,9 @@ void blk_init_queue(request_queue_t * q,
#define blkdev_free_rq(list) list_entry((list)->next, struct request, queue);
/*
* Get a free request. io_request_lock must be held and interrupts
- * disabled on the way in.
+ * disabled on the way in. Returns NULL if there are no free requests.
*/
-static inline struct request *get_request(request_queue_t *q, int rw)
+static struct request *get_request(request_queue_t *q, int rw)
{
struct request *rq = NULL;
struct request_list *rl = q->rq + rw;
@@ -438,38 +443,102 @@ static inline struct request *get_reques
}

/*
- * No available requests for this queue, unplug the device.
+ * Here's the request allocation design:
+ *
+ * 1: Blocking on request exhaustion is a key part of I/O throttling.
+ *
+ * 2: We want to be `fair' to all requesters. We must avoid starvation, and
+ * attempt to ensure that all requesters sleep for a similar duration. Hence
+ * no stealing requests when there are other processes waiting.
+ *
+ * 3: We also wish to support `batching' of requests. So when a process is
+ * woken, we want to allow it to allocate a decent number of requests
+ * before it blocks again, so they can be nicely merged (this only really
+ * matters if the process happens to be adding requests near the head of
+ * the queue).
+ *
+ * 4: We want to avoid scheduling storms. This isn't really important, because
+ * the system will be I/O bound anyway. But it's easy.
+ *
+ * There is tension between requirements 2 and 3. Once a task has woken,
+ * we don't want to allow it to sleep as soon as it takes its second request.
+ * But we don't want currently-running tasks to steal all the requests
+ * from the sleepers. We handle this with wakeup hysteresis around
+ * 0 .. batch_requests and with the assumption that request taking is much,
+ * much faster than request freeing.
+ *
+ * So here's what we do:
+ *
+ * a) A READA requester fails if free_requests < batch_requests
+ *
+ * We don't want READA requests to prevent sleepers from ever
+ * waking.
+ *
+ * When a process wants a new request:
+ *
+ * b) If free_requests == 0, the requester sleeps in FIFO manner.
+ *
+ * b) If 0 < free_requests < batch_requests and there are waiters,
+ * we still take a request non-blockingly. This provides batching.
+ *
+ * c) If free_requests >= batch_requests, the caller is immediately
+ * granted a new request.
+ *
+ * When a request is released:
+ *
+ * d) If free_requests < batch_requests, do nothing.
+ *
+ * f) If free_requests >= batch_requests, wake up a single waiter.
+ *
+ * The net effect is that when a process is woken at the batch_requests level,
+ * it will be able to take approximately (batch_requests) requests before
+ * blocking again (at the tail of the queue).
+ *
+ * This all assumes that the rate of taking requests is much, much higher
+ * than the rate of releasing them. Which is very true.
+ *
+ * -akpm, Feb 2002.
*/
+#undef REQTIMING
static struct request *__get_request_wait(request_queue_t *q, int rw)
{
register struct request *rq;
DECLARE_WAITQUEUE(wait, current);

+#ifdef REQTIMING
+ struct timeval tv1, tv2;
+ unsigned long long t1, t2;
+ unsigned long t;
+ do_gettimeofday(&tv1);
+#endif
+
generic_unplug_device(q);
- add_wait_queue(&q->wait_for_request, &wait);
+ add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
do {
set_current_state(TASK_UNINTERRUPTIBLE);
- if (q->rq[rw].count < batch_requests)
+ if (q->rq[rw].count == 0)
schedule();
spin_lock_irq(&io_request_lock);
rq = get_request(q,rw);
spin_unlock_irq(&io_request_lock);
} while (rq == NULL);
- remove_wait_queue(&q->wait_for_request, &wait);
+ remove_wait_queue(&q->wait_for_requests[rw], &wait);
current->state = TASK_RUNNING;
- return rq;
-}
-
-static inline struct request *get_request_wait(request_queue_t *q, int rw)
-{
- register struct request *rq;

- spin_lock_irq(&io_request_lock);
- rq = get_request(q, rw);
- spin_unlock_irq(&io_request_lock);
- if (rq)
- return rq;
- return __get_request_wait(q, rw);
+#ifdef REQTIMING
+ do_gettimeofday(&tv2);
+ t1 = tv1.tv_sec;
+ t1 *= 1000000;
+ t1 += tv1.tv_usec;
+ t2 = tv2.tv_sec;
+ t2 *= 1000000;
+ t2 += tv2.tv_usec;
+ t = t2 - t1;
+ printk("%s[%d] wakes after %ldms for %s\n", current->comm,
+ current->pid, t / 1000,
+ (rw == WRITE) ? "WRITE" : "READ");
+#endif
+ return rq;
}

/* RO fail safe mechanism */
@@ -546,7 +615,7 @@ static inline void add_request(request_q
/*
* Must be called with io_request_lock held and interrupts disabled
*/
-inline void blkdev_release_request(struct request *req)
+void blkdev_release_request(struct request *req)
{
request_queue_t *q = req->q;
int rw = req->cmd;
@@ -560,8 +629,9 @@ inline void blkdev_release_request(struc
*/
if (q) {
list_add(&req->queue, &q->rq[rw].free);
- if (++q->rq[rw].count >= batch_requests && waitqueue_active(&q->wait_for_request))
- wake_up(&q->wait_for_request);
+ if (++q->rq[rw].count >= batch_requests &&
+ waitqueue_active(&q->wait_for_requests[rw]))
+ wake_up(&q->wait_for_requests[rw]);
}
}

@@ -742,22 +812,30 @@ again:
BUG();
}

- /*
- * Grab a free request from the freelist - if that is empty, check
- * if we are doing read ahead and abort instead of blocking for
- * a free slot.
- */
get_rq:
if (freereq) {
req = freereq;
freereq = NULL;
- } else if ((req = get_request(q, rw)) == NULL) {
- spin_unlock_irq(&io_request_lock);
- if (rw_ahead)
- goto end_io;
-
- freereq = __get_request_wait(q, rw);
- goto again;
+ } else {
+ /*
+ * See description above __get_request_wait()
+ */
+ if (rw_ahead) {
+ if (q->rq[rw].count < batch_requests) {
+ spin_unlock_irq(&io_request_lock);
+ goto end_io;
+ }
+ req = get_request(q, rw);
+ if (req == NULL)
+ BUG();
+ } else {
+ req = get_request(q, rw);
+ if (req == NULL) {
+ spin_unlock_irq(&io_request_lock);
+ freereq = __get_request_wait(q, rw);
+ goto again;
+ }
+ }
}

/* fill up the request-info, and add it to the queue */


-


2002-02-13 01:29:05

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [patch] get_request starvation fix

On Tue, Feb 12, 2002 at 03:13:26PM -0800, Andrew Morton wrote:
> Second version of this patch, incorporating Suparna's
> suggested simplification (the low-water mark was
> unnecessary).
>
> This patch is working well here. Hopefully it'll pop up
> in an rmap kernel soon.
>
> Bill Irwin has been doing some fairly extensive tuning
> and testing of this. Hopefully he'll come out with some
> numbers soon.

Much of my testing has involved varying elevator and bdflush parameters
to find what works best with this. From preliminary results it is clear
that the resolution of the fairness issues in this patch and the read
latency patch does not impair performance.

I'll do a bunch more runs tonight with the latest version of this.
My prior runs were with an earlier version.


Cheers,
Bill

2002-02-13 13:50:19

by Randy Hron

[permalink] [raw]
Subject: Re: [patch] get_request starvation fix


tiobench measurements using 2048 MB worth of files on ext2 fs.
K6-2 475 mhz with 384 MB RAM and IDE disk.
Sorted by test, number of threads, then kernel.
Read, write, and seek rates in MB/sec.
Latency in milliseconds.
Percent of requests that took longer than 2 and 10 seconds.

2.4.18-pre9-am1 = make_request, read_latency2, and low-latency patches.
2.4.18-pre9-am2 = make_request, read_latency2, and sync_livelock patches.

On sequential reads, kernels with make_request and read_latency2
reduce max latency from around 500 seconds down 2-8 seconds.
Fairness to requests is vastly improved. Throughput looks better too.

Sequential Reads Num Avg Maximum Lat% Lat%
Kernel Thr Rate (CPU%) Latency Latency >2s >10s
---------------- --- ---------------------------------------------------
2.4.18-pre9 32 9.18 12.19% 34.739 545388.95 0.00304 0.00304
2.4.18-pre9-ac1 32 11.48 30.08% 25.736 289601.71 0.00839 0.00839
2.4.18-pre9-am1 32 9.25 11.97% 40.103 2251.30 0.00000 0.00000
2.4.18-pre9-am2 32 9.32 12.11% 39.917 2206.14 0.00000 0.00000

2.4.18-pre9 64 8.83 11.81% 59.045 484270.65 0.02689 0.02651
2.4.18-pre9-ac1 64 10.37 27.12% 49.460 293425.91 0.04273 0.04196
2.4.18-pre9-am1 64 9.53 13.42% 77.737 3840.97 0.00000 0.00000
2.4.18-pre9-am2 64 9.52 13.15% 77.873 3951.99 0.00000 0.00000

2.4.18-pre9 128 8.44 10.92% 109.317 572551.72 0.08507 0.07534
2.4.18-pre9-ac1 128 9.28 24.08% 104.159 333550.33 0.13599 0.13523
2.4.18-pre9-am1 128 9.55 15.38% 153.480 7939.48 0.66070 0.00000
2.4.18-pre9-am2 128 9.59 15.72% 153.244 7648.12 0.80833 0.00000

On random reads, the make_request and read_latency2 patches help a lot at
128 threads. Max latency and % of high latency requests is greatly reduced.

Random Reads Num Avg Maximum Lat% Lat%
Kernel Thr Rate (CPU%) Latency Latency >2s >10s
---------------- --- ---------------------------------------------------
2.4.18-pre9 32 0.57 1.580% 549.367 1523.39 0.00000 0.00000
2.4.18-pre9-ac1 32 0.53 2.226% 660.193 2966.26 0.00000 0.00000
2.4.18-pre9-am1 32 0.58 1.479% 623.361 1693.90 0.00000 0.00000
2.4.18-pre9-am2 32 0.58 1.501% 619.679 1727.47 0.00000 0.00000

2.4.18-pre9 64 0.61 2.031% 1016.205 2618.34 0.00000 0.00000
2.4.18-pre9-ac1 64 0.56 2.579% 1160.377 72621.35 0.35283 0.32763
2.4.18-pre9-am1 64 0.61 1.711% 1145.865 2865.21 0.00000 0.00000
2.4.18-pre9-am2 64 0.62 1.978% 1135.097 2792.76 0.00000 0.00000

2.4.18-pre9 128 0.61 1.699% 1569.812 53372.49 4.48589 4.41029
2.4.18-pre9-ac1 128 0.55 2.701% 1862.139 75417.90 5.54435 5.21673
2.4.18-pre9-am1 128 0.63 2.511% 2152.480 4710.51 0.00000 0.00000
2.4.18-pre9-am2 128 0.63 2.525% 2145.578 4709.85 0.00000 0.00000

On the write tests, 2.4.18-pre9-ac1 has the best latency numbers.

Sequential Write Num Avg Maximum Lat% Lat%
Kernel Thr Rate (CPU%) Latency Latency >2s >10s
---------------- --- ---------------------------------------------------
2.4.18-pre9 32 13.18 35.47% 26.155 33677.84 0.48867 0.00324
2.4.18-pre9-ac1 32 13.66 36.82% 24.738 9008.30 0.03567 0.00000
2.4.18-pre9-am1 32 15.27 53.94% 21.236 39604.59 0.37098 0.00400
2.4.18-pre9-am2 32 15.17 53.66% 21.492 41358.77 0.37155 0.00496

2.4.18-pre9 64 13.47 44.74% 46.296 46673.56 0.76332 0.03968
2.4.18-pre9-ac1 64 11.91 32.45% 55.355 12051.19 0.22926 0.00000
2.4.18-pre9-am1 64 14.57 52.16% 43.430 51274.83 0.75874 0.05665
2.4.18-pre9-am2 64 14.78 53.76% 42.685 54281.45 0.63266 0.07457

2.4.18-pre9 128 12.78 42.55% 91.421 60494.19 1.20201 0.26531
2.4.18-pre9-ac1 128 10.15 27.63% 122.921 15371.80 1.29719 0.00000
2.4.18-pre9-am1 128 13.72 47.73% 83.373 69739.68 1.16043 0.22126
2.4.18-pre9-am2 128 14.17 52.81% 80.642 67952.87 1.10722 0.23041


Random Writes Num Avg Maximum Lat% Lat%
Kernel Thr Rate (CPU%) Latency Latency >2s >10s
---------------- --- ---------------------------------------------------
2.4.18-pre9 32 0.60 1.323% 1.915 576.55 0.00000 0.00000
2.4.18-pre9-ac1 32 0.61 1.296% 0.311 481.93 0.00000 0.00000
2.4.18-pre9-am1 32 0.71 2.411% 1.696 624.95 0.00000 0.00000
2.4.18-pre9-am2 32 0.73 2.804% 1.567 711.59 0.00000 0.00000

2.4.18-pre9 64 0.65 1.547% 1.552 558.61 0.00000 0.00000
2.4.18-pre9-ac1 64 0.64 1.457% 0.214 88.78 0.00000 0.00000
2.4.18-pre9-am1 64 0.74 2.649% 1.014 522.44 0.00000 0.00000
2.4.18-pre9-am2 64 0.75 2.647% 1.035 557.22 0.00000 0.00000

2.4.18-pre9 128 0.70 1.823% 1.664 785.13 0.00000 0.00000
2.4.18-pre9-ac1 128 0.67 1.926% 0.339 398.63 0.00000 0.00000
2.4.18-pre9-am1 128 0.76 2.716% 1.527 845.77 0.00000 0.00000
2.4.18-pre9-am2 128 0.75 2.828% 1.371 886.19 0.00000 0.00000

More tests on more kernels at
http://home.earthlink.net/~rwhron/kernel/k6-2-475.html
--
Randy Hron

2002-02-15 18:33:09

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [patch] get_request starvation fix



On Tue, 12 Feb 2002, Andrew Morton wrote:

> Second version of this patch, incorporating Suparna's
> suggested simplification (the low-water mark was
> unnecessary).
>
> This patch is working well here. Hopefully it'll pop up
> in an rmap kernel soon.
>
> Bill Irwin has been doing some fairly extensive tuning
> and testing of this. Hopefully he'll come out with some
> numbers soon.
>
> I include the original description...

It seems the real gain (in latency) is caused by the FIFO behaviour.

That is, removing this hunk (against __get_request_wait())

- if (q->rq[rw].count < batch_requests)
+ if (q->rq[rw].count == 0)
schedule();

Would not make _much_ difference latency-wise. I'm I right or missing
something ?

Anyway, I would like to have the patch cleaned up for 2.4.19-pre (remove
the instrumentation stuff _and_ make it clear on the documentation that
READA requests are not being used in practice).

Thanks a lot for that, Andrew.


2002-02-16 07:33:35

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] get_request starvation fix

Marcelo Tosatti wrote:
>
> ...
> It seems the real gain (in latency) is caused by the FIFO behaviour.

Well there is no net gain in latency. What we've gained is *fairness*,
so the worst-case latency is improved. And yes, it's the FIFO behaviour
which provides that. And it is the exclusive wait which reduces the context
switch rate by 50%.

> That is, removing this hunk (against __get_request_wait())
>
> - if (q->rq[rw].count < batch_requests)
> + if (q->rq[rw].count == 0)
> schedule();
>
> Would not make _much_ difference latency-wise. I'm I right or missing
> something ?

Well the code which is presently in -rc1 is quite pointless. We
perform the above test a couple of microseconds after determining
that there are zero free requests. So the test you've illustrated
will *always* be true. The test for zero is the right thing to
do - it's just normal waitqueue handling. It's just a little
obfuscated by the implicit test-for-zero in get_request().

However, contrary to my earlier guess, the request batching does
make a measurable difference. Changing the code so that we wake up
a sleeper as soon as any request is freed costs maybe 30%
on `dbench 64'.

> Anyway, I would like to have the patch cleaned up for 2.4.19-pre (remove
> the instrumentation stuff _and_ make it clear on the documentation that
> READA requests are not being used in practice).

READA is used in a few filesystems for directory readahead. (And since
dir-in-pagecache, ext2 is no longer performing directory readhead. Bad.)

I came >this< close to killing READA altogether. I believe it's a
misfeature. If we're under intense seek pressure, the very, very _last_
thing we want to do is to create more seeks by not performing readahead.
But given that the misfeature only applies to creation of new requests,
and that block-contiguous readahead will still work OK, it's not too
serious. Plus it's a stable kernel :)





--- linux-2.4.18-rc1/include/linux/blkdev.h Mon Nov 26 11:52:07 2001
+++ linux-akpm/include/linux/blkdev.h Fri Feb 15 23:06:04 2002
@@ -119,9 +119,9 @@ struct request_queue
spinlock_t queue_lock;

/*
- * Tasks wait here for free request
+ * Tasks wait here for free read and write requests
*/
- wait_queue_head_t wait_for_request;
+ wait_queue_head_t wait_for_requests[2];
};

struct blk_dev_struct {
--- linux-2.4.18-rc1/drivers/block/ll_rw_blk.c Wed Feb 13 12:59:09 2002
+++ linux-akpm/drivers/block/ll_rw_blk.c Fri Feb 15 23:09:17 2002
@@ -118,10 +118,14 @@ int * max_readahead[MAX_BLKDEV];
int * max_sectors[MAX_BLKDEV];

/*
- * How many reqeusts do we allocate per queue,
- * and how many do we "batch" on freeing them?
+ * The total number of requests in each queue.
*/
-static int queue_nr_requests, batch_requests;
+static int queue_nr_requests;
+
+/*
+ * The threshold around which we make wakeup decisions
+ */
+static int batch_requests;

static inline int get_max_sectors(kdev_t dev)
{
@@ -352,7 +356,8 @@ static void blk_init_free_list(request_q
q->rq[i&1].count++;
}

- init_waitqueue_head(&q->wait_for_request);
+ init_waitqueue_head(&q->wait_for_requests[0]);
+ init_waitqueue_head(&q->wait_for_requests[1]);
spin_lock_init(&q->queue_lock);
}

@@ -418,9 +423,9 @@ void blk_init_queue(request_queue_t * q,
#define blkdev_free_rq(list) list_entry((list)->next, struct request, queue);
/*
* Get a free request. io_request_lock must be held and interrupts
- * disabled on the way in.
+ * disabled on the way in. Returns NULL if there are no free requests.
*/
-static inline struct request *get_request(request_queue_t *q, int rw)
+static struct request *get_request(request_queue_t *q, int rw)
{
struct request *rq = NULL;
struct request_list *rl = q->rq + rw;
@@ -438,40 +443,84 @@ static inline struct request *get_reques
}

/*
- * No available requests for this queue, unplug the device.
+ * Here's the request allocation design:
+ *
+ * 1: Blocking on request exhaustion is a key part of I/O throttling.
+ *
+ * 2: We want to be `fair' to all requesters. We must avoid starvation, and
+ * attempt to ensure that all requesters sleep for a similar duration. Hence
+ * no stealing requests when there are other processes waiting.
+ *
+ * 3: We also wish to support `batching' of requests. So when a process is
+ * woken, we want to allow it to allocate a decent number of requests
+ * before it blocks again, so they can be nicely merged (this only really
+ * matters if the process happens to be adding requests near the head of
+ * the queue).
+ *
+ * 4: We want to avoid scheduling storms. This isn't really important, because
+ * the system will be I/O bound anyway. But it's easy.
+ *
+ * There is tension between requirements 2 and 3. Once a task has woken,
+ * we don't want to allow it to sleep as soon as it takes its second request.
+ * But we don't want currently-running tasks to steal all the requests
+ * from the sleepers. We handle this with wakeup hysteresis around
+ * 0 .. batch_requests and with the assumption that request taking is much,
+ * much faster than request freeing.
+ *
+ * So here's what we do:
+ *
+ * a) A READA requester fails if free_requests < batch_requests
+ *
+ * We don't want READA requests to prevent sleepers from ever
+ * waking. Note that READA is used extremely rarely - a few
+ * filesystems use it for directory readahead.
+ *
+ * When a process wants a new request:
+ *
+ * b) If free_requests == 0, the requester sleeps in FIFO manner.
+ *
+ * b) If 0 < free_requests < batch_requests and there are waiters,
+ * we still take a request non-blockingly. This provides batching.
+ *
+ * c) If free_requests >= batch_requests, the caller is immediately
+ * granted a new request.
+ *
+ * When a request is released:
+ *
+ * d) If free_requests < batch_requests, do nothing.
+ *
+ * f) If free_requests >= batch_requests, wake up a single waiter.
+ *
+ * The net effect is that when a process is woken at the batch_requests level,
+ * it will be able to take approximately (batch_requests) requests before
+ * blocking again (at the tail of the queue).
+ *
+ * This all assumes that the rate of taking requests is much, much higher
+ * than the rate of releasing them. Which is very true.
+ *
+ * -akpm, Feb 2002.
*/
+
static struct request *__get_request_wait(request_queue_t *q, int rw)
{
register struct request *rq;
DECLARE_WAITQUEUE(wait, current);

generic_unplug_device(q);
- add_wait_queue(&q->wait_for_request, &wait);
+ add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
do {
set_current_state(TASK_UNINTERRUPTIBLE);
- if (q->rq[rw].count < batch_requests)
+ if (q->rq[rw].count == 0)
schedule();
spin_lock_irq(&io_request_lock);
rq = get_request(q,rw);
spin_unlock_irq(&io_request_lock);
} while (rq == NULL);
- remove_wait_queue(&q->wait_for_request, &wait);
+ remove_wait_queue(&q->wait_for_requests[rw], &wait);
current->state = TASK_RUNNING;
return rq;
}

-static inline struct request *get_request_wait(request_queue_t *q, int rw)
-{
- register struct request *rq;
-
- spin_lock_irq(&io_request_lock);
- rq = get_request(q, rw);
- spin_unlock_irq(&io_request_lock);
- if (rq)
- return rq;
- return __get_request_wait(q, rw);
-}
-
/* RO fail safe mechanism */

static long ro_bits[MAX_BLKDEV][8];
@@ -546,7 +595,7 @@ static inline void add_request(request_q
/*
* Must be called with io_request_lock held and interrupts disabled
*/
-inline void blkdev_release_request(struct request *req)
+void blkdev_release_request(struct request *req)
{
request_queue_t *q = req->q;
int rw = req->cmd;
@@ -560,8 +609,9 @@ inline void blkdev_release_request(struc
*/
if (q) {
list_add(&req->queue, &q->rq[rw].free);
- if (++q->rq[rw].count >= batch_requests && waitqueue_active(&q->wait_for_request))
- wake_up(&q->wait_for_request);
+ if (++q->rq[rw].count >= batch_requests &&
+ waitqueue_active(&q->wait_for_requests[rw]))
+ wake_up(&q->wait_for_requests[rw]);
}
}

@@ -742,22 +792,30 @@ again:
BUG();
}

- /*
- * Grab a free request from the freelist - if that is empty, check
- * if we are doing read ahead and abort instead of blocking for
- * a free slot.
- */
get_rq:
if (freereq) {
req = freereq;
freereq = NULL;
- } else if ((req = get_request(q, rw)) == NULL) {
- spin_unlock_irq(&io_request_lock);
- if (rw_ahead)
- goto end_io;
-
- freereq = __get_request_wait(q, rw);
- goto again;
+ } else {
+ /*
+ * See description above __get_request_wait()
+ */
+ if (rw_ahead) {
+ if (q->rq[rw].count < batch_requests) {
+ spin_unlock_irq(&io_request_lock);
+ goto end_io;
+ }
+ req = get_request(q, rw);
+ if (req == NULL)
+ BUG();
+ } else {
+ req = get_request(q, rw);
+ if (req == NULL) {
+ spin_unlock_irq(&io_request_lock);
+ freereq = __get_request_wait(q, rw);
+ goto again;
+ }
+ }
}

/* fill up the request-info, and add it to the queue */


-

2002-02-16 10:10:43

by Daniel Phillips

[permalink] [raw]
Subject: Re: [patch] get_request starvation fix

On February 16, 2002 08:32 am, Andrew Morton wrote:
> However, contrary to my earlier guess, the request batching does
> make a measurable difference. Changing the code so that we wake up
> a sleeper as soon as any request is freed costs maybe 30%
> on `dbench 64'.

Is this consistent with results on other IO benchmarks?

--
Daniel

2002-02-16 10:27:08

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] get_request starvation fix

Daniel Phillips wrote:
>
> On February 16, 2002 08:32 am, Andrew Morton wrote:
> > However, contrary to my earlier guess, the request batching does
> > make a measurable difference. Changing the code so that we wake up
> > a sleeper as soon as any request is freed costs maybe 30%
> > on `dbench 64'.
>
> Is this consistent with results on other IO benchmarks?
>

I dunno. I doubt it - few of the other benchmarks are very
seek-intensive. dbench is somewhat repeatable if you load
it up with enough clients. And average the results across
enough runs. And stand on one leg and squint.