Here's a patch which addresses the get_request starvation problem.
The problem we're experiencing is that when the request queue is
exhausted, processes go to sleep until the request queue has refilled
to the batch_nr_requests level. But already-running tasks can simply
whizz in and steal free requests, thus preventing the sleeping tasks
from getting woken.
Some instrumentation illustrates this clearly. All testing was on
ext2. In a `dbench 64' run on stock 2.4.18-pre9, the maximum sleep
duration in __get_request_wait() was 37 seconds, and it's being very
unfair. Look at this section of the trace:
Feb 7 21:59:06 mnm kernel: dbench[883] wakes after 442ms
Feb 7 21:59:06 mnm kernel: dbench[877] wakes after 442ms
Feb 7 21:59:06 mnm kernel: dbench[868] wakes after 441ms
Feb 7 21:59:06 mnm kernel: kupdated[7] wakes after 11587ms
Feb 7 21:59:06 mnm kernel: dbench[914] wakes after 574ms
Feb 7 21:59:06 mnm kernel: dbench[873] wakes after 2618ms
Feb 7 21:59:06 mnm kernel: dbench[927] wakes after 755ms
Feb 7 21:59:06 mnm kernel: dbench[873] wakes after 263ms
Feb 7 21:59:06 mnm kernel: dbench[924] wakes after 838ms
Feb 7 21:59:06 mnm kernel: dbench[924] wakes after 134ms
Feb 7 21:59:06 mnm kernel: dbench[871] wakes after 18688ms
Feb 7 21:59:06 mnm kernel: dbench[908] wakes after 3245ms
Feb 7 21:59:06 mnm kernel: dbench[895] wakes after 974ms
Feb 7 21:59:06 mnm kernel: dbench[878] wakes after 27401ms
Feb 7 21:59:06 mnm kernel: dbench[895] wakes after 278ms
Feb 7 21:59:06 mnm kernel: dbench[904] wakes after 278ms
With the patch applied, the maximum sleep duration was eight seconds,
but when this was happening, *all* tasks were sleeping for similar
durations:
Feb 7 23:55:40 mnm kernel: dbench[948] wakes after 3978ms
Feb 7 23:55:40 mnm kernel: dbench[935] wakes after 3439ms
Feb 7 23:55:41 mnm kernel: dbench[955] wakes after 2998ms
Feb 7 23:55:41 mnm kernel: dbench[965] wakes after 2847ms
Feb 7 23:55:41 mnm kernel: dbench[994] wakes after 3032ms
Feb 7 23:55:41 mnm kernel: dbench[969] wakes after 2652ms
Feb 7 23:55:41 mnm kernel: dbench[956] wakes after 2197ms
Feb 7 23:55:41 mnm kernel: dbench[993] wakes after 1589ms
Feb 7 23:55:41 mnm kernel: dbench[990] wakes after 1590ms
Feb 7 23:55:41 mnm kernel: dbench[939] wakes after 1585ms
Feb 7 23:55:41 mnm kernel: kupdated[7] wakes after 1555ms
Feb 7 23:55:41 mnm kernel: dbench[976] wakes after 1787ms
Feb 7 23:55:41 mnm kernel: dbench[997] wakes after 1653ms
Feb 7 23:55:41 mnm kernel: dbench[961] wakes after 1891ms
Feb 7 23:55:42 mnm kernel: dbench[966] wakes after 1967ms
Feb 7 23:55:42 mnm kernel: dbench[936] wakes after 1339ms
Feb 7 23:55:42 mnm kernel: dbench[965] wakes after 1216ms
Feb 7 23:55:42 mnm kernel: kjournald[9] wakes after 1253ms
Feb 7 23:55:42 mnm kernel: dbench[990] wakes after 879ms
Feb 7 23:55:42 mnm kernel: kupdated[7] wakes after 849ms
Feb 7 23:55:42 mnm kernel: dbench[959] wakes after 343ms
Feb 7 23:55:42 mnm kernel: dbench[936] wakes after 343ms
Feb 7 23:55:42 mnm kernel: dbench[964] wakes after 492ms
Feb 7 23:55:42 mnm kernel: dbench[966] wakes after 648ms
Feb 7 23:55:42 mnm kernel: kupdated[7] wakes after 413ms
Feb 7 23:55:42 mnm kernel: dbench[988] wakes after 344ms
The actual algorithm I chose is described in the patch. It is
noteworthy that the wakeup strategy has changed from wake-all to
wake-one, which should save a few cycles. The number of context
switches during the entire dbench 64 run fell from 57,000 to
29,000.
Interestingly, dbench runtimes are still very inconsistent, and
many clients still exit quite early in the run. Beats me.
Rik, this patch will conflict with read-latency2 in the init code.
Easily fixable. I haven't tested with the larger request queue, but I
suggest that we want to keep the 1024 requests, a high-water mark of
queue_nr_requests/4 and a low-water mark of queue_nr_requests/8. I'll
do more testing tomorrow.
Also, you noted the other day that a LILO run *never* terminated when
there was a dbench running. This is in fact not due to request
starvation. It's due to livelock in invalidate_bdev(), which is called
via ioctl(BLKFLSBUF) by LILO. invalidate_bdev() will never terminate
as long as another task is generating locked buffers against the
device.
--- linux-2.4.18-pre9/drivers/block/ll_rw_blk.c Thu Feb 7 13:04:11 2002
+++ linux-akpm/drivers/block/ll_rw_blk.c Fri Feb 8 00:33:01 2002
@@ -118,10 +118,15 @@ int * max_readahead[MAX_BLKDEV];
int * max_sectors[MAX_BLKDEV];
/*
- * How many reqeusts do we allocate per queue,
- * and how many do we "batch" on freeing them?
+ * The total number of requests in each queue.
*/
-static int queue_nr_requests, batch_requests;
+static int queue_nr_requests;
+
+/*
+ * low- and high-water marks for the queues.
+ */
+static int batch_requests_low;
+static int batch_requests_high;
static inline int get_max_sectors(kdev_t dev)
{
@@ -418,9 +423,9 @@ void blk_init_queue(request_queue_t * q,
#define blkdev_free_rq(list) list_entry((list)->next, struct request, queue);
/*
* Get a free request. io_request_lock must be held and interrupts
- * disabled on the way in.
+ * disabled on the way in. Returns NULL if there are no free requests.
*/
-static inline struct request *get_request(request_queue_t *q, int rw)
+static struct request *get_request(request_queue_t *q, int rw)
{
struct request *rq = NULL;
struct request_list *rl = q->rq + rw;
@@ -438,18 +443,81 @@ static inline struct request *get_reques
}
/*
- * No available requests for this queue, unplug the device.
+ * Here's the request allocation design:
+ *
+ * 1: Blocking on request exhaustion is a key part of I/O throttling.
+ *
+ * 2: We want to be `fair' to all requesters. We must avoid starvation, and
+ * attempt to ensure that all requesters sleep for a similar duration. Hence
+ * no stealing requests when there are other processes waiting.
+ *
+ * 3: We also wish to support `batching' of requests. So when a process is
+ * woken, we want to allow it to allocate a decent number of requests
+ * before it blocks again, so they can be nicely merged (this only really
+ * matters if the process happens to be adding requests near the head of
+ * the queue).
+ *
+ * 4: We want to avoid scheduling storms. This isn't really important, because
+ * the system will be I/O bound anyway. But it's easy.
+ *
+ * There is tension between requirements 2 and 3. Once a task has woken,
+ * we don't want to allow it to sleep as soon as it takes its second request.
+ * But we don't want currently-running tasks to steal all the requests
+ * from the sleepers. We handle this with high- and low- water marks and
+ * with the assumption that request taking is much, much faster than
+ * request freeing.
+ *
+ * So here's what we do:
+ *
+ * a) A READA requester fails if free_requests < batch_requests_high
+ *
+ * We don't want READA requests to prevent sleepers from ever
+ * waking.
+ *
+ * When a process wants a new request:
+ *
+ * b) If someone else is waiting on requests and free_requests < low_water,
+ * the requester sleeps in FIFO manner.
+ *
+ * c) If low_water < free_requests < high_water, the caller is immediately
+ * granted a new request.
+ *
+ * d) If nobody is waiting on requests, the caller gets given a request,
+ * if there are any available. Otherwise the caller sleeps.
+ *
+ * When a request is released:
+ *
+ * e) If free_requests < low_water, do nothing.
+ *
+ * f) If free_requests > high_water, wake up a single waiter.
+ *
+ * The net effect is that when a process is woken at the high-water mark,
+ * it will be able to take approximately (high-water - low-water) requests
+ * before blocking again (at the tail of the queue).
+ *
+ * This all assumes that the rate of taking requests is much, much higher
+ * than the rate of releasing them. Which is very true.
+ *
+ * -akpm, Feb 2002.
*/
+#undef REQTIMING
static struct request *__get_request_wait(request_queue_t *q, int rw)
{
register struct request *rq;
DECLARE_WAITQUEUE(wait, current);
+#ifdef REQTIMING
+ struct timeval tv1, tv2;
+ unsigned long long t1, t2;
+ unsigned long t;
+ do_gettimeofday(&tv1);
+#endif
+
generic_unplug_device(q);
- add_wait_queue(&q->wait_for_request, &wait);
+ add_wait_queue_exclusive(&q->wait_for_request, &wait);
do {
set_current_state(TASK_UNINTERRUPTIBLE);
- if (q->rq[rw].count < batch_requests)
+ if (q->rq[rw].count < batch_requests_low)
schedule();
spin_lock_irq(&io_request_lock);
rq = get_request(q,rw);
@@ -457,19 +525,20 @@ static struct request *__get_request_wai
} while (rq == NULL);
remove_wait_queue(&q->wait_for_request, &wait);
current->state = TASK_RUNNING;
- return rq;
-}
-static inline struct request *get_request_wait(request_queue_t *q, int rw)
-{
- register struct request *rq;
-
- spin_lock_irq(&io_request_lock);
- rq = get_request(q, rw);
- spin_unlock_irq(&io_request_lock);
- if (rq)
- return rq;
- return __get_request_wait(q, rw);
+#ifdef REQTIMING
+ do_gettimeofday(&tv2);
+ t1 = tv1.tv_sec;
+ t1 *= 1000000;
+ t1 += tv1.tv_usec;
+ t2 = tv2.tv_sec;
+ t2 *= 1000000;
+ t2 += tv2.tv_usec;
+ t = t2 - t1;
+ printk("%s[%d] wakes after %ldms\n", current->comm,
+ current->pid, t / 1000);
+#endif
+ return rq;
}
/* RO fail safe mechanism */
@@ -546,7 +615,7 @@ static inline void add_request(request_q
/*
* Must be called with io_request_lock held and interrupts disabled
*/
-inline void blkdev_release_request(struct request *req)
+void blkdev_release_request(struct request *req)
{
request_queue_t *q = req->q;
int rw = req->cmd;
@@ -560,7 +629,8 @@ inline void blkdev_release_request(struc
*/
if (q) {
list_add(&req->queue, &q->rq[rw].free);
- if (++q->rq[rw].count >= batch_requests && waitqueue_active(&q->wait_for_request))
+ if (++q->rq[rw].count >= batch_requests_high &&
+ waitqueue_active(&q->wait_for_request))
wake_up(&q->wait_for_request);
}
}
@@ -742,22 +812,41 @@ again:
BUG();
}
- /*
- * Grab a free request from the freelist - if that is empty, check
- * if we are doing read ahead and abort instead of blocking for
- * a free slot.
- */
get_rq:
if (freereq) {
req = freereq;
freereq = NULL;
- } else if ((req = get_request(q, rw)) == NULL) {
- spin_unlock_irq(&io_request_lock);
- if (rw_ahead)
- goto end_io;
-
- freereq = __get_request_wait(q, rw);
- goto again;
+ } else {
+ /*
+ * See description above __get_request_wait()
+ */
+ if (rw_ahead) {
+ if (q->rq[rw].count < batch_requests_high) {
+ spin_unlock_irq(&io_request_lock);
+ goto end_io;
+ }
+ req = get_request(q, rw);
+ if (req == NULL)
+ BUG();
+ } else {
+ if (waitqueue_active(&q->wait_for_request)) {
+ if (q->rq[rw].count < batch_requests_low) {
+ spin_unlock_irq(&io_request_lock);
+ freereq = __get_request_wait(q, rw);
+ goto again;
+ }
+ req = get_request(q, rw);
+ if (req == NULL)
+ BUG();
+ } else {
+ req = get_request(q, rw);
+ if (req == NULL) {
+ spin_unlock_irq(&io_request_lock);
+ freereq = __get_request_wait(q, rw);
+ goto again;
+ }
+ }
+ }
}
/* fill up the request-info, and add it to the queue */
@@ -1124,8 +1213,10 @@ int __init blk_dev_init(void)
/*
* Batch frees according to queue length
*/
- batch_requests = queue_nr_requests/4;
- printk("block: %d slots per queue, batch=%d\n", queue_nr_requests, batch_requests);
+ batch_requests_high = queue_nr_requests / 4;
+ batch_requests_low = batch_requests_high / 2;
+ printk("block: %d slots per queue, batch_low=%d, batch_high=%d\n",
+ queue_nr_requests, batch_requests_low, batch_requests_high);
#ifdef CONFIG_AMIGA_Z2RAM
z2_init();
On Fri, Feb 08 2002, Andrew Morton wrote:
> Here's a patch which addresses the get_request starvation problem.
[snip]
Agrh, if only I knew you were working on this too :/. Oh well, from a
first-read the patch looks good.
--
Jens Axboe
Andrew Morton wrote:
>
> It is noteworthy that the wakeup strategy has changed from wake-all to
> wake-one, which should save a few cycles.
It is also noteworthy that read latency completely sucks. Not sure if
I've made it worse.
Anyway, with exclusive wakeup we need separate waitqueues for read and
write requests. Updated patch at
http://www.zip.com.au/~akpm/linux/2.4/2.4.18-pre9/make_request.patch
-
Jens Axboe wrote:
>
> On Fri, Feb 08 2002, Andrew Morton wrote:
> > Here's a patch which addresses the get_request starvation problem.
>
> [snip]
>
> Agrh, if only I knew you were working on this too :/. Oh well, from a
> first-read the patch looks good.
Seems that with FIFO fairness, /bin/sync now also livelocks. And
it's pretty easy to see why. There's nothing to make
write_unlocked_buffers() terminate.
We'll worry about that tomorrow. I may just make it give up
after writing (2 * nr_buffers_type[BUF_DIRTY]) buffers.
The patch works well with read-latency2 (and it didn't throw
rejects). Smooth and fast. It's going to take some time and
testing to settle everything in.
-
On Fri, 8 Feb 2002, Andrew Morton wrote:
> + * This all assumes that the rate of taking requests is much, much higher
> + * than the rate of releasing them. Which is very true.
This is not necessarily true for read requests.
If each read request is synchronous and the process will
generate the next read request after the current one
has finished, then it's quite possible to clog up the
queue with read requests which are generated at exactly
the same rate as they're processed.
Couldn't this still cause starvation, even with your patch?
regards,
Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document
http://www.surriel.com/ http://distro.conectiva.com/
Rik van Riel wrote:
>
> On Fri, 8 Feb 2002, Andrew Morton wrote:
>
> > + * This all assumes that the rate of taking requests is much, much higher
> > + * than the rate of releasing them. Which is very true.
>
> This is not necessarily true for read requests.
>
> If each read request is synchronous and the process will
> generate the next read request after the current one
> has finished, then it's quite possible to clog up the
> queue with read requests which are generated at exactly
> the same rate as they're processed.
>
> Couldn't this still cause starvation, even with your patch?
No, that's fine.
The problem which the comment refers to is: how to provide
per-process request batching without running off and creating
per-process reservation pools or such.
What I'm relying on is that when a sleeper is woken (at low-water),
there are at least (high-water - low-water) requests available before
get_request will again sleep. And that the woken process will be
able to grab a decent number of those non-blocking requests. I
suspect it's always true, as long as (high-water - low_water) is
"much greater than" the number of CPUs.
The synchronous reader is well-behaved, and should be nicely
FIFO if we're getting low on requests.
-
On Fri, Feb 08 2002, Andrew Morton wrote:
> Here's a patch which addresses the get_request starvation problem.
[snip]
> Also, you noted the other day that a LILO run *never* terminated when
> there was a dbench running. This is in fact not due to request
> starvation. It's due to livelock in invalidate_bdev(), which is called
> via ioctl(BLKFLSBUF) by LILO. invalidate_bdev() will never terminate
> as long as another task is generating locked buffers against the
> device.
[snip]
Could this below related?
I get system looks through lilo (bzlilo) from time to time with all latest
kernels + O(1) + -aa + preempt
Thanks,
Dieter
> Re: [patch] O(1) scheduler, -J9
> Datum: Wed, 30 Jan 2002 17:36:21 +0100
> Von: Dieter N?tzel <[email protected]>
> An: Ingo Molnar <[email protected]>
> Kopie: Robert Love <[email protected]>, Oleg Drokin <[email protected]>
>
> On Wednesday, 30. January 2002 17:26, Dieter N?tzel wrote:
> > On Wednesday, 30. January 2002 15:40, Ingo Molnar wrote:
> > > On Wed, 30 Jan 2002, Dieter [iso-8859-15] N?tzel wrote:
> > > > As always running "nice +19 make -j40" in my HUGE C++ VTK tree in the
> > > > background ;-)
> > > >
> > > > The mouse feels a little bit sluggish after KMail start so I reniced X
> > > > to -10. [...]
> > >
> > > does this make X smooth? nice -10 is a good choice, it's not too low and
> > > not too high, i'm using that value myself.
> >
> > Yes, mostly.
> > In "normal operation mode" X at 0 is good.
> >
> > I only get some very little but noticeable unsmoothness with several KDE
> > apps running. It could be KDE it self.
> >
> > Latency degradation since -J4 stays. But that could be missing of some
> > lock-breaks.
> >
> > Robert's latest BKL stuff halved the numbers for the two
> > latencytest0.42-png graphic benches on 2.4.18-pre7 + patches.
> Addition:
> I am most worried about the occasional kernel hangs during make
> bzlilo/bzImage. It appears during lilo (???). The new kernel (vmlinux) is
> still be built in /usr/src/linux but together with "latest" *.o and
> .*.flags files broken due to ReiserFS transaction replay after reboot
> (normal behavior with a journaling FS but without full data journaling.
>
> Maybe a RACE between lilo vs sync?
Dieter N?tzel wrote:
>
> On Fri, Feb 08 2002, Andrew Morton wrote:
> > Here's a patch which addresses the get_request starvation problem.
>
> [snip]
>
> > Also, you noted the other day that a LILO run *never* terminated when
> > there was a dbench running. This is in fact not due to request
> > starvation. It's due to livelock in invalidate_bdev(), which is called
> > via ioctl(BLKFLSBUF) by LILO. invalidate_bdev() will never terminate
> > as long as another task is generating locked buffers against the
> > device.
> [snip]
>
> Could this below related?
> I get system looks through lilo (bzlilo) from time to time with all latest
> kernels + O(1) + -aa + preempt
>
Depends what you mean by "system locks". The invalidate_bdev() problem
won't lock the machine. Its symptoms are merely that the ioctl will
not terminate until the process which is writing to disk stops.
In other words: if you run dbench, then lilo, lilo will not complete
until after dbench terminates.
If you're seeing actual have-to-hit-reset lockups then that'll
be due to something quite different.
-
On Friday, 8. February 2002 20:44, Andrew Morton wrote:
> Dieter N?tzel wrote:
> > On Fri, Feb 08 2002, Andrew Morton wrote:
> > > Here's a patch which addresses the get_request starvation problem.
> >
> > [snip]
> >
> > > Also, you noted the other day that a LILO run *never* terminated when
> > > there was a dbench running. This is in fact not due to request
> > > starvation. It's due to livelock in invalidate_bdev(), which is called
> > > via ioctl(BLKFLSBUF) by LILO. invalidate_bdev() will never terminate
> > > as long as another task is generating locked buffers against the
> > > device.
> >
> > [snip]
> >
> > Could this below related?
> > I get system looks through lilo (bzlilo) from time to time with all
> > latest kernels + O(1) + -aa + preempt
>
> Depends what you mean by "system locks".
Hard lock up :-(
Nothing in the log files.
No SYSRQ works. Only reset. --- But ReiserFS works. Some few corrupted
*.o.flag files and most of the time a broken /usr/src/vmlinux or freshly
/boot/vmlinuz file.
> The invalidate_bdev() problem won't lock the machine.
I see.
> In other words: if you run dbench, then lilo, lilo will not complete
> until after dbench terminates.
dbench wasn't run before
Only several "sync" commands (by hand) during kernel build helps.
> If you're seeing actual have-to-hit-reset lockups then that'll
> be due to something quite different.
Sadly, yes.
Thanks,
Dieter
On Fri, 8 Feb 2002, Andrew Morton wrote:
> In other words: if you run dbench, then lilo, lilo will not complete
> until after dbench terminates.
Actually, lilo completed after 5 minutes, but still only
about halfway through the dbench.
Rik
--
Will hack the VM for food.
http://www.surriel.com/ http://distro.conectiva.com/
On Fri, Feb 08, 2002 at 10:29:37AM -0800, Andrew Morton wrote:
> > > Here's a patch which addresses the get_request starvation problem.
> >
> > Looks like a tremendously important patch. Thanks!
>
> hmm. Dunno. I don't think it's very common to hit get_request
> starvation. Only for dbenchy things, I suspect. Still, it's
> good to have a design which addresses nasty corner cases.
It really works!
tiobench sequential reads with 32 threads went from max latency of
193 seconds down to 7 seconds with your make_request, read-latency2
and low-latency patches. (2.4.18-pre9-am1) The % of high latency
requests also went down.
K6-2 475 mhz with 384 MB ram and reiserfs on IDE disks.
Average of 3 runs.
Total File size = 1024 MB. (individual file size = 1024 / num-threads)
Read, write, and seek rates in MB/sec.
Latency in milliseconds.
Percent of requests that took longer than 2 and 10 seconds.
Sequential Num Avg Maximum Lat% Lat% CPU
Reads Thr Rate (CPU%) Latency Latency >2s >10s Eff
--- ---------------------------------------------------
2.4.18-pre9-am1 8 8.94 13.33% 10.402 1785.52 0.000 0.000 67
2.4.18-pre9 8 8.85 13.22% 10.537 1954.27 0.000 0.000 67
2.4.18-pre9-am1 16 8.98 13.70% 30.827 4023.09 0.000 0.000 66
2.4.18-pre9 16 8.91 13.60% 31.122 4072.67 0.000 0.000 65
2.4.18-pre9-am1 32 9.00 14.16% 70.834 7032.45 0.000 0.000 64
2.4.18-pre9 32 8.87 13.89% 66.135 193590.10 0.030 0.023 64
Random Num Avg Maximum Lat% Lat% CPU
Reads Thr Rate ( CPU%) Latency Latency >2s >10s Eff
--- ---------------------------------------------------
2.4.18-pre9-am1 8 0.69 1 .828% 132.242 473.73 0.000 0.000 38
2.4.18-pre9 8 0.71 2 .156% 128.869 510.38 0.000 0.000 33
2.4.18-pre9-am1 16 0.72 2 .154% 371.288 1410.13 0.000 0.000 33
2.4.18-pre9 16 0.74 2 .025% 364.871 1397.82 0.000 0.000 36
2.4.18-pre9-am1 32 0.74 2 .229% 816.605 2996.07 0.000 0.000 33
2.4.18-pre9 32 0.75 2 .163% 734.035 2841.97 0.000 0.000 35
Sequential Num Avg Maximum Lat% Lat% CPU
Writes Thr Rate (CPU%) Latency Latency >2s >10s Eff
--- ---------------------------------------------------
2.4.18-pre9-am1 8 11.45 52.33% 7.651 15095.52 0.058 0.000 22
2.4.18-pre9 8 9.23 40.84% 9.021 12351.12 0.040 0.000 23
2.4.18-pre9-am1 16 11.48 54.57% 22.253 41922.53 0.214 0.000 21
2.4.18-pre9 16 9.46 41.71% 26.374 28280.18 0.212 0.000 23
2.4.18-pre9-am1 32 11.37 55.91% 52.917 75331.67 0.679 0.004 20
2.4.18-pre9 32 9.50 42.55% 61.944 60970.74 0.770 0.005 22
Random Num Avg Maximum Lat% Lat% CPU
Writes Thr Rate (CPU%) Latency Latency >2s >10s Eff
--- ---------------------------------------------------
2.4.18-pre9-am1 8 0.62 2.393% 1.786 330.56 0.000 0.000 26
2.4.18-pre9 8 0.50 1.216% 1.700 333.32 0.000 0.000 41
2.4.18-pre9-am1 16 0.65 2.527% 4.533 996.54 0.000 0.000 26
2.4.18-pre9 16 0.52 1.267% 4.228 1236.08 0.000 0.000 41
2.4.18-pre9-am1 32 0.65 2.530% 7.980 1907.91 0.000 0.000 26
2.4.18-pre9 32 0.53 1.343% 7.818 2509.73 0.000 0.000 39
> > Do you have a patch for dbench too? :)
>
> /bin/rm?
Good one. It would be helpful though, if you've already done the work.
--
Randy Hron
Andrew Morton wrote:
>
> Here's a patch which addresses the get_request starvation problem.
>
Sharp-eyed Suparna noticed that the algorithm still works if the
low-water mark is set to zero (ie: rip it out) so I did that and
the code is a little simpler. Updated patch at
http://www.zip.com.au/~akpm/linux/2.4/2.4.18-pre9/make_request.patch
Also, here's a patch which fixes the /bin/sync livelock in
write_unlocked_buffers(). It simply bales out after writing
all the buffers which were dirty at the time the function
was called, rather than keeping on trying to write buffers
until the list is empty.
Given that /bin/sync calls write_unlocked_buffers() three times,
that's good enough. sync still takes aaaaaages, but it terminates.
The similar livelock in invalidate_bdev() is a bit trickier. I'm
not really sure what the semantics of invalidate_bdev() against a
device which has an r/w filesystem mounted on it are supposed to be.
I suspect we should just not call the function if the device is in
use, or call fsync_dev() or something.
--- linux-2.4.18-pre9/fs/buffer.c Thu Feb 7 13:04:21 2002
+++ linux-akpm/fs/buffer.c Sun Feb 10 21:50:48 2002
@@ -188,12 +188,13 @@ static void write_locked_buffers(struct
* return without it!
*/
#define NRSYNC (32)
-static int write_some_buffers(kdev_t dev)
+static int write_some_buffers(kdev_t dev, signed long *nr_to_write)
{
struct buffer_head *next;
struct buffer_head *array[NRSYNC];
unsigned int count;
int nr;
+ int ret;
next = lru_list[BUF_DIRTY];
nr = nr_buffers_type[BUF_DIRTY];
@@ -212,29 +213,38 @@ static int write_some_buffers(kdev_t dev
array[count++] = bh;
if (count < NRSYNC)
continue;
-
- spin_unlock(&lru_list_lock);
- write_locked_buffers(array, count);
- return -EAGAIN;
+ ret = -EAGAIN;
+ goto writeout;
}
unlock_buffer(bh);
__refile_buffer(bh);
}
+ ret = 0;
+writeout:
spin_unlock(&lru_list_lock);
-
- if (count)
+ if (count) {
write_locked_buffers(array, count);
- return 0;
+ if (nr_to_write)
+ *nr_to_write -= count;
+ }
+ return ret;
}
/*
- * Write out all buffers on the dirty list.
+ * Because we drop the locking during I/O it's not possible
+ * to write out all the buffers. So the only guarantee that
+ * we can make here is that we write out all the buffers which
+ * were dirty at the time write_unlocked_buffers() was called.
+ * fsync_dev() calls in here three times, so we end up writing
+ * many more buffers than ever appear on BUF_DIRTY.
*/
static void write_unlocked_buffers(kdev_t dev)
{
+ signed long nr_to_write = nr_buffers_type[BUF_DIRTY] * 2;
+
do {
spin_lock(&lru_list_lock);
- } while (write_some_buffers(dev));
+ } while (write_some_buffers(dev, &nr_to_write) && (nr_to_write > 0));
run_task_queue(&tq_disk);
}
@@ -1079,7 +1089,7 @@ void balance_dirty(void)
/* If we're getting into imbalance, start write-out */
spin_lock(&lru_list_lock);
- write_some_buffers(NODEV);
+ write_some_buffers(NODEV, NULL);
/*
* And if we're _really_ out of balance, wait for
@@ -2852,7 +2862,7 @@ static int sync_old_buffers(void)
bh = lru_list[BUF_DIRTY];
if (!bh || time_before(jiffies, bh->b_flushtime))
break;
- if (write_some_buffers(NODEV))
+ if (write_some_buffers(NODEV, NULL))
continue;
return 0;
}
@@ -2951,7 +2961,7 @@ int bdflush(void *startup)
CHECK_EMERGENCY_SYNC
spin_lock(&lru_list_lock);
- if (!write_some_buffers(NODEV) || balance_dirty_state() < 0) {
+ if (!write_some_buffers(NODEV, NULL) || balance_dirty_state() < 0) {
wait_for_some_buffers(NODEV);
interruptible_sleep_on(&bdflush_wait);
}
-
There's something that's still bothering me - I wonder if I'm missing
something very obvious here.
> * So here's what we do:
> *
> * a) A READA requester fails if free_requests < batch_requests
> *
> * We don't want READA requests to prevent sleepers from ever
> * waking.
> *
> * When a process wants a new request:
> *
> * b) If free_requests == 0, the requester sleeps in FIFO manner.
> *
> * b) If 0 < free_requests < batch_requests and there are waiters,
> * we still take a request non-blockingly. This provides batching.
For a caller who just got an exclusive wakeup on the request queue,
this enables the woken up task to do batching and not sleep again on the
next request it needs. However since we use the same logic for new callers,
don't we still have those starvation issues ?
(i.e new callers end up stealing requests while there are sleepers,
given that wakeups will happen only if the queue builds up beyond the high
water mark)
> *
> * c) If free_requests >= batch_requests, the caller is immediately
> * granted a new request.
> *
> * When a request is released:
> *
> * d) If free_requests < batch_requests, do nothing.
On Mon, Feb 11, 2002 at 01:41:20AM -0800, Andrew Morton wrote:
> Andrew Morton wrote:
> >
> > Here's a patch which addresses the get_request starvation problem.
> >
>
> Sharp-eyed Suparna noticed that the algorithm still works if the
> low-water mark is set to zero (ie: rip it out) so I did that and
> the code is a little simpler. Updated patch at
> http://www.zip.com.au/~akpm/linux/2.4/2.4.18-pre9/make_request.patch
>
Suparna Bhattacharya wrote:
>
> ...
> > * When a process wants a new request:
> > *
> > * b) If free_requests == 0, the requester sleeps in FIFO manner.
> > *
> > * b) If 0 < free_requests < batch_requests and there are waiters,
> > * we still take a request non-blockingly. This provides batching.
>
> For a caller who just got an exclusive wakeup on the request queue,
> this enables the woken up task to do batching and not sleep again on the
> next request it needs. However since we use the same logic for new callers,
> don't we still have those starvation issues ?
Let's think of `incoming' tasks and `outgoing' ones. The incoming
tasks are running tasks which are making new requests, and the
outgoing ones are tasks which were on the waitqueue, and which are
in the process of being woken and granted a request.
Your concern is that incoming tasks can still steal all the requests
away from the sleepers. And sure, they can, until all requests are
gone. At which time the incoming tasks get to wait their turn on
the waitqueue. And generally, the problem is with writes, where we tend
to have a few threads performing a large number of requests.
Another scenario is where incoming tasks just take a handful of requests,
and their rate of taking equals the rate of request freeing, and the
free_request threshold remains greater than zero, less than batch_nr_requests.
This steady state could conceivably happen with the correct number of
threads performing O_SYNC/fsync writes, for example. I'll see if I can
find a way to make this happen.
The final scenario is where there are many outgoing tasks, so they
compete, and each gets an insufficiently large number of requests.
I suspect this is indeed happening with the current kernel's wake-all
code. But the wake-one change makes this unlikely.
When an outgoing task is woken, we know there are 32 free requests on the
queue. There's an assumption (or design) here that the outgoing task
will be able to get a decent number of those requests. This works. It
may fail if there are a massive number of CPUs. But we need to increase
the overall request queue size anyway - 128 is too small.
Under heavy load, the general operating mode for this code is that
damn near every task is asleep. So most work is done by outgoing threads.
BTW, I suspect the request batching isn't super-important. It'll
certainly decrease the context switch rate very much. But from a
request-merging point of view it's unlikely to make much difference.
-
On Mon, Feb 11, 2002 at 01:41:20AM -0800, Andrew Morton wrote:
> Also, here's a patch which fixes the /bin/sync livelock in
> write_unlocked_buffers(). It simply bales out after writing
> all the buffers which were dirty at the time the function
> was called, rather than keeping on trying to write buffers
> until the list is empty.
I've also seen this issue when I start multiple mkfs commands (10 in
parallel basically make the system useless), but the patch helped.
Any chance it will get into 2.4?
Thanks,
Jesse
On Mon, Feb 11, 2002 at 11:26:44AM -0800, Andrew Morton wrote:
> Suparna Bhattacharya wrote:
> >
> > ...
> > > * When a process wants a new request:
> > > *
> > > * b) If free_requests == 0, the requester sleeps in FIFO manner.
> > > *
> > > * b) If 0 < free_requests < batch_requests and there are waiters,
> > > * we still take a request non-blockingly. This provides batching.
> >
> > For a caller who just got an exclusive wakeup on the request queue,
> > this enables the woken up task to do batching and not sleep again on the
> > next request it needs. However since we use the same logic for new callers,
> > don't we still have those starvation issues ?
>
> Let's think of `incoming' tasks and `outgoing' ones. The incoming
> tasks are running tasks which are making new requests, and the
> outgoing ones are tasks which were on the waitqueue, and which are
> in the process of being woken and granted a request.
>
> Your concern is that incoming tasks can still steal all the requests
> away from the sleepers. And sure, they can, until all requests are
> gone. At which time the incoming tasks get to wait their turn on
> the waitqueue. And generally, the problem is with writes, where we tend
> to have a few threads performing a large number of requests.
>
> Another scenario is where incoming tasks just take a handful of requests,
> and their rate of taking equals the rate of request freeing, and the
> free_request threshold remains greater than zero, less than batch_nr_requests.
> This steady state could conceivably happen with the correct number of
> threads performing O_SYNC/fsync writes, for example. I'll see if I can
> find a way to make this happen.
Thanks for the explanation. (I was off-line for a couple of days so
didn't get to respond earlier) This is closest to the case that I
was concerned about. Perhaps with merging/write-clustering at higher
levels the number of requests might be less, but that is secondary.
What was worrying me is that even when the rate of request taking
is greater than the rate of request freeing, couldn't we have every
1 out of N incoming tasks picking up a just freed up request, thus
starving the waiters ? (The rest N-1 incoming tasks might get to
wait their turn, where N:1 is the ratio of request taking to request
freeing)
>
> The final scenario is where there are many outgoing tasks, so they
> compete, and each gets an insufficiently large number of requests.
> I suspect this is indeed happening with the current kernel's wake-all
> code. But the wake-one change makes this unlikely.
>
> When an outgoing task is woken, we know there are 32 free requests on the
> queue. There's an assumption (or design) here that the outgoing task
> will be able to get a decent number of those requests. This works. It
> may fail if there are a massive number of CPUs. But we need to increase
> the overall request queue size anyway - 128 is too small.
Yes, especially when the system has sufficient resources to increase
the request queue size, one might as well queue up on the request
queue, rather than the wait queue, and minimize those context switches.
>
> Under heavy load, the general operating mode for this code is that
> damn near every task is asleep. So most work is done by outgoing threads.
>
> BTW, I suspect the request batching isn't super-important. It'll
> certainly decrease the context switch rate very much. But from a
> request-merging point of view it's unlikely to make much difference.
>
> -