2000-12-09 18:06:50

by Jens Axboe

[permalink] [raw]
Subject: patch: blk-12

Hi,

I've released what will probably be the last blk-xx patch for 2.4, at
least as far as features go. In fact, blk-12 is just minor tweaks and
fixes over the previous version. Highlight of changes:

o Merge elevator merge and insertion scan. This saves an entire linear
queue scan when we can't merge a new buffer into the existing list.

o More fair merge accounting, actually take request size into account
when merging.

o Cleanup leftover cruft from previous elevator (nr_segments etc)

o Request queue aging.

o Batch freeing of requests. Stock kernels have very bad behaviour
under I/O load (load here meaning that the request list is empty,
doesn't require much effort...), because as soon as a request is
completed and put back on the freelist, a read/write will grab it
and the queue will be unplugged again. This effectively disables
elevator merging efforts completely. Note -- even though wakeups
of wait_for_request are now not a 1-1 mapping, wake-one semantics
are maintained.

o Fix sg indeterminate request completion time, due to scsi_insert_*
not providing guarentee of immediate queue run.

o Fix off by one ide-dma setup error

o Bump max request size from 64KB to 1MB, let low level drivers set
their own limits (eg IDE has 128KB hw limit). No need to limit
nice SCSI hardware, since during the data phase is where we get
full throttle.

o Remove silly s/390 double request get error

It's against 2.4.0-test12-pre7, and can be found here:

*.kernel.org/pub/linux/kernel/axboe/patches/2.4.0-test12-pre7/blk-12.bz2

--
* Jens Axboe <[email protected]>
* SuSE Labs


2000-12-10 03:24:42

by Andrew Morton

[permalink] [raw]
Subject: Re: patch: blk-12

Jens Axboe wrote:
>
> o Batch freeing of requests. Stock kernels have very bad behaviour
> under I/O load (load here meaning that the request list is empty,
> doesn't require much effort...), because as soon as a request is
> completed and put back on the freelist, a read/write will grab it
> and the queue will be unplugged again. This effectively disables
> elevator merging efforts completely. Note -- even though wakeups
> of wait_for_request are now not a 1-1 mapping, wake-one semantics
> are maintained.
>

Exclusive waitqueues do not give wake-one semantics. They give
us wake-0.999999999 semantics :)

1: static struct request *__get_request_wait(request_queue_t *q, int rw)
2: {
3: register struct request *rq;
4: DECLARE_WAITQUEUE(wait, current);
5:
6: add_wait_queue_exclusive(&q->wait_for_request, &wait);
7: for (;;) {
8: __set_current_state(TASK_UNINTERRUPTIBLE);
9: spin_lock_irq(&io_request_lock);
10: rq = get_request(q, rw);
11: spin_unlock_irq(&io_request_lock);
12: if (rq)
13: break;
14: generic_unplug_device(q);
15: schedule();
16: }
17: remove_wait_queue(&q->wait_for_request, &wait);
18: current->state = TASK_RUNNING;
19: return rq;
20: }

Suppose there are two tasks on the waitqueue. Someone does a wakeup.
One task wakes, loops around to line 8. It sets TASK_UNINTERRUPTIBLE
and is now eligible for another wakeup. So if someone does a wakeup
on the waitqueue while this task is running statements 9, 10, 11, 12,
13 and 17, that wakeup will be sent to the newly woken task.

The net effect: two tasks are sleeping, there were two wakeups but
only one task woke up.

Is this a problem in practice? Presumably not. This is probably
because the waitqueue is getting wakeups sent to it all the time,
and we muddle through.

One place where this race _would_ bite is in semaphores. __down()
has the same race, but it does an extra wake_up() on the queue after it
has taken itself off, which fixes things. So `up()' can in fact be
wake-two. err, make that wake-1.999999999.

The moral is: be careful with exclusive waitqueues. They're
a minefield.

There's another interesting race with exclusive waitqueues. In
a few places in the kernel a task can be on two exclusive waitqueues
at the same time. If two CPUs simutaneously direct a wakeup to
the two waitqueues they can race and the net effect is that
they both wake the same task. Depending on the behaviour
of the woken task when it returns to the outer waitloop that
can be a lost wakeup.

My proposal to fix this with a global spinlock in __wake_up_common
got nixed at Penguin HQ. Instead I have a patch lying around here
somewhere which makes wake_up_process() return a success value. If
__wake_up_common tries to wake an exclusive process but fails, it
keeps looking for someone to wake up.

-