[Sorry again for the resend, the previous mail got bounced off lkml]
Bug description:
AS scheduler alternates between issuing read and write batches. It does
the batch switch only after all requests from the previous batch are
completed.
When switching to a write batch, if there is an on-going read request,
it waits for its completion and indicates its intention of switching by
setting ad->changed_batch and the new direction but does not update the
batch_expire_time for the new write batch which it does in the case of
no previous pending requests.
On completion of the read request, it sees that we were waiting for the
switch and schedules work for kblockd right away and resets the
ad->changed_data flag.
Now when kblockd enters dispatch_request where it is expected to pick
up a write request, it in turn ends the write batch because the
batch_expire_timer was not updated and shows the expire timestamp for
the previous batch.
This results in the write starvation for all the cases where there is
the intention for switching to a write batch, but there is a previous
in-flight read request and the batch gets reverted to a read_batch
right away.
This also holds true in the reverse case (switching from a write batch
to a read batch with an in-flight write request).
I've checked that this bug exists on 2.6.11, 2.6.18, 2.6.24 and
linux-2.6-block git HEAD. I've tested the fix on x86 platforms with
SCSI drives where the driver asks for the next request while a current
request is in-flight.
This patch is based off linux-2.6-block git HEAD.
Bug reproduction:
A simple scenario which reproduces this bug is:
- dd if=/dev/hda3 of=/dev/null &
- lilo
The lilo takes forever to complete.
This can also be reproduced fairly easily with the earlier dd and
another test
program doing msync().
The example test program below should print out a message after every
iteration
but it simply hangs forever. With this bugfix it makes forward progress.
====
Example test program using msync() (thanks to suleiman AT google DOT
com)
inline uint64_t
rdtsc(void)
{
int64_t tsc;
__asm __volatile("rdtsc" : "=A" (tsc));
return (tsc);
}
int
main(int argc, char **argv)
{
struct stat st;
uint64_t e, s, t;
char *p, q;
long i;
int fd;
if (argc < 2) {
printf("Usage: %s <file>\n", argv[0]);
return (1);
}
if ((fd = open(argv[1], O_RDWR | O_NOATIME)) < 0)
err(1, "open");
if (fstat(fd, &st) < 0)
err(1, "fstat");
p = mmap(NULL, st.st_size, PROT_READ | PROT_WRITE,
MAP_SHARED, fd, 0);
t = 0;
for (i = 0; i < 1000; i++) {
*p = 0;
msync(p, 4096, MS_SYNC);
s = rdtsc();
*p = 0;
__asm __volatile(""::: "memory");
e = rdtsc();
if (argc > 2)
printf("%d: %lld cycles %jd %jd\n",
i, e - s, (intmax_t)s, (intmax_t)e);
t += e - s;
}
printf("average time: %lld cycles\n", t / 1000);
return (0);
}
====
block/as-iosched.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)
diff --git a/block/as-iosched.c b/block/as-iosched.c
index 8c39467..d4c9f9c 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -831,6 +831,8 @@ static void as_completed_request(struct
request_queue *q, struct request *rq)
}
if (ad->changed_batch && ad->nr_dispatched == 1) {
+ ad->current_batch_expires = jiffies +
+ ad->batch_expire[ad->batch_data_dir];
kblockd_schedule_work(&ad->antic_work);
ad->changed_batch = 0;
--
-Divyesh Shah
Well, thanks for the report and test case. I'm a bit out of the
loop when it comes to IO scheduling these days, however the fix
seems good to me.
Does google still use AS scheduling? This doesn't introduce any
performance regressions that you can tell?
Andrew might be away this week I think. Jens, any chance you can
pick this up and get it merged when you feel it is ready?
Acked-by: Nick Piggin <[email protected]>
Thanks,
Nick
On Saturday 14 June 2008 03:03, Divyesh Shah wrote:
> [Sorry for the resend, the previous mail got bounced off lkml]
>
> Bug description:
> AS scheduler alternates between issuing read and write batches. It does
> the batch switch only after all requests from the previous batch are
> completed.
>
> When switching to a write batch, if there is an on-going read request,
> it waits for its completion and indicates its intention of switching by
> setting ad->changed_batch and the new direction but does not update the
> batch_expire_time for the new write batch which it does in the case of
> no previous pending requests.
> On completion of the read request, it sees that we were waiting for the
> switch and schedules work for kblockd right away and resets the
> ad->changed_data flag.
> Now when kblockd enters dispatch_request where it is expected to pick
> up a write request, it in turn ends the write batch because the
> batch_expire_timer was not updated and shows the expire timestamp for
> the previous batch.
>
> This results in the write starvation for all the cases where there is
> the intention for switching to a write batch, but there is a previous
> in-flight read request and the batch gets reverted to a read_batch
> right away.
>
> This also holds true in the reverse case (switching from a write batch
> to a read batch with an in-flight write request).
>
> I've checked that this bug exists on 2.6.11, 2.6.18, 2.6.24 and
> linux-2.6-block git HEAD. I've tested the fix on x86 platforms with
> SCSI drives where the driver asks for the next request while a current
> request is in-flight.
>
> This patch is based off linux-2.6-block git HEAD.
>
> Bug reproduction:
> A simple scenario which reproduces this bug is:
> - dd if=/dev/hda3 of=/dev/null &
> - lilo
> The lilo takes forever to complete.
>
> This can also be reproduced fairly easily with the earlier dd and
> another test
> program doing msync().
>
> The example test program below should print out a message after every
> iteration
> but it simply hangs forever. With this bugfix it makes forward progress.
>
> ====
> Example test program using msync() (thanks to suleiman AT google DOT
> com)
>
> inline uint64_t
> rdtsc(void)
> {
> int64_t tsc;
>
> __asm __volatile("rdtsc" : "=A" (tsc));
> return (tsc);
> }
>
> int
> main(int argc, char **argv)
> {
> struct stat st;
> uint64_t e, s, t;
> char *p, q;
> long i;
> int fd;
>
> if (argc < 2) {
> printf("Usage: %s <file>\n", argv[0]);
> return (1);
> }
>
> if ((fd = open(argv[1], O_RDWR | O_NOATIME)) < 0)
> err(1, "open");
>
> if (fstat(fd, &st) < 0)
> err(1, "fstat");
>
> p = mmap(NULL, st.st_size, PROT_READ | PROT_WRITE,
> MAP_SHARED, fd, 0);
>
> t = 0;
> for (i = 0; i < 1000; i++) {
> *p = 0;
> msync(p, 4096, MS_SYNC);
> s = rdtsc();
> *p = 0;
> __asm __volatile(""::: "memory");
> e = rdtsc();
> if (argc > 2)
> printf("%d: %lld cycles %jd %jd\n",
> i, e - s, (intmax_t)s, (intmax_t)e);
> t += e - s;
> }
> printf("average time: %lld cycles\n", t / 1000);
> return (0);
> }
> ====
>
> block/as-iosched.c | 2 ++
> 1 files changed, 2 insertions(+), 0 deletions(-)
>
> diff --git a/block/as-iosched.c b/block/as-iosched.c
> index 8c39467..d4c9f9c 100644
> --- a/block/as-iosched.c
> +++ b/block/as-iosched.c
> @@ -831,6 +831,8 @@ static void as_completed_request(struct
> request_queue *q, struct request *rq)
> }
>
> if (ad->changed_batch && ad->nr_dispatched == 1) {
> + ad->current_batch_expires = jiffies +
> + ad->batch_expire[ad->batch_data_dir];
> kblockd_schedule_work(&ad->antic_work);
> ad->changed_batch = 0;
>
> --
>
> -Divyesh Shah
On Mon, Jun 16 2008, Nick Piggin wrote:
> Well, thanks for the report and test case. I'm a bit out of the
> loop when it comes to IO scheduling these days, however the fix
> seems good to me.
>
> Does google still use AS scheduling? This doesn't introduce any
> performance regressions that you can tell?
>
> Andrew might be away this week I think. Jens, any chance you can
> pick this up and get it merged when you feel it is ready?
It looks good to me, I'll queue it up.
--
Jens Axboe
On Mon, 16 Jun 2008, Nick Piggin wrote:
> Well, thanks for the report and test case. I'm a bit out of the
> loop when it comes to IO scheduling these days, however the fix
> seems good to me.
>
> Does google still use AS scheduling? This doesn't introduce any
> performance regressions that you can tell?
Yes we use AS scheduler. Benchmarking results on workloads have shown that
AS does as good a job (and better in some cases) as CFQ. When there is
a single thread reading/writing to disk, AS performs as well as CFQ. With
multiple threads doing IO such that there are a lot of outstanding
requests, CFQ performs worse than AS in terms of throughput.
-Divyesh.