2005-03-22 21:51:05

by Mark Seger

[permalink] [raw]
Subject: Patch for inconsistent recording of block device statistics

The read/write statistics for both sectors and merges are calculated at
the time requests first enter the request queue but the remainder of the
statistics, such as the number of read/writes are calculated at the time
the I/O completes. As a result, one cannot accurately determine the
data rates read or written at the actual time the I/O is performed.
This behavior is masked with smaller queue sizes but is very real and
was very noticeable with earlier 2.6 kenels using the cfq scheduler
which had a default queue size of 8192 where the time difference between
these sets of counters could exceed 10 seconds for large file writes and
small monitoring intervals such as 1 second. In that environment, one
would see extremely high bursts of I/O, sometimes exceeding 500 or even
1000 MB/sec for the first second or two and then drop to 0 for a long
time while the 'number of operations' counters accurately reflect what
is really happening.

The attached patch fixes this problem by simply accumulating the
read/write sector/merge data in temporary variables stored in the
request queue entry, and when the I/O completes copies those values to
the disk statistics block.

-mark

diff -uprN -X dontdiff ../linux-2.6.11.4/drivers/block/ll_rw_blk.c
../linux-2.6.11.4-mjs/drivers/block/ll_rw_blk.c
--- ../linux-2.6.11.4/drivers/block/ll_rw_blk.c 2005-03-15
19:09:00.000000000 -0500
+++ ../linux-2.6.11.4-mjs/drivers/block/ll_rw_blk.c 2005-03-22
15:43:07.000000000 -0500
@@ -2107,13 +2107,13 @@ void drive_stat_acct(struct request *rq,
return;

if (rw == READ) {
- __disk_stat_add(rq->rq_disk, read_sectors, nr_sectors);
+ rq->read_sectors_accum += nr_sectors;
if (!new_io)
- __disk_stat_inc(rq->rq_disk, read_merges);
+ rq->read_merges_accum += 1;
} else if (rw == WRITE) {
- __disk_stat_add(rq->rq_disk, write_sectors, nr_sectors);
+ rq->write_sectors_accum += nr_sectors;
if (!new_io)
- __disk_stat_inc(rq->rq_disk, write_merges);
+ rq->write_merges_accum += 1;
}
if (new_io) {
disk_round_stats(rq->rq_disk);
@@ -2487,6 +2487,11 @@ get_rq:
req->rq_disk = bio->bi_bdev->bd_disk;
req->start_time = jiffies;

+ req->write_sectors_accum=0;
+ req->write_merges_accum=0;
+ req->read_sectors_accum=0;
+ req->read_merges_accum=0;
+
add_request(q, req);
out:
if (freereq)
@@ -2989,10 +2994,14 @@ void end_that_request_last(struct reques
case WRITE:
__disk_stat_inc(disk, writes);
__disk_stat_add(disk, write_ticks, duration);
+ __disk_stat_add(disk, write_sectors,
req->write_sectors_accum);
+ __disk_stat_add(disk, write_merges,
req->write_merges_accum);
break;
case READ:
__disk_stat_inc(disk, reads);
__disk_stat_add(disk, read_ticks, duration);
+ __disk_stat_add(disk, read_sectors,
req->read_sectors_accum);
+ __disk_stat_add(disk, read_merges,
req->read_merges_accum);
break;
}
disk_round_stats(disk);
diff -uprN -X dontdiff ../linux-2.6.11.4/include/linux/blkdev.h
../linux-2.6.11.4-mjs/include/linux/blkdev.h
--- ../linux-2.6.11.4/include/linux/blkdev.h 2005-03-15
19:09:02.000000000 -0500
+++ ../linux-2.6.11.4-mjs/include/linux/blkdev.h 2005-03-22
15:42:47.000000000 -0500
@@ -176,6 +176,12 @@ struct request {
* For Power Management requests
*/
struct request_pm_state *pm;
+
+ /*
+ * accumulate intermediate stats
+ */
+ unsigned long read_sectors_accum, write_sectors_accum;
+ unsigned long read_merges_accum, write_merges_accum;
};

/*


2005-03-23 09:20:10

by Jens Axboe

[permalink] [raw]
Subject: Re: Patch for inconsistent recording of block device statistics

On Tue, Mar 22 2005, Mark Seger wrote:
> The read/write statistics for both sectors and merges are calculated at
> the time requests first enter the request queue but the remainder of the
> statistics, such as the number of read/writes are calculated at the time
> the I/O completes. As a result, one cannot accurately determine the
> data rates read or written at the actual time the I/O is performed.
> This behavior is masked with smaller queue sizes but is very real and
> was very noticeable with earlier 2.6 kenels using the cfq scheduler
> which had a default queue size of 8192 where the time difference between
> these sets of counters could exceed 10 seconds for large file writes and
> small monitoring intervals such as 1 second. In that environment, one
> would see extremely high bursts of I/O, sometimes exceeding 500 or even
> 1000 MB/sec for the first second or two and then drop to 0 for a long
> time while the 'number of operations' counters accurately reflect what
> is really happening.
>
> The attached patch fixes this problem by simply accumulating the
> read/write sector/merge data in temporary variables stored in the
> request queue entry, and when the I/O completes copies those values to
> the disk statistics block.

I don't like this patch, it adds 4 * sizeof(unsigned long) to struct
request when it can be solved without adding anything. The idea is
sound, though, the current way the stats are done isn't very
interesting.

How about accounting merges the way we currently do it, since that piece
of the stats _is_ interesting at queueing time. And then account
completion in __end_that_request_first(). Untested patch attached.

===== drivers/block/ll_rw_blk.c 1.287 vs edited =====
--- 1.287/drivers/block/ll_rw_blk.c 2005-03-11 21:32:27 +01:00
+++ edited/drivers/block/ll_rw_blk.c 2005-03-23 10:10:39 +01:00
@@ -2294,16 +2293,12 @@
if (!blk_fs_request(rq) || !rq->rq_disk)
return;

- if (rw == READ) {
- __disk_stat_add(rq->rq_disk, read_sectors, nr_sectors);
- if (!new_io)
+ if (!new_io) {
+ if (rw == READ)
__disk_stat_inc(rq->rq_disk, read_merges);
- } else if (rw == WRITE) {
- __disk_stat_add(rq->rq_disk, write_sectors, nr_sectors);
- if (!new_io)
+ else
__disk_stat_inc(rq->rq_disk, write_merges);
- }
- if (new_io) {
+ } else {
disk_round_stats(rq->rq_disk);
rq->rq_disk->in_flight++;
}
@@ -3063,6 +3069,13 @@
(unsigned long long)req->sector);
}

+ if (blk_fs_request(req)) {
+ if (rq_data_dir(req) == READ)
+ __disk_stat_add(req->rq_disk, read_sectors, nr_bytes >> 9);
+ else
+ __disk_stat_add(req->rq_disk, write_sectors, nr_bytes >> 9);
+ }
+
total_bytes = bio_nbytes = 0;
while ((bio = req->bio) != NULL) {
int nbytes;

--
Jens Axboe

2005-03-23 14:40:54

by Mark Seger

[permalink] [raw]
Subject: Re: Patch for inconsistent recording of block device statistics


>I don't like this patch, it adds 4 * sizeof(unsigned long) to struct
>request when it can be solved without adding anything. The idea is
>sound, though, the current way the stats are done isn't very
>interesting.
>
>
Actually I wasn't all that excited about using the extra variable
myself. However, I wasn't entirely sure what was going on and this at
least allowed me to test the concept without doing anything harmful.

>How about accounting merges the way we currently do it, since that piece
>of the stats _is_ interesting at queueing time. And then account
>completion in __end_that_request_first(). Untested patch attached.
>
>
I also agree with your suggestion about keeping the merged counts where
they are and am copying the author of iostat to suggest the man page be
updated to reflect the fact that merges are counts for requests queued
rather than 'issued to the device' as it currently states.

re: your patch - I did try it on both an Operton and Xeon box. It
worked find on the Opeteron and reported 0 for all the sectors on the
Xeon. If nothing immediately jumps to your mind could it have been
something I did wrong? I'll try another build after I send this along,
but I don't see how that will help as I did the first one from a brand
new source kit.

The one thing that still jumps out at me about this patch is that the
sectors are being counted in one routine and the number of I/Os in
another. If the best place to update the sector counts is indeed where
you suggest doing it, is there any reason not to move the update code
for all the disk stats from end_that_request_last() to that same place
as well for consistency and for better assurances that they are updated
as close to the same point in time as possible?

-mark

2005-03-23 15:49:34

by Mark Seger

[permalink] [raw]
Subject: Process level I/O stats?

Apologies if this has been discussed recently but I couldn't find
anything. As I've seen others ask over the years, have there been any
newer thoughts on when and how this capability might be added?
-mark

2005-03-23 15:52:22

by Jens Axboe

[permalink] [raw]
Subject: Re: Patch for inconsistent recording of block device statistics

On Wed, Mar 23 2005, Mark Seger wrote:
>
> >I don't like this patch, it adds 4 * sizeof(unsigned long) to struct
> >request when it can be solved without adding anything. The idea is
> >sound, though, the current way the stats are done isn't very
> >interesting.
> >
> >
> Actually I wasn't all that excited about using the extra variable
> myself. However, I wasn't entirely sure what was going on and this at
> least allowed me to test the concept without doing anything harmful.
>
> >How about accounting merges the way we currently do it, since that piece
> >of the stats _is_ interesting at queueing time. And then account
> >completion in __end_that_request_first(). Untested patch attached.
> >
> >
> I also agree with your suggestion about keeping the merged counts where
> they are and am copying the author of iostat to suggest the man page be
> updated to reflect the fact that merges are counts for requests queued
> rather than 'issued to the device' as it currently states.
>
> re: your patch - I did try it on both an Operton and Xeon box. It
> worked find on the Opeteron and reported 0 for all the sectors on the
> Xeon. If nothing immediately jumps to your mind could it have been
> something I did wrong? I'll try another build after I send this along,
> but I don't see how that will help as I did the first one from a brand
> new source kit.

Sounds very strange, it is generic code so should work for all.
Different storage?

> The one thing that still jumps out at me about this patch is that the
> sectors are being counted in one routine and the number of I/Os in
> another. If the best place to update the sector counts is indeed where
> you suggest doing it, is there any reason not to move the update code
> for all the disk stats from end_that_request_last() to that same place
> as well for consistency and for better assurances that they are updated
> as close to the same point in time as possible?

The reason that the sector counting is done in end_that_request_first()
is that it may not be valid in end_that_request_last().
end_that_request_first() may be invoked several times for a single
request, so I did not move the 'number of io count' there as well as
that would require additional tracking in the request. But
end_that_request_last() will in 99.9% of the cases be called _right_
after end_that_request_first(), so I think it should work fine. The
cases where that doesn't happen is for partial io completions.

--
Jens Axboe

2005-03-23 16:02:36

by Jens Axboe

[permalink] [raw]
Subject: Re: Process level I/O stats?

On Wed, Mar 23 2005, Mark Seger wrote:
> Apologies if this has been discussed recently but I couldn't find
> anything. As I've seen others ask over the years, have there been any
> newer thoughts on when and how this capability might be added?

Several patches have been posted for that, try and search the
linux-kernel archive for per-process io stats.

--
Jens Axboe

2005-03-23 18:23:12

by Mark Seger

[permalink] [raw]
Subject: Re: Patch for inconsistent recording of block device statistics


>>re: your patch - I did try it on both an Operton and Xeon box. It
>>worked find on the Opeteron and reported 0 for all the sectors on the
>>Xeon. If nothing immediately jumps to your mind could it have been
>>something I did wrong? I'll try another build after I send this along,
>>but I don't see how that will help as I did the first one from a brand
>>new source kit.
>>
>>
>
>Sounds very strange, it is generic code so should work for all.
>Different storage?
>
>
Works fine now. Obviously I screwed up something and just wanted to let
you know it was cockpit error on my end.
Is your plan to move this into some future kernel? Do you need anything
more from me at this point?

-mark

2005-03-23 18:35:58

by Jens Axboe

[permalink] [raw]
Subject: Re: Patch for inconsistent recording of block device statistics

On Wed, Mar 23 2005, Mark Seger wrote:
>
> >>re: your patch - I did try it on both an Operton and Xeon box. It
> >>worked find on the Opeteron and reported 0 for all the sectors on the
> >>Xeon. If nothing immediately jumps to your mind could it have been
> >>something I did wrong? I'll try another build after I send this along,
> >>but I don't see how that will help as I did the first one from a brand
> >>new source kit.
> >>
> >>
> >
> >Sounds very strange, it is generic code so should work for all.
> >Different storage?
> >
> >
> Works fine now. Obviously I screwed up something and just wanted to let
> you know it was cockpit error on my end.
> Is your plan to move this into some future kernel? Do you need anything
> more from me at this point?

Yes, I will make sure it gets committed. Thanks for your help so far.

--
Jens Axboe

2005-03-24 02:31:38

by Mark Goodwin

[permalink] [raw]
Subject: Re: Patch for inconsistent recording of block device statistics


On Wed, 23 Mar 2005, Jens Axboe wrote:
> Yes, I will make sure it gets committed. Thanks for your help so far.
>
> --
> Jens Axboe

Jens,

SGI needs this fix. Will it find it's way into SLES9/SP2?
Or should we open a SuSE bug?

Thanks
-- Mark Goodwin

2005-03-24 06:51:35

by Jens Axboe

[permalink] [raw]
Subject: Re: Patch for inconsistent recording of block device statistics

On Thu, Mar 24 2005, Mark Goodwin wrote:
>
> On Wed, 23 Mar 2005, Jens Axboe wrote:
> > Yes, I will make sure it gets committed. Thanks for your help so far.
> >
> > --
> > Jens Axboe
>
> Jens,
>
> SGI needs this fix. Will it find it's way into SLES9/SP2?
> Or should we open a SuSE bug?

A little OT here, but just open a suse bug.

--
Jens Axboe