2002-04-11 11:07:13

by Zlatko Calusic

[permalink] [raw]
Subject: sard/iostat disk I/O statistics/accounting for 2.5.8-pre3

Hi!

I have ported Stephen's excellent sard I/O accounting patch to
2.5.8-pre3. Additionaly, I have adapted/cleaned up iostat utility
(originally written by Greg Franks) and created a man page for it.

All this gives Linux a much needed functionality - extensive
measurements and reporting of disk device performance/bottlenecks.
No server should be without it. See below for a real life story.

The patch, iostat utility and man page can be found at

<URL:http://linux.inet.hr/>

Also, find the patch appended to this mail, for quicker browsing.


Some observations:
-----------------

The patch is quite well tested and stable (that is, it shouldn't eat
your machine :)). Both on UP & SMP configurations, but I have tested
it only on IDE hardware. Please test it on your configuration
(especially SCSI) and report results.

I had to make some changes, as kernel internals have changed since the
time patch was originally written. Alan has also included this patch
in his 2.4.x-ac kernel series, but it is not working well.

First problem is that somehow, misteriously, ios_in_flight variable
drops to value of -1 when disks are idle. Of course, this skews lots
of other numbers and iostat reports garbage. I tried to find the cause
of this behaviour, but failed (looks like we have a request fired on
each disk, whose start is never accounted but completion is?!). So I
resolved it this way

if (hd->ios_in_flight)
--hd->ios_in_flight;

which works well, but I would still love to know how number of I/Os
can drop below zero. :)

Second problem/nuisance is that blk_partition_remap() destroys
partition information from the bio->bi_dev before the request is
queued. That's why -ac kernel doesn't report per-partition
information. To resolve that, I had to put additional code in
locate_hd_struct() to extract partition information back again which
is kind of ugly (first we destroy it, then we recreate it again). The
other way we could solve this is to put something like bio->bi_acctdev
in bio structure and keep original device there, but that is not only
a duplication of info (which is also ugly), but would also waste
precious memory. I don't know if block re-mapping which is happening
in the abovementioned function is strictly necessary. Jens, could you
help with that one and/or give a suggestion which way to go with that
one?


TODO:
----

Please check the patch on various hardware you're equipped with and
report success/failure. Somebody should really go through the man page
and correct grammar mistakes :), maybe make it more readable in
places. By the time this patch is ready for inclusion in the Linus
kernel, old disk statistics counters should be removed (as they are
buggy, incomplete and obsolete).

This patch w/ iostat utility is probably a necessary ingredient to
make quality measurement of the new IDE-TCQ code. Too bad I have a
Maxtor drive on a Promise controller here (any hope of me running TCQ
on that combination in the future?).


Real life story:
---------------

So, I had this applications heavily inserting data in the Oracle
database. And its performance mattered to me. As the disk rattled all
the time inserts were going on, I decided I needed to spend some time
reordering datafiles to achieve the peak performance (disk I/O was a
performance bottleneck, obviously). But after an hour or two I got
only few percents improvement. How could that happen?!

Well, it happened that the database and application were already tuned
quite well, and the load on database was (correctly) CPU based. As I
have an SMP machine, I forgot that 40% free CPU could be on that
second processor, obviously unusable in the app in question. And disk
rattling was normal, lots of fsyncs, some seeks...

So, when I finally managed to port this patch to 2.5.x (we are
speaking of a home development database, 2.5.x kernels are OK for
those, right?! :)) iostat showed only ~20% disk utilization! That is,
there was _NO_ heavy I/O contention and disks were _NOT_ a culprit.
Live and learn.

If only I had known that before I wasted my time needlessly moving big
datafiles around and fixing configuration... :(


Regards,

Zlatko


Index: 7.12/include/linux/blkdev.h
--- 7.12/include/linux/blkdev.h Wed, 10 Apr 2002 09:50:17 +0200 zcalusic (linux25/n/b/1_blkdev.h 1.1.1.1.1.1.1.2.2.1 644)
+++ 7.14/include/linux/blkdev.h Wed, 10 Apr 2002 10:02:44 +0200 zcalusic (linux25/n/b/1_blkdev.h 1.1.1.1.1.1.1.2.2.2 644)
@@ -34,6 +34,7 @@
int rq_status; /* should split this into a few status bits */
kdev_t rq_dev;
int errors;
+ unsigned long start_time;
sector_t sector;
unsigned long nr_sectors;
unsigned long hard_sector; /* the hard_* are block layer
Index: 7.12/include/linux/genhd.h
--- 7.12/include/linux/genhd.h Tue, 15 Jan 2002 22:59:02 +0100 zcalusic (linux25/o/b/18_genhd.h 1.1.1.1.1.1 644)
+++ 7.14/include/linux/genhd.h Wed, 10 Apr 2002 10:02:44 +0200 zcalusic (linux25/o/b/18_genhd.h 1.1.1.1.1.1.2.1 644)
@@ -61,6 +61,22 @@
unsigned long start_sect;
unsigned long nr_sects;
devfs_handle_t de; /* primary (master) devfs entry */
+
+ /* Performance stats: */
+ unsigned int ios_in_flight;
+ unsigned int io_ticks;
+ unsigned int last_idle_time;
+ unsigned int last_queue_change;
+ unsigned int aveq;
+
+ unsigned int rd_ios;
+ unsigned int rd_merges;
+ unsigned int rd_ticks;
+ unsigned int rd_sectors;
+ unsigned int wr_ios;
+ unsigned int wr_merges;
+ unsigned int wr_ticks;
+ unsigned int wr_sectors;
int number; /* stupid old code wastes space */
};

@@ -238,6 +254,19 @@
#ifdef __KERNEL__

char *disk_name (struct gendisk *hd, int minor, char *buf);
+
+/*
+ * disk_round_stats is used to round off the IO statistics for a disk
+ * for a complete clock tick.
+ */
+void disk_round_stats(struct hd_struct *hd);
+
+/*
+ * Account for the completion of an IO request (used by drivers which
+ * bypass the normal end_request processing)
+ */
+struct request;
+void req_finished_io(struct request *);

extern void devfs_register_partitions (struct gendisk *dev, int minor,
int unregister);
Index: 7.12/drivers/scsi/scsi_lib.c
--- 7.12/drivers/scsi/scsi_lib.c Tue, 12 Feb 2002 09:26:34 +0100 zcalusic (linux25/j/c/26_scsi_lib.c 1.1.1.1.1.2 644)
+++ 7.14/drivers/scsi/scsi_lib.c Wed, 10 Apr 2002 10:02:44 +0200 zcalusic (linux25/j/c/26_scsi_lib.c 1.1.1.1.1.2.1.1 644)
@@ -382,6 +382,7 @@
*/
if (req->waiting)
complete(req->waiting);
+ req_finished_io(req);

add_blkdev_randomness(major(req->rq_dev));

Index: 7.12/drivers/block/genhd.c
--- 7.12/drivers/block/genhd.c Wed, 10 Apr 2002 09:50:17 +0200 zcalusic (linux25/x/c/3_genhd.c 1.1.1.1.1.1.3.1 644)
+++ 7.14/drivers/block/genhd.c Wed, 10 Apr 2002 10:02:44 +0200 zcalusic (linux25/x/c/3_genhd.c 1.1.1.1.1.1.3.2 644)
@@ -168,6 +168,10 @@
read_unlock(&gendisk_lock);
}

+
+/* Normalise the disk performance stats to a notional timer tick of 1ms. */
+#define MSEC(x) ((x) * 1000 / HZ)
+
static int show_partition(struct seq_file *part, void *v)
{
struct gendisk *sgp = v;
@@ -175,19 +179,34 @@
char buf[64];

if (sgp == gendisk_head)
- seq_puts(part, "major minor #blocks name\n\n");
+ seq_puts(part, "major minor #blocks name "
+ "rio rmerge rsect ruse "
+ "wio wmerge wsect wuse "
+ "running use aveq\n\n");

/* show all non-0 size partitions of this disk */
for (n = 0; n < (sgp->nr_real << sgp->minor_shift); n++) {
- if (sgp->part[n].nr_sects == 0)
+ struct hd_struct *hd = &sgp->part[n];
+
+ if (!hd->nr_sects)
continue;
- seq_printf(part, "%4d %4d %10d %s\n",
- sgp->major, n, sgp->sizes[n],
- disk_name(sgp, n, buf));
+ disk_round_stats(hd);
+ seq_printf(part, "%4d %4d %10d %s "
+ "%d %d %d %d %d %d %d %d %d %d %d\n",
+ sgp->major, n, sgp->sizes[n],
+ disk_name(sgp, n, buf),
+ hd->rd_ios, hd->rd_merges,
+ hd->rd_sectors, MSEC(hd->rd_ticks),
+ hd->wr_ios, hd->wr_merges,
+ hd->wr_sectors, MSEC(hd->wr_ticks),
+ hd->ios_in_flight, MSEC(hd->io_ticks),
+ MSEC(hd->aveq));
}

return 0;
}
+
+#undef MSEC

struct seq_operations partitions_op = {
start: part_start,
Index: 7.12/drivers/block/ll_rw_blk.c
--- 7.12/drivers/block/ll_rw_blk.c Wed, 10 Apr 2002 09:50:17 +0200 zcalusic (linux25/x/c/9_ll_rw_blk. 1.1.1.1.1.1.1.1.1.1.3.1.1.1 644)
+++ 7.14/drivers/block/ll_rw_blk.c Wed, 10 Apr 2002 18:37:24 +0200 zcalusic (linux25/x/c/9_ll_rw_blk. 1.1.1.1.1.1.1.1.1.1.3.1.1.3 644)
@@ -944,6 +944,136 @@
printk(KERN_ERR "drive_stat_acct: cmd not R/W?\n");
}

+/* Return up to two hd_structs on which to do IO accounting for a given
+ * request. On a partitioned device, we want to account both against
+ * the partition and against the whole disk.
+ */
+static void locate_hd_struct(struct request *req,
+ struct hd_struct **hd1,
+ struct hd_struct **hd2)
+{
+ int p, n_part;
+ struct gendisk *gd;
+ struct hd_struct *hd;
+
+ *hd1 = NULL;
+ *hd2 = NULL;
+
+ gd = get_gendisk(req->rq_dev);
+ if (gd && gd->part) {
+ *hd1 = &gd->part[0];
+
+ /* Try to reconstruct the partition */
+ n_part = 1 << gd->minor_shift;
+ for (p = 1; p < n_part; p++) {
+ hd = &gd->part[p];
+ /* skip empty and extended partitions */
+ if (hd->nr_sects < 3)
+ continue;
+ if (req->sector >= hd->start_sect
+ && req->sector < hd->start_sect + hd->nr_sects)
+ break;
+ }
+ if (p < n_part)
+ *hd2 = hd;
+ }
+}
+
+/* Round off the performance stats on an hd_struct. The average IO
+ * queue length and utilisation statistics are maintained by observing
+ * the current state of the queue length and the amount of time it has
+ * been in this state for. Normally, that accounting is done on IO
+ * completion, but that can result in more than a second's worth of IO
+ * being accounted for within any one second, leading to >100%
+ * utilisation. To deal with that, we do a round-off before returning
+ * the results when reading /proc/partitions, accounting immediately for
+ * all queue usage up to the current jiffies and restarting the counters
+ * again.
+ */
+void disk_round_stats(struct hd_struct *hd)
+{
+ unsigned long now = jiffies;
+
+ hd->aveq += hd->ios_in_flight * (now - hd->last_queue_change);
+ hd->last_queue_change = now;
+
+ if (hd->ios_in_flight)
+ hd->io_ticks += (now - hd->last_idle_time);
+ hd->last_idle_time = now;
+}
+
+static inline void down_ios(struct hd_struct *hd)
+{
+ disk_round_stats(hd);
+ if (hd->ios_in_flight)
+ --hd->ios_in_flight;
+}
+
+static inline void up_ios(struct hd_struct *hd)
+{
+ disk_round_stats(hd);
+ ++hd->ios_in_flight;
+}
+
+static void account_io_start(struct hd_struct *hd, struct request *req,
+ int merge, int sectors)
+{
+ switch (rq_data_dir(req)) {
+ case READ:
+ if (merge)
+ hd->rd_merges++;
+ hd->rd_sectors += sectors;
+ break;
+ case WRITE:
+ if (merge)
+ hd->wr_merges++;
+ hd->wr_sectors += sectors;
+ break;
+ default:
+ }
+ if (!merge)
+ up_ios(hd);
+}
+
+static void account_io_end(struct hd_struct *hd, struct request *req)
+{
+ unsigned long duration = jiffies - req->start_time;
+ switch (rq_data_dir(req)) {
+ case READ:
+ hd->rd_ticks += duration;
+ hd->rd_ios++;
+ break;
+ case WRITE:
+ hd->wr_ticks += duration;
+ hd->wr_ios++;
+ break;
+ default:
+ }
+ down_ios(hd);
+}
+
+void req_new_io(struct request *req, int merge, int sectors)
+{
+ struct hd_struct *hd1, *hd2;
+
+ locate_hd_struct(req, &hd1, &hd2);
+ if (hd1)
+ account_io_start(hd1, req, merge, sectors);
+ if (hd2)
+ account_io_start(hd2, req, merge, sectors);
+}
+
+void req_finished_io(struct request *req)
+{
+ struct hd_struct *hd1, *hd2;
+
+ locate_hd_struct(req, &hd1, &hd2);
+ if (hd1)
+ account_io_end(hd1, req);
+ if (hd2)
+ account_io_end(hd2, req);
+}
+
/*
* add-request adds a request to the linked list.
* queue lock is held and interrupts disabled, as we muck with the
@@ -1018,6 +1148,8 @@
* counts here.
*/
if (q->merge_requests_fn(q, req, next)) {
+ struct hd_struct *hd1, *hd2;
+
elv_merge_requests(q, req, next);

blkdev_dequeue_request(next);
@@ -1027,6 +1159,15 @@

req->nr_sectors = req->hard_nr_sectors += next->hard_nr_sectors;

+ /* One last thing: we have removed a request, so we now have
+ * one less expected IO to complete for accounting purposes.
+ */
+ locate_hd_struct(req, &hd1, &hd2);
+ if (hd1)
+ down_ios(hd1);
+ if (hd2)
+ down_ios(hd2);
+
blkdev_release_request(next);
}
}
@@ -1117,6 +1258,7 @@
req->biotail = bio;
req->nr_sectors = req->hard_nr_sectors += nr_sectors;
drive_stat_acct(req, nr_sectors, 0);
+ req_new_io(req, 1, nr_sectors);
attempt_back_merge(q, req);
goto out;

@@ -1142,6 +1284,7 @@
req->sector = req->hard_sector = sector;
req->nr_sectors = req->hard_nr_sectors += nr_sectors;
drive_stat_acct(req, nr_sectors, 0);
+ req_new_io(req, 1, nr_sectors);
attempt_front_merge(q, req);
goto out;

@@ -1210,6 +1353,8 @@
req->waiting = NULL;
req->bio = req->biotail = bio;
req->rq_dev = bio->bi_dev;
+ req->start_time = jiffies;
+ req_new_io(req, 0, nr_sectors);
add_request(q, req, insert_here);
out:
if (freereq)
@@ -1649,6 +1794,7 @@
{
if (req->waiting)
complete(req->waiting);
+ req_finished_io(req);

blkdev_release_request(req);
}

--
Zlatko


2002-04-11 14:02:42

by Christoph Hellwig

[permalink] [raw]
Subject: Re: sard/iostat disk I/O statistics/accounting for 2.5.8-pre3

On Thu, Apr 11, 2002 at 01:06:59PM +0200, Zlatko Calusic wrote:
> I had to make some changes, as kernel internals have changed since the
> time patch was originally written. Alan has also included this patch
> in his 2.4.x-ac kernel series, but it is not working well.

Oh, it's also in 2.4.19-pre6 and works pretty well..

> First problem is that somehow, misteriously, ios_in_flight variable
> drops to value of -1 when disks are idle. Of course, this skews lots
> of other numbers and iostat reports garbage. I tried to find the cause
> of this behaviour, but failed (looks like we have a request fired on
> each disk, whose start is never accounted but completion is?!). So I
> resolved it this way
>
> if (hd->ios_in_flight)
> --hd->ios_in_flight;
>
> which works well, but I would still love to know how number of I/Os
> can drop below zero. :)

I'm unable to reproduce it here - I only have idle partition though..

> Second problem/nuisance is that blk_partition_remap() destroys
> partition information from the bio->bi_dev before the request is
> queued. That's why -ac kernel doesn't report per-partition
> information.

Bullocks. 2.4 doesn't even have blk_partition_remap, but the individual
drivers do the partition remapping themselves. I wouldn't have submitted
sard for 2.4 inclusion if there would be such a bug.

Christoph

2002-04-11 17:21:33

by Zlatko Calusic

[permalink] [raw]
Subject: Re: sard/iostat disk I/O statistics/accounting for 2.5.8-pre3

Christoph Hellwig <[email protected]> writes:

> On Thu, Apr 11, 2002 at 01:06:59PM +0200, Zlatko Calusic wrote:
>> I had to make some changes, as kernel internals have changed since the
>> time patch was originally written. Alan has also included this patch
>> in his 2.4.x-ac kernel series, but it is not working well.
>
> Oh, it's also in 2.4.19-pre6 and works pretty well..

Good. One more reason to put it in the 2.5.x as well. :)

>
>> First problem is that somehow, misteriously, ios_in_flight variable
>> drops to value of -1 when disks are idle. Of course, this skews lots
>> of other numbers and iostat reports garbage. I tried to find the cause
>> of this behaviour, but failed (looks like we have a request fired on
>> each disk, whose start is never accounted but completion is?!). So I
>> resolved it this way
>>
>> if (hd->ios_in_flight)
>> --hd->ios_in_flight;
>>
>> which works well, but I would still love to know how number of I/Os
>> can drop below zero. :)
>
> I'm unable to reproduce it here - I only have idle partition though..

Here is how it looks here (SMP machine, it could matter):


major minor #blocks name rio rmerge rsect ruse wio wmerge wsect wuse running use aveq

34 0 40011552 hdg 23 62 189 70 4 3 32 0 -1 946410 -946410
33 0 60051600 hde 8349 18725 214666 108230 3781 15234 152176 91700 -1 927060 -746550
3 0 19938240 hda 848 1023 14565 5470 1303 1542 22768 300 -1 942120 -940710


Notice negative numbers for running, aveq. Kernel is 2.4.19-pre5-ac3.

>
>> Second problem/nuisance is that blk_partition_remap() destroys
>> partition information from the bio->bi_dev before the request is
>> queued. That's why -ac kernel doesn't report per-partition
>> information.
>
> Bullocks. 2.4 doesn't even have blk_partition_remap, but the individual
> drivers do the partition remapping themselves. I wouldn't have submitted
> sard for 2.4 inclusion if there would be such a bug.
>

Oops, you're right. I stand corrected. That particular problem doesn't
exist on 2.4.x. It is obviously a property of new bio layer which
exists only in the 2.5.x.

Regards,
--
Zlatko

2002-04-11 21:07:15

by Mike Fedyk

[permalink] [raw]
Subject: Re: sard/iostat disk I/O statistics/accounting for 2.5.8-pre3

On Thu, Apr 11, 2002 at 07:20:54PM +0200, Zlatko Calusic wrote:
> Christoph Hellwig <[email protected]> writes:
> > I'm unable to reproduce it here - I only have idle partition though..
>
> Here is how it looks here (SMP machine, it could matter):
>
>
> major minor #blocks name rio rmerge rsect ruse wio wmerge wsect wuse running use aveq
>
> 34 0 40011552 hdg 23 62 189 70 4 3 32 0 -1 946410 -946410
> 33 0 60051600 hde 8349 18725 214666 108230 3781 15234 152176 91700 -1 927060 -746550
> 3 0 19938240 hda 848 1023 14565 5470 1303 1542 22768 300 -1 942120 -940710
>
>
> Notice negative numbers for running, aveq. Kernel is 2.4.19-pre5-ac3.

major minor #blocks name rio rmerge rsect ruse wio wmerge wsect wuse running use aveq

3 0 10037774 ide/host0/bus0/target0/lun0/disc 2381496 3602026 12295120 24796300 2167401 11887863 28592436 168474630 -1 849341880 -666064880
3 1 488344 ide/host0/bus0/target0/lun0/part1 53667 86 430048 544050 18970 47327 535456 2558790 0 499660 3105630
3 2 9549288 ide/host0/bus0/target0/lun0/part2 2327827 3601934 11865056 24252240 2148431 11840536 28056980 165915840 0 16504720 190278950

2.4.19-pre4-ac3

Did I misunderstand when you reported that statistics were not reported for
partitions in a previous message? Also -1 only shows up for the drive and
not the partitions.

Mike

2002-04-11 23:06:36

by Zlatko Calusic

[permalink] [raw]
Subject: Re: sard/iostat disk I/O statistics/accounting for 2.5.8-pre3

Mike Fedyk <[email protected]> writes:

> On Thu, Apr 11, 2002 at 07:20:54PM +0200, Zlatko Calusic wrote:
>> Christoph Hellwig <[email protected]> writes:
>> > I'm unable to reproduce it here - I only have idle partition though..
>>
>> Here is how it looks here (SMP machine, it could matter):
>>
>>
>> major minor #blocks name rio rmerge rsect ruse wio wmerge wsect wuse running use aveq
>>
>> 34 0 40011552 hdg 23 62 189 70 4 3 32 0 -1 946410 -946410
>> 33 0 60051600 hde 8349 18725 214666 108230 3781 15234 152176 91700 -1 927060 -746550
>> 3 0 19938240 hda 848 1023 14565 5470 1303 1542 22768 300 -1 942120 -940710
>>
>>
>> Notice negative numbers for running, aveq. Kernel is 2.4.19-pre5-ac3.
>
> major minor #blocks name rio rmerge rsect ruse wio wmerge wsect wuse running use aveq
>
> 3 0 10037774 ide/host0/bus0/target0/lun0/disc 2381496 3602026 12295120 24796300 2167401 11887863 28592436 168474630 -1 849341880 -666064880
> 3 1 488344 ide/host0/bus0/target0/lun0/part1 53667 86 430048 544050 18970 47327 535456 2558790 0 499660 3105630
> 3 2 9549288 ide/host0/bus0/target0/lun0/part2 2327827 3601934 11865056 24252240 2148431 11840536 28056980 165915840 0 16504720 190278950
>
> 2.4.19-pre4-ac3
>
> Did I misunderstand when you reported that statistics were not reported for
> partitions in a previous message? Also -1 only shows up for the drive and
> not the partitions.
>

I was wrong, per-partition information IS correctly reported for
2.4.x. Only in 2.5.x I had to workaround to get per-partition data.

Also, it _seems_ that only on whole devices ios_in_flight variable
(column "running" in /proc/partitions) drops to -1. Your (and mine)
output shows it clearly, yes. This needs further investigation, I have
been unable to find the problem so far.

And, as aveq value is calculated directly from the ios_in_flight
value, the average queue value is wrong too (it never goes negative,
if everything is all right).
--
Zlatko

2002-04-12 02:22:27

by Don Dupuis

[permalink] [raw]
Subject: Re: sard/iostat disk I/O statistics/accounting for 2.5.8-pre3

I have a program called cspm at http://sourceforge.net/projects/cspm
that uses Stephen's sard patches. This program allows showing stats on
all disks, controllers, and on a systemwide basis. It will show ios,
uses, merges, and blocks per second. This program was written in QT.
Please have a look at it and provide some feedback.

Thanks.

Don Dupuis

On Thu, 2002-04-11 at 06:06, Zlatko Calusic wrote:
> Hi!
>
> I have ported Stephen's excellent sard I/O accounting patch to
> 2.5.8-pre3. Additionaly, I have adapted/cleaned up iostat utility
> (originally written by Greg Franks) and created a man page for it.
>
> All this gives Linux a much needed functionality - extensive
> measurements and reporting of disk device performance/bottlenecks.
> No server should be without it. See below for a real life story.
>
> The patch, iostat utility and man page can be found at
>
> <URL:http://linux.inet.hr/>
>
> Also, find the patch appended to this mail, for quicker browsing.
>
>
> Some observations:
> -----------------
>
> The patch is quite well tested and stable (that is, it shouldn't eat
> your machine :)). Both on UP & SMP configurations, but I have tested
> it only on IDE hardware. Please test it on your configuration
> (especially SCSI) and report results.
>
> I had to make some changes, as kernel internals have changed since the
> time patch was originally written. Alan has also included this patch
> in his 2.4.x-ac kernel series, but it is not working well.
>
> First problem is that somehow, misteriously, ios_in_flight variable
> drops to value of -1 when disks are idle. Of course, this skews lots
> of other numbers and iostat reports garbage. I tried to find the cause
> of this behaviour, but failed (looks like we have a request fired on
> each disk, whose start is never accounted but completion is?!). So I
> resolved it this way
>
> if (hd->ios_in_flight)
> --hd->ios_in_flight;
>
> which works well, but I would still love to know how number of I/Os
> can drop below zero. :)
>
> Second problem/nuisance is that blk_partition_remap() destroys
> partition information from the bio->bi_dev before the request is
> queued. That's why -ac kernel doesn't report per-partition
> information. To resolve that, I had to put additional code in
> locate_hd_struct() to extract partition information back again which
> is kind of ugly (first we destroy it, then we recreate it again). The
> other way we could solve this is to put something like bio->bi_acctdev
> in bio structure and keep original device there, but that is not only
> a duplication of info (which is also ugly), but would also waste
> precious memory. I don't know if block re-mapping which is happening
> in the abovementioned function is strictly necessary. Jens, could you
> help with that one and/or give a suggestion which way to go with that
> one?
>
>
> TODO:
> ----
>
> Please check the patch on various hardware you're equipped with and
> report success/failure. Somebody should really go through the man page
> and correct grammar mistakes :), maybe make it more readable in
> places. By the time this patch is ready for inclusion in the Linus
> kernel, old disk statistics counters should be removed (as they are
> buggy, incomplete and obsolete).
>
> This patch w/ iostat utility is probably a necessary ingredient to
> make quality measurement of the new IDE-TCQ code. Too bad I have a
> Maxtor drive on a Promise controller here (any hope of me running TCQ
> on that combination in the future?).
>
>
> Real life story:
> ---------------
>
> So, I had this applications heavily inserting data in the Oracle
> database. And its performance mattered to me. As the disk rattled all
> the time inserts were going on, I decided I needed to spend some time
> reordering datafiles to achieve the peak performance (disk I/O was a
> performance bottleneck, obviously). But after an hour or two I got
> only few percents improvement. How could that happen?!
>
> Well, it happened that the database and application were already tuned
> quite well, and the load on database was (correctly) CPU based. As I
> have an SMP machine, I forgot that 40% free CPU could be on that
> second processor, obviously unusable in the app in question. And disk
> rattling was normal, lots of fsyncs, some seeks...
>
> So, when I finally managed to port this patch to 2.5.x (we are
> speaking of a home development database, 2.5.x kernels are OK for
> those, right?! :)) iostat showed only ~20% disk utilization! That is,
> there was _NO_ heavy I/O contention and disks were _NOT_ a culprit.
> Live and learn.
>
> If only I had known that before I wasted my time needlessly moving big
> datafiles around and fixing configuration... :(
>
>
> Regards,
>
> Zlatko
>
>
> Index: 7.12/include/linux/blkdev.h
> --- 7.12/include/linux/blkdev.h Wed, 10 Apr 2002 09:50:17 +0200 zcalusic (linux25/n/b/1_blkdev.h 1.1.1.1.1.1.1.2.2.1 644)
> +++ 7.14/include/linux/blkdev.h Wed, 10 Apr 2002 10:02:44 +0200 zcalusic (linux25/n/b/1_blkdev.h 1.1.1.1.1.1.1.2.2.2 644)
> @@ -34,6 +34,7 @@
> int rq_status; /* should split this into a few status bits */
> kdev_t rq_dev;
> int errors;
> + unsigned long start_time;
> sector_t sector;
> unsigned long nr_sectors;
> unsigned long hard_sector; /* the hard_* are block layer
> Index: 7.12/include/linux/genhd.h
> --- 7.12/include/linux/genhd.h Tue, 15 Jan 2002 22:59:02 +0100 zcalusic (linux25/o/b/18_genhd.h 1.1.1.1.1.1 644)
> +++ 7.14/include/linux/genhd.h Wed, 10 Apr 2002 10:02:44 +0200 zcalusic (linux25/o/b/18_genhd.h 1.1.1.1.1.1.2.1 644)
> @@ -61,6 +61,22 @@
> unsigned long start_sect;
> unsigned long nr_sects;
> devfs_handle_t de; /* primary (master) devfs entry */
> +
> + /* Performance stats: */
> + unsigned int ios_in_flight;
> + unsigned int io_ticks;
> + unsigned int last_idle_time;
> + unsigned int last_queue_change;
> + unsigned int aveq;
> +
> + unsigned int rd_ios;
> + unsigned int rd_merges;
> + unsigned int rd_ticks;
> + unsigned int rd_sectors;
> + unsigned int wr_ios;
> + unsigned int wr_merges;
> + unsigned int wr_ticks;
> + unsigned int wr_sectors;
> int number; /* stupid old code wastes space */
> };
>
> @@ -238,6 +254,19 @@
> #ifdef __KERNEL__
>
> char *disk_name (struct gendisk *hd, int minor, char *buf);
> +
> +/*
> + * disk_round_stats is used to round off the IO statistics for a disk
> + * for a complete clock tick.
> + */
> +void disk_round_stats(struct hd_struct *hd);
> +
> +/*
> + * Account for the completion of an IO request (used by drivers which
> + * bypass the normal end_request processing)
> + */
> +struct request;
> +void req_finished_io(struct request *);
>
> extern void devfs_register_partitions (struct gendisk *dev, int minor,
> int unregister);
> Index: 7.12/drivers/scsi/scsi_lib.c
> --- 7.12/drivers/scsi/scsi_lib.c Tue, 12 Feb 2002 09:26:34 +0100 zcalusic (linux25/j/c/26_scsi_lib.c 1.1.1.1.1.2 644)
> +++ 7.14/drivers/scsi/scsi_lib.c Wed, 10 Apr 2002 10:02:44 +0200 zcalusic (linux25/j/c/26_scsi_lib.c 1.1.1.1.1.2.1.1 644)
> @@ -382,6 +382,7 @@
> */
> if (req->waiting)
> complete(req->waiting);
> + req_finished_io(req);
>
> add_blkdev_randomness(major(req->rq_dev));
>
> Index: 7.12/drivers/block/genhd.c
> --- 7.12/drivers/block/genhd.c Wed, 10 Apr 2002 09:50:17 +0200 zcalusic (linux25/x/c/3_genhd.c 1.1.1.1.1.1.3.1 644)
> +++ 7.14/drivers/block/genhd.c Wed, 10 Apr 2002 10:02:44 +0200 zcalusic (linux25/x/c/3_genhd.c 1.1.1.1.1.1.3.2 644)
> @@ -168,6 +168,10 @@
> read_unlock(&gendisk_lock);
> }
>
> +
> +/* Normalise the disk performance stats to a notional timer tick of 1ms. */
> +#define MSEC(x) ((x) * 1000 / HZ)
> +
> static int show_partition(struct seq_file *part, void *v)
> {
> struct gendisk *sgp = v;
> @@ -175,19 +179,34 @@
> char buf[64];
>
> if (sgp == gendisk_head)
> - seq_puts(part, "major minor #blocks name\n\n");
> + seq_puts(part, "major minor #blocks name "
> + "rio rmerge rsect ruse "
> + "wio wmerge wsect wuse "
> + "running use aveq\n\n");
>
> /* show all non-0 size partitions of this disk */
> for (n = 0; n < (sgp->nr_real << sgp->minor_shift); n++) {
> - if (sgp->part[n].nr_sects == 0)
> + struct hd_struct *hd = &sgp->part[n];
> +
> + if (!hd->nr_sects)
> continue;
> - seq_printf(part, "%4d %4d %10d %s\n",
> - sgp->major, n, sgp->sizes[n],
> - disk_name(sgp, n, buf));
> + disk_round_stats(hd);
> + seq_printf(part, "%4d %4d %10d %s "
> + "%d %d %d %d %d %d %d %d %d %d %d\n",
> + sgp->major, n, sgp->sizes[n],
> + disk_name(sgp, n, buf),
> + hd->rd_ios, hd->rd_merges,
> + hd->rd_sectors, MSEC(hd->rd_ticks),
> + hd->wr_ios, hd->wr_merges,
> + hd->wr_sectors, MSEC(hd->wr_ticks),
> + hd->ios_in_flight, MSEC(hd->io_ticks),
> + MSEC(hd->aveq));
> }
>
> return 0;
> }
> +
> +#undef MSEC
>
> struct seq_operations partitions_op = {
> start: part_start,
> Index: 7.12/drivers/block/ll_rw_blk.c
> --- 7.12/drivers/block/ll_rw_blk.c Wed, 10 Apr 2002 09:50:17 +0200 zcalusic (linux25/x/c/9_ll_rw_blk. 1.1.1.1.1.1.1.1.1.1.3.1.1.1 644)
> +++ 7.14/drivers/block/ll_rw_blk.c Wed, 10 Apr 2002 18:37:24 +0200 zcalusic (linux25/x/c/9_ll_rw_blk. 1.1.1.1.1.1.1.1.1.1.3.1.1.3 644)
> @@ -944,6 +944,136 @@
> printk(KERN_ERR "drive_stat_acct: cmd not R/W?\n");
> }
>
> +/* Return up to two hd_structs on which to do IO accounting for a given
> + * request. On a partitioned device, we want to account both against
> + * the partition and against the whole disk.
> + */
> +static void locate_hd_struct(struct request *req,
> + struct hd_struct **hd1,
> + struct hd_struct **hd2)
> +{
> + int p, n_part;
> + struct gendisk *gd;
> + struct hd_struct *hd;
> +
> + *hd1 = NULL;
> + *hd2 = NULL;
> +
> + gd = get_gendisk(req->rq_dev);
> + if (gd && gd->part) {
> + *hd1 = &gd->part[0];
> +
> + /* Try to reconstruct the partition */
> + n_part = 1 << gd->minor_shift;
> + for (p = 1; p < n_part; p++) {
> + hd = &gd->part[p];
> + /* skip empty and extended partitions */
> + if (hd->nr_sects < 3)
> + continue;
> + if (req->sector >= hd->start_sect
> + && req->sector < hd->start_sect + hd->nr_sects)
> + break;
> + }
> + if (p < n_part)
> + *hd2 = hd;
> + }
> +}
> +
> +/* Round off the performance stats on an hd_struct. The average IO
> + * queue length and utilisation statistics are maintained by observing
> + * the current state of the queue length and the amount of time it has
> + * been in this state for. Normally, that accounting is done on IO
> + * completion, but that can result in more than a second's worth of IO
> + * being accounted for within any one second, leading to >100%
> + * utilisation. To deal with that, we do a round-off before returning
> + * the results when reading /proc/partitions, accounting immediately for
> + * all queue usage up to the current jiffies and restarting the counters
> + * again.
> + */
> +void disk_round_stats(struct hd_struct *hd)
> +{
> + unsigned long now = jiffies;
> +
> + hd->aveq += hd->ios_in_flight * (now - hd->last_queue_change);
> + hd->last_queue_change = now;
> +
> + if (hd->ios_in_flight)
> + hd->io_ticks += (now - hd->last_idle_time);
> + hd->last_idle_time = now;
> +}
> +
> +static inline void down_ios(struct hd_struct *hd)
> +{
> + disk_round_stats(hd);
> + if (hd->ios_in_flight)
> + --hd->ios_in_flight;
> +}
> +
> +static inline void up_ios(struct hd_struct *hd)
> +{
> + disk_round_stats(hd);
> + ++hd->ios_in_flight;
> +}
> +
> +static void account_io_start(struct hd_struct *hd, struct request *req,
> + int merge, int sectors)
> +{
> + switch (rq_data_dir(req)) {
> + case READ:
> + if (merge)
> + hd->rd_merges++;
> + hd->rd_sectors += sectors;
> + break;
> + case WRITE:
> + if (merge)
> + hd->wr_merges++;
> + hd->wr_sectors += sectors;
> + break;
> + default:
> + }
> + if (!merge)
> + up_ios(hd);
> +}
> +
> +static void account_io_end(struct hd_struct *hd, struct request *req)
> +{
> + unsigned long duration = jiffies - req->start_time;
> + switch (rq_data_dir(req)) {
> + case READ:
> + hd->rd_ticks += duration;
> + hd->rd_ios++;
> + break;
> + case WRITE:
> + hd->wr_ticks += duration;
> + hd->wr_ios++;
> + break;
> + default:
> + }
> + down_ios(hd);
> +}
> +
> +void req_new_io(struct request *req, int merge, int sectors)
> +{
> + struct hd_struct *hd1, *hd2;
> +
> + locate_hd_struct(req, &hd1, &hd2);
> + if (hd1)
> + account_io_start(hd1, req, merge, sectors);
> + if (hd2)
> + account_io_start(hd2, req, merge, sectors);
> +}
> +
> +void req_finished_io(struct request *req)
> +{
> + struct hd_struct *hd1, *hd2;
> +
> + locate_hd_struct(req, &hd1, &hd2);
> + if (hd1)
> + account_io_end(hd1, req);
> + if (hd2)
> + account_io_end(hd2, req);
> +}
> +
> /*
> * add-request adds a request to the linked list.
> * queue lock is held and interrupts disabled, as we muck with the
> @@ -1018,6 +1148,8 @@
> * counts here.
> */
> if (q->merge_requests_fn(q, req, next)) {
> + struct hd_struct *hd1, *hd2;
> +
> elv_merge_requests(q, req, next);
>
> blkdev_dequeue_request(next);
> @@ -1027,6 +1159,15 @@
>
> req->nr_sectors = req->hard_nr_sectors += next->hard_nr_sectors;
>
> + /* One last thing: we have removed a request, so we now have
> + * one less expected IO to complete for accounting purposes.
> + */
> + locate_hd_struct(req, &hd1, &hd2);
> + if (hd1)
> + down_ios(hd1);
> + if (hd2)
> + down_ios(hd2);
> +
> blkdev_release_request(next);
> }
> }
> @@ -1117,6 +1258,7 @@
> req->biotail = bio;
> req->nr_sectors = req->hard_nr_sectors += nr_sectors;
> drive_stat_acct(req, nr_sectors, 0);
> + req_new_io(req, 1, nr_sectors);
> attempt_back_merge(q, req);
> goto out;
>
> @@ -1142,6 +1284,7 @@
> req->sector = req->hard_sector = sector;
> req->nr_sectors = req->hard_nr_sectors += nr_sectors;
> drive_stat_acct(req, nr_sectors, 0);
> + req_new_io(req, 1, nr_sectors);
> attempt_front_merge(q, req);
> goto out;
>
> @@ -1210,6 +1353,8 @@
> req->waiting = NULL;
> req->bio = req->biotail = bio;
> req->rq_dev = bio->bi_dev;
> + req->start_time = jiffies;
> + req_new_io(req, 0, nr_sectors);
> add_request(q, req, insert_here);
> out:
> if (freereq)
> @@ -1649,6 +1794,7 @@
> {
> if (req->waiting)
> complete(req->waiting);
> + req_finished_io(req);
>
> blkdev_release_request(req);
> }
>
> --
> Zlatko
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/


2002-04-12 10:28:47

by Jochen Suckfuell

[permalink] [raw]
Subject: Re: sard/iostat disk I/O statistics/accounting for 2.5.8-pre3


On Fri, Apr 12, 2002 at 01:06:10AM +0200, Zlatko Calusic wrote:
> > major minor #blocks name rio rmerge rsect ruse wio wmerge wsect wuse running use aveq
> > 3 0 10037774 ide/host0/bus0/target0/lun0/disc 2381496 3602026 12295120 24796300 2167401 11887863 28592436 168474630 -1 849341880 -666064880
> >
> Also, it _seems_ that only on whole devices ios_in_flight variable
> (column "running" in /proc/partitions) drops to -1. Your (and mine)
> output shows it clearly, yes. This needs further investigation, I have
> been unable to find the problem so far.

I have seen ios_in_flight as low as -4 already.

> And, as aveq value is calculated directly from the ios_in_flight
> value, the average queue value is wrong too (it never goes negative,
> if everything is all right).
The use (one of the most important values here) is also completely wrong
when ios_in_flight is too low: Whenever ios_in_flight!=0, the disk is
accounted as running.

I have patched gen_partition_disk() to correct the ios_in_flight value
as a workaround. This is better than checking it each time ios_in_flight
is decremented, because it only happens when someone reads
/proc/partitions, and so it doesn't slow down the disk access.

But I'm still trying to reproduce the cause of the problem.

Bye
Jochen

--
Jochen Suckfuell

2002-04-14 19:07:10

by Zlatko Calusic

[permalink] [raw]
Subject: Re: sard/iostat disk I/O statistics/accounting for 2.5.8-pre3

Don Dupuis <[email protected]> writes:

> I have a program called cspm at http://sourceforge.net/projects/cspm
> that uses Stephen's sard patches. This program allows showing stats on
> all disks, controllers, and on a systemwide basis. It will show ios,
> uses, merges, and blocks per second. This program was written in QT.
> Please have a look at it and provide some feedback.
>

After having lots of problems with Qt 3.0, I finally managed to
install it on my Debian. Then I had to correct one bug in your source
to get it to compile, and now it just coredumps like it has a problem
when parsing /proc/partitions (as open() of that file is last thing
before it dies with segmentation fault).

I would like to try it, but didn't have much luck so far.
--
Zlatko

2002-04-25 11:53:53

by Thomas Tonino

[permalink] [raw]
Subject: Re: sard/iostat disk I/O statistics/accounting for 2.5.8-pre3

Zlatko Calusic wrote:

> All this gives Linux a much needed functionality - extensive
> measurements and reporting of disk device performance/bottlenecks.
> No server should be without it. See below for a real life story.
>
> The patch, iostat utility and man page can be found at
>
> <URL:http://linux.inet.hr/>

This is very useful functionality. I'm currently looking at a similar
patch for the 2.4 series, taken from

http://www.kernel.at/pub/linux/kernel/people/hch/patches/v2.4/2.4.18-pre4/

which seems to work fine so far (with very limited testing). This patch
gives information that is really needed to maintain systems that
actually use their disks, such as mail, database and news servers.

May I vote for inclusion in the mainline kernel, both 2.4 and 2.5?
Again, it is an important part of server maintainability.


Thomas