Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id ; Thu, 11 Apr 2002 07:07:13 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id ; Thu, 11 Apr 2002 07:07:12 -0400 Received: from inje.iskon.hr ([213.191.128.16]:29604 "EHLO inje.iskon.hr") by vger.kernel.org with ESMTP id ; Thu, 11 Apr 2002 07:07:09 -0400 To: linux-kernel@vger.kernel.org Cc: "Stephen C. Tweedie" , Jens Axboe , Alan Cox Subject: sard/iostat disk I/O statistics/accounting for 2.5.8-pre3 Reply-To: zlatko.calusic@iskon.hr X-Face: s71Vs\G4I3mB$X2=P4h[aszUL\%"`1!YRYl[JGlC57kU-`kxADX}T/Bq)Q9.$fGh7lFNb.s i&L3xVb:q_Pr}>Eo(@kU,c:3:64cR]m@27>1tGl1):#(bs*Ip0c}N{:JGcgOXd9H'Nwm:}jLr\FZtZ pri/C@\,4lW<|jrq^<):Nk%Hp@G&F"r+n1@BoH From: Zlatko Calusic Date: Thu, 11 Apr 2002 13:06:59 +0200 Message-ID: Lines: 427 User-Agent: Gnus/5.090005 (Oort Gnus v0.05) XEmacs/21.4 (Common Lisp, i386-debian-linux) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Hi! I have ported Stephen's excellent sard I/O accounting patch to 2.5.8-pre3. Additionaly, I have adapted/cleaned up iostat utility (originally written by Greg Franks) and created a man page for it. All this gives Linux a much needed functionality - extensive measurements and reporting of disk device performance/bottlenecks. No server should be without it. See below for a real life story. The patch, iostat utility and man page can be found at Also, find the patch appended to this mail, for quicker browsing. Some observations: ----------------- The patch is quite well tested and stable (that is, it shouldn't eat your machine :)). Both on UP & SMP configurations, but I have tested it only on IDE hardware. Please test it on your configuration (especially SCSI) and report results. I had to make some changes, as kernel internals have changed since the time patch was originally written. Alan has also included this patch in his 2.4.x-ac kernel series, but it is not working well. First problem is that somehow, misteriously, ios_in_flight variable drops to value of -1 when disks are idle. Of course, this skews lots of other numbers and iostat reports garbage. I tried to find the cause of this behaviour, but failed (looks like we have a request fired on each disk, whose start is never accounted but completion is?!). So I resolved it this way if (hd->ios_in_flight) --hd->ios_in_flight; which works well, but I would still love to know how number of I/Os can drop below zero. :) Second problem/nuisance is that blk_partition_remap() destroys partition information from the bio->bi_dev before the request is queued. That's why -ac kernel doesn't report per-partition information. To resolve that, I had to put additional code in locate_hd_struct() to extract partition information back again which is kind of ugly (first we destroy it, then we recreate it again). The other way we could solve this is to put something like bio->bi_acctdev in bio structure and keep original device there, but that is not only a duplication of info (which is also ugly), but would also waste precious memory. I don't know if block re-mapping which is happening in the abovementioned function is strictly necessary. Jens, could you help with that one and/or give a suggestion which way to go with that one? TODO: ---- Please check the patch on various hardware you're equipped with and report success/failure. Somebody should really go through the man page and correct grammar mistakes :), maybe make it more readable in places. By the time this patch is ready for inclusion in the Linus kernel, old disk statistics counters should be removed (as they are buggy, incomplete and obsolete). This patch w/ iostat utility is probably a necessary ingredient to make quality measurement of the new IDE-TCQ code. Too bad I have a Maxtor drive on a Promise controller here (any hope of me running TCQ on that combination in the future?). Real life story: --------------- So, I had this applications heavily inserting data in the Oracle database. And its performance mattered to me. As the disk rattled all the time inserts were going on, I decided I needed to spend some time reordering datafiles to achieve the peak performance (disk I/O was a performance bottleneck, obviously). But after an hour or two I got only few percents improvement. How could that happen?! Well, it happened that the database and application were already tuned quite well, and the load on database was (correctly) CPU based. As I have an SMP machine, I forgot that 40% free CPU could be on that second processor, obviously unusable in the app in question. And disk rattling was normal, lots of fsyncs, some seeks... So, when I finally managed to port this patch to 2.5.x (we are speaking of a home development database, 2.5.x kernels are OK for those, right?! :)) iostat showed only ~20% disk utilization! That is, there was _NO_ heavy I/O contention and disks were _NOT_ a culprit. Live and learn. If only I had known that before I wasted my time needlessly moving big datafiles around and fixing configuration... :( Regards, Zlatko Index: 7.12/include/linux/blkdev.h --- 7.12/include/linux/blkdev.h Wed, 10 Apr 2002 09:50:17 +0200 zcalusic (linux25/n/b/1_blkdev.h 1.1.1.1.1.1.1.2.2.1 644) +++ 7.14/include/linux/blkdev.h Wed, 10 Apr 2002 10:02:44 +0200 zcalusic (linux25/n/b/1_blkdev.h 1.1.1.1.1.1.1.2.2.2 644) @@ -34,6 +34,7 @@ int rq_status; /* should split this into a few status bits */ kdev_t rq_dev; int errors; + unsigned long start_time; sector_t sector; unsigned long nr_sectors; unsigned long hard_sector; /* the hard_* are block layer Index: 7.12/include/linux/genhd.h --- 7.12/include/linux/genhd.h Tue, 15 Jan 2002 22:59:02 +0100 zcalusic (linux25/o/b/18_genhd.h 1.1.1.1.1.1 644) +++ 7.14/include/linux/genhd.h Wed, 10 Apr 2002 10:02:44 +0200 zcalusic (linux25/o/b/18_genhd.h 1.1.1.1.1.1.2.1 644) @@ -61,6 +61,22 @@ unsigned long start_sect; unsigned long nr_sects; devfs_handle_t de; /* primary (master) devfs entry */ + + /* Performance stats: */ + unsigned int ios_in_flight; + unsigned int io_ticks; + unsigned int last_idle_time; + unsigned int last_queue_change; + unsigned int aveq; + + unsigned int rd_ios; + unsigned int rd_merges; + unsigned int rd_ticks; + unsigned int rd_sectors; + unsigned int wr_ios; + unsigned int wr_merges; + unsigned int wr_ticks; + unsigned int wr_sectors; int number; /* stupid old code wastes space */ }; @@ -238,6 +254,19 @@ #ifdef __KERNEL__ char *disk_name (struct gendisk *hd, int minor, char *buf); + +/* + * disk_round_stats is used to round off the IO statistics for a disk + * for a complete clock tick. + */ +void disk_round_stats(struct hd_struct *hd); + +/* + * Account for the completion of an IO request (used by drivers which + * bypass the normal end_request processing) + */ +struct request; +void req_finished_io(struct request *); extern void devfs_register_partitions (struct gendisk *dev, int minor, int unregister); Index: 7.12/drivers/scsi/scsi_lib.c --- 7.12/drivers/scsi/scsi_lib.c Tue, 12 Feb 2002 09:26:34 +0100 zcalusic (linux25/j/c/26_scsi_lib.c 1.1.1.1.1.2 644) +++ 7.14/drivers/scsi/scsi_lib.c Wed, 10 Apr 2002 10:02:44 +0200 zcalusic (linux25/j/c/26_scsi_lib.c 1.1.1.1.1.2.1.1 644) @@ -382,6 +382,7 @@ */ if (req->waiting) complete(req->waiting); + req_finished_io(req); add_blkdev_randomness(major(req->rq_dev)); Index: 7.12/drivers/block/genhd.c --- 7.12/drivers/block/genhd.c Wed, 10 Apr 2002 09:50:17 +0200 zcalusic (linux25/x/c/3_genhd.c 1.1.1.1.1.1.3.1 644) +++ 7.14/drivers/block/genhd.c Wed, 10 Apr 2002 10:02:44 +0200 zcalusic (linux25/x/c/3_genhd.c 1.1.1.1.1.1.3.2 644) @@ -168,6 +168,10 @@ read_unlock(&gendisk_lock); } + +/* Normalise the disk performance stats to a notional timer tick of 1ms. */ +#define MSEC(x) ((x) * 1000 / HZ) + static int show_partition(struct seq_file *part, void *v) { struct gendisk *sgp = v; @@ -175,19 +179,34 @@ char buf[64]; if (sgp == gendisk_head) - seq_puts(part, "major minor #blocks name\n\n"); + seq_puts(part, "major minor #blocks name " + "rio rmerge rsect ruse " + "wio wmerge wsect wuse " + "running use aveq\n\n"); /* show all non-0 size partitions of this disk */ for (n = 0; n < (sgp->nr_real << sgp->minor_shift); n++) { - if (sgp->part[n].nr_sects == 0) + struct hd_struct *hd = &sgp->part[n]; + + if (!hd->nr_sects) continue; - seq_printf(part, "%4d %4d %10d %s\n", - sgp->major, n, sgp->sizes[n], - disk_name(sgp, n, buf)); + disk_round_stats(hd); + seq_printf(part, "%4d %4d %10d %s " + "%d %d %d %d %d %d %d %d %d %d %d\n", + sgp->major, n, sgp->sizes[n], + disk_name(sgp, n, buf), + hd->rd_ios, hd->rd_merges, + hd->rd_sectors, MSEC(hd->rd_ticks), + hd->wr_ios, hd->wr_merges, + hd->wr_sectors, MSEC(hd->wr_ticks), + hd->ios_in_flight, MSEC(hd->io_ticks), + MSEC(hd->aveq)); } return 0; } + +#undef MSEC struct seq_operations partitions_op = { start: part_start, Index: 7.12/drivers/block/ll_rw_blk.c --- 7.12/drivers/block/ll_rw_blk.c Wed, 10 Apr 2002 09:50:17 +0200 zcalusic (linux25/x/c/9_ll_rw_blk. 1.1.1.1.1.1.1.1.1.1.3.1.1.1 644) +++ 7.14/drivers/block/ll_rw_blk.c Wed, 10 Apr 2002 18:37:24 +0200 zcalusic (linux25/x/c/9_ll_rw_blk. 1.1.1.1.1.1.1.1.1.1.3.1.1.3 644) @@ -944,6 +944,136 @@ printk(KERN_ERR "drive_stat_acct: cmd not R/W?\n"); } +/* Return up to two hd_structs on which to do IO accounting for a given + * request. On a partitioned device, we want to account both against + * the partition and against the whole disk. + */ +static void locate_hd_struct(struct request *req, + struct hd_struct **hd1, + struct hd_struct **hd2) +{ + int p, n_part; + struct gendisk *gd; + struct hd_struct *hd; + + *hd1 = NULL; + *hd2 = NULL; + + gd = get_gendisk(req->rq_dev); + if (gd && gd->part) { + *hd1 = &gd->part[0]; + + /* Try to reconstruct the partition */ + n_part = 1 << gd->minor_shift; + for (p = 1; p < n_part; p++) { + hd = &gd->part[p]; + /* skip empty and extended partitions */ + if (hd->nr_sects < 3) + continue; + if (req->sector >= hd->start_sect + && req->sector < hd->start_sect + hd->nr_sects) + break; + } + if (p < n_part) + *hd2 = hd; + } +} + +/* Round off the performance stats on an hd_struct. The average IO + * queue length and utilisation statistics are maintained by observing + * the current state of the queue length and the amount of time it has + * been in this state for. Normally, that accounting is done on IO + * completion, but that can result in more than a second's worth of IO + * being accounted for within any one second, leading to >100% + * utilisation. To deal with that, we do a round-off before returning + * the results when reading /proc/partitions, accounting immediately for + * all queue usage up to the current jiffies and restarting the counters + * again. + */ +void disk_round_stats(struct hd_struct *hd) +{ + unsigned long now = jiffies; + + hd->aveq += hd->ios_in_flight * (now - hd->last_queue_change); + hd->last_queue_change = now; + + if (hd->ios_in_flight) + hd->io_ticks += (now - hd->last_idle_time); + hd->last_idle_time = now; +} + +static inline void down_ios(struct hd_struct *hd) +{ + disk_round_stats(hd); + if (hd->ios_in_flight) + --hd->ios_in_flight; +} + +static inline void up_ios(struct hd_struct *hd) +{ + disk_round_stats(hd); + ++hd->ios_in_flight; +} + +static void account_io_start(struct hd_struct *hd, struct request *req, + int merge, int sectors) +{ + switch (rq_data_dir(req)) { + case READ: + if (merge) + hd->rd_merges++; + hd->rd_sectors += sectors; + break; + case WRITE: + if (merge) + hd->wr_merges++; + hd->wr_sectors += sectors; + break; + default: + } + if (!merge) + up_ios(hd); +} + +static void account_io_end(struct hd_struct *hd, struct request *req) +{ + unsigned long duration = jiffies - req->start_time; + switch (rq_data_dir(req)) { + case READ: + hd->rd_ticks += duration; + hd->rd_ios++; + break; + case WRITE: + hd->wr_ticks += duration; + hd->wr_ios++; + break; + default: + } + down_ios(hd); +} + +void req_new_io(struct request *req, int merge, int sectors) +{ + struct hd_struct *hd1, *hd2; + + locate_hd_struct(req, &hd1, &hd2); + if (hd1) + account_io_start(hd1, req, merge, sectors); + if (hd2) + account_io_start(hd2, req, merge, sectors); +} + +void req_finished_io(struct request *req) +{ + struct hd_struct *hd1, *hd2; + + locate_hd_struct(req, &hd1, &hd2); + if (hd1) + account_io_end(hd1, req); + if (hd2) + account_io_end(hd2, req); +} + /* * add-request adds a request to the linked list. * queue lock is held and interrupts disabled, as we muck with the @@ -1018,6 +1148,8 @@ * counts here. */ if (q->merge_requests_fn(q, req, next)) { + struct hd_struct *hd1, *hd2; + elv_merge_requests(q, req, next); blkdev_dequeue_request(next); @@ -1027,6 +1159,15 @@ req->nr_sectors = req->hard_nr_sectors += next->hard_nr_sectors; + /* One last thing: we have removed a request, so we now have + * one less expected IO to complete for accounting purposes. + */ + locate_hd_struct(req, &hd1, &hd2); + if (hd1) + down_ios(hd1); + if (hd2) + down_ios(hd2); + blkdev_release_request(next); } } @@ -1117,6 +1258,7 @@ req->biotail = bio; req->nr_sectors = req->hard_nr_sectors += nr_sectors; drive_stat_acct(req, nr_sectors, 0); + req_new_io(req, 1, nr_sectors); attempt_back_merge(q, req); goto out; @@ -1142,6 +1284,7 @@ req->sector = req->hard_sector = sector; req->nr_sectors = req->hard_nr_sectors += nr_sectors; drive_stat_acct(req, nr_sectors, 0); + req_new_io(req, 1, nr_sectors); attempt_front_merge(q, req); goto out; @@ -1210,6 +1353,8 @@ req->waiting = NULL; req->bio = req->biotail = bio; req->rq_dev = bio->bi_dev; + req->start_time = jiffies; + req_new_io(req, 0, nr_sectors); add_request(q, req, insert_here); out: if (freereq) @@ -1649,6 +1794,7 @@ { if (req->waiting) complete(req->waiting); + req_finished_io(req); blkdev_release_request(req); } -- Zlatko - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/