2002-07-26 09:12:09

by Jan Kasprzak

[permalink] [raw]
Subject: sard I/O accounting bug in -ac

Hello,

There is a minor bug in the I/O accounting in 2.4.19-rc1-ac2 (and
probably later revisions as well - at least no fix to this is mentioned
in the changelog). The problem is that the "running" item in /proc/partitions
can get negative. Here is a part of my /proc/partitions:

# head /proc/partitions
major minor #blocks name rio rmerge rsect ruse wio wmerge wsect wuse running use aveq

58 0 204210176 lvma 0 0 0 0 0 0 0 0 0 0 0
34 0 75068280 hdg 12007096 98830582 885673212 17084202 317257 4005187 34592296 32986598 -1071 22122854 34831850
34 1 74805696 hdg1 11988002 98662006 885199472 16164162 312861 3983758 34385688 32524318 4 17485050 5748597
34 2 262552 hdg2 16396 4 131200 151010 4396 21429 206608 462280 0 484160 613290
33 0 45034920 hde 10694513 98320446 660827602 40668484 175714 1606797 14276992 35243644 -1065 18924884 38273425
33 1 44870616 hde1 10128022 63655922 590270586 37513084 171329 1585351 14070264 34922914 5 36875816 29493365
33 2 164272 hde2 15831 0 126648 147970 4385 21446 206728 320730 0 249100 468700
33 64 45034920 hdf 11314436 99923267 889901456 3322406 198326 2108746 18457720 28092418 -1068 20797814 37115864


The negative values are only in entries for the whole disk, never for
its partition:

# grep -- - /proc/partitions
34 0 75068280 hdg 12007082 98830340 885671132 17079932 317257 4005187 34592296 32986598 -1075 22116424 41738490
33 0 45034920 hde 10694440 98319468 660819154 40653024 175714 1606797 14276992 35243644 -1070 18918454 2187572
33 64 45034920 hdf 11314385 99922713 889896600 3308796 198326 2108746 18457720 28092418 -1070 20791384 1032102
22 0 20005650 hdc 3903087 48375999 324447832 40241037 127195 1401386 12232848 26945838 -1070 26995574 36221312
3 0 9770544 hda 33937639 61905727 191734832 39181743 9254145 28327116 75478202 2120759 -1065 17574604 29910514
3 64 19938240 hdb 4225756 35931205 321255608 31092862 114779 1426224 12329296 37265634 -1063 22389064 20898945

BTW, is there any documentation describing the meaning of the values
in /proc/partitions? What is the difference between rio/wio and rmerge/wmerge?
In what units is ruse/wuse measured? I thought it means milisecond of
read/write activity, but it grows far more by than 1000 per second for some
partitions.

Thanks,

-Yenya

--
| Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - private}> |
| GPG: ID 1024/D3498839 Fingerprint 0D99A7FB206605D7 8B35FCDE05B18A5E |
| http://www.fi.muni.cz/~kas/ Czech Linux Homepage: http://www.linux.cz/ |
|----------- If you want the holes in your knowledge showing up -----------|
|----------- try teaching someone. -- Alan Cox -----------|


2002-07-26 13:33:58

by Jochen Suckfuell

[permalink] [raw]
Subject: Re: sard I/O accounting bug in -ac

On Fri, Jul 26, 2002 at 11:15:20AM +0200, Jan Kasprzak wrote:
> Hello,
>
> There is a minor bug in the I/O accounting in 2.4.19-rc1-ac2 (and
> probably later revisions as well - at least no fix to this is mentioned
> in the changelog). The problem is that the "running" item in /proc/partitions
> can get negative. Here is a part of my /proc/partitions:
> [...]

I tracked this problem down on hosts with SCSI disks, where an
accounting function was called without the appropriate lock held. (My
previous mail including a patch for this bug on SCSI is here:
http://www.uwsg.iu.edu/hypermail/linux/kernel/0207.1/0323.html .)

However, since you don't have SCSI disks, this will most likely not fix
your problem. But before I was able to track down this missing lock
issue, I could fix the problem by making the ios_in_flight variable use
atomic operations. Let me guess, you have an SMP machine, where a race
condition is likely to occur?

Below is my patch to use atomic operations (against 2.4.19-pre10) that
fixed the problem for me on an SMP machine with SCSI disks. (I don't
have a non-SCSI SMP machine to check.)

In the first chunk, I also added a check for negative ios_in_flight
values to the function generating the /proc/partitions output, and
eventually set them to zero. You should remove this part to see if
values are still becoming negative.

If this patch works, there must be another place where accounting is
done without proper spinlock handling. The atomic-patch should not be
the final solution for this!

>
> BTW, is there any documentation describing the meaning of the values
> in /proc/partitions? What is the difference between rio/wio and rmerge/wmerge?
> In what units is ruse/wuse measured? I thought it means milisecond of
> read/write activity, but it grows far more by than 1000 per second for some
> partitions.
rio/wio are the total number of read/write requests processed, while
rmerge/wmerge is the number of successful merges of requests.


Bye
Jochen

--
Jochen Suckfuell --- http://www.suckfuell.net/jochen/ ---



diff -ru linux/drivers/block/genhd.c
linux-2.4.19pre10/drivers/block/genhd.c
--- linux/drivers/block/genhd.c Fri Jun 14 13:09:26 2002
+++ linux-2.4.19pre10/drivers/block/genhd.c Fri Jun 14 11:18:26 2002
@@ -177,6 +177,15 @@
continue;

hd = &gp->part[n]; disk_round_stats(hd);
+
+ if((int)atomic_read(&hd->ios_in_flight) < 0)
+ {
+ /* this happens mysteriously, so here is a cheap workaround
*/
+ atomic_set(&hd->ios_in_flight, 0);
+
+ printk("get_partition_list() corrected ios_in_flight for
%s\n", disk_name(gp, n, buf));
+ }
+
len += sprintf(page + len,
"%4d %4d %10d %s "
"%d %d %d %d %d %d %d %d %d %d %d\n",
@@ -189,7 +198,7 @@
hd->wr_ios, hd->wr_merges,
hd->wr_sectors,
MSEC(hd->wr_ticks),
- hd->ios_in_flight,
+ atomic_read(&hd->ios_in_flight),
MSEC(hd->io_ticks),
MSEC(hd->aveq));
#undef MSEC
diff -ru linux/drivers/block/ll_rw_blk.c
linux-2.4.19pre10/drivers/block/ll_rw_blk.c
--- linux/drivers/block/ll_rw_blk.c Fri Jun 14 13:09:27 2002
+++ linux-2.4.19pre10/drivers/block/ll_rw_blk.c Fri Jun 14 11:19:17 2002
@@ -631,10 +631,10 @@
{
unsigned long now = jiffies;

- hd->aveq += (hd->ios_in_flight * (jiffies - hd->last_queue_change));
+ hd->aveq += ((int)atomic_read(&hd->ios_in_flight) * (jiffies -
hd->last_queue_change));
hd->last_queue_change = now;

- if (hd->ios_in_flight)
+ if ((int)atomic_read(&hd->ios_in_flight))
hd->io_ticks += (now - hd->last_idle_time);
hd->last_idle_time = now;
}
@@ -643,13 +643,13 @@
static inline void down_ios(struct hd_struct *hd)
{
disk_round_stats(hd);
- --hd->ios_in_flight;
+ atomic_dec(&hd->ios_in_flight);
}

static inline void up_ios(struct hd_struct *hd)
{
disk_round_stats(hd);
- ++hd->ios_in_flight;
+ atomic_inc(&hd->ios_in_flight);
}

static void account_io_start(struct hd_struct *hd, struct request *req,
diff -ru linux/include/linux/genhd.h
linux-2.4.19pre10/include/linux/genhd.h
--- linux/include/linux/genhd.h Fri Jun 14 13:16:45 2002
+++ linux-2.4.19pre10/include/linux/genhd.h Fri Jun 14 12:15:43 2002
@@ -12,6 +12,7 @@
#include <linux/config.h>
#include <linux/types.h>
#include <linux/major.h>
+#include <asm/atomic.h>

enum {
/* These three have identical behaviour; use the second one if DOS
* fdisk gets
@@ -65,7 +66,7 @@
int number; /* stupid old code wastes space */

/* Performance stats: */
- unsigned int ios_in_flight;
+ atomic_t ios_in_flight;
unsigned int io_ticks;
unsigned int last_idle_time;
unsigned int last_queue_change;


2002-07-26 13:54:22

by Jan Kasprzak

[permalink] [raw]
Subject: Re: sard I/O accounting bug in -ac

Jochen Suckfuell wrote:
: However, since you don't have SCSI disks, this will most likely not fix
: your problem. But before I was able to track down this missing lock
: issue, I could fix the problem by making the ios_in_flight variable use
: atomic operations. Let me guess, you have an SMP machine, where a race
: condition is likely to occur?

Nope. Single-CPU Athlon 850, Via KT133 (not KT133A), ASUS
board with on-board VIA IDE and on-board promise controller, 1.1GB RAM
(CONFIG_HIGHMEM), 3c985B NIC.

: Below is my patch to use atomic operations (against 2.4.19-pre10) that
: fixed the problem for me on an SMP machine with SCSI disks. (I don't
: have a non-SCSI SMP machine to check.)
:
: In the first chunk, I also added a check for negative ios_in_flight
: values to the function generating the /proc/partitions output, and
: eventually set them to zero. You should remove this part to see if
: values are still becoming negative.
:
: If this patch works, there must be another place where accounting is
: done without proper spinlock handling. The atomic-patch should not be
: the final solution for this!

I'll look at this, but I am about to leave for a holiday,
so don't expect feedback in next two weeks or so. Also I don't want to
experiment with this server more than necessary, and this is not a critical
bug.

: rio/wio are the total number of read/write requests processed, while
: rmerge/wmerge is the number of successful merges of requests.

What does it mean "merges" and how is it related to the
mult-count of the HDD? I've tried to run "dd if=/dev/hdc bs=4k of=/dev/null"
on a drive with mult-count 16 and read-ahead 8, and the rsect count
was approximately 126.5 times bigger than rio, while the rsect/rmerge
was about 62. I've changed the "bs=4k" to "bs=64k" in the "dd" command,
but the rsect/rio and rsect/rmerge ratio was about the same.

So how does the read() syscall from dd(1) maps to rio requests
(and rmerge requests)? Is it related to the drive multcount at all?

-Yenya

--
| Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - private}> |
| GPG: ID 1024/D3498839 Fingerprint 0D99A7FB206605D7 8B35FCDE05B18A5E |
| http://www.fi.muni.cz/~kas/ Czech Linux Homepage: http://www.linux.cz/ |
|----------- If you want the holes in your knowledge showing up -----------|
|----------- try teaching someone. -- Alan Cox -----------|