2003-05-31 22:22:40

by Michael Buesch

[permalink] [raw]
Subject: [2.5.70] possible problem with /dev/diskstats

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi.

I've just played around with my server (that has actualy no load)
and I recognized something strange in /dev/diskstats.

Documentation/iostats.txt says about diskstats:
[SNIP]
Field 9 -- # of I/Os currently in progress
The only field that should go to zero. Incremented as requests are
given to appropriate request_queue_t and decremented as they finish.
[SNIP]

But here is a cat /proc/diskstats:
1 0 ram0 0 0 0 0 0 0 0 0 0 0 0
1 1 ram1 0 0 0 0 0 0 0 0 0 0 0
1 2 ram2 0 0 0 0 0 0 0 0 0 0 0
1 3 ram3 0 0 0 0 0 0 0 0 0 0 0
1 4 ram4 0 0 0 0 0 0 0 0 0 0 0
1 5 ram5 0 0 0 0 0 0 0 0 0 0 0
1 6 ram6 0 0 0 0 0 0 0 0 0 0 0
1 7 ram7 0 0 0 0 0 0 0 0 0 0 0
1 8 ram8 0 0 0 0 0 0 0 0 0 0 0
1 9 ram9 0 0 0 0 0 0 0 0 0 0 0
1 10 ram10 0 0 0 0 0 0 0 0 0 0 0
1 11 ram11 0 0 0 0 0 0 0 0 0 0 0
1 12 ram12 0 0 0 0 0 0 0 0 0 0 0
1 13 ram13 0 0 0 0 0 0 0 0 0 0 0
1 14 ram14 0 0 0 0 0 0 0 0 0 0 0
1 15 ram15 0 0 0 0 0 0 0 0 0 0 0
7 0 loop0 0 0 0 0 0 0 0 0 0 0 0
7 1 loop1 0 0 0 0 0 0 0 0 0 0 0
7 2 loop2 0 0 0 0 0 0 0 0 0 0 0
7 3 loop3 0 0 0 0 0 0 0 0 0 0 0
7 4 loop4 0 0 0 0 0 0 0 0 0 0 0
7 5 loop5 0 0 0 0 0 0 0 0 0 0 0
7 6 loop6 0 0 0 0 0 0 0 0 0 0 0
7 7 loop7 0 0 0 0 0 0 0 0 0 0 0
3 0 hda 948 317 16216 4294408142 90 333 848 7309 4294967294 7309215 4280372198
~~~~~~~~~~
3 1 hda1 4 8 0 0
3 2 hda2 1258 16200 424 848
3 64 hdb 434 0 3456 4294402902 337 261 4808 3174 4294967294 7313591 4280362822
~~~~~~~~~~
3 65 hdb1 0 0 0 0
3 66 hdb2 431 3448 601 4808

That's definately not zero. 8)

The server seems to run fine and it seems to be only
a displaying-problem.
Kernel is 2.5.70 vanilla with the attached patch by Andrew Morton.
hda and hdb are two small 2GB ide-disks.
If you need more information, please just ask for what you want. :)

UH, I've just recognized, that this value is one
before UNSIGNED_LONG_MAX (4,294,867,295).

- --
Regards Michael B?sch
http://www.8ung.at/tuxsoft
00:23:28 up 9:55, 5 users, load average: 1.02, 1.04, 1.00


[ANDREW'S PATCH]

This /proc tunable sets the kupdate interval. It has a couple of problems:

- - No way to turn it off completely (userspace dirty memory management
solutions require this).

- - If it has been set to one hour and then the user resets it to five
seconds, that resetting will not take effect for up to an hour.

Fix that up by providing a sysctl handler. Setting the tunable to zero now
disables the kupdate function.


Documentation/filesystems/proc.txt | 2 ++
include/linux/writeback.h | 4 ++++
kernel/sysctl.c | 16 +---------------
mm/page-writeback.c | 20 +++++++++++++++++++-
4 files changed, 26 insertions(+), 16 deletions(-)

diff -puN Documentation/filesystems/proc.txt~writeback-interval-restart Documentation/filesystems/proc.txt
- --- 25/Documentation/filesystems/proc.txt~writeback-interval-restart 2003-05-31 12:22:37.000000000 -0700
+++ 25-akpm/Documentation/filesystems/proc.txt 2003-05-31 12:22:37.000000000 -0700
@@ -1068,6 +1068,8 @@ The pdflush writeback daemons will perio
out to disk. This tunable expresses the interval between those wakeups, in
100'ths of a second.

+Setting this to zero disables periodic writeback altogether.
+
dirty_expire_centisecs
----------------------

diff -puN include/linux/writeback.h~writeback-interval-restart include/linux/writeback.h
- --- 25/include/linux/writeback.h~writeback-interval-restart 2003-05-31 12:22:37.000000000 -0700
+++ 25-akpm/include/linux/writeback.h 2003-05-31 12:22:37.000000000 -0700
@@ -78,6 +78,10 @@ extern int vm_dirty_ratio;
extern int dirty_writeback_centisecs;
extern int dirty_expire_centisecs;

+struct ctl_table;
+struct file;
+int dirty_writeback_centisecs_handler(struct ctl_table *, int, struct file *,
+ void *, size_t *);

void page_writeback_init(void);
void balance_dirty_pages(struct address_space *mapping);
diff -puN kernel/sysctl.c~writeback-interval-restart kernel/sysctl.c
- --- 25/kernel/sysctl.c~writeback-interval-restart 2003-05-31 12:22:37.000000000 -0700
+++ 25-akpm/kernel/sysctl.c 2003-05-31 12:22:37.000000000 -0700
@@ -276,7 +276,6 @@ static ctl_table kern_table[] = {
/* Constants for minimum and maximum testing in vm_table.
We use these as one-element integer vectors. */
static int zero = 0;
- -static int one = 1;
static int one_hundred = 100;


@@ -297,20 +296,7 @@ static ctl_table vm_table[] = {
&sysctl_intvec, NULL, &zero, &one_hundred },
{VM_DIRTY_WB_CS, "dirty_writeback_centisecs",
&dirty_writeback_centisecs, sizeof(dirty_writeback_centisecs), 0644,
- - NULL, &proc_dointvec_minmax, &sysctl_intvec, NULL,
- - /* Here, we define the range of possible values for
- - dirty_writeback_centisecs.
- -
- - The default value is 5 seconds (500 centisec). We will use 1
- - centisec, the smallest possible value that could make any sort of
- - sense. If we allowed the user to set the interval to 0 seconds
- - (which would presumably mean to chew up all of the CPU looking for
- - dirty pages and writing them out, without taking a break), the
- - interval would effectively become 1 second (100 centisecs), due to
- - some nicely documented throttling code in wb_kupdate().
- -
- - There is no maximum legal value for dirty_writeback. */
- - &one , NULL},
+ NULL, dirty_writeback_centisecs_handler },
{VM_DIRTY_EXPIRE_CS, "dirty_expire_centisecs",
&dirty_expire_centisecs, sizeof(dirty_expire_centisecs), 0644,
NULL, &proc_dointvec},
diff -puN mm/page-writeback.c~writeback-interval-restart mm/page-writeback.c
- --- 25/mm/page-writeback.c~writeback-interval-restart 2003-05-31 12:22:37.000000000 -0700
+++ 25-akpm/mm/page-writeback.c 2003-05-31 12:22:37.000000000 -0700
@@ -26,6 +26,7 @@
#include <linux/percpu.h>
#include <linux/notifier.h>
#include <linux/smp.h>
+#include <linux/sysctl.h>

/*
* The maximum number of pages to writeout in a single bdflush/kupdate
@@ -329,7 +330,24 @@ static void wb_kupdate(unsigned long arg
}
if (time_before(next_jif, jiffies + HZ))
next_jif = jiffies + HZ;
- - mod_timer(&wb_timer, next_jif);
+ if (dirty_writeback_centisecs)
+ mod_timer(&wb_timer, next_jif);
+}
+
+/*
+ * sysctl handler for /proc/sys/vm/dirty_writeback_centisecs
+ */
+int dirty_writeback_centisecs_handler(ctl_table *table, int write,
+ struct file *file, void *buffer, size_t *length)
+{
+ proc_dointvec(table, write, file, buffer, length);
+ if (dirty_writeback_centisecs) {
+ mod_timer(&wb_timer,
+ jiffies + (dirty_writeback_centisecs * HZ) / 100);
+ } else {
+ del_timer(&wb_timer);
+ }
+ return 0;
}

static void wb_timer_fn(unsigned long unused)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQE+2S5OoxoigfggmSgRAlavAKCNTA7U3afJJKdZHLHYbytludY1/wCfWrv+
PVN2W38sdFQ7MqQLfk01nY4=
=1pjP
-----END PGP SIGNATURE-----


2003-06-01 18:51:54

by Rick Lindsley

[permalink] [raw]
Subject: Re: [2.5.70] possible problem with /dev/diskstats

Documentation/iostats.txt says about diskstats:
[SNIP]
Field 9 -- # of I/Os currently in progress
The only field that should go to zero. Incremented as requests are
given to appropriate request_queue_t and decremented as they finish.
[SNIP]

But here is a cat /proc/diskstats:
3 0 hda 948 317 16216 4294408142 90 333 848 7309 4294967294 7309215 4280372198
~~~~~~~~~~

Yes, I've had a couple of other reports of this. I suspect there is
a path by which an "I/O" appears to have been completed while none was
begun. I've only noticed this on my hda drive as well. What I didn't
notice was exactly when this behavior began, which may have been useful
in tracking down the problem. You'll notice a couple of other values
there in the 4.2 billion range that probably suffer from a similar
(or maybe the same) issue. I haven't seen this on SCSI drives yet,
just hda drives, which suggests there may be something about that I/O
path that bears some closer scrutiny.

Rick

2003-06-01 19:18:43

by Michael Buesch

[permalink] [raw]
Subject: Re: [2.5.70] possible problem with /dev/diskstats

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

[I think we can take Andrew out of the cc list, because it
seems to be not caused by his patch, then]
On Sunday 01 June 2003 21:05, Rick Lindsley wrote:
> Documentation/iostats.txt says about diskstats:
> [SNIP]
> Field 9 -- # of I/Os currently in progress
> The only field that should go to zero. Incremented as requests are
> given to appropriate request_queue_t and decremented as they
> finish. [SNIP]
>
> But here is a cat /proc/diskstats:
> 3 0 hda 948 317 16216 4294408142 90 333 848 7309 4294967294
> 7309215 4280372198 ~~~~~~~~~~
>
> Yes, I've had a couple of other reports of this. I suspect there is
> a path by which an "I/O" appears to have been completed while none was
> begun. I've only noticed this on my hda drive as well. What I didn't
> notice was exactly when this behavior began, which may have been useful
> in tracking down the problem. You'll notice a couple of other values
> there in the 4.2 billion range that probably suffer from a similar
> (or maybe the same) issue. I haven't seen this on SCSI drives yet,
> just hda drives, which suggests there may be something about that I/O
> path that bears some closer scrutiny.

I've restarted the server and it has an uptime of
21:26:36 up 3:20, 1 user, load average: 0.00, 0.00, 0.00

I've done some disk-io (compiled some software) and the values
now are even lower:

3 0 hda 1623 415 28028 128689845 408 1657 4144 61294 4294967234 11728672 3714354303
3 1 hda1 4 8 1 2
3 2 hda2 1966 28002 2071 4142
3 64 hdb 381 18 3272 124256894 1662 1332 24000 61697 4294967248 11733826 3870100758

Directly after reboot it is:
21:30:55 up 1 min, 1 user, load average: 0.15, 0.08, 0.02

3 0 hda 904 311 15432 4294408137 66 213 564 7358 4294967294 67078 4294856106
3 1 hda1 4 8 0 0
3 2 hda2 1208 15416 282 564
3 64 hdb 45 0 344 4294400418 8 3 88 404 4294967294 72358 4294841430


Where exactly is the code in the kernel, that produces the diskstats file,
so I can try grepping through it?

>
> Rick

- --
Regards Michael B?sch
http://www.8ung.at/tuxsoft
21:23:47 up 2:09, 3 users, load average: 1.01, 1.01, 1.00

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQE+2lSToxoigfggmSgRAuSgAJ42rV0+KTChl6OzUXf7KftHyi8b/wCfXaaJ
nOQzW3tR2UwVo15DpdRPzUQ=
=nGvx
-----END PGP SIGNATURE-----

2003-06-02 04:09:54

by Rick Lindsley

[permalink] [raw]
Subject: Re: [2.5.70] possible problem with /dev/diskstats

I've done some disk-io (compiled some software) and the values
now are even lower:

Interesting, yes, you now have 60 fewer "I/O's in flight."

Where exactly is the code in the kernel, that produces the diskstats file,
so I can try grepping through it?

The I/O's in flight are incremented in drive_stat_acct() (see ll_rw_blk.c)
and decremented in attempt_merge() and end_that_request_last(). My gut feel
is that end_that_request_last() is getting called more often than expected
via some interesting path, but I've nothing right now to back that up.

Rick

2003-06-02 04:54:38

by Maneesh Soni

[permalink] [raw]
Subject: Re: [2.5.70] possible problem with /dev/diskstats

On Sat, May 31, 2003 at 10:40:21PM +0000, Michael Buesch wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi.
>
> I've just played around with my server (that has actualy no load)
> and I recognized something strange in /dev/diskstats.
>
> Documentation/iostats.txt says about diskstats:
> [SNIP]
> Field 9 -- # of I/Os currently in progress
> The only field that should go to zero. Incremented as requests are
> given to appropriate request_queue_t and decremented as they finish.
> [SNIP]
>
> But here is a cat /proc/diskstats:
> 1 0 ram0 0 0 0 0 0 0 0 0 0 0 0
> 1 1 ram1 0 0 0 0 0 0 0 0 0 0 0
> 1 2 ram2 0 0 0 0 0 0 0 0 0 0 0
> 1 3 ram3 0 0 0 0 0 0 0 0 0 0 0
> 1 4 ram4 0 0 0 0 0 0 0 0 0 0 0
> 1 5 ram5 0 0 0 0 0 0 0 0 0 0 0
> 1 6 ram6 0 0 0 0 0 0 0 0 0 0 0
> 1 7 ram7 0 0 0 0 0 0 0 0 0 0 0
> 1 8 ram8 0 0 0 0 0 0 0 0 0 0 0
> 1 9 ram9 0 0 0 0 0 0 0 0 0 0 0
> 1 10 ram10 0 0 0 0 0 0 0 0 0 0 0
> 1 11 ram11 0 0 0 0 0 0 0 0 0 0 0
> 1 12 ram12 0 0 0 0 0 0 0 0 0 0 0
> 1 13 ram13 0 0 0 0 0 0 0 0 0 0 0
> 1 14 ram14 0 0 0 0 0 0 0 0 0 0 0
> 1 15 ram15 0 0 0 0 0 0 0 0 0 0 0

Rick,

ramdisk stats are also always zero whether you do any IO or not. Any idea
where this can be corrected.

Thanks
Maneesh


--
Maneesh Soni
IBM Linux Technology Center,
IBM India Software Lab, Bangalore.
Phone: +91-80-5044999 email: [email protected]
http://lse.sourceforge.net/

2003-06-02 14:12:10

by Michael Buesch

[permalink] [raw]
Subject: Re: [2.5.70] possible problem with /dev/diskstats

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Monday 02 June 2003 07:10, Maneesh Soni wrote:
> On Sat, May 31, 2003 at 10:40:21PM +0000, Michael Buesch wrote:
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> >
> > Hi.
> >
> > I've just played around with my server (that has actualy no load)
> > and I recognized something strange in /dev/diskstats.
> >
> > Documentation/iostats.txt says about diskstats:
> > [SNIP]
> > Field 9 -- # of I/Os currently in progress
> > The only field that should go to zero. Incremented as requests are
> > given to appropriate request_queue_t and decremented as they finish.
> > [SNIP]
> >
> > But here is a cat /proc/diskstats:
> > 1 0 ram0 0 0 0 0 0 0 0 0 0 0 0
> > 1 1 ram1 0 0 0 0 0 0 0 0 0 0 0
> > 1 2 ram2 0 0 0 0 0 0 0 0 0 0 0
> > 1 3 ram3 0 0 0 0 0 0 0 0 0 0 0
> > 1 4 ram4 0 0 0 0 0 0 0 0 0 0 0
> > 1 5 ram5 0 0 0 0 0 0 0 0 0 0 0
> > 1 6 ram6 0 0 0 0 0 0 0 0 0 0 0
> > 1 7 ram7 0 0 0 0 0 0 0 0 0 0 0
> > 1 8 ram8 0 0 0 0 0 0 0 0 0 0 0
> > 1 9 ram9 0 0 0 0 0 0 0 0 0 0 0
> > 1 10 ram10 0 0 0 0 0 0 0 0 0 0 0
> > 1 11 ram11 0 0 0 0 0 0 0 0 0 0 0
> > 1 12 ram12 0 0 0 0 0 0 0 0 0 0 0
> > 1 13 ram13 0 0 0 0 0 0 0 0 0 0 0
> > 1 14 ram14 0 0 0 0 0 0 0 0 0 0 0
> > 1 15 ram15 0 0 0 0 0 0 0 0 0 0 0
>
> Rick,
>
> ramdisk stats are also always zero whether you do any IO or not. Any idea
> where this can be corrected.

Hmm, yes, I've had /var on ram0 in this example, but it doesn't show
any statistics for it.

> Thanks
> Maneesh

- --
Regards Michael B?sch
http://www.8ung.at/tuxsoft
16:17:13 up 20 min, 1 user, load average: 1.00, 0.97, 0.74

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQE+21z+oxoigfggmSgRAvqPAJ9ZXw2Rrk1FZQovWUu58xQlNEareACeLVoo
ZVw22gyiSz3ehXdtxaK+LlM=
=Rw7M
-----END PGP SIGNATURE-----