2006-05-01 05:31:04

by NeilBrown

[permalink] [raw]
Subject: [PATCH 004 of 11] md: Increase the delay before marking metadata clean, and make it configurable.


When a md array has been idle (no writes) for 20msecs it is marked as
'clean'. This delay turns out to be too short for some real
workloads. So increase it to 200msec (the time to update the metadata
should be a tiny fraction of that) and make it sysfs-configurable.

Signed-off-by: Neil Brown <[email protected]>

### Diffstat output
./Documentation/md.txt | 9 ++++++++
./drivers/md/md.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++--
2 files changed, 61 insertions(+), 2 deletions(-)

diff ./Documentation/md.txt~current~ ./Documentation/md.txt
--- ./Documentation/md.txt~current~ 2006-05-01 15:09:20.000000000 +1000
+++ ./Documentation/md.txt 2006-05-01 15:10:18.000000000 +1000
@@ -207,6 +207,15 @@ All md devices contain:
available. It will then appear at md/dev-XXX (depending on the
name of the device) and further configuration is then possible.

+ safe_mode_delay
+ When an md array has seen no write requests for a certain period
+ of time, it will be marked as 'clean'. When another write
+ request arrive, the array is marked as 'dirty' before the write
+ commenses. This is known as 'safe_mode'.
+ The 'certain period' is controlled by this file which stores the
+ period as a number of seconds. The default is 200msec (0.200).
+ Writing a value of 0 disables safemode.
+
sync_speed_min
sync_speed_max
This are similar to /proc/sys/dev/raid/speed_limit_{min,max}

diff ./drivers/md/md.c~current~ ./drivers/md/md.c
--- ./drivers/md/md.c~current~ 2006-05-01 15:10:17.000000000 +1000
+++ ./drivers/md/md.c 2006-05-01 15:10:18.000000000 +1000
@@ -43,6 +43,7 @@
#include <linux/suspend.h>
#include <linux/poll.h>
#include <linux/mutex.h>
+#include <linux/ctype.h>

#include <linux/init.h>

@@ -1968,6 +1969,54 @@ static void analyze_sbs(mddev_t * mddev)
}

static ssize_t
+safe_delay_show(mddev_t *mddev, char *page)
+{
+ int msec = (mddev->safemode_delay*1000)/HZ;
+ return sprintf(page, "%d.%03d\n", msec/1000, msec%1000);
+}
+static ssize_t
+safe_delay_store(mddev_t *mddev, const char *cbuf, size_t len)
+{
+ int scale=1;
+ int dot=0;
+ int i;
+ unsigned long msec;
+ char buf[30];
+ char *e;
+ /* remove a period, and count digits after it */
+ if (len >= sizeof(buf))
+ return -EINVAL;
+ strlcpy(buf, cbuf, len);
+ buf[len] = 0;
+ for (i=0; i<len; i++) {
+ if (dot) {
+ if (isdigit(buf[i])) {
+ buf[i-1] = buf[i];
+ scale *= 10;
+ }
+ buf[i] = 0;
+ } else if (buf[i] == '.') {
+ dot=1;
+ buf[i] = 0;
+ }
+ }
+ msec = simple_strtoul(buf, &e, 10);
+ if (e == buf || (*e && *e != '\n'))
+ return -EINVAL;
+ msec = (msec * 1000) / scale;
+ if (msec == 0)
+ mddev->safemode_delay = 0;
+ else {
+ mddev->safemode_delay = (msec*HZ)/1000;
+ if (mddev->safemode_delay == 0)
+ mddev->safemode_delay = 1;
+ }
+ return len;
+}
+static struct md_sysfs_entry md_safe_delay =
+__ATTR(safe_mode_delay, 0644,safe_delay_show, safe_delay_store);
+
+static ssize_t
level_show(mddev_t *mddev, char *page)
{
struct mdk_personality *p = mddev->pers;
@@ -2423,6 +2472,7 @@ static struct attribute *md_default_attr
&md_size.attr,
&md_metadata.attr,
&md_new_device.attr,
+ &md_safe_delay.attr,
NULL,
};

@@ -2695,7 +2745,7 @@ static int do_md_run(mddev_t * mddev)
mddev->safemode = 0;
mddev->safemode_timer.function = md_safemode_timeout;
mddev->safemode_timer.data = (unsigned long) mddev;
- mddev->safemode_delay = (20 * HZ)/1000 +1; /* 20 msec delay */
+ mddev->safemode_delay = (200 * HZ)/1000 +1; /* 200 msec delay */
mddev->in_sync = 1;

ITERATE_RDEV(mddev,rdev,tmp)
@@ -4581,7 +4631,7 @@ void md_write_end(mddev_t *mddev)
if (atomic_dec_and_test(&mddev->writes_pending)) {
if (mddev->safemode == 2)
md_wakeup_thread(mddev->thread);
- else
+ else if (mddev->safemode_delay)
mod_timer(&mddev->safemode_timer, jiffies + mddev->safemode_delay);
}
}


2006-05-01 05:45:59

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 004 of 11] md: Increase the delay before marking metadata clean, and make it configurable.

NeilBrown <[email protected]> wrote:
>
>
> When a md array has been idle (no writes) for 20msecs it is marked as
> 'clean'. This delay turns out to be too short for some real
> workloads. So increase it to 200msec (the time to update the metadata
> should be a tiny fraction of that) and make it sysfs-configurable.
>
>
> ...
>
> + safe_mode_delay
> + When an md array has seen no write requests for a certain period
> + of time, it will be marked as 'clean'. When another write
> + request arrive, the array is marked as 'dirty' before the write
> + commenses. This is known as 'safe_mode'.
> + The 'certain period' is controlled by this file which stores the
> + period as a number of seconds. The default is 200msec (0.200).
> + Writing a value of 0 disables safemode.
> +

Why not make the units milliseconds? Rename this to safe_mode_delay_msecs
to remove any doubt.

> +static ssize_t
> +safe_delay_store(mddev_t *mddev, const char *cbuf, size_t len)
> +{
> + int scale=1;
> + int dot=0;
> + int i;
> + unsigned long msec;
> + char buf[30];
> + char *e;
> + /* remove a period, and count digits after it */
> + if (len >= sizeof(buf))
> + return -EINVAL;
> + strlcpy(buf, cbuf, len);
> + buf[len] = 0;
> + for (i=0; i<len; i++) {
> + if (dot) {
> + if (isdigit(buf[i])) {
> + buf[i-1] = buf[i];
> + scale *= 10;
> + }
> + buf[i] = 0;
> + } else if (buf[i] == '.') {
> + dot=1;
> + buf[i] = 0;
> + }
> + }
> + msec = simple_strtoul(buf, &e, 10);
> + if (e == buf || (*e && *e != '\n'))
> + return -EINVAL;
> + msec = (msec * 1000) / scale;
> + if (msec == 0)
> + mddev->safemode_delay = 0;
> + else {
> + mddev->safemode_delay = (msec*HZ)/1000;
> + if (mddev->safemode_delay == 0)
> + mddev->safemode_delay = 1;
> + }
> + return len;

And most of that goes away.


2006-05-01 06:14:58

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 004 of 11] md: Increase the delay before marking metadata clean, and make it configurable.

Neil Brown <[email protected]> wrote:
>
> On Sunday April 30, [email protected] wrote:
> > NeilBrown <[email protected]> wrote:
> > >
> > >
> > > When a md array has been idle (no writes) for 20msecs it is marked as
> > > 'clean'. This delay turns out to be too short for some real
> > > workloads. So increase it to 200msec (the time to update the metadata
> > > should be a tiny fraction of that) and make it sysfs-configurable.
> > >
> > >
> > > ...
> > >
> > > + safe_mode_delay
> > > + When an md array has seen no write requests for a certain period
> > > + of time, it will be marked as 'clean'. When another write
> > > + request arrive, the array is marked as 'dirty' before the write
> > > + commenses. This is known as 'safe_mode'.
> > > + The 'certain period' is controlled by this file which stores the
> > > + period as a number of seconds. The default is 200msec (0.200).
> > > + Writing a value of 0 disables safemode.
> > > +
> >
> > Why not make the units milliseconds? Rename this to safe_mode_delay_msecs
> > to remove any doubt.
>
> Because umpteen years ago when I was adding thread-usage statistics to
> /proc/net/rpc/nfsd I used milliseconds and Linus asked me to make it
> seconds - a much more "obvious" unit. See Email below.
> It seems very sensible to me.

That's output. It's easier to do the conversion with output. And I guess
one could argue that lots of people read /proc files, but few write to
them.

Generally I don't think we should be teaching the kernel to accept
pretend-floating-point numbers like this, especially when a) "delay in
milliseconds" is such a simple concept and b) it's so easy to go from float
to milliseconds in userspace.

Do you really expect that humans (really dumb ones ;)) will be echoing
numbers into this file? Or will it mainly be a thing for mdadm to fiddle
with?

2006-05-01 06:02:44

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 004 of 11] md: Increase the delay before marking metadata clean, and make it configurable.

On Sunday April 30, [email protected] wrote:
> NeilBrown <[email protected]> wrote:
> >
> >
> > When a md array has been idle (no writes) for 20msecs it is marked as
> > 'clean'. This delay turns out to be too short for some real
> > workloads. So increase it to 200msec (the time to update the metadata
> > should be a tiny fraction of that) and make it sysfs-configurable.
> >
> >
> > ...
> >
> > + safe_mode_delay
> > + When an md array has seen no write requests for a certain period
> > + of time, it will be marked as 'clean'. When another write
> > + request arrive, the array is marked as 'dirty' before the write
> > + commenses. This is known as 'safe_mode'.
> > + The 'certain period' is controlled by this file which stores the
> > + period as a number of seconds. The default is 200msec (0.200).
> > + Writing a value of 0 disables safemode.
> > +
>
> Why not make the units milliseconds? Rename this to safe_mode_delay_msecs
> to remove any doubt.

Because umpteen years ago when I was adding thread-usage statistics to
/proc/net/rpc/nfsd I used milliseconds and Linus asked me to make it
seconds - a much more "obvious" unit. See Email below.
It seems very sensible to me.

...
> > + msec = simple_strtoul(buf, &e, 10);
> > + if (e == buf || (*e && *e != '\n'))
> > + return -EINVAL;
> > + msec = (msec * 1000) / scale;
> > + if (msec == 0)
> > + mddev->safemode_delay = 0;
> > + else {
> > + mddev->safemode_delay = (msec*HZ)/1000;
> > + if (mddev->safemode_delay == 0)
> > + mddev->safemode_delay = 1;
> > + }
> > + return len;
>
> And most of that goes away.

Maybe it could go in a library :-?

NeilBrown


------------------------------------------------------------
From: Linus Torvalds <[email protected]>
To: Neil Brown <[email protected]>
cc: [email protected]
Subject: Re: PATCH knfsd - stats tidy up.
Date: Tue, 18 Jul 2000 12:21:12 -0700 (PDT)
Content-Type: TEXT/PLAIN; charset=US-ASCII



On Tue, 18 Jul 2000, Neil Brown wrote:
>
> The following patch converts jiffies to milliseconds for output, and
> also makes the number wrap predicatably at 1,000,000 seconds
> (approximately one fortnight).

If no programs depend on the format, I actually prefer format changes like
this to be of the "obvious" kind. One such obvious kind is the format

0.001

which obviously means 0.001 seconds.

And yes, I'm _really_ sorry that a lot of the old /proc files contain
jiffies. Lazy. Ugly. Bad. Much of it my bad.

Doing 0.001 doesn't mean that you have to use floating point, in fact
you've done most of the work already in your ms patch, just splitting
things out a bit works well:

/* gcc knows to combine / and % - generate one "divl" */
unsigned int sec = time / HZ, msec = time % HZ;
msec = (msec * 1000) / HZ;

sprintf(" %d.%03d", sec, msec)

(It's basically the same thing you already do, except it doesn't
re-combine the seconds and milliseconds but just prints them out
separately.. And it has the advantage that if you want to change it to
microseconds some day, you can do so very trivially without breaking the
format. Plus it's readable as hell.)

Linus

2006-05-01 06:15:15

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 004 of 11] md: Increase the delay before marking metadata clean, and make it configurable.

Neil Brown wrote:

>On Sunday April 30, [email protected] wrote:
>
>>NeilBrown <[email protected]> wrote:
>>
>>>
>>>When a md array has been idle (no writes) for 20msecs it is marked as
>>>'clean'. This delay turns out to be too short for some real
>>>workloads. So increase it to 200msec (the time to update the metadata
>>>should be a tiny fraction of that) and make it sysfs-configurable.
>>>
>>>
>>>...
>>>
>>>+ safe_mode_delay
>>>+ When an md array has seen no write requests for a certain period
>>>+ of time, it will be marked as 'clean'. When another write
>>>+ request arrive, the array is marked as 'dirty' before the write
>>>+ commenses. This is known as 'safe_mode'.
>>>+ The 'certain period' is controlled by this file which stores the
>>>+ period as a number of seconds. The default is 200msec (0.200).
>>>+ Writing a value of 0 disables safemode.
>>>+
>>>
>>Why not make the units milliseconds? Rename this to safe_mode_delay_msecs
>>to remove any doubt.
>>
>
>Because umpteen years ago when I was adding thread-usage statistics to
>/proc/net/rpc/nfsd I used milliseconds and Linus asked me to make it
>seconds - a much more "obvious" unit. See Email below.
>It seems very sensible to me.
>

Either way, all ambiguity is removed if you put the unit in the name. And
don't use jiffies because that obviously is not portable (which sounds like
it was Linus' biggest concern).

Once you do that, I don't much care whether you use seconds or milliseconds.
Other than to note that many of our units now are ms, especially when
they're
measuring things at or around the ms order of magnitude. But I'm not
aware of
so many proc values that don't work in integers.

--

Send instant messages to your online friends http://au.messenger.yahoo.com

2006-05-01 15:17:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 004 of 11] md: Increase the delay before marking metadata clean, and make it configurable.



On Sun, 30 Apr 2006, Andrew Morton wrote:
>
> Generally I don't think we should be teaching the kernel to accept
> pretend-floating-point numbers like this, especially when a) "delay in
> milliseconds" is such a simple concept and b) it's so easy to go from float
> to milliseconds in userspace.
>
> Do you really expect that humans (really dumb ones ;)) will be echoing
> numbers into this file? Or will it mainly be a thing for mdadm to fiddle
> with?

I generally hate interfaces that have some "random base".

So "delay in seconds" is not a random base, because "seconds" is a good SI
base unit, and there's not a lot of question about it. But once you start
talking milliseconds on microseconds, I'd actually much rather have a
"fake floating point number" over having different files have different
(magic) base constants. How do you remember which are milliseconds, which
are microseconds, and which are just seconds?

It should be easy to have a helper function or two that takes a "struct
timeval" and reads/writes a "float".

Linus

2006-05-02 05:56:32

by bert hubert

[permalink] [raw]
Subject: Re: [PATCH 004 of 11] md: Increase the delay before marking metadata clean, and make it configurable.

On Mon, May 01, 2006 at 03:30:19PM +1000, NeilBrown wrote:
> When a md array has been idle (no writes) for 20msecs it is marked as
> 'clean'. This delay turns out to be too short for some real
> workloads. So increase it to 200msec (the time to update the metadata
> should be a tiny fraction of that) and make it sysfs-configurable.

What does this mean, 'too short'? What happens in that case, backing block
devices are still busy writing? When making this configurable, the help text
better explain what the trade offs are.

Thanks.

--
http://www.PowerDNS.com Open source, database driven DNS Software
http://netherlabs.nl Open and Closed source services

2006-05-09 01:40:48

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 004 of 11] md: Increase the delay before marking metadata clean, and make it configurable.

On Tuesday May 2, [email protected] wrote:
> On Mon, May 01, 2006 at 03:30:19PM +1000, NeilBrown wrote:
> > When a md array has been idle (no writes) for 20msecs it is marked as
> > 'clean'. This delay turns out to be too short for some real
> > workloads. So increase it to 200msec (the time to update the metadata
> > should be a tiny fraction of that) and make it sysfs-configurable.
>
> What does this mean, 'too short'? What happens in that case, backing block
> devices are still busy writing? When making this configurable, the help text
> better explain what the trade offs are.

"too short" means that the update happens often enough to cause a
noticeable performance degradation.

In an application writes steadily very 21msecs (or maybe 30msecs) then
there will be 2 superblock writes and 1 application write every
21msecs, and this causes enough disk io to close the app down. - I
guess all the updates fill up the 21msec space.

With a larger delay - 200msec - you could still get bad situations
e.g. with the app writing every 210msecs. However 2 superblock
updates plus one app write is a much smaller fraction of 200msecs, so
there shouldn't be as many problems.

Yes, a more detailed explanation should go in Documentation/md.txt

NeilBrown