2008-03-22 11:25:38

by Bernd Schubert

[permalink] [raw]
Subject: deadline unfairness

Hello,

some it seems the deadline scheduler is rather unfair. Below is an example
of md-raid6 initialization of md3, md4 and md5. All three md-devices
do share the same blockdevices (we have patched md to allow
parallel rebuild of shared block devices, since for us the cpu is the
bottleneck and not the block device).

All rebuilds started basically at the same time, as you can see, md3 is already
done and now md4 rebuilds substantially faster than md5.


md5 : active raid6 sdk3[0] sde3[5] sdi3[4] sdm3[3] sdc3[2] sdg3[1]
6834869248 blocks level 6, 256k chunk, algorithm 2 [6/6] [UUUUUU]
[=============>.......] resync = 65.8% (1124909328/1708717312) finish=272.2min speed=35734K/sec

md4 : active raid6 sdk2[0] sde2[5] sdi2[4] sdm2[3] sdc2[2] sdg2[1]
6834869248 blocks level 6, 256k chunk, algorithm 2 [6/6] [UUUUUU]
[===============>.....] resync = 77.6% (1327362312/1708717312) finish=123.9min speed=51283K/sec

md3 : active raid6 sdk1[0] sde1[5] sdi1[4] sdm1[3] sdc1[2] sdg1[1]
6834869248 blocks level 6, 256k chunk, algorithm 2 [6/6] [UUUUUU]


Reducing write_expire to 2000ms did improve the situation a bit, but noop and
the other schedulers are still by far more fair.

Here with noop:

md5 : active raid6 sdk3[0] sde3[5] sdi3[4] sdm3[3] sdc3[2] sdg3[1]
6834869248 blocks level 6, 256k chunk, algorithm 2 [6/6] [UUUUUU]
[=============>.......] resync = 67.3% (1150741776/1708717312) finish=216.8min speed=42875K/sec

md4 : active raid6 sdk2[0] sde2[5] sdi2[4] sdm2[3] sdc2[2] sdg2[1]
6834869248 blocks level 6, 256k chunk, algorithm 2 [6/6] [UUUUUU]
[===============>.....] resync = 79.3% (1355377160/1708717312) finish=134.8min speed=43659K/sec

md3 : active raid6 sdk1[0] sde1[5] sdi1[4] sdm1[3] sdc1[2] sdg1[1]
6834869248 blocks level 6, 256k chunk, algorithm 2 [6/6] [UUUUUU]


This is basically with a 2.6.22 kernel + lustre + md-backports, but nothing
done to the scheduler.


Cheers,
Bernd


2008-03-22 12:15:50

by Aaron Carroll

[permalink] [raw]
Subject: Re: deadline unfairness

Bernd Schubert wrote:
> Hello,
>
> some it seems the deadline scheduler is rather unfair. Below is an example
> of md-raid6 initialization of md3, md4 and md5. All three md-devices
> do share the same blockdevices (we have patched md to allow
> parallel rebuild of shared block devices, since for us the cpu is the
> bottleneck and not the block device).
>
> All rebuilds started basically at the same time, as you can see, md3 is already
> done and now md4 rebuilds substantially faster than md5.
> [..]
> This is basically with a 2.6.22 kernel + lustre + md-backports, but nothing
> done to the scheduler.

Hi Bernd,

There is a deadline bug in pre-2.6.24 kernels where lower-sector requests can starve
higher-sector requests; you might be hitting this bug. It was fixed by commit:
6f5d8aa6382eef2b26032c88656270bdae7f0c42


-- Aaron

2008-03-22 13:06:19

by Bernd Schubert

[permalink] [raw]
Subject: Re: deadline unfairness

Hello Aron,

On Saturday 22 March 2008, Aaron Carroll wrote:
> Bernd Schubert wrote:
> > Hello,
> >
> > some it seems the deadline scheduler is rather unfair. Below is an
> > example of md-raid6 initialization of md3, md4 and md5. All three
> > md-devices do share the same blockdevices (we have patched md to allow
> > parallel rebuild of shared block devices, since for us the cpu is the
> > bottleneck and not the block device).
> >
> > All rebuilds started basically at the same time, as you can see, md3 is
> > already done and now md4 rebuilds substantially faster than md5.
> > [..]
> > This is basically with a 2.6.22 kernel + lustre + md-backports, but
> > nothing done to the scheduler.
>
> Hi Bernd,
>
> There is a deadline bug in pre-2.6.24 kernels where lower-sector requests
> can starve higher-sector requests; you might be hitting this bug. It was
> fixed by commit: 6f5d8aa6382eef2b26032c88656270bdae7f0c42

thanks a lot for your help! I will build a new kernel later on today and then
report back if it helps. Commit dfb3d72a9aa519672c9ae06f0d2f93eccb35482f also
looks useful...

Thanks again,
Bernd

2008-03-22 23:38:53

by Jan Engelhardt

[permalink] [raw]
Subject: Re: deadline unfairness


On Mar 22 2008 12:25, Bernd Schubert wrote:
>
> some it seems the deadline scheduler is rather unfair. Below is an example
> of md-raid6 initialization of md3, md4 and md5. All three md-devices
> do share the same blockdevices (we have patched md to allow
> parallel rebuild of shared block devices, since for us the cpu is the
> bottleneck and not the block device).

Could you share this patch? It would be really intersting for use
with fast based block devices (flash, ramdisk, and such)!

2008-03-24 16:16:21

by Bernd Schubert

[permalink] [raw]
Subject: Re: deadline unfairness

Hello Jan,

On Sunday 23 March 2008, Jan Engelhardt wrote:
> On Mar 22 2008 12:25, Bernd Schubert wrote:
> > some it seems the deadline scheduler is rather unfair. Below is an
> > example of md-raid6 initialization of md3, md4 and md5. All three
> > md-devices do share the same blockdevices (we have patched md to allow
> > parallel rebuild of shared block devices, since for us the cpu is the
> > bottleneck and not the block device).
>
> Could you share this patch? It would be really intersting for use
> with fast based block devices (flash, ramdisk, and such)!

I already tried to push the patch to Neil, but I guess Neil was to busy to
look at it. The patch below is for 2.6.22, but it applies to 2.6.25-git.


Signed-off-by: Bernd Schubert <[email protected]>

Index: linux-2.6.22/drivers/md/md.c
===================================================================
--- linux-2.6.22.orig/drivers/md/md.c 2007-12-06 19:51:55.000000000 +0100
+++ linux-2.6.22/drivers/md/md.c 2007-12-07 12:07:47.000000000 +0100
@@ -74,6 +74,8 @@ static DEFINE_SPINLOCK(pers_lock);

static void md_print_devices(void);

+static DECLARE_WAIT_QUEUE_HEAD(resync_wait);
+
#define MD_BUG(x...) { printk("md: bug in file %s, line %d\n", __FILE__,
__LINE__); md_print_devices(); }

/*
@@ -2843,6 +2845,34 @@ __ATTR(sync_speed_max, S_IRUGO|S_IWUSR,


static ssize_t
+sync_force_parallel_show(mddev_t *mddev, char *page)
+{
+ return sprintf(page, "%d\n", mddev->parallel_resync);
+}
+
+static ssize_t
+sync_force_parallel_store(mddev_t *mddev, const char *buf, size_t len)
+{
+ char *e;
+ unsigned long n = simple_strtoul(buf, &e, 10);
+
+ if (!*buf || (*e && *e != '\n') || (n != 0 && n != 1))
+ return -EINVAL;
+
+ mddev->parallel_resync = n;
+
+ if (mddev->sync_thread) {
+ wake_up(&resync_wait);
+ }
+ return len;
+}
+
+/* force parallel resync, even with shared block devices */
+static struct md_sysfs_entry md_sync_force_parallel =
+__ATTR(sync_force_parallel, S_IRUGO|S_IWUSR,
+ sync_force_parallel_show, sync_force_parallel_store);
+
+static ssize_t
sync_speed_show(mddev_t *mddev, char *page)
{
unsigned long resync, dt, db;
@@ -2980,6 +3010,7 @@ static struct attribute *md_redundancy_a
&md_sync_min.attr,
&md_sync_max.attr,
&md_sync_speed.attr,
+ &md_sync_force_parallel.attr,
&md_sync_completed.attr,
&md_suspend_lo.attr,
&md_suspend_hi.attr,
@@ -5199,8 +5230,6 @@ void md_allow_write(mddev_t *mddev)
}
EXPORT_SYMBOL_GPL(md_allow_write);

-static DECLARE_WAIT_QUEUE_HEAD(resync_wait);
-
#define SYNC_MARKS 10
#define SYNC_MARK_STEP (3*HZ)
void md_do_sync(mddev_t *mddev)
@@ -5264,8 +5293,9 @@ void md_do_sync(mddev_t *mddev)
ITERATE_MDDEV(mddev2,tmp) {
if (mddev2 == mddev)
continue;
- if (mddev2->curr_resync &&
- match_mddev_units(mddev,mddev2)) {
+ if (!mddev->parallel_resync
+ && mddev2->curr_resync
+ && match_mddev_units(mddev,mddev2)) {
DEFINE_WAIT(wq);
if (mddev < mddev2 && mddev->curr_resync == 2) {
/* arbitrarily yield */
Index: linux-2.6.22/include/linux/raid/md_k.h
===================================================================
--- linux-2.6.22.orig/include/linux/raid/md_k.h 2007-12-06 19:51:55.000000000
+0100
+++ linux-2.6.22/include/linux/raid/md_k.h 2007-12-06 19:52:33.000000000 +0100
@@ -170,6 +170,9 @@ struct mddev_s
int sync_speed_min;
int sync_speed_max;

+ /* resync even though the same disks are shared among md-devices */
+ int parallel_resync;
+
int ok_start_degraded;
/* recovery/resync flags
* NEEDED: we might need to start a resync/recover

2008-03-24 16:44:48

by Bernd Schubert

[permalink] [raw]
Subject: Re: deadline unfairness

Hallo Aron,

On Saturday 22 March 2008, Bernd Schubert wrote:
> Hello Aron,
>
> On Saturday 22 March 2008, Aaron Carroll wrote:
> > Bernd Schubert wrote:
> > > Hello,
> > >
> > > some it seems the deadline scheduler is rather unfair. Below is an
> > > example of md-raid6 initialization of md3, md4 and md5. All three
> > > md-devices do share the same blockdevices (we have patched md to allow
> > > parallel rebuild of shared block devices, since for us the cpu is the
> > > bottleneck and not the block device).
> > >
> > > All rebuilds started basically at the same time, as you can see, md3 is
> > > already done and now md4 rebuilds substantially faster than md5.
> > > [..]
> > > This is basically with a 2.6.22 kernel + lustre + md-backports, but
> > > nothing done to the scheduler.
> >
> > Hi Bernd,
> >
> > There is a deadline bug in pre-2.6.24 kernels where lower-sector requests
> > can starve higher-sector requests; you might be hitting this bug. It was
> > fixed by commit: 6f5d8aa6382eef2b26032c88656270bdae7f0c42
>
> thanks a lot for your help! I will build a new kernel later on today and
> then report back if it helps. Commit
> dfb3d72a9aa519672c9ae06f0d2f93eccb35482f also looks useful...

After applying both patches it looks much better now.


Thanks again for your help,
Bernd