LinuxLists.cc - [PATCH 0/7] md fixes for -rc2

2015-05-08 08:56:40

Subject: [PATCH 0/7] md fixes for -rc2

Hi all,
following are some md fixes for -rc2 that I will be sending to Linus
(hopefully) for -rc3.

Most are for raid5 and many of those are fixes for the new
batched-stripe-writes code.
Sorry Shaohua, I didn't review/test those properly before forwarding
them on. Could you please check them all and make sure there is
nothing that you disagree with.

Thanks,
NeilBrown

---

Heinz Mauelshagen (1):
md-raid0: conditional mddev->queue access to suit dm-raid

NeilBrown (6):
md/raid5: new alloc_stripe() to allocate an initialize a stripe.
md/raid5: more incorrect BUG_ON in handle_stripe_fill.
md/raid5: avoid reading parity blocks for full-stripe write to degraded array
md/raid5: don't record new size if resize_stripes fails.
md/raid5: fix allocation of 'scribble' array.
md/raid5: fix handling of degraded stripes in batches.

drivers/md/raid0.c | 5 +-
drivers/md/raid5.c | 123 ++++++++++++++++++++++++++++++----------------------
2 files changed, 73 insertions(+), 55 deletions(-)

--
Signature

2015-05-08 08:56:47

by NeilBrown

[permalink] [raw]

Subject: [PATCH 1/7] md-raid0: conditional mddev->queue access to suit dm-raid

From: Heinz Mauelshagen <[email protected]>

This patch is a prerequisite for dm-raid "raid0" support to allow
dm-raid to access the MD RAID0 personality doing unconditional
accesses to mddev->queue, which is NULL in case of dm-raid stacked on
top of MD.

Most of the conditional mddev->queue accesses made it to upstream but
this missing one, which prohibits md raid0 to set disk stack limits
(being done in dm core in case of md underneath dm).

Signed-off-by: Heinz Mauelshagen <[email protected]>
Tested-by: Heinz Mauelshagen <[email protected]>
Signed-off-by: NeilBrown <[email protected]>
---
drivers/md/raid0.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index 2cb59a641cd2..6a68ef5246d4 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -188,8 +188,9 @@ static int create_strip_zones(struct mddev *mddev, struct r0conf **private_conf)
}
dev[j] = rdev1;

- disk_stack_limits(mddev->gendisk, rdev1->bdev,
- rdev1->data_offset << 9);
+ if (mddev->queue)
+ disk_stack_limits(mddev->gendisk, rdev1->bdev,
+ rdev1->data_offset << 9);

if (rdev1->bdev->bd_disk->queue->merge_bvec_fn)
conf->has_merge_bvec = 1;

2015-05-08 08:56:53

by NeilBrown

[permalink] [raw]

Subject: [PATCH 2/7] md/raid5: new alloc_stripe() to allocate an initialize a stripe.

The new batch_lock and batch_list fields are being initialized in
grow_one_stripe() but not in resize_stripes(). This causes a crash
on resize.

So separate the core initialization into a new function and call it
from both allocation sites.

Signed-off-by: NeilBrown <[email protected]>
Fixes: 59fc630b8b5f ("RAID5: batch adjacent full stripe write")
---
drivers/md/raid5.c | 32 ++++++++++++++++++--------------
1 file changed, 18 insertions(+), 14 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 77dfd720aaa0..91a1e8b26b52 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1971,17 +1971,30 @@ static void raid_run_ops(struct stripe_head *sh, unsigned long ops_request)
put_cpu();
}

+static struct stripe_head *alloc_stripe(struct kmem_cache *sc, gfp_t gfp)
+{
+ struct stripe_head *sh;
+
+ sh = kmem_cache_zalloc(sc, gfp);
+ if (sh) {
+ spin_lock_init(&sh->stripe_lock);
+ spin_lock_init(&sh->batch_lock);
+ INIT_LIST_HEAD(&sh->batch_list);
+ INIT_LIST_HEAD(&sh->lru);
+ atomic_set(&sh->count, 1);
+ }
+ return sh;
+}
static int grow_one_stripe(struct r5conf *conf, gfp_t gfp)
{
struct stripe_head *sh;
- sh = kmem_cache_zalloc(conf->slab_cache, gfp);
+
+ sh = alloc_stripe(conf->slab_cache, gfp);
if (!sh)
return 0;

sh->raid_conf = conf;

- spin_lock_init(&sh->stripe_lock);
-
if (grow_buffers(sh, gfp)) {
shrink_buffers(sh);
kmem_cache_free(conf->slab_cache, sh);
@@ -1990,13 +2003,8 @@ static int grow_one_stripe(struct r5conf *conf, gfp_t gfp)
sh->hash_lock_index =
conf->max_nr_stripes % NR_STRIPE_HASH_LOCKS;
/* we just created an active stripe so... */
- atomic_set(&sh->count, 1);
atomic_inc(&conf->active_stripes);
- INIT_LIST_HEAD(&sh->lru);

- spin_lock_init(&sh->batch_lock);
- INIT_LIST_HEAD(&sh->batch_list);
- sh->batch_head = NULL;
release_stripe(sh);
conf->max_nr_stripes++;
return 1;
@@ -2109,13 +2117,11 @@ static int resize_stripes(struct r5conf *conf, int newsize)
return -ENOMEM;

for (i = conf->max_nr_stripes; i; i--) {
- nsh = kmem_cache_zalloc(sc, GFP_KERNEL);
+ nsh = alloc_stripe(sc, GFP_KERNEL);
if (!nsh)
break;

nsh->raid_conf = conf;
- spin_lock_init(&nsh->stripe_lock);
-
list_add(&nsh->lru, &newstripes);
}
if (i) {
@@ -2142,13 +2148,11 @@ static int resize_stripes(struct r5conf *conf, int newsize)
lock_device_hash_lock(conf, hash));
osh = get_free_stripe(conf, hash);
unlock_device_hash_lock(conf, hash);
- atomic_set(&nsh->count, 1);
+
for(i=0; i<conf->pool_size; i++) {
nsh->dev[i].page = osh->dev[i].page;
nsh->dev[i].orig_page = osh->dev[i].page;
}
- for( ; i<newsize; i++)
- nsh->dev[i].page = NULL;
nsh->hash_lock_index = hash;
kmem_cache_free(conf->slab_cache, osh);
cnt++;

2015-05-08 08:57:01

by NeilBrown

[permalink] [raw]

Subject: [PATCH 3/7] md/raid5: more incorrect BUG_ON in handle_stripe_fill.

It is not incorrect to call handle_stripe_fill() when
a batch of full-stripe writes is active.
It is, however, a BUG if fetch_block() then decides
it needs to actually fetch anything.

So move the 'BUG_ON' to where it belongs.

Signed-off-by: NeilBrown <[email protected]>
Fixes: 59fc630b8b5f ("RAID5: batch adjacent full stripe write")
---
drivers/md/raid5.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 91a1e8b26b52..415cac6d89bd 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3302,6 +3302,7 @@ static int fetch_block(struct stripe_head *sh, struct stripe_head_state *s,
*/
BUG_ON(test_bit(R5_Wantcompute, &dev->flags));
BUG_ON(test_bit(R5_Wantread, &dev->flags));
+ BUG_ON(sh->batch_head);
if ((s->uptodate == disks - 1) &&
(s->failed && (disk_idx == s->failed_num[0] ||
disk_idx == s->failed_num[1]))) {
@@ -3370,7 +3371,6 @@ static void handle_stripe_fill(struct stripe_head *sh,
{
int i;

- BUG_ON(sh->batch_head);
/* look for blocks to read/compute, skip this if a compute
* is already in flight, or if the stripe contents are in the
* midst of changing due to a write

2015-05-08 08:57:07

by NeilBrown

[permalink] [raw]

Subject: [PATCH 4/7] md/raid5: avoid reading parity blocks for full-stripe write to degraded array

When performing a reconstruct write, we need to read all blocks
that are not being over-written .. except the parity (P and Q) blocks.

The code currently reads these (as they are not being over-written!)
unnecessarily.

Signed-off-by: NeilBrown <[email protected]>
Fixes: ea664c8245f3 ("md/raid5: need_this_block: tidy/fix last condition.")
---
drivers/md/raid5.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 415cac6d89bd..85dc0e67fb88 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3282,7 +3282,9 @@ static int need_this_block(struct stripe_head *sh, struct stripe_head_state *s,
/* reconstruct-write isn't being forced */
return 0;
for (i = 0; i < s->failed; i++) {
- if (!test_bit(R5_UPTODATE, &fdev[i]->flags) &&
+ if (s->failed_num[i] != sh->pd_idx &&
+ s->failed_num[i] != sh->qd_idx &&
+ !test_bit(R5_UPTODATE, &fdev[i]->flags) &&
!test_bit(R5_OVERWRITE, &fdev[i]->flags))
return 1;
}

2015-05-08 08:57:14

by NeilBrown

[permalink] [raw]

Subject: [PATCH 5/7] md/raid5: don't record new size if resize_stripes fails.

If any memory allocation in resize_stripes fails we will return
-ENOMEM, but in some cases we update conf->pool_size anyway.

This means that if we try again, the allocations will be assumed
to be larger than they are, and badness results.

So only update pool_size if there is no error.

This bug was introduced in 2.6.17 and the patch is suitable for
-stable.

Fixes: ad01c9e3752f ("[PATCH] md: Allow stripes to be expanded in preparation for expanding an array")
Cc: [email protected] (v2.6.17+)
Signed-off-by: NeilBrown <[email protected]>
---
drivers/md/raid5.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 85dc0e67fb88..9748e525e4c0 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2216,7 +2216,8 @@ static int resize_stripes(struct r5conf *conf, int newsize)

conf->slab_cache = sc;
conf->active_name = 1-conf->active_name;
- conf->pool_size = newsize;
+ if (!err)
+ conf->pool_size = newsize;
return err;
}

2015-05-08 08:57:27

by NeilBrown

[permalink] [raw]

Subject: [PATCH 7/7] md/raid5: fix handling of degraded stripes in batches.

There is no need for special handling of stripe-batches when the array
is degraded.

There may be if there is a failure in the batch, but STRIPE_DEGRADED
does not imply an error.

So don't set STRIPE_BATCH_ERR in ops_run_io just because the array is
degraded.
This actually causes a bug: the STRIPE_DEGRADED flag gets cleared in
check_break_stripe_batch_list() and so the bitmap bit gets cleared
when it shouldn't.

So in check_break_stripe_batch_list(), split the batch up completely -
again STRIPE_DEGRADED isn't meaningful.

Also don't set STRIPE_BATCH_ERR when there is a write error to a
replacement device. This simply removes the replacement device and
requires no extra handling.

Signed-off-by: NeilBrown <[email protected]>
---
drivers/md/raid5.c | 17 +++--------------
1 file changed, 3 insertions(+), 14 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 3873eaa6fa2e..1ba97fdc6df1 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1078,9 +1078,6 @@ again:
pr_debug("skip op %ld on disc %d for sector %llu\n",
bi->bi_rw, i, (unsigned long long)sh->sector);
clear_bit(R5_LOCKED, &sh->dev[i].flags);
- if (sh->batch_head)
- set_bit(STRIPE_BATCH_ERR,
- &sh->batch_head->state);
set_bit(STRIPE_HANDLE, &sh->state);
}

@@ -2448,7 +2445,7 @@ static void raid5_end_write_request(struct bio *bi, int error)
}
rdev_dec_pending(rdev, conf->mddev);

- if (sh->batch_head && !uptodate)
+ if (sh->batch_head && !uptodate && !replacement)
set_bit(STRIPE_BATCH_ERR, &sh->batch_head->state);

if (!test_and_clear_bit(R5_DOUBLE_LOCKED, &sh->dev[i].flags))
@@ -4214,15 +4211,9 @@ static void check_break_stripe_batch_list(struct stripe_head *sh)
return;

head_sh = sh;
- do {
- sh = list_first_entry(&sh->batch_list,
- struct stripe_head, batch_list);
- BUG_ON(sh == head_sh);
- } while (!test_bit(STRIPE_DEGRADED, &sh->state));

- while (sh != head_sh) {
- next = list_first_entry(&sh->batch_list,
- struct stripe_head, batch_list);
+ list_for_each_entry_safe(sh, next, &head_sh->batch_list, batch_list) {
+
list_del_init(&sh->batch_list);

set_mask_bits(&sh->state, ~STRIPE_EXPAND_SYNC_FLAG,
@@ -4242,8 +4233,6 @@ static void check_break_stripe_batch_list(struct stripe_head *sh)

set_bit(STRIPE_HANDLE, &sh->state);
release_stripe(sh);
-
- sh = next;
}
}

2015-05-08 08:57:20

by NeilBrown

[permalink] [raw]

Subject: [PATCH 6/7] md/raid5: fix allocation of 'scribble' array.

As the new 'scribble' array is sized based on chunk size,
we need to make sure the size matches the largest of 'old'
and 'new' chunk sizes when the array is undergoing reshape.

We also potentially need to resize it even when not resizing
the stripe cache, as chunk size can change without changing
number of devices.

So move the 'resize' code into a separate function, and
consider old and new sizes when allocating.

Signed-off-by: NeilBrown <[email protected]>
Fixes: 46d5b785621a ("raid5: use flex_array for scribble data")
---
drivers/md/raid5.c | 65 ++++++++++++++++++++++++++++++++++------------------
1 file changed, 43 insertions(+), 22 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 9748e525e4c0..3873eaa6fa2e 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2068,6 +2068,35 @@ static struct flex_array *scribble_alloc(int num, int cnt, gfp_t flags)
return ret;
}

+static int resize_chunks(struct r5conf *conf, int new_disks, int new_sectors)
+{
+ unsigned long cpu;
+ int err = 0;
+
+ mddev_suspend(conf->mddev);
+ get_online_cpus();
+ for_each_present_cpu(cpu) {
+ struct raid5_percpu *percpu;
+ struct flex_array *scribble;
+
+ percpu = per_cpu_ptr(conf->percpu, cpu);
+ scribble = scribble_alloc(new_disks,
+ new_sectors / STRIPE_SECTORS,
+ GFP_NOIO);
+
+ if (scribble) {
+ flex_array_free(percpu->scribble);
+ percpu->scribble = scribble;
+ } else {
+ err = -ENOMEM;
+ break;
+ }
+ }
+ put_online_cpus();
+ mddev_resume(conf->mddev);
+ return err;
+}
+
static int resize_stripes(struct r5conf *conf, int newsize)
{
/* Make all the stripes able to hold 'newsize' devices.
@@ -2096,7 +2125,6 @@ static int resize_stripes(struct r5conf *conf, int newsize)
struct stripe_head *osh, *nsh;
LIST_HEAD(newstripes);
struct disk_info *ndisks;
- unsigned long cpu;
int err;
struct kmem_cache *sc;
int i;
@@ -2178,25 +2206,6 @@ static int resize_stripes(struct r5conf *conf, int newsize)
} else
err = -ENOMEM;

- get_online_cpus();
- for_each_present_cpu(cpu) {
- struct raid5_percpu *percpu;
- struct flex_array *scribble;
-
- percpu = per_cpu_ptr(conf->percpu, cpu);
- scribble = scribble_alloc(newsize, conf->chunk_sectors /
- STRIPE_SECTORS, GFP_NOIO);
-
- if (scribble) {
- flex_array_free(percpu->scribble);
- percpu->scribble = scribble;
- } else {
- err = -ENOMEM;
- break;
- }
- }
- put_online_cpus();
-
/* Step 4, return new stripes to service */
while(!list_empty(&newstripes)) {
nsh = list_entry(newstripes.next, struct stripe_head, lru);
@@ -6228,8 +6237,11 @@ static int alloc_scratch_buffer(struct r5conf *conf, struct raid5_percpu *percpu
percpu->spare_page = alloc_page(GFP_KERNEL);
if (!percpu->scribble)
percpu->scribble = scribble_alloc(max(conf->raid_disks,
- conf->previous_raid_disks), conf->chunk_sectors /
- STRIPE_SECTORS, GFP_KERNEL);
+ conf->previous_raid_disks),
+ max(conf->chunk_sectors,
+ conf->prev_chunk_sectors)
+ / STRIPE_SECTORS,
+ GFP_KERNEL);

if (!percpu->scribble || (conf->level == 6 && !percpu->spare_page)) {
free_scratch_buffer(conf, percpu);
@@ -7205,6 +7217,15 @@ static int check_reshape(struct mddev *mddev)
if (!check_stripe_cache(mddev))
return -ENOSPC;

+ if (mddev->new_chunk_sectors > mddev->chunk_sectors ||
+ mddev->delta_disks > 0)
+ if (resize_chunks(conf,
+ conf->previous_raid_disks
+ + max(0, mddev->delta_disks),
+ max(mddev->new_chunk_sectors,
+ mddev->chunk_sectors)
+ ) < 0)
+ return -ENOMEM;
return resize_stripes(conf, (conf->previous_raid_disks
+ mddev->delta_disks));
}

2015-05-08 19:12:26

by Shaohua Li

[permalink] [raw]

Subject: Re: [PATCH 7/7] md/raid5: fix handling of degraded stripes in batches.

On Fri, May 08, 2015 at 06:56:12PM +1000, NeilBrown wrote:
> There is no need for special handling of stripe-batches when the array
> is degraded.
>
> There may be if there is a failure in the batch, but STRIPE_DEGRADED
> does not imply an error.
>
> So don't set STRIPE_BATCH_ERR in ops_run_io just because the array is
> degraded.
> This actually causes a bug: the STRIPE_DEGRADED flag gets cleared in
> check_break_stripe_batch_list() and so the bitmap bit gets cleared
> when it shouldn't.
>
> So in check_break_stripe_batch_list(), split the batch up completely -
> again STRIPE_DEGRADED isn't meaningful.
>
> Also don't set STRIPE_BATCH_ERR when there is a write error to a
> replacement device. This simply removes the replacement device and
> requires no extra handling.
>
> Signed-off-by: NeilBrown <[email protected]>
> ---
> drivers/md/raid5.c | 17 +++--------------
> 1 file changed, 3 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 3873eaa6fa2e..1ba97fdc6df1 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -1078,9 +1078,6 @@ again:
> pr_debug("skip op %ld on disc %d for sector %llu\n",
> bi->bi_rw, i, (unsigned long long)sh->sector);
> clear_bit(R5_LOCKED, &sh->dev[i].flags);
> - if (sh->batch_head)
> - set_bit(STRIPE_BATCH_ERR,
> - &sh->batch_head->state);
> set_bit(STRIPE_HANDLE, &sh->state);
> }

Patches look good to me. I had a question here. Is it possible some stripes in
a batch become degraded here but some not? Seems possible, then the batch
should be splitted too.

Thanks,
Shaohua

2015-05-13 00:56:12

by NeilBrown

[permalink] [raw]

Subject: Re: [PATCH 7/7] md/raid5: fix handling of degraded stripes in batches.

On Fri, 8 May 2015 12:12:23 -0700 Shaohua Li <[email protected]> wrote:

> On Fri, May 08, 2015 at 06:56:12PM +1000, NeilBrown wrote:
> > There is no need for special handling of stripe-batches when the array
> > is degraded.
> >
> > There may be if there is a failure in the batch, but STRIPE_DEGRADED
> > does not imply an error.
> >
> > So don't set STRIPE_BATCH_ERR in ops_run_io just because the array is
> > degraded.
> > This actually causes a bug: the STRIPE_DEGRADED flag gets cleared in
> > check_break_stripe_batch_list() and so the bitmap bit gets cleared
> > when it shouldn't.
> >
> > So in check_break_stripe_batch_list(), split the batch up completely -
> > again STRIPE_DEGRADED isn't meaningful.
> >
> > Also don't set STRIPE_BATCH_ERR when there is a write error to a
> > replacement device. This simply removes the replacement device and
> > requires no extra handling.
> >
> > Signed-off-by: NeilBrown <[email protected]>
> > ---
> > drivers/md/raid5.c | 17 +++--------------
> > 1 file changed, 3 insertions(+), 14 deletions(-)
> >
> > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> > index 3873eaa6fa2e..1ba97fdc6df1 100644
> > --- a/drivers/md/raid5.c
> > +++ b/drivers/md/raid5.c
> > @@ -1078,9 +1078,6 @@ again:
> > pr_debug("skip op %ld on disc %d for sector %llu\n",
> > bi->bi_rw, i, (unsigned long long)sh->sector);
> > clear_bit(R5_LOCKED, &sh->dev[i].flags);
> > - if (sh->batch_head)
> > - set_bit(STRIPE_BATCH_ERR,
> > - &sh->batch_head->state);
> > set_bit(STRIPE_HANDLE, &sh->state);
> > }
>
> Patches look good to me. I had a question here. Is it possible some stripes in
> a batch become degraded here but some not? Seems possible, then the batch
> should be splitted too.

Why?

I don't really understand the purpose of splitting up the batch.
The only possible error handling on a full-stripe write is:
- fail a device, or
- record a bad-block.

The first case affects all stripes in a batch equally so there is no need to
split it up.
The second case it is probably best to record the bad blocks while iterating
through the batch in handle_stripe_clean_event().

What exactly do you expect to happen after the stripes in a batch after they
have been split up?

Thanks,
NeilBrown

Attachments:

(No filename) (811.00 B)
OpenPGP digital signature

2015-05-20 05:56:27

by Shaohua Li

[permalink] [raw]

Subject: Re: [PATCH 7/7] md/raid5: fix handling of degraded stripes in batches.

On Wed, May 13, 2015 at 10:56:04AM +1000, NeilBrown wrote:
> On Fri, 8 May 2015 12:12:23 -0700 Shaohua Li <[email protected]> wrote:
>
> > On Fri, May 08, 2015 at 06:56:12PM +1000, NeilBrown wrote:
> > > There is no need for special handling of stripe-batches when the array
> > > is degraded.
> > >
> > > There may be if there is a failure in the batch, but STRIPE_DEGRADED
> > > does not imply an error.
> > >
> > > So don't set STRIPE_BATCH_ERR in ops_run_io just because the array is
> > > degraded.
> > > This actually causes a bug: the STRIPE_DEGRADED flag gets cleared in
> > > check_break_stripe_batch_list() and so the bitmap bit gets cleared
> > > when it shouldn't.
> > >
> > > So in check_break_stripe_batch_list(), split the batch up completely -
> > > again STRIPE_DEGRADED isn't meaningful.
> > >
> > > Also don't set STRIPE_BATCH_ERR when there is a write error to a
> > > replacement device. This simply removes the replacement device and
> > > requires no extra handling.
> > >
> > > Signed-off-by: NeilBrown <[email protected]>
> > > ---
> > > drivers/md/raid5.c | 17 +++--------------
> > > 1 file changed, 3 insertions(+), 14 deletions(-)
> > >
> > > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> > > index 3873eaa6fa2e..1ba97fdc6df1 100644
> > > --- a/drivers/md/raid5.c
> > > +++ b/drivers/md/raid5.c
> > > @@ -1078,9 +1078,6 @@ again:
> > > pr_debug("skip op %ld on disc %d for sector %llu\n",
> > > bi->bi_rw, i, (unsigned long long)sh->sector);
> > > clear_bit(R5_LOCKED, &sh->dev[i].flags);
> > > - if (sh->batch_head)
> > > - set_bit(STRIPE_BATCH_ERR,
> > > - &sh->batch_head->state);
> > > set_bit(STRIPE_HANDLE, &sh->state);
> > > }
> >
> > Patches look good to me. I had a question here. Is it possible some stripes in
> > a batch become degraded here but some not? Seems possible, then the batch
> > should be splitted too.
>
> Why?
>
> I don't really understand the purpose of splitting up the batch.
> The only possible error handling on a full-stripe write is:
> - fail a device, or
> - record a bad-block.
>
> The first case affects all stripes in a batch equally so there is no need to
> split it up.
> The second case it is probably best to record the bad blocks while iterating
> through the batch in handle_stripe_clean_event().
>
> What exactly do you expect to happen after the stripes in a batch after they
> have been split up?

My original concern is a device failure can causes some stripes fail but some
not, eg, get rdev returns NULL in ops_run_io for some stripes but not all of a
batch. There is no any locking, so seems possible. But you are right, the
stripes without error in ops_run_io will get an IO error eventually, the whole
batch stripes are still in the same state. So my concern is invalid, but I
forgot to reply the email, sorry.

The batch split is to handle IO error, eg, record bad-block and so on. I'm not
confident to change existing code to handle the error case, so I feel spliting
it and handling the stripe in normal way is the best thing to do. There
certainly might be better way. Again the device failure case should be ignored,
which I didn't realize originally.

BTW, can you apply the fix reported by Maxime, which is introduced by the batch patch.
http://marc.info/?l=linux-raid&m=143153461415534&w=2

Thanks,
Shaohua