2013-07-21 10:27:00

by Justin Piszcz

[permalink] [raw]
Subject: 3.10.1: echo repair > sync_action causes hang on RAID-1 (2 x SSD)

Hi,

When I run repair on an MD-RAID1 sync_action, the speed slows down and it
stays like this (below) for hours.

The system is then completely unresponsive to user input. I have replaced a
failing SSD; however, after a check, mismatch_cnt seems to increase over
time. When I run repair, the system freezes to user-input. Has anyone else
run into this issue with a RAID-1 volume (2 x SSD) using 0.90 metadata?
Long ago I used to use this same configuration with two physical disks and
there was never a problem.

Even though I left a root shell open, this has no effect to break the
resync:
# echo idle > /sys/devices/virtual/block/md1/md/sync_action

Every 1.0s: cat /proc/mdstat Sun Jul 21 06:15:38
2013

Personalities : [raid1]
md1 : active raid1 sdc2[0] sdb2[1]
233381376 blocks [2/2] [UU]
[>....................] resync = 0.0% (151616/233381376)
finish=36171.5min speed=107K/sec

md0 : active raid1 sdc1[0] sdb1[1]
1048512 blocks [2/2] [UU]

unused devices: <none>

10 minutes later:

233381376 blocks [2/2] [UU]
[>....................] resync = 0.0% (151616/233381376)
finish=52219.3min speed=74K/sec

Where it hangs (151616) or elsewhere, has been different each time I watched
it, it does not appear to be hanging at the same block each time.

Justin.


2013-07-21 23:03:12

by NeilBrown

[permalink] [raw]
Subject: Re: 3.10.1: echo repair > sync_action causes hang on RAID-1 (2 x SSD)

On Sun, 21 Jul 2013 06:26:55 -0400 "Justin Piszcz" <[email protected]>
wrote:

> Hi,
>
> When I run repair on an MD-RAID1 sync_action, the speed slows down and it
> stays like this (below) for hours.
>
> The system is then completely unresponsive to user input. I have replaced a
> failing SSD; however, after a check, mismatch_cnt seems to increase over
> time. When I run repair, the system freezes to user-input. Has anyone else
> run into this issue with a RAID-1 volume (2 x SSD) using 0.90 metadata?
> Long ago I used to use this same configuration with two physical disks and
> there was never a problem.
>
> Even though I left a root shell open, this has no effect to break the
> resync:
> # echo idle > /sys/devices/virtual/block/md1/md/sync_action
>
> Every 1.0s: cat /proc/mdstat Sun Jul 21 06:15:38
> 2013
>
> Personalities : [raid1]
> md1 : active raid1 sdc2[0] sdb2[1]
> 233381376 blocks [2/2] [UU]
> [>....................] resync = 0.0% (151616/233381376)
> finish=36171.5min speed=107K/sec
>
> md0 : active raid1 sdc1[0] sdb1[1]
> 1048512 blocks [2/2] [UU]
>
> unused devices: <none>
>
> 10 minutes later:
>
> 233381376 blocks [2/2] [UU]
> [>....................] resync = 0.0% (151616/233381376)
> finish=52219.3min speed=74K/sec
>
> Where it hangs (151616) or elsewhere, has been different each time I watched
> it, it does not appear to be hanging at the same block each time.
>

Hi Justin,
this is a known bug. Fix has been accepted into mainline for 3.11-rc2.
Hopefully it will get into 3.10.3 (too late for 3.10.2).

NeilBrown


Attachments:
signature.asc (828.00 B)

2013-07-25 23:10:55

by Justin Piszcz

[permalink] [raw]
Subject: RE: 3.10.1: echo repair > sync_action causes hang on RAID-1 (2 x SSD)



-----Original Message-----
From: NeilBrown [mailto:[email protected]]
Sent: Sunday, July 21, 2013 7:03 PM
To: Justin Piszcz
Cc: [email protected]; [email protected]
Subject: Re: 3.10.1: echo repair > sync_action causes hang on RAID-1 (2 x
SSD)

> Hi Justin,
> this is a known bug. Fix has been accepted into mainline for 3.11-rc2.
> Hopefully it will get into 3.10.3 (too late for 3.10.2).

> NeilBrown


Hi Neil,

Did the fix by chance make it into 3.10.3?

The same issue occurs with 3.10.3 for me as well:

Every 1.0s: cat /proc/mdstat Thu Jul 25 19:09:46
2013

Personalities : [raid1]
md1 : active raid1 sdc2[0] sdb2[1]
233381376 blocks [2/2] [UU]
[>....................] resync = 0.0% (151488/233381376)
finish=32045.3m
in speed=121K/sec

md0 : active raid1 sdc1[0] sdb1[1]
1048512 blocks [2/2] [UU]

unused devices: <none>



Justin.

2013-07-26 00:36:09

by NeilBrown

[permalink] [raw]
Subject: Re: 3.10.1: echo repair > sync_action causes hang on RAID-1 (2 x SSD)

On Thu, 25 Jul 2013 19:10:50 -0400 "Justin Piszcz" <[email protected]>
wrote:

>
>
> -----Original Message-----
> From: NeilBrown [mailto:[email protected]]
> Sent: Sunday, July 21, 2013 7:03 PM
> To: Justin Piszcz
> Cc: [email protected]; [email protected]
> Subject: Re: 3.10.1: echo repair > sync_action causes hang on RAID-1 (2 x
> SSD)
>
> > Hi Justin,
> > this is a known bug. Fix has been accepted into mainline for 3.11-rc2.
> > Hopefully it will get into 3.10.3 (too late for 3.10.2).
>
> > NeilBrown
>
>
> Hi Neil,
>
> Did the fix by chance make it into 3.10.3?

No, it looks like it missed again. I gather there was a large inflow of
patches for -stable in the 3.11-rc1 merge window and Greg has been processing
them in batches. Hopefully in 3.10.4.

The relevant patch is commit 30bc9b53878a9921b02e3 in mainline.

NeilBrown



>
> The same issue occurs with 3.10.3 for me as well:
>
> Every 1.0s: cat /proc/mdstat Thu Jul 25 19:09:46
> 2013
>
> Personalities : [raid1]
> md1 : active raid1 sdc2[0] sdb2[1]
> 233381376 blocks [2/2] [UU]
> [>....................] resync = 0.0% (151488/233381376)
> finish=32045.3m
> in speed=121K/sec
>
> md0 : active raid1 sdc1[0] sdb1[1]
> 1048512 blocks [2/2] [UU]
>
> unused devices: <none>
>
>
>
> Justin.


Attachments:
signature.asc (828.00 B)

2013-07-26 09:56:56

by Justin Piszcz

[permalink] [raw]
Subject: RE: 3.10.1: echo repair > sync_action causes hang on RAID-1 (2 x SSD)



-----Original Message-----
From: NeilBrown [mailto:[email protected]]
Sent: Thursday, July 25, 2013 8:36 PM
To: Justin Piszcz
Cc: [email protected]; [email protected]
Subject: Re: 3.10.1: echo repair > sync_action causes hang on RAID-1 (2 x
SSD)

On Thu, 25 Jul 2013 19:10:50 -0400 "Justin Piszcz" <[email protected]>
wrote:

> Did the fix by chance make it into 3.10.3?

No, it looks like it missed again. I gather there was a large inflow of
patches for -stable in the 3.11-rc1 merge window and Greg has been
processing
them in batches. Hopefully in 3.10.4.

The relevant patch is commit 30bc9b53878a9921b02e3 in mainline.

NeilBrown

--

Method to get patch via git and patch kernel:

$ git clone
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
$ git log |grep 30bc9b53878a9921b02e3
commit 30bc9b53878a9921b02e3b5bc4283ac1c6de102a
$ git show 30bc9b53878a9921b02e3b5bc4283ac1c6de102a > /tmp/a
# patch -p1 < /tmp/a
patching file drivers/md/raid1.c
Hunk #1 succeeded at 1848 (offset -1 lines).
Hunk #2 succeeded at 1886 (offset -1 lines).
Hunk #3 succeeded at 1915 (offset -1 lines).

Reboot- tested, success, thanks..!

One follow-up question:
$ cat /sys/block/md1/md/mismatch_cnt
314112
-> On a live RAID-1 (root filesystem) without swap, is it normal to have
such a high mismatch_cnt even after a repair?

First repair:
Fri Jul 26 05:30:47 EDT 2013: The meta-device /dev/md1 has mismatch_cnt
314112 sectors.
Second repair:
Fri Jul 26 05:30:47 EDT 2013: The meta-device /dev/md1 has mismatch_cnt
313600 sectors.

Should I be concerned?


Testing the patch:

Personalities : [raid1]
md1 : active raid1 sdc2[0] sdb2[1]
233381376 blocks [2/2] [UU]
[>....................] check = 0.3% (838976/233381376)
finish=9.2min speed=419488K/sec

md0 : active raid1 sdc1[0] sdb1[1]
1048512 blocks [2/2] [UU]

Personalities : [raid1]
md1 : active raid1 sdc2[0] sdb2[1]
233381376 blocks [2/2] [UU]
[===============>.....] check = 77.5% (180889856/233381376)
finish=2.5min speed=342654K/sec

md0 : active raid1 sdc1[0] sdb1[1]
1048512 blocks [2/2] [UU]

Personalities : [raid1]
md1 : active raid1 sdc2[0] sdb2[1]
233381376 blocks [2/2] [UU]

md0 : active raid1 sdc1[0] sdb1[1]
1048512 blocks [2/2] [UU]


Justin.

2013-07-29 05:57:11

by NeilBrown

[permalink] [raw]
Subject: Re: 3.10.1: echo repair > sync_action causes hang on RAID-1 (2 x SSD)

On Fri, 26 Jul 2013 05:56:51 -0400 "Justin Piszcz" <[email protected]>
wrote:

>
>
> -----Original Message-----
> From: NeilBrown [mailto:[email protected]]
> Sent: Thursday, July 25, 2013 8:36 PM
> To: Justin Piszcz
> Cc: [email protected]; [email protected]
> Subject: Re: 3.10.1: echo repair > sync_action causes hang on RAID-1 (2 x
> SSD)
>
> On Thu, 25 Jul 2013 19:10:50 -0400 "Justin Piszcz" <[email protected]>
> wrote:
>
> > Did the fix by chance make it into 3.10.3?
>
> No, it looks like it missed again. I gather there was a large inflow of
> patches for -stable in the 3.11-rc1 merge window and Greg has been
> processing
> them in batches. Hopefully in 3.10.4.
>
> The relevant patch is commit 30bc9b53878a9921b02e3 in mainline.
>
> NeilBrown
>
> --
>
> Method to get patch via git and patch kernel:
>
> $ git clone
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
> $ git log |grep 30bc9b53878a9921b02e3
> commit 30bc9b53878a9921b02e3b5bc4283ac1c6de102a
> $ git show 30bc9b53878a9921b02e3b5bc4283ac1c6de102a > /tmp/a
> # patch -p1 < /tmp/a
> patching file drivers/md/raid1.c
> Hunk #1 succeeded at 1848 (offset -1 lines).
> Hunk #2 succeeded at 1886 (offset -1 lines).
> Hunk #3 succeeded at 1915 (offset -1 lines).
>
> Reboot- tested, success, thanks..!
>
> One follow-up question:
> $ cat /sys/block/md1/md/mismatch_cnt
> 314112
> -> On a live RAID-1 (root filesystem) without swap, is it normal to have
> such a high mismatch_cnt even after a repair?
>
> First repair:
> Fri Jul 26 05:30:47 EDT 2013: The meta-device /dev/md1 has mismatch_cnt
> 314112 sectors.
> Second repair:
> Fri Jul 26 05:30:47 EDT 2013: The meta-device /dev/md1 has mismatch_cnt
> 313600 sectors.

Those two lines have exactly the same timestamp and array name but different
mismatch counts. That is very strange.

Did you run two consecutive 'repair's on the one array, both with the patched
kernel? If so and the second mismatch_cnt wasn't zero (or close to
it..maybe) then something is definitely wrong.

NeilBrown


>
> Should I be concerned?
>
>
> Testing the patch:
>
> Personalities : [raid1]
> md1 : active raid1 sdc2[0] sdb2[1]
> 233381376 blocks [2/2] [UU]
> [>....................] check = 0.3% (838976/233381376)
> finish=9.2min speed=419488K/sec
>
> md0 : active raid1 sdc1[0] sdb1[1]
> 1048512 blocks [2/2] [UU]
>
> Personalities : [raid1]
> md1 : active raid1 sdc2[0] sdb2[1]
> 233381376 blocks [2/2] [UU]
> [===============>.....] check = 77.5% (180889856/233381376)
> finish=2.5min speed=342654K/sec
>
> md0 : active raid1 sdc1[0] sdb1[1]
> 1048512 blocks [2/2] [UU]
>
> Personalities : [raid1]
> md1 : active raid1 sdc2[0] sdb2[1]
> 233381376 blocks [2/2] [UU]
>
> md0 : active raid1 sdc1[0] sdb1[1]
> 1048512 blocks [2/2] [UU]
>
>
> Justin.
>


Attachments:
signature.asc (828.00 B)

2013-07-29 07:33:35

by Justin Piszcz

[permalink] [raw]
Subject: RE: 3.10.1: echo repair > sync_action causes hang on RAID-1 (2 x SSD)



-----Original Message-----
From: NeilBrown [mailto:[email protected]]
Sent: Monday, July 29, 2013 1:57 AM
To: Justin Piszcz
Cc: [email protected]; [email protected]
Subject: Re: 3.10.1: echo repair > sync_action causes hang on RAID-1 (2 x
SSD)

On Fri, 26 Jul 2013 05:56:51 -0400 "Justin Piszcz" <[email protected]>
wrote:

[..]

Further testing shows all is ok now:


Sun Nov 25 02:12:03 EST 2012: Parity check(s) running, sleeping 60
seconds...
Sun Nov 25 02:13:03 EST 2012: Parity check(s) running, sleeping 60
seconds...
Sun Nov 25 02:14:03 EST 2012: cat /sys/block/md0/md/mismatch_cnt
Sun Nov 25 02:14:03 EST 2012: 0
Sun Nov 25 02:14:03 EST 2012: cat /sys/block/md1/md/mismatch_cnt
Sun Nov 25 02:14:03 EST 2012: 0
Sun Nov 25 02:14:03 EST 2012: The meta-device /dev/md0 has no mismatched
sectors.
Sun Nov 25 02:14:04 EST 2012: The meta-device /dev/md1 has no mismatched
sectors.
Sun Nov 25 02:14:05 EST 2012: All devices are clean...
Sun Nov 25 02:14:05 EST 2012: cat /sys/block/md0/md/mismatch_cnt
Sun Nov 25 02:14:05 EST 2012: 0
Sun Nov 25 02:14:05 EST 2012: cat /sys/block/md1/md/mismatch_cnt
Sun Nov 25 02:14:05 EST 2012: 0

Thanks for your help.

Justin.