2003-02-28 12:09:09

by Mike Black

[permalink] [raw]
Subject: RAID SCSI binding

Linux 2.4.20 (but not unique to this kernel -- been this way for over a year):
I have a RAID5 array that doesn't startup properly during boot (I have to stop it and restart after the system is up). I've had
this problem forever and have been trying to fix it.
Here's what it looks like when it's up and running:
md4 : active raid5 sdn1[0] sdt1[6] sds1[5] sdr1[4] sdq1[3] sdp1[2] sdo1[1]
216490752 blocks level 5, 128k chunk, algorithm 0 [7/7] [UUUUUUU]
The partitions types are set to 0x83 and the array is being manually started in rc.local instead of doing auto-start.

I think I finally have and idea of what's happening. The major device # may be confusing the RAID bootup process.

During boot the devices are listed properly but then they don't all get bound.

Feb 28 05:42:48 yeti kernel: sda: sda1
Feb 28 05:42:48 yeti kernel: sdb: sdb1
Feb 28 05:42:48 yeti kernel: sdc: sdc1
Feb 28 05:42:48 yeti kernel: sdd: sdd1
Feb 28 05:42:48 yeti kernel: sde: sde1
Feb 28 05:42:48 yeti kernel: sdf: sdf1
Feb 28 05:42:48 yeti kernel: sdg: sdg1
Feb 28 05:42:48 yeti kernel: sdh: sdh1
Feb 28 05:42:48 yeti kernel: sdi: sdi1
Feb 28 05:42:48 yeti kernel: sdj: sdj1
Feb 28 05:42:48 yeti kernel: sdk: sdk1
Feb 28 05:42:48 yeti kernel: sdl: sdl1
Feb 28 05:42:48 yeti kernel: sdm: sdm1
Feb 28 05:42:48 yeti kernel: sdn: sdn1
Feb 28 05:42:48 yeti kernel: sdo: sdo1
Feb 28 05:42:48 yeti kernel: sdp: sdp1
Feb 28 05:42:48 yeti kernel: sdq: sdq1
Feb 28 05:42:48 yeti kernel: sdr: sdr1
Feb 28 05:42:48 yeti kernel: sds: sds1
Feb 28 05:42:48 yeti kernel: sdt: sdt1
Feb 28 05:42:49 yeti kernel: [events: 000000bc]
Feb 28 05:42:49 yeti kernel: md: bind<sdo1,1>
Feb 28 05:42:49 yeti kernel: [events: 000000bc]
Feb 28 05:42:49 yeti kernel: md: bind<sdp1,2>
Feb 28 05:42:49 yeti kernel: [events: 000000bc]
Feb 28 05:42:49 yeti kernel: md: bind<sdn1,3>
Feb 28 05:42:49 yeti kernel: [events: 000000ab]
Feb 28 05:42:49 yeti kernel: md: bind<sdb1,1>
Feb 28 05:42:49 yeti kernel: [events: 000000ab]
Feb 28 05:42:49 yeti kernel: md: bind<sdc1,2>
Feb 28 05:42:49 yeti kernel: [events: 000000ab]
Feb 28 05:42:49 yeti kernel: md: bind<sda1,3>
Feb 28 05:42:49 yeti kernel: [events: 000000ab]
Feb 28 05:42:49 yeti kernel: md: bind<sde1,4>
Feb 28 05:42:49 yeti kernel: [events: 000000ab]
Feb 28 05:42:49 yeti kernel: md: bind<sdf1,5>
Feb 28 05:42:49 yeti kernel: [events: 000000ab]
Feb 28 05:42:49 yeti kernel: md: bind<sdg1,6>
Feb 28 05:42:49 yeti kernel: [events: 000000ab]
Feb 28 05:42:49 yeti kernel: md: bind<sdh1,7>
Feb 28 05:42:49 yeti kernel: [events: 000000ab]
Feb 28 05:42:49 yeti kernel: md: bind<sdi1,8>
Feb 28 05:42:49 yeti kernel: [events: 000000ab]
Feb 28 05:42:49 yeti kernel: md: bind<sdj1,9>
Feb 28 05:42:49 yeti kernel: [events: 000000ab]
Feb 28 05:42:49 yeti kernel: md: bind<sdk1,10>
Feb 28 05:42:49 yeti kernel: [events: 000000ab]
Feb 28 05:42:49 yeti kernel: md: bind<sdl1,11>
Feb 28 05:42:49 yeti kernel: [events: 000000ab]
Feb 28 05:42:49 yeti kernel: md: bind<sdd1,12>
Feb 28 05:42:49 yeti kernel: [events: 000000ab]
Feb 28 05:42:49 yeti kernel: md: bind<sdm1,13>

Note that sdq,sdr,sds , and sdt do not get an md: bind entry
The raid ends up in /proc/mdstat with just sdn,o,p in it.
I then stop and restart the array and everything is OK.
Feb 28 05:53:30 yeti kernel: md: md4 stopped.
Feb 28 05:53:30 yeti kernel: md: unbind<sdn1,2>
Feb 28 05:53:30 yeti kernel: md: export_rdev(sdn1)
Feb 28 05:53:30 yeti kernel: md: unbind<sdp1,1>
Feb 28 05:53:30 yeti kernel: md: export_rdev(sdp1)
Feb 28 05:53:30 yeti kernel: md: unbind<sdo1,0>
Feb 28 05:53:30 yeti kernel: md: export_rdev(sdo1)
Feb 28 05:53:34 yeti kernel: [events: 000000bc]
Feb 28 05:53:34 yeti kernel: md: bind<sdo1,1>
Feb 28 05:53:34 yeti kernel: [events: 000000bc]
Feb 28 05:53:34 yeti kernel: md: bind<sdp1,2>
Feb 28 05:53:34 yeti kernel: [events: 000000bc]
Feb 28 05:53:34 yeti kernel: md: bind<sdq1,3>
Feb 28 05:53:34 yeti kernel: [events: 000000bc]
Feb 28 05:53:34 yeti kernel: md: bind<sdr1,4>
Feb 28 05:53:34 yeti kernel: [events: 000000bc]
Feb 28 05:53:34 yeti kernel: md: bind<sds1,5>
Feb 28 05:53:34 yeti kernel: [events: 000000bc]
Feb 28 05:53:34 yeti kernel: md: bind<sdt1,6>
Feb 28 05:53:34 yeti kernel: [events: 000000bc]
Feb 28 05:53:34 yeti kernel: md: bind<sdn1,7>
Feb 28 05:53:34 yeti kernel: md: sdn1's event counter: 000000bc
Feb 28 05:53:34 yeti kernel: md: sdt1's event counter: 000000bc
Feb 28 05:53:34 yeti kernel: md: sdn1's event counter: 000000bc
Feb 28 05:53:34 yeti kernel: md: sdt1's event counter: 000000bc
Feb 28 05:53:34 yeti kernel: md: sds1's event counter: 000000bc
Feb 28 05:53:34 yeti kernel: md: sdr1's event counter: 000000bc
Feb 28 05:53:34 yeti kernel: md: sdq1's event counter: 000000bc
Feb 28 05:53:34 yeti kernel: md: sdp1's event counter: 000000bc
Feb 28 05:53:34 yeti kernel: md: sdo1's event counter: 000000bc
Feb 28 05:53:34 yeti kernel: md4: max total readahead window set to 3072k
Feb 28 05:53:34 yeti kernel: md4: 6 data-disks, max readahead per data-disk: 512k
Feb 28 05:53:34 yeti kernel: raid5: device sdn1 operational as raid disk 0
Feb 28 05:53:34 yeti kernel: raid5: device sdt1 operational as raid disk 6
Feb 28 05:53:34 yeti kernel: raid5: device sds1 operational as raid disk 5
Feb 28 05:53:34 yeti kernel: raid5: device sdr1 operational as raid disk 4
Feb 28 05:53:34 yeti kernel: raid5: device sdq1 operational as raid disk 3
Feb 28 05:53:34 yeti kernel: raid5: device sdp1 operational as raid disk 2
Feb 28 05:53:34 yeti kernel: raid5: device sdo1 operational as raid disk 1
Feb 28 05:53:34 yeti kernel: raid5: allocated 7483kB for md4
Feb 28 05:53:34 yeti kernel: md: updating md4 RAID superblock on device
Feb 28 05:53:34 yeti kernel: md: sdn1 [events: 000000bd]<6>(write) sdn1's sb offset: 36081856
Feb 28 05:53:34 yeti kernel: md: sdt1 [events: 000000bd]<6>(write) sdt1's sb offset: 36081856
Feb 28 05:53:34 yeti kernel: md: sds1 [events: 000000bd]<6>(write) sds1's sb offset: 36081856
Feb 28 05:53:34 yeti kernel: md: sdr1 [events: 000000bd]<6>(write) sdr1's sb offset: 36081856
Feb 28 05:53:34 yeti kernel: md: sdq1 [events: 000000bd]<6>(write) sdq1's sb offset: 36081856
Feb 28 05:53:34 yeti kernel: md: sdp1 [events: 000000bd]<6>(write) sdp1's sb offset: 36081856
Feb 28 05:53:34 yeti kernel: md: sdo1 [events: 000000bd]<6>(write) sdo1's sb offset: 36081856


I noticed that the major node is different starting at sdq.

brw-rw---- 1 root disk 8, 208 Feb 22 2002 /dev/sdn
brw-rw---- 1 root disk 8, 224 Feb 22 2002 /dev/sdo
brw-rw---- 1 root disk 8, 240 Feb 22 2002 /dev/sdp
brw-rw---- 1 root disk 65, 0 Feb 22 2002 /dev/sdq
brw-rw---- 1 root disk 65, 16 Feb 22 2002 /dev/sdr
brw-rw---- 1 root disk 65, 32 Feb 22 2002 /dev/sds
brw-rw---- 1 root disk 65, 48 Feb 22 2002 /dev/sdt

So it looks to me like the md driver doesn't like the raid system crossing the major device boundary for some odd reason.
Although I'm not sure why I can stop and restart after the system is totally up but not start it in rc.local using the same
commands:

We're not hitting the 27 maximum disks so that isn't it.

Looking thru the code in md.c I don't quite see where the problem is (or may be).

Michael D. Black [email protected]
http://www.csi-inc.com/
http://www.csi-inc.com/~mike
321-676-2923, x203
Melbourne FL


2003-02-28 20:40:27

by NeilBrown

[permalink] [raw]
Subject: Re: RAID SCSI binding

On Friday February 28, [email protected] wrote:
> Linux 2.4.20 (but not unique to this kernel -- been this way for over a year):
> I have a RAID5 array that doesn't startup properly during boot (I have to stop it and restart after the system is up). I've had
> this problem forever and have been trying to fix it.
> Here's what it looks like when it's up and running:
> md4 : active raid5 sdn1[0] sdt1[6] sds1[5] sdr1[4] sdq1[3] sdp1[2] sdo1[1]
> 216490752 blocks level 5, 128k chunk, algorithm 0 [7/7] [UUUUUUU]
> The partitions types are set to 0x83 and the array is being manually
> started in rc.local instead of doing auto-start.

What tool are you using to start the array? raidstart or mdadm?
There doesn't seem to be enough noise in the logs for it to be
raidstart (no "md: autorun ...") so I assume mdadm.

What do you get if you add "-v" to the mdadm running from rc.local?
Can you show me
mdadm -E /dev/sda1 /dev/sdo1

NeilBrown

>
> So it looks to me like the md driver doesn't like the raid system crossing the major device boundary for some odd reason.
> Although I'm not sure why I can stop and restart after the system is totally up but not start it in rc.local using the same
> commands:

I suspect this is just a co-incidence. I cannot imagine how the major
device number could affect things at all.

NeilBrown

2003-03-03 13:22:28

by Mike Black

[permalink] [raw]
Subject: Re: RAID SCSI binding

Using mdadm v1.0.0

/dev/sdn1:
Magic : a92b4efc
Version : 00.90.00
UUID : 05084a65:52ffddf9:e971cafd:3aca3445
Creation Time : Wed Nov 7 07:19:15 2001
Raid Level : raid5
Device Size : 36081792 (34.41 GiB 36.94 GB)
Raid Devices : 7
Total Devices : 8
Preferred Minor : 4

Update Time : Fri Feb 28 05:53:34 2003
State : dirty, no-errors
Active Devices : 7
Working Devices : 7
Failed Devices : 1
Spare Devices : 0
Checksum : a1e07486 - correct
Events : 0.189

Layout : left-asymmetric
Chunk Size : 128K

Number Major Minor RaidDevice State
this 0 8 209 0 active sync /dev/sdn1
0 0 8 209 0 active sync /dev/sdn1
1 1 8 225 1 active sync /dev/sdo1
2 2 8 241 2 active sync /dev/sdp1
3 3 65 1 3 active sync /dev/sdq1
4 4 65 17 4 active sync /dev/sdr1
5 5 65 33 5 active sync /dev/sds1
6 6 65 49 6 active sync /dev/sdt1
/dev/sdo1:
Magic : a92b4efc
Version : 00.90.00
UUID : 05084a65:52ffddf9:e971cafd:3aca3445
Creation Time : Wed Nov 7 07:19:15 2001
Raid Level : raid5
Device Size : 36081792 (34.41 GiB 36.94 GB)
Raid Devices : 7
Total Devices : 8
Preferred Minor : 4

Update Time : Fri Feb 28 05:53:34 2003
State : dirty, no-errors
Active Devices : 7
Working Devices : 7
Failed Devices : 1
Spare Devices : 0
Checksum : a1e07498 - correct
Events : 0.189

Layout : left-asymmetric
Chunk Size : 128K

Number Major Minor RaidDevice State
this 1 8 225 1 active sync /dev/sdo1
0 0 8 209 0 active sync /dev/sdn1
1 1 8 225 1 active sync /dev/sdo1
2 2 8 241 2 active sync /dev/sdp1
3 3 65 1 3 active sync /dev/sdq1
4 4 65 17 4 active sync /dev/sdr1
5 5 65 33 5 active sync /dev/sds1
6 6 65 49 6 active sync /dev/sdt1

----- Original Message -----
From: "Neil Brown" <[email protected]>
To: "Mike Black" <[email protected]>
Cc: "linux-kernel" <[email protected]>; "linux-raid" <[email protected]>
Sent: Friday, February 28, 2003 3:49 PM
Subject: Re: RAID SCSI binding


> On Friday February 28, [email protected] wrote:
> > Linux 2.4.20 (but not unique to this kernel -- been this way for over a year):
> > I have a RAID5 array that doesn't startup properly during boot (I have to stop it and restart after the system is up). I've had
> > this problem forever and have been trying to fix it.
> > Here's what it looks like when it's up and running:
> > md4 : active raid5 sdn1[0] sdt1[6] sds1[5] sdr1[4] sdq1[3] sdp1[2] sdo1[1]
> > 216490752 blocks level 5, 128k chunk, algorithm 0 [7/7] [UUUUUUU]
> > The partitions types are set to 0x83 and the array is being manually
> > started in rc.local instead of doing auto-start.
>
> What tool are you using to start the array? raidstart or mdadm?
> There doesn't seem to be enough noise in the logs for it to be
> raidstart (no "md: autorun ...") so I assume mdadm.
>
> What do you get if you add "-v" to the mdadm running from rc.local?
> Can you show me
> mdadm -E /dev/sda1 /dev/sdo1
>
> NeilBrown
>
> >
> > So it looks to me like the md driver doesn't like the raid system crossing the major device boundary for some odd reason.
> > Although I'm not sure why I can stop and restart after the system is totally up but not start it in rc.local using the same
> > commands:
>
> I suspect this is just a co-incidence. I cannot imagine how the major
> device number could affect things at all.
>
> NeilBrown