2001-03-07 21:31:47

by Otto Meier

[permalink] [raw]
Subject: Kernel crash during resync of raid5 on SMP

I run a Dual prozessor SMP system on 2.4.2-ac12 for a while
in degraded mode. Today I put in a new disk to switch to
full raid5 mode. Shortly after the command raidhotadd the
system crashed with the message lost interrupt on cpu1.

This continued after reboot. I finaly managed to get it running again
by booting with kernel parameter maxcpus=1. In this one CPU mode
it finished resycing.

During this process I was never able to resync with two CPU's.

After finishing rescyncing the system run now fine in SMP Dual mode again.

Perhaps there might be an issue with spinlocks during resyncing.

Bye Otto








2001-03-07 21:56:17

by NeilBrown

[permalink] [raw]
Subject: Re: Kernel crash during resync of raid5 on SMP

On Wednesday March 7, [email protected] wrote:
> I run a Dual prozessor SMP system on 2.4.2-ac12 for a while
> in degraded mode. Today I put in a new disk to switch to
> full raid5 mode. Shortly after the command raidhotadd the
> system crashed with the message lost interrupt on cpu1.

Was there an Oops? Can we see? decoded with ksymoops of course.
Are you happy to retry? (i.e. raidsetfaulty; raidhotremove,
raidhotadd). If so, Could you try with 2.4.2?

Where abouts in the sync-process did it die? Start? end? middle?
various?

NeilBrown


>
> This continued after reboot. I finaly managed to get it running again
> by booting with kernel parameter maxcpus=1. In this one CPU mode
> it finished resycing.
>
> During this process I was never able to resync with two CPU's.
>
> After finishing rescyncing the system run now fine in SMP Dual mode again.
>
> Perhaps there might be an issue with spinlocks during resyncing.
>
> Bye Otto
>
>
>
>
>
>

2001-03-08 15:21:51

by Otto Meier

[permalink] [raw]
Subject: Re: Kernel crash during resync of raid5 on SMP

On Thu, 8 Mar 2001 08:55:28 +1100 (EST), Neil Brown wrote:

>On Wednesday March 7, [email protected] wrote:
>> I run a Dual prozessor SMP system on 2.4.2-ac12 for a while
>> in degraded mode. Today I put in a new disk to switch to
>> full raid5 mode. Shortly after the command raidhotadd the
>> system crashed with the message lost interrupt on cpu1.
>
>Was there an Oops? Can we see? decoded with ksymoops of course.

Unfortunatly I entered this command remotely. The console Display was
off at that time.

>Are you happy to retry? (i.e. raidsetfaulty; raidhotremove,
>raidhotadd). If so, Could you try with 2.4.2?

I would not really like to do that, as of now everything runs fine again for a day.

>Where abouts in the sync-process did it die? Start? end? middle?
>various?

After the first crash I needed to reboot. It crashed again shortly after
boot message that it starts to resync.

This happens several times until I used the kernel parameter MAXcpus=1.
Then it worked without a problem. After resyncing finished I could start
it again in SMP mode and everything worked fine again.

Sorry that can not shed any more light on it.

Otto


>NeilBrown
>
>
>>
>> This continued after reboot. I finaly managed to get it running again
>> by booting with kernel parameter maxcpus=1. In this one CPU mode
>> it finished resycing.
>>
>> During this process I was never able to resync with two CPU's.
>>
>> After finishing rescyncing the system run now fine in SMP Dual mode again.
>>
>> Perhaps there might be an issue with spinlocks during resyncing.
>>
>> Bye Otto
>>
>>
>>
>>
>>
>>
>



2001-03-09 01:28:59

by NeilBrown

[permalink] [raw]
Subject: Re: Kernel crash during resync of raid5 on SMP

On Thursday March 8, [email protected] wrote:
> On Thu, 8 Mar 2001 08:55:28 +1100 (EST), Neil Brown wrote:
>
> >On Wednesday March 7, [email protected] wrote:
> >> I run a Dual prozessor SMP system on 2.4.2-ac12 for a while
> >> in degraded mode. Today I put in a new disk to switch to
> >> full raid5 mode. Shortly after the command raidhotadd the
> >> system crashed with the message lost interrupt on cpu1.
> >
> >Was there an Oops? Can we see? decoded with ksymoops of course.
>
> Unfortunatly I entered this command remotely. The console Display was
> off at that time.
>
> >Are you happy to retry? (i.e. raidsetfaulty; raidhotremove,
> >raidhotadd). If so, Could you try with 2.4.2?
>
> I would not really like to do that, as of now everything runs fine again for a day.
>

Fair enough. When I get my test machine back I might do some testing
and see if I can reproduce it.
In the mean time, if anyone else sees it and gets an Oops, I would be
interested to see it.

NeilBrown