2007-05-30 01:59:41

by Yinghai Lu

[permalink] [raw]
Subject: kexec and aacraid broken

latest tree, can not use kexec to load 2.6.22-rc3 at least.

got:

AAC0: adapter kernel panic'd fffffffd
AAC0: adapter kernel failed to start, init status=0


but can load 2.6.21.3


YH


2007-05-30 02:14:14

by Andrew Morton

[permalink] [raw]
Subject: Re: kexec and aacraid broken

On Tue, 29 May 2007 18:59:32 -0700 "Yinghai Lu" <[email protected]> wrote:

> latest tree, can not use kexec to load 2.6.22-rc3 at least.
>
> got:
>
> AAC0: adapter kernel panic'd fffffffd
> AAC0: adapter kernel failed to start, init status=0

One of the two diffs below, I guess. Please do a `patch -R -p1' of this
email and retest?

>
> but can load 2.6.21.3
>

Michal, can you please add this to the regression list?




commit 9e4d4a5d71d673901d9c1df5146ce545c2cc0cc0
Author: Salyzyn, Mark <[email protected]>
Date: Tue May 1 11:43:06 2007 -0400

[SCSI] aacraid: superfluous adapter reset for IBM 8 series ServeRAID controllers

The kexec patch introduced a superfluous (and otherwise inert) reset of
some adapters. The register can have a hardware default value that has
zeros for the undefined interrupts. This patch refines the test of the
interrupt enable register to focus on only the interrupts that affect
the driver in order to detect if an incomplete shutdown of the Adapter
had occurred (kdump).

Signed-off-by: Mark Salyzyn <[email protected]>
Signed-off-by: James Bottomley <[email protected]>

diff --git a/drivers/scsi/aacraid/rx.c b/drivers/scsi/aacraid/rx.c
index b6ee3c0..291cd14 100644
--- a/drivers/scsi/aacraid/rx.c
+++ b/drivers/scsi/aacraid/rx.c
@@ -542,7 +542,7 @@ int _aac_rx_init(struct aac_dev *dev)
dev->a_ops.adapter_sync_cmd = rx_sync_cmd;
dev->a_ops.adapter_enable_int = aac_rx_disable_interrupt;
dev->OIMR = status = rx_readb (dev, MUnit.OIMR);
- if ((((status & 0xff) != 0xff) || reset_devices) &&
+ if ((((status & 0x0c) != 0x0c) || reset_devices) &&
!aac_rx_restart_adapter(dev, 0))
++restart;
/*
commit a5694ec545a880f9d23463fddc894f5096cc68fa
Author: Salyzyn, Mark <[email protected]>
Date: Mon Apr 30 13:22:24 2007 -0400

[SCSI] aacraid: kexec fix (reset interrupt handler)

Another layer on this onion also discovered by Duane, the
interrupt enable handler also needed to be set ... The interrupt enable
was called from within the synchronous command handler.

Signed-off-by: Mark Salyzyn <[email protected]>
Signed-off-by: James Bottomley <[email protected]>

diff --git a/drivers/scsi/aacraid/rx.c b/drivers/scsi/aacraid/rx.c
index 0c71315..b6ee3c0 100644
--- a/drivers/scsi/aacraid/rx.c
+++ b/drivers/scsi/aacraid/rx.c
@@ -539,6 +539,8 @@ int _aac_rx_init(struct aac_dev *dev)
}

/* Failure to reset here is an option ... */
+ dev->a_ops.adapter_sync_cmd = rx_sync_cmd;
+ dev->a_ops.adapter_enable_int = aac_rx_disable_interrupt;
dev->OIMR = status = rx_readb (dev, MUnit.OIMR);
if ((((status & 0xff) != 0xff) || reset_devices) &&
!aac_rx_restart_adapter(dev, 0))

2007-05-30 11:44:27

by Mark Salyzyn

[permalink] [raw]
Subject: RE: kexec and aacraid broken

I believe this issue is a result of the aacraid_commit_reset patch (as
posted for scsi-misc-2.6, enclosed to permit testing) not yet propagated
to the 2.6.22-rc3 tree.

This is the adapter taking longer than 3 minutes to start after a reset.
I seriously doubt either of these patches suggested below will have an
affect. And if they do, they are not root cause, one reduces the chances
that the card will be reset during initialization (thus applied would
likely mitigate this problem), the other prevents a panic when the
Adapter is reset (removed, would result in dogs and cats sleeping with
each other).

Please use kernel parameter aacraid.startup_timeout=540 (merely larger
than the default 180 seconds) when spawning the kexec or see if the
aacraid_commit_reset.patch resolves the issue to confirm my hunch.

Sincerely -- Mark Salyzyn

> -----Original Message-----
> From: Andrew Morton [mailto:[email protected]]
> Sent: Tuesday, May 29, 2007 10:14 PM
> To: Yinghai Lu
> Cc: Vivek Goyal; Eric W. Biederman; AACRAID; Linux Kernel
> Mailing List; [email protected]; Michal Piotrowski
> Subject: Re: kexec and aacraid broken
>
>
> On Tue, 29 May 2007 18:59:32 -0700 "Yinghai Lu"
> <[email protected]> wrote:
>
> > latest tree, can not use kexec to load 2.6.22-rc3 at least.
> >
> > got:
> >
> > AAC0: adapter kernel panic'd fffffffd
> > AAC0: adapter kernel failed to start, init status=0
>
> One of the two diffs below, I guess. Please do a `patch -R
> -p1' of this
> email and retest?
>
> >
> > but can load 2.6.21.3
> >
>
> Michal, can you please add this to the regression list?
>
>
>
>
> commit 9e4d4a5d71d673901d9c1df5146ce545c2cc0cc0
> Author: Salyzyn, Mark <[email protected]>
> Date: Tue May 1 11:43:06 2007 -0400
>
> [SCSI] aacraid: superfluous adapter reset for IBM 8
> series ServeRAID controllers
>
> The kexec patch introduced a superfluous (and otherwise
> inert) reset of
> some adapters. The register can have a hardware default
> value that has
> zeros for the undefined interrupts. This patch refines
> the test of the
> interrupt enable register to focus on only the interrupts
> that affect
> the driver in order to detect if an incomplete shutdown
> of the Adapter
> had occurred (kdump).
>
> Signed-off-by: Mark Salyzyn <[email protected]>
> Signed-off-by: James Bottomley <[email protected]>
>
> diff --git a/drivers/scsi/aacraid/rx.c b/drivers/scsi/aacraid/rx.c
> index b6ee3c0..291cd14 100644
> --- a/drivers/scsi/aacraid/rx.c
> +++ b/drivers/scsi/aacraid/rx.c
> @@ -542,7 +542,7 @@ int _aac_rx_init(struct aac_dev *dev)
> dev->a_ops.adapter_sync_cmd = rx_sync_cmd;
> dev->a_ops.adapter_enable_int = aac_rx_disable_interrupt;
> dev->OIMR = status = rx_readb (dev, MUnit.OIMR);
> - if ((((status & 0xff) != 0xff) || reset_devices) &&
> + if ((((status & 0x0c) != 0x0c) || reset_devices) &&
> !aac_rx_restart_adapter(dev, 0))
> ++restart;
> /*
> commit a5694ec545a880f9d23463fddc894f5096cc68fa
> Author: Salyzyn, Mark <[email protected]>
> Date: Mon Apr 30 13:22:24 2007 -0400
>
> [SCSI] aacraid: kexec fix (reset interrupt handler)
>
> Another layer on this onion also discovered by Duane, the
> interrupt enable handler also needed to be set ... The
> interrupt enable
> was called from within the synchronous command handler.
>
> Signed-off-by: Mark Salyzyn <[email protected]>
> Signed-off-by: James Bottomley <[email protected]>
>
> diff --git a/drivers/scsi/aacraid/rx.c b/drivers/scsi/aacraid/rx.c
> index 0c71315..b6ee3c0 100644
> --- a/drivers/scsi/aacraid/rx.c
> +++ b/drivers/scsi/aacraid/rx.c
> @@ -539,6 +539,8 @@ int _aac_rx_init(struct aac_dev *dev)
> }
>
> /* Failure to reset here is an option ... */
> + dev->a_ops.adapter_sync_cmd = rx_sync_cmd;
> + dev->a_ops.adapter_enable_int = aac_rx_disable_interrupt;
> dev->OIMR = status = rx_readb (dev, MUnit.OIMR);
> if ((((status & 0xff) != 0xff) || reset_devices) &&
> !aac_rx_restart_adapter(dev, 0))
>
>


Attachments:
aacraid_commit_reset.patch (3.42 kB)
aacraid_commit_reset.patch

2007-05-30 13:25:05

by Vivek Goyal

[permalink] [raw]
Subject: Re: kexec and aacraid broken

On Wed, May 30, 2007 at 07:44:02AM -0400, Salyzyn, Mark wrote:
> I believe this issue is a result of the aacraid_commit_reset patch (as
> posted for scsi-misc-2.6, enclosed to permit testing) not yet propagated
> to the 2.6.22-rc3 tree.
>
> This is the adapter taking longer than 3 minutes to start after a reset.
> I seriously doubt either of these patches suggested below will have an
> affect. And if they do, they are not root cause, one reduces the chances
> that the card will be reset during initialization (thus applied would
> likely mitigate this problem), the other prevents a panic when the
> Adapter is reset (removed, would result in dogs and cats sleeping with
> each other).
>
> Please use kernel parameter aacraid.startup_timeout=540 (merely larger
> than the default 180 seconds) when spawning the kexec or see if the
> aacraid_commit_reset.patch resolves the issue to confirm my hunch.
>

Hi Mark,

During a normal kexec (not kdump) adapter reset should not have taken
place at all. device_shutdown() routines should have taken care to
bring the device to a known sane state in first kernel so that second
kernel can initialize it without doing a reset.

With reset patch, now reset triggers on every kexec. Previously
that was not the case with kexec and adapter used to come up. I think
this needs to be looked into.

Thanks
Vivek

2007-05-30 13:57:36

by Mark Salyzyn

[permalink] [raw]
Subject: RE: kexec and aacraid broken

This is clouding the issue, Vivek.

There should be no harm, except to time, resetting the adapter. I do
want to optimize for boot time, but do not view this as a 'bug' if the
Adapter should reset during the initialization procedure. We need
instead to harden the driver to deal with Adapters that behave in an
untimely manner as a result of the reset since this generically deals
with all possible transitions (boot w/o BIOS, w/BIOS, kexec and kdump).

I will look into a possibility the driver is not performing the clean
shutdown as a result of a kexec, but that is a refinement and should not
be considered a fix for *this* reported problem; it merely moves the
problem to a kdump. The driver only disables the interrupts when the
driver is .remove'd (aac_remove_one) and not for .shutdown
(aac_shutdown). The later merely tells the firmware to stop performing
builds if in progress, flush the cache, and all subsequent writes are
performed in write-through mode; it does not clear out the driver
resources and leaves that to the .remove function only. The failure of
.remove being called may be a result of this being a boot driver?

Also, the code:

dev->OIMR = status = rx_readb (dev, MUnit.OIMR);
if ((((status & 0x0c) != 0x0c) . . .

detects if the adapter's interrupts were disabled, as would happen on a
clean shutdown. Some of the Adapters can NOT disable their interrupts,
and some have a default state with the interrupts enabled. If the
Adapter still has active interrupts, then there is no telling what
transpired before and it is considered a safety measure to reset the
Adapter in these cases. I'd prefer to err on the side of resetting the
Adapter superfluously than deal with a condition where the Adapter could
be in an unknown state with a possibility of sustaining an outstanding
command and associated interrupt (which was the whole reason this code
was introduced).

In time I am sure, I will refine this code to incorporate Quirks for
adapters that have unusual conditions for the above stated interrupt and
remove the possible superfluous reset.

Yinghai, can you please provide the Adapter designation just in case it
could be the first in this refined list. I will NOT consider this
refinement a bugfix for the same reasons stated above.

Sincerely -- Mark Salyzyn

> -----Original Message-----
> From: Vivek Goyal [mailto:[email protected]]
> Sent: Wednesday, May 30, 2007 9:25 AM
> To: Salyzyn, Mark
> Cc: Andrew Morton; Yinghai Lu; Eric W. Biederman; Linux
> Kernel Mailing List; [email protected]; Michal Piotrowski
> Subject: Re: kexec and aacraid broken
>
>
> On Wed, May 30, 2007 at 07:44:02AM -0400, Salyzyn, Mark wrote:
> > I believe this issue is a result of the
> aacraid_commit_reset patch (as
> > posted for scsi-misc-2.6, enclosed to permit testing) not
> yet propagated
> > to the 2.6.22-rc3 tree.
> >
> > This is the adapter taking longer than 3 minutes to start
> after a reset.
> > I seriously doubt either of these patches suggested below
> will have an
> > affect. And if they do, they are not root cause, one
> reduces the chances
> > that the card will be reset during initialization (thus
> applied would
> > likely mitigate this problem), the other prevents a panic when the
> > Adapter is reset (removed, would result in dogs and cats
> sleeping with
> > each other).
> >
> > Please use kernel parameter aacraid.startup_timeout=540
> (merely larger
> > than the default 180 seconds) when spawning the kexec or see if the
> > aacraid_commit_reset.patch resolves the issue to confirm my hunch.
> >
>
> Hi Mark,
>
> During a normal kexec (not kdump) adapter reset should not have taken
> place at all. device_shutdown() routines should have taken care to
> bring the device to a known sane state in first kernel so that second
> kernel can initialize it without doing a reset.
>
> With reset patch, now reset triggers on every kexec. Previously
> that was not the case with kexec and adapter used to come up. I think
> this needs to be looked into.
>
> Thanks
> Vivek
>

2007-05-30 14:17:49

by Vivek Goyal

[permalink] [raw]
Subject: Re: kexec and aacraid broken

On Wed, May 30, 2007 at 09:57:08AM -0400, Salyzyn, Mark wrote:
> This is clouding the issue, Vivek.
>
> There should be no harm, except to time, resetting the adapter. I do
> want to optimize for boot time, but do not view this as a 'bug' if the
> Adapter should reset during the initialization procedure. We need
> instead to harden the driver to deal with Adapters that behave in an
> untimely manner as a result of the reset since this generically deals
> with all possible transitions (boot w/o BIOS, w/BIOS, kexec and kdump).
>

Hi Mark,

I agree. We should make sure that we should be able to do a software
reset of adapters.

> I will look into a possibility the driver is not performing the clean
> shutdown as a result of a kexec, but that is a refinement and should not
> be considered a fix for *this* reported problem; it merely moves the
> problem to a kdump.

Agreed. I just wanted to bring out this point that right now we are
triggering software reset on every kexec and probably that is not
required. One can avoid it to save boot time. That was the whole
purpose of kexec (fastboot) project.

But this is not a fix for this problem. We should any way be able to
reset the device and should root cause this.

> The driver only disables the interrupts when the
> driver is .remove'd (aac_remove_one) and not for .shutdown
> (aac_shutdown). The later merely tells the firmware to stop performing
> builds if in progress, flush the cache, and all subsequent writes are
> performed in write-through mode; it does not clear out the driver
> resources and leaves that to the .remove function only. The failure of
> .remove being called may be a result of this being a boot driver?
>


> Also, the code:
>
> dev->OIMR = status = rx_readb (dev, MUnit.OIMR);
> if ((((status & 0x0c) != 0x0c) . . .
>
> detects if the adapter's interrupts were disabled, as would happen on a
> clean shutdown. Some of the Adapters can NOT disable their interrupts,
> and some have a default state with the interrupts enabled. If the
> Adapter still has active interrupts, then there is no telling what
> transpired before and it is considered a safety measure to reset the
> Adapter in these cases. I'd prefer to err on the side of resetting the
> Adapter superfluously than deal with a condition where the Adapter could
> be in an unknown state with a possibility of sustaining an outstanding
> command and associated interrupt (which was the whole reason this code
> was introduced).
>

So most likely if we start disabling the interrupts in .shutdown routine
we might skip resetting adapter on every kexec without any side affects?

Thanks
Vivek

2007-05-30 14:32:32

by Mark Salyzyn

[permalink] [raw]
Subject: RE: kexec and aacraid broken

Vivek Goyal [mailto:[email protected]] writes:
> So most likely if we start disabling the interrupts
> in .shutdown routine we might skip resetting adapter
> on every kexec without any side affects?

Not that simple. The .shutdown would need to perform more resource
cleanups of the .remove call to prevent side effects. I need to move
some of the .remove activity into the .shutdown handler to make sure the
adapter is quiesced.

I will hold off on submitting any of these changes until they are
evaluated and tested; I am waiting for feedback from Yinghai on the
other mitigations that I feel are closer to the root cause.

Sincerely -- Mark Salyzyn

2007-05-30 15:59:29

by Mark Salyzyn

[permalink] [raw]
Subject: [PATCH] aacraid: fix shutdown handler to also disable interrupts.

Moves quiesce, thread and interrupt shutdown into aacraid drivers'
.shutdown handler. This fix to the aac_shutdown handler will remove the
superfluous reset of the adapter during a (clean) kexec.

This fix may mitigate the active investigation 'kexec and aacraid
broken' but it is unlikely to affect the root cause (issue likely
present in both kexec and kdump). This patch reduces the chance the
problem will occur with a kexec. The fix for root cause is currently
expected to be the minimum value check to the aacraid.startup_timeout
driver variable after an adapter reset within aacraid_commit_reset.patch
submitted on 05/22/2007 and awaiting testing by Yinghai to confirm.

This attached patch is against current scsi-misc-2.6

ObligatoryDisclaimer: Please accept my condolences regarding Outlook's
handling of patch attachments.

Signed-off-by: Mark Salyzyn <[email protected]>

Sincerely -- Mark Salyzyn

> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of Salyzyn, Mark
> Sent: Wednesday, May 30, 2007 10:31 AM
> To: [email protected]
> Cc: Andrew Morton; Yinghai Lu; Eric W. Biederman; Linux
> Kernel Mailing List; [email protected]; Michal Piotrowski
> Subject: RE: kexec and aacraid broken
>
> Vivek Goyal [mailto:[email protected]] writes:
> > So most likely if we start disabling the interrupts
> > in .shutdown routine we might skip resetting adapter
> > on every kexec without any side affects?
>
> Not that simple. The .shutdown would need to perform more resource
> cleanups of the .remove call to prevent side effects. I need to move
> some of the .remove activity into the .shutdown handler to
> make sure the
> adapter is quiesced.
>
> I will hold off on submitting any of these changes until they are
> evaluated and tested; I am waiting for feedback from Yinghai on the
> other mitigations that I feel are closer to the root cause.
>
> Sincerely -- Mark Salyzyn


Attachments:
aacraid_shutdown.patch (1.49 kB)
aacraid_shutdown.patch

2007-05-30 17:36:46

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH] aacraid: fix shutdown handler to also disable interrupts.

On 5/30/07, Salyzyn, Mark <[email protected]> wrote:
> Moves quiesce, thread and interrupt shutdown into aacraid drivers'
> .shutdown handler. This fix to the aac_shutdown handler will remove the
> superfluous reset of the adapter during a (clean) kexec.
>
> This fix may mitigate the active investigation 'kexec and aacraid
> broken' but it is unlikely to affect the root cause (issue likely
> present in both kexec and kdump). This patch reduces the chance the
> problem will occur with a kexec. The fix for root cause is currently
> expected to be the minimum value check to the aacraid.startup_timeout
> driver variable after an adapter reset within aacraid_commit_reset.patch
> submitted on 05/22/2007 and awaiting testing by Yinghai to confirm.
>
> This attached patch is against current scsi-misc-2.6
>
> ObligatoryDisclaimer: Please accept my condolences regarding Outlook's
> handling of patch attachments.
>
> Signed-off-by: Mark Salyzyn <[email protected]>
>
> Sincerely -- Mark Salyzyn
>
the kernel with this patch -4 and even without

1. [SCSI] aacraid: superfluous adapter reset for IBM 8 series
ServeRAID controllers
2. [SCSI] aacraid: kexec fix (reset interrupt handler)
3. aacraid_commit_reset.patch

can load other kernel with or without patch 1,2,3

YH

2007-05-30 21:20:00

by Yinghai Lu

[permalink] [raw]
Subject: Re: kexec and aacraid broken

On 5/30/07, Salyzyn, Mark <[email protected]> wrote:
> Vivek Goyal [mailto:[email protected]] writes:
> > So most likely if we start disabling the interrupts
> > in .shutdown routine we might skip resetting adapter
> > on every kexec without any side affects?
>
> Not that simple. The .shutdown would need to perform more resource
> cleanups of the .remove call to prevent side effects. I need to move
> some of the .remove activity into the .shutdown handler to make sure the
> adapter is quiesced.
>
> I will hold off on submitting any of these changes until they are
> evaluated and tested; I am waiting for feedback from Yinghai on the
> other mitigations that I feel are closer to the root cause.
>
1. [SCSI] aacraid: superfluous adapter reset for IBM 8 series
ServeRAID controllers
2. [SCSI] aacraid: kexec fix (reset interrupt handler)
3. aacraid_commit_reset.patch
4. [PATCH] aacraid: fix shutdown handler to also disable interrupts

the kernel with this patch -4 and even without 1, 2, 3

can load other kernel with or without patch 1,2,3

YH

2007-05-30 21:22:25

by Yinghai Lu

[permalink] [raw]
Subject: Re: kexec and aacraid broken

On 5/30/07, Salyzyn, Mark <[email protected]> wrote:
> I believe this issue is a result of the aacraid_commit_reset patch (as
> posted for scsi-misc-2.6, enclosed to permit testing) not yet propagated
> to the 2.6.22-rc3 tree.
>
> This is the adapter taking longer than 3 minutes to start after a reset.
> I seriously doubt either of these patches suggested below will have an
> affect. And if they do, they are not root cause, one reduces the chances
> that the card will be reset during initialization (thus applied would
> likely mitigate this problem), the other prevents a panic when the
> Adapter is reset (removed, would result in dogs and cats sleeping with
> each other).
>
> Please use kernel parameter aacraid.startup_timeout=540 (merely larger
> than the default 180 seconds) when spawning the kexec or see if the
> aacraid_commit_reset.patch resolves the issue to confirm my hunch.
>

aacraid_commit_reset.patch is in the mainline already.

YH

2007-05-30 21:50:51

by Mark Salyzyn

[permalink] [raw]
Subject: RE: kexec and aacraid broken

Yinghai Lu [mailto:[email protected]] writes:
> aacraid_commit_reset.patch is in the mainline already.

But aacraid_commit_reset.patch is not in 2.6.22-rc3 (to which you report
the issue). Does the aacraid_commit_reset.patch work to resolve this
issue all by itself in the kexec'd kernel? Or alternatively did you try
aacraid.startup_timeout=540 as one of the kernel parameters passed to
the kexec'd kernel?

The '[PATCH] aacraid: fix shutdown handler to also disable interrupts'
patch (you refer to this as patch 4) is not to be in the picture because
it will hide the root cause. I believe I have you correct in stating
that this patch (4) resolves the problem... but I expect the problem to
remain with kdump.

Sincerely -- Mark Salyzyn

2007-05-30 22:12:40

by Yinghai Lu

[permalink] [raw]
Subject: Re: kexec and aacraid broken

On 5/30/07, Salyzyn, Mark <[email protected]> wrote:
> Yinghai Lu [mailto:[email protected]] writes:
> > aacraid_commit_reset.patch is in the mainline already.
>
> But aacraid_commit_reset.patch is not in 2.6.22-rc3 (to which you report
> the issue). Does the aacraid_commit_reset.patch work to resolve this
> issue all by itself in the kexec'd kernel? Or alternatively did you try
> aacraid.startup_timeout=540 as one of the kernel parameters passed to
> the kexec'd kernel?

No, still get adapter kernel panic

>
> The '[PATCH] aacraid: fix shutdown handler to also disable interrupts'
> patch (you refer to this as patch 4) is not to be in the picture because
> it will hide the root cause. I believe I have you correct in stating
> that this patch (4) resolves the problem... but I expect the problem to
> remain with kdump.

Oh.
without patch(4), latest kernel still can use kexec to 2.6.21.3
will try to load 2.6.22-rc1 etc.

YH

2007-05-31 12:38:21

by Mark Salyzyn

[permalink] [raw]
Subject: RE: kexec and aacraid broken

> No, still get adapter kernel panic

Which adapter are you using?

Sincerely -- Mark Salyzyn

2007-05-31 19:59:34

by Yinghai Lu

[permalink] [raw]
Subject: Re: kexec and aacraid broken

SUN coguar with 11731

YH

On 5/31/07, Salyzyn, Mark <[email protected]> wrote:
> > No, still get adapter kernel panic
>
> Which adapter are you using?
>
> Sincerely -- Mark Salyzyn
>

2007-05-31 20:46:12

by Mark Salyzyn

[permalink] [raw]
Subject: RE: kexec and aacraid broken

Ahhhh. explains why I am having troubles duping this issue thus far.

This is prerelease Firmware on a yet to be released card and thus should
not get any driver workarounds if this issue can be resolved in
Firmware. If this can be duped on a released card with released
Firmware, then the story changes of course; but still does not preclude
a Firmware/Hardware/Drive Compatibility bug ;-} . Until then, please
work this issue via SUN channels so that we get all the necessary card
debug information for our teams to work this.

I will ensure Adaptec will remain on top of this issue since it is
clearly a problem with the Adapter Hardware interfacing. The adapter is
not surviving an IOP_RESET and is going into an Adapter Firmware Kernel
Panic or taking an excessively long period (in the testing thus far >
540 seconds) of time to complete it's reset.

Sincerely -- Mark Salyzyn

Yinghai Lu [mailto:[email protected]] sez:
> SUN coguar with 11731
>
> On 5/31/07, Salyzyn, Mark <[email protected]> wrote:
> > > No, still get adapter kernel panic
> >
> > Which adapter are you using?

2007-06-01 11:09:22

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH] aacraid: fix shutdown handler to also disable interrupts.

On Wed, May 30, 2007 at 11:59:13AM -0400, Salyzyn, Mark wrote:
> Moves quiesce, thread and interrupt shutdown into aacraid drivers'
> .shutdown handler. This fix to the aac_shutdown handler will remove the
> superfluous reset of the adapter during a (clean) kexec.
>
> This fix may mitigate the active investigation 'kexec and aacraid
> broken' but it is unlikely to affect the root cause (issue likely
> present in both kexec and kdump). This patch reduces the chance the
> problem will occur with a kexec. The fix for root cause is currently
> expected to be the minimum value check to the aacraid.startup_timeout
> driver variable after an adapter reset within aacraid_commit_reset.patch
> submitted on 05/22/2007 and awaiting testing by Yinghai to confirm.
>
> This attached patch is against current scsi-misc-2.6
>
> ObligatoryDisclaimer: Please accept my condolences regarding Outlook's
> handling of patch attachments.
>
> Signed-off-by: Mark Salyzyn <[email protected]>
>

Thanks Mark. This does fix the issue of unnecessary reset of aacraid
adapter over kexec on my machine.

Thanks
Vivek

2007-06-01 17:07:57

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH] aacraid: fix shutdown handler to also disable interrupts.

On 6/1/07, Vivek Goyal <[email protected]> wrote:
> On Wed, May 30, 2007 at 11:59:13AM -0400, Salyzyn, Mark wrote:
> > Moves quiesce, thread and interrupt shutdown into aacraid drivers'
> > .shutdown handler. This fix to the aac_shutdown handler will remove the
> > superfluous reset of the adapter during a (clean) kexec.
> >
> > This fix may mitigate the active investigation 'kexec and aacraid
> > broken' but it is unlikely to affect the root cause (issue likely
> > present in both kexec and kdump). This patch reduces the chance the
> > problem will occur with a kexec. The fix for root cause is currently
> > expected to be the minimum value check to the aacraid.startup_timeout
> > driver variable after an adapter reset within aacraid_commit_reset.patch
> > submitted on 05/22/2007 and awaiting testing by Yinghai to confirm.
> >
> > This attached patch is against current scsi-misc-2.6
> >
> > ObligatoryDisclaimer: Please accept my condolences regarding Outlook's
> > handling of patch attachments.
> >
> > Signed-off-by: Mark Salyzyn <[email protected]>
> >
>
> Thanks Mark. This does fix the issue of unnecessary reset of aacraid
> adapter over kexec on my machine.
>
i'm little confused about that.
this patch is some clear shutdown, so even next start will have tight
condition will not try to reset the adapter fw. right Mark?
Maybe the driver could be smart to find out if it need to reset adaptec fw.

YH

2007-06-01 17:34:43

by Mark Salyzyn

[permalink] [raw]
Subject: RE: [PATCH] aacraid: fix shutdown handler to also disable interrupts.

Yes, this patch makes sure that the Adapter is shut down correctly, and
thus when the kexec driver loads, it does not automatically reset the
adapter during initialization. This regression was a result of adding
code to the driver to detect if the adapter needed a reset as a result
of an unclean shutdown in order to deal with an issue that came up with
kdump. Kdump does not issue a clean shutdown. As you see, it was the
process of making the driver smarter to find out if it needed to reset
the adaptec fw that triggered the problem.

As noted before, please be advised to go through SUN channels. Upgrade
your Drive(s), SES, Motherboard and Card Firmware to the latest
versions; and make sure you are using compatible drives and drive bays
to see if this problem dealing with the superfluous reset on your
pre-release system goes away. You will be able to trigger this by trying
to perform a kdump on the system, OR by reverting this patch and running
your kexec test. The superfluous reset has yet to cause an issue with a
released card beyond noticing a superfluous Firmware reset as Vivek has
pointed out.

Sincerely -- Mark Salyzyn

From: Yinghai Lu [mailto:[email protected]] sez:
> On 6/1/07, Vivek Goyal <[email protected]> wrote:
> > Thanks Mark. This does fix the issue of unnecessary reset of aacraid
> > adapter over kexec on my machine.
> i'm little confused about that.
> this patch is some clear shutdown, so even next start will have tight
> condition will not try to reset the adapter fw. right Mark?
> Maybe the driver could be smart to find out if it need to
> reset adaptec fw.
>
> YH