2008-02-26 09:27:01

by Kuan Luo

[permalink] [raw]
Subject: [PATCH] sata_nv: fix nmi intr or system hanging in rhel4u6 adma.

Hi, robert

One customer reported that their system received a nmi interrupt after
issuing "dd if=/dev/sdb of=/dev/null" on a defective disk in rhel4u6.
I tested it and found that my system hung both in rhel4u6(2.6.9-67) and
2.6.24-rc7.
The patch can work well, but I am not sure if the patch has other
potential effect on adma.
I attached a file in case of lines breaked.

The below info comes from Gunther Mayer to reproduce the issue.
"
used a Seagate ST3500841NS 3.AE for my test; probably other
seagate drives are also capable of creating media errors with
the new hdparm-8.1:

- compile hdparm-8.1
- hdparm -- yes-i-know-what-i-am-doing --make-bad-sector 60000 /dev/sdb

Unfortunately this does not succeed for nvidia sata controller (timeouts
et al.), but it worked fine on AHCI machine (e.g. FSC R640).

When I insert this newly created defective disk in Ultra 20,
it reboots within seconds after issueing "dd if=/dev/sdb of=/dev/null".
"

Signed-off-by: [email protected]

---

drivers/ata/sata_nv.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/ata/sata_nv.c b/drivers/ata/sata_nv.c
index ed5473b..e824260 100644
--- a/drivers/ata/sata_nv.c
+++ b/drivers/ata/sata_nv.c
@@ -837,9 +837,10 @@ static void nv_adma_tf_read(struct ata_port *ap,
struct ata_taskfile *tf)
all shortly be aborted anyway. We assume that NCQ commands
are not
issued via passthrough, which is the only way that switching
into
ADMA mode could abort outstanding commands. */
- nv_adma_register_mode(ap);
+ struct nv_adma_port_priv *pp = ap->private_data;

- ata_tf_read(ap, tf);
+ if (pp->flags & NV_ADMA_PORT_REGISTER_MODE)
+ ata_tf_read(ap, tf);
}

static unsigned int nv_adma_tf_to_cpb(struct ata_taskfile *tf, __le16
*cpb)

-

Best regards,
Kuan Luo

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------


Attachments:
nmi-patch2 (811.00 B)
nmi-patch2

2008-02-26 14:41:17

by Robert Hancock

[permalink] [raw]
Subject: Re: [PATCH] sata_nv: fix nmi intr or system hanging in rhel4u6 adma.



Kuan Luo wrote:
> Hi, robert
>
> One customer reported that their system received a nmi interrupt after
> issuing "dd if=/dev/sdb of=/dev/null" on a defective disk in rhel4u6.
> I tested it and found that my system hung both in rhel4u6(2.6.9-67) and
> 2.6.24-rc7.
> The patch can work well, but I am not sure if the patch has other
> potential effect on adma.
> I attached a file in case of lines breaked.
>
> The below info comes from Gunther Mayer to reproduce the issue.
> "
> used a Seagate ST3500841NS 3.AE for my test; probably other
> seagate drives are also capable of creating media errors with
> the new hdparm-8.1:
>
> - compile hdparm-8.1
> - hdparm -- yes-i-know-what-i-am-doing --make-bad-sector 60000 /dev/sdb
>
> Unfortunately this does not succeed for nvidia sata controller (timeouts
> et al.), but it worked fine on AHCI machine (e.g. FSC R640).
>
> When I insert this newly created defective disk in Ultra 20,
> it reboots within seconds after issueing "dd if=/dev/sdb of=/dev/null".
> "
>
> Signed-off-by: [email protected]
>
> ---
>
> drivers/ata/sata_nv.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/ata/sata_nv.c b/drivers/ata/sata_nv.c
> index ed5473b..e824260 100644
> --- a/drivers/ata/sata_nv.c
> +++ b/drivers/ata/sata_nv.c
> @@ -837,9 +837,10 @@ static void nv_adma_tf_read(struct ata_port *ap,
> struct ata_taskfile *tf)
> all shortly be aborted anyway. We assume that NCQ commands
> are not
> issued via passthrough, which is the only way that switching
> into
> ADMA mode could abort outstanding commands. */
> - nv_adma_register_mode(ap);
> + struct nv_adma_port_priv *pp = ap->private_data;
>
> - ata_tf_read(ap, tf);
> + if (pp->flags & NV_ADMA_PORT_REGISTER_MODE)
> + ata_tf_read(ap, tf);
> }
>
> static unsigned int nv_adma_tf_to_cpb(struct ata_taskfile *tf, __le16
> *cpb)

This is basically avoiding switching into register mode, right? I don't
think this is a very good solution as the point of the tf_read function
is that it's supposed to read the taskfile provided by the drive to
diagnose the error, so not doing this isn't a good thing.

Is there a reason why going into register mode should cause a lockup in
this case?

2008-02-26 16:28:38

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] sata_nv: fix nmi intr or system hanging in rhel4u6 adma.

Robert Hancock wrote:
>
>
> Kuan Luo wrote:
>> Hi, robert
>> One customer reported that their system received a nmi interrupt after
>> issuing "dd if=/dev/sdb of=/dev/null" on a defective disk in rhel4u6.
>> I tested it and found that my system hung both in rhel4u6(2.6.9-67) and
>> 2.6.24-rc7.
>> The patch can work well, but I am not sure if the patch has other
>> potential effect on adma.
>> I attached a file in case of lines breaked.
>>
>> The below info comes from Gunther Mayer to reproduce the issue.
>> "
>> used a Seagate ST3500841NS 3.AE for my test; probably other seagate
>> drives are also capable of creating media errors with the new hdparm-8.1:
>> - compile hdparm-8.1 - hdparm -- yes-i-know-what-i-am-doing
>> --make-bad-sector 60000 /dev/sdb
>> Unfortunately this does not succeed for nvidia sata controller (timeouts
>> et al.), but it worked fine on AHCI machine (e.g. FSC R640).
>> When I insert this newly created defective disk in Ultra 20, it
>> reboots within seconds after issueing "dd if=/dev/sdb of=/dev/null". "
>>
>> Signed-off-by: [email protected]
>>
>> ---
>>
>> drivers/ata/sata_nv.c | 5 +++--
>> 1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/ata/sata_nv.c b/drivers/ata/sata_nv.c
>> index ed5473b..e824260 100644
>> --- a/drivers/ata/sata_nv.c
>> +++ b/drivers/ata/sata_nv.c
>> @@ -837,9 +837,10 @@ static void nv_adma_tf_read(struct ata_port *ap,
>> struct ata_taskfile *tf)
>> all shortly be aborted anyway. We assume that NCQ commands
>> are not
>> issued via passthrough, which is the only way that switching
>> into
>> ADMA mode could abort outstanding commands. */
>> - nv_adma_register_mode(ap);
>> + struct nv_adma_port_priv *pp = ap->private_data;
>>
>> - ata_tf_read(ap, tf);
>> + if (pp->flags & NV_ADMA_PORT_REGISTER_MODE)
>> + ata_tf_read(ap, tf);
>> }
>>
>> static unsigned int nv_adma_tf_to_cpb(struct ata_taskfile *tf, __le16
>> *cpb)
>
> This is basically avoiding switching into register mode, right? I don't
> think this is a very good solution as the point of the tf_read function
> is that it's supposed to read the taskfile provided by the drive to
> diagnose the error, so not doing this isn't a good thing.

Agree with this analysis -- if ->tf_read() is being called, then
obviously the core wants a current copy of the device's ATA registers.

It is not a good solution to simply avoiding returning meaningful data,
because -- as Robert notes -- we need tf_read for analysis.

Jeff


2008-02-27 04:55:58

by Kuan Luo

[permalink] [raw]
Subject: RE: [PATCH] sata_nv: fix nmi intr or system hanging in rhel4u6 adma.

Jeff wrote:
>robert worte:
> >
> > This is basically avoiding switching into register mode,
> right? I don't
> > think this is a very good solution as the point of the
> tf_read function
> > is that it's supposed to read the taskfile provided by the drive to
> > diagnose the error, so not doing this isn't a good thing.
>
> Agree with this analysis -- if ->tf_read() is being called, then
> obviously the core wants a current copy of the device's ATA registers.
>
> It is not a good solution to simply avoiding returning
> meaningful data,
> because -- as Robert notes -- we need tf_read for analysis.
>
> Jeff
>

The driver got one error : "nv_adma_check_cpb: CPB 0, flags=0x11". The
code entered ata_port_abort -> ata_qc_complete
-> fill_result_tf->nv_adma_tf_read.

Firstly, nv_adma_register_mode failed, showing the below messages:
timeout waiting for ADMA IDLE, stat=0x440
timeout waiting for ADMA LEGACY, stat=0x440

Then enter ata_tf_read function.
I found the system hung at tf->hob_nsect = ioread8(ioaddr->nsect_addr);
Sometimes the screen showed "
CPU0: Machin check Exception 0000000000000004
Bank 4:b200000000070f0f
kernel panic -not syncing: CPU Context corrupt.
"
If nv_adma_register_mode failed, the reg result should be not
meaningful.
I don't know why the systm hung.
-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

2008-02-27 05:22:27

by Robert Hancock

[permalink] [raw]
Subject: Re: [PATCH] sata_nv: fix nmi intr or system hanging in rhel4u6 adma.

Kuan Luo wrote:
> Jeff wrote:
>> robert worte:
>>> This is basically avoiding switching into register mode,
>> right? I don't
>>> think this is a very good solution as the point of the
>> tf_read function
>>> is that it's supposed to read the taskfile provided by the drive to
>>> diagnose the error, so not doing this isn't a good thing.
>> Agree with this analysis -- if ->tf_read() is being called, then
>> obviously the core wants a current copy of the device's ATA registers.
>>
>> It is not a good solution to simply avoiding returning
>> meaningful data,
>> because -- as Robert notes -- we need tf_read for analysis.
>>
>> Jeff
>>
>
> The driver got one error : "nv_adma_check_cpb: CPB 0, flags=0x11". The
> code entered ata_port_abort -> ata_qc_complete
> -> fill_result_tf->nv_adma_tf_read.
>
> Firstly, nv_adma_register_mode failed, showing the below messages:
> timeout waiting for ADMA IDLE, stat=0x440
> timeout waiting for ADMA LEGACY, stat=0x440
>
> Then enter ata_tf_read function.
> I found the system hung at tf->hob_nsect = ioread8(ioaddr->nsect_addr);
> Sometimes the screen showed "
> CPU0: Machin check Exception 0000000000000004
> Bank 4:b200000000070f0f
> kernel panic -not syncing: CPU Context corrupt.
> "
> If nv_adma_register_mode failed, the reg result should be not
> meaningful.
> I don't know why the systm hung.

I suspect nv_adma_register_mode failing may be the key. I don't know why
this happens. It could be the timeout needs to be longer but I've tried
increasing it significantly without apparent effect. (Essentially it's
waiting for the status to go to the IDLE state, and then clears the GO
bit and waits for LEGACY to be set. In this case both waits timed out.)

2008-02-28 00:22:40

by Robert Hancock

[permalink] [raw]
Subject: Re: [PATCH] sata_nv: fix nmi intr or system hanging in rhel4u6 adma.

Kuan Luo wrote:
> Jeff wrote:
>> robert worte:
>>> This is basically avoiding switching into register mode,
>> right? I don't
>>> think this is a very good solution as the point of the
>> tf_read function
>>> is that it's supposed to read the taskfile provided by the drive to
>>> diagnose the error, so not doing this isn't a good thing.
>> Agree with this analysis -- if ->tf_read() is being called, then
>> obviously the core wants a current copy of the device's ATA registers.
>>
>> It is not a good solution to simply avoiding returning
>> meaningful data,
>> because -- as Robert notes -- we need tf_read for analysis.
>>
>> Jeff
>>
>
> The driver got one error : "nv_adma_check_cpb: CPB 0, flags=0x11". The
> code entered ata_port_abort -> ata_qc_complete
> -> fill_result_tf->nv_adma_tf_read.
>
> Firstly, nv_adma_register_mode failed, showing the below messages:
> timeout waiting for ADMA IDLE, stat=0x440
> timeout waiting for ADMA LEGACY, stat=0x440
>
> Then enter ata_tf_read function.
> I found the system hung at tf->hob_nsect = ioread8(ioaddr->nsect_addr);
> Sometimes the screen showed "
> CPU0: Machin check Exception 0000000000000004
> Bank 4:b200000000070f0f
> kernel panic -not syncing: CPU Context corrupt.
> "
> If nv_adma_register_mode failed, the reg result should be not
> meaningful.
> I don't know why the systm hung.

By the way, this MCE indicates that the CPU integrated northbridge got a
watchdog error indicating that a bus transaction timed out (presumably
on the HyperTransport link). This likely indicates that the chipset
failed to respond to a CPU read of that register. Obviously this is
something we want to avoid causing..

2008-02-28 03:58:32

by Kuan Luo

[permalink] [raw]
Subject: RE: [PATCH] sata_nv: fix nmi intr or system hanging in rhel4u6 adma.

Rober wrote:
> Kuan Luo wrote:
> > Jeff wrote:
> >> robert worte:
> >>> This is basically avoiding switching into register mode,
> >> right? I don't
> >>> think this is a very good solution as the point of the
> >> tf_read function
> >>> is that it's supposed to read the taskfile provided by
> the drive to
> >>> diagnose the error, so not doing this isn't a good thing.
> >> Agree with this analysis -- if ->tf_read() is being called, then
> >> obviously the core wants a current copy of the device's
> ATA registers.
> >>
> >> It is not a good solution to simply avoiding returning
> >> meaningful data,
> >> because -- as Robert notes -- we need tf_read for analysis.
> >>
> >> Jeff
> >>
> >
> > The driver got one error : "nv_adma_check_cpb: CPB 0,
> flags=0x11". The
> > code entered ata_port_abort -> ata_qc_complete
> > -> fill_result_tf->nv_adma_tf_read.
> >
> > Firstly, nv_adma_register_mode failed, showing the below messages:
> > timeout waiting for ADMA IDLE, stat=0x440
> > timeout waiting for ADMA LEGACY, stat=0x440
> >
> > Then enter ata_tf_read function.
> > I found the system hung at tf->hob_nsect =
> ioread8(ioaddr->nsect_addr);
> > Sometimes the screen showed "
> > CPU0: Machin check Exception 0000000000000004
> > Bank 4:b200000000070f0f
> > kernel panic -not syncing: CPU Context corrupt.
> > "
> > If nv_adma_register_mode failed, the reg result should be not
> > meaningful.
> > I don't know why the systm hung.
>
> By the way, this MCE indicates that the CPU integrated
> northbridge got a
> watchdog error indicating that a bus transaction timed out
> (presumably
> on the HyperTransport link). This likely indicates that the chipset
> failed to respond to a CPU read of that register. Obviously this is
> something we want to avoid causing..
>
In the process of debugging this issue, i found hardware sometimes
entered another wrong stat.
nv_adma_interrupt: adma status 0x1540 notifier 0xffffffff notifer_err
0x0 sactive 0x1.

As the status has no NV_ADMA_STAT_CPBERR, the check_commands =
0xffffffff.
Actually, there is only one command(sactive 0x1).

diff --git a/drivers/ata/sata_nv.c b/drivers/ata/sata_nv.c
index ed5473b..7c24e71 100644
--- a/drivers/ata/sata_nv.c
+++ b/drivers/ata/sata_nv.c
@@ -1052,19 +1052,18 @@ static irqreturn_t nv_adma_interrupt(int irq,
void *dev_instance)
if (status & (NV_ADMA_STAT_DONE |
NV_ADMA_STAT_CPBERR |
NV_ADMA_STAT_CMD_COMPLETE)) {
- u32 check_commands = notifier_clears[i];
+ u32 check_commands = 0;
int pos, error = 0;

- if (status & NV_ADMA_STAT_CPBERR) {
- /* Check all active commands */
- if
(ata_tag_valid(ap->link.active_tag))
- check_commands = 1 <<
-
ap->link.active_tag;
- else
- check_commands = ap->
- link.sactive;
- }
+ /* Check all active commands */
+ if (ata_tag_valid(ap->link.active_tag))
+ check_commands = 1 <<
+ ap->link.active_tag;
+ else
+ check_commands = ap->
+ link.sactive;

+ check_commands &= notifier_clears[i];
/** Check CPBs for completed commands */
while ((pos = ffs(check_commands)) &&
!error) {
pos--;

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------