2007-09-26 20:26:53

by Torsten Kaiser

[permalink] [raw]
Subject: sata_sil24 broken since 2.6.23-rc4-mm1

As reported in the "2.6.23-rc4-mm1"-thread and the "What's in
linux-2.6-block.git for 2.6.24"-thread I'm having trouble that
sometimes on bootup one drive from the SiI-3132 throws errors and
becomes inaccesible.

The latest kernel I have seen this error was 2.6.23-rc7-mm1.
>From 7 boots 2 times the following happend:

Sep 25 07:42:11 treogen [ 33.810000] md1: bitmap initialized from
disk: read 10/10 pages, set 0 bits
Sep 25 07:42:11 treogen [ 33.810000] created bitmap (145 pages) for device md1
Sep 25 07:42:11 treogen [ 63.910000] ata1.00: exception Emask 0x0
SAct 0x1 SErr 0x0 action 0x6 frozen
Sep 25 07:42:11 treogen [ 63.910000] ata1.00: cmd
61/08:00:09:d6:42/00:00:25:00:00/40 tag 0 cdb 0x0 data 4096 out
Sep 25 07:42:11 treogen [ 63.910000] res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 25 07:42:11 treogen [ 63.910000] ata1.00: status: {DRDY }
Sep 25 07:42:11 treogen [ 63.910000] ata1: hard resetting link
Sep 25 07:42:11 treogen [ 66.210000] ata1: softreset failed (port not ready)
Sep 25 07:42:11 treogen [ 66.210000] ata1: reset failed (errno=-5),
retrying in 8 secs
Sep 25 07:42:11 treogen [ 73.910000] ata1: hard resetting link
Sep 25 07:42:11 treogen [ 76.210000] ata1: softreset failed (port not ready)
Sep 25 07:42:11 treogen [ 76.210000] ata1: reset failed (errno=-5),
retrying in 8 secs
Sep 25 07:42:11 treogen [ 83.910000] ata1: hard resetting link
Sep 25 07:42:11 treogen [ 86.210000] ata1: softreset failed (port not ready)
Sep 25 07:42:11 treogen [ 86.210000] ata1: reset failed (errno=-5),
retrying in 33 secs
Sep 25 07:42:11 treogen [ 118.910000] ata1: limiting SATA link speed
to 1.5 Gbps
Sep 25 07:42:11 treogen [ 118.910000] ata1: hard resetting link
Sep 25 07:42:11 treogen [ 121.210000] ata1: softreset failed (port not ready)
Sep 25 07:42:11 treogen [ 121.210000] ata1: reset failed, giving up
Sep 25 07:42:11 treogen [ 121.210000] ata1.00: disabled
Sep 25 07:42:11 treogen [ 121.210000] ata1: EH complete
Sep 25 07:42:11 treogen [ 121.210000] sd 0:0:0:0: [sda] Result:
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
Sep 25 07:42:11 treogen [ 121.210000] end_request: I/O error, dev
sda, sector 625137161
Sep 25 07:42:11 treogen [ 121.210000] md: super_written gets
error=-5, uptodate=0
Sep 25 07:42:11 treogen [ 121.210000] raid5: Disk failure on sda2,
disabling device. Operation continuing on 2 devices

Comparing the driver/ata directory from rc3-mm1 and rc4-mm1 the
following change looked the most suspicions to me:
http://git.kernel.org/?p=linux/kernel/git/jgarzik/libata-dev.git;a=blobdiff;f=drivers/ata/sata_sil24.c;h=3dcb223117be9739ee04d70b6bfc776a4b839a3f;hp=e0cd31aa8002350add53ba6ff07493e503275244;hb=020bc1bd8d369a77bd9379cd9763ac0057651753;hpb=8d4bdf8087e682df98bdb856f6ad451bf6d597e7

That after rc4-mm1 the sata_sil24.c did not change anymore also
matches the occurrence of the error.

To confirm my theorie I exchanged the sata_sil24.c from rc8-mm1 with
the version from rc3-mm1.
I was able to boot the resulting kernel successfully 5 times, without
the error happening again.

Torsten


2007-09-27 04:57:01

by Tejun Heo

[permalink] [raw]
Subject: Re: sata_sil24 broken since 2.6.23-rc4-mm1

diff --git a/drivers/ata/sata_sil24.c b/drivers/ata/sata_sil24.c
index 3831920..dc3ddcb 100644
--- a/drivers/ata/sata_sil24.c
+++ b/drivers/ata/sata_sil24.c
@@ -1117,6 +1117,7 @@ static int sil24_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)

host->ports[i]->ioaddr.cmd_addr = port;
host->ports[i]->ioaddr.scr_addr = port + PORT_SCONTROL;
+ ata_std_ports(&ap->ioaddr);

ata_port_pbar_desc(ap, SIL24_HOST_BAR, -1, "host");
ata_port_pbar_desc(ap, SIL24_PORT_BAR, offset, "port");


Attachments:
patch (511.00 B)

2007-09-27 04:59:33

by Tejun Heo

[permalink] [raw]
Subject: Re: sata_sil24 broken since 2.6.23-rc4-mm1

Tejun Heo wrote:
> Torsten Kaiser wrote:
>> Comparing the driver/ata directory from rc3-mm1 and rc4-mm1 the
>> following change looked the most suspicions to me:
>> http://git.kernel.org/?p=linux/kernel/git/jgarzik/libata-dev.git;a=blobdiff;f=drivers/ata/sata_sil24.c;h=3dcb223117be9739ee04d70b6bfc776a4b839a3f;hp=e0cd31aa8002350add53ba6ff07493e503275244;hb=020bc1bd8d369a77bd9379cd9763ac0057651753;hpb=8d4bdf8087e682df98bdb856f6ad451bf6d597e7
>>
>> That after rc4-mm1 the sata_sil24.c did not change anymore also
>> matches the occurrence of the error.
>>
>> To confirm my theorie I exchanged the sata_sil24.c from rc8-mm1 with
>> the version from rc3-mm1.
>> I was able to boot the resulting kernel successfully 5 times, without
>> the error happening again.
>
> Thanks a lot for chasing down the problem. The changed code is address
> initialization path and it's weird that it causes intermittent failures,
> not a consistent one.
>
> Anyways, does the attached patch fix the problem?

If not, can you add printk of iomap[SIL24_PORT_BAR], offset, initialized
cmd_addr and scr_addr in the loop and see whether anything is different
between when the driver works and fails.

Thanks.

--
tejun

2007-09-27 06:14:41

by Torsten Kaiser

[permalink] [raw]
Subject: Re: sata_sil24 broken since 2.6.23-rc4-mm1

On 9/27/07, Tejun Heo <[email protected]> wrote:
> Tejun Heo wrote:
> > Torsten Kaiser wrote:
> >> Comparing the driver/ata directory from rc3-mm1 and rc4-mm1 the
> >> following change looked the most suspicions to me:
> >> http://git.kernel.org/?p=linux/kernel/git/jgarzik/libata-dev.git;a=blobdiff;f=drivers/ata/sata_sil24.c;h=3dcb223117be9739ee04d70b6bfc776a4b839a3f;hp=e0cd31aa8002350add53ba6ff07493e503275244;hb=020bc1bd8d369a77bd9379cd9763ac0057651753;hpb=8d4bdf8087e682df98bdb856f6ad451bf6d597e7
> >>
> >> That after rc4-mm1 the sata_sil24.c did not change anymore also
> >> matches the occurrence of the error.
> >>
> >> To confirm my theorie I exchanged the sata_sil24.c from rc8-mm1 with
> >> the version from rc3-mm1.
> >> I was able to boot the resulting kernel successfully 5 times, without
> >> the error happening again.
> >
> > Thanks a lot for chasing down the problem. The changed code is address
> > initialization path and it's weird that it causes intermittent failures,
> > not a consistent one.
> >
> > Anyways, does the attached patch fix the problem?

I'm starting to *really* hate that bug.
My analysis was wrong, as I booted to modified 2.6.23-rc8-mm1 this
morning, that failed too. (Same error messages as -rc7-mm1 from the
first mail in this thread.)
So it's not that change that causes the breakage.

And I'm not really finding a good pattern to what boots fail and what work.
It seems to only fail, if I completely power off the system for
several hours. (Using the physical switch at the backside of the
powersupply, not the normal soft-off)

One of the five boots I tried yesterday, I also powered the system
completely off that way, but only leaving it off ~10..20 seconds
seemed not to trigger the bug.

But I still think that is not a hardware failure, as the -rc3-mm1
kernel never showed that error, even when I used it several times
after the first -rc4-mm1 failures.

> If not, can you add printk of iomap[SIL24_PORT_BAR], offset, initialized
> cmd_addr and scr_addr in the loop and see whether anything is different
> between when the driver works and fails.

Should I do this anyway?

I compared the dmesg form good and bad boots with -rc7-mm1 but could
not see any difference, so do you think that these additional
diagnostics could show a difference?
Or could you suggest any other debugging options I should try?

Torsten

2007-09-27 06:25:27

by Jeff Garzik

[permalink] [raw]
Subject: Re: sata_sil24 broken since 2.6.23-rc4-mm1

Torsten Kaiser wrote:
> On 9/27/07, Tejun Heo <[email protected]> wrote:
>> Tejun Heo wrote:
>>> Torsten Kaiser wrote:
>>>> Comparing the driver/ata directory from rc3-mm1 and rc4-mm1 the
>>>> following change looked the most suspicions to me:
>>>> http://git.kernel.org/?p=linux/kernel/git/jgarzik/libata-dev.git;a=blobdiff;f=drivers/ata/sata_sil24.c;h=3dcb223117be9739ee04d70b6bfc776a4b839a3f;hp=e0cd31aa8002350add53ba6ff07493e503275244;hb=020bc1bd8d369a77bd9379cd9763ac0057651753;hpb=8d4bdf8087e682df98bdb856f6ad451bf6d597e7
>>>>
>>>> That after rc4-mm1 the sata_sil24.c did not change anymore also
>>>> matches the occurrence of the error.
>>>>
>>>> To confirm my theorie I exchanged the sata_sil24.c from rc8-mm1 with
>>>> the version from rc3-mm1.
>>>> I was able to boot the resulting kernel successfully 5 times, without
>>>> the error happening again.
>>> Thanks a lot for chasing down the problem. The changed code is address
>>> initialization path and it's weird that it causes intermittent failures,
>>> not a consistent one.
>>>
>>> Anyways, does the attached patch fix the problem?
>
> I'm starting to *really* hate that bug.
> My analysis was wrong, as I booted to modified 2.6.23-rc8-mm1 this
> morning, that failed too. (Same error messages as -rc7-mm1 from the
> first mail in this thread.)
> So it's not that change that causes the breakage.
>
> And I'm not really finding a good pattern to what boots fail and what work.
> It seems to only fail, if I completely power off the system for
> several hours. (Using the physical switch at the backside of the
> powersupply, not the normal soft-off)
>
> One of the five boots I tried yesterday, I also powered the system
> completely off that way, but only leaving it off ~10..20 seconds
> seemed not to trigger the bug.
>
> But I still think that is not a hardware failure, as the -rc3-mm1
> kernel never showed that error, even when I used it several times
> after the first -rc4-mm1 failures.
>
>> If not, can you add printk of iomap[SIL24_PORT_BAR], offset, initialized
>> cmd_addr and scr_addr in the loop and see whether anything is different
>> between when the driver works and fails.
>
> Should I do this anyway?
>
> I compared the dmesg form good and bad boots with -rc7-mm1 but could
> not see any difference, so do you think that these additional
> diagnostics could show a difference?
> Or could you suggest any other debugging options I should try?

I think since its a reproducible problem, I think it's easiest to get
you straight to git-bisect. In this case, that would be

1. Clone branch "upstream" of
git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev.git

2. Test. If bug persists, you have narrowed down the problem to the -mm
changes from the SATA developers, that are to be sent for 2.6.24. If
the problem does not persist, then it's a problem added in the -mm
patchset alone, which carries few ATA patches outside of libata-dev.git.

3. If the problem is in libata-dev.git#upstream (likely), you can now
use git-bisect to find the specific commit that causes the problems.
Read the git-bisect man page for full details, but the basics are

a) start with a known good point (v2.6.22? v2.6.23?) and
known bad point (HEAD, aka the most recent commit in
libata-dev.git#upstream)

b) build and boot kernels, marking each as known-good or
known-bad.

c) This process will systematically narrow down the problem
to a single git commit.

Regards,

Jeff



2007-09-27 17:34:45

by Torsten Kaiser

[permalink] [raw]
Subject: Re: sata_sil24 broken since 2.6.23-rc4-mm1

On 9/27/07, Jeff Garzik <[email protected]> wrote:
> Torsten Kaiser wrote:
> > I compared the dmesg form good and bad boots with -rc7-mm1 but could
> > not see any difference, so do you think that these additional
> > diagnostics could show a difference?
> > Or could you suggest any other debugging options I should try?
>
> I think since its a reproducible problem, I think it's easiest to get
> you straight to git-bisect. In this case, that would be

Sorry, but I don't think that will work.
It seems that I am able to reproduce the bug, but not reliable. And my
current best guess to make it happen involves the step "leaf the
computer powered off for 8 hours".
I estimate that even with the 8 hour pause only at ~50% of the boots
one drive fails. So I have no safe point to mark a bisect step as
'good'.

> a) start with a known good point (v2.6.22? v2.6.23?) and
> known bad point (HEAD, aka the most recent commit in
> libata-dev.git#upstream)

Known good is for me 2.6.23-rc3-mm1, the first known bad is 2.6.23-rc4-mm1.
I will try to look at the diff between these revisions some more, but
the change in sata_sil24.c looked like a perfect match for the
symptoms I was seeing.

What I just noticed, as I wanted two re-add the drive to the RAID:
This time it was not sda, but sdb that was kicked. But otherwise the
errors are perfectly identical.

I will try to make a 2.6.23-rc3.5-mm1 to narrow it down some more...

Torsten

2007-09-27 23:25:29

by Tejun Heo

[permalink] [raw]
Subject: Re: sata_sil24 broken since 2.6.23-rc4-mm1

Torsten Kaiser wrote:
> Known good is for me 2.6.23-rc3-mm1, the first known bad is 2.6.23-rc4-mm1.
> I will try to look at the diff between these revisions some more, but
> the change in sata_sil24.c looked like a perfect match for the
> symptoms I was seeing.

I think the first thing to do here is to verify 2.6.23-rc3-mm1 still
works fine and my previous debug patch is pretty much meaningless if
address initialization failure isn't the cause.

Thanks.

--
tejun

2007-09-28 05:36:37

by Torsten Kaiser

[permalink] [raw]
Subject: Re: sata_sil24 broken since 2.6.23-rc4-mm1

On 9/27/07, Tejun Heo <[email protected]> wrote:
> Torsten Kaiser wrote:
> > Known good is for me 2.6.23-rc3-mm1, the first known bad is 2.6.23-rc4-mm1.
> > I will try to look at the diff between these revisions some more, but
> > the change in sata_sil24.c looked like a perfect match for the
> > symptoms I was seeing.
>
> I think the first thing to do here is to verify 2.6.23-rc3-mm1 still
> works fine and my previous debug patch is pretty much meaningless if
> address initialization failure isn't the cause.

After the first trouble with -rc4-mm1 I switched back to -rc3-mm1. I
booted that kernel 7 times over 4 days and never had trouble. (Before
-rc4-mm1 came out, I used -rc3-mm1 for over a week)

So in case of -rc3-mm1 I'm pretty sure that it works.

Not completely sure is if 2.6.23-rc7-sglist kernel works. I booted
that 9 times, but from a quick look in /var/log/messages, I might not
have hit the "correct" situation to trigger the error.
That kernel is vanilla 2.6.23-rc7 plus the patch from
http://www.kernel.org/pub/linux/kernel/people/tomo/misc/v2.6.23-rc7-sglist-arch.diff.bz2
( http://marc.info/?l=linux-ide&m=119055574826083&w=2 )

Torsten

2007-09-30 06:00:37

by Torsten Kaiser

[permalink] [raw]
Subject: Re: sata_sil24 broken since 2.6.23-rc4-mm1

On 9/28/07, Torsten Kaiser <[email protected]> wrote:
> So in case of -rc3-mm1 I'm pretty sure that it works.

That's still the case.

> Not completely sure is if 2.6.23-rc7-sglist kernel works. I booted
> that 9 times, but from a quick look in /var/log/messages, I might not
> have hit the "correct" situation to trigger the error.
> That kernel is vanilla 2.6.23-rc7 plus the patch from
> http://www.kernel.org/pub/linux/kernel/people/tomo/misc/v2.6.23-rc7-sglist-arch.diff.bz2
> ( http://marc.info/?l=linux-ide&m=119055574826083&w=2 )

That is no longer the case. Yesterday this kernel did also show the failure.
The error looked a little bit different, but happend at the same
location during the bootup.
Sadly dmesg overflowed and I was not able to capture the first error.

[ 53.462632] Freeing unused kernel memory: 332k freed
ite cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 77.170903] ata1.00: exception Emask 0x40 SAct 0x1 SErr 0x0 action 0x2
[ 77.170905] ata1.00: irq_stat 0x00020002, SGT no on qword boundary
[ 77.170908] ata1.00: cmd 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0
cdb 0x0 data 4096 out
[ 77.170909] res 50/00:00:af:ea:42/00:00:25:00:00/e0 Emask
0x40 (internal error)
[ 77.504292] ata1: soft resetting port
[ 77.544221] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 77.642567] ata1.00: configured for UDMA/100
[ 77.642572] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK
driverbyte=DRIVER_SENSE,SUGGEST_OK
[ 77.642574] sd 0:0:0:0: [sda] Sense Key : Aborted Command [current]
[descriptor]
[ 77.642578] Descriptor sense data with sense descriptors (in hex):
[ 77.642580] 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
[ 77.642584] 00 00 00 af
[ 77.642586] sd 0:0:0:0: [sda] Add. Sense: No additional sense information
[ 77.642589] end_request: I/O error, dev sda, sector 625137161
[ 77.642593] ata1: EH complete
[ 77.642609] sd 0:0:0:0: [sda] 625142448 512-byte hardware sectors (320073 MB)
[ 77.642616] sd 0:0:0:0: [sda] Write Protect is off
[ 77.642618] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[ 77.642628] sd 0:0:0:0: [sda] Write cache: enabled, read cache:
enabled, doesn't support DPO or FUA
[ 77.666927] md: super_written gets error=-5, uptodate=0
[ 77.666930] raid5: Disk failure on sda2, disabling device.
Operation continuing on 1 devices

That "Operation continuing on 1 devices" is a 'little' bit misleading.
A RAID5 with two failed devices will not continue to operate. :(
The same error is repeated several times, so I expect the first error
also looked like that.

Other things I have done to narrow it down:
Comparing 2.6.23-rc3-mm1 and 2.6.23-rc4-mm1 I found the following hunk
from libata-core.c:
@@ -2182,6 +2183,17 @@
if (ap->ops->cable_detect)
ap->cbl = ap->ops->cable_detect(ap);

+ /* We may have SATA bridge glue hiding here irrespective of the
+ reported cable types and sensed types */
+ ata_link_for_each_dev(dev, &ap->link) {
+ if (!ata_dev_enabled(dev))
+ continue;
+ /* SATA drives indicate we have a bridge. We don't know which
+ end of the link the bridge is which is a problem */
+ if (ata_id_is_sata(dev->id))
+ ap->cbl = ATA_CBL_SATA;
+ }
+

But removing this from the current -mm does still show the error, so
thats not the source of the bug. (That test kicked the first drive
(sdb) from the RAID5 with the same error as most times)

Rebooting into the "safe" rc7-sglist-kernel I got the above error that
killed the RAID5.
I rebooted into a system (kernel 2.6.21-rc5-mm2, please not the
2.6.*21*, that is only a rescue system and not updated often) on a
separate partition to rebuild it. When I tried to readd the failed sdb
the system locked up with this error:

Sep 29 22:25:37 treogen [ 205.407893] ata2.00: exception Emask 0x0
SAct 0x0 SErr 0x0 action 0x2 frozen
Sep 29 22:25:37 treogen [ 205.407900] ata2.00: cmd
ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0
Sep 29 22:25:37 treogen [ 205.407901] res
40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (time out)
Sep 29 22:25:38 treogen [ 205.720531] ata2: soft resetting port
Sep 29 22:25:38 treogen [ 205.720554] ata2: SATA link up 3.0 Gbps
(SStatus 123 SControl 300)
Sep 29 22:25:38 treogen [ 205.730560] ATA: abnormal status 0xEA on
port 0xffffc20001432007
Sep 29 22:25:38 treogen [ 205.740560] ATA: abnormal status 0xEA on
port 0xffffc20001432007
Sep 29 22:26:07 treogen [ 235.343001] ata1.00: exception Emask 0x0
SAct 0x1 SErr 0x0 action 0x2 frozen
Sep 29 22:26:07 treogen [ 235.343007] ata1.00: cmd
61/08:00:bf:52:2a/00:00:01:00:00/40 tag 0 cdb 0x0 data 4096 out
Sep 29 22:26:07 treogen [ 235.343009] res
40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (time out)
Sep 29 22:26:08 treogen [ 235.655644] ata1: soft resetting port
Sep 29 22:26:08 treogen [ 235.655666] ata1: SATA link up 3.0 Gbps
(SStatus 123 SControl 300)
Sep 29 22:26:08 treogen [ 235.675602] ata2.00: qc timeout (cmd 0xec)
Sep 29 22:26:08 treogen [ 235.675608] ata2.00: failed to IDENTIFY
(I/O error, err_mask=0x104)
Sep 29 22:26:08 treogen [ 235.675610] ata2.00: revalidation failed (errno=-5)
Sep 29 22:26:08 treogen [ 235.675612] ata2: failed to recover some
devices, retrying in 5 secs
Sep 29 22:26:13 treogen [ 240.668113] ata2: hard resetting port
Sep 29 22:26:15 treogen [ 242.833438] ata2: SATA link up 3.0 Gbps
(SStatus 123 SControl 300)
Sep 29 22:26:15 treogen [ 242.843444] ATA: abnormal status 0xEC on
port 0xffffc20001432007
Sep 29 22:26:15 treogen [ 242.853443] ATA: abnormal status 0xEC on
port 0xffffc20001432007
Sep 29 22:26:38 treogen [ 265.590759] ata1.00: qc timeout (cmd 0xec)
Sep 29 22:26:38 treogen [ 265.590765] ata1.00: failed to IDENTIFY
(I/O error, err_mask=0x104)
Sep 29 22:26:38 treogen [ 265.590767] ata1.00: revalidation failed (errno=-5)
Sep 29 22:26:38 treogen [ 265.590769] ata1: failed to recover some
devices, retrying in 5 secs
Sep 29 22:26:43 treogen [ 270.583271] ata1: hard resetting port
Sep 29 22:26:45 treogen [ 272.748596] ata1: SATA link up 3.0 Gbps
(SStatus 123 SControl 300)
Sep 29 22:26:45 treogen [ 272.758601] ATA: abnormal status 0xEC on
port 0xffffc20001430007
Sep 29 22:26:45 treogen [ 272.768602] ATA: abnormal status 0xEC on
port 0xffffc20001430007
Sep 29 22:26:45 treogen [ 272.788490] ata2.00: qc timeout (cmd 0xec)
Sep 29 22:26:45 treogen [ 272.788494] ata2.00: failed to IDENTIFY
(I/O error, err_mask=0x104)
Sep 29 22:26:45 treogen [ 272.788496] ata2.00: revalidation failed (errno=-5)
Sep 29 22:26:45 treogen [ 272.788501] ata2.00: limiting speed to UDMA/100:PIO3
Sep 29 22:26:45 treogen [ 272.788503] ata2: failed to recover some
devices, retrying in 5 secs
Sep 29 22:26:50 treogen [ 277.781001] ata2: hard resetting port
Sep 29 22:26:52 treogen [ 279.946326] ata2: SATA link up 3.0 Gbps
(SStatus 123 SControl 300)
Sep 29 22:26:52 treogen [ 279.956331] ATA: abnormal status 0xEC on
port 0xffffc20001432007
Sep 29 22:26:52 treogen [ 279.966330] ATA: abnormal status 0xEC on
port 0xffffc20001432007
Sep 29 22:27:15 treogen [ 302.703652] ata1.00: qc timeout (cmd 0xec)
Sep 29 22:27:15 treogen [ 302.703661] ata1.00: failed to IDENTIFY
(I/O error, err_mask=0x104)
Sep 29 22:27:15 treogen [ 302.703663] ata1.00: revalidation failed (errno=-5)
Sep 29 22:27:15 treogen [ 302.703669] ata1.00: limiting speed to UDMA/100:PIO3
Sep 29 22:27:15 treogen [ 302.703671] ata1: failed to recover some
devices, retrying in 5 secs
Sep 29 22:27:20 treogen [ 307.696161] ata1: hard resetting port
Sep 29 22:27:22 treogen [ 309.884773] ata1: SATA link up 3.0 Gbps
(SStatus 123 SControl 300)
Sep 29 22:27:22 treogen [ 309.894784] ATA: abnormal status 0xEC on
port 0xffffc20001430007
Sep 29 22:27:22 treogen [ 309.904786] ATA: abnormal status 0xEC on
port 0xffffc20001430007
Sep 29 22:27:22 treogen [ 309.904940] ata2.00: qc timeout (cmd 0xec)
Sep 29 22:27:22 treogen [ 309.904946] ata2.00: failed to IDENTIFY
(I/O error, err_mask=0x104)
Sep 29 22:27:22 treogen [ 309.904948] ata2.00: revalidation failed (errno=-5)
Sep 29 22:27:22 treogen [ 309.904951] ata2.00: disabled
Sep 29 22:27:23 treogen [ 310.406968] ata2: EH complete
Sep 29 22:27:23 treogen [ 310.407013] sd 1:0:0:0: [sdb] Result:
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
Sep 29 22:27:23 treogen [ 310.407018] end_request: I/O error, dev
sdb, sector 19550911

Is it possible that the bug is much older, and only some change in
rc3-mm->rc4-mm makes it visible that the initialization of the SiI3132
is incomplete?

Currently I try a 2.6.23-rc4-mm1 with only 629 of 1628 mm-patches applied...

Torsten

2007-09-30 14:36:43

by Tejun Heo

[permalink] [raw]
Subject: Re: sata_sil24 broken since 2.6.23-rc4-mm1

Hello, Torsten.

Torsten Kaiser wrote:
> On 9/28/07, Torsten Kaiser <[email protected]> wrote:
>> So in case of -rc3-mm1 I'm pretty sure that it works.
>
> That's still the case.

Ah... that's weird. It would be much better if -rc3-mm1 is broken too. :-P

>> Not completely sure is if 2.6.23-rc7-sglist kernel works. I booted
>> that 9 times, but from a quick look in /var/log/messages, I might not
>> have hit the "correct" situation to trigger the error.
>> That kernel is vanilla 2.6.23-rc7 plus the patch from
>> http://www.kernel.org/pub/linux/kernel/people/tomo/misc/v2.6.23-rc7-sglist-arch.diff.bz2
>> ( http://marc.info/?l=linux-ide&m=119055574826083&w=2 )
>
> That is no longer the case. Yesterday this kernel did also show the failure.
> The error looked a little bit different, but happend at the same
> location during the bootup.
> Sadly dmesg overflowed and I was not able to capture the first error.
>
> [ 53.462632] Freeing unused kernel memory: 332k freed
> ite cache: enabled, read cache: enabled, doesn't support DPO or FUA
> [ 77.170903] ata1.00: exception Emask 0x40 SAct 0x1 SErr 0x0 action 0x2
> [ 77.170905] ata1.00: irq_stat 0x00020002, SGT no on qword boundary
> [ 77.170908] ata1.00: cmd 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0
> cdb 0x0 data 4096 out
> [ 77.170909] res 50/00:00:af:ea:42/00:00:25:00:00/e0 Emask
> 0x40 (internal error)

Hmm... SGT not on qword boundary? Please add the following to the end
of sil24_port_start() and report successful and failed kernel boot log.

printk("XXX sil24 cb=%p cb_dma=%llx\n",
cb, (unsigned long long)cb_dma);

Also, does 'dmesg -s 10000000' give full dmesg?

> I rebooted into a system (kernel 2.6.21-rc5-mm2, please not the
> 2.6.*21*, that is only a rescue system and not updated often) on a
> separate partition to rebuild it. When I tried to readd the failed sdb
> the system locked up with this error:
>
> Sep 29 22:25:37 treogen [ 205.407893] ata2.00: exception Emask 0x0
> SAct 0x0 SErr 0x0 action 0x2 frozen
> Sep 29 22:25:37 treogen [ 205.407900] ata2.00: cmd
> ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0
> Sep 29 22:25:37 treogen [ 205.407901] res
> 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (time out)
>
> Is it possible that the bug is much older, and only some change in
> rc3-mm->rc4-mm makes it visible that the initialization of the SiI3132
> is incomplete?

I can't tell but there is a pretty large userbase of sil24/32 and you
seem to be the only one to report this problem yet. I think it might be
coming somewhere else than libata or sata_sil24 itself. Hmmm... It
would be really great if you can break -rc3-mm1 too. Correct behavior
for something like that for just one -mm version sounds very weird.

Thanks.

--
tejun

2007-09-30 16:19:30

by Torsten Kaiser

[permalink] [raw]
Subject: Re: sata_sil24 broken since 2.6.23-rc4-mm1

On 9/30/07, Tejun Heo <[email protected]> wrote:
> Hello, Torsten.
>
> Torsten Kaiser wrote:
> > On 9/28/07, Torsten Kaiser <[email protected]> wrote:
> >> So in case of -rc3-mm1 I'm pretty sure that it works.
> >
> > That's still the case.
>
> Ah... that's weird. It would be much better if -rc3-mm1 is broken too. :-P

It would be much better if it would break all the time. Then bisecting
would not be guesswork. :-P

> >> Not completely sure is if 2.6.23-rc7-sglist kernel works. I booted
> >> that 9 times, but from a quick look in /var/log/messages, I might not
> >> have hit the "correct" situation to trigger the error.
> >> That kernel is vanilla 2.6.23-rc7 plus the patch from
> >> http://www.kernel.org/pub/linux/kernel/people/tomo/misc/v2.6.23-rc7-sglist-arch.diff.bz2
> >> ( http://marc.info/?l=linux-ide&m=119055574826083&w=2 )
> >
> > That is no longer the case. Yesterday this kernel did also show the failure.
> > The error looked a little bit different, but happend at the same
> > location during the bootup.
> > Sadly dmesg overflowed and I was not able to capture the first error.
> >
> > [ 53.462632] Freeing unused kernel memory: 332k freed
> > ite cache: enabled, read cache: enabled, doesn't support DPO or FUA
> > [ 77.170903] ata1.00: exception Emask 0x40 SAct 0x1 SErr 0x0 action 0x2
> > [ 77.170905] ata1.00: irq_stat 0x00020002, SGT no on qword boundary
> > [ 77.170908] ata1.00: cmd 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0
> > cdb 0x0 data 4096 out
> > [ 77.170909] res 50/00:00:af:ea:42/00:00:25:00:00/e0 Emask
> > 0x40 (internal error)
>
> Hmm... SGT not on qword boundary? Please add the following to the end
> of sil24_port_start() and report successful and failed kernel boot log.
>
> printk("XXX sil24 cb=%p cb_dma=%llx\n",
> cb, (unsigned long long)cb_dma);

Added to my current "2.6.23-rc4-mm0.5"
I will report it's output.

> Also, does 'dmesg -s 10000000' give full dmesg?

That boot ended in a minimal initrd environment that normally only
starts the RAID5 and then opens contained encrypted real root.
I was just able to push the output from dmesg through the serial link,
but had no man pages to tell me about -s ...
And that kind of error was until now a one-of-a-kind one. All other
errors where not "internal error" but "timeout".
But one time I had another SGT related error:
Sep 11 19:19:24 treogen [ 33.340000] ata1.00: exception Emask 0x20
SAct 0x1 SErr 0x0 action 0x2
Sep 11 19:19:24 treogen [ 33.340000] ata1.00: irq_stat 0x00020002,
PCI master abort while fetching SGT
Sep 11 19:19:24 treogen [ 33.340000] ata1.00: cmd
61/08:00:09:d6:42/00:00:25:00:00/40 tag 0 cdb 0x0 data 4096 out
Sep 11 19:19:24 treogen [ 33.340000] res
50/00:00:af:ea:42/00:00:25:00:00/e0 Emask 0x20 (host bus error)

What I find kind of interessing is, that while I got three different
error codes the cmd part of the output was always the same.

> > Is it possible that the bug is much older, and only some change in
> > rc3-mm->rc4-mm makes it visible that the initialization of the SiI3132
> > is incomplete?
>
> I can't tell but there is a pretty large userbase of sil24/32 and you
> seem to be the only one to report this problem yet. I think it might be
> coming somewhere else than libata or sata_sil24 itself. Hmmm... It
> would be really great if you can break -rc3-mm1 too. Correct behavior
> for something like that for just one -mm version sounds very weird.

It's not just 2.6.23-rc4-mm1. All -mm's after rc4 are broken for me.
Confirmed breakage on -rc4-mm1, -rc6-mm1 and -rc8-mm1. I'm just
narrowing on rc4-mm1 because that was the first version to break.

I'm currently trying to bisect 2.6.23-rc4-mm1. Here is the current status:

[the 2.6.23-rc4-mm1 series-file has 2013 lines]
Up to (incl.) x86_64-convert-to-clockevents.patch (line 747): 2 good boots
Up to (incl.) x86_64-cleanup-struct-irqaction-initializers-patch
(line779): 2 good boots
Up to (incl.) slub-optimize-cacheline-use-for-zeroing.patch (line
1045): 1 failed
Up to (incl.) fix-discrepancy-between-vdso-based... (line1461): 1 good, 1 failed

Next try: up to patch fs-remove-some-aop_truncated_page.patch

That means from the patches added to the rc4 variant of the mm-kernel
the following are remaining:
git-xfs-build-fix.patch
enforce-noreplace-smp-in-alternative_instructions.patch
paravirt-fix-preemptible-lazy-mode-bug.patch
i386-apic-fix-4-bit-apicid-assumption-of-mach-default.patch
fix-the-max-path-calculation-in-radix-treec-update.patch
mm-no-need-to-cast-vmalloc-return-value-in-zone_wait_table_init.patch
introduce-write_begin-write_end-aops-fix2.patch
implement-simple-fs-aops-fix.patch
ext2-convert-to-new-aops-fix2.patch
ext3-convert-to-new-aops-fix-fix.patch
ext4-convert-to-new-aops-fix-fix.patch
gfs2-convert-to-new-aops-fix.patch
reiserfs-convert-to-new-aops-fix2.patch
hostfs-convert-to-new-aops-fix-fix.patch
ufs-convert-to-new-aops-fix2.patch
sysv-convert-to-new-aops-fix2.patch
minix-convert-to-new-aops-fix2.patch
affs-convert-to-new-aops-fix-fix.patch
memoryless-nodes-add-n_cpu-node-state-move-setup-of-n_cpu-node-state-mask.patch
memoryless-nodes-fixup-uses-of-node_online_map-in-generic-code-fix.patch
memoryless-nodes-fixup-uses-of-node_online_map-in-generic-code-fix-2.patch
update-n_high_memory-node-state-for-memory-hotadd.patch
slub-avoid-page-struct-cacheline-bouncing-due-to-remote-frees-to-cpu-slab.patch
slub-do-not-use-page-mapping.patch
slub-move-page-offset-to-kmem_cache_cpu-offset.patch
slub-avoid-touching-page-struct-when-freeing-to-per-cpu-slab.patch
slub-place-kmem_cache_cpu-structures-in-a-numa-aware-way.patch
slub-optimize-cacheline-use-for-zeroing.patch

But due to the unreliable nature of the bug, I can't be to sure about that.

Next version is compiled, now again switching the PC off for an hour...

Torsten

2007-09-30 17:41:38

by Tejun Heo

[permalink] [raw]
Subject: Re: sata_sil24 broken since 2.6.23-rc4-mm1

Torsten Kaiser wrote:
> That boot ended in a minimal initrd environment that normally only
> starts the RAID5 and then opens contained encrypted real root.
> I was just able to push the output from dmesg through the serial link,
> but had no man pages to tell me about -s ...
> And that kind of error was until now a one-of-a-kind one. All other
> errors where not "internal error" but "timeout".
> But one time I had another SGT related error:
> Sep 11 19:19:24 treogen [ 33.340000] ata1.00: exception Emask 0x20
> SAct 0x1 SErr 0x0 action 0x2
> Sep 11 19:19:24 treogen [ 33.340000] ata1.00: irq_stat 0x00020002,
> PCI master abort while fetching SGT
> Sep 11 19:19:24 treogen [ 33.340000] ata1.00: cmd
> 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0 cdb 0x0 data 4096 out
> Sep 11 19:19:24 treogen [ 33.340000] res
> 50/00:00:af:ea:42/00:00:25:00:00/e0 Emask 0x20 (host bus error)
>
> What I find kind of interessing is, that while I got three different
> error codes the cmd part of the output was always the same.

That's NCQ write command. You'll be using it a lot if you're rebuilding
md5. It seems like something is going wrong with request DMA or sg
mapping. Maybe some change in block/*.[hc]?

> It's not just 2.6.23-rc4-mm1. All -mm's after rc4 are broken for me.
> Confirmed breakage on -rc4-mm1, -rc6-mm1 and -rc8-mm1. I'm just
> narrowing on rc4-mm1 because that was the first version to break.
>
> I'm currently trying to bisect 2.6.23-rc4-mm1. Here is the current status:

Have you tested 2.6.23-rc4 without mm patches? It could be something
introduced between -rc3 and 4.

> [the 2.6.23-rc4-mm1 series-file has 2013 lines]
> Up to (incl.) x86_64-convert-to-clockevents.patch (line 747): 2 good boots
> Up to (incl.) x86_64-cleanup-struct-irqaction-initializers-patch
> (line779): 2 good boots
> Up to (incl.) slub-optimize-cacheline-use-for-zeroing.patch (line
> 1045): 1 failed
> Up to (incl.) fix-discrepancy-between-vdso-based... (line1461): 1 good, 1 failed
>
> Next try: up to patch fs-remove-some-aop_truncated_page.patch
>
> That means from the patches added to the rc4 variant of the mm-kernel
> the following are remaining:
> git-xfs-build-fix.patch
> enforce-noreplace-smp-in-alternative_instructions.patch
> paravirt-fix-preemptible-lazy-mode-bug.patch
> i386-apic-fix-4-bit-apicid-assumption-of-mach-default.patch
> fix-the-max-path-calculation-in-radix-treec-update.patch
> mm-no-need-to-cast-vmalloc-return-value-in-zone_wait_table_init.patch
> introduce-write_begin-write_end-aops-fix2.patch
> implement-simple-fs-aops-fix.patch
> ext2-convert-to-new-aops-fix2.patch
> ext3-convert-to-new-aops-fix-fix.patch
> ext4-convert-to-new-aops-fix-fix.patch
> gfs2-convert-to-new-aops-fix.patch
> reiserfs-convert-to-new-aops-fix2.patch
> hostfs-convert-to-new-aops-fix-fix.patch
> ufs-convert-to-new-aops-fix2.patch
> sysv-convert-to-new-aops-fix2.patch
> minix-convert-to-new-aops-fix2.patch
> affs-convert-to-new-aops-fix-fix.patch
> memoryless-nodes-add-n_cpu-node-state-move-setup-of-n_cpu-node-state-mask.patch
> memoryless-nodes-fixup-uses-of-node_online_map-in-generic-code-fix.patch
> memoryless-nodes-fixup-uses-of-node_online_map-in-generic-code-fix-2.patch
> update-n_high_memory-node-state-for-memory-hotadd.patch
> slub-avoid-page-struct-cacheline-bouncing-due-to-remote-frees-to-cpu-slab.patch
> slub-do-not-use-page-mapping.patch
> slub-move-page-offset-to-kmem_cache_cpu-offset.patch
> slub-avoid-touching-page-struct-when-freeing-to-per-cpu-slab.patch
> slub-place-kmem_cache_cpu-structures-in-a-numa-aware-way.patch
> slub-optimize-cacheline-use-for-zeroing.patch
>
> But due to the unreliable nature of the bug, I can't be to sure about that.

Yeah, that's what I'm worried about. Bisection is extremely difficult
if errors are intermittent and takes long time to reproduce.

> Next version is compiled, now again switching the PC off for an hour...

Thanks a lot. Much appreciated.

--
tejun

2007-09-30 18:39:54

by Torsten Kaiser

[permalink] [raw]
Subject: Re: sata_sil24 broken since 2.6.23-rc4-mm1

On 9/30/07, Tejun Heo <[email protected]> wrote:
> Torsten Kaiser wrote:
> > What I find kind of interessing is, that while I got three different
> > error codes the cmd part of the output was always the same.
>
> That's NCQ write command. You'll be using it a lot if you're rebuilding
> md5.

It's not rebuilding the RAID at that point.
If one drive fails, I reboot into a "safe" kernel, fix the RAID and
that way try the next boot with a clean RAID again.

The error happens when the RAID is initialized, might be the first
write into the superblock to mark it dirty/inuse that triggers the
error.

> It seems like something is going wrong with request DMA or sg
> mapping. Maybe some change in block/*.[hc]?

The sg-chaining-patch stands out, but I have no conclusive proof that
it really is the cause.
As noted in this thread, a long time I thought that rc7 with the
sg-chaining-patch was safe, but one time it also showed the error.

> > It's not just 2.6.23-rc4-mm1. All -mm's after rc4 are broken for me.
> > Confirmed breakage on -rc4-mm1, -rc6-mm1 and -rc8-mm1. I'm just
> > narrowing on rc4-mm1 because that was the first version to break.
> >
> > I'm currently trying to bisect 2.6.23-rc4-mm1. Here is the current status:
>
> Have you tested 2.6.23-rc4 without mm patches? It could be something
> introduced between -rc3 and 4.

Not directly, but I have 4 good boots with one part of the mm-patches.
So I would tend to say that mainline 2.6.23-rc4 does not have this
bug.

> > [the 2.6.23-rc4-mm1 series-file has 2013 lines]
> > Up to (incl.) x86_64-convert-to-clockevents.patch (line 747): 2 good boots
> > Up to (incl.) x86_64-cleanup-struct-irqaction-initializers-patch
> > (line779): 2 good boots
> > Up to (incl.) slub-optimize-cacheline-use-for-zeroing.patch (line
> > 1045): 1 failed
> > Up to (incl.) fix-discrepancy-between-vdso-based... (line1461): 1 good, 1 failed
> >
> > Next try: up to patch fs-remove-some-aop_truncated_page.patch

Looks more like this is OK too.

> > That means from the patches added to the rc4 variant of the mm-kernel
> > the following are remaining:
[snip]
> > memoryless-nodes-add-n_cpu-node-state-move-setup-of-n_cpu-node-state-mask.patch
> > memoryless-nodes-fixup-uses-of-node_online_map-in-generic-code-fix.patch
> > memoryless-nodes-fixup-uses-of-node_online_map-in-generic-code-fix-2.patch
> > update-n_high_memory-node-state-for-memory-hotadd.patch
> > slub-avoid-page-struct-cacheline-bouncing-due-to-remote-frees-to-cpu-slab.patch
> > slub-do-not-use-page-mapping.patch
> > slub-move-page-offset-to-kmem_cache_cpu-offset.patch
> > slub-avoid-touching-page-struct-when-freeing-to-per-cpu-slab.patch
> > slub-place-kmem_cache_cpu-structures-in-a-numa-aware-way.patch
> > slub-optimize-cacheline-use-for-zeroing.patch
> >
> > But due to the unreliable nature of the bug, I can't be to sure about that.
>
> Yeah, that's what I'm worried about. Bisection is extremely difficult
> if errors are intermittent and takes long time to reproduce.

Yes...

As for the remaining patches:
memoryless-nodes-*
Don't think so: I do have a NUMA system, but both nodes have memory.
flush-cache-before-*
Don't think so: No ia64 system, unchanged from rc3
# grouping pages by mobility patches
... no idee, but seem unchanged
maps2.*
Don't think that related...
remaining slub-* patches
Might be...

As for you printk:
>From two goot boots, I had not had any failures with it:
First one:
Sep 30 19:24:53 treogen [ 3.810000] XXX sil24 cb=ffff810037ef0000
cb_dma=37ef0000
Sep 30 19:24:53 treogen [ 3.820000] XXX sil24 cb=ffff810037f00000
cb_dma=37f00000
Second:
Sep 30 20:06:22 treogen [ 3.820000] XXX sil24 cb=ffff810037f00000
cb_dma=37f00000
Sep 30 20:06:22 treogen [ 3.830000] XXX sil24 cb=ffff810037f10000
cb_dma=37f10000

Torsten

2007-10-01 18:00:26

by Torsten Kaiser

[permalink] [raw]
Subject: Re: sata_sil24 broken since 2.6.23-rc4-mm1

On 9/30/07, Torsten Kaiser <[email protected]> wrote:
> On 9/30/07, Tejun Heo <[email protected]> wrote:
> > Torsten Kaiser wrote:
> > > What I find kind of interessing is, that while I got three different
> > > error codes the cmd part of the output was always the same.
> >
> > That's NCQ write command. You'll be using it a lot if you're rebuilding
> > md5.
>
> It's not rebuilding the RAID at that point.
> If one drive fails, I reboot into a "safe" kernel, fix the RAID and
> that way try the next boot with a clean RAID again.
>
> The error happens when the RAID is initialized, might be the first
> write into the superblock to mark it dirty/inuse that triggers the
> error.

To make that comment "cmd part of the output was always the same" more
clear: I did not only mean that the first (cmd-)byte was the same, but
the whole line:
cmd 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0 cdb 0x0 data 4096 out

I did find ATA_CMD_FPDMA_WRITE = 0x61in ata.h, but I don't find a good
hint for the 08 that follows. The interpretation XFER_PIO_0 = 0x08
seems wrong...

> > It seems like something is going wrong with request DMA or sg
> > mapping. Maybe some change in block/*.[hc]?
>
> The sg-chaining-patch stands out, but I have no conclusive proof that
> it really is the cause.
> As noted in this thread, a long time I thought that rc7 with the
> sg-chaining-patch was safe, but one time it also showed the error.

If I look at what patches remain, it seems that some other (earlier)
patch that is new in 2.6.23-rc4-mm1 is the trigger, but it will only
fail together with a second patch.

> > > [the 2.6.23-rc4-mm1 series-file has 2013 lines]
> > > Up to (incl.) x86_64-convert-to-clockevents.patch (line 747): 2 good boots
> > > Up to (incl.) x86_64-cleanup-struct-irqaction-initializers-patch
> > > (line779): 2 good boots

Up to (incl.) fs-remove-some-aop_truncated_page.patch (line 956): 2 good boots
Up to (incl.) update-n_high_memory-node-state-for-memory-hotadd.patch (line978)
1 good boot, so I would tend to say that this patches are also OK, but
I will recheck this version.

Up to (incl.) maps2-make-proc-pid-smaps-optional-under-config_embeddedpatch-fix.patch
(line 1038): 1 failed

> > > Up to (incl.) slub-optimize-cacheline-use-for-zeroing.patch (line
> > > 1045): 1 failed
> > > Up to (incl.) fix-discrepancy-between-vdso-based... (line1461): 1 good, 1 failed

Updated list of remaining, new patches in rc4:

> > > That means from the patches added to the rc4 variant of the mm-kernel
> > > the following are remaining:
> [snip]
> > > memoryless-nodes-add-n_cpu-node-state-move-setup-of-n_cpu-node-state-mask.patch
> > > memoryless-nodes-fixup-uses-of-node_online_map-in-generic-code-fix.patch
> > > memoryless-nodes-fixup-uses-of-node_online_map-in-generic-code-fix-2.patch
> > > update-n_high_memory-node-state-for-memory-hotadd.patch

And I think these are OK too.
So I think some earlier patch triggers bug together with one of these patches:

categorize-gfp-flags.patch
categorize-gfp-flags-fix.patch
make-swappiness-safer-to-use.patch
... these should not change any behaivor from there description

> flush-cache-before-*
> Don't think so: No ia64 system, unchanged from rc3

> # grouping pages by mobility patches
> ... no idee, but seem unchanged

> maps2.*
> Don't think that related...

Will continue to break this down...

> As for you printk:
> From two goot boots, I had not had any failures with it:
> First one:
> Sep 30 19:24:53 treogen [ 3.810000] XXX sil24 cb=ffff810037ef0000
> cb_dma=37ef0000
> Sep 30 19:24:53 treogen [ 3.820000] XXX sil24 cb=ffff810037f00000
> cb_dma=37f00000
> Second:
> Sep 30 20:06:22 treogen [ 3.820000] XXX sil24 cb=ffff810037f00000
> cb_dma=37f00000
> Sep 30 20:06:22 treogen [ 3.830000] XXX sil24 cb=ffff810037f10000
> cb_dma=37f10000

Next boot with this kernel was also good:
Oct 1 06:52:32 treogen [ 3.740000] XXX sil24 cb=ffff810037f00000
cb_dma=37f00000
Oct 1 06:52:32 treogen [ 3.750000] XXX sil24 cb=ffff810037f10000
cb_dma=37f10000

Next kernel (maps2-make-proc-pid-smaps-optional-under-config_embeddedpatch-fix.patch)
Oct 1 18:07:09 treogen [ 3.760000] XXX sil24 cb=ffff810005df0000
cb_dma=5df0000
Oct 1 18:07:09 treogen [ 3.770000] XXX sil24 cb=ffff810005e00000
cb_dma=5e00000
-> failed, but when rebooting instead of a cold boot that kernel worked:
Oct 1 18:14:42 treogen [ 3.800000] XXX sil24 cb=ffff810005df0000
cb_dma=5df0000
Oct 1 18:14:42 treogen [ 3.810000] XXX sil24 cb=ffff810005e00000
cb_dma=5e00000
... same values, so thats not an indicator that trouble will follow.

Next kernel (update-n_high_memory-node-state-for-memory-hotadd.patch)
Oct 1 19:31:38 treogen [ 3.840000] XXX sil24 cb=ffff81007fa60000
cb_dma=7fa60000
Oct 1 19:31:38 treogen [ 3.850000] XXX sil24 cb=ffff810037f00000
cb_dma=37f00000
that was also a good boot.

Torsten

2007-10-03 15:21:25

by Torsten Kaiser

[permalink] [raw]
Subject: Re: sata_sil24 broken since 2.6.23-rc4-mm1

On 10/1/07, Torsten Kaiser <[email protected]> wrote:
> On 9/30/07, Torsten Kaiser <[email protected]> wrote:
> > On 9/30/07, Tejun Heo <[email protected]> wrote:
> > > Torsten Kaiser wrote:
> To make that comment "cmd part of the output was always the same" more
> clear: I did not only mean that the first (cmd-)byte was the same, but
> the whole line:
> cmd 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0 cdb 0x0 data 4096 out
>
> I did find ATA_CMD_FPDMA_WRITE = 0x61in ata.h, but I don't find a good
> hint for the 08 that follows. The interpretation XFER_PIO_0 = 0x08
> seems wrong...

Even all of the last bisect steps still showed the same line...

> If I look at what patches remain, it seems that some other (earlier)
> patch that is new in 2.6.23-rc4-mm1 is the trigger, but it will only
> fail together with a second patch.

I'm now finished with bisecting, still 2 patches, but I don't want to
spend another two hours waiting...

And the winners are: (from broken-out patchset of 2.6.23-rc4-mm1)
maps2-simplify-interdependence-of-proc-pid-maps-and-smaps.patch
maps2-move-clear_refs-code-to-task_mmuc.patch

Before these patches I have never seen the bug, with these I only got
two good boots when trying to recreate the problem. But even the
kernels that did one good boot failed on the second try.

> > As for you printk:
>From the last testkernel (up to incl.
maps2-move-clear_refs-code-to-task_mmuc.patch)

Good:
Oct 3 15:50:39 treogen [ 3.780000] XXX sil24 cb=ffff810005dd0000
cb_dma=5dd0000
Oct 3 15:50:39 treogen [ 3.790000] XXX sil24 cb=ffff810005de0000
cb_dma=5de0000

Bad:
Oct 3 16:56:46 treogen [ 3.750000] XXX sil24 cb=ffff810005de0000
cb_dma=5de0000
Oct 3 16:56:46 treogen [ 3.760000] XXX sil24 cb=ffff810005df0000
cb_dma=5df0000

I have *NO* clue how these procfs changes can kill the sata_sil24
driver, but all bisect steps point to these two patches as cause or
trigger of the bug.

Torsten

2007-10-03 15:55:25

by Torsten Kaiser

[permalink] [raw]
Subject: Re: sata_sil24 broken since 2.6.23-rc4-mm1

[CC added to author of the bad patch]

Short recap:
Since 2.6.23-rc4-mm1 all mm-kernel randomly fail one of two drives on
my Silicon Image 3132. This failure happens when my initramfs wants to
start the RAID that is on these drives.

The first error libata throws is:
Oct 3 16:56:46 treogen [ 63.320000] ata2.00: exception Emask 0x0
SAct 0x1 SErr 0x0 action 0x6 frozen
Oct 3 16:56:46 treogen [ 63.320000] ata2.00: cmd
61/08:00:09:d6:42/00:00:25:00:00/40 tag 0 cdb 0x0 data 4096 out
Oct 3 16:56:46 treogen [ 63.320000] res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 3 16:56:46 treogen [ 63.320000] ata2.00: status: {DRDY }

Resetting the sata link fails, the drive is no longer reachable until a reboot.

I then bisected the mm-patches from 2.6.23-rc4-mm1 with the following result:

On 10/3/07, Torsten Kaiser <[email protected]> wrote:
> I'm now finished with bisecting, still 2 patches, but I don't want to
> spend another two hours waiting...
>
> And the winners are: (from broken-out patchset of 2.6.23-rc4-mm1)
> maps2-simplify-interdependence-of-proc-pid-maps-and-smaps.patch
> maps2-move-clear_refs-code-to-task_mmuc.patch
>
> Before these patches I have never seen the bug, with these I only got
> two good boots when trying to recreate the problem. But even the
> kernels that did one good boot failed on the second try.

The simplify-patch just seems to move some code around, but I see a
real change in the other one:
This patch removes clear_refs_smap() from fs/proc/task_mmu.c by moving
its code to a new function. But during the move the main for-loop from
clear_refs_smap was changed:

old:
for (vma = mm->mmap; vma; vma = vma->vm_next)
if (vma->vm_mm && !is_vm_hugetlb_page(vma))
walk_page_range(vma->vm_mm, vma->vm_start, vma->vm_end,
&clear_refs_walk, vma);

new:
for (vma = mm->mmap; vma; vma = vma->vm_next)
if (!is_vm_hugetlb_page(vma))
walk_page_range(mm, vma->vm_start, vma->vm_end,
&clear_refs_walk, vma);

The walk_page_range() is no longer called on vma->vm_mm, but on mm directly.
I don't know how this can kill the sata_sil24-driver, but at least it
looks suspicious.
As I'm not really a kernel hacker, I defer this question to the ones that are.

Torsten

2007-10-03 16:38:40

by Matt Mackall

[permalink] [raw]
Subject: Re: sata_sil24 broken since 2.6.23-rc4-mm1

On Wed, Oct 03, 2007 at 05:55:10PM +0200, Torsten Kaiser wrote:
> [CC added to author of the bad patch]
>
> Short recap:
> Since 2.6.23-rc4-mm1 all mm-kernel randomly fail one of two drives on
> my Silicon Image 3132. This failure happens when my initramfs wants to
> start the RAID that is on these drives.
>
> The first error libata throws is:
> Oct 3 16:56:46 treogen [ 63.320000] ata2.00: exception Emask 0x0
> SAct 0x1 SErr 0x0 action 0x6 frozen
> Oct 3 16:56:46 treogen [ 63.320000] ata2.00: cmd
> 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0 cdb 0x0 data 4096 out
> Oct 3 16:56:46 treogen [ 63.320000] res
> 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> Oct 3 16:56:46 treogen [ 63.320000] ata2.00: status: {DRDY }
>
> Resetting the sata link fails, the drive is no longer reachable until a reboot.
>
> I then bisected the mm-patches from 2.6.23-rc4-mm1 with the following result:
>
> On 10/3/07, Torsten Kaiser <[email protected]> wrote:
> > I'm now finished with bisecting, still 2 patches, but I don't want to
> > spend another two hours waiting...
> >
> > And the winners are: (from broken-out patchset of 2.6.23-rc4-mm1)
> > maps2-simplify-interdependence-of-proc-pid-maps-and-smaps.patch
> > maps2-move-clear_refs-code-to-task_mmuc.patch
> >
> > Before these patches I have never seen the bug, with these I only got
> > two good boots when trying to recreate the problem. But even the
> > kernels that did one good boot failed on the second try.
>
> The simplify-patch just seems to move some code around, but I see a
> real change in the other one:
> This patch removes clear_refs_smap() from fs/proc/task_mmu.c by moving
> its code to a new function. But during the move the main for-loop from
> clear_refs_smap was changed:
>
> old:
> for (vma = mm->mmap; vma; vma = vma->vm_next)
> if (vma->vm_mm && !is_vm_hugetlb_page(vma))
> walk_page_range(vma->vm_mm, vma->vm_start, vma->vm_end,
> &clear_refs_walk, vma);
>
> new:
> for (vma = mm->mmap; vma; vma = vma->vm_next)
> if (!is_vm_hugetlb_page(vma))
> walk_page_range(mm, vma->vm_start, vma->vm_end,
> &clear_refs_walk, vma);
>
> The walk_page_range() is no longer called on vma->vm_mm, but on mm directly.
> I don't know how this can kill the sata_sil24-driver, but at least it
> looks suspicious.

That code should be fine. Further, it's pretty unlikely that this code
ever gets invoked. This whole interface was only recently added by
Google folks and its usage is pretty obscure.

Oh wait - you're _at_ Google, aren't you? Perhaps you're actually
using clear_refs.

Well I can see no reason why the vma we just got to by the mm->mmap
would have a vm_mm != mm, but I've certainly been wrong before.

Try changing it to:

for (vma = mm->mmap; vma; vma = vma->vm_next)
if (!is_vm_hugetlb_page(vma)) {
if (vma->vm_mm != mm)
printk("WTF: vma->vm_mm %p mm %p\n",
vma->vm_mm, mm);
walk_page_range(vma->vm_mm, vma->vm_start, vma->vm_end,
&clear_refs_walk, vma);
}

--
Mathematics is the supreme nostalgia of our time.

2007-10-03 17:37:09

by Torsten Kaiser

[permalink] [raw]
Subject: Re: sata_sil24 broken since 2.6.23-rc4-mm1

On 10/3/07, Matt Mackall <[email protected]> wrote:
> On Wed, Oct 03, 2007 at 05:55:10PM +0200, Torsten Kaiser wrote:
> > This patch removes clear_refs_smap() from fs/proc/task_mmu.c by moving
> > its code to a new function. But during the move the main for-loop from
> > clear_refs_smap was changed:
> >
> > old:
> > for (vma = mm->mmap; vma; vma = vma->vm_next)
> > if (vma->vm_mm && !is_vm_hugetlb_page(vma))
> > walk_page_range(vma->vm_mm, vma->vm_start, vma->vm_end,
> > &clear_refs_walk, vma);
> >
> > new:
> > for (vma = mm->mmap; vma; vma = vma->vm_next)
> > if (!is_vm_hugetlb_page(vma))
> > walk_page_range(mm, vma->vm_start, vma->vm_end,
> > &clear_refs_walk, vma);
> >
> > The walk_page_range() is no longer called on vma->vm_mm, but on mm directly.
> > I don't know how this can kill the sata_sil24-driver, but at least it
> > looks suspicious.
>
> That code should be fine. Further, it's pretty unlikely that this code
> ever gets invoked. This whole interface was only recently added by
> Google folks and its usage is pretty obscure.
>
> Oh wait - you're _at_ Google, aren't you? Perhaps you're actually
> using clear_refs.

No. I'm just using Google Mail.

And I'm not using any additional patches apart from what is in the mm-patchset.

> Well I can see no reason why the vma we just got to by the mm->mmap
> would have a vm_mm != mm, but I've certainly been wrong before.
>
> Try changing it to:
>
> for (vma = mm->mmap; vma; vma = vma->vm_next)
> if (!is_vm_hugetlb_page(vma)) {
> if (vma->vm_mm != mm)
> printk("WTF: vma->vm_mm %p mm %p\n",
> vma->vm_mm, mm);
> walk_page_range(vma->vm_mm, vma->vm_start, vma->vm_end,
> &clear_refs_walk, vma);
> }

Will try this, but as the error only seems to trigger, if I leave the
pc powered off for more then a hour, you will need to wait until
tomorrow.

Of note might be, that at the time of this error init has not been
started. I'm using a program from initramfs to start the RAID.
The initramfs was primarily build using the genkernel package from
gentoo, but I have removed some parts of it (kernel modules and some
other part I no longer remember)

Torsten

2007-10-03 17:51:39

by Matt Mackall

[permalink] [raw]
Subject: Re: sata_sil24 broken since 2.6.23-rc4-mm1

On Wed, Oct 03, 2007 at 07:36:55PM +0200, Torsten Kaiser wrote:
> On 10/3/07, Matt Mackall <[email protected]> wrote:
> > On Wed, Oct 03, 2007 at 05:55:10PM +0200, Torsten Kaiser wrote:
> > > This patch removes clear_refs_smap() from fs/proc/task_mmu.c by moving
> > > its code to a new function. But during the move the main for-loop from
> > > clear_refs_smap was changed:
> > >
> > > old:
> > > for (vma = mm->mmap; vma; vma = vma->vm_next)
> > > if (vma->vm_mm && !is_vm_hugetlb_page(vma))
> > > walk_page_range(vma->vm_mm, vma->vm_start, vma->vm_end,
> > > &clear_refs_walk, vma);
> > >
> > > new:
> > > for (vma = mm->mmap; vma; vma = vma->vm_next)
> > > if (!is_vm_hugetlb_page(vma))
> > > walk_page_range(mm, vma->vm_start, vma->vm_end,
> > > &clear_refs_walk, vma);
> > >
> > > The walk_page_range() is no longer called on vma->vm_mm, but on mm directly.
> > > I don't know how this can kill the sata_sil24-driver, but at least it
> > > looks suspicious.
> >
> > That code should be fine. Further, it's pretty unlikely that this code
> > ever gets invoked. This whole interface was only recently added by
> > Google folks and its usage is pretty obscure.
> >
> > Oh wait - you're _at_ Google, aren't you? Perhaps you're actually
> > using clear_refs.
>
> No. I'm just using Google Mail.

Ahh, I see. Wasn't aware of the trademark dispute..

> Of note might be, that at the time of this error init has not been
> started. I'm using a program from initramfs to start the RAID.
> The initramfs was primarily build using the genkernel package from
> gentoo, but I have removed some parts of it (kernel modules and some
> other part I no longer remember)

Then I really don't see how these patches can have anything to do with
your problem. They only affect some currently obscure files in /proc
and it's unlikely to have any impact at init time. It's more likely
that there's some code alignment Heisenbug or you got a false negative
somewhere in your bisection.

--
Mathematics is the supreme nostalgia of our time.

2007-10-03 18:07:04

by Torsten Kaiser

[permalink] [raw]
Subject: Re: sata_sil24 broken since 2.6.23-rc4-mm1

On 10/3/07, Matt Mackall <[email protected]> wrote:
> On Wed, Oct 03, 2007 at 07:36:55PM +0200, Torsten Kaiser wrote:
> > Of note might be, that at the time of this error init has not been
> > started. I'm using a program from initramfs to start the RAID.
> > The initramfs was primarily build using the genkernel package from
> > gentoo, but I have removed some parts of it (kernel modules and some
> > other part I no longer remember)
>
> Then I really don't see how these patches can have anything to do with
> your problem. They only affect some currently obscure files in /proc
> and it's unlikely to have any impact at init time. It's more likely
> that there's some code alignment Heisenbug or you got a false negative
> somewhere in your bisection.

Heisenbug via code alignment might be, but the bisection result seemed
rather reliable.
I have 4 good boots with all mm-patches before the maps2-* ones and 2
out of 3 boots failed with part or all maps2-* patches applied. (Any
many more failed boots with more mm-patches applied.)

But I do think it might be very possible that some (maybe even
correct) change in the behavior of this code triggers the error. So
I'm very interested in what you additional printk outputs to track
down what changed and why this breaks.

I looked at the scripts from the initramfs, but sadly I don't see
anything that crawls throug /proc and might hit your files. Other than
that there is nearly only busybox in there. cryptsetup should run
later and modprobe does not have any modules it could try to load...

Torsten

2007-10-04 05:33:07

by Torsten Kaiser

[permalink] [raw]
Subject: Re: sata_sil24 broken since 2.6.23-rc4-mm1

On 10/3/07, Matt Mackall <[email protected]> wrote:
> Well I can see no reason why the vma we just got to by the mm->mmap
> would have a vm_mm != mm, but I've certainly been wrong before.
>
> Try changing it to:
>
> for (vma = mm->mmap; vma; vma = vma->vm_next)
> if (!is_vm_hugetlb_page(vma)) {
> if (vma->vm_mm != mm)
> printk("WTF: vma->vm_mm %p mm %p\n",
> vma->vm_mm, mm);
> walk_page_range(vma->vm_mm, vma->vm_start, vma->vm_end,
> &clear_refs_walk, vma);
> }

You were right.
I was able to trigger the error with above printk added, but nothing
was written to the syslog.

So now I'm rather out of ideas what to test... :(

Torsten

2007-10-04 23:33:06

by Matt Mackall

[permalink] [raw]
Subject: Re: sata_sil24 broken since 2.6.23-rc4-mm1

On Thu, Oct 04, 2007 at 07:32:52AM +0200, Torsten Kaiser wrote:
> On 10/3/07, Matt Mackall <[email protected]> wrote:
> > Well I can see no reason why the vma we just got to by the mm->mmap
> > would have a vm_mm != mm, but I've certainly been wrong before.
> >
> > Try changing it to:
> >
> > for (vma = mm->mmap; vma; vma = vma->vm_next)
> > if (!is_vm_hugetlb_page(vma)) {
> > if (vma->vm_mm != mm)
> > printk("WTF: vma->vm_mm %p mm %p\n",
> > vma->vm_mm, mm);
> > walk_page_range(vma->vm_mm, vma->vm_start, vma->vm_end,
> > &clear_refs_walk, vma);
> > }
>
> You were right.
> I was able to trigger the error with above printk added, but nothing
> was written to the syslog.
>
> So now I'm rather out of ideas what to test... :(

I'd give your previous bisect step another try.

Looking back at the thread a bit, anything that requires the machine
to be off for more than a couple seconds to manifest stops looking
like software and firmware and starts looking like a heat-related
electrical or mechanical issue. Make sure your backups are current.

--
Mathematics is the supreme nostalgia of our time.

2007-10-05 06:06:22

by Torsten Kaiser

[permalink] [raw]
Subject: Re: sata_sil24 broken since 2.6.23-rc4-mm1

On 10/4/07, Matt Mackall <[email protected]> wrote:
> On Thu, Oct 04, 2007 at 07:32:52AM +0200, Torsten Kaiser wrote:
> > So now I'm rather out of ideas what to test... :(
>
> I'd give your previous bisect step another try.

Yes, I thought about that too. But I never seemed to need more than
two tries to make it fail.
So I would only suspect the last good step as wrong positive.
That would then point to the first of your maps2-patches, the moving
of the pagewalker code.
Would you thing that this is a plausible cause?

> Looking back at the thread a bit, anything that requires the machine
> to be off for more than a couple seconds to manifest stops looking
> like software and firmware and starts looking like a heat-related
> electrical or mechanical issue. Make sure your backups are current.

What backups? :-)

Yes, I also thought about hardware trouble, but the bisect result
seemed to consistent.
Also that its not always the same drive that fails, only every time
one of the sil-drives.

I now have activated ATA_DEBUG to see if the good and the bad boots differ.
It looks the same until the RAID5 starts.

Good boot:
[ 40.160000] ata_scsi_dump_cdb: CDB (2:0,0,0) 35 00 00 00 00 00 00 00 00
[ 40.160000] ata_scsi_dump_cdb: CDB (1:0,0,0) 35 00 00 00 00 00 00 00 00
[ 40.160000] ata_scsi_dump_cdb: CDB (2:0,0,0) 2a 00 25 42 d6 09 00 00 08
[ 40.160000] ata_sg_setup: 1 sg elements mapped
[ 40.160000] ata_scsi_dump_cdb: CDB (1:0,0,0) 2a 00 25 42 d6 09 00 00 08
[ 40.160000] ata_sg_setup: 1 sg elements mapped
[ 40.160000] ata_scsi_dump_cdb: CDB (2:0,0,0) 35 00 00 00 00 00 00 00 00
[ 40.160000] ata_scsi_dump_cdb: CDB (1:0,0,0) 35 00 00 00 00 00 00 00 00
[ 40.320000] nv_swncq_host_interrupt: id 0x3 SWNCQ: qc_active 0x1
dhfis 0x1 dmafis 0x1 sactive 0x0
[ 40.320000] nv_swncq_sdbfis: over
[ 40.320000] ata_scsi_dump_cdb: CDB (3:0,0,0) 35 00 00 00 00 00 00 00 00
[ 40.320000] ata_exec_command: ata3: cmd 0xEA
[ 40.390000] ata_hsm_move: ata3: protocol 1 task_state 3 (dev_stat 0x40)
[ 40.390000] ata_hsm_move: ata3: dev 0 command complete, drv_stat 0x40
[ 40.420000] md: considering sdb1 ...
[ 40.440000] md: adding sdb1 ...
[ 40.440000] md: adding sda1 ...
[ 40.450000] md: created md0
[ 40.460000] md: bind<sda1>
[ 40.470000] md: bind<sdb1>
[ 40.480000] md: running: <sdb1><sda1>
[ 40.500000] raid1: raid set md0 active with 2 out of 2 mirrors

Bad boot:
[ 40.060000] ata_scsi_dump_cdb: CDB (2:0,0,0) 35 00 00 00 00 00 00 00 00
[ 40.060000] ata_scsi_dump_cdb: CDB (1:0,0,0) 35 00 00 00 00 00 00 00 00
[ 40.060000] ata_scsi_dump_cdb: CDB (2:0,0,0) 2a 00 25 42 d6 09 00 00 08
[ 40.060000] ata_sg_setup: 1 sg elements mapped
[ 40.060000] ata_scsi_dump_cdb: CDB (1:0,0,0) 2a 00 25 42 d6 09 00 00 08
[ 40.060000] ata_sg_setup: 1 sg elements mapped
[ 40.060000] ata_scsi_dump_cdb: CDB (2:0,0,0) 35 00 00 00 00 00 00 00 00
[ 40.200000] nv_swncq_host_interrupt: id 0x3 SWNCQ: qc_active 0x1
dhfis 0x1 dmafis 0x1 sactive 0x0
[ 40.200000] nv_swncq_sdbfis: over
[ 40.200000] ata_scsi_dump_cdb: CDB (3:0,0,0) 35 00 00 00 00 00 00 00 00
[ 40.200000] ata_exec_command: ata3: cmd 0xEA
[ 40.270000] ata_hsm_move: ata3: protocol 1 task_state 3 (dev_stat 0x40)
[ 40.270000] ata_hsm_move: ata3: dev 0 command complete, drv_stat 0x40
[ 70.060000] ata_scsi_timed_out: ENTER
[ 70.060000] ata_scsi_timed_out: EXIT, ret=0
[ 70.080000] ata_scsi_error: ENTER
[ 70.080000] ata_port_flush_task: ENTER
[ 70.100000] ata1: ata_port_flush_task: EXIT
[ 70.110000] __ata_port_freeze: ata1 port frozen
[ 70.220000] __ata_port_freeze: ata1 port frozen
[ 70.230000] ata_eh_link_autopsy: ENTER
[ 70.240000] ata_eh_link_autopsy: EXIT
[ 70.250000] ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
[ 70.270000] ata1.00: cmd 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0
cdb 0x0 data 4096 out
[ 70.270000] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask
0x4 (timeout)

After [ 40.060000] ata_scsi_dump_cdb: CDB (1:0,0,0) 2a 00 25 42 d6 09 00 00 08
the drive sda falls of the earth and can't be recovered through soft-
or hard-resetting the port by the error handler.

So I will use the weekend to see if I can find out who issues this
command and add more debug to that place...

Torsten

2007-10-07 08:44:35

by Torsten Kaiser

[permalink] [raw]
Subject: Re: sata_sil24 broken since 2.6.23-rc4-mm1

On 10/5/07, Torsten Kaiser <[email protected]> wrote:
> So I will use the weekend to see if I can find out who issues this
> command and add more debug to that place...

I added some DPRINTK to sil24_qc_issue and sil24_fill_sg, but I only
found one suspicious thing.

My sil24_fill_sg now looks like this:
static inline void sil24_fill_sg(struct ata_queued_cmd *qc,
struct sil24_sge *sge)
{
struct scatterlist *sg;

ata_for_each_sg(sg, qc) {
sge->addr = cpu_to_le64(sg_dma_address(sg));
sge->cnt = cpu_to_le32(sg_dma_len(sg));
if (ata_sg_is_last(sg, qc))
sge->flags = cpu_to_le32(SGE_TRM);
else
sge->flags = 0;
DPRINTK("flags,addr,cnt = 0x%x, 0x%X, 0x%X\n", sge->flags,
sge->addr, sge->cnt);
sge++;
}
}

Suspicious is, that *all* output from this DPRINTK shows flags as 0x0,
so the last sg is never terminated (SGE_TRM is 1<<31)?
But if that is the cause, how is this working at all? Or am I doing
something stupid?

Timing and outputs from five boots:
good: bad:
more moreboot more
3->35 3->35 3->35 3->35 3->35
3->2a 2->35 2->35 3->2a 3->2a
3->setup 2->2a 2->2a 3->setup 3->setup
2->35 2->35 2->35 2->35 2->35
1->35 3->2a 3->2a 1->35 1->35
2->2a 3->setup 3->setup 2->2a 2->2a
1->2a 1->35 1->35 1->2a 1->2a
2->35 1->2a 1->2a 2->35
1->35 1->35 1->35 1->35
3->int 3->int 3->int 3->int 3->int
3->35 3->35 3->35 3->35 3->35
1->5DF/1439C 1->5DC/1439C 1->5DE/1439C
2->5E0/143BC 2->5DE/143BC 2->5DF/143BC
sg:170E sg:1AAB sg:1A60
XXX:
5DD 5DF 5DC 5DF 5DE
5E0 5E0 5DE 5E0 5DF

The first three columns where working tries, the last two failed one drive.
column 1: ATA_DEBUG added, reboot
column 2: +my additions, reboot
column 3: +my additions, cold boot, wanted to make it fail, but worked
column 4: ATA_DEBUG added, cold boot
column 5: +my additions, cold boot
[x]->[y]: x is the ata-port, 1+2 on the sata_sil24, 3 on sata_nv with swncq
y:35 -> SYNCHRONIZE_CACHE commands that where send to the drive
y:2a -> WRITE_10 commands that where send to the drive
y:setup -> Debug from swncq: nv_swncq_dmafis: dma setup tag 0x0
y:int -> Debug from swncq: nv_swncq_host_interrupt: id 0x3 SWNCQ:
qc_active 0x1 ...

The lines before the XXX:
x->a/b: x is the ata-port, a the paddr from sil24_qc_issue, b the
activate from sil24_qc_issue
All outputs from sil24_qc_issue where identical in each boot sequence,
only differed from run to run.
sg:a: a is the sge->addr from sil24_fill_sg

The lines after the XXX:
This are the addresses that the XXX-printk from sil24_port_start prints.

I hope I explained enough what above table should mean.
This hole sequence (two syncs and one write to each drive) happens
between the output:
[ 40.300000] md1: bitmap initialized from disk: read 10/10 pages, set 87 bits
[ 40.320000] created bitmap (145 pages) for device md1
and the error on a bad boot:
[ 70.680000] ata2.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
[ 70.700000] ata2.00: cmd 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0
cdb 0x0 data 4096 out
or if on a good boot:
[ 40.910000] md: considering sdb1 ...
(sdb1 is part of another raid)

(If someone whats to complete bootlogs, just ask)

So now I have two questions:
1) What happens in sil24_fill_sg with SGE_TRM?
2) If that is ok, should I try to add debug to sil24_error_intr and/or
sil24_host_intr?

Torsten

2007-10-07 14:39:26

by Torsten Kaiser

[permalink] [raw]
Subject: Re: sata_sil24 broken since 2.6.23-rc4-mm1

[Adding Jens Axboe, the author of what looks like the probable cause]
On 10/7/07, Torsten Kaiser <[email protected]> wrote:
> My sil24_fill_sg now looks like this:
> static inline void sil24_fill_sg(struct ata_queued_cmd *qc,
> struct sil24_sge *sge)
> {
> struct scatterlist *sg;
>
> ata_for_each_sg(sg, qc) {
> sge->addr = cpu_to_le64(sg_dma_address(sg));
> sge->cnt = cpu_to_le32(sg_dma_len(sg));
> if (ata_sg_is_last(sg, qc))
> sge->flags = cpu_to_le32(SGE_TRM);
> else
> sge->flags = 0;
> DPRINTK("flags,addr,cnt = 0x%x, 0x%X, 0x%X\n", sge->flags,
> sge->addr, sge->cnt);
> sge++;
> }
> }
>
> Suspicious is, that *all* output from this DPRINTK shows flags as 0x0,
> so the last sg is never terminated (SGE_TRM is 1<<31)?
> But if that is the cause, how is this working at all? Or am I doing
> something stupid?

Looking closer at
http://git.kernel.org/?p=linux/kernel/git/axboe/linux-2.6-block.git;a=commitdiff;h=ec6fdded4d76aa54aa57341e5dfdd61c507b1dcd
the change to libata.h seems bogus :

in ata_qc_first_sg:
old new
return qc->__sg return qc->__sg
qc->__sg - qc->__sg == 0 qc->n_iter=0
-> sg - qc->__sg corresponds to qc->n_iter

in ata_qc_next_sg:
sg++; sg_next(sg); qc->n_iter++;
sg - qc->__sg < qc->n_elem qc->n_iter < qc->nelem
-> sg - qc->__sg corresponds to qc->n_iter

but in ata_sg_is_last:
(sg - qc->__sg) +1 == qc->n_elem qc->n_iter == qc->n_elem
if sg - qc->__sg corresponds to qc->n_iter then shoudn't it be
qc->n_iter+1 == qc->n_elem?

That missing +1 would explain, why the SGE_TRM never gets set.

And it would fit the symptoms, that the boot would fail at random. If
the "correct" garbage was in place to where the sglist runs off it
hangs the drive.
And that would even fit the two different errors that I only got one time each:
* a completely illegal access (PCI master abort while fetching SGT)
* wrong alignment of the SGT (SGT no on qword boundary)
At that that times the garbage seemed to point invalid addresses.

But I'm still not understanding, how the kernel could only fail
sometimes at bootup, but after that working without any visible
errors? Is the sil-chip rather intelligent about detecting corrupted
sglists and silently ignoring them?

Torsten

2007-10-11 03:26:16

by Tejun Heo

[permalink] [raw]
Subject: Re: sata_sil24 broken since 2.6.23-rc4-mm1

Torsten Kaiser wrote:
> Looking closer at
> http://git.kernel.org/?p=linux/kernel/git/axboe/linux-2.6-block.git;a=commitdiff;h=ec6fdded4d76aa54aa57341e5dfdd61c507b1dcd
> the change to libata.h seems bogus :
>
> in ata_qc_first_sg:
> old new
> return qc->__sg return qc->__sg
> qc->__sg - qc->__sg == 0 qc->n_iter=0
> -> sg - qc->__sg corresponds to qc->n_iter
>
> in ata_qc_next_sg:
> sg++; sg_next(sg); qc->n_iter++;
> sg - qc->__sg < qc->n_elem qc->n_iter < qc->nelem
> -> sg - qc->__sg corresponds to qc->n_iter
>
> but in ata_sg_is_last:
> (sg - qc->__sg) +1 == qc->n_elem qc->n_iter == qc->n_elem
> if sg - qc->__sg corresponds to qc->n_iter then shoudn't it be
> qc->n_iter+1 == qc->n_elem?
>
> That missing +1 would explain, why the SGE_TRM never gets set.

Thanks a lot for tracking this down. Does changing the above code fix
your problem?

Jens, Torsten's analysis looks correct && depending on qc state (n_iter)
during iteration doesn't look like a good idea. Those iterators are not
supposed to have side effects. Would it be difficult to implement
sg_last() test?

> And it would fit the symptoms, that the boot would fail at random. If
> the "correct" garbage was in place to where the sglist runs off it
> hangs the drive.
> And that would even fit the two different errors that I only got one time each:
> * a completely illegal access (PCI master abort while fetching SGT)
> * wrong alignment of the SGT (SGT no on qword boundary)
> At that that times the garbage seemed to point invalid addresses.
>
> But I'm still not understanding, how the kernel could only fail
> sometimes at bootup, but after that working without any visible
> errors? Is the sil-chip rather intelligent about detecting corrupted
> sglists and silently ignoring them?

I have no idea why it fails only sometimes.

--
tejun

2007-10-11 05:55:08

by Torsten Kaiser

[permalink] [raw]
Subject: Re: sata_sil24 broken since 2.6.23-rc4-mm1

On 10/11/07, Tejun Heo <[email protected]> wrote:
> Torsten Kaiser wrote:
> > Looking closer at
> > http://git.kernel.org/?p=linux/kernel/git/axboe/linux-2.6-block.git;a=commitdiff;h=ec6fdded4d76aa54aa57341e5dfdd61c507b1dcd
> > the change to libata.h seems bogus :
> >
> > in ata_qc_first_sg:
> > old new
> > return qc->__sg return qc->__sg
> > qc->__sg - qc->__sg == 0 qc->n_iter=0
> > -> sg - qc->__sg corresponds to qc->n_iter
> >
> > in ata_qc_next_sg:
> > sg++; sg_next(sg); qc->n_iter++;
> > sg - qc->__sg < qc->n_elem qc->n_iter < qc->nelem
> > -> sg - qc->__sg corresponds to qc->n_iter
> >
> > but in ata_sg_is_last:
> > (sg - qc->__sg) +1 == qc->n_elem qc->n_iter == qc->n_elem
> > if sg - qc->__sg corresponds to qc->n_iter then shoudn't it be
> > qc->n_iter+1 == qc->n_elem?
> >
> > That missing +1 would explain, why the SGE_TRM never gets set.
>
> Thanks a lot for tracking this down. Does changing the above code fix
> your problem?

I did not try it.
I'm not an libata expert and while this change looks suspicios, I
can't be 100% sure if that change was intended.
And I did not want to experiment this deep in the code and risk
corrupting the hole drive.

> Jens, Torsten's analysis looks correct && depending on qc state (n_iter)
> during iteration doesn't look like a good idea. Those iterators are not
> supposed to have side effects. Would it be difficult to implement
> sg_last() test?
>
> > And it would fit the symptoms, that the boot would fail at random. If
> > the "correct" garbage was in place to where the sglist runs off it
> > hangs the drive.
> > And that would even fit the two different errors that I only got one time each:
> > * a completely illegal access (PCI master abort while fetching SGT)
> > * wrong alignment of the SGT (SGT no on qword boundary)
> > At that that times the garbage seemed to point invalid addresses.
> >
> > But I'm still not understanding, how the kernel could only fail
> > sometimes at bootup, but after that working without any visible
> > errors? Is the sil-chip rather intelligent about detecting corrupted
> > sglists and silently ignoring them?
>
> I have no idea why it fails only sometimes.

And that is, why I'm so unsure.
The error looks to serious to only cause random failures on one of two
drives on bootup.
I never had trouble with the remaining drive on the SiI-chip or both
drives if one got killed during booting.

I'm guessing that leaving the computer powered down long enough fills
the RAM with a special pattern that really hangs the drive, while
normaly it would just reject the invalid data. (I have ECC-RAM, does
this matter?)

Another guess might be that most of the time the Sil-chip correctly
terminates after the transfer-length is reached, even if SGE_TRM is
missing...

Torsten

2007-10-11 06:26:43

by Tejun Heo

[permalink] [raw]
Subject: Re: sata_sil24 broken since 2.6.23-rc4-mm1

Torsten Kaiser wrote:
>>> That missing +1 would explain, why the SGE_TRM never gets set.
>> Thanks a lot for tracking this down. Does changing the above code fix
>> your problem?
>
> I did not try it.
> I'm not an libata expert and while this change looks suspicios, I
> can't be 100% sure if that change was intended.
> And I did not want to experiment this deep in the code and risk
> corrupting the hole drive.

I don't think you would risk too much by changing that bit of code.
Please try it.

>>> But I'm still not understanding, how the kernel could only fail
>>> sometimes at bootup, but after that working without any visible
>>> errors? Is the sil-chip rather intelligent about detecting corrupted
>>> sglists and silently ignoring them?
>> I have no idea why it fails only sometimes.
>
> And that is, why I'm so unsure.
> The error looks to serious to only cause random failures on one of two
> drives on bootup.
> I never had trouble with the remaining drive on the SiI-chip or both
> drives if one got killed during booting.
>
> I'm guessing that leaving the computer powered down long enough fills
> the RAM with a special pattern that really hangs the drive, while
> normaly it would just reject the invalid data. (I have ECC-RAM, does
> this matter?)
>
> Another guess might be that most of the time the Sil-chip correctly
> terminates after the transfer-length is reached, even if SGE_TRM is
> missing...

I have no idea either. We'll probably need a PCI bus tracer to tell
exactly what's going on.

Thanks.

--
tejun

2007-10-11 08:25:45

by Jens Axboe

[permalink] [raw]
Subject: Re: sata_sil24 broken since 2.6.23-rc4-mm1

On Thu, Oct 11 2007, Tejun Heo wrote:
> Torsten Kaiser wrote:
> > Looking closer at
> > http://git.kernel.org/?p=linux/kernel/git/axboe/linux-2.6-block.git;a=commitdiff;h=ec6fdded4d76aa54aa57341e5dfdd61c507b1dcd
> > the change to libata.h seems bogus :
> >
> > in ata_qc_first_sg:
> > old new
> > return qc->__sg return qc->__sg
> > qc->__sg - qc->__sg == 0 qc->n_iter=0
> > -> sg - qc->__sg corresponds to qc->n_iter
> >
> > in ata_qc_next_sg:
> > sg++; sg_next(sg); qc->n_iter++;
> > sg - qc->__sg < qc->n_elem qc->n_iter < qc->nelem
> > -> sg - qc->__sg corresponds to qc->n_iter
> >
> > but in ata_sg_is_last:
> > (sg - qc->__sg) +1 == qc->n_elem qc->n_iter == qc->n_elem
> > if sg - qc->__sg corresponds to qc->n_iter then shoudn't it be
> > qc->n_iter+1 == qc->n_elem?
> >
> > That missing +1 would explain, why the SGE_TRM never gets set.
>
> Thanks a lot for tracking this down. Does changing the above code fix
> your problem?
>
> Jens, Torsten's analysis looks correct && depending on qc state (n_iter)
> during iteration doesn't look like a good idea. Those iterators are not
> supposed to have side effects. Would it be difficult to implement
> sg_last() test?

This is the old ata_sg_is_last:

static inline int
ata_sg_is_last(struct scatterlist *sg, struct ata_queued_cmd *qc)
{
if (sg == &qc->pad_sgent)
return 1;
if (qc->pad_len)
return 0;
if (((sg - qc->__sg) + 1) == qc->n_elem)
return 1;
return 0;
}

and the new one:

static inline int
ata_sg_is_last(struct scatterlist *sg, struct ata_queued_cmd *qc)
{
if (sg == &qc->pad_sgent)
return 1;
if (qc->pad_len)
return 0;
if (qc->n_iter == qc->n_elem)
return 1;
return 0;
}

->n_iter is how ata_qc_next_sg() walks over the sglist, I don't
understand your reference to why depending on that during iteration
would be bad?

So we could add a test for sg_last() there, but that would turn sg table
iteration into an O(N^2) operation for those drivers that use
ata_sg_is_last() with chained sg tables. I'd much rather just get rid of
ata_sg_is_last(), it's only used to mark end-of-table entries for
hardware. That logic can be performed cheaper.

Torsten, your analysis does look correct. Does it work with this simple
patch?

diff --git a/include/linux/libata.h b/include/linux/libata.h
index 2784163..0152bf4 100644
--- a/include/linux/libata.h
+++ b/include/linux/libata.h
@@ -947,7 +947,7 @@ ata_sg_is_last(struct scatterlist *sg, struct ata_queued_cmd *qc)
return 1;
if (qc->pad_len)
return 0;
- if (qc->n_iter == qc->n_elem)
+ if ((qc->n_iter + 1) == qc->n_elem)
return 1;
return 0;
}

--
Jens Axboe

2007-10-11 08:37:06

by Tejun Heo

[permalink] [raw]
Subject: Re: sata_sil24 broken since 2.6.23-rc4-mm1

Jens Axboe wrote:
> This is the old ata_sg_is_last:
>
> static inline int
> ata_sg_is_last(struct scatterlist *sg, struct ata_queued_cmd *qc)
> {
> if (sg == &qc->pad_sgent)
> return 1;
> if (qc->pad_len)
> return 0;
> if (((sg - qc->__sg) + 1) == qc->n_elem)
> return 1;
> return 0;
> }
>
> and the new one:
>
> static inline int
> ata_sg_is_last(struct scatterlist *sg, struct ata_queued_cmd *qc)
> {
> if (sg == &qc->pad_sgent)
> return 1;
> if (qc->pad_len)
> return 0;
> if (qc->n_iter == qc->n_elem)
> return 1;
> return 0;
> }
>
> ->n_iter is how ata_qc_next_sg() walks over the sglist, I don't
> understand your reference to why depending on that during iteration
> would be bad?

Because that makes ata_sg iterator macros have hidden side effects
(nothing in the interface suggests it can't be nested and when somebody
actually nests it, finding what went wrong can be pretty difficult). I
think it would be better to have explicit ata_sg_iter passed around if
sg itself isn't enough to walk the sg list.

> So we could add a test for sg_last() there, but that would turn sg table
> iteration into an O(N^2) operation for those drivers that use
> ata_sg_is_last() with chained sg tables. I'd much rather just get rid of
> ata_sg_is_last(), it's only used to mark end-of-table entries for
> hardware. That logic can be performed cheaper.

Yeap, it can be removed but having "is this the last one?" test is just
nicer to low level drivers. With ata_sg_iter, I think we can do it in
O(N) by looking up and caching the next entry.

Thanks.

--
tejun

2007-10-11 10:28:23

by Jens Axboe

[permalink] [raw]
Subject: Re: sata_sil24 broken since 2.6.23-rc4-mm1

On Thu, Oct 11 2007, Tejun Heo wrote:
> Jens Axboe wrote:
> > This is the old ata_sg_is_last:
> >
> > static inline int
> > ata_sg_is_last(struct scatterlist *sg, struct ata_queued_cmd *qc)
> > {
> > if (sg == &qc->pad_sgent)
> > return 1;
> > if (qc->pad_len)
> > return 0;
> > if (((sg - qc->__sg) + 1) == qc->n_elem)
> > return 1;
> > return 0;
> > }
> >
> > and the new one:
> >
> > static inline int
> > ata_sg_is_last(struct scatterlist *sg, struct ata_queued_cmd *qc)
> > {
> > if (sg == &qc->pad_sgent)
> > return 1;
> > if (qc->pad_len)
> > return 0;
> > if (qc->n_iter == qc->n_elem)
> > return 1;
> > return 0;
> > }
> >
> > ->n_iter is how ata_qc_next_sg() walks over the sglist, I don't
> > understand your reference to why depending on that during iteration
> > would be bad?
>
> Because that makes ata_sg iterator macros have hidden side effects
> (nothing in the interface suggests it can't be nested and when somebody
> actually nests it, finding what went wrong can be pretty difficult). I
> think it would be better to have explicit ata_sg_iter passed around if
> sg itself isn't enough to walk the sg list.

I think it's implicit in the interface, since you don't pass an sgtable
in. But it's not a big deal to me, if you want it changed, go ahead :-)

> > So we could add a test for sg_last() there, but that would turn sg table
> > iteration into an O(N^2) operation for those drivers that use
> > ata_sg_is_last() with chained sg tables. I'd much rather just get rid of
> > ata_sg_is_last(), it's only used to mark end-of-table entries for
> > hardware. That logic can be performed cheaper.
>
> Yeap, it can be removed but having "is this the last one?" test is just
> nicer to low level drivers. With ata_sg_iter, I think we can do it in
> O(N) by looking up and caching the next entry.

Sure, of you could just lookup sg_last() in the beginning of the loop.
Still seems a bit silly to me just to keep the ata_sg_is_last()
interface, since it can be done for free basically by maintaining an
sglast element in the loop that you then use when the loop is done to
set your SG_END marker (or whatever the driver needs).

--
Jens Axboe

2007-10-11 17:51:22

by Torsten Kaiser

[permalink] [raw]
Subject: Re: sata_sil24 broken since 2.6.23-rc4-mm1

On 10/11/07, Tejun Heo <[email protected]> wrote:
> Torsten Kaiser wrote:
> >>> That missing +1 would explain, why the SGE_TRM never gets set.
> >> Thanks a lot for tracking this down. Does changing the above code fix
> >> your problem?
> >
> > I did not try it.
> > I'm not an libata expert and while this change looks suspicios, I
> > can't be 100% sure if that change was intended.
> > And I did not want to experiment this deep in the code and risk
> > corrupting the hole drive.
>
> I don't think you would risk too much by changing that bit of code.
> Please try it.

The debug output now shows that SGE_TRM is getting set.
I will try to reproduce the error tomorrow morning, and report if I'm
still seeing it.

Reporting that I'm no longer see it is sadly not a good indicator, as
it also seems to depend on other factors like memory layout and/or
code alignment.

Torsten