2007-11-20 19:09:24

by noah

[permalink] [raw]
Subject: Possibly SATA related freeze killed networking and RAID

I just had a strange freeze that killed networking and made software
RAID fail two of my harddisks.

There are a bunch of messages from the kernel which I extracted from
the system log after reboot at the end of this mail. I hit power off
in pure paranoia after the box froze, and then started to do disk I/O
again just right after I noticed the messages about two of my RAID
disks had failed on the console.
The network didn't recover when the harddrive suddenly started working again.
I managed to connect an USB keyboard and wake up the monitor from
sleep so I could see some of the messages printed on the console.

I looked through some other threads and found a mention of
smartmontools which I too use (5.37-5ubuntu2).

Kernel 2.6.22-14-generic (Ubuntu Gutsy Gibbon 7.10)
Motherboard: Asus M2N32 WS Professional nForce 590 SLI MCP (MCP55)
CPU: Athlon64 X2 Dual-Core 5600+
RAM: 4GB (passed memtest86 just a few minutes ago)

The harddrives are four Samsung HD501LJ 500GB drives.
sda and sdb have firmware CR100-10 and sdc and sdd have firmware CR100-11.
The drives are just a couple of months old, well cooled and so far
there's nothing interesting reported by S.M.A.R.T.

Software raid is configured like this:
sda1,sdc1 -> md0 (raid 1)
sdb1,sdd1 -> md1 (raid 1)
Both md0 and md1 are then encrypted with dm-crypt and the dm-devices
are then used to form md2 (stripe).

-- noah


# lspci
00:00.0 RAM memory: nVidia Corporation C51 Host Bridge (rev a2)
00:00.1 RAM memory: nVidia Corporation C51 Memory Controller 0 (rev a2)
00:00.2 RAM memory: nVidia Corporation C51 Memory Controller 1 (rev a2)
00:00.3 RAM memory: nVidia Corporation C51 Memory Controller 5 (rev a2)
00:00.4 RAM memory: nVidia Corporation C51 Memory Controller 4 (rev a2)
00:00.5 RAM memory: nVidia Corporation C51 Host Bridge (rev a2)
00:00.6 RAM memory: nVidia Corporation C51 Memory Controller 3 (rev a2)
00:00.7 RAM memory: nVidia Corporation C51 Memory Controller 2 (rev a2)
00:04.0 PCI bridge: nVidia Corporation C51 PCI Express Bridge (rev a1)
00:08.0 RAM memory: nVidia Corporation MCP55 Memory Controller (rev a1)
00:09.0 ISA bridge: nVidia Corporation MCP55 LPC Bridge (rev a2)
00:09.1 SMBus: nVidia Corporation MCP55 SMBus (rev a2)
00:09.2 RAM memory: nVidia Corporation MCP55 Memory Controller (rev a2)
00:0a.0 USB Controller: nVidia Corporation MCP55 USB Controller (rev a1)
00:0a.1 USB Controller: nVidia Corporation MCP55 USB Controller (rev a2)
00:0c.0 IDE interface: nVidia Corporation MCP55 IDE (rev a1)
00:0d.0 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a2)
00:0d.1 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a2)
00:0d.2 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a2)
00:0e.0 PCI bridge: nVidia Corporation MCP55 PCI bridge (rev a2)
00:0e.1 Audio device: nVidia Corporation MCP55 High Definition Audio (rev a2)
00:10.0 Bridge: nVidia Corporation MCP55 Ethernet (rev a2)
00:11.0 Bridge: nVidia Corporation MCP55 Ethernet (rev a2)
00:12.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a2)
00:14.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a2)
00:15.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a2)
00:16.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a2)
00:17.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a2)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8
[Athlon64/Opteron] HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8
[Athlon64/Opteron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8
[Athlon64/Opteron] DRAM Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8
[Athlon64/Opteron] Miscellaneous Control
01:00.0 VGA compatible controller: nVidia Corporation GeForce 8400 GS (rev a1)
02:06.0 Communication controller: Tiger Jet Network Inc. Tiger3XX
Modem/ISDN interface
03:00.0 PCI bridge: NEC Corporation uPD720400 PCI Express - PCI/PCI-X
Bridge (rev 06)
03:00.1 PCI bridge: NEC Corporation uPD720400 PCI Express - PCI/PCI-X
Bridge (rev 06)
08:00.0 RAID bus controller: Marvell Technology Group Ltd. 88SE6145
SATA II PCI-E controller (rev a1)



kernel: [734344.717844] irq 21: nobody cared (try booting with the
"irqpoll" option)
kernel: [734344.717866]
kernel: [734344.717866] Call Trace:
kernel: [734344.717868] <IRQ> [__report_bad_irq+30/128]
__report_bad_irq+0x1e/0x80
mdadm: Fail event detected on md device /dev/md1, component device /dev/sdd1
kernel: [734344.717888] [note_interrupt+643/704] note_interrupt+0x283/0x2c0
kernel: [734344.717895] [handle_fasteoi_irq+221/272]
handle_fasteoi_irq+0xdd/0x110
mdadm: Fail event detected on md device /dev/md0, component device /dev/sdc1
kernel: [734344.717901] [do_IRQ+123/256] do_IRQ+0x7b/0x100
kernel: [734344.717904] [default_idle+0/64] default_idle+0x0/0x40
kernel: [734344.717907] [ret_from_intr+0/10] ret_from_intr+0x0/0xa
kernel: [734344.717909] <EOI> [tcp_poll+0/368] tcp_poll+0x0/0x170
kernel: [734344.717918] [default_idle+41/64] default_idle+0x29/0x40
kernel: [734344.717923] [cpu_idle+112/192] cpu_idle+0x70/0xc0
kernel: [734344.717936]
kernel: [734344.717937] handlers:
kernel: [734344.717950] [_end+131265960/2130332920]
(nv_generic_interrupt+0x0/0xe0 [sata_nv])
kernel: [734344.717973] [_end+131152728/2130332920]
(nv_nic_irq_optimized+0x0/0x2b0 [forcedeth])
kernel: [734344.718003] Disabling IRQ #21
kernel: [734356.827155] ata5.00: exception Emask 0x0 SAct 0x0 SErr
0x0 action 0x2 frozen
kernel: [734356.827169] ata5.00: cmd
25/00:90:ff:3c:82/00:01:02:00:00/e0 tag 0 cdb 0x0 data 204800 in
kernel: [734356.827170] res
40/00:b4:cc:88:7f/40:00:02:00:00/e0 Emask 0x4 (timeout)
kernel: [734355.620185] ata6.00: exception Emask 0x0 SAct 0x0 SErr
0x0 action 0x2 frozen
kernel: [734355.620199] ata6.00: cmd
c8/00:08:ef:52:a9/00:00:00:00:00/e0 tag 0 cdb 0x0 data 4096 in
kernel: [734355.620200] res
40/00:00:02:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
kernel: [734355.731595] ata6: soft resetting port
kernel: [734355.787308] ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
kernel: [734358.738407] ata5: port is slow to respond, please be
patient (Status 0xd8)
kernel: [734360.418260] ata5: device not ready (errno=-16), forcing hardreset
kernel: [734360.418262] ata5: hard resetting port
kernel: [734360.643958] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
kernel: [734366.509226] ata6.00: qc timeout (cmd 0x27)
kernel: [734366.509231] ata6.00: ata_hpa_resize 1: hpa sectors (0) is
smaller than sectors (976773168)
kernel: [734366.509241] ata6.00: failed to set xfermode (err_mask=0x40)
kernel: [734366.509250] ata6: failed to recover some devices,
retrying in 5 secs
kernel: [734368.296234] ata6: hard resetting port
kernel: [734368.521935] ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
kernel: [734371.365916] ata5.00: qc timeout (cmd 0x27)
kernel: [734371.365921] ata5.00: ata_hpa_resize 1: hpa sectors (0) is
smaller than sectors (976773168)
kernel: [734371.365928] ata5.00: failed to set xfermode (err_mask=0x40)
kernel: [734371.365936] ata5: failed to recover some devices,
retrying in 5 secs
kernel: [734373.152942] ata5: hard resetting port
kernel: [734373.378644] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
kernel: [734379.244112] ata6.00: qc timeout (cmd 0x27)
kernel: [734379.244118] ata6.00: ata_hpa_resize 1: hpa sectors (0) is
smaller than sectors (976773168)
kernel: [734379.244126] ata6.00: failed to set xfermode (err_mask=0x40)
kernel: [734379.244135] ata6: limiting SATA link speed to 1.5 Gbps
kernel: [734379.244138] ata6.00: limiting speed to UDMA/133:PIO3
kernel: [734379.244140] ata6: failed to recover some devices,
retrying in 5 secs
kernel: [734381.031138] ata6: hard resetting port
kernel: [734381.256840] ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
kernel: [734384.095108] ata5.00: qc timeout (cmd 0x27)
kernel: [734384.095113] ata5.00: ata_hpa_resize 1: hpa sectors (0) is
smaller than sectors (976773168)
kernel: [734384.095120] ata5.00: failed to set xfermode (err_mask=0x40)
kernel: [734384.095129] ata5: limiting SATA link speed to 1.5 Gbps
kernel: [734384.095131] ata5.00: limiting speed to UDMA/133:PIO3
kernel: [734384.095133] ata5: failed to recover some devices,
retrying in 5 secs
kernel: [734385.882133] ata5: hard resetting port
kernel: [734386.107836] ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
kernel: [734391.979018] ata6.00: qc timeout (cmd 0x27)
kernel: [734391.979024] ata6.00: ata_hpa_resize 1: hpa sectors (0) is
smaller than sectors (976773168)
kernel: [734391.979032] ata6.00: failed to set xfermode (err_mask=0x40)
kernel: [734391.979041] ata6.00: disabled
kernel: [734392.159007] ata6: EH complete
kernel: [734392.159018] sd 5:0:0:0: [sdd] Result:
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
kernel: [734392.159022] end_request: I/O error, dev sdd, sector 11096815
kernel: [734392.159026] raid1: sdd1: rescheduling sector 11096752
kernel: [734394.574692] sd 5:0:0:0: [sdd] Result:
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
kernel: [734394.574697] end_request: I/O error, dev sdd, sector 976767935
kernel: [734394.574703] md: super_written gets error=-5, uptodate=0
kernel: [734394.574706] raid1: Disk failure on sdd1, disabling device.
kernel: [734394.574707] ^IOperation continuing on 1 devices
kernel: [734394.574960] sd 5:0:0:0: [sdd] READ CAPACITY failed
kernel: [734394.574961] sd 5:0:0:0: [sdd] Result:
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
kernel: [734394.574964] sd 5:0:0:0: [sdd] Sense not available.
kernel: [734394.575233] sd 5:0:0:0: [sdd] Write Protect is off
kernel: [734394.575235] sd 5:0:0:0: [sdd] Mode Sense: 00 00 00 00
kernel: [734394.575275] sd 5:0:0:0: [sdd] Asking for cache data failed
kernel: [734394.575282] sd 5:0:0:0: [sdd] Assuming drive cache: write through
kernel: [734394.577622] RAID1 conf printout:
kernel: [734394.577624] --- wd:1 rd:2
kernel: [734394.577626] disk 0, wo:1, o:0, dev:sdd1
kernel: [734394.577627] disk 1, wo:0, o:1, dev:sdb1
kernel: [734394.583008] RAID1 conf printout:
kernel: [734394.583010] --- wd:1 rd:2
kernel: [734394.583011] disk 1, wo:0, o:1, dev:sdb1
kernel: [734394.593189] raid1: sdb1: redirecting sector 11096752 to
another mirror
kernel: [734399.802574] ata5.00: qc timeout (cmd 0x27)
kernel: [734399.802580] ata5.00: ata_hpa_resize 1: hpa sectors (0) is
smaller than sectors (976773168)
kernel: [734399.802588] ata5.00: failed to set xfermode (err_mask=0x40)
kernel: [734399.802606] ata5.00: disabled
kernel: [734400.306547] ata5: EH complete
kernel: [734400.306601] sd 4:0:0:0: [sdc] Result:
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
kernel: [734400.306605] end_request: I/O error, dev sdc, sector 42089727
kernel: [734400.306608] raid1: sdc1: rescheduling sector 42089664
kernel: [734400.306626] raid1: sdc1: rescheduling sector 42089672
kernel: [734400.306642] raid1: sdc1: rescheduling sector 42089680
kernel: [734400.306658] raid1: sdc1: rescheduling sector 42089688
kernel: [734400.306674] raid1: sdc1: rescheduling sector 42089696
kernel: [734400.306690] raid1: sdc1: rescheduling sector 42089704
kernel: [734400.306706] raid1: sdc1: rescheduling sector 42089712
kernel: [734400.306722] raid1: sdc1: rescheduling sector 42089720
kernel: [734400.306738] raid1: sdc1: rescheduling sector 42089728
kernel: [734400.307304] sd 4:0:0:0: [sdc] Result:
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
kernel: [734400.307307] end_request: I/O error, dev sdc, sector 976767935
kernel: [734400.307313] md: super_written gets error=-5, uptodate=0
kernel: [734400.307315] raid1: Disk failure on sdc1, disabling device.
kernel: [734400.307316] ^IOperation continuing on 1 devices
kernel: [734400.307643] sd 4:0:0:0: [sdc] READ CAPACITY failed
kernel: [734400.307645] sd 4:0:0:0: [sdc] Result:
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
kernel: [734400.307647] sd 4:0:0:0: [sdc] Sense not available.
kernel: [734400.307767] sd 4:0:0:0: [sdc] Write Protect is off
kernel: [734400.307768] sd 4:0:0:0: [sdc] Mode Sense: 00 00 00 00
kernel: [734400.307947] sd 4:0:0:0: [sdc] Asking for cache data failed
kernel: [734400.307963] sd 4:0:0:0: [sdc] Assuming drive cache: write through
kernel: [734400.319399] RAID1 conf printout:
kernel: [734400.319401] --- wd:1 rd:2
kernel: [734400.319402] disk 0, wo:1, o:0, dev:sdc1
kernel: [734400.319404] disk 1, wo:0, o:1, dev:sda1
kernel: [734400.330537] RAID1 conf printout:
kernel: [734400.330539] --- wd:1 rd:2
kernel: [734400.330540] disk 1, wo:0, o:1, dev:sda1


2007-11-20 22:07:31

by Alan

[permalink] [raw]
Subject: Re: Possibly SATA related freeze killed networking and RAID

> kernel: [734344.717844] irq 21: nobody cared (try booting with the
> "irqpoll" option)
> kernel: [734344.717866]

Your machine decided to emit interrupt 21 without an apparent reason.
Whatever caused that made the kernel shut down IRQ 21 at which point the
disk drives on that IRQ were no longer being serviced. Everything on IRQ
21 would have died - which may be why your networking failed too.

What do you have on IRQ 21 and is this a one off ?

2007-11-20 22:16:43

by noah

[permalink] [raw]
Subject: Re: Possibly SATA related freeze killed networking and RAID

2007/11/20, Alan Cox <[email protected]>:
> > kernel: [734344.717844] irq 21: nobody cared (try booting with the
> > "irqpoll" option)
> > kernel: [734344.717866]
>
> Your machine decided to emit interrupt 21 without an apparent reason.
> Whatever caused that made the kernel shut down IRQ 21 at which point the
> disk drives on that IRQ were no longer being serviced. Everything on IRQ
> 21 would have died - which may be why your networking failed too.
>
> What do you have on IRQ 21 and is this a one off ?

I've had other freezes before but this was the first time I was able
to see what was actually going on.
IRQ 21 appears to be shared between sata_nv and ethernet.

Does this mean my hardware/BIOS is broken somehow?
I'm running the latest BIOS available.

# cat /proc/interrupts CPU0 CPU1
0: 26497 3603163 IO-APIC-edge timer
1: 0 2 IO-APIC-edge i8042
8: 0 0 IO-APIC-edge rtc
9: 0 0 IO-APIC-fasteoi acpi
12: 0 6 IO-APIC-edge i8042
16: 4851 669159 IO-APIC-fasteoi shpchp, libata
20: 0 0 IO-APIC-fasteoi sata_nv
21: 36443 4775430 IO-APIC-fasteoi sata_nv, eth0
22: 31261 4531218 IO-APIC-fasteoi ohci_hcd:usb1, sata_nv
23: 4 1649 IO-APIC-fasteoi HDA Intel, ehci_hcd:usb2
NMI: 0 0
LOC: 3629562 3629543
ERR: 0

-- noah

2007-11-21 00:48:00

by Alan

[permalink] [raw]
Subject: Re: Possibly SATA related freeze killed networking and RAID

> I've had other freezes before but this was the first time I was able
> to see what was actually going on.
> IRQ 21 appears to be shared between sata_nv and ethernet.
>
> Does this mean my hardware/BIOS is broken somehow?

Not neccessarily. It could a bug in one of the drivers using IRQ 21
(sata_nv or the nvidia ethernet), it could be another inactive device, or
it could be a hardware funny.

Nvidia stuff can be quite hard to diagnose as we have no documentation
but we can try. The first question is whether it is network or disk
triggered - seeing if heavy loads to one or the other trigger the problem
might be a first plan.


Alan

2007-11-21 19:06:51

by noah

[permalink] [raw]
Subject: Re: Possibly SATA related freeze killed networking and RAID

2007/11/21, Alan Cox <[email protected]>:
> > I've had other freezes before but this was the first time I was able
> > to see what was actually going on.
> > IRQ 21 appears to be shared between sata_nv and ethernet.
> >
> > Does this mean my hardware/BIOS is broken somehow?
>
> Not neccessarily. It could a bug in one of the drivers using IRQ 21
> (sata_nv or the nvidia ethernet), it could be another inactive device, or
> it could be a hardware funny.

How can I tell if there's an inactive device?

> Nvidia stuff can be quite hard to diagnose as we have no documentation
> but we can try. The first question is whether it is network or disk
> triggered - seeing if heavy loads to one or the other trigger the problem
> might be a first plan.

I haven't managed to trigger it again yet but at the time the CPU was
heavily loaded and I was re-indexing a database which caused a lot of
disk activity. I'm quite confident the network was pretty much idle at
the time.

-- noah

2007-11-26 12:16:59

by Pavel Machek

[permalink] [raw]
Subject: Re: Possibly SATA related freeze killed networking and RAID

Hi!

> > kernel: [734344.717844] irq 21: nobody cared (try booting with the
> > "irqpoll" option)
> > kernel: [734344.717866]
>
> Your machine decided to emit interrupt 21 without an apparent reason.
> Whatever caused that made the kernel shut down IRQ 21 at which point the
> disk drives on that IRQ were no longer being serviced. Everything on IRQ
> 21 would have died - which may be why your networking failed too.

Hmm, perhaps that 'nobody cared' message should be worded more
strongly, and printed and KERN_CRIT?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-11-28 01:55:38

by Tejun Heo

[permalink] [raw]
Subject: Re: Possibly SATA related freeze killed networking and RAID

Pavel Machek wrote:
> Hi!
>
>>> kernel: [734344.717844] irq 21: nobody cared (try booting with the
>>> "irqpoll" option)
>>> kernel: [734344.717866]
>> Your machine decided to emit interrupt 21 without an apparent reason.
>> Whatever caused that made the kernel shut down IRQ 21 at which point the
>> disk drives on that IRQ were no longer being serviced. Everything on IRQ
>> 21 would have died - which may be why your networking failed too.
>
> Hmm, perhaps that 'nobody cared' message should be worded more
> strongly, and printed and KERN_CRIT?

Agreed. Nobody cared on ATA controllers is usually very effective at
taking the whole machine down. Is there any reason why we don't turn on
irqpoll on turned off IRQs automatically?

Thanks.

--
tejun

2007-11-29 22:17:05

by Phillip Susi

[permalink] [raw]
Subject: Re: Possibly SATA related freeze killed networking and RAID

Tejun Heo wrote:
> Agreed. Nobody cared on ATA controllers is usually very effective at
> taking the whole machine down. Is there any reason why we don't turn on
> irqpoll on turned off IRQs automatically?

Why does a single spurious interrupt cause it to be shut down? I can
see if the interrupt is stuck on and keeps interrupting constantly, but
if it's just the occasional spurious interrupt, why not just ignore it
and move on?

2007-11-30 00:02:47

by Tejun Heo

[permalink] [raw]
Subject: Re: Possibly SATA related freeze killed networking and RAID

Phillip Susi wrote:
> Tejun Heo wrote:
>> Agreed. Nobody cared on ATA controllers is usually very effective at
>> taking the whole machine down. Is there any reason why we don't turn on
>> irqpoll on turned off IRQs automatically?
>
> Why does a single spurious interrupt cause it to be shut down? I can
> see if the interrupt is stuck on and keeps interrupting constantly, but
> if it's just the occasional spurious interrupt, why not just ignore it
> and move on?

Because SFF ATA controller don't have IRQ pending bit. You don't know
whether IRQ is raised or not. Plus, accessing the status register which
clears pending IRQ can be very slow on PATA machines. It has to go
through the PCI and ATA bus and come back. So, unconditionally trying
to clear IRQ by accessing Status can incur noticeable overhead if the
IRQ is shared with devices which raise a lot of IRQs.

--
tejun

2007-11-30 00:03:40

by Robert Hancock

[permalink] [raw]
Subject: Re: Possibly SATA related freeze killed networking and RAID

Phillip Susi wrote:
> Tejun Heo wrote:
>> Agreed. Nobody cared on ATA controllers is usually very effective at
>> taking the whole machine down. Is there any reason why we don't turn on
>> irqpoll on turned off IRQs automatically?
>
> Why does a single spurious interrupt cause it to be shut down? I can
> see if the interrupt is stuck on and keeps interrupting constantly, but
> if it's just the occasional spurious interrupt, why not just ignore it
> and move on?

I'm not certain offhand, but I think there may be such a threshold.
However, an occasional spurious interrupt isn't likely. For a
level-triggered interrupt, an unhandled interrupt will keep interrupting
forever since nobody knows how to clear it (until we decide to disable
the IRQ entirely).

--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from [email protected]
Home Page: http://www.roberthancock.com/

2007-11-30 13:18:00

by Alan

[permalink] [raw]
Subject: Re: Possibly SATA related freeze killed networking and RAID

> Why does a single spurious interrupt cause it to be shut down? I can

It doesn't.

> see if the interrupt is stuck on and keeps interrupting constantly, but
> if it's just the occasional spurious interrupt, why not just ignore it
> and move on?

The interrupt is usually level triggered so it continues to create
interrupts until you silence it. The thresholds are about 10,000
interrupt events and on newer kernels we also reset the count if we don't
see any for a while. That works for most stuff except the thinkpad
bluetooth problem.

2007-11-30 13:32:34

by Pavel Machek

[permalink] [raw]
Subject: Re: Possibly SATA related freeze killed networking and RAID

On Fri 2007-11-30 13:13:44, Alan Cox wrote:
> > Why does a single spurious interrupt cause it to be shut down? I can
>
> It doesn't.
>
> > see if the interrupt is stuck on and keeps interrupting constantly, but
> > if it's just the occasional spurious interrupt, why not just ignore it
> > and move on?
>
> The interrupt is usually level triggered so it continues to create
> interrupts until you silence it. The thresholds are about 10,000
> interrupt events and on newer kernels we also reset the count if we don't
> see any for a while. That works for most stuff except the thinkpad
> bluetooth problem.

Which is confirmed hw problem now, btw.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-11-30 15:01:13

by Mark Lord

[permalink] [raw]
Subject: Re: Possibly SATA related freeze killed networking and RAID

Pavel Machek wrote:
> On Fri 2007-11-30 13:13:44, Alan Cox wrote:
>>> Why does a single spurious interrupt cause it to be shut down? I can
>> It doesn't.
>>
>>> see if the interrupt is stuck on and keeps interrupting constantly, but
>>> if it's just the occasional spurious interrupt, why not just ignore it
>>> and move on?
>> The interrupt is usually level triggered so it continues to create
>> interrupts until you silence it. The thresholds are about 10,000
>> interrupt events and on newer kernels we also reset the count if we don't
>> see any for a while. That works for most stuff except the thinkpad
>> bluetooth problem.
>
> Which is confirmed hw problem now, btw.
...

What problem is that, exactly?

My Dell has an internal USB BT adapter that briefly appears
and then disappears again on resume (or stays if I have "enabled" it
via the BIOS key).

I wonder if that has anything to do with the (new in) 2.6.23 pauses
that machine has on resume (about every 10th time).

Cheers

2007-11-30 18:46:19

by Phillip Susi

[permalink] [raw]
Subject: Re: Possibly SATA related freeze killed networking and RAID

Tejun Heo wrote:
> Because SFF ATA controller don't have IRQ pending bit. You don't know
> whether IRQ is raised or not. Plus, accessing the status register which
> clears pending IRQ can be very slow on PATA machines. It has to go
> through the PCI and ATA bus and come back. So, unconditionally trying
> to clear IRQ by accessing Status can incur noticeable overhead if the
> IRQ is shared with devices which raise a lot of IRQs.

There HAS to be a way to determine if that device generated the
interrupt, or the interrupt can not be shared. Since the kernel said
nobody cared about the interrupt, that indicates that the sata driver
checked the status register and realized the sata chip didn't generate
the interrupt, and returned to the kernel letting it know that the
interrupt was not for it.

2007-11-30 20:25:23

by Pavel Machek

[permalink] [raw]
Subject: Re: Possibly SATA related freeze killed networking and RAID

On Fri 2007-11-30 10:00:55, Mark Lord wrote:
> Pavel Machek wrote:
>> On Fri 2007-11-30 13:13:44, Alan Cox wrote:
>>>> Why does a single spurious interrupt cause it to be shut down? I can
>>> It doesn't.
>>>
>>>> see if the interrupt is stuck on and keeps interrupting constantly, but
>>>> if it's just the occasional spurious interrupt, why not just ignore it
>>>> and move on?
>>> The interrupt is usually level triggered so it continues to create
>>> interrupts until you silence it. The thresholds are about 10,000
>>> interrupt events and on newer kernels we also reset the count if we don't
>>> see any for a while. That works for most stuff except the thinkpad
>>> bluetooth problem.
>> Which is confirmed hw problem now, btw.
> ...
>
> What problem is that, exactly?

Spurious interrupt, interrupt link is disabled after ~15 minutes. It
seems pretty unique to t61.

> My Dell has an internal USB BT adapter that briefly appears
> and then disappears again on resume (or stays if I have "enabled" it
> via the BIOS key).
>
> I wonder if that has anything to do with the (new in) 2.6.23 pauses
> that machine has on resume (about every 10th time).

No idea, but t61 problem seems different.
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-11-30 23:57:24

by Tejun Heo

[permalink] [raw]
Subject: Re: Possibly SATA related freeze killed networking and RAID

Phillip Susi wrote:
> Tejun Heo wrote:
>> Because SFF ATA controller don't have IRQ pending bit. You don't know
>> whether IRQ is raised or not. Plus, accessing the status register which
>> clears pending IRQ can be very slow on PATA machines. It has to go
>> through the PCI and ATA bus and come back. So, unconditionally trying
>> to clear IRQ by accessing Status can incur noticeable overhead if the
>> IRQ is shared with devices which raise a lot of IRQs.
>
> There HAS to be a way to determine if that device generated the
> interrupt, or the interrupt can not be shared. Since the kernel said
> nobody cared about the interrupt, that indicates that the sata driver
> checked the status register and realized the sata chip didn't generate
> the interrupt, and returned to the kernel letting it know that the
> interrupt was not for it.

Surprise, surprise. There's no way to tell whether the controller
raised interrupt or not if command is not in progress. As I said
before, there's no IRQ pending bit. While processing commands, you can
tell by looking at other status registers but when there's nothing in
flight and the controller determines it's a good time to raise a
spurious interrupt, there's no way you can tell. That dang SFF
interface is like 15+ years old.

But we can still make things pretty robust. We're working on it.

Thanks.

--
tejun

2007-12-03 17:15:28

by Phillip Susi

[permalink] [raw]
Subject: Re: Possibly SATA related freeze killed networking and RAID

Tejun Heo wrote:
> Surprise, surprise. There's no way to tell whether the controller
> raised interrupt or not if command is not in progress. As I said
> before, there's no IRQ pending bit. While processing commands, you can
> tell by looking at other status registers but when there's nothing in
> flight and the controller determines it's a good time to raise a
> spurious interrupt, there's no way you can tell. That dang SFF
> interface is like 15+ years old.
>
> But we can still make things pretty robust. We're working on it.
>
> Thanks.
>

It sounds like you mean that you know the controller did NOT raise the
interrupt ( intentionally/correctly ) if there was no command in
progress, as opposed to not being able to tell. Unless there is some
condition under which it is valid for the controller to raise an
interrupt when it had no commands in progress? And if that's the case
and there's know way to know WHY, that's a broken design.

2007-12-04 01:32:56

by Tejun Heo

[permalink] [raw]
Subject: Re: Possibly SATA related freeze killed networking and RAID

Phillip Susi wrote:
> Tejun Heo wrote:
>> Surprise, surprise. There's no way to tell whether the controller
>> raised interrupt or not if command is not in progress. As I said
>> before, there's no IRQ pending bit. While processing commands, you can
>> tell by looking at other status registers but when there's nothing in
>> flight and the controller determines it's a good time to raise a
>> spurious interrupt, there's no way you can tell. That dang SFF
>> interface is like 15+ years old.
>>
>> But we can still make things pretty robust. We're working on it.
>
> It sounds like you mean that you know the controller did NOT raise the
> interrupt ( intentionally/correctly ) if there was no command in
> progress, as opposed to not being able to tell. Unless there is some
> condition under which it is valid for the controller to raise an
> interrupt when it had no commands in progress? And if that's the case
> and there's know way to know WHY, that's a broken design.

If everything works correctly, all interrupts can be accounted for.
It's just that there's no margin for erratic behaviors and most ATA
controllers are built really cheap. So, yeah, it's a 15+ years old
half-broken design.

--
tejun

2007-12-10 12:33:39

by noah

[permalink] [raw]
Subject: Re: Possibly SATA related freeze killed networking and RAID

2007/11/21, noah <[email protected]>:
> 2007/11/21, Alan Cox <[email protected]>:
> > > I've had other freezes before but this was the first time I was able
> > > to see what was actually going on.
> > > IRQ 21 appears to be shared between sata_nv and ethernet.
> > >
> > > Does this mean my hardware/BIOS is broken somehow?
> >
> > Not neccessarily. It could a bug in one of the drivers using IRQ 21
> > (sata_nv or the nvidia ethernet), it could be another inactive device, or
> > it could be a hardware funny.
>
> How can I tell if there's an inactive device?
>
> > Nvidia stuff can be quite hard to diagnose as we have no documentation
> > but we can try. The first question is whether it is network or disk
> > triggered - seeing if heavy loads to one or the other trigger the problem
> > might be a first plan.
>
> I haven't managed to trigger it again yet but at the time the CPU was
> heavily loaded and I was re-indexing a database which caused a lot of
> disk activity. I'm quite confident the network was pretty much idle at
> the time.

The same thing has happened twice now, both during the weekly check of
the md0 and md1 RAID1-arrays. That is, networking on the primary
interface is dead. It's interrupt (irq 21) is shared between sata_nv
and forcedeth.

Is there anything I can do to debug this problem?

I don't have access to the logs right now but will have later.

-- noah

2007-12-10 21:52:18

by Thiemo Nagel

[permalink] [raw]
Subject: Re: Possibly SATA related freeze killed networking and RAID

Hello,

I think, I'm experiencing the same problem:

09:16:34 : NETDEV WATCHDOG: eth0: transmit timed out
09:16:34 : eth0: Got tx_timeout. irq: 00000000
09:16:34 : eth0: Ring at 37e50000
09:16:34 : eth0: Dumping tx registers
09:16:34 : 0: 00000000 000000ff 00000003 025003ca 00000000 00000000
00000000 00000000
09:16:34 : 20: 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000

[...]

09:16:54 : ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
09:16:54 : ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
09:16:54 : ata6.00: cmd 25/00:08:1e:97:48/00:00:19:00:00/e0 tag 0 cdb 0x0
data 4096 in
09:16:54 : res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4
(timeout)
09:16:54 : ata5.00: cmd 25/00:70:1e:97:48/00:00:19:00:00/e0 tag 0 cdb 0x0
data 57344 in
09:16:54 : res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4
(timeout)
09:16:54 : ata6: soft resetting port
09:16:54 : ata5: soft resetting port
09:16:54 : ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
09:16:54 : ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
09:16:54 : NETDEV WATCHDOG: eth0: transmit timed out
09:16:54 : eth0: Got tx_timeout. irq: 00000032
09:16:54 : eth0: Ring at 37e50000
09:16:54 : eth0: Dumping tx registers

A more complete log can be found at:
http://www.e18.physik.tu-muenchen.de/~tnagel/misc/kernel-crash.log

The setup is strikingly similar to that of noah (I'm quoting all of this
by heart, if somebody is interested in more detail, just ask.):

Kernel: 2.6.22 (amd64, Debian patches, tainted)
Mainboard: Asus M2N-SLI Deluxe (nForce 570 SLI MCP --> MCP55, same as noah)
CPU: Athlon64 Dual-Core (same as noah)
RAM: 1GB
HD: 22 x Samsung HD501LJ 500GB (same as noah), 1-6 connected to chipset,
7-22 connected to RocketRaid 2340.

I'm using software RAID like noah, (levels 1, 5 and 6), and like with noah
the problem occurred during RAID check, in my case during heavy NFS load
which had been ongoing for ~4 days. This is the third time, it has
happened, but only this time I could catch the logs via netconsole. The
two affected drives are connected to the chipset and show no SMART errors.

Unfortunately, the kernel is tainted since I'm using HighPoint's drivers
for the RR2340. I don't know whether I can change this easily.

Kind regards,

Thiemo Nagel