LinuxLists.cc - 2.6.16.32 stuck in generic_file_aio

2006-11-29 12:41:48

Subject: 2.6.16.32 stuck in generic_file_aio_write()

Hi,

I've got a machine which occasionally locks up. I can still sysrq it from
a serial console, so it's not entirely dead.

A sysrq-t learns me that it's got a large number of httpd processes stuck
in D state :

httpd D F7619440 2160 11635 2057 11636 (NOTLB)
dbb7ae14 cc9b0550 c33224a0 f7619440 de187604 00000000 000000b3 00000001
000000b3 00000000 ffffffff d374a550 c33224a0 0005b8d8 f04af800
000f75e7
d374a550 cc9b0550 cc9b0678 ef7d33ec ef7d33e8 cc9b0550 ef7d33fc
c041bf70
Call Trace:
[<c041bf70>] __mutex_lock_slowpath+0x92/0x43e
[<c0148f29>] generic_file_aio_write+0x5c/0xfa
[<c0148f29>] generic_file_aio_write+0x5c/0xfa
[<c0148f29>] generic_file_aio_write+0x5c/0xfa
[<c01746c9>] permission+0xad/0xcb
[<c01d9c4a>] ext3_file_write+0x3b/0xb0
[<c0166777>] do_sync_write+0xd5/0x130
[<c041d1bf>] _spin_unlock+0xb/0xf
[<c0135c13>] autoremove_wake_function+0x0/0x4b
[<c0166975>] vfs_write+0x1a3/0x1a8
[<c0166a39>] sys_write+0x4b/0x74
[<c0102c03>] sysenter_past_esp+0x54/0x75

After this, the machine is rendered useless (probably due to the fact that
disk IO isn't working anymore).

The lock debugging gives me this :

D httpd:11635 [cc9b0550, 116] blocked on mutex: [ef7d33e8]
{inode_init_once}
.. held by: httpd: 506 [d67e1000, 121]
... acquired at: generic_file_aio_write+0x5c/0xfa

I see similiar things as mentioned in http://lkml.org/lkml/2006/1/10/64,
with the difference that I'm not running software RAID or SATA (it's an
Areca ARC-1110).

I can't reproduce it until now, it 'just' happens. Can someone give me a
pointer where to start looking ?

Erich, I've CC-ed you since the machine is running an Areca RAID config.
It's also the only used disk subsystem in this machine.

Regards,

Igmar

2006-11-29 15:20:59

by Igmar Palsenberg

[permalink] [raw]

Subject: Re: 2.6.16.32 stuck in generic_file_aio_write()

Hi,

A followup. It crashed again, giving me :

arcmsr0: scsi id=0 lun=0 ccb='0xf7c984e0' poll command abort successfully
end_request: I/O error, dev sda, sector 3724719

and

sd 0:0:0:0: rejecting I/O to offline device
about 15k times.

I'll see if I can upgrade the RAID driver.

Igmar

--
Igmar Palsenberg
JDI ICT

Zutphensestraatweg 85
6953 CJ Dieren
Tel: +31 (0)313 - 496741
Fax: +31 (0)313 - 420996
The Netherlands

mailto: [email protected]

2006-11-30 01:22:45

by erich

[permalink] [raw]

Subject: Re: 2.6.16.32 stuck in generic_file_aio_write()

Dear Igmar Palsenberg,

If you are working on arcmsr 1.20.00.13 for official kernel version.
This is the last version.
Could you check your RAID controller event and tell someting to me?
You can check "MBIOS"=>"Physical Drive Information"=>"View Drive
Information"=>"Select The Drive"=>"Timeout Count"......
It could tell you which disk had bad behavior cause your RAID volume
offline.
About the message dump from arcmsr, it said that your RAID volume had
something wrong and kicked out from the system.
How about your RAID config?
Areca had new firmware released (1.42).
If you are working on "sg" device with scsi passthrough ioctl method to feed
data into Areca's RAID volume.
You need to limit your data under 512 blocks (256K) each transfer.
The new firmware will enlarge it into 4096 blocks (2M) each transfer.
The firmware version 1.42 is on releasing procedure but not yet put it on
Areca ftp site.
If you need it, please tell me again.

Best Regards
Erich Chen

----- Original Message -----
From: "Igmar Palsenberg" <[email protected]>
To: <[email protected]>
Cc: <[email protected]>
Sent: Wednesday, November 29, 2006 8:41 PM
Subject: 2.6.16.32 stuck in generic_file_aio_write()

>
> Hi,
>
> I've got a machine which occasionally locks up. I can still sysrq it from
> a serial console, so it's not entirely dead.
>
> A sysrq-t learns me that it's got a large number of httpd processes stuck
> in D state :
>
> httpd D F7619440 2160 11635 2057 11636 (NOTLB)
> dbb7ae14 cc9b0550 c33224a0 f7619440 de187604 00000000 000000b3 00000001
> 000000b3 00000000 ffffffff d374a550 c33224a0 0005b8d8 f04af800
> 000f75e7
> d374a550 cc9b0550 cc9b0678 ef7d33ec ef7d33e8 cc9b0550 ef7d33fc
> c041bf70
> Call Trace:
> [<c041bf70>] __mutex_lock_slowpath+0x92/0x43e
> [<c0148f29>] generic_file_aio_write+0x5c/0xfa
> [<c0148f29>] generic_file_aio_write+0x5c/0xfa
> [<c0148f29>] generic_file_aio_write+0x5c/0xfa
> [<c01746c9>] permission+0xad/0xcb
> [<c01d9c4a>] ext3_file_write+0x3b/0xb0
> [<c0166777>] do_sync_write+0xd5/0x130
> [<c041d1bf>] _spin_unlock+0xb/0xf
> [<c0135c13>] autoremove_wake_function+0x0/0x4b
> [<c0166975>] vfs_write+0x1a3/0x1a8
> [<c0166a39>] sys_write+0x4b/0x74
> [<c0102c03>] sysenter_past_esp+0x54/0x75
>
> After this, the machine is rendered useless (probably due to the fact that
> disk IO isn't working anymore).
>
> The lock debugging gives me this :
>
> D httpd:11635 [cc9b0550, 116] blocked on mutex: [ef7d33e8]
> {inode_init_once}
> .. held by: httpd: 506 [d67e1000, 121]
> ... acquired at: generic_file_aio_write+0x5c/0xfa
>
>
> I see similiar things as mentioned in http://lkml.org/lkml/2006/1/10/64,
> with the difference that I'm not running software RAID or SATA (it's an
> Areca ARC-1110).
>
> I can't reproduce it until now, it 'just' happens. Can someone give me a
> pointer where to start looking ?
>
> Erich, I've CC-ed you since the machine is running an Areca RAID config.
> It's also the only used disk subsystem in this machine.
>
>
> Regards,
>
>
> Igmar
>

2006-11-30 09:48:21

by Igmar Palsenberg

[permalink] [raw]

Subject: Re: 2.6.16.32 stuck in generic_file_aio_write()

Hi,

> If you are working on arcmsr 1.20.00.13 for official kernel version.
> This is the last version.

I'm already on that version. I'll see if I can upgrade to 2.6.19 today.

> Could you check your RAID controller event and tell someting to me?
> You can check "MBIOS"=>"Physical Drive Information"=>"View Drive
> Information"=>"Select The Drive"=>"Timeout Count"......
> It could tell you which disk had bad behavior cause your RAID volume
> offline.

I need to be in the BIOS right ? I couldn't find anything usefull with the
cli32 tool.

> About the message dump from arcmsr, it said that your RAID volume had
> something wrong and kicked out from the system.
> How about your RAID config?

CLI> disk info
Ch ModelName Serial# FirmRev Capacity State
===============================================================================
1 HDT722516DLA380 VDK71BTCDB90KE V43OA91A 164.7GB RaidSet
Member(1)
2 HDT722516DLA380 VDN71BTCDEPH7G V43OA91A 164.7GB RaidSet
Member(1)
3 HDT722516DLA380 VDN71BTCDES96G V43OA91A 164.7GB RaidSet
Member(1)
4 HDT722516DLA380 VDN71BTCDE15KG V43OA91A 164.7GB RaidSet
Member(1)
===============================================================================

CLI> rsf info
Num Name Disks TotalCap FreeCap DiskChannels State
===============================================================================
1 Raid Set # 00 4 640.0GB 0.0GB 1234 Normal
===============================================================================

CLI> vsf info
# Name Raid# Level Capacity Ch/Id/Lun State
===============================================================================
1 ARC-1110-VOL#00 1 Raid5 480.0GB 00/00/00 Normal
===============================================================================

A plain RAID 5 config with 4 disks.

> Areca had new firmware released (1.42).
> If you are working on "sg" device with scsi passthrough ioctl method to feed
> data into Areca's RAID volume.
> You need to limit your data under 512 blocks (256K) each transfer.
> The new firmware will enlarge it into 4096 blocks (2M) each transfer.
> The firmware version 1.42 is on releasing procedure but not yet put it on
> Areca ftp site.

I don't use the sg driver at all. Is the upgrade worth it ? I usually
don't mess with firmware unless being told to do so.

> If you need it, please tell me again.

Can you send it to me ? Installing it won't hurt I guess :)

Regards,

Igmar

--
Igmar Palsenberg
JDI ICT

Zutphensestraatweg 85
6953 CJ Dieren
Tel: +31 (0)313 - 496741
Fax: +31 (0)313 - 420996
The Netherlands

mailto: [email protected]

2006-12-01 05:23:07

by Andrew Morton

[permalink] [raw]

Subject: Re: 2.6.16.32 stuck in generic_file_aio_write()

On Wed, 29 Nov 2006 13:41:37 +0100 (CET)
Igmar Palsenberg <[email protected]> wrote:

> I've got a machine which occasionally locks up. I can still sysrq it from
> a serial console, so it's not entirely dead.
>
> A sysrq-t learns me that it's got a large number of httpd processes stuck
> in D state :

There are known deadlocks in generic_file_write() in kernels up to and
including 2.6.17. Pagefaults are involved and I'd need to see the entire
sysrq-T output to determine if you're hitting that bug.

2006-12-01 08:56:23

by Igmar Palsenberg

[permalink] [raw]

Subject: Re: 2.6.16.32 stuck in generic_file_aio_write()

Hi,

> > I've got a machine which occasionally locks up. I can still sysrq it from
> > a serial console, so it's not entirely dead.
> >
> > A sysrq-t learns me that it's got a large number of httpd processes stuck
> > in D state :
>
> There are known deadlocks in generic_file_write() in kernels up to and
> including 2.6.17. Pagefaults are involved and I'd need to see the entire
> sysrq-T output to determine if you're hitting that bug.

It's rather large, but for those who want to look at it :
http://www.jdi-ict.nl/plain/serial-28112006.txt

There is also a dump from a day later, but halfway the Areca controller
decided to kick out the array, on which a lot of unwritten data needed to
be written :)

That dump is at http://www.jdi-ict.nl/plain/serial-29112006.txt

Regards,

Igmar

2006-12-04 21:04:01

by Igmar Palsenberg

[permalink] [raw]

Subject: Re: 2.6.16.32 stuck in generic_file_aio_write()

> It's rather large, but for those who want to look at it :
> http://www.jdi-ict.nl/plain/serial-28112006.txt

The same problem, this time with 2.6.19. I've done a show tasks, a show
locks, a show regs, and after that, a sync + reboot :)

Log is at http://www.jdi-ict.nl/plain/serial-04122006.txt .

If anyone needs more info : please tell me.

Regards,

Igmar

2006-12-06 15:18:00

by Igmar Palsenberg

[permalink] [raw]

Subject: Re: 2.6.16.32 stuck in generic_file_aio_write()

> > It's rather large, but for those who want to look at it :
> > http://www.jdi-ict.nl/plain/serial-28112006.txt
>
> The same problem, this time with 2.6.19. I've done a show tasks, a show
> locks, a show regs, and after that, a sync + reboot :)
>
> Log is at http://www.jdi-ict.nl/plain/serial-04122006.txt .
>
> If anyone needs more info : please tell me.

Done some more digging : isn't http://lkml.org/lkml/2006/10/13/139 somehow
related ? I do see pagefaults, and inode locks and mmap_locks.

Regards,

Igmar

2006-12-06 15:40:17

by Andrew Morton

[permalink] [raw]

Subject: Re: 2.6.16.32 stuck in generic_file_aio_write()

On Wed, 6 Dec 2006 16:17:10 +0100 (CET)
Igmar Palsenberg <[email protected]> wrote:

>
> > > It's rather large, but for those who want to look at it :
> > > http://www.jdi-ict.nl/plain/serial-28112006.txt
> >
> > The same problem, this time with 2.6.19. I've done a show tasks, a show
> > locks, a show regs, and after that, a sync + reboot :)
> >
> > Log is at http://www.jdi-ict.nl/plain/serial-04122006.txt .
> >
> > If anyone needs more info : please tell me.
>
> Done some more digging : isn't http://lkml.org/lkml/2006/10/13/139 somehow
> related ? I do see pagefaults, and inode locks and mmap_locks.
>

I thought it was, but from my look through yout 8-billion-task backtrace,
no task was stuck in D-state with the appropriate call trace.

So I don't know what's causing this. In the first trace you have at least
four D-state kjournalds and a lot of processes stuck on an i_mutex. I
guess it's consistent with an IO system which is losing completion
interrupts. AFAICT in the second trace all you have is a lot of processes
stuck on i_mutex for no obvious reason - I don't know why that would
happen.

How long does it take for this to happen?

Yes, lockdep might find something.

2006-12-06 16:14:26

by Igmar Palsenberg

[permalink] [raw]

Subject: Re: 2.6.16.32 stuck in generic_file_aio_write()

> > Done some more digging : isn't http://lkml.org/lkml/2006/10/13/139 somehow
> > related ? I do see pagefaults, and inode locks and mmap_locks.
> >
>
> I thought it was, but from my look through yout 8-billion-task backtrace,
> no task was stuck in D-state with the appropriate call trace.
>
> So I don't know what's causing this. In the first trace you have at least
> four D-state kjournalds and a lot of processes stuck on an i_mutex. I
> guess it's consistent with an IO system which is losing completion
> interrupts.

Hmm.. Is there any way to make sure ? I've got a second machine (almost
identical), which doesn't show this.

The main difference is the running kernel. I've had them at the same
kernel, at which bad machine still crashes.

/proc/interrupts

Bad machine : 18: 11160637 11235698 IO-APIC-fasteoi arcmsr
Good machine : 18: 61658630 79352227 IO-APIC-level arcmsr

Bad machine is running 2.6.19, good is running 2.6.14.7-grsec, which
probably accounts for these changes.

> AFAICT in the second trace all you have is a lot of processes
> stuck on i_mutex for no obvious reason - I don't know why that would
> happen.

It's consequent, also the traces.

> How long does it take for this to happen?

Days to a week tops. It does happen less frequent with the 2.6.19,
2.6.16.32 triggered it almost daily.

> Yes, lockdep might find something.

I've enabled most debug options. I'll boot the other kernel tomorrow.

Regards,

Igmar

2006-12-07 09:59:15

by Igmar Palsenberg

[permalink] [raw]

Subject: Re: 2.6.16.32 stuck in generic_file_aio_write()

> I thought it was, but from my look through yout 8-billion-task backtrace,
> no task was stuck in D-state with the appropriate call trace.

I was afraid of that... Where is the lock on the i_mutex suppose
to be released ? I can't grasp the codepath from within an interrupt back
to the fs layer.

> So I don't know what's causing this. In the first trace you have at least
> four D-state kjournalds and a lot of processes stuck on an i_mutex. I
> guess it's consistent with an IO system which is losing completion
> interrupts. AFAICT in the second trace all you have is a lot of processes
> stuck on i_mutex for no obvious reason - I don't know why that would
> happen.

Is there any way to see if it is missing interrupts ? Enabling the
debugging in the areca driver isn't a good idea on this machine, it's a
heavely IO loaded machine, and the problem seems to take some time to occur.

I *does* happen less often with a 2.6.19 kernel however.

The task dump takes > 10 seconds, which causes the softlock detector to
trigger. Is there any objection to a patch which disables the lockup
detector during the dump ? It isn't a big issue, since al it does is dump
a stacktrace.

I've enabled most debugging now, I'll see of i can run both a disk and VM
stresstest.

I'll put a .config and a dmesg of the machine booting at
http://www.jdi-ict.nl/plain/ for those who want to look at it.

Regards,

Igmar

2006-12-07 12:30:13

by Igmar Palsenberg

[permalink] [raw]

Subject: Re: 2.6.16.32 stuck in generic_file_aio_write()

> I've enabled most debugging now, I'll see of i can run both a disk and VM
> stresstest.

Running stress now :

stress -c 2 -i 2 -m 8 -d 8 --vm-bytes 20M --vm-hang 5 --hdd-bytes 20M

I'll see what this results in.

> I'll put a .config and a dmesg of the machine booting at
> http://www.jdi-ict.nl/plain/ for those who want to look at it.

dmesg : http://www.jdi-ict.nl/plain/lnx01.dmesg
Kernel config : http://www.jdi-ict.nl/plain/lnx01.config

regards,

Igmar

2006-12-14 08:34:48

by Igmar Palsenberg

[permalink] [raw]

Subject: Re: 2.6.16.32 stuck in generic_file_aio_write()

> > I'll put a .config and a dmesg of the machine booting at
> > http://www.jdi-ict.nl/plain/ for those who want to look at it.
>
> dmesg : http://www.jdi-ict.nl/plain/lnx01.dmesg
> Kernel config : http://www.jdi-ict.nl/plain/lnx01.config

Hmm.. Switching CONFIG_HZ from 1000 to 250 seems to 'fix' the problem.
I haven't seen the issue in nearly a week now. This makes Andrew's theory
about missing interrupts very likely.

Andrew / others : Is there a way to find out if it *is* missing
interrupts ?

Regards,

Igmar

2006-12-14 08:42:33

by Andrew Morton

[permalink] [raw]

Subject: Re: 2.6.16.32 stuck in generic_file_aio_write()

On Thu, 14 Dec 2006 09:15:39 +0100 (CET)
Igmar Palsenberg <[email protected]> wrote:

>
> > > I'll put a .config and a dmesg of the machine booting at
> > > http://www.jdi-ict.nl/plain/ for those who want to look at it.
> >
> > dmesg : http://www.jdi-ict.nl/plain/lnx01.dmesg
> > Kernel config : http://www.jdi-ict.nl/plain/lnx01.config
>
> Hmm.. Switching CONFIG_HZ from 1000 to 250 seems to 'fix' the problem.
> I haven't seen the issue in nearly a week now. This makes Andrew's theory
> about missing interrupts very likely.
>
> Andrew / others : Is there a way to find out if it *is* missing
> interrupts ?
>

umm, nasty. What's in /proc/interrupts?

2006-12-14 08:55:50

by Igmar Palsenberg

[permalink] [raw]

Subject: Re: 2.6.16.32 stuck in generic_file_aio_write()

> > Hmm.. Switching CONFIG_HZ from 1000 to 250 seems to 'fix' the problem.
> > I haven't seen the issue in nearly a week now. This makes Andrew's theory
> > about missing interrupts very likely.
> >
> > Andrew / others : Is there a way to find out if it *is* missing
> > interrupts ?
> >
>
> umm, nasty. What's in /proc/interrupts?

See below. The other machine is mostly identifical, except for i8042
missing (probably due to running an older kernel, or small differences in
the kernel config).

Regards,

Igmar

[jdiict@lnx01 ~]$ cat /proc/interrupts
CPU0 CPU1
0: 73702693 74509271 IO-APIC-edge timer
1: 1 1 IO-APIC-edge i8042
4: 2289 8389 IO-APIC-edge serial
8: 0 1 IO-APIC-edge rtc
9: 0 0 IO-APIC-fasteoi acpi
12: 3 1 IO-APIC-edge i8042
16: 203127788 0 IO-APIC-fasteoi uhci_hcd:usb2, eth0
17: 525 492 IO-APIC-fasteoi uhci_hcd:usb4
18: 13000070 67584889 IO-APIC-fasteoi arcmsr
19: 0 0 IO-APIC-fasteoi ehci_hcd:usb1
20: 0 0 IO-APIC-fasteoi uhci_hcd:usb3
NMI: 0 0
LOC: 148127756 148133476
ERR: 0
MIS: 0

2006-12-14 09:10:52

by Andrew Morton

[permalink] [raw]

Subject: Re: 2.6.16.32 stuck in generic_file_aio_write()

On Thu, 14 Dec 2006 09:55:38 +0100 (CET)
Igmar Palsenberg <[email protected]> wrote:

>
> > > Hmm.. Switching CONFIG_HZ from 1000 to 250 seems to 'fix' the problem.
> > > I haven't seen the issue in nearly a week now. This makes Andrew's theory
> > > about missing interrupts very likely.
> > >
> > > Andrew / others : Is there a way to find out if it *is* missing
> > > interrupts ?
> > >
> >
> > umm, nasty. What's in /proc/interrupts?
>
> See below. The other machine is mostly identifical, except for i8042
> missing (probably due to running an older kernel, or small differences in
> the kernel config).
>

Does the other machine have the same problems?

Are you able to rule out a hardware failure?

> [jdiict@lnx01 ~]$ cat /proc/interrupts
> CPU0 CPU1
> 0: 73702693 74509271 IO-APIC-edge timer
> 1: 1 1 IO-APIC-edge i8042
> 4: 2289 8389 IO-APIC-edge serial
> 8: 0 1 IO-APIC-edge rtc
> 9: 0 0 IO-APIC-fasteoi acpi
> 12: 3 1 IO-APIC-edge i8042
> 16: 203127788 0 IO-APIC-fasteoi uhci_hcd:usb2, eth0
> 17: 525 492 IO-APIC-fasteoi uhci_hcd:usb4
> 18: 13000070 67584889 IO-APIC-fasteoi arcmsr
> 19: 0 0 IO-APIC-fasteoi ehci_hcd:usb1
> 20: 0 0 IO-APIC-fasteoi uhci_hcd:usb3
> NMI: 0 0
> LOC: 148127756 148133476
> ERR: 0
> MIS: 0

The disk interrupt is unshared, which rules out a few software problems, I
guess.

2006-12-14 09:25:17

by Igmar Palsenberg

[permalink] [raw]

Subject: Re: 2.6.16.32 stuck in generic_file_aio_write()

> > See below. The other machine is mostly identifical, except for i8042
> > missing (probably due to running an older kernel, or small differences in
> > the kernel config).
> >
>
> Does the other machine have the same problems?

No, but that machine has a lot less disk and networkactivity.

> Are you able to rule out a hardware failure?

100% ? No, but the hardware is relatively new (about a year old), and of
good quality. It's hard to reprodure, so looking at it when it starts to
fault isn't possible either :(

> The disk interrupt is unshared, which rules out a few software problems, I
> guess.

Indeed. Bah, I hate these kind of things :(

Regards,

Igmar

2007-02-05 10:36:35

by Igmar Palsenberg

[permalink] [raw]

Subject: Re: 2.6.16.32 stuck in generic_file_aio_write()

> Does the other machine have the same problems?

It does. It seems to depend on the interrupt frequency : Setting KERNEL_HZ=250
makes it ony appear once a month or so, with KERNEL_HZ=1000, it will
occur within a week. It does happen a lot less with the other machine,
which isn't under disk activity load as much as the other machine.

> Are you able to rule out a hardware failure?

Well.. It's too much coincidence that 2 (almost identical) machines show
the same weard behaviour. What strikes me that only *disk* interrupts
after a while don't get handled. The machine itself is alive, just all
disk IO is blocked, which makes it pretty much useless.

Erich, could this be some sort of hardware problem ? I know it's a PITA to
reproduce, but setting CONFIG_HZ to 1000 and bashing the machine with
diskactivity seems to help :)

Regards,

Igmar

--
Igmar Palsenberg
JDI ICT

Zutphensestraatweg 85
6953 CJ Dieren
Tel: +31 (0)313 - 496741
Fax: +31 (0)313 - 420996
The Netherlands

mailto: [email protected]

2007-02-06 03:05:29

by erich

[permalink] [raw]

Subject: Re: 2.6.16.32 stuck in generic_file_aio_write()

Dear Igmar Palsenberg,

I can not make sure it is hardware problem, but I have interest in this
case's reproducing.
If you tell me your platform's construction, I will try it and give you good
solution.
Does your RAID adapter's firmware version work on 1.42?
Areca firmware had fix some hardware bugs and rare sg length handle in this
version.

Best Regards
Erich Chen

----- Original Message -----
From: "Igmar Palsenberg" <[email protected]>
To: "Andrew Morton" <[email protected]>
Cc: <[email protected]>; <[email protected]>; "erich"
<[email protected]>
Sent: Monday, February 05, 2007 6:24 PM
Subject: Re: 2.6.16.32 stuck in generic_file_aio_write()

>
>> Does the other machine have the same problems?
>
> It does. It seems to depend on the interrupt frequency : Setting
> KERNEL_HZ=250
> makes it ony appear once a month or so, with KERNEL_HZ=1000, it will
> occur within a week. It does happen a lot less with the other machine,
> which isn't under disk activity load as much as the other machine.
>
>> Are you able to rule out a hardware failure?
>
> Well.. It's too much coincidence that 2 (almost identical) machines show
> the same weard behaviour. What strikes me that only *disk* interrupts
> after a while don't get handled. The machine itself is alive, just all
> disk IO is blocked, which makes it pretty much useless.
>
> Erich, could this be some sort of hardware problem ? I know it's a PITA to
> reproduce, but setting CONFIG_HZ to 1000 and bashing the machine with
> diskactivity seems to help :)
>
>
> Regards,
>
>
> Igmar
>
> --
> Igmar Palsenberg
> JDI ICT
>
> Zutphensestraatweg 85
> 6953 CJ Dieren
> Tel: +31 (0)313 - 496741
> Fax: +31 (0)313 - 420996
> The Netherlands
>
> mailto: [email protected]

2007-02-12 09:27:56

by Igmar Palsenberg

[permalink] [raw]

Subject: Re: 2.6.16.32 stuck in generic_file_aio_write()

Hi,

> I can not make sure it is hardware problem, but I have interest in this
> case's reproducing.
> If you tell me your platform's construction, I will try it and give you good
> solution.

The machines giving problems are almost identical when it comes to
hardware specs :

Intel SE7520BD2 mainbord (SE7520 chipset)
Dual Intel Xeon 2.8 Ghz (other machine : Dual Xeon 3.2 Ghz)
4 GB PC3200 ECC (400 Mhz) Corsair (other machine : 2GB PC3200 ECC)

> Does your RAID adapter's firmware version work on 1.42?
> Areca firmware had fix some hardware bugs and rare sg length handle in this
> version.

It's currently at 1.41. I'll see if I can upgrade it to 1.42. For now,
I've put all available stacktraces when it hung on
http://www.jdi-ict.nl/areca, together with a lspci -v -v and a copy of the
kernel's .config

Please let me know if you need anything else.

Regards,

Igmar

--
Igmar Palsenberg
JDI ICT

Zutphensestraatweg 85
6953 CJ Dieren
Tel: +31 (0)313 - 496741
Fax: +31 (0)313 - 420996
The Netherlands

mailto: [email protected]

2007-02-19 13:26:09

by Igmar Palsenberg

[permalink] [raw]

Subject: Re: 2.6.16.32 stuck in generic_file_aio_write()

Hi,

> I can not make sure it is hardware problem, but I have interest in this
> case's reproducing.
> If you tell me your platform's construction, I will try it and give you good
> solution.
> Does your RAID adapter's firmware version work on 1.42?
> Areca firmware had fix some hardware bugs and rare sg length handle in this
> version.

I've hacked up the sysrq code so that it gives me another command : j ,
which dumps the current IRQ status on the console :

SysRq : Show IRQ status
......
Showing info for IRQ 14
status :
depth : 0
wake_depth : 0
irq_count : 38717
irqs_unhandled : 0

Showing info for IRQ 15
status : DISABLED
depth : 1
wake_depth : 0
irq_count : 22
irqs_unhandled : 0

which is a the (incomplete) result on my machine after loading a module
that does disable_irq(15) on module load.

I've put the patch at http://www.jdi-ict.nl/areca/sysrq-j.patch
I'll do a follow-up when anything usefull comes out.

Regards,

Igmar