2004-10-26 05:38:10

by Jaakko Hyvätti

[permalink] [raw]
Subject: x86_64, LOCKUP on CPU0, kjournald


Last night we got this trace, saved by serial console.

The machine is dual AMD Opteron 240 (1.4GHz), MSI K8D motherboard, 3G
mem, disks in raid configuration with 3ware SATA adapter, so they show
up as scsi devices. Kernel is updated Fedora Core 2 2.6.8-1.521smp,
but the same probably happened with 2.6.6 at least, cannot tell as I
did not have serial console back then. Disks are shared over nfs to
a few very very busy clients, some running Linux, some running IRIX.


NMI Watchdog detected LOCKUP on CPU0, registers:
CPU 0
Modules linked in: loop w83627hf i2c_sensor i2c_isa i2c_core nfsd exportfs lockd sunrpc md5 ipv6 parport_pc lp parport tg3 ipt_REJECT ipt_state ip_conntrack iptable_filter ip_tables dm_mod ohci_hcd button battery asus_acpi ac ext3 jbd 3w_xxxx sd_mod scsi_mod
Pid: 210, comm: kjournald Not tainted 2.6.8-1.521smp
RIP: 0010:[<ffffffff80161b7b>] <ffffffff80161b7b>{cache_alloc_refill+397}
RSP: 0018:00000100bf4efa58 EFLAGS: 00000013
RAX: 00000100bff39000 RBX: 0000000000000031 RCX: 00000100400116d8
RDX: 00000100400116c8 RSI: 0000000000000850 RDI: 0000010040011680
RBP: 000001003ffbf000 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000010095e05e00 R12: 00000100400116c8
R13: 0000010040011680 R14: 0000000000000850 R15: 0000010031cdb9d0
FS: 0000002a958624c0(0000) GS:ffffffff804e2d80(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000a02b5c CR3: 0000000000101000 CR4: 00000000000006e0
Process kjournald (pid: 210, threadinfo 00000100bf4ee000, task 0000010037d201f0)
Stack: 000001007e657558 000001007e657558 00000100bdc356e0 000001007e657558
000001003ffb5e00 000001003c2ed180 000001007e657558 ffffffff80161eae
0000000000000202 ffffffff80180d98
Call Trace:<ffffffff80161eae>{kmem_cache_alloc+52} <ffffffff80180d98>{alloc_buffer_head+17}
<ffffffffa003b620>{:jbd:journal_write_metadata_buffer+130}
<ffffffffa0038064>{:jbd:journal_commit_transaction+2847}
<ffffffff801367e2>{autoremove_wake_function+0} <ffffffff801367e2>{autoremove_wake_function+0}
<ffffffffa003b0e3>{:jbd:kjournald+333} <ffffffff801367e2>{autoremove_wake_function+0}
<ffffffff801367e2>{autoremove_wake_function+0} <ffffffffa003af90>{:jbd:commit_timeout+0}
<ffffffff801121e3>{child_rip+8} <ffffffffa003af96>{:jbd:kjournald+0}
<ffffffff801121db>{child_rip+0}

Code: 4c 89 61 08 49 89 0c 24 85 db 0f 8f 2c ff ff ff 8b 45 00 49
console shuts up ...


And here the next boot so you see what is there in the machine:

Bootdata ok (command line is ro root=LABEL=/ console=ttyS0,19200 console=tty0)
Linux version 2.6.8-1.521smp ([email protected]) (gcc version 3.3.3 20040412 (Red Hat Linux 3.3.3-7)) #1 SMP Mon Aug 16 09:32:47 EDT 2004
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 00000000bfff0000 (usable)
BIOS-e820: 00000000bfff0000 - 00000000bffff000 (ACPI data)
BIOS-e820: 00000000bffff000 - 00000000c0000000 (ACPI NVS)
BIOS-e820: 00000000ff7c0000 - 0000000100000000 (reserved)
Scanning NUMA topology in Northbridge 24
Number of nodes 2 (10010)
Node 0 MemBase 0000000000000000 Limit 000000003fffffff
Node 1 MemBase 0000000040000000 Limit 00000000bfff0000
Using node hash shift of 24
Bootmem setup node 0 0000000000000000-000000003fffffff
Bootmem setup node 1 0000000040000000-00000000bfff0000
No mptable found.
ACPI: RSDP (v000 ACPIAM ) @ 0x00000000000f4530
ACPI: RSDT (v001 A M I OEMRSDT 0x07000304 MSFT 0x00000097) @ 0x00000000bfff0000
ACPI: FADT (v001 A M I OEMFACP 0x07000304 MSFT 0x00000097) @ 0x00000000bfff0200
ACPI: MADT (v001 A M I OEMAPIC 0x07000304 MSFT 0x00000097) @ 0x00000000bfff0380
ACPI: SPCR (v001 A M I OEMSPCR 0x07000304 MSFT 0x00000097) @ 0x00000000bfff0400
ACPI: OEMB (v001 A M I OEMBIOS 0x07000304 MSFT 0x00000097) @ 0x00000000bffff040
ACPI: ASF! (v001 AMIASF AMDSTRET 0x00000001 INTL 0x02002026) @ 0x00000000bfff2d90
ACPI: DSDT (v001 0ABCF 0ABCF008 0x00000008 INTL 0x02002026) @ 0x0000000000000000
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled)
Processor #0 15:5 APIC version 16
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled)
Processor #1 15:5 APIC version 16
ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0])
IOAPIC[0]: Assigned apic_id 2
IOAPIC[0]: apic_id 2, version 17, address 0xfec00000, GSI 0-23
ACPI: IOAPIC (id[0x03] address[0xfebfe000] gsi_base[24])
IOAPIC[1]: Assigned apic_id 3
IOAPIC[1]: apic_id 3, version 17, address 0xfebfe000, GSI 24-27
ACPI: IOAPIC (id[0x04] address[0xfebff000] gsi_base[28])
IOAPIC[2]: Assigned apic_id 4
IOAPIC[2]: apic_id 4, version 17, address 0xfebff000, GSI 28-31
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
Using ACPI (MADT) for SMP configuration information
Checking aperture...
CPU 0: aperture @ ffd4000000 size 32 MB
Aperture from northbridge cpu 0 too small (32 MB)
No AGP bridge found
Built 2 zonelists
Kernel command line: ro root=LABEL=/ console=ttyS0,19200 console=tty0
Initializing CPU#0
PID hash table entries: 1024 (order 10: 16384 bytes)
time.c: Using 1.193182 MHz PIT timer.
time.c: Detected 1395.673 MHz processor.
Console: colour VGA+ 80x25
Dentry cache hash table entries: 65536 (order: 7, 524288 bytes)
Inode-cache hash table entries: 65536 (order: 7, 524288 bytes)
Memory: 3095016k/3145664k available (2161k kernel code, 0k reserved, 1409k data, 176k init)
Calibrating delay loop... 2744.32 BogoMIPS
Security Scaffold v1.0.0 initialized
SELinux: Initializing.
SELinux: Starting in permissive mode
There is already a security framework initialized, register_security failed.
selinux_register_security: Registering secondary module capability
Capability LSM initialized as secondary
Mount-cache hash table entries: 256 (order: 0, 4096 bytes)
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 1024K (64 bytes/line)
Using local APIC NMI watchdog using perfctr0
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 1024K (64 bytes/line)
CPU0: AMD Opteron(tm) Processor 240 stepping 01
per-CPU timeslice cutoff: 1023.77 usecs.
task migration cache decay timeout: 2 msecs.
Booting processor 1/1 rip 6000 rsp 10041d67f58
Initializing CPU#1
Calibrating delay loop... 2785.28 BogoMIPS
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 1024K (64 bytes/line)
AMD Opteron(tm) Processor 240 stepping 01
Total of 2 processors activated (5529.60 BogoMIPS).
ENABLING IO-APIC IRQs
..TIMER: vector=0x31 pin1=2 pin2=-1
Using local APIC timer interrupts.
Detected 12.461 MHz APIC timer.
checking TSC synchronization across 2 CPUs: passed.
time.c: Using PIT/TSC based timekeeping.
Brought up 2 CPUs
checking if image is initramfs...it isn't (no cpio magic); looks like an initrd
NET: Registered protocol family 16
PCI: Using configuration type 1
mtrr: v2.0 (20020519)
ACPI: Subsystem revision 20040326
ACPI: Interpreter enabled
ACPI: Using IOAPIC for interrupt routing
ACPI: PCI Root Bridge [PCI0] (00:00)
PCI: Probing PCI hardware (bus 00)
ACPI: PCI Interrupt Link [LNKA] (IRQs 3 4 5 6 7 9 *10 11 12 14 15)
ACPI: PCI Interrupt Link [LNKB] (IRQs 3 4 *5 6 7 9 10 11 12 14 15)
ACPI: PCI Interrupt Link [LNKC] (IRQs 3 4 5 6 7 9 10 *11 12 14 15)
ACPI: PCI Interrupt Link [LNKD] (IRQs 3 4 5 6 7 *9 10 11 12 14 15)
ACPI: Power Resource [GFAN] (on)
ACPI: Power Resource [LFAN] (on)
usbcore: registered new driver usbfs
usbcore: registered new driver hub
PCI: Using ACPI for IRQ routing
ACPI: PCI interrupt 0000:00:07.2[D] -> GSI 19 (level, low) -> IRQ 169
ACPI: PCI interrupt 0000:03:00.0[D] -> GSI 19 (level, low) -> IRQ 169
ACPI: PCI interrupt 0000:03:00.1[D] -> GSI 19 (level, low) -> IRQ 169
ACPI: PCI interrupt 0000:03:06.0[A] -> GSI 18 (level, low) -> IRQ 177
ACPI: PCI interrupt 0000:01:01.0[A] -> GSI 28 (level, low) -> IRQ 185
ACPI: PCI interrupt 0000:01:02.0[A] -> GSI 29 (level, low) -> IRQ 193
ACPI: PCI interrupt 0000:01:02.1[B] -> GSI 30 (level, low) -> IRQ 201
testing the IO APIC.......................



Using vector-based indexing
.................................... done.
PCI-DMA: Disabling IOMMU.
vesafb: probe of vesafb0 failed with error -6
IA32 emulation $Id: sys_ia32.c,v 1.32 2002/03/24 13:02:28 ak Exp $
audit: initializing netlink socket (disabled)
audit(1098720917.956:0): initialized
Total HugeTLB memory allocated, 0
VFS: Disk quotas dquot_6.5.1
Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
SELinux: Registering netfilter hooks
Initializing Cryptographic API
ksign: Installing public key data
Loading keyring
- Added public key 9BD64214C19BFDB0
- User ID: Red Hat, Inc. (Kernel Module GPG key)
ksign: invalid packet (ctb=00)
Unable to load default keyring: error=74
pci_hotplug: PCI Hot Plug PCI Core version: 0.5
ACPI: Fan [FN00] (on)
ACPI: Fan [FN01] (on)
ACPI: Processor [CPU1] (supports C1, 8 throttling states)
ACPI: Processor [CPU2] (supports C1)
ACPI: Thermal Zone [THRM] (45 C)
Real Time Clock Driver v1.12
Linux agpgart interface v0.100 (c) Dave Jones
Serial: 8250/16550 driver $Revision: 1.90 $ 8 ports, IRQ sharing enabled
?ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A
RAMDISK driver initialized: 16 RAM disks of 16384K size 1024 blocksize
Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
AMD8111: IDE controller at PCI slot 0000:00:07.1
AMD8111: chipset revision 3
AMD8111: not 100% native mode: will probe irqs later
AMD8111: 0000:00:07.1 (rev 03) UDMA133 controller
ide0: BM-DMA at 0xffa0-0xffa7, BIOS settings: hda:pio, hdb:pio
ide1: BM-DMA at 0xffa8-0xffaf, BIOS settings: hdc:pio, hdd:pio
ide-floppy driver 0.99.newide
usbcore: registered new driver hiddev
usbcore: registered new driver usbhid
drivers/usb/input/hid-core.c: v2.0:USB HID core driver
mice: PS/2 mouse device common for all mice
serio: i8042 AUX port at 0x60,0x64 irq 12
serio: i8042 KBD port at 0x60,0x64 irq 1
input: AT Translated Set 2 keyboard on isa0060/serio0
md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27
NET: Registered protocol family 2
IP: routing cache hash table of 16384 buckets, 256Kbytes
TCP: Hash tables configured (established 262144 bind 65536)
Initializing IPsec netlink socket
NET: Registered protocol family 1
NET: Registered protocol family 17
powernow-k8: Power state transitions not supported
powernow-k8: Power state transitions not supported
ACPI: (supports S0 S1 S4 S5)
BIOS EDD facility v0.16 2004-Jun-25, 3 devices found
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
RAMDISK: Compressed image found at block 0
VFS: Mounted root (ext2 filesystem).
SCSI subsystem initialized
3ware Storage Controller device driver for Linux v1.26.00.039.
ACPI: PCI interrupt 0000:01:01.0[A] -> GSI 28 (level, low) -> IRQ 185
scsi0 : Found a 3ware Storage Controller at 0x8c00, IRQ: 185, P-chip: 1.3
scsi0 : 3ware Storage Controller
3w-xxxx: scsi0: AEN: WARNING: Unclean shutdown detected: Unit #0.
3w-xxxx: scsi0: AEN: WARNING: Unclean shutdown detected: Unit #2.
3w-xxxx: scsi0: AEN: WARNING: Unclean shutdown detected: Unit #4.
Using cfq io scheduler
Vendor: 3ware Model: Logical Disk 0 Rev: 1.2
Type: Direct-Access ANSI SCSI revision: 00
SCSI device sda: 145224064 512-byte hdwr sectors (74355 MB)
SCSI device sda: drive cache: write back
sda: sda1 sda2 sda3
Attached scsi disk sda at scsi0, channel 0, id 0, lun 0
Vendor: 3ware Model: Logical Disk 2 Rev: 1.2
Type: Direct-Access ANSI SCSI revision: 00
SCSI device sdb: 145224064 512-byte hdwr sectors (74355 MB)
SCSI device sdb: drive cache: write back
sdb: sdb1
Attached scsi disk sdb at scsi0, channel 0, id 2, lun 0
Vendor: 3ware Model: Logical Disk 4 Rev: 1.2
Type: Direct-Access ANSI SCSI revision: 00
SCSI device sdc: 145224064 512-byte hdwr sectors (74355 MB)
SCSI device sdc: drive cache: write back
sdc: sdc1
Attached scsi disk sdc at scsi0, channel 0, id 4, lun 0
EXT3-fs: INFO: recovery required on readonly filesystem.
EXT3-fs: write access will be enabled during recovery.
kjournald starting. Commit interval 5 seconds
EXT3-fs: sda3: orphan cleanup on readonly fs
EXT3-fs: sda3: 8 orphan inodes deleted
EXT3-fs: recovery complete.
EXT3-fs: mounted filesystem with ordered data mode.
Freeing unused kernel memory: 176k freed
SELinux: Disabled at runtime.
SELinux: Unregistering netfilter hooks
3w-xxxx: scsi0: AEN: INFO: Initialization started: Unit #0.
3w-xxxx: scsi0: AEN: INFO: Initialization complete: Unit #0.
3w-xxxx: scsi0: AEN: INFO: Initialization started: Unit #2.
3w-xxxx: scsi0: AEN: INFO: Initialization complete: Unit #2.
3w-xxxx: scsi0: AEN: INFO: Initialization started: Unit #4.
3w-xxxx: scsi0: AEN: INFO: Initialization complete: Unit #4.



--
[email protected] http://www.iki.fi/hyvatti/ +358 40 5011222
echo 'movl $36,%eax;int $128;movl $0,%ebx;movl $1,%eax;int $128'|as -o/bin/sync


2004-11-01 09:26:22

by Jaakko Hyvätti

[permalink] [raw]
Subject: ext3 and nfsd do not work under load (Re: x86_64, LOCKUP on CPU0, kjournald)


Here is another oops and lockup, with nfsd now there in the trace also:

Unable to handle kernel paging request at ffffffff00000808 RIP:
<ffffffff80161b37>{cache_alloc_refill+329}
PML4 103027 PGD 0
Oops: 0002 [1] SMP
CPU 0
Modules linked in: w83627hf i2c_sensor i2c_isa i2c_core nfsd exportfs lockd sunrpc md5 ipv6 parport_pc lp parport tg3 ipt_REJECT ipt_state ip_conntrack iptable_filter ip_tables dm_mod ohci_hcd button battery asus_acpi ac ext3 jbd 3w_xxxx sd_mod scsi_mod
Pid: 1968, comm: nfsd Not tainted 2.6.8-1.521smp
RIP: 0010:[<ffffffff80161b37>] <ffffffff80161b37>{cache_alloc_refill+329}
RSP: 0018:000001003ad8d728 EFLAGS: 00010046
RAX: ffffffff00000800 RBX: 00000000ffffffff RCX: 0000010001000000
RDX: 00000100400112c8 RSI: 00000000000000d0 RDI: 0000010040011280
RBP: 000001003ffb7000 R08: a322eb30453b40eb R09: 0000000000000008
R10: 000001004d6bdb00 R11: 000001004d6bdb00 R12: 00000100400112c8
R13: 0000010040011280 R14: 00000000000000d0 R15: 0000000000000001
FS: 0000002a958624c0(0000) GS:ffffffff804e2d80(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffffffff00000808 CR3: 0000000000101000 CR4: 00000000000006e0
Process nfsd (pid: 1968, threadinfo 000001003ad8c000, task 00000100be7e5550)
Stack: 0000010076ba4388 0000010076ba4388 fffffffffffffff4 0000010076ba4388
000001003ad8d7e8 0000000000000000 000001009c01a188 ffffffff80161eae
0000000000000283 ffffffff80195755
Call Trace:<ffffffff80161eae>{kmem_cache_alloc+52} <ffffffff80195755>{d_alloc+196}
<ffffffff8018a272>{cached_lookup+40} <ffffffff8018bc42>{__lookup_hash+117}
<ffffffff8018bcdc>{lookup_one_len+87} <ffffffffa0173c4e>{:nfsd:compose_entry_fh+279}
<ffffffffa0173e6f>{:nfsd:encode_entry+460} <ffffffff80318579>{thread_return+41}
<ffffffff80248b5c>{__generic_unplug_device+49} <ffffffffa0174192>{:nfsd:nfs3svc_encode_entry_plus+0}
<ffffffffa017419b>{:nfsd:nfs3svc_encode_entry_plus+9}
<ffffffffa0048aa4>{:ext3:call_filldir+137} <ffffffffa0174192>{:nfsd:nfs3svc_encode_entry_plus+0}
<ffffffffa0048c0d>{:ext3:ext3_dx_readdir+331} <ffffffffa0174192>{:nfsd:nfs3svc_encode_entry_plus+0}
<ffffffffa00484e2>{:ext3:ext3_readdir+131} <ffffffffa0174192>{:nfsd:nfs3svc_encode_entry_plus+0}
<ffffffffa0168679>{:nfsd:fh_verify+1324} <ffffffff80131f3b>{recalc_task_prio+332}
<ffffffffa0174192>{:nfsd:nfs3svc_encode_entry_plus+0}
<ffffffff80190053>{vfs_readdir+123} <ffffffffa0174192>{:nfsd:nfs3svc_encode_entry_plus+0}
<ffffffffa016bd8d>{:nfsd:nfsd_readdir+104} <ffffffffa017149a>{:nfsd:nfsd3_proc_readdirplus+245}
<ffffffffa01668c2>{:nfsd:nfsd_dispatch+226} <ffffffffa012fb83>{:sunrpc:svc_process+931}
<ffffffffa01665b7>{:nfsd:nfsd+750} <ffffffffa01662c9>{:nfsd:nfsd+0}
<ffffffff801121e3>{child_rip+8} <ffffffffa01662c9>{:nfsd:nfsd+0}
<ffffffffa01662c9>{:nfsd:nfsd+0} <ffffffff801121db>{child_rip+0}


Code: 48 89 50 08 48 89 02 48 c7 41 08 00 02 20 00 66 83 79 24 ff
RIP <ffffffff80161b37>{cache_alloc_refill+329} RSP <000001003ad8d728>
CR2: ffffffff00000808
NMI Watchdog detected LOCKUP on CPU0, registers:
CPU 0
Modules linked in: w83627hf i2c_sensor i2c_isa i2c_core nfsd exportfs lockd sunrpc md5 ipv6 parport_pc lp parport tg3 ipt_REJECT ipt_state ip_conntrack iptable_filter ip_tables dm_mod ohci_hcd button battery asus_acpi ac ext3 jbd 3w_xxxx sd_mod scsi_mod
Pid: 1968, comm: nfsd Not tainted 2.6.8-1.521smp
RIP: 0010:[<ffffffff80163549>] <ffffffff80163549>{.text.lock.slab+338}
RSP: 0018:ffffffff80480b38 EFLAGS: 00000086
RAX: 000000010fe430f5 RBX: 0000010040011280 RCX: 000000000000001e
RDX: 000000000000001d RSI: 000001003ffb7000 RDI: 0000010040011280
RBP: 0000010040011370 R08: 000001003ffac010 R09: 00000100bfc34d80
R10: 0000010040011280 R11: 0000010040011280 R12: 00000100bfc34e28
R13: 0000000000000002 R14: ffffffff804af7e0 R15: 0000000000000000
FS: 0000002a958624c0(0000) GS:ffffffff804e2d80(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffffffff00000808 CR3: 0000000000101000 CR4: 00000000000006e0
Process nfsd (pid: 1968, threadinfo 000001003ad8c000, task 00000100be7e5550)
Stack: 0000010001e0fb60 0000000000000000 ffffffff80480b98 0000000000000292
0000000000000000 0000010001e13020 ffffffff80480bb8 000001003ad8d678
0000000000000002 ffffffff80162bcd
Call Trace:<IRQ> <ffffffff80162bcd>{reap_timer_fnc+31} <ffffffff80141c08>{run_timer_softirq+427}
<ffffffff8013d965>{__do_softirq+65} <ffffffff8013d970>{__do_softirq+76}
<ffffffff8013d9fd>{do_softirq+49} <ffffffff80111e91>{apic_timer_interrupt+133}
<EOI> <ffffffff80112d92>{oops_end+80} <ffffffff80112d54>{oops_end+18}
<ffffffff80123f80>{do_page_fault+1008} <ffffffff8017efb4>{__getblk+17}
<ffffffffa004afb1>{:ext3:ext3_getblk+195} <ffffffff8017eed8>{bh_lru_install+221}
<ffffffff8011202d>{error_exit+0} <ffffffff80161b37>{cache_alloc_refill+329}
<ffffffff80195a25>{d_instantiate+202} <ffffffff80161eae>{kmem_cache_alloc+52}
<ffffffff80195755>{d_alloc+196} <ffffffff8018a272>{cached_lookup+40}
<ffffffff8018bc42>{__lookup_hash+117} <ffffffff8018bcdc>{lookup_one_len+87}
<ffffffffa0173c4e>{:nfsd:compose_entry_fh+279} <ffffffffa0173e6f>{:nfsd:encode_entry+460}
<ffffffff80318579>{thread_return+41} <ffffffff80248b5c>{__generic_unplug_device+49}
<ffffffffa0174192>{:nfsd:nfs3svc_encode_entry_plus+0}
<ffffffffa017419b>{:nfsd:nfs3svc_encode_entry_plus+9}
<ffffffffa0048aa4>{:ext3:call_filldir+137} <ffffffffa0174192>{:nfsd:nfs3svc_encode_entry_plus+0}
<ffffffffa0048c0d>{:ext3:ext3_dx_readdir+331} <ffffffffa0174192>{:nfsd:nfs3svc_encode_entry_plus+0}
<ffffffffa00484e2>{:ext3:ext3_readdir+131} <ffffffffa0174192>{:nfsd:nfs3svc_encode_entry_plus+0}
<ffffffffa0168679>{:nfsd:fh_verify+1324} <ffffffff80131f3b>{recalc_task_prio+332}
<ffffffffa0174192>{:nfsd:nfs3svc_encode_entry_plus+0}
<ffffffff80190053>{vfs_readdir+123} <ffffffffa0174192>{:nfsd:nfs3svc_encode_entry_plus+0}
<ffffffffa016bd8d>{:nfsd:nfsd_readdir+104} <ffffffffa017149a>{:nfsd:nfsd3_proc_readdirplus+245}
<ffffffffa01668c2>{:nfsd:nfsd_dispatch+226} <ffffffffa012fb83>{:sunrpc:svc_process+931}
<ffffffffa01665b7>{:nfsd:nfsd+750} <ffffffffa01662c9>{:nfsd:nfsd+0}
<ffffffff801121e3>{child_rip+8} <ffffffffa01662c9>{:nfsd:nfsd+0}
<ffffffffa01662c9>{:nfsd:nfsd+0} <ffffffff801121db>{child_rip+0}


Code: 80 7d b8 00 7e f8 e9 b4 f4 ff ff f3 90 80 bb a8 00 00 00 00
console shuts up ...


On Tue, 26 Oct 2004, Jaakko Hyv?tti wrote:
> Last night we got this trace, saved by serial console.
>
> The machine is dual AMD Opteron 240 (1.4GHz), MSI K8D motherboard, 3G
> mem, disks in raid configuration with 3ware SATA adapter, so they show
> up as scsi devices. Kernel is updated Fedora Core 2 2.6.8-1.521smp,
> but the same probably happened with 2.6.6 at least, cannot tell as I
> did not have serial console back then. Disks are shared over nfs to
> a few very very busy clients, some running Linux, some running IRIX.
>
>
> NMI Watchdog detected LOCKUP on CPU0, registers:
> CPU 0
> Modules linked in: loop w83627hf i2c_sensor i2c_isa i2c_core nfsd exportfs lockd sunrpc md5 ipv6 parport_pc lp parport tg3 ipt_REJECT ipt_state ip_conntrack iptable_filter ip_tables dm_mod ohci_hcd button battery asus_acpi ac ext3 jbd 3w_xxxx sd_mod scsi_mod
> Pid: 210, comm: kjournald Not tainted 2.6.8-1.521smp
> RIP: 0010:[<ffffffff80161b7b>] <ffffffff80161b7b>{cache_alloc_refill+397}
> RSP: 0018:00000100bf4efa58 EFLAGS: 00000013
> RAX: 00000100bff39000 RBX: 0000000000000031 RCX: 00000100400116d8
> RDX: 00000100400116c8 RSI: 0000000000000850 RDI: 0000010040011680
> RBP: 000001003ffbf000 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000010095e05e00 R12: 00000100400116c8
> R13: 0000010040011680 R14: 0000000000000850 R15: 0000010031cdb9d0
> FS: 0000002a958624c0(0000) GS:ffffffff804e2d80(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 0000000000a02b5c CR3: 0000000000101000 CR4: 00000000000006e0
> Process kjournald (pid: 210, threadinfo 00000100bf4ee000, task 0000010037d201f0)
> Stack: 000001007e657558 000001007e657558 00000100bdc356e0 000001007e657558
> 000001003ffb5e00 000001003c2ed180 000001007e657558 ffffffff80161eae
> 0000000000000202 ffffffff80180d98
> Call Trace:<ffffffff80161eae>{kmem_cache_alloc+52} <ffffffff80180d98>{alloc_buffer_head+17}
> <ffffffffa003b620>{:jbd:journal_write_metadata_buffer+130}
> <ffffffffa0038064>{:jbd:journal_commit_transaction+2847}
> <ffffffff801367e2>{autoremove_wake_function+0} <ffffffff801367e2>{autoremove_wake_function+0}
> <ffffffffa003b0e3>{:jbd:kjournald+333} <ffffffff801367e2>{autoremove_wake_function+0}
> <ffffffff801367e2>{autoremove_wake_function+0} <ffffffffa003af90>{:jbd:commit_timeout+0}
> <ffffffff801121e3>{child_rip+8} <ffffffffa003af96>{:jbd:kjournald+0}
> <ffffffff801121db>{child_rip+0}
>
> Code: 4c 89 61 08 49 89 0c 24 85 db 0f 8f 2c ff ff ff 8b 45 00 49
> console shuts up ...

--
Foreca Ltd [email protected]
Pursimiehenkatu 29-31 B, FIN-00150 Helsinki, Finland http://www.foreca.com

2004-11-01 09:34:00

by Andrew Morton

[permalink] [raw]
Subject: Re: ext3 and nfsd do not work under load (Re: x86_64, LOCKUP on CPU0, kjournald)

Jaakko Hyv?tti <[email protected]> wrote:
>
>
> Here is another oops and lockup, with nfsd now there in the trace also:
>
> Unable to handle kernel paging request at ffffffff00000808 RIP:
> <ffffffff80161b37>{cache_alloc_refill+329}
> PML4 103027 PGD 0
> Oops: 0002 [1] SMP
> CPU 0
> Modules linked in: w83627hf i2c_sensor i2c_isa i2c_core nfsd exportfs lockd sunrpc md5 ipv6 parport_pc lp parport tg3 ipt_REJECT ipt_state ip_conntrack iptable_filter ip_tables dm_mod ohci_hcd button battery asus_acpi ac ext3 jbd 3w_xxxx sd_mod scsi_mod
> Pid: 1968, comm: nfsd Not tainted 2.6.8-1.521smp

That's a vendor kernel of some form, yes?

> RIP: 0010:[<ffffffff80161b37>] <ffffffff80161b37>{cache_alloc_refill+329}

I'd suggest that you either recompile this vendor kernel with
CONFIG_DEBUG_SLAB=y or try a later kernel.org kernel and see if it's fixed.
Or look for an updated kernel from your vendor. And check the vendor's
bug tracking system for other reports of this problem.

I don't recall having seen anyone else report this crash.

2004-11-01 10:31:29

by Jaakko Hyvätti

[permalink] [raw]
Subject: Re: ext3 and nfsd do not work under load (Re: x86_64, LOCKUP on CPU0, kjournald)

On Mon, 1 Nov 2004, Andrew Morton wrote:
> Jaakko Hyv?tti <[email protected]> wrote:
> >
> >
> > Here is another oops and lockup, with nfsd now there in the trace also:
> >
> > Unable to handle kernel paging request at ffffffff00000808 RIP:
> > <ffffffff80161b37>{cache_alloc_refill+329}
> > PML4 103027 PGD 0
> > Oops: 0002 [1] SMP
> > CPU 0
> > Modules linked in: w83627hf i2c_sensor i2c_isa i2c_core nfsd exportfs lockd sunrpc md5 ipv6 parport_pc lp parport tg3 ipt_REJECT ipt_state ip_conntrack iptable_filter ip_tables dm_mod ohci_hcd button battery asus_acpi ac ext3 jbd 3w_xxxx sd_mod scsi_mod
> > Pid: 1968, comm: nfsd Not tainted 2.6.8-1.521smp
>
> That's a vendor kernel of some form, yes?

Fedora core 2 latest, I assumed it is very close to kernel.org or even
the same kernel. But I'll try 2.6.9 now. It might take a week to oops
though.

Jaakko

--
Foreca Ltd [email protected]
Pursimiehenkatu 29-31 B, FIN-00150 Helsinki, Finland http://www.foreca.com

2004-11-05 09:49:42

by Jaakko Hyvätti

[permalink] [raw]
Subject: Re: ext3 and nfsd do not work under load (Re: x86_64, LOCKUP on CPU0, kjournald)

On Mon, 1 Nov 2004, Andrew Morton wrote:
> I'd suggest that you either recompile this vendor kernel with
> CONFIG_DEBUG_SLAB=y or try a later kernel.org kernel and see if it's fixed.

A funny detail. CONFIG_DEBUG_SLAB=y makes kernel.org 2.6.9 refuse
rlogin and telnet connections, I suppose because of problems with pty
allocations. Cannot find any debug anywhere about this. Ssh and console
logins work. So I cannot enable CONFIG_DEBUG_SLAB. Will proceed with
2.6.9 though. I'll report when it hangs if it hangs.

Jaakko

--
Foreca Ltd [email protected]
Pursimiehenkatu 29-31 B, FIN-00150 Helsinki, Finland http://www.foreca.com

2004-11-05 10:07:22

by Nigel Cunningham

[permalink] [raw]
Subject: Re: ext3 and nfsd do not work under load (Re: x86_64, LOCKUP on CPU0, kjournald)

Hi.

On Mon, 2004-11-01 at 20:31, Andrew Morton wrote:
> Jaakko Hyv?tti <[email protected]> wrote:
> >
> >
> > Here is another oops and lockup, with nfsd now there in the trace also:
> >
> > Unable to handle kernel paging request at ffffffff00000808 RIP:
> > <ffffffff80161b37>{cache_alloc_refill+329}
> > PML4 103027 PGD 0
> > Oops: 0002 [1] SMP
> > CPU 0
> > Modules linked in: w83627hf i2c_sensor i2c_isa i2c_core nfsd exportfs lockd sunrpc md5 ipv6 parport_pc lp parport tg3 ipt_REJECT ipt_state ip_conntrack iptable_filter ip_tables dm_mod ohci_hcd button battery asus_acpi ac ext3 jbd 3w_xxxx sd_mod scsi_mod
> > Pid: 1968, comm: nfsd Not tainted 2.6.8-1.521smp
>
> That's a vendor kernel of some form, yes?
>
> > RIP: 0010:[<ffffffff80161b37>] <ffffffff80161b37>{cache_alloc_refill+329}
>

I've seen similar things here. Last night on 2.6.9 + kdb, win4lin,
bootsplash & suspend2 (no cycles done), I could lock up my desktop
machine by doing

cat /dev/hda > /dev/null

Then

cat /proc/meminfo

a few times. I was going to stress test suspend, but never got that far
:>

Regards,

Nigel
--
Nigel Cunningham
Pastoral Worker
Christian Reformed Church of Tuggeranong
PO Box 1004, Tuggeranong, ACT 2901

You see, at just the right time, when we were still powerless, Christ
died for the ungodly. -- Romans 5:6