2006-09-18 13:12:47

by kautzy

[permalink] [raw]
Subject: Dual Core Opteron hangs, iommu Entries (x86_64)

Since this is my first post on this list, I would like to say hello to
everyone!

I am experiencing problems with a 2x dual core opteron servers. every
5-7 days the system hangs. while it still pings, it does not react on
console inputs, i can't login via ssh either. when that happens, the
only thing one can do is to reset the machine. there aren't any errors
logged.

i have checked the memory for errors, but it looks like it is ok.

I found a post on this list describing a problem which looks similar to
mine:

http://www.gatago.com/linux/kernel/13699679.html

as mentioned in the above post, a dmesg on my server also shows
following entries:

Allocating PCI resources starting at fb800000 (gap: fb000000:4780000)
Checking aperture...
CPU 0: aperture @ cc24000000 size 32 MB
Aperture from northbridge cpu 0 too small (32 MB)
No AGP bridge found
Your BIOS doesn't leave a aperture memory hole
Please enable the IOMMU option in the BIOS setup
This costs you 64 MB of RAM
Mapping aperture over 65536 KB of RAM @ 8000000
Built 1 zonelists

can those entries have anything to do with the system crashes, and if,
can booting with iommu=memaper=3 help to solve the problem?

i am running kernel 2.6.17.11, sarge amd64 , the system has 6GB RAM

i appreciate any suggestions :)



chris

the full output of dmesg:

Bootdata ok (command line is root=/dev/sda8 ro console=tty0 )
Linux version 2.6.17.11-mli1-opteron-v2 (root@mli1) (gcc version 3.3.5
(Debian 1:3.3.5-13)) #1 SMP Mon Sep 11 12:29:02 CEST 2006
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 000000000009f400 (usable)
BIOS-e820: 000000000009f400 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 00000000faff0000 (usable)
BIOS-e820: 00000000faff0000 - 00000000fafff000 (ACPI data)
BIOS-e820: 00000000fafff000 - 00000000fb000000 (ACPI NVS)
BIOS-e820: 00000000ff780000 - 0000000100000000 (reserved)
BIOS-e820: 0000000100000000 - 0000000180000000 (usable)
DMI 2.3 present.
On node 0 totalpages: 1529283
DMA zone: 2459 pages, LIFO batch:0
DMA32 zone: 1009704 pages, LIFO batch:31
Normal zone: 517120 pages, LIFO batch:31
Intel MultiProcessor Specification v1.1
Virtual Wire compatibility mode.
OEM ID: TYAN Product ID: S2882 APIC at: 0xFEE00000
Processor #0 15:1 APIC version 16
Processor #1 15:1 APIC version 16
Processor #2 15:1 APIC version 16
Processor #3 15:1 APIC version 16
I/O APIC #4 Version 17 at 0xFEC00000.
I/O APIC #5 Version 17 at 0xFEBFF000.
I/O APIC #6 Version 17 at 0xFEBFE000.
Setting APIC routing to flat
Processors: 4
Allocating PCI resources starting at fb800000 (gap: fb000000:4780000)
Checking aperture...
CPU 0: aperture @ cc24000000 size 32 MB
Aperture from northbridge cpu 0 too small (32 MB)
No AGP bridge found
Your BIOS doesn't leave a aperture memory hole
Please enable the IOMMU option in the BIOS setup
This costs you 64 MB of RAM
Mapping aperture over 65536 KB of RAM @ 8000000
Built 1 zonelists
Kernel command line: root=/dev/sda8 ro console=tty0
Initializing CPU#0
PID hash table entries: 4096 (order: 12, 32768 bytes)
time.c: Using 1.193182 MHz WALL PIT GTOD PIT/TSC timer.
time.c: Detected 2190.816 MHz processor.
Console: colour VGA+ 80x25
Dentry cache hash table entries: 1048576 (order: 11, 8388608 bytes)
Inode-cache hash table entries: 524288 (order: 10, 4194304 bytes)
Memory: 6038612k/6291456k available (3002k kernel code, 170092k
reserved, 1269k data, 168k init)
Calibrating delay using timer specific routine.. 4390.66 BogoMIPS
(lpj=8781339)
Mount-cache hash table entries: 256
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 1024K (64 bytes/line)
Using IO-APIC 4
Using IO-APIC 5
Using IO-APIC 6
GSI 18 sharing vector 0x89 and IRQ 18
GSI 19 sharing vector 0x91 and IRQ 19
GSI 24 sharing vector 0x99 and IRQ 24
GSI 25 sharing vector 0xA1 and IRQ 25
GSI 29 sharing vector 0xA9 and IRQ 29
Using local APIC timer interrupts.
result 12447820
Detected 12.447 MHz APIC timer.
Booting processor 1/4 APIC 0x1
Initializing CPU#1
Calibrating delay using timer specific routine.. 4381.80 BogoMIPS
(lpj=8763613)
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 1024K (64 bytes/line)
Dual Core AMD Opteron(tm) Processor 275 stepping 02
CPU 1: Syncing TSC to CPU 0.
CPU 1: synchronized TSC with CPU 0 (last diff 6 cycles, maxerr 627 cycles)
Booting processor 2/4 APIC 0x2
Initializing CPU#2
Calibrating delay using timer specific routine.. 4381.88 BogoMIPS
(lpj=8763771)
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 1024K (64 bytes/line)
Dual Core AMD Opteron(tm) Processor 275 stepping 02
CPU 2: Syncing TSC to CPU 0.
CPU 2: synchronized TSC with CPU 0 (last diff 1 cycles, maxerr 876 cycles)
Booting processor 3/4 APIC 0x3
Initializing CPU#3
Calibrating delay using timer specific routine.. 4381.92 BogoMIPS
(lpj=8763852)
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 1024K (64 bytes/line)
Dual Core AMD Opteron(tm) Processor 275 stepping 02
CPU 3: Syncing TSC to CPU 0.
CPU 3: synchronized TSC with CPU 0 (last diff 7 cycles, maxerr 864 cycles)
Brought up 4 CPUs
testing NMI watchdog ... OK.
migration_cost=460
NET: Registered protocol family 16
PCI: Using configuration type 1
SCSI subsystem initialized
PCI: Probing PCI hardware
PCI: Probing PCI hardware (bus 00)
Boot video device is 0000:03:06.0
PCI: Using IRQ router default [1022/746b] at 0000:00:07.3
PCI->APIC IRQ transform: 0000:00:07.2[D] -> IRQ 19
PCI->APIC IRQ transform: 0000:03:06.0[A] -> IRQ 18
PCI->APIC IRQ transform: 0000:03:08.0[A] -> IRQ 18
PCI->APIC IRQ transform: 0000:02:09.0[A] -> IRQ 24
PCI->APIC IRQ transform: 0000:02:09.1[B] -> IRQ 25
PCI->APIC IRQ transform: 0000:01:04.0[A] -> IRQ 29
PCI-DMA: Disabling AGP.
PCI-DMA: aperture base @ 8000000 size 65536 KB
PCI-DMA: using GART IOMMU.
PCI-DMA: Reserving 64MB of IOMMU area in the AGP aperture
PCI: Bridge: 0000:00:06.0
IO window: 9000-bfff
MEM window: fca00000-feafffff
PREFETCH window: disabled.
PCI: Bridge: 0000:00:0a.0
IO window: disabled.
MEM window: fc900000-fc9fffff
PREFETCH window: fc600000-fc6fffff
PCI: Bridge: 0000:00:0b.0
IO window: 8000-8fff
MEM window: fc800000-fc8fffff
PREFETCH window: fb500000-fc5fffff
NET: Registered protocol family 2
IP route cache hash table entries: 262144 (order: 9, 2097152 bytes)
TCP established hash table entries: 262144 (order: 10, 4194304 bytes)
TCP bind hash table entries: 65536 (order: 8, 1048576 bytes)
TCP: Hash tables configured (established 262144 bind 65536)
TCP reno registered
IA32 emulation $Id: sys_ia32.c,v 1.32 2002/03/24 13:02:28 ak Exp $
Installing knfsd (copyright (C) 1996 [email protected]).
Initializing Cryptographic API
io scheduler noop registered
io scheduler anticipatory registered
io scheduler deadline registered (default)
io scheduler cfq registered
PCI: MSI quirk detected. PCI_BUS_FLAGS_NO_MSI set for subordinate bus.
PCI: MSI quirk detected. PCI_BUS_FLAGS_NO_MSI set for subordinate bus.
Real Time Clock Driver v1.12ac
Linux agpgart interface v0.101 (c) Dave Jones
Floppy drive(s): fd0 is 1.44M
FDC 0 is a post-1991 82077
loop: loaded (max 8 devices)
Intel(R) PRO/1000 Network Driver - version 7.0.33-k2
Copyright (c) 1999-2005 Intel Corporation.
eepro100.c:v1.09j-t 9/29/99 Donald Becker
http://www.scyld.com/network/eepro100.html
eepro100.c: $Revision: 1.36 $ 2000/11/17 Modified by Andrey V. Savochkin
<[email protected]> and others
eth0: 0000:03:08.0, 00:E0:81:32:F6:36, IRQ 18.
Board assembly 567812-052, Physical connectors present: RJ45
Primary interface chip i82555 PHY #1.
General self-test: passed.
Serial sub-system self-test: passed.
Internal registers self-test: passed.
ROM checksum self-test: passed (0xd0a6c714).
e100: Intel(R) PRO/100 Network Driver, 3.5.10-k2-NAPI
e100: Copyright(c) 1999-2005 Intel Corporation
tg3.c:v3.59 (June 8, 2006)
eth1: Tigon3 [partno(BCM95704A7) rev 2003 PHY(5704)]
(PCIX:100MHz:64-bit) 10/100/1000BaseT Ethernet 00:e0:81:32:f7:ac
eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] Split[0] WireSpeed[1]
TSOcap[1]
eth1: dma_rwctrl[769f4000] dma_mask[64-bit]
eth2: Tigon3 [partno(BCM95704A7) rev 2003 PHY(5704)]
(PCIX:100MHz:64-bit) 10/100/1000BaseT Ethernet 00:e0:81:32:f7:ad
eth2: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] Split[0] WireSpeed[1]
TSOcap[1]
eth2: dma_rwctrl[769f4000] dma_mask[64-bit]
Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
3ware 9000 Storage Controller device driver for Linux v2.26.02.007.
3w-9xxx: scsi0: AEN: INFO (0x04:0x0055): Battery charging started:.
3w-9xxx: scsi0: AEN: INFO (0x04:0x0053): Battery capacity test is overdue:.
scsi0 : 3ware 9000 Storage Controller
3w-9xxx: scsi0: Found a 3ware 9000 Storage Controller at 0xfc8ffc00,
IRQ: 29.
3w-9xxx: scsi0: Firmware FE9X 2.08.00.005, BIOS BE9X 2.03.01.052, Ports: 8.
Vendor: AMCC Model: 9500S-8 DISK Rev: 2.08
Type: Direct-Access ANSI SCSI revision: 03
SCSI device sda: 956884992 512-byte hdwr sectors (489925 MB)
sda: Write Protect is off
sda: Mode Sense: 23 00 00 00
SCSI device sda: drive cache: write back, no read (daft)
SCSI device sda: 956884992 512-byte hdwr sectors (489925 MB)
sda: Write Protect is off
sda: Mode Sense: 23 00 00 00
SCSI device sda: drive cache: write back, no read (daft)
sda: sda1 < sda5 sda6 sda7 sda8 sda9 sda10 > sda2 sda3
sd 0:0:0:0: Attached scsi disk sda
serio: i8042 AUX port at 0x60,0x64 irq 12
serio: i8042 KBD port at 0x60,0x64 irq 1
mice: PS/2 mouse device common for all mice
TCP bic registered
NET: Registered protocol family 1
NET: Registered protocol family 10
IPv6 over IPv4 tunneling driver
NET: Registered protocol family 17
NET: Registered protocol family 15
802.1Q VLAN Support v1.8 Ben Greear <[email protected]>
All bugs added by David S. Miller <[email protected]>
ReiserFS: sda8: found reiserfs format "3.6" with standard journal
ReiserFS: sda8: using ordered data mode
ReiserFS: sda8: journal params: device sda8, size 8192, journal first
block 18, max trans len 1024, max batch 900, max commit age 30, max
trans age 30
ReiserFS: sda8: checking transaction log (sda8)
input: AT Translated Set 2 keyboard as /class/input/input0
ReiserFS: sda8: replayed 15 transactions in 1 seconds
ReiserFS: sda8: Using r5 hash to sort names
VFS: Mounted root (reiserfs filesystem) readonly.
Freeing unused kernel memory: 168k freed
Adding 1951856k swap on /dev/sda5. Priority:-1 extents:1 across:1951856k
Adding 1951856k swap on /dev/sda6. Priority:-2 extents:1 across:1951856k
Adding 1951792k swap on /dev/sda7. Priority:-3 extents:1 across:1951792k
ReiserFS: sda10: found reiserfs format "3.6" with standard journal
ReiserFS: sda10: using ordered data mode
ReiserFS: sda10: journal params: device sda10, size 8192, journal first
block 18, max trans len 1024, max batch 900, max commit age 30, max
trans age 30
ReiserFS: sda10: checking transaction log (sda10)
ReiserFS: sda10: Using r5 hash to sort names
ReiserFS: sda10: Removing [30 40588 0x0 SD]..done
ReiserFS: sda10: Removing [3 40583 0x0 SD]..done
ReiserFS: sda10: Removing [3 40582 0x0 SD]..done
ReiserFS: sda10: Removing [3 40579 0x0 SD]..done
ReiserFS: sda10: There were 4 uncompleted unlinks/truncates. Completed
ReiserFS: sda2: found reiserfs format "3.6" with standard journal
ReiserFS: sda2: using ordered data mode
ReiserFS: sda2: journal params: device sda2, size 8192, journal first
block 18, max trans len 1024, max batch 900, max commit age 30, max
trans age 30
ReiserFS: sda2: checking transaction log (sda2)
ReiserFS: sda2: Using r5 hash to sort names
ReiserFS: sda2: Removing [1306 51393 0x0 SD]..done
ReiserFS: sda2: Removing [1306 51193 0x0 SD]..done
ReiserFS: sda2: There were 2 uncompleted unlinks/truncates. Completed
ReiserFS: sda3: found reiserfs format "3.6" with standard journal
ReiserFS: sda3: using ordered data mode
ReiserFS: sda3: journal params: device sda3, size 8192, journal first
block 18, max trans len 1024, max batch 900, max commit age 30, max
trans age 30
ReiserFS: sda3: checking transaction log (sda3)
ReiserFS: sda3: Using r5 hash to sort names
PM: Writing back config space on device 0000:02:09.1 at offset b (was
164814e4, writing 164414e4)
PM: Writing back config space on device 0000:02:09.1 at offset 3 (was
804000, writing 804010)
PM: Writing back config space on device 0000:02:09.1 at offset 2 (was
2000000, writing 2000003)
PM: Writing back config space on device 0000:02:09.1 at offset 1 (was
2b00000, writing 2b00106)
ADDRCONF(NETDEV_UP): eth2: link is not ready
tg3: eth2: Link is up at 1000 Mbps, full duplex.
tg3: eth2: Flow control is off for TX and off for RX.
ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
eth2: no IPv6 routers present
3w-9xxx: scsi0: AEN: INFO (0x04:0x0056): Battery charging completed:.


2006-09-18 18:30:58

by Jon Mason

[permalink] [raw]
Subject: Re: Dual Core Opteron hangs, iommu Entries (x86_64)

On Mon, Sep 18, 2006 at 03:12:41PM +0200, kautzy wrote:
> Since this is my first post on this list, I would like to say hello to
> everyone!
>
> I am experiencing problems with a 2x dual core opteron servers. every
> 5-7 days the system hangs. while it still pings, it does not react on
> console inputs, i can't login via ssh either. when that happens, the
> only thing one can do is to reset the machine. there aren't any errors
> logged.
>
> i have checked the memory for errors, but it looks like it is ok.
>
> I found a post on this list describing a problem which looks similar to
> mine:
>
> http://www.gatago.com/linux/kernel/13699679.html
>
> as mentioned in the above post, a dmesg on my server also shows
> following entries:
>
> Allocating PCI resources starting at fb800000 (gap: fb000000:4780000)
> Checking aperture...
> CPU 0: aperture @ cc24000000 size 32 MB
> Aperture from northbridge cpu 0 too small (32 MB)
> No AGP bridge found
> Your BIOS doesn't leave a aperture memory hole
> Please enable the IOMMU option in the BIOS setup
> This costs you 64 MB of RAM
> Mapping aperture over 65536 KB of RAM @ 8000000
> Built 1 zonelists
>
> can those entries have anything to do with the system crashes, and if,
> can booting with iommu=memaper=3 help to solve the problem?
>
> i am running kernel 2.6.17.11, sarge amd64 , the system has 6GB RAM
>
> i appreciate any suggestions :)

Your problem is that you have more than 4GB of RAM and not enough room
in your IOMMU aperature to handle all of the pending DMA requests.
Dmesg suggests you go into your BIOS and increase your AGP aperature
from 32M to 64M, did you try that?

Thanks,
Jon

>
>
>
> chris
>
> the full output of dmesg:
>
> Bootdata ok (command line is root=/dev/sda8 ro console=tty0 )
> Linux version 2.6.17.11-mli1-opteron-v2 (root@mli1) (gcc version 3.3.5
> (Debian 1:3.3.5-13)) #1 SMP Mon Sep 11 12:29:02 CEST 2006
> BIOS-provided physical RAM map:
> BIOS-e820: 0000000000000000 - 000000000009f400 (usable)
> BIOS-e820: 000000000009f400 - 00000000000a0000 (reserved)
> BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
> BIOS-e820: 0000000000100000 - 00000000faff0000 (usable)
> BIOS-e820: 00000000faff0000 - 00000000fafff000 (ACPI data)
> BIOS-e820: 00000000fafff000 - 00000000fb000000 (ACPI NVS)
> BIOS-e820: 00000000ff780000 - 0000000100000000 (reserved)
> BIOS-e820: 0000000100000000 - 0000000180000000 (usable)
> DMI 2.3 present.
> On node 0 totalpages: 1529283
> DMA zone: 2459 pages, LIFO batch:0
> DMA32 zone: 1009704 pages, LIFO batch:31
> Normal zone: 517120 pages, LIFO batch:31
> Intel MultiProcessor Specification v1.1
> Virtual Wire compatibility mode.
> OEM ID: TYAN Product ID: S2882 APIC at: 0xFEE00000
> Processor #0 15:1 APIC version 16
> Processor #1 15:1 APIC version 16
> Processor #2 15:1 APIC version 16
> Processor #3 15:1 APIC version 16
> I/O APIC #4 Version 17 at 0xFEC00000.
> I/O APIC #5 Version 17 at 0xFEBFF000.
> I/O APIC #6 Version 17 at 0xFEBFE000.
> Setting APIC routing to flat
> Processors: 4
> Allocating PCI resources starting at fb800000 (gap: fb000000:4780000)
> Checking aperture...
> CPU 0: aperture @ cc24000000 size 32 MB
> Aperture from northbridge cpu 0 too small (32 MB)
> No AGP bridge found
> Your BIOS doesn't leave a aperture memory hole
> Please enable the IOMMU option in the BIOS setup
> This costs you 64 MB of RAM
> Mapping aperture over 65536 KB of RAM @ 8000000
> Built 1 zonelists
> Kernel command line: root=/dev/sda8 ro console=tty0
> Initializing CPU#0
> PID hash table entries: 4096 (order: 12, 32768 bytes)
> time.c: Using 1.193182 MHz WALL PIT GTOD PIT/TSC timer.
> time.c: Detected 2190.816 MHz processor.
> Console: colour VGA+ 80x25
> Dentry cache hash table entries: 1048576 (order: 11, 8388608 bytes)
> Inode-cache hash table entries: 524288 (order: 10, 4194304 bytes)
> Memory: 6038612k/6291456k available (3002k kernel code, 170092k
> reserved, 1269k data, 168k init)
> Calibrating delay using timer specific routine.. 4390.66 BogoMIPS
> (lpj=8781339)
> Mount-cache hash table entries: 256
> CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
> CPU: L2 Cache: 1024K (64 bytes/line)
> Using IO-APIC 4
> Using IO-APIC 5
> Using IO-APIC 6
> GSI 18 sharing vector 0x89 and IRQ 18
> GSI 19 sharing vector 0x91 and IRQ 19
> GSI 24 sharing vector 0x99 and IRQ 24
> GSI 25 sharing vector 0xA1 and IRQ 25
> GSI 29 sharing vector 0xA9 and IRQ 29
> Using local APIC timer interrupts.
> result 12447820
> Detected 12.447 MHz APIC timer.
> Booting processor 1/4 APIC 0x1
> Initializing CPU#1
> Calibrating delay using timer specific routine.. 4381.80 BogoMIPS
> (lpj=8763613)
> CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
> CPU: L2 Cache: 1024K (64 bytes/line)
> Dual Core AMD Opteron(tm) Processor 275 stepping 02
> CPU 1: Syncing TSC to CPU 0.
> CPU 1: synchronized TSC with CPU 0 (last diff 6 cycles, maxerr 627 cycles)
> Booting processor 2/4 APIC 0x2
> Initializing CPU#2
> Calibrating delay using timer specific routine.. 4381.88 BogoMIPS
> (lpj=8763771)
> CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
> CPU: L2 Cache: 1024K (64 bytes/line)
> Dual Core AMD Opteron(tm) Processor 275 stepping 02
> CPU 2: Syncing TSC to CPU 0.
> CPU 2: synchronized TSC with CPU 0 (last diff 1 cycles, maxerr 876 cycles)
> Booting processor 3/4 APIC 0x3
> Initializing CPU#3
> Calibrating delay using timer specific routine.. 4381.92 BogoMIPS
> (lpj=8763852)
> CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
> CPU: L2 Cache: 1024K (64 bytes/line)
> Dual Core AMD Opteron(tm) Processor 275 stepping 02
> CPU 3: Syncing TSC to CPU 0.
> CPU 3: synchronized TSC with CPU 0 (last diff 7 cycles, maxerr 864 cycles)
> Brought up 4 CPUs
> testing NMI watchdog ... OK.
> migration_cost=460
> NET: Registered protocol family 16
> PCI: Using configuration type 1
> SCSI subsystem initialized
> PCI: Probing PCI hardware
> PCI: Probing PCI hardware (bus 00)
> Boot video device is 0000:03:06.0
> PCI: Using IRQ router default [1022/746b] at 0000:00:07.3
> PCI->APIC IRQ transform: 0000:00:07.2[D] -> IRQ 19
> PCI->APIC IRQ transform: 0000:03:06.0[A] -> IRQ 18
> PCI->APIC IRQ transform: 0000:03:08.0[A] -> IRQ 18
> PCI->APIC IRQ transform: 0000:02:09.0[A] -> IRQ 24
> PCI->APIC IRQ transform: 0000:02:09.1[B] -> IRQ 25
> PCI->APIC IRQ transform: 0000:01:04.0[A] -> IRQ 29
> PCI-DMA: Disabling AGP.
> PCI-DMA: aperture base @ 8000000 size 65536 KB
> PCI-DMA: using GART IOMMU.
> PCI-DMA: Reserving 64MB of IOMMU area in the AGP aperture
> PCI: Bridge: 0000:00:06.0
> IO window: 9000-bfff
> MEM window: fca00000-feafffff
> PREFETCH window: disabled.
> PCI: Bridge: 0000:00:0a.0
> IO window: disabled.
> MEM window: fc900000-fc9fffff
> PREFETCH window: fc600000-fc6fffff
> PCI: Bridge: 0000:00:0b.0
> IO window: 8000-8fff
> MEM window: fc800000-fc8fffff
> PREFETCH window: fb500000-fc5fffff
> NET: Registered protocol family 2
> IP route cache hash table entries: 262144 (order: 9, 2097152 bytes)
> TCP established hash table entries: 262144 (order: 10, 4194304 bytes)
> TCP bind hash table entries: 65536 (order: 8, 1048576 bytes)
> TCP: Hash tables configured (established 262144 bind 65536)
> TCP reno registered
> IA32 emulation $Id: sys_ia32.c,v 1.32 2002/03/24 13:02:28 ak Exp $
> Installing knfsd (copyright (C) 1996 [email protected]).
> Initializing Cryptographic API
> io scheduler noop registered
> io scheduler anticipatory registered
> io scheduler deadline registered (default)
> io scheduler cfq registered
> PCI: MSI quirk detected. PCI_BUS_FLAGS_NO_MSI set for subordinate bus.
> PCI: MSI quirk detected. PCI_BUS_FLAGS_NO_MSI set for subordinate bus.
> Real Time Clock Driver v1.12ac
> Linux agpgart interface v0.101 (c) Dave Jones
> Floppy drive(s): fd0 is 1.44M
> FDC 0 is a post-1991 82077
> loop: loaded (max 8 devices)
> Intel(R) PRO/1000 Network Driver - version 7.0.33-k2
> Copyright (c) 1999-2005 Intel Corporation.
> eepro100.c:v1.09j-t 9/29/99 Donald Becker
> http://www.scyld.com/network/eepro100.html
> eepro100.c: $Revision: 1.36 $ 2000/11/17 Modified by Andrey V. Savochkin
> <[email protected]> and others
> eth0: 0000:03:08.0, 00:E0:81:32:F6:36, IRQ 18.
> Board assembly 567812-052, Physical connectors present: RJ45
> Primary interface chip i82555 PHY #1.
> General self-test: passed.
> Serial sub-system self-test: passed.
> Internal registers self-test: passed.
> ROM checksum self-test: passed (0xd0a6c714).
> e100: Intel(R) PRO/100 Network Driver, 3.5.10-k2-NAPI
> e100: Copyright(c) 1999-2005 Intel Corporation
> tg3.c:v3.59 (June 8, 2006)
> eth1: Tigon3 [partno(BCM95704A7) rev 2003 PHY(5704)]
> (PCIX:100MHz:64-bit) 10/100/1000BaseT Ethernet 00:e0:81:32:f7:ac
> eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] Split[0] WireSpeed[1]
> TSOcap[1]
> eth1: dma_rwctrl[769f4000] dma_mask[64-bit]
> eth2: Tigon3 [partno(BCM95704A7) rev 2003 PHY(5704)]
> (PCIX:100MHz:64-bit) 10/100/1000BaseT Ethernet 00:e0:81:32:f7:ad
> eth2: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] Split[0] WireSpeed[1]
> TSOcap[1]
> eth2: dma_rwctrl[769f4000] dma_mask[64-bit]
> Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
> ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
> 3ware 9000 Storage Controller device driver for Linux v2.26.02.007.
> 3w-9xxx: scsi0: AEN: INFO (0x04:0x0055): Battery charging started:.
> 3w-9xxx: scsi0: AEN: INFO (0x04:0x0053): Battery capacity test is overdue:.
> scsi0 : 3ware 9000 Storage Controller
> 3w-9xxx: scsi0: Found a 3ware 9000 Storage Controller at 0xfc8ffc00,
> IRQ: 29.
> 3w-9xxx: scsi0: Firmware FE9X 2.08.00.005, BIOS BE9X 2.03.01.052, Ports: 8.
> Vendor: AMCC Model: 9500S-8 DISK Rev: 2.08
> Type: Direct-Access ANSI SCSI revision: 03
> SCSI device sda: 956884992 512-byte hdwr sectors (489925 MB)
> sda: Write Protect is off
> sda: Mode Sense: 23 00 00 00
> SCSI device sda: drive cache: write back, no read (daft)
> SCSI device sda: 956884992 512-byte hdwr sectors (489925 MB)
> sda: Write Protect is off
> sda: Mode Sense: 23 00 00 00
> SCSI device sda: drive cache: write back, no read (daft)
> sda: sda1 < sda5 sda6 sda7 sda8 sda9 sda10 > sda2 sda3
> sd 0:0:0:0: Attached scsi disk sda
> serio: i8042 AUX port at 0x60,0x64 irq 12
> serio: i8042 KBD port at 0x60,0x64 irq 1
> mice: PS/2 mouse device common for all mice
> TCP bic registered
> NET: Registered protocol family 1
> NET: Registered protocol family 10
> IPv6 over IPv4 tunneling driver
> NET: Registered protocol family 17
> NET: Registered protocol family 15
> 802.1Q VLAN Support v1.8 Ben Greear <[email protected]>
> All bugs added by David S. Miller <[email protected]>
> ReiserFS: sda8: found reiserfs format "3.6" with standard journal
> ReiserFS: sda8: using ordered data mode
> ReiserFS: sda8: journal params: device sda8, size 8192, journal first
> block 18, max trans len 1024, max batch 900, max commit age 30, max
> trans age 30
> ReiserFS: sda8: checking transaction log (sda8)
> input: AT Translated Set 2 keyboard as /class/input/input0
> ReiserFS: sda8: replayed 15 transactions in 1 seconds
> ReiserFS: sda8: Using r5 hash to sort names
> VFS: Mounted root (reiserfs filesystem) readonly.
> Freeing unused kernel memory: 168k freed
> Adding 1951856k swap on /dev/sda5. Priority:-1 extents:1 across:1951856k
> Adding 1951856k swap on /dev/sda6. Priority:-2 extents:1 across:1951856k
> Adding 1951792k swap on /dev/sda7. Priority:-3 extents:1 across:1951792k
> ReiserFS: sda10: found reiserfs format "3.6" with standard journal
> ReiserFS: sda10: using ordered data mode
> ReiserFS: sda10: journal params: device sda10, size 8192, journal first
> block 18, max trans len 1024, max batch 900, max commit age 30, max
> trans age 30
> ReiserFS: sda10: checking transaction log (sda10)
> ReiserFS: sda10: Using r5 hash to sort names
> ReiserFS: sda10: Removing [30 40588 0x0 SD]..done
> ReiserFS: sda10: Removing [3 40583 0x0 SD]..done
> ReiserFS: sda10: Removing [3 40582 0x0 SD]..done
> ReiserFS: sda10: Removing [3 40579 0x0 SD]..done
> ReiserFS: sda10: There were 4 uncompleted unlinks/truncates. Completed
> ReiserFS: sda2: found reiserfs format "3.6" with standard journal
> ReiserFS: sda2: using ordered data mode
> ReiserFS: sda2: journal params: device sda2, size 8192, journal first
> block 18, max trans len 1024, max batch 900, max commit age 30, max
> trans age 30
> ReiserFS: sda2: checking transaction log (sda2)
> ReiserFS: sda2: Using r5 hash to sort names
> ReiserFS: sda2: Removing [1306 51393 0x0 SD]..done
> ReiserFS: sda2: Removing [1306 51193 0x0 SD]..done
> ReiserFS: sda2: There were 2 uncompleted unlinks/truncates. Completed
> ReiserFS: sda3: found reiserfs format "3.6" with standard journal
> ReiserFS: sda3: using ordered data mode
> ReiserFS: sda3: journal params: device sda3, size 8192, journal first
> block 18, max trans len 1024, max batch 900, max commit age 30, max
> trans age 30
> ReiserFS: sda3: checking transaction log (sda3)
> ReiserFS: sda3: Using r5 hash to sort names
> PM: Writing back config space on device 0000:02:09.1 at offset b (was
> 164814e4, writing 164414e4)
> PM: Writing back config space on device 0000:02:09.1 at offset 3 (was
> 804000, writing 804010)
> PM: Writing back config space on device 0000:02:09.1 at offset 2 (was
> 2000000, writing 2000003)
> PM: Writing back config space on device 0000:02:09.1 at offset 1 (was
> 2b00000, writing 2b00106)
> ADDRCONF(NETDEV_UP): eth2: link is not ready
> tg3: eth2: Link is up at 1000 Mbps, full duplex.
> tg3: eth2: Flow control is off for TX and off for RX.
> ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
> eth2: no IPv6 routers present
> 3w-9xxx: scsi0: AEN: INFO (0x04:0x0056): Battery charging completed:.
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2006-09-18 18:49:44

by kautzy

[permalink] [raw]
Subject: Re: Dual Core Opteron hangs, iommu Entries (x86_64)

Jon Mason wrote:
> On Mon, Sep 18, 2006 at 03:12:41PM +0200, kautzy wrote:
>
>> Since this is my first post on this list, I would like to say hello to
>> everyone!
>>
>> I am experiencing problems with a 2x dual core opteron servers. every
>> 5-7 days the system hangs. while it still pings, it does not react on
>> console inputs, i can't login via ssh either. when that happens, the
>> only thing one can do is to reset the machine. there aren't any errors
>> logged.
>>
>> i have checked the memory for errors, but it looks like it is ok.
>>
>> I found a post on this list describing a problem which looks similar to
>> mine:
>>
>> http://www.gatago.com/linux/kernel/13699679.html
>>
>> as mentioned in the above post, a dmesg on my server also shows
>> following entries:
>>
>> Allocating PCI resources starting at fb800000 (gap: fb000000:4780000)
>> Checking aperture...
>> CPU 0: aperture @ cc24000000 size 32 MB
>> Aperture from northbridge cpu 0 too small (32 MB)
>> No AGP bridge found
>> Your BIOS doesn't leave a aperture memory hole
>> Please enable the IOMMU option in the BIOS setup
>> This costs you 64 MB of RAM
>> Mapping aperture over 65536 KB of RAM @ 8000000
>> Built 1 zonelists
>>
>> can those entries have anything to do with the system crashes, and if,
>> can booting with iommu=memaper=3 help to solve the problem?
>>
>> i am running kernel 2.6.17.11, sarge amd64 , the system has 6GB RAM
>>
>> i appreciate any suggestions :)
>>
>
> Your problem is that you have more than 4GB of RAM and not enough room
> in your IOMMU aperature to handle all of the pending DMA requests.
> Dmesg suggests you go into your BIOS and increase your AGP aperature
> from 32M to 64M, did you try that?
>
> Thanks,
> Jon
>
>
thanks alot for your reply jon!

i should have mentioned that the mainboard has neither an agp slot nor
AGP aperature settings in the bios :(

the biggest problem i am facing is, that i always have to wait 5-7 days
until i see if the changes i made since i had to reboot the computer had
a positive effect (which unluckily never was the case until now) ;-)

regards

chris

>>
>> chris
>>
>> the full output of dmesg:
>>
>> Bootdata ok (command line is root=/dev/sda8 ro console=tty0 )
>> Linux version 2.6.17.11-mli1-opteron-v2 (root@mli1) (gcc version 3.3.5
>> (Debian 1:3.3.5-13)) #1 SMP Mon Sep 11 12:29:02 CEST 2006
>> BIOS-provided physical RAM map:
>> BIOS-e820: 0000000000000000 - 000000000009f400 (usable)
>> BIOS-e820: 000000000009f400 - 00000000000a0000 (reserved)
>> BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
>> BIOS-e820: 0000000000100000 - 00000000faff0000 (usable)
>> BIOS-e820: 00000000faff0000 - 00000000fafff000 (ACPI data)
>> BIOS-e820: 00000000fafff000 - 00000000fb000000 (ACPI NVS)
>> BIOS-e820: 00000000ff780000 - 0000000100000000 (reserved)
>> BIOS-e820: 0000000100000000 - 0000000180000000 (usable)
>> DMI 2.3 present.
>> On node 0 totalpages: 1529283
>> DMA zone: 2459 pages, LIFO batch:0
>> DMA32 zone: 1009704 pages, LIFO batch:31
>> Normal zone: 517120 pages, LIFO batch:31
>> Intel MultiProcessor Specification v1.1
>> Virtual Wire compatibility mode.
>> OEM ID: TYAN Product ID: S2882 APIC at: 0xFEE00000
>> Processor #0 15:1 APIC version 16
>> Processor #1 15:1 APIC version 16
>> Processor #2 15:1 APIC version 16
>> Processor #3 15:1 APIC version 16
>> I/O APIC #4 Version 17 at 0xFEC00000.
>> I/O APIC #5 Version 17 at 0xFEBFF000.
>> I/O APIC #6 Version 17 at 0xFEBFE000.
>> Setting APIC routing to flat
>> Processors: 4
>> Allocating PCI resources starting at fb800000 (gap: fb000000:4780000)
>> Checking aperture...
>> CPU 0: aperture @ cc24000000 size 32 MB
>> Aperture from northbridge cpu 0 too small (32 MB)
>> No AGP bridge found
>> Your BIOS doesn't leave a aperture memory hole
>> Please enable the IOMMU option in the BIOS setup
>> This costs you 64 MB of RAM
>> Mapping aperture over 65536 KB of RAM @ 8000000
>> Built 1 zonelists
>> Kernel command line: root=/dev/sda8 ro console=tty0
>> Initializing CPU#0
>> PID hash table entries: 4096 (order: 12, 32768 bytes)
>> time.c: Using 1.193182 MHz WALL PIT GTOD PIT/TSC timer.
>> time.c: Detected 2190.816 MHz processor.
>> Console: colour VGA+ 80x25
>> Dentry cache hash table entries: 1048576 (order: 11, 8388608 bytes)
>> Inode-cache hash table entries: 524288 (order: 10, 4194304 bytes)
>> Memory: 6038612k/6291456k available (3002k kernel code, 170092k
>> reserved, 1269k data, 168k init)
>> Calibrating delay using timer specific routine.. 4390.66 BogoMIPS
>> (lpj=8781339)
>> Mount-cache hash table entries: 256
>> CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
>> CPU: L2 Cache: 1024K (64 bytes/line)
>> Using IO-APIC 4
>> Using IO-APIC 5
>> Using IO-APIC 6
>> GSI 18 sharing vector 0x89 and IRQ 18
>> GSI 19 sharing vector 0x91 and IRQ 19
>> GSI 24 sharing vector 0x99 and IRQ 24
>> GSI 25 sharing vector 0xA1 and IRQ 25
>> GSI 29 sharing vector 0xA9 and IRQ 29
>> Using local APIC timer interrupts.
>> result 12447820
>> Detected 12.447 MHz APIC timer.
>> Booting processor 1/4 APIC 0x1
>> Initializing CPU#1
>> Calibrating delay using timer specific routine.. 4381.80 BogoMIPS
>> (lpj=8763613)
>> CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
>> CPU: L2 Cache: 1024K (64 bytes/line)
>> Dual Core AMD Opteron(tm) Processor 275 stepping 02
>> CPU 1: Syncing TSC to CPU 0.
>> CPU 1: synchronized TSC with CPU 0 (last diff 6 cycles, maxerr 627 cycles)
>> Booting processor 2/4 APIC 0x2
>> Initializing CPU#2
>> Calibrating delay using timer specific routine.. 4381.88 BogoMIPS
>> (lpj=8763771)
>> CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
>> CPU: L2 Cache: 1024K (64 bytes/line)
>> Dual Core AMD Opteron(tm) Processor 275 stepping 02
>> CPU 2: Syncing TSC to CPU 0.
>> CPU 2: synchronized TSC with CPU 0 (last diff 1 cycles, maxerr 876 cycles)
>> Booting processor 3/4 APIC 0x3
>> Initializing CPU#3
>> Calibrating delay using timer specific routine.. 4381.92 BogoMIPS
>> (lpj=8763852)
>> CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
>> CPU: L2 Cache: 1024K (64 bytes/line)
>> Dual Core AMD Opteron(tm) Processor 275 stepping 02
>> CPU 3: Syncing TSC to CPU 0.
>> CPU 3: synchronized TSC with CPU 0 (last diff 7 cycles, maxerr 864 cycles)
>> Brought up 4 CPUs
>> testing NMI watchdog ... OK.
>> migration_cost=460
>> NET: Registered protocol family 16
>> PCI: Using configuration type 1
>> SCSI subsystem initialized
>> PCI: Probing PCI hardware
>> PCI: Probing PCI hardware (bus 00)
>> Boot video device is 0000:03:06.0
>> PCI: Using IRQ router default [1022/746b] at 0000:00:07.3
>> PCI->APIC IRQ transform: 0000:00:07.2[D] -> IRQ 19
>> PCI->APIC IRQ transform: 0000:03:06.0[A] -> IRQ 18
>> PCI->APIC IRQ transform: 0000:03:08.0[A] -> IRQ 18
>> PCI->APIC IRQ transform: 0000:02:09.0[A] -> IRQ 24
>> PCI->APIC IRQ transform: 0000:02:09.1[B] -> IRQ 25
>> PCI->APIC IRQ transform: 0000:01:04.0[A] -> IRQ 29
>> PCI-DMA: Disabling AGP.
>> PCI-DMA: aperture base @ 8000000 size 65536 KB
>> PCI-DMA: using GART IOMMU.
>> PCI-DMA: Reserving 64MB of IOMMU area in the AGP aperture
>> PCI: Bridge: 0000:00:06.0
>> IO window: 9000-bfff
>> MEM window: fca00000-feafffff
>> PREFETCH window: disabled.
>> PCI: Bridge: 0000:00:0a.0
>> IO window: disabled.
>> MEM window: fc900000-fc9fffff
>> PREFETCH window: fc600000-fc6fffff
>> PCI: Bridge: 0000:00:0b.0
>> IO window: 8000-8fff
>> MEM window: fc800000-fc8fffff
>> PREFETCH window: fb500000-fc5fffff
>> NET: Registered protocol family 2
>> IP route cache hash table entries: 262144 (order: 9, 2097152 bytes)
>> TCP established hash table entries: 262144 (order: 10, 4194304 bytes)
>> TCP bind hash table entries: 65536 (order: 8, 1048576 bytes)
>> TCP: Hash tables configured (established 262144 bind 65536)
>> TCP reno registered
>> IA32 emulation $Id: sys_ia32.c,v 1.32 2002/03/24 13:02:28 ak Exp $
>> Installing knfsd (copyright (C) 1996 [email protected]).
>> Initializing Cryptographic API
>> io scheduler noop registered
>> io scheduler anticipatory registered
>> io scheduler deadline registered (default)
>> io scheduler cfq registered
>> PCI: MSI quirk detected. PCI_BUS_FLAGS_NO_MSI set for subordinate bus.
>> PCI: MSI quirk detected. PCI_BUS_FLAGS_NO_MSI set for subordinate bus.
>> Real Time Clock Driver v1.12ac
>> Linux agpgart interface v0.101 (c) Dave Jones
>> Floppy drive(s): fd0 is 1.44M
>> FDC 0 is a post-1991 82077
>> loop: loaded (max 8 devices)
>> Intel(R) PRO/1000 Network Driver - version 7.0.33-k2
>> Copyright (c) 1999-2005 Intel Corporation.
>> eepro100.c:v1.09j-t 9/29/99 Donald Becker
>> http://www.scyld.com/network/eepro100.html
>> eepro100.c: $Revision: 1.36 $ 2000/11/17 Modified by Andrey V. Savochkin
>> <[email protected]> and others
>> eth0: 0000:03:08.0, 00:E0:81:32:F6:36, IRQ 18.
>> Board assembly 567812-052, Physical connectors present: RJ45
>> Primary interface chip i82555 PHY #1.
>> General self-test: passed.
>> Serial sub-system self-test: passed.
>> Internal registers self-test: passed.
>> ROM checksum self-test: passed (0xd0a6c714).
>> e100: Intel(R) PRO/100 Network Driver, 3.5.10-k2-NAPI
>> e100: Copyright(c) 1999-2005 Intel Corporation
>> tg3.c:v3.59 (June 8, 2006)
>> eth1: Tigon3 [partno(BCM95704A7) rev 2003 PHY(5704)]
>> (PCIX:100MHz:64-bit) 10/100/1000BaseT Ethernet 00:e0:81:32:f7:ac
>> eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] Split[0] WireSpeed[1]
>> TSOcap[1]
>> eth1: dma_rwctrl[769f4000] dma_mask[64-bit]
>> eth2: Tigon3 [partno(BCM95704A7) rev 2003 PHY(5704)]
>> (PCIX:100MHz:64-bit) 10/100/1000BaseT Ethernet 00:e0:81:32:f7:ad
>> eth2: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] Split[0] WireSpeed[1]
>> TSOcap[1]
>> eth2: dma_rwctrl[769f4000] dma_mask[64-bit]
>> Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
>> ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
>> 3ware 9000 Storage Controller device driver for Linux v2.26.02.007.
>> 3w-9xxx: scsi0: AEN: INFO (0x04:0x0055): Battery charging started:.
>> 3w-9xxx: scsi0: AEN: INFO (0x04:0x0053): Battery capacity test is overdue:.
>> scsi0 : 3ware 9000 Storage Controller
>> 3w-9xxx: scsi0: Found a 3ware 9000 Storage Controller at 0xfc8ffc00,
>> IRQ: 29.
>> 3w-9xxx: scsi0: Firmware FE9X 2.08.00.005, BIOS BE9X 2.03.01.052, Ports: 8.
>> Vendor: AMCC Model: 9500S-8 DISK Rev: 2.08
>> Type: Direct-Access ANSI SCSI revision: 03
>> SCSI device sda: 956884992 512-byte hdwr sectors (489925 MB)
>> sda: Write Protect is off
>> sda: Mode Sense: 23 00 00 00
>> SCSI device sda: drive cache: write back, no read (daft)
>> SCSI device sda: 956884992 512-byte hdwr sectors (489925 MB)
>> sda: Write Protect is off
>> sda: Mode Sense: 23 00 00 00
>> SCSI device sda: drive cache: write back, no read (daft)
>> sda: sda1 < sda5 sda6 sda7 sda8 sda9 sda10 > sda2 sda3
>> sd 0:0:0:0: Attached scsi disk sda
>> serio: i8042 AUX port at 0x60,0x64 irq 12
>> serio: i8042 KBD port at 0x60,0x64 irq 1
>> mice: PS/2 mouse device common for all mice
>> TCP bic registered
>> NET: Registered protocol family 1
>> NET: Registered protocol family 10
>> IPv6 over IPv4 tunneling driver
>> NET: Registered protocol family 17
>> NET: Registered protocol family 15
>> 802.1Q VLAN Support v1.8 Ben Greear <[email protected]>
>> All bugs added by David S. Miller <[email protected]>
>> ReiserFS: sda8: found reiserfs format "3.6" with standard journal
>> ReiserFS: sda8: using ordered data mode
>> ReiserFS: sda8: journal params: device sda8, size 8192, journal first
>> block 18, max trans len 1024, max batch 900, max commit age 30, max
>> trans age 30
>> ReiserFS: sda8: checking transaction log (sda8)
>> input: AT Translated Set 2 keyboard as /class/input/input0
>> ReiserFS: sda8: replayed 15 transactions in 1 seconds
>> ReiserFS: sda8: Using r5 hash to sort names
>> VFS: Mounted root (reiserfs filesystem) readonly.
>> Freeing unused kernel memory: 168k freed
>> Adding 1951856k swap on /dev/sda5. Priority:-1 extents:1 across:1951856k
>> Adding 1951856k swap on /dev/sda6. Priority:-2 extents:1 across:1951856k
>> Adding 1951792k swap on /dev/sda7. Priority:-3 extents:1 across:1951792k
>> ReiserFS: sda10: found reiserfs format "3.6" with standard journal
>> ReiserFS: sda10: using ordered data mode
>> ReiserFS: sda10: journal params: device sda10, size 8192, journal first
>> block 18, max trans len 1024, max batch 900, max commit age 30, max
>> trans age 30
>> ReiserFS: sda10: checking transaction log (sda10)
>> ReiserFS: sda10: Using r5 hash to sort names
>> ReiserFS: sda10: Removing [30 40588 0x0 SD]..done
>> ReiserFS: sda10: Removing [3 40583 0x0 SD]..done
>> ReiserFS: sda10: Removing [3 40582 0x0 SD]..done
>> ReiserFS: sda10: Removing [3 40579 0x0 SD]..done
>> ReiserFS: sda10: There were 4 uncompleted unlinks/truncates. Completed
>> ReiserFS: sda2: found reiserfs format "3.6" with standard journal
>> ReiserFS: sda2: using ordered data mode
>> ReiserFS: sda2: journal params: device sda2, size 8192, journal first
>> block 18, max trans len 1024, max batch 900, max commit age 30, max
>> trans age 30
>> ReiserFS: sda2: checking transaction log (sda2)
>> ReiserFS: sda2: Using r5 hash to sort names
>> ReiserFS: sda2: Removing [1306 51393 0x0 SD]..done
>> ReiserFS: sda2: Removing [1306 51193 0x0 SD]..done
>> ReiserFS: sda2: There were 2 uncompleted unlinks/truncates. Completed
>> ReiserFS: sda3: found reiserfs format "3.6" with standard journal
>> ReiserFS: sda3: using ordered data mode
>> ReiserFS: sda3: journal params: device sda3, size 8192, journal first
>> block 18, max trans len 1024, max batch 900, max commit age 30, max
>> trans age 30
>> ReiserFS: sda3: checking transaction log (sda3)
>> ReiserFS: sda3: Using r5 hash to sort names
>> PM: Writing back config space on device 0000:02:09.1 at offset b (was
>> 164814e4, writing 164414e4)
>> PM: Writing back config space on device 0000:02:09.1 at offset 3 (was
>> 804000, writing 804010)
>> PM: Writing back config space on device 0000:02:09.1 at offset 2 (was
>> 2000000, writing 2000003)
>> PM: Writing back config space on device 0000:02:09.1 at offset 1 (was
>> 2b00000, writing 2b00106)
>> ADDRCONF(NETDEV_UP): eth2: link is not ready
>> tg3: eth2: Link is up at 1000 Mbps, full duplex.
>> tg3: eth2: Flow control is off for TX and off for RX.
>> ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
>> eth2: no IPv6 routers present
>> 3w-9xxx: scsi0: AEN: INFO (0x04:0x0056): Battery charging completed:.
>>
>> -
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>>
>>
>
>

2006-09-19 07:09:37

by Andi Kleen

[permalink] [raw]
Subject: Re: Dual Core Opteron hangs, iommu Entries (x86_64)

Jon Mason <[email protected]> writes:
>
> Your problem is that you have more than 4GB of RAM and not enough room
> in your IOMMU aperature to handle all of the pending DMA requests.
> Dmesg suggests you go into your BIOS and increase your AGP aperature
> from 32M to 64M, did you try that?

The kernel ignores 32MB apertures anyways and creates its own 64MB aperture.

Normally IOMMU overflow should be handled without a hang though, assuming
you don't have a buggy driver. If you don't trust it you can
boot with iommu=panic panic=30 -- then the kernel will always panic
and reboot when this happens. The aperture can be also increased
with iommu=memaper=3 or 4

But it could be a lot of other things. So far you don't seem to have
any evidence that it's the IOMMU except for some likely misguided googling.

The usual way to check for unknown hangs is to just configure
serial console and see if there is some output during the hang.

Also do a memtest86 for at least 12hours.

-Andi