2005-03-30 22:07:21

by Philip Lawatsch

[permalink] [raw]
Subject: AMD64 Machine hardlocks when using memset

>Bootdata ok (command line is BOOT_IMAGE=test ro root=809)
Linux version 2.6.12-rc1 (root@localhost) (gcc version 3.4.3 20041125 (Gentoo Linux 3.4.3-r1, ssp-3.4.3-0, pie-8.7.7)) #1 Wed Mar 30 23:30:20 CEST 2005
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 000000000009f800 (usable)
BIOS-e820: 000000000009f800 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 000000007fff0000 (usable)
BIOS-e820: 000000007fff0000 - 000000007fff3000 (ACPI NVS)
BIOS-e820: 000000007fff3000 - 0000000080000000 (ACPI data)
BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)
BIOS-e820: 00000000fec00000 - 00000000fec01000 (reserved)
BIOS-e820: 00000000fee00000 - 00000000fef00000 (reserved)
BIOS-e820: 00000000fefffc00 - 00000000ff000000 (reserved)
BIOS-e820: 00000000ffff0000 - 0000000100000000 (reserved)
ACPI: RSDP (v000 Nvidia ) @ 0x00000000000f78c0
ACPI: RSDT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x00000000) @ 0x000000007fff3040
ACPI: FADT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x00000000) @ 0x000000007fff30c0
ACPI: MCFG (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x00000000) @ 0x000000007fff9540
ACPI: MADT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x00000000) @ 0x000000007fff9480
ACPI: DSDT (v001 NVIDIA AWRDACPI 0x00001000 MSFT 0x0100000e) @ 0x0000000000000000
On node 0 totalpages: 524272
DMA zone: 4096 pages, LIFO batch:1
Normal zone: 520176 pages, LIFO batch:16
HighMem zone: 0 pages, LIFO batch:1
Nvidia board detected. Ignoring ACPI timer override.
ACPI: Local APIC address 0xfee00000
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
Processor #0 15:15 APIC version 16
ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1])
ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0])
IOAPIC[0]: apic_id 2, version 17, address 0xfec00000, GSI 0-23
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
ACPI: BIOS IRQ0 pin2 override ignored.
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
ACPI: INT_SRC_OVR (bus 0 bus_irq 14 global_irq 14 high edge)
ACPI: INT_SRC_OVR (bus 0 bus_irq 15 global_irq 15 high edge)
ACPI: IRQ9 used by override.
ACPI: IRQ14 used by override.
ACPI: IRQ15 used by override.
Setting APIC routing to flat
Using ACPI (MADT) for SMP configuration information
Built 1 zonelists
Kernel command line: BOOT_IMAGE=test ro root=809 console=tty0
Initializing CPU#0
PID hash table entries: 4096 (order: 12, 131072 bytes)
time.c: Using 1.193182 MHz PIT timer.
time.c: Detected 2211.376 MHz processor.
Console: colour VGA+ 80x25
Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes)
Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes)
Memory: 2056168k/2097088k available (3281k kernel code, 40236k reserved, 1386k data, 188k init)
Calibrating delay loop... 4374.52 BogoMIPS (lpj=2187264)
Mount-cache hash table entries: 256
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 512K (64 bytes/line)
CPU: AMD Athlon(tm) 64 Processor 3500+ stepping 00
Using local APIC NMI watchdog using perfctr0
Using local APIC timer interrupts.
Detected 12.564 MHz APIC timer.
NET: Registered protocol family 16
PCI: Using configuration type 1
mtrr: v2.0 (20020519)
ACPI: Subsystem revision 20050211
ACPI: Interpreter enabled
ACPI: Using IOAPIC for interrupt routing
ACPI: PCI Root Bridge [PCI0] (00:00)
PCI: Probing PCI hardware (bus 00)
PCI: Transparent bridge - 0000:00:09.0
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.HUB0._PRT]
ACPI: PCI Interrupt Link [LNK1] (IRQs 3 4 5 7 9 10 11 *12 14 15)
ACPI: PCI Interrupt Link [LNK2] (IRQs 3 4 5 7 9 10 11 12 14 15) *0, disabled.
ACPI: PCI Interrupt Link [LNK3] (IRQs 3 4 *5 7 9 10 11 12 14 15)
ACPI: PCI Interrupt Link [LNK4] (IRQs 3 4 5 7 9 10 11 12 14 15) *0, disabled.
ACPI: PCI Interrupt Link [LNK5] (IRQs 3 4 5 7 9 10 11 12 14 15) *0, disabled.
ACPI: PCI Interrupt Link [LUBA] (IRQs 3 4 *5 7 9 10 11 12 14 15)
ACPI: PCI Interrupt Link [LUBB] (IRQs 3 4 5 7 9 10 11 12 14 15) *0, disabled.
ACPI: PCI Interrupt Link [LMAC] (IRQs 3 4 5 7 9 10 *11 12 14 15)
ACPI: PCI Interrupt Link [LACI] (IRQs *3 4 5 7 9 10 11 12 14 15)
ACPI: PCI Interrupt Link [LMCI] (IRQs 3 4 5 7 9 10 11 12 14 15) *0, disabled.
ACPI: PCI Interrupt Link [LSMB] (IRQs 3 4 5 7 9 10 *11 12 14 15)
ACPI: PCI Interrupt Link [LUB2] (IRQs *3 4 5 7 9 10 11 12 14 15)
ACPI: PCI Interrupt Link [LIDE] (IRQs 3 4 5 7 9 10 11 12 14 15) *0, disabled.
ACPI: PCI Interrupt Link [LSID] (IRQs 3 4 5 7 9 10 *11 12 14 15)
ACPI: PCI Interrupt Link [LFID] (IRQs 3 4 5 7 9 10 11 *12 14 15)
ACPI: PCI Interrupt Link [LPCA] (IRQs 3 4 5 7 9 10 11 12 14 15) *0, disabled.
ACPI: PCI Interrupt Link [APC1] (IRQs *16), disabled.
ACPI: PCI Interrupt Link [APC2] (IRQs *17), disabled.
ACPI: PCI Interrupt Link [APC3] (IRQs *18), disabled.
ACPI: PCI Interrupt Link [APC4] (IRQs *19), disabled.
ACPI: PCI Interrupt Link [APC5] (IRQs *16), disabled.
ACPI: PCI Interrupt Link [APCF] (IRQs 20 21 22 23) *0, disabled.
ACPI: PCI Interrupt Link [APCG] (IRQs 20 21 22 23) *0, disabled.
ACPI: PCI Interrupt Link [APCH] (IRQs 20 21 22 23) *0, disabled.
ACPI: PCI Interrupt Link [APCJ] (IRQs 20 21 22 23) *0, disabled.
ACPI: PCI Interrupt Link [APCK] (IRQs 20 21 22 23) *0, disabled.
ACPI: PCI Interrupt Link [APCS] (IRQs 20 21 22 23) *0, disabled.
ACPI: PCI Interrupt Link [APCL] (IRQs 20 21 22 23) *0, disabled.
ACPI: PCI Interrupt Link [APCZ] (IRQs 20 21 22 23) *0, disabled.
ACPI: PCI Interrupt Link [APSI] (IRQs 20 21 22 23) *0, disabled.
ACPI: PCI Interrupt Link [APSJ] (IRQs 20 21 22 23) *0, disabled.
ACPI: PCI Interrupt Link [APCP] (IRQs 20 21 22 23) *0, disabled.
SCSI subsystem initialized
usbcore: registered new driver usbfs
usbcore: registered new driver hub
PCI: Using ACPI for IRQ routing
PCI: If a device doesn't work, try "pci=routeirq". If it helps, post a report
TC classifier action (bugs to [email protected] cc [email protected])
IA32 emulation $Id: sys_ia32.c,v 1.32 2002/03/24 13:02:28 ak Exp $
Total HugeTLB memory allocated, 0
devfs: 2004-01-31 Richard Gooch ([email protected])
devfs: boot_options: 0x0
JFS: nTxBlock = 8192, nTxLock = 65536
SGI XFS with ACLs, large block/inode numbers, no debug enabled
Initializing Cryptographic API
pci_hotplug: PCI Hot Plug PCI Core version: 0.5
acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.4
ACPI: Power Button (FF) [PWRF]
ACPI: Fan [FAN] (on)
ACPI: Processor [CPU0] (supports 8 throttling states)
ACPI: Thermal Zone [THRM] (40 C)
Real Time Clock Driver v1.12
Non-volatile memory driver v1.2
[drm] Initialized drm 1.0.0 20040925
serio: i8042 AUX port at 0x60,0x64 irq 12
serio: i8042 KBD port at 0x60,0x64 irq 1
Serial: 8250/16550 driver $Revision: 1.90 $ 8 ports, IRQ sharing disabled
ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
io scheduler noop registered
io scheduler anticipatory registered
io scheduler deadline registered
io scheduler cfq registered
RAMDISK driver initialized: 16 RAM disks of 8192K size 1024 blocksize
loop: loaded (max 8 devices)
forcedeth.c: Reverse Engineered nForce ethernet driver. Version 0.31.
ACPI: PCI Interrupt Link [APCH] enabled at IRQ 23
ACPI: PCI interrupt 0000:00:0a.0[A] -> GSI 23 (level, low) -> IRQ 23
PCI: Setting latency timer of device 0000:00:0a.0 to 64
eth0: forcedeth.c: subsystem: 01043:8141 bound to 0000:00:0a.0
Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
NFORCE-CK804: IDE controller at PCI slot 0000:00:06.0
NFORCE-CK804: chipset revision 162
NFORCE-CK804: not 100% native mode: will probe irqs later
NFORCE-CK804: BIOS didn't set cable bits correctly. Enabling workaround.
NFORCE-CK804: 0000:00:06.0 (rev a2) UDMA133 controller
ide0: BM-DMA at 0xf000-0xf007, BIOS settings: hda:DMA, hdb:DMA
ide1: BM-DMA at 0xf008-0xf00f, BIOS settings: hdc:DMA, hdd:DMA
Probing IDE interface ide0...
hda: PIONEER DVD-RW DVR-109, ATAPI CD/DVD-ROM drive
hdb: PHILIPS DROM5016L, ATAPI CD/DVD-ROM drive
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
Probing IDE interface ide1...
Probing IDE interface ide1...
Probing IDE interface ide2...
Probing IDE interface ide3...
Probing IDE interface ide4...
Probing IDE interface ide5...
hda: ATAPI 40X DVD-ROM DVD-R CD-R/RW drive, 2000kB Cache, UDMA(66)
Uniform CD-ROM driver Revision: 3.20
hdb: ATAPI 48X DVD-ROM drive, 512kB Cache, UDMA(33)
libata version 1.10 loaded.
sata_nv version 0.6
ACPI: PCI Interrupt Link [APSI] enabled at IRQ 22
ACPI: PCI interrupt 0000:00:07.0[A] -> GSI 22 (level, low) -> IRQ 22
PCI: Setting latency timer of device 0000:00:07.0 to 64
ata1: SATA max UDMA/133 cmd 0x9F0 ctl 0xBF2 bmdma 0xD800 irq 22
ata2: SATA max UDMA/133 cmd 0x970 ctl 0xB72 bmdma 0xD808 irq 22
ata1: no device found (phy stat 00000000)
scsi0 : sata_nv
ata2: no device found (phy stat 00000000)
scsi1 : sata_nv
ACPI: PCI Interrupt Link [APSJ] enabled at IRQ 21
ACPI: PCI interrupt 0000:00:08.0[A] -> GSI 21 (level, low) -> IRQ 21
PCI: Setting latency timer of device 0000:00:08.0 to 64
ata3: SATA max UDMA/133 cmd 0x9E0 ctl 0xBE2 bmdma 0xC400 irq 21
ata4: SATA max UDMA/133 cmd 0x960 ctl 0xB62 bmdma 0xC408 irq 21
ata3: dev 0 cfg 49:2f00 82:7c6b 83:7f09 84:4063 85:7c68 86:3e01 87:4063 88:407f
ata3: dev 0 ATA, max UDMA/133, 398297088 sectors: lba48
nv_sata: Primary device added
nv_sata: Primary device removed
nv_sata: Secondary device added
nv_sata: Secondary device removed
ata3: dev 0 configured for UDMA/133
scsi2 : sata_nv
ata4: no device found (phy stat 00000000)
scsi3 : sata_nv
Vendor: ATA Model: Maxtor 6B200M0 Rev: BANC
Type: Direct-Access ANSI SCSI revision: 05
SCSI device sda: 398297088 512-byte hdwr sectors (203928 MB)
SCSI device sda: drive cache: write back
SCSI device sda: 398297088 512-byte hdwr sectors (203928 MB)
SCSI device sda: drive cache: write back
/dev/scsi/host2/bus0/target0/lun0: p1 p2 < p5 p6 p7 p8 p9 p10 >
Attached scsi disk sda at scsi2, channel 0, id 0, lun 0
Attached scsi generic sg0 at scsi2, channel 0, id 0, lun 0, type 0
usbmon: debugs is not available
mice: PS/2 mouse device common for all mice
md: linear personality registered as nr 1
md: raid0 personality registered as nr 2
md: raid1 personality registered as nr 3
md: raid5 personality registered as nr 4
raid5: automatically using best checksumming function: generic_sse
generic_sse: 6776.000 MB/sec
raid5: using function: generic_sse (6776.000 MB/sec)
raid6: int64x1 2042 MB/s
raid6: int64x2 2949 MB/s
raid6: int64x4 2886 MB/s
raid6: int64x8 1921 MB/s
raid6: sse2x1 906 MB/s
raid6: sse2x2 1773 MB/s
raid6: sse2x4 3062 MB/s
raid6: using algorithm sse2x4 (3062 MB/s)
md: raid6 personality registered as nr 8
md: multipath personality registered as nr 7
md: md driver 0.90.1 MAX_MD_DEVS=256, MD_SB_DISKS=27
device-mapper: 4.4.0-ioctl (2005-01-12) initialised: [email protected]
Advanced Linux Sound Architecture Driver Version 1.0.8 (Thu Jan 13 09:39:32 2005 UTC).
ALSA device list:
No soundcards found.
NET: Registered protocol family 2
IP: routing cache hash table of 16384 buckets, 128Kbytes
TCP established hash table entries: 524288 (order: 10, 4194304 bytes)
TCP bind hash table entries: 65536 (order: 7, 524288 bytes)
TCP: Hash tables configured (established 524288 bind 65536)
NET: Registered protocol family 1
NET: Registered protocol family 10
Disabled Privacy Extensions on device ffffffff80538620(lo)
IPv6 over IPv4 tunneling driver
NET: Registered protocol family 17
NET: Registered protocol family 15
powernow-k8: Found 1 AMD Athlon 64 / Opteron processors (version 1.00.09e)
powernow-k8: 0 : fid 0xe (2200 MHz), vid 0x6 (1400 mV)
powernow-k8: 1 : fid 0xc (2000 MHz), vid 0x8 (1350 mV)
powernow-k8: 2 : fid 0xa (1800 MHz), vid 0xa (1300 mV)
powernow-k8: 3 : fid 0x2 (1000 MHz), vid 0x12 (1100 mV)
cpu_init done, current fid 0xe, vid 0x6
ACPI wakeup devices:
HUB0 XVR0 XVR1 XVR2 XVR3 USB0 USB2 MMAC MMCI UAR1
ACPI: (supports S0 S1 S3 S4 S5)
BIOS EDD facility v0.16 2004-Jun-25, 1 devices found
devfs_mk_dev: could not append to parent for md/0
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
kjournald starting. Commit interval 5 seconds
EXT3-fs: mounted filesystem with ordered data mode.
VFS: Mounted root (ext3 filesystem) readonly.
Freeing unused kernel memory: 188k freed
input: AT Translated Set 2 keyboard on isa0060/serio0
Adding 1959888k swap on /dev/sda6. Priority:-1 extents:1
EXT3 FS on sda9, internal journal
kjournald starting. Commit interval 5 seconds
EXT3 FS on sda8, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
i2c_adapter i2c-0: nForce2 SMBus adapter at 0x4c00
i2c_adapter i2c-1: nForce2 SMBus adapter at 0x4c40
ohci_hcd: 2004 Nov 08 USB 1.1 'Open' Host Controller (OHCI) Driver (PCI)
ACPI: PCI Interrupt Link [APCF] enabled at IRQ 20
ACPI: PCI interrupt 0000:00:02.0[A] -> GSI 20 (level, low) -> IRQ 20
PCI: Setting latency timer of device 0000:00:02.0 to 64
ohci_hcd 0000:00:02.0: nVidia Corporation CK804 USB Controller
ohci_hcd 0000:00:02.0: new USB bus registered, assigned bus number 1
ohci_hcd 0000:00:02.0: irq 20, io mem 0xd0104000
hub 1-0:1.0: USB hub found
hub 1-0:1.0: 10 ports detected
ACPI: PCI Interrupt Link [APCL] enabled at IRQ 23
ACPI: PCI interrupt 0000:00:02.1[B] -> GSI 23 (level, low) -> IRQ 23
PCI: Setting latency timer of device 0000:00:02.1 to 64
ehci_hcd 0000:00:02.1: nVidia Corporation CK804 USB Controller
ehci_hcd 0000:00:02.1: new USB bus registered, assigned bus number 2
ehci_hcd 0000:00:02.1: irq 23, io mem 0xd0105000
PCI: cache line size of 64 is not supported by device 0000:00:02.1
ehci_hcd 0000:00:02.1: park 0
ehci_hcd 0000:00:02.1: USB 2.0 initialized, EHCI 1.00, driver 10 Dec 2004
hub 2-0:1.0: USB hub found
hub 2-0:1.0: 10 ports detected
ACPI: PCI Interrupt Link [APCJ] enabled at IRQ 22
ACPI: PCI interrupt 0000:00:04.0[A] -> GSI 22 (level, low) -> IRQ 22
PCI: Setting latency timer of device 0000:00:04.0 to 64
intel8x0_measure_ac97_clock: measured 49642 usecs
intel8x0: clocking to 46875
usb 1-1: new full speed USB device using ohci_hcd and address 3
usb 1-2: new low speed USB device using ohci_hcd and address 4
usb 1-10: new full speed USB device using ohci_hcd and address 5
cdc_acm 1-1:1.0: ttyACM0: USB ACM device
usbcore: registered new driver cdc_acm
drivers/usb/class/cdc-acm.c: v0.23:USB Abstract Control Model driver for USB modems and ISDN adapters
usbcore: registered new driver hiddev
input: USB HID v1.10 Mouse [B16_b_02 USB-PS/2 Optical Mouse] on usb-0000:00:02.0-2
usbcore: registered new driver usbhid
drivers/usb/input/hid-core.c: v2.01:USB HID core driver
Bluetooth: Core ver 2.7
NET: Registered protocol family 31
Bluetooth: HCI device and connection manager initialized
Bluetooth: HCI socket layer initialized
Bluetooth: HCI USB driver ver 2.8
usbcore: registered new driver hci_usb
ieee1394: Initialized config rom entry `ip1394'
ohci1394: $Rev: 1250 $ Ben Collins <[email protected]>
ACPI: PCI Interrupt Link [APC1] enabled at IRQ 16
ACPI: PCI interrupt 0000:05:0b.0[A] -> GSI 16 (level, low) -> IRQ 16
ohci1394: fw-host0: OHCI-1394 1.1 (PCI): IRQ=[16] MMIO=[d0004000-d00047ff] Max Packet=[2048]
USB Universal Host Controller Interface driver v2.2
ieee1394: Host added: ID:BUS[0-00:1023] GUID[0011d800000a00cc]
eth1394: $Rev: 1247 $ Ben Collins <[email protected]>
eth1394: eth1: IEEE-1394 IPv4 over 1394 Ethernet (fw-host0)


Attachments:
dmesg.txt (15.10 kB)

2005-03-31 00:06:23

by Matthias-Christian Ott

[permalink] [raw]
Subject: Re: AMD64 Machine hardlocks when using memset

Philip Lawatsch schrieb:

>Hi,
>
>
>I do have a very strange problem:
>
>If I memset a ~1meg buffer some thousand times (in the userspace) it
>will hardlock my machine.
>
>I've been using 2.6.12-rc1 and also a lot of other kernels (2.6.9,
>2.6.11). I've tried it both using a 32 bit kernel and a 64 bit kernel.
>When running on the 32 bit kernel the machine hardlocks after about
>15000 iterations, on a 64 bit kernel the machine hardlocks after about
>5000 (the 64 bit system has nearly no background jobs running).
>
>I've been running memcheck for several hours now but nothing did show up.
>
>
>I've got an Asus A8N-SLI board with 2 gigs of memory and an AMD 3500+ CPU.
>
>The 64 bit kernel was compiled using gcc 3.4.3 and the 32 bit kernel
>using 3.3.5.
>
>
>This simple programm will kill my machine:
>
>#include <stdlib.h>
>#include <stdio.h>
>int main(int argc, char *argv[])
>{
> char buf[1024*1024];
> int i;
> for (i=0;i<1024*16;++i)
> {
> printf("%d\n",i);
> memset(buf,0,1024*1024);
> }
> printf("Done\n");
> return 0;
>}
>
>If I usleep for 1ms after each memset the whole thing will happily run
>forever without any problems.
>
>Also if I start it twice (without sleeping in the loop) the machine wont
>hardlock either (tested with a 32 bit kernel).
>
>I'd really appreciate any pointers as to what might be wrong here.
>
>I've tried both kernels with and without preemption.
>
>kind regards Philip
>
>
>
>
>------------------------------------------------------------------------
>
>
>
>>Bootdata ok (command line is BOOT_IMAGE=test ro root=809)
>>
>>
>Linux version 2.6.12-rc1 (root@localhost) (gcc version 3.4.3 20041125 (Gentoo Linux 3.4.3-r1, ssp-3.4.3-0, pie-8.7.7)) #1 Wed Mar 30 23:30:20 CEST 2005
>BIOS-provided physical RAM map:
> BIOS-e820: 0000000000000000 - 000000000009f800 (usable)
> BIOS-e820: 000000000009f800 - 00000000000a0000 (reserved)
> BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
> BIOS-e820: 0000000000100000 - 000000007fff0000 (usable)
> BIOS-e820: 000000007fff0000 - 000000007fff3000 (ACPI NVS)
> BIOS-e820: 000000007fff3000 - 0000000080000000 (ACPI data)
> BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)
> BIOS-e820: 00000000fec00000 - 00000000fec01000 (reserved)
> BIOS-e820: 00000000fee00000 - 00000000fef00000 (reserved)
> BIOS-e820: 00000000fefffc00 - 00000000ff000000 (reserved)
> BIOS-e820: 00000000ffff0000 - 0000000100000000 (reserved)
>ACPI: RSDP (v000 Nvidia ) @ 0x00000000000f78c0
>ACPI: RSDT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x00000000) @ 0x000000007fff3040
>ACPI: FADT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x00000000) @ 0x000000007fff30c0
>ACPI: MCFG (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x00000000) @ 0x000000007fff9540
>ACPI: MADT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x00000000) @ 0x000000007fff9480
>ACPI: DSDT (v001 NVIDIA AWRDACPI 0x00001000 MSFT 0x0100000e) @ 0x0000000000000000
>On node 0 totalpages: 524272
> DMA zone: 4096 pages, LIFO batch:1
> Normal zone: 520176 pages, LIFO batch:16
> HighMem zone: 0 pages, LIFO batch:1
>Nvidia board detected. Ignoring ACPI timer override.
>ACPI: Local APIC address 0xfee00000
>ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
>Processor #0 15:15 APIC version 16
>ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1])
>ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0])
>IOAPIC[0]: apic_id 2, version 17, address 0xfec00000, GSI 0-23
>ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
>ACPI: BIOS IRQ0 pin2 override ignored.
>ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
>ACPI: INT_SRC_OVR (bus 0 bus_irq 14 global_irq 14 high edge)
>ACPI: INT_SRC_OVR (bus 0 bus_irq 15 global_irq 15 high edge)
>ACPI: IRQ9 used by override.
>ACPI: IRQ14 used by override.
>ACPI: IRQ15 used by override.
>Setting APIC routing to flat
>Using ACPI (MADT) for SMP configuration information
>Built 1 zonelists
>Kernel command line: BOOT_IMAGE=test ro root=809 console=tty0
>Initializing CPU#0
>PID hash table entries: 4096 (order: 12, 131072 bytes)
>time.c: Using 1.193182 MHz PIT timer.
>time.c: Detected 2211.376 MHz processor.
>Console: colour VGA+ 80x25
>Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes)
>Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes)
>Memory: 2056168k/2097088k available (3281k kernel code, 40236k reserved, 1386k data, 188k init)
>Calibrating delay loop... 4374.52 BogoMIPS (lpj=2187264)
>Mount-cache hash table entries: 256
>CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
>CPU: L2 Cache: 512K (64 bytes/line)
>CPU: AMD Athlon(tm) 64 Processor 3500+ stepping 00
>Using local APIC NMI watchdog using perfctr0
>Using local APIC timer interrupts.
>Detected 12.564 MHz APIC timer.
>NET: Registered protocol family 16
>PCI: Using configuration type 1
>mtrr: v2.0 (20020519)
>ACPI: Subsystem revision 20050211
>ACPI: Interpreter enabled
>ACPI: Using IOAPIC for interrupt routing
>ACPI: PCI Root Bridge [PCI0] (00:00)
>PCI: Probing PCI hardware (bus 00)
>PCI: Transparent bridge - 0000:00:09.0
>ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
>ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.HUB0._PRT]
>ACPI: PCI Interrupt Link [LNK1] (IRQs 3 4 5 7 9 10 11 *12 14 15)
>ACPI: PCI Interrupt Link [LNK2] (IRQs 3 4 5 7 9 10 11 12 14 15) *0, disabled.
>ACPI: PCI Interrupt Link [LNK3] (IRQs 3 4 *5 7 9 10 11 12 14 15)
>ACPI: PCI Interrupt Link [LNK4] (IRQs 3 4 5 7 9 10 11 12 14 15) *0, disabled.
>ACPI: PCI Interrupt Link [LNK5] (IRQs 3 4 5 7 9 10 11 12 14 15) *0, disabled.
>ACPI: PCI Interrupt Link [LUBA] (IRQs 3 4 *5 7 9 10 11 12 14 15)
>ACPI: PCI Interrupt Link [LUBB] (IRQs 3 4 5 7 9 10 11 12 14 15) *0, disabled.
>ACPI: PCI Interrupt Link [LMAC] (IRQs 3 4 5 7 9 10 *11 12 14 15)
>ACPI: PCI Interrupt Link [LACI] (IRQs *3 4 5 7 9 10 11 12 14 15)
>ACPI: PCI Interrupt Link [LMCI] (IRQs 3 4 5 7 9 10 11 12 14 15) *0, disabled.
>ACPI: PCI Interrupt Link [LSMB] (IRQs 3 4 5 7 9 10 *11 12 14 15)
>ACPI: PCI Interrupt Link [LUB2] (IRQs *3 4 5 7 9 10 11 12 14 15)
>ACPI: PCI Interrupt Link [LIDE] (IRQs 3 4 5 7 9 10 11 12 14 15) *0, disabled.
>ACPI: PCI Interrupt Link [LSID] (IRQs 3 4 5 7 9 10 *11 12 14 15)
>ACPI: PCI Interrupt Link [LFID] (IRQs 3 4 5 7 9 10 11 *12 14 15)
>ACPI: PCI Interrupt Link [LPCA] (IRQs 3 4 5 7 9 10 11 12 14 15) *0, disabled.
>ACPI: PCI Interrupt Link [APC1] (IRQs *16), disabled.
>ACPI: PCI Interrupt Link [APC2] (IRQs *17), disabled.
>ACPI: PCI Interrupt Link [APC3] (IRQs *18), disabled.
>ACPI: PCI Interrupt Link [APC4] (IRQs *19), disabled.
>ACPI: PCI Interrupt Link [APC5] (IRQs *16), disabled.
>ACPI: PCI Interrupt Link [APCF] (IRQs 20 21 22 23) *0, disabled.
>ACPI: PCI Interrupt Link [APCG] (IRQs 20 21 22 23) *0, disabled.
>ACPI: PCI Interrupt Link [APCH] (IRQs 20 21 22 23) *0, disabled.
>ACPI: PCI Interrupt Link [APCJ] (IRQs 20 21 22 23) *0, disabled.
>ACPI: PCI Interrupt Link [APCK] (IRQs 20 21 22 23) *0, disabled.
>ACPI: PCI Interrupt Link [APCS] (IRQs 20 21 22 23) *0, disabled.
>ACPI: PCI Interrupt Link [APCL] (IRQs 20 21 22 23) *0, disabled.
>ACPI: PCI Interrupt Link [APCZ] (IRQs 20 21 22 23) *0, disabled.
>ACPI: PCI Interrupt Link [APSI] (IRQs 20 21 22 23) *0, disabled.
>ACPI: PCI Interrupt Link [APSJ] (IRQs 20 21 22 23) *0, disabled.
>ACPI: PCI Interrupt Link [APCP] (IRQs 20 21 22 23) *0, disabled.
>SCSI subsystem initialized
>usbcore: registered new driver usbfs
>usbcore: registered new driver hub
>PCI: Using ACPI for IRQ routing
>PCI: If a device doesn't work, try "pci=routeirq". If it helps, post a report
>TC classifier action (bugs to [email protected] cc [email protected])
>IA32 emulation $Id: sys_ia32.c,v 1.32 2002/03/24 13:02:28 ak Exp $
>Total HugeTLB memory allocated, 0
>devfs: 2004-01-31 Richard Gooch ([email protected])
>devfs: boot_options: 0x0
>JFS: nTxBlock = 8192, nTxLock = 65536
>SGI XFS with ACLs, large block/inode numbers, no debug enabled
>Initializing Cryptographic API
>pci_hotplug: PCI Hot Plug PCI Core version: 0.5
>acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.4
>ACPI: Power Button (FF) [PWRF]
>ACPI: Fan [FAN] (on)
>ACPI: Processor [CPU0] (supports 8 throttling states)
>ACPI: Thermal Zone [THRM] (40 C)
>Real Time Clock Driver v1.12
>Non-volatile memory driver v1.2
>[drm] Initialized drm 1.0.0 20040925
>serio: i8042 AUX port at 0x60,0x64 irq 12
>serio: i8042 KBD port at 0x60,0x64 irq 1
>Serial: 8250/16550 driver $Revision: 1.90 $ 8 ports, IRQ sharing disabled
>ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
>ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
>io scheduler noop registered
>io scheduler anticipatory registered
>io scheduler deadline registered
>io scheduler cfq registered
>RAMDISK driver initialized: 16 RAM disks of 8192K size 1024 blocksize
>loop: loaded (max 8 devices)
>forcedeth.c: Reverse Engineered nForce ethernet driver. Version 0.31.
>ACPI: PCI Interrupt Link [APCH] enabled at IRQ 23
>ACPI: PCI interrupt 0000:00:0a.0[A] -> GSI 23 (level, low) -> IRQ 23
>PCI: Setting latency timer of device 0000:00:0a.0 to 64
>eth0: forcedeth.c: subsystem: 01043:8141 bound to 0000:00:0a.0
>Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
>ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
>NFORCE-CK804: IDE controller at PCI slot 0000:00:06.0
>NFORCE-CK804: chipset revision 162
>NFORCE-CK804: not 100% native mode: will probe irqs later
>NFORCE-CK804: BIOS didn't set cable bits correctly. Enabling workaround.
>NFORCE-CK804: 0000:00:06.0 (rev a2) UDMA133 controller
> ide0: BM-DMA at 0xf000-0xf007, BIOS settings: hda:DMA, hdb:DMA
> ide1: BM-DMA at 0xf008-0xf00f, BIOS settings: hdc:DMA, hdd:DMA
>Probing IDE interface ide0...
>hda: PIONEER DVD-RW DVR-109, ATAPI CD/DVD-ROM drive
>hdb: PHILIPS DROM5016L, ATAPI CD/DVD-ROM drive
>ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
>Probing IDE interface ide1...
>Probing IDE interface ide1...
>Probing IDE interface ide2...
>Probing IDE interface ide3...
>Probing IDE interface ide4...
>Probing IDE interface ide5...
>hda: ATAPI 40X DVD-ROM DVD-R CD-R/RW drive, 2000kB Cache, UDMA(66)
>Uniform CD-ROM driver Revision: 3.20
>hdb: ATAPI 48X DVD-ROM drive, 512kB Cache, UDMA(33)
>libata version 1.10 loaded.
>sata_nv version 0.6
>ACPI: PCI Interrupt Link [APSI] enabled at IRQ 22
>ACPI: PCI interrupt 0000:00:07.0[A] -> GSI 22 (level, low) -> IRQ 22
>PCI: Setting latency timer of device 0000:00:07.0 to 64
>ata1: SATA max UDMA/133 cmd 0x9F0 ctl 0xBF2 bmdma 0xD800 irq 22
>ata2: SATA max UDMA/133 cmd 0x970 ctl 0xB72 bmdma 0xD808 irq 22
>ata1: no device found (phy stat 00000000)
>scsi0 : sata_nv
>ata2: no device found (phy stat 00000000)
>scsi1 : sata_nv
>ACPI: PCI Interrupt Link [APSJ] enabled at IRQ 21
>ACPI: PCI interrupt 0000:00:08.0[A] -> GSI 21 (level, low) -> IRQ 21
>PCI: Setting latency timer of device 0000:00:08.0 to 64
>ata3: SATA max UDMA/133 cmd 0x9E0 ctl 0xBE2 bmdma 0xC400 irq 21
>ata4: SATA max UDMA/133 cmd 0x960 ctl 0xB62 bmdma 0xC408 irq 21
>ata3: dev 0 cfg 49:2f00 82:7c6b 83:7f09 84:4063 85:7c68 86:3e01 87:4063 88:407f
>ata3: dev 0 ATA, max UDMA/133, 398297088 sectors: lba48
>nv_sata: Primary device added
>nv_sata: Primary device removed
>nv_sata: Secondary device added
>nv_sata: Secondary device removed
>ata3: dev 0 configured for UDMA/133
>scsi2 : sata_nv
>ata4: no device found (phy stat 00000000)
>scsi3 : sata_nv
> Vendor: ATA Model: Maxtor 6B200M0 Rev: BANC
> Type: Direct-Access ANSI SCSI revision: 05
>SCSI device sda: 398297088 512-byte hdwr sectors (203928 MB)
>SCSI device sda: drive cache: write back
>SCSI device sda: 398297088 512-byte hdwr sectors (203928 MB)
>SCSI device sda: drive cache: write back
> /dev/scsi/host2/bus0/target0/lun0: p1 p2 < p5 p6 p7 p8 p9 p10 >
>Attached scsi disk sda at scsi2, channel 0, id 0, lun 0
>Attached scsi generic sg0 at scsi2, channel 0, id 0, lun 0, type 0
>usbmon: debugs is not available
>mice: PS/2 mouse device common for all mice
>md: linear personality registered as nr 1
>md: raid0 personality registered as nr 2
>md: raid1 personality registered as nr 3
>md: raid5 personality registered as nr 4
>raid5: automatically using best checksumming function: generic_sse
> generic_sse: 6776.000 MB/sec
>raid5: using function: generic_sse (6776.000 MB/sec)
>raid6: int64x1 2042 MB/s
>raid6: int64x2 2949 MB/s
>raid6: int64x4 2886 MB/s
>raid6: int64x8 1921 MB/s
>raid6: sse2x1 906 MB/s
>raid6: sse2x2 1773 MB/s
>raid6: sse2x4 3062 MB/s
>raid6: using algorithm sse2x4 (3062 MB/s)
>md: raid6 personality registered as nr 8
>md: multipath personality registered as nr 7
>md: md driver 0.90.1 MAX_MD_DEVS=256, MD_SB_DISKS=27
>device-mapper: 4.4.0-ioctl (2005-01-12) initialised: [email protected]
>Advanced Linux Sound Architecture Driver Version 1.0.8 (Thu Jan 13 09:39:32 2005 UTC).
>ALSA device list:
> No soundcards found.
>NET: Registered protocol family 2
>IP: routing cache hash table of 16384 buckets, 128Kbytes
>TCP established hash table entries: 524288 (order: 10, 4194304 bytes)
>TCP bind hash table entries: 65536 (order: 7, 524288 bytes)
>TCP: Hash tables configured (established 524288 bind 65536)
>NET: Registered protocol family 1
>NET: Registered protocol family 10
>Disabled Privacy Extensions on device ffffffff80538620(lo)
>IPv6 over IPv4 tunneling driver
>NET: Registered protocol family 17
>NET: Registered protocol family 15
>powernow-k8: Found 1 AMD Athlon 64 / Opteron processors (version 1.00.09e)
>powernow-k8: 0 : fid 0xe (2200 MHz), vid 0x6 (1400 mV)
>powernow-k8: 1 : fid 0xc (2000 MHz), vid 0x8 (1350 mV)
>powernow-k8: 2 : fid 0xa (1800 MHz), vid 0xa (1300 mV)
>powernow-k8: 3 : fid 0x2 (1000 MHz), vid 0x12 (1100 mV)
>cpu_init done, current fid 0xe, vid 0x6
>ACPI wakeup devices:
>HUB0 XVR0 XVR1 XVR2 XVR3 USB0 USB2 MMAC MMCI UAR1
>ACPI: (supports S0 S1 S3 S4 S5)
>BIOS EDD facility v0.16 2004-Jun-25, 1 devices found
>devfs_mk_dev: could not append to parent for md/0
>md: Autodetecting RAID arrays.
>md: autorun ...
>md: ... autorun DONE.
>kjournald starting. Commit interval 5 seconds
>EXT3-fs: mounted filesystem with ordered data mode.
>VFS: Mounted root (ext3 filesystem) readonly.
>Freeing unused kernel memory: 188k freed
>input: AT Translated Set 2 keyboard on isa0060/serio0
>Adding 1959888k swap on /dev/sda6. Priority:-1 extents:1
>EXT3 FS on sda9, internal journal
>kjournald starting. Commit interval 5 seconds
>EXT3 FS on sda8, internal journal
>EXT3-fs: mounted filesystem with ordered data mode.
>i2c_adapter i2c-0: nForce2 SMBus adapter at 0x4c00
>i2c_adapter i2c-1: nForce2 SMBus adapter at 0x4c40
>ohci_hcd: 2004 Nov 08 USB 1.1 'Open' Host Controller (OHCI) Driver (PCI)
>ACPI: PCI Interrupt Link [APCF] enabled at IRQ 20
>ACPI: PCI interrupt 0000:00:02.0[A] -> GSI 20 (level, low) -> IRQ 20
>PCI: Setting latency timer of device 0000:00:02.0 to 64
>ohci_hcd 0000:00:02.0: nVidia Corporation CK804 USB Controller
>ohci_hcd 0000:00:02.0: new USB bus registered, assigned bus number 1
>ohci_hcd 0000:00:02.0: irq 20, io mem 0xd0104000
>hub 1-0:1.0: USB hub found
>hub 1-0:1.0: 10 ports detected
>ACPI: PCI Interrupt Link [APCL] enabled at IRQ 23
>ACPI: PCI interrupt 0000:00:02.1[B] -> GSI 23 (level, low) -> IRQ 23
>PCI: Setting latency timer of device 0000:00:02.1 to 64
>ehci_hcd 0000:00:02.1: nVidia Corporation CK804 USB Controller
>ehci_hcd 0000:00:02.1: new USB bus registered, assigned bus number 2
>ehci_hcd 0000:00:02.1: irq 23, io mem 0xd0105000
>PCI: cache line size of 64 is not supported by device 0000:00:02.1
>ehci_hcd 0000:00:02.1: park 0
>ehci_hcd 0000:00:02.1: USB 2.0 initialized, EHCI 1.00, driver 10 Dec 2004
>hub 2-0:1.0: USB hub found
>hub 2-0:1.0: 10 ports detected
>ACPI: PCI Interrupt Link [APCJ] enabled at IRQ 22
>ACPI: PCI interrupt 0000:00:04.0[A] -> GSI 22 (level, low) -> IRQ 22
>PCI: Setting latency timer of device 0000:00:04.0 to 64
>intel8x0_measure_ac97_clock: measured 49642 usecs
>intel8x0: clocking to 46875
>usb 1-1: new full speed USB device using ohci_hcd and address 3
>usb 1-2: new low speed USB device using ohci_hcd and address 4
>usb 1-10: new full speed USB device using ohci_hcd and address 5
>cdc_acm 1-1:1.0: ttyACM0: USB ACM device
>usbcore: registered new driver cdc_acm
>drivers/usb/class/cdc-acm.c: v0.23:USB Abstract Control Model driver for USB modems and ISDN adapters
>usbcore: registered new driver hiddev
>input: USB HID v1.10 Mouse [B16_b_02 USB-PS/2 Optical Mouse] on usb-0000:00:02.0-2
>usbcore: registered new driver usbhid
>drivers/usb/input/hid-core.c: v2.01:USB HID core driver
>Bluetooth: Core ver 2.7
>NET: Registered protocol family 31
>Bluetooth: HCI device and connection manager initialized
>Bluetooth: HCI socket layer initialized
>Bluetooth: HCI USB driver ver 2.8
>usbcore: registered new driver hci_usb
>ieee1394: Initialized config rom entry `ip1394'
>ohci1394: $Rev: 1250 $ Ben Collins <[email protected]>
>ACPI: PCI Interrupt Link [APC1] enabled at IRQ 16
>ACPI: PCI interrupt 0000:05:0b.0[A] -> GSI 16 (level, low) -> IRQ 16
>ohci1394: fw-host0: OHCI-1394 1.1 (PCI): IRQ=[16] MMIO=[d0004000-d00047ff] Max Packet=[2048]
>USB Universal Host Controller Interface driver v2.2
>ieee1394: Host added: ID:BUS[0-00:1023] GUID[0011d800000a00cc]
>eth1394: $Rev: 1247 $ Ben Collins <[email protected]>
>eth1394: eth1: IEEE-1394 IPv4 over 1394 Ethernet (fw-host0)
>
>
You want to allocate a lot of memory (16 GB), you don't have that much
space, so the Kernel hangs.

Matthias-Christian Ott

2005-03-31 04:32:50

by Robert Hancock

[permalink] [raw]
Subject: Re: AMD64 Machine hardlocks when using memset

Matthias-Christian Ott wrote:
> You want to allocate a lot of memory (16 GB), you don't have that much
> space, so the Kernel hangs.

No, this is not what it is doing. The program is simply wiping the same
1MB block of memory over and over. If it was doing what you say it would
not (or should not) lock the machine anyway.

--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from [email protected]
Home Page: http://www.roberthancock.com/

2005-03-31 04:39:07

by Robert Hancock

[permalink] [raw]
Subject: Re: AMD64 Machine hardlocks when using memset

Philip Lawatsch wrote:
> Hi,
>
>
> I do have a very strange problem:
>
> If I memset a ~1meg buffer some thousand times (in the userspace) it
> will hardlock my machine.

I thought that this must be impossible, but I tried it on my machine
which is very similar (Asus A8N-SLI, Athlon 64 3500+, 2GB RAM) and to my
surprise it breaks on mine too with kernel 2.6.11. I tested using the
program below. After about a minute or so of this, the machine either
locked hard or rebooted spontaneously. When it locked, there was no oops
message, the NMI watchdog was not triggered and there was no response to
SysRq commands. (I tested it with and without the NVIDIA module loaded.)

This seems pretty terrible, a perfectly legal program running as a
normal user is hard-locking the machine. Anyone have any suggestions to
debug this? Also, can somebody else on an x86_64 try and duplicate this?

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main( int argc, char* argv[] )
{
char* test = malloc(512*1024*1024);
int i;
for( i=0; i<1000000; i++ )
{
memset( test, 0, 512*1024*1024);
}
free(test);
return 0;
}

Bootdata ok (command line is ro root=LABEL=/)
Linux version 2.6.11-1.7_FC3custom (rob@Newcastle) (gcc version 3.4.2
20041017 (Red Hat 3.4.2-6.fc3)) #1 Thu Mar 24 21:23:17 CST 2005
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 000000000009f800 (usable)
BIOS-e820: 000000000009f800 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 000000007fff0000 (usable)
BIOS-e820: 000000007fff0000 - 000000007fff3000 (ACPI NVS)
BIOS-e820: 000000007fff3000 - 0000000080000000 (ACPI data)
BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)
BIOS-e820: 00000000fec00000 - 00000000fec01000 (reserved)
BIOS-e820: 00000000fee00000 - 00000000fef00000 (reserved)
BIOS-e820: 00000000fefffc00 - 00000000ff000000 (reserved)
BIOS-e820: 00000000ffff0000 - 0000000100000000 (reserved)
ACPI: RSDP (v000 Nvidia ) @
0x00000000000f7d50
ACPI: RSDT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x00000000) @
0x000000007fff3040
ACPI: FADT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x00000000) @
0x000000007fff30c0
ACPI: MCFG (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x00000000) @
0x000000007fff9640
ACPI: MADT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x00000000) @
0x000000007fff9580
ACPI: DSDT (v001 NVIDIA AWRDACPI 0x00001000 MSFT 0x0100000e) @
0x0000000000000000
On node 0 totalpages: 524272
DMA zone: 4096 pages, LIFO batch:1
Normal zone: 520176 pages, LIFO batch:16
HighMem zone: 0 pages, LIFO batch:1
Nvidia board detected. Ignoring ACPI timer override.
ACPI: Local APIC address 0xfee00000
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
Processor #0 15:15 APIC version 16
ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1])
ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0])
IOAPIC[0]: apic_id 2, version 17, address 0xfec00000, GSI 0-23
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
ACPI: BIOS IRQ0 pin2 override ignored.
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
ACPI: INT_SRC_OVR (bus 0 bus_irq 14 global_irq 14 high edge)
ACPI: INT_SRC_OVR (bus 0 bus_irq 15 global_irq 15 high edge)
ACPI: IRQ9 used by override.
ACPI: IRQ14 used by override.
ACPI: IRQ15 used by override.
Setting APIC routing to flat
Using ACPI (MADT) for SMP configuration information
Checking aperture...
CPU 0: aperture @ 1320000000 size 32 MB
Aperture from northbridge cpu 0 too small (32 MB)
No AGP bridge found
Built 1 zonelists
Kernel command line: ro root=LABEL=/ console=tty0
Initializing CPU#0
PID hash table entries: 4096 (order: 12, 131072 bytes)
time.c: Using 1.193182 MHz PIT timer.
time.c: Detected 2211.365 MHz processor.
Console: colour VGA+ 80x25
Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes)
Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes)
Memory: 2055568k/2097088k available (2722k kernel code, 40732k reserved,
1239k data, 188k init)
Calibrating delay loop... 4374.52 BogoMIPS (lpj=2187264)
Security Framework v1.0.0 initialized
SELinux: Initializing.
SELinux: Starting in permissive mode
selinux_register_security: Registering secondary module capability
Capability LSM initialized as secondary
Mount-cache hash table entries: 256 (order: 0, 4096 bytes)
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 512K (64 bytes/line)
CPU: AMD Athlon(tm) 64 Processor 3500+ stepping 00
Using local APIC NMI watchdog using perfctr0
Using local APIC timer interrupts.
Detected 12.564 MHz APIC timer.
checking if image is initramfs... it is
NET: Registered protocol family 16
PCI: Using configuration type 1
mtrr: v2.0 (20020519)
ACPI: Subsystem revision 20050211
ACPI: Interpreter enabled
ACPI: Using IOAPIC for interrupt routing
ACPI: PCI Root Bridge [PCI0] (00:00)
PCI: Probing PCI hardware (bus 00)
PCI: Transparent bridge - 0000:00:09.0
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.HUB0._PRT]
ACPI: PCI Interrupt Link [LNK1] (IRQs 3 *4 5 7 9 10 11 12 14 15)
ACPI: PCI Interrupt Link [LNK2] (IRQs 3 4 5 7 9 10 11 *12 14 15)
ACPI: PCI Interrupt Link [LNK3] (IRQs 3 4 5 *7 9 10 11 12 14 15)
ACPI: PCI Interrupt Link [LNK4] (IRQs 3 4 *5 7 9 10 11 12 14 15)
ACPI: PCI Interrupt Link [LNK5] (IRQs 3 4 5 7 9 10 11 12 14 15) *0,
disabled.
ACPI: PCI Interrupt Link [LUBA] (IRQs 3 4 5 7 9 10 *11 12 14 15)
ACPI: PCI Interrupt Link [LUBB] (IRQs 3 4 5 7 9 10 11 12 14 15) *0,
disabled.
ACPI: PCI Interrupt Link [LMAC] (IRQs *3 4 5 7 9 10 11 12 14 15)
ACPI: PCI Interrupt Link [LACI] (IRQs 3 4 5 7 9 10 11 12 14 15) *0,
disabled.
ACPI: PCI Interrupt Link [LMCI] (IRQs 3 4 5 7 9 10 11 12 14 15) *0,
disabled.
ACPI: PCI Interrupt Link [LSMB] (IRQs 3 4 5 7 9 10 11 *12 14 15)
ACPI: PCI Interrupt Link [LUB2] (IRQs 3 4 5 7 9 *10 11 12 14 15)
ACPI: PCI Interrupt Link [LIDE] (IRQs 3 4 5 7 9 10 11 12 14 15) *0,
disabled.
ACPI: PCI Interrupt Link [LSID] (IRQs 3 4 5 7 9 *10 11 12 14 15)
ACPI: PCI Interrupt Link [LFID] (IRQs 3 4 5 7 9 10 *11 12 14 15)
ACPI: PCI Interrupt Link [LPCA] (IRQs 3 4 5 7 9 10 11 12 14 15) *0,
disabled.
ACPI: PCI Interrupt Link [APC1] (IRQs *16), disabled.
ACPI: PCI Interrupt Link [APC2] (IRQs *17), disabled.
ACPI: PCI Interrupt Link [APC3] (IRQs *18), disabled.
ACPI: PCI Interrupt Link [APC4] (IRQs *19), disabled.
ACPI: PCI Interrupt Link [APC5] (IRQs *16), disabled.
ACPI: PCI Interrupt Link [APCF] (IRQs 20 21 22 23) *0, disabled.
ACPI: PCI Interrupt Link [APCG] (IRQs 20 21 22 23) *0, disabled.
ACPI: PCI Interrupt Link [APCH] (IRQs 20 21 22 23) *0, disabled.
ACPI: PCI Interrupt Link [APCJ] (IRQs 20 21 22 23) *0, disabled.
ACPI: PCI Interrupt Link [APCK] (IRQs 20 21 22 23) *0, disabled.
ACPI: PCI Interrupt Link [APCS] (IRQs 20 21 22 23) *0, disabled.
ACPI: PCI Interrupt Link [APCL] (IRQs 20 21 22 23) *0, disabled.
ACPI: PCI Interrupt Link [APCZ] (IRQs 20 21 22 23) *0, disabled.
ACPI: PCI Interrupt Link [APSI] (IRQs 20 21 22 23) *0, disabled.
ACPI: PCI Interrupt Link [APSJ] (IRQs 20 21 22 23) *0, disabled.
ACPI: PCI Interrupt Link [APCP] (IRQs 20 21 22 23) *0, disabled.
usbcore: registered new driver usbfs
usbcore: registered new driver hub
PCI: Using ACPI for IRQ routing
** PCI interrupts are no longer routed automatically. If this
** causes a device to stop working, it is probably because the
** driver failed to call pci_enable_device(). As a temporary
** workaround, the "pci=routeirq" argument restores the old
** behavior. If this argument makes the device work again,
** please email the output of "lspci" to [email protected]
** so I can fix the driver.
PCI-DMA: Disabling IOMMU.
IA32 emulation $Id: sys_ia32.c,v 1.32 2002/03/24 13:02:28 ak Exp $
audit: initializing netlink socket (disabled)
audit(1112221596.395:0): initialized
Total HugeTLB memory allocated, 0
VFS: Disk quotas dquot_6.5.1
Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
SELinux: Registering netfilter hooks
Initializing Cryptographic API
ksign: Installing public key data
Loading keyring
pci_hotplug: PCI Hot Plug PCI Core version: 0.5
ACPI: Fan [FAN] (on)
ACPI: Processor [CPU0] (supports 8 throttling states)
ACPI: Thermal Zone [THRM] (40 C)
Real Time Clock Driver v1.12
Linux agpgart interface v0.100 (c) Dave Jones
[drm] Initialized drm 1.0.0 20040925
serio: i8042 AUX port at 0x60,0x64 irq 12
serio: i8042 KBD port at 0x60,0x64 irq 1
Serial: 8250/16550 driver $Revision: 1.90 $ 76 ports, IRQ sharing enabled
io scheduler noop registered
io scheduler anticipatory registered
io scheduler deadline registered
io scheduler cfq registered
RAMDISK driver initialized: 16 RAM disks of 16384K size 1024 blocksize
Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
NFORCE-CK804: IDE controller at PCI slot 0000:00:06.0
NFORCE-CK804: chipset revision 162
NFORCE-CK804: not 100% native mode: will probe irqs later
NFORCE-CK804: BIOS didn't set cable bits correctly. Enabling workaround.
NFORCE-CK804: 0000:00:06.0 (rev a2) UDMA133 controller
ide0: BM-DMA at 0xf000-0xf007, BIOS settings: hda:DMA, hdb:DMA
ide1: BM-DMA at 0xf008-0xf00f, BIOS settings: hdc:DMA, hdd:DMA
Probing IDE interface ide0...
hda: LITE-ON DVDRW SOHW-1633S, ATAPI CD/DVD-ROM drive
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
Probing IDE interface ide1...
hdc: LITE-ON CD-RW SOHR-5238S, ATAPI CD/DVD-ROM drive
ide1 at 0x170-0x177,0x376 on irq 15
Probing IDE interface ide2...
Probing IDE interface ide3...
Probing IDE interface ide4...
Probing IDE interface ide5...
hda: ATAPI 48X DVD-ROM DVD-R CD-R/RW drive, 2048kB Cache
Uniform CD-ROM driver Revision: 3.20
hdc: ATAPI 52X CD-ROM CD-R/RW drive, 2048kB Cache, UDMA(33)
ide-floppy driver 0.99.newide
usbcore: registered new driver hiddev
usbcore: registered new driver usbhid
drivers/usb/input/hid-core.c: v2.0:USB HID core driver
mice: PS/2 mouse device common for all mice
md: md driver 0.90.1 MAX_MD_DEVS=256, MD_SB_DISKS=27
NET: Registered protocol family 2
IP: routing cache hash table of 2048 buckets, 112Kbytes
TCP established hash table entries: 524288 (order: 10, 4194304 bytes)
TCP bind hash table entries: 65536 (order: 9, 3670016 bytes)
TCP: Hash tables configured (established 524288 bind 65536)
Initializing IPsec netlink socket
NET: Registered protocol family 1
NET: Registered protocol family 17
powernow-k8: Found 1 AMD Athlon 64 / Opteron processors (version 1.00.09e)
powernow-k8: 0 : fid 0xe (2200 MHz), vid 0x2 (1500 mV)
powernow-k8: 1 : fid 0xc (2000 MHz), vid 0x6 (1400 mV)
powernow-k8: 2 : fid 0xa (1800 MHz), vid 0xa (1300 mV)
powernow-k8: 3 : fid 0x2 (1000 MHz), vid 0x12 (1100 mV)
cpu_init done, current fid 0xe, vid 0x2
ACPI wakeup devices:
HUB0 XVR0 XVR1 XVR2 XVR3 USB0 USB2 MMAC MMCI
ACPI: (supports S0 S1 S3 S4 S5)
Freeing unused kernel memory: 188k freed
SCSI subsystem initialized
libata version 1.10 loaded.
sata_nv version 0.6
ACPI: PCI Interrupt Link [APSI] enabled at IRQ 23
ACPI: PCI interrupt 0000:00:07.0[A] -> GSI 23 (level, low) -> IRQ 177
PCI: Setting latency timer of device 0000:00:07.0 to 64
ata1: SATA max UDMA/133 cmd 0x9F0 ctl 0xBF2 bmdma 0xD800 irq 177
ata2: SATA max UDMA/133 cmd 0x970 ctl 0xB72 bmdma 0xD808 irq 177
ata1: no device found (phy stat 00000000)
scsi0 : sata_nv
ata2: no device found (phy stat 00000000)
scsi1 : sata_nv
ACPI: PCI Interrupt Link [APSJ] enabled at IRQ 22
ACPI: PCI interrupt 0000:00:08.0[A] -> GSI 22 (level, low) -> IRQ 185
PCI: Setting latency timer of device 0000:00:08.0 to 64
ata3: SATA max UDMA/133 cmd 0x9E0 ctl 0xBE2 bmdma 0xC400 irq 185
ata4: SATA max UDMA/133 cmd 0x960 ctl 0xB62 bmdma 0xC408 irq 185
ata3: dev 0 cfg 49:2f00 82:346b 83:7d01 84:4003 85:3469 86:3c01 87:4003
88:407f
ata3: dev 0 ATA, max UDMA/133, 312581808 sectors: lba48
nv_sata: Primary device added
nv_sata: Primary device removed
nv_sata: Secondary device added
nv_sata: Secondary device removed
ata3: dev 0 configured for UDMA/133
scsi2 : sata_nv
ata4: dev 0 cfg 49:2f00 82:346b 83:7d01 84:4003 85:3469 86:3c01 87:4003
88:407f
ata4: dev 0 ATA, max UDMA/133, 312581808 sectors: lba48
nv_sata: Primary device added
nv_sata: Primary device removed
nv_sata: Secondary device added
nv_sata: Secondary device removed
ata4: dev 0 configured for UDMA/133
scsi3 : sata_nv
Vendor: ATA Model: ST3160827AS Rev: 3.42
Type: Direct-Access ANSI SCSI revision: 05
SCSI device sda: 312581808 512-byte hdwr sectors (160042 MB)
SCSI device sda: drive cache: write back
SCSI device sda: 312581808 512-byte hdwr sectors (160042 MB)
SCSI device sda: drive cache: write back
sda:<4>nv_sata: Primary device added
nv_sata: Primary device removed
nv_sata: Secondary device added
nv_sata: Secondary device removed
sda1
Attached scsi disk sda at scsi2, channel 0, id 0, lun 0
Vendor: ATA Model: ST3160827AS Rev: 3.42
Type: Direct-Access ANSI SCSI revision: 05
SCSI device sdb: 312581808 512-byte hdwr sectors (160042 MB)
SCSI device sdb: drive cache: write back
SCSI device sdb: 312581808 512-byte hdwr sectors (160042 MB)
SCSI device sdb: drive cache: write back
sdb:<4>nv_sata: Primary device added
nv_sata: Primary device removed
nv_sata: Secondary device added
nv_sata: Secondary device removed
sdb1 sdb2 sdb3 sdb4 <<4>nv_sata: Primary device added
nv_sata: Primary device removed
nv_sata: Secondary device added
nv_sata: Secondary device removed
sdb5<4>nv_sata: Primary device added
nv_sata: Primary device removed
nv_sata: Secondary device added
nv_sata: Secondary device removed
sdb6 >
Attached scsi disk sdb at scsi3, channel 0, id 0, lun 0
EXT3-fs: INFO: recovery required on readonly filesystem.
EXT3-fs: write access will be enabled during recovery.
kjournald starting. Commit interval 5 seconds
EXT3-fs: recovery complete.
EXT3-fs: mounted filesystem with ordered data mode.
security: 3 users, 4 roles, 318 types, 23 bools
security: 53 classes, 10826 rules
SELinux: Completing initialization.
SELinux: Setting up existing superblocks.
SELinux: initialized (dev sdb6, type ext3), uses xattr
SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
SELinux: initialized (dev selinuxfs, type selinuxfs), uses genfs_contexts
SELinux: initialized (dev mqueue, type mqueue), not configured for labeling
SELinux: initialized (dev hugetlbfs, type hugetlbfs), not configured for
labeling
SELinux: initialized (dev devpts, type devpts), uses transition SIDs
SELinux: initialized (dev eventpollfs, type eventpollfs), uses
genfs_contexts
SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
SELinux: initialized (dev futexfs, type futexfs), uses genfs_contexts
SELinux: initialized (dev pipefs, type pipefs), uses task SIDs
SELinux: initialized (dev sockfs, type sockfs), uses task SIDs
SELinux: initialized (dev proc, type proc), uses genfs_contexts
SELinux: initialized (dev bdev, type bdev), uses genfs_contexts
SELinux: initialized (dev rootfs, type rootfs), uses genfs_contexts
SELinux: initialized (dev sysfs, type sysfs), uses genfs_contexts
SELinux: initialized (dev usbfs, type usbfs), uses genfs_contexts
Floppy drive(s): fd0 is 1.44M
FDC 0 is a post-1991 82077
forcedeth.c: Reverse Engineered nForce ethernet driver. Version 0.31.
ACPI: PCI Interrupt Link [APCH] enabled at IRQ 21
ACPI: PCI interrupt 0000:00:0a.0[A] -> GSI 21 (level, low) -> IRQ 193
PCI: Setting latency timer of device 0000:00:0a.0 to 64
eth0: forcedeth.c: subsystem: 01043:8141 bound to 0000:00:0a.0
ACPI: PCI Interrupt Link [APC2] enabled at IRQ 17
ACPI: PCI interrupt 0000:05:0c.0[A] -> GSI 17 (level, low) -> IRQ 201
ACPI: PCI interrupt 0000:05:0c.0[A] -> GSI 17 (level, low) -> IRQ 201
eth1: Yukon Gigabit Ethernet 10/100/1000Base-T Adapter
PrefPort:A RlmtMode:Check Link State
ip_tables: (C) 2000-2002 Netfilter core team
eth0: no link during initialization.
ACPI: PCI interrupt 0000:05:07.0[A] -> GSI 17 (level, low) -> IRQ 201
ice1724: Invalid EEPROM version 1
ACPI: PCI Interrupt Link [APCL] enabled at IRQ 20
ACPI: PCI interrupt 0000:00:02.1[B] -> GSI 20 (level, low) -> IRQ 209
ehci_hcd 0000:00:02.1: EHCI Host Controller
PCI: Setting latency timer of device 0000:00:02.1 to 64
ehci_hcd 0000:00:02.1: irq 209, pci mem 0xd2004000
ehci_hcd 0000:00:02.1: new USB bus registered, assigned bus number 1
eth0: link up.
PCI: cache line size of 64 is not supported by device 0000:00:02.1
ehci_hcd 0000:00:02.1: park 0
ehci_hcd 0000:00:02.1: USB 2.0 initialized, EHCI 1.00, driver 10 Dec 2004
hub 1-0:1.0: USB hub found
hub 1-0:1.0: 10 ports detected
ohci_hcd: 2004 Nov 08 USB 1.1 'Open' Host Controller (OHCI) Driver (PCI)
ACPI: PCI Interrupt Link [APCF] enabled at IRQ 23
ACPI: PCI interrupt 0000:00:02.0[A] -> GSI 23 (level, low) -> IRQ 177
ohci_hcd 0000:00:02.0: OHCI Host Controller
PCI: Setting latency timer of device 0000:00:02.0 to 64
ohci_hcd 0000:00:02.0: irq 177, pci mem 0xd2003000
ohci_hcd 0000:00:02.0: new USB bus registered, assigned bus number 2
hub 2-0:1.0: USB hub found
hub 2-0:1.0: 10 ports detected
ieee1394: Initialized config rom entry `ip1394'
ohci1394: $Rev: 1223 $ Ben Collins <[email protected]>
ACPI: PCI Interrupt Link [APC1] enabled at IRQ 16
ACPI: PCI interrupt 0000:05:0b.0[A] -> GSI 16 (level, low) -> IRQ 217
ohci1394: fw-host0: OHCI-1394 1.1 (PCI): IRQ=[217]
MMIO=[d1008000-d10087ff] Max Packet=[2048]
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
usb 1-5: new high speed USB device using ehci_hcd and address 4
usb 2-1: new low speed USB device using ohci_hcd and address 2
input: USB HID v1.11 Mouse [Microsoft Microsoft Wireless Optical Mouse?
1.0A] on usb-0000:00:02.0-1
usb 2-4: new full speed USB device using ohci_hcd and address 3
hub 2-4:1.0: USB hub found
hub 2-4:1.0: 4 ports detected
ieee1394: Host added: ID:BUS[0-00:1023] GUID[0011d80000007098]
usb 2-9: new low speed USB device using ohci_hcd and address 4
Initializing USB Mass Storage driver...
ACPI: Power Button (FF) [PWRF]
ibm_acpi: ec object not found
usb 2-9: 05-wait_for_sys timed out on ep0in
EXT3 FS on sdb6, internal journal
device-mapper: 4.4.0-ioctl (2005-01-12) initialised: [email protected]
cdrom: open failed.
cdrom: open failed.
kjournald starting. Commit interval 5 seconds
EXT3 FS on sdb3, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
SELinux: initialized (dev sdb3, type ext3), uses xattr
SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
SELinux: initialized (dev sdb2, type vfat), uses genfs_contexts
NTFS driver 2.1.22 [Flags: R/O MODULE].
NTFS volume version 3.1.
SELinux: initialized (dev sda1, type ntfs), uses genfs_contexts
NTFS volume version 3.1.
SELinux: initialized (dev sdb1, type ntfs), uses genfs_contexts
Adding 1052216k swap on /dev/sdb5. Priority:-1 extents:1
SELinux: initialized (dev binfmt_misc, type binfmt_misc), uses
genfs_contexts


--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from [email protected]
Home Page: http://www.roberthancock.com/

2005-03-31 07:30:27

by Denis Vlasenko

[permalink] [raw]
Subject: Re: AMD64 Machine hardlocks when using memset

On Thursday 31 March 2005 07:38, Robert Hancock wrote:
> Philip Lawatsch wrote:
> > Hi,
> >
> >
> > I do have a very strange problem:
> >
> > If I memset a ~1meg buffer some thousand times (in the userspace) it
> > will hardlock my machine.
>
> I thought that this must be impossible, but I tried it on my machine
> which is very similar (Asus A8N-SLI, Athlon 64 3500+, 2GB RAM) and to my
> surprise it breaks on mine too with kernel 2.6.11. I tested using the
> program below. After about a minute or so of this, the machine either
> locked hard or rebooted spontaneously. When it locked, there was no oops
> message, the NMI watchdog was not triggered and there was no response to
> SysRq commands. (I tested it with and without the NVIDIA module loaded.)
>
> This seems pretty terrible, a perfectly legal program running as a
> normal user is hard-locking the machine. Anyone have any suggestions to
> debug this? Also, can somebody else on an x86_64 try and duplicate this?
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
>
> int main( int argc, char* argv[] )
> {
> char* test = malloc(512*1024*1024);
> int i;
> for( i=0; i<1000000; i++ )
> {
> memset( test, 0, 512*1024*1024);
> }
> free(test);
> return 0;
> }

This reminds me on VIA northbridge problem when BIOS enabled
a feature which was experimental and turned out to be buggy.
Was causing oopses ONLY on K7 optimized kernels because
of movntq stores used. They seem to put an awful lot of writes
on the bus.
--
vda

2005-03-31 07:46:06

by Paul Jackson

[permalink] [raw]
Subject: Re: AMD64 Machine hardlocks when using memset

Yup - kills my x86_64 too. I can't stay up for half a minute.
I got a couple of Oops

Unable to handle kernel paging request at 0000000000002730 RIP:

Unable to handle kernel paging request at ffff81773ffc6918 RIP:

The first try ended with a sudden reboot. The second time, I ctrl-C'd
out while I still had a responsive system.

I thought it might be a CPU temperature issue, so downloaded XMBmon
"Mother Board Monitor Program for X Window System", and hacked the
command line mbmon in it to add this memset loop and report the CPU temp
each time around the loop.

My CPU Temp went from its usual 39 C idle, to 45 C during the memset
loop, which are typical temperatures for this PC. No problem there.

In a couple more tries, I got:
knotify killed with a SIGSEGV
artsd killed with a SIGSEGV
a hard lockup, requiring the big red button
a second oops at the same ffff81773ffc6918 as above.

My CPU, from /proc/cpuinfo, is:
model name : AMD Athlon(tm) 64 Processor 3500+

My mainboard is an MSI K8N Neo2 Platinum. I have 1 GByte of
Corsair XMS DDR400 memory.

I am not overclocking and I am running with standard voltages.

This is on a 2.6.11-rc5 kernel, though I doubt that matters.
I'm guessing it's hardware.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-03-31 08:15:48

by Paul Jackson

[permalink] [raw]
Subject: Re: AMD64 Machine hardlocks when using memset

Denis wrote:
> This reminds me on VIA northbridge problem when BIOS enabled
> a feature which was experimental and turned out to be buggy.

You were close!

I changed my Memory Timing from 1T to 2T, and now it is as solid as a
rock. It has been up 7 minutes as I type this, without a hiccup.

Notice this comment, at http://www.vr-zone.com.sg/?i=1641&p=1&s=0

Well as most Athlon 64 users know, 1T setting improves performance quite
significantly over 2T, but it is also very taxing on the memory and
quite a hit-and-miss when matching different memory with different
boards. From some users' feedback, the Asus A8N SLI can be a little
picky with 1T setting when overclocking, so results might be a little
better with other boards.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-03-31 09:40:27

by Philip Lawatsch

[permalink] [raw]
Subject: Re: AMD64 Machine hardlocks when using memset

Paul Jackson wrote:
> Denis wrote:
>
>>This reminds me on VIA northbridge problem when BIOS enabled
>>a feature which was experimental and turned out to be buggy.
>
>
> You were close!
>
> I changed my Memory Timing from 1T to 2T, and now it is as solid as a
> rock. It has been up 7 minutes as I type this, without a hiccup.
>
> Notice this comment, at http://www.vr-zone.com.sg/?i=1641&p=1&s=0
>
> Well as most Athlon 64 users know, 1T setting improves performance quite
> significantly over 2T, but it is also very taxing on the memory and
> quite a hit-and-miss when matching different memory with different
> boards. From some users' feedback, the Asus A8N SLI can be a little
> picky with 1T setting when overclocking, so results might be a little
> better with other boards.
>

I've now tried the most conservative settings available. The 32 bit
kernel now hangs after about 150000 Iterations (compared to about 16000
before) but the 64 bit kernel still hangs after about 5000.

After a ~12 hour memtest86 run memtest86 crashed (!), filling the
console with some garbage characters and then hanging.

This is driving me crazy.

Imo memtest86 should not hang onless something screws up the memory area
it is loaded into.

I've also tried the newest beta bios for the board now, didnt change
anything.

kind regards Philip

2005-03-31 10:16:23

by Paul Jackson

[permalink] [raw]
Subject: Re: AMD64 Machine hardlocks when using memset

Your problem is almost certainly in the hardware area (cpu, bios,
memory, power, northbridge, motherboard, cooling or thereabouts).

> Imo memtest86 should not hang onless something screws up [its] memory area

There is nothing else running when memtest runs. You cannot assume
that your hardware is operating like a sane digital computer when
memtest hangs - the magic of zero's, one's and instruction set
architectures is coming unglued and you are getting a glimpse of the
ugliness that is usually hidden behind the curtain.

Good luck fixing it.

LKML is probably not the place to continue to analyze this, now that
you've recreated it with memtest as well.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-03-31 11:40:35

by Mikael Pettersson

[permalink] [raw]
Subject: Re: AMD64 Machine hardlocks when using memset

Paul Jackson writes:
> Yup - kills my x86_64 too. I can't stay up for half a minute.
...
> My mainboard is an MSI K8N Neo2 Platinum.

I've tested both versions of the test program on two Athlon64 boxes,
and neither has had any problems with them.

My two machines are both VIA K8T800-based (a desktop and a laptop),
but it seems those of you who had problems have nForce-based machines.
So presumably it's either the nForce chipset or your memory timings are
out of spec.

/Mikael

2005-03-31 14:45:20

by Stelian Pop

[permalink] [raw]
Subject: Re: AMD64 Machine hardlocks when using memset

On Thu, Mar 31, 2005 at 12:04:59AM +0200, Philip Lawatsch wrote:

> I do have a very strange problem:
>
> If I memset a ~1meg buffer some thousand times (in the userspace) it
> will hardlock my machine.
>
> I've been using 2.6.12-rc1 and also a lot of other kernels (2.6.9,
> 2.6.11). I've tried it both using a 32 bit kernel and a 64 bit kernel.
> When running on the 32 bit kernel the machine hardlocks after about
> 15000 iterations, on a 64 bit kernel the machine hardlocks after about
> 5000 (the 64 bit system has nearly no background jobs running).
>
> I've been running memcheck for several hours now but nothing did show up.
>
>
> I've got an Asus A8N-SLI board with 2 gigs of memory and an AMD 3500+ CPU.
>
> The 64 bit kernel was compiled using gcc 3.4.3 and the 32 bit kernel
> using 3.3.5.
[...]

> powernow-k8: Found 1 AMD Athlon 64 / Opteron processors (version 1.00.09e)
> powernow-k8: 0 : fid 0xe (2200 MHz), vid 0x6 (1400 mV)
> powernow-k8: 1 : fid 0xc (2000 MHz), vid 0x8 (1350 mV)
> powernow-k8: 2 : fid 0xa (1800 MHz), vid 0xa (1300 mV)
> powernow-k8: 3 : fid 0x2 (1000 MHz), vid 0x12 (1100 mV)
> cpu_init done, current fid 0xe, vid 0x6

Just a thought: does deactivating cpufreq change anything ?

I haven't tested yet your program, but on my Asus K8NE-Deluxe very
strange things happen if cpufreq/powernow is activated *and*
the cpu frequency is changed...

Stelian.
--
Stelian Pop <[email protected]>

2005-03-31 18:47:45

by Paul Jackson

[permalink] [raw]
Subject: Re: AMD64 Machine hardlocks when using memset

> your memory timings are out of spec.

I don't know what spec applies here, don't really care.
But when I backed off my Memory Timing from 1T to 2T,
my box became stable running this memset() test.

So I am a happy camper, grateful that someone posted
this nice test, and agree with you that it was a memory
timing issue, at least for my system.

Apparently Philip's box has additional "issues". Whatever.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-04-01 01:37:01

by Robert Hancock

[permalink] [raw]
Subject: Re: AMD64 Machine hardlocks when using memset

Philip Lawatsch wrote:
> I've now tried the most conservative settings available. The 32 bit
> kernel now hangs after about 150000 Iterations (compared to about 16000
> before) but the 64 bit kernel still hangs after about 5000.

I'm still seeing this on my system as well, using the most conservative
timings possible (DDR200, all delay parameters except the refresh time
set to the largest possible value) as well as DDR333 with the same
timings and DDR400 with everything set to auto. I also tried the kernel
on the Fedora Core 3 rescue disc (same crash) and in single user mode
(same crash).

So far, the crashes have consisted of either a hang, reboot or panic.
One panic was a "spinlock already locked at kernel/module.c:2022" error.
The other one is below, for what it's worth:

Mar 31 18:55:43 Newcastle kernel: Unable to handle kernel paging request
at ffff8100588f5000 RIP:
Mar 31 18:55:43 Newcastle kernel: <ffffffff80236ac7>{clear_page+7}
Mar 31 18:55:43 Newcastle kernel: PGD 8063 PUD a063 PMD 0
Mar 31 18:55:43 Newcastle kernel: Oops: 0002 [1]
Mar 31 18:55:43 Newcastle kernel: CPU 0
Mar 31 18:55:43 Newcastle kernel: Modules linked in: md5(U) ipv6(U)
parport_pc(U) lp(U) parport(U) autofs4(U) it87(U) i2c_sensor(U)
i2c_isa(U) i2c_dev(U) i2c_core(U) sunrpc(U) pcmcia(U) yenta_socket(U)
rsrc_nonstatic(U) pcmcia_core(U) joydev(U) nls_utf8(U) ntfs(U) vfat(U)
fat(U) dm_mod(U) video(U) button(U) battery(U) ac(U) usb_storage(U)
ohci1394(U) ieee1394(U) ohci_hcd(U) ehci_hcd(U) snd_ice1724(U)
snd_ice17xx_ak4xxx(U) snd_ac97_codec(U) snd_pcm_oss(U) snd_mixer_oss(U)
snd_pcm(U) snd_timer(U) snd_page_alloc(U) snd_ak4xxx_adda(U)
snd_mpu401_uart(U) snd_rawmidi(U) snd_seq_device(U) snd(U) soundcore(U)
forcedeth(U) floppy(U) ext3(U) jbd(U) sata_nv(U) libata(U) sd_mod(U)
scsi_mod(U)
Mar 31 18:55:43 Newcastle kernel: Pid: 4928, comm: crashtest Not tainted
2.6.11-1.7_FC3custom
Mar 31 18:55:43 Newcastle kernel: RIP: 0010:[<ffffffff80236ac7>]
<ffffffff80236ac7>{clear_page+7}
Mar 31 18:55:43 Newcastle kernel: RSP: 0000:ffff810078299ca0 EFLAGS:
00010246
Mar 31 18:55:43 Newcastle kernel: RAX: 0000000000000000 RBX:
0000000000000001 RCX: 0000000000000200
Mar 31 18:55:43 Newcastle kernel: RDX: ffffffff80478940 RSI:
0000000000000000 RDI: ffff8100588f5000
Mar 31 18:55:43 Newcastle kernel: RBP: ffff81000235f5d0 R08:
0000000000000000 R09: 0000000000000000
Mar 31 18:55:43 Newcastle kernel: R10: 00000000000552fa R11:
0000000000000000 R12: ffff810000000000
Mar 31 18:55:43 Newcastle kernel: R13: ffff81000235f598 R14:
6db6db6db6db6db7 R15: 0000000000000000
Mar 31 18:55:43 Newcastle kernel: FS: 00002aaaaaabeb00(0000)
GS:ffffffff80552300(0000) knlGS:0000000000000000
Mar 31 18:55:43 Newcastle kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
000000008005003b
Mar 31 18:55:43 Newcastle kernel: CR2: ffff8100588f5000 CR3:
00000000791b0000 CR4: 00000000000006e0
Mar 31 18:55:43 Newcastle kernel: Process crashtest (pid: 4928,
threadinfo ffff810078298000, task ffff8100788d67e0)
Mar 31 18:55:43 Newcastle kernel: Stack: ffffffff80170bc2
0000000000000019 0000000000000286 0000000000000000
Mar 31 18:55:43 Newcastle kernel: 000000000000000a
000080d20000000a 0000000000000286 0000000000000256
Mar 31 18:55:43 Newcastle kernel: ffffffff80478bc0 0000000000000000
Mar 31 18:55:43 Newcastle kernel: Call
Trace:<ffffffff80170bc2>{buffered_rmqueue+1154}
<ffffffff80170dac>{__alloc_pages+220}
Mar 31 18:55:43 Newcastle kernel:
<ffffffff80181c52>{do_no_page+370} <ffffffff801825c0>{handle_mm_fault+560}
Mar 31 18:55:43 Newcastle kernel:
<ffffffff80284f9c>{write_chan+860} <ffffffff80123834>{do_page_fault+1044}
Mar 31 18:55:43 Newcastle kernel:
<ffffffff803a3699>{thread_return+41} <ffffffff8010f58d>{error_exit+0}
Mar 31 18:55:43 Newcastle kernel:
Mar 31 18:55:43 Newcastle kernel:
Mar 31 18:55:43 Newcastle kernel: Code: f3 48 ab c3 66 66 66 90 66 66 66
90 66 66 66 90 66 66 66 90
Mar 31 18:55:43 Newcastle kernel: RIP <ffffffff80236ac7>{clear_page+7}
RSP <ffff810078299ca0>
Mar 31 18:55:43 Newcastle kernel: CR2: ffff8100588f5000


--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from [email protected]
Home Page: http://www.roberthancock.com/


2005-04-01 04:38:25

by Robert Hancock

[permalink] [raw]
Subject: Re: AMD64 Machine hardlocks when using memset

Stelian Pop wrote:
> Just a thought: does deactivating cpufreq change anything ?
>
> I haven't tested yet your program, but on my Asus K8NE-Deluxe very
> strange things happen if cpufreq/powernow is activated *and*
> the cpu frequency is changed...

Didn't change anything for me, I tried deactivating cpufreq, still
crashes when I run that test program.

This is getting pretty ridiculous.. I've tried memory timings down to
the slowest possible, ran Memtest86 for 4 passes with no errors, and
it's been stable in Windows for a few months now. Still something is
blowing up in Linux with this test though..

--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from [email protected]
Home Page: http://www.roberthancock.com/

2005-04-01 10:41:34

by Denis Vlasenko

[permalink] [raw]
Subject: Re: AMD64 Machine hardlocks when using memset

On Friday 01 April 2005 07:37, Robert Hancock wrote:
> Stelian Pop wrote:
> > Just a thought: does deactivating cpufreq change anything ?
> >
> > I haven't tested yet your program, but on my Asus K8NE-Deluxe very
> > strange things happen if cpufreq/powernow is activated *and*
> > the cpu frequency is changed...
>
> Didn't change anything for me, I tried deactivating cpufreq, still
> crashes when I run that test program.
>
> This is getting pretty ridiculous.. I've tried memory timings down to
> the slowest possible, ran Memtest86 for 4 passes with no errors, and
> it's been stable in Windows for a few months now. Still something is
> blowing up in Linux with this test though..

If you want to dig deeper, go to assembler level.
That is, instead of using memset(), disassemble
your program and make your own

void my_memset(...)
{
asm volatile(/* code sequence from your crashing prog*/);
}

and use that in your memsetting loop. Sure, it won't change anything,
but:

a) we will know exactly which instruction sequence drives
your CPU/chipset crazy
b) others can try to reproduce without danger of memset being
implemented differently on their perticular version of gcc/glibc/whatever
c) you can try other memsets in order to know more about this bug
(for example, if inserting some NOPs in the my_memset body
makes bug disappear will definitely point towards defective/
overheating CPU. etc...)
--
vda

2005-04-01 17:27:39

by Ray Lee

[permalink] [raw]
Subject: Re: AMD64 Machine hardlocks when using memset

On Thu, 2005-03-31 at 22:37 -0600, Robert Hancock wrote:
> This is getting pretty ridiculous.. I've tried memory timings down to
> the slowest possible, ran Memtest86 for 4 passes with no errors, and
> it's been stable in Windows for a few months now. Still something is
> blowing up in Linux with this test though..

Have you run the same memset test under windows?

I've traced a lot of oddball problems down to bad or marginal power
supplies.

Ray

2005-04-01 19:15:19

by Philip Lawatsch

[permalink] [raw]
Subject: Re: AMD64 Machine hardlocks when using memset

Ray Lee wrote:
> On Thu, 2005-03-31 at 22:37 -0600, Robert Hancock wrote:
>
>>This is getting pretty ridiculous.. I've tried memory timings down to
>>the slowest possible, ran Memtest86 for 4 passes with no errors, and
>>it's been stable in Windows for a few months now. Still something is
>>blowing up in Linux with this test though..
>
>
> Have you run the same memset test under windows?
>
> I've traced a lot of oddball problems down to bad or marginal power
> supplies.

So far I've tried 2 PSUs and 3 different brands of memory.

No differences. And due to a lack of windows I cant really test it.

I'll try a different (not based on nforce 4) motherboard now.


kind regards Philip

2005-04-02 02:32:34

by Robert Hancock

[permalink] [raw]
Subject: Re: AMD64 Machine hardlocks when using memset

Ray Lee wrote:
> On Thu, 2005-03-31 at 22:37 -0600, Robert Hancock wrote:
>
>>This is getting pretty ridiculous.. I've tried memory timings down to
>>the slowest possible, ran Memtest86 for 4 passes with no errors, and
>>it's been stable in Windows for a few months now. Still something is
>>blowing up in Linux with this test though..
>
>
> Have you run the same memset test under windows?
>
> I've traced a lot of oddball problems down to bad or marginal power
> supplies.

I've now built a similar test program for Windows. I've let it run over
2000 iterations of 512MB memsets with no problems. On Linux it usually
blew up with under 200 iterations. It does run visibly slower than the
Linux version though - this is after all 32 bit Windows and it was
compiled with crufty old Visual C++ 6.0 so it is probably not that
optimized for this CPU. I will see if I can get a more optimized build
of this to try in Mingw32 or something.. after all if it's related to
some instruction combination or something it may not show up in the
build I have.

--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from [email protected]
Home Page: http://www.roberthancock.com/

2005-04-02 04:12:31

by Paul Jackson

[permalink] [raw]
Subject: Re: AMD64 Machine hardlocks when using memset

Robert wrote:
> It does run visibly slower

The x86_64 memset(), both in user space and the kernel, for whatever gcc
I have, and for a current kernel, uses the "repz stos" or "rep stosq"
prefixed instruction for the bulk of the copy. This combination is a
long running, interruptible Intel string instruction that loops on
itself until the CX register decrements to zero.

Was your windows app using "stos"?

I'll wager a nickel that the actual crash you see comes when the
processor has to handle an interrupt while in the middle of this
instruction.

I'll wager a dime it's hardware, though interrupt activity may be
required to provoke it.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-04-02 04:51:43

by Robert Hancock

[permalink] [raw]
Subject: Re: AMD64 Machine hardlocks when using memset

Paul Jackson wrote:
> The x86_64 memset(), both in user space and the kernel, for whatever gcc
> I have, and for a current kernel, uses the "repz stos" or "rep stosq"
> prefixed instruction for the bulk of the copy. This combination is a
> long running, interruptible Intel string instruction that loops on
> itself until the CX register decrements to zero.
>
> Was your windows app using "stos"?
>
> I'll wager a nickel that the actual crash you see comes when the
> processor has to handle an interrupt while in the middle of this
> instruction.
>
> I'll wager a dime it's hardware, though interrupt activity may be
> required to provoke it.

I ended up making a test program which essentially did the same thing
except not using memset (just moving an int* up repeatedly and setting
the value there to 0). That worked fine on both Windows and Linux. I
then tried such a program using a long* compiled as 64-bit on Linux,
that also worked fine. It seems like I can only reproduce it when memset
is actually used..

I don't remember exactly what the Windows memset was using, that was on
my work machine - it was inline assembly though, and I do know that it
had only one instruction for the whole set, so it was likely "repz stos"
or something similar to that.

As it turns out, the memset in my version of glibc x86_64 is not using
such a string instruction though - it seems to be using two different
sets of instructions depending on the size of the memset (not sure
exactly how they're calculating the threshold between these..) For sizes
below the treshold, this is the inner loop - it's using normal mov
instructions:

3: /* Copy 64 bytes. */
mov %r8,(%rcx)
mov %r8,0x8(%rcx)
mov %r8,0x10(%rcx)
mov %r8,0x18(%rcx)
mov %r8,0x20(%rcx)
mov %r8,0x28(%rcx)
mov %r8,0x30(%rcx)
mov %r8,0x38(%rcx)
add $0x40,%rcx
dec %rax
jne 3b

For sizes above the threshold though, this is the inner loop. It's using
movnti which is an SSE cache-bypasssing store:

11: /* Copy 64 bytes without polluting the cache. */
/* We could use movntdq %xmm0,(%rcx) here to further
speed up for large cases but let's not use XMM registers. */
movnti %r8,(%rcx)
movnti %r8,0x8(%rcx)
movnti %r8,0x10(%rcx)
movnti %r8,0x18(%rcx)
movnti %r8,0x20(%rcx)
movnti %r8,0x28(%rcx)
movnti %r8,0x30(%rcx)
movnti %r8,0x38(%rcx)
add $0x40,%rcx
dec %rax
jne 11b

I'm wondering if one does a ton of these cache-bypassing stores whether
something gets hosed because of that. Not sure what that could be
though. I don't imagine the chipset is involved with any of that on the
Athlon 64 - either the CPU or RAM seems the most likely suspect to me

--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from [email protected]
Home Page: http://www.roberthancock.com/

2005-04-04 10:48:10

by Alan

[permalink] [raw]
Subject: Re: AMD64 Machine hardlocks when using memset

On Sad, 2005-04-02 at 05:50, Robert Hancock wrote:
> I'm wondering if one does a ton of these cache-bypassing stores whether
> something gets hosed because of that. Not sure what that could be
> though. I don't imagine the chipset is involved with any of that on the
> Athlon 64 - either the CPU or RAM seems the most likely suspect to me

The glibc version is essentially the "perfect" copy function for the
CPU. If you have any bus/memory problems or chipset bugs it will bite
you.

Alan

2005-04-06 04:06:13

by Robert Hancock

[permalink] [raw]
Subject: Re: AMD64 Machine hardlocks when using memset

Alan Cox wrote:
> On Sad, 2005-04-02 at 05:50, Robert Hancock wrote:
>
>>I'm wondering if one does a ton of these cache-bypassing stores whether
>>something gets hosed because of that. Not sure what that could be
>>though. I don't imagine the chipset is involved with any of that on the
>>Athlon 64 - either the CPU or RAM seems the most likely suspect to me
>
>
> The glibc version is essentially the "perfect" copy function for the
> CPU. If you have any bus/memory problems or chipset bugs it will bite
> you.

Anyone have any suggestions on how to track this further? It seems
fairly clear what circumstances are causing it, but as for figuring out
what's at fault..

--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from [email protected]
Home Page: http://www.roberthancock.com/

2005-04-06 07:01:53

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: AMD64 Machine hardlocks when using memset

Hi,

On Wednesday, 6 of April 2005 06:05, Robert Hancock wrote:
> Alan Cox wrote:
> > On Sad, 2005-04-02 at 05:50, Robert Hancock wrote:
> >
> >>I'm wondering if one does a ton of these cache-bypassing stores whether
> >>something gets hosed because of that. Not sure what that could be
> >>though. I don't imagine the chipset is involved with any of that on the
> >>Athlon 64 - either the CPU or RAM seems the most likely suspect to me
> >
> >
> > The glibc version is essentially the "perfect" copy function for the
> > CPU. If you have any bus/memory problems or chipset bugs it will bite
> > you.
>
> Anyone have any suggestions on how to track this further? It seems
> fairly clear what circumstances are causing it, but as for figuring out
> what's at fault..

Well, I would start from changing memory modules.

Greets,
Rafael


--
- Would you tell me, please, which way I ought to go from here?
- That depends a good deal on where you want to get to.
-- Lewis Carroll "Alice's Adventures in Wonderland"

2005-04-06 08:47:40

by Denis Vlasenko

[permalink] [raw]
Subject: Re: AMD64 Machine hardlocks when using memset

[disregard my previous mail. I should have read the whole thread first]

On Saturday 02 April 2005 07:50, Robert Hancock wrote:
> As it turns out, the memset in my version of glibc x86_64 is not using
> such a string instruction though - it seems to be using two different
> sets of instructions depending on the size of the memset (not sure
> exactly how they're calculating the threshold between these..) For sizes
> below the treshold, this is the inner loop - it's using normal mov
> instructions:
>
> 3: /* Copy 64 bytes. */
> mov %r8,(%rcx)
> mov %r8,0x8(%rcx)
> mov %r8,0x10(%rcx)
> mov %r8,0x18(%rcx)
> mov %r8,0x20(%rcx)
> mov %r8,0x28(%rcx)
> mov %r8,0x30(%rcx)
> mov %r8,0x38(%rcx)
> add $0x40,%rcx
> dec %rax
> jne 3b
>
> For sizes above the threshold though, this is the inner loop. It's using
> movnti which is an SSE cache-bypasssing store:
>
> 11: /* Copy 64 bytes without polluting the cache. */
> /* We could use movntdq %xmm0,(%rcx) here to further
> speed up for large cases but let's not use XMM registers. */
> movnti %r8,(%rcx)
> movnti %r8,0x8(%rcx)
> movnti %r8,0x10(%rcx)
> movnti %r8,0x18(%rcx)
> movnti %r8,0x20(%rcx)
> movnti %r8,0x28(%rcx)
> movnti %r8,0x30(%rcx)
> movnti %r8,0x38(%rcx)
> add $0x40,%rcx
> dec %rax
> jne 11b

This is a very rarely used instruction. People either do
plain old rep stosl or do 3DNOW or SSE2 non-temporal stores.

Maybe movnti is different (buggy?) in subtle way.

Does it blow up if you use 3DNOW or SSE2 non-temporal stores?

If yes, then try different BIOS (not nesessarily latest is best).
BTW, 'Athlon bug' was tracked down similarly. New BIOS enabled
buggy chipset feature - BOOM! non-temporals killed the box
(took several months to figure it out back then).
--
vda

2005-04-06 09:31:18

by Philip Lawatsch

[permalink] [raw]
Subject: Re: AMD64 Machine hardlocks when using memset

Rafael J. Wysocki wrote:

>>Anyone have any suggestions on how to track this further? It seems
>>fairly clear what circumstances are causing it, but as for figuring out
>>what's at fault..
>
>
> Well, I would start from changing memory modules.

As I wrote earlier, I tried 4 different (but same brand) modules, 2
Infineon and 2 Samsung ones. No difference.

Btw, I've been working (stressing) the machine for one week now and
never had any problems, the system seems rock solid (until I start my
memory stresser).

kind regards Philip

2005-04-06 10:59:09

by Philip Lawatsch

[permalink] [raw]
Subject: Re: AMD64 Machine hardlocks when using memset

Robert Hancock wrote:
> Alan Cox wrote:
>
>> On Sad, 2005-04-02 at 05:50, Robert Hancock wrote:
>>
>>> I'm wondering if one does a ton of these cache-bypassing stores
>>> whether something gets hosed because of that. Not sure what that
>>> could be though. I don't imagine the chipset is involved with any of
>>> that on the Athlon 64 - either the CPU or RAM seems the most likely
>>> suspect to me
>>
>>
>>
>> The glibc version is essentially the "perfect" copy function for the
>> CPU. If you have any bus/memory problems or chipset bugs it will bite
>> you.
>
>
> Anyone have any suggestions on how to track this further? It seems
> fairly clear what circumstances are causing it, but as for figuring out
> what's at fault..

Digging through my glibc's source if found that if you memset arrays
<120000 bytes it will use good old mov instructions to do the job. In
case of arrays larger than 120000 bytes it will use movnti instructions
to do the job.

Thus I refined my test code to use mov for memset regardless of the size
(simply abused glibcs code a little bit)

-> No crash!

Then, changing the all the mov to movnti and my machine frags again :(

It seems that mov'ing does not kill my machine while simply using movnti
does.

kind regards Philip

2005-04-06 11:15:45

by Arjan van de Ven

[permalink] [raw]
Subject: Re: AMD64 Machine hardlocks when using memset

On Wed, 2005-04-06 at 12:59 +0200, Philip Lawatsch wrote:
> Robert Hancock wrote:
> > Alan Cox wrote:
> >
> >> On Sad, 2005-04-02 at 05:50, Robert Hancock wrote:
> >>
> >>> I'm wondering if one does a ton of these cache-bypassing stores
> >>> whether something gets hosed because of that. Not sure what that
> >>> could be though. I don't imagine the chipset is involved with any of
> >>> that on the Athlon 64 - either the CPU or RAM seems the most likely
> >>> suspect to me
> >>
> >>
> >>
> >> The glibc version is essentially the "perfect" copy function for the
> >> CPU. If you have any bus/memory problems or chipset bugs it will bite
> >> you.
> >
> >
> > Anyone have any suggestions on how to track this further? It seems
> > fairly clear what circumstances are causing it, but as for figuring out
> > what's at fault..
>
> Digging through my glibc's source if found that if you memset arrays
> <120000 bytes it will use good old mov instructions to do the job. In
> case of arrays larger than 120000 bytes it will use movnti instructions
> to do the job.
>
> Thus I refined my test code to use mov for memset regardless of the size
> (simply abused glibcs code a little bit)
>
> -> No crash!
>
> Then, changing the all the mov to movnti and my machine frags again :(
>
> It seems that mov'ing does not kill my machine while simply using movnti
> does.

movnti also gets a higher bandwidth so that doesn't rule out too much..



2005-04-06 11:22:38

by Philip Lawatsch

[permalink] [raw]
Subject: Re: AMD64 Machine hardlocks when using memset

Philip Lawatsch wrote:

>>Anyone have any suggestions on how to track this further? It seems
>>fairly clear what circumstances are causing it, but as for figuring out
>>what's at fault..
>

> It seems that mov'ing does not kill my machine while simply using movnti
> does.

Forget about what I just wrote, I've been able to reproduce this in
32bit mode too although it did take a long while to happen.

And glibc in 32bit mode simply uses mov in a normal loop to write to the
memory.

Looks like using mov in 64bit mode polluted my cache and crippled
performance (have been running some other programs in the background)
and thus perhaps didnt trigger the problem.

I'm going nuts with this.

kind regards Philip