2002-11-14 10:58:45

by Thorsten Mika

[permalink] [raw]
Subject: system lockups and shutdowns fo running processes

Dear all,

i have been facing problems on a machine (amd k6-2) running debian
unstable, kernel 2.4.17 (homemade of ftp.kernel.org sources). the x-window
systems dies unfrequently, sometimes the whole box even hangs. i never
found anything inside the logfiles giving me a clue.

today i found this:

Linux version 2.4.17 ([email protected]) (gcc version 2.95.2 20000220
(Debian GNU/L$
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 00000000000a0000 (usable)
BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 0000000004000000 (usable)
BIOS-e820: 00000000ffff0000 - 0000000100000000 (reserved)
On node 0 totalpages: 16384
zone(0): 4096 pages.
zone(1): 12288 pages.
zone(2): 0 pages.
No local APIC present or hardware disabled
Kernel command line: auto BOOT_IMAGE=Linux ro root=301
Initializing CPU#0
Detected 334.091 MHz processor.
Console: colour VGA+ 80x25
Calibrating delay loop... 666.82 BogoMIPS
Memory: 62308k/65536k available (1057k kernel code, 2844k reserved, 292k
data, 232k$
Dentry-cache hash table entries: 8192 (order: 4, 65536 bytes)
Inode-cache hash table entries: 4096 (order: 3, 32768 bytes)
Mount-cache hash table entries: 1024 (order: 1, 8192 bytes)
Buffer-cache hash table entries: 4096 (order: 2, 16384 bytes)
Page-cache hash table entries: 16384 (order: 4, 65536 bytes)
CPU: Before vendor init, caps: 008001bf 808009bf 00000000, vendor = 2
Enabling old style K6 write allocation for 64 Mb
CPU: L1 I Cache: 32K (32 bytes/line), D cache 32K (32 bytes/line)
CPU: After vendor init, caps: 008001bf 808009bf 00000000 00000000
CPU: After generic, caps: 008001bf 808009bf 00000000 00000000
CPU: Common caps: 008001bf 808009bf 00000000 00000000
CPU: AMD-K6(tm) 3D processor stepping 00
Checking 'hlt' instruction... OK.
POSIX conformance testing by UNIFIX
mtrr: v1.40 (20010327) Richard Gooch ([email protected])
mtrr: detected mtrr type: none
PCI: PCI BIOS revision 2.10 entry at 0xfb330, last bus=0
PCI: Using configuration type 1
PCI: Probing PCI hardware
Disabling direct PCI/PCI transfers.
isapnp: Scanning for PnP cards...
isapnp: No Plug & Play device found
Linux NET4.0 for Linux 2.4
Based upon Swansea University Computer Society NET3.039
Initializing RT netlink socket
apm: BIOS version 1.2 Flags 0x07 (Driver version 1.15)
Starting kswapd
parport0: PC-style at 0x378 (0x778) [PCSPP,TRISTATE]
parport0: irq 7 detected
parport0: Printer, HEWLETT-PACKARD DESKJET 690C
pty: 256 Unix98 ptys configured
Serial driver version 5.05c (2001-07-08) with MANY_PORTS SHARE_IRQ
SERIAL_PCI ISAPN$
ttyS00 at 0x03f8 (irq = 4) is a 16550A
ttyS01 at 0x02f8 (irq = 3) is a 16550A
lp0: using parport0 (polling).
Real Time Clock Driver v1.10e
Non-volatile memory driver v1.1
block: 128 slots per queue, batch=32
Uniform Multi-Platform E-IDE driver Revision: 6.31
ide: Assuming 33MHz system bus speed for PIO modes; override with
idebus=xx
SIS5513: IDE controller on PCI bus 00 dev 09
SIS5513: chipset revision 208
SIS5513: not 100% native mode: will probe irqs later
SiS5597
ide0: BM-DMA at 0x4000-0x4007, BIOS settings: hda:pio, hdb:pio
ide1: BM-DMA at 0x4008-0x400f, BIOS settings: hdc:pio, hdd:pio
hda: ST38641A, ATA DISK drive
hdb: CDA46801I, ATAPI CD/DVD-ROM drive
hdc: QUANTUM LPS420A, ATA DISK drive
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
ide1 at 0x170-0x177,0x376 on irq 15
hda: 16809660 sectors (8607 MB) w/128KiB Cache, CHS=1046/255/63, UDMA(33)
hdc: 824160 sectors (422 MB) w/98KiB Cache, CHS=1010/16/51, DMA
hdb: set_drive_speed_status: status=0x51 { DriveReady SeekComplete Error }
hdb: set_drive_speed_status: error=0x04
hdb: ATAPI 4X CD-ROM drive, 240kB Cache
Uniform CD-ROM driver Revision: 3.12
Partition check:
hda: hda1 hda2 hda3 hda4 < hda5 hda6 >
hdc: hdc1
Floppy drive(s): fd0 is 1.44M
FDC 0 is a post-1991 82077
loop: loaded (max 8 devices)
PPP generic driver version 2.4.1
8139too Fast Ethernet driver 0.9.22
eth0: RealTek RTL8139 Fast Ethernet at 0x6300, 00:30:84:40:cb:8a, IRQ 10
eth0: Identified 8139 chip type 'RTL-8139C'
Linux agpgart interface v0.99 (c) Jeff Hartmann
agpgart: Maximum main memory to use for agp memory: 28M
agpgart: no supported devices found.
mice: PS/2 mouse device common for all mice
NET4: Linux TCP/IP 1.0 for NET4.0
IP Protocols: ICMP, UDP, TCP, IGMP
IP: routing cache hash table of 512 buckets, 4Kbytes
TCP: Hash tables configured (established 4096 bind 4096)
NET4: Unix domain sockets 1.0/SMP for Linux NET4.0.
VFS: Mounted root (ext2 filesystem) readonly.
Freeing unused kernel memory: 232k freed
Adding Swap: 130748k swap-space (priority -1)
reiserfs: checking transaction log (device 03:06) ...
Using r5 hash to sort names
ReiserFS version 3.6.25
eth0: Setting half-duplex based on auto-negotiated partner ability 0000.
spurious 8259A interrupt: IRQ7.
hda: set_drive_speed_status: status=0x58hda: dma_intr: status=0x58 {
DriveReady See$
{ DriveReady SeekComplete DataRequest }
general protection fault: 9920
CPU: 0
EIP: 0010:[<c3130187>] Not tainted
EFLAGS: 00013206
eax: 00000008 ebx: 00000001 ecx: c2059920 edx: 00001000
esi: c2059920 edi: 0000000a ebp: 00000005 esp: c116bf30
ds: 0018 es: 0018 ss: 0018
Process kupdated (pid: 7, stackpage=c116b000)
Stack: c2059920 c2059920 c0c8dce0 c012f5b4 c2059920 c116a000 c0211a17
c116a24b
0008e000 00000036 c19c0a40 c340b740 c32bd840 c38ee740 c18b0ea0
c2ff2500
c3e95d80 c3a25d20 c11cc8c0 c3a35aa0 c1ed4600 c1ed49c0 c3f530c0
c27b4600
Call Trace: [<c012f5b4>] [<c0131c47>] [<c0131eb5>] [<c010546c>]

Code: 07 20 07 20 07 20 07 20 07 20 07 20 07 20 07 20 07 20 07 20
<1>Unable to handle kernel paging request at virtual address 11b9fcef
printing eip:
c0126416
*pde = 00000000
Oops: 0002
CPU: 0
EIP: 0010:[<c0126416>] Not tainted
EFLAGS: 00010246
eax: 00000000 ebx: c25a47e0 ecx: 00000001 edx: 000003e8
esi: ffffffec edi: c1fb9fa4 ebp: c2593680 esp: c1fb9f40
ds: 0018 es: 0018 ss: 0018
Process wmaker (pid: 425, stackpage=c1fb9000)
Stack: c2593680 00000001 c0136d19 c2593680 00000001 c3ccd000 00000000
c1fb9fa4
00000009 c01362be 00000009 c3ccd00f c25a47e0 c3ccd006 00000008
82ddc30a
c0136d5a c01370b1 c1fb8000 c1fb9fa4 08162a90 bffffb8c c013446d
c1fb8000
Call Trace: [<c0136d19>] [<c01362be>] [<c0136d5a>] [<c01370b1>]
[<c013446d>]
[<c0106b23>]

Code: 10 89 ee fc b9 11 00 00 00 f3 a5 8b 7c 24 1c 89 7b 08 8b 54

any hint?

thx in advance,

thorsten

p.s.: as i am not a subscriber to the list, could you please cc me?

--
BOFH excuse #213:

Change your language to Finnish.


2002-11-14 17:03:45

by Thorsten Mika

[permalink] [raw]
Subject: Re: system lockups and shutdowns fo running processes

"Randy.Dunlap" <[email protected]> schrieb:

> On Thu, 14 Nov 2002, Thorsten Mika wrote:

[output of dmesg]

>
> Please run that Oops text thru ksymoops so that it makes some sense.


here it is:
8139too Fast Ethernet driver 0.9.22
CPU: 0
EIP: 0010:[<c3130187>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00013206
eax: 00000008 ebx: 00000001 ecx: c2059920 edx: 00001000
esi: c2059920 edi: 0000000a ebp: 00000005 esp: c116bf30
ds: 0018 es: 0018 ss: 0018
Process kupdated (pid: 7, stackpage=c116b000)
Stack: c2059920 c2059920 c0c8dce0 c012f5b4 c2059920 c116a000 c0211a17
c116a24b
0008e000 00000036 c19c0a40 c340b740 c32bd840 c38ee740 c18b0ea0
c2ff2500
c3e95d80 c3a25d20 c11cc8c0 c3a35aa0 c1ed4600 c1ed49c0 c3f530c0
c27b4600
Call Trace: [<c012f5b4>] [<c0131c47>] [<c0131eb5>] [<c010546c>]
Code: 07 20 07 20 07 20 07 20 07 20 07 20 07 20 07 20 07 20 07 20


>>EIP; c3130187 <___strtok+2e1def3/10a9ddcc> <=====

>>ecx; c2059920 <___strtok+1d4768c/10a9ddcc>
>>esi; c2059920 <___strtok+1d4768c/10a9ddcc>
>>esp; c116bf30 <___strtok+e59c9c/10a9ddcc>

Trace; c012f5b4 <vfs_statfs+244/10d0>
Trace; c0131c47 <__wait_on_buffer+187/360>
Trace; c0131eb5 <file_fsync+15/2c0>
Trace; c010546c <machine_power_off+ec/140>

Code; c3130187 <___strtok+2e1def3/10a9ddcc>
00000000 <_EIP>:
Code; c3130187 <___strtok+2e1def3/10a9ddcc> <=====
0: 07 pop %es <=====
Code; c3130188 <___strtok+2e1def4/10a9ddcc>
1: 20 07 and %al,(%edi)
Code; c313018a <___strtok+2e1def6/10a9ddcc>
3: 20 07 and %al,(%edi)
Code; c313018c <___strtok+2e1def8/10a9ddcc>
5: 20 07 and %al,(%edi)
Code; c313018e <___strtok+2e1defa/10a9ddcc>
7: 20 07 and %al,(%edi)
Code; c3130190 <___strtok+2e1defc/10a9ddcc>
9: 20 07 and %al,(%edi)
Code; c3130192 <___strtok+2e1defe/10a9ddcc>
b: 20 07 and %al,(%edi)
Code; c3130194 <___strtok+2e1df00/10a9ddcc>
d: 20 07 and %al,(%edi)
Code; c3130196 <___strtok+2e1df02/10a9ddcc>
f: 20 07 and %al,(%edi)
Code; c3130198 <___strtok+2e1df04/10a9ddcc>
11: 20 07 and %al,(%edi)
Code; c313019a <___strtok+2e1df06/10a9ddcc>
13: 20 00 and %al,(%eax)

<1>Unable to handle kernel paging request at virtual address 11b9fcef
c0126416
*pde = 00000000
Oops: 0002
CPU: 0
EIP: 0010:[<c0126416>] Not tainted
EFLAGS: 00010246
eax: 00000000 ebx: c25a47e0 ecx: 00000001 edx: 000003e8
esi: ffffffec edi: c1fb9fa4 ebp: c2593680 esp: c1fb9f40
ds: 0018 es: 0018 ss: 0018
Process wmaker (pid: 425, stackpage=c1fb9000)
Stack: c2593680 00000001 c0136d19 c2593680 00000001 c3ccd000 00000000
c1fb9fa4
00000009 c01362be 00000009 c3ccd00f c25a47e0 c3ccd006 00000008
82ddc30a
c0136d5a c01370b1 c1fb8000 c1fb9fa4 08162a90 bffffb8c c013446d
c1fb8000
Call Trace: [<c0136d19>] [<c01362be>] [<c0136d5a>] [<c01370b1>]
[<c013446d>]
[<c0106b23>]
Code: 10 89 ee fc b9 11 00 00 00 f3 a5 8b 7c 24 1c 89 7b 08 8b 54


>>EIP; c0126416 <generic_file_mmap+a6/b90> <=====

>>ebx; c25a47e0 <___strtok+229254c/10a9ddcc>
>>edi; c1fb9fa4 <___strtok+1ca7d10/10a9ddcc>
>>ebp; c2593680 <___strtok+22813ec/10a9ddcc>
>>esp; c1fb9f40 <___strtok+1ca7cac/10a9ddcc>

Trace; c0136d19 <cdput+199/980>
Trace; c01362be <bdget+10e/150>
Trace; c0136d5a <cdput+1da/980>
Trace; c01370b1 <cdput+531/980>
Trace; c013446d <block_symlink+26d/460>
Trace; c0106b23 <__up_wakeup+f63/2460>

Code; c0126416 <generic_file_mmap+a6/b90>
00000000 <_EIP>:
Code; c0126416 <generic_file_mmap+a6/b90> <=====
0: 10 89 ee fc b9 11 adc %cl,0x11b9fcee(%ecx) <=====
Code; c012641c <generic_file_mmap+ac/b90>
6: 00 00 add %al,(%eax)
Code; c012641e <generic_file_mmap+ae/b90>
8: 00 f3 add %dh,%bl
Code; c0126420 <generic_file_mmap+b0/b90>
a: a5 movsl %ds:(%esi),%es:(%edi)
Code; c0126421 <generic_file_mmap+b1/b90>
b: 8b 7c 24 1c mov 0x1c(%esp,1),%edi
Code; c0126425 <generic_file_mmap+b5/b90>
f: 89 7b 08 mov %edi,0x8(%ebx)
Code; c0126428 <generic_file_mmap+b8/b90>
12: 8b 54 00 00 mov 0x0(%eax,%eax,1),%edx


1 warning and 1 error issued. Results may not be reliable.

for me this is chinese, hope it is what you wanted.

thx in advance,

thorsten


--
Thorsten Mika mailto: [email protected]
Hamm / Germany TM5173-RIPE

Public Key ID: 41338C37 -- http://www.keyserver.net

2002-11-14 18:22:25

by Randy.Dunlap

[permalink] [raw]
Subject: Re: system lockups and shutdowns fo running processes

On Thu, 14 Nov 2002, Richard B. Johnson wrote:

| On Thu, 14 Nov 2002, Thorsten Mika wrote:
|
| > "Randy.Dunlap" <[email protected]> schrieb:
| >
| > > On Thu, 14 Nov 2002, Thorsten Mika wrote:
| >
| > [output of dmesg]
| >
| > >
| > > Please run that Oops text thru ksymoops so that it makes some sense.
| >
| >
| > here it is:
| > 8139too Fast Ethernet driver 0.9.22
| > CPU: 0
| > EIP: 0010:[<c3130187>] Not tainted
| > Using defaults from ksymoops -t elf32-i386 -a i386
| > EFLAGS: 00013206
| > eax: 00000008 ebx: 00000001 ecx: c2059920 edx: 00001000
| > esi: c2059920 edi: 0000000a ebp: 00000005 esp: c116bf30
| > ds: 0018 es: 0018 ss: 0018
| > Process kupdated (pid: 7, stackpage=c116b000)
| > Stack: c2059920 c2059920 c0c8dce0 c012f5b4 c2059920 c116a000 c0211a17
| > c116a24b
| > 0008e000 00000036 c19c0a40 c340b740 c32bd840 c38ee740 c18b0ea0
| > c2ff2500
| > c3e95d80 c3a25d20 c11cc8c0 c3a35aa0 c1ed4600 c1ed49c0 c3f530c0
| > c27b4600
| > Call Trace: [<c012f5b4>] [<c0131c47>] [<c0131eb5>] [<c010546c>]
| > Code: 07 20 07 20 07 20 07 20 07 20 07 20 07 20 07 20 07 20 07 20
| >
|
| Yawn. It's trying to execute data...
| <TAB><SPC><TAB><SPC><TAB><SPC>..........
| Possible stack overflow.

Yes, I saw the hex 20 07 20 07 20 07 ... and realized that it shouldn't
be executing there. The question is "how did it get there?".

--
~Randy
"I read part of it all the way through." -- Samuel Goldwyn

2002-11-14 18:18:33

by Richard B. Johnson

[permalink] [raw]
Subject: Re: system lockups and shutdowns fo running processes

On Thu, 14 Nov 2002, Thorsten Mika wrote:

> "Randy.Dunlap" <[email protected]> schrieb:
>
> > On Thu, 14 Nov 2002, Thorsten Mika wrote:
>
> [output of dmesg]
>
> >
> > Please run that Oops text thru ksymoops so that it makes some sense.
>
>
> here it is:
> 8139too Fast Ethernet driver 0.9.22
> CPU: 0
> EIP: 0010:[<c3130187>] Not tainted
> Using defaults from ksymoops -t elf32-i386 -a i386
> EFLAGS: 00013206
> eax: 00000008 ebx: 00000001 ecx: c2059920 edx: 00001000
> esi: c2059920 edi: 0000000a ebp: 00000005 esp: c116bf30
> ds: 0018 es: 0018 ss: 0018
> Process kupdated (pid: 7, stackpage=c116b000)
> Stack: c2059920 c2059920 c0c8dce0 c012f5b4 c2059920 c116a000 c0211a17
> c116a24b
> 0008e000 00000036 c19c0a40 c340b740 c32bd840 c38ee740 c18b0ea0
> c2ff2500
> c3e95d80 c3a25d20 c11cc8c0 c3a35aa0 c1ed4600 c1ed49c0 c3f530c0
> c27b4600
> Call Trace: [<c012f5b4>] [<c0131c47>] [<c0131eb5>] [<c010546c>]
> Code: 07 20 07 20 07 20 07 20 07 20 07 20 07 20 07 20 07 20 07 20
>

Yawn. It's trying to execute data...
<TAB><SPC><TAB><SPC><TAB><SPC>..........
Possible stack overflow.

Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
Bush : The Fourth Reich of America


2002-11-14 18:41:32

by Richard B. Johnson

[permalink] [raw]
Subject: Re: system lockups and shutdowns fo running processes

On Thu, 14 Nov 2002, Randy.Dunlap wrote:

> On Thu, 14 Nov 2002, Richard B. Johnson wrote:
>
> | On Thu, 14 Nov 2002, Thorsten Mika wrote:
> |
> | > "Randy.Dunlap" <[email protected]> schrieb:
> | >
> | > > On Thu, 14 Nov 2002, Thorsten Mika wrote:
> | >
> | > [output of dmesg]
> | >
> | > >
> | > > Please run that Oops text thru ksymoops so that it makes some sense.
> | >
> | >
> | > here it is:
> | > 8139too Fast Ethernet driver 0.9.22
> | > CPU: 0
> | > EIP: 0010:[<c3130187>] Not tainted
> | > Using defaults from ksymoops -t elf32-i386 -a i386
> | > EFLAGS: 00013206
> | > eax: 00000008 ebx: 00000001 ecx: c2059920 edx: 00001000
> | > esi: c2059920 edi: 0000000a ebp: 00000005 esp: c116bf30
> | > ds: 0018 es: 0018 ss: 0018
> | > Process kupdated (pid: 7, stackpage=c116b000)
> | > Stack: c2059920 c2059920 c0c8dce0 c012f5b4 c2059920 c116a000 c0211a17
> | > c116a24b
> | > 0008e000 00000036 c19c0a40 c340b740 c32bd840 c38ee740 c18b0ea0
> | > c2ff2500
> | > c3e95d80 c3a25d20 c11cc8c0 c3a35aa0 c1ed4600 c1ed49c0 c3f530c0
> | > c27b4600
> | > Call Trace: [<c012f5b4>] [<c0131c47>] [<c0131eb5>] [<c010546c>]
> | > Code: 07 20 07 20 07 20 07 20 07 20 07 20 07 20 07 20 07 20 07 20
> | >
> |
> | Yawn. It's trying to execute data...
> | <TAB><SPC><TAB><SPC><TAB><SPC>..........
> | Possible stack overflow.
>
> Yes, I saw the hex 20 07 20 07 20 07 ... and realized that it shouldn't
> be executing there. The question is "how did it get there?".
>

Good question. I see kupdated was running at the time. So follow
this logic.... or lack thereof...

(1) kswapd is panting heavily swapping data (executable stuff) to
and from the disk.
(2) kupdated is awakened to write some file buffers to the disk. It
had been swapped out (part of it). Some executable code gets swapped
back in...but, because of an error somewhere in the code-path, you
end up with the wrong file-data being swapped back in.

This looks like some race that happened because somebody didn't take a
lock when they should have. Or, it could be something as simple as
an unhandled error condition in a disk driver where upper level code
'thought' a read had occurred, but it didn't. Lots of things like
this.

The next check might be to disable swapping `swapoff -a` and see if
the condition is repeatable.


Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
Bush : The Fourth Reich of America


2002-11-14 19:20:06

by Willy Tarreau

[permalink] [raw]
Subject: Re: system lockups and shutdowns fo running processes

Hi !

On Thu, Nov 14, 2002 at 01:27:50PM -0500, Richard B. Johnson wrote:
> On Thu, 14 Nov 2002, Thorsten Mika wrote:
> > Call Trace: [<c012f5b4>] [<c0131c47>] [<c0131eb5>] [<c010546c>]
> > Code: 07 20 07 20 07 20 07 20 07 20 07 20 07 20 07 20 07 20 07 20
> Yawn. It's trying to execute data...
> <TAB><SPC><TAB><SPC><TAB><SPC>..........
> Possible stack overflow.

It's not <TAB> <SPC>, it's a copy from a text video memory buffer of an empty
portion of the screen. 07 is the attribute (=color), and 20 the <SPC> which
fills empty space. So this does not look like a buffer overflow, but a use of
a bad pointer somewhere. Perhaps a race during a text scroll or something ?

BTW, <TAB> is 9 :-)

Cheers,
Willy

Subject: Re: system lockups and shutdowns fo running processes

"Richard B. Johnson" <[email protected]> writes:

>> Call Trace: [<c012f5b4>] [<c0131c47>] [<c0131eb5>] [<c010546c>]
>> Code: 07 20 07 20 07 20 07 20 07 20 07 20 07 20 07 20 07 20 07 20
>>

>Yawn. It's trying to execute data...
><TAB><SPC><TAB><SPC><TAB><SPC>..........
>Possible stack overflow.

<TAB> = ^I = 09

07 = ^G = <BEL>

Regards
Henning


--
Dipl.-Inf. (Univ.) Henning P. Schmiedehausen -- Geschaeftsfuehrer
INTERMETA - Gesellschaft fuer Mehrwertdienste mbH [email protected]

Am Schwabachgrund 22 Fon.: 09131 / 50654-0 [email protected]
D-91054 Buckenhof Fax.: 09131 / 50654-20