2002-10-05 02:35:40

by Srihari Vijayaraghavan

[permalink] [raw]
Subject: Linux-2.4.20-pre8-aa2 oops report.

[1.] One line summary of the problem:
2.4.20-pre8aa2 Kernel oopsed couple of times.

[2.] Full description of the problem/report:
Same as above.

[3.] Keywords (i.e., modules, networking, kernel):
I am no kernel developer, but I suspect it may be due to XFree86,
AGPGART/DRM/Radeon. I may be wrong though, please feel to correct me.

[4.] Kernel version (from /proc/version):
Linux version 2.4.20-pre8aa2 ([email protected]) (gcc version 3.2
20020903 (Red Hat Linux 8.0 3.2-7)) #3 Thu Oct 3 21:07:54 EST 2002

[5.] Output of Oops.. message (if applicable) with symbolic information
resolved (see Documentation/oops-tracing.txt)
ksymoops 2.4.5 on i686 2.4.20-pre8aa2. Options used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.4.20-pre8aa2/ (default)
-m /boot/System.map-2.4.20-pre8aa2 (default)

Warning: You did not tell me where to find symbol information. I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc. ksymoops -h explains the options.

Oct 5 11:46:39 localhost kernel: kernel BUG at memory.c:419!
Oct 5 11:46:39 localhost kernel: invalid operand: 0000 2.4.20-pre8aa2 #3 Thu
Oct 3 21:07:54 EST 2002
Oct 5 11:46:39 localhost kernel: CPU: 0
Oct 5 11:46:39 localhost kernel: EIP: 0010:[<c01270f6>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
Oct 5 11:46:39 localhost kernel: EFLAGS: 00210246
Oct 5 11:46:39 localhost kernel: eax: cb988000 ebx: 00000000 ecx:
cabe4740 edx: 00000000
Oct 5 11:46:39 localhost kernel: esi: cb988000 edi: 00000000 ebp:
00000000 esp: cbcfde84
Oct 5 11:46:39 localhost kernel: ds: 0018 es: 0018 ss: 0018
Oct 5 11:46:39 localhost kernel: Process gnome-session (pid: 5481,
stackpage=cbcfd000)
Oct 5 11:46:39 localhost kernel: Stack: cabe4740 cb988400 00200292 00003000
da97e4c0 00000000 cabe4740 00000000
Oct 5 11:46:39 localhost kernel: c012a5b5 cabe4740 00000000 00000000
00000000 cabe4740 cbcfc000 cbcfdf30
Oct 5 11:46:39 localhost kernel: 0000000b c0116a36 cabe4740 00200202
cabe4740 c011b807 cabe4740 c158f270
Oct 5 11:46:39 localhost kernel: Call Trace: [<c012a5b5>] [<c0116a36>]
[<c011b807>] [<c01213cc>] [<c01215a4>]
Oct 5 11:46:39 localhost kernel: [<c0108c54>] [<c0113c60>] [<c0108f38>]
Oct 5 11:46:39 localhost kernel: Code: 0f 0b a3 01 42 4c 1f c0 89 f6 8b 44 24
24 89 74 24 04 89 5c


>>EIP; c01270f6 <zap_page_range+26/b0> <=====

>>eax; cb988000 <END_OF_CODE+53341a9/????>
>>ecx; cabe4740 <END_OF_CODE+45908e9/????>
>>esi; cb988000 <END_OF_CODE+53341a9/????>
>>esp; cbcfde84 <END_OF_CODE+56aa02d/????>

Trace; c012a5b5 <exit_mmap+b5/130>
Trace; c0116a36 <mmput+56/d0>
Trace; c011b807 <do_exit+87/260>
Trace; c01213cc <sig_exit+9c/a0>
Trace; c01215a4 <dequeue_signal+64/d0>
Trace; c0108c54 <do_signal+1b4/2a0>
Trace; c0113c60 <do_page_fault+0/5a0>
Trace; c0108f38 <signal_return+14/18>

Code; c01270f6 <zap_page_range+26/b0>
00000000 <_EIP>:
Code; c01270f6 <zap_page_range+26/b0> <=====
0: 0f 0b ud2a <=====
Code; c01270f8 <zap_page_range+28/b0>
2: a3 01 42 4c 1f mov %eax,0x1f4c4201
Code; c01270fd <zap_page_range+2d/b0>
7: c0 89 f6 8b 44 24 24 rorb $0x24,0x24448bf6(%ecx)
Code; c0127104 <zap_page_range+34/b0>
e: 89 74 24 04 mov %esi,0x4(%esp,1)
Code; c0127108 <zap_page_range+38/b0>
12: 89 5c 00 00 mov %ebx,0x0(%eax,%eax,1)

Oct 5 11:46:39 localhost kernel: kernel BUG at memory.c:419!
Oct 5 11:46:39 localhost kernel: invalid operand: 0000 2.4.20-pre8aa2 #3 Thu
Oct 3 21:07:54 EST 2002
Oct 5 11:46:39 localhost kernel: CPU: 0
Oct 5 11:46:39 localhost kernel: EIP: 0010:[<c01270f6>] Not tainted
Oct 5 11:46:39 localhost kernel: EFLAGS: 00210246
Oct 5 11:46:39 localhost kernel: eax: c51a5000 ebx: 00000000 ecx:
c2ec2a80 edx: 00000000
Oct 5 11:46:39 localhost kernel: esi: c51a5000 edi: 00000000 ebp:
00000000 esp: d160df48
Oct 5 11:46:39 localhost kernel: ds: 0018 es: 0018 ss: 0018
Oct 5 11:46:39 localhost kernel: Process gnome-session (pid: 5371,
stackpage=d160d000)
Oct 5 11:46:39 localhost kernel: Stack: c158e380 c013b2cc 00200296 00003000
dbe63ac0 00000000 c2ec2a80 00000000
Oct 5 11:46:39 localhost kernel: c012a5b5 c2ec2a80 00000000 00000000
00000000 c2ec2a80 d160c000 bffff5fc
Oct 5 11:46:39 localhost kernel: 00000100 c0116a36 c2ec2a80 00200206
c2ec2a80 c011b807 c2ec2a80 00001569
Oct 5 11:46:39 localhost kernel: Call Trace: [<c013b2cc>] [<c012a5b5>]
[<c0116a36>] [<c011b807>] [<c011ba13>]
Oct 5 11:46:39 localhost kernel: [<c0108eff>]
Oct 5 11:46:39 localhost kernel: Code: 0f 0b a3 01 42 4c 1f c0 89 f6 8b 44 24
24 89 74 24 04 89 5c


>>EIP; c01270f6 <zap_page_range+26/b0> <=====

>>eax; c51a5000 <[agpgart].bss.end+10701e5/1b9b265>
>>ecx; c2ec2a80 <[serial].bss.end+8e659d/1ac3b9d>
>>esi; c51a5000 <[agpgart].bss.end+10701e5/1b9b265>
>>esp; d160df48 <END_OF_CODE+afba0f1/????>

Trace; c013b2cc <fput+cc/120>
Trace; c012a5b5 <exit_mmap+b5/130>
Trace; c0116a36 <mmput+56/d0>
Trace; c011b807 <do_exit+87/260>
Trace; c011ba13 <sys_exit+13/20>
Trace; c0108eff <system_call+33/38>

Code; c01270f6 <zap_page_range+26/b0>
00000000 <_EIP>:
Code; c01270f6 <zap_page_range+26/b0> <=====
0: 0f 0b ud2a <=====
Code; c01270f8 <zap_page_range+28/b0>
2: a3 01 42 4c 1f mov %eax,0x1f4c4201
Code; c01270fd <zap_page_range+2d/b0>
7: c0 89 f6 8b 44 24 24 rorb $0x24,0x24448bf6(%ecx)
Code; c0127104 <zap_page_range+34/b0>
e: 89 74 24 04 mov %esi,0x4(%esp,1)
Code; c0127108 <zap_page_range+38/b0>
12: 89 5c 00 00 mov %ebx,0x0(%eax,%eax,1)


1 warning issued. Results may not be reliable.

[6.] A small shell script or example program which triggers the
problem (if possible)
Unfortunately No.

[7.] Environment
[7.1.] Software (add the output of the ver_linux script here)
If some fields are empty or look unusual you may have an old version.
Compare to the current minimal requirements in Documentation/Changes.

Linux localhost.localdomain 2.4.20-pre8aa2 #3 Thu Oct 3 21:07:54 EST 2002 i686
athlon i386 GNU/Linux

Gnu C gcc (GCC) 3.2 20020903 (Red Hat Linux 8.0 3.2-7)
Copyright (C) 2002 Free Software Foundation, Inc. This is free software; see
the source for copying conditions. There is NO warranty; not even for
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Gnu make 3.79.1
util-linux 2.11r
mount 2.11r
modutils 2.4.18
e2fsprogs 1.27
pcmcia-cs 3.1.31
PPP 2.4.1
isdn4k-utils 3.1pre4
Linux C Library 2.2.93
Dynamic linker (ldd) 2.2.93
Procps 2.0.7
Net-tools 1.60
Kbd 1.06
Sh-utils 2.0.12
Modules Loaded ipt_state ip_conntrack ppp_deflate zlib_inflate
zlib_deflate ppp_async ppp_generic slhc sr_mod emu10k1 ac97_codec soundcore
radeon agpgart af_packet iptable_filter ip_tables serial floppy ide-scsi
scsi_mod ide-cd cdrom raid0 md rtc unix

[7.2.] Processor information (from /proc/cpuinfo):
processor : 0
vendor_id : AuthenticAMD
cpu family : 6
model : 4
model name : AMD Athlon(tm) Processor
stepping : 2
cpu MHz : 1200.075
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat
pse36 mmx fxsr syscall mmxext 3dnowext 3dnow
bogomips : 2392.06

[7.3.] Module information (from /proc/modules):
ipt_state 1080 36 (autoclean)
ip_conntrack 25152 1 (autoclean) [ipt_state]
ppp_deflate 4472 0 (autoclean)
zlib_inflate 21060 0 (autoclean) [ppp_deflate]
zlib_deflate 20632 0 (autoclean) [ppp_deflate]
ppp_async 9344 1 (autoclean)
ppp_generic 19604 3 (autoclean) [ppp_deflate ppp_async]
slhc 6832 1 (autoclean) [ppp_generic]
sr_mod 15960 0 (autoclean)
emu10k1 63488 0 (autoclean)
ac97_codec 13320 0 (autoclean) [emu10k1]
soundcore 5988 4 (autoclean) [emu10k1]
radeon 87416 3
agpgart 19996 3
af_packet 11464 0 (autoclean)
iptable_filter 2412 1 (autoclean)
ip_tables 14328 2 [ipt_state iptable_filter]
serial 50404 1 (autoclean)
floppy 55868 0 (autoclean)
ide-scsi 10512 0
scsi_mod 96788 2 [sr_mod ide-scsi]
ide-cd 33412 0
cdrom 32608 0 [sr_mod ide-cd]
raid0 3912 4 (autoclean)
md 56544 4 [raid0]
rtc 8532 0 (autoclean)
unix 17832 149 (autoclean)

[7.4.] Loaded driver and hardware information (/proc/ioports, /proc/iomem)
/proc/ioports:
0000-001f : dma1
0020-003f : pic1
0040-005f : timer
0060-006f : keyboard
0070-007f : rtc
0080-008f : dma page reg
00a0-00bf : pic2
00c0-00df : dma2
00f0-00ff : fpu
0170-0177 : ide1
01f0-01f7 : ide0
02f8-02ff : serial(auto)
0376-0376 : ide1
03c0-03df : vga+
03f6-03f6 : ide0
03f8-03ff : serial(auto)
0cf8-0cff : PCI conf1
5000-500f : VIA Technologies, Inc. VT82C686 [Apollo Super ACPI]
6000-607f : VIA Technologies, Inc. VT82C686 [Apollo Super ACPI]
c000-cfff : PCI Bus #01
c000-c0ff : ATI Technologies Inc Radeon VE QY
d000-d003 : Advanced Micro Devices [AMD] AMD-760 [IGD4-1P] System Controller
d400-d40f : VIA Technologies, Inc. VT82C586B PIPC Bus Master IDE
d400-d407 : ide0
d408-d40f : ide1
d800-d81f : VIA Technologies, Inc. USB
dc00-dc1f : VIA Technologies, Inc. USB (#2)
e000-e01f : Creative Labs SB Live! EMU10k1
e000-e01f : EMU10K1
e400-e407 : Creative Labs SB Live! MIDI/Game Port

/proc/iomem:
0000-001f : dma1
0020-003f : pic1
0040-005f : timer
0060-006f : keyboard
0070-007f : rtc
0080-008f : dma page reg
00a0-00bf : pic2
00c0-00df : dma2
00f0-00ff : fpu
0170-0177 : ide1
01f0-01f7 : ide0
02f8-02ff : serial(auto)
0376-0376 : ide1
03c0-03df : vga+
03f6-03f6 : ide0
03f8-03ff : serial(auto)
0cf8-0cff : PCI conf1
5000-500f : VIA Technologies, Inc. VT82C686 [Apollo Super ACPI]
6000-607f : VIA Technologies, Inc. VT82C686 [Apollo Super ACPI]
c000-cfff : PCI Bus #01
c000-c0ff : ATI Technologies Inc Radeon VE QY
d000-d003 : Advanced Micro Devices [AMD] AMD-760 [IGD4-1P] System Controller
d400-d40f : VIA Technologies, Inc. VT82C586B PIPC Bus Master IDE
d400-d407 : ide0
d408-d40f : ide1
d800-d81f : VIA Technologies, Inc. USB
dc00-dc1f : VIA Technologies, Inc. USB (#2)
e000-e01f : Creative Labs SB Live! EMU10k1
e000-e01f : EMU10K1
e400-e407 : Creative Labs SB Live! MIDI/Game Port
[hari@localhost linux-2.4.20-pre8]$ cat /proc/iomem
00000000-0009fbff : System RAM
0009fc00-0009ffff : reserved
000a0000-000bffff : Video RAM area
000c0000-000c7fff : Video ROM
000f0000-000fffff : System ROM
00100000-1ffeffff : System RAM
00100000-001eae83 : Kernel code
001eae84-0021a73f : Kernel data
1fff0000-1fff2fff : ACPI Non-volatile Storage
1fff3000-1fffffff : ACPI Tables
d0000000-d7ffffff : Advanced Micro Devices [AMD] AMD-760 [IGD4-1P] System
Controller
d8000000-dfffffff : PCI Bus #01
d8000000-dfffffff : ATI Technologies Inc Radeon VE QY
e0000000-e1ffffff : PCI Bus #01
e1000000-e100ffff : ATI Technologies Inc Radeon VE QY
e2000000-e2000fff : Advanced Micro Devices [AMD] AMD-760 [IGD4-1P] System
Controller
ffff0000-ffffffff : reserved

[7.5.] PCI information ('lspci -vvv' as root)
00:00.0 Host bridge: Advanced Micro Devices [AMD] AMD-760 [IGD4-1P] System
Controller (rev 12)
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping-
SERR- FastB2B-
Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort-
<MAbort+ >SERR- <PERR-
Latency: 32
Region 0: Memory at d0000000 (32-bit, prefetchable) [size=128M]
Region 1: Memory at e2000000 (32-bit, prefetchable) [size=4K]
Region 2: I/O ports at d000 [disabled] [size=4]
Capabilities: [a0] AGP version 2.0
Status: RQ=15 SBA+ 64bit- FW- Rate=x1,x2
Command: RQ=0 SBA+ AGP+ 64bit- FW- Rate=x1

00:01.0 PCI bridge: Advanced Micro Devices [AMD] AMD-760 [IGD4-1P] AGP Bridge
(prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping-
SERR+ FastB2B-
Status: Cap- 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort-
<MAbort- >SERR- <PERR-
Latency: 32
Bus: primary=00, secondary=01, subordinate=01, sec-latency=32
I/O behind bridge: 0000c000-0000cfff
Memory behind bridge: e0000000-e1ffffff
Prefetchable memory behind bridge: d8000000-dfffffff
BridgeCtl: Parity- SERR+ NoISA+ VGA+ MAbort- >Reset- FastB2B-

00:07.0 ISA bridge: VIA Technologies, Inc. VT82C686 [Apollo Super South] (rev
40)
Subsystem: VIA Technologies, Inc. VT82C686/A PCI to ISA Bridge
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping+
SERR- FastB2B-
Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort-
<MAbort- >SERR- <PERR-
Latency: 0
Capabilities: [c0] Power Management version 2
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-

00:07.1 IDE interface: VIA Technologies, Inc. VT82C586B PIPC Bus Master IDE
(rev 06) (prog-if 8a [Master SecP PriP])
Subsystem: VIA Technologies, Inc. VT82C586B PIPC Bus Master IDE
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping-
SERR- FastB2B-
Status: Cap+ 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort-
<MAbort- >SERR- <PERR-
Latency: 32
Region 4: I/O ports at d400 [size=16]
Capabilities: [c0] Power Management version 2
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-

00:07.2 USB Controller: VIA Technologies, Inc. USB (rev 16) (prog-if 00
[UHCI])
Subsystem: VIA Technologies, Inc. (Wrong ID) USB Controller
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping-
SERR- FastB2B-
Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort-
<MAbort- >SERR- <PERR-
Latency: 32, cache line size 08
Interrupt: pin D routed to IRQ 10
Region 4: I/O ports at d800 [size=32]
Capabilities: [80] Power Management version 2
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-

00:07.3 USB Controller: VIA Technologies, Inc. USB (rev 16) (prog-if 00
[UHCI])
Subsystem: VIA Technologies, Inc. (Wrong ID) USB Controller
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping-
SERR- FastB2B-
Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort-
<MAbort- >SERR- <PERR-
Latency: 32, cache line size 08
Interrupt: pin D routed to IRQ 10
Region 4: I/O ports at dc00 [size=32]
Capabilities: [80] Power Management version 2
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-

00:07.4 SMBus: VIA Technologies, Inc. VT82C686 [Apollo Super ACPI] (rev 40)
Subsystem: VIA Technologies, Inc. VT82C686 [Apollo Super ACPI]
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping-
SERR- FastB2B-
Status: Cap+ 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort-
<MAbort- >SERR- <PERR-
Interrupt: pin ? routed to IRQ 9
Capabilities: [68] Power Management version 2
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-

00:0c.0 Multimedia audio controller: Creative Labs SB Live! EMU10k1 (rev 06)
Subsystem: Creative Labs CT4832 SBLive! Value
Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping-
SERR- FastB2B-
Status: Cap+ 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort-
<MAbort- >SERR- <PERR-
Latency: 32 (500ns min, 5000ns max)
Interrupt: pin A routed to IRQ 10
Region 0: I/O ports at e000 [size=32]
Capabilities: [dc] Power Management version 1
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-

00:0c.1 Input device controller: Creative Labs SB Live! MIDI/Game Port (rev
06)
Subsystem: Creative Labs Gameport Joystick
Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping-
SERR- FastB2B-
Status: Cap+ 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort-
<MAbort- >SERR- <PERR-
Latency: 32
Region 0: I/O ports at e400 [size=8]
Capabilities: [dc] Power Management version 1
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-

01:05.0 VGA compatible controller: ATI Technologies Inc Radeon VE QY (prog-if
00 [VGA])
Subsystem: Unknown device 1787:0202
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping+
SERR- FastB2B-
Status: Cap+ 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort-
<MAbort- >SERR- <PERR-
Latency: 32 (2000ns min), cache line size 08
Interrupt: pin A routed to IRQ 11
Region 0: Memory at d8000000 (32-bit, prefetchable) [size=128M]
Region 1: I/O ports at c000 [size=256]
Region 2: Memory at e1000000 (32-bit, non-prefetchable) [size=64K]
Expansion ROM at <unassigned> [disabled] [size=128K]
Capabilities: [58] AGP version 2.0
Status: RQ=47 SBA+ 64bit- FW- Rate=x1,x2,x4
Command: RQ=15 SBA+ AGP+ 64bit- FW- Rate=x1
Capabilities: [50] Power Management version 2
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-

[7.6.] SCSI information (from /proc/scsi/scsi)
Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 00
Vendor: RICOH Model: CD-R/RW MP7083A Rev: 1.20
Type: CD-ROM ANSI SCSI revision: 02

[7.7.] Other information that might be relevant to the problem
(please look in /proc and include all information that you
think to be relevant):
None.

[X.] Other notes, patches, fixes, workarounds:
I see the following syslog messages between the oops
Oct 5 11:46:39 localhost gdm(pam_unix)[5359]: session closed for user hari
Oct 5 11:46:39 localhost gdm[5359]: gdm_slave_xioerror_handler: Fatal X error
- Restarting :0

I was using XFree86 (the one with Red Hat 8) on 2D at the time of oops, no 3D
activities (the only 3D usage of this computer is playing tuxracer game :)

I was doing heavy file system activities just before the oops, I was trying to
measure Ext3 and Raid0 performance by creating nearly 5-6 GB file using dd. I
will see if I can reproduce this on mainline, RH kernel etc.

<rant>
This is the second crash ever happened to me (the first one was the pesky
netfilter oops may be due to NAT, which didn't make it to the system logs,
and I am still waiting for it to happen again now that I have kernel
debugging/sysrq enabled). I am genuinely worried about the stability of my
favourite OS.
</rant>

Anyway thanks guys, you are all doing a wonderful job on the Linux kernel
project. Please CC me if you can as I am not subscribed to LKML, but I
regularly read the web archives.
--
Hari
[email protected]


2002-10-05 02:58:02

by Srihari Vijayaraghavan

[permalink] [raw]
Subject: Re: Linux-2.4.20-pre8-aa2 oops report.

On Saturday 05 October 2002 12:47, Srihari Vijayaraghavan wrote:
> [1.] One line summary of the problem:
> 2.4.20-pre8aa2 Kernel oopsed couple of times.

A little more research reveals that the oops happens at the following function
in mm/memory.c

/*
* remove user pages in a given range.
*/
void zap_page_range(struct mm_struct *mm, unsigned long address, unsigned long
size)
{
mmu_gather_t *tlb;
pgd_t * dir;
unsigned long start = address, end = address + size;
int freed = 0;

dir = pgd_offset(mm, address);

/*
* This is a long-lived spinlock. That's fine.
* There's no contention, because the page table
* lock only protects against kswapd anyway, and
* even if kswapd happened to be looking at this
* process we _want_ it to get stuck.
*/
if (address >= end)
BUG();
spin_lock(&mm->page_table_lock);
flush_cache_range(mm, address, end);
tlb = tlb_gather_mmu(mm);

do {
freed += zap_pmd_range(tlb, dir, address, end - address);
address = (address + PGDIR_SIZE) & PGDIR_MASK;
dir++;
} while (address && (address < end));

/* this will flush any remaining tlb entries */
tlb_finish_mmu(tlb, start, end);

/*
* Update rss for the mm_struct (not necessarily current->mm)
* Notice that rss is an unsigned long.
*/
if (mm->rss > freed)
mm->rss -= freed;
else
mm->rss = 0;
spin_unlock(&mm->page_table_lock);
}

BTW I ran memtest2.x and memtest3.0 overnight few times in the past and it
always passed for more than 30 times or so everytime. I forgot to mention
this in my previous e-mail.
--
Hari
[email protected]

2002-10-05 07:43:25

by Srihari Vijayaraghavan

[permalink] [raw]
Subject: Re: Linux-2.4.20-pre8-aa2 oops report.

On Saturday 05 October 2002 13:09, Srihari Vijayaraghavan wrote:
> On Saturday 05 October 2002 12:47, Srihari Vijayaraghavan wrote:
> > [1.] One line summary of the problem:
> > 2.4.20-pre8aa2 Kernel oopsed couple of times.

I was able to produce couple of more oops.

ksymoops 2.4.5 on i686 2.4.20-pre8aa2. Options used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.4.20-pre8aa2/ (default)
-m /boot/System.map-2.4.20-pre8aa2 (default)

Warning: You did not tell me where to find symbol information. I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc. ksymoops -h explains the options.

ac97_codec: AC97 Audio codec, id: v9(SigmaTel STAC9721/23)
Unable to handle kernel paging request at virtual address c5db0034
c0114517
*pde = 05c001e3
Oops: 0000 2.4.20-pre8aa2 #3 Thu Oct 3 21:07:54 EST 2002
CPU: 0
EIP: 0010:[<c0114517>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00013086
eax: 00000000 ebx: c665b324 ecx: c5db0000 edx: c665b324
esi: c665b31c edi: c01e6ae2 ebp: 00003246 esp: c73b1d90
ds: 0018 es: 0018 ss: 0018
Process modprobe (pid: 1012, stackpage=c73b1000)
Stack: c73b0000 00000002 c66fc000 c73b0000 c0113e82 c01e6ae2 c73b1dfc c73b1f6c
c3e27f8e c73b0000 00000000 c17e17c0 c016e94f c3e0cf80 d90e1390 0001ff9d
c02102ef 00000000 dffcb5f4 da33f340 dffcb580 c3e0cf80 c016f2d4 d90e1390
Call Trace: [<c0113e82>] [<c01e6ae2>] [<c016e94f>] [<c016f2d4>]
[<c016811f>]
[<c0113c60>] [<c0108ff0>] [<c01e6ae2>] [<c0127d19>] [<c012860f>]
[<c0113e0a>]
[<c0128e88>] [<c013b2cc>] [<c0129e9f>] [<c012a1d2>] [<c012a254>]
[<c0113c60>]
[<c0108ff0>]
Code: 8b 51 34 85 d2 74 3f f7 41 14 41 00 00 00 74 36 8b 71 38 89


>>EIP; c0114517 <search_exception_table+17/80> <=====

>>ebx; c665b324 <[emu10k1].data.end+88bb25/8a4881>
>>ecx; c5db0000 <[soundcore].bss.end+39889d/3a891d>
>>edx; c665b324 <[emu10k1].data.end+88bb25/8a4881>
>>esi; c665b31c <[emu10k1].data.end+88bb1d/8a4881>
>>edi; c01e6ae2 <fast_clear_page+12/50>
>>ebp; 00003246 Before first symbol
>>esp; c73b1d90 <END_OF_CODE+d39f39/????>

Trace; c0113e82 <do_page_fault+222/5a0>
Trace; c01e6ae2 <fast_clear_page+12/50>
Trace; c016e94f <do_get_write_access+27f/500>
Trace; c016f2d4 <journal_dirty_metadata+174/200>
Trace; c016811f <ext3_do_update_inode+16f/3e0>
Trace; c0113c60 <do_page_fault+0/5a0>
Trace; c0108ff0 <error_code+34/3c>
Trace; c01e6ae2 <fast_clear_page+12/50>
Trace; c0127d19 <do_wp_page+1b9/1f0>
Trace; c012860f <handle_mm_fault+11f/160>
Trace; c0113e0a <do_page_fault+1aa/5a0>
Trace; c0128e88 <zap_pmd_range+78/80>
Trace; c013b2cc <fput+cc/120>
Trace; c0129e9f <unmap_fixup+12f/140>
Trace; c012a1d2 <do_munmap+292/2d0>
Trace; c012a254 <sys_munmap+44/80>
Trace; c0113c60 <do_page_fault+0/5a0>
Trace; c0108ff0 <error_code+34/3c>

Code; c0114517 <search_exception_table+17/80>
00000000 <_EIP>:
Code; c0114517 <search_exception_table+17/80> <=====
0: 8b 51 34 mov 0x34(%ecx),%edx <=====
Code; c011451a <search_exception_table+1a/80>
3: 85 d2 test %edx,%edx
Code; c011451c <search_exception_table+1c/80>
5: 74 3f je 46 <_EIP+0x46>
Code; c011451e <search_exception_table+1e/80>
7: f7 41 14 41 00 00 00 testl $0x41,0x14(%ecx)
Code; c0114525 <search_exception_table+25/80>
e: 74 36 je 46 <_EIP+0x46>
Code; c0114527 <search_exception_table+27/80>
10: 8b 71 38 mov 0x38(%ecx),%esi
Code; c011452a <search_exception_table+2a/80>
13: 89 00 mov %eax,(%eax)


1 warning issued. Results may not be reliable.

ksymoops 2.4.5 on i686 2.4.20-pre8aa2. Options used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.4.20-pre8aa2/ (default)
-m /boot/System.map-2.4.20-pre8aa2 (default)

Warning: You did not tell me where to find symbol information. I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc. ksymoops -h explains the options.

ac97_codec: AC97 Audio codec, id: v9(SigmaTel STAC9721/23)
Unable to handle kernel paging request at virtual address c2d68358
c014e0f9
*pde = 0823c163
Oops: 0003 2.4.20-pre8aa2 #3 Thu Oct 3 21:07:54 EST 2002
CPU: 0
EIP: 0010:[<c014e0f9>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00210282
eax: c70b7f58 ebx: c70b7f40 ecx: c2d68358 edx: c78bae58
esi: c70b7fac edi: c687801f ebp: 4f55e46f esp: c2545ee0
ds: 0018 es: 0018 ss: 0018
Process pam_timestamp_c (pid: 2001, stackpage=c2545000)
Stack: 00200217 c021097c 0001828e c0120b37 c70b7f40 dff200c0 c6878013 0000000c
c6878013 c687801f 00000000 c2545f98 c014486b c70b7ec0 c2545f40 c6878013
c0144e94 c70b7ec0 c2545f40 00000000 00000008 00000000 c73f0d00 00000000
Call Trace: [<c0120b37>] [<c014486b>] [<c0144e94>] [<c0145377>]
[<c0145609>]
[<c014204f>] [<c0108eff>]
Code: 89 11 89 40 04 89 43 18 eb cc 89 f0 89 3c 24 83 c0 3c 89 44


>>EIP; c014e0f9 <d_lookup+d9/110> <=====

>>eax; c70b7f58 <END_OF_CODE+9cc101/????>
>>ebx; c70b7f40 <END_OF_CODE+9cc0e9/????>
>>ecx; c2d68358 <[serial].bss.end+76be75/182bb9d>
>>edx; c78bae58 <END_OF_CODE+11cf001/????>
>>esi; c70b7fac <END_OF_CODE+9cc155/????>
>>edi; c687801f <END_OF_CODE+18c1c8/????>
>>ebp; 4f55e46f Before first symbol
>>esp; c2545ee0 <[floppy].bss.end+2184a5/24e645>

Trace; c0120b37 <schedule_timeout+67/b0>
Trace; c014486b <cached_lookup+1b/70>
Trace; c0144e94 <link_path_walk+3c4/6f0>
Trace; c0145377 <path_lookup+37/40>
Trace; c0145609 <__user_walk+49/60>
Trace; c014204f <sys_lstat64+1f/80>
Trace; c0108eff <system_call+33/38>

Code; c014e0f9 <d_lookup+d9/110>
00000000 <_EIP>:
Code; c014e0f9 <d_lookup+d9/110> <=====
0: 89 11 mov %edx,(%ecx) <=====
Code; c014e0fb <d_lookup+db/110>
2: 89 40 04 mov %eax,0x4(%eax)
Code; c014e0fe <d_lookup+de/110>
5: 89 43 18 mov %eax,0x18(%ebx)
Code; c014e101 <d_lookup+e1/110>
8: eb cc jmp ffffffd6 <_EIP+0xffffffd6>
Code; c014e103 <d_lookup+e3/110>
a: 89 f0 mov %esi,%eax
Code; c014e105 <d_lookup+e5/110>
c: 89 3c 24 mov %edi,(%esp,1)
Code; c014e108 <d_lookup+e8/110>
f: 83 c0 3c add $0x3c,%eax
Code; c014e10b <d_lookup+eb/110>
12: 89 44 00 00 mov %eax,0x0(%eax,%eax,1)


1 warning issued. Results may not be reliable.

Steps to reproduce:
1. Login to XFree86/KDE or GNOME
2. Start some open-source heavy-weight applications (I use Mozilla, Open
Office Writer and Calc and Impress)
3. Exit all those applications
4. # mke2fs -j /dev/md0 (or) mke2fs -j /dev/hdc5
5. # mount /dev/md0 /md0
6. # cd /md0
7. # time dd if=/dev/zero of=zero bs=1024 count=1048576 (I have choosen 1 GB
file because I have 512 MB RAM in the system)
8. # dmesg (to verify if there is an oops)
9. Repeat step 2 and verify if there is an oops
10. Else repeat steps 1 to 9 couple of times

Intrestingly both mainline (2.4.20-pre8) and Red Hat 8 kernel (2.4.18-14) do
not exhibit this regression (on few attempts).

Please feel free to suggest any ideas to pinpoint the issue if you can. I will
test the system with ReiserFS and Debian Woody (gcc 2.95.4) later today or
tomorrow.
--
Hari
[email protected]

2002-10-10 01:20:00

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Linux-2.4.20-pre8-aa2 oops report. [solved]

Hello Srihari,

On Sat, Oct 05, 2002 at 05:55:01PM +1000, Srihari Vijayaraghavan wrote:
> On Saturday 05 October 2002 13:09, Srihari Vijayaraghavan wrote:
> > On Saturday 05 October 2002 12:47, Srihari Vijayaraghavan wrote:
> > > [1.] One line summary of the problem:
> > > 2.4.20-pre8aa2 Kernel oopsed couple of times.
>
> I was able to produce couple of more oops.

thanks for your detailed reports, please try to reproduce any problem
you had with this incremental fix applied on top of 2.4.20pre8aa2:

--- ul-20021007/kernel/sched.c.~1~ Tue Oct 8 07:14:19 2002
+++ ul-20021007/kernel/sched.c Thu Oct 10 02:29:58 2002
@@ -380,6 +387,7 @@ void wake_up_forked_process(task_t * p)
parent = NULL;
}

+ p->cpu = smp_processor_id();
__activate_task(p, rq, parent);
spin_unlock_irq(&rq->lock);
}


I started to get random reports of corruption after I fixed the
scheduler starvation and resurrected a non weak schedule-child-first
logic in the latest few -aa. It took so long because I really couldn't
see anything wrong in that patch (there wasn't anything wrong indeed).
The new schedule-child-first logic can put the new forked task in the
expired array (to run them just before the parent to maximize cache
effects and to avoid advantaging childs too much by putting them all in
the active array always) and it somehow put at the light a core bug in
the o1 scheduler, this bug is not present in 2.5. I found it after some
day of heavy debugging while trying to find out what was wrong with the
schedule-child-first changes. A task running with a wrong
smp_processor_id() generates very weird oopses and crashes, it is one of
the things that has the most unpredictable side effects. This above
patch should bring back total solidity to my tree. tomorrow I will
release a new -aa with this applied (I may use p->cpu = parent->cpu just
in case it's simpler for the compiler to optimize, but it will be
completely equivalent to the above).

Special thanks to Chris Mason for the help and for finding a way to
reproduce it reliably and even for getting the only reliable single oops
out of it (that I happened to discard because at first glance it looked
corrupt like the others ;)

Other 2.4 backports of the o1 scheduler may want to verify that they
didn't inherit this subtle bug. (I just checked that -ac doesn't have it)

Andrea

2002-10-10 10:04:10

by Srihari Vijayaraghavan

[permalink] [raw]
Subject: Re: Linux-2.4.20-pre8-aa2 oops report. [solved]

Hello Andrea,

> thanks for your detailed reports, please try to reproduce any problem
> you had with this incremental fix applied on top of 2.4.20pre8aa2:
>
> --- ul-20021007/kernel/sched.c.~1~ Tue Oct 8 07:14:19 2002
> +++ ul-20021007/kernel/sched.c Thu Oct 10 02:29:58 2002
> @@ -380,6 +387,7 @@ void wake_up_forked_process(task_t * p)
> parent = NULL;
> }
>
> + p->cpu = smp_processor_id();
> __activate_task(p, rq, parent);
> spin_unlock_irq(&rq->lock);
> }
>

Thanks. Unfortunately that did not fix the problem.

I was able to reproduce 4 more oops. (all happened one after other)

ksymoops 2.4.5 on i686 2.4.20-pre8aa2-p1. Options used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.4.20-pre8aa2-p1/ (default)
-m /boot/System.map-2.4.20-pre8aa2-p1 (default)

Warning: You did not tell me where to find symbol information. I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc. ksymoops -h explains the options.

Oct 10 19:26:36 localhost kernel: Unable to handle kernel NULL pointer
dereference at virtual address 0000011b
Oct 10 19:26:36 localhost kernel: c01a96b2
Oct 10 19:26:36 localhost kernel: *pde = 00000000
Oct 10 19:26:36 localhost kernel: Oops: 0000 2.4.20-pre8aa2-p1 #4 Thu Oct 10
19:12:17 EST 2002
Oct 10 19:26:36 localhost kernel: CPU: 0
Oct 10 19:26:36 localhost kernel: EIP: 0010:[<c01a96b2>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
Oct 10 19:26:36 localhost kernel: EFLAGS: 00010213
Oct 10 19:26:36 localhost kernel: eax: 00000113 ebx: 00000145 ecx:
c37eff64 edx: c69aedc0
Oct 10 19:26:36 localhost kernel: esi: c5bd4000 edi: c69aedc0 ebp:
00000000 esp: c37eff1c
Oct 10 19:26:36 localhost kernel: ds: 0018 es: 0018 ss: 0018
Oct 10 19:26:36 localhost kernel: Process bonobo-activati (pid: 988,
stackpage=c37ef000)
Oct 10 19:26:36 localhost kernel: Stack: c5bd4020 c51daa40 00000004 c014a279
c69aedc0 00000000 00000000 7fffffff
Oct 10 19:26:36 localhost kernel: 00000000 00000000 c014a37f 00000005
c5bd4000 c37eff64 c37eff60 c37ee000
Oct 10 19:26:36 localhost kernel: c37ee000 00000000 00000000 c37effa8
08082fe0 00000000 00000005 c014a4fc
Oct 10 19:26:36 localhost kernel: Call Trace: [<c014a279>] [<c014a37f>]
[<c014a4fc>] [<c0108eff>]
Oct 10 19:26:36 localhost kernel: Code: 8b 48 08 89 44 24 04 89 14 24 8b 44 24
14 89 44 24 08 ff 51


>>EIP; c01a96b2 <sock_poll+12/30> <=====

>>ecx; c37eff64 <[iptable_filter].data.end+12065d9/182e6f5>
>>edx; c69aedc0 <END_OF_CODE+2af69/????>
>>esi; c5bd4000 <[soundcore].bss.end+1d289d/3ae91d>
>>edi; c69aedc0 <END_OF_CODE+2af69/????>
>>esp; c37eff1c <[iptable_filter].data.end+1206591/182e6f5>

Trace; c014a279 <do_pollfd+89/90>
Trace; c014a37f <do_poll+ff/110>
Trace; c014a4fc <sys_poll+16c/2f0>
Trace; c0108eff <system_call+33/38>

Code; c01a96b2 <sock_poll+12/30>
00000000 <_EIP>:
Code; c01a96b2 <sock_poll+12/30> <=====
0: 8b 48 08 mov 0x8(%eax),%ecx <=====
Code; c01a96b5 <sock_poll+15/30>
3: 89 44 24 04 mov %eax,0x4(%esp,1)
Code; c01a96b9 <sock_poll+19/30>
7: 89 14 24 mov %edx,(%esp,1)
Code; c01a96bc <sock_poll+1c/30>
a: 8b 44 24 14 mov 0x14(%esp,1),%eax
Code; c01a96c0 <sock_poll+20/30>
e: 89 44 24 08 mov %eax,0x8(%esp,1)
Code; c01a96c4 <sock_poll+24/30>
12: ff 51 00 call *0x0(%ecx)

Oct 10 19:26:36 localhost kernel: CPU: 0
Oct 10 19:26:36 localhost kernel: EIP: 0010:[<c0132998>] Not tainted
Oct 10 19:26:36 localhost kernel: EFLAGS: 00010057
Oct 10 19:26:36 localhost kernel: eax: ffffffff ebx: ffffffbf ecx:
c4973000 edx: ffffffff
Oct 10 19:26:37 localhost kernel: esi: c15870c0 edi: 00000246 ebp:
000001f0 esp: c7635f60
Oct 10 19:26:37 localhost kernel: ds: 0018 es: 0018 ss: 0018
Oct 10 19:26:37 localhost kernel: Process gnome-settings- (pid: 992,
stackpage=c7635000)
Oct 10 19:26:37 localhost kernel: Stack: 00000000 00000000 c7634000 080bdcc8
080bdcc8 bffff618 c014a657 c15870c0
Oct 10 19:26:37 localhost kernel: 000001f0 c31a99c0 c3474000 c7634000
c01150eb c51daac0 c7635fa8 00000000
Oct 10 19:26:37 localhost kernel: fffffff4 c013a8f9 00000000 00000000
c7634000 420d2220 080bdcc8 bffff618
Oct 10 19:26:37 localhost kernel: Call Trace: [<c014a657>] [<c01150eb>]
[<c013a8f9>] [<c0108eff>]
Oct 10 19:26:37 localhost kernel: Code: 89 10 89 42 04 c7 01 00 00 00 00 8b 06
89 48 04 89 01 89 71


>>EIP; c0132998 <__kmem_cache_alloc+78/f0> <=====

>>eax; ffffffff <END_OF_CODE+3967c1a8/????>
>>ebx; ffffffbf <END_OF_CODE+3967c168/????>
>>ecx; c4973000 <[radeon].bss.end+7dda89/186ab09>
>>edx; ffffffff <END_OF_CODE+3967c1a8/????>
>>esi; c15870c0 <_end+12ff710/15786d0>
>>esp; c7635f60 <END_OF_CODE+cb2109/????>

Trace; c014a657 <sys_poll+2c7/2f0>
Trace; c01150eb <do_schedule+15b/240>
Trace; c013a8f9 <sys_writev+69/80>
Trace; c0108eff <system_call+33/38>

Code; c0132998 <__kmem_cache_alloc+78/f0>
00000000 <_EIP>:
Code; c0132998 <__kmem_cache_alloc+78/f0> <=====
0: 89 10 mov %edx,(%eax) <=====
Code; c013299a <__kmem_cache_alloc+7a/f0>
2: 89 42 04 mov %eax,0x4(%edx)
Code; c013299d <__kmem_cache_alloc+7d/f0>
5: c7 01 00 00 00 00 movl $0x0,(%ecx)
Code; c01329a3 <__kmem_cache_alloc+83/f0>
b: 8b 06 mov (%esi),%eax
Code; c01329a5 <__kmem_cache_alloc+85/f0>
d: 89 48 04 mov %ecx,0x4(%eax)
Code; c01329a8 <__kmem_cache_alloc+88/f0>
10: 89 01 mov %eax,(%ecx)
Code; c01329aa <__kmem_cache_alloc+8a/f0>
12: 89 71 00 mov %esi,0x0(%ecx)

Oct 10 19:26:37 localhost kernel: <1>Unable to handle kernel NULL pointer
dereference at virtual address 00000003
Oct 10 19:26:38 localhost kernel: c0131412
Oct 10 19:26:38 localhost kernel: *pde = 00000000
Oct 10 19:26:38 localhost kernel: Oops: 0000 2.4.20-pre8aa2-p1 #4 Thu Oct 10
19:12:17 EST 2002
Oct 10 19:26:38 localhost kernel: CPU: 0
Oct 10 19:26:38 localhost kernel: EIP: 0010:[<c0131412>] Not tainted
Oct 10 19:26:38 localhost kernel: EFLAGS: 00010286
Oct 10 19:26:38 localhost kernel: eax: e4cb0000 ebx: ffffffff ecx:
c020d768 edx: c497378c
Oct 10 19:26:38 localhost kernel: esi: c7634000 edi: c31a99c0 ebp:
0000000b esp: c7635e98
Oct 10 19:26:38 localhost kernel: ds: 0018 es: 0018 ss: 0018
Oct 10 19:26:38 localhost kernel: Process gnome-settings- (pid: 992,
stackpage=c7635000)
Oct 10 19:26:38 localhost kernel: Stack: c020e600 00000005 c31a99c0 c012a513
e4cb0000 00000046 00000001 000001f0
Oct 10 19:26:38 localhost kernel: c31a99c0 c7634000 c0109a10 0000000b
c0116a36 c31a99c0 00000202 c31a99c0
Oct 10 19:26:38 localhost kernel: c011b807 c31a99c0 00000000 c7635f2c
00000000 c0109a10 000001f0 c01095ef
Oct 10 19:26:38 localhost kernel: Call Trace: [<c012a513>] [<c0109a10>]
[<c0116a36>] [<c011b807>] [<c0109a10>]
Oct 10 19:26:38 localhost kernel: [<c01095ef>] [<c0109a61>] [<c0108ff0>]
[<c0132998>] [<c014a657>] [<c01150eb>]
Oct 10 19:26:38 localhost kernel: [<c013a8f9>] [<c0108eff>]
Oct 10 19:26:38 localhost kernel: Code: 39 43 04 74 1f 8d 53 0c 8b 5b 0c 85 db
75 f1 c7 04 24 80 51


>>EIP; c0131412 <vfree+22/80> <=====

>>eax; e4cb0000 <END_OF_CODE+1e32c1a9/????>
>>ebx; ffffffff <END_OF_CODE+3967c1a8/????>
>>ecx; c020d768 <gdt_table+68/e0>
>>edx; c497378c <[radeon].bss.end+7de215/186ab09>
>>esi; c7634000 <END_OF_CODE+cb01a9/????>
>>edi; c31a99c0 <[iptable_filter].data.end+bc0035/182e6f5>
>>esp; c7635e98 <END_OF_CODE+cb2041/????>

Trace; c012a513 <exit_mmap+13/130>
Trace; c0109a10 <do_general_protection+0/a0>
Trace; c0116a36 <mmput+56/d0>
Trace; c011b807 <do_exit+87/260>
Trace; c0109a10 <do_general_protection+0/a0>
Trace; c01095ef <die+7f/80>
Trace; c0109a61 <do_general_protection+51/a0>
Trace; c0108ff0 <error_code+34/3c>
Trace; c0132998 <__kmem_cache_alloc+78/f0>
Trace; c014a657 <sys_poll+2c7/2f0>
Trace; c01150eb <do_schedule+15b/240>
Trace; c013a8f9 <sys_writev+69/80>
Trace; c0108eff <system_call+33/38>

Code; c0131412 <vfree+22/80>
00000000 <_EIP>:
Code; c0131412 <vfree+22/80> <=====
0: 39 43 04 cmp %eax,0x4(%ebx) <=====
Code; c0131415 <vfree+25/80>
3: 74 1f je 24 <_EIP+0x24>
Code; c0131417 <vfree+27/80>
5: 8d 53 0c lea 0xc(%ebx),%edx
Code; c013141a <vfree+2a/80>
8: 8b 5b 0c mov 0xc(%ebx),%ebx
Code; c013141d <vfree+2d/80>
b: 85 db test %ebx,%ebx
Code; c013141f <vfree+2f/80>
d: 75 f1 jne 0 <_EIP>
Code; c0131421 <vfree+31/80>
f: c7 04 24 80 51 00 00 movl $0x5180,(%esp,1)

Oct 10 19:26:38 localhost kernel: CPU: 0
Oct 10 19:26:38 localhost kernel: EIP: 0010:[<c0132998>] Not tainted
Oct 10 19:26:38 localhost kernel: EFLAGS: 00010057
Oct 10 19:26:38 localhost kernel: eax: ffffffff ebx: ffffffbf ecx:
c4973000 edx: ffffffff
Oct 10 19:26:38 localhost kernel: esi: c15870c0 edi: 00000246 ebp:
000001f0 esp: c6721f3c
Oct 10 19:26:38 localhost kernel: ds: 0018 es: 0018 ss: 0018
Oct 10 19:26:38 localhost kernel: Process esd (pid: 998, stackpage=c6721000)
Oct 10 19:26:38 localhost kernel: Stack: 7fffffff 00000017 fffffff4 00000001
c6720000 bffff848 c0149d2c c15870c0
Oct 10 19:26:38 localhost kernel: 000001f0 c0149e39 00000004 00000004
c6721f8c 00000005 08054450 bffff8bc
Oct 10 19:26:38 localhost kernel: bffff8e8 00000004 00000031 bffff8c0
00000000 c4973440 c4973444 c4973448
Oct 10 19:26:38 localhost kernel: Call Trace: [<c0149d2c>] [<c0149e39>]
[<c0108eff>]
Oct 10 19:26:38 localhost kernel: Code: 89 10 89 42 04 c7 01 00 00 00 00 8b 06
89 48 04 89 01 89 71


>>EIP; c0132998 <__kmem_cache_alloc+78/f0> <=====

>>eax; ffffffff <END_OF_CODE+3967c1a8/????>
>>ebx; ffffffbf <END_OF_CODE+3967c168/????>
>>ecx; c4973000 <[radeon].bss.end+7dda89/186ab09>
>>edx; ffffffff <END_OF_CODE+3967c1a8/????>
>>esi; c15870c0 <_end+12ff710/15786d0>
>>esp; c6721f3c <[ac97_codec].data.end+92ab35/b88c79>

Trace; c0149d2c <select_bits_alloc+1c/20>
Trace; c0149e39 <sys_select+f9/4b0>
Trace; c0108eff <system_call+33/38>

Code; c0132998 <__kmem_cache_alloc+78/f0>
00000000 <_EIP>:
Code; c0132998 <__kmem_cache_alloc+78/f0> <=====
0: 89 10 mov %edx,(%eax) <=====
Code; c013299a <__kmem_cache_alloc+7a/f0>
2: 89 42 04 mov %eax,0x4(%edx)
Code; c013299d <__kmem_cache_alloc+7d/f0>
5: c7 01 00 00 00 00 movl $0x0,(%ecx)
Code; c01329a3 <__kmem_cache_alloc+83/f0>
b: 8b 06 mov (%esi),%eax
Code; c01329a5 <__kmem_cache_alloc+85/f0>
d: 89 48 04 mov %ecx,0x4(%eax)
Code; c01329a8 <__kmem_cache_alloc+88/f0>
10: 89 01 mov %eax,(%ecx)
Code; c01329aa <__kmem_cache_alloc+8a/f0>
12: 89 71 00 mov %esi,0x0(%ecx)


1 warning issued. Results may not be reliable.

I am able to easily reproduce the issue by doing:
1. Login to XFree86/Gnome or KDE
2. Run Mozilla, Open Office Writer/Impress/Calc and exit all of them
3. mke2fs -j /dev/hda9 (that is a blank 2.5 GB partition)
4. mount /dev/hda9 /test
5. cd /test; dd if=/dev/zero of=zero bs=1024 count=1048576
6. Log out and Log in XFree86
7. Oops appears in the System logs

--
Hari
[email protected]

2002-10-13 01:39:55

by Srihari Vijayaraghavan

[permalink] [raw]
Subject: 2.4.20-pre10aa1 oops report (was Re: Linux-2.4.20-pre8-aa2 oops report. [solved])

Hello,

On Thursday 10 October 2002 20:17, Srihari Vijayaraghavan wrote:
> Thanks. Unfortunately that did not fix the problem.
>
> I was able to reproduce 4 more oops. (all happened one after other)
>
> ksymoops 2.4.5 on i686 2.4.20-pre8aa2-p1. Options used

Here is a similar oops report from 2.4.20-pre10aa1.

ksymoops 2.4.5 on i686 2.4.20-pre10aa1. Options used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.4.20-pre10aa1/ (default)
-m /boot/System.map-2.4.20-pre10aa1 (default)

Warning: You did not tell me where to find symbol information. I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc. ksymoops -h explains the options.

Oct 11 22:43:19 localhost kernel: Unable to handle kernel paging request at
virtual address cbe8e000
Oct 11 22:43:19 localhost kernel: c01e55e2
Oct 11 22:43:19 localhost kernel: *pde = 0bc001e3
Oct 11 22:43:19 localhost kernel: Oops: 0002 2.4.20-pre10aa1 #3 Fri Oct 11
22:10:08 EST 2002
Oct 11 22:43:19 localhost kernel: CPU: 0
Oct 11 22:43:19 localhost kernel: EIP: 0010:[<c01e55e2>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
Oct 11 22:43:19 localhost kernel: EFLAGS: 00013246
Oct 11 22:43:19 localhost kernel: eax: 0000003f ebx: cbe8e000 ecx:
c9f8e000 edx: 00000000
Oct 11 22:43:19 localhost kernel: esi: c3f7d4b0 edi: 000004b0 ebp:
c120c084 esp: c9f8feac
Oct 11 22:43:19 localhost kernel: ds: 0018 es: 0018 ss: 0018
Oct 11 22:43:19 localhost kernel: Process modprobe (pid: 1675,
stackpage=c9f8f000)
Oct 11 22:43:19 localhost kernel: Stack: 00104025 c0126952 cbe8e000 c95bc420
4212c1fc dff87e00 cbc1a140 c0126d7e
Oct 11 22:43:19 localhost kernel: dff87e00 cbc1a140 c3f7d4b0 c95bc420
00000001 4212c1fc c9f8ff24 dff87e00
Oct 11 22:43:19 localhost kernel: cbc1a140 4212c1fc c9f8e000 c011240a
dff87e00 cbc1a140 4212c1fc 00000001
Oct 11 22:43:19 localhost kernel: Call Trace: [<c0126952>] [<c0126d7e>]
[<c011240a>] [<c012869f>] [<c01289d2>]
Oct 11 22:43:19 localhost kernel: [<c0128a54>] [<c0112260>] [<c01075b0>]
Oct 11 22:43:19 localhost kernel: Code: 0f e7 03 0f e7 43 08 0f e7 43 10 0f e7
43 18 0f e7 43 20 0f


>>EIP; c01e55e2 <fast_clear_page+12/50> <=====

>>ebx; cbe8e000 <[sr_mod].bss.end+54ea1a9/1925c229>
>>ecx; c9f8e000 <[sr_mod].bss.end+35ea1a9/1925c229>
>>esi; c3f7d4b0 <[agpgart].bss.end+200695/1b93265>
>>edi; 000004b0 Before first symbol
>>ebp; c120c084 <_end+f86b14/166cb10>
>>esp; c9f8feac <[sr_mod].bss.end+35ec055/1925c229>

Trace; c0126952 <do_anonymous_page+a2/110>
Trace; c0126d7e <handle_mm_fault+8e/160>
Trace; c011240a <do_page_fault+1aa/5a0>
Trace; c012869f <unmap_fixup+12f/140>
Trace; c01289d2 <do_munmap+292/2d0>
Trace; c0128a54 <sys_munmap+44/80>
Trace; c0112260 <do_page_fault+0/5a0>
Trace; c01075b0 <error_code+34/3c>

Code; c01e55e2 <fast_clear_page+12/50>
00000000 <_EIP>:
Code; c01e55e2 <fast_clear_page+12/50> <=====
0: 0f e7 03 movntq %mm0,(%ebx) <=====
Code; c01e55e5 <fast_clear_page+15/50>
3: 0f e7 43 08 movntq %mm0,0x8(%ebx)
Code; c01e55e9 <fast_clear_page+19/50>
7: 0f e7 43 10 movntq %mm0,0x10(%ebx)
Code; c01e55ed <fast_clear_page+1d/50>
b: 0f e7 43 18 movntq %mm0,0x18(%ebx)
Code; c01e55f1 <fast_clear_page+21/50>
f: 0f e7 43 20 movntq %mm0,0x20(%ebx)
Code; c01e55f5 <fast_clear_page+25/50>
13: 0f 00 00 sldtl (%eax)


1 warning issued. Results may not be reliable.

The mainline (2.4.20-pre10) does not exhibit this issue. Unlike
2.4.20-pre8aa1, 2.4.20-pre10aa1 rebooted itself after the above oops.

I am hoping some of these oops might reveal the real issue/reason/bug to
kernel developers one of these days.

And my sincere thanks for your help.
--
Hari
[email protected]

2002-10-13 22:38:01

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.20-pre10aa1 oops report (was Re: Linux-2.4.20-pre8-aa2 oops report. [solved])

On Sun, Oct 13, 2002 at 11:53:29AM +1000, Srihari Vijayaraghavan wrote:
> Oct 11 22:43:19 localhost kernel: Process modprobe (pid: 1675,

this smells like a problem with one of your modules. Please make 100%
sure you use exactly the same .config for both 2.4.20pre10 and
2.4.20pre10aa1 and please try to find which is the module that is
crashing the kernel after it's being loaded. Expect always different
kind of crashes and oopses. You can also try to turn on the slab
debugging option in the kernel hacking menu.

> Code; c01e55e2 <fast_clear_page+12/50>

you also may want to configure the kernel as i686 instead of K7 so
fast_clear_page won't be used to see if it makes any difference.

> The mainline (2.4.20-pre10) does not exhibit this issue. Unlike
> 2.4.20-pre8aa1, 2.4.20-pre10aa1 rebooted itself after the above oops.
>
> I am hoping some of these oops might reveal the real issue/reason/bug to
> kernel developers one of these days.

the place where the oops happens is most certainly not the problem,
either something is wrong with fast_clear_page for whatever hardware
reason, or more likely the moduled by modprobe is corrupting the
freelist and alloc_pages returned garbage.

btw, how much memory do you have? If you've more than 800M it could be a
broken driver using pte_offset by hand, try to reproduce with mem=800m
in such case. To fix this you should find which is the module that is
destabilizing the kernel.

thanks for the reports.

Andrea

2002-10-15 01:02:42

by Srihari Vijayaraghavan

[permalink] [raw]
Subject: Re: 2.4.20-pre10aa1 oops report (was Re: Linux-2.4.20-pre8-aa2 oops report. [solved])

Hello Andrea,

> this smells like a problem with one of your modules. Please make 100%
> sure you use exactly the same .config for both 2.4.20pre10 and
> 2.4.20pre10aa1 and please try to find which is the module that is
> crashing the kernel after it's being loaded. Expect always different
> kind of crashes and oopses. You can also try to turn on the slab
> debugging option in the kernel hacking menu.

Yes I am using the same .config file from 2.4.20-pre10 on
2.4.20-pre10aa1 (of course I run make oldconfig, and accept the default
setting that shows up on 2.4.20-pre10aa1)

I think you are right, it has something to do with the kernel modules.

> > Code; c01e55e2 <fast_clear_page+12/50>
>
> you also may want to configure the kernel as i686 instead of K7 so
> fast_clear_page won't be used to see if it makes any difference.

Ok. That didn't really help. Kernel compiled for i386 even crashes, but
the k7 optimised kernel crashes at the Athlon speed :-)

> the place where the oops happens is most certainly not the problem,
> either something is wrong with fast_clear_page for whatever hardware
> reason, or more likely the moduled by modprobe is corrupting the
> freelist and alloc_pages returned garbage.
>
> btw, how much memory do you have? If you've more than 800M it
> could be a
> broken driver using pte_offset by hand, try to reproduce with mem=800m
> in such case. To fix this you should find which is the module that is
> destabilizing the kernel.

My computer has 512 MB RAM. No highmem.

I am able to trigger the issue (after 3 attempts [1]) with,
CONFIG_AGP m
CONFIG_AGP_AMD y
CONFIG_DRM y
CONFIG_DRM_RADEON m

While I couldn't trigger the issue (after 5 attempts [1]) without them.
Hence I suspect it may be something to do with them. But it takes a lot
of time to test these all, I think I will have good answers in couple of
days time considering the amount of time it takes to perform the tests.

[1]
1. Login to XFree86/Gnome
2. Start Mozilla, Evolution, OpenOffice Writer/Calc/Impress, Konqueror,
KMail. And exit them all.
3. mke2fs -j /dev/hdc9; mount /dev/hdc9 /test;cd /test;dd if=/dev/zero
of=zero bs=1024 count=2097152;cd /
4. Redo the step 2
5. Log out and log in and redo step 2
6. Unmount /test

Repeat the above test cycle few times (on 3rd attempt or so) the system
oops (when I had AGP/AMD/DRM/Radeon stuff).

Thanks for your help.

Hari
[email protected]


2002-10-15 12:51:12

by Srihari Vijayaraghavan

[permalink] [raw]
Subject: Re: 2.4.20-pre10aa1 oops report (was Re: Linux-2.4.20-pre8-aa2 oops report. [solved])

Hello Andrea,

> > this smells like a problem with one of your modules. Please make 100%
> > sure you use exactly the same .config for both 2.4.20pre10 and
> > 2.4.20pre10aa1 and please try to find which is the module that is
> > crashing the kernel after it's being loaded. Expect always different
> > kind of crashes and oopses. You can also try to turn on the slab
> > debugging option in the kernel hacking menu.

That precisely is the reason. The bad news is that system crashes when agpgart
and radeon are compiled as modules, and the good news is that I am unable to
crash it when they are not.

Mainline (2.4.20-pre10) is stable when agpgart and radeon are compiled as
modules.

The problem is much easier to reproduce than I thought, just log in and log
out of XFree86/Gnome few times (3 or more times in my case) is more than
adequate to crash it.

Here is the .config which is stable in -aa1:
CONFIG_AGP=y
CONFIG_AGP_AMD=y
CONFIG_DRM=y
CONFIG_DRM_NEW=y
CONFIG_DRM_RADEON=y

Here is the .config which destabilises the -aa1 kernel:
CONFIG_AGP=m
CONFIG_AGP_AMD=y
CONFIG_DRM=y
CONFIG_DRM_NEW=y
CONFIG_DRM_RADEON=m

Unfortunately system just reboots without leaving oops information in the
system logs. If you want I can try few older versions of -aa to find from
when it started happening.

Thanks for your help.
--
Hari
[email protected]

2002-10-15 13:58:45

by Srihari Vijayaraghavan

[permalink] [raw]
Subject: Re: 2.4.20-pre10aa1 oops report (was Re: Linux-2.4.20-pre8-aa2 oops report. [solved])

Hello,

> That precisely is the reason. The bad news is that system crashes when
> agpgart and radeon are compiled as modules, and the good news is that I am
> unable to crash it when they are not.

My goodness, I have spoken too early I guess. The -aa kernel crashes whether
agpgart and radeon are modules or not.

> Mainline (2.4.20-pre10) is stable when agpgart and radeon are compiled as
> modules.

That holds true still.

> The problem is much easier to reproduce than I thought, just log in and log
> out of XFree86/Gnome few times (3 or more times in my case) is more than
> adequate to crash it.

That is still the case.
--
Hari
[email protected]

2002-10-16 08:34:06

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.20-pre10aa1 oops report (was Re: Linux-2.4.20-pre8-aa2 oops report. [solved]) [solved2? ac97]

On Wed, Oct 16, 2002 at 12:13:02AM +1000, Srihari Vijayaraghavan wrote:
> Hello,
>
> > That precisely is the reason. The bad news is that system crashes when
> > agpgart and radeon are compiled as modules, and the good news is that I am
> > unable to crash it when they are not.
>
> My goodness, I have spoken too early I guess. The -aa kernel crashes whether
> agpgart and radeon are modules or not.

I'm running this kernel for 5 days now very often under heavy load (also with
thousand of tasks with volanomark in background and aio and flood of writes
from /dev/zero), and there's no sign of instability (besides a rare tcp race
that is been reported for 2.4.19 on l-k too, not fatal, it only deadlocks the
tcp connection and you've to kill the task because readmsg will never return
until it gets a signal, I tried to debug it but with no luck yet, but
that is also most certainly a mainline issue too and it triggers only
during heavy load).

You probably did something incidentally (not part of your regression
test loop) that corrupted memory. The regression test is a workload that
will show you if the corruption has happened in the past or not, but the
regression test loop is not the thing that is generating the corruption.
The regression test loop is what gets _harmed_ by the corruption, it's
not the culprit.

My crystall ball is telling me that you could reproduce it easily on my
tree because when you feel finally stable and that you can restart doing
your usual work without worrying about oopses, you enjoy yourself
playing some music to relax. And you instead don't play music while you
try to reproduce the problem because you're busy looking at stressing
the kernel and in turn you can't reproduce the bug. Is she right? ;)

Please try with CONFIG_SOUND=n and make sure to run:

rm -r /lib/modules/2.4.20-pre10aa1

before "make modules_install" to avoid running stale modules (also enable
modversions just in case).

I see a pile of oopses all showing ac97 loaded into the kernel, some
also for 2.4.19, but they may be unrelated problems of course. A number
of reports showing definitive random mm corruption like yours on top of
2.4.20-pre vanilla (not -aa) are most certainly been affected too by the
ac97 bug (I'm CC'ing the other affected testers, they can try as well
the same as you). I never tried ac97 (I've a couple of boxes that could
handle it, but I never attempted to play sound on those yet and the
chipset may be different so it may not trigger for me after all even if
I could load that module).

Hint: in the past I found easier to reproduce various module bugs with a
loop like this:

while :; do insmod ac97_codec.o; rmmod ac97_codec.o; done

you can try the above and see if it trigger in seconds.

>From the l-k db grepping it seems the bug is been introduced in 2.4.19.
So I would suggest you to try to reproduce after a:

rm -r 2.4.20pre10aa1/drivers/sound
cp -a 2.4.18/drivers/sound 2.4.20pre10aa1/drivers
cd 2.4.20pre10aa1; make oldconfig ...

(of course you can replace 2.4.20pre10aa1 with 2.4.20pre11 vanilla or
2.4.20pre10ac2)

and see if the instability goes away?

Marcelo also included some further ac97 patch in pre11, maybe
2.4.20pre11aa1 will fix it, you may want to give it a try too when I
release it (OTOH, I'm fixing what seems to be a design bug in the o1
scheduler that is apparently generatating an huge cpu waste, so I don't
guarantee that the very first release with these changes will be as
solid as 2.4.20pre10aa1 ;)

Thanks for all the reports,

Andrea