2005-03-31 13:26:07

by Steffen Moser

[permalink] [raw]
Subject: Oops with "linux-2.4.29"

Hi all,

one of our file servers (SuSE Linux 7.2, running "linux-2.4.29")
oopsed some days ago - here is the bug report:

[1.] One line summary of the problem:

Kernel "linux-2.4.29" oopses irregularly. The oopses seem to be
triggered by high I/O load on the SCSI subsystem.


[2.] Full description of the problem/report:

The machine which is mainly used as a "samba" and "NIS/NFS" server of
a comprehensive secondry school's network runs unstable - especially
under heavy I/O load, for example, when a virus scanner ("antivir" {1})
is doing a scan over the whole user data files.

The oopses are not only triggered by "antivir", but also during high
I/O load caused by a "samba" process, for example.

The machine is running quite an old installation of SuSE Linux 7.2. The
kernel which is used is "linux-2.4.29". The security relevant things
("samba", "ssh", "kernel", and so on) have been patched manually. I
haven't had the time to set up this machine based on a newer distri-
bution, yet. :-/

We are running software RAID1 on two SCSI hard disks, two further SCSI
disks are running without RAID. Only "ext3fs" is used on this machine.

One thing I have to add: The machine was running a bit unstable some
months ago (Nov 2003). I didn't have the time to write a bug report
at that time. :-(

But you can find some debug output from that time at {2}.

The errors I got in November 2003 (running linux-2.4.22) seem to be
clearly related to disk issues. I assume that the oops I am reporting
within this message is also related to disk, file system or SCSI sub-
system issues.

Between November 2003 and now the machine has ran bit more stable (per-
haps about one crash every two months), but at the moment I have to re-
boot it at least once a week.

That's the reason I am writing this report now.

I suppose that the unstable behaviour is caused by hardware problems
(disk problems or SCSI problems?), but I don't know it for sure.

"memtest" hasn't found anything, but I've had it running only for about
five hours up to now. The SCSI cabling has been also checked and seems
to be alright. Letting run a CPU time consuming application in the back-
ground (for example, a distributed.net client) doesn't seem to trigger
the problem.


[3.] Keywords (i.e., modules, networking, kernel):

linux kernel 2.4.29 oops ext3 I/O high load SCSI


[4.] Kernel version (from /proc/version):

| Linux version 2.4.29 (root@fsa01) (gcc version 2.95.3 20010315 (SuSE)) #2 Thu Jan 20 12:25:04 CET 2005


[5.] Output of Oops.. message (if applicable) with symbolic information
resolved (see Documentation/oops-tracing.txt)

| v00001@fsa01:~ > ./ksymoops-2.4.11/ksymoops -m /boot/System.map
| ksymoops 2.4.11 on i686 2.4.29. Options used
| -V (default)
| -k /proc/ksyms (default)
| -l /proc/modules (default)
| -o /lib/modules/2.4.29/ (default)
| -m /boot/System.map (specified)
|
| Reading Oops report from the terminal

[...]

| Code: 00 00 00 00 17 3a 23 00 01 00 00 00 04 09 00 00 00 00 00 00
| Unable to handle kernel NULL pointer dereference at virtual address
| 00000000
| ccabe7a0
| *pde = 00000000
| Oops: 0002
| CPU: 0
| EIP: 0010:[<ccabe7a0>] Not tainted
| Using defaults from ksymoops -t elf32-i386 -a i386
| EFLAGS: 00010282
| eax: 00000000 ebx: c13b8214 ecx: 00000017 edx: c0273c5c
| esi: dfe91504 edi: 00000000 ebp: da00a804 esp: d4e5bf20
| ds: 0018 es: 0018 ss: 0018
| Process antivir (pid: 28362, stackpage=d4e5b000)
| Stack: c13b8214 c015991c c0124041 d3baf260 c13b8214 00000000 d3baf260 08179504
| d3baf280 00001000 00000001 00000000 00000000 da00a740 c0124583 d3baf260
| d3baf280 d4e5bf8c c012446c 00000000 d3baf260 ffffffea 00000400 00000000
| Call Trace: [<c015991c>] [<c0124041>] [<c0124583>] [<c012446c>] [<c0130d56>]
| [<c0106b33>]
| Code: 00 00 00 00 17 3a 23 00 01 00 00 00 04 09 00 00 00 00 00 00
|
|
| >>EIP; ccabe7a0 <_end+c7cd2ac/20552b0c> <=====
|
| >>ebx; c13b8214 <_end+10c6d20/20552b0c>
| >>edx; c0273c5c <contig_page_data+dc/3c0>
| >>esi; dfe91504 <_end+1fba0010/20552b0c>
| >>ebp; da00a804 <_end+19d19310/20552b0c>
| >>esp; d4e5bf20 <_end+14b6aa2c/20552b0c>
|
| Trace; c015991c <ext3_get_block+0/64>
| Trace; c0124041 <do_generic_file_read+291/438>
| Trace; c0124583 <generic_file_read+8b/190>
| Trace; c012446c <file_read_actor+0/8c>
| Trace; c0130d56 <sys_read+96/f0>
| Trace; c0106b33 <system_call+33/38>
|
| Code; ccabe7a0 <_end+c7cd2ac/20552b0c>
| 00000000 <_EIP>:
| Code; ccabe7a0 <_end+c7cd2ac/20552b0c> <=====
| 0: 00 00 add %al,(%eax) <=====
| Code; ccabe7a2 <_end+c7cd2ae/20552b0c>
| 2: 00 00 add %al,(%eax)
| Code; ccabe7a4 <_end+c7cd2b0/20552b0c>
| 4: 17 pop %ss
| Code; ccabe7a5 <_end+c7cd2b1/20552b0c>
| 5: 3a 23 cmp (%ebx),%ah
| Code; ccabe7a7 <_end+c7cd2b3/20552b0c>
| 7: 00 01 add %al,(%ecx)
| Code; ccabe7a9 <_end+c7cd2b5/20552b0c>
| 9: 00 00 add %al,(%eax)
| Code; ccabe7ab <_end+c7cd2b7/20552b0c>
| b: 00 04 09 add %al,(%ecx,%ecx,1)


[6.] A small shell script or example program which triggers the
problem (if possible):

Running "antivir /usr/lib/AntiVir/antivir -s -e -del /home /export"
every two hours (started by "cron") will produce a oops like this
within a few days.


[7.] Environment

[7.1.] Software (add the output of the ver_linux script here):

| v00001@fsa01:~ > . /usr/src/linux-2.4.29/scripts/ver_linux
| If some fields are empty or look unusual you may have an old version.
| Compare to the current minimal requirements in Documentation/Changes.
|
| Linux fsa01 2.4.29 #2 Thu Jan 20 12:25:04 CET 2005 i686 unknown
|
| Gnu C 2.95.3
| Gnu make 3.79.1
| binutils 2.10.91.0.4
| util-linux 2.11l
| mount 2.11l
| modutils 2.4.5
| e2fsprogs 1.25
| pcmcia-cs 3.1.25
| quota-tools 3.08.
| Linux C Library x 1 root root 1341670 Dec 18 2001 /lib/libc.so.6
| Dynamic linker (ldd) 2.2.2
| Procps 2.0.7
| Net-tools 1.60
| Kbd 1.04
| Sh-utils 2.0
| Modules Loaded ipv6 3c59x ipchains serial


[7.2.] Processor information (from /proc/cpuinfo):

| processor : 0
| vendor_id : GenuineIntel
| cpu family : 6
| model : 8
| model name : Pentium III (Coppermine)
| stepping : 10
| cpu MHz : 871.044
| cache size : 256 KB
| fdiv_bug : no
| hlt_bug : no
| f00f_bug : no
| coma_bug : no
| fpu : yes
| fpu_exception : yes
| cpuid level : 2
| wp : yes
| flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr sse
| bogomips : 1736.70


[7.3.] Module information (from /proc/modules):

| ipv6 144736 -1 (autoclean)
| 3c59x 25088 1 (autoclean)
| ipchains 38048 0
| serial 43456 1 (autoclean)


[7.4.] SCSI information (from /proc/scsi/scsi):

| Attached devices:
| Host: scsi0 Channel: 00 Id: 00 Lun: 00
| Vendor: IBM Model: DPSS-336950M Rev: S96H
| Type: Direct-Access ANSI SCSI revision: 03
| Host: scsi0 Channel: 00 Id: 01 Lun: 00
| Vendor: IBM Model: DPSS-336950M Rev: S96H
| Type: Direct-Access ANSI SCSI revision: 03
| Host: scsi0 Channel: 00 Id: 03 Lun: 00
| Vendor: HP Model: C1537A Rev: L005
| Type: Sequential-Access ANSI SCSI revision: 02
| Host: scsi0 Channel: 00 Id: 04 Lun: 00
| Vendor: IBM Model: DNES-318350 Rev: SA30
| Type: Direct-Access ANSI SCSI revision: 03
| Host: scsi0 Channel: 00 Id: 05 Lun: 00
| Vendor: IBM Model: DNES-318350 Rev: SA30
| Type: Direct-Access ANSI SCSI revision: 03


[7.5.] Other information that might be relevant to the problem
(please look in /proc and include all information that you
think to be relevant):

| System memory (at time of oops):
|
| total: used: free: shared: buffers: cached:
| Mem: 528605184 510717952 17887232 0 89063424 31199232
| Swap: 526336000 0 526336000
| MemTotal: 516216 kB
| MemFree: 17468 kB
| MemShared: 0 kB
| Buffers: 86976 kB
| Cached: 30468 kB
| SwapCached: 0 kB
| Active: 98716 kB
| Inactive: 18864 kB
| HighTotal: 0 kB
| HighFree: 0 kB
| LowTotal: 516216 kB
| LowFree: 17468 kB
| SwapTotal: 514000 kB
| SwapFree: 514000 kB

| System uptime:
|
| 10:40am up 9 days, 17 min, 2 users, load average: 6.01, 6.00, 6.00

| Partitioning:
|
| fsa01:~ # fdisk -l
|
| Disk /dev/sda: 255 heads, 63 sectors, 4492 cylinders
| Units = cylinders of 16065 * 512 bytes
|
| Device Boot Start End Blocks Id System
| /dev/sda1 * 1 2 16033+ fd Linux raid autodetect
| /dev/sda2 3 4492 36065925 5 Extended
| /dev/sda5 3 47 361431 fd Linux raid autodetect
| /dev/sda6 48 92 361431 fd Linux raid autodetect
| /dev/sda7 93 284 1542208+ fd Linux raid autodetect
| /dev/sda8 285 1814 12289693+ fd Linux raid autodetect
| /dev/sda9 1815 4459 21245931 fd Linux raid autodetect
| /dev/sda10 4460 4491 257008+ 82 Linux swap
|
| Disk /dev/sdb: 255 heads, 63 sectors, 4492 cylinders
| Units = cylinders of 16065 * 512 bytes
|
| Device Boot Start End Blocks Id System
| /dev/sdb1 1 2 16033+ fd Linux raid autodetect
| /dev/sdb2 3 4492 36065925 5 Extended
| /dev/sdb5 3 47 361431 fd Linux raid autodetect
| /dev/sdb6 48 92 361431 fd Linux raid autodetect
| /dev/sdb7 93 284 1542208+ fd Linux raid autodetect
| /dev/sdb8 285 1814 12289693+ fd Linux raid autodetect
| /dev/sdb9 1815 4459 21245931 fd Linux raid autodetect
| /dev/sdb10 4460 4491 257008+ 82 Linux swap
|
| Disk /dev/sdc: 255 heads, 63 sectors, 2231 cylinders
| Units = cylinders of 16065 * 512 bytes
|
| Device Boot Start End Blocks Id System
| /dev/sdc1 1 2231 17920476 83 Linux
|
| Disk /dev/sdd: 255 heads, 63 sectors, 2231 cylinders
| Units = cylinders of 16065 * 512 bytes
|
| Device Boot Start End Blocks Id System
| /dev/sdd1 1 2231 17920476 83 Linux

| Mounts:
|
| fsa01:~ # mount
| /dev/md1 on / type ext3 (rw)
| proc on /proc type proc (rw)
| devpts on /dev/pts type devpts (rw,mode=0620,gid=5)
| /dev/md2 on /var type ext3 (rw)
| /dev/md3 on /usr type ext3 (rw)
| /dev/md0 on /boot type ext3 (rw)
| /dev/md4 on /home type ext3 (rw,usrquota)
| /dev/md5 on /export type ext3 (rw)
| /dev/sdc1 on /export/tausch type ext3 (rw)
| /dev/sdd1 on /export/admin type ext3 (rw)
| shmfs on /dev/shm type shm (rw)

| Disk free status:
|
| fsa01:~ # df -h
| Filesystem Size Used Avail Use% Mounted on
| /dev/md1 342M 53M 271M 17% /
| /dev/md2 342M 38M 286M 12% /var
| /dev/md3 1.4G 903M 504M 65% /usr
| /dev/md0 15M 5.0M 9.3M 35% /boot
| /dev/md4 11G 9.9G 1.0G 91% /home
| /dev/md5 20G 5.7G 13G 31% /export
| /dev/sdc1 17G 59M 15G 1% /export/tausch
| /dev/sdd1 17G 14G 2.0G 88% /export/admin
| shmfs 252M 0 252M 0% /dev/shm


- The relatively high load average (6.01) is caused by six hanging
and unkillable (and un-strace-able) "antivir" processes (one began
to hang when the oops happened and "cron" started five further
"antivir" processes (which got also stuck) until I noticed the oops
and commented out the line in "/etc/crontab".

Output of "ps aux", output of "smartctl -a /dev/sd[a-d]", "lspci",
kernel config, and so on, you will find at {3} - to make this mail
not too long.

If you need further information or debug material, pleast let me
know.

Any help will be greatly appreciated! Thank you in advance!

Bye,
Steffen

{1} http://www.antivir.de/en/index.html
{2} http://www.uni-ulm.de/~s_smoser/ml/lkml/2003-11-09_01/
{3} http://www.uni-ulm.de/~s_smoser/ml/lkml/2005-03-31_01/


2005-04-04 15:13:43

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: Oops with "linux-2.4.29"


Hi Steffen,

On Thu, Mar 31, 2005 at 03:24:49PM +0200, Steffen Moser wrote:
> Hi all,
>
> one of our file servers (SuSE Linux 7.2, running "linux-2.4.29")
> oopsed some days ago - here is the bug report:
>
> [1.] One line summary of the problem:
>
> Kernel "linux-2.4.29" oopses irregularly. The oopses seem to be
> triggered by high I/O load on the SCSI subsystem.
>
>
> [2.] Full description of the problem/report:
>
> The machine which is mainly used as a "samba" and "NIS/NFS" server of
> a comprehensive secondry school's network runs unstable - especially
> under heavy I/O load, for example, when a virus scanner ("antivir" {1})
> is doing a scan over the whole user data files.
>
> The oopses are not only triggered by "antivir", but also during high
> I/O load caused by a "samba" process, for example.
>
> The machine is running quite an old installation of SuSE Linux 7.2. The
> kernel which is used is "linux-2.4.29". The security relevant things
> ("samba", "ssh", "kernel", and so on) have been patched manually. I
> haven't had the time to set up this machine based on a newer distri-
> bution, yet. :-/
>

<snip>

> [...]
>
> | Code: 00 00 00 00 17 3a 23 00 01 00 00 00 04 09 00 00 00 00 00 00
> | Unable to handle kernel NULL pointer dereference at virtual address
> | 00000000
> | ccabe7a0
> | *pde = 00000000
> | Oops: 0002
> | CPU: 0
> | EIP: 0010:[<ccabe7a0>] Not tainted
> | Using defaults from ksymoops -t elf32-i386 -a i386
> | EFLAGS: 00010282
> | eax: 00000000 ebx: c13b8214 ecx: 00000017 edx: c0273c5c
> | esi: dfe91504 edi: 00000000 ebp: da00a804 esp: d4e5bf20
> | ds: 0018 es: 0018 ss: 0018
> | Process antivir (pid: 28362, stackpage=d4e5b000)
> | Stack: c13b8214 c015991c c0124041 d3baf260 c13b8214 00000000 d3baf260 08179504
> | d3baf280 00001000 00000001 00000000 00000000 da00a740 c0124583 d3baf260
> | d3baf280 d4e5bf8c c012446c 00000000 d3baf260 ffffffea 00000400 00000000
> | Call Trace: [<c015991c>] [<c0124041>] [<c0124583>] [<c012446c>] [<c0130d56>]
> | [<c0106b33>]
> | Code: 00 00 00 00 17 3a 23 00 01 00 00 00 04 09 00 00 00 00 00 00
> |
> |
> | >>EIP; ccabe7a0 <_end+c7cd2ac/20552b0c> <=====
> |
> | >>ebx; c13b8214 <_end+10c6d20/20552b0c>
> | >>edx; c0273c5c <contig_page_data+dc/3c0>
> | >>esi; dfe91504 <_end+1fba0010/20552b0c>
> | >>ebp; da00a804 <_end+19d19310/20552b0c>
> | >>esp; d4e5bf20 <_end+14b6aa2c/20552b0c>
> |
> | Trace; c015991c <ext3_get_block+0/64>
> | Trace; c0124041 <do_generic_file_read+291/438>
> | Trace; c0124583 <generic_file_read+8b/190>
> | Trace; c012446c <file_read_actor+0/8c>
> | Trace; c0130d56 <sys_read+96/f0>
> | Trace; c0106b33 <system_call+33/38>
> |
> | Code; ccabe7a0 <_end+c7cd2ac/20552b0c>
> | 00000000 <_EIP>:
> | Code; ccabe7a0 <_end+c7cd2ac/20552b0c> <=====
> | 0: 00 00 add %al,(%eax) <=====
> | Code; ccabe7a2 <_end+c7cd2ae/20552b0c>
> | 2: 00 00 add %al,(%eax)
> | Code; ccabe7a4 <_end+c7cd2b0/20552b0c>
> | 4: 17 pop %ss
> | Code; ccabe7a5 <_end+c7cd2b1/20552b0c>
> | 5: 3a 23 cmp (%ebx),%ah
> | Code; ccabe7a7 <_end+c7cd2b3/20552b0c>
> | 7: 00 01 add %al,(%ecx)
> | Code; ccabe7a9 <_end+c7cd2b5/20552b0c>
> | 9: 00 00 add %al,(%eax)
> | Code; ccabe7ab <_end+c7cd2b7/20552b0c>
> | b: 00 04 09 add %al,(%ecx,%ecx,1)

This looks like corruption - ext3_get_block() jumps to a bogus function
which contains bogus instructions. Like if ext3_get_block() had been
overwritten with junk data.

Smells like bad hardware, but I'm not certain.

> [6.] A small shell script or example program which triggers the
> problem (if possible):
>
> Running "antivir /usr/lib/AntiVir/antivir -s -e -del /home /export"
> every two hours (started by "cron") will produce a oops like this
> within a few days.

Do you have other oopses saved? Please send em.

2005-04-08 15:46:03

by Steffen Moser

[permalink] [raw]
Subject: Re: Oops with "linux-2.4.29"

Hi Marcelo,

* On Sun, Apr 03, 2005 at 05:20 PM (-0300), Marcelo Tosatti wrote:

> This looks like corruption - ext3_get_block() jumps to a bogus function
> which contains bogus instructions. Like if ext3_get_block() had been
> overwritten with junk data.

Thank you very much for your quick reply!

> Smells like bad hardware, but I'm not certain.
>
> > [6.] A small shell script or example program which triggers the
> > problem (if possible):
> >
> > Running "antivir /usr/lib/AntiVir/antivir -s -e -del /home /export"
> > every two hours (started by "cron") will produce a oops like this
> > within a few days.
>
> Do you have other oopses saved? Please send em.

Today, I (finally) got some oopses a few hours after I had started a
little file system stress tesing running a few "antivir" instances
(but I had done this also the last days beginning when your reply
received me).

I am still running "linux-2.4.29" on this machine.

I've put the oopses on to my web space:

http://www.uni-ulm.de/~s_smoser/ml/lkml/2005-04-08_01/fsa01_2005-05-08_oopses-syslog.log

The call trace addresses were already resolved to symbols by "klogd"
(I didn't start using "-x"). Therefore I looked into the kernel ring
buffer using "dmesg", but unfortunately I didn't get all oopses from
there (especially the first one wasn't there anymore). Nevertheless,
I also put up the (incomplete) oopses which I extracted from "dmesg":

http://www.uni-ulm.de/~s_smoser/ml/lkml/2005-04-08_01/fsa01_2005-05-08_oopses-dmesg.log

I haven't rebooted the machine, yet (and I don't have to reboot be-
fore Monday). If I should do further testing, please let me know.

Today, all of the running "antivir" processes were terminated with a
segmentation fault. The last time when the machine oopsed (2005-03-31),
they just stuck. Last time, I also didn't get "Unable to handle kernel
paging request at virtual address" (as it happened today), but an "Un-
able to handle kernel NULL pointer dereference" instead.

By the way (but it isn't very surprising): After today's oopses I can
easily produce more oopses by simply accessing the file:

/home/s00469/.seyon/protocols

(I saw that this file was the last one that was accessed by the anti-
vir processes before they segfaulted).

For example:

| strace cat /home/s00469/.seyon/protocols

gives the a SIGSEV (of, course, also without doing "strace"):

| [...]
|
| open("/usr/lib/locale/en_US/LC_CTYPE", O_RDONLY) = 3
| fstat64(3, {st_mode=S_IFREG|0644, st_size=110304, ...}) = 0
| old_mmap(NULL, 110304, PROT_READ, MAP_PRIVATE, 3, 0) = 0x4013d000
| close(3) = 0
| fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = 0
| open("/home/s00469/.seyon/protocols", O_RDONLY|O_LARGEFILE) = 3
| fstat64(3, {st_mode=S_IFREG|0644, st_size=2113, ...}) = 0
| brk(0x8050000) = 0x8050000
| read(3, <unfinished ...>
| +++ killed by SIGSEGV +++

and produces two further oopses:

| Unable to handle kernel paging request at virtual address 14d8d3a9
| c0124141
| *pde = 00000000
| Oops: 0002
| CPU: 0
| EIP: 0010:[<c0124141>] Not tainted
| Using defaults from ksymoops -t elf32-i386 -a i386
| EFLAGS: 00010202
| eax: 14d8d3a5 ebx: c1022fe8 ecx: 00000000 edx: 00000cb9
| esi: dfec1178 edi: c1022fe8 ebp: d8d3a511 esp: ccfc3f34
| ds: 0018 es: 0018 ss: 0018
| Process cat (pid: 15412, stackpage=ccfc3000)
| Stack: 00000000 dad358c0 0804dac8 dad358e0 00001000 00000001 00000000 00000000
| d2000000 c0124593 dad358c0 dad358e0 ccfc3f8c c012447c 00000000 dad358c0
| ffffffea 00001000 c0111f5d ccfc3fac ccfc2000 00000000 00000000 00001000
| Call Trace: [<c0124593>] [<c012447c>] [<c0111f5d>] [<c0130d66>] [<c0106b9f>]
| Code: 89 78 04 89 07 89 6f 04 89 7d 00 89 6f 08 89 f2 89 f8 e8 b8
|
|
| >>EIP; c0124141 <do_generic_file_read+381/438> <=====
|
| >>ebx; c1022fe8 <_end+d31ad4/20552aec>
| >>esi; dfec1178 <_end+1fbcfc64/20552aec>
| >>edi; c1022fe8 <_end+d31ad4/20552aec>
| >>ebp; d8d3a511 <_end+18a48ffd/20552aec>
| >>esp; ccfc3f34 <_end+ccd2a20/20552aec>
|
| Trace; c0124593 <generic_file_read+8b/190>
| Trace; c012447c <file_read_actor+0/8c>
| Trace; c0111f5d <schedule+2d1/2f8>
| Trace; c0130d66 <sys_read+96/f0>
| Trace; c0106b9f <tracesys+1f/23>
|
| Code; c0124141 <do_generic_file_read+381/438>
| 00000000 <_EIP>:
| Code; c0124141 <do_generic_file_read+381/438> <=====
| 0: 89 78 04 mov %edi,0x4(%eax) <=====
| Code; c0124144 <do_generic_file_read+384/438>
| 3: 89 07 mov %eax,(%edi)
| Code; c0124146 <do_generic_file_read+386/438>
| 5: 89 6f 04 mov %ebp,0x4(%edi)
| Code; c0124149 <do_generic_file_read+389/438>
| 8: 89 7d 00 mov %edi,0x0(%ebp)
| Code; c012414c <do_generic_file_read+38c/438>
| b: 89 6f 08 mov %ebp,0x8(%edi)
| Code; c012414f <do_generic_file_read+38f/438>
| e: 89 f2 mov %esi,%edx
| Code; c0124151 <do_generic_file_read+391/438>
| 10: 89 f8 mov %edi,%eax
| Code; c0124153 <do_generic_file_read+393/438>
| 12: e8 b8 00 00 00 call cf <_EIP+0xcf> c0124210 <generic_file_direct_IO+18/284>
|
| <1>Unable to handle kernel paging request at virtual address 8800002c
| c0141210
| *pde = 00000000
| Oops: 0000
| CPU: 0
| EIP: 0010:[<c0141210>] Not tainted
| EFLAGS: 00010286
| eax: 88000000 ebx: dad358c0 ecx: 88000000 edx: d881b8fc
| esi: 00000000 edi: d881b8fc ebp: d938cdc0 esp: ccfc3dcc
| ds: 0018 es: 0018 ss: 0018
| Process cat (pid: 15412, stackpage=ccfc3000)
| Stack: dad358c0 00000000 d938cdc0 00000001 c013090f dad358c0 d938cdc0 dad358c0
| d938cdc0 00000001 00000003 d938cdc0 c0116bb8 dad358c0 d938cdc0 de620d20
| ccfc3f00 ccfc2000 0000000b d938cedc c011714f d938cdc0 00000002 ccfc3f00
| Call Trace: [<c013090f>] [<c0116bb8>] [<c011714f>] [<c01070f6>] [<c01112c7>]
| [<c0110f24>] [<c012ad46>] [<c012133b>] [<c0121387>] [<c0106c24>] [<c0124141>]
| [<c0124593>] [<c012447c>] [<c0111f5d>] [<c0130d66>] [<c0106b9f>]
| Code: f6 40 2c 01 0f 84 06 01 00 00 39 68 14 0f 85 fd 00 00 00 8b
|
|
| >>EIP; c0141210 <locks_remove_posix+30/158> <=====
|
| >>ebx; dad358c0 <_end+1aa443ac/20552aec>
| >>edx; d881b8fc <_end+1852a3e8/20552aec>
| >>edi; d881b8fc <_end+1852a3e8/20552aec>
| >>ebp; d938cdc0 <_end+1909b8ac/20552aec>
| >>esp; ccfc3dcc <_end+ccd28b8/20552aec>
|
| Trace; c013090f <filp_close+4b/5c>
| Trace; c0116bb8 <put_files_struct+54/bc>
| Trace; c011714f <do_exit+ab/234>
| Trace; c01070f6 <die+56/58>
| Trace; c01112c7 <do_page_fault+3a3/4d4>
| Trace; c0110f24 <do_page_fault+0/4d4>
| Trace; c012ad46 <_alloc_pages+16/18>
| Trace; c012133b <do_anonymous_page+c3/dc>
| Trace; c0121387 <do_no_page+33/188>
| Trace; c0106c24 <error_code+34/3c>
| Trace; c0124141 <do_generic_file_read+381/438>
| Trace; c0124593 <generic_file_read+8b/190>
| Trace; c012447c <file_read_actor+0/8c>
| Trace; c0111f5d <schedule+2d1/2f8>
| Trace; c0130d66 <sys_read+96/f0>
| Trace; c0106b9f <tracesys+1f/23>
|
| Code; c0141210 <locks_remove_posix+30/158>
| 00000000 <_EIP>:
| Code; c0141210 <locks_remove_posix+30/158> <=====
| 0: f6 40 2c 01 testb $0x1,0x2c(%eax) <=====
| Code; c0141214 <locks_remove_posix+34/158>
| 4: 0f 84 06 01 00 00 je 110 <_EIP+0x110> c0141320 <locks_remove_posix+140/158>
| Code; c014121a <locks_remove_posix+3a/158>
| a: 39 68 14 cmp %ebp,0x14(%eax)
| Code; c014121d <locks_remove_posix+3d/158>
| d: 0f 85 fd 00 00 00 jne 110 <_EIP+0x110> c0141320 <locks_remove_posix+140/158>
| Code; c0141223 <locks_remove_posix+43/158>
| 13: 8b 00 mov (%eax),%eax

I can repeat it as often as I want, but I don't think that this delivers
something new - the system is in a unstable state since the first oops
has occured today at 12:03:22.

The system's uptime is now: 2 days, 16:19.

I've also found some older oopses that I didn't post last time. I haven't
resolved them by "ksymoops" when they appeared, but they were resolved by
"klogd" a bit (I don't have the symbol file from there anymore). I also
uploaded them to my web space, you'll find them there:

http://www.uni-ulm.de/~s_smoser/ml/lkml/2005-04-08_01/fsa01_2005-03-22_oopses-syslog.log

All in all it seems that the parts where the oopses occur and how they
occur are varying quite, so I suppose that hardware problems are really
very, very likely.

Perhaps there is a link to these old SCSI errors:

http://www.uni-ulm.de/~s_smoser/ml/lkml/2003-11-09_01/logs/
(already posted in my last mail)

One thing the oopses have in common: A high I/O load on the SCSI system
seems to provoke them. High SCSI load is mainly caused by "antivir" or
by "smbd" (Samba) on this machine. These days I used "antivir" to pro-
voke the oopses. I don't have oopsed caused by "smbd" (this happened in
February, IIRC) anymore.

Thank you very much for your analyzing!

Best regards,
Steffen