2004-10-30 17:08:45

by dap

[permalink] [raw]
Subject: 2.6.10-rc1 crashes on recursive directory walk [2.6.9 was OK]


I've used xfs and ext3 on a large ftp server with lots of files, and
when I do a 'find / -ls' with the kernel 2.6.10-rc1, the server crashes
with no Oops or other message. only the reset button give a response.. I
can reproduce it any time with find, but the point of crash is random,
it can crash on xfs and ext3 partitions too.. 2.6.9 works fine in this
environment..


vm settings:

echo 16384 > /proc/sys/vm/min_free_kbytes
echo 28 > /proc/sys/vm/vfs_cache_pressure
echo 100 > /proc/sys/vm/swappiness

I've tried to double min_free_kbytes but didn't help


--
dap



2004-10-30 17:25:30

by jurriaan

[permalink] [raw]
Subject: Re: 2.6.10-rc1 crashes on recursive directory walk [2.6.9 was OK]

From: dap <[email protected]>
Date: Sat, Oct 30, 2004 at 07:08:36PM +0200
>
> I've used xfs and ext3 on a large ftp server with lots of files, and
> when I do a 'find / -ls' with the kernel 2.6.10-rc1, the server crashes
> with no Oops or other message. only the reset button give a response.. I
> can reproduce it any time with find, but the point of crash is random,
> it can crash on xfs and ext3 partitions too.. 2.6.9 works fine in this
> environment..
>
What are the tailing lines of 'strace find / -ls' ?

Those would perhaps help determine what system call is crashing.

Jurriaan
--
"It shall not happen again, not while I am alive." Nion struck his chest
with his fist. "I have been mild and guileless! I have trusted persons
with suppuration and gangrene for brains."
Jack Vance - Araminta Station
Debian (Unstable) GNU/Linux 2.6.9-mm1 2x6078 bogomips load 0.25

2004-10-30 18:26:44

by dap

[permalink] [raw]
Subject: Re: 2.6.10-rc1 crashes on recursive directory walk [2.6.9 was OK]


On Sat, 2004-10-30 at 19:24, Jurriaan wrote:
> From: dap <[email protected]>
> Date: Sat, Oct 30, 2004 at 07:08:36PM +0200
> >
> > I've used xfs and ext3 on a large ftp server with lots of files, and
> > when I do a 'find / -ls' with the kernel 2.6.10-rc1, the server crashes
> > with no Oops or other message. only the reset button give a response.. I
> > can reproduce it any time with find, but the point of crash is random,
> > it can crash on xfs and ext3 partitions too.. 2.6.9 works fine in this
> > environment..
> >
> What are the tailing lines of 'strace find / -ls' ?

the only problem is that it's a productive server with large, software
raid5 arrays and lockups can trigger resync`s that leads to a
significant performance degradation for days and the users really hates
this, so I'll try to reproduce it with another box on this weekend. if I
can't, I'll do it on the productive server and send the results..


--
dap


2004-10-31 11:41:10

by Alexander Nyberg

[permalink] [raw]
Subject: Re: 2.6.10-rc1 crashes on recursive directory walk [2.6.9 was OK]

> I've used xfs and ext3 on a large ftp server with lots of files, and
> when I do a 'find / -ls' with the kernel 2.6.10-rc1, the server crashes
> with no Oops or other message. only the reset button give a response.. I
> can reproduce it any time with find, but the point of crash is random,
> it can crash on xfs and ext3 partitions too.. 2.6.9 works fine in this
> environment..
>
>
> vm settings:
>
> echo 16384 > /proc/sys/vm/min_free_kbytes
> echo 28 > /proc/sys/vm/vfs_cache_pressure
> echo 100 > /proc/sys/vm/swappiness
>
> I've tried to double min_free_kbytes but didn't help
>

Hi,

.config & dmesg please, I can't seem to be able to reproduce this under
me own environment. Also a little more description of how the setup
looks like would be nice, I've only gathered one or more raid5 arrays
with xfs and ext3 involved.

2004-10-31 15:16:24

by dap

[permalink] [raw]
Subject: Re: 2.6.10-rc1 crashes on recursive directory walk [2.6.9 was OK]


I also can't reproduce it under another box, so I compiled the -rc1
kernel again with same config and I'm trying to crash the produtive
server with that, but seems like I can't.
I'm confused, maybe this problem was caused by some kind of compile
problem due to memory error, maybe just the ftp workload changed and
it's a hard to hit bug, don't know.. anyway, I use the -rc1 by now and
I'll report if this happened again.. sorry for the early report


On Sun, 2004-10-31 at 11:57, Alexander Nyberg wrote:
> .config & dmesg please, I can't seem to be able to reproduce this under
> me own environment. Also a little more description of how the setup
> looks like would be nice, I've only gathered one or more raid5 arrays
> with xfs and ext3 involved.


--
dap


2004-11-05 13:56:09

by dap

[permalink] [raw]
Subject: Re: 2.6.10-rc1 crashes on recursive directory walk [2.6.9 was OK]


hard to hit, but not impossible. :( this time I've got oops:
http://innocence.nightwish.hu/dap/OopsOnFind-2.4.10rc1.jpg

.config and dmesg attached. I've used a 3rd party module from
highpoint-tech.com for my hpt1820 card, but this driver's rock stable
under any other kernel

the workload quite complex, slapd, mysql, proftpd (no sendfile
support), apache2 (with sendfile), all of them hardly used, and
sometimes the 'find' that could made the kernel crash.
I'm using 2.6.9 now, I want to see that it's really stable and the bug
was introduced in the -rc1 patch as I think. if this crashing too, I'll
write immediately


EverDream:~# lsmod
Module Size Used by
ipv6 239072 525
ipt_TOS 2880 1
ipt_LOG 7104 1
iptable_mangle 3040 1
tun 7424 1
uhci_hcd 30832 0
ohci_hcd 19812 0
ehci_hcd 27684 0
usbcore 105380 3 uhci_hcd,ohci_hcd,ehci_hcd
intel_agp 20544 1
agpgart 29580 1 intel_agp
quota_v1 4032 3
xfs 585884 2
sata_promise 8356 2
hptmv 215848 40
w83627hf 29480 0
i2c_sensor 3936 1 w83627hf
i2c_isa 2624 0
i2c_i801 8044 0
iptable_filter 3104 1
ip_tables 16992 4
ipt_TOS,ipt_LOG,iptable_mangle,iptable_filter
softdog 5456 1

EverDream:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid5] [multipath]
md7 : active raid5 sdai1[0] sdap1[6] sdan1[5] sdam1[4] sdak1[2] sdaj1[1]
1194849792 blocks level 5, 32k chunk, algorithm 2 [7/6] [UUU_UUU]

md6 : active raid5 sdx1[0] sdah1[10] sdag1[9] sdaf1[8] sdae1[7] sdad1[6] sdac1[5] sdab1[4] sdaa1[3] sdz1[2] sdy1[1]
1991416320 blocks level 5, 32k chunk, algorithm 2 [11/11] [UUUUUUUUUUU]

md1 : active raid5 sdq1[0] sdm1[13] sdb1[12] sdp1[11] sdo1[10] sdn1[9] sdi1[8] sdl1[7] sdk1[6] sdv1[5] sdu1[4] sdt1[3] sds1[2] sdw1[1]
2031747328 blocks level 5, 64k chunk, algorithm 2 [14/14] [UUUUUUUUUUUUUU]
[>....................] resync = 3.7% (5832192/156288256) finish=2384.5min speed=1048K/sec
md2 : active raid1 hdc1[1] hda1[0]
976640 blocks [2/2] [UU]

md4 : active raid1 hdc2[1] hda2[0]
1951808 blocks [2/2] [UU]

md0 : active raid5 sde1[0] sdc1[10] sda1[9] sdar1[8] sdh1[7] sdaq1[6] hdb1[5] sdg1[4] sdj1[3] hdd1[2] sdf1[1]
1200536320 blocks level 5, 128k chunk, algorithm 2 [11/11] [UUUUUUUUUUU]
[==>..................] resync = 12.3% (14782080/120053632) finish=847.7min speed=2067K/sec
unused devices: <none>


---

I also can't reproduce it under another box, so I compiled the -rc1
kernel again with same config and I'm trying to crash the produtive
server with that, but seems like I can't.
I'm confused, maybe this problem was caused by some kind of compile
problem due to memory error, maybe just the ftp workload changed and
it's a hard to hit bug, don't know.. anyway, I use the -rc1 by now and
I'll report if this happened again.. sorry for the early report


On Sun, 2004-10-31 at 11:57, Alexander Nyberg wrote:
> .config & dmesg please, I can't seem to be able to reproduce this under
> me own environment. Also a little more description of how the setup
> looks like would be nice, I've only gathered one or more raid5 arrays
> with xfs and ext3 involved.



--
dap


Attachments:
.config (34.57 kB)
dmesg (36.29 kB)
Download all attachments