Hallo,
Now I really hope its the last one, all this rc's are making me mad.
Ok, here it is.
Summary of changes from v2.4.21-rc6 to v2.4.21-rc7
============================================
<[email protected]>:
o [SPARC]: Export phys_base on sparc32
<[email protected]>:
o fix olympic driver build
<[email protected]>:
o Fix Solution Engine 7751 Build
o Define VM_DATA_DEFAULT_FLAGS for SH
<[email protected]>:
o [sparc]: Attempt mul/div emulation handling on all cpus
David S. Miller <[email protected]>:
o [SPARC]: Fix sys_ipc to return ENOSYS instead of EINVAL as appropriate
o [SPARC64]: Implement dump_stack in 2.4.x
o [SPARC64]: Only use power interrupt when button property exists
o [IPV4/IPV6]: Use Jenkins hash for fragment reassembly handling
o [IPV6]: Input full addresses into TCP_SYNQ hash function
o [IPV4]: Add sysctl to control ipfrag_secret_interval
o [SPARC64]: Fix probe error handling in envctrl.c driver
o [SPARC64]: Fix probe error handling in bbc_{envctrl,i2c}.c driver
o [SPARC64]: Fix exploitable holes and bugs in ioctl32 translations
Douglas Gilbert <[email protected]>:
o sg: Fix side effect introduced by last "off by one" fix
Eric Brower <[email protected]>:
o [SPARC]: Refactor AUXIO support
Marcelo Tosatti <[email protected]>:
o Changed EXTRAVERSION to -rc7
Pete Zaitcev <[email protected]>:
o [sparc] Force type in __put_user
o [SPARC]: Fix gcc-3.x builds
Rob Radez <[email protected]>:
o [sparc]: Fix uninitialized spinlock in SRMMU code
o [SPARC]: Kill initialize_secondary, unused
> [[email protected]]
>
> Now I really hope its the last one, all this rc's are making me mad.
Are you quite sure you don't want Alan to get you the updates necessary
for IDE to build as modules for .21 final?
--
Tomas Szepe <[email protected]>
On Tue, 3 Jun 2003, Tomas Szepe wrote:
> > [[email protected]]
> >
> > Now I really hope its the last one, all this rc's are making me mad.
>
> Are you quite sure you don't want Alan to get you the updates necessary
> for IDE to build as modules for .21 final?
Well, I can for sure release -rc8 with that.
I just want this possible -rc8 to be released no later than tonight.
Alan?
Marcelo Tosatti <[email protected]> writes:
> Now I really hope its the last one, all this rc's are making me mad.
i still can't get it to compile for sparc32:
gcc -D__KERNEL__ -I/usr/src/linux/include -Wall -Wstrict-prototypes -Wno-trigraphs -O2 -fno-strict-aliasing -fno-common -fomit-frame-pointer -m32 -pipe -mno-fpu -fcall-used-g5 -fcall-used-g7 -nostdinc -iwithprefix include -DKBUILD_BASENAME=ksyms -DEXPORT_SYMTAB -c ksyms.c
/usr/src/linux/include/asm/checksum.h: In function `csum_partial_copy_nocheck':
/usr/src/linux/include/asm/checksum.h:59: error: asm-specifier for variable `d' conflicts with asm clobber list
/usr/src/linux/include/asm/checksum.h:59: error: asm-specifier for variable `l' conflicts with asm clobber list
/usr/src/linux/include/asm/checksum.h: In function `csum_partial_copy_from_user':
/usr/src/linux/include/asm/checksum.h:81: error: asm-specifier for variable `d' conflicts with asm clobber list
/usr/src/linux/include/asm/checksum.h:81: error: asm-specifier for variable `l' conflicts with asm clobber list
/usr/src/linux/include/asm/checksum.h:81: error: asm-specifier for variable `s' conflicts with asm clobber list
/usr/src/linux/include/asm/checksum.h: In function `csum_partial_copy_to_user':
/usr/src/linux/include/asm/checksum.h:108: error: asm-specifier for variable `d' conflicts with asm clobber list
/usr/src/linux/include/asm/checksum.h:108: error: asm-specifier for variable `l' conflicts with asm clobber list
/usr/src/linux/include/asm/checksum.h:108: error: asm-specifier for variable `s' conflicts with asm clobber list
make[3]: *** [ksyms.o] Error 1
make[3]: Leaving directory `/usr/src/linux/kernel'
make[2]: *** [first_rule] Error 2
make[2]: Leaving directory `/usr/src/linux/kernel'
make[1]: *** [_dir_kernel] Error 2
make[1]: Leaving directory `/usr/src/linux'
make: *** [stamp-build] Error 2
not sure when this started. the last kernel i managed to compile was
rc2 (skipped rc3 and rc4, rc5 didn't compile). the last one that will
boot was 2.4.21-pre1. this is on a sun4m Fujitsu TurboSparc.
--alex--
--
| I believe the moment is at hand when, by a paranoiac and active |
| advance of the mind, it will be possible (simultaneously with |
| automatism and other passive states) to systematize confusion |
| and thus to help to discredit completely the world of reality. |
if [ -r System.map ]; then /sbin/depmod -ae -F System.map 2.4.21-rc7; fi
depmod: *** Unresolved symbols in
/lib/modules/2.4.21-rc7/kernel/drivers/net/wan/comx.o
depmod: proc_get_inode
Margit
On Tuesday 03 June 2003 20:45, Margit Schubert-While wrote:
> if [ -r System.map ]; then /sbin/depmod -ae -F System.map 2.4.21-rc7; fi
> depmod: *** Unresolved symbols in
> /lib/modules/2.4.21-rc7/kernel/drivers/net/wan/comx.o
> depmod: proc_get_inode
attached.
hch: I know what you'll say, so don't reply ;-))
ciao, Marc
> > > Now I really hope its the last one, all this rc's are making me mad.
> >
> > Are you quite sure you don't want Alan to get you the updates necessary
> > for IDE to build as modules for .21 final?
>
> Well, I can for sure release -rc8 with that.
>
> I just want this possible -rc8 to be released no later than tonight.
Unfortunately I just committed my test box to production and can't test
Alan's SiImage fixes in rc6-ac2, but if they pan out, please try to
include them in -rc8 as well.
On Tue, Jun 03, 2003 at 11:30:59AM -0700, Alex Romosan wrote:
> Marcelo Tosatti <[email protected]> writes:
>
> > Now I really hope its the last one, all this rc's are making me mad.
>
> i still can't get it to compile for sparc32:
>
> gcc -D__KERNEL__ -I/usr/src/linux/include -Wall -Wstrict-prototypes -Wno-trigraphs -O2 -fno-strict-aliasing -fno-common -fomit-frame-pointer -m32 -pipe -mno-fpu -fcall-used-g5 -fcall-used-g7 -nostdinc -iwithprefix include -DKBUILD_BASENAME=ksyms -DEXPORT_SYMTAB -c ksyms.c
> /usr/src/linux/include/asm/checksum.h: In function `csum_partial_copy_nocheck':
> /usr/src/linux/include/asm/checksum.h:59: error: asm-specifier for variable `d' conflicts with asm clobber list
> /usr/src/linux/include/asm/checksum.h:59: error: asm-specifier for variable `l' conflicts with asm clobber list
> /usr/src/linux/include/asm/checksum.h: In function `csum_partial_copy_from_user':
That looks like you either need a different compiler version,
or different binutils version...
Jeff
On Tue, Jun 03, 2003 at 08:50:00PM +0200, Marc-Christian Petersen wrote:
> On Tuesday 03 June 2003 20:45, Margit Schubert-While wrote:
>
> > if [ -r System.map ]; then /sbin/depmod -ae -F System.map 2.4.21-rc7; fi
> > depmod: *** Unresolved symbols in
> > /lib/modules/2.4.21-rc7/kernel/drivers/net/wan/comx.o
> > depmod: proc_get_inode
>
> attached.
>
> hch: I know what you'll say, so don't reply ;-))
So add the message yourself if you don't want me to reply.
For those who haven't heard before: this is _not_ a correct
fix. proc_get_inode is not exported for a reason and the whole
procfs mess in comx needs a rewrite. Given that no one looked
into this over the last three years I guess we should rather
remove the driver..
Jeff Garzik <[email protected]> writes:
> On Tue, Jun 03, 2003 at 11:30:59AM -0700, Alex Romosan wrote:
>> Marcelo Tosatti <[email protected]> writes:
>>
>> > Now I really hope its the last one, all this rc's are making me mad.
>>
>> i still can't get it to compile for sparc32:
>>
>> gcc -D__KERNEL__ -I/usr/src/linux/include -Wall -Wstrict-prototypes -Wno-trigraphs -O2 -fno-strict-aliasing -fno-common -fomit-frame-pointer -m32 -pipe -mno-fpu -fcall-used-g5 -fcall-used-g7 -nostdinc -iwithprefix include -DKBUILD_BASENAME=ksyms -DEXPORT_SYMTAB -c ksyms.c
>> /usr/src/linux/include/asm/checksum.h: In function `csum_partial_copy_nocheck':
>> /usr/src/linux/include/asm/checksum.h:59: error: asm-specifier for variable `d' conflicts with asm clobber list
>> /usr/src/linux/include/asm/checksum.h:59: error: asm-specifier for variable `l' conflicts with asm clobber list
>> /usr/src/linux/include/asm/checksum.h: In function `csum_partial_copy_from_user':
>
> That looks like you either need a different compiler version,
> or different binutils version...
gcc (GCC) 3.3 (Debian)
GNU ld version 2.14.90.0.4 20030523 Debian GNU/Linux
the same versions work on i386 though...
--alex--
--
| I believe the moment is at hand when, by a paranoiac and active |
| advance of the mind, it will be possible (simultaneously with |
| automatism and other passive states) to systematize confusion |
| and thus to help to discredit completely the world of reality. |
On Tue, Jun 03, 2003 at 12:58:40PM -0700, Alex Romosan wrote:
> Jeff Garzik <[email protected]> writes:
>
> > On Tue, Jun 03, 2003 at 11:30:59AM -0700, Alex Romosan wrote:
> >> Marcelo Tosatti <[email protected]> writes:
> >>
> >> > Now I really hope its the last one, all this rc's are making me mad.
> >>
> >> i still can't get it to compile for sparc32:
> >>
> >> gcc -D__KERNEL__ -I/usr/src/linux/include -Wall -Wstrict-prototypes -Wno-trigraphs -O2 -fno-strict-aliasing -fno-common -fomit-frame-pointer -m32 -pipe -mno-fpu -fcall-used-g5 -fcall-used-g7 -nostdinc -iwithprefix include -DKBUILD_BASENAME=ksyms -DEXPORT_SYMTAB -c ksyms.c
> >> /usr/src/linux/include/asm/checksum.h: In function `csum_partial_copy_nocheck':
> >> /usr/src/linux/include/asm/checksum.h:59: error: asm-specifier for variable `d' conflicts with asm clobber list
> >> /usr/src/linux/include/asm/checksum.h:59: error: asm-specifier for variable `l' conflicts with asm clobber list
> >> /usr/src/linux/include/asm/checksum.h: In function `csum_partial_copy_from_user':
> >
> > That looks like you either need a different compiler version,
> > or different binutils version...
>
> gcc (GCC) 3.3 (Debian)
> GNU ld version 2.14.90.0.4 20030523 Debian GNU/Linux
That would do it.
> the same versions work on i386 though...
Yes, but i386 either didn't have now invalid clober lists, or they were
fixed in the -pre portion (like it was on PPC32 as well).
--
Tom Rini
http://gate.crashing.org/~trini/
On Maw, 2003-06-03 at 20:15, [email protected] wrote:
> Unfortunately I just committed my test box to production and can't test
> Alan's SiImage fixes in rc6-ac2, but if they pan out, please try to
> include them in -rc8 as well.
You could add the dma autoenable but the rest should be avoided
On Tue, 2003-06-03 at 13:14, Tom Rini wrote:
> > gcc (GCC) 3.3 (Debian)
> > GNU ld version 2.14.90.0.4 20030523 Debian GNU/Linux
>
> That would do it.
I don't trust anything past gcc-3.2.x on sparc and sparc64.
Use 3.3.x and later at your own peril.
--
David S. Miller <[email protected]>
Hello Dave , Thank you for the warning . Now how about why
laymans style ? Tia , JimL
On Tue, 3 Jun 2003, David S. Miller wrote:
> On Tue, 2003-06-03 at 13:14, Tom Rini wrote:
> > > gcc (GCC) 3.3 (Debian)
> > > GNU ld version 2.14.90.0.4 20030523 Debian GNU/Linux
> > That would do it.
> I don't trust anything past gcc-3.2.x on sparc and sparc64.
> Use 3.3.x and later at your own peril.
--
+------------------------------------------------------------------+
| James W. Laferriere | System Techniques | Give me VMS |
| Network Engineer | P.O. Box 854 | Give me Linux |
| [email protected] | Coudersport PA 16915 | only on AXP |
+------------------------------------------------------------------+
"David S. Miller" <[email protected]> writes:
> On Tue, 2003-06-03 at 13:14, Tom Rini wrote:
>> > gcc (GCC) 3.3 (Debian)
>> > GNU ld version 2.14.90.0.4 20030523 Debian GNU/Linux
>>
>> That would do it.
>
> I don't trust anything past gcc-3.2.x on sparc and sparc64.
> Use 3.3.x and later at your own peril.
recompiled with gcc-3.2.3 and the kernel not only compiled but also
booted. thank you.
--alex--
--
| I believe the moment is at hand when, by a paranoiac and active |
| advance of the mind, it will be possible (simultaneously with |
| automatism and other passive states) to systematize confusion |
| and thus to help to discredit completely the world of reality. |
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hi!
Marcelo Tosatti wrote:
> Hallo,
>
> Now I really hope its the last one, all this rc's are making me mad.
>
;-)
So, here's a report on the more positive side...
As I mentioned in some e-mails in the last few days,
I'm currently testing an Asus AP1700-S5 server with
a single Xeon 2.4GHz CPU (FSB533), 512MB RAM and
4x36GB U320SCSI drives (3 of them are assembled as RAID5),
connected via GBit Ethernet to our internal network
root@setup:~ {533} $ lspci
00:00.0 Host bridge: ServerWorks CNB20-HE Host Bridge (rev 31)
00:00.1 Host bridge: ServerWorks CNB20-HE Host Bridge
00:00.2 Host bridge: ServerWorks CNB20-HE Host Bridge
00:02.0 Ethernet controller: Intel Corp. 82540EM Gigabit Ethernet Controller (rev 02)
00:03.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27)
00:0f.0 ISA bridge: ServerWorks CSB5 South Bridge (rev 93)
00:0f.1 IDE interface: ServerWorks CSB5 IDE Controller (rev 93)
00:0f.2 USB Controller: ServerWorks OSB4/CSB5 OHCI USB Controller (rev 05)
00:0f.3 Host bridge: ServerWorks GCLE Host Bridge
00:10.0 Host bridge: ServerWorks: Unknown device 0101 (rev 03)
00:10.2 Host bridge: ServerWorks: Unknown device 0101 (rev 03)
00:11.0 Host bridge: ServerWorks: Unknown device 0101 (rev 03)
00:11.2 Host bridge: ServerWorks: Unknown device 0101 (rev 03)
02:04.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 (rev 07)
02:04.1 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 (rev 07)
03:02.0 Ethernet controller: Intel Corp. 82544GC Gigabit Ethernet Controller (LOM) (rev 02)
root@setup:~ {538} $ uptime
2:05pm up 18:09, 11 users, load average: 8.03, 8.45, 8.15
This system is running 2.4.21-rc7 for more than 18 hours
now with the following load:
*) an endless loop to create and remove a large file on the
RAID5 (ext3 filesystem):
while true; do time dd if /dev/zero of /var/tmp/largefile bs 1M count 2000 ; rm -f /var/tmp/largefile; done
*) some commands to create additional load:
cd /
find . boot/ usr/ tmp/ opt/ var/ -xdev -type f -exec md5sum {} \;
*) NFS copy of a whole 40GB filesystem tree from a Linux NFS server
to the RAID5 (in a loop)
*) the system is also NFS serving a Linux NFS client, which
copies the whole server filesystem into /dev/null
*) Additionally, I have the following programs running:
- Squid (currently used as proxy for our internal web browsers)
- Apache
- jedit (with j2sdk-1.4.1_01)
- StarOffice-5.2
- Mozilla-1.3.1
- and lots of additional programs (shell, sshd, emacs), but
no X server (we are using Linux workstations as X-Terminals)
All in all, there are more than 190 processes at any point in
time in the past 18 hours.
This all produces a permanent load between 7 and 9
vmstat 1
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 4 4 111720 3220 11344 423820 0 0 4 18976 4892 4273 2 68 30
0 4 3 111720 3204 11352 423728 32 0 80 25216 1460 2095 0 15 85
0 4 3 111716 3332 11352 423364 76 0 92 25796 1432 1895 2 14 84
0 4 3 111716 3208 11372 423392 48 0 712 26336 1566 2346 4 14 81
0 6 3 111716 3208 11412 423196 132 0 420 32820 1774 3113 12 19 69
0 5 3 111716 3376 11440 422340 704 0 924 24444 1570 2811 3 17 79
6 2 4 111716 2328 11560 423988 536 0 700 32088 2268 4590 6 73 21
11 3 4 111764 63352 11604 321148 16 308 310 36868 2267 5390 12 46 42
root@setup:~ {537} $ uptime
1:37pm up 17:41, 10 users, load average: 7.94, 7.31, 7.18
Under this circumstances, I made the following observations:
a) The system runs stable for more than 18 hours now
b) It seems to behave quite fine, given the load.
Response time for all services (web-proxy, web-server)
is reasonable low (you almost don't notice any delay)
c) Interactive programs (Mozilla, StarOffice, JEdit) are
still quite usable. There is some delay when opening
a file in SO (say, about 2-3 seconds), but that's fine
d) Sometimes (but not really reproducable) I noticed a
_big_ delay when connecting to the server using SSH
(with "big", I mean 1 minute or so). I eventually
get a connection, and then can work as normal.
e) The server uses a single, but hyperthreaded CPU.
Hyperthreading is enabled, and Linux shows both
logical CPU's:
root@setup:~ {529} $ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel(R) Xeon(TM) CPU 2.40GHz
stepping : 7
cpu MHz : 2392.169
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
bogomips : 4771.02
processor : 1
vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel(R) Xeon(TM) CPU 2.40GHz
stepping : 7
cpu MHz : 2392.169
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
bogomips : 4771.02
But interrupt distribution seems a little bit strange:
root@setup:~ {530} $ cat /proc/interrupts
CPU0 CPU1
0: 6318080 0 IO-APIC-edge timer
1: 967 0 IO-APIC-edge keyboard
2: 0 0 XT-PIC cascade
4: 32477 0 IO-APIC-edge serial
5: 55629300 0 IO-APIC-level eth0
9: 85639064 0 IO-APIC-level acpi, ioc0, ioc1
11: 0 0 IO-APIC-level usb-ohci
15: 2 0 IO-APIC-edge ide1
NMI: 0 0
LOC: 6318529 6318527
ERR: 0
MIS: 0
With 2.4.21-rc6-ac1, interrupts where counted for both
logical CPU's. Is this a bug or a feature?
HTH
- - andreas
- --
Andreas Haumer | mailto:[email protected]
*x Software + Systeme | http://www.xss.co.at/
Karmarschgasse 51/2/20 | Tel: +43-1-6060114-0
A-1100 Vienna, Austria | Fax: +43-1-6060114-71
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQE+3zMOxJmyeGcXPhERAu6CAKCILyOUfPyGaKG8pvbl4droch6B+ACbBNB/
Dw1L/tRv2JSrOHA12B8BaHM=
=rWPF
-----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hi!
Andreas Haumer wrote:
> Hi!
>
> Marcelo Tosatti wrote:
>
>>Hallo,
>>
>>Now I really hope its the last one, all this rc's are making me mad.
>>
>
> ;-)
>
> So, here's a report on the more positive side...
>
I think, I have to take that back... :-((
> As I mentioned in some e-mails in the last few days,
> I'm currently testing an Asus AP1700-S5 server with
> a single Xeon 2.4GHz CPU (FSB533), 512MB RAM and
> 4x36GB U320SCSI drives (3 of them are assembled as RAID5),
> connected via GBit Ethernet to our internal network
>
I had this system running under heavy load for about 24 hours
without problems. I then stopped the stress testing, and had
several system freezes since then.
With system freeze I mean:
*) machine doesn't answer to ping, no reaction to console
keyboard, no message on the console screen, no message
in logfile, no oops, no noticeable system activity
I changed several BIOS settings (disabled hyperthreading,
disabled USB, disabled power management) and tried to run
the kernel with "acpi=off" and "noapic".
I also changed root disk, because I found a SCSI error
message in the logs once.
Nothing seems to help. The system just freezes under light
load at some time between 1 and 8 hours uptime.
It's really strange that it survived heavy load for
more than 24 hours in the first place.
I found some problem reports from several people,
which sound quite similar to the freeze I see here.
These people all had motherboards with serverworks
chipset, GBit ethernet and noticed similar lockups
or system freeze symptoms. From the reports I'm not
sure if the problems still persist or if they should
be solved now. Can someone please comment on that?
Here are some infos from the system again:
root@server:~ {505} $ cat /proc/interrupts
CPU0
0: 118748 IO-APIC-edge timer
1: 274 IO-APIC-edge keyboard
2: 0 XT-PIC cascade
4: 7011 IO-APIC-edge serial
9: 1181037 IO-APIC-level ioc0, ioc1
14: 1685 IO-APIC-level eth0
15: 2 IO-APIC-edge ide1
NMI: 0
LOC: 118700
ERR: 0
MIS: 0
root@server:~ {506} $ cat /proc/cmdline
auto BOOT_IMAGE=lx2421rc7 ro root=100 acpi=off
root@server:~ {507} $ uname -a
Linux server 2.4.21-rc7 #1 SMP Wed Jun 4 18:31:15 CEST 2003 i686 unknown
root@server:~ {508} $ lsmod
Module Size Used by Not tainted
af_packet 13256 1 (autoclean)
e1000 50028 1 (autoclean)
ext3 60832 2 (autoclean)
jbd 40056 2 (autoclean) [ext3]
raid5 17704 1 (autoclean)
md 57472 2 (autoclean) [raid5]
xor 8868 0 (autoclean) [raid5]
unix 15664 38 (autoclean)
ext2 33440 4 (autoclean)
sd_mod 10652 18 (autoclean)
isense 32404 0 (autoclean) (unused)
mptctl 19116 0 (autoclean) (unused)
mptscsih 29696 9 (autoclean)
mptbase 32640 5 (autoclean) [isense mptctl mptscsih]
scsi_mod 95748 2 (autoclean) [sd_mod mptscsih]
root@server:~ {511} $ lspci -vvvv
00:00.0 Host bridge: ServerWorks CNB20-HE Host Bridge (rev 31)
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap- 66Mhz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR-
00:00.1 Host bridge: ServerWorks CNB20-HE Host Bridge
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap- 66Mhz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR-
00:00.2 Host bridge: ServerWorks CNB20-HE Host Bridge
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap- 66Mhz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR-
00:02.0 Ethernet controller: Intel Corp. 82540EM Gigabit Ethernet Controller (rev 02)
Subsystem: Intel Corp. 82540EM Gigabit Ethernet Controller
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 32 (63750ns min), cache line size 08
Interrupt: pin A routed to IRQ 14
Region 0: Memory at fd800000 (32-bit, non-prefetchable) [size=128K]
Region 2: I/O ports at d800 [size=64]
Capabilities: [dc] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [e4] PCI-X non-bridge device.
Command: DPERE- ERO+ RBC=0 OST=0
Status: Bus=0 Dev=0 Func=0 64bit- 133MHz- SCD- USC-, DC=simple, DMMRBC=0, DMOST=0, DMCRS=0, RSCEM- Capabilities: [f0] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable-
Address: 0000000000000000 Data: 0000
00:03.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27) (prog-if 00 [VGA])
Subsystem: ATI Technologies Inc: Unknown device 8008
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping+ SERR- FastB2B-
Status: Cap+ 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 32 (2000ns min), cache line size 08
Interrupt: pin A routed to IRQ 10
Region 0: Memory at fc000000 (32-bit, non-prefetchable) [size=16M]
Region 1: I/O ports at d400 [size=256]
Region 2: Memory at fb800000 (32-bit, non-prefetchable) [size=4K]
Expansion ROM at febe0000 [disabled] [size=128K]
Capabilities: [5c] Power Management version 2
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
00:0f.0 ISA bridge: ServerWorks CSB5 South Bridge (rev 93)
Subsystem: ServerWorks CSB5 South Bridge
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap- 66Mhz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort+ >SERR- <PERR-
Latency: 32
00:0f.1 IDE interface: ServerWorks CSB5 IDE Controller (rev 93) (prog-if 88 [Master SecP])
Subsystem: ServerWorks CSB5 IDE Controller
Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap- 66Mhz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 32, cache line size 08
Region 0: I/O ports at <ignored>
Region 1: I/O ports at <ignored>
Region 2: I/O ports at <ignored>
Region 3: I/O ports at <ignored>
Region 4: I/O ports at a800 [size=16]
00:0f.3 Host bridge: ServerWorks GCLE Host Bridge
Subsystem: ServerWorks: Unknown device 0230
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap- 66Mhz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 0
00:10.0 Host bridge: ServerWorks: Unknown device 0101 (rev 03)
Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr+ DEVSEL=medium >TAbort- <TAbort- <MAbort+ >SERR- <PERR-
Capabilities: [60]
00:10.2 Host bridge: ServerWorks: Unknown device 0101 (rev 03)
Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort+ >SERR- <PERR-
Capabilities: [60]
00:11.0 Host bridge: ServerWorks: Unknown device 0101 (rev 03)
Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr+ DEVSEL=medium >TAbort- <TAbort- <MAbort+ >SERR- <PERR-
Capabilities: [60]
00:11.2 Host bridge: ServerWorks: Unknown device 0101 (rev 03)
Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr+ DEVSEL=medium >TAbort- <TAbort- <MAbort+ >SERR- <PERR-
Capabilities: [60]
02:04.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 (rev 07)
Subsystem: LSI Logic / Symbios Logic: Unknown device 1000
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 72 (4250ns min, 4500ns max), cache line size 08
Interrupt: pin A routed to IRQ 9
Region 0: I/O ports at a000 [size=256]
Region 1: Memory at fa000000 (64-bit, non-prefetchable) [size=64K]
Region 3: Memory at f9800000 (64-bit, non-prefetchable) [size=64K]
Expansion ROM at fe900000 [disabled] [size=1M]
Capabilities: [50] Power Management version 2
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable-
Address: 0000000000000000 Data: 0000
Capabilities: [68]
02:04.1 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 (rev 07)
Subsystem: LSI Logic / Symbios Logic: Unknown device 1000
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 72 (4250ns min, 4500ns max), cache line size 08
Interrupt: pin B routed to IRQ 9
Region 0: I/O ports at 9800 [size=256]
Region 1: Memory at f9000000 (64-bit, non-prefetchable) [size=64K]
Region 3: Memory at f8800000 (64-bit, non-prefetchable) [size=64K]
Expansion ROM at fe800000 [disabled] [size=1M]
Capabilities: [50] Power Management version 2
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable-
Address: 0000000000000000 Data: 0000
Capabilities: [68]
03:02.0 Ethernet controller: Intel Corp. 82544GC Gigabit Ethernet Controller (LOM) (rev 02)
Subsystem: Intel Corp.: Unknown device 110d
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 32 (63750ns min), cache line size 08
Interrupt: pin A routed to IRQ 5
Region 0: Memory at f8000000 (64-bit, non-prefetchable) [size=128K]
Region 2: Memory at f7800000 (64-bit, non-prefetchable) [size=128K]
Region 4: I/O ports at 9400 [size=32]
Expansion ROM at fe7e0000 [disabled] [size=128K]
Capabilities: [dc] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [e4] PCI-X non-bridge device.
Command: DPERE- ERO+ RBC=0 OST=0
Status: Bus=0 Dev=0 Func=0 64bit- 133MHz- SCD- USC-, DC=simple, DMMRBC=0, DMOST=0, DMCRS=0, RSCEM- Capabilities: [f0] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable-
Address: 0000000000000000 Data: 0000
Any idea how I should proceed now?
I really could use some help here, I'm running out
of ideas... :-((
- - andreas
- --
Andreas Haumer | mailto:[email protected]
*x Software + Systeme | http://www.xss.co.at/
Karmarschgasse 51/2/20 | Tel: +43-1-6060114-0
A-1100 Vienna, Austria | Fax: +43-1-6060114-71
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQE+4gjsxJmyeGcXPhERAsT4AJ9sylkxso5kXO51+6c5bfskVV2meACgrF33
t8xXYpu6FGPsiQ9VBmnk6ek=
=Yov+
-----END PGP SIGNATURE-----
> Now I really hope its the last one, all this
> rc's are making me mad.
We still have ide problems, and I don't see any
potential fixes for that in the changelog between -rc6 and -rc7.
I tried -rc6 on a whim and had hda report
a timeout (dma, I think, but the message went by kind of quick), then the big freeze with the
disk light stuck on, Never happened
in 6 months on the same hardware running
2.4.19-rc2 (with glibc-2.2.5, gcc-2.95.3,
binutils-2.12.90.0.9, all ext2 filesystems).
I recompiled with all kernel debugging options
enabled and disabled partition statistics, since that was the one thing that was obviously new about the enabled ide options (I didn't select
any other new options, but of course the kernel code underneath is probably different, so one could not conclude anything from suck meager
testing). It ran for about 8 hours without freezing, with that drive doing a lot more
work than it was doing when it livelocked.
e2fsck reported errors on the next reboot, though,
and it's been rebooted into 2.4.19-rc2 to get some
other work done with it since then (caching the source for an upgrade of a 2.2.x box, different libc, yada yada, needs to be reliable until
that is finished).
SiS530/5513, k6-II/450, udma33 Maxtor drive that 2.4.19-rc2 has no problems with.
You can release a 2.4.21 anyway, of course, but without finding out where the ide livelock (and other big freezes, thinking of the report on the all-scsi system already posted) originates, calling it "stable" would be a bit fanciful.
(2.4.19-rc2 has its own quirks, of course, but
not "single-threaded ide livelock with this
chipset and ide drive". I can reliably kill it with 32 threads depth-first scanning different directory trees on that same disk in parallel, unfortunately without an oops to show for it.
It is not running out of memory (no ENOMEM reports), merely some mundane race condition or missing lock or whatever. Change it to 32 forks running in parallel, and they finish normally, though of course not all that quickly while seek-thrashing one and the same disk between them.)
Not what you wanted to hear, right? Oh well.
(Better to find out sooner than release
2.4.21-stable and watch 52 different bug reports on it arrive at the list the next day.)
Regards,
Clayton Weaver
<mailto: [email protected]>
--
_______________________________________________
Sign-up for your own FREE Personalized E-mail at Mail.com
http://www.mail.com/?sr=signup
Hi !
[ first, please fix your mailer and cut your lines, it's not easy to quote you in replies ]
On Sun, Jun 08, 2003 at 03:54:48AM -0500, Clayton Weaver wrote:
> > Now I really hope its the last one, all this
> > rc's are making me mad.
>
> We still have ide problems, and I don't see any
> potential fixes for that in the changelog between -rc6 and -rc7.
>
> I tried -rc6 on a whim and had hda report
> a timeout (dma, I think, but the message went by kind of quick), then the big freeze with the
> disk light stuck on, Never happened in 6 months on the same hardware running
> 2.4.19-rc2 (with glibc-2.2.5, gcc-2.95.3, binutils-2.12.90.0.9, all ext2 filesystems).
Did you try with "ide0=nodma", or other similar options ?
> SiS530/5513, k6-II/450, udma33 Maxtor drive that 2.4.19-rc2 has no problems with.
That's not exactly what you said below. You said that you could reliably kill it with 32 threads...
Perhaps you have a broken hardware, and 2.4.21 stresses it more than 2.4.19-rc2. Perhaps it's
really an old driver bug, then having reported it since this you encountered it would have been
more constructive than telling us at 2.4.21 time that it dies even more easily than a one year old
2.4.19-rc2.
> You can release a 2.4.21 anyway, of course, but without finding out where the ide livelock (and other big freezes, thinking of the report on the all-scsi system already posted) originates, calling it "stable" would be a bit fanciful.
That's what -pre and -rc are for : bug reports. The ide code has been included in 2.4.21-pre1,
several months ago. There's always a risk of breaking someone's setup, but obviously, if people
don't try pre-releases and don't report problems in time, how could they hope to get a stable
kernel on their hardware ?
> Not what you wanted to hear, right? Oh well.
>
> (Better to find out sooner than release
> 2.4.21-stable and watch 52 different bug reports on it arrive at the list the next day.)
Well, look through the archives, there have been two patches by Lionel Bouton and Vojtech Pavlik
posted in May for the 5513 driver, to support newer chipsets. I don't know if they have been
included, nor if they also fixed old bugs. Perhaps you'll be intersted in checking them.
BTW, someone reported yesterday that his 5513 worked flawlessly in 2.4.20, but behaved like yours
on 2.5.70. Have you tested 2.4.20, or better, have you tried to narrow the problem down to a
particular version (but I bet it will be tied to the introduction of the newer IDE code).
You may also try -ac kernels which have more recent, but less tested code.
Regards,
Willy
----- Original Message -----
From: Willy Tarreau <[email protected]>
Date: Sun, 8 Jun 2003 11:47:29 +0200
To: Clayton Weaver <[email protected]>
Subject: Re: Linux 2.4.21-rc7
> Hi !
Greets.
> [ first, please fix your mailer and cut your lines, it's not easy to quote you in replies ]
Long lines?
email.com is a web mailer. If it is failing
to wrap where I put newlines, I'll see what I
can do.
> On Sun, Jun 08, 2003 at 03:54:48AM -0500, Clayton Weaver wrote:
> > > Now I really hope its the last one, all this
> > > rc's are making me mad.
> > We still have ide problems, and I don't see
any
> > potential fixes for that in the changelog between -rc6 and -rc7.
> >
> > I tried -rc6 on a whim and had hda report
> > a timeout (dma, I think, but the message went by kind of quick), then the big freeze with the
> > disk light stuck on, Never happened in 6 months on the same hardware running
> > 2.4.19-rc2 (with glibc-2.2.5, gcc-2.95.3, binutils-2.12.90.0.9, all ext2 filesystems).
> Did you try with "ide0=nodma", or other similar options ?
No.
Note that "nodma" is unnecessary on this
same box running kernel 2.4.19-rc2. Why would
2.4.21-rcX need it? To pin down whether the
problem is in the ide dma code or some other
part of the ide code?
> > SiS530/5513, k6-II/450, udma33 Maxtor
drivethat 2.4.19-rc2 has no problems with.
Here is the data on the drive from hdparm
while running under 2.4.19-rc2. rc.local
executes "hdparm -c 1 /dev/hda" at boot.
hdparm -v:
/dev/hda:
multcount = 16 (on)
IO_support = 1 (32-bit)
unmaskirq = 0 (off)
using_dma = 1 (on)
keepsettings = 0 (off)
readonly = 0 (off)
readahead = 8 (on)
geometry = 1655/255/63, sectors = 26588016, start = 0
hdparm -i:
/dev/hda:
Model=Maxtor 91360U4, FwRev=MA540RR0, SerialNo=C40LMAFC
Config={ Fixed }
RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=57
BuffType=DualPortCache, BuffSize=2048kB, MaxMultSect=16, MultSect=16
CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=26588016
IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
PIO modes: pio0 pio1 pio2 pio3 pio4
DMA modes: mdma0 mdma1 mdma2
UDMA modes: udma0 udma1 *udma2
AdvancedPM=yes: disabled (255) WriteCache=enabled
Drive conforms to: ATA/ATAPI-4 T13 1153D revision 17: 1 2 3 4 5
> That's not exactly what you said below. You said that you could reliably kill it with 32 threads...
> Perhaps you have a broken hardware, and 2.4.21 stresses it more than 2.4.19-rc2. Perhaps it's
> really an old driver bug, then having reported it since this you encountered it would have been
> more constructive than telling us at 2.4.21 time that it dies even more easily than a one year old
> 2.4.19-rc2.
It does not die more easily with 2.4.19-rc2
(in my opinion). It dies in a threads context
but not in a forks context, where the threads
and the forks are doing the same i/o to/from
the same controller/disk (different versions
of same program).
I have also seen it freeze with an unlucky
mouse click in XFree86 4.0 under 2.4.19-rc2,
so I did not assume that the threads hang
was necessarily ide-relevant. Something
disk i/o intensive was merely what it
happened to be doing with those threads,
but that problem seemed to me more thread
related than ide related. (Guess I'll have
to spawn a bunch of threads doing some other
kind of i/o to test that assumption.)
[]
> > (Better to find out sooner than release
> > 2.4.21-stable and watch 52 different bug reports on it arrive at the list the next day.)
> Well, look through the archives, there have been two patches by Lionel Bouton and Vojtech Pavlik
> posted in May for the 5513 driver, to support newer chipsets. I don't know if they have been
> included, nor if they also fixed old bugs. Perhaps you'll be intersted in checking them.
(SiS530 is not newer, k6-II era, but it
is worth a look anyway.)
The SiS5513 driver seems fine. You can
hammer on it all day with this motherboard
with gcc, multiple smb mounts, gigabyte ftp or
sftp transfers, etc, in parallel, and no blinks from the hard drive (modulo threads or the X-server under 2.4.19-rc2).
(Why 2.4.19-rc2? It mostly works, ie it is
stable for what I typically use that box
for. Someone running a different application
mix or different hardware might consider it useless crap. It has the lcall fix and a
few other minor bug fixes that were posted
to the kernel list between then and now.)
> BTW, someone reported yesterday that his 5513 worked flawlessly in 2.4.20, but behaved like yours
> on 2.5.70. Have you tested 2.4.20, or better, have you tried to narrow the problem down to a
> particular version (but I bet it will be tied to the introduction of the newer IDE code).
No. (I do actually need this thing to work at
times.) The newer ide code as the source of the
problem matches my hunch. Maybe the kernel
debugging that I enabled at compile time will
come up with something (*before* the
deadlock, so it can actually log an anomaly).
The newer ide code may have found a bug in
the SiS5513 driver that the old code did
not exercise. Let us hope not, because then
a fix only fixes it for me and other users
of that driver, while lots of people with
other kinds of ide hardware seem to be
reporting similar problems.
My guess is that the problems are upstream
of any specific driver, but that is merely
a hunch. (It is possible that they all do
the same wrong thing *in the drivers*.)
> You may also try -ac kernels which have more recent, but less tested code.
> Regards,
> Willy
Thanks for the insight.
Regards,
Clayton Weaver
<mailto: [email protected]>
--
_______________________________________________
Sign-up for your own FREE Personalized E-mail at Mail.com
http://www.mail.com/?sr=signup
Please stop comparing 2.4.19-rc2 to 2.4.21-rc7.
Just go through 2.4.20-pre/-rc and 2.4.21-pre/-rc
and find when things broke if you want them fixed.
--
Bartlomiej
> Note that "nodma" is unnecessary on this
> same box running kernel 2.4.19-rc2. Why would
> 2.4.21-rcX need it? To pin down whether the
> problem is in the ide dma code or some other
> part of the ide code?
exactly, because DMA needs more conditions than PIO to run at all
and even more to run reliably. There are lots of cases where DMA
doesn't work while PIO does.
> It does not die more easily with 2.4.19-rc2
> (in my opinion). It dies in a threads context
> but not in a forks context, where the threads
> and the forks are doing the same i/o to/from
> the same controller/disk (different versions
> of same program).
>
> I have also seen it freeze with an unlucky
> mouse click in XFree86 4.0 under 2.4.19-rc2,
> so I did not assume that the threads hang
> was necessarily ide-relevant. Something
> disk i/o intensive was merely what it
> happened to be doing with those threads,
> but that problem seemed to me more thread
> related than ide related. (Guess I'll have
> to spawn a bunch of threads doing some other
> kind of i/o to test that assumption.)
OK, but a freeze isn't acceptable anyway, whatever you were doing,
because it always means a bug somewhere.
Cheers,
Willy
PS: your lines were shorter this way :-)
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hi!
Note: I'm reporting this with a different subject line now,
as I got zero replies to my first bugreport. This is still
the same Asus AP1700-S5 server as in my previous reports,
though:
Asus AP1700-S5 server, single Xeon 2.4GHz CPU (FSB533)
512MB registered DDR with ECC, Asus PR-DLS533 motherboard
with ServerWorks GCLE chipset
root@server:~ {535} $ lspci
00:00.0 Host bridge: ServerWorks CNB20-HE Host Bridge (rev 31)
00:00.1 Host bridge: ServerWorks CNB20-HE Host Bridge
00:00.2 Host bridge: ServerWorks CNB20-HE Host Bridge
00:03.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27)
00:0f.0 ISA bridge: ServerWorks CSB5 South Bridge (rev 93)
00:0f.1 IDE interface: ServerWorks CSB5 IDE Controller (rev 93)
00:0f.3 Host bridge: ServerWorks GCLE Host Bridge
00:10.0 Host bridge: ServerWorks: Unknown device 0101 (rev 03)
00:10.2 Host bridge: ServerWorks: Unknown device 0101 (rev 03)
00:11.0 Host bridge: ServerWorks: Unknown device 0101 (rev 03)
00:11.2 Host bridge: ServerWorks: Unknown device 0101 (rev 03)
01:02.0 Ethernet controller: 3Com Corporation 3c905C-TX/TX-M [Tornado] (rev 74)
02:04.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 (rev 07)
02:04.1 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 (rev 07)
Andreas Haumer wrote:
[...]
> I had this system running under heavy load for about 24 hours
> without problems. I then stopped the stress testing, and had
> several system freezes since then.
>
> With system freeze I mean:
>
> *) machine doesn't answer to ping, no reaction to console
> keyboard, no message on the console screen, no message
> in logfile, no oops, no noticeable system activity
>
I just had another freeze or lockup of this system,
after 1 day and 14 hours uptime. :-(
This time the machine was running with an 3Com 3c905c
100MBit NIC, with the onboard e1000 GBit controllers disabled.
Obviously, this didn't help, too...
When I noticed the freeze, I tried to ping the server,
and got a few replies back, but with a delay of more than
60 seconds! I didn't wait that long when I tried to ping
the server on the previous lockups, so maybe the "no answer
to ping" symptom I described is more a "big delay in
answering ping packets" symptom. Does that ring any bell?
Any idea anyone?
- - andreas
- --
Andreas Haumer | mailto:[email protected]
*x Software + Systeme | http://www.xss.co.at/
Karmarschgasse 51/2/20 | Tel: +43-1-6060114-0
A-1100 Vienna, Austria | Fax: +43-1-6060114-71
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQE+5F6HxJmyeGcXPhERApOfAJ4klAsR0lA8Zzk5s22quImzxud6agCgvAi1
FXZuNQV3C4UaKVi9gOvtJFM=
=qL4B
-----END PGP SIGNATURE-----
Hello Andreas,
I am not quite sure if you are experiencing something similar to my problem.
Fact is this:
I have a serverworks based dual PIII board and I am experiencing freezes just
about every day.
Equal setups:
Kernel 2.4.21-rc7
00:00.0 Host bridge: ServerWorks CNB20HE Host Bridge (me: rev 23 you: rev 31)
00:00.1 Host bridge: ServerWorks CNB20HE Host Bridge (rev 01)
Lockups during light load
Differing:
Just about everything else:
yours: mine:
Storage System: Symbios AIC
VGA : ATI Rage XL ATI Radeon RV200
Network : Intel/3com Intel/Broadcom
Processor : Xeon UP PIII SMP
I could already produce oops-messages on the problem and mine all come up in
kmem_cache_alloc_batch. It would be interesting where your box freezes. It
cannot be at this same place, because the code is not there in UP.
Try this (in case you are not working in front of the box):
Start box and switch to text console, enter "setterm -blank 0" to disable
screen blanker. Wait for oops. If we are lucky you will see something, get a
pencil then :-)
--
Regards,
Stephan
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hi!
Many thanks for your reply!
Stephan von Krawczynski wrote:
> Hello Andreas,
>
> I am not quite sure if you are experiencing something similar to my problem.
> Fact is this:
>
> I have a serverworks based dual PIII board and I am experiencing freezes just
> about every day.
>
> Equal setups:
>
> Kernel 2.4.21-rc7
> 00:00.0 Host bridge: ServerWorks CNB20HE Host Bridge (me: rev 23 you: rev 31)
> 00:00.1 Host bridge: ServerWorks CNB20HE Host Bridge (rev 01)
>
> Lockups during light load
>
Me too.
I had it running for 24 hours with heavy stress testing
and a load above 7 all the time without problems. I then
stopped this test, and the box locked up 2 hours later,
and locked up about 7 or 8 times in the past few days :-(
>
> Differing:
>
> Just about everything else:
> yours: mine:
> Storage System: Symbios AIC
This is not a "normal" symbios logic "sym53c8xx"
storage controller, but a "Symbios Logic 53c1030",
which uses the Fusion MPT driver. This is the first
time I'm running this driver, so I don't know if it's
considered stable (but I guess so)
Unfortunately I can't replace it as I don't have any
spare SCSI controller which fits right now.
> VGA : ATI Rage XL ATI Radeon RV200
> Network : Intel/3com Intel/Broadcom
> Processor : Xeon UP PIII SMP
>
>
> I could already produce oops-messages on the problem and mine all come up in
> kmem_cache_alloc_batch. It would be interesting where your box freezes. It
> cannot be at this same place, because the code is not there in UP.
> Try this (in case you are not working in front of the box):
>
> Start box and switch to text console, enter "setterm -blank 0" to disable
> screen blanker. Wait for oops. If we are lucky you will see something, get a
> pencil then :-)
>
I always have the system running with text console and
screen blanking disabled. Alas, I see no oops :-(
IMHO it doesn't look like the kernel crashes with an oops,
it does look more like it suddenly goes into an endless
loop or ridiculously high load somehow.
Last time I hade this freeze, I noticed that the system
answered my ICMP ping messages with a delay of more than
60 seconds. This looked like the system was very busy
at that time.
I'm now running with 2.4.20rc2, and also have syslog
routed to another system on the network. We'll see if
I can get any more information out of this.
- - andreas
- --
Andreas Haumer | mailto:[email protected]
*x Software + Systeme | http://www.xss.co.at/
Karmarschgasse 51/2/20 | Tel: +43-1-6060114-0
A-1100 Vienna, Austria | Fax: +43-1-6060114-71
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQE+5HvjxJmyeGcXPhERAvOvAJ94cQS4tlzylHiVU084v7FK/e/aowCgw4w9
M3YWSHXzx9IuKeU4Z6WicEk=
=8102
-----END PGP SIGNATURE-----
On Sat, 7 Jun 2003, Andreas Haumer wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi!
>
> Andreas Haumer wrote:
> > Hi!
> >
> > Marcelo Tosatti wrote:
> >
> >>Hallo,
> >>
> >>Now I really hope its the last one, all this rc's are making me mad.
> >>
> >
> > ;-)
> >
> > So, here's a report on the more positive side...
> >
> I think, I have to take that back... :-((
>
> > As I mentioned in some e-mails in the last few days,
> > I'm currently testing an Asus AP1700-S5 server with
> > a single Xeon 2.4GHz CPU (FSB533), 512MB RAM and
> > 4x36GB U320SCSI drives (3 of them are assembled as RAID5),
> > connected via GBit Ethernet to our internal network
> >
> I had this system running under heavy load for about 24 hours
> without problems. I then stopped the stress testing, and had
> several system freezes since then.
>
> With system freeze I mean:
>
> *) machine doesn't answer to ping, no reaction to console
> keyboard, no message on the console screen, no message
> in logfile, no oops, no noticeable system activity
Maybe the NMI oopser helps?
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hi!
Anders Karlsson wrote:
> On Wed, 2003-06-11 at 21:48, Marcelo Tosatti wrote:
>
>>On Sat, 7 Jun 2003, Andreas Haumer wrote:
>
> [snip]
>
>>>I had this system running under heavy load for about 24 hours
>>>without problems. I then stopped the stress testing, and had
>>>several system freezes since then.
>>>
>>>With system freeze I mean:
>>>
>>>*) machine doesn't answer to ping, no reaction to console
>>> keyboard, no message on the console screen, no message
>>> in logfile, no oops, no noticeable system activity
>
>
> I have this problem without actually stressing the machine too hard. The
> average load on my Thinkpad over a weekend would perhaps be 0.05, yet I
> can have several hard hangs where there seems to be no trace of a hang
> at all in logfiles.
>
I have to admit that "system freeze" is a quite unspecific
symptom. It could have a zillion of different reasons.
In my case I'm currently chasing SCSI errors which I think
could have something to do with it (besides, it's _not_ an Adaptec
controller, but a LSI 53c1030 with Fusion MPT driver... :-)
In my server logs I sometimes see SCSI timeouts like this:
[...]
scsi : aborting command due to timeout : pid 1148093, scsi0, channel 0, id 1, lun 0 Read (10) 00 00 00 0f af 00 00 10 00
mptscsih: OldAbort scheduling ABORT SCSI IO (sc=dfca8e00)
IOs outstanding = 3
mptscsih: ioc0: Issue of TaskMgmt Successful!
SCSI host 0 abort (pid 1148093) timed out - resetting
SCSI bus is being reset for host 0 channel 0.
mptscsih: OldReset scheduling BUS_RESET (sc=dfca8e00)
IOs outstanding = 4
SCSI Error Report =-=-= (0:0:0)
SCSI_Status=02h (CHECK CONDITION)
Original_CDB[]: 2A 00 00 3C 4D 78 00 00 02 00 - "WRITE(10)"
SenseData[20h]: 70 00 06 00 00 00 00 18 00 00 00 00 29 02 00 00 00 00 ...
SenseKey=6h (UNIT ATTENTION); FRU=00h
ASC/ASCQ=29h/02h "SCSI BUS RESET OCCURRED"
SCSI Error Report =-=-= (0:1:0)
SCSI_Status=02h (CHECK CONDITION)
Original_CDB[]: 28 00 00 00 0F AF 00 00 10 00 - "READ(10)"
SenseData[20h]: 70 00 06 00 00 00 00 18 00 00 00 00 29 02 00 00 00 00 ...
SenseKey=6h (UNIT ATTENTION); FRU=00h
ASC/ASCQ=29h/02h "SCSI BUS RESET OCCURRED"
SCSI Error Report =-=-= (0:2:0)
SCSI_Status=02h (CHECK CONDITION)
Original_CDB[]: 28 00 00 4E 0A 37 00 00 08 00 - "READ(10)"
SenseData[20h]: 70 00 06 00 00 00 00 18 00 00 00 00 29 02 00 00 00 00 ...
SenseKey=6h (UNIT ATTENTION); FRU=00h
ASC/ASCQ=29h/02h "SCSI BUS RESET OCCURRED"
SCSI Error Report =-=-= (0:3:0)
SCSI_Status=02h (CHECK CONDITION)
Original_CDB[]: 28 00 03 B0 08 6F 00 00 08 00 - "READ(10)"
SenseData[20h]: 70 00 06 00 00 00 00 18 00 00 00 00 29 02 00 00 00 00 ...
SenseKey=6h (UNIT ATTENTION); FRU=00h
ASC/ASCQ=29h/02h "SCSI BUS RESET OCCURRED"
[...]
There are 4 hot swap SCSI disks in the server, and all of them
eventually report those timeouts (so it's not specific to a single
disk)
I already replaced cabling, tried a different hot swap (SCA)
cage, and I'm now trying to replace the disks one by one to
eventually find the culprit.
There are two problems with this approach:
1.) After each change I have to wait several hours up to two
days for a SCSI timeout to occur as I can not reproduce
the problem at will.
2.) I'm not _sure_ if those SCSI timeouts are related to the server
freeze symptoms I see. It's just an assumption.
IMHO it could work as follows: SCSI timeouts occure somtimes.
The driver then aborts the command and resets the SCSI bus
to get it into a sane state again. But what if the bus reset
doesn't work as expected and the bus remains unusable for a
while? Could this bring the whole system into this "freeze"
state (the system is still running, but everything waits for
the SCSI bus to recover)? Could this explain the symptom of
those big delays of ICMP ping answer messages I saw?
So the most precious resource for chasing this problem is time,
and this is also the resource which I don't have available as
much as I'd like to... :-(
>
>>Maybe the NMI oopser helps?
>
>
> Marcelo, where can I get hold of this and would there be documentation
> included with it for how to install/use it?
>
Look at /usr/src/linux/Documentation/nmi_watchdog.txt
Regards,
- - andreas
- --
Andreas Haumer | mailto:[email protected]
*x Software + Systeme | http://www.xss.co.at/
Karmarschgasse 51/2/20 | Tel: +43-1-6060114-0
A-1100 Vienna, Austria | Fax: +43-1-6060114-71
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQE+6El7xJmyeGcXPhERAqykAKCumORTm/lDofkrg52FX33rOfgC/ACeNxR7
l9/znrbi0lZoR/zw+LTdNhI=
=W7Gt
-----END PGP SIGNATURE-----