LinuxLists.cc - strange freeze with VIA C7 dedicated server and libc 2.6.1

2008-06-24 03:01:47

Subject: strange freeze with VIA C7 dedicated server and libc 2.6.1

hi all

I am using gentoo linux on 3 boxes ( low cost dedicated servers ) who
are using VIA C7 CPU, all of them were running very well for
more than 1 year.

I recently emerge --sync and emerge -DNatuv world ( gentoo updates )
on 2 of the boxes.

The 2 upgraded boxes ( now running glibc-2.6.1 )are now now freezing very
often, most often when under heavy load ( >2 load )

There is nothing in the logs, I checked syslog, kern.log . . . nothing, no
clue.
the other box, still running glibc-2.5-r4 , is working very well as before.

I tried many kernel, from 2.6.18 to 2.6.24, with and without hardened profile,
its the same

I'm not the only one, many people in france had this problem, debian users who
downgraded libc could go back to a stable server, but with gentoo, downgrading
libc seems pretty dangerous.

All the reported problems are in french cause it seems only dedibox (
http://dedibox.fr ) provides low cost servers using VIA C7 processor ), if
needed i can provide many webpages where people describe the problem in french
( google "dedibox freeze libc" gives some ), but I found nothing in english, if
you contact dedibox.fr admins they will confirm the problem, perhaps they even
could accept to provide a box for testing, who knows . . .

I have nothing to give you, nothing in the log, the box just stop working as
if power had been switched off

The problem happened at least on debian and gentoo which are the most used
linux distros on dedibox VIA C7 servers.

The exact processor is :
processor : 0
vendor_id : CentaurHauls
cpu family : 6
model : 10
model name : VIA Esther processor 2000MHz
stepping : 9
cpu MHz : 1995.084
cache size : 128 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge cmov pat
clflush acpi mmx fxsr sse sse2 tm pni est tm2 rng rng_en ace ace_en ace2
ace2_en phe phe_en pmm pmm_en
bogomips : 3994.49
clflush size : 64

Reproducible: Always

Steps to Reproduce:
1. box using VIA C7 processor with glibc
2. heavy load >2
3. wait 1 or 2 hours

The complete bug report with attachments ( logs and kernel .config )
and gentoo maintainers comments is here :

http://bugs.gentoo.org/show_bug.cgi?id=228263

I have no clue this is a linux kernel bug, since all the kernels I
tried ( from 2.6.18 to 2.6.24 ) are working perfectly before upgrading
to the new libc, but the gentoo maintainers finally told me I should
post on the LKML.

here is the last comment from the gentoo maintainer :
"If the kernel completely locks up then that is a kernel bug or a hardware bug.
It shouldn't be possible to lock up the kernel, regardless of what userland
such as glibc does. Presumably the newer glibc is doing something different
that is triggering the bug. Regardless of what that is and whether it should be
doing it, it shouldn't completely hang the kernel."

Feel free to ask more details, I'll be happy to provide answers.
( no need to cc me I subscribed to follow this problem )

--
Cordialement

William Waisse
http://waisse.org | http://neoskills.com
http://cahierspip.ww7.be | http://feeder.ww7.be

2008-06-24 09:39:16

by Alan

[permalink] [raw]

Subject: Re: strange freeze with VIA C7 dedicated server and libc 2.6.1

> "If the kernel completely locks up then that is a kernel bug or a hardware bug.
> It shouldn't be possible to lock up the kernel, regardless of what userland
> such as glibc does. Presumably the newer glibc is doing something different

Except for bugs in glibc that trigger things happening as root which go
on to do stuff like power down the system (root is allowed to power
down/reboot/etc). That is a fairly unlikely case.

> that is triggering the bug. Regardless of what that is and whether it should be
> doing it, it shouldn't completely hang the kernel."

The first thing is to find out which glibc version is the latest that
works, which is the earliest that fails. Second is to try and find out
what apps or event is the trigger for the fail (eg can you boot into text
mode with init s and then run 2 or 3 cpu hogs all day)

Alan

2008-06-24 21:05:30

by william

[permalink] [raw]

Subject: Re: strange freeze with VIA C7 dedicated server and libc 2.6.1

> Except for bugs in glibc that trigger things happening as root which go> on to do stuff like power down the system (root is allowed to power> down/reboot/etc). That is a fairly unlikely case.
yes, I know this is something really unbelievable, with nothing inthe logs . . . but it happens to at least 20 people, all the upgradedboxes have the problem, and all the downgraded boxes see the problemdisappear.
>> that is triggering the bug. Regardless of what that is and whether it should be>> doing it, it shouldn't completely hang the kernel."> The first thing is to find out which glibc version is the latest that> works, which is the earliest that fails. Yes, but I couldnt test it by myself on a production dedicated server.
The nly thing whoich are 100% sure :gentoo : upgrade from glibc-2.5-r4 to glibc-2.6.1 makes the problem appear.debian : upgrade from 2.3.6.ds1-3 to 2.3.6.ds1-13etch5 makes theproblem appear.all the debian users who downgraded their libc to 2.3.6.ds1-3 see theproblem disappear.( I suppose the -13 in debian package name means 2.6.3+many patches,probably the 2.3.6.ds1-13etch5 is a 2.6.x ? )
( I coulldn't downgrade libc on gentoo, downgrading libc on gentoo isa nearly suicidal idea )
But, now I have good news, dedibox.fr admins accepted to lend us abox for testing purpose.
I can offer a testing shell with unlimited sudo to any kerneldevelopper, interested in investigating this mystery, and having agnupg key and a web of trust ( mine ishttp://pgpkeys.mit.edu:11371/pks/lookup?op=vindex&search=0x690B4E07 weprobably have a trust path ).
> Second is to try and find out> what apps or event is the trigger for the fail (eg can you boot into text> mode with init s and then run 2 or 3 cpu hogs all day)
I have have only some details on this point :
* my box freeze during morning sql updates ( updating 300 MB SQLduring 3 hours every morning ), but the scrpt is launched with nice-20* crontab could be related to the problem, it seems to me that I haveless freezes since I splitted one big crontab ( launching a 3 hourlong script ) in 4 smaller crontabs, some other users said thatdisabling big crontabs helped* the load is not so big , often between 1 and 2
another thing it did not say in the first mail, after the problemappeared I installed lm_sensors and watchdog to try investigating theproblem :
* the temperature is never higher than 54°C which seems ok for a VIAC7, am I wrong ? some people say 54°c is ok, some other says its notnormal with a via C7 in a datacenter . . .
* the watchdog says nothing in the logs, but is able to reboot the box.
Thank you very much for your answer Alan, I were hesitating onposting a report with no logs, no clues . . . your answer gives me alittle hope ;)

-- Cordialement
William Waisse http://waisse.org | http://neoskills.com http://cahierspip.ww7.be | http://feeder.ww7.be????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?

2008-06-24 21:46:37

by Alan

[permalink] [raw]

Subject: Re: strange freeze with VIA C7 dedicated server and libc 2.6.1

> * the watchdog says nothing in the logs, but is able to reboot the box.
>
> Thank you very much for your answer Alan, I were hesitating on
> posting a report with no logs, no clues . . . your answer gives me a
> little hope ;)

Two random thoughts from your last comment

- If you do

echo "2" >/proc/sys/vm/overcommit_memory
echo "80" >/proc/sys/vm/overcommit_ratio

do you instead get out of memory kills (which would imply bad memory
leaks perhaps triggered by glibc ?)

- Does your system pass 'crashme' testing (run as a non root user). If
not then that might give an eventual identification of a crashme run
which takes out the box. We've found kernel bugs, CPU bugs and
combinations of the two before now that way.

2008-06-25 16:36:29

by Stefan Hellermann

[permalink] [raw]

Subject: Re: strange freeze with VIA C7 dedicated server and libc 2.6.1

Am Dienstag, den 24.06.2008, 22:28 +0100 schrieb Alan Cox:
> > * the watchdog says nothing in the logs, but is able to reboot the box.
> >
> > Thank you very much for your answer Alan, I were hesitating on
> > posting a report with no logs, no clues . . . your answer gives me a
> > little hope ;)
>
> Two random thoughts from your last comment
>
> - If you do
>
> echo "2" >/proc/sys/vm/overcommit_memory
> echo "80" >/proc/sys/vm/overcommit_ratio
>
> do you instead get out of memory kills (which would imply bad memory
> leaks perhaps triggered by glibc ?)
>
> - Does your system pass 'crashme' testing (run as a non root user). If
> not then that might give an eventual identification of a crashme run
> which takes out the box. We've found kernel bugs, CPU bugs and
> combinations of the two before now that way.

Hi!

I've got the same problem with a VIA Epia SN-1800, Gentoo and
glibc-2.6.1. First I had crashes every day, but these came from
madwifi-ng. Now with vanilla-2.6.25.6 and no modules it's crashing about
every 3 weeks with no log I can provide. I have a serial console
connected to it, but I have no other device running 24h to collect the
crash.
I tried glibc-2.7, but with this powerdns-resolver isn't working any
more, and I don't think the problems are gone (only one crash so far).
It's not easy to downgrade glibc on gentoo, but I could try
vanilla-glibc-2.5 if this would help.
I have no big crontab, only a script with rotates logs and makewhatis.

Where can I find 'crashme'? Is it a tool I can download?

It's a small home-server carrying my mails and webspace, so I can do a
bit testing, but I don't like large downtime :-)

--
Kind Regards
Stefan Hellermann

2009-10-20 13:44:46

by Eric des Courtis

[permalink] [raw]

Subject: Re: strange freeze with VIA C7 dedicated server and libc 2.6.1

Hi,

I have the same problem but I do have a stack trace. I did run crashme
with +2000 666 100 1:00:00 but it seems to work fine. Random
application will crash in the sys_open() call. If I am in X the system
sometimes freezes completely.

Anyway this is the stack trace:

[ 2074.794366] invalid opcode: 0000 [#1] SMP
[ 2074.804264] last sysfs file:
/sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/resource
[ 2074.804264] Dumping ftrace buffer:
[ 2074.804264] (ftrace buffer empty)
[ 2074.804264] Modules linked in: via drm lp parport viafb
i2c_algo_bit snd_via82xx gameport snd_ac97_codec ac97_bus snd_pcm_oss
snd_mixer_oss snd_pcm snd_page_alloc snd_mpu401_uart snd_seq_dummy
snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq
snd_timer snd_seq_device pcspkr snd lirc_imon i2c_viapro soundcore
lirc_dev via_agp agpgart shpchp usbhid via_rhine 3c59x mii vesafb
fbcon tileblit font bitblit softcursor
[ 2074.804264]
[ 2074.804264] Pid: 2635, comm: lcdproc Not tainted (2.6.28-15-server
#52-Ubuntu) ID-PCM7E PC2500
[ 2074.804264] EIP: 0060:[<c01d1041>] EFLAGS: 00010202 CPU: 0
[ 2074.804264] EIP is at path_lookup_open+0x31/0xa0
[ 2074.804264] EAX: 00000001 EBX: 00000101 ECX: 00000000 EDX: f5dca5b0
[ 2074.804264] ESI: ffffffe9 EDI: f5c8bf04 EBP: f5c8bec0 ESP: f5c8bea8
[ 2074.804264] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[ 2074.804264] Process lcdproc (pid: 2635, ti=f5c8a000 task=f5dca5b0
task.ti=f5c8a000)
[ 2074.804264] Stack:
[ 2074.804264] 00000001 e84f2000 ffffff9c ffffff9c 00000001 f5c8bf04
f5c8bf70 c01d1d23
[ 2074.804264] f5c8bf04 00000001 00000000 f5c8bf04 f64d3000 00000000
e84f2000 00000000
[ 2074.804264] 00000024 ffffffff 00000000 00000000 00000000 00000000
00000000 f5dca5b0
[ 2074.804264] Call Trace:
[ 2074.804264] [<c01d1d23>] ? do_filp_open+0xb3/0x7c0
[ 2074.804264] [<c01569e0>] ? autoremove_wake_function+0x0/0x50
[ 2074.804264] [<c01daed0>] ? alloc_fd+0xe0/0x100
[ 2074.804264] [<c01c47bf>] ? do_sys_open+0x5f/0x120
[ 2074.804264] [<c01c48e9>] ? sys_open+0x29/0x40
[ 2074.804264] [<c0109eef>] ? sysenter_do_call+0x12/0x2f
[ 2074.804264] Code: 89 5d f4 89 cb 89 75 f8 be e9 ff ff ff 89 7d fc
8b 7d 08 89 45 f0 89 55 ec e8 3c 68 ff ff 85 c0 74 33 89 47 4c 8b 45
0c 80 cf 01 <c7> 47 48 00 00 00 00 89 d9 89 47 44 8b 55 ec 8b 45 f0 89
3c 24
[ 2074.804264] EIP: [<c01d1041>] path_lookup_open+0x31/0xa0 SS:ESP 0068:f5c8bea8
[ 2075.261958] ---[ end trace 59aabadb5240aad2 ]---

And much later (could be unrelated):

[ 2830.975240] lcdproc[9430]: segfault at 1bfef35 ip b7f9e05a sp
bfef2175 error 4 in libc-2.9.so[b7f67000+15c000]
[ 2830.984939] klogd[2111]: segfault at 4 ip b7e1e05a sp bfb6d2b1
error 4 in libc-2.9.so[b7de7000+15c000]

Cheers,

Eric des Courtis

On Wed, Jun 25, 2008 at 12:36 PM, Stefan Hellermann
<[email protected]> wrote:
> Am Dienstag, den 24.06.2008, 22:28 +0100 schrieb Alan Cox:
>> > * the watchdog says nothing in the logs, but is able to reboot the box.
>> >
>> > ?Thank you very much for your answer Alan, I were hesitating on
>> > posting a report with no logs, no clues . . . your answer gives me a
>> > little hope ;)
>>
>> Two random thoughts from your last comment
>>
>> - If you do
>>
>> echo "2" >/proc/sys/vm/overcommit_memory
>> echo "80" >/proc/sys/vm/overcommit_ratio
>>
>> do you instead get out of memory kills (which would imply bad memory
>> leaks perhaps triggered by glibc ?)
>>
>> - Does your system pass 'crashme' testing (run as a non root user). If
>> not then that might give an eventual identification of a crashme run
>> which takes out the box. We've found kernel bugs, CPU bugs and
>> combinations of the two before now that way.
>
> Hi!
>
> I've got the same problem with a VIA Epia SN-1800, Gentoo and
> glibc-2.6.1. First I had crashes every day, but these came from
> madwifi-ng. Now with vanilla-2.6.25.6 and no modules it's crashing about
> every 3 weeks with no log I can provide. I have a serial console
> connected to it, but I have no other device running 24h to collect the
> crash.
> I tried glibc-2.7, but with this powerdns-resolver isn't working any
> more, and I don't think the problems are gone (only one crash so far).
> It's not easy to downgrade glibc on gentoo, but I could try
> vanilla-glibc-2.5 if this would help.
> I have no big crontab, only a script with rotates logs and makewhatis.
>
> Where can I find 'crashme'? Is it a tool I can download?
>
> It's a small home-server carrying my mails and webspace, so I can do a
> bit testing, but I don't like large downtime :-)
>
> --
> Kind Regards
> Stefan Hellermann
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at ?http://www.tux.org/lkml/
>

2009-10-20 17:24:09

by Stefan Hellermann

[permalink] [raw]

Subject: Re: strange freeze with VIA C7 dedicated server and libc 2.6.1

Hi,

for me the problems are gone. Hardware stayed the same, but I installed
updates for many packages. Currently I'm running vanilla-2.6.31 compiled
with gcc-4.3.2 and a libc from gentoo, glibc-2.9_p20081201-r2.

Cheers
Stefan Hellermann

Am 20.10.2009 15:44, schrieb Eric des Courtis:
> Hi,
>
> I have the same problem but I do have a stack trace. I did run crashme
> with +2000 666 100 1:00:00 but it seems to work fine. Random
> application will crash in the sys_open() call. If I am in X the system
> sometimes freezes completely.
>
>
> Anyway this is the stack trace:
>
> [ 2074.794366] invalid opcode: 0000 [#1] SMP
> [ 2074.804264] last sysfs file:
> /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/resource
> [ 2074.804264] Dumping ftrace buffer:
> [ 2074.804264] (ftrace buffer empty)
> [ 2074.804264] Modules linked in: via drm lp parport viafb
> i2c_algo_bit snd_via82xx gameport snd_ac97_codec ac97_bus snd_pcm_oss
> snd_mixer_oss snd_pcm snd_page_alloc snd_mpu401_uart snd_seq_dummy
> snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq
> snd_timer snd_seq_device pcspkr snd lirc_imon i2c_viapro soundcore
> lirc_dev via_agp agpgart shpchp usbhid via_rhine 3c59x mii vesafb
> fbcon tileblit font bitblit softcursor
> [ 2074.804264]
> [ 2074.804264] Pid: 2635, comm: lcdproc Not tainted (2.6.28-15-server
> #52-Ubuntu) ID-PCM7E PC2500
> [ 2074.804264] EIP: 0060:[<c01d1041>] EFLAGS: 00010202 CPU: 0
> [ 2074.804264] EIP is at path_lookup_open+0x31/0xa0
> [ 2074.804264] EAX: 00000001 EBX: 00000101 ECX: 00000000 EDX: f5dca5b0
> [ 2074.804264] ESI: ffffffe9 EDI: f5c8bf04 EBP: f5c8bec0 ESP: f5c8bea8
> [ 2074.804264] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
> [ 2074.804264] Process lcdproc (pid: 2635, ti=f5c8a000 task=f5dca5b0
> task.ti=f5c8a000)
> [ 2074.804264] Stack:
> [ 2074.804264] 00000001 e84f2000 ffffff9c ffffff9c 00000001 f5c8bf04
> f5c8bf70 c01d1d23
> [ 2074.804264] f5c8bf04 00000001 00000000 f5c8bf04 f64d3000 00000000
> e84f2000 00000000
> [ 2074.804264] 00000024 ffffffff 00000000 00000000 00000000 00000000
> 00000000 f5dca5b0
> [ 2074.804264] Call Trace:
> [ 2074.804264] [<c01d1d23>] ? do_filp_open+0xb3/0x7c0
> [ 2074.804264] [<c01569e0>] ? autoremove_wake_function+0x0/0x50
> [ 2074.804264] [<c01daed0>] ? alloc_fd+0xe0/0x100
> [ 2074.804264] [<c01c47bf>] ? do_sys_open+0x5f/0x120
> [ 2074.804264] [<c01c48e9>] ? sys_open+0x29/0x40
> [ 2074.804264] [<c0109eef>] ? sysenter_do_call+0x12/0x2f
> [ 2074.804264] Code: 89 5d f4 89 cb 89 75 f8 be e9 ff ff ff 89 7d fc
> 8b 7d 08 89 45 f0 89 55 ec e8 3c 68 ff ff 85 c0 74 33 89 47 4c 8b 45
> 0c 80 cf 01 <c7> 47 48 00 00 00 00 89 d9 89 47 44 8b 55 ec 8b 45 f0 89
> 3c 24
> [ 2074.804264] EIP: [<c01d1041>] path_lookup_open+0x31/0xa0 SS:ESP 0068:f5c8bea8
> [ 2075.261958] ---[ end trace 59aabadb5240aad2 ]---
>
> And much later (could be unrelated):
>
> [ 2830.975240] lcdproc[9430]: segfault at 1bfef35 ip b7f9e05a sp
> bfef2175 error 4 in libc-2.9.so[b7f67000+15c000]
> [ 2830.984939] klogd[2111]: segfault at 4 ip b7e1e05a sp bfb6d2b1
> error 4 in libc-2.9.so[b7de7000+15c000]
>
>
> Cheers,
>
> Eric des Courtis
>
> On Wed, Jun 25, 2008 at 12:36 PM, Stefan Hellermann
> <[email protected]> wrote:
>> Am Dienstag, den 24.06.2008, 22:28 +0100 schrieb Alan Cox:
>>>> * the watchdog says nothing in the logs, but is able to reboot the box.
>>>>
>>>> Thank you very much for your answer Alan, I were hesitating on
>>>> posting a report with no logs, no clues . . . your answer gives me a
>>>> little hope ;)
>>>
>>> Two random thoughts from your last comment
>>>
>>> - If you do
>>>
>>> echo "2" >/proc/sys/vm/overcommit_memory
>>> echo "80" >/proc/sys/vm/overcommit_ratio
>>>
>>> do you instead get out of memory kills (which would imply bad memory
>>> leaks perhaps triggered by glibc ?)
>>>
>>> - Does your system pass 'crashme' testing (run as a non root user). If
>>> not then that might give an eventual identification of a crashme run
>>> which takes out the box. We've found kernel bugs, CPU bugs and
>>> combinations of the two before now that way.
>>
>> Hi!
>>
>> I've got the same problem with a VIA Epia SN-1800, Gentoo and
>> glibc-2.6.1. First I had crashes every day, but these came from
>> madwifi-ng. Now with vanilla-2.6.25.6 and no modules it's crashing about
>> every 3 weeks with no log I can provide. I have a serial console
>> connected to it, but I have no other device running 24h to collect the
>> crash.
>> I tried glibc-2.7, but with this powerdns-resolver isn't working any
>> more, and I don't think the problems are gone (only one crash so far).
>> It's not easy to downgrade glibc on gentoo, but I could try
>> vanilla-glibc-2.5 if this would help.
>> I have no big crontab, only a script with rotates logs and makewhatis.
>>
>> Where can I find 'crashme'? Is it a tool I can download?
>>
>> It's a small home-server carrying my mails and webspace, so I can do a
>> bit testing, but I don't like large downtime :-)
>>
>> --
>> Kind Regards
>> Stefan Hellermann
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>>
>