2011-04-28 08:42:36

by Nikola Ciprich

[permalink] [raw]
Subject: 2.6.32.21 - uptime related crashes?

Hello everybody,

I'm trying to solve strange issue, today, my fourth machine running 2.6.32.21 just crashed. What makes the cases similar, apart fromn same kernel version is that all boxes had very similar uptimes: 214, 216, 216, and 224 days. This might just be a coincidence, but I think this might be important.

Unfortunately I only have backtraces of two crashes (and those are trimmed, sorry), and they do not look as similar as I'd like, but still maybe there is something in common:

[<ffffffff81120cc7>] pollwake+0x57/0x60
[<ffffffff81046720>] ? default_wake_function+0x0/0x10
[<ffffffff8103683a>] __wake_up_common+0x5a/0x90
[<ffffffff8103a313>] __wake_up+0x43/0x70
[<ffffffffa0321573>] process_masterspan+0x643/0x670 [dahdi]
[<ffffffffa0326595>] coretimer_func+0x135/0x1d0 [dahdi]
[<ffffffff8105d74d>] run_timer_softirq+0x15d/0x320
[<ffffffffa0326460>] ? coretimer_func+0x0/0x1d0 [dahdi]
[<ffffffff8105690c>] __do_softirq+0xcc/0x220
[<ffffffff8100c40c>] call_softirq+0x1c/0x30
[<ffffffff8100e3ba>] do_softirq+0x4a/0x80
[<ffffffff810567c7>] irq_exit+0x87/0x90
[<ffffffff8100d7b7>] do_IRQ+0x77/0xf0
[<ffffffff8100bc53>] ret_from_intr+0x0/Oxa
<EUI> [<ffffffffa019e556>] ? acpi_idle_enter_bm+0x273/0x2a1 [processor]
[<ffffffffa019e54c>] ? acpi_idle_enter_bm+0x269/0x2a1 [processor]
[<ffffffff81280095>] ? cpuidle_idle_call+0xa5/0x150
[<ffffffff8100a18f>] ? cpu_idle+0x4f/0x90
[<ffffffff81323c95>] ? rest_init+0x75/0x80
[<ffffffff81582d7f>] ? start_kernel+0x2ef/0x390
[<ffffffff81582271>] ? x86_64_start_reservations+0x81/0xc0
[<ffffffff81582386>] ? x86_64_start_kernel+0xd6/0x100

this box (actually two of the crashed ones) is using dahdi_dummy module to generate timing for asterisk SW pbx, so maybe it's related to it.


[<ffffffff810a5063>] handle_IRQ_event+0x63/0x1c0
[<ffffffff810a71ae>] handle_edge_irq+0xce/0x160
[<ffffffff8100e1bf>] handle_irq+0x1f/0x30
[<ffffffff8100d7ae>] do_IRQ+0x6e/0xf0
[<ffffffff8100bc53>] ret_from_intr+0x0/Oxa
<EUI> [<ffffffff8133?f?f>] ? _spin_un1ock_irq+0xf/0x40
[<ffffffff81337f79>] ? _spin_un1ock_irq+0x9/0x40
[<ffffffff81064b9a>] ? exit_signals+0x8a/0x130
[<ffffffff8105372e>] ? do_exit+0x7e/0x7d0
[<ffffffff8100f8a7>] ? oops_end+0xa7/0xb0
[<ffffffff8100faa6>] ? die+0x56/0x90
[<ffffffff8100c810>] ? do_trap+0x130/0x150
[<ffffffff8100ccca>] ? do_divide_error+0x8a/0xa0
[<ffffffff8103d227>] ? find_busiest_group+0x3d7/0xa00
[<ffffffff8104400b>] ? cpuacct_charge+0x6b/0x90
[<ffffffff8100c045>] ? divide_error+0x15/0x20
[<ffffffff8103d227>] ? find_busiest_group+0x3d7/0xa00
[<ffffffff8103cfff>] ? find_busiest_group+0x1af/0xa00
[<ffffffff81335483>] ? thread_return+0x4ce/0x7bb
[<ffffffff8133bec5>] ? do_nanosleep+0x75/0x30
[<ffffffff810?1?4e>] ? hrtimer_nanosleep+0x9e/0x120
[<ffffffff810?08f0>] ? hrtimer_wakeup+0x0/0x30
[<ffffffff810?183f>] ? sys_nanosleep+0x6f/0x80

another two don't use it. only similarity I see here is that it seems to be IRQ handling related, but both issues don't have anything in common.
Does anybody have an idea on where should I look? Of course I should update all those boxes to (at least) latest 2.6.32.x, and I'll do it for sure, but still I'd first like to know where the problem was, and if it has been fixed, or how to fix it...
I'd be gratefull for any help...
BR
nik


--
-------------------------------------
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28. rijna 168, 709 01 Ostrava

tel.: +420 596 603 142
fax: +420 596 621 273
mobil: +420 777 093 799
http://www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: [email protected]
-------------------------------------


Attachments:
(No filename) (3.63 kB)
(No filename) (198.00 B)
Download all attachments

2011-04-28 18:35:13

by Willy Tarreau

[permalink] [raw]
Subject: Re: [stable] 2.6.32.21 - uptime related crashes?

Hello Nikola,

On Thu, Apr 28, 2011 at 10:26:25AM +0200, Nikola Ciprich wrote:
> Hello everybody,
>
> I'm trying to solve strange issue, today, my fourth machine running 2.6.32.21 just crashed. What makes the cases similar, apart fromn same kernel version is that all boxes had very similar uptimes: 214, 216, 216, and 224 days. This might just be a coincidence, but I think this might be important.

Interestingly, one of our customers just had two machines who crashed
yesterday after 212 days and 212+20h respectively. They were running
debian's 2.6.32-bpo.5-amd64 which is based on 2.6.32.23 AIUI.

The crash looks very similar to the following bug which we have updated :

https://bugzilla.kernel.org/show_bug.cgi?id=16991

(bugzilla doesn't appear to respond as I'm posting this mail).

The top of your ouput is missing. In our case as in the reports on the bug
above, there was a divide by zero error. Did you happen to spot this one
too, or do you just not know ? I observe "divide_error+0x15/0x20" in one
of your reports, so it's possible that it matches the same pattern at least
for one trace. Just in case, it would be nice to feed the bugzilla entry
above.

> Unfortunately I only have backtraces of two crashes (and those are trimmed, sorry), and they do not look as similar as I'd like, but still maybe there is something in common:
>
> [<ffffffff81120cc7>] pollwake+0x57/0x60
> [<ffffffff81046720>] ? default_wake_function+0x0/0x10
> [<ffffffff8103683a>] __wake_up_common+0x5a/0x90
> [<ffffffff8103a313>] __wake_up+0x43/0x70
> [<ffffffffa0321573>] process_masterspan+0x643/0x670 [dahdi]
> [<ffffffffa0326595>] coretimer_func+0x135/0x1d0 [dahdi]
> [<ffffffff8105d74d>] run_timer_softirq+0x15d/0x320
> [<ffffffffa0326460>] ? coretimer_func+0x0/0x1d0 [dahdi]
> [<ffffffff8105690c>] __do_softirq+0xcc/0x220
> [<ffffffff8100c40c>] call_softirq+0x1c/0x30
> [<ffffffff8100e3ba>] do_softirq+0x4a/0x80
> [<ffffffff810567c7>] irq_exit+0x87/0x90
> [<ffffffff8100d7b7>] do_IRQ+0x77/0xf0
> [<ffffffff8100bc53>] ret_from_intr+0x0/Oxa
> <EUI> [<ffffffffa019e556>] ? acpi_idle_enter_bm+0x273/0x2a1 [processor]
> [<ffffffffa019e54c>] ? acpi_idle_enter_bm+0x269/0x2a1 [processor]
> [<ffffffff81280095>] ? cpuidle_idle_call+0xa5/0x150
> [<ffffffff8100a18f>] ? cpu_idle+0x4f/0x90
> [<ffffffff81323c95>] ? rest_init+0x75/0x80
> [<ffffffff81582d7f>] ? start_kernel+0x2ef/0x390
> [<ffffffff81582271>] ? x86_64_start_reservations+0x81/0xc0
> [<ffffffff81582386>] ? x86_64_start_kernel+0xd6/0x100
>
> this box (actually two of the crashed ones) is using dahdi_dummy module to generate timing for asterisk SW pbx, so maybe it's related to it.
>
>
> [<ffffffff810a5063>] handle_IRQ_event+0x63/0x1c0
> [<ffffffff810a71ae>] handle_edge_irq+0xce/0x160
> [<ffffffff8100e1bf>] handle_irq+0x1f/0x30
> [<ffffffff8100d7ae>] do_IRQ+0x6e/0xf0
> [<ffffffff8100bc53>] ret_from_intr+0x0/Oxa
> <EUI> [<ffffffff8133?f?f>] ? _spin_un1ock_irq+0xf/0x40
> [<ffffffff81337f79>] ? _spin_un1ock_irq+0x9/0x40
> [<ffffffff81064b9a>] ? exit_signals+0x8a/0x130
> [<ffffffff8105372e>] ? do_exit+0x7e/0x7d0
> [<ffffffff8100f8a7>] ? oops_end+0xa7/0xb0
> [<ffffffff8100faa6>] ? die+0x56/0x90
> [<ffffffff8100c810>] ? do_trap+0x130/0x150
> [<ffffffff8100ccca>] ? do_divide_error+0x8a/0xa0
> [<ffffffff8103d227>] ? find_busiest_group+0x3d7/0xa00
> [<ffffffff8104400b>] ? cpuacct_charge+0x6b/0x90
> [<ffffffff8100c045>] ? divide_error+0x15/0x20
> [<ffffffff8103d227>] ? find_busiest_group+0x3d7/0xa00
> [<ffffffff8103cfff>] ? find_busiest_group+0x1af/0xa00
> [<ffffffff81335483>] ? thread_return+0x4ce/0x7bb
> [<ffffffff8133bec5>] ? do_nanosleep+0x75/0x30
> [<ffffffff810?1?4e>] ? hrtimer_nanosleep+0x9e/0x120
> [<ffffffff810?08f0>] ? hrtimer_wakeup+0x0/0x30
> [<ffffffff810?183f>] ? sys_nanosleep+0x6f/0x80
>
> another two don't use it. only similarity I see here is that it seems to be IRQ handling related, but both issues don't have anything in common.
> Does anybody have an idea on where should I look? Of course I should update all those boxes to (at least) latest 2.6.32.x, and I'll do it for sure, but still I'd first like to know where the problem was, and if it has been fixed, or how to fix it...
> I'd be gratefull for any help...

There were quite a bunch of scheduler updates recently. We may be lucky and
hope for the bug to have vanished with the changes, but we may as well see
the same crash in 7 months :-/

My coworker Herv? (CC'd) who worked on the issue suggests that we might have
something which goes wrong past a certain uptime (eg: 212 days), which needs
a special event to be triggered (I/O, process exiting, etc...). I think this
makes quite some sense.

Could you check your CONFIG_HZ so that we could convert those uptimes to
jiffies ? Maybe this will ring a bell in someone's head :-/

Best regards,
Willy

2011-04-29 10:04:43

by Nikola Ciprich

[permalink] [raw]
Subject: Re: [stable] 2.6.32.21 - uptime related crashes?

(another CC added)

Hello Willy!

I made some statistics of our servers regarding kernel version and uptime.
Here are some my thoughts:
- I'm 100% sure this problem wasn't present in kernels <= 2.6.30.x (we've got a lot of boxes with uptimes >600days)
- I'm 90% sure this problem also wasn't present in 2.6.32.16 (we've got 6 boxes running for 235 to 280days)

What I'm not sure is, whether this is present in 2.6.19, I have:
2 boxes running 2.6.32.19 for 238days and one 2.6.32.20 for 216days.
I also have a bunch ov 2.6.32.23 boxes, which are now getting close to 200days uptime.
But I suspect this really is first problematic version, more on it later.
First regarding Your question about CONFIG_HZ - we use 250HZ setting, which leads me to following:
250 * 60 * 60 * 24 * 199 = 4298400000 which is value a little over 2**32! So maybe some unsingned long variable
might overflow? Does this make sense?

And to my suspicion about 2.6.32.19, there is one commit which maybe is related:

commit 0cf55e1ec08bb5a22e068309e2d8ba1180ab4239
Author: Hidetoshi Seto <[email protected]>
Date: Wed Dec 2 17:28:07 2009 +0900

sched, cputime: Introduce thread_group_times()

This is a real fix for problem of utime/stime values decreasing
described in the thread:

http://lkml.org/lkml/2009/11/3/522

Now cputime is accounted in the following way:

- {u,s}time in task_struct are increased every time when the thread
is interrupted by a tick (timer interrupt).

- When a thread exits, its {u,s}time are added to signal->{u,s}time,
after adjusted by task_times().

- When all threads in a thread_group exits, accumulated {u,s}time
(and also c{u,s}time) in signal struct are added to c{u,s}time
in signal struct of the group's parent.
.
.
.

I haven't studied this into detail yet, but it seems to me it might really be related. Hidetoshi-san - do You have some opinion about this?
Could this somehow either create or invoke the problem with overflow of some variable which would lead to division by zero or similar problems?

Any other thoughts?

best regards

nik





On Thu, Apr 28, 2011 at 08:34:34PM +0200, Willy Tarreau wrote:
> Hello Nikola,
>
> On Thu, Apr 28, 2011 at 10:26:25AM +0200, Nikola Ciprich wrote:
> > Hello everybody,
> >
> > I'm trying to solve strange issue, today, my fourth machine running 2.6.32.21 just crashed. What makes the cases similar, apart fromn same kernel version is that all boxes had very similar uptimes: 214, 216, 216, and 224 days. This might just be a coincidence, but I think this might be important.
>
> Interestingly, one of our customers just had two machines who crashed
> yesterday after 212 days and 212+20h respectively. They were running
> debian's 2.6.32-bpo.5-amd64 which is based on 2.6.32.23 AIUI.
>
> The crash looks very similar to the following bug which we have updated :
>
> https://bugzilla.kernel.org/show_bug.cgi?id=16991
>
> (bugzilla doesn't appear to respond as I'm posting this mail).
>
> The top of your ouput is missing. In our case as in the reports on the bug
> above, there was a divide by zero error. Did you happen to spot this one
> too, or do you just not know ? I observe "divide_error+0x15/0x20" in one
> of your reports, so it's possible that it matches the same pattern at least
> for one trace. Just in case, it would be nice to feed the bugzilla entry
> above.
>
> > Unfortunately I only have backtraces of two crashes (and those are trimmed, sorry), and they do not look as similar as I'd like, but still maybe there is something in common:
> >
> > [<ffffffff81120cc7>] pollwake+0x57/0x60
> > [<ffffffff81046720>] ? default_wake_function+0x0/0x10
> > [<ffffffff8103683a>] __wake_up_common+0x5a/0x90
> > [<ffffffff8103a313>] __wake_up+0x43/0x70
> > [<ffffffffa0321573>] process_masterspan+0x643/0x670 [dahdi]
> > [<ffffffffa0326595>] coretimer_func+0x135/0x1d0 [dahdi]
> > [<ffffffff8105d74d>] run_timer_softirq+0x15d/0x320
> > [<ffffffffa0326460>] ? coretimer_func+0x0/0x1d0 [dahdi]
> > [<ffffffff8105690c>] __do_softirq+0xcc/0x220
> > [<ffffffff8100c40c>] call_softirq+0x1c/0x30
> > [<ffffffff8100e3ba>] do_softirq+0x4a/0x80
> > [<ffffffff810567c7>] irq_exit+0x87/0x90
> > [<ffffffff8100d7b7>] do_IRQ+0x77/0xf0
> > [<ffffffff8100bc53>] ret_from_intr+0x0/Oxa
> > <EUI> [<ffffffffa019e556>] ? acpi_idle_enter_bm+0x273/0x2a1 [processor]
> > [<ffffffffa019e54c>] ? acpi_idle_enter_bm+0x269/0x2a1 [processor]
> > [<ffffffff81280095>] ? cpuidle_idle_call+0xa5/0x150
> > [<ffffffff8100a18f>] ? cpu_idle+0x4f/0x90
> > [<ffffffff81323c95>] ? rest_init+0x75/0x80
> > [<ffffffff81582d7f>] ? start_kernel+0x2ef/0x390
> > [<ffffffff81582271>] ? x86_64_start_reservations+0x81/0xc0
> > [<ffffffff81582386>] ? x86_64_start_kernel+0xd6/0x100
> >
> > this box (actually two of the crashed ones) is using dahdi_dummy module to generate timing for asterisk SW pbx, so maybe it's related to it.
> >
> >
> > [<ffffffff810a5063>] handle_IRQ_event+0x63/0x1c0
> > [<ffffffff810a71ae>] handle_edge_irq+0xce/0x160
> > [<ffffffff8100e1bf>] handle_irq+0x1f/0x30
> > [<ffffffff8100d7ae>] do_IRQ+0x6e/0xf0
> > [<ffffffff8100bc53>] ret_from_intr+0x0/Oxa
> > <EUI> [<ffffffff8133?f?f>] ? _spin_un1ock_irq+0xf/0x40
> > [<ffffffff81337f79>] ? _spin_un1ock_irq+0x9/0x40
> > [<ffffffff81064b9a>] ? exit_signals+0x8a/0x130
> > [<ffffffff8105372e>] ? do_exit+0x7e/0x7d0
> > [<ffffffff8100f8a7>] ? oops_end+0xa7/0xb0
> > [<ffffffff8100faa6>] ? die+0x56/0x90
> > [<ffffffff8100c810>] ? do_trap+0x130/0x150
> > [<ffffffff8100ccca>] ? do_divide_error+0x8a/0xa0
> > [<ffffffff8103d227>] ? find_busiest_group+0x3d7/0xa00
> > [<ffffffff8104400b>] ? cpuacct_charge+0x6b/0x90
> > [<ffffffff8100c045>] ? divide_error+0x15/0x20
> > [<ffffffff8103d227>] ? find_busiest_group+0x3d7/0xa00
> > [<ffffffff8103cfff>] ? find_busiest_group+0x1af/0xa00
> > [<ffffffff81335483>] ? thread_return+0x4ce/0x7bb
> > [<ffffffff8133bec5>] ? do_nanosleep+0x75/0x30
> > [<ffffffff810?1?4e>] ? hrtimer_nanosleep+0x9e/0x120
> > [<ffffffff810?08f0>] ? hrtimer_wakeup+0x0/0x30
> > [<ffffffff810?183f>] ? sys_nanosleep+0x6f/0x80
> >
> > another two don't use it. only similarity I see here is that it seems to be IRQ handling related, but both issues don't have anything in common.
> > Does anybody have an idea on where should I look? Of course I should update all those boxes to (at least) latest 2.6.32.x, and I'll do it for sure, but still I'd first like to know where the problem was, and if it has been fixed, or how to fix it...
> > I'd be gratefull for any help...
>
> There were quite a bunch of scheduler updates recently. We may be lucky and
> hope for the bug to have vanished with the changes, but we may as well see
> the same crash in 7 months :-/
>
> My coworker Herv? (CC'd) who worked on the issue suggests that we might have
> something which goes wrong past a certain uptime (eg: 212 days), which needs
> a special event to be triggered (I/O, process exiting, etc...). I think this
> makes quite some sense.
>
> Could you check your CONFIG_HZ so that we could convert those uptimes to
> jiffies ? Maybe this will ring a bell in someone's head :-/
>
> Best regards,
> Willy
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

--
-------------------------------------
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28. rijna 168, 709 01 Ostrava

tel.: +420 596 603 142
fax: +420 596 621 273
mobil: +420 777 093 799
http://www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: [email protected]
-------------------------------------


Attachments:
(No filename) (7.76 kB)
(No filename) (198.00 B)
Download all attachments

2011-04-30 09:36:49

by Willy Tarreau

[permalink] [raw]
Subject: Re: [stable] 2.6.32.21 - uptime related crashes?

Hello Nikola,

On Fri, Apr 29, 2011 at 12:02:00PM +0200, Nikola Ciprich wrote:
> (another CC added)
>
> Hello Willy!
>
> I made some statistics of our servers regarding kernel version and uptime.
> Here are some my thoughts:
> - I'm 100% sure this problem wasn't present in kernels <= 2.6.30.x (we've got a lot of boxes with uptimes >600days)
> - I'm 90% sure this problem also wasn't present in 2.6.32.16 (we've got 6 boxes running for 235 to 280days)

OK those are all precious information.

> What I'm not sure is, whether this is present in 2.6.19, I have:
> 2 boxes running 2.6.32.19 for 238days and one 2.6.32.20 for 216days.
> I also have a bunch ov 2.6.32.23 boxes, which are now getting close to 200days uptime.
> But I suspect this really is first problematic version, more on it later.
> First regarding Your question about CONFIG_HZ - we use 250HZ setting, which leads me to following:
> 250 * 60 * 60 * 24 * 199 = 4298400000 which is value a little over 2**32! So maybe some unsingned long variable
> might overflow? Does this make sense?

Yes of course it makes sense, that was also my worries. 2^32 jiffies at
250 Hz is slightly less than 199 days. Maybe an overflow somewhere keeps
propagating wrong results on some computations. I remember having encountered
a lot of funny things when trying to get 2.4 get past the 497 days limit
using the jiffies64 patch. So I would not be surprized at all that we're
in a similar situation here.

Also, I've checked the Debian kernel config where we had the divide overflow
and it was running at 250 Hz too.

> And to my suspicion about 2.6.32.19, there is one commit which maybe is related:
>
> commit 0cf55e1ec08bb5a22e068309e2d8ba1180ab4239
> Author: Hidetoshi Seto <[email protected]>
> Date: Wed Dec 2 17:28:07 2009 +0900
>
> sched, cputime: Introduce thread_group_times()
>
> This is a real fix for problem of utime/stime values decreasing
> described in the thread:
>
> http://lkml.org/lkml/2009/11/3/522
>
> Now cputime is accounted in the following way:
>
> - {u,s}time in task_struct are increased every time when the thread
> is interrupted by a tick (timer interrupt).
>
> - When a thread exits, its {u,s}time are added to signal->{u,s}time,
> after adjusted by task_times().
>
> - When all threads in a thread_group exits, accumulated {u,s}time
> (and also c{u,s}time) in signal struct are added to c{u,s}time
> in signal struct of the group's parent.
> .
> .
> .
>
> I haven't studied this into detail yet, but it seems to me it might really be related. Hidetoshi-san - do You have some opinion about this?
> Could this somehow either create or invoke the problem with overflow of some variable which would lead to division by zero or similar problems?
>
> Any other thoughts?

There was a kernel parameter in the past that was used to make jiffies wrap
a few minutes after boot, maybe we should revive it to try to reproduce
without waiting 7 new months :-/

Last, the "advantage" with a suspected regression in a stable series is that
there are a lot less patches to test.

Regards,
Willy

Subject: Re: [stable] 2.6.32.21 - uptime related crashes?

On Sat, 30 Apr 2011, Willy Tarreau wrote:
> There was a kernel parameter in the past that was used to make jiffies wrap
> a few minutes after boot, maybe we should revive it to try to reproduce
> without waiting 7 new months :-/

IMHO, such an option should be given a permanent home in KCONFIG, and it
should be enabled by default.

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh

2011-04-30 11:54:50

by Willy Tarreau

[permalink] [raw]
Subject: Re: [stable] 2.6.32.21 - uptime related crashes?

On Sat, Apr 30, 2011 at 08:22:44AM -0300, Henrique de Moraes Holschuh wrote:
> On Sat, 30 Apr 2011, Willy Tarreau wrote:
> > There was a kernel parameter in the past that was used to make jiffies wrap
> > a few minutes after boot, maybe we should revive it to try to reproduce
> > without waiting 7 new months :-/
>
> IMHO, such an option should be given a permanent home in KCONFIG, and it
> should be enabled by default.

Not exactly as it affects uptime.

Willy

2011-04-30 12:04:23

by Nikola Ciprich

[permalink] [raw]
Subject: Re: [stable] 2.6.32.21 - uptime related crashes?

> There was a kernel parameter in the past that was used to make jiffies wrap
> a few minutes after boot, maybe we should revive it to try to reproduce
> without waiting 7 new months :-/
hmm, I've googled this up, and it seems to have been 2.5.x patch, so it will
certainly need some porting..
I'll try to have a look on it tonight and report...

>
> Last, the "advantage" with a suspected regression in a stable series is that
> there are a lot less patches to test.
>
> Regards,
> Willy
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

--
-------------------------------------
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28. rijna 168, 709 01 Ostrava

tel.: +420 596 603 142
fax: +420 596 621 273
mobil: +420 777 093 799

http://www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: [email protected]
-------------------------------------


Attachments:
(No filename) (1.04 kB)
(No filename) (198.00 B)
Download all attachments
Subject: Re: [stable] 2.6.32.21 - uptime related crashes?

On Sat, 30 Apr 2011, Willy Tarreau wrote:
> On Sat, Apr 30, 2011 at 08:22:44AM -0300, Henrique de Moraes Holschuh wrote:
> > On Sat, 30 Apr 2011, Willy Tarreau wrote:
> > > There was a kernel parameter in the past that was used to make jiffies wrap
> > > a few minutes after boot, maybe we should revive it to try to reproduce
> > > without waiting 7 new months :-/
> >
> > IMHO, such an option should be given a permanent home in KCONFIG, and it
> > should be enabled by default.
>
> Not exactly as it affects uptime.

Well, the offset to the real uptime is known, is there a way to fudge it
back that affects only the uptime reporting?

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh

2011-04-30 15:56:01

by Greg KH

[permalink] [raw]
Subject: Re: [stable] 2.6.32.21 - uptime related crashes?

On Sat, Apr 30, 2011 at 02:02:05PM +0200, Nikola Ciprich wrote:
> > There was a kernel parameter in the past that was used to make jiffies wrap
> > a few minutes after boot, maybe we should revive it to try to reproduce
> > without waiting 7 new months :-/
> hmm, I've googled this up, and it seems to have been 2.5.x patch, so it will
> certainly need some porting..
> I'll try to have a look on it tonight and report...

I don't think any patch is needed, I thought we did that by default now,
but I can't seem to find the code where it happens...

odd.

greg k-h

2011-04-30 16:08:13

by Randy Dunlap

[permalink] [raw]
Subject: Re: [stable] 2.6.32.21 - uptime related crashes?

On Sat, 30 Apr 2011 08:57:07 -0700 Greg KH wrote:

> On Sat, Apr 30, 2011 at 02:02:05PM +0200, Nikola Ciprich wrote:
> > > There was a kernel parameter in the past that was used to make jiffies wrap
> > > a few minutes after boot, maybe we should revive it to try to reproduce
> > > without waiting 7 new months :-/
> > hmm, I've googled this up, and it seems to have been 2.5.x patch, so it will
> > certainly need some porting..
> > I'll try to have a look on it tonight and report...
>
> I don't think any patch is needed, I thought we did that by default now,
> but I can't seem to find the code where it happens...
>
> odd.

linux/jiffies.h:

/*
* Have the 32 bit jiffies value wrap 5 minutes after boot
* so jiffies wrap bugs show up earlier.
*/
#define INITIAL_JIFFIES ((unsigned long)(unsigned int) (-300*HZ))


and kernel/timer.c:

u64 jiffies_64 __cacheline_aligned_in_smp = INITIAL_JIFFIES;


---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

2011-04-30 16:50:25

by Willy Tarreau

[permalink] [raw]
Subject: Re: [stable] 2.6.32.21 - uptime related crashes?

On Sat, Apr 30, 2011 at 09:08:05AM -0700, Randy Dunlap wrote:
> On Sat, 30 Apr 2011 08:57:07 -0700 Greg KH wrote:
>
> > On Sat, Apr 30, 2011 at 02:02:05PM +0200, Nikola Ciprich wrote:
> > > > There was a kernel parameter in the past that was used to make jiffies wrap
> > > > a few minutes after boot, maybe we should revive it to try to reproduce
> > > > without waiting 7 new months :-/
> > > hmm, I've googled this up, and it seems to have been 2.5.x patch, so it will
> > > certainly need some porting..
> > > I'll try to have a look on it tonight and report...
> >
> > I don't think any patch is needed, I thought we did that by default now,
> > but I can't seem to find the code where it happens...
> >
> > odd.
>
> linux/jiffies.h:
>
> /*
> * Have the 32 bit jiffies value wrap 5 minutes after boot
> * so jiffies wrap bugs show up earlier.
> */
> #define INITIAL_JIFFIES ((unsigned long)(unsigned int) (-300*HZ))
>
>
> and kernel/timer.c:
>
> u64 jiffies_64 __cacheline_aligned_in_smp = INITIAL_JIFFIES;

Thanks Randy.

So that would mean that wrapping jiffies should be unrelated to the
reported panics. Let's wait for Hidetoshi-san's analysis then.

Regards,
Willy

2011-04-30 17:57:18

by Faidon Liambotis

[permalink] [raw]
Subject: Re: 2.6.32.21 - uptime related crashes?

If I may add some, hopefully useful, information to the thread:

At work we have an HP c7000 blade enclosure. It's populated with 8 ProLiant
BL460c G1 (Xeon E5405, constant_tsc but not nonstop_tsc) and 4 ProLiant BL460c
G6 (Xeon E5504, constant_tsc + nonstop_tsc). All were booted at the same day
with Debian's 2.6.32-23~bpo50+1 kernel, i.e. upstream 2.6.32.21.

We too experienced problems with just the G6 blades at near 215 days
uptime (on the 19th of April), all at the same time. From our
investigation, it seems that their cpu_clocks jumped suddenly far in the
future and then almost immediately rolled over due to wrapping around
64-bits.

Although all of their (G6s) clocks wrapped around *at the same time*, only one
of them actually crashed at the time, with a second one crashing just a few
days later, on the 28th.

Three of them had the following on their logs:
Apr 18 20:56:07 hn-05 kernel: [17966378.581971] tap0: no IPv6 routers present
Apr 19 10:15:42 hn-05 kernel: [18446743935.365550] BUG: soft lockup - CPU#4 stuck for 17163091968s! [kvm:25913]
[...]
Apr 19 10:15:42 hn-05 kernel: [18446743935.447275] [<ffffffff81010b42>] ? system_call_fastpath+0x16/0x1b
Apr 19 10:18:32 hn-05 kernel: [ 31.587025] bond0.13: received packet with own address as source address
(full oops at the end of this mail)

Note that the 17163091968s time stuck was *the same* (+/- 1s) in *all
three of them*.

What's also very strange, although we're not very sure if related, is
that when we took the first line of the above log entry and substracted
17966378.581971 from Apr 18 20:56:07, this resulted in a boot-time that
differed several hours from the actual boot time (the latter being
verified both with syslog and /proc/bstat's btime, which were both in
agreement). This was verified post-mortem too, with the date checked to
be correct.

IOW, we have some serious clock drift (calcuated it at runtime to
~0.1s/min) on these machines that hasn't been made apparent probably
since they all run NTP. Moreover, we also saw that drift on other
machines (1U, different vendor, different data center, E5520 CPUs) but
not with the G1 blades. Our investigation showed that the drift is there
constantly (it's not a one-time event) but we're not really sure if it's
related with the Apr 18th jump event. Note that the drift is there even
if we boot with "clocksource=hpet" but disappears when booted with
"notsc". Also note that we've verified with 2.6.38 and the drift is
still there.

Regards,
Faidon

1: The full backtrace is:
Apr 18 20:56:07 hn-05 kernel: [17966378.581971] tap0: no IPv6 routers present
Apr 19 06:25:02 hn-05 kernel: imklog 3.18.6, log source = /proc/kmsg started.
Apr 19 10:15:42 hn-05 kernel: [18446743935.365550] BUG: soft lockup - CPU#4 stuck for 17163091968s! [kvm:25913]
Apr 19 10:15:42 hn-05 kernel: [18446743935.447056] Modules linked in: xt_pkttype ext2 tun kvm_intel kvm nf_conntrack_ipv6 ip6table_filter ip6_tables xt_tcpudp nf_conntrack_ipv4 nf_
defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables 8021q garp bridge stp bonding dm_round_robin dm_multipath scsi_dh ipmi_poweroff ipmi_devintf loop snd_pcm snd_ti
mer snd soundcore bnx2x psmouse snd_page_alloc serio_raw sd_mod crc_t10dif hpilo ipmi_si ipmi_msghandler container evdev crc32c pcspkr shpchp pci_hotplug libcrc32c power_meter mdio
button processor ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot dm_mod usbhid hid cciss ehci_hcd uhci_hcd qla2xxx scsi_transport_fc usbcore nls_base tg3 libphy scsi_
tgt scsi_mod thermal fan thermal_sys [last unloaded: scsi_wait_scan]
Apr 19 10:15:42 hn-05 kernel: [18446743935.447111] CPU 4:
Apr 19 10:15:42 hn-05 kernel: [18446743935.447112] Modules linked in: xt_pkttype ext2 tun kvm_intel kvm nf_conntrack_ipv6 ip6table_filter ip6_tables xt_tcpudp nf_conntrack_ipv4 nf_
defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables 8021q garp bridge stp bonding dm_round_robin dm_multipath scsi_dh ipmi_poweroff ipmi_devintf loop snd_pcm snd_timer snd soundcore bnx2x psmouse snd_page_alloc serio_raw sd_mod crc_t10dif hpilo ipmi_si ipmi_msghandler container evdev crc32c pcspkr shpchp pci_hotplug libcrc32c power_meter mdio button processor ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot dm_mod usbhid hid cciss ehci_hcd uhci_hcd qla2xxx scsi_transport_fc usbcore nls_base tg3 libphy scsi_tgt scsi_mod thermal fan thermal_sys [last unloaded: scsi_wait_scan]
Apr 19 10:15:42 hn-05 kernel: [18446743935.447154] Pid: 25913, comm: kvm Not tainted 2.6.32-bpo.5-amd64 #1 ProLiant BL460c G6
Apr 19 10:15:42 hn-05 kernel: [18446743935.447157] RIP: 0010:[<ffffffffa02e209a>] [<ffffffffa02e209a>] kvm_arch_vcpu_ioctl_run+0x785/0xa44 [kvm]
Apr 19 10:15:42 hn-05 kernel: [18446743935.447177] RSP: 0018:ffff88070065fd38 EFLAGS: 00000202
Apr 19 10:15:42 hn-05 kernel: [18446743935.447179] RAX: ffff88070065ffd8 RBX: ffff8804154ba860 RCX: ffff8804154ba860
Apr 19 10:15:42 hn-05 kernel: [18446743935.447182] RDX: ffff8804154ba8b9 RSI: ffff81004c818b10 RDI: ffff8100291a7d48
Apr 19 10:15:42 hn-05 kernel: [18446743935.447184] RBP: ffffffff8101166e R08: 0000000000000000 R09: 0000000000000000
Apr 19 10:15:42 hn-05 kernel: [18446743935.447187] R10: 00007f1b2a2af078 R11: ffffffff802f3405 R12: 0000000000000001
Apr 19 10:15:42 hn-05 kernel: [18446743935.447189] R13: 00000000154ba8b8 R14: ffff8804136ac000 R15: ffff8804154bbd48
Apr 19 10:15:42 hn-05 kernel: [18446743935.447192] FS: 0000000042cbb950(0000) GS:ffff88042e440000(0000) knlGS:0000000000000000
Apr 19 10:15:42 hn-05 kernel: [18446743935.447195] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
Apr 19 10:15:42 hn-05 kernel: [18446743935.447197] CR2: 0000000000000008 CR3: 0000000260c41000 CR4: 00000000000026e0
Apr 19 10:15:42 hn-05 kernel: [18446743935.447200] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Apr 19 10:15:42 hn-05 kernel: [18446743935.447202] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Apr 19 10:15:42 hn-05 kernel: [18446743935.447205] Call Trace:
Apr 19 10:15:42 hn-05 kernel: [18446743935.447215] [<ffffffffa02e2004>] ? kvm_arch_vcpu_ioctl_run+0x6ef/0xa44 [kvm]
Apr 19 10:15:42 hn-05 kernel: [18446743935.447224] [<ffffffff8100f79c>] ? __switch_to+0x285/0x297
Apr 19 10:15:42 hn-05 kernel: [18446743935.447231] [<ffffffffa02d49d1>] ? kvm_vcpu_ioctl+0xf1/0x4e6 [kvm]
Apr 19 10:15:42 hn-05 kernel: [18446743935.447237] [<ffffffff810240da>] ? lapic_next_event+0x18/0x1d
Apr 19 10:15:42 hn-05 kernel: [18446743935.447245] [<ffffffff8106fb77>] ? tick_dev_program_event+0x2d/0x95
Apr 19 10:15:42 hn-05 kernel: [18446743935.447251] [<ffffffff81047e29>] ? finish_task_switch+0x3a/0xaf
Apr 19 10:15:42 hn-05 kernel: [18446743935.447258] [<ffffffff810f9f5a>] ? vfs_ioctl+0x21/0x6c
Apr 19 10:15:42 hn-05 kernel: [18446743935.447261] [<ffffffff810fa4a8>] ? do_vfs_ioctl+0x48d/0x4cb
Apr 19 10:15:42 hn-05 kernel: [18446743935.447268] [<ffffffff81063fcd>] ? sys_timer_settime+0x233/0x283
Apr 19 10:15:42 hn-05 kernel: [18446743935.447272] [<ffffffff810fa537>] ? sys_ioctl+0x51/0x70
Apr 19 10:15:42 hn-05 kernel: [18446743935.447275] [<ffffffff81010b42>] ? system_call_fastpath+0x16/0x1b
Apr 19 10:18:32 hn-05 kernel: [ 31.587025] bond0.13: received packet with own address as source address

Subject: Re: [stable] 2.6.32.21 - uptime related crashes?

On Sat, 30 Apr 2011, Willy Tarreau wrote:
> So that would mean that wrapping jiffies should be unrelated to the
> reported panics. Let's wait for Hidetoshi-san's analysis then.

Err, it actually could be a problem when it wraps twice, OR it could be
related to conditions that didn't happen yet right after boot.

But that's less likely, of course.

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh

2011-04-30 20:16:05

by Willy Tarreau

[permalink] [raw]
Subject: Re: 2.6.32.21 - uptime related crashes?

Hi Faidon,

On Sat, Apr 30, 2011 at 08:39:05PM +0300, Faidon Liambotis wrote:
> If I may add some, hopefully useful, information to the thread:
>
> At work we have an HP c7000 blade enclosure. It's populated with 8 ProLiant
> BL460c G1 (Xeon E5405, constant_tsc but not nonstop_tsc) and 4 ProLiant
> BL460c
> G6 (Xeon E5504, constant_tsc + nonstop_tsc). All were booted at the same day
> with Debian's 2.6.32-23~bpo50+1 kernel, i.e. upstream 2.6.32.21.
>
> We too experienced problems with just the G6 blades at near 215 days
> uptime (on the 19th of April), all at the same time. From our
> investigation, it seems that their cpu_clocks jumped suddenly far in the
> future and then almost immediately rolled over due to wrapping around
> 64-bits.
>
> Although all of their (G6s) clocks wrapped around *at the same time*, only
> one
> of them actually crashed at the time, with a second one crashing just a few
> days later, on the 28th.
>
> Three of them had the following on their logs:
> Apr 18 20:56:07 hn-05 kernel: [17966378.581971] tap0: no IPv6 routers
> present
> Apr 19 10:15:42 hn-05 kernel: [18446743935.365550] BUG: soft lockup - CPU#4
> stuck for 17163091968s! [kvm:25913]
> [...]
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447275] [<ffffffff81010b42>] ?
> system_call_fastpath+0x16/0x1b
> Apr 19 10:18:32 hn-05 kernel: [ 31.587025] bond0.13: received packet with
> own address as source address
> (full oops at the end of this mail)
>
> Note that the 17163091968s time stuck was *the same* (+/- 1s) in *all
> three of them*.
>
> What's also very strange, although we're not very sure if related, is
> that when we took the first line of the above log entry and substracted
> 17966378.581971 from Apr 18 20:56:07, this resulted in a boot-time that
> differed several hours from the actual boot time (the latter being
> verified both with syslog and /proc/bstat's btime, which were both in
> agreement). This was verified post-mortem too, with the date checked to
> be correct.

Well, your report is by far the most complete of our 3 since you managed
to get this trace of wraping time.

I don't know if it means anything to anyone, but 17163091968 is exactly
0x3FF000000, or 1023*2^24. Too round to be a coincidence. This number
of seconds corresponds to 1000 * 2^32 jiffies at 250 Hz.

> IOW, we have some serious clock drift (calcuated it at runtime to
> ~0.1s/min) on these machines that hasn't been made apparent probably
> since they all run NTP. Moreover, we also saw that drift on other
> machines (1U, different vendor, different data center, E5520 CPUs) but
> not with the G1 blades. Our investigation showed that the drift is there
> constantly (it's not a one-time event) but we're not really sure if it's
> related with the Apr 18th jump event. Note that the drift is there even
> if we boot with "clocksource=hpet" but disappears when booted with
> "notsc". Also note that we've verified with 2.6.38 and the drift is
> still there.
>
> Regards,
> Faidon
>
> 1: The full backtrace is:
> Apr 18 20:56:07 hn-05 kernel: [17966378.581971] tap0: no IPv6 routers
> present
> Apr 19 06:25:02 hn-05 kernel: imklog 3.18.6, log source = /proc/kmsg
> started.
> Apr 19 10:15:42 hn-05 kernel: [18446743935.365550] BUG: soft lockup - CPU#4
> stuck for 17163091968s! [kvm:25913]
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447056] Modules linked in:
> xt_pkttype ext2 tun kvm_intel kvm nf_conntrack_ipv6 ip6table_filter
> ip6_tables xt_tcpudp nf_conntrack_ipv4 nf_
> defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables 8021q
> garp bridge stp bonding dm_round_robin dm_multipath scsi_dh ipmi_poweroff
> ipmi_devintf loop snd_pcm snd_ti
> mer snd soundcore bnx2x psmouse snd_page_alloc serio_raw sd_mod crc_t10dif
> hpilo ipmi_si ipmi_msghandler container evdev crc32c pcspkr shpchp
> pci_hotplug libcrc32c power_meter mdio
> button processor ext3 jbd mbcache dm_mirror dm_region_hash dm_log
> dm_snapshot dm_mod usbhid hid cciss ehci_hcd uhci_hcd qla2xxx
> scsi_transport_fc usbcore nls_base tg3 libphy scsi_
> tgt scsi_mod thermal fan thermal_sys [last unloaded: scsi_wait_scan]
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447111] CPU 4:
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447112] Modules linked in:
> xt_pkttype ext2 tun kvm_intel kvm nf_conntrack_ipv6 ip6table_filter
> ip6_tables xt_tcpudp nf_conntrack_ipv4 nf_
> defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables 8021q
> garp bridge stp bonding dm_round_robin dm_multipath scsi_dh ipmi_poweroff
> ipmi_devintf loop snd_pcm snd_timer snd soundcore bnx2x psmouse
> snd_page_alloc serio_raw sd_mod crc_t10dif hpilo ipmi_si ipmi_msghandler
> container evdev crc32c pcspkr shpchp pci_hotplug libcrc32c power_meter mdio
> button processor ext3 jbd mbcache dm_mirror dm_region_hash dm_log
> dm_snapshot dm_mod usbhid hid cciss ehci_hcd uhci_hcd qla2xxx
> scsi_transport_fc usbcore nls_base tg3 libphy scsi_tgt scsi_mod thermal fan
> thermal_sys [last unloaded: scsi_wait_scan]
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447154] Pid: 25913, comm: kvm
> Not tainted 2.6.32-bpo.5-amd64 #1 ProLiant BL460c G6
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447157] RIP:
> 0010:[<ffffffffa02e209a>] [<ffffffffa02e209a>]
> kvm_arch_vcpu_ioctl_run+0x785/0xa44 [kvm]
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447177] RSP:
> 0018:ffff88070065fd38 EFLAGS: 00000202
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447179] RAX: ffff88070065ffd8
> RBX: ffff8804154ba860 RCX: ffff8804154ba860
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447182] RDX: ffff8804154ba8b9
> RSI: ffff81004c818b10 RDI: ffff8100291a7d48
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447184] RBP: ffffffff8101166e
> R08: 0000000000000000 R09: 0000000000000000
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447187] R10: 00007f1b2a2af078
> R11: ffffffff802f3405 R12: 0000000000000001
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447189] R13: 00000000154ba8b8
> R14: ffff8804136ac000 R15: ffff8804154bbd48
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447192] FS:
> 0000000042cbb950(0000) GS:ffff88042e440000(0000) knlGS:0000000000000000
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447195] CS: 0010 DS: 002b ES:
> 002b CR0: 0000000080050033
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447197] CR2: 0000000000000008
> CR3: 0000000260c41000 CR4: 00000000000026e0
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447200] DR0: 0000000000000000
> DR1: 0000000000000000 DR2: 0000000000000000
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447202] DR3: 0000000000000000
> DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447205] Call Trace:
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447215] [<ffffffffa02e2004>] ?
> kvm_arch_vcpu_ioctl_run+0x6ef/0xa44 [kvm]
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447224] [<ffffffff8100f79c>] ?
> __switch_to+0x285/0x297
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447231] [<ffffffffa02d49d1>] ?
> kvm_vcpu_ioctl+0xf1/0x4e6 [kvm]
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447237] [<ffffffff810240da>] ?
> lapic_next_event+0x18/0x1d
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447245] [<ffffffff8106fb77>] ?
> tick_dev_program_event+0x2d/0x95
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447251] [<ffffffff81047e29>] ?
> finish_task_switch+0x3a/0xaf
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447258] [<ffffffff810f9f5a>] ?
> vfs_ioctl+0x21/0x6c
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447261] [<ffffffff810fa4a8>] ?
> do_vfs_ioctl+0x48d/0x4cb
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447268] [<ffffffff81063fcd>] ?
> sys_timer_settime+0x233/0x283
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447272] [<ffffffff810fa537>] ?
> sys_ioctl+0x51/0x70
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447275] [<ffffffff81010b42>] ?
> system_call_fastpath+0x16/0x1b
> Apr 19 10:18:32 hn-05 kernel: [ 31.587025] bond0.13: received packet with
> own address as source address

Regards,
Willy

2011-05-06 03:15:29

by Hidetoshi Seto

[permalink] [raw]
Subject: Re: [stable] 2.6.32.21 - uptime related crashes?

Hi Nikola,

Sorry for not replying sooner.

(2011/04/29 19:02), Nikola Ciprich wrote:
> (another CC added)
>
> Hello Willy!
>
> I made some statistics of our servers regarding kernel version and uptime.
> Here are some my thoughts:
> - I'm 100% sure this problem wasn't present in kernels <= 2.6.30.x (we've got a lot of boxes with uptimes >600days)
> - I'm 90% sure this problem also wasn't present in 2.6.32.16 (we've got 6 boxes running for 235 to 280days)
>
> What I'm not sure is, whether this is present in 2.6.19, I have:
> 2 boxes running 2.6.32.19 for 238days and one 2.6.32.20 for 216days.
> I also have a bunch ov 2.6.32.23 boxes, which are now getting close to 200days uptime.
> But I suspect this really is first problematic version, more on it later.
> First regarding Your question about CONFIG_HZ - we use 250HZ setting, which leads me to following:
> 250 * 60 * 60 * 24 * 199 = 4298400000 which is value a little over 2**32! So maybe some unsingned long variable
> might overflow? Does this make sense?
>
> And to my suspicion about 2.6.32.19, there is one commit which maybe is related:
>
> commit 0cf55e1ec08bb5a22e068309e2d8ba1180ab4239
> Author: Hidetoshi Seto <[email protected]>
> Date: Wed Dec 2 17:28:07 2009 +0900
>
> sched, cputime: Introduce thread_group_times()
>
> This is a real fix for problem of utime/stime values decreasing
> described in the thread:
>
> http://lkml.org/lkml/2009/11/3/522
>
> Now cputime is accounted in the following way:
>
> - {u,s}time in task_struct are increased every time when the thread
> is interrupted by a tick (timer interrupt).
>
> - When a thread exits, its {u,s}time are added to signal->{u,s}time,
> after adjusted by task_times().
>
> - When all threads in a thread_group exits, accumulated {u,s}time
> (and also c{u,s}time) in signal struct are added to c{u,s}time
> in signal struct of the group's parent.
> .
> .
> .
>
> I haven't studied this into detail yet, but it seems to me it might really be related. Hidetoshi-san - do You have some opinion about this?
> Could this somehow either create or invoke the problem with overflow of some variable which would lead to division by zero or similar problems?

No.

The commit you pointed is a change for runtimes (cputime_t) accounted for
threads, not for uptime/jiffies/tick. And I suppose any overflow/zero-div
cannot be there:

if (total) {
:
do_div(temp, total);
:
}
:
p->prev_utime = max(p->prev_utime, utime);

>
> Any other thoughts?
>
> best regards
>
> nik

>From a glance of diff v2.6.32.16..v2.6.32.23, tick_nohz_* could be an
another suspect. Humm...


Thanks,
H.Seto

2011-05-13 22:08:39

by Nicolas Carlier

[permalink] [raw]
Subject: Re: [stable] 2.6.32.21 - uptime related crashes?

Hi Willy,

On Thu, Apr 28, 2011 at 8:34 PM, Willy Tarreau <[email protected]> wrote:
> Hello Nikola,
>
> On Thu, Apr 28, 2011 at 10:26:25AM +0200, Nikola Ciprich wrote:
>> Hello everybody,
>>
>> I'm trying to solve strange issue, today, my fourth machine running 2.6.32.21 just crashed. What makes the cases similar, apart fromn same kernel version is that all boxes had very similar uptimes: 214, 216, 216, and 224 days. This might just be a coincidence, but I think this might be important.
>
> Interestingly, one of our customers just had two machines who crashed
> yesterday after 212 days and 212+20h respectively. They were running
> debian's 2.6.32-bpo.5-amd64 which is based on 2.6.32.23 AIUI.
>
> The crash looks very similar to the following bug which we have updated :
>
> ? https://bugzilla.kernel.org/show_bug.cgi?id=16991
>
> (bugzilla doesn't appear to respond as I'm posting this mail).
>
> The top of your ouput is missing. In our case as in the reports on the bug
> above, there was a divide by zero error. Did you happen to spot this one
> too, or do you just not know ? I observe "divide_error+0x15/0x20" in one
> of your reports, so it's possible that it matches the same pattern at least
> for one trace. Just in case, it would be nice to feed the bugzilla entry
> above.
>
>> Unfortunately I only have backtraces of two crashes (and those are trimmed, sorry), and they do not look as similar as I'd like, but still maybe there is something in common:
>>
>> [<ffffffff81120cc7>] pollwake+0x57/0x60
>> [<ffffffff81046720>] ? default_wake_function+0x0/0x10
>> [<ffffffff8103683a>] __wake_up_common+0x5a/0x90
>> [<ffffffff8103a313>] __wake_up+0x43/0x70
>> [<ffffffffa0321573>] process_masterspan+0x643/0x670 [dahdi]
>> [<ffffffffa0326595>] coretimer_func+0x135/0x1d0 [dahdi]
>> [<ffffffff8105d74d>] run_timer_softirq+0x15d/0x320
>> [<ffffffffa0326460>] ? coretimer_func+0x0/0x1d0 [dahdi]
>> [<ffffffff8105690c>] __do_softirq+0xcc/0x220
>> [<ffffffff8100c40c>] call_softirq+0x1c/0x30
>> [<ffffffff8100e3ba>] do_softirq+0x4a/0x80
>> [<ffffffff810567c7>] irq_exit+0x87/0x90
>> [<ffffffff8100d7b7>] do_IRQ+0x77/0xf0
>> [<ffffffff8100bc53>] ret_from_intr+0x0/Oxa
>> <EUI> [<ffffffffa019e556>] ? acpi_idle_enter_bm+0x273/0x2a1 [processor]
>> [<ffffffffa019e54c>] ? acpi_idle_enter_bm+0x269/0x2a1 [processor]
>> [<ffffffff81280095>] ? cpuidle_idle_call+0xa5/0x150
>> [<ffffffff8100a18f>] ? cpu_idle+0x4f/0x90
>> [<ffffffff81323c95>] ? rest_init+0x75/0x80
>> [<ffffffff81582d7f>] ? start_kernel+0x2ef/0x390
>> [<ffffffff81582271>] ? x86_64_start_reservations+0x81/0xc0
>> [<ffffffff81582386>] ? x86_64_start_kernel+0xd6/0x100
>>
>> this box (actually two of the crashed ones) is using dahdi_dummy module to generate timing for asterisk SW pbx, so maybe it's related to it.
>>
>>
>> [<ffffffff810a5063>] handle_IRQ_event+0x63/0x1c0
>> [<ffffffff810a71ae>] handle_edge_irq+0xce/0x160
>> [<ffffffff8100e1bf>] handle_irq+0x1f/0x30
>> [<ffffffff8100d7ae>] do_IRQ+0x6e/0xf0
>> [<ffffffff8100bc53>] ret_from_intr+0x0/Oxa
>> <EUI> [<ffffffff8133?f?f>] ? _spin_un1ock_irq+0xf/0x40
>> [<ffffffff81337f79>] ? _spin_un1ock_irq+0x9/0x40
>> [<ffffffff81064b9a>] ? exit_signals+0x8a/0x130
>> [<ffffffff8105372e>] ? do_exit+0x7e/0x7d0
>> [<ffffffff8100f8a7>] ? oops_end+0xa7/0xb0
>> [<ffffffff8100faa6>] ? die+0x56/0x90
>> [<ffffffff8100c810>] ? do_trap+0x130/0x150
>> [<ffffffff8100ccca>] ? do_divide_error+0x8a/0xa0
>> [<ffffffff8103d227>] ? find_busiest_group+0x3d7/0xa00
>> [<ffffffff8104400b>] ? cpuacct_charge+0x6b/0x90
>> [<ffffffff8100c045>] ? divide_error+0x15/0x20
>> [<ffffffff8103d227>] ? find_busiest_group+0x3d7/0xa00
>> [<ffffffff8103cfff>] ? find_busiest_group+0x1af/0xa00
>> [<ffffffff81335483>] ? thread_return+0x4ce/0x7bb
>> [<ffffffff8133bec5>] ? do_nanosleep+0x75/0x30
>> [<ffffffff810?1?4e>] ? hrtimer_nanosleep+0x9e/0x120
>> [<ffffffff810?08f0>] ? hrtimer_wakeup+0x0/0x30
>> [<ffffffff810?183f>] ? sys_nanosleep+0x6f/0x80
>>
>> another two don't use it. only similarity I see here is that it seems to be IRQ handling related, but both issues don't have anything in common.
>> Does anybody have an idea on where should I look? Of course I should update all those boxes to (at least) latest 2.6.32.x, and I'll do it for sure, but still I'd first like to know where the problem was, and if it has been fixed, or how to fix it...
>> I'd be gratefull for any help...
>
> There were quite a bunch of scheduler updates recently. We may be lucky and
> hope for the bug to have vanished with the changes, but we may as well see
> the same crash in 7 months :-/
>
> My coworker Herv? (CC'd) who worked on the issue suggests that we might have
> something which goes wrong past a certain uptime (eg: 212 days), which needs
> a special event to be triggered (I/O, process exiting, etc...). I think this
> makes quite some sense.
>
> Could you check your CONFIG_HZ so that we could convert those uptimes to
> jiffies ? Maybe this will ring a bell in someone's head :-/
>

We had encounter the same issue on many nodes of our cluster which ran
on a 2.6.32.8 Debian Kernel. All the servers which had crashed, had
almost the same uptime, more than 200 days. But those which didn't
crashed, had the same uptime.

Each time, we had the "divide by zero" in "find_busiest_group"

One explanation can be the difference in term of number of tasks since boot.

As the servers fallen one by one, and as we were not able to reproduce
the problem quickly, we had use the patch provides by Andrew
Dickinson.

Regards,

--
Nicolas Carlier

2011-05-14 19:06:48

by Nikola Ciprich

[permalink] [raw]
Subject: Re: 2.6.32.21 - uptime related crashes?

Hello gentlemans,
Nicolas, thanks for further report, it contradicts my theory that problem occured somewhere during 2.6.32.16.
Now I think I know why several of my other machines running 2.6.32.x for long time didn't crashed:

I checked bugzilla entry for (I believe the same) problem here:
https://bugzilla.kernel.org/show_bug.cgi?id=16991
and Peter Zijlstra asked there, whether reporters systems were running some RT tasks. Then I realised that all of my four crashed boxes were pacemaker/corosync clusters and pacemaker uses lots of RT priority tasks. So I believe this is important, and might be reason why other machines seem to be running rock solid - they are not running any RT tasks.
It also might help with hunting this bug. Is somebody of You also running some RT priority tasks on inflicted systems, or problem also occured without it?
Cheers!
n.

PS: Hidetoshi-san - btw, (late) thanks for Your reply and confirmation that Your patch should not be the cause of this problem. I'm now sure it must have emerged sooner, and I'm sorry for this false accusation :)

On Sat, Apr 30, 2011 at 10:14:36PM +0200, Willy Tarreau wrote:
> Hi Faidon,
>
> On Sat, Apr 30, 2011 at 08:39:05PM +0300, Faidon Liambotis wrote:
> > If I may add some, hopefully useful, information to the thread:
> >
> > At work we have an HP c7000 blade enclosure. It's populated with 8 ProLiant
> > BL460c G1 (Xeon E5405, constant_tsc but not nonstop_tsc) and 4 ProLiant
> > BL460c
> > G6 (Xeon E5504, constant_tsc + nonstop_tsc). All were booted at the same day
> > with Debian's 2.6.32-23~bpo50+1 kernel, i.e. upstream 2.6.32.21.
> >
> > We too experienced problems with just the G6 blades at near 215 days
> > uptime (on the 19th of April), all at the same time. From our
> > investigation, it seems that their cpu_clocks jumped suddenly far in the
> > future and then almost immediately rolled over due to wrapping around
> > 64-bits.
> >
> > Although all of their (G6s) clocks wrapped around *at the same time*, only
> > one
> > of them actually crashed at the time, with a second one crashing just a few
> > days later, on the 28th.
> >
> > Three of them had the following on their logs:
> > Apr 18 20:56:07 hn-05 kernel: [17966378.581971] tap0: no IPv6 routers
> > present
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.365550] BUG: soft lockup - CPU#4
> > stuck for 17163091968s! [kvm:25913]
> > [...]
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447275] [<ffffffff81010b42>] ?
> > system_call_fastpath+0x16/0x1b
> > Apr 19 10:18:32 hn-05 kernel: [ 31.587025] bond0.13: received packet with
> > own address as source address
> > (full oops at the end of this mail)
> >
> > Note that the 17163091968s time stuck was *the same* (+/- 1s) in *all
> > three of them*.
> >
> > What's also very strange, although we're not very sure if related, is
> > that when we took the first line of the above log entry and substracted
> > 17966378.581971 from Apr 18 20:56:07, this resulted in a boot-time that
> > differed several hours from the actual boot time (the latter being
> > verified both with syslog and /proc/bstat's btime, which were both in
> > agreement). This was verified post-mortem too, with the date checked to
> > be correct.
>
> Well, your report is by far the most complete of our 3 since you managed
> to get this trace of wraping time.
>
> I don't know if it means anything to anyone, but 17163091968 is exactly
> 0x3FF000000, or 1023*2^24. Too round to be a coincidence. This number
> of seconds corresponds to 1000 * 2^32 jiffies at 250 Hz.
>
> > IOW, we have some serious clock drift (calcuated it at runtime to
> > ~0.1s/min) on these machines that hasn't been made apparent probably
> > since they all run NTP. Moreover, we also saw that drift on other
> > machines (1U, different vendor, different data center, E5520 CPUs) but
> > not with the G1 blades. Our investigation showed that the drift is there
> > constantly (it's not a one-time event) but we're not really sure if it's
> > related with the Apr 18th jump event. Note that the drift is there even
> > if we boot with "clocksource=hpet" but disappears when booted with
> > "notsc". Also note that we've verified with 2.6.38 and the drift is
> > still there.
> >
> > Regards,
> > Faidon
> >
> > 1: The full backtrace is:
> > Apr 18 20:56:07 hn-05 kernel: [17966378.581971] tap0: no IPv6 routers
> > present
> > Apr 19 06:25:02 hn-05 kernel: imklog 3.18.6, log source = /proc/kmsg
> > started.
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.365550] BUG: soft lockup - CPU#4
> > stuck for 17163091968s! [kvm:25913]
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447056] Modules linked in:
> > xt_pkttype ext2 tun kvm_intel kvm nf_conntrack_ipv6 ip6table_filter
> > ip6_tables xt_tcpudp nf_conntrack_ipv4 nf_
> > defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables 8021q
> > garp bridge stp bonding dm_round_robin dm_multipath scsi_dh ipmi_poweroff
> > ipmi_devintf loop snd_pcm snd_ti
> > mer snd soundcore bnx2x psmouse snd_page_alloc serio_raw sd_mod crc_t10dif
> > hpilo ipmi_si ipmi_msghandler container evdev crc32c pcspkr shpchp
> > pci_hotplug libcrc32c power_meter mdio
> > button processor ext3 jbd mbcache dm_mirror dm_region_hash dm_log
> > dm_snapshot dm_mod usbhid hid cciss ehci_hcd uhci_hcd qla2xxx
> > scsi_transport_fc usbcore nls_base tg3 libphy scsi_
> > tgt scsi_mod thermal fan thermal_sys [last unloaded: scsi_wait_scan]
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447111] CPU 4:
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447112] Modules linked in:
> > xt_pkttype ext2 tun kvm_intel kvm nf_conntrack_ipv6 ip6table_filter
> > ip6_tables xt_tcpudp nf_conntrack_ipv4 nf_
> > defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables 8021q
> > garp bridge stp bonding dm_round_robin dm_multipath scsi_dh ipmi_poweroff
> > ipmi_devintf loop snd_pcm snd_timer snd soundcore bnx2x psmouse
> > snd_page_alloc serio_raw sd_mod crc_t10dif hpilo ipmi_si ipmi_msghandler
> > container evdev crc32c pcspkr shpchp pci_hotplug libcrc32c power_meter mdio
> > button processor ext3 jbd mbcache dm_mirror dm_region_hash dm_log
> > dm_snapshot dm_mod usbhid hid cciss ehci_hcd uhci_hcd qla2xxx
> > scsi_transport_fc usbcore nls_base tg3 libphy scsi_tgt scsi_mod thermal fan
> > thermal_sys [last unloaded: scsi_wait_scan]
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447154] Pid: 25913, comm: kvm
> > Not tainted 2.6.32-bpo.5-amd64 #1 ProLiant BL460c G6
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447157] RIP:
> > 0010:[<ffffffffa02e209a>] [<ffffffffa02e209a>]
> > kvm_arch_vcpu_ioctl_run+0x785/0xa44 [kvm]
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447177] RSP:
> > 0018:ffff88070065fd38 EFLAGS: 00000202
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447179] RAX: ffff88070065ffd8
> > RBX: ffff8804154ba860 RCX: ffff8804154ba860
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447182] RDX: ffff8804154ba8b9
> > RSI: ffff81004c818b10 RDI: ffff8100291a7d48
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447184] RBP: ffffffff8101166e
> > R08: 0000000000000000 R09: 0000000000000000
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447187] R10: 00007f1b2a2af078
> > R11: ffffffff802f3405 R12: 0000000000000001
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447189] R13: 00000000154ba8b8
> > R14: ffff8804136ac000 R15: ffff8804154bbd48
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447192] FS:
> > 0000000042cbb950(0000) GS:ffff88042e440000(0000) knlGS:0000000000000000
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447195] CS: 0010 DS: 002b ES:
> > 002b CR0: 0000000080050033
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447197] CR2: 0000000000000008
> > CR3: 0000000260c41000 CR4: 00000000000026e0
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447200] DR0: 0000000000000000
> > DR1: 0000000000000000 DR2: 0000000000000000
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447202] DR3: 0000000000000000
> > DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447205] Call Trace:
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447215] [<ffffffffa02e2004>] ?
> > kvm_arch_vcpu_ioctl_run+0x6ef/0xa44 [kvm]
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447224] [<ffffffff8100f79c>] ?
> > __switch_to+0x285/0x297
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447231] [<ffffffffa02d49d1>] ?
> > kvm_vcpu_ioctl+0xf1/0x4e6 [kvm]
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447237] [<ffffffff810240da>] ?
> > lapic_next_event+0x18/0x1d
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447245] [<ffffffff8106fb77>] ?
> > tick_dev_program_event+0x2d/0x95
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447251] [<ffffffff81047e29>] ?
> > finish_task_switch+0x3a/0xaf
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447258] [<ffffffff810f9f5a>] ?
> > vfs_ioctl+0x21/0x6c
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447261] [<ffffffff810fa4a8>] ?
> > do_vfs_ioctl+0x48d/0x4cb
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447268] [<ffffffff81063fcd>] ?
> > sys_timer_settime+0x233/0x283
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447272] [<ffffffff810fa537>] ?
> > sys_ioctl+0x51/0x70
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.447275] [<ffffffff81010b42>] ?
> > system_call_fastpath+0x16/0x1b
> > Apr 19 10:18:32 hn-05 kernel: [ 31.587025] bond0.13: received packet with
> > own address as source address
>
> Regards,
> Willy
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

--
-------------------------------------
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28. rijna 168, 709 01 Ostrava

tel.: +420 596 603 142
fax: +420 596 621 273
mobil: +420 777 093 799

http://www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: [email protected]
-------------------------------------


Attachments:
(No filename) (9.80 kB)
(No filename) (198.00 B)
Download all attachments

2011-05-14 20:46:52

by Willy Tarreau

[permalink] [raw]
Subject: Re: 2.6.32.21 - uptime related crashes?

Hi,

On Sat, May 14, 2011 at 09:04:23PM +0200, Nikola Ciprich wrote:
> Hello gentlemans,
> Nicolas, thanks for further report, it contradicts my theory that problem occured somewhere during 2.6.32.16.

Well, I'd like to be sure what kernel we're talking about. Nicolas said
"2.6.32.8 Debian Kernel", but I suspect it's "2.6.32-8something" instead.
Nicolas, could you please report the exact version as indicated by "uname -a" ?

> Now I think I know why several of my other machines running 2.6.32.x for long time didn't crashed:
>
> I checked bugzilla entry for (I believe the same) problem here:
> https://bugzilla.kernel.org/show_bug.cgi?id=16991
> and Peter Zijlstra asked there, whether reporters systems were running some RT tasks. Then I realised that all of my four crashed boxes were pacemaker/corosync clusters and pacemaker uses lots of RT priority tasks. So I believe this is important, and might be reason why other machines seem to be running rock solid - they are not running any RT tasks.
> It also might help with hunting this bug. Is somebody of You also running some RT priority tasks on inflicted systems, or problem also occured without it?

No, our customer who had two of these boxes crash at the same time was
not running any RT task to the best of my knowledge.

Cheers,
Willy

2011-05-14 21:00:20

by Ben Hutchings

[permalink] [raw]
Subject: Re: 2.6.32.21 - uptime related crashes?

On Sat, 2011-05-14 at 22:45 +0200, Willy Tarreau wrote:
> Hi,
>
> On Sat, May 14, 2011 at 09:04:23PM +0200, Nikola Ciprich wrote:
> > Hello gentlemans,
> > Nicolas, thanks for further report, it contradicts my theory that problem occured somewhere during 2.6.32.16.
>
> Well, I'd like to be sure what kernel we're talking about. Nicolas said
> "2.6.32.8 Debian Kernel", but I suspect it's "2.6.32-8something" instead.
> Nicolas, could you please report the exact version as indicated by "uname -a" ?
[...]

Actually you need to use 'cat /proc/version' (or dpkg) to get the full
version.

Ben.

--
Ben Hutchings
Once a job is fouled up, anything done to improve it makes it worse.


Attachments:
signature.asc (828.00 B)
This is a digitally signed message part

2011-05-14 23:13:46

by Nicolas Carlier

[permalink] [raw]
Subject: Re: 2.6.32.21 - uptime related crashes?

Hi,

On Sat, May 14, 2011 at 10:45 PM, Willy Tarreau <[email protected]> wrote:
> Hi,
>
> On Sat, May 14, 2011 at 09:04:23PM +0200, Nikola Ciprich wrote:
>> Hello gentlemans,
>> Nicolas, thanks for further report, it contradicts my theory that problem occured somewhere during 2.6.32.16.
>
> Well, I'd like to be sure what kernel we're talking about. Nicolas said
> "2.6.32.8 Debian Kernel", but I suspect it's "2.6.32-8something" instead.
> Nicolas, could you please report the exact version as indicated by "uname -a" ?


Sorry, I can't provide more informations on this version because I
don't use it anymore, I can just corrected myself, it was not a
2.6.32.8 kernel but a 2.6.32.7 backport debian kernel, which had been
recompiled.

Because of this problem I took the oportunity to change to a 2.6.32.26
kernel, however as there was nothing on the changelog or bugzilla
about the resolution of this issue we have applied the patch found in
bugzilla which revealed this problem:

https://bugzilla.kernel.org/show_bug.cgi?id=16991#c17

>
>> Now I think I know why several of my other machines running 2.6.32.x for long time didn't crashed:
>>
>> I checked bugzilla entry for (I believe the same) problem here:
>> https://bugzilla.kernel.org/show_bug.cgi?id=16991
>> and Peter Zijlstra asked there, whether reporters systems were running some RT tasks. Then I realised that all of my four crashed boxes were pacemaker/corosync clusters and pacemaker uses lots of RT priority tasks. So I believe this is important, and might be reason why other machines seem to be running rock solid - they are not running any RT tasks.
>> It also might help with hunting this bug. Is somebody of You also running some RT priority tasks on inflicted systems, or problem also occured without it?
>
> No, our customer who had two of these boxes crash at the same time was
> not running any RT task to the best of my knowledge.
>

Regards,

--
Nicolas Carlier

2011-05-15 22:57:04

by Faidon Liambotis

[permalink] [raw]
Subject: Re: 2.6.32.21 - uptime related crashes?

On Sat, May 14, 2011 at 09:04:23PM +0200, Nikola Ciprich wrote:
> Nicolas, thanks for further report, it contradicts my theory that
> problem occured somewhere during 2.6.32.16. Now I think I know why
> several of my other machines running 2.6.32.x for long time didn't
> crashed:
>
> I checked bugzilla entry for (I believe the same) problem here:
> https://bugzilla.kernel.org/show_bug.cgi?id=16991

I don't think that that bug is related, I for one haven't seen any
backtrace that is similar to the above or relevant to divide by zero.

> and Peter Zijlstra asked there, whether reporters systems were running
> some RT tasks. Then I realised that all of my four crashed boxes were
> pacemaker/corosync clusters and pacemaker uses lots of RT priority
> tasks. So I believe this is important, and might be reason why other
> machines seem to be running rock solid - they are not running any RT
> tasks. It also might help with hunting this bug. Is somebody of You
> also running some RT priority tasks on inflicted systems, or problem
> also occured without it?

No, no RT tasks here. The boxes in my case were just running a lot of
kvm processes.

Regards,
Faidon

2011-05-16 07:39:40

by Apollon Oikonomopoulos

[permalink] [raw]
Subject: Re: 2.6.32.21 - uptime related crashes?

Hello all,

On 01:56 Mon 16 May , Faidon Liambotis wrote:
> > and Peter Zijlstra asked there, whether reporters systems were running
> > some RT tasks. Then I realised that all of my four crashed boxes were
> > pacemaker/corosync clusters and pacemaker uses lots of RT priority
> > tasks. So I believe this is important, and might be reason why other
> > machines seem to be running rock solid - they are not running any RT
> > tasks. It also might help with hunting this bug. Is somebody of You
> > also running some RT priority tasks on inflicted systems, or problem
> > also occured without it?
>
> No, no RT tasks here. The boxes in my case were just running a lot of
> kvm processes.

Actually we are running multipathd, which is an RT process and heavily loaded
on these particular systems.

Regards,
Apollon

2011-06-28 02:27:02

by john stultz

[permalink] [raw]
Subject: Re: 2.6.32.21 - uptime related crashes?

On Sat, Apr 30, 2011 at 10:39 AM, Faidon Liambotis <[email protected]> wrote:
> We too experienced problems with just the G6 blades at near 215 days uptime
> (on the 19th of April), all at the same time. From our investigation, it
> seems that their cpu_clocks jumped suddenly far in the future and then
> almost immediately rolled over due to wrapping around 64-bits.
>
> Although all of their (G6s) clocks wrapped around *at the same time*, only
> one
> of them actually crashed at the time, with a second one crashing just a few
> days later, on the 28th.
>
> Three of them had the following on their logs:
> Apr 18 20:56:07 hn-05 kernel: [17966378.581971] tap0: no IPv6 routers
> present
> Apr 19 10:15:42 hn-05 kernel: [18446743935.365550] BUG: soft lockup - CPU#4
> stuck for 17163091968s! [kvm:25913]

So, did this issue ever get any traction or get resolved?

>From the softlockup message, I suspect we hit a multiply overflow in
the underlying sched_clock() implementation.

Because the goal of sched_clock is to be very fast, lightweight and
safe from locking issues (so it can be called anywhere) handling
transient corner cases internally has been avoided as they would
require costly locking and extra overhead. Because of this,
sched_clock users should be cautious to be robust in the face of
transient errors.

Peter: I wonder if the soft lockup code should be using the
(hopefully) more robust timekeeping code (ie: get_seconds) for its
get_timestamp function? I'd worry that you might have issues catching
cases where the system was locked up so the timekeeping accounting
code didn't get to run, but you have the same problem in the jiffies
based sched_clock code as well (since timekeeping increments jiffies
in most cases).

That said, I didn't see from any of the backtraces in this thread why
the system actually crashed. The softlockup message on its own
shouldn't do that, so I suspect there's still a related issue
somewhere else here.

thanks
-john

2011-06-28 05:19:23

by Willy Tarreau

[permalink] [raw]
Subject: Re: 2.6.32.21 - uptime related crashes?

On Mon, Jun 27, 2011 at 07:25:31PM -0700, john stultz wrote:
> On Sat, Apr 30, 2011 at 10:39 AM, Faidon Liambotis <[email protected]> wrote:
> > We too experienced problems with just the G6 blades at near 215 days uptime
> > (on the 19th of April), all at the same time. From our investigation, it
> > seems that their cpu_clocks jumped suddenly far in the future and then
> > almost immediately rolled over due to wrapping around 64-bits.
> >
> > Although all of their (G6s) clocks wrapped around *at the same time*, only
> > one
> > of them actually crashed at the time, with a second one crashing just a few
> > days later, on the 28th.
> >
> > Three of them had the following on their logs:
> > Apr 18 20:56:07 hn-05 kernel: [17966378.581971] tap0: no IPv6 routers
> > present
> > Apr 19 10:15:42 hn-05 kernel: [18446743935.365550] BUG: soft lockup - CPU#4
> > stuck for 17163091968s! [kvm:25913]
>
> So, did this issue ever get any traction or get resolved?

I'm not aware of any news on the subject unfortunately. We asked our customer
to reboot both machines one week apart so that in 6 months they don't crash at
the same time :-/

(...)
> That said, I didn't see from any of the backtraces in this thread why
> the system actually crashed. The softlockup message on its own
> shouldn't do that, so I suspect there's still a related issue
> somewhere else here.

One of the traces clearly showed that the kernel's uptime had wrapped
or jumped, because the uptime suddenly jumped forwards to something
like 2^32/HZ seconds IIRC.

Thus it is possible that we have two bugs, one on the clock making it
jump forwards and one somewhere else causing an overflow when the clock
jumps too far.

Regards,
Willy

2011-06-28 06:27:34

by Apollon Oikonomopoulos

[permalink] [raw]
Subject: Re: 2.6.32.21 - uptime related crashes?

On 07:17 Tue 28 Jun , Willy Tarreau wrote:
> On Mon, Jun 27, 2011 at 07:25:31PM -0700, john stultz wrote:
> > That said, I didn't see from any of the backtraces in this thread why
> > the system actually crashed. The softlockup message on its own
> > shouldn't do that, so I suspect there's still a related issue
> > somewhere else here.
>
> One of the traces clearly showed that the kernel's uptime had wrapped
> or jumped, because the uptime suddenly jumped forwards to something
> like 2^32/HZ seconds IIRC.
>
> Thus it is possible that we have two bugs, one on the clock making it
> jump forwards and one somewhere else causing an overflow when the clock
> jumps too far.

Our last machine with wrapped time crashed 1 month ago, almost 1 month after
the time wrap. One thing I noticed, was that although the machine seemed
healthy apart from the time-wrap, there seemed to be random scheduling
glitches, which were mostly visible as high ping times to the KVM guests
running on the machine. Unfortunately I don't have any exact numbers, so I
suppose the best I can do is describe what we saw.

All scheduler statistics under /proc/sched_debug on the host seemed normal,
however pinging a VM from outside would give random spikes in the order of
hundreds of ms among the usual 1-2 ms times. Moving the VM to another host
would restore sane ping times and any other VM moved to this host would exhibit
the same behaviour. Ping times to the host itself from outside were stable.
This was also accompanied by bad I/O performance in the KVM guests themelves
and the strange effect that the total CPU time on the VM's munin graphs would
add to less than 100% * #CPUs. Neither the host nor the guests were
experiencing heavy load.

As a side note, this was similar to the behaviour we had experienced once when
some of multipathd's path checkers (which are RT tasks IIRC) had crashed,
although this time restarting multipathd didn't help.

Regards,
Apollon