LinuxLists.cc - Re: HPET regression in 2.6.26 versus 2.6.25 -- RCU problem

2008-08-09 22:36:24

Subject: Re: HPET regression in 2.6.26 versus 2.6.25 -- RCU problem

OK, sorry for several hours of delay, but I had to work this
morning and just got home.

> > I am completely ignorant about how the kernel works, so any guesses I have
> > are probably worthless... but I'll throw some out anyway:
> >
> > 1. Maybe HPET is used (if present) for timing by RCU, so disabling it
> > forces RCU to work differently. (Pure guess here: I know nothing about
> > RCU, and haven't even tried looking at its code.)
>
> RCU doesn't use HPET directly. Most of its time-dependent behavior
> comes from its being invoked from the scheduling-clock interrupt.

OK. It was just a guess, anyway, but in my weak attempts to apply logic
to the problem I thought: a locking issue would not go away merely by
disabling HPET, but if HPET touches the inner workings of RCU (or something
on which RCU depends) then it would make sense that disabling HPET causes
RCU to behave differently.
I was just brainstorming, though....

> > 2. Maybe my hardware is broken. We need see one initcall return that
> > report over 280,000 msecs... when the entire boot->freeze time was about
> > 3 secs. On the other hand, 2.6.25 (and before) work just fine with HPET
> > enabled.
>
> For CONFIG_CLASSIC_RCU and !CONFIG_PREEMPT, in-kernel infinite spin loops
> will cause synchronize_rcu() to hang. For other RCU configurations,
> spinning with interrupts disabled will result in similar hangs. Invoking
> synchronize_rcu() very early in boot (before rcu_init() has been called)
> will of course also hang.
>
> Could you please let me know whether your config has CONFIG_CLASSIC_RCU
> or CONFIG_PREEMPT_RCU?

[My apologies for the poor writing above. The sentence "We need see one
initcall return that report over 280,000 msecs..." was supposed to say
"We *DID* see one initcall return that *reported* over 280,000 msecs..."
In other words, something funky is going on with this machine's timers
in the crashing kernels.]

OK, I don't believe Paul was here for the beginning of this thread on
Monday, so before supplying the info requested I need to provide some
context on my situation. I have one machine ("desktop") which works fine
with 2.6.2[67] kernels, with mboard = "Gigabyte GA-M59SLI-S5"; and I have
two machines ("fileserver", "webserver") on which 2.6.2[67] kernels freeze,
both with mboard = "ECS AMD690GM-M2". I also am interested in getting the
Debian stock kernel working for their upcoming stable release, as well as
getting my own custom kernels working again.

First, here is the .config info for the Debian stock kernel called
"linux-image-2.6.26-1-amd64":
====================
$ egrep 'HPET|RCU|PREEMPT' config-2.6.26-1-amd64
CONFIG_PREEMPT_NOTIFIERS=y
CONFIG_CLASSIC_RCU=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
CONFIG_HPET=y
CONFIG_HPET_MMAP=y
# CONFIG_RCU_TORTURE_TEST is not set
====================
This kernel freezes on webserver/fileserver, but runs fine on desktop. (The
binary is identical, having moved it from desktop to the others via NFS instead
of downloading a separate instance from the Debian repositories.)

Here is info from the custom .config for my FREEZING fileserver machine, which
is not the same as the desktop, and not the same as Debian stock:
====================
$ egrep 'HPET|RCU|PREEMPT' config-2.6.26-2s11950.080804.fileserver.uvesafb
CONFIG_CLASSIC_RCU=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
CONFIG_HPET=y
CONFIG_HPET_RTC_IRQ=y
CONFIG_HPET_MMAP=y
====================
This was derived from the working .config for 2.6.25 on fileserver:
====================
$ egrep 'HPET|RCU|PREEMPT' config-2.6.25-7.080720.fileserver.uvesafb
CONFIG_CLASSIC_RCU=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
CONFIG_HPET=y
# CONFIG_HPET_RTC_IRQ is not set
CONFIG_HPET_MMAP=y
====================

After reading Paul's email, but before replying, I applied the changes
to PREEMPT and PREEMPT_RCU and built 2.6.27-rc2 from my git tree on
fileserver. This kernel FREEZES on fileserver, like the custom and
Debian stock 2.6.26 kernels mentioned above:
====================
$ egrep 'HPET|RCU|PREEMPT' config-2.6.27-rc2.080809.preempt+rcu
# CONFIG_CLASSIC_RCU is not set
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y
CONFIG_PREEMPT_RCU=y
CONFIG_RCU_TRACE=y
CONFIG_HPET=y
CONFIG_HPET_RTC_IRQ=y
CONFIG_HPET_MMAP=y
# CONFIG_PREEMPT_TRACER is not set
====================

Here is info from the custom .config for my WORKING desktop machine, which
is not the same as fileserver/webserver, and not the same as Debian stock:
====================
$ egrep 'HPET|RCU|PREEMPT' config-2.6.26-1.080801.desktop.uvesafb
CONFIG_CLASSIC_RCU=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y
# CONFIG_PREEMPT_RCU is not set
CONFIG_HPET=y
CONFIG_HPET_RTC_IRQ=y
CONFIG_HPET_MMAP=y
====================

(My custom configurations originated with the Debian stock config, but I
disabled drivers and features irrelevant for my hardware, then tweaked
each .config according to each machine's specific hardware and usage.
All machines work fine using my custom configs for 2.6.25 kernels and
earlier.)

> > 3. I was able to find the commit that introduced the freeze
> > (3def3d6ddf43dbe20c00c3cbc38dfacc8586998f), so there has to be a connection
> > between that commit and the RCU problem. Is it possible that a prexisting
> > error or oversight in the code was merely exposed by that commit? (And
> > only on certain hardware?) Or does that code itself contain the error?
>
> Thank you for finding the commit -- should be quite helpful!!!
>
> A quick look reveals what appears to be reader-writer locking rather
> than RCU. It does run in early boot before rcu_init(), so if it managed
> to call synchronize_rcu() somehow you indeed would see a hang. I do
> not see such a call, but then again, I don't know this code much at all.
>
> This is the second time in as many days that motivated RCU's working
> correctly before rcu_init()... Hmmm...

Again, I think Paul was not here for the previous messages in this thread. A
bit of recap may be in order:

The commit that first causes the freeze (and I assume that no commits since
would also cause a freeze, but that is unknown at this point) touches 3
files:

arch/x86/kernel/e820_64.c: Here, the algorithm was altered to remove
several calls to a function called request_resource(<args>), replacing them
with a single call to insert_resource(<args>). I have no idea whether this
change is problematic, by I observe that "request" sounds read-only, while
"insert" implies read-write behavior. (NB: this file no longer exists, and
its contents have been merged into 'e820.c'.)

arch/x86/kernel/setup_64.c: Here, several calls of insert_resource(<args>)
are added in 2 functions.

include/asm-x86/e820_64.h: Here, a function prototype is modified to reflect
changes made in 'e820_64.c'.

Booting the 2.6.26 kernels on fileserver with "debug initcall_debug" reveals
that the last function called before the freeze is called "inet_init()".
(The inet_init() function itself is not important here; one desperate
experiment I tried, disabling most of the kernel... including CONFIG_NET...
caused the freeze to occur in pci_init() instead.) The inet_init() function
is located in net/ipv4/af_inet.c, and freezes in a loop which calls
inet_register_protosw(<arg>):

===== BEGIN CODE EXCERPT ========
/* Register the socket-side information for inet_create. */
for (r = &inetsw[0]; r < &inetsw[SOCK_MAX]; ++r)
INIT_LIST_HEAD(r);

for (q = inetsw_array; q < &inetsw_array[INETSW_ARRAY_LEN]; ++q)
inet_register_protosw(q);
===== END EXCERPT ========

The inet_register_protosw(<arg>) function calls list_add_rcu(<args>) in a
block of code enclosed between spin_lock_bh(<arg>) and spin_unlock_bh(<arg>).
Again, I don't know what I'm doing, but it looks like this is where
inet_init() touchs RCU features. Just before inet_register_protosw() hits
"return;" it calls, synchronize_net(); this is a tiny function, which calls
might_sleep() and synchronize_rcu().

At synchronize_rcu(), the freeze occurs. It occurs on the first iteration
of inet_register_protosw(<arg>) as well.

To quote Daffy Duck: "Something's amiss here...." I lack the knowledge
and skills to know whether commit 3def3d... is really to blame, or whether
the changes it made simply revealed breakage in the other code which was
already present. Indeed, none of you seem to be having any problem at all;
nor am I, on my "desktop" machine!

> > If any has any test code I can run to detect massive HPET breakage on
> > these motherboards, I'll be glad to do so. Or any other experimental
> > code changes, for that matter.
>
> If you can answer my CONFIG_CLASSIC_RCU vs. CONFIG_PREEMPT_RCU question
> above, I should be able to provide you a diagnostic patch that would say
> which CPU RCU was waiting on. At least assuming that at least one CPU
> was still taking the scheduling-clock interrupt, that is. ;-)

[More poor grammar apologies: "If any has any test code..." ==>
"If *anyone has any test code..."]

Thank you for the help. This problem is frustrating, but incredibly
interesting to me. I have never had this sort of problem with any previous
kernel, so I have never had an opportunity to play bug-catcher before. By
pursuing the matter this far, I have learned elementary usage of 'git', I
have had a chance to peek at the kernel source code itself, and have even
successfully inserted code (only harmless printk()'s, though) and built the
modified kernel without errors afterward! Without this regression, I would
have had none of this fun!

A few closing comments, then:

1. I don't think the PREEMPT options in .config are to blame. The Debian
stock 2.6.26 kernel runs on "desktop", but freezes on "fileserver". That
makes it look like a hardware issue, but 2.6.25 ran fine. [init_headache()]

2. Commit 3def3d... draws the line between 2.6.25 working on "fileserver"
and pre-2.6.26 not working on "fileserver". The changes in e820.c seem to
modify a function called e820_reserve_resources() from requesting resources
to inserting resources. (The changes in setup.c don't affect me, since the
additional call of insert_resource() is in a block depending on CONFIG_KEXEC,
which is disable in my custom kernels.) Something about this commit causes
inet_init() -- which calls inet_register_protosw(), which calls
synchronize_net(), which calls synchronize_rcu() -- to freeze.
[init_migraine()]

3. Whatever the cause -- whether the commit is doing something wrong, or
whether it just exposed something else that wasn't right to begin with --
the problem can just be made to go away by using "hpet=disabled" as a boot
parameter. [init_apoplexy()]

4. The problem seems to only manifest itself on an ECS AMD690GM-M2
motherboard, since of the thousands of users of Debian Sid I am the only
one reporting a problem on the Debian BTS, and no one else on the LKML is
experiencing it either. [init_fatal_aneurism()]

However, even though I am the only one plagued by this problem, it is clear
that this hardware ran 2.6.25 just fine. Maybe the full extent of the
problem is yet to be seen, since the vast majority of Linux users run
distributions with older kernels. So, I'm viewing this as a chance for
me to finally be able to contribute, until one of 3 things is discovered:
the problem is my fault, the problem is my hardware's fault, or the problem
is a bug in the kernel.

Thanks Paul (and Peter and Yinghai),
Dave W.

2008-08-10 15:15:36

by Paul E. McKenney

[permalink] [raw]

Subject: Re: HPET regression in 2.6.26 versus 2.6.25 -- RCU problem

On Sat, Aug 09, 2008 at 03:35:48PM -0700, David Witbrodt wrote:
> OK, sorry for several hours of delay, but I had to work this
> morning and just got home.
>
>
>
>
> > > I am completely ignorant about how the kernel works, so any guesses I have
> > > are probably worthless... but I'll throw some out anyway:
> > >
> > > 1. Maybe HPET is used (if present) for timing by RCU, so disabling it
> > > forces RCU to work differently. (Pure guess here: I know nothing about
> > > RCU, and haven't even tried looking at its code.)
> >
> > RCU doesn't use HPET directly. Most of its time-dependent behavior
> > comes from its being invoked from the scheduling-clock interrupt.
>
> OK. It was just a guess, anyway, but in my weak attempts to apply logic
> to the problem I thought: a locking issue would not go away merely by
> disabling HPET, but if HPET touches the inner workings of RCU (or something
> on which RCU depends) then it would make sense that disabling HPET causes
> RCU to behave differently.
> I was just brainstorming, though....

One other possibility would be something like:

rcu_read_lock();
/* something that waits for the HPET. */
rcu_read_unlock();

I don't know of any such code sequence, but if one did exist somewhere
in the kernel, then HPET failure could stall a synchronize_rcu().

> > > 2. Maybe my hardware is broken. We need see one initcall return that
> > > report over 280,000 msecs... when the entire boot->freeze time was about
> > > 3 secs. On the other hand, 2.6.25 (and before) work just fine with HPET
> > > enabled.
> >
> > For CONFIG_CLASSIC_RCU and !CONFIG_PREEMPT, in-kernel infinite spin loops
> > will cause synchronize_rcu() to hang. For other RCU configurations,
> > spinning with interrupts disabled will result in similar hangs. Invoking
> > synchronize_rcu() very early in boot (before rcu_init() has been called)
> > will of course also hang.
> >
> > Could you please let me know whether your config has CONFIG_CLASSIC_RCU
> > or CONFIG_PREEMPT_RCU?
>
> [My apologies for the poor writing above. The sentence "We need see one
> initcall return that report over 280,000 msecs..." was supposed to say
> "We *DID* see one initcall return that *reported* over 280,000 msecs..."
> In other words, something funky is going on with this machine's timers
> in the crashing kernels.]

No need to apologize -- I did understand your intent.

> OK, I don't believe Paul was here for the beginning of this thread on
> Monday, so before supplying the info requested I need to provide some
> context on my situation. I have one machine ("desktop") which works fine
> with 2.6.2[67] kernels, with mboard = "Gigabyte GA-M59SLI-S5"; and I have
> two machines ("fileserver", "webserver") on which 2.6.2[67] kernels freeze,
> both with mboard = "ECS AMD690GM-M2". I also am interested in getting the
> Debian stock kernel working for their upcoming stable release, as well as
> getting my own custom kernels working again.

OK, so at least the desktop machine is multi-CPU, and perhaps the
fileserver as well.

> First, here is the .config info for the Debian stock kernel called
> "linux-image-2.6.26-1-amd64":
> ====================
> $ egrep 'HPET|RCU|PREEMPT' config-2.6.26-1-amd64
> CONFIG_PREEMPT_NOTIFIERS=y
> CONFIG_CLASSIC_RCU=y

OK. Classic RCU has not changed much recently. This also indicates
an infinite loop in kernel code (or a CPU locking up completely, which
is quite rare, but can still happen).

You do try preemptable RCU, which is much more recent (and thus much
more subject to suspicion), but get the same result.

I will see about putting together a diagnostic patch for Classic RCU.
The approach will be to record jiffies (or some such) at the beginning
of the grace period (in rcu_start_batch()), then have
rcu_check_callbacks() complain if:

1. it is running on a CPU that has holding up grace periods for
a long time (say one second). This will identify the culprit
assuming that the culprit has not disabled hardware irqs,
instruction execution, or some such.

2. it is running on a CPU that is not holding up grace periods,
but grace periods have been held up for an even longer time
(say two seconds).

In either case, some sort of exponential backoff would be needed to
avoid multi-gigabyte log files. Of course, all of this assumes that
the machine remains healthy enough to actually get any such messages
somewhere that you can see them, but so it goes...

Thanx, Paul

> CONFIG_HPET_TIMER=y
> CONFIG_HPET_EMULATE_RTC=y
> CONFIG_PREEMPT_NONE=y
> # CONFIG_PREEMPT_VOLUNTARY is not set
> # CONFIG_PREEMPT is not set
> CONFIG_HPET=y
> CONFIG_HPET_MMAP=y
> # CONFIG_RCU_TORTURE_TEST is not set
> ====================
> This kernel freezes on webserver/fileserver, but runs fine on desktop. (The
> binary is identical, having moved it from desktop to the others via NFS instead
> of downloading a separate instance from the Debian repositories.)
>
> Here is info from the custom .config for my FREEZING fileserver machine, which
> is not the same as the desktop, and not the same as Debian stock:
> ====================
> $ egrep 'HPET|RCU|PREEMPT' config-2.6.26-2s11950.080804.fileserver.uvesafb
> CONFIG_CLASSIC_RCU=y
> CONFIG_HPET_TIMER=y
> CONFIG_HPET_EMULATE_RTC=y
> CONFIG_PREEMPT_NONE=y
> # CONFIG_PREEMPT_VOLUNTARY is not set
> # CONFIG_PREEMPT is not set
> CONFIG_HPET=y
> CONFIG_HPET_RTC_IRQ=y
> CONFIG_HPET_MMAP=y
> ====================
> This was derived from the working .config for 2.6.25 on fileserver:
> ====================
> $ egrep 'HPET|RCU|PREEMPT' config-2.6.25-7.080720.fileserver.uvesafb
> CONFIG_CLASSIC_RCU=y
> CONFIG_HPET_TIMER=y
> CONFIG_HPET_EMULATE_RTC=y
> CONFIG_PREEMPT_NONE=y
> # CONFIG_PREEMPT_VOLUNTARY is not set
> # CONFIG_PREEMPT is not set
> CONFIG_HPET=y
> # CONFIG_HPET_RTC_IRQ is not set
> CONFIG_HPET_MMAP=y
> ====================
>
> After reading Paul's email, but before replying, I applied the changes
> to PREEMPT and PREEMPT_RCU and built 2.6.27-rc2 from my git tree on
> fileserver. This kernel FREEZES on fileserver, like the custom and
> Debian stock 2.6.26 kernels mentioned above:
> ====================
> $ egrep 'HPET|RCU|PREEMPT' config-2.6.27-rc2.080809.preempt+rcu
> # CONFIG_CLASSIC_RCU is not set
> CONFIG_HPET_TIMER=y
> CONFIG_HPET_EMULATE_RTC=y
> # CONFIG_PREEMPT_NONE is not set
> # CONFIG_PREEMPT_VOLUNTARY is not set
> CONFIG_PREEMPT=y
> CONFIG_PREEMPT_RCU=y
> CONFIG_RCU_TRACE=y
> CONFIG_HPET=y
> CONFIG_HPET_RTC_IRQ=y
> CONFIG_HPET_MMAP=y
> # CONFIG_PREEMPT_TRACER is not set
> ====================
>
> Here is info from the custom .config for my WORKING desktop machine, which
> is not the same as fileserver/webserver, and not the same as Debian stock:
> ====================
> $ egrep 'HPET|RCU|PREEMPT' config-2.6.26-1.080801.desktop.uvesafb
> CONFIG_CLASSIC_RCU=y
> CONFIG_HPET_TIMER=y
> CONFIG_HPET_EMULATE_RTC=y
> # CONFIG_PREEMPT_NONE is not set
> # CONFIG_PREEMPT_VOLUNTARY is not set
> CONFIG_PREEMPT=y
> # CONFIG_PREEMPT_RCU is not set
> CONFIG_HPET=y
> CONFIG_HPET_RTC_IRQ=y
> CONFIG_HPET_MMAP=y
> ====================
>
> (My custom configurations originated with the Debian stock config, but I
> disabled drivers and features irrelevant for my hardware, then tweaked
> each .config according to each machine's specific hardware and usage.
> All machines work fine using my custom configs for 2.6.25 kernels and
> earlier.)
>
>
> > > 3. I was able to find the commit that introduced the freeze
> > > (3def3d6ddf43dbe20c00c3cbc38dfacc8586998f), so there has to be a connection
> > > between that commit and the RCU problem. Is it possible that a prexisting
> > > error or oversight in the code was merely exposed by that commit? (And
> > > only on certain hardware?) Or does that code itself contain the error?
> >
> > Thank you for finding the commit -- should be quite helpful!!!
> >
> > A quick look reveals what appears to be reader-writer locking rather
> > than RCU. It does run in early boot before rcu_init(), so if it managed
> > to call synchronize_rcu() somehow you indeed would see a hang. I do
> > not see such a call, but then again, I don't know this code much at all.
> >
> > This is the second time in as many days that motivated RCU's working
> > correctly before rcu_init()... Hmmm...
>
> Again, I think Paul was not here for the previous messages in this thread. A
> bit of recap may be in order:
>
> The commit that first causes the freeze (and I assume that no commits since
> would also cause a freeze, but that is unknown at this point) touches 3
> files:
>
> arch/x86/kernel/e820_64.c: Here, the algorithm was altered to remove
> several calls to a function called request_resource(<args>), replacing them
> with a single call to insert_resource(<args>). I have no idea whether this
> change is problematic, by I observe that "request" sounds read-only, while
> "insert" implies read-write behavior. (NB: this file no longer exists, and
> its contents have been merged into 'e820.c'.)
>
> arch/x86/kernel/setup_64.c: Here, several calls of insert_resource(<args>)
> are added in 2 functions.
>
> include/asm-x86/e820_64.h: Here, a function prototype is modified to reflect
> changes made in 'e820_64.c'.
>
>
> Booting the 2.6.26 kernels on fileserver with "debug initcall_debug" reveals
> that the last function called before the freeze is called "inet_init()".
> (The inet_init() function itself is not important here; one desperate
> experiment I tried, disabling most of the kernel... including CONFIG_NET...
> caused the freeze to occur in pci_init() instead.) The inet_init() function
> is located in net/ipv4/af_inet.c, and freezes in a loop which calls
> inet_register_protosw(<arg>):
>
> ===== BEGIN CODE EXCERPT ========
> /* Register the socket-side information for inet_create. */
> for (r = &inetsw[0]; r < &inetsw[SOCK_MAX]; ++r)
> INIT_LIST_HEAD(r);
>
> for (q = inetsw_array; q < &inetsw_array[INETSW_ARRAY_LEN]; ++q)
> inet_register_protosw(q);
> ===== END EXCERPT ========
>
> The inet_register_protosw(<arg>) function calls list_add_rcu(<args>) in a
> block of code enclosed between spin_lock_bh(<arg>) and spin_unlock_bh(<arg>).
> Again, I don't know what I'm doing, but it looks like this is where
> inet_init() touchs RCU features. Just before inet_register_protosw() hits
> "return;" it calls, synchronize_net(); this is a tiny function, which calls
> might_sleep() and synchronize_rcu().
>
> At synchronize_rcu(), the freeze occurs. It occurs on the first iteration
> of inet_register_protosw(<arg>) as well.
>
> To quote Daffy Duck: "Something's amiss here...." I lack the knowledge
> and skills to know whether commit 3def3d... is really to blame, or whether
> the changes it made simply revealed breakage in the other code which was
> already present. Indeed, none of you seem to be having any problem at all;
> nor am I, on my "desktop" machine!
>
>
> > > If any has any test code I can run to detect massive HPET breakage on
> > > these motherboards, I'll be glad to do so. Or any other experimental
> > > code changes, for that matter.
> >
> > If you can answer my CONFIG_CLASSIC_RCU vs. CONFIG_PREEMPT_RCU question
> > above, I should be able to provide you a diagnostic patch that would say
> > which CPU RCU was waiting on. At least assuming that at least one CPU
> > was still taking the scheduling-clock interrupt, that is. ;-)
>
> [More poor grammar apologies: "If any has any test code..." ==>
> "If *anyone has any test code..."]
>
> Thank you for the help. This problem is frustrating, but incredibly
> interesting to me. I have never had this sort of problem with any previous
> kernel, so I have never had an opportunity to play bug-catcher before. By
> pursuing the matter this far, I have learned elementary usage of 'git', I
> have had a chance to peek at the kernel source code itself, and have even
> successfully inserted code (only harmless printk()'s, though) and built the
> modified kernel without errors afterward! Without this regression, I would
> have had none of this fun!
>
> A few closing comments, then:
>
> 1. I don't think the PREEMPT options in .config are to blame. The Debian
> stock 2.6.26 kernel runs on "desktop", but freezes on "fileserver". That
> makes it look like a hardware issue, but 2.6.25 ran fine. [init_headache()]
>
> 2. Commit 3def3d... draws the line between 2.6.25 working on "fileserver"
> and pre-2.6.26 not working on "fileserver". The changes in e820.c seem to
> modify a function called e820_reserve_resources() from requesting resources
> to inserting resources. (The changes in setup.c don't affect me, since the
> additional call of insert_resource() is in a block depending on CONFIG_KEXEC,
> which is disable in my custom kernels.) Something about this commit causes
> inet_init() -- which calls inet_register_protosw(), which calls
> synchronize_net(), which calls synchronize_rcu() -- to freeze.
> [init_migraine()]
>
> 3. Whatever the cause -- whether the commit is doing something wrong, or
> whether it just exposed something else that wasn't right to begin with --
> the problem can just be made to go away by using "hpet=disabled" as a boot
> parameter. [init_apoplexy()]
>
> 4. The problem seems to only manifest itself on an ECS AMD690GM-M2
> motherboard, since of the thousands of users of Debian Sid I am the only
> one reporting a problem on the Debian BTS, and no one else on the LKML is
> experiencing it either. [init_fatal_aneurism()]
>
> However, even though I am the only one plagued by this problem, it is clear
> that this hardware ran 2.6.25 just fine. Maybe the full extent of the
> problem is yet to be seen, since the vast majority of Linux users run
> distributions with older kernels. So, I'm viewing this as a chance for
> me to finally be able to contribute, until one of 3 things is discovered:
> the problem is my fault, the problem is my hardware's fault, or the problem
> is a bug in the kernel.
>
>
> Thanks Paul (and Peter and Yinghai),
> Dave W.

2008-08-11 01:35:48

by Paul E. McKenney

[permalink] [raw]

Subject: [PATCH diagnostic] Re: HPET regression in 2.6.26 versus 2.6.25 -- RCU problem

On Sun, Aug 10, 2008 at 08:15:20AM -0700, Paul E. McKenney wrote:
> I will see about putting together a diagnostic patch for Classic RCU.
> The approach will be to record jiffies (or some such) at the beginning
> of the grace period (in rcu_start_batch()), then have
> rcu_check_callbacks() complain if:
>
> 1. it is running on a CPU that has holding up grace periods for
> a long time (say one second). This will identify the culprit
> assuming that the culprit has not disabled hardware irqs,
> instruction execution, or some such.
>
> 2. it is running on a CPU that is not holding up grace periods,
> but grace periods have been held up for an even longer time
> (say two seconds).
>
> In either case, some sort of exponential backoff would be needed to
> avoid multi-gigabyte log files. Of course, all of this assumes that
> the machine remains healthy enough to actually get any such messages
> somewhere that you can see them, but so it goes...

And here is the patch. It is still a bit raw, so the results should
be viewed with some suspicion. It adds a default-off kernel parameter
CONFIG_RCU_CPU_STALL which must be enabled.

Rather than exponential backoff, it backs off to once per 30 seconds.
My feeling upon thinking on it was that if you have stalled RCU grace
periods for that long, a few extra printk() messages are probably the
least of your worries...

Signed-off-by: Paul E. McKenney <[email protected]>
---

include/linux/rcuclassic.h | 3 +
kernel/rcuclassic.c | 80 +++++++++++++++++++++++++++++++++++++++++++++
lib/Kconfig.debug | 13 +++++++
3 files changed, 96 insertions(+)

diff -urpNa -X dontdiff linux-2.6.27-rc1/include/linux/rcuclassic.h linux-2.6.27-rc1-cpustall/include/linux/rcuclassic.h
--- linux-2.6.27-rc1/include/linux/rcuclassic.h 2008-07-30 08:48:16.000000000 -0700
+++ linux-2.6.27-rc1-cpustall/include/linux/rcuclassic.h 2008-08-10 12:21:22.000000000 -0700
@@ -46,6 +46,9 @@ struct rcu_ctrlblk {
long cur; /* Current batch number. */
long completed; /* Number of the last completed batch */
int next_pending; /* Is the next batch already waiting? */
+#ifdef CONFIG_RCU_CPU_STALL
+ unsigned long gp_check; /* Time grace period should end, in seconds. */
+#endif /* #ifdef CONFIG_RCU_CPU_STALL */

int signaled;

diff -urpNa -X dontdiff linux-2.6.27-rc1/kernel/rcuclassic.c linux-2.6.27-rc1-cpustall/kernel/rcuclassic.c
--- linux-2.6.27-rc1/kernel/rcuclassic.c 2008-07-30 08:48:17.000000000 -0700
+++ linux-2.6.27-rc1-cpustall/kernel/rcuclassic.c 2008-08-10 17:51:32.000000000 -0700
@@ -47,6 +47,7 @@
#include <linux/notifier.h>
#include <linux/cpu.h>
#include <linux/mutex.h>
+#include <linux/time.h>

#ifdef CONFIG_DEBUG_LOCK_ALLOC
static struct lock_class_key rcu_lock_key;
@@ -269,6 +270,81 @@ static void rcu_do_batch(struct rcu_data
* rcu_check_quiescent_state calls rcu_start_batch(0) to start the next grace
* period (if necessary).
*/
+
+#ifdef CONFIG_RCU_CPU_STALL
+
+static inline void record_gp_check_time(struct rcu_ctrlblk *rcp)
+{
+ rcp->gp_check = get_seconds() + 3;
+}
+static void print_other_cpu_stall(struct rcu_ctrlblk *rcp)
+{
+ int cpu;
+ long delta;
+
+ /* Only let one CPU complain about others per time interval. */
+
+ spin_lock(&rcp->lock);
+ delta = get_seconds() - rcp->gp_check;
+ if (delta < 2L ||
+ cpus_empty(rcp->cpumask)) {
+ spin_unlock(&rcp->lock);
+ return;
+ rcp->gp_check = get_seconds() + 30;
+ }
+ spin_unlock(&rcp->lock);
+
+ /* OK, time to rat on our buddy... */
+
+ printk(KERN_ERR "RCU detected CPU stalls:");
+ for_each_cpu_mask(cpu, rcp->cpumask)
+ printk(" %d", cpu);
+ printk(" (detected by %d, t=%lu/%lu)\n",
+ smp_processor_id(), get_seconds(), rcp->gp_check);
+}
+static void print_cpu_stall(struct rcu_ctrlblk *rcp)
+{
+ printk(KERN_ERR "RCU detected CPU %d stall (t=%lu/%lu)\n",
+ smp_processor_id(), get_seconds(), rcp->gp_check);
+ dump_stack();
+ spin_lock(&rcp->lock);
+ if ((long)(get_seconds() - rcp->gp_check) >= 0L)
+ rcp->gp_check = get_seconds() + 30;
+ spin_unlock(&rcp->lock);
+}
+static inline void check_cpu_stall(struct rcu_ctrlblk *rcp,
+ struct rcu_data *rdp)
+{
+ long delta;
+
+ delta = get_seconds() - rcp->gp_check;
+ if (cpu_isset(smp_processor_id(), rcp->cpumask) && delta >= 0L) {
+
+ /* We haven't checked in, so go dump stack. */
+
+ print_cpu_stall(rcp);
+
+ } else if (!cpus_empty(rcp->cpumask) && delta >= 2L) {
+
+ /* They had two seconds to dump stack, so complain. */
+
+ print_other_cpu_stall(rcp);
+
+ }
+}
+
+#else /* #ifdef CONFIG_RCU_CPU_STALL */
+
+static inline void record_gp_check_time(struct rcu_ctrlblk *rcp)
+{
+}
+static inline void check_cpu_stall(struct rcu_ctrlblk *rcp,
+ struct rcu_data *rdp)
+{
+}
+
+#endif /* #else #ifdef CONFIG_RCU_CPU_STALL */
+
/*
* Register a new batch of callbacks, and start it up if there is currently no
* active batch and the batch to be registered has not already occurred.
@@ -285,6 +361,7 @@ static void rcu_start_batch(struct rcu_c
*/
smp_wmb();
rcp->cur++;
+ record_gp_check_time(rcp);

/*
* Accessing nohz_cpu_mask before incrementing rcp->cur needs a
@@ -468,6 +545,9 @@ static void rcu_process_callbacks(struct

static int __rcu_pending(struct rcu_ctrlblk *rcp, struct rcu_data *rdp)
{
+ /* Check for CPU stalls, if enabled. */
+ check_cpu_stall(rcp, rdp);
+
/* This cpu has pending rcu entries and the grace period
* for them has completed.
*/
diff -urpNa -X dontdiff linux-2.6.27-rc1/lib/Kconfig.debug linux-2.6.27-rc1-cpustall/lib/Kconfig.debug
--- linux-2.6.27-rc1/lib/Kconfig.debug 2008-07-30 08:48:17.000000000 -0700
+++ linux-2.6.27-rc1-cpustall/lib/Kconfig.debug 2008-08-10 12:14:18.000000000 -0700
@@ -597,6 +597,19 @@ config RCU_TORTURE_TEST_RUNNABLE
Say N here if you want the RCU torture tests to start only
after being manually enabled via /proc.

+config RCU_CPU_STALL
+ bool "Check for stalled CPUs delaying RCU grace periods"
+ depends on CLASSIC_RCU
+ default n
+ help
+ This option causes RCU to printk information on which
+ CPUs are delaying the current grace period, but only when
+ the grace period extends for excessive time periods.
+
+ Say Y if you want RCU to perform such checks.
+
+ Say N if you are unsure.
+
config KPROBES_SANITY_TEST
bool "Kprobes sanity tests"
depends on DEBUG_KERNEL

2008-08-11 01:38:21

On Mon, Aug 11, 2008 at 11:22:21AM -0700, David Witbrodt wrote:
>
>
> > > Well, I was hoping to see something interesting. I ran it with parameters
> > > "debug initcall_debug", and it locked up at the same place. I let it for
> > > 15 minutes, in case of some delayed reaction. Nada.
> >
> > Interesting. The causes could be:
> >
> > o Scheduling-clock interrupts aren't happening, as Ingo suggested.
>
> Does anyone have a short answer to this question: Were the changes between
> 2.6.25 and 2.6.26 so major that interrupts are NOW being used that were not
> being used before?

Not that I am aware of, but I must defer to others who know more about
Linux's timer interrupts than do I.

Thanx, Paul

> Again, I don't even pretend to understand the kernel's inner workings, but
> 2.6.25 _did_ work on this hardware... and with HPET enabled.
>
>
> DW

2008-08-13 00:25:23

by Paul E. McKenney

[permalink] [raw]

Subject: [PATCH diagnostic] Prevent console flood when one CPU sees another AWOL via RCU

On Mon, Aug 11, 2008 at 06:17:28AM -0700, Paul E. McKenney wrote:
> On Mon, Aug 11, 2008 at 01:38:17PM +0200, Ingo Molnar wrote:
> >
> > * Paul E. McKenney <[email protected]> wrote:
> >
> > > And here is the patch. It is still a bit raw, so the results should
> > > be viewed with some suspicion. It adds a default-off kernel parameter
> > > CONFIG_RCU_CPU_STALL which must be enabled.
> > >
> > > Rather than exponential backoff, it backs off to once per 30 seconds.
> > > My feeling upon thinking on it was that if you have stalled RCU grace
> > > periods for that long, a few extra printk() messages are probably the
> > > least of your worries...
> >
> > while this wont debug problems were timer irqs are genuinely stuck for
> > long periods of time, it should find problems with RCU completion logic
> > itself in the presence of correct timer irqs - and the lack of any
> > messages from this debug option should point the finger more firmly in
> > the direction of stalled timer irqs.
> >
> > So i find this debug feature rather useful and have applied it to
> > tip/core/rcu (and cleaned it up a bit). I renamed the config option to
> > CONFIG_DEBUG_RCU_STALL to make it more in line with usual debug option
> > names. Lets see whether -tip testing finds any false positives.
>
> Sounds good!
>
> For whatever it is worth, this diagnostic can also locate latency issues
> in non-CONFIG_PREEMPT kernels, even when those problems are outside of
> preempt_disable() regions. Latency tracer is of course a better tool
> for things -inside- of preempt_disable() regions.

One small change needed to keep from flooding the console when one
CPU notices that another is AWOL. Unless I am missing something subtle.
Otherwise the cleanups look good!

Signed-off-by: Paul E. McKenney <[email protected]>
---

rcuclassic.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/kernel/rcuclassic.c b/kernel/rcuclassic.c
index 56b8712..dab2676 100644
--- a/kernel/rcuclassic.c
+++ b/kernel/rcuclassic.c
@@ -308,6 +308,7 @@ static void print_other_cpu_stall(struct rcu_ctrlblk *rcp)
spin_unlock(&rcp->lock);
return;
}
+ rcp->gp_check = get_seconds() + 30;
spin_unlock(&rcp->lock);

/* OK, time to rat on our buddy... */