This is my proposal to print the NMI stack traces from an RCU stall safely.
Here's the gist of it.
Patch 1: move the trace_seq out of the tracing code. It's useful for other
purposes too. Like writing from an NMI context.
Patch 2: Add a per_cpu "printk_func" that printk calls. By default it calls
vprintk_def() which does what it has always done. This allows us to
override what printk() calls normally on a per cpu basis.
Patch 3: Have the NMI handler that dumps the stack trace just change the
printk_func to call a NMI safe printk function that writes to a per cpu
trace_seq. When all NMI handlers chimed in, the original caller prints
out the trace_seqs for each CPU from a printk safe context.
This is much less intrusive than the other versions out there.
Note, The biggest change was the moving of trace_seq code out of trace_output.c
and into the new trace_seq.c library.
Thoughts?
Steven Rostedt (Red Hat) (3):
trace_seq: Move the trace_seq code to lib/
printk: Add per_cpu printk func to allow printk to be diverted
x86/nmi: Perform a safe NMI stack trace on all CPUs
----
arch/x86/kernel/apic/hw_nmi.c | 66 ++++++++-
include/linux/percpu.h | 3 +
include/linux/printk.h | 2 +
include/linux/trace_seq.h | 68 ++--------
kernel/printk/printk.c | 38 ++++--
kernel/trace/trace.c | 24 ----
kernel/trace/trace_output.c | 268 -------------------------------------
kernel/trace/trace_output.h | 16 ---
lib/Makefile | 2 +-
lib/trace_seq.c | 303 ++++++++++++++++++++++++++++++++++++++++++
10 files changed, 412 insertions(+), 378 deletions(-)
On Thu, 19 Jun 2014, Steven Rostedt wrote:
> This is my proposal to print the NMI stack traces from an RCU stall safely.
> Here's the gist of it.
>
> Patch 1: move the trace_seq out of the tracing code. It's useful for other
> purposes too. Like writing from an NMI context.
>
> Patch 2: Add a per_cpu "printk_func" that printk calls. By default it calls
> vprintk_def() which does what it has always done. This allows us to
> override what printk() calls normally on a per cpu basis.
>
> Patch 3: Have the NMI handler that dumps the stack trace just change the
> printk_func to call a NMI safe printk function that writes to a per cpu
> trace_seq. When all NMI handlers chimed in, the original caller prints
> out the trace_seqs for each CPU from a printk safe context.
>
> This is much less intrusive than the other versions out there.
I agree this is less intrusive than having printk() use two versions of
the buffers and perform merging, OTOH, it doesn't really seem to be
fully clean and systematic solution either.
I had a different idea earlier today, and Petr seems to have implemented
it already; I guess he'll be sending it out as RFC tomorrow for
comparision.
The idea basically is to *switch* what arch_trigger_all_cpu_backtrace()
and arch_trigger_all_cpu_backtrace_handler() are doing; i.e. use the NMI
as a way to stop all the CPUs (one by one), and let the CPU that is
sending the NMIs around to actually walk and dump the stacks of the CPUs
receiving the NMI IPI.
It's the most trivial aproach I've been able to come up with, and should
be usable for everybody (RCU stall detector and sysrq). The only tricky
part is: if we want pt_regs to be part of the dump as well, how to pass
those cleanly between the 'stopped' CPU and the CPU that is doing the
printing. Other than that, it's just moving a few lines of code around, I
believe.
What do you think?
--
Jiri Kosina
SUSE Labs
On Thu, 19 Jun 2014 23:56:36 +0200 (CEST)
Jiri Kosina <[email protected]> wrote:
> I agree this is less intrusive than having printk() use two versions of
> the buffers and perform merging, OTOH, it doesn't really seem to be
> fully clean and systematic solution either.
I'm not sure why you consider it not fully clean. I think it is quite
elegant ;-)
>
> I had a different idea earlier today, and Petr seems to have implemented
> it already; I guess he'll be sending it out as RFC tomorrow for
> comparision.
>
> The idea basically is to *switch* what arch_trigger_all_cpu_backtrace()
> and arch_trigger_all_cpu_backtrace_handler() are doing; i.e. use the NMI
> as a way to stop all the CPUs (one by one), and let the CPU that is
> sending the NMIs around to actually walk and dump the stacks of the CPUs
> receiving the NMI IPI.
And this is cleaner? Stopping a CPU via NMI and then what happens if
something else goes wrong and that CPU never starts back up? This
sounds like something that can cause more problems than it was
reporting on.
The code had better handle all the strange cases where an NMI comes in
after the code that sent it timed out waiting for the NMIs to return.
Then you also need to print out the data while the NMIs still spin.
>
> It's the most trivial aproach I've been able to come up with, and should
> be usable for everybody (RCU stall detector and sysrq). The only tricky
> part is: if we want pt_regs to be part of the dump as well, how to pass
> those cleanly between the 'stopped' CPU and the CPU that is doing the
> printing. Other than that, it's just moving a few lines of code around, I
> believe.
Note, I first tried to just save the data and pass it to the calling
functions, but found that rather messy, as there is a lot of stuff
being dumped, and it's different if it is x86_64 vs i386 or if the NMI
happened when in userspace or kernel. That's when I realized it would
be trivial to just switch what printk does and then have the NMI do the
printing to another NMI safe buffer.
It was even more convenient that trace_seq already existed and made it
much easier to implement.
-- Steve
On Thu, 19 Jun 2014, Steven Rostedt wrote:
> > The idea basically is to *switch* what arch_trigger_all_cpu_backtrace()
> > and arch_trigger_all_cpu_backtrace_handler() are doing; i.e. use the NMI
> > as a way to stop all the CPUs (one by one), and let the CPU that is
> > sending the NMIs around to actually walk and dump the stacks of the CPUs
> > receiving the NMI IPI.
>
> And this is cleaner? Stopping a CPU via NMI and then what happens if
> something else goes wrong and that CPU never starts back up? This
> sounds like something that can cause more problems than it was
> reporting on.
It's going to get NMI in exactly the same situations it does with the
current arch_trigger_all_cpu_backtrace(), the only difference being that
it doesn't try to invoke printk() from inside NMI. The IPI-NMI is used
solely as a point of synchronization for the stack dumping.
> Then you also need to print out the data while the NMIs still spin.
Exactly, that's actually the whole point.
Thanks,
--
Jiri Kosina
SUSE Labs
On Fri, 20 Jun 2014 01:03:28 +0200 (CEST)
Jiri Kosina <[email protected]> wrote:
> On Thu, 19 Jun 2014, Steven Rostedt wrote:
>
> > > The idea basically is to *switch* what arch_trigger_all_cpu_backtrace()
> > > and arch_trigger_all_cpu_backtrace_handler() are doing; i.e. use the NMI
> > > as a way to stop all the CPUs (one by one), and let the CPU that is
> > > sending the NMIs around to actually walk and dump the stacks of the CPUs
> > > receiving the NMI IPI.
> >
> > And this is cleaner? Stopping a CPU via NMI and then what happens if
> > something else goes wrong and that CPU never starts back up? This
> > sounds like something that can cause more problems than it was
> > reporting on.
>
> It's going to get NMI in exactly the same situations it does with the
> current arch_trigger_all_cpu_backtrace(), the only difference being that
> it doesn't try to invoke printk() from inside NMI. The IPI-NMI is used
> solely as a point of synchronization for the stack dumping.
Well, all CPUs are going to be spinning until the main CPU prints
everything out. That's not quite the same thing as what it use to do.
>
> > Then you also need to print out the data while the NMIs still spin.
>
> Exactly, that's actually the whole point.
But this stops everything with a big hammer, until everything gets
printed out, not just the one CPU that happens to be stuck.
-- Steve
On Thu, 19 Jun 2014, Steven Rostedt wrote:
> > > > The idea basically is to *switch* what arch_trigger_all_cpu_backtrace()
> > > > and arch_trigger_all_cpu_backtrace_handler() are doing; i.e. use the NMI
> > > > as a way to stop all the CPUs (one by one), and let the CPU that is
> > > > sending the NMIs around to actually walk and dump the stacks of the CPUs
> > > > receiving the NMI IPI.
> > >
> > > And this is cleaner? Stopping a CPU via NMI and then what happens if
> > > something else goes wrong and that CPU never starts back up? This
> > > sounds like something that can cause more problems than it was
> > > reporting on.
> >
> > It's going to get NMI in exactly the same situations it does with the
> > current arch_trigger_all_cpu_backtrace(), the only difference being that
> > it doesn't try to invoke printk() from inside NMI. The IPI-NMI is used
> > solely as a point of synchronization for the stack dumping.
>
> Well, all CPUs are going to be spinning until the main CPU prints
> everything out. That's not quite the same thing as what it use to do.
Is there a reason fo all CPUs to be spinning until everything is printed
out? Every CPU will be spinning until his very stack is printed out, while
other CPUs will be running uninterrupted.
I don't think there is a need for a global stop_machine()-like
synchronization here. The printing CPU will be sending IPI to the CPU N+1
only after it has finished printing CPU N stacktrace.
Thanks,
--
Jiri Kosina
SUSE Labs
On Fri, 20 Jun 2014 01:27:26 +0200 (CEST)
Jiri Kosina <[email protected]> wrote:
>
> I don't think there is a need for a global stop_machine()-like
> synchronization here. The printing CPU will be sending IPI to the CPU N+1
> only after it has finished printing CPU N stacktrace.
So you plan on sending an IPI to a CPU then wait for it to acknowledge
that it is spinning, and then print out the data and then tell the CPU
it can stop spinning?
-- Steve
On Thu, 19 Jun 2014, Steven Rostedt wrote:
> > I don't think there is a need for a global stop_machine()-like
> > synchronization here. The printing CPU will be sending IPI to the CPU N+1
> > only after it has finished printing CPU N stacktrace.
>
> So you plan on sending an IPI to a CPU then wait for it to acknowledge
> that it is spinning, and then print out the data and then tell the CPU
> it can stop spinning?
Yes, that was exactly my idea. You have to be synchronized with the CPU
receiving the NMI anyway in case you'd like to get its pt_regs and dump
those as part of the dump.
--
Jiri Kosina
SUSE Labs
On Fri 2014-06-20 01:38:59, Jiri Kosina wrote:
> On Thu, 19 Jun 2014, Steven Rostedt wrote:
>
> > > I don't think there is a need for a global stop_machine()-like
> > > synchronization here. The printing CPU will be sending IPI to the CPU N+1
> > > only after it has finished printing CPU N stacktrace.
> >
> > So you plan on sending an IPI to a CPU then wait for it to acknowledge
> > that it is spinning, and then print out the data and then tell the CPU
> > it can stop spinning?
>
> Yes, that was exactly my idea. You have to be synchronized with the CPU
> receiving the NMI anyway in case you'd like to get its pt_regs and dump
> those as part of the dump.
This approach did not work after all. There was still the same
race. If we stop a CPU in the middle of printk(), it does not help
to move the printing task to another CPU ;-) We would need to
make a copy of regs and all the stacks to unblock the CPU.
Hmm, in general, if we want a consistent snapshot, we need to temporary store
the information in NMI context and put it into the main ring buffer
in normal context. We either need to copy stacks or copy the printed text.
I start to like Steven's solution with the trace_seq buffer. I see the
following advantages:
+ the snapshot is pretty good;
+ we still send NMI to all CPUs at the "same" time
+ only minimal time is spent in NMI context;
+ CPUs are not blocked by each other to get sequential output
+ minimum of new code
+ trace_seq buffer is already implemented and used
+ it might be even better after getting attention from new users
Of course, it has also some disadvantages:
+ needs quite big per-CPU buffer;
+ but we would need some extra space to copy the data anyway
+ trace might be shrunken;
+ but 1 page should be enough in most cases;
+ we could make it configurable
+ delay until the message appears in the ringbuffer and console
+ better than freezing
+ still saved in core file
+ crash tool could get improved to find the traces
Note that the above solution solves only printing of the stack.
There are still other locations when printk is called in NMI
context. IMHO, some of them are helpful:
./arch/x86/kernel/nmi.c: WARN(in_nmi(),
./arch/x86/mm/kmemcheck/kmemcheck.c: WARN_ON_ONCE(in_nmi());
./arch/x86/mm/fault.c: WARN_ON_ONCE(in_nmi());
./arch/x86/mm/fault.c: WARN_ON_ONCE(in_nmi());
./mm/vmalloc.c: BUG_ON(in_nmi());
./lib/genalloc.c: BUG_ON(in_nmi());
./lib/genalloc.c: BUG_ON(in_nmi());
./include/linux/hardirq.h: BUG_ON(in_nmi());
And some are probably less important:
./arch/x86/platform/uv/uv_nmi.c several locations here
./arch/m68k/mac/macints.c- printk("... pausing, press NMI to resume ...");
Well, there are only few. Maybe, we could share the trace_seq buffer
here.
Of course, there is still the possibility to implement a lockless
buffer. But it will be much more complicated than the current one.
I am not sure that we really want it.
Best Regards,
Petr
On Fri, Jun 20, 2014 at 6:35 PM, Petr Mládek <[email protected]> wrote:
> On Fri 2014-06-20 01:38:59, Jiri Kosina wrote:
>> On Thu, 19 Jun 2014, Steven Rostedt wrote:
>>
>> > > I don't think there is a need for a global stop_machine()-like
>> > > synchronization here. The printing CPU will be sending IPI to the CPU N+1
>> > > only after it has finished printing CPU N stacktrace.
>> >
>> > So you plan on sending an IPI to a CPU then wait for it to acknowledge
>> > that it is spinning, and then print out the data and then tell the CPU
>> > it can stop spinning?
>>
>> Yes, that was exactly my idea. You have to be synchronized with the CPU
>> receiving the NMI anyway in case you'd like to get its pt_regs and dump
>> those as part of the dump.
>
> This approach did not work after all. There was still the same
> race. If we stop a CPU in the middle of printk(), it does not help
> to move the printing task to another CPU ;-) We would need to
> make a copy of regs and all the stacks to unblock the CPU.
>
> Hmm, in general, if we want a consistent snapshot, we need to temporary store
> the information in NMI context and put it into the main ring buffer
> in normal context. We either need to copy stacks or copy the printed text.
>
>
> I start to like Steven's solution with the trace_seq buffer. I see the
> following advantages:
>
> + the snapshot is pretty good;
> + we still send NMI to all CPUs at the "same" time
>
> + only minimal time is spent in NMI context;
> + CPUs are not blocked by each other to get sequential output
>
> + minimum of new code
> + trace_seq buffer is already implemented and used
> + it might be even better after getting attention from new users
>
>
> Of course, it has also some disadvantages:
>
> + needs quite big per-CPU buffer;
> + but we would need some extra space to copy the data anyway
>
> + trace might be shrunken;
> + but 1 page should be enough in most cases;
> + we could make it configurable
>
> + delay until the message appears in the ringbuffer and console
> + better than freezing
> + still saved in core file
> + crash tool could get improved to find the traces
>
>
> Note that the above solution solves only printing of the stack.
> There are still other locations when printk is called in NMI
> context. IMHO, some of them are helpful:
>
> ./arch/x86/kernel/nmi.c: WARN(in_nmi(),
> ./arch/x86/mm/kmemcheck/kmemcheck.c: WARN_ON_ONCE(in_nmi());
> ./arch/x86/mm/fault.c: WARN_ON_ONCE(in_nmi());
> ./arch/x86/mm/fault.c: WARN_ON_ONCE(in_nmi());
>
> ./mm/vmalloc.c: BUG_ON(in_nmi());
> ./lib/genalloc.c: BUG_ON(in_nmi());
> ./lib/genalloc.c: BUG_ON(in_nmi());
> ./include/linux/hardirq.h: BUG_ON(in_nmi());
>
> And some are probably less important:
>
> ./arch/x86/platform/uv/uv_nmi.c several locations here
> ./arch/m68k/mac/macints.c- printk("... pausing, press NMI to resume ...");
>
>
> Well, there are only few. Maybe, we could share the trace_seq buffer
> here.
>
> Of course, there is still the possibility to implement a lockless
> buffer. But it will be much more complicated than the current one.
> I am not sure that we really want it.
Let me join the discussion.
In the past I've implemented similar feature for x86 in openvz kernel
and now I'm thinking about it for arm.
Originally I thought that seizing all cpus one by one and printing
from the initiator
is a best approach and I've started preparing arguments against
over-engineered printk...
But it seems deferring printk output isn't that bad idea, it much
easier and arch-independent.
Saving context is very arch-specific. Printing is a problem too:
show_regs() is a horrible mess,
on most arches it always prints something from current context. I'm
thinking about cleaning this
mess, but doing this for all arches takes some time for sure.
Instead of per-cpu buffers printk might use part of existing ring
buffer -- initiator cpu allocates space for
target cpu and flushes it into common stream after it finish printing.
Probably this kind of transactional model might be used on single cpu
for multi-line KERN_CONT.
>
>
> Best Regards,
> Petr
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On Tue, 24 Jun 2014, Konstantin Khlebnikov wrote:
> Originally I thought that seizing all cpus one by one and printing from
> the initiator is a best approach and I've started preparing arguments
> against over-engineered printk...
By "seizing" I guess you mean sending IPI, right?
What do you do if you'd interrupt it in the middle of printk() in order
not to deadlock by trying to take the buffer lock on the initiator CPU?
--
Jiri Kosina
SUSE Labs
On Wed, Jun 25, 2014 at 2:01 PM, Jiri Kosina <[email protected]> wrote:
> On Tue, 24 Jun 2014, Konstantin Khlebnikov wrote:
>
>> Originally I thought that seizing all cpus one by one and printing from
>> the initiator is a best approach and I've started preparing arguments
>> against over-engineered printk...
>
> By "seizing" I guess you mean sending IPI, right?
>
> What do you do if you'd interrupt it in the middle of printk() in order
> not to deadlock by trying to take the buffer lock on the initiator CPU?
Yep, under "seize" I mean NMI IPI which gathering state of registers from cpu.
Also it can keep cpu in nmi handler while initiator cpu unwinds its stack trace.
Code of dump-all-stack in arch/sparc does something like that.
You're right, moving printk from one cpu to another doesn't fixes recursion,
but we could send NMI IPI under logbuf_lock. I don't think that this is good
approach, but it seems realistic. Of course all printing from other NMI
handlers must be done in some weird way.
I like that context-depend printk routing from this patchset.
I'm just not quite happy with this straight forward implementation.
Rerouting flow into different buffers might serve different purposes,
it must make code simpler instead of adding yet another knot.
For example printk continuation might use it too.
>
> --
> Jiri Kosina
> SUSE Labs
On Wed 2014-06-25 15:04:41, Konstantin Khlebnikov wrote:
> On Wed, Jun 25, 2014 at 2:01 PM, Jiri Kosina <[email protected]> wrote:
> > On Tue, 24 Jun 2014, Konstantin Khlebnikov wrote:
>
> I like that context-depend printk routing from this patchset.
> I'm just not quite happy with this straight forward implementation.
> Rerouting flow into different buffers might serve different purposes,
> it must make code simpler instead of adding yet another knot.
> For example printk continuation might use it too.
Note that we could not share the buffer between normal and NMI
context easily. We would end with the same locking problems as we have now.
In fact, any sharing is complicated when NMI is involved, especially
when we need to access the data from both NMI and normal context.
And we need to copy them into the main ring buffer in a normal context.
This patchset works only because arch_trigger_all_cpu_backtrace() is
called from normal context. Also all other calls are ignored until the
previous call is finished, see "backtrace_flag".
Best Regards,
Petr
On Tue 2014-06-24 17:32:15, Konstantin Khlebnikov wrote:
>
> Instead of per-cpu buffers printk might use part of existing ring
> buffer -- initiator cpu allocates space for
> target cpu and flushes it into common stream after it finish printing.
> Probably this kind of transactional model might be used on single cpu
> for multi-line KERN_CONT.
I wanted to think more about this.
The evident advantage is that we do not need extra buffers. Also it might
be used instead of the cont buffer.
One challenge is that we do not know how much space is needed. If
we do not reserve enough, we might not be able to reserve it later.
This would happen when the main ring buffer is locked by some
blocked process. If we reserve too much, it is a waste of information.
Note that we must remove old messages to make the space.
Also we could not rotate the ring buffer over the reserved space. Then
we might miss the last messages from the normal context when the
system goes down.
Another question is if it could simplify the printk() code. The reservation
logic would be probably more complicated than the current logic around
the cont buffer.
Best Regards,
Petr