Hello
Was trying out current git (4.15-rc7) and the perf_fuzzer very quickly
will lock up my Haswell test machine so solidly that I don't get any debug
info, even with a serial console.
I'll try enabling various debug options to see if I can get a more useful
bug report.
Vince
* Vince Weaver <[email protected]> wrote:
> Hello
>
> Was trying out current git (4.15-rc7) and the perf_fuzzer very quickly
> will lock up my Haswell test machine so solidly that I don't get any debug
> info, even with a serial console.
That's pretty concerning ...
> I'll try enabling various debug options to see if I can get a more useful
> bug report.
Note that the page table isolation (PTI) feature has a number of effects on perf
and on NMI handlers, so one of the things to try would be to disable PTI.
Thanks,
Ingo
On Mon, 8 Jan 2018, Ingo Molnar wrote:
>
> Note that the page table isolation (PTI) feature has a number of effects on perf
> and on NMI handlers, so one of the things to try would be to disable PTI.
Yes, it seems to be a KPTI issue.
With KPTI disabled I can fuzz for a few hours, no problems.
With KPTI enabled the perf_fuzzer on this haswell machine will lock up
within minutes.
It's a hard lockup with nothing over the serial console. It is not
deterministic/reproducible either.
Vince
On Mon, Jan 08, 2018 at 03:29:42PM -0500, Vince Weaver wrote:
> On Mon, 8 Jan 2018, Ingo Molnar wrote:
> >
> > Note that the page table isolation (PTI) feature has a number of effects on perf
> > and on NMI handlers, so one of the things to try would be to disable PTI.
>
> Yes, it seems to be a KPTI issue.
>
> With KPTI disabled I can fuzz for a few hours, no problems.
>
> With KPTI enabled the perf_fuzzer on this haswell machine will lock up
> within minutes.
I'll go take a look...
On Tue, Jan 09, 2018 at 11:25:07AM +0100, Peter Zijlstra wrote:
> On Mon, Jan 08, 2018 at 03:29:42PM -0500, Vince Weaver wrote:
> > On Mon, 8 Jan 2018, Ingo Molnar wrote:
> > >
> > > Note that the page table isolation (PTI) feature has a number of effects on perf
> > > and on NMI handlers, so one of the things to try would be to disable PTI.
> >
> > Yes, it seems to be a KPTI issue.
> >
> > With KPTI disabled I can fuzz for a few hours, no problems.
> >
> > With KPTI enabled the perf_fuzzer on this haswell machine will lock up
> > within minutes.
>
> I'll go take a look...
OK, so I'm running on an IVB-EP with PTI enabled. I insta triggered a
lockdep splat, after fixing that I get NMI watchdog triggers that look
like false positives; and I suspect:
[ 290.791675] perf: Dynamic interrupt throttling disabled, can hang your system!
is responsible for that. That is, subsequent stacktraces for the same
CPU are unrelated, which make it look like the CPU actually makes
progress.
So remind me again, how are you running that fuzzer? I'm running
./fast_repro99.sh as root.
On Tue, Jan 09, 2018 at 02:26:02PM +0100, Peter Zijlstra wrote:
> OK, so I'm running on an IVB-EP with PTI enabled. I insta triggered a
> lockdep splat
---
Subject: perf: Fix lock inversion between perf,trace,cpuhp
From: Peter Zijlstra <[email protected]>
Date: Tue Jan 9 13:10:30 CET 2018
Lockdep gifted us:
perf_trace_init()
#0 mutex_lock(&event_mutex)
perf_trace_event_init()
perf_trace_event_reg()
tp_event->class->reg() := tracepoint_probe_register
#1 mutex_lock(&tracepoints_mutex)
trace_point_add_func()
#2 static_key_enable()
#2 do_cpu_up()
perf_event_init_cpu()
#3 mutex_lock(&pmus_lock)
#4 mutex_lock(&ctx->mutex)
perf_event_task_disable()
mutex_lock(¤t->perf_event_mutex)
#4 ctx = perf_event_ctx_lock()
#5 perf_event_for_each_child()
do_exit()
task_work_run()
__fput()
perf_release()
perf_event_release_kernel()
#4 mutex_lock(&ctx->mutex)
#5 mutex_lock(&event->child_mutex)
free_event()
_free_event()
event->destroy() := perf_trace_destroy
#0 mutex_lock(&event_mutex);
Fix that by moving the free_event() out from under the locks.
Cc: Thomas Gleixner <[email protected]>
Cc: Steven Rostedt (VMware) <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
kernel/events/core.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1231,6 +1231,10 @@ static void put_ctx(struct perf_event_co
* perf_event_context::lock
* perf_event::mmap_mutex
* mmap_sem
+ *
+ * cpu_hotplug_lock
+ * pmus_lock
+ * cpuctx->mutex / perf_event_context::mutex
*/
static struct perf_event_context *
perf_event_ctx_lock_nested(struct perf_event *event, int nesting)
@@ -4196,6 +4200,7 @@ int perf_event_release_kernel(struct per
{
struct perf_event_context *ctx = event->ctx;
struct perf_event *child, *tmp;
+ LIST_HEAD(free_list);
/*
* If we got here through err_file: fput(event_file); we will not have
@@ -4268,8 +4273,7 @@ int perf_event_release_kernel(struct per
struct perf_event, child_list);
if (tmp == child) {
perf_remove_from_context(child, DETACH_GROUP);
- list_del(&child->child_list);
- free_event(child);
+ list_move(&child->child_list, &free_list);
/*
* This matches the refcount bump in inherit_event();
* this can't be the last reference.
@@ -4284,6 +4288,11 @@ int perf_event_release_kernel(struct per
}
mutex_unlock(&event->child_mutex);
+ list_for_each_entry_safe(child, tmp, &free_list, child_list) {
+ list_del(&child->child_list);
+ free_event(child);
+ }
+
no_ctx:
put_event(event); /* Must be the 'last' reference */
return 0;
On Tue, 9 Jan 2018, Peter Zijlstra wrote:
> So remind me again, how are you running that fuzzer? I'm running
> ./fast_repro99.sh as root.
I'm running ./fast_repro98.sh on a regular haswell machine with paranoid
set to "0".
I'll try your patch and see if it makes a difference.
I can also try on a few other machines. I started setting up a fuzzing
cluster for this kind of thing but the multiport serial card I got for
logging the serial consoles crashes fairly quickly under load and I
haven't had time to track that down.
Vince
On Tue, Jan 09, 2018 at 08:44:23AM -0500, Vince Weaver wrote:
> On Tue, 9 Jan 2018, Peter Zijlstra wrote:
>
> > So remind me again, how are you running that fuzzer? I'm running
> > ./fast_repro99.sh as root.
>
> I'm running ./fast_repro98.sh on a regular haswell machine with paranoid
> set to "0".
>
> I'll try your patch and see if it makes a difference.
I suspect not, it shouldn't be PTI specific.
> I can also try on a few other machines. I started setting up a fuzzing
> cluster for this kind of thing but the multiport serial card I got for
> logging the serial consoles crashes fairly quickly under load and I
> haven't had time to track that down.
So I don't think HSW has PT, which is something I cannot test on my IVB.
OTOH HSW has PEBS real_ip, but that uses _less_ magic.
In any case, I found yet another lockdep splat, trying to figure out wth
to do about that. Will then continue running with fast_repro98.
On Tue, 9 Jan 2018, Peter Zijlstra wrote:
> > I'll try your patch and see if it makes a difference.
>
> I suspect not, it shouldn't be PTI specific.
yes, applying your patch didn't help, still locks up on the Haswell
machine.
Is there any debugging I could turn on that would help? I tried KASAN
but it didn't help. I think I have the regular lockdep stuff enabled.
alt-sysrq doesn't work either (or at least, the version using BREAK over
the serial console doesn't, I can maybe try hooking up a keyboard/display
to see if that helps).
Vince
On Tue, Jan 09, 2018 at 10:24:55AM -0500, Vince Weaver wrote:
> On Tue, 9 Jan 2018, Peter Zijlstra wrote:
>
> > > I'll try your patch and see if it makes a difference.
> >
> > I suspect not, it shouldn't be PTI specific.
>
> yes, applying your patch didn't help, still locks up on the Haswell
> machine.
So CONFIG_PAGE_TABLE_ISOLATION=y and booting with "pti=off" makes it
'work', right?
> Is there any debugging I could turn on that would help? I tried KASAN
> but it didn't help. I think I have the regular lockdep stuff enabled.
>
> alt-sysrq doesn't work either (or at least, the version using BREAK over
> the serial console doesn't, I can maybe try hooking up a keyboard/display
> to see if that helps).
The below is always my first try to get something out of the machine,
after that its printk() stuffing code to see how far we get..
In particular I'd start instrumenting the NMI entry_64.S code, because
that's really the biggest difference between PTI and !PTI :/ all rather
bothersome I'm afraid.
Really sucks I cannot as yet reproduce.
---
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 5e486b6509e5..fc9021fd6e3c 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1172,6 +1172,11 @@
parameter will force ia64_sal_cache_flush to call
ia64_pal_cache_flush instead of SAL_CACHE_FLUSH.
+ force_early_printk
+ Forcefully uses early_console (as per earlyprintk=)
+ usage for regular printk, bypassing everything,
+ including the syslog (dmesg will be empty).
+
forcepae [X86-32]
Forcefully enable Physical Address Extension (PAE).
Many Pentium M systems disable PAE but may have a
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index b9006617710f..c1a4ed17c22c 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -365,6 +365,75 @@ __packed __aligned(4)
#endif
;
+#ifdef CONFIG_EARLY_PRINTK
+struct console *early_console;
+
+static bool __read_mostly force_early_printk;
+
+static int __init force_early_printk_setup(char *str)
+{
+ force_early_printk = true;
+ return 0;
+}
+early_param("force_early_printk", force_early_printk_setup);
+
+static int early_printk_cpu = -1;
+
+static int early_vprintk(const char *fmt, va_list args)
+{
+ int n, cpu, old;
+ char buf[512];
+
+ cpu = get_cpu();
+ /*
+ * Test-and-Set inter-cpu spinlock with recursion.
+ */
+ for (;;) {
+ /*
+ * c-cas to avoid the exclusive bouncing on spin.
+ * Depends on the memory barrier implied by cmpxchg
+ * for ACQUIRE semantics.
+ */
+ old = READ_ONCE(early_printk_cpu);
+ if (old == -1) {
+ old = cmpxchg(&early_printk_cpu, -1, cpu);
+ if (old == -1)
+ break;
+ }
+ /*
+ * Allow recursion for interrupts and the like.
+ */
+ if (old == cpu)
+ break;
+
+ cpu_relax();
+ }
+
+ n = vscnprintf(buf, sizeof(buf), fmt, args);
+ early_console->write(early_console, buf, n);
+
+ /*
+ * Unlock -- in case @old == @cpu, this is a no-op.
+ */
+ smp_store_release(&early_printk_cpu, old);
+ put_cpu();
+
+ return n;
+}
+
+asmlinkage __visible void early_printk(const char *fmt, ...)
+{
+ va_list ap;
+
+ if (!early_console)
+ return;
+
+ va_start(ap, fmt);
+ early_vprintk(fmt, ap);
+ va_end(ap);
+}
+#endif
+
/*
* The logbuf_lock protects kmsg buffer, indices, counters. This can be taken
* within the scheduler's rq lock. It must be released before calling
@@ -1692,6 +1761,16 @@ asmlinkage int vprintk_emit(int facility, int level,
int printed_len;
bool in_sched = false;
+#ifdef CONFIG_KGDB_KDB
+ if (unlikely(kdb_trap_printk && kdb_printf_cpu < 0))
+ return vkdb_printf(KDB_MSGSRC_PRINTK, fmt, args);
+#endif
+
+#ifdef CONFIG_EARLY_PRINTK
+ if (force_early_printk && early_console)
+ return early_vprintk(fmt, args);
+#endif
+
if (level == LOGLEVEL_SCHED) {
level = LOGLEVEL_DEFAULT;
in_sched = true;
@@ -1784,18 +1863,7 @@ EXPORT_SYMBOL(printk_emit);
int vprintk_default(const char *fmt, va_list args)
{
- int r;
-
-#ifdef CONFIG_KGDB_KDB
- /* Allow to pass printk() to kdb but avoid a recursion. */
- if (unlikely(kdb_trap_printk && kdb_printf_cpu < 0)) {
- r = vkdb_printf(KDB_MSGSRC_PRINTK, fmt, args);
- return r;
- }
-#endif
- r = vprintk_emit(0, LOGLEVEL_DEFAULT, NULL, 0, fmt, args);
-
- return r;
+ return vprintk_emit(0, LOGLEVEL_DEFAULT, NULL, 0, fmt, args);
}
EXPORT_SYMBOL_GPL(vprintk_default);
@@ -1826,7 +1894,12 @@ asmlinkage __visible int printk(const char *fmt, ...)
int r;
va_start(args, fmt);
- r = vprintk_func(fmt, args);
+#ifdef CONFIG_EARLY_PRINTK
+ if (force_early_printk && early_console)
+ r = vprintk_default(fmt, args);
+ else
+#endif
+ r = vprintk_func(fmt, args);
va_end(args);
return r;
@@ -1863,26 +1936,6 @@ static bool suppress_message_printing(int level) { return false; }
#endif /* CONFIG_PRINTK */
-#ifdef CONFIG_EARLY_PRINTK
-struct console *early_console;
-
-asmlinkage __visible void early_printk(const char *fmt, ...)
-{
- va_list ap;
- char buf[512];
- int n;
-
- if (!early_console)
- return;
-
- va_start(ap, fmt);
- n = vscnprintf(buf, sizeof(buf), fmt, ap);
- va_end(ap);
-
- early_console->write(early_console, buf, n);
-}
-#endif
-
static int __add_preferred_console(char *name, int idx, char *options,
char *brl_options)
{
On Tue, 9 Jan 2018, Peter Zijlstra wrote:
> So CONFIG_PAGE_TABLE_ISOLATION=y and booting with "pti=off" makes it
> 'work', right?
yes. Previously I was changing CONFIG_PAGE_TABLE_ISOLATION and
recompiling, but just now I booted with it set to yes and pti=off and the
fuzzer has been running fine for a half hour (usually it crashes in under
5 minutes).
I did see these in the logs which I don't think I've seen before.
WARNING: stack recursion on stack type 2
WARNING: can't dereference iret registers at 000000000783fea8 for ip paranoid_entry+0x2e/0x90
WARNING: can't dereference registers at 00000000f0698d17 for ip paranoid_entry+0x4c/0x90
WARNING: stack going in the wrong direction? ip=native_sched_clock+0x9/0x90
> The below is always my first try to get something out of the machine,
> after that its printk() stuffing code to see how far we get..
>
> In particular I'd start instrumenting the NMI entry_64.S code, because
> that's really the biggest difference between PTI and !PTI :/ all rather
> bothersome I'm afraid.
I'll try that next. Also getting a few other machines up and into a
state that I can start fuzzing on them.
(extra challenge, the lab my machines is in possibly has a leak in the
roof, and they're calling for an inch of rain on top of 3 feet of
existing snow so I might have to shut everything down and relocate on
short notice).
Vince
On Tue, Jan 09, 2018 at 10:56:52AM -0500, Vince Weaver wrote:
> On Tue, 9 Jan 2018, Peter Zijlstra wrote:
>
> > So CONFIG_PAGE_TABLE_ISOLATION=y and booting with "pti=off" makes it
> > 'work', right?
>
> yes. Previously I was changing CONFIG_PAGE_TABLE_ISOLATION and
> recompiling, but just now I booted with it set to yes and pti=off and the
> fuzzer has been running fine for a half hour (usually it crashes in under
> 5 minutes).
>
> I did see these in the logs which I don't think I've seen before.
>
> WARNING: stack recursion on stack type 2
> WARNING: can't dereference iret registers at 000000000783fea8 for ip paranoid_entry+0x2e/0x90
> WARNING: can't dereference registers at 00000000f0698d17 for ip paranoid_entry+0x4c/0x90
> WARNING: stack going in the wrong direction? ip=native_sched_clock+0x9/0x90
I've seen that last one, but not the ones before. Josh, this isn't
healty, right? :-)
> > The below is always my first try to get something out of the machine,
> > after that its printk() stuffing code to see how far we get..
> >
> > In particular I'd start instrumenting the NMI entry_64.S code, because
> > that's really the biggest difference between PTI and !PTI :/ all rather
> > bothersome I'm afraid.
>
> I'll try that next. Also getting a few other machines up and into a
> state that I can start fuzzing on them.
>
> (extra challenge, the lab my machines is in possibly has a leak in the
> roof, and they're calling for an inch of rain on top of 3 feet of
> existing snow so I might have to shut everything down and relocate on
> short notice).
Ouch, good luck with that!
On Tue, Jan 09, 2018 at 04:12:53PM +0100, Peter Zijlstra wrote:
> In any case, I found yet another lockdep splat, trying to figure out wth
> to do about that.
An of course, testing this one yields yet another lockdep splat..
onwards to #3 :/
---
Subject: perf: Fix another perf,trace,cpuhp lock inversion
From: Peter Zijlstra <[email protected]>
Date: Tue Jan 9 17:07:59 CET 2018
Lockdep complained about:
perf_trace_init()
#0 mutex_lock(&event_mutex)
perf_trace_event_init()
perf_trace_event_reg()
tp_event->class->reg() := tracepoint_probe_register
#1 mutex_lock(&tracepoints_mutex)
trace_point_add_func()
#2 static_key_enable()
#2 do_cpu_up()
perf_event_init_cpu()
#3 mutex_lock(&pmus_lock)
#4 mutex_lock(&ctx->mutex)
perf_ioctl()
#4 ctx = perf_event_ctx_lock()
_perf_iotcl()
ftrace_profile_set_filter()
#0 mutex_lock(&event_mutex)
Fudge it for now by noting that the tracepoint state does not depend
on the event <-> context relation. Ugly though :/
Cc: Thomas Gleixner <[email protected]>
Cc: Steven Rostedt <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
kernel/events/core.c | 26 ++++++++++++++++++++++++--
1 file changed, 24 insertions(+), 2 deletions(-)
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8525,6 +8525,29 @@ perf_event_set_addr_filter(struct perf_e
return ret;
}
+static int
+perf_tracepoint_set_filter(struct perf_event *event, char *filter_str)
+{
+ struct perf_event_context *ctx = event->ctx;
+ int ret;
+
+ /*
+ * Beware, here be dragons!!
+ *
+ * the tracepoint muck will deadlock against ctx->mutex, but the tracepoint
+ * stuff does not actually need it. So temporarily drop ctx->mutex. As per
+ * perf_event_ctx_lock() we already have a reference on ctx.
+ *
+ * This can result in event getting moved to a different ctx, but that
+ * does not affect the tracepoint state.
+ */
+ mutex_unlock(&ctx->mutex);
+ ret = ftrace_profile_set_filter(event, event->attr.config, filter_str);
+ mutex_lock(&ctx->mutex);
+
+ return ret;
+}
+
static int perf_event_set_filter(struct perf_event *event, void __user *arg)
{
char *filter_str;
@@ -8541,8 +8564,7 @@ static int perf_event_set_filter(struct
if (IS_ENABLED(CONFIG_EVENT_TRACING) &&
event->attr.type == PERF_TYPE_TRACEPOINT)
- ret = ftrace_profile_set_filter(event, event->attr.config,
- filter_str);
+ ret = perf_tracepoint_set_filter(event, filter_str);
else if (has_addr_filter(event))
ret = perf_event_set_addr_filter(event, filter_str);
* Peter Zijlstra <[email protected]> wrote:
> On Tue, Jan 09, 2018 at 10:24:55AM -0500, Vince Weaver wrote:
> > On Tue, 9 Jan 2018, Peter Zijlstra wrote:
> >
> > > > I'll try your patch and see if it makes a difference.
> > >
> > > I suspect not, it shouldn't be PTI specific.
> >
> > yes, applying your patch didn't help, still locks up on the Haswell
> > machine.
>
> So CONFIG_PAGE_TABLE_ISOLATION=y and booting with "pti=off" makes it
> 'work', right?
>
> > Is there any debugging I could turn on that would help? I tried KASAN
> > but it didn't help. I think I have the regular lockdep stuff enabled.
> >
> > alt-sysrq doesn't work either (or at least, the version using BREAK over
> > the serial console doesn't, I can maybe try hooking up a keyboard/display
> > to see if that helps).
>
> The below is always my first try to get something out of the machine,
> after that its printk() stuffing code to see how far we get..
>
> In particular I'd start instrumenting the NMI entry_64.S code, because
> that's really the biggest difference between PTI and !PTI :/ all rather
> bothersome I'm afraid.
>
> Really sucks I cannot as yet reproduce.
>
> ---
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 5e486b6509e5..fc9021fd6e3c 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -1172,6 +1172,11 @@
> parameter will force ia64_sal_cache_flush to call
> ia64_pal_cache_flush instead of SAL_CACHE_FLUSH.
>
> + force_early_printk
> + Forcefully uses early_console (as per earlyprintk=)
> + usage for regular printk, bypassing everything,
> + including the syslog (dmesg will be empty).
BTW., this looks very useful - please consider pushing it upstream if it does not
hurt any of the other syslog usecases.
Thanks,
Ingo
On Tue, Jan 09, 2018 at 05:16:06PM +0100, Ingo Molnar wrote:
> > + force_early_printk
> > + Forcefully uses early_console (as per earlyprintk=)
> > + usage for regular printk, bypassing everything,
> > + including the syslog (dmesg will be empty).
>
> BTW., this looks very useful - please consider pushing it upstream if it does not
> hurt any of the other syslog usecases.
I've posted it a number of times, the printk people didn't hate it but
haven't merged it either. Instead they continue working on making it a
bigger trainwreck...
I'll post it again later when we're not chasing spectres and meltdowns.
On Tue, Jan 09, 2018 at 05:05:51PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 09, 2018 at 10:56:52AM -0500, Vince Weaver wrote:
> > On Tue, 9 Jan 2018, Peter Zijlstra wrote:
> >
> > > So CONFIG_PAGE_TABLE_ISOLATION=y and booting with "pti=off" makes it
> > > 'work', right?
> >
> > yes. Previously I was changing CONFIG_PAGE_TABLE_ISOLATION and
> > recompiling, but just now I booted with it set to yes and pti=off and the
> > fuzzer has been running fine for a half hour (usually it crashes in under
> > 5 minutes).
> >
> > I did see these in the logs which I don't think I've seen before.
> >
> > WARNING: stack recursion on stack type 2
> > WARNING: can't dereference iret registers at 000000000783fea8 for ip paranoid_entry+0x2e/0x90
> > WARNING: can't dereference registers at 00000000f0698d17 for ip paranoid_entry+0x4c/0x90
> > WARNING: stack going in the wrong direction? ip=native_sched_clock+0x9/0x90
>
> I've seen that last one, but not the ones before. Josh, this isn't
> healty, right? :-)
Looks like some funky things may be happening on the stack there, though
we have seen some issues related to ORC getting confused by paravirt
patching. I'm a bit swamped at the moment but I'll look deeper as time
permits...
--
Josh
I tried it again with your patch and force_early_printk, no luck.
I can start dropping printks around the NMI code but I feel like I don't
really know what I'm doing.
Also I managed to hit (presumably) the same bug on a skylake machine.
That one doesn't have a serial cable hooked up to it, I'll try to see if I
can find one.
I am running debian-unstable with gcc 7.2 if it makes a difference.
Vince
On Tue, 9 Jan 2018 17:14:00 +0100
Peter Zijlstra <[email protected]> wrote:
> On Tue, Jan 09, 2018 at 04:12:53PM +0100, Peter Zijlstra wrote:
>
> > In any case, I found yet another lockdep splat, trying to figure out wth
> > to do about that.
>
> An of course, testing this one yields yet another lockdep splat..
> onwards to #3 :/
>
> ---
> Subject: perf: Fix another perf,trace,cpuhp lock inversion
> From: Peter Zijlstra <[email protected]>
> Date: Tue Jan 9 17:07:59 CET 2018
>
> Lockdep complained about:
>
> perf_trace_init()
> #0 mutex_lock(&event_mutex)
> perf_trace_event_init()
> perf_trace_event_reg()
> tp_event->class->reg() := tracepoint_probe_register
> #1 mutex_lock(&tracepoints_mutex)
> trace_point_add_func()
> #2 static_key_enable()
>
>
>
> #2 do_cpu_up()
> perf_event_init_cpu()
> #3 mutex_lock(&pmus_lock)
> #4 mutex_lock(&ctx->mutex)
>
>
> perf_ioctl()
> #4 ctx = perf_event_ctx_lock()
> _perf_iotcl()
> ftrace_profile_set_filter()
> #0 mutex_lock(&event_mutex)
>
>
> Fudge it for now by noting that the tracepoint state does not depend
> on the event <-> context relation. Ugly though :/
>
Looking at ftrace_profile_set_filter(), I see it starts with:
mutex_lock(&event_mutex);
How much of a big deal would it be if we move taking event_mutex() into
perf_ioctl(), and then make ftrace_profile_set_filter() not take the
event_mutex. This is the only place that function is used. Would that
work?
Below is a patch (totally untested, not even compiled)
-- Steve
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
---
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 4df5b695bf0d..9fac7ac14b32 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4741,9 +4741,11 @@ static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
struct perf_event_context *ctx;
long ret;
+ mutex_lock(&event_mutex);
ctx = perf_event_ctx_lock(event);
ret = _perf_ioctl(event, cmd, arg);
perf_event_ctx_unlock(event, ctx);
+ mutex_unlock(&event_mutex);
return ret;
}
diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index 61e7f0678d33..46c2e5d20662 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -2256,6 +2256,7 @@ static int ftrace_function_set_filter(struct perf_event *event,
}
#endif /* CONFIG_FUNCTION_TRACER */
+/* Must have event_mutex held */
int ftrace_profile_set_filter(struct perf_event *event, int event_id,
char *filter_str)
{
@@ -2263,17 +2264,14 @@ int ftrace_profile_set_filter(struct perf_event *event, int event_id,
struct event_filter *filter;
struct trace_event_call *call;
- mutex_lock(&event_mutex);
+ lockdep_assert_held(&event_mutex);
call = event->tp_event;
-
- err = -EINVAL;
if (!call)
- goto out_unlock;
+ return -EINVAL;
- err = -EEXIST;
if (event->filter)
- goto out_unlock;
+ return -EEXIST;
err = create_filter(call, filter_str, false, &filter);
if (err)
@@ -2288,9 +2286,6 @@ int ftrace_profile_set_filter(struct perf_event *event, int event_id,
if (err || ftrace_event_is_function(call))
__free_filter(filter);
-out_unlock:
- mutex_unlock(&event_mutex);
-
return err;
}
On Tue, Jan 09, 2018 at 12:53:46PM -0500, Steven Rostedt wrote:
> Looking at ftrace_profile_set_filter(), I see it starts with:
>
> mutex_lock(&event_mutex);
>
> How much of a big deal would it be if we move taking event_mutex() into
> perf_ioctl(), and then make ftrace_profile_set_filter() not take the
> event_mutex. This is the only place that function is used. Would that
> work?
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 4df5b695bf0d..9fac7ac14b32 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -4741,9 +4741,11 @@ static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> struct perf_event_context *ctx;
> long ret;
>
> + mutex_lock(&event_mutex);
> ctx = perf_event_ctx_lock(event);
> ret = _perf_ioctl(event, cmd, arg);
> perf_event_ctx_unlock(event, ctx);
> + mutex_unlock(&event_mutex);
>
> return ret;
> }
This would globally serialize all perf_ioctl()'s, also that event_mutex
is for trace_events and really does not belong in perf.
So no, I really rather would not do this.
The alternative I was thinking of was lifting the cpuhp lock out from
under event_mutex, that would also break the chain, but would probably
be lots of work for trace bits.
On Tue, 9 Jan 2018 19:02:07 +0100
Peter Zijlstra <[email protected]> wrote:
> This would globally serialize all perf_ioctl()'s, also that event_mutex
> is for trace_events and really does not belong in perf.
>
> So no, I really rather would not do this.
>
> The alternative I was thinking of was lifting the cpuhp lock out from
> under event_mutex, that would also break the chain, but would probably
> be lots of work for trace bits.
I think we tried that once, and it became an never ending whack a mole
process. I prefer not to do that either.
How bad is your work around patch that you posted?
-- Steve
On Tue, Jan 09, 2018 at 01:09:57PM -0500, Steven Rostedt wrote:
> On Tue, 9 Jan 2018 19:02:07 +0100
> Peter Zijlstra <[email protected]> wrote:
>
> > This would globally serialize all perf_ioctl()'s, also that event_mutex
> > is for trace_events and really does not belong in perf.
> >
> > So no, I really rather would not do this.
> >
> > The alternative I was thinking of was lifting the cpuhp lock out from
> > under event_mutex, that would also break the chain, but would probably
> > be lots of work for trace bits.
>
> I think we tried that once, and it became an never ending whack a mole
> process. I prefer not to do that either.
Figures..
> How bad is your work around patch that you posted?
Not too bad, just tricky..
On Tue, 9 Jan 2018, Vince Weaver wrote:
> Also I managed to hit (presumably) the same bug on a skylake machine.
> That one doesn't have a serial cable hooked up to it, I'll try to see if I
> can find one.
>
> I am running debian-unstable with gcc 7.2 if it makes a difference.
I built the kernel with gcc 6.4 in case it would make a difference; it
didn't.
Also tried turning off the NMI watchdog in case that was interfering,
didn't help.
Vince
On Tue, Jan 09, 2018 at 11:07:16AM -0600, Josh Poimboeuf wrote:
> On Tue, Jan 09, 2018 at 05:05:51PM +0100, Peter Zijlstra wrote:
> > On Tue, Jan 09, 2018 at 10:56:52AM -0500, Vince Weaver wrote:
> > > On Tue, 9 Jan 2018, Peter Zijlstra wrote:
> > >
> > > > So CONFIG_PAGE_TABLE_ISOLATION=y and booting with "pti=off" makes it
> > > > 'work', right?
> > >
> > > yes. Previously I was changing CONFIG_PAGE_TABLE_ISOLATION and
> > > recompiling, but just now I booted with it set to yes and pti=off and the
> > > fuzzer has been running fine for a half hour (usually it crashes in under
> > > 5 minutes).
For the crash, you might try enabling CONFIG_DEBUG_ENTRY and seeing if
that gives you any output.
> > > I did see these in the logs which I don't think I've seen before.
> > >
> > > WARNING: stack recursion on stack type 2
> > > WARNING: can't dereference iret registers at 000000000783fea8 for ip paranoid_entry+0x2e/0x90
> > > WARNING: can't dereference registers at 00000000f0698d17 for ip paranoid_entry+0x4c/0x90
> > > WARNING: stack going in the wrong direction? ip=native_sched_clock+0x9/0x90
This all looks very weird. The stack pointers -- 000000000783fea8 and
00000000f0698d17 -- are obviously very wrong. I will try to recreate
locally.
--
Josh
On Tue, Jan 09, 2018 at 08:44:23AM -0500, Vince Weaver wrote:
> On Tue, 9 Jan 2018, Peter Zijlstra wrote:
>
> > So remind me again, how are you running that fuzzer? I'm running
> > ./fast_repro99.sh as root.
>
> I'm running ./fast_repro98.sh on a regular haswell machine with paranoid
> set to "0".
I'm seeing things like:
Cannot open /sys/kernel/tracing/kprobe_events
this is likely caused by me not having anything mounted there. Rostedt
provided the magic incantation to make that work, I'll go try now.
At least I seem to have cured all lockdep splats.. of course, the moment
I hit send on this someone bad is bound to happen.
On Thu, 11 Jan 2018, Peter Zijlstra wrote:
> I'm seeing things like:
>
> Cannot open /sys/kernel/tracing/kprobe_events
>
> this is likely caused by me not having anything mounted there. Rostedt
> provided the magic incantation to make that work, I'll go try now.
The perf_fuzzer krpobe code is possibly really out of date, I haven't
really kept up with recent developments and I don't usually fuzz as root.
> At least I seem to have cured all lockdep splats.. of course, the moment
> I hit send on this someone bad is bound to happen.
Should I be trying any of those patches?
Currently I can still lock my machines pretty quickly. I have been doing
some more tests. Machines don't crash if paranoid is "2" but will if
paranoid is "1".
Vince
On Thu, Jan 11, 2018 at 10:13:53AM +0100, Peter Zijlstra wrote:
> On Tue, Jan 09, 2018 at 08:44:23AM -0500, Vince Weaver wrote:
> > On Tue, 9 Jan 2018, Peter Zijlstra wrote:
> >
> > > So remind me again, how are you running that fuzzer? I'm running
> > > ./fast_repro99.sh as root.
> >
> > I'm running ./fast_repro98.sh on a regular haswell machine with paranoid
> > set to "0".
>
> I'm seeing things like:
>
> Cannot open /sys/kernel/tracing/kprobe_events
>
> this is likely caused by me not having anything mounted there. Rostedt
> provided the magic incantation to make that work, I'll go try now.
>
> At least I seem to have cured all lockdep splats.. of course, the moment
> I hit send on this someone bad is bound to happen.
So while the IVB-EP (6/62/4) goes wobbly after a while, I also started
fuzzing on my HSW-EX (6/63/4) and there it runs without issue for over
30 minutes.
(I get an unchecked MSR splat from the RAPL pmu, but that is non-fatal)
But so far no weird hangs.. running tip/master + below.
Sadly the only other HSW I have is a laptop, and those are absolutely
useless for debugging (no serial :/) but I'll go see if I can reproduce.
FWIW these test-boxes are running debian/testing
---
Documentation/admin-guide/kernel-parameters.txt | 5 +
arch/x86/events/intel/ds.c | 33 ++++---
kernel/events/core.c | 47 +++++++++-
kernel/locking/lockdep.c | 10 ++
kernel/printk/printk.c | 119 +++++++++++++++++-------
kernel/smp.c | 10 ++
6 files changed, 171 insertions(+), 53 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 8bc512782bd2..7fa1a59461f7 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1169,6 +1169,11 @@
parameter will force ia64_sal_cache_flush to call
ia64_pal_cache_flush instead of SAL_CACHE_FLUSH.
+ force_early_printk
+ Forcefully uses early_console (as per earlyprintk=)
+ usage for regular printk, bypassing everything,
+ including the syslog (dmesg will be empty).
+
forcepae [X86-32]
Forcefully enable Physical Address Extension (PAE).
Many Pentium M systems disable PAE but may have a
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 8156e47da7ba..18c25ab28557 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -372,10 +372,9 @@ static int alloc_pebs_buffer(int cpu)
static void release_pebs_buffer(int cpu)
{
struct cpu_hw_events *hwev = per_cpu_ptr(&cpu_hw_events, cpu);
- struct debug_store *ds = hwev->ds;
void *cea;
- if (!ds || !x86_pmu.pebs)
+ if (!x86_pmu.pebs)
return;
kfree(per_cpu(insn_buffer, cpu));
@@ -384,7 +383,6 @@ static void release_pebs_buffer(int cpu)
/* Clear the fixmap */
cea = &get_cpu_entry_area(cpu)->cpu_debug_buffers.pebs_buffer;
ds_clear_cea(cea, x86_pmu.pebs_buffer_size);
- ds->pebs_buffer_base = 0;
dsfree_pages(hwev->ds_pebs_vaddr, x86_pmu.pebs_buffer_size);
hwev->ds_pebs_vaddr = NULL;
}
@@ -419,16 +417,14 @@ static int alloc_bts_buffer(int cpu)
static void release_bts_buffer(int cpu)
{
struct cpu_hw_events *hwev = per_cpu_ptr(&cpu_hw_events, cpu);
- struct debug_store *ds = hwev->ds;
void *cea;
- if (!ds || !x86_pmu.bts)
+ if (!x86_pmu.bts)
return;
/* Clear the fixmap */
cea = &get_cpu_entry_area(cpu)->cpu_debug_buffers.bts_buffer;
ds_clear_cea(cea, BTS_BUFFER_SIZE);
- ds->bts_buffer_base = 0;
dsfree_pages(hwev->ds_bts_vaddr, BTS_BUFFER_SIZE);
hwev->ds_bts_vaddr = NULL;
}
@@ -454,16 +450,22 @@ void release_ds_buffers(void)
if (!x86_pmu.bts && !x86_pmu.pebs)
return;
- get_online_cpus();
- for_each_online_cpu(cpu)
+ for_each_possible_cpu(cpu)
+ release_ds_buffer(cpu);
+
+ for_each_possible_cpu(cpu) {
+ /*
+ * Again, ignore errors from offline CPUs, they will no longer
+ * observe cpu_hw_events.ds and not program the DS_AREA when
+ * they come up.
+ */
fini_debug_store_on_cpu(cpu);
+ }
for_each_possible_cpu(cpu) {
release_pebs_buffer(cpu);
release_bts_buffer(cpu);
- release_ds_buffer(cpu);
}
- put_online_cpus();
}
void reserve_ds_buffers(void)
@@ -483,8 +485,6 @@ void reserve_ds_buffers(void)
if (!x86_pmu.pebs)
pebs_err = 1;
- get_online_cpus();
-
for_each_possible_cpu(cpu) {
if (alloc_ds_buffer(cpu)) {
bts_err = 1;
@@ -521,11 +521,14 @@ void reserve_ds_buffers(void)
if (x86_pmu.pebs && !pebs_err)
x86_pmu.pebs_active = 1;
- for_each_online_cpu(cpu)
+ for_each_possible_cpu(cpu) {
+ /*
+ * Ignores wrmsr_on_cpu() errors for offline CPUs they
+ * will get this call through intel_pmu_cpu_starting().
+ */
init_debug_store_on_cpu(cpu);
+ }
}
-
- put_online_cpus();
}
/*
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 4e1a1bf8d867..fb079be8971a 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1231,6 +1231,10 @@ static void put_ctx(struct perf_event_context *ctx)
* perf_event_context::lock
* perf_event::mmap_mutex
* mmap_sem
+ *
+ * cpu_hotplug_lock
+ * pmus_lock
+ * cpuctx->mutex / perf_event_context::mutex
*/
static struct perf_event_context *
perf_event_ctx_lock_nested(struct perf_event *event, int nesting)
@@ -4196,6 +4200,7 @@ int perf_event_release_kernel(struct perf_event *event)
{
struct perf_event_context *ctx = event->ctx;
struct perf_event *child, *tmp;
+ LIST_HEAD(free_list);
/*
* If we got here through err_file: fput(event_file); we will not have
@@ -4268,8 +4273,7 @@ int perf_event_release_kernel(struct perf_event *event)
struct perf_event, child_list);
if (tmp == child) {
perf_remove_from_context(child, DETACH_GROUP);
- list_del(&child->child_list);
- free_event(child);
+ list_move(&child->child_list, &free_list);
/*
* This matches the refcount bump in inherit_event();
* this can't be the last reference.
@@ -4284,6 +4288,11 @@ int perf_event_release_kernel(struct perf_event *event)
}
mutex_unlock(&event->child_mutex);
+ list_for_each_entry_safe(child, tmp, &free_list, child_list) {
+ list_del(&child->child_list);
+ free_event(child);
+ }
+
no_ctx:
put_event(event); /* Must be the 'last' reference */
return 0;
@@ -8526,6 +8535,29 @@ perf_event_set_addr_filter(struct perf_event *event, char *filter_str)
return ret;
}
+static int
+perf_tracepoint_set_filter(struct perf_event *event, char *filter_str)
+{
+ struct perf_event_context *ctx = event->ctx;
+ int ret;
+
+ /*
+ * Beware, here be dragons!!
+ *
+ * the tracepoint muck will deadlock against ctx->mutex, but the tracepoint
+ * stuff does not actually need it. So temporarily drop ctx->mutex. As per
+ * perf_event_ctx_lock() we already have a reference on ctx.
+ *
+ * This can result in event getting moved to a different ctx, but that
+ * does not affect the tracepoint state.
+ */
+ mutex_unlock(&ctx->mutex);
+ ret = ftrace_profile_set_filter(event, event->attr.config, filter_str);
+ mutex_lock(&ctx->mutex);
+
+ return ret;
+}
+
static int perf_event_set_filter(struct perf_event *event, void __user *arg)
{
char *filter_str;
@@ -8542,8 +8574,7 @@ static int perf_event_set_filter(struct perf_event *event, void __user *arg)
if (IS_ENABLED(CONFIG_EVENT_TRACING) &&
event->attr.type == PERF_TYPE_TRACEPOINT)
- ret = ftrace_profile_set_filter(event, event->attr.config,
- filter_str);
+ ret = perf_tracepoint_set_filter(event, filter_str);
else if (has_addr_filter(event))
ret = perf_event_set_addr_filter(event, filter_str);
@@ -9178,7 +9209,13 @@ static int perf_try_init_event(struct pmu *pmu, struct perf_event *event)
if (!try_module_get(pmu->module))
return -ENODEV;
- if (event->group_leader != event) {
+ /*
+ * A number of pmu->event_init() methods iterate the sibling_list to,
+ * for example, validate if the group fits on the PMU. Therefore,
+ * if this is a sibling event, acquire the ctx->mutex to protect
+ * the sibling_list.
+ */
+ if (event->group_leader != event && pmu->task_ctx_nr != perf_sw_context) {
/*
* This ctx->mutex can nest when we're called through
* inheritance. See the perf_event_ctx_lock_nested() comment.
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 5fa1324a4f29..fb81dc92aaa6 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2668,8 +2668,12 @@ check_usage_backwards(struct task_struct *curr, struct held_lock *this,
this, 0, irqclass);
}
+DECLARE_PER_CPU(struct cpumask *, smp_call_mask);
+
void print_irqtrace_events(struct task_struct *curr)
{
+ struct cpumask *__mask = this_cpu_read(smp_call_mask);
+
printk("irq event stamp: %u\n", curr->irq_events);
printk("hardirqs last enabled at (%u): [<%p>] %pS\n",
curr->hardirq_enable_event, (void *)curr->hardirq_enable_ip,
@@ -2683,6 +2687,12 @@ void print_irqtrace_events(struct task_struct *curr)
printk("softirqs last disabled at (%u): [<%p>] %pS\n",
curr->softirq_disable_event, (void *)curr->softirq_disable_ip,
(void *)curr->softirq_disable_ip);
+
+ if (__mask) {
+ printk("smp_function_call mask: %*pbl\n", cpumask_pr_args(__mask));
+ } else {
+ printk("smp_function_call mask: (NULL)\n");
+ }
}
static int HARDIRQ_verbose(struct lock_class *class)
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index b9006617710f..c1a4ed17c22c 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -365,6 +365,75 @@ __packed __aligned(4)
#endif
;
+#ifdef CONFIG_EARLY_PRINTK
+struct console *early_console;
+
+static bool __read_mostly force_early_printk;
+
+static int __init force_early_printk_setup(char *str)
+{
+ force_early_printk = true;
+ return 0;
+}
+early_param("force_early_printk", force_early_printk_setup);
+
+static int early_printk_cpu = -1;
+
+static int early_vprintk(const char *fmt, va_list args)
+{
+ int n, cpu, old;
+ char buf[512];
+
+ cpu = get_cpu();
+ /*
+ * Test-and-Set inter-cpu spinlock with recursion.
+ */
+ for (;;) {
+ /*
+ * c-cas to avoid the exclusive bouncing on spin.
+ * Depends on the memory barrier implied by cmpxchg
+ * for ACQUIRE semantics.
+ */
+ old = READ_ONCE(early_printk_cpu);
+ if (old == -1) {
+ old = cmpxchg(&early_printk_cpu, -1, cpu);
+ if (old == -1)
+ break;
+ }
+ /*
+ * Allow recursion for interrupts and the like.
+ */
+ if (old == cpu)
+ break;
+
+ cpu_relax();
+ }
+
+ n = vscnprintf(buf, sizeof(buf), fmt, args);
+ early_console->write(early_console, buf, n);
+
+ /*
+ * Unlock -- in case @old == @cpu, this is a no-op.
+ */
+ smp_store_release(&early_printk_cpu, old);
+ put_cpu();
+
+ return n;
+}
+
+asmlinkage __visible void early_printk(const char *fmt, ...)
+{
+ va_list ap;
+
+ if (!early_console)
+ return;
+
+ va_start(ap, fmt);
+ early_vprintk(fmt, ap);
+ va_end(ap);
+}
+#endif
+
/*
* The logbuf_lock protects kmsg buffer, indices, counters. This can be taken
* within the scheduler's rq lock. It must be released before calling
@@ -1692,6 +1761,16 @@ asmlinkage int vprintk_emit(int facility, int level,
int printed_len;
bool in_sched = false;
+#ifdef CONFIG_KGDB_KDB
+ if (unlikely(kdb_trap_printk && kdb_printf_cpu < 0))
+ return vkdb_printf(KDB_MSGSRC_PRINTK, fmt, args);
+#endif
+
+#ifdef CONFIG_EARLY_PRINTK
+ if (force_early_printk && early_console)
+ return early_vprintk(fmt, args);
+#endif
+
if (level == LOGLEVEL_SCHED) {
level = LOGLEVEL_DEFAULT;
in_sched = true;
@@ -1784,18 +1863,7 @@ EXPORT_SYMBOL(printk_emit);
int vprintk_default(const char *fmt, va_list args)
{
- int r;
-
-#ifdef CONFIG_KGDB_KDB
- /* Allow to pass printk() to kdb but avoid a recursion. */
- if (unlikely(kdb_trap_printk && kdb_printf_cpu < 0)) {
- r = vkdb_printf(KDB_MSGSRC_PRINTK, fmt, args);
- return r;
- }
-#endif
- r = vprintk_emit(0, LOGLEVEL_DEFAULT, NULL, 0, fmt, args);
-
- return r;
+ return vprintk_emit(0, LOGLEVEL_DEFAULT, NULL, 0, fmt, args);
}
EXPORT_SYMBOL_GPL(vprintk_default);
@@ -1826,7 +1894,12 @@ asmlinkage __visible int printk(const char *fmt, ...)
int r;
va_start(args, fmt);
- r = vprintk_func(fmt, args);
+#ifdef CONFIG_EARLY_PRINTK
+ if (force_early_printk && early_console)
+ r = vprintk_default(fmt, args);
+ else
+#endif
+ r = vprintk_func(fmt, args);
va_end(args);
return r;
@@ -1863,26 +1936,6 @@ static bool suppress_message_printing(int level) { return false; }
#endif /* CONFIG_PRINTK */
-#ifdef CONFIG_EARLY_PRINTK
-struct console *early_console;
-
-asmlinkage __visible void early_printk(const char *fmt, ...)
-{
- va_list ap;
- char buf[512];
- int n;
-
- if (!early_console)
- return;
-
- va_start(ap, fmt);
- n = vscnprintf(buf, sizeof(buf), fmt, ap);
- va_end(ap);
-
- early_console->write(early_console, buf, n);
-}
-#endif
-
static int __add_preferred_console(char *name, int idx, char *options,
char *brl_options)
{
diff --git a/kernel/smp.c b/kernel/smp.c
index 084c8b3a2681..d014602e13f7 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -257,6 +257,8 @@ static void flush_smp_call_function_queue(bool warn_cpu_offline)
irq_work_run();
}
+DEFINE_PER_CPU(const struct cpumask *, smp_call_mask);
+
/*
* smp_call_function_single - Run a function on a specific CPU
* @func: The function to run. This must be fast and non-blocking.
@@ -296,11 +298,15 @@ int smp_call_function_single(int cpu, smp_call_func_t func, void *info,
csd_lock(csd);
}
+ this_cpu_write(smp_call_mask, cpumask_of(cpu));
+
err = generic_exec_single(cpu, csd, func, info);
if (wait)
csd_lock_wait(csd);
+ this_cpu_write(smp_call_mask, NULL);
+
put_cpu();
return err;
@@ -457,6 +463,8 @@ void smp_call_function_many(const struct cpumask *mask,
__cpumask_set_cpu(cpu, cfd->cpumask_ipi);
}
+ this_cpu_write(smp_call_mask, cfd->cpumask_ipi);
+
/* Send a message to all CPUs in the map */
arch_send_call_function_ipi_mask(cfd->cpumask_ipi);
@@ -468,6 +476,8 @@ void smp_call_function_many(const struct cpumask *mask,
csd_lock_wait(csd);
}
}
+
+ this_cpu_write(smp_call_mask, NULL);
}
EXPORT_SYMBOL(smp_call_function_many);
On Thu, Jan 11, 2018 at 10:26:14AM -0500, Vince Weaver wrote:
> On Thu, 11 Jan 2018, Peter Zijlstra wrote:
>
> > I'm seeing things like:
> >
> > Cannot open /sys/kernel/tracing/kprobe_events
> >
> > this is likely caused by me not having anything mounted there. Rostedt
> > provided the magic incantation to make that work, I'll go try now.
>
> The perf_fuzzer krpobe code is possibly really out of date, I haven't
> really kept up with recent developments and I don't usually fuzz as root.
I was thinking that running as root should be worse than user.. I can
switch to fuzzing as user.. who knows.
> > At least I seem to have cured all lockdep splats.. of course, the moment
> > I hit send on this someone bad is bound to happen.
>
> Should I be trying any of those patches?
Can't hurt, but I doubt it'll help sone, nothing there is PTI specific.
The force_early_printk stuff helps with printk deadlocks though, hit a
bunch of them earlier today when I build a kernel without them in.
> Currently I can still lock my machines pretty quickly. I have been doing
> some more tests. Machines don't crash if paranoid is "2" but will if
> paranoid is "1".
OK, I'm going to try fuzzing as a user with paranoid=0, and if that
doesn't help, I'm going to switch to linus' tree with my patches on.
On Thu, 11 Jan 2018, Peter Zijlstra wrote:
> OK, I'm going to try fuzzing as a user with paranoid=0, and if that
> doesn't help, I'm going to switch to linus' tree with my patches on.
OK, I'm fuzzing on a core2 machine and it locks up too.
It did give the following first (but it kept going for at least a minute
after this oops before locking solid).
Jan 11 11:38:57 core2 kernel: [ 823.930123] PGD 22ea067 P4D 22ea067 PUD 22ed067 PMD 11a520063 PTE 800000011a7a1063
Jan 11 11:38:57 core2 kernel: [ 823.938012] Oops: 0002 [#1] SMP PTI
Jan 11 11:38:57 core2 kernel: [ 823.968898] CPU: 1 PID: 17688 Comm: perf_fuzzer Not tainted 4.15.0-rc7+ #211
Jan 11 11:38:57 core2 kernel: [ 823.976152] Hardware name: AOpen DE7000/nMCP7ALPx-DE R1.06 Oct.19.2012, BIOS 080015 10/19/2012
Jan 11 11:38:57 core2 kernel: [ 823.985259] RIP: 0033:0x7fbda0042b3c
Jan 11 11:38:57 core2 kernel: [ 823.989002] RSP: 002b:00007ffed7fa46c8 EFLAGS: 00010246
Jan 11 11:38:57 core2 kernel: [ 823.994434] RAX: 000000000000462d RBX: 000000000000000c RCX: 00007fbda0042b3c
Jan 11 11:38:57 core2 kernel: [ 824.001786] RDX: 0000000000000000 RSI: 00007ffed7fa46dc RDI: 000000000000462d
Jan 11 11:38:57 core2 kernel: [ 824.009135] RBP: 00007ffed7fa46e0 R08: 00007fbda032e0e4 R09: 0000000000000000
Jan 11 11:38:57 core2 kernel: [ 824.016483] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000401950
Jan 11 11:38:57 core2 kernel: [ 824.023832] R13: 00007ffed7fa6ae0 R14: 0000000000000000 R15: 0000000000000000
Jan 11 11:38:57 core2 kernel: [ 824.031198] FS: 00007fbda0540700(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000
Jan 11 11:38:57 core2 kernel: [ 824.039530] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 11 11:38:57 core2 kernel: [ 824.045477] CR2: ffff88011a7a1000 CR3: 000000011a998000 CR4: 00000000000407e0
Jan 11 11:38:57 core2 kernel: [ 824.052827] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jan 11 11:38:57 core2 kernel: [ 824.060176] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
Jan 11 11:38:57 core2 kernel: [ 824.072767] CR2: ffff88011a7a1000
Jan 11 11:38:57 core2 kernel: [ 824.076248] ---[ end trace 31d5336f74638197 ]--
On Thu, 11 Jan 2018, Vince Weaver wrote:
> On Thu, 11 Jan 2018, Peter Zijlstra wrote:
>
> > OK, I'm going to try fuzzing as a user with paranoid=0, and if that
> > doesn't help, I'm going to switch to linus' tree with my patches on.
>
> OK, I'm fuzzing on a core2 machine and it locks up too.
>
> It did give the following first (but it kept going for at least a minute
> after this oops before locking solid).
This was missing the first few lines which made it to the serial log but
not the console:
[ 823.919729] BUG: unable to handle kernel paging request at ffff88011a7a1000
[ 823.926928] IP: 0x7fbda0042b3c
I'm dumping vmlinux and can't find address 0x7fbda0042b3c at all.
Should I be able to find it there? Or is KASLR/KPTI or similar stopping
me?
> Jan 11 11:38:57 core2 kernel: [ 823.930123] PGD 22ea067 P4D 22ea067 PUD 22ed067 PMD 11a520063 PTE 800000011a7a1063
> Jan 11 11:38:57 core2 kernel: [ 823.938012] Oops: 0002 [#1] SMP PTI
> Jan 11 11:38:57 core2 kernel: [ 823.968898] CPU: 1 PID: 17688 Comm: perf_fuzzer Not tainted 4.15.0-rc7+ #211
> Jan 11 11:38:57 core2 kernel: [ 823.976152] Hardware name: AOpen DE7000/nMCP7ALPx-DE R1.06 Oct.19.2012, BIOS 080015 10/19/2012
> Jan 11 11:38:57 core2 kernel: [ 823.985259] RIP: 0033:0x7fbda0042b3c
> Jan 11 11:38:57 core2 kernel: [ 823.989002] RSP: 002b:00007ffed7fa46c8 EFLAGS: 00010246
> Jan 11 11:38:57 core2 kernel: [ 823.994434] RAX: 000000000000462d RBX: 000000000000000c RCX: 00007fbda0042b3c
> Jan 11 11:38:57 core2 kernel: [ 824.001786] RDX: 0000000000000000 RSI: 00007ffed7fa46dc RDI: 000000000000462d
> Jan 11 11:38:57 core2 kernel: [ 824.009135] RBP: 00007ffed7fa46e0 R08: 00007fbda032e0e4 R09: 0000000000000000
> Jan 11 11:38:57 core2 kernel: [ 824.016483] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000401950
> Jan 11 11:38:57 core2 kernel: [ 824.023832] R13: 00007ffed7fa6ae0 R14: 0000000000000000 R15: 0000000000000000
> Jan 11 11:38:57 core2 kernel: [ 824.031198] FS: 00007fbda0540700(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000
> Jan 11 11:38:57 core2 kernel: [ 824.039530] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Jan 11 11:38:57 core2 kernel: [ 824.045477] CR2: ffff88011a7a1000 CR3: 000000011a998000 CR4: 00000000000407e0
> Jan 11 11:38:57 core2 kernel: [ 824.052827] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Jan 11 11:38:57 core2 kernel: [ 824.060176] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
> Jan 11 11:38:57 core2 kernel: [ 824.072767] CR2: ffff88011a7a1000
> Jan 11 11:38:57 core2 kernel: [ 824.076248] ---[ end trace 31d5336f74638197 ]--
>
On Thu, Jan 11, 2018 at 11:58:24AM -0500, Vince Weaver wrote:
> On Thu, 11 Jan 2018, Vince Weaver wrote:
>
> > On Thu, 11 Jan 2018, Peter Zijlstra wrote:
> >
> > > OK, I'm going to try fuzzing as a user with paranoid=0, and if that
> > > doesn't help, I'm going to switch to linus' tree with my patches on.
> >
> > OK, I'm fuzzing on a core2 machine and it locks up too.
> >
> > It did give the following first (but it kept going for at least a minute
> > after this oops before locking solid).
>
> This was missing the first few lines which made it to the serial log but
> not the console:
>
> [ 823.919729] BUG: unable to handle kernel paging request at ffff88011a7a1000
> [ 823.926928] IP: 0x7fbda0042b3c
>
> I'm dumping vmlinux and can't find address 0x7fbda0042b3c at all.
That's a user address..
> Should I be able to find it there? Or is KASLR/KPTI or similar stopping
> me?
Something is definitely funny, lemme try and make sense of this splat.
> > Jan 11 11:38:57 core2 kernel: [ 823.930123] PGD 22ea067 P4D 22ea067 PUD 22ed067 PMD 11a520063 PTE 800000011a7a1063
> > Jan 11 11:38:57 core2 kernel: [ 823.938012] Oops: 0002 [#1] SMP PTI
> > Jan 11 11:38:57 core2 kernel: [ 823.968898] CPU: 1 PID: 17688 Comm: perf_fuzzer Not tainted 4.15.0-rc7+ #211
> > Jan 11 11:38:57 core2 kernel: [ 823.976152] Hardware name: AOpen DE7000/nMCP7ALPx-DE R1.06 Oct.19.2012, BIOS 080015 10/19/2012
> > Jan 11 11:38:57 core2 kernel: [ 823.985259] RIP: 0033:0x7fbda0042b3c
> > Jan 11 11:38:57 core2 kernel: [ 823.989002] RSP: 002b:00007ffed7fa46c8 EFLAGS: 00010246
> > Jan 11 11:38:57 core2 kernel: [ 823.994434] RAX: 000000000000462d RBX: 000000000000000c RCX: 00007fbda0042b3c
> > Jan 11 11:38:57 core2 kernel: [ 824.001786] RDX: 0000000000000000 RSI: 00007ffed7fa46dc RDI: 000000000000462d
> > Jan 11 11:38:57 core2 kernel: [ 824.009135] RBP: 00007ffed7fa46e0 R08: 00007fbda032e0e4 R09: 0000000000000000
> > Jan 11 11:38:57 core2 kernel: [ 824.016483] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000401950
> > Jan 11 11:38:57 core2 kernel: [ 824.023832] R13: 00007ffed7fa6ae0 R14: 0000000000000000 R15: 0000000000000000
> > Jan 11 11:38:57 core2 kernel: [ 824.031198] FS: 00007fbda0540700(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000
> > Jan 11 11:38:57 core2 kernel: [ 824.039530] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > Jan 11 11:38:57 core2 kernel: [ 824.045477] CR2: ffff88011a7a1000 CR3: 000000011a998000 CR4: 00000000000407e0
> > Jan 11 11:38:57 core2 kernel: [ 824.052827] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > Jan 11 11:38:57 core2 kernel: [ 824.060176] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
> > Jan 11 11:38:57 core2 kernel: [ 824.072767] CR2: ffff88011a7a1000
> > Jan 11 11:38:57 core2 kernel: [ 824.076248] ---[ end trace 31d5336f74638197 ]--
> >
>
on the same core2 machine I got this which didn't crash the machine (but
the perf_fuzzer process is stuck)
[ 4592.608066] INFO: task systemd-logind:488 blocked for more than 120 seconds.
[ 4592.615159] Not tainted 4.15.0-rc7+ #211
[ 4592.619648] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4592.627512] systemd-logind D 0 488 1 0x80000006
[ 4592.633039] Call Trace:
[ 4592.635529] ? __schedule+0x35d/0x507
[ 4592.639239] ? usleep_range+0x51/0x51
[ 4592.642948] ? schedule+0x7b/0x84
[ 4592.646310] ? schedule_timeout+0x21/0xc7
[ 4592.650368] ? ttwu_do_wakeup+0x19/0x10d
[ 4592.654339] ? do_wait_for_common+0xfe/0x12f
[ 4592.658653] ? wake_up_q+0x40/0x40
[ 4592.662106] ? wait_for_common+0x35/0x3f
[ 4592.666075] ? __wait_rcu_gp+0xe9/0xfa
[ 4592.669872] ? synchronize_sched+0x3e/0x43
[ 4592.674019] ? call_rcu_bh+0x13/0x13
[ 4592.677641] ? trace_raw_output_rcu_utilization+0x42/0x42
[ 4592.683086] ? evdev_release+0x81/0xcf [evdev]
[ 4592.687578] ? __fput+0xdf/0x18f
[ 4592.690852] ? task_work_run+0x6b/0x7f
[ 4592.694650] ? do_exit+0x4b1/0x994
[ 4592.698103] ? __perf_event_task_sched_out+0x62/0x302
[ 4592.703200] ? do_group_exit+0x3b/0xa9
[ 4592.707000] ? get_signal+0x490/0x50b
[ 4592.710714] ? do_signal+0x23/0x5ea
[ 4592.714249] ? schedule_hrtimeout_range_clock+0x38/0xcf
[ 4592.719521] ? ep_scan_ready_list+0x176/0x195
[ 4592.723926] ? SYSC_epoll_wait+0x2ea/0x39d
[ 4592.728074] ? prepare_exit_to_usermode+0xd1/0x102
[ 4592.732913] ? entry_SYSCALL_64_fastpath+0x6e/0x70
[ 4601.468052] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 4601.473590] 1-...0: (1 GPs behind) idle=74e/140000000000000/0 softirq=139228/139229 fqs=28744
[ 4601.482338] (detected by 0, t=60007 jiffies, g=67462, c=67461, q=1637)
[ 4601.488996] Sending NMI from CPU 0 to CPUs 1:
On Thu, 11 Jan 2018, Vince Weaver wrote:
> on the same core2 machine I got this which didn't crash the machine (but
> the perf_fuzzer process is stuck)
also got this one:
Cannot open /sys/kernel/tracing/kprobe_events
[ 408.159243] watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [perf_fuzzer:22326]
[ 408.193102] CPU: 1 PID: 22326 Comm: perf_fuzzer Tainted: G D 4.15.0-rc7+ #211
[ 408.201507] Hardware name: AOpen DE7000/nMCP7ALPx-DE R1.06 Oct.19.2012, BIOS 080015 10/19/2012
[ 408.210436] RIP: 0010:smp_call_function_single+0xac/0xc3
[ 408.215797] RSP: 0018:ffffc90008307de0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff11
[ 408.223421] RAX: 0000000000000000 RBX: ffffc90008307de0 RCX: 0000000000000000
[ 408.230602] RDX: 0000000000000001 RSI: 00000000000008fb RDI: 0000000000000300
[ 408.237777] RBP: ffffc90008307e30 R08: fffffffc3146e801 R09: 0000000000000003
[ 408.244957] R10: 0000000000000000 R11: ffff88011a98af80 R12: 0000000000000001
[ 408.252137] R13: 0000000000000000 R14: 00007f05bdefb104 R15: 000000000000001f
[ 408.259318] FS: 00007f05be10d700(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000
[ 408.267462] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 408.273258] CR2: 0000000000735980 CR3: 000000011a31a000 CR4: 00000000000407e0
[ 408.280439] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 408.287620] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
[ 408.294798] Call Trace:
[ 408.297300] ? perf_event_sysfs_show+0x1d/0x1d
[ 408.301792] ? ctx_resched+0x84/0x84
[ 408.305418] ? task_function_call+0x35/0x48
[ 408.309649] task_function_call+0x35/0x48
[ 408.313706] ? perf_ctx_lock+0x1e/0x1e
[ 408.317506] event_function_call+0x60/0xa5
[ 408.321652] ? ctx_resched+0x84/0x84
[ 408.325278] ? _perf_event_disable+0x3e/0x3e
[ 408.329598] perf_event_for_each_child+0x4d/0x6b
[ 408.334267] perf_event_task_enable+0x66/0x9a
[ 408.338673] SyS_prctl+0x152/0x3ca
[ 408.342126] ? SyS_write+0x69/0x78
[ 408.345581] entry_SYSCALL_64_fastpath+0x17/0x70
[ 408.350248] RIP: 0033:0x7f05bdc3ee7a
[ 408.353872] RSP: 002b:00007ffc80e425c8 EFLAGS: 00000202
[ 408.353874] Code: 01 74 04 f3 90 eb f4 83 48 18 01 48 89 d1 44 89 d7 48 89 f2 48 89 de e8 c9 fe ff ff 45 85 e4 74 12 8b 53 18 80 e2 01 74 04 f3 90 <eb> f4 eb 04 0f ff eb ad 48 83 c4 38 5b 41 5a 41 5c 5d 49 8d 62
On Wed, 10 Jan 2018, Josh Poimboeuf wrote:
> For the crash, you might try enabling CONFIG_DEBUG_ENTRY and seeing if
> that gives you any output.
I did enable that, didn't seem to help on the haswell machien at least.
> > > > WARNING: can't dereference iret registers at 000000000783fea8 for ip paranoid_entry+0x2e/0x90
> > > > WARNING: can't dereference registers at 00000000f0698d17 for ip paranoid_entry+0x4c/0x90
> > > > WARNING: stack going in the wrong direction? ip=native_sched_clock+0x9/0x90
>
> This all looks very weird. The stack pointers -- 000000000783fea8 and
> 00000000f0698d17 -- are obviously very wrong. I will try to recreate
> locally.
On a related note, on a core2 machine with the perf_fuzzer I got this too:
Jan 11 13:44:01 core2 kernel: [ 1078.931403] WARNING: stack recursion on stack type 4
Jan 11 13:44:01 core2 kernel: [ 1078.931411] WARNING: can't dereference registers at 000000002c6beb99 for ip swapgs_restore_regs_and_return_to_usermode+0x2b/0x7c
Vince
On Thu, Jan 11, 2018 at 01:20:10PM -0500, Vince Weaver wrote:
> On Thu, 11 Jan 2018, Vince Weaver wrote:
>
> > on the same core2 machine I got this which didn't crash the machine (but
> > the perf_fuzzer process is stuck)
>
> also got this one:
>
> Cannot open /sys/kernel/tracing/kprobe_events
> [ 408.159243] watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [perf_fuzzer:22326]
Yes, I got plenty of those. Mostly they appear false positives, because
the next time the soft watchdog triggers on the same CPU the callchains
are completely unrelated. So its just 'slow' not stuck.
On Thu, Jan 11, 2018 at 06:03:47PM +0100, Peter Zijlstra wrote:
> > On Thu, 11 Jan 2018, Vince Weaver wrote:
> > [ 823.919729] BUG: unable to handle kernel paging request at ffff88011a7a1000
> > [ 823.926928] IP: 0x7fbda0042b3c
> >
> > I'm dumping vmlinux and can't find address 0x7fbda0042b3c at all.
>
> That's a user address..
> > [ 823.930123] PGD 22ea067 P4D 22ea067 PUD 22ed067 PMD 11a520063 PTE 800000011a7a1063
> > [ 823.938012] Oops: 0002 [#1] SMP PTI
> > [ 823.968898] CPU: 1 PID: 17688 Comm: perf_fuzzer Not tainted 4.15.0-rc7+ #211
> > [ 823.976152] Hardware name: AOpen DE7000/nMCP7ALPx-DE R1.06 Oct.19.2012, BIOS 080015 10/19/2012
> > [ 823.985259] RIP: 0033:0x7fbda0042b3c
> > [ 823.989002] RSP: 002b:00007ffed7fa46c8 EFLAGS: 00010246
> > [ 823.994434] RAX: 000000000000462d RBX: 000000000000000c RCX: 00007fbda0042b3c
> > [ 824.001786] RDX: 0000000000000000 RSI: 00007ffed7fa46dc RDI: 000000000000462d
> > [ 824.009135] RBP: 00007ffed7fa46e0 R08: 00007fbda032e0e4 R09: 0000000000000000
> > [ 824.016483] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000401950
> > [ 824.023832] R13: 00007ffed7fa6ae0 R14: 0000000000000000 R15: 0000000000000000
> > [ 824.031198] FS: 00007fbda0540700(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000
> > [ 824.039530] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 824.045477] CR2: ffff88011a7a1000 CR3: 000000011a998000 CR4: 00000000000407e0
> > [ 824.052827] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [ 824.060176] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
> > [ 824.072767] CR2: ffff88011a7a1000
> > [ 824.076248] ---[ end trace 31d5336f74638197 ]--
So its faulting while looking up the userspace page tables, so
something buggered. The CR3 has bit 12 clear so its a kernel CR3, which
is also somewhat puzzling.
Oooh, right, its the CR3 from when we report the OOPS, which is less
than useful.
I can't get further than that the page-tables are screwed..
On Thu, Jan 11, 2018 at 02:00:27PM -0500, Vince Weaver wrote:
> On Wed, 10 Jan 2018, Josh Poimboeuf wrote:
>
> > For the crash, you might try enabling CONFIG_DEBUG_ENTRY and seeing if
> > that gives you any output.
>
> I did enable that, didn't seem to help on the haswell machien at least.
>
> > > > > WARNING: can't dereference iret registers at 000000000783fea8 for ip paranoid_entry+0x2e/0x90
> > > > > WARNING: can't dereference registers at 00000000f0698d17 for ip paranoid_entry+0x4c/0x90
> > > > > WARNING: stack going in the wrong direction? ip=native_sched_clock+0x9/0x90
> >
> > This all looks very weird. The stack pointers -- 000000000783fea8 and
> > 00000000f0698d17 -- are obviously very wrong. I will try to recreate
> > locally.
>
> On a related note, on a core2 machine with the perf_fuzzer I got this too:
>
> Jan 11 13:44:01 core2 kernel: [ 1078.931403] WARNING: stack recursion on stack type 4
> Jan 11 13:44:01 core2 kernel: [ 1078.931411] WARNING: can't dereference registers at 000000002c6beb99 for ip swapgs_restore_regs_and_return_to_usermode+0x2b/0x7c
Yuck. This time it was stack recursion on the entry stack. In the
previous error, recursion was detected on the IRQ stack. Otherwise they
look quite similar.
Was that also with nopti?
I haven't had a chance to try it yet, any idea if these would be
recreatable in a VM?
--
Josh
On Thu, Jan 11, 2018 at 01:21:12PM -0600, Josh Poimboeuf wrote:
> Yuck. This time it was stack recursion on the entry stack. In the
> previous error, recursion was detected on the IRQ stack. Otherwise they
> look quite similar.
>
> Was that also with nopti?
Both with pti enabled, nopti makes things work again.
> I haven't had a chance to try it yet, any idea if these would be
> recreatable in a VM?
Good question; typically we run the perf fuzzer on read hardware because
the PMU doesn't virtualize much. But if software perf is enough to
trigger this, that would certainly help.
On Thu, 11 Jan 2018, Peter Zijlstra wrote:
> On Thu, Jan 11, 2018 at 01:21:12PM -0600, Josh Poimboeuf wrote:
> > Yuck. This time it was stack recursion on the entry stack. In the
> > previous error, recursion was detected on the IRQ stack. Otherwise they
> > look quite similar.
> >
> > Was that also with nopti?
>
> Both with pti enabled, nopti makes things work again.
I think I have hit those errors even with pti disabled but now I'll have
to double check.
> > I haven't had a chance to try it yet, any idea if these would be
> > recreatable in a VM?
>
> Good question; typically we run the perf fuzzer on read hardware because
> the PMU doesn't virtualize much. But if software perf is enough to
> trigger this, that would certainly help.
I can only trigger them with perf_event_paranoid=0 (not with
perf_event_paranoid=2) so it might depend on how much of the PMU is
virtualized.
Vince
Not sure if this info helps, but if I make perf_fuzzer *not* create AUX
mmap() buffers, I'm unable to reproduce the hangs both on core2 and
haswell.
Vince
On Thu, 11 Jan 2018, Vince Weaver wrote:
> Not sure if this info helps, but if I make perf_fuzzer *not* create AUX
> mmap() buffers, I'm unable to reproduce the hangs both on core2 and
> haswell.
Confirmed, I can crash the system without the fuzzer, just by doing
perf record --per-thread -e intel_bts// /bin/ls
on my haswell system.
Vince
On Thu, Jan 11, 2018 at 03:15:16PM -0500, Vince Weaver wrote:
>
> Not sure if this info helps, but if I make perf_fuzzer *not* create AUX
> mmap() buffers, I'm unable to reproduce the hangs both on core2 and
> haswell.
That certainly reduces the amount of code to stare at. Will do so
tomorrow. Thanks!
On Thu, 11 Jan 2018, Vince Weaver wrote:
> On Thu, 11 Jan 2018, Peter Zijlstra wrote:
>
> > On Thu, Jan 11, 2018 at 01:21:12PM -0600, Josh Poimboeuf wrote:
> > > Yuck. This time it was stack recursion on the entry stack. In the
> > > previous error, recursion was detected on the IRQ stack. Otherwise they
> > > look quite similar.
> > >
> > > Was that also with nopti?
> >
> > Both with pti enabled, nopti makes things work again.
>
> I think I have hit those errors even with pti disabled but now I'll have
> to double check.
I can confirm this, I am able to trigger the stack recursion warning even
when "pti=off" is set.
Jan 11 15:34:47 core2 kernel: [ 320.668900] WARNING: stack recursion on stack type 4
Jan 11 15:34:47 core2 kernel: [ 320.668909] WARNING: can't dereference registers at 00000000d5ae0491 for ip swapgs_restore_regs_and_return_to_usermode+0x28/0x7c
Vince
On Thu, Jan 11, 2018 at 03:40:44PM -0500, Vince Weaver wrote:
> On Thu, 11 Jan 2018, Vince Weaver wrote:
>
> > Not sure if this info helps, but if I make perf_fuzzer *not* create AUX
> > mmap() buffers, I'm unable to reproduce the hangs both on core2 and
> > haswell.
>
> Confirmed, I can crash the system without the fuzzer, just by doing
>
> perf record --per-thread -e intel_bts// /bin/ls
>
> on my haswell system.
It makes my IVB very ill, starts spewing RCU stall warnings, but is
otherwise very unresponsive.
Awesome... I'll prod at it when my brain works again.
On Thu, 11 Jan 2018, Peter Zijlstra wrote:
> It makes my IVB very ill, starts spewing RCU stall warnings, but is
> otherwise very unresponsive.
>
> Awesome... I'll prod at it when my brain works again.
>
Not sure if it's related, but I hit this on the core2 machine fuzzing
overnight with "pti=off"
Jan 11 19:03:03 core2 kernel: [12816.125397] WARNING: CPU: 0 PID: 3144 at kernel/events/core.c:5097 perf_mmap_close+0x129/0x216
Jan 11 19:03:03 core2 kernel: [12816.126204] WARNING: CPU: 1 PID: 3197 at kernel/events/ring_buffer.c:569 __rb_free_aux+0x1a/0xb6
Jan 11 19:03:03 core2 kernel: [12816.126219] CPU: 1 PID: 3197 Comm: perf_fuzzer Not tainted 4.15.0-rc7+ #211
Jan 11 19:03:03 core2 kernel: [12816.126220] Hardware name: AOpen DE7000/nMCP7ALPx-DE R1.06 Oct.19.2012, BIOS 080015 10/19/2012
Jan 11 19:03:03 core2 kernel: [12816.126222] RIP: 0010:__rb_free_aux+0x1a/0xb6
Jan 11 19:03:03 core2 kernel: [12816.126223] RSP: 0000:ffffc90007417c08 EFLAGS: 00010006
Jan 11 19:03:03 core2 kernel: [12816.126224] RAX: 0000000080110000 RBX: ffff8801197aee00 RCX: 0000000000000000
Jan 11 19:03:03 core2 kernel: [12816.126225] RDX: 0000000000040400 RSI: 00000000000400f6 RDI: ffff8801197aee00
Jan 11 19:03:03 core2 kernel: [12816.126225] RBP: ffff88011fc91000 R08: 0000000000000020 R09: 0000000000000030
Jan 11 19:03:03 core2 kernel: [12816.126226] R10: ffffc90007417c28 R11: 0000000000000246 R12: 0000000000000000
Jan 11 19:03:03 core2 kernel: [12816.126227] R13: 0000000000000001 R14: ffff88011901a800 R15: 0000000000000000
Jan 11 19:03:03 core2 kernel: [12816.126228] FS: 00007f1957682700(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000
Jan 11 19:03:03 core2 kernel: [12816.126229] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 11 19:03:03 core2 kernel: [12816.126230] CR2: 00000000026be684 CR3: 000000011a79a000 CR4: 00000000000407e0
Jan 11 19:03:03 core2 kernel: [12816.126231] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jan 11 19:03:03 core2 kernel: [12816.126231] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
Jan 11 19:03:03 core2 kernel: [12816.126232] Call Trace:
Jan 11 19:03:03 core2 kernel: [12816.126236] perf_aux_output_end+0xf4/0x100
Jan 11 19:03:03 core2 kernel: [12816.126239] intel_bts_interrupt+0x9e/0xf5
Jan 11 19:03:03 core2 kernel: [12816.126241] intel_pmu_handle_irq+0x72/0x3dd
Jan 11 19:03:03 core2 kernel: [12816.126245] ? flush_tlb_mm_range+0xb0/0xca
Jan 11 19:03:03 core2 kernel: [12816.126248] ? radix_tree_next_chunk+0x73/0x26b
Jan 11 19:03:03 core2 kernel: [12816.126249] ? get_page+0x5/0xa
Jan 11 19:03:03 core2 kernel: [12816.126251] ? mm_counter_file+0x5/0x14
Jan 11 19:03:03 core2 kernel: [12816.126254] ? alloc_set_pte+0x1b9/0x1cf
Jan 11 19:03:03 core2 kernel: [12816.126255] ? unlock_page+0xa/0x20
Jan 11 19:03:03 core2 kernel: [12816.126256] ? filemap_map_pages+0x182/0x1f4
Jan 11 19:03:03 core2 kernel: [12816.126258] ? reuse_swap_page+0x7a/0x115
Jan 11 19:03:03 core2 kernel: [12816.126259] ? wp_page_reuse+0x31/0x3a
Jan 11 19:03:03 core2 kernel: [12816.126260] ? do_wp_page+0x16d/0x242
Jan 11 19:03:03 core2 kernel: [12816.126262] ? __handle_mm_fault+0x67c/0x6f1
Jan 11 19:03:03 core2 kernel: [12816.126264] ? perf_event_nmi_handler+0x27/0x3e
Jan 11 19:03:03 core2 kernel: [12816.126266] ? perf_event_nmi_handler+0x1b/0x3e
Jan 11 19:03:03 core2 kernel: [12816.126267] perf_event_nmi_handler+0x27/0x3e
Jan 11 19:03:03 core2 kernel: [12816.126269] nmi_handle+0x52/0xf5
Jan 11 19:03:03 core2 kernel: [12816.126271] default_do_nmi+0x41/0xda
Jan 11 19:03:03 core2 kernel: [12816.126273] do_nmi+0x92/0x102
Jan 11 19:03:03 core2 kernel: [12816.126275] nmi+0x67/0xb0
Jan 11 19:03:03 core2 kernel: [12816.126277] RIP: 0033:0x40fa77
Jan 11 19:03:03 core2 kernel: [12816.126277] RSP: 002b:00007fff93ea1d48 EFLAGS: 00000202
Jan 11 19:03:03 core2 kernel: [12816.126278] RAX: 0000000000000000 RBX: 000000000000000c RCX: 000000000007887c
Jan 11 19:03:03 core2 kernel: [12816.126279] RDX: 0000000000000000 RSI: 00007f1957470620 RDI: 00007f19574714e0
Jan 11 19:03:03 core2 kernel: [12816.126280] RBP: 00007fff93ea1d60 R08: 0000000000000000 R09: 00007f1957682700
Jan 11 19:03:03 core2 kernel: [12816.126281] R10: 00007f19576829d0 R11: 0000000000000246 R12: 0000000000401950
Jan 11 19:03:03 core2 kernel: [12816.126281] R13: 00007fff93ea4150 R14: 0000000000000000 R15: 0000000000000000
Jan 11 19:03:03 core2 kernel: [12816.126282] Code: 38 48 01 c7 48 c7 47 08 00 00 00 00 e9 4c 98 00 00 66 66 66 66 90 65 8b 05 57 59 f1 7e a9 ff ff ff 7f 41 54 55 53 48 89 fb 74 02 <0f> ff 48 8b bb e0 00 00 00 48 85 ff 74 1c ff 93 c8 00 00 00 48
Jan 11 19:03:03 core2 kernel: [12816.126299] ---[ end trace d8df98463050a325 ]---
Jan 11 19:03:03 core2 kernel: [12816.445380] CPU: 0 PID: 3144 Comm: perf_fuzzer Tainted: G W 4.15.0-rc7+ #211
Jan 11 19:03:03 core2 kernel: [12816.453689] Hardware name: AOpen DE7000/nMCP7ALPx-DE R1.06 Oct.19.2012, BIOS 080015 10/19/2012
Jan 11 19:03:03 core2 kernel: [12816.462611] RIP: 0010:perf_mmap_close+0x129/0x216
Jan 11 19:03:03 core2 kernel: [12816.467360] RSP: 0018:ffffc9000729fe58 EFLAGS: 00010202
Jan 11 19:03:03 core2 kernel: [12816.472630] RAX: 0000000000000001 RBX: ffff8801197aee00 RCX: 0000000000000000
Jan 11 19:03:03 core2 kernel: [12816.479808] RDX: 0000000000000000 RSI: 00000000000008fb RDI: ffff8801197aee00
Jan 11 19:03:03 core2 kernel: [12816.486984] RBP: ffff88011a5b1000 R08: 0000000000000001 R09: ffff880118c94d80
Jan 11 19:03:03 core2 kernel: [12816.494160] R10: ffffc9000729fe30 R11: ffff880118c94d80 R12: ffff88011914b300
Jan 11 19:03:03 core2 kernel: [12816.501336] R13: ffff88011a133a28 R14: ffff88011a5b1270 R15: ffff8801197aebc8
Jan 11 19:03:03 core2 kernel: [12816.508512] FS: 00007f1957682700(0000) GS:ffff88011fc00000(0000) knlGS:0000000000000000
Jan 11 19:03:03 core2 kernel: [12816.516652] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 11 19:03:03 core2 kernel: [12816.522441] CR2: 0000000000837f80 CR3: 0000000118caa000 CR4: 00000000000407f0
Jan 11 19:03:03 core2 kernel: [12816.529618] DR0: 0000000000000000 DR1: 0000000000000040 DR2: 0000000000000040
Jan 11 19:03:03 core2 kernel: [12816.536796] DR3: 0000000000000040 DR6: 00000000ffff0ff0 DR7: 0000000000000600
Jan 11 19:03:03 core2 kernel: [12816.543971] Call Trace:
Jan 11 19:03:03 core2 kernel: [12816.546467] remove_vma+0x2d/0x5f
Jan 11 19:03:03 core2 kernel: [12816.549828] do_munmap+0x2d7/0x2fb
Jan 11 19:03:03 core2 kernel: [12816.553277] vm_munmap+0x54/0x7a
Jan 11 19:03:03 core2 kernel: [12816.556553] SyS_munmap+0x1f/0x24
Jan 11 19:03:03 core2 kernel: [12816.559917] entry_SYSCALL_64_fastpath+0x17/0x70
Jan 11 19:03:03 core2 kernel: [12816.564580] RIP: 0033:0x7f19571af757
Jan 11 19:03:03 core2 kernel: [12816.568200] RSP: 002b:00007fff93ea1ce8 EFLAGS: 00000206
Jan 11 19:03:03 core2 kernel: [12816.568201] Code: 00 f0 49 29 44 24 60 49 8b 45 40 48 8b 93 c0 00 00 00 48 89 df 48 29 90 c8 00 00 00 e8 f0 43 00 00 8b 83 d0 00 00 00 85 c0 74 02 <0f> ff 4c 89 f7 e8 47 68 4b 00 f0 ff 4b 78 48 8d bd 90 02 00 00
Jan 11 19:03:03 core2 kernel: [12816.592397] ---[ end trace d8df98463050a326 ]---
Hello
I reported this back in January, but I think it got lost since everyone
was busy with other more pressing matters.
But in any case, the perf_fuzzer still can trigger these type of messages
and just wanted to see if they were a cause for concern, or just noise.
[66620.496076] WARNING: can't dereference registers at 0000000051f78a40 for ip interrupt_entry+0xba/0xc0
[66620.506117] WARNING: stack recursion on stack type 4
[67126.898984] WARNING: stack going in the wrong direction? ip=native_sched_clock+0xd/0x90
[67148.214712] WARNING: can't dereference iret registers at 00000000c8f3c864 for ip error_exit+0x20/0x20
Vince
On Tue, May 01, 2018 at 09:29:38AM -0400, Vince Weaver wrote:
> Hello
>
> I reported this back in January, but I think it got lost since everyone
> was busy with other more pressing matters.
>
> But in any case, the perf_fuzzer still can trigger these type of messages
> and just wanted to see if they were a cause for concern, or just noise.
>
> [66620.496076] WARNING: can't dereference registers at 0000000051f78a40 for ip interrupt_entry+0xba/0xc0
> [66620.506117] WARNING: stack recursion on stack type 4
> [67126.898984] WARNING: stack going in the wrong direction? ip=native_sched_clock+0xd/0x90
> [67148.214712] WARNING: can't dereference iret registers at 00000000c8f3c864 for ip error_exit+0x20/0x20
Sorry, it did fall off my radar thanks to Spectre/Meltdown. I just
started looking at some similar reports, so your timing is perfect.
I suspect these warnings are similar to issues I saw before with the FP
unwinder, where it's trying to read the stack of a running task while
the task is simultaneously writing to the stack. So basically it's
probably just noise, but we should try to get rid of the warnings
because the unwinder is expected to get lost in such a case.
Can you try the following patch?
diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 563e28d14f2c..5eb796865b5b 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -137,7 +137,9 @@ int get_stack_info(unsigned long *stack, struct task_struct *task,
*/
if (visit_mask) {
if (*visit_mask & (1UL << info->type)) {
- printk_deferred_once(KERN_WARNING "WARNING: stack recursion on stack type %d\n", info->type);
+ if (task == current)
+ printk_deferred_once(KERN_WARNING "WARNING: stack recursion on stack type %d\n",
+ info->type);
goto unknown;
}
*visit_mask |= 1UL << info->type;
diff --git a/arch/x86/kernel/unwind_orc.c b/arch/x86/kernel/unwind_orc.c
index feb28fee6cea..570435c52ca0 100644
--- a/arch/x86/kernel/unwind_orc.c
+++ b/arch/x86/kernel/unwind_orc.c
@@ -7,7 +7,14 @@
#include <asm/orc_lookup.h>
#define orc_warn(fmt, ...) \
- printk_deferred_once(KERN_WARNING pr_fmt("WARNING: " fmt), ##__VA_ARGS__)
+ printk_deferred_once(KERN_WARNING "WARNING: " fmt, ##__VA_ARGS__)
+
+#define orc_warn_current(fmt, ...) \
+({ \
+ if (state->task == current) \
+ printk_deferred_once(KERN_WARNING "WARNING: " fmt, \
+ ##__VA_ARGS__); \
+})
extern int __start_orc_unwind_ip[];
extern int __stop_orc_unwind_ip[];
@@ -400,8 +407,8 @@ bool unwind_next_frame(struct unwind_state *state)
case ORC_REG_R10:
if (!state->regs || !state->full_regs) {
- orc_warn("missing regs for base reg R10 at ip %pB\n",
- (void *)state->ip);
+ orc_warn_current("missing regs for base reg R10 at ip %pB\n",
+ (void *)state->ip);
goto done;
}
sp = state->regs->r10;
@@ -409,8 +416,8 @@ bool unwind_next_frame(struct unwind_state *state)
case ORC_REG_R13:
if (!state->regs || !state->full_regs) {
- orc_warn("missing regs for base reg R13 at ip %pB\n",
- (void *)state->ip);
+ orc_warn_current("missing regs for base reg R13 at ip %pB\n",
+ (void *)state->ip);
goto done;
}
sp = state->regs->r13;
@@ -418,8 +425,8 @@ bool unwind_next_frame(struct unwind_state *state)
case ORC_REG_DI:
if (!state->regs || !state->full_regs) {
- orc_warn("missing regs for base reg DI at ip %pB\n",
- (void *)state->ip);
+ orc_warn_current("missing regs for base reg DI at ip %pB\n",
+ (void *)state->ip);
goto done;
}
sp = state->regs->di;
@@ -427,8 +434,8 @@ bool unwind_next_frame(struct unwind_state *state)
case ORC_REG_DX:
if (!state->regs || !state->full_regs) {
- orc_warn("missing regs for base reg DX at ip %pB\n",
- (void *)state->ip);
+ orc_warn_current("missing regs for base reg DX at ip %pB\n",
+ (void *)state->ip);
goto done;
}
sp = state->regs->dx;
@@ -463,8 +470,8 @@ bool unwind_next_frame(struct unwind_state *state)
case ORC_TYPE_REGS:
if (!deref_stack_regs(state, sp, &state->ip, &state->sp)) {
- orc_warn("can't dereference registers at %p for ip %pB\n",
- (void *)sp, (void *)orig_ip);
+ orc_warn_current("can't dereference registers at %p for ip %pB\n",
+ (void *)sp, (void *)orig_ip);
goto done;
}
@@ -475,8 +482,8 @@ bool unwind_next_frame(struct unwind_state *state)
case ORC_TYPE_REGS_IRET:
if (!deref_stack_iret_regs(state, sp, &state->ip, &state->sp)) {
- orc_warn("can't dereference iret registers at %p for ip %pB\n",
- (void *)sp, (void *)orig_ip);
+ orc_warn_current("can't dereference iret registers at %p for ip %pB\n",
+ (void *)sp, (void *)orig_ip);
goto done;
}
@@ -518,8 +525,8 @@ bool unwind_next_frame(struct unwind_state *state)
if (state->stack_info.type == prev_type &&
on_stack(&state->stack_info, (void *)state->sp, sizeof(long)) &&
state->sp <= prev_sp) {
- orc_warn("stack going in the wrong direction? ip=%pB\n",
- (void *)orig_ip);
+ orc_warn_current("stack going in the wrong direction? ip=%pB\n",
+ (void *)orig_ip);
goto done;
}
On Tue, 1 May 2018, Josh Poimboeuf wrote:
> Can you try the following patch?
I applied the patch, but the warnings don't really look that different.
[ 62.220322] WARNING: stack recursion on stack type 4
[ 62.220326] WARNING: can't dereference registers at 000000009ca2e86d for ip swapgs_restore_regs_and_return_to_usermode+0x79/0x87
[ 367.597013] WARNING: stack going in the wrong direction? ip=native_sched_clock+0x9/0x90
Vince
On Tue, May 01, 2018 at 03:59:31PM -0400, Vince Weaver wrote:
> On Tue, 1 May 2018, Josh Poimboeuf wrote:
>
> > Can you try the following patch?
>
> I applied the patch, but the warnings don't really look that different.
>
> [ 62.220322] WARNING: stack recursion on stack type 4
> [ 62.220326] WARNING: can't dereference registers at 000000009ca2e86d for ip swapgs_restore_regs_and_return_to_usermode+0x79/0x87
> [ 367.597013] WARNING: stack going in the wrong direction? ip=native_sched_clock+0x9/0x90
Bummer. I'll need to come up with a patch to dump more info.
--
Josh
On Tue, May 01, 2018 at 05:04:58PM -0500, Josh Poimboeuf wrote:
> On Tue, May 01, 2018 at 03:59:31PM -0400, Vince Weaver wrote:
> > On Tue, 1 May 2018, Josh Poimboeuf wrote:
> >
> > > Can you try the following patch?
> >
> > I applied the patch, but the warnings don't really look that different.
> >
> > [ 62.220322] WARNING: stack recursion on stack type 4
> > [ 62.220326] WARNING: can't dereference registers at 000000009ca2e86d for ip swapgs_restore_regs_and_return_to_usermode+0x79/0x87
> > [ 367.597013] WARNING: stack going in the wrong direction? ip=native_sched_clock+0x9/0x90
>
> Bummer. I'll need to come up with a patch to dump more info.
After looking closer, I realized that at least some of these warnings
are due to bad unwind hints in the entry code. Can you try this patch
instead of the last one?
diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index be63330c5511..73f5d4c10304 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -165,6 +165,7 @@ For 32-bit we have the following conventions - kernel is built with
.endif
popq %rdx
popq %rsi
+ UNWIND_HINT_IRET_REGS offset=16
.if \pop_rdi
popq %rdi
.endif
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 805f52703ee3..c6c54ce30090 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -306,7 +306,6 @@ GLOBAL(entry_SYSCALL_64_after_hwframe)
*/
syscall_return_via_sysret:
/* rcx and r11 are already restored (see code above) */
- UNWIND_HINT_EMPTY
POP_REGS pop_rdi=0 skip_r11rcx=1
/*
@@ -315,6 +314,7 @@ syscall_return_via_sysret:
*/
movq %rsp, %rdi
movq PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp
+ UNWIND_HINT_EMPTY
pushq RSP-RDI(%rdi) /* RSP */
pushq (%rdi) /* RDI */
@@ -666,6 +666,7 @@ GLOBAL(swapgs_restore_regs_and_return_to_usermode)
*/
movq %rsp, %rdi
movq PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp
+ UNWIND_HINT_EMPTY
/* Copy the IRET frame to the trampoline stack. */
pushq 6*8(%rdi) /* SS */
@@ -1640,6 +1641,7 @@ nmi_restore:
* at the "iret" frame.
*/
addq $6*8, %rsp
+ UNWIND_HINT_IRET_REGS
/*
* Clear "NMI executing". Set DF first so that we can easily
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 92b6a2c21631..d66f14bb6c50 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -1239,7 +1239,7 @@ static int update_insn_state_regs(struct instruction *insn, struct insn_state *s
struct cfi_reg *cfa = &state->cfa;
struct stack_op *op = &insn->stack_op;
- if (cfa->base != CFI_SP)
+ if (cfa->base != CFI_SP && cfa->base != CFI_SP_INDIRECT)
return 0;
/* push */
On Wed, 2 May 2018, Josh Poimboeuf wrote:
> After looking closer, I realized that at least some of these warnings
> are due to bad unwind hints in the entry code. Can you try this patch
> instead of the last one?
with just this new patch applied I still get warnings such as this:
[ 469.436218] WARNING: can't dereference registers at 00000000886d9235 for ip apic_timer_interrupt+0xa/0x20
[ 790.499655] WARNING: stack recursion on stack type 2
[ 790.907092] WARNING: stack going in the wrong direction? ip=native_sched_clock+0x9/0x90
[ 3632.876656] WARNING: can't dereference iret registers at 000000001754e5aa for ip nmi_restore+0x16/0x2b
[ 3650.161250] WARNING: missing regs for base reg R10 at ip native_sched_clock+0xd/0x90
Vince
On Fri, May 04, 2018 at 10:35:53AM -0400, Vince Weaver wrote:
> On Wed, 2 May 2018, Josh Poimboeuf wrote:
>
> > After looking closer, I realized that at least some of these warnings
> > are due to bad unwind hints in the entry code. Can you try this patch
> > instead of the last one?
>
> with just this new patch applied I still get warnings such as this:
>
> [ 469.436218] WARNING: can't dereference registers at 00000000886d9235 for ip apic_timer_interrupt+0xa/0x20
> [ 790.499655] WARNING: stack recursion on stack type 2
> [ 790.907092] WARNING: stack going in the wrong direction? ip=native_sched_clock+0x9/0x90
> [ 3632.876656] WARNING: can't dereference iret registers at 000000001754e5aa for ip nmi_restore+0x16/0x2b
> [ 3650.161250] WARNING: missing regs for base reg R10 at ip native_sched_clock+0xd/0x90
The 'nmi_restore' warning points to a bug in my patch, but the others
are head scratchers. Here's a patch which combines the first two
patches, plus improves the existing warnings a bit. Can you try it?
Also, any tips for reproducing this locally? I cloned the perf fuzzer
github. Is it as simple as just "make" and "./run_tests.sh"?
diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index be63330c5511..73f5d4c10304 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -165,6 +165,7 @@ For 32-bit we have the following conventions - kernel is built with
.endif
popq %rdx
popq %rsi
+ UNWIND_HINT_IRET_REGS offset=16
.if \pop_rdi
popq %rdi
.endif
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 805f52703ee3..2908ed2ef698 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -306,7 +306,6 @@ GLOBAL(entry_SYSCALL_64_after_hwframe)
*/
syscall_return_via_sysret:
/* rcx and r11 are already restored (see code above) */
- UNWIND_HINT_EMPTY
POP_REGS pop_rdi=0 skip_r11rcx=1
/*
@@ -315,6 +314,7 @@ syscall_return_via_sysret:
*/
movq %rsp, %rdi
movq PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp
+ UNWIND_HINT_EMPTY
pushq RSP-RDI(%rdi) /* RSP */
pushq (%rdi) /* RDI */
@@ -666,6 +666,7 @@ GLOBAL(swapgs_restore_regs_and_return_to_usermode)
*/
movq %rsp, %rdi
movq PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp
+ UNWIND_HINT_EMPTY
/* Copy the IRET frame to the trampoline stack. */
pushq 6*8(%rdi) /* SS */
diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 563e28d14f2c..d8dbf7c3c2f1 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -137,7 +137,9 @@ int get_stack_info(unsigned long *stack, struct task_struct *task,
*/
if (visit_mask) {
if (*visit_mask & (1UL << info->type)) {
- printk_deferred_once(KERN_WARNING "WARNING: stack recursion on stack type %d\n", info->type);
+ if (task == current)
+ printk_deferred(KERN_WARNING "WARNING: stack recursion on stack type %d\n",
+ info->type);
goto unknown;
}
*visit_mask |= 1UL << info->type;
diff --git a/arch/x86/kernel/unwind_orc.c b/arch/x86/kernel/unwind_orc.c
index feb28fee6cea..1324ffdb861d 100644
--- a/arch/x86/kernel/unwind_orc.c
+++ b/arch/x86/kernel/unwind_orc.c
@@ -7,7 +7,14 @@
#include <asm/orc_lookup.h>
#define orc_warn(fmt, ...) \
- printk_deferred_once(KERN_WARNING pr_fmt("WARNING: " fmt), ##__VA_ARGS__)
+ printk_deferred(KERN_WARNING pr_fmt("WARNING: " fmt), ##__VA_ARGS__)
+
+#define orc_warn_current(fmt, ...) \
+({ \
+ if (state->task == current) \
+ printk_deferred(KERN_WARNING "WARNING: " fmt, \
+ ##__VA_ARGS__); \
+})
extern int __start_orc_unwind_ip[];
extern int __stop_orc_unwind_ip[];
@@ -400,8 +407,8 @@ bool unwind_next_frame(struct unwind_state *state)
case ORC_REG_R10:
if (!state->regs || !state->full_regs) {
- orc_warn("missing regs for base reg R10 at ip %pB\n",
- (void *)state->ip);
+ orc_warn_current("missing regs for base reg R10 at ip %pB\n",
+ (void *)state->ip);
goto done;
}
sp = state->regs->r10;
@@ -409,8 +416,8 @@ bool unwind_next_frame(struct unwind_state *state)
case ORC_REG_R13:
if (!state->regs || !state->full_regs) {
- orc_warn("missing regs for base reg R13 at ip %pB\n",
- (void *)state->ip);
+ orc_warn_current("missing regs for base reg R13 at ip %pB\n",
+ (void *)state->ip);
goto done;
}
sp = state->regs->r13;
@@ -418,8 +425,8 @@ bool unwind_next_frame(struct unwind_state *state)
case ORC_REG_DI:
if (!state->regs || !state->full_regs) {
- orc_warn("missing regs for base reg DI at ip %pB\n",
- (void *)state->ip);
+ orc_warn_current("missing regs for base reg DI at ip %pB\n",
+ (void *)state->ip);
goto done;
}
sp = state->regs->di;
@@ -427,8 +434,8 @@ bool unwind_next_frame(struct unwind_state *state)
case ORC_REG_DX:
if (!state->regs || !state->full_regs) {
- orc_warn("missing regs for base reg DX at ip %pB\n",
- (void *)state->ip);
+ orc_warn_current("missing regs for base reg DX at ip %pB\n",
+ (void *)state->ip);
goto done;
}
sp = state->regs->dx;
@@ -463,8 +470,8 @@ bool unwind_next_frame(struct unwind_state *state)
case ORC_TYPE_REGS:
if (!deref_stack_regs(state, sp, &state->ip, &state->sp)) {
- orc_warn("can't dereference registers at %p for ip %pB\n",
- (void *)sp, (void *)orig_ip);
+ orc_warn_current("can't dereference registers at %p for ip %pB\n",
+ (void *)sp, (void *)orig_ip);
goto done;
}
@@ -475,8 +482,8 @@ bool unwind_next_frame(struct unwind_state *state)
case ORC_TYPE_REGS_IRET:
if (!deref_stack_iret_regs(state, sp, &state->ip, &state->sp)) {
- orc_warn("can't dereference iret registers at %p for ip %pB\n",
- (void *)sp, (void *)orig_ip);
+ orc_warn_current("can't dereference iret registers at %p for ip %pB\n",
+ (void *)sp, (void *)orig_ip);
goto done;
}
@@ -518,8 +525,8 @@ bool unwind_next_frame(struct unwind_state *state)
if (state->stack_info.type == prev_type &&
on_stack(&state->stack_info, (void *)state->sp, sizeof(long)) &&
state->sp <= prev_sp) {
- orc_warn("stack going in the wrong direction? ip=%pB\n",
- (void *)orig_ip);
+ orc_warn_current("stack going in the wrong direction? ip=%pB\n",
+ (void *)orig_ip);
goto done;
}
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 92b6a2c21631..d66f14bb6c50 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -1239,7 +1239,7 @@ static int update_insn_state_regs(struct instruction *insn, struct insn_state *s
struct cfi_reg *cfa = &state->cfa;
struct stack_op *op = &insn->stack_op;
- if (cfa->base != CFI_SP)
+ if (cfa->base != CFI_SP && cfa->base != CFI_SP_INDIRECT)
return 0;
/* push */
On Fri, 4 May 2018, Josh Poimboeuf wrote:
> Also, any tips for reproducing this locally? I cloned the perf fuzzer
> github. Is it as simple as just "make" and "./run_tests.sh"?
run_tests only runs the perf_event regressiong tests.
To run the fuzzer, enter the "fuzzer" directory and either run
"./fast_repro98.sh"
which will run the fuzzer repeatedly, or you can instead just run
./perf_fuzzer
but that sometimes eventually errors out if it manages to get stuck in a
signal storm.
While running on haswell and skylake machines the WARNINGs trigger fairly
quickly (within minutes usually).
You might also encounter a handful of other known kernel warnings which
I've reported in the past and usually just ignore.
Vince
On Fri, 4 May 2018, Josh Poimboeuf wrote:
>
> The 'nmi_restore' warning points to a bug in my patch, but the others
> are head scratchers. Here's a patch which combines the first two
> patches, plus improves the existing warnings a bit. Can you try it?
with that updated patch I hit
May 4 21:51:20 haswell kernel: [19245.450607] WARNING: stack recursion on stack type 2
May 4 22:21:29 haswell kernel: [21055.268717] WARNING: can't dereference registers at 000000006546ba71 for ip ret_from_intr+0x6/0x1d
May 4 22:36:22 haswell kernel: [21948.106762] WARNING: stack going in the wrong direction? ip=native_sched_clock+0xe/0x90
May 4 22:36:22 haswell kernel: [21948.115377] WARNING: stack going in the wrong direction? ip=native_sched_clock+0xe/0x90
May 4 22:36:22 haswell kernel: [21948.124086] WARNING: stack going in the wrong direction? ip=native_sched_clock+0xd/0x90
May 4 22:36:22 haswell kernel: [21948.124088] WARNING: stack going in the wrong direction? ip=intel_pmu_handle_irq+0x12/0x4a0
May 4 22:36:22 haswell kernel: [21948.124097] WARNING: stack going in the wrong direction? ip=native_sched_clock+0xe/0x90
May 4 22:36:22 haswell kernel: [21948.150189] WARNING: stack going in the wrong direction? ip=native_sched_clock+0xe/0x90
May 4 22:36:22 haswell kernel: [21948.150199] WARNING: stack going in the wrong direction? ip=intel_pmu_handle_irq+0xe/0x4a0
the last bit repeated for a few minutes (flooding the log with a few
thousand entries that look mostly similar)
Vince
On Sat, May 05, 2018 at 11:38:16AM -0400, Vince Weaver wrote:
> On Fri, 4 May 2018, Josh Poimboeuf wrote:
> >
> > The 'nmi_restore' warning points to a bug in my patch, but the others
> > are head scratchers. Here's a patch which combines the first two
> > patches, plus improves the existing warnings a bit. Can you try it?
>
> with that updated patch I hit
>
> May 4 21:51:20 haswell kernel: [19245.450607] WARNING: stack recursion on stack type 2
> May 4 22:21:29 haswell kernel: [21055.268717] WARNING: can't dereference registers at 000000006546ba71 for ip ret_from_intr+0x6/0x1d
> May 4 22:36:22 haswell kernel: [21948.106762] WARNING: stack going in the wrong direction? ip=native_sched_clock+0xe/0x90
> May 4 22:36:22 haswell kernel: [21948.115377] WARNING: stack going in the wrong direction? ip=native_sched_clock+0xe/0x90
> May 4 22:36:22 haswell kernel: [21948.124086] WARNING: stack going in the wrong direction? ip=native_sched_clock+0xd/0x90
> May 4 22:36:22 haswell kernel: [21948.124088] WARNING: stack going in the wrong direction? ip=intel_pmu_handle_irq+0x12/0x4a0
> May 4 22:36:22 haswell kernel: [21948.124097] WARNING: stack going in the wrong direction? ip=native_sched_clock+0xe/0x90
> May 4 22:36:22 haswell kernel: [21948.150189] WARNING: stack going in the wrong direction? ip=native_sched_clock+0xe/0x90
> May 4 22:36:22 haswell kernel: [21948.150199] WARNING: stack going in the wrong direction? ip=intel_pmu_handle_irq+0xe/0x4a0
>
> the last bit repeated for a few minutes (flooding the log with a few
> thousand entries that look mostly similar)
Thanks. I can recreate now, so I'll stop bugging you for a bit. This
fuzzer is really good at finding unwinder issues.
--
Josh
On Sat, May 05, 2018 at 01:29:12PM -0500, Josh Poimboeuf wrote:
> On Sat, May 05, 2018 at 11:38:16AM -0400, Vince Weaver wrote:
> > On Fri, 4 May 2018, Josh Poimboeuf wrote:
> > >
> > > The 'nmi_restore' warning points to a bug in my patch, but the others
> > > are head scratchers. Here's a patch which combines the first two
> > > patches, plus improves the existing warnings a bit. Can you try it?
> >
> > with that updated patch I hit
> >
> > May 4 21:51:20 haswell kernel: [19245.450607] WARNING: stack recursion on stack type 2
> > May 4 22:21:29 haswell kernel: [21055.268717] WARNING: can't dereference registers at 000000006546ba71 for ip ret_from_intr+0x6/0x1d
> > May 4 22:36:22 haswell kernel: [21948.106762] WARNING: stack going in the wrong direction? ip=native_sched_clock+0xe/0x90
> > May 4 22:36:22 haswell kernel: [21948.115377] WARNING: stack going in the wrong direction? ip=native_sched_clock+0xe/0x90
> > May 4 22:36:22 haswell kernel: [21948.124086] WARNING: stack going in the wrong direction? ip=native_sched_clock+0xd/0x90
> > May 4 22:36:22 haswell kernel: [21948.124088] WARNING: stack going in the wrong direction? ip=intel_pmu_handle_irq+0x12/0x4a0
> > May 4 22:36:22 haswell kernel: [21948.124097] WARNING: stack going in the wrong direction? ip=native_sched_clock+0xe/0x90
> > May 4 22:36:22 haswell kernel: [21948.150189] WARNING: stack going in the wrong direction? ip=native_sched_clock+0xe/0x90
> > May 4 22:36:22 haswell kernel: [21948.150199] WARNING: stack going in the wrong direction? ip=intel_pmu_handle_irq+0xe/0x4a0
> >
> > the last bit repeated for a few minutes (flooding the log with a few
> > thousand entries that look mostly similar)
>
> Thanks. I can recreate now, so I'll stop bugging you for a bit. This
> fuzzer is really good at finding unwinder issues.
Deja vu. Most of these are related to perf PEBS, similar to the
following issue:
b8000586c90b ("perf/x86/intel: Cure bogus unwind from PEBS entries")
This is basically the ORC version of that. setup_pebs_sample_data() is
assembling a franken-pt_regs which ORC isn't happy about. RIP is
inconsistent with some of the other registers (like RSP and RBP).
Peter, any ideas?
--
Josh
On Sun, May 06, 2018 at 06:49:35PM -0500, Josh Poimboeuf wrote:
> Deja vu. Most of these are related to perf PEBS, similar to the
> following issue:
>
> b8000586c90b ("perf/x86/intel: Cure bogus unwind from PEBS entries")
>
> This is basically the ORC version of that. setup_pebs_sample_data() is
> assembling a franken-pt_regs which ORC isn't happy about. RIP is
> inconsistent with some of the other registers (like RSP and RBP).
>
> Peter, any ideas?
Urgh..
Something like so perhaps? It's a bit of a hack, but I couldn't quickly
think of something nicer.
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 707b2a96e516..86f0c15dcc2d 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2997,6 +2997,9 @@ static int intel_pmu_hw_config(struct perf_event *event)
}
if (x86_pmu.pebs_aliases)
x86_pmu.pebs_aliases(event);
+
+ if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN)
+ event->attr.sample_type |= __PERF_SAMPLE_CALLCHAIN_EARLY;
}
if (needs_branch_stack(event)) {
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 8a10a045b57b..2115ac8336b4 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1183,17 +1183,21 @@ static void setup_pebs_sample_data(struct perf_event *event,
data->data_src.val = val;
}
+ /*
+ * We must however always use iregs for the unwinder to stay sane; the
+ * record BP,SP,IP can point into thin air when the record is from a
+ * previous PMI context or an (I)RET happend between the record and
+ * PMI.
+ */
+ if (sample_type & PERF_SAMPLE_CALLCHAIN)
+ data->callchain = perf_callchain(event, iregs);
+
/*
* We use the interrupt regs as a base because the PEBS record does not
* contain a full regs set, specifically it seems to lack segment
* descriptors, which get used by things like user_mode().
*
* In the simple case fix up only the IP for PERF_SAMPLE_IP.
- *
- * We must however always use BP,SP from iregs for the unwinder to stay
- * sane; the record BP,SP can point into thin air when the record is
- * from a previous PMI context or an (I)RET happend between the record
- * and PMI.
*/
*regs = *iregs;
@@ -1212,15 +1216,8 @@ static void setup_pebs_sample_data(struct perf_event *event,
regs->si = pebs->si;
regs->di = pebs->di;
- /*
- * Per the above; only set BP,SP if we don't need callchains.
- *
- * XXX: does this make sense?
- */
- if (!(sample_type & PERF_SAMPLE_CALLCHAIN)) {
- regs->bp = pebs->bp;
- regs->sp = pebs->sp;
- }
+ regs->bp = pebs->bp;
+ regs->sp = pebs->sp;
#ifndef CONFIG_X86_32
regs->r8 = pebs->r8;
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index b8e288a1f740..55b93cd0d48a 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -143,6 +143,8 @@ enum perf_event_sample_format {
PERF_SAMPLE_PHYS_ADDR = 1U << 19,
PERF_SAMPLE_MAX = 1U << 20, /* non-ABI */
+
+ __PERF_SAMPLE_CALLCHAIN_EARLY = 1UL << 63,
};
/*
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 67612ce359ad..27c9e0f99f30 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6380,7 +6380,9 @@ void perf_prepare_sample(struct perf_event_header *header,
if (sample_type & PERF_SAMPLE_CALLCHAIN) {
int size = 1;
- data->callchain = perf_callchain(event, regs);
+ if (!(sample_type & __PERF_SAMPLE_CALLCHAIN_EARLY))
+ data->callchain = perf_callchain(event, regs);
+
size += data->callchain->nr;
header->size += size * sizeof(u64);
On Thu, May 10, 2018 at 03:48:41PM +0200, Peter Zijlstra wrote:
> On Sun, May 06, 2018 at 06:49:35PM -0500, Josh Poimboeuf wrote:
>
> > Deja vu. Most of these are related to perf PEBS, similar to the
> > following issue:
> >
> > b8000586c90b ("perf/x86/intel: Cure bogus unwind from PEBS entries")
> >
> > This is basically the ORC version of that. setup_pebs_sample_data() is
> > assembling a franken-pt_regs which ORC isn't happy about. RIP is
> > inconsistent with some of the other registers (like RSP and RBP).
> >
> > Peter, any ideas?
>
> Urgh..
>
> Something like so perhaps? It's a bit of a hack, but I couldn't quickly
> think of something nicer.
Thanks, seems like this would work. I'll give it some testing.
--
Josh
On Thu, May 10, 2018 at 03:48:41PM +0200, Peter Zijlstra wrote:
> On Sun, May 06, 2018 at 06:49:35PM -0500, Josh Poimboeuf wrote:
>
> > Deja vu. Most of these are related to perf PEBS, similar to the
> > following issue:
> >
> > b8000586c90b ("perf/x86/intel: Cure bogus unwind from PEBS entries")
> >
> > This is basically the ORC version of that. setup_pebs_sample_data() is
> > assembling a franken-pt_regs which ORC isn't happy about. RIP is
> > inconsistent with some of the other registers (like RSP and RBP).
> >
> > Peter, any ideas?
>
> Urgh..
>
> Something like so perhaps? It's a bit of a hack, but I couldn't quickly
> think of something nicer.
I had to convert perf_callchain() to a global function to get it to
compile, but it does seem to fix the issue.
--
Josh