2017-09-05 07:34:22

by Markus Trippelsdorf

[permalink] [raw]
Subject: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

Current mainline git (24e700e291d52bd2) hangs when building software
concurrently (for example perf).
The issue is not 100% reproducible (sometimes building perf succeeds),
so bisecting will not work.
Magic SysRq key doesn't work and there is nothing in the logs.
Enabling CONFIG_PROVE_LOCKING makes the issue go away.

Any ideas on how to debug this further?

--
Markus


2017-09-05 08:54:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On Tue, Sep 05, 2017 at 09:27:38AM +0200, Markus Trippelsdorf wrote:
> Current mainline git (24e700e291d52bd2) hangs when building software
> concurrently (for example perf).
> The issue is not 100% reproducible (sometimes building perf succeeds),
> so bisecting will not work.

Sadly I cannot reproduce, I had:

while :; do make clean; make; done

running on tools/perf for a while, and now have:

while :; do make O=defconfig-build clean; make O=defconfig-build -j80; done

running, all smooth sailing, although there's the hope that the moment I
hit send on this email the box comes unstuck.

> Magic SysRq key doesn't work and there is nothing in the logs.
> Enabling CONFIG_PROVE_LOCKING makes the issue go away.

SysRq not working is suspicious.. and I take it the NMI watchdog also
isn't firing?

> Any ideas on how to debug this further?

So you have a (real) serial line on that box?

Could you try something like:

debug ignore_loglevel sysrq_always_enabled earlyprintk=serial,ttyS0,115200 force_early_printk

with the below patch applied? That always gives me the most reliable
output.

---
kernel/printk/printk.c | 119 +++++++++++++++++++++++++++++++++++--------------
1 file changed, 86 insertions(+), 33 deletions(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index fc47863f629c..b17099fbc7ce 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -365,6 +365,75 @@ __packed __aligned(4)
#endif
;

+#ifdef CONFIG_EARLY_PRINTK
+struct console *early_console;
+
+static bool __read_mostly force_early_printk;
+
+static int __init force_early_printk_setup(char *str)
+{
+ force_early_printk = true;
+ return 0;
+}
+early_param("force_early_printk", force_early_printk_setup);
+
+static int early_printk_cpu = -1;
+
+static int early_vprintk(const char *fmt, va_list args)
+{
+ int n, cpu, old;
+ char buf[512];
+
+ cpu = get_cpu();
+ /*
+ * Test-and-Set inter-cpu spinlock with recursion.
+ */
+ for (;;) {
+ /*
+ * c-cas to avoid the exclusive bouncing on spin.
+ * Depends on the memory barrier implied by cmpxchg
+ * for ACQUIRE semantics.
+ */
+ old = READ_ONCE(early_printk_cpu);
+ if (old == -1) {
+ old = cmpxchg(&early_printk_cpu, -1, cpu);
+ if (old == -1)
+ break;
+ }
+ /*
+ * Allow recursion for interrupts and the like.
+ */
+ if (old == cpu)
+ break;
+
+ cpu_relax();
+ }
+
+ n = vscnprintf(buf, sizeof(buf), fmt, args);
+ early_console->write(early_console, buf, n);
+
+ /*
+ * Unlock -- in case @old == @cpu, this is a no-op.
+ */
+ smp_store_release(&early_printk_cpu, old);
+ put_cpu();
+
+ return n;
+}
+
+asmlinkage __visible void early_printk(const char *fmt, ...)
+{
+ va_list ap;
+
+ if (!early_console)
+ return;
+
+ va_start(ap, fmt);
+ early_vprintk(fmt, ap);
+ va_end(ap);
+}
+#endif
+
/*
* The logbuf_lock protects kmsg buffer, indices, counters. This can be taken
* within the scheduler's rq lock. It must be released before calling
@@ -1704,6 +1773,16 @@ asmlinkage int vprintk_emit(int facility, int level,
int printed_len = 0;
bool in_sched = false;

+#ifdef CONFIG_KGDB_KDB
+ if (unlikely(kdb_trap_printk && kdb_printf_cpu < 0))
+ return vkdb_printf(KDB_MSGSRC_PRINTK, fmt, args);
+#endif
+
+#ifdef CONFIG_EARLY_PRINTK
+ if (force_early_printk && early_console)
+ return early_vprintk(fmt, args);
+#endif
+
if (level == LOGLEVEL_SCHED) {
level = LOGLEVEL_DEFAULT;
in_sched = true;
@@ -1796,18 +1875,7 @@ EXPORT_SYMBOL(printk_emit);

int vprintk_default(const char *fmt, va_list args)
{
- int r;
-
-#ifdef CONFIG_KGDB_KDB
- /* Allow to pass printk() to kdb but avoid a recursion. */
- if (unlikely(kdb_trap_printk && kdb_printf_cpu < 0)) {
- r = vkdb_printf(KDB_MSGSRC_PRINTK, fmt, args);
- return r;
- }
-#endif
- r = vprintk_emit(0, LOGLEVEL_DEFAULT, NULL, 0, fmt, args);
-
- return r;
+ return vprintk_emit(0, LOGLEVEL_DEFAULT, NULL, 0, fmt, args);
}
EXPORT_SYMBOL_GPL(vprintk_default);

@@ -1838,7 +1906,12 @@ asmlinkage __visible int printk(const char *fmt, ...)
int r;

va_start(args, fmt);
- r = vprintk_func(fmt, args);
+#ifdef CONFIG_EARLY_PRINTK
+ if (force_early_printk && early_console)
+ r = vprintk_default(fmt, args);
+ else
+#endif
+ r = vprintk_func(fmt, args);
va_end(args);

return r;
@@ -1875,26 +1948,6 @@ static bool suppress_message_printing(int level) { return false; }

#endif /* CONFIG_PRINTK */

-#ifdef CONFIG_EARLY_PRINTK
-struct console *early_console;
-
-asmlinkage __visible void early_printk(const char *fmt, ...)
-{
- va_list ap;
- char buf[512];
- int n;
-
- if (!early_console)
- return;
-
- va_start(ap, fmt);
- n = vscnprintf(buf, sizeof(buf), fmt, ap);
- va_end(ap);
-
- early_console->write(early_console, buf, n);
-}
-#endif
-
static int __add_preferred_console(char *name, int idx, char *options,
char *brl_options)
{

2017-09-05 09:55:53

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On 2017.09.05 at 10:53 +0200, Peter Zijlstra wrote:
> On Tue, Sep 05, 2017 at 09:27:38AM +0200, Markus Trippelsdorf wrote:
> > Current mainline git (24e700e291d52bd2) hangs when building software
> > concurrently (for example perf).
> > The issue is not 100% reproducible (sometimes building perf succeeds),
> > so bisecting will not work.
>
> Sadly I cannot reproduce, I had:
>
> while :; do make clean; make; done
>
> running on tools/perf for a while, and now have:
>
> while :; do make O=defconfig-build clean; make O=defconfig-build -j80; done
>
> running, all smooth sailing, although there's the hope that the moment I
> hit send on this email the box comes unstuck.
>
> > Magic SysRq key doesn't work and there is nothing in the logs.
> > Enabling CONFIG_PROVE_LOCKING makes the issue go away.
>
> SysRq not working is suspicious.. and I take it the NMI watchdog also
> isn't firing?

Yes.

> > Any ideas on how to debug this further?
>
> So you have a (real) serial line on that box?

Sadly, no. But hopefully somebody else (with a proper kernel debugging
setup) will reproduce the issue soon.

--
Markus

2017-09-06 12:52:55

by Thomas Gleixner

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On Tue, 5 Sep 2017, Markus Trippelsdorf wrote:
> On 2017.09.05 at 10:53 +0200, Peter Zijlstra wrote:
> > > Any ideas on how to debug this further?
> >
> > So you have a (real) serial line on that box?
>
> Sadly, no. But hopefully somebody else (with a proper kernel debugging
> setup) will reproduce the issue soon.

Does the machine respond to ping or is it entirely dead?

Thanks,

tglx

2017-09-06 13:15:10

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On 2017.09.06 at 14:52 +0200, Thomas Gleixner wrote:
> On Tue, 5 Sep 2017, Markus Trippelsdorf wrote:
> > On 2017.09.05 at 10:53 +0200, Peter Zijlstra wrote:
> > > > Any ideas on how to debug this further?
> > >
> > > So you have a (real) serial line on that box?
> >
> > Sadly, no. But hopefully somebody else (with a proper kernel debugging
> > setup) will reproduce the issue soon.
>
> Does the machine respond to ping or is it entirely dead?

It is entirely dead and doesn't respond to ping.

--
Markus

2017-09-07 06:28:49

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On 2017.09.06 at 15:15 +0200, Markus Trippelsdorf wrote:
> On 2017.09.06 at 14:52 +0200, Thomas Gleixner wrote:
> > On Tue, 5 Sep 2017, Markus Trippelsdorf wrote:
> > > On 2017.09.05 at 10:53 +0200, Peter Zijlstra wrote:
> > > > > Any ideas on how to debug this further?
> > > >
> > > > So you have a (real) serial line on that box?
> > >
> > > Sadly, no. But hopefully somebody else (with a proper kernel debugging
> > > setup) will reproduce the issue soon.
> >
> > Does the machine respond to ping or is it entirely dead?
>
> It is entirely dead and doesn't respond to ping.

The bug even kills the host (running 4.13) when running 24e700e2 in qemu
(kvm) and compiling stuff in parallel in the guest.
I see an RCU CPU stall in dmesg (on the host), but unfortunately cannot
save it, because nothing gets written to disk after the stall.
Connecting to qemu via gdb also doesn't work.

--
Markus

2017-09-08 05:35:43

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On 2017.09.07 at 08:28 +0200, Markus Trippelsdorf wrote:
> On 2017.09.06 at 15:15 +0200, Markus Trippelsdorf wrote:
> > On 2017.09.06 at 14:52 +0200, Thomas Gleixner wrote:
> > > On Tue, 5 Sep 2017, Markus Trippelsdorf wrote:
> > > > On 2017.09.05 at 10:53 +0200, Peter Zijlstra wrote:
> > > > > > Any ideas on how to debug this further?
> > > > >
> > > > > So you have a (real) serial line on that box?
> > > >
> > > > Sadly, no. But hopefully somebody else (with a proper kernel debugging
> > > > setup) will reproduce the issue soon.
> > >
> > > Does the machine respond to ping or is it entirely dead?
> >
> > It is entirely dead and doesn't respond to ping.
>
> The bug even kills the host (running 4.13) when running 24e700e2 in qemu
> (kvm) and compiling stuff in parallel in the guest.
> I see an RCU CPU stall in dmesg (on the host), but unfortunately cannot
> save it, because nothing gets written to disk after the stall.
> Connecting to qemu via gdb also doesn't work.

My guess would be a bug in a low level function (asm) that only hits AMD
machines. I'm running an old Phenom II X4 processor. My config is
attached.

--
Markus


Attachments:
(No filename) (1.12 kB)
config (82.32 kB)
Download all attachments

2017-09-08 06:26:53

by Thomas Gleixner

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On Fri, 8 Sep 2017, Markus Trippelsdorf wrote:

CC+ Borislav. He might have access to such a beast

> On 2017.09.07 at 08:28 +0200, Markus Trippelsdorf wrote:
> > On 2017.09.06 at 15:15 +0200, Markus Trippelsdorf wrote:
> > > On 2017.09.06 at 14:52 +0200, Thomas Gleixner wrote:
> > > > On Tue, 5 Sep 2017, Markus Trippelsdorf wrote:
> > > > > On 2017.09.05 at 10:53 +0200, Peter Zijlstra wrote:
> > > > > > > Any ideas on how to debug this further?
> > > > > >
> > > > > > So you have a (real) serial line on that box?
> > > > >
> > > > > Sadly, no. But hopefully somebody else (with a proper kernel debugging
> > > > > setup) will reproduce the issue soon.
> > > >
> > > > Does the machine respond to ping or is it entirely dead?
> > >
> > > It is entirely dead and doesn't respond to ping.
> >
> > The bug even kills the host (running 4.13) when running 24e700e2 in qemu
> > (kvm) and compiling stuff in parallel in the guest.
> > I see an RCU CPU stall in dmesg (on the host), but unfortunately cannot
> > save it, because nothing gets written to disk after the stall.
> > Connecting to qemu via gdb also doesn't work.
>
> My guess would be a bug in a low level function (asm) that only hits AMD
> machines. I'm running an old Phenom II X4 processor. My config is
> attached.
>
> --
> Markus
>

2017-09-08 08:05:54

by Borislav Petkov

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On Fri, Sep 08, 2017 at 08:26:44AM +0200, Thomas Gleixner wrote:
> On Fri, 8 Sep 2017, Markus Trippelsdorf wrote:
>
> CC+ Borislav. He might have access to such a beast

Can I have /proc/cpuinfo and dmesg pls, in order to see whether I have
something similar?

Private mail's fine too.

Thx.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2017-09-08 09:16:31

by Borislav Petkov

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On Fri, Sep 08, 2017 at 10:05:36AM +0200, Borislav Petkov wrote:
> On Fri, Sep 08, 2017 at 08:26:44AM +0200, Thomas Gleixner wrote:
> > On Fri, 8 Sep 2017, Markus Trippelsdorf wrote:
> >
> > CC+ Borislav. He might have access to such a beast
>
> Can I have /proc/cpuinfo and dmesg pls, in order to see whether I have
> something similar?
>
> Private mail's fine too.

So I don't have exactly your model - mine is model 2, stepping 3 but I see
something strange too, in dmesg:

+blkid[981]: segfault at 8 ip 00007f8267a4a3fd sp 00007ffc77045de0 error 4 in ld-linux-x86-64.so.2[7f8267a3a000+22000]
+blkid[984]: segfault at 8 ip 00007fe6035583fd sp 00007ffe9456e5b0 error 4 in ld-linux-x86-64.so.2[7fe603548000+22000]
+blkid[987]: segfault at 8 ip 00007fc695f123fd sp 00007fff1838f5f0 error 4 in ld-linux-x86-64.so.2[7fc695f02000+22000]
+blkid[990]: segfault at 8 ip 00007fd2a8d7c3fd sp 00007ffe4f663870 error 4 in ld-linux-x86-64.so.2[7fd2a8d6c000+22000]
+blkid[993]: segfault at 8 ip 00007fdb865fb3fd sp 00007ffc2ff85050 error 4 in ld-linux-x86-64.so.2[7fdb865eb000+22000]
+blkid[996]: segfault at 8 ip 00007fc28b90a3fd sp 00007ffc7233e0e0 error 4 in ld-linux-x86-64.so.2[7fc28b8fa000+22000]
+blkid[999]: segfault at 8 ip 00007fb95934c3fd sp 00007ffed513d660 error 4 in ld-linux-x86-64.so.2[7fb95933c000+22000]
+blkid[1002]: segfault at 8 ip 00007fb5facc83fd sp 00007ffece190fe0 error 4 in ld-linux-x86-64.so.2[7fb5facb8000+22000]
+blkid[1005]: segfault at 8 ip 00007f113cc693fd sp 00007ffcf6bbd3f0 error 4 in ld-linux-x86-64.so.2[7f113cc59000+22000]
+blkid[1008]: segfault at 8 ip 00007f1ad0a593fd sp 00007ffeee162df0 error 4 in ld-linux-x86-64.so.2[7f1ad0a49000+22000]
+blkid[1011]: segfault at 8 ip 00007fdd003183fd sp 00007ffde1f69e60 error 4 in ld-linux-x86-64.so.2[7fdd00308000+22000]
+blkid[1014]: segfault at 8 ip 00007ffb3240c3fd sp 00007ffc75a88180 error 4 in ld-linux-x86-64.so.2[7ffb323fc000+22000]
+blkid[1017]: segfault at 8 ip 00007f88b6d683fd sp 00007ffef5dbe830 error 4 in ld-linux-x86-64.so.2[7f88b6d58000+22000]
+blkid[1020]: segfault at 8 ip 00007fec7760c3fd sp 00007ffc9dd05890 error 4 in ld-linux-x86-64.so.2[7fec775fc000+22000]
+blkid[1026]: segfault at 8 ip 00007f5a31ecc3fd sp 00007fffaf3604b0 error 4 in ld-linux-x86-64.so.2[7f5a31ebc000+22000]
+logsave[1027]: segfault at 8 ip 00007f237d2033fd sp 00007fff53933e60 error 4 in ld-linux-x86-64.so.2[7f237d1f3000+22000]

and then

git pull
...

Fast-forward
error: merge died of signal 7

Lemme try 4.13.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2017-09-08 09:48:20

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On 2017.09.08 at 11:16 +0200, Borislav Petkov wrote:
> On Fri, Sep 08, 2017 at 10:05:36AM +0200, Borislav Petkov wrote:
> > On Fri, Sep 08, 2017 at 08:26:44AM +0200, Thomas Gleixner wrote:
> > > On Fri, 8 Sep 2017, Markus Trippelsdorf wrote:
> > >
> > > CC+ Borislav. He might have access to such a beast
> >
> > Can I have /proc/cpuinfo and dmesg pls, in order to see whether I have
> > something similar?
> >
> > Private mail's fine too.
>
> So I don't have exactly your model - mine is model 2, stepping 3 but I see
> something strange too, in dmesg:

I'm pretty sure the bug is in the merged 'x86-mm-for-linus' branch:
Either Andy's "PCID optimized TLB flushing" (would be my guess) or
'encrypted memory' support by Tom Lendacky.

(Bisecting is hard, because sometimes I can compile stuff for over 15
minutes without hitting the bug. At other times the machine locks up
hard when starting X11 already.)

--
Markus

2017-09-08 10:35:19

by Ingo Molnar

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf


* Markus Trippelsdorf <[email protected]> wrote:

> On 2017.09.08 at 11:16 +0200, Borislav Petkov wrote:
> > On Fri, Sep 08, 2017 at 10:05:36AM +0200, Borislav Petkov wrote:
> > > On Fri, Sep 08, 2017 at 08:26:44AM +0200, Thomas Gleixner wrote:
> > > > On Fri, 8 Sep 2017, Markus Trippelsdorf wrote:
> > > >
> > > > CC+ Borislav. He might have access to such a beast
> > >
> > > Can I have /proc/cpuinfo and dmesg pls, in order to see whether I have
> > > something similar?
> > >
> > > Private mail's fine too.
> >
> > So I don't have exactly your model - mine is model 2, stepping 3 but I see
> > something strange too, in dmesg:
>
> I'm pretty sure the bug is in the merged 'x86-mm-for-linus' branch:
> Either Andy's "PCID optimized TLB flushing" (would be my guess) or
> 'encrypted memory' support by Tom Lendacky.
>
> (Bisecting is hard, because sometimes I can compile stuff for over 15
> minutes without hitting the bug. At other times the machine locks up
> hard when starting X11 already.)

Do you have the 72c0098d92ce fix?

Thanks,

Ingo

2017-09-08 10:39:10

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On 2017.09.08 at 12:35 +0200, Ingo Molnar wrote:
>
> * Markus Trippelsdorf <[email protected]> wrote:
>
> > On 2017.09.08 at 11:16 +0200, Borislav Petkov wrote:
> > > On Fri, Sep 08, 2017 at 10:05:36AM +0200, Borislav Petkov wrote:
> > > > On Fri, Sep 08, 2017 at 08:26:44AM +0200, Thomas Gleixner wrote:
> > > > > On Fri, 8 Sep 2017, Markus Trippelsdorf wrote:
> > > > >
> > > > > CC+ Borislav. He might have access to such a beast
> > > >
> > > > Can I have /proc/cpuinfo and dmesg pls, in order to see whether I have
> > > > something similar?
> > > >
> > > > Private mail's fine too.
> > >
> > > So I don't have exactly your model - mine is model 2, stepping 3 but I see
> > > something strange too, in dmesg:
> >
> > I'm pretty sure the bug is in the merged 'x86-mm-for-linus' branch:
> > Either Andy's "PCID optimized TLB flushing" (would be my guess) or
> > 'encrypted memory' support by Tom Lendacky.
> >
> > (Bisecting is hard, because sometimes I can compile stuff for over 15
> > minutes without hitting the bug. At other times the machine locks up
> > hard when starting X11 already.)
>
> Do you have the 72c0098d92ce fix?

Yes. The bug still happens on the current git tree (which has the fix
already):
% git describe
v4.13-9217-g5969d1bb3082

--
Markus

2017-09-08 11:30:42

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On 2017.09.08 at 12:39 +0200, Markus Trippelsdorf wrote:
> On 2017.09.08 at 12:35 +0200, Ingo Molnar wrote:
> >
> > * Markus Trippelsdorf <[email protected]> wrote:
> >
> > > On 2017.09.08 at 11:16 +0200, Borislav Petkov wrote:
> > > > On Fri, Sep 08, 2017 at 10:05:36AM +0200, Borislav Petkov wrote:
> > > > > On Fri, Sep 08, 2017 at 08:26:44AM +0200, Thomas Gleixner wrote:
> > > > > > On Fri, 8 Sep 2017, Markus Trippelsdorf wrote:
> > > > > >
> > > > > > CC+ Borislav. He might have access to such a beast
> > > > >
> > > > > Can I have /proc/cpuinfo and dmesg pls, in order to see whether I have
> > > > > something similar?
> > > > >
> > > > > Private mail's fine too.
> > > >
> > > > So I don't have exactly your model - mine is model 2, stepping 3 but I see
> > > > something strange too, in dmesg:
> > >
> > > I'm pretty sure the bug is in the merged 'x86-mm-for-linus' branch:
> > > Either Andy's "PCID optimized TLB flushing" (would be my guess) or
> > > 'encrypted memory' support by Tom Lendacky.
> > >
> > > (Bisecting is hard, because sometimes I can compile stuff for over 15
> > > minutes without hitting the bug. At other times the machine locks up
> > > hard when starting X11 already.)
> >
> > Do you have the 72c0098d92ce fix?
>
> Yes. The bug still happens on the current git tree (which has the fix
> already):

The bug is definitely caused by Andy Lutomirski's PCID optimized TLB
flushing" patches. Tom is off the hook.

--
Markus

2017-09-08 14:51:21

by Borislav Petkov

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On Fri, Sep 08, 2017 at 11:16:14AM +0200, Borislav Petkov wrote:
...
> +blkid[1020]: segfault at 8 ip 00007fec7760c3fd sp 00007ffc9dd05890 error 4 in ld-linux-x86-64.so.2[7fec775fc000+22000]
> +blkid[1026]: segfault at 8 ip 00007f5a31ecc3fd sp 00007fffaf3604b0 error 4 in ld-linux-x86-64.so.2[7f5a31ebc000+22000]
> +logsave[1027]: segfault at 8 ip 00007f237d2033fd sp 00007fff53933e60 error 4 in ld-linux-x86-64.so.2[7f237d1f3000+22000]

Yap, definitely no segfaults with 4.13

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2017-09-08 16:13:22

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On Fri, Sep 8, 2017 at 4:30 AM, Markus Trippelsdorf
<[email protected]> wrote:
> On 2017.09.08 at 12:39 +0200, Markus Trippelsdorf wrote:
>> On 2017.09.08 at 12:35 +0200, Ingo Molnar wrote:
>> >
>> > * Markus Trippelsdorf <[email protected]> wrote:
>> >
>> > > On 2017.09.08 at 11:16 +0200, Borislav Petkov wrote:
>> > > > On Fri, Sep 08, 2017 at 10:05:36AM +0200, Borislav Petkov wrote:
>> > > > > On Fri, Sep 08, 2017 at 08:26:44AM +0200, Thomas Gleixner wrote:
>> > > > > > On Fri, 8 Sep 2017, Markus Trippelsdorf wrote:
>> > > > > >
>> > > > > > CC+ Borislav. He might have access to such a beast
>> > > > >
>> > > > > Can I have /proc/cpuinfo and dmesg pls, in order to see whether I have
>> > > > > something similar?
>> > > > >
>> > > > > Private mail's fine too.
>> > > >
>> > > > So I don't have exactly your model - mine is model 2, stepping 3 but I see
>> > > > something strange too, in dmesg:
>> > >
>> > > I'm pretty sure the bug is in the merged 'x86-mm-for-linus' branch:
>> > > Either Andy's "PCID optimized TLB flushing" (would be my guess) or
>> > > 'encrypted memory' support by Tom Lendacky.
>> > >
>> > > (Bisecting is hard, because sometimes I can compile stuff for over 15
>> > > minutes without hitting the bug. At other times the machine locks up
>> > > hard when starting X11 already.)
>> >
>> > Do you have the 72c0098d92ce fix?
>>
>> Yes. The bug still happens on the current git tree (which has the fix
>> already):
>
> The bug is definitely caused by Andy Lutomirski's PCID optimized TLB
> flushing" patches. Tom is off the hook.

I'm pretty sure it can't be PCID per se, since these CPUs are way too
old and are very unlikely to have PCID.

It could plausibly be the lazy TLB flushing changes.

2017-09-08 17:16:37

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On 2017.09.08 at 09:12 -0700, Andy Lutomirski wrote:
> On Fri, Sep 8, 2017 at 4:30 AM, Markus Trippelsdorf
> <[email protected]> wrote:
> > On 2017.09.08 at 12:39 +0200, Markus Trippelsdorf wrote:
> >> On 2017.09.08 at 12:35 +0200, Ingo Molnar wrote:
> >> >
> >> > * Markus Trippelsdorf <[email protected]> wrote:
> >> >
> >> > > On 2017.09.08 at 11:16 +0200, Borislav Petkov wrote:
> >> > > > On Fri, Sep 08, 2017 at 10:05:36AM +0200, Borislav Petkov wrote:
> >> > > > > On Fri, Sep 08, 2017 at 08:26:44AM +0200, Thomas Gleixner wrote:
> >> > > > > > On Fri, 8 Sep 2017, Markus Trippelsdorf wrote:
> >> > > > > >
> >> > > > > > CC+ Borislav. He might have access to such a beast
> >> > > > >
> >> > > > > Can I have /proc/cpuinfo and dmesg pls, in order to see whether I have
> >> > > > > something similar?
> >> > > > >
> >> > > > > Private mail's fine too.
> >> > > >
> >> > > > So I don't have exactly your model - mine is model 2, stepping 3 but I see
> >> > > > something strange too, in dmesg:
> >> > >
> >> > > I'm pretty sure the bug is in the merged 'x86-mm-for-linus' branch:
> >> > > Either Andy's "PCID optimized TLB flushing" (would be my guess) or
> >> > > 'encrypted memory' support by Tom Lendacky.
> >> > >
> >> > > (Bisecting is hard, because sometimes I can compile stuff for over 15
> >> > > minutes without hitting the bug. At other times the machine locks up
> >> > > hard when starting X11 already.)
> >> >
> >> > Do you have the 72c0098d92ce fix?
> >>
> >> Yes. The bug still happens on the current git tree (which has the fix
> >> already):
> >
> > The bug is definitely caused by Andy Lutomirski's PCID optimized TLB
> > flushing" patches. Tom is off the hook.
>
> I'm pretty sure it can't be PCID per se, since these CPUs are way too
> old and are very unlikely to have PCID.

Yes, the CPU doesn't support PCID (,but it does support PGE).

> It could plausibly be the lazy TLB flushing changes.

Yes, I've narrowed it down to:

commit 94b1b03b519b81c494900cb112aa00ed205cc2d9
Author: Andy Lutomirski <[email protected]>
Date: Thu Jun 29 08:53:17 2017 -0700

x86/mm: Rework lazy TLB mode and TLB freshness tracking


Theoretically you guys should be able to reproduce the issue by using
the "nopcid" boot option.

--
Markus

2017-09-08 21:47:23

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On Fri, Sep 8, 2017 at 10:16 AM, Markus Trippelsdorf
<[email protected]> wrote:
> On 2017.09.08 at 09:12 -0700, Andy Lutomirski wrote:
>> On Fri, Sep 8, 2017 at 4:30 AM, Markus Trippelsdorf
>> <[email protected]> wrote:
>> > On 2017.09.08 at 12:39 +0200, Markus Trippelsdorf wrote:
>> >> On 2017.09.08 at 12:35 +0200, Ingo Molnar wrote:
>> >> >
>> >> > * Markus Trippelsdorf <[email protected]> wrote:
>> >> >
>> >> > > On 2017.09.08 at 11:16 +0200, Borislav Petkov wrote:
>> >> > > > On Fri, Sep 08, 2017 at 10:05:36AM +0200, Borislav Petkov wrote:
>> >> > > > > On Fri, Sep 08, 2017 at 08:26:44AM +0200, Thomas Gleixner wrote:
>> >> > > > > > On Fri, 8 Sep 2017, Markus Trippelsdorf wrote:
>> >> > > > > >
>> >> > > > > > CC+ Borislav. He might have access to such a beast
>> >> > > > >
>> >> > > > > Can I have /proc/cpuinfo and dmesg pls, in order to see whether I have
>> >> > > > > something similar?
>> >> > > > >
>> >> > > > > Private mail's fine too.
>> >> > > >
>> >> > > > So I don't have exactly your model - mine is model 2, stepping 3 but I see
>> >> > > > something strange too, in dmesg:
>> >> > >
>> >> > > I'm pretty sure the bug is in the merged 'x86-mm-for-linus' branch:
>> >> > > Either Andy's "PCID optimized TLB flushing" (would be my guess) or
>> >> > > 'encrypted memory' support by Tom Lendacky.
>> >> > >
>> >> > > (Bisecting is hard, because sometimes I can compile stuff for over 15
>> >> > > minutes without hitting the bug. At other times the machine locks up
>> >> > > hard when starting X11 already.)
>> >> >
>> >> > Do you have the 72c0098d92ce fix?
>> >>
>> >> Yes. The bug still happens on the current git tree (which has the fix
>> >> already):
>> >
>> > The bug is definitely caused by Andy Lutomirski's PCID optimized TLB
>> > flushing" patches. Tom is off the hook.
>>
>> I'm pretty sure it can't be PCID per se, since these CPUs are way too
>> old and are very unlikely to have PCID.
>
> Yes, the CPU doesn't support PCID (,but it does support PGE).
>
>> It could plausibly be the lazy TLB flushing changes.
>
> Yes, I've narrowed it down to:
>
> commit 94b1b03b519b81c494900cb112aa00ed205cc2d9
> Author: Andy Lutomirski <[email protected]>
> Date: Thu Jun 29 08:53:17 2017 -0700
>
> x86/mm: Rework lazy TLB mode and TLB freshness tracking
>
>
> Theoretically you guys should be able to reproduce the issue by using
> the "nopcid" boot option.
>

Any chance you could test with CONFIG_DEBUG_VM=y? There are lots of
potentially useful assertions in that code.

Can you also post your /proc/cpuinfo? And can you re-confirm that a
problematic guest kernel is causing problems in the *host*?

2017-09-08 21:57:08

by Borislav Petkov

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On Fri, Sep 08, 2017 at 02:47:00PM -0700, Andy Lutomirski wrote:
> Any chance you could test with CONFIG_DEBUG_VM=y? There are lots of
> potentially useful assertions in that code.
>
> Can you also post your /proc/cpuinfo? And can you re-confirm that a
> problematic guest kernel is causing problems in the *host*?

Also, have you seen any MCEs during early boot, after the freezes?

You probably wouldn't have because we don't log them on F10h due to
broken BIOSen. So add "mce=bootlog" to your grub and warm-reset your box
after one of those freezes and send me dmesg. It should have an MCE in
there, if it happens what I think it happens.

Thx.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2017-09-08 23:08:16

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

[Linus, I added you to get your opinion on whether the last bit here
is a problem.]

On Fri, Sep 8, 2017 at 2:56 PM, Borislav Petkov <[email protected]> wrote:
> On Fri, Sep 08, 2017 at 02:47:00PM -0700, Andy Lutomirski wrote:
>> Any chance you could test with CONFIG_DEBUG_VM=y? There are lots of
>> potentially useful assertions in that code.
>>
>> Can you also post your /proc/cpuinfo? And can you re-confirm that a
>> problematic guest kernel is causing problems in the *host*?
>
> Also, have you seen any MCEs during early boot, after the freezes?
>
> You probably wouldn't have because we don't log them on F10h due to
> broken BIOSen. So add "mce=bootlog" to your grub and warm-reset your box
> after one of those freezes and send me dmesg. It should have an MCE in
> there, if it happens what I think it happens.
>

Here's my theory as to what's happening.

Before my patch, flush_tlb_mm_range() guaranteed that the range would
be flushed on all CPUs prior to returning. With the patch, it only
promises that it will be flushed on all CPUs prior to anyone trying to
access it on the CPU in question. This has two consequences:

1. A kernel thread that accidentally reads or writes a user address
could hit a stale TLB entry. This seems harmless in the sense that
this can only happen if we already have a bug.

2. The CPU itself could see the TLB entry and do nefarious
architecturally invisible things with it.

I bet that #2 dramatically increases the chance that we hit erratum 383.

I can imagine a case where we have a problem even in the absence of an
erratum. Specifically, suppose we have some page mapped. CPU A
writes to it using combining (it's mapped WC or an explicit streaming
write is done). CPU B removes the TLB entry and does
flush_tlb_mm_range(). CPU B would expect that all writes to the page
are done, but CPU A's write is still sitting in the streaming buffers.

I *think* this is impossible because CPU A's mm_cpumask manipulations
are atomic and should therefore force out the streaming write buffers,
but maybe there's some other scenario where this matters.

--Andy

2017-09-08 23:23:14

by Linus Torvalds

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On Fri, Sep 8, 2017 at 4:07 PM, Andy Lutomirski <[email protected]> wrote:
>
> I *think* this is impossible because CPU A's mm_cpumask manipulations
> are atomic and should therefore force out the streaming write buffers,
> but maybe there's some other scenario where this matters.

I don't think atomic memops do that.

They enforce globally visible ordering, but since they happen in the
cache and is not actually visible to outside, that doesn't actually
affect any streaming write buffers.

Then, if somebody else requests a cacheline that we have exclusive
ownership to, the write buffers just need to flush before we give up
that cacheline.

So a locked memory op is *not* serializing, it only enforces memory
ordering. Big difference.

Only fully serializing instructions will serialize with the write
buffers, and they are expensive as hell (partly exactly _due_ to these
kinds of issues).

So this change to delay invalidation does sound fairly scary..

Linus

2017-09-09 00:00:28

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On Fri, Sep 8, 2017 at 4:23 PM, Linus Torvalds
<[email protected]> wrote:
> On Fri, Sep 8, 2017 at 4:07 PM, Andy Lutomirski <[email protected]> wrote:
>>
>> I *think* this is impossible because CPU A's mm_cpumask manipulations
>> are atomic and should therefore force out the streaming write buffers,
>> but maybe there's some other scenario where this matters.
>
> I don't think atomic memops do that.
>
> They enforce globally visible ordering, but since they happen in the
> cache and is not actually visible to outside, that doesn't actually
> affect any streaming write buffers.
>
> Then, if somebody else requests a cacheline that we have exclusive
> ownership to, the write buffers just need to flush before we give up
> that cacheline.
>
> So a locked memory op is *not* serializing, it only enforces memory
> ordering. Big difference.
>
> Only fully serializing instructions will serialize with the write
> buffers, and they are expensive as hell (partly exactly _due_ to these
> kinds of issues).

I'm not convinced. The SDM says (Vol 3, 11.3, under WC):

If the WC buffer is partially filled, the writes may be delayed until
the next occurrence of a serializing event; such as, an SFENCE or
MFENCE instruction, CPUID execution, a read or write to uncached
memory, an interrupt occurrence, or a LOCK instruction execution.

Thanks, Intel, for definiing "serializing event" differently here than
anywhere else in the whole manual.

Anyhow, I can think of two cases where this is relevant.

1. The kernel wants to reclaim a page of normal memory, so it unmaps
it and flushes. Another CPU has an entry for that page in its WC
buffer. I don't think we care whether the flush causes the WC write
to really hit RAM because it's unobservable -- we just need to make
sure it is ordered, as seen by software, before the flush operation
completes. From the quote above, I think we're okay here.

2. The kernel is unmapping some IO memory (e.g. a GPU command buffer).
It wants a guarantee that, when flush_tlb_mm_range returns, all CPUs
are really done writing to it. Here I'm less convinced. The SDM
quote certainly suggests to me that we have a promise that the WC
write has *started* before flush_tlb_mm_range returns, but I'm not
sure I believe that it's guaranteed to have retired. That being said,
I'm not sure that this is observable either -- anything the kernel
does that depends on the writes being done presumably has to involve
further IO to the same device, and I suspect that WC write; lock write
to memory; observe that write on another CPU; IO on other CPU really
does guarantee that everything hits the bus in order.

FWIW, I'm not sure that we ever had a guarantee that IO writes were
all fully done before flush_tlb_mm_range would return. Can't they
still be hanging out in the PCI bridge or whatever until someone
*reads* that device?

>
> So this change to delay invalidation does sound fairly scary..

With PCID, we're fundamentally delaying invalidation. I think the
worry is more that we're not guaranteeing that every CPU that could
have accessed the page being flushed has executed a real serializing
instruction.

If we want to force invalidation/serialization, I can see two
reasonable solutions:

1. Revert this behavior change: continue sending IPIs to lazy CPUs.
The problem is that this will totally wipe out the performance gain,
and that gain seemed to be substantial in some microbenchmarks at
least.

2. Get rid of lazy mode. With PCID at least, switching to init_mm isn't so bad.

I'd prefer to leave it as is except on the buggy AMD CPUs, though,
since the current code is nice and fast.

2017-09-09 01:05:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On Fri, Sep 8, 2017 at 5:00 PM, Andy Lutomirski <[email protected]> wrote:
>
> I'm not convinced. The SDM says (Vol 3, 11.3, under WC):
>
> If the WC buffer is partially filled, the writes may be delayed until
> the next occurrence of a serializing event; such as, an SFENCE or
> MFENCE instruction, CPUID execution, a read or write to uncached
> memory, an interrupt occurrence, or a LOCK instruction execution.
>
> Thanks, Intel, for definiing "serializing event" differently here than
> anywhere else in the whole manual.

Yeah, it's really badly defined. Ok, maybe a locked instruction does
actually wait for it.. It should be invisible to anything, regardless.

> 1. The kernel wants to reclaim a page of normal memory, so it unmaps
> it and flushes. Another CPU has an entry for that page in its WC
> buffer. I don't think we care whether the flush causes the WC write
> to really hit RAM because it's unobservable -- we just need to make
> sure it is ordered, as seen by software, before the flush operation
> completes. From the quote above, I think we're okay here.

Agreed.

> 2. The kernel is unmapping some IO memory (e.g. a GPU command buffer).
> It wants a guarantee that, when flush_tlb_mm_range returns, all CPUs
> are really done writing to it. Here I'm less convinced. The SDM
> quote certainly suggests to me that we have a promise that the WC
> write has *started* before flush_tlb_mm_range returns, but I'm not
> sure I believe that it's guaranteed to have retired.

If others have writable TLB entries, what keeps them from just
continuing to write for a long time afterwards?

> I'd prefer to leave it as is except on the buggy AMD CPUs, though,
> since the current code is nice and fast.

So is there a patch to detect the 383 erratum and serialize for those?
I may have missed that part.

Linus

2017-09-09 01:39:33

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf



> On Sep 8, 2017, at 6:05 PM, Linus Torvalds <[email protected]> wrote:
>
>> On Fri, Sep 8, 2017 at 5:00 PM, Andy Lutomirski <[email protected]> wrote:
>>
>> I'm not convinced. The SDM says (Vol 3, 11.3, under WC):
>>
>> If the WC buffer is partially filled, the writes may be delayed until
>> the next occurrence of a serializing event; such as, an SFENCE or
>> MFENCE instruction, CPUID execution, a read or write to uncached
>> memory, an interrupt occurrence, or a LOCK instruction execution.
>>
>> Thanks, Intel, for definiing "serializing event" differently here than
>> anywhere else in the whole manual.
>
> Yeah, it's really badly defined. Ok, maybe a locked instruction does
> actually wait for it.. It should be invisible to anything, regardless.
>
>> 1. The kernel wants to reclaim a page of normal memory, so it unmaps
>> it and flushes. Another CPU has an entry for that page in its WC
>> buffer. I don't think we care whether the flush causes the WC write
>> to really hit RAM because it's unobservable -- we just need to make
>> sure it is ordered, as seen by software, before the flush operation
>> completes. From the quote above, I think we're okay here.
>
> Agreed.
>
>> 2. The kernel is unmapping some IO memory (e.g. a GPU command buffer).
>> It wants a guarantee that, when flush_tlb_mm_range returns, all CPUs
>> are really done writing to it. Here I'm less convinced. The SDM
>> quote certainly suggests to me that we have a promise that the WC
>> write has *started* before flush_tlb_mm_range returns, but I'm not
>> sure I believe that it's guaranteed to have retired.
>
> If others have writable TLB entries, what keeps them from just
> continuing to write for a long time afterwards?

Whoever unmaps the resource by kicking out their drm fd? I admit I'm just trying to think of the worst case.

>
>> I'd prefer to leave it as is except on the buggy AMD CPUs, though,
>> since the current code is nice and fast.
>
> So is there a patch to detect the 383 erratum and serialize for those?
> I may have missed that part.
>

The patch is in my head. It's imaginarily attached to this email.


> Linus

2017-09-09 06:39:12

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On 2017.09.08 at 23:56 +0200, Borislav Petkov wrote:
> On Fri, Sep 08, 2017 at 02:47:00PM -0700, Andy Lutomirski wrote:
> > Any chance you could test with CONFIG_DEBUG_VM=y? There are lots of
> > potentially useful assertions in that code.
> >
> > Can you also post your /proc/cpuinfo? And can you re-confirm that a
> > problematic guest kernel is causing problems in the *host*?
>
> Also, have you seen any MCEs during early boot, after the freezes?
>
> You probably wouldn't have because we don't log them on F10h due to
> broken BIOSen. So add "mce=bootlog" to your grub and warm-reset your box
> after one of those freezes and send me dmesg. It should have an MCE in
> there, if it happens what I think it happens.

Unfortunately the machine hangs in the BIOS after the first warm-reset.
Probably when it encounters an MCE it doesn't expect. I have to
warm-reset a second time to get to the boot-loader. So it is impossible
for me to see any possible MCE.

--
Markus

2017-09-09 08:13:46

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On 2017.09.08 at 14:47 -0700, Andy Lutomirski wrote:
> On Fri, Sep 8, 2017 at 10:16 AM, Markus Trippelsdorf
> <[email protected]> wrote:
> > On 2017.09.08 at 09:12 -0700, Andy Lutomirski wrote:
> >> On Fri, Sep 8, 2017 at 4:30 AM, Markus Trippelsdorf
> >> <[email protected]> wrote:
> >> > On 2017.09.08 at 12:39 +0200, Markus Trippelsdorf wrote:
> >> >> On 2017.09.08 at 12:35 +0200, Ingo Molnar wrote:
> >> >> >
> >> >> > * Markus Trippelsdorf <[email protected]> wrote:
> >> >> >
> >> >> > > On 2017.09.08 at 11:16 +0200, Borislav Petkov wrote:
> >> >> > > > On Fri, Sep 08, 2017 at 10:05:36AM +0200, Borislav Petkov wrote:
> >> >> > > > > On Fri, Sep 08, 2017 at 08:26:44AM +0200, Thomas Gleixner wrote:
> >> >> > > > > > On Fri, 8 Sep 2017, Markus Trippelsdorf wrote:
> >> >> > > > > >
> >> >> > > > > > CC+ Borislav. He might have access to such a beast
> >> >> > > > >
> >> >> > > > > Can I have /proc/cpuinfo and dmesg pls, in order to see whether I have
> >> >> > > > > something similar?
> >> >> > > > >
> >> >> > > > > Private mail's fine too.
> >> >> > > >
> >> >> > > > So I don't have exactly your model - mine is model 2, stepping 3 but I see
> >> >> > > > something strange too, in dmesg:
> >> >> > >
> >> >> > > I'm pretty sure the bug is in the merged 'x86-mm-for-linus' branch:
> >> >> > > Either Andy's "PCID optimized TLB flushing" (would be my guess) or
> >> >> > > 'encrypted memory' support by Tom Lendacky.
> >> >> > >
> >> >> > > (Bisecting is hard, because sometimes I can compile stuff for over 15
> >> >> > > minutes without hitting the bug. At other times the machine locks up
> >> >> > > hard when starting X11 already.)
> >> >> >
> >> >> > Do you have the 72c0098d92ce fix?
> >> >>
> >> >> Yes. The bug still happens on the current git tree (which has the fix
> >> >> already):
> >> >
> >> > The bug is definitely caused by Andy Lutomirski's PCID optimized TLB
> >> > flushing" patches. Tom is off the hook.
> >>
> >> I'm pretty sure it can't be PCID per se, since these CPUs are way too
> >> old and are very unlikely to have PCID.
> >
> > Yes, the CPU doesn't support PCID (,but it does support PGE).
> >
> >> It could plausibly be the lazy TLB flushing changes.
> >
> > Yes, I've narrowed it down to:
> >
> > commit 94b1b03b519b81c494900cb112aa00ed205cc2d9
> > Author: Andy Lutomirski <[email protected]>
> > Date: Thu Jun 29 08:53:17 2017 -0700
> >
> > x86/mm: Rework lazy TLB mode and TLB freshness tracking
> >
> >
> > Theoretically you guys should be able to reproduce the issue by using
> > the "nopcid" boot option.
> >
>
> Any chance you could test with CONFIG_DEBUG_VM=y? There are lots of
> potentially useful assertions in that code.

CONFIG_DEBUG_VM=y doesn't change anything. I still get the hard hang
without anything in the logs.

> Can you also post your /proc/cpuinfo? And can you re-confirm that a
> problematic guest kernel is causing problems in the *host*?

processor : 0
vendor_id : AuthenticAMD
cpu family : 16
model : 4
model name : AMD Phenom(tm) II X4 955 Processor
stepping : 2
microcode : 0x10000db
cpu MHz : 3210.960
cache size : 512 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt hw_pstate vmmcall npt lbrv svm_lock nrip_save
bugs : tlb_mmatch apic_c1e fxsave_leak sysret_ss_attrs null_seg amd_e400
bogomips : 6424.50
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

Unfortunately I cannot reproduce the qemu (kvm) problem anymore. (Perhaps
I have not tried long enough).
Anyway, kvm has code that should handle erratum_383.

--
Markus

2017-09-09 10:18:25

by Borislav Petkov

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On Sat, Sep 09, 2017 at 08:39:08AM +0200, Markus Trippelsdorf wrote:
> Unfortunately the machine hangs in the BIOS after the first warm-reset.
> Probably when it encounters an MCE it doesn't expect. I have to
> warm-reset a second time to get to the boot-loader. So it is impossible
> for me to see any possible MCE.

Ok, let's try to disable the syncflood before the test. As root:

# V=$(setpci -s 18.3 0x44.l)
# echo $V
# V=$(printf "0x%x" $((0x$V & ~(1 << 21))))
# setpci -s 18.3 0x44.l=$V
# echo $V

I've added the echo $V so that you can paste them as a reply so that I
can see their values.

And then run the triggering sequence again, better not on an X terminal
but in the text console to see any MCEs when it freezes. I remember you
saying that you don't have serial connected to it so catching the MCE
would need more staring :)

Thx.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2017-09-09 11:07:54

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On 2017.09.09 at 12:18 +0200, Borislav Petkov wrote:
> On Sat, Sep 09, 2017 at 08:39:08AM +0200, Markus Trippelsdorf wrote:
> > Unfortunately the machine hangs in the BIOS after the first warm-reset.
> > Probably when it encounters an MCE it doesn't expect. I have to
> > warm-reset a second time to get to the boot-loader. So it is impossible
> > for me to see any possible MCE.
>
> Ok, let's try to disable the syncflood before the test. As root:
>
> # V=$(setpci -s 18.3 0x44.l)
> # echo $V

4a70005c

> # V=$(printf "0x%x" $((0x$V & ~(1 << 21))))
> # setpci -s 18.3 0x44.l=$V
> # echo $V

0x4a50005c

> I've added the echo $V so that you can paste them as a reply so that I
> can see their values.
>
> And then run the triggering sequence again, better not on an X terminal
> but in the text console to see any MCEs when it freezes. I remember you
> saying that you don't have serial connected to it so catching the MCE
> would need more staring :)

It doesn't work. Compiling in a text console just freezes the machine
before any MCE gets printed.

--
Markus

2017-09-09 13:07:43

by Borislav Petkov

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On Sat, Sep 09, 2017 at 01:07:49PM +0200, Markus Trippelsdorf wrote:
> It doesn't work. Compiling in a text console just freezes the machine
> before any MCE gets printed.

Ok, let's turn off all syncflood bits. Hunk below. Do a

$ dmesg | grep syncflood

to check it worked. It says

[ 1.557017] quirk_syncflood: 0x44: 0xa900044
[ 1.561431] quirk_syncflood: 0x44: wrote 0xa800040
[ 1.566361] quirk_syncflood: 0x180: 0x700022
[ 1.570775] quirk_syncflood: 0x180: wrote 0x20

here.

Also, make sure you boot with "pci=check_enable_amd_mmconf" on the
kernel cmdline because someone broke extended PCI cfg space again on
those machines. At least on my test box here... But that's something
I'll deal with later. :-\

Thanks.

---
diff --git a/arch/x86/kernel/quirks.c b/arch/x86/kernel/quirks.c
index eaa591cfd98b..c6a4430d2222 100644
--- a/arch/x86/kernel/quirks.c
+++ b/arch/x86/kernel/quirks.c
@@ -626,6 +626,32 @@ static void amd_disable_seq_and_redirect_scrub(struct pci_dev *dev)
DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_16H_NB_F3,
amd_disable_seq_and_redirect_scrub);

+static void quirk_syncflood(struct pci_dev *misc)
+{
+ u32 val;
+
+ pci_read_config_dword(misc, 0x44, &val);
+ pr_info("%s: 0x44: 0x%x\n", __func__, val);
+
+ val &= ~(BIT(30) | BIT(21) | BIT(20) | BIT(2));
+
+ pci_write_config_dword(misc, 0x44, val);
+ pci_read_config_dword(misc, 0x44, &val);
+ pr_info("%s: 0x44: wrote 0x%x\n", __func__, val);
+
+ pci_read_config_dword(misc, 0x180, &val);
+ pr_info("%s: 0x180: 0x%x\n", __func__, val);
+
+ val &= ~(BIT(22) | BIT(21) | BIT(20) | BIT(9) | BIT(8) | BIT(7) | BIT(6) | BIT(1));
+
+ pci_write_config_dword(misc, 0x180, val);
+ pci_read_config_dword(misc, 0x180, &val);
+ pr_info("%s: 0x180: wrote 0x%x\n", __func__, val);
+}
+
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_10H_NB_MISC,
+ quirk_syncflood);
+
#if defined(CONFIG_X86_64) && defined(CONFIG_X86_MCE)
#include <linux/jump_label.h>
#include <asm/string_64.h>




--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2017-09-09 13:37:50

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On 2017.09.09 at 15:07 +0200, Borislav Petkov wrote:
> On Sat, Sep 09, 2017 at 01:07:49PM +0200, Markus Trippelsdorf wrote:
> > It doesn't work. Compiling in a text console just freezes the machine
> > before any MCE gets printed.
>
> Ok, let's turn off all syncflood bits. Hunk below. Do a
>
> $ dmesg | grep syncflood
>
> to check it worked. It says
>
> [ 1.557017] quirk_syncflood: 0x44: 0xa900044
> [ 1.561431] quirk_syncflood: 0x44: wrote 0xa800040
> [ 1.566361] quirk_syncflood: 0x180: 0x700022
> [ 1.570775] quirk_syncflood: 0x180: wrote 0x20
>
> here.
>
> Also, make sure you boot with "pci=check_enable_amd_mmconf" on the
> kernel cmdline because someone broke extended PCI cfg space again on
> those machines. At least on my test box here... But that's something
> I'll deal with later. :-\

Thanks. This one worked:

mce: [Hardware Error]: CPU: 0 Machine Check Exception: 4 Bank 4: fa000010000b0c0f
mce: [Hardware Error]: TSC b75d6ef4ad MISC c00a00001000000
mce: [Hardware Error]: PROCESSOR 2:100f42 TIME 1504963036 SOCKET 0 APIC 0 microcode 1000db

(I had to copy the above by hand, so it may not be 100% accurate).

--
Markus

2017-09-09 13:39:58

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On 2017.09.09 at 15:37 +0200, Markus Trippelsdorf wrote:
> On 2017.09.09 at 15:07 +0200, Borislav Petkov wrote:
> > On Sat, Sep 09, 2017 at 01:07:49PM +0200, Markus Trippelsdorf wrote:
> > > It doesn't work. Compiling in a text console just freezes the machine
> > > before any MCE gets printed.
> >
> > Ok, let's turn off all syncflood bits. Hunk below. Do a
> >
> > $ dmesg | grep syncflood
> >
> > to check it worked. It says
> >
> > [ 1.557017] quirk_syncflood: 0x44: 0xa900044
> > [ 1.561431] quirk_syncflood: 0x44: wrote 0xa800040
> > [ 1.566361] quirk_syncflood: 0x180: 0x700022
> > [ 1.570775] quirk_syncflood: 0x180: wrote 0x20
> >
> > here.
> >
> > Also, make sure you boot with "pci=check_enable_amd_mmconf" on the
> > kernel cmdline because someone broke extended PCI cfg space again on
> > those machines. At least on my test box here... But that's something
> > I'll deal with later. :-\
>
> Thanks. This one worked:
>
> mce: [Hardware Error]: CPU: 0 Machine Check Exception: 4 Bank 4: fa000010000b0c0f
> mce: [Hardware Error]: TSC b75d6ef4ad MISC c00a00001000000
> mce: [Hardware Error]: PROCESSOR 2:100f42 TIME 1504963036 SOCKET 0 APIC 0 microcode 1000db

Decoded:

CPU: 0 Machine Check Exception: 4 Bank 4: fa000010000b0c0f
Hardware event. This is not a software error.
CPU 0 0 data cache TSC b75d6ef4ad
TIME 1504963036 Sat Sep 9 15:17:16 2017
STATUS 0 MCGSTATUS 0
CPUID Vendor AMD Family 16 Model 4
SOCKET 0 APIC 0 microcode 1000db


--
Markus

2017-09-09 14:07:20

by Borislav Petkov

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On Sat, Sep 09, 2017 at 03:39:54PM +0200, Markus Trippelsdorf wrote:
> > mce: [Hardware Error]: CPU: 0 Machine Check Exception: 4 Bank 4: fa000010000b0c0f
> > mce: [Hardware Error]: TSC b75d6ef4ad MISC c00a00001000000
> > mce: [Hardware Error]: PROCESSOR 2:100f42 TIME 1504963036 SOCKET 0 APIC 0 microcode 1000db
>
> Decoded:
>
> CPU: 0 Machine Check Exception: 4 Bank 4: fa000010000b0c0f
> Hardware event. This is not a software error.
> CPU 0 0 data cache TSC b75d6ef4ad
> TIME 1504963036 Sat Sep 9 15:17:16 2017
> STATUS 0 MCGSTATUS 0
> CPUID Vendor AMD Family 16 Model 4
> SOCKET 0 APIC 0 microcode 1000db

Yeah, this is not really decoding it - I need to address that case of
uncorrectable MCE not being decoded too.

In any case, it is not E383:

MC4_STATUS[Val|Over|UC|EN|MiscV|PCC|UECC|EEC: GART cache table walk encountered an invalid PTE (0x05)|ET: TLB(tt:GEN;ll:LG)]: 0xfa0020000005001b

And those should actually be masked out:

"BIOS is recommended to mask GART table walk errors by setting the bit
in MSRC001_0048 corresponding to F3x40[GartTblWkEn]."

And we disable those but for some reason, it doesn't stick :-)

Do

# modprobe msr
# rdmsr -a 0x00000410
# rdmsr -a 0xc0010048

as root.

Thanks.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2017-09-09 14:20:19

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On 2017.09.09 at 16:07 +0200, Borislav Petkov wrote:
> On Sat, Sep 09, 2017 at 03:39:54PM +0200, Markus Trippelsdorf wrote:
> > > mce: [Hardware Error]: CPU: 0 Machine Check Exception: 4 Bank 4: fa000010000b0c0f
> > > mce: [Hardware Error]: TSC b75d6ef4ad MISC c00a00001000000
> > > mce: [Hardware Error]: PROCESSOR 2:100f42 TIME 1504963036 SOCKET 0 APIC 0 microcode 1000db
> >
> > Decoded:
> >
> > CPU: 0 Machine Check Exception: 4 Bank 4: fa000010000b0c0f
> > Hardware event. This is not a software error.
> > CPU 0 0 data cache TSC b75d6ef4ad
> > TIME 1504963036 Sat Sep 9 15:17:16 2017
> > STATUS 0 MCGSTATUS 0
> > CPUID Vendor AMD Family 16 Model 4
> > SOCKET 0 APIC 0 microcode 1000db
>
> Yeah, this is not really decoding it - I need to address that case of
> uncorrectable MCE not being decoded too.
>
> In any case, it is not E383:
>
> MC4_STATUS[Val|Over|UC|EN|MiscV|PCC|UECC|EEC: GART cache table walk encountered an invalid PTE (0x05)|ET: TLB(tt:GEN;ll:LG)]: 0xfa0020000005001b
>
> And those should actually be masked out:
>
> "BIOS is recommended to mask GART table walk errors by setting the bit
> in MSRC001_0048 corresponding to F3x40[GartTblWkEn]."
>
> And we disable those but for some reason, it doesn't stick :-)
>
> Do
>
> # rdmsr -a 0x00000410

3fffffff
0
0
0

> # rdmsr -a 0xc0010048

780400
780400
780400
780400

--
Markus

2017-09-09 14:33:52

by Borislav Petkov

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On Sat, Sep 09, 2017 at 04:20:14PM +0200, Markus Trippelsdorf wrote:
> > # rdmsr -a 0x00000410
>
> 3fffffff
> 0
> 0
> 0

WTF?! Those should be equal on every CPU. Yikes, we need to pay
attention to those... Grrr.

# wrmsr -a 0x00000410 0x3ffffbff

should fix your issue.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2017-09-09 14:43:54

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On 2017.09.09 at 16:33 +0200, Borislav Petkov wrote:
> On Sat, Sep 09, 2017 at 04:20:14PM +0200, Markus Trippelsdorf wrote:
> > > # rdmsr -a 0x00000410
> >
> > 3fffffff
> > 0
> > 0
> > 0
>
> WTF?! Those should be equal on every CPU. Yikes, we need to pay
> attention to those... Grrr.
>
> # wrmsr -a 0x00000410 0x3ffffbff
>
> should fix your issue.

No, it doesn't work:

x4 ~ # rdmsr -a 0x00000410
3fffffff
0
0
0
x4 ~ # wrmsr -a 0x00000410 0x3ffffbff
x4 ~ # rdmsr -a 0x00000410
3ffffbff
0
0
0

--
Markus

2017-09-09 16:32:30

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On 2017.09.09 at 16:43 +0200, Markus Trippelsdorf wrote:
> On 2017.09.09 at 16:33 +0200, Borislav Petkov wrote:
> > On Sat, Sep 09, 2017 at 04:20:14PM +0200, Markus Trippelsdorf wrote:
> > > > # rdmsr -a 0x00000410
> > >
> > > 3fffffff
> > > 0
> > > 0
> > > 0
> >
> > WTF?! Those should be equal on every CPU. Yikes, we need to pay
> > attention to those... Grrr.
> >
> > # wrmsr -a 0x00000410 0x3ffffbff
> >
> > should fix your issue.
>
> No, it doesn't work:
>
> x4 ~ # rdmsr -a 0x00000410
> 3fffffff
> 0
> 0
> 0
> x4 ~ # wrmsr -a 0x00000410 0x3ffffbff
> x4 ~ # rdmsr -a 0x00000410
> 3ffffbff
> 0
> 0
> 0

Also tried the following patch. It does not help.

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 3b413065c613..9ee1edb0929f 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1580,7 +1580,7 @@ static int __mcheck_cpu_apply_quirks(struct cpuinfo_x86 *c)

/* This should be disabled by the BIOS, but isn't always */
if (c->x86_vendor == X86_VENDOR_AMD) {
- if (c->x86 == 15 && cfg->banks > 4) {
+ if ((c->x86 == 15 || c->x86 == 16) && cfg->banks > 4) {
/*
* disable GART TBL walk error reporting, which
* trips off incorrectly with the IOMMU & 3ware

--
Markus

2017-09-09 17:05:54

by Borislav Petkov

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On Sat, Sep 09, 2017 at 06:32:25PM +0200, Markus Trippelsdorf wrote:
> Also tried the following patch. It does not help.

Ok, another theory. This one still needs to be fixed properly but that
for later.

For some reason (insufficient coffee maybe), I have mistyped your
MCi_STATUS value earlier. Your mail says it is "fa000010000b0c0f". Do
you still have a screen photo to verify it?

Because if so, the correct error type is:

MC4_STATUS[Val|Over|UC|EN|MiscV|PCC|EEC: Protocol error (link, L3, probe filter) (0x0b)|ET: BUS(pp:OBS;t:NOTIMOUT;r4:GEN;ii:GEN;ll:LG)]: 0xfa000010000b0c0f

And for that I'd need the MC4_ADDR value too.

So can you please apply the patch below ontop of the syncflood quirk
patch and retrigger, make a photo of the MCE and send it to me?

Thanks.

---
commit e84e5ad290c7c26af69a721148f404766529509b
Author: Borislav Petkov <[email protected]>
Date: Sat Sep 9 00:55:50 2017 +0200

x86/MCE/AMD: Collect error info even if valid bits are not set

The MCA banks log error info into MCA_ADDR, MCA_MISC0, and MCA_SYND even
if the corresponding valid bits are not set:

"Error handlers should save the values in MCA_ADDR, MCA_MISC0,
and MCA_SYND even if MCA_STATUS[AddrV], MCA_STATUS[MiscV], and
MCA_STATUS[SyndV] are zero."

Do so by setting those bits so that code down the MCE processing path
doesn't need to be changed.

Signed-off-by: Borislav Petkov <[email protected]>

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 3b413065c613..c63c7ef326c7 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -436,6 +436,20 @@ static inline void mce_gather_info(struct mce *m, struct pt_regs *regs)
if (mca_cfg.rip_msr)
m->ip = mce_rdmsrl(mca_cfg.rip_msr);
}
+
+ /*
+ * Error handlers should save the values in MCA_ADDR, MCA_MISC0, and
+ * MCA_SYND even if MCA_STATUS[AddrV], MCA_STATUS[MiscV], and
+ * MCA_STATUS[SyndV] are zero.
+ */
+ if (m->cpuvendor == X86_VENDOR_AMD) {
+ u64 status = MCI_STATUS_ADDRV | MCI_STATUS_MISCV;
+
+ if (mce_flags.smca)
+ status |= MCI_STATUS_SYNDV;
+
+ m->status |= status;
+ }
}

int mce_available(struct cpuinfo_x86 *c)

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2017-09-09 17:23:57

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On 2017.09.09 at 19:05 +0200, Borislav Petkov wrote:
> On Sat, Sep 09, 2017 at 06:32:25PM +0200, Markus Trippelsdorf wrote:
> > Also tried the following patch. It does not help.
>
> Ok, another theory. This one still needs to be fixed properly but that
> for later.
>
> For some reason (insufficient coffee maybe), I have mistyped your
> MCi_STATUS value earlier. Your mail says it is "fa000010000b0c0f". Do
> you still have a screen photo to verify it?

I double checked and the value is correct.

> Because if so, the correct error type is:
>
> MC4_STATUS[Val|Over|UC|EN|MiscV|PCC|EEC: Protocol error (link, L3, probe filter) (0x0b)|ET: BUS(pp:OBS;t:NOTIMOUT;r4:GEN;ii:GEN;ll:LG)]: 0xfa000010000b0c0f
>
> And for that I'd need the MC4_ADDR value too.
>
> So can you please apply the patch below ontop of the syncflood quirk
> patch and retrigger, make a photo of the MCE and send it to me?

Hmm, the output is exactly the same as before your patch.

--
Markus

2017-09-09 17:36:52

by Borislav Petkov

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On Sat, Sep 09, 2017 at 07:23:52PM +0200, Markus Trippelsdorf wrote:
> Hmm, the output is exactly the same as before your patch.

Bah, that patch doesn't account for the fact that we're rereading the
status field again in do_machine_check().

Ok, let's force MCi_ADDR out. Ontop:

---
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index c63c7ef326c7..e5580da2c491 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -240,8 +240,7 @@ static void __print_mce(struct mce *m)
}

pr_emerg(HW_ERR "TSC %llx ", m->tsc);
- if (m->addr)
- pr_cont("ADDR %llx ", m->addr);
+ pr_cont("ADDR %llx ", m->addr);
if (m->misc)
pr_cont("MISC %llx ", m->misc);

@@ -636,8 +635,9 @@ static void mce_read_aux(struct mce *m, int i)
if (m->status & MCI_STATUS_MISCV)
m->misc = mce_rdmsrl(msr_ops.misc(i));

+ m->addr = mce_rdmsrl(msr_ops.addr(i));
+
if (m->status & MCI_STATUS_ADDRV) {
- m->addr = mce_rdmsrl(msr_ops.addr(i));

/*
* Mask the reported address by the reported granularity.

---
Thanks.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2017-09-09 17:49:58

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On Fri, Sep 8, 2017 at 6:39 PM, Andy Lutomirski <[email protected]> wrote:
>
>
>> On Sep 8, 2017, at 6:05 PM, Linus Torvalds <[email protected]> wrote:
>>
>>> On Fri, Sep 8, 2017 at 5:00 PM, Andy Lutomirski <[email protected]> wrote:
>>>
>>> I'm not convinced. The SDM says (Vol 3, 11.3, under WC):
>>>
>>> If the WC buffer is partially filled, the writes may be delayed until
>>> the next occurrence of a serializing event; such as, an SFENCE or
>>> MFENCE instruction, CPUID execution, a read or write to uncached
>>> memory, an interrupt occurrence, or a LOCK instruction execution.
>>>
>>> Thanks, Intel, for definiing "serializing event" differently here than
>>> anywhere else in the whole manual.
>>
>> Yeah, it's really badly defined. Ok, maybe a locked instruction does
>> actually wait for it.. It should be invisible to anything, regardless.
>>
>>> 1. The kernel wants to reclaim a page of normal memory, so it unmaps
>>> it and flushes. Another CPU has an entry for that page in its WC
>>> buffer. I don't think we care whether the flush causes the WC write
>>> to really hit RAM because it's unobservable -- we just need to make
>>> sure it is ordered, as seen by software, before the flush operation
>>> completes. From the quote above, I think we're okay here.
>>
>> Agreed.
>>
>>> 2. The kernel is unmapping some IO memory (e.g. a GPU command buffer).
>>> It wants a guarantee that, when flush_tlb_mm_range returns, all CPUs
>>> are really done writing to it. Here I'm less convinced. The SDM
>>> quote certainly suggests to me that we have a promise that the WC
>>> write has *started* before flush_tlb_mm_range returns, but I'm not
>>> sure I believe that it's guaranteed to have retired.
>>
>> If others have writable TLB entries, what keeps them from just
>> continuing to write for a long time afterwards?
>
> Whoever unmaps the resource by kicking out their drm fd? I admit I'm just trying to think of the worst case.
>
>>
>>> I'd prefer to leave it as is except on the buggy AMD CPUs, though,
>>> since the current code is nice and fast.
>>
>> So is there a patch to detect the 383 erratum and serialize for those?
>> I may have missed that part.
>>
>
> The patch is in my head. It's imaginarily attached to this email.

After contemplating the info from Boris and Markus, I think I need to
add a #3 to the list of reasons my patch could be problematic:

3. If a CPU frees a page table (or PUD or PMD or whatever), that CPU
will flush before the memory goes back to the system. If that flush
is deferred on a different CPU that has the pointer to the freed table
cached in its TLB, then that CPU can speculatively load complete
garbage into its TLB.

I don't think this should be observable, but I can easily imagine it
triggering errata or weird ill-advised machine checks.

Anyway, if I need change the behavior back, I can do it in one of two
ways. I can just switch to init_mm instead of going lazy, which is
expensive, but not *that* expensive on CPUs with PCID. Or I can do it
the way we used to do it and send the flush IPI to lazy CPUs. The
latter will only have a performance impact when a flush happens, but
the performance hit is much higher when there's a flush.

--Andy

2017-09-09 18:02:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On Sat, Sep 9, 2017 at 10:49 AM, Andy Lutomirski <[email protected]> wrote:
>
> Anyway, if I need change the behavior back, I can do it in one of two
> ways. I can just switch to init_mm instead of going lazy, which is
> expensive, but not *that* expensive on CPUs with PCID. Or I can do it
> the way we used to do it and send the flush IPI to lazy CPUs. The
> latter will only have a performance impact when a flush happens, but
> the performance hit is much higher when there's a flush.

Why not both?

Let's at least entertain the idea. In particular, we don't send IPI's
to *all* CPU's. We only send them to the set of CPU's that could have
that MM cached.

And that set _may_ be very limited. In the best case, it's just the
current CPU, and no IPI is needed at all.

Which means that maybe we can use that set of CPU's as guidance to how
we should treat lazy.

We can *also* take PCID support into account.

So what I would suggest is something like

- if we have PCID support, _and_ the set of CPU's is more than just
us, just switch to init_mm. The switch is cheaper than the IPI's.

- otherwise do what we used to do, with the IPI.

The exact heuristics could be tuned later, but considering Markus's
report, and considering that not so many people have really even
heavily tested the new code yet (so _one_ report now means that there
are probably a shitload of machines that would show it later), I
really think we need to steer back towards our old behavior. But at
the same time, I think we can take advantage of newer CPU's that _do_
have PCID.

Hmm?

Linus

2017-09-09 18:14:48

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On 2017.09.09 at 19:36 +0200, Borislav Petkov wrote:
> On Sat, Sep 09, 2017 at 07:23:52PM +0200, Markus Trippelsdorf wrote:
> > Hmm, the output is exactly the same as before your patch.
>
> Bah, that patch doesn't account for the fact that we're rereading the
> status field again in do_machine_check().
>
> Ok, let's force MCi_ADDR out. Ontop:

Thanks, will try it later.

I think the issue gets fixed by:

# wrmsr -a 0xc0010015 0x1000018

Setting bit 3 of the Hardware Configuration Register to 1.

Quote for the docs:
?TlbCacheDis: cacheable memory disable. Read-write. 0=Enables performance optimization that
assumes PML4, PDP, PDE, and PTE entries are in cacheable WB-DRAM; memory type checks may
be bypassed, and addresses outside of WB-DRAM may result in undefined behavior or NB protocol
errors. 1=Disables performance optimization and allows PML4, PDP, PDE and PTE entries to be in
any memory type. Operating systems that maintain page tables in memory types other than WB-
DRAM must set TlbCacheDis to insure proper operation.?


I've been successfully compiling for over 15 minutes now.

--
Markus

2017-09-09 18:26:24

by Borislav Petkov

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On Sat, Sep 09, 2017 at 08:14:45PM +0200, Markus Trippelsdorf wrote:
> # wrmsr -a 0xc0010015 0x1000018

I know but I'd still like to see the exact error signature.

So please clear that bit 3 and try to catch that MCE together with the
ADDR.

Thanks.


--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2017-09-09 18:26:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On Sat, Sep 9, 2017 at 11:14 AM, Markus Trippelsdorf
<[email protected]> wrote:
>
> I think the issue gets fixed by:
>
> # wrmsr -a 0xc0010015 0x1000018
>
> Setting bit 3 of the Hardware Configuration Register to 1.
>
> Quote for the docs:
> »TlbCacheDis: cacheable memory disable. Read-write. 0=Enables performance optimization that
> assumes PML4, PDP, PDE, and PTE entries are in cacheable WB-DRAM

Uhhuh.

The page directories should *definitely* always be in cacheable
memory, so it should be ok for that bit to be 0, and it's possible
that setting it to 1 will seriously screw up performance.

But the fact that that fixes it for you does indicate that it's not
just a stale TLB entry or something, it really is some CPU using page
tables after they have been free'd and been re-allocated to something
else (and *then* they may point to garbage).

So I do think it's a sign that we definitely need that IPI for you.

Linus

2017-09-09 18:30:04

by Borislav Petkov

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On Sat, Sep 09, 2017 at 11:26:27AM -0700, Linus Torvalds wrote:
> But the fact that that fixes it for you does indicate that it's not
> just a stale TLB entry or something, it really is some CPU using page
> tables after they have been free'd and been re-allocated to something
> else (and *then* they may point to garbage).

Cool, I was trying to think of a good use case how we'd hit that. I
guess you just gave one. :)

Thx.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2017-09-09 18:46:44

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On 2017.09.09 at 20:26 +0200, Borislav Petkov wrote:
> On Sat, Sep 09, 2017 at 08:14:45PM +0200, Markus Trippelsdorf wrote:
> > # wrmsr -a 0xc0010015 0x1000018
>
> I know but I'd still like to see the exact error signature.
>
> So please clear that bit 3 and try to catch that MCE together with the
> ADDR.

OK. ADDR is 12. The rest is the same (modulo time).

--
Markus

2017-09-09 18:47:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On Sat, Sep 9, 2017 at 11:29 AM, Borislav Petkov <[email protected]> wrote:
> On Sat, Sep 09, 2017 at 11:26:27AM -0700, Linus Torvalds wrote:
>> But the fact that that fixes it for you does indicate that it's not
>> just a stale TLB entry or something, it really is some CPU using page
>> tables after they have been free'd and been re-allocated to something
>> else (and *then* they may point to garbage).
>
> Cool, I was trying to think of a good use case how we'd hit that. I
> guess you just gave one. :)

The thing is, even with the delayed TLB flushing, I don't think it
should be *so* delayed that we should be seeing a TLB fill from
garbage page tables.

But the part in Andy's patch that worries me the most is that

+ cpumask_clear_cpu(cpu, mm_cpumask(mm));

in enter_lazy_tlb(). It means that we won't be notified by peopel
invalidating the page tables, and while we then do re-validate the TLB
when we switch back from lazy mode, I still worry. I'm not entirely
convinced by that tlb_gen logic.

I can't actually see anything *wrong* in the tlb_gen logic, but it worries me.

Linus

2017-09-09 19:10:05

by Borislav Petkov

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On Sat, Sep 09, 2017 at 11:47:33AM -0700, Linus Torvalds wrote:
> The thing is, even with the delayed TLB flushing, I don't think it
> should be *so* delayed that we should be seeing a TLB fill from
> garbage page tables.

Yeah, but we can't know what kind of speculative accesses happen between
the removal from the mask and the actual flushing.

> But the part in Andy's patch that worries me the most is that
>
> + cpumask_clear_cpu(cpu, mm_cpumask(mm));
>
> in enter_lazy_tlb(). It means that we won't be notified by peopel
> invalidating the page tables, and while we then do re-validate the TLB
> when we switch back from lazy mode, I still worry. I'm not entirely
> convinced by that tlb_gen logic.
>
> I can't actually see anything *wrong* in the tlb_gen logic, but it worries me.

Yeah, sounds like we're uncovering a situation of possibly stale
mappings which we haven't had before. Or at least widening that window.

And I still need to analyze what that MCE on Markus' machine is saying
exactly. The TlbCacheDis thing is an optimization which does away with
memory type checks. But we probably will have to disable it on those
boxes as we can't guarantee pagetable elements are all in WB mem...

Or we can guarantee them in WB but the lazy flushing delays the actual
clearing of the TLB entries so much so that they end up pointing to
garbage, as you say, which is not in WB mem and thus causes the protocol
error.

Hmm. All still wet.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2017-09-09 19:11:44

by Borislav Petkov

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On Sat, Sep 09, 2017 at 08:46:38PM +0200, Markus Trippelsdorf wrote:
> OK. ADDR is 12. The rest is the same (modulo time).

I'm assuming that's 12 hex... yeah, "ADDR %llx ".

Dammit, that should have "0x" prepended. Grrr, I'll fix all that next
week.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2017-09-09 19:19:47

by Borislav Petkov

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On Sat, Sep 09, 2017 at 09:11:33PM +0200, Borislav Petkov wrote:
> On Sat, Sep 09, 2017 at 08:46:38PM +0200, Markus Trippelsdorf wrote:
> > OK. ADDR is 12. The rest is the same (modulo time).
>
> I'm assuming that's 12 hex... yeah, "ADDR %llx ".
>
> Dammit, that should have "0x" prepended. Grrr, I'll fix all that next
> week.

Ok, that 0x12 looks like it fits the TlbCacheDis thing (bits [5:1]):

"0_1001b

Link: A specific coherent-only packet from a CPU was issued to an
IO link. This may be caused by software which addresses page table
structures in a memory type other than cacheable WB-DRAM without
properly configuring MSRC001_0015[TlbCacheDis]. This may occur, for
example, when page table structure addresses are above top of memory. In
such cases, the NB will generate an MCE if it sees a mismatch between
the memory operation generated by the core and the link type. See
2.9.3.1.2 [Determining The Access Destination for CPU Accesses]."

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2017-09-09 19:21:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On Sat, Sep 9, 2017 at 12:09 PM, Borislav Petkov <[email protected]> wrote:
>
> Yeah, but we can't know what kind of speculative accesses happen between
> the removal from the mask and the actual flushing.

Indeed. The speculative kernel thread accesses while lazy could easily
trigger this.

And I guess those are pretty fundamental. So..

Linus

2017-09-09 19:28:53

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On Sat, Sep 9, 2017 at 12:09 PM, Borislav Petkov <[email protected]> wrote:
> On Sat, Sep 09, 2017 at 11:47:33AM -0700, Linus Torvalds wrote:
>> The thing is, even with the delayed TLB flushing, I don't think it
>> should be *so* delayed that we should be seeing a TLB fill from
>> garbage page tables.
>
> Yeah, but we can't know what kind of speculative accesses happen between
> the removal from the mask and the actual flushing.
>
>> But the part in Andy's patch that worries me the most is that
>>
>> + cpumask_clear_cpu(cpu, mm_cpumask(mm));
>>
>> in enter_lazy_tlb(). It means that we won't be notified by peopel
>> invalidating the page tables, and while we then do re-validate the TLB
>> when we switch back from lazy mode, I still worry. I'm not entirely
>> convinced by that tlb_gen logic.
>>
>> I can't actually see anything *wrong* in the tlb_gen logic, but it worries me.
>
> Yeah, sounds like we're uncovering a situation of possibly stale
> mappings which we haven't had before. Or at least widening that window.
>
> And I still need to analyze what that MCE on Markus' machine is saying
> exactly. The TlbCacheDis thing is an optimization which does away with
> memory type checks. But we probably will have to disable it on those
> boxes as we can't guarantee pagetable elements are all in WB mem...
>
> Or we can guarantee them in WB but the lazy flushing delays the actual
> clearing of the TLB entries so much so that they end up pointing to
> garbage, as you say, which is not in WB mem and thus causes the protocol
> error.
>
> Hmm. All still wet.
>

I think it's my theory #3. The CPU has a "paging-structure cache"
(Intel lingo) that points to a freed page. The CPU speculatively
follows it and gets complete garbage, triggering this MCE and who
knows what else.

I propose the following fix. If PCID is on, then, in
enter_lazy_tlb(), we switch to init_mm with the no-flush flag set.
(And we give init_mm its own dedicated ASID to keep it simple and fast
-- no need to use the LRU ASID mapping to assign one dynamically.) We
clear the bit in mm_cpumask. That is, we more or less just skip the
whole lazy TLB optimization and rely on PCID CPUs having reasonably
fast CR3 writes. No extra IPIs. I suppose I need to benchmark this.
It will certainly slow down workloads that rapidly toggle between a
user thread and a kernel thread because it forces serialization on
each mm switch, but maybe that's not so bad.

If PCID is off, then we leave the old CR3 value when we go lazy, and
we also leave the flag in mm_cpumask set. When a flush is requested,
we send out the IPI and switch to init_mm (and flush because we have
no choice). IOW, the no-PCID behavior goes back to what it used to
be.

For the PCID case, I'm relying on this language in the SDM (vol 3, 4.10):

When a logical processor creates entries in the TLBs (Section 4.10.2)
and paging-structure caches (Section
4.10.3), it associates those entries with the current PCID. When using
entries in the TLBs and paging-structure
caches to translate a linear address, a logical processor uses only
those entries associated with the current PCID
(see Section 4.10.2.4 for an exception).

This is also just common sense -- a CPU that makes any assumptions
about a paging-structure cache for an inactive ASID is just nuts,
especially if it assumes that the result of following it is at all
sane. IOW, we really should be able to switch to ASID 1 and back to 0
without any flushes without worrying that the old page tables for ASID
1 might get freed afterwards. Obviously we need to flush if we switch
back to PCID 1, but the code already does this.


Also, sorry Rik, this means your old increased laziness optimization
is dead in the water. It will have exactly the same speculative load
problem.

2017-09-09 19:38:02

by Borislav Petkov

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On Sat, Sep 09, 2017 at 12:28:30PM -0700, Andy Lutomirski wrote:
> I propose the following fix. If PCID is on, then, in
> enter_lazy_tlb(), we switch to init_mm with the no-flush flag set.
> (And we give init_mm its own dedicated ASID to keep it simple and fast
> -- no need to use the LRU ASID mapping to assign one dynamically.) We
> clear the bit in mm_cpumask. That is, we more or less just skip the
> whole lazy TLB optimization and rely on PCID CPUs having reasonably
> fast CR3 writes. No extra IPIs. I suppose I need to benchmark this.
> It will certainly slow down workloads that rapidly toggle between a
> user thread and a kernel thread because it forces serialization on
> each mm switch, but maybe that's not so bad.

Sounds ok so far.

> If PCID is off, then we leave the old CR3 value when we go lazy, and
> we also leave the flag in mm_cpumask set. When a flush is requested,
> we send out the IPI and switch to init_mm (and flush because we have
> no choice). IOW, the no-PCID behavior goes back to what it used to
> be.

Ok, question: why can't we load the new CR3 value too, immediately? Or
are we saying, we might get to return to the same CR3 we had before we
were lazy so we won't need to do an unnecessary CR3 write with the same
value. A microoptimization, if you will.

Yes?

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2017-09-10 04:42:35

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On Sat, Sep 9, 2017 at 12:37 PM, Borislav Petkov <[email protected]> wrote:
> On Sat, Sep 09, 2017 at 12:28:30PM -0700, Andy Lutomirski wrote:
>> I propose the following fix. If PCID is on, then, in
>> enter_lazy_tlb(), we switch to init_mm with the no-flush flag set.
>> (And we give init_mm its own dedicated ASID to keep it simple and fast
>> -- no need to use the LRU ASID mapping to assign one dynamically.) We
>> clear the bit in mm_cpumask. That is, we more or less just skip the
>> whole lazy TLB optimization and rely on PCID CPUs having reasonably
>> fast CR3 writes. No extra IPIs. I suppose I need to benchmark this.
>> It will certainly slow down workloads that rapidly toggle between a
>> user thread and a kernel thread because it forces serialization on
>> each mm switch, but maybe that's not so bad.
>
> Sounds ok so far.
>
>> If PCID is off, then we leave the old CR3 value when we go lazy, and
>> we also leave the flag in mm_cpumask set. When a flush is requested,
>> we send out the IPI and switch to init_mm (and flush because we have
>> no choice). IOW, the no-PCID behavior goes back to what it used to
>> be.
>
> Ok, question: why can't we load the new CR3 value too, immediately? Or
> are we saying, we might get to return to the same CR3 we had before we
> were lazy so we won't need to do an unnecessary CR3 write with the same
> value. A microoptimization, if you will.

It is indeed a microoptimization, but it's a microoptimization that
we've had in the kernel for a long, long time.

But it may be an ill-advised microoptimization, or at least a poorly
implemented one historically. The microoptimization mostly affects
workloads that have a process on an otherwise idle CPU that frequently
sleeps for very short times. With the optimization, we avoid two TLB
flushes and two serializing instructions every time we sleep.
Historically, we got a bunch of useless IPIs, too, depending on the
workload.

The problem is that the implementation, which lives in
kernel/sched/core.c for the most part, involves some extra reference
counting, and there are NUMA workloads with many cores all running the
same mm that pay a *huge* cost in refcounting, since all the CPUs are
hammering the same refcount. And this refcount is (I think) basically
pointless on x86 and maybe on most architectures.

PeterZ and Ingo, would you be okay with adding a define so arches can
opt out of the task_struct::active_mm field entirely? That is, with
the option set, task_struct wouldn't have an active_mm field, the core
wouldn't call mmgrab and mmdrop, and the arch would be responsible for
that bookkeeping instead? x86, and presumably all arches without
cross-core invalidation, would probably prefer to just shoot down the
old mm entirely in __mmput() rather than trying to figure out when do
finish freeing old mms. After all, exit_mmap() is going to send an
IPI regardless, so I see no reason to have the scheduler core pin an
old dead mm just because some random kernel thread's active_mm field
points to it.

IOW, if I'm going to reintroduce something like what the old lazy mode
did on x86, I'd rather do it right.

--Andy

2017-09-10 20:23:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On Sat, Sep 09, 2017 at 09:42:12PM -0700, Andy Lutomirski wrote:
> PeterZ and Ingo, would you be okay with adding a define so arches can
> opt out of the task_struct::active_mm field entirely? That is, with
> the option set, task_struct wouldn't have an active_mm field, the core
> wouldn't call mmgrab and mmdrop, and the arch would be responsible for
> that bookkeeping instead? x86, and presumably all arches without
> cross-core invalidation, would probably prefer to just shoot down the
> old mm entirely in __mmput() rather than trying to figure out when do
> finish freeing old mms. After all, exit_mmap() is going to send an
> IPI regardless, so I see no reason to have the scheduler core pin an
> old dead mm just because some random kernel thread's active_mm field
> points to it.

I'm only quickly skimming this thread, but I don't see anything too
worrysome being proposed.

If you're in LA next week we can talk about it in more detail if you
want.

2017-09-10 20:26:02

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf



> On Sep 10, 2017, at 1:22 PM, Peter Zijlstra <[email protected]> wrote:
>
>> On Sat, Sep 09, 2017 at 09:42:12PM -0700, Andy Lutomirski wrote:
>> PeterZ and Ingo, would you be okay with adding a define so arches can
>> opt out of the task_struct::active_mm field entirely? That is, with
>> the option set, task_struct wouldn't have an active_mm field, the core
>> wouldn't call mmgrab and mmdrop, and the arch would be responsible for
>> that bookkeeping instead? x86, and presumably all arches without
>> cross-core invalidation, would probably prefer to just shoot down the
>> old mm entirely in __mmput() rather than trying to figure out when do
>> finish freeing old mms. After all, exit_mmap() is going to send an
>> IPI regardless, so I see no reason to have the scheduler core pin an
>> old dead mm just because some random kernel thread's active_mm field
>> points to it.
>
> I'm only quickly skimming this thread, but I don't see anything too
> worrysome being proposed.
>
> If you're in LA next week we can talk about it in more detail if you
> want.

I'll be there.

2017-09-11 01:12:27

by Rik van Riel

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On Sat, 2017-09-09 at 12:28 -0700, Andy Lutomirski wrote:
> -
> I propose the following fix.  If PCID is on, then, in
> enter_lazy_tlb(), we switch to init_mm with the no-flush flag set.
> (And we give init_mm its own dedicated ASID to keep it simple and
> fast
> -- no need to use the LRU ASID mapping to assign one
> dynamically.)  We
> clear the bit in mm_cpumask.  That is, we more or less just skip the
> whole lazy TLB optimization and rely on PCID CPUs having reasonably
> fast CR3 writes.  No extra IPIs.

Avoiding the IPIs is probably what matters the most, especially
on systems with deep C states, and virtual machines where the
host may be running something else, causing the IPI service time
to go through the roof for idle VCPUs.

> Also, sorry Rik, this means your old increased laziness optimization
> is dead in the water.  It will have exactly the same speculative load
> problem.

Doesn't a memory barrier solve that speculative load
problem?

The memory barrier could be added only to the path
that potentially skips reloading the TLB, under the
assumption that a memory barrier is cheaper than a
TLB reload (even with ASID).

2017-09-11 01:47:10

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On Sun, Sep 10, 2017 at 6:12 PM, Rik van Riel <[email protected]> wrote:
> On Sat, 2017-09-09 at 12:28 -0700, Andy Lutomirski wrote:
>> -
>> I propose the following fix. If PCID is on, then, in
>> enter_lazy_tlb(), we switch to init_mm with the no-flush flag set.
>> (And we give init_mm its own dedicated ASID to keep it simple and
>> fast
>> -- no need to use the LRU ASID mapping to assign one
>> dynamically.) We
>> clear the bit in mm_cpumask. That is, we more or less just skip the
>> whole lazy TLB optimization and rely on PCID CPUs having reasonably
>> fast CR3 writes. No extra IPIs.
>
> Avoiding the IPIs is probably what matters the most, especially
> on systems with deep C states, and virtual machines where the
> host may be running something else, causing the IPI service time
> to go through the roof for idle VCPUs.
>
>> Also, sorry Rik, this means your old increased laziness optimization
>> is dead in the water. It will have exactly the same speculative load
>> problem.
>
> Doesn't a memory barrier solve that speculative load
> problem?
>
> The memory barrier could be added only to the path
> that potentially skips reloading the TLB, under the
> assumption that a memory barrier is cheaper than a
> TLB reload (even with ASID).

No, nothing stops the problematic speculative load. Here's the issue.
One CPU removes a reference to a page table from a higher-level page
table, flushes, and then frees the page table. Then it re-allocates
it and writes something unrelated there. Another CPU that has CR3
pointing to the page hierarchy in question could have a reference to
the freed table in its paging structure cache. Even if it's
guaranteed to not try to access the addresses in question (because
they're user addresses and the other CPU is in kernel mode, etc), but
there is never a guarantee that the CPU doesn't randomly try to fill
its TLB for the affected addresses. This results in invalid PTEs in
the TLB, possible accesses using bogus memory types, and maybe even
reads from IO space.

It looks like we actually need to propagate flushes everywhere that
could have references to the flushed range, even if the software won't
access that range.

2017-09-11 15:08:21

by Rik van Riel

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On Sun, 2017-09-10 at 18:46 -0700, Andy Lutomirski wrote:
>
> No, nothing stops the problematic speculative load.  Here's the
> issue.
> One CPU removes a reference to a page table from a higher-level page
> table, flushes, and then frees the page table.  Then it re-allocates
> it and writes something unrelated there.  Another CPU that has CR3
> pointing to the page hierarchy in question could have a reference to
> the freed table in its paging structure cache.  Even if it's
> guaranteed to not try to access the addresses in question (because
> they're user addresses and the other CPU is in kernel mode, etc), but
> there is never a guarantee that the CPU doesn't randomly try to fill
> its TLB for the affected addresses.  This results in invalid PTEs in
> the TLB, possible accesses using bogus memory types, and maybe even
> reads from IO space.

Good point, I had forgotten all about memory accesses
that do not originate with software behavior.

2017-09-12 07:14:44

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

On 2017.09.09 at 11:26 -0700, Linus Torvalds wrote:
> On Sat, Sep 9, 2017 at 11:14 AM, Markus Trippelsdorf
> <[email protected]> wrote:
> >
> > I think the issue gets fixed by:
> >
> > # wrmsr -a 0xc0010015 0x1000018
> >
> > Setting bit 3 of the Hardware Configuration Register to 1.
> >
> > Quote from the docs:
> > ?TlbCacheDis: cacheable memory disable. Read-write. 0=Enables performance optimization that
> > assumes PML4, PDP, PDE, and PTE entries are in cacheable WB-DRAM
>
> Uhhuh.
>
> The page directories should *definitely* always be in cacheable
> memory, so it should be ok for that bit to be 0, and it's possible
> that setting it to 1 will seriously screw up performance.

Well, I don't see any dramatic performance decrease on my box.
For instance compile times are roughly the same (a bit quicker in fact).
And in day to day usage I notice absolutely no difference.

--
Markus

2017-09-17 17:04:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf


* Andy Lutomirski <[email protected]> wrote:

> PeterZ and Ingo, would you be okay with adding a define so arches can
> opt out of the task_struct::active_mm field entirely? That is, with
> the option set, task_struct wouldn't have an active_mm field, the core
> wouldn't call mmgrab and mmdrop, and the arch would be responsible for
> that bookkeeping instead? x86, and presumably all arches without
> cross-core invalidation, would probably prefer to just shoot down the
> old mm entirely in __mmput() rather than trying to figure out when do
> finish freeing old mms. After all, exit_mmap() is going to send an
> IPI regardless, so I see no reason to have the scheduler core pin an
> old dead mm just because some random kernel thread's active_mm field
> points to it.
>
> IOW, if I'm going to reintroduce something like what the old lazy mode
> did on x86, I'd rather do it right.

How realistic would it be to get rid of ::active_mm on all architectures
at once?

Thanks,

Ingo