2018-12-11 22:25:38

by Borislav Petkov

[permalink] [raw]
Subject: [RFC PATCH 0/4] x86/alternative: Add ALTERNATIVE_3

From: Borislav Petkov <[email protected]>

Hi,

here's a patchset which adds an ALTERNATIVE_3() macro so that we can
finally do RDTSCP in rdtsc_ordered() but also not break all the possible
qemu CPU models out there like some old pentium which doesn't have even
MFENCE. Which is the reason for the three-insn alternative macro, btw.

The stuff works here on all machines I've tested it on but since we're
too close to the merge window, I'll hold on to it for another round and
let it soak slowly into 2019. Thus the RFC tag. In the meantime, it'll
collect your comments and ideas.

:-)

Thanks!

Borislav Petkov (4):
x86/alternatives: Add macro comments
x86/alternatives: Print containing function
x86/alternatives: Add an ALTERNATIVE_3() macro
x86/TSC: Use RDTSCP

arch/x86/include/asm/alternative.h | 39 ++++++++++++++++++++++++------
arch/x86/include/asm/msr.h | 16 ++++++++++--
arch/x86/kernel/alternative.c | 4 +--
3 files changed, 48 insertions(+), 11 deletions(-)

--
2.19.1



2018-12-11 22:24:52

by Borislav Petkov

[permalink] [raw]
Subject: [RFC PATCH 1/4] x86/alternatives: Add macro comments

From: Borislav Petkov <[email protected]>

... so that when one stares at the .s output, one can find her way
around the resulting asm magic.

For example, ALTERNATIVE looks like this now:

# ALT: oldnstr
661:
...
662:
# ALT: padding
.skip ...
663:
.pushsection .altinstructions,"a"

...

.popsection
.pushsection .altinstr_replacement, "ax"
# ALT: replacement 1
6641:
...
6651:
.popsection

Merge __OLDINSTR() into OLDINSTR(), while at it.

No functional changes.

Signed-off-by: Borislav Petkov <[email protected]>
---
arch/x86/include/asm/alternative.h | 16 +++++++++-------
1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index d7faa16622d8..8d38b296a18a 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -87,13 +87,12 @@ static inline int alternatives_text_reserved(void *start, void *end)
#define alt_total_slen alt_end_marker"b-661b"
#define alt_rlen(num) e_replacement(num)"f-"b_replacement(num)"f"

-#define __OLDINSTR(oldinstr, num) \
+#define OLDINSTR(oldinstr, num) \
+ "# ALT: oldnstr\n" \
"661:\n\t" oldinstr "\n662:\n" \
+ "# ALT: padding\n" \
".skip -(((" alt_rlen(num) ")-(" alt_slen ")) > 0) * " \
- "((" alt_rlen(num) ")-(" alt_slen ")),0x90\n"
-
-#define OLDINSTR(oldinstr, num) \
- __OLDINSTR(oldinstr, num) \
+ "((" alt_rlen(num) ")-(" alt_slen ")),0x90\n" \
alt_end_marker ":\n"

/*
@@ -109,7 +108,9 @@ static inline int alternatives_text_reserved(void *start, void *end)
* additionally longer than the first replacement alternative.
*/
#define OLDINSTR_2(oldinstr, num1, num2) \
+ "# ALT: oldinstr2\n" \
"661:\n\t" oldinstr "\n662:\n" \
+ "# ALT: padding2\n" \
".skip -((" alt_max_short(alt_rlen(num1), alt_rlen(num2)) " - (" alt_slen ")) > 0) * " \
"(" alt_max_short(alt_rlen(num1), alt_rlen(num2)) " - (" alt_slen ")), 0x90\n" \
alt_end_marker ":\n"
@@ -122,8 +123,9 @@ static inline int alternatives_text_reserved(void *start, void *end)
" .byte " alt_rlen(num) "\n" /* replacement len */ \
" .byte " alt_pad_len "\n" /* pad len */

-#define ALTINSTR_REPLACEMENT(newinstr, feature, num) /* replacement */ \
- b_replacement(num)":\n\t" newinstr "\n" e_replacement(num) ":\n\t"
+#define ALTINSTR_REPLACEMENT(newinstr, feature, num) /* replacement */ \
+ "# ALT: replacement " #num "\n" \
+ b_replacement(num)":\n\t" newinstr "\n" e_replacement(num) ":\n"

/* alternative assembly primitive: */
#define ALTERNATIVE(oldinstr, newinstr, feature) \
--
2.19.1


2018-12-11 22:25:31

by Borislav Petkov

[permalink] [raw]
Subject: [RFC PATCH 2/4] x86/alternatives: Print containing function

From: Borislav Petkov <[email protected]>

... in the "debug-alternative" output so that one can find her way
easier when staring at the vmlinux disassembly.

For example:

apply_alternatives: feat: 3*32+18, old: (read_tsc+0x0/0x10 (ffffffff8101d1c0) len: 5), repl: (ffffffff824e6d33, len: 5)
^^^^^^^^^^^^^^^^^
ffffffff8101d1c0: old_insn: 0f 31 90 90 90
ffffffff824e6d33: rpl_insn: 0f ae e8 0f 31
ffffffff8101d1c0: final_insn: 0f ae e8 0f 31

No functional changes.

Signed-off-by: Borislav Petkov <[email protected]>
---
arch/x86/kernel/alternative.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index ebeac487a20c..d458c7973c56 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -393,10 +393,10 @@ void __init_or_module noinline apply_alternatives(struct alt_instr *start,
continue;
}

- DPRINTK("feat: %d*32+%d, old: (%px len: %d), repl: (%px, len: %d), pad: %d",
+ DPRINTK("feat: %d*32+%d, old: (%pS (%px) len: %d), repl: (%px, len: %d), pad: %d",
a->cpuid >> 5,
a->cpuid & 0x1f,
- instr, a->instrlen,
+ instr, instr, a->instrlen,
replacement, a->replacementlen, a->padlen);

DUMP_BYTES(instr, a->instrlen, "%px: old_insn: ", instr);
--
2.19.1


2018-12-11 22:25:56

by Borislav Petkov

[permalink] [raw]
Subject: [RFC PATCH 3/4] x86/alternatives: Add an ALTERNATIVE_3() macro

From: Borislav Petkov <[email protected]>

Similar to ALTERNATIVE_2(), ALTERNATIVE_3() selects between 3 possible
variants. Will be used for adding RDTSCP to the rdtsc_ordered()
alternatives.

Signed-off-by: Borislav Petkov <[email protected]>
---
arch/x86/include/asm/alternative.h | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)

diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index 8d38b296a18a..f24d111f366b 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -115,6 +115,16 @@ static inline int alternatives_text_reserved(void *start, void *end)
"(" alt_max_short(alt_rlen(num1), alt_rlen(num2)) " - (" alt_slen ")), 0x90\n" \
alt_end_marker ":\n"

+#define OLDINSTR_3(oldinsn, n1, n2, n3) \
+ "# ALT: oldinstr3\n" \
+ "661:\n\t" oldinsn "\n662:\n" \
+ "# ALT: padding3\n" \
+ ".skip -((" alt_max_short(alt_max_short(alt_rlen(n1), alt_rlen(n2)), alt_rlen(n3)) \
+ " - (" alt_slen ")) > 0) * " \
+ "(" alt_max_short(alt_max_short(alt_rlen(n1), alt_rlen(n2)), alt_rlen(n3)) \
+ " - (" alt_slen ")), 0x90\n" \
+ alt_end_marker ":\n"
+
#define ALTINSTR_ENTRY(feature, num) \
" .long 661b - .\n" /* label */ \
" .long " b_replacement(num)"f - .\n" /* new instruction */ \
@@ -148,6 +158,19 @@ static inline int alternatives_text_reserved(void *start, void *end)
ALTINSTR_REPLACEMENT(newinstr2, feature2, 2) \
".popsection\n"

+#define ALTERNATIVE_3(oldinsn, newinsn1, feat1, newinsn2, feat2, newinsn3, feat3) \
+ OLDINSTR_3(oldinsn, 1, 2, 3) \
+ ".pushsection .altinstructions,\"a\"\n" \
+ ALTINSTR_ENTRY(feat1, 1) \
+ ALTINSTR_ENTRY(feat2, 2) \
+ ALTINSTR_ENTRY(feat3, 3) \
+ ".popsection\n" \
+ ".pushsection .altinstr_replacement, \"ax\"\n" \
+ ALTINSTR_REPLACEMENT(newinsn1, feat1, 1) \
+ ALTINSTR_REPLACEMENT(newinsn2, feat2, 2) \
+ ALTINSTR_REPLACEMENT(newinsn3, feat3, 3) \
+ ".popsection\n"
+
/*
* Alternative instructions for different CPU types or capabilities.
*
--
2.19.1


2018-12-11 22:26:31

by Borislav Petkov

[permalink] [raw]
Subject: [RFC PATCH 4/4] x86/TSC: Use RDTSCP

From: Borislav Petkov <[email protected]>

Currently, the kernel uses

[LM]FENCE; RDTSC

in the timekeeping code, to guarantee monotonicity of time where the
*FENCE is selected based on vendor.

Replace that sequence with RDTSCP which is faster or on-par and gives
the same guarantees.

A microbenchmark on Intel shows that the change is on-par.

On AMD, the change is either on-par with the current LFENCE-prefixed
RDTSC and some are slightly better with RDTSCP.

The comparison is done with the LFENCE-prefixed RDTSC (and not with the
MFENCE-prefixed one, as one would normally expect) because all modern
AMD families make LFENCE serializing and thus avoid the heavy MFENCE by
effectively enabling X86_FEATURE_LFENCE_RDTSC.

Co-developed-by: Thomas Gleixner <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Cc: Tom Lendacky <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: John Stultz <[email protected]>
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]
---
arch/x86/include/asm/msr.h | 16 ++++++++++++++--
1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
index 91e4cf189914..5cc3930cb465 100644
--- a/arch/x86/include/asm/msr.h
+++ b/arch/x86/include/asm/msr.h
@@ -217,6 +217,8 @@ static __always_inline unsigned long long rdtsc(void)
*/
static __always_inline unsigned long long rdtsc_ordered(void)
{
+ DECLARE_ARGS(val, low, high);
+
/*
* The RDTSC instruction is not ordered relative to memory
* access. The Intel SDM and the AMD APM are both vague on this
@@ -227,9 +229,19 @@ static __always_inline unsigned long long rdtsc_ordered(void)
* ordering guarantees as reading from a global memory location
* that some other imaginary CPU is updating continuously with a
* time stamp.
+ *
+ * Thus, use the preferred barrier on the respective CPU, aiming for
+ * RDTSCP as the default.
*/
- barrier_nospec();
- return rdtsc();
+ asm volatile(ALTERNATIVE_3("rdtsc",
+ "mfence; rdtsc", X86_FEATURE_MFENCE_RDTSC,
+ "lfence; rdtsc", X86_FEATURE_LFENCE_RDTSC,
+ "rdtscp", X86_FEATURE_RDTSCP)
+ : EAX_EDX_RET(val, low, high)
+ /* RDTSCP clobbers ECX with MSR_TSC_AUX. */
+ :: "ecx");
+
+ return EAX_EDX_VAL(val, low, high);
}

static inline unsigned long long native_read_pmc(int counter)
--
2.19.1


2018-12-11 23:01:10

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 4/4] x86/TSC: Use RDTSCP

On Tue, Dec 11, 2018 at 2:23 PM Borislav Petkov <[email protected]> wrote:
>
> From: Borislav Petkov <[email protected]>
>
> Currently, the kernel uses
>
> [LM]FENCE; RDTSC
>
> in the timekeeping code, to guarantee monotonicity of time where the
> *FENCE is selected based on vendor.
>
> Replace that sequence with RDTSCP which is faster or on-par and gives
> the same guarantees.
>
> A microbenchmark on Intel shows that the change is on-par.
>
> On AMD, the change is either on-par with the current LFENCE-prefixed
> RDTSC and some are slightly better with RDTSCP.
>
> The comparison is done with the LFENCE-prefixed RDTSC (and not with the
> MFENCE-prefixed one, as one would normally expect) because all modern
> AMD families make LFENCE serializing and thus avoid the heavy MFENCE by
> effectively enabling X86_FEATURE_LFENCE_RDTSC.
>
> Co-developed-by: Thomas Gleixner <[email protected]>
> Signed-off-by: Thomas Gleixner <[email protected]>
> Signed-off-by: Borislav Petkov <[email protected]>
> Cc: Tom Lendacky <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: John Stultz <[email protected]>
> Cc: [email protected]
> Link: https://lkml.kernel.org/r/[email protected]
> ---
> arch/x86/include/asm/msr.h | 16 ++++++++++++++--
> 1 file changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
> index 91e4cf189914..5cc3930cb465 100644
> --- a/arch/x86/include/asm/msr.h
> +++ b/arch/x86/include/asm/msr.h
> @@ -217,6 +217,8 @@ static __always_inline unsigned long long rdtsc(void)
> */
> static __always_inline unsigned long long rdtsc_ordered(void)
> {
> + DECLARE_ARGS(val, low, high);
> +
> /*
> * The RDTSC instruction is not ordered relative to memory
> * access. The Intel SDM and the AMD APM are both vague on this
> @@ -227,9 +229,19 @@ static __always_inline unsigned long long rdtsc_ordered(void)
> * ordering guarantees as reading from a global memory location
> * that some other imaginary CPU is updating continuously with a
> * time stamp.
> + *
> + * Thus, use the preferred barrier on the respective CPU, aiming for
> + * RDTSCP as the default.
> */
> - barrier_nospec();
> - return rdtsc();
> + asm volatile(ALTERNATIVE_3("rdtsc",
> + "mfence; rdtsc", X86_FEATURE_MFENCE_RDTSC,
> + "lfence; rdtsc", X86_FEATURE_LFENCE_RDTSC,
> + "rdtscp", X86_FEATURE_RDTSCP)
> + : EAX_EDX_RET(val, low, high)
> + /* RDTSCP clobbers ECX with MSR_TSC_AUX. */
> + :: "ecx");
> +
> + return EAX_EDX_VAL(val, low, high);
> }

This whole series seems reasonable, except that it caused me to look
at barrier_nospec(). And I had a bit of a WTF moment, as in "WTF does
RDTSC have to do with a speculation protection barrier". Does it
actually make sense?

2018-12-11 23:14:07

by Tom Lendacky

[permalink] [raw]
Subject: Re: [RFC PATCH 4/4] x86/TSC: Use RDTSCP

On 12/11/2018 04:59 PM, Andy Lutomirski wrote:
> On Tue, Dec 11, 2018 at 2:23 PM Borislav Petkov <[email protected]> wrote:
>>
>> From: Borislav Petkov <[email protected]>
>>
>> Currently, the kernel uses
>>
>> [LM]FENCE; RDTSC
>>
>> in the timekeeping code, to guarantee monotonicity of time where the
>> *FENCE is selected based on vendor.
>>
>> Replace that sequence with RDTSCP which is faster or on-par and gives
>> the same guarantees.
>>
>> A microbenchmark on Intel shows that the change is on-par.
>>
>> On AMD, the change is either on-par with the current LFENCE-prefixed
>> RDTSC and some are slightly better with RDTSCP.
>>
>> The comparison is done with the LFENCE-prefixed RDTSC (and not with the
>> MFENCE-prefixed one, as one would normally expect) because all modern
>> AMD families make LFENCE serializing and thus avoid the heavy MFENCE by
>> effectively enabling X86_FEATURE_LFENCE_RDTSC.
>>
>> Co-developed-by: Thomas Gleixner <[email protected]>
>> Signed-off-by: Thomas Gleixner <[email protected]>
>> Signed-off-by: Borislav Petkov <[email protected]>
>> Cc: Tom Lendacky <[email protected]>
>> Cc: Andy Lutomirski <[email protected]>
>> Cc: "H. Peter Anvin" <[email protected]>
>> Cc: John Stultz <[email protected]>
>> Cc: [email protected]
>> Link: https://lkml.kernel.org/r/[email protected]
>> ---
>> arch/x86/include/asm/msr.h | 16 ++++++++++++++--
>> 1 file changed, 14 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
>> index 91e4cf189914..5cc3930cb465 100644
>> --- a/arch/x86/include/asm/msr.h
>> +++ b/arch/x86/include/asm/msr.h
>> @@ -217,6 +217,8 @@ static __always_inline unsigned long long rdtsc(void)
>> */
>> static __always_inline unsigned long long rdtsc_ordered(void)
>> {
>> + DECLARE_ARGS(val, low, high);
>> +
>> /*
>> * The RDTSC instruction is not ordered relative to memory
>> * access. The Intel SDM and the AMD APM are both vague on this
>> @@ -227,9 +229,19 @@ static __always_inline unsigned long long rdtsc_ordered(void)
>> * ordering guarantees as reading from a global memory location
>> * that some other imaginary CPU is updating continuously with a
>> * time stamp.
>> + *
>> + * Thus, use the preferred barrier on the respective CPU, aiming for
>> + * RDTSCP as the default.
>> */
>> - barrier_nospec();
>> - return rdtsc();
>> + asm volatile(ALTERNATIVE_3("rdtsc",
>> + "mfence; rdtsc", X86_FEATURE_MFENCE_RDTSC,
>> + "lfence; rdtsc", X86_FEATURE_LFENCE_RDTSC,
>> + "rdtscp", X86_FEATURE_RDTSCP)
>> + : EAX_EDX_RET(val, low, high)
>> + /* RDTSCP clobbers ECX with MSR_TSC_AUX. */
>> + :: "ecx");
>> +
>> + return EAX_EDX_VAL(val, low, high);
>> }
>
> This whole series seems reasonable, except that it caused me to look
> at barrier_nospec(). And I had a bit of a WTF moment, as in "WTF does
> RDTSC have to do with a speculation protection barrier". Does it
> actually make sense?

It does seem overloaded in that sense, but the feature means that LFENCE
is serializing and so can be used in rdtsc_ordered. In the same sense,
barrier_nospec is looking for whether LFENCE is serializing and preferring
that over MFENCE since it is lighter weight.

In light of how they're being used now, they could probably stand to be
renamed in some way.

Thanks,
Tom

>

2018-12-11 23:39:01

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: [RFC PATCH 4/4] x86/TSC: Use RDTSCP

> And I had a bit of a WTF moment, as in "WTF does
> RDTSC have to do with a speculation protection barrier".
> Does it actually make sense?

It doesn't. There was too much s/lfence/barrier_nospec/ apparently.

> + asm volatile(ALTERNATIVE_3("rdtsc",
> + "mfence; rdtsc", X86_FEATURE_MFENCE_RDTSC,
> + "lfence; rdtsc", X86_FEATURE_LFENCE_RDTSC,
> + "rdtscp", X86_FEATURE_RDTSCP)
> + : EAX_EDX_RET(val, low, high)
> + /* RDTSCP clobbers ECX with MSR_TSC_AUX. */
> + :: "ecx");

I have a question: does alternatives ordering matter? CPU can have
both features.

And other code in this file uses "c" for clobber.

2018-12-11 23:41:54

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC PATCH 4/4] x86/TSC: Use RDTSCP

On Tue, Dec 11, 2018 at 11:12:41PM +0000, Lendacky, Thomas wrote:
> It does seem overloaded in that sense, but the feature means that LFENCE
> is serializing and so can be used in rdtsc_ordered. In the same sense,
> barrier_nospec is looking for whether LFENCE is serializing and preferring
> that over MFENCE since it is lighter weight.
>
> In light of how they're being used now, they could probably stand to be
> renamed in some way.

Actually, come to think of it, what really matters here is whether
LFENCE is serializing or not. Because if so, you wanna replace with LFENCE
as it is lighter. And in that case a single alternative() - not _2() -
should suffice.

BUT(!), that still is not good enough if you do some qemu CPU models
like pentium or so which don't even have MFENCE and cause stuff like
this:

https://lkml.kernel.org/r/[email protected]

Which means, that you *do* have to alternate between

* no insn at all
* MFENCE
* LFENCE, if it is serializing

so barrier_nospec() does the right thing, AFAICS. And this is why we
need an ALTERNATIVE_3() to add RDTSCP into the mix too.

WRT renaming, I guess we can do something like:

* X86_FEATURE_MFENCE_RDTSC -> X86_FEATURE_MFENCE - to mean that CPU has
MFENCE support.

and

* X86_FEATURE_LFENCE_RDTSC -> X86_FEATURE_LFENCE_SERIALIZING

Or something to that effect.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2018-12-11 23:44:27

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC PATCH 4/4] x86/TSC: Use RDTSCP

On Wed, Dec 12, 2018 at 02:37:53AM +0300, Alexey Dobriyan wrote:
> I have a question: does alternatives ordering matter?

Yes.

> CPU can have both features.

The last present feature wins.

> And other code in this file uses "c" for clobber.

Ok.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2018-12-12 00:08:59

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC PATCH 4/4] x86/TSC: Use RDTSCP

On Wed, Dec 12, 2018 at 12:43:08AM +0100, Borislav Petkov wrote:
> > And other code in this file uses "c" for clobber.
>
> Ok.

Except that clobbers need to be full register names:

./arch/x86/include/asm/msr.h: In function ‘do_hres’:
./arch/x86/include/asm/msr.h:236:2: error: unknown register name ‘c’ in ‘asm’
asm volatile(ALTERNATIVE_3("rdtsc",
^~~

...

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2018-12-12 02:26:19

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 4/4] x86/TSC: Use RDTSCP

> On Dec 11, 2018, at 3:39 PM, Borislav Petkov <[email protected]> wrote:
>
>> On Tue, Dec 11, 2018 at 11:12:41PM +0000, Lendacky, Thomas wrote:
>> It does seem overloaded in that sense, but the feature means that LFENCE
>> is serializing and so can be used in rdtsc_ordered. In the same sense,
>> barrier_nospec is looking for whether LFENCE is serializing and preferring
>> that over MFENCE since it is lighter weight.
>>
>> In light of how they're being used now, they could probably stand to be
>> renamed in some way.
>
> Actually, come to think of it, what really matters here is whether
> LFENCE is serializing or not. Because if so, you wanna replace with LFENCE
> as it is lighter. And in that case a single alternative() - not _2() -
> should suffice.
>
> BUT(!), that still is not good enough if you do some qemu CPU models
> like pentium or so which don't even have MFENCE and cause stuff like
> this:
>
> https://lkml.kernel.org/r/[email protected]
>
> Which means, that you *do* have to alternate between
>
> * no insn at all
> * MFENCE
> * LFENCE, if it is serializing
>
> so barrier_nospec() does the right thing, AFAICS. And this is why we
> need an ALTERNATIVE_3() to add RDTSCP into the mix too.
>
> WRT renaming, I guess we can do something like:
>
> * X86_FEATURE_MFENCE_RDTSC -> X86_FEATURE_MFENCE - to mean that CPU has
> MFENCE support.
>
> and
>
> * X86_FEATURE_LFENCE_RDTSC -> X86_FEATURE_LFENCE_SERIALIZING
>
> Or something to that effect.

This makes me nervous, since no one knows what “serializing” means.
IIRC AMD specifically documents that MFENCE is required before RDTSC
to get sensible ordering. So it’s entirely plausible to me that
LFENCE is okay for Spectre mitigation but MFENCE is needed for RDTSC
on some CPU.

2018-12-12 10:01:02

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 4/4] x86/TSC: Use RDTSCP

On Tue, Dec 11, 2018 at 06:24:44PM -0800, Andy Lutomirski wrote:
> > On Dec 11, 2018, at 3:39 PM, Borislav Petkov <[email protected]> wrote:
> >
> >> On Tue, Dec 11, 2018 at 11:12:41PM +0000, Lendacky, Thomas wrote:
> >> It does seem overloaded in that sense, but the feature means that LFENCE
> >> is serializing and so can be used in rdtsc_ordered. In the same sense,
> >> barrier_nospec is looking for whether LFENCE is serializing and preferring
> >> that over MFENCE since it is lighter weight.
> >>
> >> In light of how they're being used now, they could probably stand to be
> >> renamed in some way.
> >
> > Actually, come to think of it, what really matters here is whether
> > LFENCE is serializing or not. Because if so, you wanna replace with LFENCE
> > as it is lighter. And in that case a single alternative() - not _2() -
> > should suffice.
> >
> > BUT(!), that still is not good enough if you do some qemu CPU models
> > like pentium or so which don't even have MFENCE and cause stuff like
> > this:
> >
> > https://lkml.kernel.org/r/[email protected]
> >
> > Which means, that you *do* have to alternate between
> >
> > * no insn at all
> > * MFENCE
> > * LFENCE, if it is serializing
> >
> > so barrier_nospec() does the right thing, AFAICS. And this is why we
> > need an ALTERNATIVE_3() to add RDTSCP into the mix too.
> >
> > WRT renaming, I guess we can do something like:
> >
> > * X86_FEATURE_MFENCE_RDTSC -> X86_FEATURE_MFENCE - to mean that CPU has
> > MFENCE support.
> >
> > and
> >
> > * X86_FEATURE_LFENCE_RDTSC -> X86_FEATURE_LFENCE_SERIALIZING
> >
> > Or something to that effect.
>
> This makes me nervous, since no one knows what “serializing” means.
> IIRC AMD specifically documents that MFENCE is required before RDTSC
> to get sensible ordering. So it’s entirely plausible to me that
> LFENCE is okay for Spectre mitigation but MFENCE is needed for RDTSC
> on some CPU.

What we want is IFENCE, an instruction that flushes the complete
pipeline. Or alternatively put: holds completion until all prior issued
instructions complete.

MFENCE always did that (and a ton more), LFENCE seems to have always
done that on Intel, but AMD at some point actually implemented LFENCE as
it was specified (only hold completion until all preceding loads are
complete) and they (now) have this MSR bit to 'fix' that.

At some point in the past (when all this spectre LFENCE muck was
relatively fresh) I suggested we call the thing: instruction_fence() or
something like that, maybe we ought to still do that now.

Re RDTSC, waiting for all preceding instructions to complete is
'obviously' sufficient for two RDTSC instructions not to get re-ordered
either.

2018-12-12 10:10:35

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC PATCH 4/4] x86/TSC: Use RDTSCP

On Tue, Dec 11, 2018 at 06:24:44PM -0800, Andy Lutomirski wrote:
> This makes me nervous, since no one knows what “serializing” means.

Why no one? If you wanna say that X86_FEATURE_LFENCE_SERIALIZING is not
really telling, so is X86_FEATURE_LFENCE_RDTSC, TBH. :)

> IIRC AMD specifically documents that MFENCE is required before RDTSC
> to get sensible ordering. So it’s entirely plausible to me that
> LFENCE is okay for Spectre mitigation but MFENCE is needed for RDTSC
> on some CPU.

Look at init_amd(), the if (cpu_has(c, X86_FEATURE_XMM2)) branch where
we make LFENCE serializing. The logic with the new names would be:


asm volatile(ALTERNATIVE_3("rdtsc",
"mfence; rdtsc", X86_FEATURE_MFENCE,
"lfence; rdtsc", X86_FEATURE_LFENCE_SERIALIZING,
"rdtscp", X86_FEATURE_RDTSCP)

RDTSC is put there during build. At boot time:

if CPU has MFENCE
use MFENCE to stop RDTSC speculation
if LFENCE is serializing
use LFENCE...
if CPU has RDTSCP
even better, use that as it is the fastest or on par.

Of course the order of those is important.

Ok?

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2018-12-12 12:06:08

by Andrea Parri

[permalink] [raw]
Subject: Re: [RFC PATCH 4/4] x86/TSC: Use RDTSCP

On Wed, Dec 12, 2018 at 10:59:12AM +0100, Peter Zijlstra wrote:
> On Tue, Dec 11, 2018 at 06:24:44PM -0800, Andy Lutomirski wrote:
> > > On Dec 11, 2018, at 3:39 PM, Borislav Petkov <[email protected]> wrote:
> > >
> > >> On Tue, Dec 11, 2018 at 11:12:41PM +0000, Lendacky, Thomas wrote:
> > >> It does seem overloaded in that sense, but the feature means that LFENCE
> > >> is serializing and so can be used in rdtsc_ordered. In the same sense,
> > >> barrier_nospec is looking for whether LFENCE is serializing and preferring
> > >> that over MFENCE since it is lighter weight.
> > >>
> > >> In light of how they're being used now, they could probably stand to be
> > >> renamed in some way.
> > >
> > > Actually, come to think of it, what really matters here is whether
> > > LFENCE is serializing or not. Because if so, you wanna replace with LFENCE
> > > as it is lighter. And in that case a single alternative() - not _2() -
> > > should suffice.
> > >
> > > BUT(!), that still is not good enough if you do some qemu CPU models
> > > like pentium or so which don't even have MFENCE and cause stuff like
> > > this:
> > >
> > > https://lkml.kernel.org/r/[email protected]
> > >
> > > Which means, that you *do* have to alternate between
> > >
> > > * no insn at all
> > > * MFENCE
> > > * LFENCE, if it is serializing
> > >
> > > so barrier_nospec() does the right thing, AFAICS. And this is why we
> > > need an ALTERNATIVE_3() to add RDTSCP into the mix too.
> > >
> > > WRT renaming, I guess we can do something like:
> > >
> > > * X86_FEATURE_MFENCE_RDTSC -> X86_FEATURE_MFENCE - to mean that CPU has
> > > MFENCE support.
> > >
> > > and
> > >
> > > * X86_FEATURE_LFENCE_RDTSC -> X86_FEATURE_LFENCE_SERIALIZING
> > >
> > > Or something to that effect.
> >
> > This makes me nervous, since no one knows what “serializing” means.
> > IIRC AMD specifically documents that MFENCE is required before RDTSC
> > to get sensible ordering. So it’s entirely plausible to me that
> > LFENCE is okay for Spectre mitigation but MFENCE is needed for RDTSC
> > on some CPU.
>
> What we want is IFENCE, an instruction that flushes the complete
> pipeline. Or alternatively put: holds completion until all prior issued
> instructions complete.
>
> MFENCE always did that (and a ton more), LFENCE seems to have always
> done that on Intel, but AMD at some point actually implemented LFENCE as
> it was specified (only hold completion until all preceding loads are
> complete) and they (now) have this MSR bit to 'fix' that.

For the record, the "Software techniques for managing speculation on AMD
processors" white paper states:

"Instructions that cause the machine to temporarily stop inserting
new instructions into the machine for execution and wait for
execution of older instructions to finish are referred to as
dispatch serializing instructions."

and

"MITIGATION G-2

Description: Set an MSR in the processor so that LFENCE is a
dispatch serializing instruction and then use LFENCE in code
streams to serialize dispatch (LFENCE is faster than RDTSCP which
is also dispatch serializing). This mode of LFENCE may be enabled
by setting MSR C001_1029[1]=1.

Effect: Upon encountering an LFENCE when the MSR bit is set,
dispatch will stop until the LFENCE instruction becomes the oldest
instruction in the machine.

Applicability: All AMD family 10h/12h/14h/15h/16h/17h processors
support this MSR. LFENCE support is indicated by CPUID function1
EDX bit 26, SSE2. AMD family 0Fh/11h processors support LFENCE as
serializing always but do not support this MSR. AMD plans support
for this MSR and access to this bit for all future processors."

I could not find similar information in the AMD APM though; Section 7.6.4
("Serializing Instructions") of this manual describe a different/stronger
notion of "serialization", IIUC.


>
> At some point in the past (when all this spectre LFENCE muck was
> relatively fresh) I suggested we call the thing: instruction_fence() or
> something like that, maybe we ought to still do that now.

FWIW, I do find the name rdtsc_ordered() as somehow too evocative... ;-)
maybe simply rdtsc_nospec() would be a better choice?

Andrea


>
> Re RDTSC, waiting for all preceding instructions to complete is
> 'obviously' sufficient for two RDTSC instructions not to get re-ordered
> either.

2018-12-12 14:19:04

by Tom Lendacky

[permalink] [raw]
Subject: Re: [RFC PATCH 4/4] x86/TSC: Use RDTSCP

On 12/11/2018 08:24 PM, Andy Lutomirski wrote:
>> On Dec 11, 2018, at 3:39 PM, Borislav Petkov <[email protected]> wrote:
>>
>>> On Tue, Dec 11, 2018 at 11:12:41PM +0000, Lendacky, Thomas wrote:
>>> It does seem overloaded in that sense, but the feature means that LFENCE
>>> is serializing and so can be used in rdtsc_ordered. In the same sense,
>>> barrier_nospec is looking for whether LFENCE is serializing and preferring
>>> that over MFENCE since it is lighter weight.
>>>
>>> In light of how they're being used now, they could probably stand to be
>>> renamed in some way.
>>
>> Actually, come to think of it, what really matters here is whether
>> LFENCE is serializing or not. Because if so, you wanna replace with LFENCE
>> as it is lighter. And in that case a single alternative() - not _2() -
>> should suffice.
>>
>> BUT(!), that still is not good enough if you do some qemu CPU models
>> like pentium or so which don't even have MFENCE and cause stuff like
>> this:
>>
>> https://lkml.kernel.org/r/[email protected]
>>
>> Which means, that you *do* have to alternate between
>>
>> * no insn at all
>> * MFENCE
>> * LFENCE, if it is serializing
>>
>> so barrier_nospec() does the right thing, AFAICS. And this is why we
>> need an ALTERNATIVE_3() to add RDTSCP into the mix too.
>>
>> WRT renaming, I guess we can do something like:
>>
>> * X86_FEATURE_MFENCE_RDTSC -> X86_FEATURE_MFENCE - to mean that CPU has
>> MFENCE support.
>>
>> and
>>
>> * X86_FEATURE_LFENCE_RDTSC -> X86_FEATURE_LFENCE_SERIALIZING
>>
>> Or something to that effect.
>
> This makes me nervous, since no one knows what “serializing” means.
> IIRC AMD specifically documents that MFENCE is required before RDTSC
> to get sensible ordering. So it’s entirely plausible to me that
> LFENCE is okay for Spectre mitigation but MFENCE is needed for RDTSC
> on some CPU.

As long as MSR 0xc0011029[1] is set (MSR_F10H_DECFG_LFENCE_SERIALIZE_BIT),
then LFENCE is a proper, lighter weight solution for ordering RDTSC. So
we're good there.

Thanks,
Tom

>

2018-12-12 14:20:41

by Tom Lendacky

[permalink] [raw]
Subject: Re: [RFC PATCH 4/4] x86/TSC: Use RDTSCP

On 12/12/2018 08:15 AM, Tom Lendacky wrote:
> On 12/11/2018 08:24 PM, Andy Lutomirski wrote:
>>> On Dec 11, 2018, at 3:39 PM, Borislav Petkov <[email protected]> wrote:
>>>
>>>> On Tue, Dec 11, 2018 at 11:12:41PM +0000, Lendacky, Thomas wrote:
>>>> It does seem overloaded in that sense, but the feature means that LFENCE
>>>> is serializing and so can be used in rdtsc_ordered. In the same sense,
>>>> barrier_nospec is looking for whether LFENCE is serializing and preferring
>>>> that over MFENCE since it is lighter weight.
>>>>
>>>> In light of how they're being used now, they could probably stand to be
>>>> renamed in some way.
>>>
>>> Actually, come to think of it, what really matters here is whether
>>> LFENCE is serializing or not. Because if so, you wanna replace with LFENCE
>>> as it is lighter. And in that case a single alternative() - not _2() -
>>> should suffice.
>>>
>>> BUT(!), that still is not good enough if you do some qemu CPU models
>>> like pentium or so which don't even have MFENCE and cause stuff like
>>> this:
>>>
>>> https://lkml.kernel.org/r/[email protected]
>>>
>>> Which means, that you *do* have to alternate between
>>>
>>> * no insn at all
>>> * MFENCE
>>> * LFENCE, if it is serializing
>>>
>>> so barrier_nospec() does the right thing, AFAICS. And this is why we
>>> need an ALTERNATIVE_3() to add RDTSCP into the mix too.
>>>
>>> WRT renaming, I guess we can do something like:
>>>
>>> * X86_FEATURE_MFENCE_RDTSC -> X86_FEATURE_MFENCE - to mean that CPU has
>>> MFENCE support.
>>>
>>> and
>>>
>>> * X86_FEATURE_LFENCE_RDTSC -> X86_FEATURE_LFENCE_SERIALIZING
>>>
>>> Or something to that effect.
>>
>> This makes me nervous, since no one knows what “serializing” means.
>> IIRC AMD specifically documents that MFENCE is required before RDTSC
>> to get sensible ordering. So it’s entirely plausible to me that
>> LFENCE is okay for Spectre mitigation but MFENCE is needed for RDTSC
>> on some CPU.
>
> As long as MSR 0xc0011029[1] is set (MSR_F10H_DECFG_LFENCE_SERIALIZE_BIT),
> then LFENCE is a proper, lighter weight solution for ordering RDTSC. So
> we're good there.

I should add that I'll see about getting documentation updated with this
information.

Thanks,
Tom

>
> Thanks,
> Tom
>
>>

2018-12-12 18:09:47

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 4/4] x86/TSC: Use RDTSCP

On Wed, Dec 12, 2018 at 2:08 AM Borislav Petkov <[email protected]> wrote:
>
> On Tue, Dec 11, 2018 at 06:24:44PM -0800, Andy Lutomirski wrote:
> > This makes me nervous, since no one knows what “serializing” means.
>
> Why no one? If you wanna say that X86_FEATURE_LFENCE_SERIALIZING is not
> really telling, so is X86_FEATURE_LFENCE_RDTSC, TBH. :)

You're proving my point, I think. CPUID, IRET, MOV to CR, etc are
"serializing". LFENCE, on many CPUd and depending on MSRs, is a
different kind of serializing. MFENCE is something else. All LOCK
instructions are some kind of barrier, but I don't think anyone calls
them "serializing".

The uaccess users of barrier_nospec() are presumably looking for a
speculation barrier in the sense of "CPU, please don't execute the
code after this until you're sure that this code should be executed
for real and until all inputs are known, not guessed."

The property I want for RDTSC ordering is much weaker: I want it to be
ordered like a load. Imagine that, instead of an on-chip TSC, the TSC
is literally a location in main memory that gets incremented by an
extra dedicated CPU every nanosecond or so. I want users of RDTSC to
work as if they were reading such a location in memory using an
ordinary load. I believe this gives the real desired property that it
should be impossible to observe the TSC going backwards. This is a
much weaker form of serialization.

2018-12-12 18:46:50

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC PATCH 4/4] x86/TSC: Use RDTSCP

On Wed, Dec 12, 2018 at 10:07:03AM -0800, Andy Lutomirski wrote:
> You're proving my point, I think. CPUID, IRET, MOV to CR, etc are
> "serializing". LFENCE, on many CPUd and depending on MSRs, is a
> different kind of serializing. MFENCE is something else. All LOCK
> instructions are some kind of barrier, but I don't think anyone calls
> them "serializing".

Yeah, peterz and I hashed it out a bit today on IRC about the different
meanings of serializing. I see your point now.

> The uaccess users of barrier_nospec() are presumably looking for a
> speculation barrier in the sense of "CPU, please don't execute the
> code after this until you're sure that this code should be executed
> for real and until all inputs are known, not guessed."

Yeah, I believe AMD's paper has this nicely written:

"MITIGATION G-2

Description: Set an MSR in the processor so that LFENCE is a dispatch
serializing instruction and then use LFENCE in code streams to
serialize dispatch (LFENCE is faster than RDTSCP which is also dispatch
serializing). This mode of LFENCE may be enabled by setting MSR
C001_1029[1]=1.

Effect: Upon encountering an LFENCE when the MSR bit is set, dispatch
will stop until the LFENCE instruction becomes the oldest instruction in
the machine."

https://developer.amd.com/wp-content/resources/90343-B_SoftwareTechniquesforManagingSpeculation_WP_7-18Update_FNL.pdf

which is basically what you want for the whole mitigation crap if you
want to kill speculation - you simply hold dispatch until the LFENCE
retires.

> The property I want for RDTSC ordering is much weaker: I want it to be
> ordered like a load. Imagine that, instead of an on-chip TSC, the TSC
> is literally a location in main memory that gets incremented by an
> extra dedicated CPU every nanosecond or so. I want users of RDTSC to
> work as if they were reading such a location in memory using an
> ordinary load. I believe this gives the real desired property that it
> should be impossible to observe the TSC going backwards. This is a
> much weaker form of serialization.

Well, in that case you need something new.

Because, the moment you have a RDTSC in flight and a second RDTSC comes
in and that second RDTSC must *not* bypass the first one and execute
earlier due to OoO, you need to impose some ordering. And that's pretty
much uarch-dependent, I'd say.

And I guess on AMD the way to do that is to stop dispatch until the
first RDTSC retires.

Can it be done faster? Sure. And I'm pretty sure there's a lot of pesky
little hw details we're not even hearing of, which get in the way.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2018-12-12 18:53:46

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 4/4] x86/TSC: Use RDTSCP

On Wed, Dec 12, 2018 at 10:45 AM Borislav Petkov <[email protected]> wrote:
>
> On Wed, Dec 12, 2018 at 10:07:03AM -0800, Andy Lutomirski wrote:
> > You're proving my point, I think. CPUID, IRET, MOV to CR, etc are
> > "serializing". LFENCE, on many CPUd and depending on MSRs, is a
> > different kind of serializing. MFENCE is something else. All LOCK
> > instructions are some kind of barrier, but I don't think anyone calls
> > them "serializing".
>
> Yeah, peterz and I hashed it out a bit today on IRC about the different
> meanings of serializing. I see your point now.
>
> > The uaccess users of barrier_nospec() are presumably looking for a
> > speculation barrier in the sense of "CPU, please don't execute the
> > code after this until you're sure that this code should be executed
> > for real and until all inputs are known, not guessed."
>
> Yeah, I believe AMD's paper has this nicely written:
>
> "MITIGATION G-2
>
> Description: Set an MSR in the processor so that LFENCE is a dispatch
> serializing instruction and then use LFENCE in code streams to
> serialize dispatch (LFENCE is faster than RDTSCP which is also dispatch
> serializing). This mode of LFENCE may be enabled by setting MSR
> C001_1029[1]=1.
>
> Effect: Upon encountering an LFENCE when the MSR bit is set, dispatch
> will stop until the LFENCE instruction becomes the oldest instruction in
> the machine."
>
> https://developer.amd.com/wp-content/resources/90343-B_SoftwareTechniquesforManagingSpeculation_WP_7-18Update_FNL.pdf
>
> which is basically what you want for the whole mitigation crap if you
> want to kill speculation - you simply hold dispatch until the LFENCE
> retires.
>
> > The property I want for RDTSC ordering is much weaker: I want it to be
> > ordered like a load. Imagine that, instead of an on-chip TSC, the TSC
> > is literally a location in main memory that gets incremented by an
> > extra dedicated CPU every nanosecond or so. I want users of RDTSC to
> > work as if they were reading such a location in memory using an
> > ordinary load. I believe this gives the real desired property that it
> > should be impossible to observe the TSC going backwards. This is a
> > much weaker form of serialization.
>
> Well, in that case you need something new.
>
> Because, the moment you have a RDTSC in flight and a second RDTSC comes
> in and that second RDTSC must *not* bypass the first one and execute
> earlier due to OoO, you need to impose some ordering. And that's pretty
> much uarch-dependent, I'd say.
>
> And I guess on AMD the way to do that is to stop dispatch until the
> first RDTSC retires.
>
> Can it be done faster? Sure. And I'm pretty sure there's a lot of pesky
> little hw details we're not even hearing of, which get in the way.
>

As far as I know, RDTSCP gets the job done, as does LFENCE, RDTSC on
Intel. There was a big discussion a few years ago where we changed it
from LFENCE;RDTSC;LFENCE to just LFENCE;RDTSC after everyone was
reasonably convinced that the uarch would not dispatch two RDTSCs
backwards if the first one was immediately preceeded by LFENCE.

2018-12-12 20:02:56

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC PATCH 4/4] x86/TSC: Use RDTSCP

On Wed, Dec 12, 2018 at 10:50:30AM -0800, Andy Lutomirski wrote:
> As far as I know, RDTSCP gets the job done, as does LFENCE, RDTSC on
> Intel.

Same on AMD when LFENCE has been made dispatch-serializing.

> There was a big discussion a few years ago where we changed it
> from LFENCE;RDTSC;LFENCE to just LFENCE;RDTSC after everyone was
> reasonably convinced that the uarch would not dispatch two RDTSCs
> backwards if the first one was immediately preceeded by LFENCE.

Yeah, the second one won't pass the LFENCE so you won't see time going
backwards, sure.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2018-12-12 20:11:55

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 4/4] x86/TSC: Use RDTSCP

On Wed, Dec 12, 2018 at 12:00 PM Borislav Petkov <[email protected]> wrote:
>
> On Wed, Dec 12, 2018 at 10:50:30AM -0800, Andy Lutomirski wrote:
> > As far as I know, RDTSCP gets the job done, as does LFENCE, RDTSC on
> > Intel.
>
> Same on AMD when LFENCE has been made dispatch-serializing.
>
> > There was a big discussion a few years ago where we changed it
> > from LFENCE;RDTSC;LFENCE to just LFENCE;RDTSC after everyone was
> > reasonably convinced that the uarch would not dispatch two RDTSCs
> > backwards if the first one was immediately preceeded by LFENCE.
>
> Yeah, the second one won't pass the LFENCE so you won't see time going
> backwards, sure.
>

Also that the uarch doesn't turn:

LFENCE
RDTSC
load some memory

into a load followed a cycle later by RDTSC. If that happened, then
some cooperating CPUs could observe time going backwards.

But RDTSC has no dependencies, so, barring odd register renaming
effects or similar, this isn't going to be a problem in practice.
Also, the actual clock reading code is complicated enough that we have
several cycles of wiggle room :)

2018-12-12 20:31:44

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC PATCH 4/4] x86/TSC: Use RDTSCP

On Wed, Dec 12, 2018 at 12:09:00PM -0800, Andy Lutomirski wrote:
> Also that the uarch doesn't turn:
>
> LFENCE
> RDTSC
> load some memory
>
> into a load followed a cycle later by RDTSC. If that happened, then
> some cooperating CPUs could observe time going backwards.

How so? Details?

Because I wouldn't be surprised at all if the logical CPUs on the same
node would see actually the same timestamp counter which is propagated
from the northbridge into the cores instead of maintaining it in each
core and having to synchronize it periodically.

But maybe you mean something completely different...

> But RDTSC has no dependencies, so, barring odd register renaming
> effects or similar, this isn't going to be a problem in practice.

Did you talk to hw guys about it? Because I don't see anything blocking
the CPU from reordering those...

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2018-12-14 13:40:39

by David Laight

[permalink] [raw]
Subject: RE: [RFC PATCH 4/4] x86/TSC: Use RDTSCP

From: Borislav Petkov
> Sent: 12 December 2018 18:45
...
> > The property I want for RDTSC ordering is much weaker: I want it to be
> > ordered like a load. Imagine that, instead of an on-chip TSC, the TSC
> > is literally a location in main memory that gets incremented by an
> > extra dedicated CPU every nanosecond or so. I want users of RDTSC to
> > work as if they were reading such a location in memory using an
> > ordinary load. I believe this gives the real desired property that it
> > should be impossible to observe the TSC going backwards. This is a
> > much weaker form of serialization.
>
> Well, in that case you need something new.
>
> Because, the moment you have a RDTSC in flight and a second RDTSC comes
> in and that second RDTSC must *not* bypass the first one and execute
> earlier due to OoO, you need to impose some ordering. And that's pretty
> much uarch-dependent, I'd say.
>
> And I guess on AMD the way to do that is to stop dispatch until the
> first RDTSC retires.
>
> Can it be done faster? Sure. And I'm pretty sure there's a lot of pesky
> little hw details we're not even hearing of, which get in the way.

ISTR one of the problems with RDTSC serialising is that it is used
for micro-benchmarks.
So you want to time all the instructions between a pair of RDTSC.
This doesn't work well if RDTSC doesn't wait for all instructions
to have executed.
The serialisation requirements for spectre mitigation are different.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2018-12-15 18:55:14

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 4/4] x86/TSC: Use RDTSCP

On Fri, Dec 14, 2018 at 5:39 AM David Laight <[email protected]> wrote:
>
> From: Borislav Petkov
> > Sent: 12 December 2018 18:45
> ...
> > > The property I want for RDTSC ordering is much weaker: I want it to be
> > > ordered like a load. Imagine that, instead of an on-chip TSC, the TSC
> > > is literally a location in main memory that gets incremented by an
> > > extra dedicated CPU every nanosecond or so. I want users of RDTSC to
> > > work as if they were reading such a location in memory using an
> > > ordinary load. I believe this gives the real desired property that it
> > > should be impossible to observe the TSC going backwards. This is a
> > > much weaker form of serialization.
> >
> > Well, in that case you need something new.
> >
> > Because, the moment you have a RDTSC in flight and a second RDTSC comes
> > in and that second RDTSC must *not* bypass the first one and execute
> > earlier due to OoO, you need to impose some ordering. And that's pretty
> > much uarch-dependent, I'd say.
> >
> > And I guess on AMD the way to do that is to stop dispatch until the
> > first RDTSC retires.
> >
> > Can it be done faster? Sure. And I'm pretty sure there's a lot of pesky
> > little hw details we're not even hearing of, which get in the way.
>
> ISTR one of the problems with RDTSC serialising is that it is used
> for micro-benchmarks.

If you're benchmarking with that level of detail, you're probably
doing RDTSC directly instead of using the vDSO. Or, even better,
RDPMC.

Subject: [tip:x86/alternatives] x86/alternatives: Add an ALTERNATIVE_3() macro

Commit-ID: 71a93c26930471e976dd184ef91931b2a5393afc
Gitweb: https://git.kernel.org/tip/71a93c26930471e976dd184ef91931b2a5393afc
Author: Borislav Petkov <[email protected]>
AuthorDate: Mon, 10 Dec 2018 16:17:23 +0100
Committer: Borislav Petkov <[email protected]>
CommitDate: Wed, 16 Jan 2019 12:42:50 +0100

x86/alternatives: Add an ALTERNATIVE_3() macro

Similar to ALTERNATIVE_2(), ALTERNATIVE_3() selects between 3 possible
variants. Will be used for adding RDTSCP to the rdtsc_ordered()
alternatives.

Signed-off-by: Borislav Petkov <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tom Lendacky <[email protected]>
Cc: X86 ML <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
arch/x86/include/asm/alternative.h | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)

diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index 373e2baca6ce..4c74073a19cc 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -122,6 +122,16 @@ static inline int alternatives_text_reserved(void *start, void *end)
"(" alt_max_short(alt_rlen(num1), alt_rlen(num2)) " - (" alt_slen ")), 0x90\n" \
alt_end_marker ":\n"

+#define OLDINSTR_3(oldinsn, n1, n2, n3) \
+ "# ALT: oldinstr3\n" \
+ "661:\n\t" oldinsn "\n662:\n" \
+ "# ALT: padding3\n" \
+ ".skip -((" alt_max_short(alt_max_short(alt_rlen(n1), alt_rlen(n2)), alt_rlen(n3)) \
+ " - (" alt_slen ")) > 0) * " \
+ "(" alt_max_short(alt_max_short(alt_rlen(n1), alt_rlen(n2)), alt_rlen(n3)) \
+ " - (" alt_slen ")), 0x90\n" \
+ alt_end_marker ":\n"
+
#define ALTINSTR_ENTRY(feature, num) \
" .long 661b - .\n" /* label */ \
" .long " b_replacement(num)"f - .\n" /* new instruction */ \
@@ -155,6 +165,19 @@ static inline int alternatives_text_reserved(void *start, void *end)
ALTINSTR_REPLACEMENT(newinstr2, feature2, 2) \
".popsection\n"

+#define ALTERNATIVE_3(oldinsn, newinsn1, feat1, newinsn2, feat2, newinsn3, feat3) \
+ OLDINSTR_3(oldinsn, 1, 2, 3) \
+ ".pushsection .altinstructions,\"a\"\n" \
+ ALTINSTR_ENTRY(feat1, 1) \
+ ALTINSTR_ENTRY(feat2, 2) \
+ ALTINSTR_ENTRY(feat3, 3) \
+ ".popsection\n" \
+ ".pushsection .altinstr_replacement, \"ax\"\n" \
+ ALTINSTR_REPLACEMENT(newinsn1, feat1, 1) \
+ ALTINSTR_REPLACEMENT(newinsn2, feat2, 2) \
+ ALTINSTR_REPLACEMENT(newinsn3, feat3, 3) \
+ ".popsection\n"
+
/*
* Alternative instructions for different CPU types or capabilities.
*

Subject: [tip:x86/alternatives] x86/alternatives: Add macro comments

Commit-ID: 1c1ed4731cc81942a5b25f284a85257573829b9e
Gitweb: https://git.kernel.org/tip/1c1ed4731cc81942a5b25f284a85257573829b9e
Author: Borislav Petkov <[email protected]>
AuthorDate: Mon, 10 Dec 2018 12:21:48 +0100
Committer: Borislav Petkov <[email protected]>
CommitDate: Wed, 16 Jan 2019 12:40:07 +0100

x86/alternatives: Add macro comments

... so that when one stares at the .s output, one can find her way
around the resulting asm magic.

With it, ALTERNATIVE looks like this now:

# ALT: oldnstr
661:
...
662:
# ALT: padding
.skip ...
663:
.pushsection .altinstructions,"a"

...

.popsection
.pushsection .altinstr_replacement, "ax"
# ALT: replacement 1
6641:
...
6651:
.popsection

Merge __OLDINSTR() into OLDINSTR(), while at it.

No functional changes.

Signed-off-by: Borislav Petkov <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tom Lendacky <[email protected]>
Cc: X86 ML <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
arch/x86/include/asm/alternative.h | 16 +++++++++-------
1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index 0660e14690c8..373e2baca6ce 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -94,13 +94,12 @@ static inline int alternatives_text_reserved(void *start, void *end)
#define alt_total_slen alt_end_marker"b-661b"
#define alt_rlen(num) e_replacement(num)"f-"b_replacement(num)"f"

-#define __OLDINSTR(oldinstr, num) \
+#define OLDINSTR(oldinstr, num) \
+ "# ALT: oldnstr\n" \
"661:\n\t" oldinstr "\n662:\n" \
+ "# ALT: padding\n" \
".skip -(((" alt_rlen(num) ")-(" alt_slen ")) > 0) * " \
- "((" alt_rlen(num) ")-(" alt_slen ")),0x90\n"
-
-#define OLDINSTR(oldinstr, num) \
- __OLDINSTR(oldinstr, num) \
+ "((" alt_rlen(num) ")-(" alt_slen ")),0x90\n" \
alt_end_marker ":\n"

/*
@@ -116,7 +115,9 @@ static inline int alternatives_text_reserved(void *start, void *end)
* additionally longer than the first replacement alternative.
*/
#define OLDINSTR_2(oldinstr, num1, num2) \
+ "# ALT: oldinstr2\n" \
"661:\n\t" oldinstr "\n662:\n" \
+ "# ALT: padding2\n" \
".skip -((" alt_max_short(alt_rlen(num1), alt_rlen(num2)) " - (" alt_slen ")) > 0) * " \
"(" alt_max_short(alt_rlen(num1), alt_rlen(num2)) " - (" alt_slen ")), 0x90\n" \
alt_end_marker ":\n"
@@ -129,8 +130,9 @@ static inline int alternatives_text_reserved(void *start, void *end)
" .byte " alt_rlen(num) "\n" /* replacement len */ \
" .byte " alt_pad_len "\n" /* pad len */

-#define ALTINSTR_REPLACEMENT(newinstr, feature, num) /* replacement */ \
- b_replacement(num)":\n\t" newinstr "\n" e_replacement(num) ":\n\t"
+#define ALTINSTR_REPLACEMENT(newinstr, feature, num) /* replacement */ \
+ "# ALT: replacement " #num "\n" \
+ b_replacement(num)":\n\t" newinstr "\n" e_replacement(num) ":\n"

/* alternative assembly primitive: */
#define ALTERNATIVE(oldinstr, newinstr, feature) \

Subject: [tip:x86/alternatives] x86/alternatives: Print containing function

Commit-ID: c1d4e4192aa4e7408a81c32a77e7c867a07f8aa2
Gitweb: https://git.kernel.org/tip/c1d4e4192aa4e7408a81c32a77e7c867a07f8aa2
Author: Borislav Petkov <[email protected]>
AuthorDate: Mon, 10 Dec 2018 12:30:30 +0100
Committer: Borislav Petkov <[email protected]>
CommitDate: Wed, 16 Jan 2019 12:41:57 +0100

x86/alternatives: Print containing function

... in the "debug-alternative" output so that one can find her way
easier when staring at the vmlinux disassembly.

For example:

apply_alternatives: feat: 3*32+18, old: (read_tsc+0x0/0x10 (ffffffff8101d1c0) len: 5), repl: (ffffffff824e6d33, len: 5)
^^^^^^^^^^^^^^^^^
ffffffff8101d1c0: old_insn: 0f 31 90 90 90
ffffffff824e6d33: rpl_insn: 0f ae e8 0f 31
ffffffff8101d1c0: final_insn: 0f ae e8 0f 31

No functional changes.

Signed-off-by: Borislav Petkov <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tom Lendacky <[email protected]>
Cc: X86 ML <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
arch/x86/kernel/alternative.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index ebeac487a20c..d458c7973c56 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -393,10 +393,10 @@ void __init_or_module noinline apply_alternatives(struct alt_instr *start,
continue;
}

- DPRINTK("feat: %d*32+%d, old: (%px len: %d), repl: (%px, len: %d), pad: %d",
+ DPRINTK("feat: %d*32+%d, old: (%pS (%px) len: %d), repl: (%px, len: %d), pad: %d",
a->cpuid >> 5,
a->cpuid & 0x1f,
- instr, a->instrlen,
+ instr, instr, a->instrlen,
replacement, a->replacementlen, a->padlen);

DUMP_BYTES(instr, a->instrlen, "%px: old_insn: ", instr);