LinuxLists.cc - [PATCH] x86: Align TLB invalidation info

2018-01-31 20:14:02

Subject: [PATCH] x86: Align TLB invalidation info

The TLB invalidation info is allocated on the stack, which might cause
it to be unaligned. Since this information may be transferred to
different cores for TLB shootdown, this might result in an additional
cache-line bouncing between the cores.

GCC provides a way to deal with it by using
__builtin_alloca_with_align(). Use it to avoid the bouncing cache lines.

Signed-off-by: Nadav Amit <[email protected]>

Cc: Andy Lutomirski <[email protected]>
Cc: Dave Hansen <[email protected]>
---
arch/x86/mm/tlb.c | 21 +++++++++++----------
include/linux/compiler-gcc.h | 5 +++++
include/linux/compiler_types.h | 4 ++++
3 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 5bfe61a5e8e3..bab7bb5d982f 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -574,37 +574,38 @@ static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
unsigned long end, unsigned long vmflag)
{
+ struct flush_tlb_info *info;
int cpu;

- struct flush_tlb_info info = {
- .mm = mm,
- };
+ info = __alloca_with_align(sizeof(*info),
+ SMP_CACHE_BYTES * BITS_PER_BYTE);
+ info->mm = mm;

cpu = get_cpu();

/* This is also a barrier that synchronizes with switch_mm(). */
- info.new_tlb_gen = inc_mm_tlb_gen(mm);
+ info->new_tlb_gen = inc_mm_tlb_gen(mm);

/* Should we flush just the requested range? */
if ((end != TLB_FLUSH_ALL) &&
!(vmflag & VM_HUGETLB) &&
((end - start) >> PAGE_SHIFT) <= tlb_single_page_flush_ceiling) {
- info.start = start;
- info.end = end;
+ info->start = start;
+ info->end = end;
} else {
- info.start = 0UL;
- info.end = TLB_FLUSH_ALL;
+ info->start = 0UL;
+ info->end = TLB_FLUSH_ALL;
}

if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
VM_WARN_ON(irqs_disabled());
local_irq_disable();
- flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN);
+ flush_tlb_func_local(info, TLB_LOCAL_MM_SHOOTDOWN);
local_irq_enable();
}

if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
- flush_tlb_others(mm_cpumask(mm), &info);
+ flush_tlb_others(mm_cpumask(mm), info);

put_cpu();
}
diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h
index 631354acfa72..aea9a2e69417 100644
--- a/include/linux/compiler-gcc.h
+++ b/include/linux/compiler-gcc.h
@@ -314,6 +314,11 @@
#define __designated_init __attribute__((designated_init))
#endif

+#if GCC_VERSION >= 60100
+#define __alloca_with_align(size, alignment) \
+ __builtin_alloca_with_align(size, alignment)
+#endif
+
#endif /* gcc version >= 40000 specific checks */

#if !defined(__noclone)
diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h
index 6b79a9bba9a7..c71297d95c74 100644
--- a/include/linux/compiler_types.h
+++ b/include/linux/compiler_types.h
@@ -271,4 +271,8 @@ struct ftrace_likely_data {
# define __native_word(t) (sizeof(t) == sizeof(char) || sizeof(t) == sizeof(short) || sizeof(t) == sizeof(int) || sizeof(t) == sizeof(long))
#endif

+#ifndef __alloca_with_align
+#define __alloca_with_align(size, alignment) __builtin_alloca(size)
+#endif
+
#endif /* __LINUX_COMPILER_TYPES_H */
--
2.14.1

2018-01-31 20:26:17

by Andy Lutomirski

[permalink] [raw]

Subject: Re: [PATCH] x86: Align TLB invalidation info

> On Jan 31, 2018, at 12:11 PM, Nadav Amit <[email protected]> wrote:
>
> The TLB invalidation info is allocated on the stack, which might cause
> it to be unaligned. Since this information may be transferred to
> different cores for TLB shootdown, this might result in an additional
> cache-line bouncing between the cores.
>
> GCC provides a way to deal with it by using
> __builtin_alloca_with_align(). Use it to avoid the bouncing cache lines.
>

Eww. How about __aligned?

> Signed-off-by: Nadav Amit <[email protected]>
>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Dave Hansen <[email protected]>
> ---
> arch/x86/mm/tlb.c | 21 +++++++++++----------
> include/linux/compiler-gcc.h | 5 +++++
> include/linux/compiler_types.h | 4 ++++
> 3 files changed, 20 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 5bfe61a5e8e3..bab7bb5d982f 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -574,37 +574,38 @@ static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
> void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
> unsigned long end, unsigned long vmflag)
> {
> + struct flush_tlb_info *info;
> int cpu;
>
> - struct flush_tlb_info info = {
> - .mm = mm,
> - };
> + info = __alloca_with_align(sizeof(*info),
> + SMP_CACHE_BYTES * BITS_PER_BYTE);
> + info->mm = mm;
>
> cpu = get_cpu();
>
> /* This is also a barrier that synchronizes with switch_mm(). */
> - info.new_tlb_gen = inc_mm_tlb_gen(mm);
> + info->new_tlb_gen = inc_mm_tlb_gen(mm);
>
> /* Should we flush just the requested range? */
> if ((end != TLB_FLUSH_ALL) &&
> !(vmflag & VM_HUGETLB) &&
> ((end - start) >> PAGE_SHIFT) <= tlb_single_page_flush_ceiling) {
> - info.start = start;
> - info.end = end;
> + info->start = start;
> + info->end = end;
> } else {
> - info.start = 0UL;
> - info.end = TLB_FLUSH_ALL;
> + info->start = 0UL;
> + info->end = TLB_FLUSH_ALL;
> }
>
> if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
> VM_WARN_ON(irqs_disabled());
> local_irq_disable();
> - flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN);
> + flush_tlb_func_local(info, TLB_LOCAL_MM_SHOOTDOWN);
> local_irq_enable();
> }
>
> if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
> - flush_tlb_others(mm_cpumask(mm), &info);
> + flush_tlb_others(mm_cpumask(mm), info);
>
> put_cpu();
> }
> diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h
> index 631354acfa72..aea9a2e69417 100644
> --- a/include/linux/compiler-gcc.h
> +++ b/include/linux/compiler-gcc.h
> @@ -314,6 +314,11 @@
> #define __designated_init __attribute__((designated_init))
> #endif
>
> +#if GCC_VERSION >= 60100
> +#define __alloca_with_align(size, alignment) \
> + __builtin_alloca_with_align(size, alignment)
> +#endif
> +
> #endif /* gcc version >= 40000 specific checks */
>
> #if !defined(__noclone)
> diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h
> index 6b79a9bba9a7..c71297d95c74 100644
> --- a/include/linux/compiler_types.h
> +++ b/include/linux/compiler_types.h
> @@ -271,4 +271,8 @@ struct ftrace_likely_data {
> # define __native_word(t) (sizeof(t) == sizeof(char) || sizeof(t) == sizeof(short) || sizeof(t) == sizeof(int) || sizeof(t) == sizeof(long))
> #endif
>
> +#ifndef __alloca_with_align
> +#define __alloca_with_align(size, alignment) __builtin_alloca(size)
> +#endif
> +
> #endif /* __LINUX_COMPILER_TYPES_H */
> --
> 2.14.1
>

2018-01-31 20:49:31

by Nadav Amit

[permalink] [raw]

Subject: Re: [PATCH] x86: Align TLB invalidation info

Andy Lutomirski <[email protected]> wrote:

>
>
>> On Jan 31, 2018, at 12:11 PM, Nadav Amit <[email protected]> wrote:
>>
>> The TLB invalidation info is allocated on the stack, which might cause
>> it to be unaligned. Since this information may be transferred to
>> different cores for TLB shootdown, this might result in an additional
>> cache-line bouncing between the cores.
>>
>> GCC provides a way to deal with it by using
>> __builtin_alloca_with_align(). Use it to avoid the bouncing cache lines.
>
> Eww. How about __aligned?

Err.. Stupid me. For some reason I remembered I tried it and it didn’t have
the desired effect, which caused me to assume it does not work for variables
on the stack. Anyhow, it does the work. I’ll submit v2.

2018-01-31 20:51:11

by Dave Hansen

[permalink] [raw]

Subject: Re: [PATCH] x86: Align TLB invalidation info

On 01/31/2018 12:48 PM, Nadav Amit wrote:
>>> On Jan 31, 2018, at 12:11 PM, Nadav Amit <[email protected]> wrote:
>>>
>>> The TLB invalidation info is allocated on the stack, which might cause
>>> it to be unaligned. Since this information may be transferred to
>>> different cores for TLB shootdown, this might result in an additional
>>> cache-line bouncing between the cores.
>>>
>>> GCC provides a way to deal with it by using
>>> __builtin_alloca_with_align(). Use it to avoid the bouncing cache lines.
>> Eww. How about __aligned?
> Err.. Stupid me. For some reason I remembered I tried it and it didn’t have
> the desired effect, which caused me to assume it does not work for variables
> on the stack. Anyhow, it does the work. I’ll submit v2.

Also, just curious, but did you find this situation by inspection or did
it show up in a profile somewhere?

2018-01-31 20:54:31

by Nadav Amit

[permalink] [raw]

Subject: Re: [PATCH] x86: Align TLB invalidation info

Dave Hansen <[email protected]> wrote:

> On 01/31/2018 12:48 PM, Nadav Amit wrote:
>>>> On Jan 31, 2018, at 12:11 PM, Nadav Amit <[email protected]> wrote:
>>>>
>>>> The TLB invalidation info is allocated on the stack, which might cause
>>>> it to be unaligned. Since this information may be transferred to
>>>> different cores for TLB shootdown, this might result in an additional
>>>> cache-line bouncing between the cores.
>>>>
>>>> GCC provides a way to deal with it by using
>>>> __builtin_alloca_with_align(). Use it to avoid the bouncing cache lines.
>>> Eww. How about __aligned?
>> Err.. Stupid me. For some reason I remembered I tried it and it didn’t have
>> the desired effect, which caused me to assume it does not work for variables
>> on the stack. Anyhow, it does the work. I’ll submit v2.
>
> Also, just curious, but did you find this situation by inspection or did
> it show up in a profile somewhere?

Actually, not. I considered adding data to info for a different reason
(which eventually I deserted); I was assuming you will all kill me for
increasing the size, so this was a preemption step.

2018-01-31 21:02:14

by Andy Lutomirski

[permalink] [raw]

Subject: Re: [PATCH] x86: Align TLB invalidation info

On Wed, Jan 31, 2018 at 12:48 PM, Nadav Amit <[email protected]> wrote:
> Andy Lutomirski <[email protected]> wrote:
>
>>
>>
>>> On Jan 31, 2018, at 12:11 PM, Nadav Amit <[email protected]> wrote:
>>>
>>> The TLB invalidation info is allocated on the stack, which might cause
>>> it to be unaligned. Since this information may be transferred to
>>> different cores for TLB shootdown, this might result in an additional
>>> cache-line bouncing between the cores.
>>>
>>> GCC provides a way to deal with it by using
>>> __builtin_alloca_with_align(). Use it to avoid the bouncing cache lines.
>>
>> Eww. How about __aligned?
>
> Err.. Stupid me. For some reason I remembered I tried it and it didn’t have
> the desired effect, which caused me to assume it does not work for variables
> on the stack. Anyhow, it does the work. I’ll submit v2.
>

You're probably remembering that __aligned(16) malfunctions on older
GCC versions. But __aligned(64), which is what you want, has always
been okay AFAIK.

2018-01-31 21:04:23

by Dave Hansen

[permalink] [raw]

Subject: Re: [PATCH] x86: Align TLB invalidation info

On 01/31/2018 12:11 PM, Nadav Amit wrote:
> The TLB invalidation info is allocated on the stack, which might cause
> it to be unaligned. Since this information may be transferred to
> different cores for TLB shootdown, this might result in an additional
> cache-line bouncing between the cores.
>
> GCC provides a way to deal with it by using
> __builtin_alloca_with_align(). Use it to avoid the bouncing cache lines.

It doesn't really *bounce*, though, does it? I don't see any writes on
the remote side. The remote use seems entirely read-only.

You also don't have to exhaustively test this, but I'd love to see at
least a sanity check with a microbenchmark (or something) that, yes,
this does help *something*. Maybe it makes the remote
flush_tlb_func_common() run faster because it's pulling in fewer lines,
or maybe you can even detect fewer misses in there.

2018-01-31 21:12:49

by Nadav Amit

[permalink] [raw]

Subject: Re: [PATCH] x86: Align TLB invalidation info

Dave Hansen <[email protected]> wrote:

> On 01/31/2018 12:11 PM, Nadav Amit wrote:
>> The TLB invalidation info is allocated on the stack, which might cause
>> it to be unaligned. Since this information may be transferred to
>> different cores for TLB shootdown, this might result in an additional
>> cache-line bouncing between the cores.
>>
>> GCC provides a way to deal with it by using
>> __builtin_alloca_with_align(). Use it to avoid the bouncing cache lines.
>
> It doesn't really *bounce*, though, does it? I don't see any writes on
> the remote side. The remote use seems entirely read-only.
>
> You also don't have to exhaustively test this, but I'd love to see at
> least a sanity check with a microbenchmark (or something) that, yes,
> this does help *something*. Maybe it makes the remote
> flush_tlb_func_common() run faster because it's pulling in fewer lines,
> or maybe you can even detect fewer misses in there.

I agree that with the whole Meltdown/Spectre entry-cost it might not even be
measurable, at least on small ( < 2 sockets) machines. But I do not think it
worth profiling. Basically, AFAIK, all the data structures that are used for
inter-processor communication by the kernel are aligned, and this is an
exception.

2018-01-31 21:16:37

by Andy Lutomirski

[permalink] [raw]

Subject: Re: [PATCH] x86: Align TLB invalidation info

On Wed, Jan 31, 2018 at 1:09 PM, Nadav Amit <[email protected]> wrote:
> Dave Hansen <[email protected]> wrote:
>
>> On 01/31/2018 12:11 PM, Nadav Amit wrote:
>>> The TLB invalidation info is allocated on the stack, which might cause
>>> it to be unaligned. Since this information may be transferred to
>>> different cores for TLB shootdown, this might result in an additional
>>> cache-line bouncing between the cores.
>>>
>>> GCC provides a way to deal with it by using
>>> __builtin_alloca_with_align(). Use it to avoid the bouncing cache lines.
>>
>> It doesn't really *bounce*, though, does it? I don't see any writes on
>> the remote side. The remote use seems entirely read-only.
>>
>> You also don't have to exhaustively test this, but I'd love to see at
>> least a sanity check with a microbenchmark (or something) that, yes,
>> this does help *something*. Maybe it makes the remote
>> flush_tlb_func_common() run faster because it's pulling in fewer lines,
>> or maybe you can even detect fewer misses in there.
>
> I agree that with the whole Meltdown/Spectre entry-cost it might not even be
> measurable, at least on small ( < 2 sockets) machines. But I do not think it
> worth profiling. Basically, AFAIK, all the data structures that are used for
> inter-processor communication by the kernel are aligned, and this is an
> exception.
>

This is only going to be measurable at all on NUMA, I suspect.

2018-01-31 21:18:57

by Nadav Amit

[permalink] [raw]

Subject: Re: [PATCH] x86: Align TLB invalidation info

Andy Lutomirski <[email protected]> wrote:

> On Wed, Jan 31, 2018 at 1:09 PM, Nadav Amit <[email protected]> wrote:
>> Dave Hansen <[email protected]> wrote:
>>
>>> On 01/31/2018 12:11 PM, Nadav Amit wrote:
>>>> The TLB invalidation info is allocated on the stack, which might cause
>>>> it to be unaligned. Since this information may be transferred to
>>>> different cores for TLB shootdown, this might result in an additional
>>>> cache-line bouncing between the cores.
>>>>
>>>> GCC provides a way to deal with it by using
>>>> __builtin_alloca_with_align(). Use it to avoid the bouncing cache lines.
>>>
>>> It doesn't really *bounce*, though, does it? I don't see any writes on
>>> the remote side. The remote use seems entirely read-only.
>>>
>>> You also don't have to exhaustively test this, but I'd love to see at
>>> least a sanity check with a microbenchmark (or something) that, yes,
>>> this does help *something*. Maybe it makes the remote
>>> flush_tlb_func_common() run faster because it's pulling in fewer lines,
>>> or maybe you can even detect fewer misses in there.
>>
>> I agree that with the whole Meltdown/Spectre entry-cost it might not even be
>> measurable, at least on small ( < 2 sockets) machines. But I do not think it
>> worth profiling. Basically, AFAIK, all the data structures that are used for
>> inter-processor communication by the kernel are aligned, and this is an
>> exception.
>
> This is only going to be measurable at all on NUMA, I suspect.

Yes, I meant <= 2 ...

2018-02-01 05:39:30

by Nadav Amit

[permalink] [raw]

Subject: Re: [PATCH] x86: Align TLB invalidation info

Dave Hansen <[email protected]> wrote:

> On 01/31/2018 01:09 PM, Nadav Amit wrote:
>>> You also don't have to exhaustively test this, but I'd love to see at
>>> least a sanity check with a microbenchmark (or something) that, yes,
>>> this does help *something*. Maybe it makes the remote
>>> flush_tlb_func_common() run faster because it's pulling in fewer lines,
>>> or maybe you can even detect fewer misses in there.
>> I agree that with the whole Meltdown/Spectre entry-cost it might not even be
>> measurable, at least on small ( < 2 sockets) machines. But I do not think it
>> worth profiling. Basically, AFAIK, all the data structures that are used for
>> inter-processor communication by the kernel are aligned, and this is an
>> exception.
>
> I'm certainly not nak'ing this. I think your patch is likely a good
> idea. But, could you please take ten or twenty minutes to go see if
> practice matches your assumptions? I'd really appreciate it. If you
> can't measure it, then no biggie.

[CC’ing the mailing list]

Per your request, I measured it (which perhaps I should have done before). I
caused a misalignment intentionally by adding some padding to flush_tlb_info
and compared it with an aligned version.

I used ftrace to measure the execution time of flush_tlb_func_remote() on a
2-socket Haswell machine, using a microbenchmark I wrote for some research
project.

It turns out that your skepticism may be correct - In both cases the
function execution time is roughly 400ns (2% improvement on the aligned case
which is probably noise).

So it is up to you whether you want to discard the patch.

Regards,
Nadav

2018-02-01 09:39:58

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH] x86: Align TLB invalidation info

On Wed, Jan 31, 2018 at 09:38:46PM -0800, Nadav Amit wrote:

> I used ftrace to measure the execution time of flush_tlb_func_remote() on a
> 2-socket Haswell machine, using a microbenchmark I wrote for some research
> project.

However cool ftrace is, it is _really_ bad for such uses. The cost of
using ftrace is many many time higher than any change you could affect
by this.

A microbench and/or perf is what you should use for this.

2018-02-01 18:46:44

by Nadav Amit

[permalink] [raw]

Subject: Re: [PATCH] x86: Align TLB invalidation info

Peter Zijlstra <[email protected]> wrote:

> On Wed, Jan 31, 2018 at 09:38:46PM -0800, Nadav Amit wrote:
>
>> I used ftrace to measure the execution time of flush_tlb_func_remote() on a
>> 2-socket Haswell machine, using a microbenchmark I wrote for some research
>> project.
>
> However cool ftrace is, it is _really_ bad for such uses. The cost of
> using ftrace is many many time higher than any change you could affect
> by this.
>
> A microbench and/or perf is what you should use for this.

Don’t expect to see a remote NUMA access impact, whose cost are few 10s of
nanoseconds on microbenchmarks. (And indeed I did not.) Each iteration of
#PF - MADV_DONTNEED takes several microseconds, and the impact is lost in
the noise.

You are right in the fact that ftrace introduces overheads, but the variance
is relatively low. If I stretch the struct to 3 lines of cache, I see a 20ns
overhead. Anyhow, I think this line of code got more than its fair share of
attention.

Attachments:

signature.asc (849.00 B)
Message signed with OpenPGP