2015-07-11 20:25:30

by David Daney

[permalink] [raw]
Subject: [PATCH 0/3] arm64, mm: Use IPIs for TLB invalidation.

From: David Daney <[email protected]>

This patch set (or something like it) is needed for the Cavium
ThunderX, but its performance improvements may make it compelling on
its own merits.

Summery: On ThunerX we cannot use broadcast TLB invalidation, so we
use IPIs where necessary. The funny thing is that it also happens to
make workloads similar to kernel builds much faster.

David Daney (3):
arm64, mm: Add flush_tlb_all_local() function.
arm64, mm: Use flush_tlb_all_local() in flush_context().
arm64, mm: Use IPIs for TLB invalidation.

arch/arm64/include/asm/tlbflush.h | 64 ++++++++-------------------------------
arch/arm64/mm/context.c | 2 +-
arch/arm64/mm/flush.c | 46 ++++++++++++++++++++++++++++
3 files changed, 59 insertions(+), 53 deletions(-)

--
1.9.1


2015-07-11 20:25:32

by David Daney

[permalink] [raw]
Subject: [PATCH 1/3] arm64, mm: Add flush_tlb_all_local() function.

From: David Daney <[email protected]>

To be used in follow-on patch.

Signed-off-by: David Daney <[email protected]>
---
arch/arm64/include/asm/tlbflush.h | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 934815d..42c09ec 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -105,6 +105,13 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
dsb(ish);
}

+static inline void flush_tlb_all_local(void)
+{
+ dsb(ishst);
+ asm("tlbi vmalle1");
+ isb();
+}
+
static inline void __flush_tlb_kernel_range(unsigned long start, unsigned long end)
{
unsigned long addr;
--
1.9.1

2015-07-11 20:26:08

by David Daney

[permalink] [raw]
Subject: [PATCH 2/3] arm64, mm: Use flush_tlb_all_local() in flush_context().

From: David Daney <[email protected]>

When CONFIG_SMP, we end up calling flush_context() on each CPU
(indirectly) from __new_context(). Because of this, doing a broadcast
TLB invalidate is overkill, as all CPUs will be doing a local
invalidation.

Change the scope of the TLB invalidation operation to be local,
resulting in nr_cpus invalidations, rather than nr_cpus^2.

On CPUs with a large ASID space this operation is not often done.
But, when it is, this reduces the overhead.

Benchmarked "time make -j48" kernel build with and without the patch on
Cavium ThunderX system, one run to warm up the caches, and then five
runs measured:

original with-patch
139.299s 139.0766s
S.D. 0.321 S.D. 0.159

Probably a little faster, but could be measurement noise.

Signed-off-by: David Daney <[email protected]>
---
arch/arm64/mm/context.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm64/mm/context.c b/arch/arm64/mm/context.c
index 76c1e6c..ab5b8d3 100644
--- a/arch/arm64/mm/context.c
+++ b/arch/arm64/mm/context.c
@@ -48,7 +48,7 @@ static void flush_context(void)
{
/* set the reserved TTBR0 before flushing the TLB */
cpu_set_reserved_ttbr0();
- flush_tlb_all();
+ flush_tlb_all_local();
if (icache_is_aivivt())
__flush_icache_all();
}
--
1.9.1

2015-07-11 20:25:33

by David Daney

[permalink] [raw]
Subject: [PATCH 3/3] arm64, mm: Use IPIs for TLB invalidation.

From: David Daney <[email protected]>

Most broadcast TLB invalidations are unnecessary. So when
invalidating for a given mm/vma target the only the needed CPUs via
and IPI.

For global TLB invalidations, also use IPI.

Tested on Cavium ThunderX.

This change reduces 'time make -j48' on kernel from 139s to 116s (83%
as long).

The patch is needed because of a ThunderX Pass1 erratum: Exclusive
store operations unreliable in the presence of broadcast TLB
invalidations. The performance improvements shown make it compelling
even without the erratum workaround need.

Signed-off-by: David Daney <[email protected]>
---
arch/arm64/include/asm/tlbflush.h | 67 ++++++---------------------------------
arch/arm64/mm/flush.c | 46 +++++++++++++++++++++++++++
2 files changed, 56 insertions(+), 57 deletions(-)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 42c09ec..2c132b0 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -63,46 +63,22 @@
* only require the D-TLB to be invalidated.
* - kaddr - Kernel virtual memory address
*/
-static inline void flush_tlb_all(void)
-{
- dsb(ishst);
- asm("tlbi vmalle1is");
- dsb(ish);
- isb();
-}
-
-static inline void flush_tlb_mm(struct mm_struct *mm)
-{
- unsigned long asid = (unsigned long)ASID(mm) << 48;
+void flush_tlb_all(void);

- dsb(ishst);
- asm("tlbi aside1is, %0" : : "r" (asid));
- dsb(ish);
-}
+void flush_tlb_mm(struct mm_struct *mm);

static inline void flush_tlb_page(struct vm_area_struct *vma,
unsigned long uaddr)
{
- unsigned long addr = uaddr >> 12 |
- ((unsigned long)ASID(vma->vm_mm) << 48);
-
- dsb(ishst);
- asm("tlbi vae1is, %0" : : "r" (addr));
- dsb(ish);
+ /* Simplify to entire mm. */
+ flush_tlb_mm(vma->vm_mm);
}

static inline void __flush_tlb_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end)
{
- unsigned long asid = (unsigned long)ASID(vma->vm_mm) << 48;
- unsigned long addr;
- start = asid | (start >> 12);
- end = asid | (end >> 12);
-
- dsb(ishst);
- for (addr = start; addr < end; addr += 1 << (PAGE_SHIFT - 12))
- asm("tlbi vae1is, %0" : : "r"(addr));
- dsb(ish);
+ /* Simplify to entire mm. */
+ flush_tlb_mm(vma->vm_mm);
}

static inline void flush_tlb_all_local(void)
@@ -112,40 +88,17 @@ static inline void flush_tlb_all_local(void)
isb();
}

-static inline void __flush_tlb_kernel_range(unsigned long start, unsigned long end)
-{
- unsigned long addr;
- start >>= 12;
- end >>= 12;
-
- dsb(ishst);
- for (addr = start; addr < end; addr += 1 << (PAGE_SHIFT - 12))
- asm("tlbi vaae1is, %0" : : "r"(addr));
- dsb(ish);
- isb();
-}
-
-/*
- * This is meant to avoid soft lock-ups on large TLB flushing ranges and not
- * necessarily a performance improvement.
- */
-#define MAX_TLB_RANGE (1024UL << PAGE_SHIFT)
-
static inline void flush_tlb_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end)
{
- if ((end - start) <= MAX_TLB_RANGE)
- __flush_tlb_range(vma, start, end);
- else
- flush_tlb_mm(vma->vm_mm);
+ /* Simplify to entire mm. */
+ flush_tlb_mm(vma->vm_mm);
}

static inline void flush_tlb_kernel_range(unsigned long start, unsigned long end)
{
- if ((end - start) <= MAX_TLB_RANGE)
- __flush_tlb_kernel_range(start, end);
- else
- flush_tlb_all();
+ /* Simplify to all. */
+ flush_tlb_all();
}

/*
diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
index 4dfa397..45f24d3 100644
--- a/arch/arm64/mm/flush.c
+++ b/arch/arm64/mm/flush.c
@@ -20,6 +20,7 @@
#include <linux/export.h>
#include <linux/mm.h>
#include <linux/pagemap.h>
+#include <linux/smp.h>

#include <asm/cacheflush.h>
#include <asm/cachetype.h>
@@ -27,6 +28,51 @@

#include "mm.h"

+static void flush_tlb_local(void *info)
+{
+ asm volatile("\n"
+ " tlbi vmalle1\n"
+ " isb sy"
+ );
+}
+
+static void flush_tlb_mm_local(void *info)
+{
+ unsigned long asid = (unsigned long)info;
+
+ asm volatile("\n"
+ " tlbi aside1, %0\n"
+ " isb sy"
+ : : "r" (asid)
+ );
+}
+
+void flush_tlb_all(void)
+{
+ /* Make sure page table modifications are visible. */
+ dsb(ishst);
+ /* IPI to all CPUs to do local flush. */
+ on_each_cpu(flush_tlb_local, NULL, 1);
+
+}
+EXPORT_SYMBOL(flush_tlb_all);
+
+void flush_tlb_mm(struct mm_struct *mm)
+{
+ if (!mm) {
+ flush_tlb_all();
+ } else {
+ unsigned long asid = (unsigned long)ASID(mm) << 48;
+ /* Make sure page table modifications are visible. */
+ dsb(ishst);
+ /* IPI to all CPUs to do local flush. */
+ on_each_cpu_mask(mm_cpumask(mm),
+ flush_tlb_mm_local, (void *)asid, 1);
+ }
+
+}
+EXPORT_SYMBOL(flush_tlb_mm);
+
void flush_cache_range(struct vm_area_struct *vma, unsigned long start,
unsigned long end)
{
--
1.9.1

2015-07-11 22:06:33

by Sergei Shtylyov

[permalink] [raw]
Subject: Re: [PATCH 3/3] arm64, mm: Use IPIs for TLB invalidation.

Hello.

On 07/11/2015 11:25 PM, David Daney wrote:

> From: David Daney <[email protected]>

> Most broadcast TLB invalidations are unnecessary. So when
> invalidating for a given mm/vma target the only the needed CPUs via

The only the needed?

> and IPI.

> For global TLB invalidations, also use IPI.

> Tested on Cavium ThunderX.

> This change reduces 'time make -j48' on kernel from 139s to 116s (83%
> as long).

> The patch is needed because of a ThunderX Pass1 erratum: Exclusive
> store operations unreliable in the presence of broadcast TLB
> invalidations. The performance improvements shown make it compelling
> even without the erratum workaround need.

> Signed-off-by: David Daney <[email protected]>

WBR, Sergei

2015-07-12 21:59:07

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH 3/3] arm64, mm: Use IPIs for TLB invalidation.

On Sat, Jul 11, 2015 at 01:25:23PM -0700, David Daney wrote:
> From: David Daney <[email protected]>
>
> Most broadcast TLB invalidations are unnecessary. So when
> invalidating for a given mm/vma target the only the needed CPUs via
> and IPI.
>
> For global TLB invalidations, also use IPI.
>
> Tested on Cavium ThunderX.
>
> This change reduces 'time make -j48' on kernel from 139s to 116s (83%
> as long).

Have you tried something like kernbench? It tends to be more consistent
than a simple "time make".

However, the main question is how's the performance on systems with a
lot less CPUs (like 4 to 8)? The results are highly dependent on the
type of application, CPU and SoC implementation (I've done similar
benchmarks in the past). So, I don't think it's a simple answer here.

> The patch is needed because of a ThunderX Pass1 erratum: Exclusive
> store operations unreliable in the presence of broadcast TLB
> invalidations. The performance improvements shown make it compelling
> even without the erratum workaround need.

This performance argument is debatable, I need more data and not just
for the Cavium boards and kernel building. In the meantime, it's an
erratum workaround and it needs to follow the other workarounds we have
in the kernel with a proper Kconfig option and alternatives that can be
patched in our out at run-time (I wonder whether jump labels would be
better suited here).

--
Catalin

2015-07-13 18:18:26

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH 3/3] arm64, mm: Use IPIs for TLB invalidation.

On Sat, Jul 11, 2015 at 09:25:23PM +0100, David Daney wrote:
> From: David Daney <[email protected]>
>
> Most broadcast TLB invalidations are unnecessary. So when
> invalidating for a given mm/vma target the only the needed CPUs via
> and IPI.
>
> For global TLB invalidations, also use IPI.
>
> Tested on Cavium ThunderX.
>
> This change reduces 'time make -j48' on kernel from 139s to 116s (83%
> as long).

Any idea *why* you're seeing such an improvement? Some older kernels had
a bug where we'd try to flush a negative (i.e. huge) range by page, so it
would be nice to rule that out. I assume these measurements are using
mainline?

Having TLBI responsible for that amount of a kernel build doesn't feel
right to me and doesn't line-up with the profiles I'm used to seeing.

You have 16-bit ASIDs, right?

Will

2015-07-13 19:31:38

by David Daney

[permalink] [raw]
Subject: Re: [PATCH 3/3] arm64, mm: Use IPIs for TLB invalidation.

On 07/13/2015 11:17 AM, Will Deacon wrote:
> On Sat, Jul 11, 2015 at 09:25:23PM +0100, David Daney wrote:
>> From: David Daney <[email protected]>
>>
>> Most broadcast TLB invalidations are unnecessary. So when
>> invalidating for a given mm/vma target the only the needed CPUs via
>> and IPI.
>>
>> For global TLB invalidations, also use IPI.
>>
>> Tested on Cavium ThunderX.
>>
>> This change reduces 'time make -j48' on kernel from 139s to 116s (83%
>> as long).
>
> Any idea *why* you're seeing such an improvement? Some older kernels had
> a bug where we'd try to flush a negative (i.e. huge) range by page, so it
> would be nice to rule that out. I assume these measurements are using
> mainline?

I have an untested multi-part theory:

1) Most of the invalidations in the kernel build will be for a mm that
was only used on a single CPU (the current CPU), so IPIs are for the
most part not needed. We win by not having to synchronize across all
CPUs waiting for the DSB to complete. I think most of it occurs at
process exit. Q: why do anything at process exit? The use of ASIDs
should make TLB invalidations at process death unnecessary.

2) By simplifying the VA range invalidations to just a single ASID based
invalidation, we are issuing many fewer TLBI broadcasts. The overhead
of refilling the local TLB with still needed mappings may be lower than
the overhead of all those TLBI operations.

>
> Having TLBI responsible for that amount of a kernel build doesn't feel
> right to me and doesn't line-up with the profiles I'm used to seeing.

I don't have enough information to comment on this at the moment.

>
> You have 16-bit ASIDs, right?

Correct. This means we aren't doing the rollover work very often, and
that it is therefore not a significant source of system overhead.


>
> Will
>

2015-07-14 11:13:48

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH 3/3] arm64, mm: Use IPIs for TLB invalidation.

On Mon, Jul 13, 2015 at 11:58:24AM -0700, David Daney wrote:
> On 07/13/2015 11:17 AM, Will Deacon wrote:
> >On Sat, Jul 11, 2015 at 09:25:23PM +0100, David Daney wrote:
> >>From: David Daney <[email protected]>
> >>
> >>Most broadcast TLB invalidations are unnecessary. So when
> >>invalidating for a given mm/vma target the only the needed CPUs via
> >>and IPI.
> >>
> >>For global TLB invalidations, also use IPI.
> >>
> >>Tested on Cavium ThunderX.
> >>
> >>This change reduces 'time make -j48' on kernel from 139s to 116s (83%
> >>as long).
> >
> >Any idea *why* you're seeing such an improvement? Some older kernels had
> >a bug where we'd try to flush a negative (i.e. huge) range by page, so it
> >would be nice to rule that out. I assume these measurements are using
> >mainline?
>
> I have an untested multi-part theory:
>
> 1) Most of the invalidations in the kernel build will be for a mm that was
> only used on a single CPU (the current CPU), so IPIs are for the most part
> not needed. We win by not having to synchronize across all CPUs waiting for
> the DSB to complete. I think most of it occurs at process exit. Q: why do
> anything at process exit? The use of ASIDs should make TLB invalidations at
> process death unnecessary.

I think for the process exit, something like below may work (but it
needs proper review and a lot of testing to make sure I haven't missed
anything; note that it's only valid for the current ASID allocation
algorithm on arm64 which does not allow ASID reusing until roll-over):

------------8<---------------------------
diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h
index 3a0242c7eb8d..0176cda688cb 100644
--- a/arch/arm64/include/asm/tlb.h
+++ b/arch/arm64/include/asm/tlb.h
@@ -38,7 +38,8 @@ static inline void __tlb_remove_table(void *_table)
static inline void tlb_flush(struct mmu_gather *tlb)
{
if (tlb->fullmm) {
- flush_tlb_mm(tlb->mm);
+ /* Deferred until ASID roll-over */
+ WARN_ON(atomic_read(&tlb->mm->mm_users));
} else {
struct vm_area_struct vma = { .vm_mm = tlb->mm, };
flush_tlb_range(&vma, tlb->start, tlb->end);
diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 934815d45eda..2e595933864a 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -150,6 +150,13 @@ static inline void __flush_tlb_pgtable(struct mm_struct *mm,
{
unsigned long addr = uaddr >> 12 | ((unsigned long)ASID(mm) << 48);

+ /*
+ * Check for concurrent users of this mm. If there are no users with
+ * user space, we do not have any (speculative) page table walkers.
+ */
+ if (!atomic_read(&mm->mm_users))
+ return;
+
dsb(ishst);
asm("tlbi vae1is, %0" : : "r" (addr));
dsb(ish);
------------8<---------------------------

AFAICT, we have three main cases for full mm TLBI (and another when the
VA range is is too large):

1. fork - dup_mmap() needs to flush the parent after changing the pages
to read-only for CoW. Here we can't really do anything
2. sys_exit - exit_mmap() clearing the page tables, the above TLBI
deferring would help
3. sys_execve - by the time we call exit_mmap(old_mm), we already
activated the new mm via exec_mmap(), so deferring TLBI should work

BTW, if we do the TLBI deferring to the ASID roll-over event, your
flush_context() patch to use local TLBI would no longer work. It is
called from __new_context() when allocating a new ASID, so it needs to
be broadcast to all the CPUs.

> 2) By simplifying the VA range invalidations to just a single ASID based
> invalidation, we are issuing many fewer TLBI broadcasts. The overhead of
> refilling the local TLB with still needed mappings may be lower than the
> overhead of all those TLBI operations.

That the munmap case usually. In our tests, we haven't seen large
ranges, mostly 1-2 4KB pages (especially with kernbench when median file
size fits in 4KB). Maybe the new batching code for x86 could help ARM as
well if we implement it. We would still issue TLBIs but it allows us to
issue a single DSB at the end.

Once we manage to optimise the current implementation, maybe it would
still be faster on a large machine (48 cores) with IPIs but it is highly
dependent on the type of workload (single-threaded tasks would benefit).
Also note that under KVM the cost of the IPI is much higher.

--
Catalin

2015-07-14 11:40:35

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH 3/3] arm64, mm: Use IPIs for TLB invalidation.

On Tue, Jul 14, 2015 at 12:13:42PM +0100, Catalin Marinas wrote:
> BTW, if we do the TLBI deferring to the ASID roll-over event, your
> flush_context() patch to use local TLBI would no longer work. It is
> called from __new_context() when allocating a new ASID, so it needs to
> be broadcast to all the CPUs.

What we can do instead is:

- Keep track of the CPUs on which an mm has been active
- Do a local TLBI if only the current CPU is in the list
- Move to the same ASID allocation algorithm as arch/arm/
- Change the ASID re-use policy so that we only mark an ASID as free
if we succeeded in performing a local TLBI, postponing anything else
until rollover

That should handle the fork() + exec() case nicely, I reckon. I tried
something similar in the past for arch/arm/, but it didn't make a difference
on any of the platforms I have access to (where TLBI traffic was cheap).

It would *really* help if I had some Thunder-X hardware...

> That the munmap case usually. In our tests, we haven't seen large
> ranges, mostly 1-2 4KB pages (especially with kernbench when median file
> size fits in 4KB). Maybe the new batching code for x86 could help ARM as
> well if we implement it. We would still issue TLBIs but it allows us to
> issue a single DSB at the end.

Again, I looked at this in the past but it turns out that the DSB ISHST
needed to publish PTEs tends to sync TLBIs on most cores (even though
it's not an architectural requirement), so postponing the full DSB to
the end didn't help on existing microarchitectures.

Finally, it might be worth dusting off the leaf-only TLBI stuff you
looked at in the past. It doesn't reduce the message traffic, but I can't
see it making things worse.

Will

2015-07-14 13:09:28

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH 3/3] arm64, mm: Use IPIs for TLB invalidation.

On Tue, Jul 14, 2015 at 12:40:30PM +0100, Will Deacon wrote:
> On Tue, Jul 14, 2015 at 12:13:42PM +0100, Catalin Marinas wrote:
> > BTW, if we do the TLBI deferring to the ASID roll-over event, your
> > flush_context() patch to use local TLBI would no longer work. It is
> > called from __new_context() when allocating a new ASID, so it needs to
> > be broadcast to all the CPUs.
>
> What we can do instead is:
>
> - Keep track of the CPUs on which an mm has been active

We already do this in switch_mm().

> - Do a local TLBI if only the current CPU is in the list

This would be beneficial independent of the following two items. I think
it's worth doing.

> - Move to the same ASID allocation algorithm as arch/arm/

This is useful to avoid the IPI on roll-over. With 16-bit ASIDs, I don't
think this is too urgent but, well, the benchmarks may say otherwise.

> - Change the ASID re-use policy so that we only mark an ASID as free
> if we succeeded in performing a local TLBI, postponing anything else
> until rollover
>
> That should handle the fork() + exec() case nicely, I reckon. I tried
> something similar in the past for arch/arm/, but it didn't make a difference
> on any of the platforms I have access to (where TLBI traffic was cheap).
>
> It would *really* help if I had some Thunder-X hardware...

I agree. With only 8 CPUs, we don't notice any difference with the above
optimisations.

> > That the munmap case usually. In our tests, we haven't seen large
> > ranges, mostly 1-2 4KB pages (especially with kernbench when median file
> > size fits in 4KB). Maybe the new batching code for x86 could help ARM as
> > well if we implement it. We would still issue TLBIs but it allows us to
> > issue a single DSB at the end.
>
> Again, I looked at this in the past but it turns out that the DSB ISHST
> needed to publish PTEs tends to sync TLBIs on most cores (even though
> it's not an architectural requirement), so postponing the full DSB to
> the end didn't help on existing microarchitectures.

We could postpone all the TLBI, including the first DSB ISHST. But I
need to look in detail at the recent TLBI batching patches for x86, they
do it to reduce IPIs but we could similarly use them to reduce the total
sync time after broadcast (i.e. DSB for pte, lots of TLBIs, DSB for TLBI
sync).

> Finally, it might be worth dusting off the leaf-only TLBI stuff you
> looked at in the past. It doesn't reduce the message traffic, but I can't
> see it making things worse.

I didn't see a difference but I'll post them to the list.

--
Catalin