This series adds few optimizations to reduce the trap cost in the tlb
flush path. We should only make SBI calls to remote tlb flush only if
absolutely required.
This series is based on Christoph's series:
http://lists.infradead.org/pipermail/linux-riscv/2019-August/006148.html
Changes from v2->v3:
1. Split the patches into smaller one per optimization.
Atish Patra (3):
RISC-V: Issue a local tlbflush if possible.
RISC-V: Issue a tlb page flush if possible
RISC-V: Do not invoke SBI call if cpumask is empty
arch/riscv/mm/tlbflush.c | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)
--
2.21.0
In RISC-V, tlb flush happens via SBI which is expensive. If the local
cpu is the only cpu in cpumask, there is no need to invoke a SBI call.
Just do a local flush and return.
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/mm/tlbflush.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c
index df93b26f1b9d..36430ee3bed9 100644
--- a/arch/riscv/mm/tlbflush.c
+++ b/arch/riscv/mm/tlbflush.c
@@ -2,6 +2,7 @@
#include <linux/mm.h>
#include <linux/smp.h>
+#include <linux/sched.h>
#include <asm/sbi.h>
void flush_tlb_all(void)
@@ -13,9 +14,23 @@ static void __sbi_tlb_flush_range(struct cpumask *cmask, unsigned long start,
unsigned long size)
{
struct cpumask hmask;
+ unsigned int cpuid = get_cpu();
+ if (!cmask) {
+ riscv_cpuid_to_hartid_mask(cpu_online_mask, &hmask);
+ goto issue_sfence;
+ }
+
+ if (cpumask_test_cpu(cpuid, cmask) && cpumask_weight(cmask) == 1) {
+ local_flush_tlb_all();
+ goto done;
+ }
riscv_cpuid_to_hartid_mask(cmask, &hmask);
+
+issue_sfence:
sbi_remote_sfence_vma(hmask.bits, start, size);
+done:
+ put_cpu();
}
void flush_tlb_mm(struct mm_struct *mm)
--
2.21.0
If tlbflush request is for page only, there is no need to do a
complete local tlb shootdown.
Just do a local tlb flush for the given address.
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/mm/tlbflush.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c
index 36430ee3bed9..9f58b3790baa 100644
--- a/arch/riscv/mm/tlbflush.c
+++ b/arch/riscv/mm/tlbflush.c
@@ -22,7 +22,10 @@ static void __sbi_tlb_flush_range(struct cpumask *cmask, unsigned long start,
}
if (cpumask_test_cpu(cpuid, cmask) && cpumask_weight(cmask) == 1) {
- local_flush_tlb_all();
+ if (size <= PAGE_SIZE && size != -1)
+ local_flush_tlb_page(start);
+ else
+ local_flush_tlb_all();
goto done;
}
riscv_cpuid_to_hartid_mask(cmask, &hmask);
--
2.21.0
SBI calls are expensive. If cpumask is empty, there is no need to
trap via SBI as no remote tlb flushing is required.
Signed-off-by: Atish Patra <[email protected]>
---
arch/riscv/mm/tlbflush.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c
index 9f58b3790baa..2bd3c418d769 100644
--- a/arch/riscv/mm/tlbflush.c
+++ b/arch/riscv/mm/tlbflush.c
@@ -21,6 +21,9 @@ static void __sbi_tlb_flush_range(struct cpumask *cmask, unsigned long start,
goto issue_sfence;
}
+ if (cpumask_empty(cmask))
+ goto done;
+
if (cpumask_test_cpu(cpuid, cmask) && cpumask_weight(cmask) == 1) {
if (size <= PAGE_SIZE && size != -1)
local_flush_tlb_page(start);
--
2.21.0
On Wed, Aug 21, 2019 at 05:46:43PM -0700, Atish Patra wrote:
> + if (size <= PAGE_SIZE && size != -1)
> + local_flush_tlb_page(start);
> + else
> + local_flush_tlb_all();
As Andreas pointed out (unsigned long)-1 is actually larger than
PAGE_SIZE, so we don't need the extra check.
On Wed, Aug 21, 2019 at 05:46:44PM -0700, Atish Patra wrote:
> SBI calls are expensive. If cpumask is empty, there is no need to
> trap via SBI as no remote tlb flushing is required.
>
> Signed-off-by: Atish Patra <[email protected]>
> ---
> arch/riscv/mm/tlbflush.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c
> index 9f58b3790baa..2bd3c418d769 100644
> --- a/arch/riscv/mm/tlbflush.c
> +++ b/arch/riscv/mm/tlbflush.c
> @@ -21,6 +21,9 @@ static void __sbi_tlb_flush_range(struct cpumask *cmask, unsigned long start,
> goto issue_sfence;
> }
>
> + if (cpumask_empty(cmask))
> + goto done;
I think this can even be done before the get_cpu to optimize it a little
further.
On Thu, 2019-08-22 at 03:50 +0200, Christoph Hellwig wrote:
> On Wed, Aug 21, 2019 at 05:46:43PM -0700, Atish Patra wrote:
> > + if (size <= PAGE_SIZE && size != -1)
> > + local_flush_tlb_page(start);
> > + else
> > + local_flush_tlb_all();
>
> As Andreas pointed out (unsigned long)-1 is actually larger than
> PAGE_SIZE, so we don't need the extra check.
Ahh yes. Sorry I missed his comment in the earlier email. Fixed it.
--
Regards,
Atish
On Thu, 2019-08-22 at 03:46 +0200, Christoph Hellwig wrote:
> On Wed, Aug 21, 2019 at 05:46:42PM -0700, Atish Patra wrote:
> > In RISC-V, tlb flush happens via SBI which is expensive. If the
> > local
> > cpu is the only cpu in cpumask, there is no need to invoke a SBI
> > call.
> >
> > Just do a local flush and return.
> >
> > Signed-off-by: Atish Patra <[email protected]>
> > ---
> > arch/riscv/mm/tlbflush.c | 15 +++++++++++++++
> > 1 file changed, 15 insertions(+)
> >
> > diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c
> > index df93b26f1b9d..36430ee3bed9 100644
> > --- a/arch/riscv/mm/tlbflush.c
> > +++ b/arch/riscv/mm/tlbflush.c
> > @@ -2,6 +2,7 @@
> >
> > #include <linux/mm.h>
> > #include <linux/smp.h>
> > +#include <linux/sched.h>
> > #include <asm/sbi.h>
> >
> > void flush_tlb_all(void)
> > @@ -13,9 +14,23 @@ static void __sbi_tlb_flush_range(struct cpumask
> > *cmask, unsigned long start,
> > unsigned long size)
> > {
> > struct cpumask hmask;
> > + unsigned int cpuid = get_cpu();
> >
> > + if (!cmask) {
> > + riscv_cpuid_to_hartid_mask(cpu_online_mask, &hmask);
> > + goto issue_sfence;
> > + }
> > +
> > + if (cpumask_test_cpu(cpuid, cmask) && cpumask_weight(cmask) ==
> > 1) {
> > + local_flush_tlb_all();
> > + goto done;
> > + }
>
> I think a single core on a SMP kernel is a valid enough use case
> given
> how litte distros still have UP kernels. So Maybe this shiuld rather
> be:
>
> if (!cmask)
> cmask = cpu_online_mask;
>
> if (cpumask_test_cpu(cpuid, cmask) && cpumask_weight(cmask) ==
> 1) {
> local_flush_tlb_all();
> } else {
> riscv_cpuid_to_hartid_mask(cmask, &hmask);
> sbi_remote_sfence_vma(hmask.bits, start, size);
> }
The downside of this is that for every !cmask case in true SMP (more
common probably) it will execute 2 extra cpumask instructions. As
tlbflush path is in performance critical path, I think we should favor
more common case (SMP with more than 1 core).
Thoughts ?
--
Regards,
Atish
On Wed, Aug 21, 2019 at 05:46:42PM -0700, Atish Patra wrote:
> In RISC-V, tlb flush happens via SBI which is expensive. If the local
> cpu is the only cpu in cpumask, there is no need to invoke a SBI call.
>
> Just do a local flush and return.
>
> Signed-off-by: Atish Patra <[email protected]>
> ---
> arch/riscv/mm/tlbflush.c | 15 +++++++++++++++
> 1 file changed, 15 insertions(+)
>
> diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c
> index df93b26f1b9d..36430ee3bed9 100644
> --- a/arch/riscv/mm/tlbflush.c
> +++ b/arch/riscv/mm/tlbflush.c
> @@ -2,6 +2,7 @@
>
> #include <linux/mm.h>
> #include <linux/smp.h>
> +#include <linux/sched.h>
> #include <asm/sbi.h>
>
> void flush_tlb_all(void)
> @@ -13,9 +14,23 @@ static void __sbi_tlb_flush_range(struct cpumask *cmask, unsigned long start,
> unsigned long size)
> {
> struct cpumask hmask;
> + unsigned int cpuid = get_cpu();
>
> + if (!cmask) {
> + riscv_cpuid_to_hartid_mask(cpu_online_mask, &hmask);
> + goto issue_sfence;
> + }
> +
> + if (cpumask_test_cpu(cpuid, cmask) && cpumask_weight(cmask) == 1) {
> + local_flush_tlb_all();
> + goto done;
> + }
I think a single core on a SMP kernel is a valid enough use case given
how litte distros still have UP kernels. So Maybe this shiuld rather be:
if (!cmask)
cmask = cpu_online_mask;
if (cpumask_test_cpu(cpuid, cmask) && cpumask_weight(cmask) == 1) {
local_flush_tlb_all();
} else {
riscv_cpuid_to_hartid_mask(cmask, &hmask);
sbi_remote_sfence_vma(hmask.bits, start, size);
}
On Thu, 2019-08-22 at 03:51 +0200, Christoph Hellwig wrote:
> On Wed, Aug 21, 2019 at 05:46:44PM -0700, Atish Patra wrote:
> > SBI calls are expensive. If cpumask is empty, there is no need to
> > trap via SBI as no remote tlb flushing is required.
> >
> > Signed-off-by: Atish Patra <[email protected]>
> > ---
> > arch/riscv/mm/tlbflush.c | 3 +++
> > 1 file changed, 3 insertions(+)
> >
> > diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c
> > index 9f58b3790baa..2bd3c418d769 100644
> > --- a/arch/riscv/mm/tlbflush.c
> > +++ b/arch/riscv/mm/tlbflush.c
> > @@ -21,6 +21,9 @@ static void __sbi_tlb_flush_range(struct cpumask
> > *cmask, unsigned long start,
> > goto issue_sfence;
> > }
> >
> > + if (cpumask_empty(cmask))
> > + goto done;
>
> I think this can even be done before the get_cpu to optimize it a
> little
> further.
Yeah. I can just return directly in this case and call get_cpu after
this. Thanks for the suggestion.
--
Regards,
Atish
On Thu, Aug 22, 2019 at 04:01:24AM +0000, Atish Patra wrote:
> The downside of this is that for every !cmask case in true SMP (more
> common probably) it will execute 2 extra cpumask instructions. As
> tlbflush path is in performance critical path, I think we should favor
> more common case (SMP with more than 1 core).
Actually, looking at both the current mainline code, and the code from my
cleanups tree I don't think remote_sfence_vma / __sbi_tlb_flush_range
can ever be called with NULL cpumask, as we always have a valid mm.
So this is a bit of a moot point, and we can drop andling that case
entirely. With that we can also use a simple if / else for the local
cpu only vs remote case. Btw, what was the reason you didn't like
using cpumask_any_but like x86, which should be more efficient than
cpumask_test_cpu + hweigt?
On Thu, 2019-08-22 at 06:27 +0200, [email protected] wrote:
> On Thu, Aug 22, 2019 at 04:01:24AM +0000, Atish Patra wrote:
> > The downside of this is that for every !cmask case in true SMP
> > (more
> > common probably) it will execute 2 extra cpumask instructions. As
> > tlbflush path is in performance critical path, I think we should
> > favor
> > more common case (SMP with more than 1 core).
>
> Actually, looking at both the current mainline code, and the code
> from my
> cleanups tree I don't think remote_sfence_vma / __sbi_tlb_flush_range
> can ever be called with NULL cpumask, as we always have a valid mm.
>
Yes. You are correct.
As both cpumask functions here will crash if cpumask is null, we should
probably leave a harmless comment to warn the consequeunce of cpumask
being null.
> So this is a bit of a moot point, and we can drop andling that case
> entirely. With that we can also use a simple if / else for the local
> cpu only vs remote case.
Done.
> Btw, what was the reason you didn't like
> using cpumask_any_but like x86, which should be more efficient than
> cpumask_test_cpu + hweigt?
I had it in v2 patch but removed as it can potentially return garbage
value if cpumask is empty.
However, we are already checking empty cpumask before the local cpu
check. I will replace cpumask_test_cpu + hweight with
cpumask_any_but().
--
Regards,
Atish