2020-07-08 12:43:18

by Zhenyu Ye

[permalink] [raw]
Subject: [RFC PATCH v5 0/2] arm64: tlb: add support for TLBI RANGE instructions

ARMv8.4-TLBI provides TLBI invalidation instruction that apply to a
range of input addresses. This series add support for this feature.

I tested this feature on a FPGA machine whose cpus support the tlbi range.
As the page num increases, the performance is improved significantly. When
page num = 256, the performance is improved by about 10 times.

Below is the test data when the stride = PTE:

[page num] [classic] [tlbi range]
1 16051 13524
2 11366 11146
3 11582 12171
4 11694 11101
5 12138 12267
6 12290 11105
7 12400 12002
8 12837 11097
9 14791 12140
10 15461 11087
16 18233 11094
32 26983 11079
64 43840 11092
128 77754 11098
256 145514 11089
512 280932 11111

See more details in:

https://lore.kernel.org/linux-arm-kernel/[email protected]/

--
ChangeList:
v5:
- rebase this series on Linux 5.8-rc4.
- remove the __TG macro.
- move the odd range_pages check into loop.

v4:
combine the __flush_tlb_range() and the __directly into the same function
with a single loop for both.

v3:
rebase this series on Linux 5.7-rc1.

v2:
Link: https://lkml.org/lkml/2019/11/11/348

Zhenyu Ye (2):
arm64: tlb: Detect the ARMv8.4 TLBI RANGE feature
arm64: tlb: Use the TLBI RANGE feature in arm64

arch/arm64/include/asm/cpucaps.h | 3 +-
arch/arm64/include/asm/sysreg.h | 3 +
arch/arm64/include/asm/tlbflush.h | 101 +++++++++++++++++++++++++-----
arch/arm64/kernel/cpufeature.c | 10 +++
4 files changed, 102 insertions(+), 15 deletions(-)

--
2.19.1



2020-07-08 12:44:01

by Zhenyu Ye

[permalink] [raw]
Subject: [RFC PATCH v5 2/2] arm64: tlb: Use the TLBI RANGE feature in arm64

Add __TLBI_VADDR_RANGE macro and rewrite __flush_tlb_range().

In this patch, we only use the TLBI RANGE feature if the stride == PAGE_SIZE,
because when stride > PAGE_SIZE, usually only a small number of pages need
to be flushed and classic tlbi intructions are more effective.

We can also use 'end - start < threshold number' to decide which way
to go, however, different hardware may have different thresholds, so
I'm not sure if this is feasible.

Signed-off-by: Zhenyu Ye <[email protected]>
---
arch/arm64/include/asm/tlbflush.h | 104 ++++++++++++++++++++++++++----
1 file changed, 90 insertions(+), 14 deletions(-)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index bc3949064725..30975ddb8f06 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -50,6 +50,16 @@
__tlbi(op, (arg) | USER_ASID_FLAG); \
} while (0)

+#define __tlbi_last_level(op1, op2, arg, last_level) do { \
+ if (last_level) { \
+ __tlbi(op1, arg); \
+ __tlbi_user(op1, arg); \
+ } else { \
+ __tlbi(op2, arg); \
+ __tlbi_user(op2, arg); \
+ } \
+} while (0)
+
/* This macro creates a properly formatted VA operand for the TLBI */
#define __TLBI_VADDR(addr, asid) \
({ \
@@ -59,6 +69,60 @@
__ta; \
})

+/*
+ * Get translation granule of the system, which is decided by
+ * PAGE_SIZE. Used by TTL.
+ * - 4KB : 1
+ * - 16KB : 2
+ * - 64KB : 3
+ */
+static inline unsigned long get_trans_granule(void)
+{
+ switch (PAGE_SIZE) {
+ case SZ_4K:
+ return 1;
+ case SZ_16K:
+ return 2;
+ case SZ_64K:
+ return 3;
+ default:
+ return 0;
+ }
+}
+
+/*
+ * This macro creates a properly formatted VA operand for the TLBI RANGE.
+ * The value bit assignments are:
+ *
+ * +----------+------+-------+-------+-------+----------------------+
+ * | ASID | TG | SCALE | NUM | TTL | BADDR |
+ * +-----------------+-------+-------+-------+----------------------+
+ * |63 48|47 46|45 44|43 39|38 37|36 0|
+ *
+ * The address range is determined by below formula:
+ * [BADDR, BADDR + (NUM + 1) * 2^(5*SCALE + 1) * PAGESIZE)
+ *
+ */
+#define __TLBI_VADDR_RANGE(addr, asid, scale, num, ttl) \
+ ({ \
+ unsigned long __ta = (addr) >> PAGE_SHIFT; \
+ __ta &= GENMASK_ULL(36, 0); \
+ __ta |= (unsigned long)(ttl) << 37; \
+ __ta |= (unsigned long)(num) << 39; \
+ __ta |= (unsigned long)(scale) << 44; \
+ __ta |= get_trans_granule() << 46; \
+ __ta |= (unsigned long)(asid) << 48; \
+ __ta; \
+ })
+
+/* These macros are used by the TLBI RANGE feature. */
+#define __TLBI_RANGE_PAGES(num, scale) (((num) + 1) << (5 * (scale) + 1))
+#define MAX_TLBI_RANGE_PAGES __TLBI_RANGE_PAGES(31, 3)
+
+#define TLBI_RANGE_MASK GENMASK_ULL(4, 0)
+#define __TLBI_RANGE_NUM(range, scale) \
+ (((range) >> (5 * (scale) + 1)) & TLBI_RANGE_MASK)
+
/*
* TLB Invalidation
* ================
@@ -181,32 +245,44 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end,
unsigned long stride, bool last_level)
{
+ int num = 0;
+ int scale = 0;
unsigned long asid = ASID(vma->vm_mm);
unsigned long addr;
+ unsigned long range_pages;

start = round_down(start, stride);
end = round_up(end, stride);
+ range_pages = (end - start) >> PAGE_SHIFT;

- if ((end - start) >= (MAX_TLBI_OPS * stride)) {
+ if ((!cpus_have_const_cap(ARM64_HAS_TLBI_RANGE) &&
+ (end - start) >= (MAX_TLBI_OPS * stride)) ||
+ range_pages >= MAX_TLBI_RANGE_PAGES) {
flush_tlb_mm(vma->vm_mm);
return;
}

- /* Convert the stride into units of 4k */
- stride >>= 12;
-
- start = __TLBI_VADDR(start, asid);
- end = __TLBI_VADDR(end, asid);
-
dsb(ishst);
- for (addr = start; addr < end; addr += stride) {
- if (last_level) {
- __tlbi(vale1is, addr);
- __tlbi_user(vale1is, addr);
- } else {
- __tlbi(vae1is, addr);
- __tlbi_user(vae1is, addr);
+ while (range_pages > 0) {
+ if (cpus_have_const_cap(ARM64_HAS_TLBI_RANGE) &&
+ stride == PAGE_SIZE && range_pages % 2 == 0) {
+ num = __TLBI_RANGE_NUM(range_pages, scale) - 1;
+ if (num >= 0) {
+ addr = __TLBI_VADDR_RANGE(start, asid, scale,
+ num, 0);
+ __tlbi_last_level(rvale1is, rvae1is, addr,
+ last_level);
+ start += __TLBI_RANGE_PAGES(num, scale) << PAGE_SHIFT;
+ range_pages -= __TLBI_RANGE_PAGES(num, scale);
+ }
+ scale++;
+ continue;
}
+
+ addr = __TLBI_VADDR(start, asid);
+ __tlbi_last_level(vale1is, vae1is, addr, last_level);
+ start += stride;
+ range_pages -= stride >> PAGE_SHIFT;
}
dsb(ish);
}
--
2.19.1


2020-07-08 12:44:06

by Zhenyu Ye

[permalink] [raw]
Subject: [RFC PATCH v5 1/2] arm64: tlb: Detect the ARMv8.4 TLBI RANGE feature

ARMv8.4-TLBI provides TLBI invalidation instruction that apply to a
range of input addresses. This patch detect this feature.

Signed-off-by: Zhenyu Ye <[email protected]>
---
arch/arm64/include/asm/cpucaps.h | 3 ++-
arch/arm64/include/asm/sysreg.h | 3 +++
arch/arm64/kernel/cpufeature.c | 10 ++++++++++
3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/cpucaps.h b/arch/arm64/include/asm/cpucaps.h
index d7b3bb0cb180..96fe898bfb5f 100644
--- a/arch/arm64/include/asm/cpucaps.h
+++ b/arch/arm64/include/asm/cpucaps.h
@@ -62,7 +62,8 @@
#define ARM64_HAS_GENERIC_AUTH 52
#define ARM64_HAS_32BIT_EL1 53
#define ARM64_BTI 54
+#define ARM64_HAS_TLBI_RANGE 55

-#define ARM64_NCAPS 55
+#define ARM64_NCAPS 56

#endif /* __ASM_CPUCAPS_H */
diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
index 463175f80341..b4eb2e5601f2 100644
--- a/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@ -617,6 +617,9 @@
#define ID_AA64ISAR0_SHA1_SHIFT 8
#define ID_AA64ISAR0_AES_SHIFT 4

+#define ID_AA64ISAR0_TLBI_RANGE_NI 0x0
+#define ID_AA64ISAR0_TLBI_RANGE 0x2
+
/* id_aa64isar1 */
#define ID_AA64ISAR1_I8MM_SHIFT 52
#define ID_AA64ISAR1_DGH_SHIFT 48
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 9fae0efc80c1..5491bf47e62c 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -2058,6 +2058,16 @@ static const struct arm64_cpu_capabilities arm64_features[] = {
.sign = FTR_UNSIGNED,
},
#endif
+ {
+ .desc = "TLB range maintenance instruction",
+ .capability = ARM64_HAS_TLBI_RANGE,
+ .type = ARM64_CPUCAP_SYSTEM_FEATURE,
+ .matches = has_cpuid_feature,
+ .sys_reg = SYS_ID_AA64ISAR0_EL1,
+ .field_pos = ID_AA64ISAR0_TLB_SHIFT,
+ .sign = FTR_UNSIGNED,
+ .min_field_value = ID_AA64ISAR0_TLBI_RANGE,
+ },
{},
};

--
2.19.1


2020-07-08 18:27:44

by Catalin Marinas

[permalink] [raw]
Subject: Re: [RFC PATCH v5 2/2] arm64: tlb: Use the TLBI RANGE feature in arm64

On Wed, Jul 08, 2020 at 08:40:31PM +0800, Zhenyu Ye wrote:
> Add __TLBI_VADDR_RANGE macro and rewrite __flush_tlb_range().
>
> In this patch, we only use the TLBI RANGE feature if the stride == PAGE_SIZE,
> because when stride > PAGE_SIZE, usually only a small number of pages need
> to be flushed and classic tlbi intructions are more effective.

Why are they more effective? I guess a range op would work on this as
well, say unmapping a large THP range. If we ignore this stride ==
PAGE_SIZE, it could make the code easier to read.

> We can also use 'end - start < threshold number' to decide which way
> to go, however, different hardware may have different thresholds, so
> I'm not sure if this is feasible.
>
> Signed-off-by: Zhenyu Ye <[email protected]>
> ---
> arch/arm64/include/asm/tlbflush.h | 104 ++++++++++++++++++++++++++----
> 1 file changed, 90 insertions(+), 14 deletions(-)

Could you please rebase these patches on top of the arm64 for-next/tlbi
branch:

git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/tlbi

> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index bc3949064725..30975ddb8f06 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -50,6 +50,16 @@
> __tlbi(op, (arg) | USER_ASID_FLAG); \
> } while (0)
>
> +#define __tlbi_last_level(op1, op2, arg, last_level) do { \
> + if (last_level) { \
> + __tlbi(op1, arg); \
> + __tlbi_user(op1, arg); \
> + } else { \
> + __tlbi(op2, arg); \
> + __tlbi_user(op2, arg); \
> + } \
> +} while (0)
> +
> /* This macro creates a properly formatted VA operand for the TLBI */
> #define __TLBI_VADDR(addr, asid) \
> ({ \
> @@ -59,6 +69,60 @@
> __ta; \
> })
>
> +/*
> + * Get translation granule of the system, which is decided by
> + * PAGE_SIZE. Used by TTL.
> + * - 4KB : 1
> + * - 16KB : 2
> + * - 64KB : 3
> + */
> +static inline unsigned long get_trans_granule(void)
> +{
> + switch (PAGE_SIZE) {
> + case SZ_4K:
> + return 1;
> + case SZ_16K:
> + return 2;
> + case SZ_64K:
> + return 3;
> + default:
> + return 0;
> + }
> +}

Maybe you can factor out this switch statement in the for-next/tlbi
branch to be shared with TTL.

> +/*
> + * This macro creates a properly formatted VA operand for the TLBI RANGE.
> + * The value bit assignments are:
> + *
> + * +----------+------+-------+-------+-------+----------------------+
> + * | ASID | TG | SCALE | NUM | TTL | BADDR |
> + * +-----------------+-------+-------+-------+----------------------+
> + * |63 48|47 46|45 44|43 39|38 37|36 0|
> + *
> + * The address range is determined by below formula:
> + * [BADDR, BADDR + (NUM + 1) * 2^(5*SCALE + 1) * PAGESIZE)
> + *
> + */
> +#define __TLBI_VADDR_RANGE(addr, asid, scale, num, ttl) \

I don't see a non-zero ttl passed to this macro but I suspect this would
change if based on top of the TTL patches.

> + ({ \
> + unsigned long __ta = (addr) >> PAGE_SHIFT; \
> + __ta &= GENMASK_ULL(36, 0); \
> + __ta |= (unsigned long)(ttl) << 37; \
> + __ta |= (unsigned long)(num) << 39; \
> + __ta |= (unsigned long)(scale) << 44; \
> + __ta |= get_trans_granule() << 46; \
> + __ta |= (unsigned long)(asid) << 48; \
> + __ta; \
> + })
> +
> +/* These macros are used by the TLBI RANGE feature. */
> +#define __TLBI_RANGE_PAGES(num, scale) (((num) + 1) << (5 * (scale) + 1))
> +#define MAX_TLBI_RANGE_PAGES __TLBI_RANGE_PAGES(31, 3)
> +
> +#define TLBI_RANGE_MASK GENMASK_ULL(4, 0)
> +#define __TLBI_RANGE_NUM(range, scale) \
> + (((range) >> (5 * (scale) + 1)) & TLBI_RANGE_MASK)
> +
> /*
> * TLB Invalidation
> * ================
> @@ -181,32 +245,44 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
> unsigned long start, unsigned long end,
> unsigned long stride, bool last_level)
> {
> + int num = 0;
> + int scale = 0;
> unsigned long asid = ASID(vma->vm_mm);
> unsigned long addr;
> + unsigned long range_pages;
>
> start = round_down(start, stride);
> end = round_up(end, stride);
> + range_pages = (end - start) >> PAGE_SHIFT;
>
> - if ((end - start) >= (MAX_TLBI_OPS * stride)) {
> + if ((!cpus_have_const_cap(ARM64_HAS_TLBI_RANGE) &&
> + (end - start) >= (MAX_TLBI_OPS * stride)) ||
> + range_pages >= MAX_TLBI_RANGE_PAGES) {
> flush_tlb_mm(vma->vm_mm);
> return;
> }

Is there any value in this range_pages check here? What's the value of
MAX_TLBI_RANGE_PAGES? If we have TLBI range ops, we make a decision here
but without including the stride. Further down we use the stride to skip
the TLBI range ops.

>
> - /* Convert the stride into units of 4k */
> - stride >>= 12;
> -
> - start = __TLBI_VADDR(start, asid);
> - end = __TLBI_VADDR(end, asid);
> -
> dsb(ishst);
> - for (addr = start; addr < end; addr += stride) {
> - if (last_level) {
> - __tlbi(vale1is, addr);
> - __tlbi_user(vale1is, addr);
> - } else {
> - __tlbi(vae1is, addr);
> - __tlbi_user(vae1is, addr);
> + while (range_pages > 0) {

BTW, I think we can even drop the "range_" from range_pages, it's just
the number of pages.

> + if (cpus_have_const_cap(ARM64_HAS_TLBI_RANGE) &&
> + stride == PAGE_SIZE && range_pages % 2 == 0) {
> + num = __TLBI_RANGE_NUM(range_pages, scale) - 1;
> + if (num >= 0) {
> + addr = __TLBI_VADDR_RANGE(start, asid, scale,
> + num, 0);
> + __tlbi_last_level(rvale1is, rvae1is, addr,
> + last_level);
> + start += __TLBI_RANGE_PAGES(num, scale) << PAGE_SHIFT;
> + range_pages -= __TLBI_RANGE_PAGES(num, scale);
> + }
> + scale++;
> + continue;
> }
> +
> + addr = __TLBI_VADDR(start, asid);
> + __tlbi_last_level(vale1is, vae1is, addr, last_level);
> + start += stride;
> + range_pages -= stride >> PAGE_SHIFT;
> }
> dsb(ish);
> }

I think the algorithm is correct, though I need to work it out on a
piece of paper.

The code could benefit from some comments (above the loop) on how the
range is built and the right scale found.

--
Catalin

2020-07-09 06:53:50

by Zhenyu Ye

[permalink] [raw]
Subject: Re: [RFC PATCH v5 2/2] arm64: tlb: Use the TLBI RANGE feature in arm64

On 2020/7/9 2:24, Catalin Marinas wrote:
> On Wed, Jul 08, 2020 at 08:40:31PM +0800, Zhenyu Ye wrote:
>> Add __TLBI_VADDR_RANGE macro and rewrite __flush_tlb_range().
>>
>> In this patch, we only use the TLBI RANGE feature if the stride == PAGE_SIZE,
>> because when stride > PAGE_SIZE, usually only a small number of pages need
>> to be flushed and classic tlbi intructions are more effective.
>
> Why are they more effective? I guess a range op would work on this as
> well, say unmapping a large THP range. If we ignore this stride ==
> PAGE_SIZE, it could make the code easier to read.
>

OK, I will remove the stride == PAGE_SIZE here.

>> We can also use 'end - start < threshold number' to decide which way
>> to go, however, different hardware may have different thresholds, so
>> I'm not sure if this is feasible.
>>
>> Signed-off-by: Zhenyu Ye <[email protected]>
>> ---
>> arch/arm64/include/asm/tlbflush.h | 104 ++++++++++++++++++++++++++----
>> 1 file changed, 90 insertions(+), 14 deletions(-)
>
> Could you please rebase these patches on top of the arm64 for-next/tlbi
> branch:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/tlbi
>

OK, I will send a formal version patch of this series soon.

>>
>> - if ((end - start) >= (MAX_TLBI_OPS * stride)) {
>> + if ((!cpus_have_const_cap(ARM64_HAS_TLBI_RANGE) &&
>> + (end - start) >= (MAX_TLBI_OPS * stride)) ||
>> + range_pages >= MAX_TLBI_RANGE_PAGES) {
>> flush_tlb_mm(vma->vm_mm);
>> return;
>> }
>
> Is there any value in this range_pages check here? What's the value of
> MAX_TLBI_RANGE_PAGES? If we have TLBI range ops, we make a decision here
> but without including the stride. Further down we use the stride to skip
> the TLBI range ops.
>

MAX_TLBI_RANGE_PAGES is defined as __TLBI_RANGE_PAGES(31, 3), which is
decided by ARMv8.4 spec. The address range is determined by below formula:

[BADDR, BADDR + (NUM + 1) * 2^(5*SCALE + 1) * PAGESIZE)

Which has nothing to do with the stride. After removing the stride ==
PAGE_SIZE below, there will be more clear.


>> }
>
> I think the algorithm is correct, though I need to work it out on a
> piece of paper.
>
> The code could benefit from some comments (above the loop) on how the
> range is built and the right scale found.
>

OK.

Thanks,
Zhenyu